89
Proceedings of SSST-7 Seventh Workshop on Syntax, Semantics and Structure in Statistical Translation Marine Carpuat, Lucia Specia and Dekai Wu (editors) SIGMT / SIGLEX Workshop The 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Syntax, Semantics and Structure in Statistical Translation · Proceedings of the 7th Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 1–10, Atlanta,

Embed Size (px)

Citation preview

Proceedings of SSST-7

Seventh Workshop on

Syntax, Semantics and Structurein Statistical Translation

Marine Carpuat, Lucia Specia and Dekai Wu (editors)

SIGMT / SIGLEX WorkshopThe 2013 Conference of the North American Chapter

of the Association for Computational Linguistics:Human Language Technologies

c©2013 The Association for Computational Linguistics

209 N. Eighth StreetStroudsburg, PA 18360USATel: +1-570-476-8006Fax: [email protected]

ISBN 978-1-937284-47-3

ii

Introduction

The Seventh Workshop on Syntax, Semantics and Structure in Statistical Translation (SSST-7) was heldon 13 June 2013 following the NAACL 2013 conference in Atlanta, GA, USA. Like the first six SSSTworkshops in 2007, 2008, 2009, 2010, 2011 and 2012, it aimed to bring together researchers fromdifferent communities working in the rapidly growing field of structured statistical models of naturallanguage translation.

We selected 8 papers for this year’s workshop, many of which reflect statistical machinetranslation’s movement toward not only tree-structured and syntactic models incorporating stochasticsynchronous/transduction grammars, but also increasingly semantic models and the closely linkedissues of deep syntax and shallow semantics.

In the third year since “Semantics” was explicitly added to the workshop name, the work exploringSMT’s connections to semantics has continued to grow. Carpuat shows that word sense disambiguationtasks can be viewed as a method for semantic evaluation of machine translation lexical choice. Singhstudies the impact of the orthographic representation of Manipuri, a Sino-Tibetan language on the taskof SMT to and from English, and explores its impact on lexical ambiguity.

Several papers deepen our understanding of theoretical and practical issues associated with structuredstatistical translation models.

Maillette de Buy Wenniger and Sima’an show how to extend rules in a hierarchical phrase-basedsystem with reordering information, by defining more specific nonterminals and augmenting ruleswith features. Huck, Vilar, Freitag and Ney present a detailed empirical study of cube pruning forhierarchical phrase-based systems. Herrmann, Niehues and Waibel incorporate a syntactic tree-basedreordering method to model long-range reorderings in a phrase-based machine translation system, andcombine reordering models at different levels of linguistic representation.

Saers, Addanki and Wu present an unsupervised method for inducing an Inversion TransductionGrammar based on the Minimum Description Length principle. Maillette de Buy Wenniger andSima’an propose a precise definition of what it means for an Inversion Transduction Grammar tocover the word alignment of a sentence, and experiment with human and machine-made alignments.Kaeshammer explores the expressiveness of synchronous linear context-free rewriting systems formachine translation by computing derivation coverage on manually word aligned parallel text.

Thanks are due once again to our authors and our Program Committee for making the seventh SSSTworkshop another success.

Marine Carpuat, Lucia Specia, and Dekai Wu

iii

Organizers:

Marine Carpuat, National Research Council CanadaLucia Specia, University of SheffieldDekai Wu, Hong Kong University of Science and Technology

Program Committee:

Marianna Apidianak, LIMSI-CNRSWilker Aziz, University of WolverhamptonSrinivas Bangalore, AT&T Labs ResearchYee Seng Chan, Raytheon BBN TechnologiesColin Cherry, National Research Council CanadaDavid Chiang, USC/ISIJohn DeNero, GoogleMarc Dymetman, Xerox Research Centre EuropeAlexander Fraser, Universität StuttgartDaniel Gildea, University of RochesterGreg Hanneman, Carnegie Mellon UniversityYifan He, New York UniversityHieu Hoang, University of EdinburghPhilipp Koehn, University of EdinburghEls Lefever, Hogeschool GentChi-kiu Lo, HKUSTDaniel Marcu, SDLAurélien Max, LIMSI-CNRS & Univ. Paris SudDaniele Pighin, GoogleMarkus Saers, HKUSTTaro Watanabe, NICTDeyi Xiong, Institute for Infocomm ResearchBowen Zhou, IBM Research

v

Table of Contents

A Semantic Evaluation of Machine Translation Lexical ChoiceMarine Carpuat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Taste of Two Different Flavours: Which Manipuri Script works better for English-Manipuri Languagepair SMT Systems?

Thoudam Doren Singh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Hierarchical Alignment Decomposition Labels for Hiero Grammar RulesGideon Maillette de Buy Wenniger and Khalil Sima’an . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

A Performance Study of Cube Pruning for Large-Scale Hierarchical Machine TranslationMatthias Huck, David Vilar, Markus Freitag and Hermann Ney . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

Combining Word Reordering Methods on different Linguistic Abstraction Levels for Statistical MachineTranslation

Teresa Herrmann, Jan Niehues and Alex Waibel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

Combining Top-down and Bottom-up Search for Unsupervised Induction of Transduction GrammarsMarkus Saers, Karteek Addanki and Dekai Wu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

A Formal Characterization of Parsing Word Alignments by Synchronous Grammars with EmpiricalEvidence to the ITG Hypothesis.

Gideon Maillette de Buy Wenniger and Khalil Sima’an . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

Synchronous Linear Context-Free Rewriting Systems for Machine TranslationMiriam Kaeshammer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

vii

Conference Program

Thursday, June 13, 2013

9:15–9:30 Opening Remarks

Session 1

9:30–10:00 A Semantic Evaluation of Machine Translation Lexical ChoiceMarine Carpuat

10:00–10:30 Taste of Two Different Flavours: Which Manipuri Script works better for English-Manipuri Language pair SMT Systems?Thoudam Doren Singh

10:30–11:00 Break

Session 2

11:00–11:30 Hierarchical Alignment Decomposition Labels for Hiero Grammar RulesGideon Maillette de Buy Wenniger and Khalil Sima’an

11:30–12:00 A Performance Study of Cube Pruning for Large-Scale Hierarchical Machine Trans-lationMatthias Huck, David Vilar, Markus Freitag and Hermann Ney

12:00–12:30 Combining Word Reordering Methods on different Linguistic Abstraction Levels forStatistical Machine TranslationTeresa Herrmann, Jan Niehues and Alex Waibel

12:30–2:00 Lunch

ix

Thursday, June 13, 2013 (continued)

Session 3

2:00–3:00 Panel discussion: Meaning Representations for Machine Translation, with Jan Hajic,Kevin Knight, Martha Palmer and Dekai Wu

3:30–4:00 Combining Top-down and Bottom-up Search for Unsupervised Induction of TransductionGrammarsMarkus Saers, Karteek Addanki and Dekai Wu

3:30–4:00 Break

Session 4

4:00–4:30 A Formal Characterization of Parsing Word Alignments by Synchronous Grammars withEmpirical Evidence to the ITG Hypothesis.Gideon Maillette de Buy Wenniger and Khalil Sima’an

4:30–5:00 Synchronous Linear Context-Free Rewriting Systems for Machine TranslationMiriam Kaeshammer

x

Proceedings of the 7th Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 1–10,Atlanta, Georgia, 13 June 2013. c©2013 Association for Computational Linguistics

A Semantic Evaluation of Machine Translation Lexical Choice

Marine CarpuatNational Research Council Canada

1200 Montreal Rd,Ottawa, ON K1A 0R6

[email protected]

Abstract

While automatic metrics of translation qual-ity are invaluable for machine translation re-search, deeper understanding of translationerrors require more focused evaluations de-signed to target specific aspects of translationquality. We show that Word Sense Disam-biguation (WSD) can be used to evaluate thequality of machine translation lexical choice,by applying a standard phrase-based SMT sys-tem on the SemEval2010 Cross-Lingual WSDtask. This case study reveals that the SMTsystem does not perform as well as a WSDsystem trained on the exact same parallel data,and that local context models based on sourcephrases and target n-grams are much weakerrepresentations of context than the simpletemplates used by the WSD system.

1 Introduction

Much research has focused on automatically eval-uating the quality of Machine Translation (MT) bycomparing automatic translations to human transla-tions on samples of a few thousand sentences. Manymetrics (Papineni et al., 2002; Banerjee and Lavie,2005; Gimenez and Marquez, 2007; Lo and Wu,2011, for instance) have been proposed to estimatethe adequacy and fluency of machine translation andevaluated based on their correlatation with humanjudgements of translation quality (Callison-Burch etal., 2010). While these metrics have proven in-valuable in driving progress in MT research, finer-grained evaluations of translation quality are neces-sary to provide a more focused analysis of transla-tion errors. When developing complex MT systems,

comparing BLEU or TER scores is not sufficient tounderstand what improved or what went wrong. Er-ror analysis can of course be done manually (Vilaret al., 2006), but it is often too slow and expensiveto be performed as often as needed during systemdevelopment.

Several metrics have been recently proposed toevaluate specific aspects of translation quality suchas word order (Birch et al., 2010; Chen et al., 2012).While word order is indirectly taken into account byBLEU, TER or METEOR scores, dedicated metricsprovide a direct evaluation that lets us understandwhether a given system’s reordering performanceimproved during system development. Word ordermetrics provide a complementary tool for targetingevaluation and analysis to a specific aspect of ma-chine translation quality.

There has not been as much work on evaluatingthe lexical choice performance of MT: does a MTsystem preserve the meaning of words in transla-tion? This is of course measured indirectly by com-monly used global metrics, but a more focused eval-uation can help us gain a better understanding of thebehavior of MT systems.

In this paper, we show that MT lexical choice canbe framed and evaluated as a standard Word SenseDisambiguation (WSD) task. We leverage exist-ing WSD shared tasks in order to evaluate whetherword meaning is preserved in translation. Let us em-phasize that, just like reordering metrics, our WSDevaluation is meant to complement global metrics oftranslation quality. In previous work, intrinsic eval-uations of lexical choice have been performed us-ing either semi-automatically constructed data sets

1

based on MT reference translations (Gimenez andMarquez, 2008; Carpuat and Wu, 2008), or man-ually constructed word sense disambiguation testbeds that do not exactly match MT lexical choice(Carpuat and Wu, 2005). We will show how ex-isting Cross-Lingual Word Sense Disambiguationtasks (Lefever and Hoste, 2010; Lefever and Hoste,2013) can be directly seen as machine translationlexical choice (Section 2): their sense inventory isbased on translations in a second language ratherthan arbitrary sense representations used in otherWSD tasks (Carpuat and Wu, 2005); unlike in MTevaluation settings, human annotators can more eas-ily provide a complete representation of all correctmeanings of a word. Second, we show how us-ing this task for evaluating the lexical choice perfor-mance of several phrase-based SMT systems (PB-SMT) gives some insights into their strengths andweaknesses (Section 5).

2 Selecting a Word Sense DisambiguationTask to Evaluate MT Lexical Choice

Word Sense Disambiguation consists in determiningthe correct sense of a word in context. This chal-lenging problem has been studied from a rich varietyof persectives in Natural Language Processing (seeAgirre and Edmonds (2006) for an overview.) TheSenseval and SemEval series of evaluations (Ed-monds and Cotton, 2001; Mihalcea and Edmonds,2004; Agirre et al., 2007) have driven the standard-ization of methodology for evaluating WSD sys-tems. Many shared tasks were organized over theyears, providing evaluation settings that vary alongseveral dimensions, including:

• target vocabulary: in all word tasks, systemsare expected to tag all content words in run-ning text (Palmer et al., 2001), while in lexicalsample tasks, the evaluation considers a smallerpredefined set of target words (Mihalcea et al.,2004; Lefever and Hoste, 2010).

• language: English is by far the most studiedlanguage, but the disambiguation of words inother languages such as Chinese (Jin et al.,2007) has been considered.

• sense inventory: many tasks use WordNetsenses (Fellbaum, 1998), but other sense repre-

sentations have been used, including alternatesemantic databases such as HowNet (Dong,1998), or lexicalizations in one or more lan-guages (Chklovski et al., 2004).

The Cross-Lingual Word Sense Disambiguation(CLWSD) task introduced at a recent edition of Se-mEval (Lefever and Hoste, 2010) is an English lex-ical sample task that uses translations in other Eu-ropean languages as a sense inventory. As a result,it is particularly well suited to evaluating machinetranslation lexical choice.

2.1 Translations as Word SenseRepresentations

The CLWSD task is essentially the same task asMT lexical choice: given English target words incontext, systems are asked to predict translations inother European languages. The gold standard con-sists of translations proposed by several bilingualhumans, as can be seen in Table 1. MT systempredictions can be compared to human annotationsdirectly, without introducing additional sources ofambiguity and mismatches due to representation dif-ferences. This contrasts with our previous work onevaluating MT on a WSD task (Carpuat and Wu,2005), which used text annotated with abstract sensecategories from the HowNet knowledge base (Dong,1998). In HowNet, each word is defined using aconcept, constructed as a combination of basic unitsof meaning, called sememes. Words that share thesame concept can be viewed as synonyms. Evaluat-ing MT using a gold standard of HowNet categoriesrequires to map translations from the MT output tothe HowNet representation. Some categories are an-notated with English translations, but additional ef-fort is required in order to cover all translation can-didates produced by the MT system.

2.2 Controlled Learning Conditions

Another advantage of the CLWSD task is that it pro-vides controlled learning conditions (even though itis an unsupervised task with no annotated trainingdata.) The gold labels for CLWSD are learned fromparallel corpora. As a result MT lexical choice mod-els can be estimated on the exact same data. Trans-lations for English words in the lexical sample areextracted from a semi-automatic word alignment of

2

Target word ringEnglish context The twelve stars of the European flag are depicted on the outer ring.Gold translations anillo (3);cırculo (2);corona (2);aro (1);English context The terrors which Mr Cash expresses about our future in the community have a familiar ring

about them.Gold translations sonar (3);tinte (3);connotacion(2);tono (1);English context The American containment ring around the Soviet bloc had been seriously breached only by

the Soviet acquisition of military facilities in Cuba.Gold translations cerco (2);cırculo (2);cordon (2);barrera (1);blindaje (1);limitacion (1);

Table 1: Example of annotated CLWSD instances from the SemEval 2010 test set. For each gold Spanishtranslation, we are given the number of annotators who proposed it (out of 3 annotators.)

sentences from the Europarl parallel corpus (Koehn,2005). These translations are then manually clus-tered into senses. When constructing the gold an-notation, human annotators are given occurrences oftarget words in context. For each occurrence, theyselect a sense cluster and provide all translationsfrom this cluster that are correct in this specific con-text. Since three annotators contribute, each test oc-currence is therefore tagged with a set of translationsin another language, along with a frequency whichrepresents the number of annotators who selected it.A more detailed description of the annotation pro-cess can be found in (Lefever and Hoste, 2010).

Again, this contrasts with our previous work onevaluating MT on a HowNet-based Chinese WSDtask, where Chinese sentences were manually anno-tated with HowNet senses which were completelyunrelated to the parallel corpus used for training theSMT system. Using CLWSD as an evaluation of MTlexical choice solves this issue and provides con-trolled learning conditions.

2.3 CLWSD evaluates the semantic adequacyof MT lexical choice

A key challenge in MT evaluation lies in decid-ing whether the meaning of the translation is cor-rect when it does not exactly match the referencetranslation. METEOR uses WordNet synonyms andlearned paraphrases tables (Denkowski and Lavie,2010). MEANT uses vector-space based lexicalsimilarity scores (Lo et al., 2012). While thesemethods lead to higher correlations with humanjudgements on average, they are not ideal for afine-grained evaluation of lexical choice: similar-ity scores are defined independently of context and

might give credit to incorrect translations (Carpuat etal., 2012). In contrast, CLWSD solves this difficultproblem by providing all correct translation candi-dates in context according to several human anno-tators. These multiple translations provide a morecomplete representation of the correct meaning ofeach occurrence of a word in context.

The CLWSD annotation procedure is designedto easily let human annotators provide many cor-rect translation alternatives for a word. Producingmany correct annotations for a complete sentence isa much more expensive undertaking: crowdsourc-ing can help alleviate the cost of obtaining a smallnumber of reference translation (Zbib et al., 2012),but acquiring a complete representation of all pos-sible translations of a source sentence is a muchmore complex task (Dreyer and Marcu, 2012). Ma-chine translation evaluations typically use betweenone and four reference translations, which providea very incomplete representation of the correct se-mantics of the input sentence in the output language.CLWSD provides a more complete representationthrough the multiple gold translations available.

2.4 Limitations

The main drawback of using CLWSD to evaluatelexical choice is that CLWSD is a lexical sampletask, which only evaluates disambiguation of 20 En-glish nouns. This arbitrary sample of words does notlet us target words or phrases that might be specifi-cally interesting for MT.

In addition, the data available through the sharedtask does not let us evaluate complete translationsof the CLWSD test sentences, since full referencestranslations are not available. Instead of using

3

a WSD dataset for MT purposes, we could takethe converse approach andautomatically constructa WSD test set based on MT evaluation corpora(Vickrey et al., 2005; Gimenez and Marquez, 2008;Carpuat and Wu, 2008; Carpuat et al., 2012). How-ever, this approach suffers from noisy automaticalignments between source and reference, as well asfrom a limited representation of the correct mean-ing of words in context due to the limited number ofreference translations.

Other SemEval tasks such as the Cross-LingualLexical Substitution Task (Mihalcea et al., 2010)would also provide an appropriate test bed. Wefocused on the CLWSD task, since it uses sensesdrawn from the Europarl parallel corpus, and there-fore offers more constrained settings for comparisonbetween systems. The lexical substitution task tar-gets verbs and adjectives in addition to nouns, andwould therefore be an interesting test case to con-sider in future work.

2.5 Official and MT-centric Evaluation Metrics

In order to make comparison with other systems pos-sible, we follow the standard evaluation frameworkdefined for the task and score the output of all oursystems using four different metrics, computed us-ing the scoring tool made available by the organiz-ers.

The difference between system predictions andgold standard annotations are quantified using pre-cision and recall scores1 , defined as follows. Givena set T of test items and a set H of annotators, Hi isthe set of translation proposed by all annotators h forinstance i ∈ T . Each translation type res in Hi hasan associated frequency freqres, which representsthe number of human annotators which selected resas one of their top 3 translations. Given a set of sys-tem answers A of items i ∈ T such that the systemprovides at least one answer, ai : i ∈ A is is the setof answers from the system for instance i. For eachi, the scorer computes the intersection of the systemanswers ai and the gold standard Hi.

Systems propose as many answers as deemed nec-

1In this paper, we focus on evaluating translation systemswhose task is to produce a single complete translation for agiven sentence. As a result, we only focus on the 1-best MToutput and do not report the relaxed out-of-five evaluation set-ting also considered in the official SemEval task.

essary, but the scores are divided by the number ofguesses in order not to favor systems that outputmany answers per instance.

Precision = 1|A|

∑ai:i∈A

∑res∈ai

freqres

|ai||Hi|

Recall = 1|T |

∑ai:i∈T

∑res∈ai

freqres

|ai||Hi|We also report Mode Precision and Mode Recall

scores: instead of comparing system answers to thefull set of gold standard translations Hi for an in-stance i ∈ T , the Mode Precision and Recall scoresonly use a single gold translation, which is the trans-lation chosen most frequently by the human annota-tors.

In addition, we compute the 1-gram precisioncomponent of the BLEU score (Papineni et al.,2002), denoted as BLEU1 in the result tables2. Incontrast with the official CLWSD evaluation scoresdescribed above, BLEU1 gives equal weight to alltranslation candidates, which can be seen as multi-ple references.

3 PBSMT system

We use a typical phrase-based SMT system trainedfor English-to-Spanish translation. Its applicationto the CLWSD task affects the selection of trainingdata and its preprocessing, but the SMT model de-sign and learning strategies are exactly the same asfor conventional translation tasks.

3.1 Model

We use the NRC’s PORTAGE phrase-based SMTsystem, which implements a standard phrasal beam-search decoder with cube pruning. Translation hy-potheses are scored according to the following fea-tures:

• 4 phrase-table scores: phrasal translation prob-abilites with Kneser-Ney smoothing and Zens-Ney lexical smoothing in both translation direc-tions (Chen et al., 2011)

• 6 hierarchical lexicalized reordering scores,which represent the orientation of the currentphrase with respect to the previous block thatcould have been translated as a single phrase(Galley and Manning, 2008)

2even though it does not include the length penalty used inthe BLEU score.

4

• a word penalty, which scores the length of theoutput sentence

• a word-displacement distortion penalty

• a Kneser-Ney smoothed 5-gram Spanish lan-guage model

Weights for these features are learned using a batchversion of the MIRA algorithm(Cherry and Foster,2012). Phrase pairs are extracted from IBM4 align-ments obtained with GIZA++(Och and Ney, 2003).We learn phrase translation candidates for phrases oflength 1 to 7.

Converting the PBSMT output for CLWSD re-quires a final straightforward mapping step. Weuse the phrasal alignment between SMT input andoutput to isolate the translation candidates for theCLWSD target word. When it maps to a multi-word phrase in the target language, we use the wordwithin the phrase that has the highest translationIBM1 translation probability given the CLWSD tar-get word of interest. Note that there is no need toperform any manual mapping between SMT outputand sense inventories as in (Carpuat and Wu, 2005).

3.2 DataThe core training corpus is the exact same set ofsentences from Europarl that were used to learnthe sense inventory, in order to ensure that PBSMTknows the same translations as the human annotatorswho built the gold standard. There are about 900ksentence pairs, since only 1-to1 alignments that ex-ist in all the languages considered in CLWSD wereused (Lefever and Hoste, 2010).

We exploit additional corpora from theWMT2012 translation task, using the full Eu-roparl corpus to train language models, and for oneexperiment the news-commentary parallel corpus(see Section 9.)

These parallel corpora are used to learn the trans-lation, reordering and language models. The log-linear feature weights are learned on a developmentset of 3000 sentences sampled from the WMT2012development test sets. They are selected based ontheir distance to the CLWSD trial and test sentences(Moore and Lewis, 2010).

We tokenize and lemmatize all English andSpanish text using the FreeLing tools (Padro and

Stanilovsky, 2012). We use lemma representationsto perform translation, since the CLWSD targets andtranslations are lemmatized.

4 WSD system

4.1 Model

We also train a dedicated WSD system for this taskin order to perform a controlled comparison with theSMT system. Many WSD systems have been eval-uated on the SemEval test bed used here, however,they differ in terms of resources used, training dataand preprocessing pipelines. In order to control forthese parameters, we build a WSD system trainedon the exact same training corpus, preprocessing andword alignment as the SMT system described above.

We cast WSD as a generic ranking problem withlinear models. Given a word in context x, translationcandidates t are ranked according to the followingmodel: f(x, t) =

∑i λiφi(x, t), where φi(x, t) rep-

resent binary features that fire when specific cluesare observed in a context x.

Context clues are based on standard feature tem-plates in many supervised WSD approaches (Flo-rian et al., 2002; van Gompel, 2010; Lefever et al.,2011):

• words in a window of 2 words around the dis-ambiguation target.

• part-of-speech tags in a window of 2 wordsaround the disambiguation target

• bag-of-word context: all nouns, verbs and ad-jectives in the context x

At training time, each example (x, t) is assigneda cost based on the translation observed in parallelcorpora: f(x, t) = 0 if t = taligned, f(x, t) = 1 oth-erwise . Feature weights λi can be learned in manyways. We optimize logistic loss using stochastic gra-dient descent3.

4.2 Data

The training instances for the supervised WSD sys-tem are built automatically by (1) extracting all oc-currences of English target words in context, and (2)annotating them with their aligned Spanish lemma.

3we use the optimizer from http://hunch.net/˜ vw v7.1.2

5

Mode ModeSystem Prec. Rec. Prec. Rec. BLEU1WSD 25.96 25.58 55.02 54.13 76.06PBSMT 23.72 23.69 45.49 45.37 62.72MFStest 21.35 21.35 44.50 44.50 65.50MFStrain 19.14 19.14 42.00 42.00 59.70

Table 2: Main CLWSD results: PBSMT yields com-petitive results, but WSD outperforms PBSMT

We obtain a total of 33139 training instances for alltargets (an average of 1656 per target, with a mini-mum of 30 and a maximum of 5414). Note that thisprocess does not require any manual annotation.

5 WSD systems can outperform PBSMT

Table 2 summarizes the main results. PBSMT out-performs the most frequent sense baseline by a widemargin, and interestingly also yields better resultsthan many of the dedicated WSD systems that par-ticipated in the SemEval task. However, PBSMTperformance does not match that of the most fre-quent sense oracle (which uses sense frequencies ob-served in the test set rather than training set). TheWSD system trained on the same word-aligned par-allel corpus as the PBSMT system achieves the bestperformance. It also obtains better results than allbut the top system in the official results (Lefever andHoste, 2010).

The results in Table 2 are quite different fromthose reported by Carpuat and Wu (2005) on a Chi-nese WSD task. The Chinese-English PBSMT sys-tem performed much worse than any of the dedi-cated WSD systems on that task. While our WSDsystem outperforms PBSMT on the CLWSD tasktoo, the difference is not as large, and the PBSMTsystem is competitive when compared to the full setof systems that were evaluated on this task. Thisconfirms that the CLWSD task represents a more fairbenchmark for comparing PBSMT with WSD sys-tems.

6 Impact of PBSMT Context Models

What is the impact of PBSMT context modelson lexical choice accuracy? Table 3 provides anoverview of experiments where we vary the contextsize available to the PBSMT system. The main PB-

Mode ModeSystem Prec. Rec. Prec. Rec. BLEU1PBSMT 23.72 23.69 45.49 45.37 62.72max source phrase length ll = 1 24.44 24.36 44.50 44.38 65.43l = 3 24.27 24.22 46.52 46.41 64.33n-gram LM ordern = 3 23.60 23.55 44.58 44.47 61.62n = 7 23.58 23.53 46.06 45.94 62.22n = 2 23.40 23.35 44.75 44.63 63.02n = 1 22.92 22.87 43.00 42.89 58.62+bilingual LM4-gram 23.89 23.84 45.49 45.37 62.62

Table 3: Impact of source and target context modelson PBSMT performance

SMT system in the top row uses the default settingspresented in Section 3.

In the first set of experiments, we evaluate theimpact of the source side context on CLWSD per-formance. Phrasal translations represent the coreof PBSMT systems: they capture collocational con-text in the source language, and they are thereforeare less ambiguous than single words (Koehn andKnight, 2003; Koehn et al., 2003). The defaultPBSMT learns translations for sources phrases oflength ranging from 1 to 7 words.

Limiting the PBSMT system to translate shorterphrases (Rows l = 1 and l = 3 in Table 3) surpris-ingly improves CLWSD performance, even thoughit degrades BLEU score on WMT test sets. Thesource context captured by longer phrases thereforedoes not provide the right disambiguating informa-tion in this context.

In the second set of experiments, we evaluate theimpact of the context size in the target language, byvarying the size of the n-gram language model used.The default PBSMT system used a 5-gram languagemodel. Reducing the n-gram order to 3, 2, 1 and in-creasing it to 7 both degrade performance. Shortern-grams do not provide enough disambiguating con-text, while longer n-grams are more sparse and per-haps do not generalize well outside of the trainingcorpus.

Finally, we report a last experiment which uses abilingual language model to enrich the context rep-resentation in PBSMT (Niehues et al., 2011). Thislanguage model is estimated on word pairs formed

6

Mode ModeSystem Prec. Rec. Prec. Rec. BLEU1+ hier 23.72 23.69 45.49 45.37 62.72+ lex 23.69 23.64 46.66 46.54 62.22dist 23.42 23.37 45.43 45.30 62.22

Table 4: Impact of reordering models: lexicalizedreodering does not hurt lexical choice only when hi-erarchical models are used

by target words augmented with their aligned sourcewords. We use a 4-gram model, trained using Good-Turing discounting. This only results in small im-provements (< 0.1) over the standard PBSMT sys-tem, and remains far below the performance of thededicated WSD system.

These results show that source phrases are weakrepresentations of context for the purpose of lexicalchoice. Target n−gram context is more useful thansource phrasal context, which can surprisingly harmlexical choice accuracy.

7 Impact of PBSMT Reordering Models

While the phrase-table is the core of PBSMT sys-tem, the reordering model used in our system isheavily lexicalized. In this section, we evaluate itsimpact on CLWSD performance. The standard PB-SMT system uses a hierarchical lexicalized reorder-ing model (Galley and Manning, 2008) in addition tothe distance-based distortion limit. Unlike lexical-ized reordering(Koehn et al., 2007), which modelsthe orientation of a phrase with respect to the pre-vious phrase, hierarchical reordering models definethe orientation of a phrase with respect to the previ-ous block that could have been translated as a singlephrase.

In Table 4, we show that lexicalized reorderingmodel benefit CLWSD performance, and that the hi-erarchical model performs slightly better than thenon-hierarchical overall.

8 Impact of phrase translation selection

In this section, we consider the impact of variousmethods for selecting phrase translations on the lex-ical choice performance of PBSMT.

First, we investigate the impact of limiting thenumber of translation candidates considered for

Mode ModeSystem Prec. Rec. Prec. Rec. BLEU1PBSMT 23.72 23.69 45.49 45.37 62.72Number t of translations per phraset = 20 23.68 23.63 45.66 45.54 62.32t = 100 23.65 23.60 45.65 45.53 62.52Other phrase-table pruning methodsstat sig 23.71 23.66 45.19 45.07 62.62

Table 5: Impact of translation candidate selection onPBSMT performance

each source phrase in the phrase-table. The mainPBSMT system uses t = 50 translation candidatesper source phrase. Limiting that number to 20 andincreasing it to 100 both have a very small impact onCLWSD.

Second, we prune the phrase-table using a sta-tistical significance test to measure (Johnson et al.,2007). This pruning strategy aims to drastically de-crease the size of the phrase-table without degradingtranslation performance by removing noisy phrasepairs.

9 Impact of training corpus

Since increasing the amount of training data is a re-liable way to improve translation performance, weevaluate the impact of training the PBSMT systemon more than the Europarl data used for controlledcomparison with WSD. We increase the paralleltraining corpus with the WMT-12 News Commen-tary parallel data 4. This yields an additional trainingset of roughly160k sentence pairs. We build linearmixture models to combine translation, reorderingand language models learned on Europarl and NewsCommentary corpora (Foster and Kuhn, 2007). Ascan be seen in Table 6, this approach improves allCLWSD scores except for 1-gram precision. Thedecrease in 1-gram precision indicates that the addi-tion of the news corpus introduces new translationcandidates that differ from those used in the gold in-ventory. Interestingly, the additional data is not suf-ficient to match the performance of the WSD systemlearned on Europarl only (see Table 2). While ad-ditional data should be used when available, richercontext features are valuable to make the most ofexisting data.

4http://www.statmt.org/wmt12/translation-task.html

7

Mode ModeSystem Prec. Rec. Prec. Rec. BLEU1Europarl 23.72 23.69 45.49 45.37 62.72+ News 24.34 24.28 47.49 47.37 61.22

Table 6: Impact of training corpus on PBSMT per-formance: adding news parallel sentences helps Pre-cision and Recall, but does not match WSD on theEuroparl only.

10 Conclusion

We use a SemEval Cross-Lingual WSD task toevaluate the lexical choice performance of a typi-cal phrase-based SMT system. Unlike conventionalWSD task that rely on abstract sense inventoriesrather than translations, cross-lingual WSD providesa fair setting for comparing SMT with dedicatedWSD systems. Unlike conventional evaluations ofmachine translation quality, the cross-lingual WSDtask lets us isolate a specific aspect of translationquality and show how it is affected by different com-ponents of the phrase-based SMT system.

Unlike in previous evaluations on conventionalWSD tasks (Carpuat and Wu, 2005), phrase-basedSMT performance is on par with many dedicatedWSD systems. However, the phrase-based SMTsystem does not perform as well as a WSD sys-tem trained on the exact same parallel data. Anal-ysis shows that while many SMT components canpotentially have an impact on SMT lexical choice,CLWSD accuracy is most affected by the length ofsource phrases and order of target n-gram languagemodels. Using shorter source phrases actually im-proves lexical choice accuracy. The official resultsfor the CLWSD task at SemEval 2013 evaluationprovide further insights (Lefever and Hoste, 2013):our PBSMT system can achieve top precision asmeasured using the top prediction as in this paper,but does not perform as well as other submitted sys-tems when taking into account the top 5 predictions(Carpuat, 2013). This suggests that local contextmodels based on source phrases and target n-gramsare much weaker representations of context than thesimple templates used by WSD systems, and thateven strong PBSMT systems can benefit from con-text models developed for WSD.

New learning algorithms (Chiang et al., 2009;

Cherry and Foster, 2012, for instance) finally makeit possible for PBSMT to reliably learn from manymore features than the typical system used here.Evaluations such as the CLWSD task will provideuseful tools for analyzing the impact of these fea-tures on lexical choice and inform feature design inincreasingly large and complex systems.

References

E. Agirre and P.G. Edmonds. 2006. Word Sense Dis-ambiguation: Algorithms and Applications. Text,Speech, and Language Technology Series. SpringerScience+Business Media B.V.

Eneko Agirre, Lluıs Marquez, and Richard Wicen-towski, editors. 2007. Proceedings of SemEval-2007:4th International Workshop on Semantic Evaluation,Prague, July.

Satanjeev Banerjee and Alon Lavie. 2005. METEOR:An Automatic Metric for MT Evaluation with Im-proved Correlation with Human Judgement. In Pro-ceedings of Workshop on Intrinsic and Extrinsic Eval-uation Measures for MT and/or Summarization at the43th Annual Meeting of the Association of Computa-tional Linguistics (ACL-2005), Ann Arbor, Michigan,June.

Alexandra Birch, Mile Osborne, and Phil Blunsom.2010. Metrics for MT evaluation: Evaluating reorder-ing. Machine Translation, 24:15–26.

Chris Callison-Burch, Philipp Koehn, Christof Monz,Kay Peterson, Mark Przybocki, and Omar F. Zaidan.2010. Findings of the 2010 joint workshop on sta-tistical machine translation and metrics for machinetranslation. In Proceedings of the Joint Fifth Workshopon Statistical Machine Translation and MetricsMATR,WMT ’10, pages 17–53.

Marine Carpuat and Dekai Wu. 2005. Evaluating theWord Sense Disambiguation Performance of Statisti-cal Machine Translation. In Proceedings of the SecondInternational Joint Conference on Natural LanguageProcessing (IJCNLP), pages 122–127, Jeju Island, Re-public of Korea.

Marine Carpuat and Dekai Wu. 2008. Evaluation ofContext-Dependent Phrasal Translation Lexicons forStatistical Machine Translation. In Proceedings of thesixth conference on Language Resouces and Evalua-tion (LREC 2008), Marrakech, May.

Marine Carpuat, Hal Daume III, Alexander Fraser, ChrisQuirk, Fabienne Braune, Ann Clifton, Ann Irvine,Jagadeesh Jagarlamudi, John Morgan, Majid Raz-mara, Ales Tamchyna, Katharine Henry, and Rachel

8

Rudinger. 2012. Domain adaptation in machine trans-lation: Final report. In 2012 Johns Hopkins SummerWorkshop Final Report.

Marine Carpuat. 2013. Nrc: A machine translationapproach to cross-lingual word sense disambiguation(SemEval-2013 Task 10). In Proceedings of SemEval.

Boxing Chen, Roland Kuhn, George Foster, and HowardJohnson. 2011. Unpacking and transforming featurefunctions: New ways to smooth phrase tables. In Pro-ceedings of Machine Translation Summit.

Boxing Chen, Roland Kuhn, and Samuel Larkin. 2012.Port: a precision-order-recall mt evaluation metric fortuning. In Proceedings of the 50th Annual Meeting ofthe Association for Computational Linguistics, pages930–939.

Colin Cherry and George Foster. 2012. Batch tuningstrategies for statistical machine translation. In Pro-ceedings of the 2012 Conference of the North Ameri-can Chapter of the Association for Computational Lin-guistics: Human Language Technologies, pages 427–436, Montreal, Canada, June.

David Chiang, Kevin Knight, and Wei Wang. 2009.11,001 new features for statistical machine translation.In NAACL-HLT 2009: Proceedings of the 2009 AnnualConference of the North American Chapter of the As-sociation for Computational Linguistics, pages 218–226, Boulder, Colorado.

Timothy Chklovski, Rada Mihalcea, Ted Pedersen, andAmruta Purandare. 2004. The Senseval-3 Multi-lingual English-Hindi lexical sample task. In Pro-ceedings of Senseval-3, Third International Workshopon Evaluating Word Sense Disambiguation Systems,pages 5–8, Barcelona, Spain, July.

Michael Denkowski and Alon Lavie. 2010. METEOR-NEXT and the METEOR paraphrase tables: improvedevaluation support for five target languages. In Pro-ceedings of the Joint Fifth Workshop on Statistical Ma-chine Translation and MetricsMATR, WMT ’10, pages339–342.

Zhendong Dong. 1998. Knowledge description: what,how and who? In Proceedings of International Sym-posium on Electronic Dictionary, Tokyo, Japan.

Markus Dreyer and Daniel Marcu. 2012. Hyter:Meaning-equivalent semantics for translation evalua-tion. In Proceedings of the 2012 Conference of theNorth American Chapter of the Association for Com-putational Linguistics: Human Language Technolo-gies, pages 162–171, Montreal, Canada, June.

Philip Edmonds and Scott Cotton. 2001. Senseval-2:Overview. In Proceedings of Senseval-2: Second In-ternational Workshop on Evaluating Word Snese Dis-ambiguation Systems, pages 1–5, Toulouse, France.

Christiane Fellbaum, editor. 1998. WordNet: An Elec-tronic Lexical Database. MIT Press.

Radu Florian, Silviu Cucerzan, Charles Schafer, andDavid Yarowsky. 2002. Combining classifiers forword sense disambiguation. Natural Language Engi-neering, 8(4):327–241.

George Foster and Roland Kuhn. 2007. Mixture-modeladaptation for SMT. In Proceedings of the SecondWorkshop on Statistical Machine Translation, pages128–135, Prague, Czech Republic, June.

Michel Galley and Christopher D. Manning. 2008. Asimple and effective hierarchical phrase reorderingmodel. In Proceedings of the Conference on EmpiricalMethods in Natural Language Processing, EMNLP’08, pages 848–856.

Jesus Gimenez and Lluıs Marquez. 2007. LinguisticFeatures for Automatic Evaluation of HeterogenousMT Systems. In Proceedings of the Second Work-shop on Statistical Machine Translation, pages 256–264, Prague

Jesus Gimenez and Lluıs Marquez. 2008. Discrimina-tive Phrase Selection for Statistical Machine Transla-tion. Learning Machine Translation. NIPS WorkshopSeries.

Peng Jin, Yunfang Wu, and Shiwen Yu. 2007. Semeval-2007 task 05: Multilingual chinese-english lexicalsample. In Proceedings of the Fourth InternationalWorkshop on Semantic Evaluations (SemEval-2007),pages 19–23, Prague, Czech Republic, June.

Howard Johnson, Joel Martin, George Foster, and RolandKuhn. 2007. Improving Translation Quality by Dis-carding Most of the Phrasetable. In Proceedings of the2007 Joint Conference on Empirical Methods in Nat-ural Language Processing and Computational Natu-ral Language Learning (EMNLP-CoNLL), pages 967–975.

Philipp Koehn and Kevin Knight. 2003. Feature-RichStatistical Translation of Noun Phrases. In Proceed-ings of 41st Annual Meeting of the Association forComputational Linguistics, Sapporo, Japan, July.

Philipp Koehn, Franz Och, and Daniel Marcu. 2003. Sta-tistical Phrase-based Translation. In Proceedings ofHLT/NAACL-2003, Edmonton, Canada, May.

Philipp Koehn, Hieu Hoang, Alexandra Birch, ChrisCallison-Burch, Marcello Federico, Nicola Bertoldi,Brooke Cowan, Wade Shen, Christine Moran, RichardZens, Chris Dyer, Ondrej Bojar, Alexandra Con-stantin, and Evan Herbst. 2007. Moses: Open SourceToolkit for Statistical Machine Translation. In AnnualMeeting of the Association for Computational Linguis-tics (ACL), demonstration session, Prague, Czech Re-public, June.

Philipp Koehn. 2005. Europarl: A parallel corpus forstatistical machine translation. In Machine Transla-tion Summit X, Phuket, Thailand, September.

9

Els Lefever and Veronique Hoste. 2010. Semeval-2010task 3: Cross-lingual word sense disambiguation. InProceedings of the 5th International Workshop on Se-mantic Evaluation, pages 15–20, Uppsala, Sweden,July.

Els Lefever and Veronique Hoste. 2013. Semeval-2013task 10: Cross-lingual word sense disambiguation.In Proceedings of the 7th International Workshop onSemantic Evaluation (SemEval 2013), Atlanta, USA,May.

Els Lefever, Veronique Hoste, and Martine De Cock.2011. Parasense or how to use parallel corpora forword sense disambiguation. In Proceedings of the49th Annual Meeting of the Association for Compu-tational Linguistics: Human Language Technologies,pages 317–322, Portland, Oregon, USA, June.

Chi-kiu Lo and Dekai Wu. 2011. Meant: an inexpen-sive, high-accuracy, semi-automatic metric for evalu-ating translation utility via semantic frames. In Pro-ceedings of the 49th Annual Meeting of the Associa-tion for Computational Linguistics: Human LanguageTechnologies - Volume 1, HLT ’11, pages 220–229.

Chi-kiu Lo, Anand Karthik Tumuluru, and Dekai Wu.2012. Fully automatic semantic MT evaluation. InProceedings of the Seventh Workshop on StatisticalMachine Translation, pages 243–252.

Rada Mihalcea and Phiip Edmonds, editors. 2004. Pro-ceedings of Senseval-3: Third international Workshopon the Evaluation of Systems for the Semantic Analysisof Text, Barcelona, Spain.

Rada Mihalcea, Timothy Chklovski, and Adam Killgar-iff. 2004. The Senseval-3 English lexical sampletask. In Proceedings of Senseval-3, Third Interna-tional Workshop on Evaluating Word Sense Disam-biguation Systems, pages 25–28, Barcelona, Spain,July.

Rada Mihalcea, Ravi Sinha, and Diana McCarthy. 2010.SemEval-2010 Task 2: Cross-Lingual Lexical Substi-tution. In Proceedings of the 5th International Work-shop on Semantic Evaluation, pages 9–14, Uppsala,Sweden, July.

Robert C. Moore and William Lewis. 2010. Intelli-gent selection of language model training data. InProceedings of the ACL 2010 Conference Short Pa-pers, ACLShort ’10, pages 220–224, Stroudsburg, PA,USA.

Jan Niehues, Teresa Herrmann, Stephan Vogel, and AlexWaibel. 2011. Wider context by using bilingual lan-guage models in machine translation. In Proceedingsof the Sixth Workshop on Statistical Machine Trans-lation, WMT ’11, pages 198–206, Stroudsburg, PA,USA.

Franz Josef Och and Hermann Ney. 2003. A SystematicComparison of Various Statistical Alignment Models.Computational Linguistics, 29(1):19–52.

Lluıs Padro and Evgeny Stanilovsky. 2012. FreeLing3.0: Towards wider multilinguality. In Proceedings ofthe Language Resources and Evaluation Conference(LREC 2012), Istanbul, Turkey, May.

Martha Palmer, Christiane Fellbaum, Scott Cotton, Lau-ren Delfs, and Hao Trang Dang. 2001. English tasks:All-words and verb lexical sample. In Proceedings ofSenseval-2, Second International Workshop on Evalu-ating Word Sense Disambiguation Systems, pages 21–24, Toulouse, France, July. SIGLEX, Association forComputational Linguistic.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic eval-uation of machine translation. In Proceedings of the40th Annual Meeting of the Association for Computa-tional Linguistics, Philadelphia, PA, July.

Maarten van Gompel. 2010. Uvt-wsd1: A cross-lingualword sense disambiguation system. In Proceedings ofthe 5th International Workshop on Semantic Evalua-tion, pages 238–241, Uppsala, Sweden, July.

David Vickrey, Luke Biewald, Marc Teyssier, andDaphne Koller. 2005. Word-Sense Disambigua-tion for Machine Translation. In Joint HumanLanguage Technology conference and Conference onEmpirical Methods in Natural Language Processing(HLT/EMNLP 2005), Vancouver.

David Vilar, Jia Xu, Luis Fernando D’Haro, and Her-mann Ney. 2006. Error Analysis of Machine Trans-lation Output. In Proceedings of the 5th InternationalConference on Language Resources and Evaluation(LREC’06), pages 697–702, Genoa, Italy, May.

Rabih Zbib, Erika Malchiodi, Jacob Devlin, DavidStallard, Spyros Matsoukas, Richard Schwartz, JohnMakhoul, Omar F. Zaidan, and Chris Callison-Burch.2012. Machine translation of arabic dialects. In Pro-ceedings of the 2012 Conference of the North Amer-ican Chapter of the Association for ComputationalLinguistics: Human Language Technologies, NAACLHLT ’12, pages 49–59.

10

Proceedings of the 7th Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 11–18,Atlanta, Georgia, 13 June 2013. c©2013 Association for Computational Linguistics

Taste of Two Different Flavours: Which Manipuri Script Works Better for English-Manipuri Language Pair SMT Systems?

Thoudam Doren Singh Centre for Development of Advanced Computing (CDAC), Mumbai

Gulmohor Cross Road No 9, Juhu Mumbai-400049, INDIA

[email protected]

Abstract

The statistical machine translation (SMT) sys-tem heavily depends on the sentence aligned parallel corpus and the target language model. This paper points out some of the core issues on switching a language script and its reper-cussion in the phrase based statistical machine translation system development. The present task reports on the outcome of English-Manipuri language pair phrase based SMT task on two aspects – a) Manipuri using Ben-gali script, b) Manipuri using transliterated Meetei Mayek script. Two independent views on Bengali script based SMT and transliter-ated Meitei Mayek based SMT systems of the training data and language models are pre-sented and compared. The impact of various language models is commendable in such sce-nario. The BLEU and NIST score shows that Bengali script based phrase based SMT (PBSMT) outperforms over the Meetei Mayek based English to Manipuri SMT system. However, subjective evaluation shows slight variation against the automatic scores.

1 Introduction

The present finding is due to some issue of socio-linguistics phenomenon called digraphia - a case of Manipuri language (a resource constrained Indian languages spoken mainly in the state of Manipur) using two different scripts namely Bengali script1 1 http://unicode.org/charts/PDF/U0980.pdf

and Meetei Mayek2. Meetei Mayek (MM) is the original script which was used until the 18th cen-tury to represent Manipuri text. Its earliest use is dated between the 11th and 12th centuries CE3. Manipuri language is recognized by the Indian Un-ion and has been included in the list of 8th sched-uled languages by the 71st amendment of the constitution in 1992. In the recent times, the Ben-gali script is getting replaced by Meetei Mayek at schools, government departments and other admin-istrative activities. It may be noted that Manipuri is the only Tibeto-Burman language which has its own script. Digraphia has implications in language technology as well despite the issues of language planning, language policy and language ideology. There are several examples of languages written in one script that was replaced later by another script. Some of the examples are Romanian which origi-nally used Cyrillic then changed to Latin; Turkish and Swahili began with the Arabic then Latin, and many languages of former Soviet Central Asia, which abandoned the Cyrillic script after the disso-lution of the USSR. The present study is a typical case where the natural language processing of an Indian language is affected in case of switching script.

Manipuri is a monosyllabic, morphologically rich and highly agglutinative in nature. Tone is very prominent. So, a special treatment of these tonal words is absolutely necessary. Manipuri lan-guage has 6 vowels and their tone counterparts and 6 diphthongs and their tone counterparts. Thus, a

2 http://unicode.org/charts/PDF/UABC0.pdf 3 http://en.wikipedia.org/wiki/Meitei_language

11

Manipuri learner should know its tone system and the corresponding word meaning.

Natural language processing tasks for Manipuri language is at the initial phase. We use a small par-allel corpus and a sizable monolingual corpus col-lected from Manipuri news to develop English-Manipuri statistical machine translation system. The Manipuri news texts are in Bengali script. So, we carry out transliteration from Bengali script to Meetei Mayek as discussed in section 3. Typically, transliteration is carried out between two different languages –one as a source and the other as a tar-get. But, in our case, in order to kick start the MT system development, Bengali script (in which most of the digital Manipuri text are available) to Meetei Mayek transliteration is carried out using different models. The performance of the rule based translit-eration is improved by integrating the conjunct and syllable handling module in the present rule based task along with transliteration unit (TU). However, due to the tonal characteristic of this language, there is loss of accents for the tonal words when getting translated from Bengali script. In other words, there is essence of intonation in Manipuri text; the differentiation between Bengali characters such as �◌ (i) and ◌� (ee) or ◌� (u) and ◌� (oo) cannot be made using Meetei Mayek. This increases the lexical ambiguity on the transliterated Manipuri words in Meetei Mayek script.

2 Related Work

Several SMT systems between English and mor-phologically rich languages are reported. (Tou-tonova et al., 2007) reported the improvement of an SMT by applying word form prediction models from a stem using extensive morphological and syntactic information from source and target lan-guages. Contributions using factored phrase based model and a probabilistic tree transfer model at deep syntactic layer are made by (Bojar and Hajič, 2008) of English-to-Czech SMT system. (Yeniterzi and Oflazer, 2010) reported syntax-to-morphology mapping in factored phrase-based Statistical Ma-chine Translation (Koehn and Hoang, 2007) from English to Turkish relying on syntactic analysis on the source side (English) and then encodes a wide variety of local and non-local syntactic structures as complex structural tags which appear as addi-tional factors in the training data. On the target side

(Turkish), they only perform morphological analy-sis and disambiguation but treat the complete com-plex morphological tag as a factor, instead of separating morphemes. (Bojar et al., 2012) pointed out several pitfalls when designing factored model translation setup. All the above systems have been developed using one script for each language at the source as well as target.

Manipuri is a relatively free word order where the grammatical role of content words is largely determined by their case markers and not just by their positions in the sentence. Machine Transla-tion systems of Manipuri and English is reported by (Singh and Bandyopadhyay, 2010b) on devel-opment of English-Manipuri SMT system using morpho-syntactic and semantic information where the target case markers are generated based on the suffixes and semantic relations of the source sen-tence. The above mentioned system is developed using Bengali script based Manipuri text. SMT systems between English and morphologically rich highly agglutinative language suffer badly if the adequate training and language resource is not available. Not only this, it is important to note that the linguistic representation of the text has implica-tions on several NLP aspects not only in machine translations systems. This is our first attempt to build and compare English-Manipuri language pair SMT systems using two different scripts of Ma-nipuri.

3 Transliterated Parallel Corpora

The English-Manipuri parallel corpora and Ma-nipuri monolingual corpus collected from the news website www.thesangaiexpress.com are based on Bengali script. The Bengali script has 52 conso-nants and 12 vowels. The modern-day Meetei Mayek script is made up of a core repertoire of 27 letters, alongside letters and symbols for final con-sonants, dependent vowel signs, punctuation, and digits. Meetei Mayek is a Brahmic script with con-sonants bearing the inherent vowel and vowel ma-tras modifying it. However, unlike most other Brahmi-derived scripts, Meetei Mayek employs explicit final consonants which contain no final vowels. The use of the killer (which refers to its function of killing the inherent vowel of a conso-nant letter) is optional in spelling; for example, while ꯗꯔ may be read dara or dra, ꯗꯔ must be read dra. Syllable initial combinations for vowels can

12

occur in modern usage to represent diphthongs. The Meetei Mayek has 27 letters (Iyek Ipee), 8 dependent vowel signs (Cheitap Iyek), 8 final con-sonants (Lonsum Iyek), 10 digits (Cheising Iyek) and 3 punctuation (Cheikhei, Lum Iyek and Apun Iyek).

Bengali Script

Meetei Mayek

�, �, , K (Sam) �, � e (Na) , � f (Til) �, � F (Thou) �, � \ (Yang) �, � r (Dil) �, � R (Dhou) �, � B (Un) �, � T (Ee)

�, �, � j (Rai) �◌, ◌� g (Inap) ◌� , ◌� b (Unap)

Table 1 – Many-to-One mapping table

There is no possibility of direct one-to-one map-ping for the 27 Meetei Mayek letter (Iyek Ipee) to Bengali script as given by table 1, over and above some of Bengali scripts which does not have a cor-responding direct mapping to Meetei Mayek such as ( �, �, ◌ , !, ◌" etc.) which has resulted in the loss of target representation. The syllable based Bengali script to Meetei Mayek transliteration system out-performs the other known transliteration systems in news domain between the two scripts in terms of precision and recall (Singh, 2012). The overall conjunct representation is many-to-many charac-ters in nature for the bilingual transliteration task of Bengali script and Meetei Mayek. Some of the example words using the conjuncts are given as:

#$�� �� wDjlKe (press-na)

���%&�& �� rgKfDjgdsg (district-ki)

#�'( ��'�)* �� KlsDjlfjg\lGf (secretariate-ta)

#+',*- � � wlfDjOM (petrol)

And the Bengali script conjuncts and its constitu-ents along with the Meetei Mayek notation for the above examples are as given below:

#$ (pre) � + + ◌. � + #◌ � wDjl

�% (stri) � � + + ◌. � + �◌ � KfDjg

#( (cre) � & + � + #◌ � sDjl

#,* (tro) � + � + #◌* � fDjO A sample transliterated parallel corpus between English and Manipuri is given in the table 2. These transliterations are based on the syllable based model.

English On the part of the election depart-ment, IFCD have been intimidated for taking up necessary measures.

Manipuri in Bengali Script

�'-/� ��+* 0 '12�& 1*�3&�4� 5�67�����*

��&*� 8-9* �9& +*�:;9* :<=�'> . Manipuri in Meetei Mayek

ꯢꯂꯦꯛꯁꯟ ꯗꯤꯄꯔꯇꯃꯦꯅꯇꯀꯤ ꯃꯌꯀꯧꯗꯒꯤ ꯑꯦꯢꯑꯦꯐꯁꯤꯗꯤꯗ ꯗꯔꯀꯔ ꯂꯧꯕ ꯊꯕꯛ

ꯄꯌꯈꯠꯅꯕ ꯈꯡꯍꯟꯈꯔꯦ ꯫

Gloss election departmentki maykeidagee IFCDda darkar leiba thabak payk-hatnaba khanghankhre .

English In case of rising the water level of Nambul river, the gate is shut down and the engines are operated to pump out the water.

Manipuri in Bengali Script

&��4�@* �@�- �� '�-4� ��A �1*� B*A:CD9�� #4

E�� ��A-4* ��A E�� ��F��* G��* �HA'�*D4*

-*&+�� =*��� . Manipuri in Meetei Mayek

ꯀꯔꯤꯒꯝ ꯕ ꯅꯝꯕꯜ ꯇꯔꯦꯜ ꯒꯤ ꯏꯁꯤꯪ ꯏꯃꯌꯋꯪꯈꯠꯂꯛꯂꯕꯗꯤ ꯒꯦꯠ ꯑꯁꯤ ꯊꯤꯪꯂꯒ ꯏꯁꯤꯪ

ꯑꯁꯤ ꯢꯟꯖꯤꯟꯅ ꯑꯣꯢꯅ ꯆꯤꯪꯊꯣꯛꯂꯒ ꯂꯛꯄꯅꯤ ꯍꯌꯔꯤ ꯫

Gloss karigumba nambul turelgi eesing eemay waangkhatlaklabadi gate asi thinglaga eesing asi enginena oyna chingthoklaga laakpani hayri.

English The department has a gate at Samushang meant for draining out the flood water of Lamphelpat.

Manipuri in Bengali Script

*1�<�* ��+* 0 '12 E��4� #4 E1* -'I-+*�J&

��A �HA'�*K9* �L� .

Manipuri in Meetei Mayek

ꯁꯃꯁꯡꯗ ꯗ ꯤꯄꯔꯇꯃꯦꯅꯇ ꯑꯁꯤꯒꯤ ꯒꯦꯠ

ꯑꯃ ꯂꯝꯐꯦꯜꯄꯠꯀꯤ ꯏꯁꯤꯪ ꯆꯤꯪꯊꯣꯛꯅꯕꯊꯝ ꯃꯤ ꯫

Gloss samusangda department asigee gate ama lamphelpatki easing ching-thoknaba thammee.

Table 2. Transliterated texts of English – Manipuri Par-allel Corpora and the corresponding Gloss

13

4 Building SMT for English-Manipuri

The important resources of building SMT are the training and language modeling data. We use a small amount of parallel corpora for training and a sizable amount of monolingual Manipuri and Eng-lish news corpora. So, we have two aspects of de-veloping English-Manipuri language pair SMT systems by using the two different scripts for Ma-nipuri. The moot question is which script will per-form better. At the moment, we are developing only the baseline systems. So, the downstream tools are not taken into account which would have affected by way of the performance of the script specific tools other than the transliteration system performance used in the task. In the SMT devel-opment process, apart from transliteration accuracy error, the change in script to represent Manipuri text has made the task of NLP related activities a difference in the way how it was carried out with Bengali script towards improving the factored based modes in future as well. Lexical ambiguity is very common in this language mostly due to tonal characteristics. This has resulted towards the re-quirement of a word sense disambiguation module more than before. This is because of a set of differ-ence in the representation using Meitei Mayek. As part of this ongoing experiment, we augment the training data with 4600 manually prepared variants of verbs and nouns phrases for improving the over-all accuracy and help solving a bit of data sparsity problem of the SMT system along with an addi-tional lexicon of 10000 entries between English and Manipuri to handle bits of data sparsity and sense disambiguation during the training process. The English-Manipuri parallel corpus developed by (Singh and Bandyopadhyay, 2010a) is used in the experiment. Moses4 toolkit (Koehn, 2007) is used for training with GIZA++5 and decoding. Minimum error rate training (Och, 2003) for tuning are carried out using the development data for two scripts. Table 3 gives the corpus statistics of the English-Manipuri SMT system development.

4.1 Lexical Ambiguity

Manipuri is, by large, a tonal language. The lexical ambiguity is very prominent even with Bengali script based text representation. The degree of am- 4 http://www.statmt.org/moses/ 5 http://www.fjoch.com/GIZA++.html

biguity worsens due to the convergence as shown by the figure 1 and many to one mapping shown in the table 1. So, the Bengali script to Meetei Mayek transliteration has resulted to the lost of several words meaning at the transliterated output. Many aspects of translation can be best explained at a morphological, syntactic or semantic level. This implies that the phrase table and target language model are very much affected by using Meetei Mayek based text and hence the output of the SMT system. Thus, lexical ambiguity is one major rea-son on why the transliterated Meetei Mayek script based PBSMT suffers comparatively. Three exam-ples of lexical ambiguity are given below:

(a) �1 (mi) � spider � ꯃꯤ (mi) meaning either spider or man 1� (mee) � man � ꯃꯤ (mi) meaning either spider or man (b) �9* (sooba) � to work � ꯁꯕ (suba) meaning either to work or to hit �9* (suba) � to hit � ꯁꯕ (suba) meaning either to work or to hit (c) ���9* (sinba) / ��9* (shinba) � substitution � ꯁꯤꯟꯕ (sinba) ��9* (sheenba) � arrangement � ꯁꯤꯟꯕ (sinba) ��9* (sheenba) � sour taste � ꯁꯤꯟꯕ (sinba)

Figure 1. An example of convergence of TU (�� -su, ��-soo etc.) from Bengali Script to Meitei Mayek

��

� �

��

14

The lexical ambiguity that arises are twofold, i) one after transliteration into Meetei Mayek as given by examples (a) and (b), ii) one before the transliteration as given by the example (c) for which the ambiguity is doubled after the translit-eration. Thus, the scripts are functioning as a rep-resentation language for lexical ambiguity like the semantic phrase sense disambiguation model for SMT (Carpuat and Wu, 2007).

4.2 Language Modeling

The impact of the different language models is clearly seen in our experiment mostly by way of lexical variation and convergence characteristics as shown in Figure 1. Four different language models are developed: a) language model (LM1) on Ben-gali script based Manipuri text, b) language model (LM2) on transliterated Manipuri Meetei Mayek text, there is change in the language model pa-rameter such as perplexity which affects the over-all translation decoding process, c) language model (LM3) based on language model (LM1) with trans-literation to Meitei Mayek on Manipuri text from Bengali Script texts, and d) language model (LM4) based on language model (LM2) with translitera-tion to Bengali script on Manipuri text from Meetei Mayek text. SRILM (Stolcke, 2002) is used to build trigram model with modified Kneser-Ney smoothing (Stanley and Joshua, 1998) and interpo-lated with lower order estimates which works best for Manipuri language using 2.3 million words of 180,000 Manipuri news sentences. There are varia-tions in the language model parameters while switching the scripts.

The log probability and perplexity of the sen-tence (considering the first sentence from Table 2) using Bengali script, “�'-/� ��+* 0 '12�& 1*�3&�4�

5�67�����* ��&*� 8-9* �9& +*�:;9* :<=�'> M” are given as:

logprob= -22.873 ppl= 193.774 ppl1= 347.888 while the parameters for the same sentence using Meetei Mayek, i.e., “ꯢꯂꯦꯛꯁꯟ ꯗꯤꯄꯔꯇꯃꯦꯅꯇꯀꯤ ꯃꯌꯀꯧꯗꯒꯤ ꯑꯦꯢꯑꯦꯐꯁꯤꯗꯤꯗ

ꯗꯔꯀꯔ ꯂꯧꯕ ꯊꯕꯛ ꯄꯌꯈꯠꯅꯕ ꯈꯡꯍꯟꯈꯔꯦ ꯫” are given as: logprob= -26.7555 ppl= 473.752 ppl1= 939.364

It is also observed that some of the n-grams entries on one language model are not available in the other language model. For example,

-2.708879 1���* #HN9�� -0.3211589

is a bigram found in Bengali script based language model but not found in the Meetei Mayek based language model. Similarly, -6.077539 ꯃꯗꯃꯀꯁꯇꯩꯍꯟꯖꯅꯤꯪꯗꯦ -0.06379553

is a bigram found in the Meetei Mayek based lan-guage model but not available in Bengali script based language model. Above all, for the same n-gram in the language models, the log(P(W)) and log(backoff-weight) are found to be different. Two bigram examples are given below:

-1.972813 1���* #�*D&+* -0.09325081 -6.077539 ꯃꯗꯗ ꯊꯣꯛꯂꯀꯄ -0.06379553 and, -1.759148 1���* #�*�&+* -0.3929711 -6.077539 ꯃꯗꯗ ꯊꯣꯔꯀꯄ -0.06379552

4.3 Evaluation

The systems are developed using the following corpus statistics.

# of Sentences # of Words

Training 10000 231254

Development 5000 121201

Testing 500 12204

Table 3. Corpus Statistics The evaluations of SMT systems are done using automatic scoring and subjective evaluation.

4.4 Automatic Scoring

We carry out the comparisons of automatic evalua-tion metrics scores for the SMT systems. The vari-ous models developed are evaluated using BLEU (Papineni et al, 2002) and NIST (Doddington, 2002) automatic scoring techniques. A high NIST score means a better translation by measuring the precision of n-gram.

15

BLEU Score

NIST Score

Meetei Mayek based Baseline using LM2 language model

11.05 3.57

Meetei Mayek based Baseline with LM3 language model

11.81 3.33

Bengali Script based Baseline using LM1 language model

15.02 4.01

Bengali Script based Baseline using LM4 language model

14.51 3.82

Table 4 . Automatics Scores of English to Manipuri

SMT system

BLEU metric gives the precision of n-gram with respect to the reference translation but with a brev-ity penalty.

BLEU Score

NIST Score

Bengali Script based Baseline 12.12 4.27

Meetei Mayek based Baseline using

13.74 4.31

Table 5. Automatics Scores of Manipuri to English

SMT system

4.5 Subjective Evaluation

The subjective evaluation is carried out by two bilingual judges. The inter-annotator agreement is 0.3 of scale 1. The adequacy and fluency used in the subjective evaluation scales are given by the Table 6 and Table 7.

Level Interpretation

4 Full meaning is conveyed

3 Most of the meaning is conveyed

2 Poor meaning is conveyed

1 No meaning is conveyed

Table 6. Adequacy Scale

Level Interpretation

4 Flawless with no grammatical error

3 Good output with minor errors

2 Disfluent ungrammatical with correct phrase

1 Incomprehensible

Table 7. Fluency Scale

The scores of adequacy and fluency on 100 test sentences based on the length are given at Table 8 and Table 9 based on the adequacy and fluency scales give by Table 6 and Table 7.

Sentence length Fluency Adequacy

<=15 words 3.13 3.16 Baseline using Ben-gali Script

>15 words 2.21 2.47

<=15 words 3.58 3.47 Baseline using Meetei Mayek

>15 words 2.47 2.63

Table 8. Scores of Adequacy and Fluency of English to

Manipuri SMT system

Sentence length Fluency Adequacy

<=15 words 2.39 2.42 Baseline using Ben-gali Script

>15 words 2.01 2.14

<=15 words 2.61 2.65 Baseline using Meetei Mayek

>15 words 2.10 1.94

Table 9. Scores of Adequacy and Fluency of Manipuri

to English SMT system

5 Sample Translation Outputs

The following tables show the various translation outputs of English-Manipuri as well as Manipuri-English PBSMT systems using Bengali script and Meetei Mayek scripts.

English On the part of the election de-

partment, IFCD have been intimi-dated for taking up necessary measures.

Manipuri Reference Translation (Bengali Script)

�'-/� ��+* 0 '12�& 1*�3&�4� 5�67�����*

��&*� 8-9* �9& +*�:;9* :<=�'> .

Gloss election departmentki maykei-dagee IFCDda darkar leiba tha-bak paykhatnaba khanghankhre .

Baseline Transla-tion output

(Bengali Script)

�'-/� ��+* 0 '12�& 1*�3&�4� 5�67�����*

��&*� 8-9* �9& +*�:;9* :<=�'>.

Table 10. English to Manipuri SMT system output using

Bengali Script

16

English On the part of the election depart-ment, IFCD have been intimidated for taking up necessary measures.

Manipuri refer-ence Translation (Meetei Mayek)

ꯢꯂꯦꯛꯁꯟ ꯗꯤꯄꯔꯇꯃꯦꯅꯇꯀꯤ ꯃꯌꯀꯧꯗꯒꯤ ꯑꯦꯢꯑꯦꯐꯁꯤꯗꯤꯗ ꯗꯔꯀꯔ ꯂꯧꯕ ꯊꯕꯛ

ꯄꯌꯈꯠꯅꯕ ꯈꯡꯍꯟꯈꯔꯦ ꯫

Gloss election departmentki maykeidagee IFCDda darkar leiba thabak payk-hatnaba khanghankhre .

Baseline Trans-lation output

ꯢꯂꯦꯛꯁꯟ ꯗꯤꯄꯔꯇꯃꯦꯅꯇꯟ ꯑꯦꯢꯑꯦꯐꯁꯤꯗꯤꯗ ꯗꯔꯀꯔ ꯂꯧꯕ ꯊꯕꯛ ꯄꯌꯈꯠꯅꯕ ꯍꯢꯈꯔꯦ ꯫

Table 11: English to Manipuri SMT system output using

Meetei Mayek

Input Manipuri sentence

�'-/� ��+* 0 '12�& 1*�3&�4� 5�67�����* ��&*� 8-9* �9& +*�:;9* :<=�'> .

Gloss election departmentki maykeidagee IFCDda darkar leiba thabak paykhat-naba khanghankhre .

Reference Eng-lish translation

On the part of the election department, IFCD have been intimidated for taking up necessary measures.

Baseline Translation output

the election department notified IFCD to take up necessary steps

Table 12: Manipuri to English translation output using Bengali script:

Input Manipuri sentence

ꯢꯂꯦꯛꯁꯟ ꯗꯤꯄꯔ ꯇꯃꯦꯅꯇꯀꯤ ꯃꯌꯀꯧꯗꯒꯤ ꯑꯦꯢꯑꯦꯐꯁꯤꯗꯤꯗ ꯗꯔꯀꯔ ꯂꯧꯕ ꯊꯕꯛ

ꯄꯌꯈꯠꯅꯕ ꯈꯡꯍꯟꯈꯔꯦ ꯫

Gloss election departmentki maykeidagee IFCDda darkar leiba thabak paykhat-naba khanghankhre .

Reference Eng-lish translation

On the part of the election department, IFCD have been intimidated for taking up necessary measures.

Baseline Translation output

the election department intimidated IFCD to take up necessary steps

Table 13: Manipuri to English translation output using Meetei Mayek: The English to Manipuri SMT system output using Bengali Script suffers fluency and adequacy scores as given by table 8 compared to English to Ma-nipuri SMT system output using Meetei Mayek script. In the case of Manipuri to English SMT sys-

tem, the Meetei Mayek based SMT system outper-forms the Bengali script based SMT in terms of both fluency and adequacy as given by table 9 as well as automatic scores as given by table 5.

6 Conclusion and Discussion

The detailed study of grapheme-to-phoneme indi-cates missing tone for several words using present Meetei Mayek script. The training process based on the Bengali script training data is found to have higher vocabulary coverage all across since the representation is done with a finer glyph as com-pared to Meetei Mayek so is the higher automatic scores in case of English-to-Manipuri PBSMT sys-tem. But, considering the subjective evaluation scores, the Meetei Mayek based SMT systems out-performs the Bengali script based English-to-Manipuri SMT systems as against the automatic scores. In the case of Manipuri-to-English PBSMT systems, both the automatic score and subjective evaluation scores using Meetei Mayek outperforms the Bengali script based systems. Statistical sig-nificant test is performed to judge if a change in score that comes from a change in the system with script switching reflects a change in overall trans-lation quality. The systems show statistically sig-nificant result as measured by bootstrap re-sampling method (Koehn, 2004) on BLEU score. In future, the experiments can be repeated with special treatment of individual morphemes in bits and pieces on a decent size of parallel corpora. More notably, SMT decoders may have the feature of handling two or more scripts of the same lan-guage in the future SMT systems for languages like Manipuri.

Acknowledgments

I, sincerely, thank Dr. Zia Saquib, Executive Di-rector, CDAC (Mumbai), Prof. Sivaji Bandyop-adhyay, Jadavpur University, Kolkata and the anonymous reviewers for their support and valu-able comments.

References

Andreas Stolcke. 2002. SRILM – an extensible language modeling toolkit. In Proceedings of the International Conference on Spoken Language Processing.

Franz J. Och. 2003. Minimum error rate training in Statistical Machine Translation, Proceedings of ACL.

17

George Doddington. 2002. Automatic evaluation of Ma-chine Translation quality using n-gram co-occurrence statistics. In Proceedings of HLT 2002, San Diego, CA.

Kishore Papineni, Salim Roukos, Todd Ward and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of 40th ACL, Philadelphia, PA.

Kristina Toutanova, Hisami Suzuki and Achim Ruopp. 2008. Applying Morphology Generation Models to Machine Translation, In Proc. 46th Annual Meeting of the Association for Computational Linguistics.

Marine Carpuat and Dekai Wu. 2007. How Phrase Sense Disambiguation outperforms Word Sense Dis-ambiguation for Statistical Machine, 11th Interna-tional Conference on Theoretical and Methodological Issues in Machine Translation (TMI 2007). pages 43-52, Skövde, Sweden, September 2007.

Ondřej Bojar and Jan Hajič. 2008. Phrase-Based and Deep Syntactic English-to-Czech Statistical Machine Translation, Proceedings of the Third Workshop on Statistical Machine Translation, pages 143–146, Co-lumbus, Ohio, USA.

Ondřej Bojar, Bushra Jawaid and Amir Kamran. 2012. Probes in a Taxonomy of Factored Phrase-Based Models, Proceedings of the 7th Workshop on Statis-tical Machine Translation of Association for Compu-tational Linguistics, pages 253–260, Montréal, Canada.

Philipp Koehn. 2004. Statistical significance tests for machine translation evaluation. In EMNLP-2004: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, 25-26 July 2004, pages 388-395, Barcelona, Spain.

Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Rich-ard Zens, Chris Dyer, Ondřej Bojar, Alexandra Con-stantin and Evan Herbst. 2007. Moses: Open Source Toolkit for Statistical Machine Translation, Annual Meeting of the Association for Computational Lin-guistics (ACL), demonstration session, Prague, Czech Republic.

Reyyan Yeniterzi and Kemal Oflazer. 2010. Syntax-to-Morphology Mapping in Factored Phrase-Based Sta-tistical Machine Translation from English to Turkish, In proceeding of the 48th Annual Meeting of the As-sociation of Computational Linguistics, Pages 454-464, Uppsala, Sweden.

Stanley F. Chen and Joshua Goodman. 1998. An em-pirical study of smoothing techniques for language modeling. Technical Report TR-10-98, Harvard Uni-versity Center for Research in Computing Technol-ogy.

Thoudam Doren Singh and Sivaji Bandyopadhyay. 2010a. Semi Automatic Parallel Corpora Extraction

from Comparable News Corpora, In the International Journal of POLIBITS, Issue 41 (January – June 2010), ISSN 1870-9044, pages 11-17.

Thoudam Doren Singh and Sivaji Bandyopadhyay. 2010b. Manipuri-English Bidirectional Statistical MachineTranslation Systems using Morphology and Dependency Relations, Proceedings of SSST-4, Fourth Workshop on Syntax and Structure in Statisti-cal Translation, pages 83–91, COLING 2010, Bei-jing, August 2010.

Thoudam Doren Singh. 2012. Bidirectional Bengali Script and Meetei Mayek Transliteration of Web Based Manipuri News Corpus, In the Proceedings of the 3rd Workshop on South and Southeast Asian Natural Language Processing (SANLP) of COLING 2012, IIT Bombay, Mumbai, India, pages 181-189, 8th December, 2012.

18

Proceedings of the 7th Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 19–28,Atlanta, Georgia, 13 June 2013. c©2013 Association for Computational Linguistics

Hierarchical Alignment Decomposition Labels for Hiero Grammar Rules

Gideon Maillette de Buy WennigerInstitute for Logic,

Language and ComputationUniversity of Amsterdam

Science Park 904, 1098 XH AmsterdamThe Netherlands

gemdbw AT gmail.com

Khalil Sima’anInstitute for Logic,

Language and ComputationUniversity of Amsterdam

Science Park 904, 1098 XH AmsterdamThe Netherlands

k.simaan AT uva.nl

Abstract

Selecting a set of nonterminals for the syn-chronous CFGs underlying the hierarchicalphrase-based models is usually done on thebasis of a monolingual resource (like a syntac-tic parser). However, a standard bilingual re-source like word alignments is itself rich withreordering patterns that, if clustered some-how, might provide labels of different (pos-sibly complementary) nature to monolinguallabels. In this paper we explore a first ver-sion of this idea based on a hierarchical de-composition of word alignments into recursivetree representations. We identify five clus-ters of alignment patterns in which the chil-dren of a node in a decomposition tree arefound and employ these five as nonterminal la-bels for the Hiero productions. Although thisis our first non-optimized instantiation of theidea, our experiments show competitive per-formance with the Hiero baseline, exemplify-ing certain merits of this novel approach.

1 Introduction

The Hiero model (Chiang, 2007; Chiang, 2005)formulates phrase-based translation in terms of asynchronous context-free grammar (SCFG) limitedto the inversion transduction grammar (ITG) (Wu,1997) family. While the original Hiero approachworks with a single nonterminal label (X) (besidesthe start nonterminal S ), more recent work is dedi-cated to devising methods for extracting more elab-orate labels for the phrase-pairs and their abstrac-tions into SCFG productions, e.g., (Zollmann andVenugopal, 2006; Li et al., 2012; Almaghout et al.,2011). All labeling approaches exploit monolin-gual parsers of some kind, e.g., syntactic, seman-

tic or sense-oriented. The rationale behind mono-lingual labeling is often to make the probability dis-tributions over alternative synchronous derivationsof the Hiero model more sensitive to linguisticallyjustified monolingual phrase context. For example,syntactic target-language labels in many approachesare aimed at improved target language modeling(fluency, cf. Hassan et al. (2007); Zollmann andVenugopal (2006)), whereas source-language labelsprovide suitable context for reordering (see Mylon-akis and Sima’an (2011)). It is usually believedthat the monolingual labels tend to stand for clus-ters of phrase pairs that are expected to be inter-substitutable, either syntactically or semantically(see Marton et al. (2012) for an illuminating discus-sion).

While we believe that monolingual labelingstrategies are sound, in this paper we explore thecomplementary idea that the nonterminal labelscould also signify bilingual properties of the phrasepair, particularly its characteristic word alignmentpatterns. Intuitively, an SCFG with nonterminal la-bels standing for alignment patterns should put morepreference on synchronous derivations that mimicthe word alignment patterns found in the trainingcorpus, and thus, possibly allow for better reorder-ing. It is important to stress the fact that these wordalignment patterns are complementary to the mono-lingual linguistic patterns and it is conceivable thatthe two can be combined effectively, but this remainsbeyond the scope of this article.

The question addressed in this paper is how to se-lect word alignment patterns and cluster them intobilingual nonterminal labels? In this paper we ex-plore a first instantiation of this idea starting outfrom the following simplifying assumptions:

19

• The labels come from the word alignmentsonly,• The labels are coarse-grained, pre-defined clus-

ters and not optimized for performance,• The labels extend the binary set of ITG oper-

ators (monotone and inverted) into five suchlabels in order to cover non-binarizable align-ment patterns.

Our labels are based on our own tree decomposi-tions of word alignments (Sima’an and Maillette deBuy Wenniger, 2011), akin to Normalized Decom-position Trees (NDTs) (Zhang et al., 2008). In thisfirst attempt we explore a set of five nonterminal la-bels that characterize alignment patterns found di-rectly under nodes in the NDT projected for everyword alignment in the parallel corpus during train-ing. There is a range of work that exploits the mono-tone and inverted orientations of binary ITG withinhierarchical phrase-based models, either as featurefunctions of lexicalized Hiero productions (Chiang,2007; Zollmann and Venugopal, 2006), or as labelson non-lexicalized ITG productions, e.g., (Mylon-akis and Sima’an, 2011). As far as we are aware,this is the first attempt at exploring a larger set ofsuch word alignment derived labels in hierarchicalSMT. Therefore, we expect that there are many vari-ants that could improve substantially on our strongset of assumptions.

2 Hierarchical SMT models

Hierarchical SMT usually works with weighted in-stantiations of Synchronous Context-Free Gram-mars (SCFGs) (Aho and Ullman, 1969). SCFGsare defined over a finite set of nonterminals (startincluded), a finite set of terminals and a finite setof synchronous productions. A synchronous pro-duction in an SCFG consists of two context-freeproductions (source and target) containing the samenumber of nonterminals on the right-hand side, witha bijective (1-to-1 and onto) function between thesource and target nonterminals. Like the standardHiero model (Chiang, 2007), we constrain our workto SCFGs which involve at most two nonterminalsin every lexicalized production.

Given an SCFG G, a source sentence s is trans-lated into a target sentence t by synchronous deriva-tions d, each is a finite sequence of well-formed

substitutions of synchronous productions from G,see (Chiang, 2006). Standardly, for complexity rea-sons, most models used make the assumption thatthe probability P(t | s) can be optimized through assingle best derivation as follows:

arg maxt

P(t | s) = arg maxt

∑d∈G

P(t,d | s) (1)

≈ arg maxd∈G

P(t,d | s) (2)

This approximation can be notoriously problematicfor labelled Hiero models because the labels tendto lead to many more derivations than in the orig-inal model, thereby aggravating the effects of thisassumption. This problem is relevant for our workand approaches to deal with it are Minimum Bayes-Risk decoding (Kumar and Byrne, 2004; Tromble etal., 2008), Variational Decoding (Li et al., 2009) andsoft labeling (Venugopal et al., 2009; Marton et al.,2012; Chiang, 2010).

Given a derivation d, most existing phrase-based models approximate the derivation probabil-ity through a linear interpolation of a finite set offeature functions (Φ(d)) of the derivation d, mostlyworking with local feature functions φi of individ-ual productions, the target side yield string t of d(target language model features) and other heuristicfeatures discussed in the experimental section:

arg maxd∈G

P(t,d | s) ≈ arg maxd∈G

|Φ(d)|∑i=1

λi × φi (3)

Where λi is the weight of feature φi optimized overa held-out parallel corpus by some direct error-minimization procedure like MERT (Och, 2003).

3 Baseline: Hiero Grammars (single label)

Hiero Grammars (Chiang, 2005; Chiang, 2007) area particular form of SCFGs that generalize phrase-based translation models to hierarchical phrase-based Translation models. They allow only up totwo (pairs of) nonterminals on the right-hand-side ofrules. Hierarchical rules are formed from fully lex-icalized base rules (i.e. phrase pairs) by replacing asub-span of the phrase pair that corresponds itself toa valid phrase pair with variable X called “gap”. Two

20

gaps may be maximally introduced in this way1, la-beled as X1 and X2 respectively for distinction. Thetypes of permissible Hiero rules are:

X → 〈α, γ〉 (4a)

X → 〈α X1 β, δ X1 ζ〉 (4b)

X → 〈α X1 β X2 γ , δ X1 ζ X2 η 〉 (4c)

X → 〈α X1 β X2 γ , δ X2 ζ X1 η 〉 (4d)

Here α, β, γ, δ, ζ, η are terminal sequences, pos-sibly empty. Equation 4a corresponds to a normalphrase pair, 4b to a rule with one gap and 4c and 4dto the monotone- and inverting rules respectively.

An important extra constraint used in the originalHiero model is that rules must have at least one pairof aligned words, so that translation decisions are al-ways based on some lexical evidence. Furthermorethe sum of terminals and nonterminals on the sourceside may not be greater than five, and nonterminalsare not allowed to be adjacent on the source side.

4 Alignment Labeled Grammars

Labeling the Hiero grammar productions makesrules with gaps more restricted about what broadcategories of rules are allowed to substitute for thegaps. In the best case this prevents overgeneraliza-tion, and makes the translation distributions moreaccurate. In the worst case, however, it can also leadto too restrictive rules, as well as sparse translationdistributions. Despite these inherent risks, a numberof approaches based on syntactically inspired labelshas succeeded to improve the state of the art byusing monolingual labels, e.g., (Zollmann andVenugopal, 2006; Zollmann, 2011; Almaghout etal., 2011; Chiang, 2010; Li et al., 2012).

Unlabeled Hiero derivations can be seen as recur-sive compositions of phrase pairs. A single transla-tion may be generated by different derivations (seeequation 1), each standing for a choice of com-position rules over a choice of a segmentation ofthe source-target sentence pair into a bag of phrasepairs. However, a synchronous derivation also in-duces an alignment between the different segments

1The motivation for this restriction to two gaps is mainly apractical computational one, as it can be shown that translationcomplexity grows exponentially with the number of gaps.

that it composes together. Our goal here is to la-bel the Hiero rules in order to exploit aspects of thealignment that a synchronous derivation induces.

We exploit the idea that phrase pairs can be ef-ficiently grouped into maximally decomposed trees(normalized decomposition trees – NDTs) (Zhanget al., 2008). In an NDT every phrase pair is re-cursively decomposed at every level into the mini-mum number of its phrase constituents, so that theresulting structure is maximal in that it contains thelargest number of nodes. In Figure 1 left we showan example alignment and in Figure 1 right its as-sociated NDT. The NDT shows pairs of source andtarget spans of (sub-) phrase pairs, governed at dif-ferent levels of the tree by their parent node. Inour example the root node splits into three phrasepairs, but these three phrase pairs together do notmanage to cover the entire phrase pair of the par-ent because of the discontinuous translation struc-ture 〈owe, sind ... schuldig〉. Consequently, a par-tially lexicalized structure with three children corre-sponding to phrase pairs and lexical items coveringthe words left by these phrase pairs is required.

During grammar extraction we determine anAlignment Label for every left-hand-side and gap ofevery rule we extract. This is done by looking at theNDT that decomposes their corresponding phrasepairs, and determining the complexity of the rela-tion with their direct children in this tree. Complex-ity cases are ordered by preference, where the moresimple label corresponding to the choice of maximaldecomposition is preferred. We distinguish the fol-lowing five cases, ordered by increasing complexity:

1. Monotonic: If the alignment can be split intotwo monotonically ordered parts.

2. Inverted: If the alignment can be split into twoinverted parts.

3. Permutation: If the alignment can be factoredas a permutation of more than 3 parts.2

4. Complex: If the alignment cannot be factoredas a permutation of parts, but the phrase doescontain at least one smaller phrase pair (i.e., itis composite).

5. Atomic: If the alignment does not allow the ex-istence of smaller (child) phrase pairs.

2Permutations of just 3 parts never occur in a NDT, as theycan always be further decomposed as a sequence of two binarynodes.

21

1we

2owe

3this

4to

5our

6citizens

das1

sind2

wir3

unsern4

burgern5

schuldig6

([1, 6], [1, 6])

([5, 6], [4, 5])

([6, 6], [5, 5])([5, 5], [4, 4])

([3, 3], [1, 1])([1, 1], [3, 3])

Figure 1: Example of complex word alignment, taken from Europarl data English-German (left) and its associatedNormalized Decomposition Tree (Zhang et al., 2008) (right).

We show examples of each of these cases in Figure2. Furthermore, in Figure 3 we show an exampleof an alignment labeled Hiero rule based on one ofthese alignment examples.

Our kind of labels has a completely different fla-vor from monolingual labels in that they cannot beseen as identifying linguistically meaningful clus-ters of phrase pairs. These labels are mere latentbilingual clusters and the translation model mustmarginalize over them (equation 1) or use MinimumBayes-Risk decoding.

4.1 Features : Relations over labelsIn this section we describe the features we use inour experiments. To be unambiguous we first needto introduce some terminology. Let r be a transla-tion rule. We use p to denote probabilities estimatedusing simple relative frequency estimation from theword aligned sentence pairs of the training corpus.Then src(r) is the source side of the rule, includ-ing the source side of the left-hand-side label. Simi-larly tgt(r) is the target side of the rule, including thetarget side of the left-hand-side label. Furthermoreun(src(r)) is the source side without any nontermi-nal labels, and analogous for un(tgt(r)).

4.1.1 Basic FeaturesWe use the following phrase probability features:• p(tgt(r)|src(r)): Phrase probability target side

given source side• p(src(r)|tgt(r)): Phrase probability source side

given target sideWe reinforce those by the following phrase prob-

ability smoothing features:• p(tgt(r)|un(src(r)))• p(un(src(r))|tgt(r))• p(un(tgt(r))|src(r))• p(src(r)|un(tgt(r)))• p(un(tgt(r))|un(src(r)))• p(un(src(r))|un(tgt(r)))We also add the following features:

• pw(tgt(r)|src(r)), pw(src(r)|tgt(r)): Lexicalweights based on terminal symbols as forphrase-based and hierarchical phrase-basedMT.• p(r|lhs(r)) : Generative probability of a rule

given its left-hand-side labelWe use the following set of basic binary features,

with 1 values by default, and a value exp(1) if thecorresponding condition holds:• ϕGlue(r): exp(1) if rule is a glue rule• ϕlex(r): exp(1) if rule has only terminals on

right-hand side• ϕabs(r): exp(1) if rule has only nonterminals on

right-hand side• ϕst w tt(r): exp(1) if rule has terminals on the

source side but not on the target side• ϕtt w st(r): exp(1) if rule has terminals on the

target side but not on the source side• ϕmono(r): exp(1) if rule has no inverted pair of

nonterminalsFurthermore we use the :• ϕra(r): Phrase penalty, exp(1) for all rules.• exp(ϕwp(r)): Word penalty, exponent of the

number of terminals on the target side• ϕrare(r): exp( 1

#(∑

r′∈C δrr′ )) : Rarity penalty, with

#(∑

r′∈C δrr′) being the count of rule r in the cor-pus.

4.1.2 Binary Reordering FeaturesBesides the basic features we want to use extrasets of binary features that are specially designedto directly learn the desirability of certain broadclasses of reordering patterns, beyond the way thisis already implicitly learned for particular lexical-ized rules by the introduction of reordering labels.3

These features can be seen as generalizations of themost simple feature that penalizes/rewards mono-

3We did some initial experiments with such features inJoshua, but haven’t managed yet to get them working in Moseswith MBR. Since these experiments are inconclusive withoutMBR we leave them out here.

22

this is an important matter

das ist ein wichtige angelegenheit

1

1

2

2

Monotone

we all agree on this

das sehen wir alle

1

1

2

2

Inversion

i want to stress two points

auf zwei punkte mochte ich hinweisen

1

1

2

2

3

3

4

4

Permutation

we owe this to our citizens

das sind wir unsern burgern schuldig

1

1

2

2

3

3

Complex

it would be possible

kann mann

1

1

Atomic

Figure 2: Different types of Alignment Labels

tone order ϕmono(r) from our basic feature set. Thenew features we want to introduce “fire” for a spe-cific combination of reordering labels on the lefthand side and one or both gaps, plus optionally theinformation whether the rule itself invert its gaps andwhether or not it is abstract.

5 Experiments

We evaluate our method on one language pair usingGerman as source and English as target. The data isderived from parliament proceedings sourced fromthe Europarl corpus (Koehn, 2005), with WMT-07development and test data. We used a maximumsentence length of 40 for filtering. We employ ei-ther 200K or (approximately) 1000K sentence pairsfor training, 1K for development and 2K for test-ing (single reference per source sentence). Both thebaseline and our method decode with a 3-gram lan-guage model smoothed with modified Knesser-Neydiscounting (Chen and Goodman, 1998), trained onthe target side of the full original training set (ap-proximately 1000K sentences).

We compare against state-of-the-art hierarchi-cal translation (Chiang, 2005) baselines, based onthe Joshua (Ganitkevitch et al., 2012) and Moses(Hoang et al., 2007) translation systems with defaultdecoding settings. We use our own grammar extrac-

we owe this to our citizens

das sind wir unsern burgern schuldig

X Complex

X Atomic1

X Atomic1

X Monotone2

X Monotone2

X Complex

Figure 3: Example of a labeled Hiero ruleX Complex→ 〈we owe X Atomic1 to X Monotone2 ,X Atomic1 sind wir X Monotone2 schuldig 〉extracted from the Complex example in Figure 2 by re-placing the phrase pairs 〈this, das〉 and 〈our citizens , un-sern burgern〉 with (labeled) variables.

tor for the generation of all grammars, including thebaseline Hiero grammars. This enables us to use thesame features (as far as applicable given the gram-mar formalism) and assure true comparability of thegrammars under comparison.

5.1 Training and Decoding DetailsIn this section we discuss the choices and settingswe used in our experiments. Our initial experiments

4We later discovered we needed to add the flag “–return-best-dev” in Moses to actually get the weights from the bestdevelopment run, our initial experiments had not used this. Thisexplains the somewhat unfortunate drop in performance in ourAnalysis Experiments.

23

DecodingType

SystemName

200K

LatticeMBR

Hiero 26.44Hiero-RL 26.72

ViterbiHiero 26.23Hiero-RL-PPL 26.16

Table 1: Initial Results. Lowercase BLEU results forGerman-English trained on 200K sentence pairs.4

Top rows display results for our experiments using Moses(Hoang et al., 2007) with Lattice Minimum Bayes-RiskDecoding5 (Tromble et al., 2008) in combination withBatch Mira (Cherry and Foster, 2012) for tuning. Beloware results for experiments with Joshua (Ganitkevitch etal., 2012) using Viterbi decoding (i.e. no MBR) and PRO(Hopkins and May, 2011) for tuning.

were done on Joshua (Ganitkevitch et al., 2012),using the Viterbi best derivation. The second setof experiments was done on Moses (Hoang et al.,2007) using Lattice Minimum Bayes-Risk Decod-ing5 (Tromble et al., 2008) to sum over derivations.

5.1.1 General Settings

To train our system we use the following settings.We use the standard Hiero grammar extractionconstraints (Chiang, 2007) but for our reorderinglabeled grammars we use them with some modifi-cations. In particular, while for basic Hiero onlyphrase pairs with source spans up to 10 are allowed,and abstract rules are forbidden, we allow extractionof fully abstract rules, without length constraints.Furthermore we allow their application withoutlength constraints during decoding. Followingcommon practice, we use simple relative frequencyestimation to estimate the phrase probabilities,lexical probabilities and generative rule probabilityrespectively.6

5After submission we were told by Moses support that infact neither normal Minimum Bayes-Risk (MBR) nor LatticeMBR are operational in Moses Chart.

6Personal correspondence with Andreas Zollmann furtherreinforced the authors appreciation of the importance of thisfeature introduced in (Zollmann and Venugopal, 2006; Zoll-mann, 2011). Strangely enough this feature seems to be un-available in the standard Moses (Hoang et al., 2007) and Joshua(Ganitkevitch et al., 2012) grammar extractors, that also imple-ment SAMT grammar extraction

5.1.2 Specific choices and settings JoshuaViterbi experiments

Based on experiments reported in (Mylonakis andSima’an, 2011; Mylonakis, 2012) we opted to notlabel the (fully lexicalized) phrase pairs, but insteadlabel them with a generic PhrasePair label and usea set of switch rules from all other labels to thePhrasePair label to enable transition between Hierorules and phrase pairs.

We train our systems using PRO (Hopkins andMay, 2011) implemented in Joshua by Ganitkevitchet al. (2012). We use the standard tuning, where allfeatures are treated as dense features.We allow up to30 tuning iterations. We further follow the PRO set-tings introduced in (Ganitkevitch et al., 2012) butuse 0.5 for the coefficient Ψ that interpolates theweights learned at the current with those from theprevious iteration. We use the final learned weightsfor decoding with the log-linear model and reportLowercase BLEU scores for the tuned test set.

5.1.3 Specific choices and settings MosesLattice MBR experiments

As mentioned before we use Moses (Hoang etal., 2007) for our second experiment, in combina-tion with Lattice Minimum Bayes-Risk Decoding5

(Tromble et al., 2008). Furthermore we use BatchMira (Cherry and Foster, 2012) for tuning with max-imum 10 tuning iterations of the 200K training set,and 30 for the 1000K training set.7

For our Moses experiments we mainly workedwith a uniform labeling policy, labeling phrase pairsin the same way with alignment labels as normalrules. This is motivated by the fact that since we areusing Minimum Bayes-Risk decoding, the risks ofsparsity from labeling are much reduced. And label-ing everything does have the advantage that reorder-

7We are mostly interested in the relative performance of oursystem in comparison to the baseline for the same settings. Nev-ertheless, it might be that the labeled systems, which have moresmoothing features, are relatively suffering more from too lit-tle tuning iterations than the baseline which does not have theseextra features and thus may be easier to tune. This was one ofthe reasons to increase the number of tuning iterations from 10to 30 in our later experiments on 1000K. Usage of MinimumBayes-Risk decoding or not is crucial as we have explained be-fore in section 1. The main reason we opted for Batch Mira overPRO is that it is more commonly used in Moses systems, and inany case at least superior to MERT (Och, 2003) in most cases.

24

ing information can be fully propagated in deriva-tions starting from the lowest (phrase) level. We alsoran experiments with the generic phrase pair label-ing, since there were reasons to believe this coulddecrease sparsity and potentially lead to better re-sults.8

5.2 Initial ResultsWe report Lowercase BLEU scores for experi-ments with and without Lattice Minimum Bayes-Risk (MBR) decoding (Tromble et al., 2008). Ta-ble 1 bottom shows the results of our first experi-ments with Joshua, using the Viterbi derivation andno MBR decoding to sum over derivations. Wedisplay scores for the Hiero baseline (Hiero) andthe (partially) alignment labeled system (Hiero-AL-PPL) which uses alignment labels for Hiero rulesand PhrasePair to label all phrase pairs. Scores arearound 26.25 BLEU for both systems, with onlymarginal differences. In summary our labeled sys-tems are at best comparable to the Hiero baseline.

Table 1 top shows the results of our second ex-periments with Moses and Lattice MBR5. Hereour (fully) alignment labeled system (Hiero-AL)achieves a score of 26.72 BLEU, in comparison to26.44 BLEU for the Hiero baseline (Hiero). A smallimprovement of 0.28 BLEU point.

5.3 Advanced experimentsWe now report Lowercase BLEU scores for moredetailed analysis experiments with and without Lat-tice Minimum Bayes-Risk5 (MBR) decoding, wherewe varied other training and decoding parameters inthe Moses environment. Particularly, in this set ofexperiments we choose the best tuning parametersettings over 30 Batch Mira iterations (as opposedto the weights returned by default – used in the pre-vious experiments). We also explore varieties in tun-ing with a decoder that works with Viterbi/MBR,and final testing with Viterbi/MBR.

In Table 2, the top rows show the results of our ex-periments using MBR decoding. We display scores

8We discovered that the Moses chart decoder does not allowfully abstract unary rules in the current implementation, whichmakes direct usage of unary (switch) rules not possible. Switchrules and other unaries can still be emulated though, by adapt-ing the grammar, using multiple copies of rules with differentlabels. This blows up the grammar a bit, but at least works inpractice.

DecodingType

SystemName

200K 1000K

LatticeMBR

Hiero 27.19 28.39Hiero-AL 26.61 28.32Hiero-AL-PPL 26.89 28.41

ViterbiHiero 26.80 28.57Hiero-AL 28.36

Table 2: Analysis Results. Lowercase BLEU results forGerman-English trained on 200K and 1000K sentencepairs using Moses (Hoang et al., 2007) in combinationwith Batch Mira (Cherry and Foster, 2012) for tuning.Top rows display results for our experiments with LatticeMinimum Bayes-Risk Decoding5 (Tromble et al., 2008).Below are results for experiments using Viterbi decoding(i.e. no MBR) for tuning. Results on 200K were run with10 tuning iterations, results on 1000K with 30 tuning it-erations.

for the Hiero baseline (Hiero) and the fully/partiallyalignment labeled systems Hiero-AL and Hiero-AL-PPL. In the preceding set of experiments MBR de-coding clearly showed improved performance overViterbi, particularly for our labelled system.

On the small training set of 200K we observethat the Hiero baseline achieves 27.19 BLEU andthus beats the labeled systems Hiero-AL with 26.61BLEU and 26.89 BLEU by a good margin. On thebigger dataset of 1000K and with more tuning iter-ations (3), all systems perform better. When usingLattice MBR Hiero achieving 28.39 BLEU, Hiero-AL 28.32 BLEU and finally Hiero-AL-PPL achieves28.41. These are insignificant differences in perfor-mance between the labelled and unlabeled systems.

Table 1 bottom also shows the results of oursecond set of experiments with Viterbi decoding.Here, the baseline Hiero system for 200K trainingset achieves a score of 26.80 BLEU on the smalltraining set. We also conducted another set ofexperiments on the larger training set of 1000K, thistime with Viterbi decoding. The Hiero baseline withViterbi scores 28.57 BLEU while Hiero-AL scores28.36 BLEU under the same conditions.

It is puzzling that Hiero Viterbi (for 1000k) per-forms better than the same system with MBR decod-ing systems. But after submission we were told byMoses support that neither normal MBR nor LatticeMBR are operational in Moses Chart. This meansthat in fact the effect of MBR on our labels remainsstill undecided, and more work is still needed in thisdirection. The small decrease in performance for the

25

labelled system relative to Hiero (in Viterbi) is possi-bly the result of the labelled system being more brit-tle and harder to tune than the Hiero system. Thishypothesis needs further exploration.

While a whole set of experimental questions re-mains open, we think that based on this preliminarybut nevertheless considerable set of experiments, itseems that our labels do not always improve perfor-mance compared with the Hiero baseline. It is possi-ble that these labels, under a more advanced imple-mentation via soft constraints (as opposed to hard la-beling), could provide the empirical evidence to ourtheoretical choices. A further concern regarding thelabels is that our current choice (5 labels) is heuristicand not optimized for the training data. It remains tobe seen in the future if proper learning of these labelsas latent variables optimized for the training data orthe use of soft constraints can shed more light on theuse of alignment labels in hierarchical SMT.

5.4 Analysis

While we did not have time to do a deep compara-tive analysis of the properties of the grammars, a fewthings can be said based on the results. First of allwe have seen that alignment labels do not always im-prove over the Hiero baseline. In earlier experimentswe observed some improvement when the labelledgrammar was used in combination with MinimumBayes-Risk Decoding but not without it. In later ex-periments with different tuning settings (Mira), theimprovements evaporated and in fact, the Viterbi Hi-ero baseline turned out, surprisingly, the best of allsystems.

Our use of MBR is theoretically justified by theimportance of aggregating over the derivations of theoutput translations when labeling Hiero variables:statistically speaking, if the labels split the occur-rences of the phrase pairs, they will lead to multiplederivations per Hiero derivation with fractions of thescores. This is in line with earlier work on the ef-fect of spurious ambiguity, e.g. Variational Decod-ing (Li et al., 2009). Yet, in the case of our model,there is also a conceptual explanation for the need toaggregate over different derivations of the same sen-tence pair. The decomposition of a word alignmentinto hierarchical decomposition trees has a interest-ing property: the simpler (less reordering) a wordalignment, the more (binary) decomposition trees –

and in our model derivations – it will have. Hence,aggregating over the derivations is a way to gatherevidence for the complexity of alignment patternsthat our model can fit in between a given source-target sentence pair. However, in the current exper-imental setting, where final tuning with Mira is cru-cial, and where the use of MBR within Moses is stillnot standard, we cannot reap full benefit of our the-oretical analysis concerning the fit of MBR for ourmodels’ alignment labels.

6 Conclusion

We presented a novel method for labeling Hierovariables with nonterminals derived from the hierar-chical patterns found in recursive decompositions ofword alignments into tree representations. Our ex-periments based on a first instantiation of this ideawith a fixed set of labels, not optimized to the train-ing data, show promising performance. Our earlyexperiments suggested that these labels have merit,whereas later experiments with more varied trainingand decoder settings showed these results to be un-stable.

Empirical results aside, our approach opens up awhole new line of research to improve the state ofthe art of hierarchical SMT by learning these la-tent alignment labels directly from standard wordalignments without special use of syntactic or otherparsers. The fact that such labels are in principlecomplementary with monolingual information is anexciting perspective which we might explore in fu-ture work.

Acknowledgements

This work is supported by The Netherlands Organi-zation for Scientific Research (NWO) under grant nr.612.066.929. This work was sponsored by the BIGGrid project for the use of the computing and storagefacilities, with financial support from the Nether-lands Organization of Scientific Research (NWO)under grant BG-087-12. The authors would like tothank the people from the Joshua team at John Hop-kins University, in particular Yuan Cao, JonathanWeese, Matt Post and Juri Ganitkevitch, for theirhelpful replies to questions regarding Joshua and itsPRO and Packing implementations.

26

References

Alfred V. Aho and Jeffrey D. Ullman. 1969. Syntaxdirected translations and the pushdown assembler. J.Comput. Syst. Sci., 3(1):37–56.

Hala Almaghout, Jie Jiang, and Andy Way. 2011. Ccgcontextual labels in hierarchical phrase-based smt. InProceedings of the 15th Annual Conference of the Eu-ropean Association for Machine Translation (EAMT-2011), May.

Stanley F. Chen and Joshua T. Goodman. 1998. Anempirical study of smoothing techniques for languagemodeling. Technical Report TR-10-98, Computer Sci-ence Group, Harvard University.

Colin Cherry and George Foster. 2012. Batch tuningstrategies for statistical machine translation. In HLT-NAACL, pages 427–436.

David Chiang. 2005. A hierarchical phrase-based modelfor statistical machine translation. In Proceedings ofthe 43rd Annual Meeting of the ACL, pages 263–270,June.

David Chiang. 2006. An introduction to synchronousgrammars.

David Chiang. 2007. Hierarchical phrase-based transla-tion. Computational Linguistics, 33(2):201–228.

David Chiang. 2010. Learning to translate with sourceand target syntax. In Proceedings of the 48th AnnualMeeting of the Association for Computational Linguis-tics, pages 1443–1452.

Juri Ganitkevitch, Yuan Cao, Jonathan Weese, Matt Post,and Chris Callison-Burch. 2012. Joshua 4.0: Pack-ing, pro, and paraphrases. In Proceedings of theSeventh Workshop on Statistical Machine Translation,pages 283–291, Montreal, Canada, June. Associationfor Computational Linguistics.

Hany Hassan, Khalil Sima’an, and Andy Way. 2007. Su-pertagged phrase-based statistical machine translation.In Proceedings of ACL 2007, page 288295.

Hieu Hoang, Alexandra Birch, Chris Callison-burch,Richard Zens, Rwth Aachen, Alexandra Constantin,Marcello Federico, Nicola Bertoldi, Chris Dyer,Brooke Cowan, Wade Shen, Christine Moran, andOndrej Bojar. 2007. Moses: Open source toolkitfor statistical machine translation. In Proceedings ofthe 41st Annual Meeting on Association for Computa-tional Linguistics - Volume 1, pages 177–180.

Mark Hopkins and Jonathan May. 2011. Tuning as rank-ing. In Proceedings of the Conference on Empiri-cal Methods in Natural Language Processing, pages1352–1362.

Shankar Kumar and William Byrne. 2004. Minimumbayes-risk decoding for statistical machine translation.In HLT-NAACL, page 16917.

Zhifei Li, Jason Eisner, and Sanjeev Khudanpur. 2009.Variational decoding for statistical machine transla-tion. In Proceedings of the Joint Conference of the47th Annual Meeting of the ACL and the 4th Interna-tional Joint Conference on Natural Language Process-ing of the AFNLP: Volume 2 - Volume 2, pages 593–601.

Junhui Li, Zhaopeng Tu, Guodong Zhou, and Josef vanGenabith. 2012. Using syntactic head information inhierarchical phrase-based translation. In Proceedingsof the Seventh Workshop on Statistical Machine Trans-lation, pages 232–242.

Yuval Marton, David Chiang, and Philip Resnik. 2012.Soft syntactic constraints for arabic—english hierar-chical phrase-based translation. Machine Translation,26(1-2):137–157.

Markos Mylonakis and Khalil Sima’an. 2011. Learninghierarchical translation structure with linguistic anno-tations. In Proceedings of the 49th Annual Meetingof the Association for Computational Linguistics: Hu-man Language Technologies, pages 642–652.

Markos Mylonakis. 2012. Learning the Latent Struc-ture of Translation. Ph.D. thesis, University of Ams-terdam.

Franz Josef Och. 2003. Minimum error rate trainingin statistical machine translation. In Proceedings ofthe 41st Annual Meeting on Association for Computa-tional Linguistics - Volume 1, pages 160–167.

Khalil Sima’an and Gideon Maillette de Buy Wenniger.2011. Hierarchical translation equivalence over wordalignments. Technical Report PP-2011-38, Institutefor Logic, Language and Computation.

Roy W. Tromble, Shankar Kumar, Franz Och, and Wolf-gang Macherey. 2008. Lattice minimum bayes-riskdecoding for statistical machine translation. In Pro-ceedings of the Conference on Empirical Methods inNatural Language Processing, pages 620–629.

Ashish Venugopal, Andreas Zollmann, Noah A. Smith,and Stephan Vogel. 2009. Preference grammars: soft-ening syntactic constraints to improve statistical ma-chine translation. In Proceedings of Human LanguageTechnologies: The 2009 Annual Conference of theNorth American Chapter of the Association for Com-putational Linguistics, pages 236–244.

Dekai Wu. 1997. Stochastic inversion transductiongrammars and bilingual parsing of parallel corpora.Computational Linguistics, 23:377–404.

Hao Zhang, Daniel Gildea, and David Chiang. 2008. Ex-tracting synchronous grammar rules from word-levelalignments in linear time. In Proceedings of the 22ndInternational Conference on Computational Linguis-tics - Volume 1, pages 1081–1088.

Andreas Zollmann and Ashish Venugopal. 2006. Syntaxaugmented machine translation via chart parsing. In

27

NAACL 2006 - Workshop on statistical machine trans-lation, June.

Andreas Zollmann. 2011. Learning Multiple-Nonterminal Synchronous Grammars for StatisticalMachine Translation. Ph.D. thesis, Carnegie MellonUniversity.

28

Proceedings of the 7th Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 29–38,Atlanta, Georgia, 13 June 2013. c©2013 Association for Computational Linguistics

A Performance Study of Cube Pruning for Large-Scale HierarchicalMachine Translation

Matthias Huck1 and David Vilar2 and Markus Freitag1 and Hermann Ney1

1 Human Language Technology and Pattern 2 DFKI GmbHRecognition Group, RWTH Aachen University Alt-Moabit 91c

D-52056 Aachen, Germany D-10559 Berlin, Germany<surname>@cs.rwth-aachen.de [email protected]

Abstract

In this paper, we empirically investigate theimpact of critical configuration parameters inthe popular cube pruning algorithm for decod-ing in hierarchical statistical machine transla-tion. Specifically, we study how the choiceof the k-best generation size affects trans-lation quality and resource requirements inhierarchical search. We furthermore exam-ine the influence of two different granular-ities of hypothesis recombination. Our ex-periments are conducted on the large-scaleChinese→English and Arabic→English NISTtranslation tasks. Besides standard hierarchi-cal grammars, we also explore search with re-stricted recursion depth of hierarchical rulesbased on shallow-1 grammars.

1 Introduction

Cube pruning (Chiang, 2007) is a widely usedsearch strategy in state-of-the-art hierarchical de-coders. Some alternatives and extensions to theclassical algorithm as proposed by David Chianghave been presented in the literature since, e.g. cubegrowing (Huang and Chiang, 2007), lattice-basedhierarchical translation (Iglesias et al., 2009b; deGispert et al., 2010), and source cardinality syn-chronous cube pruning (Vilar and Ney, 2012). Stan-dard cube pruning remains the commonly adopteddecoding procedure in hierarchical machine transla-tion research at the moment, though. The algorithmhas meanwhile been implemented in many publiclyavailable toolkits, as for example in Moses (Koehnet al., 2007; Hoang et al., 2009), Joshua (Li et

al., 2009a), Jane (Vilar et al., 2010), cdec (Dyer etal., 2010), Kriya (Sankaran et al., 2012), and Niu-Trans (Xiao et al., 2012). While the plain hierar-chical approach to machine translation (MT) is onlyformally syntax-based, cube pruning can also be uti-lized for decoding with syntactically or semanticallyenhanced models, for instance those by Venugopalet al. (2009), Shen et al. (2010), Xie et al. (2011),Almaghout et al. (2012), Li et al. (2012), Williamsand Koehn (2012), or Baker et al. (2010).

Here, we look into the following key aspects of hi-erarchical phrase-based translation with cube prun-ing:

• Deep vs. shallow grammar.• k-best generation size.• Hypothesis recombination scheme.

We conduct a comparative study of all combinationsof these three factors in hierarchical decoding andpresent detailed experimental analyses with respectto translation quality and search efficiency. We fo-cus on two tasks which are of particular interest tothe research community: the Chinese→English andArabic→English NIST OpenMT translation tasks.

The paper is structured as follows: We briefly out-line some important related work in the followingsection. We subsequently give a summary of thegrammars used in hierarchical phrase-based trans-lation, including a presentation of the difference be-tween a deep and a shallow-1 grammar (Section 3).Essential aspects of hierarchical search with thecube pruning algorithm are explained in Section 4.We show how the k-best generation size is defined(we apply the limit without counting recombined

29

candidates), and we present the two different hy-pothesis recombination schemes (recombination Tand recombination LM). Our empirical investiga-tions and findings constitute the major part of thiswork: In Section 5, we first accurately describe oursetup, then conduct a number of comparative exper-iments with varied parameters on the two translationtasks, and finally analyze and discuss the results. Weconclude the paper in Section 6.

2 Related Work

Hierarchical phrase-based translation (HPBT) wasfirst proposed by Chiang (2005). Chiang also in-troduced the cube pruning algorithm for hierarchicalsearch (Chiang, 2007). It is basically an adaptationof one of the k-best parsing algorithms by Huangand Chiang (2005). Good descriptions of the cubepruning implementation in the Joshua decoder havebeen provided by Li and Khudanpur (2008) and Liet al. (2009b). Xu and Koehn (2012) implementedhierarchical search with the cube growing algorithmin Moses and compared its performance to Moses’cube pruning implementation. Heafield et al. re-cently developed techniques to speed up hierarchicalsearch by means of an improved language model in-tegration (Heafield et al., 2011; Heafield et al., 2012;Heafield et al., 2013).

3 Probabilistic SCFGs for HPBT

In hierarchical phrase-based translation, a proba-bilistic synchronous context-free grammar (SCFG)is induced from a bilingual text. In addition to con-tinuous lexical phrases, hierarchical phrases withusually up to two gaps are extracted from the word-aligned parallel training data.

Deep grammar. The non-terminal set of a stan-dard hierarchical grammar comprises two symbolswhich are shared by source and target: the initialsymbol S and one generic non-terminal symbol X .Extracted rules of a standard hierarchical grammarare of the form X → 〈α, β,∼ 〉 where 〈α, β〉 is abilingual phrase pair that may contain X , i.e. α ∈({X } ∪ VF )+ and β ∈ ({X } ∪ VE)+, where VF

and VE are the source and target vocabulary, respec-tively. The ∼ relation denotes a one-to-one corre-spondence between the non-terminals in α and in β.A non-lexicalized initial rule and a special glue rule

complete the grammar. We denote standard hierar-chical grammars as deep grammars here.

Shallow-1 grammar. Iglesias et al. (2009a) pro-pose a limitation of the recursion depth for hierar-chical rules with shallow grammars. In a shallow-1grammar, the generic non-terminal X of the stan-dard hierarchical approach is replaced by two dis-tinct non-terminals XH and XP . By changing theleft-hand sides of the rules, lexical phrases are al-lowed to be derived from XP only, hierarchicalphrases from XH only. On all right-hand sides ofhierarchical rules, the X is replaced by XP . Gapswithin hierarchical phrases can thus be filled withcontinuous lexical phrases only, not with hierarchi-cal phrases. The initial and glue rules are adjustedaccordingly.

4 Hierarchical Search with Cube Pruning

Hierarchical search is typically carried out with aparsing-based procedure. The parsing algorithm isextended to handle translation candidates and to in-corporate language model scores via cube pruning.

The cube pruning algorithm. Cube pruning op-erates on a hypergraph which represents the wholeparsing space. This hypergraph is built employ-ing a customized version of the CYK+ parsing al-gorithm (Chappelier and Rajman, 1998). Giventhe hypergraph, cube pruning expands at most kderivations at each hypernode.1 The pseudocodeof the k-best generation step of the cube pruningalgorithm is shown in Figure 1. This function iscalled in bottom-up topological order for all hy-pernodes. A heap of active derivations A is main-tained. A initially contains the first-best derivationsfor each incoming hyperedge (line 1). Active deriva-tions are processed in a loop (line 3) until a limit kis reached or A is empty. If a candidate deriva-tion d is recombinable, the RECOMBINE auxiliaryfunction recombines it and returns true; otherwise(for non-recombinable candidates) RECOMBINE re-turns false. Non-recombinable candidates are ap-pended to the list D of k-best derivations (line 6).This list will be sorted before the function terminates

1The hypergraph on which cube pruning operates can beconstructed based on other techniques, such as tree automata,but CYK+ parsing is the dominant approach.

30

(line 8). The PUSHSUCC auxiliary function (line 7)updates A with the next best derivations following dalong the hyperedge. PUSHSUCC determines thecube order by processing adjacent derivations in aspecific sequence (of predecessor hypernodes alongthe hyperedge and phrase translation options).2

k-best generation size. Candidate derivations aregenerated by cube pruning best-first along the in-coming hyperedges. A problem results from the lan-guage model integration, though: As soon as lan-guage model context is considered, monotonicityproperties of the derivation cost can no longer beguaranteed. Thus, even for single-best translation,k-best derivations are collected to a buffer in a beamsearch manner and finally sorted according to theircost. The k-best generation size is consequently acrucial parameter to the cube pruning algorithm.

Hypothesis recombination. Partial hypotheseswith states that are indistinguishable from each otherare recombined during search. We define two no-tions of when to consider two derivations as indis-tinguishable, and thus when to recombine them:

Recombination T. The T recombination schemerecombines derivations that produce identicaltranslations.

Recombination LM. The LM recombinationscheme recombines derivations with identicallanguage model context.

Recombination is conducted within the loop ofthe k-best generation step of cube pruning. Re-combined derivations do not increment the gener-ation count; the k-best generation limit is thus ef-fectively applied after recombination.3 In general,more phrase translation candidates per hypernodeare being considered (and need to be rated with thelanguage model) in the recombination LM schemecompared to the recombination T scheme. The morepartial hypotheses can be recombined, the more it-erations of the inner code block of the k-best gen-eration loop are possible. The same internal k-best

2See Vilar (2011) for the pseudocode of the PUSHSUCC

function and other details which are omitted here.3Whether recombined derivations contribute to the genera-

tion count or not is a configuration decision (or implementa-tion decision). Please note that some publicly available toolkitscount recombined derivations by default.

Input: a hypernode and the size k of the k-best listOutput: D, a list with the k-best derivations

1 let A← heap({(e,1|e|) | e ∈ incoming edges)})2 let D ← [ ]3 while |A| > 0 ∧ |D| < k do4 d← pop(A)5 if not RECOMBINE(D, d) then6 D ← D ++ [d]

7 PUSHSUCC(d,A)

8 sort D

Figure 1: k-best generation with the cube pruning al-gorithm.

generation size results in a larger search space for re-combination LM. We will examine how the overallnumber of loop iterations relates to the k-best gener-ation limit. By measuring the number of derivationsas well as the number of recombination operationson our test sets, we will be able to give an insightinto how large the fraction of recombinable candi-dates is for different configurations.

5 Experiments

We conduct experiments which evaluate perfor-mance in terms of both translation quality andcomputational efficiency, i.e. translation speed andmemory consumption, for combinations of deepor shallow-1 grammars with the two hypothesisrecombination schemes and an exhaustive rangeof k-best generation size settings. Empirical re-sults are presented on the Chinese→English andArabic→English 2008 NIST tasks (NIST, 2008).

5.1 Experimental Setup

We work with parallel training corpora of 3.0 MChinese–English sentence pairs (77.5 M Chinese /81.0 M English running words after preprocessing)and 2.5 M Arabic–English sentence pairs (54.3 MArabic / 55.3 M English running words after prepro-cessing), respectively. Word alignments are createdby aligning the data in both directions with GIZA++and symmetrizing the two trained alignments (Ochand Ney, 2003). When extracting phrases, we applyseveral restrictions, in particular a maximum lengthof ten on source and target side for lexical phrases,a length limit of five on source and ten on targetside for hierarchical phrases (including non-terminalsymbols), and no more than two gaps per phrase.

31

Table 1: Data statistics for the test sets. Numbers havebeen replaced by a special category symbol.

Chinese MT08 Arabic MT08Sentences 1 357 1 360Running words 34 463 45 095Vocabulary 6 209 9 387

The decoder loads only the best translation optionsper distinct source side with respect to the weightedphrase-level model scores (100 for Chinese, 50 forArabic). The language models are 4-grams withmodified Kneser-Ney smoothing (Kneser and Ney,1995; Chen and Goodman, 1998) which have beentrained with the SRILM toolkit (Stolcke, 2002).

During decoding, a maximum length constraintof ten is applied to all non-terminals except the ini-tial symbol S . Model weights are optimized withMERT (Och, 2003) on 100-best lists. The op-timized weights are obtained (separately for deepand for shallow-1 grammars) with a k-best gen-eration size of 1 000 for Chinese→English and of500 for Arabic→English and kept for all setups.We employ MT06 as development sets. Trans-lation quality is measured in truecase with BLEU

(Papineni et al., 2002) on the MT08 test sets.Data statistics for the preprocessed source sides ofboth the Chinese→English MT08 test set and theArabic→English MT08 test set are given in Table 1.

Our translation experiments are conducted withthe open source translation toolkit Jane (Vilar etal., 2010; Vilar et al., 2012). The core imple-mentation of the toolkit is written in C++. Wecompiled with GCC version 4.4.3 using its -O2optimization flag. We employ the SRILM li-braries to perform language model scoring in thedecoder. In binarized version, the language mod-els have a size of 3.6G (Chinese→English) and 6.2G(Arabic→English). Language models and phrase ta-bles have been copied to the local hard disks of themachines. In all experiments, the language modelis completely loaded beforehand. Loading time ofthe language model and any other initialization stepsare not included in the measured translation time.Phrase tables are in the Jane toolkit’s binarized for-mat. The decoder initializes the prefix tree struc-ture, required nodes get loaded from secondary stor-age into main memory on demand, and the loadedcontent is being cleared each time a new input sen-

tence is to be parsed. There is nearly no overheaddue to unused data in main memory. We do notrely on memory mapping. Memory statistics arewith respect to virtual memory. The hardware wasequipped with RAM well beyond the requirementsof the tasks, and sufficient memory has been re-served for the processes.

5.2 Experimental Results

Figures 2 and 3 depict how the Chinese→Englishand Arabic→English setups behave in terms oftranslation quality. The k-best generation size incube pruning is varied between 10 and 10 000.The four graphs in each plot illustrate the resultswith combinations of deep grammar and recombi-nation scheme T, deep grammar and recombinationscheme LM, shallow grammar and recombinationscheme T, as well as shallow grammar and recom-bination scheme LM. Figures 4 and 5 show the cor-responding translation speed in words per second forthese settings. The maximum memory requirementsin gigabytes are given in Figures 6 and 7. In orderto visualize the trade-offs between translation qual-ity and resource consumption somewhat better, weplotted translation quality against time requirementsin Figures 8 and 9 and translation quality againstmemory requirements in Figures 10 and 11. Transla-tion quality and model score (averaged over all sen-tences; higher is better) are nicely correlated for allconfigurations, as can be concluded from Figures 12through 15.

5.3 Discussion

Chinese→English. For Chinese→English trans-lation, the system with deep grammar performs gen-erally a bit better with respect to quality than theshallow one, which accords with the findings ofother groups (de Gispert et al., 2010; Sankaran etal., 2012). The LM recombination scheme yieldsslightly better quality than the T scheme, and withthe shallow-1 grammar it outperforms the T schemeat any given fixed amount of time or memory allo-cation (Figures 8 and 10).

Shallow-1 translation is up to roughly 2.5 timesfaster than translation with the deep grammar. How-ever, the shallow-1 setups are considerably sloweddown at higher k-best sizes as well, while the ef-fort pays off only very moderately. Overall, the

32

23

23.5

24

24.5

25

25.5

10 100 1000 10000

BLE

U [%

]

k-best generation size

NIST Chinese-to-English (MT08)

deep, recombination Tdeep, recombination LMshallow-1, recombination Tshallow-1, recombination LM

Figure 2: Chinese→English translation quality (truecase).

42.5

43

43.5

44

44.5

45

10 100 1000 10000

BLE

U [%

]

k-best generation size

NIST Arabic-to-English (MT08)

deep, recombination Tdeep, recombination LMshallow-1, recombination Tshallow-1, recombination LM

Figure 3: Arabic→English translation quality (truecase).

0

1

2

3

4

5

6

7

8

9

10 100 1000 10000

wor

ds p

er s

econ

d

k-best generation size

NIST Chinese-to-English (MT08)

deep, recombination Tdeep, recombination LMshallow-1, recombination Tshallow-1, recombination LM

Figure 4: Chinese→English translation speed.

0

2

4

6

8

10

12

14

16

18

10 100 1000 10000

wor

ds p

er s

econ

d

k-best generation size

NIST Arabic-to-English (MT08)

deep, recombination Tdeep, recombination LMshallow-1, recombination Tshallow-1, recombination LM

Figure 5: Arabic→English translation speed.

0

8

16

24

32

40

10 100 1000 10000

giga

byte

s

k-best generation size

NIST Chinese-to-English (MT08)

deep, recombination Tdeep, recombination LMshallow-1, recombination Tshallow-1, recombination LM

Figure 6: Chinese→English memory requirements.

0

8

16

24

32

40

10 100 1000 10000

giga

byte

s

k-best generation size

NIST Arabic-to-English (MT08)

deep, recombination Tdeep, recombination LMshallow-1, recombination Tshallow-1, recombination LM

Figure 7: Arabic→English memory requirements.

33

23

23.5

24

24.5

25

25.5

0.125 0.25 0.5 1 2 4 8 16 32

BLE

U [%

]

seconds per word

NIST Chinese-to-English (MT08)

deep, recombination Tdeep, recombination LMshallow-1, recombination Tshallow-1, recombination LM

Figure 8: Trade-off between translation quality and speedfor Chinese→English.

42.5

43

43.5

44

44.5

45

0.125 0.25 0.5 1 2 4 8 16 32

BLE

U [%

]

seconds per word

NIST Arabic-to-English (MT08)

deep, recombination Tdeep, recombination LMshallow-1, recombination Tshallow-1, recombination LM

Figure 9: Trade-off between translation quality and speedfor Arabic→English.

23

23.5

24

24.5

25

25.5

8 16 32 64

BLE

U [%

]

gigabytes

NIST Chinese-to-English (MT08)

deep, recombination Tdeep, recombination LMshallow-1, recombination Tshallow-1, recombination LM

Figure 10: Trade-off between translation quality and mem-ory requirements for Chinese→English.

42.5

43

43.5

44

44.5

45

16 32 64 128

BLE

U [%

]

gigabytes

NIST Arabic-to-English (MT08)

deep, recombination Tdeep, recombination LMshallow-1, recombination Tshallow-1, recombination LM

Figure 11: Trade-off between translation quality and mem-ory requirements for Arabic→English.

shallow-1 grammar at a k-best size between 100 and1 000 seems to offer a good compromise of qualityand efficiency. Deep translation with k = 2 000 andthe LM recombination scheme promises high qual-ity translation, but note the rapid memory consump-tion increase beyond k = 1000 with the deep gram-mar. At k ≤ 1 000, memory consumption is not anissue in both deep and shallow systems, but transla-tion speed starts to drop at k > 100 already.

Arabic→English. Shallow-1 translation producescompetitive quality for Arabic→English translation(de Gispert et al., 2010; Huck et al., 2011). TheLM recombination scheme boosts the BLEU scoresslightly. The systems with deep grammar are slowed

down strongly with every increase of the k-best size.Their memory consumption likewise inflates early.We actually stopped running experiments with deepgrammars for Arabic→English at k = 7 000 for theT recombination scheme, and at k = 700 for the LMrecombination scheme because 124G of memory didnot suffice any more for higher k-best sizes. Thememory consumption of the shallow systems staysnearly constant across a large range of the surveyedk-best sizes, but Figure 11 reveals a plateau wheremore resources do not improve translation quality.Increasing k from 100 to 2 000 in the shallow setupwith LM recombination provides half a BLEU point,but reduces speed by a factor of more than 10.

34

23

23.5

24

24.5

25

25.5

-8.7 -8.65 -8.6 -8.55 -8.5 -8.45 -8.4

BLE

U [%

]

average model score

NIST Chinese-to-English (MT08)

deep, recombination Tdeep, recombination LM

Figure 12: Relation of translation quality and averagemodel score for Chinese→English (deep grammar).

42.5

43

43.5

44

44.5

45

-6.6 -6.5 -6.4 -6.3 -6.2 -6.1

BLE

U [%

]

average model score

NIST Arabic-to-English (MT08)

deep, recombination Tdeep, recombination LM

Figure 13: Relation of translation quality and averagemodel score for Arabic→English (deep grammar).

23

23.5

24

24.5

25

25.5

-9.4 -9.35 -9.3 -9.25 -9.2 -9.15 -9.1

BLE

U [%

]

average model score

NIST Chinese-to-English (MT08)

shallow-1, recombination Tshallow-1, recombination LM

Figure 14: Relation of translation quality and averagemodel score for Chinese→English (shallow-1 grammar).

42.5

43

43.5

44

44.5

45

-12.1 -12 -11.9 -11.8 -11.7 -11.6

BLE

U [%

]

average model score

NIST Arabic-to-English (MT08)

shallow-1, recombination Tshallow-1, recombination LM

Figure 15: Relation of translation quality and averagemodel score for Arabic→English (shallow-1 grammar).

Actual amount of derivations. We measured theamount of hypernodes (Table 2), the amount of actu-ally generated derivations after recombination, andthe amount of generated candidate derivations in-cluding recombined ones—or, equivalently, loop it-erations in the algorithm from Figure 1—for se-lected limits k (Tables 3 and 4). The ratio of theaverage amount of derivations per hypernode afterand before recombination remains consistently atlow values for all recombination T setups. For thesetups with LM recombination scheme, this recom-bination factor rises with larger k, i.e. the fractionof recombinable candidates increases. The increaseis remarkably pronounced for Arabic→English with

deep grammar. The steep slope of the recombina-tion factor may be interpreted as an indicator for un-desired overgeneration of the deep grammar on theArabic→English task.

6 Conclusion

We systematically studied three key aspects of hier-archical phrase-based translation with cube pruning:Deep vs. shallow-1 grammars, the k-best generationsize, and the hypothesis recombination scheme. Ina series of empirical experiments, we revealed thetrade-offs between translation quality and resourcerequirements to a more fine-grained degree than thisis typically done in the literature.

35

Table 2: Average amount of hypernodes per sentence and average length of the preprocessed input sentences on theNIST Chinese→English (MT08) and Arabic→English (MT08) tasks.

Chinese→English Arabic→Englishdeep shallow-1 deep shallow-1

avg. #hypernodes per sentence 480.5 200.7 896.4 308.4avg. source sentence length 25.4 33.2

Table 3: Detailed statistics about the actual amount of derivations on the NIST Chinese→English task (MT08).

deeprecombination T recombination LM

avg. #derivations avg. #derivations avg. #derivations avg. #derivationsper hypernode per hypernode per hypernode per hypernode

k (after recombination) (incl. recombined) factor (after recombination) (incl. recombined) factor10 10.0 11.7 1.17 10.0 18.2 1.82

100 99.9 120.1 1.20 99.9 275.8 2.761000 950.1 1142.3 1.20 950.1 4246.9 4.47

10000 9429.8 11262.8 1.19 9418.1 72008.4 7.65

shallow-1recombination T recombination LM

avg. #derivations avg. #derivations avg. #derivations avg. #derivationsper hypernode per hypernode per hypernode per hypernode

k (after recombination) (incl. recombined) factor (after recombination) (incl. recombined) factor10 9.7 11.3 1.17 9.6 13.6 1.41

100 90.8 105.2 1.16 90.4 168.6 1.861000 707.3 811.3 1.15 697.4 2143.4 3.07

10000 6478.1 7170.4 1.11 6202.8 34165.6 5.51

Table 4: Detailed statistics about the actual amount of derivations on the NIST Arabic→English task (MT08).

deeprecombination T recombination LM

avg. #derivations avg. #derivations avg. #derivations avg. #derivationsper hypernode per hypernode per hypernode per hypernode

k (after recombination) (incl. recombined) factor (after recombination) (incl. recombined) factor10 10.0 18.3 1.83 10.0 71.5 7.15

100 98.0 177.4 1.81 98.0 1726.0 17.62500 482.1 849.0 1.76 482.1 14622.1 30.33

1000 961.8 1675.0 1.74 – – –

shallow-1recombination T recombination LM

avg. #derivations avg. #derivations avg. #derivations avg. #derivationsper hypernode per hypernode per hypernode per hypernode

k (after recombination) (incl. recombined) factor (after recombination) (incl. recombined) factor10 9.6 12.1 1.26 9.6 16.6 1.73

100 80.9 105.2 1.30 80.2 193.8 2.421000 690.1 902.1 1.31 672.1 2413.0 3.59

10000 5638.6 7149.5 1.27 5275.1 31283.6 5.93

36

Acknowledgments

This work was partly achieved as part of the QuaeroProgramme, funded by OSEO, French State agencyfor innovation. This material is also partly basedupon work supported by the DARPA BOLT projectunder Contract No. HR0011-12-C-0015. Any opin-ions, findings and conclusions or recommendationsexpressed in this material are those of the authorsand do not necessarily reflect the views of theDARPA. The research leading to these results hasreceived funding from the European Union Sev-enth Framework Programme (FP7/2007-2013) un-der grant agreement no 287658.

ReferencesHala Almaghout, Jie Jiang, and Andy Way. 2012. Ex-

tending CCG-based Syntactic Constraints in Hierar-chical Phrase-Based SMT. In Proc. of the AnnualConf. of the European Assoc. for Machine Translation(EAMT), pages 193–200, Trento, Italy, May.

Kathryn Baker, Michael Bloodgood, Chris Callison-Burch, Bonnie Dorr, Nathaniel Filardo, LoriLevin, Scott Miller, and Christine Piatko. 2010.Semantically-Informed Syntactic Machine Transla-tion: A Tree-Grafting Approach. In Proc. of the Conf.of the Assoc. for Machine Translation in the Americas(AMTA), Denver, CO, USA, October/November.

Jean-Cedric Chappelier and Martin Rajman. 1998. AGeneralized CYK Algorithm for Parsing StochasticCFG. In Proc. of the First Workshop on Tabulation inParsing and Deduction, pages 133–137, Paris, France,April.

Stanley F. Chen and Joshua Goodman. 1998. An Em-pirical Study of Smoothing Techniques for LanguageModeling. Technical Report TR-10-98, ComputerScience Group, Harvard University, Cambridge, MA,USA, August.

David Chiang. 2005. A Hierarchical Phrase-BasedModel for Statistical Machine Translation. In Proc. ofthe Annual Meeting of the Assoc. for ComputationalLinguistics (ACL), pages 263–270, Ann Arbor, MI,USA, June.

David Chiang. 2007. Hierarchical Phrase-Based Trans-lation. Computational Linguistics, 33(2):201–228,June.

Adria de Gispert, Gonzalo Iglesias, Graeme Blackwood,Eduardo R. Banga, and William Byrne. 2010. Hierar-chical Phrase-Based Translation with Weighted Finite-State Transducers and Shallow-n Grammars. Compu-tational Linguistics, 36(3):505–533.

Chris Dyer, Adam Lopez, Juri Ganitkevitch, JohnathanWeese, Ferhan Ture, Phil Blunsom, Hendra Setiawan,Vladimir Eidelman, and Philip Resnik. 2010. cdec:A Decoder, Alignment, and Learning framework forfinite-state and context-free translation models. InProc. of the ACL 2010 System Demonstrations, pages7–12, Uppsala, Sweden, July.

Kenneth Heafield, Hieu Hoang, Philipp Koehn, TetsuoKiso, and Marcello Federico. 2011. Left LanguageModel State for Syntactic Machine Translation. InProc. of the Int. Workshop on Spoken Language Trans-lation (IWSLT), pages 183–190, San Francisco, CA,USA, December.

Kenneth Heafield, Philipp Koehn, and Alon Lavie. 2012.Language Model Rest Costs and Space-Efficient Stor-age. In Proc. of the 2012 Joint Conf. on Empir-ical Methods in Natural Language Processing andComputational Natural Language Learning, EMNLP-CoNLL ’12, pages 1169–1178, Jeju Island, Korea,July.

Kenneth Heafield, Philipp Koehn, and Alon Lavie. 2013.Grouping Language Model Boundary Words to SpeedK-Best Extraction from Hypergraphs. In Proc. of theHuman Language Technology Conf. / North AmericanChapter of the Assoc. for Computational Linguistics(HLT-NAACL), Atlanta, GA, USA, June.

Hieu Hoang, Philipp Koehn, and Adam Lopez. 2009.A Unified Framework for Phrase-Based, Hierarchical,and Syntax-Based Statistical Machine Translation. InProc. of the Int. Workshop on Spoken Language Trans-lation (IWSLT), pages 152–159, Tokyo, Japan, Decem-ber.

Liang Huang and David Chiang. 2005. Better k-bestParsing. In Proc. of the 9th Int. Workshop on ParsingTechnologies, pages 53–64, October.

Liang Huang and David Chiang. 2007. Forest Rescoring:Faster Decoding with Integrated Language Models. InProc. of the Annual Meeting of the Assoc. for Com-putational Linguistics (ACL), pages 144–151, Prague,Czech Republic, June.

Matthias Huck, David Vilar, Daniel Stein, and HermannNey. 2011. Advancements in Arabic-to-English Hier-archical Machine Translation. In 15th Annual Confer-ence of the European Association for Machine Trans-lation, pages 273–280, Leuven, Belgium, May.

Gonzalo Iglesias, Adria de Gispert, Eduardo R. Banga,and William Byrne. 2009a. Rule Filtering by Pat-tern for Efficient Hierarchical Translation. In Proc. ofthe 12th Conf. of the Europ. Chapter of the Assoc. forComputational Linguistics (EACL), pages 380–388,Athens, Greece, March.

Gonzalo Iglesias, Adria de Gispert, Eduardo R. Banga,and William Byrne. 2009b. Hierarchical Phrase-Based Translation with Weighted Finite State Trans-

37

ducers. In Proc. of the Human Language TechnologyConf. / North American Chapter of the Assoc. for Com-putational Linguistics (HLT-NAACL), pages 433–441,Boulder, CO, USA, June.

Reinhard Kneser and Hermann Ney. 1995. ImprovedBacking-Off for M-gram Language Modeling. InProc. of the International Conf. on Acoustics, Speech,and Signal Processing, volume 1, pages 181–184, De-troit, MI, USA, May.

P. Koehn, H. Hoang, A. Birch, C. Callison-Burch,M. Federico, N. Bertoldi, B. Cowan, W. Shen,C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin,and E. Herbst. 2007. Moses: Open Source Toolkit forStatistical Machine Translation. In Proc. of the AnnualMeeting of the Assoc. for Computational Linguistics(ACL), pages 177–180, Prague, Czech Republic, June.

Zhifei Li and Sanjeev Khudanpur. 2008. A ScalableDecoder for Parsing-Based Machine Translation withEquivalent Language Model State Maintenance. InProceedings of the Second Workshop on Syntax andStructure in Statistical Translation, SSST ’08, pages10–18, Columbus, OH, USA, June.

Zhifei Li, Chris Callison-Burch, Chris Dyer, SanjeevKhudanpur, Lane Schwartz, Wren Thornton, JonathanWeese, and Omar Zaidan. 2009a. Joshua: An OpenSource Toolkit for Parsing-Based Machine Transla-tion. In Proc. of the Workshop on Statistical MachineTranslation (WMT), pages 135–139, Athens, Greece,March.

Zhifei Li, Chris Callison-Burch, Sanjeev Khudanpur, andWren Thornton. 2009b. Decoding in Joshua: OpenSource, Parsing-Based Machine Translation. ThePrague Bulletin of Mathematical Linguistics, (91):47–56, January.

Junhui Li, Zhaopeng Tu, Guodong Zhou, and Josef vanGenabith. 2012. Using Syntactic Head Information inHierarchical Phrase-Based Translation. In Proc. of theWorkshop on Statistical Machine Translation (WMT),pages 232–242, Montreal, Canada, June.

NIST. 2008. Open Machine Translation 2008 Evalua-tion. http://www.itl.nist.gov/iad/mig/tests/mt/2008/.

Franz Josef Och and Hermann Ney. 2003. A SystematicComparison of Various Statistical Alignment Models.Computational Linguistics, 29(1):19–51, March.

Franz Josef Och. 2003. Minimum Error Rate Trainingfor Statistical Machine Translation. In Proc. of the An-nual Meeting of the Assoc. for Computational Linguis-tics (ACL), pages 160–167, Sapporo, Japan, July.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Eval-uation of Machine Translation. In Proc. of the AnnualMeeting of the Assoc. for Computational Linguistics(ACL), pages 311–318, Philadelphia, PA, USA, July.

Baskaran Sankaran, Majid Razmara, and Anoop Sarkar.2012. Kriya - An end-to-end Hierarchical Phrase-based MT System. The Prague Bulletin of Mathemat-ical Linguistics, (97):83–98, April.

Libin Shen, Jinxi Xu, and Ralph Weischedel. 2010.String-to-Dependency Statistical Machine Translation.Computational Linguistics, 36(4):649–671, Decem-ber.

Andreas Stolcke. 2002. SRILM – an Extensible Lan-guage Modeling Toolkit. In Proc. of the Int. Conf.on Spoken Language Processing (ICSLP), volume 3,Denver, CO, USA, September.

Ashish Venugopal, Andreas Zollmann, Noah A. Smith,and Stephan Vogel. 2009. Preference Grammars:Softening Syntactic Constraints to Improve Statisti-cal Machine Translation. In Proc. of the HumanLanguage Technology Conf. / North American Chap-ter of the Assoc. for Computational Linguistics (HLT-NAACL), pages 236–244, Boulder, CO, USA, June.

David Vilar and Hermann Ney. 2012. Cardinalitypruning and language model heuristics for hierarchi-cal phrase-based translation. Machine Translation,26(3):217–254, September.

David Vilar, Daniel Stein, Matthias Huck, and HermannNey. 2010. Jane: Open Source Hierarchical Transla-tion, Extended with Reordering and Lexicon Models.In Proc. of the Workshop on Statistical Machine Trans-lation (WMT), pages 262–270, Uppsala, Sweden, July.

David Vilar, Daniel Stein, Matthias Huck, and HermannNey. 2012. Jane: an advanced freely available hierar-chical machine translation toolkit. Machine Transla-tion, 26(3):197–216, September.

David Vilar. 2011. Investigations on Hierarchi-cal Phrase-Based Machine Translation. Ph.D. the-sis, RWTH Aachen University, Aachen, Germany,November.

Philip Williams and Philipp Koehn. 2012. GHKMRule Extraction and Scope-3 Parsing in Moses. InProc. of the Workshop on Statistical Machine Transla-tion (WMT), pages 388–394, Montreal, Canada, June.

Tong Xiao, Jingbo Zhu, Hao Zhang, and Qiang Li. 2012.NiuTrans: An Open Source Toolkit for Phrase-basedand Syntax-based Machine Translation. In Proc. ofthe ACL 2012 System Demonstrations, pages 19–24,Jeju, Republic of Korea, July.

Jun Xie, Haitao Mi, and Qun Liu. 2011. A NovelDependency-to-String Model for Statistical MachineTranslation. In Proc. of the Conf. on Empirical Meth-ods for Natural Language Processing (EMNLP), pages216–226, Edinburgh, Scotland, UK, July.

Wenduan Xu and Philipp Koehn. 2012. Extending HieroDecoding in Moses with Cube Growing. The PragueBulletin of Mathematical Linguistics, (98):133–142,October.

38

Proceedings of the 7th Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 39–47,Atlanta, Georgia, 13 June 2013. c©2013 Association for Computational Linguistics

Combining Word Reordering Methods on different Linguistic AbstractionLevels for Statistical Machine Translation

Teresa Herrmann, Jan Niehues, Alex WaibelInstitute for Anthropomatics

Karlsruhe Institute of TechnologyKarlsruhe, Germany

{teresa.herrmann,jan.niehues,alexander.waibel}@kit.edu

Abstract

We describe a novel approach to combin-ing lexicalized, POS-based and syntactic tree-based word reordering in a phrase-based ma-chine translation system. Our results showthat each of the presented reordering meth-ods leads to improved translation quality on itsown. The strengths however can be combinedto achieve further improvements. We presentexperiments on German-English and German-French translation. We report improvementsof 0.7 BLEU points by adding tree-based andlexicalized reordering. Up to 1.1 BLEU pointscan be gained by POS and tree-based reorder-ing over a baseline with lexicalized reorder-ing. A human analysis, comparing subjec-tive translation quality as well as a detailed er-ror analysis show the impact of our presentedtree-based rules in terms of improved sentencequality and reduction of errors related to miss-ing verbs and verb positions.

1 IntroductionOne of the main difficulties in statistical machinetranslation (SMT) is presented by the different wordorders between languages. Most state-of-the-artphrase-based SMT systems handle it within phrasepairs or during decoding by allowing words to beswapped while translation hypotheses are generated.An additional reordering model might be included inthe log-linear model of translation. However, thesemethods can cover reorderings only over a very lim-ited distance. Recently, reordering as preprocessinghas drawn much attention. The idea is to detach thereordering problem from the decoding process and

to apply a reordering model prior to translation inorder to facilitate a monotone translation.

Encouraged by the improvements that can beachieved with part-of-speech (POS) reordering rules(Niehues and Kolss, 2009; Rottmann and Vogel,2007), we apply such rules on a different linguis-tic level. We abstract from the words in the sentenceand learn reordering rules based on syntactic con-stituents in the source language sentence. Syntac-tic parse trees represent the sentence structure andshow the relations between constituents in the sen-tence. Relying on syntactic constituents instead ofPOS tags should help to model the reordering taskmore reliably, since sentence constituents are movedas whole blocks of words, thus keeping the sentencestructure intact.

In addition, we combine the POS-based and syn-tactic tree-based reordering models and also add alexicalized reordering model, which is used in manystate-of-the-art phrase-based SMT systems nowa-days.

2 Related Work

The problem of word reordering has been addressedby several approaches over the last years.

In a phrase-based SMT system reordering canbe achieved during decoding by allowing swaps ofwords within a defined window. Lexicalized re-ordering models (Koehn et al., 2005; Tillmann,2004) include information about the orientation ofadjacent phrases that is learned during phrase extrac-tion. This reordering method, which affects the scor-ing of translation hypotheses but does not generatenew reorderings, is used e.g. in the open source ma-

39

chine translation system Moses (Koehn et al., 2007).Syntax-based (Yamada and Knight, 2001) or

syntax-augmented (Zollmann and Venugopal, 2006)MT systems address the reordering problem by em-bedding syntactic analysis in the decoding process.Hierarchical MT systems (Chiang, 2005) constructa syntactic hierarchy during decoding, which is in-dependent of linguistic categories.

To our best knowledge Xia and McCord (2004)were the first to model the word reordering problemas a preprocessing step. They automatically learnreordering rules for English-French translation fromsource and target language dependency trees. After-wards, many followed these footsteps. Earlier ap-proaches craft reordering rules manually based onsyntactic or dependency parse trees or POS tags de-signed for particular languages (Collins et al., 2005;Popovic and Ney, 2006; Habash, 2007; Wang et al.,2007). Later there were more and more approachesusing data-driven methods. Costa-jussa and Fonol-losa (2006) frame the word reordering problem asa translation task and use word class informationto translate the original source sentence into a re-ordered source sentence that can be translated moreeasily. A very popular approach is to automaticallylearn reordering rules based on POS tags or syn-tactic chunks (Popovic and Ney, 2006; Rottmannand Vogel, 2007; Zhang et al., 2007; Crego andHabash, 2008). Khalilov et al. (2009) present re-ordering rules learned from source and target sidesyntax trees. More recently, Genzel (2010) proposedto automatically learn reordering rules from IBM1alignments and source side dependency trees. InDeNero and Uszkoreit (2011) no parser is needed,but the sentence structure used for learning the re-ordering model is induced automatically from a par-allel corpus. Among these approaches most are ableto cover short-range reorderings and some store re-ordering variants in a word lattice leaving the selec-tion of the path to the decoder. Long-range reorder-ings are addressed by manual rules (Collins et al.,2005) or using automatically learned rules (Niehuesand Kolss, 2009).

Motivated by the POS-based reordering modelsin Niehues and Kolss (2009) and Rottmann and Vo-gel (2007), we present a reordering model based onthe syntactic structure of the source sentence. Weintend to cover both short-range and long-range re-

ordering more reliably by abstracting to constituentsextracted from syntactic parse trees instead of work-ing only with morphosyntactic information on theword level. Furthermore, we combine POS-basedand tree-based models and additionally include alexicalized reordering model. Altogether we applyword reordering on three different levels: lexical-ized reordering model on the word level, POS-basedreordering on the morphosyntactic level and syntaxtree-based reordering on the constituent level. Incontrast to previous work we use original syntacticparse trees instead of binarized parse trees or depen-dency trees. Furthermore, our goal is to address es-pecially long-range reorderings involving verb con-structions.

3 Motivation

When translating from German to English differentword order is the most prominent problem. Espe-cially the verb needs to be shifted over long dis-tances in the sentence, since the position of the verbdiffers in German and English sentences. The finiteverbs in the English language are generally locatedat the second position in the sentence. In Germanthis is only the case in a main clause. In Germansubordinate clauses the verb is at the final positionas shown in Example 1.

Example 1:Source: ..., nachdem ich eine Weile im Inter-

net gesucht habe.Gloss: ... after I a while in-the internet

searched have.POS Reord.: ..., nachdem ich habe eine Weile im

Internet gesucht.POS Transl.: ... as I have for some time on the

Internet.

The example shows first the source sentence andan English gloss. POS Reord presents the reorderedsource sentence as produced by POS rules. Thisshould be the source sentence according to targetlanguage word order. POS Transl shows the trans-lation of the reordered sequence. We can see thatsome cases remain unresolved. The POS rules suc-ceed in putting the auxiliary habe/have to the rightposition in the sentence. But the participle, carry-ing the main meaning of the sentence, is not shiftedtogether with the auxiliary. During translation it is

40

dropped from the sentence, rendering it unintelligi-ble.

A reason why the POS rules do not shift bothparts of the verb might be that the rules operate onthe word level only and treat every POS tag inde-pendently of the others. A reordering model basedon syntactic constituents can help with this. Addi-tional information about the syntactic structure ofthe sentence allows to identify which words belongtogether and should not be separated, but shifted asa whole block. Abstracting from the word level tothe constituent level also provides the advantage thateven though reorderings are performed over longsentence spans, the rules consist of less reorderingunits (constituents which themselves consist of con-stituents or words) and can be learned more reliably.

4 Tree-based ReorderingIn order to encourage linguistically meaningful re-orderings we learn rules based on syntactic tree con-stituents. While the POS-based rules are flat andperform the reordering on a sequence of words, thetree-based rules operate on subtrees in the parse treeas shown in Figure 1.

VP

VVPPNPPTNEG

→VP

NPVVPPPTNEG

Figure 1: Example reordering rule based on subtrees

A syntactic parse tree contains both the word-level categories, i.e. parts-of-speech and higher or-der categories, i.e. constituents. In this way it pro-vides information about the building blocks of a sen-tence that belong together and should not be takenapart by reordering. Consequently, the tree-basedreordering operates both on the word level and onthe constituent level to make use of all available in-formation in the parse tree. It is able to handle long-range reorderings as well as short-range reorder-ings, depending on how many words the reorderedconstituents cover. The tree-based reordering rulesshould also be more stable and introduce less ran-dom word shuffling than the POS-based rules.

The reordering model consists of two stages. Firstthe rule extraction, where the rules are learned bysearching the training corpus for crossing align-ments which indicate a reordering between source

and target language. The second is the applicationof the learned reordering rules to the input text priorto translation.

4.1 Rule Extraction

As shown in Figure 4 we learn rules like this:VP PTNEG NP VVPP→ VP PTNEG VVPP NP

where the first item in the rule is the head node ofthe subtree and the rest represent the children. Inthe second part of the rule the children are indexedso that children of the same category cannot be con-fused. Figure 2 shows an example for rule extrac-tion: a sentence in its syntactic parse tree representa-tion, the sentence in the target language and an auto-matically generated alignment. A reordering occursbetween the constituents VVPP and NP.

S1-n

CS

...

VP2-5

VVPP3-3

gewahlt

NP4-5

NN5-5

Szenarien

ADJA4-4

kunstliche

PTNEG2-2

nicht

VAFIN2-2

haben

PPER1-1

Wir

1

We

2

didn’t

3

choose

4

artificial

5

scenarios

Figure 2: Example training sentence used to extract re-ordering rules

In a first step the reordering rule has to be found.We extract the rules from a word aligned corpuswhere a syntactic parse tree is provided for eachsource side sentence. We traverse the tree top downand scan each subtree for reorderings, i.e. cross-ings of alignment links between source and targetsentence. If there is a reordering, we extract arule that rearranges the source side constituents ac-cording to the order of the corresponding words on

41

the target side. Each constituent in a subtree com-prises one or more words. We determine the lowest(min) and highest (max) alignment point for eachconstituent ck and thus determine the range of theconstituent on the target side. This can be formal-ized as min(ck) = min{j|fi ∈ ck; ai = j} andmax(ck) = max{j|fi ∈ ck; ai = j}. To illustratethe process, we have annotated the parse tree in Fig-ure 2 with the alignment points (min-max) for eachconstituent.

After defining the range, we check for the follow-ing conditions in order to determine whether to ex-tract a reordering rule.

1. all constituents have a non-empty range2. source and target word order differ

First, for each subtree at least one word in each con-stituent needs to be aligned. Otherwise it is not pos-sible to determine a conclusive order. Second, wecheck whether there is actually a reordering, i.e. thetarget language words are not in the same order asthe constituents in the source language: min(ck) >min(ck+1) and max(ck) > max(ck+1).

Once we find a reordering rule to extract, we cal-culate the probability of this rule as the relative fre-quency with which such a reordering occurred in allsubtrees of the training corpus divided by the num-ber of total occurrences of this subtree in the corpus.We only store rules for reorderings that occur morethan 5 times in the corpus.

4.1.1 Partial Rules

The syntactic parse trees of German sentences arequite flat, i.e. a subtree usually has many children.When a rule is extracted, it always consists of thehead of the subtree and all its children. The ap-plication requires that the applicable rule matchesthe complete subtree: the head and all its children.However, most of the time only some of the chil-dren are actually involved in a reordering. Thereare also many different subtree variants that are quitesimilar. In verb phrases or noun phrases, for exam-ple, modifiers such as prepositional phrases or ad-verbial phrases can be added nearly arbitrarily. Inorder to generalize the tree-based reordering rules,we extend the rule extraction. We do not only extractthe rules from the complete child sequence, but alsofrom any continuous child sequence in a constituent.

This way, we extract generalized rules which canbe applied more often. Formally, for each subtreeh → cn

1 = c1c2...cn that matches the constraintspresented in Section 4.1, we modify the basic ruleextraction to: ∀i, j1 ≤ i < j ≤ n : h → cj

i . Itcould be argued that the partial rules might be notas reliable as the specific rules. In Section 6 we willshow that such generalizations are meaningful andcan have a positive effect on the translation quality.

4.2 Rule Application

During the training of the system all reordering rulesare extracted from the parallel corpus. Prior to trans-lation the rules are applied to the original source text.Each rule is applied independently producing a re-ordering variant of that sentence. The original sen-tence and all reordering variants are stored in a wordlattice which is later used as input to the decoder.The rules may be applied recursively to already re-ordered paths. If more than one rule can be applied,all paths are added to the lattice unless the rules gen-erate the same output. In this case only the rule withthe highest probability is applied.

The edges in a word lattice for one sentence areassigned transition probabilities as follows. In themonotone path with original word order all transi-tion probabilities are initially set to 1. In a reorderedpath the first branching transition is assigned theprobability of the rule that generated the path. Allother transition probabilities in this path are set to 1.Whenever a reordered path branches from the mono-tone path, the probability of the branching edge issubstracted from the probability of the monotoneedge. However, a minimum probability of 0.05 isreserved for the monotone edge. The score of thecomplete path is computed as the product of the tran-sition probabilities. During decoding the best pathis searched for by including the score for the cur-rent path weighted by the weight for the reorderingmodel in the log-linear model. In order to enableefficient decoding we limit the lattice size by onlyapplying rules with a probability higher than a pre-defined threshold.

4.2.1 Recursive Rule Application

As mentioned above, the tree-based rules may beapplied recursively. That means, after one rule isapplied to the source sentence, a reordered path may

42

S

S

aus anderen Bundeslandern

PP

VAFIN

habe

VP

VVPP

bekommen

viele Anfragen

NP

ADV

schon

PPER

ich

KOUS

dass

Ich kann Ihnen nur sagen,

...

I may just tell you that I got already lots of requests from other federal states

Figure 3: Example parse tree with separated verb particles

be reordered again. The reason is the structure ofthe syntactic parse trees. Verbs and their particlesare typically not located within the same subtree.Hence, they cannot be covered by one reorderingrule. A separate rule is extracted for each subtree.Figure 3 demonstrates this in an example. The twoparts that belong to the verb in this German sentence,namely bekommen and habe, are not located withinthe same constituent. The finite verb habe forms aconstituent of its own and the participle bekommenforms part of the VP constituent. In English the fi-nite verb and the participle need to be placed next toeach other. In order to rearrange the source languagewords according to the target language word order,the following two reordering movements need to beperformed: the finite verb habe needs to be placedbefore the VP constituent and the participle bekom-men needs to be moved within the VP constituent tothe first position. Only if both movements are per-formed, the right word order can be generated.

However, the reordering model only considersone subtree at a time when extracting reorderingrules. In this case two rules are learned, but if theyare applied to the source sentence separately, theywill end up in separate paths in the word lattice. Thedecoder then has to choose which path to translate:the one where the finite verb is placed before the VPconstituent or the path where the participle is at thefirst position in the VP constituent.

To counter this drawback the rules may be applied

recursively to the new paths created by our reorder-ing rules. We use the same rules, but newly createdpaths are fed back into the queue of sentences to bereordered. However, we only apply the rules to partsof the reordered sentence that are still in the originalword order and restrict the recursion depth.

5 Combining reordering methodsIn order to get a deeper insight into their individ-ual strengths we compare the reordering methods ondifferent linguistic levels and also combine them toinvestigate whether gains can be increased. We ad-dress the word level using the lexicalized reordering,the morphosyntactic level by POS-based reorderingand the constituent level by tree-based reordering.

5.1 POS-based and tree-based rules

The training of the POS-based reordering is per-formed as described in (Rottmann and Vogel,2007) for short-range reordering rules, such asVVIMP VMFIN PPER → PPER VMFIN VVIMP.Long-range reordering rules trained accordingto (Niehues and Kolss, 2009) include gaps match-ing longer spans of arbitrary POS sequences(VAFIN * VVPP → VAFIN VVPP *). The POS-based reordering used in our experiments always in-cludes both short and long-range rules.

The tree-based rules are trained separately as de-scribed above. First the POS-based rules are appliedto the monotone path of the source sentence and then

43

the tree-based rules are applied independently, pro-ducing separate paths.

5.2 Rule-based and lexicalized reordering

As described in Section 4.2 we create word latticesthat encode the reordering variants. The lexical-ized reordering model stores for each phrase pairthe probabilities for possible reordering orientationsat the incoming and outgoing phrase boundaries:monotone, swap and discontinuous. In order to ap-ply the lexicalized reordering model on lattices theoriginal position of each word is stored in the lat-tice. While the translation hypothesis is generated,the reordering orientation with respect to the origi-nal position of the words is checked at each phraseboundary. The probability for the respective orien-tation is included as an additional score.

6 ResultsThe tree-based models are applied for German-English and German-French translation. Results aremeasured in case-sensitive BLEU (Papineni et al.,2002).

6.1 General System Description

First we describe the general system architecturewhich underlies all the systems used later on. Weuse a phrase-based decoder (Vogel, 2003) that takesword lattices as input. Optimization is performedusing MERT with respect to BLEU. All POS-basedor tree-based systems apply monotone translationonly. Baseline systems without reordering rules usea distance-based reordering model. In addition, alexicalized reordering model as described in (Koehnet al., 2005) is applied where indicated. POS tagsand parse trees are generated using the Tree Tag-ger (Schmid, 1994) and the Stanford Parser (Raf-ferty and Manning, 2008).

6.1.1 Data

The German-English system is trained on the pro-vided data of the WMT 2012. news-test2010 andnews-test2011 are used for development and test-ing. The type of data used for training, developmentand testing the German-French system is similar toWMT data, except that 2 references are available.The training corpus for the reordering models con-sist of the word-aligned Europarl and News Com-mentary corpora where POS tags and parse trees are

generated for the source side.

6.2 German-English

We built systems using POS-based and tree-basedreordering and show the impact of the individualmodels as well as their combination on the transla-tion quality. The results are presented in Table 1.

For each system, two different setups were evalu-ated. First, with a distance-based reordering modelonly (noLexRM) and with an additional lexicalizedreordering model (LexRM). The baseline systemwhich uses no reordering rules at all allows a re-ordering window of 5 in the decoder for both setups.For all systems where reordering rules are applied,monotone translation is performed. Since the rulestake over the main reordering job, only monotonetranslation is necessary from the reordered word lat-tice input. In this experiment, we compare the tree-based rules with and without recursion, and the par-tial rules.

Rule TypeSystem noLexRM LexRM

Dev Test Dev TestBaseline (no Rules) 22.82 21.06 23.54 21.61POS 24.33 21.98 24.42 22.15Tree 24.01 21.92 24.24 22.01Tree rec. 24.37 21.97 24.53 22.19Tree rec.+ par. 24.31 22.21 24.65 22.27POS + Tree 24.57 22.21 24.91 22.47POS + Tree rec. 24.61 22.39 24.81 22.45POS + Tree rec.+ par. 24.80 22.45 24.78 22.70

Table 1: German-English

Compared to the baseline system using distance-based reordering only, 1.4 BLEU points can begained by applying combined POS and tree-basedreordering. The tree rules including partial rules andrecursive application alone achieve already a bet-ter performance than the POS rules, but using themall in combination leads to an improvement of 0.4BLEU points over the POS-based reordering alone.When lexicalized reordering is added, the relativeimprovements are similar: 1.1 BLEU points com-pared to the Baseline and 0.55 BLEU points over thePOS-based reordering. We can therefore argue thatthe individual rule types as well as the lexicalized re-ordering model seem to address complementary re-ordering issues and can be combined successfully to

44

obtain an even better translation quality.We applied only tree rules with a probability of

0.1 and higher. Partial rules require a threshold of0.4 to be applied, since they are less reliable. In or-der to prevent the lattices from growing too large,the recursive rule application is restricted to a max-imum recursion depth of 3. These values were setaccording to the results of preliminary experimentsinvestigating the impact of the rule probabilities onthe translation quality. Normal rules and partial rulesare not mixed during recursive application.

With the best system we performed a final exper-iment on the official testset of the WMT 2012 andachieved a score of 23.73 which is 0.4 BLEU pointsbetter than the best constrained submission.

6.3 Translation Examples

Example 2 shows how the translation of the sen-tence presented above is improved by adding thetree-based rules. We can see that using tree con-stituents in the reordering model indeed addressesthe problem of verb particles and especially missingverb parts in German.

Example 2:Src: ..., nachdem ich eine Weile im Internet

gesucht habe.Gloss: ..., after I a while in-the Internet search-

ed have.POS: ... as I have for some time on the Inter-

net.+Tree: ... after I have looked for a while on the

Internet.

Example 3 shows another aspect of how the tree-based rules work. With the help of the tree-based re-ordering rules, it is possible to relocate the separatedprefix of German verbs and find the correct transla-tion. The verb vorschlagen consist of the main verb(MV) schlagen (here conjugated as schlagt) and theprefix (PX) vor. Depending on the verb form andsentence type, the prefix must be separated from themain verb and is located in a different part of thesentence. The two parts of the verb can also haveindividual meanings. Although the translation ofthe verb stem were correct if it were the full verb,not recognizing the separated prefix and ignoring itin translation, corrupts the meaning of the sentence.With the help of the tree-based rules, the dependency

between the main verb and its prefix is resolved andthe correct translation can be chosen.

6.4 German-French

The same experiments were tested on German-French translation. For this language pair, similarimprovements could be achieved by combining POSand tree-based reordering rules and applying a lexi-calized reordering model in addition. Table 2 showsthe results. Up to 0.7 BLEU points could be gainedby adding tree rules and another 0.1 by lexicalizedreordering.

Rule TypeSystem noLexRM LexRM

Dev Test Dev TestPOS 41.29 38.07 42.04 38.55POS + Tree 41.94 38.47 42.44 38.57POS + Tree rec. 42.35 38.66 42.80 38.71POS + Tree rec.+ par. 42.48 38.79 42.87 38.88

Table 2: German-French

6.5 Binarized Syntactic Trees

Even though related work using syntactic parse treesin SMT for reordering purposes (Jiang et al., 2010)have reported an advantage of binarized parse treesover standard parse trees, our model did not bene-fit from binarized parse trees. It seems that the flathierarchical structure of standard parse trees enablesour reordering model to learn the order of the con-stituents most efficiently.

7 Human Evaluation

7.1 Sentence-based comparison

In order to have an additional perspective of theimpact of our tree-based reordering, we also pro-vide a human evaluation of the translation outputof the German-English system without the lexical-ized reordering model. 250 translation hypotheseswere selected to be annotated. For each input sen-tence two translations generated by different sys-tems were presented, one applying POS-based re-ordering only and the other one applying both POS-based and tree-based reordering rules. The hypothe-ses were anonymized and presented in random order.

Table 3 shows the BLEU scores of the analyzedsystems and the manual judgement of comparative,subjective translation quality. In 50.8% of the sen-

45

Example 3:Src: Die RPG Byty schlagt ihnen in den Schreiben eine Mieterhohung von ca. 15 bis 38 Prozent vor.Gloss: The RPG Byty proposes-MV them in the letters a rent increase of ca. 15 to 38 percent proposes-PXPOS: The RPG Byty beats them in the letter, a rental increase of around 15 to 38 percent.+Tree: The RPG Byty proposes them in the letters a rental increase of around 15 to 38 percent.

System BLEU wins %POS Rules 21.98 58 23.2POS + Tree Rules rec. par. 22.45 127 50.8Table 3: Human Evaluation of Translation quality

tences, the translation generated by the system us-ing tree-based rules was judged to be better, whereasin 23.2% of the cases the system without tree-basedrules was rated better. For 26% of the sentences thetranslation quality was very similar. Consequently,in 76.8% of the cases the tree-based system pro-duced a translation that is either better or of the samequality as the system using POS rules only.

7.2 Analysis of verbs

Since the verbs in German-to-English translationwere one of the issues that the tree-based reorder-ing model should address, a more detailed analysiswas performed on the first 165 sentences. We espe-cially investigated the changes regarding the verbsbetween the translations stemming from the systemusing the POS-based reordering only and the one us-ing both the POS and the tree-based model. We ex-amined three aspects of the verbs in the two trans-lations. Each change introduced by the tree-basedreordering model was first classified either as an im-provement or a degradation of the translation qual-ity. Secondly, it was assigned to one of the followingcategories: exist, position or form. In case of an im-provement, exist means a verb existed in the trans-lation due to the tree-based model, which did notexist before. A degradation in this category meansthat a verb was removed from the translation whenincluding the tree-based reordering model. An im-provement or degradation in the category positionor form means that the verb position or verb formwas improved or degraded, respectively.

Table 4 shows the results of this analysis. In total,48 of the verb changes were identified as improve-ments, while only 16 were regarded as degradationsof translation quality. Improvements mainly concern

Type all exist position formImprovements 48 22 21 5Degradations 16 2 11 3

Table 4: Manual Analysis of verbs

improved verb position in the sentence and verbsthat could be translated with the help of the tree-based rules that were not there before. Even thoughalso degradations were introduced by the tree-basedreordering model, the improvements outweigh them.

8 ConclusionWe have presented a reordering method based onsyntactic tree constituents to model long-range re-orderings in SMT more reliably. Furthermore, wecombined the reordering methods addressing dif-ferent linguistic abstraction levels. Experimentson German-English and German-French translationshowed that the best translation quality could beachieved by combining POS-based and tree-basedrules. Adding a lexicalized reordering model in-creased the translation quality even further. In totalwe could reach up to 0.7 BLEU points of improve-ment by adding tree-based and lexicalized reorder-ing compared to only POS-based rules. Up to 1.1BLEU points were gained over to a baseline systemusing a lexicalized reordering model.

A human evaluation showed a preference of thePOS+Tree-based reordering method in most cases.A detailed analysis of the verbs in the transla-tion outputs revealed that the tree-based reorderingmodel indeed addresses verb constructions and im-proves the translation of German verbs.

AcknowledgmentsThis work was partly achieved as part of the QuaeroProgramme, funded by OSEO, French State agencyfor innovation. The research leading to these resultshas received funding from the European Union Sev-enth Framework Programme (FP7/2007-2013) un-der grant agreement n◦ 287658.

46

ReferencesDavid Chiang. 2005. A hierarchical phrase-based model

for statistical machine translation. In Proceedings ofthe 43rd Annual Meeting on Association for Computa-tional Linguistics, ACL ’05, pages 263–270, Strouds-burg, PA, USA.

Michael Collins, Philipp Koehn, and Ivona Kucerova.2005. Clause Restructuring for Statistical MachineTranslation. In Proceedings of ACL 2005, Ann Arbor,Michigan.

Marta R. Costa-jussa and Jose A. R. Fonollosa. 2006.Statistical Machine Reordering. In Conference onEmpirical Methods on Natural Language Processing(EMNLP 2006), Sydney, Australia.

Josep M. Crego and Nizar Habash. 2008. Using ShallowSyntax Information to Improve Word Alignment andReordering for SMT. In ACL-HLT 2008, Columbus,Ohio, USA.

John DeNero and Jakob Uszkoreit. 2011. Inducing sen-tence structure from parallel corpora for reordering. InProceedings of EMNLP 2011, pages 193–203, Edin-burgh, Scotland, UK.

Dmitriy Genzel. 2010. Automatically learning source-side reordering rules for large scale machine transla-tion. In Proceedings of COLING 2010, Beijing, China.

Nizar Habash. 2007. Syntactic preprocessing for statis-tical machine translation. Proceedings of the 11th MTSummit.

Jie Jiang, Jinhua Du, and Andy Way. 2010. Improvedphrase-based smt with syntactic reordering patternslearned from lattice scoring. In Proceedings of AMTA2010, Denver, CO, USA.

M. Khalilov, J.A.R. Fonollosa, and M. Dras. 2009.A new subtree-transfer approach to syntax-based re-ordering for statistical machine translation. In Proc.of the 13th Annual Conference of the European As-sociation for Machine Translation (EAMT’09), pages198–204, Barcelona, Spain.

Philipp Koehn, Amittai Axelrod, Alexandra BirchMayne, Chris Callison-Burch, Miles Osborne, andDavid Talbot. 2005. Edinburgh system description forthe 2005 iwslt speech translation evaluation. In Inter-national Workshop on Spoken Language Translation.

Philipp Koehn, Hieu Hoang, Alexandra Birch, ChrisCallison-Burch, Marcello Federico, Nicola Bertoldi,Brooke Cowan, Wade Shen, Christine Moran, RichardZens, et al. 2007. Moses: Open source toolkit forstatistical machine translation. In Annual meeting-association for computational linguistics, volume 45,page 2.

Jan Niehues and Muntsin Kolss. 2009. A POS-BasedModel for Long-Range Reorderings in SMT. InFourth Workshop on Statistical Machine Translation(WMT 2009), Athens, Greece.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for AutomaticEvaluation of Machine Translation. Technical ReportRC22176 (W0109-022), IBM Research Division, T. J.Watson Research Center.

Maja Popovic and Hermann Ney. 2006. POS-basedWord Reorderings for Statistical Machine Translation.In International Conference on Language Resourcesand Evaluation (LREC 2006), Genoa, Italy.

Anna N. Rafferty and Christopher D. Manning. 2008.Parsing three German treebanks: lexicalized and un-lexicalized baselines. In Proceedings of the Workshopon Parsing German, Columbus, Ohio.

Kay Rottmann and Stephan Vogel. 2007. Word Reorder-ing in Statistical Machine Translation with a POS-Based Distortion Model. In TMI, Skovde, Sweden.

Helmut Schmid. 1994. Probabilistic Part-of-Speech Tag-ging Using Decision Trees. In International Con-ference on New Methods in Language Processing,Manchester, UK.

Christoph Tillmann. 2004. A unigram orientation modelfor statistical machine translation. In Proceedings ofHLT-NAACL 2004: Short Papers, pages 101–104.

Stephan Vogel. 2003. SMT Decoder Dissected: WordReordering. In Int. Conf. on Natural Language Pro-cessing and Knowledge Engineering, Beijing, China.

Chao Wang, Michael Collins, and Philipp Koehn. 2007.Chinese syntactic reordering for statistical machinetranslation. In Proceedings of the 2007 Joint Confer-ence on Empirical Methods in Natural Language Pro-cessing and Computational Natural Language Learn-ing (EMNLP-CoNLL), pages 737–745.

Fei Xia and Michael McCord. 2004. Improving a sta-tistical mt system with automatically learned rewritepatterns. In Proceedings of COLING 2004, Geneva,Switzerland.

Kenji Yamada and Kevin Knight. 2001. A syntax-based statistical translation model. In Proceedings ofthe 39th Annual Meeting on Association for Computa-tional Linguistics, ACL ’01, pages 523–530, Strouds-burg, PA, USA.

Yuqi Zhang, Richard Zens, and Hermann Ney. 2007.Chunk-Level Reordering of Source Language Sen-tences with Automatically Learned Rules for Statis-tical Machine Translation. In HLT-NAACL Work-shop on Syntax and Structure in Statistical Transla-tion, Rochester, NY, USA.

Andreas Zollmann and Ashish Venugopal. 2006. Syntaxaugmented machine translation via chart parsing. InProceedings of the Workshop on Statistical MachineTranslation, StatMT ’06, pages 138–141, Stroudsburg,PA, USA.

47

Proceedings of the 7th Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 48–57,Atlanta, Georgia, 13 June 2013. c©2013 Association for Computational Linguistics

Combining Top-down and Bottom-up Searchfor Unsupervised Induction of Transduction Grammars

Markus SAERS and Karteek ADDANKI and Dekai WUHuman Language Technology Center

Dept. of Computer Science and EngineeringHong Kong University of Science and Technology

{masaers|vskaddanki|dekai}@cs.ust.hk

Abstract

We show that combining both bottom-up rulechunking and top-down rule segmentationsearch strategies in purely unsupervised learn-ing of phrasal inversion transduction gram-mars yields significantly better translation ac-curacy than either strategy alone. Previous ap-proaches have relied on incrementally buildinglarger rules by chunking smaller rules bottom-up; we introduce a complementary top-downmodel that incrementally builds shorter rulesby segmenting larger rules. Specifically, wecombine iteratively chunked rules from Saerset al. (2012) with our new iteratively seg-mented rules. These integrate seamlessly be-cause both stay strictly within a pure trans-duction grammar framework inducing undermatching models during both training andtesting—instead of decoding under a com-pletely different model architecture than whatis assumed during the training phases, whichviolates an elementary principle of machinelearning and statistics. To be able to drive in-duction top-down, we introduce a minimumdescription length objective that trades offmaximum likelihood against model size. Weshow empirically that combining the more lib-eral rule chunking model with a more conser-vative rule segmentation model results in sig-nificantly better translations than either strat-egy in isolation.

1 Introduction

In this paper we combine both bottom-up chunkingand top-down segmentation as search directions inthe unsupervised pursuit of an inversion transduc-tion grammar (ITG); we also show that the combi-nation of the resulting grammars is superior to ei-

ther of them in isolation. For the bottom-up chunk-ing approach we use the method reported in Saerset al. (2012), and for the top-down segmentation ap-proach, we introduce a minimum description length(MDL) learning objective. The new learning objec-tive is similar to the Bayesian maximum a poste-riori objective, and makes it possible to learn top-down, which is impossible using maximum likeli-hood, as the initial grammar that rewrites the startsymbol to all sentence pairs in the training data al-ready maximizes the likelihood of the training data.Since both approaches result in stochastic ITGs, theycan be easily combined into a single stochastic ITGwhich allows for seamless combination. The pointof our present work is that the two different searchstrategies result in very different grammars so thatthe combination of them is superior in terms of trans-lation accuracy to either of them in isolation.The transduction grammar approach has the ad-

vantage that induction, tuning and testing are op-timized on the exact same underlying model—thisused to be a given in machine learning and statisticalprediction, but has been largely ignored in the statis-tical machine translation (SMT) community, wheremost current SMT approaches to learning phrasetranslations that (a) require enormous amounts ofrun-time memory, and (b) contain a high degree ofredundancy. In particular, phrase-based SMT mod-els such as Koehn et al. (2003) and Chiang (2007)often search for candidate translation segments andtransduction rules by committing to a word align-ment that is completely alien to the grammar, as itis learned with very different models (Brown et al.(1993), Vogel et al. (1996)), whose output is thencombined heuristically to form the alignment actu-ally used to extract lexical segment translations (Och

48

and Ney, 2003). The fact that it is even possibleto improve the performance of a phrase-based di-rect translation system by tossing away most of thelearned segmental translations (Johnson et al., 2007)illustrates the above points well.Transduction grammars can also be induced from

treebanks instead of unannotated corpora, which cutsdown the vast search space by enforcing additional,external constraints. This approach was pioneeredby Galley et al. (2006), and there has been a lot of re-search since, usually referred to as tree-to-tree, tree-to-string and string-to-tree, depending on wherethe analyses are found in the training data. This com-plicates the learning process by adding external con-straints that are bound to match the translation modelpoorly; grammarians of English should not be ex-pected to care about its relationship to Chinese. Itdoes, however, constitute a way to borrow nonter-minal categories that help the translation model.It is also possible for the word alignments leading

to phrase-based SMT models to be learned throughtransduction grammars (see for example Cherry andLin (2007), Zhang et al. (2008), Blunsom et al.(2008), Saers andWu (2009), Haghighi et al. (2009),Blunsom et al. (2009), Saers et al. (2010), Blunsomand Cohn (2010), Saers and Wu (2011), Neubig etal. (2011), Neubig et al. (2012)). Even when theSMT model is hierarchical, most of the informationencoded in the grammar is tossed away, when thelearned model is reduced to a word alignment. Aword alignment can only encode the lexical relation-ships that exist between a sentence pair according toa single parse tree, which means that the rest of themodel: the alternative parses and the syntactic struc-ture, is ignored.Theminimumdescription length (MDL) objective

that we will be using to drive the learning will pro-vide a way to escape the maximum-likelihood-of-the-data-given-the-model optimum that we start outwith. However, going only by MDL will also lead toa degenerate case, where the size of the grammar isallowed to shrink regardless of how unlikely the cor-pus becomes. Instead, we will balance the length ofthe grammar with the probability of the corpus giventhe grammar. This has a natural Bayesian interpreta-tion where the length of the grammar acts as a priorover the structure of the grammar.Similar approaches have been used before, but to

induce monolingual grammars. Stolcke and Omo-hundro (1994) use a method similar to MDL calledBayesianmodel merging to learn the structure of hid-den Markov models as well as stochastic context-free grammars. The SCFGs are induced by allowingsequences of nonterminals to be replaced with a sin-gle nonterminal (chunking) as well as allowing twononterminals to merge into one. Grünwald (1996)uses it to learn nonterminal categories in a context-free grammar. It has also been used to interpret vi-sual scenes by classifying the activity that goes on ina video sequences (Si et al., 2011). Our work in thispaper is markedly different to even the previous NLPwork in that (a) we induce an inversion transduc-tion grammar (Wu, 1997) rather than a monolingualgrammar, and (b) we focus on learning the terminalsegments rather than the nonterminal categories.The similar Bayesian approaches to finding the

model structure of ITGs have been tried before, butonly to generate alignments that mismatched trans-lation models are then trained on, rather than usingthe ITG directly as translation model, which we do.Zhang et al. (2008) use variational Bayes with a spar-sity prior over the parameters to prevent the size ofthe grammar to explode when allowing for adjacentterminals in the Viterbi biparses to chunk together.Blunsom et al. (2008), Blunsom et al. (2009) andBlunsom and Cohn (2010) use Gibbs sampling tofind good phrasal translations. Neubig et al. (2011)and Neubig et al. (2012) use a method more similarto ours, but with a Pitman-Yor process as prior overthe structures.The idea of iteratively segmenting the existing

sentence pairs to find good phrasal translations hasalso been tried before; Vilar and Vidal (2005) intro-duces the Recursive Alignment Model, which recur-sively determines whether a bispan is a good enoughtranslation on its own (using IBM model 1), or if itshould be split into two bispans (either in straight orinverted order). The model uses length of the inputsentence to determine whether to split or not, anduses very limited local information about the splitpoint to determine where to split. Training the pa-rameters is done with a maximum likelihood objec-tive. In contrast, our model is one single genera-tive model (as opposed to an ad hoc model), trainedwith a minimum description length objective (ratherthan trying to maximize the probability of the train-

49

ing data).The rest of the paper is structured so that we first

take a closer look at the minimum description lengthprinciple that will be used to drive the top-downsearch (Section 2). We then show how the top-downgrammar is learned (Sections 3 and 4), before show-ing how we combine the new grammar with that ofSaers et al. (2012) (Section 5). We then detail theexperimental setup that will substantiate our claimsempirically (Section 6) before interpreting the resultsof those experiments (Section 7). Finally, we offersome conclusions (Section 8).

2 Minimum description length

The minimum description length principle is aboutfinding the optimal balance between the size of amodel and the size of some data given the model(Solomonoff (1959), Rissanen (1983)). Consider theinformation theoretical problem of encoding somedatawith amodel, and then sending both the encodeddata and the information needed to decode the data(the model) over a channel; the minimum descrip-tion length would be the minimum number of bitssent over the channel. The encoded data can be inter-preted as carrying the information necessary to dis-ambiguate the ambiguities or uncertainties that themodel has about the data. Theoretically, the modelcan grow in size and become more certain about thedata, and it can shrink in size and become more un-certain about the data. An intuitive interpretation ofthis is that the exceptions, which are a part of the en-coded data, can be moved into the model itself. Bydoing so, the size of the model increases, but thereis no longer an exception that needs to be conveyedabout the data. Some “exceptions” occur frequentlyenough that it is a good idea to incorporate them intothe model, and some do not; finding the optimal bal-ance minimizes the total description length.Formally, the description length (DL) is:

DL (M,D) = DL (D|M) + DL (M) (1)

Where M is the model and D is the data. Note theclear parallel to probabilities that have been movedinto the logarithmic domain.In natural language processing, we never have

complete data to train on, so we need our models togeneralize to unseen data. A model that is very cer-tain about the training data runs the risk of not being

able to generalize to new data: it is over-fitting. Itis bad enough when estimating the parameters of atransduction grammar, and catastrophic when induc-ing the structure of the grammar. The key conceptthat we want to capture when learning the structureof a transduction grammar is generalization. This isthe property that allow it to translate new, unseen,input. The challenge is to pin down what general-ization actually is, and how to measure it.One property of generalization for grammars is

that it will lower the probability of the training data.This may seem counterintuitive, but can be under-stood as moving some of the probability mass awayfrom the training data and putting it in unseen data.A second property is that rules that are specific tothe training data can be eliminated from the gram-mar (or replaced with less specific rules that generatethe same thing). The second property would shortenthe description of the grammar, and the first wouldmake the description of the corpus given the gram-mar longer. That is: generalization raises the firstterm and lowers the second in Equation 1. A goodgeneralization will lower the total MDL, whereas apoor onewill raise it; a good generalizationwill tradea little data certainty for more model parsimony.

2.1 Measuring the length of a corpusThe information-theoretic view of the problem alsogives a hint at the operationalization of length. Shan-non (1948) stipulates that the number of bits it takesto encode that a probabilistic variable has taken a cer-tain value can be encoded using as little as the nega-tive logarithmic probability of that outcome.Following this, the parallel corpus given the trans-

duction grammar gives the number of bits requiredto encode it: DL (C|G) = −log2 (P (C|G)), whereC is the corpus and G is the grammar.

2.2 Measuring the length of an ITGSince information theory deals with encoding se-quences of symbols, we need some way to serializean inversion transduction grammar (ITG) into a mes-sage whose length can be measured.To serialize an ITG, we first need to determine

the alphabet that the message will be written in. Weneed one symbol for every nonterminal, L0-terminaland L1-terminal. We will also make the assump-tion that all these symbols are used in at least one

50

rule, so that it is sufficient to serialize the rules inorder to express the entire grammar. To serializethe rules, we need some kind of delimiter to knowwhere one rule starts and the next ends; we will ex-ploit the fact that we also need to specify whether therule is straight or inverted (unary rules are assumedto be straight), and merge these two functions intoone symbol. This gives the union of the symbols ofthe grammar and the set {[], ⟨⟩}, where [] signals thebeginning of a straight rule, and ⟨⟩ signals the be-ginning of an inverted rule. The serialized formatof a rule will be: rule type/start marker, followed bythe left-hand side nonterminal, followed by all right-hand side symbols. The symbols on the right-handsides are either nonterminals or biterminals—pairsofL0-terminals andL1-terminals that model transla-tion equivalences. The serialized form of a grammaris the serialized form of all rules concatenated.Consider the following toy grammar:

S → A, A → ⟨AA⟩, A → [AA] ,A → have/有, A → yes/有, A → yes/是

Its serialized form would be:

[]SA⟨⟩AAA[]AAA[]Ahave有[]Ayes有[]Ayes是

Now we can, again turn to information theory to ar-rive at an encoding for this message. Assuming auniform distribution over the symbols, each symbolwill require −log2

(1N

)bits to encode (where N is

the number of different symbols—the type count).The above example has 8 symbols, meaning thateach symbol requires 3 bits. The entire message is23 symbols long, which means that we need 69 bitsto encode it.

3 Model initialization

Rather than starting out with a general transductiongrammar and fitting it to the training data, we do theexact opposite: we start with a transduction gram-mar that fits the training data as well as possible, andgeneralize from there. The transduction grammarthat fits the training data the best is the one wherethe start symbol rewrites to the full sentence pairsthat it has to generate. It is also possible to add anynumber of nonterminal symbols in the layer betweenthe start symbol and the bisentences without altering

the probability of the training data. We take advan-tage of this by allowing for one intermediate sym-bol so that the start symbol conforms to the normalform and always rewrites to precisely one nontermi-nal symbol. This violate the MDL principle, as theintroduction of new symbols, by definition, makesthe description of the model longer, but conformingto the normal form of ITGs was deemedmore impor-tant than strictly minimizing the description length.Our initial grammar thus looks like this:

S → A,

A → e0..T0/f0..V0 ,

A → e0..T1/f0..V1 ,

...,

A → e0..TN/f0..VN

Where S is the start symbol, A is the nonterminal,N is the number of sentence pairs in the training cor-pus, Ti is the length of the ith output sentence (whichmakes e0..Ti the ith output sentence), and Vi is thelength of the ith input sentence (which makes f0..Vi

the ith input sentence).

4 Model generalization

To generalize the initial inversion transduction gram-mar we need to identify parts of the existing biter-minals that could be validly used in isolation, andallow them to combine with other segments. Thisis the very feature that allows a finite transductiongrammar to generate an infinite set of sentence pairs.Doing this moves some of the probability mass,which was concentrated in the training data, to un-seen data—the very definition of generalization. Ourgeneral strategy is to propose a number of sets ofbiterminal rules and a place to segment them, eval-uate how the description length would change if wewere to apply one of these sets of segmentations tothe grammar, and commit to the best set. That is:we do a greedy search over the power set of possi-ble segmentations of the rule set. As we will see, thisintractable problem can be reasonable efficiently ap-proximated, which is what we have implemented andtested.The key component in the approach is the ability

to evaluate how the description length would changeif a specific segmentation was made in the grammar.

51

This can then be extended to a set of segmentations,which only leaves the problem of generating suitablesets of segmentations.The key to a successful segmentation is to maxi-

mize the potential for reuse. Any segment that canbe reused saves model size. Consider the terminalrule:

A → five thousand yen is my limit/我最多出五千日元

(Chinese gloss: ’wŏ zùi dūo chū wŭ qīan rì yúan’).This rule can be split into three rules:

A → ⟨AA⟩,A → five thousand yen/五千日元,

A → is my limit/我最多出

Note that the original rule consists of 16 symbols (inour encoding scheme), whereas the new three rulesconsists of 4 + 9 + 9 = 22 symbols. It is reason-able to believe that the bracketing inverted rule is inthe grammar already, but this still leaves 18 symbols,which is decidedly longer than 16 symbols—and weneed to get the length to be shorter if we want to seea net gain, since the length of the corpus given thegrammar is likely to be longer with the segmentedrules. What we really need to do is find a way toreuse the lexical rules that came out of the segmen-tation. Now suppose the grammar also contained thisterminal rule:

A → the total fare is five thousand yen/总共的费用是五千日元

(Chinese gloss: ’zŏng gòng de fèi yòng shì wŭ qīanrì yúan’). This rule can also be split into three rules:

A → [AA] ,

A → the total fare is/总共的费用是,

A → five thousand yen/五千日元

Again, we will assume that the structural rule is al-ready present in the grammar, the old rule was 19symbols long, and the two new terminal rules are12+9 = 21 symbols long. Again we are out of luck,as the new rules are longer than the old one, and threerules are likely to be less probable than one rule dur-ing parsing. The way to make this work is to realize

that the two existing rules share a bilingual affix—abiaffix: “five thousand dollars” translating into “五千日元”. If we make the two changes at the sametime, we get rid of 16 + 19 = 35 symbols worth ofrules, and introduce a mere 9 + 9 + 12 = 30 sym-bols worth of rules (assuming the structural rules arealready in the grammar). Making these two changesat the same time is essential, as the length of the fivesaved symbols can be used to offset the likely in-crease in the length of the corpus given the data. Andof course: the more rules we can find with shared bi-affixes, the more likely we are to find a good set ofsegmentations.Our algorithm takes advantage of the above obser-

vation by focusing on the biaffixes found in the train-ing data. Each biaffix defines a set of lexical rulespaired up with a possible segmentation. We evaluatethe biaffixes by estimating the change in descriptionlength associated with committing to all the segmen-tations defined by a biaffix. This allows us to findthe best set of segmentations, but rather than com-mitting only to the one best set of segmentations, wewill collect all sets which would improve descrip-tion length, and try to commit to as many of themas possible. The pseudocode for our algorithm is asfollows:G // The grammar

biaffixes_to_rules // Maps biaffixes to the

// rules they occur in

biaffixes_delta = [] // A list of biaffixes and

// their DL impact on G

for each biaffix b :

delta = eval_dl(b, biaffixes_to_rules[b], G)

if (delta < 0)

biaffixes_delta.push(b, delta)

sort_by_delta(biaffixes_delta)

for each b:delta pair in biaffixes_delta :

real_delta = eval_dl(b, biaffixes_to_rules[b], G)

if (real_delta < 0)

G = make_segmentations(b, biaffixes_to_rules[b], G)

The methods eval_dl, sort_by_delta andmake_segmentations evaluates the impact on de-scription length that committing to a biaffix wouldcause, sorts a list of biaffixes according to this delta,and applies all the changes associated with a biaffixto the grammar, respectively.Evaluating the impact on description length

breaks down into two parts: the difference in de-scription length of the grammar DL (G′) − DL (G)(where G′ is the grammar that results from applyingall the changes that committing to a biaffix dictates),

52

and the difference in description length of the corpusgiven the grammar DL (C|G′) − DL (C|G). Thesetwo quantities are simply added up to get the totalchange in description length.The difference in grammar length is calculated

as described in Section 2.2. The difference in de-scription length of the corpus given the grammarcan be calculated by biparsing the corpus, sinceDL (C|G′) = −log2 (P (C|p′)) and DL (C|G) =−log2 (P (C|p)) where p′ and p are the rule prob-ability functions of G′ and G respectively. Bipars-ing is, however, a very costly process that we do notwant to have inside a loop. Instead, we assume thatwe have the original corpus probability (through bi-parsing outside the loop), and estimate the new cor-pus probability from it (in closed form). Given thatwe are splitting the rule r0 into the three rules r1,r2 and r3, and that the probability mass of r0 is dis-tributed uniformly over the new rules, the new ruleprobability function p′ will be identical to p, exceptthat:

p′ (r0) = 0,

p′ (r1) = p (r1) +1

3p (r0) ,

p′ (r2) = p (r2) +1

3p (r0) ,

p′ (r3) = p (r3) +1

3p (r0)

Since we have eliminated all the occurrences of r0

and replaced them with combinations of r1, r2 andr3, the probability of the corpus given this new ruleprobability function will be:

P(C|p′) = P (C|p)

p′ (r1) p′ (r2) p′ (r3)

p (r0)

To make this into a description length, we need totake the negative logarithm of the above, which re-sults in:

DL(C|G′) =

DL (C|G) − log2

(p′ (r1) p′ (r2) p′ (r3)

p (r0)

)The difference in description length of the corpusgiven the grammar can now be expressed as:

DL (C|G′) − DL (C|G) =

−log2

(p′(r1)p′(r2)p′(r3)

p(r0)

)

To calculate the impact of a set of segmentations, weneed to take all the changes into account in one go.We do this in a two-pass fashion, first calculatingthe new probability function (p′) and the change ingrammar description length (taking care not to countthe same rule twice), and then, in the second pass,calculating the change in corpus description length.

5 Model combination

Themodel we learn by iteratively subsegmenting thetraining data is guaranteed to be parsimonious whileretaining a decent fit to the training data; these aredesirable qualities, but there is a real risk that wefailed to make some generalization that we shouldhave made; to counter this risk, we can use a modeltrained under more liberal conditions. We chose theapproach taken by Saers et al. (2012) for two rea-sons: (a) the model has the same form as our model,which means that we can integrate it seamlessly, and(b) their aims are similar to ours but their methoddiffers significantly; specifically, they let the modelgrow in size as long as the data reduces in size. Boththese qualities make it a suitable complement for ourmodel.Assuming we have two grammars (Ga and Gb)

that we want to combine, the interpolation param-eter α will determine the probability function of thecombined grammar such that:

pa+b (r) = αpa (r) + (1 − α)pb (r)

for all rules r in the union of the two rule sets, andwhere pa+b is the rule probability function of thecombined grammar and pa and pb are the rule prob-ability functions of Ga and Gb respectively. Someinitial experiments indicated that an α value of about0.4 was reasonable (when Ga was the grammar ob-tained through the training scheme outlined above,and Gb was the grammar obtained through the train-ing scheme outlined in Saers et al. (2012)), so weused 0.4 in this paper.

6 Experimental setup

We have made the claim that iterative top-down seg-mentation guided by the objective of minimizing thedescription length gives a better precision grammarthan iterative bottom-up chunking, and that the com-bination of the two gives superior results to either

53

0

2

4

6

8

10

12

14

0 1 2 3 4 5 6 7Pro

babi

lity

in lo

g do

mai

n (M

bit)

Iterations

Figure 1: Description length in bits over the different it-erations of top-down search. The lower portion representsDL (G) and the upper portion represents DL (C|G).

approach in isolation. We have outlined how thiscan be done in practice, and we now substantiate thatclaim empirically.We will initialize a stochastic bracketing inver-

sion transduction grammar (BITG) to rewrite it’sone nonterminal symbol directly into all the sen-tence pairs of the training data (iteration 0). We willthen segment the grammar iteratively a total of seventimes (iterations 1–7). For each iteration we willrecord the change in description length and test thegrammar. Each iteration requires us to biparse thetraining data, which we do with the cubic time algo-rithm described in Saers et al. (2009), with a beamwidth of 100.As training data, we use the IWSLT07 Chinese–

English data set (Fordyce, 2007), which contains46,867 sentence pairs of training data, 506 Chinesesentences of development data with 16 English ref-erence translations, and 489 Chinese sentences with6 English reference translations each as test data; allthe sentences are taken from the traveling domain.Since the Chinese is written without whitespace, weuse a tool that tries to clump characters together intomore “word like” sequences (Wu, 1999).As the bottom-up grammar, we will reuse the

grammar learned in Saers et al. (2012), specifically,we will use the BITG that was bootstrapped froma bracketing finite-state transduction grammar (BF-STG) that has been chunked twice, giving bitermi-nals where the monolingual segments are 0–4 tokenslong. The bottom-up grammar is trained on the same

0

10

20

30

40

50

60

0 1 2 3 4 5 6 7Num

ber

of r

ules

(th

ousa

nds)

Iterations

Figure 2: Number of rules learned during top-downsearch over the different iterations.

data as our model.To test the learned grammars as translation mod-

els, we first tune the grammar parameters to the train-ing data using expectation maximization (Dempsteret al., 1977) and parse forests acquired with theabove mentioned biparser, again with a beam widthof 100. To do the actual decoding, we use ourin-house ITG decoder. The decoder uses a CKY-style parsing algorithm (Cocke, 1969; Kasami, 1965;Younger, 1967) and cube pruning (Chiang, 2007) tointegrate the language model scores. The decoderbuilds an efficient hypergraph structure which is thenscored using both the induced grammar and the lan-guage model. The weights for the language modeland the grammar, are tuned towards BLEU (Papineniet al., 2002) using MERT (Och, 2003). We use theZMERT (Zaidan, 2009) implementation ofMERT asit is a robust and flexible implementation of MERT,while being loosely coupled with the decoder. Weuse SRILM (Stolcke, 2002) for training a trigramlanguage model on the English side of the trainingdata. To evaluate the quality of the resulting transla-tions, we use BLEU, and NIST (Doddington, 2002).

7 Experimental results

The results from running the experiments detailedin the previous section can be summarized in fourgraphs. Figures 1 and 2 show the size of our new,segmenting model during induction, in terms of de-scription length and in terms of rule count. The ini-tial ITG is at iteration 0, where the vast majority

54

0.00

0.05

0.10

0.15

0.20

0 1 2 3 4 5 6 7

BLE

U

Iterations

Figure 3: Variations in BLEU score over different iter-ations. The thin line represents the baseline bottom-upsearch (Saers et al., 2012), the dotted line represents thetop-down search, and the thick line represents the com-bined results.

of the size is taken up by the model (DL (G)), andvery little by the data (DL (C|G))—just as we pre-dicted. The trend over the induction phase is a sharpdecrease in model size, and a moderate increase indata size, with the overall size constantly decreas-ing. Note that, although the number of rules rises,the total description length decreases. Again, this isprecisely what we expected. The size of the modellearned according to Saers et al. (2012) is close to 30Mbits—far off the chart. This shows that our newtop-down approach is indeed learning a more parsi-monious grammar than the bottom-up approach.Figures 3 and 4 shows the translation quality of

the learned model. The thin flat lines show the qual-ity of the bottom-up approach (Saers et al., 2012),whereas the thick curves shows the quality of thenew, top-down model presented in this paper with-out (dotted line), and without the bottom-up model(solid line). Although the MDL-based model is bet-ter than the old model, the combination of the twois still superior. It is particularly encouraging to seethat the over-fitting that seems to take place after iter-ation 3 with the MDL-based approach is amelioratedwith the bottom-up model.

8 Conclusions

We have introduced a purely unsupervised learningscheme for phrasal stochastic inversion transductiongrammars that is the first to combine two oppos-

0.0

1.0

2.0

3.0

4.0

5.0

0 1 2 3 4 5 6 7

NIS

T

Iterations

Figure 4: Variations in NIST score over different iter-ations. The thin line represents the baseline bottom-upsearch (Saers et al., 2012), the dotted line represents thetop-down search, and the thick line represents the com-bined results.

ing ways of searching for the phrasal translations: abottom-up rule chunking approach driven by a maxi-mum likelihood (ML) objective and a top-down rulesegmenting approach driven by a minimum descrip-tion length (MDL) objective. The combination ap-proach takes advantage of the fact that the conser-vative top-down MDL-driven rule segmenting ap-proach learns a very parsimonious, yet competitive,model when compared to a liberal bottom-up ML-driven approach. Results show that the combinationof the two opposing approaches is significantly su-perior to either of them in isolation.

9 Acknowledgements

This material is based upon work supported in partby the Defense Advanced Research Projects Agency(DARPA) under BOLT contract no. HR0011-12-C-0016, and GALE contract nos. HR0011-06-C-0022 and HR0011-06-C-0023; by the EuropeanUnion under the FP7 grant agreement no. 287658;and by the Hong Kong Research Grants Council(RGC) research grants GRF620811, GRF621008,and GRF612806. Any opinions, findings and con-clusions or recommendations expressed in this ma-terial are those of the authors and do not necessarilyreflect the views of DARPA, the EU, or RGC.

55

References

P. Blunsom and T. Cohn. Inducing syn-chronous grammars with slice sampling. InHLT/NAACL2010, pages 238–241, Los Angeles,California, June 2010.

P. Blunsom, T. Cohn, and M. Osborne. Bayesiansynchronous grammar induction. In Proceedingsof NIPS 21, Vancouver, Canada, December 2008.

P. Blunsom, T. Cohn, C. Dyer, and M. Osborne. Agibbs sampler for phrasal synchronous grammarinduction. In Proceedings of ACL/IJCNLP, pages782–790, Suntec, Singapore, August 2009.

P. F. Brown, S. A. Della Pietra, V. J. Della Pietra, andR. L.Mercer. TheMathematics ofMachine Trans-lation: Parameter estimation. Computational Lin-guistics, 19(2):263–311, 1993.

C. Cherry and D. Lin. Inversion transduction gram-mar for joint phrasal translation modeling. In Pro-ceedings of SSST, pages 17–24, Rochester, NewYork, April 2007.

D. Chiang. Hierarchical phrase-based translation.Computational Linguistics, 33(2):201–228, 2007.

J. Cocke. Programming languages and their compil-ers: Preliminary notes. Courant Institute ofMath-ematical Sciences, New York University, 1969.

A. P. Dempster, N. M. Laird, and D. B. Rubin. Max-imum likelihood from incomplete data via the emalgorithm. Journal of the Royal Statistical Soci-ety. Series B (Methodological), 39(1):1–38, 1977.

G. Doddington. Automatic evaluation of machinetranslation quality using n-gram co-occurrencestatistics. In Proceedings of the 2nd InternationalConference on Human Language Technology Re-search, pages 138–145, San Diego, California,2002.

C. S. Fordyce. Overview of the IWSLT 2007 evalu-ation campaign. In Proceedings of IWSLT, pages1–12, 2007.

M. Galley, J. Graehl, K. Knight, D. Marcu, S. De-Neefe, W. Wang, and I. Thayer. Scalable infer-ence and training of context-rich syntactic trans-lation models. In Proceedings of COLING/ACL-2006, pages 961–968, Sydney, Australia, July2006.

Peter Grünwald. A minimum description length ap-proach to grammar inference in symbolic. LectureNotes in Artificial Intelligence, (1040):203–216,1996.

A. Haghighi, J. Blitzer, J. DeNero, and D. Klein.Better word alignments with supervised itg mod-els. In Proceedings of ACL/IJCNLP-2009, pages923–931, Suntec, Singapore, August 2009.

H. Johnson, J. Martin, G. Foster, and R. Kuhn.Improving translation quality by discarding mostof the phrasetable. In Proceedings of EMNLP-CoNLL-2007, pages 967–975, Prague, Czech Re-public, June 2007.

T. Kasami. An efficient recognition and syntax anal-ysis algorithm for context-free languages. Tech-nical Report AFCRL-65-00143, Air Force Cam-bridge Research Laboratory, 1965.

P. Koehn, F. J. Och, and D. Marcu. StatisticalPhrase-Based Translation. In Proceedings ofHLT/NAACL-2003, volume 1, pages 48–54, Ed-monton, Canada, May/June 2003.

G. Neubig, T. Watanabe, E. Sumita, S. Mori, andT. Kawahara. An unsupervised model for jointphrase alignment and extraction. In Proceedingsof ACL/HLT-2011, pages 632–641, Portland, Ore-gon, June 2011.

G. Neubig, T. Watanabe, S. Mori, and T. Kawahara.Machine translation without words through sub-string alignment. In Proceedings of ACL-2012,pages 165–174, Jeju Island, Korea, July 2012.

F. J. Och and H. Ney. A Systematic Comparison ofVarious Statistical Alignment Models. Computa-tional Linguistics, 29(1):19–51, 2003.

F. J. Och. Minimum error rate training in statisticalmachine translation. InProceedings of ACL-2003,pages 160–167, Sapporo, Japan, July 2003.

K. Papineni, S. Roukos, T. Ward, and W. Zhu.BLEU: a method for automatic evaluation of ma-chine translation. In Proceedings of ACL-2002,pages 311–318, Philadelphia, Pennsylvania, July2002.

J. Rissanen. A universal prior for integers and esti-mation by minimum description length. The An-nals of Statistics, 11(2):416–431, June 1983.

56

M. Saers and D. Wu. Improving phrase-basedtranslation via word alignments from StochasticInversion Transduction Grammars. In Proceed-ings of SSST-3, pages 28–36, Boulder, Colorado,June 2009.

M. Saers and D. Wu. Principled induction of phrasalbilexica. In Proceedings of EAMT-2011, pages313–320, Leuven, Belgium, May 2011.

M. Saers, J. Nivre, and D. Wu. Learning stochasticbracketing inversion transduction grammars witha cubic time biparsing algorithm. In Proceedingsof IWPT’09, pages 29–32, Paris, France, October2009.

M. Saers, J. Nivre, and D. Wu. Word alignment withstochastic bracketing linear inversion transduc-tion grammar. In Proceedings of HLT/NAACL-2010, pages 341–344, Los Angeles, California,June 2010.

M. Saers, K. Addanki, and D. Wu. From finite-stateto inversion transductions: Toward unsupervisedbilingual grammar induction. In Proceedings ofCOLING 2012: Technical Papers, pages 2325–2340, Mumbai, India, December 2012.

C. E. Shannon. A mathematical theory of com-munication. The Bell System Technical Journal,27:379–423, 623–, July, October 1948.

Z. Si, M. Pei, B. Yao, and S. Zhu. Unsuper-vised learning of event and-or grammar and se-mantics from video. In Proceedings of the 2011IEEE International Conference on Computer Vi-sion (ICCV), pages 41–48, November 2011.

R. J. Solomonoff. A new method for discovering thegrammars of phrase structure languages. In IFIPCongress, pages 285–289, 1959.

A. Stolcke and S. Omohundro. Inducing proba-bilistic grammars by bayesian model merging. InR. C. Carrasco and J. Oncina, editors, Grammat-ical Inference and Applications, pages 106–118.Springer, 1994.

A. Stolcke. SRILM – an extensible language model-ing toolkit. In Proceedings of ICSLP-2002, pages901–904, Denver, Colorado, September 2002.

J. M. Vilar and E. Vidal. A recursive statistical trans-lation model. In ACL-2005 Workshop on Building

andUsing Parallel Texts, pages 199–207, AnnAr-bor, Jun 2005.

S. Vogel, H. Ney, and C. Tillmann. HMM-basedWord Alignment in Statistical Translation. In Pro-ceedings of COLING-96, volume 2, pages 836–841, 1996.

D. Wu. Stochastic Inversion Transduction Gram-mars and Bilingual Parsing of Parallel Corpora.Computational Linguistics, 23(3):377–403, 1997.

Z. Wu. LDC Chinese segmenter, 1999.D. H. Younger. Recognition and parsing of context-free languages in time n3. Information and Con-trol, 10(2):189–208, 1967.

O. F. Zaidan. Z-MERT: A Fully Configurable OpenSource Tool for Minimum Error Rate Trainingof Machine Translation Systems. The PragueBulletin of Mathematical Linguistics, 91:79–88,2009.

H. Zhang, C. Quirk, R. C. Moore, and D. Gildea.Bayesian learning of non-compositional phraseswith synchronous parsing. In Proceedings ofACL-08: HLT, pages 97–105, Columbus, Ohio,June 2008.

57

Proceedings of the 7th Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 58–67,Atlanta, Georgia, 13 June 2013. c©2013 Association for Computational Linguistics

A Formal Characterization of Parsing Word Alignments by SynchronousGrammars with Empirical Evidence to the ITG Hypothesis

Gideon Maillette de Buy Wenniger∗University of [email protected]

Khalil Sima’an∗University of [email protected]

Abstract

Deciding whether a synchronous grammarformalism generates a given word alignment(the alignment coverage problem) depends onfinding an adequate instance grammar andthen using it to parse the word alignment. Butwhat does it mean to parse a word align-ment by a synchronous grammar? This is for-mally undefined until we define an unambigu-ous mapping between grammatical deriva-tions and word-level alignments. This pa-per proposes an initial, formal characteriza-tion of alignment coverage as intersecting twopartially ordered sets (graphs) of translationequivalence units, one derived by a gram-mar instance and another defined by the wordalignment. As a first sanity check, we reportextensive coverage results for ITG on auto-matic and manual alignments. Even for theITG formalism, our formal characterizationmakes explicit many algorithmic choices of-ten left underspecified in earlier work.

1 IntroductionThe training data used by current statistical machinetranslation (SMT) models consists of source andtarget sentence pairs aligned together at the wordlevel (word alignments). For the hierarchical andsyntactically-enriched SMT models, e.g., (Chiang,2007; Zollmann and Venugopal, 2006), this trainingdata is used for extracting statistically weighted Syn-chronous Context-Free Grammars (SCFGs). For-mally speaking, a synchronous grammar defines aset of (source-target) sentence pairs derived syn-chronously by the grammar. Contrary to common

∗ Institute for Logic, Language and Computation.

belief, however, a synchronous grammar (see e.g.,(Chiang, 2005; Satta and Peserico, 2005)) does notaccept (or parse) word alignments. This is becausea synchronous derivation generates a tree pair witha bijective binary relation (links) between their non-terminal nodes. For deciding whether a given wordalignment is generated/accepted by a given syn-chronous grammar, it is necessary to interpret thesynchronous derivations down to the lexical level.However, it is formally defined yet how to unam-biguously interpret the synchronous derivations ofa synchronous grammar as word alignments. Onemajor difficulty is that synchronous productions, intheir most general form, may contain unaligned ter-minal sequences. Consider, for instance, the rela-tively non-complex synchronous production

〈X → α X(1) β X(2) γ X(3), X → σ X(2) τ X(1) µ X(3)〉

where superscript (i) stands for aligned instancesof nonterminal X and all Greek symbols stand forarbitrary non-empty terminals sequences. Given aword aligned sentence pair it is necessary to bindthe terminal sequence by alignments consistent withthe given word alignment, and then parse the wordalignment with the thus enriched grammar rules.This is not complex if we assume that each of thesource terminal sequences is contiguously alignedwith a target contiguous sequence, but difficult if weassume arbitrary alignments, including many-to-oneand non-contiguously aligned chunks.

One important goal of this paper is to proposea formal characterization of what it means to syn-chronously parse a word alignment. Our formalcharacterization is borrowed from the “parsing as in-tersection" paradigm, e.g., (Bar-Hillel et al., 1964;Lang, 1988; van Noord, 1995; Nederhof and Satta,

58

2004). Conceptually, our characterization makes useof three algorithms. Firstly, parse the unaligned sen-tence pair with the synchronous grammar to obtain aset of synchronous derivations, i.e., trees. Secondly,interpret a word alignment as generating a set ofsynchronous trees representing the recursive trans-lation equivalence relations of interest1 perceived inthe word alignment. And finally, intersect the setsof nodes in the two sets of synchronous trees tocheck whether the grammar can generate (parts of)the word alignment. The formal detail of each ofthese three steps is provided in sections 3 to 5.

We think that alignment parsing is relevant forcurrent research because it highlights the differ-ence between alignments in training data and align-ments accepted by a synchronous grammar (learnedfrom data). This is useful for literature on learn-ing from word aligned parallel corpora (e.g., (Zensand Ney, 2003; DeNero et al., 2006; Blunsom et al.,2009; Cohn and Blunsom, 2009; Riesa and Marcu,2010; Mylonakis and Sima’an, 2011; Haghighi etal., 2009; McCarley et al., 2011)). A theoretical,formalized characterization of the alignment pars-ing problem is likely to improve the choices made inempirical work as well. We exemplify our claims byproviding yet another empirical study of the stabilityof the ITG hypothesis. Our study highlights some ofthe technical choices left implicit in preceding workas explained in the next section.

2 First application to the ITG hypothesis

A grammar formalism is a whole set/family of syn-chronous grammars. For example, ITG (Wu, 1997)defines a family of inversion-transduction gram-mars differing among them in the exact set of syn-chronous productions, terminals and non-terminals.Given a synchronous grammar formalism and aninput word alignment, a relevant theoretical ques-tion is whether there exists an instance synchronousgrammar that generates the word alignment exactly.We will refer to this question as the alignment cover-age problem. In this paper we propose an approachto the alignment coverage problem using the three-step solution proposed above for parsing word align-

1The translation equivalence relations of interest may varyin kind as we will exemplify later. The known phrase pairs aremerely one possible kind.

ments by arbitrary synchronous grammars.Most current use of synchronous grammars is

limited to a subclass using a pair of nonterminals,e.g., (Chiang, 2007; Zollmann and Venugopal, 2006;Mylonakis and Sima’an, 2011), thereby remain-ing within the confines of the ITG formalism (Wu,1997). On the one hand, this is because of computa-tional complexity reasons. On the other, this choicerelies on existing empirical evidence of what we willcall the “ITG hypothesis", freely rephrased as fol-lows: the ITG formalism is sufficient for represent-ing a major percentage of reorderings in translationdata in general.

Although checking whether a word alignment canbe generated by ITG is far simpler than for arbi-trary synchronous grammars, there is a striking vari-ation in the approaches taken in the existing litera-ture, e.g., (Zens and Ney, 2003; Wellington et al.,2006; Søgaard and Wu, 2009; Carpuat and Wu,2007; Søgaard and Kuhn, 2009; Søgaard, 2010).Søgaard and Wu (Søgaard and Wu, 2009) observejustifiably that the literature studying the ITG align-ment coverage makes conflicting choices in methodand data, and reports significantly diverging align-ment coverage scores. We hypothesize here thatthe major conflicting choices in method (what tocount and how to parse) are likely due to the ab-sence of a well-understood, formalized method forparsing word alignments even under ITG. In this pa-per we apply our formal approach to the ITG case,contributing new empirical evidence concerning theITG hypothesis.

For our empirical study we exemplify our ap-proach by detailing an algorithm dedicated to ITG inNormal-Form (NF-ITG). While our algorithm is inessence equivalent to existing algorithms for check-ing binarizability of permutations, e.g.,(Wu, 1997;Huang et al., 2009), the formal foundations pre-ceding it concern nailing down the choices madein parsing arbitrary word alignments, as opposed to(bijective) permutations. The formalization is ourway to resolve some of the major points of differ-ences in existing literature.

We report new coverage results for ITG parsingof manual as well as automatic alignments, showingthe contrast between the two kinds. While the latterseems built for phrase extraction, trading-off preci-sion for recall, the former is heavily marked with id-

59

iomatic expressions. Our coverage results make ex-plicit a relevant dilemma. To hierarchically parse thecurrent automatic word alignments exactly, we willneed more general synchronous reordering mecha-nisms than ITG, with increased risk of exponentialparsing algorithms (Wu, 1997; Satta and Peserico,2005). But if we abandon these word alignments,we will face the exponential problem of learning re-ordering arbitrary permutations, cf. (Tromble andEisner, 2009). Our results also exhibit the impor-tance of explicitly defining the units of translationequivalence when studying (ITG) coverage of wordalignments. The more complex the choice of trans-lation equivalence relations, the more difficult it is toparse the word alignments.

3 Translation equivalence in MT

In (Koehn et al., 2003), a translation equivalenceunit (TEU) is a phrase pair: a pair of contiguoussubstrings of the source and target sentences suchthat the words on the one side align only with wordson the other side (formal definitions next). The hier-archical phrase pairs (Chiang, 2005; Chiang, 2007)are extracted by replacing one or more sub-phrasepairs, that are contained within a phrase pair, bypairs of linked variables. This defines a subsumptionrelation between hierarchical phrase pairs (Zhang etal., 2008). Actual systems, e.g., (Koehn et al., 2003;Chiang, 2007) set an upperbound on length or thenumber of variables in the synchronous productions.For the purposes of our theoretical study, these prac-tical limitations are irrelevant.

We give two definitions of translation equivalencefor word alignments.2 The first one makes no as-sumptions about the contiguity of TEUs, while thesecond does require them to be contiguous sub-strings on both sides (i.e., phrase pairs).

As usual, s = s1...sm and t = t1...tn are source andtarget sentences respectively. Let sσ be the sourceword at position σ in s and tτ be the target word atposition τ in t. An alignment link a ∈ a in a wordalignment a is a pair of positions 〈σ, τ〉 such that 1 ≤

2Unaligned words tend to complicate the formalization un-necessarily. As usual we also require that unaligned words mustfirst be grouped with aligned words adjacent to them beforetranslation equivalence is defined for an alignment. This stan-dard strategy allows us to informally discuss unaligned wordsin the following without loss of generality.

σ ≤ m and 1 ≤ τ ≤ n. For the sake of brevity, wewill often talk about alignments without explicitlymentioning the associated source and target words,knowing that these can be readily obtained from thepair of positions and the sentence pair 〈s, t〉. Givena subset a′ ⊆ a we define wordss(a′) = {sσ | ∃X :〈σ, X〉 ∈ a′} and wordst(a′) = {tτ | ∃X : 〈X, τ〉 ∈ a′}.

Now we consider triples (s′, t′, a′) such thata′ ⊆ a, s′ = wordss(a′) and t′ = wordst(a′). Wedefine the translation equivalence units (TEUs) inthe set TE(s, t, a) as follows:

Definition 3.1 (s′, t′, a′) ∈ TE(s, t, a) iff 〈σ, τ〉 ∈ a′⇒ (for all X, if 〈σ, X〉 ∈ a then 〈σ, X〉 ∈ a′) ∧ (forall X, if 〈X, τ〉 ∈ a then 〈X, τ〉 ∈ a′)In other words, if some alignment involving sourceposition σ or τ is included in a′, then all alignmentsin a containing that position are in a′ as well. Thisdefinition allows a variety of complex word align-ments such as the so-called Cross-serial Discontigu-ous Translation Units and Bonbons (Søgaard andWu, 2009).

We also define the subsumption relation (partialorder) <a as follows:

Definition 3.2 A TEU u2 = (s2, t2, a2) subsumes(<a) a TEU u1 = (s1, t1, a1) iff a1 ⊂ a2. The sub-sumption order will be represented by u1 <a u2.

Based on the subsumption relation we can par-tition TE(s, t, a) into two disjoint sets : atomicTEAtom(s, t, a) and composed TEComp(s, t, a).

Definition 3.3 u1 ∈ TE(s, t, a) is atomic iff @ u2 ∈

TE(s, t, a) such that (u2 <a u1).

Now the set TEAtom(s, t, a) is simply the setof all atomic translation equivalents, andthe set of composed translation equivalentsTEComp(s, t, a) = (TE(s, t, a) \ TEAtom(s, t, a)).

Based on the general definition of translationequivalence, we can now give a more restricteddefinition that allows only contiguous translationequivalents (phrase pairs):

Definition 3.4 (s′, t′, a′) constitutes a contiguoustranslation equivalent iff:

1. (s′, t′, a′) ∈ TE(s, t, a) and

60

2. Both s′ and t′ are contiguous substrings of sand t′ respectively.

This set of translation equivalents is the unlimitedset of phrase pairs known from phrase-based ma-chine translation (Koehn et al., 2003). The relation<a as well as the division into atomic and composedTEUs can straightforwardly be adapted to contigu-ous translation equivalents.

4 Grammatical translation equivalence

The derivations of a synchronous grammar can beinterpreted as deriving a partially ordered set ofTEUs as well. A finite derivation S →+ 〈s, t, aG〉

of an instance grammar G is a finite sequence ofterm-rewritings, where at each step of the sequence asingle nonterminal is rewritten using a synchronousproduction of G. The set of the finite derivationsof G defines a language, a set of triples 〈s, t, aG〉

consisting of a source string of terminals s, a targetstring of terminals t and an alignment between theirgrammatical constituents. Crucially, the alignmentaG is obtained by recursively interpreting the align-ment relations embedded in the synchronous gram-mar productions in the derivation for all constituentsand concerns constituent alignments (as opposed toword alignments).

Grammatical translation equivalents TEG(s, t)A synchronous derivation S →+ 〈s, t, aG〉 can beviewed as a deductive proof that 〈s, t, aG〉 is a gram-matical translation equivalence unit (grammaticalTEU). Along the way, a derivation also proves otherconstituent-level (sub-sentential) units as TEUs.

We define a sub-sentential grammatical TEU of〈s, t, aG〉 to consist of a triple 〈sx, tx, ax〉, where sx

and tx are two subsequences3 (of s and t respec-tively), derived synchronously from the same con-

3A subsequence of a string is a subset of the word-positionpairs that preserves the order but do not necessarily constitutecontiguous substrings.

Figure 2: Alignment with both contiguous and dis-contiguous TEUs (example from Europarl En-Ne).

stituent X in some non-empty “tail" of a derivationS →+ 〈s, t, aG〉; importantly, by the workings of G,the alignment ax ⊆ aG fulfills the requirement that aword in sx or in tx is linked to another by aG iff it isalso linked that way by ax (i.e., no alignments startout from terminals in sx or tx and link to terminalsoutside them). We will denote with TEG(s, t) the setof all grammatical TEUs for the sentence pair 〈s, t〉derived by G.

Subsumption relation <G(s,t) Besides derivingTEUs, a derivation also shows how the differentTEUs compose together into larger TEUs accordingto the grammar. We are interested in the subsump-tion relation: one grammatical TEU/constituent (u1)subsumes another (u2) (written u2 <G(s,t) u1) iff thelatter (u2) is derived within a finite derivation of theformer (u1).4

The set of grammatical TEUs for a finite set ofderivations for a given sentence pair is the union ofthe sets defined for the individual derivations. Simi-larly, the relation between TEU’s for a set of deriva-tions is defined as the union of the individual rela-tions.

5 Alignment coverage by intersection

Let a word aligned sentence pair 〈s, t, a〉 be given,and let us assume that we have a definition of an or-dered set TE(s, t, a) with partial order <a. We willsay that a grammar formalism covers a iff there ex-ists an instance grammar G that fulfills two intersec-tion equations simultaneously:5

(1) TE(s, t, a) ∩ TEG(s, t) = TE(s, t, a)

(2) <a ∩ <G(s,t)=<a

In the second equation, the intersection of partial or-ders is based on the standard view that these are inessence also sets of ordered pairs. In practice, itis sufficient to implement an algorithm that shows

4Note that we define this relation exhaustively thereby defin-ing the set of paths in synchronous trees derived by the grammarfor 〈s, t〉. Hence, the subsumption relation can be seen to definea forest of synchronous trees.

5In this work we have restricted this definition to full cover-age (i.e., subset) version but it is imaginable that other measurescan be based on the cardinality (size) of the intersection in termsof covered TEUs, in following of measures found in (Søgaardand Kuhn, 2009; Søgaard and Wu, 2009). We leave this to fu-ture work.

61

Figure 1: Alignment with only contiguous TEUs (example from LREC En-Fr).

that G derives every TEU in TE(s, t, a), and thatthe subsumption relation <a between TEUs in amust be realized by the derivations of G that de-rive TE(s, t, a). In effect, this way every TEU thatsubsumes other TEUs must be derived recursively,while the minimal, atomic units (not subsuming anyothers) must be derived using the lexical produc-tions (endowed with internal word alignments) ofNF-ITG. Again, the rationale behind this choice isthat the atomic units constitute fixed translation ex-pressions (idiomatic TEUs) which cannot be com-posed from other TEUs, and hence belong in the lex-icon. We will exhibit coverage algorithms for doingso for NF-ITG for the two kinds of semantic inter-pretations of word alignments.

A note on dedicated instances of NF-ITG Givena translation equivalence definition over word align-ments TE(s, t, a), the lexical productions for a ded-icated instance of NF-ITG are defined6 by the set{X → u | u ∈ TEAtom(s, t, a)}. This means that thelexical productions have atomic TEUs at the right-hand side including alignments between the wordsof the source and target terminals. In the sequel, wewill only talk about dedicated instances of NF-ITGand hence we will not explicitly repeat this everytime.

Given two grammatical TEUs u1 and u2, an NF-ITG instance allows their concatenation either inmonotone [] or inverted <> order iff they are ad-jacent on the source and target sides. This factimplies that for every composed translation equiv-alent u ∈ TE(s, t, a) we can check whether it isderivable by a dedicated NF-ITG instance by check-ing whether it recursively decomposes into adjacentpairs of TEUs down to the atomic TEUs level. Notethat by doing so, we are also implicitly checking

6Unaligned words add one wrinkle in this scheme: infor-mally, we consider a TEU u formed by attaching unalignedwords to an atomic TEU also as atomic iff u is absolutely neededto cover the aligned sentence pair.

whether the subsumption order between the TEUsin TE(s, t, a) is realized by the grammatical deriva-tion (i.e, <G(s,t)⊆<a). Formally, an aligned sentencepair 〈s, t, a〉 is split into a pair of TEUs 〈s1, t1, a1〉

and 〈s2, t2, a2〉 that can be composed back usingthe [] and <> productions. If such a split exists,the splitting is conducted recursively for each of〈s1, t1, a1〉 and 〈s2, t2, a2〉 until both are atomic TEUsin TE(s, t, a). This recursive splitting is the checkof binarizability and an algorithm is described in(Huang et al., 2009).

6 A simple algorithm for ITG

We exemplify the grammatical coverage for (nor-mal form) ITG by employing a standard tabular al-gorithm based on CYK (Younger, 1967). The al-gorithm works in two phases creating a chart con-taining TEUs with associated inferences. In the ini-tialization phase (Algorithm 1), for all source spansthat correspond to translation equivalents and whichhave no smaller translation equivalents they contain,atomic translation equivalents are added as atomicinferences to the chart. In the second phase, basedon the atomic inferences, the simple rules of NF-ITG are applied to add inferences for increasinglylarger chart entries. An inference is added (Algo-rithms 2 and 3) iff a chart entry can be split into twosub-entries for which inferences already exist, andfurthermore the union of the sets of target positionsfor those two entries form a consecutive range.7 TheaddMonotoneInference and addInvertedInference inAlgorithm 3 mark the composit inferences by mono-tone and inverted productions respectively.

7We are not treating unaligned words formally here. For un-aligned source and target words, we have to generate the differ-ent inferences corresponding to different groupings with theirneighboring aligned words. Using pre-processing we set asidethe unaligned words, then parse the remaining word alignmentfully. After parsing, by post-processing, we introduce in theparse table atomic TEUs that include the unaligned words.

62

InitializeChartInput : 〈s, t, a〉Output: Initialized chart for atomic units

for spanLength← 2 to n dofor i← 0 to n − spanLength + 1 do

j← i + spanLength − 1u← {〈X,Y〉 : X ∈ {i... j}}if (u ∈ TEAtom(s, t, a)) then

addAtomicIn f erence(chart[i][ j],u)end

endend

Algorithm 1: Algorithm that initializes the Chartwith atomic sub-sentential TEUs. In order to beatomic, a TEU may not contain smaller TEUs thatconsist of a proper subset of the alignments (andassociated words) of the TEU.

ComputeTEUsNFITGInput : 〈s, t, a〉Output: TRUE/FALSE for coverage

InitializeChart(chart)for spanLength← 2 to n do

for i← 0 to n − spanLength + 1 doj← i + spanLength − 1if chart[i][ j] ∈ TE(s, t, a) then

continueendfor splitPoint ← i + 1 to j do

a′ ← (chart[i][k − 1] ∪ chart[k][ j])if (chart[i][k − 1] ∈ TE(s, t, a)) ∧(chart[k][ j] ∈ TE(s, t, a)) ∧(a′ ∈ TE(s, t, a)) then

addT EU(chart, i, j, k, a′)end

endendif (chart[0][n − 1] , ∅) then

return TRUEelse

return FALSEend

endAlgorithm 2: Algorithm that incrementally buildscomposite TEUs using only the rules allowed byNF-ITG

addTEUInput :chart - the charti,j,k - the lower, upper and split point indicesa′ - the TEU to be added

Output: chart with TEU a′ added in theintended entry

if MaxYt ({Yt : 〈Xs,Yt〉 ∈ chart[i][k − 1]})< MaxYt ({Yt : 〈Xs,Yt〉 ∈ chart[k][ j]}) then

addMonotoneIn f erence(chart[i][ j], a′)else

addInvertedIn f erence(chart[i][ j], a′)end

Algorithm 3: Algorithm that adds a TEU and as-sociated Inference to the chart

7 Experiments

Data Sets We use manually and automaticallyaligned corpora. Manually aligned corpora comefrom two datasets. The first (Graca et al.,2008) consists of six language pairs: Portuguese–English, Portuguese–French, Portuguese–Spanish,English–Spanish, English–French and French–Spanish. These datasets contain 100 sentence pairseach and distinguish Sure and Possible alignments.Following (Søgaard and Kuhn, 2009), we treat thesetwo equally. The second manually aligned dataset(Padó and Lapata, 2006) contains 987 sentence pairsfrom the English-German part of Europarl anno-tated using the Blinker guidelines (Melamed, 1998).The automatically aligned data comes from Europarl(Koehn, 2005) in three language pairs (English–Dutch, English–French and English–German). Thecorpora are automatically aligned using GIZA++

(Och and Ney, 2003) in combination with the grow-diag-final-and heuristic. With sentence length cut-off 40 on both sides these contain respectively 945k,949k and 995k sentence pairs.

Grammatical Coverage (GC) is defined as thepercentage word alignments (sentence pairs) in aparallel corpus that can be covered by an instanceof the grammar (NF-ITG) (cf. Section 5). Clearly,GC depends on the chosen semantic interpretationof word alignments: contiguous TE’s (phrase pairs)or discontiguous TE’s.

63

Alignments Set GC contiguous TEs GC discontiguous TEsHand aligned corpora

English–French 76.0 75.0English–Portuguese 78.0 78.0English–Spanish 83.0 83.0Portuguese–French 78.0 74.0Portuguese–Spanish 91.0 91.0Spanish–French 79.0 74.0LREC Corpora Average 80.83±5.49 79.17±6.74English–German 45.427 45.325

Automatically aligned CorporaEnglish–Dutch 45.533 43.57English–French 52.84 49.95English–German 45.59 43.72Automatically aligned corpora average 47.99±4.20 45.75±3.64

Table 1: The grammatical coverage (GC) of NF-ITG for different corpora dependent on the interpretationof word alignments: contiguous Translation Equivalence or discontiguous Translation Equivalence

Results Table 1 shows the Grammatical Coverage(GC) of NF-ITG for the different corpora depen-dent on the two alternative definitions of translationequivalence. The first thing to notice is that thereis just a small difference between the GrammaticalCoverage scores for these two definitions. The dif-ference is in the order of a few percentage points,the largest difference is seen for Portuguese–French(79% v.s 74% Grammatical Coverage), for somelanguage pairs there is no difference. For the au-tomatically aligned corpora the absolute differenceis on average about 2%. We attribute this to the factthat there are only very few discontiguous TEUs thatcan be covered by NF-ITG in this data.

The second thing to notice is that the scores aremuch higher for the corpora from the LREC datasetthan they are for the manually aligned English–German corpus. The approximately double sourceand target length of the manually aligned English–German corpus, in combination with somewhat lessdense alignments makes this corpus much harderthan the LREC corpora. Intuitively, one wouldexpect that more alignment links make alignmentsmore complicated. This turns out to not always bethe case. Further inspection of the LREC alignmentsalso shows that these alignments often consist ofparts that are completely linked. Such completelylinked parts are by definition treated as atomicTEUs, which could make the alignments look sim-

pler. This contrasts with the situation in the man-ually aligned English–German corpus where on av-erage less alignment links exist per word. Exam-ples 1 and 2 show that dense alignments can be sim-pler than less dense ones. This is because sometimesthe density implies idiomatic TEUs which leads torather flat lexical productions. We think that id-iomatic TEUs reasonably belong in the lexicon.

When we look at the results for the automati-cally aligned corpora at the lowest rows in the ta-ble, we see that these are comparable to the resultsfor the manually aligned English–German corpus(and much lower than the results for the LREC cor-pora). This could be explained by the fact that themanually aligned English–German is not only Eu-roparl data, but possibly also because the manualalignments themselves were obtained by initializa-tion with the GIZA++ alignments. In any case, themanually and automatically acquired alignments forthis data are not too different from the perspective ofNF-ITG. Further differences might exist if we wouldemploy another class of grammars, e.g., full SCFGs.

One the one hand, we find that manual align-ments are well but not fully covered by NF-ITG.On the other, the automatic alignments are not cov-ered well but NF-ITG. This suggests that these au-tomatic alignments are difficult to cover by NF-ITG,and the reason could be that these alignments arebuilt heuristically by trading precision for recall cf.

64

(Och and Ney, 2003). Sogaard (Søgaard, 2010) re-ports that full ITG provides a few percentage pointsgains over NF-ITG.

Overall, we find that our results for the LREC dataare far higher Sogaard’s (Søgaard, 2010) results butlower than the upperbounds of (Søgaard and Wu,2009). A similar observation holds for the English–German manually aligned EuroParl data, albeit themaximum length (15) used in (Søgaard and Wu,2009; Søgaard, 2010) is different from ours (40). Weattribute the difference between our results and So-gaard’s approach to our choice to adopt lexical pro-ductions of NF-ITG that contain own internal align-ments (the detailed version) and determined by theatomic TEUs of the word alignment. Our resultsdiffer substantially from (Søgaard and Wu, 2009)who report upperbounds (indeed our results still fallwithin these upperbounds for the LREC data).

8 Related Work

The array of work described in (Zens and Ney,2003; Wellington et al., 2006; Søgaard and Wu,2009; Søgaard and Kuhn, 2009; Søgaard, 2010) con-centrates on methods for calculating upperboundson the alignment coverage for all ITGs, includingNF-ITG. Interestingly, these upperbounds are deter-mined by filtering/excluding complex alignment phe-nomena known formally to be beyond (NF-)ITG.None of these earlier efforts discussed explicitly thedilemmas of instantiating a grammar formalism orhow to formally parse word alignments.

The work in (Zens and Ney, 2003; Søgaard andWu, 2009), defining and counting TEUs, providesa far tighter upperbound than (Wellington et al.,2006), who use the disjunctive interpretation ofword alignments, interpreting multiple alignmentlinks of the same word as alternatives. We adopt theconjunctive interpretation of word alignments like amajority of work in MT, e.g., (Ayan and Dorr, 2006;Fox, 2002; Søgaard and Wu, 2009; Søgaard, 2010).

In deviation from earlier work, the work in (Sø-gaard and Kuhn, 2009; Søgaard and Wu, 2009;Søgaard, 2010) discusses TEUs defined over wordalignments explicitly, and defines evaluation metricsbased on TEUs. In particular, Sogaard (Søgaard,2010) writes that he employs "a more aggressivesearch" for TEUs than earlier work, thereby leading

to far tighter upperbounds on hand aligned data. Ourresults seem to back this claim but, unfortunately, wecould not pin down the formal details of his proce-dure.

More remotely related, the work described in(Huang et al., 2009) presents a binarization algo-rithm for productions of an SCFG instance (as op-posed to formalism). Although somewhat related,this is different from checking whether there existsan NF-ITG instance (which has to be determined)that covers a word alignment.

In contrast with earlier work, we present the align-ment coverage problem as an intersection of two par-tially ordered sets (graphs). The partial order overTEUs as well as the formal definition of parsing asintersection in this work are novel elements, mak-ing explicit the view of word alignments as automatagenerating partially order sets.

9 Conclusions

In this paper we provide a formal characterizationfor the problem of determining the coverage of aword alignment by a given grammar formalism asthe intersection of two partially ordered sets. Thesepartially ordered set of TEUs can be formalized interms of hyper-graphs implementing forests (packedsynchronous trees), and the coverage as the intersec-tion between sets of synchronous trees generalizingthe trees of (Zhang et al., 2008).

Practical explorations of our findings for the bene-fit of models of learning reordering are underway. Infuture work we would like to investigate the exten-sion of this work to other limited subsets of SCFGs.We will also investigate the possibility of devisingITGs with explicit links between terminal symbolsin the productions, exploring different kinds of link-ing.

Acknowledgements We thank reviewers for theirhelpful comments, and thank Mark-Jan Nederhof forilluminating discussions on parsing as intersection.This work is supported by The Netherlands Orga-nization for Scientific Research (NWO) under grantnr. 612.066.929.

65

ReferencesNacip Ayan and Bonnie Dorr. 2006. Going beyond AER:

an extensive analysis of word alignments and their im-pact on MT. In Proc. of the 21st International Confer-ence on Computational Linguistics and the 44th An-nual Meeting of the ACL, pages 9–16, Morristown, NJ,USA.

Yehoshua Bar-Hillel, Micha Perles, and Eli Shamir.1964. On formal properties of simple phrase struc-ture grammars. In Y. Bar-Hillel, editor, Language andInformation: Selected Essays on their Theory and Ap-plication, chapter 9, pages 116–150. Addison-Wesley,Reading, Massachusetts.

Phil Blunsom, Trevor Cohn, Chris Dyer, and Miles Os-borne. 2009. A gibbs sampler for phrasal synchronousgrammar induction. In ACL/AFNLP, pages 782–790.

Marine Carpuat and Dekai Wu. 2007. Improving statisti-cal machine translation using word sense disambigua-tion. In Proc. of the Joint Conference on EmpiricalMethods in Natural Language Processing and Compu-tational Natural Language Learning (EMNLP-CoNLL2007), page 61–72.

David Chiang. 2005. A hierarchical phrase-based modelfor statistical machine translation. In Proceedings ofthe 43rd Annual Meeting of the ACL, pages 263–270,June.

David Chiang. 2007. Hierarchical phrase-based transla-tion. Computational Linguistics, 33(2):201–228.

Trevor Cohn and Phil Blunsom. 2009. A bayesian modelof syntax-directed tree to string grammar induction. InEMNLP, pages 352–361.

John DeNero, Daniel Gillick, James Zhang, and DanKlein. 2006. Why generative phrase models underper-form surface heuristics. In Proceedings of the work-shop on SMT, pages 31–38.

Heidi J. Fox. 2002. Phrasal cohesion and statisticalmachine translation. In Proceedings of the ACL-02conference on Empirical methods in natural languageprocessing - Volume 10, Proceedings of EMNLP,pages 304–311, Stroudsburg, PA, USA. Associationfor Computational Linguistics.

Joao Graca, Joana Pardal, Luísa Coheur, and DiamantinoCaseiro. 2008. Building a golden collection of paral-lel multi-language word alignment. In LREC’08, Mar-rakech, Morocco. European Language Resources As-sociation (ELRA).

Aria Haghighi, John Blitzer, John DeNero, and DanKlein. 2009. Better word alignments with superviseditg models. In Proceedings of the Joint Conference ofthe 47th Annual Meeting of the ACL and the 4th Inter-national Joint Conference on Natural Language Pro-cessing of the AFNLP, pages 923–931, Suntec, Singa-pore, August. Association for Computational Linguis-tics.

Liang Huang, Hao Zhang, Daniel Gildea, and KevinKnight. 2009. Binarization of synchronouscontext-free grammars. Computational Linguistics,35(4):559–595.

Philipp Koehn, Franz Josef Och, and Daniel Marcu.2003. Statistical phrase-based translation. In Proc.of the Human Language Technology Conference, HLT-NAACL, May.

Philipp Koehn. 2005. Europarl: A parallel corpus forstatistical machine translation. In Proc. of MT Summit.

Bernard Lang. 1988. Parsing incomplete sentences. InProceedings of COLING, pages 365–371.

J. Scott McCarley, Abraham Ittycheriah, Salim Roukos,Bing Xiang, and Jian-Ming Xu. 2011. A correc-tion model for word alignments. In Proceedings ofEMNLP, pages 889–898.

Dan Melamed. 1998. Annotation style guide for theblinker project, version 1.0. Technical Report IRCSTR #98-06, University of Pennsylvania.

Markos Mylonakis and Khalil Sima’an. 2011. Learninghierarchical translation structure with linguistic anno-tations. In Proceedings of the HLT/NAACL-2011.

Mark-Jan Nederhof and Giorgio Satta. 2004. The lan-guage intersection problem for non-recursive context-free grammars. Inf. Comput., 192(2):172–184.

Franz Josef Och and Hermann Ney. 2003. A system-atic comparison of various statistical alignment mod-els. Computational Linguistics, 29(1):19–51.

Sebastian Padó and Mirella Lapata. 2006. Optimal con-stituent alignment with edge covers for semantic pro-jection. In ACL-COLING’06, ACL-44, pages 1161–1168, Stroudsburg, PA, USA. Association for Compu-tational Linguistics.

Jason Riesa and Daniel Marcu. 2010. Hierarchicalsearch for word alignment. In Proceedings of ACL,pages 157–166.

Giorgio Satta and Enoch Peserico. 2005. Some com-putational complexity results for synchronous context-free grammars. In Proceedings of Human LanguageTechnology Conference and Conference on EmpiricalMethods i n Natural Language Processing, pages 803–810, Vancouver, British Columbia, Canada, October.Association for Computational Linguistics.

Anders Søgaard and Jonas Kuhn. 2009. Empirical lowerbounds on alignment error rates in syntax-based ma-chine translation. In SSST ’09, pages 19–27, Strouds-burg, PA, USA. Association for Computational Lin-guistics.

Anders Søgaard and Dekai Wu. 2009. Empirical lowerbounds on translation unit error rate for the full classof inversion transduction grammars. In Proceedings ofthe 11th International Workshop on Parsing Technolo-gies (IWPT-2009), 7-9 October 2009, Paris, France,

66

pages 33–36. The Association for Computational Lin-guistics.

Anders Søgaard. 2010. Can inversion transductiongrammars generate hand alignments? In Proccedingsof the 14th Annual Conference of the European Asso-ciation for Machine Translation (EAMT).

Roy Tromble and Jason Eisner. 2009. Learning linear or-dering problems for better translation. In Proceedingsof EMNLP’09, pages 1007–1016, Singapore.

Gertjan van Noord. 1995. The intersection of finite stateautomata and definite clause grammars. In Proceed-ings of ACL, pages 159–165.

Benjamin Wellington, Sonjia Waxmonsky, and I. DanMelamed. 2006. Empirical lower bounds on the com-plexity of translational equivalence. In Proceedings ofthe Annual Meeting of the Association for Computa-tional Linguistics (ACL).

Dekai Wu. 1997. Stochastic inversion transductiongrammars and bilingual parsing of parallel corpora.Computational Linguistics, 3(23):377–403.

D.H. Younger. 1967. Recognition and parsing ofcontext-free languages in time n3. Information andControl, 10(2):189–208.

Richard Zens and Hermann Ney. 2003. A comparativestudy on reordering constraints in statistical machinetranslation. In Proceedings of the Annual Meeting ofthe ACL, pages 144–151.

Hao Zhang, Daniel Gildea, and David Chiang. 2008. Ex-tracting synchronous grammar rules from word-levelalignments in linear time. In Proceedings of COLING,pages 1081–1088.

Andreas Zollmann and Ashish Venugopal. 2006. Syn-tax augmented machine translation via chart parsing.In Proceedings of the North-American Chapter of theACL (NAACL’06), pages 138–141.

67

Proceedings of the 7th Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 68–77,Atlanta, Georgia, 13 June 2013. c©2013 Association for Computational Linguistics

Synchronous Linear Context-Free Rewriting Systemsfor Machine Translation

Miriam KaeshammerUniversity of Dusseldorf

Universitatsstraße 140225 Dusseldorf, Germany

Abstract

We propose synchronous linear context-freerewriting systems as an extension to syn-chronous context-free grammars in which syn-chronized non-terminals span k ≥ 1 continu-ous blocks on each side of the bitext. Suchdiscontinuous constituents are required for in-ducing certain alignment configurations thatoccur relatively frequently in manually an-notated parallel corpora and that cannot begenerated with less expressive grammar for-malisms. As part of our investigations con-cerning the minimal k that is required for in-ducing manual alignments, we present a hier-archical aligner in form of a deduction system.We find that by restricting k to 2 on both sides,100% of the data can be covered.

1 Introduction

The most prominent paradigms in statistical ma-chine translation are phrase-based translation mod-els (Koehn et al., 2003) and tree-based approachesusing some form of a synchronous context-freegrammar (SCFG) (Chiang, 2007; Zollmann andVenugopal, 2006; Hoang and Koehn, 2010), in par-ticular inversion transduction grammar (ITG) (Wu,1997). The rules of the translation models are usu-ally learned from word aligned parallel corpora.Synchronous grammars also induce alignments be-tween words in the bitext when simultaneously rec-ognizing words via the application of a synchronousrule (Wu, 1997). Due to their central role, it is im-portant that a synchronous grammar formalism ispowerful enough to generate all alignment config-urations that occur in hand-aligned parallel corpora

(i)

a b c d

b d a c

(ii)

a b

a1 b1 a2 b2

(iii)

a1 b a2

b1 a b2

Figure 1: (i) inside-out alignment (Wu, 1997); (ii) cross-serial discontinuous translation unit (Søgaard and Kuhn,2009); (iii) bonbon alignment (Simard et al., 2005)

that are taken to be a gold standard of translationalequivalence (Wellington et al., 2006).

The empirical adequacy of phrase-based andSCFG-based translation models has been put intoquestion (Wellington et al., 2006; Søgaard andKuhn, 2009; Søgaard and Wu, 2009; Søgaard, 2010)because they are unable to induce certain alignmentconfigurations. In the alignments in Figure 1, thetranslation units a, b, c, and d cannot be indepen-dently generated by a binary SCFG. Due to a re-ordering component, phrase-based systems can han-dle (i), but neither (ii) nor (iii). Those phenomenahowever occur relatively frequently in hand-alignedparallel corpora. Wellington et al. (2006) found thatcomplex structures such as inside-out alignments oc-cur in 5% of English-Chinese sentence pairs and inthe study of Søgaard and Kuhn (2009) between 1.6%(for Danish-English data) and 12.1% (for Danish-Spanish data) of all translation units are discontinu-ous, i.e. not derivable by ITGs in normal form.

As Wellington et al. (2006) already noted forinside-out alignments, discontinuous constituentsare required for binary synchronous derivations ofthe alignment configurations under consideration.This is illustrated in Figure 2: the yields of A 2

68

(i) (iii)A 4

A 3

A 2

A 1

a b c d

b d a c

A 1

A 2

A 3

A 4

A 3

A 2A 1

a1 b a2

b1 a b2

A 1A 2

A 3

Figure 2: Synchronous derivations: co-indexed non-terminals are generated synchronously. Note that manyother derivations that induce the same alignment struc-tures are possible, but all of them involve at least one dis-continuous constituent.

and A 3 in (i) are discontinuous on the target side,in (iii) the yield of A 1 is discontinuous on the sourceside and the yield of A 2 is discontinuous on thetarget side. We therefore propose to augment tree-based approaches such that they can account for dis-continuous constituents in the source and/or targetderivation. This implies going beyond the power ofcontext-free grammars.

In the monolingual parsing community, linearcontext-free rewriting systems (LCFRS) have beenestablished as an appropriate formalism for the mod-eling of discontinuous structure (Maier and Lichte,2011; Kuhlmann and Satta, 2009). LCFRS is an ex-tension of CFG, in which non-terminals can spank ≥ 1 continuous blocks of a string. k is termedthe fan-out of the non-terminal. If k = 1 for allnon-terminals, the grammar is a CFG. Recent workshows that probabilistic data-driven parsing withLCFRS is indeed feasible and gives acceptable re-sults (Maier, 2010; Evang and Kallmeyer, 2011; vanCranenburgh, 2012; Maier et al., 2012; Kallmeyerand Maier, 2013). It seems timely to transfer thesefindings to statistical machine translation.

In this work, we introduce the notion of syn-chronous LCFRS for translation and show how thealignments in Figure 1 are induced. Since the pars-ing complexity of LCFRS, and thus of synchronous

LCFRS as well, depends directly on k, the num-ber of blocks that a non-terminal in the grammarmay span, an investigation concerning the empiri-cally required k is carried out on manually aligneddata. For this purpose, we present a parallel parserfor an all-accepting synchronous LCFRS that is usedto validate hierarchical alignments for a given k.This extends the work of Wellington et al. (2006)and Søgaard (2010) from a methodological point ofview, as will be explained in Section 5. In particular,we will revise the results that Søgaard (2010) pre-sented concerning the coverage of ITG. Our exper-iments furthermore include data sets that have notbeen used in previous similar studies.

2 Synchronous LCFRS for Translation

2.1 LCFRS

An LCFRS1 (Vijay-Shanker et al., 1987; Weir,1988) is a tuple G = (N,T, V, P, S) where N isa finite set of non-terminals with a function dim:N → N determining the fan-out of each A ∈ N ; Tand V are disjoint finite sets of terminals and vari-ables; S ∈ N is the start symbol with dim(S) = 1;and P is a finite set of rewriting rules

A(α1, . . . , αdim(A))→ A1(X(1)1 , . . . , X

(1)dim(A1)

)

· · ·Am(X(m)1 , . . . , X

(m)dim(Am))

where A,A1, . . . , Am ∈ N , X(i)j ∈ V for 1 ≤ i ≤

m, 1 ≤ j ≤ dim(Ai) and αi ∈ (T ∪ V )∗ for1 ≤ i ≤ dim(A), for a rank m ≥ 0. For all r ∈ P ,every variableX in r occurs exactly once in the left-hand side (LHS) and exactly once in the right-handside (RHS) of r. r describes how the yield of theLHS non-terminal is computed from the yields ofthe RHS non-terminals. The yield of S is the lan-guage of the grammar. Figure 3 shows a sampleLCFRS with more explanations.

The rank of G is the maximal rank of any of itsrules, and its fan-out is the maximal fan-out of anyof its non-terminals. G is called a (u, v)-LCFRS ifit has rank u and fan-out v.

2.2 Synchronous LCFRS

We define synchronous LCFRS (SLCFRS) in par-allel to synchronous CFG, see for example Satta

1We use the syntax of simple range concatenation grammars(Boullier, 1998), a formalism that is equivalent to LCFRS.

69

A(ab, cd)→ ε 〈ab, cd〉 in yield of AA(aXb, cY d)→ A(X, Y ) if 〈X, Y 〉 in yield of A,

then also 〈aXb, cY d〉 inyield of A

S(XY )→ A(X, Y ) if 〈X, Y 〉 in yield of A,then 〈XY 〉 in yield of S

Figure 3: Sample LCFRS for L = {anbncndn |n > 0}

and Peserico (2005). An SLCFRS is a tuple G =(Ns, Nt, Ts, Tt, Vs, Vt, P, Ss, St) where Ns, Ts, Vs,Ss, resp. Nt, Tt, Vt, St are defined as for LCFRS.They denote the alphabets for the source and tar-get side respectively. P is a finite set of syn-chronous rewriting rules 〈rs, rt,∼〉 where rs and rtare LCFRS rewriting rules based on Ns, Ts, Vs andNt, Tt, Vt respectively, and ∼ is a bijective mappingof the non-terminals in the RHS of rs to the non-terminals in the RHS of rt. This link relation is rep-resented by co-indexation in the synchronous rules.During a derivation, the yields of two co-indexednon-terminals have to be explained from one syn-chronous rule. 〈Ss, St〉 is the start pair.

We call the tuple (Ns, Ts, Vs, Ps, Ss) the sourceside grammar Gs and (Nt, Tt, Vt, Pt, St) the targetside grammar Gt where Ps is the set of all rs in Pand Pt is the set of all rt in P . The rank u of G isthe maximal rank of Gs and Gt, and the fan-out v ofG is the sum of the fan-outs of Gs and Gt. We willsometimes write vvGs |vGt

to make clear how the fan-out of G is distributed over the source and the targetside. As in the monolingual case, a correspondinggrammar G is called a (u, v)-SLCFRS.

As an example consider the rules in Figure 4.They translate cross-serial dependencies into nestedones. The rank of the corresponding grammar is 2and its fan-out 42|2.

Note that instead of defining an SLCFRS, onecould also set the fan-out of each non-terminal inan LCFRS to ≥ 2, set dim(S) = 2, and formu-late synchronization between the arguments of thenon-terminals. The main disadvantage is that thisrequiresNs = Nt. Furthermore, this seems less per-spicuous than SLCFRS when moving from SCFGto mild context-sensitivity. Generalized MultitextGrammar (Melamed et al., 2004) is another weaklyequivalent grammar formalism.

In correspondence to ITG and normal-form ITG(NF-ITG) (Søgaard and Wu, 2009), we say an

〈A(a, c)→ ε , C(a, c)→ ε〉〈B(b, d)→ ε , D(bd)→ ε〉〈A(aX, cZ)→ A 1 (X, Z) , C(aX, Zc)→ C 1 (X, Z)〉〈B(bY, dU)→ B 1 (Y, U) , D(bY d)→ D 1 (Y )〉〈S(XY ZU)→ A 1 (X, Z)B 2 (Y, U) ,

S(XY Z)→ C 1 (X, Z)D 2 (Y )〉

Figure 4: Sample SLCFRS for L ={〈anbmcndm, anbmdmcn〉 |n,m > 0}

SLCFRS G is in normal form if the following twoconditions hold: (a) the rank of G is at most 2 and(b) for all r ∈ P it holds that the LHS argumentsof rs and rt contain either terminals or variables, butno mixture of both. The grammar in Figure 4 is notin normal form.

While ITGs constrain the order of the non-terminals in the RHS of the target side to be in thesame or in the reverse order compared to the non-terminals in the RHS of the source side, we do notimpose such ordering constraints (on the variables)for SLCFRS. However, it is obvious that a (2, 21|1)-SLCFRS is equivalent to an ITG of rank 2 and thata (2, 21|1)-SLCFRS in normal form is equivalent toa NF-ITG.

2.3 Alignment Capacity

A translation unit is a maximally connected sub-graph of a given alignment structure. Typically thisis the smallest unit from which translation mod-els are learned. During a synchronous derivation,we interpret simultaneously recognized terminals asaligned (Wu, 1997). They thus correspond to a trans-lation unit. We call the synchronous derivation treea hierarchical alignment. Many-to-many alignmentsare interpreted conjunctively. This means that to in-duce a given translation unit, a grammar has to beable to generate the complete translation unit, andnot just one of the corresponding word alignments.The last point has been argued for in Søgaard andKuhn (2009).

SLCFRS are able to induce the alignment struc-tures under consideration (in Figure 1). This is ex-emplified by the rules given in Figure 5.

Clearly, there exist many different possible hier-archical alignments for a given alignment structure.The underlying constraints for the grammars in Fig-ure 5 are (a) each translation unit is represented by

70

(i)〈A(a)→ ε , A(a)→ ε〉〈A(Xb)→ A 1 (X) , A(b, Y )→ A 1 (Y )〉〈A(Xc)→ A 1 (X) , A(Y1, Y2c)→ A 1 (Y1, Y2)〉〈A(Xd)→ A 1 (X) , A(Y1dY2)→ A 1 (Y1, Y2)〉

(ii)〈A(a)→ ε , A(a1, a2)→ ε〉〈A(Xb)→ A 1 (X) , A(Y1b1Y2b2)→ A 1 (Y1, Y2)〉or〈A(b)→ ε , A(b1, b2)→ ε〉〈A(aX)→ A 1 (X) , A(a1Y1a2Y2)→ A 1 (Y1, Y2)〉

(iii)〈A(a1, a2)→ ε , A(a)→ ε〉〈A(X1bX2)→ A 1 (X1, X2) , A(b1Y b2)→ A 1 (Y )〉or〈A(b)→ ε , A(b1, b2)→ ε〉〈A(a1Xa2)→ A 1 (X) , A(Y1aY2)→ A 1 (Y1, Y2)〉

Figure 5: SLCFRS rules that induce the alignments inFigure 1. For (i) there are many other derivations possi-ble, since there are 4! possibilities to combine the trans-lation units in a binary way. The shown rules correspondto Figure 2(i).

exactly one rule and (b) each rule aligns exactly onetranslation unit and combines it with at most one al-ready established synchronous constituent.

ITG and NF-ITG do not generate the same classof alignments (Søgaard and Wu, 2009). In paral-lel, a (2, v)-SLCFRS in normal form does not gen-erate the same class of alignments as an unrestricted(2, v)-SLCFRS. Consider, for example, a discontin-uous translation unit d with two gaps on the sourceside and a grammar G with fan-out 32|1. G in nor-mal form cannot induce d. In general, for generat-ing x gaps, a fan-out of x+ 1 is required. However,without the normal form requirement, G can possi-bly induce d with a rule that combines the terminalsof d with the constituents that fill the gaps.

2.4 Parsing Complexity

LCFRS in normal form can be parsed in O(n3k)where k is the fan-out of the grammar (Seki et al.,1991). This result can be transferred to SLCFRS:An SLCFRS with fan-out v is essentially an LCFRSwith fan-out v + 1. However, because of the startnon-terminal S with dim(S) = 2, all non-terminalsA ∈ N with dim(A) ≥ 2 and the special inter-pretation of the source/target side meaning that vari-ables occur either on the source or target side but

〈T (αs)→ ε , T (βt)→ ε〉〈A(α1)→ T 1 (α1) , A(β1)→ T 1 (β1)〉〈A(α1)→ A 1 (α2)A 2 (α3) , A(β1)→ A 1 (β2)A 2 (β3)〉

where αs ∈ (Ts∗)k0 ,βt ∈ (Tt

∗)k′0 ,αi ∈ (Vs

+)kj ,βi ∈(Vt

+)k′j for 0 < kj ≤ ks, 0 < k′j ≤ kt, 0 < i ≤ 3, 0 ≤ j ≤ 3

Figure 6: All-accepting SCLFRS in normal form withfan-out v = ks + kt

cannot change sides, no items that cross or involvethe additional gap have to be built during parsing.Bitext parsing with SLCFRS in normal form cantherefore also be performed in O(n3v) where n =

max(ns, nt), or more specifically O(n3vGss n

3vGtt )

where ns, nt are the lengths of the source and tar-get input strings respectively.

3 Empirical Investigation

Since parsing complexity with SLCFRS is deter-mined by the fan-out v of the grammar, we con-duct an investigation to find out which v would berequired to fully cover the alignment configurationsthat occur in manually aligned parallel corpora.

3.1 Bottom-Up Hierarchical AlignerOur study is based on alignment validation(Søgaard, 2010), i.e. we check whether an align-ment structure can be generated by an all-acceptingSLCFRS with a specific v. Such a grammar is de-picted in Figure 6. Note in particular that it leavesopen how to compose the yield of the LHS non-terminal from the two RHS constituents. To be ableto use the grammar for parsing, one would have tospell out all combination possibilities.

Instead, we use the idea of a bottom-up hierar-chical aligner (Wellington et al., 2006). It worksvery much like a synchronous parser, but the con-straints for inferences are the word alignments andpotentially other things, and not the rewriting rulesof a grammar. Initial constituents are built fromthe word alignments, then constituents are combinedwith each other. The goal is to find a constituent thatcompletely covers the input. In our case, the con-straints for the hierarchical aligner come from thetranslation units, the fan-out vks|kt

of the simulatedgrammar and possibly a normal-form requirement.

We specify the hierarchical aligner in terms ofa deduction system (Shieber et al., 1995). Deduc-

71

tion rules have the form A1...AmB C where A1 . . . Am

and B are items, i.e. intermediate parsing results,and C is a list of conditions on A1 . . . Am and B.The interpretation is that if A1 . . . Am can be de-duced and conditions C hold, then B can be de-duced. Our items have the form 〈[Xs,ρs], [Xt,ρt]〉whereXs ∈ Ns andXt ∈ Nt of the simulated gram-mar. All-accepting grammars usually have only onenon-terminal symbol, but we need a distinction be-tween pre-terminal constituents T and general con-stituents A for simulating SLCFRS in normal formas well as the full class. ρs and ρt characterize thespans of the synchronous constituent on the sourceand target side respectively. We view them as bitvectors where ρs(i) = 1 means that si is in theyield of Xs, and ρt(i

′) = 1 that ti′ is in the yieldof Xt. 〈s0...n, t0...n′〉 is the input sentence pairthat is segmented into m disjoint translation units〈D(m)

s , D(m)t 〉 based on the given word alignment

structure. D(m)s and D(m)

t are sets of word indicesinto s and t respectively. We furthermore specifysome useful operations for bit vectors. The ∪ opera-tor combines bit vectors of the same length to a newbit vector by an elementwise or operation, while theintersection ∩ of two bit vectors is the elementwiseand operation. 0l is a bit vector ρ such that ρ(i) = 0for all 0 ≤ i ≤ l. The function b(ρ) returns thenumber of blocks of ρ, i.e. the number of continu-ous sequences of 1s in ρ.

Figure 7 shows the deduction rules of the hi-erarchical aligner that simulate an all-acceptingSLCFRS in normal form. Scan builds T items fromtranslation units, Unary creates A items from Titems, and Binary combines two A items to a largerA item. Via the side conditions, A items are onlycreated if they respect the specified fan-out vks|kt

of the all-accepting grammar. If the hierarchicalaligner finds an A item that spans 〈s, t〉, the align-ment structure of 〈s, t〉 is valid, i.e. can be inducedby an SLCFRS in normal form with fan-out vks|kt

.

Since we are also interested in the empirical align-ment capacity of SLCFRS without normal-form re-striction, we present an extended deduction systemin Figure 8. The additional rules lead to the simu-lation of an SLCFRS of rank 2 where terminals andvariables can be combined in the arguments of theLHS non-terminals of the rewriting rules. Note in

particular that the generation of T items is not con-strained by a maximally allowed vks|kt

.For the computation of the items, we use standard

chart parsing techniques, maintaining a chart and anagenda.

3.2 Data

We use manually aligned parallel corpora for ourstudy.2 Data sets that have already been previouslyused in similar experiments, e.g. in Wellington etal. (2006), Søgaard and Wu (2009), and Søgaard(2010), are those from Martin et al. (2005) forEnglish-Romanian and English-Hindi, the English-French data from Mihalcea and Pedersen (2003), theEuroparl data set described in Graca et al. (2008) forthe six combinations of English, French, Portugueseand Spanish, the English-German Europarl data thatwas created for Pado and Lapata (2006), and datasets with Danish as the source language that are partof the Parole corpus of the Copenhagen DependencyTreebank (Buch-Kromann et al., 2009).

We furthermore perform our study on data setsthat, to the best of our knowledge, have not beenevaluated in a similar setting before. Those areEnglish-Swedish gold alignments documented inHolmqvist and Ahrenberg (2011), the English-Inuktitut data used in Martin et al. (2005), moreEnglish-German data3, the English-Spanish data setin Lambert et al. (2005) and English-Dutch align-ments that are part of the Dutch Parallel Corpus(Macken, 2010). Characteristics about the data setsare presented in the last columns of Table 1.

3.3 Method

We apply the bottom-up hierarchical alignment al-gorithm in various configurations to each manuallyaligned sentence pair. If a goal item is found, thealignment structure can be induced with the formal-ism in question. We measure the number of sen-tence pairs for which a hierarchical alignment wasreached over the total number of sentence pairs.Søgaard (2010) refers to this as alignment reach-ability, which is the inverse of parse failure rate(Wellington et al., 2006).

2Whenever there are sure (S) and possible (P) alignmentsannotated, we use both.

3By T. Schoenemann, from http://user.phil-fak.uni-duesseldorf.de/˜tosch/downloads.html

72

Scan: 〈[T,ρs], [T,ρt]〉a translation unit 〈Ds, Dt〉

where ρs(i) = 1 if i ∈ Ds, otherwise ρs(i) = 0, and ρt(i′) = 1 if i′ ∈ Dt, otherwise ρt(i

′) = 0

Unary:〈[T,ρs], [T,ρt]〉〈[A,ρs], [A,ρt]〉

b(ρs) ≤ ks, b(ρt) ≤ kt

Binary:〈[A,ρ1

s], [A,ρ1t ]〉, 〈[A,ρ2

s], [A,ρ2t ]〉

〈[A,ρ3s], [A,ρ3

t ]〉ρ1

s ∩ ρ2s = 0n,ρ1

t ∩ ρ2t = 0n′

, b(ρ3s) ≤ ks, b(ρ

3t ) ≤ kt

where ρ3s = ρ1

s ∪ ρ2s and ρ3

t = ρ1t ∪ ρ2

t

Goal: 〈[A,ρs], [A,ρt]〉where ρs(i) = 1 for all 0 ≤ i ≤ n and ρt(i

′) = 1 for all 0 ≤ i′ ≤ n′

Figure 7: CYK deduction system for an all-accepting SLCFRS in normal form with fan-out vks|kt

UnaryMixed:〈[T,ρT

s ], [T,ρTt ]〉, 〈[A,ρA

s ], [A,ρAt ]〉

〈[A,ρs], [A,ρt]〉ρT

s ∩ ρAs = 0n,ρT

t ∩ ρAt = 0n′

, b(ρs) ≤ ks, b(ρt) ≤ kt

where ρs = ρTs ∪ ρA

s and ρt = ρTt ∪ ρA

t

BinaryMixed:〈[T,ρT

s ], [T,ρTt ]〉, 〈[A,ρ1

s], [A,ρ1t ]〉, 〈[A,ρ2

s], [A,ρ2t ]〉

〈[A,ρ3s], [A,ρ3

t ]〉

ρTs ∩ ρ1

s = 0n,ρ1s ∩ ρ2

s = 0n,ρ2s ∩ ρT

s = 0n,

ρTt ∩ ρ1

t = 0n′,ρ1

t ∩ ρ2t = 0n′

,ρ2t ∩ ρT

t = 0n′,

b(ρ3s) ≤ ks, b(ρ

3t ) ≤ kt

where ρ3s = ρT

s ∪ ρ1s ∪ ρ2

s and ρ3t = ρT

t ∪ ρ1t ∪ ρ2

t

Figure 8: Additional inference rules for the deduction system in Figure 7 for simulating an SLCFRS of rank 2 withoutnormal form restriction.

SLCFRSNF u = 2 Søgaard (2010) Data

v = 21|1 v = 42|2 v = 21|1 v = 42|2 NF-ITG ITG= NF-ITG = ITG #SPs min med max

Martinen-ro (30) 45.07 97.85 95.07 100.00 - - 447 2|2 20|19 96|94en-hi (40) 82.73 100.00 96.36 1|2

2|1100.00 - - 115 1|1 10|12 45|58en-iu (40) 40.66 95.60 100.00 100.00 - - 100 10|3 26|10 79|26

Pado en-de (15) 73.74 100.00 94.41 1|2100.00 38.97 45.13 987 5|5 24|23 40|40Mihal. en-fr 67.56 98.88 95.30 100.00 *76.98 *81.75 447 2|2 16|17 30|30

Graca

en-fr 73.00 100.00 95.00 1|2100.00 65.00 68.00 100 4|4 11|13 14|21en-pt 76.00 100.00 98.00 1|2

2|1100.00 65.00 67.00 100 4|3 11|12 14|21en-es 82.00 100.00 96.00 1|2

2|1100.00 73.00 74.00 100 4|4 11|11 14|24pt-fr 73.00 97.00 92.00 1|2100.00 63.00 63.00 100 3|4 12|13 21|21pt-es 90.00 99.00 99.00 1|2

2|1100.00 80.00 81.00 100 3|4 12|11 21|24es-fr 74.00 100.00 91.00 1|2100.00 68.00 68.00 100 4|4 11|13 24|21

CDT

da-en (25) 72.90 98.93 97.80 100.00 - - 5464 1|1 16|17 89|98da-de (25) 64.87 98.42 94.94 1|2

2|1100.00 *47.62 *49.35 449 1|1 17|18 75|74da-es (25) 66.61 97.68 97.50 100.00 *30.68 *35.54 807 1|1 16|18 78|97da-it (25) 69.01 97.65 97.95 100.00 *60.00 *60.00 1514 1|1 16|19 78|268

Holmqv. en-sv (30) 82.83 99.78 95.60 100.00 - - 1164 1|1 21|19 40|40Schoen. en-de (40) 29.15 94.74 76.11 100.00 - - 300 1|1 21|22 77|79Lambert en-es (40) 47.15 97.83 94.85 100.00 - - 500 4|4 26|27 90|99Macken en-nl (30) 57.14 98.86 94.86 100.00 - - 699 1|1 20|19 107|105

Table 1: Alignment reachability scores of our experiments and those of Søgaard (2010), plus characteristics of the datasets. The numbers in parentheses are the sentence length cut-offs used in our experiments. The results marked with *are not directly comparable to ours because different versions of the data sets were used.

73

3.4 Results

Table 1 shows the results. It confirmes that NF-ITG is not capable of generating the majority ofalignment configurations. However, when allow-ing discontinuous constituents with maximally twoblocks on each side (v = 42|2), NF-SLCFRS in-duces all alignments present in six of the data sets,and reaches scores > 97 for the other data sets, ex-cept two of them for which scores are still > 94.7.

For grammars without normal-form constraint,alignment reachability is generally higher. We testedgrammars of rank 2 and found that over 90% of thesentence pairs in each data set can be induced with-out the necessity of discontinuous constituents (ex-cept data set Schoen.). Such grammars roughly cor-respond to successfully applied translation models,e.g. in Hiero (Chiang, 2007). Nevertheless, our ex-periments show that the gold alignments contain aproportion of structures that cannot be generated byITGs. With a (2, 42|2)-SLCFRS, all occurring align-ment configurations are captured. For some datasets, a fan-out of 3 is enough to induce all align-ments. This is indicated by 1|2 and 2|1.

Going back to grammars in normal form, the sen-tence pairs that cannot be induced with a grammarof fan-out 42|2 all display translation units that re-quire three (or very rarely four) blocks on at leastthe source or the target side. An interesting observa-tion is that only the English-Inuktitut data can nev-ertheless be generated with fan-out 4, by distribut-ing the allowed discontinuity unequally: with a NF-SLCFRS with fan-out 43|1, the alignment reachabil-ity is 100. This is not surprising given the fact thatInuktitut is a polysynthetic language.

Previous results by Søgaard (2010) concerningthe coverage of ITG and NF-ITG on hand-aligneddata, repeated for convenience in Table 1, are muchlower than ours and therefore present a highly dis-torted picture concerning the empirical need of dis-continuous constituents. This is due to the fact thatthe implementation4 used for the experiments han-dles unaligned words incorrectly. They are addeddeterministically to the first constituent that encoun-ters them, which leads to false negatives as furtherexplained in Figure 9. After fixing this issue, thesame results as for NF-SLCFRS with v = 21|1 are

4http://cst.dk/anders/itg-search.html

[ ] 6[ ] 5[ ] 4[ ] 1 [ ] 2 [ ] 3

a b c d

a’ b’ c’ d’

[ ] 1 [ ] 2 [ ] 3[ ] 4[ ] 5[ ] 6

Figure 9: Synchronous ITG parse chart provided by theimplementation from Søgaard (2010): c “belongs to”constituent 6 while c’ “belongs to” constituent 5 . Whentrying to combine 4 and 3 , c and c’ are not considered asunaligned because they are already part of a constituent,and neither 5 nor 6 can be combined with 3 without cre-ating a discontinuous constituent. The algorithm cannotfind a larger continuous constituent, the alignment valida-tion therefore returns false. However, this simple align-ment structure lies within the power of NF-ITG and ITG.

obtained. Another problem of the implementationconcerns discontinuous translation units. Søgaard’salignment validation returns false if the words in thegap are aligned, although such configurations areinduced by unrestricted ITG, see Søgaard and Wu(2009, Section 3.2.1).

4 Discussion

Our experiments show that by moving from syn-chronous grammars with only continuous con-stituents to grammars that allow two blocks perconstituent, (almost) all manual alignments can begenerated, depending on whether the normal-formis enforced or not. Given the parsing complexitythat comes with allowing discontinuities, this is apromising finding since it has already been shownfor monolingual parsing that restricting the fan-outto 2 drastically reduces parsing times (Maier et al.,2012). In the future, we might also investigatewhether refraining from ill-nested structures (Maierand Lichte, 2011) is a reasonable option for tree-based machine translation in order to reduce com-plexity (Gomez-Rodrıguez et al., 2010).

Even though bitext parsing complexity forSLCFRS is prohibitively high, we expect that, giventhe techniques that have been developed for transla-tion with SCFG, SLCFRS finds its application as a

74

translation model. In practice, only source side pars-ing is performed for translation and various prun-ing methods are applied to reduce the search space(e.g. in Chiang (2005), Yamada and Knight (2002)and many others).

It should also be mentioned that it is not clear yethow alignment reachability scores relate to machinetranslation quality and evaluation. We can never-theless infer from the presented results that what isconsidered as translationally equivalent by the anno-tators of the data sets and their guidelines is beyondthe search space of SCFG. A supplementary studycould furthermore investigate translation unit errorrates (Søgaard and Kuhn, 2009) for the data sets, un-der the assumption of a hierarchical SLCFRS align-ment with a specific fan-out.

5 Related Work

Our empirical investigation extends previous stud-ies, and thus provides new insights. Both Welling-ton et al. (2006) and Søgaard (2010) use a bottom-up hierarchical alignment algorithm with the goal ofinvestigating the alignment complexity of manuallyaligned parallel corpora. Søgaard (2010) is howeveronly interested in the alignment reachability of ITGand NF-ITG, and nothing beyond. We have further-more revealed that the presented results underesti-mate the alignment capacity of ITG and NF-ITG.

The study of Wellington et al. (2006) is very sim-ilar to ours in that the number of blocks in discon-tinuous constituents that are required for hierarchi-cal alignment are investigated. The word alignmentsare however treated disjunctively, which means thatin the case of n-to-m alignments with n,m ≥ 1,it is enough to induce one of the involved align-ments. With this methodology a large class of dis-continuities we are interested in, e.g. cross-serialdiscontinuous translation units, is ignored. The fail-ure rates they present are therefore much lower thanours. Wellington et al. (2006) also show that whenconstraining synchronous derivations by monolin-gual syntactic parse trees on the source and/or targetside, allowing discontinuous constituents becomeseven more important for inducing gold alignments.

We are of course not the first to propose a trans-lation model that is expressive enough to inducethe alignments in question in Figure 1. Following

up on a translation model proposed by Simard etal. (2005), Galley and Manning (2010) extend thephrase-based approach in that they allow for discon-tinuous phrase pairs. Their system outperforms aphrase-based system and a system based on SCFGof rank 2. In a way, our proposal to use SLCFRSis the syntax-based counterpart to their approach.Methods to integrate linguistic constituency infor-mation into the so far only formal tree-based ap-proach can be directly transferred from the SCFG-based approaches to SLCFRS. In constrast, it is notobvious how to include such information into thephrase-based systems.

Søgaard (2008) proposes to use an even more ex-pressive formalism than LCFRS, namely range con-catenation grammar, and to exploit its ability to copysubstrings during the derivation. The downsides ofthis approach are already mentioned in Søgaard andKuhn (2009); for example, no tight probability esti-mation is possible for such a grammar.

The necessity of going towards mildly context-sensitive formalisms for translation modeling hasalso been advocated by Melamed (Melamed et al.,2004; Melamed, 2004). This step was howevernot motivated by the induction of specific complextranslation units, but rather by the general obser-vation that discontinuous constituents are necessaryfor synchronous derivations using linguistically mo-tivated grammars. Discontinuous constituents alsoemerge when binarizing synchronous grammars ofcontinuous yields with rank ≥ 4 (Melamed, 2003;Rambow and Satta, 1999).

6 Conclusion

Motivated by the finding that synchronous CFG can-not induce certain alignment configurations, we sug-gest to use synchronous LCFRS instead, which al-lows for discontinuities. Even though our empiricalinvestigation shows that with exclusively continuousderivations more manual alignments can be capturedthan previously reported, there are still many alignedsentence pairs that can only be generated when set-ting the fan-out of the translation grammar to > 2.It remains to determine how such more accurate andmore expressive models relate to translation quality.

75

Acknowledgments

I would like to thank Laura Kallmeyer and WolfgangMaier for discussions and comments, the reviewersfor their suggestions, Anders Søgaard for discus-sions concerning ITG and his implementation, andZdenek Zabokrtsky for helping with the CDT data.This research was funded by the German ResearchFoundation as part of the project Grammar For-malisms beyond Context-Free Grammars and theiruse for Machine Learning Tasks.

References

Pierre Boullier. 1998. Proposal for a Natural Lan-guage Processing syntactic backbone. Technical Re-port 3342, INRIA.

Matthias Buch-Kromann, Iørn Korzen, and Henrik HøegMuller. 2009. Uncovering the lost structure of trans-lations with parallel treebanks. Copenhagen Studies inLanguage, 38:199–224.

David Chiang. 2005. A hierarchical phrase-based modelfor statistical machine translation. In Proceedings ofthe 43rd Annual Meeting on Association for Computa-tional Linguistics, pages 263–270.

David Chiang. 2007. Hierarchical phrase-based transla-tion. Computational Linguistics, 33(2):201–228.

Kilian Evang and Laura Kallmeyer. 2011. PLCFRS pars-ing of English discontinuous constituents. In Proceed-ings of the 12th International Conference on ParsingTechnologies (IWPT), pages 104–116.

Michel Galley and Christopher D. Manning. 2010. Ac-curate non-hierarchical phrase-based translation. InHuman Language Technologies: The 2010 AnnualConference of the North American Chapter of the As-sociation for Computational Linguistics, pages 966–974.

Carlos Gomez-Rodrıguez, Marco Kuhlmann, and Gior-gio Satta. 2010. Efficient parsing of well-nested Lin-ear Context-Free Rewriting Systems. In Human Lan-guage Technologies: The 2010 Annual Conference ofthe North American Chapter of the Association forComputational Linguistics, pages 276–284.

Joao de Almeida Varelas Graca, Joana Paulo Pardal,Luısa Coheur, and Diamantino Antonio Caseiro.2008. Building a golden collection of parallel multi-language word alignment. In The 6th InternationalConference on Language Resources and Evaluation(LREC08).

Hieu Hoang and Philipp Koehn. 2010. Improved trans-lation with source syntax labels. In Proceedings of the

Joint Fifth Workshop on Statistical Machine Transla-tion and MetricsMATR, pages 409–417. Associationfor Computational Linguistics.

Maria Holmqvist and Lars Ahrenberg. 2011. A goldstandard for English–Swedish word alignment. InProceedings of the 18th Nordic Conference of Compu-tational Linguistics NODALIDA 2011, pages 106–113.

Laura Kallmeyer and Wolfgang Maier. 2013. Data-driven parsing using Probabilistic Linear Context-FreeRewriting Systems. Computational Linguistics, 39(1).Accepted for publication.

Philipp Koehn, Franz Josef Och, and Daniel Marcu.2003. Statistical phrase-based translation. In Proceed-ings of the 2003 Conference of the North AmericanChapter of the Association for Computational Linguis-tics on Human Language Technology, pages 48–54.Association for Computational Linguistics.

Marco Kuhlmann and Giorgio Satta. 2009. Treebankgrammar techniques for non-projective dependencyparsing. In Proceedings of the 12th Conference ofthe European Chapter of the Association for Compu-tational Linguistics (EACL), pages 478–486.

Patrik Lambert, Adria Gispert, Rafael Banchs, andJose B. Marino. 2005. Guidelines for word align-ment evaluation and manual alignment. Language Re-sources and Evaluation, 39:267–285.

Lieve Macken. 2010. An annotation scheme and goldstandard for Dutch-English word alignment. In Pro-ceedings of the Seventh International Conference onLanguage Resources and Evaluation (LREC’10).

Wolfgang Maier and Timm Lichte. 2011. Character-izing discontinuity in constituent treebanks. In For-mal Grammer 2009, Revised Selected Papers, volume5591 of LNAI. Springer.

Wolfgang Maier, Miriam Kaeshammer, and LauraKallmeyer. 2012. Data-driven PLCFRS parsing revis-ited: Restricting the fan-out to two. In Proceedings ofthe Eleventh International Conference on Tree Adjoin-ing Grammars and Related Formalisms (TAG+11).

Wolfgang Maier. 2010. Direct parsing of discontinu-ous constituents in German. In Proceedings of theNAACL HLT 2010 First Workshop on Statistical Pars-ing of Morphologically-Rich Languages.

Joel Martin, Rada Mihalca, and Ted Pedersen. 2005.Word alignment for languages with scarce resources.In ACL Workshop on Building and Using ParallelTexts.

I. Dan Melamed, Giorgio Satta, and Benjamin Welling-ton. 2004. Generalized multitext grammars. In Pro-ceedings of the 42nd Annual Conference of the Asso-ciation for Computational Linguistics (ACL).

I. Dan Melamed. 2003. Multitext grammars and syn-chronous parsers. In Proceedings of the 2003 Con-

76

ference of the North American Chapter of the Associ-ation for Computational Linguistics on Human Lan-guage Technology, pages 79–86.

I. Dan Melamed. 2004. Statistical machine translationby parsing. In Proceedings of the 42nd Annual Con-ference of the Association for Computational Linguis-tics (ACL).

Rada Mihalcea and Ted Pedersen. 2003. An evaluationexercise for word alignment. In Proceedings of theHLT-NAACL 2003 Workshop on Building and UsingParallel Texts: Data Driven Machine Translation andBeyond, pages 1–10.

Sebastian Pado and Mirella Lapata. 2006. Optimal con-stituent alignment with edge covers for semantic pro-jection. In Proceedings of the 21st International Con-ference on Computational Linguistics and 44th AnnualMeeting of the ACL, pages 1161–1168.

Owen Rambow and Giorgio Satta. 1999. Independentparallelism in finite copying parallel rewriting sys-tems. Theoretical Computer Science, 223(1-2):87–120.

Giorgio Satta and Enoch Peserico. 2005. Some com-putational complexity results for synchronous context-free grammars. In Proceedings of Human Lan-guage Technology Conference and Conference onEmpirical Methods in Natural Language Processing(HLT/EMNLP), pages 803–810.

Hiroyuki Seki, Takashi Matsumura, Mamoru Fujii, andTadao Kasami. 1991. On Multiple Context-Free Grammars. Theoretical Computer Science,88(2):191–229.

Stuart M. Shieber, Yves Schabes, and Fernando C. N.Pereira. 1995. Principles and implementation ofdeductive parsing. Journal of Logic Programming,24(1&2):3–36.

Michel Simard, Nicola Cancedda, Bruno Cavestro, MarcDymetman, Eric Gaussier, Cyril Goutte, Kenji Ya-mada, Philippe Langlais, and Arne Mauser. 2005.Translating with non-contiguous phrases. In Proceed-ings of Human Language Technology Conference andConference on Empirical Methods in Natural Lan-guage Processing (HLT/EMNLP), pages 755–762.

Anders Søgaard and Jonas Kuhn. 2009. Empirical lowerbounds on alignment error rates in syntax-based ma-chine translation. In Proceedings of the Third Work-shop on Syntax and Structure in Statistical Transla-tion (SSST ’09). Association for Computational Lin-guistics.

Anders Søgaard and Dekai Wu. 2009. Empirical lowerbounds on translation unit error rate for the full classof inversion transduction grammars. In Proceedingsof the 11th International Conference on Parsing Tech-nologies, pages 33–36.

Anders Søgaard. 2008. Range concatenation grammarsfor translation. In Proceedings of Coling 2008: Com-panion volume: Posters.

Anders Søgaard. 2010. Can inversion transductiongrammars generate hand alignments? In Proccedingsof the 14th Annual Conference of the European Asso-ciation for Machine Translation (EAMT).

Andreas van Cranenburgh. 2012. Efficient parsing withLinear Context-Free Rewriting Systems. In Proceed-ings of the 13th Conference of the European Chapterof the Association for Computational Linguistics.

K. Vijay-Shanker, David Weir, and Aravind K. Joshi.1987. Characterizing structural descriptions used byvarious formalisms. In Proceedings of the 25th AnnualMeeting of the Association for Computational Linguis-tics.

David Weir. 1988. Characterizing Mildly Context-Sensitive Grammar Formalisms. Ph.D. thesis, Univer-sity of Pennsylviania, Philadelphia, PA.

Benjamin Wellington, Sonjia Waxmonsky, and I. DanMelamed. 2006. Empirical lower bounds on the com-plexity of translational equivalence. In Proceedings ofthe 21st International Conference on ComputationalLinguistics and 44th Annual Meeting of the Associa-tion for Computational Linguistics, pages 977–984.

Dekai Wu. 1997. Stochastic inversion transductiongrammars and bilingual parsing of parallel corpora.Computational linguistics, 23(3):377–403.

Kenji Yamada and Kevin Knight. 2002. A decoder forsyntax-based statistical MT. In Proceedings of 40thAnnual Meeting of the Association for ComputationalLinguistics.

Andreas Zollmann and Ashish Venugopal. 2006. Syntaxaugmented machine translation via chart parsing. InProceedings of the Workshop on Statistical MachineTranslation, HLT/NAACL.

77

Author Index

Addanki, Karteek, 48

Carpuat, Marine, 1

Freitag, Markus, 29

Herrmann, Teresa, 39Huck, Matthias, 29

Kaeshammer, Miriam, 68

Maillette de Buy Wenniger, Gideon, 19, 58

Ney, Hermann, 29Niehues, Jan, 39

Saers, Markus, 48Sima’an, Khalil, 19, 58Singh, Thoudam Doren, 11

Vilar, David, 29

Waibel, Alex, 39Wu, Dekai, 48

79