12
Proceedings of the Eighth Workshop on Statistical Machine Translation, pages 452–463, Sofia, Bulgaria, August 8-9, 2013 c 2013 Association for Computational Linguistics A Phrase Orientation Model for Hierarchical Machine Translation Matthias Huck and Joern Wuebker and Felix Rietig and Hermann Ney Human Language Technology and Pattern Recognition Group Computer Science Department RWTH Aachen University D-52056 Aachen, Germany {huck,wuebker,rietig,ney}@i6.informatik.rwth-aachen.de Abstract We introduce a lexicalized reordering model for hierarchical phrase-based ma- chine translation. The model scores mono- tone, swap, and discontinuous phrase ori- entations in the manner of the one pre- sented by Tillmann (2004). While this type of lexicalized reordering model is a valuable and widely-used component of standard phrase-based statistical machine translation systems (Koehn et al., 2007), it is however commonly not employed in hi- erarchical decoders. We describe how phrase orientation prob- abilities can be extracted from word- aligned training data for use with hierar- chical phrase inventories, and show how orientations can be scored in hierarchi- cal decoding. The model is empirically evaluated on the NIST ChineseEnglish translation task. We achieve a signifi- cant improvement of +1.2 %BLEU over a typical hierarchical baseline setup and an improvement of +0.7 %BLEU over a syntax-augmented hierarchical setup. On a FrenchGerman translation task, we obtain a gain of up to +0.4 %BLEU. 1 Introduction In hierarchical phrase-based translation (Chiang, 2005), a probabilistic synchronous context-free grammar (SCFG) is induced from bilingual train- ing corpora. In addition to continuous lexical phrases as in standard phrase-based translation, hierarchical phrases with usually up to two non- terminals are extracted from the word-aligned par- allel training data. Hierarchical decoding is typically carried out with a parsing-based procedure. The parsing al- gorithm is extended to handle translation candi- dates and to incorporate language model scores via cube pruning (Chiang, 2007). During decod- ing, a hierarchical translation rule implicitly spec- ifies the placement of the target part of a sub- derivation which is substituting one of its non- terminals in a partial hypothesis. The hierarchical phrase-based model thus provides an integrated re- ordering mechanism. The reorderings which are being conducted by the hierarchical decoder are a result of the application of SCFG rules, which generally means that there must have been some evidence in the training data for each reordering operation. At first glance one might be tempted to believe that any additional designated phrase ori- entation modeling would be futile in hierarchical translation as a consequence of this. We argue that such a conclusion is false, and we will pro- vide empirical evidence in this work that lexical- ized phrase orientation scoring can be highly ben- eficial not only in standard phrase-based systems, but also in hierarchical ones. The purpose of a phrase orientation model is to assess the adequacy of phrase reordering dur- ing search. In standard phrase-based translation with continuous phrases only and left-to-right hy- pothesis generation (Koehn et al., 2003; Zens and Ney, 2008), phrase reordering is implemented by jumps within the input sentence. The choice of the best order for the target sequence is made based on the language model score of this sequence and a distortion cost that is computed from the source- side jump distances. Though the space of admis- sible reorderings is in most cases contrained by a maximum jump width or coverage-based restric- tions (Zens et al., 2004) for efficiency reasons, the basic approach of arbitrarily jumping to un- covered positions on source side is still very per- missive. Lexicalized reordering models assist the decoder in taking a good decision. Phrase-based decoding allows for a straightforward integration of lexicalized reordering models which assign 452

A Phrase Orientation Model for Hierarchical Machine ... · Proceedings of the Eighth Workshop on Statistical Machine Translation, pages 452–463, Sofia, Bulgaria, August 8-9, 2013

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

  • Proceedings of the Eighth Workshop on Statistical Machine Translation, pages 452–463,Sofia, Bulgaria, August 8-9, 2013 c©2013 Association for Computational Linguistics

    A Phrase Orientation Model for Hierarchical Machine Translation

    Matthias Huck and Joern Wuebker and Felix Rietig and Hermann NeyHuman Language Technology and Pattern Recognition Group

    Computer Science DepartmentRWTH Aachen UniversityD-52056 Aachen, Germany

    {huck,wuebker,rietig,ney}@i6.informatik.rwth-aachen.de

    Abstract

    We introduce a lexicalized reorderingmodel for hierarchical phrase-based ma-chine translation. The model scores mono-tone, swap, and discontinuous phrase ori-entations in the manner of the one pre-sented by Tillmann (2004). While thistype of lexicalized reordering model is avaluable and widely-used component ofstandard phrase-based statistical machinetranslation systems (Koehn et al., 2007), itis however commonly not employed in hi-erarchical decoders.

    We describe how phrase orientation prob-abilities can be extracted from word-aligned training data for use with hierar-chical phrase inventories, and show howorientations can be scored in hierarchi-cal decoding. The model is empiricallyevaluated on the NIST Chinese→Englishtranslation task. We achieve a signifi-cant improvement of +1.2 %BLEU overa typical hierarchical baseline setup andan improvement of +0.7 %BLEU over asyntax-augmented hierarchical setup. Ona French→German translation task, weobtain a gain of up to +0.4 %BLEU.

    1 Introduction

    In hierarchical phrase-based translation (Chiang,2005), a probabilistic synchronous context-freegrammar (SCFG) is induced from bilingual train-ing corpora. In addition to continuous lexicalphrases as in standard phrase-based translation,hierarchical phrases with usually up to two non-terminals are extracted from the word-aligned par-allel training data.

    Hierarchical decoding is typically carried outwith a parsing-based procedure. The parsing al-gorithm is extended to handle translation candi-

    dates and to incorporate language model scoresvia cube pruning (Chiang, 2007). During decod-ing, a hierarchical translation rule implicitly spec-ifies the placement of the target part of a sub-derivation which is substituting one of its non-terminals in a partial hypothesis. The hierarchicalphrase-based model thus provides an integrated re-ordering mechanism. The reorderings which arebeing conducted by the hierarchical decoder area result of the application of SCFG rules, whichgenerally means that there must have been someevidence in the training data for each reorderingoperation. At first glance one might be tempted tobelieve that any additional designated phrase ori-entation modeling would be futile in hierarchicaltranslation as a consequence of this. We arguethat such a conclusion is false, and we will pro-vide empirical evidence in this work that lexical-ized phrase orientation scoring can be highly ben-eficial not only in standard phrase-based systems,but also in hierarchical ones.

    The purpose of a phrase orientation model isto assess the adequacy of phrase reordering dur-ing search. In standard phrase-based translationwith continuous phrases only and left-to-right hy-pothesis generation (Koehn et al., 2003; Zens andNey, 2008), phrase reordering is implemented byjumps within the input sentence. The choice of thebest order for the target sequence is made basedon the language model score of this sequence anda distortion cost that is computed from the source-side jump distances. Though the space of admis-sible reorderings is in most cases contrained by amaximum jump width or coverage-based restric-tions (Zens et al., 2004) for efficiency reasons,the basic approach of arbitrarily jumping to un-covered positions on source side is still very per-missive. Lexicalized reordering models assist thedecoder in taking a good decision. Phrase-baseddecoding allows for a straightforward integrationof lexicalized reordering models which assign

    452

  • different scores depending on how a currentlytranslated phrase has been reordered with respectto its context. Popular lexicalized reordering mod-els for phrase-based translation distinguish threeorientation classes: monotone, swap, and discon-tinuous (Tillmann, 2004; Koehn et al., 2007; Gal-ley and Manning, 2008). To obtain such a model,scores for the three classes are calculated from thecounts of the respective orientation occurrences inthe word-aligned training data for each extractedphrase. The left-to-right orientation of phrasesduring phrase-based search can be easily deter-mined from the start and end positions of con-tinuous phrases. Approximations may need to beadopted for the right-to-left scoring direction.

    The utility of phrase orientation models in stan-dard phrase-based translation is plausible and hasbeen empirically established in practice. In hierar-chical phrase-based translation, some other typesof lexicalized reordering models have been inves-tigated recently (He et al., 2010a; He et al., 2010b;Hayashi et al., 2010; Huck et al., 2012a), butin none of them are the orientation scores condi-tioned on the lexical identity of each phrase in-dividually. These models are rather word-basedand applied on block boundaries. Experimentalresults obtained with these other types of lexical-ized reordering models have been very encourag-ing, though.

    There are certain reasons why assessing the ad-equacy of phrase reordering should be useful inhierarchical search:

    • Albeit phrase reorderings are always a resultof the application of SCFG rules, the decoderis still able to choose from many differentparses of the input sentence.

    • The decoder can furthermore choose frommany translation options for each givenparse, which result in different reorderingsand different phrases being embedded in thereordering non-terminals.

    • All other models only weakly connect an em-bedded phrase with the hierarchical phrase itis placed into, in particular as the set of non-terminals of the hierarchical grammar onlycontains two generic non-terminal symbols.

    We therefore investigate phrase orientation mod-eling for hierarchical translation in this work.

    2 Outline

    The remainder of the paper is structured as fol-lows: We briefly outline important related pub-lications in the following section. We subse-quently give a summary of some essential aspectsof the hierarchical phrase-based translation ap-proach (Section 4). Phrase orientation modelingand a way in which a phrase orientation model canbe trained for hierarchical phrase inventories areexplained in Section 5. In Section 6 we introducean extension of hierarchical search which enablesthe decoder to score phrase orientations. Empiri-cal results are presented in Section 7. We concludethe paper in Section 8.

    3 Related Work

    Hierarchical phrase-based translation was pro-posed by Chiang (2005). He et al. (2010a) inte-grated a maximum entropy based lexicalized re-ordering model with word- and class-based fea-tures. Different classifiers for different rule pat-terns are trained for their model (He et al.,2010b). A comparable discriminatively trainedmodel which applies a single classifier for all typesof rules was developed by Huck et al. (2012a).Hayashi et al. (2010) explored the word-based re-ordering model by Tromble and Eisner (2009) inhierarchical translation.

    For standard phrase-based translation, Galleyand Manning (2008) introduced a hierarchicalphrase orientation model. Similar to previous ap-proaches (Tillmann, 2004; Koehn et al., 2007), itdistinguishes the three orientation classes mono-tone, swap, and discontinuous. However, it differsin that it is not limited to model local reorderingphenomena, but allows for phrases to be hierarchi-cally combined into blocks in order to determinethe orientation class. This has the advantage thatprobability mass is shifted from the rather uninfor-mative default category discontinuous to the othertwo orientation classes, which model the locationof a phrase more specifically. In this work, wetransfer this concept to a hierarchical phrase-basedmachine translation system.

    4 Hierarchical Phrase-Based Translation

    The non-terminal set of a standard hierarchicalgrammar comprises two symbols which are sharedby source and target: the initial symbol S and onegeneric non-terminal symbol X . The generic non-terminal X is used as a placeholder for the gaps

    453

  • f1f2f3f4f5

    e1 e2 e3 e4 e5target

    source

    (a) Monotone phrase orientation.

    f1f2f3f4f5

    e1 e2 e3 e4 e5target

    source

    (b) Swap phrase orientation.

    f1f2f3f4f5

    e1 e2 e3 e4 e5target

    source

    (c) Discontinuous phrase orientation.

    Figure 1: Extraction of the orientation classes monotone, swap, and discontinuous from word-alignedtraining samples. The examples show the left-to-right orientation of the shaded phrases. The dashedrectangles indicate how the predecessor words are merged into blocks with regard to their word align-ment.

    within the right-hand side of hierarchical transla-tion rules as well as on all left-hand sides of thetranslation rules that are extracted from the paral-lel training corpus.

    Extracted rules of a standard hierarchical gram-mar are of the form X → 〈α, β,∼ 〉 where 〈α, β〉is a bilingual phrase pair that may contain X , i.e.α ∈ ({X } ∪ VF )+ and β ∈ ({X } ∪ VE)+, whereVF and VE are the source and target vocabulary,respectively. The non-terminals on the source sideand on the target side of hierarchical rules arelinked in a one-to-one correspondence. The ∼ re-lation defines this one-to-one correspondence. Inaddition to the extracted rules, a non-lexicalizedinitial rule

    S → 〈X∼0, X∼0〉 (1)

    is engrafted into the hierarchical grammar, as wellas a special glue rule

    S → 〈S∼0X∼1, S∼0X∼1〉 (2)

    that the system can use for serial concatenationof phrases as in monotonic phrase-based transla-tion. The initial symbol S is the start symbol ofthe grammar.

    Hierarchical search is conducted with a cus-tomized version of the CYK+ parsing algo-rithm (Chappelier and Rajman, 1998) and cubepruning (Chiang, 2007). A hypergraph which rep-resents the whole parsing space is built employingCYK+. Cube pruning operates in bottom-up topo-logical order on this hypergraph and expands atmost k derivations at each hypernode.

    5 Modeling Phrase Orientation forHierarchical Machine Translation

    The phrase orientation model we are using wasintroduced by Galley and Manning (2008). Tomodel the sequential order of phrases within theglobal translation context, the three orientationclasses monotone (M), swap (S) and discontinu-ous (D) are distinguished, each in both left-to-right and right-to-left direction. In order to cap-ture the global rather than the local context, previ-ous phrases can be merged into blocks if they areconsistent with respect to the word alignment. Aphrase is in monotone orientation if a consistentmonotone predecessor block exists, and in swaporientation if a consistent swap predecessor blockexists. Otherwise it is in discontinuous orientation.

    Given a sequence of source words fJ1 and a se-quence of target words eI1, a block 〈f j2j1 , e

    i2i1〉 (with

    1 ≤ j1 ≤ j2 ≤ J and 1 ≤ i1 ≤ i2 ≤ I)is consistent with respect to the word alignmentA ⊆ {1, ..., I} × {1, ..., J} iff

    ∃(i, j) ∈ A : i1 ≤ i ≤ i2 ∧ j1 ≤ j ≤ j2∧ ∀(i, j) ∈ A : i1 ≤ i ≤ i2 ↔ j1 ≤ j ≤ j2.

    (3)

    Consistency is based upon two conditions in thisdefinition: (1.) At least one source and target po-sition within the block must be aligned, and (2.)words from inside the source interval may onlybe aligned to words from inside the target inter-val and vice versa. These are the same condi-tions as those that are applied for the extraction of

    454

  • f1f2f3f4f5

    e1 e2 e3 e4 e5target

    source

    (a) A monotone orientation.

    Left-to-right orientation counts:

    N(M |f2X∼0f4, e2X∼0e4) = 1N(S|f2X∼0f4, e2X∼0e4) = 0N(D|f2X∼0f4, e2X∼0e4) = 0

    f1f2f6f4f5

    e1 e2 e6 e4 e5target

    source

    (b) Another monotone orientation.

    Left-to-right orientation counts:

    N(M |f2X∼0f4, e2X∼0e4) = 2N(S|f2X∼0f4, e2X∼0e4) = 0N(D|f2X∼0f4, e2X∼0e4) = 0

    f1f2f3f4f5

    e1 e2 e3 e4 e5target

    source

    (c) A swap orientation.

    Left-to-right orientation counts:

    N(M |f2X∼0f4, e2X∼0e4) = 2N(S|f2X∼0f4, e2X∼0e4) = 1N(D|f2X∼0f4, e2X∼0e4) = 0

    Figure 2: Accumulation of orientation counts for hierarchical phrases during extraction. The hierarchicalphrase 〈f2X∼0f4, e2X∼0e4〉 (dark shaded) can be extracted from all the three training samples. Itsorientation is identical to the orientation of the continuous phrase (lightly shaded) which the sub-phraseis cut out of, respectively. Note that the actual lexical content of the sub-phrase may differ. For instance,the sub-phrase 〈f3, e3〉 is being cut out in Fig. 2a, and the sub-phrase 〈f6, e6〉 is being cut out in Fig. 2b.

    standard continuous phrases. The only differenceis that length constraints are applied to phrases, butnot to blocks.

    Figure 1 illustrates the extraction of monotone,swap, and discontinuous orientation classes inleft-to-right direction from word-aligned bilingualtraining samples. The right-to-left direction worksanalogously.

    We found that this concept can be neatlyplugged into the hierarchical phrase-based trans-lation paradigm, without having to resort to ap-proximations in decoding, which is necessary todetermine the right-to-left orientation in a standardphrase-based system (Cherry et al., 2012). To trainthe orientations, the extraction procedure from thestandard phrase-based version of the reorderingmodel can be used with a minor extension. Themodel is trained on the same word-aligned datafrom which the translation rules are extracted. Foreach training sentence, we extract all phrases ofunlimited length that are consistent with the wordalignment, and store their corners in a matrix. Thecorners are distinguished by their location: top-left, top-right, bottom-left, and bottom-right. Foreach bilingual phrase, we determine its left-to-right and right-to-left orientation by checking foradjacent corners.

    The lexicalized orientation probability for theorientation O ∈ {M,S,D} and the phrase pair〈α, β〉 is estimated as its smoothed relative fre-quency:

    p(O) =N(O)∑

    O′∈{M,S,D}N(O′)

    (4)

    p(O|α, β) = σ · p(O) +N(O|α, β)σ +

    ∑O′∈{M,S,D}N(O

    ′|f̃ , ẽ).

    (5)Here, N(O) denotes the global count and

    N(O|α, β) the lexicalized count for the orienta-tion O. σ is a smoothing constant.

    To determine the orientation frequency for a hi-erarchical phrase with non-terminal symbols, theorientation counts of all those phrases are accu-mulated from which a sub-phrase is cut out andreplaced by a non-terminal symbol to obtain thishierarchical phrase. Figure 2 gives an example.

    Negative logarithms of the values are used ascosts in the log-linear model combination (Ochand Ney, 2002). Cost 0 for all orientations is as-signed to the special rules which are not extractedfrom the training data (initial and glue rule).

    455

  • f1f2f3X~0f4

    e1 e2 e3 X~0 e4target

    source

    (a) Monotone non-terminal orientation.

    f1X~0f2f3f4

    e1 e2 e3 X~0 e4target

    source

    (b) Swap non-terminal orientation.

    f1f2X~0f3f4

    e1 e2 X~0 e4target

    source

    e3

    (c) Discontinuous non-terminal orienta-tion.

    Figure 3: Scoring with the orientation classes monotone, swap, and discontinuous. Each picture showsexactly one hierarchical phrase. The block which replaces the non-terminal X during decoding is embed-ded with the orientation of this non-terminal X within the hierarchical phrase. The examples show theleft-to-right orientation of the non-terminal. The left-to-right orientation can be detected from the wordalignment of the hierarchical phrase, except for cases where the non-terminal is in boundary position ontarget side.

    6 Phrase Orientation Scoring inHierarchical Decoding

    Our implementation of phrase orientation scoringin hierarchical decoding is based on the observa-tion that hierarchical rule applications, i.e. the us-age of rules with non-terminals within their right-hand sides, settle the target sequence order. Mono-tone, swap, or discontinuous orientations of blocksare each due to monotone, swap, or discontinuousplacement of non-terminals which are being sub-stituted by these blocks.

    The problem of phrase orientation scoring canthus be mostly reduced to three steps which needto be carried out whenever a hierarchical rule isapplied:

    1. Determining the orientations of the non-terminals in the rule.

    2. Retrieving the proper orientation cost of thetopmost rule application in the sub-derivationwhich corresponds to the embedded block forthe respective non-terminal.

    3. Applying the orientation cost to the log-linearmodel combination for the current derivation.

    The orientation of a non-terminal in a hierarchi-cal rule is dependent on the word alignments inits context. Figure 3 depicts three examples.1 We

    however need to deal with special cases where anon-terminal orientation cannot be established atthe moment when the hierarchical rule is consid-ered. We first describe the non-degenerate case(Section 6.1). Afterwards we briefly discuss ourstrategy in the special situation of boundary non-terminals where the non-terminal orientation can-not be determined from information which is in-herent to the hierarchical rule under consideration(Section 6.3).

    We focus on left-to-right orientation scoring;right-to-left scoring is symmetric.

    6.1 Determining Orientations

    In order to determine the orientation class of anon-terminal, we rely on the word alignmentswithin the phrases. With each phrase, we storethe alignment matrix that has been seen most fre-quently during phrase extraction. Non-terminalsymbols on target side are assumed to be alignedto the respective non-terminal symbols on source

    1Note that even maximal consecutive lexical intervals (ei-ther on source or target side) are not necessarily aligned ina way which makes them consistent bilingual blocks. InFig. 3a, e4 is for instance aligned both below and abovethe non-terminal. In Fig. 3c, neither 〈f1f2, e1e2〉 nor〈f1f2, e3e4〉 would be valid continuous phrases (the sameholds for 〈f3f4, e1e2〉 and 〈f3f4, e3e4〉). We actually needthe generalization of the phrase orientation model to hierar-chical phrases as described in Section 5 for this reason. Oth-erwise we would be able to just score neighboring consistentsub-blocks with a model that does not account for hierarchi-cal phrases with non-terminals.

    456

  • f1f2f3

    X~0f4

    e1 e2 e3 X~0e4

    f5

    e6e5target

    source

    (a) Last previous aligned target position.

    f1f2f3

    X~0f4

    e1 e2 e3 X~0e4

    f5

    e6e5target

    source

    (b) Initial box.

    f1f2f3

    X~0f4

    e1 e2 e3 X~0e4

    f5

    e6e5target

    source

    (c) Expansion of the initial box.

    f1f2f3

    X~0f4

    e1 e2 e3 X~0e4

    f5

    e6e5target

    source

    (d) The final box is a consistent left-to-right mono-tone predecessor block of the non-terminal.

    Figure 4: Determining the orientation class during decoding. Starting from the last previous alignedtarget position, a box is spanned across the relevant alignment links onto the corner of the non-terminal.The box is then checked for consistency.

    side according to the ∼ relation. In the alignmentmatrix, the rows and columns of non-terminals canobviously contain only exactly this one alignmentlink.

    Starting from the last previous aligned target po-sition to the left of the non-terminal, the algorithmexpands a box that spans across the other rele-vant alignment links onto the corner of the non-terminal. Afterwards it checks whether the areason the opposite sides of the non-terminal positionare non-aligned in the source and target intervalsof this box. The non-terminal is in discontinu-ous orientation if the box is not a consistent block.If the box is a consistent block, the non-terminalis in monotone orientation if its source-side posi-tion is larger than the maximum of the source-sideinterval of the box, and in swap orientation if itssource-side position is smaller than the minimumof the source-side interval of the box.

    Figure 4 illustrates how the procedure operates.In left-to-right direction, an initial box is spannedfrom the last previous aligned target position tothe lower (monotone) or upper (swap) left cor-ner of the non-terminal. In the example, startingfrom 〈f3, e5〉 (Fig. 4a), this initial box is spannedto the lower left corner by iterating from f3 tof4 and expanding its target interval to the mini-mum aligned target position within these two rowsof the alignment matrix. The initial box cov-ers 〈f3f4, e3e4e5〉 (Fig. 4b). The procedure thenrepeatedly checks whether the box needs to beexpanded—alternating to the bottom (monotone)or top (swap) and to the left—until no alignmentlinks below or to the left of the box break theconsistency. Two box expansion are conductedin the example: the first one expands the ini-tial box below, resulting in a larger box whichcovers 〈f1f2f3f4, e3e4e5〉 (Fig. 4c); the second

    457

  • f1f2

    X~0e1X~0

    target

    source

    (a) Left boundary non-terminal that can be placedin left-to-right monotone ordiscontinuous orientationwhen the phrase is embeddedinto another one.

    f1f2X~0

    e1X~0target

    source

    (b) Left boundary non-terminal that can be placedin left-to-right discontinuousor swap orientation whenthe phrase is embedded intoanother one.

    f1

    f2X~0

    e1 X~0target

    source

    (c) Left boundary non-terminal that can be placed inleft-to-right monotone, swap,or discontinuous orientationwhen the phrase is embeddedinto another one.

    f1

    f2X~0

    e1X~0target

    source

    (d) Left boundary non-terminal that can only beplaced in left-to-right dis-continuous orientation whenthe phrase is embedded intoanother one.

    Figure 5: Left boundary non-terminal symbols. Orientations the non-terminal can eventually turn out toget placed in differ depending on existing alignment links in the rest of the phrase. Delayed left-to-rightscoring is not required in cases as in Fig. 5d. Fractional costs for the possible orientations are temporarilyapplied in the other cases and recursively corrected as soon as an orientation is constituted in an upperhypernode.

    one expands this new box to the left, resulting ina final box which covers 〈f1f2f3f4, e1e2e3e4e5〉(Fig. 4d) and does not need to be expanded to-wards the lower left corner any more. Afterwardsthe procedure examines whether the final box isa consistent block by inspecting whether the ar-eas on the opposite side of the non-terminal po-sition are non-aligned in the intervals of the box(areas with waved lines in the Fig. 4d). These ar-eas do not contain alignment links in the example:the orientation class of the non-terminal is mono-tone as it has a consistent left-to-right monotonepredecessor block. (Suppose an alignment link〈f5, e2〉 would break the consistency: the orienta-tion class would then be discontinuous as the finalbox would not be a consistent block.)

    Orientations of non-terminals could basically beprecomputed and stored in the translation table.We however compute them on demand during de-coding. The computational overhead did not seemto be too severe in our experiments.

    6.2 Scoring Orientations

    Once the orientation is determined, the proper ori-entation cost of the embedded block needs to beretrieved. We access the topmost rule applicationin the sub-derivation which corresponds to the em-bedded block for the respective non-terminal andread the orientation model costs for this rule. Thespecial case of delayed scoring for boundary non-terminals as described in the subsequent section isrecursively processed if necessary. The retrieved

    orientation costs of the embedded blocks of allnon-terminals are finally added to the log-linearmodel combination for the current derivation.

    6.3 Boundary Non-Terminals

    Cases where a non-terminal orientation cannot beestablished at the moment when the hierarchi-cal rule is considered arise when a non-terminalsymbol is in a boundary position on target side.We define a non-terminal to be in (left or right)boundary position iff no symbols are aligned be-tween the phrase-internal target-side index of thenon-terminal and the (left or right) phrase bound-ary. Left boundary positions of non-terminalsare critical for left-to-right orientation scoring,right boundary positions for right-to-left orienta-tion scoring. We denote non-terminals in bound-ary position as boundary non-terminals.

    The procedure as described in Section 6.1 is notapplicable to boundary non-terminals because alast previous aligned target position does not ex-ist. If it is impossible to determine the final non-terminal orientation in the hypothesis from infor-mation which is inherent to the phrase, we areforced to delay the orientation scoring of the em-bedded block. Our solution in these cases is toheuristically add fractional costs of all orientationsthe non-terminal can still eventually turn out to getplaced in (cf. Figure 5). We do so because notadding an orientation cost to the derivation wouldgive it an unjustified advantage over other ones.As soon as an orientation is constituted in an up-

    458

  • per hypernode, any heuristic and actual orientationcosts can be collected by means of a recursive call.Note that monotone or swap orientations in upperhypernodes can top-down transition into discon-tinuous orientations for boundary non-terminals,depending on existing phrase-internal alignmentlinks in the context of the respective boundarynon-terminal. In the derivation at the upper hyper-node, the heuristic costs are subtracted and the cor-rect actual costs added. Delayed scoring can leadto search errors; in order to keep them confined,the delayed scoring needs to be done separatelyfor all derivations, not just for the first-best sub-derivations along the incoming hyperedges.

    7 Experiments

    We evaluate the effect of phrase orienta-tion scoring in hierarchical translation on theChinese→English 2008 NIST task2 and on theFrench→German language pair using the standardWMT3 newstest sets for development and testing.

    7.1 Experimental Setup

    We work with a Chinese–English parallel train-ing corpus of 3.0 M sentence pairs (77.5 M Chi-nese / 81.0 M English running words). To train theGerman→French baseline system, we use 2.0 Msentence pairs (53.1 M French / 45.8 M Germanrunning words) that are partly taken from theEuroparl corpus (Koehn, 2005) and have partlybeen collected within the Quaero project.4

    Word alignments are created by aligning thedata in both directions with GIZA++5 and sym-metrizing the two trained alignments (Och andNey, 2003). When extracting phrases, we ap-ply several restrictions, in particular a maximumlength of ten on source and target side for lexi-cal phrases, a length limit of five on source andten on target side for hierarchical phrases (includ-ing non-terminal symbols), and no more than twonon-terminals per phrase.

    A standard set of models is used in the base-lines, comprising phrase translation probabilitiesand lexical translation probabilities in both direc-tions, word and phrase penalty, binary featuresmarking hierarchical rules, glue rule, and rules

    2http://www.itl.nist.gov/iad/mig/tests/mt/2008/

    3http://www.statmt.org/wmt13/translation-task.html

    4http://www.quaero.org5http://code.google.com/p/giza-pp/

    with non-terminals at the boundaries, three sim-ple count-based binary features, phrase length ra-tios, and a language model. The language modelsare 4-grams with modified Kneser-Ney smooth-ing (Kneser and Ney, 1995; Chen and Goodman,1998) which have been trained with the SRILMtoolkit (Stolcke, 2002).

    Model weights are optimized against BLEU (Pa-pineni et al., 2002) with MERT (Och, 2003) on100-best lists. For Chinese→English we employMT06 as development set, MT08 is used as unseentest set. For German→French we employ news-test2009 as development set, newstest2008, news-test2010, and newstest2011 are used as unseen testsets. During decoding, a maximum length con-straint of ten is applied to all non-terminals exceptthe initial symbol S . Translation quality is mea-sured in truecase with BLEU and TER (Snover etal., 2006). The results on MT08 are checked forstatistical significance over the baseline. Confi-dence intervals have been computed using boot-strapping for BLEU and Cochran’s approximateratio variance for TER (Leusch and Ney, 2009).

    7.2 Chinese→English Experimental ResultsTable 1 comprises all results of our empirical eval-uation on the Chinese→English task.

    We first compare the performance of the phraseorientation model in left-to-right direction onlywith the performance of the phrase orientationmodel in left-to-right and right-to-left direction(bidirectional). In all experiments, monotone,swap, and discontinuous orientation costs aretreated as being from different feature functionsin the log-linear model combination: we assigna separate scaling factor to each of the orienta-tions. We have three more scaling factors than inthe baseline for left-to-right direction only, and sixmore scaling factors for bidirectional phrase ori-entation scoring. As can be seen from the resultstable, the left-to-right model already yields a gainof 1.1 %BLEU over the baseline on the unseen testset (MT08). The bidirectional model performs justslightly better (+1.2 %BLEU over the baseline).With both models, the TER is reduced significantlyas well (-1.1 / -1.3 compared to the baseline). Weadopted the discriminative lexicalized reorderingmodel (discrim. RO) that has been suggested byHuck et al. (2012a) for comparison purposes. Thephrase orientation model provides clearly bettertranslation quality in our experiments.

    459

  • MT06 (Dev) MT08 (Test)NIST Chinese→English BLEU [%] TER [%] BLEU [%] TER [%]HPBT Baseline 32.6 61.2 25.2 66.6+ discrim. RO 33.0 61.3 25.8 66.0+ phrase orientation (left-to-right) 33.3 60.7 26.3 65.5+ phrase orientation (bidirectional) 33.2 60.6 26.4 65.3+ swap rule 32.8 61.7 25.8 66.6

    + discrim. RO 33.1 61.2 26.0 66.1+ phrase orientation (bidirectional) 33.3 60.7 26.5 65.3+ binary swap feature 33.2 61.0 25.9 66.2

    + discrim. RO 33.2 61.3 26.2 66.1+ phrase orientation (bidirectional) 33.6 60.5 26.6 65.1

    + soft syntactic labels 33.4 60.8 26.1 66.4+ phrase orientation (bidirectional) 33.7 60.1 26.8 65.1

    + phrase-level s2t+t2s DWL + triplets 34.3 60.1 27.7 65.0+ discrim. RO 34.8 59.8 27.7 64.7+ phrase orientation (bidirectional) 35.3 59.0 28.4 63.7

    Table 1: Experimental results for the NIST Chinese→English translation task (truecase). On the test set,bold font indicates results that are significantly better than the baseline (p < .05).

    As a next experiment, we bring in more re-ordering capabilities by augmenting the hierarchi-cal grammar with a single swap rule

    X → 〈X∼0X∼1,X∼1X∼0〉 (6)supplementary to the initial rule and glue rule.The swap rule allows adjacent phrases to be trans-posed. The setup with swap rule and bidirectionalphrase orientation model is about as good as thesetup with just the bidirectional phrase orienta-tion model and no swap rule. If we furthermoremark the swap rule with a binary feature (binaryswap feature), we end up at an improvement of+1.4 %BLEU over the baseline. The phrase ori-entation model again provides higher translationquality than the discriminative reordering model.

    In a third experiment, we investigate whetherthe phrase orientation model also has a positive in-fluence when integrated into a syntax-augmentedhierarchical system. We configured a hierarchi-cal setup with soft syntactic labels (Stein et al.,2010), a syntactic enhancement in the manner ofpreference grammars (Venugopal et al., 2009). OnMT08, the syntax-augmented system performs 0.9%BLEU above the baseline setup. We achieve anadditional improvement of +0.7 %BLEU and -1.3TER by including the bidirectional phrase orien-tation model. Interestingly, the translation qualityof the setup with soft syntactic labels (but with-out phrase orientation model) is worse than of the

    setup with phrase orientation model (but withoutsoft syntactic labels) on MT08. The combinationof both extensions provides the best result, though.

    In a last experiment, we finally took a verystrong setup which improves over the baseline by2.5 %BLEU through the integration of phrase-leveldiscriminative word lexicon (DWL) models andtriplet lexicon models in source-to-target (s2t) andtarget-to-source (t2s) direction. The models havebeen presented by Hasan et al. (2008), Bangaloreet al. (2007), and Mauser et al. (2009). We applythem in a similar manner as proposed by Huck etal. (2011). In this strong setup, the discriminativereordering model gives gains on the developmentset which barely carry over to the test set. Addingthe bidirectional phrase orientation model, in con-trast, results in a nice gain of +0.7 %BLEU and areduction of 1.3 points in TER on the test set, evenon top of the DWL and triplet lexicon models.

    7.3 French→German Experimental ResultsTable 2 comprises the results of our empirical eval-uation on the French→German task.

    The left-to-right phrase orientation modelboosts the translation quality by up to 0.3 %BLEU.The reduction in TER is in a similar order ofmagnitude. The bidirectional model performs abit better again, with an advancement of up to0.4 %BLEU and a maximal reduction in TER of0.6 points.

    460

  • newstest2008 newstest2009 newstest2010 newstest2011BLEU TER BLEU TER BLEU TER BLEU TER

    French→German [%] [%] [%] [%] [%] [%] [%] [%]HPBT Baseline 15.2 71.7 15.0 71.7 15.7 69.5 14.2 72.2+ phrase orientation (left-to-right) 15.1 71.4 15.3 71.4 15.9 69.2 14.5 71.8+ phrase orientation (bidirectional) 15.4 71.1 15.4 71.3 15.9 69.1 14.6 71.6

    Table 2: Experimental results for the French→German translation task (truecase). newstest2009 is usedas development set.

    8 Conclusion

    In this paper, we introduced a phrase orientationmodel for hierarchical machine translation. Thetraining of a lexicalized reordering model whichassigns probabilities for monotone, swap, and dis-continuous orientation of phrases was generalizedfrom standard continuous phrases to hierarchicalphrases. We explained how phrase orientationscoring can be implemented in hierarchical decod-ing and conducted a number of experiments on aChinese→English and a French→German transla-tion task. The results indicate that phrase orienta-tion modeling is a very suitable enhancement ofthe hierarchical paradigm.

    Our implementation will be released as part ofJane (Vilar et al., 2010; Vilar et al., 2012; Hucket al., 2012b), the RWTH Aachen University opensource statistical machine translation toolkit.6

    Acknowledgments

    This work was partly achieved as part of theQuaero Programme, funded by OSEO, FrenchState agency for innovation. This material is alsopartly based upon work supported by the DARPABOLT project under Contract No. HR0011-12-C-0015. Any opinions, findings and conclu-sions or recommendations expressed in this ma-terial are those of the authors and do not neces-sarily reflect the views of the DARPA. The re-search leading to these results has received fund-ing from the European Union Seventh FrameworkProgramme (FP7/2007-2013) under grant agree-ment no 287658.

    References

    Srinivas Bangalore, Patrick Haffner, and Stephan Kan-thak. 2007. Statistical Machine Translation through

    6http://www.hltpr.rwth-aachen.de/jane/

    Global Lexical Selection and Sentence Reconstruc-tion. In Proc. of the Annual Meeting of the Assoc. forComputational Linguistics (ACL), pages 152–159,Prague, Czech Republic, June.

    Jean-Cédric Chappelier and Martin Rajman. 1998. AGeneralized CYK Algorithm for Parsing Stochas-tic CFG. In Proc. of the First Workshop on Tab-ulation in Parsing and Deduction, pages 133–137,Paris, France, April.

    Stanley F. Chen and Joshua Goodman. 1998. AnEmpirical Study of Smoothing Techniques for Lan-guage Modeling. Technical Report TR-10-98, Com-puter Science Group, Harvard University, Cam-bridge, MA, USA, August.

    Colin Cherry, Robert C. Moore, and Chris Quirk.2012. On Hierarchical Re-ordering and Permuta-tion Parsing for Phrase-based Decoding. In Proc. ofthe Workshop on Statistical Machine Translation(WMT), pages 200–209, Montréal, Canada, June.

    David Chiang. 2005. A Hierarchical Phrase-BasedModel for Statistical Machine Translation. In Proc.of the Annual Meeting of the Assoc. for Computa-tional Linguistics (ACL), pages 263–270, Ann Ar-bor, MI, USA, June.

    David Chiang. 2007. Hierarchical Phrase-BasedTranslation. Computational Linguistics, 33(2):201–228, June.

    Michel Galley and Christopher D. Manning. 2008. ASimple and Effective Hierarchical Phrase Reorder-ing Model. In Proc. of the Conf. on Empirical Meth-ods for Natural Language Processing (EMNLP),pages 847–855, Honolulu, HI, USA, October.

    Saša Hasan, Juri Ganitkevitch, Hermann Ney, andJesús Andrés-Ferrer. 2008. Triplet Lexicon Mod-els for Statistical Machine Translation. In Proc. ofthe Conf. on Empirical Methods for Natural Lan-guage Processing (EMNLP), pages 372–381, Hon-olulu, HI, USA, October.

    Katsuhiko Hayashi, Hajime Tsukada, Katsuhito Sudoh,Kevin Duh, and Seiichi Yamamoto. 2010. Hi-erarchical Phrase-based Machine Translation withWord-based Reordering Model. In Proc. of theInt. Conf. on Computational Linguistics (COLING),pages 439–446, Beijing, China, August.

    461

  • Zhongjun He, Yao Meng, and Hao Yu. 2010a. Extend-ing the Hierarchical Phrase Based Model with Max-imum Entropy Based BTG. In Proc. of the Conf. ofthe Assoc. for Machine Translation in the Americas(AMTA), Denver, CO, USA, October/November.

    Zhongjun He, Yao Meng, and Hao Yu. 2010b. Max-imum Entropy Based Phrase Reordering for Hier-archical Phrase-based Translation. In Proc. of theConf. on Empirical Methods for Natural LanguageProcessing (EMNLP), pages 555–563, Cambridge,MA, USA, October.

    Matthias Huck, Saab Mansour, Simon Wiesler, andHermann Ney. 2011. Lexicon Models for Hierar-chical Phrase-Based Machine Translation. In Proc.of the Int. Workshop on Spoken Language Transla-tion (IWSLT), pages 191–198, San Francisco, CA,USA, December.

    Matthias Huck, Stephan Peitz, Markus Freitag, andHermann Ney. 2012a. Discriminative ReorderingExtensions for Hierarchical Phrase-Based MachineTranslation. In Proc. of the Annual Conf. of theEuropean Assoc. for Machine Translation (EAMT),pages 313–320, Trento, Italy, May.

    Matthias Huck, Jan-Thorsten Peter, Markus Freitag,Stephan Peitz, and Hermann Ney. 2012b. Hierar-chical Phrase-Based Translation with Jane 2. ThePrague Bulletin of Mathematical Linguistics, 98:37–50, October.

    Reinhard Kneser and Hermann Ney. 1995. ImprovedBacking-Off for M-gram Language Modeling. InProc. of the Int. Conf. on Acoustics, Speech, and Sig-nal Processing (ICASSP), volume 1, pages 181–184,Detroit, MI, USA, May.

    Philipp Koehn, Franz Joseph Och, and Daniel Marcu.2003. Statistical Phrase-Based Translation. In Proc.of the Human Language Technology Conf. / NorthAmerican Chapter of the Assoc. for ComputationalLinguistics (HLT-NAACL), pages 127–133, Edmon-ton, Canada, May/June.

    Philipp Koehn, Hieu Hoang, Alexandra Birch, ChrisCallison-Burch, Marcello Federico, Nicola Bertoldi,Brooke Cowan, Wade Shen, Christine Moran,Richard Zens, Chris Dyer, Ondřej Bojar, AlexandraConstantin, and Evan Herbst. 2007. Moses: OpenSource Toolkit for Statistical Machine Translation.In Proc. of the Annual Meeting of the Assoc. forComputational Linguistics (ACL), Demo and PosterSessions, pages 177–180, Prague, Czech Republic.

    Philipp Koehn. 2005. Europarl: A Parallel Corpus forStatistical Machine Translation. In Proc. of the MTSummit X, Phuket, Thailand, September.

    Gregor Leusch and Hermann Ney. 2009. Edit dis-tances with block movements and error rate confi-dence estimates. Machine Translation, 23(2):129–140, December.

    Arne Mauser, Saša Hasan, and Hermann Ney. 2009.Extending Statistical Machine Translation with Dis-criminative and Trigger-Based Lexicon Models. InProc. of the Conf. on Empirical Methods for Natu-ral Language Processing (EMNLP), pages 210–218,Singapore, August.

    Franz Josef Och and Hermann Ney. 2002. Discrimina-tive Training and Maximum Entropy Models for Sta-tistical Machine Translation. In Proc. of the AnnualMeeting of the Assoc. for Computational Linguistics(ACL), pages 295–302, Philadelphia, PA, USA, July.

    Franz Josef Och and Hermann Ney. 2003. A System-atic Comparison of Various Statistical AlignmentModels. Computational Linguistics, 29(1):19–51,March.

    Franz Josef Och. 2003. Minimum Error Rate Train-ing for Statistical Machine Translation. In Proc. ofthe Annual Meeting of the Assoc. for ComputationalLinguistics (ACL), pages 160–167, Sapporo, Japan,July.

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for AutomaticEvaluation of Machine Translation. In Proc. of theAnnual Meeting of the Assoc. for ComputationalLinguistics (ACL), pages 311–318, Philadelphia, PA,USA, July.

    Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin-nea Micciulla, and John Makhoul. 2006. A Studyof Translation Edit Rate with Targeted Human An-notation. In Proc. of the Conf. of the Assoc. forMachine Translation in the Americas (AMTA), pages223–231, Cambridge, MA, USA, August.

    Daniel Stein, Stephan Peitz, David Vilar, and HermannNey. 2010. A Cocktail of Deep Syntactic Fea-tures for Hierarchical Machine Translation. In Proc.of the Conf. of the Assoc. for Machine Translationin the Americas (AMTA), Denver, CO, USA, Octo-ber/November.

    Andreas Stolcke. 2002. SRILM – An Extensible Lan-guage Modeling Toolkit. In Proc. of the Int. Conf.on Spoken Language Processing (ICSLP), volume 2,pages 901–904, Denver, CO, USA, September.

    Christoph Tillmann. 2004. A Unigram OrientationModel for Statistical Machine Translation. In Pro-ceedings of HLT-NAACL 2004: Short Papers, HLT-NAACL-Short ’04, pages 101–104, Boston, MA,USA.

    Roy Tromble and Jason Eisner. 2009. Learning LinearOrdering Problems for Better Translation. In Proc.of the Conf. on Empirical Methods for Natural Lan-guage Processing (EMNLP), pages 1007–1016, Sin-gapore, August.

    Ashish Venugopal, Andreas Zollmann, Noah A. Smith,and Stephan Vogel. 2009. Preference Grammars:

    462

  • Softening Syntactic Constraints to Improve Statis-tical Machine Translation. In Proc. of the Hu-man Language Technology Conf. / North AmericanChapter of the Assoc. for Computational Linguistics(HLT-NAACL), pages 236–244, Boulder, CO, USA,June.

    David Vilar, Daniel Stein, Matthias Huck, and Her-mann Ney. 2010. Jane: Open Source HierarchicalTranslation, Extended with Reordering and LexiconModels. In Proc. of the Workshop on Statistical Ma-chine Translation (WMT), pages 262–270, Uppsala,Sweden, July.

    David Vilar, Daniel Stein, Matthias Huck, and Her-mann Ney. 2012. Jane: an advanced freely avail-able hierarchical machine translation toolkit. Ma-chine Translation, 26(3):197–216, September.

    Richard Zens and Hermann Ney. 2008. Improvementsin Dynamic Programming Beam Search for Phrase-Based Statistical Machine Translation. In Proc. ofthe Int. Workshop on Spoken Language Translation(IWSLT), pages 195–205, Waikiki, HI, USA, Octo-ber.

    Richard Zens, Hermann Ney, Taro Watanabe, and Ei-ichiro Sumita. 2004. Reordering Constraints forPhrase-Based Statistical Machine Translation. InProc. of the Int. Conf. on Computational Linguis-tics (COLING), pages 205–211, Geneva, Switzer-land, August.

    463