Word Translation Prediction for Morphologically Rich Languageswith Bilingual Neural Networks

Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1676–1688,October 25-29, 2014, Doha, Qatar. c©2014 Association for Computational Linguistics

Word Translation Prediction for Morphologically Rich Languageswith Bilingual Neural Networks

Ke Tran Arianna Bisazza Christof MonzInformatics Institute, University of Amsterdam

Science Park 904, 1098 XH Amsterdam, The Netherlands{m.k.tran,a.bisazza,c.monz}@uva.nl

Abstract

Translating into morphologically rich lan-guages is a particularly difficult problemin machine translation due to the high de-gree of inflectional ambiguity in the tar-get language, often only poorly capturedby existing word translation models. Wepresent a general approach that exploitssource-side contexts of foreign words toimprove translation prediction accuracy.Our approach is based on a probabilisticneural network which does not require lin-guistic annotation nor manual feature en-gineering. We report significant improve-ments in word translation prediction accu-racy for three morphologically rich targetlanguages. In addition, preliminary resultsfor integrating our approach into a large-scale English-Russian statistical machinetranslation system show small but statisti-cally significant improvements in transla-tion quality.

1 Introduction

The ability to make context-sensitive translationdecisions is one of the major strengths of phrase-based SMT (PSMT). However, the way PSMT ex-ploits source-language context has several limita-tions as pointed out, for instance, by Quirk andMenezes (2006) and Durrani et al. (2013). First,the amount of context used to translate a giveninput word depends on the phrase segmentation,with hypotheses resulting from different segmen-tations competing with one another. Another issueis that, given a phrase segmentation, each sourcephrase is translated independently from the oth-ers, which can be problematic especially for short

phrases. As a result, the predictive translation ofa source phrase does not access useful linguisticclues in the source sentence that are outside of thescope of the phrase.

Lexical weighting tackles the problem of un-reliable phrase probabilities, typically associatedwith long phrases, but does not alleviate the prob-lem of context segmentation. An important shareof the translation selection task is then left to thelanguage model (LM), which is certainly very ef-fective but can only leverage target language con-text. Moreover, decisions that are taken at earlydecoding stages—such as the common practiceof retaining only top n translation options foreach source span—depend only on the translationmodels and on the target context available in thephrase.

Source context based translation models (Gim-pel and Smith, 2008; Mauser et al., 2009; Jeonget al., 2010; Haque et al., 2011) naturally ad-dress these limitations. These models can ex-ploit a boundless context of the input text, butthey assume that target words can be predicted in-dependently from each other, which makes themeasy to integrate into state-of-the-art PSMT sys-tems. Even though the independence assump-tion is made on the target side, these models haveshown the benefits of utilizing source context, es-pecially in translating into morphologically richlanguages. One drawback of previous researchon this topic, though, is that it relied on richsets of manually designed features, which in turnrequired the availability of linguistic annotationtools like POS taggers and syntactic parsers.

In this paper, we specifically focus on im-proving the prediction accuracy for word transla-tions. Achieving high levels of word translationaccuracy is particularly challenging for language

1676

pairs where the source language is morphologi-cally poor, such as English, and the target lan-guage is morphologically rich, such as Russian,i.e., language pairs with a high degree of surfacerealization ambiguity (Minkov et al., 2007). Toaddress this problem we propose a general ap-proach based on bilingual neural networks (BNN)exploiting source-side contextual information.

This paper makes a number of contributions:Unlike previous approaches our models do not re-quire any form of linguistic annotation (Minkovet al., 2007; Kholy and Habash, 2012; Chahuneauet al., 2013), nor do they require any feature en-gineering (Gimpel and Smith, 2008). Moreover,besides directly predicting fully inflected formsas Jeong et al. (2010), our approach can alsomodel stem and suffix prediction explicitly. Pre-diction accuracy is evaluated with respect to threemorphologically rich target languages (Bulgarian,Czech, and Russian) showing that our approachconsistently yields substantial improvements overa competitive baseline. We also show that theseimprovements in prediction accuracy can be ben-eficial in an end-to-end machine translation sce-nario by integrating into a large-scale English-Russian PSMT system. Finally, a detailed analysisshows that our approach induces a positive bias onphrase translation probabilities leading to a betterranking of the translation options employed by thedecoder.

2 Lexical coverage of SMT models

The first question we ask is whether translationcan be improved by a more accurate selection ofthe translation options already existing in the SMTmodels, as opposed to generating new options.To answer this question we measure the lexicalcoverage of a baseline PSMT system trained onEnglish-Russian.1 We choose this language pairbecause of the morphological richness on the tar-get side: Russian is characterized by a highly in-flectional morphology with a particularly complexnominal declension (six core cases, three gendersand two number categories). As suggested byGreen and DeNero (2012), we compute the re-call of reference tokens in the set of target to-kens that the decoder could produce in a trans-lation of the source, that is the target tokens ofall phrase pairs that matched the input sentence

1Training data and SMT setup are described in Section 6.

and that were actually used for decoding.2 Wecall this the decoder’s lexical search space. Then,we compare the reference/space recall against thereference/MT-output recall: that is, the percent-age of reference tokens that also appeared in the1-best translation output by the SMT system. Re-sults for the WMT12 benchmark are presented inTable 1. From the first two rows, we see that only arather small part of the correct target tokens avail-able to the decoder are actually produced in the1-best MT output (50% against 86%). Althoughour word-level analysis does not directly estimatephrase-level coverage, these numbers suggest thata large potential for translation improvement liesin better lexical selection during decoding.

Token recall:reference/MT-search-space 86.0%reference/MT-output 50.0%stem-only reference/MT-output 12.3%of which reachable 11.2%

Table 1: Lexical coverage analysis of the baselineSMT system (English-Russian wmt12).

To quantify the importance of morphology, wecount how many reference tokens matched theMT output only at the stem level3 and for howmany of those the correct surface form existedin the search space (reachable matches). Thesetwo numbers represent the upper bound of the im-provement achievable by a model only predictingsuffixes given the target stems. As shown in Ta-ble 1, such a model could potentially increase thereference/MT-output recall by 12.3% with genera-tion of new inflected forms, and by 11.2% without.Thus, also when it comes to morphology, gener-ation seems to be of secondary importance com-pared to better selection in our experimental setup.

3 Predicting word translations in context

It is standard practice in PSMT to use word-to-word translation probabilities as an additionalphrase score. More specifically, state-of-the-artPSMT systems employ the maximum-likelihoodestimate of the context-independent probabilityof a target word given its aligned source wordP (tj |si) for each word alignment link aij .

2This corresponds to the top 30 phrases sorted byweighted phrase, lexical and LM probabilities, for eachsource span. Koehn (2004) and our own experience suggestthat using more phrases has little or no impact on MT quality.

3Word segmentation for this analysis is performed by theRussian Snowball stemmer, see also Section 5.3.

1677

[конституционность] [индиана закон]

constitutionality of the] [indiana law] [.]

[.]

[the

Figure 1: Fragment of English sentence and its in-correct Russian translation produced by the base-line SMT system. Square brackets indicate phraseboundaries.

The main goal of our work is to improve theestimation of such probabilities by exploiting thecontext of si, which in turn we expect will re-sult in better phrase translation selection. Figure1 illustrates this idea: the translation of “law” inthis example has a wrong case—nominative in-stead of genitive. Due to the rare word “Indi-ana/индиана”, the target LM must backoff to thebigram history and does not penalize this choicesufficiently. However, a model that has access tothe word “of” in the near source context could biasthe translation of “law” to the correct case.

We then model P (tj |csi) with source contextcsi defined as a fixed-length word sequence cen-tered around si:

csi = si−k, ..., si, ..., si+k

Our definition of context is similar to the n − 1word history used in n-gram LMs. Similarly toprevious work in source context-sensitive trans-lation modeling (Jeong et al., 2010; Chahuneauet al., 2013), target words are predicted indepen-dently from each other, which allows for an ef-ficient decoding integration. We are particularlyinterested in translating into morphologically richlanguages where source context can provide usefulinformation for the prediction of target translation,for example, the gender of the subject in a sourcesentence constrains the morphology of the transla-tion of the source verb. Therefore, we integrate thenotions of stem and suffix directly into the model.We assume the availability of a word segmenta-tion function g that takes a target word t as in-put and returns its stem and suffix: g(t) = (σ, µ).Then, the conditional probability p(tj |csi) can bedecomposed into stem probability and suffix prob-ability:

p(tj |csi) = p(σj |csi)p(µj |csi , σj) (1)

These two probabilities can be estimated sepa-rately, which yields the two subtasks:

1. predict target stem σ given source context cs;2. predict target suffix µ given source context cs

and target stem σ.

Based on the results of our analysis, we focuson the selection of existing translation candidates.We then restrict our prediction on a set of pos-sible target candidates depending on the task in-stead of considering all target words in the vocab-ulary. More specifically, for each source word si,our candidate generation function returns the set oftarget words Ts = {t1, . . . , tm} that were alignedto si in the parallel training corpus, which in turncorresponds to the set of target words that the SMTsystem can produce for a given source. In practice,we use a pruned version of Ts to speed up trainingand reduce noise (see details in Section 5).

As for the morphological models, given Ts andg, we can obtain Ls = {σ1, . . . , σk}, the set ofpossible target stem translations of s, and Mσ ={µ1, . . . , µl}, the set of possible suffixes for a tar-get stem σ. The use of Ls, and Mσ is similar tostemming and inflection operations in (Toutanovaet al., 2008) while the set Ts is similar to the GENfunction in (Jeong et al., 2010).4

Our approach differs crucially from previouswork (Minkov et al., 2007; Chahuneau et al.,2013) in that it does not require linguistic fea-tures such as part-of-speech and syntactic tree onthe source side. The proposed models automati-cally learn features that are relevant for each of themodeled tasks, directly from word-aligned data.To make the approach completely language inde-pendent, the word segmentation function g can betrained with an unsupervised segmentation tool.The effects of using different word segmentationtechniques are discussed in Section 5.

4 Bilingual neural networks fortranslation prediction

Probabilistic neural network (NN), or continuousspace, language models have received increas-ing attention over the last few years and havebeen applied to several natural language process-ing tasks (Bengio et al., 2003; Collobert and We-ston, 2008; Socher et al., 2011; Socher et al.,2012). Within statistical machine translation, they

4Note that our suffix generation function Mσ is restrictedto the forms observed in the target monolingual data, but notto those aligned to a source word s, which opens the possi-bility of generating inflected forms that are missing from thetranslation models. We leave this possibility to future work.

1678

have been used for monolingual target languagemodeling (Schwenk et al., 2006; Le et al., 2011;Duh et al., 2013; Vaswani et al., 2013), n-gramtranslation modeling (Son et al., 2012), phrasetranslation modeling (Schwenk, 2012; Zou et al.,2013; Gao et al., 2014) and minimal translationmodeling (Hu et al., 2014). The recurrent neuralnetwork LMs of Auli et al. (2013) are primarilytrained to predict target word sequences. However,they also experiment with an additional input layerrepresenting source side context.

Our models differ from most previous work inneural language modeling in that we predict a tar-get translation given a source context while pre-vious models predict the next word given a tar-get word history. Unlike previous work in phrasetranslation modeling with NNs, our models havethe advantage of accessing source context that canfall outside the phrase boundaries.

We now describe our models in a general set-ting, predicting target translations given a sourcecontext, where target translations can be eitherwords, stems or suffixes.5

4.1 Neural network architectureFollowing a common approach in deep learningfor NLP (Bengio et al., 2003; Collobert and We-ston, 2008), we represent each source word si bya column vector rsi ∈ Rd. Given a source con-text csi = si−k, ..., si, ..., si+k of k words on theleft and k words on the right of si, the context rep-resentation rcsi

∈ R(2k+1)×d is obtained by con-catenating the vector representations of all wordsin csi :

rcsi= rsi−k � ...� rsi+k

Our main BNN architecture for word or stemprediction (Figure 2a) is a feed-forward neuralnetwork (FFNN) with one hidden layer, a matrixW1 ∈ Rn×(2k+1)d connecting the input layer tothe hidden layer, a matrix W2 ∈ R|Vt|×n connect-ing the hidden layer to the output layer, and a biasvector b2 ∈ R|Vt| where |Vt| is the size of targettranslations vocabulary. The target translation dis-tribution PBNN(t|csi) for a given source contextcsi is computed by a forward pass:

softmax(W2 φ(W1rcsi

) + b2

)(2)

where φ is a nonlinearity (tanh, sigmoid or rec-tified linear units). The parameters of the neural

5The source code of our models is available at https://bitbucket.org/ketran

network are θ = {rsi ,W1,W2,b2}.The suffix prediction BNN is obtained by

adding the target stem representation rσ to the in-put layer (see Figure 2b).

4.2 Model variants

We encounter two major issues with FFNNs: (i)They do not provide a natural mechanism to com-pute word surface conditional probability p(t|cs)given individual stem probability p(σ|cs) and suf-fix probability p(µ|cs, σ), and (ii) FFNNs do notprovide the flexibility to capture long dependen-cies among words if they lie outside the sourcecontext window. Hence, we consider two BNNvariants: a log-bilinear model (LBL) and a con-volutional neural network model (ConvNet). LBLcould potentially address (i) by factorizing targetrepresentations into target stem and suffix repre-sentations whereas ConvNets offer the advantageof modeling variable input length (ii) (Kalchbren-ner et al., 2014).

Log-bilinear model. The FFNN models stemand suffix probabilities separately. A log-bilinearmodel instead could directly model word predic-tion through a factored representation of targetwords, similarly to Botha and Blunsom (2014).Thus, no probability mass would be wasted overstem-suffix combinations that are not in the candi-date generation function. The LBL model speci-fies the conditional distribution for the word trans-lation tj ∈ Tsi given a source context csi :

Pθ(tj |csi) =exp(sθ(tj , csi))∑

t′j∈Tsiexp(sθ(t′j , csi))

(3)

We use an additional set of word representationsqtj ∈ Rn for target translations tj . The LBLmodel computes a predictive representation q of asource context csi by taking a linear combinationof the source word representations rsi+m with theposition-dependent weight matrices Cm ∈ Rn×d:

q =k∑

m=−kCmrsi+m (4)

The score function sθ(tj , csi) measures the simi-larity between the predictive representation q andthe target representation qtj :

sθ(tj , csi) = qTqtj + bThqtj + btj (5)

1679

P✓(t|csi)

rsi�k rsirsi+k

W1

W2

�(x)

(a) BNN for word prediction.

P✓(µ|�, csi)

rsi�k rsirsi+k

W1

W2

�(x)

r�

(b) BNN for suffix prediction.

Figure 2: Feed-forward BNN architectures for predicting target translations: (a) word model (similar tostem model), and (b) suffix model with an additional vector representation rσ for target stems σ.

Here btj is the bias term associated with targetword tj . bh ∈ Rn are the representation bi-ases. sθ(tj , csi) can be seen as the negative en-ergy function of the target translation tj and itscontext csi . The parameters of the model thusare θ = {rsi ,Cm,qtj ,bh, btj}. Our log-bilinearmodel is a modification of the log-bilinear modelproposed for n-gram language modeling in (Mnihand Hinton, 2007).

Convolutional neural network model. Thismodel (Figure 3) computes the predictive repre-sentation q by applying a sequence of 2k convo-lutional layers {L1, . . . ,L2k}. The source contextcsi is represented as a matrix mcsi

∈ Rd×(2k+1):

mcsi=[rsi−k ; . . . ; rsi+k

](6)

q

rs1 rs2rs3 rs4

rs5 rs6rs0

Figure 3: Convolutional neural network model.Edges with the same color indicate the same ker-nel weight matrix.

Each convolutional layer Li consists of a one-dimensional filter mi ∈ Rd×2. Each row of mi

is convolved with the corresponding row in theprevious layer resulting in a weight matrix whosenumber of columns decreases by one. Thus after2k convolutional layers, the network transformsthe source context matrix mcsi

to a feature vec-tor q ∈ Rd. A fully connected layer with weightmatrix W followed by a softmax layer are placedafter the last convolutional layer L2k to performclassification. The parameters of the convolutional

neural network model are θ = {rsi ,mj ,W}.Here, we focus on a fixed length input, how-ever convolutional neural networks may be used tomodel variable length input (Kalchbrenner et al.,2014; Kalchbrenner and Blunsom, 2013).

4.3 Training

In training, for each example (t, cs), we maximizethe conditional probability Pθ(t|cs) of a correcttarget label t. The contribution of the training ex-ample (t, cs) to the gradient of the log conditionalprobability is given by:

∂

∂θlogPθ(t|cs) =

∂

∂θsθ(t|cs)

−∑t′∈Ts

Pθ(t′|cs) ∂∂θsθ(t′, cs)

Note that in the gradient, we do not sum over alltarget translations T but a set of possible candi-dates Ts of a source word s. In practice |Ts| ≤ 200with our pruning settings (see Section 5.1), thustraining time for one example does not depend onthe vocabulary size. Our training criterion can beseen as a form of contrastive estimation (Smithand Eisner, 2005), however we explicitly move theprobability mass from competing candidates to thecorrect translation candidate, thus obtaining morereliable estimates of the conditional probabilities.

The BNN parameters are initialized randomlyaccording to a zero-mean Gaussian. We regularizethe models with L2. As an alternative to the L2

regularizer, we also experiment with dropout (Hin-ton et al., 2012), where the neurons are randomlyzeroed out with dropout rate p. This technique isknown to be useful in computer vision tasks buthas been rarely used in NLP tasks. In FFNN, weuse dropout after the hidden layer, while in Con-vNet, dropout applies after the last convolutionallayer. The dropout rate p is set to 0.3 in our exper-

1680

iments. We use rectified nonlinearities6 in FFNNand after each convolutional layer in ConvNet. Wetrain our BNN models with the standard stochasticgradient descent.

5 Evaluating word translation prediction

In this section, we assess the ability of our BNNmodels to predict the correct translation of a wordin context. In addition to English-Russian, we alsoconsider translation prediction for Czech and Bul-garian. As members of the Slavic language fam-ily, Czech and Bulgarian are also characterized byhighly inflectional morphology. Czech, like Rus-sian, displays a very rich nominal inflection withas many as 14 declension paradigms. Bulgarian,unlike Russian, is not affected by case distinctionsbut is characterized by a definiteness suffix.

5.1 Experimental setup

The following parallel corpora are used to train theBNN models:

• English-Russian: WMT13 data (News Com-mentary and Yandex corpora);• English-Czech: CzEng 1.0 corpus (Bojar et

al., 2012) (Web Pages and News sections);• English-Bulgarian: a mix of crawled news

data, TED talks and Europarl proceedings.

Detailed corpus statistics are given in Table 2. Foreach language pair, accuracies are measured on aheld-out set of 10K parallel sentences.

To prepare the candidate generation function,each dataset is first word-aligned with GIZA++,then a bilingual lexicon with maximum-likelihoodprobabilities (Pmle) is built from the symmetrizedalignment. After some frequency and signifi-cance pruning,7 the top 200 translations sorted byPmle(t|s) · Pmle(s|t) are kept as candidate wordtranslations for each source word in the vocabu-lary. Word alignments are also used to train theBNN models: each alignment link constitutes atraining sample, with no special treatment of un-aligned words and 1-to-many alignments.

The context window size k is set to 3 (cor-responding to 7-gram) and the dimensionality of

6We find that using rectified linear units gives better re-sults than sigmoid and tanh.

7Each lexicon is pruned with minimum word frequency 5,minimum source-target word pair frequency 2, minimum logodds ratio 10.

source word representations to 100 in all experi-ments. The number of hidden units in our feed-forward neural networks and the target translationembedding size in LBL models are set to 200. Allmodels are trained for 10 iterations with learningrate set to 0.001.

En-Ru En-Cs En-BgSentences 1M 1M 0.8MSrc. tokens 26.5M 19.2M 19.3MTrg. tokens 24.7M 16.7M 18.9MSrc. T/T .0109 .0105 .0051Trg. T/T .0247 .0163 .0104

Table 2: BNN training corpora statistics: numberof sentences, tokens, and type/token ratio (T/T).

5.2 Word, stem and suffix predictionaccuracy

We measure accuracy at top-n, i. e. the numberof times the correct translation was in the top ncandidates sorted by a model. For each subtask—word, stem and suffix prediction—the BNNmodel is compared to the context-independentmaximum-likelihood baseline Pmle(t|s) on whichthe PSMT lexical weighting score is based. Notethat this is a more realistic baseline than the uni-form models sometimes reported in the litera-ture. The oracle corresponds to the percentage ofaligned source-target word pairs in the held-out setthat are covered by the candidate generation func-tion. Out of the missing links, about 4% is dueto lexicon pruning. Results for all three languagepairs are presented in Table 3. In this series ofexperiments, the morphological BNNs utilize un-supervised segmentation models trained on eachtarget language following Lee et al. (2011).8

As shown in Table 3, the BNN models outper-form the baseline by a large margin in all tasks andlanguages. In particular, word prediction accuracyat top-1 increases by +6.4%, +24.6% and +9.0%absolute in English-Russian, English-Czech andEnglish-Bulgarian respectively, without the use ofany features based on linguistic annotation. Whilethe baseline and oracle differences among lan-guages can be explained by different levels ofoverlap between training and held-out set, we can-not easily explain why the Czech BNN perfor-mance is so much higher. When comparing the

8We use the C++ implementation available at http://groups.csail.mit.edu/rbg/code/morphsyn

1681

Model En-Ru En-Cs En-BgWord prediction (%):

Baseline 33.0 / 50.1 42.0 / 59.9 47.9 / 66.0

Word BNN39.4 / 56.6 66.6 / 81.4 56.9 / 74.0+6.4 / +6.5 +24.6/+21.5 +9.0 / +8.0

Oracle 79.5 / 0.00 90.2 / 0.00 86.9 / 0.00Stem prediction (%):

Baseline 40.7 / 58.2 46.1 / 64.3 51.9 / 70.1

Stem BNN45.1 / 62.5 66.1 / 81.6 56.7 / 74.4+4.4 / +4.3 +20.0/+17.3 +4.8 / +4.3Suffix prediction (%):

Baseline 71.2 / 85.6 78.8 / 93.2 81.5 / 92.4

Suffix BNN77.0 / 89.7 91.9 / 97.4 87.7 / 94.9+5.8 / +4.1 +13.1 /+4.2 +6.2 / +2.5

Table 3: BNN prediction accuracy (top-1/top-3)compared to a context-independent maximum-likelihood baseline.

three prediction subtasks, we find that word pre-diction is the hardest task as expected. Stem pre-diction accuracies are considerably higher thanword prediction accuracies in Russian, but almostequal in the other two languages. Finally, base-line accuracies for suffix prediction are by farthe highest, ranging between 71.2% and 81.5%,which is primarily explained by a smaller num-ber of candidates to choose from. Also on thistask, the BNN model achieves considerable gainsof +5.8%, +13.1% and +6.2% at top-1, without theneed of manual feature engineering.

From these figures, it is hard to predict whetherword BNNs or morphological BNNs will have abetter effect on SMT performance. On one hand,the word-level BNN achieves the highest gain overthe MLE baseline. On the other, the stem- andsuffix-level BNNs provide two separate scoringfunctions, whose weights can be directly tuned fortranslation quality. A preliminary answer to thisquestion is given by the SMT experiments pre-sented in Section 6.

5.3 Effect of word segmentation

This section analyzes the effect of using differentsegmentation techniques. We consider two super-vised tagging methods that produce lemma and in-flection tag for each token in a context-sensitivemanner: TreeTagger (Sharoff et al., 2008) for Rus-sian and the Morce tagger (Spoustová et al., 2007)for Czech.9 Finally, we employ the Russian Snow-ball rule-based stemmer as a light-weight context-

9Annotation included in the CzEng 1.0 corpus release.

Figure 4: Effect of different word segmentationtechniques (U: unsupervised, S: supervised, R:rule-based stemmer) on stem and suffix predictionaccuracy. The dark part of each bar stands for top-1, the light one for top-3 accuracy.

insensitive segmentation technique.10

As shown in Figure 4, accuracies for both stemand suffix prediction vary noticeably with the seg-mentation used. However, higher stem accuraciescorresponds to lower suffix accuracies and viceversa, which can be mainly due to a general pref-erence of a tool to segment more or less than an-other. In summary, the unsupervised segmentationmethods and the light-weight stemmer appear toperform comparably to the supervised methods.

5.4 Effect of training data size

We examine the predictive power of our modelswith respect to the size of training data. Table 4shows the accuracies of stem and suffix modelstrained on 200K and 1M English-Russian sentencepairs with unsupervised word segmentation. Sur-prisingly, we observe only a minor loss when wedecrease the training data size, which suggests thatour models are robust even on a small data set.

# Train sent. Stem Acc. Suffix Acc.

1M 45.1 / 62.5 77.0 / 89.7200K 44.6 / 61.8 75.7 / 88.6

Table 4: Accuracy at top-1/top-3 (%) of stem andsuffix BNNs with different training data sizes.

5.5 Fine-grained evaluation

We evaluate the suffix BNN model at the part-of-speech (POS) level. Table 5 provides suffix pre-diction accuracy per POS for En-Ru. For thisanalysis, Russian data is segmented by TreeTag-

10http://snowball.tartarus.org/algorithms/russian/stemmer.html

1682

ger. Additionally, we report the average numberof suffixes per stem given the part-of-speech.

Our results are consistent with the findings ofChahuneau et al. (2013):11 the prediction of ad-jectives is more difficult than that of other POSwhile Russian verb prediction is relatively easierin spite of the higher number of suffixes per stem.These differences reflect the importance of sourceversus target context features in the prediction ofthe target inflection: For instance, adjectives agreein gender with the nouns they modify, but this maybe only inferred from the target context.

POS A V N M P

Acc. (%) 49.6 61.9 62.8 84.5 64.4|Mσ| 18.2 18.4 9.2 7.1 13.3

Table 5: Suffix prediction accuracy at top-1 (%),breakdown by category (A: adjectives, V: verbs,N: nouns, M: numerals and P: pronouns). |Mσ|denotes the average number of suffixes per stem.

5.6 Neural Network variants

Table 6 shows the stem and suffix accuracies ofBNN variants on English-Czech. Although noneof the variants outperform our main FFNN archi-tecture, we observe similar performances by theLBL on stem prediction, and by the ConvNet onsuffix prediction. This suggests that future workcould exploit their additional flexibilities (see Sec-tion 4.2) to improve the BNN predictive power.As for the low suffix accuracy by the LBL, itcan be explained by the absence of nonlinearitytransformation. Nonlinearity is important for thesuffix model where the prediction of target suf-fix µj often does not depend linearly on si andσj . The predictive representation of target stemin the LBL stem model, however, mainly dependson the source representation rsi through a positiondependent weight matrix C0. Thus, we observe asmaller accuracy drop in the stem model than inthe suffix model. Conversely, the ConvNet per-forms poorly on stem prediction because it cap-tures the meaning of the whole source context in-stead of emphasizing the importance of the sourceword si as the main predictor of the target transla-tion tj .

11Chahuneau et al. (2013) report an average accuracy of63.1% for the prediction of A, V, N, M suffixes. When wetrain our model on the same dataset (news-commentary) weobtain a comparable result (64.7% vs 63.1%).

Unexpectedly, no improvement is obtained bythe use of dropout regularizer (see Section 4.3).

Model Stem Acc Suffix Acc

FFNN 66.1 / 81.6 91.9 / 97.4FFNN+do 64.6 / 81.1 91.5 / 97.5

LBL 63.6 / 79.6 86.4 / 96.4ConvNet+do 58.6 / 75.6 90.3 / 96.9

Table 6: Accuracies at top-1/top-3 (%) of stem andsuffix models. +do indicates dropout instead of L2

regularizer. FFNN is our main architecture.

6 SMT experiments

While the main objective of this paper is to im-prove prediction accuracy of word translations,see Section 5, we are also interested in know-ing to which extent these improvements carry overwithin an end-to-end machine translation task. Tothis end, we integrate our translation predictionmodels described in Section 4 into our existingEnglish-Russian SMT system.

For each phrase pair matching the input, thephrase BNN score PBNN-p is computed as follows:

PBNN-p(s, t, a) =

|s|∏i=1

1|{ai}|

∑j∈{ai}

PBNN(tj |csi) if |{ai}| > 0

Pmle(NULL|si) otherwise

where a is the word-level alignment of the phrasepair (s, t) and {ai} is the set of target positionsaligned to si. If a source-target link cannot bescored by the BNN model, we give it a PBNNprobability of 1 and increment a separate countfeature ε. Note that the same phrase pair can getdifferent BNN scores if used in different sourceside contexts.

Our baseline is an in-house phrase-based(Koehn et al., 2003) statistical machine transla-tion system very similar to Moses (Koehn et al.,2007). All system runs use hierarchical lexicalizedreordering (Galley and Manning, 2008; Cherryet al., 2012), distinguishing between monotone,swap, and discontinuous reordering, all with re-spect to left-to-right and right-to-left decoding.Other features include linear distortion, bidirec-tional lexical weighting (Koehn et al., 2003), wordand phrase penalties, and finally a word-level 5-gram target LM trained on all available mono-lingual data with modified Kneser-Ney smooth-ing (Chen and Goodman, 1999). The distortion

1683

Corpus Lang. #Sent. #Tok.

paral.trainEN

1.9M48.9M

RU 45.9MWiki dict. EN/RU 508K –mono.train RU 21.0M 390MWMT2012

EN3K 64K

WMT2013 3K 56K

Table 7: SMT training and test data statistics. Allnumbers refer to tokenized, lowercased data.

limit is set to 6 and for each source phrase the top30 translation candidates are considered. Whentranslating into a morphologically rich language,data sparsity issues in the target language becomeparticularly apparent. To compensate for this wealso experiment with a 5-gram suffix-based LM inaddition to the surface-based LM (Müller et al.,2012; Bisazza and Monz, 2014).

The BNN models are integrated as additionallog-probability feature functions (logPBNN-p):one feature for the word prediction model or twofeatures for the stem and suffix models respec-tively, plus the penalty feature ε.

Table 7 shows the data used to train our English-Russian SMT system. The feature weights for allapproaches were tuned by using pairwise rank-ing optimization (Hopkins and May, 2011) on thewmt12 benchmark (Callison-Burch et al., 2012).During tuning, 14 PRO parameter estimation runsare performed in parallel on different samples ofthe n-best list after each decoder iteration. Theweights of the individual PRO runs are then av-eraged and passed on to the next decoding itera-tion. Performing weight estimation independentlyfor a number of samples corrects for some of theinstability that can be caused by individual sam-ples. The wmt13 set (Bojar et al., 2013) was usedfor testing. We use approximate randomization(Noreen, 1989) to test for statistically significantdifferences between runs (Riezler and Maxwell,2005).

Translation quality is measured with case-insensitive BLEU[%] using one reference trans-lation. As shown in Table 8, statistically signif-icant improvements over the respective baseline(Baseline and Base+suffLM) are marked N at thep < .01 level. Integrating our bilingual neural net-work approach into our SMT system yields smallbut statistically significant improvements of 0.4BLEU over a competitive baseline. We can also

SMT system wmt12 (dev) wmt13 (test)Baseline 24.7 18.9+ stem/suff. BNN 25.1 19.3N

Base+suffLM 24.5 19.2+ word BNN 24.5 19.3+ stem/suff. BNN 24.7 19.6N

Table 8: Effect of our BNN models on English-Russian translation quality (BLEU[%]).

see that it is beneficial to add a suffix-based lan-guage model to the baseline system. The biggestimprovement is obtained by combining the suffix-based language model and our BNN approach,yielding 0.7 BLEU over a competitive, state-of-the-art baseline, of which 0.4 BLEU are due to ourBNNs. Finally, one can see that the BNNs mod-eling stems and suffixes separately perform bet-ter than a BNN directly predicting fully inflectedforms.

To better understand the BNN effect on theSMT system, we analyze the set of phrase pairsthat are employed by the decoder to translate eachsentence. This set is ranked by the weighted com-bination of phrase translation and lexical weight-ing scores, target language model score and, ifavailable, phrase BNN scores. As shown in Ta-ble 9, the morphological BNN models have a pos-itive effect on the decoder’s lexical search spaceincreasing the recall of reference tokens amongthe top 1 and 3 phrase translation candidates. Themean reciprocal rank (MRR) also improves from0.655 to 0.662. Looking at the 1-best SMT output,we observe a slight increase of reference/outputrecall (50.0% to 50.7%), which is less than the in-crease we observe for the top 1 translation candi-dates (57.6% to 59.0%). One possible explanationis that the new, more accurate translation distribu-tions are overruled by other SMT model scores,

Token recall (wmt12): Baseline +BNNreference/MT-search-space [top-1] 57.6% 59.0%reference/MT-search-space [top-3] 70.7% 70.9%reference/MT-search-space [top-30] 86.0% 85.0%reference/MT-search-space [MRR] 0.655 0.662reference/MT-output 50.0% 50.7%stem-only reference/MT-output 12.3% 11.5%of which reachable 11.2% 10.3%

Table 9: Target word coverage analysis of theEnglish-Russian SMT system before and afteradding the morphological BNN models.

1684

like the target LM, that are based on traditionalmaximum-likelihood estimates. While the suffix-based LMs proved beneficial in our experiments,we speculate that higher gains could be obtainedby coupling our approach with a morphology-aware neural LM like the one recently presentedby Botha and Blunsom (2014).

7 Related work

While most relevant literature has been discussedin earlier sections, the following approaches areparticularly related to ours: Minkov et al. (2007)and Toutanova et al. (2008) address target inflec-tion prediction with a log-linear model based onrich morphological and syntactic features. Theirmodel exploits target context and is applied toinflect the output of a stem-based SMT system,whereas our models predict target words (or pairsof stem-suffix) independently and are integratedinto decoding. Chahuneau et al. (2013) addressthe same problem with another feature-rich dis-criminative model that can be integrated in decod-ing, like ours, but they also use it to inflect on-the-fly stemmed phrases. It is not clear what partof their SMT improvements is due to the gener-ation of new phrases or to better scoring. Jeonget al. (2010) predict surface word forms in con-text, similarly to our word BNN, and integrate thescores into the SMT system. Unlike us, they relyon linguistic feature-rich log-linear models to dothat. Gimpel and Smith (2008) propose a similarapproach to directly predict phrases in context, in-stead of words.

All those approaches employed features thatcapture the global structure of source sentences,like dependency relations. By contrast, our mod-els access only local context in the source sen-tence but they achieve accuracy gains comparablyto models that also use global sentence structure.

8 Conclusions

We have proposed a general approach to predictword translations in context using bilingual neu-ral network architectures. Unlike previous NN ap-proaches, we model word, stem and suffix dis-tributions in the target language given context inthe source language. Instead of relying on man-ually engineered features, our models automati-cally learn abstract word representations and fea-tures that are relevant for the modeled task directlyfrom word-aligned parallel data. Our preliminary

results with LBL and ConvNet architectures sug-gest that potential improvement may be achievedby factorizing target representations or by dynam-ically modeling source context size. Evaluatedon three morphologically rich languages, our ap-proach achieves considerable gains in word, stemand suffix accuracy over a context-independentmaximum-likelihood baseline. Finally, we haveshown that the proposed BNN models can betightly integrated into a phrase-based SMT sys-tem, resulting in small but statistically significantBLEU improvement over a competitive, large-scale English-Russian baseline.

Our analysis shows that the number of correcttarget words occurring in highly scored phrasetranslation candidates increases after integratingthe morphological BNNs. However, only few ofthese end up in the 1-best translation output. Fu-ture work will investigate the benefits of couplingour BNN models with target language models thatalso exploit abstract word representations, such asBotha and Blunsom (2014) and Auli et al. (2013).

Acknowledgments

This research was funded in part by theNetherlands Organization for Scientific Research(NWO) under project numbers 639.022.213 and612.001.218. We would like to thank EkaterinaGarmash for helping with the error analysis of theEnglish-Russian translations.

ReferencesMichael Auli, Michel Galley, Chris Quirk, and Geof-

frey Zweig. 2013. Joint language and translationmodeling with recurrent neural networks. In Pro-ceedings of the 2013 Conference on Empirical Meth-ods in Natural Language Processing, pages 1044–1054, Seattle, Washington, USA, October.

Yoshua Bengio, Réjean Ducharme, Pascal Vincent, andChristian Janvin. 2003. A neural probabilistic lan-guage model. The Journal of Machine Learning Re-search, 3:1137–1155.

Arianna Bisazza and Christof Monz. 2014. Class-based language modeling for translating into mor-phologically rich languages. In Proceedings ofCOLING 2014, the 25th International Conferenceon Computational Linguistics, pages 1918–1927,Dublin, Ireland.

Ondrej Bojar, Zdenek Žabokrtský, Ondrej Dušek, Pe-tra Galušcáková, Martin Majliš, David Marecek, JiríMaršík, Michal Novák, Martin Popel, and Aleš Tam-chyna. 2012. The joy of parallelism with czeng

1685

1.0. In Proceedings of LREC2012, Istanbul, Turkey,May. ELRA, European Language Resources Asso-ciation.

Ondrej Bojar, Christian Buck, Chris Callison-Burch,Christian Federmann, Barry Haddow, PhilippKoehn, Christof Monz, Matt Post, Radu Soricut, andLucia Specia. 2013. Findings of the 2013 Work-shop on Statistical Machine Translation. In Pro-ceedings of the Eighth Workshop on Statistical Ma-chine Translation, pages 1–44, Sofia, Bulgaria, Au-gust. Association for Computational Linguistics.

Jan A. Botha and Phil Blunsom. 2014. Composi-tional Morphology for Word Representations andLanguage Modelling. In Proceedings of the 31st In-ternational Conference on Machine Learning, Bei-jing, China, June.

Chris Callison-Burch, Philipp Koehn, Christof Monz,Matt Post, Radu Soricut, and Lucia Specia. 2012.Findings of the 2012 workshop on statistical ma-chine translation. In Proceedings of the SeventhWorkshop on Statistical Machine Translation, pages10–51, Montréal, Canada, June. Association forComputational Linguistics.

Victor Chahuneau, Eva Schlinger, Noah A. Smith, andChris Dyer. 2013. Translating into morphologicallyrich languages with synthetic phrases. In Proceed-ings of the Conference on Empirical Methods in Nat-ural Language Processing, pages 1677–1687, Seat-tle, USA, October.

Stanley F. Chen and Joshua Goodman. 1999. Anempirical study of smoothing techniques for lan-guage modeling. Computer Speech and Language,4(13):359–393.

Colin Cherry, Robert C. Moore, and Chris Quirk. 2012.On hierarchical re-ordering and permutation parsingfor phrase-based decoding. In Proceedings of theSeventh Workshop on Statistical Machine Transla-tion, pages 200–209, Montréal, Canada, June. Asso-ciation for Computational Linguistics.

Ronan Collobert and Jason Weston. 2008. A unifiedarchitecture for natural language processing: deepneural networks with multitask learning. In Pro-ceedings of the 25th Annual International Confer-ence on Machine Learning, volume 12, pages 2493–2537.

Kevin Duh, Graham Neubig, Katsuhito Sudoh, and Ha-jime Tsukada. 2013. Adaptation data selection us-ing neural language models: Experiments in ma-chine translation. In Proceedings of the 51st AnnualMeeting of the Association for Computational Lin-guistics, pages 678–683, Sofia, Bulgaria, August.

Nadir Durrani, Alexander Fraser, and Helmut Schmid.2013. Model with minimal translation units, but de-code with phrases. In Proceedings of Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, pages 1–11, Atlanta, Georgia, USA, June.

Michel Galley and Christopher D. Manning. 2008. Asimple and effective hierarchical phrase reorderingmodel. In EMNLP ’08: Proceedings of the Con-ference on Empirical Methods in Natural LanguageProcessing, pages 848–856, Morristown, NJ, USA.Association for Computational Linguistics.

Jianfeng Gao, Xiaodong He, Wen-tau Yih, andLi Deng. 2014. Learning continuous phrase repre-sentations for translation modeling. In Proceedingsof the 52nd Annual Meeting of the Association forComputational Linguistics, pages 699–709. Associ-ation for Computational Linguistics.

Kevin Gimpel and Noah A. Smith. 2008. Rich source-side context for statistical machine translation. InProceedings of the Third Workshop on StatisticalMachine Translation, pages 9–17, Columbus, Ohio,USA.

Spence Green and John DeNero. 2012. A class-basedagreement model for generating accurately inflectedtranslations. In Proceedings of the 50th AnnualMeeting of the Association for Computational Lin-guistics, ACL ’12, pages 146–155, Stroudsburg, PA,USA. Association for Computational Linguistics.

Rejwanul Haque, Sudip Kumar Naskar, Antal Bosch,and Andy Way. 2011. Integrating source-language context into phrase-based statistical ma-chine translation. Machine Translation, 25(3):239–285, September.

Geoffrey E. Hinton, Nitish Srivastava, AlexKrizhevsky, Ilya Sutskever, and Ruslan R. Salakhut-dinov. 2012. Improving neural networks bypreventing co-adaptation of feature detectors. arXivpreprint arXiv:1207.0580.

Mark Hopkins and Jonathan May. 2011. Tuning asranking. In Proceedings of the 2011 Conference onEmpirical Methods in Natural Language Process-ing, pages 1352–1362, Edinburgh, Scotland, UK.,July. Association for Computational Linguistics.

Yuening Hu, Michael Auli, Qin Gao, and Jianfeng Gao.2014. Minimum translation modeling with recur-rent neural networks. In Proceedings of the 14thConference of the European Chapter of the Associ-ation for Computational Linguistics, pages 20–29,Gothenburg, Sweden, April. Association for Com-putational Linguistics.

Minwoo Jeong, Kristina Toutanova, Hisami Suzuki,and Chris Quirk. 2010. A discriminative lexiconmodel for complex morphology. In The Ninth Con-ference of the Association for Machine Translationin the Americas.

Nal Kalchbrenner and Phil Blunsom. 2013. Recurrentcontinuous translation models. In Proceedings ofthe 2013 Conference on Empirical Methods in Natu-ral Language Processing, pages 1700–1709, Seattle,USA, October.

1686

Nal Kalchbrenner, Edward Grefenstette, and Phil Blun-som. 2014. A convolutional neural network formodelling sentences. In Proceedings of the 52ndAnnual Meeting of the Association for Computa-tional Linguistics, pages 655–665. Association forComputational Linguistics.

Ahmed El Kholy and Nizar Habash. 2012. Translate,predict or generate: Modeling rich morphology instatistical machine translation. In Proceedings ofthe 16th Conference of the European Association forMachine Translation (EAMT).

Philipp Koehn, Franz Josef Och, and Daniel Marcu.2003. Statistical phrase-based translation. In Pro-ceedings of HLT-NAACL 2003, pages 127–133, Ed-monton, Canada.

Philipp Koehn, Hieu Hoang, Alexandra Birch, ChrisCallison-Burch, Marcello Federico, Nicola Bertoldi,Brooke Cowan, Wade Shen, Christine Moran,Richard Zens, Chris Dyer, Ondrej Bojar, Alexan-dra Constantin, and Evan Herbst. 2007. Moses:Open Source Toolkit for Statistical Machine Trans-lation. In Proceedings of the 45th Annual Meeting ofthe Association for Computational Linguistics Com-panion Volume Proceedings of the Demo and PosterSessions, pages 177–180, Prague, Czech Republic.Association for Computational Linguistics.

Philipp Koehn. 2004. Pharaoh: A beam searchdecoder for phrase-based statistical machine trans-lation models. In Robert E. Frederking andKathryn B. Taylor, editors, Proceedings of the 6thConference of the Association for Machine Transla-tions in the Americas (AMTA 2004), pages 115–124.

Hai-Son Le, Ilya Oparin, Alexandre Allauzen, J Gau-vain, and François Yvon. 2011. Structured outputlayer neural network language model. In Proceed-ings of Proceedings of ICASSP.

Yoong Keok Lee, Aria Haghighi, and Regina Barzilay.2011. Modeling syntactic context improves mor-phological segmentation. In Proceedings of the Fif-teenth Conference on Computational Natural Lan-guage Learning, pages 1–9, Portland, Oregon, USA,June. Association for Computational Linguistics.

Arne Mauser, Saša Hasan, and Hermann Ney. 2009.Extending statistical machine translation with dis-criminative and trigger-based lexicon models. InProceedings of the 2009 Conference on EmpiricalMethods in Natural Language Processing: Volume1 - Volume 1, EMNLP ’09, pages 210–218, Strouds-burg, PA, USA. Association for Computational Lin-guistics.

Einat Minkov, Kristina Toutanova, and Hisami Suzuki.2007. Generating complex morphology for machinetranslation. In Proceedings of the 45th Annual Meet-ing of the Association of Computational Linguistics,pages 128–135.

Andriy Mnih and Geoffrey E. Hinton. 2007. Threenew graphical models for statistical language mod-elling. In Proceedings of the 24th InternationalConference on Machine Learning, pages 641–648,New York, NY, USA.

Thomas Müller, Hinrich Schütze, and Helmut Schmid.2012. A comparative investigation of morphologicallanguage modeling for the languages of the Euro-pean Union. In Proceedings of the 2012 Conferenceof the North American Chapter of the Associationfor Computational Linguistics: Human LanguageTechnologies, pages 386–395, Montréal, Canada,June. Association for Computational Linguistics.

Eric W. Noreen. 1989. Computer Intensive Meth-ods for Testing Hypotheses. An Introduction. Wiley-Interscience.

Chris Quirk and Arul Menezes. 2006. Do we needphrases? challenging the conventional wisdom instatistical machine translation. In Proceedings ofthe Human Language Technology Conference of theNAACL, Main Conference, pages 9–16, New YorkCity, USA, June. Association for ComputationalLinguistics.

Stefan Riezler and John T. Maxwell. 2005. On somepitfalls in automatic evaluation and significance test-ing for MT. In Proceedings of the ACL Workshopon Intrinsic and Extrinsic Evaluation Measures forMachine Translation and/or Summarization, pages57–64, Ann Arbor, Michigan, June. Association forComputational Linguistics.

Holger Schwenk, Daniel Dechelotte, and Jean-LucGauvain. 2006. Continuous space language modelsfor statistical machine translation. In Proceedingsof the COLING/ACL 2006 Conference, pages 723–730, Sydney, Australia, July. Association for Com-putational Linguistics.

Holger Schwenk. 2012. Continuous space translationmodels for phrase-based statistical machine transla-tion. In Proceedings of COLING.

Serge Sharoff, Mikhail Kopotev, Tomaz Erjavec, AnnaFeldman, and Dagmar Divjak. 2008. Design-ing and evaluating a russian tagset. In Pro-ceedings of the Sixth International Conference onLanguage Resources and Evaluation (LREC’08),Marrakech, Morocco. European Language Re-sources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2008/.

Noah A. Smith and Jason Eisner. 2005. Contrastiveestimation: Training log-linear models on unlabeleddata. In Proceedings of the 43rd Annual Meeting onAssociation for Computational Linguistics.

Richard Socher, Jeffrey Pennington, Eric H. Huang,Andrew Y. Ng, and Christopher D. Manning. 2011.Semi-supervised recursive autoencoders for predict-ing sentiment distributions. In Proceedings of the

1687

Conference on Empirical Methods in Natural Lan-guage Processing, pages 151–161, Stroudsburg, PA,USA. Association for Computational Linguistics.

Richard Socher, Brody Huval, Christopher D. Man-ning, and Andrew Y. Ng. 2012. Semantic composi-tionality through recursive matrix-vector spaces. InProceedings of the 2012 Joint Conference on Empir-ical Methods in Natural Language Processing andComputational Natural Language Learning, pages1201–1211. Association for Computational Linguis-tics.

Le Hai Son, Alexandre Allauzen, and François Yvon.2012. Continuous space translation models withneural networks. In Proceedings of the 2012 Con-ference of the North American Chapter of the Asso-ciation for Computational Linguistics: Human Lan-guage Technologies, pages 39–48, Stroudsburg, PA,USA. Association for Computational Linguistics.

Drahomíra Spoustová, Jan Hajic, Jan Votrubec, PavelKrbec, and Pavel Kveton. 2007. The best of twoworlds: Cooperation of statistical and rule-basedtaggers for czech. In Proceedings of the Work-shop on Balto-Slavonic Natural Language Process-ing, pages 67–74, Prague, Czech Republic, June.Association for Computational Linguistics.

Kristina Toutanova, Hisami Suzuki, and Achim Ruopp.2008. Applying morphology generation models tomachine translation. In Proceedings of the Associa-tion for Computational Linguistics.

Ashish Vaswani, Yinggong Zhao, Victoria Fossum,and David Chiang. 2013. Decoding with large-scale neural language models improves translation.In Proceedings of the 2013 Conference on Empiri-cal Methods in Natural Language Processing, pages1387–1392, Seattle, October.

Will Y. Zou, Richard Socher, Daniel Cer, and Christo-pher D. Manning. 2013. Bilingual word embed-dings for phrase-based machine translation. In Pro-ceedings of the 2013 Conference on Empirical Meth-ods in Natural Language Processing, pages 1393–1398, Seattle, USA, October.

1688

Documents

Word Translation Prediction for Morphologically Rich Languageswith Bilingual Neural Networks