Incremental Decoding and Training Methods for Simultaneous ... · Proceedings of NAACL-HLT 2018, pages 493–499 New Orleans, Louisiana, June 1 - 6, 2018. c 2018 Association for Computational

Proceedings of NAACL-HLT 2018, pages 493–499New Orleans, Louisiana, June 1 - 6, 2018. c©2018 Association for Computational Linguistics

Incremental Decoding and Training Methods for SimultaneousTranslation in Neural Machine Translation

Fahim Dalvi⇤[email protected]

Hassan Sajjad Stephan [email protected] [email protected]

Qatar Computing Research Institute – HBKU

Nadir Durrani⇤[email protected]

Abstract

We address the problem of simultaneous trans-lation by modifying the Neural MT decoder tooperate with dynamically built encoder and at-tention. We propose a tunable agent which de-cides the best segmentation strategy for a user-defined BLEU loss and Average Proportion(AP) constraint. Our agent outperforms previ-ously proposed Wait-if-diff and Wait-if-worseagents (Cho and Esipova, 2016) on BLEU witha lower latency. Secondly we proposed data-driven changes to Neural MT training to bettermatch the incremental decoding framework.

1 Introduction

Simultaneous translation is a desirable attribute inSpoken Language Translation, where the transla-tor is required to keep up with the speaker. In a lec-ture or meeting translation scenario where utter-ances are long, or the end of sentence is not clearlymarked, the system must operate on a bufferedsequence. Generating translations for such in-complete sequences presents a considerable chal-lenge for machine translation, more so in the caseof syntactically divergent language pairs (such asGerman-English), where the context required tocorrectly translate a sentence, appears much laterin the sequence, and prematurely committing to atranslation leads to significant loss in quality.

Various strategies to select appropriate segmen-tation points in a streaming input have been pro-posed (Fugen et al., 2007; Bangalore et al., 2012;Sridhar et al., 2013; Yarmohammadi et al., 2013;Oda et al., 2014). A downside of this approachis that the MT system translates sequences inde-pendent of each other, ignoring the context. Evenif the segmenter decides perfect points to segmentthe input stream, an MT system requires lexicalhistory to make the correct decision.

⇤These authors contributed equally to this work

The end-to-end nature of the Neural MT archi-tecture (Sutskever et al., 2014; Bahdanau et al.,2015) provides a natural mechanism1 to integratestream decoding. Specifically, the recurrent prop-erty of the encoder and decoder components pro-vide an easy way to maintain historic context in afixed size vector.

We modify the neural MT architecture to oper-ate in an online fashion where i) the encoder andthe attention are updated dynamically as new inputwords are added, through a READ operation, andii) the decoder generates output from the availableencoder states, through a WRITE operation. Thedecision of when to WRITE is learned through atunable segmentation agent, based on user-definedthresholds. Our incremental decoder significantlyoutperforms the chunk-based decoder and restoresthe oracle performance with a deficit of 2 BLEUpoints across 4 language pairs with a moderatedelay. We additionally explore whether modify-ing the Neural MT training to match the decodercan improve performance. While we observed sig-nificant restoration in the case of chunk decod-ing matched with chunk-based NMT training, thesame was not found true with our proposed incre-mental training to match the incremental decodingframework.

The remaining paper is organized as follow:Section 2 describes modifications to the NMT de-coder to enable stream decoding. Section 3 de-scribes various agents to learn a READ/WRITEstrategy. Section 4 presents evaluation and re-sults. Section 5 describes modifications to theNMT training to mimic corresponding decodingstrategy, and Section 6 concludes the paper.

1as opposed to the traditional phrase-based decoder(Moses), which requires pre-computation of phrase-table,future-cost estimation and separately maintaining each state-full feature (language model, OSM (Durrani et al., 2015) etc.)

493

NeuralMT

Food

نمالا يئاذغلا رمأ

security matter

NeuralMT

Food

نمالا يئاذغلا رمأ يرورض

security is an important matter

NeuralMT

Security

نمالا

NeuralMT

Food

نمالا يئاذغلا

security

Iteration 1 Iteration 2 Iteration 3 Iteration 4nw = 0 nw = 2 nw = 0 nw = 4

NeuralMT

Food

نمالا يئاذغلا رمأ

security matter

NeuralMT

Food

نمالا يئاذغلا رمأ يرورض

security is an important matter

NeuralMT

Security

نمالا

NeuralMT

Food

نمالا يئاذغلا

security

Iteration 1 Iteration 2 Iteration 3 Iteration 4nw = 0 nw = 2 nw = 0 nw = 4

Figure 1: A decoding pass over a 4-word source sentence. nw denotes the number of words the agent chose tocommit. Green nodes = committed words, Blue nodes = newly generated words in the current iteration. Wordsmarked in red are discarded, as the agent chooses to not commit them.

2 Incremental Decoding

Problem: In a stream decoding scenario, the en-tire source sequence is not readily available. Thetranslator must either wait for the sequence tofinish in order to compute the encoder state, orcommit partial translations at several intermediatesteps, potentially losing contextual information.

Chunk-based Decoder: A straight forward wayto enable simultaneous translation is to chop theincoming input after every N-tokens. A draw-back of these approaches is that the translationand segmentation process operate independentlyof each other, and the previous contextual his-tory is not considered when translating the currentchunk. This information is important to generategrammatically correct and coherent translations.

Incremental Decoding: The RNN-based NMTframework provides a natural mechanism to pre-serve context and accommodate streaming. Thedecoder maintains the entire target history throughthe previous decoder state alone. But to enable in-cremental neural decoding, we have to address thefollowing constraints: i) how to dynamically buildthe encoder and attention with the streaming in-put? ii) what is the best strategy to pre-committranslations at several intermediate points?Inspired by Cho and Esipova (2016), we mod-ify the NMT decoder to operate in a sequence ofREAD and WRITE operations. The former readsthe next word from the buffered source sequenceand translates it using the available context, andthe latter is computed through an AGENT, whichdecides how many words should be committedfrom this generated translation. Note that, whena translation is generated in the READ operation,the already committed target words remain un-changed, i.e. the generation is continued from

Algorithm 1 Algorithm for incremental decoders, Source sequences0, Available source sequencetc, Committed target sequencet, Current decoded sequence for s0

nw, Number of tokens to commit

s0 emptyfor token in s do . READ operation

s0 s0 + tokent NMTDECODER(s0, tc)if s0 6= s then

nw AGENT(s0, tc, t)else

nw length(t)� length(tc)end if . commit all new words if we have seen the

entire sourcet0c GETNEWTOKENS(tc, t, nw)tc tc + t0c . WRITE operation

end for

function GETNEWTOKENS(tc, t, nw)start length(tc) + 1end start + nw

return t[start : end]end function

the last committed target word using the saveddecoder state. See Algorithm 1 for details. TheAGENT decides how many target words to WRITEafter every READ operation, and has complete con-trol over the context each target word gets to seebefore being committed, as well as the overall de-lay incurred. Figure 1 shows the incremental de-coder in action, where the agent decides to notcommit any target words in iterations 1 and 3. Theexample shows an instance where the incorrectlytranslated words are discarded when more contextbecomes available. Given this generic framework,we describe several AGENTS in Section 3, trained tooptimize the BLEU loss and latency.

Beam Search: Independent of the agent beingused, the modified NMT architecture incurs some

494

READ READ WRITEbeam 2

READ WRITEbeam 1

Nor

mal

Dec

odin

gIn

crm

emen

tal D

ecod

ing

Figure 2: Beam Search in normal decoding vs incre-mental decoding. Green nodes indicate the hypoth-esis selected by the agent to WRITE. Since we can-not change what we have already committed, the othernodes (marked in yellow) are discarded and future hy-potheses originate from the selected hypothesis alone.Normal beam search is executed for consecutive READoperations (blue nodes).

complexities for beam decoding. For example,if at some iteration the decoder generates 5 newwords, but the agent decides to commit only 2 ofthese, the best hypothesis at the 2nd word may notbe the same as the one at the 5th word. Hence, theagent has to re-rank the hypotheses at the last tar-get word it decides to commit. Future hypothesesthen continue from this selected hypothesis. SeeFigure 2 for a visual representation. The overallutility of beam decoding is reduced in the case ofincremental decoding, because it is necessary tocommit and retain only one beam at several pointsto start producing output with minimal delay.

3 Segmentation Strategies

In this section, we discuss different AGENTS that weevaluated in our modified incremental decoder. Tomeasure latency in these agents, we use AverageProportion (AP) metric as defined by Cho and Es-ipova (2016). AP is calculated as the total numberof source words each target word required beforebeing committed, normalized by the product of thesource and target lengths. It varies between 0 and1 with lesser being better. See supplementary ma-terial for details.

Wait-until-end: The WUE agent waits for theentire source sentence before decoding, and servesas an upper bound on the performance of ouragents, albeit with the worst AP = 1.

Wait-if-worse/diff: We reimplemented thebaseline agents described in Cho and Esipova(2016). The Wait-if-Worse (WIW) agent WRITES

a target word if its probability does not decreaseafter a READ operation. The Wait-if-Diff (WID)agent instead WRITES a target word if the targetword remains unchanged after a READ operation.

Static Read and Write: The STATIC-RW:agent is inspired from the chunk-based decoderand tries to resolve its shortcomings while main-taining its simplicity. The primary drawback of thechunk-based decoder is the loss of context acrosschunks. Our agent starts by performing S READoperations, followed by repeated RW WRITESand READS until the end of the source sequence.The number of WRITE and READ operations is thesame to ensure that the gap between the sourceand target sequence does not increase with time.The initial S READ operations essentially create abuffer of S tokens, allowing some future contextto be used by the decoder. Note that the latencyinduced by this agent in this case is only in the be-ginning, and remains constant for the rest of thesentence. This method actually introduces a classof AGENTS based on their S ,RW values. We tuneS and RW to select the specific AGENT with theuser-defined BLEU-loss and AP thresholds.

4 Evaluation

Data: We trained systems for 4 language pairs:German-, Arabic-, Czech- and Spanish-Englishpairs using the data made available for IWSLT(Cettolo et al., 2014). See supplementary materialfor data stats. These language pairs present a di-verse set of challenges for this problem, with Ara-bic and Czech being morphologically rich, Ger-man being syntactically divergent, and Spanishintroducing local reorderings with respect to En-glish.

NMT System: We trained a 2-layered LSTMencoder-decoder models with attention using theseq2seq-attn implementation (Kim, 2016).Please see supplementary material for settings.

Results: Figure 3 shows the results of variousstreaming agents. Our proposed STATIC-RWagent outperforms other methods while maintain-ing an AP < 0.75 with a loss of less than 0.5BLEU points on Arabic, Czech and Spanish. Thiswas found to be consistent for all test-sets 2011-2014 (See under “small” models in Figure 4). Inthe case of German the loss at AP < 0.75 wasaround 1.5 BLEU points. The syntactical diver-gence and rich morphology of German posits a

495

Figure 3: Results for various streaming AGENTS (WID, WIW, WUE, C6 (Chunk decoding with a N=6) and S ,RWfor STATIC-RW) on the tune-set. For each AP bucket, we only show the Agents with the top 3 BLEU scores inthat bucket, with remaining listed in descending order of their BLEU scores.

bigger challenge and requires larger context thanother language pairs. For example the conjugatedverb in a German verb complex appears in the sec-ond position, while the main verb almost alwaysoccurs at the end of the sentence/phrase (Durraniet al., 2011). Our methods are also comparable tothe more sophisticated techniques involving Rein-forcement Learning to learn an agent introducedby Gu et al. (2017) and Satija and Pineau (2016),but without the overhead of expensive training forthe agent.

Scalability: The preliminary results were ob-tained using models trained on the TED corpusonly. We conducted further experiments by train-ing models on larger data-sets (See the supple-mentary section again for data sizes) to see if ourfindings are scalable. We fine-tuned (Luong andManning, 2015; Sajjad et al., 2017b) our mod-els with the in-domain data to avoid domain dis-parity. We then re-ran our agents with the bestS ,RW values (with an AP under 0.75) for eachlanguage pair. Figure 4 (“large” models) showsthat the BLEU loss from the respective oracle in-creased when the models were trained with big-ger data sizes. This could be attributed to the in-creased lexical ambiguity from the large amountof out-domain data, which can only be resolvedwith additional contextual information. Howeverour results were still better than the WIW agent,which also has an AP value above 0.8. Allowingsimilar AP, our STATIC-RW agents were able torestore the BLEU loss to be 1.5 for all language

Figure 4: Averaged results on test-sets (2011-2014) us-ing the models trained on small and large datasets usingAP 0.75. Detailed test-wise results are available inthe supplementary material.

pairs except German-English. Detailed test resultsare available in the suplementary material.

5 Incremental Training

The limitation of previously described decodingapproaches (chunk-based and incremental) is themismatch between training and decoding. Thetraining is carried on full sentences, however, atthe test time, the decoder generates hypothesisbased on incomplete source information. This dis-crepancy between training and decoding can bepotentially harmful. In Section 2, we presentedtwo methods to address the partial input sentencedecoding problem, the Chunk Decoder and theIncremental Decoder. We now train models tomatch the corresponding decoding scenario.

496

5.1 Chunk Training

In chunk-based training, we simply split eachtraining sentence into chunks of N tokens.2 Thecorresponding target sentence for each chunk isgenerated by having a span of target words that areword-aligned3 with the words in the source span.Chunking the data into smaller segments increasesthe training time significantly. To overcome thisproblem, we train a model on the full sentencesusing all the data and then fine-tune it with the in-domain chunked data.

5.2 Add-M Training

Next we formulate a training mechanism to matchthe incremental decoding described in Section 2.A way to achieve this is to force the attention ona local span of encoder states and block it fromgiving weight to the non-local (rightward) encoderstates. The hope is that in the case of long-rangedependencies, the model learns to predict thesedependencies without the entire source context.Such a training procedure is non-trivial, as it re-quires dynamic inputs to the attention mechanismwhile training, including backpropagation wheresome encoder states which have been seen by theattention mechanism a greater number of timesdynamically receiving more gradient inputs. Weleave this idea as future work, while focusing on adata-driven technique to mimic this kind of train-ing as described below.

We start with the first N words in a source sen-tence and generate target words that are aligned tothese words. We then generate the next training in-stances with N +M , N +2M , N +3M ... sourcewords until the end of sentence has been reached.4

The resulting training roughly mimics the decod-ing scenario where the source-side context is grad-ually built. The down-side of this method is thatthe data size increases quadratically, making thetraining infeasible. To overcome this, we fine-tune a model trained on full sentences with the in-domain corpus generated using this method.

2Although randomly segmenting the source sentencebased on number of tokens is a naıve approach that does nottake into account the linguistic properties, our goal here wasto exactly match the training with the chunk-based decodingscenario.

3We used fast-align (Dyer et al., 2013) for alignments.4We trained with N = 6 and M = 1 for our experiments.

Figure 5: Averaged test set results on various trainingmodifications

5.3 ResultsThe results in Figure 5 show that matching thechunk-decoding with corresponding chunk-basedtraining significantly improves performance, witha gain of up to 12 BLEU points. However,we were not able to improve upon our incre-mental decoder, with the results deteriorating no-tably. One reason for this degradation is that thetraining/decoding scenarios are still not perfectlymatched. The training pipeline in this case alsosees the beginning of sentences much more often,which could lead to unnatural distributions beinginferred within the model.

6 Conclusion

We addressed the problem of simultaneous trans-lation by modifying the architecture in Neural MTdecoder. We presented a tunable agent which de-cides the best segmentation strategy based on user-defined BLEU loss and AP constraints. Our re-sults showed improvements over previously es-tablished WIW and WID methods. We addition-ally modified the Neural MT training to matchthe incremental decoding, which significantly im-proved the chunk-based decoding, but we did notobserve any improvement using Add-M Training.The code for our incremental decoder and agentshas been made available.5 While were able to sig-nificantly improve the the chunk-based decoder,we did not observe any improvement using theAdd-M Training. In the future we would like tochange the training model to dynamically build theencoder and the attention model in order to matchour incremental decoder.

5https://github.com/fdalvi/seq2seq-attn-stream

497

ReferencesAhmed Abdelali, Kareem Darwish, Nadir Durrani, and

Hamdy Mubarak. 2016. Farasa: A fast and furioussegmenter for arabic. In Proceedings of the 2016Conference of the North American Chapter of theAssociation for Computational Linguistics: Demon-strations. Association for Computational Linguis-tics, San Diego, California, pages 11–16. http://www.aclweb.org/anthology/N16-3003.

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio. 2015. Neural machine translation by jointlylearning to align and translate. In ICLR. http://arxiv.org/pdf/1409.0473v6.pdf.

Srinivas Bangalore, Vivek Kumar Rangarajan Srid-har, Prakash Kolan, Ladan Golipour, and AuraJimenez. 2012. Real-time incremental speech-to-speech translation of dialogs. In Proceedings ofthe 2012 Conference of the North American Chap-ter of the Association for Computational Linguis-tics: Human Language Technologies. Associationfor Computational Linguistics, Montreal, Canada,pages 437–445.

Mauro Cettolo, Jan Niehues, Sebastian Stuker, LuisaBentivogli, and Marcello Federico. 2014. Report onthe 11th IWSLT Evaluation Campaign. Proceedingsof the International Workshop on Spoken LanguageTranslation, Lake Tahoe, US .

Kyunghyun Cho and Masha Esipova. 2016. Can neu-ral machine translation do simultaneous translation?CoRR abs/1606.02012.

Nadir Durrani, Helmut Schmid, and Alexander Fraser.2011. A Joint Sequence Translation Model with In-tegrated Reordering. In Proceedings of the Associ-ation for Computational Linguistics: Human Lan-guage Technologies (ACL-HLT’11). Portland, OR,USA.

Nadir Durrani, Helmut Schmid, Alexander Fraser,Philipp Koehn, and Hinrich Schutze. 2015. TheOperation Sequence Model – Combining N-Gram-based and Phrase-based Statistical Machine Trans-lation. Computational Linguistics 41(2):157–186.

Chris Dyer, Victor Chahuneau, and Noah A. Smith.2013. A simple, fast, and effective reparameteriza-tion of ibm model 2. In Proceedings of NAACL’13.

Andreas Eisele and Yu Chen. 2010. MultiUN: A Mul-tilingual Corpus from United Nation Documents. InProceedings of the Seventh conference on Interna-tional Language Resources and Evaluation. Valleta,Malta.

Christian Fugen, Alex Waibel, and Muntsin Kolss.2007. Simultaneous translation of lectures andspeeches. Machine Translation 21(4):209–252.

Jiatao Gu, Graham Neubig, Kyunghyun Cho, and Vic-tor O.K. Li. 2017. Learning to translate in real-timewith neural machine translation. In Proceedings of

the 15th Conference of the European Chapter of theAssociation for Computational Linguistics: Volume1, Long Papers. Association for Computational Lin-guistics, Valencia, Spain, pages 1053–1062.

Yoon Kim. 2016. Seq2seq-attn. https://github.com/harvardnlp/seq2seq-attn.

Philipp Koehn, Hieu Hoang, Alexandra Birch, ChrisCallison-Burch, Marcello Federico, Nicola Bertoldi,Brooke Cowan, Wade Shen, Christine Moran,Richard Zens, Chris Dyer, Ondrej Bojar, AlexandraConstantin, and Evan Herbst. 2007. Moses: Opensource toolkit for statistical machine translation. InProceedings of the ACL 2007. Prague, Czech Re-public.

Minh-Thang Luong and Christopher D. Manning.2015. Stanford Neural Machine Translation Sys-tems for Spoken Language Domains. In Proceed-ings of the International Workshop on Spoken Lan-guage Translation. Da Nang, Vietnam.

Yusuke Oda, Graham Neubig, Sakriani Sakti, TomokiToda, and Satoshi Nakamura. 2014. Optimiz-ing segmentation strategies for simultaneous speechtranslation. In Proceedings of the 52nd AnnualMeeting of the Association for Computational Lin-guistics (Volume 2: Short Papers). Association forComputational Linguistics, Baltimore, Maryland,pages 551–556. http://www.aclweb.org/anthology/P14-2090.

Hassan Sajjad, Fahim Dalvi, Nadir Durrani, AhmedAbdelali, Yonatan Belinkov, and Stephan Vogel.2017a. Challenging Language-Dependent Segmen-tation for Arabic: An Application to Machine Trans-lation and Part-of-Speech Tagging. In Proceedingsof the 55th Annual Meeting of the Association forComputational Linguistics. Association for Compu-tational Linguistics, Vancouver, Canada.

Hassan Sajjad, Nadir Durrani, Fahim Dalvi, YonatanBelinkov, and Stephan Vogel. 2017b. Neural Ma-chine Translation Training in a Multi-Domain Sce-nario. In Proceedings of the 14th InternationalWorkshop on Spoken Language Technology (IWSLT-14).

Harsh Satija and Joelle Pineau. 2016. Simultane-ous machine translation using deep reinforcementlearning. In Abstraction in Reinforcement Learn-ing Workshop. International Conference on MachineLearning, New York, USA.

Rico Sennrich, Barry Haddow, and Alexandra Birch.2016. Neural machine translation of rare wordswith subword units. In Proceedings of the 54th An-nual Meeting of the Association for ComputationalLinguistics (Volume 1: Long Papers). Associationfor Computational Linguistics, Berlin, Germany,pages 1715–1725. http://www.aclweb.org/anthology/P16-1162.

498

Rangarajan Sridhar, Vivek Kumar, John Chen, SrinivasBangalore, Andrej Ljolje, and Rathinavelu Chengal-varayan. 2013. Segmentation strategies for stream-ing speech translation. In Proceedings of the2013 Conference of the North American Chap-ter of the Association for Computational Linguis-tics: Human Language Technologies. Associationfor Computational Linguistics, Atlanta, Georgia,pages 230–238. http://www.aclweb.org/anthology/N13-1023.

Ilya Sutskever, Oriol Vinyals, and Quoc VV Le. 2014.Sequence to Sequence Learning with Neural Net-works. In Advances in neural information process-ing systems. pages 3104–3112.

Mahsa Yarmohammadi, Vivek Kumar Rangara-jan Sridhar, Srinivas Bangalore, and BaskaranSankaran. 2013. Incremental segmentation anddecoding strategies for simultaneous translation.In Proceedings of the Sixth International JointConference on Natural Language Processing.Asian Federation of Natural Language Processing,Nagoya, Japan, pages 1032–1036. http://www.aclweb.org/anthology/I13-1141.

499