Learning to Translate in Real-time with Neural … to Translate in Real-time with Neural Machine Translation Jiatao Guy, Graham Neubig , Kyunghyun Choz and Victor O.K. Liy yThe University

Learning to Translate in Real-time with Neural Machine Translation

Jiatao Gu† , Graham Neubig♦ , Kyunghyun Cho‡ and Victor O.K. Li†

†The University of Hong Kong ♦Carnegie Mellon University ‡New York University†{jiataogu, vli}@eee.hku.hk ♦

[email protected]‡[email protected]

Abstract

Translating in real-time, a.k.a. simultane-ous translation, outputs translation wordsbefore the input sentence ends, which is achallenging problem for conventional ma-chine translation methods. We propose aneural machine translation (NMT) frame-work for simultaneous translation in whichan agent learns to make decisions on whento translate from the interaction with apre-trained NMT environment. To tradeoff quality and delay, we extensively ex-plore various targets for delay and designa method for beam-search applicable inthe simultaneous MT setting. Experimentsagainst state-of-the-art baselines on twolanguage pairs demonstrate the efficacyof the proposed framework both quantita-tively and qualitatively.1

1 Introduction

Simultaneous translation, the task of translatingcontent in real-time as it is produced, is an im-portant tool for real-time understanding of spokenlectures or conversations (Fugen et al., 2007; Ban-galore et al., 2012). Different from the typicalmachine translation (MT) task, in which transla-tion quality is paramount, simultaneous translationrequires balancing the trade-off between transla-tion quality and time delay to ensure that usersreceive translated content in an expeditious man-ner (Mieno et al., 2015). A number of methodshave been proposed to solve this problem, mostlyin the context of phrase-based machine translation.These methods are based on a segmenter, whichreceives the input one word at a time, then decideswhen to send it to a MT system that translates each

1Code and data can be found at https://github.com/nyu-dl/dl4mt-simul-trans.

Lastnig

htwe ser

vedMr X a bee

r, who die

ddur

ingthe nig

ht. < e

os>

< eos >.

iststorben

ge--Nachtder

Laufeim

der,

serviertBiereinX

Herrnwir

habenAbend

Gestern READWRITE

Figure 1: Example output from the proposedframework in DE → EN simultaneous transla-tion. The heat-map represents the soft alignmentbetween the incoming source sentence (left, up-to-down) and the emitted translation (top, left-to-right). The length of each column representsthe number of source words being waited for be-fore emitting the translation. Best viewed whenzoomed digitally.

segment independently (Oda et al., 2014) or with aminimal amount of language model context (Ban-galore et al., 2012).

Independently of simultaneous translation, ac-curacy of standard MT systems has greatly im-proved with the introduction of neural-network-based MT systems (NMT) (Sutskever et al., 2014;Bahdanau et al., 2014). Very recently, there havebeen a few efforts to apply NMT to simultane-ous translation either through heuristic modifica-tions to the decoding process (Cho and Esipova,2016), or through the training of an independentsegmentation network that chooses when to per-form output using a standard NMT model (Satijaand Pineau, 2016). However, the former modellacks a capability to learn the appropriate timingwith which to perform translation, and the lattermodel uses a standard NMT model as-is, lack-ing a holistic design of the modeling and learningwithin the simultaneous MT context. In addition,neither model has demonstrated gains over previ-

arX

iv:1

610.

0038

8v3

[cs

.CL

] 1

0 Ja

n 20

17

https://github.com/nyu-dl/dl4mt-simul-trans

https://github.com/nyu-dl/dl4mt-simul-trans

ous segmentation-based baselines, leaving ques-tions of their relative merit unresolved.

In this paper, we propose a unified design forlearning to perform neural simultaneous machinetranslation. The proposed framework is based onformulating translation as an interleaved sequenceof two actions: READ and WRITE. Based on this,we devise a model connecting the NMT systemand these READ/WRITE decisions. An exampleof how translation is performed in this frameworkis shown in Fig. 1, and detailed definitions of theproblem and proposed framework are described in§2 and §3. To learn which actions to take when, wepropose a reinforcement-learning-based strategywith a reward function that considers both qual-ity and delay (§4). We also develop a beam-searchmethod that performs search within the translationsegments (§5).

We evaluate the proposed method on English-Russian (EN-RU) and English-German (EN-DE)translation in both directions (§6). The quantita-tive results show strong improvements comparedto both the NMT-based algorithm and a conven-tional segmentation methods. We also extensivelyanalyze the effectiveness of the learning algorithmand the influence of the trade-off in the optimiza-tion criterion, by varying a target delay. Finally,qualitative visualization is utilized to discuss thepotential and limitations of the framework.

2 Problem DefinitionSuppose we have a buffer of input words X ={x1, ..., xTs} to be translated in real-time. We de-fine the simultaneous translation task as sequen-tially making two interleaved decisions: READ orWRITE. More precisely, the translator READs asource word xη from the input buffer in chrono-logical order as translation context, or WRITEs atranslated word yτ onto the output buffer, resultingin output sentence Y = {y1, ..., yTt}, and actionsequence A = {a1, ..., aT } consists of Ts READsand Tt WRITEs, so T = Ts + Tt.

Similar to standard MT, we have a measureQ(Y ) to evaluate the translation quality, such asBLEU score (Papineni et al., 2002). For simulta-neous translation we are also concerned with thefact that each action incurs a time delay D(A).D(A) will mainly be influenced by delay causedby READ, as this entails waiting for a humanspeaker to continue speaking (about 0.3s per wordfor an average speaker), while WRITE consists ofgenerating a few words from a machine transla-

Figure 2: Illustration of the proposed framework:at each step, the NMT environment (left) com-putes a candidate translation. The recurrent agent(right) will the observation including the candi-dates and send back decisions–READ or WRITE.

tion system, which is possible on the order of mil-liseconds. Thus, our objective is finding an opti-mal policy that generates decision sequences witha good trade-off between higher quality Q(Y ) andlower delay D(A). We elaborate on exactly howto define this trade-off in §4.2.

In the following sections, we first describe howto connect the READ/WRITE actions with the NMTsystem (§3), and how to optimize the system toimprove simultaneous MT results (§4).

3 Simultaneous Translationwith Neural Machine Translation

The proposed framework is shown in Fig. 2, andcan be naturally decomposed into two parts: envi-ronment (§3.1) and agent (§3.2).

3.1 Environment

Encoder: READ The first element of the NMTsystem is the encoder, which converts input wordsX = {x1, ..., xTs} into context vectors H ={h1, ..., hTs}. Standard NMT uses bi-directionalRNNs as encoders (Bahdanau et al., 2014), but thisis not suitable for simultaneous processing as us-ing a reverse-order encoder requires knowing thefinal word of the sentence before beginning pro-cessing. Thus, we utilize a simple left-to-right uni-directional RNN as our encoder:

hη = φUNI-ENC (hη−1, xη) (1)

Decoder: WRITE Similar with standard MT, weuse an attention-based decoder. In contrast, we

only reference the words that have been read fromthe input when generating each target word:

cητ = φATT (zτ−1, yτ−1, Hη)

zητ = φDEC (zτ−1, yτ−1, cητ )

p (y|y<τ , Hη) ∝ exp [φOUT (zητ )] ,

(2)

where for τ , zτ−1 and yτ−1 represent the previousdecoder state and output word, respectively. Hη

is used to represent the incomplete input states,where Hη is a prefix of H . As the WRITE actioncalculates the probability of the next word on thefly, we need greedy decoding for each step:

yητ = arg maxy p (y|y<τ , Hη) (3)

Note that yητ , zητ corresponds to Hη and is the can-

didate for yτ , zτ . The agent described in the nextsection decides whether to take this candidate orwait for better predictions.

3.2 Agent

A trainable agent is designed to make decisionsA = {a1, .., aT }, at ∈ A sequentially based onobservations O = {o1, ..., oT }, ot ∈ O, and thencontrol the translation environment properly.

Observation As shown in Fig 2, we concatenatethe current context vector cητ , the current decoderstate zητ and the embedding vector of the candidateword yητ as the continuous observation, oτ+η =[cητ ; zητ ;E(yητ )] to represent the current state.

Action Similarly to prior work (Grissom II et al.,2014), we define the following set of actions:

• READ: the agent rejects the candidate and waitsto encode the next word from input buffer;• WRITE: the agent accepts the candidate and

emits it as the prediction into output buffer;

Policy How the agent chooses the actions basedon the observation defines the policy. In our set-ting, we utilize a stochastic policy πθ parameter-ized by a recurrent neural network, that is:

st = fθ (st−1, ot)

πθ(at|a<t, o≤t) ∝ gθ (st) ,(4)

where st is the internal state of the agent, and isupdated recurrently yielding the distribution of theaction at. Based on the policy of our agent, theoverall algorithm of greedy decoding is shown inAlgorithm 1, The algorithm outputs the translationresult and a sequence of observation-action pairs.

Algorithm 1 Simultaneous Greedy Decoding

Require: NMT system φ, policy πθ, τMAX, inputbuffer X , output buffer Y , state buffer S.

1: Init x1 ⇐ X,h1 ← φENC (x1) , H1 ← {h1}

2: z0 ← φINIT

(H1), y0 ← 〈s〉

3: τ ← 0, η ← 14: while τ < τMAX do5: t← τ + η6: yητ , z

ητ , ot ← φ (zτ−1, yτ−1, H

η)7: at ∼ πθ (at; a<t, o<t) , S ⇐ (ot, at)8: if at = READ and xη 6= 〈/s〉 then9: xη+1 ⇐ X,hη+1 ← φENC (hη, xη+1)

10: Hη+1 ← Hη ∪ {hη+1}, η ← η + 111: if |Y | = 0 then z0 ← φINIT (Hη)

12: else if at = WRITE then13: zτ ← zητ , yτ ← yητ14: Y ⇐ yτ , τ ← τ + 115: if yτ = 〈/s〉 then break

4 Learning

The proposed framework can be trained using re-inforcement learning. More precisely, we use pol-icy gradient algorithm together with variance re-duction and regularization techniques.

4.1 Pre-trainingWe need an NMT environment for the agent to ex-plore and use to generate translations. Here, wesimply pre-train the NMT encoder-decoder on fullsentence pairs with maximum likelihood, and as-sume the pre-trained model is still able to generatereasonable translations even on incomplete sourcesentences. Although this is likely sub-optimal, ourNMT environment based on uni-directional RNNscan treat incomplete source sentences in a mannersimilar to shorter source sentences and has the po-tential to translate them more-or-less correctly.

4.2 Reward FunctionThe policy is learned in order to increase a rewardfor the translation. At each step the agent will re-ceive a reward signal rt based on (ot, at). To eval-uate a good simultaneous machine translation, areward must consider both quality and delay.

Quality We evaluate the translation quality us-ing metrics such as BLEU (Papineni et al., 2002).The BLEU score is defined as the weighted geo-metric average of the modified n-gram precisionBLEU0, multiplied by the brevity penalty BP topunish a short translation. In practice, the vanilla

BLEU score is not a good metric at sentence levelbecause being a geometric average, the score willreduce to zero if one of the precisions is zero. Toavoid this, we used a smoothed version of BLEUfor our implementation (Lin and Och, 2004).

BLEU(Y, Y ∗) = BP · BLEU0(Y, Y ∗), (5)

where Y ∗ is the reference and Y is the output. Wedecompose BLEU and use the difference of par-tial BLEU scores as the reward, that is:

rQt =

{∆BLEU0(Y, Y ∗, t) t < T

BLEU(Y, Y ∗) t = T(6)

where Y t is the cumulative output at t (Y 0 = ∅),and ∆BLEU0(Y, Y ∗, t) = BLEU0(Y t, Y ∗) −BLEU0(Y t−1, Y ∗). Obviously, if at = READ, nonew words are written into Y , yielding rQt = 0.Note that we do not multiply BP until the end ofthe sentence, as it would heavily penalize partialtranslation results.

Delay As another critical feature, delay judgeshow much time is wasted waiting for the transla-tion. Ideally we would directly measure the actualtime delay incurred by waiting for the next word.For simplicity, however, we suppose it consumesthe same amount of time listening for one moreword. We define two measurements, global andlocal, respectively:

• Average Proportion (AP): following the def-inition in (Cho and Esipova, 2016), X , Y arethe source and decoded sequences respectively,and we use s(τ) to denote the number of sourcewords been waited when decoding word yτ ,

0 < d (X,Y ) =1

|X||Y |∑τ

s(τ) ≤ 1

dt =

{0 t < Td(X,Y ) t = T

(7)

d is a global delay metric, which defines the av-erage waiting proportion of the source sentencewhen translating each word.

• Consecutive Wait length (CW): in speechtranslation, listeners are also concerned withlong silences during which no translation oc-curs. To capture this, we also consider on howmany words were waited for (READ) consecu-tively between translating two words. For eachaction, where we initially define c0 = 0,

ct =

{ct−1 + 1 at = READ

0 at = WRITE(8)

• Target Delay: We further define “target delay”for both d and c as d∗ and c∗, respectively, asdifferent simultaneous translation applicationsmay have different requirements on delay. Inour implementation, the reward function for de-lay is written as:

rDt = α·[sgn(ct − c∗) + 1]+β ·bdt−d∗c+ (9)

where α ≤ 0, β ≤ 0.

Trade-off between quality and delay A goodsimultaneous translation system requires balanc-ing the trade-off of translation quality and timedelay. Obviously, achieving the best translationquality and the shortest translation delays are ina sense contradictory. In this paper, the trade-offis achieved by balancing the rewards rt = rQt +rDtprovided to the system, that is, by adjusting the co-efficients α, β and the target delay d∗, c∗ in Eq. 9.

4.3 Reinforcement LearningPolicy Gradient We freeze the pre-trained pa-rameters of an NMT model, and train the agentusing the policy gradient (Williams, 1992). Thepolicy gradient maximizes the following expectedcumulative future rewards, J = Eπθ

[∑Tt=1 rt

],

whose gradient is

∇θJ = Eπθ

[T∑t′=1

∇θ log πθ(at′ |·)Rt

](10)

Rt =∑T

k=t

[rQk + rDk

]is the cumulative future

rewards for current observation and action. Inpractice, Eq. 10 is estimated by sampling multi-ple action trajectories from the current policy πθ,collecting the corresponding rewards.

Variance Reduction Directly using the policygradient suffers from high variance, which makeslearning unstable and inefficient. We thus em-ploy the variance reduction techniques suggestedby Mnih and Gregor (2014). We subtract fromRt the output of a baseline network bϕ to obtainRt = Rt − bϕ (ot), and centered re-scale the re-ward as Rt = Rt−b√

σ2+εwith a running average b

and standard deviation σ. The baseline network istrained to minimize the squared loss as follows:

Lϕ = Eπθ

[T∑t=1

‖Rt − bϕ (ot) ‖2]

(11)

We also regularize the negative entropy of thepolicy to facilitate exploration.

Algorithm 2 Learning with Policy Gradient

Require: NMT system φ, agent θ, baseline ϕ1: Pretrain the NMT system φ using MLE;2: Initialize the agent θ;3: while stopping criterion fails do4: Obtain a translation pairs: {(X,Y ∗)};5: for (Y, S) ∼ Simultaneous Decoding do6: for (ot, at) in S do7: Compute the quality: rQt ;8: Compute the delay: rDt ;9: Compute the baseline: bϕ (ot);

10: Collect the future rewards: {Rt};11: Perform variance reduction: {Rt};12: Update: θ ← θ + λ1∇θ [J − κH(πθ)]13: Update: ϕ← ϕ− λ2∇ϕL

The overall learning algorithm is summarizedin Algorithm 2. For efficiency, instead of updatingwith stochastic gradient descent (SGD) on a singlesentence, both the agent and the baseline are opti-mized using a minibatch of multiple sentences.

5 Simultaneous Beam SearchIn previous sections we described a simultaneousgreedy decoding algorithm. In standard NMT ithas been shown that beam search, where the de-coder keeps a beam of k translation trajectories,greatly improves translation quality (Sutskever etal., 2014), as shown in Fig. 3 (A).

It is non-trivial to directly apply beam-search insimultaneous machine translation, as beam searchwaits until the last word to write down translation.Based on our assumption WRITE does not cost de-lay, we can perform a simultaneous beam-searchwhen the agent chooses to consecutively WRITE:keep multiple beams of translation trajectories intemporary buffer and output the best path whenthe agent switches to READ. As shown in Fig. 3(B) & (C), it tries to search for a relatively betterpath while keeping the delay unchanged.

Note that we do not re-train the agent for simul-taneous beam-search. At each step we simply in-put the observation of the current best trajectoryinto the agent for making next decision.

6 Experiments

6.1 Settings

Dataset To extensively study the proposed si-multaneous translation model, we train and evalu-ate it on two different language pairs: “English-

Figure 3: Illustrations of (A) beam-search, (B) si-multaneous greedy decoding and (C) simultaneousbeam-search.

German (EN-DE)” and “English-Russian (EN-RU)” in both directions per pair. We use the par-allel corpora available from WMT’152 for bothpre-training the NMT environment and learningthe policy. We utilize newstest-2013 as the valida-tion set to evaluate the proposed algorithm. Boththe training set and the validation set are tokenizedand segmented into sub-word units with byte-pairencoding (BPE) (Sennrich et al., 2015). We onlyuse sentence pairs where both sides are less than50 BPE subword symbols long for training.

Environment & Agent Settings We pre-trainedthe NMT environments for both language pairsand both directions following the same settingfrom (Cho and Esipova, 2016). We further builtour agents, using a recurrent policy with 512GRUs and a softmax function to produce the ac-tion distribution. All our agents are trained us-ing policy gradient using Adam (Kingma and Ba,2014) optimizer, with a mini-batch size of 10. Foreach sentence pair in a batch, 5 trajectories aresampled. For testing, instead of sampling we pickthe action with higher probability each step.

Baselines We compare the proposed methodsagainst previously proposed baselines. For faircomparison, we use the same NMT environment:

• Wait-Until-End (WUE): an agent that starts toWRITE only when the last source word is seen.In general, we expect this to achieve the bestquality of translation. We perform both greedydecoding and beam-search with this method.

• Wait-One-Step (WOS): an agent that WRITEsafter each READs. Such a policy is problematicwhen the source and target language pairs havedifferent word orders or lengths (e.g. EN-DE).2http://www.statmt.org/wmt15/

0 10 20 30 40 50Mini-batches (x1000)

0

2

4

6

8

10

12

14

16

AP=0.3AP=0.5AP=0.7

CW=2.0CW=5.0CW=8.0

Wait Until End.Wait One Step

(a) BLEU (EN → RU)

0 10 20 30 40 50Mini-batches (x1000)

0.0

0.2

0.4

0.6

0.8

1.0

AP=0.3AP=0.5AP=0.7

CW=2.0CW=5.0CW=8.0

Wait Until EndWait One Step

(b) AP (EN → RU)

0 10 20 30 40 50Mini-batches (x1000)

0

5

10

15

20

25

AP=0.3AP=0.5AP=0.7

CW=2.0CW=5.0CW=8.0

Wait Until EndWait One Step

(c) CW (EN → RU)

Figure 4: Learning progress curves for variant delay targets on the validation dataset for EN → RU.Every time we only keep one target for one delay measure. For instance when using target AP, thecoefficient of α in Eq. 9 will be set 0.

0.5 0.6 0.7 0.8 0.9 1.0Average Proportion

12

13

14

15

16

BL

EU

(a) EN→RU

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0Average Proportion

6

8

10

12

14

16

18

20B

LE

U

(b) RU→EN

0.6 0.7 0.8 0.9 1.0Average Proportion

12

13

14

15

16

17

18

19

BL

EU

(c) EN→DE

0.5 0.6 0.7 0.8 0.9 1.0Average Proportion

14

16

18

20

22

24

BL

EU

(d) DE→EN

Figure 5: Delay (AP) v.s. BLEU for both language pair–directions. The shown point-pairs are the results of simul-taneous greedy decoding and beam-search (beam-size = 5)respectively with models trained for various delay targets:(J /: CW=8, N4: CW=5, �♦: CW=2, I .: AP=0.3, HO:AP=0.5, ��: AP=0.7). For each target, we select the modelthat maximizes the quality-to-delay ratio ( BLEU

AP ) on the val-idation set. The baselines are also plotted (F: WOS FI:WUE, ×: WID, +: WIW).

• Wait-If-Worse/Wait-If-Diff (WIW/WID): asproposed by Cho and Esipova (2016), the al-gorithm first pre-READs the next source word,and accepts this READ when the probability ofthe most likely target word decreases (WIW), orthe most likely target word changes (WID).

• Segmentation-based (SEG) (Oda et al., 2014):a state-of-the-art segmentation-based algorithmbased on optimizing segmentation to achievethe highest quality score. In this paper, wetried the simple greedy method (SEG1) and thegreedy method with POS Constraint (SEG2).

6.2 Quantitative AnalysisIn order to evaluate the effectiveness of our rein-forcement learning algorithms with different re-ward functions, we vary the target delay d∗ ∈{0.3, 0.5, 0.7} and c∗ ∈ {2, 5, 8} for Eq. 9 sepa-rately, and trained agents with α and β adjusted tovalues that provided stable learning for each lan-guage pair according to the validation set.

Learning Curves As shown in Fig. 4, we plotlearning progress for EN-RU translation with dif-ferent target settings. It clearly shows that ouralgorithm effectively increases translation qualityfor all the models, while pushing the delay close,if not all of the way, to the target value. It canalso be noted from Fig. 4 (a) and (b) that there ex-

0 5 10 15 20 25Consecutive Wait Length

2

4

6

8

10

12

14

16

18B

LE

U

1.0 1.5 2.0 2.5 3.0 3.512.012.513.013.514.014.515.015.516.0

Figure 6: Delay (CW) v.s. BLEU score for EN→ RU, (J /: CW=8, N4: CW=5, �♦: CW=2, I.: AP=0.3, HO: AP=0.5, ��: AP=0.7), againstthe baselines (F: WOS F: WUE, +: SEG1, ×:SEG2).

ists strong correlation between the two delay mea-sures, implying the agent can learn to decreaseboth AP and CW simultaneously.

Quality v.s. Delay As shown in Fig. 5, it isclear that the trade-off between translation qualityand delay has very similar behaviors across bothlanguage pairs and directions. The smaller delay(AP or CW) the learning algorithm is targeting, thelower quality (BLEU score) the output translation.It is also interesting to observe that, it is more diffi-cult for “→EN” translation to achieve a lower APtarget while maintaining good quality, comparedto “EN→”. In addition, the models that are op-timized on AP tend to perform better than thoseoptimized on CW, especially in “→EN” transla-tion. German and Russian sentences tend to belonger than English, hence require more consecu-tive waits before being able to emit the next En-glish symbol.

v.s. Baselines In Fig. 5 and 6, the points closerto the upper left corner achieve better trade-offperformance. Compared to WUE and WOS whichcan ideally achieve the best quality (but the worstdelay) and the best delay (but poor quality) re-spectively, all of our proposed models find a goodbalance between quality and delay. Some of theproposed models can achieve good BLEU scoresclose to WUE, while have much smaller delay.

Compared to the method of Cho and Esipova(2016) based on two hand-crafted rules (WID,WIW), in most cases our proposed models findbetter trade-off points, while there are a few ex-ceptions. We also observe that the baseline modelshave trouble controlling the delay in a reasonablearea. In contrast, by optimizing towards a given

target delay, our proposed model is stable whilemaintaining good translation quality.

We also compared against Oda et al. (2014)’sstate-of-the-art segmentation algorithm (SEG). Asshown in Fig 6, it is clear that although SEG canwork with variant segmentation lengths (CW), theproposed model outputs high quality translationsat a much smaller CW. We conjecture that this isdue to the independence assumption in SEG, whilethe RNNs and attention mechanism in our modelmakes it possible to look at the whole history todecide each translated word.

w/o Beam-Search We also plot the results of si-multaneous beam-search instead of using greedydecoding. It is clear from Fig. 5 and 6 that mostof the proposed models can achieve an visible in-crease in quality together with a slight increase indelay. This is because beam-search can help toavoid bad local minima. We also observe that thesimultaneous beam-search cannot bring as muchimprovement as it did in the standard NMT set-ting. In most cases, the smaller delay the modelachieves, the less beam search can help as it re-quires longer consecutive WRITE segments for ex-tensive search to be necessary. One possible so-lution is to consider the beam uncertainty in theagent’s READ/WRITE decisions. We leave this tofuture work.

6.3 Qualitative Analysis

In this section, we perform a more in-depth analy-sis using examples from both EN-RU and EN-DEpairs, in order to have a deeper understanding ofthe proposed algorithm and its remaining limita-tions. We only perform greedy decoding to sim-plify visualization.

EN→RU As shown in Fig 8, since both En-glish and Russian are Subject-Verb-Object (SVO)languages, the corresponding words may share thesame order in both languages, which makes si-multaneous translation easier. It is clear that thelarger the target delay (AP or CW) is set, the morewords are read before translating the correspond-ing words, which in turn results in better transla-tion quality. We also note that very early WRITE

commonly causes bad translation. For example,for AP=0.3 & CW=2, both the models chooseto WRITE in the very beginning the word “The”,which is unreasonable since Russian has no arti-cles, and there is no word corresponding to it. Onegood feature of using NMT is that the more words

The cost

of thecampai

gn

isbasica

lly

beingpai

dby mysal

aryas a sen

--ato

r. < e

os>

< eos >.

decktge--atorSen--alsalt

Geh--meindurch

genommenGrunde

imwerden

Kampagnediefur

KostenDie READ

WRITE

(a) Simultaneous Neural Machine Translation

The cost

of thecampai

gn

isbasica

lly

covere

d

by my salary

as a sen--ato

r. < e

os>

< eos >.

decktge--atorSen--

alsalt

Geh--meindurch

genommenGrunde

imwerden

Kampagnediefur

KostenDie READ

WRITE

(b) Neural Machine Translation

Figure 7: Comparison of DE→EN examples using the proposed framework and usual NMT systemrespectively. Both the heatmaps share the same setting with Fig. 1. The verb “gedeckt” is incorrectlytranslated in simultaneous translation.

Source AP=0.3 AP=0.7 CW=2 CW=8The TheThe The people p--i--entthep--ol--s , , p--riv-- as ers I asI heard яслышал Люди , Людиin ,какяслышал the как countryside в в яслышал , ,want сельскойместности сельской в какяa местности слышалGovernment сельской that , вis местности сельскойnot made хочу , up правительство местностиof хочу thi-- , , eves хотят правительство . ,<eos> котороенепроизводится

во--ров.,чтобыправительство,котороенев--меши--ваетсявво--ры.

,котороенеявляетсясостав--нойчастьюво--ров.

хотят,чтобыправительство,котороенев--меши--ваетсявво--ры.

Summary BLEU=39/AP=0.46 BLEU=64/AP=0.77 BLEU=54/CW=1.76 BLEU=64/CW=2.55

Figure 8: Given the example input sentence (leftmost column), we show outputs by models trained forvarious delay targets. For these outputs, each row corresponds to one source word and represents theemitted words (maybe empty) after reading this word. The corresponding source and target words are inthe same color for all model outputs.

the decoder READs, the longer history is saved,rendering simultaneous translation easier.

DE→EN As shown in Fig 1 and 7 (a), wherewe visualize the attention weights as soft align-ment between the progressive input and outputsentences, the highest weights are basically alongthe diagonal line. This indicates that our simul-taneous translator works by waiting for enough

source words with high alignment weights andthen switching to write them.

DE-EN translation is likely more difficult asGerman usually uses Subject-Object-Verb (SOV)constructions a lot. As shown in Fig 1, when a sen-tence (or a clause) starts the agent has learned suchpolicy to READ multiple steps to approach the verb(e.g. serviert and gestorben in Fig 1). Such a pol-

icy is still limited when the verb is very far fromthe subject. For instance in Fig. 7, the simultane-ous translator achieves almost the same translationwith standard NMT except for the verb “gedeckt”which corresponds to “covered” in NMT output.Since there are too many words between the verb“gedeckt” and the subject “Kosten fur die Kam-pagne werden”, the agent gives up reading (oth-erwise it will cause a large delay and a penalty)and WRITEs “being paid” based on the decoder’shypothesis. This is one of the limitations of theproposed framework, as the NMT environment istrained on complete source sentences and it maybe difficult to predict the verb that has not beenseen in the source sentence. One possible way isto fine-tune the NMT model on incomplete sen-tences to boost its prediction ability. We will leavethis as future work.

7 Related Work

Researchers commonly consider the problem ofsimultaneous machine translation in the scenarioof real-time speech interpretation (Fugen et al.,2007; Bangalore et al., 2012; Fujita et al., 2013;Rangarajan Sridhar et al., 2013; Yarmohammadiet al., 2013). In this approach, the incomingspeech stream required to be translated are firstrecognized and segmented based on an automaticspeech recognition (ASR) system. The translationmodel then works independently based on each ofthese segments, potentially limiting the quality oftranslation. To avoid using a fixed segmentationalgorithm, Oda et al. (2014) introduced a trainablesegmentation component into their system, so thatthe segmentation leads to better translation quality.Grissom II et al. (2014) proposed a similar frame-work, however, based on reinforcement learning.All these methods still rely on translating each seg-ment independently without previous context.

Recently, two research groups have tried to ap-ply the NMT framework to the simultaneous trans-lation task. Cho and Esipova (2016) proposeda similar waiting process. However, their wait-ing criterion is manually defined without learning.Satija and Pineau (2016) proposed a method sim-ilar to ours in overall concept, but it significantlydiffers from our proposed method in many details.The biggest difference is that they proposed to usean agent that passively reads a new word at eachstep. Because of this, it cannot consecutively de-code multiple steps, rendering beam search diffi-

cult. In addition, they lack the comparison to anyexisting approaches. On the other hand, we per-form an extensive experimental evaluation againststate-of-the-art baselines, demonstrating the rela-tive utility both quantitatively and qualitatively.

The proposed framework is also related to somerecent efforts about online sequence-to-sequence(SEQ2SEQ) learning. Jaitly et al. (2015) pro-posed a SEQ2SEQ ASR model that takes fixed-sized segments of the input sequence and outputstokens based on each segment in real-time. It istrained with alignment information using super-vised learning. A similar idea for online ASR isproposed by Luo et al. (2016). Similar to Satijaand Pineau (2016), they also used reinforcementlearning to decide whether to emit a token whilereading a new input at each step. Although shar-ing some similarities, ASR is very different fromsimultaneous MT with a more intuitive definitionfor segmentation. In addition, Yu et al. (2016)recently proposed an online alignment model tohelp sentence compression and morphological in-flection. They regarded the alignment between theinput and output sequences as a hidden variable,and performed transitions over the input and out-put sequence. By contrast, the proposed READ andWRITE actions do not necessarily to be performedon aligned words (e.g. in Fig. 1), and are learnedto balance the trade-off of quality and delay.

8 Conclusion

We propose a unified framework to do neural si-multaneous machine translation. To trade off qual-ity and delay, we extensively explore various tar-gets for delay and design a method for beam-search applicable in the simultaneous MT setting.Experiments against state-of-the-art baselines ontwo language pairs demonstrate the efficacy bothquantitatively and qualitatively.

Acknowledgments

KC acknowledges the support by Facebook,Google (Google Faculty Award 2016) and NVidia(GPU Center of Excellence 2015-2016). GN ac-knowledges the support of the Microsoft COREprogram. This work was also partly supportedby Samsung Electronics (Project: ”Developmentand Application of Larger-Context Neural Ma-chine Translation”).

References

[Bahdanau et al.2014] Dzmitry Bahdanau, KyunghyunCho, and Yoshua Bengio. 2014. Neural machinetranslation by jointly learning to align and translate.arXiv preprint arXiv:1409.0473.

[Bangalore et al.2012] Srinivas Bangalore, Vivek Ku-mar Rangarajan Sridhar, Prakash Kolan, LadanGolipour, and Aura Jimenez. 2012. Real-time in-cremental speech-to-speech translation of dialogs.In Proceedings of the 2012 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,pages 437–445. Association for Computational Lin-guistics.

[Cho and Esipova2016] Kyunghyun Cho and MashaEsipova. 2016. Can neural machine transla-tion do simultaneous translation? arXiv preprintarXiv:1606.02012.

[Fugen et al.2007] Christian Fugen, Alex Waibel, andMuntsin Kolss. 2007. Simultaneous translationof lectures and speeches. Machine Translation,21(4):209–252.

[Fujita et al.2013] Tomoki Fujita, Graham Neubig,Sakriani Sakti, Tomoki Toda, and Satoshi Naka-mura. 2013. Simple, lexicalized choice of transla-tion timing for simultaneous speech translation. InINTERSPEECH.

[Grissom II et al.2014] Alvin Grissom II, He He, Jor-dan Boyd-Graber, John Morgan, and Hal Daume III.2014. Dont until the final verb wait: Reinforce-ment learning for simultaneous machine transla-tion. In Proceedings of the 2014 Conference onEmpirical Methods in Natural Language Processing(EMNLP), pages 1342–1352, Doha, Qatar, October.Association for Computational Linguistics.

[Jaitly et al.2015] Navdeep Jaitly, Quoc V Le, OriolVinyals, Ilya Sutskeyver, and Samy Bengio. 2015.An online sequence-to-sequence model using partialconditioning. arXiv preprint arXiv:1511.04868.

[Kingma and Ba2014] Diederik Kingma and JimmyBa. 2014. Adam: A method for stochastic opti-mization. arXiv preprint arXiv:1412.6980.

[Lin and Och2004] Chin-Yew Lin and Franz Josef Och.2004. Automatic evaluation of machine transla-tion quality using longest common subsequence andskip-bigram statistics. In Proceedings of the 42ndAnnual Meeting on Association for ComputationalLinguistics, page 605. Association for Computa-tional Linguistics.

[Luo et al.2016] Yuping Luo, Chung-Cheng Chiu,Navdeep Jaitly, and Ilya Sutskever. 2016. Learningonline alignments with continuous rewards policygradient. arXiv preprint arXiv:1608.01281.

[Mieno et al.2015] Takashi Mieno, Graham Neubig,Sakriani Sakti, Tomoki Toda, and Satoshi Naka-mura. 2015. Speed or accuracy? a study in evalua-tion of simultaneous speech translation. In INTER-SPEECH.

[Mnih and Gregor2014] Andriy Mnih and Karol Gre-gor. 2014. Neural variational inference and learningin belief networks. arXiv preprint arXiv:1402.0030.

[Oda et al.2014] Yusuke Oda, Graham Neubig, SakrianiSakti, Tomoki Toda, and Satoshi Nakamura. 2014.Optimizing segmentation strategies for simultane-ous speech translation. In Proceedings of the 52ndAnnual Meeting of the Association for Computa-tional Linguistics (Volume 2: Short Papers), pages551–556, Baltimore, Maryland, June. Associationfor Computational Linguistics.

[Papineni et al.2002] Kishore Papineni, Salim Roukos,Todd Ward, and Wei-Jing Zhu. 2002. Bleu: amethod for automatic evaluation of machine trans-lation. In Proceedings of the 40th annual meetingon association for computational linguistics, pages311–318. Association for Computational Linguis-tics.

[Rangarajan Sridhar et al.2013] Vivek Kumar Rangara-jan Sridhar, John Chen, Srinivas Bangalore, An-drej Ljolje, and Rathinavelu Chengalvarayan. 2013.Segmentation strategies for streaming speech trans-lation. In Proceedings of the 2013 Conference ofthe North American Chapter of the Association forComputational Linguistics: Human Language Tech-nologies, pages 230–238, Atlanta, Georgia, June.Association for Computational Linguistics.

[Satija and Pineau2016] Harsh Satija and Joelle Pineau.2016. Simultaneous machine translation using deepreinforcement learning. Abstraction in Reinforce-ment Learning Workshop, ICML2016.

[Sennrich et al.2015] Rico Sennrich, Barry Haddow,and Alexandra Birch. 2015. Neural machine trans-lation of rare words with subword units. arXivpreprint arXiv:1508.07909.

[Sutskever et al.2014] Ilya Sutskever, Oriol Vinyals,and Quoc V Le. 2014. Sequence to sequence learn-ing with neural networks. In Advances in neural in-formation processing systems, pages 3104–3112.

[Williams1992] Ronald J Williams. 1992. Simplestatistical gradient-following algorithms for connec-tionist reinforcement learning. Machine learning,8(3-4):229–256.

[Yarmohammadi et al.2013] Mahsa Yarmohammadi,Vivek Kumar Rangarajan Sridhar, Srinivas Banga-lore, and Baskaran Sankaran. 2013. Incrementalsegmentation and decoding strategies for simultane-ous translation. In IJCNLP, pages 1032–1036.

[Yu et al.2016] Lei Yu, Jan Buys, and Phil Blunsom.2016. Online segment to segment neural transduc-tion. arXiv preprint arXiv:1609.08194.

Documents

Learning to Translate in Real-time with Neural … to Translate in Real-time with Neural Machine Translation Jiatao Guy, Graham Neubig , Kyunghyun Choz and Victor O.K. Liy yThe University