Neural Machine Translation Basics II

Neural Machine Translation Basics II

Rico Sennrich

University of Edinburgh

R. Sennrich Neural Machine Translation Basics II 1 / 56

Modelling Translation

Suppose that we have:a source sentence S of length m (x1, . . . , xm)a target sentence T of length n (y1, . . . , yn)

We can express translation as a probabilistic model

T ∗ = argmaxT

p(T |S)

Expanding using the chain rule gives

p(T |S) = p(y1, . . . , yn|x1, . . . , xm)

=

n∏i=1

p(yi|y1, . . . , yi−1, x1, . . . , xm)


Language Model for Translation?

h1 h2 h3

y1 y2 y3

I love you

x1 x2 x3

Ich liebe dich



h1 h2 h3

y1 y2 y3

I love you

x1 x2 x3

Je t’ aime



h1 h2

y1 y2

I love

x1 x2

Te quiero



h1 h2 h3 h4 h5 h6 h7 h8

y1 y2 y3 y4 y5 y6 y7 y8

Ich liebe dich <translate> I love you </s>

x1 x2 x3 x4 x5 x6 x7 x8

<s> Ich liebe dich <translate> I love you


Differences Between Translation and Language Model

Target-side language model:

p(T ) =

n∏i=1

p(yi|y1, . . . , yi−1)

Translation model:

p(T |S) =n∏

i=1

p(yi|y1, . . . , yi−1, x1, . . . , xm)

We could just treat sentence pair as one long sequence, but:We do not care about p(S)We may want different vocabulary, network architecture for source text

→ Use separate RNNs for source and target.


Differences Between Translation and Language Model

Target-side language model:

p(T ) =

n∏i=1

p(yi|y1, . . . , yi−1)

Translation model:

p(T |S) =n∏

i=1

p(yi|y1, . . . , yi−1, x1, . . . , xm)

We could just treat sentence pair as one long sequence, but:We do not care about p(S)We may want different vocabulary, network architecture for source text

→ Use separate RNNs for source and target.


Encoder-Decoder for Translation

s1 s2 s3 s4 s5

y1 y2 y3 y4 y5

of course john has fun

h1 h2 h3 h4

x1 x2 x3 x4

natürlich hat john spaß

Decoder

Encoder


Encoder-Decoder for Translation

s1 s2 s3 s4 s5

y1 y2 y3 y4 y5


h1 h2 h3 h4

x1 x2 x3 x4


Decoder

Encoder


Summary vector

Last encoder hidden-state “summarises” source sentence

With multilingual training, we can potentially learnlanguage-independent meaning representation

[Sutskever et al., 2014]



1 Attention Model

2 Decoding

3 Open-Vocabulary Neural Machine Translation

4 More on ArchitecturesDeep NetworksLayer NormalizationNMT with Self-Attention


Summary vector as information bottleneck

Problem: Sentence LengthFixed sized representation degrades as sentence length increases

Reversing source brings some improvement [Sutskever et al., 2014]

[Cho et al., 2014]

Solution: AttentionCompute context vector as weighted average of source hidden states

Weights computed by feed-forward network with softmax activation


Summary vector as information bottleneck

Problem: Sentence LengthFixed sized representation degrades as sentence length increases

Reversing source brings some improvement [Sutskever et al., 2014]

[Cho et al., 2014]

Solution: AttentionCompute context vector as weighted average of source hidden states

Weights computed by feed-forward network with softmax activation


Encoder-Decoder with Attention

h1 h2 h3 h4

x1 x2 x3 x4


+

s1 s2 s3 s4 s5

y1 y2 y3 y4 y5


0.70.1

0.10.1

Decoder

Encoder



h1 h2 h3 h4

x1 x2 x3 x4


+

s1 s2 s3 s4 s5

y1 y2 y3 y4 y5


0.60.2

0.10.1

Decoder

Encoder



h1 h2 h3 h4

x1 x2 x3 x4


+

s1 s2 s3 s4 s5

y1 y2 y3 y4 y5


0.10.1

0.70.1

Decoder

Encoder



h1 h2 h3 h4

x1 x2 x3 x4


+

s1 s2 s3 s4 s5

y1 y2 y3 y4 y5


0.10.7

0.10.1

Decoder

Encoder



h1 h2 h3 h4

x1 x2 x3 x4


+

s1 s2 s3 s4 s5

y1 y2 y3 y4 y5


0.10.1

0.10.7

Decoder

Encoder


Attentional encoder-decoder: Maths

simplifications of model by [Bahdanau et al., 2015] (for illustration)plain RNN instead of GRU

simpler output layer

we do not show bias terms

decoder follows Look, Update, Generate strategy [Sennrich et al., 2017]

Details in https://github.com/amunmt/amunmt/blob/master/contrib/notebooks/dl4mt.ipynb

notationW , U , E, C, V are weight matrices (of different dimensionality)

E one-hot to embedding (e.g. 50000 · 512)W embedding to hidden (e.g. 512 · 1024)U hidden to hidden (e.g. 1024 · 1024)C context (2x hidden) to hidden (e.g. 2048 · 1024)Vo hidden to one-hot (e.g. 1024 · 50000)

separate weight matrices for encoder and decoder (e.g. Ex and Ey)

input X of length Tx; output Y of length TyR. Sennrich Neural Machine Translation Basics II 10 / 56

https://github.com/amunmt/amunmt/blob/master/contrib/notebooks/dl4mt.ipynb


encoder

−→h j =

{0, , if j = 0

tanh(−→W xExxj +

−→U xhj−1) , if j > 0

←−h j =

{0, , if j = Tx + 1

tanh(←−W xExxj +

←−U xhj+1) , if j ≤ Tx

hj = (−→h j ,←−h j)



decoder

si =

{tanh(Ws

←−h i), , if i = 0

tanh(WyEyyi−1 + Uysi−1 + Cci) , if i > 0

ti = tanh(Uosi +WoEyyi−1 + Coci)

yi = softmax(Voti)

attention model

eij = v>a tanh(Wasi−1 + Uahj)

αij = softmax(eij)

ci =

Tx∑j=1

αijhj


Attention model

attention modelside effect: we obtain “alignment” between source and targetsentenceapplications:

visualisationreplace unknown words with back-off dictionary [Jean et al., 2015]...

Kyunghyun Chohttp://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/


Attention is not alignment

0 10 20 30 40 50 60 70 8025

30

35

27.1

28.5

29.6

31

33

34.7

34.1

31.3

27.726.9

27.6

28.7

30.3

32.3

33.8

34.7

31.5

33.9

Sentence Length (source, subword count)

BL

EU

BLEU Scores with Varying Sentence Length

NeuralPhrase-Based

Figure 7: Quality of translations based on sen-tence length. SMT outperforms NMT for sen-tences longer than 60 subword tokens. For verylong sentences (80+) quality is much worse due totoo short output.

3.5 Word Alignment

The key contribution of the attention model in neu-ral machine translation (Bahdanau et al., 2015)was the imposition of an alignment of the outputwords to the input words. This takes the shapeof a probability distribution over the input wordswhich is used to weigh them in a bag-of-wordsrepresentation of the input sentence.

Arguably, this attention model does not func-tionally play the role of a word alignment betweenthe source in the target, at least not in the sameway as its analog in statistical machine translation.While in both cases, alignment is a latent variablethat is used to obtain probability distributions overwords or phrases, arguably the attention model hasa broader role. For instance, when translating averb, attention may also be paid to its subject andobject since these may disambiguate it. To fur-ther complicate matters, the word representationsare products of bidirectional gated recurrent neu-ral networks that have the effect that each wordrepresentation is informed by the entire sentencecontext.

But there is a clear need for an alignment mech-anism between source and target words. For in-stance, prior work used the alignments providedby the attention model to interpolate word transla-tion decisions with traditional probabilistic dictio-naries (Arthur et al., 2016), for the introduction ofcoverage and fertility models (Tu et al., 2016), etc.

But is the attention model in fact the proper

rela

tions

betw

een

Oba

ma

and

Net

anya

hu

have

been

stra

ined

for

year

s.

dieBeziehungen

zwischen

Obama

undNetanjahu

sind

seit

Jahrenangespannt

.

56

89

72

16

26

96

79

98

42

11

11

14

38

22

84

23

54 10

98

49

Figure 8: Word alignment for English–German:comparing the attention model states (green boxeswith probability in percent if over 10) with align-ments obtained from fast-align (blue outlines).

means? To examine this, we compare the softalignment matrix (the sequence of attention vec-tors) with word alignments obtained by traditionalword alignment methods. We use incrementalfast-align (Dyer et al., 2013) to align the input andoutput of the neural machine system.

See Figure 8 for an illustration. We comparethe word attention states (green boxes) with theword alignments obtained with fast align (blueoutlines). For most words, these match up prettywell. Both attention states and fast-align align-ment points are a bit fuzzy around the functionwords have-been/sind.

However, the attention model may settle onalignments that do not correspond with our intu-ition or alignment points obtained with fast-align.See Figure 9 for the reverse language direction,German–English. All the alignment points appearto be off by one position. We are not aware of anyintuitive explanation for this divergent behavior —the translation quality is high for both systems.

We measure how well the soft alignment (atten-tion model) of the NMT system match the align-ments of fast-align with two metrics:

• a match score that checks for each outputif the aligned input word according to fast-align is indeed the input word that receivedthe highest attention probability, and

• a probability mass score that sums up the

34

[Koehn and Knowles, 2017]


Attention is not alignment

das

Ver

haltn

is

zwis

chen

Oba

ma

und

Net

anya

hu

ist

seit

Jahr

enge

span

nt.

therelationship

between

Obama

andNetanyahu

has

been

stretched

foryears

. 11

47

81

72

87

93

95

38

21

17

16

14

38

19

33

90

32

26

54

77

12

17

Figure 9: Mismatch between attention states anddesired word alignments (German–English).

probability mass given to each alignmentpoint obtained from fast-align.

In these scores, we have to handle byte pair encod-ing and many-to-many alignments11

In out experiment, we use the neural machinetranslation models provided by Edinburgh12 (Sen-nrich et al., 2016a). We run fast-align on the sameparallel data sets to obtain alignment models andused them to align the input and output of theNMT system. Table 3 shows alignment scores forthe systems. The results suggest that, while dras-tic, the divergence for German–English is an out-lier. We note, however, that we have seen suchlarge a divergence also under different data condi-tions.

Note that the attention model may produce bet-ter word alignments by guided alignment training(Chen et al., 2016; Liu et al., 2016) where super-vised word alignments (such as the ones producedby fast-align) are provided to model training.

11(1) NMT operates on subwords, but fast-align is run onfull words. (2) If an input word is split into subwords bybyte pair encoding, then we add their attention scores. (3)If an output word is split into subwords, then we take theaverage of their attention vectors. (4) The match scores andprobability mass scores are computed as average over outputword-level scores. (5) If an output word has no fast-alignalignment point, it is ignored in this computation. (6) If anoutput word is fast-aligned to multiple input words, then (6a)for the match score: count it as correct if the n aligned wordsamong the top n highest scoring words according to attentionand (6b) for the probability mass score: add up their attentionscores.

12https://github.com/rsennrich/wmt16-scripts

Language Pair Match Prob.German–English 14.9% 16.0%English–German 77.2% 63.2%Czech–English 78.0% 63.3%English–Czech 76.1% 59.7%Russian–English 72.5% 65.0%English–Russian 73.4% 64.1%

Table 3: Scores indicating overlap between at-tention probabilities and alignments obtained withfast-align.

3.6 Beam Search

The task of decoding is to find the full sentencetranslation with the highest probability. In statis-tical machine translation, this problem has beenaddressed with heuristic search techniques that ex-plore a subset of the space of possible translation.A common feature of these search techniques is abeam size parameter that limits the number of par-tial translations maintained per input word.

There is typically a straightforward relationshipbetween this beam size parameter and the modelscore of resulting translations and also their qual-ity score (e.g., BLEU). While there are dimin-ishing returns for increasing the beam parameter,typically improvements in these scores can be ex-pected with larger beams.

Decoding in neural translation models can beset up in similar fashion. When predicting the nextoutput word, we may not only commit to the high-est scoring word prediction but also maintain thenext best scoring words in a list of partial trans-lations. We record with each partial translationthe word translation probabilities (obtained fromthe softmax), extend each partial translation withsubsequent word predictions and accumulate thesescores. Since the number of partial translation ex-plodes exponentially with each new output word,we prune them down to a beam of highest scoringpartial translations.

As in traditional statistical machine translationdecoding, increasing the beam size allows us toexplore a larger set of the space of possible transla-tion and hence find translations with better modelscores.

However, as Figure 10 illustrates, increasing thebeam size does not consistently improve transla-tion quality. In fact, in almost all cases, worsetranslations are found beyond an optimal beamsize setting (we are using again Edinburgh’s WMT

35



Attention model

question: how can model translate correctly, if attention is one-off?

answer: information can also flow along recurrent connections, so there isno guarantee that attention corresponds to alignment


Attention model

question: how can model translate correctly, if attention is one-off?

answer: information can also flow along recurrent connections, so there isno guarantee that attention corresponds to alignment


What If You Want Attention to be Alignment-like?

One Solution: Guided Alignment Training [Chen et al., 2016]1 compute alignment with external tool (IBM models)2 if multiple source words align to same target words,

normalize so that∑

j Aij = 1

3 modify objective function of NMT training:minimize target sentence cross-entropy (as before)minimize divergence between model attention α and externalalignment A:

H(A,α) = − 1

Ty

Ty∑i=1

Tx∑j=1

Aij logαij


Attention model

attention model also works with images:

[Cho et al., 2015]


Attention model

[Cho et al., 2015]



1 Attention Model

2 Decoding




Application of Encoder-Decoder Model

Scoring (a translation)p(La, croissance, économique, s’est, ralentie, ces, dernières, années, . |Economic, growth, has, slowed, down, in, recent, year, .) = ?

Decoding ( a source sentence)Generate the most probable translation of a source sentence

y∗ = argmaxy p(y|Economic, growth, has, slowed, down, in, recent, year, .)


Decoding

exact searchgenerate every possible sentence T in target language

compute score p(T |S) for each

pick best one

intractable: |vocab|N translations for output length N→ we need approximative search strategy


Decoding

approximative search/1: greedy searchat each time step, compute probabilitydistribution P (yi|S, y<i)

select yi according to some heuristic:

sampling: sample from P (yi|S, y<i)greedy search: pick argmaxy p(yi|S, y<i)

continue until we generate <eos>

! 0.928

0.175

<eos> 0.999

0.175

hello 0.946

0.056

world 0.957

0.100

0

efficient, but suboptimal


Decoding

approximative search/2: beamsearch

maintain list of K hypotheses(beam)

at each time step, expand eachhypothesis k: p(yki |S, yk<i)

select K hypotheses withhighest total probability:∏

i

p(yki |S, yk<i)

hello 0.946

0.056

world 0.957

0.100

World 0.010

4.632

. 0.030

3.609

! 0.928

0.175

... 0.014

4.384

<eos> 0.999

3.609

world 0.684

5.299

HI 0.007

4.920

<eos> 0.994

4.390

Hey 0.006

5.107

<eos> 0.999

0.175

0

K = 3

relatively efficient . . . beam expansion parallelisable

currently default search strategy in neural machine translation

small beam (K ≈ 10) offers good speed-quality trade-off


Beam Search in Practice

translation quality goes down with large beam size (!)

pruning hypotheses to small beam removes states that areimprobable, but have low entropy (label bias)

models have a bias towards short translation→ cost is often normalized by length

if model starts copying input, future copying has high probability

1 2 4 8 12 20 30 50 100 200 500 1,000

29

30

31

29.7

30.4 30.5 30.430.330

29.8

29.4

28.5

29.7

30.430.6

30.830.930.930.930.9 30.930.7

30.3

29.9

Beam Size

BL

EU

Czech–English

UnnormalizedNormalized

1 2 4 8 12 20 30 50 100 200 500 1,000

20

21

22

23

24

22

23.2

23.9 2424.124.12423.8

23.5

22.7

19.9

23.1

23.623.8

24.1 24.224 23.9

23.623.2

Beam Size

BL

EU

English-Czech


1 2 4 8 12 20 30 50 100 200 500 1 000

35

36

37

35.7

36.4

36.9 36.936.836.736.636.3

36.1

35.7

34.6

35.7

36.6

37.237.537.537.637.637.6 37.6 37.6 37.6 37.6

Beam Size

BL

EU

German–English


1 2 4 8 12 20 30 50 100 200 500 1 000

27

28

29

26.8

27.9

28.4 28.428.528.528.528.428.1

27.6

26.726.8

28

28.628.929

29.129.129.2 29.2 29.2 29.1

28.7

Beam Size

BL

EU

English–German


1 2 4 8 12 20 30 50 100 200 500 1 00015

16

17

15.8

16.416.6 16.616.7

16.916.916.9

17.3

1615.8

16.4 16.5 16.416.416.416.416.3 16.2

15.9

15.6

15.3

Beam Size

BL

EU

Romanian–English


1 2 4 8 12 20 30 50 100 200 500 1 000

24

25

26

25.425.5

25.6 25.625.825.825.825.8 25.8

25.6

24.7

24

25.425.6 25.6

25.725.6

25.725.6

25.725.6 25.6 25.6 25.6

Beam Size

BL

EU

English–Romanian


1 2 4 8 12 20 30 50 100 200 500 1 000

25

26

27

25.5

26.626.9

26.626.4

25.9

25.5

24.1

25.5

26.9

27.527.727.827.827.727.7 27.6

27.1

25.9

24.8

Beam Size

BL

EU

Russian–English


1 2 4 8 12 20 30 50 100 200 500 1 000

20

21

22

23

19.9

21.8

22.322.622.5

22.222.221.9

21.4

20.7

19.9

21.7

22.122.422.522.422.422.4 22.3

22.121.8

21.3

Beam Size

BL

EU

English–Russian


Figure 10: Translation quality with varying beam sizes. For large beams, quality decreases, especiallywhen not normalizing scores by sentence length.

36



Ensembles

combine decision of multiple classifiers by votingensemble will reduce error if these conditions are met:

base classifiers are accuratebase classifiers are diverse (make different errors)


Ensembles in NMT

vote at each time step to explore same search space(better than decoding with one, reranking n-best list with others)

voting mechanism: typically average (log-)probability

logP (yi|S, y<i) =

∑Mm=1 logPm(yi|S, y<i)

M

requirements for voting at each time step:same output vocabularysame factorization of Ybut: internal network architecture may be different

we still use reranking in some situationsexample: combine left-to-right decoding and right-to-left decoding



1 Attention Model

2 Decoding




Text Representation

how do we represent text in NMT?1-hot encoding

lookup of word embedding for inputprobability distribution over vocabulary for output

large vocabulariesincrease network sizedecrease training and decoding speed

typical network vocabulary size: 10 000–100 000 symbols

representation of "cat"vocabulary 1-hot vector embedding012.

1024

thecatis.

mat

010.0

0.10.30.70.5


Problem

translation is open-vocabulary problem

many training corpora contain millions of word types

productive word formation processes (compounding; derivation) allowformation and understanding of unseen words

names, numbers are morphologically simple, but open word classes


Non-Solution: Ignore Rare Words

replace out-of-vocabulary words with UNK

a vocabulary of 50 000 words covers 95% of text

this gets you 95% of the way...... if you only care about automatic metrics

why 95% is not enoughrare outcomes have high self-information

source The indoor temperature is very pleasant.reference Das Raumklima ist sehr angenehm.[Bahdanau et al., 2015] Die UNK ist sehr angenehm. 7

[Jean et al., 2015] Die Innenpool ist sehr angenehm. 7

[Sennrich, Haddow, Birch, ACL 2016] Die Innen+ temperatur ist sehr angenehm. X


Non-Solution: Ignore Rare Words

replace out-of-vocabulary words with UNK

a vocabulary of 50 000 words covers 95% of text

this gets you 95% of the way...... if you only care about automatic metrics

why 95% is not enoughrare outcomes have high self-information



[Sennrich, Haddow, Birch, ACL 2016] Die Innen+ temperatur ist sehr angenehm. X


Solution 1: Back-off Models

back-off models [Jean et al., 2015, Luong et al., 2015]replace rare words with UNK at training time

when system produces UNK, align UNK to source word, and translatethis with back-off method



limitationscompounds: hard to model 1-to-many relationships

morphology: hard to predict inflection with back-off dictionary

names: if alphabets differ, we need transliteration

alignment: attention model unreliable


Solution 2: Subword Models

MT is an open-vocabulary problemcompounding and other productive morphological processes

they charge a carry-on bag fee.sie erheben eine Hand|gepäck|gebühr.

names

Obama(English; German)Îáàìà (Russian)オバマ (o-ba-ma) (Japanese)

technical terms, numbers, etc.


Subword units

segmentation algorithms: wishlistopen-vocabulary NMT: encode all words through small vocabulary

encoding generalizes to unseen words

small text size

good translation quality

our experiments [Sennrich et al., 2016]after preliminary experiments, we propose:

character n-grams (with shortlist of unsegmented words)segmentation via byte pair encoding (BPE)


Byte pair encoding for word segmentation

bottom-up character mergingstarting point: character-level representation→ computationally expensive

compress representation based on information theory→ byte pair encoding [Gage, 1994]

repeatedly replace most frequent symbol pair (’A’,’B’) with ’AB’

hyperparameter: when to stop→ controls vocabulary size

word freq’l o w</w>’ 5’l o w e r</w>’ 2’n e w e s t</w>’ 6’w i d e s t</w>’ 3

vocabulary:l o w</w> w e r</w> n s t</w> i d

es est</w> lo







word freq’l o w</w>’ 5’l o w e r</w>’ 2’n e w es t</w>’ 6’w i d es t</w>’ 3

vocabulary:l o w</w> w e r</w> n s t</w> i des

est</w> lo







word freq’l o w</w>’ 5’l o w e r</w>’ 2’n e w est</w>’ 6’w i d est</w>’ 3

vocabulary:l o w</w> w e r</w> n s t</w> i des est</w>

lo







word freq’lo w</w>’ 5’lo w e r</w>’ 2’n e w est</w>’ 6’w i d est</w>’ 3

vocabulary:l o w</w> w e r</w> n s t</w> i des est</w> lo



why BPE?open-vocabulary:operations learned on training set can be applied to unknown words

compression of frequent character sequences improves efficiency→ trade-off between text length and vocabulary size

’l o w e s t</w>’e s → eses t</w> → est</w>l o → lo





’l o w es t</w>’e s → eses t</w> → est</w>l o → lo





’l o w est</w>’e s → eses t</w> → est</w>l o → lo





’lo w est</w>’e s → eses t</w> → est</w>l o → lo


Subword NMT: Translation Quality

EN-DE EN-RU0.0

10.0

20.022.0

19.1

22.8

20.4B

LEU

word-level NMT (with back-off) [Jean et al., 2015]

subword-level NMT: BPE


Subword NMT: Translation Quality

100 101 102 103 104 105 1060

0.2

0.4

0.6

0.8

150 000 500 000

training set frequency rank

unig

ramF1

NMT Results EN-RU

subword-level NMT: BPEsubword-level NMT: char bigramsword-level (with back-off)word-level (no back-off)


Examples

system sentencesource health research institutesreference Gesundheitsforschungsinstituteword-level (with back-off) Forschungsinstitutecharacter bigrams Fo|rs|ch|un|gs|in|st|it|ut|io|ne|nBPE Gesundheits|forsch|ungsin|stitutesource rakfiskreference ðàêôèñêà (rakfiska)word-level (with back-off) rakfisk → UNK→ rakfiskcharacter bigrams ra|kf|is|k→ ðà|êô|èñ|ê (ra|kf|is|k)BPE rak|f|isk → ðàê|ô|èñêà (rak|f|iska)


Subword Models: Variants

morphologically motivated subword units[Sánchez-Cartagena and Toral, 2016, Tamchyna et al., 2017,Huck et al., 2017, Pinnis et al., 2017]

probabilistic segmentation and sampling [Kudo, 2018]


Solution 3: Character-level Models

advantages:(mostly) open-vocabularyno heuristic or language-specific segmentationneural network can learn from raw character sequences

drawbacks:increasing sequence length slows training/decoding(reported x2–x8 increase in training time)memory requirement: trade-off between embedding matrix size andsequence length

active research on specialized architectures [Ling et al., 2015,Luong and Manning, 2016, Chung et al., 2016, Lee et al., 2016]


Character-level NMT: Current Results(as of 5 days ago: August 29 2018)

Figure 1: Test BLEU for character and BPE translation as architectures scale from 1 BiLSTM encoder layer and 2LSTM decoder layers (1×2+2) to our standard 6×2+8. The y-axis spans 6 BLEU points for each language pair.

Figure 2: BLEU versus training corpus size in millionsof sentence pairs, for the EnFr language-pair.

Figure 3: Training time per sentence versus total num-ber of layers (encoder plus decoder) in the model.

tems differ most in the number of lexical choiceerrors, and in the extent to which they drop con-tent. The latter is surprising, and appears to be aside-effect of a general tendency of the charactermodels to be more faithful to the source, vergingon being overly literal. An example of droppedcontent is shown in Table 5 (top).

Regarding lexical choice, the two systems dif-fer not only in the number of errors, but in thenature of those errors. In particular, the BPEmodel had more trouble handling German com-pound nouns. Table 5 (bottom) shows an exam-ple which exhibits two compound errors, includ-

Error Type BPE CharLexical Choice 19 8

Compounds 13 1Proper Names 2 1Morphological 2 2Other lexical 2 4

Dropped Content 7 0

Table 4: Error counts out of 100 randomly sampled ex-amples from the DeEn test set.

ing one where the character system is a strict im-provement, translating Bunsenbrenner into bunsenburner instead of bullets. The second error followsanother common pattern, where both systems mis-handle the German compound (Chemiestunden /chemistry lessons), but the character system failsin a more useful way.

We also found that both systems occasionallymistranslate proper names. Both fail by attempt-ing to translate when they should copy over, butthe BPE system’s errors are harder to understandas they involve semantic translation, renderingBritta Hermann as Sir Leon, and Esme Nussbaumas smiling walnut.9 The character system’s oneobserved error in this category was phonetic ratherthan semantic, rendering Schotten as Scottland.

Interestingly, we also observed several in-stances where the model correctly translates theGerman 24-hour clock into the English 12-hourclock; for example, 19.30 becomes 7:30 p.m..This deterministic transformation is potentially inreach for both models, but we observed it only forthe character system in this sample.

9 The BPE segmentations for these names were: _Britta _Herr mann and _Es me _N uss baum

Figure 1: Test BLEU for character and BPE translation as architectures scale from 1 BiLSTM encoder layer and 2LSTM decoder layers (1×2+2) to our standard 6×2+8. The y-axis spans 6 BLEU points for each language pair.

Figure 2: BLEU versus training corpus size in millionsof sentence pairs, for the EnFr language-pair.

Figure 3: Training time per sentence versus total num-ber of layers (encoder plus decoder) in the model.

tems differ most in the number of lexical choiceerrors, and in the extent to which they drop con-tent. The latter is surprising, and appears to be aside-effect of a general tendency of the charactermodels to be more faithful to the source, vergingon being overly literal. An example of droppedcontent is shown in Table 5 (top).

Regarding lexical choice, the two systems dif-fer not only in the number of errors, but in thenature of those errors. In particular, the BPEmodel had more trouble handling German com-pound nouns. Table 5 (bottom) shows an exam-ple which exhibits two compound errors, includ-

Error Type BPE CharLexical Choice 19 8

Compounds 13 1Proper Names 2 1Morphological 2 2Other lexical 2 4

Dropped Content 7 0

Table 4: Error counts out of 100 randomly sampled ex-amples from the DeEn test set.

ing one where the character system is a strict im-provement, translating Bunsenbrenner into bunsenburner instead of bullets. The second error followsanother common pattern, where both systems mis-handle the German compound (Chemiestunden /chemistry lessons), but the character system failsin a more useful way.

We also found that both systems occasionallymistranslate proper names. Both fail by attempt-ing to translate when they should copy over, butthe BPE system’s errors are harder to understandas they involve semantic translation, renderingBritta Hermann as Sir Leon, and Esme Nussbaumas smiling walnut.9 The character system’s oneobserved error in this category was phonetic ratherthan semantic, rendering Schotten as Scottland.

Interestingly, we also observed several in-stances where the model correctly translates theGerman 24-hour clock into the English 12-hourclock; for example, 19.30 becomes 7:30 p.m..This deterministic transformation is potentially inreach for both models, but we observed it only forthe character system in this sample.

9 The BPE segmentations for these names were: _Britta _Herr mann and _Es me _N uss baum

For deep LSTMs, char-level systems canachieve higher BLEU than BPE

...but training time (per sentence) is 8x higher

[Cherry et al., 2018]



1 Attention Model

2 Decoding




Deep Networks

increasing model depth often increases model performance

example: stack RNN:

hi,1 = g(U1hi−1,1 +W1Exxi)

hi,2 = g(U2hi−1,2 +W2hi,1)

hi,3 = g(U3hi−1,3 +W3hi,2)

. . .

. . .

. . .


Deep Networks

often necessary to combat vanishing gradient:residual connections between layers:

hi,1 = g(U1hi−1,1 +W1Exxi)

hi,2 = g(U2hi−1,2 +W2hi,1)+hi,1

hi,3 = g(U3hi−1,3 +W3hi,2)+hi,2


Layer Normalization

if input distribution to NN layer changes, parameters need to adapt tothis covariate shift

especially bad: RNN state grows/shrinks as we go through sequence

normalization of layers reduces shift, and improves training stability

re-center and re-scale each layer a (with H units)

two bias parameters, g and b, restore original representation power

µ =1

H

H∑i=1

ai

σ =

√√√√ 1

H

H∑i=1

(ai − µ)2

h =[gσ� (a− µ) + b

]R. Sennrich Neural Machine Translation Basics II 46 / 56

Layer Normalization and Deep Models:Results from UEDIN@WMT17

CS→EN DE→EN LV→EN RU→EN TR→EN ZH→ENsystem 2017 2017 2017 2017 2017 2017baseline 27.5 32.0 16.4 31.3 19.7 21.7+layer normalization 28.2 32.1 17.0 32.3 18.8 22.5+deep model 28.9 33.5 16.6 32.7 20.6 22.9

layer normalization and deep models generally improve quality

layer normalization also speeds up convergence when training(fewer updates needed)



1 Attention Model

2 Decoding




Attention Is All You Need [Vaswani et al., 2017]

main criticism of recurrent architecture:recurrent computations cannot be parallelized

core idea: instead of recurrence, use attention mechanism tocondition hidden states on context→ self-attention

specifically, attend over previous layer of deep network

Self-Attention

Convolution Self-Attention

https://nlp.stanford.edu/seminar/details/lkaiser.pdf


https://nlp.stanford.edu/seminar/details/lkaiser.pdf

Attention Is All You Need [Vaswani et al., 2017]

Transformer architecturestack of N self-attention layers

self-attention in decoder is masked

decoder also attends to encoder states

Add & Norm: residual connection andlayer normalization

RNN can learn to count raw textTransformer needs positional encoding

Figure 1: The Transformer - model architecture.

3.1 Encoder and Decoder Stacks

Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has twosub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network. We employ a residual connection [11] around each ofthe two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer isLayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layeritself. To facilitate these residual connections, all sub-layers in the model, as well as the embeddinglayers, produce outputs of dimension dmodel = 512.

Decoder: The decoder is also composed of a stack of N = 6 identical layers. In addition to the twosub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-headattention over the output of the encoder stack. Similar to the encoder, we employ residual connectionsaround each of the sub-layers, followed by layer normalization. We also modify the self-attentionsub-layer in the decoder stack to prevent positions from attending to subsequent positions. Thismasking, combined with fact that the output embeddings are offset by one position, ensures that thepredictions for position i can depend only on the known outputs at positions less than i.

3.2 Attention

An attention function can be described as mapping a query and a set of key-value pairs to an output,where the query, keys, values, and output are all vectors. The output is computed as a weighted sumof the values, where the weight assigned to each value is computed by a compatibility function of thequery with the corresponding key.

3

[Vaswani et al., 2017]


Multi-Head Attention

basic attention mechanism in AIAYN: Scaled Dot-Product Attention

Attention(Q,K, V ) = softmax(QKT

√dk

)V

query Q is decoder/encoder state (for attention/self-attention)

key K and value V are encoder hidden states

multi-head attention: use h parallel attention mechanisms withlow-dimensional, learned projections of Q, K, and V

Scaled Dot-Product Attention Multi-Head Attention

Figure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of severalattention layers running in parallel.

3.2.1 Scaled Dot-Product Attention

We call our particular attention "Scaled Dot-Product Attention" (Figure 2). The input consists ofqueries and keys of dimension dk, and values of dimension dv . We compute the dot products of thequery with all keys, divide each by

√dk, and apply a softmax function to obtain the weights on the

values.

In practice, we compute the attention function on a set of queries simultaneously, packed togetherinto a matrix Q. The keys and values are also packed together into matrices K and V . We computethe matrix of outputs as:

Attention(Q,K, V ) = softmax(QKT

√dk

)V (1)

The two most commonly used attention functions are additive attention [2], and dot-product (multi-plicative) attention. Dot-product attention is identical to our algorithm, except for the scaling factorof 1√

dk. Additive attention computes the compatibility function using a feed-forward network with

a single hidden layer. While the two are similar in theoretical complexity, dot-product attention ismuch faster and more space-efficient in practice, since it can be implemented using highly optimizedmatrix multiplication code.

While for small values of dk the two mechanisms perform similarly, additive attention outperformsdot product attention without scaling for larger values of dk [3]. We suspect that for large values ofdk, the dot products grow large in magnitude, pushing the softmax function into regions where it hasextremely small gradients 4. To counteract this effect, we scale the dot products by 1√

dk.

3.2.2 Multi-Head Attention

Instead of performing a single attention function with dmodel-dimensional keys, values and queries,we found it beneficial to linearly project the queries, keys and values h times with different, learnedlinear projections to dk, dk and dv dimensions, respectively. On each of these projected versions ofqueries, keys and values we then perform the attention function in parallel, yielding dv-dimensionaloutput values. These are concatenated and once again projected, resulting in the final values, asdepicted in Figure 2.

4To illustrate why the dot products get large, assume that the components of q and k are independent randomvariables with mean 0 and variance 1. Then their dot product, q · k =

∑dki=1 qiki, has mean 0 and variance dk.

4

[Vaswani et al., 2017]


Multi-Head Attention

motivation for multi-head attention:different heads can attend to different states

Attention VisualizationsInput-Input Layer5It is in th

issp

irit

that

a maj

ority

of Am

eric

ango

vern

men

tsha

vepa

ssed

new

law

ssi

nce

2009

mak

ing

the

regi

stra

tion

or votin

gpr

oces

sm

ore

diffi

cult

. <EO

S>

<pad

>

<pad

>

<pad

>

It is inth

issp

irit

that a

maj

ority of

Am

eric

ango

vern

men

tsha

vepa

ssed

new

law

ssi

nce

2009

mak

ing

the

regi

stra

tion or

votin

gpr

oces

sm

ore

diffi

cult .

<EO

S>

<pad

>

<pad

>

<pad

>

Figure 3: An example of the attention mechanism following long-distance dependencies in theencoder self-attention in layer 5 of 6. Many of the attention heads attend to a distant dependency ofthe verb ‘making’, completing the phrase ‘making...more difficult’. Attentions here shown only forthe word ‘making’. Different colors represent different heads. Best viewed in color.

13


Comparison

empirical comparison difficultsome components could be mix-and-matched

choice of attention mechanismchoice of positional encodinghyperparameters and training tricks

different test sets and/or evaluation scripts


Comparison

SOCKEYE [Hieber et al., 2017] (EN-DE; newstest2017)

system BLEU

deep LSTM 25.6Convolutional 24.6Transformer 27.5

[Chen et al., 2018] (EN-DE; newstest2014)

system BLEU

GNMT (RNN) 24.7Transformer 27.9RNMT+ (RNN) 28.5


Why Self-Attention?

our understanding of neural networks lags behind empirical progress

there are some theoretical arguments why architectures work well...(e.g. self-attention reduces distance in network between words)

...but these are very speculative

Targeted Evaluation [Tang et al., 2018]

quality of long-distanceagreement is similarbetween GRU andTransformer

strength of Transformer:better word sensedisambiguation

uni-directional LSTMs, and the decoder is a stackof 8 uni-directional LSTMs. The size of embed-dings and hidden states is 512. We apply layer nor-malization and label smoothing (0.1) in all mod-els. We tie the source and target embeddings.The dropout rate of embeddings and Transformerblocks is set to 0.1. The dropout rate of RNNs andCNNs is 0.2. The kernel size of CNNs is 3. Trans-formers have an 8-head attention mechanism.

To test the robustness of our findings, we alsotest a different style of RNN architecture, froma different toolkit. We evaluate bi-deep transi-tional RNNs (Miceli Barone et al., 2017) whichare state-of-art RNNs in machine translation. Weuse the bi-deep RNN-based model (RNN-bideep)implemented in Marian (Junczys-Dowmunt et al.,2018). Different from the previous settings, weuse the Adam optimizer with β1 = 0.9, β2 =0.98, ε = 10−9. The initial learning rate is0.0003. We tie target embeddings and output em-beddings. Both the encoder and decoder have 4layers of LSTM units, only the encoder layers arebi-directional. LSTM units consist of several cells(deep transition): 4 in the first layer of the decoder,2 cells everywhere else.

We use training data from the WMT17 sharedtask.2 We use newstest2013 as the validation set,and use newstest2014 and newstest2017 as the testsets. All BLEU scores are computed with Sacre-BLEU (Post, 2018). There are about 5.9 millionsentence pairs in the training set after preprocess-ing with Moses scripts. We learn a joint BPEmodel with 32,000 subword units (Sennrich et al.,2016). We employ the model that has the best per-plexity on the validation set for the evaluation.

4.2 Overall Results

Table 2 reports the BLEU scores on newstest2014and newstest2017, the perplexity on the valida-tion set, and the accuracy on long-range depen-dencies.3 Transformer achieves the highest accu-racy on this task and the highest BLEU scores onboth newstest2014 and newstest2017. Comparedto RNNS2S, ConvS2S has slightly better results re-garding BLEU scores, but a much lower accuracyon long-range dependencies. The RNN-bideepmodel achieves distinctly better BLEU scores anda higher accuracy on long-range dependencies.

2http://www.statmt.org/wmt17/translation-task.html

3We report average accuracy on instances where the dis-tance between subject and verb is longer than 10 words.

However, it still cannot outperform Transformerson any of the tasks.

Model 2014 2017 PPL Acc(%)RNNS2S 23.3 25.1 6.1 95.1ConvS2S 23.9 25.2 7.0 84.9Transformer 26.7 27.5 4.5 97.1RNN-bideep 24.7 26.1 5.7 96.3

Table 2: The results of different NMT models, in-cluding the BLEU scores on newstest2014 and new-stest2017, the perplexity on the validation set, and theaccuracy of long-range dependencies.

Figure 2 shows the performance of different ar-chitectures on the subject-verb agreement task. Itis evident that Transformer, RNNS2S, and RNN-bideep perform much better than ConvS2S onlong-range dependencies. However, Transformer,RNNS2S, and RNN-bideep are all robust over longdistances. Transformer outperforms RNN-bideepfor distances 11-12, but RNN-bideep performsequally or better for distance 13 or higher. Thus,we cannot conclude that Transformer models areparticularly stronger than RNN models for longdistances, despite achieving higher average accu-racy on distances above 10.

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 >15

Accuracy

Distance

RNNS2S

RNN-bideep

ConvS2S

Transformer

Figure 2: Accuracy of different NMT models on thesubject-verb agreement task.

4.2.1 CNNsTheoretically, the performance of CNNs will dropwhen the distance between the subject and the verbexceeds the local context size. However, ConvS2Sis also clearly worse than RNNS2S for subject-verbagreement within the local context size.

In order to explore how the ability of ConvS2Sto capture long-range dependencies depends onthe local context size, we train additional systems,


Further Reading

primary literature:Sutskever, Vinyals, Le (2014): Sequence to Sequence Learning with Neural Networks.Bahdanau, Cho, Bengio (2014): Neural Machine Translation by Jointly Learning to Alignand Translate.Vaswani et al. (2017): Attention Is All You Need

secondary literature:Koehn (2017): Neural Machine Translation.Neubig (2017): Neural Machine Translation and Sequence-to-sequence Models: ATutorialCho (2015): Natural Language Understanding with Distributed Representation


Bibliography I

Bahdanau, D., Cho, K., and Bengio, Y. (2015).Neural Machine Translation by Jointly Learning to Align and Translate.In Proceedings of the International Conference on Learning Representations (ICLR).

Chen, M. X., Firat, O., Bapna, A., Johnson, M., Macherey, W., Foster, G., Jones, L., Schuster, M., Shazeer, N., Parmar, N.,Vaswani, A., Uszkoreit, J., Kaiser, L., Chen, Z., Wu, Y., and Hughes, M. (2018).The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation.In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages76–86, Melbourne, Australia. Association for Computational Linguistics.

Chen, W., Matusov, E., Khadivi, S., and Peter, J. (2016).Guided Alignment Training for Topic-Aware Neural Machine Translation.CoRR, abs/1607.01628.

Cherry, C., Foster, G., Bapna, A., Firat, O., and Macherey, W. (2018).Revisiting Character-Based Neural Machine Translation with Capacity and Compression.ArXiv e-prints.

Cho, K., Courville, A., and Bengio, Y. (2015).Describing Multimedia Content using Attention-based Encoder-Decoder Networks.

Cho, K., van Merrienboer, B., Bahdanau, D., and Bengio, Y. (2014).On the Properties of Neural Machine Translation: Encoder–Decoder Approaches.In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 103–111,Doha, Qatar. Association for Computational Linguistics.

Chung, J., Cho, K., and Bengio, Y. (2016).A Character-level Decoder without Explicit Segmentation for Neural Machine Translation.CoRR, abs/1603.06147.


Bibliography II

Gage, P. (1994).A New Algorithm for Data Compression.C Users J., 12(2):23–38.

Hieber, F., Domhan, T., Denkowski, M., Vilar, D., Sokolov, A., Clifton, A., and Post, M. (2017).Sockeye: A Toolkit for Neural Machine Translation.CoRR.

Huck, M., Riess, S., and Fraser, A. (2017).Target-side Word Segmentation Strategies for Neural Machine Translation.In Proceedings of the Second Conference on Machine Translation, Volume 1: Research Papers, Copenhagen, Denmark.Association for Computational Linguistics.

Jean, S., Cho, K., Memisevic, R., and Bengio, Y. (2015).On Using Very Large Target Vocabulary for Neural Machine Translation.InProceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers),pages 1–10, Beijing, China. Association for Computational Linguistics.

Koehn, P. and Knowles, R. (2017).Six Challenges for Neural Machine Translation.In Proceedings of the First Workshop on Neural Machine Translation, pages 28–39, Vancouver. Association for ComputationalLinguistics.

Kudo, T. (2018).Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates.In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages66–75, Melbourne, Australia. Association for Computational Linguistics.


Bibliography III

Lee, J., Cho, K., and Hofmann, T. (2016).Fully Character-Level Neural Machine Translation without Explicit Segmentation.ArXiv e-prints.

Ling, W., Trancoso, I., Dyer, C., and Black, A. W. (2015).Character-based Neural Machine Translation.ArXiv e-prints.

Luong, M.-T. and Manning, D. C. (2016).Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models.In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages1054–1063. Association for Computational Linguistics.

Luong, T., Sutskever, I., Le, Q., Vinyals, O., and Zaremba, W. (2015).Addressing the Rare Word Problem in Neural Machine Translation.InProceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers),pages 11–19, Beijing, China. Association for Computational Linguistics.

Pinnis, M., Krislauks, R., Deksne, D., and Miks, T. (2017).Neural Machine Translation for Morphologically Rich Languages with Improved Sub-word Units and Synthetic Data.In Text, Speech, and Dialogue - 20th International Conference, TSD 2017, pages 237–245, Prague, Czech Republic.

Sánchez-Cartagena, V. M. and Toral, A. (2016).Abu-MaTran at WMT 2016 Translation Task: Deep Learning, Morphological Segmentation and Tuning on Character Sequences.In Proceedings of the First Conference on Machine Translation, pages 362–370, Berlin, Germany. Association for ComputationalLinguistics.


Bibliography IV

Sennrich, R., Firat, O., Cho, K., Birch, A., Haddow, B., Hitschler, J., Junczys-Dowmunt, M., Läubli, S., Miceli Barone, A. V.,Mokry, J., and Nadejde, M. (2017).Nematus: a Toolkit for Neural Machine Translation.InProceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics,pages 65–68, Valencia, Spain.

Sennrich, R., Haddow, B., and Birch, A. (2016).Neural Machine Translation of Rare Words with Subword Units.In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages1715–1725, Berlin, Germany.

Sutskever, I., Vinyals, O., and Le, Q. V. (2014).Sequence to Sequence Learning with Neural Networks.InAdvances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014,pages 3104–3112, Montreal, Quebec, Canada.

Tamchyna, A., Weller-Di Marco, M., and Fraser, A. (2017).Modeling Target-Side Inflection in Neural Machine Translation.In Second Conference on Machine Translation (WMT17).

Tang, G., Müller, M., Rios, A., and Sennrich, R. (2018).Why Self-Attention? A Targeted Evaluation of Neural Machine Translation Architectures.In EMNLP 2018, Brussels, Belgium. Association for Computational Linguistics.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017).Attention Is All You Need.CoRR, abs/1706.03762.

Documents

Neural Machine Translation Basics II