Upload
others
View
28
Download
0
Embed Size (px)
Citation preview
Neural Machine Translation Basics II
Rico Sennrich
University of Edinburgh
R. Sennrich Neural Machine Translation Basics II 1 / 56
Modelling Translation
Suppose that we have:a source sentence S of length m (x1, . . . , xm)a target sentence T of length n (y1, . . . , yn)
We can express translation as a probabilistic model
T ∗ = argmaxT
p(T |S)
Expanding using the chain rule gives
p(T |S) = p(y1, . . . , yn|x1, . . . , xm)
=
n∏i=1
p(yi|y1, . . . , yi−1, x1, . . . , xm)
R. Sennrich Neural Machine Translation Basics II 1 / 56
Language Model for Translation?
h1 h2 h3
y1 y2 y3
I love you
x1 x2 x3
Ich liebe dich
R. Sennrich Neural Machine Translation Basics II 2 / 56
Language Model for Translation?
h1 h2 h3
y1 y2 y3
I love you
x1 x2 x3
Je t’ aime
R. Sennrich Neural Machine Translation Basics II 2 / 56
Language Model for Translation?
h1 h2
y1 y2
I love
x1 x2
Te quiero
R. Sennrich Neural Machine Translation Basics II 2 / 56
Language Model for Translation?
h1 h2 h3 h4 h5 h6 h7 h8
y1 y2 y3 y4 y5 y6 y7 y8
Ich liebe dich <translate> I love you </s>
x1 x2 x3 x4 x5 x6 x7 x8
<s> Ich liebe dich <translate> I love you
R. Sennrich Neural Machine Translation Basics II 3 / 56
Differences Between Translation and Language Model
Target-side language model:
p(T ) =
n∏i=1
p(yi|y1, . . . , yi−1)
Translation model:
p(T |S) =n∏
i=1
p(yi|y1, . . . , yi−1, x1, . . . , xm)
We could just treat sentence pair as one long sequence, but:We do not care about p(S)We may want different vocabulary, network architecture for source text
→ Use separate RNNs for source and target.
R. Sennrich Neural Machine Translation Basics II 4 / 56
Differences Between Translation and Language Model
Target-side language model:
p(T ) =
n∏i=1
p(yi|y1, . . . , yi−1)
Translation model:
p(T |S) =n∏
i=1
p(yi|y1, . . . , yi−1, x1, . . . , xm)
We could just treat sentence pair as one long sequence, but:We do not care about p(S)We may want different vocabulary, network architecture for source text
→ Use separate RNNs for source and target.
R. Sennrich Neural Machine Translation Basics II 4 / 56
Encoder-Decoder for Translation
s1 s2 s3 s4 s5
y1 y2 y3 y4 y5
of course john has fun
h1 h2 h3 h4
x1 x2 x3 x4
natürlich hat john spaß
Decoder
Encoder
R. Sennrich Neural Machine Translation Basics II 5 / 56
Encoder-Decoder for Translation
s1 s2 s3 s4 s5
y1 y2 y3 y4 y5
of course john has fun
h1 h2 h3 h4
x1 x2 x3 x4
natürlich hat john spaß
Decoder
Encoder
R. Sennrich Neural Machine Translation Basics II 5 / 56
Summary vector
Last encoder hidden-state “summarises” source sentence
With multilingual training, we can potentially learnlanguage-independent meaning representation
[Sutskever et al., 2014]
R. Sennrich Neural Machine Translation Basics II 6 / 56
Neural Machine Translation Basics II
1 Attention Model
2 Decoding
3 Open-Vocabulary Neural Machine Translation
4 More on ArchitecturesDeep NetworksLayer NormalizationNMT with Self-Attention
R. Sennrich Neural Machine Translation Basics II 7 / 56
Summary vector as information bottleneck
Problem: Sentence LengthFixed sized representation degrades as sentence length increases
Reversing source brings some improvement [Sutskever et al., 2014]
[Cho et al., 2014]
Solution: AttentionCompute context vector as weighted average of source hidden states
Weights computed by feed-forward network with softmax activation
R. Sennrich Neural Machine Translation Basics II 8 / 56
Summary vector as information bottleneck
Problem: Sentence LengthFixed sized representation degrades as sentence length increases
Reversing source brings some improvement [Sutskever et al., 2014]
[Cho et al., 2014]
Solution: AttentionCompute context vector as weighted average of source hidden states
Weights computed by feed-forward network with softmax activation
R. Sennrich Neural Machine Translation Basics II 8 / 56
Encoder-Decoder with Attention
h1 h2 h3 h4
x1 x2 x3 x4
natürlich hat john spaß
+
s1 s2 s3 s4 s5
y1 y2 y3 y4 y5
of course john has fun
0.70.1
0.10.1
Decoder
Encoder
R. Sennrich Neural Machine Translation Basics II 9 / 56
Encoder-Decoder with Attention
h1 h2 h3 h4
x1 x2 x3 x4
natürlich hat john spaß
+
s1 s2 s3 s4 s5
y1 y2 y3 y4 y5
of course john has fun
0.60.2
0.10.1
Decoder
Encoder
R. Sennrich Neural Machine Translation Basics II 9 / 56
Encoder-Decoder with Attention
h1 h2 h3 h4
x1 x2 x3 x4
natürlich hat john spaß
+
s1 s2 s3 s4 s5
y1 y2 y3 y4 y5
of course john has fun
0.10.1
0.70.1
Decoder
Encoder
R. Sennrich Neural Machine Translation Basics II 9 / 56
Encoder-Decoder with Attention
h1 h2 h3 h4
x1 x2 x3 x4
natürlich hat john spaß
+
s1 s2 s3 s4 s5
y1 y2 y3 y4 y5
of course john has fun
0.10.7
0.10.1
Decoder
Encoder
R. Sennrich Neural Machine Translation Basics II 9 / 56
Encoder-Decoder with Attention
h1 h2 h3 h4
x1 x2 x3 x4
natürlich hat john spaß
+
s1 s2 s3 s4 s5
y1 y2 y3 y4 y5
of course john has fun
0.10.1
0.10.7
Decoder
Encoder
R. Sennrich Neural Machine Translation Basics II 9 / 56
Attentional encoder-decoder: Maths
simplifications of model by [Bahdanau et al., 2015] (for illustration)plain RNN instead of GRU
simpler output layer
we do not show bias terms
decoder follows Look, Update, Generate strategy [Sennrich et al., 2017]
Details in https://github.com/amunmt/amunmt/blob/master/contrib/notebooks/dl4mt.ipynb
notationW , U , E, C, V are weight matrices (of different dimensionality)
E one-hot to embedding (e.g. 50000 · 512)W embedding to hidden (e.g. 512 · 1024)U hidden to hidden (e.g. 1024 · 1024)C context (2x hidden) to hidden (e.g. 2048 · 1024)Vo hidden to one-hot (e.g. 1024 · 50000)
separate weight matrices for encoder and decoder (e.g. Ex and Ey)
input X of length Tx; output Y of length TyR. Sennrich Neural Machine Translation Basics II 10 / 56
Attentional encoder-decoder: Maths
encoder
−→h j =
{0, , if j = 0
tanh(−→W xExxj +
−→U xhj−1) , if j > 0
←−h j =
{0, , if j = Tx + 1
tanh(←−W xExxj +
←−U xhj+1) , if j ≤ Tx
hj = (−→h j ,←−h j)
R. Sennrich Neural Machine Translation Basics II 11 / 56
Attentional encoder-decoder: Maths
decoder
si =
{tanh(Ws
←−h i), , if i = 0
tanh(WyEyyi−1 + Uysi−1 + Cci) , if i > 0
ti = tanh(Uosi +WoEyyi−1 + Coci)
yi = softmax(Voti)
attention model
eij = v>a tanh(Wasi−1 + Uahj)
αij = softmax(eij)
ci =
Tx∑j=1
αijhj
R. Sennrich Neural Machine Translation Basics II 12 / 56
Attention model
attention modelside effect: we obtain “alignment” between source and targetsentenceapplications:
visualisationreplace unknown words with back-off dictionary [Jean et al., 2015]...
Kyunghyun Chohttp://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/
R. Sennrich Neural Machine Translation Basics II 13 / 56
Attention is not alignment
0 10 20 30 40 50 60 70 8025
30
35
27.1
28.5
29.6
31
33
34.7
34.1
31.3
27.726.9
27.6
28.7
30.3
32.3
33.8
34.7
31.5
33.9
Sentence Length (source, subword count)
BL
EU
BLEU Scores with Varying Sentence Length
NeuralPhrase-Based
Figure 7: Quality of translations based on sen-tence length. SMT outperforms NMT for sen-tences longer than 60 subword tokens. For verylong sentences (80+) quality is much worse due totoo short output.
3.5 Word Alignment
The key contribution of the attention model in neu-ral machine translation (Bahdanau et al., 2015)was the imposition of an alignment of the outputwords to the input words. This takes the shapeof a probability distribution over the input wordswhich is used to weigh them in a bag-of-wordsrepresentation of the input sentence.
Arguably, this attention model does not func-tionally play the role of a word alignment betweenthe source in the target, at least not in the sameway as its analog in statistical machine translation.While in both cases, alignment is a latent variablethat is used to obtain probability distributions overwords or phrases, arguably the attention model hasa broader role. For instance, when translating averb, attention may also be paid to its subject andobject since these may disambiguate it. To fur-ther complicate matters, the word representationsare products of bidirectional gated recurrent neu-ral networks that have the effect that each wordrepresentation is informed by the entire sentencecontext.
But there is a clear need for an alignment mech-anism between source and target words. For in-stance, prior work used the alignments providedby the attention model to interpolate word transla-tion decisions with traditional probabilistic dictio-naries (Arthur et al., 2016), for the introduction ofcoverage and fertility models (Tu et al., 2016), etc.
But is the attention model in fact the proper
rela
tions
betw
een
Oba
ma
and
Net
anya
hu
have
been
stra
ined
for
year
s.
dieBeziehungen
zwischen
Obama
undNetanjahu
sind
seit
Jahrenangespannt
.
56
89
72
16
26
96
79
98
42
11
11
14
38
22
84
23
54 10
98
49
Figure 8: Word alignment for English–German:comparing the attention model states (green boxeswith probability in percent if over 10) with align-ments obtained from fast-align (blue outlines).
means? To examine this, we compare the softalignment matrix (the sequence of attention vec-tors) with word alignments obtained by traditionalword alignment methods. We use incrementalfast-align (Dyer et al., 2013) to align the input andoutput of the neural machine system.
See Figure 8 for an illustration. We comparethe word attention states (green boxes) with theword alignments obtained with fast align (blueoutlines). For most words, these match up prettywell. Both attention states and fast-align align-ment points are a bit fuzzy around the functionwords have-been/sind.
However, the attention model may settle onalignments that do not correspond with our intu-ition or alignment points obtained with fast-align.See Figure 9 for the reverse language direction,German–English. All the alignment points appearto be off by one position. We are not aware of anyintuitive explanation for this divergent behavior —the translation quality is high for both systems.
We measure how well the soft alignment (atten-tion model) of the NMT system match the align-ments of fast-align with two metrics:
• a match score that checks for each outputif the aligned input word according to fast-align is indeed the input word that receivedthe highest attention probability, and
• a probability mass score that sums up the
34
[Koehn and Knowles, 2017]
R. Sennrich Neural Machine Translation Basics II 14 / 56
Attention is not alignment
das
Ver
haltn
is
zwis
chen
Oba
ma
und
Net
anya
hu
ist
seit
Jahr
enge
span
nt.
therelationship
between
Obama
andNetanyahu
has
been
stretched
foryears
. 11
47
81
72
87
93
95
38
21
17
16
14
38
19
33
90
32
26
54
77
12
17
Figure 9: Mismatch between attention states anddesired word alignments (German–English).
probability mass given to each alignmentpoint obtained from fast-align.
In these scores, we have to handle byte pair encod-ing and many-to-many alignments11
In out experiment, we use the neural machinetranslation models provided by Edinburgh12 (Sen-nrich et al., 2016a). We run fast-align on the sameparallel data sets to obtain alignment models andused them to align the input and output of theNMT system. Table 3 shows alignment scores forthe systems. The results suggest that, while dras-tic, the divergence for German–English is an out-lier. We note, however, that we have seen suchlarge a divergence also under different data condi-tions.
Note that the attention model may produce bet-ter word alignments by guided alignment training(Chen et al., 2016; Liu et al., 2016) where super-vised word alignments (such as the ones producedby fast-align) are provided to model training.
11(1) NMT operates on subwords, but fast-align is run onfull words. (2) If an input word is split into subwords bybyte pair encoding, then we add their attention scores. (3)If an output word is split into subwords, then we take theaverage of their attention vectors. (4) The match scores andprobability mass scores are computed as average over outputword-level scores. (5) If an output word has no fast-alignalignment point, it is ignored in this computation. (6) If anoutput word is fast-aligned to multiple input words, then (6a)for the match score: count it as correct if the n aligned wordsamong the top n highest scoring words according to attentionand (6b) for the probability mass score: add up their attentionscores.
12https://github.com/rsennrich/wmt16-scripts
Language Pair Match Prob.German–English 14.9% 16.0%English–German 77.2% 63.2%Czech–English 78.0% 63.3%English–Czech 76.1% 59.7%Russian–English 72.5% 65.0%English–Russian 73.4% 64.1%
Table 3: Scores indicating overlap between at-tention probabilities and alignments obtained withfast-align.
3.6 Beam Search
The task of decoding is to find the full sentencetranslation with the highest probability. In statis-tical machine translation, this problem has beenaddressed with heuristic search techniques that ex-plore a subset of the space of possible translation.A common feature of these search techniques is abeam size parameter that limits the number of par-tial translations maintained per input word.
There is typically a straightforward relationshipbetween this beam size parameter and the modelscore of resulting translations and also their qual-ity score (e.g., BLEU). While there are dimin-ishing returns for increasing the beam parameter,typically improvements in these scores can be ex-pected with larger beams.
Decoding in neural translation models can beset up in similar fashion. When predicting the nextoutput word, we may not only commit to the high-est scoring word prediction but also maintain thenext best scoring words in a list of partial trans-lations. We record with each partial translationthe word translation probabilities (obtained fromthe softmax), extend each partial translation withsubsequent word predictions and accumulate thesescores. Since the number of partial translation ex-plodes exponentially with each new output word,we prune them down to a beam of highest scoringpartial translations.
As in traditional statistical machine translationdecoding, increasing the beam size allows us toexplore a larger set of the space of possible transla-tion and hence find translations with better modelscores.
However, as Figure 10 illustrates, increasing thebeam size does not consistently improve transla-tion quality. In fact, in almost all cases, worsetranslations are found beyond an optimal beamsize setting (we are using again Edinburgh’s WMT
35
[Koehn and Knowles, 2017]
R. Sennrich Neural Machine Translation Basics II 15 / 56
Attention model
question: how can model translate correctly, if attention is one-off?
answer: information can also flow along recurrent connections, so there isno guarantee that attention corresponds to alignment
R. Sennrich Neural Machine Translation Basics II 16 / 56
Attention model
question: how can model translate correctly, if attention is one-off?
answer: information can also flow along recurrent connections, so there isno guarantee that attention corresponds to alignment
R. Sennrich Neural Machine Translation Basics II 16 / 56
What If You Want Attention to be Alignment-like?
One Solution: Guided Alignment Training [Chen et al., 2016]1 compute alignment with external tool (IBM models)2 if multiple source words align to same target words,
normalize so that∑
j Aij = 1
3 modify objective function of NMT training:minimize target sentence cross-entropy (as before)minimize divergence between model attention α and externalalignment A:
H(A,α) = − 1
Ty
Ty∑i=1
Tx∑j=1
Aij logαij
R. Sennrich Neural Machine Translation Basics II 17 / 56
Attention model
attention model also works with images:
[Cho et al., 2015]
R. Sennrich Neural Machine Translation Basics II 18 / 56
Attention model
[Cho et al., 2015]
R. Sennrich Neural Machine Translation Basics II 19 / 56
Neural Machine Translation Basics II
1 Attention Model
2 Decoding
3 Open-Vocabulary Neural Machine Translation
4 More on ArchitecturesDeep NetworksLayer NormalizationNMT with Self-Attention
R. Sennrich Neural Machine Translation Basics II 20 / 56
Application of Encoder-Decoder Model
Scoring (a translation)p(La, croissance, économique, s’est, ralentie, ces, dernières, années, . |Economic, growth, has, slowed, down, in, recent, year, .) = ?
Decoding ( a source sentence)Generate the most probable translation of a source sentence
y∗ = argmaxy p(y|Economic, growth, has, slowed, down, in, recent, year, .)
R. Sennrich Neural Machine Translation Basics II 21 / 56
Decoding
exact searchgenerate every possible sentence T in target language
compute score p(T |S) for each
pick best one
intractable: |vocab|N translations for output length N→ we need approximative search strategy
R. Sennrich Neural Machine Translation Basics II 22 / 56
Decoding
approximative search/1: greedy searchat each time step, compute probabilitydistribution P (yi|S, y<i)
select yi according to some heuristic:
sampling: sample from P (yi|S, y<i)greedy search: pick argmaxy p(yi|S, y<i)
continue until we generate <eos>
! 0.928
0.175
<eos> 0.999
0.175
hello 0.946
0.056
world 0.957
0.100
0
efficient, but suboptimal
R. Sennrich Neural Machine Translation Basics II 23 / 56
Decoding
approximative search/2: beamsearch
maintain list of K hypotheses(beam)
at each time step, expand eachhypothesis k: p(yki |S, yk<i)
select K hypotheses withhighest total probability:∏
i
p(yki |S, yk<i)
hello 0.946
0.056
world 0.957
0.100
World 0.010
4.632
. 0.030
3.609
! 0.928
0.175
... 0.014
4.384
<eos> 0.999
3.609
world 0.684
5.299
HI 0.007
4.920
<eos> 0.994
4.390
Hey 0.006
5.107
<eos> 0.999
0.175
0
K = 3
relatively efficient . . . beam expansion parallelisable
currently default search strategy in neural machine translation
small beam (K ≈ 10) offers good speed-quality trade-off
R. Sennrich Neural Machine Translation Basics II 24 / 56
Beam Search in Practice
translation quality goes down with large beam size (!)
pruning hypotheses to small beam removes states that areimprobable, but have low entropy (label bias)
models have a bias towards short translation→ cost is often normalized by length
if model starts copying input, future copying has high probability
1 2 4 8 12 20 30 50 100 200 500 1,000
29
30
31
29.7
30.4 30.5 30.430.330
29.8
29.4
28.5
29.7
30.430.6
30.830.930.930.930.9 30.930.7
30.3
29.9
Beam Size
BL
EU
Czech–English
UnnormalizedNormalized
1 2 4 8 12 20 30 50 100 200 500 1,000
20
21
22
23
24
22
23.2
23.9 2424.124.12423.8
23.5
22.7
19.9
23.1
23.623.8
24.1 24.224 23.9
23.623.2
Beam Size
BL
EU
English-Czech
UnnormalizedNormalized
1 2 4 8 12 20 30 50 100 200 500 1 000
35
36
37
35.7
36.4
36.9 36.936.836.736.636.3
36.1
35.7
34.6
35.7
36.6
37.237.537.537.637.637.6 37.6 37.6 37.6 37.6
Beam Size
BL
EU
German–English
UnnormalizedNormalized
1 2 4 8 12 20 30 50 100 200 500 1 000
27
28
29
26.8
27.9
28.4 28.428.528.528.528.428.1
27.6
26.726.8
28
28.628.929
29.129.129.2 29.2 29.2 29.1
28.7
Beam Size
BL
EU
English–German
UnnormalizedNormalized
1 2 4 8 12 20 30 50 100 200 500 1 00015
16
17
15.8
16.416.6 16.616.7
16.916.916.9
17.3
1615.8
16.4 16.5 16.416.416.416.416.3 16.2
15.9
15.6
15.3
Beam Size
BL
EU
Romanian–English
UnnormalizedNormalized
1 2 4 8 12 20 30 50 100 200 500 1 000
24
25
26
25.425.5
25.6 25.625.825.825.825.8 25.8
25.6
24.7
24
25.425.6 25.6
25.725.6
25.725.6
25.725.6 25.6 25.6 25.6
Beam Size
BL
EU
English–Romanian
UnnormalizedNormalized
1 2 4 8 12 20 30 50 100 200 500 1 000
25
26
27
25.5
26.626.9
26.626.4
25.9
25.5
24.1
25.5
26.9
27.527.727.827.827.727.7 27.6
27.1
25.9
24.8
Beam Size
BL
EU
Russian–English
UnnormalizedNormalized
1 2 4 8 12 20 30 50 100 200 500 1 000
20
21
22
23
19.9
21.8
22.322.622.5
22.222.221.9
21.4
20.7
19.9
21.7
22.122.422.522.422.422.4 22.3
22.121.8
21.3
Beam Size
BL
EU
English–Russian
UnnormalizedNormalized
Figure 10: Translation quality with varying beam sizes. For large beams, quality decreases, especiallywhen not normalizing scores by sentence length.
36
[Koehn and Knowles, 2017]
R. Sennrich Neural Machine Translation Basics II 25 / 56
Ensembles
combine decision of multiple classifiers by votingensemble will reduce error if these conditions are met:
base classifiers are accuratebase classifiers are diverse (make different errors)
R. Sennrich Neural Machine Translation Basics II 26 / 56
Ensembles in NMT
vote at each time step to explore same search space(better than decoding with one, reranking n-best list with others)
voting mechanism: typically average (log-)probability
logP (yi|S, y<i) =
∑Mm=1 logPm(yi|S, y<i)
M
requirements for voting at each time step:same output vocabularysame factorization of Ybut: internal network architecture may be different
we still use reranking in some situationsexample: combine left-to-right decoding and right-to-left decoding
R. Sennrich Neural Machine Translation Basics II 27 / 56
Neural Machine Translation Basics II
1 Attention Model
2 Decoding
3 Open-Vocabulary Neural Machine Translation
4 More on ArchitecturesDeep NetworksLayer NormalizationNMT with Self-Attention
R. Sennrich Neural Machine Translation Basics II 28 / 56
Text Representation
how do we represent text in NMT?1-hot encoding
lookup of word embedding for inputprobability distribution over vocabulary for output
large vocabulariesincrease network sizedecrease training and decoding speed
typical network vocabulary size: 10 000–100 000 symbols
representation of "cat"vocabulary 1-hot vector embedding012.
1024
thecatis.
mat
010.0
0.10.30.70.5
R. Sennrich Neural Machine Translation Basics II 29 / 56
Problem
translation is open-vocabulary problem
many training corpora contain millions of word types
productive word formation processes (compounding; derivation) allowformation and understanding of unseen words
names, numbers are morphologically simple, but open word classes
R. Sennrich Neural Machine Translation Basics II 30 / 56
Non-Solution: Ignore Rare Words
replace out-of-vocabulary words with UNK
a vocabulary of 50 000 words covers 95% of text
this gets you 95% of the way...... if you only care about automatic metrics
why 95% is not enoughrare outcomes have high self-information
source The indoor temperature is very pleasant.reference Das Raumklima ist sehr angenehm.[Bahdanau et al., 2015] Die UNK ist sehr angenehm. 7
[Jean et al., 2015] Die Innenpool ist sehr angenehm. 7
[Sennrich, Haddow, Birch, ACL 2016] Die Innen+ temperatur ist sehr angenehm. X
R. Sennrich Neural Machine Translation Basics II 31 / 56
Non-Solution: Ignore Rare Words
replace out-of-vocabulary words with UNK
a vocabulary of 50 000 words covers 95% of text
this gets you 95% of the way...... if you only care about automatic metrics
why 95% is not enoughrare outcomes have high self-information
source The indoor temperature is very pleasant.reference Das Raumklima ist sehr angenehm.[Bahdanau et al., 2015] Die UNK ist sehr angenehm. 7
[Jean et al., 2015] Die Innenpool ist sehr angenehm. 7
[Sennrich, Haddow, Birch, ACL 2016] Die Innen+ temperatur ist sehr angenehm. X
R. Sennrich Neural Machine Translation Basics II 31 / 56
Solution 1: Back-off Models
back-off models [Jean et al., 2015, Luong et al., 2015]replace rare words with UNK at training time
when system produces UNK, align UNK to source word, and translatethis with back-off method
source The indoor temperature is very pleasant.reference Das Raumklima ist sehr angenehm.[Bahdanau et al., 2015] Die UNK ist sehr angenehm. 7
[Jean et al., 2015] Die Innenpool ist sehr angenehm. 7
limitationscompounds: hard to model 1-to-many relationships
morphology: hard to predict inflection with back-off dictionary
names: if alphabets differ, we need transliteration
alignment: attention model unreliable
R. Sennrich Neural Machine Translation Basics II 32 / 56
Solution 2: Subword Models
MT is an open-vocabulary problemcompounding and other productive morphological processes
they charge a carry-on bag fee.sie erheben eine Hand|gepäck|gebühr.
names
Obama(English; German)Îáàìà (Russian)オバマ (o-ba-ma) (Japanese)
technical terms, numbers, etc.
R. Sennrich Neural Machine Translation Basics II 33 / 56
Subword units
segmentation algorithms: wishlistopen-vocabulary NMT: encode all words through small vocabulary
encoding generalizes to unseen words
small text size
good translation quality
our experiments [Sennrich et al., 2016]after preliminary experiments, we propose:
character n-grams (with shortlist of unsegmented words)segmentation via byte pair encoding (BPE)
R. Sennrich Neural Machine Translation Basics II 34 / 56
Byte pair encoding for word segmentation
bottom-up character mergingstarting point: character-level representation→ computationally expensive
compress representation based on information theory→ byte pair encoding [Gage, 1994]
repeatedly replace most frequent symbol pair (’A’,’B’) with ’AB’
hyperparameter: when to stop→ controls vocabulary size
word freq’l o w</w>’ 5’l o w e r</w>’ 2’n e w e s t</w>’ 6’w i d e s t</w>’ 3
vocabulary:l o w</w> w e r</w> n s t</w> i d
es est</w> lo
R. Sennrich Neural Machine Translation Basics II 35 / 56
Byte pair encoding for word segmentation
bottom-up character mergingstarting point: character-level representation→ computationally expensive
compress representation based on information theory→ byte pair encoding [Gage, 1994]
repeatedly replace most frequent symbol pair (’A’,’B’) with ’AB’
hyperparameter: when to stop→ controls vocabulary size
word freq’l o w</w>’ 5’l o w e r</w>’ 2’n e w es t</w>’ 6’w i d es t</w>’ 3
vocabulary:l o w</w> w e r</w> n s t</w> i des
est</w> lo
R. Sennrich Neural Machine Translation Basics II 35 / 56
Byte pair encoding for word segmentation
bottom-up character mergingstarting point: character-level representation→ computationally expensive
compress representation based on information theory→ byte pair encoding [Gage, 1994]
repeatedly replace most frequent symbol pair (’A’,’B’) with ’AB’
hyperparameter: when to stop→ controls vocabulary size
word freq’l o w</w>’ 5’l o w e r</w>’ 2’n e w est</w>’ 6’w i d est</w>’ 3
vocabulary:l o w</w> w e r</w> n s t</w> i des est</w>
lo
R. Sennrich Neural Machine Translation Basics II 35 / 56
Byte pair encoding for word segmentation
bottom-up character mergingstarting point: character-level representation→ computationally expensive
compress representation based on information theory→ byte pair encoding [Gage, 1994]
repeatedly replace most frequent symbol pair (’A’,’B’) with ’AB’
hyperparameter: when to stop→ controls vocabulary size
word freq’lo w</w>’ 5’lo w e r</w>’ 2’n e w est</w>’ 6’w i d est</w>’ 3
vocabulary:l o w</w> w e r</w> n s t</w> i des est</w> lo
R. Sennrich Neural Machine Translation Basics II 35 / 56
Byte pair encoding for word segmentation
why BPE?open-vocabulary:operations learned on training set can be applied to unknown words
compression of frequent character sequences improves efficiency→ trade-off between text length and vocabulary size
’l o w e s t</w>’e s → eses t</w> → est</w>l o → lo
R. Sennrich Neural Machine Translation Basics II 36 / 56
Byte pair encoding for word segmentation
why BPE?open-vocabulary:operations learned on training set can be applied to unknown words
compression of frequent character sequences improves efficiency→ trade-off between text length and vocabulary size
’l o w es t</w>’e s → eses t</w> → est</w>l o → lo
R. Sennrich Neural Machine Translation Basics II 36 / 56
Byte pair encoding for word segmentation
why BPE?open-vocabulary:operations learned on training set can be applied to unknown words
compression of frequent character sequences improves efficiency→ trade-off between text length and vocabulary size
’l o w est</w>’e s → eses t</w> → est</w>l o → lo
R. Sennrich Neural Machine Translation Basics II 36 / 56
Byte pair encoding for word segmentation
why BPE?open-vocabulary:operations learned on training set can be applied to unknown words
compression of frequent character sequences improves efficiency→ trade-off between text length and vocabulary size
’lo w est</w>’e s → eses t</w> → est</w>l o → lo
R. Sennrich Neural Machine Translation Basics II 36 / 56
Subword NMT: Translation Quality
EN-DE EN-RU0.0
10.0
20.022.0
19.1
22.8
20.4B
LEU
word-level NMT (with back-off) [Jean et al., 2015]
subword-level NMT: BPE
R. Sennrich Neural Machine Translation Basics II 37 / 56
Subword NMT: Translation Quality
100 101 102 103 104 105 1060
0.2
0.4
0.6
0.8
150 000 500 000
training set frequency rank
unig
ramF1
NMT Results EN-RU
subword-level NMT: BPEsubword-level NMT: char bigramsword-level (with back-off)word-level (no back-off)
R. Sennrich Neural Machine Translation Basics II 38 / 56
Examples
system sentencesource health research institutesreference Gesundheitsforschungsinstituteword-level (with back-off) Forschungsinstitutecharacter bigrams Fo|rs|ch|un|gs|in|st|it|ut|io|ne|nBPE Gesundheits|forsch|ungsin|stitutesource rakfiskreference ðàêôèñêà (rakfiska)word-level (with back-off) rakfisk → UNK→ rakfiskcharacter bigrams ra|kf|is|k→ ðà|êô|èñ|ê (ra|kf|is|k)BPE rak|f|isk → ðàê|ô|èñêà (rak|f|iska)
R. Sennrich Neural Machine Translation Basics II 39 / 56
Subword Models: Variants
morphologically motivated subword units[Sánchez-Cartagena and Toral, 2016, Tamchyna et al., 2017,Huck et al., 2017, Pinnis et al., 2017]
probabilistic segmentation and sampling [Kudo, 2018]
R. Sennrich Neural Machine Translation Basics II 40 / 56
Solution 3: Character-level Models
advantages:(mostly) open-vocabularyno heuristic or language-specific segmentationneural network can learn from raw character sequences
drawbacks:increasing sequence length slows training/decoding(reported x2–x8 increase in training time)memory requirement: trade-off between embedding matrix size andsequence length
active research on specialized architectures [Ling et al., 2015,Luong and Manning, 2016, Chung et al., 2016, Lee et al., 2016]
R. Sennrich Neural Machine Translation Basics II 41 / 56
Character-level NMT: Current Results(as of 5 days ago: August 29 2018)
Figure 1: Test BLEU for character and BPE translation as architectures scale from 1 BiLSTM encoder layer and 2LSTM decoder layers (1×2+2) to our standard 6×2+8. The y-axis spans 6 BLEU points for each language pair.
Figure 2: BLEU versus training corpus size in millionsof sentence pairs, for the EnFr language-pair.
Figure 3: Training time per sentence versus total num-ber of layers (encoder plus decoder) in the model.
tems differ most in the number of lexical choiceerrors, and in the extent to which they drop con-tent. The latter is surprising, and appears to be aside-effect of a general tendency of the charactermodels to be more faithful to the source, vergingon being overly literal. An example of droppedcontent is shown in Table 5 (top).
Regarding lexical choice, the two systems dif-fer not only in the number of errors, but in thenature of those errors. In particular, the BPEmodel had more trouble handling German com-pound nouns. Table 5 (bottom) shows an exam-ple which exhibits two compound errors, includ-
Error Type BPE CharLexical Choice 19 8
Compounds 13 1Proper Names 2 1Morphological 2 2Other lexical 2 4
Dropped Content 7 0
Table 4: Error counts out of 100 randomly sampled ex-amples from the DeEn test set.
ing one where the character system is a strict im-provement, translating Bunsenbrenner into bunsenburner instead of bullets. The second error followsanother common pattern, where both systems mis-handle the German compound (Chemiestunden /chemistry lessons), but the character system failsin a more useful way.
We also found that both systems occasionallymistranslate proper names. Both fail by attempt-ing to translate when they should copy over, butthe BPE system’s errors are harder to understandas they involve semantic translation, renderingBritta Hermann as Sir Leon, and Esme Nussbaumas smiling walnut.9 The character system’s oneobserved error in this category was phonetic ratherthan semantic, rendering Schotten as Scottland.
Interestingly, we also observed several in-stances where the model correctly translates theGerman 24-hour clock into the English 12-hourclock; for example, 19.30 becomes 7:30 p.m..This deterministic transformation is potentially inreach for both models, but we observed it only forthe character system in this sample.
9 The BPE segmentations for these names were: _Britta _Herr mann and _Es me _N uss baum
Figure 1: Test BLEU for character and BPE translation as architectures scale from 1 BiLSTM encoder layer and 2LSTM decoder layers (1×2+2) to our standard 6×2+8. The y-axis spans 6 BLEU points for each language pair.
Figure 2: BLEU versus training corpus size in millionsof sentence pairs, for the EnFr language-pair.
Figure 3: Training time per sentence versus total num-ber of layers (encoder plus decoder) in the model.
tems differ most in the number of lexical choiceerrors, and in the extent to which they drop con-tent. The latter is surprising, and appears to be aside-effect of a general tendency of the charactermodels to be more faithful to the source, vergingon being overly literal. An example of droppedcontent is shown in Table 5 (top).
Regarding lexical choice, the two systems dif-fer not only in the number of errors, but in thenature of those errors. In particular, the BPEmodel had more trouble handling German com-pound nouns. Table 5 (bottom) shows an exam-ple which exhibits two compound errors, includ-
Error Type BPE CharLexical Choice 19 8
Compounds 13 1Proper Names 2 1Morphological 2 2Other lexical 2 4
Dropped Content 7 0
Table 4: Error counts out of 100 randomly sampled ex-amples from the DeEn test set.
ing one where the character system is a strict im-provement, translating Bunsenbrenner into bunsenburner instead of bullets. The second error followsanother common pattern, where both systems mis-handle the German compound (Chemiestunden /chemistry lessons), but the character system failsin a more useful way.
We also found that both systems occasionallymistranslate proper names. Both fail by attempt-ing to translate when they should copy over, butthe BPE system’s errors are harder to understandas they involve semantic translation, renderingBritta Hermann as Sir Leon, and Esme Nussbaumas smiling walnut.9 The character system’s oneobserved error in this category was phonetic ratherthan semantic, rendering Schotten as Scottland.
Interestingly, we also observed several in-stances where the model correctly translates theGerman 24-hour clock into the English 12-hourclock; for example, 19.30 becomes 7:30 p.m..This deterministic transformation is potentially inreach for both models, but we observed it only forthe character system in this sample.
9 The BPE segmentations for these names were: _Britta _Herr mann and _Es me _N uss baum
For deep LSTMs, char-level systems canachieve higher BLEU than BPE
...but training time (per sentence) is 8x higher
[Cherry et al., 2018]
R. Sennrich Neural Machine Translation Basics II 42 / 56
Neural Machine Translation Basics II
1 Attention Model
2 Decoding
3 Open-Vocabulary Neural Machine Translation
4 More on ArchitecturesDeep NetworksLayer NormalizationNMT with Self-Attention
R. Sennrich Neural Machine Translation Basics II 43 / 56
Deep Networks
increasing model depth often increases model performance
example: stack RNN:
hi,1 = g(U1hi−1,1 +W1Exxi)
hi,2 = g(U2hi−1,2 +W2hi,1)
hi,3 = g(U3hi−1,3 +W3hi,2)
. . .
. . .
. . .
R. Sennrich Neural Machine Translation Basics II 44 / 56
Deep Networks
often necessary to combat vanishing gradient:residual connections between layers:
hi,1 = g(U1hi−1,1 +W1Exxi)
hi,2 = g(U2hi−1,2 +W2hi,1)+hi,1
hi,3 = g(U3hi−1,3 +W3hi,2)+hi,2
R. Sennrich Neural Machine Translation Basics II 45 / 56
Layer Normalization
if input distribution to NN layer changes, parameters need to adapt tothis covariate shift
especially bad: RNN state grows/shrinks as we go through sequence
normalization of layers reduces shift, and improves training stability
re-center and re-scale each layer a (with H units)
two bias parameters, g and b, restore original representation power
µ =1
H
H∑i=1
ai
σ =
√√√√ 1
H
H∑i=1
(ai − µ)2
h =[gσ� (a− µ) + b
]R. Sennrich Neural Machine Translation Basics II 46 / 56
Layer Normalization and Deep Models:Results from UEDIN@WMT17
CS→EN DE→EN LV→EN RU→EN TR→EN ZH→ENsystem 2017 2017 2017 2017 2017 2017baseline 27.5 32.0 16.4 31.3 19.7 21.7+layer normalization 28.2 32.1 17.0 32.3 18.8 22.5+deep model 28.9 33.5 16.6 32.7 20.6 22.9
layer normalization and deep models generally improve quality
layer normalization also speeds up convergence when training(fewer updates needed)
R. Sennrich Neural Machine Translation Basics II 47 / 56
Neural Machine Translation Basics II
1 Attention Model
2 Decoding
3 Open-Vocabulary Neural Machine Translation
4 More on ArchitecturesDeep NetworksLayer NormalizationNMT with Self-Attention
R. Sennrich Neural Machine Translation Basics II 48 / 56
Attention Is All You Need [Vaswani et al., 2017]
main criticism of recurrent architecture:recurrent computations cannot be parallelized
core idea: instead of recurrence, use attention mechanism tocondition hidden states on context→ self-attention
specifically, attend over previous layer of deep network
Self-Attention
Convolution Self-Attention
https://nlp.stanford.edu/seminar/details/lkaiser.pdf
R. Sennrich Neural Machine Translation Basics II 49 / 56
Attention Is All You Need [Vaswani et al., 2017]
Transformer architecturestack of N self-attention layers
self-attention in decoder is masked
decoder also attends to encoder states
Add & Norm: residual connection andlayer normalization
RNN can learn to count raw textTransformer needs positional encoding
Figure 1: The Transformer - model architecture.
3.1 Encoder and Decoder Stacks
Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has twosub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network. We employ a residual connection [11] around each ofthe two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer isLayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layeritself. To facilitate these residual connections, all sub-layers in the model, as well as the embeddinglayers, produce outputs of dimension dmodel = 512.
Decoder: The decoder is also composed of a stack of N = 6 identical layers. In addition to the twosub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-headattention over the output of the encoder stack. Similar to the encoder, we employ residual connectionsaround each of the sub-layers, followed by layer normalization. We also modify the self-attentionsub-layer in the decoder stack to prevent positions from attending to subsequent positions. Thismasking, combined with fact that the output embeddings are offset by one position, ensures that thepredictions for position i can depend only on the known outputs at positions less than i.
3.2 Attention
An attention function can be described as mapping a query and a set of key-value pairs to an output,where the query, keys, values, and output are all vectors. The output is computed as a weighted sumof the values, where the weight assigned to each value is computed by a compatibility function of thequery with the corresponding key.
3
[Vaswani et al., 2017]
R. Sennrich Neural Machine Translation Basics II 50 / 56
Multi-Head Attention
basic attention mechanism in AIAYN: Scaled Dot-Product Attention
Attention(Q,K, V ) = softmax(QKT
√dk
)V
query Q is decoder/encoder state (for attention/self-attention)
key K and value V are encoder hidden states
multi-head attention: use h parallel attention mechanisms withlow-dimensional, learned projections of Q, K, and V
Scaled Dot-Product Attention Multi-Head Attention
Figure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of severalattention layers running in parallel.
3.2.1 Scaled Dot-Product Attention
We call our particular attention "Scaled Dot-Product Attention" (Figure 2). The input consists ofqueries and keys of dimension dk, and values of dimension dv . We compute the dot products of thequery with all keys, divide each by
√dk, and apply a softmax function to obtain the weights on the
values.
In practice, we compute the attention function on a set of queries simultaneously, packed togetherinto a matrix Q. The keys and values are also packed together into matrices K and V . We computethe matrix of outputs as:
Attention(Q,K, V ) = softmax(QKT
√dk
)V (1)
The two most commonly used attention functions are additive attention [2], and dot-product (multi-plicative) attention. Dot-product attention is identical to our algorithm, except for the scaling factorof 1√
dk. Additive attention computes the compatibility function using a feed-forward network with
a single hidden layer. While the two are similar in theoretical complexity, dot-product attention ismuch faster and more space-efficient in practice, since it can be implemented using highly optimizedmatrix multiplication code.
While for small values of dk the two mechanisms perform similarly, additive attention outperformsdot product attention without scaling for larger values of dk [3]. We suspect that for large values ofdk, the dot products grow large in magnitude, pushing the softmax function into regions where it hasextremely small gradients 4. To counteract this effect, we scale the dot products by 1√
dk.
3.2.2 Multi-Head Attention
Instead of performing a single attention function with dmodel-dimensional keys, values and queries,we found it beneficial to linearly project the queries, keys and values h times with different, learnedlinear projections to dk, dk and dv dimensions, respectively. On each of these projected versions ofqueries, keys and values we then perform the attention function in parallel, yielding dv-dimensionaloutput values. These are concatenated and once again projected, resulting in the final values, asdepicted in Figure 2.
4To illustrate why the dot products get large, assume that the components of q and k are independent randomvariables with mean 0 and variance 1. Then their dot product, q · k =
∑dki=1 qiki, has mean 0 and variance dk.
4
[Vaswani et al., 2017]
R. Sennrich Neural Machine Translation Basics II 51 / 56
Multi-Head Attention
motivation for multi-head attention:different heads can attend to different states
Attention VisualizationsInput-Input Layer5It is in th
issp
irit
that
a maj
ority
of Am
eric
ango
vern
men
tsha
vepa
ssed
new
law
ssi
nce
2009
mak
ing
the
regi
stra
tion
or votin
gpr
oces
sm
ore
diffi
cult
. <EO
S>
<pad
><p
ad>
<pad
><p
ad>
<pad
><p
ad>
It is inth
issp
irit
that a
maj
ority of
Am
eric
ango
vern
men
tsha
vepa
ssed
new
law
ssi
nce
2009
mak
ing
the
regi
stra
tion or
votin
gpr
oces
sm
ore
diffi
cult .
<EO
S>
<pad
><p
ad>
<pad
><p
ad>
<pad
><p
ad>
Figure 3: An example of the attention mechanism following long-distance dependencies in theencoder self-attention in layer 5 of 6. Many of the attention heads attend to a distant dependency ofthe verb ‘making’, completing the phrase ‘making...more difficult’. Attentions here shown only forthe word ‘making’. Different colors represent different heads. Best viewed in color.
13
R. Sennrich Neural Machine Translation Basics II 52 / 56
Comparison
empirical comparison difficultsome components could be mix-and-matched
choice of attention mechanismchoice of positional encodinghyperparameters and training tricks
different test sets and/or evaluation scripts
R. Sennrich Neural Machine Translation Basics II 53 / 56
Comparison
SOCKEYE [Hieber et al., 2017] (EN-DE; newstest2017)
system BLEU
deep LSTM 25.6Convolutional 24.6Transformer 27.5
[Chen et al., 2018] (EN-DE; newstest2014)
system BLEU
GNMT (RNN) 24.7Transformer 27.9RNMT+ (RNN) 28.5
R. Sennrich Neural Machine Translation Basics II 54 / 56
Why Self-Attention?
our understanding of neural networks lags behind empirical progress
there are some theoretical arguments why architectures work well...(e.g. self-attention reduces distance in network between words)
...but these are very speculative
Targeted Evaluation [Tang et al., 2018]
quality of long-distanceagreement is similarbetween GRU andTransformer
strength of Transformer:better word sensedisambiguation
uni-directional LSTMs, and the decoder is a stackof 8 uni-directional LSTMs. The size of embed-dings and hidden states is 512. We apply layer nor-malization and label smoothing (0.1) in all mod-els. We tie the source and target embeddings.The dropout rate of embeddings and Transformerblocks is set to 0.1. The dropout rate of RNNs andCNNs is 0.2. The kernel size of CNNs is 3. Trans-formers have an 8-head attention mechanism.
To test the robustness of our findings, we alsotest a different style of RNN architecture, froma different toolkit. We evaluate bi-deep transi-tional RNNs (Miceli Barone et al., 2017) whichare state-of-art RNNs in machine translation. Weuse the bi-deep RNN-based model (RNN-bideep)implemented in Marian (Junczys-Dowmunt et al.,2018). Different from the previous settings, weuse the Adam optimizer with β1 = 0.9, β2 =0.98, ε = 10−9. The initial learning rate is0.0003. We tie target embeddings and output em-beddings. Both the encoder and decoder have 4layers of LSTM units, only the encoder layers arebi-directional. LSTM units consist of several cells(deep transition): 4 in the first layer of the decoder,2 cells everywhere else.
We use training data from the WMT17 sharedtask.2 We use newstest2013 as the validation set,and use newstest2014 and newstest2017 as the testsets. All BLEU scores are computed with Sacre-BLEU (Post, 2018). There are about 5.9 millionsentence pairs in the training set after preprocess-ing with Moses scripts. We learn a joint BPEmodel with 32,000 subword units (Sennrich et al.,2016). We employ the model that has the best per-plexity on the validation set for the evaluation.
4.2 Overall Results
Table 2 reports the BLEU scores on newstest2014and newstest2017, the perplexity on the valida-tion set, and the accuracy on long-range depen-dencies.3 Transformer achieves the highest accu-racy on this task and the highest BLEU scores onboth newstest2014 and newstest2017. Comparedto RNNS2S, ConvS2S has slightly better results re-garding BLEU scores, but a much lower accuracyon long-range dependencies. The RNN-bideepmodel achieves distinctly better BLEU scores anda higher accuracy on long-range dependencies.
2http://www.statmt.org/wmt17/translation-task.html
3We report average accuracy on instances where the dis-tance between subject and verb is longer than 10 words.
However, it still cannot outperform Transformerson any of the tasks.
Model 2014 2017 PPL Acc(%)RNNS2S 23.3 25.1 6.1 95.1ConvS2S 23.9 25.2 7.0 84.9Transformer 26.7 27.5 4.5 97.1RNN-bideep 24.7 26.1 5.7 96.3
Table 2: The results of different NMT models, in-cluding the BLEU scores on newstest2014 and new-stest2017, the perplexity on the validation set, and theaccuracy of long-range dependencies.
Figure 2 shows the performance of different ar-chitectures on the subject-verb agreement task. Itis evident that Transformer, RNNS2S, and RNN-bideep perform much better than ConvS2S onlong-range dependencies. However, Transformer,RNNS2S, and RNN-bideep are all robust over longdistances. Transformer outperforms RNN-bideepfor distances 11-12, but RNN-bideep performsequally or better for distance 13 or higher. Thus,we cannot conclude that Transformer models areparticularly stronger than RNN models for longdistances, despite achieving higher average accu-racy on distances above 10.
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 >15
Accuracy
Distance
RNNS2S
RNN-bideep
ConvS2S
Transformer
Figure 2: Accuracy of different NMT models on thesubject-verb agreement task.
4.2.1 CNNsTheoretically, the performance of CNNs will dropwhen the distance between the subject and the verbexceeds the local context size. However, ConvS2Sis also clearly worse than RNNS2S for subject-verbagreement within the local context size.
In order to explore how the ability of ConvS2Sto capture long-range dependencies depends onthe local context size, we train additional systems,
R. Sennrich Neural Machine Translation Basics II 55 / 56
Further Reading
primary literature:Sutskever, Vinyals, Le (2014): Sequence to Sequence Learning with Neural Networks.Bahdanau, Cho, Bengio (2014): Neural Machine Translation by Jointly Learning to Alignand Translate.Vaswani et al. (2017): Attention Is All You Need
secondary literature:Koehn (2017): Neural Machine Translation.Neubig (2017): Neural Machine Translation and Sequence-to-sequence Models: ATutorialCho (2015): Natural Language Understanding with Distributed Representation
R. Sennrich Neural Machine Translation Basics II 56 / 56
Bibliography I
Bahdanau, D., Cho, K., and Bengio, Y. (2015).Neural Machine Translation by Jointly Learning to Align and Translate.In Proceedings of the International Conference on Learning Representations (ICLR).
Chen, M. X., Firat, O., Bapna, A., Johnson, M., Macherey, W., Foster, G., Jones, L., Schuster, M., Shazeer, N., Parmar, N.,Vaswani, A., Uszkoreit, J., Kaiser, L., Chen, Z., Wu, Y., and Hughes, M. (2018).The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation.In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages76–86, Melbourne, Australia. Association for Computational Linguistics.
Chen, W., Matusov, E., Khadivi, S., and Peter, J. (2016).Guided Alignment Training for Topic-Aware Neural Machine Translation.CoRR, abs/1607.01628.
Cherry, C., Foster, G., Bapna, A., Firat, O., and Macherey, W. (2018).Revisiting Character-Based Neural Machine Translation with Capacity and Compression.ArXiv e-prints.
Cho, K., Courville, A., and Bengio, Y. (2015).Describing Multimedia Content using Attention-based Encoder-Decoder Networks.
Cho, K., van Merrienboer, B., Bahdanau, D., and Bengio, Y. (2014).On the Properties of Neural Machine Translation: Encoder–Decoder Approaches.In Proceedings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation, pages 103–111,Doha, Qatar. Association for Computational Linguistics.
Chung, J., Cho, K., and Bengio, Y. (2016).A Character-level Decoder without Explicit Segmentation for Neural Machine Translation.CoRR, abs/1603.06147.
R. Sennrich Neural Machine Translation Basics II 57 / 56
Bibliography II
Gage, P. (1994).A New Algorithm for Data Compression.C Users J., 12(2):23–38.
Hieber, F., Domhan, T., Denkowski, M., Vilar, D., Sokolov, A., Clifton, A., and Post, M. (2017).Sockeye: A Toolkit for Neural Machine Translation.CoRR.
Huck, M., Riess, S., and Fraser, A. (2017).Target-side Word Segmentation Strategies for Neural Machine Translation.In Proceedings of the Second Conference on Machine Translation, Volume 1: Research Papers, Copenhagen, Denmark.Association for Computational Linguistics.
Jean, S., Cho, K., Memisevic, R., and Bengio, Y. (2015).On Using Very Large Target Vocabulary for Neural Machine Translation.InProceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers),pages 1–10, Beijing, China. Association for Computational Linguistics.
Koehn, P. and Knowles, R. (2017).Six Challenges for Neural Machine Translation.In Proceedings of the First Workshop on Neural Machine Translation, pages 28–39, Vancouver. Association for ComputationalLinguistics.
Kudo, T. (2018).Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates.In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages66–75, Melbourne, Australia. Association for Computational Linguistics.
R. Sennrich Neural Machine Translation Basics II 58 / 56
Bibliography III
Lee, J., Cho, K., and Hofmann, T. (2016).Fully Character-Level Neural Machine Translation without Explicit Segmentation.ArXiv e-prints.
Ling, W., Trancoso, I., Dyer, C., and Black, A. W. (2015).Character-based Neural Machine Translation.ArXiv e-prints.
Luong, M.-T. and Manning, D. C. (2016).Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models.In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages1054–1063. Association for Computational Linguistics.
Luong, T., Sutskever, I., Le, Q., Vinyals, O., and Zaremba, W. (2015).Addressing the Rare Word Problem in Neural Machine Translation.InProceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers),pages 11–19, Beijing, China. Association for Computational Linguistics.
Pinnis, M., Krislauks, R., Deksne, D., and Miks, T. (2017).Neural Machine Translation for Morphologically Rich Languages with Improved Sub-word Units and Synthetic Data.In Text, Speech, and Dialogue - 20th International Conference, TSD 2017, pages 237–245, Prague, Czech Republic.
Sánchez-Cartagena, V. M. and Toral, A. (2016).Abu-MaTran at WMT 2016 Translation Task: Deep Learning, Morphological Segmentation and Tuning on Character Sequences.In Proceedings of the First Conference on Machine Translation, pages 362–370, Berlin, Germany. Association for ComputationalLinguistics.
R. Sennrich Neural Machine Translation Basics II 59 / 56
Bibliography IV
Sennrich, R., Firat, O., Cho, K., Birch, A., Haddow, B., Hitschler, J., Junczys-Dowmunt, M., Läubli, S., Miceli Barone, A. V.,Mokry, J., and Nadejde, M. (2017).Nematus: a Toolkit for Neural Machine Translation.InProceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics,pages 65–68, Valencia, Spain.
Sennrich, R., Haddow, B., and Birch, A. (2016).Neural Machine Translation of Rare Words with Subword Units.In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages1715–1725, Berlin, Germany.
Sutskever, I., Vinyals, O., and Le, Q. V. (2014).Sequence to Sequence Learning with Neural Networks.InAdvances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014,pages 3104–3112, Montreal, Quebec, Canada.
Tamchyna, A., Weller-Di Marco, M., and Fraser, A. (2017).Modeling Target-Side Inflection in Neural Machine Translation.In Second Conference on Machine Translation (WMT17).
Tang, G., Müller, M., Rios, A., and Sennrich, R. (2018).Why Self-Attention? A Targeted Evaluation of Neural Machine Translation Architectures.In EMNLP 2018, Brussels, Belgium. Association for Computational Linguistics.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017).Attention Is All You Need.CoRR, abs/1706.03762.
R. Sennrich Neural Machine Translation Basics II 60 / 56