Upload
others
View
13
Download
0
Embed Size (px)
Citation preview
Convolutional over Recurrent Encoder for Neural Machine Translation
Praveen Dakwale and Christof Monz
Annual Conference of the European Association for Machine Translation
2017
Convolutional over Recurrent Encoder for Neural Machine Translation
2
Neural Machine Translation• End to end neural network with RNN architecture where the
output of an RNN (decoder) is conditioned on another RNN (encoder).
• c is a fixed length vector representation of source sentence encoded by RNN.
• Attention Mechanism :
• (Bahdanau et al 2015) : compute conext vector as weighted average of annotations of source hidden states.
Published as a conference paper at ICLR 2015
The decoder is often trained to predict the next word y
t
0 given the context vector c and all thepreviously predicted words {y1, · · · , yt0�1}. In other words, the decoder defines a probability overthe translation y by decomposing the joint probability into the ordered conditionals:
p(y) =
TY
t=1
p(y
t
| {y1, · · · , yt�1} , c), (2)
where y =
�y1, · · · , yT
y
�. With an RNN, each conditional probability is modeled as
p(y
t
| {y1, · · · , yt�1} , c) = g(y
t�1, st, c), (3)
where g is a nonlinear, potentially multi-layered, function that outputs the probability of yt
, and s
t
isthe hidden state of the RNN. It should be noted that other architectures such as a hybrid of an RNNand a de-convolutional neural network can be used (Kalchbrenner and Blunsom, 2013).
3 LEARNING TO ALIGN AND TRANSLATE
In this section, we propose a novel architecture for neural machine translation. The new architectureconsists of a bidirectional RNN as an encoder (Sec. 3.2) and a decoder that emulates searchingthrough a source sentence during decoding a translation (Sec. 3.1).
3.1 DECODER: GENERAL DESCRIPTION
Figure 1: The graphical illus-tration of the proposed modeltrying to generate the t-th tar-get word y
t
given a sourcesentence (x1, x2, . . . , xT
).
In a new model architecture, we define each conditional probabilityin Eq. (2) as:
p(y
i
|y1, . . . , yi�1,x) = g(y
i�1, si, ci), (4)
where s
i
is an RNN hidden state for time i, computed by
s
i
= f(s
i�1, yi�1, ci).
It should be noted that unlike the existing encoder–decoder ap-proach (see Eq. (2)), here the probability is conditioned on a distinctcontext vector c
i
for each target word y
i
.
The context vector c
i
depends on a sequence of annotations(h1, · · · , hT
x
) to which an encoder maps the input sentence. Eachannotation h
i
contains information about the whole input sequencewith a strong focus on the parts surrounding the i-th word of theinput sequence. We explain in detail how the annotations are com-puted in the next section.
The context vector ci
is, then, computed as a weighted sum of theseannotations h
i
:
c
i
=
T
xX
j=1
↵
ij
h
j
. (5)
The weight ↵ij
of each annotation h
j
is computed by
↵
ij
=
exp (e
ij
)
PT
x
k=1 exp (eik), (6)
wheree
ij
= a(s
i�1, hj
)
is an alignment model which scores how well the inputs around position j and the output at positioni match. The score is based on the RNN hidden state s
i�1 (just before emitting y
i
, Eq. (4)) and thej-th annotation h
j
of the input sentence.
We parametrize the alignment model a as a feedforward neural network which is jointly trained withall the other components of the proposed system. Note that unlike in traditional machine translation,
3
Published as a conference paper at ICLR 2015
The decoder is often trained to predict the next word y
t
0 given the context vector c and all thepreviously predicted words {y1, · · · , yt0�1}. In other words, the decoder defines a probability overthe translation y by decomposing the joint probability into the ordered conditionals:
p(y) =
TY
t=1
p(y
t
| {y1, · · · , yt�1} , c), (2)
where y =
�y1, · · · , yT
y
�. With an RNN, each conditional probability is modeled as
p(y
t
| {y1, · · · , yt�1} , c) = g(y
t�1, st, c), (3)
where g is a nonlinear, potentially multi-layered, function that outputs the probability of yt
, and s
t
isthe hidden state of the RNN. It should be noted that other architectures such as a hybrid of an RNNand a de-convolutional neural network can be used (Kalchbrenner and Blunsom, 2013).
3 LEARNING TO ALIGN AND TRANSLATE
In this section, we propose a novel architecture for neural machine translation. The new architectureconsists of a bidirectional RNN as an encoder (Sec. 3.2) and a decoder that emulates searchingthrough a source sentence during decoding a translation (Sec. 3.1).
3.1 DECODER: GENERAL DESCRIPTION
Figure 1: The graphical illus-tration of the proposed modeltrying to generate the t-th tar-get word y
t
given a sourcesentence (x1, x2, . . . , xT
).
In a new model architecture, we define each conditional probabilityin Eq. (2) as:
p(y
i
|y1, . . . , yi�1,x) = g(y
i�1, si, ci), (4)
where s
i
is an RNN hidden state for time i, computed by
s
i
= f(s
i�1, yi�1, ci).
It should be noted that unlike the existing encoder–decoder ap-proach (see Eq. (2)), here the probability is conditioned on a distinctcontext vector c
i
for each target word y
i
.
The context vector c
i
depends on a sequence of annotations(h1, · · · , hT
x
) to which an encoder maps the input sentence. Eachannotation h
i
contains information about the whole input sequencewith a strong focus on the parts surrounding the i-th word of theinput sequence. We explain in detail how the annotations are com-puted in the next section.
The context vector ci
is, then, computed as a weighted sum of theseannotations h
i
:
c
i
=
T
xX
j=1
↵
ij
h
j
. (5)
The weight ↵ij
of each annotation h
j
is computed by
↵
ij
=
exp (e
ij
)
PT
x
k=1 exp (eik), (6)
wheree
ij
= a(s
i�1, hj
)
is an alignment model which scores how well the inputs around position j and the output at positioni match. The score is based on the RNN hidden state s
i�1 (just before emitting y
i
, Eq. (4)) and thej-th annotation h
j
of the input sentence.
We parametrize the alignment model a as a feedforward neural network which is jointly trained withall the other components of the proposed system. Note that unlike in traditional machine translation,
3
Convolutional over Recurrent Encoder for Neural Machine Translation
3
h21h1
1 h31 hi
1
h12 h2
2 h32 hi
2
S21S1
1 S31 Sj
1
S12 S2
2 S32 Sj
2
C1 C2 C3 Cj
X2X1 XiX3
y2y1 yjy3
Encoder
Decoder
Z1 Z2 Z3 Zi
αj:n +
St-12
C’j = ∑αjiCNi
Convolutional over Recurrent Encoder for Neural Machine Translation
4
Why RNN works for NMT ?
✦ Recurrently encode history for variable length large input sequences
✦ Capture the long distance dependency which is an important occurrence in natural language text
Convolutional over Recurrent Encoder for Neural Machine Translation
5
RNN for NMT:
✤ Disadvantages : ✤ Slow : Doesn’t allow parallel computation within sequence ✤ Non-uniform composition : For each state, first word is over-
processed and the last one only once ✤ Dense representation : each hi is a compact summary of the
source sentence up to word ‘i’ ✤ Focus on global representation not on local features
Convolutional over Recurrent Encoder for Neural Machine Translation
6
CNN in NLP :
✤ Unlike RNN, CNN apply over a fixed size window of input ✤ This allows for parallel computation
✤ Represent sentence in terms of features: ✤ a weighted combination of multiple words or n-grams
✤ Very successful in learning sentence representations for various tasks ✤ Sentiment analysis, question classification (Kim 2014,
Kalchbrenner et al 2014)
Convolutional over Recurrent Encoder for Neural Machine Translation
7
Convolution over Recurrent encoder (CoveR):
✤ Can CNN help for NMT ? ✤ Instead of single recurrent outputs, we can use a
composition of multiple hidden state outputs of the encoder ✤ Convolution over recurrent :
✤ We apply multiple layers of fixed size convolution filters over the output of the RNN encoder at each time step
✤ Can provide wider context about the relevant features of the source sentence
Convolutional over Recurrent Encoder for Neural Machine Translation
8
CoveR model
h21h1
1 h31 hi
1
h12 h2
2 h32 hi
2
S21S1
1 S31 Sj
1
S12 S2
2 S32 Sj
2
C’1 C’2 C’3 C’j
X2X1 XiX3
y2y1 yjy3
Decoder
pad0 pad0
CN11 CN2
1 CN31 CNi
1
CN12 CN2
2 CN32 CNi
2
pad0 pad0
RNN-Encoder
CNN-Layers
Z1 Z2 Z3 Zi
αj:n + C’j = ∑αjiCNi
St-12
Convolutional over Recurrent Encoder for Neural Machine Translation
9
Convolution over Recurrent encoder:
✤ Each of the vectors CNi now represents a feature produced by multiple kernels over hi
✤ Relatively uniform composition of multiple previous states and current state.
✤ Simultaneous hence faster processing at the convolutional layers
PBML ??? MAY 2017
h21h1
1 h31 hi
1
h12 h2
2 h32 hi
2
S21S1
1 S31 Sj
1
S12 S2
2 S32 Sj
2
C1 C2 C3 Cj
X2X1 XiX3
y2y1 yjy3
Encoder
Decoder
Z1 Z2 Z3 Zi
αj:n +
St-12
C’j = ∑αjiCNi
Figure 1. NMT encoder-decoder framework
h21h1
1 h31 hi
1
h12 h2
2 h32 hi
2
S21S1
1 S31 Sj
1
S12 S2
2 S32 Sj
2
C’1 C’2 C’3 C’j
X2X1 XiX3
y2y1 yjy3
Decoder
pad0 pad0
CN11 CN2
1 CN31 CNi
1
CN12 CN2
2 CN32 CNi
2
pad0 pad0
RNN-Encoder
CNN-Layers
Z1 Z2 Z3 Zi
αj:n + C’j = ∑αjiCNi
St-12
Figure 2. Convolution over Recurrent model
CN1i
CN1i = σ(θ · hi−[(w−1)/2]:i+[(w−1)/2] + b)
CN1, CN2, .....CNn
c ′i ci
6
Convolutional over Recurrent Encoder for Neural Machine Translation
10
Related work:
✤ Gehring et al 2017: ✤ Completely replace RNN encoder with CNN ✤ Simple replacement doesn’t work, position embeddings
required to model dependencies ✤ Require 6-15 convolutional layers to compete 2 layer RNN
✤ Meng et al 2015 : ✤ For Phrase-based MT, use CNN language model as
additional feature
Convolutional over Recurrent Encoder for Neural Machine Translation
11
Experimental setting:
✤ Data : ✦ WMT-2015 En-De training data : 4.2M sentence pairs ✦ Dev : WMT2013 test set ✦ Test : WMT2014,WMT2015 test sets
✤ Baseline : ✦ Two layer unidirectional LSTM encoder ✦ Embedding size, hidden size = 1000 ✦ Vocab : Source : 60k, Target : 40k
Convolutional over Recurrent Encoder for Neural Machine Translation
12
Experimental setting:
✤ CoveR : ✦ Encoder : 3 convolutional layers over RNN output ✦ Decoder : same as baseline ✦ Convolutional filters of size : 3 ✦ Output dimension : 1000 ✦ Zero padding on both sides at each layer, no pooling ✦ Residual connection (He et, al 2015) between each
intermediate layer
Convolutional over Recurrent Encoder for Neural Machine Translation
13
Experimental setting:
✤ Deep RNN encoder : ✦ Comparing 2 layer RNN encoder baseline to CoveR is
unfair • Improvement maybe just due to increased number of
parameters ✦ We compare with a deep RNN encoder with 5 layers ✦ 2 layers of decoder initialized through a non-linear
transformation of encoder final states
Convolutional over Recurrent Encoder for Neural Machine Translation
14
BLEU scores ( * = significant at p < 0.05)
BLEU Dev wmt14 wmt15
Baseline 17.9 15.8 18.5
Deep RNN encoder 18.3 16.2 18.7
CoveR 18.5 16.9* 19.0*
Result
✤ Compared to baseline: ✦ +1.1 for WMT-14 and 0.5 for WMT-15
✤ Compared to deep RNN encoder : ✦ +0.7 for WMT-14 and 0.3 for WMT-15
Convolutional over Recurrent Encoder for Neural Machine Translation
15
#parameters and decoding speed
BLEU #parameters(millions)
avg sec/sent
Baseline 174 0.11
Deep RNN encoder 283 0.28
CoveR 183 0.14
Result
✤ CoveR model: ✤ Slightly slower than baseline but faster than deep RNN ✤ Slightly more parameter than baseline but less than deep
RNN ✤ Improvements not just due to increased number of parameters
Convolutional over Recurrent Encoder for Neural Machine Translation
16
Qualitative analysis : ✤ Increased output length
✤ With additional context, CoveR model generates complete translation
P. Dakwale, C.Monz Convolutional over Recurrent Encoder for NMT (1–12)
< >
Table 2. Translation examples. Words in bold show correct translations produced by ourmodel as compared to the baseline.
9
BLEU Avg sent lengthBaseline 18.7
Deep RNN 19.0CoveR 19.9
Reference 20.9
Convolutional over Recurrent Encoder for Neural Machine Translation
17
Qualitative analysis : ✤ More uniform attention distribution
✤ Generation of correct composite word
P. Dakwale, C.Monz Convolutional over Recurrent Encoder for NMT (1–12)
< >
Table 2. Translation examples. Words in bold show correct translations produced by ourmodel as compared to the baseline.
9
Convolutional over Recurrent Encoder for Neural Machine Translation
18
Qualitative analysis : ✤ More uniform attention distribution
Baseline
CoveR
PBML ??? MAY 2017
Figure 3. Attention distribution for Baseline
Figure 4. Attention distribution for CoveR model
10
PBML ??? MAY 2017
Figure 3. Attention distribution for Baseline
Figure 4. Attention distribution for CoveR model
10
✤ Baseline translates : ‘itinerary’ to ‘strecke’ (road, distance)
✤ Pays attention only to ‘itinerary’ for this position
✤ Cover translates : ‘itinerary’ to ‘reiseroute’
✤ Also pays attention to final verb
Convolutional over Recurrent Encoder for Neural Machine Translation
19
Conclusion :
✤ CoveR : multiple convolutional layers over RNN encoder ✤ Significant improvements over standard LSTM baseline ✤ Increasing LSTM layers improves slightly, but convolutional
layers perform better ✤ Faster and less parameters than fully RNN encoder of
same size ✤ CoveR model can improve coverage and provide more
uniform attention distribution
Questions ?
Thanks