Convolutional over Recurrent Encoder for Neural Machine ... · Praveen Dakwale and Christof Monz Annual Conference of the European Association for Machine Translation 2017. ... S

Convolutional over Recurrent Encoder for Neural Machine Translation

Praveen Dakwale and Christof Monz

Annual Conference of the European Association for Machine Translation

2017


2

Neural Machine Translation• End to end neural network with RNN architecture where the

output of an RNN (decoder) is conditioned on another RNN (encoder).

• c is a fixed length vector representation of source sentence encoded by RNN.

• Attention Mechanism :

• (Bahdanau et al 2015) : compute conext vector as weighted average of annotations of source hidden states.

Published as a conference paper at ICLR 2015

The decoder is often trained to predict the next word y

t

0 given the context vector c and all thepreviously predicted words {y1, · · · , yt0�1}. In other words, the decoder defines a probability overthe translation y by decomposing the joint probability into the ordered conditionals:

p(y) =

TY

t=1

p(y

t

| {y1, · · · , yt�1} , c), (2)

where y =

�y1, · · · , yT

y

�. With an RNN, each conditional probability is modeled as

p(y

t

| {y1, · · · , yt�1} , c) = g(y

t�1, st, c), (3)

where g is a nonlinear, potentially multi-layered, function that outputs the probability of yt

, and s

t

isthe hidden state of the RNN. It should be noted that other architectures such as a hybrid of an RNNand a de-convolutional neural network can be used (Kalchbrenner and Blunsom, 2013).

3 LEARNING TO ALIGN AND TRANSLATE

In this section, we propose a novel architecture for neural machine translation. The new architectureconsists of a bidirectional RNN as an encoder (Sec. 3.2) and a decoder that emulates searchingthrough a source sentence during decoding a translation (Sec. 3.1).

3.1 DECODER: GENERAL DESCRIPTION

Figure 1: The graphical illus-tration of the proposed modeltrying to generate the t-th tar-get word y

t

given a sourcesentence (x1, x2, . . . , xT

).

In a new model architecture, we define each conditional probabilityin Eq. (2) as:

p(y

i

|y1, . . . , yi�1,x) = g(y

i�1, si, ci), (4)

where s

i

is an RNN hidden state for time i, computed by

s

i

= f(s

i�1, yi�1, ci).

It should be noted that unlike the existing encoder–decoder ap-proach (see Eq. (2)), here the probability is conditioned on a distinctcontext vector c

i

for each target word y

i

.

The context vector c

i

depends on a sequence of annotations(h1, · · · , hT

x

) to which an encoder maps the input sentence. Eachannotation h

i

contains information about the whole input sequencewith a strong focus on the parts surrounding the i-th word of theinput sequence. We explain in detail how the annotations are com-puted in the next section.

The context vector ci

is, then, computed as a weighted sum of theseannotations h

i

:

c

i

=

T

xX

j=1

↵

ij

h

j

. (5)

The weight ↵ij

of each annotation h

j

is computed by

↵

ij

=

exp (e

ij

)

PT

x

k=1 exp (eik), (6)

wheree

ij

= a(s

i�1, hj

)

is an alignment model which scores how well the inputs around position j and the output at positioni match. The score is based on the RNN hidden state s

i�1 (just before emitting y

i

, Eq. (4)) and thej-th annotation h

j

of the input sentence.

We parametrize the alignment model a as a feedforward neural network which is jointly trained withall the other components of the proposed system. Note that unlike in traditional machine translation,

3

Published as a conference paper at ICLR 2015

The decoder is often trained to predict the next word y

t

0 given the context vector c and all thepreviously predicted words {y1, · · · , yt0�1}. In other words, the decoder defines a probability overthe translation y by decomposing the joint probability into the ordered conditionals:

p(y) =

TY

t=1

p(y

t

| {y1, · · · , yt�1} , c), (2)

where y =

�y1, · · · , yT

y

�. With an RNN, each conditional probability is modeled as

p(y

t

| {y1, · · · , yt�1} , c) = g(y

t�1, st, c), (3)

where g is a nonlinear, potentially multi-layered, function that outputs the probability of yt

, and s

t

isthe hidden state of the RNN. It should be noted that other architectures such as a hybrid of an RNNand a de-convolutional neural network can be used (Kalchbrenner and Blunsom, 2013).

3 LEARNING TO ALIGN AND TRANSLATE

In this section, we propose a novel architecture for neural machine translation. The new architectureconsists of a bidirectional RNN as an encoder (Sec. 3.2) and a decoder that emulates searchingthrough a source sentence during decoding a translation (Sec. 3.1).

3.1 DECODER: GENERAL DESCRIPTION

Figure 1: The graphical illus-tration of the proposed modeltrying to generate the t-th tar-get word y

t

given a sourcesentence (x1, x2, . . . , xT

).

In a new model architecture, we define each conditional probabilityin Eq. (2) as:

p(y

i

|y1, . . . , yi�1,x) = g(y

i�1, si, ci), (4)

where s

i

is an RNN hidden state for time i, computed by

s

i

= f(s

i�1, yi�1, ci).

It should be noted that unlike the existing encoder–decoder ap-proach (see Eq. (2)), here the probability is conditioned on a distinctcontext vector c

i

for each target word y

i

.

The context vector c

i

depends on a sequence of annotations(h1, · · · , hT

x

) to which an encoder maps the input sentence. Eachannotation h

i

contains information about the whole input sequencewith a strong focus on the parts surrounding the i-th word of theinput sequence. We explain in detail how the annotations are com-puted in the next section.

The context vector ci

is, then, computed as a weighted sum of theseannotations h

i

:

c

i

=

T

xX

j=1

↵

ij

h

j

. (5)

The weight ↵ij

of each annotation h

j

is computed by

↵

ij

=

exp (e

ij

)

PT

x

k=1 exp (eik), (6)

wheree

ij

= a(s

i�1, hj

)

is an alignment model which scores how well the inputs around position j and the output at positioni match. The score is based on the RNN hidden state s

i�1 (just before emitting y

i

, Eq. (4)) and thej-th annotation h

j

of the input sentence.

We parametrize the alignment model a as a feedforward neural network which is jointly trained withall the other components of the proposed system. Note that unlike in traditional machine translation,

3


3

h21h1

1 h31 hi

1

h12 h2

2 h32 hi

2

S21S1

1 S31 Sj

1

S12 S2

2 S32 Sj

2

C1 C2 C3 Cj

X2X1 XiX3

y2y1 yjy3

Encoder

Decoder

Z1 Z2 Z3 Zi

αj:n +

St-12

C’j = ∑αjiCNi


4

Why RNN works for NMT ?

✦ Recurrently encode history for variable length large input sequences

✦ Capture the long distance dependency which is an important occurrence in natural language text


5

RNN for NMT:

✤ Disadvantages : ✤ Slow : Doesn’t allow parallel computation within sequence ✤ Non-uniform composition : For each state, first word is over-

processed and the last one only once ✤ Dense representation : each hi is a compact summary of the

source sentence up to word ‘i’ ✤ Focus on global representation not on local features


6

CNN in NLP :

✤ Unlike RNN, CNN apply over a fixed size window of input ✤ This allows for parallel computation

✤ Represent sentence in terms of features: ✤ a weighted combination of multiple words or n-grams

✤ Very successful in learning sentence representations for various tasks ✤ Sentiment analysis, question classification (Kim 2014,

Kalchbrenner et al 2014)


7

Convolution over Recurrent encoder (CoveR):

✤ Can CNN help for NMT ? ✤ Instead of single recurrent outputs, we can use a

composition of multiple hidden state outputs of the encoder ✤ Convolution over recurrent :

✤ We apply multiple layers of fixed size convolution filters over the output of the RNN encoder at each time step

✤ Can provide wider context about the relevant features of the source sentence


8

CoveR model

h21h1

1 h31 hi

1

h12 h2

2 h32 hi

2

S21S1

1 S31 Sj

1

S12 S2

2 S32 Sj

2

C’1 C’2 C’3 C’j

X2X1 XiX3

y2y1 yjy3

Decoder

pad0 pad0

CN11 CN2

1 CN31 CNi

1

CN12 CN2

2 CN32 CNi

2

pad0 pad0

RNN-Encoder

CNN-Layers

Z1 Z2 Z3 Zi

αj:n + C’j = ∑αjiCNi

St-12


9

Convolution over Recurrent encoder:

✤ Each of the vectors CNi now represents a feature produced by multiple kernels over hi

✤ Relatively uniform composition of multiple previous states and current state.

✤ Simultaneous hence faster processing at the convolutional layers

PBML ??? MAY 2017

h21h1

1 h31 hi

1

h12 h2

2 h32 hi

2

S21S1

1 S31 Sj

1

S12 S2

2 S32 Sj

2

C1 C2 C3 Cj

X2X1 XiX3

y2y1 yjy3

Encoder

Decoder

Z1 Z2 Z3 Zi

αj:n +

St-12

C’j = ∑αjiCNi

Figure 1. NMT encoder-decoder framework

h21h1

1 h31 hi

1

h12 h2

2 h32 hi

2

S21S1

1 S31 Sj

1

S12 S2

2 S32 Sj

2

C’1 C’2 C’3 C’j

X2X1 XiX3

y2y1 yjy3

Decoder

pad0 pad0

CN11 CN2

1 CN31 CNi

1

CN12 CN2

2 CN32 CNi

2

pad0 pad0

RNN-Encoder

CNN-Layers

Z1 Z2 Z3 Zi

αj:n + C’j = ∑αjiCNi

St-12

Figure 2. Convolution over Recurrent model

CN1i

CN1i = σ(θ · hi−[(w−1)/2]:i+[(w−1)/2] + b)

CN1, CN2, .....CNn

c ′i ci

6


10

Related work:

✤ Gehring et al 2017: ✤ Completely replace RNN encoder with CNN ✤ Simple replacement doesn’t work, position embeddings

required to model dependencies ✤ Require 6-15 convolutional layers to compete 2 layer RNN

✤ Meng et al 2015 : ✤ For Phrase-based MT, use CNN language model as

additional feature


11

Experimental setting:

✤ Data : ✦ WMT-2015 En-De training data : 4.2M sentence pairs ✦ Dev : WMT2013 test set ✦ Test : WMT2014,WMT2015 test sets

✤ Baseline : ✦ Two layer unidirectional LSTM encoder ✦ Embedding size, hidden size = 1000 ✦ Vocab : Source : 60k, Target : 40k


12


✤ CoveR : ✦ Encoder : 3 convolutional layers over RNN output ✦ Decoder : same as baseline ✦ Convolutional filters of size : 3 ✦ Output dimension : 1000 ✦ Zero padding on both sides at each layer, no pooling ✦ Residual connection (He et, al 2015) between each

intermediate layer


13


✤ Deep RNN encoder : ✦ Comparing 2 layer RNN encoder baseline to CoveR is

unfair • Improvement maybe just due to increased number of

parameters ✦ We compare with a deep RNN encoder with 5 layers ✦ 2 layers of decoder initialized through a non-linear

transformation of encoder final states


14

BLEU scores ( * = significant at p < 0.05)

BLEU Dev wmt14 wmt15

Baseline 17.9 15.8 18.5

Deep RNN encoder 18.3 16.2 18.7

CoveR 18.5 16.9* 19.0*

Result

✤ Compared to baseline: ✦ +1.1 for WMT-14 and 0.5 for WMT-15

✤ Compared to deep RNN encoder : ✦ +0.7 for WMT-14 and 0.3 for WMT-15


15

#parameters and decoding speed

BLEU #parameters(millions)

avg sec/sent

Baseline 174 0.11

Deep RNN encoder 283 0.28

CoveR 183 0.14

Result

✤ CoveR model: ✤ Slightly slower than baseline but faster than deep RNN ✤ Slightly more parameter than baseline but less than deep

RNN ✤ Improvements not just due to increased number of parameters


16

Qualitative analysis : ✤ Increased output length

✤ With additional context, CoveR model generates complete translation

P. Dakwale, C.Monz Convolutional over Recurrent Encoder for NMT (1–12)

< >

Table 2. Translation examples. Words in bold show correct translations produced by ourmodel as compared to the baseline.

9

BLEU Avg sent lengthBaseline 18.7

Deep RNN 19.0CoveR 19.9

Reference 20.9


17

Qualitative analysis : ✤ More uniform attention distribution

✤ Generation of correct composite word

P. Dakwale, C.Monz Convolutional over Recurrent Encoder for NMT (1–12)

< >

Table 2. Translation examples. Words in bold show correct translations produced by ourmodel as compared to the baseline.

9


18

Qualitative analysis : ✤ More uniform attention distribution

Baseline

CoveR

PBML ??? MAY 2017

Figure 3. Attention distribution for Baseline

Figure 4. Attention distribution for CoveR model

10

PBML ??? MAY 2017

Figure 3. Attention distribution for Baseline

Figure 4. Attention distribution for CoveR model

10

✤ Baseline translates : ‘itinerary’ to ‘strecke’ (road, distance)

✤ Pays attention only to ‘itinerary’ for this position

✤ Cover translates : ‘itinerary’ to ‘reiseroute’

✤ Also pays attention to final verb


19

Conclusion :

✤ CoveR : multiple convolutional layers over RNN encoder ✤ Significant improvements over standard LSTM baseline ✤ Increasing LSTM layers improves slightly, but convolutional

layers perform better ✤ Faster and less parameters than fully RNN encoder of

same size ✤ CoveR model can improve coverage and provide more

uniform attention distribution

Questions ?

Thanks

Documents

Convolutional over Recurrent Encoder for Neural Machine ... · Praveen Dakwale and Christof Monz Annual Conference of the European Association for Machine Translation 2017. ... S