203
Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt Institute for Language, Cognition and Computation University of Edinburgh November 1 2016 Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 1 / 115

Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

  • Upload
    others

  • View
    13

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Advances in Neural Machine Translation

Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Institute for Language, Cognition and ComputationUniversity of Edinburgh

November 1 2016

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 1 / 115

Page 2: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Machine Translation

Why we need MThuman translation industry: ≈ 666 million words / day[Pym et al., 2012]

MT indudstry: � 100 billion words / day [Turovsky, 2016]

demand for translation for outpaces what is humanly possible to produce→ we need fast, high-quality MT

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 1 / 115

Page 3: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Neural Machine Translation

Kyunghyun Chohttp://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 2 / 115

Page 4: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Edinburgh’s* WMT results over the years

2013 2014 2015 20160.0

10.0

20.0

30.0

20.3 20.9 20.8 21.519.4 20.2

22.0 22.1

18.9

24.7

BLE

Uon

new

stes

t201

3(E

N→

DE

)

phrase-based SMTsyntax-based SMTneural MT

*NMT 2015 from U. Montréal: https://sites.google.com/site/acl16nmt/

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 3 / 115

Page 5: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

neural MT has already moved from academia intoproduction

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 4 / 115

Page 6: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Why neural MT?

single, end-to-end trained neural network replaces collection of weakfeatures

good generalization via continuous space representations→ modelling of dependencies over long distances

why now?neural translation dates back to at least the 80s [Allen, 1987]large-scale neural MT is now possible thanks to

large amounts of training dataexponential growth in computational power (GPUs!)algorithmic advances

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 5 / 115

Page 7: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Neural Machine Translation

1 Neural Networks — Basics

2 Recurrent Neural Networks and LSTMs

3 Attention-based NMT Model4 Where are we now? Evaluation and chal-

lengesEvaluation resultsComparing neural and phrase-based ma-chine translation

5 Recent Research in Neural Machine Transla-tion

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 6 / 115

Page 8: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Linear Regression

Parameters: θ =

[θ0

θ1

]Model: hθ(x) = θ0 + θ1x

5 10 15 20

Population

0

5

10

15

20

25

Pro

fit

Data

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 7 / 115

Page 9: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Linear Regression

Parameters: θ =

[θ0

θ1

]Model: hθ(x) = θ0 + θ1x

5 10 15 20

Population

0

5

10

15

20

25

Pro

fit

Data

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 7 / 115

Page 10: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Linear Regression

Parameters: θ =

[θ0

θ1

]Model: hθ(x) = θ0 + θ1x

5 10 15 20

Population

0

5

10

15

20

25

Pro

fit

y= − 5. 00 + 1. 50x

Data

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 7 / 115

Page 11: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Linear Regression

Parameters: θ =

[θ0

θ1

]Model: hθ(x) = θ0 + θ1x

5 10 15 20

Population

0

5

10

15

20

25

Pro

fit

y= − 6. 00 + 2. 00x

Data

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 7 / 115

Page 12: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Linear Regression

Parameters: θ =

[θ0

θ1

]Model: hθ(x) = θ0 + θ1x

5 10 15 20

Population

0

5

10

15

20

25

Pro

fit

y= − 2. 50 + 1. 00x

Data

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 7 / 115

Page 13: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Linear Regression

Parameters: θ =

[θ0

θ1

]Model: hθ(x) = θ0 + θ1x

5 10 15 20

Population

0

5

10

15

20

25

Pro

fit

y= − 3. 90 + 1. 19x

Data

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 7 / 115

Page 14: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

The cost (or loss) function

We try to find parameters θ ∈ R2 such that the cost function J(θ) isminimal:

J : R2 → R

θ = arg minθ∈R2

J(θ)

Mean Square Error:

J(θ) =1

2m

m∑

i=1

(hθ(x

(i))− y(i))2

=1

2m

m∑

i=1

(θ0 + θ1x

(i) − y(i))2

where m is the number of data points in the training set.

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 8 / 115

Page 15: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

The cost (or loss) function

We try to find parameters θ ∈ R2 such that the cost function J(θ) isminimal:

J : R2 → R

θ = arg minθ∈R2

J(θ)

Mean Square Error:

J(θ) =1

2m

m∑

i=1

(hθ(x

(i))− y(i))2

=1

2m

m∑

i=1

(θ0 + θ1x

(i) − y(i))2

where m is the number of data points in the training set.

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 8 / 115

Page 16: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

The cost (or loss) function

We try to find parameters θ ∈ R2 such that the cost function J(θ) isminimal:

J : R2 → R

θ = arg minθ∈R2

J(θ)

Mean Square Error:

J(θ) =1

2m

m∑

i=1

(hθ(x

(i))− y(i))2

=1

2m

m∑

i=1

(θ0 + θ1x

(i) − y(i))2

where m is the number of data points in the training set.

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 8 / 115

Page 17: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

The cost (or loss) function

We try to find parameters θ ∈ R2 such that the cost function J(θ) isminimal:

J : R2 → R

θ = arg minθ∈R2

J(θ)

Mean Square Error:

J(θ) =1

2m

m∑

i=1

(hθ(x

(i))− y(i))2

=1

2m

m∑

i=1

(θ0 + θ1x

(i) − y(i))2

where m is the number of data points in the training set.

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 8 / 115

Page 18: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

The cost (or loss) function

5 10 15 20

Population

0

5

10

15

20

25

Pro

fit

y= − 5. 00 + 1. 50x

Data

J(

[−5.00

1.50

]) = 6.1561

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 9 / 115

Page 19: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

The cost (or loss) function

5 10 15 20

Population

0

5

10

15

20

25

Pro

fit

y= − 6. 00 + 2. 00x

Data

J(

[−6.00

2.00

]) = 19.3401

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 9 / 115

Page 20: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

The cost (or loss) function

5 10 15 20

Population

0

5

10

15

20

25

Pro

fit

y= − 2. 50 + 1. 00x

Data

J(

[−2.50

1.00

]) = 4.7692

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 9 / 115

Page 21: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

The cost (or loss) function

5 10 15 20

Population

0

5

10

15

20

25

Pro

fit

y= − 3. 90 + 1. 19x

Data

J(

[−3.90

1.19

]) = 4.4775

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 9 / 115

Page 22: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

The cost (or loss) function

So, how do we find θ = arg minθ∈R2

J(θ) computationally?

θ 0

10

5

0

5

10

θ11

01

23

4

J(θ)

100

0

100

200

300

400

500

600

700

800

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 10 / 115

Page 23: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

The cost (or loss) function

So, how do we find θ = arg minθ∈R2

J(θ) computationally?

θ 0

10

5

0

5

10

θ11

01

23

4

J(θ)

100

0

100

200

300

400

500

600

700

800

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 10 / 115

Page 24: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

(Stochastic) gradient descent

θj := θj − α∂

∂θjJ(θ) for each j

Step 0, α = 0.01

θ 0

10

5

0

5

10

θ11

01

23

4

J(θ)

100

0

100

200

300

400

500

600

700

800

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 11 / 115

Page 25: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

(Stochastic) gradient descent

θj := θj − α∂

∂θjJ(θ) for each j

Step 0, α = 0.01

θ 0

10

5

0

5

10

θ11

01

23

4

J(θ)

100

0

100

200

300

400

500

600

700

800

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 11 / 115

Page 26: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

(Stochastic) gradient descent

θj := θj − α∂

∂θjJ(θ) for each j

Step 1, α = 0.01

θ 0

10

5

0

5

10

θ11

01

23

4

J(θ)

100

0

100

200

300

400

500

600

700

800

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 11 / 115

Page 27: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

(Stochastic) gradient descent

θj := θj − α∂

∂θjJ(θ) for each j

Step 20, α = 0.01

θ 0

10

5

0

5

10

θ11

01

23

4

J(θ)

100

0

100

200

300

400

500

600

700

800

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 11 / 115

Page 28: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

(Stochastic) gradient descent

θj := θj − α∂

∂θjJ(θ) for each j

Step 200, α = 0.01

θ 0

10

5

0

5

10

θ11

01

23

4

J(θ)

100

0

100

200

300

400

500

600

700

800

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 11 / 115

Page 29: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

(Stochastic) gradient descent

θj := θj − α∂

∂θjJ(θ) for each j

Step 10000, α = 0.01

θ 0

10

5

0

5

10

θ11

01

23

4

J(θ)

100

0

100

200

300

400

500

600

700

800

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 11 / 115

Page 30: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

(Stochastic) gradient descent

θj := θj − α∂

∂θjJ(θ) for each j

Step 10000, α = 0.005

θ 0

10

5

0

5

10

θ11

01

23

4

J(θ)

100

0

100

200

300

400

500

600

700

800

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 11 / 115

Page 31: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

(Stochastic) gradient descent

θj := θj − α∂

∂θjJ(θ) for each j

Step 10000, α = 0.02

θ 0

10

5

0

5

10

θ11

01

23

4

J(θ)

100

0

100

200

300

400

500

600

700

800

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 11 / 115

Page 32: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

(Stochastic) gradient descent

θj := θj − α∂

∂θjJ(θ) for each j

Step 10, α = 0.025

θ 0

10

5

0

5

10

θ11

01

23

4

J(θ)

100

0

100

200

300

400

500

600

700

800

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 11 / 115

Page 33: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

(Stochastic) gradient descent

How do we calculate∂

∂θjJ(θ)?

∂θjJ(θ) =

∂θj

1

2m

m∑

i=1

(hθ(x(i))− y(i))2

=1

m

m∑

i=1

(hθ(x(i))− y(i))x

(i)j

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 12 / 115

Page 34: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

(Stochastic) gradient descent

How do we calculate∂

∂θjJ(θ)?

∂θjJ(θ) =

∂θj

1

2m

m∑

i=1

(hθ(x(i))− y(i))2

=1

m

m∑

i=1

(hθ(x(i))− y(i))x

(i)j

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 12 / 115

Page 35: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

(Stochastic) gradient descent

How do we calculate∂

∂θjJ(θ)?

∂θjJ(θ) =

∂θj

1

2m

m∑

i=1

(hθ(x(i))− y(i))2

= 2 · 1

2m

m∑

i=1

(hθ(x(i))− y(i)) · ∂

∂θj(hθ(x

(i))− y(i))

=1

m

m∑

i=1

(hθ(x(i))− y(i)) · ∂

∂θj

n∑

i=0

θix(i)i

=1

m

m∑

i=1

(hθ(x(i))− y(i))x

(i)j

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 12 / 115

Page 36: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

The update rule once again

For linear regression we have the following model:

hθ(x) = θ0 + θ1x

and we repeat until convergence (θ0 and θ1 should be updatedsimultaneously):

θ0 := θ0 − α1

m

m∑

i=1

(hθ(x(i))− y(i))

θ1 := θ1 − α1

m

m∑

i=1

(hθ(x(i))− y(i))x(i)

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 13 / 115

Page 37: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

The update rule once again

For linear regression we have the following model:

hθ(x) = θ0 + θ1x

and we repeat until convergence (θ0 and θ1 should be updatedsimultaneously):

θ0 := θ0 − α1

m

m∑

i=1

(hθ(x(i))− y(i))

θ1 := θ1 − α1

m

m∑

i=1

(hθ(x(i))− y(i))x(i)

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 13 / 115

Page 38: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

To summarize what we have learned

When approaching a machine learning problem, we need:

a suitable model;

(here: a linear model)

a suitable cost (or loss) function;

(here: mean square error)

an optimization algorithm;

(here: a variant of SGD)

the gradient(s) of the cost function (if required by the optimizationalgorithm).

Side note: algorithms for finding the minimum without the gradient

for linear regession: the normal matrix (exact);

random search;

genetic algorithms;

...

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 14 / 115

Page 39: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

To summarize what we have learned

When approaching a machine learning problem, we need:

a suitable model;

(here: a linear model)

a suitable cost (or loss) function;

(here: mean square error)

an optimization algorithm;

(here: a variant of SGD)

the gradient(s) of the cost function (if required by the optimizationalgorithm).

Side note: algorithms for finding the minimum without the gradient

for linear regession: the normal matrix (exact);

random search;

genetic algorithms;

...

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 14 / 115

Page 40: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

To summarize what we have learned

When approaching a machine learning problem, we need:

a suitable model;

(here: a linear model)

a suitable cost (or loss) function;

(here: mean square error)

an optimization algorithm;

(here: a variant of SGD)

the gradient(s) of the cost function (if required by the optimizationalgorithm).

Side note: algorithms for finding the minimum without the gradient

for linear regession: the normal matrix (exact);

random search;

genetic algorithms;

...

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 14 / 115

Page 41: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

To summarize what we have learned

When approaching a machine learning problem, we need:

a suitable model;

(here: a linear model)

a suitable cost (or loss) function;

(here: mean square error)

an optimization algorithm;

(here: a variant of SGD)

the gradient(s) of the cost function (if required by the optimizationalgorithm).

Side note: algorithms for finding the minimum without the gradient

for linear regession: the normal matrix (exact);

random search;

genetic algorithms;

...

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 14 / 115

Page 42: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

To summarize what we have learned

When approaching a machine learning problem, we need:

a suitable model;

(here: a linear model)

a suitable cost (or loss) function;

(here: mean square error)

an optimization algorithm;

(here: a variant of SGD)

the gradient(s) of the cost function (if required by the optimizationalgorithm).

Side note: algorithms for finding the minimum without the gradient

for linear regession: the normal matrix (exact);

random search;

genetic algorithms;

...

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 14 / 115

Page 43: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

To summarize what we have learned

When approaching a machine learning problem, we need:

a suitable model; (here: a linear model)

a suitable cost (or loss) function; (here: mean square error)

an optimization algorithm; (here: a variant of SGD)

the gradient(s) of the cost function (if required by the optimizationalgorithm).

Side note: algorithms for finding the minimum without the gradient

for linear regession: the normal matrix (exact);

random search;

genetic algorithms;

...

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 14 / 115

Page 44: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

To summarize what we have learned

When approaching a machine learning problem, we need:

a suitable model; (here: a linear model)

a suitable cost (or loss) function; (here: mean square error)

an optimization algorithm; (here: a variant of SGD)

the gradient(s) of the cost function (if required by the optimizationalgorithm).

Side note: algorithms for finding the minimum without the gradient

for linear regession: the normal matrix (exact);

random search;

genetic algorithms;

...

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 14 / 115

Page 45: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

To summarize what we have learned

When approaching a machine learning problem, we need:

a suitable model; (here: a linear model)

a suitable cost (or loss) function; (here: mean square error)

an optimization algorithm; (here: a variant of SGD)

the gradient(s) of the cost function (if required by the optimizationalgorithm).

Side note: algorithms for finding the minimum without the gradient

for linear regession: the normal matrix (exact);

random search;

genetic algorithms;

...

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 14 / 115

Page 46: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

To summarize what we have learned

When approaching a machine learning problem, we need:

a suitable model; (here: a linear model)

a suitable cost (or loss) function; (here: mean square error)

an optimization algorithm; (here: a variant of SGD)

the gradient(s) of the cost function (if required by the optimizationalgorithm).

Side note: algorithms for finding the minimum without the gradient

for linear regession: the normal matrix (exact);

random search;

genetic algorithms;

...

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 14 / 115

Page 47: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

To summarize what we have learned

When approaching a machine learning problem, we need:

a suitable model; (here: a linear model)

a suitable cost (or loss) function; (here: mean square error)

an optimization algorithm; (here: a variant of SGD)

the gradient(s) of the cost function (if required by the optimizationalgorithm).

Side note: algorithms for finding the minimum without the gradient

for linear regession: the normal matrix (exact);

random search;

genetic algorithms;

...

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 14 / 115

Page 48: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

A neuron

1

x1

x2

· · ·

xn

z =

n∑

i=0

θixi

Input function

g(z)

Activation function

Output

θ0

θ1

θ2

θn

Features

Input layer

Neuron

Layer 1

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 15 / 115

Page 49: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Linear regression and neural networks

1

x1

x2

· · ·

xn

z =

n∑

i=0

θixi

Input function

g(z) = z

Activation function

Output

θ0

θ1

θ2

θn

Features

Input layer

Neuron

Layer 1

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 16 / 115

Page 50: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

The logistic function (remember this one!)

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 17 / 115

Page 51: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

A more typical neuron (binary logistic regression)

1

x1

x2

· · ·

xn

z =n∑

i=0

θixi

Input function

g(z) =1

1 + e−z

Activation function

Output

θ0

θ1

θ2

θn

Features

Input layer

Neuron

Layer 1

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 18 / 115

Page 52: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Binary logistic regression

Model:

hθ(x) = g(∑n

i=0 θixi) =1

1 + e−∑n

i=0 θixi

Cost function (binary crossentropy):

J(θ) = − 1

m[∑m

i=1 y(i) log hθ(x

(i)) + (1− y(i)) log(1− hθ(x(i)))]

Gradient:

∂J(θ)∂θj

=1

m

m∑

i=1

(hθ(x(i))− y(i))x

(i)j

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 19 / 115

Page 53: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Binary logistic regression

Model:

hθ(x) = g(∑n

i=0 θixi) =1

1 + e−∑n

i=0 θixi

Cost function (binary crossentropy):

J(θ) = − 1

m[∑m

i=1 y(i) log hθ(x

(i)) + (1− y(i)) log(1− hθ(x(i)))]

Gradient:

∂J(θ)∂θj

=1

m

m∑

i=1

(hθ(x(i))− y(i))x

(i)j

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 19 / 115

Page 54: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Binary logistic regression

Model:

hθ(x) = g(∑n

i=0 θixi) =1

1 + e−∑n

i=0 θixi

Cost function (binary crossentropy):

J(θ) = − 1

m[∑m

i=1 y(i) log hθ(x

(i)) + (1− y(i)) log(1− hθ(x(i)))]

Gradient:

∂J(θ)∂θj

=1

m

m∑

i=1

(hθ(x(i))− y(i))x

(i)j

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 19 / 115

Page 55: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Multi-class logistic regression and neural networks

1

x1

x2

· · ·

xn

P (c = 0|Θ, X)

P (c = 1|Θ, X)

P (c = 2|Θ, X)

θ(0)0

θ(0)1

θ(0)2

θ(0)n

θ(2)n

Features

Input layer

Layer 1

g(z) = softmax(z)

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 20 / 115

Page 56: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Multi-class logistic regression

Model: hΘ(x) = [P (k|x,Θ)]k=1,...,c = softmax(Θx) whereΘ = (θ(1), . . . , θ(c))

Cost function: J(Θ) = − 1m

∑mi=1

∑ck=1 δ(y

(i), k) logP (k|x(i),Θ)

where δ(x, y) =

{1 if x = y0 otherwise

Gradient:∂J(Θ)

∂Θj,k= − 1

m

∑mi=1(δ(y(i), k)− P (k|x(i),Θ)) x

(i)j

May look complicated, but can be looked up!

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 21 / 115

Page 57: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Multi-class logistic regression

Model: hΘ(x) = [P (k|x,Θ)]k=1,...,c = softmax(Θx) whereΘ = (θ(1), . . . , θ(c))

Cost function: J(Θ) = − 1m

∑mi=1

∑ck=1 δ(y

(i), k) logP (k|x(i),Θ)

where δ(x, y) =

{1 if x = y0 otherwise

Gradient:∂J(Θ)

∂Θj,k= − 1

m

∑mi=1(δ(y(i), k)− P (k|x(i),Θ)) x

(i)j

May look complicated, but can be looked up!

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 21 / 115

Page 58: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Multi-class logistic regression

Model: hΘ(x) = [P (k|x,Θ)]k=1,...,c = softmax(Θx) whereΘ = (θ(1), . . . , θ(c))

Cost function: J(Θ) = − 1m

∑mi=1

∑ck=1 δ(y

(i), k) logP (k|x(i),Θ)

where δ(x, y) =

{1 if x = y0 otherwise

Gradient:∂J(Θ)

∂Θj,k= − 1

m

∑mi=1(δ(y(i), k)− P (k|x(i),Θ)) x

(i)j

May look complicated, but can be looked up!

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 21 / 115

Page 59: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Multi-class logistic regression

Model: hΘ(x) = [P (k|x,Θ)]k=1,...,c = softmax(Θx) whereΘ = (θ(1), . . . , θ(c))

Cost function: J(Θ) = − 1m

∑mi=1

∑ck=1 δ(y

(i), k) logP (k|x(i),Θ)

where δ(x, y) =

{1 if x = y0 otherwise

Gradient:∂J(Θ)

∂Θj,k= − 1

m

∑mi=1(δ(y(i), k)− P (k|x(i),Θ)) x

(i)j

May look complicated, but can be looked up!

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 21 / 115

Page 60: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Multi-class logistic regression

Model: hΘ(x) = [P (k|x,Θ)]k=1,...,c = softmax(Θx) whereΘ = (θ(1), . . . , θ(c))

Cost function: J(Θ) = − 1m

∑mi=1

∑ck=1 δ(y

(i), k) logP (k|x(i),Θ)

where δ(x, y) =

{1 if x = y0 otherwise

Gradient:∂J(Θ)

∂Θj,k= − 1

m

∑mi=1(δ(y(i), k)− P (k|x(i),Θ)) x

(i)j

May look complicated, but can be looked up!

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 21 / 115

Page 61: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Multi-class logistic regression and neural networks

1

x1

x2

· · ·

xn

P (c = 0|Θ, X)

P (c = 1|Θ, X)

P (c = 2|Θ, X)

θ(0)0

θ(0)1

θ(0)2

θ(0)n

θ(2)n

Features

Input layer

Layer 1

g(z) = softmax(z)

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 22 / 115

Page 62: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Deep learning: multi-layer neural networks

x1

x2

· · ·

xn

a(1)1

a(1)2

a(1)3

1

a(2)1

a(2)2

a(2)3

1

a(3)1

(1)1

Θ(1)1,1

Θ(1)2,1

Θ(1)n,1

Θ(1)n,3

β(2)1

Θ(2)1,1

Θ(2)2,1

Θ(2)3,1

Θ(2)3,3

β(3)1

Θ(3)1,1

Θ(3)2,1

Θ(3)3,1

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 23 / 115

Page 63: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Why multiple-layers?

1.0 0.5 0.0 0.5 1.01.0

0.5

0.0

0.5

1.0

Can a linear model separate these dots?

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 24 / 115

Page 64: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Why multiple-layers?

1.0 0.5 0.0 0.5 1.01.0

0.5

0.0

0.5

1.0

h(x) = θ0 + θ1x1 + θ2x2

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 24 / 115

Page 65: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Why multiple-layers?

1.0 0.5 0.0 0.5 1.01.0

0.5

0.0

0.5

1.0

h(x) = θ0 + θ1x1 + θ2x2 + θ3x21 + θ4x1x2 + θ5x

22

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 24 / 115

Page 66: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Why multiple-layers?

1.0 0.5 0.0 0.5 1.01.0

0.5

0.0

0.5

1.0

h(x) = θ0 + θ1x1 + θ2x2 + θ3x3 + θ4x4 + θ5x5 where x3 = x22, . . .

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 24 / 115

Page 67: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Why multiple-layers?

1.0 0.5 0.0 0.5 1.01.0

0.5

0.0

0.5

1.0

h(x) = σ(Θ2σ(Θ1x)) where |Θ1| = 3× 3, |Θ2| = 3× 1

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 24 / 115

Page 68: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Feed forward networks language models

Source: Philipp Koehn, draft chapther on neural machine translation.

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 25 / 115

Page 69: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Feed forward networks language models

Source: Philipp Koehn, draft chapther on neural machine translation.

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 25 / 115

Page 70: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Backpropagation – forward step

x1

x2

· · ·

xn

a(1)1

a(1)2

a(1)3

1

a(2)1

a(2)2

a(2)3

1

a(3)1

(1)1

Θ(1)1,1

Θ(1)1,2

Θ(1)1,n

Θ(1)3,n

β(2)1

Θ(2)1,1

Θ(2)1,2

Θ(2)1,3

Θ(2)3,3

β(3)1

Θ(3)1,1

Θ(3)1,2

Θ(3)1,3

a(0) = x

z(1) = Θ(1)a(0) + β(1)

g(1)(x) = tanh(x)a(1) = g(1)(z(1))

z(2) = Θ(2)a(1) + β(2)

g(2)(x) = tanh(x)a(2) = g(2)(z(2))

z(3) = Θ(3)a(2) + β(3)

g(3)(x) = tanh(x)a(3) = g(3)(z(3))

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 26 / 115

Page 71: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

The four fundamental equations of Backpropagation

δL = ∇aLJ(Θ)� (gL)′(zL) (BP1)

δl = ((Θl+1)T δl+1)� (gl)′(zl) (BP2)

∇βlJ(Θ) = δl (BP3)

∇ΘlJ(Θ) = al−1 � δl (BP4)

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 27 / 115

Page 72: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

The Backpropagation Algorithm

For one training example (x,y):

Input: Set the activations of the input layers a0 = x

Forward step: for l = 1, . . . , L calculate

zl = Θ(l)al−1 + βl and al = gl(zl)

Output error δL: calculate vector

δL = ∇aLJ(Θ)� (gL)′(zL)

Error backpropagation: for l = L− 1, L− 2, . . . , 1 calculate

δl = ((Θl+1)T δl+1)� (gl)′(zl)

Gradients:

∇ΘlJ(Θ) = al−1 � δl and ∇βlJ(Θ) = δl

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 28 / 115

Page 73: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Backpropagation – backward step

x1

x2

· · ·

xn

a(1)1

a(1)2

a(1)3

1

a(2)1

a(2)2

a(2)3

1

a(3)1

1

δ(3) = (a(3) − y)� (1− tanh2(z(3)))

δ(2) = (Θ(3))T δ(3) � (1− tanh2(z(2)))

δ(1) = (Θ(2))T δ(2) � (1− tanh2(z(1)))

β(1)1

Θ(1)1,1

Θ(1)1,2

Θ(1)1,n

Θ(1)3,n

β(2)1

Θ(2)1,1

Θ(2)1,2

Θ(2)1,3

Θ(2)3,3

β(3)1

Θ(3)1,1

Θ(3)1,2

Θ(3)1,3

a(0) = x

z(1) = Θ(1)a(0) + β(1)

g(1)(x) = tanh(x)a(1) = g(1)(z(1))

z(2) = Θ(2)a(1) + β(2)

g(2)(x) = tanh(x)a(2) = g(2)(z(2))

z(3) = Θ(3)a(2) + β(3)

g(3)(x) = tanh(x)a(3) = g(3)(z(3))

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 29 / 115

Page 74: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Backpropagation and stochastic gradient descent

One iteration:

For all parameters Θ = (Θ1, . . . ,ΘL) create zero-valued helpermatrices ∆ = (∆1, . . . ,∆L) of the same size (β omitted forsimplicity).For m examples in the batch, i = 1, . . . ,m:

Perform backpropagation for example (x(i), y(i)) and store thegradients ∇ΘJ

(i)(Θ)

∆ := ∆ +1

m∇ΘJ

(i)(Θ)

Update the weights: Θ := Θ− α∆

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 30 / 115

Page 75: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

More complicated network architectures

Textbook backprogagation is formulated in terms of layers, weights,biases, activations, weighted inputs, ...

Actual architectures can contain concatenation of bidirectional RNNstates, ...

What’s the derivation of the "concatenation" operation?

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 31 / 115

Page 76: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Computing derivatives with reverse-mode autodiff

f(x1, x2) = sin(x1) + x1x2

∂f

∂x1= ?

∂f

∂xx= ?

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 32 / 115

Page 77: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Computing derivatives with reverse-mode autodiff

f(x1, x2) = sin(x1) + x1x2

∂f

∂x1= ?

∂f

∂xx= ?

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 32 / 115

Page 78: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Computation graphs to the rescue

f(x1, x2)

+

sin ∗

x1 x2

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 33 / 115

Page 79: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Computation graphs to the rescue

f(x1, x2)

+

sin ∗

x1 x2w1 = x1 w2 = x2

w3 = w1 · w2w4 = sin(w1)

w5 = w3 + w4

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 34 / 115

Page 80: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Computation graphs to the rescue

f(x1, x2)

+

sin ∗

x1 x2w1 = x1 w2 = x2

w3 = w1 · w2w4 = sin(w1)

w5 = w3 + w4

∂f∂x1

= wa1 + wb1= cos(x1) + x2

∂f∂x2

= w2 = x1

f = w5 = 1

w4 = w5∂w5

∂w4= w5 w3 = w5

∂w5

∂w3= w5

wa1 = w4∂w4

∂w1= w4 · cos(w1)

wb1 = w3 · w2

w2 = w3∂w3

∂w2= w3 · w1

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 35 / 115

Page 81: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Computation graph for neural networks

a = softmax(x · w + b)o = mean(sum(log(a)� y))

mean"cost"

sum

×

log input"y"

softmax

+

• param"b"

input"x"

param"W"

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 36 / 115

Page 82: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Computation graph for neural networks

a0 = xa1 = ReLU(a0 · w0 + b0)a2 = ReLU(a1 · w1 + b1)a3 = a2 · w2 + b2o1 = softmax(a3)o2 = mean(crossentropy(a3, y))

softmax"scores"

+

x-ent

mean"cost"

input"y"

• param"b2"

ReLU param"W2"

+

• param"b1"

ReLU param"W1"

+

• param"b0"

input"x"

param"W0"

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 37 / 115

Page 83: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Neural Machine Translation

1 Neural Networks — Basics

2 Recurrent Neural Networks and LSTMs

3 Attention-based NMT Model4 Where are we now? Evaluation and chal-

lengesEvaluation resultsComparing neural and phrase-based ma-chine translation

5 Recent Research in Neural Machine Transla-tion

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 38 / 115

Page 84: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Recurrent neural networks (RNNs)

Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 39 / 115

Page 85: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Recurrent neural networks (RNNs)

Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 39 / 115

Page 86: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Recurrent neural networks (RNNs)

ht = tanh(Wh · ht−1 +Wx · xt + b)

ht = tanh(W · [ht−1, xt] + b)

Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 40 / 115

Page 87: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Recurrent neural networks (RNNs)

ht = tanh(Wh · ht−1 +Wx · xt + b)

ht = tanh(W · [ht−1, xt] + b)

Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 40 / 115

Page 88: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Recurrent neural networks language models

Source: Philipp Koehn, draft chapther on neural machine translation.

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 41 / 115

Page 89: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Fun with RNNs

Andrej Karpathy: http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Character-level language models

Python code generation

Poetry generation

...

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 42 / 115

Page 90: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

RNNs and long distance dependencies

Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 43 / 115

Page 91: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

RNNs and long distance dependencies

Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 43 / 115

Page 92: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Long Short-Term Memory (LSTM)

Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 44 / 115

Page 93: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Long Short-Term Memory (LSTM)

Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 44 / 115

Page 94: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Long Short-Term Memory (LSTM) – Step-by-step

Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 45 / 115

Page 95: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Long Short-Term Memory (LSTM) – Step-by-step

Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 45 / 115

Page 96: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Long Short-Term Memory (LSTM) – Step-by-step

Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 45 / 115

Page 97: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Long Short-Term Memory (LSTM) – Step-by-step

Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 45 / 115

Page 98: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Gated Recurrent Units (GRUs)

Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 46 / 115

Page 99: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

RNNs in Encoder-Decoder architectures

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 47 / 115

Page 100: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Neural Machine Translation

1 Neural Networks — Basics

2 Recurrent Neural Networks and LSTMs

3 Attention-based NMT Model4 Where are we now? Evaluation and chal-

lengesEvaluation resultsComparing neural and phrase-based ma-chine translation

5 Recent Research in Neural Machine Transla-tion

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 48 / 115

Page 101: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Neural Machine Translation

Kyunghyun Chohttp://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 49 / 115

Page 102: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Translation modelling

decomposition of translation problem (for NMT)a source sentence S of length m is a sequence x1, . . . , xm

a target sentence T of length n is a sequence y1, . . . , yn

T ∗ = argmaxt

p(T |S)

p(T |S) = p(y1, . . . , yn|x1, . . . , xm)

=

n∏i=1

p(yi|y0, . . . , yi−1, x1, . . . , xm)

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 50 / 115

Page 103: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Translation modelling

difference from language modeltarget-side language model:

p(T ) =

n∏

i=1

p(yi|y0, . . . , yi−1)

translation model:

p(T |S) =

n∏

i=1

p(yi|y0, . . . , yi−1, x1, . . . , xm)

we could just treat sentence pair as one long sequence, but:we do not care about p(S) (S is given)we may want different vocabulary, network architecture for source text

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 51 / 115

Page 104: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Translation modelling

difference from language modeltarget-side language model:

p(T ) =

n∏

i=1

p(yi|y0, . . . , yi−1)

translation model:

p(T |S) =

n∏

i=1

p(yi|y0, . . . , yi−1, x1, . . . , xm)

we could just treat sentence pair as one long sequence, but:we do not care about p(S) (S is given)we may want different vocabulary, network architecture for source text

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 51 / 115

Page 105: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Translating with RNNs

Encoder-decoder [Sutskever et al., 2014, Cho et al., 2014]two RNNs (LSTM or GRU):

encoder reads input and produces hidden state representationsdecoder produces output, based on last encoder hidden state

encoder and decoder are learned jointly→ supervision signal from parallel text is backpropagated

Kyunghyun Cho http://devblogs.nvidia.com/parallelforall/

introduction-neural-machine-translation-gpus-part-2/

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 52 / 115

Page 106: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Summary vector

last encoder hidden-state “summarizes” source sentence

with multilingual training, we can potentially learnlanguage-independent meaning representation

[Sutskever et al., 2014]

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 53 / 115

Page 107: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Summary vector as information bottleneck

can fixed-size vector represent meaning of arbitrarily long sentence?

empirically, quality decreases for long sentences

reversing source sentence brings some improvement[Sutskever et al., 2014]

[Sutskever et al., 2014]

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 54 / 115

Page 108: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Attentional encoder-decoder

encodergoal: avoid bottleneck of summary vector

use bidirectional RNN, and concatenate forward and backward states→ annotation vector hirepresent source sentence as vector of n annotations→ variable-length representation

Kyunghyun Chohttp://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 55 / 115

Page 109: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Attentional encoder-decoder

attentionproblem: how to incorporate variable-length context into hidden state?

attention model computes context vector as weighted average ofannotations

weights are computed by feedforward neural network with softmaxactivation

Kyunghyun Chohttp://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 56 / 115

Page 110: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Attentional encoder-decoder: math

simplifications of model by [Bahdanau et al., 2015] (for illustration)plain RNN instead of GRU

simpler output layer

we do not show bias terms

notationW , U , E, C, V are weight matrices (of different dimensionality)

E one-hot to embedding (e.g. 50000 · 512)W embedding to hidden (e.g. 512 · 1024)U hidden to hidden (e.g. 1024 · 1024)C context (2x hidden) to hidden (e.g. 2048 · 1024)Vo hidden to one-hot (e.g. 1024 · 50000)

separate weight matrices for encoder and decoder (e.g. Ex and Ey)

input X of length Tx; output Y of length Ty

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 57 / 115

Page 111: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Attentional encoder-decoder: math

encoder

−→h j =

{0, , if j = 0

tanh(−→W xExxj +

−→U xhj−1) , if j > 0

←−h j =

{0, , if j = Tx + 1

tanh(←−W xExxj +

←−U xhj+1) , if j ≤ Tx

hj = (−→h j ,←−h j)

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 58 / 115

Page 112: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Attentional encoder-decoder: math

decoder

si =

{tanh(Ws

←−h i), , if i = 0

tanh(WyEyyi + Uysi−1 + Cci) , if i > 0

ti = tanh(Uosi−1 +WoEyyi−1 + Coci)

yi = softmax(Voti)

attention model

eij = v>a tanh(Wasi−1 + Uahj)

αij = softmax(eij)

ci =

Tx∑

j=1

αijhj

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 59 / 115

Page 113: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Attention model

attention modelside effect: we obtain alignment between source and target sentence

information can also flow along recurrent connections, so there is noguarantee that attention corresponds to alignmentapplications:

visualisationreplace unknown words with back-off dictionary [Jean et al., 2015]...

Kyunghyun Chohttp://devblogs.nvidia.com/parallelforall/introduction-neural-machine-translation-gpus-part-3/

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 60 / 115

Page 114: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Attention model

attention model also works with images:

[Cho et al., 2015]

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 61 / 115

Page 115: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Attention model

[Cho et al., 2015]

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 62 / 115

Page 116: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Applications

score a translationp(La, croissance, économique, s’est, ralentie, ces, dernières, années, . |Economic, growth, has, slowed, down, in, recent, year, .) = ?

generate the most probable translation of a source sentence→ decodingy∗ = argmaxy p(y|Economic, growth, has, slowed, down, in, recent, year, .)

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 63 / 115

Page 117: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Decoding

exact searchgenerate every possible sentence T in target language

compute score p(T |S) for each

pick best one

intractable: |vocab|N translations for output length N→ we need approximative search strategy

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 64 / 115

Page 118: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Decoding

approximative search/1at each time step, compute probability distribution P (yi|X, y<i)select yi according to some heuristic:

sampling: sample from P (yi|X, y<i)greedy search: pick argmaxy p(yi|X, y<i)

continue until we generate <eos>

efficient, but suboptimal

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 65 / 115

Page 119: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Decoding

approximative search/2: beam searchmaintain list of K hypotheses (beam)

at each time step, expand each hypothesis k: p(yki |X, yk<i)select K hypotheses with highest total probability:

i

p(yki |X, yk<i)

relatively efficient

currently default search strategy in neural machine translation

small beam (K ≈ 10) offers good speed-quality trade-off

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 66 / 115

Page 120: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Ensembles

at each timestep, combine the probability distribution of M differentensemble components.

combine operator: typically average (log-)probability

logP (yi|X, y<i) =

∑Mm=1 logPm(yi|X, y<i)

M

requirements:same output vocabularysame factorization of Y

internal network architecture may be different

source representations may be different(extreme example: ensemble-like model with different sourcelanguages [Junczys-Dowmunt and Grundkiewicz, 2016])

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 67 / 115

Page 121: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Ensembles

recent ensemble strategies in NMTensemble of 8 independent training runs with differenthyperparameters/architectures [Luong et al., 2015a]

ensemble of 8 independent training runs with different randominitializations [Chung et al., 2016]

ensemble of 4 checkpoints of same training run[Sennrich et al., 2016a]→ probably less effective, but only requires one training run

EN→CS EN→DE EN→RO EN→RU CS→EN DE→EN RO→EN RU→EN

0.010.020.030.040.0

23.7

31.628.1

24.3

30.1

36.233.3

26.924.8

33.128.2

26.031.4

37.533.9

28.0

BLE

U

single model ensemble

[Sennrich et al., 2016a]

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 68 / 115

Page 122: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Neural Machine Translation

1 Neural Networks — Basics

2 Recurrent Neural Networks and LSTMs

3 Attention-based NMT Model4 Where are we now? Evaluation and chal-

lengesEvaluation resultsComparing neural and phrase-based ma-chine translation

5 Recent Research in Neural Machine Transla-tion

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 69 / 115

Page 123: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

State of Neural MT

attentional encoder-decoder networks have become state of the arton various MT tasksyour mileage may vary depending on

language pair and text typeamount of training datatype of training resources (monolingual?)hyperparameters

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 70 / 115

Page 124: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Attentional encoder-decoders (NMT) are SOTA

system BLEU official rankuedin-nmt 34.2 1metamind 32.3 2uedin-syntax 30.6 3NYU-UMontreal 30.8 4online-B 29.4 5-10KIT/LIMSI 29.1 5-10cambridge 30.6 5-10online-A 29.9 5-10promt-rule 23.4 5-10KIT 29.0 6-10jhu-syntax 26.6 11-12jhu-pbmt 28.3 11-12uedin-pbmt 28.4 13-14online-F 19.3 13-15online-G 23.8 14-15

Table: WMT16 results for EN→DE

system BLEU official rankuedin-nmt 38.6 1online-B 35.0 2-5online-A 32.8 2-5uedin-syntax 34.4 2-5KIT 33.9 2-6uedin-pbmt 35.1 5-7jhu-pbmt 34.5 6-7online-G 30.1 8jhu-syntax 31.0 9online-F 20.2 10

Table: WMT16 results for DE→EN

pure NMT

NMT component

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 71 / 115

Page 125: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Attentional encoder-decoders (NMT) are SOTA

system BLEU official rankuedin-nmt 34.2 1metamind 32.3 2uedin-syntax 30.6 3NYU-UMontreal 30.8 4online-B 29.4 5-10KIT/LIMSI 29.1 5-10cambridge 30.6 5-10online-A 29.9 5-10promt-rule 23.4 5-10KIT 29.0 6-10jhu-syntax 26.6 11-12jhu-pbmt 28.3 11-12uedin-pbmt 28.4 13-14online-F 19.3 13-15online-G 23.8 14-15

Table: WMT16 results for EN→DE

system BLEU official rankuedin-nmt 38.6 1online-B 35.0 2-5online-A 32.8 2-5uedin-syntax 34.4 2-5KIT 33.9 2-6uedin-pbmt 35.1 5-7jhu-pbmt 34.5 6-7online-G 30.1 8jhu-syntax 31.0 9online-F 20.2 10

Table: WMT16 results for DE→EN

pure NMT

NMT component

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 71 / 115

Page 126: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Attentional encoder-decoders (NMT) are SOTA

system BLEU official rankuedin-nmt 34.2 1metamind 32.3 2uedin-syntax 30.6 3NYU-UMontreal 30.8 4online-B 29.4 5-10KIT/LIMSI 29.1 5-10cambridge 30.6 5-10online-A 29.9 5-10promt-rule 23.4 5-10KIT 29.0 6-10jhu-syntax 26.6 11-12jhu-pbmt 28.3 11-12uedin-pbmt 28.4 13-14online-F 19.3 13-15online-G 23.8 14-15

Table: WMT16 results for EN→DE

system BLEU official rankuedin-nmt 38.6 1online-B 35.0 2-5online-A 32.8 2-5uedin-syntax 34.4 2-5KIT 33.9 2-6uedin-pbmt 35.1 5-7jhu-pbmt 34.5 6-7online-G 30.1 8jhu-syntax 31.0 9online-F 20.2 10

Table: WMT16 results for DE→EN

pure NMT

NMT component

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 71 / 115

Page 127: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Attentional encoder-decoders (NMT) are SOTA

uedin-nmt 25.8 1NYU-UMontreal 23.6 2jhu-pbmt 23.6 3cu-chimera 21.0 4-5cu-tamchyna 20.8 4-5uedin-cu-syntax 20.9 6-7online-B 22.7 6-7online-A 19.5 15cu-TectoMT 14.7 16cu-mergedtrees 8.2 18

Table: WMT16 results for EN→CS

online-B 39.2 1-2uedin-nmt 33.9 1-2uedin-pbmt 35.2 3uedin-syntax 33.6 4-5online-A 30.8 4-6jhu-pbmt 32.2 5-7LIMSI 31.0 6-7

Table: WMT16 results for RO→EN

uedin-nmt 31.4 1jhu-pbmt 30.4 2online-B 28.6 3PJATK 28.3 8-10online-A 25.7 11cu-mergedtrees 13.3 12

Table: WMT16 results for CS→EN

uedin-nmt 28.1 1-2QT21-HimL-SysComb 28.9 1-2KIT 25.8 3-7uedin-pbmt 26.8 3-7online-B 25.4 3-7uedin-lmu-hiero 25.9 3-7RWTH-SYSCOMB 27.1 3-7LIMSI 23.9 8-10lmu-cuni 24.3 8-10jhu-pbmt 23.5 8-11usfd-rescoring 23.1 10-12online-A 19.2 11-12

Table: WMT16 results for EN→RO

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 71 / 115

Page 128: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Attentional encoder-decoders (NMT) are SOTAPROMT-rule 22.3 1amu-uedin 25.3 2-4online-B 23.8 2-5uedin-nmt 26.0 2-5online-G 26.2 3-5NYU-UMontreal 23.1 6jhu-pbmt 24.0 7-8LIMSI 23.6 7-10online-A 20.2 8-10AFRL-MITLL-phr 23.5 9-10AFRL-MITLL-verb 20.9 11online-F 8.6 12

Table: WMT16 results for EN→RU

amu-uedin 29.1 1-2online-G 28.7 1-3NRC 29.1 2-4online-B 28.1 3-5uedin-nmt 28.0 4-5online-A 25.7 6-7AFRL-MITLL-phr 27.6 6-7AFRL-MITLL-contrast 27.0 8-9PROMT-rule 20.4 8-9online-F 13.5 10

Table: WMT16 results for RU→EN

uedin-pbmt 23.4 1-4online-G 20.6 1-4online-B 23.6 1-4UH-opus 23.1 1-4PROMT-SMT 20.3 5UH-factored 19.3 6-7uedin-syntax 20.4 6-7online-A 19.0 8jhu-pbmt 19.1 9

Table: WMT16 results for FI→EN

online-G 15.4 1-3abumatra-nmt 17.2 1-4

online-B 14.4 1-4abumatran-combo 17.4 3-5

UH-opus 16.3 4-5NYU-UMontreal 15.1 6-8

abumatran-pbsmt 14.6 6-8online-A 13.0 6-8jhu-pbmt 13.8 9-10

UH-factored 12.8 9-12aalto 11.6 10-13

jhu-hltcoe 11.9 10-13UUT 11.6 11-13

Table: WMT16 results for EN→FI

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 71 / 115

Page 129: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Neural Machine Translation

1 Neural Networks — Basics

2 Recurrent Neural Networks and LSTMs

3 Attention-based NMT Model4 Where are we now? Evaluation and chal-

lengesEvaluation resultsComparing neural and phrase-based ma-chine translation

5 Recent Research in Neural Machine Transla-tion

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 72 / 115

Page 130: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Interlude: why is (machine) translation hard?

ambiguitywords are often polysemous, with different translations for differentmeanings

system sentencesource Dort wurde er von dem Schläger und einer weiteren männlichen Person erneut angegriffen.reference There he was attacked again by his original attacker and another male.uedin-nmt There he was attacked again by the racket and another male person.uedin-pbsmt There, he was at the club and another male person attacked again.

Schläger

attackerracket club

racket https://www.flickr.com/photos/128067141@N07/15157111178 / CC BY 2.0attacker https://commons.wikimedia.org/wiki/File:Wikibully.jpg

golf club https://commons.wikimedia.org/wiki/File:Golf_club,_Callawax_X-20_8_iron_-_III.jpg / CC-BY-SA-3.0

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 73 / 115

Page 131: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Interlude: why is (machine) translation hard?

ambiguitywords are often polysemous, with different translations for differentmeanings

system sentencesource Dort wurde er von dem Schläger und einer weiteren männlichen Person erneut angegriffen.reference There he was attacked again by his original attacker and another male.uedin-nmt There he was attacked again by the racket and another male person.uedin-pbsmt There, he was at the club and another male person attacked again.

Schläger

attacker

racket

club

racket https://www.flickr.com/photos/128067141@N07/15157111178 / CC BY 2.0attacker https://commons.wikimedia.org/wiki/File:Wikibully.jpg

golf club https://commons.wikimedia.org/wiki/File:Golf_club,_Callawax_X-20_8_iron_-_III.jpg / CC-BY-SA-3.0

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 73 / 115

Page 132: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Interlude: why is (machine) translation hard?

ambiguitywords are often polysemous, with different translations for differentmeanings

system sentencesource Dort wurde er von dem Schläger und einer weiteren männlichen Person erneut angegriffen.reference There he was attacked again by his original attacker and another male.uedin-nmt There he was attacked again by the racket and another male person.uedin-pbsmt There, he was at the club and another male person attacked again.

Schläger

attackerracket

club

racket https://www.flickr.com/photos/128067141@N07/15157111178 / CC BY 2.0attacker https://commons.wikimedia.org/wiki/File:Wikibully.jpg

golf club https://commons.wikimedia.org/wiki/File:Golf_club,_Callawax_X-20_8_iron_-_III.jpg / CC-BY-SA-3.0

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 73 / 115

Page 133: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Interlude: why is (machine) translation hard?

ambiguitywords are often polysemous, with different translations for differentmeanings

system sentencesource Dort wurde er von dem Schläger und einer weiteren männlichen Person erneut angegriffen.reference There he was attacked again by his original attacker and another male.uedin-nmt There he was attacked again by the racket and another male person.uedin-pbsmt There, he was at the club and another male person attacked again.

Schläger

attackerracket club

racket https://www.flickr.com/photos/128067141@N07/15157111178 / CC BY 2.0attacker https://commons.wikimedia.org/wiki/File:Wikibully.jpg

golf club https://commons.wikimedia.org/wiki/File:Golf_club,_Callawax_X-20_8_iron_-_III.jpg / CC-BY-SA-3.0

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 73 / 115

Page 134: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Interlude: why is (machine) translation hard?

word orderthere are systematic word order differences between languages. We needto generate words in the correct order.

system sentencesource Unsere digitalen Leben haben die Notwendigkeit, stark, lebenslustig und erfolgreich zu erscheinen, verdoppelt [...]reference Our digital lives have doubled the need to appear strong, fun-loving and successful [...]uedin-nmt Our digital lives have doubled the need to appear strong, lifelike and successful [...]uedin-pbsmt Our digital lives are lively, strong, and to be successful, doubled [...]

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 74 / 115

Page 135: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Interlude: why is (machine) translation hard?

grammatical marking systemgrammatical distinctions can be marked in different ways, for instancethrough word order (English), or inflection (German). The translator needsto produce the appropriate marking.

English ... because the dog chased the man.German ... weil der Hund den Mann jagte.

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 75 / 115

Page 136: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Interlude: why is (machine) translation hard?

multiword expressionsthe meaning of non-compositional expressions is lost in a word-to-wordtranslation

system sentencesource He bends over backwards for the team, ignoring any pain.reference Er zerreißt sich für die Mannschaft, geht über Schmerzen drüber.

(lit: he tears himself apart for the team)uedin-nmt Er beugt sich rückwärts für die Mannschaft, ignoriert jeden Schmerz.

(lit: he bends backwards for the team)uedin-pbsmt Er macht alles für das Team, den Schmerz zu ignorieren.

(lit: he does everything for the team)

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 76 / 115

Page 137: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Interlude: why is (machine) translation hard?

subcategorizationWords only allow for specific categories of syntactic arguments, that oftendiffer between languages.

English he remembers his medical appointment.German er erinnert sich an seinen Arzttermin.English *he remembers himself to his medical appointment.German *er erinnert seinen Arzttermin.

agreementinflected forms may need to agree over long distances to satisfygrammaticality.

English they can not be foundFrench elles ne peuvent pas être trouvées

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 77 / 115

Page 138: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Interlude: why is (machine) translation hard?

morphological complexitytranslator may need to analyze/generate morphologically complex wordsthat were not seen before.

German AbwasserbehandlungsanlageEnglish waste water treatment plantFrench station d’épuration des eaux résiduaires

system sentencesource Titelverteidiger ist Drittligaabsteiger SpVgg Unterhaching.reference The defending champions are SpVgg Unterhaching, who have been relegated to the third league.uedin-nmt Defending champion is third-round pick SpVgg Underhaching.uedin-pbsmt Title defender Drittligaabsteiger Week 2.

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 78 / 115

Page 139: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Interlude: why is (machine) translation hard?

open vocabularylanguages have an open vocabulary, and we need to learn translations forwords that we have only seen rarely (or never)

system sentencesource Titelverteidiger ist Drittligaabsteiger SpVgg Unterhaching.reference The defending champions are SpVgg Unterhaching, who have been relegated to the third league.uedin-nmt Defending champion is third-round pick SpVgg Underhaching.uedin-pbsmt Title defender Drittligaabsteiger Week 2.

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 79 / 115

Page 140: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Interlude: why is (machine) translation hard?

discontinuous structuresa word (sequence) can map to a discontinuous structure in anotherlanguage.

English I do not knowFrench Je ne sais pas

system sentencesource Ein Jahr später machten die Fed-Repräsentanten diese Kürzungen rückgängig.reference A year later, Fed officials reversed those cuts.uedin-nmt A year later, FedEx officials reversed those cuts.uedin-pbsmt A year later, the Fed representatives made these cuts.

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 80 / 115

Page 141: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Interlude: why is (machine) translation hard?

discoursethe translation of referential expressions depends on discourse context,which sentence-level translators have no access to.

English I made a decision. Please respect it.French J’ai pris une décision. Respectez-la s’il vous plaît.French J’ai fait un choix. Respectez-le s’il vous plaît.

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 81 / 115

Page 142: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Interlude: why is (machine) translation hard?

assorted other difficultiesunderspecification

ellipsis

lexical gaps

language change

language variation (dialects, genres, domains)

ill-formed input

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 82 / 115

Page 143: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Comparison between phrase-based and neural MT

human analysis of NMT (reranking) [Neubig et al., 2015]NMT is more grammatical

word orderinsertion/deletion of function wordsmorphological agreement

minor degradation in lexical choice?

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 83 / 115

Page 144: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Comparison between phrase-based and neural MT

analysis of IWSLT 2015 results [Bentivogli et al., 2016]human-targeted translation error rate (HTER) based on automatictranslation and human post-edit

4 error types: substitution, insertion, deletion, shift

systemHTER (no shift) HTER

word lemma %∆ (shift only)PBSMT [Ha et al., 2015] 28.3 23.2 -18.0 3.5NMT [Luong and Manning, 2015] 21.7 18.7 -13.7 1.5

word-level is closer to lemma-level performance: better atinflection/agreement

improvement on lemma-level: better lexical choice

fewer shift errors: better word order

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 84 / 115

Page 145: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Comparison between phrase-based and neural MT

analysis of IWSLT 2015 results [Bentivogli et al., 2016]human-targeted translation error rate (HTER) based on automatictranslation and human post-edit

4 error types: substitution, insertion, deletion, shift

systemHTER (no shift) HTER

word lemma %∆ (shift only)PBSMT [Ha et al., 2015] 28.3 23.2 -18.0 3.5NMT [Luong and Manning, 2015] 21.7 18.7 -13.7 1.5

word-level is closer to lemma-level performance: better atinflection/agreement

improvement on lemma-level: better lexical choice

fewer shift errors: better word order

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 84 / 115

Page 146: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Comparison between phrase-based and neural MT

analysis of IWSLT 2015 results [Bentivogli et al., 2016]human-targeted translation error rate (HTER) based on automatictranslation and human post-edit

4 error types: substitution, insertion, deletion, shift

systemHTER (no shift) HTER

word lemma %∆ (shift only)PBSMT [Ha et al., 2015] 28.3 23.2 -18.0 3.5NMT [Luong and Manning, 2015] 21.7 18.7 -13.7 1.5

word-level is closer to lemma-level performance: better atinflection/agreement

improvement on lemma-level: better lexical choice

fewer shift errors: better word order

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 84 / 115

Page 147: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Comparison between phrase-based and neural MT

analysis of IWSLT 2015 results [Bentivogli et al., 2016]human-targeted translation error rate (HTER) based on automatictranslation and human post-edit

4 error types: substitution, insertion, deletion, shift

systemHTER (no shift) HTER

word lemma %∆ (shift only)PBSMT [Ha et al., 2015] 28.3 23.2 -18.0 3.5NMT [Luong and Manning, 2015] 21.7 18.7 -13.7 1.5

word-level is closer to lemma-level performance: better atinflection/agreement

improvement on lemma-level: better lexical choice

fewer shift errors: better word order

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 84 / 115

Page 148: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Fluency

+1% +13%

CS→EN DE→EN RO→EN RU→EN

60

80

100

70.8 72.2 73.9 72.875.4 75.8

71.2 71.1

Ade

quac

y

ONLINE-B UEDIN-NMT

CS→EN DE→EN RO→EN RU→EN

60

80

100

64.668.4 66.7 67.8

78.7 77.571.9 74.3

Flue

ncy

ONLINE-B UEDIN-NMT

Figure: WMT16 direct assessment results

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 85 / 115

Page 149: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Fluency

+1%

+13%

CS→EN DE→EN RO→EN RU→EN

60

80

100

70.8 72.2 73.9 72.875.4 75.8

71.2 71.1

Ade

quac

y

ONLINE-B UEDIN-NMT

CS→EN DE→EN RO→EN RU→EN

60

80

100

64.668.4 66.7 67.8

78.7 77.571.9 74.3

Flue

ncy

ONLINE-B UEDIN-NMT

Figure: WMT16 direct assessment results

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 85 / 115

Page 150: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Fluency

+1%

+13%

CS→EN DE→EN RO→EN RU→EN

60

80

100

70.8 72.2 73.9 72.875.4 75.8

71.2 71.1

Ade

quac

y

ONLINE-B UEDIN-NMT

CS→EN DE→EN RO→EN RU→EN

60

80

100

64.668.4 66.7 67.8

78.7 77.571.9 74.3

Flue

ncy

ONLINE-B UEDIN-NMT

Figure: WMT16 direct assessment results

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 85 / 115

Page 151: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Fluency

+1% +13%

CS→EN DE→EN RO→EN RU→EN

60

80

100

70.8 72.2 73.9 72.875.4 75.8

71.2 71.1

Ade

quac

y

ONLINE-B UEDIN-NMT

CS→EN DE→EN RO→EN RU→EN

60

80

100

64.668.4 66.7 67.8

78.7 77.571.9 74.3

Flue

ncy

ONLINE-B UEDIN-NMT

Figure: WMT16 direct assessment results

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 85 / 115

Page 152: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Why is neural MT output more grammatical?

phrase-based SMTlog-linear combination of many “weak” features

data sparsenesss triggers back-off to smaller units

strong independence assumptions

neural MTend-to-end trained model

generalization via continuous space representation

output conditioned on full source text and target history

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 86 / 115

Page 153: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Neural Machine Translation

1 Neural Networks — Basics

2 Recurrent Neural Networks and LSTMs

3 Attention-based NMT Model4 Where are we now? Evaluation and chal-

lengesEvaluation resultsComparing neural and phrase-based ma-chine translation

5 Recent Research in Neural Machine Transla-tion

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 87 / 115

Page 154: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Efficiency

speed bottlenecksmatrix multiplication→ use of highly parallel hardware (GPUs)size of output layer scales with vocabulary size. Solutions:

LMs: hierarchical softmax; noise-contrastive estimation;self-normalizationNMT: approximate softmax through subset of vocabulary[Jean et al., 2015, Mi et al., 2016, L’Hostis et al., 2016]

NMT training vs. decoding (on fast GPU)training: slow (1-3 weeks)

decoding: fast (100 000–500 000 sentences / day)a

awith NVIDIA Titan X and amuNMT (https://github.com/emjotde/amunmt)

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 88 / 115

Page 155: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Efficiency

aggressive batching during decodingcompute all prefixes in beam in single batchcompute multiple sentences in single batch

8-bit inference [Wu et al., 2016]

knowledge distillation: student network mimics teacher[Kim and Rush, 2016]

Figure 1: Overview of the different knowledge distillation approaches. In word-level knowledge distillation (left) cross-entropy

is minimized between the student/teacher distributions (yellow) for each word in the actual target sequence (ECD), as well as

between the student distribution and the degenerate data distribution, which has all of its probabilitiy mass on one word (black). In

sequence-level knowledge distillation (center) the student network is trained on the output from beam search of the teacher network

that had the highest score (ACF). In sequence-level interpolation (right) the student is trained on the output from beam search of

the teacher network that had the highest sim with the target sequence (ECE).

This objective can be seen as minimizing the cross-entropy between the degenerate data distribution(which has all of its probability mass on one class)and the model distribution p(y |x; θ).

In knowledge distillation, we assume access toa learned teacher distribution q(y |x; θT ), possiblytrained over the same data set. Instead of minimiz-ing cross-entropy with the observed data, we insteadminimize the cross-entropy with the teacher’s prob-ability distribution,

LKD(θ; θT ) =−|V|∑

k=1

q(y = k |x; θT )×

log p(y = k |x; θ)

where θT parameterizes the teacher distribution andremains fixed. Note the cross-entropy setup is iden-tical, but the target distribution is no longer a sparsedistribution.4 Training on q(y |x; θT ) is attractivesince it gives more information about other classesfor a given data point (e.g. similarity betweenclasses) and has less variance in gradients (Hintonet al., 2015).

4 In some cases the entropy of the teacher/student distribu-tion is increased by annealing it with a temperature term τ > 1

p(y |x) ∝ p(y |x)1τ

After testing τ ∈ {1, 1.5, 2} we found that τ = 1 worked best.

Since this new objective has no direct term for thetraining data, it is common practice to interpolatebetween the two losses,

L(θ; θT ) = (1− α)LNLL(θ) + αLKD(θ; θT )

where α is mixture parameter combining the one-hotdistribution and the teacher distribution.

3 Knowledge Distillation for NMT

The large sizes of neural machine translation sys-tems make them an ideal candidate for knowledgedistillation approaches. In this section we explorethree different ways this technique can be applied toNMT.

3.1 Word-Level Knowledge DistillationNMT systems are trained directly to minimize wordNLL, LWORD-NLL, at each position. Therefore ifwe have a teacher model, standard knowledge distil-lation for multi-class cross-entropy can be applied.We define this distillation for a sentence as,

LWORD-KD = −J∑

j=1

|V|∑

k=1

q(tj = k | s, t<j)×

log p(tj = k | s, t<j)

where V is the target vocabulary set. The studentcan further be trained to optimize the mixture of

[Kim and Rush, 2016]

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 89 / 115

Page 156: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Open-vocabulary translation

Why is vocabulary size a problem?size of one-hot input/output vector is linear to vocabulary size

large vocabularies are space inefficient

large output vocabularies are time inefficient

typical network vocabulary size: 30 000–100 000

What about out-of-vocabulary words?training set vocabulary typically larger than network vocabulary(1 million words or more)at translation time, we regularly encounter novel words:

names: Barack Obamamorph. complex words: Hand|gepäck|gebühr (’carry-on bag fee’)numbers, URLs etc.

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 90 / 115

Page 157: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Open-vocabulary translation

Solutionscopy unknown words, or translate with back-off dictionary[Jean et al., 2015, Luong et al., 2015b, Gulcehre et al., 2016]→ works for names (if alphabet is shared), and 1-to-1 aligned words

use subword units (characters or others) for input/output vocabulary→ model can learn translation of seen words on subword level→ model can translate unseen words if translation is transparent

active research area [Sennrich et al., 2016c,Luong and Manning, 2016, Chung et al., 2016, Ling et al., 2015,Costa-jussà and Fonollosa, 2016, Zhao and Zhang, 2016,Lee et al., 2016]

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 91 / 115

Page 158: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Core idea: transparent translations

transparent translationssome translations are semantically/phonologically transparentmorphologically complex words (e.g. compounds):

solar system (English)Sonnen|system (German)Nap|rendszer (Hungarian)

named entities:Obama(English; German)Îáàìà (Russian)オバマ (o-ba-ma) (Japanese)

cognates and loanwords:claustrophobia(English)Klaustrophobie(German)Êëàóñòðîôîáèÿ (Russian)

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 92 / 115

Page 159: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Subword neural machine translation

Flat representation [Sennrich et al., 2016c, Chung et al., 2016]sentence is a sequence of subword units

Hierarchical representation[Ling et al., 2015, Luong and Manning, 2016]

sentence is a sequence of words

words are a sequence of subword units

Under review as a conference paper at ICLR 2016

variables, the source attention a and the target context lfp−1, the probability of a given word type tp

being the next translated word tp is given by:

P (tp|a, lfp−1) =exp(eS

tp

a a+Stp

l lfp−1)∑

j∈[0,T ] exp(eSj

aa+Sjl l

fp−1)

,

where Sa and Sl are the parameters that map the conditioned vectors into a score for each word typein the target language vocabulary T . The parameters for a specific word type j are obtained as Sj

a

and Sjl , respectively. Then, scores are normalized into a probability.

2.2 CHARACTER-BASED MACHINE TRANSLATION

We now present our adaptation of the word-based neural network model to operate over charactersequences rather than word sequences. However, unlike previous approaches that attempt to discardthe notion of words completely (Vilar et al., 2007; Neubig et al., 2013), we propose an hierarhicalarchitecture, which replaces the word lookup tables (steps 1 and 3) and the word softmax (step 6)with character-based alternatives, which compose the notion of words from individual characters.The advantage of this approach is that we benefit from properties of character-based approaches (e.g.compactness and orthographic sensitivity), but can also easily be incorporated into any word-basedneural approaches.

Character-based Word Representation The work in (Ling et al., 2015; Ballesteros et al., 2015)proposes a compositional model for learning word vectors from characters. Similar to word lookuptables, a word string sj is mapped into a ds,w-dimensional vector, but rather than allocating param-eters for each individual word type, the word vector sj is composed by a series of transformationusing its character sequence sj,0, . . . , sj,x.

* C2W Compositional Model

BLSTM

W h e r e

Word Vector for "Where"

Figure 2: Illustration of the C2W model. Square boxes represent vectors of neuron activations.

The illustration of the model is shown in 2. Essentially, the model builds a representation of the wordusing characters, by reading characters from left to right and vice-versa. More formally, given an in-put word sj = sj,0, . . . , sj,x, the model projects each character into a continuous ds,c-dimensionalvectors sj,0, . . . , sj,x using a character lookup table. Then, it builds a forward LSTM state se-quence hf

0 , . . . ,hfk by reading the character vectors sj,0, . . . , sj,x. Another, backward LSTM reads

the character vectors in the reverse order generating the backward states hbk, . . . ,h

b0. Finally, the

4

open question: should attention be on level of words or subwords?

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 93 / 115

Page 160: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Subword neural machine translation

Choice of subword unitcharacters: small vocabulary, long sequences

morphemes (?): hard to control vocabulary size

hybrid choice: shortlist of words, subwords for rare words

variable-length character n-grams: byte-pair encoding (BPE)

open research question which subword segmentation is best choice interms of efficiency and effectiveness.

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 94 / 115

Page 161: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Byte pair encoding [Gage, 1994]

algorithmiteratively replace most frequent byte pair in sequence with unused byte

aaabdaaabac

ZabdZabacZYdZYacXdXac

Z=aaY=abX=ZY

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 95 / 115

Page 162: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Byte pair encoding [Gage, 1994]

algorithmiteratively replace most frequent byte pair in sequence with unused byte

aaabdaaabacZabdZabac

ZYdZYacXdXac

Z=aa

Y=abX=ZY

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 95 / 115

Page 163: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Byte pair encoding [Gage, 1994]

algorithmiteratively replace most frequent byte pair in sequence with unused byte

aaabdaaabacZabdZabacZYdZYac

XdXac

Z=aaY=ab

X=ZY

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 95 / 115

Page 164: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Byte pair encoding [Gage, 1994]

algorithmiteratively replace most frequent byte pair in sequence with unused byte

aaabdaaabacZabdZabacZYdZYacXdXac

Z=aaY=abX=ZY

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 95 / 115

Page 165: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Byte pair encoding for word segmentation

bottom-up character mergingiteratively replace most frequent pair of symbols (’A’,’B’) with ’AB’

apply on dictionary, not on full text (for efficiency)

output vocabulary: character vocabulary + one symbol per merge

word freq’l o w </w>’ 5’l o w e r </w>’ 2’n e w e s t </w>’ 6’w i d e s t </w>’ 3

freq symbol pair new symbol

9 (’e’, ’s’) → ’es’9 (’es’, ’t’) → ’est’9 (’est’, ’</w>’) → ’est</w>’7 (’l’, ’o’) → ’lo’7 (’lo’, ’w’) → ’low’...

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 96 / 115

Page 166: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Byte pair encoding for word segmentation

bottom-up character mergingiteratively replace most frequent pair of symbols (’A’,’B’) with ’AB’

apply on dictionary, not on full text (for efficiency)

output vocabulary: character vocabulary + one symbol per merge

word freq’l o w </w>’ 5’l o w e r </w>’ 2’n e w es t </w>’ 6’w i d es t </w>’ 3

freq symbol pair new symbol9 (’e’, ’s’) → ’es’

9 (’es’, ’t’) → ’est’9 (’est’, ’</w>’) → ’est</w>’7 (’l’, ’o’) → ’lo’7 (’lo’, ’w’) → ’low’...

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 96 / 115

Page 167: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Byte pair encoding for word segmentation

bottom-up character mergingiteratively replace most frequent pair of symbols (’A’,’B’) with ’AB’

apply on dictionary, not on full text (for efficiency)

output vocabulary: character vocabulary + one symbol per merge

word freq’l o w </w>’ 5’l o w e r </w>’ 2’n e w est </w>’ 6’w i d est </w>’ 3

freq symbol pair new symbol9 (’e’, ’s’) → ’es’9 (’es’, ’t’) → ’est’

9 (’est’, ’</w>’) → ’est</w>’7 (’l’, ’o’) → ’lo’7 (’lo’, ’w’) → ’low’...

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 96 / 115

Page 168: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Byte pair encoding for word segmentation

bottom-up character mergingiteratively replace most frequent pair of symbols (’A’,’B’) with ’AB’

apply on dictionary, not on full text (for efficiency)

output vocabulary: character vocabulary + one symbol per merge

word freq’l o w </w>’ 5’l o w e r </w>’ 2’n e w est</w>’ 6’w i d est</w>’ 3

freq symbol pair new symbol9 (’e’, ’s’) → ’es’9 (’es’, ’t’) → ’est’9 (’est’, ’</w>’) → ’est</w>’

7 (’l’, ’o’) → ’lo’7 (’lo’, ’w’) → ’low’...

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 96 / 115

Page 169: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Byte pair encoding for word segmentation

bottom-up character mergingiteratively replace most frequent pair of symbols (’A’,’B’) with ’AB’

apply on dictionary, not on full text (for efficiency)

output vocabulary: character vocabulary + one symbol per merge

word freq’lo w </w>’ 5’lo w e r </w>’ 2’n e w est</w>’ 6’w i d est</w>’ 3

freq symbol pair new symbol9 (’e’, ’s’) → ’es’9 (’es’, ’t’) → ’est’9 (’est’, ’</w>’) → ’est</w>’7 (’l’, ’o’) → ’lo’

7 (’lo’, ’w’) → ’low’...

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 96 / 115

Page 170: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Byte pair encoding for word segmentation

bottom-up character mergingiteratively replace most frequent pair of symbols (’A’,’B’) with ’AB’

apply on dictionary, not on full text (for efficiency)

output vocabulary: character vocabulary + one symbol per merge

word freq’low </w>’ 5’low e r </w>’ 2’n e w est</w>’ 6’w i d est</w>’ 3

freq symbol pair new symbol9 (’e’, ’s’) → ’es’9 (’es’, ’t’) → ’est’9 (’est’, ’</w>’) → ’est</w>’7 (’l’, ’o’) → ’lo’7 (’lo’, ’w’) → ’low’...

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 96 / 115

Page 171: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Byte pair encoding for word segmentation

why BPE?don’t waste time on frequent character sequences→ trade-off between text length and vocabulary sizes

open-vocabulary:learned operations can be applied to unknown words

alternative view: character-level model on compressed text

’l o w e s t </w>’

(’e’, ’s’) → ’es’(’es’, ’t’) → ’est’(’est’, ’</w>’) → ’est</w>’(’l’, ’o’) → ’lo’(’lo’, ’w’) → ’low’

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 97 / 115

Page 172: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Byte pair encoding for word segmentation

why BPE?don’t waste time on frequent character sequences→ trade-off between text length and vocabulary sizes

open-vocabulary:learned operations can be applied to unknown words

alternative view: character-level model on compressed text

’l o w es t </w>’

(’e’, ’s’) → ’es’(’es’, ’t’) → ’est’(’est’, ’</w>’) → ’est</w>’(’l’, ’o’) → ’lo’(’lo’, ’w’) → ’low’

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 97 / 115

Page 173: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Byte pair encoding for word segmentation

why BPE?don’t waste time on frequent character sequences→ trade-off between text length and vocabulary sizes

open-vocabulary:learned operations can be applied to unknown words

alternative view: character-level model on compressed text

’l o w est </w>’

(’e’, ’s’) → ’es’(’es’, ’t’) → ’est’(’est’, ’</w>’) → ’est</w>’(’l’, ’o’) → ’lo’(’lo’, ’w’) → ’low’

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 97 / 115

Page 174: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Byte pair encoding for word segmentation

why BPE?don’t waste time on frequent character sequences→ trade-off between text length and vocabulary sizes

open-vocabulary:learned operations can be applied to unknown words

alternative view: character-level model on compressed text

’l o w est</w>’

(’e’, ’s’) → ’es’(’es’, ’t’) → ’est’(’est’, ’</w>’) → ’est</w>’(’l’, ’o’) → ’lo’(’lo’, ’w’) → ’low’

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 97 / 115

Page 175: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Byte pair encoding for word segmentation

why BPE?don’t waste time on frequent character sequences→ trade-off between text length and vocabulary sizes

open-vocabulary:learned operations can be applied to unknown words

alternative view: character-level model on compressed text

’lo w est</w>’

(’e’, ’s’) → ’es’(’es’, ’t’) → ’est’(’est’, ’</w>’) → ’est</w>’(’l’, ’o’) → ’lo’(’lo’, ’w’) → ’low’

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 97 / 115

Page 176: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Byte pair encoding for word segmentation

why BPE?don’t waste time on frequent character sequences→ trade-off between text length and vocabulary sizes

open-vocabulary:learned operations can be applied to unknown words

alternative view: character-level model on compressed text

’low est</w>’

(’e’, ’s’) → ’es’(’es’, ’t’) → ’est’(’est’, ’</w>’) → ’est</w>’(’l’, ’o’) → ’lo’(’lo’, ’w’) → ’low’

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 97 / 115

Page 177: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Fully Character-level NMT [Lee et al., 2016]

character-to-character model requires no language-specificsegmentation

drawback: RNN over characters is slow (especially attention!)

(shorter) segment sequences are obtained from characters viaconvolution and max-pooling layers

_ _ T h e s e c o n d p e r s o n _ _

Convolution+ReLU

MaxPoolingwithStride5

HighwayNetwork

BidirectionalGRU

CharacterEmbeddingsℝ#×%&

ℝ()×(%&+,-#)

ℝ/×%&

ℝ/×(%& 0⁄ )

ℝ/×(%& 0⁄ )

SegmentEmbeddings

Figure 1: Encoder architecture schematics. Underscore denotes padding. A dotted vertical line delimits each segment.

3.3 Challenges

Sentences are on average 6 (DE, CS and RU) to 8(FI) times longer when represented in characters.This poses three major challenges to achieving fullycharacter-level translation.

(1) Training/decoding latency For the decoder, al-though the sequence to be generated is much longer,each character-level softmax operation costs consid-erably less compared to a word- or subword-levelsoftmax. Chung et al. (2016) report that character-level decoding is only 14% slower than subword-level decoding.

On the other hand, computational complexity ofthe attention mechanism grows quadratically withrespect to the sentence length, as it needs to attendto every source token for every target token. Thismakes a naive character-level approach, such asin (Luong and Manning, 2016), computationallyprohibitive. Consequently, reducing the length ofthe source sequence is key to ensuring reasonablespeed in both training and decoding.

(2) Mapping character sequence to continuousrepresentation The arbitrary relationship betweenthe orthography of a word and its meaning is a well-known problem in linguistics (de Saussure, 1916).Building a character-level encoder is arguably amore difficult problem, as the encoder needs to learna highly non-linear function from a long sequence

of character symbols to a meaning representation.

(3) Long range dependencies in characters Acharacter-level encoder needs to model dependen-cies over longer timespans than a word-level en-coder does.

4 Fully Character-Level NMT

4.1 Encoder

We design an encoder that addresses all the chal-lenges discussed above by using convolutional andpooling layers aggressively to both (1) drasticallyshorten the input sentence and (2) efficiently capturelocal regularities. Inspired by the character-levellanguage model from (Kim et al., 2015), ourencoder first reduces the source sentence lengthwith a series of convolutional, pooling and highwaylayers. The shorter representation, instead of the fullcharacter sequence, is passed through a bidirectionalGRU to (3) help it resolve long term dependencies.We illustrate the proposed encoder in Figure 1 anddiscuss each layer in detail below.

Embedding We map the source sentence(x1, . . . , xTx) ∈ R1×Tx to a sequence of characterembeddings X = (C(x1), . . . ,C(xTx)) ∈ Rdc×Tx

where C is the character embedding lookup table:C ∈ Rdc×|C|.

Convolution One-dimensional convolution opera-

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 98 / 115

Page 178: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Architecture variants

an incomplete selectiondifferent encoder architectures:

convolution network[Kalchbrenner and Blunsom, 2013, Kalchbrenner et al., 2016]TreeLSTM [Eriguchi et al., 2016]

modifications to attention mechanism[Luong et al., 2015a, Feng et al., 2016, Zhang et al., 2016]

deeper networks [Zhou et al., 2016, Wu et al., 2016]

coverage model [Mi et al., 2016, Tu et al., 2016b, Tu et al., 2016a]

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 99 / 115

Page 179: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Sequence-level training

problem: at training time, target-side history is reliable;at test time, it is not.

solution: instead of using gold context, sample from the model toobtain target context [Shen et al., 2016, Ranzato et al., 2016,Bengio et al., 2015, Wiseman and Rush, 2016]

more efficient cross entropy training remains in use to initializeweights

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 100 / 115

Page 180: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Trading-off target and source context

system sentencesource Ein Jahr später machten die Fed-Repräsentanten diese Kürzungen rückgängig.reference A year later, Fed officials reversed those cuts.uedin-nmt A year later, FedEx officials reversed those cuts.uedin-pbsmt A year later, the Fed representatives made these cuts.

problemRNN is locally normalized at each time step

given Fed: as previous word, Ex is very likely in training data: p(Ex|Fed:) = 0.55

label bias problem: locally-normalized models may ignore input in low-entropy state

potential solutions (speculative)sampling at training time

bidirectional decoder [Liu et al., 2016, Sennrich et al., 2016a]

context gates to trade-off source and target context [Tu et al., 2016]

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 101 / 115

Page 181: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Monolingual Training Data

why monolingual data for phrase-based SMT?more training data 3

more appropriate training data (domain adaptation) 3

relax independence assumptions 3

why monolingual data for neural MT?more training data 3

more appropriate training data (domain adaptation) 3

relax independence assumptions 7

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 102 / 115

Page 182: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Training data: monolingual

Solutions/1shallow fusion: rescore beam with language model[Gülçehre et al., 2015]

deep fusion: extra, LM-specific hidden layer [Gülçehre et al., 2015]

(a) Shallow Fusion (Sec. 4.1) (b) Deep Fusion (Sec. 4.2)

Figure 1: Graphical illustrations of the proposed fusion methods.

learned by the LM from monolingual corpora isnot overwritten. It is possible to use monolingualcorpora as well while finetuning all the parame-ters, but in this paper, we alter only the output pa-rameters in the stage of finetuning.

4.2.1 Balancing the LM and TMIn order for the decoder to flexibly balance the in-put from the LM and TM, we augment the decoderwith a “controller” mechanism. The need to flex-ibly balance the signals arises depending on thework being translated. For instance, in the caseof Zh-En, there are no Chinese words that corre-spond to articles in English, in which case the LMmay be more informative. On the other hand, ifa noun is to be translated, it may be better to ig-nore any signal from the LM, as it may prevent thedecoder from choosing the correct translation. In-tuitively, this mechanism helps the model dynami-cally weight the different models depending on theword being translated.

The controller mechanism is implemented as afunction taking the hidden state of the LM as inputand computing

gt = σ(v>g s

LMt + bg

), (7)

where σ is a logistic sigmoid function. vg and bgare learned parameters.

The output of the controller is then multipliedwith the hidden state of the LM. This lets the de-

coder use the signal from the TM fully, while thecontroller controls the magnitude of the LM sig-nal.

In our experiments, we empirically found that itwas better to initialize the bias bg to a small, neg-ative number. This allows the decoder to decidethe importance of the LM only when it is deemednecessary.

5 Datasets

We evaluate the proposed approaches on four di-verse tasks: Chinese to English (Zh-En), Turkishto English (Tr-En), German to English (De-En)and Czech to English (Cs-En). We describe eachof these datasets in more detail below.

5.1 Parallel Corpora

5.1.1 Zh-En: OpenMT’15We use the parallel corpora made availableas a part of the NIST OpenMT’15 Challenge.Sentence-aligned pairs from three domains arecombined to form a training set: (1) SMS/CHATand (2) conversational telephone speech (CTS)from DARPA BOLT Project, and (3) newsgroup-s/weblogs from DARPA GALE Project. In total,the training set consists of 430K sentence pairs(see Table 1 for the detailed statistics). We train

In all our experiments, we set bg = −1 to ensure thatgt is initially 0.26 on average.

[Gülçehre et al., 2015]

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 103 / 115

Page 183: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Training data: monolingual

Solutions/2decoder is already a language model. Train encoder-decoder withadded monolingual data [Sennrich et al., 2016b]

ti = tanh(Uosi−1 + VoEyyi−1 + Coci)

yi = softmax(Woti)

how do we get approximation of context vector ci?dummy source context (moderately effective)automatically back-translate monolingual data into source language

name 2014 2015PBSMT [Haddow et al., 2015] 28.8 29.3NMT [Gülçehre et al., 2015] 23.6 -shallow fusion [Gülçehre et al., 2015] 23.7 -deep fusion [Gülçehre et al., 2015] 24.0 -NMT baseline 25.9 26.7+back-translated monolingual data 29.5 30.4

Table: DE→EN translation performance (BLEU) on WMT training/test sets.

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 104 / 115

Page 184: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Training data: multilingual

Multi-source translation [Zoph and Knight, 2016]we can condition on multiple input sentences

A B C <EOS> W X Y Z

<EOS> Z Y X W

A B C

<EOS> W X Y Z

<EOS> Z Y X W

I J K

Figure 2: Multi-source encoder-decoder model for MT. We have two source sentences (C B A and K J I)in different languages. Each language has its own encoder; it passes its final hidden and cell state to a setof combiners (in black). The output of a combiner is a hidden state and cell state of the same dimension.

the input gate of a typical LSTM cell. In equa-tion 4, there are two forget gates indexed by thesubscript i that serve as the forget gates for eachof the incoming cells for each of the encoders. Inequation 5, o represents the output gate of a nor-mal LSTM. i, f , o, and u are all size-1000 vectors.

2.3 Multi-Source AttentionOur single-source attention model is modeled offthe local-p attention model with feed input fromLuong et al. (2015b), where hidden states from thetop decoder layer can look back at the top hiddenstates from the encoder. The top decoder hiddenstate is combined with a weighted sum of the en-coder hidden states, to make a better hidden statevector (ht), which is passed to the softmax outputlayer. With input-feeding, the hidden state fromthe attention model is sent down to the bottom de-coder layer at the next time step.

The local-p attention model from Luong et al.(2015b) works as follows. First, a position to lookat in the source encoder is predicted by equation 9:

pt = S · sigmoid(vTp tanh(Wpht)) (9)

S is the source sentence length, and vp and Wp

are learned parameters, with vp being a vector ofdimension 1000, and Wp being a matrix of dimen-sion 1000 x 1000.

After pt is computed, a window of size 2D + 1is looked at in the top layer of the source encodercentered around pt (D = 10). For each hiddenstate in this window, we compute an alignment

score at(s), between 0 and 1. This alignment scoreis computed by equations 10, 11 and 12:

at(s) = align(ht, hs)exp(−(s− pt)2

2σ2

)(10)

align(ht, hs) =exp(score(ht, hs))∑s′ exp(score(ht, hs′))

(11)

score(ht, hs) = hTt Wahs (12)

In equation 10, σ is set to be D/2 and s is thesource index for that hidden state. Wa is a learn-able parameter of dimension 1000 x 1000.

Once all of the alignments are calculated, ct iscreated by taking a weighted sum of all source hid-den states multiplied by their alignment weight.

The final hidden state sent to the softmax layeris given by:

ht = tanh(Wc[ht; ct]

)(13)

We modify this attention model to look at bothsource encoders simultaneously. We create a con-text vector from each source encoder named c1tand c2t instead of the just ct in the single-sourceattention model:

ht = tanh(Wc[ht; c

1t ; c

2t ])

(14)

In our multi-source attention model we nowhave two pt variables, one for each source encoder.

benefits:one source text may contain information that is undespecified in other→ possible quality gains

drawbacks:we need multiple source sentences at training and decoding time

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 105 / 115

Page 185: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Training data: multilingual

Multilingual models [Dong et al., 2015, Firat et al., 2016a]we can share layers (encoder/decoder/attention) of the model acrosslanguage pairs Figure 2: Multi-task learning framework for multiple-target language translation

Figure 3: Optimization for end to multi-end model

3.4 Translation with Beam SearchAlthough parallel corpora are available for theencoder and the decoder modeling in the trainingphrase, the ground truth is not available during testtime. During test time, translation is produced byfinding the most likely sequence via beam search.

Y = argmaxY

p(YTp |STp) (15)

Given the target direction we want to translate to,beam search is performed with the shared encoderand a specific target decoder where search spacebelongs to the decoder Tp. We adopt beam searchalgorithm similar as it is used in SMT system(Koehn, 2004) except that we only utilize scoresproduced by each decoder as features. The sizeof beam is 10 in our experiments for speedupconsideration. Beam search is ended until the end-of-sentence eos symbol is generated.

4 Experiments

We conducted two groups of experiments toshow the effectiveness of our framework. Thegoal of the first experiment is to show thatmulti-task learning helps to improve translationperformance given enough training corpora for alllanguage pairs. In the second experiment, weshow that for some resource-poor language pairswith a few parallel training data, their translationperformance could be improved as well.

4.1 DatasetThe Europarl corpus is a multi-lingual corpusincluding 21 European languages. Here we onlychoose four language pairs for our experiments.The source language is English for all languagepairs. And the target languages are Spanish(Es), French (Fr), Portuguese (Pt) and Dutch(Nl). To demonstrate the validity of ourlearning framework, we do some preprocessingon the training set. For the source language,we use 30k of the most frequent words forsource language vocabulary which is sharedacross different language pairs and 30k mostfrequent words for each target language. Out-of-vocabulary words are denoted as unknownwords, and we maintain different unknown wordlabels for different languages. For test sets,we also restrict all words in the test set tobe from our training vocabulary and mark theOOV words as the corresponding labels as inthe training data. The size of training corpus inexperiment 1 and 2 is listed in Table 1 where

1727

benefits:transfer learning from one language pair to the otherscalability: no need for N2 −N independent models for N languages

drawbacks:no successful generalization to language pairs with no training data(but: synthetic training data works: [Firat et al., 2016b])

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 106 / 115

Page 186: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Training data: multilingual

Multilingual models [Lee et al., 2016]single, character level encoder trained onmultiple languages

more compact modeloccasional quality improvements over singlelanguage pairsrobust towards (synthetic) code-switched input

18/50

Fully Character-Level Multilingual NMTJason Lee and Kyunghyun Cho, 2016 (in preparation)

Model details,

▸ RNNSearch model

▸ Source-Target character level

▸ CNN+RNN encoder

▸ Bi-scale decoder

▸ {Fi ,De,Cs,Ru}→ En

Training,

▸ Mix mini-batches

▸ Use bi-text only

Firat and Cho: https://ufal.mff.cuni.cz/mtm16/files/12-recent-advances-and-future-of-neural-mt-orhat-firat.pdf

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 107 / 115

Page 187: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Training data: other tasks

Multi-task models [Luong et al., 2016]other tasks can be modelled with sequence-to-sequence models

we can share layers between translation and other tasks

Published as a conference paper at ICLR 2016

Figure 1:Sequence to sequence learning examples – (left) machine translation (Sutskever et al.,2014) and (right) constituent parsing (Vinyals et al., 2015a).

and German by up to +1.5 BLEU points over strong single-task baselines on the WMT benchmarks.Furthermore, we have established a newstate-of-the-artresult in constituent parsing with 93.0 F1.We also explore two unsupervised learning objectives, sequence autoencoders (Dai & Le, 2015) andskip-thought vectors (Kiros et al., 2015), and reveal theirinteresting properties in the MTL setting:autoencoder helps less in terms of perplexities but more on BLEU scores compared to skip-thought.

2 SEQUENCE TOSEQUENCE LEARNING

Sequence to sequence learning (seq2seq) aims to directly model the conditional probabilityp(y|x) ofmapping an input sequence,x1, . . . , xn, into an output sequence,y1, . . . , ym. It accomplishes suchgoal through theencoder-decoderframework proposed by Sutskever et al. (2014) and Cho et al.(2014). As illustrated in Figure 1, theencodercomputes a representations for each input sequence.Based on that input representation, thedecodergenerates an output sequence, one unit at a time, andhence, decomposes the conditional probability as:

log p(y|x) =∑m

j=1log p (yj|y<j , x, s) (1)

A natural model for sequential data is the recurrent neural network (RNN), which is used by most ofthe recentseq2seqwork. These work, however, differ in terms of: (a)architecture– from unidirec-tional, to bidirectional, and deep multi-layer RNNs; and (b) RNN type– which are long-short termmemory (LSTM) (Hochreiter & Schmidhuber, 1997) and the gated recurrent unit (Cho et al., 2014).

Another important difference betweenseq2seqwork lies in what constitutes the input represen-tation s. The earlyseq2seqwork (Sutskever et al., 2014; Cho et al., 2014; Luong et al., 2015b;Vinyals et al., 2015b) uses only the last encoder state to initialize the decoder and setss = [ ]in Eq. (1). Recently, Bahdanau et al. (2015) proposes anattention mechanism, a way to provideseq2seqmodels with a random access memory, to handle long input sequences. This is accomplishedby settings in Eq. (1) to be the set of encoder hidden states already computed. On the decoder side,at each time step, the attention mechanism will decide how much information to retrieve from thatmemory by learning where to focus, i.e., computing the alignment weights for all input positions.Recent work such as (Xu et al., 2015; Jean et al., 2015a; Luonget al., 2015a; Vinyals et al., 2015a)has found that it is crucial to empowerseq2seqmodels with the attention mechanism.

3 MULTI -TASK SEQUENCE-TO-SEQUENCE LEARNING

We generalize the work of Dong et al. (2015) to the multi-tasksequence-to-sequence learning set-ting that includes the tasks of machine translation (MT), constituency parsing, and image captiongeneration. Depending which tasks involved, we propose to categorize multi-taskseq2seqlearninginto three general settings. In addition, we will discuss the unsupervised learning tasks consideredas well as the learning process.

3.1 ONE-TO-MANY SETTING

This scheme involvesone encoderand multiple decodersfor tasks in which the encoder can beshared, as illustrated in Figure 2. The input to each task is asequence of English words. A separatedecoder is used to generate each sequence of output units which can be either (a) a sequence of tags

2

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 108 / 115

Page 188: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

NMT as a component in log-linear models

Log-linear modelsmodel ensembling is well-established

reranking output of phrase-based/syntax-based with NMT[Neubig et al., 2015]

incorporating NMT as a feature function into PBSMT[Junczys-Dowmunt et al., 2016]→ results depend on relative performance of PBSMT and NMT

English→Russian Russian→English

0.0

10.0

20.0

30.0

22.8

27.526.028.1

25.9

29.9

BLE

U

phrase-based SMT neural MT hybrid

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 109 / 115

Page 189: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Linguistic Features [Sennrich and Haddow, 2016]a.k.a. Factored Neural Machine Translation

motivation: disambiguate words by POS

English Germancloseverb schließencloseadj nahclosenoun Ende

source We thought a win like this might be closeadj.reference Wir dachten, dass ein solcher Sieg nah sein könnte.baseline NMT *Wir dachten, ein Sieg wie dieser könnte schließen.

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 110 / 115

Page 190: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Linguistic Features: Architecture

use separate embeddings for each feature, then concatenate

baseline: only word feature

E(close) =

0.50.20.30.1

|F | input features

E1(close) =

0.40.10.2

E2(adj) =

[0.1]

E1(close) ‖ E2(adj) =

0.40.10.20.1

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 111 / 115

Page 191: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Linguistic Features: Results

English→German German→English English→Romanian

0.0

10.0

20.0

30.0

40.0

27.8

31.4

23.8

28.4

32.9

24.8

BLE

U

baseline +linguistic features

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 112 / 115

Page 192: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Further Reading

secondary literaturelecture notes by Kyunghyun Cho: [Cho, 2015]

chapter on Neural Network Models in “Statistical Machine Translation”by Philipp Koehn http://mt-class.org/jhu/assets/papers/neural-network-models.pdf

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 113 / 115

Page 193: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

(A small selection of) Resources

NMT toolsdl4mt-tutorial (theano) https://github.com/nyu-dl/dl4mt-tutorial

(our branch: nematus https://github.com/rsennrich/nematus)

nmt.matlab https://github.com/lmthang/nmt.matlab

seq2seq (tensorflow) https://www.tensorflow.org/versions/r0.8/tutorials/seq2seq/index.html

neural monkey (tensorflow) https://github.com/ufal/neuralmonkey

seq2seq-attn (torch) https://github.com/harvardnlp/seq2seq-attn

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 114 / 115

Page 194: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Do it yourself

sample files and instructions for training NMT modelhttps://github.com/rsennrich/wmt16-scripts

pre-trained models to test decoding (and for further experiments)http://statmt.org/rsennrich/wmt16_systems/

lab on installing/using Nematus:http://ufal.mff.cuni.cz/mtm16/files/

13-nematus-lab-rico-sennrich.pdf

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 115 / 115

Page 195: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Bibliography I

Allen, R. B. (1987).Several studies on natural language and back-propagation.In In Proceedings of the IEEE First International Conference on Neural Networks, pages 335–341, San Diego, CA. IEEE.

Bahdanau, D., Cho, K., and Bengio, Y. (2015).Neural Machine Translation by Jointly Learning to Align and Translate.In Proceedings of the International Conference on Learning Representations (ICLR).

Bengio, S., Vinyals, O., Jaitly, N., and Shazeer, N. (2015).Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks.CoRR, abs/1506.03099.

Bentivogli, L., Bisazza, A., Cettolo, M., and Federico, M. (2016).Neural versus Phrase-Based Machine Translation Quality: a Case Study.In EMNLP 2016.

Cho, K. (2015).Natural Language Understanding with Distributed Representation.CoRR, abs/1511.07916.

Cho, K., Courville, A., and Bengio, Y. (2015).Describing Multimedia Content using Attention-based Encoder-Decoder Networks.

Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio, Y. (2014).Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation.In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724–1734,Doha, Qatar. Association for Computational Linguistics.

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 116 / 115

Page 196: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Bibliography II

Chung, J., Cho, K., and Bengio, Y. (2016).A Character-level Decoder without Explicit Segmentation for Neural Machine Translation.CoRR, abs/1603.06147.

Costa-jussà, R. M. and Fonollosa, R. J. A. (2016).Character-based Neural Machine Translation.In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages357–361. Association for Computational Linguistics.

Dong, D., Wu, H., He, W., Yu, D., and Wang, H. (2015).Multi-Task Learning for Multiple Language Translation.InProceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers),pages 1723–1732, Beijing, China. Association for Computational Linguistics.

Eriguchi, A., Hashimoto, K., and Tsuruoka, Y. (2016).Tree-to-Sequence Attentional Neural Machine Translation.In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages823–833. Association for Computational Linguistics.

Feng, S., Liu, S., Li, M., and Zhou, M. (2016).Implicit Distortion and Fertility Models for Attention-based Encoder-Decoder NMT Model.CoRR, abs/1601.03317.

Firat, O., Cho, K., and Bengio, Y. (2016a).Multi-Way, Multilingual Neural Machine Translation with a Shared Attention Mechanism.InProceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,pages 866–875. Association for Computational Linguistics.

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 117 / 115

Page 197: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Bibliography III

Firat, O., Sankaran, B., Al-Onaizan, Y., Yarman-Vural, F. T., and Cho, K. (2016b).Zero-Resource Translation with Multi-Lingual Neural Machine Translation.CoRR, abs/1606.04164.

Gage, P. (1994).A New Algorithm for Data Compression.C Users J., 12(2):23–38.

Gülçehre, c., Firat, O., Xu, K., Cho, K., Barrault, L., Lin, H., Bougares, F., Schwenk, H., and Bengio, Y. (2015).On Using Monolingual Corpora in Neural Machine Translation.CoRR, abs/1503.03535.

Gulcehre, C., Ahn, S., Nallapati, R., Zhou, B., and Bengio, Y. (2016).Pointing the Unknown Words.In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages140–149. Association for Computational Linguistics.

Ha, T.-L., Niehues, J., Cho, E., Mediani, M., and Waibel, A. (2015).The KIT translation systems for IWSLT 2015.In Proceedings of the International Workshop on Spoken Language Translation (IWSLT), pages 62–69.

Haddow, B., Huck, M., Birch, A., Bogoychev, N., and Koehn, P. (2015).The Edinburgh/JHU Phrase-based Machine Translation Systems for WMT 2015.In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 126–133, Lisbon, Portugal. Association forComputational Linguistics.

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 118 / 115

Page 198: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Bibliography IV

Jean, S., Cho, K., Memisevic, R., and Bengio, Y. (2015).On Using Very Large Target Vocabulary for Neural Machine Translation.InProceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers),pages 1–10, Beijing, China. Association for Computational Linguistics.

Junczys-Dowmunt, M., Dwojak, T., and Sennrich, R. (2016).The AMU-UEDIN Submission to the WMT16 News Translation Task: Attention-based NMT Models as Feature Functions inPhrase-based SMT.In Proceedings of the First Conference on Machine Translation, Volume 2: Shared Task Papers, pages 316–322, Berlin,Germany. Association for Computational Linguistics.

Junczys-Dowmunt, M. and Grundkiewicz, R. (2016).Log-linear Combinations of Monolingual and Bilingual Neural Machine Translation Models for Automatic Post-Editing.In Proceedings of the First Conference on Machine Translation, pages 751–758, Berlin, Germany. Association for ComputationalLinguistics.

Kalchbrenner, N. and Blunsom, P. (2013).Recurrent Continuous Translation Models.In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle. Association forComputational Linguistics.

Kalchbrenner, N., Espeholt, L., Simonyan, K., van den Oord, A., Graves, A., and Kavukcuoglu, K. (2016).Neural Machine Translation in Linear Time.ArXiv e-prints.

Kim, Y. and Rush, A. M. (2016).Sequence-Level Knowledge Distillation.CoRR, abs/1606.07947.

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 119 / 115

Page 199: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Bibliography V

Lee, J., Cho, K., and Hofmann, T. (2016).Fully Character-Level Neural Machine Translation without Explicit Segmentation.ArXiv e-prints.

L’Hostis, G., Grangier, D., and Auli, M. (2016).Vocabulary Selection Strategies for Neural Machine Translation.ArXiv e-prints.

Ling, W., Trancoso, I., Dyer, C., and Black, A. W. (2015).Character-based Neural Machine Translation.ArXiv e-prints.

Liu, L., Utiyama, M., Finch, A., and Sumita, E. (2016).Agreement on Target-bidirectional Neural Machine Translation .In NAACL HLT 16, San Diego, CA.

Luong, M., Le, Q. V., Sutskever, I., Vinyals, O., and Kaiser, L. (2016).Multi-task Sequence to Sequence Learning.In ICLR 2016.

Luong, M.-T. and Manning, C. D. (2015).Stanford Neural Machine Translation Systems for Spoken Language Domains.In Proceedings of the International Workshop on Spoken Language Translation 2015, Da Nang, Vietnam.

Luong, M.-T. and Manning, D. C. (2016).Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models.In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages1054–1063. Association for Computational Linguistics.

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 120 / 115

Page 200: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Bibliography VI

Luong, T., Pham, H., and Manning, C. D. (2015a).Effective Approaches to Attention-based Neural Machine Translation.In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1412–1421, Lisbon,Portugal. Association for Computational Linguistics.

Luong, T., Sutskever, I., Le, Q., Vinyals, O., and Zaremba, W. (2015b).Addressing the Rare Word Problem in Neural Machine Translation.InProceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers),pages 11–19, Beijing, China. Association for Computational Linguistics.

Mi, H., Sankaran, B., Wang, Z., and Ittycheriah, A. (2016).A Coverage Embedding Model for Neural Machine Translation.ArXiv e-prints.

Mi, H., Wang, Z., and Ittycheriah, A. (2016).Vocabulary Manipulation for Neural Machine Translation.CoRR, abs/1605.03209.

Neubig, G., Morishita, M., and Nakamura, S. (2015).Neural Reranking Improves Subjective Quality of Machine Translation: NAIST at WAT2015.In Proceedings of the 2nd Workshop on Asian Translation (WAT2015), pages 35–41, Kyoto, Japan.

Pym, A., Grin, F., and Sfreddo, C. (2012).The status of the translation profession in the European Union, volume 7.European Commission, Luxemburg.

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 121 / 115

Page 201: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Bibliography VII

Ranzato, M., Chopra, S., Auli, M., and Zaremba, W. (2016).Sequence Level Training with Recurrent Neural Networks.In ICLR 2016.

Sennrich, R. and Haddow, B. (2016).Linguistic Input Features Improve Neural Machine Translation.In Proceedings of the First Conference on Machine Translation, Volume 1: Research Papers, pages 83–91, Berlin, Germany.Association for Computational Linguistics.

Sennrich, R., Haddow, B., and Birch, A. (2016a).Edinburgh Neural Machine Translation Systems for WMT 16.In Proceedings of the First Conference on Machine Translation, Volume 2: Shared Task Papers, pages 368–373, Berlin,Germany. Association for Computational Linguistics.

Sennrich, R., Haddow, B., and Birch, A. (2016b).Improving Neural Machine Translation Models with Monolingual Data.In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages86–96, Berlin, Germany. Association for Computational Linguistics.

Sennrich, R., Haddow, B., and Birch, A. (2016c).Neural Machine Translation of Rare Words with Subword Units.In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages1715–1725, Berlin, Germany. Association for Computational Linguistics.

Shen, S., Cheng, Y., He, Z., He, W., Wu, H., Sun, M., and Liu, Y. (2016).Minimum Risk Training for Neural Machine Translation.In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin,Germany.

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 122 / 115

Page 202: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Bibliography VIII

Sutskever, I., Vinyals, O., and Le, Q. V. (2014).Sequence to Sequence Learning with Neural Networks.InAdvances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014,pages 3104–3112, Montreal, Quebec, Canada.

Tu, Z., Liu, Y., Lu, Z., Liu, X., and Li, H. (2016).Context Gates for Neural Machine Translation.ArXiv e-prints.

Tu, Z., Lu, Z., Liu, Y., Liu, X., and Li, H. (2016a).Coverage-based Neural Machine Translation.CoRR, abs/1601.04811.

Tu, Z., Lu, Z., Liu, Y., Liu, X., and Li, H. (2016b).Modeling Coverage for Neural Machine Translation.In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages76–85. Association for Computational Linguistics.

Turovsky, B. (2016).Ten years of Google Translate.https://googleblog.blogspot.co.uk/2016/04/ten-years-of-google-translate.html.

Wiseman, S. and Rush, A. M. (2016).Sequence-to-Sequence Learning as Beam-Search Optimization.CoRR, abs/1606.02960.

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 123 / 115

Page 203: Advances in Neural Machine Translationhomepages.inf.ed.ac.uk/rsennric/amta2016-tutorial.pdf · Advances in Neural Machine Translation Rico Sennrich, Alexandra Birch, Marcin Junczys-Dowmunt

Bibliography IX

Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., Klingner, J.,Shah, A., Johnson, M., Liu, X., Kaiser, Ł., Gouws, S., Kato, Y., Kudo, T., Kazawa, H., Stevens, K., Kurian, G., Patil, N., Wang, W.,Young, C., Smith, J., Riesa, J., Rudnick, A., Vinyals, O., Corrado, G., Hughes, M., and Dean, J. (2016).Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation.ArXiv e-prints.

Zhang, B., Xiong, D., and Su, J. (2016).Recurrent Neural Machine Translation.CoRR, abs/1607.08725.

Zhao, S. and Zhang, Z. (2016).An Efficient Character-Level Neural Machine Translation.ArXiv e-prints.

Zhou, J., Cao, Y., Wang, X., Li, P., and Xu, W. (2016).Deep Recurrent Models with Fast-Forward Connections for Neural Machine Translation.Transactions of the Association of Computational Linguistics – Volume 4, Issue 1, pages 371–383.

Zoph, B. and Knight, K. (2016).Multi-Source Neural Translation.In NAACL HLT 2016.

Sennrich, Birch, Junczys-Dowmunt Neural Machine Translation 124 / 115