Edinburgh's Neural Machine Translation Systemsusfd-rescoring 23.1 EN!RO uedin-nmt 26.0 amu-uedin 25.3 jhu-pbmt 24.0 LIMSI 23.6 AFRL-MITLL 23.5 NYU-UMontreal 23.1 AFRL-MITLL-verb-annot

Edinburgh’s Neural Machine Translation Systems

Barry Haddow

University of Edinburgh

October 27, 2016

Barry Haddow Edinburgh’s NMT Systems 1 / 20

Collaborators

Rico Sennrich Alexandra Birch


Edinburgh’s WMT Results Over the Years

2013 2014 2015 20160

10

20

30

20.3 20.9 20.821.5

19.420.2

22 22.1

24.7

BLE

Uon

new

stes

t201

3(E

N→

DE

)

phrase-based SMTsyntax-based SMTneural MT



2013 2014 2015 20160

10

20

30

20.3 20.9 20.821.5

19.420.2

22 22.1

24.7

BLE

Uon

new

stes

t201

3(E

N→

DE

)




2013 2014 2015 20160

10

20

30

20.3 20.9 20.821.5

19.420.2

22 22.1

24.7

BLE

Uon

new

stes

t201

3(E

N→

DE

)



Neural Machine Translation:Encoder-Decoder-with-Attention

[Image: Philipp Koehn]Barry Haddow Edinburgh’s NMT Systems 3 / 20

Neural versus Phrase-based MT

phrase-based SMTLearn segment-segment correspondences from bitext

→ training is multistage pipeline of heuristics

→ strong independence assumptions

→ “fixed” trade-off between features

neural MTLearn mathematical function on vectors from bitext

→ end-to-end trained model

→ output conditioned on full source text and target history

→ non-linear dependence on information sources


Neural versus Phrase-based MT

phrase-based SMTLearn segment-segment correspondences from bitext

→ training is multistage pipeline of heuristics

→ strong independence assumptions

→ “fixed” trade-off between features

neural MTLearn mathematical function on vectors from bitext

→ end-to-end trained model

→ output conditioned on full source text and target history

→ non-linear dependence on information sources


Innovations in Edinburgh’s WMT16 Systems

1 Subword models to allow translation of rare/unknown words→ since networks have small, fixed vocabulary

2 Back-translated monolingual data as additional training data→ allows us to make use of extensive monolingual resources

3 Combination of left-to-right and right-to-left models→ Reduces “label-bias” problem

4 Pervasive dropout→ Technical device to improve training with small data sets


Problem with Word-level Models

they charge a carry-on bag fee.sie erheben eine Hand|gepäck|gebühr.

Neural MT architectures have small and fixed vocabularytranslation is an open-vocabulary problem

productive word formation (example: compounding)names (may require transliteration)


Why Subword Models?

transparent translationsmany translations are semantically/phonologically transparent→ translation via subword units possiblemorphologically complex words (e.g. compounds):

solar system (English)Sonnen|system (German)Nap|rendszer (Hungarian)

named entities:Barack Obama (English; German)Áàðàê Îáàìà (Russian)バラク・オバマ (ba-ra-ku o-ba-ma) (Japanese)

cognates and loanwords:claustrophobia (English)Klaustrophobie (German)Êëàóñòðîôîáèÿ (Russian)


Byte-pair Encoding

Start with maximally split (i.e. characters)

Use statistics to identify groups to merge

Proceed until pre-determined number of merge operations is learnt

Examples:system sentencesource health research institutesreference Gesundheitsforschungsinstituteword-level Forschungsinstitutesubword Gesundheits|forsch|ungsin|stitutesource rakfiskreference ðàêôèñêà (rakfiska)word-level rakfisk → UNK → rakfisksubword rak|f|isk → ðàê|ô|èñêà (rak|f|iska)


Subword Models

translation of many rare/unknown words is transparent

subword units allow open-vocabulary NMT without back-off model

substantial gains in translation quality, especially for rare words


Monolingual Training Data

why monolingual data for phrase-based SMT?relax independence assumptions 3

more training data 3

more appropriate training data (domain adaptation) 3

why monolingual data for neural MT?relax independence assumptions 7

more training data 3

more appropriate training data (domain adaptation) 3


Monolingual Data in NMT

encoder-decoder already conditions onprevious target words

no architecture change required to learnfrom monolingual data


Monolingual Training Instances

Problemwe have no source context for monolingual training instances

Solutionstwo methods to deal with missing source context:

empty/dummy source context→ danger of unlearning conditioning on sourceproduce synthetic source sentence via back-translation→ get approximation of source context


Monolingual Training Instances

Problemwe have no source context for monolingual training instances

Solutionstwo methods to deal with missing source context:

empty/dummy source context→ danger of unlearning conditioning on sourceproduce synthetic source sentence via back-translation→ get approximation of source context


Evaluation: WMT 15 English↔German

syntax-based parallel +monolingual +synthetic

0

10

20

30

24.4 23.6 24.626.5

24.4 23.624.4

BLE

U

English→German

PBSMT parallel +synthetic +synth-ens4

0

10

20

30 29.326.7

30.431.6

29.326.7

29.3

BLE

UGerman→English


Why is monolingual data helpful?

Domain adaptation effect

Reduces over-fitting

Improves fluency


Putting it all together: WMT16 Results

EN→CS EN→DE EN→RO EN→RU CS→EN DE→EN RO→EN RU→EN

0

10

20

30

40

BLE

U

parallel data +synthetic data +ensemble +R2L reranking


Human Pairwise Ranking Scores

direction BLEU rank human rankEN→CS 1 of 9 1 of 20EN→DE 1 of 11 1 of 15EN→RO 2 of 10 1–2 of 12EN→RU 1 of 8 2–5 of 12CS→EN 1 of 4 1 of 12DE→EN 1 of 6 1 of 10RO→EN 2 of 5 2 of 7RU→EN 3 of 6 5 of 10

NB: Human rankings include online systems, and for en→cs, extra systems from tuning

task


NMT vs. PBMT: An extended test

Training and test drawn from UN corpusMulti-parallel, 11M linesArabic, Chinese, English, French, Russian, Spanish

Apply BPE, but no monolingual data

NMT systems trained for 8 days

Evaluate using BLEU on 4000 sentences

[Junczys-Dowmunt et. al, 2016]


NMT vs. PBMT: UN data


Conclusions and Outlook

Conclusionsneural MT is SOTA on many tasks

subword models and back-translated data contributed to success

NMT has gone from lab to deployment, much faster than expected

Problems and OpportunitiesInclusion of terminology, placeholders, markup

Interpretation and manual changes to models

Computer-aided and interactive MT

Incremental training

Extra knowledge sources (context, multimodal)

Sharing across languages, domains


AcknowledgmentsSome of the research presented here was conducted in cooperation withSamsung Electronics Polska sp. z o.o. - Samsung R&D Institute Poland.

This project has received funding from the European Union’sHorizon 2020 research and innovation programme undergrant agreement 645452 (QT21).


Thank-you


WMT16 Results (BLEU)

uedin-nmt 34.2metamind 32.3NYU-UMontreal 30.8cambridge 30.6uedin-syntax 30.6KIT/LIMSI 29.1KIT 29.0uedin-pbmt 28.4jhu-syntax 26.6

EN→DE

uedin-nmt 38.6uedin-pbmt 35.1jhu-pbmt 34.5uedin-syntax 34.4KIT 33.9jhu-syntax 31.0

DE→EN

uedin-nmt 25.8NYU-UMontreal 23.6jhu-pbmt 23.6cu-chimera 21.0uedin-cu-syntax 20.9cu-tamchyna 20.8cu-TectoMT 14.7cu-mergedtrees 8.2

EN→CS

uedin-nmt 31.4jhu-pbmt 30.4PJATK 28.3cu-mergedtrees 13.3

CS→EN

uedin-pbmt 35.2uedin-nmt 33.9uedin-syntax 33.6jhu-pbmt 32.2LIMSI 31.0

RO→EN

QT21-HimL-SysComb 28.9uedin-nmt 28.1RWTH-SYSCOMB 27.1uedin-pbmt 26.8uedin-lmu-hiero 25.9KIT 25.8lmu-cuni 24.3LIMSI 23.9jhu-pbmt 23.5usfd-rescoring 23.1

EN→RO

uedin-nmt 26.0amu-uedin 25.3jhu-pbmt 24.0LIMSI 23.6AFRL-MITLL 23.5NYU-UMontreal 23.1AFRL-MITLL-verb-annot 20.9

EN→RU

amu-uedin 29.1NRC 29.1uedin-nmt 28.0AFRL-MITLL 27.6AFRL-MITLL-contrast 27.0

RU→EN

Edinburgh NMT

SystemCombination withEdinburgh NMT




EN→DE


DE→EN


EN→CS


CS→EN


RO→EN


EN→RO


EN→RU


RU→EN

Edinburgh NMT





EN→DE


DE→EN


EN→CS


CS→EN


RO→EN


EN→RO


EN→RU


RU→EN

Edinburgh NMT



WMT16: English→German

# score range system1 0.49 1 UEDIN-NMT

2 0.40 2 METAMIND

3 0.29 3 UEDIN-SYNTAX

4 0.17 4 NYU-MONTREAL

5 −0.01 5-10 ONLINE-B−0.01 5-10 KIT-LIMSI

−0.02 5-10 CAMBRIDGE

−0.02 5-10 ONLINE-A−0.03 5-10 PROMT-RULE

−0.05 6-10 KIT

6 −0.14 11-12 JHU-SYNTAX

−0.15 11-12 JHU-PBMT

7 −0.26 13-14 UEDIN-PBMT

−0.33 13-15 ONLINE-F−0.34 14-15 ONLINE-G




2 0.40 2 METAMIND






−0.05 6-10 KIT

6 −0.14 11-12 JHU-SYNTAX

−0.15 11-12 JHU-PBMT

7 −0.26 13-14 UEDIN-PBMT

−0.33 13-15 ONLINE-F−0.34 14-15 ONLINE-G

Neural MT




2 0.40 2 METAMIND






−0.05 6-10 KIT

6 −0.14 11-12 JHU-SYNTAX

−0.15 11-12 JHU-PBMT

7 −0.26 13-14 UEDIN-PBMT

−0.33 13-15 ONLINE-F−0.34 14-15 ONLINE-G

Neural MT Neural components


Fluency and Adequacy (German→ English)

Adequacy Fluencyde

-en

UEDIN-NMT 0.204 75.8 0.339 77.5ONLINE-A 0.095 72.7 0.094 70.1ONLINE-B 0.086 72.2 0.015 68.4

UEDIN-SYNTAX 0.065 71.5 0.141 71.8KIT 0.062 71.4 0.192 72.7

UEDIN-PBMT 0.042 70.9 0.004 68.6JHU-PBMT 0.019 70.5 0.084 70.5ONLINE-G 0.009 70.2 −0.067 65.3ONLINE-F −0.204 64.0 −0.348 57.8

JHU-SYNTAX −0.261 62.4 −0.237 62.5

[Bojar et al. WMT 2016]


Neural MT and Phrase-based SMT

Neural MT Phrase-based SMTtranslation quality 3

model size 3

training time 3

model interpretability 3

decoding efficiency 3 3

toolkits3 3

(for simplicity) (for maturity)special hardware requirement GPU lots of RAM


Documents

Edinburgh's Neural Machine Translation Systemsusfd-rescoring 23.1 EN!RO uedin-nmt 26.0 amu-uedin 25.3 jhu-pbmt 24.0 LIMSI 23.6 AFRL-MITLL 23.5 NYU-UMontreal 23.1 AFRL-MITLL-verb-annot