Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Edinburgh’s Neural Machine Translation Systems
Barry Haddow
University of Edinburgh
October 27, 2016
Barry Haddow Edinburgh’s NMT Systems 1 / 20
Collaborators
Rico Sennrich Alexandra Birch
Barry Haddow Edinburgh’s NMT Systems 1 / 20
Edinburgh’s WMT Results Over the Years
2013 2014 2015 20160
10
20
30
20.3 20.9 20.821.5
19.420.2
22 22.1
24.7
BLE
Uon
new
stes
t201
3(E
N→
DE
)
phrase-based SMTsyntax-based SMTneural MT
Barry Haddow Edinburgh’s NMT Systems 2 / 20
Edinburgh’s WMT Results Over the Years
2013 2014 2015 20160
10
20
30
20.3 20.9 20.821.5
19.420.2
22 22.1
24.7
BLE
Uon
new
stes
t201
3(E
N→
DE
)
phrase-based SMTsyntax-based SMTneural MT
Barry Haddow Edinburgh’s NMT Systems 2 / 20
Edinburgh’s WMT Results Over the Years
2013 2014 2015 20160
10
20
30
20.3 20.9 20.821.5
19.420.2
22 22.1
24.7
BLE
Uon
new
stes
t201
3(E
N→
DE
)
phrase-based SMTsyntax-based SMTneural MT
Barry Haddow Edinburgh’s NMT Systems 2 / 20
Neural Machine Translation:Encoder-Decoder-with-Attention
[Image: Philipp Koehn]Barry Haddow Edinburgh’s NMT Systems 3 / 20
Neural versus Phrase-based MT
phrase-based SMTLearn segment-segment correspondences from bitext
→ training is multistage pipeline of heuristics
→ strong independence assumptions
→ “fixed” trade-off between features
neural MTLearn mathematical function on vectors from bitext
→ end-to-end trained model
→ output conditioned on full source text and target history
→ non-linear dependence on information sources
Barry Haddow Edinburgh’s NMT Systems 4 / 20
Neural versus Phrase-based MT
phrase-based SMTLearn segment-segment correspondences from bitext
→ training is multistage pipeline of heuristics
→ strong independence assumptions
→ “fixed” trade-off between features
neural MTLearn mathematical function on vectors from bitext
→ end-to-end trained model
→ output conditioned on full source text and target history
→ non-linear dependence on information sources
Barry Haddow Edinburgh’s NMT Systems 4 / 20
Innovations in Edinburgh’s WMT16 Systems
1 Subword models to allow translation of rare/unknown words→ since networks have small, fixed vocabulary
2 Back-translated monolingual data as additional training data→ allows us to make use of extensive monolingual resources
3 Combination of left-to-right and right-to-left models→ Reduces “label-bias” problem
4 Pervasive dropout→ Technical device to improve training with small data sets
Barry Haddow Edinburgh’s NMT Systems 5 / 20
Problem with Word-level Models
they charge a carry-on bag fee.sie erheben eine Hand|gepäck|gebühr.
Neural MT architectures have small and fixed vocabularytranslation is an open-vocabulary problem
productive word formation (example: compounding)names (may require transliteration)
Barry Haddow Edinburgh’s NMT Systems 6 / 20
Why Subword Models?
transparent translationsmany translations are semantically/phonologically transparent→ translation via subword units possiblemorphologically complex words (e.g. compounds):
solar system (English)Sonnen|system (German)Nap|rendszer (Hungarian)
named entities:Barack Obama (English; German)Áàðàê Îáàìà (Russian)バラク・オバマ (ba-ra-ku o-ba-ma) (Japanese)
cognates and loanwords:claustrophobia (English)Klaustrophobie (German)Êëàóñòðîôîáèÿ (Russian)
Barry Haddow Edinburgh’s NMT Systems 7 / 20
Byte-pair Encoding
Start with maximally split (i.e. characters)
Use statistics to identify groups to merge
Proceed until pre-determined number of merge operations is learnt
Examples:system sentencesource health research institutesreference Gesundheitsforschungsinstituteword-level Forschungsinstitutesubword Gesundheits|forsch|ungsin|stitutesource rakfiskreference ðàêôèñêà (rakfiska)word-level rakfisk → UNK → rakfisksubword rak|f|isk → ðàê|ô|èñêà (rak|f|iska)
Barry Haddow Edinburgh’s NMT Systems 8 / 20
Subword Models
translation of many rare/unknown words is transparent
subword units allow open-vocabulary NMT without back-off model
substantial gains in translation quality, especially for rare words
Barry Haddow Edinburgh’s NMT Systems 9 / 20
Monolingual Training Data
why monolingual data for phrase-based SMT?relax independence assumptions 3
more training data 3
more appropriate training data (domain adaptation) 3
why monolingual data for neural MT?relax independence assumptions 7
more training data 3
more appropriate training data (domain adaptation) 3
Barry Haddow Edinburgh’s NMT Systems 10 / 20
Monolingual Data in NMT
encoder-decoder already conditions onprevious target words
no architecture change required to learnfrom monolingual data
Barry Haddow Edinburgh’s NMT Systems 11 / 20
Monolingual Training Instances
Problemwe have no source context for monolingual training instances
Solutionstwo methods to deal with missing source context:
empty/dummy source context→ danger of unlearning conditioning on sourceproduce synthetic source sentence via back-translation→ get approximation of source context
Barry Haddow Edinburgh’s NMT Systems 12 / 20
Monolingual Training Instances
Problemwe have no source context for monolingual training instances
Solutionstwo methods to deal with missing source context:
empty/dummy source context→ danger of unlearning conditioning on sourceproduce synthetic source sentence via back-translation→ get approximation of source context
Barry Haddow Edinburgh’s NMT Systems 12 / 20
Evaluation: WMT 15 English↔German
syntax-based parallel +monolingual +synthetic
0
10
20
30
24.4 23.6 24.626.5
24.4 23.624.4
BLE
U
English→German
PBSMT parallel +synthetic +synth-ens4
0
10
20
30 29.326.7
30.431.6
29.326.7
29.3
BLE
UGerman→English
Barry Haddow Edinburgh’s NMT Systems 13 / 20
Why is monolingual data helpful?
Domain adaptation effect
Reduces over-fitting
Improves fluency
Barry Haddow Edinburgh’s NMT Systems 14 / 20
Putting it all together: WMT16 Results
EN→CS EN→DE EN→RO EN→RU CS→EN DE→EN RO→EN RU→EN
0
10
20
30
40
BLE
U
parallel data +synthetic data +ensemble +R2L reranking
Barry Haddow Edinburgh’s NMT Systems 15 / 20
Human Pairwise Ranking Scores
direction BLEU rank human rankEN→CS 1 of 9 1 of 20EN→DE 1 of 11 1 of 15EN→RO 2 of 10 1–2 of 12EN→RU 1 of 8 2–5 of 12CS→EN 1 of 4 1 of 12DE→EN 1 of 6 1 of 10RO→EN 2 of 5 2 of 7RU→EN 3 of 6 5 of 10
NB: Human rankings include online systems, and for en→cs, extra systems from tuning
task
Barry Haddow Edinburgh’s NMT Systems 16 / 20
NMT vs. PBMT: An extended test
Training and test drawn from UN corpusMulti-parallel, 11M linesArabic, Chinese, English, French, Russian, Spanish
Apply BPE, but no monolingual data
NMT systems trained for 8 days
Evaluate using BLEU on 4000 sentences
[Junczys-Dowmunt et. al, 2016]
Barry Haddow Edinburgh’s NMT Systems 17 / 20
NMT vs. PBMT: UN data
Barry Haddow Edinburgh’s NMT Systems 18 / 20
Conclusions and Outlook
Conclusionsneural MT is SOTA on many tasks
subword models and back-translated data contributed to success
NMT has gone from lab to deployment, much faster than expected
Problems and OpportunitiesInclusion of terminology, placeholders, markup
Interpretation and manual changes to models
Computer-aided and interactive MT
Incremental training
Extra knowledge sources (context, multimodal)
Sharing across languages, domains
Barry Haddow Edinburgh’s NMT Systems 19 / 20
AcknowledgmentsSome of the research presented here was conducted in cooperation withSamsung Electronics Polska sp. z o.o. - Samsung R&D Institute Poland.
This project has received funding from the European Union’sHorizon 2020 research and innovation programme undergrant agreement 645452 (QT21).
Barry Haddow Edinburgh’s NMT Systems 20 / 20
Thank-you
Barry Haddow Edinburgh’s NMT Systems 21 / 20
WMT16 Results (BLEU)
uedin-nmt 34.2metamind 32.3NYU-UMontreal 30.8cambridge 30.6uedin-syntax 30.6KIT/LIMSI 29.1KIT 29.0uedin-pbmt 28.4jhu-syntax 26.6
EN→DE
uedin-nmt 38.6uedin-pbmt 35.1jhu-pbmt 34.5uedin-syntax 34.4KIT 33.9jhu-syntax 31.0
DE→EN
uedin-nmt 25.8NYU-UMontreal 23.6jhu-pbmt 23.6cu-chimera 21.0uedin-cu-syntax 20.9cu-tamchyna 20.8cu-TectoMT 14.7cu-mergedtrees 8.2
EN→CS
uedin-nmt 31.4jhu-pbmt 30.4PJATK 28.3cu-mergedtrees 13.3
CS→EN
uedin-pbmt 35.2uedin-nmt 33.9uedin-syntax 33.6jhu-pbmt 32.2LIMSI 31.0
RO→EN
QT21-HimL-SysComb 28.9uedin-nmt 28.1RWTH-SYSCOMB 27.1uedin-pbmt 26.8uedin-lmu-hiero 25.9KIT 25.8lmu-cuni 24.3LIMSI 23.9jhu-pbmt 23.5usfd-rescoring 23.1
EN→RO
uedin-nmt 26.0amu-uedin 25.3jhu-pbmt 24.0LIMSI 23.6AFRL-MITLL 23.5NYU-UMontreal 23.1AFRL-MITLL-verb-annot 20.9
EN→RU
amu-uedin 29.1NRC 29.1uedin-nmt 28.0AFRL-MITLL 27.6AFRL-MITLL-contrast 27.0
RU→EN
Edinburgh NMT
SystemCombination withEdinburgh NMT
Barry Haddow Edinburgh’s NMT Systems 22 / 20
WMT16 Results (BLEU)
uedin-nmt 34.2metamind 32.3NYU-UMontreal 30.8cambridge 30.6uedin-syntax 30.6KIT/LIMSI 29.1KIT 29.0uedin-pbmt 28.4jhu-syntax 26.6
EN→DE
uedin-nmt 38.6uedin-pbmt 35.1jhu-pbmt 34.5uedin-syntax 34.4KIT 33.9jhu-syntax 31.0
DE→EN
uedin-nmt 25.8NYU-UMontreal 23.6jhu-pbmt 23.6cu-chimera 21.0uedin-cu-syntax 20.9cu-tamchyna 20.8cu-TectoMT 14.7cu-mergedtrees 8.2
EN→CS
uedin-nmt 31.4jhu-pbmt 30.4PJATK 28.3cu-mergedtrees 13.3
CS→EN
uedin-pbmt 35.2uedin-nmt 33.9uedin-syntax 33.6jhu-pbmt 32.2LIMSI 31.0
RO→EN
QT21-HimL-SysComb 28.9uedin-nmt 28.1RWTH-SYSCOMB 27.1uedin-pbmt 26.8uedin-lmu-hiero 25.9KIT 25.8lmu-cuni 24.3LIMSI 23.9jhu-pbmt 23.5usfd-rescoring 23.1
EN→RO
uedin-nmt 26.0amu-uedin 25.3jhu-pbmt 24.0LIMSI 23.6AFRL-MITLL 23.5NYU-UMontreal 23.1AFRL-MITLL-verb-annot 20.9
EN→RU
amu-uedin 29.1NRC 29.1uedin-nmt 28.0AFRL-MITLL 27.6AFRL-MITLL-contrast 27.0
RU→EN
Edinburgh NMT
SystemCombination withEdinburgh NMT
Barry Haddow Edinburgh’s NMT Systems 22 / 20
WMT16 Results (BLEU)
uedin-nmt 34.2metamind 32.3NYU-UMontreal 30.8cambridge 30.6uedin-syntax 30.6KIT/LIMSI 29.1KIT 29.0uedin-pbmt 28.4jhu-syntax 26.6
EN→DE
uedin-nmt 38.6uedin-pbmt 35.1jhu-pbmt 34.5uedin-syntax 34.4KIT 33.9jhu-syntax 31.0
DE→EN
uedin-nmt 25.8NYU-UMontreal 23.6jhu-pbmt 23.6cu-chimera 21.0uedin-cu-syntax 20.9cu-tamchyna 20.8cu-TectoMT 14.7cu-mergedtrees 8.2
EN→CS
uedin-nmt 31.4jhu-pbmt 30.4PJATK 28.3cu-mergedtrees 13.3
CS→EN
uedin-pbmt 35.2uedin-nmt 33.9uedin-syntax 33.6jhu-pbmt 32.2LIMSI 31.0
RO→EN
QT21-HimL-SysComb 28.9uedin-nmt 28.1RWTH-SYSCOMB 27.1uedin-pbmt 26.8uedin-lmu-hiero 25.9KIT 25.8lmu-cuni 24.3LIMSI 23.9jhu-pbmt 23.5usfd-rescoring 23.1
EN→RO
uedin-nmt 26.0amu-uedin 25.3jhu-pbmt 24.0LIMSI 23.6AFRL-MITLL 23.5NYU-UMontreal 23.1AFRL-MITLL-verb-annot 20.9
EN→RU
amu-uedin 29.1NRC 29.1uedin-nmt 28.0AFRL-MITLL 27.6AFRL-MITLL-contrast 27.0
RU→EN
Edinburgh NMT
SystemCombination withEdinburgh NMT
Barry Haddow Edinburgh’s NMT Systems 22 / 20
WMT16: English→German
# score range system1 0.49 1 UEDIN-NMT
2 0.40 2 METAMIND
3 0.29 3 UEDIN-SYNTAX
4 0.17 4 NYU-MONTREAL
5 −0.01 5-10 ONLINE-B−0.01 5-10 KIT-LIMSI
−0.02 5-10 CAMBRIDGE
−0.02 5-10 ONLINE-A−0.03 5-10 PROMT-RULE
−0.05 6-10 KIT
6 −0.14 11-12 JHU-SYNTAX
−0.15 11-12 JHU-PBMT
7 −0.26 13-14 UEDIN-PBMT
−0.33 13-15 ONLINE-F−0.34 14-15 ONLINE-G
Barry Haddow Edinburgh’s NMT Systems 23 / 20
WMT16: English→German
# score range system1 0.49 1 UEDIN-NMT
2 0.40 2 METAMIND
3 0.29 3 UEDIN-SYNTAX
4 0.17 4 NYU-MONTREAL
5 −0.01 5-10 ONLINE-B−0.01 5-10 KIT-LIMSI
−0.02 5-10 CAMBRIDGE
−0.02 5-10 ONLINE-A−0.03 5-10 PROMT-RULE
−0.05 6-10 KIT
6 −0.14 11-12 JHU-SYNTAX
−0.15 11-12 JHU-PBMT
7 −0.26 13-14 UEDIN-PBMT
−0.33 13-15 ONLINE-F−0.34 14-15 ONLINE-G
Neural MT
Barry Haddow Edinburgh’s NMT Systems 23 / 20
WMT16: English→German
# score range system1 0.49 1 UEDIN-NMT
2 0.40 2 METAMIND
3 0.29 3 UEDIN-SYNTAX
4 0.17 4 NYU-MONTREAL
5 −0.01 5-10 ONLINE-B−0.01 5-10 KIT-LIMSI
−0.02 5-10 CAMBRIDGE
−0.02 5-10 ONLINE-A−0.03 5-10 PROMT-RULE
−0.05 6-10 KIT
6 −0.14 11-12 JHU-SYNTAX
−0.15 11-12 JHU-PBMT
7 −0.26 13-14 UEDIN-PBMT
−0.33 13-15 ONLINE-F−0.34 14-15 ONLINE-G
Neural MT Neural components
Barry Haddow Edinburgh’s NMT Systems 23 / 20
Fluency and Adequacy (German→ English)
Adequacy Fluencyde
-en
UEDIN-NMT 0.204 75.8 0.339 77.5ONLINE-A 0.095 72.7 0.094 70.1ONLINE-B 0.086 72.2 0.015 68.4
UEDIN-SYNTAX 0.065 71.5 0.141 71.8KIT 0.062 71.4 0.192 72.7
UEDIN-PBMT 0.042 70.9 0.004 68.6JHU-PBMT 0.019 70.5 0.084 70.5ONLINE-G 0.009 70.2 −0.067 65.3ONLINE-F −0.204 64.0 −0.348 57.8
JHU-SYNTAX −0.261 62.4 −0.237 62.5
[Bojar et al. WMT 2016]
Barry Haddow Edinburgh’s NMT Systems 24 / 20
Neural MT and Phrase-based SMT
Neural MT Phrase-based SMTtranslation quality 3
model size 3
training time 3
model interpretability 3
decoding efficiency 3 3
toolkits3 3
(for simplicity) (for maturity)special hardware requirement GPU lots of RAM
Barry Haddow Edinburgh’s NMT Systems 25 / 20