27
Institute of Computational Linguistics Contrastive Evaluation of Larger-context Neural Machine Translation Kolloquium Talk 2018 Mathias Müller 4/10/18 KOLLO, Mathias Müller

Contrastive Evaluation of Larger-context Neural Machine ...5b9cbc03-5924-4cd8... · Additional Encoder and attention network 4/10/18 KOLLO, Mathias Müller Page 7 •on top of Nematus

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Contrastive Evaluation of Larger-context Neural Machine ...5b9cbc03-5924-4cd8... · Additional Encoder and attention network 4/10/18 KOLLO, Mathias Müller Page 7 •on top of Nematus

Institute of Computational Linguistics

Contrastive Evaluation of Larger-context Neural Machine Translation

Kolloquium Talk 2018Mathias Müller

4/10/18 KOLLO, Mathias Müller

Page 2: Contrastive Evaluation of Larger-context Neural Machine ...5b9cbc03-5924-4cd8... · Additional Encoder and attention network 4/10/18 KOLLO, Mathias Müller Page 7 •on top of Nematus

Larger-context neural machine translation

4/10/18 KOLLO, Mathias Müller Page 2

Page 3: Contrastive Evaluation of Larger-context Neural Machine ...5b9cbc03-5924-4cd8... · Additional Encoder and attention network 4/10/18 KOLLO, Mathias Müller Page 7 •on top of Nematus

Why larger context?

4/10/18 KOLLO, Mathias Müller Page 3

SourceHowever, the European Central Bank (ECB) took an interest in it in a report on virtual currencies published in October. It describes bitcoin as "the most successful virtual currency,” […].

TargetDennoch hat die Europäische Zentralbank (EZB) in einem im Oktober veröffentlichten Bericht über virtuelle Währungen Interesse hierfür gezeigt. Sie beschreibt Bitcoin als "die virtuelle Währung mit dem größten Erfolg” […].

(example taken from newstest2013.{de,en})

Page 4: Contrastive Evaluation of Larger-context Neural Machine ...5b9cbc03-5924-4cd8... · Additional Encoder and attention network 4/10/18 KOLLO, Mathias Müller Page 7 •on top of Nematus

Why larger context?

4/10/18 KOLLO, Mathias Müller Page 4

Page 5: Contrastive Evaluation of Larger-context Neural Machine ...5b9cbc03-5924-4cd8... · Additional Encoder and attention network 4/10/18 KOLLO, Mathias Müller Page 7 •on top of Nematus

Why larger context?

4/10/18 KOLLO, Mathias Müller Page 5

SourceIt describes bitcoin as "the most successful virtual currency”.

TargetEs beschreibt den Bitcoin als "die erfolgreichste virtuelle Währung".

Page 6: Contrastive Evaluation of Larger-context Neural Machine ...5b9cbc03-5924-4cd8... · Additional Encoder and attention network 4/10/18 KOLLO, Mathias Müller Page 7 •on top of Nematus

How to incorporate larger context?

4/10/18 KOLLO, Mathias Müller Page 6

Open question, preliminary works:• gated auxiliary context or ”warm start” decoder initialization with

a document summary (Wang et al., 2017)• additional encoder and attention network for previous

source sentence (Jean et al., 2017)• Concatenate previous source sentence, mark with a prefix

(Tiedemann and Scherrer, 2017)• both source and target context (Miculicich Werlen et al.,

submitted)• hierarchical attention, among other solutions (Bawden et al.,

submitted)

Page 7: Contrastive Evaluation of Larger-context Neural Machine ...5b9cbc03-5924-4cd8... · Additional Encoder and attention network 4/10/18 KOLLO, Mathias Müller Page 7 •on top of Nematus

Additional Encoder and attention network

4/10/18 KOLLO, Mathias Müller Page 7

• on top of Nematus (Sennrich et al., 2017) which follows standard practice: an encoder-decoder framework with attention (Bahdanau et al., 2014)

• Encoder and Decoder are gated recurrent units (GRUs), a variant of RNNs

• Decoder is a GRU conditioned on source sentence, the source sentence context in turn is generated by the encoder, and modulated by attention

• we also condition on preceding sentences, with additional encoders and separate attention networks

Page 8: Contrastive Evaluation of Larger-context Neural Machine ...5b9cbc03-5924-4cd8... · Additional Encoder and attention network 4/10/18 KOLLO, Mathias Müller Page 7 •on top of Nematus

Additional Encoder and attention network

4/10/18 KOLLO, Mathias Müller Page 7

• on top of Nematus (Sennrich et al., 2017) which follows standard practice: an encoder-decoder framework with attention (Bahdanau et al., 2014)

• Encoder and Decoder are gated recurrent units (GRUs), a variant of RNNs

• Decoder is a GRU conditioned on source sentence, the source sentence context in turn is generated by the encoder, and modulated by attention

• we also condition on preceding sentences, with additional encoders and separate attention networks

Mathias Müller
Mathias Müller
Mathias Müller
Page 9: Contrastive Evaluation of Larger-context Neural Machine ...5b9cbc03-5924-4cd8... · Additional Encoder and attention network 4/10/18 KOLLO, Mathias Müller Page 7 •on top of Nematus

Recurrent neural networks refresher

4/10/18 KOLLO, Mathias Müller Page 8

Page 10: Contrastive Evaluation of Larger-context Neural Machine ...5b9cbc03-5924-4cd8... · Additional Encoder and attention network 4/10/18 KOLLO, Mathias Müller Page 7 •on top of Nematus

RNN variant: gated recurrent unit (GRU)

4/10/18 KOLLO, Mathias Müller Page 9

Figure taken from Chung et al. (2014)

Page 11: Contrastive Evaluation of Larger-context Neural Machine ...5b9cbc03-5924-4cd8... · Additional Encoder and attention network 4/10/18 KOLLO, Mathias Müller Page 7 •on top of Nematus

Conditional gated recurrent unit (cGRU)

4/10/18 KOLLO, Mathias Müller Page 10

Detailed formulas: https://github.com/nyu-dl/dl4mt-tutorial/blob/master/docs/cgru.pdf

Page 12: Contrastive Evaluation of Larger-context Neural Machine ...5b9cbc03-5924-4cd8... · Additional Encoder and attention network 4/10/18 KOLLO, Mathias Müller Page 7 •on top of Nematus

Extension of cGRU for n contexts

4/10/18 KOLLO, Mathias Müller Page 11

Detailed formulas: https://github.com/bricksdont/ncgru/blob/master/ct.pdf

Page 13: Contrastive Evaluation of Larger-context Neural Machine ...5b9cbc03-5924-4cd8... · Additional Encoder and attention network 4/10/18 KOLLO, Mathias Müller Page 7 •on top of Nematus

How to incorporate larger context?

4/10/18 KOLLO, Mathias Müller Page 12

• Additional encoder and attention networks for previous context (Jean et al., 2017) in Nematus

• Technically: an extension of deep transition (Pascanu et al., 2013) with additional GRU steps that attend to contexts other than the current source sentence

• Intuitively: while generating the next word, the decoder has access to previous source or target sentence

• Multiple encoders share most of the parameters because embedding matrices are tied (Press and Wolf, 2016)

Page 14: Contrastive Evaluation of Larger-context Neural Machine ...5b9cbc03-5924-4cd8... · Additional Encoder and attention network 4/10/18 KOLLO, Mathias Müller Page 7 •on top of Nematus

Actual systems we have trained

4/10/18 KOLLO, Mathias Müller Page 13

• Nematus systems with standard parameters, similar to Edinburgh’s WMT 17 submissions

• English to German (why?)• Training data from WMT 17

1) Baseline system without additional context2) + source context: 1 previous source sentence if any3) + target context: 1 previous target sentence if any

Page 15: Contrastive Evaluation of Larger-context Neural Machine ...5b9cbc03-5924-4cd8... · Additional Encoder and attention network 4/10/18 KOLLO, Mathias Müller Page 7 •on top of Nematus

How to evaluate larger-context systems?

4/10/18 KOLLO, Mathias Müller Page 14

• Need: evaluation that focuses on specific linguistic phenomena• Challenge Set for contrastive evaluation

SourceDespite the fact that it is a part of China, Hong Kong determines its currency policy separately.

TargetHongkong bestimmt, obwohl es zu China gehört, seine Währungspolitik selbst.

ContrastiveHongkong bestimmt, obwohl er zu China gehört, seine Währungspolitik selbst.

(example taken from newstest2009)

Mathias Müller
Mathias Müller
Mathias Müller
Mathias Müller
Mathias Müller
Page 16: Contrastive Evaluation of Larger-context Neural Machine ...5b9cbc03-5924-4cd8... · Additional Encoder and attention network 4/10/18 KOLLO, Mathias Müller Page 7 •on top of Nematus

How to evaluate larger-context systems?

4/10/18 KOLLO, Mathias Müller Page 15

• Previous work with manually constructed sets: Guillou and Hardmeier (2016); Isabelle et al. (2017); Bawden et al., (submitted)

• Larger-scale automatic sets: Sennrich (2017); Rios et al. (2017); Burlot and Yvon (2017); ours

Page 17: Contrastive Evaluation of Larger-context Neural Machine ...5b9cbc03-5924-4cd8... · Additional Encoder and attention network 4/10/18 KOLLO, Mathias Müller Page 7 •on top of Nematus

Our test set of contrastive examples

4/10/18 KOLLO, Mathias Müller Page 16

• Sources: WMT, CS Corpus, OpenSubtitles• Good candidates extracted automatically after linguistic

processing (parsing, coreference resolution)• focused on personal pronouns• Roughly 600k examples

Page 18: Contrastive Evaluation of Larger-context Neural Machine ...5b9cbc03-5924-4cd8... · Additional Encoder and attention network 4/10/18 KOLLO, Mathias Müller Page 7 •on top of Nematus

4/10/18 KOLLO, Mathias Müller Page 17

Page 19: Contrastive Evaluation of Larger-context Neural Machine ...5b9cbc03-5924-4cd8... · Additional Encoder and attention network 4/10/18 KOLLO, Mathias Müller Page 7 •on top of Nematus

Results: BLEU

4/10/18 KOLLO, Mathias Müller Page 18

System newstest2015 (dev) newstest2017 (test)Baseline 24.80 23.02C10 22.68 21.47C11 24.48 22.38

Page 20: Contrastive Evaluation of Larger-context Neural Machine ...5b9cbc03-5924-4cd8... · Additional Encoder and attention network 4/10/18 KOLLO, Mathias Müller Page 7 •on top of Nematus

Contrastive scores where EN pronoun is ‘it’

4/10/18 KOLLO, Mathias Müller Page 19

Baseline C10 C11Overall performance 0.44 0.47 0.64

Baseline C10 C11it : er 0.18 0.27 0.50it : es 0.84 0.76 0.83it : sie 0.3 0.39 0.62

Mathias Müller
Page 21: Contrastive Evaluation of Larger-context Neural Machine ...5b9cbc03-5924-4cd8... · Additional Encoder and attention network 4/10/18 KOLLO, Mathias Müller Page 7 •on top of Nematus

Contrastive scores where EN pronoun is ‘it’

4/10/18 KOLLO, Mathias Müller Page 20

Baseline C10 C11intrasegmental 0.61 0.60 0.67extrasegmental 0.41 0.45 0.64

ê distance ê Baseline C10 C110 0.61 0.60 0.671 0.36 0.43 0.642 0.46 0.43 0.583 0.53 0.53 0.66

3+ 0.67 0.56 0.76

Page 22: Contrastive Evaluation of Larger-context Neural Machine ...5b9cbc03-5924-4cd8... · Additional Encoder and attention network 4/10/18 KOLLO, Mathias Müller Page 7 •on top of Nematus

Current activities

4/10/18 KOLLO, Mathias Müller Page 22

Last steps for the contrastive evaluation experiments:• Publish our resource and work at WMT 18

Ongoing work:• inductive biases of fully convolutional (Gehring et al., 2017) or

self-attention (“transformer”) models (Vaswani et al., 2017); collaboration with Edinburgh

• Low-resource experiments with Rumansh: pretraining transformer models with self-attentional language models (adaptation of Ramachandran et al., 2017)

Page 23: Contrastive Evaluation of Larger-context Neural Machine ...5b9cbc03-5924-4cd8... · Additional Encoder and attention network 4/10/18 KOLLO, Mathias Müller Page 7 •on top of Nematus

Thanks!

4/10/18 KOLLO, Mathias Müller Page 23

Code currently here:https://gitlab.cl.uzh.ch/mt/nematus-context2

Page 24: Contrastive Evaluation of Larger-context Neural Machine ...5b9cbc03-5924-4cd8... · Additional Encoder and attention network 4/10/18 KOLLO, Mathias Müller Page 7 •on top of Nematus

Bibliography

4/10/18 KOLLO, Mathias Müller Page 24

Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014).

Bawden, Rachel, et al. “Evaluating Discourse Phenomena in Neural Machine Translation”. (Submitted to NAACL 2018)

Burlot, Franck, and François Yvon. "Evaluating the morphological competence of Machine Translation Systems." Proceedings of the Second Conference on Machine Translation. 2017.

Chung, Junyoung, et al. "Empirical evaluation of gated recurrent neural networks on sequence modeling." arXiv preprint arXiv:1412.3555 (2014).

Gehring, Jonas, et al. "Convolutional sequence to sequence learning." arXiv preprint arXiv:1705.03122 (2017).

Guillou, Liane, and Christian Hardmeier. "PROTEST: A Test Suite for Evaluating Pronouns in Machine Translation." LREC. 2016.

Isabelle, Pierre, Colin Cherry, and George Foster. "A Challenge Set Approach to Evaluating Machine Translation." Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017.

Page 25: Contrastive Evaluation of Larger-context Neural Machine ...5b9cbc03-5924-4cd8... · Additional Encoder and attention network 4/10/18 KOLLO, Mathias Müller Page 7 •on top of Nematus

Bibliography

4/10/18 KOLLO, Mathias Müller Page 25

Jean, Sebastien, et al. "Does Neural Machine Translation Benefit from Larger Context?." arXiv preprint arXiv:1704.05135 (2017).

Miculicich Werlen, Lesly, et al. “Self-Attentive Residual Decoder for Neural Machine Translation.” (Submitted to NAACL 2018)

Pascanu, Razvan, et al. "How to construct deep recurrent neural networks." In Proceedings of the Second International Conference on Learning Representations (ICLR 2014)

Press, Ofir, and Lior Wolf. "Using the Output Embedding to Improve Language Models." Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. Vol. 2. 2017.

Ramachandran, Prajit, Peter Liu, and Quoc Le. "Unsupervised Pretraining for Sequence to Sequence Learning." Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017.

Rikters, Matīss, Mark Fishel, and Ondřej Bojar. "Visualizing neural machine translation attention and confidence." The Prague Bulletin of Mathematical Linguistics 109.1 (2017): 39-50.

Page 26: Contrastive Evaluation of Larger-context Neural Machine ...5b9cbc03-5924-4cd8... · Additional Encoder and attention network 4/10/18 KOLLO, Mathias Müller Page 7 •on top of Nematus

Bibliography

4/10/18 KOLLO, Mathias Müller Page 26

Rios Gonzales, Annette, Laura Mascarell, and Rico Sennrich. "Improving Word Sense Disambiguation in Neural Machine Translation with Sense Embeddings." Proceedings of the Second Conference on Machine Translation. 2017.

Sennrich, Rico, et al. "Nematus: a Toolkit for Neural Machine Translation." Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics. 2017.

Sennrich, Rico. "How Grammatical is Character-level Neural Machine Translation? Assessing MT Quality with Contrastive Translation Pairs." Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. Vol. 2. 2017.

Tiedemann, Jörg, and Yves Scherrer. "Neural Machine Translation with Extended Context." Proceedings of the Third Workshop on Discourse in Machine Translation. 2017.

Vaswani, Ashish, et al. "Attention is all you need." Advances in Neural Information Processing Systems. 2017.

Wang, Longyue, et al. "Exploiting Cross-Sentence Context for Neural Machine Translation." Proceedings of EMNLP. 2017.

Page 27: Contrastive Evaluation of Larger-context Neural Machine ...5b9cbc03-5924-4cd8... · Additional Encoder and attention network 4/10/18 KOLLO, Mathias Müller Page 7 •on top of Nematus

Appendix: Notions of depth in RNN networks

4/10/18 KOLLO, Mathias Müller Page 27

• generally three types of depth (Pascanu et al., 2013):

stacked layers (each layer individually recurrent)deep transition (units not individually recurrent)deep output (units not individually recurrent)

• in Nematus, the decoder is implemented as a cGRU with deep transition and deep output

• crucially: attention over source sentence vectors C is a deep transition step