Contrastive Evaluation of Larger-context Neural Machine ...5b9cbc03-5924-4cd8... · Additional Encoder and attention network 4/10/18 KOLLO, Mathias Müller Page 7 •on top of Nematus

Institute of Computational Linguistics

Contrastive Evaluation of Larger-context Neural Machine Translation

Kolloquium Talk 2018Mathias Müller

4/10/18 KOLLO, Mathias Müller

Larger-context neural machine translation

4/10/18 KOLLO, Mathias Müller Page 2

Why larger context?


SourceHowever, the European Central Bank (ECB) took an interest in it in a report on virtual currencies published in October. It describes bitcoin as "the most successful virtual currency,” […].

TargetDennoch hat die Europäische Zentralbank (EZB) in einem im Oktober veröffentlichten Bericht über virtuelle Währungen Interesse hierfür gezeigt. Sie beschreibt Bitcoin als "die virtuelle Währung mit dem größten Erfolg” […].

(example taken from newstest2013.{de,en})

Why larger context?


Why larger context?


SourceIt describes bitcoin as "the most successful virtual currency”.

TargetEs beschreibt den Bitcoin als "die erfolgreichste virtuelle Währung".

How to incorporate larger context?


Open question, preliminary works:• gated auxiliary context or ”warm start” decoder initialization with

a document summary (Wang et al., 2017)• additional encoder and attention network for previous

source sentence (Jean et al., 2017)• Concatenate previous source sentence, mark with a prefix

(Tiedemann and Scherrer, 2017)• both source and target context (Miculicich Werlen et al.,

submitted)• hierarchical attention, among other solutions (Bawden et al.,

submitted)

Additional Encoder and attention network


• on top of Nematus (Sennrich et al., 2017) which follows standard practice: an encoder-decoder framework with attention (Bahdanau et al., 2014)

• Encoder and Decoder are gated recurrent units (GRUs), a variant of RNNs

• Decoder is a GRU conditioned on source sentence, the source sentence context in turn is generated by the encoder, and modulated by attention

• we also condition on preceding sentences, with additional encoders and separate attention networks

Additional Encoder and attention network


• on top of Nematus (Sennrich et al., 2017) which follows standard practice: an encoder-decoder framework with attention (Bahdanau et al., 2014)

• Encoder and Decoder are gated recurrent units (GRUs), a variant of RNNs

• Decoder is a GRU conditioned on source sentence, the source sentence context in turn is generated by the encoder, and modulated by attention

• we also condition on preceding sentences, with additional encoders and separate attention networks

Mathias Müller

Mathias Müller

Mathias Müller

Recurrent neural networks refresher


RNN variant: gated recurrent unit (GRU)


Figure taken from Chung et al. (2014)

Conditional gated recurrent unit (cGRU)


Detailed formulas: https://github.com/nyu-dl/dl4mt-tutorial/blob/master/docs/cgru.pdf

Extension of cGRU for n contexts


Detailed formulas: https://github.com/bricksdont/ncgru/blob/master/ct.pdf

How to incorporate larger context?


• Additional encoder and attention networks for previous context (Jean et al., 2017) in Nematus

• Technically: an extension of deep transition (Pascanu et al., 2013) with additional GRU steps that attend to contexts other than the current source sentence

• Intuitively: while generating the next word, the decoder has access to previous source or target sentence

• Multiple encoders share most of the parameters because embedding matrices are tied (Press and Wolf, 2016)

Actual systems we have trained


• Nematus systems with standard parameters, similar to Edinburgh’s WMT 17 submissions

• English to German (why?)• Training data from WMT 17

1) Baseline system without additional context2) + source context: 1 previous source sentence if any3) + target context: 1 previous target sentence if any

How to evaluate larger-context systems?


• Need: evaluation that focuses on specific linguistic phenomena• Challenge Set for contrastive evaluation

SourceDespite the fact that it is a part of China, Hong Kong determines its currency policy separately.

TargetHongkong bestimmt, obwohl es zu China gehört, seine Währungspolitik selbst.

ContrastiveHongkong bestimmt, obwohl er zu China gehört, seine Währungspolitik selbst.

(example taken from newstest2009)

Mathias Müller

Mathias Müller

Mathias Müller

Mathias Müller

Mathias Müller

How to evaluate larger-context systems?


• Previous work with manually constructed sets: Guillou and Hardmeier (2016); Isabelle et al. (2017); Bawden et al., (submitted)

• Larger-scale automatic sets: Sennrich (2017); Rios et al. (2017); Burlot and Yvon (2017); ours

Our test set of contrastive examples


• Sources: WMT, CS Corpus, OpenSubtitles• Good candidates extracted automatically after linguistic

processing (parsing, coreference resolution)• focused on personal pronouns• Roughly 600k examples


Results: BLEU


System newstest2015 (dev) newstest2017 (test)Baseline 24.80 23.02C10 22.68 21.47C11 24.48 22.38

Contrastive scores where EN pronoun is ‘it’


Baseline C10 C11Overall performance 0.44 0.47 0.64

Baseline C10 C11it : er 0.18 0.27 0.50it : es 0.84 0.76 0.83it : sie 0.3 0.39 0.62

Mathias Müller

Contrastive scores where EN pronoun is ‘it’


Baseline C10 C11intrasegmental 0.61 0.60 0.67extrasegmental 0.41 0.45 0.64

ê distance ê Baseline C10 C110 0.61 0.60 0.671 0.36 0.43 0.642 0.46 0.43 0.583 0.53 0.53 0.66

3+ 0.67 0.56 0.76

Current activities


Last steps for the contrastive evaluation experiments:• Publish our resource and work at WMT 18

Ongoing work:• inductive biases of fully convolutional (Gehring et al., 2017) or

self-attention (“transformer”) models (Vaswani et al., 2017); collaboration with Edinburgh

• Low-resource experiments with Rumansh: pretraining transformer models with self-attentional language models (adaptation of Ramachandran et al., 2017)

Thanks!


Code currently here:https://gitlab.cl.uzh.ch/mt/nematus-context2

Bibliography


Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014).

Bawden, Rachel, et al. “Evaluating Discourse Phenomena in Neural Machine Translation”. (Submitted to NAACL 2018)

Burlot, Franck, and François Yvon. "Evaluating the morphological competence of Machine Translation Systems." Proceedings of the Second Conference on Machine Translation. 2017.

Chung, Junyoung, et al. "Empirical evaluation of gated recurrent neural networks on sequence modeling." arXiv preprint arXiv:1412.3555 (2014).

Gehring, Jonas, et al. "Convolutional sequence to sequence learning." arXiv preprint arXiv:1705.03122 (2017).

Guillou, Liane, and Christian Hardmeier. "PROTEST: A Test Suite for Evaluating Pronouns in Machine Translation." LREC. 2016.

Isabelle, Pierre, Colin Cherry, and George Foster. "A Challenge Set Approach to Evaluating Machine Translation." Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017.

Bibliography


Jean, Sebastien, et al. "Does Neural Machine Translation Benefit from Larger Context?." arXiv preprint arXiv:1704.05135 (2017).

Miculicich Werlen, Lesly, et al. “Self-Attentive Residual Decoder for Neural Machine Translation.” (Submitted to NAACL 2018)

Pascanu, Razvan, et al. "How to construct deep recurrent neural networks." In Proceedings of the Second International Conference on Learning Representations (ICLR 2014)

Press, Ofir, and Lior Wolf. "Using the Output Embedding to Improve Language Models." Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. Vol. 2. 2017.

Ramachandran, Prajit, Peter Liu, and Quoc Le. "Unsupervised Pretraining for Sequence to Sequence Learning." Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017.

Rikters, Matīss, Mark Fishel, and Ondřej Bojar. "Visualizing neural machine translation attention and confidence." The Prague Bulletin of Mathematical Linguistics 109.1 (2017): 39-50.

Bibliography


Rios Gonzales, Annette, Laura Mascarell, and Rico Sennrich. "Improving Word Sense Disambiguation in Neural Machine Translation with Sense Embeddings." Proceedings of the Second Conference on Machine Translation. 2017.

Sennrich, Rico, et al. "Nematus: a Toolkit for Neural Machine Translation." Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics. 2017.

Sennrich, Rico. "How Grammatical is Character-level Neural Machine Translation? Assessing MT Quality with Contrastive Translation Pairs." Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. Vol. 2. 2017.

Tiedemann, Jörg, and Yves Scherrer. "Neural Machine Translation with Extended Context." Proceedings of the Third Workshop on Discourse in Machine Translation. 2017.

Vaswani, Ashish, et al. "Attention is all you need." Advances in Neural Information Processing Systems. 2017.

Wang, Longyue, et al. "Exploiting Cross-Sentence Context for Neural Machine Translation." Proceedings of EMNLP. 2017.

Appendix: Notions of depth in RNN networks


• generally three types of depth (Pascanu et al., 2013):

stacked layers (each layer individually recurrent)deep transition (units not individually recurrent)deep output (units not individually recurrent)

• in Nematus, the decoder is implemented as a cGRU with deep transition and deep output

• crucially: attention over source sentence vectors C is a deep transition step

Documents

Contrastive Evaluation of Larger-context Neural Machine ...5b9cbc03-5924-4cd8... · Additional Encoder and attention network 4/10/18 KOLLO, Mathias Müller Page 7 •on top of Nematus