Document Context Neural Machine Translation with Memory ... · Document Context Neural Machine Translation with Memory Networks Experiments and Analysis Experimental Setup Training/dev/test

Document Context Neural Machine Translation with Memory Networks

Document Context Neural Machine Translationwith Memory Networks

Sameen Maruf, Gholamreza Haffari

Faculty of Information Technology

Monash University

July 17, 2017

1 / 30


Overview

1 Introduction

2 Document MT as Structured Prediction

3 Document NMT with MemNets

4 Experiments and Analysis

5 Conclusion

6 References

2 / 30


Introduction

Overview

1 Introduction




5 Conclusion

6 References

3 / 30


Introduction

Why document-level machine translation?

Most MT models translate sentences independently

Discourse phenomena are ignored, e.g. pronominal anaphoraand lexical consistency which may have long range dependency

4 / 30


Introduction




4 / 30


Introduction




4 / 30


Introduction




4 / 30


Introduction




4 / 30


Introduction


Statistical MT attempts to document MT do not yieldsignificant empirical improvements[Hardmeier and Federico, 2010, Gong et al., 2011,Garcia et al., 2014]

Previous context-NMT models only use local context andreport deteriorated performance when using the target-sidecontext[Jean et al., 2017, Wang et al., 2017, Bawden et al., 2018]

We incorporate global source and target document contexts

5 / 30


Introduction





5 / 30


Introduction





5 / 30


Introduction





5 / 30


Document MT as Structured Prediction

Overview

1 Introduction




5 Conclusion

6 References

6 / 30




7 / 30




7 / 30




7 / 30




7 / 30




7 / 30




7 / 30




Two types of factors: fθ(yt ; xt , x−t), gθ(yt ; y−t)

8 / 30




Training objective:Maximise P(y1, . . . , y|d ||x1, . . . , x|d |)

=⇒ Maximise the pseudo-likelihood

arg maxθ

|d |∏t=1

Pθ(yt |xt , y−t , x−t) (1)

where fθ and gθ are subsumed in the Pθ(yt |xt , y−t , x−t)

9 / 30




Training objective:

Maximise P(y1, . . . , y|d ||x1, . . . , x|d |)


arg maxθ

|d |∏t=1

Pθ(yt |xt , y−t , x−t) (1)


9 / 30






arg maxθ

|d |∏t=1

Pθ(yt |xt , y−t , x−t) (1)


9 / 30






arg maxθ

|d |∏t=1

Pθ(yt |xt , y−t , x−t) (1)


9 / 30




Challenge: During test time, the target document is not given

Coordinate Ascent (i.e., Iterative Decoding)

10 / 30






10 / 30






10 / 30






10 / 30






10 / 30






10 / 30




Iterative Decoding

11 / 30




Iterative Decoding

11 / 30




Iterative Decoding

11 / 30




Iterative Decoding

11 / 30




Iterative Decoding

11 / 30




Iterative Decoding

11 / 30


Document NMT with MemNets

Overview

1 Introduction




5 Conclusion

6 References

12 / 30




=⇒ Pθ(yt |xt , y−t , x−t)

13 / 30




=⇒

14 / 30




=⇒

15 / 30




=⇒

16 / 30




=⇒

Memory-to-Context:

st,j = GRU(st,j−1,ET [yt,j−1], ct,j , csrct , c trg

t )

17 / 30




=⇒

Memory-to-Output:

yt,j ∼ softmax(Wy · rt,j + Wym · csrct + Wyt · c trg

t + by )

18 / 30




Use only source, target, or both external memories

Use Memory-to-Context/Memory-to-Output architectures forincorporating the different contexts

19 / 30


Experiments and Analysis

Overview

1 Introduction




5 Conclusion

6 References

20 / 30



Experimental Setup

Training/dev/test corpora statistics:

corpus #docs (H) #sents (K) avg doc len

Fr→En Ted-Talks 10/1.2/1.5 123/15/19 123/128/124Et→En Europarl v7 150/10/18 209/14/25 14/14/14De→En News-Commentary 49/.9/1.6 191/2/3 39/23/19

Evaluation Metrics: BLEU, METEOR

Baselines:

Context-free baseline (S-NMT)

Local source context baselines:• [Jean et al., 2017] & [Wang et al., 2017]

21 / 30



Experimental Setup





Baselines:



21 / 30



Experimental Setup





Baselines:



21 / 30



Experimental Setup





Baselines:



21 / 30



Memory-to-Context Results

Fr→EnFr→EnFr→En0

0.2

0.4

0.6

0.8

1

BL

EU

De→EnDe→EnDe→En0

0.2

0.4

0.6

0.8

1

Et→EnEt→EnEt→En0

0.2

0.4

0.6

0.8

1

22 / 30




Fr→En20

21

22

23

20.85

BL

EU

De→En9

9.5

10

10.5

11

9.18

S-NMT

Et→En20

21

22

23

20.42

22 / 30




Fr→En20

21

22

23

20.85

21.91

BL

EU

De→En9

9.5

10

10.5

11

9.18

10.2

S-NMT S-NMT+src

Et→En20

21

22

23

20.42

22.1

22 / 30




Fr→En20

21

22

23

20.85

21.91

21.74

BL

EU

De→En9

9.5

10

10.5

11

9.18

10.2

9.97

S-NMT S-NMT+src S-NMT+trg

Et→En20

21

22

23

20.42

22.121.94

22 / 30




Fr→En20

21

22

23

20.85

21.91

21.74

22

BL

EU

De→En9

9.5

10

10.5

11

9.18

10.2

9.97

10.54

S-NMT S-NMT+src S-NMT+trg S-NMT+both

Et→En20

21

22

23

20.42

22.121.94

22.32

22 / 30



Memory-to-Output Results


0.2

0.4

0.6

0.8

1

BL

EU


0.2

0.4

0.6

0.8

1


0.2

0.4

0.6

0.8

1

23 / 30




Fr→En20

21

22

23

20.85

BL

EU

De→En9

9.5

10

10.5

11

9.18

S-NMT

Et→En20

21

22

23

20.42

23 / 30




Fr→En20

21

22

23

20.85

21.8

BL

EU

De→En9

9.5

10

10.5

11

9.18

9.98

S-NMT S-NMT+src

Et→En20

21

22

23

20.42

21.5

23 / 30




Fr→En20

21

22

23

20.85

21.8 21.76

BL

EU

De→En9

9.5

10

10.5

11

9.18

9.9810.04

S-NMT S-NMT+src S-NMT+trg

Et→En20

21

22

23

20.42

21.5

21.82

23 / 30




Fr→En20

21

22

23

20.85

21.8 21.76 21.77

BL

EU

De→En9

9.5

10

10.5

11

9.18

9.9810.04

10.23

S-NMT S-NMT+src S-NMT+trg S-NMT+both

Et→En20

21

22

23

20.42

21.5

21.82

22.2

23 / 30



Main Results


0.2

0.4

0.6

0.8

1

BL

EU


0.2

0.4

0.6

0.8

1


0.2

0.4

0.6

0.8

1

24 / 30



Main Results

Fr→En21

21.5

22

22.5

23

21.9521.87

BL

EU

De→En10

10.2

10.4

10.6

10.8

11

10.26

10.14

[Jean et al., 2017] [Wang et al., 2017]

Et→En21

21.5

22

22.5

23

21.67

22.06

24 / 30



Main Results

Fr→En21

21.5

22

22.5

23

21.9521.87 21.91

BL

EU

De→En10

10.2

10.4

10.6

10.8

11

10.26

10.14

10.2

[Jean et al., 2017] [Wang et al., 2017] S-NMT+src

Et→En21

21.5

22

22.5

23

21.67

22.06 22.1

24 / 30



Main Results

Fr→En21

21.5

22

22.5

23

21.9521.87 21.91

22

BL

EU

De→En10

10.2

10.4

10.6

10.8

11

10.26

10.14

10.2

10.54

[Jean et al., 2017] [Wang et al., 2017] S-NMT+src S-NMT+both

Et→En21

21.5

22

22.5

23

21.67

22.06 22.1

22.32

24 / 30



Example translation

Source qimonda taidab lissaboni strateegia eesmarke.Target qimonda meets the objectives of the lisbon strategy.

S-NMT <UNK> is the objectives of the lisbon strategy.+Src Mem the millennium development goals are fulfilling the

millennium goals of the lisbon strategy.+Trg Mem in writing. - (ro) the lisbon strategy is fulfilling the

objectives of the lisbon strategy.+Both Mems qimonda fulfils the aims of the lisbon strategy.

[Wang et al., 2017] <UNK> fulfils the objectives of the lisbon strategy.

25 / 30



Example translation






25 / 30



Example translation






25 / 30



Example translation






25 / 30



Example translation (contd.)

Source ... et riigis kehtib endiselt lukasenka diktatuur,mis rikub inim- ning etnilise vahemuse oigusi.

Target ... this country is still under the dictatorship oflukashenko, breaching human rights and the rightsof ethnic minorities.

S-NMT ... the country still remains in a position of lukashenkoto violate human rights and ethnic minorities.

+Src Mem ... the country still applies to the brutal dictatorship ofhuman and ethnic minority rights.

+Trg Mem ... the country still keeps the <UNK> dictatorship thatviolates human rights and ethnic rights.

+Both Mems ... the country still persists in lukashenko’s dictatorshipthat violate human rights and ethnic minority rights.

[Wang et al., 2017] ... there is still a regime in the country that isviolating the rights of human and ethnic minorityin the country.

26 / 30











26 / 30











26 / 30


Conclusion

Overview

1 Introduction




5 Conclusion

6 References

27 / 30


Conclusion

Conclusion

Proposed a model which incorporates the global source andtarget document contexts

Proposed effective training and decoding methodologies forour model

Future Work:Investigate document-context NMT models which incorporatespecific discourse-level phenomena

28 / 30


Conclusion

Conclusion




28 / 30


Conclusion

Conclusion




28 / 30


Conclusion

Conclusion




28 / 30


Conclusion

Conclusion




28 / 30


References

Overview

1 Introduction




5 Conclusion

6 References

29 / 30


References

References I

Hardmeier C. and Federico, M. (2010).

Modelling pronominal anaphora in statistical machine translation.International Workshop on Spoken Language Translation.

Gong Z. and Zhang M. and Zhou G. (2011).

Cache-based document-level statistical machine translation.Proceedings of the Conference on Empirical Methods in Natural Language Processing.

Garcia E. M. and Espana-Bonet C. and Marquez L. (2014).

Document-level machine translation as a re-translation process.Procesamiento del Lenguaje Natural, 53:103110..

Jean, S. and Lauly, L. and Firat, O. and Cho, K. (2017).

Does Neural Machine Translation Benefit from Larger Context?arXiv:1704.05135.

Wang, L. and Tu, Z. and Way, A. and Liu, Q. (2017).

Exploiting Cross-Sentence Context for Neural Machine Translation.Proceedings of the Conference on Empirical Methods in Natural Language Processing.

Bawden, R. and Sennrich, R. and Birch, A. and Haddow, B. (2018).

Evaluating Discourse Phenomena in Neural Machine Translation.Proceedings of the NAACL-HLT 2018.

30 / 30

Documents

Document Context Neural Machine Translation with Memory ... · Document Context Neural Machine Translation with Memory Networks Experiments and Analysis Experimental Setup Training/dev/test