Get To The Point: Summarization with Pointer-Generator Networks_acl17_論文紹介

1. 2017.06.26 NAIST D1 Masayoshi Kondo - About Neural Summarization@2017 Get To The Point : Summarization with Pointer-Generator Networks ACL17 Abigail See Stanford University Peter J. Liu Google Brain Christopher D. Manning Stanford University

2. 00: ( in: NN out: ) NNEnc:bi-directional RNN / Dec: RNN Seq2Seq pointer mechanism(attention mechanism) / coverage mechanism CNN/Daily Mail multi-sentence summarization ROUGE-score. abstract seq2seq (repetition)seq2seq-attention pointing (src)generation ()pointer-generator network Pointingrepetition coverage CNN/DailyMailROUGE2

3. 1. Introduction 2. Our Models 3. Related Work 4. Dataset 5. Experiments 6. Results 7. Discussion 8. Conclusion

5. 00: Introduction Text Summarization Extractive Summarization : - () Abstractive Summarization : - NN (copy) Src() Trg() Src() Trg() ----------------- ----------------- ----------------- ----------------- ----------------- ----------------- ------------- ------------ ---------------- ---------------- ---------------- ----- ----------------- ----------------- ----------------- ----------------- ----------------- ----------------- ------------- ------------ xxxxxxxxxxxxx xxxxxxxxxxxxx xxxxxxxxxxxxx xxxxxxx

6. 00: Introduction Abstractive Summarization Undesireble behavior such as inaccurately reproducing factual details. An inability to deal with out-of-vocabulary (OOV) Repeating themselves Short Text (1 or 2 sentences) Long Text (more than 3 sentences) Single Document Headline Generation Multi Documents (Opinion Mining) Document Summary length () Long-text summarization

7. 00: Introduction Pointer-Generator Network - (copy) Coverage Mechanism - reputation ROUGE-score CNN/Daily Mail Dataset - News( / English)

8. 00: Introduction Attention Encoder (Bi-LSTM) Decoder (RNN) Input-Sequence Predicted Vocab Distribution Context Vector

9. 00: Introduction Attention Encoder (Bi-LSTM) Decoder (RNN) Input-Sequence Attention Distribution Predicted Vocab Distribution Context Vector

10. 00: Introduction Attention Encoder (Bi-LSTM) Decoder (RNN) Input-Sequence Attention Distribution Predicted Vocab Distribution Context Vector pgen Context Vector

11. 00: Introduction Attention Encoder (Bi-LSTM) Decoder (RNN) Input-Sequence Attention Distribution Predicted Vocab Distribution Context Vector pgen Final Predicted Vocab Distribution 1 - pgen pgen Context Vector

12. 00: Introduction Attention Encoder (Bi-LSTM) Decoder (RNN) Input-Sequence Attention Distribution Predicted Vocab Distribution Context Vector pgen Final Predicted Vocab Distribution 1 - pgen pgen(src) Context Vector

14. 00: Our Models 2.1 Sequence-to-Sequence attention model [Encoder] [Decoder] i+1 i ei t = vT tanh Whh +Wss( ) at = soft max(et ) ht = ai t hi i $ % & & ' & & Encoder hidden state : Decoder hidden state : s h Context vector : h Neural machine translation by jointly learning to align and translate [Bahdanau, ICLR15] Abstractive text summarization using sequence-to-sequence RNN and beyond [R.Nallapati et al, CoNLL16]

15. 00: Our Models 2.2 Pointer-generator network Attention Attention Distribution Predicted Vocab Distribution Context Vector pgen 1 - pgen pgen Context Vector pgen = wh* T ht * + ws T st + wx T xt + bptr( ) P(w) = pgenPvocab (w)+ 1 pgen( ) ai t i:wi=w Final probability distribution: P(w) context vector: wh* T ,ws T ,wx T Generation probability : pgen ht * / decoder state: st / decoder input: xt Vector parameters:

16. 00: Our Models 2.3 Coverage mechanism Coverage Vector : ct Attention Distribution sum Decoder Timestep 1 2 3 t-1 t ct Coverage Vector Dec attention vector ct = at' t'=0 t1 ct is a (unnormalized) distribution over the source document words.

17. 00: Our Models 2.3 Coverage mechanism ei t = vT tanh Whh +Wss +Wcct + battn( ) covlosst Coverage Vector Coverage Loss : covlosst = min(ai t ,ci t ) i losst = log(wt * )+ min(ai t ,ci t ) i Attention : Dect Enciattention coverage (vectori) DectEncici tt tai tci tc1 , acovlossmin(a)backpropDect Enci min(c)DectEnci EncDec min(a) Dect

20. 00: Dataset CNN/Daily Mail Dataset : Online news articles Source (article) Target (summary) avg Sentence : - Word : 781 (tokens) vocab 150k size avg Sentence : 3.75 Word : 56 (tokens) vocab 60k size Settings Used scripts by Nallapati et al (2016) for pre-processing. Used the original text (non-anonymized version of the data). Train set Validation set Test set 287,226 13,368 11,496 Dataset size

22. 00: Experiments Model Details Hidden layer : 256 dims Word emb : 128 dims Vocab : 2 types src trg (large) 150k 60k (small) 50k 50k Setting Details Optimize Adagrad Init-lr 0.15 Init-accumlator value 0.1 Regularize terms Max grad-clipping size 2 Early-stopping Batch size 16 Beam size (for test) 4 Environment & procedure Single GPU - Tesla K40m GPU - > Training : > Test : Word-Embpre-train Src400 tokens Trg100 tokens Src400 tokens Trg120 tokens - - ROUGE scores (F1) - METEOR scores

23. 00: Experiments Training time (Calculation cost) Proposed Model Baseline Model 230,000 iters (12.8 epoch) About 3 days + 4 hours 50 k 4 days +14 hours 150k 8 days +21 hours 600000 iters (33 epoch) - Other Settings - Coverage Loss Weight : =1 3000iter() - Inspection - =2Coverage LossPrimary Loss Coverage Model()Coverage Loss Attentionrepetation

25. 00: Results lead-3src Nallaptianonymized lead-3

26. 00: Results (seq2se2-attention) Fig.1 OOV (UNK )

28. 00: Discussion 7.1 Comparison with extractive systems ROUGE :1 :2 lead-3 400 tokens(20 sentences)800 tokensROUGE ROUGElead-3 ROUGE

29. lead-3() ROUGE 00: Discussion 7.1 Comparison with extractive systems ROUGE ROUGE lead-3)

30. METEOR 00: Discussion 7.1 Comparison with extractive systems METEOR () 1 lead-3lead-3

31. 00: Discussion 7.1 Comparison with extractive systems We believe that investigating this issue further is an important direction for future work. 7.2 How abstractive is our model ? We have show that our pointer mechanism makes our abstractive system more reliable, copying factual details correctly more often. But, does the ease of copying make our system any less abstractive ? pointer mechanism

32. 00: Discussion 7.2 How abstractive is our model ? srcn-gram

33. Fig.7 ) Article X beat Y on 00: Discussion 7.2 How abstractive is our model ? Fig.5 )

34. 00: Discussion 7.2 How abstractive is our model ? Train : 0.30 0.53 (train) Test : avg-0.17 pgen src

36. 00:Conclusion Pointer-generator network long-text dataset abstractive summarization - Repetition

37. END

Data & Analytics

Get To The Point: Summarization with Pointer-Generator Networks_acl17_論文紹介