Get To The Point: Summarization with Pointer-Generator Networks_acl17_論文紹介
37
2017.06.26 NAIST 然語処学研究室 D1 Masayoshi Kondo 紹介 About Neural Summarization@2017 Get To The Point : Summarization with PointerGenerator Networks ACL17 Abigail See Stanford University Peter J. Liu Google Brain Christopher D. Manning Stanford University
Get To The Point: Summarization with Pointer-Generator Networks_acl17_論文紹介
1. 2017.06.26 NAIST D1 Masayoshi Kondo - About Neural
Summarization@2017 Get To The Point : Summarization with
Pointer-Generator Networks ACL17 Abigail See Stanford University
Peter J. Liu Google Brain Christopher D. Manning Stanford
University
6. 00: Introduction Abstractive Summarization Undesireble
behavior such as inaccurately reproducing factual details. An
inability to deal with out-of-vocabulary (OOV) Repeating themselves
Short Text (1 or 2 sentences) Long Text (more than 3 sentences)
Single Document Headline Generation Multi Documents (Opinion
Mining) Document Summary length () Long-text summarization
9. 00: Introduction Attention Encoder (Bi-LSTM) Decoder (RNN)
Input-Sequence Attention Distribution Predicted Vocab Distribution
Context Vector
10. 00: Introduction Attention Encoder (Bi-LSTM) Decoder (RNN)
Input-Sequence Attention Distribution Predicted Vocab Distribution
Context Vector pgen Context Vector
11. 00: Introduction Attention Encoder (Bi-LSTM) Decoder (RNN)
Input-Sequence Attention Distribution Predicted Vocab Distribution
Context Vector pgen Final Predicted Vocab Distribution 1 - pgen
pgen Context Vector
12. 00: Introduction Attention Encoder (Bi-LSTM) Decoder (RNN)
Input-Sequence Attention Distribution Predicted Vocab Distribution
Context Vector pgen Final Predicted Vocab Distribution 1 - pgen
pgen(src) Context Vector
13. 1. Introduction 2. Our Models 3. Related Work 4. Dataset 5.
Experiments 6. Results 7. Discussion 8. Conclusion
14. 00: Our Models 2.1 Sequence-to-Sequence attention model
[Encoder] [Decoder] i+1 i ei t = vT tanh Whh +Wss( ) at = soft
max(et ) ht = ai t hi i $ % & & ' & & Encoder
hidden state : Decoder hidden state : s h Context vector : h Neural
machine translation by jointly learning to align and translate
[Bahdanau, ICLR15] Abstractive text summarization using
sequence-to-sequence RNN and beyond [R.Nallapati et al,
CoNLL16]
15. 00: Our Models 2.2 Pointer-generator network Attention
Attention Distribution Predicted Vocab Distribution Context Vector
pgen 1 - pgen pgen Context Vector pgen = wh* T ht * + ws T st + wx
T xt + bptr( ) P(w) = pgenPvocab (w)+ 1 pgen( ) ai t i:wi=w Final
probability distribution: P(w) context vector: wh* T ,ws T ,wx T
Generation probability : pgen ht * / decoder state: st / decoder
input: xt Vector parameters:
16. 00: Our Models 2.3 Coverage mechanism Coverage Vector : ct
Attention Distribution sum Decoder Timestep 1 2 3 t-1 t ct Coverage
Vector Dec attention vector ct = at' t'=0 t1 ct is a (unnormalized)
distribution over the source document words.
17. 00: Our Models 2.3 Coverage mechanism ei t = vT tanh Whh
+Wss +Wcct + battn( ) covlosst Coverage Vector Coverage Loss :
covlosst = min(ai t ,ci t ) i losst = log(wt * )+ min(ai t ,ci t )
i Attention : Dect Enciattention coverage (vectori) DectEncici tt
tai tci tc1 , acovlossmin(a)backpropDect Enci min(c)DectEnci EncDec
min(a) Dect
18. 1. Introduction 2. Our Models 3. Related Work 4. Dataset 5.
Experiments 6. Results 7. Discussion 8. Conclusion
19. 1. Introduction 2. Our Models 3. Related Work 4. Dataset 5.
Experiments 6. Results 7. Discussion 8. Conclusion
20. 00: Dataset CNN/Daily Mail Dataset : Online news articles
Source (article) Target (summary) avg Sentence : - Word : 781
(tokens) vocab 150k size avg Sentence : 3.75 Word : 56 (tokens)
vocab 60k size Settings Used scripts by Nallapati et al (2016) for
pre-processing. Used the original text (non-anonymized version of
the data). Train set Validation set Test set 287,226 13,368 11,496
Dataset size
21. 1. Introduction 2. Our Models 3. Related Work 4. Dataset 5.
Experiments 6. Results 7. Discussion 8. Conclusion
22. 00: Experiments Model Details Hidden layer : 256 dims Word
emb : 128 dims Vocab : 2 types src trg (large) 150k 60k (small) 50k
50k Setting Details Optimize Adagrad Init-lr 0.15 Init-accumlator
value 0.1 Regularize terms Max grad-clipping size 2 Early-stopping
Batch size 16 Beam size (for test) 4 Environment & procedure
Single GPU - Tesla K40m GPU - > Training : > Test :
Word-Embpre-train Src400 tokens Trg100 tokens Src400 tokens Trg120
tokens - - ROUGE scores (F1) - METEOR scores
23. 00: Experiments Training time (Calculation cost) Proposed
Model Baseline Model 230,000 iters (12.8 epoch) About 3 days + 4
hours 50 k 4 days +14 hours 150k 8 days +21 hours 600000 iters (33
epoch) - Other Settings - Coverage Loss Weight : =1 3000iter() -
Inspection - =2Coverage LossPrimary Loss Coverage Model()Coverage
Loss Attentionrepetation
24. 1. Introduction 2. Our Models 3. Related Work 4. Dataset 5.
Experiments 6. Results 7. Discussion 8. Conclusion
27. 1. Introduction 2. Our Models 3. Related Work 4. Dataset 5.
Experiments 6. Results 7. Discussion 8. Conclusion
28. 00: Discussion 7.1 Comparison with extractive systems ROUGE
:1 :2 lead-3 400 tokens(20 sentences)800 tokensROUGE ROUGElead-3
ROUGE
29. lead-3() ROUGE 00: Discussion 7.1 Comparison with
extractive systems ROUGE ROUGE lead-3)
30. METEOR 00: Discussion 7.1 Comparison with extractive
systems METEOR () 1 lead-3lead-3
31. 00: Discussion 7.1 Comparison with extractive systems We
believe that investigating this issue further is an important
direction for future work. 7.2 How abstractive is our model ? We
have show that our pointer mechanism makes our abstractive system
more reliable, copying factual details correctly more often. But,
does the ease of copying make our system any less abstractive ?
pointer mechanism
32. 00: Discussion 7.2 How abstractive is our model ?
srcn-gram
33. Fig.7 ) Article X beat Y on 00: Discussion 7.2 How
abstractive is our model ? Fig.5 )
34. 00: Discussion 7.2 How abstractive is our model ? Train :
0.30 0.53 (train) Test : avg-0.17 pgen src
35. 1. Introduction 2. Our Models 3. Related Work 4. Dataset 5.
Experiments 6. Results 7. Discussion 8. Conclusion