Upload
others
View
11
Download
0
Embed Size (px)
Citation preview
Document Summarization
based on Structured Learning
with Hidden Information
Hung-yi Lee
Outline
Introduction of summarization
Unsupervised summarization
Summarization as binary classification
Structured Learning for summarization
Paragraph boundaries as hidden information in Structured Learning
For spoken document (lecture recordings, etc.)
Experiments
Conclusion
INTRODUCTION
Task Introduction
Extractive document summarization
Select the indicative sentences
Cascade the sentences to form a summary
Document:
Two food critics have eaten meat that was grown in a lab.
It is the first time anyone has eaten artificial meat.
The experiment is part of a project run by Google co-founder Sergey
Brin.
He invested over $380,000 in research for the burger.
Summary:
It is the first time anyone has eaten artificial meat.
The experiment is part of a project run by Google co-founder
Sergey Brin.
Task Introduction
Extractive document summarization
Select the indicative sentences
Cascade the sentences to form a summary
The number of sentences selected as summary is
decided by a predefined ratio (e.g. 10%)
UNSUPERVISED SUMMARIZATION
Unsupervised Summarization
Maximum Marginal Relevance (MMR)
document
x1: 涼宮春日…
x2: Deep …
x3: 涼宮春日…
x4: Structured …
x5: 沒有很宅…
……
e.g. transcriptions of the lecture recordings
A good summary should include as
much important sentences as possible
The redundancy of a good summary
should be minimized.
Unsupervised Summarization
Maximum Marginal Relevance (MMR)
Rank by R(x)document
x1: 涼宮春日…
x2: Deep …
x3: 涼宮春日…
x4: Structured …
x5: 沒有很宅…
……
x3: 涼宮春日…
x1: 涼宮春日…
x2: Deep …
x4: Structured …
x5: 沒有很宅…
……
R(x): importance of sentence x (e.g. do x include the words frequently mentioned in the document)
x3
e.g. transcriptions of the lecture recordings
Summary
Unsupervised Summarization
Maximum Marginal Relevance (MMR)
Rank by R(x)document
x1: 涼宮春日…
x2: Deep …
x3: 涼宮春日…
x4: Structured …
x5: 沒有很宅…
……
x3
x2
x4
x3: 涼宮春日…
x1: 涼宮春日…
x2: Deep …
x4: Structured …
x5: 沒有很宅…
……
Cascade based on the order in the document
e.g. transcriptions of the lecture recordings
Summary
x2
x3
x4
EXTRACTIVE SUMMARIZATION
AS BINARY CLASSIFICATION
Supervised Approach
Training data
2nd and 4th sentences
form the summary
3rd sentences form the
summary
1st and 2nd sentences
form the summary
Use the training data to learn
model for summarization
……
……
……
document 1
document 2
document 3
Binary Classification
for Extractive Summarization
Summarization can be simply taken as a binary
classification problem.
Sentence 1
Sentence 2
Sentence 3
Sentence 4
Binary
Classifier
(e.g. SVM,
DNN)
-1
+1
+1
-1
Sentence 2
Sentence 3
label
summary
Drawbacks for Binary Classification
Binary classifier individually considers each
sentence
More advanced machine learning techniques
LSA is useful for summarization
LSA is useful for summarization
LSA is useful for summarization
Hello ……
LSA is Latent semantic analysis
LSA is useful for summarization
Repeat again
……
DocumentSummary
Summary should be succinct
To generate good summary, “global
information” can not be ignored.
STRUCTURED LEARNING FOR
EXTRACTIVE SUMMARIZATION
Structured Learning
for Extractive Summarization
Input: whole document
Output: summary
Structured
Learning
Summary
Consider the
whole document
selected in
summary
For a document
with 3 sentences
Structured Learning
for Extractive Summarization - Evaluation
Learn an evaluation function 𝐹 𝑑, 𝑠𝑑𝑑 : document, 𝑠𝑑 : a set of sentences in 𝑑
How good it is to take the sentence set sd as the
summary of document d
Structured Learning
for Extractive Summarization - Evaluation
The evaluation function 𝐹 𝑑, 𝑠𝑑 considers
A good summary should include as much important
sentences as possible
The redundancy of a good summary should be
minimized.
Structured Learning
for Extractive Summarization - Evaluation
The evaluation function 𝐹 𝑑, 𝑠𝑑 considers
𝐹 𝑑, 𝑠𝑑 = 𝑥𝑖∈𝑠𝑑
𝑅(𝑥𝑖) − 𝜆 𝑥𝑖,𝑥𝑗∈𝑠𝑑
𝑆𝑖𝑚(𝑥𝑖 , 𝑥𝑗)
s. t. 𝑥𝑖∈𝑠𝑑
𝐿(𝑥𝑖) ≤ 𝐾 Constraint of Length
Importance of
a sentence
Redundancy:
Similarity of sentence pairs
Parameter to balance
Length of the selected
summary
Structured Learning
for Extractive Summarization - Evaluation
The evaluation function 𝐹 𝑑, 𝑠𝑑 considers
𝐹 𝑑, 𝑠𝑑 = 𝑥𝑖∈𝑠𝑑
𝑅(𝑥𝑖) − 𝜆 𝑥𝑖,𝑥𝑗∈𝑠𝑑
𝑆𝑖𝑚(𝑥𝑖 , 𝑥𝑗)
= 𝑥𝑖∈𝑠𝑑
𝜔0𝑇 ∙ 𝑓0 𝑥𝑖 − 𝜆
𝑥𝑖,𝑥𝑗∈𝑠𝑑
𝑆𝑖𝑚 𝑥𝑖 , 𝑥𝑗
𝑅(𝑥𝑖) is the inner product of
(1) 𝑓0 𝑥𝑖 : feature for the sentence 𝑥𝑖
(2) 𝜔0: weight vector
→The parameters can be jointly learned from
training data.
Structured Learning
for Extractive Summarization - Evaluation
Semantic feature (32)
PLSA with 32 topics.
Similarity to the whole
document (1)
PLSA based similarity score
Prosodic feature (60)
Pause (12)
Duration (15)
Pitch (20)
Energy (13)
Key term related feature (2)
Number of key terms in an
utterance
Number of key terms
occurring first time in the
document.
Sentence length (1)
Position of sentence in a
document (1)
Significance score (1)
Sum of TF-IDF in a sentence
Features for an sentence f0(xi)
Inference: solving “arg max” problem
Structured Learning
for Extractive Summarization - Inference
For a document d
with 3 sentences
sentence in 𝒔𝒅
sentence not in 𝒔𝒅
Enumerate
all the possible
sentence set 𝒔𝒅argmax
𝑠𝑑
𝐹 𝑑, 𝑠𝑑
summary
Reference: McDonald, Ryan. "A Study of Global Inference
Algorithms in Multi-Document Summarization."
Structured Learning
for Extractive Summarization - Training
𝐹 𝑑, 𝑠𝑑 = 𝑥𝑖∈𝑠𝑑
𝑅(𝑥𝑖) − 𝜆 𝑥𝑖,𝑥𝑗∈𝑠𝑑
𝑆𝑖𝑚(𝑥𝑖 , 𝑥𝑗)
= 𝑥𝑖∈𝑠𝑑
𝜔0𝑇 ∙ 𝑓0 𝑥𝑖 − 𝜆
𝑥𝑖,𝑥𝑗∈𝑠𝑑
𝑆𝑖𝑚 𝑥𝑖 , 𝑥𝑗
Trained by structured perceptron
or structured SVM
= 𝜔0𝑇
𝑥𝑖∈𝑠𝑑
𝑓0 𝑥𝑖 − 𝜆 𝑥𝑖,𝑥𝑗∈𝑠𝑑
𝑆𝑖𝑚 𝑥𝑖 , 𝑥𝑗
𝐹 𝑑, 𝑠𝑑 =𝜔0
𝜆∙
𝑥𝑖∈𝑠𝑑
𝑓0 𝑥𝑖
− 𝑥𝑖,𝑥𝑗∈𝑠𝑑
𝑆𝑖𝑚 𝑥𝑖 , 𝑥𝑗= 𝑤 ∙ 𝜙 𝑑, 𝑠𝑑
PARAGRAPH BOUNDARIES
AS HIDDEN INFORMATION
IN STRUCTURED LEARNING
Paragraph for Summarization
Paragraph boundaries are helpful for summarization
For example, consecutive sentences in a paragraph cluster
are more likely to be selected together
The paragraph boundaries are not directly available in
spoken document (e.g. lecture recordings)
…𝑥𝑖+3𝑥𝑖−2 𝑥𝑖−1 𝑥𝑖𝑥𝑖−2 𝑥𝑖−1
𝐹 𝑑, 𝑠𝑑
Evaluation Function
with Paragraph Boundaries
Given the paragraph set, how
good it is to take sd as summary
goodness of the
paragraph set
𝐹 𝑑, 𝑠𝑑 , 𝐻𝑑
=
𝑥𝑖∈𝑠𝑑
𝑅 𝑥𝑖 − 𝜆
𝑥𝑖,𝑥𝑗∈𝑠𝑑
𝑆𝑖𝑚 𝑥𝑖 , 𝑥𝑗
+
ℎ𝑘∈𝐻𝑑
𝐶(𝑠𝑑 , ℎ𝑘) +
ℎ𝑘∈𝐻𝑑
𝑆 ℎ𝑘
Original evaluation
function
paragraph related terms →
sentences
paragraphs hk hk+1
…
paragraph set Hd
𝑥𝑖 𝑥𝑖+1 𝑥𝑖+2 𝑥𝑖+3𝑥𝑖−2 𝑥𝑖−1
For speech, paragraph
boundaries is also
obtained by machine.
Evaluation Function
with Paragraph Boundaries
𝐹 𝑑, 𝑠𝑑 , 𝐻𝑑 =
𝑥𝑖∈𝑠𝑑
𝑅 𝑥𝑖 − 𝜆
𝑥𝑖,𝑥𝑗∈𝑠𝑑
𝑆𝑖𝑚 𝑥𝑖 , 𝑥𝑗
+
ℎ𝑘∈𝐻𝑑
𝐶 𝑠𝑑 , ℎ𝑘 +
ℎ𝑘∈𝐻𝑑
𝑆 ℎ𝑘
𝐹 𝑑, 𝑠𝑑 , 𝐻𝑑 =
𝑥𝑖∈𝑠𝑑
𝜔0𝑇 ∙ 𝑓0 𝑥𝑖 − 𝜆
𝑥𝑖,𝑥𝑗∈𝑠𝑑
𝑆𝑖𝑚 𝑥𝑖 , 𝑥𝑗
+
ℎ𝑘∈𝐻𝑑
𝜔1𝑇 ∙ 𝑓1 𝑠𝑑 , ℎ𝑘 +
ℎ𝑘∈𝐻𝑑
𝜔2𝑇 ∙ 𝑓2 ℎ𝑘
max(1
3,2
3)
Evaluation Function
with Paragraph Boundaries - f1(sd, hk)
One dimension in f1(sd, hk)
Purity
𝑥𝑖+1𝑥𝑖 𝑥𝑖+2 𝑥𝑖+3 𝑥𝑖+4 𝑥𝑖+5 𝑥𝑖+6
Included in the
sentence set sd
Not included in the
sentence set sd
max(1
4,3
4)
hk hk+1
max(0, 1)
𝑥𝑖+1𝑥𝑖 𝑥𝑖+2 𝑥𝑖+3 𝑥𝑖+4 𝑥𝑖+5 𝑥𝑖+6
hk hk+1hk-1max(1, 0)max(0, 1)
Lower purity
Higher purity
Evaluation Function
with Paragraph Boundaries
𝐹 𝑑, 𝑠𝑑 , 𝐻𝑑 =
𝑥𝑖∈𝑠𝑑
𝑅 𝑥𝑖 − 𝜆
𝑥𝑖,𝑥𝑗∈𝑠𝑑
𝑆𝑖𝑚 𝑥𝑖 , 𝑥𝑗
+
ℎ𝑘∈𝐻𝑑
𝐶 𝑠𝑑 , ℎ𝑘 +
ℎ𝑘∈𝐻𝑑
𝑆 ℎ𝑘
𝐹 𝑑, 𝑠𝑑 , 𝐻𝑑 =
𝑥𝑖∈𝑠𝑑
𝜔0𝑇 ∙ 𝑓0 𝑥𝑖 − 𝜆
𝑥𝑖,𝑥𝑗∈𝑠𝑑
𝑆𝑖𝑚 𝑥𝑖 , 𝑥𝑗
+
ℎ𝑘∈𝐻𝑑
𝜔1𝑇 ∙ 𝑓1 𝑠𝑑 , ℎ𝑘 +
ℎ𝑘∈𝐻𝑑
𝜔2𝑇 ∙ 𝑓2 ℎ𝑘
Evaluation Function
with Paragraph Boundaries - f2(hk)
One dimension in f2(hk)
Average of similarity scores for all pairs of sentences
within a paragraph
𝑥𝑖 𝑥𝑖+1 𝑥𝑖+2
𝑝𝑎𝑖𝑟1
𝑝𝑎𝑖𝑟2 𝑝𝑎𝑖𝑟3
=1
3(𝑝𝑎𝑖𝑟1 + 𝑝𝑎𝑖𝑟2 + 𝑝𝑎𝑖𝑟3)
hk
Evaluation Function
with Paragraph Boundaries
𝐹 𝑑, 𝑠𝑑 , 𝐻𝑑 =
𝑥𝑖∈𝑠𝑑
𝜔0𝑇 ∙ 𝐹0 𝑥𝑖 − 𝜆
𝑥𝑖,𝑥𝑗∈𝑠𝑑
𝑆𝑖𝑚 𝑥𝑖 , 𝑥𝑗
+
ℎ𝑘∈𝐻𝑑
𝜔1𝑇 ∙ 𝐹1 𝑠𝑑 , ℎ𝑘 +
ℎ𝑘∈𝐻𝑑
𝜔2𝑇 ∙ 𝐹2 ℎ𝑘
Also trained by structured
perceptron or structured SVM
(with hidden information)
𝑤 =
𝜔0
𝜆𝜔1
𝜔2
𝐹 𝑑, 𝑠𝑑 , 𝐻𝑑 = 𝑤 ∙ 𝜙 𝑑, 𝑠𝑑 , 𝐻𝑑
Inference with Paragraph Boundaries
Evaluation
function
𝑭 𝒅, 𝒔𝒅, 𝑯𝒅
…
…
…
0.9
0.4
0.1
-0.2
-0.3
0.5
0.7
-0.4
-0.8
0.6
Maximum!Generated summary
Training with Paragraph Boundaries
Find the most possible paragraph boundaries 𝐻𝑑 given w and s𝑑
Learn w with s𝑑 and 𝐻𝑑
Initialized
w
𝐹 𝑑, 𝑠𝑑 , 𝐻𝑑 = 𝑤 ∙ 𝜙 𝑑, 𝑠𝑑 , 𝐻𝑑
Training data: 𝑑, s𝑑 s𝑑 : reference summary of d
Training
with Paragraph Boundaries
Most possible paragraph boundaries:
𝐻𝑑 = 𝒂𝒓𝒈𝒎𝒂𝒙𝑯
𝑤 ∙ 𝜙 𝑑, s𝑑 , 𝐻𝑑reference summary
labeled by human
most possible
paragraph boundaries
For a document d
with 3 utterances
Reference summary
(answer of training data)
Enumerate all possible
paragraph boundaries
The paragraph boundary
that maximizes
the evaluation function
Most possible
paragraphs 𝐻𝑑 s𝑑
Training with Paragraph Boundaries
Find the most possible paragraph boundaries 𝐻𝑑 given w and s𝑑
Learn w with s𝑑 and 𝐻𝑑
Initialized
w
𝐹 𝑑, 𝑠𝑑 , 𝐻𝑑 = 𝑤 ∙ 𝜙 𝑑, 𝑠𝑑 , 𝐻𝑑
Training data: 𝑑, s𝑑 s𝑑 : reference summary of d
Training
with Paragraph Boundaries
…
…
…
0.9
0.4
0.1
-0.2
-0.3
0.5
0.7
-0.4
-0.8
0.6
high
……
Evaluation
function
𝑭(𝒅,𝒔𝒅,𝑯𝒅 )
𝐻𝑑 s𝑑
𝐹 𝑑, 𝑠𝑑 , 𝐻𝑑 = 𝑤 ∙ 𝜙 𝑑, 𝑠𝑑 , 𝐻𝑑
Training with Paragraph Boundaries
Find the most possible paragraph boundaries 𝐻𝑑 given w and s𝑑
Learn w with s𝑑 and 𝐻𝑑
Initialized
w
𝐹 𝑑, 𝑠𝑑 , 𝐻𝑑 = 𝑤 ∙ 𝜙 𝑑, 𝑠𝑑 , 𝐻𝑑
Training data: 𝑑, s𝑑 s𝑑 : reference summary of d
EXPERIMENTS
Experimental Setup
Corpus: a course offered in National Taiwan
University
Speech recognition is needed
Annotators produced three reference summaries
for each spoken document
UNSUPER
VISEDSUPERVISED
RatioEvaluation
MeasureMMR
Binary
Classification
Structured
Learning
Structured
Learning
+ Hidden
10%
ROUGE-1 0.40 0.41 0.43 0.44
ROUGE-2 0.18 0.18 0.22 0.22
ROUGE-L 0.40 0.41 0.42 0.43
30%
ROUGE-1 0.55 0.54 0.56 0.57
ROUGE-2 0.34 0.34 0.35 0.36
ROUGE-L 0.54 0.53 0.56 0.56
Experimental Result
UNSUPER
VISEDSUPERVISED
RatioEvaluation
MeasureMMR
Binary
Classification
Structured
Learning
Structured
Learning
+ Hidden
10%
ROUGE-1 0.40 0.41 0.43 0.44
ROUGE-2 0.18 0.18 0.22 0.22
ROUGE-L 0.40 0.41 0.42 0.43
30%
ROUGE-1 0.55 0.54 0.56 0.57
ROUGE-2 0.34 0.34 0.35 0.36
ROUGE-L 0.54 0.53 0.56 0.56
Experimental Result
UNSUPER
VISEDSUPERVISED
RatioEvaluation
MeasureMMR
Binary
Classification
Structured
Learning
Structured
Learning
+ Hidden
10%
ROUGE-1 0.40 0.41 0.43 0.44
ROUGE-2 0.18 0.18 0.22 0.22
ROUGE-L 0.40 0.41 0.42 0.43
30%
ROUGE-1 0.55 0.54 0.56 0.57
ROUGE-2 0.34 0.34 0.35 0.36
ROUGE-L 0.54 0.53 0.56 0.56
Experimental Result
UNSUPER
VISEDSUPERVISED
RatioEvaluation
MeasureMMR
Binary
Classification
Structured
Learning
Structured
Learning
+ Hidden
10%
ROUGE-1 0.40 0.41 0.43 0.44
ROUGE-2 0.18 0.18 0.22 0.22
ROUGE-L 0.40 0.41 0.42 0.43
30%
ROUGE-1 0.55 0.54 0.56 0.57
ROUGE-2 0.34 0.34 0.35 0.36
ROUGE-L 0.54 0.53 0.56 0.56
Experimental Result
UNSUPER
VISEDSUPERVISED
RatioEvaluation
MeasureMMR
Binary
Classification
Structured
Learning
Structured
Learning
+ Hidden
10%
ROUGE-1 0.40 0.41 0.43 0.44
ROUGE-2 0.18 0.18 0.22 0.22
ROUGE-L 0.40 0.41 0.42 0.43
30%
ROUGE-1 0.55 0.54 0.56 0.57
ROUGE-2 0.34 0.34 0.35 0.36
ROUGE-L 0.54 0.53 0.56 0.56
Experimental Result
CONCLUSION
Conclusion
Structured learning which considers “global
information” is helpful in summarization
The performance of structured learning is further
improved by considering paragraph boundaries as
hidden information.
Reference
Hung-yi Lee, Yu-yu Chou, Yow-Bang Wang, Lin-shan Lee,
"Supervised Spoken Document Summarization Jointly
Considering Utterance Importance and Redundancy by
Structured Support Vector Machine", InterSpeech, 2012
Sz-Rung Shiang, Hung-yi Lee, Lin-shan Lee, "Supervised
Spoken Document Summarization Based on Structured
Support Vector Machine with Utterance Clusters as Hidden
Variables", InterSpeech, 2013
Thanks for your attention!The slides are modified from the talk of 向思蓉
at Interspeech 2013 (Lyon, France)
The research worked with 向思蓉 and周宥宇
Appendix
Training Process
…
…
…
Higher than the other with margin:
Δ sdi , sdi
0.9
0.4
0.1
-0.2
-0.3
0.5
0.7
-0.4
-0.8
0.6
The one using reference
summary and most possible cluster set.
high
……
Evaluation
function
𝑭(𝒅,𝒔𝒅,𝑯𝒅 )
where ROUGE(s𝑑) is the ROUGE-1 F measure when…
s𝑑 is the generated summary
s𝑑 is the reference summary (labeled by human)
Δ s𝑑 , s𝑑 = 1 − ROUGE(s𝑑)