Document Summarization based on Structured …speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015/Structured...Summary: It is the first time anyone has eaten artificial meat. The experiment

Document Summarization

based on Structured Learning

with Hidden Information

Hung-yi Lee

Outline

Introduction of summarization

Unsupervised summarization

Summarization as binary classification

Structured Learning for summarization

Paragraph boundaries as hidden information in Structured Learning

For spoken document (lecture recordings, etc.)

Experiments

Conclusion

INTRODUCTION

Task Introduction

Extractive document summarization

Select the indicative sentences

Cascade the sentences to form a summary

Document:

Two food critics have eaten meat that was grown in a lab.

It is the first time anyone has eaten artificial meat.

The experiment is part of a project run by Google co-founder Sergey

Brin.

He invested over $380,000 in research for the burger.

Summary:

It is the first time anyone has eaten artificial meat.

The experiment is part of a project run by Google co-founder

Sergey Brin.

Task Introduction

Extractive document summarization

Select the indicative sentences

Cascade the sentences to form a summary

The number of sentences selected as summary is

decided by a predefined ratio (e.g. 10%)

UNSUPERVISED SUMMARIZATION

Unsupervised Summarization

Maximum Marginal Relevance (MMR)

document

x1: 涼宮春日…

x2: Deep …

x3: 涼宮春日…

x4: Structured …

x5: 沒有很宅…

……

e.g. transcriptions of the lecture recordings

A good summary should include as

much important sentences as possible

The redundancy of a good summary

should be minimized.



Rank by R(x)document

x1: 涼宮春日…

x2: Deep …

x3: 涼宮春日…

x4: Structured …

x5: 沒有很宅…

……

x3: 涼宮春日…

x1: 涼宮春日…

x2: Deep …

x4: Structured …

x5: 沒有很宅…

……

R(x): importance of sentence x (e.g. do x include the words frequently mentioned in the document)

x3


Summary



Rank by R(x)document

x1: 涼宮春日…

x2: Deep …

x3: 涼宮春日…

x4: Structured …

x5: 沒有很宅…

……

x3

x2

x4

x3: 涼宮春日…

x1: 涼宮春日…

x2: Deep …

x4: Structured …

x5: 沒有很宅…

……

Cascade based on the order in the document


Summary

x2

x3

x4

EXTRACTIVE SUMMARIZATION

AS BINARY CLASSIFICATION

Supervised Approach

Training data

2nd and 4th sentences

form the summary

3rd sentences form the

summary

1st and 2nd sentences

form the summary

Use the training data to learn

model for summarization

……

……

……

document 1

document 2

document 3

Binary Classification

for Extractive Summarization

Summarization can be simply taken as a binary

classification problem.

Sentence 1

Sentence 2

Sentence 3

Sentence 4

Binary

Classifier

(e.g. SVM,

DNN)

-1

+1

+1

-1

Sentence 2

Sentence 3

label

summary

Drawbacks for Binary Classification

Binary classifier individually considers each

sentence

More advanced machine learning techniques

LSA is useful for summarization



Hello ……

LSA is Latent semantic analysis


Repeat again

……

DocumentSummary

Summary should be succinct

To generate good summary, “global

information” can not be ignored.

STRUCTURED LEARNING FOR

EXTRACTIVE SUMMARIZATION

Structured Learning

for Extractive Summarization

Input: whole document

Output: summary

Structured

Learning

Summary

Consider the

whole document

selected in

summary

For a document

with 3 sentences

Structured Learning

for Extractive Summarization - Evaluation

Learn an evaluation function 𝐹 𝑑, 𝑠𝑑𝑑 : document, 𝑠𝑑 : a set of sentences in 𝑑

How good it is to take the sentence set sd as the

summary of document d

Structured Learning


The evaluation function 𝐹 𝑑, 𝑠𝑑 considers

A good summary should include as much important

sentences as possible

The redundancy of a good summary should be

minimized.

Structured Learning



𝐹 𝑑, 𝑠𝑑 = 𝑥𝑖∈𝑠𝑑

𝑅(𝑥𝑖) − 𝜆 𝑥𝑖,𝑥𝑗∈𝑠𝑑

𝑆𝑖𝑚(𝑥𝑖 , 𝑥𝑗)

s. t. 𝑥𝑖∈𝑠𝑑

𝐿(𝑥𝑖) ≤ 𝐾 Constraint of Length

Importance of

a sentence

Redundancy:

Similarity of sentence pairs

Parameter to balance

Length of the selected

summary

Structured Learning






= 𝑥𝑖∈𝑠𝑑

𝜔0𝑇 ∙ 𝑓0 𝑥𝑖 − 𝜆

𝑥𝑖,𝑥𝑗∈𝑠𝑑

𝑆𝑖𝑚 𝑥𝑖 , 𝑥𝑗

𝑅(𝑥𝑖) is the inner product of

(1) 𝑓0 𝑥𝑖 : feature for the sentence 𝑥𝑖

(2) 𝜔0: weight vector

→The parameters can be jointly learned from

training data.

Structured Learning


Semantic feature (32)

PLSA with 32 topics.

Similarity to the whole

document (1)

PLSA based similarity score

Prosodic feature (60)

Pause (12)

Duration (15)

Pitch (20)

Energy (13)

Key term related feature (2)

Number of key terms in an

utterance

Number of key terms

occurring first time in the

document.

Sentence length (1)

Position of sentence in a

document (1)

Significance score (1)

Sum of TF-IDF in a sentence

Features for an sentence f0(xi)

Inference: solving “arg max” problem

Structured Learning

for Extractive Summarization - Inference

For a document d

with 3 sentences

sentence in 𝒔𝒅

sentence not in 𝒔𝒅

Enumerate

all the possible

sentence set 𝒔𝒅argmax

𝑠𝑑

𝐹 𝑑, 𝑠𝑑

summary

Reference: McDonald, Ryan. "A Study of Global Inference

Algorithms in Multi-Document Summarization."

Structured Learning

for Extractive Summarization - Training




= 𝑥𝑖∈𝑠𝑑




Trained by structured perceptron

or structured SVM

= 𝜔0𝑇

𝑥𝑖∈𝑠𝑑

𝑓0 𝑥𝑖 − 𝜆 𝑥𝑖,𝑥𝑗∈𝑠𝑑


𝐹 𝑑, 𝑠𝑑 =𝜔0

𝜆∙

𝑥𝑖∈𝑠𝑑

𝑓0 𝑥𝑖

− 𝑥𝑖,𝑥𝑗∈𝑠𝑑

𝑆𝑖𝑚 𝑥𝑖 , 𝑥𝑗= 𝑤 ∙ 𝜙 𝑑, 𝑠𝑑

PARAGRAPH BOUNDARIES

AS HIDDEN INFORMATION

IN STRUCTURED LEARNING

Paragraph for Summarization

Paragraph boundaries are helpful for summarization

For example, consecutive sentences in a paragraph cluster

are more likely to be selected together

The paragraph boundaries are not directly available in

spoken document (e.g. lecture recordings)

…𝑥𝑖+3𝑥𝑖−2 𝑥𝑖−1 𝑥𝑖𝑥𝑖−2 𝑥𝑖−1

𝐹 𝑑, 𝑠𝑑

Evaluation Function

with Paragraph Boundaries

Given the paragraph set, how

good it is to take sd as summary

goodness of the

paragraph set

𝐹 𝑑, 𝑠𝑑 , 𝐻𝑑

=

𝑥𝑖∈𝑠𝑑

𝑅 𝑥𝑖 − 𝜆



+

ℎ𝑘∈𝐻𝑑

𝐶(𝑠𝑑 , ℎ𝑘) +

ℎ𝑘∈𝐻𝑑

𝑆 ℎ𝑘

Original evaluation

function

paragraph related terms →

sentences

paragraphs hk hk+1

…

paragraph set Hd

𝑥𝑖 𝑥𝑖+1 𝑥𝑖+2 𝑥𝑖+3𝑥𝑖−2 𝑥𝑖−1

For speech, paragraph

boundaries is also

obtained by machine.

Evaluation Function


𝐹 𝑑, 𝑠𝑑 , 𝐻𝑑 =

𝑥𝑖∈𝑠𝑑




+

ℎ𝑘∈𝐻𝑑

𝐶 𝑠𝑑 , ℎ𝑘 +

ℎ𝑘∈𝐻𝑑

𝑆 ℎ𝑘


𝑥𝑖∈𝑠𝑑




+

ℎ𝑘∈𝐻𝑑

𝜔1𝑇 ∙ 𝑓1 𝑠𝑑 , ℎ𝑘 +

ℎ𝑘∈𝐻𝑑

𝜔2𝑇 ∙ 𝑓2 ℎ𝑘

max(1

3,2

3)

Evaluation Function

with Paragraph Boundaries - f1(sd, hk)

One dimension in f1(sd, hk)

Purity

𝑥𝑖+1𝑥𝑖 𝑥𝑖+2 𝑥𝑖+3 𝑥𝑖+4 𝑥𝑖+5 𝑥𝑖+6

Included in the

sentence set sd

Not included in the

sentence set sd

max(1

4,3

4)

hk hk+1

max(0, 1)

𝑥𝑖+1𝑥𝑖 𝑥𝑖+2 𝑥𝑖+3 𝑥𝑖+4 𝑥𝑖+5 𝑥𝑖+6

hk hk+1hk-1max(1, 0)max(0, 1)

Lower purity

Higher purity

Evaluation Function



𝑥𝑖∈𝑠𝑑




+

ℎ𝑘∈𝐻𝑑

𝐶 𝑠𝑑 , ℎ𝑘 +

ℎ𝑘∈𝐻𝑑

𝑆 ℎ𝑘


𝑥𝑖∈𝑠𝑑




+

ℎ𝑘∈𝐻𝑑

𝜔1𝑇 ∙ 𝑓1 𝑠𝑑 , ℎ𝑘 +

ℎ𝑘∈𝐻𝑑

𝜔2𝑇 ∙ 𝑓2 ℎ𝑘

Evaluation Function

with Paragraph Boundaries - f2(hk)

One dimension in f2(hk)

Average of similarity scores for all pairs of sentences

within a paragraph

𝑥𝑖 𝑥𝑖+1 𝑥𝑖+2

𝑝𝑎𝑖𝑟1

𝑝𝑎𝑖𝑟2 𝑝𝑎𝑖𝑟3

=1

3(𝑝𝑎𝑖𝑟1 + 𝑝𝑎𝑖𝑟2 + 𝑝𝑎𝑖𝑟3)

hk

Evaluation Function



𝑥𝑖∈𝑠𝑑

𝜔0𝑇 ∙ 𝐹0 𝑥𝑖 − 𝜆



+

ℎ𝑘∈𝐻𝑑

𝜔1𝑇 ∙ 𝐹1 𝑠𝑑 , ℎ𝑘 +

ℎ𝑘∈𝐻𝑑

𝜔2𝑇 ∙ 𝐹2 ℎ𝑘

Also trained by structured

perceptron or structured SVM

(with hidden information)

𝑤 =

𝜔0

𝜆𝜔1

𝜔2

𝐹 𝑑, 𝑠𝑑 , 𝐻𝑑 = 𝑤 ∙ 𝜙 𝑑, 𝑠𝑑 , 𝐻𝑑

Inference with Paragraph Boundaries

Evaluation

function

𝑭 𝒅, 𝒔𝒅, 𝑯𝒅

…

…

…

0.9

0.4

0.1

-0.2

-0.3

0.5

0.7

-0.4

-0.8

0.6

Maximum!Generated summary

Training with Paragraph Boundaries

Find the most possible paragraph boundaries 𝐻𝑑 given w and s𝑑

Learn w with s𝑑 and 𝐻𝑑

Initialized

w


Training data: 𝑑, s𝑑 s𝑑 : reference summary of d

Training


Most possible paragraph boundaries:

𝐻𝑑 = 𝒂𝒓𝒈𝒎𝒂𝒙𝑯

𝑤 ∙ 𝜙 𝑑, s𝑑 , 𝐻𝑑reference summary

labeled by human

most possible

paragraph boundaries

For a document d

with 3 utterances

Reference summary

(answer of training data)

Enumerate all possible

paragraph boundaries

The paragraph boundary

that maximizes

the evaluation function

Most possible

paragraphs 𝐻𝑑 s𝑑




Initialized

w



Training


…

…

…

0.9

0.4

0.1

-0.2

-0.3

0.5

0.7

-0.4

-0.8

0.6

high

……

Evaluation

function

𝑭(𝒅,𝒔𝒅,𝑯𝒅 )

𝐻𝑑 s𝑑





Initialized

w



EXPERIMENTS

Experimental Setup

Corpus: a course offered in National Taiwan

University

Speech recognition is needed

Annotators produced three reference summaries

for each spoken document

UNSUPER

VISEDSUPERVISED

RatioEvaluation

MeasureMMR

Binary

Classification

Structured

Learning

Structured

Learning

+ Hidden

10%

ROUGE-1 0.40 0.41 0.43 0.44

ROUGE-2 0.18 0.18 0.22 0.22

ROUGE-L 0.40 0.41 0.42 0.43

30%

ROUGE-1 0.55 0.54 0.56 0.57

ROUGE-2 0.34 0.34 0.35 0.36

ROUGE-L 0.54 0.53 0.56 0.56

Experimental Result

UNSUPER

VISEDSUPERVISED

RatioEvaluation

MeasureMMR

Binary

Classification

Structured

Learning

Structured

Learning

+ Hidden

10%

ROUGE-1 0.40 0.41 0.43 0.44

ROUGE-2 0.18 0.18 0.22 0.22

ROUGE-L 0.40 0.41 0.42 0.43

30%

ROUGE-1 0.55 0.54 0.56 0.57

ROUGE-2 0.34 0.34 0.35 0.36

ROUGE-L 0.54 0.53 0.56 0.56

Experimental Result

UNSUPER

VISEDSUPERVISED

RatioEvaluation

MeasureMMR

Binary

Classification

Structured

Learning

Structured

Learning

+ Hidden

10%

ROUGE-1 0.40 0.41 0.43 0.44

ROUGE-2 0.18 0.18 0.22 0.22

ROUGE-L 0.40 0.41 0.42 0.43

30%

ROUGE-1 0.55 0.54 0.56 0.57

ROUGE-2 0.34 0.34 0.35 0.36

ROUGE-L 0.54 0.53 0.56 0.56

Experimental Result

UNSUPER

VISEDSUPERVISED

RatioEvaluation

MeasureMMR

Binary

Classification

Structured

Learning

Structured

Learning

+ Hidden

10%

ROUGE-1 0.40 0.41 0.43 0.44

ROUGE-2 0.18 0.18 0.22 0.22

ROUGE-L 0.40 0.41 0.42 0.43

30%

ROUGE-1 0.55 0.54 0.56 0.57

ROUGE-2 0.34 0.34 0.35 0.36

ROUGE-L 0.54 0.53 0.56 0.56

Experimental Result

UNSUPER

VISEDSUPERVISED

RatioEvaluation

MeasureMMR

Binary

Classification

Structured

Learning

Structured

Learning

+ Hidden

10%

ROUGE-1 0.40 0.41 0.43 0.44

ROUGE-2 0.18 0.18 0.22 0.22

ROUGE-L 0.40 0.41 0.42 0.43

30%

ROUGE-1 0.55 0.54 0.56 0.57

ROUGE-2 0.34 0.34 0.35 0.36

ROUGE-L 0.54 0.53 0.56 0.56

Experimental Result

CONCLUSION

Conclusion

Structured learning which considers “global

information” is helpful in summarization

The performance of structured learning is further

improved by considering paragraph boundaries as

hidden information.

Reference

Hung-yi Lee, Yu-yu Chou, Yow-Bang Wang, Lin-shan Lee,

"Supervised Spoken Document Summarization Jointly

Considering Utterance Importance and Redundancy by

Structured Support Vector Machine", InterSpeech, 2012

Sz-Rung Shiang, Hung-yi Lee, Lin-shan Lee, "Supervised

Spoken Document Summarization Based on Structured

Support Vector Machine with Utterance Clusters as Hidden

Variables", InterSpeech, 2013

Thanks for your attention!The slides are modified from the talk of 向思蓉

at Interspeech 2013 (Lyon, France)

The research worked with 向思蓉 and周宥宇

Appendix

Training Process

…

…

…

Higher than the other with margin:

Δ sdi , sdi

0.9

0.4

0.1

-0.2

-0.3

0.5

0.7

-0.4

-0.8

0.6

The one using reference

summary and most possible cluster set.

high

……

Evaluation

function

𝑭(𝒅,𝒔𝒅,𝑯𝒅 )

where ROUGE(s𝑑) is the ROUGE-1 F measure when…

s𝑑 is the generated summary

s𝑑 is the reference summary (labeled by human)

Δ s𝑑 , s𝑑 = 1 − ROUGE(s𝑑)

Documents

Document Summarization based on Structured …speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015/Structured...Summary: It is the first time anyone has eaten artificial meat. The experiment