A New Approach to Unsupervised Text Summarization

A New Approach to Unsupervised Text Summarization

Agenda

Introduction The Approach Diversity-Based Summarization Test Data and Evaluation Procedure Results and Discussion Conclusion and Future Work

Introduction

Supervisedtypically make use of human-made summaries or

extracts to find features or parameters of

summarization algorithms.

Problem: human-made summaries should be reliable enough.

Unsuperviseddetermine relevant parameters without regard to

human-made summaries.

Introduction (cont’d)

Validity？

Introduction (cont’d)

ExperimentA large group students of university to identify 10% sentences

in a text (various domains in a news paper corpus) which they

believe to be most important.

Reporting the rather modest result of 25% agreement among

their choice.

Problem1.Reliability

2.Portabolity

The Approach

Evaluate summaryNot in terms of how well they match human-made extracts.

Not in terms of how much time it takes for humans to make

relevance judgments on them.

In terms of how well they represent source documents in usual

IR tasks such as document retrieval and text categorization.

The Approach (cont’d)

ExtractionLack of fluency or cohesion.

But humans are able to perform as well reading 20%-30%

extracts as the original full text.

Diversity-Based Summarization

ProblemWhat is the most important sentences that can represent the

text.

Katz’s make an important observation that the numbers of

occurrences of content words in a document do not depend on

the document’s length.

The frequencies per document of individual content words do

not grow proportionally with the length of a document.

Diversity-Based Summarization (cont’d)

Two important properties of text1.Redundancy – How repetitive concepts are.

2.Diversity – How many different concept are in the text.

Much of the prior work is focus on redundancy, few of them

take an issue with the problem of diversity.

MMR (maximal marginal relevance)


Method1.Find diversity – Find diverse topic areas in text.

2.Reduce-Redundancy – From each topic area, identify the

most important sentence and take that sentence as a

representative of the area.

A summary is then a set of sentences generated by Reduce-

Redundancy.


Find DiversityBuilt upon the K-means clustering algorithm extended with

Minimum Description Length Principle (MDL) version of X-

means.

X-means is an extension of K-means with an added

functionality of estimating K, K is supplied by user.


μj – the coordinates of the centroid with the index j.

xi – the coordinates of the i-th data point.

(i) represents the index of the centroid closest to the data point

i.

Ex. μ(j) denotes the centroid associated with the data point j.

ci - denotes a cluster with the index i.


K-meansA hard clustering algorithm that produces a clustering of input

data points into K disjoint subsets.

Starting with some randomly chosen initial points. A bad choice

of initial centers can have adverse effects on performance in

Clustering.

A best solution is one that minimizes distortion.


Define distortion as the averaged sum of squares of Euclidean

distances between objects of a cluster and its centroid.

For some clustering solution S = {c1, . . . , ck}, its distortion is

where

ci - a cluster

xj - an object in ci

μ(i) - the centroid of ci

| ・ | - the cardinality function


Problem of K-meansUser should supply the number of clusters.

It’s prone to searching local minima.


X-meansGlobally searching the space of centroid locations to find the

best way of partitioning the input data.

Resorting to a model selection criterion known as the Baysian

Information Criterion (BIC) to decide whether to split a cluster.

When the information gain from splitting a cluster as measured

by BIC is greater than the gain for keeping that cluster as it is.

It splits.





Modification of X-meansReplacing BIC by MDL










Reduce-RedundancyUse a simple sentence weighting model (the Z-model)

Taking the weight of a given sentence as the sum of tf ・ idf

values of index terms in that sentence.

x - a index term

tf(x) - the frequency of term x in document

idf(x) - the inverse document frequency of x


Z-model sentence selection1.Determining the weights of sentences in the text.

2.Sorting them in a decreasing order.

3.Selecting top sentences.

Further normalizes sentence weight of length.

Find out the best W(s) score. Then take the sentence as a

representative of the cluster.

Minimize the loss of the resulting summary’s relevance to

potential query.


ProblemThe process does not preserve statistical properties of a

source text, which are often left statistically indistinguishable

after the process.

SolutionExtrapolating frequencies of index terms in extracts in order to

estimate their true frequencies in source texts.


Extrapolation formula

pr - the probability of a given word occurring r times in the document.m ≥ 0

In this experiments, index terms with two or more occurrences in the document, so the extrapolation would be E(k | k ≥ 2) .

Test Data and Evaluation Procedure

BMIR-J2Benchmark for Japanese IR system version 2, represents a

test collection of 5080 news article which published in 1994 in

Japan.


F-measure

P – PrecisionR – Recall


Two-set of experimentStrict relevance scheme (SRS), takes only A-labeled

documents as relevant to the query.

Moderate relevance scheme (MRS), takes both A- and B-

labeled documents as relevant.


Summarization method1.Z model

2.diversity-based summarizer with the standard K-means

(DBS/K)

3.diversity-based summarizer with XM-means (DBS/XM)

Compression rate is between 20% to 50%.


Experiment procedure1.At each compression rate, run Z-model, DBS/K and

DBS/XM on the entire BMIR-J2 collection, to produce

respective pools of extracts.

2.For each query from BMIR-J2, perform a search on

each pool generated, and score performance with the

uninterpolated average F-measure.

Results and Discussion





Conclusion and Future Work

Diversity-based summarization (DBS/XM) was found to be

superior to relevance-based summarization (Z-model) in

measuring the loss of information in extracts in terms of

retrieval performance.

Future WorkExtending the current DBS framework to deal with multi-

document summarization.

Speech summarization with audio input and output.

Text categorization.

Documents

A New Approach to Unsupervised Text Summarization