43
A New Approach to Unsupervised Text Summarization

A New Approach to Unsupervised Text Summarization

  • Upload
    badu

  • View
    46

  • Download
    0

Embed Size (px)

DESCRIPTION

A New Approach to Unsupervised Text Summarization. Agenda. Introduction The Approach Diversity-Based Summarization Test Data and Evaluation Procedure Results and Discussion Conclusion and Future Work. Introduction. Supervised typically make use of human-made summaries or - PowerPoint PPT Presentation

Citation preview

Page 1: A New Approach to Unsupervised Text Summarization

A New Approach to Unsupervised Text Summarization

Page 2: A New Approach to Unsupervised Text Summarization

Agenda

Introduction The Approach Diversity-Based Summarization Test Data and Evaluation Procedure Results and Discussion Conclusion and Future Work

Page 3: A New Approach to Unsupervised Text Summarization

Introduction

Supervisedtypically make use of human-made summaries or

extracts to find features or parameters of

summarization algorithms.

Problem: human-made summaries should be reliable enough.

Unsuperviseddetermine relevant parameters without regard to

human-made summaries.

Page 4: A New Approach to Unsupervised Text Summarization

Introduction (cont’d)

Validity?

Page 5: A New Approach to Unsupervised Text Summarization

Introduction (cont’d)

ExperimentA large group students of university to identify 10% sentences

in a text (various domains in a news paper corpus) which they

believe to be most important.

Reporting the rather modest result of 25% agreement among

their choice.

Problem1.Reliability

2.Portabolity

Page 6: A New Approach to Unsupervised Text Summarization

The Approach

Evaluate summaryNot in terms of how well they match human-made extracts.

Not in terms of how much time it takes for humans to make

relevance judgments on them.

In terms of how well they represent source documents in usual

IR tasks such as document retrieval and text categorization.

Page 7: A New Approach to Unsupervised Text Summarization

The Approach (cont’d)

ExtractionLack of fluency or cohesion.

But humans are able to perform as well reading 20%-30%

extracts as the original full text.

Page 8: A New Approach to Unsupervised Text Summarization

Diversity-Based Summarization

ProblemWhat is the most important sentences that can represent the

text.

Katz’s make an important observation that the numbers of

occurrences of content words in a document do not depend on

the document’s length.

The frequencies per document of individual content words do

not grow proportionally with the length of a document.

Page 9: A New Approach to Unsupervised Text Summarization

Diversity-Based Summarization (cont’d)

Two important properties of text1.Redundancy – How repetitive concepts are.

2.Diversity – How many different concept are in the text.

Much of the prior work is focus on redundancy, few of them

take an issue with the problem of diversity.

MMR (maximal marginal relevance)

Page 10: A New Approach to Unsupervised Text Summarization

Diversity-Based Summarization (cont’d)

Method1.Find diversity – Find diverse topic areas in text.

2.Reduce-Redundancy – From each topic area, identify the

most important sentence and take that sentence as a

representative of the area.

A summary is then a set of sentences generated by Reduce-

Redundancy.

Page 11: A New Approach to Unsupervised Text Summarization

Diversity-Based Summarization (cont’d)

Find DiversityBuilt upon the K-means clustering algorithm extended with

Minimum Description Length Principle (MDL) version of X-

means.

X-means is an extension of K-means with an added

functionality of estimating K, K is supplied by user.

Page 12: A New Approach to Unsupervised Text Summarization

Diversity-Based Summarization (cont’d)

μj – the coordinates of the centroid with the index j.

xi – the coordinates of the i-th data point.

(i) represents the index of the centroid closest to the data point

i.

Ex. μ(j) denotes the centroid associated with the data point j.

ci - denotes a cluster with the index i.

Page 13: A New Approach to Unsupervised Text Summarization

Diversity-Based Summarization (cont’d)

K-meansA hard clustering algorithm that produces a clustering of input

data points into K disjoint subsets.

Starting with some randomly chosen initial points. A bad choice

of initial centers can have adverse effects on performance in

Clustering.

A best solution is one that minimizes distortion.

Page 14: A New Approach to Unsupervised Text Summarization

Diversity-Based Summarization (cont’d)

Define distortion as the averaged sum of squares of Euclidean

distances between objects of a cluster and its centroid.

For some clustering solution S = {c1, . . . , ck}, its distortion is

where

ci - a cluster

xj - an object in ci

μ(i) - the centroid of ci

| ・ | - the cardinality function

Page 15: A New Approach to Unsupervised Text Summarization

Diversity-Based Summarization (cont’d)

Problem of K-meansUser should supply the number of clusters.

It’s prone to searching local minima.

Page 16: A New Approach to Unsupervised Text Summarization

Diversity-Based Summarization (cont’d)

X-meansGlobally searching the space of centroid locations to find the

best way of partitioning the input data.

Resorting to a model selection criterion known as the Baysian

Information Criterion (BIC) to decide whether to split a cluster.

When the information gain from splitting a cluster as measured

by BIC is greater than the gain for keeping that cluster as it is.

It splits.

Page 17: A New Approach to Unsupervised Text Summarization

Diversity-Based Summarization (cont’d)

Page 18: A New Approach to Unsupervised Text Summarization

Diversity-Based Summarization (cont’d)

Page 19: A New Approach to Unsupervised Text Summarization

Diversity-Based Summarization (cont’d)

Page 20: A New Approach to Unsupervised Text Summarization

Diversity-Based Summarization (cont’d)

Modification of X-meansReplacing BIC by MDL

Page 21: A New Approach to Unsupervised Text Summarization

Diversity-Based Summarization (cont’d)

Page 22: A New Approach to Unsupervised Text Summarization

Diversity-Based Summarization (cont’d)

Page 23: A New Approach to Unsupervised Text Summarization

Diversity-Based Summarization (cont’d)

Page 24: A New Approach to Unsupervised Text Summarization

Diversity-Based Summarization (cont’d)

Page 25: A New Approach to Unsupervised Text Summarization

Diversity-Based Summarization (cont’d)

Page 26: A New Approach to Unsupervised Text Summarization

Diversity-Based Summarization (cont’d)

Page 27: A New Approach to Unsupervised Text Summarization

Diversity-Based Summarization (cont’d)

Page 28: A New Approach to Unsupervised Text Summarization

Diversity-Based Summarization (cont’d)

Page 29: A New Approach to Unsupervised Text Summarization

Diversity-Based Summarization (cont’d)

Reduce-RedundancyUse a simple sentence weighting model (the Z-model)

Taking the weight of a given sentence as the sum of tf ・ idf

values of index terms in that sentence.

x - a index term

tf(x) - the frequency of term x in document

idf(x) - the inverse document frequency of x

Page 30: A New Approach to Unsupervised Text Summarization

Diversity-Based Summarization (cont’d)

Z-model sentence selection1.Determining the weights of sentences in the text.

2.Sorting them in a decreasing order.

3.Selecting top sentences.

Further normalizes sentence weight of length.

Find out the best W(s) score. Then take the sentence as a

representative of the cluster.

Minimize the loss of the resulting summary’s relevance to

potential query.

Page 31: A New Approach to Unsupervised Text Summarization

Diversity-Based Summarization (cont’d)

ProblemThe process does not preserve statistical properties of a

source text, which are often left statistically indistinguishable

after the process.

SolutionExtrapolating frequencies of index terms in extracts in order to

estimate their true frequencies in source texts.

Page 32: A New Approach to Unsupervised Text Summarization

Diversity-Based Summarization (cont’d)

Extrapolation formula

pr - the probability of a given word occurring r times in the document.m ≥ 0

In this experiments, index terms with two or more occurrences in the document, so the extrapolation would be E(k | k ≥ 2) .

Page 33: A New Approach to Unsupervised Text Summarization

Test Data and Evaluation Procedure

BMIR-J2Benchmark for Japanese IR system version 2, represents a

test collection of 5080 news article which published in 1994 in

Japan.

Page 34: A New Approach to Unsupervised Text Summarization

Test Data and Evaluation Procedure

F-measure

P – PrecisionR – Recall

Page 35: A New Approach to Unsupervised Text Summarization

Test Data and Evaluation Procedure

Two-set of experimentStrict relevance scheme (SRS), takes only A-labeled

documents as relevant to the query.

Moderate relevance scheme (MRS), takes both A- and B-

labeled documents as relevant.

Page 36: A New Approach to Unsupervised Text Summarization

Test Data and Evaluation Procedure

Summarization method1.Z model

2.diversity-based summarizer with the standard K-means

(DBS/K)

3.diversity-based summarizer with XM-means (DBS/XM)

Compression rate is between 20% to 50%.

Page 37: A New Approach to Unsupervised Text Summarization

Test Data and Evaluation Procedure

Experiment procedure1.At each compression rate, run Z-model, DBS/K and

DBS/XM on the entire BMIR-J2 collection, to produce

respective pools of extracts.

2.For each query from BMIR-J2, perform a search on

each pool generated, and score performance with the

uninterpolated average F-measure.

Page 38: A New Approach to Unsupervised Text Summarization

Results and Discussion

Page 39: A New Approach to Unsupervised Text Summarization

Results and Discussion

Page 40: A New Approach to Unsupervised Text Summarization

Results and Discussion

Page 41: A New Approach to Unsupervised Text Summarization

Results and Discussion

Page 42: A New Approach to Unsupervised Text Summarization

Results and Discussion

Page 43: A New Approach to Unsupervised Text Summarization

Conclusion and Future Work

Diversity-based summarization (DBS/XM) was found to be

superior to relevance-based summarization (Z-model) in

measuring the loss of information in extracts in terms of

retrieval performance.

Future WorkExtending the current DBS framework to deal with multi-

document summarization.

Speech summarization with audio input and output.

Text categorization.