Text Summarization wbia 黄连恩 [email protected] 北京大学信息工程学院 12/17/2013

Text Summarization

http://net.pku.edu.cn/~wbia黄连恩

[email protected]北京大学信息工程学院

12/17/2013

Overview

What is summarization?

Columbia Newsblaster

The academic version

http://newsblaster.cs.columbia.edu/index.html

What is the input?

News, or clusters of news a single article or several articles on a related

topic Email and email thread Scientific articles Health information: patients and doctors Meeting summarization Video

What is the output

Keywords Highlight information in the input Chunks or speech directly from the input or

paraphrase and aggregate the input in novel ways

Modality: text, speech, video, graphics

Ideal stages of summarization

Analysis Input representation and understanding

Transformation Selecting important content

Realization Generating novel text corresponding to the gist of the

input

Most current systems

Use shallow analysis methods Rather than full understanding

Work by sentence selection Identify important sentences and piece them

together to form a summary

Types of summaries

Extracts Sentences from the original document are

displayed together to form a summary

Abstracts Materials is transformed: paraphrased,

restructured, shortened

Extractive summarization

Each sentence is assigned a score that reflects how important and contenful they are

Data-driven approaches Word statistics Cue phrases Section headers Sentence position

Knowledge-based systems Discourse information

Resolve anaphora, text structure Use external lexical resources

Wordnet, adjective polarity lists, opinion Using machine learning

What are summaries useful for?

Relevance judgments Does this document contain information I am

interested in? Is this document worth reading?

Save time Reduce the need to consult the full

document

Recent development

2013.3, Yahoo bought news reading app Summly for $30 million!

2013.4, Google purchased Wavii for more than $30 million!

Multi-document summarization

Very useful for presenting and organizing search results Many results are very similar, and grouping

closely related documents helps cover more event facets

Summarizing similarities and differences between documents

How to deal with redundancy?

Author JK Rowling has won her legal battle in a New York court to get an unofficial Harry Potter encyclopaedia banned from publication.

A U.S. federal judge in Manhattan has sided with author J.K. Rowling and ruled against the publication of a Harry Potter encyclopedia created by a fan of the book series.

Shallow techniques not likely to work well

Global optimization for content selection

What is the best summary? vs What is the best sentence?

Form all summaries and choose the best What is the problem with this approach?

Information ordering

In what order to present the selected sentences? An article with permuted sentences will not be

easy to understand

Very important for multi-document summarization Sentences coming from different documents

Automatic summary edits

Some expressions might not be appropriate in the new context References:

he Putin Russian Prime Minister Vladimir Putin

Discourse connectives However, moreover, subsequently

Requires more sophisticated NLP techniques

Before

Pinochet was placed under arrest in London Friday by

British police acting on a warrant issued by a Spanishjudge. Pinochet has immunity from prosecution inChile as a senator-for-life under a new constitution

thathis government crafted. Pinochet was detained in

theLondon clinic while recovering from back surgery.

After

Gen. Augusto Pinochet, the former Chilean dictator, was placed under arrest in London Friday by British police acting on a warrant issued by a Spanish judge. Pinochet has immunity from prosecution in Chile as a senator-for-life under a new constitution that his government crafted. Pinochet was detained in the London clinic while recovering from back surgery.

Before

Turkey has been trying to form a new government since a coalition government led by Yilmaz collapsed last month over allegations that he rigged the sale of a bank. Ecevit refused even to consult with the leader of the Virtue Party during his efforts to form a government. Ecevit must now try to build a government. Demirel consulted Turkey's party leaders immediately after Ecevit gave up.

After

Turkey has been trying to form a new government since a coalition government led by Prime Minister Mesut Yilmaz collapsed last month over allegations that he rigged the sale of a bank. Premier-designate Bulent Ecevit refused even to consult with the leader of the Virtue Party during his efforts to form a government. Ecevit must now try to build a government. President Suleyman Demirel consulted Turkey's party leaders immediately after Ecevit gave up.

Traditional Approaches

1) word frequency based method

Hans Peter Luhn (“father of Information Retrieval”): The Automatic Creation of Literature Abstracts - 1958

Image: Courtesy IBM0000011

Luhn’s method: basic idea

Target documents: technical literature The method is based on the following

assumptions: Frequency of word occurrence in an article is a useful

measurement of word significance Relative position of these significant words within a

sentence is also a useful measurement of word significance

Based on limited capabilities of machines (IBM 704) no semantic information

0000100

Why word frequency?

Important words are repeated throughout the text examples are given in favor of a certain

principle arguments are given for a certain principle Technical literature one word: one notion

Simple and straightforward algorithm cheap to implement (processing time is costly) Note that different forms of the same word are

counted as the same word0000101

When significant?

Too low frequent words are not significant Too high frequent words are also not significant (e.g. “the”,

“and”) Removing low frequent words is easy

set a minimum frequency-threshold Removing common (high frequent) words:

Setting a maximum frequency threshold (statistically obtained) Comparing to a common-word list

Figure 1 from [Luhn, 1958]0000110

Using relative position

Where greatest number of high-frequent words are found closest together probability very high that representative information is given

Based on the characteristic that an explanation of a certain idea is represented by words closely together (e.g. sentences – paragraphs - chapters)

0000111

The significance factor

The “significance factor” of a sentence reflects the number of occurrences of significant words within a sentence and the linear distance between them due to non-significant words in between

Only consider portion of sentence bracketed by significant words with maximum of 5 non-significant words in between, e.g. “ (*) - - - [ * - * * - - * - - * ] - - (*) “

Significance factor formula: (Σ[*])2 / |[.]| (2.5 in the above example)

0001000

Generating the abstract

For every sentence the significance factor is calculated

The sentences with a significance factor higher than a certain cut-off value are returned (alternatively the N highest-valued sentences can be returned)

For large texts, it can also be applied to subdivisions of the text

No evaluation of the results present in the journal paper!0001001

2) Position based method

H.P. Edmundson: New methods in Automatic Extracting - 1969

0001010

IBM 7090 - Courtesy IBM

Lead method

Claim: Important sentences occur at the beginning (and/or end) of texts.

Lead method: just take first sentence(s)!

Experiments: In 85% of 200 individual paragraphs the

topic sentences occurred in initial position and in 7% in final position (Baxendale, 58).

Only 13% of the paragraphs of contemporary writers start with topic sentences (Donlan, 80).

Cue-Phrase method

Claim 1: Important sentences contain ‘bonus phrases’, such as significantly, In this paper we show, and In conclusion, while non-important sentences contain ‘stigma phrases’ such as hardly and impossible.

Claim 2: These phrases can be detected automatically (Kupiec et al. 95; Teufel and Moens 97).

Method: Add to sentence score if it contains a bonus phrase, penalize if it contains a stigma phrase.

Four methods for weighting

Weighting methods: Cue Method Key Method Title Method Location Method

The weight of a sentence is a linear combination of the weights obtained with the above four methods

The highest weighing sentences are included in the abstract

Target documents: technical literature

0001011

Cue Method

Based on the hypothesis that the probable relevance of a sentence is affected by presence of pragmatic words (e.g. “Significant”, “Greatest”, Impossible”, “Hardly”)

Three types of Cue words: Bonus words: positively affecting the relevance of

a sentence (e.g. “Significant”, “Greatest”) Stigma words: negatively affecting the relevance

of a sentence (e.g. “Impossible”, “Hardly”) Null words: irrelevant

0001100

Obtaining Cue words

The lists were obtained by statistical analyses of 100 documents: Dispersion (λ): number of documents in which

the word occurred Selection ratio (η): ratio of number of

occurrences in extractor-selected sentences to number of occurrences in all sentences

Bonus words: η > thighη Stigma words: η < tlowη Null words: λ > tλ and tlowη< η < thighη 0001101

Resulting Cue lists

Bonus list (783): comparatives, superlatives, adverbs of conclusion, value terms, etc.

Stigma list (73): anaphoric expressions, belittling expressions, etc.

Null list (139): ordinals, cardinals, the verb “to be”, prepositions, pronouns, etc.

0001110

Cue weight of sentence

Tag all Bonus words with weight b > 0, all Stigma words with weight s < 0, all Null words with weight n = 0

Cue weight of sentence: Σ (Cue weight of each word in sentence)

0001111

Key Method

Principle based on [Luhn], counting the frequency of words.

Algorithm differs: Create key glossary of all non-Cue words in the document

which have a frequency larger than a certain threshold Weight of each key word in the key glossary is set to the

frequency it occurs in the document Assign key weight to each word which can be found in the

key glossary If word is not in key glossary, key weight: 0 No relative position is used ([Luhn])

Key weight of sentence: Σ (Key weight of each word in sentence)

0010000

Title Method

Based on the hypothesis that an author conceives title as circumscribing the subject matter of the document (similarly for headings vs. paragraphs)

Create title glossary consisting of all non-Null words in the title, subtitle and headings of the document

Words are given a positive title weight if they appear in this glossary

Title words are given a larger weight than heading words

Title weight of sentence: Σ (Title weight of each word in sentence)

0010001

Location Method

Based on the hypothesis that: Sentences occurring under certain headings are positively

relevant Topic sentences tend to occur very early or very late in a

document and its paragraphs Global idea:

Give each sentence below his heading the same weight as the heading itself (note that this is independent from the Title Method) – Heading weight

Give each sentence a certain weight based on its position - Ordinal weight

Location weight of sentence: Ordinal weight of sentence + Heading weight of sentence

0010010

Location Method: Heading weight

Compare each word in a heading with the pre-stored Heading dictionary

If the word occurs in this dictionary, assign it a weight equal to the weight it has in the dictionary

Heading weight of a heading: Σ (heading weight of each word in heading)

Heading weight of a sentence = Heading weight of its heading

0010011

Creating the Heading dictionary

The Heading dictionary was created by listing all words in the headings of 120 documents and calculating the selection ratio for each word:

Selection ratio (η): ratio of number of occurrences in extractor-selected sentences to number of occurrences in all headings

Deletions from this list were made on the basis of low frequency and unrelatedness to the desired information types (subject, purpose, conclusion, etc.)

Weights were given to the words in the Heading dictionary proportional to the selection ratio

The resulting Heading dictionary contained 90 words0010100

Location Method: Ordinal weight

Sentences of the first paragraph are tagged with weight O1

Sentences of the last paragraph are tagged with weight O2

The first sentence of a paragraph is tagged with weight O3

The last sentence of a paragraph is tagged with weight O4

Ordinal weight of sentence: O1 + O2 + O3 + O40010101

Generating the abstract

Calculate the weight of a sentence: aC + bK + cT + dL, with a,b,c,d constant positive integers, C: Cue Weight, K: Key weight, T: Title weight, L: Location weight

The values of a, b, c and d were obtained by manually comparing the generated automatic abstracts with the desired (human made) abstract

Return the highest N sentences under their proper headings as the abstract (including title)

N is calculated by taking a percentage of the size of the original documents, in this journal paper 25% is used

0010110

Which combination is best?

All combinations of C, K, T and L were tried to see which result had (on average) the most overlap with the handmade extract

As can be seen in the figure below (only the interesting results are shown), the Key method was omitted and only C, T and L are used to create the best abstract

Surprising result! (Luhn used only keywords to create the abstract)

Figure 4 from [Edmundson, 1969]

0010111

Evaluation

Evaluation was done on unseen data (40 technical documents), comparison with handmade abstracts

Result: 44% of the sentences co-selected, 66% similarity between abstracts (human judge)

Random ‘abstract’: 25% of the sentences co-selected, 34% similarity between abstracts

Another evaluation criterion: ‘extract-worthiness’ Result: 84% of the sentences selected is extract-worthy Therefore: for one document many possible abstracts

(differing in length and content)

0011000

3) Machine-learning method

Ask people to select sentences Use these as training examples for machine

learning Each sentence is represented as a number of

features Based on the features distinguish sentences

that are appropriate for a summary and sentences that are not

Run on new inputs

Scoring sentences

For each sentence s the probability P is calculated that it will be included in the summary S given the k features (Bayes’ rule):

Assuming statistical independence of the features:

is constant, and and can be estimated directly from the training set by counting occurrences

This function assigns for each s a score which can be used to select sentences for inclusion in the abstract

0100100

The training material

188 documents with professionally created abstracts from the scientific/technical domain, the average length of the abstracts is 3 sentences (3.5% of the total size of the document)

Sentences from the abstract were matched to the original document: 79% direct sentence matches 3% direct joins (2 sentences combined) 18% no direct match or join possible

Therefore the maximum performance of the automatic system is 82%0100101

Evaluation

Too little material Cross-validation used to evaluate Two evaluation measures

Fraction of manually selected sentences which were reproduced correctly: average result: 35%

Fraction of the matchable selected sentences which were reproduced correctly: average result: 42%

Performance of features (2nd measure):

0100110

Feature Individual % sentences correct

Cumulative % sentences correct

Paragraph 33 33

Fixed Phrases 29 42

Length Cut-off 24 44

Thematic Word 20 42

Uppercase Word 20 42

Claim: The multi-sentence coherence structure of a text can be constructed, and the ‘centrality’ of the textual units in this structure reflects their importance.

Tree-like representation of texts in the style of Rhetorical Structure Theory (Mann and Thompson,88).

Use the discourse representation in order to determine the most important textual units. Attempts:

(Ono et al., 1994) for Japanese. (Marcu, 1997,2000) for English.

4) Discourse-based method

Rhetorical parsing (Marcu,97)

[With its distant orbit {– 50 percent farther from the sun than Earth –} and slim atmospheric blanket,1] [Mars experiences frigid weather conditions.2] [Surface temperatures typically average about –60 degrees Celsius (–76 degrees Fahrenheit) at the equator and can dip to –123 degrees C near the poles.3] [Only the midday sun at tropical latitudes is warm enough to thaw ice on occasion,4] [but any liquid water formed that way would evaporate almost instantly5] [because of the low atmospheric pressure.6]

[Although the atmosphere holds a small amount of water, and water-ice clouds sometimes develop,7] [most Martian weather involves blowing dust or carbon dioxide.8] [Each winter, for example, a blizzard of frozen carbon dioxide rages over one pole, and a few meters of this dry-ice snow accumulate as previously frozen carbon dioxide evaporates from the opposite polar cap.9] [Yet even on the summer pole, {where the sun remains in the sky all day long,} temperatures never warm enough to melt frozen water.10]

Rhetorical parsing (2)

Use discourse markers to hypothesize rhetorical relations rhet_rel(CONTRAST, 4, 5) rhet_rel(CONTRAT, 4, 6) rhet_rel(EXAMPLE, 9, [7,8]) rhet_rel(EXAMPLE, 10, [7,8])

Use semantic similarity to hypothesize rhetorical relations if similar(u1,u2) then

rhet_rel(ELABORATION, u2, u1) rhet_rel(BACKGROUND, u1,u2)else

rhet_rel(JOIN, u1, u2)

rhet_rel(JOIN, 3, [1,2]) rhet_rel(ELABORATION, [4,6], [1,2]) Use the hypotheses in order to derive a valid discourse

representation of the original text.

Rhetorical parsing (3)

5Evidence

Cause

5 6

4

4 5Contrast

3

3Elaboration

1 2

2BackgroundJustification

2Elaboration

7 8

8Concession

9 10

10Antithesis

8Example

2Elaboration

Summarization = selection of the most important units

2 > 8 > 3, 10 > 1, 4, 5, 7, 9 > 6

Discourse method: Evaluation

(using a combination of heuristics for rhetorical parsing disambiguation)Reduction Method Recall Precision F-score

10% Humans 83.20% 75.95% 79.41%

Program 63.75% 72.50% 67.84%

Lead 82.91% 63.45% 71.89%

20% Humans 82.83% 64.93% 72.80%

Program 61.79% 60.83% 61.31%

Lead 70.91% 46.96% 56.50%

TREC Corpus(fourfold cross-validation)

Level Method Rec. Prec. F-score

Clause Humans 72.66% 69.63% 71.27%

Program (training) 67.57% 73.53% 70.42%

Program (no training) 51.35% 63.33% 56.71%

Lead 39.68% 39.68% 39.68%

Sentence Humans 78.11% 79.37% 78.73%

Program (training) 69.23% 64.29% 66.67%

Program (no training) 57.69% 51.72% 54.54%

Lead 54.22% 54.22% 54.22%

Scientific American Corpus

5) VS based method

Based on word probability

S is sentence with length n

Pi is the probability of the i-th word in the sentence

Based on word tf.idf

n

pSweight

n

ii

1

)log()(

n

idftfSweight

n

ii

1

.)(

Centrality measures

How representative is a sentence of the overall content of a document

The more similar are sentence is to the document, the more representative it is

ji

jii SSsimK

Scentrality ),(1

)(

Evaluation

Comparing Text Against Text

Which human summary makes a good gold standard? Many summaries are good

At what granularity is the comparison made?

When can we say that two pieces of text match?

Variation impacts evaluation

Comparing content is hard All kinds of judgment calls

Paraphrases VP vs. NP

Ministers have been exchanged Reciprocal ministerial visits

Length and constituent type Robotics assists doctors in the medical operating

theater Surgeons started using robotic assistants

Nightmare: only one gold standard

System may have chosen an equally good sentence but not in the one gold standard

Pinochet arrested in London on Oct 16 at a Spanish judge’s request for atrocities against Spaniards in Chile.

Former Chilean dictator Augusto Pinochet has been arrested in London at the request of the Spanish government

In DUC 2001 (one gold standard), human model had significant impact on scores (McKeown et al)

Five human summaries needed to avoid changes in rank (Nenkova and Passonneau)

DUC2003 data 3 topic sets, 1 highest scoring and 2 lowest scoring 10 model summaries

Scoring

Two main approaches used in DUC

ROUGE (Lin and Hovy)

Pyramids (Nenkova and Passonneau)

Problems: Are the results stable? How difficult is it to do the scoring?

DUC – Document Understanding Conference

Established and funded by DARPA TIDES Run by independent evaluator NIST

Open to summarization community Annual evaluations on common datasets 2001-present

Tasks Single document summarization Headline summarization Multi-document summarization Multi-lingual summarization Focused summarization

DUC Evaluation

Gold Standard Human summaries written by NIST From 2 to 9 summaries per input set

Multiple metrics Manual

Coverage (early years) Pyramids (later years) Responsiveness (later years) Quality questions

Automatic Rouge (-1, -2, -skipbigrams, LCS, BE)

Granularity Manual: sub-sentential elements Automatic: sentences

ROUGE: Recall-Oriented Understudy for Gisting Evaluation

Rouge – Ngram co-occurrence metrics measuring content overlap

Counts of n-gram overlaps between candidate and model

summaries

Total n-grams in summary model

ROUGE

Experimentation with different units of comparison: unigrams, bigrams, longest common substring, skip-bigams, basic elements

Automatic and thus easy to apply

Important to consider confidence intervals when determining differences between systems

Scores falling within same interval not significantly different Rouge scores place systems into large groups: can be hard to

definitively say one is better than another

Sometimes results unintuitive: Multilingual scores as high as English scores Use in speech summarization shows no discrimination

Good for training regardless of intervals: can see trends

LexPageRank: Prestige in Multi-Document Text Summarization

Gunes Erkan and Dragomir R. Radev

ACL 2004

Abstract

This paper consider an approach for computing sentence importance based on the concept of eigenvector centrality (prestige) – LexPageRank

In this model, a sentence connectivity matrix is constructed based on cosine similarity

The experimental results using DUC2004 show that this approach outperforms centroid-based summarization and is quite successful compared to other summarization systems

Introduction

Text summarization is the process of automatically creating a compressed version of a given text that provides useful information for the user

This summarization approach is to assess the centrality of each sentence in a cluster and include the most important ones in the summary

Introduce two new measures for centrality, Degree and LexPageRank, inspired from the prestige concept in social networks

Sentence centrality and centroid-based summarization

Extractive summarization produces summaries by choosing a subset of the sentences in the original documents

Centrality of a sentence is often defined in terms of the centrality of the words that it contains

The centroid of a cluster is a psuedo-document which consists of words that have frequency*IDF scores above a predefined threshold

In centroid-based summarization (Radevet et al., 2000), the sentences that contain more words from the centroid of the cluster are considered central

Centroid-based summarization has given promising results in the past

Prestige-based sentence centrality

We hypothesize that the sentences that are similar to many of the other sentences in a cluster are more central (or prestigious) to the topic

There are two issues How to define similarity between two sentences

Cosine How to compute the overall prestige of a sentence given

its similarity to other sentences Degree centrality Eigenvector centrality and LexPageank


A cluster may be represented by a cosine similarity matrix


Most of them are nonzero


Degree centrality Since we are interested in significant similarities

in the matrix, we can eliminate some low values by defining a threshold , so that the cluster can be view as an undirected graph

We define degree centrality as the degree of each node in the similarity graph




Issue for degree centrality Several unwanted sentences vote for each and raise their

prestige This situation can be avoided by considering where the

votes come from and taking the prestige of the voting node into account in weight each node

Eigenvector centrality and LexPageRank PageRank (Page et al., 1998) is a method propose for

assigning a prestige score to each page in the web independent of a specific query

Depending on the number of pages that link to that pages as well as the individual score of the linking pages


The PageRank of Page A

This recursively defined value can be computed by forming the binary adjacency matrix of the web, normalizing this matrix so that row sums equal to 1, and finding the principal eigenvector of the normalized matrix

PageRank for ith pages equals to the ith entry in the eigenvector

T1,…,Tn: pages that link to page Ad: damping factor, C(Ti): the number of outgoing links from page Ti


This method can be easily applied to the cosine similarity graph to find the most prestigious sentences in a document

We called this new measure of sentence similarity LexPageRank


damping factor = 1


Advantage over Centroid It accounts for information subsumption among

sentences It prevents unnaturally high IDF scores from

boosting up the score of a sentence that is unrelated to the topic

Experiments on DUC 2004 data

DUC 2004 data was used in our experiments Task 2 involves summarization of 50 TDT English

clusters Task 4 is to produce summaries of machine

translation output (in English) of 24 Arabic TDT documents

Recall-based measure – Rouge is adopted and 665-byte summaries for each cluster are produced


MEAD summarization toolkit Extractive multi-document summarization Consist of three components

Feature extractor (document -> feature vector) Centroid, Position and Length

Combiner (feature vector -> scalar value) Reranker (the scores are adjusted upward or

downward) MMR (Maximum Margin Relevance), CSIS (Cross-

Sentence Information Subsumption) weight

Threshold


Centroid

Thank You!

Q&A

HOME WORKHOME WORK

阅读以下文献之一，写一个阅读报告阅读以下文献之一，写一个阅读报告 SentTopic-MultiRank: a novel ranking model for multi-SentTopic-MultiRank: a novel ranking model for multi-

document summarization.document summarization. In COLING’12 RelationListwise for query-focused multi-document

summarization. In COLING’12 A supervised aggregation framework for multi-document A supervised aggregation framework for multi-document

summarization.summarization. In COLING’12 Query-Focused Multidocument Summarization Based on

Query-Sensitive Feature Space. In CIKM’12 Optimized Event Storyline Generation based on Mixture-

Event-Aspect Model. In EMNLP’13

Documents

Text Summarization wbia 黄连恩 [email protected] 北京大学信息工程学院 12/17/2013