Upload
shanna-carter
View
257
Download
1
Embed Size (px)
Citation preview
Overview
What is summarization?
What is the input?
News, or clusters of news a single article or several articles on a related
topic Email and email thread Scientific articles Health information: patients and doctors Meeting summarization Video
What is the output
Keywords Highlight information in the input Chunks or speech directly from the input or
paraphrase and aggregate the input in novel ways
Modality: text, speech, video, graphics
Ideal stages of summarization
Analysis Input representation and understanding
Transformation Selecting important content
Realization Generating novel text corresponding to the gist of the
input
Most current systems
Use shallow analysis methods Rather than full understanding
Work by sentence selection Identify important sentences and piece them
together to form a summary
Types of summaries
Extracts Sentences from the original document are
displayed together to form a summary
Abstracts Materials is transformed: paraphrased,
restructured, shortened
Extractive summarization
Each sentence is assigned a score that reflects how important and contenful they are
Data-driven approaches Word statistics Cue phrases Section headers Sentence position
Knowledge-based systems Discourse information
Resolve anaphora, text structure Use external lexical resources
Wordnet, adjective polarity lists, opinion Using machine learning
What are summaries useful for?
Relevance judgments Does this document contain information I am
interested in? Is this document worth reading?
Save time Reduce the need to consult the full
document
Recent development
2013.3, Yahoo bought news reading app Summly for $30 million!
2013.4, Google purchased Wavii for more than $30 million!
Multi-document summarization
Very useful for presenting and organizing search results Many results are very similar, and grouping
closely related documents helps cover more event facets
Summarizing similarities and differences between documents
How to deal with redundancy?
Author JK Rowling has won her legal battle in a New York court to get an unofficial Harry Potter encyclopaedia banned from publication.
A U.S. federal judge in Manhattan has sided with author J.K. Rowling and ruled against the publication of a Harry Potter encyclopedia created by a fan of the book series.
Shallow techniques not likely to work well
Global optimization for content selection
What is the best summary? vs What is the best sentence?
Form all summaries and choose the best What is the problem with this approach?
Information ordering
In what order to present the selected sentences? An article with permuted sentences will not be
easy to understand
Very important for multi-document summarization Sentences coming from different documents
Automatic summary edits
Some expressions might not be appropriate in the new context References:
he Putin Russian Prime Minister Vladimir Putin
Discourse connectives However, moreover, subsequently
Requires more sophisticated NLP techniques
Before
Pinochet was placed under arrest in London Friday by
British police acting on a warrant issued by a Spanishjudge. Pinochet has immunity from prosecution inChile as a senator-for-life under a new constitution
thathis government crafted. Pinochet was detained in
theLondon clinic while recovering from back surgery.
After
Gen. Augusto Pinochet, the former Chilean dictator, was placed under arrest in London Friday by British police acting on a warrant issued by a Spanish judge. Pinochet has immunity from prosecution in Chile as a senator-for-life under a new constitution that his government crafted. Pinochet was detained in the London clinic while recovering from back surgery.
Before
Turkey has been trying to form a new government since a coalition government led by Yilmaz collapsed last month over allegations that he rigged the sale of a bank. Ecevit refused even to consult with the leader of the Virtue Party during his efforts to form a government. Ecevit must now try to build a government. Demirel consulted Turkey's party leaders immediately after Ecevit gave up.
After
Turkey has been trying to form a new government since a coalition government led by Prime Minister Mesut Yilmaz collapsed last month over allegations that he rigged the sale of a bank. Premier-designate Bulent Ecevit refused even to consult with the leader of the Virtue Party during his efforts to form a government. Ecevit must now try to build a government. President Suleyman Demirel consulted Turkey's party leaders immediately after Ecevit gave up.
Traditional Approaches
1) word frequency based method
Hans Peter Luhn (“father of Information Retrieval”): The Automatic Creation of Literature Abstracts - 1958
Image: Courtesy IBM0000011
Luhn’s method: basic idea
Target documents: technical literature The method is based on the following
assumptions: Frequency of word occurrence in an article is a useful
measurement of word significance Relative position of these significant words within a
sentence is also a useful measurement of word significance
Based on limited capabilities of machines (IBM 704) no semantic information
0000100
Why word frequency?
Important words are repeated throughout the text examples are given in favor of a certain
principle arguments are given for a certain principle Technical literature one word: one notion
Simple and straightforward algorithm cheap to implement (processing time is costly) Note that different forms of the same word are
counted as the same word0000101
When significant?
Too low frequent words are not significant Too high frequent words are also not significant (e.g. “the”,
“and”) Removing low frequent words is easy
set a minimum frequency-threshold Removing common (high frequent) words:
Setting a maximum frequency threshold (statistically obtained) Comparing to a common-word list
Figure 1 from [Luhn, 1958]0000110
Using relative position
Where greatest number of high-frequent words are found closest together probability very high that representative information is given
Based on the characteristic that an explanation of a certain idea is represented by words closely together (e.g. sentences – paragraphs - chapters)
0000111
The significance factor
The “significance factor” of a sentence reflects the number of occurrences of significant words within a sentence and the linear distance between them due to non-significant words in between
Only consider portion of sentence bracketed by significant words with maximum of 5 non-significant words in between, e.g. “ (*) - - - [ * - * * - - * - - * ] - - (*) “
Significance factor formula: (Σ[*])2 / |[.]| (2.5 in the above example)
0001000
Generating the abstract
For every sentence the significance factor is calculated
The sentences with a significance factor higher than a certain cut-off value are returned (alternatively the N highest-valued sentences can be returned)
For large texts, it can also be applied to subdivisions of the text
No evaluation of the results present in the journal paper!0001001
2) Position based method
H.P. Edmundson: New methods in Automatic Extracting - 1969
0001010
IBM 7090 - Courtesy IBM
Lead method
Claim: Important sentences occur at the beginning (and/or end) of texts.
Lead method: just take first sentence(s)!
Experiments: In 85% of 200 individual paragraphs the
topic sentences occurred in initial position and in 7% in final position (Baxendale, 58).
Only 13% of the paragraphs of contemporary writers start with topic sentences (Donlan, 80).
Cue-Phrase method
Claim 1: Important sentences contain ‘bonus phrases’, such as significantly, In this paper we show, and In conclusion, while non-important sentences contain ‘stigma phrases’ such as hardly and impossible.
Claim 2: These phrases can be detected automatically (Kupiec et al. 95; Teufel and Moens 97).
Method: Add to sentence score if it contains a bonus phrase, penalize if it contains a stigma phrase.
Four methods for weighting
Weighting methods: Cue Method Key Method Title Method Location Method
The weight of a sentence is a linear combination of the weights obtained with the above four methods
The highest weighing sentences are included in the abstract
Target documents: technical literature
0001011
Cue Method
Based on the hypothesis that the probable relevance of a sentence is affected by presence of pragmatic words (e.g. “Significant”, “Greatest”, Impossible”, “Hardly”)
Three types of Cue words: Bonus words: positively affecting the relevance of
a sentence (e.g. “Significant”, “Greatest”) Stigma words: negatively affecting the relevance
of a sentence (e.g. “Impossible”, “Hardly”) Null words: irrelevant
0001100
Obtaining Cue words
The lists were obtained by statistical analyses of 100 documents: Dispersion (λ): number of documents in which
the word occurred Selection ratio (η): ratio of number of
occurrences in extractor-selected sentences to number of occurrences in all sentences
Bonus words: η > thighη Stigma words: η < tlowη Null words: λ > tλ and tlowη< η < thighη 0001101
Resulting Cue lists
Bonus list (783): comparatives, superlatives, adverbs of conclusion, value terms, etc.
Stigma list (73): anaphoric expressions, belittling expressions, etc.
Null list (139): ordinals, cardinals, the verb “to be”, prepositions, pronouns, etc.
0001110
Cue weight of sentence
Tag all Bonus words with weight b > 0, all Stigma words with weight s < 0, all Null words with weight n = 0
Cue weight of sentence: Σ (Cue weight of each word in sentence)
0001111
Key Method
Principle based on [Luhn], counting the frequency of words.
Algorithm differs: Create key glossary of all non-Cue words in the document
which have a frequency larger than a certain threshold Weight of each key word in the key glossary is set to the
frequency it occurs in the document Assign key weight to each word which can be found in the
key glossary If word is not in key glossary, key weight: 0 No relative position is used ([Luhn])
Key weight of sentence: Σ (Key weight of each word in sentence)
0010000
Title Method
Based on the hypothesis that an author conceives title as circumscribing the subject matter of the document (similarly for headings vs. paragraphs)
Create title glossary consisting of all non-Null words in the title, subtitle and headings of the document
Words are given a positive title weight if they appear in this glossary
Title words are given a larger weight than heading words
Title weight of sentence: Σ (Title weight of each word in sentence)
0010001
Location Method
Based on the hypothesis that: Sentences occurring under certain headings are positively
relevant Topic sentences tend to occur very early or very late in a
document and its paragraphs Global idea:
Give each sentence below his heading the same weight as the heading itself (note that this is independent from the Title Method) – Heading weight
Give each sentence a certain weight based on its position - Ordinal weight
Location weight of sentence: Ordinal weight of sentence + Heading weight of sentence
0010010
Location Method: Heading weight
Compare each word in a heading with the pre-stored Heading dictionary
If the word occurs in this dictionary, assign it a weight equal to the weight it has in the dictionary
Heading weight of a heading: Σ (heading weight of each word in heading)
Heading weight of a sentence = Heading weight of its heading
0010011
Creating the Heading dictionary
The Heading dictionary was created by listing all words in the headings of 120 documents and calculating the selection ratio for each word:
Selection ratio (η): ratio of number of occurrences in extractor-selected sentences to number of occurrences in all headings
Deletions from this list were made on the basis of low frequency and unrelatedness to the desired information types (subject, purpose, conclusion, etc.)
Weights were given to the words in the Heading dictionary proportional to the selection ratio
The resulting Heading dictionary contained 90 words0010100
Location Method: Ordinal weight
Sentences of the first paragraph are tagged with weight O1
Sentences of the last paragraph are tagged with weight O2
The first sentence of a paragraph is tagged with weight O3
The last sentence of a paragraph is tagged with weight O4
Ordinal weight of sentence: O1 + O2 + O3 + O40010101
Generating the abstract
Calculate the weight of a sentence: aC + bK + cT + dL, with a,b,c,d constant positive integers, C: Cue Weight, K: Key weight, T: Title weight, L: Location weight
The values of a, b, c and d were obtained by manually comparing the generated automatic abstracts with the desired (human made) abstract
Return the highest N sentences under their proper headings as the abstract (including title)
N is calculated by taking a percentage of the size of the original documents, in this journal paper 25% is used
0010110
Which combination is best?
All combinations of C, K, T and L were tried to see which result had (on average) the most overlap with the handmade extract
As can be seen in the figure below (only the interesting results are shown), the Key method was omitted and only C, T and L are used to create the best abstract
Surprising result! (Luhn used only keywords to create the abstract)
Figure 4 from [Edmundson, 1969]
0010111
Evaluation
Evaluation was done on unseen data (40 technical documents), comparison with handmade abstracts
Result: 44% of the sentences co-selected, 66% similarity between abstracts (human judge)
Random ‘abstract’: 25% of the sentences co-selected, 34% similarity between abstracts
Another evaluation criterion: ‘extract-worthiness’ Result: 84% of the sentences selected is extract-worthy Therefore: for one document many possible abstracts
(differing in length and content)
0011000
3) Machine-learning method
Ask people to select sentences Use these as training examples for machine
learning Each sentence is represented as a number of
features Based on the features distinguish sentences
that are appropriate for a summary and sentences that are not
Run on new inputs
Scoring sentences
For each sentence s the probability P is calculated that it will be included in the summary S given the k features (Bayes’ rule):
Assuming statistical independence of the features:
is constant, and and can be estimated directly from the training set by counting occurrences
This function assigns for each s a score which can be used to select sentences for inclusion in the abstract
0100100
The training material
188 documents with professionally created abstracts from the scientific/technical domain, the average length of the abstracts is 3 sentences (3.5% of the total size of the document)
Sentences from the abstract were matched to the original document: 79% direct sentence matches 3% direct joins (2 sentences combined) 18% no direct match or join possible
Therefore the maximum performance of the automatic system is 82%0100101
Evaluation
Too little material Cross-validation used to evaluate Two evaluation measures
Fraction of manually selected sentences which were reproduced correctly: average result: 35%
Fraction of the matchable selected sentences which were reproduced correctly: average result: 42%
Performance of features (2nd measure):
0100110
Feature Individual % sentences correct
Cumulative % sentences correct
Paragraph 33 33
Fixed Phrases 29 42
Length Cut-off 24 44
Thematic Word 20 42
Uppercase Word 20 42
Claim: The multi-sentence coherence structure of a text can be constructed, and the ‘centrality’ of the textual units in this structure reflects their importance.
Tree-like representation of texts in the style of Rhetorical Structure Theory (Mann and Thompson,88).
Use the discourse representation in order to determine the most important textual units. Attempts:
(Ono et al., 1994) for Japanese. (Marcu, 1997,2000) for English.
4) Discourse-based method
Rhetorical parsing (Marcu,97)
[With its distant orbit {– 50 percent farther from the sun than Earth –} and slim atmospheric blanket,1] [Mars experiences frigid weather conditions.2] [Surface temperatures typically average about –60 degrees Celsius (–76 degrees Fahrenheit) at the equator and can dip to –123 degrees C near the poles.3] [Only the midday sun at tropical latitudes is warm enough to thaw ice on occasion,4] [but any liquid water formed that way would evaporate almost instantly5] [because of the low atmospheric pressure.6]
[Although the atmosphere holds a small amount of water, and water-ice clouds sometimes develop,7] [most Martian weather involves blowing dust or carbon dioxide.8] [Each winter, for example, a blizzard of frozen carbon dioxide rages over one pole, and a few meters of this dry-ice snow accumulate as previously frozen carbon dioxide evaporates from the opposite polar cap.9] [Yet even on the summer pole, {where the sun remains in the sky all day long,} temperatures never warm enough to melt frozen water.10]
Rhetorical parsing (2)
Use discourse markers to hypothesize rhetorical relations rhet_rel(CONTRAST, 4, 5) rhet_rel(CONTRAT, 4, 6) rhet_rel(EXAMPLE, 9, [7,8]) rhet_rel(EXAMPLE, 10, [7,8])
Use semantic similarity to hypothesize rhetorical relations if similar(u1,u2) then
rhet_rel(ELABORATION, u2, u1) rhet_rel(BACKGROUND, u1,u2)else
rhet_rel(JOIN, u1, u2)
rhet_rel(JOIN, 3, [1,2]) rhet_rel(ELABORATION, [4,6], [1,2]) Use the hypotheses in order to derive a valid discourse
representation of the original text.
Rhetorical parsing (3)
5Evidence
Cause
5 6
4
4 5Contrast
3
3Elaboration
1 2
2BackgroundJustification
2Elaboration
7 8
8Concession
9 10
10Antithesis
8Example
2Elaboration
Summarization = selection of the most important units
2 > 8 > 3, 10 > 1, 4, 5, 7, 9 > 6
Discourse method: Evaluation
(using a combination of heuristics for rhetorical parsing disambiguation)Reduction Method Recall Precision F-score
10% Humans 83.20% 75.95% 79.41%
Program 63.75% 72.50% 67.84%
Lead 82.91% 63.45% 71.89%
20% Humans 82.83% 64.93% 72.80%
Program 61.79% 60.83% 61.31%
Lead 70.91% 46.96% 56.50%
TREC Corpus(fourfold cross-validation)
Level Method Rec. Prec. F-score
Clause Humans 72.66% 69.63% 71.27%
Program (training) 67.57% 73.53% 70.42%
Program (no training) 51.35% 63.33% 56.71%
Lead 39.68% 39.68% 39.68%
Sentence Humans 78.11% 79.37% 78.73%
Program (training) 69.23% 64.29% 66.67%
Program (no training) 57.69% 51.72% 54.54%
Lead 54.22% 54.22% 54.22%
Scientific American Corpus
5) VS based method
Based on word probability
S is sentence with length n
Pi is the probability of the i-th word in the sentence
Based on word tf.idf
n
pSweight
n
ii
1
)log()(
n
idftfSweight
n
ii
1
.)(
Centrality measures
How representative is a sentence of the overall content of a document
The more similar are sentence is to the document, the more representative it is
ji
jii SSsimK
Scentrality ),(1
)(
Evaluation
Comparing Text Against Text
Which human summary makes a good gold standard? Many summaries are good
At what granularity is the comparison made?
When can we say that two pieces of text match?
Variation impacts evaluation
Comparing content is hard All kinds of judgment calls
Paraphrases VP vs. NP
Ministers have been exchanged Reciprocal ministerial visits
Length and constituent type Robotics assists doctors in the medical operating
theater Surgeons started using robotic assistants
Nightmare: only one gold standard
System may have chosen an equally good sentence but not in the one gold standard
Pinochet arrested in London on Oct 16 at a Spanish judge’s request for atrocities against Spaniards in Chile.
Former Chilean dictator Augusto Pinochet has been arrested in London at the request of the Spanish government
In DUC 2001 (one gold standard), human model had significant impact on scores (McKeown et al)
Five human summaries needed to avoid changes in rank (Nenkova and Passonneau)
DUC2003 data 3 topic sets, 1 highest scoring and 2 lowest scoring 10 model summaries
Scoring
Two main approaches used in DUC
ROUGE (Lin and Hovy)
Pyramids (Nenkova and Passonneau)
Problems: Are the results stable? How difficult is it to do the scoring?
DUC – Document Understanding Conference
Established and funded by DARPA TIDES Run by independent evaluator NIST
Open to summarization community Annual evaluations on common datasets 2001-present
Tasks Single document summarization Headline summarization Multi-document summarization Multi-lingual summarization Focused summarization
DUC Evaluation
Gold Standard Human summaries written by NIST From 2 to 9 summaries per input set
Multiple metrics Manual
Coverage (early years) Pyramids (later years) Responsiveness (later years) Quality questions
Automatic Rouge (-1, -2, -skipbigrams, LCS, BE)
Granularity Manual: sub-sentential elements Automatic: sentences
ROUGE: Recall-Oriented Understudy for Gisting Evaluation
Rouge – Ngram co-occurrence metrics measuring content overlap
Counts of n-gram overlaps between candidate and model
summaries
Total n-grams in summary model
ROUGE
Experimentation with different units of comparison: unigrams, bigrams, longest common substring, skip-bigams, basic elements
Automatic and thus easy to apply
Important to consider confidence intervals when determining differences between systems
Scores falling within same interval not significantly different Rouge scores place systems into large groups: can be hard to
definitively say one is better than another
Sometimes results unintuitive: Multilingual scores as high as English scores Use in speech summarization shows no discrimination
Good for training regardless of intervals: can see trends
LexPageRank: Prestige in Multi-Document Text Summarization
Gunes Erkan and Dragomir R. Radev
ACL 2004
Abstract
This paper consider an approach for computing sentence importance based on the concept of eigenvector centrality (prestige) – LexPageRank
In this model, a sentence connectivity matrix is constructed based on cosine similarity
The experimental results using DUC2004 show that this approach outperforms centroid-based summarization and is quite successful compared to other summarization systems
Introduction
Text summarization is the process of automatically creating a compressed version of a given text that provides useful information for the user
This summarization approach is to assess the centrality of each sentence in a cluster and include the most important ones in the summary
Introduce two new measures for centrality, Degree and LexPageRank, inspired from the prestige concept in social networks
Sentence centrality and centroid-based summarization
Extractive summarization produces summaries by choosing a subset of the sentences in the original documents
Centrality of a sentence is often defined in terms of the centrality of the words that it contains
The centroid of a cluster is a psuedo-document which consists of words that have frequency*IDF scores above a predefined threshold
In centroid-based summarization (Radevet et al., 2000), the sentences that contain more words from the centroid of the cluster are considered central
Centroid-based summarization has given promising results in the past
Prestige-based sentence centrality
We hypothesize that the sentences that are similar to many of the other sentences in a cluster are more central (or prestigious) to the topic
There are two issues How to define similarity between two sentences
Cosine How to compute the overall prestige of a sentence given
its similarity to other sentences Degree centrality Eigenvector centrality and LexPageank
Prestige-based sentence centrality
A cluster may be represented by a cosine similarity matrix
Prestige-based sentence centrality
Most of them are nonzero
Prestige-based sentence centrality
Degree centrality Since we are interested in significant similarities
in the matrix, we can eliminate some low values by defining a threshold , so that the cluster can be view as an undirected graph
We define degree centrality as the degree of each node in the similarity graph
Prestige-based sentence centrality
Prestige-based sentence centrality
Prestige-based sentence centrality
Issue for degree centrality Several unwanted sentences vote for each and raise their
prestige This situation can be avoided by considering where the
votes come from and taking the prestige of the voting node into account in weight each node
Eigenvector centrality and LexPageRank PageRank (Page et al., 1998) is a method propose for
assigning a prestige score to each page in the web independent of a specific query
Depending on the number of pages that link to that pages as well as the individual score of the linking pages
Prestige-based sentence centrality
The PageRank of Page A
This recursively defined value can be computed by forming the binary adjacency matrix of the web, normalizing this matrix so that row sums equal to 1, and finding the principal eigenvector of the normalized matrix
PageRank for ith pages equals to the ith entry in the eigenvector
T1,…,Tn: pages that link to page Ad: damping factor, C(Ti): the number of outgoing links from page Ti
Prestige-based sentence centrality
This method can be easily applied to the cosine similarity graph to find the most prestigious sentences in a document
We called this new measure of sentence similarity LexPageRank
Prestige-based sentence centrality
damping factor = 1
Prestige-based sentence centrality
Advantage over Centroid It accounts for information subsumption among
sentences It prevents unnaturally high IDF scores from
boosting up the score of a sentence that is unrelated to the topic
Experiments on DUC 2004 data
DUC 2004 data was used in our experiments Task 2 involves summarization of 50 TDT English
clusters Task 4 is to produce summaries of machine
translation output (in English) of 24 Arabic TDT documents
Recall-based measure – Rouge is adopted and 665-byte summaries for each cluster are produced
Experiments on DUC 2004 data
MEAD summarization toolkit Extractive multi-document summarization Consist of three components
Feature extractor (document -> feature vector) Centroid, Position and Length
Combiner (feature vector -> scalar value) Reranker (the scores are adjusted upward or
downward) MMR (Maximum Margin Relevance), CSIS (Cross-
Sentence Information Subsumption) weight
Threshold
Experiments on DUC 2004 data
Centroid
Thank You!
Q&A
HOME WORKHOME WORK
阅读以下文献之一,写一个阅读报告阅读以下文献之一,写一个阅读报告 SentTopic-MultiRank: a novel ranking model for multi-SentTopic-MultiRank: a novel ranking model for multi-
document summarization.document summarization. In COLING’12 RelationListwise for query-focused multi-document
summarization. In COLING’12 A supervised aggregation framework for multi-document A supervised aggregation framework for multi-document
summarization.summarization. In COLING’12 Query-Focused Multidocument Summarization Based on
Query-Sensitive Feature Space. In CIKM’12 Optimized Event Storyline Generation based on Mixture-
Event-Aspect Model. In EMNLP’13