Upload
others
View
31
Download
0
Embed Size (px)
Citation preview
AUTOMATED TEXT SUMMARIZATION:
ON BASELINES, QUERY-BIAS AND EVALUATIONS!
By
Rahul Katragadda
200607007
rahul [email protected]
A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THEREQUIREMENTS FOR THE DEGREE OF
Master of Science (by Research)in
Computer Science & Engineering
Search and Information Extraction LabLanguage Technologies Research Center
International Institute of Information Technology
Hyderabad, India
December 2009
Copyright © 2009 Rahul Katragadda
All Rights Reserved
Dedicated to all those people, who made me believe I am better than I was.
INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY
Hyderabad, India
CERTIFICATE
It is certified that the work contained in this thesis, titled “Automated Text Sum-
marization: On Baselines, Query-Bias and Evaluations!” by Rahul Katragadda
(200607007) submitted in partial fulfillment for the award of the degree of Master
of Science (by Research) in Computer Science & Engineering, has been carried out
under my supervision and it is not submitted elsewhere for a degree.
Date Advisor :
Dr. Vasudeva VarmaAssociate Professor
IIIT Hyderabad
Acknowledgements
I am grateful to my advisor, Dr Vasudeva Varma for his advice and in his efforts
in managing the Search and Information Extraction Lab, of Language Technologies
Research Center where I have had the pleasure to work for the whole duration of my
MS by Research studies. I also have been fortunate to get timely advice and quick
feedback in many of my initial meetings with Dr Prasad Pingali, who as my mentor
taught me what it means to do research. As the administrative assistant for SIEL,
Mr Mahender Kumar has saved me from a great amount of paper work, Mr Babji
who took over our system requirements, tirelessly, and Mr Bhupal for motivating
me through the years.
I warmly acknowledge all the people that have had — and would definitely
continue to have — a significant impact on the way I work. Among these Prof Karen
Sparck Jones, Prof Kathy McKeown, Dr Ani Nenkova among others. I would also
take this opportunity to thank Dr Chales L A Clarke for those numerous interactions
that led to bring clarity to the subject matter of this thesis. Thanks Charlie, for
the access to your expertise and your belief in my work, I could move on from a
rejection at ICON-08 to an acceptance at ACL-09. I owe my deepest gratitude to
Prof Petri Myllymaki and Dr Wray Buntine for their continuous guidance on and
off research during my internship at CoSCo group, University of Helsinki. Thanks
to members of CoSCo group, who showed me that research can, and should, be fun.
You are known by the company you keep and I have been fortunate to have
some great colleagues at SIEL, especially my lunch group comprising Sowmya,
Suman, Praneeth, Swathi and Santosh who provided me confort at all times and
chipped in with their critical review when I badly needed one. I would like to
make a special mention to Suman Kumar Chalawadi for those countless mid-night
discussions on query-focus, query-bias and summarization evaluations, when he
had no clue of any word I uttered and yet kept staring at the board and listening to
my words. I must acknowledge Mahesh Mohan, Sai Sathyanarayan, Samar Hussain
and Romanch Agarwal for those countless, endless discussions on all aspects of
life. I am also hugely indebted to my batchmates Ravitej and Sai Mahesh, and my
colleagues Sowmya and Swathi for constantly keeping track of my progress.
Three and half years is a long time, and I have been blessed with some of the
best men (and women) of IIIT’s sports fraternity who helped me through my ups
and downs. I would like to thank Mr Kamalakar and the Physical Education Center
for providing me all the sports facilities that complemented my academic life. I
thank Abhilash, Ashwath and Satish for some of the most memorable moments —
those medals lying in my room and beyond — I would cherish all my life. Also
important was Mr Sitaram and family at the coffee shop who provided us with
continuous source of ‘double’ TEA. I am delighted to acknowledge the staff of
the G4S securities, especially Mehdi Hasan, Raghu, Laxman and others who have
eased our lives and provided a secure campus. I also have high regards to all the
members of house-keeping staff who took care of room no 348, Old Boys Hostel
(OBH) where I stayed for 3 full years.
Many anonymous reviewers of my papers have shaped my thinking and writing
habits thus far, and the manuscript of this thesis was reviewed by Prof Kamal Kar-
lapalem and Dr Kishore S Prahallad, who have all put great efforts in reviewing this
work and I thank them all for their time and efforts. I also thank Praveen Bysani and
Vijay Bharat for their patience and understanding of my way of research, especially
Praveen, for whom I had to grow, to reach his expectations of a mentor.
I am extremely grateful to my parents Rajyasree and Ravi Kumar, to my sister
Roshna, to my cousin Namratha and my friend Geetha for their continuous support
and discrete (and continuous) inquiries about the progress of my studies.
And to all those who kept on asking “When will you graduate?”, this is it. :)
vi
Synopsis
With the advent of the Internet/WWW and the proliferation of textual news, rich-
media, social networking, et cetera. there has been an unprecedented surge in the
content on the web. With huge amount of information available on the world wide
web, there is a pressing need to have ‘Information Access’ systems that would help
users with an information need in providing the relevant information in a concise,
pertinent format. There are various modes of Information Access including In-
formation Retrieval, Text Mining, Machine Translation, Text Categorization, Text
Summarization, et cetera. In this thesis, we study certain aspects of “Text Summa-
rization” as a technology for ‘Information Access’.
Majority of the recent work in the area of Text Summarization addresses the sub-
problems like Ranked Sentence Ordering, Generic Single Document Summariza-
tion, Generic Multi-Document Summarization, Query-Focused Multi-Document
Summarization, Query-Focused Update Summarization. In this thesis we focus on
two tasks, specifically “Query-Focused Multi-Document Summarization (QFMDS)”
and “Query-Focused Update Summarization (QFUS)”. The QFMDS task has gen-
erated a lot of community interest for obvious reasons of being closer to a real
world application of open domain question answering. Given a set of N relevant
documents on a general topic and a specific query (or information need) within the
topic, the task is to generate a relevant and fluent 250 word summary. The focus
of this thesis lies around four issues dealing with query-focused multi-document
summarization. They are:
1. Impact of query-bias on summarization
2. Simple and strong baselines for update summarization
3. Language modeling extension to update summarization
4. An automated intrinsic content evaluation measure
In this thesis we identify two dependent yet different terms ‘query-bias’ and
‘query-focus’ with respect to the QFMDS task and show that most of the auto-
mated summarization systems are trying to be query-biased rather than being query-
focused. In the context of this problem, we show evidence from multiple sources
to display the inherent bias introduced by the automated systems. First, we theo-
retically explain how a naıve classifier based summarizer can be enhanced greatly
by biasing the algorithm to use just the query-biased sentences. Second, on an
‘information nugget’ based evaluation data, we show that most of the participat-
ing systems were query-biased. Third, we further build formal generative models,
namely binomial and multinomial models, to model the likelihood of a system be-
ing query-biased. Such a modeling revealed a high positive correlation between
‘a system being query-biased’ and ‘automated evaluation score’. Our results also
underscore the difference between human and automated summaries. We show that
when asked to produce query-focused summaries humans do not rely to the same
extent on the repetition of query-terms.
The QFUS task is a natural extension to the QFMDS task. In update summa-
rization task a series of news stories (news stream) on a particular topic are tracked
over a period of time and are available for summarization. Two or more sets of
documents are available: an initial set, followed by multiple updated sets of news
stories. The problem is to generate query-focused multi-document summaries for
the initial set and then produce update summaries for the updated sets assuming the
user has already read the preceding sets of documents. This is a relatively new task
and our work in this context is two fold. First, we defined a sentence position based
baseline summarizer which is genre dependent. In the context of the new task, we
argue that current baseline in the update summarization task is a poor performer
in content evaluations and hence cannot help in tracking the progress of the field.
A stronger baseline such as the one based on “sentence position” would be robust
and be able to help track the progress. Second, we describe a generic extension
to language modeling based approaches to tailor them towards the update summa-
rization task. We showed that a simple Context Adjustment (CA) based on ‘stream
dynamics’ could help in generation of better summaries when compared to the base
language modeling approach.
In Text Summarization, like in any other Information Access methodologies,
evaluation is a crucial component of the system development process. In language
technologies research, Automated Evaluation is often viewed as supplementary to
the task itself since knowing how to evaluate would lead to knowing how to perform
the task. In the case of summarization, knowing how best to evaluate summaries
would help in knowing how best to summarize. Usually, manual evaluations are
used to evaluate summaries and compare performance of different systems against
each other. However, these manual evaluations are time consuming and difficult
to repeat, hence infeasible. Keeping this view of text summarization and evalua-
tion we proposed an evaluation framework based on generative modeling paradigm.
We describe a ‘multinomial model’ built on the distribution of key-terms in docu-
ment collections and how they are captured in peers. We show that such a simple
paradigm can be used to create evaluation measures that are comparable to the
state-of-the-art evaluations used at DUC and TAC. In particular, we observed that
in comparison with other metrics our approach has been very good at capturing
“overall responsiveness” apart from pyramid based manual scores.
ix
Contents
Table of Contents x
List of Tables xiii
List of Figures xiv
1 Introduction 11.1 Introduction to Text Summarization . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Human Abstraction or Professional Summarizing . . . . . . . . . . 21.2 Different Categories of Summarization . . . . . . . . . . . . . . . . . . . . 41.3 Approaches to Automated Text Summarization . . . . . . . . . . . . . . . 61.4 Summarization Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4.1 Evaluation of Content and Readability . . . . . . . . . . . . . . . . 81.5 Focused Summarization Evaluation Workshops . . . . . . . . . . . . . . . 10
1.5.1 pre-DUC era . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.5.2 Document Understanding Conferences (DUC) . . . . . . . . . . . 111.5.3 Text Analysis Conferences (TAC) . . . . . . . . . . . . . . . . . . 12
1.6 Objective and Scope of the Research . . . . . . . . . . . . . . . . . . . . . 121.7 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2 Problems Addressed 162.1 Understanding query-bias in summarization systems . . . . . . . . . . . . 162.2 Sentence Position Baseline for Update Summarization . . . . . . . . . . . 192.3 Language Modeling approaches for Update Summarization . . . . . . . . . 212.4 Alternative Automated Summarization Evaluations . . . . . . . . . . . . . 22
3 Related Work 243.1 Approaches to Text Summarization . . . . . . . . . . . . . . . . . . . . . . 24
3.1.1 Heuristic Approaches to Text Summarization . . . . . . . . . . . . 243.1.2 Language Modeling Approaches . . . . . . . . . . . . . . . . . . . 263.1.3 Linguistic structure or discourse based summarization . . . . . . . 273.1.4 Machine Learning Approaches . . . . . . . . . . . . . . . . . . . . 28
3.2 Summarization Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 29
x
CONTENTS
3.3 Evaluation of Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.3.1 ROUGE [Lin, 2004b] . . . . . . . . . . . . . . . . . . . . . . . . . 313.3.2 Pyramid Evaluation [Nenkova et al., 2007] . . . . . . . . . . . . . 343.3.3 Other content evaluation measures . . . . . . . . . . . . . . . . . . 37
3.4 Evaluation of Readability . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.4.1 Manual Evaluation of Readability . . . . . . . . . . . . . . . . . . 383.4.2 Automated Evaluation of Readability . . . . . . . . . . . . . . . . 39
4 Impact of Query-Bias on Text Summarization 414.1 Introduction to Query-Bias vs Query-Focus . . . . . . . . . . . . . . . . . 43
4.1.1 Query-biased content in human summaries . . . . . . . . . . . . . 444.2 Theoretical Justification on query-bias affecting summarization performance 45
4.2.1 Equi-probable summarization setting . . . . . . . . . . . . . . . . 474.2.2 Query-biased Equi-probable summarization setting . . . . . . . . . 474.2.3 Performance of Equi-probable summarization . . . . . . . . . . . . 484.2.4 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3 Performance of participating systems from DUC 2007 . . . . . . . . . . . 504.4 Query-Bias in Summary Content Units (SCUs) . . . . . . . . . . . . . . . 524.5 Formalizing Query-Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.5.1 The Binomial Model . . . . . . . . . . . . . . . . . . . . . . . . . 564.5.2 The Multinomial Model . . . . . . . . . . . . . . . . . . . . . . . 574.5.3 Correlation of ROUGE metrics with likelihood of query-bias . . . . 58
4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604.7 Conclusive Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614.8 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5 Baselines for Update Summarization 635.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.1.1 Sentence Position . . . . . . . . . . . . . . . . . . . . . . . . . . . 645.1.2 Introduction to the Position Hypothesis . . . . . . . . . . . . . . . 68
5.2 Sub-Optimal Sentence Position Policy (SPP) . . . . . . . . . . . . . . . . . 685.2.1 Sentence Position Yield and Optimal Position Policy (OPP) . . . . 685.2.2 Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 695.2.3 Pyramid Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.3 A Summarization Algorithm based on SPP . . . . . . . . . . . . . . . . . . 725.3.1 Query-Focused Multi-Document Summarization . . . . . . . . . . 735.3.2 Update Summarization Task . . . . . . . . . . . . . . . . . . . . . 735.3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.4 Baselines in Summarization Tasks . . . . . . . . . . . . . . . . . . . . . . 765.5 Discussion and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 77
xi
CONTENTS
6 A Language Modeling Extension for Update Summarization 796.1 Update Summarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 806.2 Language Modeling approach to IR and Summarization . . . . . . . . . . . 81
6.2.1 Probabilistic Hyperspace Analogue to Language (PHAL) . . . . . . 816.3 Language Modeling Extension . . . . . . . . . . . . . . . . . . . . . . . . 82
6.3.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 836.3.2 Signature Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . 846.3.3 Context Adjustment . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 866.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
7 Alternative (Automated) Summarization Evaluations 887.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 887.2 Current Summarization Evaluations . . . . . . . . . . . . . . . . . . . . . 897.3 Automated Content Evaluations . . . . . . . . . . . . . . . . . . . . . . . 917.4 Generative Modeling of Reference Summaries . . . . . . . . . . . . . . . . 92
7.4.1 Binomial Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 947.4.2 Multinomial Model . . . . . . . . . . . . . . . . . . . . . . . . . . 94
7.5 Signature Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 957.5.1 Query Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 957.5.2 Model Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . 967.5.3 POS Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
7.6 Experiments and Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . 977.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
7.7.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 997.8 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
8 Conclusions 1048.1 Contributions of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 1048.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1088.3 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1098.4 Unpublished Manuscripts . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Bibliography 111
xii
List of Tables
4.1 Percentage of query-biased content in document collections . . . . . . . . 454.2 Percentage of query-biased content in model summaries . . . . . . . . . . 454.3 Ratio of Query-bias densities . . . . . . . . . . . . . . . . . . . . . . . . . 454.4 ROUGE scores with confidence intervals for the equi-probable summarizer. 494.5 Systems that performed well in content overlap metrics . . . . . . . . . . . 514.6 Systems that performed well based on Linguistic Quality Evaluations. . . . 514.7 Systems that did not perform based on content overlap metrics . . . . . . . 524.8 Statistics on Query-Biased Sentences . . . . . . . . . . . . . . . . . . . . . 544.9 Rank based on ROUGE-2, Avg likelihood of emitting query-biased sen-
tence, ROUGE-2 scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574.10 Rank, Avg likelihood of emitting a query-biased sentence, ROUGE-2 scores 584.11 Correlation of ROUGE measures with log-likelihood scores for automated
systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594.12 Correlation of ROUGE measures with log-likelihood scores for humans . . 59
5.1 Performance of SPP algorithm at QFMDS task . . . . . . . . . . . . . . . 745.2 Cluster-wise Performance of SPP algorithm in Update Summarization task 755.3 Performance of SPP algorithm based on various metrics. . . . . . . . . . . 75
6.1 Impact of Context Adjustment on PHAL . . . . . . . . . . . . . . . . . . . 87
7.1 Cluster A Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1027.2 Cluster B Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
xiii
List of Figures
4.1 The process and structure of Pyramid Annotations and source mappings . . 534.2 SCU annotation of a source document. . . . . . . . . . . . . . . . . . . . . 534.3 Distribution of relevant and irrelevant sentences in query-biased corpus . . 554.4 Impact of query-bias in the classification view of Text Summarization Process 60
5.1 An example of Highlights of a news story . . . . . . . . . . . . . . . . . . 665.2 Impact of location in identifying highlights . . . . . . . . . . . . . . . . . 675.3 A sample mapping of SCU annotation to source document sentences. An
excerpt from mapping of topic D0701A of DUC 2007 QF-MDS task. . . . 705.4 Sentence Position Yield for small documents. . . . . . . . . . . . . . . . . 715.5 Sentence Position Yield for large documents . . . . . . . . . . . . . . . . . 72
7.1 Sample output generated by an evaluation metric . . . . . . . . . . . . . . 93
xiv
Chapter 1
Introduction
With the advent of the internet and the proliferation of news and social networking related
activity, online content in terms of the general web, newswire, social networking content,
etc. seem to have no bounds. For every information need an online user might have, there
is too much information that the world wide web (WWW) provides. In this context it is
often desirable to have a summary of the related documents to get a brief overview of the
document collection. There are several scenarios
1. A user enters the information need and the search engine provides the search results
along with a summary snippet for each document.
2. A user enters the information need and the search engine provides the search results
along with a summary of all the documents.
3. A user enters the information need and the search engine provides the search results
along with a summary of all the documents specifically marked by the user as being
relevant.
4. A user enters the information need and the search engine provides the search results
in clusters, with related results in the same cluster. Associated with each of these
clusters is a summary of all the documents within the cluster. And so on.
1
CHAPTER 1. INTRODUCTION
In the following sections we introduce the notion of automated text summarization,
briefly discuss the various approaches taken to solve the problem, and introduce summa-
rization evaluation; all of which would be further discussed in deeper details in Chapter 3.
1.1 Introduction to Text Summarization
“Text Summarization” research began in bits and pieces in various forms ever since [Luhn, 1958]
described as an informal idea. Later formally developed by [Edmundson, 1969] among
others, while [Jones, 1993, Jones, 1998, Sparck Jones, 2007] gave a much needed clarity
to the direction in which summarization research was to flow through. Psychological stud-
ies on how humans perform the task of text summarization helped the understanding of the
process involved [Kieras, 1985]. Since the beginning of TIPSTER-SUMMAC program,
followed by DUC and now TAC and also equally supported by other workshops on sum-
marization, a lot of effort has gone into creating focused problem domains and solving the
partial problems that have the global goal of automated text summarization.
1.1.1 Human Abstraction or Professional Summarizing
The field of automatic summarization is fortunate in that there are still human experts who
carry out summarization as part of their professional life. These are professional abstrac-
tors, who are skilled in the art of constructing summaries. Their employers are usually
abstracting services and information publishers, in particular, producers of bibliographic
databases.
We could gain valuable insights for automated summarization, by studying these ex-
pert abstractors and the way they carry out their summarization activities. The insights
gained would be valuable wherever we put ourself on the continuum from partially to fully
automated summarization.
2
CHAPTER 1. INTRODUCTION
The stages of abstracting Different scholars have come up with different decompo-
sitions of the abstracting process. Cremmins [Cremmins, 1982] decomposed the process
of abstracting into four ‘approximate stages’: “Focusing on the basic features, ”Identi-
fying the information“, ”Extracting, organizing, and reducing the relevant information“
and ”Refining the relevant information“. Pinto Molina [Molina, 1995] also suggested four
stages: interpretation, selection, reinterpretation and synthesis. Where, interpretation in-
volves reading and understanding the document, selection refers to selection of pertinent
information given the users’ needs, a reinterpretation is the process of interpreting the per-
tinent information to ascertain facts, synthesis refers to the process of generating the output
abstract from the pertinent information.
The most detailed work on studying human abstractors comes from Endres-Niggemeyer
[Endres-Niggemeyer, 1998], who carried out an empirical study of the verbal protocols and
behavior of six abstractors. Based on her findings, she listed human summarization process
as a 3 stage process:
• Document Exploration. An initial exploration of source documents to identify its
genre and style such that an appropriate scheme could be selected based on the ab-
stractor’s prior knowledge of document types and their information structure.
• Relevance Assessment. Important and relevant information is aggregated to construct
what she calls a theme, which is a structured mental representation of what the doc-
ument is about.
• Summary Production. Theme and Scheme are used to produce a final summary af-
ter applying certain cutting and pasting operations. Professional abstractors do not
usually invent anything, they follow the original author as closely as possible and
reintegrate the most important points of a document by drawing from the pool of
standard sentence patterns accumulated over years of experience in writing abstracts.
3
CHAPTER 1. INTRODUCTION
Expert practice in automated summarization systems It is interesting to note that
many of these aspects of expert practice are used in automated summarization systems,
although most of these systems are confined to extracts rather than abstracts. However,
we do not find the entire set of strategies in a single summarizer. Some summarizers fo-
cus on leveraging specific shallow features, like the ones of cue-phrases’ occurrence, lo-
cation features, etc. as in [Luhn, 1958, Edmundson, 1969]; others use a discourse-level
representation as in [Marcu, 1999b]. Still others focus on cut-and-paste operations to con-
struct summaries and to edit and revise them such as in [Jing, 2001]. It is easy to see
a mapping between these different methods as part of an overall strategy as devised by
Endres-Niggemeyer. For instance, usually, what features to use and which information to
use is dependent on the genre and style of the document(s), which represents the Document
Exploration stage. The usage of either shallow features or more stronger language model-
ing (or any relevance ranking measure) would fall into Relevance Assessment stage. And
finally, any Cut and Paste like operations would fall into Summary Production.
1.2 Different Categories of Summarization
The goal of Summarization has been “to extract informative content from an information
source and present the most important content (possibly with added context) to the user in a
condensed form and in a manner sensitive to user’s or application’s need”. Summarization
has been categorized based on various attributes and following are a few simple examples
categorized based on media being summarized:
• Text Summarization. Summarization of textual media in the form of digital text
refers to “Text Summarization”. There has been over 5 decades of research in this
area, courtesy a lot of reasonably complex tasks, that have been designed at focused
workshops such as TIPSTER Text Summarization Evaluation Conference (TIPSTER-
SUMMAC), Document Understanding Conferences (DUC), etc. This thesis sur-
rounds around the “Text Summarization” problem and the abundant literature in this4
CHAPTER 1. INTRODUCTION
area is explained elsewhere in this thesis.
• Video/Multimedia Summarization. Text is just one form of communication medium.
There is huge amount of information available in the forms of speech, images and
videos that could be leveraged to generate meaningful summaries. If the source in-
put constitutes any one of these multimedia components, that would lead us to create
mono-medium summaries; for eg. speech summaries or video summaries. On the
otherhand, if multiple types of “content formats” together form the corpora then it
would lead towards generating multimedia summaries. Usually, videos do not stand
by themselves and are augmented by speech, text (in transcripts), tags and so on.
Since a video summarization system would have these other attributes it can make
use of to generate summaries it is also called multimedia summarization.
• Speech Summarization. In recent years the amount of multimedia data available has
increased rapidly, especially due to increase of broadcasting channels and availabil-
ity of cheap and efficient mass storage means. In this era of information explosion,
there is a great need for systems that can distill this huge amount of data automat-
ically with less complexity and time. Speech summarization is very useful in wide
variety of applications; in case of broadcast news, it serves the purpose of summa-
rizing important contents of a show, meeting summarization helps individuals not
present at the meeting to know key issues discussed in a meeting and also important
decisions taken in it. Summarization of voice mail or voice messages saves time for
an individual from listening to all messages.
• Opinion Summarization. “What do people think about ?”, this is perhaps the ques-
tion that better describes the scope of the problem. In a world in which millions of
people write their opinions about any issue in blogs, news sites, review sites or social
media, the distillation of knowledge from this huge amount of unstructured infor-
mation is a challenging task. Sentiment Analysis and Opinion Mining are two areas
related to Natural Language Processing and Text Mining that deal with the identifica-5
CHAPTER 1. INTRODUCTION
tion of opinions and attitudes in natural language texts. Recently, Opinion Mining has
received huge interest in the information systems and language technologies commu-
nities. Recent efforts in opinion mining community has seen the upbringing of the
focused evaluations such as International Conference on Weblogs and Social Media
(ICWSM), Text Analysis Conference (TAC) – opinion-summarization/opinion-QA
task, Workshop on Mining User Generated Content and Workshop on Opinion Min-
ing and Sentiment Analysis (WOMSA) among others.
1.3 Approaches to Automated Text Summarization
With the advent of strong evaluation forums such as TIPSTER-SUMMAC, DUC and now
TAC, summarization research has seen a huge gain in the number of different approaches
being taken. From simple heuristics based approaches [Luhn, 1958, Katragadda et al., 2009]
to Language Modeling based approaches [J et al., 2005, Maheedhar Kolla, 2007] to Lin-
guistic Structure based such as based on lexical chains and discourse connectives [Li et al., 2007,
Marcu, 2000]. More detailed description of various methods employed for automated text
summarization is shown in 3.1, but here we briefly touch some of the various general ap-
proaches available.
Heuristic Approaches to Text Summarization From simple term frequency [Luhn, 1958]
based approaches in Luhn’s early work to till date, simplest of heuristics often surpass the
more complex and convincing theories of “importance in a discourse”. For eg., Inverse
Document Frequency [Jones, 1972], Document Frequency [Bysani et al., 2009], etc. Char-
acteristics of the genre are also usually strong indicators of important text. This had been
shown in [Edmundson, 1969, Lin and Hovy, 1997] and even in today’s context such an ap-
proach generates a strong baseline as we would see later in this thesis. Some of these
approaches have been described in Sections 3 and 5.
6
CHAPTER 1. INTRODUCTION
Language Modeling Approaches The language modeling based approaches have a
long history in information retrieval research and is a ready success story. Language Mod-
eling is one type (among many) of the probabilistic modeling techniques [Hiemstra, 2009].
Language models were applied to Information Retrieval by a number of researchers in late
90’s [Ponte, 1998, Hiemstra, 1998, Miller et al., 1999]. Their origin comes from proba-
bilistic models of language generation developed for automatic speech recognition systems.
For information retrieval, language models are built for each document. By following this
approach, the language model of this thesis would assign an exceptionally high probability
to the word “summarization”, indicating that this thesis would be a good candidate for re-
trieval if query contains this word. That is in language modeling we seek the probability of
getting a query given a document.
In the case of text summarization, a lot of literature talks about language modeling
based summarization. In particular [Lawrie, 2003] used language models to identify content-
bearing words. She uses relative entropy or KL Divergence to compare language mod-
els of topic set to a generic english corpus. Earlier, [Berger and Mittal, 2000] generated
summaries by creating a language model of the document, select terms that should oc-
cur in a summary, and then combine the terms using a trigram language model to gen-
erate readable summaries. Recently, a lot of work has been shown by [Schiffman, 2007,
Maheedhar Kolla, 2007, Bysani et al., 2009] in update summarization. Relevance based
language modeling approach to text summarization has been applied in [Jagarlamudi, 2006]
and has been described in Chapters 3 and 6.
Linguistic structure or discourse based summarization There have been elaborate
theories of discourse structure based on centering theory and Rhetorical Structure Theory
(RST) and there have been attempts of caching on them to build summarization systems.
Especially, Marcu’s [Marcu, 1999b] experiments indicated that Discourse Trees are good
indicators of importance in text. [Silber and McCoy, 2000, Barzilay and Elhadad, 1997,
Li et al., 2007] used Lexical Chains to combine surface features with discourse like features
7
CHAPTER 1. INTRODUCTION
to generate readable summaries.
Machine Learning Approaches Recent advances in machine learning have been
adapted to summarization problem through the years and locational features have been
consistently used to identify salience of a sentence. Some representative work in ‘learning’
sentence extraction would include training a binary classifier [Kupiec et al., 1995], train-
ing a Markov model [Conroy et al., 2004], training a CRF [Shen et al., 2007], and learning
pairwise-ranking of sentences [Toutanova et al., 2007].
1.4 Summarization Evaluation
Evaluation offers many advantages to automatic summarization, as has been the case with
other language understanding technologies, it can foster the creation of reusable resources
and infrastructure; it creates an environment for comparison and replication of results; and
it introduces an element of competition to produce better results [Hirschman and Mani, 2001].
Summarization Evaluation, like Machine Translation (MT) evaluation (or any other
Natural Language Processing systems’ evaluation), can be broadly classified into two cate-
gories [Jones and Galliers, 1996]. The first, an intrinsic evaluation, tests the summarization
system in itself. The second, an extrinsic evaluation, tests the summarization system based
on how it affects the completion of some other task. In the past intrinsic evaluations have
assessed mainly informativeness and coherence of the summaries. Meanwhile, extrinsic
evaluations have been used to test the impact of summarization on tasks like reading com-
prehension, relevance assessment, etc.
1.4.1 Evaluation of Content and Readability
Summarization evaluation is categorized by the two main characteristics of a summary: its
content and form. The evaluation of content are called ‘Content Evaluations’ and that of
‘form’ are called ‘Readability/Fluency Evaluations’. There are various manual evaluation
8
CHAPTER 1. INTRODUCTION
metrics for Content Evaluations including ‘Likert Scale rating of summaries’, Pyramid
Evaluation, and Content Responsiveness. For the Readability/Fluency Evaluations there
is only a Likert scale rating of summaries by human experts which stands as reference
judgments. There exists another metric based on human evaluation that scores for Overall
Responsiveness, which is a metric that combines evaluation of both content and form. All
of this is reported in deeper details in Section 3.2.
Automated Content Evaluation Since the manual methods of evaluation are ‘time-
consuming’, ‘difficult to perform’, ‘requires human expertise’ and hence, expensive and
non-repeatable, there is an urgent need to develop automated evaluation counterparts that
could act as a surrogate to manual evaluations while being repeatable and in-expensive.
From the mid 90’s there has been a lot of work on automated content evaluation and many
useful evaluation tools such as ROUGE [Lin, 2004b] and Basic Elements [Hovy et al., 2006]
had been developed and successfully tested over the years. Manual and automated content
evaluation measures are further deeply discussed in Section 3.2. Soon after the DUC-era
and as TAC Programme begun, a task on Automated Evaluations called “Automated Evalu-
ation of Summaries of Peers (AESOP)” was introduced in TAC 2009. AESOP task is again
briefly described in Section 3.2 and later, a detailed description is provided in Chapter 7
when we discuss our approach to automated summarization evaluation.
Automated Readability Evaluation For the Readability/Fluency aspects of sum-
mary evaluation there hasn’t been much of dedicated research with summaries. Discourse-
level constraints on adjacent sentence, have been relatively fairly investigated, indicative of
coherence and good text-flow [Lapata, 2003, Lapata and Barzilay, 2005, Barzilay and Lapata, 2008].
In a lot of applications, like in “overall responsiveness” for text summaries, fluency is as-
sessed in combination with other qualities. In machine translation scenarios, approaches
such as BLEU [Papineni et al., 2002] use n-gram overlap with a reference to judge “overall
goodness” of a translation. In some related work in Natural Language Generation (NLG)
9
CHAPTER 1. INTRODUCTION
[Wan et al., 2005, Mutton et al., 2007] directly set a goal of sentence level fluency regard-
less of content. A recent work by [Chae and Nenkova, 2009] performed a systematic study
on how syntactic features were able to distinguish machine generated translations from
human translations. Another related work [Pitler and Nenkova, 2008] investigated the im-
pact of certain linguistic surface features, syntactic features, entity coherence features and
discourse features on the readability of Wall Street Journal (WSJ) Corpus. More on the
automated readability assessment has been described in Section 3.2.
1.5 Focused Summarization Evaluation Workshops
1.5.1 pre-DUC era
Before the truly large scale summarization evaluation took place in the form of Document
Understanding Conferences (DUC) there was no major forum or community where com-
mon evaluation benchmarks could be set and focused tasks could be accomplished.
TIPSTER-SUMMAC In May 1998, the U.S. government completed the TIPSTER Text
Summarization Evaluation (SUMMAC), which was the first large-scale, developer-independent
evaluation of automatic text summarization systems. Two main extrinsic evaluation tasks
were defined, based on activities typically carried out by information analysts in the U.S.
Government. In the ad-hoc task, the focus was on indicative summaries which were tai-
lored to a particular topic. In the categorization task, the evaluation sought to find out
whether a generic summary could effectively present enough information to allow an an-
alyst to quickly and correctly categorize a document. The final, question-answering task
involved an intrinsic evaluation where a topic-related summary for a document was eval-
uated in terms of its “informativeness”, namely, the degree to which it contained answers
found in the source document to a set of topic-related questions. SUMMAC had established
definitively in a large-scale evaluation that automatic text summarization is very effective
10
CHAPTER 1. INTRODUCTION
in relevance assessment tasks. Summaries at relatively low compression rates (17% for ad-
hoc, 10% for categorization) allowed for relevance assessment almost as accurate as with
full-text (5% degradation in F-score for ad-hoc and 14% degradation for categorization,
both degradations not being statistically significant), while reducing decision-making time
by 40% (categorization) and 50% (ad-hoc).
1.5.2 Document Understanding Conferences (DUC)
There was much interest and activity in late 1990’s aimed at building powerful multi-
purpose information systems. The governmental agencies involved include DARPA, ARDA
and NIST. Their programmes, for example DARPA’s TIDES (Translingual Information
Detection Extraction and Summarization) programme, ARDA’s Advanced Question & An-
swering Program and NIST’s TREC (Text Retrieval Conferences) programme cover a range
of subprogrammes. These focus on different tasks requiring their own evaluation designs.
Within TIDES and among other researchers interested in document understanding, a
group grew up which has been focusing on summarization and the evaluation of summa-
rization systems. Part of the initial evaluation for TIDES called for a workshop to be held in
the fall of 2000 to explore different ways of summarizing a common set of documents. Ad-
ditionally a road mapping effort was started in March of 2000 to lay plans for a long-term
evaluation effort in summarization [Harman and Over, 2002].
Out of the initial workshop and the roadmapping effort has grown a continuing eval-
uation in the area of text summarization called the Document Understanding Conferences
(DUC1). Sponsored by the Advanced Research and Development Activity (ARDA), the
conference series was run by the National Institute of Standards and Technology (NIST)
to further progress in summarization and enable researchers to participate in large-scale
experiments. DUC ran from the year 2001 to 2007 at which point TAC took over DUC.
1http://duc.nist.gov/
11
CHAPTER 1. INTRODUCTION
1.5.3 Text Analysis Conferences (TAC)
There has been a growing recognition of the importance of community-wide evaluations for
research in information technologies. The Text Analysis Conference (TAC2) is a series of
workshops that provides the infrastructure for large-scale evaluation of Natural Language
Processing technology. TAC begun from where DUC ended; TAC continues the good work
of DUC and TREC-QA communities apart from roping in Recognizing Textual Entailment
(RTE) community for the collaborative betterment of evaluation of NLP technologies.
In recent years, at the Document Understanding Conferences, Text Summarization re-
search evolved through task focused evaluations ranging from ‘generic single-document
summarization’ to ‘query-focused multi-document summarization (QFMDS)’. The QFMDS
task models the real-world complex question answering task wherein, given a topic and a
set of 25 relevant documents, the task is to synthesize a fluent, well-organized 250-word
summary of the documents that answers the question(s) in the topic statement. Recent fo-
cus in the community has been towards query-focused update-summarization task at DUC
and the TAC. The update task was to produce short (~100 words) multi-document update
summaries of newswire articles under the assumption that the user has already read a set of
earlier articles. The purpose of each update summary will be to inform the reader of new
information about a particular topic.
1.6 Objective and Scope of the Research
In this work we characterize the role of two well known ‘linguistic features’ in the task of
automated text summarization. We describe the usage of sentence position and query-bias
in automated text summarization. We show that top-performing summarization systems
have induced ‘query-bias’ and pick only those sentences that contain at least one query
term, while trying to improve their system’s performance on query-focused multi-document
2http://www.nist.gov/tac/
12
CHAPTER 1. INTRODUCTION
summarization tasks. Later we use another feature, namely ‘sentence position’, to show that
the state-of-the-art systems are unable to perform any better than a simple baseline system.
We also propose the need to change the baselines used for the current short summary task
of update summarization. Later we describe an extension to text summarization systems
based on language modeling framework to rekindle them for Update Summarization task.
We show that it is possible to apply context adjustment techniques based on signature
terms to improve performance of PHAL based language modeling system. Finally we use
generative modeling paradigm for the creation of alternative evaluation metrics comparable
to the state-of-the-art automated evaluation technologies for text summarization, namely
ROUGE and Basic Elements.
1.7 Organization of the Thesis
The main focus of this thesis is interleaved among several pertinent issues related to query-
focused multi-document summarization. In this thesis we aim to to identify the role of
query-bias and sentence position in query-focused multi-document summarization, and
provide a language modeling based extension to update summarization and finally we pro-
pose a framework for automated evaluation of summaries. The rest of the thesis is orga-
nized as follows:
Chapter 2 introduces our research goals in purview of this thesis, elaborating on the
context in which the importance of the problem is seen. Each section in this chapter de-
scribes one of the four problems we address in this thesis and how we approach towards a
meaningful solution.
Chapter 3 describes the relevant literature in the context of this thesis. In this chapter
we first discuss some of the related approaches to text summarization algorithms, mainly in
the context of query-focused multi-document summarization. We categorized the related
13
CHAPTER 1. INTRODUCTION
approaches under the following four heads: Heuristic approaches, Language Modeling
approaches, Linguistic Structure or Discourse based approaches and Machine Learning
approaches. Later in the chapter, we describe procedures for summarization evaluation.
We describe in detail some of the manual and automated evaluation methods for evaluation
of content of a summary. At the end of the chapter, we describe the approaches taken to
manual evaluation of readability/fluency of summaries and the efforts taken to automate
the process.
Chapter 4 describes a problem in the context of query-focused multi-document summa-
rization research. We characterize the reliance of various automated summarization algo-
rithms on query-term occurrence to define relevance of a sentence towards a query. In a
theoretical setting we show that it is possible for classifier like algorithms for summariza-
tion can be improve their performance by biasing towards query-terms. Later in multiple
practical settings we show that automated systems are indeed biased towards sentences
containing query-terms.
Chapter 5 explores the position hypothesis based on relevant literature and discusses
some work for and against the hypothesis. We then describe a sub-optimal position pol-
icy derived from an interesting new dataset that facilitates identification of relevance of a
subset of sentences. We apply the position policy thus derived to build an algorithm to gen-
erate summaries and show that such an algorithm would perform better at generating short
summaries. Later in the chapter we describe the baselines used in summarization tasks and
argue that a position based baseline is indeed a better baseline for update summarization
task.
Chapter 6 describes a simple framework to adopt language modeling based approaches
to text summarization, to update summarization. First we introduce the update summa-
rization problem in the context of a news stream. We describe the language modeling
14
CHAPTER 1. INTRODUCTION
approaches taken in IR and summarization and follow up with the description of proba-
bilistic hyperspace analogue to language (PHAL). In the next sections we describe how we
extend PHAL using context adjustment based on signature terms.
Chapter 7 explores the area of summarization evaluation in general and the need to have
consistent, robust automated evaluation systems. In the sections that follow we describe the
current approaches to text summarization evaluation and the need for various automated
evaluation systems. Later we describe a generative modeling based formalism to evaluate
summaries based on their sentence level likelihood of inclusion of signature terms. Finally,
we discuss two approaches to generate signature terms to validate the applicability of the
framework.
Finally, Chapter 8 concludes this thesis by explaining the work done and expanding
upon the contributions of this thesis. This chapter also provides a detailed mention of
foreseeable future work. At the end of the chapter we list out the related publications that
have emanated from this work.
15
Chapter 2
Problems Addressed
The goal of this thesis is to identify a few important problems pertinent to automated text
summarization and provide solutions to them. Following are the issues that we deal with
in this thesis:
1. We examine the role of query-bias in query-focused multi-document summarization.
2. We examine the role of sentence position in identifying genre specific relevant con-
tent.
3. We examine how a language modeling based summarization system could be rekin-
dled to handle the update summarization problem.
4. We build multiple alternative automated evaluation systems based on a generative
modeling paradigm by modeling the occurrence of signature terms.
This research focuses on these 4 major problems and is discused in detail below:
2.1 Understanding query-bias in summarization systems
Starting in 2005 until 2007, a query-focused multi-document summarization task was con-
ducted as part of the annual Document Understanding Conference. This task models a real-
16
CHAPTER 2. PROBLEMS ADDRESSED
world complex question answering scenario, where systems need to synthesize from a set
of 25 documents, a brief (250 words), well organized fluent answer to an information need.
Query-focused summarization is a topic of ongoing importance within the summarization
and question answering communities. Most of the work in this area has been conducted
under the guise of “query-focused multi-document summarization”, “descriptive question
answering”, or even “complex question answering”.
One of the issues studied since the inception of automatic summarization is that of hu-
man agreement: different people choose different content for their summaries [Rath et al., 1961,
van Halteren and Teufel, 2003, Nenkova et al., 2007]. Even the same person may not be
able to produce the same summary at a later time [Rath et al., 1961]. Humans vary in the
material they choose to include in a summary and how they express the content. Their judg-
ments of summary quality varies from one person to another and across time for a single
person [Harman and Over, 2004]. Later, it was assumed [Dang, 2005] that having a ques-
tion/query to provide focus would improve agreement between any two human-authored
model summaries, as well as between a model summary and an automated summary. This
agreement in content is at the heart of the approaches taken for the automated summa-
rization evaluation techniques [Lin, 2004b, Nenkova et al., 2007]. It has been noted that
these reference based content evaluations would be more robust if multiple gold-standard
summaries were used [Lin, 2004a, van Halteren and Teufel, 2003], and there have been ex-
clusive studies on how many references are required to obtain stable evaluation results
[Lin, 2004a, Nenkova et al., 2007].
In trying to further understand the process of text summarization, the steering commit-
tee of DUC decided to constrain the summarization process based on two major parameters
that could produce summaries with widely different content: query-focus and granularity.
Having a question/query to focus the summary was intended to improve the agreement
between the model summaries. Additionally, for DUC 2005, the NIST assessor who de-
veloped each topic also specified the desired granularity (level of generalization) of the
summary. Granularity was a way to express one type of user preference; one user might
17
CHAPTER 2. PROBLEMS ADDRESSED
want a “general” background or overview summary, while another user might want “spe-
cific” details that would allow him to answer questions about specific events or situations.
This parameter, “granularity”, was withdrawn from further DUCs since NIST assessors
found that the size of the summary plays a much bigger role in determining what informa-
tion to include, than a granularity specification. Almost all NIST assessors tried to write
their summaries according to the granularity requested, but some “specific” summaries
ended up being very general given the large amount of information and small space al-
lowance. Despite this, all the NIST assessors appreciated the theory behind the granularity
specification.
There has been a plethora of research on query-focused multi-document summarization
from the summarization and question answering communities. New approaches to solve the
problem crop up every year under the competitive head of DUC and most automated sum-
marization systems are optimized on ROUGE or some other evaluation metric. The task
of Query-Focused Multi-Document Summarization seeks to improve agreement in content
among human-authored model summaries. Query-focus also aids the automated summariz-
ers in directing the summary at specific topics, which may result in better agreement with
these model summaries. However, while query focus correlates with performance, we show
that high-performing automatic systems produce summaries with disproportionally higher
query term occurrence than do human summarizers. Experimental evidence suggests that
automatic systems heavily rely on query term occurrence and repetition to achieve good
performance.
In Chapter 4 based on a corpus study we show that there is a difference in query-biased
density in document collections and in human summaries. We further the argument by
analyzing text summarization as a naıve classification problem, we show theoretically that
a simple sentence picking algorithm could perform better when picking sentences from
query-biased sentences only. In Section 4.4 we use Summary Content Units (SCUs) to
show that most of the participating systems are influenced by query-bias. Later in Sec-
tion 4.5 we build formal generative models capturing the relationship between the presence
18
CHAPTER 2. PROBLEMS ADDRESSED
of (or amount of) query-bias, and ROUGE-2 scores. We also speculate that while binomial
model captures the overall influence of query-bias and shows that most humans have a sim-
ilar strategy informed by query-bias, the multinomial model clearly distinguishes systems
that are heavily biased from those that are not by using the granularity of query-bias. In the
end we mathematically explain how certain top-performing systems have induced query-
bias while trying to improve their system’s performance on query-focused multi-document
summarization tasks.
2.2 Sentence Position Baseline for Update Summarization
The position hypothesis states that importance of a sentence can be based on its ordinal po-
sition. For instance, Baxendale [Baxendale, 1958] found that in 85% of the paragraphs, the
first sentences was a ‘topic sentence’. A study in expository prose showed [Dolan, 1980]
that only 13% of professional writers start with topic sentences. And [Singer and Dolan, 1980]
maintain that the main idea of a text might appear anywhere in the paragraph or not be
stated at all. Arriving at a negative conclusion [Paijmans, 1994] found that words with
higher informative content do not cluster in first or last sentences. In contrast, in psycho-
logical studies Kieras confirmed [Kieras, 1985] the position of a mention within a text.
Position of a sentence in a document or of a word in a sentence could be an indicator
of importance of the sentence/word in certain genre. Such features are called locational
features, and a sentence position feature deals with presence of key sentences at specific
locations in the text. Sentence Position has been well studied in summarization research
since its inception, early in Edmundson’s work [Edmundson, 1969] and has had a great in-
fluence on getting a better understanding of genre specific characteristics. In later studies it
has been shown [Lin and Hovy, 1997] that for genre such as news, a position based feature
performs very well and can capture major thematic words in a discourse.
Throughout the literature as summarization research followed trends from generic single-
document summarization, to generic multi-document summarization, to focused multi-
19
CHAPTER 2. PROBLEMS ADDRESSED
document summarization there were two major baselines that stayed throughout the evalu-
ations. Those two baselines are:
1. First N words of the document (or of the most recent document).
2. First sentence from each document in chronological order until the length requirement is
reached.
The first baseline performs poorly at content evaluations based on all manual and automatic
metrics. However, since it doesn’t disturb the original flow and ordering of a document,
linguistically these summaries are the best. While the second baseline has not been beaten
— based on content evaluation — by automated systems, the linguistic aspects of summary
quality would be compromised in such a summary and are usually very bad at readability
aspects.
In Chapter 5, we describe a sentence position based summarizer that is built based on a
sentence position policy, created from the evaluation testbed of recent summarization tasks
at Document Understanding Conferences (DUC). We show that the summarizer thus built is
able to outperform most comparable systems. Based on an interesting corpus explained in
Section 4.4, we derive a Sub-optimal Sentence Position Policy (SPP) which is in most parts
inspired by the work on “Optimal Position Policy” by Lin and Hovy [Lin and Hovy, 1997].
We further use the SPP to generate multi-document summaries for QF-MDS task and the
Update Summarization task. Our experiments also show that such a method would perform
better at producing short summaries (upto 100 words) than longer summaries. We speculate
based on certain studies in the literature that the success of SPP in Update summarization
task is due to the nature of text collections and genre. Later in Section 5.4 we argue that a
position based baseline would work better for all short summary tasks and hence should be
used as a baseline for the Update Summarization task. We comment that such a stronger
content based baseline would allow for evaluation of progress on the problem over the
years.
20
CHAPTER 2. PROBLEMS ADDRESSED
2.3 Language Modeling approaches for Update Summa-
rization
The key to “update summarization” is a real world setting where a user needs to keep track
of a hot topic, continuously at random intervals of time. There are a lot of hits on the hot
topic and lots of documents are generated within a short span. The user cannot deal with
the proliferation of information and hence requires a summarization engine that generates
a very targeted informative summary. So the user now has access to the information in
the form of a summary, and if he needs to know more he would be able to see the source
documents. After a certain usage of the summarization engine, say he takes a break (its
christmas!) and comes back to the summarization engine for the summary on the recent
activity on the topic. Now, a normal summarizer would just generate a summary of the
recent documents and present it to him. However the idea of update summarization is to be
able to filter out information that has already appeared in the previous articles, whether or
not they were presented to the user as part of a previous summary. In effect it is like adding
another layer of redundancy checking to avoid the repeated information.
The updates on the topic need to filter out redundant information, while preserving the
informativeness of the content. So the task of Update Summarization has two components,
A normal query-focused multi-document summarization for the first cluster (of documents)
on the topic, and an updated summary generation procudure which would also produce
query-focused summaries under the assumption that user has already gone through previ-
ous document cluster(s). In the current work, we approach the problem under a sentence
extractive summarization paradigm, using an existing language modeling framework. Here,
we see the “update summary generation” task as a language model smoothing problem.
Language Modeling based approaches to text summarization have been used frequently
for text summarization. Recently, relevance based language modeling approaches have
been applied [Jagarlamudi, 2006] for the query-focused multi-document summarization
task. In Chapter 6, we describe a novel framework for update summarization based on
21
CHAPTER 2. PROBLEMS ADDRESSED
the language modeling pardigm. We choose to employ context adjustment techniques on
language models of the novel clusters biased by the previous clusters. We show that the
resulting adjusted language models perform better than the plain language model in case of
pHAL, a probabilistic version of Hyperlogue Analogue to Language.
2.4 Alternative Automated Summarization Evaluations
Evaluation is a critical component in the area of automatic summarization; it is used both
to rank multiple participant systems in a shared tasks, such as the summarization track
at TAC 2009, 2008 and its DUC predecessors, and to developers whose goal is to im-
prove the summarization systems. Summarization evaluation can foster the creation of
reusable resources and infrastructure; it creates an environment for comparison and repli-
cation of results; and it introduces an element of competition to produce better results
[Hirschman and Mani, 2001]. However, manual evaluation of a large number of documents
necessary for a relatively unbiased view is often unfeasible, especially since multiple eval-
uations are needed in future to track incremental improvement in systems. Therefore, there
is an urgent need for reliable automatic metrics that can perform evaluation in a fast and
consistent manner.
Summarization evaluation can be broadly classified into two categories: intrinsic and
extrinsic. Intrinsic evaluations are where the quality of the created automated summary
is measured directly. Intrinsic evaluations requires some reference against which to judge
the summarization quality. Intrinsic evaluations have taken two major forms: manual, in
which one or more people evaluate the system produced summary and automatic, in which
the summary is evaluated without the human in the loop. The content or informativeness
of a summary has been evaluated based on various manual metrics. Earlier, NIST assessors
used to rate each summary on a 5-point scale based on whether a summary is “very poor”
to “very good”. Since 2006, NIST uses the Pyramid framework to measure content re-
sponsiveness. In the pyramid method as explained in Section 3.2, assessors first extract all
22
CHAPTER 2. PROBLEMS ADDRESSED
possible “information nuggets” or Summary Content Units (SCUs) from human-produced
model summaries on a given topic. Each SCU has a weight associated with it based on the
number of model summaries in which this information appears. The final score of a peer
summary is based on the recall of nuggets in the peer.
All forms of manual assessment is time-consuming, expensive and not repeatable;
whether scoring summaries on a Likert scale or by evaluating peers against “nuggget pyra-
mids” as in the pyramid method. Such assessment doesn’t help system developers — who
would ideally like to have fast, reliable and most importantly automated evaluation metric
that can be used to keep track of incremental improvements in their systems. So despite
the strong manual evaluation criterion for informativeness, time tested automated methods
viz. ROUGE, Basic Elements(BE) have been regularly employed to test their correlation
with manual evaluation metrics like ‘modified pyramid score’, ‘content responsiveness’
and ‘overall responsiveness’ of a summary. Each of the above metrics has been sufficiently
described in Section 3.2. The creation and testing of automatic evaluation metrics is there-
fore an important research avenue. The goal is to create automated evaluation metrics that
correlate very highly with these manual metrics.
In chapter 7, we motivate towards the need of alternative summarization evaluation
systems for both content and readability. In the context of TAC AESOP (Automatically
Evaluating Summares Of Peers) task, we describe the problem with content evaluation
metrics and how a good metric must behave. We describe how a well known generative
model could be used to create automated evaluation systems comparable to the state-of-the-
art. Our method is based on a multinomial model distribution of key-terms (or signature
terms) in document collections, and the likelihood that they are captured in peers.
23
Chapter 3
Related Work
In this chapter, we survey the related work in the area of text summarization. In what fol-
lows, we mainly concentrate on 4 major categories of approaches to text summarization,
namely Heuristic Approaches, Language Modeling Approaches, Linguistic Structure or
Discourse based approaches and Machine Learning approaches. We also describe summa-
rization evaluation methodologies from manual evaluation to automated evaluation tech-
niques of content and form.
3.1 Approaches to Text Summarization
Heuristics and Linguistic cues have always been tried out in early Information Retrieval
research, while the late 90’s saw the surge of language modeling for Information Retrieval.
Almost all of these techniques have been tried out for text summarization. Some of the
related relevant approaches are described below.
3.1.1 Heuristic Approaches to Text Summarization
Heuristics from linguistic analysis has been of great help in automated text summariza-
tion. An analysis of the various linguistic phenomena on source text has lead to interesting
24
CHAPTER 3. RELATED WORK
experimental conclusions for Text Summarization. Term frequency has played a crucial
role since its application to summarization as a heuristic in [Luhn, 1958] to a more formal,
mathematical treatment in [Nenkova et al., 2006] where they have proven that a simple un-
igram language model can be used to generate state-of-the-art summaries.
Inverse Document Frequency (idf )
Inverse Document Frequency [Jones, 1972] has been repeatedly uesd in IR and Text sum-
marization since its introduction as a heuristic to Information Retrieval. The inverse doc-
ument frequency is a measure of the general importance of the term (obtained by dividing
the number of all documents by the number of documents containing the term, and then
taking the logarithm of that quotient). This is now a standard feature in more sophisticated
summarization systems [Radev et al., 2004, Daume III and Marcu, 2004, Shen et al., 2007,
Toutanova et al., 2007].
Document Frequency (df )
Document Frequency is another relatively simple feature that performs very strongly in
multi-document summarization and is usually can be seen as an inverse of Inverse Docu-
ment Frequency (IDF). Conventional wisdom advises us to use IDF while [Schilder and Kondadadi, 2008]
used it to a certain advantage in the case of query-focused multi-document summarization
tasks. Later, and more recently, [Bysani et al., 2009] used Document Frequency to illustrate
the power of such a feature for update summarization task in TAC 2008 and 2009.
Though usage of IDF is supported by conventional wisdom, this author supports Doc-
ument Frequency since it seems to be more applicable for the following reasons. IDF was
applied to cases where term specificity was being addressed statistically on a huge unrelated
corpus available for Information Retrieval. In the case of text summarization, however, IR
has been pre-computed and we have a set of related documents available aprior. It seems
intuitive that within a set of related documents a term needs to be more redundant across
25
CHAPTER 3. RELATED WORK
documents apart from being redundant within each document, such that it would be a topic-
specific term.
Sentence Position (sp)
Sentence Position has been extensively studied since its introduction to summarization by
[Edmundson, 1969]. Earlier [Baxendale, 1958] defined a straight forward definition to po-
sition based importance: “title plus first and last sentence of paragraph are important”.
Later [Lin and Hovy, 1997] empirically characterized position feature as a genre depen-
dent feature and derived a position policy, as an ordering of priority of sentence importance.
Further research on sentence position feature has been described in detail in Chapter 5.
3.1.2 Language Modeling Approaches
Language modeling based approaches have been famous in text Summarization litera-
ture since the early work by Luhn [Luhn, 1958]. Luhn used frequencies of terms as a
heuristic to show importance of terms towards summaries. Though his early work was
heuristic it paved way for more sophisticated and mathematically driven language mod-
els. [Allan et al., 2001] describe a language modeling based approach to define temporal
summaries, in which they describe language models to capture novelty and usefulness.
[Jagarlamudi, 2006] has shown how a relevance based language modeling paradigm can
be applied to automated text summarization, specifically for the “query-focused multi-
document summarization” task. Recently, [Nenkova et al., 2006] has shown reasonable
success in multi-document text Summarization using ‘just’ unigram language models. In
Chapter 6 we use the Probabilistic Hyperspace Analogue to Language (PHAL) described in
[Jagarlamudi, 2006] as the language modeling mechanism and extend it to deal with update
summarization task.
In [Lawrie, 2003], she defines summarization in terms of “probabilistic language mod-
els” and she uses this definition to explore techniques for automatically generating topic
26
CHAPTER 3. RELATED WORK
hierarchies. She used language models to define two concepts: ‘topicality’ and ‘predictive-
ness’ of each word which exemplify topic-orientedness of a word and existence of sub-topic
hierarchies for the word.
Chandan Kumar [Kumar, 2009] describes an “Information Loss” based framework for
‘generic multi-document summarization’ based on the language modeling paradigm. He
treats summarization as a decision making process and based on a sentence score obtained
by comparing (relative entropy) two language models (document model and world model),
he is able to generate informative summaries. Their framework is based on minimization
of Bayesian Risk of loosing an informative sentence.
In the context of update summarization, a lot of work has taken shape using language
modeling. Firstly, [Maheedhar Kolla, 2007] described a cluster based language model that
uses background modeling to efficiently generate update summaries. They used two meth-
ods of background modeling based on “documents in previous cluster” and “summary of
previous cluster”. And secondly, in another related work, [Bysani et al., 2009] coined a
term called “Novelty Factor” which is based on ratio of distribution of a word in current
cluster against previous clusters. This can be seen as an application of language modeling
at document level.
3.1.3 Linguistic structure or discourse based summarization
Lexical Chains
The notion of Cohesion, introduced in [Halliday and Hasan, 1976] captures part of the in-
tuition. Cohesion is a device for “sticking together” different parts of the text. Cohesion is
achieved through the use of semantically related terms, co-reference, ellipsis and conjunc-
tions. Among the different cohesion building devices ‘lexical cohesion’ is the most easily
identifiable and most frequent type, and it can be a very important source for the ‘flow’ of
informative content.
There is a close connection between discourse structure and cohesion. Related words
27
CHAPTER 3. RELATED WORK
tend to co-occur within a discourse unit of the text. So cohesion is one of the surface
indicators of discourse structure and ‘lexical chains’ can be used to identify it. The first
computational model for lexical chains was presented in [Morris and Hirst, 1991] which
used Roget’s Thesaurus as knowledge base. All the later models used some knowledge
source or the other, like Wordnet, dictionaries, etc. [Barzilay and Elhadad, 1997] observed
that there are certain limitations in earlier approaches to lexical chains. A major issue was
of “greedy disambiguation” which results from greedy sense selection. [Barzilay, 1997]
addressed this issue by applying a less greedy algorithm that constructs all possible in-
terpretations of the source text using lexical chains. They then select the interpretations
with strongest cohesion. They then use these “strong chains” to generate a summary of
the original document. They also presented the usefulness of these lexical chains as a
source representation for automated text summarization. Later [Silber and McCoy, 2000]
presented an O(n) algorithm on the number of nouns present within the source document.
They proposed an alternative scoring algorithm to the one presented in [Barzilay, 1997]
and have shown that their algorithm is efficient despite being able to produce summaries of
similar quality.
Discourse
Daniel Marcu led a focused research on the applicability of Discourse Structures to Text
Summarization [Marcu, 2000]. He showed that discourse trees are good indicators of tex-
tual importance [Marcu, 1999b]. He devised a discourse parsing algorithm to indicate dis-
course connectives that describe importance. He also showed that incorporating various
heuristics into a discourse-based summarization framework improves its performance.
3.1.4 Machine Learning Approaches
Recent advances in machine learning have been adapted to summarization problem through
the years using various features to identify salience of a sentence. Some representative work
28
CHAPTER 3. RELATED WORK
in ‘learning’ sentence extraction would include training a binary classifier [Kupiec et al., 1995],
training a Markov model [Conroy et al., 2004], training a CRF [Shen et al., 2007], and
learning pairwise-ranking of sentences [Toutanova et al., 2007].
[Kupiec et al., 1995] use learning in order to combine various shallow heuristics/features
(cue phrases, location, sentence length, word frequency and title), using a corpus of re-
search papers with manually produced abstracts. [Conroy et al., 2004] applied Hidden
Markov model training for sentence scoring. Their HMM was trained using data from
NIST DUC 03, task 5 novelty data. Though HMMs were successfully applied to text
summarization, they cannot fully exploit the linguistic features since they have to assume
independence among features for tractability. On the other hand, unsupervised approaches
rely on heuristics that are difficult to generalize, hence [Shen et al., 2007] applied a Con-
ditional Random Field to take full advantage of features that might be dependent on each
other.
3.2 Summarization Evaluation
Summarization Evaluation, like Machine Translation (MT) evaluation (or any other NLP
systems’ evaluation), can be broadly classified into two categories [Jones and Galliers, 1996].
The first, an intrinsic evaluation, tests the summarization system in itself. The second, an
extrinsic evaluation, tests the summarization system based on how it affects the completion
of some other task. In the past intrinsic evaluations have assessed mainly informativeness
and coherence of the summaries. Meanwhile, extrinsic evaluations have been used to test
the impact of summarization on tasks like reading comprehension, relevance assessment,
etc.
Intrinsic Evaluations Intrinsic evaluations are where the quality of the created auto-
mated summary is measured directly. Intrinsic evaluations requires some standard or model
against which to judge the summarization quality and this standard is usually operational-
29
CHAPTER 3. RELATED WORK
ized by utilizing an existing abstract/text dataset or by having humans create model sum-
maries [Jing et al., 1998]. Intrinsic evaluations have taken two major forms: manual, in
which one or more people evaluate the system produced summary and automatic, in which
the summary is evaluated without the human in the loop. But both types involve human
judgments of some sort and with them their inherent variability.
Extrinsic Evaluation Extrinsic evaluations are where one measures indirectly how well a
summary performs by measuring performance in a task putatively dependent on the quality
of summary. Extrinsic evaluations require the selection of an appropriate task that could
use summarization and measure the effect of using automatic summaries instead of original
text. Critical issues here are the selection of a sensible real task and the metrics that will be
sensitive to differences in quality of summaries.
Assessment of Evaluations Overall, from the literature on text summarization, we can
see, along with some definite progress in summarization technology, that automated sum-
mary evaluation is more complex than it originally appeared to be. A simple dichotomy
between intrinsic and extrinsic evaluations is too crude, and by comparison with other Nat-
ural Language Information Processing (NLIP) tasks, evaluation at the intrinsic end of the
range of possibilities is of limited value. The forms of gold-standard quasi-evaluation that
have been thoroughly useful for other tasks like speech transcription, or machine transla-
tion and to some, though lesser, extent for information extraction or question answering, are
less indicative of the potential value for summaries than in these cases. At the same time,
it is difficult even at such apparently fine-grained forms of summarization evaluations as
nugget comparisons, when given the often complex systems involved, to attribute particular
performance effects to certain particular system features or to discriminate among the sys-
tems. All this makes the potential task in context extremely problematic. Such a Catch-22
situation is displayed appropriately in [Lin and Hovy, 2003a, Lin and Hovy, 2003b]: they
attribute poor system performance (for extractive summarizing) to human gold standard
30
CHAPTER 3. RELATED WORK
disagreement, so humans ought to agree more. But attempting to specify summarizing
requirements so as to achieve this may be as much misconceived as impossible. Similar
issues arise with Marcu’s development of test corpora from existing source summary data
[Marcu, 1999a].
3.3 Evaluation of Content
Content evaluation refers to enabling quantification of informativeness of a summary. In-
formativeness aims at assessing the summary’s information content. As a summary of a
source becomes shorter, there is less information from the source that can be preserved in
the summary. Therefore, one measure of informativeness is to assess how much informa-
tion from the source is preserved in the summary. Another measure is how much informa-
tion from a reference summary is covered by information in the system summary. In the
following subsections we would describe the major content evaluation metrics: ROUGE,
Pyramid evaluations, and give a brief overview of other metrics available as alternatives.
3.3.1 ROUGE [Lin, 2004b]
ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation. It includes mea-
sures to automatically determine the quality of a summary by comparing it to other (ideal)
summaries created by humans. The measures count the number of overlapping units such
as n-gram, word sequences, and word pairs between the computer-generated summaries to
be evaluated and the human summaries. The ROUGE package [Lin, 2004b] contains the
following four measures: ROUGE-N, ROUGE-L, ROUGE-W and ROUGE-S. Following
is a small description of each of them:
ROUGE-N Formally, ROUGE-N is an n-gram recall between a candidate summary and
a set of reference summaries. ROUGE-N is computed as follows:
31
CHAPTER 3. RELATED WORK
ROUGE-N =∑
s∈{ReferenceSummaries}
∑gramn∈s
Countmatch(gramn) (3.1)
Where n stands for the length of the n-gram, gramn is the n-gram itself and Countmatch(gramn)
is the maximum number of n-grams co-occurring in a candidate summary and a set of ref-
erence summaries. It is easy to understand that ROUGE-N is a recall oriented measure
because the denominator of the equation is the total sum of the number of n-grams occur-
ring at the reference summary side.
ROUGE-L A sequence Z=[z1, z2, . . . , zn] is a subsequence of another sequence X = [x1,
x2, . . . , xm], if there exists a strict increasing sequence [i1, i2, . . . , ik] of indices of X such
that ∀ j = 1,2,. . . , k, we have xij = zj [Cormen et al., 1990]. Given two sequences X and
Y, the longest common subsequence (LCS) of X and Y is a common subsequence with
maximum length.
In applying LCS for summarization evaluation, a summary sentence is treated as a
sequence of words. The intuition is that the longer the LCS of two summary sentences
is, the more similar the two summaries are. LCS based recall measure to estimate the
similarity between two summaries X of length m and Y of length n, assuming X is a
reference summary sentence and Y is a candidate summary sentence, as follows:
Rlcs =LCS(X, Y )
m(3.2)
One advantage of using LCS is that it does not require consecutive matches but in-
sequence matches that reflect sentence level word order as n-grams. The other advan-
tage is that it automatically includes longest in-sequence common n-grams, therefore no
pre-defined n-gram length is necessary. By only awarding credit to in-sequence unigram
matches, ROUGE-L also captures sentence level structure in a natural way. However, LCS
faces one main dis-advantage that it only counts the main in-sequence words; therefore, the
other alternative LCSes and shorter sentences are not reflected in the final score.
32
CHAPTER 3. RELATED WORK
ROUGE-W LCS has many nice properties [Lin, 2004b], however unfortunately, the ba-
sic LCS also has a problem that it does not differentiate LCSes of different spatial relations
within their embedding sequences. For example, given a reference sequence X and two
candidate sequences Y1 and Y2 as follows:
X : [A, B, C, D, E, F,G]
Y1 : [A, B, C, D, H, I, K]
Y2 : [A, H,B, K, C, I, D]
Y1 and Y2 have the same ROUGE-L score. However, in this case, Y1 should be the
better choice than Y2 because Y1 has consecutive matches. To improve the basic LCS
method, we can simply remember the length of consecutive matches encountered so far to
a regular two dimensional dynamic program table computing LCS. [Lin, 2004b] call this
weighted LCS (WLCS).
ROUGE-S Skip-bigram is any pair of words in their sentence order, allowing for ar-
bitrary gaps. Skip-bigram co-occurrence statistics measure the overlap of skip-bigrams
between a candidate translation and a set of reference translations. Consider the following
example sentences:
S1. police killed the gunman
S2. police kill the gunman
S3. the gunman kill police
S4. the gunman police kill
Each sentence has 4C2 = 6 skip-bigrams. For example, S1 has the following skip-
bigrams: (“police killed”, “police the”, “police gunman”, “killed the”, “killed gunman”,
“the gunman”).
33
CHAPTER 3. RELATED WORK
S2 has three skip-bigrams matches with S1 (“police the”, “police gunman”, “police
gunman”). S3 has one skip-bigram match with S1 (“the gunman”), and S4 has two skip-
bigram matches with S1 (“police killed”, “the gunman”). Given the translations X of length
m and Y of length n, assuming X is the reference translation and Y is a candidate transla-
tion, skip-bigram recall and precision is computed as follows:
Rskip2 =SKIP2(X, Y )
C(m, 2)(3.3)
Pskip2 =SKIP2(X, Y )
C(n, 2)(3.4)
ROUGE-SU One potential problem for ROUGE-S is that it does not give any credit to
a candidate sentence if the sentence does not have any word pair co-occurring with its
references. For example, the following sentence has a ROUGE-S score of zero.
S5. gunman the killed police
S5 is the exact inverse of S1 and there is no skip bigram match between them. However, we
would like to differentiate sentences similar to S5 from sentences that do not have single
word co-occurrence with S1. To achieve this, a simple extension to ROUGE-S is ROUGE-
SUn is employed, where n is the skip distance for a bigram. ROUGE-SU includes all the
bigrams obtained by ROUGE-S and all unigrams, and hence removes the above problem.
3.3.2 Pyramid Evaluation [Nenkova et al., 2007]
The most common way to evaluate the informativeness of automated summaries is to com-
pare them with human-authored reference summaries. For decades, the task of automatic
summarization had been cast as a sentence selection problem. Systems were developed
to identify most important sentences in the input to be picked to form a summary. It was
then appropriate to generate human reference summaries by asking people to produce sum-
maries by picking representative sentences. Systems were evaluated using metrics such as
34
CHAPTER 3. RELATED WORK
precision and recall [Salton et al., 1997] which measured the extent to which automated
summarizers selected sentences that might be selected by human summarizers. Over a
period of time, a lot of undesirable affects associated with this approach came to light:
1. Human Variation. Content selection is not a deterministic process [Salton et al., 1997,
Marcu, 1997, Mani, 2001]. Different people choose different sentences to include in
a summary, and even the same person can select different sentences at different times
[Rath et al., 1961]. Such observations lead to the concerns about the usage of a sin-
gle reference summary and suggest that multiple human references would provide a
better ground for comparison.
2. Analysis Granularity. The issue of human variation aside, even comparing the
degree of sentence co-selection is not always justified. Even if a system does not
choose a exactly the same sentence as does the human summarizer, the sentence
picked might have considerable overlap in content with one or more sentences from
reference summaries. Thus partial match below the level of a sentence need to be
accounted for.
3. Semantic Equivalence. Another issue related to granularity is semantic equivalence.
Especially, in the case of newswire summarization (more so in multi-document case),
different input sentences mean the same even if they are worded differently. Humans
would pick only one of the multiple equivalent alternatives to be part of the summary
and a system shall be penalized if it selects one of the other equally appropriate
options.
4. Extracts or Abstracts. When humans are asked to write a summary of a text, they
do not normally pick sentences from the input and concatenate them to form the
summary. Instead, they pick the important information and use their own words to
synthesize the information to form an informative readable summary. Thus, the ex-
act match of sentences as required by precision and recall measures are not at all
35
CHAPTER 3. RELATED WORK
feasible. As the field grows and continuously moves towards more non-extractive
summarizers we clearly need to move to methods which can handle semantic equiv-
alence at varying levels of granularity.
The pyramid method, provides an unified method for addressing the issues outlined
above. The key assumption of the pyramid method is the need for multiple human authored
reference summaries, which taken together, yield a gold-standard for system output.
SCUs. SCUs are semantically motivated subsentential units; they are variable in length
but not bigger than a sentential clause. SCUs emerge from the annotation of a collection
of human summaries for the same input. They are identified by noting information that
is repeated across summaries, whether the repetition is as small as a modifier of noun
or as large as a clause. Sentences corresponding to information that appears only in one
summary are broken down into clauses, each of which is one SCU in the pyramid. Weights
are associated with each SCU indicating the number of summaries in which it appeared.
Pyramids. Unlike many gold-standards, a pyramid represents the opinions of multiple
human summary writers each of whom has written a model summary for the input set
of documents. A key feature of a pyramid is that it quantitatively represents agreement
among the human summaries: SCUs that appear in more of the human summaries are
weighted more highly, allowing differentiation between important content (that appears in
many human summaries) from less important content. Such weighting is necessary in sum-
marization evaluation, given that different people choose somewhat different information
when asked to write a summary for the same set of documents. More details on SCUs, the
procedure to identify SCUs, and the method for scoring a new summary against a pyramid
are discussed in deeper details in [Nenkova et al., 2007].
36
CHAPTER 3. RELATED WORK
3.3.3 Other content evaluation measures
Over the last decade there has been tremendous interest in summarization research and
systems cropped up from every corner of the world. Thanks to focused evaluations (like
DUC), the summarization community created a clear roadmap for the future of automated
summarization. Evaluation has always been a priority for the community and from time to
time evaluation has been revived.
Following are some of the not so fashionable summarization evaluation techniques,
most of these techniques couldn’t take limelight unlike ROUGE and Pyramid methods.
What follows are the various evaluation approaches, in their order of appearance:
• SEE
• Relative Utility
• Basic Elements [Hovy et al., 2006]
• N-gram graphs
Recently, [Tratz and Hovy, 2008] extended Basic Elements Recall based evaluations by
following a set of transformations on the basic elements. Some research [Ani et al., 2005]
has been devoted for the automation of the pyramid method. Recent tasks at the focused
evaluations at TAC have targetted the aspect of “Automatically Evaluating Summaries Of
Peers”1. Also, some research on automated summarization evaluation without human mod-
els [Louis and Nenkova, 2009] has been pursued and is of severe interest to the community.
3.4 Evaluation of Readability
Readability evaluation refers to enabling quantification of the form of a summary. Read-
ability/Fluency aims at assessing the summary’s surface form. If a summary is picked
1http://nist.gov/tac/2009/Summarization/index.html
37
CHAPTER 3. RELATED WORK
verbatim from a document, there is less chance of being distorted and hence the readability
of the source can be preserved in the summary.
3.4.1 Manual Evaluation of Readability
At the document understanding conferences (DUC) and now at the Text Analysis Confer-
ences (TAC), the readability of summaries is assessed using five linguistic quality questions
which measure the qualities of the summary that do not involve comparison with a refer-
ence summary or topic of focus. The linguistic qualities measured are Grammatically,
Non-redundancy, Referential clarity, Focus and Structure and coherence.
Grammatically A summary should have no datelines, system internal formatting, capi-
talization errors or obvious ungrammatical sentences (for eg., fragments, missing compo-
nents) that make the text difficult to read.
Non-redundancy There should be no unnecessary repetition in the summary. Unneces-
sary repetition might take the form of whole sentences that are repeated, or repeated facts,
or the repeated use of a noun or noun phrase (eg., “Prasad Pingali”) when a pronoun (“he”)
would suffice.
Referential clarity It should be easy to identify who or what the pronouns and noun
phrases in the summary are referring to. If a person or other entity is mentioned, it should
be clear what their role in the story is. So, a reference would be unclear if an entity is
referenced but its identity or relation to the story remains unclear.
Focus The summary should have a focus; sentences should only contain information that
is related to the rest of the summary.
Structure and Coherence The summary should be well-structured and well-organized.
The summary should not just be a heap of related information, but should build from sen-38
CHAPTER 3. RELATED WORK
tence to sentence to a coherent body of information about a topic.
3.4.2 Automated Evaluation of Readability
For the Readability/Fluency aspects of automated summary evaluation there hasn’t been
much of a dedicated research with summaries. Discourse-level constraints on adjacent
sentence, have been relatively fairly investigated, which are indicative of coherence and
good text-flow [Lapata, 2003, Barzilay and Lapata, 2008]. In a lot of applications, like in
“overall responsiveness” for text summaries, fluency is assessed in combination with other
qualities. In machine translation scenarios, approaches such as BLEU use n-gram overlap
[Papineni et al., 2002] with a reference to judge “overall goodness” of a translation. With
BLEU, higher sized n-grams’ overlap were meant to capture fluency considerations, while
all the n-gram overlaps together contribute to the translation’s “content goodness”. On the
contrary, in some related work in NLG [Wan et al., 2005, Mutton et al., 2007] directly set
a goal of sentence level fluency regardless of content. In [Wan et al., 2005] they build upon
the premise that syntactic information from a parser can more robustly capture sentence
fluency than language models, giving a more direct indications of the degree of ungram-
maticality. The idea is extended in [Mutton et al., 2007], where four parsers are used and
artificially generated sentences with varying level of fluency are evaluated with impressive
success.
Recently, [Chae and Nenkova, 2009] performed a systematic study on how syntactic
features were able to distinguish machine generated translations from human translations.
They were also able to distinguish a ‘well formed’ translation from a ‘low fluency’ trans-
lation. In another related work, [Pitler and Nenkova, 2008] investigated the impact of
certain linguistic surface features, syntactic features, entity coherence features and dis-
course features on the readability of Wall Street Journal (WSJ) Corpus. Their investi-
gations revealed that while surface features like average number of words per sentence
and average number of characters per word are not good predictors, there exist syntac-
39
CHAPTER 3. RELATED WORK
tic, semantic, and discourse features that do correlate highly with readability. Further,
[Feng et al., 2009, Feng, 2009] developed a tool for automatically rating the readability of
texts for adult users with intellectual disabilities.
There has been no known work in the area of characterizing readability/fluency of a
summary directly, from text summaries point of view. Application of the above methods
and in particular syntactic and semantic features expressed in [Chae and Nenkova, 2009,
Pitler and Nenkova, 2008] to create an automated metric to evaluate summaries would be
an interesting area of research.
40
Chapter 4
Impact of Query-Bias on Text
Summarization
In the context of the Document Understanding Conferences, the task of Query-Focused
Multi-Document Summarization seeks to improve agreement in content among human-
generated model summaries. This agreement is essential to the content evaluation process
as described earlier in Chapter 3. Query-focus was also assumed to aid the automated sum-
marizers in directing the summary at specific topics, which would result in better agreement
of automated summaries with the model summaries. However, while query focus correlates
with performance, we show that high-performing automatic systems produce summaries
with disproportionally higher query term density than do human summarizers. Experimen-
tal evidence suggests that automatic systems heavily rely on query term occurrence and
repetition to achieve good performance.
Human Agreement One of the issues studied since the inception of automatic summa-
rization is that of human agreement: different people choose different content for their sum-
maries [Rath et al., 1961, van Halteren and Teufel, 2003, Nenkova et al., 2007]. Even the
same person may not produce the same summary at a later time [Rath et al., 1961]. Humans
vary in what material they choose to include in a summary and how they express the con-
41
CHAPTER 4. IMPACT OF QUERY-BIAS ON TEXT SUMMARIZATION
tent and their judgments of summary quality varies from one person to another and across
time for a single person [Harman and Over, 2004]. Later, it was assumed [Dang, 2005]
that having a question/query to provide focus would improve agreement between any two
human-generated model summaries, as well as between a model summary and an auto-
mated summary. This agreement in content is at the heart of the approaches taken for the
automated summarization evaluation techniques [Lin, 2004b, Nenkova et al., 2007]. It has
been noted that these reference based content evaluations would be more robust if multi-
ple gold-standard summaries were used [Lin, 2004a, van Halteren and Teufel, 2003], and
there have been exclusive studies on how many references are required to obtain stable
evaluation results [Lin, 2004a, Nenkova et al., 2007].
Query-focus and granularity In trying to further understand the process of text summa-
rization, the steering committee of DUC decided to constrain the summarization process
based on two major parameters that could produce summaries with widely different con-
tent: query-focus and granularity. Having a question/query to focus the summary was
intended to improve the agreement between the model summaries. Additionally, for DUC
2005, the NIST assessor who developed each topic also specified the desired granularity
(level of generalization) of the summary. Granularity was a way to express one type of
user preference; one user might want a “general” background or overview summary, while
another user might want “specific” details that would allow him to answer questions about
specific events or situations. This parameter, “granularity”, was withdrawn from further
DUCs since NIST assessors found that the size of the summary plays a much bigger role in
determining what information to include, than a granularity specification. Almost all NIST
assessors tried to write their summaries according to the granularity requested, but some
“specific” summaries ended up being very general given the large amount of information
and small space allowance. Despite this, all the NIST assessors appreciated the theory
behind the granularity specification.
42
CHAPTER 4. IMPACT OF QUERY-BIAS ON TEXT SUMMARIZATION
Research in query-focused summarization Starting in 2005 until 2007, a query-focused
multi-document summarization task was conducted as part of the annual Document Under-
standing Conference. This task models a real-world complex question answering scenario,
where systems need to synthesize from a set of 25 documents, a brief (250 words), well
organized fluent answer to an information need. Query-focused summarization is a topic
of ongoing importance within the summarization and question answering communities.
Most of the work in this area has been conducted under the guise of “query-focused multi-
document summarization”, “descriptive question answering”, or even “complex question
answering”. Recent addition at Text Analysis Conference (TAC) in the form of Update
Summarization is a natural extension of “query-focused multi-document summarization”
as discussed in Section 1.5.
4.1 Introduction to Query-Bias vs Query-Focus
The term ‘query-bias’, with respect to a sentence, is precisely defined to mean that the
sentence has at least one query term within it. The term ‘query-focus’ is less precisely
defined, but is related to the cognitive task of focusing a summary on the query, which we
assume humans do naturally. In other words, the human generated model summaries (and
other human authored summaries) are assumed to be query-focused.
Query-bias in sentences could be seen as a trivial way of trying to find sentence rele-
vance to the query [Gupta et al., 2007]. We follow the same intuition, and define query-
biased sentences as those sentences that have at least one content word in common with the
query. Content words are all those words that are not stop words, and stop words for our
purposes have been obtained from Rainbow Classification Toolkit [McCallum, 1996]. In
order to study how query-bias influences human/system content selection choices, we used
the 45 test sets of the query-focused multi-document summarization task from DUC 2007.
For each set, the input for summarization was available, along with four model summaries
for the input, and the summaries produced by all the automatic summarizers that partici-
43
CHAPTER 4. IMPACT OF QUERY-BIAS ON TEXT SUMMARIZATION
pated in DUC 2007. Each input set consisted of 25 documents and the summaries were
250 words long.
4.1.1 Query-biased content in human summaries
While performing some corpus studies on the DUC dataset1, we computed the amount of
query bias in document collections and in gold-standard human summaries(model sum-
maries). We observed that the amount of query-bias in model summaries, was consid-
erably higher than the same in the document collections. Based on this observation, we
began investigations addressing the question, “Does query bias affect sentence selection in
Multi-Document Summarization ?”.
In our initial investigations we observed that in DUC 2007 dataset, around 35% sen-
tences in source document set were query-biased; as seen in Table-4.12. Similarly, among
the sentences in model summaries, nearly 58% sentences were query-biased; as seen in
Table-4.2. After checking for consistency among DUC dataset of 3 years’(On DUC 2005,
2006 and 2007 Query Focused Multi-Document Summarization Task), we confirmed the
relation among the query biased content in document collections and model summaries.
Tables 4.1 and 4.2 summarize the findings3. As seen in Table-4.3, R, the ratio of P ′
(from Table 4.1) and Q′ (from Table 4.2) is small, indicating that model summaries are
denser than document collections, in terms of query-bias.
1The multi-document summarization corpus for DUC 2005, 2006 and 20072All the counts of sentences in Tables 4.1, 4.2, 4.5, 4.6 and 4.7 are averages over all topics.32005 data consisted of variable number of documents per topic and also had granularity aspect to the
summarization process; hence, the data is a bit disturbed, still the overall patterns are the same.
44
CHAPTER 4. IMPACT OF QUERY-BIAS ON TEXT SUMMARIZATION
Year’s Dataset Sentences(D) Biased Sentences(Dq) P′ = Dq/D (in %)
2005 935.2 192.06 20.54
2006 703.72 236.14 33.56
2007 548.96 195.76 35.66
Table 4.1 Percentage of query-biased content in document collections
Year’s Dataset Sentences(M ) Biased Sentences(Mq) Q′ = Mq/M (in %)
2005 81.08 42.16 52.00
2006 58.78 30.7 52.23
2007 52.91 30.67 57.96
Table 4.2 Percentage of query-biased content in model summaries
Dataset Ratio, R = P ′/Q′
2005 0.3950
2006 0.6425
2007 0.6152
Table 4.3 Ratio of Query-bias densities
4.2 Theoretical Justification on query-bias affecting sum-
marization performance
Text Summarization research, in recent years, has been approached mostly as a sentence ex-
traction problem. Where key sentences from the input are extracted and concatenated, in a
meaningful order, to form a summary. Such a formulation is easily captured by a classifier-
like approach, and most of the current and popular approaches to ‘text summarization’ are
based on classifying a sentence as relevant or irrelevant. There is ample research that has
45
CHAPTER 4. IMPACT OF QUERY-BIAS ON TEXT SUMMARIZATION
used a large number of simple features to train a classifier to predict relevance of a sen-
tence; more on such classifier-based approaches are described briefly in Section 3.1. In this
section, we show how any simple classifier-like algorithm’s performance can be improved
by biasing the summarizer towards query-biased sentences.
It has been shown earlier [Gupta et al., 2007] that term frequency or term-likelihood
based summarizers perform better when they use only the query-biased set of sentences for
ranking. Our primary hypothesis is that most (if not all) of the systems that perform well
in summarization task, either knowingly or un-knowingly rely very heavily on query-bias.
Later in this chapter our results also show that when humans summarize, most of them do
have a similar strategy informed by query-bias.
While defining “query-focused multi-document summarization task”, it was based on
the observation that ‘query-focus’ provides consensus among the model summaries, and
hence a reasonable clarity on what an automated text summarization must produce. It is
important to note that the automated systems try to imitate the same phenomena when they
try to bias towards query-terms while using shallow approaches to the problem. However,
there have been no studies on how ‘query-bias’ affects human/automatic summarization,
unlike frequency based measures that have been well studied since Luhn’s early work
[Luhn, 1958] and have been proved to be of importance by [Nenkova et al., 2006]. This
led us to investigate, if there is some relationship between query-focused summarization
and unigram query-bias density.
In the following sections we first describe an Equi-probable Automatic Summarization
algorithm (Section 4.2.1), then we show how the algorithm performs theoretically, when
constrained on query-bias (Section 4.2.2) and finally discuss its practical value in analyzing
the problem at hand. The central idea in showcasing such a summarizer is to show that a
query-bias based summarizer can perform better than a summarizer without query-bias.
That is, theoretically, it is possible to build summarization algorithms whose performance
improves by biasing on query terms.
46
CHAPTER 4. IMPACT OF QUERY-BIAS ON TEXT SUMMARIZATION
4.2.1 Equi-probable summarization setting
In this section we describe an ‘equi-probable’ sentence selection algorithm for automatic
text summarization. Since all sentences are equally likely to be part of the summary, such
an algorithm would be selecting sentences to be included in the summary randomly. In
such a scenario, the probability of picking any particular sentence to be included in the
summary is(
1
|D|
); where |D| is the total number of sentences in the input.
In this setting, which we call the standard setting, the summarization problem is seen
as a sentence classification problem. Given a sentence s, the classifier should be able to
classify it as relevant to be included in the summary or not. Let D be the set of sentences in
document collection, M the set of sentences in model summaries and S (x) the summariza-
tion function. Then, the goal of the summarization function S (x) is to be able to generate
a set of sentences that belong to M.
That is,
∀s ∈ S (D) : s ∈ M
i.e., S (D) ⊂ M
Let there be |D| sentences in D, and |M| sentences in M. Also, assuming that redun-
dancy is allowed while generating a summary, and if each sentence is equally likely to be
picked. Then, the probability of picking k sentences that contribute towards Model Sum-
maries is
P (S (D) ⊂ M) =
(|M ||D|
)k
(4.1)
4.2.2 Query-biased Equi-probable summarization setting
Assuming we constrain the summarization problem based on query-bias, to create the con-
strained setting, the following transformations take place. Let the sets Dq and Mq be de-
fined as query biased components of D and M respectively. Then, the goal of the summa-
rization function is to be able to generate a set of sentences that belong to Mq. That is,
Dq: Dq ⊂ D
47
CHAPTER 4. IMPACT OF QUERY-BIAS ON TEXT SUMMARIZATION
Mq: Mq ⊂ M
∀s∈ S (Dq) : s ∈ Mq
i.e., S (Dq) ⊂ Mq
Also, since model summaries are created based on the explicit guidelines4 provided,
it is safe to assume that sentences generated by human assessors, aren’t re-structured or
re-generated in a way deviating too far from the source sentence(s). That is, M ⊂ D and
Mq ⊂ Dq.
Now, if we were to pick the k sentences for summary from Dq, then, the probability of
picking those sentences from Mq is
P (S (Dq) ⊂ Mq) =
(|Mq ||M |
)∗ |M |(
|Dq ||D|
)∗ |D|
k
(4.2)
=⇒ P (S (Dq) ⊂ Mq) =
{|Mq|/|M ||Dq|/|D|
}k
∗{|M ||D|
}k
(4.3)
The second term in the equation 4.3 is equivalent to equation 4.1, and hence the prob-
ability of picking the right sentences depends on the first term, that is, λ ={
|Mq |/|M ||Dq |/|D|
}k
.
Here, λ is the ratio of unigram query-bias densities in model summaries and document col-
lections. If λ >1, the probability of picking the right information(which is the classification
accuracy) increases. λ is the exact inverse of the term R that has been empirically calcu-
lated in Table 4.3. Based on these results, we observe that λ >1. Therefore, the probability
of finding the right information increases as we work on the biased subset of data. Note
that, this improved probability is for the case of equi-probable sentence selection criterion.
4.2.3 Performance of Equi-probable summarization
Theoretically, the summarizer we just described (Section 4.2.1) should perform better when
constrained on query-bias. To measure its practical value we implemented the above sum-
4Guidelines for Human Summary writers available at http://duc.nist.gov/duc2005/assessor.summarization.instructions.pdf
48
CHAPTER 4. IMPACT OF QUERY-BIAS ON TEXT SUMMARIZATION
marizer. We dis-allowed a duplicate sentence in summary since it is of no use in an algo-
rithm and it was used in previous discussions only to ease the understanding. For evaluation
we computed ROUGE suite of evaluation metrics [Lin, 2004b] and we report ROUGE-2
and ROUGE-SU4 metrics, which were the official automatic evaluation metrics for DUC
2007.
The ROUGE scores of equi-probable summarizer in both the standard setting and in
the constrained scenario are shown in Table 4.4. The scores provided are the average score
of 1000 runs for each setting to counter randomness of the algorithm being significant.
It may be seen that there is a significant improvement in both ROUGE-2 and ROUGE-
SU4 metrics, with 9.3% improvement on ROUGE-2 and 4.3% improvement on ROUGE-
SU4 scores. The results of such a naive summarizer are statistically better than at least
8 or 10 automated systems of DUC 2007 based on ROUGE-2 and ROUGE-SU4 metrics,
respectively.
Setting ROUGE-2 ROUGE-SU4
standard 0.06970 (95%-conf.int. 0.06530 - 0.07364) 0.12692 (95%-conf.int. 0.12279 - 0.13082)
constrained 0.07621 (95%-conf.int. 0.07150 - 0.08101) 0.13230 (95%-conf.int. 0.12779 - 0.13665)
Table 4.4 ROUGE scores with confidence intervals for the equi-probable summa-rizer.
4.2.4 Observations
We have shown in Section 4.2.3 that an equi-probable summarizer improves performance
under the constraints of query-bias. The fact that the classification or ranking algorithms
that are currently being employed for summarization task can perform considerably well,
unlike equi-probable selection (except 8-10 systems all others perform better than this al-
gorithm), this bias will certainly help in obtaining summaries that are closer to the model
summaries.
49
CHAPTER 4. IMPACT OF QUERY-BIAS ON TEXT SUMMARIZATION
For a better ranking algorithm, the probability of selecting an important sentence is
greater than(
PL
), as the algorithm is better informed of important sentences. This also
means that most of the systems that are able to perform reasonably well in sentence selec-
tion, may not be able to utilize query-bias to further improve performance. In fact, those
systems whose algorithms already rely on query-bias and are already performing well, may
not improve their performance by introducing the bias towards query-terms. On the other
hand, they could actually worsen their choice of sentence selection as they could be loosing
out on important sentences from (D −Dq) if the system’s precision in picking un-biased
sentences is higher than its precision in picking biased sentences.
4.3 Performance of participating systems from DUC 2007
In this section, we analyze the systems that have participated in DUC 2007 Query Fo-
cused Multi-Document Summarization Task. We segregated the systems into few cate-
gories based on the performance of systems in various evaluation metrics. Since we are
interested in the effect of query-bias on content selection, we ended up in the following
categories:
1. Systems that performed very well in ROUGE/BE (content overlap metrics) seen in
Table-4.5.5
2. Systems that performed very well in Linguistic Quality, seen in Table-4.6.
3. Systems that performed poorly in content overlap metrics, seen in Table-4.7.
Discussion Table 4.5 shows the 5 top performing systems and their respective percentage
of query-bias. It is notable that all the systems in this category had a very high percentage of
query-biased sentences6. On the contrary, Table 4.7 shows the exact opposite, that systems5Tables 4.5,4.6, 4.7, 4.9 and 4.10 are ordered by rank of the systems based on their respective categories.6System 29 has comparatively lower P′ since in their approach a lot of sentence simplification operations
were performed on the original sentences and source sentences were distorted.
50
CHAPTER 4. IMPACT OF QUERY-BIAS ON TEXT SUMMARIZATION
System ID Sentences(S) Biased Sentences(F) P′ = F/S (in %)
15 9.88 8.48 85.83
24 11.6 9.26 79.91
29 19.04 13.46 70.69
4 7.84 7.33 93.49
13 10.11 9.64 95.35
Table 4.5 Systems that performed well in content overlap metrics
System ID Sentences(S) Biased Sentences(F) P′ = F/S (in %)
23 7.44 6.88 92.47
4 7.84 7.33 93.49
14 8.7333 8 91.60
5 7.82 7.24 92.32
17 10.07 9.33 92.65
Table 4.6 Systems that performed well based on Linguistic Quality Evaluations.
that do not perform well based on content metrics have very low amounts of query-bias.
This shows that there are patterns we can observe with respect to query-bias and infor-
mative content. But, there were exceptions too. Some systems (System ID 5 and 17, for
example) had a good amount of query-bias but didn’t perform well on content evaluations.
However, these were systems that were performing well (apart from baseline system ID 1)
on linguistic quality evaluations. This is a major insight, since indeed some systems might
just be trying to be biased on query-terms such that they generate coherent summaries7.
And indeed, as we see in Table 4.6 all the top performing systems in linguistic quality eval-
uations were biased towards query-terms. This analysis shows that working at the level
7See Focus, Structure and Coherence in Section 3.4.1
51
CHAPTER 4. IMPACT OF QUERY-BIAS ON TEXT SUMMARIZATION
System ID Sentences(S) Biased Sentences(F) P′ = F/S (in %)
16 9.8 4.67 47.65
27 13.33 4.53 33.98
6 8.87 5.11 57.61
10 14.36 8.4 58.50
11 10.29 6.44 62.59
12 9.84 5.62 57.11
Table 4.7 Systems that did not perform based on content overlap metrics
of top-performing (and worse performing) systems doesn’t clarify the intuition with which
the experiments were conducted and further much deeper analysis based on theoretically
sound methods is required to assess if indeed all systems are query-biased.
4.4 Query-Bias in Summary Content Units (SCUs)
Summary content units, referred as SCUs hereafter, are semantically motivated subsen-
tential units that are variable in length but not bigger than a sentential clause. SCUs are
constructed from annotation of a collection of human summaries on a given document col-
lection. SCUs are identified by noting information that is repeated across these human
summaries. The repetition is as small as a modifier of a noun phrase or as large as a clause.
The evaluation method that is based on overlapping SCUs in human and automatic sum-
maries is called the pyramid method [Nenkova et al., 2007].
The University of Ottawa has organized the pyramid annotation data such that for some
of the sentences in the original document collection, a list of corresponding content units
is known [Copeck et al., 2006]. In Figure 4.1 we visualize the structure of pyramid an-
notations and the source sentence mapping done at Ottawa. To the left of the Figure 4.1
we can see how models and peers interact to form a pyramid structure, as explained by
52
CHAPTER 4. IMPACT OF QUERY-BIAS ON TEXT SUMMARIZATION
[Nenkova et al., 2007]. There is a ‘one-to-many’ mapping between an SCU of a pyra-
mid and sentences in any of the peers; a fact could be represented in different sentences,
at possibly different granularities. On the right side of the Figure 4.1, we can see how
each sentence in the peer can be mapped to a sentence in source document collection by
‘one-to-many’ relationship. Such a one-to-many relationship exists only in the presence
of the assumption that all the peers are sentence extractive summaries. Hence, overall it
is possible to obtain a ‘one-to-many’ mapping between an SCU and ‘one or more’ source
sentences.
Figure 4.1 The process and structure of Pyramid Annotations and source map-pings
Figure 4.2 SCU annotation of a source document.
A sample of such an SCU mapping for a document from topic D0701A of the DUC
53
CHAPTER 4. IMPACT OF QUERY-BIAS ON TEXT SUMMARIZATION
Dataset total relevant biased relevant irrelevant biased irrelevant % bias in relevant % bias in irrelevant
DUC 2005 24831 1480 1127 1912 1063 76.15 55.60
DUC 2006 14747 1047 902 1407 908 86.15 71.64
DUC 2007 12832 924 782 975 674 84.63 69.12
Table 4.8 Statistical information on counts of query-biased sentences.
2007 QF-MDS corpus is shown in Figure 4.2. Three sentences are seen in the figure among
which two have been annotated with system IDs and SCU weights wherever applicable.
The first sentence has not been picked by any of the summarizers participating in Pyramid
Evaluations, hence it is unknown if the sentence would have contributed to any SCU. The
second sentence was picked by 8 summarizers and that sentence contributed to an SCU of
weight 3, hence it is a relevant sentence. The third sentence in the example was picked by
one summarizer, however, it did not contribute to any SCU, so it is an irrelevant sentence.
This example shows all the three types of sentences available in the corpus: unknown
samples, relevant samples and irrelevant samples.
We extracted the relevant and irrelevant samples in the source documents from these
annotations; types of second and third sentences shown in Figure 4.2. Figure 4.3 shows the
distribution of query-bias in the source document collections. Among a total of ≈ 13000
sentences available in source collections around 2000 sentences were annotated as either
relevant or irrelevant. A total of 14.8% sentences were annotated to be either relevant or
irrelevant. When we analyzed the relevant set, we found that 84.63% sentences in this set
were query-biased. Also, on the irrelevant sample set, we found that 69.12% sentences
were query-biased. That is, on an average, 76.67% of the sentences picked by any auto-
mated summarizer are query-biased. All the above numbers are based on the DUC 2007
dataset shown in boldface in Table 4.8.
The above experiment shows that whether the systems pick relevant sentence or not,
they are more likely to pick sentences that are query-biased. This could mean that these
systems are being biased towards query-bias to be able closely follow relevant content, like
54
CHAPTER 4. IMPACT OF QUERY-BIAS ON TEXT SUMMARIZATION
Figure 4.3 Distribution of relevant and irrelevant sentences in query-biased corpus
we explained in Section 4.2. However, there is one caveat: The annotated sentences come
only from the summaries of systems that participated in the pyramid evaluations. Since
only 13 among a total 32 participating systems were evaluated using pyramid evaluations,
the dataset is limited. However, despite this small issue, it is very clear that at least those
systems that participated in pyramid evaluations have been biased towards query-terms, or
at least, they have been better at correctly identifying important sentences from the query-
biased sentences than from query-unbiased sentences.
4.5 Formalizing Query-Bias
Our search for a formal method to capture the relation between occurrence of query-biased
sentences in the input and in summaries resulted in building binomial and multinomial
model distributions. The distributions estimated were then used to obtain the likelihood of
a query-biased sentence being emitted into a summary by each system.
55
CHAPTER 4. IMPACT OF QUERY-BIAS ON TEXT SUMMARIZATION
For the DUC 2007 data, there were 45 summaries for each of the 32 systems (labeled 1-
32) among which 2 were baselines (labeled 1 and 2), and 18 summaries from each of 10 hu-
man summarizers (labeled A-J). We computed the log-likelihood, log (L[summary; p ( Ci )]),
of all human and machine summaries from DUC’07 query focused multi-document sum-
marization task, based on both distributions described below (see Sections 4.5.1, 4.5.2).
4.5.1 The Binomial Model
We represent the set of sentences as a binomial distribution over type of sentences. Let Ci ∈
{C0, C1} denote classes of sentences without and with query-bias respectively. For each
sentence s ∈ Ci in the input collection, we associate a probability p(Ci) for it to be emitted
into a summary. It is also obvious that query-biased sentences will be assigned lower
emission probabilities, because the occurrence of query-biased sentences in the input is less
likely. On average each topic has 549 sentences, among which 196 contain a query term;
which means only 35.6% sentences in the input were query-biased. Hence, the likelihood
function here denotes the likelihood of a summary to contain non query-biased sentences.
Humans’ and systems’ summaries must now constitute low likelihood to show that they
rely on query-bias.
The likelihood of a summary then is :
L[summary; p (Ci)] =N !
n0!n1!p (C0)
n0 p (C1)n1 (4.4)
Where N is the number of sentences in the summary, and n0 + n1 = N; n0 and n1 are
the cardinalities of C0 and C1 in the summary. Table 4.9 shows various systems with their
ranks based on ROUGE-2 and the average log-likelihood scores. The ROUGE [Lin, 2004b]
suite of metrics are n-gram overlap based metrics that have been shown to highly correlate
with human evaluations on content responsiveness. ROUGE-2 and ROUGE-SU4 are the
official ROUGE metrics for evaluating query-focused multi-document summarization task
since DUC 2005.
56
CHAPTER 4. IMPACT OF QUERY-BIAS ON TEXT SUMMARIZATION
ID rank LL ROUGE-2 ID rank LL ROUGE-2 ID rank LL ROUGE-2
1 31 -1.9842 0.06039 J -3.9465 0.13904 24 4 -5.8451 0.11793
C -2.1387 0.15055 E -3.9485 0.13850 9 12 -5.9049 0.10370
16 32 -2.2906 0.03813 10 28 -4.0723 0.07908 14 14 -5.9860 0.10277
27 30 -2.4012 0.06238 21 22 -4.2460 0.08989 5 23 -6.0464 0.08784
6 29 -2.5536 0.07135 G -4.3143 0.13390 4 3 -6.2347 0.11887
12 25 -2.9415 0.08505 25 27 -4.4542 0.08039 20 6 -6.3923 0.10879
I -3.0196 0.13621 B -4.4655 0.13992 29 2 -6.4076 0.12028
11 24 -3.0495 0.08678 19 26 -4.6785 0.08453 3 9 -7.1720 0.10660
28 16 -3.1932 0.09858 26 21 -4.7658 0.08989 8 11 -7.4125 0.10408
2 18 -3.2058 0.09382 23 7 -5.3418 0.10810 17 15 -7.4458 0.10212
D -3.2357 0.17528 30 10 -5.4039 0.10614 13 5 -7.7504 0.11172
H -3.4494 0.13001 7 8 -5.6291 0.10795 32 17 -8.0117 0.09750
A -3.6481 0.13254 18 19 -5.6397 0.09170 22 13 -8.9843 0.10329
F -3.8316 0.13395 15 1 -5.7938 0.12448 31 20 -9.0806 0.09126
Table 4.9 Rank, Averaged log-likelihood score based on binomial model, trueROUGE-2 score for the summaries of various systems in DUC’07 query-focusedmulti-document summarization task.
4.5.2 The Multinomial Model
In the previous section (Section 4.5.1), we described the binomial model where we clas-
sified each sentence as being query-biased or not. However, if we were to quantify the
amount of query-bias in a sentence, we associate each sentence to one among k possible
classes leading to a multinomial distribution. Let Ci ∈ {C0, C1, C2, . . . , Ck} denote the k
levels of query-bias. Ci is the set of sentences, each having i query terms.
The number of sentences participating in each class varies highly, with C0 bagging a
high percentage of sentences (64.4%) and the rest {C1, C2, . . . , Ck} distributing among
themselves the rest 35.6% sentences. Since the distribution is highly-skewed, distinguish-
ing systems based on log-likelihood scores using this model is easier and perhaps more
accurate. Like before, Humans’ and systems’ summaries must now constitute low likeli-
hood to show that they rely on query-bias.
57
CHAPTER 4. IMPACT OF QUERY-BIAS ON TEXT SUMMARIZATION
The likelihood of a summary then is :
L[summary; p (Ci)] =N !
n0!n1! · · ·nk!p (C0)
n0 p (C1)n1 · · · p (Ck)
nk (4.5)
Where N is the number of sentences in the summary, and n0 + n1 + · · · + nk = N; n0,
n1,· · · ,nk are respectively the cardinalities of C0, C1, · · · ,Ck, in the summary. Table 4.10
shows various systems with their ranks based on ROUGE-2 and the average log-likelihood
scores.
ID rank LL ROUGE-2 ID rank LL ROUGE-2 ID rank LL ROUGE-2
1 31 -4.6770 0.06039 10 28 -8.5004 0.07908 5 23 -14.3259 0.08784
16 32 -4.7390 0.03813 G -9.5593 0.13390 9 12 -14.4732 0.10370
6 29 -5.4809 0.07135 E -9.6831 0.13850 22 13 -14.8557 0.10329
27 30 -5.5110 0.06238 26 21 -9.7163 0.08989 4 3 -14.9307 0.11887
I -6.7662 0.13621 J -9.8386 0.13904 18 19 -15.0114 0.09170
12 25 -6.8631 0.08505 19 26 -10.3226 0.08453 14 14 -15.4863 0.10277
2 18 -6.9363 0.09382 B -10.4152 0.13992 20 6 -15.8697 0.10879
C -7.2497 0.15055 25 27 -10.7693 0.08039 32 17 -15.9318 0.09750
H -7.6657 0.13001 29 2 -12.7595 0.12028 7 8 -15.9927 0.10795
11 24 -7.8048 0.08678 21 22 -13.1686 0.08989 17 15 -17.3737 0.10212
A -7.8690 0.13254 24 4 -13.2842 0.11793 8 11 -17.4454 0.10408
D -8.0266 0.17528 30 10 -13.3632 0.10614 31 20 -17.5615 0.09126
28 16 -8.0307 0.09858 23 7 -13.7781 0.10810 3 9 -19.0495 0.10660
F -8.2633 0.13395 15 1 -14.2832 0.12448 13 5 -19.3089 0.11172
Table 4.10 Rank, Averaged log-likelihood score based on multinomial model,true ROUGE-2 score for the summaries of various systems in DUC’07 query-focused multi-document summarization task.
4.5.3 Correlation of ROUGE metrics with likelihood of query-bias
Tables 4.9 and 4.10 display log-likelihood scores of various systems in the descending
order of log-likelihood scores along with their respective ROUGE-2 scores. We computed
the pearson correlation coefficient (ρ) of ‘ROUGE-2 and log-likelihood’ and ‘ROUGE-SU4
58
CHAPTER 4. IMPACT OF QUERY-BIAS ON TEXT SUMMARIZATION
and log-likelihood’. This was computed for systems (ID: 1-32) (r1) and for humans (ID:
A-J) (r2) separately, and for both distributions.
For the binomial model, r1 = -0.66 and r2 = 0.39 was obtained. This clearly indicates
that there is a strong negative correlation between likelihood of occurrence of a non-query-
term and ROUGE-2 score. That is, a strong positive correlation between likelihood of
occurrence of a query-term and ROUGE-2 score. Similarly, for human summarizers there is
a weak negative correlation between likelihood of occurrence of a query-term and ROUGE-
2 score. The same correlation analysis applies to ROUGE-SU4 scores: r1 = -0.66 and r2 =
0.38.
Similar analysis with the multinomial model have been reported in Tables 4.11 and 4.12.
Tables 4.11 and 4.12 show the correlation among ROUGE-2 and log-likelihood scores for
systems8 and humans9.
ρ ROUGE-2 ROUGE-SU4
binomial -0.66 -0.66
multinomial -0.73 -0.73
Table 4.11 Correlation of ROUGE measures with log-likelihood scores for auto-mated systems
ρ ROUGE-2 ROUGE-SU4
binomial 0.39 0.38
multinomial 0.15 0.09
Table 4.12 Correlation of ROUGE measures with log-likelihood scores for hu-mans
8All the results in Table 4.11 are statistically significant with p-value (p < 0.00004, N=32)9None of the results in Table 4.12 are statistically significant with p-value (p > 0.265, N=10)
59
CHAPTER 4. IMPACT OF QUERY-BIAS ON TEXT SUMMARIZATION
4.6 Discussion
Based on our observations in this chapter it is clear that most of the automated systems
are being biased towards query-terms in trying to generate summaries that are closer to hu-
man summaries. We have shown based on various experiments that query-bias is probably
directly involved in generating better summaries for otherwise poor algorithms. Looking
at those algorithms from the point of view of classification view of summarization (See
Figure 4.4) we observe that most systems are ignoring sentences that do not contain query-
terms. This sort of ignorance just based on naive surface features (that too as much as
“term-bias”) would mean that a lot of informative content among the non-biased sentences
are lost out. Hence, there is a need to look at the behavior of each algorithm along these
lines before we get those into such issues at later stage.
Figure 4.4 Impact of query-bias in the classification view of Text SummarizationProcess
Figure 4.4 clearly indicates how various summarization functions S1(D), S2(D) · · ·
are query-biased in the current context of query-focused multi-document summarization.
60
CHAPTER 4. IMPACT OF QUERY-BIAS ON TEXT SUMMARIZATION
Topic sentences are color shaded as query-biased sentences and un-biased sentences sepa-
rately and the same mapping is shown for summarizer’s output.
4.7 Conclusive Remarks
1. Automated systems are query-biased while approaching query-focused multi-document
summarization.
2. The binomial model clearly captures the relation between query-bias and ROUGE
scores. It also shows how human models are collectively similar in their strategy
from query-bias point of view.
3. Multinomial model reinforces what the binomial model shows while showing the
milder differences in the way systems are query-biased. In particular we could now
differentiate systems that are biased with multiple query terms from those that are
biased towards sentences having a single query term.
4.8 Chapter Summary
Our results underscore the differences between human and machine generated summaries.
Based on Summary Content Unit (SCU) level analysis of query-bias we argue that most
systems are better at finding important sentences only from query-biased sentences. More
importantly, we show that on an average, 76.67% of the sentences picked by any automated
summarizer are query-biased. When asked to produce query-focused summaries, humans
do not rely to the same extent on the repetition of query terms.
We further confirm based on the likelihood of emitting non query-biased sentence, that
there is a strong (negative) correlation among systems’ likelihood score and ROUGE score,
which suggests that systems are trying to improve performance based on ROUGE metrics
by being biased towards the query terms. On the other hand, humans do not rely on query-
61
CHAPTER 4. IMPACT OF QUERY-BIAS ON TEXT SUMMARIZATION
bias, though we do not have statistically significant evidence to suggest it. We have also
speculated that the multinomial model helps in better capturing the variance across the
systems since it distinguishes among query-biased sentences by quantifying the amount of
query-bias.
From our point of view, most of the extractive summarization algorithms are formal-
ized based on a bag-of-words query model. The innovation with individual approaches
has been in formulating the actual algorithm on top of the query model. We speculate
that the real difference in human summarizers and automated summarizers could be in
the way a query (or relevance) is represented. Traditional query models from IR litera-
ture have been used in summarization research thus far, and though some previous work
[Amini and Usunier, 2007] tries to address this issue using contextual query expansion,
new models to represent the query is perhaps the only way to induce topic-focus on the
summary. IR-like query models, which are designed to handle ‘short keyword queries’,
are perhaps not capable of handling ‘an elaborate query’ in case of summarization. Since
the notion of query-focus is apparently missing in any or all of the algorithms, the future
summarization algorithms must try to incorporate this while designing new algorithms.
62
Chapter 5
Baselines for Update Summarization:
The Sentence Position Hypothesis
In this chapter, we describe a sentence position based summarizer that is built based on a
sentence position policy, created from the evaluation testbed of recent summarization tasks
at Document Understanding Conferences (DUC). We show that the summarizer thus built
is able to outperform most systems participating in task focused summarization evaluations
at Text Analysis Conferences (TAC) 2008. Our experiments also show that such a method
would perform better at producing short summaries (up to 100 words) than longer sum-
maries. Further, we discuss the baselines traditionally used for summarization evaluation
and suggest the revival of an old baseline to suit the current summarization task at TAC:
the Update Summarization task.
5.1 Introduction
Document summarization received a lot of attention since an early work by Luhn [Luhn, 1958].
Statistical information derived from word frequency and distribution was used by the ma-
chine to compute a relative measure of significance, first for individual words and then
for sentences. Later, Edmundson [Edmundson, 1969] introduced four clues for identifying
63
CHAPTER 5. BASELINES FOR UPDATE SUMMARIZATION
significant words (topics) in a text. Among them title and location are related to position
methods, while the other two are presence of cue words and high frequency content words.
Edmundson assigned positive weights to sentences according to their ordinal position in
the text, giving more weight to the first sentence of the first paragraph and the last sentence
of the last paragraph.
5.1.1 Sentence Position
Position of a sentence in a document or the position of a word in a sentence give good clues
towards importance of the sentence or word respectively. Such features are called locational
features, and a sentence position feature deals with presence of key sentences at specific
locations in the text. Sentence Position has been well studied in summarization research
since its inception, early in Edmundson’s work [Edmundson, 1969]. Earlier, Baxendale
[Baxendale, 1958] defined a position method in a very straightforward way as title plus
first and last sentence of a paragraph. But, since the paradigmatic discourse structure
differs significantly over subject domains and text genres, a position method should be
more specific. Dolan stated [Dolan, 1980] that a study of topic sentences in expository
prose showed that only 13% of paragraphs of contemporary professional writers began
with topic sentences. Singer and Dolan maintain [Singer and Dolan, 1980] that the main
idea of a paragraph can appear anywhere in the paragraph or not be stated at all. Arriving
at a completely negative conclusion, Paijmans [Paijmans, 1994] conducted experiments on
the relation between word position and its significance, and found that “words with high
information content according to tf?idf -based weighting schemes do not cluster in the first
and last sentences of paragraphs. In contrast, Kieras in psychological studies [Kieras, 1985]
confirmed the importance of the position of a mention within a text.
[Edmundson, 1969] The four basic methods employed by Edmundson to extract sen-
tences are Cue, Key, Title and Location. Among these four clues: title and location are
closest to position based methods. The title method is based on the hypothesis that an au-
64
CHAPTER 5. BASELINES FOR UPDATE SUMMARIZATION
thor conceives the title as circumscribing the subject matter of the document. Also, when
the author partitions the body of the document into major sections, he summarizes it by
choosing appropriate headings. In this work Edmundson showed that the hypothesis that
words of the title and headings are positively relevant was accepted at 99 percent level of
significance. For the location method, he showed that the hypothesis is that:
1. sentences occurring under certain headings are positively relevant; and
2. topic sentences tend to occur very early or very late in a document and its paragraphs.
Which as we would see would be replicated (in part) in our results.
[Baxendale, 1958] Baxendale’s investigation was based on a sample of 200 paragraphs
to determine where the important words are most likely to be found. He concluded that in
85% of the paragraphs, the first sentence was a topic sentence and in 7% of the paragraphs,
the final one.
[Lin and Hovy, 1997] Lin and Hovy [Lin and Hovy, 1997] describe a method for auto-
mated training and evaluation of an Optimal Position Policy, a method of locating likely
positions of topic-bearing sentences based on genre-specific regularities of discourse struc-
ture. They provide an empirical validation of position hypothesis, as laid down by Ed-
mundson [Edmundson, 1969]. Most of our work relies on the structure of experimentation
done by Lin and Hovy, whose details are discussed at length in the next few sections.
[Kastner and Monz, 2009] In a marginally related problem of keyfact extraction, Kast-
ner and Monz [Kastner and Monz, 2009] show similar results on the position hypothesis.
They use position feature to automatically identify news highlights, a feature on web-based
news service of BBC, which currently is done manually.
The usefulness of ‘key fact extraction’ is shown in its usage in CNN’s website1 which
1http://www.cnn.com/
65
CHAPTER 5. BASELINES FOR UPDATE SUMMARIZATION
Figure 5.1 An example of Highlights of a news story
since 2006 has most of its news stories preceded by a list of story highlights, see2 Fig-
ure 5.1. The advantage of the news highlights as opposed to full-text summaries is that
they are much ‘easier on the eye’ and better suited for quick skimming. Till the time
[Kastner and Monz, 2009] was published, only CNN.com was the provider of such a ser-
vice. [Kastner and Monz, 2009] tried to study as to how far this process could be auto-
mated. As part of their study they identified how position of a sentence in the text can be
crucial to its inclusion as part of highlights. Intuitively, facts of greater importance will be
placed at the beginning of the text, and this is supported by their data, as shown in Fig-
ure 5.2. This work [Kastner and Monz, 2009] is most recent and was published in parallel
with the work reported in [Katragadda et al., 2009].
Machine Learning based approaches Position information has been quite frequently
used in single-document summarization. Indeed, a simple baseline system takes the first
‘l’ sentences as the summary outperforms most summarization systems in DUC 2004
2Figures 5.1 and 5.2 have been taken with permission from the original authors of the paper.
66
CHAPTER 5. BASELINES FOR UPDATE SUMMARIZATION
Figure 5.2 Impact of location in identifying highlights
[Barzilay and Lee, 2004]. Also, [Zajic et al., 2002] use position in scoring candidate sum-
mary sentences. In multi-document summarization, various systems have used position as
a feature in scoring candidate sentences. Recent advances in machine learning have been
adapted to summarization problem through the years and locational features have been
consistently used to identify salience of a sentence. Some representative work in ‘learning’
sentence extraction would include training a binary classifier [Kupiec et al., 1995], train-
ing a Markov model [Conroy et al., 2004], training a CRF [Shen et al., 2007], and learning
pairwise-ranking of sentences [Toutanova et al., 2007]. An interesting usage of position
feature was shown in [tau Yih et al., 2007] where they explored the usage of word position
in scoring a sentence based on an average position of a word. They used both generative
and discriminative scoring functions for scoring the words.
67
CHAPTER 5. BASELINES FOR UPDATE SUMMARIZATION
5.1.2 Introduction to the Position Hypothesis
The position hypothesis states that importance of a sentence in a text has relation to the
ordinal position of the sentence in the text. As described earlier, each type of genre could
have a different position of stress. For example, in the case of technical literature, initial
and final paragraphs are important. There have been negative conclusions in the literature
for instance [Paijmans, 1994] states that words with high information content do not cluster
in the first and last sentences of a paragraph, while [Singer and Dolan, 1980] have argued
that the main idea of a paragraph can appear anywhere in the text.
The purposes of this study are to clarify the contradictions in the literature, and to test
the above mentioned intuitions and results, and to verify the hypothesis that importance of
a sentence in a text is indeed related to its ordinal position in the text. We also argue that
such a genre based summarizer would be a good baseline for short summary tasks such as
the Update Summarization task.
5.2 Sub-Optimal Sentence Position Policy (SPP)
Given a large text collection and a way to approximate the relevance for a reasonably large
subset of sentences, we could identify significant positional attributes for the genre of the
collection. Our experiments are based on the work described in [Lin and Hovy, 1997],
whose experiments using the Ziff-Davis corpus gave great insights on the selective power
of the position method.
5.2.1 Sentence Position Yield and Optimal Position Policy (OPP)
Lin and Hovy [Lin and Hovy, 1997] provide an empirical validation for the position hy-
pothesis. They describe a method of deriving an Optimal Position Policy for a collection of
texts within a genre, as long as a small set of topic keywords is defined for each text. They
defined sentence yield (strength of relevance) of a sentence based on the mention of topic
68
CHAPTER 5. BASELINES FOR UPDATE SUMMARIZATION
keywords in the sentence.
The positional yield is defined as the average sentence yield for that position in the doc-
ument. They computed the yield of each sentence position in each document by counting
the number of different keywords contained in the respective sentence in each document,
and averaging over all documents. An Optimal Position Policy (OPP) is derived based on
the decreasing values of positional yield.
Their experiments grounded on the assumption that abstract is an ideal representation of
central topic(s) of a text. For their evaluations, they used the abstract to compare whether
the sentences found based on their Optimal Position Policy are indeed a good selection.
They used precision-recall measures to establish those findings.
At our disposal we had data from pyramid evaluations that provided sentences and their
mapping to any content units in the gold standard summaries. The annotations in the data
provide a unique property that each sentence can derive for itself a score for relevance.
5.2.2 Documents
There are a wide variety of document types across genre. In our case of newswire collection
we have identified two primary types of documents: small document and large document.
This distinction is made based on the total sentences in the document. All documents that
have the number of sentences above a threshold should be considered large. We exper-
imented on thresholds varying from 10 to 35 sentences and figured out that documents’
distribution into the two categories was acceptable when threshold-ed at 20 sentences. This
decision is also well supported by the fact that the last sentences of a document were more
important than the others in the middle [Baxendale, 1958].
Sentence Position Yield (SPY) is obtained separately for both types of documents. For a
small document, sentence positions have values from 1 through 20. Meanwhile, for a large
document we compute SPY for position 1 through 20, then the last 15 sentences labeled
136 through 150 and ‘any other sentence’ is labeled 100. It can be seen in Figure 5.5
69
CHAPTER 5. BASELINES FOR UPDATE SUMMARIZATION
that sentences that do not come from leading or trailing part of large documents do not
contribute much content to the summaries.
5.2.3 Pyramid Data
Summary content units, referred as SCUs hereafter, are semantically motivated, sub-sentential
units that are variable in length but not bigger than a sentential clause. SCUs emerge from
annotation of a collection of human summaries for the same input. They are identified by
noting information that is repeated across summaries, whether the repetition is as small as a
modifier of a noun phrase or as large as a clause. The weight an SCU obtains is directly pro-
portional to the number of reference summaries that support that piece of information. The
evaluation method that is based on overlapping SCUs in human and automatic summaries
is described in the Pyramid method [Nenkova et al., 2007].
The University of Ottawa has organized the pyramid annotation data such that for
some of the sentences in the original document collection (those that were picked by sys-
tems participating in pyramid evaluation), a list of corresponding content units is known
[Copeck et al., 2006]. We described that data in much detail in Section 4.4.
A sample of the SCU mapping is reproduced in Figure 5.3. We used this data to identify
locations in a document from where most sentences were being picked, and which of those
locations were being most content responsive to the query.
Figure 5.3 A sample mapping of SCU annotation to source document sentences.An excerpt from mapping of topic D0701A of DUC 2007 QF-MDS task.
For each SCU, a weight is associated in pyramid annotations. Thus a sentential score70
CHAPTER 5. BASELINES FOR UPDATE SUMMARIZATION
could be defined as sum of weights of all the contributing SCUs of the sentence. For an
unknown sample and a negative sample, sentential score is 0. For example, in the second
sentence in Figure 5.3 the score is 3, contributed by a single SCU. While the same for the
first and third sentences is 0.
For each sentence position the sentential score is averaged over all documents, which
we call Sentence Position Yield. SPY for small and large documents is shown in Figures 5.4
and 5.5. Based on these values for various positions, a simple Position Policy was framed
as shown below. A position policy is an ordered set consisting of elements in the order
of most importance. Within a subset, each sub-element is equally important and treated
likewise.
Figure 5.4 Sentence Position Yield for small documents.
{s1, S1, {s2, S2, s3} , {S3, s4, s5, s6, s7, s8, s20} , {S4, s9} . . . }
In the above position policy, sentences from small documents and large documents are
represented by si and Sj respectively.
The position policy described above provides an ordering of ranked sentence positions
based on a very accurate ‘relevance’ annotations on sentences. However, there is a large
subset of sentences that are not annotated with either positive or negative relevance judg-
71
CHAPTER 5. BASELINES FOR UPDATE SUMMARIZATION
Figure 5.5 Sentence Position Yield for large documents
ment. Hence, the policy derived is based on a high-precision low-recall corpus3 for sen-
tence relevance. If all the sentences were annotated with such judgments, the policy could
have been different. For this reason we call the above derived policy, a Sub-optimal Posi-
tion Policy (SPP).
5.3 A Summarization Algorithm based on SPP
The goal of creating a position policy was to identify its effectiveness as a summarization
algorithm. The above simple heuristic was easily incorporated as an algorithm based on
simple scoring for each distinct set in the policy. For instance, based on the policy above,
all s1 get the highest weight followed by next best weight to all S1 and so on.
As it can be observed, only the first sentence of each document might end up comprising
the summary, which is tolerable if we do not get redundant information in the summary.
Hence we also used a simple unigram match based redundancy measure that doesn’t allow
3DUC 2005 and 2006 data has been used for learning the SPP. In further experiments in Section 5.3, DUC
2007 and TAC 2008 data have been used as test data.
72
CHAPTER 5. BASELINES FOR UPDATE SUMMARIZATION
a sentence if it matches any of the already selected sentences in at least 40% of content
words in it. We also dis-allow sentences greater than 25 content words.
We applied the above algorithm to generate multi-document summaries for various
tasks. We have applied it to Query-Focused Multi-Document Summarization (QF-MDS)
task of DUC 2007 and Query-Focused Update Summarization task of TAC 2008.
5.3.1 Query-Focused Multi-Document Summarization
The query-focused multi-document summarization task at DUC models the real world com-
plex question answering task. Given a topic and a set of 25 relevant documents, this task
is to synthesize a fluent, well-organized 250 word summary of the documents that answers
the question(s) in the topic statement/narration.
The summaries from the above algorithm for the QF-MDS were evaluated based on
ROUGE metrics [Lin, 2004b]. The average4 recall scores are reported for ROUGE-2 and
ROUGE-SU4 in Table 5.1. Also reported are the performance of the top performing system
and the official baseline(s). Our SPP algorithm performed worse than most systems partic-
ipating in the task that year and performed better5 than only the ‘first x words’ baseline and
3 other systems.
5.3.2 Update Summarization Task
The update summarization task is to produce short (~100 words) multi-document update
summaries of newswire articles under the assumption that the user has already read a set
of earlier articles. The initial document set is called cluster A and the next set of articles
are called cluster B. For cluster A, a query-focused multi-document summary is expected.
The purpose of each ‘update summary’ (summary of cluster B) will be to inform the reader
of new information about a particular topic. Summaries from the above algorithm for the
4Averaged over all the 45 topics of DUC 2007 dataset.5Better in a statistical sense, based on 95% confidence intervals of the two systems’ evaluation based on
ROUGE-2.
73
CHAPTER 5. BASELINES FOR UPDATE SUMMARIZATION
system ROUGE-2 ROUGE-SU4
‘first x words’ baseline 0.06039 0.10507
‘generic’ baseline 0.09382 0.14641
SPP algorithm 0.06913 0.12492
system 15 (top system) 0.12448 0.17711
Table 5.1 ROUGE 2, SU4 Recall scores for two baselines, the SPP algorithm anda top performing system at Query-Focused Multi-Document Summarization task,DUC 2007.
Query Focused Update Summarization task were evaluated based on ROUGE metrics. This
algorithm performed surprisingly better at this task when compared to QF-MDS. The rouge
scores suggest that this algorithm is well above the median for cluster A and among the top
5 systems for cluster B.
It must be noted that consistent performance across clusters (both A and B) shows the
robustness of the ‘SPP algorithm’ at the update summarization task. Also, it is evident that
such an algorithm is computationally simple and light-weight.
These surprisingly high scores on ROUGE metrics prompted us to evaluate the sum-
maries based on Pyramid Evaluation [Nenkova et al., 2007]. Pyramid evaluation provides
a more semantic approach to evaluation of content based on SCUs as discussed in Sec-
tion 5.2.3. The average6 modified pyramid scores of cluster A and cluster B summaries
is shown in Table 5.2, along with the average recall scores for ROUGE-2, ROUGE-SU4
scores. The pyramid evaluation7 suggests that this algorithm performs better than all other
automated systems at TAC 2008. Table 5.3 shows the average performance (across clus-
ters) of ‘first x words’ baseline, SPP algorithm and two top performing systems (System
ID=43 and ID=11). System 43 was adjudged best system based on ROUGE metrics, and
system 11 was top performer based on pyramid evaluations at TAC 2008.
6Averaged over all the 48 topics of TAC 2008 dataset.7Pyramid Annotation were done by a volunteer who also volunteered for annotations during DUC 2007.
74
CHAPTER 5. BASELINES FOR UPDATE SUMMARIZATION
ROUGE-2 ROUGE-SU4 pyramid
cluster A 0.08987 0.1213 0.3432
cluster B 0.09319 0.1283 0.3576
Table 5.2 Cluster wise ROUGE 2, SU4 Recall scores and modified PyramidScores for SPP algorithm at the Update Summarization task.
system ROUGE-2 ROUGE-SU4 pyramid
‘first x words’ baseline 0.05896 0.09327 0.166
SPP algorithm 0.09153 0.1245 0.3504
System 43 (top in ROUGE) 0.10395 0.13646 0.289
System 11 (top in pyramid) 0.08858 0.12484 0.336
Table 5.3 Average ROUGE 2, SU4 Recall scores and modified Pyramid Scoresfor baseline, SPP algorithm and two top performing systems at TAC 2008.
5.3.3 Discussion
It is interesting to observe that the algorithm that performs very poorly at QF-MDS, does
very well in the Update Summarization task. A possible explanation for such behavior
could be based on summary length. For a 250 word summary in the QF-MDS task, hu-
man summaries might provide a descriptive answer to the query that includes information
nuggets accompanied by background information. Indeed, it has been earlier reported that
humans appreciate receiving more information than just the answer to the query, whenever
possible [Lin et al., 2003, Bosma, 2005].
Whereas, in the case of Update Summarization task the summary length is only 100
words. In such a short length humans need to trade-off between answer sentences and
supporting sentences, and usually answers are preferred. And since our method identifies
sentences that are known to be contributing towards the needed answers, it performs better
at the shorter version of the task.
75
CHAPTER 5. BASELINES FOR UPDATE SUMMARIZATION
Another possible explanation is that as a shorter summary length is required, the task
of choosing the most important information becomes more difficult and no approach works
well consistently. Also, it has often been noted that this baseline is indeed quite strong
for this genre, due to the journalistic convention for putting the most important part of an
article in the initial paragraphs.
5.4 Baselines in Summarization Tasks
Over the years, as summarization research followed trends from generic single-document
summarization, to generic multi-document summarization, to focused multi-document sum-
marization there were two major baselines that stayed throughout the evaluations. Those
two baselines are:
1. First N words of the document (or of the most recent document).
2. First sentence from each document in chronological order until the length requirement is
reached.
The first baseline was in place ever since the first evaluation of generic single document
summarization took place in DUC 2001. For multi-document summarization, first N words
of the most recent document (chronologically) was chosen as the baseline 1. In the recent
summarization evaluations at Text Analysis Conference (TAC 2008), where update sum-
marization was evaluated; baseline 1 still persists. This baseline performs pretty poorly at
content evaluations based on all manual and automatic metrics. However, since it doesn’t
disturb the original flow and ordering of a document, linguistically these summaries are the
best. Indeed it outperforms all the automated systems based on linguistic quality evalua-
tions.
The second baseline had been used occasionally with multi-document summarization
from 2001 to 2004 with both generic multi-document summarization and focused multi-
document summarization. In 2001 only one system significantly outperformed the base-
line 2 [Nenkova, 2005]. In 2003 QF-MDS however, only one system outperformed the76
CHAPTER 5. BASELINES FOR UPDATE SUMMARIZATION
baseline 2 above, while in 2004 at the same task, no system significantly outperforms the
baseline. This baseline as can be seen, over the years has been pretty much untouched by
systems based on content evaluation. However, the linguistic aspects of summary quality
would be compromised in such a summary.
Currently, for the Update Summarization task at TAC 2008, NIST’s baseline is the
baseline 1 (‘first x words’ baseline). And all systems (except one) perform better than
the baseline in all forms of content evaluation. Since the task is to generate 100 word
summaries (short summaries), based on past experiences, there is no doubt that baseline 2
would perform well.
It is interesting to observe that baseline 2 is a close approximation to the ‘SPP algo-
rithm’ described in this chapter. There are two main differences that we draw between
‘baseline 2’ and SPP algorithm. First, ‘baseline 2’ picks only the first sentence in each
document, while ‘SPP algorithm’ could pick other sentences in an order described by the
position policy. Second, ‘baseline 2’ puts no restriction on redundancy, thus due to journal-
istic conventions entire summary might be comprised of the same ‘information nuggets’,
wasting the minimal real-estate available (~100 words). On the other hand, in our ‘SPP al-
gorithm’ we consider a simple unigram-overlap measure to identify redundant information
in sentence pairs that avoids redundant nuggets in the final summary.
5.5 Discussion and Conclusions
Baselines 1 and 2 mentioned above, could together act as a balancing mechanism to com-
pare for linguistic quality and responsive content in the summary. The availability of a
stronger content responsive summary as a baseline would enable steady progress in the
field. While all the linguistically motivated systems would compare themselves with base-
line 1, the summary content motivated systems would compare with the stronger baseline
2 and get better than it.
Over the years to come, the usage of ‘baseline1’ doesn’t help in understanding whether
77
CHAPTER 5. BASELINES FOR UPDATE SUMMARIZATION
there has been significant improvement in the field. This is because almost every simple
algorithm beats the baseline performance. Having a better baseline, like the one based on
the position hypothesis, would raise the bar for systems participating in coming years, and
tracking progress of the field over the years is easier.In this chapter, we derived a method to identify a ‘sub-optimal position policy’ based
on pyramid annotation data, that were previously unavailable. We also distinguish small
and large documents to obtain the position policy. We described the Sub-optimal Sentence
Position Policy (SPP) based on pyramid annotation data and implemented the SPP as an
algorithm to show that a position policy thus formed is a good representative of the genre
and thus performs way above median performance. We further describe the baselines used
in summarization evaluation and discuss the need to bring back baseline 2 (or the ‘SPP
algorithm’) as an official baseline for update summarization task.
Ultimately, as Lin and Hovy [Lin and Hovy, 1997] suggest, the position method can
only take us certain distance. It has a limited power of resolution (the sentence) and its
limited method of identification (the position in a text). Which is why we intend to use
it as a baseline. Currently, as we can see the algorithm generates a generic summary, it
doesn’t consider the topic or query to generate a query-focused summary. In future we plan
to extend the SPP algorithm with some basic method for bringing in relevance.
78
Chapter 6
A Language Modeling Extension for
Update Summarization
In the context of DUC, there has been a proliferation of publications on Automated Text
Summarization1, vying the state-of-the-art each year. Throughout the literature there have
been proven models that would work, and some that have been shown to work under con-
strained settings. Early text summarization task started with a “Single Document Generic
Summary” task. Over a period of time, the community grew and began working on inter-
esting problems such as “Multi-Document Generic Summary generation”, “Query-Focused
Multi-Document Summary generation”, etc.
At the DUC, in every 2-3 years the major problem being addressed is altered/tweaked to
create a new task of interest to the community that brought the goal forward towards better
summarization approaches having a realistic task setting. At the end of 2007, the com-
munity was dealing a similar scenario while developing “Query-Focused Multi-Document
Update Summarization” task.1http://www-nlpir.nist.gov/projects/duc/pubs.html
79
CHAPTER 6. A LANGUAGE MODELING EXTENSION FOR UPDATESUMMARIZATION
6.1 Update Summarization
The key to “update summarization” is a real world setting where a user needs to keep track
of a hot topic, continuously at random intervals of time. There are a lot of hits on the hot
topic and lots of documents are generated within a short span. The user cannot deal with
the proliferation of information and hence requires a summarization engine that generates
a very targeted informative summary. So the user now has access to the information in
the form of a summary, and if he needs to know more he would be able to see the source
documents. After a certain usage of the summarization engine, say he takes a break (its
Christmas!) and comes back to the summarization engine for the summary on the recent
activity on the topic. Now, a normal summarizer would just generate a summary of the
recent documents and present it to him. However the idea of update summarization is to be
able to filter out information that has already appeared in the previous articles, whether or
not they were presented to the user as part of a previous summary. In effect it is like adding
another redundancy checking module to desist the repeated information.
The updates on the topic need to filter out redundant information, while preserving the
informativeness of the content. So the task of Update Summarization has two components,
A normal query-focused multi-document summarization for the first cluster (of documents)
on the topic, and an updated summary generation procedure which would also produce
query-focused summaries under the assumption that user has already gone through previ-
ous document cluster(s). In the current work, we approach the problem under a sentence
extractive summarization paradigm, using an existing language modeling framework. Here,
we see the “update summary generation” task as a language modeling smoothing problem.
80
CHAPTER 6. A LANGUAGE MODELING EXTENSION FOR UPDATESUMMARIZATION
6.2 Language Modeling approach to IR and Summariza-
tion
A statistical language model, or more simply a language model, is a probabilistic mecha-
nism for generating text. In the case of Information Retrieval, the rise of the usage of lan-
guage modeling appeared soon after [Ponte and Croft, 1998] described a “language mod-
eling approach to IR”. Soon after that [Song and Croft, 1999] furthered research in the
application of language modeling to IR.
Language modeling based approaches have been famous in text Summarization liter-
ature since the early work by Luhn [Luhn, 1958]. [Jagarlamudi, 2006] has shown how a
relevance based language modeling paradigm can be applied to automated text summariza-
tion, specifically for the “query-focused multi-document summarization” task. Recently,
[Nenkova et al., 2006] has shown reasonable success in multi-document text Summariza-
tion using ‘just’ unigram language models. Language modeling approaches in IR and in
Text Summarization has been more deeply dealt with in Section 3.1. Here we use the Prob-
abilistic Hyperspace Analogue to Language (PHAL) described in [Jagarlamudi, 2006] as
the language modeling mechanism.
6.2.1 Probabilistic Hyperspace Analogue to Language (PHAL)
From the model of [J et al., 2005] a Hyperspace Analogue to Language model constructs
dependencies of a word w on other words based on their occurrence in the context of w
in window size k, in a sufficiently large corpus. We use the PHAL model, and use the
relevance based Language Modeling approach as followed by [J et al., 2005] for sentence
scoring. The PHAL, probabilistic HAL is a natural extension to HAL spaces, as term
co-occurrence counts can be used to define conditional probabilities. The PHAL can be
interpreted as, “given a word w what is the probability of observing another word w′ with
w in a window of size K”.
81
CHAPTER 6. A LANGUAGE MODELING EXTENSION FOR UPDATESUMMARIZATION
PHAL (w′|w) = c× HAL(w′|w)n(w)×K
The sentence scoring mechanism is based on this model that has been built. Assuming
word independence, the relevance of a sentence S can be expressed as,
P (S|R) =∏wi∈S
P (wi|R) ≈∏wi∈S
P (wi|Q)
P (S|R) =∏wi∈S
P (wi)
P (Q)
∏qj
PHAL (qj|wi)
≈∏wi∈S
P (wi)∏qj
PHAL (qj|wi) (6.1)
All sentences in the corpus are scored based on the above equation and the best (high
scoring) sentences are picked to be part of the final summary.
6.3 Language Modeling Extension
While generating update summaries, whether in the case of Single Document Summariza-
tion or in Multi-Document Summarization the idea is to be able to present information
that the reader has not already seen. Looking at it the other way, it could be considered
as suppression of information that the user has already seen. In the context of language
modeling, if we construct models that represent the document (or document collections)
then the corresponding models for the new stream of data should be modified based on the
information that has already been seen, and known to be of importance. It is interesting
to note that, query relevance doesn’t change. However, a considerable topical shift might
occur in the new stream. To accommodate this topical shift and to avoid redundancy in
the update-summary, we build a background aware language model that penalizes thematic
features that occur more frequently in the background than in the new stream.
Since our approach to language modeling, the PHAL, is “bigram” based, so would our
extension be. But it is imperative to understand that we are providing a general framework
82
CHAPTER 6. A LANGUAGE MODELING EXTENSION FOR UPDATESUMMARIZATION
which can be applied to any “language modeling mechanism”. To provide this extension
we need to devise two things:
• Signature Terms
• Context Adjustment (or smoothing)
After devising strategies for both signature terms and context adjustment, which in our
case we describe in the following sections, the following approach is followed to summary
generation shall be taken.
6.3.1 Approach
Our approach is motivated by the fact that the necessary update may be seen as giving
more weight to important sentences. Since, importance is now a factor that is also de-
pendent on the previous utterances on the topic, we should either increase the weight of
important terms in the novel cluster or decrease the weights of the words that are more
important in the previous clusters. In any of the approaches we first need to identify sig-
nature terms [Lin and Hovy, 2000] of the respective cluster with respect to previous clus-
ters. It may be observed that Summarization for the first cluster is a simple query-focused
Summarization problem and is a result of direct application of the approach described
in [J et al., 2005], hence we do not consider it here.
The primary algorithm consists of the following steps:
1. For the first cluster, generate summary using the actual Summarization algorithm,
say S(x).
2. For each of the next clusters
(a) Compute Language model of previous clusters, say LM(A).
(b) Compute Language model of current cluster, say LM(B).
83
CHAPTER 6. A LANGUAGE MODELING EXTENSION FOR UPDATESUMMARIZATION
(c) Generate ‘Signature Terms’ of previous cluster and current cluster, say TA and
TB respectively.
(d) Perform ‘Context Adjustment’ for model LM(B), using TA and/or TB.
(e) Generate summary based on the adjusted (or corrected) Language Model.
6.3.2 Signature Terms
Topic signatures as defined by Lin and Hovy [Lin and Hovy, 2000] are a set of related
terms that describe a topic. Lin and Hovy achieve their purpose by collecting a set of
terms that are typically highly correlated with a target concept from a pre-classified corpus
such as TREC collections. We try to approximate the same using the dataset at hand. For
each topic, if we consider cluster A as irrelevant set (R) and cluster B as relevant set (R)
we obtain terms that are relatively more important for cluster B than for cluster A. Our
hypothesis is based on the assumption that a term is significant if it occurs relatively more
frequently in cluster B than in cluster A. In this work we report on the utility of this method
for term selection to aid context adjustment for Update Summarization.
The document set is pre-classified into two sets cluster A and cluster B. Assuming the
following two hypotheses:
Hypothesis 1 (H1) : P (B|ti) = p = P (B|ti) (6.2)
Hypothesis 2 (H2) : P (B|ti) = p1 6= p2 = P (B|ti) (6.3)
Where H1 implies that relevancy of a document is independent of ti, while H2 implies
that the presence of ti indicates strong relevancy assuming p1 � p2. And the following
2-by-2 contingency table:
Where O11 is the frequency of term ti occurring in cluster B, O12 is the frequency of
term ti occurring in cluster A, O21 is the frequency of term ti 6= ti occurring in cluster B,
O22 is the frequency of term ti 6= ti occurring in cluster A.
84
CHAPTER 6. A LANGUAGE MODELING EXTENSION FOR UPDATESUMMARIZATION
R R
ti O11 O12
ti O21 O22
Assuming a binomial distribution of terms as relevant or irrelevant:
b(k; n, x) =
n
r
xk (1− x)(n−k)
then the likelihood for H1 is:
L(H1) = b(O11; O11 + O12, p) b(O21; O21 + O22, p)
and for H2 is:
L(H2) = b(O11; O11 + O12, p1) b(O21; O21 + O22, p2)
The log λ value is then computed as follows:
= log L(H1)LH2
= log b(O11;O11+O12,p) b(O21;O21+O22,p)b(O11;O11+O12,p1) b(O21;O21+O22,p2)
=((O11 + O21) log p + (O12 + O22) log(1− p)) −
(O11 log p1 + O12 log(1− p1) + O21 log p2 +
O22 log(1− p2))) (6.4)
This term λ is called Dunning’s likelihood ratio [Dunning, 1993]. Dunning suggests
that λ is more appropriate for sparse data than χ2 for hypothesis testing and the quantity
−2 log λ is asymptotically χ2 distributed. And hence we can use χ2 distribution table to
look up −2 log λ value at specific confidence level.
6.3.3 Context Adjustment
Let Tnew and Told be the set of Signature terms extracted from cluser B and cluster A,
respectively. Signature terms could be utilized in adjusting the language model as follows.85
CHAPTER 6. A LANGUAGE MODELING EXTENSION FOR UPDATESUMMARIZATION
∀wi ∈ Tnew,
∀wj PHAL (wj|wi, B) =
PHAL (wj|wi, B) + PHAL (wj|wi, A) (6.5)
∀wi ∈ Told,
∀wj PHAL (wj|wi, B) =
PHAL (wj|wi, B)− PHAL (wj|wi, A) (6.6)
Update summarization has a major difference from normal query-focused multi-document
summarization, it requires us to provide information that is novel. This can be seen from
two view-points: improving novelty and reducing redundancy. The update based on Eq.
(6.5) is used to boost co-occurrences of novel terms in Cluster B, while update based on Eq.
(6.6) penalizes co-occurrences of signature terms of previous clusters and hence reducing
redundancy from earlier clusters.
6.4 Evaluation
The major evaluation criteria for this experiment was to study the effect of signature terms
based context adjustment for language modeling approaches on the update summarization
task. We measured the impact of context adjustment using automated evaluation measures
based on ROUGE [Lin, 2004b]. Table 6.1 illustrates the impact of the above mentioned
context adjustments on the PHAL feature. The results are only indicative of the perfor-
mance of the context adjustment applied to update clusters. Clearly, there is a visible ad-
vantage in applying such a framework for language modeling based update summarization.
6.5 Summary
In this chapter we described a generic approach to update summarization problem based on
the language modeling paradigm. We argued that update summarization task is a ‘stream’86
CHAPTER 6. A LANGUAGE MODELING EXTENSION FOR UPDATESUMMARIZATION
ROUGE-2 ROUGE-SU4
PHAL 0.08190 0.12208
PHAL+ Adjustment 0.08369 0.12395
Table 6.1 Impact of Context Adjustment on PHAL
(of text) based summary and that a solution to the problem must also lie in understanding
the variation in the current state of the stream from previous states. We observed that a
simple context adjustment based on ‘stream dynamics’ could help in generation of better
summaries when compared to the base algorithm.
Though the improvement in the scores is very tight, the potential for improvement of
the approach cannot be denied. In future the development of this approach into a formal
strategy with stronger mathematical basis needs to be addressed. Several issues such as the
“semantics of the ‘adjusted model’ ”, “weighting the bias”, etc. haven’t been addressed
in this work and are the concerns of a detailed further work. It is also possible to apply
multiple, more appropriate approaches to signature term extraction, which we ignored by
choosing a simple approach based on [Lin and Hovy, 2000]. This signature extraction ap-
proach may not be appropriate, in particular, because of the nature of the data. We have
observed that for some topics there were very few signature terms (<5). We believe that
this happens because of wrong application of the method. This approach to signature ex-
traction was built for dissimilar topic clusters, and blindly using them for same topic, time
varying clusters leaves us with just a few distinguishing terms between the two clusters,
and would paralyze our otherwise sound approach. The similarity in the two clusters (pre-
vious and current) is more prominent than dissimilarity, and hence other approaches may
be better able to distinguish these clusters and provide with stronger and more ‘significant’
signature terms.
87
Chapter 7
Alternative (Automated) Summarization
Evaluations
Evaluation is crucial component in the area of automatic summarization; it is used both to
rank multiple participant systems in a shared tasks, such as the summarization track at TAC
2009, 2008 and its DUC predecessors, and to developers whose goal is to improve the sum-
marization systems. Summarization evaluation, as has been the case with other language
understanding technologies, can foster the creation of reusable resources and infrastruc-
ture; it creates an environment for comparison and replication of results; and it introduces
an element of competition to produce better results [Hirschman and Mani, 2001]. However,
manual evaluation of a large number of documents necessary for a relatively unbiased view
is often unfeasible, especially since multiple evaluations are needed in future to track incre-
mental improvement in systems. Therefore, there is an urgent need for reliable automatic
metrics that can perform evaluation in a fast and consistent manner.
7.1 Introduction
Summarization evaluation techniques currently used in the literature and at the focused
evaluations have been described in detail in Section 3.2. As discussed earlier, summa-
88
CHAPTER 7. ALTERNATIVE (AUTOMATED) SUMMARIZATION EVALUATIONS
rization evaluation of informativeness falls into two major categories: intrinsic and extrin-
sic evaluations. Intrinsic evaluations measure the quality of the created summary directly.
Since the measurement is direct and doesn’t involve some other process or task at which the
summary should perform, we need to compare this summary with some other reference.
Usually, for this purpose, multiple human reference summaries are obtained. Examples
of ‘intrinsic informativeness evaluations’ include ROUGE [Lin, 2004b], Basic Elements
[Hovy et al., 2006, Tratz and Hovy, 2008] and Pyramid evaluations [Nenkova et al., 2007].
Among the examples above, all the methods are fully automated except the Pyramid method,
which involves human effort in the creation of “SCU Pyramids”. Current techniques ap-
plied to evaluate summaries are based on the assumption that multiple human references
summaries exist, which themself are difficult and time consuming to obtain. Recently,
[Louis and Nenkova, 2009] has tried to evaluate summaries in the context of no reference
summaries. They used an information theoretic framework, to compute information lost
while obtaining the summary at hand.
Summarization Evaluation, like Machine Translation (MT) evaluation (or any other
NLP systems’ evaluation), can be broadly classified into two categories [Jones and Galliers, 1996].
The first, an intrinsic evaluation, tests the summarization system in itself. The second, an
extrinsic evaluation, tests the summarization system based on how it affects the completion
of some other task. In the past intrinsic evaluations have assessed mainly informativeness
and coherence of the summaries. Meanwhile, extrinsic evaluations have been used to test
the impact of summarization on tasks like reading comprehension, relevance assessment,
etc.
7.2 Current Summarization Evaluations
In the Text Analysis Conference (TAC) series and the predecessor, the Document Under-
standing Conferences (DUC) series, the evaluation of summarization quality was con-
ducted using both manual and automated metrics. Manual assessment, performed by
89
CHAPTER 7. ALTERNATIVE (AUTOMATED) SUMMARIZATION EVALUATIONS
human judges centers around two main aspects of summarization quality: informative-
ness/content and readability/fluency. Since manual evaluation is still the undisputed gold
standard, both at TAC and DUC there was a phenomenal effort to evaluate manually as
much data as possible.
Content Evaluations The content or informativeness of a summary has been evaluated
based on various manual metrics. Earlier, NIST assessors used to rate each summary on a 5-
point scale based on whether a summary is “very poor” to “very good”. Since 2006, NIST
uses the Pyramid framework to measure content responsiveness. In the pyramid method
as explained in Section 3.2, assessors first extract all possible “information nuggets” or
Summary Content Units (SCUs) from human-produced model summaries on a given topic.
Each SCU has a weight associated with it based on the number of model summaries in
which this information appears. The final score of a peer summary is based on the recall of
nuggets in the peer.
All forms of manual assessment is time-consuming, expensive and not repeatable;
whether scoring summaries on a Likert scale or by evaluating peers against “nugget pyra-
mids” as in the pyramid method. Such assessment doesn’t help system developers — who
would ideally like to have fast, reliable and most importantly automated evaluation metric
that can be used to keep track of incremental improvements in their systems. So despite
the strong manual evaluation criterion for informativeness, time tested automated methods
viz. ROUGE, Basic Elements(BE) have been regularly employed to test their correlation
with manual evaluation metrics like ‘modified pyramid score’, ‘content responsiveness’
and ‘overall responsiveness’ of a summary. Each of the above metrics has been sufficiently
described in Section 3.2. The creation and testing of automatic evaluation metrics is there-
fore an important research avenue. The goal is to create automated evaluation metrics that
correlate very highly with these manual metrics.
90
CHAPTER 7. ALTERNATIVE (AUTOMATED) SUMMARIZATION EVALUATIONS
Readability/Fluency Evaluations Readability or Fluency of a summary is evaluated
based on certain set of linguistic quality questions that manual assessors answer for each
summary. The linguistic quality markers are: Grammaticality, Non-Redundancy, Referen-
tial Clarity, Focus and Structure and Coherence.
Manual Readability Evaluations Readability assessment is primarily a manual method
following Likert Scale rating by human assessors like for content responsiveness. All the
above linguistic quality markers are rated individually for each summary and an average
score for linguistic quality of various peers are compared against each other. An ANOVA
(Analysis of Variance) is performed on the linguistic quality markers to show which set of
peers fall in a statistically similar range.
Automated Readability Evaluations Though there hasn’t been any formal work in
automatic evaluation of Readability/Fluency of summaries, [Pitler and Nenkova, 2008] de-
scribe a first study on various lexical, syntactic and discourse features to produce a highly
predictive model of human reader’s judgments of text readability. Their experiments were
largely based on Wall Street Journal (WSJ) corpus, since they were interested in the impact
of discourse on readability aspects.
7.3 Automated Content Evaluations
Based on the arguments set above, automated evaluation of content and form are necessary
for tracking the developers incremental improvements, and a focused task on creation of
automated metrics for content and form would help in the process. This was precisely the
point being addressed at the TAC AESOP (Automatically Evaluating Summaries of Peers)
task. In TAC 2009, AESOP task involves only “Automated Evaluation of Content and
Responsiveness”, and this chapter addresses the same.
In its first edition of AESOP task at TAC 2009, the purpose of the task was to promote
91
CHAPTER 7. ALTERNATIVE (AUTOMATED) SUMMARIZATION EVALUATIONS
research and development of systems that evaluate the quality of content in the summaries.
The output of the automated metrics are compared against two manual metrics: (modi-
fied) pyramid score, which measures summary content and overall responsiveness, which
measures a combination of content and linguistic quality.
Experimental Configuration In TAC 2009 task, for each topic there are 4 reference
summaries and 55 peer summaries. The task output is to generate, for each peer summary,
a score representing (in the semantics of the metric) the goodness of the summary content,
measured against or without the use of model summaries. A snapshot of the output obtained
from a metric is shown in Figure 7.1.
7.4 Generative Modeling of Reference Summaries
In Section 4.5, we observed that a formal model, that captures the amount of query-bias in
a summary as distributed in the source collection, correlates highly with ROUGE scores.
Based on these observations we hypothesized that to evaluate a peer summary, a similar ap-
proach to modeling can be taken. We utilized the generatie models discussed in Section 4.5
and devised multiple alternative methods of identifying what makes a good summary.
In Section 4.5, we describe two models based on the ‘generative modeling framework’:
a binomial model and a multinomial model, which we used to show that automated systems
are being query-biased to be able to perform better on ROUGE like surface metrics. Our
approach is to use the same generative models to evaluate summaries. We describe in the
following sections, how various features extracted from reference summaries can be used
in modeling how strongly peer summaries are able to imitate reference summaries.
We use generative modeling to model the distribution of signature terms in the source
and the “likelihood of a summary being biased towards these signature terms”. As earlier
we use two models of generative modeling, Binomial and Multinomial models, both of
which are described below.
92
CHAPTER 7. ALTERNATIVE (AUTOMATED) SUMMARIZATION EVALUATIONS
Figure 7.1 Sample output generated by an evaluation metric
93
CHAPTER 7. ALTERNATIVE (AUTOMATED) SUMMARIZATION EVALUATIONS
7.4.1 Binomial Model
Let us consider there are ‘k’ words that we consider signature terms, as identified by any
of the methods described in Section 7.5. The sentences in the input document collection
are represented as a binomial distribution over the type of sentences. Let Ci ∈ {C0, C1}
denote classes of sentences without and with those ‘signature terms’ respectively. For each
sentence s ∈ Ci in the input collection, we associate a probability p(Ci) for it to be emitted
into a summary.
The likelihood of a summary then is :
L[summary; p (Ci)] =N !
n0!n1!p (C0)
n0 p (C1)n1 (7.1)
Where N is the number of sentences in the summary, and n0 + n1 = N; n0 and n1 are
the cardinalities of C0 and C1 in the summary.
7.4.2 Multinomial Model
Previously, we described the binomial model where we classified each sentence into two
classes, as being biased towards a signature term or not. However, if we were to quantify
the amount of signature-term bias in a sentence, we associate each sentence to one among
k possible classes leading to a multinomial distribution. Let Ci ∈ {C0, C1, C2, . . . , Ck}
denote the k levels of signature-term bias. Ci is the set of sentences, each having i signature
terms.
The number of sentences participating in each class varies highly, with C0 bagging a
high percentage of sentences and the rest {C1, C2, . . . , Ck} distributing among themselves
the rest sentences. Since the distribution is highly-skewed to the left, distinguishing systems
based on log-likelihood scores using this model is easier and perhaps more accurate.
The likelihood of a summary then is :
L[summary; p (Ci)] =N !
n0!n1! · · ·nk!p (C0)
n0 p (C1)n1 · · · p (Ck)
nk (7.2)
94
CHAPTER 7. ALTERNATIVE (AUTOMATED) SUMMARIZATION EVALUATIONS
Where N is the number of sentences in the ‘peer summary’, and n0 + n1 + · · · + nk =
N; n0, n1,· · · ,nk are respectively the cardinalities of C0, C1, · · · ,Ck, in the summary.
7.5 Signature Terms
The likelihood of certain characteristics based on the binomial or multinomial model shows
how well those characteristics of the input have been captured in a summary. For our ap-
proach, we need to have certain keywords from the reference summaries that are considered
to be very important for the topic/query combination. We choose multiple alternative meth-
ods to identify such signature-terms. Here we list these methods:
1. Query terms
2. Model consistency
3. Part-Of-Speech (POS)
7.5.1 Query Terms
If we consider query terms as the characteristics that discriminate important sentences from
unimportant ones, we obtain the likelihood of a summary emitting a query-biased sen-
tence. Earlier, we have shown in Section 4.5 that such a likelihood has very high system-
level correlation with ROUGE scores. Since ROUGE correlates very highly with manual
evaluations (‘pyramid evaluation’ or ‘overall responsiveness’), a naıve assumption is that
likelihood modeling of query-bias would correlate well with manual evaluations. This as-
sumption led us to use this method as a baseline for our experiments. Our baselines for this
work have been explained in Section 7.6.
95
CHAPTER 7. ALTERNATIVE (AUTOMATED) SUMMARIZATION EVALUATIONS
7.5.2 Model Consistency
The hypothesis behind the method is that a term is important if it is part of a reference sum-
mary. In this method we obtain all the terms that are commonly agreed upon by reference
summaries. The idea is that the more the reference summaries agree the more important
they are. This is based on the assumption that word level importance sums up towards
sentence inclusion. Since there are 4 reference summaries available for each topic, we can
use reference agreement in two ways:
• total agreement
• partial agreement
Total agreement In the case of total agreement, only the words that occur in all
reference summaries are considered to be important. This case leads to only a single run
which we would call ‘total-agreement’.
Partial agreement In the case of partial agreement, words that occur in at least ‘k’
reference summaries are considered to be important. Since there are 4 reference summaries
per topic, a term would be considered a ‘signature term’ if it occurs in ‘k’ of those 4
reference summaries. There were a total of 3 runs in this case : ‘partial-agreement-1’,
‘partial-agreement-2’ and ‘partial-agreement-3’.
7.5.3 POS Features
We hypothesized that a certain type of words based on the parts-of-speech they belong to
could be more informative than the other words, and that in modeling their occurrence in
peer summaries we are defining informativeness of the peers with respect to models.
Part-of-speech tagger Traditional grammar classifies words based on eight parts of speech:
the verb, the noun, the adjective, the pronoun, the adverb, the preposition, the conjunction
96
CHAPTER 7. ALTERNATIVE (AUTOMATED) SUMMARIZATION EVALUATIONS
and the interjection. Each part of speech explains not what the word is, but how the word
is used. Infact the same word can be a noun in one sentence and a verb or adjective in an-
other. We have used the Penn Treebank Tag-set [Marcus et al., 1993] for our purposes. For
automated tagging we have used the Stanford POS tagger [Toutanova and Manning, 2000,
Toutanova et al., 2003] in these experiments.
Tag Subset Selection – feature selection Based on an analysis of how each ‘POS tag’
performs at the task we selectively combine the set of features. We used the following
‘POS tag’ features: NN, NNP, NNPS, VB, VBN, VBD, CD, SYMB, and their combinations.
We experimented with several combinations of these features and zeroed on to a final list
of combinations that form the runs described in this work. The final list of runs comprises
of some of the individual ‘POS tag’ features and some combinations, they are:
• NN
• NNP
• NNPS
• NOUN – A combination of NN, NNP and NNPS features.
• VB
• VBN
• VBD
• VERB – A combination of VB, VBN and VBD features.
• CD
• SYMB
• MISC – A combination of CD and SYMB features.
• ALL – A combination of NOUN, VERB and MISC features.
7.6 Experiments and Evaluations
Our experimental setup was primarily defined based on how signature terms have been
identified. We have detailed few methods of identification of signature-terms in Section 7.5.
For each method of identifying signature terms we have one or more runs as described
earlier.97
CHAPTER 7. ALTERNATIVE (AUTOMATED) SUMMARIZATION EVALUATIONS
Baselines Apart from the set of runs described in Section 7.5, we propose to use the
following two baselines.
• Binomial modeling for query terms. This approach uses the Binomial model de-
scribed in Section 7.4 to obtain the likelihood that a system would generate sum-
maries that comprises of sentences containing query-terms.
• Multinomial modeling for query terms. This baseline approach uses the Multinomial
model described in Section 7.4 to obtain the multinomial likelihood that a system
would generate summaries that comprises of sentences containing query-terms. This
model distinguishes sentences that contain a single query-term to sentences contain-
ing two query-terms and so on.
Datasets The experiments shown here were performed on TAC 2009 update summariza-
tion datasets which have 44 topics and 55 system summaries for each topic apart from 4
human reference summaries. Since in our methods there is no clear way to distinguish eval-
uation of cluster A’s or cluster B’s summary – we don’t evaluate the update of a summary
– we effectively have 88 topics to evaluate on.
Evaluations Successful evaluation of these new summarization evaluation metrics is
done based on how well these new metrics correlate with manual evaluations. In Sec-
tion 3.2 we described how pyramid evaluations and content responsiveness evaluations are
performed. This task, despite the complexity involved, boils down to a simpler problem,
that of information ordering. We have a reference ordering and have various metrics that
provide their own ordering for these systems. Comparing an ordering of information with
another is a fairly well understood task and we would use correlations between these man-
ual metrics and the metrics we proposed in this work to show how well our metrics are able
to imitate human evaluations in being able to generate similar ordering of systems. We use
Pearson’s Correlation Coefficient of system level average scores produced by all systems
based on our metrics and by the manual methods.98
CHAPTER 7. ALTERNATIVE (AUTOMATED) SUMMARIZATION EVALUATIONS
7.7 Results
Our target for these focused experiments were to create alternatives to the content evalua-
tion metrics (pyramid method and overall responsiveness), that are either ‘too expensive’ or
‘non-replicable’ or both. The important question to ask is: which of the above two metrics
are we trying to imitate? It is unlikely that a single automated evaluation measure would
be able to correctly reflect both readability and content responsiveness, since they repre-
sent form and content which are separate qualities of a summary and would need different
measures. We chose to imitate content since having better content in a summary is more
important than having a readable summary.
In Tables 7.1 and 7.2 we present system level Pearson’s correlations between the scores
provided by our metrics — as well as the time tested automated evaluation metrics ROUGE-
SU4 and Basic Elements (BE) — and the manual Pyramid scores. The table also includes
correlations with the manual Overall Responsiveness measure, which reflects both content
and form; later we would observe that the correlations are much higher with respect to
pyramids than with overall responsiveness, this is because in our approach we are trying to
capture how well content of model summaries are being reciprocated in system summaries.
7.7.1 Discussion
We have used two separate settings for displaying results: an AllPeers case and a NoModels
case. AllPeers case consists of the scores returned by the metric for all the summarizers
(automated and human), while in the case of NoModels case only automated summarizers
are scored using the evaluation metrics. This setup helps distinguish methods that are able
to differentiate two things:
• Metrics that are able to differentiate humans from automated summarizers.
• Metrics that are able to rank human summarizers in the desired order.
99
CHAPTER 7. ALTERNATIVE (AUTOMATED) SUMMARIZATION EVALUATIONS
Results1 have shown that no single metric is good at distinguishing everything, however
they also show that certain type of keywords have been instrumental in providing the key
distinguishing power to the metric. For example, VERB and NOUN features have been
key the contributors to ALL run. As an interesting side note, we observe that having high
number of ‘significant’ signature-terms seems to be better than a low number of ‘strong’
signature-terms, as seen from the experiments on total-agreement and partial-agreement.
The most important result of our approach has been that our method was very highly cor-
related with “overall responsiveness”, which again is a very good sign for an evaluation
metric.
7.8 Chapter Summary
In this chapter, we argued for the need of alternative ‘automated’ summarization evaluation
systems for both content and readability. In the context of TAC AESOP (Automatically
Evaluating Summaries Of Peers) task, we describe the problem with content evaluation
metrics and how a good metric must behave. We model the problem as an information or-
dering problem; our approach (and indeed others) should now be able to rank systems (and
possibly human summarizers) in the same order as human evaluation would have produced.
We show how a well known generative model could be used to create automated evaluation
systems comparable to the state-of-the-art. Our method is based on a multinomial model
distribution of key-terms (or signature terms) in document collections, and how they are
captured in peers.
We have used two types of signature-terms to model the evaluation metrics. The first is
based on POS tags of important terms in a model summary and the second is based on how
much information the reference summaries shared among themselves. Our results show
that verbs and nouns are key contributors to our best run which was dependent on various
1We have excluded NNPS and SYMB from the analysis since they didn’t have enough samples in the
testset, so as to obtain consistent results.
100
CHAPTER 7. ALTERNATIVE (AUTOMATED) SUMMARIZATION EVALUATIONS
individual features. Another important observation was that all the metrics were consistent
in that they produced similar results for both cluster A and cluster B in the context of update
summaries. The most startling result is that in comparison with the automated evaluation
metrics currently in use (ROUGE, Basic Elements) our approach has been very good at
capturing “overall responsiveness” apart from pyramid based manual scores.
101
CHAPTER 7. ALTERNATIVE (AUTOMATED) SUMMARIZATION EVALUATIONS
RUN Pyramid Responsiveness
AllPeers NoModels AllPeers NoModels
High Baselines
ROUGE-SU4 0.734 0.921 0.617 0.767
Basic Elements (BE) 0.586 0.857 0.456 0.692
Baselines
Binom(query) 0.217 0.528 0.163 0.509
Multinom(query) 0.117 0.523 0.626 0.514
Experimental Runs
POS based
NN 0.909 0.867 0.853 0.766
NNP 0.666 0.504 0.661 0.463
NOUN 0.923 0.882 0.870 0.779
VB 0.913 0.820 0.877 0.705
VBN 0.931 0.817 0.929 0.683
VBD 0.944 0.859 0.927 0.698
VERB 0.972 0.902 0.952 0.733
CD 0.762 0.601 0.757 0.561
MISC 0.762 0.601 0.757 0.561
ALL 0.969 0.913 0.934 0.802
Model Consistency/Agreement
total-agreement 0.727 0.768 0.659 0.682
partial-agreement-3 0.867 0.856 0.813 0.757
partial-agreement-2 0.936 0.893 0.886 0.791
partial-agreement-1 0.966 0.895 0.930 0.768
Table 7.1 Cluster A Results
102
CHAPTER 7. ALTERNATIVE (AUTOMATED) SUMMARIZATION EVALUATIONS
RUN Pyramid Responsiveness
AllPeers NoModels AllPeers NoModels
High Baselines
ROUGE-SU4 0.586 0.940 0.564 0.729
Basic Elements (BE) 0.629 0.924 0.447 0.694
Baselines
Binom(query) 0.210 0.364 0.178 0.372
Multinom(query) -0.004 0.361 -0.020 0.446
Experimental Runs
POS based
NN 0.908 0.845 0.877 0.788
NNP 0.646 0.453 0.631 0.380
NOUN 0.909 0.848 0.878 0.783
VB 0.872 0.871 0.875 0.742
VBN 0.934 0.873 0.944 0.720
VBD 0.922 0.909 0.914 0.718
VERB 0.949 0.951 0.942 0.784
CD 0.807 0.599 0.800 0.497
MISC 0.807 0.599 0.800 0.497
ALL 0.957 0.921 0.931 0.793
Model Consistency/Agreement
total-agreement 0.811 0.738 0.808 0.762
partial-agreement-3 0.901 0.839 0.882 0.806
partial-agreement-2 0.949 0.898 0.924 0.817
partial-agreement-1 0.960 0.903 0.936 0.763
Table 7.2 Cluster B Results
103
Chapter 8
Conclusions
In this thesis, we have systematically studied the following 4 aspects of text summarization
systems. We examined the role of query-bias in the query-focused multi-document sum-
marization systems. Then, we build upon the position hypothesis to describe a baseline
algorithm for ‘update summarization’ and other short summary tasks. Later, we show that
a simple signature term based context adjustment allows standard language modeling ap-
proaches to be applied for the update summarization task. Finally, we describe a generative
modeling based automated intrinsic evaluation framework that models the distribution of
signature-terms across the summaries based on their distribution drawn from source col-
lections. Section 8.1 describes the contributions of this thesis and Section 8.2 follows up
with a detailed mention of the foreseeable future work.
8.1 Contributions of this Thesis
Automated text summarization deals with condensing a source representation of text that
retains the most informative and relevant (in case of query-focused summaries) pieces of
information. In this thesis, we researched automated text summarization from four angles:
1. Impact of query-bias on summarization
104
CHAPTER 8. CONCLUSIONS
2. Simple and strong baselines for update summarization
3. Language modeling extension to update summarization
4. An automated intrinsic content evaluation measure
Our experiments in Chapter 4 clearly show that most of the automated systems are
being biased towards query-terms in trying to generate summaries that are closer to human
summaries. Our key contributions in quantifying the impact of query-bias:
• We have shown based on various experiments that query-bias is probably directly
involved in generating better summaries for otherwise poor algorithms.
• We further confirm based on the likelihood of a system emitting non query-biased
sentence, that there is a strong (negative) correlation among systems’ likelihood score
and ROUGE score, which suggests that systems are trying to improve performance
based on ROUGE metrics by being biased towards the query terms. On the other
hand, humans do not rely on query-bias. Our concern in this work was to empiri-
cally show that the notion of query-focus is apparently missing in any or all of the
algorithms, and that the future summarization algorithms must try to incorporate this
while designing new algorithms.
• Looking at those algorithms from the point of view of sentence-classifier view of
summarization (See Figure 4.4) we observe that most systems are ignoring sentences
that do not contain query-terms. This sort of ignorance just based on naive surface
features (that too as much as “term-bias”) would mean that a lot of informative con-
tent among the non-biased sentences are lost out. Hence, there is a need to look at
the behavior of each algorithm along these lines before we get those into such issues
at later stage.
• Our results also underscore the differences between human and machine generated
summaries. When asked to produce query-focused summaries, humans do not rely
to the same extent on the repetition of query terms.105
CHAPTER 8. CONCLUSIONS
We derived a simple baseline algorithm to identify key sentences to be included in a
summary based on a ‘sub-optimal position policy’ in Chapter 5. The key contributions in
this work are
• The usage of a new data source built on the pyramid annotation data. This is impor-
tant since such a resource, despite being available for 3-4 years now, hasn’t been put
to use for many purposes.
• We also distinguish small and large documents to obtain the position policy showing
how small documents tend to be more informative and focused than large documents.
• We described the Sub-optimal Sentence Position Policy (SPP) based on pyramid an-
notation data and implemented the SPP as an algorithm to show that a position policy
thus formed is a good representative of the genre and thus performs way above me-
dian performance.
• We further describe the baselines used in summarization evaluation and discuss the
need to bring back baseline 2 (or the ‘SPP algorithm’) as an official baseline for
update summarization task.
We described a generic approach to update summarization problem based on the lan-
guage modeling paradigm. In Chapter6 we argued that update summarization task is a
‘stream’ (of text) based summary and that a solution to the problem must also lie in under-
standing the variation in the current state of the stream from previous states. Though the
improvement in the scores is very tight, the potential for improvement of the approach is
very high and provides with a very promising framework for update summarization using
language modeling. Our key contributions in the solution to this problem were
• We showcased the solution to update summarization problem based on language
modeling approaches by visualizing a context adjustment over the language model
built on the corpus.
106
CHAPTER 8. CONCLUSIONS
• We used PHAL based language modeling to sentence scoring and use context ad-
justment strategies to boost novel signature-terms and decrease the impact of stale
terms.
• We observed that a simple context adjustment based on ‘stream dynamics’ could help
in generation of better summaries when compared to the base language modeling
based approach.
We argue for the need of alternative ‘automated’ summarization evaluation systems for
both content and readability. We describe the problem with content evaluation metrics and
what characteristics a good automated evaluation metric must constitute. In Chapter 7 we
model the problem as an information ordering problem; our approach (and indeed others)
should now be able to rank systems (and possibly human summarizers) in the same order
as human evaluation would have produced. Following are our key contributions towards
building a summarization evaluation system:
• We show how a well known generative model could be applied to create automated
evaluation systems comparable to the state-of-the-art.
• Our method is based on a multinomial model distribution of key-terms (or signature
terms) in document collections, and how they are captured in peers.
• We have used two types of signature-terms to model the evaluation metrics. The first
is based on POS tags of important terms in a model summary and the second is based
on how much information the reference summaries shared among themselves.
• Our results show that verbs and nouns are key contributors to our best run which was
dependent on various individual features.
• Another important observation was that all the metrics were consistent in that they
produced similar results for both cluster A and cluster B in the context of update
summaries.
107
CHAPTER 8. CONCLUSIONS
• The most startling observation is that in comparison with ROUGE, Basic Elements
our approach has been very good at capturing “overall responsiveness” apart from
pyramid based manual scores.
8.2 Future Work
In this thesis we show how query-bias affects most summarization algorithms. In the hind-
sight, further algorithmic study of all the automated summarization algorithms and ap-
proaches should help understand where and how the bias towards the query-terms is being
introduced. Further theoretical analysis as to where in the modeling or scoring approaches
such a bias is induced, would help alleviate the problem and in designing future approaches
with similar strategies having a foresight towards problems of query-bias.
Simple discourse based feature based on position of a sentence performs relatively
strongly at update summarization. In the background of this work, it is imperative to re-
alize that position is just one of the possible features of discourse and while position can
be highly relevant in corpora such as news it might turn out to be insignificant in other
corpora such as a books. It would be interesting to see how we can capture other discourse
features — possibly shallow — based on connectedness (vs disconnectedness) of text in a
document. Such features wouldn’t be of interest in single topic documents such as news
stories but they provide key information about topical shift in conversational texts or books.
We described extensions to a language modeling based framework for incorporating up-
date summarization. In the future the development of this approach into a formal strategy
with stronger mathematical basis needs to be addressed. Several issues such as the “se-
mantics of the ‘adjusted model’ ”, “weighting the bias”, etc. haven’t been addressed in this
work and are the concerns of a detailed further work. It is also possible to apply multiple,
more appropriate approaches to signature term extraction, which we ignored by choosing
a simple approach based on [Lin and Hovy, 2000]. The signature extraction approach we
used may not be appropriate, in particular, because of the nature of the data. We have ob-
108
CHAPTER 8. CONCLUSIONS
served that for some topics there were very few signature terms (<5) and we believe that
this happens because of wrong application of the method. This approach to signature ex-
traction was built for dissimilar topic clusters, and blindly using them for same topic, time
varying clusters leaves us with just a few distinguishing terms between the two clusters,
and would paralyze our otherwise sound approach. In the case of update summarization
data, the similarity in the two clusters (previous and current) is more prominent than dis-
similarity, and hence other approaches may be better able to distinguish these clusters and
provide with stronger and more ‘significant’ signature terms.
A novel framework for automated content evaluations was described in our work. Our
framework relies on two sub-problems: finding ‘signature-terms’ and the generative mod-
eling aspects. In this work we have shown that such a model is promising by using simple
methods for signature-terms identification once we fix on a multinomial model. It would
be interesting to find other methods for signature-term identification that could bring per-
formance of the method closer to human levels of judgment. The evaluation paradigm
described in this work assumes that human references are available, another interesting
line of work would be based on how to evaluate system summaries when there are no ref-
erence summaries available. Looking further ahead, there is a pressing need to distinguish
“relevant content extraction” and “relevant summaries”. The difference being the form
of summaries, the way they are presented coherently. It would be very relevant to define
algorithms that would be able to correlate strongly with linguistic quality metrics.
8.3 Publications
1. Query-Focused Summaries or Query-Biased Summaries ? In the proceedings of
the Joint conferences of the 47th annual meeting of the Association of Computational
Linguistics (ACL) and the 4th International Joint Conference on Natural Language
Processing (IJCNLP), ACL-IJCNLP 2009.
109
CHAPTER 8. CONCLUSIONS
2. Sentence Position revisited: A robust light-weight Update Summarization ‘base-
line’ Algorithm. In the proceedings of the HLT-NAACL workshop on cross lan-
guage information access (CLIAWS3), 2009.
3. IIIT Hyderabad at TAC 2008. In the working notes of Text Analysis Conference
(TAC) at the joint meeting of the annual conferences of TAC and TREC, 2008.
4. On {Alternative}Automated Content Evaluation Measures. In the working notes
of Text Analysis Conference (TAC) at the joint meeting of the annual conferences of
TAC and TREC, 2009.
8.4 Unpublished Manuscripts
1. GEMS: Generative Modeling for Evaluation of Summaries. In submission.
110
Bibliography
[Allan et al., 2001] Allan, J., Gupta, R., and Khandelwal, V. (2001). Topic models forsummarizing novelty. In In Proceedings of the Workshop on Language Modeling andInformation Retrieval, pages 66–71.
[Amini and Usunier, 2007] Amini, M. R. and Usunier, N. (2007). A contextual query ex-pansion approach by term clustering for robust text summarization. In the proceedingsof Document Understanding Conference.
[Ani et al., 2005] Ani, A. H., Nenkova, A., Passonneau, R., and Rambow, O. (2005). Au-tomation of summary evaluation by the pyramid method. In In Proceedings of the Con-ference of Recent Advances in Natural Language Processing (RANLP, page 226.
[Barzilay, 1997] Barzilay, R. (1997). Lexical chains for summarization. Master’s thesis.
[Barzilay and Elhadad, 1997] Barzilay, R. and Elhadad, M. (1997). Using lexical chainsfor text summarization. In In Proceedings of the ACL Workshop on Intelligent ScalableText Summarization, pages 10–17.
[Barzilay and Lapata, 2008] Barzilay, R. and Lapata, M. (2008). Modeling local coher-ence: An entity-based approach. Comput. Linguist., 34(1):1–34.
[Barzilay and Lee, 2004] Barzilay, R. and Lee, L. (2004). Catching the drift: Probabilisticcontent models, with applications to generation and summarization. In Susan Dumais,D. M. and Roukos, S., editors, HLT-NAACL 2004: Main Proceedings, pages 113–120,Boston, Massachusetts, USA. Association for Computational Linguistics.
[Baxendale, 1958] Baxendale, P. B. (1958). Machine-made index for technical literature –an experiment. IBM Journal of Research and Development, 2(Non-topical Issue).
[Berger and Mittal, 2000] Berger, A. and Mittal, V. O. (2000). Query-relevant summariza-tion using faqs. In ACL ’00: Proceedings of the 38th Annual Meeting on Associationfor Computational Linguistics, pages 294–301, Morristown, NJ, USA. Association forComputational Linguistics.
[Bosma, 2005] Bosma, W. (2005). Extending answers using discourse structures. In Sag-gion, H. and Minel, J. L., editors, RANLP workshop on Crossing Barriers in Text sum-marization Research, pages 2–9. Incoma Ltd.
111
BIBLIOGRAPHY
[Bysani et al., 2009] Bysani, P., Bharat, V., and Varma, V. (2009). Modeling novelty andfeature combination using support vector regression for update summarization. In 7thInternational Conference On Natural Language Processing. NLP Association of India.
[Chae and Nenkova, 2009] Chae, J. and Nenkova, A. (2009). Predicting the fluency of textwith shallow structural features: Case studies of machine translation and human-writtentext. In EACL, pages 139–147. The Association for Computer Linguistics.
[Conroy et al., 2004] Conroy, J. M., Schlesinger, J. D., Goldstein, J., and O’leary, D. P.(2004). Left-brain/right-brain multi-document summarization. In the proceedings ofDocument Understanding Conference (DUC) 2004.
[Copeck et al., 2006] Copeck, T., Inkpen, D., Kazantseva, A., Kennedy, A., Kipp, D., Nas-tase, V., and Szpakowicz, S. (2006). Leveraging duc. In proceedings of DUC 2006.
[Cormen et al., 1990] Cormen, T. H., Leiserson, C. E., and Rivest, R. L. (1990). Introduc-tion to Algorithms. The MIT press and McGraw-Hill.
[Cremmins, 1982] Cremmins, E. T. (1982). The art of abstracting / Edward T. Cremmins.ISI Press, Philadelphia :.
[Dang, 2005] Dang, H. T. (2005). Overview of duc 2005. In proceedings of DocumentUnderstanding Conference.
[Daume III and Marcu, 2004] Daume III, H. and Marcu, D. (2004). A phrase-based hmmapproach to document/abstract alignment. In Lin, D. and Wu, D., editors, Proceed-ings of EMNLP 2004, pages 119–126, Barcelona, Spain. Association for ComputationalLinguistics.
[Dolan, 1980] Dolan, D. (1980). Locating main idea in history textbooks. Journal ofReading, pages 135 – 140.
[Dunning, 1993] Dunning, T. (1993). Accurate methods for the statistics of surprise andcoincidence. volume 19, pages 61–74.
[Edmundson, 1969] Edmundson, H. P. (1969). New methods in automatic extracting.Journal of the ACM, 16(2):264–285.
[Endres-Niggemeyer, 1998] Endres-Niggemeyer, B. (1998). Summarizing Information.Springer-Verlag New York, Inc., Secaucus, NJ, USA.
[Feng, 2009] Feng, L. (2009). Automatic readability assessment for people with intellec-tual disabilities. SIGACCESS Accessibility and Computing, (93):84–91.
[Feng et al., 2009] Feng, L., Elhadad, N., and Huenerfauth, M. (2009). Cognitively moti-vated features for readability assessment. In Proceedings of the 12th Conference of theEuropean Chapter of the ACL (EACL 2009), pages 229–237, Athens, Greece. Associa-tion for Computational Linguistics.
112
BIBLIOGRAPHY
[Gupta et al., 2007] Gupta, S., Nenkova, A., and Jurafsky, D. (2007). Measuring impor-tance and query relevance in topic-focused multi-document summarization. acl compan-ion volume, 2007.
[Halliday and Hasan, 1976] Halliday, M. and Hasan, R. (1976). Longman publishers.
[Harman and Over, 2002] Harman, D. and Over, P. (2002). The duc summarization eval-uations. In Proceedings of the second international conference on Human LanguageTechnology Research, pages 44–51, San Francisco, CA, USA. Morgan Kaufmann Pub-lishers Inc.
[Harman and Over, 2004] Harman, D. and Over, P. (2004). The effects of human variationin duc summarization evaluation. In Marie-Francine Moens, S. S., editor, Text Summa-rization Branches Out: Proceedings of the ACL-04 Workshop, pages 10–17, Barcelona,Spain. Association for Computational Linguistics.
[Hiemstra, 1998] Hiemstra, D. (1998). A linguistically motivated probabilistic model ofinformation retrieval. In ECDL ’98: Proceedings of the Second European Conferenceon Research and Advanced Technology for Digital Libraries, pages 569–584, London,UK. Springer-Verlag.
[Hiemstra, 2009] Hiemstra, D. (2009). Information retrieval models.
[Hirschman and Mani, 2001] Hirschman, L. and Mani, I. (2001). Evaluation.
[Hovy et al., 2006] Hovy, E., yew Lin, C., Zhou, L., and Fukumoto, J. (2006). Automatedsummarization evaluation with basic elements. In In Proceedings of the Fifth Conferenceon Language Resources and Evaluation (LREC.
[J et al., 2005] J, J., Pingali, P., and Varma, V. (2005). A relevance-based language model-ing approach to duc 2005.
[Jagarlamudi, 2006] Jagarlamudi, J. (2006). Query-based multi-document summarizationusing language. Master’s thesis, IIIT Hyderabad, India.
[Jing, 2001] Jing, H. (2001). Cut-and-Paste Text Summarization. PhD thesis.
[Jing et al., 1998] Jing, H., Barzilay, R., Mckeown, K., and Elhadad, M. (1998). Sum-marization evaluation methods: Experiments and analysis. In In AAAI Symposium onIntelligent Summarization, pages 60–68.
[Jones, 1972] Jones, K. S. (1972). A statistical interpretation of term specificity and itsapplication in retrieval. Journal of Documentation, 28:11–21.
[Jones, 1993] Jones, K. S. (1993). What might be in a summary? In Information Retrieval,pages 9–26.
113
BIBLIOGRAPHY
[Jones, 1998] Jones, K. S. (1998). Automatic summarising: Factors and directions. InAdvances in Automatic Text Summarization, pages 1–12. MIT Press.
[Jones and Galliers, 1996] Jones, K. S. and Galliers, J. R. (1996). Evaluating NaturalLanguage Processing Systems: An Analysis and Review. Springer-Verlag New York,Inc., Secaucus, NJ, USA.
[Kastner and Monz, 2009] Kastner, I. and Monz, C. (2009). Automatic single-documentkey fact extraction from newswire articles. In Proceedings of the 12th Conference of theEuropean Chapter of the ACL (EACL 2009), pages 415–423, Athens, Greece. Associa-tion for Computational Linguistics.
[Katragadda et al., 2009] Katragadda, R., Pingali, P., and Varma, V. (2009). Sentence po-sition revisited: A robust light-weight update summarization ‘baseline’ algorithm. InProceedings of the Third International Workshop on Cross Lingual Information Access:Addressing the Information Need of Multilingual Societies (CLIAWS3), pages 46–52,Boulder, Colorado. Association for Computational Linguistics.
[Kieras, 1985] Kieras, D. E. (1985). Thematic process in the comprehension of technicalprose.
[Kumar, 2009] Kumar, C. (2009). Information loss based framework for document sum-marization. Master’s thesis, Hyderabad, India.
[Kupiec et al., 1995] Kupiec, J., Pedersen, J., and Chen, F. (1995). A trainable documentsummarizer. In the proceedings of ACM SIGIR’95, pages 68–73. ACM.
[Lapata, 2003] Lapata, M. (2003). Probabilistic text structuring: Experiments with sen-tence ordering. In proceedings of the annual meeting of the Association for Computa-tional Linguistics, pages 545–552. The Association of Computational Linguistics.
[Lapata and Barzilay, 2005] Lapata, M. and Barzilay, R. (2005). Automatic evaluation oftext coherence: Models and representations. In Kaelbling, L. P. and Saffiotti, A., editors,IJCAI, pages 1085–1090. Professional Book Center.
[Lawrie, 2003] Lawrie, D. J. (2003). Language models for hierarchical summarization.PhD thesis. Director-Croft, W. Bruce.
[Li et al., 2007] Li, J., Sun, L., Kit, C., and Webster, J. (2007). A query-focused multi-document summarizer based on lexical chains. In DUC’07: Document UnderstandingConference, 2007.
[Lin and Hovy, 2000] Lin, C. and Hovy, E. (2000). The automated acquisition of topicsignatures for text summarization.
[Lin, 2004a] Lin, C.-Y. (2004a). Looking for a few good metrics: Automatic summariza-tion evaluation - how many samples are enough? In the proceedings of NTCIR Workshop4. ACL.
114
BIBLIOGRAPHY
[Lin, 2004b] Lin, C.-Y. (2004b). Rouge: A package for automatic evaluation of sum-maries. In the proceedings of ACL Workshop on Text Summarization Branches Out.ACL.
[Lin and Hovy, 1997] Lin, C.-Y. and Hovy, E. (1997). Identifying topics by position. InProceedings of the fifth conference on Applied natural language processing, pages 283–290. ACL.
[Lin and Hovy, 2003a] Lin, C.-Y. and Hovy, E. (2003a). Automatic evaluation of sum-maries using n-gram co-occurrence statistics. In NAACL ’03: Proceedings of the 2003Conference of the North American Chapter of the Association for Computational Lin-guistics on Human Language Technology, pages 71–78, Morristown, NJ, USA. Associ-ation for Computational Linguistics.
[Lin and Hovy, 2003b] Lin, C.-Y. and Hovy, E. (2003b). The potential and limitations ofautomatic sentence extraction for summarization. In Proceedings of the HLT-NAACL 03on Text summarization workshop, pages 73–80, Morristown, NJ, USA. Association forComputational Linguistics.
[Lin et al., 2003] Lin, J., Quan, D., Sinha, V., Bakshi, K., Huynh, D., Katz, B., and Karger,D. R. (2003). The role of context in question answering systems. In the proceedings ofCHI’04. ACM.
[Louis and Nenkova, 2009] Louis, A. and Nenkova, A. (2009). Automatically evaluatingcontent selection in summarization without human models. In Proceedings of the 2009Conference on Empirical Methods in Natural Language Processing, pages 306–314,Singapore. Association for Computational Linguistics.
[Luhn, 1958] Luhn, H. (1958). The automatic creation of literature abstracts. In IBMJournal of Research and Development, Vol. 2, No. 2, pp. 159-165, April 1958.
[Maheedhar Kolla, 2007] Maheedhar Kolla, Olga Vechtomova, C. L. A. C. (2007). Com-parison of models based on summaries or documents towards extraction of update sum-maries. In DUC’07: Document Understanding Conference, 2007.
[Mani, 2001] Mani, I. (2001). Summarization Evaluation: an overview. Pergamon Press,Inc., Tarrytown, NY, USA.
[Marcu, 1997] Marcu, D. (1997). From discourse structure to text summaries. pages 82–88.
[Marcu, 1999a] Marcu, D. (1999a). The automatic construction of large-scale corpora forsummarization research. In University of California, Berkely, pages 137–144.
[Marcu, 1999b] Marcu, D. (1999b). Discourse trees are good indicators of importance intext. In Advances in Automatic Text Summarization, pages 123–136. The MIT Press.
115
BIBLIOGRAPHY
[Marcu, 2000] Marcu, D. (2000). The Theory and Practice of Discourse Parsing and Sum-marization. MIT Press, Cambridge, MA, USA.
[Marcus et al., 1993] Marcus, M. P., Marcinkiewicz, M. A., and Santorini, B. (1993).Building a large annotated corpus of english: the penn treebank. Comput. Linguist.,19(2):313–330.
[McCallum, 1996] McCallum, A. K. (1996). Bow: A toolkit for statistical languagemodeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/ mccal-lum/bow.
[Miller et al., 1999] Miller, D. R. H., Leek, T., and Schwartz, R. M. (1999). A hiddenmarkov model information retrieval system. In SIGIR ’99: Proceedings of the 22nd an-nual international ACM SIGIR conference on Research and development in informationretrieval, pages 214–221, New York, NY, USA. ACM.
[Molina, 1995] Molina, M. P. (1995). Documentary abstracting: toward a methodologicalmodel. J. Am. Soc. Inf. Sci., 46(3):225–234.
[Morris and Hirst, 1991] Morris, J. and Hirst, G. (1991). Lexical cohesion computed bythesaural relations as an indicator of the structure of text. Comput. Linguist., 17(1):21–48.
[Mutton et al., 2007] Mutton, A., Dras, M., Wan, S., and Dale, R. (2007). Gleu: Automaticevaluation of sentence-level fluency. In ACL. The Association for Computer Linguistics.
[Nenkova, 2005] Nenkova, A. (2005). Automatic text summarization of newswire:Lessons learned from the document understanding conference. In Veloso, M. M. andKambhampati, S., editors, AAAI, pages 1436–1441. AAAI Press / The MIT Press.
[Nenkova et al., 2007] Nenkova, A., Passonneau, R., and McKeown, K. (2007). The pyra-mid method: Incorporating human content selection variation in summarization evalua-tion. In ACM Trans. Speech Lang. Process., volume 4, New York, NY, USA. ACM.
[Nenkova et al., 2006] Nenkova, A., Vanderwende, L., and McKeown, K. (2006). A com-positional context sensitive multi-document summarizer: exploring the factors that in-fluence summarization. In SIGIR ’06: Proceedings of the 29th annual internationalACM SIGIR conference on Research and development in information retrieval, pages573–580, New York, NY, USA. ACM.
[Paijmans, 1994] Paijmans, J. J. (1994). Relative weights of words in documents. In InL.G.M. Noordman and W.A.M. de Vroomen, editors, Conference proceedings of STIN-FON, pages 195 – 208.
[Papineni et al., 2002] Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002). Bleu:a method for automatic evaluation of machine translation. In ACL ’02: Proceedings of
116
BIBLIOGRAPHY
the 40th Annual Meeting on Association for Computational Linguistics, pages 311–318,Morristown, NJ, USA. Association for Computational Linguistics.
[Pitler and Nenkova, 2008] Pitler, E. and Nenkova, A. (2008). Revisiting readability: Aunified framework for predicting text quality. In EMNLP, pages 186–195. ACL.
[Ponte and Croft, 1998] Ponte, J. and Croft, W. B. (1998). A language modeling approachto information retrieval. pages 275–281, New York, NY. ACM, ACM.
[Ponte, 1998] Ponte, J. M. (1998). A language modeling approach to information retrieval.Master’s thesis, Amherst, MA, USA.
[Radev et al., 2004] Radev, D., Allison, T., Blair-goldensohn, S., Blitzer, J., elebi, A.,Dimitrov, S., Drabek, E., Hakim, A., Lam, W., Liu, D., Otterbacher, J., Qi, H., Saggion,H., Teufel, S., Winkel, A., and Zhang, Z. (2004). Mead - a platform for multidocumentmultilingual text summarization. In in LREC 2004.
[Rath et al., 1961] Rath, G., Resnick, A., and Savage, R. (1961). The formation of ab-stracts by the selection of sentences: Part 1: Sentence selection by man and machines.In Journal of American Documentation., pages 139–208.
[Salton et al., 1997] Salton, G., Singhal, A., Mitra, M., and Buckley, C. (1997). Automatictext structuring and summarization. Inf. Process. Manage., 33(2):193–207.
[Schiffman, 2007] Schiffman, B. (2007). Summarization for q&a at columbia universityfor duc 2007. In DUC’07: Document Understanding Conference, 2007.
[Schilder and Kondadadi, 2008] Schilder, F. and Kondadadi, R. (2008). Fastsum: fast andaccurate query-based multi-document summarization. In HLT ’08: Proceedings of the46th Annual Meeting of the Association for Computational Linguistics on Human Lan-guage Technologies, pages 205–208, Morristown, NJ, USA. Association for Computa-tional Linguistics.
[Shen et al., 2007] Shen, D., Sun, J.-T., Li, H., Yang, Q., and Chen, Z. (2007). Documentsummarization using conditional random fields. In the proceedings of IJCAI ’07., pages2862–2867. IJCAI.
[Silber and McCoy, 2000] Silber, H. G. and McCoy, K. F. (2000). Efficient text summa-rization using lexical chains. In IUI ’00: Proceedings of the 5th international conferenceon Intelligent user interfaces, pages 252–255, New York, NY, USA. ACM.
[Singer and Dolan, 1980] Singer, H. and Dolan, D. (1980). Reading and Learning fromText. Little Brown, Boston, Massachusetts.
[Song and Croft, 1999] Song, F. and Croft, W. B. (1999). A general language model forinformation retrieval. In Proceedings of Eighth International Conference on Informationand Knowledge Management. ACM.
117
BIBLIOGRAPHY
[Sparck Jones, 2007] Sparck Jones, K. (2007). Automatic summarising: The state of theart. Information Processing and Management, 43(6):1449–1481.
[tau Yih et al., 2007] tau Yih, W., Goodman, J., Vanderwende, L., and Suzuki, H. (2007).Multi-document summarization by maximizing informative content-words. In Veloso,M. M., editor, IJCAI, pages 1776–1782.
[Toutanova et al., 2007] Toutanova, K., Brockett, C., Gamon, M., Jagarlamundi, J.,Suzuki, H., and Vanderwende, L. (2007). The pythy summarization system: Microsoftresearch at duc 2007. In the proceedings of Document Understanding Conference.
[Toutanova et al., 2003] Toutanova, K., Klein, D., Manning, C. D., and Singer, Y. (2003).Feature-rich part-of-speech tagging with a cyclic dependency network. In NAACL ’03:Proceedings of the 2003 Conference of the North American Chapter of the Associationfor Computational Linguistics on Human Language Technology, pages 173–180, Mor-ristown, NJ, USA. Association for Computational Linguistics.
[Toutanova and Manning, 2000] Toutanova, K. and Manning, C. D. (2000). Enriching theknowledge sources used in a maximum entropy part-of-speech tagger. In Proceedingsof the 2000 Joint SIGDAT conference on Empirical methods in natural language pro-cessing and very large corpora, pages 63–70, Morristown, NJ, USA. Association forComputational Linguistics.
[Tratz and Hovy, 2008] Tratz, S. and Hovy, E. (2008). Summarization evaluation usingtransformed basic elements. In proceedings of Text Analysis Conference.
[van Halteren and Teufel, 2003] van Halteren, H. and Teufel, S. (2003). Examining theconsensus between human summaries: initial experiments with factoid analysis. InHLT-NAACL 03 Text summarization workshop, pages 57–64, Morristown, NJ, USA.Association for Computational Linguistics.
[Wan et al., 2005] Wan, S., Dale, R., and Dras, M. (2005). Searching for grammaticality:Propagating dependencies in the viterbi algorithm. In Proceedings of the Tenth EuropeanWorkshop on Natural Language Generation (ENLG-05). Association for ComputationalLinguistics.
[Zajic et al., 2002] Zajic, D., Dorr, B., and Schwartz, R. (2002). Automatic headline gen-eration for newspaper stories. In In the proceedings of the ACL Workshop on AutomaticSummarization/Document Understanding Conference(DUC, pages 78–85. Associationfor Computational Linguistics.
118