AUTOMATED TEXT SUMMARIZATION: ON …web2py.iiit.ac.in/publications/default/download/masters...AUTOMATED TEXT SUMMARIZATION: ON BASELINES, QUERY-BIAS AND EVALUATIONS! By Rahul Katragadda

AUTOMATED TEXT SUMMARIZATION:

ON BASELINES, QUERY-BIAS AND EVALUATIONS!

By

Rahul Katragadda

200607007

rahul [email protected]

A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THEREQUIREMENTS FOR THE DEGREE OF

Master of Science (by Research)in

Computer Science & Engineering

Search and Information Extraction LabLanguage Technologies Research Center

International Institute of Information Technology

Hyderabad, India

December 2009

Copyright © 2009 Rahul Katragadda

All Rights Reserved

Dedicated to all those people, who made me believe I am better than I was.

INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY

Hyderabad, India

CERTIFICATE

It is certified that the work contained in this thesis, titled “Automated Text Sum-

marization: On Baselines, Query-Bias and Evaluations!” by Rahul Katragadda

(200607007) submitted in partial fulfillment for the award of the degree of Master

of Science (by Research) in Computer Science & Engineering, has been carried out

under my supervision and it is not submitted elsewhere for a degree.

Date Advisor :

Dr. Vasudeva VarmaAssociate Professor

IIIT Hyderabad

Acknowledgements

I am grateful to my advisor, Dr Vasudeva Varma for his advice and in his efforts

in managing the Search and Information Extraction Lab, of Language Technologies

Research Center where I have had the pleasure to work for the whole duration of my

MS by Research studies. I also have been fortunate to get timely advice and quick

feedback in many of my initial meetings with Dr Prasad Pingali, who as my mentor

taught me what it means to do research. As the administrative assistant for SIEL,

Mr Mahender Kumar has saved me from a great amount of paper work, Mr Babji

who took over our system requirements, tirelessly, and Mr Bhupal for motivating

me through the years.

I warmly acknowledge all the people that have had — and would definitely

continue to have — a significant impact on the way I work. Among these Prof Karen

Sparck Jones, Prof Kathy McKeown, Dr Ani Nenkova among others. I would also

take this opportunity to thank Dr Chales L A Clarke for those numerous interactions

that led to bring clarity to the subject matter of this thesis. Thanks Charlie, for

the access to your expertise and your belief in my work, I could move on from a

rejection at ICON-08 to an acceptance at ACL-09. I owe my deepest gratitude to

Prof Petri Myllymaki and Dr Wray Buntine for their continuous guidance on and

off research during my internship at CoSCo group, University of Helsinki. Thanks

to members of CoSCo group, who showed me that research can, and should, be fun.

You are known by the company you keep and I have been fortunate to have

some great colleagues at SIEL, especially my lunch group comprising Sowmya,

Suman, Praneeth, Swathi and Santosh who provided me confort at all times and

chipped in with their critical review when I badly needed one. I would like to

make a special mention to Suman Kumar Chalawadi for those countless mid-night

discussions on query-focus, query-bias and summarization evaluations, when he

had no clue of any word I uttered and yet kept staring at the board and listening to

my words. I must acknowledge Mahesh Mohan, Sai Sathyanarayan, Samar Hussain

and Romanch Agarwal for those countless, endless discussions on all aspects of

life. I am also hugely indebted to my batchmates Ravitej and Sai Mahesh, and my

colleagues Sowmya and Swathi for constantly keeping track of my progress.

Three and half years is a long time, and I have been blessed with some of the

best men (and women) of IIIT’s sports fraternity who helped me through my ups

and downs. I would like to thank Mr Kamalakar and the Physical Education Center

for providing me all the sports facilities that complemented my academic life. I

thank Abhilash, Ashwath and Satish for some of the most memorable moments —

those medals lying in my room and beyond — I would cherish all my life. Also

important was Mr Sitaram and family at the coffee shop who provided us with

continuous source of ‘double’ TEA. I am delighted to acknowledge the staff of

the G4S securities, especially Mehdi Hasan, Raghu, Laxman and others who have

eased our lives and provided a secure campus. I also have high regards to all the

members of house-keeping staff who took care of room no 348, Old Boys Hostel

(OBH) where I stayed for 3 full years.

Many anonymous reviewers of my papers have shaped my thinking and writing

habits thus far, and the manuscript of this thesis was reviewed by Prof Kamal Kar-

lapalem and Dr Kishore S Prahallad, who have all put great efforts in reviewing this

work and I thank them all for their time and efforts. I also thank Praveen Bysani and

Vijay Bharat for their patience and understanding of my way of research, especially

Praveen, for whom I had to grow, to reach his expectations of a mentor.

I am extremely grateful to my parents Rajyasree and Ravi Kumar, to my sister

Roshna, to my cousin Namratha and my friend Geetha for their continuous support

and discrete (and continuous) inquiries about the progress of my studies.

And to all those who kept on asking “When will you graduate?”, this is it. :)

vi

Synopsis

With the advent of the Internet/WWW and the proliferation of textual news, rich-

media, social networking, et cetera. there has been an unprecedented surge in the

content on the web. With huge amount of information available on the world wide

web, there is a pressing need to have ‘Information Access’ systems that would help

users with an information need in providing the relevant information in a concise,

pertinent format. There are various modes of Information Access including In-

formation Retrieval, Text Mining, Machine Translation, Text Categorization, Text

Summarization, et cetera. In this thesis, we study certain aspects of “Text Summa-

rization” as a technology for ‘Information Access’.

Majority of the recent work in the area of Text Summarization addresses the sub-

problems like Ranked Sentence Ordering, Generic Single Document Summariza-

tion, Generic Multi-Document Summarization, Query-Focused Multi-Document

Summarization, Query-Focused Update Summarization. In this thesis we focus on

two tasks, specifically “Query-Focused Multi-Document Summarization (QFMDS)”

and “Query-Focused Update Summarization (QFUS)”. The QFMDS task has gen-

erated a lot of community interest for obvious reasons of being closer to a real

world application of open domain question answering. Given a set of N relevant

documents on a general topic and a specific query (or information need) within the

topic, the task is to generate a relevant and fluent 250 word summary. The focus

of this thesis lies around four issues dealing with query-focused multi-document

summarization. They are:

1. Impact of query-bias on summarization

2. Simple and strong baselines for update summarization

3. Language modeling extension to update summarization

4. An automated intrinsic content evaluation measure

In this thesis we identify two dependent yet different terms ‘query-bias’ and

‘query-focus’ with respect to the QFMDS task and show that most of the auto-

mated summarization systems are trying to be query-biased rather than being query-

focused. In the context of this problem, we show evidence from multiple sources

to display the inherent bias introduced by the automated systems. First, we theo-

retically explain how a naıve classifier based summarizer can be enhanced greatly

by biasing the algorithm to use just the query-biased sentences. Second, on an

‘information nugget’ based evaluation data, we show that most of the participat-

ing systems were query-biased. Third, we further build formal generative models,

namely binomial and multinomial models, to model the likelihood of a system be-

ing query-biased. Such a modeling revealed a high positive correlation between

‘a system being query-biased’ and ‘automated evaluation score’. Our results also

underscore the difference between human and automated summaries. We show that

when asked to produce query-focused summaries humans do not rely to the same

extent on the repetition of query-terms.

The QFUS task is a natural extension to the QFMDS task. In update summa-

rization task a series of news stories (news stream) on a particular topic are tracked

over a period of time and are available for summarization. Two or more sets of

documents are available: an initial set, followed by multiple updated sets of news

stories. The problem is to generate query-focused multi-document summaries for

the initial set and then produce update summaries for the updated sets assuming the

user has already read the preceding sets of documents. This is a relatively new task

and our work in this context is two fold. First, we defined a sentence position based

baseline summarizer which is genre dependent. In the context of the new task, we

argue that current baseline in the update summarization task is a poor performer

in content evaluations and hence cannot help in tracking the progress of the field.

A stronger baseline such as the one based on “sentence position” would be robust

and be able to help track the progress. Second, we describe a generic extension

to language modeling based approaches to tailor them towards the update summa-

rization task. We showed that a simple Context Adjustment (CA) based on ‘stream

dynamics’ could help in generation of better summaries when compared to the base

language modeling approach.

In Text Summarization, like in any other Information Access methodologies,

evaluation is a crucial component of the system development process. In language

technologies research, Automated Evaluation is often viewed as supplementary to

the task itself since knowing how to evaluate would lead to knowing how to perform

the task. In the case of summarization, knowing how best to evaluate summaries

would help in knowing how best to summarize. Usually, manual evaluations are

used to evaluate summaries and compare performance of different systems against

each other. However, these manual evaluations are time consuming and difficult

to repeat, hence infeasible. Keeping this view of text summarization and evalua-

tion we proposed an evaluation framework based on generative modeling paradigm.

We describe a ‘multinomial model’ built on the distribution of key-terms in docu-

ment collections and how they are captured in peers. We show that such a simple

paradigm can be used to create evaluation measures that are comparable to the

state-of-the-art evaluations used at DUC and TAC. In particular, we observed that

in comparison with other metrics our approach has been very good at capturing

“overall responsiveness” apart from pyramid based manual scores.

ix

Contents

Table of Contents x

List of Tables xiii

List of Figures xiv

1 Introduction 11.1 Introduction to Text Summarization . . . . . . . . . . . . . . . . . . . . . 2

1.1.1 Human Abstraction or Professional Summarizing . . . . . . . . . . 21.2 Different Categories of Summarization . . . . . . . . . . . . . . . . . . . . 41.3 Approaches to Automated Text Summarization . . . . . . . . . . . . . . . 61.4 Summarization Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.4.1 Evaluation of Content and Readability . . . . . . . . . . . . . . . . 81.5 Focused Summarization Evaluation Workshops . . . . . . . . . . . . . . . 10

1.5.1 pre-DUC era . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.5.2 Document Understanding Conferences (DUC) . . . . . . . . . . . 111.5.3 Text Analysis Conferences (TAC) . . . . . . . . . . . . . . . . . . 12

1.6 Objective and Scope of the Research . . . . . . . . . . . . . . . . . . . . . 121.7 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2 Problems Addressed 162.1 Understanding query-bias in summarization systems . . . . . . . . . . . . 162.2 Sentence Position Baseline for Update Summarization . . . . . . . . . . . 192.3 Language Modeling approaches for Update Summarization . . . . . . . . . 212.4 Alternative Automated Summarization Evaluations . . . . . . . . . . . . . 22

3 Related Work 243.1 Approaches to Text Summarization . . . . . . . . . . . . . . . . . . . . . . 24

3.1.1 Heuristic Approaches to Text Summarization . . . . . . . . . . . . 243.1.2 Language Modeling Approaches . . . . . . . . . . . . . . . . . . . 263.1.3 Linguistic structure or discourse based summarization . . . . . . . 273.1.4 Machine Learning Approaches . . . . . . . . . . . . . . . . . . . . 28

3.2 Summarization Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 29

x

CONTENTS

3.3 Evaluation of Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.3.1 ROUGE [Lin, 2004b] . . . . . . . . . . . . . . . . . . . . . . . . . 313.3.2 Pyramid Evaluation [Nenkova et al., 2007] . . . . . . . . . . . . . 343.3.3 Other content evaluation measures . . . . . . . . . . . . . . . . . . 37

3.4 Evaluation of Readability . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.4.1 Manual Evaluation of Readability . . . . . . . . . . . . . . . . . . 383.4.2 Automated Evaluation of Readability . . . . . . . . . . . . . . . . 39

4 Impact of Query-Bias on Text Summarization 414.1 Introduction to Query-Bias vs Query-Focus . . . . . . . . . . . . . . . . . 43

4.1.1 Query-biased content in human summaries . . . . . . . . . . . . . 444.2 Theoretical Justification on query-bias affecting summarization performance 45

4.2.1 Equi-probable summarization setting . . . . . . . . . . . . . . . . 474.2.2 Query-biased Equi-probable summarization setting . . . . . . . . . 474.2.3 Performance of Equi-probable summarization . . . . . . . . . . . . 484.2.4 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.3 Performance of participating systems from DUC 2007 . . . . . . . . . . . 504.4 Query-Bias in Summary Content Units (SCUs) . . . . . . . . . . . . . . . 524.5 Formalizing Query-Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.5.1 The Binomial Model . . . . . . . . . . . . . . . . . . . . . . . . . 564.5.2 The Multinomial Model . . . . . . . . . . . . . . . . . . . . . . . 574.5.3 Correlation of ROUGE metrics with likelihood of query-bias . . . . 58

4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604.7 Conclusive Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614.8 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5 Baselines for Update Summarization 635.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.1.1 Sentence Position . . . . . . . . . . . . . . . . . . . . . . . . . . . 645.1.2 Introduction to the Position Hypothesis . . . . . . . . . . . . . . . 68

5.2 Sub-Optimal Sentence Position Policy (SPP) . . . . . . . . . . . . . . . . . 685.2.1 Sentence Position Yield and Optimal Position Policy (OPP) . . . . 685.2.2 Documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 695.2.3 Pyramid Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.3 A Summarization Algorithm based on SPP . . . . . . . . . . . . . . . . . . 725.3.1 Query-Focused Multi-Document Summarization . . . . . . . . . . 735.3.2 Update Summarization Task . . . . . . . . . . . . . . . . . . . . . 735.3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.4 Baselines in Summarization Tasks . . . . . . . . . . . . . . . . . . . . . . 765.5 Discussion and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 77

xi

CONTENTS

6 A Language Modeling Extension for Update Summarization 796.1 Update Summarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 806.2 Language Modeling approach to IR and Summarization . . . . . . . . . . . 81

6.2.1 Probabilistic Hyperspace Analogue to Language (PHAL) . . . . . . 816.3 Language Modeling Extension . . . . . . . . . . . . . . . . . . . . . . . . 82

6.3.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 836.3.2 Signature Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . 846.3.3 Context Adjustment . . . . . . . . . . . . . . . . . . . . . . . . . 85

6.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 866.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

7 Alternative (Automated) Summarization Evaluations 887.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 887.2 Current Summarization Evaluations . . . . . . . . . . . . . . . . . . . . . 897.3 Automated Content Evaluations . . . . . . . . . . . . . . . . . . . . . . . 917.4 Generative Modeling of Reference Summaries . . . . . . . . . . . . . . . . 92

7.4.1 Binomial Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 947.4.2 Multinomial Model . . . . . . . . . . . . . . . . . . . . . . . . . . 94

7.5 Signature Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 957.5.1 Query Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 957.5.2 Model Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . 967.5.3 POS Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

7.6 Experiments and Evaluations . . . . . . . . . . . . . . . . . . . . . . . . . 977.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

7.7.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 997.8 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

8 Conclusions 1048.1 Contributions of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 1048.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1088.3 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1098.4 Unpublished Manuscripts . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

Bibliography 111

xii

List of Tables

4.1 Percentage of query-biased content in document collections . . . . . . . . 454.2 Percentage of query-biased content in model summaries . . . . . . . . . . 454.3 Ratio of Query-bias densities . . . . . . . . . . . . . . . . . . . . . . . . . 454.4 ROUGE scores with confidence intervals for the equi-probable summarizer. 494.5 Systems that performed well in content overlap metrics . . . . . . . . . . . 514.6 Systems that performed well based on Linguistic Quality Evaluations. . . . 514.7 Systems that did not perform based on content overlap metrics . . . . . . . 524.8 Statistics on Query-Biased Sentences . . . . . . . . . . . . . . . . . . . . . 544.9 Rank based on ROUGE-2, Avg likelihood of emitting query-biased sen-

tence, ROUGE-2 scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574.10 Rank, Avg likelihood of emitting a query-biased sentence, ROUGE-2 scores 584.11 Correlation of ROUGE measures with log-likelihood scores for automated

systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594.12 Correlation of ROUGE measures with log-likelihood scores for humans . . 59

5.1 Performance of SPP algorithm at QFMDS task . . . . . . . . . . . . . . . 745.2 Cluster-wise Performance of SPP algorithm in Update Summarization task 755.3 Performance of SPP algorithm based on various metrics. . . . . . . . . . . 75

6.1 Impact of Context Adjustment on PHAL . . . . . . . . . . . . . . . . . . . 87

7.1 Cluster A Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1027.2 Cluster B Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

xiii

List of Figures

4.1 The process and structure of Pyramid Annotations and source mappings . . 534.2 SCU annotation of a source document. . . . . . . . . . . . . . . . . . . . . 534.3 Distribution of relevant and irrelevant sentences in query-biased corpus . . 554.4 Impact of query-bias in the classification view of Text Summarization Process 60

5.1 An example of Highlights of a news story . . . . . . . . . . . . . . . . . . 665.2 Impact of location in identifying highlights . . . . . . . . . . . . . . . . . 675.3 A sample mapping of SCU annotation to source document sentences. An

excerpt from mapping of topic D0701A of DUC 2007 QF-MDS task. . . . 705.4 Sentence Position Yield for small documents. . . . . . . . . . . . . . . . . 715.5 Sentence Position Yield for large documents . . . . . . . . . . . . . . . . . 72

7.1 Sample output generated by an evaluation metric . . . . . . . . . . . . . . 93

xiv

Chapter 1

Introduction

With the advent of the internet and the proliferation of news and social networking related

activity, online content in terms of the general web, newswire, social networking content,

etc. seem to have no bounds. For every information need an online user might have, there

is too much information that the world wide web (WWW) provides. In this context it is

often desirable to have a summary of the related documents to get a brief overview of the

document collection. There are several scenarios

1. A user enters the information need and the search engine provides the search results

along with a summary snippet for each document.


along with a summary of all the documents.


along with a summary of all the documents specifically marked by the user as being

relevant.


in clusters, with related results in the same cluster. Associated with each of these

clusters is a summary of all the documents within the cluster. And so on.

1

CHAPTER 1. INTRODUCTION

In the following sections we introduce the notion of automated text summarization,

briefly discuss the various approaches taken to solve the problem, and introduce summa-

rization evaluation; all of which would be further discussed in deeper details in Chapter 3.

1.1 Introduction to Text Summarization

“Text Summarization” research began in bits and pieces in various forms ever since [Luhn, 1958]

described as an informal idea. Later formally developed by [Edmundson, 1969] among

others, while [Jones, 1993, Jones, 1998, Sparck Jones, 2007] gave a much needed clarity

to the direction in which summarization research was to flow through. Psychological stud-

ies on how humans perform the task of text summarization helped the understanding of the

process involved [Kieras, 1985]. Since the beginning of TIPSTER-SUMMAC program,

followed by DUC and now TAC and also equally supported by other workshops on sum-

marization, a lot of effort has gone into creating focused problem domains and solving the

partial problems that have the global goal of automated text summarization.

1.1.1 Human Abstraction or Professional Summarizing

The field of automatic summarization is fortunate in that there are still human experts who

carry out summarization as part of their professional life. These are professional abstrac-

tors, who are skilled in the art of constructing summaries. Their employers are usually

abstracting services and information publishers, in particular, producers of bibliographic

databases.

We could gain valuable insights for automated summarization, by studying these ex-

pert abstractors and the way they carry out their summarization activities. The insights

gained would be valuable wherever we put ourself on the continuum from partially to fully

automated summarization.

2


The stages of abstracting Different scholars have come up with different decompo-

sitions of the abstracting process. Cremmins [Cremmins, 1982] decomposed the process

of abstracting into four ‘approximate stages’: “Focusing on the basic features, ”Identi-

fying the information“, ”Extracting, organizing, and reducing the relevant information“

and ”Refining the relevant information“. Pinto Molina [Molina, 1995] also suggested four

stages: interpretation, selection, reinterpretation and synthesis. Where, interpretation in-

volves reading and understanding the document, selection refers to selection of pertinent

information given the users’ needs, a reinterpretation is the process of interpreting the per-

tinent information to ascertain facts, synthesis refers to the process of generating the output

abstract from the pertinent information.

The most detailed work on studying human abstractors comes from Endres-Niggemeyer

[Endres-Niggemeyer, 1998], who carried out an empirical study of the verbal protocols and

behavior of six abstractors. Based on her findings, she listed human summarization process

as a 3 stage process:

• Document Exploration. An initial exploration of source documents to identify its

genre and style such that an appropriate scheme could be selected based on the ab-

stractor’s prior knowledge of document types and their information structure.

• Relevance Assessment. Important and relevant information is aggregated to construct

what she calls a theme, which is a structured mental representation of what the doc-

ument is about.

• Summary Production. Theme and Scheme are used to produce a final summary af-

ter applying certain cutting and pasting operations. Professional abstractors do not

usually invent anything, they follow the original author as closely as possible and

reintegrate the most important points of a document by drawing from the pool of

standard sentence patterns accumulated over years of experience in writing abstracts.

3


Expert practice in automated summarization systems It is interesting to note that

many of these aspects of expert practice are used in automated summarization systems,

although most of these systems are confined to extracts rather than abstracts. However,

we do not find the entire set of strategies in a single summarizer. Some summarizers fo-

cus on leveraging specific shallow features, like the ones of cue-phrases’ occurrence, lo-

cation features, etc. as in [Luhn, 1958, Edmundson, 1969]; others use a discourse-level

representation as in [Marcu, 1999b]. Still others focus on cut-and-paste operations to con-

struct summaries and to edit and revise them such as in [Jing, 2001]. It is easy to see

a mapping between these different methods as part of an overall strategy as devised by

Endres-Niggemeyer. For instance, usually, what features to use and which information to

use is dependent on the genre and style of the document(s), which represents the Document

Exploration stage. The usage of either shallow features or more stronger language model-

ing (or any relevance ranking measure) would fall into Relevance Assessment stage. And

finally, any Cut and Paste like operations would fall into Summary Production.

1.2 Different Categories of Summarization

The goal of Summarization has been “to extract informative content from an information

source and present the most important content (possibly with added context) to the user in a

condensed form and in a manner sensitive to user’s or application’s need”. Summarization

has been categorized based on various attributes and following are a few simple examples

categorized based on media being summarized:

• Text Summarization. Summarization of textual media in the form of digital text

refers to “Text Summarization”. There has been over 5 decades of research in this

area, courtesy a lot of reasonably complex tasks, that have been designed at focused

workshops such as TIPSTER Text Summarization Evaluation Conference (TIPSTER-

SUMMAC), Document Understanding Conferences (DUC), etc. This thesis sur-

rounds around the “Text Summarization” problem and the abundant literature in this4


area is explained elsewhere in this thesis.

• Video/Multimedia Summarization. Text is just one form of communication medium.

There is huge amount of information available in the forms of speech, images and

videos that could be leveraged to generate meaningful summaries. If the source in-

put constitutes any one of these multimedia components, that would lead us to create

mono-medium summaries; for eg. speech summaries or video summaries. On the

otherhand, if multiple types of “content formats” together form the corpora then it

would lead towards generating multimedia summaries. Usually, videos do not stand

by themselves and are augmented by speech, text (in transcripts), tags and so on.

Since a video summarization system would have these other attributes it can make

use of to generate summaries it is also called multimedia summarization.

• Speech Summarization. In recent years the amount of multimedia data available has

increased rapidly, especially due to increase of broadcasting channels and availabil-

ity of cheap and efficient mass storage means. In this era of information explosion,

there is a great need for systems that can distill this huge amount of data automat-

ically with less complexity and time. Speech summarization is very useful in wide

variety of applications; in case of broadcast news, it serves the purpose of summa-

rizing important contents of a show, meeting summarization helps individuals not

present at the meeting to know key issues discussed in a meeting and also important

decisions taken in it. Summarization of voice mail or voice messages saves time for

an individual from listening to all messages.

• Opinion Summarization. “What do people think about ?”, this is perhaps the ques-

tion that better describes the scope of the problem. In a world in which millions of

people write their opinions about any issue in blogs, news sites, review sites or social

media, the distillation of knowledge from this huge amount of unstructured infor-

mation is a challenging task. Sentiment Analysis and Opinion Mining are two areas

related to Natural Language Processing and Text Mining that deal with the identifica-5


tion of opinions and attitudes in natural language texts. Recently, Opinion Mining has

received huge interest in the information systems and language technologies commu-

nities. Recent efforts in opinion mining community has seen the upbringing of the

focused evaluations such as International Conference on Weblogs and Social Media

(ICWSM), Text Analysis Conference (TAC) – opinion-summarization/opinion-QA

task, Workshop on Mining User Generated Content and Workshop on Opinion Min-

ing and Sentiment Analysis (WOMSA) among others.

1.3 Approaches to Automated Text Summarization

With the advent of strong evaluation forums such as TIPSTER-SUMMAC, DUC and now

TAC, summarization research has seen a huge gain in the number of different approaches

being taken. From simple heuristics based approaches [Luhn, 1958, Katragadda et al., 2009]

to Language Modeling based approaches [J et al., 2005, Maheedhar Kolla, 2007] to Lin-

guistic Structure based such as based on lexical chains and discourse connectives [Li et al., 2007,

Marcu, 2000]. More detailed description of various methods employed for automated text

summarization is shown in 3.1, but here we briefly touch some of the various general ap-

proaches available.

Heuristic Approaches to Text Summarization From simple term frequency [Luhn, 1958]

based approaches in Luhn’s early work to till date, simplest of heuristics often surpass the

more complex and convincing theories of “importance in a discourse”. For eg., Inverse

Document Frequency [Jones, 1972], Document Frequency [Bysani et al., 2009], etc. Char-

acteristics of the genre are also usually strong indicators of important text. This had been

shown in [Edmundson, 1969, Lin and Hovy, 1997] and even in today’s context such an ap-

proach generates a strong baseline as we would see later in this thesis. Some of these

approaches have been described in Sections 3 and 5.

6


Language Modeling Approaches The language modeling based approaches have a

long history in information retrieval research and is a ready success story. Language Mod-

eling is one type (among many) of the probabilistic modeling techniques [Hiemstra, 2009].

Language models were applied to Information Retrieval by a number of researchers in late

90’s [Ponte, 1998, Hiemstra, 1998, Miller et al., 1999]. Their origin comes from proba-

bilistic models of language generation developed for automatic speech recognition systems.

For information retrieval, language models are built for each document. By following this

approach, the language model of this thesis would assign an exceptionally high probability

to the word “summarization”, indicating that this thesis would be a good candidate for re-

trieval if query contains this word. That is in language modeling we seek the probability of

getting a query given a document.

In the case of text summarization, a lot of literature talks about language modeling

based summarization. In particular [Lawrie, 2003] used language models to identify content-

bearing words. She uses relative entropy or KL Divergence to compare language mod-

els of topic set to a generic english corpus. Earlier, [Berger and Mittal, 2000] generated

summaries by creating a language model of the document, select terms that should oc-

cur in a summary, and then combine the terms using a trigram language model to gen-

erate readable summaries. Recently, a lot of work has been shown by [Schiffman, 2007,

Maheedhar Kolla, 2007, Bysani et al., 2009] in update summarization. Relevance based

language modeling approach to text summarization has been applied in [Jagarlamudi, 2006]

and has been described in Chapters 3 and 6.

Linguistic structure or discourse based summarization There have been elaborate

theories of discourse structure based on centering theory and Rhetorical Structure Theory

(RST) and there have been attempts of caching on them to build summarization systems.

Especially, Marcu’s [Marcu, 1999b] experiments indicated that Discourse Trees are good

indicators of importance in text. [Silber and McCoy, 2000, Barzilay and Elhadad, 1997,

Li et al., 2007] used Lexical Chains to combine surface features with discourse like features

7


to generate readable summaries.

Machine Learning Approaches Recent advances in machine learning have been

adapted to summarization problem through the years and locational features have been

consistently used to identify salience of a sentence. Some representative work in ‘learning’

sentence extraction would include training a binary classifier [Kupiec et al., 1995], train-

ing a Markov model [Conroy et al., 2004], training a CRF [Shen et al., 2007], and learning

pairwise-ranking of sentences [Toutanova et al., 2007].

1.4 Summarization Evaluation

Evaluation offers many advantages to automatic summarization, as has been the case with

other language understanding technologies, it can foster the creation of reusable resources

and infrastructure; it creates an environment for comparison and replication of results; and

it introduces an element of competition to produce better results [Hirschman and Mani, 2001].

Summarization Evaluation, like Machine Translation (MT) evaluation (or any other

Natural Language Processing systems’ evaluation), can be broadly classified into two cate-

gories [Jones and Galliers, 1996]. The first, an intrinsic evaluation, tests the summarization

system in itself. The second, an extrinsic evaluation, tests the summarization system based

on how it affects the completion of some other task. In the past intrinsic evaluations have

assessed mainly informativeness and coherence of the summaries. Meanwhile, extrinsic

evaluations have been used to test the impact of summarization on tasks like reading com-

prehension, relevance assessment, etc.

1.4.1 Evaluation of Content and Readability

Summarization evaluation is categorized by the two main characteristics of a summary: its

content and form. The evaluation of content are called ‘Content Evaluations’ and that of

‘form’ are called ‘Readability/Fluency Evaluations’. There are various manual evaluation

8


metrics for Content Evaluations including ‘Likert Scale rating of summaries’, Pyramid

Evaluation, and Content Responsiveness. For the Readability/Fluency Evaluations there

is only a Likert scale rating of summaries by human experts which stands as reference

judgments. There exists another metric based on human evaluation that scores for Overall

Responsiveness, which is a metric that combines evaluation of both content and form. All

of this is reported in deeper details in Section 3.2.

Automated Content Evaluation Since the manual methods of evaluation are ‘time-

consuming’, ‘difficult to perform’, ‘requires human expertise’ and hence, expensive and

non-repeatable, there is an urgent need to develop automated evaluation counterparts that

could act as a surrogate to manual evaluations while being repeatable and in-expensive.

From the mid 90’s there has been a lot of work on automated content evaluation and many

useful evaluation tools such as ROUGE [Lin, 2004b] and Basic Elements [Hovy et al., 2006]

had been developed and successfully tested over the years. Manual and automated content

evaluation measures are further deeply discussed in Section 3.2. Soon after the DUC-era

and as TAC Programme begun, a task on Automated Evaluations called “Automated Evalu-

ation of Summaries of Peers (AESOP)” was introduced in TAC 2009. AESOP task is again

briefly described in Section 3.2 and later, a detailed description is provided in Chapter 7

when we discuss our approach to automated summarization evaluation.

Automated Readability Evaluation For the Readability/Fluency aspects of sum-

mary evaluation there hasn’t been much of dedicated research with summaries. Discourse-

level constraints on adjacent sentence, have been relatively fairly investigated, indicative of

coherence and good text-flow [Lapata, 2003, Lapata and Barzilay, 2005, Barzilay and Lapata, 2008].

In a lot of applications, like in “overall responsiveness” for text summaries, fluency is as-

sessed in combination with other qualities. In machine translation scenarios, approaches

such as BLEU [Papineni et al., 2002] use n-gram overlap with a reference to judge “overall

goodness” of a translation. In some related work in Natural Language Generation (NLG)

9


[Wan et al., 2005, Mutton et al., 2007] directly set a goal of sentence level fluency regard-

less of content. A recent work by [Chae and Nenkova, 2009] performed a systematic study

on how syntactic features were able to distinguish machine generated translations from

human translations. Another related work [Pitler and Nenkova, 2008] investigated the im-

pact of certain linguistic surface features, syntactic features, entity coherence features and

discourse features on the readability of Wall Street Journal (WSJ) Corpus. More on the

automated readability assessment has been described in Section 3.2.

1.5 Focused Summarization Evaluation Workshops

1.5.1 pre-DUC era

Before the truly large scale summarization evaluation took place in the form of Document

Understanding Conferences (DUC) there was no major forum or community where com-

mon evaluation benchmarks could be set and focused tasks could be accomplished.

TIPSTER-SUMMAC In May 1998, the U.S. government completed the TIPSTER Text

Summarization Evaluation (SUMMAC), which was the first large-scale, developer-independent

evaluation of automatic text summarization systems. Two main extrinsic evaluation tasks

were defined, based on activities typically carried out by information analysts in the U.S.

Government. In the ad-hoc task, the focus was on indicative summaries which were tai-

lored to a particular topic. In the categorization task, the evaluation sought to find out

whether a generic summary could effectively present enough information to allow an an-

alyst to quickly and correctly categorize a document. The final, question-answering task

involved an intrinsic evaluation where a topic-related summary for a document was eval-

uated in terms of its “informativeness”, namely, the degree to which it contained answers

found in the source document to a set of topic-related questions. SUMMAC had established

definitively in a large-scale evaluation that automatic text summarization is very effective

10


in relevance assessment tasks. Summaries at relatively low compression rates (17% for ad-

hoc, 10% for categorization) allowed for relevance assessment almost as accurate as with

full-text (5% degradation in F-score for ad-hoc and 14% degradation for categorization,

both degradations not being statistically significant), while reducing decision-making time

by 40% (categorization) and 50% (ad-hoc).

1.5.2 Document Understanding Conferences (DUC)

There was much interest and activity in late 1990’s aimed at building powerful multi-

purpose information systems. The governmental agencies involved include DARPA, ARDA

and NIST. Their programmes, for example DARPA’s TIDES (Translingual Information

Detection Extraction and Summarization) programme, ARDA’s Advanced Question & An-

swering Program and NIST’s TREC (Text Retrieval Conferences) programme cover a range

of subprogrammes. These focus on different tasks requiring their own evaluation designs.

Within TIDES and among other researchers interested in document understanding, a

group grew up which has been focusing on summarization and the evaluation of summa-

rization systems. Part of the initial evaluation for TIDES called for a workshop to be held in

the fall of 2000 to explore different ways of summarizing a common set of documents. Ad-

ditionally a road mapping effort was started in March of 2000 to lay plans for a long-term

evaluation effort in summarization [Harman and Over, 2002].

Out of the initial workshop and the roadmapping effort has grown a continuing eval-

uation in the area of text summarization called the Document Understanding Conferences

(DUC1). Sponsored by the Advanced Research and Development Activity (ARDA), the

conference series was run by the National Institute of Standards and Technology (NIST)

to further progress in summarization and enable researchers to participate in large-scale

experiments. DUC ran from the year 2001 to 2007 at which point TAC took over DUC.

1http://duc.nist.gov/

11


1.5.3 Text Analysis Conferences (TAC)

There has been a growing recognition of the importance of community-wide evaluations for

research in information technologies. The Text Analysis Conference (TAC2) is a series of

workshops that provides the infrastructure for large-scale evaluation of Natural Language

Processing technology. TAC begun from where DUC ended; TAC continues the good work

of DUC and TREC-QA communities apart from roping in Recognizing Textual Entailment

(RTE) community for the collaborative betterment of evaluation of NLP technologies.

In recent years, at the Document Understanding Conferences, Text Summarization re-

search evolved through task focused evaluations ranging from ‘generic single-document

summarization’ to ‘query-focused multi-document summarization (QFMDS)’. The QFMDS

task models the real-world complex question answering task wherein, given a topic and a

set of 25 relevant documents, the task is to synthesize a fluent, well-organized 250-word

summary of the documents that answers the question(s) in the topic statement. Recent fo-

cus in the community has been towards query-focused update-summarization task at DUC

and the TAC. The update task was to produce short (~100 words) multi-document update

summaries of newswire articles under the assumption that the user has already read a set of

earlier articles. The purpose of each update summary will be to inform the reader of new

information about a particular topic.

1.6 Objective and Scope of the Research

In this work we characterize the role of two well known ‘linguistic features’ in the task of

automated text summarization. We describe the usage of sentence position and query-bias

in automated text summarization. We show that top-performing summarization systems

have induced ‘query-bias’ and pick only those sentences that contain at least one query

term, while trying to improve their system’s performance on query-focused multi-document

2http://www.nist.gov/tac/

12


summarization tasks. Later we use another feature, namely ‘sentence position’, to show that

the state-of-the-art systems are unable to perform any better than a simple baseline system.

We also propose the need to change the baselines used for the current short summary task

of update summarization. Later we describe an extension to text summarization systems

based on language modeling framework to rekindle them for Update Summarization task.

We show that it is possible to apply context adjustment techniques based on signature

terms to improve performance of PHAL based language modeling system. Finally we use

generative modeling paradigm for the creation of alternative evaluation metrics comparable

to the state-of-the-art automated evaluation technologies for text summarization, namely

ROUGE and Basic Elements.

1.7 Organization of the Thesis

The main focus of this thesis is interleaved among several pertinent issues related to query-

focused multi-document summarization. In this thesis we aim to to identify the role of

query-bias and sentence position in query-focused multi-document summarization, and

provide a language modeling based extension to update summarization and finally we pro-

pose a framework for automated evaluation of summaries. The rest of the thesis is orga-

nized as follows:

Chapter 2 introduces our research goals in purview of this thesis, elaborating on the

context in which the importance of the problem is seen. Each section in this chapter de-

scribes one of the four problems we address in this thesis and how we approach towards a

meaningful solution.

Chapter 3 describes the relevant literature in the context of this thesis. In this chapter

we first discuss some of the related approaches to text summarization algorithms, mainly in

the context of query-focused multi-document summarization. We categorized the related

13


approaches under the following four heads: Heuristic approaches, Language Modeling

approaches, Linguistic Structure or Discourse based approaches and Machine Learning

approaches. Later in the chapter, we describe procedures for summarization evaluation.

We describe in detail some of the manual and automated evaluation methods for evaluation

of content of a summary. At the end of the chapter, we describe the approaches taken to

manual evaluation of readability/fluency of summaries and the efforts taken to automate

the process.

Chapter 4 describes a problem in the context of query-focused multi-document summa-

rization research. We characterize the reliance of various automated summarization algo-

rithms on query-term occurrence to define relevance of a sentence towards a query. In a

theoretical setting we show that it is possible for classifier like algorithms for summariza-

tion can be improve their performance by biasing towards query-terms. Later in multiple

practical settings we show that automated systems are indeed biased towards sentences

containing query-terms.

Chapter 5 explores the position hypothesis based on relevant literature and discusses

some work for and against the hypothesis. We then describe a sub-optimal position pol-

icy derived from an interesting new dataset that facilitates identification of relevance of a

subset of sentences. We apply the position policy thus derived to build an algorithm to gen-

erate summaries and show that such an algorithm would perform better at generating short

summaries. Later in the chapter we describe the baselines used in summarization tasks and

argue that a position based baseline is indeed a better baseline for update summarization

task.

Chapter 6 describes a simple framework to adopt language modeling based approaches

to text summarization, to update summarization. First we introduce the update summa-

rization problem in the context of a news stream. We describe the language modeling

14


approaches taken in IR and summarization and follow up with the description of proba-

bilistic hyperspace analogue to language (PHAL). In the next sections we describe how we

extend PHAL using context adjustment based on signature terms.

Chapter 7 explores the area of summarization evaluation in general and the need to have

consistent, robust automated evaluation systems. In the sections that follow we describe the

current approaches to text summarization evaluation and the need for various automated

evaluation systems. Later we describe a generative modeling based formalism to evaluate

summaries based on their sentence level likelihood of inclusion of signature terms. Finally,

we discuss two approaches to generate signature terms to validate the applicability of the

framework.

Finally, Chapter 8 concludes this thesis by explaining the work done and expanding

upon the contributions of this thesis. This chapter also provides a detailed mention of

foreseeable future work. At the end of the chapter we list out the related publications that

have emanated from this work.

15

Chapter 2

Problems Addressed

The goal of this thesis is to identify a few important problems pertinent to automated text

summarization and provide solutions to them. Following are the issues that we deal with

in this thesis:

1. We examine the role of query-bias in query-focused multi-document summarization.

2. We examine the role of sentence position in identifying genre specific relevant con-

tent.

3. We examine how a language modeling based summarization system could be rekin-

dled to handle the update summarization problem.

4. We build multiple alternative automated evaluation systems based on a generative

modeling paradigm by modeling the occurrence of signature terms.

This research focuses on these 4 major problems and is discused in detail below:

2.1 Understanding query-bias in summarization systems

Starting in 2005 until 2007, a query-focused multi-document summarization task was con-

ducted as part of the annual Document Understanding Conference. This task models a real-

16

CHAPTER 2. PROBLEMS ADDRESSED

world complex question answering scenario, where systems need to synthesize from a set

of 25 documents, a brief (250 words), well organized fluent answer to an information need.

Query-focused summarization is a topic of ongoing importance within the summarization

and question answering communities. Most of the work in this area has been conducted

under the guise of “query-focused multi-document summarization”, “descriptive question

answering”, or even “complex question answering”.

One of the issues studied since the inception of automatic summarization is that of hu-

man agreement: different people choose different content for their summaries [Rath et al., 1961,

van Halteren and Teufel, 2003, Nenkova et al., 2007]. Even the same person may not be

able to produce the same summary at a later time [Rath et al., 1961]. Humans vary in the

material they choose to include in a summary and how they express the content. Their judg-

ments of summary quality varies from one person to another and across time for a single

person [Harman and Over, 2004]. Later, it was assumed [Dang, 2005] that having a ques-

tion/query to provide focus would improve agreement between any two human-authored

model summaries, as well as between a model summary and an automated summary. This

agreement in content is at the heart of the approaches taken for the automated summa-

rization evaluation techniques [Lin, 2004b, Nenkova et al., 2007]. It has been noted that

these reference based content evaluations would be more robust if multiple gold-standard

summaries were used [Lin, 2004a, van Halteren and Teufel, 2003], and there have been ex-

clusive studies on how many references are required to obtain stable evaluation results

[Lin, 2004a, Nenkova et al., 2007].

In trying to further understand the process of text summarization, the steering commit-

tee of DUC decided to constrain the summarization process based on two major parameters

that could produce summaries with widely different content: query-focus and granularity.

Having a question/query to focus the summary was intended to improve the agreement

between the model summaries. Additionally, for DUC 2005, the NIST assessor who de-

veloped each topic also specified the desired granularity (level of generalization) of the

summary. Granularity was a way to express one type of user preference; one user might

17


want a “general” background or overview summary, while another user might want “spe-

cific” details that would allow him to answer questions about specific events or situations.

This parameter, “granularity”, was withdrawn from further DUCs since NIST assessors

found that the size of the summary plays a much bigger role in determining what informa-

tion to include, than a granularity specification. Almost all NIST assessors tried to write

their summaries according to the granularity requested, but some “specific” summaries

ended up being very general given the large amount of information and small space al-

lowance. Despite this, all the NIST assessors appreciated the theory behind the granularity

specification.

There has been a plethora of research on query-focused multi-document summarization

from the summarization and question answering communities. New approaches to solve the

problem crop up every year under the competitive head of DUC and most automated sum-

marization systems are optimized on ROUGE or some other evaluation metric. The task

of Query-Focused Multi-Document Summarization seeks to improve agreement in content

among human-authored model summaries. Query-focus also aids the automated summariz-

ers in directing the summary at specific topics, which may result in better agreement with

these model summaries. However, while query focus correlates with performance, we show

that high-performing automatic systems produce summaries with disproportionally higher

query term occurrence than do human summarizers. Experimental evidence suggests that

automatic systems heavily rely on query term occurrence and repetition to achieve good

performance.

In Chapter 4 based on a corpus study we show that there is a difference in query-biased

density in document collections and in human summaries. We further the argument by

analyzing text summarization as a naıve classification problem, we show theoretically that

a simple sentence picking algorithm could perform better when picking sentences from

query-biased sentences only. In Section 4.4 we use Summary Content Units (SCUs) to

show that most of the participating systems are influenced by query-bias. Later in Sec-

tion 4.5 we build formal generative models capturing the relationship between the presence

18


of (or amount of) query-bias, and ROUGE-2 scores. We also speculate that while binomial

model captures the overall influence of query-bias and shows that most humans have a sim-

ilar strategy informed by query-bias, the multinomial model clearly distinguishes systems

that are heavily biased from those that are not by using the granularity of query-bias. In the

end we mathematically explain how certain top-performing systems have induced query-

bias while trying to improve their system’s performance on query-focused multi-document

summarization tasks.

2.2 Sentence Position Baseline for Update Summarization

The position hypothesis states that importance of a sentence can be based on its ordinal po-

sition. For instance, Baxendale [Baxendale, 1958] found that in 85% of the paragraphs, the

first sentences was a ‘topic sentence’. A study in expository prose showed [Dolan, 1980]

that only 13% of professional writers start with topic sentences. And [Singer and Dolan, 1980]

maintain that the main idea of a text might appear anywhere in the paragraph or not be

stated at all. Arriving at a negative conclusion [Paijmans, 1994] found that words with

higher informative content do not cluster in first or last sentences. In contrast, in psycho-

logical studies Kieras confirmed [Kieras, 1985] the position of a mention within a text.

Position of a sentence in a document or of a word in a sentence could be an indicator

of importance of the sentence/word in certain genre. Such features are called locational

features, and a sentence position feature deals with presence of key sentences at specific

locations in the text. Sentence Position has been well studied in summarization research

since its inception, early in Edmundson’s work [Edmundson, 1969] and has had a great in-

fluence on getting a better understanding of genre specific characteristics. In later studies it

has been shown [Lin and Hovy, 1997] that for genre such as news, a position based feature

performs very well and can capture major thematic words in a discourse.

Throughout the literature as summarization research followed trends from generic single-

document summarization, to generic multi-document summarization, to focused multi-

19


document summarization there were two major baselines that stayed throughout the evalu-

ations. Those two baselines are:

1. First N words of the document (or of the most recent document).

2. First sentence from each document in chronological order until the length requirement is

reached.

The first baseline performs poorly at content evaluations based on all manual and automatic

metrics. However, since it doesn’t disturb the original flow and ordering of a document,

linguistically these summaries are the best. While the second baseline has not been beaten

— based on content evaluation — by automated systems, the linguistic aspects of summary

quality would be compromised in such a summary and are usually very bad at readability

aspects.

In Chapter 5, we describe a sentence position based summarizer that is built based on a

sentence position policy, created from the evaluation testbed of recent summarization tasks

at Document Understanding Conferences (DUC). We show that the summarizer thus built is

able to outperform most comparable systems. Based on an interesting corpus explained in

Section 4.4, we derive a Sub-optimal Sentence Position Policy (SPP) which is in most parts

inspired by the work on “Optimal Position Policy” by Lin and Hovy [Lin and Hovy, 1997].

We further use the SPP to generate multi-document summaries for QF-MDS task and the

Update Summarization task. Our experiments also show that such a method would perform

better at producing short summaries (upto 100 words) than longer summaries. We speculate

based on certain studies in the literature that the success of SPP in Update summarization

task is due to the nature of text collections and genre. Later in Section 5.4 we argue that a

position based baseline would work better for all short summary tasks and hence should be

used as a baseline for the Update Summarization task. We comment that such a stronger

content based baseline would allow for evaluation of progress on the problem over the

years.

20


2.3 Language Modeling approaches for Update Summa-

rization

The key to “update summarization” is a real world setting where a user needs to keep track

of a hot topic, continuously at random intervals of time. There are a lot of hits on the hot

topic and lots of documents are generated within a short span. The user cannot deal with

the proliferation of information and hence requires a summarization engine that generates

a very targeted informative summary. So the user now has access to the information in

the form of a summary, and if he needs to know more he would be able to see the source

documents. After a certain usage of the summarization engine, say he takes a break (its

christmas!) and comes back to the summarization engine for the summary on the recent

activity on the topic. Now, a normal summarizer would just generate a summary of the

recent documents and present it to him. However the idea of update summarization is to be

able to filter out information that has already appeared in the previous articles, whether or

not they were presented to the user as part of a previous summary. In effect it is like adding

another layer of redundancy checking to avoid the repeated information.

The updates on the topic need to filter out redundant information, while preserving the

informativeness of the content. So the task of Update Summarization has two components,

A normal query-focused multi-document summarization for the first cluster (of documents)

on the topic, and an updated summary generation procudure which would also produce

query-focused summaries under the assumption that user has already gone through previ-

ous document cluster(s). In the current work, we approach the problem under a sentence

extractive summarization paradigm, using an existing language modeling framework. Here,

we see the “update summary generation” task as a language model smoothing problem.

Language Modeling based approaches to text summarization have been used frequently

for text summarization. Recently, relevance based language modeling approaches have

been applied [Jagarlamudi, 2006] for the query-focused multi-document summarization

task. In Chapter 6, we describe a novel framework for update summarization based on

21


the language modeling pardigm. We choose to employ context adjustment techniques on

language models of the novel clusters biased by the previous clusters. We show that the

resulting adjusted language models perform better than the plain language model in case of

pHAL, a probabilistic version of Hyperlogue Analogue to Language.

2.4 Alternative Automated Summarization Evaluations

Evaluation is a critical component in the area of automatic summarization; it is used both

to rank multiple participant systems in a shared tasks, such as the summarization track

at TAC 2009, 2008 and its DUC predecessors, and to developers whose goal is to im-

prove the summarization systems. Summarization evaluation can foster the creation of

reusable resources and infrastructure; it creates an environment for comparison and repli-

cation of results; and it introduces an element of competition to produce better results

[Hirschman and Mani, 2001]. However, manual evaluation of a large number of documents

necessary for a relatively unbiased view is often unfeasible, especially since multiple eval-

uations are needed in future to track incremental improvement in systems. Therefore, there

is an urgent need for reliable automatic metrics that can perform evaluation in a fast and

consistent manner.

Summarization evaluation can be broadly classified into two categories: intrinsic and

extrinsic. Intrinsic evaluations are where the quality of the created automated summary

is measured directly. Intrinsic evaluations requires some reference against which to judge

the summarization quality. Intrinsic evaluations have taken two major forms: manual, in

which one or more people evaluate the system produced summary and automatic, in which

the summary is evaluated without the human in the loop. The content or informativeness

of a summary has been evaluated based on various manual metrics. Earlier, NIST assessors

used to rate each summary on a 5-point scale based on whether a summary is “very poor”

to “very good”. Since 2006, NIST uses the Pyramid framework to measure content re-

sponsiveness. In the pyramid method as explained in Section 3.2, assessors first extract all

22


possible “information nuggets” or Summary Content Units (SCUs) from human-produced

model summaries on a given topic. Each SCU has a weight associated with it based on the

number of model summaries in which this information appears. The final score of a peer

summary is based on the recall of nuggets in the peer.

All forms of manual assessment is time-consuming, expensive and not repeatable;

whether scoring summaries on a Likert scale or by evaluating peers against “nuggget pyra-

mids” as in the pyramid method. Such assessment doesn’t help system developers — who

would ideally like to have fast, reliable and most importantly automated evaluation metric

that can be used to keep track of incremental improvements in their systems. So despite

the strong manual evaluation criterion for informativeness, time tested automated methods

viz. ROUGE, Basic Elements(BE) have been regularly employed to test their correlation

with manual evaluation metrics like ‘modified pyramid score’, ‘content responsiveness’

and ‘overall responsiveness’ of a summary. Each of the above metrics has been sufficiently

described in Section 3.2. The creation and testing of automatic evaluation metrics is there-

fore an important research avenue. The goal is to create automated evaluation metrics that

correlate very highly with these manual metrics.

In chapter 7, we motivate towards the need of alternative summarization evaluation

systems for both content and readability. In the context of TAC AESOP (Automatically

Evaluating Summares Of Peers) task, we describe the problem with content evaluation

metrics and how a good metric must behave. We describe how a well known generative

model could be used to create automated evaluation systems comparable to the state-of-the-

art. Our method is based on a multinomial model distribution of key-terms (or signature

terms) in document collections, and the likelihood that they are captured in peers.

23

Chapter 3

Related Work

In this chapter, we survey the related work in the area of text summarization. In what fol-

lows, we mainly concentrate on 4 major categories of approaches to text summarization,

namely Heuristic Approaches, Language Modeling Approaches, Linguistic Structure or

Discourse based approaches and Machine Learning approaches. We also describe summa-

rization evaluation methodologies from manual evaluation to automated evaluation tech-

niques of content and form.

3.1 Approaches to Text Summarization

Heuristics and Linguistic cues have always been tried out in early Information Retrieval

research, while the late 90’s saw the surge of language modeling for Information Retrieval.

Almost all of these techniques have been tried out for text summarization. Some of the

related relevant approaches are described below.

3.1.1 Heuristic Approaches to Text Summarization

Heuristics from linguistic analysis has been of great help in automated text summariza-

tion. An analysis of the various linguistic phenomena on source text has lead to interesting

24

CHAPTER 3. RELATED WORK

experimental conclusions for Text Summarization. Term frequency has played a crucial

role since its application to summarization as a heuristic in [Luhn, 1958] to a more formal,

mathematical treatment in [Nenkova et al., 2006] where they have proven that a simple un-

igram language model can be used to generate state-of-the-art summaries.

Inverse Document Frequency (idf )

Inverse Document Frequency [Jones, 1972] has been repeatedly uesd in IR and Text sum-

marization since its introduction as a heuristic to Information Retrieval. The inverse doc-

ument frequency is a measure of the general importance of the term (obtained by dividing

the number of all documents by the number of documents containing the term, and then

taking the logarithm of that quotient). This is now a standard feature in more sophisticated

summarization systems [Radev et al., 2004, Daume III and Marcu, 2004, Shen et al., 2007,

Toutanova et al., 2007].

Document Frequency (df )

Document Frequency is another relatively simple feature that performs very strongly in

multi-document summarization and is usually can be seen as an inverse of Inverse Docu-

ment Frequency (IDF). Conventional wisdom advises us to use IDF while [Schilder and Kondadadi, 2008]

used it to a certain advantage in the case of query-focused multi-document summarization

tasks. Later, and more recently, [Bysani et al., 2009] used Document Frequency to illustrate

the power of such a feature for update summarization task in TAC 2008 and 2009.

Though usage of IDF is supported by conventional wisdom, this author supports Doc-

ument Frequency since it seems to be more applicable for the following reasons. IDF was

applied to cases where term specificity was being addressed statistically on a huge unrelated

corpus available for Information Retrieval. In the case of text summarization, however, IR

has been pre-computed and we have a set of related documents available aprior. It seems

intuitive that within a set of related documents a term needs to be more redundant across

25


documents apart from being redundant within each document, such that it would be a topic-

specific term.

Sentence Position (sp)

Sentence Position has been extensively studied since its introduction to summarization by

[Edmundson, 1969]. Earlier [Baxendale, 1958] defined a straight forward definition to po-

sition based importance: “title plus first and last sentence of paragraph are important”.

Later [Lin and Hovy, 1997] empirically characterized position feature as a genre depen-

dent feature and derived a position policy, as an ordering of priority of sentence importance.

Further research on sentence position feature has been described in detail in Chapter 5.

3.1.2 Language Modeling Approaches

Language modeling based approaches have been famous in text Summarization litera-

ture since the early work by Luhn [Luhn, 1958]. Luhn used frequencies of terms as a

heuristic to show importance of terms towards summaries. Though his early work was

heuristic it paved way for more sophisticated and mathematically driven language mod-

els. [Allan et al., 2001] describe a language modeling based approach to define temporal

summaries, in which they describe language models to capture novelty and usefulness.

[Jagarlamudi, 2006] has shown how a relevance based language modeling paradigm can

be applied to automated text summarization, specifically for the “query-focused multi-

document summarization” task. Recently, [Nenkova et al., 2006] has shown reasonable

success in multi-document text Summarization using ‘just’ unigram language models. In

Chapter 6 we use the Probabilistic Hyperspace Analogue to Language (PHAL) described in

[Jagarlamudi, 2006] as the language modeling mechanism and extend it to deal with update

summarization task.

In [Lawrie, 2003], she defines summarization in terms of “probabilistic language mod-

els” and she uses this definition to explore techniques for automatically generating topic

26


hierarchies. She used language models to define two concepts: ‘topicality’ and ‘predictive-

ness’ of each word which exemplify topic-orientedness of a word and existence of sub-topic

hierarchies for the word.

Chandan Kumar [Kumar, 2009] describes an “Information Loss” based framework for

‘generic multi-document summarization’ based on the language modeling paradigm. He

treats summarization as a decision making process and based on a sentence score obtained

by comparing (relative entropy) two language models (document model and world model),

he is able to generate informative summaries. Their framework is based on minimization

of Bayesian Risk of loosing an informative sentence.

In the context of update summarization, a lot of work has taken shape using language

modeling. Firstly, [Maheedhar Kolla, 2007] described a cluster based language model that

uses background modeling to efficiently generate update summaries. They used two meth-

ods of background modeling based on “documents in previous cluster” and “summary of

previous cluster”. And secondly, in another related work, [Bysani et al., 2009] coined a

term called “Novelty Factor” which is based on ratio of distribution of a word in current

cluster against previous clusters. This can be seen as an application of language modeling

at document level.

3.1.3 Linguistic structure or discourse based summarization

Lexical Chains

The notion of Cohesion, introduced in [Halliday and Hasan, 1976] captures part of the in-

tuition. Cohesion is a device for “sticking together” different parts of the text. Cohesion is

achieved through the use of semantically related terms, co-reference, ellipsis and conjunc-

tions. Among the different cohesion building devices ‘lexical cohesion’ is the most easily

identifiable and most frequent type, and it can be a very important source for the ‘flow’ of

informative content.

There is a close connection between discourse structure and cohesion. Related words

27


tend to co-occur within a discourse unit of the text. So cohesion is one of the surface

indicators of discourse structure and ‘lexical chains’ can be used to identify it. The first

computational model for lexical chains was presented in [Morris and Hirst, 1991] which

used Roget’s Thesaurus as knowledge base. All the later models used some knowledge

source or the other, like Wordnet, dictionaries, etc. [Barzilay and Elhadad, 1997] observed

that there are certain limitations in earlier approaches to lexical chains. A major issue was

of “greedy disambiguation” which results from greedy sense selection. [Barzilay, 1997]

addressed this issue by applying a less greedy algorithm that constructs all possible in-

terpretations of the source text using lexical chains. They then select the interpretations

with strongest cohesion. They then use these “strong chains” to generate a summary of

the original document. They also presented the usefulness of these lexical chains as a

source representation for automated text summarization. Later [Silber and McCoy, 2000]

presented an O(n) algorithm on the number of nouns present within the source document.

They proposed an alternative scoring algorithm to the one presented in [Barzilay, 1997]

and have shown that their algorithm is efficient despite being able to produce summaries of

similar quality.

Discourse

Daniel Marcu led a focused research on the applicability of Discourse Structures to Text

Summarization [Marcu, 2000]. He showed that discourse trees are good indicators of tex-

tual importance [Marcu, 1999b]. He devised a discourse parsing algorithm to indicate dis-

course connectives that describe importance. He also showed that incorporating various

heuristics into a discourse-based summarization framework improves its performance.

3.1.4 Machine Learning Approaches

Recent advances in machine learning have been adapted to summarization problem through

the years using various features to identify salience of a sentence. Some representative work

28


in ‘learning’ sentence extraction would include training a binary classifier [Kupiec et al., 1995],

training a Markov model [Conroy et al., 2004], training a CRF [Shen et al., 2007], and

learning pairwise-ranking of sentences [Toutanova et al., 2007].

[Kupiec et al., 1995] use learning in order to combine various shallow heuristics/features

(cue phrases, location, sentence length, word frequency and title), using a corpus of re-

search papers with manually produced abstracts. [Conroy et al., 2004] applied Hidden

Markov model training for sentence scoring. Their HMM was trained using data from

NIST DUC 03, task 5 novelty data. Though HMMs were successfully applied to text

summarization, they cannot fully exploit the linguistic features since they have to assume

independence among features for tractability. On the other hand, unsupervised approaches

rely on heuristics that are difficult to generalize, hence [Shen et al., 2007] applied a Con-

ditional Random Field to take full advantage of features that might be dependent on each

other.

3.2 Summarization Evaluation

Summarization Evaluation, like Machine Translation (MT) evaluation (or any other NLP

systems’ evaluation), can be broadly classified into two categories [Jones and Galliers, 1996].

The first, an intrinsic evaluation, tests the summarization system in itself. The second, an

extrinsic evaluation, tests the summarization system based on how it affects the completion

of some other task. In the past intrinsic evaluations have assessed mainly informativeness

and coherence of the summaries. Meanwhile, extrinsic evaluations have been used to test

the impact of summarization on tasks like reading comprehension, relevance assessment,

etc.

Intrinsic Evaluations Intrinsic evaluations are where the quality of the created auto-

mated summary is measured directly. Intrinsic evaluations requires some standard or model

against which to judge the summarization quality and this standard is usually operational-

29


ized by utilizing an existing abstract/text dataset or by having humans create model sum-

maries [Jing et al., 1998]. Intrinsic evaluations have taken two major forms: manual, in

which one or more people evaluate the system produced summary and automatic, in which

the summary is evaluated without the human in the loop. But both types involve human

judgments of some sort and with them their inherent variability.

Extrinsic Evaluation Extrinsic evaluations are where one measures indirectly how well a

summary performs by measuring performance in a task putatively dependent on the quality

of summary. Extrinsic evaluations require the selection of an appropriate task that could

use summarization and measure the effect of using automatic summaries instead of original

text. Critical issues here are the selection of a sensible real task and the metrics that will be

sensitive to differences in quality of summaries.

Assessment of Evaluations Overall, from the literature on text summarization, we can

see, along with some definite progress in summarization technology, that automated sum-

mary evaluation is more complex than it originally appeared to be. A simple dichotomy

between intrinsic and extrinsic evaluations is too crude, and by comparison with other Nat-

ural Language Information Processing (NLIP) tasks, evaluation at the intrinsic end of the

range of possibilities is of limited value. The forms of gold-standard quasi-evaluation that

have been thoroughly useful for other tasks like speech transcription, or machine transla-

tion and to some, though lesser, extent for information extraction or question answering, are

less indicative of the potential value for summaries than in these cases. At the same time,

it is difficult even at such apparently fine-grained forms of summarization evaluations as

nugget comparisons, when given the often complex systems involved, to attribute particular

performance effects to certain particular system features or to discriminate among the sys-

tems. All this makes the potential task in context extremely problematic. Such a Catch-22

situation is displayed appropriately in [Lin and Hovy, 2003a, Lin and Hovy, 2003b]: they

attribute poor system performance (for extractive summarizing) to human gold standard

30


disagreement, so humans ought to agree more. But attempting to specify summarizing

requirements so as to achieve this may be as much misconceived as impossible. Similar

issues arise with Marcu’s development of test corpora from existing source summary data

[Marcu, 1999a].

3.3 Evaluation of Content

Content evaluation refers to enabling quantification of informativeness of a summary. In-

formativeness aims at assessing the summary’s information content. As a summary of a

source becomes shorter, there is less information from the source that can be preserved in

the summary. Therefore, one measure of informativeness is to assess how much informa-

tion from the source is preserved in the summary. Another measure is how much informa-

tion from a reference summary is covered by information in the system summary. In the

following subsections we would describe the major content evaluation metrics: ROUGE,

Pyramid evaluations, and give a brief overview of other metrics available as alternatives.

3.3.1 ROUGE [Lin, 2004b]

ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation. It includes mea-

sures to automatically determine the quality of a summary by comparing it to other (ideal)

summaries created by humans. The measures count the number of overlapping units such

as n-gram, word sequences, and word pairs between the computer-generated summaries to

be evaluated and the human summaries. The ROUGE package [Lin, 2004b] contains the

following four measures: ROUGE-N, ROUGE-L, ROUGE-W and ROUGE-S. Following

is a small description of each of them:

ROUGE-N Formally, ROUGE-N is an n-gram recall between a candidate summary and

a set of reference summaries. ROUGE-N is computed as follows:

31


ROUGE-N =∑

s∈{ReferenceSummaries}

∑gramn∈s

Countmatch(gramn) (3.1)

Where n stands for the length of the n-gram, gramn is the n-gram itself and Countmatch(gramn)

is the maximum number of n-grams co-occurring in a candidate summary and a set of ref-

erence summaries. It is easy to understand that ROUGE-N is a recall oriented measure

because the denominator of the equation is the total sum of the number of n-grams occur-

ring at the reference summary side.

ROUGE-L A sequence Z=[z1, z2, . . . , zn] is a subsequence of another sequence X = [x1,

x2, . . . , xm], if there exists a strict increasing sequence [i1, i2, . . . , ik] of indices of X such

that ∀ j = 1,2,. . . , k, we have xij = zj [Cormen et al., 1990]. Given two sequences X and

Y, the longest common subsequence (LCS) of X and Y is a common subsequence with

maximum length.

In applying LCS for summarization evaluation, a summary sentence is treated as a

sequence of words. The intuition is that the longer the LCS of two summary sentences

is, the more similar the two summaries are. LCS based recall measure to estimate the

similarity between two summaries X of length m and Y of length n, assuming X is a

reference summary sentence and Y is a candidate summary sentence, as follows:

Rlcs =LCS(X, Y )

m(3.2)

One advantage of using LCS is that it does not require consecutive matches but in-

sequence matches that reflect sentence level word order as n-grams. The other advan-

tage is that it automatically includes longest in-sequence common n-grams, therefore no

pre-defined n-gram length is necessary. By only awarding credit to in-sequence unigram

matches, ROUGE-L also captures sentence level structure in a natural way. However, LCS

faces one main dis-advantage that it only counts the main in-sequence words; therefore, the

other alternative LCSes and shorter sentences are not reflected in the final score.

32


ROUGE-W LCS has many nice properties [Lin, 2004b], however unfortunately, the ba-

sic LCS also has a problem that it does not differentiate LCSes of different spatial relations

within their embedding sequences. For example, given a reference sequence X and two

candidate sequences Y1 and Y2 as follows:

X : [A, B, C, D, E, F,G]

Y1 : [A, B, C, D, H, I, K]

Y2 : [A, H,B, K, C, I, D]

Y1 and Y2 have the same ROUGE-L score. However, in this case, Y1 should be the

better choice than Y2 because Y1 has consecutive matches. To improve the basic LCS

method, we can simply remember the length of consecutive matches encountered so far to

a regular two dimensional dynamic program table computing LCS. [Lin, 2004b] call this

weighted LCS (WLCS).

ROUGE-S Skip-bigram is any pair of words in their sentence order, allowing for ar-

bitrary gaps. Skip-bigram co-occurrence statistics measure the overlap of skip-bigrams

between a candidate translation and a set of reference translations. Consider the following

example sentences:

S1. police killed the gunman

S2. police kill the gunman

S3. the gunman kill police

S4. the gunman police kill

Each sentence has 4C2 = 6 skip-bigrams. For example, S1 has the following skip-

bigrams: (“police killed”, “police the”, “police gunman”, “killed the”, “killed gunman”,

“the gunman”).

33


S2 has three skip-bigrams matches with S1 (“police the”, “police gunman”, “police

gunman”). S3 has one skip-bigram match with S1 (“the gunman”), and S4 has two skip-

bigram matches with S1 (“police killed”, “the gunman”). Given the translations X of length

m and Y of length n, assuming X is the reference translation and Y is a candidate transla-

tion, skip-bigram recall and precision is computed as follows:

Rskip2 =SKIP2(X, Y )

C(m, 2)(3.3)

Pskip2 =SKIP2(X, Y )

C(n, 2)(3.4)

ROUGE-SU One potential problem for ROUGE-S is that it does not give any credit to

a candidate sentence if the sentence does not have any word pair co-occurring with its

references. For example, the following sentence has a ROUGE-S score of zero.

S5. gunman the killed police

S5 is the exact inverse of S1 and there is no skip bigram match between them. However, we

would like to differentiate sentences similar to S5 from sentences that do not have single

word co-occurrence with S1. To achieve this, a simple extension to ROUGE-S is ROUGE-

SUn is employed, where n is the skip distance for a bigram. ROUGE-SU includes all the

bigrams obtained by ROUGE-S and all unigrams, and hence removes the above problem.

3.3.2 Pyramid Evaluation [Nenkova et al., 2007]

The most common way to evaluate the informativeness of automated summaries is to com-

pare them with human-authored reference summaries. For decades, the task of automatic

summarization had been cast as a sentence selection problem. Systems were developed

to identify most important sentences in the input to be picked to form a summary. It was

then appropriate to generate human reference summaries by asking people to produce sum-

maries by picking representative sentences. Systems were evaluated using metrics such as

34


precision and recall [Salton et al., 1997] which measured the extent to which automated

summarizers selected sentences that might be selected by human summarizers. Over a

period of time, a lot of undesirable affects associated with this approach came to light:

1. Human Variation. Content selection is not a deterministic process [Salton et al., 1997,

Marcu, 1997, Mani, 2001]. Different people choose different sentences to include in

a summary, and even the same person can select different sentences at different times

[Rath et al., 1961]. Such observations lead to the concerns about the usage of a sin-

gle reference summary and suggest that multiple human references would provide a

better ground for comparison.

2. Analysis Granularity. The issue of human variation aside, even comparing the

degree of sentence co-selection is not always justified. Even if a system does not

choose a exactly the same sentence as does the human summarizer, the sentence

picked might have considerable overlap in content with one or more sentences from

reference summaries. Thus partial match below the level of a sentence need to be

accounted for.

3. Semantic Equivalence. Another issue related to granularity is semantic equivalence.

Especially, in the case of newswire summarization (more so in multi-document case),

different input sentences mean the same even if they are worded differently. Humans

would pick only one of the multiple equivalent alternatives to be part of the summary

and a system shall be penalized if it selects one of the other equally appropriate

options.

4. Extracts or Abstracts. When humans are asked to write a summary of a text, they

do not normally pick sentences from the input and concatenate them to form the

summary. Instead, they pick the important information and use their own words to

synthesize the information to form an informative readable summary. Thus, the ex-

act match of sentences as required by precision and recall measures are not at all

35


feasible. As the field grows and continuously moves towards more non-extractive

summarizers we clearly need to move to methods which can handle semantic equiv-

alence at varying levels of granularity.

The pyramid method, provides an unified method for addressing the issues outlined

above. The key assumption of the pyramid method is the need for multiple human authored

reference summaries, which taken together, yield a gold-standard for system output.

SCUs. SCUs are semantically motivated subsentential units; they are variable in length

but not bigger than a sentential clause. SCUs emerge from the annotation of a collection

of human summaries for the same input. They are identified by noting information that

is repeated across summaries, whether the repetition is as small as a modifier of noun

or as large as a clause. Sentences corresponding to information that appears only in one

summary are broken down into clauses, each of which is one SCU in the pyramid. Weights

are associated with each SCU indicating the number of summaries in which it appeared.

Pyramids. Unlike many gold-standards, a pyramid represents the opinions of multiple

human summary writers each of whom has written a model summary for the input set

of documents. A key feature of a pyramid is that it quantitatively represents agreement

among the human summaries: SCUs that appear in more of the human summaries are

weighted more highly, allowing differentiation between important content (that appears in

many human summaries) from less important content. Such weighting is necessary in sum-

marization evaluation, given that different people choose somewhat different information

when asked to write a summary for the same set of documents. More details on SCUs, the

procedure to identify SCUs, and the method for scoring a new summary against a pyramid

are discussed in deeper details in [Nenkova et al., 2007].

36


3.3.3 Other content evaluation measures

Over the last decade there has been tremendous interest in summarization research and

systems cropped up from every corner of the world. Thanks to focused evaluations (like

DUC), the summarization community created a clear roadmap for the future of automated

summarization. Evaluation has always been a priority for the community and from time to

time evaluation has been revived.

Following are some of the not so fashionable summarization evaluation techniques,

most of these techniques couldn’t take limelight unlike ROUGE and Pyramid methods.

What follows are the various evaluation approaches, in their order of appearance:

• SEE

• Relative Utility

• Basic Elements [Hovy et al., 2006]

• N-gram graphs

Recently, [Tratz and Hovy, 2008] extended Basic Elements Recall based evaluations by

following a set of transformations on the basic elements. Some research [Ani et al., 2005]

has been devoted for the automation of the pyramid method. Recent tasks at the focused

evaluations at TAC have targetted the aspect of “Automatically Evaluating Summaries Of

Peers”1. Also, some research on automated summarization evaluation without human mod-

els [Louis and Nenkova, 2009] has been pursued and is of severe interest to the community.

3.4 Evaluation of Readability

Readability evaluation refers to enabling quantification of the form of a summary. Read-

ability/Fluency aims at assessing the summary’s surface form. If a summary is picked

1http://nist.gov/tac/2009/Summarization/index.html

37


verbatim from a document, there is less chance of being distorted and hence the readability

of the source can be preserved in the summary.

3.4.1 Manual Evaluation of Readability

At the document understanding conferences (DUC) and now at the Text Analysis Confer-

ences (TAC), the readability of summaries is assessed using five linguistic quality questions

which measure the qualities of the summary that do not involve comparison with a refer-

ence summary or topic of focus. The linguistic qualities measured are Grammatically,

Non-redundancy, Referential clarity, Focus and Structure and coherence.

Grammatically A summary should have no datelines, system internal formatting, capi-

talization errors or obvious ungrammatical sentences (for eg., fragments, missing compo-

nents) that make the text difficult to read.

Non-redundancy There should be no unnecessary repetition in the summary. Unneces-

sary repetition might take the form of whole sentences that are repeated, or repeated facts,

or the repeated use of a noun or noun phrase (eg., “Prasad Pingali”) when a pronoun (“he”)

would suffice.

Referential clarity It should be easy to identify who or what the pronouns and noun

phrases in the summary are referring to. If a person or other entity is mentioned, it should

be clear what their role in the story is. So, a reference would be unclear if an entity is

referenced but its identity or relation to the story remains unclear.

Focus The summary should have a focus; sentences should only contain information that

is related to the rest of the summary.

Structure and Coherence The summary should be well-structured and well-organized.

The summary should not just be a heap of related information, but should build from sen-38


tence to sentence to a coherent body of information about a topic.

3.4.2 Automated Evaluation of Readability

For the Readability/Fluency aspects of automated summary evaluation there hasn’t been

much of a dedicated research with summaries. Discourse-level constraints on adjacent

sentence, have been relatively fairly investigated, which are indicative of coherence and

good text-flow [Lapata, 2003, Barzilay and Lapata, 2008]. In a lot of applications, like in

“overall responsiveness” for text summaries, fluency is assessed in combination with other

qualities. In machine translation scenarios, approaches such as BLEU use n-gram overlap

[Papineni et al., 2002] with a reference to judge “overall goodness” of a translation. With

BLEU, higher sized n-grams’ overlap were meant to capture fluency considerations, while

all the n-gram overlaps together contribute to the translation’s “content goodness”. On the

contrary, in some related work in NLG [Wan et al., 2005, Mutton et al., 2007] directly set

a goal of sentence level fluency regardless of content. In [Wan et al., 2005] they build upon

the premise that syntactic information from a parser can more robustly capture sentence

fluency than language models, giving a more direct indications of the degree of ungram-

maticality. The idea is extended in [Mutton et al., 2007], where four parsers are used and

artificially generated sentences with varying level of fluency are evaluated with impressive

success.

Recently, [Chae and Nenkova, 2009] performed a systematic study on how syntactic

features were able to distinguish machine generated translations from human translations.

They were also able to distinguish a ‘well formed’ translation from a ‘low fluency’ trans-

lation. In another related work, [Pitler and Nenkova, 2008] investigated the impact of

certain linguistic surface features, syntactic features, entity coherence features and dis-

course features on the readability of Wall Street Journal (WSJ) Corpus. Their investi-

gations revealed that while surface features like average number of words per sentence

and average number of characters per word are not good predictors, there exist syntac-

39


tic, semantic, and discourse features that do correlate highly with readability. Further,

[Feng et al., 2009, Feng, 2009] developed a tool for automatically rating the readability of

texts for adult users with intellectual disabilities.

There has been no known work in the area of characterizing readability/fluency of a

summary directly, from text summaries point of view. Application of the above methods

and in particular syntactic and semantic features expressed in [Chae and Nenkova, 2009,

Pitler and Nenkova, 2008] to create an automated metric to evaluate summaries would be

an interesting area of research.

40

Chapter 4

Impact of Query-Bias on Text

Summarization

In the context of the Document Understanding Conferences, the task of Query-Focused

Multi-Document Summarization seeks to improve agreement in content among human-

generated model summaries. This agreement is essential to the content evaluation process

as described earlier in Chapter 3. Query-focus was also assumed to aid the automated sum-

marizers in directing the summary at specific topics, which would result in better agreement

of automated summaries with the model summaries. However, while query focus correlates

with performance, we show that high-performing automatic systems produce summaries

with disproportionally higher query term density than do human summarizers. Experimen-

tal evidence suggests that automatic systems heavily rely on query term occurrence and

repetition to achieve good performance.

Human Agreement One of the issues studied since the inception of automatic summa-

rization is that of human agreement: different people choose different content for their sum-

maries [Rath et al., 1961, van Halteren and Teufel, 2003, Nenkova et al., 2007]. Even the

same person may not produce the same summary at a later time [Rath et al., 1961]. Humans

vary in what material they choose to include in a summary and how they express the con-

41

CHAPTER 4. IMPACT OF QUERY-BIAS ON TEXT SUMMARIZATION

tent and their judgments of summary quality varies from one person to another and across

time for a single person [Harman and Over, 2004]. Later, it was assumed [Dang, 2005]

that having a question/query to provide focus would improve agreement between any two

human-generated model summaries, as well as between a model summary and an auto-

mated summary. This agreement in content is at the heart of the approaches taken for the

automated summarization evaluation techniques [Lin, 2004b, Nenkova et al., 2007]. It has

been noted that these reference based content evaluations would be more robust if multi-

ple gold-standard summaries were used [Lin, 2004a, van Halteren and Teufel, 2003], and

there have been exclusive studies on how many references are required to obtain stable

evaluation results [Lin, 2004a, Nenkova et al., 2007].

Query-focus and granularity In trying to further understand the process of text summa-

rization, the steering committee of DUC decided to constrain the summarization process

based on two major parameters that could produce summaries with widely different con-

tent: query-focus and granularity. Having a question/query to focus the summary was

intended to improve the agreement between the model summaries. Additionally, for DUC

2005, the NIST assessor who developed each topic also specified the desired granularity

(level of generalization) of the summary. Granularity was a way to express one type of

user preference; one user might want a “general” background or overview summary, while

another user might want “specific” details that would allow him to answer questions about

specific events or situations. This parameter, “granularity”, was withdrawn from further

DUCs since NIST assessors found that the size of the summary plays a much bigger role in

determining what information to include, than a granularity specification. Almost all NIST

assessors tried to write their summaries according to the granularity requested, but some

“specific” summaries ended up being very general given the large amount of information

and small space allowance. Despite this, all the NIST assessors appreciated the theory

behind the granularity specification.

42


Research in query-focused summarization Starting in 2005 until 2007, a query-focused

multi-document summarization task was conducted as part of the annual Document Under-

standing Conference. This task models a real-world complex question answering scenario,

where systems need to synthesize from a set of 25 documents, a brief (250 words), well

organized fluent answer to an information need. Query-focused summarization is a topic

of ongoing importance within the summarization and question answering communities.

Most of the work in this area has been conducted under the guise of “query-focused multi-

document summarization”, “descriptive question answering”, or even “complex question

answering”. Recent addition at Text Analysis Conference (TAC) in the form of Update

Summarization is a natural extension of “query-focused multi-document summarization”

as discussed in Section 1.5.

4.1 Introduction to Query-Bias vs Query-Focus

The term ‘query-bias’, with respect to a sentence, is precisely defined to mean that the

sentence has at least one query term within it. The term ‘query-focus’ is less precisely

defined, but is related to the cognitive task of focusing a summary on the query, which we

assume humans do naturally. In other words, the human generated model summaries (and

other human authored summaries) are assumed to be query-focused.

Query-bias in sentences could be seen as a trivial way of trying to find sentence rele-

vance to the query [Gupta et al., 2007]. We follow the same intuition, and define query-

biased sentences as those sentences that have at least one content word in common with the

query. Content words are all those words that are not stop words, and stop words for our

purposes have been obtained from Rainbow Classification Toolkit [McCallum, 1996]. In

order to study how query-bias influences human/system content selection choices, we used

the 45 test sets of the query-focused multi-document summarization task from DUC 2007.

For each set, the input for summarization was available, along with four model summaries

for the input, and the summaries produced by all the automatic summarizers that partici-

43


pated in DUC 2007. Each input set consisted of 25 documents and the summaries were

250 words long.

4.1.1 Query-biased content in human summaries

While performing some corpus studies on the DUC dataset1, we computed the amount of

query bias in document collections and in gold-standard human summaries(model sum-

maries). We observed that the amount of query-bias in model summaries, was consid-

erably higher than the same in the document collections. Based on this observation, we

began investigations addressing the question, “Does query bias affect sentence selection in

Multi-Document Summarization ?”.

In our initial investigations we observed that in DUC 2007 dataset, around 35% sen-

tences in source document set were query-biased; as seen in Table-4.12. Similarly, among

the sentences in model summaries, nearly 58% sentences were query-biased; as seen in

Table-4.2. After checking for consistency among DUC dataset of 3 years’(On DUC 2005,

2006 and 2007 Query Focused Multi-Document Summarization Task), we confirmed the

relation among the query biased content in document collections and model summaries.

Tables 4.1 and 4.2 summarize the findings3. As seen in Table-4.3, R, the ratio of P ′

(from Table 4.1) and Q′ (from Table 4.2) is small, indicating that model summaries are

denser than document collections, in terms of query-bias.

1The multi-document summarization corpus for DUC 2005, 2006 and 20072All the counts of sentences in Tables 4.1, 4.2, 4.5, 4.6 and 4.7 are averages over all topics.32005 data consisted of variable number of documents per topic and also had granularity aspect to the

summarization process; hence, the data is a bit disturbed, still the overall patterns are the same.

44


Year’s Dataset Sentences(D) Biased Sentences(Dq) P′ = Dq/D (in %)

2005 935.2 192.06 20.54

2006 703.72 236.14 33.56

2007 548.96 195.76 35.66

Table 4.1 Percentage of query-biased content in document collections

Year’s Dataset Sentences(M ) Biased Sentences(Mq) Q′ = Mq/M (in %)

2005 81.08 42.16 52.00

2006 58.78 30.7 52.23

2007 52.91 30.67 57.96

Table 4.2 Percentage of query-biased content in model summaries

Dataset Ratio, R = P ′/Q′

2005 0.3950

2006 0.6425

2007 0.6152

Table 4.3 Ratio of Query-bias densities

4.2 Theoretical Justification on query-bias affecting sum-

marization performance

Text Summarization research, in recent years, has been approached mostly as a sentence ex-

traction problem. Where key sentences from the input are extracted and concatenated, in a

meaningful order, to form a summary. Such a formulation is easily captured by a classifier-

like approach, and most of the current and popular approaches to ‘text summarization’ are

based on classifying a sentence as relevant or irrelevant. There is ample research that has

45


used a large number of simple features to train a classifier to predict relevance of a sen-

tence; more on such classifier-based approaches are described briefly in Section 3.1. In this

section, we show how any simple classifier-like algorithm’s performance can be improved

by biasing the summarizer towards query-biased sentences.

It has been shown earlier [Gupta et al., 2007] that term frequency or term-likelihood

based summarizers perform better when they use only the query-biased set of sentences for

ranking. Our primary hypothesis is that most (if not all) of the systems that perform well

in summarization task, either knowingly or un-knowingly rely very heavily on query-bias.

Later in this chapter our results also show that when humans summarize, most of them do

have a similar strategy informed by query-bias.

While defining “query-focused multi-document summarization task”, it was based on

the observation that ‘query-focus’ provides consensus among the model summaries, and

hence a reasonable clarity on what an automated text summarization must produce. It is

important to note that the automated systems try to imitate the same phenomena when they

try to bias towards query-terms while using shallow approaches to the problem. However,

there have been no studies on how ‘query-bias’ affects human/automatic summarization,

unlike frequency based measures that have been well studied since Luhn’s early work

[Luhn, 1958] and have been proved to be of importance by [Nenkova et al., 2006]. This

led us to investigate, if there is some relationship between query-focused summarization

and unigram query-bias density.

In the following sections we first describe an Equi-probable Automatic Summarization

algorithm (Section 4.2.1), then we show how the algorithm performs theoretically, when

constrained on query-bias (Section 4.2.2) and finally discuss its practical value in analyzing

the problem at hand. The central idea in showcasing such a summarizer is to show that a

query-bias based summarizer can perform better than a summarizer without query-bias.

That is, theoretically, it is possible to build summarization algorithms whose performance

improves by biasing on query terms.

46


4.2.1 Equi-probable summarization setting

In this section we describe an ‘equi-probable’ sentence selection algorithm for automatic

text summarization. Since all sentences are equally likely to be part of the summary, such

an algorithm would be selecting sentences to be included in the summary randomly. In

such a scenario, the probability of picking any particular sentence to be included in the

summary is(

1

|D|

); where |D| is the total number of sentences in the input.

In this setting, which we call the standard setting, the summarization problem is seen

as a sentence classification problem. Given a sentence s, the classifier should be able to

classify it as relevant to be included in the summary or not. Let D be the set of sentences in

document collection, M the set of sentences in model summaries and S (x) the summariza-

tion function. Then, the goal of the summarization function S (x) is to be able to generate

a set of sentences that belong to M.

That is,

∀s ∈ S (D) : s ∈ M

i.e., S (D) ⊂ M

Let there be |D| sentences in D, and |M| sentences in M. Also, assuming that redun-

dancy is allowed while generating a summary, and if each sentence is equally likely to be

picked. Then, the probability of picking k sentences that contribute towards Model Sum-

maries is

P (S (D) ⊂ M) =

(|M ||D|

)k

(4.1)

4.2.2 Query-biased Equi-probable summarization setting

Assuming we constrain the summarization problem based on query-bias, to create the con-

strained setting, the following transformations take place. Let the sets Dq and Mq be de-

fined as query biased components of D and M respectively. Then, the goal of the summa-

rization function is to be able to generate a set of sentences that belong to Mq. That is,

Dq: Dq ⊂ D

47


Mq: Mq ⊂ M

∀s∈ S (Dq) : s ∈ Mq

i.e., S (Dq) ⊂ Mq

Also, since model summaries are created based on the explicit guidelines4 provided,

it is safe to assume that sentences generated by human assessors, aren’t re-structured or

re-generated in a way deviating too far from the source sentence(s). That is, M ⊂ D and

Mq ⊂ Dq.

Now, if we were to pick the k sentences for summary from Dq, then, the probability of

picking those sentences from Mq is

P (S (Dq) ⊂ Mq) =

(|Mq ||M |

)∗ |M |(

|Dq ||D|

)∗ |D|

k

(4.2)

=⇒ P (S (Dq) ⊂ Mq) =

{|Mq|/|M ||Dq|/|D|

}k

∗{|M ||D|

}k

(4.3)

The second term in the equation 4.3 is equivalent to equation 4.1, and hence the prob-

ability of picking the right sentences depends on the first term, that is, λ ={

|Mq |/|M ||Dq |/|D|

}k

.

Here, λ is the ratio of unigram query-bias densities in model summaries and document col-

lections. If λ >1, the probability of picking the right information(which is the classification

accuracy) increases. λ is the exact inverse of the term R that has been empirically calcu-

lated in Table 4.3. Based on these results, we observe that λ >1. Therefore, the probability

of finding the right information increases as we work on the biased subset of data. Note

that, this improved probability is for the case of equi-probable sentence selection criterion.

4.2.3 Performance of Equi-probable summarization

Theoretically, the summarizer we just described (Section 4.2.1) should perform better when

constrained on query-bias. To measure its practical value we implemented the above sum-

4Guidelines for Human Summary writers available at http://duc.nist.gov/duc2005/assessor.summarization.instructions.pdf

48


marizer. We dis-allowed a duplicate sentence in summary since it is of no use in an algo-

rithm and it was used in previous discussions only to ease the understanding. For evaluation

we computed ROUGE suite of evaluation metrics [Lin, 2004b] and we report ROUGE-2

and ROUGE-SU4 metrics, which were the official automatic evaluation metrics for DUC

2007.

The ROUGE scores of equi-probable summarizer in both the standard setting and in

the constrained scenario are shown in Table 4.4. The scores provided are the average score

of 1000 runs for each setting to counter randomness of the algorithm being significant.

It may be seen that there is a significant improvement in both ROUGE-2 and ROUGE-

SU4 metrics, with 9.3% improvement on ROUGE-2 and 4.3% improvement on ROUGE-

SU4 scores. The results of such a naive summarizer are statistically better than at least

8 or 10 automated systems of DUC 2007 based on ROUGE-2 and ROUGE-SU4 metrics,

respectively.

Setting ROUGE-2 ROUGE-SU4

standard 0.06970 (95%-conf.int. 0.06530 - 0.07364) 0.12692 (95%-conf.int. 0.12279 - 0.13082)

constrained 0.07621 (95%-conf.int. 0.07150 - 0.08101) 0.13230 (95%-conf.int. 0.12779 - 0.13665)

Table 4.4 ROUGE scores with confidence intervals for the equi-probable summa-rizer.

4.2.4 Observations

We have shown in Section 4.2.3 that an equi-probable summarizer improves performance

under the constraints of query-bias. The fact that the classification or ranking algorithms

that are currently being employed for summarization task can perform considerably well,

unlike equi-probable selection (except 8-10 systems all others perform better than this al-

gorithm), this bias will certainly help in obtaining summaries that are closer to the model

summaries.

49


For a better ranking algorithm, the probability of selecting an important sentence is

greater than(

PL

), as the algorithm is better informed of important sentences. This also

means that most of the systems that are able to perform reasonably well in sentence selec-

tion, may not be able to utilize query-bias to further improve performance. In fact, those

systems whose algorithms already rely on query-bias and are already performing well, may

not improve their performance by introducing the bias towards query-terms. On the other

hand, they could actually worsen their choice of sentence selection as they could be loosing

out on important sentences from (D −Dq) if the system’s precision in picking un-biased

sentences is higher than its precision in picking biased sentences.

4.3 Performance of participating systems from DUC 2007

In this section, we analyze the systems that have participated in DUC 2007 Query Fo-

cused Multi-Document Summarization Task. We segregated the systems into few cate-

gories based on the performance of systems in various evaluation metrics. Since we are

interested in the effect of query-bias on content selection, we ended up in the following

categories:

1. Systems that performed very well in ROUGE/BE (content overlap metrics) seen in

Table-4.5.5

2. Systems that performed very well in Linguistic Quality, seen in Table-4.6.

3. Systems that performed poorly in content overlap metrics, seen in Table-4.7.

Discussion Table 4.5 shows the 5 top performing systems and their respective percentage

of query-bias. It is notable that all the systems in this category had a very high percentage of

query-biased sentences6. On the contrary, Table 4.7 shows the exact opposite, that systems5Tables 4.5,4.6, 4.7, 4.9 and 4.10 are ordered by rank of the systems based on their respective categories.6System 29 has comparatively lower P′ since in their approach a lot of sentence simplification operations

were performed on the original sentences and source sentences were distorted.

50


System ID Sentences(S) Biased Sentences(F) P′ = F/S (in %)

15 9.88 8.48 85.83

24 11.6 9.26 79.91

29 19.04 13.46 70.69

4 7.84 7.33 93.49

13 10.11 9.64 95.35

Table 4.5 Systems that performed well in content overlap metrics


23 7.44 6.88 92.47

4 7.84 7.33 93.49

14 8.7333 8 91.60

5 7.82 7.24 92.32

17 10.07 9.33 92.65

Table 4.6 Systems that performed well based on Linguistic Quality Evaluations.

that do not perform well based on content metrics have very low amounts of query-bias.

This shows that there are patterns we can observe with respect to query-bias and infor-

mative content. But, there were exceptions too. Some systems (System ID 5 and 17, for

example) had a good amount of query-bias but didn’t perform well on content evaluations.

However, these were systems that were performing well (apart from baseline system ID 1)

on linguistic quality evaluations. This is a major insight, since indeed some systems might

just be trying to be biased on query-terms such that they generate coherent summaries7.

And indeed, as we see in Table 4.6 all the top performing systems in linguistic quality eval-

uations were biased towards query-terms. This analysis shows that working at the level

7See Focus, Structure and Coherence in Section 3.4.1

51



16 9.8 4.67 47.65

27 13.33 4.53 33.98

6 8.87 5.11 57.61

10 14.36 8.4 58.50

11 10.29 6.44 62.59

12 9.84 5.62 57.11

Table 4.7 Systems that did not perform based on content overlap metrics

of top-performing (and worse performing) systems doesn’t clarify the intuition with which

the experiments were conducted and further much deeper analysis based on theoretically

sound methods is required to assess if indeed all systems are query-biased.

4.4 Query-Bias in Summary Content Units (SCUs)

Summary content units, referred as SCUs hereafter, are semantically motivated subsen-

tential units that are variable in length but not bigger than a sentential clause. SCUs are

constructed from annotation of a collection of human summaries on a given document col-

lection. SCUs are identified by noting information that is repeated across these human

summaries. The repetition is as small as a modifier of a noun phrase or as large as a clause.

The evaluation method that is based on overlapping SCUs in human and automatic sum-

maries is called the pyramid method [Nenkova et al., 2007].

The University of Ottawa has organized the pyramid annotation data such that for some

of the sentences in the original document collection, a list of corresponding content units

is known [Copeck et al., 2006]. In Figure 4.1 we visualize the structure of pyramid an-

notations and the source sentence mapping done at Ottawa. To the left of the Figure 4.1

we can see how models and peers interact to form a pyramid structure, as explained by

52


[Nenkova et al., 2007]. There is a ‘one-to-many’ mapping between an SCU of a pyra-

mid and sentences in any of the peers; a fact could be represented in different sentences,

at possibly different granularities. On the right side of the Figure 4.1, we can see how

each sentence in the peer can be mapped to a sentence in source document collection by

‘one-to-many’ relationship. Such a one-to-many relationship exists only in the presence

of the assumption that all the peers are sentence extractive summaries. Hence, overall it

is possible to obtain a ‘one-to-many’ mapping between an SCU and ‘one or more’ source

sentences.

Figure 4.1 The process and structure of Pyramid Annotations and source map-pings

Figure 4.2 SCU annotation of a source document.

A sample of such an SCU mapping for a document from topic D0701A of the DUC

53


Dataset total relevant biased relevant irrelevant biased irrelevant % bias in relevant % bias in irrelevant

DUC 2005 24831 1480 1127 1912 1063 76.15 55.60

DUC 2006 14747 1047 902 1407 908 86.15 71.64

DUC 2007 12832 924 782 975 674 84.63 69.12

Table 4.8 Statistical information on counts of query-biased sentences.

2007 QF-MDS corpus is shown in Figure 4.2. Three sentences are seen in the figure among

which two have been annotated with system IDs and SCU weights wherever applicable.

The first sentence has not been picked by any of the summarizers participating in Pyramid

Evaluations, hence it is unknown if the sentence would have contributed to any SCU. The

second sentence was picked by 8 summarizers and that sentence contributed to an SCU of

weight 3, hence it is a relevant sentence. The third sentence in the example was picked by

one summarizer, however, it did not contribute to any SCU, so it is an irrelevant sentence.

This example shows all the three types of sentences available in the corpus: unknown

samples, relevant samples and irrelevant samples.

We extracted the relevant and irrelevant samples in the source documents from these

annotations; types of second and third sentences shown in Figure 4.2. Figure 4.3 shows the

distribution of query-bias in the source document collections. Among a total of ≈ 13000

sentences available in source collections around 2000 sentences were annotated as either

relevant or irrelevant. A total of 14.8% sentences were annotated to be either relevant or

irrelevant. When we analyzed the relevant set, we found that 84.63% sentences in this set

were query-biased. Also, on the irrelevant sample set, we found that 69.12% sentences

were query-biased. That is, on an average, 76.67% of the sentences picked by any auto-

mated summarizer are query-biased. All the above numbers are based on the DUC 2007

dataset shown in boldface in Table 4.8.

The above experiment shows that whether the systems pick relevant sentence or not,

they are more likely to pick sentences that are query-biased. This could mean that these

systems are being biased towards query-bias to be able closely follow relevant content, like

54


Figure 4.3 Distribution of relevant and irrelevant sentences in query-biased corpus

we explained in Section 4.2. However, there is one caveat: The annotated sentences come

only from the summaries of systems that participated in the pyramid evaluations. Since

only 13 among a total 32 participating systems were evaluated using pyramid evaluations,

the dataset is limited. However, despite this small issue, it is very clear that at least those

systems that participated in pyramid evaluations have been biased towards query-terms, or

at least, they have been better at correctly identifying important sentences from the query-

biased sentences than from query-unbiased sentences.

4.5 Formalizing Query-Bias

Our search for a formal method to capture the relation between occurrence of query-biased

sentences in the input and in summaries resulted in building binomial and multinomial

model distributions. The distributions estimated were then used to obtain the likelihood of

a query-biased sentence being emitted into a summary by each system.

55


For the DUC 2007 data, there were 45 summaries for each of the 32 systems (labeled 1-

32) among which 2 were baselines (labeled 1 and 2), and 18 summaries from each of 10 hu-

man summarizers (labeled A-J). We computed the log-likelihood, log (L[summary; p ( Ci )]),

of all human and machine summaries from DUC’07 query focused multi-document sum-

marization task, based on both distributions described below (see Sections 4.5.1, 4.5.2).

4.5.1 The Binomial Model

We represent the set of sentences as a binomial distribution over type of sentences. Let Ci ∈

{C0, C1} denote classes of sentences without and with query-bias respectively. For each

sentence s ∈ Ci in the input collection, we associate a probability p(Ci) for it to be emitted

into a summary. It is also obvious that query-biased sentences will be assigned lower

emission probabilities, because the occurrence of query-biased sentences in the input is less

likely. On average each topic has 549 sentences, among which 196 contain a query term;

which means only 35.6% sentences in the input were query-biased. Hence, the likelihood

function here denotes the likelihood of a summary to contain non query-biased sentences.

Humans’ and systems’ summaries must now constitute low likelihood to show that they

rely on query-bias.

The likelihood of a summary then is :

L[summary; p (Ci)] =N !

n0!n1!p (C0)

n0 p (C1)n1 (4.4)

Where N is the number of sentences in the summary, and n0 + n1 = N; n0 and n1 are

the cardinalities of C0 and C1 in the summary. Table 4.9 shows various systems with their

ranks based on ROUGE-2 and the average log-likelihood scores. The ROUGE [Lin, 2004b]

suite of metrics are n-gram overlap based metrics that have been shown to highly correlate

with human evaluations on content responsiveness. ROUGE-2 and ROUGE-SU4 are the

official ROUGE metrics for evaluating query-focused multi-document summarization task

since DUC 2005.

56


ID rank LL ROUGE-2 ID rank LL ROUGE-2 ID rank LL ROUGE-2

1 31 -1.9842 0.06039 J -3.9465 0.13904 24 4 -5.8451 0.11793

C -2.1387 0.15055 E -3.9485 0.13850 9 12 -5.9049 0.10370

16 32 -2.2906 0.03813 10 28 -4.0723 0.07908 14 14 -5.9860 0.10277

27 30 -2.4012 0.06238 21 22 -4.2460 0.08989 5 23 -6.0464 0.08784

6 29 -2.5536 0.07135 G -4.3143 0.13390 4 3 -6.2347 0.11887

12 25 -2.9415 0.08505 25 27 -4.4542 0.08039 20 6 -6.3923 0.10879

I -3.0196 0.13621 B -4.4655 0.13992 29 2 -6.4076 0.12028

11 24 -3.0495 0.08678 19 26 -4.6785 0.08453 3 9 -7.1720 0.10660

28 16 -3.1932 0.09858 26 21 -4.7658 0.08989 8 11 -7.4125 0.10408

2 18 -3.2058 0.09382 23 7 -5.3418 0.10810 17 15 -7.4458 0.10212

D -3.2357 0.17528 30 10 -5.4039 0.10614 13 5 -7.7504 0.11172

H -3.4494 0.13001 7 8 -5.6291 0.10795 32 17 -8.0117 0.09750

A -3.6481 0.13254 18 19 -5.6397 0.09170 22 13 -8.9843 0.10329

F -3.8316 0.13395 15 1 -5.7938 0.12448 31 20 -9.0806 0.09126

Table 4.9 Rank, Averaged log-likelihood score based on binomial model, trueROUGE-2 score for the summaries of various systems in DUC’07 query-focusedmulti-document summarization task.

4.5.2 The Multinomial Model

In the previous section (Section 4.5.1), we described the binomial model where we clas-

sified each sentence as being query-biased or not. However, if we were to quantify the

amount of query-bias in a sentence, we associate each sentence to one among k possible

classes leading to a multinomial distribution. Let Ci ∈ {C0, C1, C2, . . . , Ck} denote the k

levels of query-bias. Ci is the set of sentences, each having i query terms.

The number of sentences participating in each class varies highly, with C0 bagging a

high percentage of sentences (64.4%) and the rest {C1, C2, . . . , Ck} distributing among

themselves the rest 35.6% sentences. Since the distribution is highly-skewed, distinguish-

ing systems based on log-likelihood scores using this model is easier and perhaps more

accurate. Like before, Humans’ and systems’ summaries must now constitute low likeli-

hood to show that they rely on query-bias.

57




n0!n1! · · ·nk!p (C0)

n0 p (C1)n1 · · · p (Ck)

nk (4.5)

Where N is the number of sentences in the summary, and n0 + n1 + · · · + nk = N; n0,

n1,· · · ,nk are respectively the cardinalities of C0, C1, · · · ,Ck, in the summary. Table 4.10

shows various systems with their ranks based on ROUGE-2 and the average log-likelihood

scores.

ID rank LL ROUGE-2 ID rank LL ROUGE-2 ID rank LL ROUGE-2

1 31 -4.6770 0.06039 10 28 -8.5004 0.07908 5 23 -14.3259 0.08784

16 32 -4.7390 0.03813 G -9.5593 0.13390 9 12 -14.4732 0.10370

6 29 -5.4809 0.07135 E -9.6831 0.13850 22 13 -14.8557 0.10329

27 30 -5.5110 0.06238 26 21 -9.7163 0.08989 4 3 -14.9307 0.11887

I -6.7662 0.13621 J -9.8386 0.13904 18 19 -15.0114 0.09170

12 25 -6.8631 0.08505 19 26 -10.3226 0.08453 14 14 -15.4863 0.10277

2 18 -6.9363 0.09382 B -10.4152 0.13992 20 6 -15.8697 0.10879

C -7.2497 0.15055 25 27 -10.7693 0.08039 32 17 -15.9318 0.09750

H -7.6657 0.13001 29 2 -12.7595 0.12028 7 8 -15.9927 0.10795

11 24 -7.8048 0.08678 21 22 -13.1686 0.08989 17 15 -17.3737 0.10212

A -7.8690 0.13254 24 4 -13.2842 0.11793 8 11 -17.4454 0.10408

D -8.0266 0.17528 30 10 -13.3632 0.10614 31 20 -17.5615 0.09126

28 16 -8.0307 0.09858 23 7 -13.7781 0.10810 3 9 -19.0495 0.10660

F -8.2633 0.13395 15 1 -14.2832 0.12448 13 5 -19.3089 0.11172

Table 4.10 Rank, Averaged log-likelihood score based on multinomial model,true ROUGE-2 score for the summaries of various systems in DUC’07 query-focused multi-document summarization task.

4.5.3 Correlation of ROUGE metrics with likelihood of query-bias

Tables 4.9 and 4.10 display log-likelihood scores of various systems in the descending

order of log-likelihood scores along with their respective ROUGE-2 scores. We computed

the pearson correlation coefficient (ρ) of ‘ROUGE-2 and log-likelihood’ and ‘ROUGE-SU4

58


and log-likelihood’. This was computed for systems (ID: 1-32) (r1) and for humans (ID:

A-J) (r2) separately, and for both distributions.

For the binomial model, r1 = -0.66 and r2 = 0.39 was obtained. This clearly indicates

that there is a strong negative correlation between likelihood of occurrence of a non-query-

term and ROUGE-2 score. That is, a strong positive correlation between likelihood of

occurrence of a query-term and ROUGE-2 score. Similarly, for human summarizers there is

a weak negative correlation between likelihood of occurrence of a query-term and ROUGE-

2 score. The same correlation analysis applies to ROUGE-SU4 scores: r1 = -0.66 and r2 =

0.38.

Similar analysis with the multinomial model have been reported in Tables 4.11 and 4.12.

Tables 4.11 and 4.12 show the correlation among ROUGE-2 and log-likelihood scores for

systems8 and humans9.

ρ ROUGE-2 ROUGE-SU4

binomial -0.66 -0.66

multinomial -0.73 -0.73

Table 4.11 Correlation of ROUGE measures with log-likelihood scores for auto-mated systems

ρ ROUGE-2 ROUGE-SU4

binomial 0.39 0.38

multinomial 0.15 0.09

Table 4.12 Correlation of ROUGE measures with log-likelihood scores for hu-mans

8All the results in Table 4.11 are statistically significant with p-value (p < 0.00004, N=32)9None of the results in Table 4.12 are statistically significant with p-value (p > 0.265, N=10)

59


4.6 Discussion

Based on our observations in this chapter it is clear that most of the automated systems

are being biased towards query-terms in trying to generate summaries that are closer to hu-

man summaries. We have shown based on various experiments that query-bias is probably

directly involved in generating better summaries for otherwise poor algorithms. Looking

at those algorithms from the point of view of classification view of summarization (See

Figure 4.4) we observe that most systems are ignoring sentences that do not contain query-

terms. This sort of ignorance just based on naive surface features (that too as much as

“term-bias”) would mean that a lot of informative content among the non-biased sentences

are lost out. Hence, there is a need to look at the behavior of each algorithm along these

lines before we get those into such issues at later stage.

Figure 4.4 Impact of query-bias in the classification view of Text SummarizationProcess

Figure 4.4 clearly indicates how various summarization functions S1(D), S2(D) · · ·

are query-biased in the current context of query-focused multi-document summarization.

60


Topic sentences are color shaded as query-biased sentences and un-biased sentences sepa-

rately and the same mapping is shown for summarizer’s output.

4.7 Conclusive Remarks

1. Automated systems are query-biased while approaching query-focused multi-document

summarization.

2. The binomial model clearly captures the relation between query-bias and ROUGE

scores. It also shows how human models are collectively similar in their strategy

from query-bias point of view.

3. Multinomial model reinforces what the binomial model shows while showing the

milder differences in the way systems are query-biased. In particular we could now

differentiate systems that are biased with multiple query terms from those that are

biased towards sentences having a single query term.

4.8 Chapter Summary

Our results underscore the differences between human and machine generated summaries.

Based on Summary Content Unit (SCU) level analysis of query-bias we argue that most

systems are better at finding important sentences only from query-biased sentences. More

importantly, we show that on an average, 76.67% of the sentences picked by any automated

summarizer are query-biased. When asked to produce query-focused summaries, humans

do not rely to the same extent on the repetition of query terms.

We further confirm based on the likelihood of emitting non query-biased sentence, that

there is a strong (negative) correlation among systems’ likelihood score and ROUGE score,

which suggests that systems are trying to improve performance based on ROUGE metrics

by being biased towards the query terms. On the other hand, humans do not rely on query-

61


bias, though we do not have statistically significant evidence to suggest it. We have also

speculated that the multinomial model helps in better capturing the variance across the

systems since it distinguishes among query-biased sentences by quantifying the amount of

query-bias.

From our point of view, most of the extractive summarization algorithms are formal-

ized based on a bag-of-words query model. The innovation with individual approaches

has been in formulating the actual algorithm on top of the query model. We speculate

that the real difference in human summarizers and automated summarizers could be in

the way a query (or relevance) is represented. Traditional query models from IR litera-

ture have been used in summarization research thus far, and though some previous work

[Amini and Usunier, 2007] tries to address this issue using contextual query expansion,

new models to represent the query is perhaps the only way to induce topic-focus on the

summary. IR-like query models, which are designed to handle ‘short keyword queries’,

are perhaps not capable of handling ‘an elaborate query’ in case of summarization. Since

the notion of query-focus is apparently missing in any or all of the algorithms, the future

summarization algorithms must try to incorporate this while designing new algorithms.

62

Chapter 5

Baselines for Update Summarization:

The Sentence Position Hypothesis

In this chapter, we describe a sentence position based summarizer that is built based on a

sentence position policy, created from the evaluation testbed of recent summarization tasks

at Document Understanding Conferences (DUC). We show that the summarizer thus built

is able to outperform most systems participating in task focused summarization evaluations

at Text Analysis Conferences (TAC) 2008. Our experiments also show that such a method

would perform better at producing short summaries (up to 100 words) than longer sum-

maries. Further, we discuss the baselines traditionally used for summarization evaluation

and suggest the revival of an old baseline to suit the current summarization task at TAC:

the Update Summarization task.

5.1 Introduction

Document summarization received a lot of attention since an early work by Luhn [Luhn, 1958].

Statistical information derived from word frequency and distribution was used by the ma-

chine to compute a relative measure of significance, first for individual words and then

for sentences. Later, Edmundson [Edmundson, 1969] introduced four clues for identifying

63

CHAPTER 5. BASELINES FOR UPDATE SUMMARIZATION

significant words (topics) in a text. Among them title and location are related to position

methods, while the other two are presence of cue words and high frequency content words.

Edmundson assigned positive weights to sentences according to their ordinal position in

the text, giving more weight to the first sentence of the first paragraph and the last sentence

of the last paragraph.

5.1.1 Sentence Position

Position of a sentence in a document or the position of a word in a sentence give good clues

towards importance of the sentence or word respectively. Such features are called locational

features, and a sentence position feature deals with presence of key sentences at specific

locations in the text. Sentence Position has been well studied in summarization research

since its inception, early in Edmundson’s work [Edmundson, 1969]. Earlier, Baxendale

[Baxendale, 1958] defined a position method in a very straightforward way as title plus

first and last sentence of a paragraph. But, since the paradigmatic discourse structure

differs significantly over subject domains and text genres, a position method should be

more specific. Dolan stated [Dolan, 1980] that a study of topic sentences in expository

prose showed that only 13% of paragraphs of contemporary professional writers began

with topic sentences. Singer and Dolan maintain [Singer and Dolan, 1980] that the main

idea of a paragraph can appear anywhere in the paragraph or not be stated at all. Arriving

at a completely negative conclusion, Paijmans [Paijmans, 1994] conducted experiments on

the relation between word position and its significance, and found that “words with high

information content according to tf?idf -based weighting schemes do not cluster in the first

and last sentences of paragraphs. In contrast, Kieras in psychological studies [Kieras, 1985]

confirmed the importance of the position of a mention within a text.

[Edmundson, 1969] The four basic methods employed by Edmundson to extract sen-

tences are Cue, Key, Title and Location. Among these four clues: title and location are

closest to position based methods. The title method is based on the hypothesis that an au-

64


thor conceives the title as circumscribing the subject matter of the document. Also, when

the author partitions the body of the document into major sections, he summarizes it by

choosing appropriate headings. In this work Edmundson showed that the hypothesis that

words of the title and headings are positively relevant was accepted at 99 percent level of

significance. For the location method, he showed that the hypothesis is that:

1. sentences occurring under certain headings are positively relevant; and

2. topic sentences tend to occur very early or very late in a document and its paragraphs.

Which as we would see would be replicated (in part) in our results.

[Baxendale, 1958] Baxendale’s investigation was based on a sample of 200 paragraphs

to determine where the important words are most likely to be found. He concluded that in

85% of the paragraphs, the first sentence was a topic sentence and in 7% of the paragraphs,

the final one.

[Lin and Hovy, 1997] Lin and Hovy [Lin and Hovy, 1997] describe a method for auto-

mated training and evaluation of an Optimal Position Policy, a method of locating likely

positions of topic-bearing sentences based on genre-specific regularities of discourse struc-

ture. They provide an empirical validation of position hypothesis, as laid down by Ed-

mundson [Edmundson, 1969]. Most of our work relies on the structure of experimentation

done by Lin and Hovy, whose details are discussed at length in the next few sections.

[Kastner and Monz, 2009] In a marginally related problem of keyfact extraction, Kast-

ner and Monz [Kastner and Monz, 2009] show similar results on the position hypothesis.

They use position feature to automatically identify news highlights, a feature on web-based

news service of BBC, which currently is done manually.

The usefulness of ‘key fact extraction’ is shown in its usage in CNN’s website1 which

1http://www.cnn.com/

65


Figure 5.1 An example of Highlights of a news story

since 2006 has most of its news stories preceded by a list of story highlights, see2 Fig-

ure 5.1. The advantage of the news highlights as opposed to full-text summaries is that

they are much ‘easier on the eye’ and better suited for quick skimming. Till the time

[Kastner and Monz, 2009] was published, only CNN.com was the provider of such a ser-

vice. [Kastner and Monz, 2009] tried to study as to how far this process could be auto-

mated. As part of their study they identified how position of a sentence in the text can be

crucial to its inclusion as part of highlights. Intuitively, facts of greater importance will be

placed at the beginning of the text, and this is supported by their data, as shown in Fig-

ure 5.2. This work [Kastner and Monz, 2009] is most recent and was published in parallel

with the work reported in [Katragadda et al., 2009].

Machine Learning based approaches Position information has been quite frequently

used in single-document summarization. Indeed, a simple baseline system takes the first

‘l’ sentences as the summary outperforms most summarization systems in DUC 2004

2Figures 5.1 and 5.2 have been taken with permission from the original authors of the paper.

66


Figure 5.2 Impact of location in identifying highlights

[Barzilay and Lee, 2004]. Also, [Zajic et al., 2002] use position in scoring candidate sum-

mary sentences. In multi-document summarization, various systems have used position as

a feature in scoring candidate sentences. Recent advances in machine learning have been

adapted to summarization problem through the years and locational features have been

consistently used to identify salience of a sentence. Some representative work in ‘learning’

sentence extraction would include training a binary classifier [Kupiec et al., 1995], train-

ing a Markov model [Conroy et al., 2004], training a CRF [Shen et al., 2007], and learning

pairwise-ranking of sentences [Toutanova et al., 2007]. An interesting usage of position

feature was shown in [tau Yih et al., 2007] where they explored the usage of word position

in scoring a sentence based on an average position of a word. They used both generative

and discriminative scoring functions for scoring the words.

67


5.1.2 Introduction to the Position Hypothesis

The position hypothesis states that importance of a sentence in a text has relation to the

ordinal position of the sentence in the text. As described earlier, each type of genre could

have a different position of stress. For example, in the case of technical literature, initial

and final paragraphs are important. There have been negative conclusions in the literature

for instance [Paijmans, 1994] states that words with high information content do not cluster

in the first and last sentences of a paragraph, while [Singer and Dolan, 1980] have argued

that the main idea of a paragraph can appear anywhere in the text.

The purposes of this study are to clarify the contradictions in the literature, and to test

the above mentioned intuitions and results, and to verify the hypothesis that importance of

a sentence in a text is indeed related to its ordinal position in the text. We also argue that

such a genre based summarizer would be a good baseline for short summary tasks such as

the Update Summarization task.

5.2 Sub-Optimal Sentence Position Policy (SPP)

Given a large text collection and a way to approximate the relevance for a reasonably large

subset of sentences, we could identify significant positional attributes for the genre of the

collection. Our experiments are based on the work described in [Lin and Hovy, 1997],

whose experiments using the Ziff-Davis corpus gave great insights on the selective power

of the position method.

5.2.1 Sentence Position Yield and Optimal Position Policy (OPP)

Lin and Hovy [Lin and Hovy, 1997] provide an empirical validation for the position hy-

pothesis. They describe a method of deriving an Optimal Position Policy for a collection of

texts within a genre, as long as a small set of topic keywords is defined for each text. They

defined sentence yield (strength of relevance) of a sentence based on the mention of topic

68


keywords in the sentence.

The positional yield is defined as the average sentence yield for that position in the doc-

ument. They computed the yield of each sentence position in each document by counting

the number of different keywords contained in the respective sentence in each document,

and averaging over all documents. An Optimal Position Policy (OPP) is derived based on

the decreasing values of positional yield.

Their experiments grounded on the assumption that abstract is an ideal representation of

central topic(s) of a text. For their evaluations, they used the abstract to compare whether

the sentences found based on their Optimal Position Policy are indeed a good selection.

They used precision-recall measures to establish those findings.

At our disposal we had data from pyramid evaluations that provided sentences and their

mapping to any content units in the gold standard summaries. The annotations in the data

provide a unique property that each sentence can derive for itself a score for relevance.

5.2.2 Documents

There are a wide variety of document types across genre. In our case of newswire collection

we have identified two primary types of documents: small document and large document.

This distinction is made based on the total sentences in the document. All documents that

have the number of sentences above a threshold should be considered large. We exper-

imented on thresholds varying from 10 to 35 sentences and figured out that documents’

distribution into the two categories was acceptable when threshold-ed at 20 sentences. This

decision is also well supported by the fact that the last sentences of a document were more

important than the others in the middle [Baxendale, 1958].

Sentence Position Yield (SPY) is obtained separately for both types of documents. For a

small document, sentence positions have values from 1 through 20. Meanwhile, for a large

document we compute SPY for position 1 through 20, then the last 15 sentences labeled

136 through 150 and ‘any other sentence’ is labeled 100. It can be seen in Figure 5.5

69


that sentences that do not come from leading or trailing part of large documents do not

contribute much content to the summaries.

5.2.3 Pyramid Data

Summary content units, referred as SCUs hereafter, are semantically motivated, sub-sentential

units that are variable in length but not bigger than a sentential clause. SCUs emerge from

annotation of a collection of human summaries for the same input. They are identified by

noting information that is repeated across summaries, whether the repetition is as small as a

modifier of a noun phrase or as large as a clause. The weight an SCU obtains is directly pro-

portional to the number of reference summaries that support that piece of information. The

evaluation method that is based on overlapping SCUs in human and automatic summaries

is described in the Pyramid method [Nenkova et al., 2007].

The University of Ottawa has organized the pyramid annotation data such that for

some of the sentences in the original document collection (those that were picked by sys-

tems participating in pyramid evaluation), a list of corresponding content units is known

[Copeck et al., 2006]. We described that data in much detail in Section 4.4.

A sample of the SCU mapping is reproduced in Figure 5.3. We used this data to identify

locations in a document from where most sentences were being picked, and which of those

locations were being most content responsive to the query.

Figure 5.3 A sample mapping of SCU annotation to source document sentences.An excerpt from mapping of topic D0701A of DUC 2007 QF-MDS task.

For each SCU, a weight is associated in pyramid annotations. Thus a sentential score70


could be defined as sum of weights of all the contributing SCUs of the sentence. For an

unknown sample and a negative sample, sentential score is 0. For example, in the second

sentence in Figure 5.3 the score is 3, contributed by a single SCU. While the same for the

first and third sentences is 0.

For each sentence position the sentential score is averaged over all documents, which

we call Sentence Position Yield. SPY for small and large documents is shown in Figures 5.4

and 5.5. Based on these values for various positions, a simple Position Policy was framed

as shown below. A position policy is an ordered set consisting of elements in the order

of most importance. Within a subset, each sub-element is equally important and treated

likewise.

Figure 5.4 Sentence Position Yield for small documents.

{s1, S1, {s2, S2, s3} , {S3, s4, s5, s6, s7, s8, s20} , {S4, s9} . . . }

In the above position policy, sentences from small documents and large documents are

represented by si and Sj respectively.

The position policy described above provides an ordering of ranked sentence positions

based on a very accurate ‘relevance’ annotations on sentences. However, there is a large

subset of sentences that are not annotated with either positive or negative relevance judg-

71


Figure 5.5 Sentence Position Yield for large documents

ment. Hence, the policy derived is based on a high-precision low-recall corpus3 for sen-

tence relevance. If all the sentences were annotated with such judgments, the policy could

have been different. For this reason we call the above derived policy, a Sub-optimal Posi-

tion Policy (SPP).

5.3 A Summarization Algorithm based on SPP

The goal of creating a position policy was to identify its effectiveness as a summarization

algorithm. The above simple heuristic was easily incorporated as an algorithm based on

simple scoring for each distinct set in the policy. For instance, based on the policy above,

all s1 get the highest weight followed by next best weight to all S1 and so on.

As it can be observed, only the first sentence of each document might end up comprising

the summary, which is tolerable if we do not get redundant information in the summary.

Hence we also used a simple unigram match based redundancy measure that doesn’t allow

3DUC 2005 and 2006 data has been used for learning the SPP. In further experiments in Section 5.3, DUC

2007 and TAC 2008 data have been used as test data.

72


a sentence if it matches any of the already selected sentences in at least 40% of content

words in it. We also dis-allow sentences greater than 25 content words.

We applied the above algorithm to generate multi-document summaries for various

tasks. We have applied it to Query-Focused Multi-Document Summarization (QF-MDS)

task of DUC 2007 and Query-Focused Update Summarization task of TAC 2008.

5.3.1 Query-Focused Multi-Document Summarization

The query-focused multi-document summarization task at DUC models the real world com-

plex question answering task. Given a topic and a set of 25 relevant documents, this task

is to synthesize a fluent, well-organized 250 word summary of the documents that answers

the question(s) in the topic statement/narration.

The summaries from the above algorithm for the QF-MDS were evaluated based on

ROUGE metrics [Lin, 2004b]. The average4 recall scores are reported for ROUGE-2 and

ROUGE-SU4 in Table 5.1. Also reported are the performance of the top performing system

and the official baseline(s). Our SPP algorithm performed worse than most systems partic-

ipating in the task that year and performed better5 than only the ‘first x words’ baseline and

3 other systems.

5.3.2 Update Summarization Task

The update summarization task is to produce short (~100 words) multi-document update

summaries of newswire articles under the assumption that the user has already read a set

of earlier articles. The initial document set is called cluster A and the next set of articles

are called cluster B. For cluster A, a query-focused multi-document summary is expected.

The purpose of each ‘update summary’ (summary of cluster B) will be to inform the reader

of new information about a particular topic. Summaries from the above algorithm for the

4Averaged over all the 45 topics of DUC 2007 dataset.5Better in a statistical sense, based on 95% confidence intervals of the two systems’ evaluation based on

ROUGE-2.

73


system ROUGE-2 ROUGE-SU4

‘first x words’ baseline 0.06039 0.10507

‘generic’ baseline 0.09382 0.14641

SPP algorithm 0.06913 0.12492

system 15 (top system) 0.12448 0.17711

Table 5.1 ROUGE 2, SU4 Recall scores for two baselines, the SPP algorithm anda top performing system at Query-Focused Multi-Document Summarization task,DUC 2007.

Query Focused Update Summarization task were evaluated based on ROUGE metrics. This

algorithm performed surprisingly better at this task when compared to QF-MDS. The rouge

scores suggest that this algorithm is well above the median for cluster A and among the top

5 systems for cluster B.

It must be noted that consistent performance across clusters (both A and B) shows the

robustness of the ‘SPP algorithm’ at the update summarization task. Also, it is evident that

such an algorithm is computationally simple and light-weight.

These surprisingly high scores on ROUGE metrics prompted us to evaluate the sum-

maries based on Pyramid Evaluation [Nenkova et al., 2007]. Pyramid evaluation provides

a more semantic approach to evaluation of content based on SCUs as discussed in Sec-

tion 5.2.3. The average6 modified pyramid scores of cluster A and cluster B summaries

is shown in Table 5.2, along with the average recall scores for ROUGE-2, ROUGE-SU4

scores. The pyramid evaluation7 suggests that this algorithm performs better than all other

automated systems at TAC 2008. Table 5.3 shows the average performance (across clus-

ters) of ‘first x words’ baseline, SPP algorithm and two top performing systems (System

ID=43 and ID=11). System 43 was adjudged best system based on ROUGE metrics, and

system 11 was top performer based on pyramid evaluations at TAC 2008.

6Averaged over all the 48 topics of TAC 2008 dataset.7Pyramid Annotation were done by a volunteer who also volunteered for annotations during DUC 2007.

74


ROUGE-2 ROUGE-SU4 pyramid

cluster A 0.08987 0.1213 0.3432

cluster B 0.09319 0.1283 0.3576

Table 5.2 Cluster wise ROUGE 2, SU4 Recall scores and modified PyramidScores for SPP algorithm at the Update Summarization task.

system ROUGE-2 ROUGE-SU4 pyramid

‘first x words’ baseline 0.05896 0.09327 0.166

SPP algorithm 0.09153 0.1245 0.3504

System 43 (top in ROUGE) 0.10395 0.13646 0.289

System 11 (top in pyramid) 0.08858 0.12484 0.336

Table 5.3 Average ROUGE 2, SU4 Recall scores and modified Pyramid Scoresfor baseline, SPP algorithm and two top performing systems at TAC 2008.

5.3.3 Discussion

It is interesting to observe that the algorithm that performs very poorly at QF-MDS, does

very well in the Update Summarization task. A possible explanation for such behavior

could be based on summary length. For a 250 word summary in the QF-MDS task, hu-

man summaries might provide a descriptive answer to the query that includes information

nuggets accompanied by background information. Indeed, it has been earlier reported that

humans appreciate receiving more information than just the answer to the query, whenever

possible [Lin et al., 2003, Bosma, 2005].

Whereas, in the case of Update Summarization task the summary length is only 100

words. In such a short length humans need to trade-off between answer sentences and

supporting sentences, and usually answers are preferred. And since our method identifies

sentences that are known to be contributing towards the needed answers, it performs better

at the shorter version of the task.

75


Another possible explanation is that as a shorter summary length is required, the task

of choosing the most important information becomes more difficult and no approach works

well consistently. Also, it has often been noted that this baseline is indeed quite strong

for this genre, due to the journalistic convention for putting the most important part of an

article in the initial paragraphs.

5.4 Baselines in Summarization Tasks

Over the years, as summarization research followed trends from generic single-document

summarization, to generic multi-document summarization, to focused multi-document sum-

marization there were two major baselines that stayed throughout the evaluations. Those

two baselines are:

1. First N words of the document (or of the most recent document).

2. First sentence from each document in chronological order until the length requirement is

reached.

The first baseline was in place ever since the first evaluation of generic single document

summarization took place in DUC 2001. For multi-document summarization, first N words

of the most recent document (chronologically) was chosen as the baseline 1. In the recent

summarization evaluations at Text Analysis Conference (TAC 2008), where update sum-

marization was evaluated; baseline 1 still persists. This baseline performs pretty poorly at

content evaluations based on all manual and automatic metrics. However, since it doesn’t

disturb the original flow and ordering of a document, linguistically these summaries are the

best. Indeed it outperforms all the automated systems based on linguistic quality evalua-

tions.

The second baseline had been used occasionally with multi-document summarization

from 2001 to 2004 with both generic multi-document summarization and focused multi-

document summarization. In 2001 only one system significantly outperformed the base-

line 2 [Nenkova, 2005]. In 2003 QF-MDS however, only one system outperformed the76


baseline 2 above, while in 2004 at the same task, no system significantly outperforms the

baseline. This baseline as can be seen, over the years has been pretty much untouched by

systems based on content evaluation. However, the linguistic aspects of summary quality

would be compromised in such a summary.

Currently, for the Update Summarization task at TAC 2008, NIST’s baseline is the

baseline 1 (‘first x words’ baseline). And all systems (except one) perform better than

the baseline in all forms of content evaluation. Since the task is to generate 100 word

summaries (short summaries), based on past experiences, there is no doubt that baseline 2

would perform well.

It is interesting to observe that baseline 2 is a close approximation to the ‘SPP algo-

rithm’ described in this chapter. There are two main differences that we draw between

‘baseline 2’ and SPP algorithm. First, ‘baseline 2’ picks only the first sentence in each

document, while ‘SPP algorithm’ could pick other sentences in an order described by the

position policy. Second, ‘baseline 2’ puts no restriction on redundancy, thus due to journal-

istic conventions entire summary might be comprised of the same ‘information nuggets’,

wasting the minimal real-estate available (~100 words). On the other hand, in our ‘SPP al-

gorithm’ we consider a simple unigram-overlap measure to identify redundant information

in sentence pairs that avoids redundant nuggets in the final summary.

5.5 Discussion and Conclusions

Baselines 1 and 2 mentioned above, could together act as a balancing mechanism to com-

pare for linguistic quality and responsive content in the summary. The availability of a

stronger content responsive summary as a baseline would enable steady progress in the

field. While all the linguistically motivated systems would compare themselves with base-

line 1, the summary content motivated systems would compare with the stronger baseline

2 and get better than it.

Over the years to come, the usage of ‘baseline1’ doesn’t help in understanding whether

77


there has been significant improvement in the field. This is because almost every simple

algorithm beats the baseline performance. Having a better baseline, like the one based on

the position hypothesis, would raise the bar for systems participating in coming years, and

tracking progress of the field over the years is easier.In this chapter, we derived a method to identify a ‘sub-optimal position policy’ based

on pyramid annotation data, that were previously unavailable. We also distinguish small

and large documents to obtain the position policy. We described the Sub-optimal Sentence

Position Policy (SPP) based on pyramid annotation data and implemented the SPP as an

algorithm to show that a position policy thus formed is a good representative of the genre

and thus performs way above median performance. We further describe the baselines used

in summarization evaluation and discuss the need to bring back baseline 2 (or the ‘SPP

algorithm’) as an official baseline for update summarization task.

Ultimately, as Lin and Hovy [Lin and Hovy, 1997] suggest, the position method can

only take us certain distance. It has a limited power of resolution (the sentence) and its

limited method of identification (the position in a text). Which is why we intend to use

it as a baseline. Currently, as we can see the algorithm generates a generic summary, it

doesn’t consider the topic or query to generate a query-focused summary. In future we plan

to extend the SPP algorithm with some basic method for bringing in relevance.

78

Chapter 6

A Language Modeling Extension for

Update Summarization

In the context of DUC, there has been a proliferation of publications on Automated Text

Summarization1, vying the state-of-the-art each year. Throughout the literature there have

been proven models that would work, and some that have been shown to work under con-

strained settings. Early text summarization task started with a “Single Document Generic

Summary” task. Over a period of time, the community grew and began working on inter-

esting problems such as “Multi-Document Generic Summary generation”, “Query-Focused

Multi-Document Summary generation”, etc.

At the DUC, in every 2-3 years the major problem being addressed is altered/tweaked to

create a new task of interest to the community that brought the goal forward towards better

summarization approaches having a realistic task setting. At the end of 2007, the com-

munity was dealing a similar scenario while developing “Query-Focused Multi-Document

Update Summarization” task.1http://www-nlpir.nist.gov/projects/duc/pubs.html

79

CHAPTER 6. A LANGUAGE MODELING EXTENSION FOR UPDATESUMMARIZATION

6.1 Update Summarization

The key to “update summarization” is a real world setting where a user needs to keep track

of a hot topic, continuously at random intervals of time. There are a lot of hits on the hot

topic and lots of documents are generated within a short span. The user cannot deal with

the proliferation of information and hence requires a summarization engine that generates

a very targeted informative summary. So the user now has access to the information in

the form of a summary, and if he needs to know more he would be able to see the source

documents. After a certain usage of the summarization engine, say he takes a break (its

Christmas!) and comes back to the summarization engine for the summary on the recent

activity on the topic. Now, a normal summarizer would just generate a summary of the

recent documents and present it to him. However the idea of update summarization is to be

able to filter out information that has already appeared in the previous articles, whether or

not they were presented to the user as part of a previous summary. In effect it is like adding

another redundancy checking module to desist the repeated information.

The updates on the topic need to filter out redundant information, while preserving the

informativeness of the content. So the task of Update Summarization has two components,

A normal query-focused multi-document summarization for the first cluster (of documents)

on the topic, and an updated summary generation procedure which would also produce

query-focused summaries under the assumption that user has already gone through previ-

ous document cluster(s). In the current work, we approach the problem under a sentence

extractive summarization paradigm, using an existing language modeling framework. Here,

we see the “update summary generation” task as a language modeling smoothing problem.

80


6.2 Language Modeling approach to IR and Summariza-

tion

A statistical language model, or more simply a language model, is a probabilistic mecha-

nism for generating text. In the case of Information Retrieval, the rise of the usage of lan-

guage modeling appeared soon after [Ponte and Croft, 1998] described a “language mod-

eling approach to IR”. Soon after that [Song and Croft, 1999] furthered research in the

application of language modeling to IR.

Language modeling based approaches have been famous in text Summarization liter-

ature since the early work by Luhn [Luhn, 1958]. [Jagarlamudi, 2006] has shown how a

relevance based language modeling paradigm can be applied to automated text summariza-

tion, specifically for the “query-focused multi-document summarization” task. Recently,

[Nenkova et al., 2006] has shown reasonable success in multi-document text Summariza-

tion using ‘just’ unigram language models. Language modeling approaches in IR and in

Text Summarization has been more deeply dealt with in Section 3.1. Here we use the Prob-

abilistic Hyperspace Analogue to Language (PHAL) described in [Jagarlamudi, 2006] as

the language modeling mechanism.

6.2.1 Probabilistic Hyperspace Analogue to Language (PHAL)

From the model of [J et al., 2005] a Hyperspace Analogue to Language model constructs

dependencies of a word w on other words based on their occurrence in the context of w

in window size k, in a sufficiently large corpus. We use the PHAL model, and use the

relevance based Language Modeling approach as followed by [J et al., 2005] for sentence

scoring. The PHAL, probabilistic HAL is a natural extension to HAL spaces, as term

co-occurrence counts can be used to define conditional probabilities. The PHAL can be

interpreted as, “given a word w what is the probability of observing another word w′ with

w in a window of size K”.

81


PHAL (w′|w) = c× HAL(w′|w)n(w)×K

The sentence scoring mechanism is based on this model that has been built. Assuming

word independence, the relevance of a sentence S can be expressed as,

P (S|R) =∏wi∈S

P (wi|R) ≈∏wi∈S

P (wi|Q)

P (S|R) =∏wi∈S

P (wi)

P (Q)

∏qj

PHAL (qj|wi)

≈∏wi∈S

P (wi)∏qj

PHAL (qj|wi) (6.1)

All sentences in the corpus are scored based on the above equation and the best (high

scoring) sentences are picked to be part of the final summary.

6.3 Language Modeling Extension

While generating update summaries, whether in the case of Single Document Summariza-

tion or in Multi-Document Summarization the idea is to be able to present information

that the reader has not already seen. Looking at it the other way, it could be considered

as suppression of information that the user has already seen. In the context of language

modeling, if we construct models that represent the document (or document collections)

then the corresponding models for the new stream of data should be modified based on the

information that has already been seen, and known to be of importance. It is interesting

to note that, query relevance doesn’t change. However, a considerable topical shift might

occur in the new stream. To accommodate this topical shift and to avoid redundancy in

the update-summary, we build a background aware language model that penalizes thematic

features that occur more frequently in the background than in the new stream.

Since our approach to language modeling, the PHAL, is “bigram” based, so would our

extension be. But it is imperative to understand that we are providing a general framework

82


which can be applied to any “language modeling mechanism”. To provide this extension

we need to devise two things:

• Signature Terms

• Context Adjustment (or smoothing)

After devising strategies for both signature terms and context adjustment, which in our

case we describe in the following sections, the following approach is followed to summary

generation shall be taken.

6.3.1 Approach

Our approach is motivated by the fact that the necessary update may be seen as giving

more weight to important sentences. Since, importance is now a factor that is also de-

pendent on the previous utterances on the topic, we should either increase the weight of

important terms in the novel cluster or decrease the weights of the words that are more

important in the previous clusters. In any of the approaches we first need to identify sig-

nature terms [Lin and Hovy, 2000] of the respective cluster with respect to previous clus-

ters. It may be observed that Summarization for the first cluster is a simple query-focused

Summarization problem and is a result of direct application of the approach described

in [J et al., 2005], hence we do not consider it here.

The primary algorithm consists of the following steps:

1. For the first cluster, generate summary using the actual Summarization algorithm,

say S(x).

2. For each of the next clusters

(a) Compute Language model of previous clusters, say LM(A).

(b) Compute Language model of current cluster, say LM(B).

83


(c) Generate ‘Signature Terms’ of previous cluster and current cluster, say TA and

TB respectively.

(d) Perform ‘Context Adjustment’ for model LM(B), using TA and/or TB.

(e) Generate summary based on the adjusted (or corrected) Language Model.

6.3.2 Signature Terms

Topic signatures as defined by Lin and Hovy [Lin and Hovy, 2000] are a set of related

terms that describe a topic. Lin and Hovy achieve their purpose by collecting a set of

terms that are typically highly correlated with a target concept from a pre-classified corpus

such as TREC collections. We try to approximate the same using the dataset at hand. For

each topic, if we consider cluster A as irrelevant set (R) and cluster B as relevant set (R)

we obtain terms that are relatively more important for cluster B than for cluster A. Our

hypothesis is based on the assumption that a term is significant if it occurs relatively more

frequently in cluster B than in cluster A. In this work we report on the utility of this method

for term selection to aid context adjustment for Update Summarization.

The document set is pre-classified into two sets cluster A and cluster B. Assuming the

following two hypotheses:

Hypothesis 1 (H1) : P (B|ti) = p = P (B|ti) (6.2)

Hypothesis 2 (H2) : P (B|ti) = p1 6= p2 = P (B|ti) (6.3)

Where H1 implies that relevancy of a document is independent of ti, while H2 implies

that the presence of ti indicates strong relevancy assuming p1 � p2. And the following

2-by-2 contingency table:

Where O11 is the frequency of term ti occurring in cluster B, O12 is the frequency of

term ti occurring in cluster A, O21 is the frequency of term ti 6= ti occurring in cluster B,

O22 is the frequency of term ti 6= ti occurring in cluster A.

84


R R

ti O11 O12

ti O21 O22

Assuming a binomial distribution of terms as relevant or irrelevant:

b(k; n, x) =

n

r

xk (1− x)(n−k)

then the likelihood for H1 is:

L(H1) = b(O11; O11 + O12, p) b(O21; O21 + O22, p)

and for H2 is:

L(H2) = b(O11; O11 + O12, p1) b(O21; O21 + O22, p2)

The log λ value is then computed as follows:

= log L(H1)LH2

= log b(O11;O11+O12,p) b(O21;O21+O22,p)b(O11;O11+O12,p1) b(O21;O21+O22,p2)

=((O11 + O21) log p + (O12 + O22) log(1− p)) −

(O11 log p1 + O12 log(1− p1) + O21 log p2 +

O22 log(1− p2))) (6.4)

This term λ is called Dunning’s likelihood ratio [Dunning, 1993]. Dunning suggests

that λ is more appropriate for sparse data than χ2 for hypothesis testing and the quantity

−2 log λ is asymptotically χ2 distributed. And hence we can use χ2 distribution table to

look up −2 log λ value at specific confidence level.

6.3.3 Context Adjustment

Let Tnew and Told be the set of Signature terms extracted from cluser B and cluster A,

respectively. Signature terms could be utilized in adjusting the language model as follows.85


∀wi ∈ Tnew,

∀wj PHAL (wj|wi, B) =

PHAL (wj|wi, B) + PHAL (wj|wi, A) (6.5)

∀wi ∈ Told,

∀wj PHAL (wj|wi, B) =

PHAL (wj|wi, B)− PHAL (wj|wi, A) (6.6)

Update summarization has a major difference from normal query-focused multi-document

summarization, it requires us to provide information that is novel. This can be seen from

two view-points: improving novelty and reducing redundancy. The update based on Eq.

(6.5) is used to boost co-occurrences of novel terms in Cluster B, while update based on Eq.

(6.6) penalizes co-occurrences of signature terms of previous clusters and hence reducing

redundancy from earlier clusters.

6.4 Evaluation

The major evaluation criteria for this experiment was to study the effect of signature terms

based context adjustment for language modeling approaches on the update summarization

task. We measured the impact of context adjustment using automated evaluation measures

based on ROUGE [Lin, 2004b]. Table 6.1 illustrates the impact of the above mentioned

context adjustments on the PHAL feature. The results are only indicative of the perfor-

mance of the context adjustment applied to update clusters. Clearly, there is a visible ad-

vantage in applying such a framework for language modeling based update summarization.

6.5 Summary

In this chapter we described a generic approach to update summarization problem based on

the language modeling paradigm. We argued that update summarization task is a ‘stream’86


ROUGE-2 ROUGE-SU4

PHAL 0.08190 0.12208

PHAL+ Adjustment 0.08369 0.12395

Table 6.1 Impact of Context Adjustment on PHAL

(of text) based summary and that a solution to the problem must also lie in understanding

the variation in the current state of the stream from previous states. We observed that a

simple context adjustment based on ‘stream dynamics’ could help in generation of better

summaries when compared to the base algorithm.

Though the improvement in the scores is very tight, the potential for improvement of

the approach cannot be denied. In future the development of this approach into a formal

strategy with stronger mathematical basis needs to be addressed. Several issues such as the

“semantics of the ‘adjusted model’ ”, “weighting the bias”, etc. haven’t been addressed

in this work and are the concerns of a detailed further work. It is also possible to apply

multiple, more appropriate approaches to signature term extraction, which we ignored by

choosing a simple approach based on [Lin and Hovy, 2000]. This signature extraction ap-

proach may not be appropriate, in particular, because of the nature of the data. We have

observed that for some topics there were very few signature terms (<5). We believe that

this happens because of wrong application of the method. This approach to signature ex-

traction was built for dissimilar topic clusters, and blindly using them for same topic, time

varying clusters leaves us with just a few distinguishing terms between the two clusters,

and would paralyze our otherwise sound approach. The similarity in the two clusters (pre-

vious and current) is more prominent than dissimilarity, and hence other approaches may

be better able to distinguish these clusters and provide with stronger and more ‘significant’

signature terms.

87

Chapter 7

Alternative (Automated) Summarization

Evaluations

Evaluation is crucial component in the area of automatic summarization; it is used both to

rank multiple participant systems in a shared tasks, such as the summarization track at TAC

2009, 2008 and its DUC predecessors, and to developers whose goal is to improve the sum-

marization systems. Summarization evaluation, as has been the case with other language

understanding technologies, can foster the creation of reusable resources and infrastruc-

ture; it creates an environment for comparison and replication of results; and it introduces

an element of competition to produce better results [Hirschman and Mani, 2001]. However,

manual evaluation of a large number of documents necessary for a relatively unbiased view

is often unfeasible, especially since multiple evaluations are needed in future to track incre-

mental improvement in systems. Therefore, there is an urgent need for reliable automatic

metrics that can perform evaluation in a fast and consistent manner.

7.1 Introduction

Summarization evaluation techniques currently used in the literature and at the focused

evaluations have been described in detail in Section 3.2. As discussed earlier, summa-

88

CHAPTER 7. ALTERNATIVE (AUTOMATED) SUMMARIZATION EVALUATIONS

rization evaluation of informativeness falls into two major categories: intrinsic and extrin-

sic evaluations. Intrinsic evaluations measure the quality of the created summary directly.

Since the measurement is direct and doesn’t involve some other process or task at which the

summary should perform, we need to compare this summary with some other reference.

Usually, for this purpose, multiple human reference summaries are obtained. Examples

of ‘intrinsic informativeness evaluations’ include ROUGE [Lin, 2004b], Basic Elements

[Hovy et al., 2006, Tratz and Hovy, 2008] and Pyramid evaluations [Nenkova et al., 2007].

Among the examples above, all the methods are fully automated except the Pyramid method,

which involves human effort in the creation of “SCU Pyramids”. Current techniques ap-

plied to evaluate summaries are based on the assumption that multiple human references

summaries exist, which themself are difficult and time consuming to obtain. Recently,

[Louis and Nenkova, 2009] has tried to evaluate summaries in the context of no reference

summaries. They used an information theoretic framework, to compute information lost

while obtaining the summary at hand.

Summarization Evaluation, like Machine Translation (MT) evaluation (or any other

NLP systems’ evaluation), can be broadly classified into two categories [Jones and Galliers, 1996].

The first, an intrinsic evaluation, tests the summarization system in itself. The second, an

extrinsic evaluation, tests the summarization system based on how it affects the completion

of some other task. In the past intrinsic evaluations have assessed mainly informativeness

and coherence of the summaries. Meanwhile, extrinsic evaluations have been used to test

the impact of summarization on tasks like reading comprehension, relevance assessment,

etc.

7.2 Current Summarization Evaluations

In the Text Analysis Conference (TAC) series and the predecessor, the Document Under-

standing Conferences (DUC) series, the evaluation of summarization quality was con-

ducted using both manual and automated metrics. Manual assessment, performed by

89


human judges centers around two main aspects of summarization quality: informative-

ness/content and readability/fluency. Since manual evaluation is still the undisputed gold

standard, both at TAC and DUC there was a phenomenal effort to evaluate manually as

much data as possible.

Content Evaluations The content or informativeness of a summary has been evaluated

based on various manual metrics. Earlier, NIST assessors used to rate each summary on a 5-

point scale based on whether a summary is “very poor” to “very good”. Since 2006, NIST

uses the Pyramid framework to measure content responsiveness. In the pyramid method

as explained in Section 3.2, assessors first extract all possible “information nuggets” or

Summary Content Units (SCUs) from human-produced model summaries on a given topic.

Each SCU has a weight associated with it based on the number of model summaries in

which this information appears. The final score of a peer summary is based on the recall of

nuggets in the peer.

All forms of manual assessment is time-consuming, expensive and not repeatable;

whether scoring summaries on a Likert scale or by evaluating peers against “nugget pyra-

mids” as in the pyramid method. Such assessment doesn’t help system developers — who

would ideally like to have fast, reliable and most importantly automated evaluation metric

that can be used to keep track of incremental improvements in their systems. So despite

the strong manual evaluation criterion for informativeness, time tested automated methods

viz. ROUGE, Basic Elements(BE) have been regularly employed to test their correlation

with manual evaluation metrics like ‘modified pyramid score’, ‘content responsiveness’

and ‘overall responsiveness’ of a summary. Each of the above metrics has been sufficiently

described in Section 3.2. The creation and testing of automatic evaluation metrics is there-

fore an important research avenue. The goal is to create automated evaluation metrics that

correlate very highly with these manual metrics.

90


Readability/Fluency Evaluations Readability or Fluency of a summary is evaluated

based on certain set of linguistic quality questions that manual assessors answer for each

summary. The linguistic quality markers are: Grammaticality, Non-Redundancy, Referen-

tial Clarity, Focus and Structure and Coherence.

Manual Readability Evaluations Readability assessment is primarily a manual method

following Likert Scale rating by human assessors like for content responsiveness. All the

above linguistic quality markers are rated individually for each summary and an average

score for linguistic quality of various peers are compared against each other. An ANOVA

(Analysis of Variance) is performed on the linguistic quality markers to show which set of

peers fall in a statistically similar range.

Automated Readability Evaluations Though there hasn’t been any formal work in

automatic evaluation of Readability/Fluency of summaries, [Pitler and Nenkova, 2008] de-

scribe a first study on various lexical, syntactic and discourse features to produce a highly

predictive model of human reader’s judgments of text readability. Their experiments were

largely based on Wall Street Journal (WSJ) corpus, since they were interested in the impact

of discourse on readability aspects.

7.3 Automated Content Evaluations

Based on the arguments set above, automated evaluation of content and form are necessary

for tracking the developers incremental improvements, and a focused task on creation of

automated metrics for content and form would help in the process. This was precisely the

point being addressed at the TAC AESOP (Automatically Evaluating Summaries of Peers)

task. In TAC 2009, AESOP task involves only “Automated Evaluation of Content and

Responsiveness”, and this chapter addresses the same.

In its first edition of AESOP task at TAC 2009, the purpose of the task was to promote

91


research and development of systems that evaluate the quality of content in the summaries.

The output of the automated metrics are compared against two manual metrics: (modi-

fied) pyramid score, which measures summary content and overall responsiveness, which

measures a combination of content and linguistic quality.

Experimental Configuration In TAC 2009 task, for each topic there are 4 reference

summaries and 55 peer summaries. The task output is to generate, for each peer summary,

a score representing (in the semantics of the metric) the goodness of the summary content,

measured against or without the use of model summaries. A snapshot of the output obtained

from a metric is shown in Figure 7.1.

7.4 Generative Modeling of Reference Summaries

In Section 4.5, we observed that a formal model, that captures the amount of query-bias in

a summary as distributed in the source collection, correlates highly with ROUGE scores.

Based on these observations we hypothesized that to evaluate a peer summary, a similar ap-

proach to modeling can be taken. We utilized the generatie models discussed in Section 4.5

and devised multiple alternative methods of identifying what makes a good summary.

In Section 4.5, we describe two models based on the ‘generative modeling framework’:

a binomial model and a multinomial model, which we used to show that automated systems

are being query-biased to be able to perform better on ROUGE like surface metrics. Our

approach is to use the same generative models to evaluate summaries. We describe in the

following sections, how various features extracted from reference summaries can be used

in modeling how strongly peer summaries are able to imitate reference summaries.

We use generative modeling to model the distribution of signature terms in the source

and the “likelihood of a summary being biased towards these signature terms”. As earlier

we use two models of generative modeling, Binomial and Multinomial models, both of

which are described below.

92


Figure 7.1 Sample output generated by an evaluation metric

93


7.4.1 Binomial Model

Let us consider there are ‘k’ words that we consider signature terms, as identified by any

of the methods described in Section 7.5. The sentences in the input document collection

are represented as a binomial distribution over the type of sentences. Let Ci ∈ {C0, C1}

denote classes of sentences without and with those ‘signature terms’ respectively. For each

sentence s ∈ Ci in the input collection, we associate a probability p(Ci) for it to be emitted

into a summary.



n0!n1!p (C0)

n0 p (C1)n1 (7.1)

Where N is the number of sentences in the summary, and n0 + n1 = N; n0 and n1 are

the cardinalities of C0 and C1 in the summary.

7.4.2 Multinomial Model

Previously, we described the binomial model where we classified each sentence into two

classes, as being biased towards a signature term or not. However, if we were to quantify

the amount of signature-term bias in a sentence, we associate each sentence to one among

k possible classes leading to a multinomial distribution. Let Ci ∈ {C0, C1, C2, . . . , Ck}

denote the k levels of signature-term bias. Ci is the set of sentences, each having i signature

terms.

The number of sentences participating in each class varies highly, with C0 bagging a

high percentage of sentences and the rest {C1, C2, . . . , Ck} distributing among themselves

the rest sentences. Since the distribution is highly-skewed to the left, distinguishing systems

based on log-likelihood scores using this model is easier and perhaps more accurate.



n0!n1! · · ·nk!p (C0)

n0 p (C1)n1 · · · p (Ck)

nk (7.2)

94


Where N is the number of sentences in the ‘peer summary’, and n0 + n1 + · · · + nk =

N; n0, n1,· · · ,nk are respectively the cardinalities of C0, C1, · · · ,Ck, in the summary.

7.5 Signature Terms

The likelihood of certain characteristics based on the binomial or multinomial model shows

how well those characteristics of the input have been captured in a summary. For our ap-

proach, we need to have certain keywords from the reference summaries that are considered

to be very important for the topic/query combination. We choose multiple alternative meth-

ods to identify such signature-terms. Here we list these methods:

1. Query terms

2. Model consistency

3. Part-Of-Speech (POS)

7.5.1 Query Terms

If we consider query terms as the characteristics that discriminate important sentences from

unimportant ones, we obtain the likelihood of a summary emitting a query-biased sen-

tence. Earlier, we have shown in Section 4.5 that such a likelihood has very high system-

level correlation with ROUGE scores. Since ROUGE correlates very highly with manual

evaluations (‘pyramid evaluation’ or ‘overall responsiveness’), a naıve assumption is that

likelihood modeling of query-bias would correlate well with manual evaluations. This as-

sumption led us to use this method as a baseline for our experiments. Our baselines for this

work have been explained in Section 7.6.

95


7.5.2 Model Consistency

The hypothesis behind the method is that a term is important if it is part of a reference sum-

mary. In this method we obtain all the terms that are commonly agreed upon by reference

summaries. The idea is that the more the reference summaries agree the more important

they are. This is based on the assumption that word level importance sums up towards

sentence inclusion. Since there are 4 reference summaries available for each topic, we can

use reference agreement in two ways:

• total agreement

• partial agreement

Total agreement In the case of total agreement, only the words that occur in all

reference summaries are considered to be important. This case leads to only a single run

which we would call ‘total-agreement’.

Partial agreement In the case of partial agreement, words that occur in at least ‘k’

reference summaries are considered to be important. Since there are 4 reference summaries

per topic, a term would be considered a ‘signature term’ if it occurs in ‘k’ of those 4

reference summaries. There were a total of 3 runs in this case : ‘partial-agreement-1’,

‘partial-agreement-2’ and ‘partial-agreement-3’.

7.5.3 POS Features

We hypothesized that a certain type of words based on the parts-of-speech they belong to

could be more informative than the other words, and that in modeling their occurrence in

peer summaries we are defining informativeness of the peers with respect to models.

Part-of-speech tagger Traditional grammar classifies words based on eight parts of speech:

the verb, the noun, the adjective, the pronoun, the adverb, the preposition, the conjunction

96


and the interjection. Each part of speech explains not what the word is, but how the word

is used. Infact the same word can be a noun in one sentence and a verb or adjective in an-

other. We have used the Penn Treebank Tag-set [Marcus et al., 1993] for our purposes. For

automated tagging we have used the Stanford POS tagger [Toutanova and Manning, 2000,

Toutanova et al., 2003] in these experiments.

Tag Subset Selection – feature selection Based on an analysis of how each ‘POS tag’

performs at the task we selectively combine the set of features. We used the following

‘POS tag’ features: NN, NNP, NNPS, VB, VBN, VBD, CD, SYMB, and their combinations.

We experimented with several combinations of these features and zeroed on to a final list

of combinations that form the runs described in this work. The final list of runs comprises

of some of the individual ‘POS tag’ features and some combinations, they are:

• NN

• NNP

• NNPS

• NOUN – A combination of NN, NNP and NNPS features.

• VB

• VBN

• VBD

• VERB – A combination of VB, VBN and VBD features.

• CD

• SYMB

• MISC – A combination of CD and SYMB features.

• ALL – A combination of NOUN, VERB and MISC features.

7.6 Experiments and Evaluations

Our experimental setup was primarily defined based on how signature terms have been

identified. We have detailed few methods of identification of signature-terms in Section 7.5.

For each method of identifying signature terms we have one or more runs as described

earlier.97


Baselines Apart from the set of runs described in Section 7.5, we propose to use the

following two baselines.

• Binomial modeling for query terms. This approach uses the Binomial model de-

scribed in Section 7.4 to obtain the likelihood that a system would generate sum-

maries that comprises of sentences containing query-terms.

• Multinomial modeling for query terms. This baseline approach uses the Multinomial

model described in Section 7.4 to obtain the multinomial likelihood that a system

would generate summaries that comprises of sentences containing query-terms. This

model distinguishes sentences that contain a single query-term to sentences contain-

ing two query-terms and so on.

Datasets The experiments shown here were performed on TAC 2009 update summariza-

tion datasets which have 44 topics and 55 system summaries for each topic apart from 4

human reference summaries. Since in our methods there is no clear way to distinguish eval-

uation of cluster A’s or cluster B’s summary – we don’t evaluate the update of a summary

– we effectively have 88 topics to evaluate on.

Evaluations Successful evaluation of these new summarization evaluation metrics is

done based on how well these new metrics correlate with manual evaluations. In Sec-

tion 3.2 we described how pyramid evaluations and content responsiveness evaluations are

performed. This task, despite the complexity involved, boils down to a simpler problem,

that of information ordering. We have a reference ordering and have various metrics that

provide their own ordering for these systems. Comparing an ordering of information with

another is a fairly well understood task and we would use correlations between these man-

ual metrics and the metrics we proposed in this work to show how well our metrics are able

to imitate human evaluations in being able to generate similar ordering of systems. We use

Pearson’s Correlation Coefficient of system level average scores produced by all systems

based on our metrics and by the manual methods.98


7.7 Results

Our target for these focused experiments were to create alternatives to the content evalua-

tion metrics (pyramid method and overall responsiveness), that are either ‘too expensive’ or

‘non-replicable’ or both. The important question to ask is: which of the above two metrics

are we trying to imitate? It is unlikely that a single automated evaluation measure would

be able to correctly reflect both readability and content responsiveness, since they repre-

sent form and content which are separate qualities of a summary and would need different

measures. We chose to imitate content since having better content in a summary is more

important than having a readable summary.

In Tables 7.1 and 7.2 we present system level Pearson’s correlations between the scores

provided by our metrics — as well as the time tested automated evaluation metrics ROUGE-

SU4 and Basic Elements (BE) — and the manual Pyramid scores. The table also includes

correlations with the manual Overall Responsiveness measure, which reflects both content

and form; later we would observe that the correlations are much higher with respect to

pyramids than with overall responsiveness, this is because in our approach we are trying to

capture how well content of model summaries are being reciprocated in system summaries.

7.7.1 Discussion

We have used two separate settings for displaying results: an AllPeers case and a NoModels

case. AllPeers case consists of the scores returned by the metric for all the summarizers

(automated and human), while in the case of NoModels case only automated summarizers

are scored using the evaluation metrics. This setup helps distinguish methods that are able

to differentiate two things:

• Metrics that are able to differentiate humans from automated summarizers.

• Metrics that are able to rank human summarizers in the desired order.

99


Results1 have shown that no single metric is good at distinguishing everything, however

they also show that certain type of keywords have been instrumental in providing the key

distinguishing power to the metric. For example, VERB and NOUN features have been

key the contributors to ALL run. As an interesting side note, we observe that having high

number of ‘significant’ signature-terms seems to be better than a low number of ‘strong’

signature-terms, as seen from the experiments on total-agreement and partial-agreement.

The most important result of our approach has been that our method was very highly cor-

related with “overall responsiveness”, which again is a very good sign for an evaluation

metric.

7.8 Chapter Summary

In this chapter, we argued for the need of alternative ‘automated’ summarization evaluation

systems for both content and readability. In the context of TAC AESOP (Automatically

Evaluating Summaries Of Peers) task, we describe the problem with content evaluation

metrics and how a good metric must behave. We model the problem as an information or-

dering problem; our approach (and indeed others) should now be able to rank systems (and

possibly human summarizers) in the same order as human evaluation would have produced.

We show how a well known generative model could be used to create automated evaluation

systems comparable to the state-of-the-art. Our method is based on a multinomial model

distribution of key-terms (or signature terms) in document collections, and how they are

captured in peers.

We have used two types of signature-terms to model the evaluation metrics. The first is

based on POS tags of important terms in a model summary and the second is based on how

much information the reference summaries shared among themselves. Our results show

that verbs and nouns are key contributors to our best run which was dependent on various

1We have excluded NNPS and SYMB from the analysis since they didn’t have enough samples in the

testset, so as to obtain consistent results.

100


individual features. Another important observation was that all the metrics were consistent

in that they produced similar results for both cluster A and cluster B in the context of update

summaries. The most startling result is that in comparison with the automated evaluation

metrics currently in use (ROUGE, Basic Elements) our approach has been very good at

capturing “overall responsiveness” apart from pyramid based manual scores.

101


RUN Pyramid Responsiveness

AllPeers NoModels AllPeers NoModels

High Baselines

ROUGE-SU4 0.734 0.921 0.617 0.767

Basic Elements (BE) 0.586 0.857 0.456 0.692

Baselines

Binom(query) 0.217 0.528 0.163 0.509

Multinom(query) 0.117 0.523 0.626 0.514

Experimental Runs

POS based

NN 0.909 0.867 0.853 0.766

NNP 0.666 0.504 0.661 0.463

NOUN 0.923 0.882 0.870 0.779

VB 0.913 0.820 0.877 0.705

VBN 0.931 0.817 0.929 0.683

VBD 0.944 0.859 0.927 0.698

VERB 0.972 0.902 0.952 0.733

CD 0.762 0.601 0.757 0.561

MISC 0.762 0.601 0.757 0.561

ALL 0.969 0.913 0.934 0.802

Model Consistency/Agreement

total-agreement 0.727 0.768 0.659 0.682

partial-agreement-3 0.867 0.856 0.813 0.757



Table 7.1 Cluster A Results

102


RUN Pyramid Responsiveness

AllPeers NoModels AllPeers NoModels

High Baselines

ROUGE-SU4 0.586 0.940 0.564 0.729

Basic Elements (BE) 0.629 0.924 0.447 0.694

Baselines

Binom(query) 0.210 0.364 0.178 0.372

Multinom(query) -0.004 0.361 -0.020 0.446

Experimental Runs

POS based

NN 0.908 0.845 0.877 0.788

NNP 0.646 0.453 0.631 0.380

NOUN 0.909 0.848 0.878 0.783

VB 0.872 0.871 0.875 0.742

VBN 0.934 0.873 0.944 0.720

VBD 0.922 0.909 0.914 0.718

VERB 0.949 0.951 0.942 0.784

CD 0.807 0.599 0.800 0.497

MISC 0.807 0.599 0.800 0.497

ALL 0.957 0.921 0.931 0.793

Model Consistency/Agreement

total-agreement 0.811 0.738 0.808 0.762




Table 7.2 Cluster B Results

103

Chapter 8

Conclusions

In this thesis, we have systematically studied the following 4 aspects of text summarization

systems. We examined the role of query-bias in the query-focused multi-document sum-

marization systems. Then, we build upon the position hypothesis to describe a baseline

algorithm for ‘update summarization’ and other short summary tasks. Later, we show that

a simple signature term based context adjustment allows standard language modeling ap-

proaches to be applied for the update summarization task. Finally, we describe a generative

modeling based automated intrinsic evaluation framework that models the distribution of

signature-terms across the summaries based on their distribution drawn from source col-

lections. Section 8.1 describes the contributions of this thesis and Section 8.2 follows up

with a detailed mention of the foreseeable future work.

8.1 Contributions of this Thesis

Automated text summarization deals with condensing a source representation of text that

retains the most informative and relevant (in case of query-focused summaries) pieces of

information. In this thesis, we researched automated text summarization from four angles:

1. Impact of query-bias on summarization

104

CHAPTER 8. CONCLUSIONS

2. Simple and strong baselines for update summarization

3. Language modeling extension to update summarization

4. An automated intrinsic content evaluation measure

Our experiments in Chapter 4 clearly show that most of the automated systems are

being biased towards query-terms in trying to generate summaries that are closer to human

summaries. Our key contributions in quantifying the impact of query-bias:

• We have shown based on various experiments that query-bias is probably directly

involved in generating better summaries for otherwise poor algorithms.

• We further confirm based on the likelihood of a system emitting non query-biased

sentence, that there is a strong (negative) correlation among systems’ likelihood score

and ROUGE score, which suggests that systems are trying to improve performance

based on ROUGE metrics by being biased towards the query terms. On the other

hand, humans do not rely on query-bias. Our concern in this work was to empiri-

cally show that the notion of query-focus is apparently missing in any or all of the

algorithms, and that the future summarization algorithms must try to incorporate this

while designing new algorithms.

• Looking at those algorithms from the point of view of sentence-classifier view of

summarization (See Figure 4.4) we observe that most systems are ignoring sentences

that do not contain query-terms. This sort of ignorance just based on naive surface

features (that too as much as “term-bias”) would mean that a lot of informative con-

tent among the non-biased sentences are lost out. Hence, there is a need to look at

the behavior of each algorithm along these lines before we get those into such issues

at later stage.

• Our results also underscore the differences between human and machine generated

summaries. When asked to produce query-focused summaries, humans do not rely

to the same extent on the repetition of query terms.105


We derived a simple baseline algorithm to identify key sentences to be included in a

summary based on a ‘sub-optimal position policy’ in Chapter 5. The key contributions in

this work are

• The usage of a new data source built on the pyramid annotation data. This is impor-

tant since such a resource, despite being available for 3-4 years now, hasn’t been put

to use for many purposes.

• We also distinguish small and large documents to obtain the position policy showing

how small documents tend to be more informative and focused than large documents.

• We described the Sub-optimal Sentence Position Policy (SPP) based on pyramid an-

notation data and implemented the SPP as an algorithm to show that a position policy

thus formed is a good representative of the genre and thus performs way above me-

dian performance.

• We further describe the baselines used in summarization evaluation and discuss the

need to bring back baseline 2 (or the ‘SPP algorithm’) as an official baseline for

update summarization task.

We described a generic approach to update summarization problem based on the lan-

guage modeling paradigm. In Chapter6 we argued that update summarization task is a

‘stream’ (of text) based summary and that a solution to the problem must also lie in under-

standing the variation in the current state of the stream from previous states. Though the

improvement in the scores is very tight, the potential for improvement of the approach is

very high and provides with a very promising framework for update summarization using

language modeling. Our key contributions in the solution to this problem were

• We showcased the solution to update summarization problem based on language

modeling approaches by visualizing a context adjustment over the language model

built on the corpus.

106


• We used PHAL based language modeling to sentence scoring and use context ad-

justment strategies to boost novel signature-terms and decrease the impact of stale

terms.

• We observed that a simple context adjustment based on ‘stream dynamics’ could help

in generation of better summaries when compared to the base language modeling

based approach.

We argue for the need of alternative ‘automated’ summarization evaluation systems for

both content and readability. We describe the problem with content evaluation metrics and

what characteristics a good automated evaluation metric must constitute. In Chapter 7 we

model the problem as an information ordering problem; our approach (and indeed others)

should now be able to rank systems (and possibly human summarizers) in the same order

as human evaluation would have produced. Following are our key contributions towards

building a summarization evaluation system:

• We show how a well known generative model could be applied to create automated

evaluation systems comparable to the state-of-the-art.

• Our method is based on a multinomial model distribution of key-terms (or signature

terms) in document collections, and how they are captured in peers.

• We have used two types of signature-terms to model the evaluation metrics. The first

is based on POS tags of important terms in a model summary and the second is based

on how much information the reference summaries shared among themselves.

• Our results show that verbs and nouns are key contributors to our best run which was

dependent on various individual features.

• Another important observation was that all the metrics were consistent in that they

produced similar results for both cluster A and cluster B in the context of update

summaries.

107


• The most startling observation is that in comparison with ROUGE, Basic Elements

our approach has been very good at capturing “overall responsiveness” apart from

pyramid based manual scores.

8.2 Future Work

In this thesis we show how query-bias affects most summarization algorithms. In the hind-

sight, further algorithmic study of all the automated summarization algorithms and ap-

proaches should help understand where and how the bias towards the query-terms is being

introduced. Further theoretical analysis as to where in the modeling or scoring approaches

such a bias is induced, would help alleviate the problem and in designing future approaches

with similar strategies having a foresight towards problems of query-bias.

Simple discourse based feature based on position of a sentence performs relatively

strongly at update summarization. In the background of this work, it is imperative to re-

alize that position is just one of the possible features of discourse and while position can

be highly relevant in corpora such as news it might turn out to be insignificant in other

corpora such as a books. It would be interesting to see how we can capture other discourse

features — possibly shallow — based on connectedness (vs disconnectedness) of text in a

document. Such features wouldn’t be of interest in single topic documents such as news

stories but they provide key information about topical shift in conversational texts or books.

We described extensions to a language modeling based framework for incorporating up-

date summarization. In the future the development of this approach into a formal strategy

with stronger mathematical basis needs to be addressed. Several issues such as the “se-

mantics of the ‘adjusted model’ ”, “weighting the bias”, etc. haven’t been addressed in this

work and are the concerns of a detailed further work. It is also possible to apply multiple,

more appropriate approaches to signature term extraction, which we ignored by choosing

a simple approach based on [Lin and Hovy, 2000]. The signature extraction approach we

used may not be appropriate, in particular, because of the nature of the data. We have ob-

108


served that for some topics there were very few signature terms (<5) and we believe that

this happens because of wrong application of the method. This approach to signature ex-

traction was built for dissimilar topic clusters, and blindly using them for same topic, time

varying clusters leaves us with just a few distinguishing terms between the two clusters,

and would paralyze our otherwise sound approach. In the case of update summarization

data, the similarity in the two clusters (previous and current) is more prominent than dis-

similarity, and hence other approaches may be better able to distinguish these clusters and

provide with stronger and more ‘significant’ signature terms.

A novel framework for automated content evaluations was described in our work. Our

framework relies on two sub-problems: finding ‘signature-terms’ and the generative mod-

eling aspects. In this work we have shown that such a model is promising by using simple

methods for signature-terms identification once we fix on a multinomial model. It would

be interesting to find other methods for signature-term identification that could bring per-

formance of the method closer to human levels of judgment. The evaluation paradigm

described in this work assumes that human references are available, another interesting

line of work would be based on how to evaluate system summaries when there are no ref-

erence summaries available. Looking further ahead, there is a pressing need to distinguish

“relevant content extraction” and “relevant summaries”. The difference being the form

of summaries, the way they are presented coherently. It would be very relevant to define

algorithms that would be able to correlate strongly with linguistic quality metrics.

8.3 Publications

1. Query-Focused Summaries or Query-Biased Summaries ? In the proceedings of

the Joint conferences of the 47th annual meeting of the Association of Computational

Linguistics (ACL) and the 4th International Joint Conference on Natural Language

Processing (IJCNLP), ACL-IJCNLP 2009.

109


2. Sentence Position revisited: A robust light-weight Update Summarization ‘base-

line’ Algorithm. In the proceedings of the HLT-NAACL workshop on cross lan-

guage information access (CLIAWS3), 2009.

3. IIIT Hyderabad at TAC 2008. In the working notes of Text Analysis Conference

(TAC) at the joint meeting of the annual conferences of TAC and TREC, 2008.

4. On {Alternative}Automated Content Evaluation Measures. In the working notes

of Text Analysis Conference (TAC) at the joint meeting of the annual conferences of

TAC and TREC, 2009.

8.4 Unpublished Manuscripts

1. GEMS: Generative Modeling for Evaluation of Summaries. In submission.

110

Bibliography

[Allan et al., 2001] Allan, J., Gupta, R., and Khandelwal, V. (2001). Topic models forsummarizing novelty. In In Proceedings of the Workshop on Language Modeling andInformation Retrieval, pages 66–71.

[Amini and Usunier, 2007] Amini, M. R. and Usunier, N. (2007). A contextual query ex-pansion approach by term clustering for robust text summarization. In the proceedingsof Document Understanding Conference.

[Ani et al., 2005] Ani, A. H., Nenkova, A., Passonneau, R., and Rambow, O. (2005). Au-tomation of summary evaluation by the pyramid method. In In Proceedings of the Con-ference of Recent Advances in Natural Language Processing (RANLP, page 226.

[Barzilay, 1997] Barzilay, R. (1997). Lexical chains for summarization. Master’s thesis.

[Barzilay and Elhadad, 1997] Barzilay, R. and Elhadad, M. (1997). Using lexical chainsfor text summarization. In In Proceedings of the ACL Workshop on Intelligent ScalableText Summarization, pages 10–17.

[Barzilay and Lapata, 2008] Barzilay, R. and Lapata, M. (2008). Modeling local coher-ence: An entity-based approach. Comput. Linguist., 34(1):1–34.

[Barzilay and Lee, 2004] Barzilay, R. and Lee, L. (2004). Catching the drift: Probabilisticcontent models, with applications to generation and summarization. In Susan Dumais,D. M. and Roukos, S., editors, HLT-NAACL 2004: Main Proceedings, pages 113–120,Boston, Massachusetts, USA. Association for Computational Linguistics.

[Baxendale, 1958] Baxendale, P. B. (1958). Machine-made index for technical literature –an experiment. IBM Journal of Research and Development, 2(Non-topical Issue).

[Berger and Mittal, 2000] Berger, A. and Mittal, V. O. (2000). Query-relevant summariza-tion using faqs. In ACL ’00: Proceedings of the 38th Annual Meeting on Associationfor Computational Linguistics, pages 294–301, Morristown, NJ, USA. Association forComputational Linguistics.

[Bosma, 2005] Bosma, W. (2005). Extending answers using discourse structures. In Sag-gion, H. and Minel, J. L., editors, RANLP workshop on Crossing Barriers in Text sum-marization Research, pages 2–9. Incoma Ltd.

111

BIBLIOGRAPHY

[Bysani et al., 2009] Bysani, P., Bharat, V., and Varma, V. (2009). Modeling novelty andfeature combination using support vector regression for update summarization. In 7thInternational Conference On Natural Language Processing. NLP Association of India.

[Chae and Nenkova, 2009] Chae, J. and Nenkova, A. (2009). Predicting the fluency of textwith shallow structural features: Case studies of machine translation and human-writtentext. In EACL, pages 139–147. The Association for Computer Linguistics.

[Conroy et al., 2004] Conroy, J. M., Schlesinger, J. D., Goldstein, J., and O’leary, D. P.(2004). Left-brain/right-brain multi-document summarization. In the proceedings ofDocument Understanding Conference (DUC) 2004.

[Copeck et al., 2006] Copeck, T., Inkpen, D., Kazantseva, A., Kennedy, A., Kipp, D., Nas-tase, V., and Szpakowicz, S. (2006). Leveraging duc. In proceedings of DUC 2006.

[Cormen et al., 1990] Cormen, T. H., Leiserson, C. E., and Rivest, R. L. (1990). Introduc-tion to Algorithms. The MIT press and McGraw-Hill.

[Cremmins, 1982] Cremmins, E. T. (1982). The art of abstracting / Edward T. Cremmins.ISI Press, Philadelphia :.

[Dang, 2005] Dang, H. T. (2005). Overview of duc 2005. In proceedings of DocumentUnderstanding Conference.

[Daume III and Marcu, 2004] Daume III, H. and Marcu, D. (2004). A phrase-based hmmapproach to document/abstract alignment. In Lin, D. and Wu, D., editors, Proceed-ings of EMNLP 2004, pages 119–126, Barcelona, Spain. Association for ComputationalLinguistics.

[Dolan, 1980] Dolan, D. (1980). Locating main idea in history textbooks. Journal ofReading, pages 135 – 140.

[Dunning, 1993] Dunning, T. (1993). Accurate methods for the statistics of surprise andcoincidence. volume 19, pages 61–74.

[Edmundson, 1969] Edmundson, H. P. (1969). New methods in automatic extracting.Journal of the ACM, 16(2):264–285.

[Endres-Niggemeyer, 1998] Endres-Niggemeyer, B. (1998). Summarizing Information.Springer-Verlag New York, Inc., Secaucus, NJ, USA.

[Feng, 2009] Feng, L. (2009). Automatic readability assessment for people with intellec-tual disabilities. SIGACCESS Accessibility and Computing, (93):84–91.

[Feng et al., 2009] Feng, L., Elhadad, N., and Huenerfauth, M. (2009). Cognitively moti-vated features for readability assessment. In Proceedings of the 12th Conference of theEuropean Chapter of the ACL (EACL 2009), pages 229–237, Athens, Greece. Associa-tion for Computational Linguistics.

112

BIBLIOGRAPHY

[Gupta et al., 2007] Gupta, S., Nenkova, A., and Jurafsky, D. (2007). Measuring impor-tance and query relevance in topic-focused multi-document summarization. acl compan-ion volume, 2007.

[Halliday and Hasan, 1976] Halliday, M. and Hasan, R. (1976). Longman publishers.

[Harman and Over, 2002] Harman, D. and Over, P. (2002). The duc summarization eval-uations. In Proceedings of the second international conference on Human LanguageTechnology Research, pages 44–51, San Francisco, CA, USA. Morgan Kaufmann Pub-lishers Inc.

[Harman and Over, 2004] Harman, D. and Over, P. (2004). The effects of human variationin duc summarization evaluation. In Marie-Francine Moens, S. S., editor, Text Summa-rization Branches Out: Proceedings of the ACL-04 Workshop, pages 10–17, Barcelona,Spain. Association for Computational Linguistics.

[Hiemstra, 1998] Hiemstra, D. (1998). A linguistically motivated probabilistic model ofinformation retrieval. In ECDL ’98: Proceedings of the Second European Conferenceon Research and Advanced Technology for Digital Libraries, pages 569–584, London,UK. Springer-Verlag.

[Hiemstra, 2009] Hiemstra, D. (2009). Information retrieval models.

[Hirschman and Mani, 2001] Hirschman, L. and Mani, I. (2001). Evaluation.

[Hovy et al., 2006] Hovy, E., yew Lin, C., Zhou, L., and Fukumoto, J. (2006). Automatedsummarization evaluation with basic elements. In In Proceedings of the Fifth Conferenceon Language Resources and Evaluation (LREC.

[J et al., 2005] J, J., Pingali, P., and Varma, V. (2005). A relevance-based language model-ing approach to duc 2005.

[Jagarlamudi, 2006] Jagarlamudi, J. (2006). Query-based multi-document summarizationusing language. Master’s thesis, IIIT Hyderabad, India.

[Jing, 2001] Jing, H. (2001). Cut-and-Paste Text Summarization. PhD thesis.

[Jing et al., 1998] Jing, H., Barzilay, R., Mckeown, K., and Elhadad, M. (1998). Sum-marization evaluation methods: Experiments and analysis. In In AAAI Symposium onIntelligent Summarization, pages 60–68.

[Jones, 1972] Jones, K. S. (1972). A statistical interpretation of term specificity and itsapplication in retrieval. Journal of Documentation, 28:11–21.

[Jones, 1993] Jones, K. S. (1993). What might be in a summary? In Information Retrieval,pages 9–26.

113

BIBLIOGRAPHY

[Jones, 1998] Jones, K. S. (1998). Automatic summarising: Factors and directions. InAdvances in Automatic Text Summarization, pages 1–12. MIT Press.

[Jones and Galliers, 1996] Jones, K. S. and Galliers, J. R. (1996). Evaluating NaturalLanguage Processing Systems: An Analysis and Review. Springer-Verlag New York,Inc., Secaucus, NJ, USA.

[Kastner and Monz, 2009] Kastner, I. and Monz, C. (2009). Automatic single-documentkey fact extraction from newswire articles. In Proceedings of the 12th Conference of theEuropean Chapter of the ACL (EACL 2009), pages 415–423, Athens, Greece. Associa-tion for Computational Linguistics.

[Katragadda et al., 2009] Katragadda, R., Pingali, P., and Varma, V. (2009). Sentence po-sition revisited: A robust light-weight update summarization ‘baseline’ algorithm. InProceedings of the Third International Workshop on Cross Lingual Information Access:Addressing the Information Need of Multilingual Societies (CLIAWS3), pages 46–52,Boulder, Colorado. Association for Computational Linguistics.

[Kieras, 1985] Kieras, D. E. (1985). Thematic process in the comprehension of technicalprose.

[Kumar, 2009] Kumar, C. (2009). Information loss based framework for document sum-marization. Master’s thesis, Hyderabad, India.

[Kupiec et al., 1995] Kupiec, J., Pedersen, J., and Chen, F. (1995). A trainable documentsummarizer. In the proceedings of ACM SIGIR’95, pages 68–73. ACM.

[Lapata, 2003] Lapata, M. (2003). Probabilistic text structuring: Experiments with sen-tence ordering. In proceedings of the annual meeting of the Association for Computa-tional Linguistics, pages 545–552. The Association of Computational Linguistics.

[Lapata and Barzilay, 2005] Lapata, M. and Barzilay, R. (2005). Automatic evaluation oftext coherence: Models and representations. In Kaelbling, L. P. and Saffiotti, A., editors,IJCAI, pages 1085–1090. Professional Book Center.

[Lawrie, 2003] Lawrie, D. J. (2003). Language models for hierarchical summarization.PhD thesis. Director-Croft, W. Bruce.

[Li et al., 2007] Li, J., Sun, L., Kit, C., and Webster, J. (2007). A query-focused multi-document summarizer based on lexical chains. In DUC’07: Document UnderstandingConference, 2007.

[Lin and Hovy, 2000] Lin, C. and Hovy, E. (2000). The automated acquisition of topicsignatures for text summarization.

[Lin, 2004a] Lin, C.-Y. (2004a). Looking for a few good metrics: Automatic summariza-tion evaluation - how many samples are enough? In the proceedings of NTCIR Workshop4. ACL.

114

BIBLIOGRAPHY

[Lin, 2004b] Lin, C.-Y. (2004b). Rouge: A package for automatic evaluation of sum-maries. In the proceedings of ACL Workshop on Text Summarization Branches Out.ACL.

[Lin and Hovy, 1997] Lin, C.-Y. and Hovy, E. (1997). Identifying topics by position. InProceedings of the fifth conference on Applied natural language processing, pages 283–290. ACL.

[Lin and Hovy, 2003a] Lin, C.-Y. and Hovy, E. (2003a). Automatic evaluation of sum-maries using n-gram co-occurrence statistics. In NAACL ’03: Proceedings of the 2003Conference of the North American Chapter of the Association for Computational Lin-guistics on Human Language Technology, pages 71–78, Morristown, NJ, USA. Associ-ation for Computational Linguistics.

[Lin and Hovy, 2003b] Lin, C.-Y. and Hovy, E. (2003b). The potential and limitations ofautomatic sentence extraction for summarization. In Proceedings of the HLT-NAACL 03on Text summarization workshop, pages 73–80, Morristown, NJ, USA. Association forComputational Linguistics.

[Lin et al., 2003] Lin, J., Quan, D., Sinha, V., Bakshi, K., Huynh, D., Katz, B., and Karger,D. R. (2003). The role of context in question answering systems. In the proceedings ofCHI’04. ACM.

[Louis and Nenkova, 2009] Louis, A. and Nenkova, A. (2009). Automatically evaluatingcontent selection in summarization without human models. In Proceedings of the 2009Conference on Empirical Methods in Natural Language Processing, pages 306–314,Singapore. Association for Computational Linguistics.

[Luhn, 1958] Luhn, H. (1958). The automatic creation of literature abstracts. In IBMJournal of Research and Development, Vol. 2, No. 2, pp. 159-165, April 1958.

[Maheedhar Kolla, 2007] Maheedhar Kolla, Olga Vechtomova, C. L. A. C. (2007). Com-parison of models based on summaries or documents towards extraction of update sum-maries. In DUC’07: Document Understanding Conference, 2007.

[Mani, 2001] Mani, I. (2001). Summarization Evaluation: an overview. Pergamon Press,Inc., Tarrytown, NY, USA.

[Marcu, 1997] Marcu, D. (1997). From discourse structure to text summaries. pages 82–88.

[Marcu, 1999a] Marcu, D. (1999a). The automatic construction of large-scale corpora forsummarization research. In University of California, Berkely, pages 137–144.

[Marcu, 1999b] Marcu, D. (1999b). Discourse trees are good indicators of importance intext. In Advances in Automatic Text Summarization, pages 123–136. The MIT Press.

115

BIBLIOGRAPHY

[Marcu, 2000] Marcu, D. (2000). The Theory and Practice of Discourse Parsing and Sum-marization. MIT Press, Cambridge, MA, USA.

[Marcus et al., 1993] Marcus, M. P., Marcinkiewicz, M. A., and Santorini, B. (1993).Building a large annotated corpus of english: the penn treebank. Comput. Linguist.,19(2):313–330.

[McCallum, 1996] McCallum, A. K. (1996). Bow: A toolkit for statistical languagemodeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/ mccal-lum/bow.

[Miller et al., 1999] Miller, D. R. H., Leek, T., and Schwartz, R. M. (1999). A hiddenmarkov model information retrieval system. In SIGIR ’99: Proceedings of the 22nd an-nual international ACM SIGIR conference on Research and development in informationretrieval, pages 214–221, New York, NY, USA. ACM.

[Molina, 1995] Molina, M. P. (1995). Documentary abstracting: toward a methodologicalmodel. J. Am. Soc. Inf. Sci., 46(3):225–234.

[Morris and Hirst, 1991] Morris, J. and Hirst, G. (1991). Lexical cohesion computed bythesaural relations as an indicator of the structure of text. Comput. Linguist., 17(1):21–48.

[Mutton et al., 2007] Mutton, A., Dras, M., Wan, S., and Dale, R. (2007). Gleu: Automaticevaluation of sentence-level fluency. In ACL. The Association for Computer Linguistics.

[Nenkova, 2005] Nenkova, A. (2005). Automatic text summarization of newswire:Lessons learned from the document understanding conference. In Veloso, M. M. andKambhampati, S., editors, AAAI, pages 1436–1441. AAAI Press / The MIT Press.

[Nenkova et al., 2007] Nenkova, A., Passonneau, R., and McKeown, K. (2007). The pyra-mid method: Incorporating human content selection variation in summarization evalua-tion. In ACM Trans. Speech Lang. Process., volume 4, New York, NY, USA. ACM.

[Nenkova et al., 2006] Nenkova, A., Vanderwende, L., and McKeown, K. (2006). A com-positional context sensitive multi-document summarizer: exploring the factors that in-fluence summarization. In SIGIR ’06: Proceedings of the 29th annual internationalACM SIGIR conference on Research and development in information retrieval, pages573–580, New York, NY, USA. ACM.

[Paijmans, 1994] Paijmans, J. J. (1994). Relative weights of words in documents. In InL.G.M. Noordman and W.A.M. de Vroomen, editors, Conference proceedings of STIN-FON, pages 195 – 208.

[Papineni et al., 2002] Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002). Bleu:a method for automatic evaluation of machine translation. In ACL ’02: Proceedings of

116

BIBLIOGRAPHY

the 40th Annual Meeting on Association for Computational Linguistics, pages 311–318,Morristown, NJ, USA. Association for Computational Linguistics.

[Pitler and Nenkova, 2008] Pitler, E. and Nenkova, A. (2008). Revisiting readability: Aunified framework for predicting text quality. In EMNLP, pages 186–195. ACL.

[Ponte and Croft, 1998] Ponte, J. and Croft, W. B. (1998). A language modeling approachto information retrieval. pages 275–281, New York, NY. ACM, ACM.

[Ponte, 1998] Ponte, J. M. (1998). A language modeling approach to information retrieval.Master’s thesis, Amherst, MA, USA.

[Radev et al., 2004] Radev, D., Allison, T., Blair-goldensohn, S., Blitzer, J., elebi, A.,Dimitrov, S., Drabek, E., Hakim, A., Lam, W., Liu, D., Otterbacher, J., Qi, H., Saggion,H., Teufel, S., Winkel, A., and Zhang, Z. (2004). Mead - a platform for multidocumentmultilingual text summarization. In in LREC 2004.

[Rath et al., 1961] Rath, G., Resnick, A., and Savage, R. (1961). The formation of ab-stracts by the selection of sentences: Part 1: Sentence selection by man and machines.In Journal of American Documentation., pages 139–208.

[Salton et al., 1997] Salton, G., Singhal, A., Mitra, M., and Buckley, C. (1997). Automatictext structuring and summarization. Inf. Process. Manage., 33(2):193–207.

[Schiffman, 2007] Schiffman, B. (2007). Summarization for q&a at columbia universityfor duc 2007. In DUC’07: Document Understanding Conference, 2007.

[Schilder and Kondadadi, 2008] Schilder, F. and Kondadadi, R. (2008). Fastsum: fast andaccurate query-based multi-document summarization. In HLT ’08: Proceedings of the46th Annual Meeting of the Association for Computational Linguistics on Human Lan-guage Technologies, pages 205–208, Morristown, NJ, USA. Association for Computa-tional Linguistics.

[Shen et al., 2007] Shen, D., Sun, J.-T., Li, H., Yang, Q., and Chen, Z. (2007). Documentsummarization using conditional random fields. In the proceedings of IJCAI ’07., pages2862–2867. IJCAI.

[Silber and McCoy, 2000] Silber, H. G. and McCoy, K. F. (2000). Efficient text summa-rization using lexical chains. In IUI ’00: Proceedings of the 5th international conferenceon Intelligent user interfaces, pages 252–255, New York, NY, USA. ACM.

[Singer and Dolan, 1980] Singer, H. and Dolan, D. (1980). Reading and Learning fromText. Little Brown, Boston, Massachusetts.

[Song and Croft, 1999] Song, F. and Croft, W. B. (1999). A general language model forinformation retrieval. In Proceedings of Eighth International Conference on Informationand Knowledge Management. ACM.

117

BIBLIOGRAPHY

[Sparck Jones, 2007] Sparck Jones, K. (2007). Automatic summarising: The state of theart. Information Processing and Management, 43(6):1449–1481.

[tau Yih et al., 2007] tau Yih, W., Goodman, J., Vanderwende, L., and Suzuki, H. (2007).Multi-document summarization by maximizing informative content-words. In Veloso,M. M., editor, IJCAI, pages 1776–1782.

[Toutanova et al., 2007] Toutanova, K., Brockett, C., Gamon, M., Jagarlamundi, J.,Suzuki, H., and Vanderwende, L. (2007). The pythy summarization system: Microsoftresearch at duc 2007. In the proceedings of Document Understanding Conference.

[Toutanova et al., 2003] Toutanova, K., Klein, D., Manning, C. D., and Singer, Y. (2003).Feature-rich part-of-speech tagging with a cyclic dependency network. In NAACL ’03:Proceedings of the 2003 Conference of the North American Chapter of the Associationfor Computational Linguistics on Human Language Technology, pages 173–180, Mor-ristown, NJ, USA. Association for Computational Linguistics.

[Toutanova and Manning, 2000] Toutanova, K. and Manning, C. D. (2000). Enriching theknowledge sources used in a maximum entropy part-of-speech tagger. In Proceedingsof the 2000 Joint SIGDAT conference on Empirical methods in natural language pro-cessing and very large corpora, pages 63–70, Morristown, NJ, USA. Association forComputational Linguistics.

[Tratz and Hovy, 2008] Tratz, S. and Hovy, E. (2008). Summarization evaluation usingtransformed basic elements. In proceedings of Text Analysis Conference.

[van Halteren and Teufel, 2003] van Halteren, H. and Teufel, S. (2003). Examining theconsensus between human summaries: initial experiments with factoid analysis. InHLT-NAACL 03 Text summarization workshop, pages 57–64, Morristown, NJ, USA.Association for Computational Linguistics.

[Wan et al., 2005] Wan, S., Dale, R., and Dras, M. (2005). Searching for grammaticality:Propagating dependencies in the viterbi algorithm. In Proceedings of the Tenth EuropeanWorkshop on Natural Language Generation (ENLG-05). Association for ComputationalLinguistics.

[Zajic et al., 2002] Zajic, D., Dorr, B., and Schwartz, R. (2002). Automatic headline gen-eration for newspaper stories. In In the proceedings of the ACL Workshop on AutomaticSummarization/Document Understanding Conference(DUC, pages 78–85. Associationfor Computational Linguistics.

118