Document Summarization Based on Word Associations

Document Summarization Based on Word Associations

Oskar GrossDepartment of Computer

Science and HIITUniversity of Helsinki, [email protected]

Antoine DoucetGREYC, CNRS UMR 6072

University of Normandy,Unicaen, France

[email protected]

Hannu ToivonenDepartment of Computer

Science and HIITUniversity of Helsinki, [email protected]

ABSTRACTIn the age of big data, automatic methods for creating sum-maries of documents become increasingly important. Inthis paper we propose a novel, unsupervised method for(multi-)document summarization. In an unsupervised andlanguage-independent fashion, this approach relies on thestrength of word associations in the set of documents to besummarized. The summaries are generated by picking sen-tences which cover the most specific word associations ofthe document(s). We measure the performance on the DUC2007 dataset. Our experiments indicate that the proposedmethod is the best-performing unsupervised summarizationmethod in the state-of-the-art that makes no use of human-curated knowledge bases.

Categories and Subject DescriptorsI.2.7 [Natural Language Processing]: Language mod-els—abstracting methods, summarization

KeywordsMulti-Document Summarization; Word Associations

1. INTRODUCTIONWe propose a novel method for document summarization,

Association Mixture Text Summarization, aimed to abstracta news story into a shorter text. Like most other meth-ods, Association Mixture Text Summarization works in asentence-based manner, selecting a set of sentences from thedocument to be summarized to constitute its summary. Thesentences are chosen so that they collectively cover as muchof the relevant information in the original document as pos-sible. The main difficulties are to define what is relevantand to measure how well sets of sentences cover relevantinformation. Our method has three central characteristics:

(1) Relevance is based on the relative associations betweenwords, helping to grasp the most salient information in anews story. Much of the core content of news stories is in the

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies are notmade or distributed for profit or commercial advantage and that copies bearthis notice and the full citation on the first page. Copyrights for componentsof this work owned by others than ACM must be honored. Abstracting withcredit is permitted. To copy otherwise, or republish, to post on servers or toredistribute to lists, requires prior specific permission and/or a fee.SIGIR ’14, July 06 - 11 2014, Gold Coast , QLD, AustraliaCopyright 2014 ACM 978-1-4503-2257-7/14/07 ...$15.00.http://dx.doi.org/10.1145/2600428.2609500

links they establish, e.g., between people, acts, events, andplaces. We argue that associations at subtler levels can alsobe important, even ones between adjectives or adverbs andnoun or verbs used in the news. Recognition of associationsis based on statistical analysis of word co-occurrences withinsentences. We believe that such associations reflect the keyideas of news and are useful for selecting sentences.

(2) Novel associations in a document are recognized bycontrasting them against a background corpus. News storiesare supposed to tell something new and a key problem insummarization is to identify what is new in a given doc-ument. We treat this as a novelty detection task by con-trasting the document to a background corpus to see whichassociations are emphasized more in the document.

(3) Natural language processing is trivial, making themethod language-independent. All processed documents aresplit to sentences and tokens (words) based on punctuationand whitespaces; numbers are removed, and the remainingtokens are used as they are, without any further processing.

In this paper we focus on the sentence selection subtaskof document summarization. We do not address the issueof arranging or processing the sentences for improved read-ability. We evaluate the method in English using publicbenchmarks, and leave experiments with other languagesfor future work. In the experiments, our proposed methodoutperforms all unsupervised summarization methods thatdo not use semantic resources such as Wordnet.

This paper is organised as follows. We next briefly reviewrelated work. We then present the Association Mixture TextSummarization method in Section 3. The performance of themethod is evaluated in Section 4, while Section 5 concludesthis article with a discussion.

2. RELATED WORKDocument summarization is a well-studied area. There

are two types of summarizations methods: methods whichselect existing sentences and methods which generate sen-tences. Both of these types of methods can be either super-vised or unsupervised, i.e., either learning from examplesof existing summaries or not. We focus on the unsuperviseddomain, of which we give a very brief overview. Nenkova andMcKeown [10] provide an exhaustive review of the topic.

Some methods use Latent Semantic Analysis (LSA) [2] astheir basis (e.g. [4]). The state-of-the art in purely unsuper-vised summarization is represented by the DSDR method ofHe et al. [5]. This approach generates a summary by us-ing sentences that best “reconstruct” the original document.This work has been extended by Zhang et al. [11] who com-

bined document reconstruction and topic decomposition.A number of unsupervised methods take advantage of ad-

ditional linguistic resources. In particular, the Two-TieredTopic model by Celikyilmaz [1] uses Wordnet [9] and theDUC-provided user query for selecting the summary sen-tences. The Document Understanding Conference1 (DUC)provides most evaluation procedures and collections in thesummarization field. We provide further details in Section 4.

3. METHODThe Association Mixture Text Summarization method

proposed below takes as its input a document D to be sum-marized and a background corpus B consisting of a set ofdocuments representing the norm or the current state of in-formation.

As a special case, the background corpus can be empty.Additionally, by extension, instead of a single document aset of documents can be summarized by simply giving theirconcatenation as the input document D, as will be done inthe experimental section.

The method has two parts: (1) computation of document-specific word associations, and (2) selection of sentences withstrong word associations. These two steps are described inthe following subsections.

3.1 Finding Document-Specific AssociationsWe consider two relevance criteria for associations in the

given document D.First, an association between two words is more relevant

if they are statistically mutually dependent, i.e., if they co-occur in D more frequently than they would by chance.This, of course, is a classic idea.

Second, and more interestingly, the association is charac-teristic for document D if the two words co-occur in D morefrequently than in the background corpus B.

The second criterion is in principle more useful since ituses additional data to assess the association, but it is oflittle value if the background corpus is small or if the wordsor the word pair does not occur in the corpus. Our methodtherefore uses a mixture model of the two criteria above.

Notation. We first define the notation for variouscounts of words and word pairs in document D and in back-ground B. Let ti and tj be words. We use nij to denotethe number of sentences in document D that contain bothwords ti and tj , ni−j the number of sentences containingword ti but not tj , n−ij the number of sentences containingtj but not ti, and n−i−j the number of sentences containingneither ti nor tj . We use ni· = nij +ni−j to denote the totalnumber of sentences containg word ti, and respectively forn·j . Let n = |D| denote the total number of sentences indocument D. Finally, let mij , mi−j , m−ij , m−i−j , mi·, m·jand m be the respective counts in the background corpus B.

Statistical Model. Consider the association betweenwords ti and tj . We use multinomial distributions tomodel the probabilities of observing different combinationsof existence/non-existence of words ti and tj in a sentence.The four respective model parameters are pij , pi−j , p−ij

and p−i−j , affecting the likelihood of the observed countsnij , ni−j , n−ij and n−i−j . Three such models are givennext, and the fit of the data to these models is later used toassign a weight to the association between ti and tj . The

1http://duc.nist.gov/

third model is the Association Mixture model, while the firsttwo are simpler models that will be used as the componentsof the mixture.

For convenience, we below define the models using pa-rameters pi· (the probability of observing word ti), p·j (theprobability of observing word tj), and pij (the probability ofobserving both ti and tj). These give more natural defini-tions for the models. The multinomial model parameters canthen easily be obtained as pi−j = pi· − pij ; p−ij = p·j − pij ;p−i−j = 1− pij − pi−j − p−ij .

The independence model (component) pD-ind consid-ers observed frequencies of words t1 and t2 only in docu-ment D and assumes that they are statistically independent:

pD-indi· = ni·/n; pD-ind

·j = n·j/n; pD-indij = ni· · n·j/n2.

If the data fits this model badly, i.e., essentially if nij devi-ates a lot from ni· · n·j/n, then the words are likely to bestatistically dependent.

The background model (component) pB estimates allthree parameters from the respective relative frequencies inthe background corpus B:

pBi· = mi·/m; pB·j = m·j/m; pBij = mij/m.

If the data fits this model badly then the word pair occurs inthe document differently from the background. This signalsthat the association is novel.

The association mixture model pB+D-ind averages thetwo components above, weighted by their sample sizes nand m: pB+D-ind = (n ·pD-ind +m ·pB)/(n+m). This gives

pB+D-indi· = (ni· + mi·)/(n + m),

pB+D-ind·j = (n·j + m·j)/(n + m),

pB+D-indij = (ni· · n·j/n + mij)/(n + m).

In other words, the mixture model combines informationfrom document D itself and from the background B. Theirrelative weights adapt to their relative sizes, giving moreemphasis to the statistically more reliable source of infor-mation.

Association Weights. The weight of the association be-tween two words is based on a log-likelihood ratio test [3].The test compares two models for each word pair: (1) anull model, in our case the mixture model, and (2) a max-imum likelihood alternative model. If the likelihood of thealternative model is much higher, then the null model is lesslikely to be true. In other words, the mixture model is anexpression of expectations, and we are actually interested infinding exceptions to them.

The maximum likelihood model pD is obtained by simplyassigning the model parameters directly from the observedrelative frequencies: pDi· = ni·/n; pD·j = n·j/n; pDij = nij/n.

Let L(pD) be the likelihood of the maximum likelihoodmodel given the counts nij , ni−j , n−ij , n−i−j in docu-

ment D, and let L(pB+D-ind) be the likelihood of the mix-ture model given the same counts. We define the weightw(ti, tj) of the association between ti and tj as the value ofthe respective log-likelihood ratio test:

w(ti, tj) = −2 logL(pB+D-ind)

L(pD).

Multinomial coefficients in the likelihoods cancel out, and

after simplification we have

w(ti, tj) = 2∑

a∈{“ij”,“i−j”,“−ij”,“−i−j”}

na(log pDa − log pB+D-inda ).

The log-likelihood ratio test gives lower weights for wordpairs that better match the mixture model and higherweights for those associations that are unexpected with re-spect to the mixture model. In text summarization, we areinterested in word pairs that have a higher relative frequencyin the document D than in the background B, and that havea high log-likelihood ratio.

3.2 Sentence SelectionThe other subtask is to select from document D sentences

that contain strong word associations. In the sentence selec-tion phase, our goal is to preserve as many of the strongerassociations and thereby as much as possible of the corecontents of the original document D.

Given a fixed target size of the summary (e.g. 250 words)and the association weights, we aim to pick sentences suchthat the sum of the log-likelihood ratios of word pairs in thesummary is maximized. To avoid selecting sentences withtoo similar content, each pair is taken into account once.

Formally, let document D be a set of sentences and leteach sentence be a set of words. We call any subset S ={s′1, . . . , s′s} ⊂ D of sentences a summary of D. We definethe total weight of associations in summary S as

w(S) =∑

{ti,tj} s.t. ti 6=tj∧∃s∈S: {ti,tj}⊂s

w(ti, tj),

i.e., as a sum over the set of word pairs in any sentence ofthe summary. Every pair is only counted once.

In the sentence selection step we aim to find a summaryS∗ ⊂ D with a maximal total weight, i.e.,

S∗ = arg maxS⊂D||S||≤L

w(S),

where ||S|| is the number of words in summary S. In ourexperiments below, the upper limit is set to L = 250 words.

This problem is similar to the weighted set cover prob-lem [6]: use sentences of the document to cover as much ofthe associations as possible. Due to the limited length ofthe summary, a natural “cost” of a sentence is the numberof words in it. Given the computational complexity of thetask, we resort to a greedy algorithm [6] to find a summaryS that approximates the optimum S∗.

For the sake of simplicity, in the experiments below weadd sentences to the summary S until the maximum size isreached (||S|| ≥ L) and then simply truncate the summaryto L words.

4. EXPERIMENTSIn this section, we describe experiments carried out to

evaluate the proposed Association Mixture Text Summa-rization method. We aim to address the following questions:(1) How does the method perform in comparison to state-of-the-art unsupervised summarization methods? (2) What

are the contributions of the components pB and pD-ind to themethod? (3) What is the effect of the size of the backgroundcorpus B on the quality of the summaries?

4.1 Experimental SetupFor experiments and comparisons we use the DUC 2007

dataset consisting of 45 topics. Each topic of 25 documentsfrom the AQUAINT corpus of English news is to be sum-marized into a collective abstract of at most 250 words.

The evaluation measure is the well-known ROUGE(Recall-Oriented Understudy for Gisting Evaluation) [8].We use the model summaries of the DUC datasets and theirassociated tools to compute the ROUGE measures. Accord-ing to Lin and Hovy [7] the ROUGE-1 score has the bestcorrespondence with human judgements. It is therefore themain focus of our evaluation. We experimented with sev-eral background corpora: the Brown corpus, the Gutenbergcorpus, the Reuters RCV-1 corpus, as well as combinations.

Data Preprocessing. We remove all markup tags fromthe documents and leave only the headline and textual con-tent of the news story. We then split the content to sentenceswith the DUC 2003 sentence segmentation tool and keep allwords of length at least two.

Comparative Evaluation. We compare the Associa-tion Mixture Text Summarization method against resultsgiven in literature for state-of-the-art unsupervised summa-rization methods: Document Summarization Based on DataReconstruction, linear and non-linear (DSDR-lin, DSDR-non) [5], Topic DSDR (TDSRD) [11], Two-Tiered TopicModel (TTM) and Enriched TTM (ETTM) [1]. The last twouse Wordnet and topic description as additional resources.We also include two baseline methods provided with theDUC: NIST BL and CLASSY04. The latter is actually asupervised method.

4.2 ResultsAssociation Mixture Model and Its Two Compo-

nents: In terms of F-measure for ROUGE-1, Figure 1 illus-trates the performance of the overall model and the inde-pendence and background corpus components as functionsof the size of the background corpus B.

The performance improves from 0.380 to 0.422 as the sizeof the background B grows from 10 to 10,000 sentences. Thisillustrates how a larger background corpus is a simple buteffective way to provide auxiliary information to the summa-rization process. In our experiments, 1,000–3,000 sentenceswere already sufficient as a background corpus. The im-provement after this was very limited.

Next, consider the performance of the two components ofthe model individually. The independence component doesobviously not depend on the background corpus B and ishence represented by a horizontal line on the figure.

The background component, in turn, shows a longer pe-riod of improvement than the Association Mixture modeland converges later than the 1,000–3,000 sentences range.

Overall, the Association Mixture Text Summarizationmethod seems to successfully combine the two componentsinto a model that clearly dominates both of them. Contraryto our expectations, there is a clear margin over the back-ground component for large background corpus sizes, eventhough the relative weight of the independence componentis very small there.

Comparison to Other Methods. A comparison tostate-of-the-art in unsupervised summarization methodsshows that the Association Mixture model is very competi-tive (Table 1). ROUGE-1 results are additionally shown asthin, unlabeled horizontal lines in Figure 1.

Documents

Document Summarization Based on Word Associations