[IEEE 2008 IEEE Spoken Language Technology Workshop (SLT) - Goa, India (2008.12.15-2008.12.19)] 2008 IEEE Spoken Language Technology Workshop - Using prior knowledge to assess relevance

USING PRIOR KNOWLEDGE TO ASSESS RELEVANCE IN SPEECH SUMMARIZATION

Ricardo Ribeiro1 and David Martins de Matos2

1 ISCTE/IST/L2F - INESC ID Lisboa2 IST/L2F - INESC ID Lisboa

Rua Alves Redol, 9, 1000-029 Lisboa, Portugal

ABSTRACT

We explore the use of topic-based automatically acquired priorknowledge in speech summarization, assessing its influence through-out several term weighting schemes. All information is combinedusing latent semantic analysis as a core procedure to compute therelevance of the sentence-like units of the given input source. Eval-uation is performed using the self-information measure, which triesto capture the informativeness of the summary in relation to thesummarized input source. The similarity of the output summaries ofthe several approaches is also analyzed.

Index Terms— Speech processing, Text processing, Naturallanguages, Information retrieval, Singular value decomposition

1. INTRODUCTION

Automatic summarization aims to present to the user in a concise andcomprehensible manner the most relevant content given one or moreinformation sources [1]. Several difficulties arise when addressingthis problem, but one of most importance is how to assess the rele-vant content. While in text summarization, up-to-date systems makeuse of complex information, such as syntactical [2], semantic [3]and discourse [4, 5] information, either to assess relevance or re-duce the length of the output, speech summarization systems relyon lexical, acoustic/prosodic, structural, and discourse features (al-though in what concerns the use of discourse features, only simpleapproaches, such as the given/new information score presented byMaskey and Hirschberg [6] or the detection of words like decide orconclude described by Murray et al. [7], are used) to extract from theinput, the most relevant sentence-like units. In fact, spoken languagesummarization is often considered of greater difficulty than text sum-marization [8, 9, 10]: speech recognition errors, disfluencies, andproblems in the accurate identification of sentence boundaries af-fect the input source to be summarized. However, although cur-rent work strives to explore features beyond the spoken documentstranscriptions, such as structural, acoustic/prosodic and spoken lan-guage features [6, 11], shallow text summarization approaches likeLatent Semantic Analysis (LSA) [12] and Maximal Marginal Rele-vance (MMR) [13] achieve comparable performance [11].

In the current work, we explore the use of prior knowledge toassess relevance in the context of the summarization of broadcastnews. SSNT [14] is a system for selective dissemination of mul-timedia contents, based on an automatic speech recognition (ASR)module that generates the transcriptions later used by topic segmen-tation, topic indexing, and title&summarization modules. Our pro-posal consists in the use of news stories previously indexed by topic(by the topic indexing module) to identify the most relevant pas-sages of upcoming news stories. This approach models relevance

within topics and its evolution through time, by continuously updat-ing the corresponding set of background news stories—which canbe seen as topic-directed semantic spaces—, trying to approximatethe way humans perform summarization tasks [15, 16, 17]. This isin accordance with Endres-Niggemeyer [15, 16, 17], who identifiedthe following aspects as the most important ones when observing thesummarization process from a human perspective: knowledge-basedtext analysis; combined online processing; and task orientation andselective interpretation.

We address these issues using a statistical approach: the topic-directed semantic spaces establish the knowledge base used by thesummarization process; term weighting strategies are used to com-bine global and local weights (which may also include informationlike the recognition confidence coefficient to address spoken lan-guage summarization specificities); and, finally, the use of LSA [18]provides the selective interpretation step, where terms (keywords),as described by Endres-Niggemeyer, are the most obvious cues forattracting attention.

This document is organized as follows: the next section brieflydescribes the broadcast news processing system; section 3 presentsour approach to speech summarization, detailing how prior knowl-edge is included in the summarization process; section 4 presents theobtained results; before concluding, we address similar work.

2. SELECTIVE DISSEMINATION OF MULTIMEDIACONTENTS

SSNT [14] is a system for selective dissemination of multimediacontents, working primarily with Portuguese broadcast news ser-vices. The system is based on an ASR module, that generatesthe transcriptions used by the topic segmentation, topic indexing,and title&summarization modules. Preceding the speech recogni-tion module, an audio preprocessing module classifies the audioin accordance to several criteria: speech/non-speech, speaker seg-mentation and clustering, gender, and background conditions. TheASR module with an average word error rate of 24% [14], greatlyinfluences the performance of the subsequent modules. The topicsegmentation module [19], reported as based on clustering, groupstranscribed segments into stories. The algorithm relies on a heuristicderived from the structure of the news services: each story startswith a segment spoken by the anchor. This module achieved anF-measure of 68% [14]. The main problem identified by the authorswas boundary deletion: a problem which impacts the summarizationtask. Topic indexing [19] is based on a hierarchically organizedthematic thesaurus provided by the broadcasting company. The hi-erarchy has 22 thematic areas on the first level, for which the moduleachieved a correctness of 91.4% [20, 14]. Batista et al. [21] inserteda module for recovering punctuation marks, based on maximum

169978-1-4244-3472-5/08/$25.00 ©2008 IEEE SLT 2008

entropy models, after the ASR module. The punctuation marksaddressed were the “full stop” and “comma”, which provide thesentence units necessary for use in the title&summarization module.This module had an F-measure of 56% and a Slot Error Rate of 0.74.

3. SPEECH SUMMARIZATION USING PRIORKNOWLEDGE TO ASSESS RELEVANCE

Endres-Niggemeyer [16] observes that both generic and specificknowledge is used to understand information and assess its rele-vance. Human summarization is, thus, a knowledge-based task.

3.1. Selecting Background Knowledge

The first step of our summarization process consists in the selectionof background knowledge to improve the detection of the most rele-vant passages of the news stories to be summarized. This is done byidentifying and selecting the news stories from the previous n daysthat match the topics of the news story to be summarized. This pro-cedure relies on the results produced by the topic indexing module,but a clustering approach could also be followed.

3.2. Using LSA to Combine the Available Information

The summarization process we implemented is characterized by theuse of LSA as a core procedure to compute the relevance of the ex-tracts (sentence-like units) of the given input source.

3.2.1. LSA

LSA is based on the singular vector decomposition (SVD) of theterm-sentence frequency m × n matrix, M . U is an m × n matrixof left singular vectors; Σ is the n × n diagonal matrix of singularvalues; and, V is the n × n matrix of right singular vectors (onlypossible if m ≥ n): M = UΣV T .

The idea behind the method is that the decomposition capturesthe underlying topics (identifying the most salient ones) of the docu-ment by means of co-occurrence of terms (the latent semantic analy-sis), and identifies the best representative sentence-like units of eachtopic. Summary creation can be done by picking the best represen-tatives of the most relevant topics according to a defined strategy.

Our implementation follows the original ideas of Gong andLiu [12] and the ones of Murray, Renals, and Carletta [9] for solvingdimensionality problems.

3.2.2. Including the Background Knowledge

Prior to the application of the SVD, it is necessary to create the termsby sentences matrix that represents the input source. It is in this stepthat the previously selected news stories are taken into consideration:instead of using only the news story to be summarized to create theterms by sentences matrix, both the selected stories and the newsstory are used. The matrix is defined as2

4 a1d11

... a1d1s

... a1dD1

... a1dDs

a1n1 ... a1ns

...aTd1

1... aTd1

s... aTdD

1... aTdD

saTn1 ... aTns

35 (1)

where aidkj

represents the weight of term ti, 1 ≤ i ≤ T (T is the

number terms), in sentence sdkj

, 1 ≤ k ≤ D (D is the number

of selected documents) with dk1 ≤ dk

j ≤ dks , of document dk; and

ainl , 1 ≤ l ≤ s, are the elements associated with the news story

to be summarized. The elements of the matrix are defined as aij =Lij × Gij , where Lij is a local weight, and Gij is a global weight.To form the resulting summary, only sentences corresponding to thelast columns of the matrix (which correspond to the sentence of thenews story to be summarized) are selected.

3.2.3. Term Weighting Schemes

In order assess the relevance of the prior knowledge, we exploreddifferent weighting schemes, all of them using and not using priorknowledge. The following local weights Lij were used:

• frequency of term i in sentence j;

• summation of the recognition confidence scores of all occur-rences of term i in sentence j;

• summation of the smoothed recognition confidence scores(using the trigrams tk−2tk−1tk, tk−1tktk+1, and tktk+1tk+2,tk is an occurrence term i in a sentence) of all occurrences ofterm i in sentence j.

Global weights are entropy-based (more specifically, are basedon the self-information measure), considering the probability of aterm ti characterizing a sentence s:

P (ti) =

Pj C(ti, sj)P

i

Pj C(ti, sj)

(2)

where

C(ti, sj) =

(1 ti ∈ sj

0 ti /∈ sj

(3)

The global weights Gij were defined as follows:

• 1 (not using global weight);

• −log(P (ti));

• as before, but calculating P (ti) using CR(ti, sj) (eq. 4)instead of C(ti, sj)—used only when using the recognitionconfidence scores in the local weight;

• as before: uses the smoothed recognition confidence scoresas described for the local weights—used only when using thesmoothed recognition confidence scores in the local weight.

CR(ti, sj) =

( Precog. conf. score of occurrences of ti in sj

frequency of ti in sjti ∈ sj

0 ti /∈ sj

(4)

4. EXPERIMENTAL SETUP

Our goal was to understand whether the inclusion of prior knowledgeincreases the relevant content of the resulting summaries.

4.1. Data

In this experiment, we summarized the news stories of 2 episodes,2008/06/09 and 2008/06/29, of a Portuguese news program (see toppart of table 1). As source of prior knowledge, we used for the2008/06/09 show the episodes from 2008/06/01 to 08, and for the2008/06/29 show, the episodes from 2008/06/20 to 28.

170

2008/06/09 2008/06/29# of stories 28 24# of sentences 443 476# of transcribed tokens 7733 6632# of different topics 45 47Average # of topics per story 3.54 4.92

# of stories 205 268# of sentences 4127 4553# of transcribed tokens 62025 71470# of different topics 158 195Average # of topics per story 3.42 4.10

Table 1. Properties of the news shows to be summarized (top) andof the news shows used as prior knowledge (bottom).

4.2. Evaluation

As previously mentioned, our goal is to understand if the content ofthe summaries produced using prior knowledge is more informativethan using baseline LSA methods. To do so, we use self information(I(ti) = −log(P (ti))) as the measure of the informativeness ofa summary: this meets our objectives and facilitates an automaticassessment of the improvements. The probabilities of term ti, areestimated using their distribution in the document to be summarized.

In this experiment, we generated 3 sentence summaries (whichcorresponds to a compression rate of about 5% to 10%) of each newsstory (similar to what is done, for example, in CNN website).

4.3. Results

Fig. 1 shows the overlap between the output summaries, and fig. 2shows the global performance of all approaches.

The use of global weights increases the differences between us-ing and not using prior knowledge. This is due to the fact that globalweight for each term is computed using not only the news story to besummarized, but also the prior knowledge: that influences also theelements of the term by sentences matrix related to the news story,whereas when not using global weights, the prior knowledge doesnot affect those elements. That is shown on the top graphic of fig.1, where all summaries have more overlap. It is also possible to ob-serve in fig. 1, that all methods (A-F) have produced fairly differentsummaries (about 30% of overlap) than using the first 3 sentences ofthe news story (G), which is a common baseline for this task.

Fig. 2 shows the number of times, where each method producedthe most informative summary. In general, the methods using globalweight achieved the worst results. This can be explained in part bythe noise present both in the transcriptions and in the topic indexingphase, since when not using global weights, the use of prior knowl-edge has a considerably better performance.

5. RELATED WORK

Furui [10] identifies three main approaches to speech summariza-tion: sentence extraction-based methods, sentence compaction-based methods, and combinations of both. Sentence extractive meth-ods comprehend, essentially, methods like LSA [9, 22], MMR [23, 9,22], and feature-based methods [9, 6, 7, 22]. Sentence compactionmethods are based on word removal from the transcription, withrecognition confidence scores playing a major role [24]. A com-bination of these two types of methods was developed by Kikuchi

et al. [25], where summarization is performed in two steps: first,sentence extraction is done through feature combination; second,compaction is done by scoring the words in each sentence and then adynamic programming technique is applied to select the words thatwill remain in the sentence to be included in the summary.

Closer to our work is the way topic-adapted summarization isexplored by Chatain et al. [26]: one of the components of the sen-tence scoring function is an n-gram linguistic model that is com-puted from given data. However, in the two experiments performed,one using talks and the other using broadcast news, only the one us-ing talks used a topic-adapted linguistic model and the data used forthe adaptation consisted of the papers in the conference proceedingsof the talk to be summarized. CollabSum [27] also explores docu-ment proximity in single document text summarization: by meansof a clustering algorithm, documents related to the document to besummarized are grouped in a cluster and used to build a graph thatreflects the relationships between all sentences in the cluster; theninformativeness is computed for each sentence and redundancy re-moved to generate the summary.

Fig. 1. Summary overlap: with (B,C,D,E)/without (A,F) RC (C,D:smoothed), with (D,E,F)/without (A,B,C) PK, with (bottom)/without(top) GW; G: first sentences. RC: recognition confidence; PK: priorknowledge; GW: global weight.

6. FUTURE WORK

Future work should address the influence of the modules that precedesummarization in its performance; and, assess the correlation of hu-man preferences in summarization with the informativeness measureused in the current work.

7. REFERENCES

[1] I. Mani, Automatic Summarization, John Benjamins Publish-ing Company, 2001.

171

0

5

10

15

20

25

30

35

A B C D E F G H I J K L

plain

PK

plain + RC

PK + RC

Fig. 2. Comparative results using all variations of the evaluationmeasure. A: plain LSA; B: plain LSA + GW; C: plain LSA + GW+ RC; D: plain LSA + GW + RC smoothed; E: plain LSA + RC; F:plain LSA + RC smoothed; G through L, the same but with PK.

[2] L. Vanderwende, H. Suzuki, C. Brockett, and A. Nenkova,“Beyond SumBasic: Task-focused summarization and lexicalexpansion,” Information Processing and Management, vol. 43,pp. 1606–1618, 2007.

[3] R. I. Tucker and K. Sparck Jones, “Between shallow and deep:an experiment in automatic summarising,” Tech. Rep. 632,University of Cambridge Computer Laboratory, 2005.

[4] H. Daume III and D. Marcu, “A Noisy-Channel Model forDocument Compression,” in Proceedings of the 40th annualmeeting of the ACL. 2002, pp. 449–456, ACL.

[5] S. Harabagiu and F. Lacatusu, “Topic Themes for Multi-Document Summarization,” in SIGIR 2005: Proc. of the 28th

Annual International ACM SIGIR Conf. on Research and De-velopment in Information Retrieval. 2005, pp. 202–209, ACM.

[6] S. Maskey and J. Hirschberg, “Comparing Lexical, Acous-tic/Prosodic, Strucural and Discourse Features for SpeechSummarization,” in Proceedings of the 9th EUROSPEECH- INTERSPEECH 2005. 2005, pp. 621–624, ISCA.

[7] G. Murray, S. Renals, J. Carletta, and J. Moore, “IncorporatingSpeaker and Discourse Features into Speech Summarization,”in Proceedings of the HLT/NAACL. 2006, pp. 367–374, ACL.

[8] K. R. McKeown, J. Hirschberg, M. Galley, and S. Maskey,“From Text to Speech Summarization,” in 2005 IEEE Interna-tional Conference on Acoustics, Speech, and Signal Process-ing. Proceedings. 2005, vol. V, pp. 997–1000, IEEE.

[9] G. Murray, S. Renals, and J. Carletta, “Extractive Summariza-tion of Meeting Records,” in Proc. of the 9th EUROSPEECH- INTERSPEECH 2005. 2005, pp. 593–596, ISCA.

[10] S. Furui, “Recent Advances in Automatic Speech Sum-marization,” in Proc. of the 8th Conference on Recherched’Information Assistee par Ordinateur. 2007, Centre desHautes Etudes Internationales d’Informatique Documentaire.

[11] G. Penn and Xiaodan Zhu, “A critical reassessment of evalu-ation baselines for speech summarization,” in Proceeding ofACL-08: HLT. 2008, pp. 470–478, ACL.

[12] Y. Gong and X. Liu, “Generic Text Summarization UsingRelevance Measure and Latent Semantic Analysis,” in SIGIR

2001: Proceedings of the 24st Annual International ACM SI-GIR Conference on Research and Development in InformationRetrieval. 2001, pp. 19–25, ACM.

[13] J. Carbonell and J. Goldstein, “The Use of MMR, Diversity-Based Reranking for Reordering Documents and ProducingSummaries,” in SIGIR 1998: Proceedings of the 21st AnnualInternational ACM SIGIR Conference on Research and Devel-opment in Information Retrieval. 1998, pp. 335–336, ACM.

[14] R. Amaral, H. Meinedo, D. Caseiro, I. Trancoso, and J. P. Neto,“A Prototype System for Selective Dissemination of Broad-cast News in European Portuguese,” EURASIP Journal on Ad-vances in Signal Processing, vol. 2007, 2007.

[15] B. Endres-Niggemeyer, Summarizing Informarion, Springer,1998.

[16] B. Endres-Niggemeyer, “Human-style WWW summariza-tion,” Tech. Rep., University for Applied Sciences, Departmentof Information and Communication, 2000.

[17] B. Endres-Niggemeyer, “SimSum: an empirically foundedsimulation of summarizing,” Information Processing and Man-agement, vol. 36, no. 4, pp. 659–682, 2000.

[18] T. K. Landauer, P. W. Foltz, and D. Laham, “An Introductionto Latent Semantic Analysis,” Discourse Processes, vol. 25,1998.

[19] R. Amaral and I. Trancoso, “Improving the Topic Indexationand Segmentation Modules of a Media Watch System,” inProc. of INTERSPEECH 2004 - ICSLP, 2004, pp. 1609–1612.

[20] R. Amaral, H. Meinedo, D. Caseiro, I. Trancoso, and J. P. Neto,“Automatic vs. Manual Topic Segmentation and Indexation inBroadcast News,” in Proc. of the IV Jornadas en Tecnologiadel Habla, 2006.

[21] F. Batista, D. Caseiro, N. J. Mamede, and I. Trancoso, “Recov-ering Punctuation Marks for Automatic Speech Recognition,”in Proc. of INTERSPEECH 2007. 2007, pp. 2153–2156, ISCA.

[22] R. Ribeiro and D. M. de Matos, “Extractive Summarizationof Broadcast News: Comparing Strategies for European Por-tuguese,” in Text, Speech and Dialogue – 10th InternationalConference. Proceedings. 2007, vol. 4629 of Lecture Notes inComputer Science (Subseries LNAI), pp. 115–122, Springer.

[23] K. Zechner and A. Waibel, “Minimizing Word Error Rate inTextual Summaries of Spoken Language,” in Proceedings ofthe 1st conference of the North American chapter of the ACL.2000, pp. 186–193, Morgan Kaufmann.

[24] T. Hori, C. Hori, and Y. Minami, “Speech Summarization us-ing Weighted Finite-State Transducers,” in Proceedings of the8th EUROSPEECH - INTERSPEECH 2003. 2003, pp. 2817–2820, ISCA.

[25] T. Kikuchi, S. Furui, and C. Hori, “Two-stage AutomaticSpeech Summarization by Sentence Extraction and Com-paction,” in Proceedings of the ISCA & IEEE Workshopon Spontaneous Speech Processing and Recognition (SSPR-2003). 2003, pp. 207–210, ISCA.

[26] P. Chatain, E. W. D. Whittaker, J. A. Mrozinski, and S. Furui,“Topic and Stylistic Adaptation for Speech Summarisation,” inProc. of ICASSP 2006. 2006, vol. I, pp. 977–980, IEEE.

[27] X. Wan, J. Yang, and J. Xiao, “CollabSum: Exploiting Multi-ple Document Clustering for Collaborative Single DocumentSummarizations,” in SIGIR 2007: Proc. of the 30th AnnualInternational ACM SIGIR Conference on Research and De-velopment in Information Retrieval. 2007, pp. 143–150, ACM.

172