38
Summarization and Information Extraction in Speech Ph.D. Thesis Proposal Sameer R. Maskey [email protected] Department of Computer Science Columbia University New York, NY 10027 May 1, 2006

Summarization and Information Extraction in Speech …smaskey/papers/proposal.pdf · 4.2 Headlines ... tasks into one common ... The previous work on summarization and information

Embed Size (px)

Citation preview

Summarization and Information Extraction in

Speech

Ph.D. Thesis Proposal

Sameer R. [email protected]

Department of Computer ScienceColumbia UniversityNew York, NY 10027

May 1, 2006

ii

Contents

1 Introduction 1

1.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Research Questions and Contributions . . . . . . . . . . . . . . . . . 2

1.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Structure of the proposal . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Corpus and Annotation 5

2.1 Information Extraction Corpus (IEC) . . . . . . . . . . . . . . . . . . 5

2.2 Extractive Speech Summarization Corpus (ESSC) . . . . . . . . . . 5

2.3 Disfluency Corpus (DSC) . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.4 Variable Length Summarization Corpus (VLSC) . . . . . . . . . . . . 6

3 Feature Engineering 7

3.1 Lexical Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.2 Prosodic/Acoustic Features . . . . . . . . . . . . . . . . . . . . . . . 8

3.3 Structural Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.4 Discourse Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4 Meta-Information Extraction from Spoken Documents 10

4.1 Soundbites and Soundbite-Speakers . . . . . . . . . . . . . . . . . . . 10

4.2 Headlines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.3 Speaker Roles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4.4 Interviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.5 Disfluency Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.5.1 Translation Model . . . . . . . . . . . . . . . . . . . . . . . . 13

4.5.2 Phrase Level Translation . . . . . . . . . . . . . . . . . . . . . 14

4.5.3 Weighted Finite State Transducer Implementation . . . . . . . 14

5 Speech Summarization 16

5.1 SIEZ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

5.2 Dynamic Bayesian Network based Graphical Models . . . . . . . . . . 17

5.2.1 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . 17

5.2.2 Maximum Entropy Models . . . . . . . . . . . . . . . . . . . . 19

5.2.3 Conditional Random Fields . . . . . . . . . . . . . . . . . . . 19

i

6 Summarizing Speech without Text: What we Say vs. How we Say? 216.1 Exploiting the structure . . . . . . . . . . . . . . . . . . . . . . . . . 216.2 What’s in the Prosody? . . . . . . . . . . . . . . . . . . . . . . . . . 226.3 Summarizers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

6.3.1 Structure Only . . . . . . . . . . . . . . . . . . . . . . . . . . 226.3.2 Acoustics/Prosody Only . . . . . . . . . . . . . . . . . . . . . 22

6.4 What we Say vs. How we Say? . . . . . . . . . . . . . . . . . . . . . 23

7 Variable Length Extractive Summarization 267.1 Variable Length Extraction . . . . . . . . . . . . . . . . . . . . . . . 267.2 Framework for Variable Length Extractive Summariation . . . . . . . 27

8 User-Focused Multi-document Speech Summariation 288.1 Multi-document Challenges for Speech . . . . . . . . . . . . . . . . . 288.2 Approach for Multi-document Speech Summarization . . . . . . . . . 298.3 Combining Text and Speech . . . . . . . . . . . . . . . . . . . . . . . 29

9 Conclusion and TimeLine 309.1 TimeLine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

ii

List of Figures

4.1 F-measure with 10 fold cross-validation . . . . . . . . . . . . . . . . . 11

5.1 L state position-sensitive HMM . . . . . . . . . . . . . . . . . . . . . 185.2 CRF Structure for Soundbite Detection . . . . . . . . . . . . . . . . . 20

6.1 Evaluation using ROUGE metrics . . . . . . . . . . . . . . . . . . . . 246.2 F-measure with 10 fold cross-validation . . . . . . . . . . . . . . . . . 24

iii

List of Tables

4.1 Soundbite Detection Results . . . . . . . . . . . . . . . . . . . . . . . 114.2 The Size of the Translation Lattice and LM . . . . . . . . . . . . . . 154.3 Results on Held-out Test Set . . . . . . . . . . . . . . . . . . . . . . . 15

6.1 Best Features for Predicting Summary Sentences . . . . . . . . . . . . 23

iv

Abstract

In this proposal we address the problem of summarizing and extracting informa-tion from speech documents. The overall approach and experiments can be dividedinto the following sections i) We explore and present methods on meta-information ex-traction from broadcast news speech documents. We identify headlines, interviews,soundbites, speaker roles, soundbite speakers, commerical, sports and weather re-ports. We show that such meta-information is useful for summarizing broadcast newsdocuments. ii) We propose a novel summarization model based on variable lengthsegment extraction where the segments are combination of words, phrases, sentencesor turns. We build statistical models focusing mainly on variants of graphical modelsthat extracts these varying degree of segments in an one optimal framework. iii) Weexperiment on identifying the correlation between the various properties of speechdocuments, especially the prosodic and lexical significance. Our hypothesis that theprosodic/acoustic significance correlate with lexical importance points towards thepossibility of text-independent summarization and information extraction for spokendocuments. iv) We present novel methods to address the unique challenges in com-bining text and speech for query-focused information extraction and summarizationof a mixed corpora with multiple newswire text, speech documents in English andmachine translated speech documents. v) We combine all of the above mentionedtasks into one common platform, Columbia Information Extraction and Summariza-tion System for Speech, CIEZ, that users can use to view, sort, assemble, summarizeand extract information from the spoken data.

Chapter 1

Introduction

1.1 Problem Description

Speech is the most convenient medium for human communication. Although the easeof using speech as a communication channel is advantageous for us, we face uniquechallenges when we try to process speech for various language processing tasks. Wespeak faster than we write. We generate speech with noise and disfluency, and weinvest more time to obtain information from speech document due it’s temporalnature.

In this proposal, we propose an unified solution that can extract information fromspeech documents, process it as per user requirements and summarize the contents.We assume that an Automatic Speech Recognition (ASR) engine is available for ourpurpose.

One of the problems in using ASR for speech processing tasks is the errorfulASR transcripts that contain word errors and disfluency. Detecting disfluency, worderrors and taking account of confidence scores of word prediction are essential forinformation extraction modules in speech.

Unlike text where periods and spaces are important cues for sentence and para-graph segmentation spoken documents do not have evident cues for speech segmen-tation. The performance of automatic segmentation of speech to words, phrases,sentences, turns, stories or topic can limit or vary the types of methods we use forinformation extraction and summarization.

In this work, we mainly focus on Broadcast News (BN). BN speech documentshave a defined set of document structure depending on the type of news and thenews channel. Exploiting such structure of the show has been shown to be useful forsummarization purpose [?, ?]. Automatically detecting various structural propertiesof a broadcast news document is challenging for spoken documents as different newschannels have different styles and change the show structure accordingly.

It is not evident that generative summarization is better than extractive summa-rization for speech because the originality of speech (speaker’s voice) will be lost in ageneration. Again, for extraction-based summarization it is not evident what kind ofsegments (words, phrases, sentences) or combination of them to extract. Extractinga combination of segments for an optimal summary in one unified framework with

1

2 CHAPTER 1. INTRODUCTION

current statistical models is not a trivial task as the model needs not only needs toidentify significant segment but also has to identify the correct granularity of thesegment.

It may be the case that users want a summary that has a mixture of informationfrom speech and text documents. User-focused summary that searches over a corporawith both text and speech raises some unique problems. Many of the clustering algo-rithms use lexically driven features to find redundancy in content. It is not evident onhow to combine acoustic significance with such clustering algorithm to compute thecontent redundancy. Unlike multi-document text summarization where informationfusion is mainly based on concepts carried by words and phrases, multi-documentspeech summarization would include information fusion at the acoustic/prosodic fea-ture level.

The problem of correlating significance of the content with the acoustic featuresis important especially for languages where the ASR is not available or has worderror rate that are unusably high. In such cases, we may want to extract informationfrom speech without using any lexical information. It is still not known what are theacoustic/prosodic properties of speech that signifies the importance of the content inthe given segment of a speech.

We address all the problems describe above in this thesis.

1.2 Research Questions and Contributions

In order to address the problems described in Section 1.1 we would have to addressthe following research questions.

1. Can we use text summarization techniques for summarizing speech documents.What are the possible problems with such approach and why such techniquesmay not be sufficient?

2. How can we enrich Automatic Speech Recognition (ASR) transcripts for generalspeech summarization? How should we handle the recognition errors and noisein the ASR transcripts?

3. Can we summarize speech without using text at all? What extra informa-tion is available in speech (broadcast news) that may lead to a better speechsummarization models.?

4. Can we exploit the temporal nature of speech by using time-dependent sta-tistical models for better modeling the broadcast news and building a bettersummarization system?

5. How do we address the problem that arise in building a multidocument speechsummarizaiton system?

6. How should we combine text and speech documents together to generate onesummary that answers user queries?

1.3. RELATED WORK 3

We think algorithms, theories and experiments developed to answer the researchquestions stated above to answer the problems described in Section 1.1 will providethe following novel contributions:

1. A speech summarization system that can summarize broadcast news.

2. Experiments in acoustic significance that may show the possibility of summa-rizing speech without any ASR transcript.

3. Graphical model based algorithm that showing how to use exploit the structureof the broadcast news.

4. Algorithm and modules that can identify meta-information such as sound-bites, speaker roles, interviews, soundbite speakers from broadcast news andour meta-information based speech summarizer.

5. Phrase-based disfluency detection algorithm on finite state transdcuer frame-work.

6. Clustering algorithm to cluster ASR transcript with the newswire text sen-tences.

7. Algorithm for query-focused summarization that searches and summarizes acorpora with multiple speech and text documents.

1.3 Related Work

The previous work on summarization and information extraction (IE) in speech hasfocused mainly in three different domains, broadcast news [?], meetings [?] and voice-mail/phone conversation [?]. Most of the speech summarization systems are based onextraction of words, phrases or sentences. [?] extract sentences of a voicemail while[?] extracts sentences from a broadcast news and [?, ?] extracts words. Each domainhas unique challenges and advantages.

Broadcast news from different channels follow a very similar structure on the pre-sentation of the content. Such structure can be exploited for information extractionand summarization of such spoken documents. [?] have shown that broadcast newscan be summarized with structural information and [?, ?] have shown it to be usefulfor information extraction tasks. IE and summarization of meetings, voicemail andphone conversations are relatively harder than BN because of spontaneity and worseASR transcripts. The performance of meeting summarizers are worse [?, ?]. Zech-ner shows the importance of removing disfluencies in summarizing such unrestricteddomains. Even though disfluencies occur in less frequency in BN we remove or markthem in our system. Our disfluency removal model is similar to [?, ?].

Eventhough we do not perform sentence segmentation of speech by ourselves manyof our model is dependent on the accuracy of the segmentation. We use the sentencesegmenation method proposed by [?]. Their sentence segmentation method based onjust the prosodic features is especially useful for us as we show that extraction ofsentences for summary can be performed without using ASR transcripts.

4 CHAPTER 1. INTRODUCTION

There has been significant amount of work on text summarization. Many of thesetext summarization systems are based on extraction while some also rewrite andparaphrase their output or combine multiple inputs and outputs. On the other handthere is less available work on speech summarization. This maybe partly becausemany view speech summarization as a text summarization on ASR transcripts. Weargue that speech summarization needs a significantly different approach which wediscuss in Section ?? though there are many similar problems that has to be addressedin similar ways.

Since our model of summarization is heavily based on statistical models the meth-ods proposed in [?, ?, ?] for text summarization is similar to our models. [?, ?]performs information fusion for text summarization. Since, we deal with a corpus oftext and speech with multiple documents on the same topic we address some of thesimilar problems. One of the key differences is combining speech documents is wemay want to preserve orginality of speakers voice in the summary.

1.4 Structure of the proposal

The proposal structured into seven chapters. We explain our annoation of the datathe corpus we have developed in ; the work done in feature extraction on lexical,discourse, prosodic and structural level for speech in the chapter . In the chapter wepresent our various ways of extracting meta data from speech and ASR transcriptsand in the chapter we propose dynamic and non-dynamic models for speech summa-rization. In the following chapter we show our finding on the problems and ways ofcombining multiple speech documents. We conclude the chapter

Chapter 2

Corpus and Annotation

We used several annotated corpora available from Language Data Consortium (LDC)- Penn III Treebank corpus, TDT2, TDT4, TDT5. Even though for many of ourstasks the annotation available in these corpora were enough we required additionalannotation for summarization and information extraction tasks.

2.1 Information Extraction Corpus (IEC)

We hired two annotators to annotate the subset of manual transcripts of TDT2 corpusfor soundbites, soundbite-speaker, interviews, headlines, speaker roles in broadcastnews to build the IECorpus.

We provided a labeling software dLabel v2.5 along with the labeling manual tomaintain a consistency in the annotation. The labeling manual followed ACE stan-dards for annotating named entities. The dLabel software was all written in Javaand is freely available at [?]. The dLabel was integrated with Java webstart suchthat annotators can install the labeling software in any machine without having todownload the full version of java. The snapshot of the labeling software and it’s detailis described in Appendix ??

The annotated corpus consisted of 96 CNN headlines shows. The annotation wasdone on manual transcripts of the shows. Each show was half hour thus providing uswith 48 hours of annotated broadcast news corpus.

One of the proposed tasks is to annotate soundbites, soundbite-speaker, speaker-roles in our TDT2 corpus.

2.2 Extractive Speech Summarization Corpus (ESSC)

After we annotated the data needed for the information extraction tasks we hiredanother annotator for builing extractive summaries of the the same subset of TDT4data. We again provided a labeling manual to an annotator for extracting sentences togenerate a summary. We provided a web-based interface for generating the summariessuch that the selected sentences were automatically stored in a mysql database. Thesnapshot of the interface is in Appendix ??.

5

6 CHAPTER 2. CORPUS AND ANNOTATION

ESSC corpus is comprised of 216 stories from 20 CNN shows of TDT-2 corpus.This includes 10 hours of audio data. We used manual transcripts, Dragon ASRtranscripts and audio files of each show for training and test. We used manual storysegmentation information for the segmenation of the shows.

In order to test the system with the use of ASR transcripts we force aligned themanual transcripts with the ASR transcripts and used the ASR transcripts with thealigned annotation from the manual transcripts. This allowed us to test the degreeof degradation with the use of ASR transcripts for the summarization.

2.3 Disfluency Corpus (DSC)

We tested our disfluency detection algorithm on Penn III Switchboard tree bankcorpus. We split our data in a training set of 1221502 words and held out a test setof 199520 words. We performed all of our experiments on manual transcripts. Wedo not assume that we have sentence boundaries or interruption point information.But we do assume that we have turn boundaries so that the decoder can handlesizable chunks of speech without severe memory problems. Since our model is not afeature-based model, we do not use turn boundary information to extract any featurethat may provide unfair advantage to the model.

For training, we use the annotation for repeat, repair and filled pauses that areprovided with the corpus. We convert the switchboard annotation to produce aparallel text corpus of clean and noisy speech transcripts where all the disfluentwords in a noisy transcript align with disfluent token IDs. These disfluent token IDssignify the type of disfluency for the corresponding word in the noisy transcript asshown in the example in Section 4.5.1. Let us define C as the annotated clean corpusand N as the corresponding noisy parallel corpus.

2.4 Variable Length Summarization Corpus (VLSC)

In order to run one of our proposed experiments on variable length extractive sum-marization for speech we need training data of summaries and their correspondingstories where summaries where generated extracting variable length segments fromthe stories. We require annotator to build summary for spoken documents by ex-tracting words, phrases or sentences. We propose to provide labeling manual forbuilding these summaries.

Chapter 3

Feature Engineering

In this chapter we describe the features we extract for all of our tasks. Eventhoughfeaturs to be extracted differ depending on the type of task we address, there is asignificant number of features that overlap most of the systems. We propose a frame-work to extract all such features to create a superset of features that are relevantand are usually extracted for many of the natural language processing tasks. Weextract four types of features which are lexical features, acoustic/prosodic features,structural features and discourse features. We do not use all of the features in eachset for all our tasks. We use a subset of the features depending on the task. Theproposed framework is extensible such that any new feature extraction code can beeasily integrated and any new data format can be easily converted to fit the dataformat of our system.

3.1 Lexical Features

The lexical features we extract include count features such as word counts, partof speech counts, named entity counts, cue phrase counts, TF*IDF. Thenamed entity features we extract are person names (NEI), organization names(NEII), and place names (NEIII), total number of named entities. We alsoextract cue-phrases. The other lexical features we extract are unigram, bigramand trigram probabilities.

Some of these features like named entities have previously been tested in othersummarization systems [?, ?]. One of our findings is the importance of named en-tity features. Unlike text news, in broadcasts, multiple stories are presented in onebroadcast, with each story containing its own distinctive named entities. While thesenamed entities may not be repeated frequently over the broadcast, they are importantclues to the selection of summary segments within a story. For example, a sentencecontaining many named entities in the introduction of a story by a news anchor oftenrepresents an overview of the story to be presented and, thus, is often included in asummary.

Our feature selection algorithm selects total number of NEs and number of wordsin the sentence as particularly useful features for predicting sentences to be includedin a summary. For our current purposes, we have assumed that we can obtain accurate

7

8 CHAPTER 3. FEATURE ENGINEERING

named entity labels from systems such as BBN’s IdentiFinderTM [?].

3.2 Prosodic/Acoustic Features

The intuition behind using prosodic/acoustic features is based on well-found researchin speech prosody [?] that humans use intonational variation — expanded pitch range,phrasing or intonational prominence — to mark the importance of particular itemsin their speech. In Broadcast News, we note that a change in pitch, amplitude orspeaking rate may signal differences in the relative importance of the speech segmentsproduced by anchors and reporters — the professional speakers in our corpus. Thereis also considerable evidence that topic shift is marked by changes in pitch, intensity,speaking rate and duration of pause [?, ?], and new topics or stories in broadcast newsare often introduced with content-laden sentences which, in turn, often are includedin story summaries.

Prosodic/Acoustic features have been examined in research on speech summa-rization [?] and information extraction tasks [?]. Our acoustic feature-set includesfeatures mentioned in [?, ?] as well as new acoustic features. It includes speak-ing rate (the ratio of voiced/total frames); F0 minimum, maximum, and mean;F0 range and slope; minimum, maximum, and mean RMS energy (minDB,maxDB, meanDB); RMS slope (slopeDB); sentence duration (timeLen = endtime- starttime). We extracted these features by automatically aligning the annotatedmanual transcripts or the ASR transcripts with the audio source. We then used Praat[?] to extract the features from the audio and experimented with both normalizedand raw versions of each. Normalized features were produced by dividing each fea-ture by the average of the feature values for each speaker, where speaker identify wasdetermined from the Dragon speaker segmentation of the TDT-2 corpus. Normalizedacoustic features performed better than raw values.

Our duration feature, ‘sentence duration’, represents the length in seconds of thesentence. Our motivation for including this features is twofold: Very short segmentsare not likely to contain important information. On the other hand, very long seg-ments may not be useful to include in a summary, simply for concerns about providingover-long summaries. This length feature is can accommodate both types of infor-mation. We obtain sentence length by subtracting the end from the start time foreach sentence.

Our feature selection algorithm finds that timeLen, minDB and maxDB are par-ticular discriminatory, while pitch features are, curiously, among the least useful ofthe acoustic features.

3.3 Structural Features

Broadcast News programs exhibit similar structure, particularly broadcasts of thesame show from the same news channel. Each usually begins with an anchor oranchors reporting the headlines, followed by the actual presentation of those stories bythe anchor, reporters, and sometimes interviewees. Programs are usually concludedin the same conventional manner. We call the features which rely upon aspects of

3.4. DISCOURSE FEATURES 9

this patterning and from the overall structure of the broadcast structural features [?],comparable to [?]’s style features. We have previously shown that structural featuresare useful predictors of extractive summaries of Broadcast News [?].

The structural features we investigated for our study include normalized /sen-tence position in turn, speaker type next-speaker type, previous-speakertype, speaker change, turn position in the show and sentence position inthe show. Only reporters’ turns are so marked in the TDT-2 corpus, so our speakertype feature is binary, ‘reporter or not’. This unfortunately conflates anchor turnswith those of interviewees and soundbyte speakers.

3.4 Discourse Features

Some summarization systems [?] have included discourse features, such as [?]’s dis-course trees, which models the rhetorical structure of a text to identify importantsegments for extraction. We have explored a different discourse feature, by comput-ing a measure of ‘givenness’ in our stories. Following [?] we identify ‘discourse given’information as information which has previously been evoked in a discourse, eitherby explicit reference or by indirect (in our case, stem) similarity to other mentioneditems. Our intuition is that given information is less likely to be included in a sum-mary, since it represents redundant information. Our given/new feature representsa very simple implementation of this intuition and proves to be a useful predictor ofwhether a sentence will be included in a summary. This feature is a score that rangesbetween -1 and 1 with a sentence containing only new information receiving a scoreof ‘1’, and a sentence containing only ‘given’ information receiving ‘-1’. We calculatethis score for each sentence by the following equation:

S(i) =ni

d−

si

t − d(3.1)

Here, ni is the number of ‘new’ noun stems in sentence i, d is the total number ofunique noun stems in the story; si is the number of noun stems in the sentence i thathave already been seen in the story; and t is the total number of noun stems in thestory.

The intuition behind this feature is that, if a sentence contains more new nounstems, it is likely that more ‘new information’ is included in the sentence. The termni/d in the equation 3.1 takes account of this ‘newness’. On the other hand, a verylong sentence may have many new nouns but still include other references to itemsthat have already been mentioned. In such cases, we would want to reduce the given-new score by the ‘givenness’ in the sentence; this givenness reduction is take intoaccount by si

t−d. As we will show in Section ??, this simple measure improves our

summarization F-measure. We have also experimented with variations on this scoresbut found 3.1 to yield the best performance.

Chapter 4

Meta-Information Extraction fromSpoken Documents

Spoken documents, especially broadcast news, have meta-information that may beuseful for information extraction and summarization tasks. We can define detectingheadlines, soundbites, speaker roles, interviews, soundbite-speakers, interviewees asmeta-information detection of the broadcast news. Some of these meta-information,speaker roles and soundbites, have already been shown to improve extractivesummarization. Soundbites can be useful in question answering task as described inSection ??. We plan to use the extracted meta-information for generating summariesor for information retrieval purposes. We can either use meta-information to helpextract the relevant segments of the document or use them to fill a template thatdefines a user preferred summary.

4.1 Soundbites and Soundbite-Speakers

Soundbites are the segments of speech played in broadcast news that is not a part ofthe conversation among the speakers in the news. Depending on the type of news,anchors or reporters show the segments of speech by some speaker pertaining to thenews that is being reported. We term such speakers as soundbite-speakers andthe segment of speech spoken by these soundbite-speakers as soundbites. Detectingsoundbites can be useful for summarization of broadcast news. We detect soundbitesusing lexical, acoustic and structural features in a CRF framework. We obtain anaccuracy of 77.2%. To our knowledge, no research has yet been done on soundbitedetection in BN.

To classify segments as soundbite segments or not, we built CRF models usingthe Mallet tool [?]. This tool allows us to build CRF models with varying degrees ofMarkov order. In order to test the effect of previous context, we built CRF modelswith a Markov order of 0, 1 and 2 and compared them to MEMM models. Figure6.2 compares the performance of the different models.

We compare our results with a baseline based on a chance. We do not use a“majority class baseline” because it would result in a baseline with a recall and anF-measure of 0. The CRF 10-fold cross-validation results are significantly higher

10

4.1. SOUNDBITES AND SOUNDBITE-SPEAKERS 11

Figure 4.1: F-measure with 10 fold cross-validation

than the baseline. This F-measure is 38.56% higher than baseline and the bestperforming iteration has an F-measure 64.7% higher than the baseline. Similarlyrecall and precision for the CRF model is 45.38% and 34.2% higher than the baselinerespectively.

The first model on the left in Figure 6.2 is a CRF model with a markov order of 1,wit h the observation conditioned both on the parent state and the previous parent.The second model is maximum entropy model where the observation is conditionedon the parent state only and the current state is dependent on the previous state. Thebest model, shown on the right of the figure, is a 1-order CRF model with the currentstate depending only on the previous state. Intuitively, we would assume that for atask such as soundbite detection, a higher order model would do better. However,our experiments showed that a 2-order model overfits the data and degrades overallperformance. In a 10-fold cross validation experiment our 1-order model performed10.75% better in accuracy and 9.01% better on F-measure than a 2-order model.

ModelType Precision Recall F-Meas Acc

CRF 0.522 0.624 0.566 0.674MEMM 0.431 0.545 0.478 0.602Baseline 0.18 0.17 0.18 0.465

Table 4.1: Soundbite Detection Results

We also built MEMM models for soundbite detection to compare to our CRFmodels. CRFs are similar to MEMMs except MEMMs suffer from a label bias prob-lem due to normalization over local features rather than over the entire sequence.The results presented in Table 4.1 show that the MEMM model does slightly worstthan the CRF models. For the same Markov order and similar conditioning of fea-tures over states, the CRF model does better than MEMM by 7.28% on accuracy,8.76% on F-measure, 7.88% on recall and 9.06% on precision.

Planned Experiment for Soundbite-Speaker - We will detect soundbitesas a classification problem on the named entities. We will use Identifinder to detect

12CHAPTER 4. META-INFORMATION EXTRACTION FROM SPOKEN DOCUMENTS

all the names on the speech transcript. Using the features listed in Chapter ?? andadditional lexical features especially designed for detecting soundbite-speakers suchas cue-phrases after the soundbite-speakers, we will train a supervised model fordetecting soundbites-speakers.

4.2 Headlines

We define headlines as the “sentences spoken by anchors in the beginning of thenews that describe all the major news to be reported in the talk.” These headlinesare usually two to five sentences long and are always spoken by the anchors of theshow and occurs within the first five turns of the show. Our task is not to generateheadlines as in the classic headline generation task of [?] but to segment the part ofthe document into headlines. These headlines can also be used as cues to browse thecontents of the spoken document.

We have build Bayesian Network models for detecting headlines. Although re-sults cannot be compared to one line headline generation task which is a type ofsummarization task the model performs with X% accuracy. The model for head-line detection was built by comparing various statistical models such as BayesianNetwork, Decision Trees and Rule-Based system by plotting threshold curves. Themodel with the highest Area Under Curve was chosen for detecting the headlines.

Planned Experiment for Headlines - Since headlines are fairly consistenton what part of the broadcast news it appears at, we hope the headline detectionalgorithm to be fairly easy which is shown by our early experiments. We will enhanceour current version of headline detection algorithm so that many of the features thatis heavily dependent on the type of news channel we are using is removed and morescalable features are used.

4.3 Speaker Roles

The primary speakers of a broadcast news are “Anchors”. They present the news tothe audience. The “Reporters” help anchors by reporting information away from thebroadcasting station. The reporters and anchors have their unique responsibilitiesthat make a difference in the role they play in the broadcast news. Such differencehas been termed as speaker roles. We classify three types of speaker roles Anchors,Reporters and Others.

We built a Bayesian Network model for reporter detection. For reporter detection,the Bayesian Net model could classify reporter vs. non-reporter segments with an ac-curacy of 72%, an F-measure of 0.665, precision of 0.719 and recall of 0.618. Acousticfeaturs, Structural Features and Cue-phrases are useful in speaker role detection.

Planned Experiment for Speaker Roles - Currently we only have anno-tation for reporters in TDT-4 dataset. After the completion of the annotation ofInformation Extraction Corpus (IEC), we would experiment on detecting other rolesbesides the reporters. We will use the same modeling technique we have describedabove.

4.4. INTERVIEWS 13

4.4 Interviews

Interviews are the sections in broadcast news that consists of an anchor and atleastone other speaker who is not a reporter. In a context of headline shows interviews areusually meant to show and explain opinions of experts in a particular field. Interviewcan differ a lot depending on the newschannel. In VOA interviews usually mean aperson being interviewed in detail about his/her life or a topic relevant to him/her.On the other hand interviews in CNN headline news are short and are about view-points on a certain event or a topic. In this thesis, we primarily focus on interviewsin headlines such as CNN.

Planned Experiment for Interviews - Detecting interviews is among thehardest task mentioned in this chapter. Detecting interview would include findingthe boundaries of interviews based on turn segmentation. We plan to use speaker roledetection model heavily for detecting interviews because we see consistent pattern inthe turntaking of anchor and interviewees. Exploiting this pattern we would build adynamic statistical model to find the begin and end boundaries at the turn level.

4.5 Disfluency Detection

Disfluency is common in speech. Detecting disfluency in speech can be useful forreadability of speech transcripts as well as for further processing by natural languagemodels such as summarization, machine translation or parsing. We view the disflu-ency removal problem as a process that transforms the “noisy” disfluent transcriptinto a “clean” one[?, ?]. Such a transformation can be described using statisticalmachine translation models. Particularly, Zhou et. al [?] has formulated a fastphrase-based statistical translation using FST’s for a speech-to-speech (S2S) trans-lation. We are motivated to apply a similar framework as [?] to address disfluencydetection.

4.5.1 Translation Model

Based on a source channel model approach to statistical machine translation, trans-lating [?] a foreign token sequence nJ

1to a target token sequence cI

1can be viewed

as a stochastic process of maximizing the joint probability of p(n, c) as stated in theequation 4.1

c = argmaxcI

1

Pr(nJ1, cI

1) (4.1)

The joint probability can be obtained by summing over all the hidden variablesthat can be approximated by maximization. For machine translation purposes theserandom variables take account of alignment, permutation, and fertility models.

For our purpose we view disfluency detection as translation from noisy tokensequence nJ

1:= n1, n2, ..., nJ to a clean token sequence cI

1:= c1, c2, ..., cI . Since the

removal of disfluency will entail removal of words from nJ1

we still require alignmentand fertility models as I < J .

We simplify the training of our translation model by retokenizing the cI1

sequence.Instead of clean speech transcript without any disfluent words, we append a tag that

14CHAPTER 4. META-INFORMATION EXTRACTION FROM SPOKEN DOCUMENTS

signifies the type of disfluency for each disfluent word in nJ1. This retokenization

produces cI1with the same number of words as nJ

1such that I = J . The retokenization

of our previous example of repair in Table ?? produces the following parallel text.

• Noisy Data: I want to buy three glasses no five cups of tea

• Clean Data: I want to buy REPAIR0 REPAIR1 FP0 five cups of tea

These modifications to the standard machine translation model simplify our modelin the following ways: i) We do not require fertility model since the number of words inclean and disfluent speech are equal and words in noisy speech transcript can neithergo to null nor generate more than one word. ii) With disfluent words retokenized(I = J) we have a perfect alignment between the noisy and clean transcripts in theparallel corpora, removing the need of alignment model.

4.5.2 Phrase Level Translation

Repeats and Repairs are difficult to detect because reparandum of these disfluenciescan be more than one word. In our example in Table ?? the reparandum is “threeglasses” - a two word phrase. In order to detect repairs and repeats at the phrase levelwe build a phrase level translation model. We denote the phrase segmentation byintroducing a hidden variable pK

1to the Eq. 4.2 summing over the joint probability.

In addition, we can approximate the sum over the hidden variables using a maximumoperator.

c = argmaxcJ

1

pK

1

Pr(pK1

, nJ1, cJ

1) (4.2)

≈ argmaxcJ

1

maxpK

1

Pr(pK1

, nJ1, cJ

1) (4.3)

4.5.3 Weighted Finite State Transducer Implementation

We can implement our equation 4.2 using weighted finite state transducers. Usingthe chain rule we can easily decompose the joint probability into a chain of conditionprobabilities as follows, in a similar way to [?, ?]:

Pr(pK1

, nI1, cI

1) = P (cJ

1). (4.4)

P (pK1|cJ

1). (4.5)

P (nI1|pK

1, cJ

1) (4.6)

We can compute the conditional probabilities of equations 4.4, 4.5 and 4.6 by usingthe parallel corpus and the phrase dictionary. Furthermore, we can build WFST foreach probability distribution modeling the input and output - L, N and P where Lis a language model, N is the translation model and P is the phrase segmentationmodel respectively.

4.5. DISFLUENCY DETECTION 15

The arc probabilities for the translation model N are computed by computing therelative frequencies from the collected phrase pairs.

P (c|n) =N(c, n)

N(c)(4.7)

where N(c, n) is the number of times a clean phrase c is translated by a noisyphrase n. The above equation overestimates the probabilities of rare phrases. In orderto take account of such overestimation we smoothen our translation probability byperforming a delta smoothing. We add a small numerical quantity δ on the numerator4.7 and add δ.|V | on the denominator where V is the size of the translation vocabularyfor a given phrase.

The language model plays an important role in a source channel model like ours.Our language model L is a standard trigram language model with the n-gram proba-bility computed from the clean corpus that has disfluent words tagged as REPEAT,REPAIR and FP (filled pauses). In other words, we use the annotated side of theparallel corpus as the language model training data. We built a back-off 3-gramlanguage model, and encoded it as a weighted acceptor as described in [?] to beemployed by our translation decoder.

After building all three types of WFSTs we can perform a cascaded composition ofthese finite state transducers to obtain one translation lattice that translates sequenceof noisy words to a clean phrases.

T = P N L (4.8)

Type # of states # of transitionsTranslation 1,635,398 3,426,379

LM 234,118 1,073,678

Table 4.2: The Size of the Translation Lattice and LMWe built the WFSTs as described 4.5.3 and composed them to produce a noisy to

clean speech translation lattice. We built the language model using the IBM languagemodel toolkit. The size of the final translation lattice is listed in Table 2.

When we tested our model on our held out test set we obtained the results listedin Table 4.3. We tested our method by using the standard precision, recall andF-measure. The scores were computed at the word level. The train and test dataare heavily skewed with very few positive examples of disfluency. In our test set of199520 words only 6.93% of words were disfluent so F-measure is a reasonable metricfor the system evaluation.

Disfluency Precision Recall F-measure

3*w/o LM REPEAT 0.695 0.809 0.747REPAIR 0.479 0.256 0.334

FILLED PAUSE 0.953 0.998 0.9753*with LM REPEAT 0.743 0.860 0.797

REPAIR 0.509 0.331 0.401FILLED PAUSE 0.955 0.998 0.976

Table 4.3: Results on Held-out Test Set

Chapter 5

Speech Summarization

In this chapter we describe Speech Information Extraction and Summarization (SIEZ)system that can be used to summarize single spoken document. We propose a Dy-namic Bayesian Network (DBN) based graphical model that can take account of timedependency in the broadcast news for the predictions of the segments to be includedin the summary.

5.1 SIEZ

Speech Information Extraction and Summarization system at Columbia was devel-oped over the course of two and half years. SIEZ can take any broadcast newsdocument and produce a two level summary of the news. The first level summaryconsists of high level information about the headlines of the broadcast news. Thesecond level of summaries consist of a summary of individual story in the broadcastnews as single broadcast news consists of many sub stories. Headline detection inSection 4.2 is used to first detect headlines which are then extracted for the top levelsummary. The headline is linked with all the summaries for the individual stories ofthe broadcast news. The interface of SIEZ is shown in Appendix ??.

We explored many statistical techniques while building SIEZ. Most of the mod-els cannot model time dependent features. These models tend to be simpler thandynamic model that can model time series information. We explored five techniquesthat have shown promise on the othe natural language and speech processing task.These five techniques are i Decision Trees ii Bayesian Nets iii Rule-based RipperSystem iv Support Vector Machines v) Perceptrons.

We built these five models for building summaries of a broadcast news. If wegeneralize the results we notice that Bayesian Nets is one of the most stable modeldoing better than other models for most of the task. The decision tree can handlea very noisy data so it shows some promise when there is a lot of missing values.Support Vector Machines do not seem to scale well enough for most of our tasksand so are the perceptron models. We compare all of these techniques by plottingthreshold curves which and computing the area under the curve (AOC). The resultsare shown in Table ??.

Our system uses four types of features from the spoken document as described

16

5.2. DYNAMIC BAYESIAN NETWORK BASED GRAPHICAL MODELS 17

recisionRecall

F-MeasureROUGE

0.4890.6130.4890.75

in Chapter ??. The system’s performance is reported in Table ??. The informationretrieval scores measured at the sentence level where sentence segemenation wasmanually marked.

5.2 Dynamic Bayesian Network based GraphicalModels

Broadcast news have a lot of time dependent informaton. The current speaker roleis dependent on the previous speaker role. Usually, reporter or interviewee speaksafter an anchor. There is a dependency between the stories that are reported inthe broadcast news. Also, the broadcast news follows particular format of reportingdifferent types of news in the different segmenst of the broadcast. Intuitively, itis imperative that we should be able to detect information and summarize spokendocuments better if we can take account and model such time dependent information.We explore graphical models which have been shown to be very robust for suchpurposes. We explore three types of graphical models which can be collectively calleda type of Dynamic Bayesian Networks (DBNs).

5.2.1 Hidden Markov Models

While HMMs are used in many language processing tasks, they have not been em-ployed frequently in summarization. A significant exception is the work of Conroyand O’Leary (2001), which employs an HMM model with pivoted QR decompositionfor text summarization. However, the structure of their model is constrained by iden-tifying a fixed number of ’lead’ sentences to be extracted for a summary. In the workwe present below, we introduce a new HMM approach to extractive summarizationwhich addresses some of the deficiencies of work done to date.

We define our HMM by the following parameters: Ω = 1..N : The state space,representing a set of states where N is the total number of states in the model;O = o1k, o2k, o3k, ...oMk : The set of observation vectors, where each vector is ofsize k; A = aij : The transition probability matrix, where aij is the probabilityof transition from state i to state j; bj(ojk) : The observation probability densityfunction, estimated by ΣM

k=1cjkN(ojk, µjk, Σjk), where ojk denotes the feature vector;

N(ojk, µjk, Σjk) denotes a single Gaussian density function with mean of µjk andcovariance matrix Σjk for the state j, with M the number of mixture components and

18 CHAPTER 5. SPEECH SUMMARIZATION

1

2

3

4

L-1

L

Figure 5.1: L state position-sensitive HMM

cjk the weight of the kth mixture component; Π = πi : The initial state probabilitydistribution. For convenience, we define the parameters for our HMM by a set λthat represents A, B and Π. We can use the parameter set λ to evaluate P (O|λ),i.e. to measure the maximum likelihood performance of the output observables O. Inorder to evaluate P (O|λ), however, we first need to compute the probabilities in thematrices in the parameter set λ

The Markov assumption that state durations have a geometric distribution definedby the probability of self transitions makes it difficult to model durations in an HMM.If we introduce an explicit duration probability to replace self transition probabilities,the Markov assumption no longer holds. Yet, HMMs have been extended by definingstate duration distributions called Hidden Semi-Markov Model (HSMM) that hasbeen succesfully used [?]. Similar to [?]’s use of HSMMs, we want to model theposition of a sentence in the source document explicitly. But instead of buildingan HSMM, we model this positional information by building our position-sensitiveHMM in the following way:

We first discretize the position feature into L number of bins, where the numberof sentences in each bin is proportional to the length of the document. We build 2states for each bin where the second state models the probability of the sentence beingincluded in the document’s summary and the other models the exclusion probability.Hence, for L bins we have 2L states. For any bin lth where 2l and 2l − 1 are thecorresponding states, we remove all transitions from these states to other states except2(l + 1) and 2(l + 1)− 1. This converts our ergodic L state HMM to an almost Left-to-Right HMM though l states can go back to l−1. This models sentence position inthat decisions at the lth state can be arrived at only after decisions at the (l − 1)thstate have been made. For example, if we discretize sentence position in documentinto 10 bins, such that 10% of sentences in the document fall into each bin, thenstates 13 and 14, corresponding to the seventh bin (.i.e. all positions between 0.6 to0.7 of the text) can be reached only from states 11, 12, 13 and 14.

The topology of our HMM is shown in Figure 1.

5.2. DYNAMIC BAYESIAN NETWORK BASED GRAPHICAL MODELS 19

5.2.2 Maximum Entropy Models

5.2.3 Conditional Random Fields

Markov models, Hidden Markov Models (HMM) and maximum entropy models(MEMM), have been used successfully for modeling such data for the extraction ofspeaker role in BN. However, for many Natural Language Processing tasks, modelinga given joint distribution is difficult when rich local features with complex dependen-cies are used in classification. Here, we employ a Conditional Random Field (CRF)model.

CRF models have been successfully used in various Natural Language Processingtasks including named entity detection [?] and Chinese word segmentation [?]. CRFsare undirected graphical models proposed by Lafferty et. al [?] that directly model theconditional distribution p(s|o) where s represents classes and o represents features.Such models have been shown to be effective in taking account of local dependencieswhile decoding the optimal output classes in a globally optimal framework, sincedependencies do not need to be represented explicitly. For special cases of CRFwhen we join the output class nodes in a linear chain, the CRF corresponds to aFinite State Machine (FSM), with a first-order Markov assumption. Such CRFsrepresent a globally normalized extension to MEMM models without the label-biasproblem.

We define our CRF with the following parameterization. Let o =< o1, o2, ..., oT >be the observation sequence of turns in each broadcast show. Let s =< s1, s2, ..., sT >be the sequence of states. The values on these T output nodes are limited to 0 or1, with 0 signifying ’not a soundbite’ and 1 signifying ’a soundbite’. The conditionalprobability of a state sequence s given the input sequence of turns is defined as

p∧(s|o) =1

Zo

exp

(

T∑

t=1

k

λkfk(st−1, st, o, t)

)

where Zo is a normalization factor over all state sequences and fk(st−1, st, o, t) is anarbitrary feature function. λk is a weight for each feature function. The normalizationfactor Zo is obtained by summing over the scores of all possible state sequences:

Zo =∑

s∈ST

exp

(

T∑

t=1

k

λkfk(st−1, st, o, t)

)

This can be computed efficiently in our case using dynamic programming, sinceour CRF is a linear chain of states.

20 CHAPTER 5. SPEECH SUMMARIZATION

S1 S2 S3 ST

o1 o2 o3 ot

Figure 5.2: CRF Structure for Soundbite Detection

Chapter 6

Summarizing Speech without Text:What we Say vs. How we Say?

Most of the current speech summarization systems [?, ?, ?] use Automatic SpeechRecognition (ASR) transcripts for summarizing speech. Such methods assume thatmanual, close captioned or ASR transcripts are avaliable for the spoken document.We argue that such an assumption is very limiting for porting the system into otherlanguages. There are N ?? languages in the world and only ?? has major ASRsystems. Transcribing speech manually is expensive in time and cost. In order to beable to effectively build a general purpose speech summarizer we want to experimentand explore the possibility of summarizing speech without any such text. In orderto summarize a spoken document, in absence of transcripts, we have two parametersthat may help to identify important information structure of the document and theprosody of the speakers in the document.

6.1 Exploiting the structure

We explore using the structural knowledge for broadcast news. Our motivation inusing structure for broadcast news domain to identify segments important to includein a summary follows [?]’s intuition that, in domains like Broadcast News, the mate-rial to be summarized exhibits fairly regular patterns from one speech document toanother. For example, news broadcasts generally open with a news anchor’s introduc-tion of the major news stories to be presented in the broadcast, followed by the actualpresentation of those stories by anchor, reporters, and possibly interviewees. Broad-casts are usually concluded in a fairly conventionalized manner, depending upon theconventions of the particular news program. [?] took advantage of the fact that thereis a reliable correspondence between these structural aspects of news broadcasts andthe type of the speaker in different segments, i.e., anchor, reporter, or interviewee.Their goal was to provide an overall outline of the broadcast by identifying suchspeaker types, so that the program as a whole could be browsed effectively. Theyfound that lexical as well as structural characteristics of news transcripts (both handtranscriptions and speech recognition output) provided useful predictors for classify-ing speaker type.

21

22CHAPTER 6. SUMMARIZING SPEECH WITHOUT TEXT: WHAT WE SAY VS. HOW WE SAY?

We call the features which rely upon aspects of this patterning and from theoverall structure of the broadcast structural features [?], comparable to [?]’s stylefeatures. We have previously shown that structural features are useful predictors ofextractive summaries of Broadcast News [?].

It has been shown by [?, ?] that speaker/turn segmentation can be done with[?] accuracy without using transcripts though models that are trained both on textand speech usually performs better in speaker segmentation as shown in [?, ?]. Andwe show that speaker roles can be identified with x% accuracy [?]??. Hence, itis reasonable to assume that all of the features listed above can be automaticallyextracted from a broadcast news spoken document.

6.2 What’s in the Prosody?

We use the subset of the prosodic features we have listed in Chapter ??. Particularly,we use speaking rate (the ratio of voiced/total frames); F0 minimum, maximum,and mean; F0 range and slope; minimum, maximum, and mean RMS energy(minDB, maxDB, meanDB); RMS slope (slopeDB); sentence duration (timeLen= endtime - starttime).

6.3 Summarizers

6.3.1 Structure Only

We built a statistical models just using the structural features mentioned in Section??. We built four very different kinds of model which are Decision Trees, Bayes Nets,Ruled-Based Ripper and Support Vector Machine. All four of these methods are verydifferent in the way the statistical significance is used for trainining the model.

We performed 10 fold cross validaton over the dataset that is described in Section??. We obtained the performance as reported in Table ??. We used both standardinformation retrieval metric Precision, Recall and F-Measure and the summarizationmetric ROUGE that has been used in DUC evaluation. F-Measure is a strict evalu-ation metric for our purpose as it unfairly penalizes a similar sentence with the samemeaning as the matches are on the exact sentences included in the summary.

The results in Table 6.1 show that just the structure feature does not performwell enough to do better than a baseline. The baseline is extracting top N% ofthe sentences. It has been shown [?] that such a basline is hard to beat as manydocuments list/say the important information in the beginning. For our purpose thisis even harder baseline to beat as the individual stories are shorter than a normaltext document with 15 sentences in average.

6.3.2 Acoustics/Prosody Only

We built the similar models as described in Section ??. Acoustic/Prosody basedsummarizer had clear advantages over structure only summarizer. The recall washigher 23% while the F-measure was higher by 13.9% eventhough precision was lower

6.4. WHAT WE SAY VS. HOW WE SAY? 23

Structure Only (S) Acoustis Only (A) S + APrecision 0.550 0.443 0.481Recall 0.235 0.495 0.505

F-Measure 0.329 0.468 0.495

Table 6.1: Best Features for Predicting Summary Sentences

by 10.7%. Such results signify that the way people say the words is better indicatorof the importance of the content of the speech than when and who said the givensentences.

The feature selection algorithm select the duration of the sentences as the bestpredictor from acoustic/prosodic features while the normalized position was the beststructural features. If we think about how humans summarize documents both ofthese features are important cues for humans as well. Since, structural featuresintroduce lower noise in the prediction with higher precision but acoustic/prosodicfeatures produce better recall we combined them to produce a summarizer that isbased both on prosodic and structural features. The combined model was the bestwith the highest F-Measure of 0.495 with 2.7% higher F-measure than acoustic-onlysummarizer.

The results mentioned above assume an exact match of a predicted summary sen-tence to a labeled summary sentence. For summarization purposes, this measure isgenerally considered too strict, since a sentence classified incorrectly as a summarysentence may be very close in semantic content to another sentence which was in-cluded in the gold standard summary. Another metric standardly used in summaryevaluation, which takes this synonymy into account, is ROUGE (Recall-OrientedUnderstudy for Gisting Evaluation) [?]. ROUGE measures overlap units betweenautomatic and manual summaries. Units measured can be n-gram, word sequencesor word pairs. For ROUGE-N, ROUGE-L, ROUGE-S and ROUGE-SU, then, N in-dicates (the size of) the n-grams computed, L, the longest common subsequence, andS and SU stand for skip bigram co-occurrence statistics with and without optionalunigram counting. ROUGE-N is computed using the following equation.

ROUGE − N =∑

S∈Ref.Sum

gramn∈S

Countmatch(gramn)

S∈Ref.Sum

gramn∈S

Count(gramn)

Figure 6.1 presents results of evaluating our feature-sets using the ROUGE metric,with N = 1 − 4 and all of the variants described above.

6.4 What we Say vs. How we Say?

The task of summarizer is to extract “important information”. The concept of impor-tant information is carried by the words we speak or write. Hence, we need to definethe “importance” concept in lexical terms. In order to deduce Acoustic/Prosody

24CHAPTER 6. SUMMARIZING SPEECH WITHOUT TEXT: WHAT WE SAY VS. HOW WE SAY?

Figure 6.1: Evaluation using ROUGE metrics

only summarizer can extract segments that are significant in their meaning, we per-form an experiment of correlating the prediction based on lexical features with thepredictions of the prosody-only summarizer. The F-Measure and ROUGE resultsfor resulting models with combination of acoustic/prosodic, lexical, discourse andstructural summarizer is shown in Figures 6.1

The results are preliminary but they signify that “the importance of what is saidcorrelates with how it is said.” Intuitively, one might imagine that speakers changetheir amplitude and pitch when they believe their utterances are particularly impor-tant, to convey that importance to the hearer. If this is true, we would expect thesentences that our lexical features include in a summary to be the same as those pre-dicted for inclusion by our acoustic/prosodic features. We computed the correlationcoefficient between the predictions of these two feature-sets. The correlation of 0.74supports our hypothesis.

Figure 6.2: F-measure with 10 fold cross-validationOur findings also suggest it may be possible to do effective speech summariza-

tion without the use of transcription at all, whether manual (as employed here) or

6.4. WHAT WE SAY VS. HOW WE SAY? 25

from speech recognition. Two of our feature-sets, acoustic/prosodic and structural,are independent of lexical transcription, except for sentence-level and speaker seg-mentation and classification, which have been shown to be automatically extractableusing only acoustic/prosodic information [?, ?]. The accuracy of our combined acous-tic/prosodic and structural features (F = 0.50) compares favorably to that of ourcombined feature-sets (F = 0.54). So, even if transcription is unavailable, it seemspossible to summarize broadcast news effectively, even when transcription is unavail-able; but combination of all information available in the broadcast news seems to bethe best.

Chapter 7

Variable Length ExtractiveSummarization

There has been significant amount of work on generative summarization for textdocuments. Unlike text documents, generative summarization techniques face uniquechallenges for speech documents. The foremost problem is the loss of originalityof the document. For speech document, generating a summary in speech wouldrequire a use of speech synthesizer thus losing the original voice of the speakers inthe spoken document. Hence, we propose a variable length summarization techniquethat generates a speech summary by extracting a variable length of segments fromthe spoken document and concatenating them to produce a coherent summary.

7.1 Variable Length Extraction

We propose an approach to generate a variable length extractive summary for speechdocuments. The segments that the model extracts to generate a summary can bewords, phrases or sentences. The system has to decide on the segments to extractand the size of the segments to extract.

Variable length extractive summary is a challenging problem because we need onecoherent optimal framwork that decides the size and content of the segment as oneoptimization problem. Even though we can separate these problems and frame it asa two step process the best solution would probably be one step solution. But even atwo step solution has many challenge. Firstly, we would not to decide how to segmentsentences into phrases. We can use a phrase segmentation algorithm that has beenused in Machine Translation but for summarization purpose probably a segmentationbased on content such as sentence content unit would make more sense. Since, weare handling speech segments, intonational phrase can also be one of the possiblechoices.

26

7.2. FRAMEWORK FOR VARIABLE LENGTH EXTRACTIVE SUMMARIATION27

7.2 Framework for Variable Length Extractive Sum-mariation

We propose a three step filtering process for variable length extractive summarization.We first segment the ASR transcripts into words, phrases and sentences. We woulduse SRI’s word and sentences segmentation and we will use intonational phrase seg-menter made available by ??. The segmenation produces a three tier of informationrepresenation of our spoken documents such that words (W) is a subset of phrases(P) and P is a subset of sentences (S).

The first step in the process of performing variable length summarization wouldbe scoring all the words in the transcripts for summary. Higher the score, morerelevant the words is for inclusion in the summary. In the second step we score allthe phrases. The phrase score will be combination of two scores which are the sumof all the word-level scores and the summary scores for the phrase itself. The twoscores can be combined with some linear combination of weights.

The third step would be scoring sentences. The sentence scores would take accountof words-level scores, phrase-level scores and summary scores for the sentences. Wewould generate word, phrase and sentence score for by training a statistical model.The training data is a summary of stories that have been generated by annotatorsby extracting words, phrases or sentences from the summary.

One of the most challenging part in the above proposed method is how to combinethese three different types of scores to produce a coherent summary. Although, wecan normalize the scores such that a relative score would signify what are the mostparts of the story, it is still not an easy problem to combine words, phrases andsentences to form a lexically coherent sentence. Although, text generation techniquesor paraphrasing techniques could be a possible solution we still do not want to losethe originality of the speech by adding new words or phrases that have not beenspoken by any of the speakers in the spoken document.

Chapter 8

User-Focused Multi-documentSpeech Summariation

We have mainly focused on a single document speech summarization. It is imperativethat users may want to read or listen to summaries looking for certain information.Such information retrieval task requires searching and summarzing many speech doc-uments. We propose a solution based on Global Autonomous Language Exploitation(GALE) task that searches a corpus of speech documents for given user query usinga search algorithm developed at University of Massachusettts and summarizes thedocuments containing the relevant information.

8.1 Multi-document Challenges for Speech

Multidocument summarization has been addressed by some of the summarizationsystems developed at Document Understanding Conference (DUC) evaluation. Intext newswire documents, in many instances there is verbatim repetition of text innews provided by different news agencies. In spoken documents, since the news isprepared by anchors and spoken, the sentences anchors speak from different newschannels are never same which adding more noise in sentence similarity metrics.

Information Fusion for multidocument speech document has further challengesbesides the content and the style of the news. The channels and the environmentthe news are recorded depend on the type of the news being reported which makesa difference in overall acoustic features that are extracted from these spoken docu-ments. The variation on the acoustic quality has to be normalized by speaker and bythe difference in the recording environment. Such normalization requires automaticidentification of speakers and the recording environment. We do not address thespeaker recognition or diarization in this thesis.

28

8.2. APPROACH FOR MULTI-DOCUMENT SPEECH SUMMARIZATION 29

8.2 Approach for Multi-document Speech Sum-marization

We propose a system that combines multiple spoken documents to produce a user-focused summary. User interacts with the system by entering a query of the formatshown in ??. The query is sent to a search engine that have indexed the speechdocuments. The search engine developed at University of Massachusetss returns aset of documents, scores and a markup of the relevant passage. We need to combinethese documents to generate a coherent short answer for the user. Since we do notwant to focus on information retrieval part of the system we assume that our task ismainly combining the returned spoken documents by extracting the important andrelevant information pertaining the users task.

We propose a general multidocument speech summarization task for this purposethat is not highly dependent on particular question type. We first extract lexicalfeatures for all the passages our information retreival engine has returned. Usingsentence similarity metrics on the words we cluster all the sentences into a maintopic and subtopics. Now we would extract sentences from each subtopic relevantto our main topic until we exceed the length limit of the summary. In order extractsentences from the subtopic we use a technique similar to maximal marginal relevancesuch that the new sentence being added maximizes the overall summary score in eachiteration by providing information and removing all the redundant information. Also,since we measure the similarity metric of each sentence with the centroid of the topicwe can be sure that the main focus of the topic is not lost with a lot of irrelevantinformation. And we can assume that the centroid of topic will be relevant to theuser query because our information retrieval engine returns documents maximizingthe information retrieval scores for the given query. Hence, the most redundant butrelevant information will be very close the centroid of the topic.

8.3 Combining Text and Speech

In order to effectively answer user queries on a certain event or topic we need tosearch and summarize not only spoken documents but also newswire documents.Newswire documents tend to contain more detail information than broadcast newsdocuments because each broadcast news document consists of many stories. We wantto combine the two types of documents in one summarization framework. We need toaddress the following issues for such a task: How to combine noisy speech sentenceswith text sentences? How to use acoustic features in the same framework where halfof the sentences do not have any acoustic representation? What segment to chooseas sentence segmenation on speech document may be noisy? How to cluster textand speech sentences together as speech sentences needs to take account of acousticsimilarity as well?

Chapter 9

Conclusion and TimeLine

In conclusion, we propose algorithms to build a system, Columbia Information Ex-traction and Summarization System (CIEZ), that can extract information and sum-marize multiple speech documents relevant to a user. Our best system uses a modi-fied graphical model algorithms to perform these tasks. The summarization systemis server based approach that can be used as a standalone program or can be inter-acted on the web. We show the possibility of speech summarization without using anyspeech transcripts on the basis of just using the acoustic information of the news. Wepropose the conduct experiments to perform variable length and user-focused sum-marization by extending the current techniques of sentence-based single documentsummarization systems. The information extract module of the system is also usedfor extracting meta information of the document such soundbites, soundbite-speaker,headlines, disfluency that may help in presenting and extracting and summarizingthe broadcast news.

9.1 TimeLine

The proposed timeline for completing the thesis is as follows

Milestone Weeks Deadlinehline VLSC corpus 4 July 1st, 2006IEC corpus 6 July 15th, 2006Soundbite Detection N/A CompletedSoundbite-Speaker Detection 3 June 15th, 2006Speaker-Role Detection 4 July 15, 2006MultiDocument Speech Summarization Experiment 8 Sept 15, 2006Variable Length Speech Summarization 13 Dec 15, 2006Break 4 Jan 15, 2007Summarization without Speech 4 Feb 15, 2007CIEZ Interface 4 March 15, 2007Thesis Writing 8 May 15, 2007Defense N/A May 31, 2007

30

Bibliography

31