Automatic Parliamentary Meeting Minute Generation Using Rhetorical Structure Modeling

2492 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 20, NO. 9, NOVEMBER 2012

Automatic Parliamentary Meeting Minute GenerationUsing Rhetorical Structure Modeling

Justin Jian Zhang and Pascale Fung, Senior Member, IEEE

Abstract—In this paper, we propose a one step rhetorical struc-ture parsing, chunking and extractive summarization approachto automatically generate meeting minutes from parliamentaryspeech using acoustic and lexical features. We investigate how touse lexical features extracted from imperfect ASR transcriptions,together with acoustic features extracted from the speech itself, toform extractive summaries with the structure of meeting minutes.Each business item in the minute is modeled as a rhetorical chunkwhich consists of smaller rhetorical units. Principal ComponentAnalysis (PCA) graphs of both acoustic and lexical features inmeeting speech show clear self-clustering of speech utterancesaccording to the underlying rhetorical state—for example acousticand lexical feature vectors from the question and answer ormotionof a parliamentary speech, are grouped together. We then proposea Conditional Random Fields (CRF)-based approach to performboth rhetorical structure modeling and extractive summariza-tion in one step, by chunking, parsing and extraction of salientutterances. Extracted salient utterances are grouped under thelabels of each rhetorical state, emulating meeting minutes to yieldsummaries that are more easily understandable by humans. Wecompare this approach to different machine learning methods. Weshow that our proposed CRF-based one step minute generationsystem obtains the best summarization performance both in termsof ROUGE-L F-measure at 74.5% and by human evaluation, at77.5% on average.

Index Terms—Extractive speech summarization, rhetoricalstructure modeling, meeting minutes generation.

I. INTRODUCTION

A UDIO documents from meetings, political debates,and parliamentary speech are more useful if they can

be automatically transcribed and summarized in a structuredmanner, much like meeting Hansards and minutes [1]–[8].These Hansards and minutes allow for more targeted search,better query-driven retrieval, and improved aggregated search.

Manuscript received August 24, 2011; revised April 10, 2012, June 14, 2012;accepted July 26, 2012. Date of publication August 27, 2012; date of current ver-sion October 09, 2012. This work was supported in part by the Hong Kong In-novation and Technology Fund (No. ITS/189/09), Joint Project of National Sci-ence Foundation of China(NSFC)-Guangdou Province 2009 (No. U0935003),NSFC 2011 (No. 61170216), and the International Center for Advanced Com-munication Technologies (InterACT at HKUST) (VPRDO 09/110.EG01). Theassociate editor coordinating the review of this manuscript and approving it forpublication was Prof. Renato De Mori.J. J. Zhang is with the Engineering Technology Institute, Dongguan Univer-

sity of Technology, Dongguan, China (e-mail: [email protected]).P. Fung is with the Department of Electronic and Computer Engineering, The

Hong Kong University of Science and Technology, Clear Water Bay, Kowloon,Hong Kong (e-mail: [email protected]).Color versions of one or more of the figures in this paper are available online

at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/TASL.2012.2215592

However, manual transcription and minute generation bystenographers and professional secretaries is time consumingand costly. To address this challenge, we propose a methodfor producing complete parliamentary meeting minutes auto-matically from transcriptions produced by an automatic speechrecognition (ASR) system.Structure is essential to the understanding of any document.

Hansards are manually edited transcriptions containing discern-able structural cues in the forms of titles, subtitles, paragraphs,punctuations, fonts, speaker names, speaker turns to help withthe interpretation of the underlying semantic information, whichin turn are easily accessible to human readers and search en-gines alike. These cues also exist in the corresponding minutes[1], [4]. Minutes are usually manually generated consisting of asequential list of business items and a summary for each of theseitems. Since meetings tend to occur regularly and are repeatedover long periods of time, this structure is often formalized. Rec-ognizing that Hansards are structured transcriptions andminutesare structured summaries, we propose to transcribe and parseparliamentary speech automatically and further summarize thespeech into minutes using machine learning approaches.Our previous work [9], [10] and other researchers [11]–[13]

have shown that there is a certain “rhetorical structure” that ex-ists in spoken documents, and efficient modeling of this struc-ture is helpful to summarization of presentation speech. Otherresearchers have also used the hierarchical topic structure in adocument for extracting lists of descriptive keywords or im-proving summarization performance [7].For meeting understanding tasks, state-of-the-art systems

include multiple steps of segmentation, topic identification,dialog act modeling, and then extractive summarization [7],[14], [15], [6]. In this paper, we suggest combining all thesesteps into one to simultaneously chunk and parse meetingspeech, and extract salient sentences to form minutes. Weconsider the meeting minute generation process as a sequencelabeling process where the structure in the speech is modeledby a rhetorical syntax tree. The one step system is carried out bya single supervised Conditional Random Field (CRF) classifier[16] or Hidden Markov Support Vector Machine (HMSVM)classifier [17]. In HMSVM, each rhetorical label of the currentsentence can depend directly on feature vectors of past andfuture sentences. CRFs provide a probabilistic framework forcalculating the probability of label sequences globally condi-tioned on sentence feature vector sequences. For comparison,we also implement two step systems: the first step for chunkingand parsing meeting speech; the second step for extractingsalient sentences to form minutes. The first step is carried outby one CRF classifier or RSHMMs and the second step is

1558-7916/$31.00 © 2012 IEEE

ZHANG AND FUNG: AUTOMATIC PARLIAMENTARY MEETING MINUTE GENERATION USING RHETORICAL STRUCTURE MODELING 2493

carried out by a series of CRF or SVM classifiers, one classifierfor summarizing each type of business item. There are twomain contributions in this paper that differ from our publishedwork at ICASSP 2011 [18]. First, we compare Rhetorical StateHidden Markov Models (RSHMMs) [9] and Hidden MarkovSupport Vector Machine (HMSVM) based systems [17] witha CRF based system for extracting meeting minutes. Second,we evaluate our Hansard document structure parser with moredata and explain evaluation results in more details.This paper is organized as follows: In Section II, we describe

the acoustic/phonetic and lexical features of parliamentaryspeech. Section III explains how to extract rhetorical structureof parliamentary speech from the Hong Kong LegislativeCouncil. We then detail the rhetorical structure modeling andminute generation from parliamentary speech in Section IV.Section V describes how to apply our proposed methods formeeting minute generation. The experimental setup and resultsare detailed in Section VI. Section VII presents related work.We then conclude and discuss future work at the end of thepaper.

II. FEATURE VECTORS OF PARLIAMENTARY SPEECH

We first use Jin and Schultz’s method [19] for speaker seg-mentation and clustering in a pre-processing step. We then de-tect all the speech utterance boundaries based on the length ofsilence and speaker turns. Each speech segment is less than 15seconds.We then represent each input speech utterance by a fea-ture vector of acoustic/phonetic and lexical features.

A. Acoustic/Phonetic Features

Maskey and Hirschberg [20] suggested that acoustic featuresare useful for extracting salient sentences for summarizingbroadcast news. Furui et al. [21] also used acoustic featuressuch as and energy features for speech summarization ofspontaneous speech. Acoustic/phonetic features in speech sum-marization system are usually extracted from audio data. Re-searchers commonly use acoustic/phonetic variation—changesin pitch, intensity, speaking rate—and duration of pause fortagging the important contents of their speeches [22]. We havepreviously [23] investigated these features for their efficiencyin predicting summary sentences of lecture presentation speechand broadcast news speech. We [9], [24] further found thatthese acoustic/phonetic features can be used for extractingrhetorical structure from lecture presentation speech.Schuller et al. introduced the openSMILE Toolkit to extract

acoustic features [25], [26]. We extract 384 acoustic/pho-netic features (AF) using this tool [25], [26] for repre-senting each utterance of parliamentary speech. There are16 low-level descriptors (LLD) and their delta coefficients:zero-crossing-rate (ZCR) from the time signal, root meansquare (RMS) frame energy, pitch frequency (normalized to500 Hz), harmonics-to-noise ratio (HNR) by autocorrelationfunction, and mel-frequency cepstral coefficients (MFCC)1–12. 12 functionals depicted in Table I, are applied on eachLLD features and their delta.

TABLE IACOUSTIC/PHONETIC FEATURES: LOW-LEVEL

DESCRIPTORS AND FUNCTIONALS

B. Lexical Features

Similar to text summarization, the lexical information canhelp us predict summary sentences [21], [6], [23], [27], [28].These lexical features also can help us extract rhetorical struc-ture from presentation speech [9], [24]. We extract four sets oflexical features from parliamentary speech transcriptions:• Sentence Length Features and Term Frequency-InverseDocument Frequency (TFIDF) Features: the numberof words in the previous/current/next segment; and theTFIDF of each word [9].

• N-gram Features: we first produce the top-N unigram/bi-gram/trigram list from the training corpus, and then countthe number of unigram/bigram/trigram appeared in the cur-rent segment. In our experiment, we set .

• Part-of-Speech (POS) Features: we first annotate POStag labels of the current segment using the Stanford POStagger1. We then get one POS tag label sequence for eachsegment. Next we count the number of each type of POStag label in the current segment. Each type of POS taglabel is one dimension of the POS feature. Last we get one33-dimensional POS feature for each segment.

• Syntax Features: we first obtain syntactic labels of the cur-rent segment by the Stanford parser2. We then obtain onesyntax parse label sequence for each segment. Next wecount the number of each type of syntax parse label in thecurrent segment. Each type of syntax parse label is one di-mension of the Syntax Parse feature. Last we get one 24-di-mensional Syntax Parse feature for each segment.

III. THE RHETORICAL STRUCTURE OF

PARLIAMENTARY SPEECH

Parliamentary speech is usually transcribed manually into theHansard format.Minutes are alsomanually generated consistingof a sequential list of business items and a summary for each ofthese items. Using these Hansards and minutes, we can easilyachieve more targeted search, better query-driven retrieval, andimproved aggregated search. In this section, we investigate howto extract the rhetorical structure of meeting speech.Similar to formal parliamentary speech in other countries, the

Hong Kong Legislative Council meeting speeches are planned.The Council meetings follow the Rules of the Procedure of theLegislative Council of the Hong Kong Special AdministrativeRegion3. There are six types of meeting sessions in the Council

1http://nlp.stanford.edu/software/tagger.shtml2http://nlp.stanford.edu/software/lex-parser.shtml3http://www.legco.gov.hk/general/english/procedur/content/rop.htm


Fig. 1. An example of meeting minute/rhetorical syntax tree.

meetings where the most common type is the Ordinary Ses-sion. Meetings are manually transcribed by stenographers intostructured Hansard documents and then edited into minutes. Ameeting minute, which is a faithful representation of the docu-ment-level rhetorical structure of the meeting speech as shownin Fig. 1, contains several typical business items—we call them“rhetorical chunks.” Each business item/rhetorical chunk con-tains many coherent rhetorical units.We first represent each speech utterance by acoustic and/or

lexical features. We then align each speech utterance to theminutes and label them with the structural marks: “Addressesand Statements sentence,” “Bills sentence,” “Motions sen-tence,” “Members’ Motions sentence,” “Questions sentence,”projected from the minutes. Figs. 7–9 show example sentencesfrom five different rhetorical chunks.To investigate and illustrate this rhetorical structure as repre-

sented by acoustic and lexical features of speech, we use prin-cipal component analysis (PCA) projection of all acoustic/pho-netic and lexical features from the Council meeting speech. ThePCA transformation is given by (1).

(1)

where is the original feature vector matrix; is the PCA re-sult matrix; is the singular value decomposition (SVD)of .Feature vectors of speech utterances from different business

items can be seen to cluster into distinct rhetorical chunks.When the councilors talk about different kinds of businessitems, it seems that the actual sounds such as prosodic be-haviors are different. The plots in Figs. 2, 3 and 4 show thePCA project from the parliamentary meeting speech from 43speakers in one day. These speakers used the same type ofmicrophone. The content of each business item varies fromspeaker to speaker yet. We show that feature vectors frombusiness items of the same rhetorical chunk are closely related,despite content difference. From Fig. 2, we observe that fiveseparated clusters with some overlap composed by purely

acoustic/phonetic feature vectors of the “Addresses and State-ments” chunk, “Bills” chunk, “Motions” chunk, “Members’Motions” chunk, and “Questions” chunk. This shows thatacoustic/phonetic features may help extract rhetorical structureof the parliamentary speech to some extent. Similar findingsare shown in Fig. 3 with lexical features.When we combine acoustic/phonetic features with lexical

features to represent each speech utterance, we get an evenmore clustering result as shown in Fig. 4. This shows that amore accurate underlying rhetorical structure of the parlia-mentary speeches can be obtained by using acoustic/phoneticfeatures combined with lexical features.

IV. RHETORICAL STRUCTURE MODELINGAND SUMMARIZATION

We consider the process of chunking, parsing, and sum-marization of speech as a sequence labeling problem. In thissection, we propose CRF-based and other machine learningmethods to model the rhetorical structure of parliamentarymeeting speech and to generate meeting minutes from them.

A. Conditional Random Field

Given feature vector sequence of the transcribed utter-ance, we aim to find the optimal label sequence to maxi-mize , where indicates the chunk label and rhetor-ical unit label as well as whether it is a summary sentence to beincluded in the final meeting minute. Considering that Condi-tional Random Fields (CRF) [16] provide a probabilistic frame-work for calculating the probability of globally conditionedon , we build a CRF classifier (classifier A) for chunkingand parsing the input utterances into a structural form and buildmultiple CRF classifiers (classifiers B) to extract summary sen-tences from each chunk.Conditional Random Fields (CRF) [16] are a type of dis-

criminative undirected probabilistic graphical model, whosetopology is shown in the bottom part of Fig. 5. It is mostoften used for labeling or parsing of sequential data, such as


Fig. 2. Parliamentary speech: Using Acoustic Features Only.

natural language text or biological sequences [16]. Specifically,CRF finds applications in shallow parsing [29], named entityrecognition [30] and gene finding, among other tasks, beingan alternative to the related Hidden Markov Models (HMM).CRF has many advantages for our sequential labeling taskcompared with HMM as shown in Table II. We implement ourCRF classifiers by using the CRF Library [31].We propose two ways of using CRFs. In Model I, we propose

to chunk, parse, and summarize the speech simultaneously, Webuild the label set : for chunk label, forunit label, for summary label. In Model II, we propose a twostep approach: chunking and parsing of speech in the first step(classifier A), and summarization of speech in the second step(classifier Bs). We build one CRF classifier with the label set

for the first step and several CRF classifiers with thelabel set for the second step. Each CRF classifier in the secondstep is used to summarize one type of label. We ob-tain the label set of the classifiers according to the rhetoricalstructure of the spoken document. For example, for parliamen-tary speech, the label set includes “Addresses and Statements,”“Bills,” etc.In the training process of the CRF classifier, we need to esti-

mate the parameter in (2):

(2)

where is the normalization constant that makes the prob-ability of all label sequences sum to one; is an

TABLE IICOMPARISON BETWEEN CRF AND HMM

arbitrary feature function over the entire feature vector sequenceand the labels at position and is a feature func-tion of label at position and the feature vector sequence.and are weights learned for the feature functions, reflectingthe confidence of feature functions. The feature functions de-scribe any aspect of a transition from to as well asand the global characteristics of .Let be the set of weights in a CRF model.

is usually estimated by a maximum likelihood procedure, thatis, by maximizing the conditional log-likelihood of the labeledsequences in the training data ,which is defined as (3).

(3)

To avoid over-fitting, we add a Gaussian prior over theparameters:

(4)

where and are the variances of the Gaussian priors.


Fig. 3. Parliamentary speech: Using Linguistic Features Only.

Considering that it has been found that a quasi-Newtonmethod such as L-BFGS converges significantly faster [29], inthis paper, we use L-BFGS to optimize .Given a CRF model with parameter , the most probable la-

beling sequence can be produced as .The marginal probability of labels at each position in the se-quence can be computed by a dynamic programming inferenceprocedure similar to the forward-backward procedure forHMMs [16].

B. Baseline I: Two Step Rhetorical State HMM

1) Extracting Rhetorical Structure by RSHMMs: As illus-trated in Fig. 1, rhetorical structures of parliamentary speechare in fact hierarchical structures. In our previous work [32],we proposed to build Rhetorical State Hidden Markov Modelswith state transitions that represent several kinds of rhetoricalrelations to better model this rhetorical structure.We use RSHMMs to model the underlying rhetorical struc-

ture of the transcribed document which consists of a se-quence of N recognized sentences from the ASR output:

. Fig. 6 shows the con-catenation of RSHMMs to represent a spoken document.Each RSHMM state contains a probability distribution

for the input feature vector obtained from the acoustic andlexical features for the sentence . In Fig. 6, depends onits relative rhetorical chunk model. Different kinds of rhetoricalchunk models contain different functions.We use mixtures of multivariate Gaussian distribution as the

probability distribution in formula (5).

(5)

where is the number of mixture components in the state,is the weight of the th component and

is a multivariate Gaussian with mean vector and covariancematrix for the acoustic and lexical features, as in formula (6).

(6)

Assuming that presentation speeches or parliamentaryspeeches consistently follow a rhetorical structure containingsections/chunks, HMMs (i.e., , and ) are built torepresent the respective sections/chunks. Each HMM is repre-sented by three states, roughly corresponding to the beginning,the middle, and the ending part in a rhetorical “paragraph.”Each of the states contains several Gaussian components. Wetrained each of the HMMs by performing Viterbi initializa-tion, then followed by Baum-Welch re-estimation using theforward-backward algorithm.We then place the trained HMMs into a sequential network

structure of ( , and ). We finally use the Viterbi al-gorithm to find the best rhetorical unit sequence for documentwith sentences represented by . This is

equal to finding the best state sequencein formula (7) and (8).

(7)

(8)


Fig. 4. Parliamentary speech: Using Both Acoustic and Linguistic Features.

Fig. 5. (upper) The topology of Hidden Markov Model; (bottom) The topology of Conditional Random Field.

Fig. 6. Spoken document representation with RSHMMs.

Finally, we annotate each sentence of the given document as

which approximately maximizes .

(9)

where is a mapping function for the rhetorical unit, and

we have a total of rhetorical units in a single document. isthe feature vector representing the sentence sequence .

2) Rhetorical State HMMs With Segmental Summarization:This step in our algorithm assigns each sentence to its place in aparticular rhetorical unit. Next we want to find sentences tobe classified as summary sentences by using the salient sentenceclassification function as shown in [32].Based on the probabilistic framework, the extractive summa-

rization task is carried out by estimating ofeach sentence .We propose a novel probabilistic framework—RSHMM-en-

hanced SVM—for the summarization process [9]. We approxi-

mate in the following expression:

(10)

where is the salient sentence classification function; canbe obtained by (9). We then predict whether sentence is a


Fig. 7. Questions samples and their translations.

summary sentence or not by using a probability threshold. Weset the probability to be the compression ratioof rhetorical unit .

(11)

We model by SVM classifierwith a Radial Basis Function (RBF) kernel [33], as described in(12). One SVM classifier is trained for each rhetorical unit of theRSHMMnetwork. All the HMMs in our experiments are trainedby HTK [34]. All the and can be obtained bytuning the best performance based on the development set.

(12)

C. Baseline II: Two Step Hidden Markov SVM

In our previous work [17], we showed how to build HiddenMarkov Support Vector Machine (HMSVM) [35] for learningrhetorical structure and extracting summaries from speechesand transcriptions.The upper part of Fig. 5 shows the topology of the Hidden

Markov Model. State represents a rhetorical chunk orunit; Observation represents the utterance feature vector.Compared to HMM which only considers direct dependencebetween each state and only its corresponding observation,HMSVM accounts for overlapping features. In HMSVM,each state can depend directly on past or future observation.HMSVM can effectively handle the dependency betweenneighboring observations.

1) Joint Feature Functions: Given the general problem oflearning functions based on a training sample ofinput-output pairs, we consider that the function maps a givenspeech or transcription to a rhetorical unit sequence . Weintend to find an approach for learning a discriminant function

over input/output pairs fromwhich we producea prediction by maximizing over the output variable for agiven input . Equation (13) the general form of our hypotheses, where denotes a parameter vector. We assume to belinear in some combined feature representation of inputs andoutputs in (14).

(13)

(14)

2) Hidden Markov SVM for Extracting Structural Extrac-tive Summary: For a transcribed document , we build theHMSVM for choosing one rhetorical unit label from unit labelset for labeling all the sentences in by using optimal function

, where is a recognized sentence vector sequence. is obtained from the acoustic, and lex-

ical features of the sentence .For a training example, we generalize the notion of a separa-

tion margin by defining its margin with respect to a discriminantfunction, , as (15) [35], where the are slack variables to im-plement a soft margin. The linear constraints in (15) are equiv-alent to the following set of nonlinear constraints:

for . Then the so-lution of (15) can be written as (16), where is the


Fig. 8. Motions, Members’ motions samples and their translations.

Lagrange multiplier of the constraint involving example andlabeling .

(15)

(16)

We use a Viterbi-like algorithm for decoding the optimal labelsequence.When can be written as a sum over the length of thesequence and decomposed as (17), where is the rhetorical unitlabel set. is the length of the sequence . is composed ofmapping functions that depend only on labels at position and

as well as . The score at position only dependson and labels at position and (Markov property).

(17)

After labeling all the sentences as rhetorical units, we con-catenate the sentences which are annotated as Summary togetheras the extracted summaries.

V. AUTOMATIC MINUTE GENERATION

We build automatic minute generation systems for the ordi-nary sessions of meeting from the Legislative Council of the

Hong Kong Special Administrative Region. For Ordinary Ses-sions, there are 16 types of business items according to theRules. We use the first five major types directly as labels of ourrhetorical chunks, and group the rest into “Others” for our ex-periments. Therefore, these six chunk labels are:• Questions ;• Bills ;• Motions ;• Members’ Motions ;• Address and Statements ;• OthersWe further observe that there are three types of rhetorical

units under these rhetorical chunks:• Introductory speech : made by President or Secretaryof the Council;

• Individual speech : made by one Legislative CouncilMember or Official;

• Others .Each chunk boundary is explicitly represented by an type ofspeech by the President or the Secretary, delineating a changein business item, as can be seen in Fig. 1.

A. Model I: One Step Meeting Minute Generation System

We propose a one step system for simultaneously chunking,parsing, and extracting salient sentences. We build a single CRFclassifier with a set of 36 types of labels:

, where represents business item/rhetorical chunk;represents rhetorical unit; represents summary; representsnon-summary; . For example, thelabel “ ” represents that the sentence is a summary


Fig. 9. Bills, Addresses and Statements samples and their translations.

sentence in the “Question” rhetorical chunk and “speeches byPresident or Secretary” rhetorical unit.In the training process of the CRF classifier, we need to esti-

mate the parameter in (2).Given parameter of the CRF model, the most probable la-

beling sequence can be produced as .The marginal probability of labels at each position in the se-quence can be computed by a dynamic programming inferenceprocedure similar to the forward-backward procedure for HMM[16]. We then calculate the marginal probability of each tran-scription segment being a summary segment given the wholesegment sequence by: , where

. Finally we group the salient seg-ments into the meeting minute according to the marginal prob-ability values.

B. Model II: Two Step Meeting Minute Generation System

We also propose a two step system for comparison, where theinput speech is first chunked and parsed into rhetorical units andthen separate classifiers are used on each unit to extract sum-mary sentences. This two step process is similar to other sys-tems that separate meeting understanding and summarizationinto different steps [7]:• (Step 1) Train a CRF classifier (classifier ) for chunkingand parsing the input utterances into structural form. Thelabel set of the classifier has 18 types of labels:

, where represents rhetorical chunk; representsrhetorical unit; .

• (Step 2) Train eighteen CRF classifiers (classifiers) for extracting salient sentences, one classi-

fier for each rhetorical chunk-unit label. The classifier has two

types of labels: : for summary and non-summarysentences.

In the training process of the classifier and the classifiers, we need to estimate the

parameters: .Given and , firstly

the most probable rhetorical label sequence is producedas . Next the optimal summary labelsequence for the segment sequence , each seg-ment of which is labeled as , can be producedas . We thencalculate the marginal probability of each segment beinga salient segment given the whole segment sequence by:

. Finally we combine the salientsegments within each rhetorical unit into one minute.

C. Baseline I: Two Step RSHMMs With SegmentalSummarization

We apply Rhetorical State HMMs with segmental summa-rization for extracting minutes of parliamentary speech.• (Step 1) Train six RSHMMs for parsing rhetorical struc-ture of parliamentary speech (six types of rhetoricalchunks: “Questions,” “Bills,” “Motions,” “Members’Motions,” “Address and Statements,” “Others”)

• (Step 2) Train six SVM binary classifiers for extractingsummaries with rhetorical structure, one classifier for eachrhetorical chunk label.


D. Baseline II: Two Step Hidden Markov SVM

For a transcribed document , we train an HMSVM forannotating one rhetorical label from the rhetorical label setfor labeling all the sentences in by using optimal function

, where is a recognized sentence feature vectorsequence, is a unit label sequence.The rhetorical unit label set of parliamentary speech con-

tains: 36 types of labels: , whererepresents business item/rhetorical chunk; represents rhetor-ical unit; represents summary; represents non-summary;

. For example, the label “” represents that the sentence is a summary sentence in

the “Question” rhetorical chunk and “speeches by President orSecretary” rhetorical unit.Referring to the corresponding meeting minutes, we anno-

tate a reference label sequence for each document for trainingthe HMSVM. We use a Viterbi-like algorithm for decoding theoptimal label sequence.

VI. EXPERIMENTAL SETUP AND RESULTS

A. Corpus

We collected 333 meeting audio files, their Hansard tran-scriptions, and the meeting minutes from the Hong KongLegislative Council from the year 2002 to the year 2009. Forour experiments, we use all 71 Ordinary Session meeting datafrom the year 2008 and the year 2009, including audio files,Hansards, and minutes. These 71 meetings contain 674 businessitem rhetorical chunks and 5390 rhetorical units. Each unit hason average 36 sentences. These meetings contain 73 speakers.The reference chunk and unit labels are directly extractedfrom the titles, subtitles and HTML tags of the Hansards. Forextractive summarization, we need to train a series of binaryclassifiers on the summary/non-summary sentences in theHansard. We assign a summary label to Hansard sentences withhigh similarity to their corresponding minute sentences, sincethe latter are often short paraphrases of the former. This corpushas a wordlist of 35 k vocabulary size. This word list comprises28 650 Chinese words and 6 790 English words.We randomly select 11 meetings as test data from the 71 Or-

dinary Session meeting data from the year 2008 and the year2009, while using the remaining 60 meetings as training data.The Out-of-Vocabulary (OOV) of our test set is 1%.

B. Automatic Meeting Speech Transcription System

The acoustic model of our ASR system comprises of 73Hidden Markov Models (HMMs) to represent 70 Cantonesephonemes as well as silence, short pause and noise. During thetraining of acoustic model, tied-state cross-word triphones areconstructed by decision tree clustering. Since the speech of thecouncil members are mainly in Cantonese and mixed with someEnglish and Mandarin, English to Cantonese phone mappingand Mandarin to Cantonese phone mapping are applied todictionaries such that English words and Mandarin Chinesewords can be trained. The language model is an interpolatedbigram model between the manual transcriptions mentionedabove and the Hansards from the 333 meetings between 2002and 2009. Our ASR system yields 72.0% word accuracy.

C. Experimental Setup

We perform two held-out experiments for evaluating theCRF-based one step minute generation system and two stepsystem. We also evaluate the other two step systems: RhetoricalState HMM with segmental summarization system [32], andHMSVM system [24] for comparison. ROUGE-L F-measure[36] is used to measure the summarization performances ofminute generation systems.We show human evaluation results for automatically pro-

duced summaries, based on the structures of the Hansardtranscriptions. This means that humans only need to carryout the summarization part but not the structural parsing part.Four human subjects (all undergraduate students in ComputerScience) participated in human evaluation. Human subjects dosentence-level relevance assessment. First, these four humansubjects annotate four versions of reference summaries. Sincethe automatic summarization output is at around 15% com-pression rate, humans are told to follow this compressionrate when they produce summaries. Second, we compare ourautomatically annotated summary to each human generatedsummary. We then use the kappa coefficient K, to measure theinter-agreement between human subjects 1, 2, 3, and 4 and withthe automatic summarizer.

(18)

where is the relative annotation agreement among an-notators, and is the hypothetical probability of randomagreement, using the annotated data to calculate the probabili-ties of each annotator randomly labeling. is defined asthe level of agreement that would be reached by random an-notation using the same distribution of categories as the actualannotators did.The held-out experiment result on Hansards is shown in

Table III. The additional held-out experiment result on ASRoutputs is shown in Table IV. We also build Maximal MarginalRelevance (MMR) based extractive summarization system [37]as a baseline. The experiment results are shown in Tables IIIand IV.

D. Experimental Results

From Tables III and IV, we can see that the one step minutegeneration system gives 74.5% of ROUGE-L F-measure andoutperforms the two step system by 5.97% relatively, andalso outperforms the RSHMM system by 7.19% relatively.Although the overall summarization performance on ASRtranscriptions is worse than that of Hansards, the performanceis also encouraging. The best summarization performance onASR transcription is 72.7% of ROUGE-L F-measure and is pro-duced by the one step minute generation system. We also findthat the CRF based two step system outperforms the RSHMMsystem and the HMSVM system because CRF probably isbetter than HMM in labeling or parsing sequential data. Andwe find that our proposed systems which extract summarieswith rhetorical structure outperform MMR baseline that extractsummaries without structure. These findings suggest that ourminute generation systems produce extracted summaries with


TABLE IIISUMMARIZATION PERFORMANCE OF HELD-OUT EXPERIMENTS ON HANSARDS USING ROUGE-L F-MEASURE AND HUMAN EVALUATION

TABLE IVSUMMARIZATION PERFORMANCE OF HELD-OUT EXPERIMENTS ON ASR OUTPUTS USING ROUGE-L F-MEASURE AND HUMAN EVALUATION

TABLE VEVALUATING INTER-ANNOTATOR AGREEMENT OF AUTOMATIC REFERENCE

SUMMARY ANNOTATION BASED ON MANUAL MEETING MINUTES AND HUMANSUBJECT SUMMARY ANNOTATIONS BY THE KAPPA COEFFICIENT K

rhetorical structure which also improves summarization per-formance simultaneously by applying the rhetorical sequenceinformation of the sentences within the same rhetorical chunk.Tables III and IV also show that human evaluation results are

similar to those of the automatic evaluation metric—ROUGE-LF-measure. According to Landis and Kochs scale [38], the inter-agreement between Human Subject 1, 2, 3, 4 and automatic ref-erence summary annotator based on manual meeting minutesare “almost perfect,” at above 80%, as shown in Table V. Thesefindings suggest that ROUGE is well correlated with humanjudgments for parliamentary meeting speech summarization.Furthermore, we carry out a thorough investigation of the rel-

ative contribution of different features and find that N-gram fea-tures give the most contribution and perform 68% ROUGE-LF-measure. Acoustic features give similar contribution at 66.7%ROUGE-L F-measure. It is possible that since there are several

TABLE VIEVALUATING OUR HANSARD STRUCTURE PARSER USING PRECISION,

RECALL, AND F-MEASURE ON ASR OUTPUTS

speakers and several topics in each parliamentary speech, wecannot recognize the rhetorical chunk of each uttrance by usingpure acoustic features or pure lexical features. However, if wecombine acoustic features with lexical features, the underlyingrhetorical structure can be more easily found.Since two step minute generation systems use the rhetorical

chunk and unit labels produced by the Hansard structure parser,we also evaluate the Hansard structure parser in the held-outexperiment on ASR outputs by using precision, recall, andF-measure [39]. We can see that the CRF-based Hansard struc-ture parser produces encouraging performance as shown inTable VI, 83.5% of unweighted average F-measure and 83.4%of weighted average F-measure.We do statistical significance test according to Koehns

method [40]. We repeat the process of all the experiments 20times to form the 20 comparisons. In other words, we obtainthe previous findings with 100% statistical significance.


VII. RELATED WORK

Many researchers consider the meeting summarization taskas salient sentence ranking/selection process [41], [42], [8],[43], [15], [44]–[46]. However, those extracted summaries arehard to understand without structure information.In recent years, more and more researchers are identifying

the hierarchical topic structure or rhetorical structure in adocument for extracting lists of descriptive keywords or im-proving summarization performance [47], [7], [48], [49], [10],[50]–[52]. [47] showed that the concepts of rhetorical analysisand nuclearity can be used effectively for text summarization.[7] and [48] proposed a system for automatic processing of tasksinvolving multi-party meetings such as speech recognition,dialog act segmentation, topic identification and segmentation,and meeting summarization. Their meeting summarization partcan be improved based on the outputs of dialog act tagging,topic segmentation and detection. [49] presented a stochasticHMM framework with modified K-means and segmentalK-means algorithms for extractive text summarization. [10]further presented a stochastic Hidden Markov Story Model formultilingual and multi-document summarization and proposedthat monolingual documents recounting the same story (i.e., inthe same topic) share a unique story flow (one story, one flow),and such a flow can be modeled by HMMs. [52] describesa novel Bayesian approach to unsupervised lexical cohesiondriven topic segmentation. [51] proposed a structured discrim-inative model for table-of-contents generation on written textthat accounts for a wide range of phrase-based and collocationfeatures.Our method differs from [12] in that we assume the relevance

or saliency and function of certain text pieces can be determinedby analyzing the full hierarchical structure of the text. Instead ofannotating the training data with rhetorical labels manually, wepropose using manual meeting minutes as references. Similarly[53], [54] investigate the correlation between Power Point slidesand extractive summaries. Our learningmethod is based on clas-sifiers, while [47] uses rule-based method for parsing rhetoricalstructure.

VIII. CONCLUSION AND DISCUSSION

In this paper, we have proposed a novel approach of gen-erating minutes from parliamentary meetings using structuralsummarization. We suggest that rhetorical structure modeling isa critical step in the understanding of extractive summarizationof spoken documents. Previous work has shown that explicitrhetorical structure markers, such as paragraph delimiters,titles and subtitles, sentence boundaries, fonts and styles, areessential in helping the reader understand text documents. Wesuggest that a good extractive summary should be clearly struc-tured like meeting minutes, with explicit structural markers. Tothis end, we have proposed to use a single Conditional RandomField classifier for chunking, parsing and extracting summariesfrom meeting transcriptions in one step. We found that one stepminute generation system outperforms the two step systemswhich parses rhetorical structure at the first step and extractssalient sentences at the second step. By a thorough investigationof the relative contribution of different features, we found that

acoustic features has similar contribution at 66.7% ROUGE-LF-measure to N-gram features with 68% ROUGE-L F-measure.Future work will focus on: (1) applying our proposed rhetor-

ical structure parser and minute generator on more interactivemeeting speech other than parliamentary speech. We representdifferent dialog acts as rhetorical chunks and speech acts asrhetorical units. A CRF classifier can then be trained to performmeeting understanding (i.e., chunking and parsing) and summa-rization in one single step; (2) using the rhetorical chunks/unitsfound for improving speech recognition accuracy as in [14].

ACKNOWLEDGMENT

The authors thank Chan Ho Yin for his work on providingautomatic transcriptions of the parliamentary speech.

REFERENCES

[1] R. Kaptein, M. Marx, and J. Kamps, “Who said what to whom?: cap-turing the structure of debates,” in Proc. 32nd ACM SIGIR Conf. Re-search Develop. Inf. Retrieval. ACM, 2009, pp. 831–832.

[2] Y. Akita, M. Mimura, and T. Kawahara, “Automatic transcriptionsystem for meetings of the Japanese national congress,” in Proc. InterSpeech ’09, 2009, pp. 84–87.

[3] B. Ramabhadran, O. Siohan, andA. Sethy, “The IBM2007 speech tran-scription system for European parliamentary speeches,” in Proc. IEEEAutom. Speech Recognit. Understand. Workshop, 2007, pp. 472–477.

[4] M. Marx, “Long, often quite boring, notes of meetings,” in Proc.WSDM’09 Workshop on Exploiting Semantic Annotations in Informa-tion Retrieval. ACM, 2009, pp. 46–53.

[5] M. Marx, “Advanced information access to parliamentary debates,” J.Digital Inf., vol. 10, no. 6, 2009.

[6] G. Murray, S. Renals, and J. Carletta, “Extractive summarizationof meeting recordings,” in Proc. 9th Eur. Conf. Speech Commun.Technol., 2005.

[7] G. Tur et al., “The CALO meeting speech recognition and under-standing system,” in Proc. IEEE Spoken Lang. Technol. Workshop .SLT ’08, 2008, pp. 69–72.

[8] S. Xie and Y. Liu, “Using corpus and knowledge-based similarity mea-sure in maximum marginal relevance for meeting summarization,” inProc. ICASSP ’08, 2008, pp. 4985–4988.

[9] P. Fung, H. Y. Chan, and J. J. Zhang, “Rhetorical-state hidden Markovmodels for extractive speech summarization,” in Proc. ICASSP ’08,2008, pp. 4957–4960.

[10] P. Fung and G. Ngai, “One story, one flow: Hidden Markov StoryModels for multilingual multidocument summarization,” ACM Trans.Speech Lang. Process. (TSLP), vol. 3, no. 2, pp. 1–16, 2006.

[11] G. Murray, M. Taboada, and S. Renals, “Prosodic correlates of rhetor-ical relations,” in Proc. Analyz. Conversat. Text and Speech (ACTS)Workshop at HLT-NAACL, 2006, pp. 1–7.

[12] S. Teufel andM.Moens, “Summarizing scientific articles: Experimentswith relevance and rhetorical status,” Comput. Linguist., vol. 28, no. 4,pp. 409–445, 2002.

[13] H. Lin, J. Bilmes, and S. Xie, “Graph-based submodular selectionfor extractive summarization,” in Proc. IEEE ASRU ’09, 2009, pp.381–386.

[14] A. Stolcke, K. Ries, N. Coccaro, E. Shriberg, R. Bates, D. Jurafsky,P. Taylor, R. Martin, C. V. Ess-Dykema, and M. Meteer, “Dialogueact modeling for automatic tagging and recognition of conversationalspeech,” Comput. Linguist., vol. 26, no. 3, pp. 339–373, 2000.

[15] S. Xie, D. Hakkani-Tur, B. Favre, andY. Liu, “Integrating prosodic fea-tures in extractive meeting summarization,” in IEEE Workshop Autom.Speech Recognit. Understand., 2009, pp. 387–391.

[16] J. Lafferty, A. McCallum, and F. Pereira, “Conditional random fields:Probabilistic models for segmenting and labeling sequence data,” inProc. Int. Workshop Mach. Learn. Conf., 2001, pp. 282–289.

[17] J. J. Zhang and P. Fung, “Learning deep rhetorical structure forextractive speech summarization,” in Proc. ICASSP ’10 , 2010, pp.5302–5305.

[18] J. J. Zhang, P. Fung, and H. Y. Chan, “Automatic minute generationfor parliamentary speech using conditional random fields,” in Proc.ICASSP ’11, 2011, pp. 5536–5539.


[19] Q. Jin and T. Schultz, “Speaker segmentation and clustering in meet-ings,” in Proc. 8th Int. Conf. Spoken Lang. Process., 2004.

[20] S. Maskey and J. Hirschberg, “Summarizing speech without text usinghidden Markov models,” in Proc. NAACL, 2006.

[21] S. Furui, T. Kikuchi, Y. Shinnaka, and C. Hori, “Speech-to-text andspeech-to-speech summarization of spontaneous speech,” IEEE Trans.Speech Audio Process., vol. 12, no. 4, pp. 401–408, Jul. 2004.

[22] J. Hirschberg, “Communication and prosody: Functional aspects ofprosody,” Speech Commun., vol. 36, no. 1, pp. 31–43, 2002.

[23] J. J. Zhang, H. Y. Chan, P. Fung, and L. Cao, “A comparative study onspeech summarization of broadcast news and lecture speech,” in Proc.Interspeech ’07 (Eurospeech), 2007, pp. 2781–2784.

[24] J. J. Zhang and P. Fung, “A rhetorical syntax-driven model for speechsummarization,” in Proc. 23rd Int. Conf. Comput. Linguist. (Coling’10), 2010, pp. 1299–1307.

[25] B. Schuller, S. Steidl, and A. Batliner, “The INTERSPEECH 2009emotion challenge,” in Proc. Interspeech, 2009, pp. 312–315.

[26] B. Schuller, A. Batliner, S. Steidl, and D. Seppi, “Recognising realisticemotions and affect in speech: State of the art and lessons learnt fromthe first challenge,” Speech Commun., 2011.

[27] K. Koumpis and S. Renals, “Automatic summarization of voicemailmessages using lexical and prosodic features,” ACM Trans. SpeechLang. Process., vol. 2, no. 1, pp. 1–24, 2005.

[28] S. Maskey and J. Hirschberg, “Comparing lexical, acoustic/prosodic,structural and discourse features for speech summarization,” in Proc.Interspeech ’05 (Eurospeech), 2005.

[29] F. Sha and F. Pereira, “Shallow parsing with conditional randomfields,” in Proc. Conf. North Amer. Chap. Assoc. for Comput. Linguist.Human Lang. Technol.-Vol. 1. Assoc. for Comput. Linguist., 2003, pp.134–141.

[30] B. Settles, “Biomedical named entity recognition using conditionalrandom fields and rich feature sets,” in Proc. Int. Joint Workshop Nat.Lang. Process. Biomed. and Its Applicat. Assoc. for Comput. Linguist.,2004, pp. 104–107.

[31] L. P.Morency, A. Quattoni, C.M. Christoudias, and S.Wang, “Hidden-state conditional random field library,” User Guide, 2007.

[32] J. J. Zhang, R. H. Y. Chan, and P. Fung, “Extractive speech summariza-tion using shallow rhetorical structure modeling,” IEEE Trans. Audio,Speech, Lang. Process., vol. 18, no. 6, pp. 1147–1157, Aug. 2010.

[33] C. C. Chang and C. J. Lin, LIBSVM: A Library for Support VectorMachines, 2001.

[34] S. Young, D. Kershaw, J. Odell, D. Ollason, V. Valtchev, and P. Wood-land, The HTK Book (for HTK Version 3.0). Cambridge, U.K.: Cam-bridge Univ., 2000.

[35] Y. Altun, I. Tsochantaridis, and T. Hofmann, “Hidden Markov supportvector machines,” in Mach. Learn.-Int. Workshop Then Conf., 2003,vol. 20.

[36] C. Y. Lin, “Rouge: A package for automatic evaluation of summaries,”in Proc. Workshop Text Summarization Branches Out (WAS 2004),2004, pp. 25–26.

[37] X. Zhu and G. Penn, “Summarization of spontaneous conversations,”in Proc. 9th Int. Conf. Spoken Lang. Process., 2006.

[38] J. R. Landis and G. G. Koch, “The measurement of observer agreementfor categorical data,” Biometrics, pp. 159–174, 1977.

[39] C. J. Van Rijsbergen, Information Retrieval. London, U.K.: Butter-sworth, 1979.

[40] P. Koehn, “Statistical significance tests for machine translation evalu-ation,” in Proc. EMNLP, 2004, vol. 4.

[41] M. Galley, “A skip-chain conditional random field for ranking meetingutterances by importance,” in Proc. Conf. Empirical Meth. in Nat.Lang. Process. Assoc. for Comput. Linguist., 2006, pp. 364–372.

[42] D. Gillick, K. Riedhammer, B. Favre, and D. Hakkani-Tur, “A globaloptimization framework for meeting summarization,” in Proc. IEEEICASSP ’09, 2009, pp. 4769–4772, IEEE.

[43] D. Gillick, B. Favre, D. Hakkani-Tur, B. Bohnet, Y. Liu, and S. Xie,“The icsi/utd summarization system at tac 2009,” in Proc. Text Anal.Conf. Workshop, Gaithersburg, MD, 2009.

[44] X. Zhu and G. Penn, “Evaluation of sentence selection for speech sum-marization,” in Proc. Workshop of Crossing Barriers in Text Summa-rization, RANLP-2005, 2005.

[45] G. Murray, S. Renals, J. Carletta, and J. Moore, “Incorporating speakerand discourse features into speech summarization,” in Proc. MainConf. Human Lang. Technol. Conf. North Amer. Chap. of the Assoc.of Comput. Linguist. Assoc. for Comput. Linguist., 2006, pp. 367–374.

[46] M. Hirohata, Y. Shinnaka, K. Iwano, and S. Furui, “Sentence extrac-tion-based presentation summarization techniques and evaluation met-rics,” in Proc. ICASSP, 2005, pp. 1065–1068.

[47] D.Marcu, “From discourse structures to text summaries,” inProc. ACLWorkshop Intell. Scalable Text Summariz., 1997, pp. 82–88.

[48] A. Janin et al., “The ICSI meeting project: Resources and research,” inProc. ICASSP NIST Meeting Recognition Workshop, 2004.

[49] P. Fung, G. Ngai, and P. Cheung, “Combining optimal clustering andhidden Markov models for extractive summarization,” in Proc. ACLWorkshop Multilingual Summariz., 2003, pp. 29–36.

[50] R. Barzilay and L. Lee, “Catching the drift: Probabilistic contentmodels, with applications to generation and summarization,” in Proc.HLT-NAACL, 2004, pp. 113–120.

[51] S. R. K. Branavan, P. Deshpande, and R. Barzilay, “Generatinga table-of-contents,” in Proc. Annu. Meeting-Assoc. for Comput.Linguist., 2007, vol. 45, p. 544.

[52] J. Eisenstein and R. Barzilay, “Bayesian unsupervised topic segmen-tation,” in Proc. Conf. Empir. Meth. in Natural Lang. Process. Assoc.for Comput. Linguist., 2008, pp. 334–343.

[53] L. He, E. Sanocki, A. Gupta, and J. Grudin, “Comparing presentationsummaries: Slides vs. reading vs. listening,” in Proc. SIGCHI Conf.Human Factors in Comput. Syst., NewYork, 2000, pp. 177–184, ACM.

[54] L. He, E. Sanocki, A. Gupta, and J. Grudin, “Auto-summarization ofaudio-video presentations,” in Proc. 7th ACM Int. Conf. Multimedia(Part 1), 1999, p. 498, ACM.

Justin Jian Zhang received his Ph.D. degree in theHong Kong University of Science and Technologyin 2011. He was a member of the Human Lan-guage Technology Center at HKUST. He receivedthe M.Eng. degree from the School of Softwarein Tsinghua University in 2006 and the B.Eng.degree from the School of Electrical Engineering& Automation in Tianjin University in 2003. Hewas a member of Data Mining Group in Instituteof Information System & Engineering of TsinghuaUniversity during 2003 to 2006. He is currently

an Assistant Researcher of Engineering Technology Institute in DongguanUniversity of Technology. His interests include natural language processing,speech understanding & summarization.

Pascale Fung (SM’09) received her Masters andPh.D. degrees in computer science from ColumbiaUniversity, in 1993 and 1997 respectively. She alsostudied at Ecole Centrale Paris and Kyoto University,and was formerly a researcher at AT&T Bell Labs,BBN Systems & Technologies and LIMSI/CNRSin France. She is an Associate Professor in theDepartment of Electronic & Computer Engineeringat The Hong Kong University of Science andTechnology (HKUST). She co-founded the HumanLanguage Technology Center, and is the Director of

InterACT at HKUST. Her research interests include speech summarization,speech translation, acoustic modeling and multilinguality in both speech andlanguage processing. She is an Associate Editor of the IEEE Signal ProcessingLetters, the ACM Transactions on Speech and Language Processing, and thenew Transactions of the Association for Computational Linguistics. She is aSenior Member of the IEEE, a member of the IEEE signal Processing SocietySpeech Language Technology Committee, and a board member of SIGDAT,the Association of Computational Linguistics. She served as Area Chair for theInternational Conference on Acoustics, Speech and Signal Processing in 2010,2011 and 2012.

Documents

Automatic Parliamentary Meeting Minute Generation Using Rhetorical Structure Modeling