5
SPEAKER DIARIZATION OF MEETING DATA USING THE INFORMATION BOTTLENECK FRAMEWORK Author(s) Name(s) Author Affiliation(s) ABSTRACT TBD Index TermsTBD 1. INTRODUCTION Speaker Diarization is the task of deciding who spoke when in an audio stream and is an essential step for several applica- tions such as speaker adaptation in large vocabulary ASR sys- tems, speaker based indexing and/or retrieval. It involves de- termining number of speakers and identification of the speech segments corresponding to each speaker. The number of speak- ers is not apriori known and must be estimated from data in a usupervised manner. This is achieved using a model selection criterion for inferring the number of clusters (speakers). Conventional diarization systems are based on an ergodic HMM in which state represent a speaker. Emission proba- bilities are Gaussian Mixuture Models (GMM). Initial audio stream is segmented into several regions both using speaker change detection methods or uniform segmentation. The di- arization algorithm is then based on bottom-up agglomerative clustering of those initial segments ??. Segments are merged according to some measure till a stopping criterion is met. Given that the final number of clusterins is unknown and must be estimated from data, the stopping criterion is generally re- lated to the complexity of the estimated model. The use of Bayesian Information Criterion ?? as model complexity cri- eterion has been proposed in [?] and a modified version of BIC that keep the model complexity constant has been pro- posed in [?]. BIC is obtained penalizing the ratio of likelihoods of the invidual clusters to the likelihood of the merged cluster thus at each step, the algorithm has to estimate likelihood of indi- vidual clusters and of all possible merging. This can be an computational demanding task and assume that enought data are available for estimating a parametric model of each seg- ment. Those approaches yield state-of-the art results [?] in sev- eral diarization evaluations. In this paper, we aims at investi- gating an alternative based on the agglomerative Information Bottleneck (aIB) method proposed in [?]. aIB method is a Thanks to XYZ agency for funding. clustering methods based on information theory based crite- rion that optimizes the distortion in between different clus- tering while preserving the mutual information to some rele- vance variables (see for details ??). The main advantage w.r.t. conventional HMM/GMM agglomerative clustering consists in the fact that there is no explicit representation of speaker model but only it representation into a space of variables rel- evant to its classification. We investigate in this paper the application of such ag- glomerative clustering method for a speaker diarization task. Given that there is no explicit parametric speaker model, the BIC cannot be directly applied in this case. Anyway other model selection criteria based on information theory are con- sidered here. Thus we will consider a stopping criterion based on the Minimum Description Lenght principle ??. Experiments are run on NIST RT06 “Meeting Recogni- tion Diarization” task based on data from Multiple Distant Microphones (MDM) ??. The remainder of the paper is orga- nized as follows. fill in later ... 2. INFORMATION BOTTLENECK METHOD Problem of clustering can be formulated as grouping together elements of a set X based on similarity of their distribution p(y|x) w.r.t. elements of another set Y of variables. So let us denote with X a dataset which we would like to cluster into a representation ˜ X such that we have minimal loss in information related to another relevance variable Y . In ?? this problem is formulated as finding a stochastic map px|x) that minimzes the distortion in between X and ˜ X i.e. I (X, ˜ X) while preserving the mutual information on Y i.e. I ( ˜ X,Y ). Thus obiective funcion can be formulated as: I (X, ˜ X) βI ( ˜ X,Y ) (1) where beta is the trade-off between the amount of infor- mation I ( ˜ X,Y ) to be preserved and the amount of informa- tion that is lost in the clustering I (X, ˜ X). As shown in the original paper[ref], this minimization leads

asru2007

Embed Size (px)

DESCRIPTION

http://fvalente.zxq.net/paperpdf/asru2007.pdf

Citation preview

Page 1: asru2007

SPEAKER DIARIZATION OF MEETING DATA USING THE INFORMATION B OTTLENECKFRAMEWORK

Author(s) Name(s)

Author Affiliation(s)

ABSTRACT

TBD

Index Terms— TBD

1. INTRODUCTION

Speaker Diarization is the task of decidingwho spoke whenin an audio stream and is an essential step for several applica-tions such as speaker adaptation in large vocabulary ASR sys-tems, speaker based indexing and/or retrieval. It involvesde-termining number of speakers and identification of the speechsegments corresponding to each speaker. The number of speak-ers is not apriori known and must be estimated from data in ausupervised manner. This is achieved using a model selectioncriterion for inferring the number of clusters (speakers).

Conventional diarization systems are based on an ergodicHMM in which state represent a speaker. Emission proba-bilities are Gaussian Mixuture Models (GMM). Initial audiostream is segmented into several regions both using speakerchange detection methods or uniform segmentation. The di-arization algorithm is then based on bottom-up agglomerativeclustering of those initial segments??. Segments are mergedaccording to some measure till a stopping criterion is met.Given that the final number of clusterins is unknown and mustbe estimated from data, the stopping criterion is generallyre-lated to the complexity of the estimated model. The use ofBayesian Information Criterion?? as model complexity cri-eterion has been proposed in [?] and a modified version ofBIC that keep the model complexity constant has been pro-posed in [?].

BIC is obtained penalizing the ratio of likelihoods of theinvidual clusters to the likelihood of the merged cluster thusat each step, the algorithm has to estimate likelihood of indi-vidual clusters and of all possible merging. This can be ancomputational demanding task and assume that enought dataare available for estimating a parametric model of each seg-ment.

Those approaches yield state-of-the art results [?] in sev-eral diarization evaluations. In this paper, we aims at investi-gating an alternative based on the agglomerative InformationBottleneck (aIB) method proposed in [?]. aIB method is a

Thanks to XYZ agency for funding.

clustering methods based on information theory based crite-rion that optimizes the distortion in between different clus-tering while preserving the mutual information to some rele-vance variables (see for details??). The main advantage w.r.t.conventional HMM/GMM agglomerative clustering consistsin the fact that there is no explicit representation of speakermodel but only it representation into a space of variables rel-evant to its classification.

We investigate in this paper the application of such ag-glomerative clustering method for a speaker diarization task.Given that there is no explicit parametric speaker model, theBIC cannot be directly applied in this case. Anyway othermodel selection criteria based on information theory are con-sidered here. Thus we will consider a stopping criterion basedon the Minimum Description Lenght principle??.

Experiments are run on NIST RT06 “Meeting Recogni-tion Diarization” task based on data from Multiple DistantMicrophones (MDM)??. The remainder of the paper is orga-nized as follows.

fill in later . . .

2. INFORMATION BOTTLENECK METHOD

Problem of clustering can be formulated as grouping togetherelements of a setX based on similarity of their distributionp(y|x) w.r.t. elements of another setY of variables. So letus denote withX a dataset which we would like to clusterinto a representationX such that we have minimal loss ininformation related to another relevance variableY . In ?? thisproblem is formulated as finding a stochastic mapp(x|x) thatminimzes the distortion in betweenX and X i.e. I(X, X)while preserving the mutual information on Y i.e.I(X, Y ).

Thus obiective funcion can be formulated as:

I(X, X) − βI(X, Y ) (1)

wherebeta is the trade-off between the amount of infor-mationI(X, Y ) to be preserved and the amount of informa-tion that is lost in the clusteringI(X, X).

As shown in the original paper[ref], this minimization leads

Page 2: asru2007

to the following equations

p(x|x) =p(x)

Z(β, x)exp(−βDKL[p(y|x)||p(y|x)]) (2)

p(y|x) =∑

x

p(y|x)p(x|x)p(x)

p(x)(3)

p(x) =∑

x

p(x|x)p(x) (4)

WhereZ(β, x) is a normalization function and the func-tionalDKL[p(y|x)||p(y|x)] is the Kullback-Liebler divergence,which emerges from the solution of the principle. These equa-tions can be solved iteratively. Function 1 defines a concavecurve similar to rate distortion function.

The limit β → ∞ of equations (2-4) induces a hard parti-tion of the input space i.e. the probabilistic mapp(x|x), takesvalues of0 and1 only and reduces equation set (2-4) to

... (5)

The agglomerative Information Bottleneck(AIB)?? fo-cuses on generating a hard partitions of the dataX using agreedy approach such that objective funtion 1 is minimized.It starts with the trivial clustering into|X | clusters i.e. eachdata point is considered a cluster and iteratively merges pointssuch that the decrease of objective function 1 is minimum.

After each merging,I(X, X) andI(X, Y ) decrease anddifference can be analytically estimated. For instance if twoclustersxi andxj are merged togheter, the decrease inI(X, Y )is given by

δIy = (p(xi) + p(xj))JS(p(Y |xi), p(Y |xj) (6)

where JS denotes the Jensen-Shannon divergence.In other words, the optimal merge is given by clusters

which have the smaller distance in the space of relevance vari-ableY . Given distributionsp(X) andp(Y |X) this can beobtained in straghtforward way using expression 6.

Contrarily to agglomerative clustering based on GMM andBIC criterion, this clustering simply consider the divergencebetween different data points instead of explicitely building aGMM model for each cluster.

Details about implementation of aIB algorithm can be foundin ?? and will not be further discussed here. Function 1 de-fines a curve on which a sequence of merging is defined butdoes not give any further information on the optimal numberof clusters. This will be discussed in next section

3. MODEL SELECTION

Model selection problem consists in finding the best modelthat represent a given data set. In case of parametric models,BIC ??of modified BIC [?] are very common choice.

In the case of aIB there is no parametric model that rep-resent the data and BIC criterion cannot be applied. Several

alternative solutions have been considered in literature.Forinstance the normalized mutual informationI(X,Y )

I(X,Y ) gives use-ful information on the clustering quality. It decrease fasterwhen very dissimilar clusters (most likely different speakers)are merged. Thus we investigate here simple threesholding ofI(X,Y )I(X,Y ) as possible measure of the number of clusters.

Because of the information theoretic basis of the informa-tion bottleneck method, it is straightford to apply the min-imum description length (MDL) principle[ref]. The MDLprinciple states that the optimal model is the one that en-codes the data and model with minimum code length [ref].The MDL criterion is given by

FMDL = L(H) + L(D|H) (7)

WhereL(H) is the code length to encode the hypothesis andL(D|H) is the code length required to encode the data giventhe hypothesis. In case of parametric models MDL reducesto the BIC whereL(D|H) is equivalente to likelihood of dataandL(H) is equivalente to the penalty term.

Let N be the number of input samples, andC be the num-ber of clusters. The number of bits required to these sampleswith a fixed length code isN log N

C. The clustering itself can

be coded withN [H(Y |X)+H(X)] bits. SinceH(Y |X) canbe written asH(Y ) − I(X, Y ) the model selection criterionis given by

FMDL = N [H(Y ) − I(X, Y ) + H(X)] + N logN

C(8)

Expression 8 provides the criterion according to whichnumber of clusters (i.e. speakers) can be selected.

4. AIB FOR SPEAKER DIARIZATION

In this work, we investigate the use of aIB clustering for aspeaker diarization task. We describe in the following howvariablesX ,Y and conditional distributionp(Y |X) are de-fined.

Speech segment of a given audio file are uniformely seg-mented in small chunks of fixed lenght to ensure that eachsegment contains only one speaker. This segments space isthe initial cluster spaceX . Agglomerative ’bottom-up’ clus-tering is performed in this space, merging segments whichhave most similar distributionsp(Y |X) wereY is a set ofrelevance variables.

VariableY are defined the set of gaussian components ofa GMM universal background model andp(y|x) is the prob-ability of a component given a data segment. In other words,every segment is represented throught its relevance to compo-nents of a UBM.

aIB clustering try to find a groupingX of initial segmentsX such that their representation in terms of gaussian compo-nentsY is as close as possible to the original one.

Page 3: asru2007

Given a UBM and an initial segmentation, all quatitiesneeded to aIB clusteringp(X) andp(Y |X) can be easily ob-tained.

5. SYSTEM DESCRIPTION

The core of our system is an agglomerative information bot-tleneck clustering algorithm. Beam-forming and denoisingisperformed on the multi channel audio stream to generate animproved SNR single channel audio stream which is used forextraction of mel-frequency coefficients. The components ofa sparse Gaussian Mixture model estimated from the speechonly frames are used as the relevance variables. Two criteria– thresholding the mutual information and a two part mini-mum description length – are explored for model selection.Subsequently, the output cluster boundaries are fine-tunedus-ing an GMM-HMM system. The block diagram of the systemshown in Figure 1.

Fig. 1. Diarization system

5.1. Signal Processing

The signal processing block consists of wiener filter denois-ing for individual channels followed by a beam-forming al-gorithm (delay and sum) [ref]. We use theBeamformit tool[ref] for this purpose. We extract mel-frequency ceptral coef-ficients of order19 from the resulting audio stream for furtherprocessing.

5.2. Feature Extraction

As mentioned before, posterior probabilities of the compo-nents of a back-ground gaussian mixture probabilites are usedas before[ref]. Since no additional data was available for esti-mating this mixture model, we estimated it as follows.

A gaussian mixture with shared diagonal covariace ma-trix is adopted as the background model. From a referencespeech-non speech segmentation[ref], we extract all the speechsegments. Speech segments longer than a max duration isuniformly chopped to generate a sequence of segments. Fromeach of the segments we estimate the mean of the individualgaussian distributions. The individual weights of each com-ponents is estimated as the ratio of duration of each compo-nents to the total duration. The shared covariance matrix isestimated from all the speech frames. Thus, the backgroundgaussian mixture model has as many components as the num-ber of chopped segments of the speech file and is given by,

f(xt) =

N∑

j=1

wjN (xt, µj , C) (9)

where

N (xt, µj, C) =e−

1

2(xt−µj)T C−1(xt−µj)

(2π)d|C|(10)

µj andwj is the mean and weight ofjth component andC is the shared diagonal covariance matrix.N is the num-ber of components, which is the same as number of choppedsegments.

Once we estimated the back ground model we estimatethe posterior probabilites as

Fposti =

wiN (xt, µi, C)∑N

j=1 wjN (xt, µj, C); i = 1, . . . , N (11)

The resulting features are optionally subjected to a “hardmax” transformation where the max value in the feature vec-tor is set to one and all others is set to zero. Intutively, thiscan be seen as a deterministic assignment of the input frameto a Gaussian component in contrast with keeping the proba-bilistic assignment as such.

5.3. Speaker Clustering

Since it is computationally expensive to start the clusteringfrom all frames as one cluster, we decided to initialize fromthe level of chopped segments. i.e, each chopped segment isconsidered as single observation (input variableX in the AIBalgorithm). The features extracted in the previous step(with/withoutthe hard max operation) are averaged over the segment tocompute the estimate of the distributionp(y|x). This is usedas the input for the AIB clustering. Different segment levelfeatures were assigned prior probababilities based on number

Page 4: asru2007

of frames in each of them. However, the segment lengths be-ing more or less equal, we observed assigning uniform priorprobababilites works equally well. The information bottle-neck algorithm merges the clusters that result in minimumdecrease of mutual information. The merged cluster is as-signed with the weighted average of their distribution. Theweight of the merged cluster becomes equal to the sum of theweights of the individual clusters. The value of informationbottleneck parameterβ is tuned to an optimum value usingthe development data. The agglomeration was performed tillall points merged to one cluster. This process builds the treeof all possible partitions.

5.4. Viterbi Realignment

The process of clustering the chopped segments has the con-seqence of forcing the speaker segments to alighn with theboundaries of the segments. To alleviate this problem weused an ergodic GMM-HMM to segment the audio streamafter clustering.

The ergodic GMM-HMM has as many number of statesas the number of speakers detected. The components of theGMM are same as the components of the background model.The weights of the component GMMs are the correspondingposteriors of the cluster.

We performed multiple iterations of segmentation and re-estimation of the HMM. This resulted in slightly better per-formance.

6. EXPERIMENTS AND RESULTS

6.1. Baseline

The system described in[ref] is chosen as the baseline system.The results of the baseline system on RT06 eval data is listedin Table 1. The table lists missed speech, false alarm, speakererror and diarization error for all the meetings in the database.

Table 1. Results of the baseline systemFile Miss FA spnsp spkr DER

errCMU 20050912-0900 11.60 0.20 11.80 6.10 17.83CMU 20050914-0900 10.30 0.00 10.30 4.90 15.26EDI 20050216-1051 4.90 0.20 5.10 41.00 46.02EDI 20050218-0900 4.30 0.10 4.40 19.40 23.79NIST 20051024-0930 7.00 0.20 7.20 4.70 12.00NIST 20051102-1323 6.10 0.10 6.20 17.60 23.73TNO 20041103-1130 3.80 0.10 3.90 27.60 31.48VT 20050623-1400 5.20 0.20 5.40 19.00 24.38VT 20051027-1400 3.50 0.30 3.80 38.10 41.94ALL 6.50 0.10 6.60 18.90 25.54

6.2. Oracle Experiments

To study the performance of the AIB clustering independentof the model selection or Viterbi re-alignment we conductedoracle experiments. The optimal partion (the one with mini-mum Diarization Error Rate) was picked manually from thecluster output. The Results of the oracle system is presentedin Table 2.

The resulting clusters were used to initialize a GMM-HMMas described in section 5.4. Multiple iterations of segmenta-tion and re-estimation are performed. This resulted in≃ 3%decrease in overall diarization error. The detailed results arepresented in Table 3

Table 2. Results of the oracle system (without viterbi realign-ment)

File Miss FA spnsp spkr DERerr

CMU 20050912-0900 11.90 0.20 12.10 9.80 21.89CMU 20050914-0900 10.60 0.00 10.60 15.60 26.22EDI 20050216-1051 5.20 0.10 5.30 43.80 49.17EDI 20050218-0900 4.70 0.10 4.80 37.10 41.87NIST 20051024-0930 7.40 0.20 7.60 13.60 21.12NIST 20051102-1323 6.50 0.10 6.60 14.80 21.30TNO 20041103-1130 4.20 0.10 4.30 31.20 35.45VT 20050623-1400 5.50 0.20 5.70 12.00 17.72VT 20051027-1400 3.90 0.30 4.20 24.90 29.03ALL 6.80 0.10 6.90 22.10 29.08

Table 3. Results of the oracle system (with viterbi realign-ment)

File Miss FA spnsp spkr DERerr

CMU 20050912-0900 11.70 0.20 11.90 8.20 20.06CMU 20050914-0900 10.40 0.00 10.40 13.00 23.46EDI 20050216-1051 5.10 0.10 5.20 42.80 48.03EDI 20050218-0900 4.50 0.10 4.60 32.70 37.30NIST 20051024-0930 7.20 0.20 7.40 10.60 17.94NIST 20051102-1323 6.30 0.10 6.40 11.70 18.02TNO 20041103-1130 4.00 0.10 4.10 24.20 28.24VT 20050623-1400 5.30 0.20 5.50 6.70 12.24VT 20051027-1400 3.70 0.30 4.00 25.10 29.14ALL 6.70 0.10 6.80 19.00 25.81

6.3. Model Selection

Model selection was performed either by thresholding the nor-malized mutual information or by selecting the cluster withminimum desciption length.

Figure 2 depicts the variation of normalized mutual infor-mation with the number of clusters. It can be observed that

Page 5: asru2007

this quantity increases more rapidly in the beginning and thenflatten out. A threshold tuned on the development data is usedto determine the number of clusters. Corresponding resultsare listed in Table 4

Fig. 2. I(X,Y )I(X,Y ) Vs Number of clusters

Table 4. Results of the system with model selection usingnormalized mutual information (with viterbi realignment)

File Miss FA spnsp spkr DERerr

CMU 20050912-0900 11.90 0.20 12.10 10.80 22.82CMU 20050914-0900 10.60 0.00 10.60 16.10 26.72EDI 20050216-1051 5.20 0.10 5.30 44.10 49.39EDI 20050218-0900 4.70 0.10 4.80 38.10 42.83NIST 20051024-0930 7.40 0.20 7.60 14.00 21.53NIST 20051102-1323 6.50 0.10 6.60 14.80 21.30TNO 20041103-1130 4.20 0.10 4.30 31.20 35.45VT 20050623-1400 5.50 0.20 5.70 12.10 17.81VT 20051027-1400 3.90 0.30 4.20 47.80 51.95ALL 6.80 0.10 6.90 24.60 31.58

Also, we experimented with the MDL criterion in equa-tion 8. Figure 3 illustrates the variation of code length ofhypothesisL(H), dataL(D|H) and the total MDL lengthFMDL.

7. ACKNOLEDGEMENTS

TBD

Fig. 3. MDL model selection