tdoa_interspeech_valente

Integration of TDOA Features inInformation Bottleneck

Framework for Fast SpeakerDiarization

Deepu Vijayasenan, Fabio Valente, Herve Bourlard

{deepu.vijayasenan,fabio.valente, herve.bourlard}@idiap.ch

Idiap Research Institute

Interspeech 2008

Introduction and Motivation• Speaker Diarization determines who spoke when in an audio stream.

• In case of meetings data, the recording is done with Multiple Distant

Microphones (MDM).

• In case of MDM data, the Time Delay of Arrival (TDOA) of the signal

to different microphones can be used as complementary information to

acoustic features (e.g. MFCC).

Introduction and Motivation• Several recent efforts for achieving faster-then-real-time

diarization with low computational complexity.

• We previously introduced a non-parametric clustering

system based on the Information Bottleneck principle

[Thisby98].

• State-of-the-art results using very limited computational

complexity.

• How can we integrate other feature set (e.g. TDOA

features) in this framework ?

Outline• Information Bottleneck Principle

• Agglomerative and Sequential Optimization

• Diarization system

• Integration of TDOA

• Experiments and Results

• Conclusion

Information Bottleneck principle• Let X, be a set of elements to cluster into a set of C clusters.

• Let Y be a set of variables of interest associated with X.

• Let us assume that ∀xǫX and ∀yǫY the conditional distribution

p(y|x) is available.

• IB principle states that the clustering C should preserve as

much information as possible between C and Y while

minimizing the distortion of C and X.

• This means the following objective function:

− β I(X,C) + I(C, Y )

Information Bottleneck principle II• The loss of mutual information δIy obtained by merging xi and xj is

given by Jensen-Shannon divergence between p(Y |xi) and p(Y |xj).

δIy = (p(xi) + p(xj)) · JS(p(Y |xi), p(Y |xj))

• The JS divergence is the sum of two KL divergences thus easy to

compute in case of discrete probabilities.

JS(p(Y |xi), p(Y |xj)) = πi DKL[p(Y |xi)||q(Y )] +

+πj DKL[p(Y |xj ||q(Y )] (1)

with q(Y ) = πi p(Y |xi) + πj p(Y |xj)

• The distance (thus the merging) is defined according to the variables

Y relevant to the considered problem.

IB optimization• Objective function can be optimized in agglomerative or sequential

fashion.

• Agglomerative IB [Slonim99]:

- Greedy approach.

- Start with trivial clustering of |X | clusters.

- Merges clusters that produce the minimum loss in the objective

function.

- Merging stops when a stopping criterion is met.

• Sequential IB [Slonim02]:

- Works on a partition with fixed number of clusters.

- Randomly draws an element out of the partition, and try to

assign it to another cluster.

- Typically used for refining agglomerative solution.

IB based diarization

IB based diarization II• Conventional systems use HMM/GMM framework together with the

Bayesian Information Criterion (BIC).

Most of the computational time consists in estimating the BIC

between different clusters.

• IB based system performs the clustering in the space of discrete

probabilities p(Y |X) using the JS divergence.

The resulting systems is much faster.

System Miss FA Spkr err DER RT

HMM/GMM 6.5 0.1 17.0 23.6 3.5

aIB 6.5 0.1 17.1 23.7 0.22

aIB + sIB 6.5 0.1 16.6 23.2 0.24

Table 1: RT06 eval results

TDOA features• Time Delay of Arrivals (TDOA) between different channels can be

used as additional feature for diarization [Lathoud03,Anguera06].

• TDOA is computed using the Generalized Cross Correlation Phase

Transform (GCC-PHAT) algorithm .

• In HMM/GMM the acoustic vectors (MFCC) is augmented with

TDOA.

• In IB framework combination can happen at posterior distribution

level:

p(y|x) = p(y|x, Mmfc)P (Mmfc) + p(y|x, Mdel)P (Mdel)

i.e. the cardinality of |Y | and |X | stays unchanged thus the complexity of

the clustering.

TDOA features II

RT06 results

System Miss FA Spkr err DER RT

HMM/GMM 6.5 0.1 9.3 15.9 3.63

aIB 6.5 0.1 11.4 18.0 0.34

aIB + sIB 6.5 0.1 9.7 16.3 0.41

• TDOA features reduces by 7% the Speaker Error in all the systems.

• sIB based system is 0.4% inferior to HMM/GMM but much faster.

• Improvements are verified on 7 of the 9 meetings.

Weights tuning• Probabilities of the two streams P (Mmfc) and P (Mdel) is

heuristically tuned on a development data set.

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 117.5

18

18.5

19

19.5

20

20.5

−− Mfcc weight −−>

−− S

peak

er e

rror

−−>

Conclusions• We introduced a diarization system based on the IB

principle.

• Results are similar to state-of-the-art HMM/GMM butconsiderably faster.

• We described combination of two feature streams(MFCC and TDOA) that leaves unchanged the cost ofthe clustering.

• Future plans will consider the same combination toinclude many other feature streams.

Thank You

IB based diarization• X is defined as the set of uniformly segmented speech chunks obtained

from the audio stream.

• Y is defined as the component of a background GMM as obtained

from the whole meeting data.

• Conditional distribution p(Y |X) can be obtained simply by Bayes’ rule.

• An initial solution is obtained using agglomerative clustering then

refined using sequential clustering.

• The stopping criterion is based on the Minimum Description Length

(MDL).

• Data are then re-aligned using a conventional HMM/GMM to refine

boundaries.

RT06 results

0

5

10

15

20

25

30

35

40

45

50

CMU 20050912−0900

CMU 20050914−0900

EDI 20050216−1051

EDI 20050218−0900

NIST 20051024−0930

NIST 20051102−1323

TNO 20041103−1130

VT 20050623−1400

VT 20051027−1400 ALL

− S

peak

er E

rror

−−>

MFCCMFCC+TDOA

Documents

tdoa_interspeech_valente