17
Integration of TDOA Features in Information Bottleneck Framework for Fast Speaker Diarization Deepu Vijayasenan, Fabio Valente, Herv´ e Bourlard {deepu.vijayasenan,fabio.valente, herve.bourlard}@idiap.ch Idiap Research Institute Interspeech 2008

tdoa_interspeech_valente

Embed Size (px)

DESCRIPTION

http://fvalente.zxq.net/presentations/tdoa_interspeech_valente.pdf

Citation preview

Page 1: tdoa_interspeech_valente

Integration of TDOA Features inInformation Bottleneck

Framework for Fast SpeakerDiarization

Deepu Vijayasenan, Fabio Valente, Herve Bourlard

{deepu.vijayasenan,fabio.valente, herve.bourlard}@idiap.ch

Idiap Research Institute

Interspeech 2008

Page 2: tdoa_interspeech_valente

Introduction and Motivation• Speaker Diarization determines who spoke when in an audio stream.

• In case of meetings data, the recording is done with Multiple Distant

Microphones (MDM).

• In case of MDM data, the Time Delay of Arrival (TDOA) of the signal

to different microphones can be used as complementary information to

acoustic features (e.g. MFCC).

Page 3: tdoa_interspeech_valente

Introduction and Motivation• Several recent efforts for achieving faster-then-real-time

diarization with low computational complexity.

• We previously introduced a non-parametric clustering

system based on the Information Bottleneck principle

[Thisby98].

• State-of-the-art results using very limited computational

complexity.

• How can we integrate other feature set (e.g. TDOA

features) in this framework ?

Page 4: tdoa_interspeech_valente

Outline• Information Bottleneck Principle

• Agglomerative and Sequential Optimization

• Diarization system

• Integration of TDOA

• Experiments and Results

• Conclusion

Page 5: tdoa_interspeech_valente

Information Bottleneck principle• Let X, be a set of elements to cluster into a set of C clusters.

• Let Y be a set of variables of interest associated with X.

• Let us assume that ∀xǫX and ∀yǫY the conditional distribution

p(y|x) is available.

• IB principle states that the clustering C should preserve as

much information as possible between C and Y while

minimizing the distortion of C and X.

• This means the following objective function:

− β I(X,C) + I(C, Y )

Page 6: tdoa_interspeech_valente

Information Bottleneck principle II• The loss of mutual information δIy obtained by merging xi and xj is

given by Jensen-Shannon divergence between p(Y |xi) and p(Y |xj).

δIy = (p(xi) + p(xj)) · JS(p(Y |xi), p(Y |xj))

• The JS divergence is the sum of two KL divergences thus easy to

compute in case of discrete probabilities.

JS(p(Y |xi), p(Y |xj)) = πi DKL[p(Y |xi)||q(Y )] +

+πj DKL[p(Y |xj ||q(Y )] (1)

with q(Y ) = πi p(Y |xi) + πj p(Y |xj)

• The distance (thus the merging) is defined according to the variables

Y relevant to the considered problem.

Page 7: tdoa_interspeech_valente

IB optimization• Objective function can be optimized in agglomerative or sequential

fashion.

• Agglomerative IB [Slonim99]:

- Greedy approach.

- Start with trivial clustering of |X | clusters.

- Merges clusters that produce the minimum loss in the objective

function.

- Merging stops when a stopping criterion is met.

• Sequential IB [Slonim02]:

- Works on a partition with fixed number of clusters.

- Randomly draws an element out of the partition, and try to

assign it to another cluster.

- Typically used for refining agglomerative solution.

Page 8: tdoa_interspeech_valente

IB based diarization

Page 9: tdoa_interspeech_valente

IB based diarization II• Conventional systems use HMM/GMM framework together with the

Bayesian Information Criterion (BIC).

Most of the computational time consists in estimating the BIC

between different clusters.

• IB based system performs the clustering in the space of discrete

probabilities p(Y |X) using the JS divergence.

The resulting systems is much faster.

System Miss FA Spkr err DER RT

HMM/GMM 6.5 0.1 17.0 23.6 3.5

aIB 6.5 0.1 17.1 23.7 0.22

aIB + sIB 6.5 0.1 16.6 23.2 0.24

Table 1: RT06 eval results

Page 10: tdoa_interspeech_valente

TDOA features• Time Delay of Arrivals (TDOA) between different channels can be

used as additional feature for diarization [Lathoud03,Anguera06].

• TDOA is computed using the Generalized Cross Correlation Phase

Transform (GCC-PHAT) algorithm .

• In HMM/GMM the acoustic vectors (MFCC) is augmented with

TDOA.

• In IB framework combination can happen at posterior distribution

level:

p(y|x) = p(y|x, Mmfc)P (Mmfc) + p(y|x, Mdel)P (Mdel)

i.e. the cardinality of |Y | and |X | stays unchanged thus the complexity of

the clustering.

Page 11: tdoa_interspeech_valente

TDOA features II

Page 12: tdoa_interspeech_valente

RT06 results

System Miss FA Spkr err DER RT

HMM/GMM 6.5 0.1 9.3 15.9 3.63

aIB 6.5 0.1 11.4 18.0 0.34

aIB + sIB 6.5 0.1 9.7 16.3 0.41

• TDOA features reduces by 7% the Speaker Error in all the systems.

• sIB based system is 0.4% inferior to HMM/GMM but much faster.

• Improvements are verified on 7 of the 9 meetings.

Page 13: tdoa_interspeech_valente

Weights tuning• Probabilities of the two streams P (Mmfc) and P (Mdel) is

heuristically tuned on a development data set.

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 117.5

18

18.5

19

19.5

20

20.5

−− Mfcc weight −−>

−− S

peak

er e

rror

−−>

Page 14: tdoa_interspeech_valente

Conclusions• We introduced a diarization system based on the IB

principle.

• Results are similar to state-of-the-art HMM/GMM butconsiderably faster.

• We described combination of two feature streams(MFCC and TDOA) that leaves unchanged the cost ofthe clustering.

• Future plans will consider the same combination toinclude many other feature streams.

Page 15: tdoa_interspeech_valente

Thank You

Page 16: tdoa_interspeech_valente

IB based diarization• X is defined as the set of uniformly segmented speech chunks obtained

from the audio stream.

• Y is defined as the component of a background GMM as obtained

from the whole meeting data.

• Conditional distribution p(Y |X) can be obtained simply by Bayes’ rule.

• An initial solution is obtained using agglomerative clustering then

refined using sequential clustering.

• The stopping criterion is based on the Minimum Description Length

(MDL).

• Data are then re-aligned using a conventional HMM/GMM to refine

boundaries.

Page 17: tdoa_interspeech_valente

RT06 results

0

5

10

15

20

25

30

35

40

45

50

CMU 20050912−0900

CMU 20050914−0900

EDI 20050216−1051

EDI 20050218−0900

NIST 20051024−0930

NIST 20051102−1323

TNO 20041103−1130

VT 20050623−1400

VT 20051027−1400 ALL

− S

peak

er E

rror

−−>

MFCCMFCC+TDOA