17
Speaker Diarization of Meetings based on Large TDOA vectors Deepu Vijayasenan 1 Fabio Valente 2 1 Universit¨ at des Saarlandes, 2 Idiap Research Institute 30 March 2012 D.Vijayasenan and F.Valente (UdS, Idiap) Diarization based on large TDOA vectors 30 March 2012 1 / 17

pres_icassp2012

Embed Size (px)

DESCRIPTION

http://fvalente.zxq.net/presentations/pres_icassp2012.pdf

Citation preview

Speaker Diarization of Meetings based on Large TDOAvectors

Deepu Vijayasenan1

Fabio Valente2

1Universitat des Saarlandes, 2Idiap Research Institute

30 March 2012

D.Vijayasenan and F.Valente (UdS, Idiap) Diarization based on large TDOA vectors 30 March 2012 1 / 17

Introduction

TDOA features represent location information of the speakers

Features are estimated with respect to a reference channel

Suboptimal since TDOA is result of different speaker placement withrespect to microphones

One alternative is to use TDOA values across each pair ofmicrophones

D.Vijayasenan and F.Valente (UdS, Idiap) Diarization based on large TDOA vectors 30 March 2012 2 / 17

All TDOA pairs

Average TDOA values in case of speaker 6 and 8 inIDI 200901291000 meeting (NIST reference notation)

1 2 3 4 5 6 7−6

−4

−2

0

2

TDOA index

Speaker 6

Speaker 8

1 5 9 13 17 21 25 28−4

−2

0

2

4

6

TDOA index

Speaker 6

Speaker 8

D.Vijayasenan and F.Valente (UdS, Idiap) Diarization based on large TDOA vectors 30 March 2012 3 / 17

All TDOA pairs

However all TDOA pairs are not used directly as features because ofhigh feature dimension

TDOA across all pairs of microphones were employed1 in determininginitial clusters

The problem of large feature dimension is often addressed withreducing feature dimension or selecting the most prominent features

1Koh E.C.W. et.al, “Speaker diarization using direction of arrival estimate

and acoustic feature information: The i2r-ntu submission for the NIST RT2007

ealuation” in Lecture Notes of Computer scienceD.Vijayasenan and F.Valente (UdS, Idiap) Diarization based on large TDOA vectors 30 March 2012 4 / 17

Objective

Using the large TDOA features directly as features in combinationwith spectral features

Increased dimensionality has to be taken care

Two diarization systems are studied

HMM/GMM based speaker diarizationInformation Bottleneck based system

D.Vijayasenan and F.Valente (UdS, Idiap) Diarization based on large TDOA vectors 30 March 2012 5 / 17

HMM/GMM system

Each speaker ck modeled by a minimum duration HMM state withGMM emission probability bck (st),

log bck (st) = log∑

r

w rckN (st , µ

rck,Σr

ck)

Each feature stream is modeled with individual features and

log Lck (st) = Wmfcc log[

bmfccck

(smfcct )

]

+Wtdoa log[

btdoack(stdoat )

]

Agglomerative clustering using a modified BIC criterion to determinethe number of clusters

D.Vijayasenan and F.Valente (UdS, Idiap) Diarization based on large TDOA vectors 30 March 2012 6 / 17

Information Bottleneck (IB) Principle

Distributional clustering based on maximizing mutual informationw.r.t. set of relevance variables

Given set of input variable X , relevance variables Y that containimportant information about the problem, IB principle seeks tomaximize:

F = I (Y ,C )− 1βI (C ,X )

Optimized w.r.t stochastic mapping P(C |X )

Performed using sequential or agglomerative optimization

D.Vijayasenan and F.Valente (UdS, Idiap) Diarization based on large TDOA vectors 30 March 2012 7 / 17

IB Speaker diarization

Speech segments as input Variables X

Components of a background GMM as relevance variables Y

Agglomerative clustering to IB objective function optimization

System initialized with uniform linear segmentationEach step two clusters that result in minimum loss of IB function aremerged (JS divergence in terms of p(y |x))Number of speakers determined based on a threshold on normalizedmutual information

Feature stream combination based on of distributions p(y |xi )

p(y |x) = Wmfccp(y |xmfcc ,Mmfcc) +Wtdoap(y |x

tdoa,Mtdoa)

D.Vijayasenan and F.Valente (UdS, Idiap) Diarization based on large TDOA vectors 30 March 2012 8 / 17

HMM/GMM vs. IB diarization

HMM/GMM BIC

clustering agglomerative agglomerative

merge criterion BIC JS divergence

number of spkr BIC normalized MI

multiple features log likelihood combination ofcombination rel.var. distribn.

D.Vijayasenan and F.Valente (UdS, Idiap) Diarization based on large TDOA vectors 30 March 2012 9 / 17

Experiments

Evaluate the performance of “all pair TDOA features” in context ofthe two diarization systems

Performed on a dataset of 24 meetings across 6 meeting rooms

TDOA values corresponding to all delay pairs are computed

Delay and Sum beamforming using a reference channel to computeMFCC features

Combination weights are estimated on a reference data-set consistingof 10 meetings

D.Vijayasenan and F.Valente (UdS, Idiap) Diarization based on large TDOA vectors 30 March 2012 10 / 17

TDOA Features

Estimated using GCC-PHAT

GPHAT (i , j) =Xi (f )X

j (f )

|Xi (f )||Xj(f )|

dPHAT (i , j) = argmaxd

RPHAT (d)

Ref channel TDOA: dimension for M-channel recording is (M − 1)

All pair TDOA: dimension for M-channel recording is 12M(M − 1)

D.Vijayasenan and F.Valente (UdS, Idiap) Diarization based on large TDOA vectors 30 March 2012 11 / 17

Weights Estimation

10−4

10−3

10−2

10−1

100

5

10

15

20

25

30

TDOA weight

Spe

aker

Err

or

Development Data

HMM/GMM

IB

aIB HMM/GMM #TDOA vec.

Ref. Channel TDOA (0.7,0.3) (0.9,0.1) M-1

All Pairs TDOA (0.8,0.2) (0.999,0.001) M(M − 1)/2

HMM/GMM weight optimization on a logarithmic scale

The IB weights do not alter considerably

D.Vijayasenan and F.Valente (UdS, Idiap) Diarization based on large TDOA vectors 30 March 2012 12 / 17

Speaker Error

aIB HMM/GMM

Ref. Channel TDOA 12.3 14.3

All Pairs TDOA 8.2 (+33%) 10.8 (+32%)

Both systems benefit with by combining all pair TDOAs with MFCCfeatures

D.Vijayasenan and F.Valente (UdS, Idiap) Diarization based on large TDOA vectors 30 March 2012 13 / 17

Speaker Error

HMM/GMMperformance degradeswhen number ofmicrophones is small

IB system appear morerobust to change indimension of features

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 ALL0

10

20

30

40

−S

pe

ake

r E

rro

r−−

>

HMM/GMM

All Pairs

Reference Channel

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 ALL0

10

20

30

40−

Sp

ea

ke

r E

rro

r−−

>

Meeting ID

Information Bottleneck

All Pairs

Reference Channel

D.Vijayasenan and F.Valente (UdS, Idiap) Diarization based on large TDOA vectors 30 March 2012 14 / 17

Conclusion

Proposed to use “All pairs TDOAs” directly in two speaker diarizationsystems

In case of HMM/GMM system the feature weights changeconsiderably as compared to using reference channel TDOAs

Log likelihood combinationBenefit from additional delays while the weights are optimized on alogarithmic scalePerformance degrades with low number of microphones

The weighting only get marginally affected in case of IB system

Combination in a normalized relevance variable spaceImproves consistently across meetings

whenever weighting issues are properly handled, “all pair TDOAs”reduce the speaker error by ≈ 30% compared to TDOA values withrespect to reference channel alone

D.Vijayasenan and F.Valente (UdS, Idiap) Diarization based on large TDOA vectors 30 March 2012 15 / 17

Thank You

D.Vijayasenan and F.Valente (UdS, Idiap) Diarization based on large TDOA vectors 30 March 2012 16 / 17

ID Meet. #Mic ID Meet. #Mic1 CMU 20050912-0900 2 13 NIST 20051024-0930 82 CMU 20050914-0900 2 14 NIST 20051102-1323 83 CMU 20061115-1030 3 15 NIST 20051104-1515 74 CMU 20061115-1530 3 16 NIST 20060216-1347 75 EDI 20050216-1051 16 17 NIST 20080201-1405 76 EDI 20050218-0900 16 18 NIST 20080227-1501 77 EDI 20061113-1500 16 19 NIST 20080307-0955 78 EDI 20061114-1500 16 20 TNO 20041103-1130 109 EDI 20071128-1000 16 21 VT 20050408-1500 410 EDI 20071128-1500 16 22 VT 20050425-1000 711 IDI 20090128-1600 16 23 VT 20050623-1400 412 IDI 20090129-1000 16 24 VT 20051027-1400 4

D.Vijayasenan and F.Valente (UdS, Idiap) Diarization based on large TDOA vectors 30 March 2012 17 / 17