Upload
fabio-fabio
View
212
Download
0
Embed Size (px)
DESCRIPTION
http://fvalente.zxq.net/presentations/pres_icassp2012.pdf
Citation preview
Speaker Diarization of Meetings based on Large TDOAvectors
Deepu Vijayasenan1
Fabio Valente2
1Universitat des Saarlandes, 2Idiap Research Institute
30 March 2012
D.Vijayasenan and F.Valente (UdS, Idiap) Diarization based on large TDOA vectors 30 March 2012 1 / 17
Introduction
TDOA features represent location information of the speakers
Features are estimated with respect to a reference channel
Suboptimal since TDOA is result of different speaker placement withrespect to microphones
One alternative is to use TDOA values across each pair ofmicrophones
D.Vijayasenan and F.Valente (UdS, Idiap) Diarization based on large TDOA vectors 30 March 2012 2 / 17
All TDOA pairs
Average TDOA values in case of speaker 6 and 8 inIDI 200901291000 meeting (NIST reference notation)
1 2 3 4 5 6 7−6
−4
−2
0
2
TDOA index
Speaker 6
Speaker 8
1 5 9 13 17 21 25 28−4
−2
0
2
4
6
TDOA index
Speaker 6
Speaker 8
D.Vijayasenan and F.Valente (UdS, Idiap) Diarization based on large TDOA vectors 30 March 2012 3 / 17
All TDOA pairs
However all TDOA pairs are not used directly as features because ofhigh feature dimension
TDOA across all pairs of microphones were employed1 in determininginitial clusters
The problem of large feature dimension is often addressed withreducing feature dimension or selecting the most prominent features
1Koh E.C.W. et.al, “Speaker diarization using direction of arrival estimate
and acoustic feature information: The i2r-ntu submission for the NIST RT2007
ealuation” in Lecture Notes of Computer scienceD.Vijayasenan and F.Valente (UdS, Idiap) Diarization based on large TDOA vectors 30 March 2012 4 / 17
Objective
Using the large TDOA features directly as features in combinationwith spectral features
Increased dimensionality has to be taken care
Two diarization systems are studied
HMM/GMM based speaker diarizationInformation Bottleneck based system
D.Vijayasenan and F.Valente (UdS, Idiap) Diarization based on large TDOA vectors 30 March 2012 5 / 17
HMM/GMM system
Each speaker ck modeled by a minimum duration HMM state withGMM emission probability bck (st),
log bck (st) = log∑
r
w rckN (st , µ
rck,Σr
ck)
Each feature stream is modeled with individual features and
log Lck (st) = Wmfcc log[
bmfccck
(smfcct )
]
+Wtdoa log[
btdoack(stdoat )
]
Agglomerative clustering using a modified BIC criterion to determinethe number of clusters
D.Vijayasenan and F.Valente (UdS, Idiap) Diarization based on large TDOA vectors 30 March 2012 6 / 17
Information Bottleneck (IB) Principle
Distributional clustering based on maximizing mutual informationw.r.t. set of relevance variables
Given set of input variable X , relevance variables Y that containimportant information about the problem, IB principle seeks tomaximize:
F = I (Y ,C )− 1βI (C ,X )
Optimized w.r.t stochastic mapping P(C |X )
Performed using sequential or agglomerative optimization
D.Vijayasenan and F.Valente (UdS, Idiap) Diarization based on large TDOA vectors 30 March 2012 7 / 17
IB Speaker diarization
Speech segments as input Variables X
Components of a background GMM as relevance variables Y
Agglomerative clustering to IB objective function optimization
System initialized with uniform linear segmentationEach step two clusters that result in minimum loss of IB function aremerged (JS divergence in terms of p(y |x))Number of speakers determined based on a threshold on normalizedmutual information
Feature stream combination based on of distributions p(y |xi )
p(y |x) = Wmfccp(y |xmfcc ,Mmfcc) +Wtdoap(y |x
tdoa,Mtdoa)
D.Vijayasenan and F.Valente (UdS, Idiap) Diarization based on large TDOA vectors 30 March 2012 8 / 17
HMM/GMM vs. IB diarization
HMM/GMM BIC
clustering agglomerative agglomerative
merge criterion BIC JS divergence
number of spkr BIC normalized MI
multiple features log likelihood combination ofcombination rel.var. distribn.
D.Vijayasenan and F.Valente (UdS, Idiap) Diarization based on large TDOA vectors 30 March 2012 9 / 17
Experiments
Evaluate the performance of “all pair TDOA features” in context ofthe two diarization systems
Performed on a dataset of 24 meetings across 6 meeting rooms
TDOA values corresponding to all delay pairs are computed
Delay and Sum beamforming using a reference channel to computeMFCC features
Combination weights are estimated on a reference data-set consistingof 10 meetings
D.Vijayasenan and F.Valente (UdS, Idiap) Diarization based on large TDOA vectors 30 March 2012 10 / 17
TDOA Features
Estimated using GCC-PHAT
GPHAT (i , j) =Xi (f )X
∗
j (f )
|Xi (f )||Xj(f )|
dPHAT (i , j) = argmaxd
RPHAT (d)
Ref channel TDOA: dimension for M-channel recording is (M − 1)
All pair TDOA: dimension for M-channel recording is 12M(M − 1)
D.Vijayasenan and F.Valente (UdS, Idiap) Diarization based on large TDOA vectors 30 March 2012 11 / 17
Weights Estimation
10−4
10−3
10−2
10−1
100
5
10
15
20
25
30
TDOA weight
Spe
aker
Err
or
Development Data
HMM/GMM
IB
aIB HMM/GMM #TDOA vec.
Ref. Channel TDOA (0.7,0.3) (0.9,0.1) M-1
All Pairs TDOA (0.8,0.2) (0.999,0.001) M(M − 1)/2
HMM/GMM weight optimization on a logarithmic scale
The IB weights do not alter considerably
D.Vijayasenan and F.Valente (UdS, Idiap) Diarization based on large TDOA vectors 30 March 2012 12 / 17
Speaker Error
aIB HMM/GMM
Ref. Channel TDOA 12.3 14.3
All Pairs TDOA 8.2 (+33%) 10.8 (+32%)
Both systems benefit with by combining all pair TDOAs with MFCCfeatures
D.Vijayasenan and F.Valente (UdS, Idiap) Diarization based on large TDOA vectors 30 March 2012 13 / 17
Speaker Error
HMM/GMMperformance degradeswhen number ofmicrophones is small
IB system appear morerobust to change indimension of features
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 ALL0
10
20
30
40
−S
pe
ake
r E
rro
r−−
>
HMM/GMM
All Pairs
Reference Channel
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 ALL0
10
20
30
40−
Sp
ea
ke
r E
rro
r−−
>
Meeting ID
Information Bottleneck
All Pairs
Reference Channel
D.Vijayasenan and F.Valente (UdS, Idiap) Diarization based on large TDOA vectors 30 March 2012 14 / 17
Conclusion
Proposed to use “All pairs TDOAs” directly in two speaker diarizationsystems
In case of HMM/GMM system the feature weights changeconsiderably as compared to using reference channel TDOAs
Log likelihood combinationBenefit from additional delays while the weights are optimized on alogarithmic scalePerformance degrades with low number of microphones
The weighting only get marginally affected in case of IB system
Combination in a normalized relevance variable spaceImproves consistently across meetings
whenever weighting issues are properly handled, “all pair TDOAs”reduce the speaker error by ≈ 30% compared to TDOA values withrespect to reference channel alone
D.Vijayasenan and F.Valente (UdS, Idiap) Diarization based on large TDOA vectors 30 March 2012 15 / 17
Thank You
D.Vijayasenan and F.Valente (UdS, Idiap) Diarization based on large TDOA vectors 30 March 2012 16 / 17
ID Meet. #Mic ID Meet. #Mic1 CMU 20050912-0900 2 13 NIST 20051024-0930 82 CMU 20050914-0900 2 14 NIST 20051102-1323 83 CMU 20061115-1030 3 15 NIST 20051104-1515 74 CMU 20061115-1530 3 16 NIST 20060216-1347 75 EDI 20050216-1051 16 17 NIST 20080201-1405 76 EDI 20050218-0900 16 18 NIST 20080227-1501 77 EDI 20061113-1500 16 19 NIST 20080307-0955 78 EDI 20061114-1500 16 20 TNO 20041103-1130 109 EDI 20071128-1000 16 21 VT 20050408-1500 410 EDI 20071128-1500 16 22 VT 20050425-1000 711 IDI 20090128-1600 16 23 VT 20050623-1400 412 IDI 20090129-1000 16 24 VT 20051027-1400 4
D.Vijayasenan and F.Valente (UdS, Idiap) Diarization based on large TDOA vectors 30 March 2012 17 / 17