Upload
fabio-fabio
View
240
Download
0
Embed Size (px)
DESCRIPTION
http://fvalente.zxq.net/presentations/tdoa_interspeech_valente.pdf
Citation preview
Integration of TDOA Features inInformation Bottleneck
Framework for Fast SpeakerDiarization
Deepu Vijayasenan, Fabio Valente, Herve Bourlard
{deepu.vijayasenan,fabio.valente, herve.bourlard}@idiap.ch
Idiap Research Institute
Interspeech 2008
Introduction and Motivation• Speaker Diarization determines who spoke when in an audio stream.
• In case of meetings data, the recording is done with Multiple Distant
Microphones (MDM).
• In case of MDM data, the Time Delay of Arrival (TDOA) of the signal
to different microphones can be used as complementary information to
acoustic features (e.g. MFCC).
Introduction and Motivation• Several recent efforts for achieving faster-then-real-time
diarization with low computational complexity.
• We previously introduced a non-parametric clustering
system based on the Information Bottleneck principle
[Thisby98].
• State-of-the-art results using very limited computational
complexity.
• How can we integrate other feature set (e.g. TDOA
features) in this framework ?
Outline• Information Bottleneck Principle
• Agglomerative and Sequential Optimization
• Diarization system
• Integration of TDOA
• Experiments and Results
• Conclusion
Information Bottleneck principle• Let X, be a set of elements to cluster into a set of C clusters.
• Let Y be a set of variables of interest associated with X.
• Let us assume that ∀xǫX and ∀yǫY the conditional distribution
p(y|x) is available.
• IB principle states that the clustering C should preserve as
much information as possible between C and Y while
minimizing the distortion of C and X.
• This means the following objective function:
− β I(X,C) + I(C, Y )
Information Bottleneck principle II• The loss of mutual information δIy obtained by merging xi and xj is
given by Jensen-Shannon divergence between p(Y |xi) and p(Y |xj).
δIy = (p(xi) + p(xj)) · JS(p(Y |xi), p(Y |xj))
• The JS divergence is the sum of two KL divergences thus easy to
compute in case of discrete probabilities.
JS(p(Y |xi), p(Y |xj)) = πi DKL[p(Y |xi)||q(Y )] +
+πj DKL[p(Y |xj ||q(Y )] (1)
with q(Y ) = πi p(Y |xi) + πj p(Y |xj)
• The distance (thus the merging) is defined according to the variables
Y relevant to the considered problem.
IB optimization• Objective function can be optimized in agglomerative or sequential
fashion.
• Agglomerative IB [Slonim99]:
- Greedy approach.
- Start with trivial clustering of |X | clusters.
- Merges clusters that produce the minimum loss in the objective
function.
- Merging stops when a stopping criterion is met.
• Sequential IB [Slonim02]:
- Works on a partition with fixed number of clusters.
- Randomly draws an element out of the partition, and try to
assign it to another cluster.
- Typically used for refining agglomerative solution.
IB based diarization
IB based diarization II• Conventional systems use HMM/GMM framework together with the
Bayesian Information Criterion (BIC).
Most of the computational time consists in estimating the BIC
between different clusters.
• IB based system performs the clustering in the space of discrete
probabilities p(Y |X) using the JS divergence.
The resulting systems is much faster.
System Miss FA Spkr err DER RT
HMM/GMM 6.5 0.1 17.0 23.6 3.5
aIB 6.5 0.1 17.1 23.7 0.22
aIB + sIB 6.5 0.1 16.6 23.2 0.24
Table 1: RT06 eval results
TDOA features• Time Delay of Arrivals (TDOA) between different channels can be
used as additional feature for diarization [Lathoud03,Anguera06].
• TDOA is computed using the Generalized Cross Correlation Phase
Transform (GCC-PHAT) algorithm .
• In HMM/GMM the acoustic vectors (MFCC) is augmented with
TDOA.
• In IB framework combination can happen at posterior distribution
level:
p(y|x) = p(y|x, Mmfc)P (Mmfc) + p(y|x, Mdel)P (Mdel)
i.e. the cardinality of |Y | and |X | stays unchanged thus the complexity of
the clustering.
TDOA features II
RT06 results
System Miss FA Spkr err DER RT
HMM/GMM 6.5 0.1 9.3 15.9 3.63
aIB 6.5 0.1 11.4 18.0 0.34
aIB + sIB 6.5 0.1 9.7 16.3 0.41
• TDOA features reduces by 7% the Speaker Error in all the systems.
• sIB based system is 0.4% inferior to HMM/GMM but much faster.
• Improvements are verified on 7 of the 9 meetings.
Weights tuning• Probabilities of the two streams P (Mmfc) and P (Mdel) is
heuristically tuned on a development data set.
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 117.5
18
18.5
19
19.5
20
20.5
−− Mfcc weight −−>
−− S
peak
er e
rror
−−>
Conclusions• We introduced a diarization system based on the IB
principle.
• Results are similar to state-of-the-art HMM/GMM butconsiderably faster.
• We described combination of two feature streams(MFCC and TDOA) that leaves unchanged the cost ofthe clustering.
• Future plans will consider the same combination toinclude many other feature streams.
Thank You
IB based diarization• X is defined as the set of uniformly segmented speech chunks obtained
from the audio stream.
• Y is defined as the component of a background GMM as obtained
from the whole meeting data.
• Conditional distribution p(Y |X) can be obtained simply by Bayes’ rule.
• An initial solution is obtained using agglomerative clustering then
refined using sequential clustering.
• The stopping criterion is based on the Minimum Description Length
(MDL).
• Data are then re-aligned using a conventional HMM/GMM to refine
boundaries.
RT06 results
0
5
10
15
20
25
30
35
40
45
50
CMU 20050912−0900
CMU 20050914−0900
EDI 20050216−1051
EDI 20050218−0900
NIST 20051024−0930
NIST 20051102−1323
TNO 20041103−1130
VT 20050623−1400
VT 20051027−1400 ALL
− S
peak
er E
rror
−−>
MFCCMFCC+TDOA