Upload
others
View
13
Download
0
Embed Size (px)
Citation preview
1
Non-Linear I-Vector Extraction for Speaker
Recognition
Oren Barkan1,2, Hagai Aronowitz1
1 IBM Haifa Research Lab, Israel 2 School of Computer Science, Tel-Aviv University, Israel
{orenba, hagaia}@il.ibm.com
2
Overview
• I-vectors are widely used for speaker verification, language ID
and speaker diarization
• I-vectors provide a linear mapping from high dimensional
GMM supervectors to a relatively low dimensional vectors,
named i-vectors
• GMM supervectors lie on a low-dimensional manifold [2].
• In this work we propose an alternative nonlinear method for
mapping between GMM supervectors and low dimensional
space. The method relies on the Diffusion Maps framework
• The proposed method improves over i-vectors by 4-19% and
13-44% when fused
3
I-Vectors
Mapping of audio sessions (or segments) into a low L-
dimensional space
where
- s is a session dependent supervector of stacked GMM means
- m is the UBM supervector
- T is a low-rank matrix of bases spanning a subspace covering most of the
variability in the supervector space
- w is an M-dimensional vector having a standard normal distribution
I-vector extraction
Given a trained i-vector model (m, T) and an audio session,
estimate the i-vector w
Ts m w
4
Diffusion Maps
• Diffusion Maps (DM) is a machine learning technique for
nonlinear dimensionality reduction
• DM focuses on discovering the underlying manifold that
the data has been sampled from
• In DM, an affinity matrix is built which is used to generate
a diffusion process
• As the diffusion process progresses, it integrates local
geometry to reveal geometric structures of the data at
different scales
• A diffusion map embeds the high dimensional data in a
lower-dimensional space D, such that the Euclidean
distance between points in D approximates the diffusion
distance in the original feature space
5
DM Training (1)
Input: Training set of GMM supervectors
1. Build a graph affinity matrix A using the kernel:
where, is a metric over G
- Cosine distance in this work
2. Convert A to a Markov matrix P according to:
1{ }n m
i i dx G G
2( , )( , ) exp
i j
i j
c x xa x x
( , )c
,
1
( , )
( , )
i j
i j n
i k
k
a x xP
a x x
6
DM Training (2)
3. Pick a diffusion time parameter t.
4. Compute top-L eigenvalues and eigenvectors
of P (using SVD).
5. Define a mapping such that
where indicates the i-th element of the k-th
eigenvector of P .
Output:
is the embedding of in the L-dimensional
space D.
1: { }nt i iM x D
1 1( ) , ...,T
t t
t i i l liM x
0( )li i 0( )li i
ix( )t iM x
ki
7
Diffusion Distance
• Each entry is the probability of transition from point
to point in a single time step. In the same way, each
entry is the probability of transition from point to
point in t time steps.
• A diffusion distance after t steps is defined as follows:
• It has been shown that for L=m-1:
Hence, a diffusion distance in GMM space is equivalent to
the Euclidean distance in the embedded space.
,i jP ix
jx
,
t
i jP ix
jx
2
, ,
1
( , ) ( )n
t t
t i j i k j k
k
Q x x P P
2
2( ) ( ) ( , )t i t j t i jM x M x Q x x
8
Toy Example
9
D-vector Extraction
• The mapping is defined only for the domain .
• Given a new test point , has to be extended
to
• This is done using the following Nystrom extension:
tM 1{ }ni i dx G
1 1{ }nn i ix x
1
1: { }nx
t d nM G x D
tM
( 1) 1,
1
1,
n
k n n j kj
jk
P
1 k l
10
Speaker Verification Pipeline
• 12 MFCC + 12 delta + 12 delta-delta
• Voice activity detection
• Feature warping
• GMM order is 1024
• 400-dimensional i-vectors, 400-dimensional d-vectors
• Length normalization
• Gaussian PLDA i i iw w h b
11
Datasets
• Gender-independent UBM was trained on ~13K sessions
from Switchboard-II, NIST04 and NIST06 SRE
• I-vector and d-vector systems were trained on ~17K
female and ~11K males telephone sessions from NIST04,
NIST06 and NIST08 SRE
• Experiments were conducted on the three telephone-only
core conditions of NIST10 SRE (5,6 and 8)
12
Results – NIST 2010 EER (%)
System Condition 5 Condition 6 Condition 8
Males
i-vector PLDA 2.5 4.9 1.0
d-vector PLDA 2.3 2.9 1.6
Fused system 1.7 2.2 0.8
Females
i-vector PLDA 2.7 6.0 2.2
d-vector PLDA 2.3 4.4 2.2
Fused system 2.0 3.3 1.7
13
Results – NIST 2010 Old-min-DCF
System Condition 5 Condition 6 Condition 8
Males
i-vector PLDA 0.138 0.231 0.073
d-vector PLDA 0.131 0.192 0.045
Fused system 0.103 0.128 0.033
Females
i-vector PLDA 0.132 0.244 0.087
d-vector PLDA 0.127 0.224 0.065
Fused system 0.096 0.199 0.056
14
Results – NIST 2010 New-min-DCF
System Condition 5 Condition 6 Condition 8
Males
i-vector PLDA 0.507 0.769 0.109
d-vector PLDA 0.307 0.696 0.192
Fused system 0.279 0.617 0.150
Females
i-vector PLDA 0.431 0.758 0.242
d-vector PLDA 0.322 0.814 0.238
Fused system 0.291 0.781 0.179
15
Results – NIST 2010 Relative Improvement (%)
Measure d-vector PLDA system Fused system
Males
EER 19 44
Old min-DCF 17 40
New min-DCF 14 24
Females
EER 18 36
Old min-DCF 10 24
New min-DCF 4 13
16
Conclusions
• This work proposes manifold learning for nonlinear i-vector
extraction – the d-vector extraction
• The d-vector extraction algorithm is based on the Diffusion
Maps framework
• In experimental results, the d-vector system managed to
obtain an error reduction of 4-19%*
• Simple fusion of i-vector and d-vector system resulted in
error reduction of 13-44%*
*Compared to the baseline i-vector system (with the same pipeline) and depending on the gender and error
measure.