Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
SNR-‐Aware PLDA Modeling for Robust Speaker Verifica?on
Department of Electronic and Informa?on Engineering The Hong Kong Polytechnic University
廣東順德中山大學-‐卡內基梅隆大學國際聯合研究院(SYSU-‐CMU-‐Joint Research Ins?tute)
28 Dec. 2015
Man-Wai MAK [email protected]
http://www.eie.polyu.edu.hk/~mwmak
http://www.eie.polyu.edu.hk/~mwmak/papers/SYSU-CMU-2015.pdf
2
Contents
1. I-‐Vector/PLDA for Speaker Verifica?on 2. SNR-‐Aware PLDA Modeling
– SNR-‐Invariant PLDA – Mixture of PLDA
3. Experiments on SRE12
4. Conclusions
2
3
I-‐Vectors for Speaker Verifica4on • State-‐of-‐the-‐art method for speaker verifica?on • Factor analysis model:
!µs =
!µ +Txs
• Instead of using the high-‐dimension to present the speaker s, we use the low-‐dimension (typically 500) i-‐vector xs to represent the speaker.
• T is es?mated by an EM algorithm using the u]erances of many speakers. T represents the subspace in which the i-‐vectors vary.
• Given T, es?mate xs for each target speaker and test u]erance xt
UBM supervector Low-‐rank total variability matrix
Speaker-‐dependent i-‐vector
(61440×500)
!µs
4
I-‐Vectors for Speaker Verifica4on • Given an u]erance, we align its acous?c vectors against a UBM
to obtain the sufficient sta?s?cs:
• The i-‐vector of the u]erance is the posterior mean of the latent factor of the factor analysis model:
Alignment
UBM
i-vector of utterance i: hxi|Oi = L
�1i T
T(⌃(b))�1
f̃i
L
�1i = cov(xi,xi|O) =
⇣I+T
T⌃
(b)�1NiT
⌘�1
4
5
I-‐Vectors for Speaker Verifica4on
Align ot with UBM
Ni =
ni,1I 0 ! 00 ni,2I 0 00 0 ! 00 0 " ni,MI
⎡
⎣
⎢⎢⎢⎢⎢
⎤
⎦
⎥⎥⎥⎥⎥
!fi =
!fi ,1!"fi ,M
!
"
####
$
%
&&&&
hxi|Oi = L
�1i T
T(⌃(b))�1
f̃i
L
�1i = cov(xi,xi|O) =
⇣I+T
T⌃
(b)�1NiT
⌘�1
5
6
I-‐Vectors for Speaker Verifica4on
UBM
Training Data
Training Total Variability Matrix
I-‐Vector Extractor LDA+WCCN
U]erance from Target Speaker s
Test u]erance t
Scoring Method
Decision Maker Reject θ<
θ≥Accept
xs
xt
WTxs
WTxt
T
• Given an u]erance from speaker s and a total variability matrix T, we es?mate his/her i-‐Vector xs
• Because T defines the combined space describing both speaker variability and channel variability, we use LDA+WCCN to remove channel variability
7
I-‐Vectors for Speaker Verifica4on
Before LDA (x) Ader LDA
Each point represents an u]erance. Each marker type represents a speaker.
WTx
7
8
I-‐Vectors Scoring
SCD xs,xt( ) =WTxs,W
TxtWTxs W
Txt
• Given the i-‐vector of target speaker and the i-‐vector of a test u]erance, we compute the cosine-‐distance score:
• If the score is larger than a threshold θ, then we accept the speaker; otherwise we reject the speaker.
SCD(xs,xt )∈ [0,1]
8
Probabilis4c LDA for SV • PLDA is based on a genera?ve model that uses pre-‐processed
i-‐vectors as input • It aims to model the speaker and channel variability in the i-‐
vector space • The method assumes that there is a speaker subspace V
within the i-‐vector space • The i-‐vector xs is wri]en as:
i-vector extracted from the utterance of
speaker s Global mean of all i-vectors Defining
Speaker subspace
Speaker factor
Residual noise with covariance Σ
xs =m+Vzs +εs
9
10
Probabilis4c LDA for SV • Similarly, the i-‐vector xt from a test u]erance is wri]en as:
• Ini?a?vely, you may think of zs and zt are projected vectors on the speaker subspace defined by the eigenvectors in V.
• But unlike PCA, given an i-‐vector xt , there are infinite numbers of zt. So, we need to consider the joint density of xt and zt when compu?ng the likelihood of xt
xt =m+Vzt +εt
10
11
PLDA Scoring
x t =m+Vz+ εt
x s =m+Vz+ εsxt =m+Vzt +εtxs =m+Vzs +εs
against
H0: Same speaker H1: Different speaker
11
12
Conven4onal Noise Robust PLDA
• In conven?onal mul?-‐condi?on training, we pool i-‐vectors from various background noise levels to train m, V and Σ.
EM Algorithm {m,V,Σ}
I-vectors with 2 SNR ranges
13
Conven4onal Noise Robust PLDA • Conven?onal i-‐vector/PLDA systems use a channel
space (with covariance ) to handle all SNR condi?ons.
I-‐Vector/PLDA Scoring
Enrollment Utterances
PLDA Scores
{m,V,Σ}
Σ
14
Contents
1. I-‐Vector/PLDA for Speaker Verifica?on 2. SNR-‐Aware PLDA Modeling
– SNR-‐Invariant PLDA – Mixture of PLDA
3. Experiments on SRE12
4. Conclusions
15
• We argue that the varia?on caused by SNR variability can be modeled by an SNR subspace and u]erances falling within a narrow SNR range should share the same SNR factor (Li & Mak, Interspeech15; Li & Mak, T-‐ASLP 15)
SNR Subspace
SNR Factor 2
Group1
Group2
Group3
SNR Factor 1
SNR Factor 3
SNR Invariant PLDA
16
6 dB
• Method of modeling SNR informa?on
clean 15 dB
SNR Subspace
w6dB
wcln
w15dB
I-vector Space
i-vector
SNR Invariant PLDA
17
SNR-‐invariant PLDA • PLDA:
• By adding an SNR factor to the conven?onal PLDA, we have SNR-‐invariant PLDA:
where U denotes the SNR subspace, is an SNR factor, and is the speaker (iden?ty) factor for speaker i.
• Note that it is not the same as PLDA with channel subspace R:
k kij i k ij= + + +x m Vh Uw ε
wk
ih
ij i ij= + +x m Vh ε
xij =m+Vhi +Rrij + εij
i: Speaker index j: Session index
k: SNR index
18
SNR-‐invariant PLDA • We separate I-‐vectors into different groups
according to the SNR of their u]erances
k kij i k ij= + + +x m Vh Uw ε
EM Algorithm {m,V,U,Σ}
19
Compared with Conven4onal PLDA
k kij i k ij= + + +x m Vh Uw ε
Conventional PLDA
ij i ij= + +x m Vh ε
SNR-Invariant PLDA
20
PLDA vs SNR-‐invariant PLDA
PLDA SNR-‐invariant PLDA
Generative Model
ij i ij= + +x m Vh ε k kij i k ij= + + +x m Vh Uw ε
p(x) = N (x |m,VVT +Σ) ( ) ( | , )T Tp N= + +x x m VV UU Σ
{ }=θ m,V,Σ { }=θ m,V,U,Σ
21
PLDA vs SNR-‐invariant PLDA
PLDA SNR-‐invariant PLDA
E-Step
1 11
| ( )iHTi i ijjX − −
== −∑h L V Σ x m
1| | | TTi i i i iX X X−= +h h L h h
PLDA SNR-‐invariant PLDA
22
PLDA versus SNR-‐invariant PLDA M-Step
1( ) | |T Tij i i iij ij
X X−
⎡ ⎤ ⎡ ⎤= − ⎣ ⎦⎣ ⎦∑ ∑V x m h h h
( )( ) | ( )T Tij ij i ijij
ii
X
H
⎡ ⎤− − − −⎣ ⎦=∑
∑x m x m V h x m
Σ
SNR-‐invariant PLDA Score
23
24
Contents
1. I-‐Vector/PLDA for Speaker Verifica?on 2. SNR-‐Aware PLDA Modeling
– SNR-‐Invariant PLDA – Mixture of PLDA
3. Experiments on SRE12
4. Conclusions
25
Mixture of PLDA (mPLDA) • Conven?onal i-‐vector/PLDA systems use a single PLDA
model to handle all SNR condi?ons.
PLDA Model
Enrollment i-vectors
PLDA Scores
{m,V,Σ}
26
• We argue that a PLDA model should focus on a small range of SNR.
PLDA Model 1
PLDA Score
PLDA Model 2
PLDA Model 3
PLDA Score
PLDA Score
Mixture of PLDA (mPLDA)
27
• The full spectrum of SNRs is handled by a mixture of PLDA in which the posteriors of the indicator variables depend on the u]erance’s SNR (Mak, Interspeech14; Mak et al., T-‐ASLP 16)
PLDA Model 1
PLDA Score PLDA
Model 2
PLDA Model 3
SNR Es?mator
SN
R P
oste
rior E
stim
ator
M.W. Mak, X.M. Pang and J.T. Chien, "Mixture of PLDA for Noise Robust I-Vector Speaker Verification", IEEE/ACM Trans. on Audio Speech and Language Processing, vol. 24, No. 1, pp. 13-0142, Jan. 2016.
Mixture of PLDA (mPLDA)
28
Mo4va4on of mPLDA • The idea of mPLDA is based on two hypotheses:
1. Different levels of background noise will cause the i-‐vectors to fall on different regions of the i-‐vector space
2. SNR variability nega?vely affects PLDA speaker recogni?on accuracy, but its effect can be mi?gated by explicitly modelling the SNR-‐dependent speaker subspaces through mixture of PLDA.
29
Mo4va4on of mPLDA • To verify these two hypotheses, we corrupted 7,156 clean
telephone u]erances from 763 speakers with babble noise at 6dB and 15dB using the FaNT tool
• This results in 3 sets of i-‐vectors: clean, 15dB, and 6dB • Then, a GMM is constructed as shown below.
FaNT
FaNT
I-Vector Extraction
I-Vector Extraction
Compute mean & cov
Compute mean & cov
I-Vector Extraction
Compute mean & cov
Construct GMM
Clean speech
{1/3, ⌧k,�k}3k=1
6dB
15dB
⌧1,�1
⌧3,�3
30
Mo4va4on of mPLDA • We used par??on coefficients (PC) and par??on entropy
coefficients (PE) to quan?fy the cluster separability of the three groups of i-‐vectors.
PC à 1 and PE à 0 mean that the clusters are well separated
31
Mo4va4on of mPLDA • To verify the 2nd hypothesis, we perform speaker
iden?fica?on experiments under SNR-‐match and SNR-‐ mismatch condi?ons.
• There are 9 combina?ons of PLDA models and SNR groups, of which three are matched in training and test condi?ons and six are mismatched.
• The SID accuracy gradually decreases when the SNR of the training data progressively deviates from that of the test data.
32
mPLDA: Model Parameters
2
For modeling SNR of utts.
For modeling SNR-dependent i-vectors
• Model Parameters:
33
Graphical Model of mPLDA
For modeling SNR of utts.
For modeling SNR-dependent i-vectors
`ij : SNR of the j-th utterance from the i-th speaker
xij: i-vector of the j-th utterance from the i-th speaker
V ={Vk}k=1K
π ={πk}k=1K
34
Graphical Model: PLDA vs. mPLDA
`ij : SNR of the j-th utterance from the i-th speaker
PLDA mPLDA
35
Genera4ve Model for mPLDA
where the posterior prob of SNR is
Pos
terio
r of S
NR
: SNR in dB
36
PLDA vs. mPLDA
PLDA Mixture of PLDA
Generative Model
37
EM: PLDA vs. mPLDA Auxiliary Function
PLDA:
Mixture of PLDA:
Latent indicator variables:
SNR of training utterances:
Speaker indexes
Session indexes
No. of mixtures
Latent speaker factors:
38
EM: PLDA vs. mPLDA
PLDA Mixture of PLDA
E-Step
PLDA Mixture of PLDA
39
EM: PLDA vs. mPLDA M-Step
40
Likelihood-‐Ra4o Scores of mPLDA • Same-‐speaker likelihood:
i-vectors of target and test speakers
SNR of target and test utterances
41
Likelihood-‐Ra4o Scores of mPLDA • Different-‐speaker likelihood:
• Verifica?on Score = Same-speaker likelihood
Different-speaker likelihood
41 #For full derivation, see http://bioinfo.eie.polyu.edu.hk/mPLDA/SuppMaterials.pdf
Complexity Analysis
42
Dimension of i-vectors
43
Types of mPLDA • The mixture of PLDA models can be of two types:
1. SNR-‐independent mPLDA (SI-‐mPLDA) 2. SNR-‐dependent mPLDA (SD-‐mPLDA)
44
Types of mPLDA • SNR-‐independent mPLDA is the supervised version of Hinton’s mixture of factor analyzers, where the supervision comes from the speaker labels
• Equivalent to clustering in i-‐vector space with the subspaces Vk of clusters determined by PLDA
• No guidance from SNR informa?on.
45
SI-‐mPLDA vs. SD-‐mPLDA
Mixture weights independent of the SNR of utterances.
p(x) =KX
k=1
⇢kN (x,VkVTk +⌃k)
• SNR-‐independent mPLDA:
• SNR-‐dependent mPLDA:
Posterior prob. of SNR obtained from a 1-D GMM
46
Cluster Alignment in mPLDA
SNR-independent mPLDA SNR-dependent mPLDA
In SD-mPLDA, i-vectors that are aligned to the same mixture component have similar SNR
47
SNR-‐dependent vs. SNR-‐independent
Performance on CC4 of NIST12 (male)
PLDA
SNR-indepedent mPLDA
SNR-dependent mPLDA
48
Contents
1. I-‐Vector/PLDA for Speaker Verifica?on 2. SNR-‐Aware PLDA Modeling
– SNR-‐Invariant PLDA – Mixture of PLDA
3. Experiments on SRE12
4. Conclusions
49
Data and Features • Evalua4on dataset: Common evalua?on condi?on 1 and 4 of
NIST SRE 2012 core set. • Parameteriza4on: 19 MFCCs together with energy plus their
1st and 2nd deriva?ves à 60-‐Dim • UBM: gender-‐dependent, 1024 mixtures • Total Variability Matrix: gender-‐dependent, 500 total factors • I-‐Vector Preprocessing:
Ø Whitening by WCCN then length normaliza?on Ø For SI-‐PLDA, followed by NFA (500-‐dim à 200-‐dim) + WCCN Ø For mPLDA, followed by LDA (500-‐dim à 200-‐dim) + WCCN
50
Distribu4on of SNR in SRE12
Each SNR region is handled by a specific set of SNR factors
51
Finding SNR Groups
Training Utterances
SNR Distribu4ons • SNR Distribution of training and test utterances in CC4
52
Test Utterances
Training Utterances
Performance on SRE12
Method Parameters Male Female
K Q EER(%) minDCF EER(%) minDCF
PLDA -‐ -‐ 5.42 0.371 7.53 0.531
SDmPLDA -‐ -‐ 5.28 0.415 7.70 0.539
SNR-‐Invariant PLDA
3 40 5.42 0.382 6.93 0.528
5 40 5.28 0.381 6.89 0.522
6 40 5.29 0.388 6.90 0.536
8 30 5.56 0.384 7.05 0.545
No. of SNR Groups
No. of SNR factors (dim of ) wk 53
CC1
Performance on SRE12
Method Parameters
Male Female
K Q EER(%) minDCF EER(%) minDCF
PLDA -‐ -‐ 2.40 0.332 2.19 0.335
SNR-‐dependent mPLDA
-‐ -‐ 2.47 0.283 2.07 0.328
SNR-‐Invariant PLDA
3 40 1.96 0.277 1.74 0.290
6 40 1.99 0.278 1.72 0.290
No. of SNR Groups
No. of SNR factors (dim of ) wk
54
CC2
Performance on SRE12
Method Parameters Male Female
K Q EER(%) minDCF EER(%) minDCF
PLDA -‐ -‐ 3.13 0.312 2.82 0.341
SD-‐mPLDA -‐ -‐ 2.88 0.329 2.71 0.332
SNR-‐Invariant PLDA
3 40 2.72 0.289 2.36 0.314
5 40 2.67 0.291 2.38 0.322
6 40 2.63 0.287 2.43 0.319
8 30 2.70 0.292 2.29 0.313
No. of SNR Groups
55
No. of SNR factors (dim of ) wk
CC4
Performance on SRE12
Method Parameters
Male Female
K Q EER(%) minDCF EER(%) minDCF
PLDA -‐ -‐ 2.86 0.286 2.47 0.343
SNR-‐dependent mPLDA
-‐ -‐ 2.86 0.295 2.59 0.332
SNR-‐Invariant PLDA
3 40 2.47 0.273 2.07 0.294
6 40 2.48 0.275 2.04 0.294
No. of SNR Groups
No. of SNR factors (dim of ) wk
56
CC5
Performance on SRE12
CC4, Female
Conventional PLDA
SNR-Invariant PLDA
57
Conclusions
• We show that while I-‐vectors of different SNR fall on different regions of the I-‐vector space, they vary within a single cluster in an SNR-‐subspace.
• Therefore, it is possible to model the SNR variability by adding an SNR loading matrix and SNR factors to the conven?onal PLDA model.
• We also show that I-‐vectors derived from u]erances of different SNR live in different speaker subspaces.
• Therefore, it is possible to model SNR variability by mixture of SNR-‐dependent PLDA
58
Bibliography 1. M.W. Mak, X.M. Pang and J.T. Chien, "Mixture of PLDA for Noise Robust I-‐Vector Speaker Verifica?on",
IEEE/ACM Trans. on Audio Speech and Language Processing, vol. 24, No. 1, pp. 13-‐0142, Jan. 2016.
2. Na Li and M.W. Mak, "SNR-‐Invariant PLDA Modeling in Nonparametric Subspace for Robust Speaker Verifica?on", IEEE/ACM Trans. on Audio Speech and Language Processing, vol. 23, no. 10, pp. 1648-‐1659, Oct. 2015.
3. W. Rao and M.W. Mak, "Boos?ng the Performance of I-‐Vector Based Speaker Verifica?on via U]erance Par??oning", IEEE Trans. on Audio, Speech and Language Processing, vol. 21, no. 5, pp. 1012-‐1022, May 2013.
4. N. Li and M.W. Mak, "SNR-‐Invariant PLDA with Mul?ple Speaker Subspaces", ICASSP'16, March, 2016.
5. X.M. Pang and M.W. Mak, "Noise Robust Speaker Verifica?on via the Fusion of SNR-‐Independent and SNR-‐Dependent PLDA", InternaAonal Journal of Speech Technology, Oct. 2015.
6. M.W. Mak, "Fast Scoring for Mixture of PLDA in I-‐Vector/PLDA Speaker Verifica?on” Proc. APSIPA’15, pp. 587-‐593, Dec. 2015, Hong Kong.
7. M.W. Mak and H.B. Yu, " A Study of Voice Ac?vity Detec?on Techniques for NIST Speaker Recogni?on Evalua?ons", Computer Speech & Language, vol. 28, No. 1, Jan 2014, pp. 295-‐313.
8. N. Li and M.W. Mak, "SNR-‐Invariant PLDA Modeling for Robust Speaker Verifica?on, Interspeech'15, Sept. 2015, Dresden, Germany, pp. 2317 -‐ 2321.
9. P. Kenny, “Bayesian speaker verifica?on with heavy-‐tailed priors,” in Proc. of Odyssey: Speaker and Language RecogniAon Workshop, Brno, Czech Republic, June 2010.
10. N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-‐end factor analysis for speaker verifica?on,” IEEE TransacAons on Audio, Speech and Language Processing, vol. 19, no. 4, pp. 788–798, May 2011.
59
Acknowledgment
60 Xiaomin Pang Zhili Tan Shibiao Wan Wei RAO Na LI