SNRAware*PLDA*Modeling*for*Robust Speaker*Verificaonmwmak/papers/SYSU-CMU-2015.pdf ·...

Preview:

Citation preview

SNR-­‐Aware  PLDA  Modeling  for  Robust  Speaker  Verifica?on  

Department  of  Electronic  and  Informa?on  Engineering  The  Hong  Kong  Polytechnic  University  

廣東順德中山大學-­‐卡內基梅隆大學國際聯合研究院(SYSU-­‐CMU-­‐Joint  Research  Ins?tute)  

28  Dec.  2015  

Man-Wai MAK enmwmak@polyu.edu.hk

http://www.eie.polyu.edu.hk/~mwmak

http://www.eie.polyu.edu.hk/~mwmak/papers/SYSU-CMU-2015.pdf

2  

Contents

1.  I-­‐Vector/PLDA  for  Speaker  Verifica?on  2.  SNR-­‐Aware  PLDA  Modeling  

–  SNR-­‐Invariant  PLDA  –  Mixture  of  PLDA  

3.  Experiments  on  SRE12  

4.  Conclusions  

2  

3

I-­‐Vectors  for  Speaker  Verifica4on  •  State-­‐of-­‐the-­‐art  method  for  speaker  verifica?on  •  Factor  analysis  model:  

!µs =

!µ +Txs

•  Instead  of  using  the  high-­‐dimension          to  present  the  speaker  s,  we  use  the  low-­‐dimension  (typically  500)  i-­‐vector  xs  to  represent  the  speaker.  

•  T  is  es?mated  by  an  EM  algorithm  using  the  u]erances  of  many  speakers.  T  represents  the  subspace  in  which  the  i-­‐vectors  vary.  

•  Given  T,  es?mate  xs  for  each  target  speaker  and  test  u]erance  xt    

 

UBM  supervector   Low-­‐rank  total  variability  matrix  

Speaker-­‐dependent  i-­‐vector  

(61440×500)

!µs

4

I-­‐Vectors  for  Speaker  Verifica4on  •  Given  an  u]erance,  we  align  its  acous?c  vectors  against  a  UBM  

to  obtain  the  sufficient  sta?s?cs:  

•  The  i-­‐vector  of  the  u]erance  is  the  posterior  mean  of  the  latent  factor  of  the  factor  analysis  model:  

Alignment

UBM

i-vector of utterance i: hxi|Oi = L

�1i T

T(⌃(b))�1

f̃i

L

�1i = cov(xi,xi|O) =

⇣I+T

T⌃

(b)�1NiT

⌘�1

4  

5

I-­‐Vectors  for  Speaker  Verifica4on  

Align ot with UBM

Ni =

ni,1I 0 ! 00 ni,2I 0 00 0 ! 00 0 " ni,MI

⎢⎢⎢⎢⎢

⎥⎥⎥⎥⎥

!fi =

!fi ,1!"fi ,M

!

"

####

$

%

&&&&

hxi|Oi = L

�1i T

T(⌃(b))�1

f̃i

L

�1i = cov(xi,xi|O) =

⇣I+T

T⌃

(b)�1NiT

⌘�1

5  

6

I-­‐Vectors  for  Speaker  Verifica4on  

UBM

Training  Data

Training  Total  Variability  Matrix

I-­‐Vector  Extractor LDA+WCCN  

U]erance  from  Target  Speaker  s  

Test  u]erance  t

Scoring  Method

Decision  Maker Reject θ<

θ≥Accept

xs

xt

WTxs

WTxt

T

•  Given  an  u]erance  from  speaker  s  and  a  total  variability  matrix  T,  we  es?mate  his/her  i-­‐Vector  xs

•  Because  T defines  the  combined  space  describing  both  speaker  variability  and  channel  variability,  we  use  LDA+WCCN  to  remove  channel  variability  

7

I-­‐Vectors  for  Speaker  Verifica4on  

Before  LDA  (x)   Ader  LDA  

Each  point  represents  an  u]erance.  Each  marker  type  represents  a  speaker.  

WTx

7  

8

I-­‐Vectors  Scoring  

SCD xs,xt( ) =WTxs,W

TxtWTxs W

Txt

•  Given  the  i-­‐vector  of  target  speaker  and  the  i-­‐vector  of  a  test  u]erance,  we  compute  the  cosine-­‐distance  score:  

 

•  If  the  score  is  larger  than  a  threshold  θ,  then  we  accept  the  speaker;  otherwise  we  reject  the  speaker.  

SCD(xs,xt )∈ [0,1]

8  

Probabilis4c  LDA  for  SV  •  PLDA  is  based  on  a  genera?ve  model  that  uses  pre-­‐processed  

i-­‐vectors  as  input  •  It  aims  to  model  the  speaker  and  channel  variability  in  the  i-­‐

vector  space  •  The  method  assumes  that  there  is  a  speaker  subspace  V  

within  the  i-­‐vector  space    •  The  i-­‐vector  xs  is  wri]en  as:  

i-vector extracted from the utterance of

speaker s Global mean of all i-vectors Defining

Speaker subspace

Speaker factor

Residual noise with covariance Σ

xs =m+Vzs +εs

9  

10

Probabilis4c  LDA  for  SV  •  Similarly,  the  i-­‐vector  xt  from  a  test  u]erance  is  wri]en  as:  

•  Ini?a?vely,  you  may  think  of  zs  and  zt  are  projected  vectors  on  the  speaker  subspace  defined  by  the  eigenvectors  in  V.  

•  But  unlike  PCA,  given  an  i-­‐vector  xt ,  there  are  infinite  numbers  of  zt.  So,  we  need  to  consider  the  joint  density  of  xt  and  zt  when  compu?ng  the  likelihood  of  xt  

 

xt =m+Vzt +εt

10  

11

PLDA  Scoring  

x t =m+Vz+ εt

x s =m+Vz+ εsxt =m+Vzt +εtxs =m+Vzs +εs

against

H0: Same speaker H1: Different speaker

11  

12  

Conven4onal  Noise  Robust  PLDA

•  In  conven?onal  mul?-­‐condi?on  training,  we  pool  i-­‐vectors  from  various  background  noise  levels  to  train  m,  V  and  Σ.

 

EM Algorithm {m,V,Σ}

I-vectors with 2 SNR ranges

13  

Conven4onal  Noise  Robust  PLDA •  Conven?onal  i-­‐vector/PLDA  systems  use  a  channel  

space  (with  covariance        )  to  handle  all  SNR  condi?ons.  

I-­‐Vector/PLDA  Scoring  

Enrollment Utterances

PLDA Scores

{m,V,Σ}

Σ

14  

Contents

1.  I-­‐Vector/PLDA  for  Speaker  Verifica?on  2.  SNR-­‐Aware  PLDA  Modeling  

–  SNR-­‐Invariant  PLDA  –  Mixture  of  PLDA  

3.  Experiments  on  SRE12  

4.  Conclusions  

15  

•  We  argue   that   the   varia?on   caused  by   SNR  variability   can  be   modeled   by   an   SNR   subspace   and   u]erances   falling  within   a   narrow   SNR   range   should   share   the   same   SNR  factor  (Li  &  Mak,  Interspeech15;  Li  &  Mak,  T-­‐ASLP  15)  

SNR Subspace

SNR Factor 2

Group1

Group2

Group3

SNR Factor 1

SNR Factor 3

SNR  Invariant  PLDA

16  

6 dB

•  Method  of  modeling  SNR  informa?on  

clean 15 dB

SNR Subspace

w6dB

wcln

w15dB

I-vector Space

i-vector

SNR  Invariant  PLDA

17  

SNR-­‐invariant  PLDA •  PLDA:                                                                                      

•  By  adding  an  SNR  factor  to  the  conven?onal  PLDA,  we  have  SNR-­‐invariant  PLDA:  

             where  U  denotes  the  SNR  subspace,                is  an  SNR      factor,  and            is  the  speaker  (iden?ty)  factor  for  speaker  i.

•  Note  that  it  is  not  the  same  as  PLDA  with  channel  subspace  R:  

 

k kij i k ij= + + +x m Vh Uw ε

wk

ih

ij i ij= + +x m Vh ε

xij =m+Vhi +Rrij + εij

i: Speaker index j: Session index

k: SNR index

18  

SNR-­‐invariant  PLDA •  We  separate  I-­‐vectors  into  different  groups  

according  to  the  SNR  of  their  u]erances    

k kij i k ij= + + +x m Vh Uw ε

EM Algorithm {m,V,U,Σ}

19  

Compared  with  Conven4onal  PLDA

k kij i k ij= + + +x m Vh Uw ε

Conventional PLDA

ij i ij= + +x m Vh ε

SNR-Invariant PLDA

20  

PLDA  vs  SNR-­‐invariant  PLDA

PLDA   SNR-­‐invariant  PLDA  

Generative Model

ij i ij= + +x m Vh ε k kij i k ij= + + +x m Vh Uw ε

p(x) = N (x |m,VVT +Σ) ( ) ( | , )T Tp N= + +x x m VV UU Σ

{ }=θ m,V,Σ { }=θ m,V,U,Σ

21  

PLDA  vs  SNR-­‐invariant  PLDA

PLDA   SNR-­‐invariant  PLDA  

                                                                                                                                                                                                           

E-Step

1 11

| ( )iHTi i ijjX − −

== −∑h L V Σ x m

1| | | TTi i i i iX X X−= +h h L h h

PLDA   SNR-­‐invariant  PLDA  

22  

PLDA  versus  SNR-­‐invariant  PLDA M-Step

1( ) | |T Tij i i iij ij

X X−

⎡ ⎤ ⎡ ⎤= − ⎣ ⎦⎣ ⎦∑ ∑V x m h h h

( )( ) | ( )T Tij ij i ijij

ii

X

H

⎡ ⎤− − − −⎣ ⎦=∑

∑x m x m V h x m

Σ

SNR-­‐invariant  PLDA  Score  

23  

24  

Contents

1.  I-­‐Vector/PLDA  for  Speaker  Verifica?on  2.  SNR-­‐Aware  PLDA  Modeling  

–  SNR-­‐Invariant  PLDA  –  Mixture  of  PLDA  

3.  Experiments  on  SRE12  

4.  Conclusions  

25  

Mixture  of  PLDA  (mPLDA) •  Conven?onal  i-­‐vector/PLDA  systems  use  a  single  PLDA  

model  to  handle  all  SNR  condi?ons.  

PLDA  Model  

Enrollment i-vectors

PLDA Scores

{m,V,Σ}

26  

•  We  argue  that  a  PLDA  model  should  focus  on  a  small  range  of  SNR.  

PLDA    Model  1  

PLDA Score

PLDA    Model  2  

PLDA  Model  3  

PLDA Score

PLDA Score

Mixture  of  PLDA  (mPLDA)

27  

•  The  full  spectrum  of  SNRs  is  handled  by  a  mixture  of  PLDA  in  which  the  posteriors  of  the  indicator  variables  depend  on  the  u]erance’s  SNR  (Mak,  Interspeech14;  Mak  et  al.,  T-­‐ASLP  16)  

PLDA    Model  1  

PLDA Score PLDA    

Model  2  

PLDA    Model  3  

SNR    Es?mator  

SN

R P

oste

rior E

stim

ator

M.W. Mak, X.M. Pang and J.T. Chien, "Mixture of PLDA for Noise Robust I-Vector Speaker Verification", IEEE/ACM Trans. on Audio Speech and Language Processing, vol. 24, No. 1, pp. 13-0142, Jan. 2016.

Mixture  of  PLDA  (mPLDA)

28  

Mo4va4on  of  mPLDA •  The  idea  of  mPLDA  is  based  on  two  hypotheses:  

1.  Different  levels  of  background  noise  will  cause  the  i-­‐vectors  to  fall  on  different  regions  of  the  i-­‐vector  space  

2.  SNR  variability  nega?vely  affects  PLDA  speaker  recogni?on  accuracy,  but  its  effect  can  be  mi?gated  by  explicitly  modelling  the  SNR-­‐dependent  speaker  subspaces  through  mixture  of  PLDA.  

29  

Mo4va4on  of  mPLDA •  To  verify  these  two  hypotheses,  we  corrupted  7,156  clean  

telephone  u]erances  from  763  speakers  with  babble  noise  at  6dB  and  15dB  using  the  FaNT  tool    

•  This  results  in  3  sets  of  i-­‐vectors:  clean,  15dB,  and  6dB  •  Then,  a  GMM  is  constructed  as  shown  below.  

FaNT

FaNT

I-Vector Extraction

I-Vector Extraction

Compute mean & cov

Compute mean & cov

I-Vector Extraction

Compute mean & cov

Construct GMM

Clean speech

{1/3, ⌧k,�k}3k=1

6dB

15dB

⌧1,�1

⌧3,�3

30  

Mo4va4on  of  mPLDA •  We  used  par??on  coefficients  (PC)  and  par??on  entropy  

coefficients  (PE)  to  quan?fy  the  cluster  separability  of  the  three  groups  of  i-­‐vectors.  

PC à 1 and PE à 0 mean that the clusters are well separated

31  

Mo4va4on  of  mPLDA •  To  verify  the  2nd  hypothesis,  we  perform  speaker  

iden?fica?on  experiments  under  SNR-­‐match  and  SNR-­‐  mismatch  condi?ons.    

•  There  are  9  combina?ons  of  PLDA  models  and  SNR  groups,  of  which  three  are  matched  in  training  and  test  condi?ons  and  six  are  mismatched.  

•  The  SID  accuracy  gradually  decreases  when  the  SNR  of  the  training  data  progressively  deviates  from  that  of  the  test  data.  

32  

mPLDA:  Model  Parameters

2  

For modeling SNR of utts.

For modeling SNR-dependent i-vectors

•  Model  Parameters:  

33  

Graphical  Model  of  mPLDA

For modeling SNR of utts.

For modeling SNR-dependent i-vectors

`ij : SNR of the j-th utterance from the i-th speaker

xij: i-vector of the j-th utterance from the i-th speaker

V ={Vk}k=1K

π ={πk}k=1K

34  

Graphical  Model:  PLDA  vs.  mPLDA

`ij : SNR of the j-th utterance from the i-th speaker

PLDA mPLDA

35  

Genera4ve  Model  for  mPLDA

where the posterior prob of SNR is

Pos

terio

r of S

NR

: SNR in dB

36  

PLDA  vs.  mPLDA

PLDA   Mixture  of  PLDA  

Generative Model

37  

EM:  PLDA  vs.  mPLDA Auxiliary Function

PLDA:

Mixture of PLDA:

Latent indicator variables:

SNR of training utterances:

Speaker indexes

Session indexes

No. of mixtures

Latent speaker factors:

38  

EM:  PLDA  vs.  mPLDA

PLDA   Mixture  of  PLDA  

E-Step

PLDA   Mixture  of  PLDA  

39  

EM:  PLDA  vs.  mPLDA M-Step

40  

Likelihood-­‐Ra4o  Scores  of  mPLDA •  Same-­‐speaker  likelihood:  

i-vectors of target and test speakers

SNR of target and test utterances

41  

Likelihood-­‐Ra4o  Scores  of  mPLDA •  Different-­‐speaker  likelihood:  

•  Verifica?on  Score  =    Same-speaker likelihood

Different-speaker likelihood

41  #For full derivation, see http://bioinfo.eie.polyu.edu.hk/mPLDA/SuppMaterials.pdf

Complexity  Analysis

42  

Dimension of i-vectors

43  

Types  of  mPLDA •  The  mixture  of  PLDA  models  can  be  of  two  types:  

1.  SNR-­‐independent  mPLDA  (SI-­‐mPLDA)  2.  SNR-­‐dependent  mPLDA  (SD-­‐mPLDA)  

44  

Types  of  mPLDA •  SNR-­‐independent  mPLDA  is  the  supervised  version  of  Hinton’s  mixture  of  factor  analyzers,  where  the  supervision  comes  from  the  speaker  labels  

•  Equivalent  to  clustering  in  i-­‐vector  space  with  the  subspaces  Vk  of  clusters  determined  by  PLDA  

•  No  guidance  from  SNR  informa?on.    

 

45  

SI-­‐mPLDA  vs.  SD-­‐mPLDA

Mixture weights independent of the SNR of utterances.

p(x) =KX

k=1

⇢kN (x,VkVTk +⌃k)

•  SNR-­‐independent  mPLDA:  

•  SNR-­‐dependent  mPLDA:  

Posterior prob. of SNR obtained from a 1-D GMM

46  

Cluster  Alignment  in  mPLDA

SNR-independent mPLDA SNR-dependent mPLDA

In SD-mPLDA, i-vectors that are aligned to the same mixture component have similar SNR

47  

SNR-­‐dependent  vs.  SNR-­‐independent

Performance on CC4 of NIST12 (male)

PLDA

SNR-indepedent mPLDA

SNR-dependent mPLDA

48  

Contents

1.  I-­‐Vector/PLDA  for  Speaker  Verifica?on  2.  SNR-­‐Aware  PLDA  Modeling  

–  SNR-­‐Invariant  PLDA  –  Mixture  of  PLDA  

3.  Experiments  on  SRE12  

4.  Conclusions  

49  

Data  and  Features    •  Evalua4on  dataset:  Common  evalua?on  condi?on  1  and  4  of  

NIST  SRE  2012  core  set.  •  Parameteriza4on:    19  MFCCs    together  with  energy  plus  their  

1st  and  2nd  deriva?ves  à  60-­‐Dim    •  UBM:    gender-­‐dependent,  1024  mixtures    •  Total  Variability  Matrix:  gender-­‐dependent,  500  total  factors  •  I-­‐Vector  Preprocessing:  

Ø Whitening  by  WCCN  then  length  normaliza?on  Ø For  SI-­‐PLDA,  followed  by  NFA  (500-­‐dim  à  200-­‐dim)  +  WCCN  Ø For  mPLDA,  followed  by  LDA  (500-­‐dim  à  200-­‐dim)  +  WCCN  

50  

Distribu4on  of  SNR  in  SRE12

Each SNR region is handled by a specific set of SNR factors

51  

Finding  SNR  Groups

Training Utterances

SNR  Distribu4ons •  SNR Distribution of training and test utterances in CC4

52  

Test Utterances

Training Utterances

Performance  on  SRE12

Method   Parameters   Male   Female  

K   Q   EER(%)   minDCF   EER(%)   minDCF  

PLDA   -­‐   -­‐   5.42   0.371   7.53   0.531  

SDmPLDA   -­‐   -­‐   5.28   0.415   7.70   0.539  

 SNR-­‐Invariant  PLDA    

3   40   5.42   0.382   6.93   0.528  

5   40   5.28   0.381   6.89   0.522  

6   40   5.29   0.388   6.90   0.536  

8   30   5.56   0.384   7.05   0.545  

No. of SNR Groups

No. of SNR factors (dim of ) wk 53  

CC1

Performance  on  SRE12

Method   Parameters  

Male   Female  

K   Q   EER(%)   minDCF   EER(%)   minDCF  

PLDA   -­‐   -­‐   2.40   0.332   2.19   0.335  

SNR-­‐dependent  mPLDA  

-­‐   -­‐   2.47   0.283   2.07   0.328  

SNR-­‐Invariant  PLDA  

3   40   1.96   0.277   1.74   0.290  

6   40   1.99   0.278   1.72   0.290  

No. of SNR Groups

No. of SNR factors (dim of ) wk

54  

CC2

Performance  on  SRE12

Method   Parameters   Male   Female  

K   Q   EER(%)   minDCF   EER(%)   minDCF  

PLDA   -­‐   -­‐   3.13   0.312   2.82   0.341  

SD-­‐mPLDA   -­‐   -­‐   2.88   0.329   2.71   0.332  

 SNR-­‐Invariant  PLDA  

3   40   2.72   0.289   2.36   0.314  

5   40   2.67   0.291   2.38   0.322  

6   40   2.63   0.287   2.43   0.319  

8   30   2.70   0.292   2.29   0.313  

No. of SNR Groups

55  

No. of SNR factors (dim of ) wk

CC4

Performance  on  SRE12

Method   Parameters  

Male   Female  

K   Q   EER(%)   minDCF   EER(%)   minDCF  

PLDA   -­‐   -­‐   2.86   0.286   2.47   0.343  

SNR-­‐dependent  mPLDA  

-­‐   -­‐   2.86   0.295   2.59   0.332  

SNR-­‐Invariant  PLDA  

3   40   2.47   0.273   2.07   0.294  

6   40   2.48   0.275   2.04   0.294  

No. of SNR Groups

No. of SNR factors (dim of ) wk

56  

CC5

Performance  on  SRE12

CC4, Female

Conventional PLDA

SNR-Invariant PLDA

57  

Conclusions

•  We  show  that  while  I-­‐vectors  of  different  SNR  fall  on  different   regions   of   the   I-­‐vector   space,   they   vary  within  a  single  cluster  in  an  SNR-­‐subspace.

•  Therefore,   it   is  possible   to  model   the  SNR  variability  by  adding  an  SNR   loading  matrix  and  SNR   factors   to  the  conven?onal  PLDA  model.  

•  We  also  show  that  I-­‐vectors  derived  from  u]erances  of  different  SNR  live  in  different  speaker  subspaces.  

•  Therefore,   it   is   possible   to  model   SNR   variability   by    mixture  of  SNR-­‐dependent  PLDA  

58  

Bibliography 1.  M.W.  Mak,  X.M.  Pang  and   J.T.   Chien,   "Mixture  of   PLDA   for  Noise  Robust   I-­‐Vector   Speaker  Verifica?on",  

IEEE/ACM  Trans.  on  Audio  Speech  and  Language  Processing,  vol.  24,  No.  1,  pp.  13-­‐0142,  Jan.  2016.    

2.  Na   Li   and   M.W.   Mak,   "SNR-­‐Invariant   PLDA   Modeling   in   Nonparametric   Subspace   for   Robust   Speaker  Verifica?on",   IEEE/ACM  Trans.  on  Audio  Speech  and  Language  Processing,  vol.  23,  no.  10,  pp.  1648-­‐1659,  Oct.  2015.  

3.  W.  Rao   and  M.W.  Mak,   "Boos?ng   the   Performance  of   I-­‐Vector   Based   Speaker  Verifica?on   via  U]erance  Par??oning",   IEEE  Trans.  on  Audio,   Speech  and  Language  Processing,   vol.  21,  no.  5,  pp.  1012-­‐1022,  May  2013.  

4.  N.  Li  and  M.W.  Mak,  "SNR-­‐Invariant  PLDA  with  Mul?ple  Speaker  Subspaces",  ICASSP'16,  March,  2016.  

5.  X.M.  Pang  and  M.W.  Mak,  "Noise  Robust  Speaker  Verifica?on  via  the  Fusion  of  SNR-­‐Independent  and  SNR-­‐Dependent  PLDA",  InternaAonal  Journal  of  Speech  Technology,  Oct.  2015.    

6.  M.W.  Mak,  "Fast  Scoring  for  Mixture  of  PLDA  in   I-­‐Vector/PLDA  Speaker  Verifica?on”  Proc.  APSIPA’15,  pp.  587-­‐593,  Dec.  2015,  Hong  Kong.  

7.  M.W.  Mak   and   H.B.   Yu,   "   A   Study   of   Voice   Ac?vity   Detec?on   Techniques   for   NIST   Speaker   Recogni?on  Evalua?ons",  Computer  Speech  &  Language,  vol.  28,  No.  1,  Jan  2014,  pp.  295-­‐313.  

8.  N.  Li  and  M.W.  Mak,  "SNR-­‐Invariant  PLDA  Modeling  for  Robust  Speaker  Verifica?on,   Interspeech'15,  Sept.  2015,  Dresden,  Germany,  pp.  2317  -­‐  2321.  

9.  P.   Kenny,   “Bayesian   speaker   verifica?on   with   heavy-­‐tailed   priors,”   in   Proc.   of   Odyssey:   Speaker   and  Language  RecogniAon  Workshop,  Brno,  Czech  Republic,  June  2010.  

10.  N.   Dehak,   P.   Kenny,   R.   Dehak,   P.   Dumouchel,   and   P.   Ouellet,   “Front-­‐end   factor   analysis   for   speaker  verifica?on,”  IEEE  TransacAons  on  Audio,  Speech  and  Language  Processing,  vol.  19,  no.  4,  pp.  788–798,  May  2011.  

59  

Acknowledgment

60  Xiaomin Pang Zhili Tan Shibiao Wan Wei RAO Na LI

Recommended