Non-Linear I-Vector Extraction for Speaker …...1 Non-Linear I-Vector Extraction for Speaker Recognition Oren Barkan1,2, Hagai Aronowitz1 1 IBM Haifa Research Lab, Israel 2 School

1

Non-Linear I-Vector Extraction for Speaker

Recognition

Oren Barkan1,2, Hagai Aronowitz1

1 IBM Haifa Research Lab, Israel 2 School of Computer Science, Tel-Aviv University, Israel

{orenba, hagaia}@il.ibm.com

2

Overview

• I-vectors are widely used for speaker verification, language ID

and speaker diarization

• I-vectors provide a linear mapping from high dimensional

GMM supervectors to a relatively low dimensional vectors,

named i-vectors

• GMM supervectors lie on a low-dimensional manifold [2].

• In this work we propose an alternative nonlinear method for

mapping between GMM supervectors and low dimensional

space. The method relies on the Diffusion Maps framework

• The proposed method improves over i-vectors by 4-19% and

13-44% when fused

3

I-Vectors

Mapping of audio sessions (or segments) into a low L-

dimensional space

where

- s is a session dependent supervector of stacked GMM means

- m is the UBM supervector

- T is a low-rank matrix of bases spanning a subspace covering most of the

variability in the supervector space

- w is an M-dimensional vector having a standard normal distribution

I-vector extraction

Given a trained i-vector model (m, T) and an audio session,

estimate the i-vector w

Ts m w

4

Diffusion Maps

• Diffusion Maps (DM) is a machine learning technique for

nonlinear dimensionality reduction

• DM focuses on discovering the underlying manifold that

the data has been sampled from

• In DM, an affinity matrix is built which is used to generate

a diffusion process

• As the diffusion process progresses, it integrates local

geometry to reveal geometric structures of the data at

different scales

• A diffusion map embeds the high dimensional data in a

lower-dimensional space D, such that the Euclidean

distance between points in D approximates the diffusion

distance in the original feature space

5

DM Training (1)

Input: Training set of GMM supervectors

1. Build a graph affinity matrix A using the kernel:

where, is a metric over G

- Cosine distance in this work

2. Convert A to a Markov matrix P according to:

1{ }n m

i i dx G G

2( , )( , ) exp

i j

i j

c x xa x x

( , )c

,

1

( , )

( , )

i j

i j n

i k

k

a x xP

a x x

6

DM Training (2)

3. Pick a diffusion time parameter t.

4. Compute top-L eigenvalues and eigenvectors

of P (using SVD).

5. Define a mapping such that

where indicates the i-th element of the k-th

eigenvector of P .

Output:

is the embedding of in the L-dimensional

space D.

1: { }nt i iM x D

1 1( ) , ...,T

t t

t i i l liM x

0( )li i 0( )li i

ix( )t iM x

ki

7

Diffusion Distance

• Each entry is the probability of transition from point

to point in a single time step. In the same way, each

entry is the probability of transition from point to

point in t time steps.

• A diffusion distance after t steps is defined as follows:

• It has been shown that for L=m-1:

Hence, a diffusion distance in GMM space is equivalent to

the Euclidean distance in the embedded space.

,i jP ix

jx

,

t

i jP ix

jx

2

, ,

1

( , ) ( )n

t t

t i j i k j k

k

Q x x P P

2

2( ) ( ) ( , )t i t j t i jM x M x Q x x

8

Toy Example

9

D-vector Extraction

• The mapping is defined only for the domain .

• Given a new test point , has to be extended

to

• This is done using the following Nystrom extension:

tM 1{ }ni i dx G

1 1{ }nn i ix x

1

1: { }nx

t d nM G x D

tM

( 1) 1,

1

1,

n

k n n j kj

jk

P

1 k l

10

Speaker Verification Pipeline

• 12 MFCC + 12 delta + 12 delta-delta

• Voice activity detection

• Feature warping

• GMM order is 1024

• 400-dimensional i-vectors, 400-dimensional d-vectors

• Length normalization

• Gaussian PLDA i i iw w h b

11

Datasets

• Gender-independent UBM was trained on ~13K sessions

from Switchboard-II, NIST04 and NIST06 SRE

• I-vector and d-vector systems were trained on ~17K

female and ~11K males telephone sessions from NIST04,

NIST06 and NIST08 SRE

• Experiments were conducted on the three telephone-only

core conditions of NIST10 SRE (5,6 and 8)

12

Results – NIST 2010 EER (%)

System Condition 5 Condition 6 Condition 8

Males

i-vector PLDA 2.5 4.9 1.0

d-vector PLDA 2.3 2.9 1.6

Fused system 1.7 2.2 0.8

Females

i-vector PLDA 2.7 6.0 2.2

d-vector PLDA 2.3 4.4 2.2

Fused system 2.0 3.3 1.7

13

Results – NIST 2010 Old-min-DCF


Males

i-vector PLDA 0.138 0.231 0.073

d-vector PLDA 0.131 0.192 0.045

Fused system 0.103 0.128 0.033

Females

i-vector PLDA 0.132 0.244 0.087

d-vector PLDA 0.127 0.224 0.065

Fused system 0.096 0.199 0.056

14

Results – NIST 2010 New-min-DCF


Males

i-vector PLDA 0.507 0.769 0.109

d-vector PLDA 0.307 0.696 0.192

Fused system 0.279 0.617 0.150

Females

i-vector PLDA 0.431 0.758 0.242

d-vector PLDA 0.322 0.814 0.238

Fused system 0.291 0.781 0.179

15

Results – NIST 2010 Relative Improvement (%)

Measure d-vector PLDA system Fused system

Males

EER 19 44

Old min-DCF 17 40

New min-DCF 14 24

Females

EER 18 36

Old min-DCF 10 24

New min-DCF 4 13

16

Conclusions

• This work proposes manifold learning for nonlinear i-vector

extraction – the d-vector extraction

• The d-vector extraction algorithm is based on the Diffusion

Maps framework

• In experimental results, the d-vector system managed to

obtain an error reduction of 4-19%*

• Simple fusion of i-vector and d-vector system resulted in

error reduction of 13-44%*

*Compared to the baseline i-vector system (with the same pipeline) and depending on the gender and error

measure.

Documents

Non-Linear I-Vector Extraction for Speaker …...1 Non-Linear I-Vector Extraction for Speaker Recognition Oren Barkan1,2, Hagai Aronowitz1 1 IBM Haifa Research Lab, Israel 2 School