22
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee

International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee

Embed Size (px)

Citation preview

Page 1: International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee

International Conference on Intelligent and Advanced Systems 2007

Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff.

Jain-De,Lee

Page 2: International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee

INTRODUCTION

GMM SPEAKER IDENTIFICATION SYSTEM

EXPERIMENTAL EVALUATION

CONCLUSION

Page 3: International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee

Speaker recognition is generally divided into two tasks

◦ Speaker Verification(SV)

◦ Speaker Identification(SI)

Speaker model 

◦ Text-dependent(TD)

◦ Text-independent(TI)

Page 4: International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee

Many approaches have been proposed for TI speaker recognition◦VQ based method◦Hidden Markov Models◦Gaussian Mixture Model

VQ based method

Page 5: International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee

Hidden Markov Models

◦ State Probability

◦ Transition Probability

Classify acoustic events corresponding to HMM states to characterize each speaker in TI task

TI performance is unaffected by discarding transition probabilities in HMM models

Page 6: International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee

Gaussian Mixture Model

◦ Corresponds to a single state continuous ergodic HMM

◦ Discarding the transition probabilities in the HMM models

The use of GMM for speaker identity modeling

◦ The Gaussian components represent some general speaker-dependent spectral shapes

◦ The capability of Gaussian mixture to model arbitrary densities

Page 7: International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee

The GMM speaker identification system consists of the following elements

◦ Speech processing

◦Gaussian mixture model

◦ Parameter estimation

◦ Identification

Page 8: International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee

The Mel-scale frequency cepstral coefficients (MFCC) extraction is used in front-end processing

Input Speech Signal

Input Speech Signal Pre-EmphasisPre-Emphasis FrameFrame Hamming

Window

HammingWindow

FFTFFTTriangularband-pass

filter

Triangularband-pass

filterLogarithmLogarithm DCTDCT

Mel-sca1e cepstral feature analysis

Page 9: International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee

The Gaussian model is a weighted linear combination of M uni-model Gaussian component densities

The mixture weight satisfy the constraint that

M

iii xbwxp

1

)()|(

Where is a D-dimensional vectorx

are the component densitiesMixbi ,...,1),( wi , i=1,…,M are the mixture weights

M

iiw

1

1

Page 10: International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee

Each component density is a D-variate Gaussian function of the form

The Gaussian mixture density model are denoted as

)}()(2

1exp{

||)2(

1)( 1

2/12/ iiT

ii

Dxxxbi

Where is mean vectori

is covariance matrixi

Miw iii ,...,1),,,(

Page 11: International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee

Conventional GMM training process

Input training vectorInput training vector

LBG algorithmLBG algorithm

EM algorithmEM algorithm

ConvergenceConvergence EndEndY

N

Page 12: International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee

Input training vector

Input training vector

Overall averageOverall average

SplitSplit

ClusteringClustering

Cluster’saverage Cluster’saverage 

Calculate Distortion  Calculate

Distortion   (D-D’)/D< δ(D-D’)/D< δ

D’=DD’=D

m<Mm<M EndEnd

N Y

Y N

Page 13: International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee

Speaker model training is to estimate the GMM parameters via maximum likelihood (ML) estimation

Expectation-maximization (EM) algorithm

T

ttxpXp

1

)|()|(

T

tti xip

Tw

1

),|(1

T

t t

T

t tti

xip

xxip

1

1

),|(

),|(

2

1

1

22

),|(

),|(iT

t t

T

t tti

xip

xxip

Page 14: International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee

This paper proposes an algorithm consists of two steps

Page 15: International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee

Cluster the training vectors to the mixture component with the highest likelihood

Re-estimate parameters of each component

)(maxarg1

xbC iMi

i

number of vectors classified in cluster i / total number of training vectors

iw

sample mean of vectors classified in cluster i.i

sample covariance matrix of vectors classified in cluster ii

Page 16: International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee

The feature is classified to the speaker ,whose model likelihood is the highest

The above can be formulated in logarithmic term

S

SkkXpS

1

)|(maxargˆ

T

tkt

SkxpS

11

)|(logmaxargˆ

Page 17: International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee

Database and Experiment Conditions◦ 7 male and 3 female◦ The same 40 sentences utterances with different text◦ The average sentences duration is approximately 3.5 s

Performance Comparison between EM and Highest Mixture Likelihood Clustering Training

◦ The number of Gaussian components 16

◦ 16 dimensional MFCCs◦ 20 utterances is used for training

Page 18: International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee

Convergence condition 03.0|)|()|(| )()1( kk XpXp

Page 19: International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee

The comparison between EM and highest likelihood clustering training on identification rate

◦ 10 sentences were used for training◦ 25 sentences were used for testing◦ 4 Gaussian components◦ 8 iterations

Page 20: International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee

Effect of Different Number of Gaussian Mixture Components and Amount of Training Data

◦MFCCs feature dimension is fixed to 12◦ 25 sentences is used for testing

Page 21: International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee

Effect of Feature Set on Performance for Different Number of Gaussian Mixture Components

◦Combination with first and second order difference coefficients was tested

◦ 10 sentences is used for training◦ 30 sentences is used for testing

Page 22: International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee

Comparably to conventional EM training but with less computational time

First order difference coefficients is sufficient to capture the transitional information with reasonable dimensional complexity

The 12 dimensional 16 order GMM and using 5 training sentences achieved 98.4% identification rate