10.1.1.11.1180

SPEAKER IDENTIFICATION USING GAUSSIAN MIXTURE MODELSBASED ON MULTI-SPACE PROBABILITY DISTRIBUTION

Chiyomi Miyajima†, Yosuke Hattori†,∗, Keiichi Tokuda†, Takashi Masuko‡, Takao Kobayashi‡, and Tadashi Kitamura†

† Nagoya Institute of Technology, Nagoya 466-8555, Japan‡ Tokyo Institute of Technology, Yokohama 226-8502, Japan

∗ Present address: DENSO Corporation, Kariya 448-8661, Japan†{chiyomi, tokuda, kitamura}@ics.nitech.ac.jp, ‡{masuko, tkobayas}@ip.titech.ac.jp, ∗ [email protected]

ABSTRACT

This paper presents a new approach to modeling speech spectraand pitch for text-independent speaker identification using Gaus-sian mixture models based on multi-space probability distribution(MSD-GMM). The MSD-GMM allows us to model continuouspitch values for voiced frames and discrete symbols represent-ing unvoiced frames in a unified framework. Spectral and pitchfeatures are jointly modeled by a two-stream MSD-GMM. Wederive maximum likelihood (ML) estimation formulae for theMSD-GMM parameters, and the MSD-GMM speaker modelsare evaluated for text-independent speaker identification tasks.Experimental results show that the MSD-GMM can efficientlymodel spectral and pitch features of each speaker and outper-forms conventional speaker models.

1. INTRODUCTION

Gaussian mixture models (GMMs) have been successfully ap-plied to speaker modeling in text-independent speaker identifica-tion [1]. Such identification systems mainly use spectral featuresrepresented by cepstral coefficients as speaker features. Pitch fea-tures as well as spectral features contain much speaker specificinformation [2, 3]. However, most of speaker recognition studiesin recent years have focused on using only spectral features. Themain reasons for this are i) the use of pitch features alone couldnot give enough recognition performance and ii) pitch values arenot defined in unvoiced segments and this complicates speakermodeling and feature integration.

Several works have reported that speaker recognition accu-racy can be improved by the use of pitch features in additionto spectral features [4, 5, 6, 7]. There are essentially two ap-proaches to integrating spectral and pitch information: i) twoseparate models are used for spectral and pitch features and theirscores are combined [4, 5], ii) two separate models for voicedand unvoiced parts are trained and their scores are combined[6, 7]. In [7], two separate GMMs are used, where the input ob-servations are concatenations of cepstral coefficients and log F0

for voiced frames and cepstral coefficients alone for unvoicedframes. Since the probability distribution of the conventionalGMM is defined on a single vector space, these two kinds ofvectors require their respective models.

A part of this work was supported by Research Fellowships fromthe Japan Society for the Promotion of Science for Young Scientists,No. 199808177.

In this paper a new speaker modeling technique using aGMM based on multi-space probability distribution (MSD) [8] isintroduced. The MSD-GMM allows us to model feature vectorswith variable dimensionality including zero-dimensional vectors,i.e., discrete symbols. Consequently, continuous pitch valuesfor voiced frames and discrete symbols representing “unvoiced”can be modeled using an MSD-GMM in a unified framework,and spectral and pitch features are jointly modeled by a multi-stream MSD-GMM, i.e., each speaker is modeled by a singlestatistical model. We derive maximum likelihood (ML) esti-mation formulae and evaluate the MSD-GMM speaker modelsfor text-independent speaker identification tasks comparing withconventional GMM speaker models.

The rest of the paper is organized as follows. In Section2, we introduce a speaker modeling technique based on MSD-GMM. Section 3 presents the ML estimation procedure for MSD-GMM parameters. Section 4 reports experimental results, andSection 5 gives conclusions and future works.

2. MULTI-STREAM MSD-GMM

2.1. Likelihood Calculation

Let us assume that a given observation ot at time t consistsof S information sources (streams). The s-th stream ots hasa set of space indices Xts and a random vector with variabledimensionality xts, that is

ot = (ot1,ot2, . . . ,otS), (1)

ots = (Xts,xts) . (2)

Note here that Xts is a subset of all possible space indices{1, 2, . . . , Gs}, and all the spaces represented by the indices inXts have the same dimensionality as xts.

We define the output probability distribution of an S-streamMSD-GMM λ for ot as

b(ot | λ) =MX

m=1

cm

SY

s=1

pms(ots), (3)

where cm is the mixture weight for the m-th mixture component.The observation probability of ots for mixture m is given by themulti-space probability distribution (MSD) [8] :

pms(ots) =X

g∈Xts

wmsgNDsgmsg (xts), (4)

o o o ot t t t= ( )1 2 3, , Nm115 ( )

Nm135 xt( )1

Nm148 ( )

Nm125 ( )xt1

Nm217 xt( )2

Nm318 ( )

Nm320 xt( )3

Stream 1 (G1 = 4)

Stream 2 (G2 = 1)

Stream 3 (G3 = 2)

wm11

wm12

wm13

wm14

wm21 1=

wm31

wm32

xt{ }( )12 3, ,=ot1

xt{ }( )21 ,=ot2

xt{ }( )32 ,=ot3

5-dimensional

0-dimensional

7-dimensional

Observation at time t

pm2(ot2)

pm3(ot3)

pm1(ot1)

pm1(ot1) = wm12N 5m12(xt1) + wm13N 5

m13(xt1)Stream 1:

pm2(ot2) = N 7m21(xt2)Stream 2:

pm3(ot3) = wm32Stream 3:

Fig. 1. An example of the m-th mixture component of a three-stream MSD-GMM.

where wmsg is the weight for the g-th vector space of the s-thstream and NDsg

msg ( · ) is the Dsg-variate Gaussian function withmean vector �msg and covariance matrix Σmsg (for the caseDsg > 0). For simplicity of notation, we define N 0

msg( · ) ≡ 1(for the case Dsg = 0). Note here that the multi-space probabil-ity distribution (MSD) is equivalent to continuous probability dis-tribution and discrete probability distribution when Dsg ≡ n > 0and Dsg ≡ 0, respectively. Also, an MSD-GMM is assumed tobe a generalized GMM, which includes the traditional GMM asa special case when S = 1, G1 = 1, and D11 > 0.

For an observation sequence O = (o1,o2, . . . ,oT ), thelikelihood of MSD-GMM λ is given by

P (O | λ) =

TY

t=1

b(ot | λ). (5)

Figure 1 illustrates an example of the m-th mixture compo-nent of a three-stream MSD-GMM (S = 3). The sample space ofthe first stream consists of four spaces (G1 = 4), among whichthe second and third spaces are triggered by the space indicesand pm1(ot1) becomes the sum of the two weighted Gaussians.The second stream has only one space (G2 = 1) and always out-puts its Gaussian as pm2(ot2). The third stream consists of twospaces (G3 = 2), where a zero-dimensional space is selected,and outputs its space weight wm32 (a discrete probability) aspm3(ot3).

2.2. Speaker Modeling Based on MSD-GMM

Figure 2 shows an example of spectral and pitch sequences ofa Japanese word “/ashi/” spoken by a Japanese male speaker.Generally, spectral features are represented by multi-dimensionalvectors of cepstral coefficients with continuous values. On theother hand, pitch features are represented by one-dimensionalcontinuous values of log fundamental frequencies (log F0) in

/ashi / t

UVV VUV UV

Fre

quen

cy (

kHz)

543210

Spe

ctru

mP

itch

Fre

quen

cy (

Hz)

50

100150200

Fig. 2. An example of spectral and pitch sequences of a word“/ashi/” spoken by a male speaker.

wm22

o o ot t t= ( )1 2,

xt{ }( )11

{ }1

,=ot1

xt( )2,=ot2

D-dimensional NDm11(xt1)

wm21N 1m21(xt2)

or1-dimensional

0-dimensionalor

Nm11D xt( )1

Nm211 ( )

Nm220

xt

( )

2

Spectrum (G1 = 1)

Pitch (G2 = 2)wm21

wm22

V

UV

w 1=wm11

Fig. 3. The m-th mixture component in a two-stream MSD-GMM based on spectra and pitch.

voiced frames and discrete symbols representing “unvoiced” inunvoiced frames because pitch values are defined only in voicedsegments.

As shown in Fig. 3, each speaker can be modeled by atwo-stream MSD-GMM (S = 2); the first stream is for thespectral feature and the second stream is for the pitch feature.The spectral stream has a D-dimensional space (G1 = 1), andthe pitch stream has two spaces (G2 = 2) : a one-dimensionalspace and a zero-dimensional space for voiced and unvoicedparts, respectively.

3. ML-ESTIMATION FOR MSD-GMM

In a similar way to the ML-estimation procedure in [8], P (O | λ)is increased by iterating the maximization of an auxiliary functionQ(λ′, λ) over λ to improve current parameters λ′ based on theexpectation maximization (EM) algorithm.

3.1. Definition of Q-Function

The log-likelihood of λ for an observation sequence O, a se-quence of mixture components i and a sequence of space indicesl can be written as

log P (O, i, l | λ) =TX

t=1

log cit

+

TX

t=1

SX

s=1

log witslts +

TX

t=1

SX

s=1

logNDsltsitslts

(xts), (6)

where

i = (i1, i2, . . . , iT ), (7)

l = (l1, l2, . . . , lT ), (8)

lt = (lt1, lt2, . . . , ltS). (9)

Hence the Q-function is defined as

Q(λ′, λ) =X

all i,l

P (i, l | O, λ′) log P (O, i, l | λ)

=X

all i,l

P (i, l | O, λ′)TX

t=1

log cit

+X

all i,l

P (i, l | O, λ′)TX

t=1

SX

s=1

log witslts

+X

all i,l

P (i, l | O, λ′)TX

t=1

SX

s=1

logNDsltsitslts

(xts)

=MX

m=1

TX

t=1

P (it = m | O, λ′) log cm

+MX

m=1

SX

s=1

GsX

g=1

X

t∈T (O,s,g)

P (it = m, lts = g | O, λ′) log wmsg

+MX

m=1

SX

s=1

GsX

g=1

X

t∈T (O,s,g)

P (it = m, lts = g | O, λ′) logNDsgmsg (xts), (10)

where

T (O, s, g) = {t | g ∈ Xts} . (11)

3.2. Maximization of Q-Function

The first two terms of (10) have the formPN

i=1 ui log yi, whichattains a global maximum at the single point

yi =ui

NX

j=1

uj

, for i = 1, 2, . . . , N, (12)

under the constraintsPN

i=1 yi = 1 and yi ≥ 0. The maximiza-tion of the first term of (10) leads to the re-estimate of cm:

cm =

TX

t=1

P (it = m | O, λ′)

MX

m=1

TX

t=1

P (it = m | O, λ′)

=1

T

TX

t=1

P (it = m | O, λ′)

=1

T

TX

t=1

γ′t(m), (13)

where γt(m) is the posterior probability of being in the m-thmixture component at time t, that is

γt(m) = P (it = m | O, λ) =

cm

SY

s=1

pms(ots)

b(ot). (14)

Similarly, the second term is maximized as

wmsg =

X

t∈T (O,s,g)

ξ′ts(m,g)

GsX

l=1

X

t∈T (O,s,l)

ξ′ts(m, l)

, (15)

where ξts(m, g) is the posterior probability of being in the g-thspace of stream s in the m-th mixture component at time t:

ξts(m, g) = P (it = m, lts = g | O, λ)

= P (it = m | O, λ)P (lts = g | it = m,O, λ)

= γt(m)wmsgNDsg

msg (xts)

pms(ots). (16)

The third term is maximized by solving following equations:

∂

∂�msg

X

t∈T (O,s,g)

P (it = m, lts = g | O, λ′)

· logNDsgmsg (xts) = 0, (17)

∂

∂Σ−1msg

X

t∈T (O,s,g)

P (it = m, lts = g | O, λ′)

· logNDsgmsg (xts) = 0, (18)

resulting in

�msg =

X

t∈T (O,s,g)

ξ′ts(m, g)xts

GsX

l=1

X

t∈T (O,s,l)

ξ′ts(m, l)

, (19)

Σmsg =

X

t∈T (O,s,g)

ξ′ts(m, g)(xts − �msg)(xts − �msg)�

GsX

l=1

X

t∈T (O,s,l)

ξ′ts(m, l)

.

(20)

The re-estimation is repeated iteratively using λ in place of λ′

and the final result is an ML estimation of the MSD-GMM.

4. EXPERIMENTAL EVALUATION

4.1. Experimental Conditions

First, a speaker identification experiment was carried out usingthe ATR Japanese speech database. We used word data spoken by80 speakers (40 males and 40 females). Phonetically-balanced216 words are used for training each speaker model, and 520common words per speaker are used for testing. The number oftests was 41600 in total.

Second, to evaluate the robustness of the MSD-GMM speakermodel against inter-session variability, we also conducted a speakeridentification experiment using the NTT database. The databaseconsists of sentence data uttered by 35 Japanese speakers (22males and 13 females) on five sessions over ten months (Aug.,Sept., Dec. 1990, Mar., June 1991). In each session, 15 sen-tences were recorded for each speaker. Ten sentences are com-mon to all speakers and all sessions (A-set), and five sentences

0

2

4

6

8

10

Number of mixture components

3232

+32 32 64 64

64+

6448

+1664 64

64+

6448

+16

24+

8 3232

+32 32

24+

8

5.6

4.4

3.4

3.4

6.4

4.3

3.7

3.1

6.2

5.4 5.

85.

2

5.1

4.5

4.4

4.2

95% CI

GMMS+P-GMMsV+UV-GMMsMSD-GMM

ATRdatabase

NTTdatabase

Spe

aker

iden

tific

atio

n er

ror

rate

(%

)

Fig. 4. Comparison of MSD-GMM speaker models with con-ventional GMM and two-separate GMM speaker models.

are different for each speaker and each session (B-set). The dura-tion of each sentence is approximately four second. We used 15sentences (A-set + B-set from the first session) per speaker fortraining, and 20 sentences (B-set from the other four sessions)per speaker for testing. The number of tests was 700 in total.

The speech data were down-sampled to 10 kHz, windowedat a 10-ms frame rate with a 25.6-ms Blackman window, and pa-rameterized into 13 mel-cepstral coefficients using a mel-cepstralestimation technique [9]. The 12 static parameters excluding thezero-th coefficient were used as a spectral feature. Fundamen-tal frequencies (F0) were estimated at a 10-ms frame rate usingthe RAPT method [10] with a 7.5-ms correlation window, andlog F0 for the voiced frames and discrete symbols for unvoicedframes were used as a pitch feature. Speakers were modeled byGMMs or multi-stream MSD-GMMs with diagonal covariances.

4.2. Experimental Results

The MSD-GMM speaker identification system was comparedwith three kinds of conventional systems. Figure 4 shows speakeridentification error rates with 95 % confidence intervals (CIs)when using 32 and 64 component speaker models. The left andright halves of the figure correspond to the results for the ATRand NTT databases, respectively. In the figure, “GMM” denotesa conventional GMM speaker model using a spectral featurealone. “S+P-GMMs” represents a speaker model consisting oftwo GMMs for spectra and pitch. “V+UV-GMMs” is a speakermodel consisting of two GMMs for voiced (V) and unvoiced(UV) parts [7] with the optimum numbers of mixture compo-nents for the V-GMM and the UV-GMM, i.e., 24 (V)+8 (UV) or48 (V)+16 (UV), and a linear combination parameter α = 0.5(α is the weight for the likelihood of the UV-GMM). “MSD-GMM” denotes the proposed model based on the multi-streamMSD-GMM.

As shown in the figure, the additional use of pitch infor-mation significantly improved the system performance, and thethree systems using both spectral and pitch features gave much

better performance than the conventional GMM system usinga spectral feature alone. Among the three systems, the MSD-GMM system gave the best results, and achieved 16 % and 18 %error reductions (for the ATR database) and 38 % and 51 % errorreductions (for the NTT database) over the GMM system whenusing 32 and 64 mixture models, respectively. It is also notedthat the MSD-GMM system requires no combination parameter(such as α) which has to be chosen or tuned heuristically.

5. CONCLUSION

This paper has introduced a new technique for modeling speakersbased on MSD-GMM for text-independent speaker identification.The MSD-GMM can model continuous pitch values of voicedframes and discrete symbols representing “unvoiced” in a unifiedframework. Spectral and pitch features can be jointly modeledby a multi-stream MSD-GMM. We derived the ML estimationformulae for the MSD-GMM parameters and evaluated the MSD-GMM speaker models for text-independent speaker identificationtasks. The experimental results demonstrated the high utility ofthe MSD-GMM speaker model and also proved its robustnessagainst the inter-session variability.

Introduction of stream weights to the multi-stream MSD-GMM and application of this framework to speaker verificationsystems will be subjects for future works.

6. REFERENCES

[1] D.A. Reynolds and R.C. Rose, “Robust text-independentspeaker identification using Gaussian mixture speaker mod-els,” IEEE Trans. Speech and Audio Process., 3 (1), 72–83,Jan. 1995.

[2] B.S. Atal, “Automatic speaker recognition based on pitchcontours,” J. Acoust. Soc. Amer., 52 (6), 1687–1697, 1972.

[3] S. Furui, “Research on individuality features in speechwaves and automatic speaker recognition techniques,”Speech Communication, 5 (2), 183–197, 1986.

[4] M.J. Carey, E.S. Parris, H. Lloyd-Thomas, and S. Bennett,“Robust prosodic features for speaker identification,” Proc.ICSLP’96, vol.3, pp.1800–1804, Oct. 1996.

[5] M.K. Sonmez, L. Heck, M. Weintraub, and E. Shriberg,“A lognormal tied mixture model of pitch for prosody-based speaker recognition,” Proc. EUROSPEECH’97, vol.3,pp.1391–1394, Sept. 1997.

[6] T. Matsui and S. Furui, “Text-independent speaker recog-nition using vocal tract and pitch information,” Proc.ICSLP’90, vol.1, pp.137–140, Nov. 1990.

[7] K.P. Markov and S. Nakagawa, “Integrating pitch andLPC-residual information with LPC-cepstrum for text-independent speaker recognition,” J. Acoust. Soc. Jpn., (E),20 (4), 281–291, July 1999.

[8] K. Tokuda, T. Masuko, N. Miyazaki, and T. Kobayashi,“Hidden Markov models based on multi-space probabilitydistribution for pitch pattern modeling,” Proc. ICASSP’99,vol.1, pp.229–232, May 1999.

[9] T. Fukada, K. Tokuda, T. Kobayashi, and S. Imai, “An adap-tive algorithm for mel-cepstral analysis of speech,” Proc.ICASSP’92, vol.1, pp.137–140, Mar. 1992.

[10] D. Talkin, “A robust algorithm for pitch tracking,” inSpeech Coding and Synthesis, W.B. Kleijn and K.K. Pali-wal eds., Elsevier Science, pp.495–518, 1995.

Documents

10.1.1.11.1180