Speaker Recognition Sharat.S.Chikkerur Center for Unified Biometrics and Sensors

Speaker Recognition

Sharat.S.ChikkerurCenter for Unified Biometrics and Sensors

http://www.cubs.buffalo.edu

http://www.cubs.buffalo.edu/

http://www.cubs.buffalo.edu/

Speech Fundamentals Characterizing speech

Content (Speech recognition) Signal representation (Vocoding)

• Waveform• Parametric( Excitation, Vocal Tract)

Signal analysis (Gender determination, Speaker recognition)

Terminologies Phonemes :

• Basic discrete units of speech. • English has around 42 phonemes.• Language specific

Types of speech• Voiced speech• Unvoiced speech(Fricatives)• Plosives

Formants

Speech production

Speech production mechanism Speech production model

Impulse Train

Generator

Glottal Pulse ModelG(z)

Vocal TractModelV(z)

Radiation Model

R(z)

Noise source

Pitch Av

AN

17 cm

Nature of speech

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000-1

-0.5

0

0.5

1

Time

Fre

quen

cy

0 0.2 0.4 0.6 0.8 1 1.20

1000

2000

3000

4000

Spectrogram

Vocal Tract modeling

Signal Spectrum Smoothened Signal Spectrum

•The smoothened spectrum indciates the locations of the formants of each user

•The smoothened spectrum is obtained by cepstral coefficients

Parametric Representations: Formants

Formant Frequencies Characterizes the frequency response of the vocal tract Used in characterization of vowels Can be used to determine the gender

0 500 1000 1500 2000 2500 3000 3500 40000

2

4

6

8

0 500 1000 1500 2000 2500 3000 3500 40000

5

10

15

0 500 1000 1500 2000 2500 3000 3500 4000-2

0

2

4

0 500 1000 1500 2000 2500 3000 3500 4000-1

0

1

2

3

0 500 1000 1500 2000 2500 3000 3500 4000-0.5

0

0.5

1

1.5

Parametric Representations:LPC

][][][ nGuknsansk

k

Linear predictive coefficients Used in vocoding Spectral estimation

0 500 1000 1500 2000 2500 3000 3500 4000-2

0

2

4

0 500 1000 1500 2000 2500 3000 3500 4000-2

0

2

4

5

2

20

40

200

0 1000 2000 3000 4000 5000 6000-1.5

-1

-0.5

0

0.5

0 1000 2000 3000 4000 5000 6000-2

-1

0

1

0 1000 2000 3000 4000 5000 6000-1.5

-1

-0.5

0

0.5

Parametric Representations:Cepstrum

P[n] G(z)

V(z) R(z)

u[n]

Pitch Av

AN

D[] L[] D-1[]

x1[n]*x2[n]x1‘[n]+x2‘[n] y1‘[n]+y2‘[n]

y1[n]*y2[n]

DFT[] LOG[] IDFT[]

x1[n]*x2[n]

X1(z)X2(z)

x1‘[n]+x2‘[n]

log(X1(z)) + log(X2(z))

5

10

40

Speaker Recognition

Definition It is the method of recognizing a person based on his voice It is one of the forms of biometric identification

Depends of speaker dependent characteristics.

Speaker Recognition

Speaker Identification Speaker VerificationSpeaker Detection

TextDependent

TextIndependent

TextDependent

TextIndependent

T ra n sm is s ion S p e ech S yn th e s is S p ee ch en h an ce m e nt A ids to h an d ica pp ed S p ee ch R e co g n it ion S p e ake r V e rif ica tion

S p ee ch A p p lica tio ns

Generic Speaker Recognition System

PreprocessingFeature

ExtractionPattern

Matching

PreprocessingFeature

ExtractionSpeaker Model

Verification

Enrollment

A/D Conversion

End point detection

Pre-emphasis filter

Segmentation

LAR

Cepstrum

LPCC

MFCC

Stochastic Models

GMM

HMM

Template Models

DTW

Distance Measures

Speech signalAnalysis Frames Feature Vector

Score

Choice of features

Differentiating factors b/w speakers include vocal tract shape and behavioral traits

Features should have high inter-speaker and low intra speaker variation

Our Approach

Silence Removal

Cepstrum Coefficients

Cepstral Normalization Long time average

Polynomial Function Expansion

Dynamic Time Warping

Distance Computation

Reference Template

Preprocessing

Feature Extraction

Speaker model

Matching

Silence Removal

N

kavg

Wn

kn

kxN

E

knwkxE

1

2

1

2

][1

][][

Preprocessing

Feature Extraction

Speaker model

Matching

Pre-emphasis Preprocessing

Feature Extraction

Speaker model

Matching

95.0

)1()( 1

a

azzH

Segmentation Preprocessing

Feature Extraction

Speaker model

Matching

Short time analysis

The speech signal is segmented into overlapping ‘Analysis Frames’

The speech signal is assumed to be stationary within this frame

frame analysis theoflength : N

frame analysisn:

)(2cos46.054.0][

][][

th n

kn

Q

N

nnw

knwkxQ

Q31 Q32 Q33 Q34

Feature Representation Preprocessing

Feature Extraction

Speaker model

Matching

Speech signal and spectrum of two users uttering ‘ONE’

Speaker Model

F1 = [a1…a10,b1…b10]

F2 = [a1…a10,b1…b10]

FN = [a1…a10,b1…b10]

…………….

…………….

9

1

21

9

11

1 5

jj

jjj

j

P

Pc

b

jP

Dynamic Time Warping

NMnwm

mymymymY

nxnxnxnX

K

K

),(

1....Mm ,)}()....(),({)(

1....N n ,)}().....(),({)(

21

21 Preprocessing

Feature Extraction

Speaker model

Matching

K

iii

N

nT

mtnrmYnXD

nwYnXDD

1

2

1

)()())(),((

))((),({min

•The DTW warping path in the n-by-m matrix is the path which has minimum average cumulative cost. The unmarked area is the constrain that path is allowed to go.

Resultsa0 a1 r0 r1 s0 s1

a0 0 0.1226 0.3664 0.3297 0.4009 0.4685a1 0.1226 0 0.5887 0.3258 0.4086 0.4894r0 0.3664 0.5887 0 0.0989 0.3299 0.4243r1 0.3297 0.3258 0.0989 0 0.367 0.4287s0 0.4009 0.4086 0.3299 0.367 0 0.1401s1 0.4685 0.4894 0.4243 0.4287 0.1401 0

•Distances are normalized w.r.t. length of the speech signal

•Intra speaker distance less than inter speaker distance

•Distance matrix is symmetric

Matlab Implementation

THANK YOU

Documents

Speaker Recognition Sharat.S.Chikkerur Center for Unified Biometrics and Sensors