Text-Independent Speaker Recognitionsuraj.lums.edu.pk/~sigma/old/public_html/project... · Linear Predictive CepstralCoefficients (LPCC) derived by Atal(1974), MFCC (Davis and Mermelstein,

1

Anoosha Habib, MSc Project, LUMS.

Text-Independent

Speaker Recognition

Project Presentation

Anoosha Habib

M.Sc. Project

Computer Engineering,

Lahore University of Management Sciences

Sequence of Presentation

� Introduction

� Speaker Identification System

� Speech Parameterization Methods

� Speaker Classification Methods

� Results and Performance Evaluation

� Future Work

2


Introduction

Introduction

� Speech is always regarded as the most powerful form because of its rich

dimensions character such as speech text (words), the gender, attitude,

emotion, health situation and identity of a speaker

� There are three recognition systems:

� Speech Recognition Systems,

� Language Recognition Systems and

� Speaker Recognition Systems Speech

Recognit ion

Language

Recognit ion

Speaker

Recognit ion

Words

Language Name

Speaker Name

“I am fine”

“English”

“Ali”

Speech Signal

3

Speech Processing TaxonomyRe c o g n i t i o n

Sp e e c h

Re c o g n i t i o n

Sp e a k e r

R e c o g n i t i o nL a n g u a g e

Re c o g n i t i o n

Sp e a k e r

I d e n t i f i c a t i o n

Sp e a k e r

V e r i f i c a t i o n

Text-dependent

Closed-set

Text-independent

Closed-set

Text-dependent

Closed-set

Text-independent

Open-set

Classification of Speaker

Identification System� Speaker Identification is the process of finding the identity of an

unknown speaker by his/her voice with the voice of registered speakers in the database

� Speaker Recognition is the process of recognizing a person on the basis of individual information included in speech signal.

� Text-dependent SIS

� speakers are only allowed to say some specific sentences or words, which are known to the system.

� Text-independent SIS

� The system can process freely spoken speech, which is either user selected phrase or conversational speech.

� Compared with text-dependent SRS, text-independent SRS are more flexible, but more complicated

4

Outline of SIS

� Speaker identification (SI) is the process of finding the identity of an unknown speaker by comparing his/her voice with voices of registered speakers in the database. In speaker identification, the unknown voice is assumed to be from predefined set of known speakers and it’s a one-to-many comparison

� During its operation, a speaker identification system receives two inputs:

� an identity claim (PIN, user’s name, keyword, etc.) made by the speaking person

� a certain amount of speech, representing his/her or someone else’s voice.

Phases of SIS

� 1st phase of SIS is

Enrollment Sessions also

known as Training Phase.

� During the training phase,

the SIS generates a

speaker model which is

based on the speaker’s

characteristics.

Front End Processing

FeatureFeatureFeatureFeatureVectorsVectorsVectorsVectors

Speaker Speaker Speaker Speaker DatabaseDatabaseDatabaseDatabase

Speaker Modeling

SpeakerSpeakerSpeakerSpeakerModelsModelsModelsModels

Speaker 1

Speaker 3

Speaker 2

5

Phases of SIS

� The testing phase of the system involves making a

claim on the identity of an unknown speaker by using

both the trained models and the characteristics of the

given speech.

Front End Processing

Voice from

Unknown speaker

Fearture

Vectors

Speaker Database

Speaker Model 1

Speaker Model 2

Speaker Model M

Maximum Score

Dicision

Speaker ID

Components of Speaker

Identification System

� There are three main components of SI System:

� Front-end Processing

� Speaker Modeling

� Pattern Matching

6

Front-End Processing

� Front-end Processing generally consists of three sub-processes

�Preprocessing� Removal of Noise / Silence from Speech

� Frame Blocking

� Windowing

�Feature Extraction

�Channel Compensation

Preprocessing

� Removal of noise/silence from speech

7

Frame Blocking � The speech signal is a slowly timed varying signal (it is called quasi-

stationary) that is when the signal is examined over a short period of time (5-100msec), the signal is fairly stationary.

� Speech signals are often analyzed in short time segments referred to as short-time spectral analysis typically 20-30 msec frames that overlap each other with 30-50%. This is done in order not to lose any information due to the windowing.

� Duration of each frame is 23 mS for sampling frequencies 11025 Hz and 22050 Hz, and a new frame contain the last 11.5 mS of the previous frame’s data. But for the sampling frequency 8000 Hz, duration of each frame is 16 mS and a new frame contain the last 8 mS of the previous frame’s data

Windowing

� After the signal has been framed, each frame is multiplied with a

window function w(n) with length N, where N is the length of the

frame. Typically the Hamming window is used. The windowing is

done to avoid problems due to truncation of the signal.

8


Speech Parameterization

Methods

9

Speech Parameterization Methods� Speech can be parameterized in two broad categories. These are

� Spectrum-based speech features

� Phonetic and Prosodic features

� Spectrum-based speech features: following speech features have dominated the speech and speaker recognition areas: � Real Cepstral Coefficients (RCC) introduced by Oppenheim (1969),

� Linear Prediction Coefficients (LPC) proposed by Atal and Hanauer (1971),

� Linear Predictive Cepstral Coefficients (LPCC) derived by Atal (1974),

� MFCC (Davis and Mermelstein, 1980).

� Perceptual Linear Prediction (PLP) coefficients (Hermansky, 1990),

� Adaptive Component Weighting (ACW)

� Cepstral coefficients (Assaleh and Mammone, (1994a, 1994b)),

� Wavelet-based features

� Prosodic speech features� provide useful information about the speaking style of a person. The fundamental

frequency and frame energy are well known prosodic speech features that are widely-used in speaker recognition applications. The prosodic features are susceptible to mimic attempts.

Feature Extraction

ContinousContinousContinousContinous

SpeechSpeechSpeechSpeech

FrameFrameFrameFrame

SpectrumSpectrumSpectrumSpectrum

Mel Mel Mel Mel

Weighted Weighted Weighted Weighted SpectrumSpectrumSpectrumSpectrum

Mel Cepstral Mel Cepstral Mel Cepstral Mel Cepstral CoefficientsCoefficientsCoefficientsCoefficients

Frame

Blocking Window ing

FFT

Mel Filte r

Bank

Log

Compression

DCT

Preemphasis

Filter

Continous

Speech

Frame

Blocking

Filtered

Windowing

Windowed Frames

AutocorelationLinear

Prediction

Linear Prediction Coefficients

Speech

FramesMel-Frequency

Cepstral

Coefficients

Linear

Prediction

Coefficients

10


Pattern Matching

and Classification

Pattern Matching and Classification

� The classifiers used for speaker identification can be grouped into two major types: � Template-based and Stochastic model based classifiers.

� Template based classifiers are considered to be the simplest classifiers. � The most common template-based classifiers are based on Dynamic

Time Warping (useful for text-dependent speaker recognition) and Vector Quantization (useful for text-independent speaker recognition)

� Stochastic models provide more flexibility and better results. � The stochastic model based classifiers use the Gaussian Mixture Model

(useful for text-independent speaker recognition), the Hidden Markov model (useful for text-dependent speaker recognition), and Neural Networks to model a speaker's acoustic space.

11

Speaker Modeling

� Vector Quantization� It is not possible to use all the feature vectors of a given speaker occurring in the training data to form the speaker's model. Because there are too many feature vectors for each speaker.

� A method of reducing/compressing the number of training vectors is required.

� VQ codebook consisting of a small number of highly representative vectors that efficiently represent the speaker-specific characteristics can be used.

VQ Codebook Generation by LBG

algorithm

12

LBG Algorithm

Speaker Matching

� During Training Phase, one codebook with N code vectors is computed for each of the M speakers.

� During the Testing Phase, the feature vectors representing the test utterance are encoded in terms of their nearest code vectors from the code book of each of the M speakers by measuring the distortion.

� Once these distortions are computed, the speaker identification system classifies the test utterance to a speaker whose VQ codebook results in the least distortion.

� Distortion measure is the least Euclidian Distance that is each feature vector in the sequence X is compared with all the codebooks, and the codebook with the minimized average distance is chosen to be the best.

13

Weighting Method

� Weighting method takes the correlation between the known

speakers in the database into account. The idea is that larger

weights should be assigned to vectors that has higher discriminating

power.

� If vectors from more codebooks are very close in feature space it is

not so obvious which one of the vectors that a given unknown vector

belongs to. On the other hand if a vector is far from the other vectors

of the other codebooks, then it is more clear which codebook the

given unknown vector belongs to.


Results and Performance

Evaluation

14

Distortion� Distortion of Speaker 1 with other Speakers computed by MFCC based

classifier

2.2462.24699

2.09622.096233

1.92221.922222

2.37742.377466

2.09782.097855

2.21612.216144

1.95431.95431111

1.86371.86371010

2.30842.308488

2.19232.192377

1.54861.548611

11

1.74451.74452121

2.08572.08571414

2.02892.02891313

2.04922.04921717

1.96511.96511616

1.95011.95011515

1.98431.98432222

1.96811.96812020

2.04312.04311919

2.10892.10891818

2.01712.01711212

11

2.18442.18443232

1.90141.90142525

1.84881.84882424

1.84731.84732828

1.78791.78792727

2.06152.06152626

1.79741.79743333

2.14712.14713131

2.04062.04063030

1.95381.95382929

1.9561.9562323

11

1.78271.78274141

2.24932.24933636

2.15362.15363535

1.86191.86193939

1.76241.76243838

2.07882.07883737

1.8391.8394343

1.80351.80354242

2.09922.09924040

1.84651.84653434

11

Distortion

15

Performance Evaluation

Performance evaluation with 12 MFCC as function of number of filter-banks with 12 MFCC. The codebook size is 8.

MFCC vs LPC

� The test uses 12 MFCC with 29 filters, and 12 LPCC's using 12th

order LP-analysis. The test run varies the size of the codebook (i.e.

the number of codewords assigned to each speaker).

� Both MFCC and LPCC perform perfect identification using 16 and 4

code words for each speaker, respectively.

16

Performance for varying codebook sizes

12 MFCCs 12 LPCCs

Performance of Coefficients to

noise� The noise clearly influences on the performance of the system,

making the classification almost useless at high noise levels.

� LPCCs are most resistant to low noise levels, while the MFCCshave a little better performance at larger noise levels, using acodebook size of 16.

� Increasing the code book size from 8 to 16 shows a definite improvement, especially at the higher noise levels.

17

Performance of coefficients to noise. The tests use (a) 12

LPCCs and (b) 12 MFCCs and a codebook size of 8.

LPCC MFCC

Performance of coefficients to noise. The tests use (a) 12

LPCCs and (b) 12 MFCCs and a codebook size of 16.

LPCC MFCC

18

Distortion measure for test speech sample as function of

codebook size.

the difference in distortion measure between the correct speaker and the runner up is

almost the same when varying the codebook size.

MFCC LPC


Conclusion

19


Future Work


Questions

&

Answers

Documents

Text-Independent Speaker Recognitionsuraj.lums.edu.pk/~sigma/old/public_html/project... · Linear Predictive CepstralCoefficients (LPCC) derived by Atal(1974), MFCC (Davis and Mermelstein,