Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
1
Anoosha Habib, MSc Project, LUMS.
Text-Independent
Speaker Recognition
Project Presentation
Anoosha Habib
M.Sc. Project
Computer Engineering,
Lahore University of Management Sciences
Sequence of Presentation
� Introduction
� Speaker Identification System
� Speech Parameterization Methods
� Speaker Classification Methods
� Results and Performance Evaluation
� Future Work
2
Anoosha Habib, MSc Project, LUMS.
Introduction
Introduction
� Speech is always regarded as the most powerful form because of its rich
dimensions character such as speech text (words), the gender, attitude,
emotion, health situation and identity of a speaker
� There are three recognition systems:
� Speech Recognition Systems,
� Language Recognition Systems and
� Speaker Recognition Systems Speech
Recognit ion
Language
Recognit ion
Speaker
Recognit ion
Words
Language Name
Speaker Name
“I am fine”
“English”
“Ali”
Speech Signal
3
Speech Processing TaxonomyRe c o g n i t i o n
Sp e e c h
Re c o g n i t i o n
Sp e a k e r
R e c o g n i t i o nL a n g u a g e
Re c o g n i t i o n
Sp e a k e r
I d e n t i f i c a t i o n
Sp e a k e r
V e r i f i c a t i o n
Text-dependent
Closed-set
Text-independent
Closed-set
Text-dependent
Closed-set
Text-independent
Open-set
Classification of Speaker
Identification System� Speaker Identification is the process of finding the identity of an
unknown speaker by his/her voice with the voice of registered speakers in the database
� Speaker Recognition is the process of recognizing a person on the basis of individual information included in speech signal.
� Text-dependent SIS
� speakers are only allowed to say some specific sentences or words, which are known to the system.
� Text-independent SIS
� The system can process freely spoken speech, which is either user selected phrase or conversational speech.
� Compared with text-dependent SRS, text-independent SRS are more flexible, but more complicated
4
Outline of SIS
� Speaker identification (SI) is the process of finding the identity of an unknown speaker by comparing his/her voice with voices of registered speakers in the database. In speaker identification, the unknown voice is assumed to be from predefined set of known speakers and it’s a one-to-many comparison
� During its operation, a speaker identification system receives two inputs:
� an identity claim (PIN, user’s name, keyword, etc.) made by the speaking person
� a certain amount of speech, representing his/her or someone else’s voice.
Phases of SIS
� 1st phase of SIS is
Enrollment Sessions also
known as Training Phase.
� During the training phase,
the SIS generates a
speaker model which is
based on the speaker’s
characteristics.
Front End Processing
FeatureFeatureFeatureFeatureVectorsVectorsVectorsVectors
Speaker Speaker Speaker Speaker DatabaseDatabaseDatabaseDatabase
Speaker Modeling
SpeakerSpeakerSpeakerSpeakerModelsModelsModelsModels
Speaker 1
Speaker 3
Speaker 2
5
Phases of SIS
� The testing phase of the system involves making a
claim on the identity of an unknown speaker by using
both the trained models and the characteristics of the
given speech.
Front End Processing
Voice from
Unknown speaker
Fearture
Vectors
Speaker Database
Speaker Model 1
Speaker Model 2
Speaker Model M
Maximum Score
Dicision
Speaker ID
Components of Speaker
Identification System
� There are three main components of SI System:
� Front-end Processing
� Speaker Modeling
� Pattern Matching
6
Front-End Processing
� Front-end Processing generally consists of three sub-processes
�Preprocessing� Removal of Noise / Silence from Speech
� Frame Blocking
� Windowing
�Feature Extraction
�Channel Compensation
Preprocessing
� Removal of noise/silence from speech
7
Frame Blocking � The speech signal is a slowly timed varying signal (it is called quasi-
stationary) that is when the signal is examined over a short period of time (5-100msec), the signal is fairly stationary.
� Speech signals are often analyzed in short time segments referred to as short-time spectral analysis typically 20-30 msec frames that overlap each other with 30-50%. This is done in order not to lose any information due to the windowing.
� Duration of each frame is 23 mS for sampling frequencies 11025 Hz and 22050 Hz, and a new frame contain the last 11.5 mS of the previous frame’s data. But for the sampling frequency 8000 Hz, duration of each frame is 16 mS and a new frame contain the last 8 mS of the previous frame’s data
Windowing
� After the signal has been framed, each frame is multiplied with a
window function w(n) with length N, where N is the length of the
frame. Typically the Hamming window is used. The windowing is
done to avoid problems due to truncation of the signal.
8
Anoosha Habib, MSc Project, LUMS.
Speech Parameterization
Methods
9
Speech Parameterization Methods� Speech can be parameterized in two broad categories. These are
� Spectrum-based speech features
� Phonetic and Prosodic features
� Spectrum-based speech features: following speech features have dominated the speech and speaker recognition areas: � Real Cepstral Coefficients (RCC) introduced by Oppenheim (1969),
� Linear Prediction Coefficients (LPC) proposed by Atal and Hanauer (1971),
� Linear Predictive Cepstral Coefficients (LPCC) derived by Atal (1974),
� MFCC (Davis and Mermelstein, 1980).
� Perceptual Linear Prediction (PLP) coefficients (Hermansky, 1990),
� Adaptive Component Weighting (ACW)
� Cepstral coefficients (Assaleh and Mammone, (1994a, 1994b)),
� Wavelet-based features
� Prosodic speech features� provide useful information about the speaking style of a person. The fundamental
frequency and frame energy are well known prosodic speech features that are widely-used in speaker recognition applications. The prosodic features are susceptible to mimic attempts.
Feature Extraction
ContinousContinousContinousContinous
SpeechSpeechSpeechSpeech
FrameFrameFrameFrame
SpectrumSpectrumSpectrumSpectrum
Mel Mel Mel Mel
Weighted Weighted Weighted Weighted SpectrumSpectrumSpectrumSpectrum
Mel Cepstral Mel Cepstral Mel Cepstral Mel Cepstral CoefficientsCoefficientsCoefficientsCoefficients
Frame
Blocking Window ing
FFT
Mel Filte r
Bank
Log
Compression
DCT
Preemphasis
Filter
Continous
Speech
Frame
Blocking
Filtered
Windowing
Windowed Frames
AutocorelationLinear
Prediction
Linear Prediction Coefficients
Speech
FramesMel-Frequency
Cepstral
Coefficients
Linear
Prediction
Coefficients
10
Anoosha Habib, MSc Project, LUMS.
Pattern Matching
and Classification
Pattern Matching and Classification
� The classifiers used for speaker identification can be grouped into two major types: � Template-based and Stochastic model based classifiers.
� Template based classifiers are considered to be the simplest classifiers. � The most common template-based classifiers are based on Dynamic
Time Warping (useful for text-dependent speaker recognition) and Vector Quantization (useful for text-independent speaker recognition)
� Stochastic models provide more flexibility and better results. � The stochastic model based classifiers use the Gaussian Mixture Model
(useful for text-independent speaker recognition), the Hidden Markov model (useful for text-dependent speaker recognition), and Neural Networks to model a speaker's acoustic space.
11
Speaker Modeling
� Vector Quantization� It is not possible to use all the feature vectors of a given speaker occurring in the training data to form the speaker's model. Because there are too many feature vectors for each speaker.
� A method of reducing/compressing the number of training vectors is required.
� VQ codebook consisting of a small number of highly representative vectors that efficiently represent the speaker-specific characteristics can be used.
VQ Codebook Generation by LBG
algorithm
12
LBG Algorithm
Speaker Matching
� During Training Phase, one codebook with N code vectors is computed for each of the M speakers.
� During the Testing Phase, the feature vectors representing the test utterance are encoded in terms of their nearest code vectors from the code book of each of the M speakers by measuring the distortion.
� Once these distortions are computed, the speaker identification system classifies the test utterance to a speaker whose VQ codebook results in the least distortion.
� Distortion measure is the least Euclidian Distance that is each feature vector in the sequence X is compared with all the codebooks, and the codebook with the minimized average distance is chosen to be the best.
13
Weighting Method
� Weighting method takes the correlation between the known
speakers in the database into account. The idea is that larger
weights should be assigned to vectors that has higher discriminating
power.
� If vectors from more codebooks are very close in feature space it is
not so obvious which one of the vectors that a given unknown vector
belongs to. On the other hand if a vector is far from the other vectors
of the other codebooks, then it is more clear which codebook the
given unknown vector belongs to.
Anoosha Habib, MSc Project, LUMS.
Results and Performance
Evaluation
14
Distortion� Distortion of Speaker 1 with other Speakers computed by MFCC based
classifier
2.2462.24699
2.09622.096233
1.92221.922222
2.37742.377466
2.09782.097855
2.21612.216144
1.95431.95431111
1.86371.86371010
2.30842.308488
2.19232.192377
1.54861.548611
11
1.74451.74452121
2.08572.08571414
2.02892.02891313
2.04922.04921717
1.96511.96511616
1.95011.95011515
1.98431.98432222
1.96811.96812020
2.04312.04311919
2.10892.10891818
2.01712.01711212
11
2.18442.18443232
1.90141.90142525
1.84881.84882424
1.84731.84732828
1.78791.78792727
2.06152.06152626
1.79741.79743333
2.14712.14713131
2.04062.04063030
1.95381.95382929
1.9561.9562323
11
1.78271.78274141
2.24932.24933636
2.15362.15363535
1.86191.86193939
1.76241.76243838
2.07882.07883737
1.8391.8394343
1.80351.80354242
2.09922.09924040
1.84651.84653434
11
Distortion
15
Performance Evaluation
Performance evaluation with 12 MFCC as function of number of filter-banks with 12 MFCC. The codebook size is 8.
MFCC vs LPC
� The test uses 12 MFCC with 29 filters, and 12 LPCC's using 12th
order LP-analysis. The test run varies the size of the codebook (i.e.
the number of codewords assigned to each speaker).
� Both MFCC and LPCC perform perfect identification using 16 and 4
code words for each speaker, respectively.
16
Performance for varying codebook sizes
12 MFCCs 12 LPCCs
Performance of Coefficients to
noise� The noise clearly influences on the performance of the system,
making the classification almost useless at high noise levels.
� LPCCs are most resistant to low noise levels, while the MFCCshave a little better performance at larger noise levels, using acodebook size of 16.
� Increasing the code book size from 8 to 16 shows a definite improvement, especially at the higher noise levels.
17
Performance of coefficients to noise. The tests use (a) 12
LPCCs and (b) 12 MFCCs and a codebook size of 8.
LPCC MFCC
Performance of coefficients to noise. The tests use (a) 12
LPCCs and (b) 12 MFCCs and a codebook size of 16.
LPCC MFCC
18
Distortion measure for test speech sample as function of
codebook size.
the difference in distortion measure between the correct speaker and the runner up is
almost the same when varying the codebook size.
MFCC LPC
Anoosha Habib, MSc Project, LUMS.
Conclusion
19
Anoosha Habib, MSc Project, LUMS.
Future Work
Anoosha Habib, MSc Project, LUMS.
Questions
&
Answers