Upload
alberta-peters
View
218
Download
0
Tags:
Embed Size (px)
Citation preview
Accent Modeling
An Overview
02/09/07 iCONS Group Presentation 2
Prologue
Our Initial Effort Enhancement of speaker recognition through
score level fusion of Arithmetic Harmonic Sphericity (AHS) and Hidden Markov Model (HMM) techniques
performance improvements of 22% and 6% true acceptance rate (at 5% false acceptance rate) on YOHO and USF multi-modal biometric datasets, respectively.
02/09/07 iCONS Group Presentation 3
Prologue…contd
Enhanced Recognition at various FARs (YOHO)
0102030405060708090
3 5False Acceptance Rate (%)
True
Acc
epta
nce R
ate (
%)))
))))
AHS
HMM
HF
Enhanced Recognition at Various FARs (USF data)
0
10
20
30
40
50
60
70
80
3 5False Acceptance Rate (%)
True
Acc
epta
nce R
ate (
%) (
%))
AHS
HMMHF
02/09/07 iCONS Group Presentation 4
Prologue – what next
Further improvement of recognition rate through speaker accent
Speaker accent will play a critical role in the evaluation of biometric systems, since users will be international in nature.
Incorporating accent model in the speaker recognition/verification system will be a key component that our study will focus on.
02/09/07 iCONS Group Presentation 5
Accent
What is accent The cumulative auditory effect of those features of
pronunciation which identify where a person is from regionally and socially.
Difference between accent and dialect Accent is the negative (or rather colorful) influence of
the first language (L1) of a speaker to a second language, while Dialects of a given language are differences in speaking style of that language (which all belong to L1) because of geographical and ethnic differences.
02/09/07 iCONS Group Presentation 6
Accent
Factors affecting the level of accent Age at which speaker learns the second
language. Nationality of speaker’s language instructor. Grammatical and phonological differences
between the primary and secondary languages.
Amount of interaction the speaker has with native language speakers.
02/09/07 iCONS Group Presentation 7
Applications of Accent Modeling
Accent knowledge can be used for selection of alternative pronunciations or provide information for biasing a language model for speech recognition.
Accent can be useful in profiling speakers for call routing in a call centre.
Document retrieval systems.Speaker recognition systems.
02/09/07 iCONS Group Presentation 8
Examples of Accent
- Native American English - Indian - Chinese - British - Japanese - Russian - Arabic - Greek
02/09/07 iCONS Group Presentation 9
World’s Major Languages
02/09/07 iCONS Group Presentation 10
Accent Classification System
Speech Data
(Training)
Extract Accent Features
Reference Accent Model 1
Speech Data
(Testing)
Extract Accent Features
Classification
Speech Data
(Training)
Extract Accent Features
Reference Accent Model N
Score
02/09/07 iCONS Group Presentation 11
Accent– Research Work M. V. Chan, et.al., "Classification of speech accents with neural
networks," IEEE World Congress on Computational Intelligence, vol.7, pp.4483-4486, 27 Jun-2 Jul 1994.
L. M. Arslan, “Foreign Accent Classification in American English,” Ph. D. Dissertation, Duke University, 1996.
C. Teixeira, I. Trancoso, and A. Serralheiro, “Accent identification,” In Proc. International Conference on Spoken Language Processing, vol.3, pp.1784-1787, 1996.
P. Fung and W.K. Liu, "Fast Accent Identification and Accented Speech Recognition," in Proc. ICASSP'99, vol.1, pp. 221-224, 1999.
T. Chen, et.al., "Automatic accent identification using Gaussian mixture models," ASRU '01, pp. 343- 346, 9-13 Dec. 2001.
P. Angkititrakul, J.H.L. Hansen, "Stochastic Trajectory Model Analysis for Accent Classification”, Inter. Conf. on Spoken Language Processing, vol. 1, pp. 493-496, Sept. 2002.
X. Lin, S. Simske, "Phoneme-less hierarchical accent classification," Signals, Systems and Computers, vol.2, pp. 1801-1804, 7-10 Nov. 2004.
02/09/07 iCONS Group Presentation 12
Research Work … Contd F. Farahani, et.al., "Speaker identification using supra-segmental pitch
pattern dynamics," in Proc. ICASSP‘04, vol.1, pp. I-89-92, 17-21 May 2004. M. M. Tanabian, et.al., "Automatic speaker recognition with formant
trajectory tracking using CART and neural networks," Canadian Conference on Electrical and Computer Engineering, pp. 1225- 1228, 1-4 May 2005.
S. Gray, J. H. L. Hansen, "An integrated approach to the detection and classification of accents/dialects for a spoken document retrieval system," ASRU '05, pp. 35- 40, 27 Nov-1 Dec. 2005.
P. Angkititrakul, J. H. L. Hansen, "Advances in Phone-based Modeling For Automatic Accent Classification," IEEE Transactions on Audio, Speech, and Language Processing, vol.14, pp. 634- 646, March 2006 .
K. Bartkova, D. Jouvet, "Using Multilingual Units for Improved Modeling of Pronunciation Variants," in Proc. ICASSP‘06, vol.5, pp. V-1037- V-1040, 14-19 May 2006.
A. Ikeno, J. H. L. Hansen, "Perceptual Recognition Cues in Native English Accent Variation: "Listener Accent, Perceived Accent, and Comprehension,” in Proc. ICASSP‘04, vol.1, pp. I-401- I-404, 14-19 May 2006.
02/09/07 iCONS Group Presentation 13
Accent Classification Tree
Speech Dataset
Accent Features:
Modeling:
Classification/Decision
Pitch
Stochastic Trajectory ModelsArtificial Neural Networks
Gaussian Mixture Models Hidden Markov Models
Formant Trajectories
Energy Delta MFCCs
MFCCs Formants
02/09/07 iCONS Group Presentation 14
Foreign Accent Classification in American English - Dataset
Dataset consists of neutral American English, German, Spanish, Chinese, Turkish, French, Italian, Hindi, Rumanian, Japanese, Persian and greek accents.
All speech was sampled at 8000 Hz
Totally, 43 speakers used microphone input and 68 speakers used telephone input, in a quiet office environment.
02/09/07 iCONS Group Presentation 15
Formant Frequency Analysis
Formants represent those frequencies which encompass the majority of the acoustic energy from source to output with an acoustic tube model as the system.
Second and Third formants are particularly favorable for accent classification
02/09/07 iCONS Group Presentation 16
Mel Scale Vs Accent Scale
02/09/07 iCONS Group Presentation 17
Accent Classifier
The features consisted of 8 dimensional ASCCs, energy along with their delta features.
The IW-FS, CS-FS, and CS-PS classified with 74.5%, 61.3%, and 68.3% respectively.
Using a test word count of 7-8 words, accent classification accuracy among 4 accents is 93%.
02/09/07 iCONS Group Presentation 18
Computer Vs Humans
02/09/07 iCONS Group Presentation 19
Conclusions about specific features
Word-final stop release time is longer among foreign accents
Slope of intonation contour for isolated words is more negative for Chinese speakers, and more positive for German speakers than native speakers
Voice onset time for unvoiced stops is not a significant contributor for accents considered in this study.
Second and third formant positions are different for native and non native speakers.
02/09/07 iCONS Group Presentation 20
Accent Classification/Detection using ANN
Demographic data including speaker’s age, percentage of time in a day when English used as communication and the number of years English was spoken were used as features, along with speech features: average pitch frequency and averaged first three formant frequencies were given as inputs to the neural network.
A dataset of 10 native and 12 non-native speakers were used. F2 and F3 distributions of native and non-native groups show
high dissimilarity. Three neural network classification techniques namely
competitive learning, counter propagation and back propagation were compared.
Back propagation gave a detection rate of 100% for training data and 90.9% for testing data.
02/09/07 iCONS Group Presentation 21
Phoneme less Hierarchical Accent Classification
WSJCAM0 & TIDIGITS were used to train British and American accents respectively.
IViE & Voicemail were used to test British and American accents respectively.
13 dimensional MFCCs were used as features and 64-component Gaussian Mixture Model was used for modeling.
02/09/07 iCONS Group Presentation 22
Results show an average 7.1% error rate reduction relatively when compared to direct accent classification.
02/09/07 iCONS Group Presentation 23
Accent Classification Application
02/09/07 iCONS Group Presentation 24
Advances in Phone Based Modeling
Conventional HMMs assumes that the sequence of features are produced by a piecewise stationary process.
Hidden Markov Modeling assumes that adjacent frames are acoustically uncorrelated.
Also that the state dependant duration distributions are exponentially decreasing.
02/09/07 iCONS Group Presentation 25
Why Phone Based Modeling?
Capturing the temporal variation of acoustic signal is an important aspect of speech recognition.
A better framework for modeling the evolution of the spectral dynamics of speech
Flexibility and power due to whole segment classification, in contrast to frame by frame classification
02/09/07 iCONS Group Presentation 26
Trajectories of the phoneme sequence /aa/ - /r/ from the word
‘Target’
02/09/07 iCONS Group Presentation 27
Stochastic Trajectory Model An STM represents the acoustic observations of a
phoneme as clusters of trajectories in a parametric space.
If X is a sequence of N points :
Where each point is a D-dimensional vector,X is obtained by resampling a sequence of d frames along
the linear time scale.
0 1 1( , ,..., )NX x x x
02/09/07 iCONS Group Presentation 28
Stochastic Trajectory Model The resampled N-Frame vector vector X is considered to be
underlying trajectory of the original X with d frames. The pdf of a segment X given a duration d and the segment symbol s is:
Where is the set of all trajectory components associated with
, is the probability of observing trajectory , given that the segment is , with the constraint that
is the pdf of the vector sequence X, given component trajectory , duration , symbol .
( | ) 1,kk Ts
pr t s s
( | , ) ( | , , ) ( | )k kt Tk s
p X d s p X t d s pr t s
sT
s( | )kpr t s kt
s
( | , , )kp X t d s
kt d s
02/09/07 iCONS Group Presentation 29
Stochastic Trajectory Model
The distribution assigned to each of the samples points on a trajectory is characterized by a multivariate Gaussian distribution with a mean vector , and covariance matrix . With the assumption of frame independent trajectories, the pdf is modeled as,
The training algorithm performs maximum likelihood estimation of the parameters of the gaussian distribution.
,sk im ,
sk i
1
, ,0
( | , , ) ( ; , )N s s
k k i k ii
p X t d s Gaussian X m
02/09/07 iCONS Group Presentation 30
Accent Classification System
02/09/07 iCONS Group Presentation 31
Performance – Male and Female Chinese vs American-English
02/09/07 iCONS Group Presentation 32
Further Investigation
Further study of accent classification and detection.
Study of accent in a linguistic point of view.
Experimentation and formulation of accent modeling and classification.
Combination of Accent information with my previous work to achieve speaker recognition enhancement.
Questions
Thank You