Upload
jodie-farmer
View
216
Download
1
Tags:
Embed Size (px)
Citation preview
Center for Speech and Language Technologies, Tsinghua University
A Time-Varying Voiceprint Database
Workshop onPhonetics and Speech Technology
19 Nov 2010, Tata Institute of Fundamental Research, Mumbai, India
Thomas Fang Zheng, and Linlin Wang
2
Outline
Necessity of database creation
Criteria of database design
Observations and preliminary experiments
Summary
3
Outline
Necessity of database creation
Criteria of database design
Observations and preliminary experiments
Summary
Challenges in Application of Voiceprint Recognition
−NAB (National Australia Bank);−In 2009, applied VPR to telephone banking in collaboration with
Telstra and VeCommerce (Nuance’s Verifier engine).
−ABN Amro (Algemene Bank Nederland N.V.);
−In 2006, used VoiceVault for verification in collaboration with
VoiceVault Co.
−VPR in combination with ASR of registered secret questions.
- China Construction Bank, in 2010, telephone banking- First application in China
Slide 4
Challenges in Application of Voiceprint Recognition (cont’d)
ASR/VPR engines to support Call Center Integration (CCI), artificial operators and background management systems.
… …
语音识别引擎API 接口
语音识别引擎内核
声纹识别引擎API 接口
声纹识别引擎内核
语音识别引擎API 接口
语音识别引擎内核
声纹识别引擎API 接口
声纹识别引擎内核
CCI ( I VR系统)
声纹识别引擎API 接口
CCI ( I VR系统)
声纹识别引擎API 接口
CCI ( I VR系统)
语音识别引擎API 接口
CCI (I VR系统)
语音识别引擎API 接口… …
CCI ( I VR系统)
声纹识别引擎API 接口
数据库读写
(声纹)
数据库读写
(声纹)
数据库读写
(声纹)
… …
人工座席
语音/声纹识别引擎API /工具
… …
人工座席
语音/声纹识别引擎API /工具
Slide 5
Challenges in Application of Voiceprint Recognition (cont’d)
Voiceprint Recognition
Noise
Short-utterance
Cross-channel
Emotion Time-varying
Multi-speaker
7
About voiceprint
Pioneer researchers created the word Voiceprint to stand for the identifiable uniqueness in each voice [Kersta 1962].
At the same time, they raised several questions such asDoes the voice of an adult change over time? If so,
how? [Kersta 1962]
8
The Time-Varying Challenge
Open Questions [Furui 1997]
How do we deal with long-term variability in people’s voice?
How should we update the speaker models to cope with the gradual changes in people’s voices?
Is there any systematic long-term variation?
How do we adequately retrain the models?
...
9
The Time-Varying Challenge (cont’d)
A need for caution [Bonastre et al. 2003]
…at the present time, there is no specific process that enables one to uniquely characterize a person’s voice or to identify with absolute certainty an individual from his or her voice…
Voice changes over time, either in the short-term (at different times of the day), the medium term (times of the year), or in the long-term (with age).
Long-term non-contemporary samples represent a challenge.
10
The Time-Varying Challenge (cont’d)
Performance degradation observed by Soong, et al. Database: period of 2 months, 100-talker (50M+ 50F),
digit vocabulary, 5 different recording sessions.
This result clearly indicates the need to update the VQ codebook from time to time. [Soong et al. 1985]
Identification Error (%)
1st and 2nd session for VQ codebook training, while the other three sessions for testing
3rd session 4th session 5th session
1 digit 17 25 282 digits 5 10 144 digits 2 4 810 digits 0 1 4
11
The Time-Varying Challenge (cont’d)
Performance degradation observed by Kato and ShimizuA significant loss in accuracy (4~5% in EER) between
two sessions separated by 2 months. [Kato & Shimizu 2003]
Aging was considered to be the cause [Herbert 2008].
12
The Time-Varying Challenge (cont’d)
Performance degradation observed at CSLT, THU2 recording sessions with an interval of 10 months:
July 2008 (Dataset A) and May 2009 (Dataset B)
13 speakers and 2 speaking styles (reading and spontaneous)
EER (%)Dataset A for Model Training Dataset B for Model Training
Dataset A for Verification
Dataset B for Verification
Dataset A for Verification
Dataset B for Verification
Reading 2.47 9.32 9.43 2.08
Spontaneous 9.39 14.73 14.53 6.13
13
The Time-Varying Issue in Voiceprint Recognition
A generally acknowledged phenomenon.It is widely assumed that speaker models should be
updated from time to time to maintain representativeness;
Effective but not practical.
Few researchers have figured out reasons behind this phenomenon exactly. No doubt a proper longitudinal voiceprint database is
essential for this study;The MARP corpus has been the only one published so
far. [Lawson et al. 2009]
14
The MARP Corpus
DescriptionMulti-session Audio Research project by University of
Texas at Dallas
This corpus includes 21 sessions of free-flowing conversations over a 3-year period of time for each speaker.
Each conversation was approximately 10 minutes in length with an isolated partner who remained constant throughout the 3 years, and speakers were given suggested conversation topics.
15
The MARP Corpus (cont’d)
Initial findingsThe following study of aging effect largely focused on
32 speakers and 672 sessions from June 2005 to March 2008. [Lawson et al. 2009]
While the impact on speaker recognition accuracy between any two sessions was considerable, the long-term trend was statistically quite small. Reasonable or not? Which causes this more: context or time?
The detrimental impact was clearly not a function of aging or of the voice changing within this timeframe.
16
The MARP Corpus (cont’d)
Is it more suitable for research?In free-flowing conversations, speech contents are not
fixed and a speaker’s emotion, speaking style, or engagement can be easily influenced by his/her partner.
Perhaps the aging effect is somewhat weaker than those evident ones and thus covered underneath.
17
Other Relevant Databases
NTT-VR speaker recognition data corpus
35 speakers, with 22 males and 13 females
Recorded in 5 sessions with a span of 10 months (1990-1991)
Normal, slow, and fast speaking speed
Limits: only 5 sessions
18
Demand for a new database
Creation of a longitudinal voiceprint database which specially focuses on the time-varying effect in speaker recognition is imperative for both research and practical applications.
Other factors other than time should be kept as constant as possible. Prompt texts, recording equipments, software, conditions,
environment, and so on
Different lengths of time intervals can be adopted to analyze gradual impact.
19
Outline
Necessity of database creation
Criteria of database design
Observations and preliminary experiments
Summary
20
Database Design
A general design principleThe time-varying effect is the only focus, and other
factors should be kept as constant as possible throughout all recording sessions.
Two concrete design principlesFixed prompt texts to make sure content variation will
not be a disturbance of researchGradient time intervals to provide a flexible research
for different lengths of time intervals
Speakers recruited
21
Database Design (cont’d)
Fixed prompt textsSpeakers were requested to utter in a reading way
with fixed prompt texts instead of free-style conversations. Free-style is closer to actual situation, yet can be studied
later
Prompt texts were designed to remain unchanged throughout all recording sessions. To avoid or at least reduce the impact of speech contents on
speaker recognition accuracy. In form of sentences and isolated words.
22
Database Design (cont’d)
Gradient time intervalsInitial sessions can be of shorter time intervals, while
following sessions of longer and longer time intervals. Five different time intervals are used: one week, one month,
two months, four months and half a year, as illustrated in the figure below (16 sessions).
time
sessions
2010 2011 2012
23
Database Design (cont’d)
Speakers60 fresh students in the beginning, 30M + 30F.
The design of time intervals exactly does not make summer and winter vacations the recording days.
Speakers born between 1989 and 1993, with a majority in year 1990.
Speakers from various departments, including: Computer science, biology, English, humanities, and
journalism
All of them speaking standard Chinese well.
24
Outline
Necessity of database creation
Criteria of database design
Observations and preliminary experiments
Summary
25
Current Database
Recording progress9 recording sessions have been completed so far.
Database evaluationbased on speech from the first 7 recording sessions
26
Objective
To characterize a speaker using speaker-specific features that remain largely unchanged over time
Then where to start?Acoustics: frequency, spectrum, formant, pitch, …Statistics: mean, variance, ......
A straight-forward method
Model update method 35 speakers, recorded March 2004 to May 2005
[Shan et al. 2005]
BaselineUsing model
update method
Accuracy 69.02% 74.19%
28
Prof. Dang’s work [Dang 2010]
Observations about frequenciesBased on NTT-VR corpusSpeaker relevancy measurement using Fisher’s F-
Ratio [Wolf 1971]
Speaker relevant frequencies are almost invariant for the 5 speech sessions
29
Prof. Dang’s work (cont’d)
Algorithm designEnhancing the information around speaker relevant
frequency regions To design non-uniform frequency warping algorithm which
emphasize the speaker relevant frequency regions, instead of the traditional Mel frequency warped processing method
30
Prof. Dang’s work (cont’d)
Results
MFCC: Mel frequency warpingUFCC: Uniform frequency processingNUFCC: Proposed non-uniform frequency processing
31
Are there any other useful Features?
Does pitch or formant have anything to do with time?“Pengyou” (friend in Chinese) from 3 different
sessions with an interval of one week (drawn by Praat).
There is no evident difference seen from the figure. However, from the statistical aspect, there may be some valuable information concerning the mean, variance, or higher-order parameters.
Pitch – redFormant - blue
32
Outline
Necessity of database creation
Criteria of database design
Observations and preliminary experiments
Summary
33
Summary
A longitudinal and specialized time-varying voiceprint database is the basis of researches on the widely-acknowledged time-varying issue.
Finding stable speaker-specific acoustic features is the core of all speaker recognition researches, and time-varying research is of no exception.
34
References J. Bonastre, F. Bimbot, L. Boe, et al., “Person Authentication by Voice: A Need for
Caution”, Proc. of Eurospeech 2003, pp. 33-36, Geneva, 2003. J. Dang, “Extraction and Application of Speaker’s Individuals”, A lecture of 2010
workshop at Tsinghua University, Beijing, 2010. S. Furui, “Recent Advances in Speaker Recognition”, Pattern Recognition Letters, Vol.
18, Iss. 9, pp. 859-872, September 1997. M. Hebert, “Text-Dependent Speaker Recognition”, Springer Handbook of Speech
Processing, Springer-Verlag: Berlin, 2008. T. Kato, and T. Shimizu, “Improved Speaker Verification over the Cellular Phone
Network Using Phoneme-Balanced and Digit-Sequence Preserving Connected Digit Patterns”, Proc. of ICASSP 2003, Hong Kong, 2003.
L.G. Kersta, “Voiceprint Recognition”, Nature, No. 4861, pp. 1253-1257, December 1962.
A. D. Lawson, A. R. Stauffer, E. J. Cupples, et al., “The Multi-Session Audio Research Project (MARP) Corpus: Goals, Design and Initial Findings”, Proc. of Interspeech 2009, pp. 1811-1814, Brighton, 2009.
A. D. Lawson, A. R. Stauffer, E. J. Cupples, et al., “Long Term Examination of Intra-Session and Inter-Session Speaker Variability”, Proc. of Interspeech 2009, pp. 2899-2902, Brighton, 2009.
Z. Shan, Y. Yang, and C. Wu, “A Voiceprint Access Control System”, Proc. of NCMMSC 2005, Beijing, 2005 (in Chinese ).
F. Soong, A. E. Rosenberg, L. R. Rabiner, et al., “A Vector Quantization Approach to Speaker Recognition”, Proc. of ICASSP 1985, Vol.10, pp. 387-390, Florida, 1985.
J. J. Wolf, “Efficient Acoustic Parameters for Speaker Recognition”, JASA, Vol. 51, No. 6, pp. 2044-2055, 1971
35
Thanks!