Center for Speech and Language Technologies, Tsinghua University A Time-Varying Voiceprint Database Workshop on Phonetics and Speech Technology 19 Nov

Center for Speech and Language Technologies, Tsinghua University

A Time-Varying Voiceprint Database

Workshop onPhonetics and Speech Technology

19 Nov 2010, Tata Institute of Fundamental Research, Mumbai, India

Thomas Fang Zheng, and Linlin Wang

2

Outline

Necessity of database creation

Criteria of database design

Observations and preliminary experiments

Summary

3

Outline




Summary

Challenges in Application of Voiceprint Recognition

−NAB (National Australia Bank);−In 2009, applied VPR to telephone banking in collaboration with

Telstra and VeCommerce (Nuance’s Verifier engine).

−ABN Amro (Algemene Bank Nederland N.V.);

−In 2006, used VoiceVault for verification in collaboration with

VoiceVault Co.

−VPR in combination with ASR of registered secret questions.

- China Construction Bank, in 2010, telephone banking- First application in China

Slide 4

Challenges in Application of Voiceprint Recognition (cont’d)

ASR/VPR engines to support Call Center Integration (CCI), artificial operators and background management systems.

… …

语音识别引擎API 接口

语音识别引擎内核

声纹识别引擎API 接口

声纹识别引擎内核


语音识别引擎内核


声纹识别引擎内核

CCI ( I VR系统)


CCI ( I VR系统)


CCI ( I VR系统)


CCI (I VR系统)

语音识别引擎API 接口… …

CCI ( I VR系统)


数据库读写

(声纹)

数据库读写

(声纹)

数据库读写

(声纹)

… …

人工座席

语音/声纹识别引擎API /工具

… …

人工座席

语音/声纹识别引擎API /工具

Slide 5

Challenges in Application of Voiceprint Recognition (cont’d)

Voiceprint Recognition

Noise

Short-utterance

Cross-channel

Emotion Time-varying

Multi-speaker

7

About voiceprint

Pioneer researchers created the word Voiceprint to stand for the identifiable uniqueness in each voice [Kersta 1962].

At the same time, they raised several questions such asDoes the voice of an adult change over time? If so,

how? [Kersta 1962]

8

The Time-Varying Challenge

Open Questions [Furui 1997]

How do we deal with long-term variability in people’s voice?

How should we update the speaker models to cope with the gradual changes in people’s voices?

Is there any systematic long-term variation?

How do we adequately retrain the models?

...

9

The Time-Varying Challenge (cont’d)

A need for caution [Bonastre et al. 2003]

…at the present time, there is no specific process that enables one to uniquely characterize a person’s voice or to identify with absolute certainty an individual from his or her voice…

Voice changes over time, either in the short-term (at different times of the day), the medium term (times of the year), or in the long-term (with age).

Long-term non-contemporary samples represent a challenge.

10


Performance degradation observed by Soong, et al. Database: period of 2 months, 100-talker (50M+ 50F),

digit vocabulary, 5 different recording sessions.

This result clearly indicates the need to update the VQ codebook from time to time. [Soong et al. 1985]

Identification Error (%)

1st and 2nd session for VQ codebook training, while the other three sessions for testing

3rd session 4th session 5th session

1 digit 17 25 282 digits 5 10 144 digits 2 4 810 digits 0 1 4

11


Performance degradation observed by Kato and ShimizuA significant loss in accuracy (4~5% in EER) between

two sessions separated by 2 months. [Kato & Shimizu 2003]

Aging was considered to be the cause [Herbert 2008].

12


Performance degradation observed at CSLT, THU2 recording sessions with an interval of 10 months:

July 2008 (Dataset A) and May 2009 (Dataset B)

13 speakers and 2 speaking styles (reading and spontaneous)

EER (%)Dataset A for Model Training Dataset B for Model Training

Dataset A for Verification

Dataset B for Verification

Dataset A for Verification

Dataset B for Verification

Reading 2.47 9.32 9.43 2.08

Spontaneous 9.39 14.73 14.53 6.13

13

The Time-Varying Issue in Voiceprint Recognition

A generally acknowledged phenomenon.It is widely assumed that speaker models should be

updated from time to time to maintain representativeness;

Effective but not practical.

Few researchers have figured out reasons behind this phenomenon exactly. No doubt a proper longitudinal voiceprint database is

essential for this study;The MARP corpus has been the only one published so

far. [Lawson et al. 2009]

14

The MARP Corpus

DescriptionMulti-session Audio Research project by University of

Texas at Dallas

This corpus includes 21 sessions of free-flowing conversations over a 3-year period of time for each speaker.

Each conversation was approximately 10 minutes in length with an isolated partner who remained constant throughout the 3 years, and speakers were given suggested conversation topics.

15

The MARP Corpus (cont’d)

Initial findingsThe following study of aging effect largely focused on

32 speakers and 672 sessions from June 2005 to March 2008. [Lawson et al. 2009]

While the impact on speaker recognition accuracy between any two sessions was considerable, the long-term trend was statistically quite small. Reasonable or not? Which causes this more: context or time?

The detrimental impact was clearly not a function of aging or of the voice changing within this timeframe.

16

The MARP Corpus (cont’d)

Is it more suitable for research?In free-flowing conversations, speech contents are not

fixed and a speaker’s emotion, speaking style, or engagement can be easily influenced by his/her partner.

Perhaps the aging effect is somewhat weaker than those evident ones and thus covered underneath.

17

Other Relevant Databases

NTT-VR speaker recognition data corpus

35 speakers, with 22 males and 13 females

Recorded in 5 sessions with a span of 10 months (1990-1991)

Normal, slow, and fast speaking speed

Limits: only 5 sessions

18

Demand for a new database

Creation of a longitudinal voiceprint database which specially focuses on the time-varying effect in speaker recognition is imperative for both research and practical applications.

Other factors other than time should be kept as constant as possible. Prompt texts, recording equipments, software, conditions,

environment, and so on

Different lengths of time intervals can be adopted to analyze gradual impact.

19

Outline




Summary

20

Database Design

A general design principleThe time-varying effect is the only focus, and other

factors should be kept as constant as possible throughout all recording sessions.

Two concrete design principlesFixed prompt texts to make sure content variation will

not be a disturbance of researchGradient time intervals to provide a flexible research

for different lengths of time intervals

Speakers recruited

21

Database Design (cont’d)

Fixed prompt textsSpeakers were requested to utter in a reading way

with fixed prompt texts instead of free-style conversations. Free-style is closer to actual situation, yet can be studied

later

Prompt texts were designed to remain unchanged throughout all recording sessions. To avoid or at least reduce the impact of speech contents on

speaker recognition accuracy. In form of sentences and isolated words.

22


Gradient time intervalsInitial sessions can be of shorter time intervals, while

following sessions of longer and longer time intervals. Five different time intervals are used: one week, one month,

two months, four months and half a year, as illustrated in the figure below (16 sessions).

time

sessions

2010 2011 2012

23


Speakers60 fresh students in the beginning, 30M + 30F.

The design of time intervals exactly does not make summer and winter vacations the recording days.

Speakers born between 1989 and 1993, with a majority in year 1990.

Speakers from various departments, including: Computer science, biology, English, humanities, and

journalism

All of them speaking standard Chinese well.

24

Outline




Summary

25

Current Database

Recording progress9 recording sessions have been completed so far.

Database evaluationbased on speech from the first 7 recording sessions

26

Objective

To characterize a speaker using speaker-specific features that remain largely unchanged over time

Then where to start?Acoustics: frequency, spectrum, formant, pitch, …Statistics: mean, variance, ......

A straight-forward method

Model update method 35 speakers, recorded March 2004 to May 2005

[Shan et al. 2005]

BaselineUsing model

update method

Accuracy 69.02% 74.19%

28

Prof. Dang’s work [Dang 2010]

Observations about frequenciesBased on NTT-VR corpusSpeaker relevancy measurement using Fisher’s F-

Ratio [Wolf 1971]

Speaker relevant frequencies are almost invariant for the 5 speech sessions

29

Prof. Dang’s work (cont’d)

Algorithm designEnhancing the information around speaker relevant

frequency regions To design non-uniform frequency warping algorithm which

emphasize the speaker relevant frequency regions, instead of the traditional Mel frequency warped processing method

30

Prof. Dang’s work (cont’d)

Results

MFCC: Mel frequency warpingUFCC: Uniform frequency processingNUFCC: Proposed non-uniform frequency processing

31

Are there any other useful Features?

Does pitch or formant have anything to do with time?“Pengyou” (friend in Chinese) from 3 different

sessions with an interval of one week (drawn by Praat).

There is no evident difference seen from the figure. However, from the statistical aspect, there may be some valuable information concerning the mean, variance, or higher-order parameters.

Pitch – redFormant - blue

32

Outline




Summary

33

Summary

A longitudinal and specialized time-varying voiceprint database is the basis of researches on the widely-acknowledged time-varying issue.

Finding stable speaker-specific acoustic features is the core of all speaker recognition researches, and time-varying research is of no exception.

34

References J. Bonastre, F. Bimbot, L. Boe, et al., “Person Authentication by Voice: A Need for

Caution”, Proc. of Eurospeech 2003, pp. 33-36, Geneva, 2003. J. Dang, “Extraction and Application of Speaker’s Individuals”, A lecture of 2010

workshop at Tsinghua University, Beijing, 2010. S. Furui, “Recent Advances in Speaker Recognition”, Pattern Recognition Letters, Vol.

18, Iss. 9, pp. 859-872, September 1997. M. Hebert, “Text-Dependent Speaker Recognition”, Springer Handbook of Speech

Processing, Springer-Verlag: Berlin, 2008. T. Kato, and T. Shimizu, “Improved Speaker Verification over the Cellular Phone

Network Using Phoneme-Balanced and Digit-Sequence Preserving Connected Digit Patterns”, Proc. of ICASSP 2003, Hong Kong, 2003.

L.G. Kersta, “Voiceprint Recognition”, Nature, No. 4861, pp. 1253-1257, December 1962.

A. D. Lawson, A. R. Stauffer, E. J. Cupples, et al., “The Multi-Session Audio Research Project (MARP) Corpus: Goals, Design and Initial Findings”, Proc. of Interspeech 2009, pp. 1811-1814, Brighton, 2009.

A. D. Lawson, A. R. Stauffer, E. J. Cupples, et al., “Long Term Examination of Intra-Session and Inter-Session Speaker Variability”, Proc. of Interspeech 2009, pp. 2899-2902, Brighton, 2009.

Z. Shan, Y. Yang, and C. Wu, “A Voiceprint Access Control System”, Proc. of NCMMSC 2005, Beijing, 2005 (in Chinese ).

F. Soong, A. E. Rosenberg, L. R. Rabiner, et al., “A Vector Quantization Approach to Speaker Recognition”, Proc. of ICASSP 1985, Vol.10, pp. 387-390, Florida, 1985.

J. J. Wolf, “Efficient Acoustic Parameters for Speaker Recognition”, JASA, Vol. 51, No. 6, pp. 2044-2055, 1971

35

Thanks!

Documents

Center for Speech and Language Technologies, Tsinghua University A Time-Varying Voiceprint Database Workshop on Phonetics and Speech Technology 19 Nov