35
Center for Speech and Language Technologies, Tsinghua University A Time-Varying Voiceprint Database Workshop on Phonetics and Speech Technology 19 Nov 2010, Tata Institute of Fundamental Research, Mumbai, India Thomas Fang Zheng, and Linlin Wang

Center for Speech and Language Technologies, Tsinghua University A Time-Varying Voiceprint Database Workshop on Phonetics and Speech Technology 19 Nov

Embed Size (px)

Citation preview

Page 1: Center for Speech and Language Technologies, Tsinghua University A Time-Varying Voiceprint Database Workshop on Phonetics and Speech Technology 19 Nov

Center for Speech and Language Technologies, Tsinghua University

A Time-Varying Voiceprint Database

Workshop onPhonetics and Speech Technology

19 Nov 2010, Tata Institute of Fundamental Research, Mumbai, India

Thomas Fang Zheng, and Linlin Wang

Page 2: Center for Speech and Language Technologies, Tsinghua University A Time-Varying Voiceprint Database Workshop on Phonetics and Speech Technology 19 Nov

2

Outline

Necessity of database creation

Criteria of database design

Observations and preliminary experiments

Summary

Page 3: Center for Speech and Language Technologies, Tsinghua University A Time-Varying Voiceprint Database Workshop on Phonetics and Speech Technology 19 Nov

3

Outline

Necessity of database creation

Criteria of database design

Observations and preliminary experiments

Summary

Page 4: Center for Speech and Language Technologies, Tsinghua University A Time-Varying Voiceprint Database Workshop on Phonetics and Speech Technology 19 Nov

Challenges in Application of Voiceprint Recognition

−NAB (National Australia Bank);−In 2009, applied VPR to telephone banking in collaboration with

Telstra and VeCommerce (Nuance’s Verifier engine).

−ABN Amro (Algemene Bank Nederland N.V.);

−In 2006, used VoiceVault for verification in collaboration with

VoiceVault Co.

−VPR in combination with ASR of registered secret questions.

- China Construction Bank, in 2010, telephone banking- First application in China

Slide 4

Page 5: Center for Speech and Language Technologies, Tsinghua University A Time-Varying Voiceprint Database Workshop on Phonetics and Speech Technology 19 Nov

Challenges in Application of Voiceprint Recognition (cont’d)

ASR/VPR engines to support Call Center Integration (CCI), artificial operators and background management systems.

… …

语音识别引擎API 接口

语音识别引擎内核

声纹识别引擎API 接口

声纹识别引擎内核

语音识别引擎API 接口

语音识别引擎内核

声纹识别引擎API 接口

声纹识别引擎内核

CCI ( I VR系统)

声纹识别引擎API 接口

CCI ( I VR系统)

声纹识别引擎API 接口

CCI ( I VR系统)

语音识别引擎API 接口

CCI (I VR系统)

语音识别引擎API 接口… …

CCI ( I VR系统)

声纹识别引擎API 接口

数据库读写

(声纹)

数据库读写

(声纹)

数据库读写

(声纹)

… …

人工座席

语音/声纹识别引擎API /工具

… …

人工座席

语音/声纹识别引擎API /工具

Slide 5

Page 6: Center for Speech and Language Technologies, Tsinghua University A Time-Varying Voiceprint Database Workshop on Phonetics and Speech Technology 19 Nov

Challenges in Application of Voiceprint Recognition (cont’d)

Voiceprint Recognition

Noise

Short-utterance

Cross-channel

Emotion Time-varying

Multi-speaker

Page 7: Center for Speech and Language Technologies, Tsinghua University A Time-Varying Voiceprint Database Workshop on Phonetics and Speech Technology 19 Nov

7

About voiceprint

Pioneer researchers created the word Voiceprint to stand for the identifiable uniqueness in each voice [Kersta 1962].

At the same time, they raised several questions such asDoes the voice of an adult change over time? If so,

how? [Kersta 1962]

Page 8: Center for Speech and Language Technologies, Tsinghua University A Time-Varying Voiceprint Database Workshop on Phonetics and Speech Technology 19 Nov

8

The Time-Varying Challenge

Open Questions [Furui 1997]

How do we deal with long-term variability in people’s voice?

How should we update the speaker models to cope with the gradual changes in people’s voices?

Is there any systematic long-term variation?

How do we adequately retrain the models?

...

Page 9: Center for Speech and Language Technologies, Tsinghua University A Time-Varying Voiceprint Database Workshop on Phonetics and Speech Technology 19 Nov

9

The Time-Varying Challenge (cont’d)

A need for caution [Bonastre et al. 2003]

…at the present time, there is no specific process that enables one to uniquely characterize a person’s voice or to identify with absolute certainty an individual from his or her voice…

Voice changes over time, either in the short-term (at different times of the day), the medium term (times of the year), or in the long-term (with age).

Long-term non-contemporary samples represent a challenge.

Page 10: Center for Speech and Language Technologies, Tsinghua University A Time-Varying Voiceprint Database Workshop on Phonetics and Speech Technology 19 Nov

10

The Time-Varying Challenge (cont’d)

Performance degradation observed by Soong, et al. Database: period of 2 months, 100-talker (50M+ 50F),

digit vocabulary, 5 different recording sessions.

This result clearly indicates the need to update the VQ codebook from time to time. [Soong et al. 1985]

Identification Error (%)

1st and 2nd session for VQ codebook training, while the other three sessions for testing

3rd session 4th session 5th session

1 digit 17 25 282 digits 5 10 144 digits 2 4 810 digits 0 1 4

Page 11: Center for Speech and Language Technologies, Tsinghua University A Time-Varying Voiceprint Database Workshop on Phonetics and Speech Technology 19 Nov

11

The Time-Varying Challenge (cont’d)

Performance degradation observed by Kato and ShimizuA significant loss in accuracy (4~5% in EER) between

two sessions separated by 2 months. [Kato & Shimizu 2003]

Aging was considered to be the cause [Herbert 2008].

Page 12: Center for Speech and Language Technologies, Tsinghua University A Time-Varying Voiceprint Database Workshop on Phonetics and Speech Technology 19 Nov

12

The Time-Varying Challenge (cont’d)

Performance degradation observed at CSLT, THU2 recording sessions with an interval of 10 months:

July 2008 (Dataset A) and May 2009 (Dataset B)

13 speakers and 2 speaking styles (reading and spontaneous)

EER (%)Dataset A for Model Training Dataset B for Model Training

Dataset A for Verification

Dataset B for Verification

Dataset A for Verification

Dataset B for Verification

Reading 2.47 9.32 9.43 2.08

Spontaneous 9.39 14.73 14.53 6.13

Page 13: Center for Speech and Language Technologies, Tsinghua University A Time-Varying Voiceprint Database Workshop on Phonetics and Speech Technology 19 Nov

13

The Time-Varying Issue in Voiceprint Recognition

A generally acknowledged phenomenon.It is widely assumed that speaker models should be

updated from time to time to maintain representativeness;

Effective but not practical.

Few researchers have figured out reasons behind this phenomenon exactly. No doubt a proper longitudinal voiceprint database is

essential for this study;The MARP corpus has been the only one published so

far. [Lawson et al. 2009]

Page 14: Center for Speech and Language Technologies, Tsinghua University A Time-Varying Voiceprint Database Workshop on Phonetics and Speech Technology 19 Nov

14

The MARP Corpus

DescriptionMulti-session Audio Research project by University of

Texas at Dallas

This corpus includes 21 sessions of free-flowing conversations over a 3-year period of time for each speaker.

Each conversation was approximately 10 minutes in length with an isolated partner who remained constant throughout the 3 years, and speakers were given suggested conversation topics.

Page 15: Center for Speech and Language Technologies, Tsinghua University A Time-Varying Voiceprint Database Workshop on Phonetics and Speech Technology 19 Nov

15

The MARP Corpus (cont’d)

Initial findingsThe following study of aging effect largely focused on

32 speakers and 672 sessions from June 2005 to March 2008. [Lawson et al. 2009]

While the impact on speaker recognition accuracy between any two sessions was considerable, the long-term trend was statistically quite small. Reasonable or not? Which causes this more: context or time?

The detrimental impact was clearly not a function of aging or of the voice changing within this timeframe.

Page 16: Center for Speech and Language Technologies, Tsinghua University A Time-Varying Voiceprint Database Workshop on Phonetics and Speech Technology 19 Nov

16

The MARP Corpus (cont’d)

Is it more suitable for research?In free-flowing conversations, speech contents are not

fixed and a speaker’s emotion, speaking style, or engagement can be easily influenced by his/her partner.

Perhaps the aging effect is somewhat weaker than those evident ones and thus covered underneath.

Page 17: Center for Speech and Language Technologies, Tsinghua University A Time-Varying Voiceprint Database Workshop on Phonetics and Speech Technology 19 Nov

17

Other Relevant Databases

NTT-VR speaker recognition data corpus

35 speakers, with 22 males and 13 females

Recorded in 5 sessions with a span of 10 months (1990-1991)

Normal, slow, and fast speaking speed

Limits: only 5 sessions

Page 18: Center for Speech and Language Technologies, Tsinghua University A Time-Varying Voiceprint Database Workshop on Phonetics and Speech Technology 19 Nov

18

Demand for a new database

Creation of a longitudinal voiceprint database which specially focuses on the time-varying effect in speaker recognition is imperative for both research and practical applications.

Other factors other than time should be kept as constant as possible. Prompt texts, recording equipments, software, conditions,

environment, and so on

Different lengths of time intervals can be adopted to analyze gradual impact.

Page 19: Center for Speech and Language Technologies, Tsinghua University A Time-Varying Voiceprint Database Workshop on Phonetics and Speech Technology 19 Nov

19

Outline

Necessity of database creation

Criteria of database design

Observations and preliminary experiments

Summary

Page 20: Center for Speech and Language Technologies, Tsinghua University A Time-Varying Voiceprint Database Workshop on Phonetics and Speech Technology 19 Nov

20

Database Design

A general design principleThe time-varying effect is the only focus, and other

factors should be kept as constant as possible throughout all recording sessions.

Two concrete design principlesFixed prompt texts to make sure content variation will

not be a disturbance of researchGradient time intervals to provide a flexible research

for different lengths of time intervals

Speakers recruited

Page 21: Center for Speech and Language Technologies, Tsinghua University A Time-Varying Voiceprint Database Workshop on Phonetics and Speech Technology 19 Nov

21

Database Design (cont’d)

Fixed prompt textsSpeakers were requested to utter in a reading way

with fixed prompt texts instead of free-style conversations. Free-style is closer to actual situation, yet can be studied

later

Prompt texts were designed to remain unchanged throughout all recording sessions. To avoid or at least reduce the impact of speech contents on

speaker recognition accuracy. In form of sentences and isolated words.

Page 22: Center for Speech and Language Technologies, Tsinghua University A Time-Varying Voiceprint Database Workshop on Phonetics and Speech Technology 19 Nov

22

Database Design (cont’d)

Gradient time intervalsInitial sessions can be of shorter time intervals, while

following sessions of longer and longer time intervals. Five different time intervals are used: one week, one month,

two months, four months and half a year, as illustrated in the figure below (16 sessions).

time

sessions

2010 2011 2012

Page 23: Center for Speech and Language Technologies, Tsinghua University A Time-Varying Voiceprint Database Workshop on Phonetics and Speech Technology 19 Nov

23

Database Design (cont’d)

Speakers60 fresh students in the beginning, 30M + 30F.

The design of time intervals exactly does not make summer and winter vacations the recording days.

Speakers born between 1989 and 1993, with a majority in year 1990.

Speakers from various departments, including: Computer science, biology, English, humanities, and

journalism

All of them speaking standard Chinese well.

Page 24: Center for Speech and Language Technologies, Tsinghua University A Time-Varying Voiceprint Database Workshop on Phonetics and Speech Technology 19 Nov

24

Outline

Necessity of database creation

Criteria of database design

Observations and preliminary experiments

Summary

Page 25: Center for Speech and Language Technologies, Tsinghua University A Time-Varying Voiceprint Database Workshop on Phonetics and Speech Technology 19 Nov

25

Current Database

Recording progress9 recording sessions have been completed so far.

Database evaluationbased on speech from the first 7 recording sessions

Page 26: Center for Speech and Language Technologies, Tsinghua University A Time-Varying Voiceprint Database Workshop on Phonetics and Speech Technology 19 Nov

26

Objective

To characterize a speaker using speaker-specific features that remain largely unchanged over time

Then where to start?Acoustics: frequency, spectrum, formant, pitch, …Statistics: mean, variance, ......

Page 27: Center for Speech and Language Technologies, Tsinghua University A Time-Varying Voiceprint Database Workshop on Phonetics and Speech Technology 19 Nov

A straight-forward method

Model update method 35 speakers, recorded March 2004 to May 2005

[Shan et al. 2005]

BaselineUsing model

update method

Accuracy 69.02% 74.19%

Page 28: Center for Speech and Language Technologies, Tsinghua University A Time-Varying Voiceprint Database Workshop on Phonetics and Speech Technology 19 Nov

28

Prof. Dang’s work [Dang 2010]

Observations about frequenciesBased on NTT-VR corpusSpeaker relevancy measurement using Fisher’s F-

Ratio [Wolf 1971]

Speaker relevant frequencies are almost invariant for the 5 speech sessions

Page 29: Center for Speech and Language Technologies, Tsinghua University A Time-Varying Voiceprint Database Workshop on Phonetics and Speech Technology 19 Nov

29

Prof. Dang’s work (cont’d)

Algorithm designEnhancing the information around speaker relevant

frequency regions To design non-uniform frequency warping algorithm which

emphasize the speaker relevant frequency regions, instead of the traditional Mel frequency warped processing method

Page 30: Center for Speech and Language Technologies, Tsinghua University A Time-Varying Voiceprint Database Workshop on Phonetics and Speech Technology 19 Nov

30

Prof. Dang’s work (cont’d)

Results

MFCC: Mel frequency warpingUFCC: Uniform frequency processingNUFCC: Proposed non-uniform frequency processing

Page 31: Center for Speech and Language Technologies, Tsinghua University A Time-Varying Voiceprint Database Workshop on Phonetics and Speech Technology 19 Nov

31

Are there any other useful Features?

Does pitch or formant have anything to do with time?“Pengyou” (friend in Chinese) from 3 different

sessions with an interval of one week (drawn by Praat).

There is no evident difference seen from the figure. However, from the statistical aspect, there may be some valuable information concerning the mean, variance, or higher-order parameters.

Pitch – redFormant - blue

Page 32: Center for Speech and Language Technologies, Tsinghua University A Time-Varying Voiceprint Database Workshop on Phonetics and Speech Technology 19 Nov

32

Outline

Necessity of database creation

Criteria of database design

Observations and preliminary experiments

Summary

Page 33: Center for Speech and Language Technologies, Tsinghua University A Time-Varying Voiceprint Database Workshop on Phonetics and Speech Technology 19 Nov

33

Summary

A longitudinal and specialized time-varying voiceprint database is the basis of researches on the widely-acknowledged time-varying issue.

Finding stable speaker-specific acoustic features is the core of all speaker recognition researches, and time-varying research is of no exception.

Page 34: Center for Speech and Language Technologies, Tsinghua University A Time-Varying Voiceprint Database Workshop on Phonetics and Speech Technology 19 Nov

34

References J. Bonastre, F. Bimbot, L. Boe, et al., “Person Authentication by Voice: A Need for

Caution”, Proc. of Eurospeech 2003, pp. 33-36, Geneva, 2003. J. Dang, “Extraction and Application of Speaker’s Individuals”, A lecture of 2010

workshop at Tsinghua University, Beijing, 2010. S. Furui, “Recent Advances in Speaker Recognition”, Pattern Recognition Letters, Vol.

18, Iss. 9, pp. 859-872, September 1997. M. Hebert, “Text-Dependent Speaker Recognition”, Springer Handbook of Speech

Processing, Springer-Verlag: Berlin, 2008. T. Kato, and T. Shimizu, “Improved Speaker Verification over the Cellular Phone

Network Using Phoneme-Balanced and Digit-Sequence Preserving Connected Digit Patterns”, Proc. of ICASSP 2003, Hong Kong, 2003.

L.G. Kersta, “Voiceprint Recognition”, Nature, No. 4861, pp. 1253-1257, December 1962.

A. D. Lawson, A. R. Stauffer, E. J. Cupples, et al., “The Multi-Session Audio Research Project (MARP) Corpus: Goals, Design and Initial Findings”, Proc. of Interspeech 2009, pp. 1811-1814, Brighton, 2009.

A. D. Lawson, A. R. Stauffer, E. J. Cupples, et al., “Long Term Examination of Intra-Session and Inter-Session Speaker Variability”, Proc. of Interspeech 2009, pp. 2899-2902, Brighton, 2009.

Z. Shan, Y. Yang, and C. Wu, “A Voiceprint Access Control System”, Proc. of NCMMSC 2005, Beijing, 2005 (in Chinese ).

F. Soong, A. E. Rosenberg, L. R. Rabiner, et al., “A Vector Quantization Approach to Speaker Recognition”, Proc. of ICASSP 1985, Vol.10, pp. 387-390, Florida, 1985.

J. J. Wolf, “Efficient Acoustic Parameters for Speaker Recognition”, JASA, Vol. 51, No. 6, pp. 2044-2055, 1971

Page 35: Center for Speech and Language Technologies, Tsinghua University A Time-Varying Voiceprint Database Workshop on Phonetics and Speech Technology 19 Nov

35

Thanks!