1 Cours parole du 9 Mars 2005 enseignants: Dr. Dijana Petrovska-Delacrétaz et Gérard Chollet Reconnaissance du locuteur 1.Introduction, Historique, Domaines

1

Cours parole du 9 Mars 2005enseignants: Dr. Dijana Petrovska-Delacrétaz

et Gérard Chollet

Reconnaissance du locuteur

1. Introduction, Historique, Domaines d’applications

2. Les indices de l’identité dans la parole

3. Vérification du locuteur1. Théorie de la decision

2. Dépendante / Indépendante du texte

4. L’imposture vocale

5. Vérification audio-visuelle de l’identité

6. Evaluations

7. Conclusions

2

Why should a computer recognize who is speaking ?

• Protection of individual property (habitation, bank account, personal data, messages, mobile phone, PDA,...)

• Limited access (secured areas, data bases)

• Personalization (only respond to its master’s voice)

• Locate a particular person in an audio-visual document (information retrieval)

• Who is speaking in a meeting ?

• Is a suspect the criminal ? (forensic applications)

3

Tasks in Automatic Speaker Recognition

• Speaker verification (Voice Biometrics) Are you really who you claim to be ?

• Identification (Speaker ID) : Is this speech segment coming from a known speaker ? How large is the set of speakers (population of the

world) ? • Speaker detection, segmentation, indexing, retrieval, tracking :

Looking for recordings of a particular speaker• Combining Speech and Speaker Recognition

Adaptation to a new speaker, speaker typology Personalization in dialogue systems

4

Applications

• Access ControlPhysical facilities, Computer networks, Websites

• Transaction AuthenticationTelephone banking, e-Commerce

• Speech data ManagementVoice messaging, Search engines

• Law EnforcementForensics, Home incarceration

5

Voice Biometric

• AvantagesOften the only modality over the telephone,Low cost (microphone, A/D), UbiquityPossible integration on a smart (SIM) card Natural bimodal fusion : speaking face

• DisadvantagesLack of discretionPossibility of imitation and electronic impostureLack of robustness to noise, distortion,…Temporal drift

6

Speaker Identity in Speech• Differences in

Vocal tract shapes and muscular controlFundamental frequency (typical values)

100 Hz (Male), 200 Hz (Female), 300 Hz (Child)Glottal waveformPhonotacticsLexical usage

• The differences between Voices of Twins is a limit case• Voices can also be imitated or disguised

7

spectral envelope of / i: /

f

A

Speaker A

Speaker B

Speaker Identity

• segmental factors (~30ms) glottal excitation:

fundamental frequency, amplitude,voice quality (e.g., breathiness)

vocal tract:characterized by its transfer function and represented by MFCCs (Mel Freq. Cepstral Coef)

• suprasegmental factors speaking speed (timing and rhythm of speech units) intonation patterns dialect, accent, pronunciation habits

8

What are the sources of difficulty ?

• Intra-speaker variability of the speech signal (due to stress, pathologies, environmental conditions,…)

• Recording conditions (filtering, noise,…)

• Channel mismatch between enrolment and testing

• Temporal drift

• Intentional imposture

• Voice disguise

9

Acoustic features

• Short term spectral analysis

10

Intra- and Inter-speaker variability

11

Speaker Verification

Typology of approaches (EAGLES Handbook) Text dependent

Public password Private password Customized password Text prompted

Text independent Incremental enrolment Evaluation

12

History of Speaker Recognition

13

Current approaches

14

Dynamic Time Warping (DTW)

Best path

),()Y,X( 2jid yx

“Bonjour” locuteur test Y

“Bon

jour

” lo

cute

ur X

“Bonjour” locuteur 1


“Bonjour” locuteur n

DODDINGTON 1974, ROSENBERG 1976, FURUI 1981, etc.

15

Vector Quantization (VQ)

bestquant.

),()Y,X( X2

jiCd y

Dictionnaire locuteur 1

Dictionnaire locuteur 2

Dictionnaire locuteur n


Dic

tionn

aire

locu

teur

X

SOONG, ROSENBERG 1987

16

Hidden Markov Models (HMM)

Bestpath

)S(Plog)Y,X(iXjy



“Bonjour” locuteur n


“Bon

jour

” lo

cute

ur X

ROSENBERG 1990, TSENG 1992

17

Ergodic HMM

Best path

)S(Plog)Y,X(iXjy

HMM locuteur 1

HMM locuteur 2

HMM locuteur n


HM

M lo

cute

ur X

PORITZ 1982, SAVIC 1990

18

Gaussian Mixture Models (GMM)

REYNOLDS 1995

19

HMM structure depends on the application

20

Some issues in Text-dependent Speaker Verification Systems :

The CAVE and PICASSO projects

• Sequences of digitsSpeaker independent HMM of each digitAdaptation of these HMMs to the client voice (during

enrolment and incremental enrolment)EER of less than 1 % can be achieved

• Customized passwordThe client chooses his password using some feedback from

the system• Deliberate imposture

21

Gaussian Mixture Model

• Parametric representation of the probability distribution of observations:

22

Gaussian Mixture Models

8 Gaussians per mixture

23

GMM speaker modeling

Front-endGMM

MODELING

WORLDGMM

MODEL

Front-end GMM model adaptation

TARGETGMM

MODEL

24

Baseline GMM method

HYPOTH.TARGET

GMM MOD.

Front-end

WORLDGMM

MODEL

Test Speech

xPxPLog ]

)/()/([

LLR SCORE

)/( xP

)/( xP

=

25

• Two types of errors :False rejection (a client is rejected)False acceptation (an impostor is accepted)

• Decision theory : given an observation O and a claimed identity

H0 hypothesis : it comes from an impostorH1 hypothesis : it comes from our client

• H1 is chosen if and only if P(H1|O) > P(H0|O) which could be rewritten (using Bayes law) as

Decision theory for identity verification

)1()(

)(

)1(

HPHoP

HoOP

HOP

26

Signal detection theory

27

Decision

28

Distribution of scores

29

Detection Error Tradeoff (DET) Curve

30

Evaluation

• Decision cost (FA, FR, priors, costs,…)

• Receiver Operating Characteristic Curve

• Reference systems (open software)

• Evaluations (algorithms, field trials, ergonomy,…)

31

NIST Speaker Verification Evaluations• A reference standard to compare algorithms and stimulate

new developments• Distribution (via LDC) of development and test databases

with :Increasing difficulty (from land line to mobile)Several hundreds of speakers (2 mn of training

data per client),Several thousands test accesses (5 to 50 sec per

access),• Participation of 15-20 labs every year (MIT, IBM, Nuance,

Queensland Univ, ELISA consortium,….)• Annual workshop, Special issues in Journals, …

32

National Institute of Standards & Technology (NIST)Speaker Verification Evaluations

• Annual evaluation since 1995• Common paradigm for comparing technologies

33

Speaker Verification (text independent)

• The ELISA consortiumENST, LIA, IRISA, ...http://www.lia.univ-avignon.fr/equipes/RAL/elisa/index_en.html

• BECARS : Balamand-ENST CEDRE Automatic Recognition of Speakers

• NIST evaluationshttp://www.nist.gov/speech/tests/spk/index.htm

34

NIST evaluations : Results

ENST 2003

35

Evaluations: NIST 2004

36

Combining Speech Recognition and Speaker Verification.

• Speaker independent phone HMMs

• Selection of segments or segment classes which are speaker specific

• Preliminary evaluations are performed on the NIST extended data set (one hour of training data per speaker)

37

ALISP : Automatic Language Independent Speech ProcessingData-driven speech segmentation

38

Searching in client and world speech dictionaries for speaker verification purposes

39

Fusion

40

Fusion results

41

Voice Transformations and Forgery (occasional, dedicated)

• Isolated individuals with few resources or “professional impostors” with a dedicated budget can menace the security of speaker recognition systems

• Voice transformation technologies (e.g. segmental synthesis using an inventory of client speech data) are nowadays available

• Speaker recognition research should explicitly address this forgery issue and define appropriate countermeasures

Prevention by predicting many different forgery scenarios

42

Voice Forgery using ALISP

The same words or not

Impostor

The same words or not

client

transformation

A modification of a source speaker‘s speech to imitate a target speaker

43

Conversion system: ALISP encoder

Speech

MFCC analysis

HNM

HMM recognition

Harmonic envelope

Symbol index

- Representative index- DTW path

Choice of the best representative

unit

Prosody (energy+pitch)

MFCC + delta

Database of HNM Representatives

HMM models

Noise envelope

44

Conversion system: ALISP Decoder

Concatenation of HNM

parameters for each

representative

HNM Synthesis

Speech signalSymbol index

Pitch, energy, timing

Representative index

DTW path

45

Preliminary results: DET curves

• Fabefore forgery: 16 ± 2.0 % (1700 files)

• Faafter forgery: 26 ± 2.0 % (1700 files)

46

Preliminary results

True distributions

47

Multimodal Identity Verification

• M2VTS (face and speech)front view and profilepseudo-3D with coherent light

• BIOMET:

(face, speech, fingerprint, signature, hand shape)data collectionreuse of the M2VTS and DAVID data basesexperiments on the fusion of modalities

48

Speaking Faces : Motivations

• In many situation a video sequence is acquired• Fusion of face and speech increases robustness• Forgery is more difficult

49

Talking Face Recognition(hybrid verification)

50

Lip features

• Tracking lip movements

51

A talking face model

• Using Hidden Markov Models (HMMs)

Acoustic parameters

Visual parameters

52

Imposture Model

53

Cloning

54

Conclusions, Perspectives

• Deliberate imposture is a challenge for speech only systems

• Verification of identity based on features extracted from talking faces should be developped

• Common databases and evaluation protocols are necessary

• Free access to reference systems will facilitate future developments

55

BioSecure Residential Workshop

• Aug. 1st - 26th, 2005 in ENST, Paris• Reference systems for speech, face, talking face,

fingerprint, iris, hand, signature, …• Comparative evaluations on large databases (BIOMET,

BANCA, FVC,…)• Fusion of modalities

http://www.biosecure.info

Documents

1 Cours parole du 9 Mars 2005 enseignants: Dr. Dijana Petrovska-Delacrétaz et Gérard Chollet Reconnaissance du locuteur 1.Introduction, Historique, Domaines