42
Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk )@ tsi.enst.f ggravier @ infres.enst.fr ENST/CNRS-LTCI 46 rue Barrault 75634 PARIS cedex 13 http://www.tsi.enst.fr/~chollet

Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr [email protected]@ ENST/CNRS-LTCI

  • View
    220

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI

Speaker Recognition

G. CHOLLET, G. GRAVIER,J. KHARROUBI, D. PETROVSKA-DELACRETAZ

(chollet, kharroub,petrovsk)@tsi.enst.fr [email protected]

ENST/CNRS-LTCI46 rue Barrault

75634 PARIS cedex 13http://www.tsi.enst.fr/~chollet

Page 2: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI

ENST:ENST: Ecole Nationale Supérieure des Ecole Nationale Supérieure des TélécommunicationsTélécommunicationshttp://www.enst.frhttp://www.enst.fr

CNRS:CNRS: Centre National de la Recherche ScientifiqueCentre National de la Recherche Scientifiquehttp://www.cnrs.frhttp://www.cnrs.fr

LTCI:LTCI: Laboratoire de Traitement et Communication Laboratoire de Traitement et Communication de l’Informationde l’Information

http://www.enst.fr/ura/ura.htmlhttp://www.enst.fr/ura/ura.html

Our affiliations

Page 3: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI

What is ENST?Ecole Nationale Supérieure des

Télécommunications

• classed among the

‘Grandes Ecoles d'Ingénieurs’.

• 250 state certified engineers

each year .

• part of ‘Groupement des Ecoles

de Télécommunications’

Page 4: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI

Modalities for Identity Verification

Bla-bla

SECUREDSPACE

PIN PIN 1111111111111111

11

Page 5: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI

Modalities for Identity Verification

A device you own (key, smart card,…) A code you remember (password, …)

Could be lost or stolen Physiological characteristics:

Face, iris, finger print, hand shape,… Need special equipment

Behavioral characteristics: Speech, signature, keystroke,…

Speech is the prefered modality over the telephone(but a ‘voice print’ is much more variable than a

finger print)

Page 6: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI

Outline

Where is the information about the speaker identity in the speech signal ?

How well could humans recognize a speaker ? Applications of Speaker Recognition Prior knowledge on what the speaker said Combining Speech Recognition and Speaker Verification Some research activities at ENST:

Speaker verification: The CAVE-PICASSO projects (text dependent) The ELISA consortium, NIST evaluations (text

independent) The EUREKA !2340 MAJORDOME project

Multimodal Identity Verification: The M2VTS and BIOMET projects

Perspectives

Page 7: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI

Speaker Identity in Speech

Differences in Vocal tract shapes and muscular control Fundamental frequency (typical values)

100 Hz (Male), 200 Hz (Female), 300 Hz (Child) Glottal waveform Phonotactics Lexical usage

The differences between Voices of Twins is a limit case

Voices can also be imitated or disguised

Page 8: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI

spectral envelope of / i: /

f

A

Speaker A

Speaker B

Speaker Identity

segmental factors (~30ms)

glottal excitation:fundamental frequency, amplitude,voice quality (e.g., breathiness)

vocal tract:formant frequencies and bandwidths

suprasegmental factors speaking speed (timing and rhythm of speech units) intonation patterns dialect, accent, pronunciation habits

Page 9: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI

Inter-speaker Variability

We wereaway

ayear ago.

Page 10: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI

Intra-speaker Variability

We

were

away

a

year

ago.

Page 11: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI

Vocal Apparatus

Page 12: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI

Speech production

Page 13: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI

Glottal Waveform Modeling

t

A

original residual: bluesynthetic residual: red

• Fitting a glottal pulse model to the excitation waveform allows perceptually relevant modifications to voice quality

Page 14: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI

Applications of Speaker Recognition

Identification from an open set (unrealistic) Identification from a closed set (who is

speaking in a videoconference ?) Verification of claimed identity (risk of

deliberate imposture)

The human performance in speaker recognition is far from being perfect (highly dependent on familiarity with the subject)

Page 15: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI

Speaker Verification

Typology of approaches (EAGLES Handbook) Text dependent

Public password Private password Customized password Text prompted

Text independent Incremental enrolment Evaluation

Page 16: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI

What are the sources of difficulty ?

Intra-speaker variability of the speech signal (due to stress, pathologies, environmental conditions,…)

Recording conditions (filtering, noise,…) Temporal drift Intentional imposture Voice disguise

Page 17: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI

Text-dependent Speaker Verification

Uses Automatic Speech Recognition techniques (DTW, HMM, …)

Client model adaptation from speaker independent HMM (‘World’ model)

Synchronous alignment of client and world models for the computation of a score.

Page 18: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI

Dynamic Time Warping (DTW)

Page 19: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI

HMM structure depends on the application

Page 20: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI

Signal detection theory

Page 21: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI

Score normalisation

World model

Cohort normalisation

Discriminant techniques

Page 22: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI

Detection Error Tradeoff (DET) Curve

Page 23: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI

CAVE – PICASSOhttp://www.picasso.ptt-telecom.nl/project/

Page 24: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI

Incremental enrolment of customised password

The client chooses his password using some feedback from the system.

The system attempts a phonetic transcription of the password.

Incremental enrolment is achieved on further repetitions of that password

Speaker independent phone HMM are adapted with the client enrolment data.

Synchronous alignment likelihood ratio scoring is performed on access trials.

Page 25: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI

Deliberate imposture

The impostor has some recordings of the target client voice. He can record the same sentences and align these speech signals with the recordings of the client.

A transformation (Multiple Linear Regression) is computed from these aligned data.

The impostor has heard the target client password. He records that password and applies the

transformation to this recording. The PICASSO reference system with less than 1 %

EER is defeated by this procedure (more than 30 % EER)

Page 26: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI

Speaker Verification (text independent)

The ELISA consortium ENST, LIA, IRISA, ... http://www.lia.univ-avignon.fr/equipes/RAL/elisa/

index_en.html

NIST evaluations http://www.nist.gov/speech/tests/spk/index.

htm

Ergodic HMM Gaussian Mixture Model

Page 27: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI

Gaussian Mixture Model

Parametric representation of the probability distribution of observations:

Page 28: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI

Gaussian Mixture Models

8 Gaussians per mixture

Page 29: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI

National Institute of Standards & Technology (NIST)

Speaker Verification Evaluations

• Annual evaluation since 1995• Common paradigm for comparing technologies

Page 30: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI

GMM speaker modeling

Front-endGMM

MODELING

WORLDGMM

MODEL

Front-end GMM model adaptation

TARGETGMM

MODEL

Page 31: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI

Baseline GMM method

HYPOTH.TARGET

GMM MOD.

Front-end

WORLDGMM

MODEL

Test Speech

xPxPLog ]

)/()/([

LLR SCORE

)/( xP

)/( xP

=

Page 32: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI

Support Vector Machines and Speaker Verification

Hybrid GMM-SVM system is proposed

SVM scoring model trained on development data to classify true-target speakers access and impostors access,using new feature representation based on GMMs

Modeling

Scoring

GMM

SVM

Page 33: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI

SVM principles

X (X)

Inpu

t sp

ace

Feat

ure

spac

e Separating hyperplans H , with the optimal hyperplan Ho

Ho

H

Class(X)

Page 34: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI

Results

Page 35: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI

Combining Speech Recognition and Speaker Verification.

Speaker independent phone HMMs Selection of segments or segment classes

which are speaker specific Preliminary evaluations are performed on the

NIST extended data set (one hour of training data per speaker)

Page 36: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI

Selection of nasals in words in -ing

being everythi

ng getting

anything thing

something

things going

Page 37: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI

«MAJORDOME»

Unified Messaging System

Eureka Projet no 2340

EDFVecsys

D. Bahu-Leyser, G. Chollet, K. Hallouli , J. Kharroubi, L. Likforman, D. Mostefa, D. Petrovska, M. Sigelle, P. Vaillant

KTH Mensatec UPC Airtel

Software602

Page 38: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI

Majordome’s Functionalities

• Speaker verification

• Dialogue

• Routing

• Updating the agenda

• Automatic summary

Voice

Fax

E-mail

Page 39: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI

Voice technology in Majordome

Server side background tasks:continuous speech recognition applied to voice messages upon reception Detection of sender’s name and subject

User interaction: Speaker identification and verification Speech recognition (receiving user

commands through voice interaction) Text-to-speech synthesis (reading text

summaries, E-mails or faxes)

Page 40: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI

BIOMET

Bla-bla

SECUREDSPACE

PIN PIN 1111111111111111

11

Page 41: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI

BIOMET

An extension of the M2VTS and DAVID projects to include such modalities as signature, finger print, hand shape.

Initial support (two years) is provided by GET (Groupement des Ecoles de Télécommunications)

Emphasis will be on fusion of scores obtained from two or more modalities.

Page 42: Speaker Recognition G. CHOLLET, G. GRAVIER, J. KHARROUBI, D. PETROVSKA-DELACRETAZ (chollet, kharroub,petrovsk)@tsi.enst.fr ggravier@infres.enst.fr@ ENST/CNRS-LTCI

Conclusions and Perspectives

Evaluation trials (as conducted by NIST) help improve technology.

A strategy combining speech recognition and segmental scoring seems to be a promissing approach for speaker verification.

Whenever possible, text independent speaker verification should be confirmed by text dependent verification.

Whenever possible, fusion of multiple experts (preferably multimodal) should be performed.