View
220
Download
0
Embed Size (px)
Citation preview
Speaker Recognition
G. CHOLLET, G. GRAVIER,J. KHARROUBI, D. PETROVSKA-DELACRETAZ
(chollet, kharroub,petrovsk)@tsi.enst.fr [email protected]
ENST/CNRS-LTCI46 rue Barrault
75634 PARIS cedex 13http://www.tsi.enst.fr/~chollet
ENST:ENST: Ecole Nationale Supérieure des Ecole Nationale Supérieure des TélécommunicationsTélécommunicationshttp://www.enst.frhttp://www.enst.fr
CNRS:CNRS: Centre National de la Recherche ScientifiqueCentre National de la Recherche Scientifiquehttp://www.cnrs.frhttp://www.cnrs.fr
LTCI:LTCI: Laboratoire de Traitement et Communication Laboratoire de Traitement et Communication de l’Informationde l’Information
http://www.enst.fr/ura/ura.htmlhttp://www.enst.fr/ura/ura.html
Our affiliations
What is ENST?Ecole Nationale Supérieure des
Télécommunications
• classed among the
‘Grandes Ecoles d'Ingénieurs’.
• 250 state certified engineers
each year .
• part of ‘Groupement des Ecoles
de Télécommunications’
Modalities for Identity Verification
Bla-bla
SECUREDSPACE
PIN PIN 1111111111111111
11
Modalities for Identity Verification
A device you own (key, smart card,…) A code you remember (password, …)
Could be lost or stolen Physiological characteristics:
Face, iris, finger print, hand shape,… Need special equipment
Behavioral characteristics: Speech, signature, keystroke,…
Speech is the prefered modality over the telephone(but a ‘voice print’ is much more variable than a
finger print)
Outline
Where is the information about the speaker identity in the speech signal ?
How well could humans recognize a speaker ? Applications of Speaker Recognition Prior knowledge on what the speaker said Combining Speech Recognition and Speaker Verification Some research activities at ENST:
Speaker verification: The CAVE-PICASSO projects (text dependent) The ELISA consortium, NIST evaluations (text
independent) The EUREKA !2340 MAJORDOME project
Multimodal Identity Verification: The M2VTS and BIOMET projects
Perspectives
Speaker Identity in Speech
Differences in Vocal tract shapes and muscular control Fundamental frequency (typical values)
100 Hz (Male), 200 Hz (Female), 300 Hz (Child) Glottal waveform Phonotactics Lexical usage
The differences between Voices of Twins is a limit case
Voices can also be imitated or disguised
spectral envelope of / i: /
f
A
Speaker A
Speaker B
Speaker Identity
segmental factors (~30ms)
glottal excitation:fundamental frequency, amplitude,voice quality (e.g., breathiness)
vocal tract:formant frequencies and bandwidths
suprasegmental factors speaking speed (timing and rhythm of speech units) intonation patterns dialect, accent, pronunciation habits
Inter-speaker Variability
We wereaway
ayear ago.
Intra-speaker Variability
We
were
away
a
year
ago.
Vocal Apparatus
Speech production
Glottal Waveform Modeling
t
A
original residual: bluesynthetic residual: red
• Fitting a glottal pulse model to the excitation waveform allows perceptually relevant modifications to voice quality
Applications of Speaker Recognition
Identification from an open set (unrealistic) Identification from a closed set (who is
speaking in a videoconference ?) Verification of claimed identity (risk of
deliberate imposture)
The human performance in speaker recognition is far from being perfect (highly dependent on familiarity with the subject)
Speaker Verification
Typology of approaches (EAGLES Handbook) Text dependent
Public password Private password Customized password Text prompted
Text independent Incremental enrolment Evaluation
What are the sources of difficulty ?
Intra-speaker variability of the speech signal (due to stress, pathologies, environmental conditions,…)
Recording conditions (filtering, noise,…) Temporal drift Intentional imposture Voice disguise
Text-dependent Speaker Verification
Uses Automatic Speech Recognition techniques (DTW, HMM, …)
Client model adaptation from speaker independent HMM (‘World’ model)
Synchronous alignment of client and world models for the computation of a score.
Dynamic Time Warping (DTW)
HMM structure depends on the application
Signal detection theory
Score normalisation
World model
Cohort normalisation
Discriminant techniques
Detection Error Tradeoff (DET) Curve
CAVE – PICASSOhttp://www.picasso.ptt-telecom.nl/project/
Incremental enrolment of customised password
The client chooses his password using some feedback from the system.
The system attempts a phonetic transcription of the password.
Incremental enrolment is achieved on further repetitions of that password
Speaker independent phone HMM are adapted with the client enrolment data.
Synchronous alignment likelihood ratio scoring is performed on access trials.
Deliberate imposture
The impostor has some recordings of the target client voice. He can record the same sentences and align these speech signals with the recordings of the client.
A transformation (Multiple Linear Regression) is computed from these aligned data.
The impostor has heard the target client password. He records that password and applies the
transformation to this recording. The PICASSO reference system with less than 1 %
EER is defeated by this procedure (more than 30 % EER)
Speaker Verification (text independent)
The ELISA consortium ENST, LIA, IRISA, ... http://www.lia.univ-avignon.fr/equipes/RAL/elisa/
index_en.html
NIST evaluations http://www.nist.gov/speech/tests/spk/index.
htm
Ergodic HMM Gaussian Mixture Model
Gaussian Mixture Model
Parametric representation of the probability distribution of observations:
Gaussian Mixture Models
8 Gaussians per mixture
National Institute of Standards & Technology (NIST)
Speaker Verification Evaluations
• Annual evaluation since 1995• Common paradigm for comparing technologies
GMM speaker modeling
Front-endGMM
MODELING
WORLDGMM
MODEL
Front-end GMM model adaptation
TARGETGMM
MODEL
Baseline GMM method
HYPOTH.TARGET
GMM MOD.
Front-end
WORLDGMM
MODEL
Test Speech
xPxPLog ]
)/()/([
LLR SCORE
)/( xP
)/( xP
=
Support Vector Machines and Speaker Verification
Hybrid GMM-SVM system is proposed
SVM scoring model trained on development data to classify true-target speakers access and impostors access,using new feature representation based on GMMs
Modeling
Scoring
GMM
SVM
SVM principles
X (X)
Inpu
t sp
ace
Feat
ure
spac
e Separating hyperplans H , with the optimal hyperplan Ho
Ho
H
Class(X)
Results
Combining Speech Recognition and Speaker Verification.
Speaker independent phone HMMs Selection of segments or segment classes
which are speaker specific Preliminary evaluations are performed on the
NIST extended data set (one hour of training data per speaker)
Selection of nasals in words in -ing
being everythi
ng getting
anything thing
something
things going
«MAJORDOME»
Unified Messaging System
Eureka Projet no 2340
EDFVecsys
D. Bahu-Leyser, G. Chollet, K. Hallouli , J. Kharroubi, L. Likforman, D. Mostefa, D. Petrovska, M. Sigelle, P. Vaillant
KTH Mensatec UPC Airtel
Software602
Majordome’s Functionalities
• Speaker verification
• Dialogue
• Routing
• Updating the agenda
• Automatic summary
Voice
Fax
Voice technology in Majordome
Server side background tasks:continuous speech recognition applied to voice messages upon reception Detection of sender’s name and subject
User interaction: Speaker identification and verification Speech recognition (receiving user
commands through voice interaction) Text-to-speech synthesis (reading text
summaries, E-mails or faxes)
BIOMET
Bla-bla
SECUREDSPACE
PIN PIN 1111111111111111
11
BIOMET
An extension of the M2VTS and DAVID projects to include such modalities as signature, finger print, hand shape.
Initial support (two years) is provided by GET (Groupement des Ecoles de Télécommunications)
Emphasis will be on fusion of scores obtained from two or more modalities.
Conclusions and Perspectives
Evaluation trials (as conducted by NIST) help improve technology.
A strategy combining speech recognition and segmental scoring seems to be a promissing approach for speaker verification.
Whenever possible, text independent speaker verification should be confirmed by text dependent verification.
Whenever possible, fusion of multiple experts (preferably multimodal) should be performed.