View
218
Download
0
Embed Size (px)
Citation preview
1
Cours parole du 9 Mars 2005enseignants: Dr. Dijana Petrovska-Delacrétaz
et Gérard Chollet
Reconnaissance du locuteur
1. Introduction, Historique, Domaines d’applications
2. Les indices de l’identité dans la parole
3. Vérification du locuteur1. Théorie de la decision
2. Dépendante / Indépendante du texte
4. L’imposture vocale
5. Vérification audio-visuelle de l’identité
6. Evaluations
7. Conclusions
2
Why should a computer recognize who is speaking ?
• Protection of individual property (habitation, bank account, personal data, messages, mobile phone, PDA,...)
• Limited access (secured areas, data bases)
• Personalization (only respond to its master’s voice)
• Locate a particular person in an audio-visual document (information retrieval)
• Who is speaking in a meeting ?
• Is a suspect the criminal ? (forensic applications)
3
Tasks in Automatic Speaker Recognition
• Speaker verification (Voice Biometrics) Are you really who you claim to be ?
• Identification (Speaker ID) : Is this speech segment coming from a known speaker ? How large is the set of speakers (population of the
world) ? • Speaker detection, segmentation, indexing, retrieval, tracking :
Looking for recordings of a particular speaker• Combining Speech and Speaker Recognition
Adaptation to a new speaker, speaker typology Personalization in dialogue systems
4
Applications
• Access ControlPhysical facilities, Computer networks, Websites
• Transaction AuthenticationTelephone banking, e-Commerce
• Speech data ManagementVoice messaging, Search engines
• Law EnforcementForensics, Home incarceration
5
Voice Biometric
• AvantagesOften the only modality over the telephone,Low cost (microphone, A/D), UbiquityPossible integration on a smart (SIM) card Natural bimodal fusion : speaking face
• DisadvantagesLack of discretionPossibility of imitation and electronic impostureLack of robustness to noise, distortion,…Temporal drift
6
Speaker Identity in Speech• Differences in
Vocal tract shapes and muscular controlFundamental frequency (typical values)
100 Hz (Male), 200 Hz (Female), 300 Hz (Child)Glottal waveformPhonotacticsLexical usage
• The differences between Voices of Twins is a limit case• Voices can also be imitated or disguised
7
spectral envelope of / i: /
f
A
Speaker A
Speaker B
Speaker Identity
• segmental factors (~30ms) glottal excitation:
fundamental frequency, amplitude,voice quality (e.g., breathiness)
vocal tract:characterized by its transfer function and represented by MFCCs (Mel Freq. Cepstral Coef)
• suprasegmental factors speaking speed (timing and rhythm of speech units) intonation patterns dialect, accent, pronunciation habits
8
What are the sources of difficulty ?
• Intra-speaker variability of the speech signal (due to stress, pathologies, environmental conditions,…)
• Recording conditions (filtering, noise,…)
• Channel mismatch between enrolment and testing
• Temporal drift
• Intentional imposture
• Voice disguise
9
Acoustic features
• Short term spectral analysis
10
Intra- and Inter-speaker variability
11
Speaker Verification
Typology of approaches (EAGLES Handbook) Text dependent
Public password Private password Customized password Text prompted
Text independent Incremental enrolment Evaluation
12
History of Speaker Recognition
13
Current approaches
14
Dynamic Time Warping (DTW)
Best path
),()Y,X( 2jid yx
“Bonjour” locuteur test Y
“Bon
jour
” lo
cute
ur X
“Bonjour” locuteur 1
“Bonjour” locuteur 2
“Bonjour” locuteur n
DODDINGTON 1974, ROSENBERG 1976, FURUI 1981, etc.
15
Vector Quantization (VQ)
bestquant.
),()Y,X( X2
jiCd y
Dictionnaire locuteur 1
Dictionnaire locuteur 2
Dictionnaire locuteur n
“Bonjour” locuteur test Y
Dic
tionn
aire
locu
teur
X
SOONG, ROSENBERG 1987
16
Hidden Markov Models (HMM)
Bestpath
)S(Plog)Y,X(iXjy
“Bonjour” locuteur 1
“Bonjour” locuteur 2
“Bonjour” locuteur n
“Bonjour” locuteur test Y
“Bon
jour
” lo
cute
ur X
ROSENBERG 1990, TSENG 1992
17
Ergodic HMM
Best path
)S(Plog)Y,X(iXjy
HMM locuteur 1
HMM locuteur 2
HMM locuteur n
“Bonjour” locuteur test Y
HM
M lo
cute
ur X
PORITZ 1982, SAVIC 1990
18
Gaussian Mixture Models (GMM)
REYNOLDS 1995
19
HMM structure depends on the application
20
Some issues in Text-dependent Speaker Verification Systems :
The CAVE and PICASSO projects
• Sequences of digitsSpeaker independent HMM of each digitAdaptation of these HMMs to the client voice (during
enrolment and incremental enrolment)EER of less than 1 % can be achieved
• Customized passwordThe client chooses his password using some feedback from
the system• Deliberate imposture
21
Gaussian Mixture Model
• Parametric representation of the probability distribution of observations:
22
Gaussian Mixture Models
8 Gaussians per mixture
23
GMM speaker modeling
Front-endGMM
MODELING
WORLDGMM
MODEL
Front-end GMM model adaptation
TARGETGMM
MODEL
24
Baseline GMM method
HYPOTH.TARGET
GMM MOD.
Front-end
WORLDGMM
MODEL
Test Speech
xPxPLog ]
)/()/([
LLR SCORE
)/( xP
)/( xP
=
25
• Two types of errors :False rejection (a client is rejected)False acceptation (an impostor is accepted)
• Decision theory : given an observation O and a claimed identity
H0 hypothesis : it comes from an impostorH1 hypothesis : it comes from our client
• H1 is chosen if and only if P(H1|O) > P(H0|O) which could be rewritten (using Bayes law) as
Decision theory for identity verification
)1()(
)(
)1(
HPHoP
HoOP
HOP
26
Signal detection theory
27
Decision
28
Distribution of scores
29
Detection Error Tradeoff (DET) Curve
30
Evaluation
• Decision cost (FA, FR, priors, costs,…)
• Receiver Operating Characteristic Curve
• Reference systems (open software)
• Evaluations (algorithms, field trials, ergonomy,…)
31
NIST Speaker Verification Evaluations• A reference standard to compare algorithms and stimulate
new developments• Distribution (via LDC) of development and test databases
with :Increasing difficulty (from land line to mobile)Several hundreds of speakers (2 mn of training
data per client),Several thousands test accesses (5 to 50 sec per
access),• Participation of 15-20 labs every year (MIT, IBM, Nuance,
Queensland Univ, ELISA consortium,….)• Annual workshop, Special issues in Journals, …
32
National Institute of Standards & Technology (NIST)Speaker Verification Evaluations
• Annual evaluation since 1995• Common paradigm for comparing technologies
33
Speaker Verification (text independent)
• The ELISA consortiumENST, LIA, IRISA, ...http://www.lia.univ-avignon.fr/equipes/RAL/elisa/index_en.html
• BECARS : Balamand-ENST CEDRE Automatic Recognition of Speakers
• NIST evaluationshttp://www.nist.gov/speech/tests/spk/index.htm
34
NIST evaluations : Results
ENST 2003
35
Evaluations: NIST 2004
36
Combining Speech Recognition and Speaker Verification.
• Speaker independent phone HMMs
• Selection of segments or segment classes which are speaker specific
• Preliminary evaluations are performed on the NIST extended data set (one hour of training data per speaker)
37
ALISP : Automatic Language Independent Speech ProcessingData-driven speech segmentation
38
Searching in client and world speech dictionaries for speaker verification purposes
39
Fusion
40
Fusion results
41
Voice Transformations and Forgery (occasional, dedicated)
• Isolated individuals with few resources or “professional impostors” with a dedicated budget can menace the security of speaker recognition systems
• Voice transformation technologies (e.g. segmental synthesis using an inventory of client speech data) are nowadays available
• Speaker recognition research should explicitly address this forgery issue and define appropriate countermeasures
Prevention by predicting many different forgery scenarios
42
Voice Forgery using ALISP
The same words or not
Impostor
The same words or not
client
transformation
A modification of a source speaker‘s speech to imitate a target speaker
43
Conversion system: ALISP encoder
Speech
MFCC analysis
HNM
HMM recognition
Harmonic envelope
Symbol index
- Representative index- DTW path
Choice of the best representative
unit
Prosody (energy+pitch)
MFCC + delta
Database of HNM Representatives
HMM models
Noise envelope
44
Conversion system: ALISP Decoder
Concatenation of HNM
parameters for each
representative
HNM Synthesis
Speech signalSymbol index
Pitch, energy, timing
Representative index
DTW path
45
Preliminary results: DET curves
• Fabefore forgery: 16 ± 2.0 % (1700 files)
• Faafter forgery: 26 ± 2.0 % (1700 files)
46
Preliminary results
True distributions
47
Multimodal Identity Verification
• M2VTS (face and speech)front view and profilepseudo-3D with coherent light
• BIOMET:
(face, speech, fingerprint, signature, hand shape)data collectionreuse of the M2VTS and DAVID data basesexperiments on the fusion of modalities
48
Speaking Faces : Motivations
• In many situation a video sequence is acquired• Fusion of face and speech increases robustness• Forgery is more difficult
49
Talking Face Recognition(hybrid verification)
50
Lip features
• Tracking lip movements
51
A talking face model
• Using Hidden Markov Models (HMMs)
Acoustic parameters
Visual parameters
52
Imposture Model
53
Cloning
54
Conclusions, Perspectives
• Deliberate imposture is a challenge for speech only systems
• Verification of identity based on features extracted from talking faces should be developped
• Common databases and evaluation protocols are necessary
• Free access to reference systems will facilitate future developments
55
BioSecure Residential Workshop
• Aug. 1st - 26th, 2005 in ENST, Paris• Reference systems for speech, face, talking face,
fingerprint, iris, hand, signature, …• Comparative evaluations on large databases (BIOMET,
BANCA, FVC,…)• Fusion of modalities
http://www.biosecure.info