Upload
rhoda-reeves
View
217
Download
4
Embed Size (px)
Citation preview
http://diuf.unifr.ch/diva
Dijana Petrovska-Delacrétaz Dijana Petrovska-Delacrétaz 11
Asmaa el Hannani Asmaa el Hannani11
Gérard CholletGérard Chollet 22
1: DIVA Group, University of Fribourg1: DIVA Group, University of Fribourg2: GET-ENST, CNRS-LTCI, Paris2: GET-ENST, CNRS-LTCI, Paris
3-4 December 2003, Biometrics Tutorials, 3-4 December 2003, Biometrics Tutorials, Uni. FribourgUni. Fribourg
ALISP based improvement of GMM’s ALISP based improvement of GMM’s for for
Text-independent Speaker Text-independent Speaker VerificationVerification
2Biometrics, 3-4 Dec. 2003, Fribourg
OverviewOverview
1.1. Why segmental speaker verification Why segmental speaker verification systems ?systems ?
2.2. Speech segmentation problems Speech segmentation problems
3.3. Proposed segmental system based on Proposed segmental system based on DTW distance measure DTW distance measure
4.4. Experimental setup Experimental setup
5.5. Results Results
6.6. Conclusions and perspectives Conclusions and perspectives
3Biometrics, 3-4 Dec. 2003, Fribourg
1 Why segmental 1 Why segmental speaker verification systems ?speaker verification systems ?
Current reference speaker verification systems are based Current reference speaker verification systems are based on Gaussian Mixture Models (each speech frame is treated on Gaussian Mixture Models (each speech frame is treated independently)independently)
Speech is composed of different soundsSpeech is composed of different sounds
Phonemes have different discriminant characteristics for Phonemes have different discriminant characteristics for speaker verification speaker verification (see Eatock, al. ‘94, J.Olsen ‘97, Petrovska al.’98, 2000…)(see Eatock, al. ‘94, J.Olsen ‘97, Petrovska al.’98, 2000…)
nasals and vowels convey more speaker characteristics nasals and vowels convey more speaker characteristics than other speech classesthan other speech classes
we would like to exploit this factwe would like to exploit this fact
We need a automatic speechWe need a automatic speech segmentation tool !segmentation tool !
4Biometrics, 3-4 Dec. 2003, Fribourg
1.1 Advantages and 1.1 Advantages and disadvantages disadvantages
of the speech segmentationof the speech segmentationProblems:Problems:
Need of a speech segmentation toolNeed of a speech segmentation tool
Speaker modeling per speech classes => more data Speaker modeling per speech classes => more data needed needed
More complicated systemsMore complicated systems
AdvantagesAdvantages Possibility to use it in combination with a dialogue Possibility to use it in combination with a dialogue based systems, for which a speech segmentation is based systems, for which a speech segmentation is already donealready done
Possibility to introduce text-prompted speaker Possibility to introduce text-prompted speaker verification, designed to include a maximum number of verification, designed to include a maximum number of speaker specific unitsspeaker specific units
5Biometrics, 3-4 Dec. 2003, Fribourg
2 2 Speech SegmentationSpeech Segmentation
Large Vocabulary Continuous Speech Large Vocabulary Continuous Speech Recognition (LVCSR) SystemRecognition (LVCSR) System
good results for a small set of languagesgood results for a small set of languages
need huge amount of annotated speech dataneed huge amount of annotated speech data
language (and task) dependentlanguage (and task) dependent
we do not have such a for American English we do not have such a for American English
6Biometrics, 3-4 Dec. 2003, Fribourg
2.1 ALISP Speech Segmentation2.1 ALISP Speech Segmentation
Data-driven speech segmentation Data-driven speech segmentation not yet usable for speech recognition purposesnot yet usable for speech recognition purposes
no annotated databases neededno annotated databases needed
language and task independentlanguage and task independent
we could use it to segment the speech data for a we could use it to segment the speech data for a text-independent speaker verification tasktext-independent speaker verification task
We will use the data driven speech segmentation We will use the data driven speech segmentation method method ALISPALISP ((AAutomatic utomatic LLanguage anguage IIndependent ndependent SSpeech peech PProcessing)rocessing)
7Biometrics, 3-4 Dec. 2003, Fribourg
2.2 ALISP principles 2.2 ALISP principles
8Biometrics, 3-4 Dec. 2003, Fribourg
3 Proposed speaker verification 3 Proposed speaker verification system: system:
ALISP segments and DTW ALISP segments and DTW 3.1 Segmentation problem3.1 Segmentation problem
Segmentation of the speech data with N ALISP Segmentation of the speech data with N ALISP HMM modelsHMM models
N= 64 speech classesN= 64 speech classes
Need of Need of (not transcribed)(not transcribed) speech data, speech data, to train the 64 ALISP HMM modelsto train the 64 ALISP HMM models
With so much speech classes we should change the With so much speech classes we should change the speaker modeling method , not enough data for GMM speaker modeling method , not enough data for GMM adaptation===> adaptation===>
Use of Dynamic Time Warping (DTW) Use of Dynamic Time Warping (DTW)
9Biometrics, 3-4 Dec. 2003, Fribourg
3.2 DTW distance measure for 3.2 DTW distance measure for speaker verificationspeaker verification
Dynamic Time Warping (DTW) was already used for speaker Dynamic Time Warping (DTW) was already used for speaker verification, in a verification, in a text-dependent modetext-dependent mode (Rosenberg `76, (Rosenberg `76, Rabiner Schafer ’76, Furui ’81, Pandit and Kittler ’98…) Rabiner Schafer ’76, Furui ’81, Pandit and Kittler ’98…)
The DTW distance measure between two speech segments The DTW distance measure between two speech segments conveys speaker specific characteristicsconveys speaker specific characteristics
OriginalityOriginality: used : used DTW inDTW in text-independenttext-independent mode mode
We first proceed to the segmentation of speech data in ALISP classes We first proceed to the segmentation of speech data in ALISP classes
Measure the “distance “ between speaker and non-speaker segmentsMeasure the “distance “ between speaker and non-speaker segments
Speaker specific information is extracted from the : Speaker specific information is extracted from the : ALISP based speech segments = > ALISP based speech segments = > Client DictionaryClient Dictionary
Non-speaker (world speakers) :Non-speaker (world speakers) :ALISP based speech segments => ALISP based speech segments => World DictionaryWorld Dictionary
10Biometrics, 3-4 Dec. 2003, Fribourg
3.3 Searching in the 3.3 Searching in the clientclient and and world speech dictionariesworld speech dictionaries for for speaker verification purposesspeaker verification purposes
11Biometrics, 3-4 Dec. 2003, Fribourg
4 Evaluation of the proposed 4 Evaluation of the proposed system: experimental setup system: experimental setup
Development data: one subset from NIST 2002 Development data: one subset from NIST 2002 cellular data (American English)cellular data (American English)
world speakers (60 female + 59 male): world speakers (60 female + 59 male): used to train the ALISP speech segmenter used to train the ALISP speech segmenter
and to model the non-speakers (world speakers)and to model the non-speakers (world speakers)
Evaluated on Evaluated on another subset from NIST 2002 (111 + 79 male another subset from NIST 2002 (111 + 79 male speakers) speakers)
12Biometrics, 3-4 Dec. 2003, Fribourg
4.1 Speech segmentation example4.1 Speech segmentation example
2 another occurrences of the English phone : 2 another occurrences of the English phone : ayay ; ;
the corresponding ALISP sequences: the corresponding ALISP sequences: HX - HfHX - Hf and and (HM) (HM) - - Hf Hf - - Ha-Ha- previous slide : previous slide : ( (Hf )Hf )-Ha-Ha or or (HM) - (HM) - HZHZ -Ha -Ha
13Biometrics, 3-4 Dec. 2003, Fribourg
4.2 Results: GMM , ALISP-DTW 4.2 Results: GMM , ALISP-DTW systems systems
and their fusionand their fusion
14Biometrics, 3-4 Dec. 2003, Fribourg
4.3 Results: EER comparison4.3 Results: EER comparison
SystemSystem EER %EER %
ALISP-DTW ALISP-DTW
GMMGMM22.722.7
17.417.4
Linear fusion (no score Linear fusion (no score normalization)normalization)
LR fusion (no score normalization )LR fusion (no score normalization )
LR fusion (normalized scores)LR fusion (normalized scores)
Linear fusion (normalized scores) Linear fusion (normalized scores)
18.918.9
1313
12.612.6
12.212.2
15Biometrics, 3-4 Dec. 2003, Fribourg
4.4 Importance of fusion (33% 4.4 Importance of fusion (33% improvement)improvement)
16Biometrics, 3-4 Dec. 2003, Fribourg
4.5 Using only GMM’s scores to 4.5 Using only GMM’s scores to segments=> segments=>
segmental Gmm systemsegmental Gmm system
17Biometrics, 3-4 Dec. 2003, Fribourg
5. Conclusions 5. Conclusions State of the art NIST 2002 results for EER: State of the art NIST 2002 results for EER: (best 8% to worst 28%)(best 8% to worst 28%)
Fusion of classical system with a segmental Fusion of classical system with a segmental systems : systems : big improvementsbig improvements
Why: higher level informations present in the Why: higher level informations present in the segmental system complement usefully the short segmental system complement usefully the short therm frequency informations present in the therm frequency informations present in the GMM systemGMM system