Http://diuf.unifr.ch/diva Dijana Petrovska-Delacrétaz 1 Asmaa el Hannani 1 Gérard Chollet 2 1: DIVA Group, University of Fribourg 2: GET-ENST, CNRS-LTCI,

http://diuf.unifr.ch/diva

Dijana Petrovska-Delacrétaz Dijana Petrovska-Delacrétaz 11

Asmaa el Hannani Asmaa el Hannani11

Gérard CholletGérard Chollet 22

1: DIVA Group, University of Fribourg1: DIVA Group, University of Fribourg2: GET-ENST, CNRS-LTCI, Paris2: GET-ENST, CNRS-LTCI, Paris

3-4 December 2003, Biometrics Tutorials, 3-4 December 2003, Biometrics Tutorials, Uni. FribourgUni. Fribourg

ALISP based improvement of GMM’s ALISP based improvement of GMM’s for for

Text-independent Speaker Text-independent Speaker VerificationVerification

2Biometrics, 3-4 Dec. 2003, Fribourg

OverviewOverview

1.1. Why segmental speaker verification Why segmental speaker verification systems ?systems ?

2.2. Speech segmentation problems Speech segmentation problems

3.3. Proposed segmental system based on Proposed segmental system based on DTW distance measure DTW distance measure

4.4. Experimental setup Experimental setup

5.5. Results Results

6.6. Conclusions and perspectives Conclusions and perspectives


1 Why segmental 1 Why segmental speaker verification systems ?speaker verification systems ?

Current reference speaker verification systems are based Current reference speaker verification systems are based on Gaussian Mixture Models (each speech frame is treated on Gaussian Mixture Models (each speech frame is treated independently)independently)

Speech is composed of different soundsSpeech is composed of different sounds

Phonemes have different discriminant characteristics for Phonemes have different discriminant characteristics for speaker verification speaker verification (see Eatock, al. ‘94, J.Olsen ‘97, Petrovska al.’98, 2000…)(see Eatock, al. ‘94, J.Olsen ‘97, Petrovska al.’98, 2000…)

nasals and vowels convey more speaker characteristics nasals and vowels convey more speaker characteristics than other speech classesthan other speech classes

we would like to exploit this factwe would like to exploit this fact

We need a automatic speechWe need a automatic speech segmentation tool !segmentation tool !


1.1 Advantages and 1.1 Advantages and disadvantages disadvantages

of the speech segmentationof the speech segmentationProblems:Problems:

Need of a speech segmentation toolNeed of a speech segmentation tool

Speaker modeling per speech classes => more data Speaker modeling per speech classes => more data needed needed

More complicated systemsMore complicated systems

AdvantagesAdvantages Possibility to use it in combination with a dialogue Possibility to use it in combination with a dialogue based systems, for which a speech segmentation is based systems, for which a speech segmentation is already donealready done

Possibility to introduce text-prompted speaker Possibility to introduce text-prompted speaker verification, designed to include a maximum number of verification, designed to include a maximum number of speaker specific unitsspeaker specific units


2 2 Speech SegmentationSpeech Segmentation

Large Vocabulary Continuous Speech Large Vocabulary Continuous Speech Recognition (LVCSR) SystemRecognition (LVCSR) System

good results for a small set of languagesgood results for a small set of languages

need huge amount of annotated speech dataneed huge amount of annotated speech data

language (and task) dependentlanguage (and task) dependent

we do not have such a for American English we do not have such a for American English


2.1 ALISP Speech Segmentation2.1 ALISP Speech Segmentation

Data-driven speech segmentation Data-driven speech segmentation not yet usable for speech recognition purposesnot yet usable for speech recognition purposes

no annotated databases neededno annotated databases needed

language and task independentlanguage and task independent

we could use it to segment the speech data for a we could use it to segment the speech data for a text-independent speaker verification tasktext-independent speaker verification task

We will use the data driven speech segmentation We will use the data driven speech segmentation method method ALISPALISP ((AAutomatic utomatic LLanguage anguage IIndependent ndependent SSpeech peech PProcessing)rocessing)


2.2 ALISP principles 2.2 ALISP principles


3 Proposed speaker verification 3 Proposed speaker verification system: system:

ALISP segments and DTW ALISP segments and DTW 3.1 Segmentation problem3.1 Segmentation problem

Segmentation of the speech data with N ALISP Segmentation of the speech data with N ALISP HMM modelsHMM models

N= 64 speech classesN= 64 speech classes

Need of Need of (not transcribed)(not transcribed) speech data, speech data, to train the 64 ALISP HMM modelsto train the 64 ALISP HMM models

With so much speech classes we should change the With so much speech classes we should change the speaker modeling method , not enough data for GMM speaker modeling method , not enough data for GMM adaptation===> adaptation===>

Use of Dynamic Time Warping (DTW) Use of Dynamic Time Warping (DTW)


3.2 DTW distance measure for 3.2 DTW distance measure for speaker verificationspeaker verification

Dynamic Time Warping (DTW) was already used for speaker Dynamic Time Warping (DTW) was already used for speaker verification, in a verification, in a text-dependent modetext-dependent mode (Rosenberg `76, (Rosenberg `76, Rabiner Schafer ’76, Furui ’81, Pandit and Kittler ’98…) Rabiner Schafer ’76, Furui ’81, Pandit and Kittler ’98…)

The DTW distance measure between two speech segments The DTW distance measure between two speech segments conveys speaker specific characteristicsconveys speaker specific characteristics

OriginalityOriginality: used : used DTW inDTW in text-independenttext-independent mode mode

We first proceed to the segmentation of speech data in ALISP classes We first proceed to the segmentation of speech data in ALISP classes

Measure the “distance “ between speaker and non-speaker segmentsMeasure the “distance “ between speaker and non-speaker segments

Speaker specific information is extracted from the : Speaker specific information is extracted from the : ALISP based speech segments = > ALISP based speech segments = > Client DictionaryClient Dictionary

Non-speaker (world speakers) :Non-speaker (world speakers) :ALISP based speech segments => ALISP based speech segments => World DictionaryWorld Dictionary


3.3 Searching in the 3.3 Searching in the clientclient and and world speech dictionariesworld speech dictionaries for for speaker verification purposesspeaker verification purposes


4 Evaluation of the proposed 4 Evaluation of the proposed system: experimental setup system: experimental setup

Development data: one subset from NIST 2002 Development data: one subset from NIST 2002 cellular data (American English)cellular data (American English)

world speakers (60 female + 59 male): world speakers (60 female + 59 male): used to train the ALISP speech segmenter used to train the ALISP speech segmenter

and to model the non-speakers (world speakers)and to model the non-speakers (world speakers)

Evaluated on Evaluated on another subset from NIST 2002 (111 + 79 male another subset from NIST 2002 (111 + 79 male speakers) speakers)


4.1 Speech segmentation example4.1 Speech segmentation example

2 another occurrences of the English phone : 2 another occurrences of the English phone : ayay ; ;

the corresponding ALISP sequences: the corresponding ALISP sequences: HX - HfHX - Hf and and (HM) (HM) - - Hf Hf - - Ha-Ha- previous slide : previous slide : ( (Hf )Hf )-Ha-Ha or or (HM) - (HM) - HZHZ -Ha -Ha


4.2 Results: GMM , ALISP-DTW 4.2 Results: GMM , ALISP-DTW systems systems

and their fusionand their fusion


4.3 Results: EER comparison4.3 Results: EER comparison

SystemSystem EER %EER %

ALISP-DTW ALISP-DTW

GMMGMM22.722.7

17.417.4

Linear fusion (no score Linear fusion (no score normalization)normalization)

LR fusion (no score normalization )LR fusion (no score normalization )

LR fusion (normalized scores)LR fusion (normalized scores)

Linear fusion (normalized scores) Linear fusion (normalized scores)

18.918.9

1313

12.612.6

12.212.2


4.4 Importance of fusion (33% 4.4 Importance of fusion (33% improvement)improvement)


4.5 Using only GMM’s scores to 4.5 Using only GMM’s scores to segments=> segments=>

segmental Gmm systemsegmental Gmm system


5. Conclusions 5. Conclusions State of the art NIST 2002 results for EER: State of the art NIST 2002 results for EER: (best 8% to worst 28%)(best 8% to worst 28%)

Fusion of classical system with a segmental Fusion of classical system with a segmental systems : systems : big improvementsbig improvements

Why: higher level informations present in the Why: higher level informations present in the segmental system complement usefully the short segmental system complement usefully the short therm frequency informations present in the therm frequency informations present in the GMM systemGMM system

Documents

Http://diuf.unifr.ch/diva Dijana Petrovska-Delacrétaz 1 Asmaa el Hannani 1 Gérard Chollet 2 1: DIVA Group, University of Fribourg 2: GET-ENST, CNRS-LTCI,