16
ETRW Modelling Pronunciation variation for ASR ESCA Tutorial & Research ESCA Tutorial & Research Workshop Workshop Modelling pronunciation Modelling pronunciation variation for ASR variation for ASR INTRODUCING MULTIPLE INTRODUCING MULTIPLE PRONUNCIATIONS IN PRONUNCIATIONS IN SPANISH SPEECH SPANISH SPEECH RECOGNITION SYSTEMS RECOGNITION SYSTEMS Javier Ferreiros, Javier Macías-Guarasa, José M. Pardo (GTH UPM), Luis Villarrubia (Telefónica I+D)

ETRW Modelling Pronunciation variation for ASR ESCA Tutorial & Research Workshop Modelling pronunciation variation for ASR INTRODUCING MULTIPLE PRONUNCIATIONS

Embed Size (px)

Citation preview

Page 1: ETRW Modelling Pronunciation variation for ASR ESCA Tutorial & Research Workshop Modelling pronunciation variation for ASR INTRODUCING MULTIPLE PRONUNCIATIONS

ETRW Modelling Pronunciation variation for ASR

ESCA Tutorial & Research WorkshopESCA Tutorial & Research WorkshopModelling pronunciation variation for ASR Modelling pronunciation variation for ASR

INTRODUCING MULTIPLE INTRODUCING MULTIPLE PRONUNCIATIONS IN PRONUNCIATIONS IN

SPANISH SPEECH SPANISH SPEECH RECOGNITION SYSTEMSRECOGNITION SYSTEMS

Javier Ferreiros, Javier Macías-Guarasa, José M. Pardo (GTH UPM), Luis Villarrubia (Telefónica I+D)

Page 2: ETRW Modelling Pronunciation variation for ASR ESCA Tutorial & Research Workshop Modelling pronunciation variation for ASR INTRODUCING MULTIPLE PRONUNCIATIONS

ETRW Modelling Pronunciation variation for ASR

Presentation ContentsPresentation Contents

Introduction The strategy applied CSR

Task System Architecture Results

ISR Task System Architecture Results

Conclusions and Future Work

Page 3: ETRW Modelling Pronunciation variation for ASR ESCA Tutorial & Research Workshop Modelling pronunciation variation for ASR INTRODUCING MULTIPLE PRONUNCIATIONS

ETRW Modelling Pronunciation variation for ASR

Introduction (I)Introduction (I)

Pronunciation variation: common source of recognition errors

Rule-based strategy to incorporate pronunciation alternatives for Spanish

Phonetic Rules for actual speaking habits and context dependencies (no dialectal) have been explored

Alternate pronunciations can be found even within the same speaker

Page 4: ETRW Modelling Pronunciation variation for ASR ESCA Tutorial & Research Workshop Modelling pronunciation variation for ASR INTRODUCING MULTIPLE PRONUNCIATIONS

ETRW Modelling Pronunciation variation for ASR

Introduction (II)Introduction (II)

The lexicon should consider these different possibilities even within the same dialect

It is important to study the impact of the rules on the lexicon

Near 20% error rate reduction for continuous speech task

No significant change for isolated word hypothesis generator case

Page 5: ETRW Modelling Pronunciation variation for ASR ESCA Tutorial & Research Workshop Modelling pronunciation variation for ASR INTRODUCING MULTIPLE PRONUNCIATIONS

ETRW Modelling Pronunciation variation for ASR

The strategy applied (I)The strategy applied (I)

Grapheme-to-Allophone transcriptor for continuous speech and multiple pronunciations

It deals with coarticulation and assimilation effects in word boundaries for continuous speech

Rules are accurate enough for Spanish due to easy transformation from grapheme to allophone

Rules are selected according to expert linguistic knowledge for Castilian Spanish speaking style

Page 6: ETRW Modelling Pronunciation variation for ASR ESCA Tutorial & Research Workshop Modelling pronunciation variation for ASR INTRODUCING MULTIPLE PRONUNCIATIONS

ETRW Modelling Pronunciation variation for ASR

The strategy applied (II)The strategy applied (II)

Examples of variations considered:– DIFFERENT HABITS: exámen: /e k s a m e n/

[e k s á m e~ n] [e s á m e~ n] [e s á m e~ n]

– CONTEXT DEPENDENT: bote: /b o t e/ un bote: [ú m b ó t e] el bote: [e l ó t e]

Page 7: ETRW Modelling Pronunciation variation for ASR ESCA Tutorial & Research Workshop Modelling pronunciation variation for ASR INTRODUCING MULTIPLE PRONUNCIATIONS

ETRW Modelling Pronunciation variation for ASR

The strategy applied (III)The strategy applied (III)

We have empirically searched for the minimum number of rules that produces significant improvements to limit the increase in lexicon size (i.e. Perplexity)

For the isolated word hypothesis generator case, further reduction in the number of rules has been necessary in order not to worsen the recognition rates

Page 8: ETRW Modelling Pronunciation variation for ASR ESCA Tutorial & Research Workshop Modelling pronunciation variation for ASR INTRODUCING MULTIPLE PRONUNCIATIONS

ETRW Modelling Pronunciation variation for ASR

CSR TaskCSR Task

Domain: Navy Resources Management in Spanish Speaker Dependent Task Training: 600 sentences, 4 speakers Test: 100 sentences, the same 4 speakers Base dictionary size: 979 words Extended dictionary size: 1211 words (+23.7%)

Page 9: ETRW Modelling Pronunciation variation for ASR ESCA Tutorial & Research Workshop Modelling pronunciation variation for ASR INTRODUCING MULTIPLE PRONUNCIATIONS

ETRW Modelling Pronunciation variation for ASR

CSR System ArchitectureCSR System Architecture

One pass algorithm without any grammar In the lexicon some words have several entries,

each with an alternative allophone sequence (10 MFCC + Energy), delta and delta2 parameter

sets in 3 different codebooks with 256 centroids each

discrete and semicontinuous HMM models for basic allophones (47) and triphones (350)

Page 10: ETRW Modelling Pronunciation variation for ASR ESCA Tutorial & Research Workshop Modelling pronunciation variation for ASR INTRODUCING MULTIPLE PRONUNCIATIONS

ETRW Modelling Pronunciation variation for ASR

CSR ResultsCSR Results

65

70

75

80

85

dd ddcn sc sccn

Normal

Multiple

10

12

14

16

18

20

dd ddcn sc sccn

% Error Reduction

Page 11: ETRW Modelling Pronunciation variation for ASR ESCA Tutorial & Research Workshop Modelling pronunciation variation for ASR INTRODUCING MULTIPLE PRONUNCIATIONS

ETRW Modelling Pronunciation variation for ASR

ISR TaskISR Task Domain: Proper Names, telephone environment Hypothesis / Verification scheme Tested on the Hypothesis Generator so far Training: 5800 words, 3000 speakers Test: 2500 words, 2250 speakers Base dictionary size: 1175 words Extended dictionary size: 1266 words (+7.7%) with

the same rules than in CSR task and 1193 words (+1.5%) excluding some rules

Page 12: ETRW Modelling Pronunciation variation for ASR ESCA Tutorial & Research Workshop Modelling pronunciation variation for ASR INTRODUCING MULTIPLE PRONUNCIATIONS

ETRW Modelling Pronunciation variation for ASR

ISR Hypothesis Generator (I)ISR Hypothesis Generator (I)

8 MFCC+Energy, 8 delta MFCC+delta Energy in 2 codebooks of 256 centroids each

PSBU generates a string of alphabet units (53 allophone-like units) very fast

Lexical Access: DP algorithm to match the phonetic string against the dictionary where multiple pronunciations may be included

Page 13: ETRW Modelling Pronunciation variation for ASR ESCA Tutorial & Research Workshop Modelling pronunciation variation for ASR INTRODUCING MULTIPLE PRONUNCIATIONS

ETRW Modelling Pronunciation variation for ASR

ISR Hypothesis Generator (II)ISR Hypothesis Generator (II)

Preprocessing&

VQ processes

LexicalAccess

Hypothesis Generator

PhoneticString

Build-Up

HMMsVQ books Durations

Alignmentcosts

Phoneticstring

Listof

CandidateWords

Speech

Dictionary

Indexes

Page 14: ETRW Modelling Pronunciation variation for ASR ESCA Tutorial & Research Workshop Modelling pronunciation variation for ASR INTRODUCING MULTIPLE PRONUNCIATIONS

ETRW Modelling Pronunciation variation for ASR

ISR Results for 12 best hypothesisISR Results for 12 best hypothesis

70

72

74

76

78

80

82

84

1175w 1266w 1193w

dd

sc

Page 15: ETRW Modelling Pronunciation variation for ASR ESCA Tutorial & Research Workshop Modelling pronunciation variation for ASR INTRODUCING MULTIPLE PRONUNCIATIONS

ETRW Modelling Pronunciation variation for ASR

Conclusions and Future Work (I)Conclusions and Future Work (I)

The selection of the appropriate model for each context is important when two words are concatenated for CSR: Rules for different entries depending on context. For ISR these rules are not useful.

The acoustic model may not have enough resolution to take advantage of the alternatives proposed by the rules: these rules should work better in the verifier for ISR.

Page 16: ETRW Modelling Pronunciation variation for ASR ESCA Tutorial & Research Workshop Modelling pronunciation variation for ASR INTRODUCING MULTIPLE PRONUNCIATIONS

ETRW Modelling Pronunciation variation for ASR

Conclusions and Future Work (II)Conclusions and Future Work (II)

It is important to study the real impact of the rules on the lexicon. For example: Dialectal rules should reduce recognition error rates in a similar way both for CSR and ISR.

We want to test these kind of rules plus dialectal variability rules on the verifier stage of the ISR system.