ETRW Modelling Pronunciation variation for ASR ESCA Tutorial & Research Workshop Modelling pronunciation variation for ASR INTRODUCING MULTIPLE PRONUNCIATIONS

ETRW Modelling Pronunciation variation for ASR

ESCA Tutorial & Research WorkshopESCA Tutorial & Research WorkshopModelling pronunciation variation for ASR Modelling pronunciation variation for ASR

INTRODUCING MULTIPLE INTRODUCING MULTIPLE PRONUNCIATIONS IN PRONUNCIATIONS IN

SPANISH SPEECH SPANISH SPEECH RECOGNITION SYSTEMSRECOGNITION SYSTEMS

Javier Ferreiros, Javier Macías-Guarasa, José M. Pardo (GTH UPM), Luis Villarrubia (Telefónica I+D)


Presentation ContentsPresentation Contents

Introduction The strategy applied CSR

Task System Architecture Results

ISR Task System Architecture Results

Conclusions and Future Work


Introduction (I)Introduction (I)

Pronunciation variation: common source of recognition errors

Rule-based strategy to incorporate pronunciation alternatives for Spanish

Phonetic Rules for actual speaking habits and context dependencies (no dialectal) have been explored

Alternate pronunciations can be found even within the same speaker


Introduction (II)Introduction (II)

The lexicon should consider these different possibilities even within the same dialect

It is important to study the impact of the rules on the lexicon

Near 20% error rate reduction for continuous speech task

No significant change for isolated word hypothesis generator case


The strategy applied (I)The strategy applied (I)

Grapheme-to-Allophone transcriptor for continuous speech and multiple pronunciations

It deals with coarticulation and assimilation effects in word boundaries for continuous speech

Rules are accurate enough for Spanish due to easy transformation from grapheme to allophone

Rules are selected according to expert linguistic knowledge for Castilian Spanish speaking style


The strategy applied (II)The strategy applied (II)

Examples of variations considered:– DIFFERENT HABITS: exámen: /e k s a m e n/

[e k s á m e~ n] [e s á m e~ n] [e s á m e~ n]

– CONTEXT DEPENDENT: bote: /b o t e/ un bote: [ú m b ó t e] el bote: [e l ó t e]


The strategy applied (III)The strategy applied (III)

We have empirically searched for the minimum number of rules that produces significant improvements to limit the increase in lexicon size (i.e. Perplexity)

For the isolated word hypothesis generator case, further reduction in the number of rules has been necessary in order not to worsen the recognition rates


CSR TaskCSR Task

Domain: Navy Resources Management in Spanish Speaker Dependent Task Training: 600 sentences, 4 speakers Test: 100 sentences, the same 4 speakers Base dictionary size: 979 words Extended dictionary size: 1211 words (+23.7%)


CSR System ArchitectureCSR System Architecture

One pass algorithm without any grammar In the lexicon some words have several entries,

each with an alternative allophone sequence (10 MFCC + Energy), delta and delta2 parameter

sets in 3 different codebooks with 256 centroids each

discrete and semicontinuous HMM models for basic allophones (47) and triphones (350)


CSR ResultsCSR Results

65

70

75

80

85

dd ddcn sc sccn

Normal

Multiple

10

12

14

16

18

20

dd ddcn sc sccn

% Error Reduction


ISR TaskISR Task Domain: Proper Names, telephone environment Hypothesis / Verification scheme Tested on the Hypothesis Generator so far Training: 5800 words, 3000 speakers Test: 2500 words, 2250 speakers Base dictionary size: 1175 words Extended dictionary size: 1266 words (+7.7%) with

the same rules than in CSR task and 1193 words (+1.5%) excluding some rules


ISR Hypothesis Generator (I)ISR Hypothesis Generator (I)

8 MFCC+Energy, 8 delta MFCC+delta Energy in 2 codebooks of 256 centroids each

PSBU generates a string of alphabet units (53 allophone-like units) very fast

Lexical Access: DP algorithm to match the phonetic string against the dictionary where multiple pronunciations may be included


ISR Hypothesis Generator (II)ISR Hypothesis Generator (II)

Preprocessing&

VQ processes

LexicalAccess

Hypothesis Generator

PhoneticString

Build-Up

HMMsVQ books Durations

Alignmentcosts

Phoneticstring

Listof

CandidateWords

Speech

Dictionary

Indexes


ISR Results for 12 best hypothesisISR Results for 12 best hypothesis

70

72

74

76

78

80

82

84

1175w 1266w 1193w

dd

sc


Conclusions and Future Work (I)Conclusions and Future Work (I)

The selection of the appropriate model for each context is important when two words are concatenated for CSR: Rules for different entries depending on context. For ISR these rules are not useful.

The acoustic model may not have enough resolution to take advantage of the alternatives proposed by the rules: these rules should work better in the verifier for ISR.


Conclusions and Future Work (II)Conclusions and Future Work (II)

It is important to study the real impact of the rules on the lexicon. For example: Dialectal rules should reduce recognition error rates in a similar way both for CSR and ISR.

We want to test these kind of rules plus dialectal variability rules on the verifier stage of the ISR system.

Documents

ETRW Modelling Pronunciation variation for ASR ESCA Tutorial & Research Workshop Modelling pronunciation variation for ASR INTRODUCING MULTIPLE PRONUNCIATIONS