Upload
natalie-paul
View
221
Download
1
Embed Size (px)
Citation preview
ETRW Modelling Pronunciation variation for ASR
ESCA Tutorial & Research WorkshopESCA Tutorial & Research WorkshopModelling pronunciation variation for ASR Modelling pronunciation variation for ASR
INTRODUCING MULTIPLE INTRODUCING MULTIPLE PRONUNCIATIONS IN PRONUNCIATIONS IN
SPANISH SPEECH SPANISH SPEECH RECOGNITION SYSTEMSRECOGNITION SYSTEMS
Javier Ferreiros, Javier Macías-Guarasa, José M. Pardo (GTH UPM), Luis Villarrubia (Telefónica I+D)
ETRW Modelling Pronunciation variation for ASR
Presentation ContentsPresentation Contents
Introduction The strategy applied CSR
Task System Architecture Results
ISR Task System Architecture Results
Conclusions and Future Work
ETRW Modelling Pronunciation variation for ASR
Introduction (I)Introduction (I)
Pronunciation variation: common source of recognition errors
Rule-based strategy to incorporate pronunciation alternatives for Spanish
Phonetic Rules for actual speaking habits and context dependencies (no dialectal) have been explored
Alternate pronunciations can be found even within the same speaker
ETRW Modelling Pronunciation variation for ASR
Introduction (II)Introduction (II)
The lexicon should consider these different possibilities even within the same dialect
It is important to study the impact of the rules on the lexicon
Near 20% error rate reduction for continuous speech task
No significant change for isolated word hypothesis generator case
ETRW Modelling Pronunciation variation for ASR
The strategy applied (I)The strategy applied (I)
Grapheme-to-Allophone transcriptor for continuous speech and multiple pronunciations
It deals with coarticulation and assimilation effects in word boundaries for continuous speech
Rules are accurate enough for Spanish due to easy transformation from grapheme to allophone
Rules are selected according to expert linguistic knowledge for Castilian Spanish speaking style
ETRW Modelling Pronunciation variation for ASR
The strategy applied (II)The strategy applied (II)
Examples of variations considered:– DIFFERENT HABITS: exámen: /e k s a m e n/
[e k s á m e~ n] [e s á m e~ n] [e s á m e~ n]
– CONTEXT DEPENDENT: bote: /b o t e/ un bote: [ú m b ó t e] el bote: [e l ó t e]
ETRW Modelling Pronunciation variation for ASR
The strategy applied (III)The strategy applied (III)
We have empirically searched for the minimum number of rules that produces significant improvements to limit the increase in lexicon size (i.e. Perplexity)
For the isolated word hypothesis generator case, further reduction in the number of rules has been necessary in order not to worsen the recognition rates
ETRW Modelling Pronunciation variation for ASR
CSR TaskCSR Task
Domain: Navy Resources Management in Spanish Speaker Dependent Task Training: 600 sentences, 4 speakers Test: 100 sentences, the same 4 speakers Base dictionary size: 979 words Extended dictionary size: 1211 words (+23.7%)
ETRW Modelling Pronunciation variation for ASR
CSR System ArchitectureCSR System Architecture
One pass algorithm without any grammar In the lexicon some words have several entries,
each with an alternative allophone sequence (10 MFCC + Energy), delta and delta2 parameter
sets in 3 different codebooks with 256 centroids each
discrete and semicontinuous HMM models for basic allophones (47) and triphones (350)
ETRW Modelling Pronunciation variation for ASR
CSR ResultsCSR Results
65
70
75
80
85
dd ddcn sc sccn
Normal
Multiple
10
12
14
16
18
20
dd ddcn sc sccn
% Error Reduction
ETRW Modelling Pronunciation variation for ASR
ISR TaskISR Task Domain: Proper Names, telephone environment Hypothesis / Verification scheme Tested on the Hypothesis Generator so far Training: 5800 words, 3000 speakers Test: 2500 words, 2250 speakers Base dictionary size: 1175 words Extended dictionary size: 1266 words (+7.7%) with
the same rules than in CSR task and 1193 words (+1.5%) excluding some rules
ETRW Modelling Pronunciation variation for ASR
ISR Hypothesis Generator (I)ISR Hypothesis Generator (I)
8 MFCC+Energy, 8 delta MFCC+delta Energy in 2 codebooks of 256 centroids each
PSBU generates a string of alphabet units (53 allophone-like units) very fast
Lexical Access: DP algorithm to match the phonetic string against the dictionary where multiple pronunciations may be included
ETRW Modelling Pronunciation variation for ASR
ISR Hypothesis Generator (II)ISR Hypothesis Generator (II)
Preprocessing&
VQ processes
LexicalAccess
Hypothesis Generator
PhoneticString
Build-Up
HMMsVQ books Durations
Alignmentcosts
Phoneticstring
Listof
CandidateWords
Speech
Dictionary
Indexes
ETRW Modelling Pronunciation variation for ASR
ISR Results for 12 best hypothesisISR Results for 12 best hypothesis
70
72
74
76
78
80
82
84
1175w 1266w 1193w
dd
sc
ETRW Modelling Pronunciation variation for ASR
Conclusions and Future Work (I)Conclusions and Future Work (I)
The selection of the appropriate model for each context is important when two words are concatenated for CSR: Rules for different entries depending on context. For ISR these rules are not useful.
The acoustic model may not have enough resolution to take advantage of the alternatives proposed by the rules: these rules should work better in the verifier for ISR.
ETRW Modelling Pronunciation variation for ASR
Conclusions and Future Work (II)Conclusions and Future Work (II)
It is important to study the real impact of the rules on the lexicon. For example: Dialectal rules should reduce recognition error rates in a similar way both for CSR and ISR.
We want to test these kind of rules plus dialectal variability rules on the verifier stage of the ISR system.