Towards speaker and environmental robustness in ASR: the HIWIRE project
A. Potamianos1, G. Bouselmi2, D. Dimitriadis3, D. Fohr2, R. Gemello4, I. Illina2, F. Mana4, P. Maragos3,
M. Matassoni5, V. Pitsikalis3, J. Ramírez6, E. Sanchez-Soto1, J. Segura6, and P. Svaizer5
1 Dept. of E.C.E., Tech. Univ. of Crete, Chania, Greece 2 Speech Group, LORIA, Nancy, France
3 School of E.C.E., Natl. Tech. Univ. of Athens, Athens, Greece 4 Loquendo, via Valdellatorre, 4-10149, Torino, Italy
5 ITC-irst, via Sommarive 18 - Povo (TN), Italy 6 Dept. of Signal Theory, Univ. of Granada, Spain
Outline
Introduction: the HIWIRE project Goals and objectives Research areas:
Environmental robustness Speaker robustness
Experimental results Ongoing work
HIWIRE project
http://www.hiwire.org Goals: environment and speaker robust ASR Showcase: fixed cockpit platform, PDA platform Industrial partners: Thales Avionics, Loquendo Research partners: LORIA, TUC, NTUA, UGR,
ITC-IRST, Thales research FP6 project: 6/2004 to 5/2007
Research areas
Environmental robustness Multi-microphone ASR Robust feature extraction Feature fusion and audio-visual ASR Feature equalization Voice-activity detection Speech enhancement
Speaker robustness Model-transformation Acoustic modeling for non-native speech
Multi-microphone ASR: Outline
Beamforming and Adaptive Noise Cancellation Environmental Acoustics Estimation
Beamforming: D&S
Availability of multi-channel signals allows to selectively capture the desired source:
)τs(t)(~i
M
1i
1 M
ts
Issues:
• estimation of reliable TDOAs;
Method:
• CSP analysis over multiple frames
Advantages:
• robustness
• reduced computational power
D&S with MarkIII
Test set:
• set N1_SNR0 of MC-TIDIGITS (cockpit noise), MarkIII channels
• clean models, trained on original TIDIGITS
Results (WERR [%]):
C_1 38.5
C_32 50.8
DS_C8 79.9
DS_C16 83.0
DS_C32 85.3
DS_C64 85.4
Robust Features for ASR
Modulation Features AM-FM Modulations Teager Energy Cepstrum
Fractal Features Dynamical Denoising Correlation Dimension Multiscale Fractal Dimension
Hybrid-Merged Features
up to +62% (Aurora 3)
up to +36% (Aurora 2)
up to +61% (Aurora 2)
Speech Modulation Features
Filterbank Design
Short-Term AM-FM Modulation Features Short-Term Mean Inst. Amplitude IA-Mean Short-Term Mean Inst. Frequency IF-Mean Frequency Modulation Percentages FMP
Short-Term Energy Modulation Features Average Teager Energy, Cepstrum Coef. TECC
Modulation Acoustic Features
Speech NonlinearProcessing
Demodulation
RobustFeature
Transformation/Selection
Regularization+
Multiband Filtering
( )s t ( )ns t ( )nE t
1 1( ), ( )A t F t
( ), ( )n nA t F t
StatisticalProcessing
1( )E t1( )s t
V.A.D.
Energy Features:
Teager Energy Cepstrum Coeff. TECC
AM-FM Modulation Features:
Mean Inst. Ampl. IA-Mean
Mean Inst. Freq. IF-Mean
Freq. Mod. Percent. FMP
TIMIT-based Speech Databases
TIMIT Database: Training Set: 3696 sentences , ~35 phonemes/utterances Testing Set: 1344 utterances, 46680 phonemes Sampling Frequency 16 kHz
Feature Vectors: MFCC+C0+AM-FM+1st+2nd Time Derivatives
Stream Weights: (1) for MFCC and (2) for ΑΜ-FM 3-state left-right HMMs, 16 mixtures All-pair, Unweighted grammar Performance Criterion: Phone Accuracy Rates (%) Back-end System: HTK v3.2.0
Results: TIMIT+Noise
1015202530354045505560
Accu
racy
MFCC* TEner. CC MFCC*+IA-Mean MFCC*+IF-Mean MFCC*+FMP
Up to +106%
Aurora 3 - Spanish
Connected-Digits, Sampling Frequency 8 kHz
Training Set: WM (Well-Matched): 3392 utterances (quiet 532, low 1668 and high noise 1192 MM (Medium-Mismatch): 1607 utterances (quiet 396 and low noise 1211) HM (High-Mismatch): 1696 utterances (quiet 266, low 834 and high noise 596)
Testing Set: WM: 1522 utterances (quiet 260, low 754 and high noise 508), 8056 digits MM: 850 utterances (quiet 0, low 0 and high noise 850), 4543 digits HM: 631 utterances (quiet 0, low 377 and high noise 254), 3325 digits
2 Back-end ASR Systems (ΗΤΚ and BLasr)
Feature Vectors: MFCC+AM-FM (or Auditory+ΑM-FM), TECC
All-Pair, Unweighted Grammar (or Word-Pair Grammar)
Performance Criterion: Word (digit) Accuracy Rates
Results: Aurora 3
40
50
60
70
80
90
100
Wo
rd A
ccu
rac
y (%
)
WM MM HM Average
WI007 MFCC+log(E)+D+DD+CMS TECC+log(E)+CMS
MFCC+IA-Mean MFCC+IF-Mean MFCC+FMP
Up to +62%
Fractal Features
N-d
CleanedEmbedding N-d
SignalLocal SVDspeech
signal Filtered Dynamics - Correlation Dimension
Noisy Embedding Filtered Embedding
FDCD
Multiscale Fractal Dimension MFDGeometrical
Filtering
Databases: Aurora 2
Task: Speaker Independent Recognition of Digit Sequences
TI - Digits at 8kHz Training (8440 Utterances per scenario, 55M/55F)
Clean (8kHz, G712) Multi-Condition (8kHz, G712)
4 Noises (artificial): subway, babble, car, exhibition 5 SNRs : 5, 10, 15, 20dB , clean
Testing, artificially added noise 7 SNRs: [-5, 0, 5, 10, 15, 20dB , clean] A: noises as in multi-cond train., G712 (28028 Utters) B: restaurant, street, airport, train station, G712 (28028 Utters) C: subway, street (MIRS) (14014 Utters)
Results: Aurora 2
3040
5060
7080
9010
0
Ac
cu
rac
y
20 dB 10 dB 5 dB
SNR
Baseline +FMP +FDCD +FMP+FDCD
Up to +61%
Feature Fusion
Merge synchronous feature streams Investigate both supervised and unsupervised
algorithms
Feature Fusion: multi-stream
Compute “optimal” exponent weights for each stream s
[HMM Gaussian mixture formulation; similar expressions for MM,
naïve Bayes, Euclidean/Mahalonobois classifier]
Optimality in the sense of minimizing “total classification error”
Multi-Stream Classification
Two class problem w1, w2
Feature vector x is broken up into two independent streams x1 and x2
Stream weights s1 and s2 are used to
“equalize” the “probabilities”
Multi-Stream Classification
Bayes classification decision
Non-unity weights increase Bayes error but estimation/modeling error may decrease Stream weights can decrease total error
“Optimal” weights minimize estimation error variance z
2
Optimal Stream Weights
Equal error rate in single-stream classifiers
optimal stream weights are inversely proportional to the total stream estimation error variance
Optimal Stream Weights
Equal estimation error variance in each stream
optimal weights are approximately inversely proportional to the single stream classification error
Experimental Results
Subset of CUAVE database used: 36 speakers (30 training, 6 testing), 5 sequences of 10 digits per spkr. Training set: 1500 digits (30x5x10) Test set: 300 digits (6x5x10)
Features: Audio: 39 features (MFCC_D_A) Visual: 105 features (ROIDCT_D_A)
Multi-Streams HMM models, Middle Integration: 8 state, left-to-right HMM whole-digit models Single Gaussian mixture AV-HMM uses separate audio and video feature streams
Optimal Stream Weights Results
Assume:
V2 / A
2 = 2
SNR-indep.
correlation
0.96
Parametric non-linear equalization
Parametric histogram equalization
Smoother estimates Bi-modal transformation (speech vs. non-
speech)
Voice Activity Detection
Bi-spectrum based VAD Support vector machine based VAD Combination of VAD with speech
enhancement
Speech Enhancement
Modified Wiener filtering with filter depending on global SNR
Modified Ephraim-Malah enhancement:
based on the E-M spectral attenuation rule
Non Native Speech Recognition
Build non-native models by combining English and native models
Use phone confusion between English phones and native acoustic models to add alternate model paths
Extract confusion matrix automatically by running phone recognition using native model
Phone pronunciation depends on word grapheme, English phone [grapheme] -> french phone
Example for English phone /t/
/t/
/k/
//
/t/
//
//
/t/
/t/
Extracted rules
Start
1 2 3
End (1-)*0.443 1 2 3 1 2 3
(1-)*0.286
1 2 3
1 2 3 1 2 3
(1-)*0.271
t
k
English French
French models
English model
Graphemic constraints
Example: APPROACH /ah p r ow ch/ APPROACH (A, ah) (PP, p) (R, r) (OA, ow) (CH, ch)
Alignment between graphemes and phones for each word of lexicon
Lexicon modification: add graphemes for each word Confusion rules extraction
(grapheme, english phone) → list of non native phones Example: (A, ah) → a
Used ApproachFrench Italian Spanish
WER SER WER SER WER SER
Command and control grammar
baseline 6 12.8 10.5 19.6 7.0 14.9
confusion 4.6 10.2 6.9 14.1 5.1 11.8
+graphemes confusion 4.9 11.3 8.2 15.9 6.2 13.6
Word loop
grammar
baseline 35.7 47.9 43.5 52.0 39.9 53.5
confusion 27.3 42.1 31.3 46.2 31.3 44.5
+graphemes confusion 26.2 41.9 30.5 45.5 31.3 46.5
Experiments : HIWIRE Database
Ongoing Work
Front-end combination and integration of algorithms
Fixed-platform demonstration non-native speech demo
PDA-platform demonstration Ongoing research