Download ppt - Towards speaker and environmental robustness in ASR: the HIWIRE project

Towards speaker and environmental robustness in ASR: the HIWIRE project

A. Potamianos1, G. Bouselmi2, D. Dimitriadis3, D. Fohr2, R. Gemello4, I. Illina2, F. Mana4, P. Maragos3,

M. Matassoni5, V. Pitsikalis3, J. Ramírez6, E. Sanchez-Soto1, J. Segura6, and P. Svaizer5

1 Dept. of E.C.E., Tech. Univ. of Crete, Chania, Greece 2 Speech Group, LORIA, Nancy, France

3 School of E.C.E., Natl. Tech. Univ. of Athens, Athens, Greece 4 Loquendo, via Valdellatorre, 4-10149, Torino, Italy

5 ITC-irst, via Sommarive 18 - Povo (TN), Italy 6 Dept. of Signal Theory, Univ. of Granada, Spain

Outline

Introduction: the HIWIRE project Goals and objectives Research areas:

Environmental robustness Speaker robustness

Experimental results Ongoing work

HIWIRE project

http://www.hiwire.org Goals: environment and speaker robust ASR Showcase: fixed cockpit platform, PDA platform Industrial partners: Thales Avionics, Loquendo Research partners: LORIA, TUC, NTUA, UGR,

ITC-IRST, Thales research FP6 project: 6/2004 to 5/2007

Research areas

Environmental robustness Multi-microphone ASR Robust feature extraction Feature fusion and audio-visual ASR Feature equalization Voice-activity detection Speech enhancement

Speaker robustness Model-transformation Acoustic modeling for non-native speech

Multi-microphone ASR: Outline

Beamforming and Adaptive Noise Cancellation Environmental Acoustics Estimation

Beamforming: D&S

Availability of multi-channel signals allows to selectively capture the desired source:

)τs(t)(~i

M

1i

1 M

ts

Issues:

• estimation of reliable TDOAs;

Method:

• CSP analysis over multiple frames

Advantages:

• robustness

• reduced computational power

D&S with MarkIII

Test set:

• set N1_SNR0 of MC-TIDIGITS (cockpit noise), MarkIII channels

• clean models, trained on original TIDIGITS

Results (WERR [%]):

C_1 38.5

C_32 50.8

DS_C8 79.9

DS_C16 83.0

DS_C32 85.3

DS_C64 85.4

Robust Features for ASR

Modulation Features AM-FM Modulations Teager Energy Cepstrum

Fractal Features Dynamical Denoising Correlation Dimension Multiscale Fractal Dimension

Hybrid-Merged Features

up to +62% (Aurora 3)



Speech Modulation Features

Filterbank Design

Short-Term AM-FM Modulation Features Short-Term Mean Inst. Amplitude IA-Mean Short-Term Mean Inst. Frequency IF-Mean Frequency Modulation Percentages FMP

Short-Term Energy Modulation Features Average Teager Energy, Cepstrum Coef. TECC

Modulation Acoustic Features

Speech NonlinearProcessing

Demodulation

RobustFeature

Transformation/Selection

Regularization+

Multiband Filtering

( )s t ( )ns t ( )nE t

1 1( ), ( )A t F t

( ), ( )n nA t F t

StatisticalProcessing

1( )E t1( )s t

V.A.D.

Energy Features:

Teager Energy Cepstrum Coeff. TECC

AM-FM Modulation Features:

Mean Inst. Ampl. IA-Mean

Mean Inst. Freq. IF-Mean

Freq. Mod. Percent. FMP

TIMIT-based Speech Databases

TIMIT Database: Training Set: 3696 sentences , ~35 phonemes/utterances Testing Set: 1344 utterances, 46680 phonemes Sampling Frequency 16 kHz

Feature Vectors: MFCC+C0+AM-FM+1st+2nd Time Derivatives

Stream Weights: (1) for MFCC and (2) for ΑΜ-FM 3-state left-right HMMs, 16 mixtures All-pair, Unweighted grammar Performance Criterion: Phone Accuracy Rates (%) Back-end System: HTK v3.2.0

Results: TIMIT+Noise

1015202530354045505560

Accu

racy

MFCC* TEner. CC MFCC*+IA-Mean MFCC*+IF-Mean MFCC*+FMP

Up to +106%

Aurora 3 - Spanish

Connected-Digits, Sampling Frequency 8 kHz

Training Set: WM (Well-Matched): 3392 utterances (quiet 532, low 1668 and high noise 1192 MM (Medium-Mismatch): 1607 utterances (quiet 396 and low noise 1211) HM (High-Mismatch): 1696 utterances (quiet 266, low 834 and high noise 596)

Testing Set: WM: 1522 utterances (quiet 260, low 754 and high noise 508), 8056 digits MM: 850 utterances (quiet 0, low 0 and high noise 850), 4543 digits HM: 631 utterances (quiet 0, low 377 and high noise 254), 3325 digits

2 Back-end ASR Systems (ΗΤΚ and BLasr)

Feature Vectors: MFCC+AM-FM (or Auditory+ΑM-FM), TECC

All-Pair, Unweighted Grammar (or Word-Pair Grammar)

Performance Criterion: Word (digit) Accuracy Rates

Results: Aurora 3

40

50

60

70

80

90

100

Wo

rd A

ccu

rac

y (%

)

WM MM HM Average

WI007 MFCC+log(E)+D+DD+CMS TECC+log(E)+CMS

MFCC+IA-Mean MFCC+IF-Mean MFCC+FMP

Up to +62%

Fractal Features

N-d

CleanedEmbedding N-d

SignalLocal SVDspeech

signal Filtered Dynamics - Correlation Dimension

Noisy Embedding Filtered Embedding

FDCD

Multiscale Fractal Dimension MFDGeometrical

Filtering

Databases: Aurora 2

Task: Speaker Independent Recognition of Digit Sequences

TI - Digits at 8kHz Training (8440 Utterances per scenario, 55M/55F)

Clean (8kHz, G712) Multi-Condition (8kHz, G712)

4 Noises (artificial): subway, babble, car, exhibition 5 SNRs : 5, 10, 15, 20dB , clean

Testing, artificially added noise 7 SNRs: [-5, 0, 5, 10, 15, 20dB , clean] A: noises as in multi-cond train., G712 (28028 Utters) B: restaurant, street, airport, train station, G712 (28028 Utters) C: subway, street (MIRS) (14014 Utters)

Results: Aurora 2

3040

5060

7080

9010

0

Ac

cu

rac

y

20 dB 10 dB 5 dB

SNR

Baseline +FMP +FDCD +FMP+FDCD

Up to +61%

Feature Fusion

Merge synchronous feature streams Investigate both supervised and unsupervised

algorithms

Feature Fusion: multi-stream

Compute “optimal” exponent weights for each stream s

[HMM Gaussian mixture formulation; similar expressions for MM,

naïve Bayes, Euclidean/Mahalonobois classifier]

Optimality in the sense of minimizing “total classification error”

Multi-Stream Classification

Two class problem w1, w2

Feature vector x is broken up into two independent streams x1 and x2

Stream weights s1 and s2 are used to

“equalize” the “probabilities”

Multi-Stream Classification

Bayes classification decision

Non-unity weights increase Bayes error but estimation/modeling error may decrease Stream weights can decrease total error

“Optimal” weights minimize estimation error variance z

2

Optimal Stream Weights

Equal error rate in single-stream classifiers

optimal stream weights are inversely proportional to the total stream estimation error variance

Optimal Stream Weights

Equal estimation error variance in each stream

optimal weights are approximately inversely proportional to the single stream classification error

Experimental Results

Subset of CUAVE database used: 36 speakers (30 training, 6 testing), 5 sequences of 10 digits per spkr. Training set: 1500 digits (30x5x10) Test set: 300 digits (6x5x10)

Features: Audio: 39 features (MFCC_D_A) Visual: 105 features (ROIDCT_D_A)

Multi-Streams HMM models, Middle Integration: 8 state, left-to-right HMM whole-digit models Single Gaussian mixture AV-HMM uses separate audio and video feature streams

Optimal Stream Weights Results

Assume:

V2 / A

2 = 2

SNR-indep.

correlation

0.96

Parametric non-linear equalization

Parametric histogram equalization

Smoother estimates Bi-modal transformation (speech vs. non-

speech)

Voice Activity Detection

Bi-spectrum based VAD Support vector machine based VAD Combination of VAD with speech

enhancement

Speech Enhancement

Modified Wiener filtering with filter depending on global SNR

Modified Ephraim-Malah enhancement:

based on the E-M spectral attenuation rule

Non Native Speech Recognition

Build non-native models by combining English and native models

Use phone confusion between English phones and native acoustic models to add alternate model paths

Extract confusion matrix automatically by running phone recognition using native model

Phone pronunciation depends on word grapheme, English phone [grapheme] -> french phone

Example for English phone /t/

/t/

/k/

//

/t/

//

//

/t/

/t/

Extracted rules

Start

1 2 3

End (1-)*0.443 1 2 3 1 2 3

(1-)*0.286

1 2 3

1 2 3 1 2 3

(1-)*0.271

t

k

English French

French models

English model

Graphemic constraints

Example: APPROACH /ah p r ow ch/ APPROACH (A, ah) (PP, p) (R, r) (OA, ow) (CH, ch)

Alignment between graphemes and phones for each word of lexicon

Lexicon modification: add graphemes for each word Confusion rules extraction

(grapheme, english phone) → list of non native phones Example: (A, ah) → a

Used ApproachFrench Italian Spanish

WER SER WER SER WER SER

Command and control grammar

baseline 6 12.8 10.5 19.6 7.0 14.9

confusion 4.6 10.2 6.9 14.1 5.1 11.8

+graphemes confusion 4.9 11.3 8.2 15.9 6.2 13.6

Word loop

grammar

baseline 35.7 47.9 43.5 52.0 39.9 53.5

confusion 27.3 42.1 31.3 46.2 31.3 44.5

+graphemes confusion 26.2 41.9 30.5 45.5 31.3 46.5

Experiments : HIWIRE Database

Ongoing Work

Front-end combination and integration of algorithms

Fixed-platform demonstration non-native speech demo

PDA-platform demonstration Ongoing research