Environmentally robust ASR front end for DNN-based acoustic models

• Do not compare results across different tables! – Configurations may differ

• Most results shown here can be found in:

Takuya Yoshioka and Mark J. F. Gales, “Environmentally robust ASR front-end for deep neural network acoustic models,” Computer Speech and Language, vol. 31, no. 1, pp. 65-86, May 2015

1. Motivation 2. Corpus

• AMI meeting corpus 3. Baseline systems

• SI and SAT set-ups 4. Assessment of environmental robustness of

DNN acoustic models 5. Front-end techniques 6. Combined effects

Little investigation done

• Multi-party interaction – 4 participants in each meeting

• Multi-channel recordings

– Distant microphones – only first channel used – Head-set & lapel microphones

• 2 recording set-ups

– 70h scenario-based meetings – 30h real meetings

• Different rooms • Multiple sources of distortion

– Reverberation – Additive noise – Overlapping speech

• Moving speakers • Many non-natives

• SI : speaker independent – For online transcription – DNN-HMM hybrid

• SAT: speaker adaptive training

– For offline transcription – MLP tandem

• Manual segmentations used • Overlapping segments ignored

State output distributions modelled with

– GMM or

– DNN

� ��

��Q

T

ttttt qpqqPqPp

qxX

110 )|()|()()|(

��

�M

m

jmjmmjmt Ncjp

1

)()( ),;()|( Σμxx

)()|()|(

jpjpjp t

txx �

�

• Discriminative pre-training • Cross entropy fine-tuning • Discriminative pre-training

• Trained on Telta K20 • cuBLAS 5.5 used • Mini-batch size: 800 frames • Learning rate: “newbob” scheduling • 10% held-out data for CV

System Parame-terisation

%WER

Dev Eval Avg

MPE GMM-HMM HLDA 54.7 55.6 55.2

DNN-HMM hybrid FBANK 43.5 42.6 43.1

This work 40.0 39.3 39.7

Data Set Parame-terisation

%WER

Dev Eval Avg

SDM FBANK 43.5 42.6 43.1

IHM FBANK 28.2 24.6 26.4

• 39.2% of the errors caused by acoustic distortion • DNN-HMMs not so robust

�

• Discriminative pre-training • Cross entropy fine-tuning • Discriminative pre-training

Align-ment

DNN input

%WER

Dev Eval Avg

SDM IHM 30.6 27.0 28.8

IHM SDM 41.8 40.8 41.3

IHM SDM 41.7 40.6 41.2 Using 648-2,000 5-4,000 DNN:

DNN training more sensitive to noise than state alignment

Speech enhancement

Feature transformation Multi-stream features

Speech enhancement


Previous work – Beamforming yields gains– No investigation on single-microphone algorithms

• Based on linear time (almost) invariant filters • Applied to complex-valued STFT coefficients

• The filters automatically adjusted using observations – WPE for 1ch dereverberation (NTT’s work) – BeamformIt for denoising (ICSI’s work)

• 8 microphones used, dedicated to meetings

• Unlikely to produce irregular transitions

��

��

1

0

,,,,

T

Tkktfkftftf xgxy

Align-ment

Dev Eval

SDM +Derev BFIt

(8mics) SDM +Derev

BFIt (8mics)

MPE 43.8 41.8 38.6 43.0 41.3 36.6

Hybrid 43.5 41.7 38.8 43.3 41.4 36.7

• Dereveberation helps even with single microphone • Multi-microphone beamforming works well

DNN size Context frames

Dev Eval

SDM +Derev SDM +Derev

1,000 5 9 43.8 41.8 43.0 41.3

1,500 5

9 43.5 42.0 42.6 41.1

13 42.8 41.8 42.9 41.2

19 43.0 41.7 42.9 41.2

2,000 5 9 43.8 41.3 42.9 40.4

4.7% gain from 1ch dereverberation (relative)

Speech enhancement


No positive results reported previously

• Applied to magnitude spectra • Cross terms (often) ignored

• Frame-by-frame modification – Harmful for DNN?

• Noise estimated using long-term statistics – IMCRA (used here), minimum statistics, etc

• Deltas from un-enhanced speech – Essential for obtaining gains

2

,

2

,

2

, tftftf nxy �

• Applied to FBANK features • The following mismatch function used

• Frame-by-frame modification • Noise model estimated with EM • Deltas from un-enhanced speech

))exp(1log( hynhxy �� tttt

Enhancement target %WER

Spectrum Feature Dev Eval Avg

N N 42.0 41.1 41.6

Y N 41.3 40.9 41.1

N Y 41.4 40.5 41.0

Y Y 42.0 41.0 41.5

• Small consistent gains • Different methods should not be connected

Enhancement target %WER

Spectrum Feature Dev Eval Avg

N N 42.0 41.1 41.6

Y N 41.3 40.9 41.1

N Y 41.4 40.5 41.0

Y Y 42.0 41.0 41.5

Y Y 41.4 40.4 40.9 Using multi-stream approach:

Speech enhancement


• Frame level – FMPE, RDT, FE-CMLLR – Seems to be subsumed by DNN

• Speaker (or environment) level – Global CMLLR, LIN, fDLR, VTLN – Multiple decoding passes required � SAT

• Utterance level – Single-pass decoding � SI

• Seems robust against supervision errors • STC transform used to deal with correlations:

�

�

�

��tx

)()()()( sst

sst bLxAy �

Form of speaker transform

%WER

Dev Eval Avg

None (SI) 42.6 40.2 41.4

Full 37.4 37.4 37.4

Block diagonal 37.3 36.6 37.0

• ~10% relative gains obtained • “Block diagonal” outperforms “full”

Form of speaker transform

%WER

Dev Eval Avg

None (SI) 42.6 40.2 41.4

Full 37.4 37.4 37.4

Block diagonal 37.3 36.6 37.0

None (SI) 27.8 24.2 26.0

Full 23.8 21.6 22.7

On IHM data set

))(()())(()( ucut

ucut bLxAy �

uuc : )(

Clustering performed using: – utterance-specific iVectors – Kmeans (GMM yielded similar performance figures)

)()0()( uu Twmm �m(0)

T

w(u)

Subspace representation of the deviation from UBM

m(0) m(1)

m(2) m(3)

Variability subspace

#Clusters %WER

Dev Eval Avg No QCMLLR 41.9 40.9 41.4

64 41.0 40.4 40.7 32 41.0 40.0 40.5 16 41.5 40.5 41.0

No QCMLLR 27.8 24.2 26.0 32 26.9 23.5 25.2

On IHM data set

• Using 32 clusters yielded best performance • Similar gains on both SDM and IHM

Speech enhancement


• Originally proposed by Aachen for shallow MLP tandem configurations

• Exploits DNN’s insensitivity to the increase in input dimensionality

• (Hopefully) complement features masked by noise

• Allows multiple enhancement results to be combined

• Four types of auxiliary features investigated: – MFCC (Δ/Δ2) – PLP – Gammatone cepstra

• Different frequency warping • STFT not used

– Intra-frame delta ( ) • Emphasises spectral peaks/dips

Feature set #features %WER

Dev Eval Avg

FBANK+Δ+Δ2 (baseline) 72 41.9 40.9 41.4

+PLP 85 40.7 40.3 40.5

+Gammatone 88 40.8 40.0 40.4

+MFCC 85 41.1 39.7 40.4

+MFCC+Δ+Δ2 111 40.6 40.2 40.4

+ + 2 120 40.9 39.8 40.4

+MFCC+ + 2 133 40.4 39.8 40.1

• Speech enhancement – Linear filtering – Spectral/feature enhancement

• Feature transformation– Quantised CMLLR – (Global CMLLR for SAT)

• Multi-stream features

� �

Baseline

Front-end %WER

Dev Eval Avg

FBANK baseline 43.1 42.4 42.8

+WPE 41.8 40.7 41.3

+MFCC+ + 2 40.5 40.1 40.3

+IMCRA+FE-VTS 40.0 39.3 39.7

+QCMLLR 40.9 39.5 40.2

• Effects additive except for QCMLLR • QCMLLR may work if applied to the entire feature set

��

System Parame-terisation

%WER

Dev Eval Avg

SAT GMM-HMM MPE trained

HLDA 48.8 50.2 49.5

SAT tandem MPE trained

FBANK 40.7 40.9 40.8

SI hybrid FBANK 43.5 42.6 43.1

• Outperforms SAT GMM-HMM • Outperforms SI hybrid

� �

Baseline

Front-end %WER

Dev Eval Avg

FBANK baseline 40.1 41.3 40.7

+WPE 38.9 39.3 39.1

+MFCC 38.5 38.5 38.5

+IMCRA+FE-VTS 38.4 38.7 38.6

+CMLLR 36.6 36.7 36.7

+CMLLR 36.9 37.0 37.0

+CMLLR 38.4 38.6 38.5

• Effects of WPE and CMLLR are additive • Using auxiliary features yields small gains over CMLLR

features • Denoising subsumed by CMLLR (as expected)

• Front-end processing approaches yield gains over state-of-the-art DNN-based AMs – Linear filtering (WPE, BeamformIt) – Spectral/feature enhancement (IMCRA, FE-VTS) – Feature transformation (QCMLLR, CMLLR) – Multi-stream features

• Possible to combine different classes of

approaches

Technology

Environmentally robust ASR front end for DNN-based acoustic models