Spatial Diffuseness Features for DNN -Based Speech Recognition … · 2015-05-15 · Spatial Diffuseness Features for DNN -Based Speech Recognition in Noisy and Reverberant Environments

Spatial Diffuseness Features for DNN-Based Speech Recognition in Noisy and

Reverberant EnvironmentsAndreas Schwarz, Christian Huemmer, Roland Maas,

Walter Kellermann

Lehrstuhl für Multimediakommunikation und SignalverarbeitungFriedrich-Alexander-Universität Erlangen-Nürnberg, Germany

ICASSP 2015

ICASSP 2015: Spatial Diffuseness Features for DNN-Based Speech RecognitionAndreas Schwarz, Christian Huemmer, Roland Maas, Walter Kellermann

Trend: explicit feature processing → implicit learning! MFCCs → simple filterbank features [Mohamed et al. 2013]! Filterbanks → raw time-domain signals [Jaitly, Hinton 2011]

! Denoising → noise-aware training [Seltzer et al. 2013]

What about spatial information (microphone arrays)?! Stacked feature vectors from multiple channels

[Swietojanski et al. 2013]! Phase information is not exploited

! Raw multi-channel waveforms [Hoshen et al. 2015]! Hard to generalize for arbitrary acoustic scenarios

! Spatial diffuseness features! Represent spatial information independently of

source position and microphone array

2

Deep Neural Networks for Acoustic Modeling

mh acoustics Eigenmike


Signal Model

Coherence-based Dereverberation in the STFT Domain

Extraction of Spatial Diffuseness Features

3

Outline


! Desired signal is fully coherent (only delayed between microphones)

! Noise and reverberation is diffuseand uncorrelated to the desired signal

! Coherence of the mixed sound fieldcan be modeled as:

4

Signal Model

→ Coherent-to-diffuse ratio (CDR) can be estimatedfrom the complex spatial coherence of the mixture


1. Estimate short-time spatial coherence (quasi-instantaneous)2. Estimate coherent-to-diffuse ratio (CDR)3. Perform spectral subtraction to suppress diffuse components

[Schwarz/Kellermann, “Coherent-to-Diffuse Power Ratio Estimation for Dereverberation”, IEEE/ACM Transactions on Audio, Speech and Language Processing, 2015]

Only instantaneous signal properties are exploitedNo knowledge or estimation of source DOA required

5

Coherence-based STFT-Domain Dereverberation


Word Error Rate for REVERB challenge evaluation set

Multi-condition training neutralizes the effect of dereverberation

6

Evaluation

x2x2testx2

44.4

30.3

85.7

69.3

0

10

20

30

40

50

60

70

80

90

100

logmelspec enh. logmelspec

WER

[%]

Clean speech-trained DNNSimDataRealData

9.5 9.4

28.8 28.8

0

10

20

30

40

50

60

70

80

90

100

logmelspec enh. logmelspecW

ER [%

]

Multi-condition-trained DNNSimDataRealData

⇨ Improvement for clean-trained DNN " ⇨ Disappears with multi-condition training #


Instead of STFT-domain enhancement, extract spatial features

! meldiffuseness:! 0 for purely directional sound, 1 for purely diffuse sound! computed from coherent-to-diffuse ratio: D(k,f)=1/(CDR(k,f)+1)

! Naive approach: magnitude squared coherence (melmsc)! Depends not only on diffuse noise content, but also on microphone spacing, DOA

7

Spatial Feature Extraction


8

Visualization of Features

logmelspec:

enhanced logmelspec:

meldiffuseness:


REVERB challenge “two microphone” task [Kinoshita et al. 2013]! noisy and reverberant signals created from WSJCAM0 corpus! varying direction of arrival! 2 microphones, 8cm spacing

DNN-based Speech Recognizer! Kaldi toolkit! hybrid DNN-HMM acoustic model! “maxout” network (4 hidden layers, 2000 inputs, 400 outputs per layer)! ±5#frame#splicing! training on#multi!condition noisy and reverberant data (17.5#hours)

Feature vectors! noisy logmelspec features:! enhanced logmelspec features:! augmented with melmsc:! augmented with meldiffuseness:

9

Evaluation Setup

x2x2testx2

logmelspec Δ ΔΔenh. logmel Δ ΔΔlogmelspec Δ melmsclogmelspec Δ meldiffuseness

overall dimension: 72


SimData: measured impulse responses, additive noiseRealData: real recordings in noisy environment

6% to 11% relative WER reduction by using spatial features

10

Evaluation Results

9.5 9.4 9.0 8.5

28.8 28.8 27.7 27.0

0

5

10

15

20

25

30

35

40

logmelspec enh. logmelspec logmelspec +melmsc

logmelspec +meldiffuseness

WER

[%]

SimData

RealData


Motivation! STFT-domain dereverberation has little effect on WER! Idea: exploit spatial information in the DNN

Spatial Diffuseness Features! Can be extracted instantaneously! “Blind”, no knowledge or estimation of the source DOA required! Device-independent features! 6% to 11% relative WER reduction for REVERB challenge 2-channel task! MATLAB code available (see paper)

Can we use a similar approach to deal with directional interferers?

Thank you for your attention!

11

Summary


12

Results (Details)

SimData RealData

near far near far near far near farGMM-HMM MFCC-LDA-MLLT-fMLLR 6.6 7.5 9.4 16.6 11.1 20.7 12.0 31.2 30.2 30.7 12.1 31.6

logmelspec+∆+∆∆ 5.7 6.7 7.7 13.9 8.7 14.6 9.5 28.5 29.1 28.8 9.7 24.9enhanced logmelspec+∆+∆∆ 6.6 7.1 7.7 12.2 8.3 14.6 9.4 28.5 29.1 28.8 9.1 25.3logmelspec+∆+melmsc 6.2 6.3 7.0 12.3 8.2 13.9 9.0 27.3 28.0 27.7 8.7 24.7logmelspec+∆+meldiffuseness 5.9 6.1 6.9 11.0 8.2 12.9 8.5 27.8 26.3 27.0 7.9 24.2

Recognizer Feature

DNN-HMM

Room 1 Room 2AvgAvg

Evaluation Set Development SetSimData RealData

Avg AvgRoom 3 Room 1

Documents

Spatial Diffuseness Features for DNN -Based Speech Recognition … · 2015-05-15 · Spatial Diffuseness Features for DNN -Based Speech Recognition in Noisy and Reverberant Environments