Speech enhancement for distant talking speech recognition

24 Feb 2014

Takuya YoshiokaNTT CS Labs, Cambridge University

Thanks to: T. Nakatani, K. Kinoshita, M. Delcrolix (NTT)M. Gales, X. Chen (Cambridge)

Speech Enhancement for ASR

• Effectiveness measured by WER– use of a sensible ASR system essential

• Huge computational resources available

• Offline processing allowed

• AM can also do some job

Typical ASR System

PronDict

LMAM

RecogEngine

Speech Enh

Front-EndSignal Sentence

Different Approaches for Different Situations

• 1ch vs. Mch (M > 1)

• background noise;• reverberant noise; or • interfering talkers

Different Approaches for Different Situations

• 1ch vs. Mch (M > 1)

• background noise;• reverberant noise; or • interfering talkers

• Reverberation usually modelled with FIR

• Given (x[t])t=1,…,N, recover (s[t])t=1,…,N

1ch Dereverberation (Offline)

∑=

−=T

tshtx0

][][][τ

ττ

Approaches

• Time domain– subspace, Trinicon, Long-term LP– accuate– can account for phase distortion

• Power spectral domain– WF, NMF– robust against speaker movement

• Feature domain– front-end VTS, direct CMLLR– can leverage the AM

Dereverb

Dereverb

Anal

ysis

Synt

hesis

xk(t) sk(t)

x[t] s[t]

∑=

∗ −=T

kkk tshtx0

)()()(τ

ττ

...Assume in each sub-band

Inverse Filtering (in Each Sub-band)

∑=

∗ −=U

kkk txgts0

)()()(τ

ττ

Long-Term Linear Prediction

)()()()( tetxatx k

U

kkk +−= ∑∆=

∗

τττ

)(tsk

∑∆=

∗ −−=U

kkkk txatxtsτ

ττ )()()()(

we don’t minimise ek(t)!

Why LP?

)()()()( tstxatx k

U

kkk +−= ∑∆=

∗

τττ ∑

=

∗ −=T

kkk tshtx0

)()()(τ

ττ

LP vs. FIR

( )tkU

kkUtkk tyaNtyty ,,...,1' ,)()(~))'((|)( λτττ∑ ∆=

∗= −

( )∑ ∑=

∆=∗

= −=N

ttk

U

kkNtk tyaftyp1

,Normal,...,1 ,)()(log))((log λτττ

+

),0(~)( ,tkk Nts λ )()()()( tstxatx k

U

kkk +−= ∑∆=

∗

τττ

Interleaved Estimation of: - LP coeff A= (ak(t))t=∆,...,U + speech variance Λ=(λk,t)t=1,...,T

- clean speech samples

Initialise A

Calculate sk(t)

Estimate LP coeffs A

Convergent?

Estimate speech vars Λ

Eval on REVERB Challenge Data Sets

System %WER

DNN AM + RNN LM + AM adapt 20.0

Dereverb + DNN AM + RNN LM + AM adapt 16.5

• prompts from 5K WSJ

• trained on multi-condition data

• tested on real recordings from dev set

• small amount of background noise

Eval on AMI Corpus (Meeting Transcription)

System%WER

Dev Eval

DNN AM + 3gram LM 43.5 42.6

Dereverb + DNN AM + 3gram LM 42.0 41.1

• 4 participants in each meeting

• table-top microphone used

• single-speaker segments used

• severe reverberation and background noise

1ch Algorithm Summary

• very robust against modelling errors• keys in development

– modelling the reverberation with LP– using a reasonable clean speech pdf

Multi-Channel Extension

Dereverb BF To recogniser

• LP MIMO LP

)()()()( ttt k

U

kkk exΑx +−= ∑∆=

∗

τττ

)(tskh

• LP MIMO LP

• single speech model vector speech model

)()()()( ttt k

U

kkk exΑx +−= ∑∆=

∗

τττ

)(tskh

),0(~)( ,tkk Nts λ ),0(~)( ,tkk Nts λ∗hhh

),0( ,tkN λI≈⇔

Interleaved Estimation of: - LP matrix A= (Ak(t))t=∆,...,U + speech variance Λ=(λk,t)t=1,...,T

- clean speech samples

Initialise A

Calculate sk(t)

Estimate LP matrices A

Convergent?

Estimate speech vars Λ

Eval on REVERB Challenge Data Sets

#Mics System %WER

1Baseline(DNN AM + RNN LM + AM adapt) 20.0

Dereverb + Baseline 16.5

2Dereverb + Baseline 14.8

Dereverb + MVDR + Baseline 13.6

8Dereverb + Baseline 14.0

Dereverb + MVDR + Baseline 11.3

Long-Term LP Summary

• very robust against modelling errors• can cover both 1ch and Mch set-ups• keys in development

– modelling the reverberation with LP– using a reasonable clean speech pdf

Extensions Explored

• dereverberation+BSS

• adaptive long-term LP

• NMF-based dereverberation– works in the power spectrum domain

• FE-VTS dereverberation– works in the feature domain

Dereverberation+BSS

Dereverb BSS

T60=0.3 s T60=0.5 s0

2

4

6

8

10

12

14

16

dere

verb

erat

ion+

sepa

ratio

n

sepa

ratio

n

w/o

sepa

ratio

n

SIR

(dB)

Conclusion

• Dereverberation based on long-term LP– represents reverberation with LP– consistent framework covering both 1ch and

Mch set-ups– provides gains over well-optimised DNN AMs

in realistic conditions– extensions to several directions described

Technology

Speech enhancement for distant talking speech recognition