Upload
takuya-yoshioka
View
1.031
Download
4
Tags:
Embed Size (px)
Citation preview
24 Feb 2014
Takuya YoshiokaNTT CS Labs, Cambridge University
Thanks to: T. Nakatani, K. Kinoshita, M. Delcrolix (NTT)M. Gales, X. Chen (Cambridge)
Speech Enhancement for ASR
• Effectiveness measured by WER– use of a sensible ASR system essential
• Huge computational resources available
• Offline processing allowed
• AM can also do some job
Typical ASR System
PronDict
LMAM
RecogEngine
Speech Enh
Front-EndSignal Sentence
Different Approaches for Different Situations
• 1ch vs. Mch (M > 1)
• background noise;• reverberant noise; or • interfering talkers
Different Approaches for Different Situations
• 1ch vs. Mch (M > 1)
• background noise;• reverberant noise; or • interfering talkers
• Reverberation usually modelled with FIR
• Given (x[t])t=1,…,N, recover (s[t])t=1,…,N
1ch Dereverberation (Offline)
∑=
−=T
tshtx0
][][][τ
ττ
Approaches
• Time domain– subspace, Trinicon, Long-term LP– accuate– can account for phase distortion
• Power spectral domain– WF, NMF– robust against speaker movement
• Feature domain– front-end VTS, direct CMLLR– can leverage the AM
Dereverb
Dereverb
Anal
ysis
Synt
hesis
xk(t) sk(t)
x[t] s[t]
∑=
∗ −=T
kkk tshtx0
)()()(τ
ττ
...Assume in each sub-band
Inverse Filtering (in Each Sub-band)
∑=
∗ −=U
kkk txgts0
)()()(τ
ττ
Long-Term Linear Prediction
)()()()( tetxatx k
U
kkk +−= ∑∆=
∗
τττ
)(tsk
∑∆=
∗ −−=U
kkkk txatxtsτ
ττ )()()()(
we don’t minimise ek(t)!
Why LP?
)()()()( tstxatx k
U
kkk +−= ∑∆=
∗
τττ ∑
=
∗ −=T
kkk tshtx0
)()()(τ
ττ
LP vs. FIR
( )tkU
kkUtkk tyaNtyty ,,...,1' ,)()(~))'((|)( λτττ∑ ∆=
∗= −
( )∑ ∑=
∆=∗
= −=N
ttk
U
kkNtk tyaftyp1
,Normal,...,1 ,)()(log))((log λτττ
+
),0(~)( ,tkk Nts λ )()()()( tstxatx k
U
kkk +−= ∑∆=
∗
τττ
Interleaved Estimation of: - LP coeff A= (ak(t))t=∆,...,U + speech variance Λ=(λk,t)t=1,...,T
- clean speech samples
Initialise A
Calculate sk(t)
Estimate LP coeffs A
Convergent?
Estimate speech vars Λ
Eval on REVERB Challenge Data Sets
System %WER
DNN AM + RNN LM + AM adapt 20.0
Dereverb + DNN AM + RNN LM + AM adapt 16.5
• prompts from 5K WSJ
• trained on multi-condition data
• tested on real recordings from dev set
• small amount of background noise
Eval on AMI Corpus (Meeting Transcription)
System%WER
Dev Eval
DNN AM + 3gram LM 43.5 42.6
Dereverb + DNN AM + 3gram LM 42.0 41.1
• 4 participants in each meeting
• table-top microphone used
• single-speaker segments used
• severe reverberation and background noise
1ch Algorithm Summary
• very robust against modelling errors• keys in development
– modelling the reverberation with LP– using a reasonable clean speech pdf
Multi-Channel Extension
Dereverb BF To recogniser
• LP MIMO LP
)()()()( ttt k
U
kkk exΑx +−= ∑∆=
∗
τττ
)(tskh
• LP MIMO LP
• single speech model vector speech model
)()()()( ttt k
U
kkk exΑx +−= ∑∆=
∗
τττ
)(tskh
),0(~)( ,tkk Nts λ ),0(~)( ,tkk Nts λ∗hhh
),0( ,tkN λI≈⇔
Interleaved Estimation of: - LP matrix A= (Ak(t))t=∆,...,U + speech variance Λ=(λk,t)t=1,...,T
- clean speech samples
Initialise A
Calculate sk(t)
Estimate LP matrices A
Convergent?
Estimate speech vars Λ
Eval on REVERB Challenge Data Sets
#Mics System %WER
1Baseline(DNN AM + RNN LM + AM adapt) 20.0
Dereverb + Baseline 16.5
2Dereverb + Baseline 14.8
Dereverb + MVDR + Baseline 13.6
8Dereverb + Baseline 14.0
Dereverb + MVDR + Baseline 11.3
Long-Term LP Summary
• very robust against modelling errors• can cover both 1ch and Mch set-ups• keys in development
– modelling the reverberation with LP– using a reasonable clean speech pdf
Extensions Explored
• dereverberation+BSS
• adaptive long-term LP
• NMF-based dereverberation– works in the power spectrum domain
• FE-VTS dereverberation– works in the feature domain
Dereverberation+BSS
Dereverb BSS
T60=0.3 s T60=0.5 s0
2
4
6
8
10
12
14
16
dere
verb
erat
ion+
sepa
ratio
n
sepa
ratio
n
w/o
sepa
ratio
n
SIR
(dB)
Conclusion
• Dereverberation based on long-term LP– represents reverberation with LP– consistent framework covering both 1ch and
Mch set-ups– provides gains over well-optimised DNN AMs
in realistic conditions– extensions to several directions described