Upload
duc
View
23
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Topic. Why Speech Recognizers Make Errors? A Robustness View (ICSLP 2004) Weighting Observation Vectors for Robust Speech Recognition in Noisy Environments (ICSLP 2004). Why Speech Recognizers Make Errors? A Robustness View. Hong Kook Kim and Mzin Rahim - PowerPoint PPT Presentation
Citation preview
Topic
• Why Speech Recognizers Make Errors? A Robustness View (ICSLP 2004)
• Weighting Observation Vectors for Robust Speech Recognition in Noisy Environments (ICSLP 2004)
Why Speech Recognizers Make Errors?A Robustness View
Hong Kook Kim and Mzin Rahim
Gwangju Institute of Science and Technologu(GIST), Korea
AT&T Labs-Research, USA
ICSLP 2004
Reporter: Shih-Hsiang
Introduction
• Various kinds of robustness problems– arrtibuted, background noise, coarticulation effects, ch
annel distortion, accent and dialects
• Several novel algorithms that have been proposed– Minimize the acoustic mismatch between the training
model and the testing environment– Enhancement, normalization or adaptation– Feature domain / Model domain
• In this paper try to create a diagnostic tool that can provide a better insight of “why recognizers make errors”
Stationary Quantity of NoiseStationary Signal-to-Noise Ratio
• Investigate the effect of environment noise on the ASR performance
• First Step: detect and measure background noise– Voice activity detection (VAD)
• according context of speech)– Energy detection
• using histogram and threshold– Forces alignment
• preserving the state segmentation or forces alignment with recognized transcription
• Or train a binary classifier or a Gaussian mixture model to separate speech from silence
Stationary Quantity of NoiseStationary Signal-to-Noise Ratio (cont.)
Speech waveform
Transcription
HMM Forced Alignment
Voice ActivityDetection
EnergyClustering
Speech/NoiseDecision
DictionaryGeneration
Speech/SilenceState Decoding
SNRComputation
AcousticModel
SNR_VSNR_ESNR_F
Stationary Quantity of NoiseStationary Signal-to-Noise Ratio (cont.)
• Second Step: Measure SNR
I(n) : identifier for the n-th analysis frame I(n)=1, speech interval I(n)=0, silent interval
E(n) : log energy of the n-th frame
N
insnE
1
210 )(log10)(
L
nnIL
n
nEnI
SP1
1)(
1
|)()(
1
L
nnIL
n
nEnIL
NP1
0)(
1
|)()(
1
NPSPSNR
Stationary Quantity of NoiseStationary Signal-to-Noise Ratio (cont.)
“I need to inquire about a bill that was sent.”laughing from 6 to 8 seconds
VAD30.67 dB
Energy clustering36.29dB
Forced alignment23.52dB
*Forced alignment approach to be more robust for speech/non-speech detection
Time-Varying Quantity of NoiseNonstationary SNR
• Stationary SNR measurement does not reflect the local characteristics of environmental noise
• Using standard deviation of noise power normalized by the average signal power
2
1
1
20)(
2
1
|))(()(
1
L
nnIL
n
SNRnESPnIL
NSNR
*smaller variations in the noise characteristics among different frames would result in low measurement of NSNR
Effect of Stationary and Nonstationary SNRs on ASR Performance
• Corpus– telephone speech collected (over 20 different datasets)– 5171 utterances (54658 words) for testing– trigram language model
Effect of Stationary and Nonstationary SNRs on ASR Performance (cont.)
Effect of Stationary and Nonstationary SNRs on ASR Performance (cont.)
Effect of Stationary and Nonstationary SNRs on ASR Performance (cont.)
Effect of Stationary and Nonstationary SNRs on ASR Performance (cont.)
• Estimated word accuracy (linear regression model)
• Estimation error
jiji NSNRSSNRasr ,
jijiji
N
ji ji asrasreeN ,,,1,
2, ,)/1(
Effect of Stationary and Nonstationary SNRs on ASR Performance (cont.)
Weighting Observation Vectors for Robust Speech Recognition in Noisy Environment
s
Zhenyu Xiong, Thomas Fang Zheng, and Wenhu Wu
Center for Speech Technology,Tsinghua University, ChinaICSLP 2004
Reporter:ShihHsiang
Front-end Module
Quantile based speech/non-speech Detection
• Based on order statistics (OS) filter to obtain an estimation of the local SNR of the speech signal
• Two OS filters are applied to the log energy of the signal– Median filter : track the background noise level (B)– 0.9 quantile (Q(0.9)) : track the signal level– The difference is called quantile-based estimation of t
he instantaneous SNR (QSNR) of the signal
Quantile based speech/non-speech Detection (cont.)
• Let Et-L …. Et+L be the log energy values of 2L+1 around the frame t to be analyzed
• Let E(r) , where r=1…2L+1, be the corresponding sorted values in ascending order
• Then, E(L+1) is the output of the median filter
• For the other filter
• The speech/non-speech detection is made by comparing the estimated SNR with a threshold
)1()()1()(
)(2
kk fEEfpQ
krfrkpLr
Quantile based speech/non-speech Detection (cont.)
Noise estimation
• S(ω,t ) be the power spectrum at the frequency ω at the t-th frame of the input signal
• N(ω,t ) be the power spectrum of the estimated noise at the frequency ω at the t-th frame
),(
),()1()1,(),(
tN
tStNtN
for non-speech
for speech
λ= 0.05 forgetting factor
Spectral subtraction
• A tradition non-linear spectral subtraction algorithm in the power spectrum domain is used for noise reduction
)},(),,(),(max{),(ˆ tStNtStS
α=1.1 : the over-subtraction factorβ=0.1 : the spectral floor
Frame SNR estimation
• Based on the result of noise estimation and spectral subtraction
• Indicates the degree how the current speech frame is uncorrupted with noise
),(
),(ˆlog10)(
tN
tStSNR
Weighting Algorithm
• In a conventional HMM-based speech recognition
• Emphasize the observation for slightly corrupted speech
T
jjjjj xbaXP
1,1 )()|(
jT
jjjjj xbaXP
1
1,1 )()|(
rj is an observation weighing vector emphasizingδis a factor used to adjust the emphasizing degree
2
10(exp1 j
j xSNR
Weighting factor
Experimental Result
• Clean Speech– isolate word database by 20 speaker– Each speakers speak 100 Chinese names for 4 times– Dataset contain 7,893 isolate word utterances
• Four different kinds of noises– Babble noise, Factory noise, Pink noise, White noise
• Recognition System– di-IFs corpus– 3 states and a mixture of 8 Gaussian pdfs per states– Acoustic model employs 42-dimension features
Experimental Result (cont.)
Experimental Result (cont.)