Topic

Topic

• Why Speech Recognizers Make Errors? A Robustness View (ICSLP 2004)

• Weighting Observation Vectors for Robust Speech Recognition in Noisy Environments (ICSLP 2004)

Why Speech Recognizers Make Errors?A Robustness View

Hong Kook Kim and Mzin Rahim

Gwangju Institute of Science and Technologu(GIST), Korea

AT&T Labs-Research, USA

ICSLP 2004

Reporter: Shih-Hsiang

Introduction

• Various kinds of robustness problems– arrtibuted, background noise, coarticulation effects, ch

annel distortion, accent and dialects

• Several novel algorithms that have been proposed– Minimize the acoustic mismatch between the training

model and the testing environment– Enhancement, normalization or adaptation– Feature domain / Model domain

• In this paper try to create a diagnostic tool that can provide a better insight of “why recognizers make errors”

Stationary Quantity of NoiseStationary Signal-to-Noise Ratio

• Investigate the effect of environment noise on the ASR performance

• First Step: detect and measure background noise– Voice activity detection (VAD)

• according context of speech)– Energy detection

• using histogram and threshold– Forces alignment

• preserving the state segmentation or forces alignment with recognized transcription

• Or train a binary classifier or a Gaussian mixture model to separate speech from silence

Stationary Quantity of NoiseStationary Signal-to-Noise Ratio (cont.)

Speech waveform

Transcription

HMM Forced Alignment

Voice ActivityDetection

EnergyClustering

Speech/NoiseDecision

DictionaryGeneration

Speech/SilenceState Decoding

SNRComputation

AcousticModel

SNR_VSNR_ESNR_F


• Second Step: Measure SNR

I(n) : identifier for the n-th analysis frame I(n)=1, speech interval I(n)=0, silent interval

E(n) : log energy of the n-th frame

N

insnE

1

210 )(log10)(

L

nnIL

n

nEnI

SP1

1)(

1

|)()(

1

L

nnIL

n

nEnIL

NP1

0)(

1

|)()(

1

NPSPSNR


“I need to inquire about a bill that was sent.”laughing from 6 to 8 seconds

VAD30.67 dB

Energy clustering36.29dB

Forced alignment23.52dB

*Forced alignment approach to be more robust for speech/non-speech detection

Time-Varying Quantity of NoiseNonstationary SNR

• Stationary SNR measurement does not reflect the local characteristics of environmental noise

• Using standard deviation of noise power normalized by the average signal power

2

1

1

20)(

2

1

|))(()(

1

L

nnIL

n

SNRnESPnIL

NSNR

*smaller variations in the noise characteristics among different frames would result in low measurement of NSNR

Effect of Stationary and Nonstationary SNRs on ASR Performance

• Corpus– telephone speech collected (over 20 different datasets)– 5171 utterances (54658 words) for testing– trigram language model

Effect of Stationary and Nonstationary SNRs on ASR Performance (cont.)




• Estimated word accuracy (linear regression model)

• Estimation error

jiji NSNRSSNRasr ,

jijiji

N

ji ji asrasreeN ,,,1,

2, ,)/1(

　　　


Weighting Observation Vectors for Robust Speech Recognition in Noisy Environment

s

Zhenyu Xiong, Thomas Fang Zheng, and Wenhu Wu

Center for Speech Technology,Tsinghua University, ChinaICSLP 2004

Reporter:ShihHsiang

Front-end Module

Quantile based speech/non-speech Detection

• Based on order statistics (OS) filter to obtain an estimation of the local SNR of the speech signal

• Two OS filters are applied to the log energy of the signal– Median filter : track the background noise level (B)– 0.9 quantile (Q(0.9)) : track the signal level– The difference is called quantile-based estimation of t

he instantaneous SNR (QSNR) of the signal

Quantile based speech/non-speech Detection (cont.)

• Let Et-L …. Et+L be the log energy values of 2L+1 around the frame t to be analyzed

• Let E(r) , where r=1…2L+1, be the corresponding sorted values in ascending order

• Then, E(L+1) is the output of the median filter

• For the other filter

• The speech/non-speech detection is made by comparing the estimated SNR with a threshold

)1()()1()(

)(2

kk fEEfpQ

krfrkpLr 　　

Quantile based speech/non-speech Detection (cont.)

Noise estimation

• S(ω,t ) be the power spectrum at the frequency ω at the t-th frame of the input signal

• N(ω,t ) be the power spectrum of the estimated noise at the frequency ω at the t-th frame

),(

),()1()1,(),(

tN

tStNtN

for non-speech

for speech

λ= 0.05 forgetting factor

Spectral subtraction

• A tradition non-linear spectral subtraction algorithm in the power spectrum domain is used for noise reduction

)},(),,(),(max{),(ˆ tStNtStS

α=1.1 : the over-subtraction factorβ=0.1 : the spectral floor

Frame SNR estimation

• Based on the result of noise estimation and spectral subtraction

• Indicates the degree how the current speech frame is uncorrupted with noise

),(

),(ˆlog10)(

tN

tStSNR

Weighting Algorithm

• In a conventional HMM-based speech recognition

• Emphasize the observation for slightly corrupted speech

T

jjjjj xbaXP

1,1 )()|(

jT

jjjjj xbaXP

1

1,1 )()|(

rj is an observation weighing vector emphasizingδis a factor used to adjust the emphasizing degree

2

10(exp1 j

j xSNR

Weighting factor

Experimental Result

• Clean Speech– isolate word database by 20 speaker– Each speakers speak 100 Chinese names for 4 times– Dataset contain 7,893 isolate word utterances

• Four different kinds of noises– Babble noise, Factory noise, Pink noise, White noise

• Recognition System– di-IFs corpus– 3 states and a mixture of 8 Gaussian pdfs per states– Acoustic model employs 42-dimension features

Experimental Result (cont.)

Experimental Result (cont.)

Documents

Topic