27
Topic Why Speech Recognizers Make Errors? A Rob ustness View (ICSLP 2004) Weighting Observation Vectors for Robust Speech Recognition in Noisy Environments (ICSLP 2004)

Topic

  • Upload
    duc

  • View
    23

  • Download
    0

Embed Size (px)

DESCRIPTION

Topic. Why Speech Recognizers Make Errors? A Robustness View (ICSLP 2004) Weighting Observation Vectors for Robust Speech Recognition in Noisy Environments (ICSLP 2004). Why Speech Recognizers Make Errors? A Robustness View. Hong Kook Kim and Mzin Rahim - PowerPoint PPT Presentation

Citation preview

Page 1: Topic

Topic

• Why Speech Recognizers Make Errors? A Robustness View (ICSLP 2004)

• Weighting Observation Vectors for Robust Speech Recognition in Noisy Environments (ICSLP 2004)

Page 2: Topic

Why Speech Recognizers Make Errors?A Robustness View

Hong Kook Kim and Mzin Rahim

Gwangju Institute of Science and Technologu(GIST), Korea

AT&T Labs-Research, USA

ICSLP 2004

Reporter: Shih-Hsiang

Page 3: Topic

Introduction

• Various kinds of robustness problems– arrtibuted, background noise, coarticulation effects, ch

annel distortion, accent and dialects

• Several novel algorithms that have been proposed– Minimize the acoustic mismatch between the training

model and the testing environment– Enhancement, normalization or adaptation– Feature domain / Model domain

• In this paper try to create a diagnostic tool that can provide a better insight of “why recognizers make errors”

Page 4: Topic

Stationary Quantity of NoiseStationary Signal-to-Noise Ratio

• Investigate the effect of environment noise on the ASR performance

• First Step: detect and measure background noise– Voice activity detection (VAD)

• according context of speech)– Energy detection

• using histogram and threshold– Forces alignment

• preserving the state segmentation or forces alignment with recognized transcription

• Or train a binary classifier or a Gaussian mixture model to separate speech from silence

Page 5: Topic

Stationary Quantity of NoiseStationary Signal-to-Noise Ratio (cont.)

Speech waveform

Transcription

HMM Forced Alignment

Voice ActivityDetection

EnergyClustering

Speech/NoiseDecision

DictionaryGeneration

Speech/SilenceState Decoding

SNRComputation

AcousticModel

SNR_VSNR_ESNR_F

Page 6: Topic

Stationary Quantity of NoiseStationary Signal-to-Noise Ratio (cont.)

• Second Step: Measure SNR

I(n) : identifier for the n-th analysis frame I(n)=1, speech interval I(n)=0, silent interval

E(n) : log energy of the n-th frame

N

insnE

1

210 )(log10)(

L

nnIL

n

nEnI

SP1

1)(

1

|)()(

1

L

nnIL

n

nEnIL

NP1

0)(

1

|)()(

1

NPSPSNR

Page 7: Topic

Stationary Quantity of NoiseStationary Signal-to-Noise Ratio (cont.)

“I need to inquire about a bill that was sent.”laughing from 6 to 8 seconds

VAD30.67 dB

Energy clustering36.29dB

Forced alignment23.52dB

*Forced alignment approach to be more robust for speech/non-speech detection

Page 8: Topic

Time-Varying Quantity of NoiseNonstationary SNR

• Stationary SNR measurement does not reflect the local characteristics of environmental noise

• Using standard deviation of noise power normalized by the average signal power

2

1

1

20)(

2

1

|))(()(

1

L

nnIL

n

SNRnESPnIL

NSNR

*smaller variations in the noise characteristics among different frames would result in low measurement of NSNR

Page 9: Topic

Effect of Stationary and Nonstationary SNRs on ASR Performance

• Corpus– telephone speech collected (over 20 different datasets)– 5171 utterances (54658 words) for testing– trigram language model

Page 10: Topic

Effect of Stationary and Nonstationary SNRs on ASR Performance (cont.)

Page 11: Topic

Effect of Stationary and Nonstationary SNRs on ASR Performance (cont.)

Page 12: Topic

Effect of Stationary and Nonstationary SNRs on ASR Performance (cont.)

Page 13: Topic

Effect of Stationary and Nonstationary SNRs on ASR Performance (cont.)

• Estimated word accuracy (linear regression model)

• Estimation error

jiji NSNRSSNRasr ,

jijiji

N

ji ji asrasreeN ,,,1,

2, ,)/1(

    

Page 14: Topic

Effect of Stationary and Nonstationary SNRs on ASR Performance (cont.)

Page 15: Topic

Weighting Observation Vectors for Robust Speech Recognition in Noisy Environment

s

Zhenyu Xiong, Thomas Fang Zheng, and Wenhu Wu

Center for Speech Technology,Tsinghua University, ChinaICSLP 2004

Reporter:ShihHsiang

Page 16: Topic

Front-end Module

Page 17: Topic

Quantile based speech/non-speech Detection

• Based on order statistics (OS) filter to obtain an estimation of the local SNR of the speech signal

• Two OS filters are applied to the log energy of the signal– Median filter : track the background noise level (B)– 0.9 quantile (Q(0.9)) : track the signal level– The difference is called quantile-based estimation of t

he instantaneous SNR (QSNR) of the signal

Page 18: Topic

Quantile based speech/non-speech Detection (cont.)

• Let Et-L …. Et+L be the log energy values of 2L+1 around the frame t to be analyzed

• Let E(r) , where r=1…2L+1, be the corresponding sorted values in ascending order

• Then, E(L+1) is the output of the median filter

• For the other filter

• The speech/non-speech detection is made by comparing the estimated SNR with a threshold

)1()()1()(

)(2

kk fEEfpQ

krfrkpLr   

Page 19: Topic

Quantile based speech/non-speech Detection (cont.)

Page 20: Topic

Noise estimation

• S(ω,t ) be the power spectrum at the frequency ω at the t-th frame of the input signal

• N(ω,t ) be the power spectrum of the estimated noise at the frequency ω at the t-th frame

),(

),()1()1,(),(

tN

tStNtN

for non-speech

for speech

λ= 0.05 forgetting factor

Page 21: Topic

Spectral subtraction

• A tradition non-linear spectral subtraction algorithm in the power spectrum domain is used for noise reduction

)},(),,(),(max{),(ˆ tStNtStS

α=1.1 : the over-subtraction factorβ=0.1 : the spectral floor

Page 22: Topic

Frame SNR estimation

• Based on the result of noise estimation and spectral subtraction

• Indicates the degree how the current speech frame is uncorrupted with noise

),(

),(ˆlog10)(

tN

tStSNR

Page 23: Topic

Weighting Algorithm

• In a conventional HMM-based speech recognition

• Emphasize the observation for slightly corrupted speech

T

jjjjj xbaXP

1,1 )()|(

jT

jjjjj xbaXP

1

1,1 )()|(

rj is an observation weighing vector emphasizingδis a factor used to adjust the emphasizing degree

2

10(exp1 j

j xSNR

Page 24: Topic

Weighting factor

Page 25: Topic

Experimental Result

• Clean Speech– isolate word database by 20 speaker– Each speakers speak 100 Chinese names for 4 times– Dataset contain 7,893 isolate word utterances

• Four different kinds of noises– Babble noise, Factory noise, Pink noise, White noise

• Recognition System– di-IFs corpus– 3 states and a mixture of 8 Gaussian pdfs per states– Acoustic model employs 42-dimension features

Page 26: Topic

Experimental Result (cont.)

Page 27: Topic

Experimental Result (cont.)