29
Survey of Robust Techniques 2005/5/26 Presented by Chen-Wei Liu

Survey of Robust Techniques

  • Upload
    adelie

  • View
    55

  • Download
    2

Embed Size (px)

DESCRIPTION

Survey of Robust Techniques. 2005/5/26 Presented by Chen-Wei Liu. Conferences. Log-Energy Dynamic Range Normalization for Robust Speech Recognition Weizhong Zhu and Douglas O’Shaughnessy INRS-EMT, University of Quebec, Canada ICASSP 2005 - PowerPoint PPT Presentation

Citation preview

Page 1: Survey of Robust Techniques

Survey of Robust Techniques

2005/5/26

Presented by Chen-Wei Liu

Page 2: Survey of Robust Techniques

Conferences

• Log-Energy Dynamic Range Normalization for Robust Speech Recognition– Weizhong Zhu and Douglas O’Shaughnessy– INRS-EMT, University of Quebec, Canada– ICASSP 2005

• Static and Dynamic Spectral Features : Their Noise Robustness and Optimal Weights for ASR– Chen Yang, Tan Lee, The Chinese University of Hong Kong– Frank K. Soong, ATR, Kyoto, Japan– ICASSP 2005

Page 3: Survey of Robust Techniques

Introduction

• Methods of robust speech recognition can be classified into two approaches – Front-end processing for speech feature extraction– Back-end processing for HMM decoding

• Compensation for noise in– The front-end processing method

• is to suppress the noise and get more robust parameters

– The back-end processing • is to compensate for noise and adapt the parameters inside the

HMM system

• This paper focus on the first approach

Page 4: Survey of Robust Techniques

Introduction

• Comparing with cepstral coefficients, the log-energy feature has quite different characteristics – Log of summation of energy of all samples in one frame (logE)– Summation of log filter bank (c0)

• This paper tries to find a more effective way, named log-energy dynamic range normalization (ERN), to remove the effects of additive noise– By minimizing mismatch between training and testing data

Page 5: Survey of Robust Techniques

Energy Dynamic Range Normalization

• Observations– Elevated minimum value– Valleys are buried by additive noise energy, while peaks are not

affected as much

Page 6: Survey of Robust Techniques

Energy Dynamic Range Normalization

• The larger difference on valleys leads to a mismatch between the clean and noisy speech

• To minimize the mismatch, this paper suggests an algorithm to scale the log-energy feature sequence of clean speech– In which it lifts valleys while it keeps peaks unchanged

• Log-energy dynamic range is defined as follows

1...

1...

( ( ) ). .( ) 10

( ( ) )i i n

i i n

Max Log EnergyD R dB

Min Log Energy

Page 7: Survey of Robust Techniques

Energy Dynamic Range Normalization

• In the presence of noise, is affected by additive noise, while is not affected as much

• Let and target energy dynamic range as X ;then the above equation becomes

• Is this way, it can use to set the target minimum value based on a given target dynamic range

1...( ( ) )i i nMin Log Energy

1...ax( ( ) )i i nM Log Energy

1... 1...( ( ) ) ( ( ) )i i n i i nMin Log Energy Max Log Energy

10( )X dB

Page 8: Survey of Robust Techniques

Energy Dynamic Range Normalization

• The following are the steps of the proposed log-energy feature dynamic range normalization algorithm– 1st : find Max = and

Min =– 2nd : calculate target

– 3rd : if then 4th – 4th : for i=1…n,

1...( ( ) )i i nMax Log Energy

1...( ( ) )i i nMin Log Energy

1..._ ( ( ) )i i nT Min Max Log Energy

1...( ( ) ) _i i nMin Log Energy T Min

( ) ( )

_ - ( - ( ))

-

i i

i

Log Energy Log Energy

T Min MinMax Log Energy

Max Min

Page 9: Survey of Robust Techniques

Energy Dynamic Range Normalization

• The scaling effect is decreased as its own value goes up and the maximum of the sequence is unchanged

Page 10: Survey of Robust Techniques

Experimental ResultsLinear Scaling

• The proposed method was evaluated on the Aurora 2.0

Page 11: Survey of Robust Techniques

Experimental Results Non Linear Scaling

• Using non-linear scaling of equation as follows( ) ( )

_ - (log( ) - log( ( )))

log( ) - log( )

i i

i

Log Energy Log Energy

T Min MinMax Log Energy

Max Min

Page 12: Survey of Robust Techniques

Experimental Results Comparison of Linear Scaling and Non-linear Scaling

• Performance comparisons at different SNR levels are shown as follows

Page 13: Survey of Robust Techniques

Experimental Results Combination with other techniques

Page 14: Survey of Robust Techniques

Conclusions

• When systems were trained on a clean speech training set, the proposed technique can have overall about a 30.83% relative performance

• Like CMS, the proposed method does not require any prior knowledge of noise and level

• Reducing mismatch in log-energy leads to a large recognition improvement

Page 15: Survey of Robust Techniques

Conferences

• Log-Energy Dynamic Range Normalization for Robust Speech Recognition– Weizhong Zhu and Douglas O’Shaughnessy– INRS-EMT, University of Quebec, Canada– ICASSP 2005

• Static and Dynamic Spectral Features : Their Noise Robustness and Optimal Weights for ASR– Chen Yang, Tan Lee, The Chinese University of Hong Kong– Frank K. Soong, ATR, Kyoto, Japan– ICASSP 2005

Page 16: Survey of Robust Techniques

Introduction

• Dynamic cepstral features can help static features to characterize the speech trajectory on its time varying rate

• It has been shown that such a representation (static + dynamic) yields higher speech and speaker recognition performance than static cepstra only

• This paper tries to quantify the robustness of static and dynamic features under different types of noise and variable SNRs

Page 17: Survey of Robust Techniques

Noise Robustness AnalysisRecognition with only Static or Dynamic Features

Page 18: Survey of Robust Techniques

Noise Robustness AnalysisStatic and Dynamic Cepstral Distances between Clean and Noisy Speech

• For a given sequence of noisy speech observation, the output likelihood is presented as follows by single Gaussian for simplicity :

• The mismatch between clean and noisy conditions lies mainly on the exponent term which can be re-written as :

1 2

-1

( , ,..., ) '

1 1( ) exp{- ( - ) ' ( - )}

2(2 ) | |

T

j t t j j t jvj

Y y y y

b y y y

-1

-1

-1 -1 -1

( - ) ' ( - )

( - - ) ' ( - - )

( - ) ' ( - ) 2( - ) ' ( - ) ( - ) ' ( - )

t j j t j

t t t j j t t t j

t t j t t t t j t j t j j t j

y y

y x x y x x

y x y x y x x x x

Sequence of noisy speech observation

A B A BExpected value is zero

Page 19: Survey of Robust Techniques

Noise Robustness AnalysisStatic and Dynamic Cepstral Distances between Clean and Noisy Speech

• Since the expected value of the second term is zero, the difference of likelihood between noisy and clean speech is just the first term, measured by defining a cepstral distance as follows :

– Where is used to approximate the diagonal covariance, in the clean speech model

– denotes the time average over the whole utterance

-1[( - ) ' ( - )]t t x t tCD E y x y x

x j

[ ]E

Page 20: Survey of Robust Techniques

Noise Robustness AnalysisStatic and Dynamic Cepstral Distances between Clean and Noisy Speech

• The weighted distances between clean and noisy speech for both the static and dynamic features, respectively:

– Where the superscripts d and s denote the dynamic and the static features

-1

-1

( ) [( - ) '( ) ( - )]

( ) [( - ) '( ) ( - )]

d d d d dt t x t t

s s s s st t x t t

CD d E y x y x

CD s E y x y x

Page 21: Survey of Robust Techniques

Noise Robustness AnalysisStatic and Dynamic Cepstral Distances between Clean and Noisy Speech

• The following depicts the scatter diagrams of dynamic distance (between clean and noisy dynamic cepstra) vs. its static counterpart :

Page 22: Survey of Robust Techniques

Noise Robustness AnalysisStatic and Dynamic Cepstral Distances between Clean and Noisy Speech

• Two observations can be made on the figure– Both distances are larger for increasingly mismatched conditions

at lower SNRs– Majority points fall below the diagonal line. In other words, the dy

namic cepstral distance between noisy and clean features is smaller than its static counterpart

Page 23: Survey of Robust Techniques

Exponential Weighting in DecodingExponential Weightings

• Based on the findings in the previous figure– It would make sense to weight the log likelihoods of the static

and dynamic features differently in decoding to exploit their uneven noise robustness

• The output likelihood of an observation can be split into two separate corresponding terms, d and s, as :

• The acoustic likelihood components can be computed with different exponential weightings as :

1

( ) exp{ log[ ( ; , )] log[ ( ; , )] }K

d d d s s sj t jk t jk jk t jk jk

k

b o c N o N o

1

( ) exp{ log[ ( ; , )] log[ ( ; , )] }K

d d d s s sj t jk t jk jk t jk jk

k

b o c N o N o

Page 24: Survey of Robust Techniques

Exponential Weighting in DecodingRecognition with Bracketed Weightings

• Testing by bracketing the two weights at a step of 0.1 with the constraint of unity sum

Page 25: Survey of Robust Techniques

Exponential Weighting in DecodingDiscriminative Weight Training (Weight Optimization)

• The log likelihood difference (lld) between the recognized and the correct states is chosen as the objective function for optimization– For the u-th speech utterance of T observations

• The lld is as follows :

• The cost averaged over the whole training set of U utterances is :

1 2( , ,..., )

Tu u u uO o o o

( ) ( ) - ( )r lu u ulld O g O g O

1

1( )

U

uu

LLD lld OU

Page 26: Survey of Robust Techniques

Exponential Weighting in DecodingDiscriminative Weight Training (Weight Optimization)

• This cost is minimized by adjusting iteratively the dynamic weight and the static weight, via the steepest descent as :

1

( 1) ( ) - , ( 1) ( ) -

( ) ( ) ( )-

{log ( )}( )t

r lu u u

Tj uu

t

LLD LLDn n n n

lld O g O g O

b og O

Page 27: Survey of Robust Techniques

Experimental Results Evaluation on Aurora2.0 Database

• Overall, a 36.6% relative WER reduction is obtained :

Page 28: Survey of Robust Techniques

Experimental ResultsEvaluation on CUDIGIT Database

• The relative WER improvement is 41.9%, averaged over all noise conditions

Page 29: Survey of Robust Techniques

Conclusions

• The dynamic features were found to be more resilient to additive noise interference than their static counterpart

• Optimal exponential weights for exploiting the unequal robustness of the two cepstral features were used, and better performance were obtained