9
Comparison of some noise-compensation methods for speech recognition in adverse environments B.P. Milner S.V. Vaseghi Indexing terms: Cepstral-time matrices, Noise adaptation, Noise compensalion, Spectral ~btroction. Speech recognitron, Wiener filters Abstract: A comparative study is presented of three noise-compensation schemes, namely spec- tral subtraction, Wiener filters, and noise adapta- tion, for hidden-Markov-model-based speech recognition in adverse environments. The noise- compensation methods are evaluated on a spoken-digit database, in the presence of car noise and helicopter noise at different signal-to-noise ratios. Experimental results demonstrate that the noise-compensation methods achieve a substantial improvement in recognition accuracy across a wide range of signal-to-noise ratios. At a signal-to- noise ratio of -6 dB the recognition accuracy is improved from 11% to 83%. The use of cepstral- time matrices as an improved speech representa- tion is also considered, and their combination with the noise-compensation methods is shown. Experiments show that the cepstral-time matrix is a more robust feature than a vector of identical size, composed of a combination of cepstral and differential cepstral features. 1 Introduction Speech-recognition systems operating in adverse condi- tions, such as a vehicle or factory, have to deal with a variety of ambient-noise and channel distortions. Cur- rently, most speech-recognition systems are based on hidden Markov models (HMMs) [l], and this paper pre- sents a comparative study of some noise-compensation methods for HMMs operating in noisy environments. The major signal-processing stages in an HMM-based speech-recognition system are the acoustic-feature extrac- tion, acoustic segmentation and model-likelihood calcu- lation. Noise affects each of these stages of the recognition process, and the result is a rapid deterio- ration in the recognition accuracy, as the signal-to-noise ratio decreases. Speech-recognition systems achieve best performance when the models are trained and operated in matched environments. For most applications, this is impractical as the operating environment varies with time and space, and so some form of noise compensation must be employed. The research work in noisy-speech recognition may be classified into three broad categories [2] : filtering of the noisy speech signal prior to classi- i(; IEE, 1994 Paper 1303K (E5), first received 15th April 1992 and in revised form 18th February 1994 The authors are with the School of Information Systems, University of East Anglia, Norwich, United Kingdom 280 fication [3-81; adaptation of the speech models to include the effects of noise [9-151; and the use of features which are more robust to noise [16-191. In this paper the performance of three of the most successful noise- compensation schemes, namely spectral subtraction, Wiener filters, and noise adaptation, is considered. The use of cepstral-time speech features, as a robust speech representation, for noisy-speech recognition is also investigated. Speech recognisers operating in noisy conditions must also deal with the changes in speaking habits of people subjected to noise. In noise, people speak louder and there are increases in duration and pitch and higher- frequency energy contents of speech [20]. The noise- induced stress (also known as the Lombard effect) can be as harmful to recognition as the noise itself; however, in this work the focus is on the effects of additive noise. This is partly because the Lombard effect may be taken into account, during the training of models, by including training examples from speakers subjected to noise through a headphone. 2 Hidden Markov models A hidden Markov model (HMM) [l] is a finite-state sta- tistical model, particularly useful for the statistical char- acterisation of nonstationary signals such as speech. In HMM theory, the nonstationary character of a process is modelled by a chain of N stationary states, with each state having a different set of statistical characteristics. An N-state hidden Markov model is defined by the parameter set 2. = {n,, aij, bix), i, j = 1, ..., N}, where n, is the initial state probability, aij is the state-transition probability, and bi(x) is the state-observation probability- density function (PDF), usually modelled by a mixture of Gaussian densities. The main constituents of an HMM are the state transition probabilities and the state- observation PDFs. The state-transition probabilities model the variations in speech-segment duration and articulation rates. The state-observation PDFs model the variations in spectral content of the speech segments associated with each state. A particular useful variant of HMMs is the left-right HMM, so called because state transitions can only be made from a left state to a right state, that is aij = 0 for i > j. The left-right HMM is par- ticularly useful for modelling random functions of time, using the left-to-right progression through the model. In The authors acknowledge the support of the Sci- ence & Engineering Research Council and British Telecommunications Research Laboratories. IEE Proc.-Vis. Image Signal Process., Vol. 141, No. 5. October 1994

Comparison of some noise-compensation methods for speech recognition in adverse environments

  • Upload
    sv

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Comparison of some noise-compensation methods for speech recognition in adverse environments

Comparison of some noise-compensation methods for speech recognition in adverse environments

B.P. Milner S.V. Vaseghi

Indexing terms: Cepstral-time matrices, Noise adaptation, Noise compensalion, Spectral ~b t roc t ion . Speech recognitron, Wiener filters

Abstract: A comparative study is presented of three noise-compensation schemes, namely spec- tral subtraction, Wiener filters, and noise adapta- tion, for hidden-Markov-model-based speech recognition in adverse environments. The noise- compensation methods are evaluated on a spoken-digit database, in the presence of car noise and helicopter noise at different signal-to-noise ratios. Experimental results demonstrate that the noise-compensation methods achieve a substantial improvement in recognition accuracy across a wide range of signal-to-noise ratios. At a signal-to- noise ratio of - 6 dB the recognition accuracy is improved from 11% to 83%. The use of cepstral- time matrices as an improved speech representa- tion is also considered, and their combination with the noise-compensation methods is shown. Experiments show that the cepstral-time matrix is a more robust feature than a vector of identical size, composed of a combination of cepstral and differential cepstral features.

1 Introduction

Speech-recognition systems operating in adverse condi- tions, such as a vehicle or factory, have to deal with a variety of ambient-noise and channel distortions. Cur- rently, most speech-recognition systems are based on hidden Markov models (HMMs) [l], and this paper pre- sents a comparative study of some noise-compensation methods for HMMs operating in noisy environments. The major signal-processing stages in an HMM-based speech-recognition system are the acoustic-feature extrac- tion, acoustic segmentation and model-likelihood calcu- lation. Noise affects each of these stages of the recognition process, and the result is a rapid deterio- ration in the recognition accuracy, as the signal-to-noise ratio decreases. Speech-recognition systems achieve best performance when the models are trained and operated in matched environments. For most applications, this is impractical as the operating environment varies with time and space, and so some form of noise compensation must be employed. The research work in noisy-speech recognition may be classified into three broad categories [2] : filtering of the noisy speech signal prior to classi-

i(; IEE, 1994 Paper 1303K (E5), first received 15th April 1992 and in revised form 18th February 1994 The authors are with the School of Information Systems, University of East Anglia, Norwich, United Kingdom

280

fication [3-81; adaptation of the speech models to include the effects of noise [9-151; and the use of features which are more robust to noise [16-191. In this paper the performance of three of the most successful noise- compensation schemes, namely spectral subtraction, Wiener filters, and noise adaptation, is considered. The use of cepstral-time speech features, as a robust speech representation, for noisy-speech recognition is also investigated.

Speech recognisers operating in noisy conditions must also deal with the changes in speaking habits of people subjected to noise. In noise, people speak louder and there are increases in duration and pitch and higher- frequency energy contents of speech [20]. The noise- induced stress (also known as the Lombard effect) can be as harmful to recognition as the noise itself; however, in this work the focus is on the effects of additive noise. This is partly because the Lombard effect may be taken into account, during the training of models, by including training examples from speakers subjected to noise through a headphone.

2 Hidden Markov models

A hidden Markov model (HMM) [ l ] is a finite-state sta- tistical model, particularly useful for the statistical char- acterisation of nonstationary signals such as speech. In HMM theory, the nonstationary character of a process is modelled by a chain of N stationary states, with each state having a different set of statistical characteristics. An N-state hidden Markov model is defined by the parameter set 2. = {n, , aij, bix), i, j = 1, ..., N } , where n, is the initial state probability, ai j is the state-transition probability, and bi(x) is the state-observation probability- density function (PDF), usually modelled by a mixture of Gaussian densities. The main constituents of an HMM are the state transition probabilities and the state- observation PDFs. The state-transition probabilities model the variations in speech-segment duration and articulation rates. The state-observation PDFs model the variations in spectral content of the speech segments associated with each state. A particular useful variant of HMMs is the left-right HMM, so called because state transitions can only be made from a left state to a right state, that is aij = 0 for i > j . The left-right HMM is par- ticularly useful for modelling random functions of time, using the left-to-right progression through the model. In

The authors acknowledge the support of the Sci- ence & Engineering Research Council and British Telecommunications Research Laboratories.

I E E Proc.-Vis. Image Signal Process., Vol . 141, No . 5. October 1994

Page 2: Comparison of some noise-compensation methods for speech recognition in adverse environments

HMM theory, the probability that an unknown observa- tion sequence x = [xl, x z , _ . ,, xz] is an acoustic realis- ation of the word E. IS obtained by summing the probabilities of the observation sequence over all state sequences 4 as

Viterbi algorithm HMM h

- - c n4,b41(Xl)a4141bql(XZ) ' ' . a 4 T - , q T b q T ( X T )

o l l q l . ... 4,

(1)

The probability that an observation vector x k belongs to state i, b,(xk), is commonly modelled by a mixture of M multivariate Gaussian PDFs

M

b;(xk) 1 pi, .'+'(xk, pi,, E m ) (2) m = 1

where P, , is the prior probability of mixture m of state i, and . t"(x,, pi,,,, E$,,,) is a Gaussian PDF with a mean vector p i , and a covariance matrix Eim. The state Gauss- ian PDF is the main function through which the signal and noise influence the likelihood calculations. HMMs are described in detail in Reference 1. In general, noise affects both the state-transition probability and the state- observation probability. The effects of noise on the obser- vation probabilities are considered to be more detrimental to performance than the effects of noise on the transition probabilities. In this paper only the effects of noise on the state observation probabilities are con- sidered.

. model scare

3 Wiener filters with HMMs

This Section describes how the Wiener filter can be used in conjunction with HMMs to improve the recognition performance in noisy conditions. Two implementation methods are described, these being state-dependent Wiener filters and Wiener-based model adaptation. The noisy signal model is given by

y(m) = x(m) + n(m) (3) where x(m). n(m) and y(m) are the clean speech, the noise and the noisy signal, respectively.

The coefficients of the Wiener filter are obtained by minimising the mean-squared distance between the filter output and the original noise-free signal [21]. For a sta- tionary signal, observed in additive noise, the Wiener filter in the time and frequency domains is given by the equation

w = [R, , + R,,] - ' P I X -= F * W ( j )

- I c x ( f ) (4) P A f ) + I ( N ( f )

Where R,, , PI, and p x ( f ) denote the autocorrelation matrix, autocorrelation vector and power spectrum, respectively, and the operator cF=. denotes the Fourier- transform relation. For additive random noise, the Wiener filter W ( f ) acts as an attenuator which attenuates the freqeuncy components of the noisy signal in propor- tion to the local signal-to-noise ratio. The Wiener filter only makes use of the mean of the power spectrum and ignores the variance. The application of the Wiener filter requires a knowledge of the signal and the noise-power spectrum. For quasistationary noise, the noise-power spectrum may be estimated and updated from speech- inactive periods. The speech-power spectrum is not

I E E P r o . - V i s . Image Siynuf Process., Vol 141. N o . 5. Octoher 1994

usually available but may be obtained from the mean cepstral vectors contained in each state of the HMM.

3.1 State-dependent Wiener filtering In this case the noisy speech is filtered using Wiener filters constructed from the mean vectors contained in each state of the HMM, together with a noise estimate. This is described below, and illustrated in Fig. 1 :

Step I : Pass the noisy speech to each HMM to obtain an initial state sequence

S t e p 2: From the state sequence produce a series of state-dependent Wiener filters

S t e p 3 : For each model use the state-dependent Wiener filters to filter the noisy speech

S t e p 4 ; Pass the filtered speech back through its respective model to obtain a probability score and calcu- late the best match in the normal way.

most likely

Fig. 1 noisy-speech recognition

Illustrution of H M M s with stare-dependent Wiener filters for

To implement the Wiener filter in eqn. 4, the clean-signal spectral means are obtained from the HMM cepstral means using an inverse DCT and an exponential oper- ation to convert from the log domain to the linear spec- tral domain. This is described in more detail in References 4, 5, 22 and 23.

The hypothesis with the state-based Wiener-filtering method is that when the noisy speech is filterd by the Wiener filter associated with the correct HMM, a 'matched' filter is produced, and hence the effect of the noise is reduced. However, when the filtering is per- formed by an incorrect HMM based Wiener filter, no noise reduction occurs. This then increases the likelihood of the speech to be recognised correctly.

A significant drawback of this method is that it relies on the accuracy of the initial state sequence from which the state-dependent Wiener filters are allocated.

3.2 State-integrated Wiener filtering This Section presents an alternative implementation of the Wiener filter, where it is shown that in the context of an HMM, Wiener filtering of noisy speech is equivalent to the adaptation of the mean cepstral vectors of the states of an HMM 161.

In the spectral domain, Wiener filtering is a multiplica- tive operation, and the filtered signal is given by

( 5 )

The cepstral coefficients are obtained from the discrete cosine transform (DCT) of the logarithm of the power spectral-filter bank. Owing to the logarithmic operation, filtering in the cepstral domain becomes an additive process.

(6) where c*(n), c,(n), c,(n) denote the cepstra of the filtered speech, the noisy speech and the Wiener filter, respect-

28 I

c*(n) = c@) + c,(n)

Page 3: Comparison of some noise-compensation methods for speech recognition in adverse environments

ively. From eqn. 5, the cepstrum of the Wiener filter c,(n) may be expressed as

c,,(n)

~ DCT Clog { r x ( n ) + rN(n)Il C(PX

= c d n ) - c,,,+,,,(n) (7) where c J n ) is the cepstrum of the mean speech spectrum contained in the HMM, and ~ ( , , ~ + , ~ ) ( n ) is the cepstrum of the sum of the mean spectra of the signal and noise. The filtered signal, from eqn. 6, may be rewritten as

c&) = c,(n) + c,,(n) - c(,,+,,,(n) (8) Considering the filtered signal when applied to the Gaussian scoring function of the HMM-state observation

(9)

Thus it can be deduced that Wiener filtering is equivalent to replacing the mean vector of each mixture cJn) with that of the noisy signal ~ ( , , ~ + , , ~ ) ( n ) . An advantage of this implementation technique over state-based Wiener filter- ing is that it does not rely on the accuracy of the Viterbi maximum-likelihood state sequence, extracted from the noisy speech. Experimentally this method has resulted in significantly greater recognition performance.

4 Spectral subtraction

This Section discusses the use of spectral subtraction to improve the performance of HMMs. Also, the relation- ship between spectral subtraction and Wiener filters is examined.

The main attractions of spectral subtraction are the relatively low computational and implementation com- plexity, and the fact that spectral subtraction requires only a knowledge of the noise spectrum. The general idea behind spectral subtraction is that an estimate of the original signal spectrum is obtained by subtracting an estimate of the noise-power (or magnitude) spectrum from the noisy signal. The equation describing spectral subtraction may be expressed as

I Z(f) I b = I Y(S) I b - a{SNR(f)} I N ( f ) I b (10)

Where I B(f) I b is an estimate of the original signal spec- trum I X ( f ) I b , and I N ( f ) I b is the time-averaged estimate of the noise spectrum. For magnitude spectral subtrac- tion (MSS) the exponent b = I, and for power spectral subtraction (PSS) b = 2. The parameter a{SNR(f)} con- trols the amount of noise subtracted from the noisy signal, and can be made to be signal-to-noise-ratio dependent. For full noise subtraction a{SNR(f)} = 1 and for oversubtraction u{SNR(f)} > 1. The time-averaged noise spectrum is obtained from the periods when the signal is absent and only noise is present, as

and it is assumed that the noise spectrum remains sta- tionary between the update periods. In eqn. l l I N A f ) I is

282

the spectrum of the ith noise block, and it is assumed that there are M frames in a noise-only period, where M is variable. Alternatively, the time-averaged noise spectrum can be obtained as the output of a first-order digital lowpass filter as

- _ _ _ _ I I b = PI N,-,(f)lb + (1 - P ) I NAf) I b (12)

which has the effect of producing a slowly time-varying average. Typically the lowpass filter coefficient is between 0.7 and 0.95.

Owing to the variations of the noise spectrum, spectral subtraction may produce negative estimates of the power or the magnitude spectrum. This outcome is more prob- able at frequencies with a low signal-to-noise ratio. To avoid negative magnitude estimates, the spectral subtrac- tion output is processed using a rectifying mapping func- tion T( .) of the form

The parameter P may he related to the local signal-to- noise ratio. For example, if the restored estimate of the clean speech is less than 0.01 (expressed in decibels in as -20) of the noisy signal, it can be set to some function of the noisy signal, fn { Y(f)}. In its simplest form fn { Y(f)} = noise floor, where the noise floor is a positive constant; however a better choice is

The main problem in spectral subtraction is the pro- cessing distortion introduced as a result of the variation of the noise spectrum in time. This means that the success of spectral subtraction depends on the ability of the algo- rithm to reduce the noise variations and hence remove the processing distortion.

4.1 Relation to Wiener filters The spectral subtraction equation may be rewritten in the form of the product of the noisy signal spectrum and the frequency response of a spectral subtraction filter

f n i I Y ( f ) I } =PlY(f) l , .

I%f)I* = I Y ( f ) I 2 - IN(f)I2 = H ( f ) I Y ( f ) l 2 (14)

where cc{SNR(f)}, from eqn. 10, is assumed to be 1. The frequency response of the spectral subtraction filter H ( f ) is given by

The spectral subtraction filter H ( f ) has zero phase and a magnitude response in the range 0 2 H ( f ) > 1. Like the Wiener filter, the spectral subtraction filter acts as a signal-to-noise-ratio-dependent attenuator. The attenu- ation at each frequency increases with the decreasing SNR, and decreases with the increasing SNR.

The optimal least-mean-squared-error filter for noise removal is the Wiener filter. The implementation of a Wiener filter requires the power spectrum (or the corre- lation functions) of the signal and the noise (eqn. 4). However spectral subtraction can be used as a substitute for the Wiener filter when the signal spectrum is not available. The equation describing the frequency response of the Wiener filter, for additive noise, is

/ E 6 Proc.-Vls. Image Synal Process., Vol. 141, No. 5 , October 1994

Page 4: Comparison of some noise-compensation methods for speech recognition in adverse environments

From eqns, 15 and 16, it is evident that the main differ- ence between the Wiener filter and the spectral subtrac- tion filter is that the former uses the ensemble average spectrum of the signal and the noise, whereas the latter uses the instantaneous spectrum of the noisy signal and the time-averaged spectrum of noise. Assuming that the signal and noise are wide-sense stationary, and ergodic, the instantaneous noisy-signal spectrum I Y ( f ) 1’ in spec- tral subtraction (e-may be replaced with a time- averaged spectrum I Y ( f ) I’, to give

Fig. 2 presents a comparison of W ( f ) and H(f). For ergodic processes, as the time-averaged spectrum approaches the ‘true’, ensemble-averaged spectrum, the spectral-subtraction filter approaches the Wiener filter. In practice, many signals, such as speech and music, are

frames, the signal-to-noise ratio and the linear noise estimate. One form for the nonlinear function @(.) is

Where y is a design parameter. From eqn. 20 it can be seen that as the signal-to-noise ratio decreases the output of the nonlinear estimator q.), approaches max,,,,Mf,,,,, { I N ( f ) l’}, and as the signal-to-noise ratio increases it approaches zero. The noise estimate is, however, forced to be an overestimation by using the fol- lowing limiting function

1.0,

a b Fig. 2 a Stale-dependent Wiener-filter sequence

Comparative illustration offilter sequences for a realisation ofthe digit 3 b Spectral-subtraction-filter sequence

highly nonstationary and therefore only a limited amount of time averaging is beneficial.

4.2 Nonlinear spectral subtraction There are many variants of spectral subtraction, which basically differ in the method used for the estimation of the noise spectrum, and the different degrees of averaging which they impose on the estimated signal. Nonlinear spectral subtraction makes use of information on the local signal-to-noise ratio, and also utilises the observa- tion that, at low signal-to-noise ratios, oversubtraction 18, 241 can lead to improved performance. The nonlinear spectral-subtraction filter can be written as

.

where IN(f)I’,,, is a nonlinear estimate of the noise spectrum. Lockwood and Boudy [8] suggest the follow- ing function as a nonlinear estimator of the noise spec- trum

The nonlinear estimate of the noise spectrum is a func- tion of the maximum value of the noise spectrum over M

I E E Proc.-Vis. Image Signal Process., Vol. 141, N o . 5. October 1994

The maximum attenuation of the filter is limited to H ( f ) > f i , where range >, 0.01.

spectral-subtraction f i is usually in the

5 Noise-adaptive speech models

A problem with conventional filtering methods, such as spectral subtraction, is that crucial speech information may be removed during the filtering process. For noisy- speech recognition an alternative to filtering of the noisy speech is to adapt the parameters of the speech models to include the statistics of noise and leave the noisy signal unmodified, in an attempt to obtain models which would have been obtained under matched training and testing conditions.

Noise affects both the state-observation and state- transition probabilities of HMMs. The effect of noise on the state-observation probabilities is considered more sig- nificant than its effect on the transition probabilities, although the noise-induced Lombard effect also corrupts the transition probabilities. The noise adaptation present- ed here deals only with the direct effects of noise, and only considers the adaptation of the state-observation probabilities to noise.

Adaptation of the model statistics depends on the choice of feature for speech representation. For linear speech features such as power-spectral or correlation fea- tures, and additive noise, the statistics of the noisy speech are given as the sum of the statistics of the speech and noise. In Reference 13, Roe introduced a method for the

283

Page 5: Comparison of some noise-compensation methods for speech recognition in adverse environments

noise adaptation of correlation-based speech-feature codebooks. For cepstral speech features, the nonlinear logarithmic transformation from the spectral domain to the cepstral domain affects the adaptation process. Nadas et al. [lo] introduced noise-adaptive speech models for noisy-speech recognition using models trained on clean log-power spectral-speech features. In their models it is assumed that, at any given time, each speech spectral band is dominated either by the signal energy or by the noise energy. Varga [12] introduced a noise- decomposition method in which an HMM of clean speech and an HMM of noise are used to modify the state-observation probability, to incorporate noise in the likelihood calculation. Gales and Young [I41 proposed a model-combination method in which a clean-speech HMM and a noise HMM are combined to produce a model of the noisy signal.

Fig. 3 outlines the stages involved in the model- adaptation process described in Reference 14, where the

&U- m h HMM

~%5=~Gz7yrnHFk noisy-speech HMM

DCT-’ expo

&(se HMM

Fig. 3 Block diagram of adaptation system

state-observation parameters of the cepstral-domain models are converted into the log spectral domain, using an inverse DCT. The log-spectral-domain parameters are then mapped to the linear spectral domain. Noise adap- tation then takes place on the means and variances of the spectral model, and the combined noise and speech model is converted back to the cepstral domain.

Thus in the linear spectral domain the adaptation of the means and variances to noise is given as

pX = p x ( i ) + (22)

where b X ( i b 4(ij)l, {pN(i), 4i j ) l , {p(x+N(i), d+,(ij)) are the mean and variance of the spectra of the clean signal, the noise and the noisy signal. The index i designates the ith frequency band. az(ij) is the covariance of the ith and jth frequency bands.

Assuming that the cepstral features have a Gaussian distribution, then the log spectral features, obtained from an inverse DCT, also have a Gaussian distribution. Hence owing to the exponential mapping from the cepstral to the spectral domain, it follows that the linear power-spectral variables have a log-normal distribution. The mapping functions for translating the mean and variance of a normal distribution to a log-normal dis- tribution are given as

u:+,(ij) = &ij) + U 3 i j ) (23)

py(i) = exp { ~ d i ) + dy(ii)/2}

&ij) = py(i)py(j){exp (U?&) - 11

(24)

(25) where pry(i ) is the mean of the log spectra of the ith fre- quency band and u?,(ij) is the covariance of the ith and the jth frequency bands. Subscript I denotes a log-domain variable. The mapping for the transform of the mean and variances from log-normal to normal are

(26)

(27)

&i j ) = log { 1 + 4ij)/p,,(i)py(j)}

p d i ) = log { P Y O ) - f log 11 + u?(ii)/P?(i)l 284

Following the transform to the log-spectral domain, a DCT is used to transform the noise-adapted log-domain means and variances to the cepstral domain.

6

The human auditory perception system is relatively noise-robust when compared with the rapid deterioration in the performance of HMMs operating in noisy condi- tions. The difference in performance is perhaps partly due to a more efficient use, by the human-perception system, of the relatively large amount of correlation, or redundancies, which exists in the acoustic realisation of speech. Thus an important area of research in speech recognition is the development of features and models that are robust to noise. This work may be divided into two categories.

In one category, the speech features, or speech models are modified to make more efficient use of the correlation between successive speech segments. In HMM theory it assumed that, within each state, speech features are independent identically distributed (IID). The IID assumption contributes to a rapid deterioration in the performance of HMMs in noisy conditions. In particular, the performance of the Viterbi algorithm, commonly used in the recognition phase for the calculation of the maximum-likelihood state sequence, deteriorates in noise [23]. Speech-feature vectors are correlated in time, and an effective method which includes the correlation struc- ture can improve recognition performance. Examples of such work are the short-time modified-coherence features [17] and cepstral-time speech features [18, 191.

In the second category, the effort is directed to the development of nonlinear speech features, which in some way model the human auditory-processing system. For example, Ghitza [16] used an ensemble interval histo- gram to model the auditory nerve firings in the cochlea of the ear, which has been shown to produce a more noise- robust spectrum than the conventional FFT-based spec- trum.

The remainder of this Section considers the use of a cepstral-time feature matrix for speech representation. This is mainly because

(a) cepstral-time features are a simple extension of the commonly used cepstral vector

(b) cepstral-time features are consistently more robust than an identically sized feature set composed of cepstral and differential cepstral features

(c) the statistics of the signal and noise variations, along the time axis, can be used for a more effective implementation of the noise-compensation methods.

Improved speech features for noisy conditions

6.1 Cepstral-time features Cepstral-time features are formed as follows. Speech is segmented into overlapping blocks of K samples, and each block is transformed to K spectral samples. Along the frequency axis, the spectral samples are grouped into N overlapping, mel-spaced, triangular frequency bands, and the frequency bins within each band are averaged to form power-spectral features. A triangular window, of length L samples, is then run along the time axis, and the spectral values within the span of the triangle are aver- aged to form spectral-time features. The overall effect is that each feature is obtained by averaging the samples under a frequency-time pyramid. The power-spectral variables are converted to logarithmic variables denoted by X , ( J r), and then grouped into a sequence of M x N

I E E Proc.-Vis. Image Signal Process., Vol. 141, No. 5. October 1994

Page 6: Comparison of some noise-compensation methods for speech recognition in adverse environments

spectral-time matrices. Each spectral-time matrix is trans- formed via a two-dimensional DCT to a cepstral-time matrix c(n, m). Following the DCT operation, the lower N‘ x M’ submatrix is selected for speech representation. This submatrix represents the spectral-time envelope of speech, and contains the set of coefficients most useful for speech recognition. Fig. 4 shows that a cepstral-time- matrix feature performs better than a feature vector of

100

8 0 -

6 0 -

x

0

5 40 L

L

gzo- a

-

-

O ’ -k 1’2 1; signal-to-noise ratio. d0

100

.00

0 2 6 0 -

5 4 0 -

.2

x

0

c

;20 t

Fig. 4 Comparison oJ perJormance of an H M M with matrix and wctor features

-0- 42-dimensional feature vector composed 14 cepstral. 14 delta-cepstral -@- 14 x 3 cepstral-time matrix

and 14 delta4elta cepstral features

-

-

-

identical size composed of cepstral, differential cepstral and differential-differential cepstral features.

6.2 Noise compensation for cepstral-time matrices The Wiener filter, spectral subtraction and noise- adaptation techniques for noise compensation can be extended for use with the cepstral-time matrix as follows.

6.2.7 Spectral subtraction: An effective system for the extension of spectral subtraction to a two-dimensional matrix is illustrated in Fig. 5. In this method M spectral

Y ( t ) Y( f ) V ( f , t ) Y(f , , f*) *

Fig. 5

vectors are grouped to form a spectral-time matrix. This matrix is then transformed into a ‘spectral-spectral’ matrix by converting the time dimension to frequency dimension, using a DCT.

Two-dimensional spectral subtraction is then applied as follows

Illustration oJtwo-dimensional spectral-spectral subtraction

The spectral estimate is then converted to a two- dimensional cepstral-time matrix by taking the two- dimensional DCT of the log of the spectral estimate.

6.2.2 Wiener filters: A Wiener filter for spectral-time fea- tures can be defined as

The state-based Wiener filters can then be implemented in essentially the same way as described in Section 3.

IEE Proc.-Vis. Image Signal Process., Vol. 141, N o . 5, October 1994

An alternative form of implementation for Wiener filters, based on spectral-time features, is to convert the time dimension of each spectral-time matrix to frequency, and use the modified Wiener filter

6.2.3 Noise-adaptive HMMs: The state-observation sta- tistics of an HMM, using cepstral-time features, consist of a mean cepstral-time matrix and a fourth-order covari- ance tensor. The covariance tensor can be simplified to a ‘diagonal’ tensor if it is assumed that the elements are uncorrelated. The equations for the adaptation of cepstral-time based HMMs to noise are similar to those described in Section 5, except that the transformations from the cepstral domain to log spectral, and vice versa, require two-dimensional DCTs for the means, the four- dimensional DCTs for the covariance.

7 Experimental results

This Section presents experimental results obtained for the noise-compensation techniques described in the earlier Sections, using a subset of the NOISEX-92 data- base. The subset used consists of a male talker saying ten repetitions of the ten digits, zero to nine, to form the training set, and a further ten repetitions of the ten digits to form the test set. Of the various noises available with NOISEX, the Lynx helicoper and Volvo car were selec- ted as speech contaminants.

In all experiments, the digits were modelled using an eight-state, single-mode-per-mixture, continuous-density HMM, with a diagonal covariance matrix. The model structure was left-ro-right with no skip states.

To generate features, the speech was Hamming win- dowed every 16 ms with a window width of 32 ms. From this the output from a 25-channel mel-scaled filterbank was obtained. This was then converted to either one- dimensional MFCCs using a one-dimensional DCT, or grouped together with seven other filter bank outputs and a two-dimensional DCT applied to form two- dimensional cepstral matrices. The one-dimensional ceps- tral vectors were truncated to either 15 [c(O) - c(14)] or 14 [c(l) - c(14)] cepstral coefficients. The two- dimensional cepstral matrices were truncated to either 15 x 4, or 14 x 3, depending on whether the zeroth row and column were omitted.

Figs. 6-9 show the experimental results obtained for HMMs using one-dimensional cepstral features, and

I I I I I I 0 10 20

signal-to-noise ratio. dB

Fig. 6 Experimental result? One-dimensional cepstra. do) ~(14). Lynx helicopter noise

-0 - no noise compensatmn -0- spectral subtraction

-A- Wiener - ~ M adaptation - A matched

285

Page 7: Comparison of some noise-compensation methods for speech recognition in adverse environments

Figs. 10-13 show the results for HMMs using two- dimensional cepstral-time features. Each graph presents a comparison of the three noise-compensation methods,

100

e. 80

2 6 0 -

540 -

-2

2.

0

0

- C

E20 :

-

-

-

I I 4 0 ' A I 10 ' 20

signal-io-noise ratio. dB

100-

' . 8 0 -

e a 60- 8 5 4 0 -

-. h

C

E20 E

Fig. 7 Experimental results One-dimensional cepstra, dO)-c(14). car ~ O I S ~

-0 ~ no noise compensatmn 0- spectral subtractnon

-A- Wiener -U ~ adaptation -A- matched

-

I I I 10 20

signal-to-noise ratio. dB

Fig. 8 Experimental results One-dimensional cepstra, dl)-c(I4), Lynx helicopter n m e -0- no noise compensation

-A- Wiener -m- adaptation -A- matched

~0- spectral subtraction

I I I I 10 20

signol-to-noise rotio. dB

Fig. 9 Experimental results One-dimensional cepstra, c(l)-c(l4), car noise -0- no noise compensation -0~- spectral subtractton - A- Wiener -8- adaptation -A- matched

and also includes the matched and unmatched per- formance, for the particular feature type and contamin- ating noise.

286

7.1 Matched and unmatched conditions Unmatched training and testing conditions generally produce the worst performance, owing to the

100-

*. 80 -

2 6 0 -

-2

h

0

O

2 4 0 - - C

E20 -

? I I I

10 20 signol-to-noise ratio. dB

Fig. 10 Experimental results Two-dimensional cepstra. I5 x 4. Lynx helicopter noise -0 no n o m compensatm 0- rpectral subtraction

--A- Wiener -U- adaptation

7 I 1 I

10 20 signal-io-noise ratio. dB

Fig. 11 Experimental results Two-dimensional cepstra, 15 x 4, car n m e

no noise cornpensation --o- -0- spectral subtraction -A Wiener

~ m- adaptation -A- matched

$+ 01 0 10 20

signal-to-noise ratio. dB

Fig. 12 Experimental results Two-dimensional cepstra, 14 x 3, Lynx hellcopter n o ~ e -0- no noise compensation -0 speclral subtraction

A Wiener ~ m- adaptation -A- matched

uncompensated mismatch which exists between models trained on clean speech and those tested on noisy speech. With matched conditions, the models are trained and tested with speech contaminated under similar noise con- ditions, which should indicate the best performance the

IEE Proc.-Vis. Image Signal Process., Vol. 141, N o . 5, October 1994

Page 8: Comparison of some noise-compensation methods for speech recognition in adverse environments

system can achieve. The Figures indicate that, under matched conditions, the recognition accuracy does not suffer greatly, and remains steady down to SNRs as low

I I 10

signol-to-noise ratio. dB

1 20

Fig. 13 Experimental results Two-dimensional cepslra, 14 x 3, car noise - 0 no noise compensation -~O- spectral subtraction -A- Wiener

-A- matched W- adaptation

as 0 dB. The deterioration in performance is relatively rapid below 0 dB, but remains well above the per- formance of unmatched conditions. Matched and unmatched conditions can be used to indicate the upper and lower performance of the recognition system.

7.2 Spectral subtraction The nonlinear spectral-subtraction method of Section 4 was implemented. The noise estimate was obtained from the speech-inactive periods preceding the utterance, with the maximum attenuation H ( f ) of the spectral filter limited to -20 dB. The performance of spectral subtrac- tion improves with the use of an SNR-dependent sub- traction, and an oversubtraction of up to three times the noise estimate. To reduce further the processing distor- tions due to the noise variance, the spectral-filter sequence was itself lowpass filtered before application to the noisy speech. The improvement resulting from spec- tral subtraction is significant, but remains well below those obtained by Wiener filters and model adaptation.

7.3 Wiener filtering The Wiener filtering was implemented by replacing the clean-model cepstral means with the Wiener-filter- adapted means, as described in Section 3.2. This was done by transforming the model means from the cepstral domain to the power-spectral domain, where the noise means can be added, and then transforming back to the cepstral domain. The experiments on car noise and heli- copter noise show that Wiener-based adaptation works remarkably well, performs much better than spec ral sub- traction and produces results which are compara de with those obtained using model adaptation. For exa nple, at a SNR of OdB the recognition accuracy of HMMs employing Wiener filters is well above 90% for both car and helicopter noise.

7.4 Model adaptation The clean-digit models were adapted by converting the state-observation statistics (means and variances) from the cepstral domain to the power-spectral domain, adding the noise-model statistics and converting back to the cepstral domain. The noise model was a one-state single-mode-per-mixture HMM. State-transition prob- abilities were left unchanged. The Figures show that the

IEE Pror.-Vis. Image Signal Process., Vol. 141. No. 5, October 1994

improvement in recognition accuracy that results from adaptation is substantial, and the performance of noise- adaptive HMMs approaches the results obtained in matched conditions.

7.5 Comparison of cepstral feature vectors and matrices

The use of cepstral-time feature matrices consistently improves the recognition accuracy relative to cepstral feature vectors. For example, for uncompensated car noise the recognition accuracy for cepstral vectors is 26%, which compares with 66% for cepstral-time feature matrices.

The graphs also show that noise-compensated cepstral-time matrices generally outperform noise- compensated cepstral vectors. However, with spectral subtraction, the improvement in accuracy depends on the subset of the cepstral-time matrix which is used for speech representation, and best results are obtained when the zeroth row and column of cepstral-time matrix are retained.

8 Conclusion

The experimental results in this paper provide a compa- rative evaluation of the use of noise-compensation methods for HMMs using cepstral vectors, and HMMs using cepstral-time matrices. The results show that, of the three noise-compensation schemes considered in this paper, the noise-adaptive HMMs produce the best result; this is followed closely by state-based Wiener adaptation. The spectral-subtraction method improves the recogni- tion accuracy relative to the uncompensated case, although the performance of spectral subtraction falls short of that offered by noise-adaptive and Wiener- filtering methods. The results also demonstrate that cepstral-time features provide a relatively robust alterna- tive to cepstral vectors.

9 References

1 RABINER. L.R.: ’A tutorial on hidden Markov models and selected application in speech recognition’, Proc. IEEE, 1989, pp. 257-286

2 JUANG, B.H.: ‘Speech recognition in adverse environments’, Comput. Speech Lang., 1991.5, pp. 275 -294

3 BOLL, S.F.: ‘Suppresston of acoustlc noise in speech using spectral subtraction’. IEEE Trans., 1979, ASP-27, pp. 113-120

4 BERSTEIN, A.D., and SHALLOM, I.D.: ‘An hypothesized Wiener filtering approach to noisy speech recognition’. Proceedings of the IEEE international conference on Acoustics. Speech, Signal Pro- cessing, 1991, pp. 913-916

5 EPHRAIM, Y., MALAH, D., and JUANG, B.-H.: ‘On the applica- tion of hidden Markov models for enhancing noisy speech’, IEEE Trans., 1989, ASP-37, pp. 1846-1856

6 VASEGHI, S.V., and MILNER, B.P : ‘Noise adaptive hidden Markov models based on Wiener filters’. Proceedings of Euro- speech, 1993. pp. 1023-1026

7 LIM. J.S., and OPPENHEIM, A V . ‘All-pole modelling of degraded speech’, IEEE Trans., 1978, ASP-26, pp. 197-210

8 LOCKWOOD, P., and BOUDY, J.: ‘Experiments with a non-linear spectral subtractor (NSS), hidden Markov models and the projec- tion, for robust speech recognition in cars’. Proceedings of Euro- speech, 1991, pp. 79-82

9 BRIDLE, J.S., PONTING, K.M.. BROWN, M.D.. and BORRET, A.W.: ‘A noise compensating spectrum distance measure applied to automatic speech recognitton’, Proc. Inst. Acousr., 1984, 6, (4). pp. 307-314

10 NADAS. A., NAHAMOO, D., and PICHENY, M.A.: ‘Speech recognition using noise-adaptive prototypes’, IEEE Trans., 1989, ASP-37, pp. 1495-1503

11 VARGA. A.. MOORE, R., BRIDLE, J., PONTING, K., and RUSSELL, M.: ’Noise compensation algorithms for use with hidden

287

Page 9: Comparison of some noise-compensation methods for speech recognition in adverse environments

Markov model based speech recognition’. Proceedings of the IEEE international conference on Acoustics, Speech, Signal Processing, 1988, pp. 481-484

12 VARGA, A., and MOORE, R.K.: ‘Hidden Markov model decompo- sition of speech and noise’. Proceedings of the IEEE international conference on Acoustics, Speech, Signal Processing 1990, pp. 845- 848

13 ROE, D.B.: ‘Speech recognition with a noise-adapting codebook’. Proceedings of the IEEE international conference on Acoustics. Speech, Signal Processing 1987, pp. 1139-1142

14 GALES, M.J.F., and YOUNG, S.J.: ‘An improved approach to the hidden Markov model decomposition’. Proceedings of the IEEE international conference on Acoustics, Speech, Signal Processing, 1992, pp. 129-734

15 GALES, M.J.F., and YOUNG, S.J.: ‘HMM recognition in noise using parallel model combination’. Proceedings of Eurosueech. 199% pp. 837-840

16 GHITZA, 0.: ‘Auditory nerve representation as a front-end for speech recognition in a noisy environment’, Comput. Speech Lang., 1986.1, pp. 109-130

17 MANSOUR, D., and JUANG, B.H.: ’The short-time modified coherence reoresentation and noisv sneech recormition’. IEEE , . Trans., 1989, k P - 3 1 , pp. 795-804

18 ARIKI, Y., MIZUTA, S., NAGATA, M., and SAKAI, T.: ‘Spoken- word recognition using dynamic features analysed by two- dimensional cepstrum’, IEE Proc. /, 1989,136, (Z), pp 133-140

19 VASEGHI, S.V., CONNER, P.N., and MILNER, B.P.: ‘Speech modelling using cepstral-time feature matrices in hidden Markov models’, IEE Proc. I, 1993,140, (9 , pp. 317-320

20 PISONI, D.B.: ‘Some acoustic-phonetic correlates of speech produc- ed in noise’. Proceedings of the IEEE international conference on Acoustics, Speech, Signal Processing, 1985, pp. 1581-1584

21 WIENER, N.: ‘Extrapolation, interpolation, and smoothing of sta- tionary time series, with engineering applications’ (MIT Press, 1949)

22 BEATTIE, V.L., and YOUNG, S.J.: ‘Noisy speech recognition using hidden Markov model state based filtering’. Proceedings of the IEEE international conference on Acoustics, Speech, Signal Pro- cessing, 1991, pp. 917-920

23 VASEGHI, S.V., and MILNER, B.P.: ‘Noisy speech recognition based on HMMs, Wiener filters and re-evaluation of most Likely candidates’. Proceedings of the IEEE international conference on Acoustics, Speech, Signal Processing, 1993, vol. 11, pp. 103-106

24 NOLAZCO FLORES, J.A., and YOUNG, S.J.: ‘Adapting a HMM- based recogniser for noisy speech enhanced by spectral subtraction’. Proceedings of Eurospeech, 1993, pp. 829-832

288 IEE Proc.-Vis. Image Signal Process., Vol. 141, No. 5, October 1994