9
Bayesian channel equalisation and robust features for speech recognition B.P. Milner S.V. Vaseg hi Indexing terms: Bayesian channel, Speech recognition, Micvopho,ae Abstract: The use of a speech recognition system with telephone channel environments, or different microphones, requires channel equalisation. In speech recognition, the speech model provides a bank of statistical information that can be used in the channel identification and equalisation process. The authors consider HMM-based channel equalisation, and present results demonstrating that substantial improvement can be obtained through the equalisation process. An alternative method, for speech recognition, is to use a feature set which is more robust to channel distortion. Channel distortions result in an amplitude tilt of the speech cepstrum, and therefore differential cepstral features provide a measure of immunity to channel distortions. In particular the cepstral-time feature matrix, in addition to providing a framework for representing speech dynamics, can be made robust to channel distortions. The authors present results demonstrating that a major advantage of cepstral-time matrices is their channel insensitive character. 1 Introduction In this paper we consider the problem of recognition of speech distorted in transmission through a communica- tion channel or by a microphone. Although in general the channel distortion may be nonstationary and non- linear, for most practical systems it is sufficient to approximate the channel with a slowly time-varying fil- ter. The channel distortion is modelled as an unknown convolutional operation. From the talker to the recog- nition system, three sources of convolutional distortion can be identified; namely the acoustic environment in which the microphone is placed, the microphone, and the communication channel. For speech recognition systems to operate successfully with a broad range of microphones and communication channels it is neces- sary to compensate for the channel distortion. 0 IEE, 1996 ZEE Proceedings online no. 19960577 Paper first received 18th October 1995 and in revised form 30th April 1996 B.P. Milner was with the University of East Anglia, Nonvich, UK and is now with BT Laboratories, Martlesham Heath, Ipswich IP5 7RE, UK S.V. Vaseghi is with the School of Electrical Engineering ancl Computer Science, Queen's University of Belfast BT9 5AH, UK In the time domain, the channel output y(m), which is the input to the speech recogniser, is modelled as where x(m) is the clean speech, h(m) is the channel response, nl(m) is the acoustic noise, and n2(m) is the channel noise. Eqn. 1 can be expressed in the frequency domain as = (.(m) + nl(m)) * h(m) + n2(m) (1) Y(f) = (XU) + Wf))H(f) + N2(f) (2) In this paper we are mainly concerned with the channel distortion; it is assumed that the channel distortion is dominant, and the channel noise and the acoustic noise can be ignored. An ideal channel equaliser, HinvY>, has a frequency response equal to the inverse of the channel, and is able to recover the original speech signal from the distorted signal. An ideal equaliser, H'*"#, is defined as H(f)H"""(f) = 1 + F =+ hinwh,-k = 6(i) (3) where F denotes the Fourier transform relation and 6(i) is the Kroneker delta function. Real communication channels may not be invertible, hence HnvY>, may not be well defined for all frequencies. Such a situation occurs in channels with bands of frequencies in which the signals are heavily attenuated and immersed in noise. A second form of non-invertible channel occurs when the channel distortion is non-minimum phase. However, speech recognition systems use features derived from the power spectrum, and therefore the channel magnitude response and not the channel phase is of interest. Channel equalisation is relatively simple when the channel response is known and invertible. For most applications, the channel response is unknown and the only information is the distorted signal. A possible method for channel identification is to use an equaliser training phase. This involves synchronous transmission of a known signal through the channel, from which the receiver is able to estimate the channel distortion. The main drawbacks of this approach are the increased time and power required to perform the training. An alternative method is to use a 'blind' equalisation tech- nique to remove the channel distortion. Blind equalisa- tion requires some statistics of the clean speech to identify and equalise the distortion. In speech recognition systems where the speech fea- tures are based on log-spectra, such as mel-frequency cepstral coefficients (MFCCs), the convolutional dis- tortion becomes an additive distortion as k logY(f) = logX(f) + logH(f) (4) 223 IEE Pvoc.-Vis. Image Signal Process., Vol. 143, No. 4, August 1996

Bayesian channel equalisation and robust features for speech recognition

  • Upload
    sv

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Bayesian channel equalisation and robust features for speech recognition

Bayesian channel equalisation and robust features for speech recognition

B.P. Milner S.V. Vaseg hi

Indexing terms: Bayesian channel, Speech recognition, Micvopho,ae

Abstract: The use of a speech recognition system with telephone channel environments, or different microphones, requires channel equalisation. In speech recognition, the speech model provides a bank of statistical information that can be used in the channel identification and equalisation process. The authors consider HMM-based channel equalisation, and present results demonstrating that substantial improvement can be obtained through the equalisation process. An alternative method, for speech recognition, is to use a feature set which is more robust to channel distortion. Channel distortions result in an amplitude tilt of the speech cepstrum, and therefore differential cepstral features provide a measure of immunity to channel distortions. In particular the cepstral-time feature matrix, in addition to providing a framework for representing speech dynamics, can be made robust to channel distortions. The authors present results demonstrating that a major advantage of cepstral-time matrices is their channel insensitive character.

1 Introduction

In this paper we consider the problem of recognition of speech distorted in transmission through a communica- tion channel or by a microphone. Although in general the channel distortion may be nonstationary and non- linear, for most practical systems it is sufficient to approximate the channel with a slowly time-varying fil- ter. The channel distortion is modelled as an unknown convolutional operation. From the talker to the recog- nition system, three sources of convolutional distortion can be identified; namely the acoustic environment in which the microphone is placed, the microphone, and the communication channel. For speech recognition systems to operate successfully with a broad range of microphones and communication channels it is neces- sary to compensate for the channel distortion. 0 IEE, 1996 ZEE Proceedings online no. 19960577 Paper first received 18th October 1995 and in revised form 30th April 1996 B.P. Milner was with the University of East Anglia, Nonvich, UK and is now with BT Laboratories, Martlesham Heath, Ipswich IP5 7RE, UK S.V. Vaseghi is with the School of Electrical Engineering ancl Computer Science, Queen's University of Belfast BT9 5AH, UK

In the time domain, the channel output y(m), which is the input to the speech recogniser, is modelled as

where x(m) is the clean speech, h(m) is the channel response, nl(m) is the acoustic noise, and n2(m) is the channel noise. Eqn. 1 can be expressed in the frequency domain as

= (.(m) + nl(m)) * h(m) + n2(m) (1)

Y ( f ) = ( X U ) + W f ) ) H ( f ) + N2(f) (2) In this paper we are mainly concerned with the channel distortion; it is assumed that the channel distortion is dominant, and the channel noise and the acoustic noise can be ignored.

An ideal channel equaliser, HinvY>, has a frequency response equal to the inverse of the channel, and is able to recover the original speech signal from the distorted signal. An ideal equaliser, H'*"#, is defined as

H ( f ) H " " " ( f ) = 1 + F =+ hinwh,-k = 6 ( i ) ( 3 )

where F denotes the Fourier transform relation and 6(i) is the Kroneker delta function. Real communication channels may not be invertible, hence HnvY>, may not be well defined for all frequencies. Such a situation occurs in channels with bands of frequencies in which the signals are heavily attenuated and immersed in noise. A second form of non-invertible channel occurs when the channel distortion is non-minimum phase. However, speech recognition systems use features derived from the power spectrum, and therefore the channel magnitude response and not the channel phase is of interest.

Channel equalisation is relatively simple when the channel response is known and invertible. For most applications, the channel response is unknown and the only information is the distorted signal. A possible method for channel identification is to use an equaliser training phase. This involves synchronous transmission of a known signal through the channel, from which the receiver is able to estimate the channel distortion. The main drawbacks of this approach are the increased time and power required to perform the training. An alternative method is to use a 'blind' equalisation tech- nique to remove the channel distortion. Blind equalisa- tion requires some statistics of the clean speech to identify and equalise the distortion.

In speech recognition systems where the speech fea- tures are based on log-spectra, such as mel-frequency cepstral coefficients (MFCCs), the convolutional dis- tortion becomes an additive distortion as

k

l o g Y ( f ) = l ogX( f ) + l ogH( f ) (4)

223 IEE Pvoc.-Vis. Image Signal Process., Vol. 143, No. 4, August 1996

Page 2: Bayesian channel equalisation and robust features for speech recognition

In the cepstral domain, following a discrete cosine transform (DCT) of the log-spectral variables, the dis- tortion is also additive,

From eqn. 5 the effect of a time-invariant channel dis- tortion, on the mth cepstral coefficient, is the addition of a constant cH(m). Thus, for channel-robust speech recognition, the tilt in the cepstrum, produced by the channel distortioIi cdm) needs to be compensated. There are two broad approaches to robust speech rec- ognition. One approach attempts to use an equaliser to suppress cH(m). The second approach uses speech fea- tures which are inherently robust to channel distortion.

cY(m) = cX(m) + c H ( m ) (5)

7.7 Recent approaches to channel robust speech recognition One of the earliest approaches to equalisation was pro- posed by [l]. The method was applied to dereverbera- tion of the early acoustic recordings. In this application, the response of the recording system is unknown and only the distorted signal is available. To equalise the distortion on an acoustic recording, a modern undistorted recording of the same signal mate- rial was obtained. The average cepstra of the distorted recording and the modern recording were obtained, and subtracted to make an estimate of the channel dis- tortion. Using this estimate of the channel, an equaliser was formed for the restoration of the original signal.

A model based approach to blind deconvolution is the factorisation of the distorted signal model into a channel model and a channel input model. The model factorisation needs some prior information that can be used for the correct classification of the factors, as either that of the channel or the channel input. Spencer and Rayner [2] used all-pole linear predictive models to model the channel input signal and the channel itself. For restoration of acoustic recordings, the channel itself was assumed to be stationary whereas the channel input is time varying.

In digital telecommunication, blind deconvolution is used in decision directed equalisers for removal of intersymbol interference and other channel distortions [3-lo]. In Section 3.2.3 we consider a decision directed equaliser incorporating a hidden Markov model. Blind deconvolution has also been applied to speech recogni- tion applications. The effect of channel distortion on cepstral coefficients is the addition of a constant, or a slowly time-varying, tilt in the cepstrum. Therefore channel distortions can be removed by subtraction of the channel cepstrum, or by highpass filtering of the time variations of the cepstral coefficients. Mokbel et al. [1 11 developed a method for online adaptation of a speech recognition system to variations in telephone line conditions. They estimate the channel distortion as the long-term averaged cepstral vector in speech inac- tive parts. The estimate of speech cepstrum is obtained through subtraction of the cepstral estimate of the channel from the cepstrum of the distorted speech. A similar approach is also described in [12].

Hermansky and Morgan [13] proposed the use of bandpass filters on the time variations of the log spec- tral speech bands. They used fourth-order bandpass fil- ters with a sharp spectral zero positioned at the zero (DC) frequency to remove the constant and slow time- varying components in each frequency band, including the channel distortion. Hanson and Applebaum [14] compared the effect of bandpass filtering of log spectral

bands with a highpass filtering. They found that both highpass and bandpass filtering offers similar channel robustness. In addition, they used the differential parameters of speech vectors, which are effectively bandpass variants of the spectral features. Increased channel robustness was obtained by augmenting stand- ard speech features with the higher-order differential parameters.

2 Bayesian blind deconvolution and equalisation

The Bayesian inference method provides a framework for inclusion of a statistical model of the channel input and the channel. The Bayesian risk for a channel esti- mate h is defined as

R ( W = J' J' C(h, h)fX,,lY(X, hlY) dxdy H X

H where C(h, h) is the cost of estimating the channel h as h averaged over all values of h, J;(,H~~(X,~IY) is the pos- terior density, fulH(yJh) is the observation likelihood, andfH(h) is the prior PDF of the channel. The Baye- sian estimate is obtained by minimisation of the risk function R( hJy) .

In the following it is assumed that the convolutional channel distortion is transformed into an additive dis- tortion through transformation of the channel output into cepstral variables. Ignoring the channel noise, the relation between the cepstra of the channel input and output signals is

where the cepstral vectors x(m), y(m) and h are the channel input, the channel output and the channel, respectively.

2.7 Conditional mean channel estimation A commonly used cost function in the Bayesian risk eqn. 6 is the mean square error C(h - h)= Ih - hi2 which results in the conditional mean (CM) estimate defined as

y(m) = x(m) f h (7 )

hc" / hfHlY (hly) dh (8)

(9)

H The channel input signal estimate is then obtained as

X(m) = y(m) - h

2.2 Maximum likelihood channel estimation The ML channel estimate is equivalent to the case when the cost function and the channel PDF are uni- form [15]. Assuming that the channel input signal has a Gaussian distribution with a mean vector px and a cov- ariance matrix E,,, the likelihood of a sequence of N P- dimensional channel output vectors {y(m)} , given a channel vector h is

N - 1

~YIH(Y(O)I...,~(~ -l)lh) XI f X ( Y ( m ) - h ) m=O

1 N - l - -

m=O (2n)P/2 I c,, Il/Z

) x exp (--;(~(m)-h - p z ) T C ; 2 ( Y ( m ) - h - p , )

(10)

224 IEE Proc.-Vis. Image Signal Process., Vol. 143, No. 4, August 1996

Page 3: Bayesian channel equalisation and robust features for speech recognition
Page 4: Bayesian channel equalisation and robust features for speech recognition

The likelihood of the observation sequence, given the channel and the input word model can be expressed as

(21) For a given input model M, and state sequence S = {s(O), s(l), ..., s(N-l)}, the PDF of a sequence of N independent observation vectors Y = {y(O), y(l), ..., y(N-1)} is

~ Y I M , H ( Y I M ; , ~ ) = ~ x / M ( Y - hlMz)

N-1

fYlM,S,H(YIM,,S,h) fXlM(Y(m) - hlM,, s(m)) m=O

1 N-1

m=O ( 2 ~ ) ~ ' ~ Ixz,s(m) I l l 2 - -

1 x exp (-z(Y("i) - h - PZ,+) IT

(22) )>

x -q:(m)(Y(m) - h - P z , s ( m )

Taking the derivative of the log-likelihood of eqn. 22 with respect to the channel vector h yields a maximum likelihood estimate of h as

-1 N-1 N-1

h"~(y, s ) = (2 ~~~ ,s ( i~ ) xzi,s(m) (y(m) - k z , s ( m ) ) m = O

( 2 3 ) Note that, when all state observation covariance matri- ces are identical, the channel estimate becomes

1 N-l hML(YjS) = Z(Y(4 - P z , s ( m ) ) (24)

m = O

The ML estimate of eqn. 24 is based on the ML state sequence s of NI,. In the following Section we consider the conditional mean estimate over all state sequences of a model.

3. I The conditional PDF of the channel averaged over all HMMs can be expressed as

MAP channel estimate based on HMMs

fHIY(hlY) V

= 7: fHIY,S,M(hIY, s , Mt)PSIMz (slMz)PM(Mz) z = 1 5

(25) where P,(M,) is the prior PMF of the input words. Given a sequence of N P-dimensional observation vec- tors Y = [y(O), ..., y(N-1)] the posterior PDF of the channel h along a state sequence s of an HMM M,, is defined as:

fY~H,S,M(hIY, S,M2)

where it is assumed that each state of the HMM has a Gaussian distribution with a mean vector vx,J(m) and a covariance matrix Zxx,s(,), and that the channel h is also Gaussian distributed with a mean vector ph and a covariance matrix &,. The MAP estimate along state s, in eqn. 26, can be obtained

N-1 /N-1

(27) The MAP estimate of the channel over all state sequences of all HMMs can be obtained as

h(y) = >: h M A p ( y , s , M , ) P S ~ M ( S I M , ) P M ( M ~ )

V

z = 1 s (28)

where PSIM(sIM,) is the PMF of a state sequence of length N given the word M,, and the summation is taken over all state sequences of each model Mi.

3.2 Implementation methods for HMM- based deconvolution In this Section we consider three implementation meth- ods for HMM based channel equalisation.

3.2.1 Use of the statistical averages over all HMMs: A simple approach to blind equalisation, sim- ilar to that proposed by [l], is to use the average of the mean vectors and the covariance matrices, taken over all the states of all the HMMs as

V Ns 1 P x = ~ X P M d

2=13=1 VNS (29 1

where pMJ and are the mean and the covariance of the jth state of the ith HMM, V and N, indicate the number of models and number of states, respectjvely. The maximum likelihood estimate of the channel hML is defined as

where y is the time-averaged channel output. The esti- mate of the channel input is

(30) h M L -

- (7 - Px)

" M L Xt = Yt - h

= ( X t + h) - (Y - P x ) (31) = Xt - x + (Ux

Note that the estimate of the channel input is inde- pendent of the channel itself. Using the averages over all states and models, the MAP channel estimate becomes

N--1

. N-1

m=O

+ (E;; + c;;)q&l ( 3 2 )

3.2.2 Hypothesised-input HMM: For each candi- date HMM in the input vocabulary, a channel estimate is obtained and is used to equalise the channel output

JEE Proc.-Vis. Image Signal Process., Vol 143, No. 4, August 1996 226

Page 5: Bayesian channel equalisation and robust features for speech recognition

prior to computation of a likelihood score for the model. Thus a channel estimate h, is based on the hypothesis that the input word is w. It is expected that a good channel estimate is obtained from the correctly hypothesised HMM, and a poorer estimate is obtained from an incorrectly hypothesised HMM. The logic in using hypothesised input method is that, after channel equalisation, the restored signal, produced by the cor- rectly hypothesised HMM, should gain a better score than the estimates from incorrectly hypothesised HMMs (see Fig. 2).

est i matelMi

y(m)

for Mi

Fig. 2 Hypothesised channel estimation procedure

equalisation filter hlnv (m)

The hypothesised-input HMM algorithm is as fol- lows: For i = 1 to number of words V

the channel hi

channel input as:

{ Step 1: Using HMM, M , produce an estimate of

Step 2: Using the channel estimate hi, estimate the

%(hi) 1 y - hi Step 3: Compute a probability score for model Mi,

given the estimate S(hj).

Step 4: Select the most probable word.

3.2.3 Decision-directed equalisation: Blind adaptive channel equalisers are composed of two dis- tinct sections: an adaptive linear equaliser foll'owed by a nonlinear estimator to improve the equaliser output. The output of the nonlinear estimator is the final esti- mate of the channel input, and is used as the desired signal to direct the equaliser adaptation. The use of the estimator output as the desired signal assumes that the equaliser removes a large part of the channel distor- tion, thereby enabling the estimator to produce an accurate estimate of the channel input. A method of ensuring that the equaliser 'locks into', and cancels a large part of the channel distortion is to use a start-up, equaliser training, period during which a known signal is transmitted.

1

4 z(m) x (m) + v(m)

I I

Fig. 3 Decision-directed equaliser

Fig. 3 illustrates a blind equaliser incorporating an adaptive linear filter followed by a hidden Markov model classifier/estimator. The HMM classifies the out- put of the filter as one of a number of likely signals and provides an enhanced output which is also used for

IEE Proc-Vis. Image Signal Process., Vol. 143, No. 4, August 1996

adaptation of the linear filter. The output of the equal- iser z(m) is expressed as sum of the input to the channel x(m) and a so-called convolutional noise term v(m).

The HMM may incorporate state based Wiener filters for suppression of the convolutional noise v(m). Assuming that the LMS adaptation method is employed, the adaptation of the equaliser coefficient vector is governed by the following recursive equation:

(34) where g V ( m ) is an estimate of the optimal inverse channel filter, p is the adaptation step size and the error signal is defined as

z ( m ) = z (m) + v(m) ( 3 3 )

iz'n"(m) = izinu(m - 1) + p e ( m ) y ( m )

e(m) = gH""(m) - z (m) (35) where aHMM(rn) is the output of the HMM based esti- mator which is used as the desired signal to direct adaptation.

4 features

This Section describes the use of the cepstral-time matrix as a speech feature robust to channel distortion. In the cepstral domain, for a stationary channel h, the distorted channel output y t is given by

where xt is the channel input. Taking the first differ- ence of eqn. 35 the channel distortion disappears as

This observation is the basis for techniques that use bandpass or highpass filtering of the cepstral vectors, in time, to remove the channel distortion. An alternative method for removing channel distortion is to use an appropriate cepstral-time matrix features as follows. As shown in the next Section, the zeroth column of a cepstral-time matrix contains the time average value of the cepstral coefficients. The effect of channel distor- tion is concentrated in this column and the exclusion of this column provides a channel robust feature set.

Cepstral-time matrices as channel robust

Y t = X t + h (36)

ayt = dxt (37)

4.1 Cepstral-time matrix features A cepstral-time matrix ct(n, m), may be obtained from a 2-D DCT of a log spectral-time matrix, X,V; k). A group of M log filter bank vectors, Xtcf>. are stacked together to form a spectral-time matrix X,V; k), where t indicates the matrix time index, f the frequency chan- nel and k the time vector in the matrix. The spectral- time matrix is then transformed into a cepstral-time matrix c,(n, m) using a 2-D DCT. Since the 2-D DCT can be divided into two 1-D DCTs, an alternative implementation of the cepstral-time matrix is to apply a 1-D DCT along the time axis of a matrix consisting of M conventional cepstral vectors. Various methods of deriving a cepstral-time matrix are described in [16]

In transformation from a spectral-time to a cepstral- time matrix, via a 2-D DCT, the frequency axis f of the spectral-time matrix is converted to quefrency n, and the time axis k is converted to frequency m. The lower index coefficients along the axis n represent the spectral envelope, whereas the higher coefficients represent the pitch and the excitation. Along the axis m the lower coefficients represent the long time variation of the cep- stral coefficients, and the higher coefficients the short time variation. Fig. 4 illustrates these regions.

221

Page 6: Bayesian channel equalisation and robust features for speech recognition

stacked MFCC vectors cepstral- t ime matrix quefrency n 4

quefrency n I

I I lonq time I short time varlation of ’ variation of

pitch and excitation pitch and I pitch and excitation , excitation

spectral envelope of spectral I of spectral envelope I envelope

time k \ frequency m, time

zeroth column where channel distortion is concentrated

Fig. 4 Regions of cepstral-time matrix

The cepstral-time matrix contains information regarding the transitional dynamics of the speech, as well as the instantaneous values. The conventional method for including the speech transitional dynamics is to use the differential parameters. The first and sec- ond derivatives are

N

i)i)r+(n) = i)r+ I I n ) - ?IT+-, (n l (39’) A cepstral-time matrix can be obtained by applying a 1-D DCT to a matrix containing M cepstral vectors. The zeroth, the first, and the second columns of the cepstral-time matrix c(n, 0), c(n, I) and c(n, 2), are given as

M-1

where c,,(n) is thl-Gth coefficient of the kth cepstral vec- tor in the matrix. It can be seen that the first order dif- ferential and the column c(n, 1) of the cepstral-time matrix are both produced by a weighted summation of the static cepstral vectors. Substituting eqn. 38 into eqn. 39 shows that the second order differential is generated in a similar manner to that of the c(n, 2) column of the cepstral-time matrix. Figs. 5 and 6 illustrate the similar- ity of basis functions for the differential cepstra and the first and second columns of the cepstral-time matrix.

C

Fig. 5

228

, , , I , , I ,

-4 -3 -2 -1 0 1 2 3 4 t :c t (n)

Differenrial functions

0 1 2 3 1 1 5 6 k : c k ( n )

Fig.6 DCT basis functions

The zeroth column contains the time veraged value of the cepstral coefficient. The effects of channel distortion is concentrated in this column and the exclusion of this column provides a channel robust feature set.

5 Experimental results

To evaluate the success of the various blind deconvolu- tion techniques presented in this paper, experiments were performed using speech which was convolutively distorted. To generate the distorted speech, the test speech was filtered using a variety of finite impulse response filters to simulate channel distortions. Six channel distortion filters were chosen for experimenta- tion. The frequency responses of the filters are shown in Figs. 7-12. The filters were designed specifically to attenuate parts of the frequency spectrum where much of the speech information is found (up to 3kHz). The filters a to e are all invertible, whereas filter f is nonin- vertible. Filter J is designed to simulate a bandpass tel- ephone channel, with sharp band limiting between 300Hz and 3000Hz.

frequency, Hz , 20pO 40,OO 60,OO EOpO.

-40t Fig. 7 Synthesised channel distortions Zero at OHz

frequency, Hz

2000 4000 6000 8000 I I , I I :

-40f Fig. 8 Synthesised channel distortions Zeros at 0, 3000 and 800OHa

IEE Proc.-Vis. Image Signal Process., Vol. 143, No. 4, August I996

Page 7: Bayesian channel equalisation and robust features for speech recognition

The HMMs used for speech recognition were trained on undistorted clean speech. The speech features used in these experiments were 15 dimensional mel-fre- quency cepstral coefficients (MFCCs), including the coefficient c(0). The speech database used was the NOISEX database of noisy speaker dependent isolated spoken English digits [17].

frequency, Hz 2000 4000 6000 SOPO,

I I 8 I

0 -

-20

-40

I

I I

-

-

Fig. 9 Synthesised channel distortions Zeros at 0, 1500 and 800OHz

-60 r Fig. 10 Synthesised channel distortions Zeros at 0, 700 and 8000Hz

frequency, Hz

Fig. 11 Synthesised channel distortions Zeros at 0, 700, 3000 and XOOOHz

frequency, Hz , 20:O 40pO 60pO 80pO-

d, Fig. 12 Synthesised channel distortions Telephone line channel, 128 taps

5. I Channel equalisation Table 1 shows experimental results for the channel dis- tortions a to f and two blind deconvolution methods. The first is based on maximum likelihood estimation, and is named ML. The second blind deconvolution technique uses the hypothesised maximum likelihood channel estimation technique of Section 3.2.1. This is named ML-HMM. To compare performance NCC shows the case where no channel compensation has been applied.

Table 1: % Recognition accuracy of blind deconvolution

Flat a b c d e f

NCC 100 95 50 10 10 10 25

ML 100 94 94 94 94 93 89

ML-HMM 100 100 100 100 100 100 100

The channel distortions c to e severely attenuate the speech information parts of the spectrum, and subse- quently cause a severe degradation in recognition per- formance, as shown by NCC. The maximum likelihood channel estimate M L using an average of all HMM means, produces good improvement for all channel dis- tortions. However, the maximum likelihood estimate based on the hypothesised HMM approach gives 100% accuracy for all the channel distortions tested. The results of the maximum likelihood estimates in Table 1 are reasonably constant for all the channel distortions as predicted by eqn. 31.

To examine how accurate the estimates of the chan- nel distortion h are the log filterbank of the channel distortion was computed. This was then compared to the actual estimated channel distortions for the two blind deconvolution techniques used in Table 1, M L and ML-HMM. Figs. 13 and 14 show the log filter bank representation of two channel distortion filters, namely d and e. Also shown are the two estimates, based on the maximum likelihood, averaged over all HMMs ML, and the maximum likelihood estimate computed using the ‘correct’ HMM ML-HMM. The figure shows that both the blind deconvolution tech- niques produce good estimates of the channel distor- tion. The ML-HMM produces a slightly better estimate of the channel than the M L approach.

229 IEE Proc.-Vis. Image Signal Process., Vol. 143, No. 4, August 1996

Page 8: Bayesian channel equalisation and robust features for speech recognition

f fequen cy, kHz

Fig. 13 ~ actual channel distortion - - -

__

Actual and estimated channel response for channel 1

channel estimate using all models, ML channel estimate from ‘correct’ HMM, ML-HMM

frequency, kHz 0.0 8.0

I I

Fig. 14 - actual channel distortion - - - __

Actual and estimated channel response for channel 2

channel estimate using all models, MI, channel estimate from ‘correct’ HMM, ML-HMM

5.2 Channel robust features To test the channel robustness of the cepstral-time matrix, the channel distortions shown in Figs. 7-12 were used to distort speech. However the HMMs were trained on distortion-free speech. For speech recogni- tion, only a small part of the cepstral-time matrix is useful, thus the matrix can be truncated down to N’ x M’. The variable N represents the number of filter bank channels of the spectral vectors. This is related to the sampling frequency of the speech. For the Noisex database the speech is sampled at 16kHz, which results in a N=25 channel mel-scaled filter bank. M represents the number of time vectors which have been grouped together to form the spectral-time matrix. A series of experiments found that a spectral-time matrix width of M = 8 worked best, with an overlap of M-1 vectors. Experiments revealed that the best cepstral-time matrix truncation was to N’ = 15, and M‘ = 4. The spectral- spectral matrices are generated in a similar manner, except that a DCT is applied along the time axis of the spectral-time matrix, prior to the logarithm operation and 2-D DCT.

Table 2: ?h Recognition accuracy of cepstral-time matrices

Flat a b c d e f

15dirncep.vector 100 95 50 10 10 10 25

15x4C-Tmatr ix 100 91 74 45 10 12 58

14x3C-Tmatr ix 100 100 100 100 100 100 90

Table 2 shows the recognition performance for the six channel distortions a to f and for no channel distor- tion, flat. The results show the performance of the 15 x 4, not truncated cepstral-time matrix, and the 14 x 3 truncated cepstral-time matrix. Additionally, the per- formance of 15 dimensional cepstral vectors is also shown. The speech data base in this experiment is Noisex.

Table 2 shows that the 14 x 3 cepstral-time matrix remains unaffected by the invertible channel distortions

a to e in comparison to the nontruncated 15 x 4 ceps- tral-time matrix which suffers degradation as a result of the channel. However, the 14 x 3 cepstral-time matrix has not been completely robust to the nonin- vertible channel distortion f although it does outper- form the 15 x 4 cepstral-time matrix.

Figs. 15 and 16 show a plot of two 15 x 4 cepstral- time matrices. In Fig. 15 the cepstral-time matrix has been produced from a distortion-free speech signal, and in Fig. 16 the same piece of speech has been corrupted by channel e. It can be seen that the first column of the cepstral-time matrix has been contaminated by the channel, but the inner matrix has remained unaffected by the channel distortion.

Fig. 15 IS x 4 cepstval-lime matrix of undistorted speech signal

amp

Fig. 16 15 x 4 cepstral-time matrix of distortedspeech signal

6 Conclusions

In this paper, two approaches for the recognition of channel distorted speech have been investigated. The first approach is based on blind deconvolution given the channel output signal and HMMs of the channel input signal. Various maximum likelihood channel esti- mates have been proposed, which offer significant improvement in recognition performance. In particular, a hypothesised input deconvolution method, based on using each HMM to provide a different channel esti- mate, improves recognition accuracy very significantly. The second part of this paper investigated the use of features which are robust to channel distortion. It has been shown that the cepstral-time matrix, with the first column omitted, is robust to channel distortion. Exper- imental results show that cepstral-time matrix features achieve very good performance. The cepstral-time matrix has the additional advantage that no computa- tion of the channel, and its subsequent removal, is required, as the matrices themselves arc channel robust.

7 Acknowledgments

The authors would like to acknowledge the support of Mr Simon Ringland of BT Laboratories and also the useful comments of the reviewers of this paper.

230 IEE Proc.-Vis. Image Signal Process., Vol. 143, No. 4, August 1996

Page 9: Bayesian channel equalisation and robust features for speech recognition

8 References

1 STOCKHAM, T.G., CANNON, T.M., and INGERRETSEN, R.B.: ‘Blind deconvolution through digital signal processing’, IEEE Proc., 1975, 63, (4), pp. 678-692

2 SPENCER, P.S.: ‘Separation of stationary and time-varying sys- tems and its applications to the restoration of gramophone recordings’. PhD thesis, Cambridge University Engineering Department, Cambridge, UK, 1990 LUCKY, R.W.: ‘Techniques for adaptive equalisation of digital communication systems’, Bell Syst. Tech. J., 1965, 45, pp. 255- 286

4 BENVENISTE, A., GOURSAT, M., and RUGET, (3.: ‘Robust identification of a nonminimum phase system: blind adjustment of linear equaliser in data communications’, ZEEE Trans., 1980,

5 QURESHI, S.U.H.: ‘Adaptive equalisation’, Proc. ZEEE, 1985, 73, (9), pp. 1349-1387

6 BELLINI, S.: ‘Bussgang techniques for blind equalisation’. IEEE Globecom conference records, 1986, pp. 1634-1640

7 BELLINI, S., and ROCCA, F.: ‘Near optimal blind deconvolu- tion’. IEEE Proceedings of the international conference on Acous- tics, speech, and signal processing, 1988, pp. 2236-2239 MENDEL, J.M.: ‘Tutorial on higher order statistics (spectra) in signal processing and system theory: theoretical results and some applications’, Proc. IEEE, 1991, 79, pp. 278-305

9 NIKIAS, C.L., and CHIANG, H.H.: ‘Higher-order spectrum estimation via non-causal autoregressive modeling and deconvo- lution’, IEEE Trans., 1991, ASSP-36, pp. 1911-1913

3

AC-25, pp. 385-399

8

10 NOWLAN, S.J., and HINTON, G.E.: ‘A soft decision-directed algorithm for blind equalization’, IEEE Trans., 1993, C 4 1 , (2), pp. 275-279

11 MOKBEL, C., MONNE, J., and JOUVET, D.: ‘On-line adapta- tion of a speech recogniser to variations in telephone line condi- tions’. Proceedings of the 3rd European conference on Speech communication and technology, EuroSpeech-93, 1993, Vol. 2, pp.

12 WITTMANN, M., SCHMIDBAUER, O., and AKTAS, A.: ‘Online channel compensation for robust speech recognition’. Eurospeech -93, 1993, pp. 1251-1254

13 HERMANSKY, H., and MORGAN, N.: ‘Towards handling the acoustic environment in spoken language processing’. Interna- tional conference on Spoken language processing, 1992, pp. 85-88

14 HANSON, B.A., and APPLEBAUM, T.H.: ‘Subband or cepstral domain filtering for recognition of lombard and channel-distorted speech’. IEEE international conference on Acoustics, speech and signalprocessing, 1993, Vol. 2, pp. 79-82

15 VASEGHI, S.V.: ‘Advanced signal processing and digital noise reduction’ (John Wiley, 1996)

16 MILNER, B.P., and VASEGHI, S.V.: ‘Comparison of noise compensation methods for speech recognition’, Pvoc. IEE, 1994,

17 VARGA, A.P., STEENKEN, H.J.M., TOMLINSON, M., and JONES, D.: ‘The Noisex-92 study on the effect of additive noise on automatic speech recognition’. Technical report, DRA Speech Research Unit

1247-1250

VISP-141, (5), pp. 280-288

IEE Pvoc.-Vis. Image Signal Process., Vol. 143, No. 4, August 1996 23 1