4
1018 IEEE SIGNAL PROCESSING LETTERS, VOL. 20, NO. 11, NOVEMBER 2013 Noise-Adaptive LDA: A New Approach for Speech Recognition Under Observation Uncertainty Dorothea Kolossa, Senior Member, IEEE, Steffen Zeiler, Rahim Saeidi, Member, IEEE, and Ramón Fernandez Astudillo, Member, IEEE Abstract—Automatic speech recognition (ASR) performance suffers severely from non-stationary noise, precluding widespread use of ASR in natural environments. Recently, so-termed un- certainty-of-observation techniques have helped to recover good performance. These consider the clean speech features as a hidden variable, of which the observable features are only an imperfect estimate. An estimated error variance of features is therefore used to further guide recognition. Based on the same idea, we introduce a new strategy: Reducing the speech feature dimensionality for optimal discriminance under observation uncertainty can yield signicantly improved recognition performance, and is derived easily via Fisher’s criterion of discriminant analysis. Index Terms—ASR, LDA, noise adaptive, observation uncer- tainty. I. INTRODUCTION I N many recent works on speech recognition, the system ro- bustness is increased by considering the clean speech fea- ture vectors not as known, but as hidden variables, which can only be estimated up to a given error variance or uncertainty. A set of so called uncertainty-of-observation techniques have been derived to classify speech data under this observation un- certainty, which allows to put more emphasis on clean, and less on corrupted signal components [1]. Also, it is customary in ASR systems to apply a linear dis- criminant analysis (LDA) transform to features to enhance class separability and reduce the data dimensionality [2]. In this way, the input data are mapped to a subspace in which the within class variation is minimized and the between class variability is maximized [3]. It has also been shown that LDA is benecial in mis-matched noisy conditions [4] and for mis-matched chan- nels [5]. Though extended versions of LDA like kernel LDA [6] have been proposed to increase the class separability in the transform Manuscript received April 09, 2013; revised July 15, 2013; accepted August 09, 2013. Date of publication August 15, 2013; date of current version Au- gust 23, 2013. This work was supported in part by the Portuguese Foundation for Science and Technology under Grant SFRH/BPD/68428/2010 and Project PEst-OE/EEI/LA0021/2013, the Ministry of Economic Affairs and Energy of the State of North Rhine-Westphalia under Grant IV.5-43-02/2-005-WFBO- 009), and by the European Community’s Seventh Framework Programme (FP7/ 2007-2013) under Grant 238803. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Saeed Gazor. D. Kolossa and S. Zeiler are with Institute of Communication Acoustics, Ruhr-Universität Bochum, Germany (e-mail: [email protected]; [email protected]). R. Saeidi is with the Radboud University Nijmegen, Netherlands (e-mail: [email protected]). R. F. Astudillo is with the Spoken Language Systems Lab, INESC-ID, Lisbon, Portugal (e-mail: [email protected]). Color versions of one or more of the gures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identier 10.1109/LSP.2013.2278556 domain, adapting the LDA transform online to noisy observa- tions is not considered in the literature. Rather, it is shown in [4] that after training the LDA transform on a specic noise type with a reference SNR, the performance of continuous speech recognition declines rapidly if the test data is corrupted with a different noise type or SNR. Motivated by this observation, we introduce an on-line method for adapting the LDA transform by employing the idea of observation uncertainties. To investigate the effect of the proposed technique, we use oracle uncertainties so as to avoid the effect of uncertainty estimation errors in the analysis. After an overview of conventional LDA, we will introduce the proposed approach of noise-adaptive LDA in Section II. The experimental setup and the results are presented in Sections III and IV. In Section V, strengths and weaknesses of the approach and its further potential for development are discussed, and nal conclusions are drawn. II. LINEAR DISCRIMINANT ANALYSIS After a brief review of standard LDA, the following sec- tion will introduce a model-dependent method for obtaining the LDA projection matrix. Based on this simple idea, we will show how to estimate and apply a noise-adaptive LDA (NALDA) transform on-line. A. Classical LDA Linear discriminant analysis [7] nds the most discriminative directions in a dataset for pattern classication. More speci- cally, LDA calculates a projection matrix , which, when applied to the -dimensional features , maximizes the class separability. Given the covariance matrix of the class , the between-class covariance matrix and the within-class covariance matrix (1) where is the number of features, and the number of fea- tures from class , LDA nds the optimal projection matrix by maximizing the quotient of the determinants for the projected between- and within-class covariances 1070-9908 © 2013 IEEE

Noise-Adaptive LDA: A New Approach for Speech Recognition Under Observation Uncertainty

Embed Size (px)

Citation preview

Page 1: Noise-Adaptive LDA: A New Approach for Speech Recognition Under Observation Uncertainty

1018 IEEE SIGNAL PROCESSING LETTERS, VOL. 20, NO. 11, NOVEMBER 2013

Noise-Adaptive LDA: A New Approach for SpeechRecognition Under Observation UncertaintyDorothea Kolossa, Senior Member, IEEE, Steffen Zeiler, Rahim Saeidi, Member, IEEE, and

Ramón Fernandez Astudillo, Member, IEEE

Abstract—Automatic speech recognition (ASR) performancesuffers severely from non-stationary noise, precluding widespreaduse of ASR in natural environments. Recently, so-termed un-certainty-of-observation techniques have helped to recover goodperformance. These consider the clean speech features as a hiddenvariable, of which the observable features are only an imperfectestimate. An estimated error variance of features is therefore usedto further guide recognition. Based on the same idea, we introducea new strategy: Reducing the speech feature dimensionality foroptimal discriminance under observation uncertainty can yieldsignificantly improved recognition performance, and is derivedeasily via Fisher’s criterion of discriminant analysis.

Index Terms—ASR, LDA, noise adaptive, observation uncer-tainty.

I. INTRODUCTION

I N many recent works on speech recognition, the system ro-bustness is increased by considering the clean speech fea-

ture vectors not as known, but as hidden variables, which canonly be estimated up to a given error variance or uncertainty.A set of so called uncertainty-of-observation techniques havebeen derived to classify speech data under this observation un-certainty, which allows to put more emphasis on clean, and lesson corrupted signal components [1].Also, it is customary in ASR systems to apply a linear dis-

criminant analysis (LDA) transform to features to enhance classseparability and reduce the data dimensionality [2]. In this way,the input data are mapped to a subspace in which the withinclass variation is minimized and the between class variabilityis maximized [3]. It has also been shown that LDA is beneficialin mis-matched noisy conditions [4] and for mis-matched chan-nels [5].Though extended versions of LDA like kernel LDA [6] have

been proposed to increase the class separability in the transform

Manuscript received April 09, 2013; revised July 15, 2013; accepted August09, 2013. Date of publication August 15, 2013; date of current version Au-gust 23, 2013. This work was supported in part by the Portuguese Foundationfor Science and Technology under Grant SFRH/BPD/68428/2010 and ProjectPEst-OE/EEI/LA0021/2013, the Ministry of Economic Affairs and Energy ofthe State of North Rhine-Westphalia under Grant IV.5-43-02/2-005-WFBO-009), and by the European Community’s Seventh Framework Programme (FP7/2007-2013) under Grant 238803. The associate editor coordinating the reviewof this manuscript and approving it for publication was Prof. Saeed Gazor.D. Kolossa and S. Zeiler are with Institute of Communication Acoustics,

Ruhr-Universität Bochum, Germany (e-mail: [email protected];[email protected]).R. Saeidi is with the Radboud University Nijmegen, Netherlands (e-mail:

[email protected]).R. F. Astudillo is with the Spoken Language Systems Lab, INESC-ID,

Lisbon, Portugal (e-mail: [email protected]).Color versions of one or more of the figures in this paper are available online

at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/LSP.2013.2278556

domain, adapting the LDA transform online to noisy observa-tions is not considered in the literature. Rather, it is shown in [4]that after training the LDA transform on a specific noise typewith a reference SNR, the performance of continuous speechrecognition declines rapidly if the test data is corrupted with adifferent noise type or SNR. Motivated by this observation, weintroduce an on-line method for adapting the LDA transform byemploying the idea of observation uncertainties. To investigatethe effect of the proposed technique, we use oracle uncertaintiesso as to avoid the effect of uncertainty estimation errors in theanalysis.After an overview of conventional LDA, we will introduce

the proposed approach of noise-adaptive LDA in Section II. Theexperimental setup and the results are presented in Sections IIIand IV. In Section V, strengths and weaknesses of the approachand its further potential for development are discussed, and finalconclusions are drawn.

II. LINEAR DISCRIMINANT ANALYSIS

After a brief review of standard LDA, the following sec-tion will introduce a model-dependent method for obtaining theLDA projection matrix. Based on this simple idea, we will showhow to estimate and apply a noise-adaptive LDA (NALDA)transform on-line.

A. Classical LDA

Linear discriminant analysis [7] finds the most discriminativedirections in a dataset for pattern classification. More specifi-cally, LDA calculates a projection matrix , which, whenapplied to the -dimensional features , maximizes the classseparability. Given the covariance matrix

of the class , the between-class covariance matrix

and the within-class covariance matrix

(1)

where is the number of features, and the number of fea-tures from class , LDA finds the optimal projection matrix bymaximizing the quotient of the determinants for the projectedbetween- and within-class covariances

1070-9908 © 2013 IEEE

Page 2: Noise-Adaptive LDA: A New Approach for Speech Recognition Under Observation Uncertainty

KOLOSSA et al.: NOISE-ADAPTIVE LDA 1019

Fig. 1. Comparison between PCA vs. LDA on the left and LDA vs. NALDAfor uncertain data on the right. In the left panel, the LDA projection gives muchhigher class separability and lower class overlap than the PCA projection. In theright panel, small ellipses instead of data points indicate the uncertainty in thedata. For the LDA projection, class separability is degraded. By integrating theuncertainty into the calculations, the new noise adaptive LDA can recover mostof the class separability and finds a data projection with less class overlap.

This is achieved by computing the largest generalizedeigenvectors of and , ,which can be determined as the non-trivial solutions of theequality .The larger the eigenvalue of an eigenvector , the

greater is its ratio of between- to within-class covariance. There-fore, by using only the transposed eigenvectors correspondingto the largest eigenvalues as the rows of the projectionmatrix , the data dimension is reduced, while the mostdiscriminative directions are retained.In contrast to this class-dependent projection, principal com-

ponent analysis (PCA) uses that projection matrix which max-imizes the determinant of the total covariance matrix

subject to the constraint .The projection matrix, , can again be found by solvinga generalized eigenvalue problem, ,with and .The left-hand-side of Fig. 1 illustrates the difference of PCA

and LDA outcomes, showing the first eigenvector for 2D fea-tures and the resulting distribution of the projected 1D features.As can be seen, since PCA maximizes the variance regardlessof the class labels, the projected distributions of both classes areless separable in the presented case, whereas an LDA projectionleads to easily separable class distributions.

B. Hidden-Markov-Model-Based LDA

In (1), the within-class covariance matrix is computedusing class identity information. In the context of automaticspeech recognition, this means that phonetically labeled datais needed. Since such labels are not typically available, in thefollowing, we consider each state in the ASR hidden Markovmodels (HMMs) as an implicit class . is calculated asa weighted sum of the covariances of the single-mixture,full-covariance Gaussian state output distributions

(2)

Similarly, the between-class covariance is estimated by

(3)

with as the state occupancy count in the final HMM re-esti-mation, and as the mean vector of state .This process of hidden-Markov-model-based LDA is dif-

ferent from the standard one given in [2], in that it only utilizesthe parameters of the HMM and does not need access to theoriginal training data. This makes LDA a very fast procedureand allows its framewise application, which will be describedbelow.

C. Noise-Adaptive LDA

In noisy or reverberant environments, the speech signal isoften preprocessed before applying ASR. Despite this, the re-sulting features at time frame still contain residual dis-tortions. Assuming that preprocessing eliminates any bias in thesignal, a possible mitigation strategy is to model the residualerror as class-independent, additive and zero mean withtime-varying covariance , so

with and as the class index.Under these conditions we can find an LDA matrix that is

optimal for the preprocessed data at each time frame . For thispurpose, dropping the frame index for brevity, we determinethe within-class covariance matrices of the noisy data via

with as the mean vector of clean data in class . Therefore,

In the last step, we have used the assumption of equalwithin-class covariances already implicit in the applicationof homoscedastic LDA, and we have re-inserted the timedependence of noise statistics. Thus, for computing an LDAthat optimally separates the preprocessed data we can still useclass-independent covariance matrices .Using the new covariance matrices, we can obtain the noise-

optimal LDA projection matrix in the usual way,so its -th row will be the -th largest transposed generalizedeigenvector fulfilling

Here, one can use the between-class covariance given by (3),as this is unaffected by zero-mean, class-independent noise. Wechoose the largest eigenvectors to form

(4)

The right-hand side of Fig. 1 shows the effect of this trans-form on the uncertain data, which is indicated by ellipses cor-responding to for one specific . The data histogram isplotted after projection onto the first eigenvector of standard anduncertain LDA, and , respectively. It can

Page 3: Noise-Adaptive LDA: A New Approach for Speech Recognition Under Observation Uncertainty

1020 IEEE SIGNAL PROCESSING LETTERS, VOL. 20, NO. 11, NOVEMBER 2013

be seen that a smaller class overlap results when using the un-certainty in calculating the projection matrix.

D. Usage

To use the noise-adaptive LDA for recognition, we also needcorresponding models: Once the input feature vectors have beenmapped to the most discriminative subspace by

(5)

it becomes necessary to consider the consequential changein the HMM output distributions. This poses no significantproblem, though, as the LDA matrix can beused to transform a pre-trained standard Gaussian into a newGaussian suitable for the linearly transformed data. For thispurpose, we train a full-covariance single-mixture GaussianHMM system on multicondition data. Then, during decoding,at each frame it is only necessary to transform the mean foreach state by

(6)

Likewise, each covariance matrix is updated by

(7)

The newly suggested method thus transforms both the fea-tures and the Gaussian model parameters, applying the time-de-pendent LDAmatrix according to (5)–(7). In con-trast, the uncertainty-of-observation technique of uncertaintydecoding [8], which we are comparing to in the evaluation,leaves the feature vector and the model means un-changed, and only transforms the covariance matrices, using

.

III. EXPERIMENTAL SETUP

A. Database

Experiments were carried out using the CHiME multi-channel robust ASR task [9]. This task simulates humanmachine interaction in home environments using a small setof commands. The training data is binaural and reverberant.The test data is also reverberant and corrupted with commonhousehold noises at SNRs of , , 0 dB, 3 dB, 6 dBand 9 dB. The noise sources have different directions of arrivalwhile the speaker is situated in front of the microphones.

B. Pre-Processing and Feature Extraction

As the multi-channel pre-processing stage, a delay-and-sumbeamformer with a Wiener filter was used. The noise was es-timated from a fixed blocking matrix, nulling the broadside di-rection [10]. An uncertainty-propagation-based MMSE-MFCCestimator [11] was used for the feature extraction. The appliedprinciple of uncertainty propagation does not only result in anMMSE estimate of the features, , but can also yield an estimateof the observation uncertainties . This will be valuable infuture work, where estimated uncertainties will be needed ratherthan the oracle uncertainties described in the following section.Cepstral mean subtraction was applied as a final pre-pro-

cessing stage. The final feature vectors were 39-dimensional,composed of MFCCs and their first and second derivatives, andwere computed using exactly the same method and parametersas [11].

C. Recognition System

JASPER, a token-passing based HMM recognizer intro-duced in [12], was used for all experiments. JASPER has beenoptimized for recognition using uncertainty-of-observationtechniques, and has been successfully evaluated in the contextof the CHiME challenge, where it also outperformed an HTKsystem using MLLR [10] in a direct comparison on the datasetthat is considered in the following. We therefore find JASPER tobe an appropriate system for an initial evaluation of the noiseadaptive LDA approach.For the initial evaluation, oracle uncertainty values were used

for all uncertainty-of-observation techniques. This allows to an-alyze the performance of the algorithms without uncertainty es-timation errors. The oracle uncertainty can be computed in thefeature domain via

with as the frame index [13]. The values of are usually tunedspecifically for uncertainty decoding, and separately for everynoise condition.Using NALDA, best results were obtained on the considered

data set for . This means that there is no need for tuningthe uncertainty estimate, and the best uncertainty for decodingwith NALDA is actually the theoretically correct value.

IV. RESULTS

A. Recognition Performance

Table I shows the word recognition results1 for the entire de-velopment set2. The methods are:• ML Diag: a regular maximum likelihood decoder that usesonly diagonal entries of the full covariance models,

• MLDiag + LDA: same as above, but dimensionality reduc-tion to with standard LDA is applied,cf. Section II-B.

• ML Full: the standard maximum likelihood decoder is ap-plied, using the full covariance models.

• UD: uncertainty decoding using diagonal entries of the fullcovariance models is performed according to [8].

• NALDA: noise adaptive LDA is used as in Section II-C,again using . Transformedmodels and data are eval-uated using a diagonal-covariance maximum likelihooddecoder.

As can be seen, using the optimal (oracle) uncertainties, thesuggested method outperforms standard decoding – with andwithout standard LDA – and uncertainty decoding by a widemargin. In contrast to the typical experience of diminishing re-turns for increasing SNRs, noise adaptive LDA is even ableto notably improve performance for the highest speech qualityin the data set, giving a relative word error rate reduction of33.8% relative to maximum likelihood decoding (ML Diag),and 27.7% relative to uncertainty decoding, at 9 dB SNR.

B. Computational Effort

The computational demands are clearly increased for the sug-gested approach, when comparing with standard LDA: At eachtime frame, it becomes necessary to do one generalized eigen-value decomposition for finding the matrix in (4),

1These were computed over the entire sentence, not only over the keywords,unlike for the CHiME challenge.2It is not possible for oracle experiments to use the test set, because clean data

is needed, which is not available for the CHiME test set.

Page 4: Noise-Adaptive LDA: A New Approach for Speech Recognition Under Observation Uncertainty

KOLOSSA et al.: NOISE-ADAPTIVE LDA 1021

TABLE IWORD RECOGNITION ACCURACY FOR A RANGE OF APPROACHES,

COMPUTED ON THE PASCAL-CHIME DEVELOPMENT SET, USING ORACLEUNCERTAINTY VALUES FOR UD AND NALDA

TABLE IIAVERAGE REAL-TIME FACTOR OF ALL CONSIDERED DECODING RULES

whereas in standard LDA, the matrix is found once,during training.Also, in contrast to standard LDA,where only the feature vec-

tors are transformed, so that only one additional matrix multi-plication with is needed per time frame, for NALDA,each state output distribution needs to be transformed using (6)and(7). This means that, relative to standard LDA, addi-tional matrix multiplications with the transform matrixbecome necessary, where is the number of Gaussians in thesystem.Overall, the additional effort for NALDA, compared to stan-

dard LDA, is , when is the number of timeframes, is the cost of a matrix multiplication and the costof a generalized eigenvalue decomposition.The resulting average real time factor (RTF), computed via,

that was achieved on a standard PC3 running Windows 7 Enter-prise (64 bit) and Matlab 7.9.0, is shown in Table II.As expected, the new method is computationally signifi-

cantly more expensive than standard decoding. The currentreal-time factor of noise adaptive LDA is 5.5, which meansthat our straightforward implementation is not yet real-timecapable.However, experience with parallelizing ASR systems has

shown that speedups by factors of 10 or more are achievableby exploiting the computational capabilities of graphics pro-cessing units [14], which has also been demonstrated for theJASPER system in [15]. We therefore believe that the suggestedtechnique is easily real-time capable, once GPU capabilitiesare exploited – even more so, since the required additionaloperations are mostly vector and matrix multiplications, whichare very amenable to parallelization owing to their regularstructure. Therefore, real-time capability is definitely realisticon current-generation PCs with this new uncertainty-of-obser-vation technique.

V. DISCUSSION AND CONCLUSION

A new method for coping with uncertain observations inspeech recognition has been derived and implemented. In its

3equipped with a Core i7 CPU clocked at 2.94 GHz, and with 4 GB of RAM

philosophy, it differs from many of the current ideas: In contrastto uncertainty decoding, it does not attempt to compute a betterestimate of the clean-speech likelihood, but rather, it regardsthe noisy features from a discriminative standpoint, with theaim of using only those dimensions of data which are stillinformative of the class membership. This idea directly leadsto the approach of using noisy data only after a discriminative,noise-dependent, and hence time-dependent projection, whichhas shown great potential for improved accuracy in the aboveexperiments.Future work will focus on the accurate estimation of obser-

vation uncertainties for use with the presented method and onextending the approach to Gaussian mixture models, which isstraightforward but has so far been neglected due to computa-tional effort. Because of these computational constraints, andsince the suggested algorithm is amenable to highly efficientimplementation on many-core platforms such as GPUs due toits regular structure of operations, a further short-term goal liesin achieving real-time performance by maximizing algorithmicefficiency in a parallel-computing framework.

REFERENCES

[1] D. Kolossa and R. Haeb-Umbach, Robust Speech Recognition of Un-certain or Missing Data: Theory and Applications. Berlin, Germany:Springer, 2011.

[2] R. Haeb-Umbach and H. Ney, “Linear discriminant analysis forimproved large vocabulary continuous speech recognition,” in Proc.ICASSP, 1992, vol. 1, pp. 13–16.

[3] C. M. Bishop, Pattern Recognition and Machine Learning. NewYork, NY, USA: Springer, 2006.

[4] O. Siohan, “On the robustness of linear discriminant analysis as a pre-processing step for noisy speech recognition,” in Proc. ICASSP, 1995,vol. 1, pp. 125–128.

[5] T. Eisele, R. Haeb-Umbach, and D. Langmann, “A comparative studyof linear feature transformation techniques for automatic speech recog-nition,” in Proc. ICSLP, 1996, vol. 1, pp. 252–255.

[6] A. Kocsor and L. Toth, “Kernel-based feature extraction with a speechtechnology application,” IEEE Trans. Signal Process., vol. 52, no. 8,pp. 2250–2263, 2004.

[7] R. Fisher, “The use of multiple measurements in taxonomic problems,”Ann. Eugenics, vol. 7, pp. 179–188, 1936.

[8] L. Deng, J. Droppo, and A. Acero, “Dynamic compensation of HMMvariances using the feature enhancement uncertainty computed froma parametric model of speech distortion,” IEEE Trans. Speech AudioProcess., vol. 13, no. 3, pp. 412–421, May 2005.

[9] J. P. Barker, E. Vincent, N. Ma, H. Christensen, and P. D. Green,“The PASCAL CHiME speech separation and recognition challenge,”Comput. Speech Lang., vol. 27, 2013.

[10] R. F. Astudillo, D. Kolossa, A. Abad, S. Zeiler, R. Saeidi, P. Mowlaee,J. P. da Silva Neto, and R. Martin, “Integration of beamforming anduncertainty-of-observation techniques for robust ASR in multi-sourceenvironments,” Comp. Speech Lang., vol. 27, 2013.

[11] R. F. Astudillo and R. Orglmeister, “A MMSE estimator in mel-cep-stral domain for robust large vocabulary automatic speech recogni-tion using uncertainty propagation,” in Proc. Interspeech, 2010, pp.713–716.

[12] D. Kolossa, S. Zeiler, A. Vorwerk, and R. Orglmeister, “Audiovisualspeech recognition with missing or unreliable data,” in Proc. AVSP,2009.

[13] L. Deng, J. Droppo, and A. Acero, “Exploiting variances in robust fea-ture extraction based on a parametric model of speech distortion,” inProc. ICSLP, 2002, pp. 2449–2452.

[14] K. You, J. Chong, Y. Yi, E. Gonina, C. J. Hughes, Y.-K. Chen, W.Sung, and K. Keutzer, “Parallel scalability in speech recognition,”IEEE Signal Process. Mag., vol. 26, no. 6, pp. 124–135, Nov. 2009.

[15] D. Kolossa, J. Chong, S. Zeiler, and K. Keutzer, “Efficient manycoreCHMM speech recognition for audiovisual and multistream data,” inProc. Interspeech, Makuhari, Japan, Sep. 2010, pp. 2698–2701.