[IEEE 2009 Sixth International Conference on Information Technology: New Generations - Las Vegas, NV, USA (2009.04.27-2009.04.29)] 2009 Sixth International Conference on Information

Feature Combination Using Multiple Spectral Cues for Robust Speech Recognition in Mobile Communications

Djamel Addou1, Sid-Ahmed Selouani2, Malika Boudraa1, Bachir Boudraa1

1Speech & Signal Processing Lab., USTHB University of Science & Technology, Algiers, Algeria 2LARIHS Lab., Université de Moncton, Shippagan campus, New Brunswick, Canada

Abstract This paper investigates a new front-end processing that aims at improving the performance of speech recognition in noisy mobile environments. This approach combines features based on conventional Mel-cepstral Coefficients (MFCCs) and Line Spectral Frequencies (LSFs) to constitute robust multivariate feature vectors. The proposed front-end constitutes an alternative to the DSR-XAFE (XAFE: eXtended Audio Front-End) available in GSM mobile communications. Our results showed that for highly noisy speech, using the paradigm that combines LSF with MFCCs, leads to a significant improvement in recognition accuracy on the Aurora 2 task. Key Words: Distributed Speech Recognition, GSM, Line Spectral Frequencies, Noisy mobile communications. 1. Introduction

Transmitted speech over mobile channels can

significantly degrades the performance of speech recognizers when compared to the unmodified signal [1]. This is due to the low bit rate speech coding as well as channel transmission errors. A solution to this problem is the distributed Speech Recognition (DSR) concept initiated by the European Telecommunications Standard Institute (ETSI) through the AURORA project [1]. The speech recognition process is distributed between the terminal and the network. As shown in Figure 1, the process of extracting features from speech signals, also called the front-end process, of a DSR system is implemented on the terminal and extracted features are transmitted over a data channel to a remote back-end recognizer where the remaining parts of the recognition process take place. In this way the transmission channel does not affect the recognition system performance. The Aurora project provided a normalization of the speech recognition front-end. In the context of worldwide normalization, a consortium was created to constitute the 3G Partnership Project (3GPP). This consortium

recommended the use of the XAFE: eXtended Audio Front-End as coder-decoder (codec) for the vocal commands. The ETSI codec is mainly based on Mel-Frequency Cepstral Coeffecients (MFCCs). In this paper, we investigate the performance of a new codec that could constitute an alternative to the present ETSI DSR-XAFE codec in severely degraded mobile environments. It is based on a multi-stream paradigm using a multivariable acoustic analysis line spectral frequencies LSFs) and MFCCs. The proposed system will be compatible with the 3GPP and 3GPP2 standards respectively for both European (GSM) mobile and North American (CDMA) systems.

The outline of this paper is as follows. In Section 2, we give an overview on robust techniques used in the front-end processing. Section 3 will describe the LSF extraction procedure. In Section 4, we describe the statistical framework of the multi-stream paradigm. Then, in Section 5, we proceed with a description of the database, the parameters of our experiments and the evaluation of our proposed approach for DSR. Finally, in Section 6 we conclude and discuss our results.

Figure 1. Simple block diagram of a DSR system 2. Robust front-end processing

Extraction of reliable parameters remains one of the most important issues in automatic speech recognition (ASR). This parameterization process serves to maintain

Front-end Terminal

FeatureExtraction

Speech

data network

GPRS/CDMA

Back-end Server

Speech Recognition

SpeechReconstruction

Text

Speech

/GSM

2009 Sixth International Conference on Information Technology: New Generations

978-0-7695-3596-8/09 $25.00 © 2009 IEEE

DOI 10.1109/ITNG.2009.332

1256

2009 Sixth International Conference on Information Technology: New Generations

978-0-7695-3596-8/09 $25.00 © 2009 IEEE

DOI 10.1109/ITNG.2009.332

1256

the relevant part of the information within a speech signal while eliminating the irrelevant part for the ASR process. A wide range of possibilities exists for parametrically representing the speech signal. The cepstrum is one popular choice, but it is not the only one. When the speech spectrum is modeled by an all-pole spectrum, many other parametric representations are possible, such as the set of p-coefficients αi obtained using Linear Predictive Coding (LPC) analysis and the set of line spectral frequencies (LSF). This latter possesses properties similar to those of the formant frequencies and bandwidths, based upon the LPC inverse filter. Another important transformation of the predictor coefficients is the set of partial correlation coefficients or reflection coefficients. In previous papers [2-4], we introduced a multi-stream paradigm for ASR in which, we merge different sources of information about the speech signal that could be lost when using only the MFCCs to recognize uttered speech. Our experiments in [2] showed that the use of some auditory-based features and formant cues via a multi-stream paradigm approach leads to an improvement of the recognition performance. This proved that the MFCCs loose some information relevant to the recognition process despite the popularity of such coefficients in all current ASR systems. In our experiments, we used a 3-stream feature vector. The First stream vector consists of the classical MFCCs and their first derivatives, whereas the second stream vector consists of acoustic cues derived from hearing phenomena studies. Finally, the magnitudes of the main resonances of the spectrum of the speech signal were used as the elements of the third stream vector. The above-mentioned work has been extended in [3] by the use of the formant frequencies instead of their magnitudes for ASR within the same multi-stream paradigm. In these experiments, the recognition of speech is performed using a 3-stream feature vector, which uses the formant frequencies of the speech signal obtained through an LPC analysis as the element of the third stream vector combined with the auditory-based acoustic distinctive features and the MFCCs. The obtained results [3] showed that the use of the formant frequencies for ASR in a multi-stream paradigm improves the ASR performance. Then in [4], we extended our work to evaluate the robustness of the above mentioned proposed features using a multi-stream paradigm for ASR in noisy car environments. The obtained results showed that the use of such features renders the recognition process more robust in noisy car environments.

In the present work, we investigate the potential of the multi-stream front-end using LSFs and MFCCs to improve the robustness of a DSR system. In DSR systems the feature extraction process takes place on a mobile set with limited processing power. On the other hand, there

is a certain amount of bandwidth available for each user for sending data. Among the features mentioned above, formant-like features and LSFs are more suitable for this application, because extracting them can be done as part of the process of extracting MFCC which saves a lot of computational process. However, due to some problems related to their inability to provide information about all parts of speech such as silence and weak fricatives, formants, have not been widely adopted. In [5], it has been shown that shortcomings of formant representation can be compensated to some extent by combining them with features containing signal level and general spectrum information, such as cepstrum features. In the approach presented in this paper, we propose to include, besides the MFCCs, the LSFs parameters.

3. Line spectral frequency cues

Line spectral frequencies (LSF) have been introduced by Itakura [6]. They have been proven to possess a number of advantageous properties such as sequential ordering, bounded range, and facility of stability verification [6, 7]. In addition, the frequency-domain representation of LSFs makes easier the incorporation of human perception system properties. The LSFs were extracted according to the ITU-T Recommendation G.723.1 converting the LPC parameters to the LSFs [8]. In the LPC the mean squared error between the actual speech samples and the linearly predicted ones is minimized over a finite interval, in order to provide a unique set of predictor coefficients. The transfer function of the LPC filter is given by:

,1

)(

1∑=

−+= P

k

kk za

GzH (1)

where G is the gain, P the prediction order, and ak the LPC filter coefficients. The poles of this transfer function include the poles of the vocal tract as well as those of the voice source. Solving for roots of the denominator of the transfer function gives both the formant frequencies and the poles corresponding to the voice source. From H(z), two transfer functions, Pp+1(z) and Qp+1(z) respectively called the sum and difference polynomials can be derived. The sum polynomial is given by:

),()( 1)1(1

−+−+ += zAzzAP p

ppp (2)

12571257

and the difference polynomial is given by:

),()( 1)1(1

−+−+ −= zAzzAQ p

ppp (3)

where Ap(z) is the denominator of H(z). The polynomials contain trivial zeros for even values of p at z=-1 and at z=1. These roots can be removed in order to derive the following quantities:

,...)1()(

)(ˆ 110

1p

ppp zzzzP

zP ααα +++=+

= −+ (4)

and,

....)1()(

)(ˆ 110

1p

ppp zzzzQ

zQ βββ +++=−

= −+ (5)

The LSFs are the roots of )(ˆ zP and )(ˆ zQ and alternate

with each other on the unit circle. Note that Pp+1(z) is a symmetric polynomial and Qp+1(z) is an antisymmetric polynomial. The polynomials )(ˆ zP and )(ˆ zQ , derived from Pp+1(z) and Qp+1(z) are symmetrical. Therefore, for even values of p we can derive the following property:

20),( piipi ≤≤−= αα (6)

hence (4) and (5) can be written as follows:

[ )()()(ˆ )12/()12/(1

2/2/0

2/ −−−− +++= ppppp zzzzzzP αα

],... 2/pα++ (7)

and,

[ )()()(ˆ )12/()12/(1

2/2/0

2/ −−−− +++= ppppp zzzzzzQ ββ

].... 2/pβ++ (8)

By putting z = ejω and then z + z−1 = 2cos(ω), we obtain the equations to be solved in order to find the LSFs according to the real root method ITU-T Recommendation G.723.1:

[ )cos()cos(2)(ˆ2

2120

2/ ωαωαωω −+= ppjpj eeP

],21... 2/pα++ (9)

and,

[ )cos()cos(2)(ˆ2

2120

2/ ωβωβωω −+= ppjpj eeQ

].21... 2/pβ++ (10)

Input speech is divided into frames. Each frame is further subdivided into 4 subframes. The LPC analysis is then performed on a subframe basis. The p LPC coefficients are transformed into the p corresponding LSFs. This transformation is done in the last subframe. The LSFs of the other three subframes are calculated by performing linear interpolation between the LSFs of current and previous frame. For this purpose, the unit circle is divided into 512 equal intervals, each of length π/256. The roots (LSFs) of P(z) and Q(z) polynomials are searched along the unit circle from 0 to π. A linear interpolation is performed on intervals where a sign change is observed in order to find the zeros of the polynomials. According to [8], if a sign change appears between intervals l and l − 1, a first-order interpolation is carried out as follows:

,)()(

)(1ˆ

1

1

⎟⎟⎠

⎞⎜⎜⎝

⎛

++−=

−

−

ll

l

zPzPzP

ll (11)

where l̂ is the interpolated solution index, |P(z)l| is the absolute magnitude of the result of sum polynomial evaluation at interval l (similarly for l-1).since the LSFs are interlacing in the region from 0 to π, only one zero is evaluated on P(z) at each step. The search for the next solution is performed by evaluating the difference polynomial Q(z), starting from the current solution.

LSFs are considered to be representative of the underlying phonetic knowledge of speech and are expected to be relatively robust in the particular case of ASR in noisy or band-limited environments. Two main reasons motivated our choice to consider the LSFs in noisy mobile communications. The first reason is related to the fact that LSF regions of the spectrum may stay above the noise level even in very low signal-to-noise ratio (SNR), while the lower energy regions will tend to be masked by the noise energy. The second reason is related to the fact that LSFs are widely used in conventional coding schemes. This avoids the incorporation of new parameters that may require important and costly modifications to current devices and codecs.

12581258

4. Multi-stream statistical framework

In this approach multiple acoustic feature streams obtained from different sources are concatenated to form a multi-stream feature set which is then used to train multi-stream HMMs. Consider S information sources that provide time synchronous observation vectors Ost, where s = 1, …, S ; s indicates the information source, and t, the time index. The dimensionality of the observation vectors can vary from one source to another. Each observation vector time sequence provides information about a sequence of hidden states j. In a multi-stream system, instead of generating S state sequences from S observation sequences, only one state sequence is generated. This is actually done by introducing a new output distribution function for states. The output distribution of state j is defined as:

(12)

It can be seen that the output distributions of multiple

observation vectors are merged to form a single output distribution for state j. The exponent specifies the contribution of each stream to the overall distribution by scaling its output distribution. The value of γjs is normally assumed to satisfy the constraints: and (13)

In HMMs, Gaussian mixture models are used to represent the output distribution of states. The probability of vector Ot at each time instance t in state j can be determined from the following formula:

(14) where M is the number of mixture components, cjsm is the mth mixture weight of state j for the source s. N denotes a multivariate Gaussian with µjsm as the mean vector and Σjsm as the covariance matrix:

(15)

The choice of the exponents plays an important role.

The performance of the system is significantly affected by the values of γ. Recently, exponent training has

received attention and the search for an efficient method is still undergoing. Most of the exponent training techniques in the literature have been developed in logarithmic domain [9]. By taking the log of the distribution function the exponents appear as scale factors of the log terms.

(16)

The multi-stream HMMs presented in this work have

three streams and therefore s is equal to three. Obtaining an estimate for the exponent’s parameters is a difficult task. Due to the very nature of the signal, there is no predictable underlying structure we can exploit. We expect γs to be a function of SNR. In our experiment, all states are assumed to have three streams. The first two streams are assigned to MFCCs and their first derivatives and the third stream is dedicated to LSFs features. Equal weights are assigned to LSFs relative to MFCCs with respect to eq. (14). It should be noted that in order to avoid complexity, the stream exponents are generalized to all states for all models.

5. Experiments & results 5.1. AURORA Database

In our experiments the AURORA database was used. It is a noisy speech database that was released by the Evaluations and Language resources Distribution Agency (ELDA) for the purpose of performance evaluation of DSR systems under noisy conditions. The source speech for this database is the TIdigits downsampled from 20 kHz to 8 kHz, and consists of a connected digits task spoken by American English talkers. The AURORA training set which is selected from the training part of the TIDigits, includes 8440 utterances from 55 male and 55 female adults and filtered with the G.712 (GSM standard) characteristics [10]. Three test sets (A, B and C) from 55 male and 55 female adults collected from the testing part of the TIDigits form the AURORA testing set. Each set includes subsets with 1001 utterances. One noise signal is artificially added to every subset at SNRs ranging from 20 dB to -5 dB in decreasing steps of 5 dB. The experiments presented in this paper are based on test set A, and two subsets with two different noises namely Babble and Car. Both noises and speech signals are filtered with the G.712 characteristic. In total, this set consists of 2*7*1001=14014 utterances.

5.2. Training & Recognition Parameters

In the AURORA project, whole-word HMMs were used to model the digits. Each word model consists of 16

)()(21 1'

exp||)2(

1),;( jsmstjsmjsmst

n

oo

jsm

jsmjsmstoNµµ

πµ

−Σ−− −

Σ=Σ

( ) ( )∑∏==

Σ=M

mjsmjsmstjsm

S

stj

jsONCOb11

,];;[ γµ

( )[ ] ,)(1∏=

=S

sstjstj

jsObOb γ

.11

∑=

=S

sjsγ10 ≤≤ jsγ

).(log)(log tjtj ObOb γγ =

12591259

states with three Gaussian mixtures per state. Two silence models were also considered. One of the silence models has relatively longer duration, modeling the pauses before and after the utterances with three states and six Gaussian mixtures per state. The other one is a single state HMM tied to the middle state of the first silence model, representing the short pauses between words. In DSR-XAFE, 14 coefficients including the log-energy coefficient and the 13 cepstral coefficients are extracted from 25 msec frames with 10 msec frame shift intervals. However, the first cepstral coefficient and the log-energy coefficient provide similar information and, depending on the application, using one of them is sufficient. 5.3 Tests & Results

In our experiment, the baseline system is defined over 39-dimensional observation vectors that consist of 12 cepstral and the log-energy coefficients plus the corresponding delta and acceleration vectors. It is noted MFCC-E-D-A. The front-end presented in the ETSI standard DSR-XAFE was used throughout our experiments to extract 12 cepstral coefficients (without the zeroth coefficient) and the logarithmic frame energy. Training and recognition were carried out by the HMM-based tooklit (HTK) [11]. In some special cases HTK toolkit automatically divides the feature vector into multiple equally-weighted streams in a way similar to the multi-stream paradigm. The idea of this separation is based on the lack of correlation between features. For example, it is known that the correlation between static coefficients and dynamic coefficients is small; therefore putting them in different streams, as if they were produced by independent sources, results in better statistical models. The feature vector used in the baseline experiment is automatically split into three streams by HTK with the following order. The first stream consists of static coefficients plus an energy coefficient. The second stream includes delta coefficients plus the delta energy and the third stream contains acceleration coefficients plus the acceleration energy. In our proposed method, 12 cepstral coefficients and the logarithmic frame energy were extracted, and then a 12 pole LPC filter and a UIT search algorithm, described in section 3, were used to extract the LSFs. Finally, the MFCCs and their first derivatives plus this vector is referred to as MFCC-E-D-LSF. The LSFs were combined to generate a multi-dimensional feature set. The multi-stream paradigm, through which the features are assigned to multiple streams with equal weights, was used to merge the features into HMMs. In order to be consistent with the baseline, the MFCCs and their derivatives were put into the first and second streams respectively and the

third stream was reserved for the LSFs. In order to evaluate the impact of the number of included LSFs, we carried out additional experiments where the third stream is composed of 10, and 12 LSFs features. The resulted systems are respectively noted MFCC12-E-D-LSF10 and MFCC12-E-D-LSF10 and then, their dimensions are respectively 36, and 38. We have also carried out experiments with only LSFs with three streams. The first stream is composed of 10 ou 12 LSFs, the second stream contains the first LSF derivatives, and the third stream is composed of the second derivatives of LSFs. These vectors are noted LSF10-D-A and LSF12-D-A when 10 and 12 LSFs are used respectively. Their dimensions are respectively 30 and 36.

Table 1 indicates that for the babble noise, when the SNR decreases less than 10 dB, the use of LSF front-end with 30-dimensional feature vector leads to a significant improvement in word recognition accuracy. This improvement can reach 6% relative to the word recognition accuracy obtained for the MFCC-based 39-dimensional feature vector (MFCC-E-D-A). In the case where the SNR decreases below than 5 dB, the multi-stream front-end performs better with fewer parameters. It should be noted that under high-SNR conditions, the 39-dimensional system performs better. These results suggest that it could be interesting to concomitantly use the two front-ends: the LSF front-end under severely degraded noise conditions, and the current DSR-XAFE for relatively noise-free conditions. In this case, the estimation of SNR is required in order to switch from one front-end to another. 6. Conclusion We have proposed a new front-end for the ETSI DSR XAFE codec. The results obtained from the experiments we carried out on AURORA task 2 showed that combining cepstral coefficients with LSFs features of a speech signal using the multi-stream paradigm leads to a recognition improvement in noisy speech. We have noticed that the use of only LSFs improved the DSR performance compared to the recognition performance of the DSR-XAFE that uses cepstral features alone. This improvement is noticeable for very low SNRs. On the other hand, although extracting the new features may add some level of complexity to the front-end process, the use of a 30-dimensional feature vector instead of a 39-dimensional feature vector will save time and computational power for the back-end process. We are currently continuing the effort towards the optimization of stream weights with respect to the noise source, speaker gender and phonetic contents of the speech.

12601260

Table 1. Percentage of word accuracy of multi-stream-based DSR systems trained with clean speech and tested on set A of the

AURORA database.

Signal to Noise Ratio 20dB 15dB 10dB 5dB 0dB -5dB

Babble noise

MFCC-E-D-A (39) LSF10-D-A (30) LSF12-D-A (36) MFCC12-E-D-LSF10 (36) MFCC12-E-D-LSF12 (38)

90.15 73.94 70.59 85.76 86.46

73.76 62.58 60.55 68.68 68.71

49.43 49.15 46.52 44.17 44.89

26.81 29.99 28.17 23.55 23.76

9.28 13.97 12.61 11.76 12.09

1.57 7.65 6.95 7.41 7.80

7. References [1] ETSI, “Speech processing, transmission and quality

aspects (stq); distributed speech recognition; front-end feature extraction algorithm; compression algorithm”, technical report. ETSI ES 201 108, 2003.

[2] H. Tolba, S.-A. Selouani and D. O’Shaughnessy. “Comparative Experiments to Evaluate the Use of Auditory-based Acoustic Distinctive Features and Formant Cues for Automatic Speech Recognition Using a Multi-Stream Paradigm”, ICSLP’02, September 2002.

[3] H. Tolba, S.-A. Selouani & D. O’Shaughnessy, “Auditory-based acoustic distinctive features and spectral cues for automatic speech recognition using a multi-stream paradigm”, Proc. of the ICASSP, pp. 837-840, Orlando, USA, 2002.

[4] S.-A. Selouani, H. Tolba & D.OShaughnessy, Auditory-based acoustic distinctive features and spectral cues for robust automatic speech recognition in low-SNR car environments, Proc. of Human Language Technology Conference of the North American Association for Computational Linguistics, CP volume, 91-94, Edmonton, 2003.

[5] P. Garner and W. Holmes, “On the robust incorporation of formant features into Hidden Markov Models for automatic speech recognition,” Proc. IEEE ICASSP, pp. 1–4, 1998.

[6] F. Itakura, “Line spectrum representation of linear predictive coefficients of speech signals,” Journal of the Acoustical Society of America, vol. 57, no. 1, pp. s35, 1975.

[7] F. Soong and B. Juang, "Line Spectrum Pairs (LSP) and speech data compression," in Proc. Int. Conf. Acoust., Speech, Signal Processing, (San Diego), pp.1-4, 1984.

[8] ITU-T Recommendation G.723.1, “Dual rate speech coder for multimedia communications transmitting at 5.3 and 6.3 kbit/s,” 1996.

[9] R. Rose and P. Momayyez, “Integration of multiple feature sets for reducing ambiguity in automatic speech recognition,” Proc. IEEE-ICASSP, pp. 325-328, 2007.

[10] ITU recommendation G.712, “Transmission performance characteristics of pulse code modulation channels”, Nov. 1996.

[11] S.J. Young et al., “HTK version 3.4: Reference Manual and User Manual”, Cambridge University Engineering Department Speech Group, 2006.

12611261

Documents

[IEEE 2009 Sixth International Conference on Information Technology: New Generations - Las Vegas, NV, USA (2009.04.27-2009.04.29)] 2009 Sixth International Conference on Information