eusipco08_v5

8/7/2019 eusipco08_v5

http://slidepdf.com/reader/full/eusipco08v5 1/5

DECORRELATION OF INPUT SIGNALS FOR STEREOPHONIC ACOUSTIC ECHOCANCELLATION USING THE CLASS OF PERCEPTUAL EQUIVALENCE

Anis Ben Aicha and Sofia Ben Jebara

Ecole Superieure des Communications de TunisResearch unit TECHTRA

Route de Raoued 3.5 Km, Cite El Ghazala, Ariana, 2083, TUNISIAanis ben [email protected], [email protected]

ABSTRACT In communication systems using stereophonic signals,

the high correlation of input signals decreases dramaticallythe behavior of acoustic echo cancelers (AEC’s) especiallythe misalignment between the estimated impulse responseand the real impulse response of receiving room. In this paper, we focus on the decorrelation of input signals without

any modification of auditive quality. Using perceptual prop-erties, we show that it is possible to find a set of signals whichare perceptually equivalent to input signals, in spite of their different spectral and temporal shapes. Hence, the ‘class of perceptual equivalence’ (CPE) is defined. It is an intervalbuild in the frequency domain limited by two bounds: the upper bound of perceptual equivalence (UBPE) and the lower bound of perceptual equivalence (LBPE). These two boundsare used as perceptual transparent non linear transforma-tions of input signals in order to decorrelate them. We showexperimentally that the improvements yielded by this method are higher than those of classical non linear transformations.

1. INTRODUCTION

In variety of speech communication systems such as telecon-ferencing systems, acoustic echo cancelers are necessary toremove undesirable echo that results from coupling betweenloudspeakers and microphones. Early systems consider thesingle channel case. Novel applications such as hands-freecommunications, home entertainment and virtual reality con-tribute to the development of multi-channel signal processingtechniques providing the users with enhanced sound quality.

In this paper, we consider specifically the stereophoniccase. Fig.1, represents the conventional acoustic echo can-cellation (AEC) scheme. The transmitted signals from re-mote room x1(n) and x2(n) are restored in local room by twoloudspeakers and picked up by two microphones. If nothingis done, the picked signal is retransmitted to the remote room

producing the undesired acoustic echo.Conventional AEC’s seek to estimate the echo using

adaptive finite impulse response filters (H1, H2) to model theacoustic impulse response of the two echo paths (H1,H2) be-tween loudspeakers and one of the two microphones. Simi-lar paths couple to the other microphone, but for the schemesimplicity, they are not shown.

In contrast to the case of mono-channel, the problemof stereophonic echo cancellation is much more difficult tosolve. The reasons why the problem is difficult are well ex-plained in [1] [2] and are mainly due to the strong correlationof the transmitted signals x1 and x2. Indeed, the two signals x1 and x2 are obtained from the common source s(n) by fil-tering it with impulse response of remote room G1 and G2.

d d

¨ ©source

0G1

G2

x1(n)

x2(n)

n j+

ˆ H 1ˆ H 2

e(n)d

t t t t t �

� � � � w

H 1

H 2

- y(n)

I

ˆ y(n)+

Figure 1: Basic stereophonic acoustic echo canceler.

To improve AEC’s behaviors, many methods are proposed inliterature to decorrelate the transmitted signals x1 and x2. Wecan classify them into two categories according to the usageor not of the human auditory properties.

The first category groups together decorrelation tech-niques without perceptual considerations. In [2], an indepen-dent random noise with low level is added in each channel inorder to reduce the correlation between them. In [4], a chan-nel dependent signal is added, it is obtained by a nonlinear

transformation of each channel.In the second category, human perceptual properties are

taken into account when adding to each input signal anotherone which should be quasi-white and under the input signalmasking threshold rendering them inaudible. This techniqueturns out to be more efficient than non perceptual decorrelat-ing methods [5].

Instead of adding another signal, we propose, in this pa-per, to perceptual modify the input signals in order to decor-relate them without perceptible degradation. This fact en-sures the perceptual transparency of the proposed transfor-mation and improve the behaviors of stereophonic AEC’s.

In previous works [6] [7], we demonstrate that it is possi-ble, for each signal, to construct a class of perceptual equiv-

alence (CPE). It is defined as an interval where signals havethe same auditory properties as the original one. This class isbuilt in the spectral domain with perceptual rules, it is limitedby two boundaries: the low boundary of perceptual equiva-lence (LBPE) and the upper boundary of perceptual equiva-lence (UBPE). This concept introduces a great degree of free-dom to choose signals exciting the adaptive filters which havethe same perceptual quality as the original ones and charac-terized by low cross-correlation. Hence, this paper aims find-ing and justifying such signals.

The paper is organized as follows. Section 2 recalls thefundamental problem of stereophonic acoustic echo cancel-lation and gives an overview of nonlinear methods to decor-relate input signals. In section 3, we construct the CPE us-

8/7/2019 eusipco08_v5


ing auditory properties of human ear. In section 4, we detailthe proposed technique. In section 5, we compare the pro-posed technique over two techniques proposed in [3] and [5].The results show the improvement of proposed AEC’s versusthese techniques. Then we conclude the paper.

2. UNIFYING SCHEME FOR INPUT SIGNALSDECORRELATION

It has been shown for stereo AEC that the strong correlationbetween input signals yields to the convergence of adaptivefilters to a solution that does not correctly model the transferfunctions between the loudspeakers and the microphones. Infact the cross correlation matrix of input signals plays a sig-nificant role in the convergence of adaptive filters to real im-pulse responses of the receiving room [2]. The well condi-tioned the cross correlation matrix is, the better behaviors of AEC’s are obtained [4]. If the cross correlation matrix is illconditioned, which is the case with high correlated signals,adaptive filters may converge to non optimal solution.

To make cross correlation matrix well conditioned per-

mitting to adaptive filters to converge to real impulse re-sponses of local rooms with small bias, it is obvious that weneed to decorrelate input signals.

Fig. 2 represents a unifying scheme to decorrelate inputsignals. x1(n) and x2(n) are non linearly transformed to othersignals x1(n) and x2(n). The purpose of this transformationis to make the new signals x1(n) and x2(n) non correlatedas possible with constraint of no modification of perceptualquality. In the following, we recall some known methods inliterature according to their usage of perceptual properties.

NLT 1 x1(n) x1(n)

NLT 2 x2(n) x2(n)

ˆ H 1ˆ H 2

l+

n e

+−

$ $ $ $ $ X

ˆ y(n) y(n)e(n)

t t

t t t t t �

d d d d d �

H 1

H 2

Figure 2: Unifying scheme to decorrelate stereophonic inputsignals by nonlinear transformations.

2.1 Non perceptual techniques (NPT)

The first idea to partially decorrelate the input signals isbased on the addition of low level of independent randomnoise to each channel in order to reduce the coherence be-tween x1(n) and x2(n) when compared to the coherence be-tween x1(n) and x2(n) [2]. Another technique consists onthe modulation of each input signal with an independent ran-dom noise [9]. However, it is shown that even if the level of added noise is very low, the quality of speech is significantlydegraded [4].

Without perceptual considerations and in order to mini-mize audible degradation, it is preferable to add somethinglike the original signal. The main idea proposed in [4] is toadd to each input signal x1(n) and x2(n) a nonlinear transfor-

mation of the signals themselves.

xi(n) = xi(n) +α F [ xi(n)], (1)

where F is the chosen nonlinear function and α is a param-eter which controls the amount of the added signal.

2.2 Perceptual techniques (PT)

The basic idea is to take advantage of human auditory proper-ties, namely the simultaneous masking which happens in fre-quency domain. In [5], the proposed method is based on theaddition of a random noise spectrally shaped to be maskedby the presence of the input signal. To achieve the completemasking of the added noise, the well known noise maskingthreshold is computed from each input signal. It expressesthe maximum level of added noise to be not audible. Themasking threshold serves to shape the added noise.

2.3 Motivation

Previous perceptual techniques add a data-dependent or a

data-independent signal to partially decorrelate stereophonicinput signals. A simple and an intuitive idea arises: is itpossible to eliminate some parts of input signals instead of adding another signal? Of course, this elimination must op-erate as a nonlinear operator and must not introduce any au-dible degradation on input signals. Once again, we must con-sider human auditory properties but in different manner.

The basic idea is inspired from our previous works deal-ing with speech denoising techniques [6] [7], where we pro-posed a class of perceptual equivalence (CPE). Each signalhaving a spectrum belonging to this class is heard as the orig-inal one. We think that if we consider the class bounds, usingthe upper bound as a transformed spectrum for the first inputsignal x1(n) and the lower bound as a transformed spectrum

for the second input signal x2(n), we can reduce the correla-tion between them.First of all and before detailing the idea, let us describe

the perceptual class of equivalence, its bounds and its useful-ness in the next paragraph.

3. CLASS OF PERCEPTUAL EQUIVALENCE

We aim to find an interval where possible signals belongingto it have the same auditive properties as the considered sig-nal. For such purpose, auditory properties of human ear areconsidered. More precisely, the masking concept is used: amasked signal is made inaudible by a masker if the maskedsignal magnitude is below the perceptual masking thresholdMT [8].

Using both signal spectrum and its masking threshold,we look for decision rules to decide on the audibility of amodified spectrum obtained by adding, subtracting or modi-fying some frequency components of the considered signal.These rules will permit to construct the perceptual class of equivalence.

3.1 Upper Bound of Perceptual Equivalence UBPE

The masking threshold is a curve computed in short time fre-quency domain from power spectrum of the considered sig-nal. It represents for each frequency component, the maxi-mum level of added noise to be inaudible. Hence, an addi-tive noise, with power spectrum under MT, will be inaudible.

8/7/2019 eusipco08_v5


However, it modifies the power spectrum shape of the consid-ered signal. It is obvious, that we can add many ‘inaudible’noises so that we get several shapes of power spectrum withthe same auditive quality as the original signal.

We seek to determine the intervals formed by the powerspectrum of the original speech, as a lower limit, and anothercurve, as an upper limit, so that, any modified signal (i.e.,

original signal + inaudible noise) with power spectrum be-tween the two limits will not impair the perceptual quality of the original signal.

Since the masking threshold MT represents the maxi-mum power of an additive inaudible noise, the upper limitthat we look for is the curve that results from adding thepower spectrum of the original signal with its own maskingthreshold.

The resulting spectrum is called upper bound of percep-tual equivalence “UBPE” and is defined as follows [6] [7]

UBPE (m, f ) = Γ s(m, f ) + MT (m, f ), (2)

where m (resp. f ) denotes frames index (resp. frequency

index). Γ s(m, f ) is the clean speech power spectrum.

3.2 Lower Bound of Perceptual Equivalence LBPE

By duality, some attenuations of frequency components canbe heard as speech distortion of input signal. Thus, by anal-ogy to UBPE, we propose to calculate a second curve whichexpresses the lower bound under which any attenuation of frequency components is heard as a distortion. We call itlower bound of perceptual equivalence “LBPE”. To com-pute LBPE, we used the audible spectrum introduced byTsoukalas and al for audio signal enhancement [10]. In suchcase, audible spectrum is calculated by considering the max-imum between the clean speech spectrum and the maskingthreshold.

When speech components are under MT, they are notheard and we can replace them by a chosen thresholdσ (m, f ).The proposed LBPE is defined as follows [6], [7]

LBPE (m, f ) =

Γ s(m, f ) if Γ s(m, f )≥ MT (m, f )

σ (m, f ) otherwise .(3)

The choice of σ (m, f ) obeys only one condition. It mustbe under the masking threshold σ (m, f ) < MT (m, f ). Wechoose it equal to the absolute threshold of hearing. It isdefined as the minimum power of a signal to be audible inabsolute silence [8].

3.3 Usefulness of UBPE and LBPE

Using UBPE and LBPE, we can define three regions char-acterizing the perceptual quality of a modified input signal:modified frequency components between UBPE and LBPEare perceptually equivalent to the original input signal com-ponents. Frequency components above UBPE contain a au-dible noise and frequency components under LBPE are char-acterized by speech distortion. Hence, UBPE and LBPE con-stitute the limits of admissible transparent modification of input signals: every signal having spectral shape includedbetween UBPE and LBPE is perceptually equivalent to theoriginal one. The set of these signals forms the class of per-ceptual equivalence “CPE”.

Figure 3: An illustration of UBPE and LBPE (in dB) of aspeech frame.

As illustration, we present in Fig.3 an example of speechframe power spectrum and its related curves UBPE (uppercurve in bold line) and LBPE (bottom curve in dash line).The original speech power spectrum is, for all frequenciesindex, between the two curves UBPE and LBPE.

4. PROPOSED TECHNIQUE TO DECORRELATESTEREOPHONIC SIGNALS

4.1 Proposed technique principle

The CPE permits multiple choices of signals with equivalentperceptual quality. For our application, we seek to reduce

the correlation between input signals. In other words, ourpurpose is to change temporal forms of input signals as pos-sible such as they become less correlated without any audibledegradation. Bringing up the problem to frequency domain,we can, intuitively, choose the two limits UBPE for the firstinput signal x1(n) and the LBPE for the second input signal x2(n). In fact, the two limits lead to the most different sig-nals.

The proposed nonlinear transformations (NLT) are writ-ten as follows.

NLT 1( x1) = UBPE x1

computed from x1(n)

NLT 2( x2) = LBPE x2computed from x2(n)

(4)

4.2 Reduction of coherence function

To mathematically justify our choice, let us recall the expres-sion of coherence function which is computed in spectral do-main as follows.

C ( f ) =γ x1 x2

( f ) γ x1 x1

( f )γ x2 x2( f )

, (5)

where γ xi xi( f ) denotes the power spectrum density of xi (for

i = 1 and i = 2), γ x1 x2( f ) denotes the inter-spectral density of

signals x1 and x2.The coherence function measures the similarity in fre-

quency domain between two signals. If C ( f ) is near 1, it

8/7/2019 eusipco08_v5


8/7/2019 eusipco08_v5


Figure 5: Evolution of MSE for the three considered methodsusing with 2-NLMS.

Figure 6: Misalignment of three considered methods ob-tained with 2-NLMS.

[2] M. M. Sondhi and M. R. Morgan, “Stereophonic acous-tic echo concellation - an overview of fundamental prob-lem,” IEEE Signal Processing Letters, vol. 2, no. 8 pp.

148-151, 1995.[3] J. Benesty, D. R. Morgan and M. M. Sondhi, “A bet-

ter undrestanding and an improved solution to the spe-cific problems of stereophonic aoustic echo cancella-tion,” IEEE Transaction on Speech and Audio Process-ing, vol. 6, no. 8, pp. 156-165, 1998.

[4] D. R. Morgan, J. L. Hall and J. Benesty, “Investigation of several types of nonlinearities for use in stereo acousticecho cancellation,” IEEE Trans. Speech and Audio Pro-cessing, vol. 9, no. 6, pp. 686-696, 2001.

[5] A. Gilloire and V. Turbin, “Using auditory propertiesto improve the behaviour of stereophonic acoustic echocancellation,” in Porc. IEEE ICASSP, pp. 3681-3684,

Figure 7: Evolution of ERLE for the three considered meth-ods obtained with 2-NLMS.

1998.

[6] A. Ben aicha et S. Ben Jebara, “Caracterisation per-ceptuelle de la degradation apportee par les techniquesde debruitage de la parole,” Traitement et Analyse del’Information M´ ethodes et Applications TAIMA, Tunisia,2007.

[7] A. Ben aicha et S. Ben Jebara, “Quantitative perceptualseparation of two kinds of degradation in speech denois-ing applications,” Workshop on Non-Llinear Speech Pro-cessing NOLISP, France, 2007.

[8] T. Painter and A. Spanias, “Perceptual coding of digital

audio,” in Proc. IEEE , vol. 88, pp. 451-513, 2000.[9] S. Shimauchi and S. Makino, “Stereo projection echo

canceler with true echo path estimation,” in Porc. IEEE ICASSP, pp. 3059-3062, 1995.

[10] D. E. Tsoukalas, J. N. Mourjopoulos and G. K. Kokki-nakis, “Speech enhancement based on audible noise sup-pression,” IEEE Trans. Speech and Audio Processing,vol. 5, no. 6, pp. 497-517, 1997.

[11] J. D. Johnston, “Transform coding of audio signal usingperceptual noise criteria,” IEEE J. Select. Areas Com-mun., vol 6, pp. 314-323, 1988.

Documents

eusipco08_v5