4
IEEE SIGNAL PROCESSING LETTERS, VOL. 19, NO. 2, FEBRUARY 2012 99 Cepstrum Preltering for Binaural Source Localization in Reverberant Environments Raffaele Parisi, Senior Member, IEEE, Flavia Camoes, Michele Scarpiniti, Member, IEEE, and Aurelio Uncini, Member, IEEE Abstract—Binaural sound source localization can be performed by imitation of the fundamental mechanisms of the human audi- tory system, which is based on the integrated effects of ear, pinnae, head and torso. In particular, two physical cues can be exploited, i.e. the Interaural Time Difference (ITD) and the Interaural Level Difference (ILD). It is known that joint use of ITD and ILD pro- vides good source azimuth estimations [1]. In many practical situations binaural localization has to be per- formed in closed environments, where the presence of reverber- ation degrades the performance of available position estimators. In this paper a possible solution to this difcult problem is intro- duced. The proposed solution is based on proper use of cepstral preltering prior to source localization by ITD and ILD. It is shown that cepstrum can help in reducing the effects of reverberation, thus yielding better location estimates. Index Terms—Binaural sound localization, cepstral ltering, re- verberation. I. INTRODUCTION H UMAN beings are able to localize sound sources with great accuracy and in the presence of different environ- mental conditions [2]. This fact has suggested the imitation of the mechanisms of the human auditory system in order to re- alize effective articial binaural source localization systems. The eld of possible applications is vast and include for instance the design of hearing aids, interactive robotics and augmented reality audio [1]. Recently a signicant number of models of the human auditory system have been proposed [3]. Among the possible approaches, those based on estimation of the Interaural Level Difference (ILD) and the Interaural Time Difference (ITD) are often referenced to [2], [4]. ILD is proportional to the difference in the sound levels reaching the left and right ear, while ITD is the measure of the time difference of arrival between signals at each ear. These cues can separately give information on the position of the source with respect to the listener for a specied frequency range. More specically, variations of the ILD values due to the shadowing effects originated by the head and the torso can give Manuscript received October 24, 2011; revised December 08, 2011; accepted December 08, 2011. Date of publication December 19, 2011; date of current ver- sion January 09, 2012. The associate editor coordinating the review of this man- uscript and approving it for publication was Dr. Mads Graseboll Christensen. The authors are with the Department of Information, Electronics and Telecommunications (DIET), University of Rome La Sapienza, 00184 Roma, Italy (e-mail: [email protected]). Digital Object Identier 10.1109/LSP.2011.2180376 information on the source position. This effect occurs especially for frequencies above 1.5 kHz, where the size of the head is large with respect to the wavelength of the signal [3]. On the other side, ITD directly depends on the source signal direction of ar- rival by a geometrical relationship which is based on a spherical model of the head. This relationship is valid only for a certain range of frequencies approximately below 1.5 kHz, assuming that unambiguous decoding of periodic signals is assured [3]. As a matter of fact, ITD yields position estimations with smaller standard deviation compared to ILD, but have azimuth ambiguity caused by the a priori unknown phase unwrapping factor . This fact suggested combination of both type of cues to build a novel binaural localization method [1]. Unfortunately in closed environments the presence of even moderate reverberation can originate gross localization errors. In these cases reverberation typically induces self-masking and overlap masking of phonemes, thus making unpractical the reference to early reections [3]. This requires proper prepro- cessing of signals [5]. Cepstrum analysis can be successfully employed in order to perform this task. II. MODEL DESCRIPTION Signals received by the left and right ears in a reverberant environment can be modelled in the discrete-time domain as (1) (2) where is the impulse response between the source and the ear (the binaural room impulse response, BRIR), is the sound signal emitted by the source and is the corresponding uncorrelated noise term. The impulse response takes into account two independent effects. The rst effect depends on the acoustics of the room (i.e. reverberation, [6]). The second effect takes into account the head directional ltering, thus weighing the arriving sound components according to their direction of arrival. The term represents the contribution of additive noise, which is usually modelled as an uncorrelated, zero-mean, stationary Gaussian random process. III. AZIMUTH ESTIMATION VIA ILD AND ITD Binaural sound localization can be performed by following the approach described in [1], that in this section is briey re- called. It should be remarked that the interest is on azimuth es- timation only. Elevation estimation is not considered. 1070-9908/$26.00 © 2011 IEEE

IEEE SIGNAL PROCESSING LETTERS, VOL. 19, NO. 2, FEBRUARY ...ispac.diet.uniroma1.it/wp-content/papercite-data/pdf/parisi2012cep... · 102 IEEE SIGNAL PROCESSING LETTERS, VOL. 19, NO

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: IEEE SIGNAL PROCESSING LETTERS, VOL. 19, NO. 2, FEBRUARY ...ispac.diet.uniroma1.it/wp-content/papercite-data/pdf/parisi2012cep... · 102 IEEE SIGNAL PROCESSING LETTERS, VOL. 19, NO

IEEE SIGNAL PROCESSING LETTERS, VOL. 19, NO. 2, FEBRUARY 2012 99

Cepstrum Prefiltering for Binaural SourceLocalization in Reverberant Environments

Raffaele Parisi, Senior Member, IEEE, Flavia Camoes, Michele Scarpiniti, Member, IEEE, andAurelio Uncini, Member, IEEE

Abstract—Binaural sound source localization can be performedby imitation of the fundamental mechanisms of the human audi-tory system, which is based on the integrated effects of ear, pinnae,head and torso. In particular, two physical cues can be exploited,i.e. the Interaural Time Difference (ITD) and the Interaural LevelDifference (ILD). It is known that joint use of ITD and ILD pro-vides good source azimuth estimations [1].In many practical situations binaural localization has to be per-

formed in closed environments, where the presence of reverber-ation degrades the performance of available position estimators.In this paper a possible solution to this difficult problem is intro-duced. The proposed solution is based on proper use of cepstralprefiltering prior to source localization by ITD and ILD. It is shownthat cepstrum can help in reducing the effects of reverberation,thus yielding better location estimates.

Index Terms—Binaural sound localization, cepstral filtering, re-verberation.

I. INTRODUCTION

H UMAN beings are able to localize sound sources withgreat accuracy and in the presence of different environ-

mental conditions [2]. This fact has suggested the imitation ofthe mechanisms of the human auditory system in order to re-alize effective artificial binaural source localization systems.The field of possible applications is vast and include for instancethe design of hearing aids, interactive robotics and augmentedreality audio [1].Recently a significant number of models of the human

auditory system have been proposed [3]. Among the possibleapproaches, those based on estimation of the Interaural LevelDifference (ILD) and the Interaural Time Difference (ITD) areoften referenced to [2], [4]. ILD is proportional to the differencein the sound levels reaching the left and right ear, while ITD isthe measure of the time difference of arrival between signalsat each ear. These cues can separately give information on theposition of the source with respect to the listener for a specifiedfrequency range.More specifically, variations of the ILD values due to the

shadowing effects originated by the head and the torso can give

Manuscript received October 24, 2011; revised December 08, 2011; acceptedDecember 08, 2011. Date of publication December 19, 2011; date of current ver-sion January 09, 2012. The associate editor coordinating the review of this man-uscript and approving it for publication was Dr. Mads Graseboll Christensen.The authors are with the Department of Information, Electronics and

Telecommunications (DIET), University of Rome La Sapienza, 00184 Roma,Italy (e-mail: [email protected]).Digital Object Identifier 10.1109/LSP.2011.2180376

information on the source position. This effect occurs especiallyfor frequencies above 1.5 kHz, where the size of the head is largewith respect to the wavelength of the signal [3]. On the otherside, ITD directly depends on the source signal direction of ar-rival by a geometrical relationship which is based on a sphericalmodel of the head. This relationship is valid only for a certainrange of frequencies approximately below 1.5 kHz, assumingthat unambiguous decoding of periodic signals is assured [3].As a matter of fact, ITD yields position estimations with

smaller standard deviation compared to ILD, but have azimuthambiguity caused by the a priori unknown phase unwrappingfactor . This fact suggested combination of both type of cuesto build a novel binaural localization method [1].Unfortunately in closed environments the presence of even

moderate reverberation can originate gross localization errors.In these cases reverberation typically induces self-masking andoverlap masking of phonemes, thus making unpractical thereference to early reflections [3]. This requires proper prepro-cessing of signals [5]. Cepstrum analysis can be successfullyemployed in order to perform this task.

II. MODEL DESCRIPTION

Signals received by the left and right ears in a reverberantenvironment can be modelled in the discrete-time domain as

(1)

(2)

where is the impulse response between thesource and the ear (the binaural room impulse response,BRIR), is the sound signal emitted by the source andis the corresponding uncorrelated noise term. The impulseresponse takes into account two independent effects.The first effect depends on the acoustics of the room (i.e.reverberation, [6]). The second effect takes into account thehead directional filtering, thus weighing the arriving soundcomponents according to their direction of arrival. The term

represents the contribution of additive noise, which isusually modelled as an uncorrelated, zero-mean, stationaryGaussian random process.

III. AZIMUTH ESTIMATION VIA ILD AND ITD

Binaural sound localization can be performed by followingthe approach described in [1], that in this section is briefly re-called. It should be remarked that the interest is on azimuth es-timation only. Elevation estimation is not considered.

1070-9908/$26.00 © 2011 IEEE

Page 2: IEEE SIGNAL PROCESSING LETTERS, VOL. 19, NO. 2, FEBRUARY ...ispac.diet.uniroma1.it/wp-content/papercite-data/pdf/parisi2012cep... · 102 IEEE SIGNAL PROCESSING LETTERS, VOL. 19, NO

100 IEEE SIGNAL PROCESSING LETTERS, VOL. 19, NO. 2, FEBRUARY 2012

ILD and ITD for the generic th time-frame of acquired sig-nals can be defined as

(3)

(4)

where is frequency, and are the Short TimeFourier Transforms (STFTs) of the right and left ear signals andis the phase unwrapping factor, which is unknown [7].The azimuth of the source can be obtained by comparing the

estimated ILD and ITDwith a reference set built by exploitationof Head Related Transfer Functions (HRTFs) [8]. In this case,(3) and (4) are written as

(5)

(6)

where and are the HRTF functions on the rightand left ears respectively and is the azimuth angle. In partic-ular, smoothing across azimuth is performed on the ILD lookupset in order to model the limits of human interaural level differ-ence perception [1]. More specifically, a Gaussian filter with aconstant can be employed, as indicated in the CIPIC database[8].Localization can be performed using the ILD and ITD sets of

(5) and (6) as a reference for the lookup algorithm. The ILD-only azimuth of the source placement can be estimated as theabsolute value of the difference between the ILD lookup set andthe ILD calculated with the real signal arriving at the ears. Theminimum difference across available frequencies correspondsto the estimated azimuth of the source.ITD-only azimuth localization requires an analogous proce-

dure, with the addition of the phase unwrapping module. In par-ticular for each STFT time frame the difference between the ITDlookup set and the ITD experimental data is computed across az-imuth for each possible value of the unwrapping factor . Thecorrect is then selected by minimizing the difference betweenthe ITD-only and ILD-only estimates. This -estimation proce-dure is repeated for each available time frame. A time averageacross frames is performed. The final azimuth estimations se-lected are those displaying a minimum in the difference func-tion that is consistent across frequencies.The described procedure can be preceded by application of

cepstral prefiltering. This approach is briefly recalled in the nextsection.

IV. CEPSTRAL PREFILTERING

Cepstral prefiltering was shown to be effective in reducingthe effects of reverberation on received signals [9],[10]. Thecomplex cepstrum of the signal arriving at the generic ear

is

(7)

In this formula is the Fourier transform of , isthe complex logarithm operator, is the inverse Fouriertransform and is quefrency.Convolution of two signals in the time domain corresponds

to an addition in the quefrency domain, so that application ofthe cepstral transformation to (1) and (2) leads to

(8)

where and are the cepstra of the impulse responseand the source signal respectively. The term represents thecepstrum of the additive noise term and is given by

(9)

where , and are the Fourier transforms of, and respectively. In most practical applications

the background noise can be assumed to be low enough so thatand its cepstrum can be neglected [10].

In the cepstral domain the global system impulse responsecan be written as the sum of a minimum phase component

(MPC) and an all pass component (APC) [7].Equation (8) becomes

(10)

The assumption justifying the use of cepstral prefiltering isthat the MPC of the source signal cepstrum varies from frameto frame and is zero-mean, while the MPC of the room impulseresponse is slowly varying and can be estimated by averagethrough time [10]. Generally only a few frames are sufficient forconvergence [9]. The final estimate of the MPC of the channelis then subtracted from the received signal cepstrum , thatafter filtering is transformed back to the time domain.In particular, computation of the cepstral transform of

each frame is preceded by application of an exponentialwindow on received data, where and

being the frame size. The objective is tomove poles and zeros of towards the interior of the unitcircle, thus increasing the weight of the MPC with respect tothe APC [10].

V. STEPS OF THE ALGORITHM

In summary, the azimuth estimation process is organized inthe following steps.1) Apply the exponential window to each frameof the two signals and .

2) Compute the MPC of the received signal at each frame.3) Average the MPC through successive signal frames to getan estimate of .

4) Subtract the estimate of from the cepstrum ofeach signal frame.

5) Transform back to the time domain and apply the inverseof the exponential window.

6) Apply the azimuth estimation process based on jointILD-ITD estimation.

Page 3: IEEE SIGNAL PROCESSING LETTERS, VOL. 19, NO. 2, FEBRUARY ...ispac.diet.uniroma1.it/wp-content/papercite-data/pdf/parisi2012cep... · 102 IEEE SIGNAL PROCESSING LETTERS, VOL. 19, NO

IEEE SIGNAL PROCESSING LETTERS, VOL. 19, NO. 2, FEBRUARY 2012 101

Fig. 1. Localization results for a female voice in a central azimuth position for, , and . For each

figure, results are shown with no prefiltering on the left side and with cepstralprefiltering on the right side.

VI. EXPERIMENTAL RESULTS

Experiments were realized by simulating a room of sizewith the image method [11]. Different reverberation

times were considered, up to 700 ms.1 The head was put inthe position . A female voice was used as sourcesignal, positioned at different azimuth angles with respect tothe head, while keeping the elevation angle set at zero degrees.Distance was set at one meter, as done in the CIPIC database [8].The reference Kemar head impulse response was used, which issubject 21 in the CIPIC HRIR database [8]. Sampling frequencywas kHz. Cepstral prefiltering was performed on 12 mstime frames, using an exponential window as indicated in [10].

1The reverberation time is defined as the time needed for the energy todecay of 60 dB with respect to its initial value [6].

Fig. 2. Histograms of localization results at azimuth at differentreverberation times (from to ), (a) without and (b) withcepstral prefiltering.

Fig. 1 shows the results obtained without and with cepstralprefiltering respectively, at different reverberation times and fora source positioned in a central position (azimuth ). Inthe figure, each column shows the detected azimuth as a func-tion of frequency from top to bottom respectively by ILD only,ITD only and joint use of ILD and ITD. Darkest regions in-dividuate the more likely angles. Improvements obtained with

Page 4: IEEE SIGNAL PROCESSING LETTERS, VOL. 19, NO. 2, FEBRUARY ...ispac.diet.uniroma1.it/wp-content/papercite-data/pdf/parisi2012cep... · 102 IEEE SIGNAL PROCESSING LETTERS, VOL. 19, NO

102 IEEE SIGNAL PROCESSING LETTERS, VOL. 19, NO. 2, FEBRUARY 2012

Fig. 3. Histograms of localization results at azimuth at differentreverberation times (from to ), (a) without and (b) withcepstral prefiltering.

cepstral prefiltering can be seen, expecially for higher reverber-ation times.Fig. 2 and 3 illustrate the results obtained by the joint method

for two sources positioned at and . Thehistograms of the estimated azimuth angles are shown for re-verberation times up to . It is clear that localiza-tion performance are worse in any case for more lateral sources.Figures confirm the effectiveness of cepstral prefiltering in thepresence of reverberation. It should be noted that for lateral po-sitions of the source the efficacy of cepstral pre-filtering is seriously hampered when the reverberation timeis higher than 500 ms.

VII. CONCLUSION

Binaural source localization tasks can be performed by usingILD and ITD in a joint way. The presence of reverberation,which is typical in closed environments, can limit the qualityof this solution.In this work a method to limit the adverse effect of reverbera-

tion on binaural signals is described. The proposed approach isbased on cepstral filtering of acquired signals prior to azimuthestimation. It was shown that the performance of joint ITD-ILDazimuth estimators can be improved, also for lateral positionsof the source.

REFERENCES

[1] M. Raspaud, H. Viste, and G. Evangelista, “Binaural source localiza-tion by joint estimation of ILD and ITD,” Trans. Audio, Speech Lang.Process., vol. 18, no. 1, pp. 68–77, 2010.

[2] J. Blauert, Spatial Hearing—The Psychophysics of Human Sound Lo-calization. : MIT Press, 1996.

[3] D. L. Wang and G. J. Brown, Computational Auditory Scene Anal-ysis—Principles, Algorithms, and Applications. Piscataway, NJ:IEEE Press/Wiley Interscience, 2006.

[4] C. Faller and J. Merimaa, “Source localization in complex listeningsituations: Selection of binaural cues based on interaural coherence,”J. Acoust. Soc. Amer., vol. 116, no. 5, pp. 3075–3089, 2004.

[5] C. Zannini, R. Parisi, and A. Uncini, “Binaural sound source localiza-tion in the presence of reverberation,” in Proc. 17th Int. Conf. DigitalSignal Processing (DSP2011), Jul. 2011.

[6] H. Kuttruff, Room Acoustics. London, U.K.: Spon, 1999.[7] A. Oppenheim and R. Schafer, Discrete-Time Signal Processing.

Upper Saddle River, NJ: Prentice-Hall, 1989.[8] V. R. Algazi, R. O. Duda, D. M. Thompson, and C. Avendano, “The

CIPIC HRTF database,” in Proc. 2001 IEEE ASSP Workshop on Ap-plications of Signal Processing to Audio and Acoustics (WASSAP’01),2001.

[9] R. Parisi, R. Gazzetta, and E. Di Claudio, “Prefiltering approaches fortime delay estimation in reverberant environments,” in Proc. IEEE Int.Conf. on Acoustics, Speech, and Signal Processing (ICASSP’02), 2002,vol. 3, pp. 2997–3000.

[10] A. Stéphenne and B. Champagne, “A new cepstral prefiltering tech-nique for estimating time delay under reverberant conditions,” SignalProcess., vol. 59, no. 3, pp. 253–266, 1997.

[11] J. B. Allen and D. A. Berkley, “Image method for efficiently simulatingsmall-room acoustics,” J. Acoust. Soc. Amer., vol. 65, pp. 943–950,Apr. 1979.