13
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. X, NO. Y, DEC. 2016 1 Successive relative transfer function identification using blind oblique projection Dani Cherkassky, Student member, IEEE, and Sharon Gannot, Senior member, IEEE Distortionless speech extraction in a reverberant environment can be achieved by applying a beamforming algorithm, provided that the relative transfer functions (RTFs) of the sources and the covariance matrix of the noise are known. In this paper, the challenge of RTF identification in a multi-speaker scenario is addressed. We propose a successive RTF identification (SRI) technique, based on the sole assumption that sources do not become simultaneously active. That is, we address the challenge of estimating the RTF of a specific speech source while assuming that the RTFs of all other active sources in the environment were previously estimated in an earlier stage. The RTF of interest is identified by applying the blind oblique projection (BOP)-SRI technique. When a new speech source is identified, the BOP algorithm is applied. BOP results in a null steering toward the RTF of interest, by means of applying an oblique projection to the microphone measurements. We prove that by artificially increasing the rank of the range of the projection matrix, the RTF of interest can be identified. An experimental study is carried out to evaluate the performance of the BOP-SRI algorithm in various signal to noise ratio (SNR) and signal to interference ratio (SIR) conditions and to demonstrate its effectiveness in speech extraction tasks. Index Terms—Relative transfer function, system identification, oblique projection. I. I NTRODUCTION S PEECH enhancement and separation are fundamental challenges in audio signals processing. In many speech processing applications, such as hands-free telephony, human- machine interface and hearing aids, the received signal is a mixture of the desired speech, one or more interfering sources, such as competing speakers, and background noise. The presence of the interfering sources cause a signal degra- dation which can lead to an unintelligibility of the speech and to severe degradation in speech recognition systems performance. Hundreds of multichannel speech enhancement techniques have been proposed in the literature over the last decades. Well known techniques include: non-negative matrix factorization [1], [2], blind source separation [3], [4], and beamforming [5], [6]. Recently, deep neural networks are also suggested to address the speech extraction challenge, [7], [8]. A comprehensive literature review and comparison of the aforementioned techniques is given in [9]. The current work is focused on a speech extraction/separation challenge in a reverberant environment, by an application of a beamformer. The relative transfer function (RTF) is an important compo- nent of multi-microphone speech processing systems, particu- larly in reverberant environments [10]–[15]. An RTF describes D. Cherkassky and S. Gannot are with the Faculty of Engineering, Bar-Ilan University, Ramat-Gan, Israel (e-mail: [email protected]; [email protected]). the coupling between the signals received at the microphones as a response to a single source. One of the most common applications of RTFs is speech extraction in a noisy and reverberant environment. For instance, the constraint set of the linearly constrained minimum variance (LCMV) beamformer can be expressed in terms of the sources’ RTFs [16]. A formulation of the constraints set in terms of RTFs allows the LCMV beamformer to reject the interfering speech without distorting the desired speech components. The RTF identification challenge in a noisy environment with a single active speech source has been well studied. Shalvi and Weinstein [17] proposed identifying the coupling between speech components received at two microphones by using the nonstationarity of the desired speech signal received at the sensors, assuming stationary additive noise and time- invariant RTF. The observed signal is divided into subintervals. The speech signal is regarded as stationary in each subinter- val and nonstationary between subintervals. Accordingly, an overdetermined set of equations for two unknown variables, the RTF and the cross power spectral density (PSD) of the noise signals, can be formulated by computing the PSD of the sensor signals in each subinterval. The estimates of these two variables are derived by applying the weighted least squares approach. Cohen [18] proposed an RTF identification method that utilizes the speech presence probability (SPP) in the time- frequency domain to identify the time-frequency instances that contain the speech signal. By using the SPP, it is possible to cluster the subintervals into two groups, one consisting of noise-only subintervals and the second of subintervals in which speech is present. The first group is utilized for estimating the noise cross PSD, while the second group is utilized to derive an RTF estimator. More recently, two RTF identification methods were the topic of an intensive study: the covariance subtraction (CS) method [19], [20] and the covariance whitening (CW) [16], [21] method. Both methods assume that the information on the activity pattern of the speech sources of interest is available and utilize the signals’ PSD matrices, obtained during noise-only time-segments and during speech plus noise time-segments, to estimate the RTF. A comparative survey of the CS and CW methods for RTF estimation was presented by Markovich-Golan and Gannot [22]. RTF estimation in a multiple and concurrent speaker sce- nario was recently considered in the literature. Markovich- Golan et al. [23] proposed tracking the desired and interference speakers’ subspaces in non-static scenarios with concurrently active multiple speakers in a reverberant environment. It was proven by Hadad et al. [24] that knowledge of a basis that spans the subspace of the desired sources and a basis that spans the subspace of the interfering sources suffices for This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication. The final version of record is available at http://dx.doi.org/10.1109/TASLP.2019.2957883 Copyright (c) 2019 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

Successive relative transfer function identification using ... · Dani Cherkassky, Student member, IEEE, and Sharon Gannot, Senior member, IEEE Distortionless speech extraction in

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Successive relative transfer function identification using ... · Dani Cherkassky, Student member, IEEE, and Sharon Gannot, Senior member, IEEE Distortionless speech extraction in

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. X, NO. Y, DEC. 2016 1

Successive relative transfer function identification using blind obliqueprojection

Dani Cherkassky, Student member, IEEE, and Sharon Gannot, Senior member, IEEE

Distortionless speech extraction in a reverberant environmentcan be achieved by applying a beamforming algorithm, providedthat the relative transfer functions (RTFs) of the sources andthe covariance matrix of the noise are known. In this paper,the challenge of RTF identification in a multi-speaker scenariois addressed. We propose a successive RTF identification (SRI)technique, based on the sole assumption that sources do notbecome simultaneously active. That is, we address the challengeof estimating the RTF of a specific speech source while assumingthat the RTFs of all other active sources in the environment werepreviously estimated in an earlier stage. The RTF of interest isidentified by applying the blind oblique projection (BOP)-SRItechnique. When a new speech source is identified, the BOPalgorithm is applied. BOP results in a null steering toward theRTF of interest, by means of applying an oblique projectionto the microphone measurements. We prove that by artificiallyincreasing the rank of the range of the projection matrix, the RTFof interest can be identified. An experimental study is carriedout to evaluate the performance of the BOP-SRI algorithm invarious signal to noise ratio (SNR) and signal to interference ratio(SIR) conditions and to demonstrate its effectiveness in speechextraction tasks.

Index Terms—Relative transfer function, system identification,oblique projection.

I. INTRODUCTION

SPEECH enhancement and separation are fundamentalchallenges in audio signals processing. In many speech

processing applications, such as hands-free telephony, human-machine interface and hearing aids, the received signal isa mixture of the desired speech, one or more interferingsources, such as competing speakers, and background noise.The presence of the interfering sources cause a signal degra-dation which can lead to an unintelligibility of the speechand to severe degradation in speech recognition systemsperformance. Hundreds of multichannel speech enhancementtechniques have been proposed in the literature over the lastdecades. Well known techniques include: non-negative matrixfactorization [1], [2], blind source separation [3], [4], andbeamforming [5], [6]. Recently, deep neural networks arealso suggested to address the speech extraction challenge, [7],[8]. A comprehensive literature review and comparison of theaforementioned techniques is given in [9]. The current workis focused on a speech extraction/separation challenge in areverberant environment, by an application of a beamformer.

The relative transfer function (RTF) is an important compo-nent of multi-microphone speech processing systems, particu-larly in reverberant environments [10]–[15]. An RTF describes

D. Cherkassky and S. Gannot are with the Faculty of Engineering,Bar-Ilan University, Ramat-Gan, Israel (e-mail: [email protected];[email protected]).

the coupling between the signals received at the microphonesas a response to a single source. One of the most commonapplications of RTFs is speech extraction in a noisy andreverberant environment. For instance, the constraint set of thelinearly constrained minimum variance (LCMV) beamformercan be expressed in terms of the sources’ RTFs [16]. Aformulation of the constraints set in terms of RTFs allows theLCMV beamformer to reject the interfering speech withoutdistorting the desired speech components.

The RTF identification challenge in a noisy environmentwith a single active speech source has been well studied.Shalvi and Weinstein [17] proposed identifying the couplingbetween speech components received at two microphones byusing the nonstationarity of the desired speech signal receivedat the sensors, assuming stationary additive noise and time-invariant RTF. The observed signal is divided into subintervals.The speech signal is regarded as stationary in each subinter-val and nonstationary between subintervals. Accordingly, anoverdetermined set of equations for two unknown variables,the RTF and the cross power spectral density (PSD) of thenoise signals, can be formulated by computing the PSD of thesensor signals in each subinterval. The estimates of these twovariables are derived by applying the weighted least squaresapproach. Cohen [18] proposed an RTF identification methodthat utilizes the speech presence probability (SPP) in the time-frequency domain to identify the time-frequency instances thatcontain the speech signal. By using the SPP, it is possibleto cluster the subintervals into two groups, one consisting ofnoise-only subintervals and the second of subintervals in whichspeech is present. The first group is utilized for estimating thenoise cross PSD, while the second group is utilized to derive anRTF estimator. More recently, two RTF identification methodswere the topic of an intensive study: the covariance subtraction(CS) method [19], [20] and the covariance whitening (CW)[16], [21] method. Both methods assume that the informationon the activity pattern of the speech sources of interestis available and utilize the signals’ PSD matrices, obtainedduring noise-only time-segments and during speech plus noisetime-segments, to estimate the RTF. A comparative survey ofthe CS and CW methods for RTF estimation was presentedby Markovich-Golan and Gannot [22].

RTF estimation in a multiple and concurrent speaker sce-nario was recently considered in the literature. Markovich-Golan et al. [23] proposed tracking the desired and interferencespeakers’ subspaces in non-static scenarios with concurrentlyactive multiple speakers in a reverberant environment. It wasproven by Hadad et al. [24] that knowledge of a basis thatspans the subspace of the desired sources and a basis thatspans the subspace of the interfering sources suffices for

This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.The final version of record is available at http://dx.doi.org/10.1109/TASLP.2019.2957883

Copyright (c) 2019 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

Page 2: Successive relative transfer function identification using ... · Dani Cherkassky, Student member, IEEE, and Sharon Gannot, Senior member, IEEE Distortionless speech extraction in

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. X, NO. Y, DEC. 2016 2

implementing the LCMV beamforming algorithm. The afore-mentioned desired and interfering sources’ subspaces can beestimated in a scenario where both all the desired sources andall the interfering sources are simultaneously active. However,signal segments in which both the desired and interferingsources are concurrently active cannot be used for estimatingthe subspaces. Hassani et al. [25] proposed a method forestimating the desired and the interfering sources’ subspacesby exploiting signal segments having concurrent activity ofthe desired and the interfering sources. It was assumed thatan initial estimate of the desired and interfering sources’subspaces is available, and then, the individual subspaceestimates were projected onto the joint signal subspace of allthe desired and interfering sources. The procedure exploitssignal segments with concurrent activity of the desired and theinterfering sources and results in an improved estimate of theindividual subspaces as compared with the initially availableestimates. Deleforge et al. [26] proposed a generalization ofthe RTFs’ definition to several sources. The generalized RTFsare defined through multichannel, multi-frame spectrogramsof the received noise-free signal. Markovich-Golan et al. [27]suggested using the Triple-N ICA for convolutive mixtures(TRINICON) [28], a blind source separation (BSS) frameworkfor estimating RTFs in a multi-speakers scenario. The pro-posed algorithm assumes the availability of an initial, direct-path based estimate of the target RTFs [29].

In this paper, we consider a multi-source scenario. Wepropose a successive RTF identification (SRI) technique basedon the sole assumption that sources do not become simultane-ously active. Namely, we address the challenge of estimatingthe RTF of a specific speech source while assuming thatthe RTFs of all other active sources in the environmentwere previously estimated. The proposed SRI algorithm isfounded on an oblique (nonorthogonal) projection operator[30]. Oblique projection is used to project measurements intoa low-rank subspace along a direction that is oblique to thesubspace. Unlike orthogonal projection, oblique projectionprovides the flexibility to design a nonorthogonal null spaceand range. In this contribution, we introduce the blind obliqueprojection (BOP) for RTF estimation. The range of the BOPis set to include all the previously estimated RTFs, while thenull space of the BOP is designed blindly to include theRTF of interest. Specifically, we propose resolving the nullspace of the BOP by minimizing the norm of the projectedmeasurements, subject to keeping the range of the BOP fixed.The above described RTF estimation procedure is referred toin the following as BOP-SRI. In order to demonstrate that theBOP-SRI provides a valid RTF estimator, we prove that, ifthe dimension of the range of the BOP is equal to the numberof microphones minus one and assuming a sufficiently highsignal to noise ratio (SNR), the resulting null space of theBOP will be set to a vector parallel to the RTF of interest.It should be noted that in the case where the number ofactive sources in the environment is lower than the number ofmicrophones, which is the case in typical speech enchantmentapplications, the dimension of the range can be artificiallyincreased to facilitate BOP-SRI implementation. At this point,it may be worth stressing that the proposed technique was

evaluated in controlled environments, both simulated andactual acoustic lab. Applying the proposed technique to amore complex scenarios with arbitrary activity pattern of thespeakers, may require more sophisticated speakers countingand RTF association algorithms.

The rest of the paper is organized as follows. Section IIis dedicated to the introduction of the considered signalmodel and the properties of the oblique projection operator.In Section III, we derive the BOP-SRI algorithm and presentsome practical considerations that should be addressed whenutilizing BOP-SRI in a typical speech extraction application.An experimental study, evaluating the performance of theproposed BOP-SRI algorithm, is presented in Section V.Finally, we conclude the work with a brief discussion inSection VI.

II. PRELIMINARIES

This section is split into two parts. In the first part, wepresent the considered signals and measurement model and,in the second part, the oblique projection operator. Obliqueprojection is a well established mathematical tool. However, ithas received relatively little attention in the multi-microphonespeech processing literature. Accordingly, since the obliqueprojection operator is the core of the proposed BOP-SRIalgorithm, hereafter we present the main properties of theoperator.

A. Signal model

Consider an array consisting of M microphones capturinga time-varying acoustical scene. Each of the involved signalspropagates through the acoustic environment before beingpicked up by the array. In the short-time Fourier transform(STFT) domain, the nth speech source is denoted by sn(`, k),the acoustic transfer function (ATF) relating the nth sourceand the mth microphone by gm,n(k), and the stationary noiseat the mth microphone by vm(`, k), where ` is the frame indexand k is the frequency index. The received signals in the STFTdomain can be formulated in a vector representation:

z(`, k) =N∑n=1

In(`)hn(k)xn(`, k) + v(`, k), (1)

where N is the number of sources of interest, xn(`, k) =g1,n(k)sn(`, k), In(`) ∈ 0, 1 indicates the activity ofsn(`, k), and hn(k) is the RTF vector of the nth source definedas

hn(k) =

[1,g2,n(k)

g1,n(k), · · · , gM,n(k)

g1,n(k)

]T. (2)

Considering the sources’ activity pattern, we assume that thespeech sources become active successively. Accordingly, theactivity indicator function of the nth source is defined by

In(`) =

0, if ` ≤ `n1, if `n < ` ≤ `n+1

An, otherwise.(3)

This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.The final version of record is available at http://dx.doi.org/10.1109/TASLP.2019.2957883

Copyright (c) 2019 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

Page 3: Successive relative transfer function identification using ... · Dani Cherkassky, Student member, IEEE, and Sharon Gannot, Senior member, IEEE Distortionless speech extraction in

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. X, NO. Y, DEC. 2016 3

where An ∈ 0, 1. The noise v(`, k) is assumed activethroughout the measurement period. The considered activitypattern may be practical, for example, in a noisy conferencecall scenario. In such a scenario, typically, the speech sourcesdo not become simultaneously active but do remain active fora sufficient amount of time before they become inactive again.Accordingly, the proposed In(`) dictates a unique activationtime `n of the nth speech source. Upon activation, the nthsource remains active for at least `n < ` ≤ `n+1 timeframes, while for time frames ` > `n+1 the nth source canbe either active or inactive, as suggested by the definitionof An. However, we do assume that simultaneous activationand deactivation of two independent sources never occur. Theprobability of simultaneous activation and deactivation of twoindependent sources was inquired in [31], and it was suggestedthat the probability of such an event is zero. In practice, theactivity pattern of the sources is, of course, unknown andshould be estimated from the measurements. Source activityfunction estimation is addressed in Section IV. It should bestressed that the BOP-SRI algorithm, proposed in the sequel,addresses time frames where multiple speech sources aresimultaneously active in order to estimate the target RTF.

B. Oblique projection operator

In signal processing applications, oblique projections areused to project measurements onto a low-rank subspace alonga direction that is oblique to the subspace. The SRI algorithmpresented in Section III is based on the oblique projectionoperator, and therefore we review the main properties of thisoperator in this section.

Consider an M -dimensional measurement vector z = s+n,where the signal s = Hx lies in an N − 1 dimensionalsubspace of CM , which we denote by 〈H〉. The subspace 〈H〉is the range of the transformation H and is spanned by thecolumns of the matrix H. These columns comprise a basis forthe subspace, and the elements of x = [x1, x2, · · · , xN−1]T

are the coordinates of w with respect to this basis. Similarly,n = hxN lies in a 1-dimensional subspace of CM . Thissubspace, spanned by the vector h, is denoted by 〈h〉.

An oblique projection EHh, of which the range is 〈H〉 andthe null space comprises 〈h〉, is defined by [32]:

EHh = H(HHP⊥h H

)−1HHP⊥h , (4)

where P⊥h = I − h(hHh

)−1hH is an orthogonal projection

matrix to the null subspace of 〈h〉 and I is an identity matrix.Equivalently, EhH is an oblique projection with range 〈h〉 andwith the null space comprising 〈H〉. It is straightforward toverify that EHh is an idempotent with range 〈H〉 and a nullspace that includes 〈h〉

EHhH = H(HHP⊥h H

)−1HHP⊥h H = H, (5a)

EHhh = H(HHP⊥h H

)−1HHP⊥h h = 0. (5b)

To complete the null space, let us define a matrix A, thecolumns of which span the M − N dimensional subspace

Fig. 1: Geometrical interpretation of the oblique projection inthe Euclidean space.

perpendicular to 〈H h〉. By definition, P⊥h A = A andHHA = 0, accordingly

EHhA = H(HHP⊥h H

)−1HHP⊥h A =

= H(HHP⊥h H

)−1HHA = 0. (6)

In Fig. 1, we show the geometrical interpretation of theoblique projection operator in the Euclidean space. As shown,〈A〉 is the subspace orthogonal to both 〈H〉 and 〈h〉 andEHh is a projection operator with a range equal to 〈H〉 anda null space equal to 〈h A〉. In a special case, where theunification of 〈H〉 and 〈h〉 spans the entire Euclidean space,i.e., 〈A〉 is an empty subspace, the null space of EHh is equalto 〈h〉. It is noteworthy to mention that the ability of theoblique projection operator to project onto 〈H〉 while nullingthe subspace 〈h〉 that is nonorthogonal to 〈H〉 comes with aprice. Applying EHh to a random vector v may result in anincrease in ||EHhv||2 as compared to ||v||2 [32]. The level ofthe increase in the norm depends on the principal angle [33]between subspaces 〈H〉 and 〈h〉: the smaller the principalangle, the higher is the expected amplification [34]. A similarphenomenon of noise increase by an LCMV beamformer ina scenario where the desired and interfering speakers arespatially close was demonstrated in [35].

Fig. 2: Simulative setup.

Let us quantify the noise amplification phenomenon by the

This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.The final version of record is available at http://dx.doi.org/10.1109/TASLP.2019.2957883

Copyright (c) 2019 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

Page 4: Successive relative transfer function identification using ... · Dani Cherkassky, Student member, IEEE, and Sharon Gannot, Senior member, IEEE Distortionless speech extraction in

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. X, NO. Y, DEC. 2016 4

Fig. 3: EHh Gains towards the sources of interest.

following simulative study. We simulated a microphone arraywith 8 microphones placed in a 6 × 4 × 2.5 m room withT60 ≈ 500 mSec. A top view of the room is depicted in Fig. 2.Two speech sources, s1 and s2, with RTFs denoted by H andh, respectively, are located on an arc with a radius of 1.5 m,centered on a microphone array center. A noise source v is alsopresent in the environment. In Fig. 3 we present the responsesof EHh towards the considered RTFs, as well as towards thenoise source, as a function of α, the angle between s1 ands2 locations. The simulative results demonstrate that applyingEHh will amplify the random noise in the environment whenprincipal angle between H and h is small. In the simulatedscenario noise amplification is observed for α < 3.

In the following section, we utilize the oblique projectionoperator for deriving the proposed BOP-SRI algorithm.

III. SUCCESSIVE RELATIVE TRANSFER FUNCTIONIDENTIFICATION ALGORITHM

When multiple speech sources are concurrently active, theRTF identification techniques that assume a single speechsource in a noisy environment are not valid. In the following,we propose the BOP-SRI algorithm for hn0(k) identification,under the assumption that the RTFs hn(k), n < n0 of allalready active sources in the environment were previouslyidentified. A similar challenge was considered in [36], wherea single microphone speech enhancement technique was uti-lized for identifying hn0

(k). Hereafter, we address the SRIchallenge by applying the BOP technique.

The idea behind the BOP algorithm is to set the range of theprojection to the subspace spanned by the previously identifiedRTFs hn(k), n < n0, followed by the optimization of the nullspace such that the power of the projected measurements isminimal. In the following, we demonstrate that minimal poweris achieved by nulling out the signal from hn0

(k). We alsoprove that, when 〈A〉 is an empty subspace, nulling out thesignal from hn0

(k) is equivalent to hn0(k) identification.

Assuming the past speech signals remain active, An =1, n = 1, . . . , n0 − 1, the received signal (1) in frames

`n0< ` < `n0+1 can be formulated in a matrix notation:

z(`, k) = H(k)x(`, k) + hn0(k)xn0

(`, k) + v(`, k), (7)

where x(`, k) = [x1(`, k), x2(`, k), · · · , xn0−1(`, k)]T ,H(k) = [h1(k),h2(k), · · · ,hn0−1(k)] and v(`, k) is thenoise. For simplicity, the frequency index is omitted here-inafter.

Let us show that EHhn0minimizes the power of the

projected measurements, under a plausible assumption of asufficiently large SNR. Applying EHhn0

to the received signalresults in

y(`; hn0) = EHhn0

z(`) ≈ Hx(`) + EHhn0v(`). (8)

Since the sources sn, 0 < n ≤ n0 are mutually uncorrelatedand all are uncorrelated with the noise v, the power of z(`)and y(`; hn0

) is given by

EzH(`)z(`)

= E

xH(`)HHHx(`)

+

+ Ex∗n0

(`)hHn0hn0

xn0(`)

+

+ EvH(`)v(`)

, (9a)

EyH(`; hn0

)y(`; hn0)

= ExH(`)HHHx(`)

+

+ E

vH(`)EHHhn0

EHhn0v(`)

,

(9b)

respectively. Considering the possible increase in the noisepower ∆ = E

vH(`)EH

Hhn0

EHhn0v(`)

−E

vH(`)v(`)

and under a plausible assumption of a sufficiently large SNR,i.e., E

x∗n0

(`)hHn0hn0

xn0(`)> ∆, we deduce that the power

of z(`) is higher than that of y(`; hn0).

Of course, H and hn0 are unavailable. However, assumingthat an estimator H is available, we can utilize the aboveobservation to formulate an optimization problem seekingan oblique projection EHθθθ that minimizes the power of theprojected measurements:

θθθ = argminθθθ

EzH(`)EH

HθθθEHθθθz(`)

, (10a)

EHθθθ = H(HHP⊥θθθ H

)−1HHP⊥θθθ , (10b)

P⊥θθθ = I− θθθ(θθθHθθθ

)−1θθθH . (10c)

As previously shown, applying EHθθθ to the array measure-ments z(`) results in a null steering toward hn0

. However, itis straightforward to validate that, in general, θθθ can be anyvector from a subspace 〈hn0

A〉, where 〈A〉 is the subspaceorthogonal to both

⟨H⟩

and 〈hn0〉. In order to identify hn0

bythe application of (10a), 〈A〉 should be an empty subspace.However, in most practical cases, the rank of H is smallerthan M − 1, i.e. the number of active speakers in the room issmaller than the number of microphones minus one, hence 〈A〉is usually nonempty. Accordingly, in the case where the rankof H is lower than M − 1, it should be artificially increasedprior to the application of (10a). Hence, when required, wesubstitute H in (10b) by Hc. The rank of Hc is set to M − 1by concatenating the previously estimated RTFs in H with

This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.The final version of record is available at http://dx.doi.org/10.1109/TASLP.2019.2957883

Copyright (c) 2019 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

Page 5: Successive relative transfer function identification using ... · Dani Cherkassky, Student member, IEEE, and Sharon Gannot, Senior member, IEEE Distortionless speech extraction in

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. X, NO. Y, DEC. 2016 5

randomly generated independent vectors riM−n0

i=1 :

Hc =[h1, · · · , hn0−1, r1, · · · , rM−n0

]. (11)

Let us substitute Hc in (10b) and calculate the range andthe null space of EHcθθθ, using (5b) and (5a):

EHchHc = Hc(HcHP⊥h Hc

)−1HcHP⊥h Hc = Hc, (12a)

EHchh = Hc(HcHP⊥h Hc

)−1HcHP⊥h h = 0. (12b)

Thus, by replacing H in (10b) with Hc we manipulate onlythe range of the resulting oblique projection, while the nullspace left unmodified. Explicitly, an application of EHcθθθ tothe received signal results in a distortionless response towardsall columns of Hc. However, since [r1, · · · , rM−n0

] columnsare randomly generated vectors, we expect no signals in theenvironment to impinge on the microphones with an arraymanifold equal to [r1, · · · , rM−n0 ]. It should be stressed thatin case one of the vectors [r1, · · · , rM−n0 ] is randomizedsuch that it is parallel to hn0

the proposed algorithm willfail. However, in practice, the probability to randomly guessa vector that is parallel to a specific RTF is very low, thus weneglect such an event. To summarize, by replacing H in (10b)with Hc, the range of the projection is modified to includeRTFs, which do not contribute any energy to the receivedsignal. Accordantly, all the above derivations and conclusionshold.

Solving (10a) with Hc instead of H results in θθθ beingparallel to hn0

, i.e., θθθ ≈ δhn0, where δ is an arbitrary gain.

By definition, the first entry of the RTF hn0is equal to 1, and

hence, the estimator θθθ can be normalized to obtain an estimateof the RTF:

hn0=

θθθ

θ1, (13)

where θ1 is the first element of θθθ.

Finding a closed-form expression for θθθ that solves (10a) is acumbersome task. However, we are able to derive an analyticalexpression for the first-order derivative of the target functionJ = E

yH(`;θθθ)y(`;θθθ)

,y(`;θθθ) = EHθθθz(`) with respect to

(w.r.t.) the mth entry of θθθ, θm. Hence, we can optimize J byapplying a gradient descent search method, which results inthe following iterative rule:

θim = θi−1m + µ∂J(θθθ)

∂θm, m = 1, · · · ,M (14)

with µ being the step-size and i the iteration index. Thegradient term is obtained by calculating the derivative of J(θθθ)w.r.t. to θm and is given by (see Appendix A):

∂J(θθθ)

∂θm= Tr

(ΨΨΨmR + ΨΨΨmRH

), (15)

where R = Ez(`)yH(`;θθθ)

is the cross correlation matrix

between the measurements and the projected signal y(`;θθθ).

Algorithm 1: BOP-SRI with active speakers countingInitialization:A. Utilize frames 0 < ` ≤ `1 to compute Φvv using (17).B. Set ∆EVTh. (defined in Section IV-B )C. Set Th (defined in Section IV-A ).

For each frame z(`):1. Count the number of active sources (Section IV-B):

i. zw(`) = Φ−1vv,Lz(`).

ii. Φzwzw (`) = γΦzwzw (`− 1) + zw(`)zHw (`), γ < 1.

iii. Compute eigenvalue decomposition of Φzwzw (`).iv. AS(`) = number of eigenvalues larger than EVTH(`).

2. if AS(`) = AS(`− 1).i. Go back to 1.

3. if AS(`) < AS(`− 1).i. For i = 1, . . . , AS(`− 1), search for xi(`, k) with the

least energy level and update H accordingly.ii. Go back to 1.

4. if AS(`) > AS(`− 1).i. For m = 1, . . . ,M , set θ0m to a small random number.ii. Apply (17) for m = 1, . . . ,M , till convergence.iii. Monitor local minimum (Section IV-A).vi. Apply (13).v. Output hn0 and go back to 1.

The rest of the terms are defined as

ΨΨΨm = HΓΩm

(HΓP⊥θθθ − I

), (16a)

ΨΨΨm =(P⊥θθθ ΓHHH − I

)ΩmΓHHH , (16b)

Γ =(HHP⊥θθθ H

)−1HH , (16c)

ΩΩΩm =(θθθ†)H

im − θm(θθθ†)H

θθθ†, (16d)

where (·)† is the pseudo inverse operator and im is a vectorwith its mth element equal to 1 and the rest are zeros.

IV. PRACTICAL CONSIDERATIONS

The proposed, BOP-SRI method is presented in Algo-rithm 1. However, several practical aspects related to thealgorithm implementation should be considered.

A. Initialization

Since in the general case the cost function J(θθθ) is notunimodal, the iterative method (14) could become trappedin a local minimum, depending on the initial conditions. Itis therefore important to initialize the search algorithm inclose proximity to the global minimum. We are, however, notfamiliar with an easy method for computing a good startingpoint for the iterative optimization problem at hand. Accord-ingly, in the scope of this work we followed the followingprocedure. We initialized θ0m,m = 1, . . . ,M with randomlygenerated complex-valued numbers. During the convergence,we monitored the value of the cost function J(θθθ). When thedifference between the value of the cost function at the finaliteration and its initial value was within a predefined thresholdTh, we discarded this optimization cycle and re-initialized θ0m.This initialization procedure was proven efficient in terms ofthe resulting RTF estimation quality, as presented in Section V.

This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.The final version of record is available at http://dx.doi.org/10.1109/TASLP.2019.2957883

Copyright (c) 2019 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

Page 6: Successive relative transfer function identification using ... · Dani Cherkassky, Student member, IEEE, and Sharon Gannot, Senior member, IEEE Distortionless speech extraction in

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. X, NO. Y, DEC. 2016 6

However, in terms of the resulting computational load, theproposed procedure may be suboptimal. A better initializationmethod is, however, beyond the scope of the current contribu-tion.

B. Activity indicator function

The proposed method assumes that the activity indicatorfunction of the sources of interest I(`) is available to thealgorithm. The BOP-SRI procedure utilizes In0

(`) to addressthe challenge induced by a birth of a speaker. In a practicalscenario, In0

(`) should be deduced from the measurementsz(l, k). In addition, an RTF death mechanism is also required.Refer to [23] for an equivalent discussion in dynamic scenar-ios. Source counting methods [37] may be useful for detectingthe number of active sources in a specific time period. Sincea simultaneous birth and death of two independent speakersseldom occurs, the BOP-SRI process is triggered when anincrease in the number of active sources occurs. An RTFdeath mechanism may be triggered when a decrease in thenumber of active speakers occurs. For example, the ith RTFmay be considered obsolete if the power of xi(`, k) is below athreshold for a predetermined period of time. That being said,in practical situations, where speakers may arbitrary start andstop speaking, an RTF association mechanism [38] is likely tobe required. However, such a mechanism is beyond the scopeof the current work.

In the scope of this work, we employed an active sourcecounting method based on the microphone signals PSD matrixgeneralized eigenvalue decomposition (EVD) [16]. Namely,the PSD matrix of the stationary noise Φvv is estimated duringspeech absent periods, by a sample covariance estimator:

Φvv =1

`1

`1∑`=1

z(`)zH(`). (17)

Then, the microphone signals are whitened using zw(`) =Φ−1vv,Lz(`), where Φvv,L is the lower triangular matrix obtainedby the Cholesky decomposition of the stationary noise PSDmatrix estimate. Let Φzwzw be a PSD matrix of the whitenedmeasurements, using EVD we have Φzwzw = EΛE−1,where E is a square matrix with columns correspondingto the eigenvectors of Φzwzw and Λ is a diagonal matrix,which diagonal elements are the corresponding eigenvalues ofΦzwzw . The signal zw(`) consists of components contributedby the active speakers in the environment and a white noise.Hence, assuming the number of microphones is larger than thenumber of speakers, the larger eigenvalues can be attributed tothe coherent signals (speech) while the lower to the spatiallywhite signals. The number of active speech sources is theninferred by counting the number of elements in Λ that areabove certain threshold. In a practical scenario, consideringthe whitening and modelling errors, the threshold is set toEVTh(`) = λmin(`) + ∆EVTh. Where, λmin(`) is the lowesteigenvalue, and ∆EVTh is a predefined constant. The appli-cability of aforementioned active source counting method isdemonstrated in Section V-D.

V. EXPERIMENTAL STUDY

We turn now to the evaluation of the performance of theproposed BOP-SRI algorithm. The evaluation was split intotwo experiments. In the first experiment, we aimed at evaluat-ing the BOP-SRI performance for various SNR and signal tointerference ratio (SIR) scenarios. The second experiment wasdedicated to exploring the speech separation performance ofan LCMV beamformer computed based on the RTFs estimatedby the BOP-SRI algorithm.

A. Setup and definitions

The proposed BOP-SRI algorithm was tested using amultichannel impulse response database (MIRDB)1 measuredin the speech and acoustic laboratory of the Faculty ofEngineering at Bar-Ilan University [39]. The laboratory is a6 × 6 × 2.4 m room with variable reverberation times. Thedatabase consists of impulse responses relating eight micro-phones arranged to form a linear array and the loudspeakerplaced at various angles from −90 to 90 at distances of 1and 2 m from the array. The processing was executed in thefrequency domain, the STFT analysis window length was setto 4096, with 75% overlap between successive frames, whilethe sampling frequency of the system was set to 16 KHz. Theperformance of the BOP-SRI algorithms was manifested bythe blocking ability factor (BAF)

BAFn4=

1

M − 1

M∑m=2

σ2m,n

σ2m,v

E

[vm(t) − hm,n(t) ∗ v1(t)

]2E

[xm,n(t) − hm,n(t) ∗ x1,n(t)

]2 ,where xm,n(t) is the speech generated by sn(t) and measuredby the mth microphone, vm(t) is the noise at the mthmicrophone, σ2

m,n is the power of xm,n(t), σ2m,v is the power

of vm(t), hm,n(t) is the estimated RTF relating the first andthe mth microphone as a response to sn(t), and E[·]2 is thepower of [·]. The blocking ability factor BAFn measures theratio between the ability to block the nth speech source andits inherent ability to block a random noise. BAF has a majoreffect on the amount of distortion introduced by the transferfunction GSC due to desired speech leakage [10].

B. BOP-SRI performance vs. SNR and SIR

We turn now to the evaluation of the BOP-SRI algorithm’sperformance in various SNR and SIR conditions. We utilizedthe MIRDB to setup a uniform linear array comprising M = 3microphones with 3 cm inter-spacing, capturing a mixture ofthree acoustic sources positioned at a distance of 1 m fromthe array, while T60 was set to 160 mSec. Specifically, twospeech sources s1(t) and s2(t) impinged the array with angleof arrivals (AOAs) equal to 0 and 30, respectively, and astationary fan noise v(t) impinged the array with AOA equalto 300. The powers of the sources were defined as σ2

1 , σ22

and σ2v , respectively. The activity pattern of the sources was

set such that s1(t) was activated 10 sec after the start of the

1http://www.eng.biu.ac.il/gannot/downloads/

This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.The final version of record is available at http://dx.doi.org/10.1109/TASLP.2019.2957883

Copyright (c) 2019 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

Page 7: Successive relative transfer function identification using ... · Dani Cherkassky, Student member, IEEE, and Sharon Gannot, Senior member, IEEE Distortionless speech extraction in

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. X, NO. Y, DEC. 2016 7

Fig. 4: Blocking ability factor of h2(k). h2(k) was estimatedusing BOP-SRI and time instances where I1(`) = I2(`) = 1.To implement the BOP-SRI, we utilized h1(k), that was esti-mated by CW method, during time instances where I1(`) =1, I2(`) = 0.

measurement, the time difference between s1(t) and s2(t)activation was 10 sec, and following s2(t) activation bothsources remained active for an additional 10 sec.

Our main goal in this experiment was to estimate the RTF ofthe second source h2(k) in various SNR = σ2

2/σ2v and SIR =

σ22/σ2

1 conditions, while utilizing only the time frames whereboth speech sources are active `2 < `. In order to accomplishthis task, we computed the PSD matrix of the noise Φvv(k) byutilizing the noise only time frames 0 < ` ≤ `1 followed bythe application of the CW RTF estimator [16] to the signalsreceived during noise plus first speaker active frames, namely,`1 < ` ≤ `2. The CW resulted in an estimate of the firstspeaker RTF h1(k). We then artificially increased the rank ofthe matrix H(k) by appending an arbitrary vector r, namely,Hc(k) =

[h1(k) r

]. The BOP-SRI algorithm was applied by

implementing (14) and (15) with Hc(k) (for each frequencybin k) until convergence. The threshold Th was set to 10 dBand the step size µ was set to 0.1.

The resulting BAF of h2(k) for various SNR and SIR valuesis depicted in Fig. 4. As can be readily seen, both the SNRand SIR influence the BOP-SRI estimation accuracy. It isalso seen that increasing the SNR above 20 dB for each SIRresults in a minor improvement in the estimation accuracy,whereas reducing the SNR below 20 dB greatly affects theperformance. Considering the SIR influence, it seems thatreducing the SIR from 0 dB to -10 dB has a greater effecton the estimation performance than an SIR reduction from 10dB to 0 dB.

In Fig. 5, the blocking ability of h2(k) is presented as afunction of the frequency for various SIR values while theSNR is set to 20 dB. As expected, the resulting blocking abilityis very nonuniform across the frequency range. For example,the blocking is relatively poor at the low frequencies, for all theconsidered SIRs. The nonuniform blocking can be attributedto the spectral characteristic of the speech signals.

Fig. 5: Blocking ability frequency response of h2(k). h2(k)was estimated using BOP-SRI and time instances whereI1(`) = I2(`) = 1. To implement the BOP-SRI, we utilizedh1(k), that was estimated by CW method, during time in-stances where I1(`) = 1, I2(`) = 0.

In Fig. 6, we exemplify the iterative minimization processes(14) by depicting the evolvement in the components of the costfunction

J1 = ExH1 (`)EH

HθθθEHθθθx1(`)

,

J2 = ExH2 (`)EH

HθθθEHθθθx2(`)

,

Jv = EvH(`)EH

HθθθEHθθθv(`)

,

for a single frequency, f = 1500 Hz. As can be seen, theoptimization is manifested through a minimization of J2 untila convergence is reached after ≈ 2300 iterations. It is alsoseen that, as expected, J1 remains constant throughout theminimization processes, since h1(k) is in the range EH

Hcθθθ.

Fig. 6 also demonstrates the noise enhancement phenomenonby EH

Hcθθθ; this is manifested through the Jv increase during the

iterative process. This stresses again the SNR’s importance tothe proposed BOP-SIR algorithm. In this specific example,BOP-SIR is aimed at minimizing the cost function J = J1 +J2 + Jv . Accordingly, any decrease in J2 may be masked byan increase in Jv; to prevent this masking effect, the level ofthe noise v should be sufficiently low as compared with thelevel of the signal x2.

C. Speech extraction

In this section, we demonstrate the effectiveness of theproposed BOP-SRI algorithm in a beamforming application.Our objective was to estimate the RTFs in a multi-speakerscenario and then apply an LCMV beamformer for extractingthe individual speakers from the measured mixture.

We utilized the MIRDB to implement a uniform linear arraycomprising M = 8 microphones, capturing a mixture of fouracoustic sources, while T60 was selected to be 160 mSec.Specifically, three equipower speech sources s1(t), s2(t), s3(t)were generated; s1(t) and s2(t) were positioned at a distanceof 1 m from the array, with AOAs equal to 0 and 45,

This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.The final version of record is available at http://dx.doi.org/10.1109/TASLP.2019.2957883

Copyright (c) 2019 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

Page 8: Successive relative transfer function identification using ... · Dani Cherkassky, Student member, IEEE, and Sharon Gannot, Senior member, IEEE Distortionless speech extraction in

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. X, NO. Y, DEC. 2016 8

Fig. 6: Responses of the oblique projection EHθθθ towardx1(`),x2(`) and v(`).

respectively. s3(t) was positioned at a distance of 2 m fromthe array, with AOAs equal to 315. A stationary noise sourcev(t) was positioned at a distance of 2 m with AOA equal to90, and the SNR was set to 20 dB and the noise was activethroughout the experiment.

In the first part of the experiment, all the speech sourceswere inactive, and accordingly, signals measured during thispart of the experiment were utilized for estimating the noisePSD matrix. In the second part, the speech sources wereactivated in a non-overlapping manner, each for a period of 10sec. The signals measured during this part of the experimentwere utilized for estimating the RTFs of s1(t), s2(t), s3(t).The RTFs were estimated by applying the well establishedCW RTF estimator [16], which is a valid RTF estimator in anoisy environment with a single active speech source. TheseRTFs are referred to in the following as hEVD

1 , hEVD2 , hEVD

3 ,respectively. In the third part, both s1(t) and s2(t) were activeduring the first 10 sec, while during the following 10 secall the speech source were concurrently active. The signalsmeasured during the first 10 sec and hEVD

1 were utilized forestimating the RTF of s2(t) by applying the proposed BOP-SRI algorithm. This RTF is referred to in the following ashBOP2 . The signals measured during the next 10 sec, as well as

hEVD1 and hBOP

2 , were utilized for estimating the RTF of s3(t)by applying the proposed BOP-SRI algorithm. This RTF isreferred to in the following as hBOP

3 .

The above mentioned estimators facilitated the implemen-tation of an LCMV beamformer aimed at extracting thedesired speech source from the measurements. The beam-formers are referred to in the following as west

n , whereest ∈ EVD,BOP,DOA, and n ∈ 1, 2, 3. The constraintsset of an LCMV beamformer marked with a superscriptEVD was formulated by utilizing

[hEVD1 , hEVD

2 , hEVD3

], while

a superscript BOP indicates that the constraints of the beam-former were formulated by utilizing

[hEVD1 , hBOP

2 , hBOP3

]. A

superscript DOA means that the constraints of the beamformerwere formulated using directional array manifolds steeredtowards the known DOA of the sources. A subscript n in west

n

TABLE I: Linearly constrained minimum variance beam-former gains toward the sources of interest

x1 gain [dB] x2 gain [dB] x3 gain [dB]

wEVD1 0.22 -22.22 -17.95

wBOP1 0.11 -14.78 -11.05

wDOA1 -0.8 -2.77 -1.39

wEVD2 -22.48 0.15 -17.78

wBOP2 -24.36 -1.03 -10.65

wDOA2 -5.5 -2.13 -0.1

wEVD3 -23.77 -24.04 0.25

wBOP3 -26.28 -16.94 0

wDOA3 -4.45 -2.64 -1.47

indicates a beamformer with distortionless response to xn andzero response to the other two speech sources in the room,where xn is the image of sn as measured by the microphones.

It should be stressed again that, although in the following wecompare the performance of the proposed BOP-SRI algorithmto that of the well-established CW algorithm, the algorithmsaddress two different challenges. The BOP-SRI addresses RTFestimation in a noisy and multi-speaker scenario, whereasthe CW method addresses RTF estimation in a noisy single-speaker scenario. In a multi-speaker scenario, the CW methodresults in the identification of orthogonal vectors that span thesub-space of all the active speakers in the environment. Theseorthogonal vectors are different from the individual RTFs and,to the best of our knowledge, there is no established methodavailable for inferring the individual RTFs from the orthogonalvectors. Since the CW addresses an easier task, it is usedin the following as a bound for the BOP-SRI performance.In order to demonstrate the advantage of using BOP-SRI inmulti-speaker scenario we also present a speech separationperformance results by applying a directional beamformer.

The resulting gains of the LCMV beamformers toward thesources of interest are summarized in Table I. The presentedscalar gains are the averaged results over all frequency bands.It can be easily verified that the performance of wBOP

n , n =1, 2, 3 beamformers is inferior to that of the beamformerswEVDn , n = 1, 2, 3. However, the performance of wBOP

n , n =1, 2, 3 beamformers is still reasonably high, significantlyhigher compared to the directional beamformer calculatedusing known DOAs. It should be noted that the gain of wBOP

1

towards x1 differs from the respective gain of wEVD1 . While

wEVD1 beamformer is calculated using hEVD

1 , hEVD2 , hEVD

3 , thewBOP

1 beamformer is calculated using hEVD1 , hBOP

2 , hBOP3 . This

leads to the difference in beamformers’ gains. At this pointwe would like to stress again that the EVD beamformers areused in this comparison as an unrealistic bound, as estimationof these beamformers requires an oracle scenario where thesources are active in a non-overlapping manner.

In Fig. 7, Fig. 8, and Fig. 9, we present the fre-quency responses of the considered beamformers west

n , est ∈EVD,BOP , n ∈ 1, 2, 3, toward the speech sources ofinterest, xn, n = 1, 2, 3. The figures suggest that BOP-SRIresulted in a better estimation of the RTF of x2 than the RTFof x3. This can be attributed to the fact that s3 is farther

This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.The final version of record is available at http://dx.doi.org/10.1109/TASLP.2019.2957883

Copyright (c) 2019 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

Page 9: Successive relative transfer function identification using ... · Dani Cherkassky, Student member, IEEE, and Sharon Gannot, Senior member, IEEE Distortionless speech extraction in

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. X, NO. Y, DEC. 2016 9

Fig. 7: Frequency responses of the west1 , est ∈ EVD,BOP

beamformers toward sources of interest.

Fig. 8: Frequency responses of the west2 , est ∈ EVD,BOP

beamformers toward sources of interest.

away from the microphones array than s2, which results in aworse SNR. Additionally, the SIR was worse during the hBOP

3

estimation than during the hBOP2 estimation, since during the

hBOP3 estimation both s1 and s2 were active, while during the

estimation of hBOP2 only s1 was active.

D. Speech enhancement in a babble noise environment

In this section, we demonstrate the effectiveness of theproposed BOP-SRI algorithm in the presence of babble noise,instead of the coherent noise sources that were consideredin the previous experiments. The performance of the BOP-SRI algorithm is manifested by an application of an LCMVbeamformer that extracts the speaker, the RTF of which isestimated by the BOP-SRI.

Fig. 9: Frequency responses of the west3 , est ∈ EVD,BOP

beamformers toward sources of interest.

We also utilized the MIRDB to implement a uniform lineararray comprising M = 8 microphones. In this experiment, thereverberation time in the room was higher than in the previousexperiments, T60 = 360 mSec. Two equipower speech sourcess1(t), s2(t) were positioned at a distance of 1 m from the array,with AOAs equal to 0, 45, respectively. Microphone signalswere further corrupted by babble noise v(t), which was playedthrough four loudspeakers positioned in the room and facingthe walls, with the SNR set to 8 dB. Similarly to the previousexperiment, the speech sources became successively activewhile the noise was active throughout the experiment. How-ever, unlike in the previous experiment, in this experiment,the activity pattern of the sources was unknown to the BOP-SRI algorithm, i.e., it was inferred by counting the numberof dominant eigenvalues of the whitened measurements PSDmatrix Φzwzw .

For reference, similarly to the previous experiment, thewBOPn , n = 1, 2 performance is compared with the wEVD

n , n =1, 2 performance. For estimating the RTFs for the wEVD

n

beamformers, each of the sources s1(t), s2(t) was recordedseparately in the presence of the babble noise.

The implementation of the active speakers counting methodproposed in Section IV-B is presented first. During the periodof the first 10 sec, both s1(t) and s2(t) were inactive. Thisperiod is used to estimate the PSD matrix of the stationarynoise Φvv. Upon estimating Φvv, the signal received at eachtime frame was whitened and the eigenvalue decomposition ofΦzwzw was computed. For example, we present the eigenval-ues of Φzwzw during a time frame with a single active speakerin Fig. 10. As readily seen, a single dominant eigenvalueis demonstrated. Also, it should be noted that the othereigenvalues are different from 0 dB magnitude, this shouldbe attributed to whitening and model errors. Equivalently, wedemonstrate in Fig. 11 the eigenvalues of Φzwzw during atime frame with two active speakers, two dominant eigenvaluesare readily identified. The entire processes of active speakers

This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.The final version of record is available at http://dx.doi.org/10.1109/TASLP.2019.2957883

Copyright (c) 2019 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

Page 10: Successive relative transfer function identification using ... · Dani Cherkassky, Student member, IEEE, and Sharon Gannot, Senior member, IEEE Distortionless speech extraction in

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. X, NO. Y, DEC. 2016 10

counting is depicted in Fig. 12, where the power level of eachof the eigenvalues of Φzwzw is presented as a function of thetime as well as the threshold EVTH(`) with ∆EVTH = 60 dB.The number of active speakers in each time frame is inferredby counting the number of eigenvalue with higher magnitudethan EVTH(`). Ultimately, we present the inferred speakersactivity pattern in Fig. 13.

Fig. 10: Eigenvalues of single active speaker segment as afunction of the frequency. A Single dominant eigenvalue isreadily identified.

Fig. 11: Eigenvalues of two active speakers segment as afunction of the frequency. Two dominant eigenvalues arereadily identified.

Upon inferring the speakers activity pattern from the mea-surements, we carried out the speech enhancement experimentin a similar manner to the previous experiment, with a singledifference. The inferred speakers activity function was usedto trigger the BOP-SRI algorithm instead of the oracle func-tion used in the previous experiment. The resulting gains ofthe LCMV beamformers toward the sources of interest arepresented in Table II. The scalar gains are the results offrequency averaging. It can be easily concluded that, similarly

Fig. 12: Averaged eigenvalues power level as a function oftime. The bold dashed line represents the EVTh(`). The numberof active speakers is inferred by counting the number ofeigenvalues with higher energy level that EVTh(`) during eachtime segment.

Fig. 13: Speakers activity pattern as estimated by the proposedactive speakers counting method.

to the previous experiment, the wBOPn , n = 1, 2 beamformers

result in a worse performance than the wEVDn , n = 1, 2

beamformers. However, the wBOPn , n = 1, 2 beamformers still

perform reasonably well, although hBOP2 was estimated in a

multi-speaker scenario in the presence of babble noise.In Fig. 14 and Fig. 15, the frequency responses of the

considered beamformers westn , est ∈ EVD,BOP , n ∈ 1, 2,

toward the speech source of interest, xn, n = 1, 2, arepresented. On top of the already formulated conclusions,these figures suggest that the proposed BOP-SRI algorithmis applicable for successive estimation of the RTFs in thepresence of babble noise and without a priori knowledge ofthe sources’ activity pattern, as was assumed in the previousexperiment.

To inquire the robustness of the proposed method to re-verberation we repeated this exact experiment with a single

This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.The final version of record is available at http://dx.doi.org/10.1109/TASLP.2019.2957883

Copyright (c) 2019 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

Page 11: Successive relative transfer function identification using ... · Dani Cherkassky, Student member, IEEE, and Sharon Gannot, Senior member, IEEE Distortionless speech extraction in

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. X, NO. Y, DEC. 2016 11

TABLE II: Linearly constrained minimum variance beamform-ers’ gains toward the sources of interest, T60 = 360 mSec.

x1 gain [dB] x2 gain [dB]

wEVD1 0.44 -15.4

wBOP1 0.46 -10.5

wEVD2 -14.54 0.370

wBOP2 -16.66 -1.70

Fig. 14: Frequency responses of the west1 , est ∈ EVD,BOP

beamformers toward sources of interest.

TABLE III: Linearly constrained minimum variance beam-formers’ gains toward the sources of interest, T60 = 610 mSec.

x1 gain [dB] x2 gain [dB]

wEVD1 0.14 -14.6

wBOP1 0.23 -11.0

wEVD2 -15.13 0.43

wBOP2 -17.83 1.5

difference, this time T60 = 610 mSec. The resulting gainsof the LCMV beamformers toward the sources of interestare presented in Table III. The scalar gains are the resultsof frequency averaging. It can be easily concluded that, theresulting gains in this scenario are similar to T60 = 360 mSecscenario. This suggests that the proposed BOP-SRI is anapplicable method in a reverberant scenario.

VI. SUMMARY

In this contribution, the challenge of RTF identification ina multi-speaker scenario was considered. We introduced theSRI approach, which is based on the sole assumption thatsources do not become simultaneously active. In particular,we addressed the challenge of estimating the RTF of a specificspeech source while assuming that the RTFs of all the otheractive sources in the environment were previously estimated.

Fig. 15: Frequency responses of the west2 , est ∈ EVD,BOP

beamformers toward sources of interest.

The RTF of interest was identified by applying the BOP-SRItechnique. Upon the identification of a new speech source inthe environment, the BOP algorithm is applied. Applying BOPresults in an oblique projection matrix that, once applied to themicrophone measurements, results in a null steering towardthe RTF of interest. We proved that, by artificially inflatingthe range of the projection matrix, the RTF of interest canbe inferred. We established an experimental setup based onthe MIRDB, which facilitated a performance evaluation of theproposed BOP-SRI algorithm in various SNR and SIR andreverberation levels conditions. The applicability of the RTFestimated by the BOP-SRI in a multi-speaker environmentwas tested in a speech extraction task. We compared theperformance of two sets of LCMV beamformers, where thefirst set was calculated by utilizing the RTF estimated in asingle speaker environment by applying the CW method, whilethe second set was calculated by utilizing the RTFs estimatedby BOP-SRI in a multi-speaker environment. Unsurprisingly,the first set of beamformers results in a better speech extractionperformance than the second set. At this point we wouldlike to stress again that the beamformers estimated usingthe CW method are used in this study as an unrealisticbound, as estimation of these beamformers requires an oraclescenario where the sources are active in a non-overlappingmanner. However, the proposed BOP-SRI method provides theflexibility to identify an RTF in a more realistic multi-speakerenvironment, while still resulting in a reasonable performancein the considered experiment. The applicability of the proposedBOP-SRI algorithm was also tested in the presence of babblenoise and without a priori knowledge of the sources activitypattern. It may be worth stressing that the proposed techniquewas only evaluated in controlled environments, both simulatedand actual acoustic lab. Applying the proposed technique tomore complex scenarios with arbitrary activity patterns of thespeakers, may require more sophisticated speakers countingand RTF association algorithms.

This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.The final version of record is available at http://dx.doi.org/10.1109/TASLP.2019.2957883

Copyright (c) 2019 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

Page 12: Successive relative transfer function identification using ... · Dani Cherkassky, Student member, IEEE, and Sharon Gannot, Senior member, IEEE Distortionless speech extraction in

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. X, NO. Y, DEC. 2016 12

APPENDIX AGRADIENT DERIVATION

The gradient of the target function J(θθθ) w.r.t. the mthelement of θθθ, θm = rm + jcm can be computed by applyingthe chain rule for a complex valued function, resulting in

∂θmEyH(`;θθθ)y(`;θθθ)

= E

yH(`;θθθ)

∂y(`;θθθ)

∂rm

+

+ E

(∂y(`;θθθ)

∂rm

)Hy(`;θθθ)

+ jE

yH(`;θθθ)

∂y(`;θθθ)

∂cm

+ jE

(∂y(`;θθθ)

∂cm

)Hy(`;θθθ)

. (18)

Let us write the explicit form of the projected vector y(`;θθθ)by utilizing (10b) and (10c):

y(`;θθθ) = EHθθθz(`) =

= H(HH

(I− θθθ

(θθθHθθθ

)−1θθθH)

H)−1

HH×(I− θθθ

(θθθHθθθ

)−1θθθH)

z(`). (19)

The derivative of y(`) w.r.t. to θm is computed straightfor-wardly; after collecting like terms, it results in the expression

∂θmy(`;θθθ) = HΓ

∂θm

[θθθ(θθθHθθθ

)−1θθθH] (

HΓP⊥θθθ − I)

z(`),

(20)

where H is the matrix of the previously estimated RTFs, P⊥θθθis defined in (10c), and Γ is defined in (16c). The explicitexpressions for the derivative in (20) w.r.t. the real and theimaginary part of θm are given by

∂rm

[θθθ(θθθHθθθ

)−1θθθH]

= im(θθθHθθθ

)−1θθθH−

− 2rmθθθ(θθθHθθθ

)−1 (θθθHθθθ

)−1θθθH + θθθ

(θθθHθθθ

)−1iTm, (21a)

∂cm

[θθθ(θθθHθθθ

)−1θθθH]

= jim(θθθHθθθ

)−1θθθH−

− 2cmθθθ(θθθHθθθ

)−1 (θθθHθθθ

)−1θθθH − jθθθ

(θθθHθθθ

)−1iTm.

(21b)

To complete the gradient derivation we substitute (20), (21a)and (21b) into (18) and simplify the expression using straight-forward algebra, which results in (15) .

REFERENCES

[1] Alexey Ozerov and Cedric Fevotte, “Multichannel nonnegative matrixfactorization in convolutive mixtures for audio source separation,” IEEETransactions on Audio, Speech, and Language Processing, vol. 18, no.3, pp. 550–563, 2010.

[2] Hirokazu Kameoka, Nobutaka Ono, Kunio Kashino, and ShigekiSagayama, “Complex NMF: A new sparse representation for acousticsignals,” in Proceedings of IEEE International Conference on AcousticsSpeech and Signal Processing (ICASSP), 2009, pp. 3437–3440.

[3] Shoko Araki, Hiroshi Sawada, Ryo Mukai, and Shoji Makino, “Underde-termined blind sparse source separation for arbitrarily arranged multiplesensors,” Signal Processing, vol. 87, no. 8, pp. 1833–1847, 2007.

[4] Michael Syskind Pedersen, Jan Larsen, Ulrik Kjems, and Lucas C Parra,“Convolutive blind source separation methods,” in Springer Handbookof Speech Processing, pp. 1065–1094. New York, NY, USA: Springer,2008.

[5] Barry D. Van Veen and Kevin M. Buckley, “Beamforming: A versatileapproach to spatial filtering,” IEEE ASSP magazine, vol. 5, no. 2, pp.4–24, 1988.

[6] Simon Doclo and Marc Moonen, “GSVD-based optimal filtering forsingle and multimicrophone speech enhancement,” IEEE Transactionson Signal Processing, vol. 50, no. 9, pp. 2230–2244, 2002.

[7] Tomohiro Nakatani, Nobutaka Ito, Takuya Higuchi, Shoko Araki, andKeisuke Kinoshita, “Integrating DNN-based and spatial clustering-basedmask estimation for robust MVDR beamforming,” in Proceedingsof IEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP), 2017, pp. 286–290.

[8] Shlomo E. Chazan, Sharon Gannot, and Jacob Goldberger, “Attention-based neural network for joint diarization and speaker extraction,” inProceedings of the 16th International Workshop on Acoustic SignalEnhancement (IWAENC), 2018.

[9] Sharon Gannot, Emmanuel Vincent, Shmulik Markovich-Golan, AlexeyOzerov, Sharon Gannot, Emmanuel Vincent, Shmulik Markovich-Golan,and Alexey Ozerov, “A consolidated perspective on multimicrophonespeech enhancement and source separation,” IEEE/ACM Transactionson Audio, Speech and Language Processing (TASLP), vol. 25, no. 4, pp.692–730, 2017.

[10] Sharon Gannot, David Burshtein, and Ehud Weinstein, “Signal en-hancement using beamforming and nonstationarity with applications tospeech,” IEEE Transactions on Signal Processing, vol. 49, no. 8, pp.1614–1626, 2001.

[11] Jingdong Chen, Jacob Benesty, and Yiteng Huang, “A minimumdistortion noise reduction algorithm with multiple microphones,” IEEEtransactions on audio, speech, and language processing, vol. 16, no. 3,pp. 481–493, 2008.

[12] Ronen Talmon, Israel Cohen, and Sharon Gannot, “Relative transferfunction identification using convolutive transfer function approxima-tion,” IEEE Transactions on audio, speech, and language processing,vol. 17, no. 4, pp. 546–555, 2009.

[13] Ernst Warsitz, Alexander Krueger, and Reinhold Haeb-Umbach, “Speechenhancement with a new generalized eigenvector blocking matrix forapplication in a generalized sidelobe canceller,” in Proceedings of IEEEInternational Conference on Acoustics, Speech and Signal Processing(ICASSP), 2008, pp. 73–76.

[14] Ofer Schwartz, Sharon Gannot, and Emanuel AP Habets, “Multi-microphone speech dereverberation and noise reduction using relativeearly transfer functions,” IEEE/ACM Transactions on Audio, Speechand Language Processing (TASLP), vol. 23, no. 2, pp. 240–251, 2015.

[15] Bracha Laufer, Ronen Talmon, and Sharon Gannot, “Relative trans-fer function modeling for supervised source localization,” in IEEEWorkshop on Applications of Signal Processing to Audio and Acoustics(WASPAA), 2013, 2013, pp. 1–4.

[16] Shmulik Markovich-Golan, Sharon Gannot, and Israel Cohen, “Mul-tichannel eigenspace beamforming in a reverberant noisy environmentwith multiple interfering speech signals,” IEEE Transactions on Audio,Speech, and Language Processing, vol. 17, no. 6, pp. 1071–1086, 2009.

[17] Ofir Shalvi and Ehud Weinstein, “System identification using nonsta-tionary signals,” IEEE transactions on signal processing, vol. 44, no. 8,pp. 2055–2063, 1996.

[18] Israel Cohen, “Relative transfer function identification using speechsignals,” IEEE Transactions on Speech and Audio Processing, vol. 12,no. 5, pp. 451–459, 2004.

[19] Maja Taseska and Emanuel AP Habets, “Spotforming: Spatial filter-ing with distributed arrays for position-selective sound acquisition,”IEEE/ACM Transactions on Audio, Speech, and Language Processing,vol. 24, no. 7, pp. 1291–1304, 2016.

[20] Romain Serizel, Marc Moonen, Bas Van Dijk, and Jan Wouters, “Low-rank approximation based multichannel Wiener filter algorithms fornoise reduction with application in cochlear implants,” IEEE/ACMTransactions on Audio, Speech, and Language Processing, vol. 22, no.4, pp. 785–799, 2014.

[21] Ernst Warsitz and Reinhold Haeb-Umbach, “Blind acoustic beamform-ing based on generalized eigenvalue decomposition,” IEEE Transactionson audio, speech, and language processing, vol. 15, no. 5, pp. 1529–1539, 2007.

[22] Shmulik Markovich-Golan and Sharon Gannot, “Performance analysisof the covariance subtraction method for relative transfer functionestimation and comparison to the covariance whitening method,” inProceedings of IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP), 2015, pp. 544–548.

[23] Shmulik Markovich-Golan, Sharon Gannot, and Israel Cohen, “Subspacetracking of multiple sources and its application to speakers extraction,”in IEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP), 2010, pp. 201–204.

This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.The final version of record is available at http://dx.doi.org/10.1109/TASLP.2019.2957883

Copyright (c) 2019 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

Page 13: Successive relative transfer function identification using ... · Dani Cherkassky, Student member, IEEE, and Sharon Gannot, Senior member, IEEE Distortionless speech extraction in

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. X, NO. Y, DEC. 2016 13

[24] Elior Hadad, Simon Doclo, and Sharon Gannot, “The binaural LCMVbeamformer and its performance analysis,” IEEE Transactions onAcoustics, Speech, and Signal Processing, vol. 24, no. 3, pp. 543–558,2016.

[25] Amin Hassani, Alexander Bertrand, and Marc Moonen, “LCMV beam-forming with subspace projection for multi-speaker speech enhance-ment,” in Proceedings of IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP), 2016, pp. 91–95.

[26] Antoine. Deleforge, Sharon. Gannot, and Walter Kellermann, “Towardsa generalization of relative transfer functions to more than one source,”in Proceedings of the 23rd European Signal Processing Conference(EUSIPCO), Nice, France, 2015.

[27] Shmulik Markovich-Golan, Sharon Gannot, and Walter Kellermann,“Combined LCMV-TRINICON beamforming for separating multiplespeech sources in noisy and reverberant environments,” IEEE/ACMTransactions on Audio, Speech, and Language Processing, vol. 25, no.2, pp. 320–332, 2017.

[28] H. Buchner, R. Aichner, and W. Kellermann, “TRINICON: a ver-satile framework for multichannel blind signal processing,” in IEEEInternational Conference on Acoustics, Speech, and Signal Processing(ICASSP), May 2004, vol. 3, pp. 889–992.

[29] Yuanhang Zheng, Klaus Reindl, and Walter Kellermann, “BSS forimproved interference estimation for blind speech signal extraction withtwo microphones,” in The 3rd IEEE International Workshop on Com-putational Advances in Multi-Sensor Adaptive Processing (CAMSAP),2009, pp. 253–256.

[30] Selahattin Kayalar and Howard L Weinert, “Oblique projections:Formulas, algorithms, and error bounds,” Mathematics of Control,Signals, and Systems (MCSS), vol. 2, no. 1, pp. 33–45, 1989.

[31] Nicoleta Roman and DeLiang Wang, “Binaural tracking of multiple

moving sources,” IEEE Transactions on Audio, Speech, and LanguageProcessing, vol. 16, no. 4, pp. 728–739, 2008.

[32] Richard T. Behrens and Louis L. Scharf, “Signal processing applica-tions of oblique projection operators,” IEEE Transactions on SignalProcessing, vol. 42, no. 6, pp. 1413–1424, 1994.

[33] Gene H Golub and Charles F Van Loan, Matrix computations, vol. 3,JHU Press, 2012.

[34] Remy Boyer, “Oblique projection for source estimation in a competitiveenvironment: Algorithm and statistical analysis,” Signal Processing, vol.89, no. 12, pp. 2547–2554, 2009.

[35] Gal Reuven, Sharon Gannot, and Israel Cohen, “Dual-source transfer-function generalized sidelobe canceller,” IEEE transactions on audio,speech, and language processing, vol. 16, no. 4, pp. 711–727, 2008.

[36] Dani Cherkassky, Shlomo E. Chazan, Jacob Goldberger, and SharonGannot, “Successive relative transfer function identification using singlemicrophone speech enhancement,” in Proceedings of the 25th EuropeanSignal Processing Conference (EUSIPCO), Kos, Greece, 2017.

[37] O. Walter, L. Drude, and R. Haeb-Umbach, “Source counting in speechmixtures by nonparametric Bayesian estimation of an infinite gaussianmixture model,” in Proceedings of IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP), 2015, pp. 459–463.

[38] Shlomo E Chazan, Jacob Goldberger, and Sharon Gannot, “DNN-basedconcurrent speakers detector and its application to speaker extractionwith lcmv beamforming,” in Proceedings of IEEE International Con-ference on Acoustics Speech and Signal Processing (ICASSP), 2018, pp.6712–6716.

[39] Elior Hadad, Florian Heese, Peter Vary, and Sharon Gannot, “Multichan-nel audio database in various acoustic environments,” in Proceedingsof the 14th International Workshop on Acoustic Signal Enhancement(IWAENC), 2014, pp. 313–317.

This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.The final version of record is available at http://dx.doi.org/10.1109/TASLP.2019.2957883

Copyright (c) 2019 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].