13
1 Grouping Separated Frequency Components by Estimating Propagation Model Parameters in Frequency-Domain Blind Source Separation Hiroshi Sawada, Senior Member, IEEE, Shoko Araki, Member, IEEE, Ryo Mukai, Senior Member, IEEE, Shoji Makino, Fellow, IEEE Abstract— This paper proposes a new formulation and op- timization procedure for grouping frequency components in frequency-domain blind source separation (BSS). We adopt two separation techniques, independent component analysis (ICA) and time-frequency (T-F) masking, for the frequency-domain BSS. With ICA, grouping the frequency components corresponds to aligning the permutation ambiguity of the ICA solution in each frequency bin. With T-F masking, grouping the frequency components corresponds to classifying sensor observations in the time-frequency domain for individual sources. The grouping procedure is based on estimating anechoic propagation model parameters by analyzing ICA results or sensor observations. More specifically, the time delays of arrival and attenuations from a source to all sensors are estimated for each source. The focus of this paper includes the applicability of the proposed procedure for a situation with wide sensor spacing where spatial aliasing may occur. Experimental results show that the proposed procedure effectively separates two or three sources with several sensor configurations in a real room, as long as the room reverberation is moderately low. Index Terms— Blind source separation, convolutive mixture, frequency domain, independent component analysis, permuta- tion problem, sparseness, time-frequency masking, time delay estimation, generalized cross correlation I. I NTRODUCTION The technique for estimating individual source components from their mixtures at multiple sensors is known as blind source separation (BSS) [3]–[6]. With acoustic applications of BSS, such as solving a cocktail party problem, signals are generally mixed in a convolutive manner with reverberations. Let s 1 ,..., s N be source signals and x 1 ,..., x M be sensor observations. The convolutive mixture model is formulated as x j (t)= N k=1 l h jk (l)s k (t l), j =1,...,M, (1) where t represents time and h jk (l) represents the impulse response from source k to sensor j . In a practical room situation, impulse responses h jk (l) can have thousands of taps even with an 8 kHz sampling rate. This makes the convolutive BSS problem very difficult compared with the BSS of simple instantaneous mixtures. Earlier versions of this work were presented in [1] and [2] as confer- ence papers. The authors are with NTT Communication Science Laborato- ries, NTT Corporation, 2-4 Hikaridai, Seika-cho, Soraku-gun, Kyoto 619- 0237, Japan (e-mail: [email protected]; [email protected]; [email protected]; [email protected], phone: +81-774-93-5272, fax: +81-774-93-5158). EDICS: AUD-SSEN, AUD-LMAP An efficient and practical approach for such convolutive mixtures is frequency-domain BSS [7]–[25], where we apply a short-time Fourier transform (STFT) to the sensor observations x j (t). In the frequency domain, the convolutive mixture (1) can be approximated as an instantaneous mixture at each frequency: x j (f,t)= N k=1 h jk (f )s k (f,t), j =1,...,M, (2) where f represents frequency, h jk (f ) is the frequency re- sponse from source k to sensor j , and s k (f,t) is the time- frequency representation of a source signal s k . Independent component analysis (ICA) [3]–[6] is a major statistical tool for BSS. With the frequency-domain approach, ICA is employed in each frequency bin with the instanta- neous mixture model (2). This makes the convergence of ICA stable and fast. However, the permutation ambiguity of the ICA solution in each frequency bin should be aligned so that the frequency components of the same source are grouped together. This is known as the permutation problem of frequency-domain BSS. Various methods have been proposed to solve this problem. Early work [7], [8] considered the smoothness of the frequency response of separation filters. For non-stationary sources such as speech, it is effective to exploit the mutual dependence of separated signals across frequencies either with simple second order correlation [9]–[12] or with higher order statistics [17], [18]. Spatial information of sources is also useful for the per- mutation problem, such as the direction-of-arrival of a source [12]–[14] or the ratio of the distances from a source to two sensors [15]. Our recent work [16] generalizes these methods so that the two types of geometrical information (direction and distance) are treated in a single scheme and also the BSS system does not need to know the sensor array geometry. When we are concerned with the directions of sources, we generally prefer the sensor spacing to be no larger than half the minimum wavelength of interest to avoid the effect of spatial aliasing [26]. We typically use 4 cm sensor spacing for an 8 kHz sampling rate. However, there are cases where widely spaced sensors are used to achieve better separation for low frequencies. Or, if we increase the sampling rate, for example up to 16 kHz, to obtain better speech recognition accuracy for separated signals, spatial aliasing occurs even with 4 cm spacing. If spatial aliasing occurs at high frequencies, the ICA

Grouping Separated Frequency Components by Estimating

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Grouping Separated Frequency Components by Estimating

1

Grouping Separated Frequency Components byEstimating Propagation Model Parameters inFrequency-Domain Blind Source Separation

Hiroshi Sawada, Senior Member, IEEE, Shoko Araki, Member, IEEE,Ryo Mukai, Senior Member, IEEE, Shoji Makino, Fellow, IEEE

Abstract— This paper proposes a new formulation and op-timization procedure for grouping frequency components infrequency-domain blind source separation (BSS). We adopt twoseparation techniques, independent component analysis (ICA)and time-frequency (T-F) masking, for the frequency-domainBSS. With ICA, grouping the frequency components correspondsto aligning the permutation ambiguity of the ICA solution ineach frequency bin. With T-F masking, grouping the frequencycomponents corresponds to classifying sensor observations inthe time-frequency domain for individual sources. The groupingprocedure is based on estimating anechoic propagation modelparameters by analyzing ICA results or sensor observations.More specifically, the time delays of arrival and attenuations froma source to all sensors are estimated for each source. The focus ofthis paper includes the applicability of the proposed procedure fora situation with wide sensor spacing where spatial aliasing mayoccur. Experimental results show that the proposed procedureeffectively separates two or three sources with several sensorconfigurations in a real room, as long as the room reverberationis moderately low.

Index Terms— Blind source separation, convolutive mixture,frequency domain, independent component analysis, permuta-tion problem, sparseness, time-frequency masking, time delayestimation, generalized cross correlation

I. INTRODUCTION

The technique for estimating individual source componentsfrom their mixtures at multiple sensors is known as blindsource separation (BSS) [3]–[6]. With acoustic applicationsof BSS, such as solving a cocktail party problem, signals aregenerally mixed in a convolutive manner with reverberations.Let s1, . . . , sN be source signals and x1, . . . , xM be sensorobservations. The convolutive mixture model is formulated as

xj(t) =N∑

k=1

∑l

hjk(l) sk(t− l), j=1, . . . , M, (1)

where t represents time and hjk(l) represents the impulseresponse from source k to sensor j. In a practical roomsituation, impulse responses hjk(l) can have thousands of tapseven with an 8 kHz sampling rate. This makes the convolutiveBSS problem very difficult compared with the BSS of simpleinstantaneous mixtures.

Earlier versions of this work were presented in [1] and [2] as confer-ence papers. The authors are with NTT Communication Science Laborato-ries, NTT Corporation, 2-4 Hikaridai, Seika-cho, Soraku-gun, Kyoto 619-0237, Japan (e-mail: [email protected]; [email protected];[email protected]; [email protected], phone: +81-774-93-5272,fax: +81-774-93-5158). EDICS: AUD-SSEN, AUD-LMAP

An efficient and practical approach for such convolutivemixtures is frequency-domain BSS [7]–[25], where we apply ashort-time Fourier transform (STFT) to the sensor observationsxj(t). In the frequency domain, the convolutive mixture (1)can be approximated as an instantaneous mixture at eachfrequency:

xj(f, t) =N∑

k=1

hjk(f)sk(f, t), j =1, . . . , M, (2)

where f represents frequency, hjk(f) is the frequency re-sponse from source k to sensor j, and sk(f, t) is the time-frequency representation of a source signal sk.

Independent component analysis (ICA) [3]–[6] is a majorstatistical tool for BSS. With the frequency-domain approach,ICA is employed in each frequency bin with the instanta-neous mixture model (2). This makes the convergence ofICA stable and fast. However, the permutation ambiguity ofthe ICA solution in each frequency bin should be alignedso that the frequency components of the same source aregrouped together. This is known as the permutation problem offrequency-domain BSS. Various methods have been proposedto solve this problem. Early work [7], [8] considered thesmoothness of the frequency response of separation filters. Fornon-stationary sources such as speech, it is effective to exploitthe mutual dependence of separated signals across frequencieseither with simple second order correlation [9]–[12] or withhigher order statistics [17], [18].

Spatial information of sources is also useful for the per-mutation problem, such as the direction-of-arrival of a source[12]–[14] or the ratio of the distances from a source to twosensors [15]. Our recent work [16] generalizes these methodsso that the two types of geometrical information (directionand distance) are treated in a single scheme and also the BSSsystem does not need to know the sensor array geometry.When we are concerned with the directions of sources, wegenerally prefer the sensor spacing to be no larger than half theminimum wavelength of interest to avoid the effect of spatialaliasing [26]. We typically use 4 cm sensor spacing for an8 kHz sampling rate. However, there are cases where widelyspaced sensors are used to achieve better separation for lowfrequencies. Or, if we increase the sampling rate, for exampleup to 16 kHz, to obtain better speech recognition accuracyfor separated signals, spatial aliasing occurs even with 4 cmspacing. If spatial aliasing occurs at high frequencies, the ICA

Page 2: Grouping Separated Frequency Components by Estimating

2

solutions in these frequencies imply multiple possibilities for asource direction. Such a problem is troublesome for frequency-domain BSS as previously pointed out [14], [27].

There is another method for frequency-domain BSS, whichis based on time-frequency (T-F) masking [19]–[23]. It doesnot employ ICA to separate mixtures, but relies on thesparseness of source signals exhibited in time-frequency rep-resentations. The method groups sensor observations togetherfor each source based on spatial information extracted fromthem. In [22], we applied a technique similar to that usedwith ICA [16] to classify sensor observations for T-F maskingseparation. From this experience, we consider the two meth-ods, ICA-based separation and T-F masking separation, to bevery similar in terms of exploiting the spatial information ofsources.

Based upon the above review of previous work and re-lated methods, this paper proposes a new formulation andoptimization procedure for grouping frequency componentsin the context of frequency-domain BSS. Grouping frequencycomponents corresponds to solving the permutation problem inICA-based separation, and to classifying sensor observationsin T-F masking separation. In the formulation, we use relativetime delays and attenuations from sources to sensors asparameters to be estimated. The idea of parameterizing timedelays and attenuations has already been proposed in previousstudies [20], [21], [24], where only simple two-sensor caseswere considered without the possibility of spatial aliasing. Thenovelty of this paper compared with these previous studies andour recent work [16], [22] can be summarized as follows:

1) Two methods of ICA-based separation and T-F maskingseparation are considered uniformly in terms of groupingfrequency components.

2) The problem of spatial aliasing is solved by the proposedprocedure, not only for ICA-based separation but alsofor T-F masking separation, thanks to 1).

3) It is shown that the time delay parameters in the for-mulation are estimated with a function similar to theGeneralized Cross Correlation PHAse Transform (GCC-PHAT) function [23], [28]–[30].

And the proposed procedure inherits the attractive propertiesof our recently proposed approaches [16], [22]:

4) The procedure can be applied to any number of sensors,and is not limited to two sensors.

5) The complete sensor array geometry does not have tobe known, only the information about the maximumdistance between sensors. If the complete geometry wereknown, the location (direction and/or distance from thesensors) of each source could be estimated [31], [32].

This paper is organized as follows. The next section pro-vides an overview of frequency-domain BSS. It includesboth the ICA-based method and the T-F masking method.Section III presents an anechoic propagation model with thetime delays and attenuations from a source to sensors, and alsocost functions for grouping frequency components. Section IVproposes a procedure for optimizing the cost function forpermutation alignment in ICA-based separation. Section Vshows a similar optimization procedure for classifying sensor

STFT ICA Permutation ISTFT

GroupingBasis vectors

STFT T-F masking ISTFT

Grouping

Observation vectors

(a) Separation with ICA

(b) Separation with T-F masking

Fig. 1. System structure of frequency-domain BSS. We consider two methodsfor separating the mixtures, (a) ICA and (b) T-F masking. For both methods,grouping frequency components, basis vectors or observation vectors, is thekey technique discussed in this paper.

observations in T-F masking separation, together with the re-lationship with the GCC-PHAT function. Experimental resultsfor various setups are summarized in Sec. VI. Section VIIconcludes this paper.

II. FREQUENCY-DOMAIN BSS

This section presents an overview of frequency-domainBSS. Figure 1 shows the system structure. First, the sen-sor observations (1) sampled at frequency fs are convertedinto frequency-domain time-series signals (2) by a short-timeFourier transform (STFT) of frame size L:

xj(f, t)←L/2−1∑

q=−L/2

xj(t + q)win(q) e−ı2πfq, (3)

for all discrete frequencies f ∈ {0, 1Lfs, . . . , L−1

L fs}, andfor time t, which is now down-sampled with the distance ofthe frame shift. We denote the imaginary unit as ı =

√−1in this paper. We typically use a window win(q) that taperssmoothly to zero at each end, such as a Hanning windowwin(q) = 1

2 (1 + cos 2πqL ).

Let us rewrite (2) in a vector notation:

x(f, t) =N∑

k=1

hk(f)sk(f, t), (4)

where hk = [h1k, . . . , hMk]T is the vector of frequency re-sponses from source sk to all sensors, and x = [x1, . . . , xM ]T

is called an observation vector in this paper. We consider twomethods for separating the mixtures as shown in Fig. 1. Theyare described in the following two subsections. In either case,we can limit the set of frequencies F where the operation isperformed by

F = {0,1L

fs, . . . ,12fs} (5)

due to the relationship of the complex conjugate:

xj(nLfs, t) = x∗

j (L−n

L fs, t) , n = 1, . . . , L2 −1 . (6)

Page 3: Grouping Separated Frequency Components by Estimating

3

A. Independent Component Analysis (ICA)

The first method employs complex-valued instantaneousICA in each frequency bin f ∈ F :

y(f, t) = W(f)x(f, t), (7)

where y = [y1, . . . , yN ]T is the vector of separated frequencycomponents and W is an N×M separation matrix. There aremany ICA algorithms known in the literature [3]–[6]. We donot describe these ICA algorithms in detail. More importantly,here let us explain how to estimate the mixing situation, suchas (4), from the ICA solution. We calculate a matrix A whosecolumns are basis vectors ai,

A = [a1, · · · ,aN ], ai = [a1i, . . . , aMi]T , (8)

in order to represent the vector x by a linear combination ofthe basis vectors:

x(f, t) = A(f)y(f, t) =N∑

i=1

ai(f)yi(f, t) . (9)

If W has an inverse, the matrix A is given simply by theinverse A = W−1. Otherwise it is calculated as a least-mean-square estimator [33]

A = E{xyH}(E{yyH})−1 ,

which minimizes E{||x − Ay||2}. The above procedure iseffective only when there are enough sensors (N ≤ M ).Under-determined ICA (N > M ) is still difficult to solve,and we do not usually follow the above procedure, but directlyestimate basis vectors ai(f), as shown in e.g. [25].

In any case, if ICA works well, we expect the separatedcomponents y1(f, t), . . . , yN(f, t) to be close to the originalsource components s1(f, t), . . . , sN (f, t) up to permutationand scaling ambiguity. Based on this, we see that a basis vectorai(f) in (9) is close to hk(f) in (4) again up to permutationand scaling ambiguity. The use of different subscripts, i andk, indicates the permutation ambiguity. They should be relatedby a permutation Πf : {1, . . . , N} → {1, . . . , N} for eachfrequency bin f as

i = Πf (k) (10)

so that the separated components yi originating from thesame source sk are grouped together. Section IV presents aprocedure for deciding a permutation Πf for each frequency.After permutations have been calculated, separated frequencycomponents and basis vectors are updated by

yk(f, t)← yΠf (k)(f, t), ak(f)← aΠf (k)(f), ∀ k, f, t.(11)

Next, the scaling ambiguity of ICA solution is aligned. Theexact recovery of the scaling corresponds to blind derever-beration [34], [35], which is a challenging task especiallyfor colored sources such as speech. A much easier way hasbeen proposed in [10], [11], [36], which involves adjustingto the observation xJ (f, t) of a selected reference sensorJ ∈ {1, . . . , M}:

yk(f, t)← aJk(f)yk(f, t), ∀ k, f, t. (12)

We see in (9) that aJk(f)yk(f, t) is a part of xJ (f, t) thatoriginates from source sk.

Finally, time-domain output signals yk(t) are calculatedwith an inverse STFT (ISTFT) to the separated frequencycomponents yk(f, t).

B. Time-Frequency (T-F) Masking

The second method considered in this paper is based on T-Fmasking, in which we assume the sparseness of source signals,i.e., at most only one source makes a large contribution to eachtime-frequency observation x(f, t). Based on this assumption,the mixture model (4) can simply be approximated as

x(f, t) = hk(f)sk(f, t), k ∈ {1, . . . , N} (13)

where the index k of the dominant source depends on eachtime-frequency slot (f, t).

The method classifies observation vectors x(f, t) of all time-frequency slots (f, t) into N classes so that the k-th classconsists of mixtures where the k-th source is the dominantsource. The notation

C(f, t) = k (14)

is used to represent a situation that an observation vectorx(f, t) belongs to the k-th class. Section V provides aprocedure for classifying observation vectors x. Once theclassification is completed, time domain separated signalsyk(t) are calculated with an inverse STFT (ISTFT) to thefollowing classified frequency components

yk(f, t) =

{xJ (f, t) if C(f, t) = k,

0 otherwise.(15)

C. Relationship between ICA based and T-F Masking Methods

As mentioned in the Introduction, this paper handles thecases of ICA and T-F masking uniformly in terms of groupingfrequency components. Let us discuss the relationship betweenthe two [1]. If the approximation (13) in T-F masking issatisfied, the linear combination form (9) obtained by ICAis reduced to

x(f, t) = ai(f)yi(f, t), i ∈ {1, . . . , N} (16)

where i depends on each time-frequency slot (f, t). Thus,the spatial information expressed in an observation vectorx(f, t) with the approximation (13) is the same as that ofthe basis vector ai(f) up to scaling ambiguity, with yi(f, t)being dominant in the time-frequency slot. Therefore, we canuse similar techniques for extracting spatial information fromobservation vectors x and basis vectors ai.

III. PROPAGATION MODEL AND COST FUNCTIONS

A. Problem Statement

The problem of grouping frequency components consideredin this paper is stated as follows:

Classify all basis vectors ai(f), ∀ i, f or all observa-tion vectors x(f, t), ∀ f, t into N groups so that each

Page 4: Grouping Separated Frequency Components by Estimating

4

Sensors

Source

Time delay

Attenuation

Fig. 2. Anechoic propagation model with the time delay τjk and theattenuation λjk from source k to sensor j. The time delay τjk depends on thedistance djk from source k to sensor j, and is normalized with the distancedJk of a selected reference sensor J ∈ {1, . . . , M}. The attenuation λjk hasno explicit dependence on the distance, and is normalized so that the squaredsum over all the sensors is 1.

group consists of frequency components originatingfrom the same source.

Solving this problem corresponds to deciding permutationsΠf in ICA-based separation, and to obtaining classificationinformation C(f, t) in T-F masking separation, respectively.

As discussed in the previous section, from (4) and (9),basis vectors a1(f), . . . ,aN (f) obtained by ICA are close toh1(f), . . . ,hN (f) up to permutation and scaling ambiguity.Also from (13), an observation vector x(f, t) is a scaledversion of hk(f) with k being specific to the time-frequencyslot (f, t). Therefore, we see that modeling the vector hk(f)of frequency responses is an important issue as regards solvingthe grouping problem.

B. Propagation Model with Time Delays and Attenuations

We model the propagation from a source to a sensor withthe time delay and attenuation (Fig. 2), i.e., with an anechoicmodel. This model considers only direct paths from sourcesto sensors, even though in reality signals are mixed in amulti-path manner (1) with reverberations. Such an anechoicassumption has been used in many previous studies exploitingspatial information of sources, some of which are enumeratedin the Introduction. As shown by the experimental results inSec. VI, modeling only direct paths is still effective for a realroom situation as long as the room reverberation is moderatelylow. With this model, we approximate the frequency responsehjk(f) in (2) with

cjk(f) = λjk · exp(−ı 2πfτjk), (17)

where τjk and λjk > 0 are the time delay and attenuationfrom source k to sensor j, respectively. In the vector form,hk(f) in (4) is approximated with

ck(f) =

⎡⎢⎣

λ1k · exp(−ı 2πfτ1k)...

λMk · exp(−ı 2πfτMk)

⎤⎥⎦ . (18)

Since we cannot distinguish the phase (or amplitude) ofsk(f, t) and hjk(f) of the mixture (2) in a blind scenario, thetwo types of parameters τjk and λjk can be considered to berelative. Thus, without loss of generality, we normalize themby

τjk = (djk − dJk)/v, (19)

∑Mj=1 λ2

jk = 1, (20)

where djk is the distance from source k to sensor j (Fig. 2),and v is the propagation velocity of the signal. Normalization(19) makes τJk = 0 and arg(cJk) = 0, i.e., the relative timedelay is zero at a selected reference sensor J ∈ {1, . . . , M}.Normalization (20) makes the model vector ck have unit-norm||ck|| = 1.

If we do not want to treat reference sensor J as a specialcase, we normalize the time delay in a more general way:

τjk = (djk − dpair(j)k)/v, (21)

where pair(j) �= j is the sensor that is pairing with sensor j.We can arbitrarily specify the pair(·) function . An exampleis a simple pairing with the next sensor:

pair(j) =

{1 if j = M,

j + 1 otherwise.(22)

In either case, the normalized time delay τjk can now beconsidered as the time difference of arrival (TDOA) [30], [31]of source sk between sensor j and sensor J or pair(j).

C. Phase & Amplitude Normalization

As mentioned in Sec. III-A, basis vectors ai and observationvectors x have scaling (phase and amplitude) ambiguity. Toalign the ambiguity, we apply the same kind of normalizationas discussed in the previous subsection, and then obtainphase/amplitude normalized vectors ai and x.

As regards phase ambiguity, if we follow (19), we apply

ai ← ai · exp[−ı arg(aJi)] , or (23)

x← x · exp[−ı arg(xJ )] (24)

leading to arg(aJi) = 0 or arg(xJ ) = 0. If we prefer (21),we apply

aji ← aji · exp[−ı arg(apair(j)i)] , or (25)

xj ← xj · exp[−ı arg(xpair(j))], (26)

for j = 1, . . . , M to construct ai = [a1i, . . . , aMi]T or x =[x1, . . . , xM ]T . Next, the amplitude ambiguity is aligned basedon (20) by

ai ← ai / ||ai|| , or (27)

x← x / ||x|| (28)

leading to ||ai|| = 1 or ||x|| = 1.

D. Cost Functions

Given that the phase and amplitude are normalized accord-ing to the above procedures, the task for grouping frequencycomponents can be formulated as minimizing a cost function.

With ICA-based separation, the task is to determine apermutation Πf for each frequency f ∈ F that relates the sub-scripts i and k with (10), and to estimate parameters τjk, λjk

in the model (18) so that the cost function is minimized:

Da({τjk}, {λjk}, {Πf}) =N∑

k=1

∑f∈F||ai(f)−ck(f)||2 ∣∣

i=Πf (k)

(29)

Page 5: Grouping Separated Frequency Components by Estimating

5

Fig. 3. Arguments of a21 and a22 before permutation alignment.

where {τjk} denotes the set {τ11, . . . , τMN} of time delayparameters, and similarly for {λjk} and {Πf}.

With T-F masking separation, the task is to determineclassification C(f, t) defined in (14) for each time-frequencyslot, and to estimate parameters τjk, λjk in the model (18) sothat the cost function is minimized:

Dx({τjk}, {λjk}, C) =N∑

k=1

∑C(t,f)=k

||x(f, t)− ck(f)||2,

(30)where the right-hand summation is across all the time-frequency slots (f, t) that belong to the k-th class.

The cost function Da or Dx can become zero if 1) thereal mixing situation follows the assumed anechoic model (17)perfectly and 2) the ICA is perfectly solved or the sparsenessassumption (13) is satisfied in a T-F masking case. However, inreal applications, none of these conditions is perfectly satisfied.Thus, these cost functions end up with a positive value, whichcorresponds to the variance in the mixing situation modeling.Yet minimizing them provides a solution to the groupingproblem stated in Sec. III-A.

E. Simple Example

To make the discussion here intuitively understandable, letus show a simple example performed with setup A. We havethree setups (A, B and C) shown in Fig. 9, and their commonexperimental configurations are summarized in Table I. SetupA was a simple M = N = 2 case, but the sensor spacingwas 20 cm, which induced spatial aliasing for a 16 kHzsampling rate. The example here is with ICA-based separation,and Fig. 3 shows the arguments of a21 and a22 after thenormalization (23) where we set J = 1 as a reference sensor.The arguments of a1i are not shown because they are all zero.The time delays τ21 and τ22 can be estimated from these data,as we see the two lines with different slopes corresponding toτ21 and τ22. However, the following two factors complicatethe time delay estimation. The first is that different symbols(’•’ and ’+’) constitute each of the two lines, because of thepermutation ambiguity of the ICA solutions. The second is thecircular jumps of the lines at high frequencies, which are dueto phase wrapping caused by spatial aliasing. We will explainhow to group such frequency components in the next section.

IV. PERMUTATION ALIGNMENT FOR ICA RESULTS

This section presents a procedure for minimizing the costfunction Da in (29), and for obtaining a permutation Πf for

each frequency. Figure 4 shows the flow of the procedure. Weadopt an approach that first considers only the frequency rangewhere spatial aliasing does not occur, and then considers thewhole range F .

A. For Frequencies without Spatial Aliasing

Let us first consider the lower frequency range

FL = {f : −π < 2πfτjk < π, ∀ j, k} ∩ F (31)

where we can guarantee that spatial aliasing does not occur.Let dmax be the maximum distance between the referencesensor J and any other sensor if we take (19), or betweensensor pairs of j and pair(j) if we take (21). Then the relativetime delay is bounded by

maxj,k|τjk| ≤ dmax/v (32)

and therefore FL can be defined as

FL = {f : 0 < f <v

2 dmax} ∩ F . (33)

For the frequency range FL, appropriate permutations Πf

can be obtained by minimizing another cost function

Da({τjk}, {λjk}, {Πf}) =N∑

k=1

∑f∈FL

||ai(f)− ck||2∣∣i=Πf (k)

(34)as proposed in our previous work [16]. The cost function Da

is different from (29) in that ai(f) and ck are frequencynormalized versions of basis vectors and the model vector.They are obtained by a procedure that divides their elements’argument by a scalar proportional to the frequency:

ai(f) = [a1i(f), . . . , aMi(f)]T ,

aji(f)← |aji(f)| exp(

ıβ arg[aji(f)]

f

)(35)

and

ck =

⎡⎢⎣

c1k

...cMk

⎤⎥⎦ =

⎡⎢⎣

λ1k · exp(−ı2πβτ1k)...

λMk · exp(−ı2πβτMk)

⎤⎥⎦ . (36)

where β is a constant scalar (its role will be discussedafterwards). Since the original model (17) has a linear phase,the above procedure removes the frequency dependency so thatthe resultant model vector ck does not depend on frequency.

The advantage of introducing the frequency-normalized costfunction Da is that it can be minimized efficiently by the fol-lowing clustering algorithm similar to the k-means algorithm[37]. The algorithm iterates the following two updates untilconvergence:

Πf ← argminΠ

N∑k=1

||aΠ(k)(f)− ck||2, ∀ f ∈ FL, (37)

ck ← 1|FL|

∑f∈FL

ai(f)∣∣i=Πf (k)

, ck ← ck/||ck||, ∀ k (38)

where |FL| is the number of elements (cardinality) of theset. The first update (37) optimizes the permutation Πf for

Page 6: Grouping Separated Frequency Components by Estimating

6

Frequency rangewithout aliasing

Frequencynormalization

Parameterextraction

Permutationoptimization

Model parameterre-estimation

Permutationoptimization

Cluster centroidcalculation

Phase & amplitude

normalization

Maximum distance between sensors

Basis vectors

Permutations

Fig. 4. Flow of the permutation alignment procedure presented in Sec. IV, which corresponds to the grouping part of (a) separation with ICA in Fig. 1.

0.85

Fig. 5. Arguments of a21 and a22 after permutations are aligned only forfrequency range FL = {f : 0 < f < 850 Hz} ∩ F .

each frequency with the current model ck. The second update(38) calculates the most probable model ck with the currentpermutations.

The constant scalar β in (35) and (36) affects how much thephase part is emphasized compared to the amplitude part infrequency-normalized vectors ai(f) and ck . In general micro-phone setups, time delays provide more reliable informationthan attenuations for distinguishing frequency components thatoriginate from different source signals. Thus, it is advanta-geous to emphasize the phase part by using as large a β valueas possible. However, too large a β value may cause phasewrapping. We use β = v/(4 dmax) as an appropriate value.The reason for using this value is discussed in [16].

Figure 5 shows the arguments of a21 and a22 calculated byoperation (35) in the setup A experiment. For frequency rangeFL, the clustering algorithm of iterating (37) and (38) wasperformed to decide the permutations Πf and the subscriptswere updated by (11). We see two clusters whose centroidsare the two lines represented by arg(c21) and arg(c22). Forfrequencies higher than 850 Hz, we see that operation (35) didnot work effectively because of the effect of spatial aliasing.We need another algorithm to minimize the cost function (29)for such higher frequencies.

B. For Frequencies where Spatial Aliasing may Occur

This subsection presents a procedure for deciding permu-tations Πf for frequencies where spatial aliasing may occur.Thus far, the frequency-normalized model ck has been cal-culated by (38), and it contains model parameters τ jk, λjk asshown in (36). They can be extracted from the elements of ck

Fig. 6. Arguments of a21 and a22 after permutation alignment using modelparameters estimated with low frequency range FL data. Because τ21 andτ22 are not precisely estimated, there are some permutation errors at highfrequencies.

as

τjk = −arg(cjk)2πβ

, λjk = |cjk|, ∀ j, k. (39)

A simple way of deciding permutations for higher frequenciesis to use these extracted parameters for the vector form ck(f)in (18) and calculate a permutation Πf based on the originalcost function (29) with

Πf ← argminΠ

N∑k=1

||aΠ(k)(f)− ck(f)||2 , ∀ f ∈ F . (40)

However, τjk and λjk estimated only with frequencies inFL may not be very accurate. Figure 6 shows arg(a21) andarg(a22) after the permutations had been calculated by (40)using the model parameters extracted by (39). We see someestimation error for τ21 and τ22, as the data (shown in marks’•’ and ’+’) are not lined up along the model line (shown asdashed lines) at high frequencies.

A better way is to re-estimate parameters τjk and λjk byminimizing the original cost function Da in (29), where thefrequency range is not limited to FL. In our earlier work[2], we used a gradient descent approach to refine theseparameters, where we needed to carefully select a step sizeparameter that guaranteed a stable convergence. In this paper,we adopt the following direct approach instead. With a simplemathematical manipulation (see Appendix VIII-A), the costfunction Da becomesN∑

k=1

∑f∈F

M∑j=1

{1M

+ λ2jk − 2λjkRe[aji(f) eı2πfτjk ]

∣∣i=Πf (k)

}(41)

Page 7: Grouping Separated Frequency Components by Estimating

7

Fig. 7. Arguments of a21 and a22 after permutation alignment using modelparameters re-estimated with data from the whole frequency range F . Nowτ21 and τ22 are precisely estimated, and permutations are aligned correctly.

where Re[·] takes only the real parts of a complex number.Thus, the optimum time delay τjk for minimizing the costfunction with the current permutations Πf is given by

τjk ← argmaxτ

∑f∈F

Re[aji(f) eı2πfτ ]∣∣i=Πf (k)

, ∀ j, k.

(42)And, the optimum attenuation λjk with the current permuta-tions Πf and the delay parameter τjk is given by

λjk ← 1|F|

∑f∈F

Re[aji(f) eı2πfτjk ]∣∣i=Πf (k)

, ∀ j, k. (43)

This is because the gradient of (41) with respect to λjk is

∂Da

∂λjk= 2

∑f∈F

{λjk − Re[aji(f) eı2πfτjk ]

∣∣i=Πf (k)

}

and setting the gradient zero gives the equation (43).We can iteratively update Πf by (40) and τjk, λjk by (42)-

(43) to obtain better estimations of the model parameters andconsequently better permutations. Note that the structure thatiterates (40) and (42)-(43) has the same structure as (37) and(38). Figure 7 shows arg(a21) and arg(a22) after Πf andτjk, λjk were refined by (40) and (42)-(43). We see that τ21

and τ22 were precisely estimated and the permutations werealigned correctly even for high frequencies.

V. CLASSIFICATION OF OBSERVATIONS FOR T-F MASKING

This section presents a procedure for minimizing the costfunction Dx in (30), and for obtaining a classification C(f, t)of observation vectors x(f, t) for the T-F masking separationdescribed in Sec. II-B.

A. Procedure

The structure of the procedure is shown in Fig. 8. It isalmost the same as that of the permutation alignment (Fig. 4)presented in the last section. The modification made for T-F masking separation involves replacing ai, ai, ai, Πf and“Permutation optimization” with x, x, x, C and “Classificationoptimization,” respectively.

Let us assume here that observation vectors x have beenconverted into x by the phase and amplitude normalization

presented in Sec. III-C. For frequency range FL where spa-tial aliasing does not occur, frequency normalization [22] isapplied to the elements of x(f, t):

xj(f, t)← |xj(f, t)| exp(

ıβ arg[xj(f, t)]

f

), ∀ j, f, t.

(44)With the frequency normalization, the cost function (30) isconverted into

Dx({τjk}, {λjk}, C) =N∑

k=1

∑C(f,t)=k

||x(f, t)− ck||2, (45)

where x = [x1, . . . , xM ]T , and the right-hand summation withC(f, t) = k is limited to the frequency range FL given by(33). The cost function Dx can be minimized efficiently byiterating the following two updates until convergence:

C(f, t)← argmink||x(f, t)− ck||2 , ∀ f, t, (46)

ck ← 1Nk

∑C(f,t)=k

x(f, t), ck ← ck/||ck|| , ∀ k, (47)

where Nk is the number of time-frequency slots (f, t) thatsatisfy C(f, t) = k.

For higher frequencies where spatial aliasing may occur,model parameters τjk and λjk are first extracted from ck asshown in (39), and then substituted into the vector form c k(f)in (18). Then, the classification of the observation vectors canbe decided by

C(f, t)← argmink||x(f, t)− ck(f)||2 , ∀ f, t. (48)

As with (42)-(43) for permutation alignment in the previoussection, the parameters are better estimated according to theoriginal cost function Dx in (30) by

τjk ← argmaxτ

∑C(f,t)=k

Re[xj(f, t) eı2πfτ ] , ∀ j, k, (49)

λjk ← 1Nk

∑C(f,t)=k

Re[xj(f, t) eı2πfτjk ] , ∀ j, k, (50)

where the summation with C(f, t) = k is not limited to FL butcovers the whole range F . We can iteratively update C(f, t)by (48) and τjk, λjk by (49)-(50) to obtain better estimationsof the model parameters and consequently better classification.

B. Relationship to GCC-PHAT

This subsection discusses the relationship between (49) andthe GCC-PHAT function [23], [28], [29]. Let us assume thatonly the first source s1 is active in an STFT frame centeredat time t. The TDOA τ[j,J](t) of the source between sensor jand J can be estimated with the GCC-PHAT function as

τ[j,J](t) = argmaxτ

∑f

xj(f, t)x∗J (f, t)

|xj(f, t)x∗J (f, t)|e

ı2πfτ (51)

where the summation is over all discrete frequencies.If the same assumption holds for T-F masking separation,

all the observation vectors at time frame t are classified into

Page 8: Grouping Separated Frequency Components by Estimating

8

Frequency rangewithout aliasing

Frequencynormalization

Parameterextraction

Classificationoptimization

Model parameterre-estimation

Classificationoptimization

Cluster centroidcalculation

Phase & amplitude

normalization

Maximum distance between sensors

Observation vectors

Classification

Fig. 8. Flow of the classification procedure presented in Sec. V, which corresponds to the grouping part of (b) separation with T-F masking in Fig. 1.

the first one, i.e., C(f, t) = 1, ∀ f . Then, the delay parameterestimation by (49) using only the time frame is reduced to

τj1 ← argmaxτ

∑f∈F

Re[xj(f, t) eı2πfτ ] , ∀ j, (52)

where xj(f, t) can be expressed in

xj(f, t) =xj(f, t)x∗

J (f, t)||x(f, t)|| · |x∗

J (f, t)|if we follow the phase and amplitude normalization (24) and(28). Time delay τj1 can be considered as the TDOA of sources1 between sensors j and J .

We see that (51) and (52) are very similar. The summationin (51) and (52) has the same effect because of the conjugaterelationship (6). Thus, the only difference is in the denomi-nator part, ||x(f, t)|| or |xj(f, t)|, but this difference has verylittle effect in the argmax operation if we can approximate||x(f, t)|| ≈ α · |xj(f, t)| with the same constant α for allfrequencies. In [23], T-F masking separation and time delayestimation with GCC-PHAT were discussed, but there was nomathematical statement relating these two.

Based on this observation, we recognize that iterative up-dates with (48) and (49) perform time delay estimation withthe GCC-PHAT function by selecting frequency componentsof the source. The estimations τjk are improved by a betterclassification C(f, t) of the frequency components, and con-versely the classification C(f, t) is also improved by bettertime delay estimations τjk .

VI. EXPERIMENTS

A. Experimental setups and evaluation measure

To verify the effectiveness of the proposed formulation andprocedure, we conducted experiments with the three setupsA, B and C shown in Fig. 9. They differs as regards numberof sources and sensors, and sensor spacing. The configurationscommon to all setups are summarized in Table I. We tested theBSS system mainly with a low reverberation time (130 ms) sothat the system can exploit spatial information of the sourcesaccurately when grouping frequency components, but we alsotested the system in more reverberant conditions to observehow the separation performance degrades as the reverberationtime increases (reported in Sec. VI-E).

TABLE I

COMMON EXPERIMENTAL CONFIGURATIONS

Room size 4.45 × 3.55 × 2.5 mReverberation time RT60 = 130 ms

130 ∼ 450 ms for setup ASampling rate 16 kHzSTFT frame size 2048 points (128 ms)STFT frame shift 512 points (32 ms)Source signals Speeches of 3 sPropagation velocity v = 340 m/s

The separation performance was evaluated in terms ofsignal-to-interference ratio (SIR) improvement. The improve-ment was calculated by OutputSIRi − InputSIRi for eachoutput i, and we took the average over all output i = 1, . . . , N .These two types of SIRs are defined by

InputSIRi = 10 log10

∑t |

∑l hJi(l)si(t− l)|2∑

t |∑

k �=i

∑l hJk(l)sk(t− l)|2 (dB),

OutputSIRi = 10 log10

∑t |yii(t)|2∑

t |∑

k �=i yik(t)|2 (dB),

where J ∈ {1, . . . , M} is the index of a selected referencesensor, and yik(t) is the component of sk that appears at outputyi(t), i.e., yi(t) =

∑Nk=1 yik(t).

B. Main experiments

Figure 10 summarizes the experimental results with a re-verberation time of 130 ms. We performed experiments witheight combinations of 3-second speeches, for pairs consistingof each method (ICA or T-F masking) and setup (A, B orC). As regards phase normalization, a reference sensor wasselected (19) for setups A and B, and pairing with the nextsensor (21) was employed in setup C. To observe the effectof the multi-stage procedures presented in Secs. IV and V, wemeasured the SIR improvements at three different stages andfor two special options:

Stage I Grouping frequency components only at low fre-quency range FL where spatial aliasing does notoccur, by (37) and (38) for permutations Π f , orby (46) and (47) for classification C(f, t). Atthe remaining frequencies, the permutations orclassification were random.

Page 9: Grouping Separated Frequency Components by Estimating

9

4.45 m3.

55 m

20cm

120cm

Microphones

Loudspeaker

Loudspeaker

Height of microphones and loudspeakers: 135 cm

Setup A

30cm

120cm

Microphones

Loudspeaker

Height of microphones and loudspeakers: 135 cm

4.45 m

3.55

m

Setup B

Reference sensor

0.8m

0.8m1m

1m height

1.35m height

1.35m height

4cm 3.2cm

3.5cm 1.7cm

4cm

4.45 m

3.55

m

Microphones

1.35m height

Loudspeaker

Setup C

Fig. 9. Three experimental setups. Setup A: two sources and two sensors with large spacing. Setup B: three sources and three sensors with large spacing.Setup C: three sources and four sensors with small spacing. All the microphones were omni-directional.

Stage II After Stage I, grouping frequency components atthe remaining high frequencies by (40) or (48)with the model parameters τjk, λjk extracted by(39), which were not so accurate because theywere estimated only with the data from the lowfrequency range FL.

Stage III After Stage II, re-estimating model parametersτjk, λjk by (42)-(43) with ai, or by (49)-(50) withx. This re-estimation was interleaved with group-ing frequency components at the high frequenciesby (40) or (48).

Only III Only the core part of stage III was applied.Grouping frequency components by interleaving(40) and (42)-(43) for permutations Π f , or (48)and (49)-(50) for classification C(f, t), startingfrom random initial permutations or classification.

Optimal Optimal permutations Πf or classification C(f, t)was calculated using the information on sourcesignals. This is not a practical solution, but is toenable us to see the upper limit of the separationperformance.

SIR improvements became better as the stage proceeded fromI to III. This is noticeable in setups A and B where the sensorspacing was large and the frequency range FL without spatialaliasing was very small. On the other hand, in setup C, thedifference was not so large because the sensor spacing wassmall and the range FL occupied more than half the wholerange F .

Even if only stage III was employed with random initialpermutations or classification, the results were sometimesgood. In some cases, however, especially for setup B withT-F masking, the results were not good. These results showthat the classification problem for T-F masking has a muchlarger possible solution space than the permutation problemfor ICA, and it is easy to get stuck in a local minimum of thecost function Dx. Therefore, the multi-stage procedure has anadvantage in that it is not likely to become stuck in localminima.

Table II shows the total computational time for the BSSprocedure, and also those of the ICA and Grouping sub-components depicted in Fig. 1. They are for 3-second source

TABLE II

COMPUTATIONAL TIME

Total ICA Grouping (#iterations)

Setup A, ICA 4.87 s 4.07 s 0.48 s (4.9)Setup B, ICA 8.05 s 6.85 s 0.80 s (6.4)Setup C, ICA 7.71 s 6.81 s 0.42 s (4.2)Setup A, T-F masking 1.64 s - 1.44 s (9.4)Setup B, T-F masking 2.68 s - 2.37 s (11.5)Setup C, T-F masking 4.18 s - 3.83 s (8.1)

signals, and are averaged over the eight different sourcecombinations. The BSS program was coded in Matlab and runon an AMD 2.4 GHz Athlon 64 processor. The computationaltime of the Grouping procedure was not very large and wassmaller than that of ICA. Table II also shows the averagenumber of iterations to converge for the Grouping procedure,(40) and (42)-(43) with ICA, or (48) and (49)-(50) with T-Fmasking. The T-F masking grouping procedure requires moreiterations than that of ICA because of the larger solution space,but it converges within a reasonable number of iterations.

C. Comparison with null beamforming

Let us compare the separation capability of the proposedmethods (ICA and T-F masking) with that of null beamform-ing, which is a conventional source separation method thatsimilarly exploits the spatial information of sources. In nullbeamforming, filter coefficients are designed by assuming theanechoic propagation model (17). In this sense, all these threemethods rely on delay τjk and attenuation λjk parameters.

We designed the null beamformer in the frequency domain.The separation matrix W(f) in each frequency bin was givenby the inverse (or Moore-Penrose pseudo inverse if N < M )of the assumed mixing matrix⎡

⎢⎣c11(f) . . . c1N (f)

.... . .

...cM1(f) . . . cMN (f)

⎤⎥⎦ ,

where cjk(f) is the propagation model defined in (17). Thedelay τjk and attenuation λjk parameters were accuratelyestimated in the experiment, from the individual source con-tributions on the microphones for each source.

Page 10: Grouping Separated Frequency Components by Estimating

10

Stage I Stage II Stage III Only III Optimal

5

10

15

20S

IR im

prov

emen

t (dB

)

Setup A with ICA

AverageIndividual

Stage I Stage II Stage III Only III Optimal

5

10

15

20

SIR

impr

ovem

ent (

dB)

Setup A with T−F masking

AverageIndividual

Stage I Stage II Stage III Only III Optimal

5

10

15

20

Setup B with ICA

Stage I Stage II Stage III Only III Optimal

5

10

15

20

Setup B with T−F masking

Stage I Stage II Stage III Only III Optimal

5

10

15

20

Setup C with ICA

Stage I Stage II Stage III Only III Optimal

5

10

15

20

Setup C with T−F masking

Fig. 10. SIR improvements at different stages. The first and second rows correspond to ICA-based separation and T-F masking separation, respectively. Thefirst, second, and third columns correspond to setups A, B, and C, respectively. Each dotted line shows an individual case, and a solid line with squares showsthe average of the eight individual cases.

TABLE III

SIR IMPROVEMENTS (DB) WITH DIFFERENT SEPARATION METHODS

Anechoic Setup A Setup B Setup CNull beamforming 37.29 8.14 7.93 6.94ICA 27.53 16.67 16.85 16.44T-F masking 17.92 14.10 14.27 14.90

Table III reports SIR improvements with these methods forfour different setups. An anechoic setup was added to the ex-isting three setups (A, B, and C) to contrast the characteristicsof these three methods. In the anechoic setup, the positionsof loudspeakers and microphones were the same as those ofsetup A.

We observe the following from the table. Null beamformingperforms the best in the anechoic setup, but worse thanthe other two methods in the three real-room setups. Withnull beamforming, propagation model parameters are used fordesigning the filter coefficients in the separation system. Thus,even a small discrepancy between the propagation model anda real room situation directly affects the separation. With ICAor T-F masking, on the other hand, the propagation model isused only for grouping separated frequency components. Thediscrepancy between the propagation model and a real roomsituation is reflected in the cost functionDa orDx as discussedin Sec. III-D. Therefore, these methods are robust to such adiscrepancy if it is not very severe.

D. Comparison of ICA and T-F masking

In terms of grouping frequency components, the ICA-basedand T-F masking methods have a lot in common as discussed

above. However, they are of course different in terms of thewhole BSS procedure. Here we compare these two methods.

With ICA, separated frequency components are generatedby the ICA formula (7). The separation matrix W(f) isdesigned for each frequency so that it adapts to a mixingsituation (anechoic or real reverberant). This is why ICAperforms well in all the setups in Table III and also in Fig. 10.

In contrast, with T-F masking, separated frequency compo-nents are simply frequency-domain sensor observations cal-culated by an STFT (3). How well these components areseparated depends on how well the sparseness assumption (13)holds for the original source signals. In general, a speech signalfollows the sparseness assumption to a certain degree, but itdoes less accurately than the anechoic situation follows thepropagation model (17). This is why the SIR improvement ofT-F masking for the anechoic setup saturated compared withthe other two in Table III. It should also be noted that violationof the sparseness assumption leads to an undesirable musicalnoise effect.

In summary, if the number of sensors is sufficient for thenumber of sources as shown in Table III, the ICA basedmethod performs better than the T-F masking method. How-ever, a T-F masking approach has a separation capability foran under-determined case where the number of sensors isinsufficient.

E. Experiments in more reverberant conditions

We also performed experiments in more reverberant con-ditions. The reverberation time was controlled by changingthe area of cushioned wall in the room. We considered five

Page 11: Grouping Separated Frequency Components by Estimating

11

100 200 300 400 500

5

10

15

20

Reverberation time (ms)

SIR

impr

ovem

ent (

dB)

Setup A with ICA, 60 cm

Stage IIIOptimal

100 200 300 400 500

5

10

15

20

Reverberation time (ms)

Setup A with ICA, 120 cm

Stage IIIOptimal

Fig. 11. SIR improvements with ICA-based BSS for setup A for variousreverberation times (RT60 = 130, 200, 270, 320, 380, and 450 ms) and twodifferent distances (60 and 120 cm) from the sources to the microphones. Eachsquare shows the average SIR improvement of the eight different combinationsof speech sources.

additional different reverberation times for setup A, namely200, 270, 320, 380, and 450 ms. We also considered anotherdistance of 60 cm from the sources to the microphones. Asregards the experiments reported here, let us focus on ICA-based separation for simplicity.

Figure 11 shows SIR improvements at stage III and alsowith optimal permutations. Reverberation affects the ICAsolutions as well as the permutation alignment. Even withoptimal permutations, the ICA separation performance de-grades as the reverberation time increases. The differencebetween “Optimal” and “Stage III” SIR improvements in-dicates the performance degradation caused by permutationmisalignment. In the shorter distance case (60 cm), the degreeof degradation was uniformly small for various reverberationtimes. This is because the contribution of the direct pathfrom a source to a microphone is dominant compared withthose of the reverberations, and thus the situation is wellapproximated with the anechoic propagation model. However,with the original distance (120 cm), the degradation becamelarge as the reverberation time became long. These resultsshow the applicability/limitation of the proposed method forpermutation alignment in more reverberant conditions as a casestudy.

Figure 12 shows the arguments of a21 and a22 after thepermutations were aligned at stage III, in an experiment witha reverberation time of 380 ms and a distance of 120 cm.Compared with Fig. 7 (where the reverberation time was130 ms), we see that the basis vector elements were widelyscattered around the estimated anechoic model due to the longreverberation time, and thus permutation misalignments oc-curred more frequently. However, the model parameters werereasonably estimated, capturing the center of the scatteredsamples to minimize the cost function (29).

VII. CONCLUSION

We proposed a procedure for grouping frequency compo-nents, which are basis vectors ai(f) in ICA-based separation,or observation vectors x(f, t) in T-F masking separation. Thegrouping result is expressed in permutations Πf for ICA-based separation, or in classification information C(f, t) for

Fig. 12. Arguments of a21 and a22 after permutations were aligned atstage III. The room reverberation time was 380 ms and the distance fromthe sources to the microphones was 120 cm, which made the situation verydifferent from the assumed anechoic model. Consequently, the samples ofthe arguments were widely scattered around the estimated model parameters.However, the model parameters were reasonably estimated so the sourcedirections can be approximately estimated together with the information aboutthe microphone array geometry.

T-F masking separation. The grouping is decided based on theestimated parameters of time delays τjk and attenuations λjk

from source to sensors. The proposed procedure interleavesthe grouping of frequency components and the estimation ofthe parameters, with the aim of achieving better results forboth. We adopt a multi-stage approach to attain a fast androbust convergence to a good solution. Experimental resultsshow the validity of the procedure, especially when spatialaliasing occurs due to wide sensor spacing or a high samplingrate. The applicability/limitation of the proposed method underreverberant conditions is also demonstrated experimentally.

The primary objective of this work was blind source separa-tion of acoustic sources. However, with the proposed scheme,the time delays and attenuations from sources to sensors arealso estimated with a function similar to that of GCC-PHAT. Ifwe have information on the sensor array geometry, we can alsoestimate the locations of multiple sources. This point shouldbe interesting also to researchers working in the field of sourcelocalization.

VIII. APPENDIX

A. Calculating and simplifying the cost functions

The squared distance ||ai − ck||2 that appears in (29) canbe transformed into

(ai − ck)H(ai − ck) = aHi ai + cH

k ck − aHi ck − cH

k ai

whereaH

i ai = ||ai||2 = 1 ,

cHk ck =

M∑j=1

λ2jk = 1

from the assumptions, and

−aHi ck − cH

k ai = −2Re(cHk ai) .

Thus, the minimization of the squared distance || ai − ck||2is equivalent to the maximization of the real part of the innerproduct cH

k ai, whose calculation is less demanding in terms ofcomputational complexity. We follow this idea in calculatingthe argmin operators in (37), (40), (46) and (48).

Page 12: Grouping Separated Frequency Components by Estimating

12

The mathematical manipulations conducted for obtaining(41) were the above equations and

Re[cHk (f)ai(f)] =

M∑j=1

λjkRe[aji(f) eı2πfτjk ] .

REFERENCES

[1] H. Sawada, S. Araki, R. Mukai, and S. Makino, “On calculating theinverse of separation matrix in frequency-domain blind source separa-tion,” in Independent Component Analysis and Blind Signal Separation,ser. LNCS, vol. 3889. Springer, 2006, pp. 691–699.

[2] ——, “Solving the permutation problem of frequency-domain BSS whenspatial aliasing occurs with wide sensor spacing,” in Proc. ICASSP 2006,vol. V, May 2006, pp. 77–80.

[3] T. W. Lee, Independent Component Analysis - Theory and Applications.Kluwer Academic Publishers, 1998.

[4] S. Haykin, Ed., Unsupervised Adaptive Filtering (Volume I: Blind SourceSeparation). John Wiley & Sons, 2000.

[5] A. Hyvarinen, J. Karhunen, and E. Oja, Independent Component Anal-ysis. John Wiley & Sons, 2001.

[6] A. Cichocki and S. Amari, Adaptive Blind Signal and Image Processing.John Wiley & Sons, 2002.

[7] P. Smaragdis, “Blind separation of convolved mixtures in the frequencydomain,” Neurocomputing, vol. 22, pp. 21–34, 1998.

[8] L. Parra and C. Spence, “Convolutive blind separation of non-stationarysources,” IEEE Trans. Speech Audio Processing, vol. 8, no. 3, pp. 320–327, May 2000.

[9] J. Anemuller and B. Kollmeier, “Amplitude modulation decorrelationfor convolutive blind source separation,” in Proc. ICA 2000, June 2000,pp. 215–220.

[10] S. Ikeda and N. Murata, “A method of ICA in time-frequency domain,”in Proc. International Workshop on Independent Component Analysisand Blind Signal Separation (ICA’99), Jan. 1999, pp. 365–371.

[11] N. Murata, S. Ikeda, and A. Ziehe, “An approach to blind source sepa-ration based on temporal structure of speech signals,” Neurocomputing,vol. 41, no. 1-4, pp. 1–24, Oct. 2001.

[12] H. Sawada, R. Mukai, S. Araki, and S. Makino, “A robust and precisemethod for solving the permutation problem of frequency-domain blindsource separation,” IEEE Trans. Speech Audio Processing, vol. 12, no. 5,pp. 530–538, Sept. 2004.

[13] H. Saruwatari, S. Kurita, K. Takeda, F. Itakura, T. Nishikawa, andK. Shikano, “Blind source separation combining independent compo-nent analysis and beamforming,,” EURASIP Journal on Applied SignalProcessing, vol. 2003, no. 11, pp. 1135–1146, Nov. 2003.

[14] M. Z. Ikram and D. R. Morgan, “Permutation inconsistency in blindspeech separation: Investigation and solutions,” IEEE Trans. SpeechAudio Processing, vol. 13, no. 1, pp. 1–13, Jan. 2005.

[15] R. Mukai, H. Sawada, S. Araki, and S. Makino, “Near-field frequencydomain blind source separation for convolutive mixtures,” in Proc.ICASSP 2004, vol. IV, 2004, pp. 49–52.

[16] H. Sawada, S. Araki, R. Mukai, and S. Makino, “Blind extraction ofdominant target sources using ICA and time-frequency masking,” IEEETrans. Audio, Speech and Language Processing, pp. 2165–2173, Nov.2006.

[17] A. Hiroe, “Solution of permutation problem in frequency domain ICAusing multivariate probability density functions,” in Proc. ICA 2006(LNCS 3889). Springer, Mar. 2006, pp. 601–608.

[18] T. Kim, H. T. Attias, S.-Y. Lee, and T.-W. Lee, “Blind source separationexploiting higher-order frequency dependencies,” IEEE Trans. Audio,Speech and Language Processing, pp. 70–79, Jan. 2007.

[19] M. Aoki, M. Okamoto, S. Aoki, H. Matsui, T. Sakurai, and Y. Kaneda,“Sound source segregation based on estimating incident angle of eachfrequency component of input signals acquired by multiple micro-phones,” Acoustical Science and Technology, vol. 22, no. 2, pp. 149–157,2001.

[20] S. Rickard, R. Balan, and J. Rosca, “Real-time time-frequency basedblind source separation,” in Proc. ICA2001, Dec. 2001, pp. 651–656.

[21] O. Yilmaz and S. Rickard, “Blind separation of speech mixtures viatime-frequency masking,” IEEE Trans. Signal Processing, vol. 52, no. 7,pp. 1830–1847, July 2004.

[22] S. Araki, H. Sawada, R. Mukai, and S. Makino, “A novel blind sourceseparation method with observation vector clustering,” in Proc. 2005International Workshop on Acoustic Echo and Noise Control (IWAENC2005), Sept. 2005, pp. 117–120.

[23] M. Swartling, N. Grbic, and I. Claesson, “Direction of arrival estimationfor multiple speakers using time-frequency orthogonal signal separa-tion,” in Proc. ICASSP 2006, vol. IV, May 2006, pp. 833–836.

[24] P. Bofill, “Underdetermined blind separation of delayed sound sourcesin the frequency domain,” Neurocomputing, vol. 55, pp. 627–641, 2003.

[25] S. Winter, W. Kellermann, H. Sawada, and S. Makino, “MAP basedunderdetermined blind source separation of convolutive mixtures byhierarchical clustering and L1-norm minimization,” EURASIP Journalon Advances in Signal Processing, vol. 2007, pp. 1–12, Article ID24 717, 2007.

[26] D. H. Johnson and D. E. Dudgeon, Array Signal Processing: Conceptsand Techniques. Prentice-Hall, 1993.

[27] W. Kellermann, H. Buchner, and R. Aichner, “Separating convolutivemixtures with TRINICON,” in Proc. ICASSP 2006, vol. V, May 2006,pp. 961–964.

[28] C. H. Knapp and G. C. Carter, “The generalized correlation methodfor estimation of time delay,” IEEE Trans. Acoustic, Speech and SignalProcessing, vol. 24, no. 4, pp. 320–327, Aug. 1976.

[29] M. Omologo and P. Svaizer, “Use of the crosspower-spectrum phase inacoustic event location,” IEEE Trans. Speech Audio Processing, vol. 5,no. 3, pp. 288–292, May 1997.

[30] J. Chen, Y. Huang, and J. Benesty, “Time delay estimation,” in AudioSignal Processing, Y. Huang and J. Benesty, Eds. Kluwer AcademicPublishers, 2004, pp. 197–227.

[31] M. Brandstein, J. Adcock, and H. Silverman, “A closed-form locationestimator for use with room environment microphone arrays,” IEEETrans. Speech Audio Processing, vol. 5, no. 1, pp. 45–50, Jan. 1997.

[32] Y. Huang, J. Benesty, and G. Elko, “Source localization,” in AudioSignal Processing, Y. Huang and J. Benesty, Eds. Kluwer AcademicPublishers, 2004, pp. 229–253.

[33] T. Kailath, A. H. Sayed, and B. Hassibi, Linear Estimation. PrenticeHall, 2000.

[34] T. Nakatani, K. Kinoshita, and M. Miyoshi, “Harmonicity-based blinddereverberation for single-channel speech signals,” IEEE Trans. Audio,Speech and Language Processing, vol. 15, no. 1, pp. 80–95, Jan. 2007.

[35] M. Delcroix, T. Hikichi, and M. Miyoshi, “Precise dereverberationusing multi-channel linear prediction,” IEEE Trans. Audio, Speech andLanguage Processing, vol. 15, no. 2, pp. 430–440, Feb. 2007.

[36] K. Matsuoka and S. Nakashima, “Minimal distortion principle for blindsource separation,” in Proc. ICA 2001, Dec. 2001, pp. 722–727.

[37] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2nd ed.Wiley Interscience, 2000.

PLACEPHOTOHERE

Hiroshi Sawada (M’02–SM’04) received the B.E.,M.E. and Ph.D. degrees in information science fromKyoto University, Kyoto, Japan, in 1991, 1993 and2001, respectively.

He joined NTT in 1993. He is now a senior re-search scientist at the NTT Communication ScienceLaboratories. From 1993 to 2000, he was engagedin research on the computer aided design of digitalsystems, logic synthesis, and computer architecture.In 2000, he stayed at the Computation StructuresGroup of MIT for six months. From 2002 to 2005,

he taught a class on computer architecture at Doshisha University, Kyoto.Since 2000, he has been engaged in research on signal processing, mi-

crophone array, and blind source separation (BSS). More specifically, heis working on the frequency-domain BSS for acoustic convolutive mixturesusing independent component analysis (ICA). He is an associate editor ofthe IEEE Transactions on Audio, Speech & Language Processing, and amember of the Audio and Electroacoustics Technical Committee of the IEEESP Society. He was a tutorial speaker at ICASSP 2007. He serves as thepublications chairs of the WASPAA 2007 in Mohonk, and served as anorganizing committee member for ICA 2003 in Nara and the communicationschair for IWAENC 2003 in Kyoto.

He is the author or co-author of three book chapters, more than 20 journalarticles, and more than 80 conference papers. He received the 9th TELE-COM System Technology Award for Student from the TelecommunicationsAdvancement Foundation in 1994, and the Best Paper Award of the IEEECircuit and System Society in 2000. Dr. Sawada is a senior member of theIEEE, a member of the IEICE and the ASJ.

Page 13: Grouping Separated Frequency Components by Estimating

13

PLACEPHOTOHERE

Shoko Araki (M’01) received the B.E. and the M.E.degrees from the University of Tokyo, Japan, in1998 and 2000, respectively, and the Ph. D degreefrom Hokkaido University, Japan in 2007. In 2000,she joined NTT Communication Science Labora-tories, Kyoto. Her research interests include arraysignal processing, blind source separation appliedto speech signals, and auditory scene analysis. Sheis a member of the Organizing Committee of theICA 2003, the Finance Chair of IWAENC 2003, theRegistration Chair of WASPAA 2007. She received

the 19th Awaya Prize from Acoustical Society of Japan (ASJ) in 2001, the BestPaper Award of the IWAENC in 2003, the TELECOM System TechnologyAward from the Telecommunications Advancement Foundation in 2004, andthe Academic Encouraging Prize from the Institute of Electronics, Informationand Communication Engineers (IEICE) in 2005. She is a member of the IEEE,IEICE, and the ASJ.

PLACEPHOTOHERE

Ryo Mukai (A’95–M’01–SM’04) received the B.S.and the M.S. degrees in information science fromthe University of Tokyo, Japan, in 1990 and 1992,respectively. He joined NTT in 1992. From 1992 to2000, he was engaged in research and developmentof processor architecture for network service systemsand distributed network systems. Since 2000, he hasbeen with NTT Communication Science Laborato-ries, where he is engaged in research of blind sourceseparation. His current research interests includedigital signal processing and its applications. He is

a member of the ACM, the Acoustical Society of Japan (ASJ), Instituteof Electronics, Information and Communication Engineers (IEICE), andInformation Processing Society of Japan (IPSJ). He is also a member of theTechnical Committee on Blind Signal Processing of the IEEE Circuits andSystems Society, the Organizing Committee of the ICA 2003 in NARA, andthe Publications Chair of the IWAENC 2003 in Kyoto. He received the SatoPaper Award of the ASJ in 2005 and the Paper Award of the IEICE in 2005.

PLACEPHOTOHERE

Shoji Makino (A’89–M’90–SM’99–F’04) receivedthe B. E., M. E., and Ph. D. degrees from TohokuUniversity, Japan, in 1979, 1981, and 1993, respec-tively.

He joined NTT in 1981. He is now an Execu-tive Manager at the NTT Communication ScienceLaboratories. He is also a Guest Professor at theHokkaido University. His research interests includeadaptive filtering technologies and realization ofacoustic echo cancellation, blind source separationof convolutive mixtures of speech.

He received the ICA Unsupervised Learning Pioneer Award in 2006, thePaper Award of the IEICE in 2005 and 2002, the Paper Award of the ASJin 2005 and 2002, the TELECOM System Technology Award of the TAFin 2004, the Best Paper Award of the IWAENC in 2003, the AchievementAward of the IEICE in 1997, and the Outstanding Technological DevelopmentAward of the ASJ in 1995. He is the author or co-author of more than 200articles in journals and conference proceedings and is responsible for morethan 150 patents. He is a Tutorial speaker at ICASSP2007, and a Panelist atHSCMA2005.

He is a member of both the Awards Board and the Conference Board ofthe IEEE SP Society. He is an Associate Editor of the IEEE Transactionson Speech and Audio Processing and an Associate Editor of the EURASIPJournal on Applied Signal Processing. He is a Guest Editor of the Special Issueof the IEEE Transactions on Audio, Speech and Language Processing and aGuest Editor of the Special Issue of the IEEE Transactions on Computers.He is a member of the Technical Committee on Audio and Electroacousticsof the IEEE SP Society and the Chair-Elect of the Technical Committee onBlind Signal Processing of the IEEE Circuits and Systems Society. He is theChair of the Technical Committee on Engineering Acoustics of the IEICE andthe ASJ. He is a member of the International IWAENC Standing committeeand a member of the International ICA Steering Committee.

He is the General Chair of the WASPAA2007 in Mohonk, the GeneralChair of the IWAENC2003 in Kyoto, the Organizing Chair of the ICA2003in Nara.

He is an IEEE Fellow, a council member of the ASJ, and a member of theEURASIP, and a member of the IEICE.