Blind Separation of Instantaneous Mixture of Sources via an Independent Component …zoe.bme.gatech.edu/~klee7/docs/ICA_MLE_model.pdf · 2008-02-11 · Blind Separation of Instantaneous

2768 IkbE TRANSACTIONS ON SIGNAL PROCESSING, VOL 44, NO I I , NOVEMBER 1996

Blind Separation of Instantaneous Mixture of Sources via an Independent Component Analysis

Dinh Tuan Pham, Member, IEEE

Abstruct- In this paper, we introduce a procedure for separating a multivariate distribution into nearly independent components based on minimizing a criterion defined in terms of the Kullback-Leihner distance. By replacing the unknown density with a kernel estimate, we derive useful forms of this criterion when only a sample from that distribution is available. We also compute the gradient and Hessian of our criteria for use in an iterative minimization. Setting this gradient to zero yields a set of separating functions similar to the ones considered in the source separation problem, except that here, these functions are adapted to the observed data. Finally, some simulations are given, illustrating the good performance of the method.

I. INTRODUCTION

HE problem of separation of sources has received much T attention recently in the signal processing literature [ 11, [3]-[6], [9], [lo], [14]-[16], [20] due to many possible important applications (e.g., speech analysis, telecommunica- tion, etc.). In its simplest form, one observes K sequences XI ( t ) , . . . XI( ( t ) recorded from K sensors, each observation X, ( t ) being a linear combination of K’ independent sources S l ( t ) , . . S,t(t). Thus, X ( t ) = AS(t), where X(t) and S ( t ) denote the vectors with components X , ( t ) ; . . , X,(t) and SI ( t ) , . . . , Slci ( t ) , respectively, and A is some K x K’ matrix. The source separation problem is to recover the sources from the observations. It is often called blind because one does not assume any a priori knowledge on the structure of the sources except that they are mutually independent. In this context, it is hopeless to separate more than K sources using only K observation channels. Therefore, we shall restrict ourselves to the case where K’ 5 K . In practice, there will usually be a small amount of perturbed noises, and since there is no possibility to distinguish noises from sources, one is actually in the case of more sources than sensors! However, it may be argued that if the noises are weak with respect to the sources, their effect may be ignored in a first approximation: One can still expect to recover the sources with a small amount of contamination noise and possibly some almost pure sequence of noise (or mixtures of them) if there are more sensors than sources. Thus, we may restrict ourselves to the case K’ = K , the missing sources being made up of contamination noises.

Manuscript received July 6, 1994; revised May 22, 1996. The associate editor coordinating the review of this paper and approving i t for publication was Monique P. Farques.

The author is with the Laboratory of Modeling and Computation, IMAG-C.N.R.S., B.P. 53X, 38041 Grenoble Cedex, France.

Publishcr Item Identifier S 1053-587X(96)08242-6.

The possibility of noise-corrupted sources raises the issue of robustness. A statistical procedure is called robust if it still works reasonably well when the model assumptions from which it is designed is more or less violated. In this respect, it is of interest to consider the independent component analysis (ICA), which was introduced by Comon [7] and [SI. This problem is closely related to the separation of sources but is actually much more general. In the ICA, there is no model; one is simply given a probability distribution I‘x on IRK of some a random vector X, and the goal is to find a K x K matrix B such that the components of BX are as independent (in some sense) as possible. This is the theoretical (probabilistic) ICA problem. In practice, the distribution PX is unknown and can only be estimated via an observed sequence of stationary ergodic random vector X ( t ) with the same distribution Px. Clearly, if X ( t ) = AS(t) with the components of S ( t ) being independent, then B = A-’ is a solution to the ICA problem. It also has been shown [SI that if no more than one component of S ( t ) is Gaussian, then this is the unique solution up to premultiplication with a permutation and a diagonal matrix. Thus, the solution of the ICA problem can serve to recover the sources in the source separation problem. Conversely, one can apply a source separation procedure to find a solution to the ICA. However, these two problems are distinct from their underlying assumptions and their goals. In the source separation problem, one is primarily concerned with performance: One would like to recover the sources that supposedly exist as best as one can. In the ICA problem, performance is less an issue because there may be no source at all. Of course, one may measure performance in terms of how near to independent the reconstructed sources can be, but “near independence” is a vague concept, and there is no universally accepted measure of such “nearness.” Of more importance is the robustness: The analysis must yield nearly independent components for a wide range of distributions. In this respect, it can be used for source separation in less than ideal situations. For example, one may have more sources than sensors. One can expect that an ICA would still yield the dominant sources with some contamination from the weak sources. By contrast, a separation of source procedure designed under the ideal model can fail if its assumptions are not met.

The goal in this paper is to provide a method for source separation with not only good performance but also some robustness as well. The robustness here is understood in a wider sense than what is discussed above. In fact, some source separation procedures suffer from the ambiguity problem [ 131, [ 171: The estimating equation can have more than one solution,

1053-5871(/96$05.00 0 1996 IEEE

PHAM: BLIND SEPARATlON OF INSTANTANEOUS MIXTURE OF SOURCES 2769

and as a result, the procedure can yield spurious solutions having nothing to do with the true sources. A robust method should not yield such solutions. This can be achieved by considering an ICA. In this paper, we shall introduce such an analysis based on the use of Kullback-Leibner distance. Working in the case where the underlying distribution admits a smooth density, we are able to derive a criterion that is simple to manipulate. Our procedure is quite similar to the one obtained earlier in Pham and Garat [ 171 from a maximum likelihood approach, but unlike this procedure, in which the separating functions are given a priori and somewhat arbitrar- ily, here, they are estimated from the available data. In this way, we are able to adapt the separating functions to ensure optimal performance in the source separation model.

For ease of reading, proofs of results and some Lemmas will be relegated to an appendix. In addition, the notations and Tr will denote the transpose and the trace.

11. ICA BASED ON KULLBACK-LEIBNER DISTANCE

A. Contrast Function

Recall that in ICA one considers a probability distribution PX on IRK of some random vector X and tries to find a K x K matrix B such that the components of BX are as independent as possible. To this end, we must adopt a measure of closeness to independence. One possibility is to compare the probability distribution PBX of B X with the distribution it would have if its components are independent, i.e., the product Pix = p(BX)~ ” ’ x P ( ( B ~ ) , ~ , P(Bx) , denoting the distribution of the ith component of BX. To compare these distributions, we shall adopt the Kullback-Leibner distance as it is widely used and simple to manipulate. The Kullback-Leibner distance between two probability distributions P and Q (which is also called the relative entropy of Q with respect to P ) is defined as

K ( P , Q ) = - / In zdP . dQ

This definition assumes that Q is absolutely continuous with respect to P so that dQ/dP is well defined. If this is not so, we put K(P, Q) = 30. The interesting property of K(P, Q) is that it is nonnegative and can vanish if and only if P = Q. This comes from the relations

and .I‘ - 1 2 In .I‘ for all T 2 0 with equality if and only if s = 1.

Our ICA consists of finding B , which minimizes K(PBx , PA,). The function C(B) = K(PBx , PAx) is a contrast as introduced in Comon [8]. It is very general in that it does not need any restriction on B nor any assumption on the distribution Px. Other contrasts such as the sum of squared fourth-order cumulants (see [ 81) requires that B be restricted to the set of the orthogonal matrices and the distribution PX admits fourth cumulants. However, there is a difficulty with C ( B ) as there is no simple way to estimate it from the data (when PX is not known). To surmount this difficulty, we shall limit ourselves to the case where PX

admits a density with respect to the Lebesque measure, This is not a restrictive assumption since in the source separation problem, most sources are continulous, and even if they are discrete, they are usually contaminated with some noises so that the observations have a density. (Note that if the source are discrete and noise free, the support of PX would be countable and determine the mixiing matrix uniquely when enough points of it have been observed.) In the case where the distribution PX admits a density, the contrast C ( B ) takes a simpler form, involving the entropy, which is given below. This form, in fact, is the same as the mullivariate dependence measure considered in Joe [ 121.

Assume that X admits a density f x . I1 is well known that B X also admits a density g,iven b:y

The marginal density f ( ( ~ x ) , of (BX), is obtained by integrating f ~ x ( s 1 , . . . , S K ) with respect to all variables except s,. Here and in the sequel, the subscripl 2 in an expression such as (BX) , denotes the ith component of the considered vector. Thus

K

13 f(BX),

C(B) = - 1n _ 4 L P -

/ i K fBX(s1, ” ’ , s K ) x . ~ B x ( . s ~ , . ” , S K ) C E S ~ . . * d s K

The last term is no other than the sum of the entropies of the marginal densities of PBX. The first term equals JRK [In fx(x)]fx(x) dx - 1,n det B according to (2.1) and Lemma A.l in the Appendix. Hence,

[In fx(x)]fx(x) dx. - In /detBI

in which the first term can be ignored since it does not depend on B.

Notes: The Kullback-Leibner distance is not symmetric. Had we defined C(EL) as K(P&, PBX) instead of K(PBx , PAx), we would not get the nice form (2.3) of G. The contrast C is invariant with respect to permutation and change of scale: C ( B ) is unchanged when B is premultiplied with a permutation matrix and/or with diagonal matrix. This is easily seen from (2.3). The result also holds (but is less apparent) in the general case where the distribution PX may not admit a density. Thus, as

2110 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 44, NO. 11, NOVEMBER 1996

expected, our ICA yields only B up to a permutation and a scaling.

B. Discussions: Link with Projection Pursuit Density Approximation

There is strong similarity between our ICA and the projection pursuit density approximation (PPDA) of Huber [ 1 I]. Huber proposed to approximate a density f x by a product of the form

k

f ( " ( x ) = f(O'(x) hj(b,Tx) (2.4) j=1

where bj are given vectors, and h, are positive functions on IR.. Here, k may be less than K , and the role of f ( " ) is precisely to make f ( k ~ ) integrable in this case. In the case k = K , which is of interest to us, one could take f ( O ) = 1. Huber also uses the Kullback-Leibner distance to measure closeness of f(')(x) to fx . He showed that for fixed b,, this distance is minimized when hj is proportional to the density of bFX. Thus, keeping the bj fixed and denoted by B , the matrix with columns bl , . . , bK, the best approximation to fx , of the form (2.4) with k = K and f( ' ) 1, is given by Idet BI nFZl f p x ) , [(Bx)j] (note that the factor det B is missing in Huber's paper). The last function is precisely the density for which the components of BX are independent and have the same marginal densities as f ~ x . Therefore, searching for the best approximation to f x of the form (2.4) with respect to both bj and hj , j = 1, . . , K would lead to our ICA.

However, Huber's PPDA is primarily concerned with ap- proximations to a multivariate density, whereas we are concerned with finding the most independent components. Further, Huber does not consider minimizing C(B) . He proposes instead a stepwise PPDA. Letting IJX be some given approximation to f x , one tries to improve it by multiplying it with a factor of the form h(bTx). This yields a new approximation, and by repeating this procedure, one obtains an approximating sequence ,qx that can be shown to converge to f x as k i x. In Huber's PPDA approach, the index k will be chosen to achieve the desired accuracy, but only for k = K and ,&) of the form (2.4) with ,f(') = 1 can the random vectors b:X be interpreted as independent components. Even then, they are not the same as ours since we minimize globally with respect to B . This global minimization could be more costly, but with the help of Newton's algorithm based on the relative gradient and Hessian (see below), it is still reasonable.

( k . 1

C. Gradient of the Contrast

The gradient of the contrast is useful for its minimization and provides a set of equations for the stationary point called the equilibrium equations. Of course, solving these equations may not lead to a minimum (even local). However, in practice, one constructs the solution through an iterative search, each step moves the parameter point in a direction parallel to the gradient or at least having an angle less than 90" with it, so that the contrast would decrease for a small enough step size. In this way, one can ensure at least convergence to a local minimum.

Instead of gradient, we find it more convenient to work with the associated differential. This is a linear form in the (infinitesimal) increment d B that approximates G ( B + dB) - C(B) with an error of higher order than dB as d B 4 0. By writing this form as Tr [C(B)T dB], the matrix C(B) is the desired gradient. In fact, we find it even more convenient to compute the relative gradient, which is defined as the matrix C'(B) for which the linear form Tr[C'(B)t] approximates C(B + cB) - C(B) with an error of higher order than t as 6 i 0. The relative gradient has been introduced and discussed in [2]. Clearly, it is related to the ordinary gradient through

Proposition 2.1 below provides two expressions for C'. For this purpose, we introduce the following notation: For a vector valued function 9 on IRK integrable with respect to fx , we put

c/(B) = C ( B ) B ~ .

which is f p x ) , (s,) times the conditional expectation of g(X) given (BX), = s,.

Proposition 2.1: Let $(sx), be the derivative of -In f p x ) , (which is usually called the score function of f(BX),) . Then

It should be noted that the diagonal of C'(B) vanishes identically since

by Lemma A.l and integration by parts. This is expected since C(B) does not change when one premultiplies B with a diagonal matrix so that Tr [C'(B)Tc] must vanish whenever t is diagonal. The above proposition thus provides K ( K - 1) equilibrium equations for determining, up to scaling, the matrix B minimizing C(B):

(2.7)

PHAM: BLIND SEPARATION OF INSTANTANEOUS MIXTURE OF SOURCES 2771

D. Hessian of the Contrast 111. STATISTICAL IMPILEMENTATION OF THE ICA

The Hessian of the contrast is useful to check if the solution of the equilibrium equations is really a local minimum. Its evaluation is also needed if one wants to implement a Newton-Rhapson-type search procedure.

Instead of computing element by element the Hessian matrix of C , which is awkward because the argument of C is itself a matrix, we shall compute the second differential of C. This is an approximation to Tr [C(B + dB)TdB] - Tr [C(B)TdB], linear in d B (and of course in dB as well), with an error of higher order than d B as d B + 0 (dB fixed) or, equivalently, anapproximation to C ( B + d B + d B ) - C ( B + d B ) - C ( B + dB) + C(B) , bilinear in dB and dB, with an error of higher order than I IdB 1 1 I IdB I I as dB and d B tend to 0. However, we find it even more convenient to work with relative increments t B , q B instead of absolute increments dB, dB . Thus, we look for an approximation to C(B+qB+tB) -C(B+qB) -C(B+ cB) + C ( B ) , bilinear in t and q, such that the error is of higher order than I It1 I I1q1 I as t and q tend to 0. The matrix associated with the above bilinear form is will be called the relative Hessian. Note that this is not the same the relative gradient of C’(B), which requires considering the approximation to Tr [C’(B + ~ B ) ~ t l - Tr[C’(B)Tt]. Since C’(B) = C ( B ) B T , Tr[C’(B + v B ) ~ ~ ] = Tr[C(B + s B ) ~ ~ B ] + Tr[qC’(B + v B ) ~ ~ ] , the relative gradient of C’(B) would differ from the relative Hessian by an extra term corresponding to the nonsymmetric quadratic form Tr [qC’(B)T~] .

From the definition of the gradient and relative gradient, the relative Hessian may be obtained by approximating Tr [C(B + v B ) ~ E B ] - Tr [6 (B)TcB] by a linear form in E (it is already linear in 7). The result is given below.

Proposition 2.2: The quadratic form corresponding to the Hessian of C is given by with the notation (2.5)

where ‘ denotes the derivative, and E , . and q,. denote the ith rows of t and q.

The above formula is not very useful because of its com- plexity. However, when the vector B X has independent components, it simplifies considerably.

Corollary 2.3: If B X has independent components, the quadratic form corresponding to the Hessian of C is

(2.9)

where ~ ; j and q;j denote the elements of E and q, and var{.} denotes the variance.

This result shows that the Hessian matrix is block diagonal with 2 x 2 diagonal blocks corresponding to each pair of indices of the matrix argument of C.

In the previous section, we have proposed a theoretic (probabilistic) ICA in which the density fx of the distribution i s given. This is not the case in practice where only a sample X(1), . . . , X ( T ) from this distribution is available. The natural way to solve this difficulty is to replace the unknown density by an estimate. In this section, we shall discuss how it should be done and show the similarity ofthis procedure with the source separation procedure using the separating function as obtained through the maximum likelihood approach. We also compute the gradient and Hessian of the criterion, which will be useful for implementing the procedure.

A. Contrast Function

We choose to estimate thie density by the kernel method (see, e.g., [19]). Apart from the fact that it is a well studied and simple method, it has the advantage that tlhe estimated density can be easily guaranteed to be a true density in the sense that it is nonnegative and integrates, to 1. This is important since our computations in Section I1 are based on the assumption that we have a density. The kernel density estimate is defined by [19]

where IC. is the kernel function, and hT is a bin- width parameter depending on T. To guarantee that f x is a density, it suffices to take K a density itself. One can then interpret f~ as the density of X* + ~ T N , where X* is a random vector taking the values X( l), . . . , X ( T ) with probability 1 /T , and N is a random vector independent of X* having density K . The (statistical) ICA can now be perfonmed by minimizing

~ ( B x ) , being defined in the same way as, f ( R X ) , with fx in place of fx. Note that we have dropped1 a constant term in (2.3), which is of no interest.

The above function is quite similar to the log likelihood function derived in Pham and Garat [17]. Starting from hypo- thetical densities fi of the sources and the assumption that they are temporally independent, Pham aind Garat have obtained the log likelihood function for the data X ( l ) , . . . , X ( T ) in the source separation model X( t ) = AS(t) ,as

K T 2 In f ,[(A-]X),(t)] - T In IdetAl. 2=1 t = l

This is similar to --TC(A-’) except for the fact the .~(Bx), are givtn and not estimated and that the integration with respect to f(sx), is replaced by a time averaging.

The formula (3.1) involves the marginal densities of ~ B X ,

which are generally difficult to compute since they require the integration of the density function &(B-lx)/ldet BI. A simple way to avoid this integration is to take IC. to be a Gaussian

2112 IEEE TRANSACTIONS ON SlGNAL PROCESSING, VOL. 44, NO. I I , NOVEMBER 1996

(3.4)

where 5 is the sample covariance matrix of X( I) , . . . , X(T) , (Bg),. is the ith row of B e , and

t=l ! density because then, its marginal densities are also Gaussian. Specifically, we take 6 to be a Gaussian density with zero

1 k- mean and covariance matrix 2. Then, the ith component of BN is a Gaussian variable with zero mean and variance the %th diagonal element 8;(B) of B%BT. Hence

1 T 1 ] (3'2)

where k ( s ) = exp(-s2/2)/& is the standard Gaussian density. As for 2, the natural choice is the sample covar- ance of the data X(1). . . . . X(T). This choice ensures the coherence between the estimation procedure and the linear transformation. By this, we mean that estimating fx and then using (2.1) to deduce the estimate for f ( ~ x ) % should lead to the same result as estimating f(Bx)% directly from the "data"

We do not, however, want to restrict ourselves to a Gauss- ian kernel. One may prefer a kernel with compact support. To avoid integrating the kernel K , we observe that C(B) actually involves only the marginal densities of B X , which could be estimated directly from the sequence (BX), ( t ) . Specifically, we define C(B) by (3.1) with f ( B X ) , given by (3.2) with arbitrary density k and c L ( B ) as above. (More generally, one may take as Gt(B) a measure of the scale of ( B X ) t ( l ) , . . . , (BX),(T) . ) Since 8,(B) is unchanged if the (BX),(t) are subtracted by a constant and is multiplied a factor if the (BX) , ( t ) are multiplied by the same factor, the contrast C is invariant with respect to the above operations. However, the coherence with respect to linear transformation is now lost: The densities (3.2) for different B and 1 may not come from a common joint density fx. Thus, one cannot use the result in Section I1 to obtain the gradient and Hessian of C. Nevertheless, these results are mainly based on Lemma A.2, and we can prove an analogous Lemma A.3 and thus obtain similar formulae as before.

(BX)z(l), . ' ' , (BXI2(T) .

B. Gradient and Hessian of the Contrast

To write down the gradient and Hessian of the contrast (Propositions 3.1 and 3.2), we need to define the analogs of

for g (X) = x or X X ~ : I ( B X L

- s E X \ ( B X j , ('1

(3.3)

k ( s ) = - Lm uk(u) d u . (3.5)

(Note that k. = k if the later is the standard Gaussian density.) Proposition 3. I : Assume that k is continuous!y differen-

tiable, and let ~ ( B x ) , be the derivative of - In f ( B x j , . The relative gradient of &' is

C'(B) = I I - I . (3.6)

Using the definition (3.3), the element ( i , j ) of C'(B) may be computed by

where

(3.7)

(3.9)

f being defined as .f^ with The function 4(Bx), may be viewed as a smoothed version

of ~J/(Bx), . The extra term X p X j , in (3.7) arises from the smoothing of the density through the kernel k with a bin width h . ~ . Since the diagonal of C'(B), like that of C'(B), vanishes (the proof is the same), the right-hand side of (3.7) equals 1 for i = j , yielding another formula for

in place of k .


since 2 = (l/T) ~ ~ = , X(t)XT(t) . If the data is not centered, one should center them by subtracting their arithmetic mean before computing (3. IO). Centering does

not change (3.7) because ET=, q(Bx), [(BX),(t)] =

,IR 4(Bx), (~)f(Bx), ( s ) ds = 0. Equating to zero, (3.7) or (3.10) yields K ( K - 1) equi-

librium equations for estimating the matrix B up to scaling. Note the similarity between these equations and the estimating equations proposed in Pham and Gardt [17], which differ only by the fact that the “separating” functions G(BX)A(s) + [J (Bx) , /~ : (B)]s in (3.8) are replaced by given functions not depending on the data. As can be seen from (3.8) and (3.9),

converge to infinity and 0 with a sufficiently slow rate. The above limit is the score function of the density f ( ~ x ) , , which is known to be the optimal choice for separating functions [17].

Note that in this ICA approach, one could (and should) ensure, in an iterative search for the solution to the equilibrium equations, that the contrast is decreased at each iteration step. This may avoid landing at a spurious solution.

For the Hessian of the contrast, we have a similar result as Proposition 2.2.

Proposition 3.2: Assume k twice continuously differentiable and s-”, uk(u) du = 0. Then, the quadratic form corresponding to the Hessian of C is

G(BX)z(tS) f [J(BX),/8?(B)]S ?//(BX),(S) as T and fLT

C. Discussion

In order that the kernel density estimate be consistent, hT should go to zero as T + 00 with a sufficiently slow rate (in fact, ThT/ log T -+ 00 suffices; see [19, p. 721). The optimal rate for hT is also known to be T-1/5 (see [19, p. 401). However, this can be used only as a guide since this rate is optimal in the context of estimation of density and not of entropy. In addition, the rate should be slower here since we also need the consistence of the first and second derivative of ~ ( B x ) , as well. Because of this slow rate, hT would be almost constant for a wide range of values of T covering most sample sizes encountered in practice. Thus, it is of interest to consider the case where hT does not converge to zero but to a small number h. We show here that this does not greatly affect our procedure.

To simplify the analysis, consider first the case where k is the Gaussian density. Then, , ~ ( B x ) , converges to fTBX),, which is the ith marginal density of BX + hBN, where N is a Gaussian vector independent of X having mean zero and the same covariance matrix z‘ as X. Thus, we are actually trying to perform an ICA of X + h N instead of X. More precisely, our statistical ICA amounts to simply modifying the original measure of closeness to “independence.” Instead of taking K ( P s x , PAx) for this measure,

I we take K ( ~ B x + ~ B N , PBx+/,BN)> whe1t-e I’BX+~BN is the distribution of BX + hBN, and PAX+,‘IBN. is the product of its marginal distributions. Since making the components of BX independent also makes the covariance matrix of BX diagonal and, hence, those a4 B(X + hN) independent, and vice versa, the “new” measure is still positive and vanishes if and only if the components of BX are independent. Note, however, that this measure puts moire weight on second-order statistics of the data since the “closeness to independence” of the components of N is determined by its covariance matrix, which is the same as that of X by definition. In the extreme case where h is very large, we are trying first to make the matrix BCBT diagonal, and only .then, using the remaining degree of freedom associated with rotation, trying to make the components of BX as independent as we: can.

When k is not a Gaussian density, ,/(lex), still converges to some fTBX),, lbut one cannot interpret this limit as a marginal density. Nevertheless, the extra term ( 1 P ) ET=, [&(B, T) /6? (B)] (BX); ( t ) (IBX), ( t ) in (3.10), together with the relation (3.9), suggest that using a higher hT would put more weight on the second order statistics of the data. Although we have been unable to prove (in the case hT -i h > 0 and for a general k ) that the matrix B making the components of BX independent is tlhe global minimum of the limiting contrast, we ‘can at least prove the following proposition.

Proposition 3.3: Let B be such thiat the components of BX are independent. Then, it is ,a local minimum, of liinT+% if limT,w h~ = h is small enough.

have not yet encountered a case where it fails. In practice, this property seems i o hold quite large h: we

D. The Discretized Contrast

A practical problem in implementing the above procedure is the computation of integrals involved in the contrast and its derivatives. Numerical integration cam be quite costly. Further, it ultimately reduces to computing a discrete sum. This suggests replacing the contrast by an approximation involving only sums. This approach differs from using numerical integration by the fact that the discretization scheme is the same for all integrals involved. Apart from simplifying the computation, it avoids the problem that the computed derivatives of the contrast are not exactly equal lo the derivatives of the computed contrast. Further, (as can be seen later, one can be satisfied with a crude discretization, which reduces the computational cost.

We discretize the real line into an equispaced grid of points. Since the grid spacing can Ibe larger if the functions to be integrated are smoother and the smoothness of ~ ( B x ) , is controlled by the bin width hT6.,(B), we will take as the grid spacing hT8,(B)G, where 6 is a given s,mall number. This leads to

-In IdetBI -

2114 IEEE TRANSACTIONS ON SIGNAL PROCESSING. VOL. 44, NO. I I , NOVEMBER 1996

Hence, by dropping a constant term, we obtain the discretized contrast

C&(B) 1 K

-In jdctB) +

(3.12)

where

which represents the probability that (BX)i falls into the interval [ ( m - i )h~b, (B)S, ( m + $)~TG~(B)S].

The contrast (3.12) is invariant with respect to scaling: It is not changed when B is premultiplied by a diagonal matrix, but it is not invariant with respect to the origin of coordinates. It may change if (BX)i(l)> . . . , (BX)i(T) are added a constant that is not a multiple of h ~ e ~ ( B ) 6 , although the change would be small. To make the definition of 86 unique, we shall assume that the origin of the grid is the arithmetic mean of the data, or equivalently, the data has been centered.

To compute the relative gradient and Hessian of 6'6, one cannot simply discretize the formulae in previous sections because they make use of integration by parts. However, it is not difficult to compute the relative gradient and Hessian of the $px) , (m) . This is done in Lemma A.4. From it, one obtains the folwing proposition.

Proposition 3.4: The relative gradient of is

[In $(BX) l (m) + l l j ( B X ) l (m)BT

[Ill 6 ( B X ) , (m,) + 1 ] 6 ( S X ) l (m)BT 1 - I (3.14)

c m = Ill = - 00

where

In addition, the quadratic form associated with the relative Hessian of C h is

where

x (Be),. + (kBT).,

XT(t) - +(BX)i(t) CT(B) !

Of course, if the grid spacing h~c- , (B)6 tends to zero, then the discretized contrast tends to the original one. However, h~ may tend not to zero but to a small number h, as considered in Section 111-D. Then, the grid spacing will tend to ha,(B)S (if 6 is kept fixed). This does not greatly affect the procedure by a result analogous to Proposition 3.3.

Proposition 3.5: Let B be such that the components of BX are independent. Then, it is a local minimum of limT,m C& if liinT--too h~ = h is small enough.

Although the above proposition does not require that 6 be small, an examination of its proof reveals that it depends on the fact that both 12 and hb are small. Thus, one should not choose S too large because then, h must be much smaller in order that the result holds. Nevertheless, in our experience, 6 need not be very small, which results in a relatively few number of grid points (50-100, say) in the data range. The computation of ~ ( B x ) , (m) and & x ) , (m) can be done quickly through the use of binning and FFT [ 191. Hence, our algorithm based on the discretized contrast is not as costly as it may seem.

1v. SOME EXAMPLES OF SIMULATlON

We have made many simulations to test our procedure, some of which are reported below. We consider the separation of source situation with two sources and two sensors. The data X(t) = [ X , ( t ) X,(t)lT, t = 1, . . . . T = 200 are generated by mixing two independent sources Sl ( t ) , & ( f ) through the matrix

This mixing matrix is taken as an example; our procedure is clearly invariant with respect to linear (well-conditioned) transformation, and we could take any matrix instead. Indeed,


2 -

1.5

I I I I I I I I , 1 7 7 I I I

+ 3 .

+ -

-2

where Id( -a, a) denotes the uniform distribution over [-a, a], N ( m , 02) the Gaussian distribution with mean m and variance 02, and iN(-m, 0’) + iN(m, 02) the “mixed Gauss- ian” distribution with density being the average of the Gauss- ian densities with means -m and m and variance 02. The last distribution is (slightly) bimodal.

We shall view the transformation X H BX as a change of coordinates. The “estimated” independent components (i.e., the reconstructed sources in a source separation problem) are thus viewed as the coordinates of the observation vector in a new coordinate system, whose axes can be seen to be parallel to the columns of B-l. These axes are represented as solid lines the Figs. 1-5. The data points X(1), . . . , X(200) are represented on the same figure as +. Their coordinates with respect to these axes thus represent the reconstructed sources. Note that the ICA axes are not orthogonal, in contrast with the principal component analysis (PCA) axes (which are plotted in dashed lines for comparison), which are.

The cases 1-5 are ordered in term of difficulty. It is known that mixtures of uniform distributions can be easily separated, while nearly Gaussian distributions mixtures are much more difficult (see, e.g., [17]). This is confirmed in our simulation (remember that the theoretical ICA yields B = A-’, and hence, the theoretical ICA axes must have slope 2 and 1/2, respectively). We also observe that when a non-Gaussian distribution is mixed with a Gaussian distribution, the non- Gaussian component can be usually adequately extracted, but the Gaussian component can be significant contaminated

+

I I I I I I

Fig. 2. ICA of a Gaussian anti a “mixed” Gaussian distribution.

-3 -2 -1 0 1 2 3 -4

0.2 0.41

t I I I AI 1 I I

-o.2 t

Case 1 2 3 4 5

-0.41

Source 1 Source 2

N(0,l)

U ( - 1 , 5 , 1.5)

++n/(1.5, 0.64)

N(0, 1) U ( - 1 , 1 ) iN(-2, 0.64) + iN(2, 0.64)

LN(-2, 0.64) + “(2, 0.64) U(-0.5, 0.5) U(-0.5, 0.5)

5 N ( - 2 , 1 0.64) + zN(2, ? 0.64) $N(-1.5, 0.64)

t

-O.* -0.8 -+6+&4.* -0.6 -0.4 -0.2

Fig. 3 . ICA of two uniform distributions.

with the non-Gaussian component. ]Finally, the separation of bimodal (or multimodal) distributions seem to be easiest. In fact, our procedure works very well with such distributions since the separating functions will adapt themselves to this particular form of its density.

The performance of a separation of souirces procedure may be best measured by the contamination coefficients c , ~ =

(BA),,o,/[(BA),,a,]. This represents the contribution of the j th source to the reconstructed ith soiurces when all sources are normalized to have unit variance. Table I reports the root mean squares contamination coefficients computed from 100 Monte Carlo runs under the five cases above. For completeness, the corresponding results for the PCA are also given.

v. CONCLUSION

We have introduced a new method for separation of sources. Its appealing feature is that it can adapt itself to different forms of the density of the sources. Although more extensive simulations will be needed to assess its ]performance, our

2716 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 44, NO. 11, NOVEMBER 1996

3 - Case

1 2 3 4 5

c 1 2 c 2 1 c'1 2 4 1 0.1000 0.0568 0.4013 1.2040 0.0327 0.0783 0.2904 1.3474 0.0389 0.0443 0.9748 0.9748 0.0319 0.0492 0.2400 1.4850 0.0295 0.0426 0.6625 1.0636

+ +

2

1

0 -

-1

-2

-3 t ++ , I I I I -I

-4 -2 0 2 4

Fig. 4. ICA of an uniform and a "mixed" Gaussian distribution

-

-

I I I I 1 & * +

I I , I 1

-b -4 -2 0 2 4

Fig. 5. ICA of two "mixed" Gaussian distribution.

preliminary results are very encouraging, and they indicate that the method is robust and does not suffer from the problem of ambiguity mentioned in the introduction. Further, the algorithm also converges very quickly; four to eight iterations usually suffices. From the computational point of view, the cost of our procedure is much lower than it may seem through the use of discretized contrast. The density estimation can be done through binning, and FFT as explained in [19, p. 621 (see also [18]), which is quite fast.

APPENDIX

Lemma A.1: For any real measurable function g on IRK LEMMAS AND PROOFS OF RESULTS

in particular if h is a real measurable function on E3

Proof Equation (A.1) can be proved though a change of integration variable using (2. l), but a much easier method is to note that the left-hand side of this equality is nothing other that the exvectation of a ( Y 1 with Y = BX having

the density fBx; hence, it is also the expectation of g(BX) with X having the density ,fx, i.e., the integral in the right hand side. Equation (A.2) is a consequence of (A.l), taking

Lemma A.2: Assume that fx is continuously differentiable. Let g be a continuously differentiable real function on IRK such that

g(s1, . ' ' > S K ) = h(s;).

Then, with the notation (2.5)

where .(e) denotes a term tending to zero faster than E as E 4 0, 6,. denotes the i row of E , and ' denotes the derivative.

Proof Without loss of generality, we may assume that i = 1; the proof for general i is the same after a permutation of the subscript. We have, from (2.1)

Put U = (U1 . " U K ) T = (I + E)-'s , and for eachfxed SI, make the change of integration variable ( s 2 , . . . , S K ) 4

( ~ 2 , . . . , U K ) . Since s1 is fixed, u1 must change with U % . U K . Indeed, from u1 + f l Jug = sl, where E , ~ denotes the general element of E , one has u1 = (SI - C3=* el3u3)/(1 + €11). The correspondence between ~ 2 , . . . , S K and ~ 2 , . . . , U K is thus

K

h

K

s1 - C E l j U j j = 2

s; =U; + til + E i j U i j 1 + E 1 1 j = 2

The Jacobian of this transformation is approximately n,"=, (1 + E ? ? ) z det (I + ~ ) / ( 1 + e l l ) , the error being . ( E ) . <I , I


I< Therefore Tr [C(B + V B ) ~ C B ] - ‘Tr [C(lB)TtB] is approximately

f x [!?B(Sl> u2. ’ ’ ’ , u K ) f B X ( g l . u2, . ’ ‘ > SI<)] Tr (qt) - 1 @(BX), (s)[qt E~x(t~ix), l(sx), d s with SI denoting a number between s 1 and (SI - R 1=l

K c , = l ~ l J u J ) / ( l + ~ 1 1 ) . Dividing both sides of the above relation with 1 + and integrating with respect to 712, . . . , P L K , one gets, using the assumptions of the Lemma

X f B X ( s 1 , . . . , S K ) d ~ 2 . . . dslc + .(e).

The result follows by interchanging the order of differen- tiation and integration and noting that the expression inside the above curly bracket { } is precisely the derivative of

t l J s 3 g B ( s 1 , . ” , s K ) f B X ( s 1 , . ” , s K ) withrespect to 31.

Proof of Proposition 2.1: Applying the Lemma A.2 with g = 1, we get .~Bx+~Bx(s) - ~ B X ( S ) = -cl [ E ~ x l ( B x ) ~ ( s ) ] ’ + .(e). It then follows from (2.3) that

1 K

K

with an error o(q) . By integration b y parts, the last expression can be seen to be the same as the quadratic form given in the

Proof of Corollary 2.3: Using the independence of the proposition.

(BX)i>

E : B ~ ) ; ( B X ) ~ I ( B X ) , ( ~ ~ ) = E X [ ( B X ) j ( B X ) k ] f (BX),(s i : ’ i { j , k ) E X (BX) k si f ( B X ) , ( s i ) i = j # k , E X ( B X ) j s i f ( B X ) , ( s i ) i = k # . j , $.f(BX), (si) i = j = k

where Ex denotes the expectation computed from the density fx of X. In addition,

Further, by integration by parts

\ ,

The integral of the first term in the above right-hand side 1 $;BX), ( s%)sZf (Bx) , ( S t ) d s z vanishes, whereas that of the second can be evaluated by integration by parts. This yields the first formula for C(B) of the Proposition. The second formula then follows from (A.2). = 1 $ ( S X ) , (Sz)[$(BX)? ( s t ) § - 1 ] f ( B X ) , ( s , ) dsz

Proof of Proposition 2.2: By (2.6), we have 1 $ i B x ) % ( s t ) s : ~ ( B x ) , ( 5 % ) dst

Tr [6(B + v B ) ~ ~ B ] - Tr [C(B)=EB] =

-Tr [(I + q)-’t] - T r t From these results, the quadratic form in Proposition 2.2

W reduces to that of the Corollary. Lemma A.3: Suppose the IC is twice continuously differen-

tiable. Then, with the notations (3.2)-(3.5) and that of Lemma

f + 5 1 [ ~ ( B X + ~ ] B X ) , ( ~ ) E ( , B X ) ~ ~ ( ~ X + ~ B ~ ) , ( s )

- ~ ( E X ) , ( ~ ~ ) E ( ~ B x ) ~ l p x ~ ~ (s)l ds.

a = 1

f

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL 44, NO. 11, NOVEMBER 1996 2778

A.2

f(BX+tBX), (3) - f(BX), (s)

= - ( E B ) z [fiXl(BX),(~s)]' + .(e) - f - f Ex~(Bx+~Bx) , ( 3 ) ~ Exl(sx) , (SI

- f = -PxxT,(Bx)%(411 + . ( E ) .

Proof Since A:(B) is the zth diagonal element of B k B T , &:(B + tB) = &:(B) + %(EB), (XBT) + o(c) , and hence, A,(B + tB) - 6%(B) = (EB), (EBT) L/6,(B) + .(e). Thus, putting k . ~ ( s . t ) = I C { [ ? - (BX),(~)]/[~T;T,(B)]}/[~T;T,(B)] for short and denotlng by I the derivative with respect to s ,

~.B+EB(s. t ) - ~ . B ( s , t )

From (3.2), one gets the first result of the Lemma. Using the above results again, one has,

(CBT).?

a% (B) x [s - ( B X ) , ( t ) ] k ~ ( s , t ) - ?XT(t)kB(s, i)]

x (rlB),T + 4 r l ) . After some rearrangement, one gets the second result of the Lemma.

Proofs of Propositions 3.1 and 3.2: They are similar to that of Propositions 2.1 and 2.2, using the Lemma A.3 in place of Lemma A.2 and noting that i ( s ) 4 0 as s + 30 by assumption. U

Proofofpropositions 3.3: AS T -+ oo, j(BXjL(s) +

f isx), ( 3 ) = 1 f (BX) , [s - uhai (B)]k(u) du, and dj(BX), ,$TBX,&, which is the derivative of - In fYBX), . If B is such that the components of BX are independent, BExl(sx), ( s ) , as defined in (3.3), converges to the vector with components Ex(BX),f&,)>(s) for j # i and sfTBx),(s) for j = i . Therefore, noting that s, $ J g X l L ( . ~ ) f & ~ , ~ ( s ) ds = 0, 6 ( B ) * 0 for such B. To prove that this B is a local

point is positive definite. However, from its definition (3.4), BEXxT,(,,)~(s)BT can be seen to converge to the matrix with element (3: I ; ) given by

- f

Ex(BX) . jEx(BX)k ~TBx), (SI i # j # k

i # 3 = k i = j # k.: i = k # j: i = j = IC

Ex(BX);.f;sx)^ ( 3 ) + ~a2c+B)f:*,,:* ( s )

Ex(BX)j sf;BX),(S)

"f;BX), (s)

Ex(BX)k sfis,), ( s )

frBx,, being defined as fTBX), with in place of I C . From the above results and by the same computations as in the proof of Corollary 2.3, it can be seen that the quadratic form in Proposition 3.2 converges to

1 { E i j q j i + ~ ( B I SF FAX)^ i # j

x [fTBX,, (s) + h2$BX), (SI1 ds f i j % j ' (A.3) 1 This quadratic form is positive definite if

ff?(B) ~ [ d ~ ~ ~ X ) ; ( s ) f T B X ) , ( S ) + ia'f&X), (,$)I d s 2 (A.4)

with equality for one subscript i at most. However, the above left-hand side converges as h + 0 to

0 ~ 2 ~ ) d+m)l (s)f(sx), ( s ) ds =

by integration by parts. Since ~ ' $ ( B x ) , ( s ) s f ( ~ x ) , (s) d s = 1, the last right-hand side is the inverse of the squared correlation between ?/I(BX)~ [(BX);] and (BX);, and it is greater than 1 unless f(Bx), is Gaussian. We shall show below that in this case, (A.4) is an equality. Hence, (A.4) holds for h, small enough, with equality for at most one subscript i , yielding the desired result.

We now show that (A.4) is an equality if f ( ~ x ) , is a Gaussian density. By integration by parts, its left-hand side can be written as

since flBx,,(u) = - ~ ( B X ) , ( U ) ( U - m)/a?(B), m being the mean of f ( ~ x ) , . Using (3 5 ) , the above right-hand side reduces to 1, $TBx)z(s)(s - m)f&x,,(s) d s which equals 1.

Lemma A.4: As c + 0 ,

?(Bx+EBx), (m) - ~j(Bx), (m) =&BX), ( ~ ) ( E B ) ; + ~ ( t )

where O ( E ) denotes a term tending to zero faster than t as t + 0, and & ~ ) , ( r r ~ ) and j ? (~x ) , (m) are given by (3.15)

$(BX+tBX), (m) - $(BX)% (m) = (EB), $(BX), (m)

minimum, we must show that the relative Hessian at this and (3.16).


The proof of this Lemma is similar to that of Lemma A.3. Proof of Propositions 3.5: If B is such that the components

of BX are independent and hT -+ h, & ~ ) , ( m ) would converge to 0, and hence, the same holds for the relative gradient of C6 at B. It remains to show that the relative Hessian of C6 at B converges to a positive definite matrix. The quadratic form associated with this Hessian is given in Proposition 3.3, and since (BX)l(t) . , . . . ( B X ) K ( ~ ) are centered and pairwise independent, B @ ( ~ x ) , ( m ) -+ 0 and B&sx), (m)BT converges to a matrix with all elements being zero except the ( j , , j ) , j # i elements, which equal

Hence, the considered quadratic form converges to

where fyBX), ( s ) J k ( ~ ) f p ~ ) , [s + o,(B)u] d u is the limit of ~ ( B x ) , (x), pTBX)% (m) is the corresponding limit of $(BX)~(VL) , and ATBx), is the limit of j p ~ ) ~ :

The sum in the last right-hand side is the Riemann sum for the integral

which equals 1 - h2a,”(B) JR+$x)a (u)~;Bx) , ( U ) * du by integration by parts. Thus, approximating the Riemann sums by the corresponding integrals, the limiting quadratic form associated with the relative Hessian of C6 at B is the same as that of C, which was given in (A.3). Since the approximation becomes exact as hS --f 0, the result follows.

REFERENCES

B. Laheld and J. F. Cardoso, ‘“Adaptive source separation with uniform performance,” in Proc. EUSIF’CO, Edinburgh, 1994. B. Laheld, “SCparation auto-adaptative de sources. Implantations et performances,” Thkse de doctorat, ENST, 1994. J. P. Cardoso, “Source separation using higher order moments,” in Proc. ICASSP, 1989, pp. 2109-2112. ~, “Iterative technique for blind source separation using only fourth order cumulants,” in Proc. qf EUSIPCO 92, Signal Processing VI: Theory and Application. J. Vandewalle, R. Boite, M. Moonen, and A. Oosterlink, Eds., 1992, pp. 739-742. J. P. Cardoso and A. Soulonmiac, “An efficient technique for blind separation of complex sources,” in Proc. K E E SP Workshop Higher Order-Statist., Lake Tahoe, 1993, pp. 275-279. P. Comon, “Blind separation of sources: Problems statement,” Signul Processing, vol. 24, pp. 11-20, 1991. -, “Independence component analysis,” in Proc. Inl. Workshop Higher Order Statist., Chamrousse, France, 19191, pp. 11 1-1 12. -, “Independence components analysis, a ncw concept,” Signal Procesisng, vol. 36, no. 3, pp, 287-314, 19941. P. Duvaut, “Principle des methodes de separation de sources fondCes sur les moments d’ordre supirieur,” Traitement du Signal, vol. 7, no. 5, (numero spCcial non IinCaire non gaussien), pp. 407418, 1990. M. Gaeta and J. L. Lacoume, “Source sqaration without U priori knowledge: The maximum likelihood approach,” in Proc. EUSIPCO 90, Signal Processing V, L. Tores, E. MasGrau, and M. A. Lagunas, Eds, 1990, pp. 621-624. P. J . Huber, “Projection persuit,” Ann. Slatist., vol. 23, no. 2, pp. 435-475, 1985. H. Joe, “Relative cntropy measure of multivariate dependence,” J . Amer. Statist. Assoc., vol. 84, no. 405, pp. 1.57-164, 1089. C. Jutten and J. Herault, “Blind separation of sources, Part I: An adaptative algorithm based on neuromimetic structure,” Signul Processing, vol. 24, pp. 1-10, 1991. J . L. Lacoume, “Sources identification: A solution based on the cumulants,” in IEEE Workshop Spe,ctrum Anal. Moddmg, Minneapolis, Aug. 1988. P. Loubaton and N. Delfossc, “Siparatio’n adaptive de sources indtpendantes par une approch de dcflation,” in XIV Colloque sur le Traitement Signal Images (GRETSI), Juan-Le>;+Pins, Sept. 1993, pp, 325-328. E. Moreau and 0. Macchi, “New self-adaptive algorithm for sources separation based on constrast functions,” in iDroc. IEEE SP Workshop Higher Order Slutist., Lake Tahoe, 1993, pp. 215-219. D. T. Pham, P. Garat, and C. Jutten, “Separation of a mixture of independent sources through a maximum Ilikclihood approach,” in Proc. EUSIPCO 92, Signal F‘rocessing VI: T i i e o ~ and Application, J. Vandewalle, R. Boite, M. Moonen, and A. Oostcrlink, Eds., 1992, pp. 771-774. D. T. Pham, “On the discretization error in the computation of the empirical charactcristic function,” J. Slutist. Compuf. Simulations, vol. 53, pp. 129-141, 1995. B. W. Silverman, Density Estimation ,for Stalistic,s and Data Analysi.~. Londonmew York: Chapman and Hall, 1986. L. Tong, Y. Inouye, and R. Liu, “Waveform preserving blind estimation of multiple independent sources,” IEEE Tram. Signal Processing, vol. 41, no. 7, pp. 2461-2470, 1993.

Dinh Tuan Pham (M’88) was born in Hanoi, Vietnam, on February 10, 1945. He graduated from the School of Applied Mdthematm and Computer Science (ENSIMAG) of thc Polytechnic Institute of Grenoble in 1968. He rcceived the Ph D degree in statistics in 1975 from the University of Grenohle

He wa, a Postdoctoral Fellow at the University of California at Berkeley in the Department of Sta- ti\tics from 1977 to 1978 and a Visiting Professor at Indiana University, Bloonnngton, in the Department of Mathematic? from 1979 to 1980 He I? currently

Director of Research at the French Centre National de Recherche Scientifique His research includes time-series analysis, signal modeling, drray processing, and biomedical signal processing.

Documents

Blind Separation of Instantaneous Mixture of Sources via an Independent Component …zoe.bme.gatech.edu/~klee7/docs/ICA_MLE_model.pdf · 2008-02-11 · Blind Separation of Instantaneous