Overcomplete Independent Component Analysis via SDPproceedings.mlr.press/v89/podosinnikova19a/podosinnikova19a.pdfOvercomplete Independent Component Analysis via SDP where W denotes

Overcomplete Independent Component Analysis via SDP

Anastasia Podosinnikova Amelia Perry Alexander S. WeinMIT MIT Courant Institute, NYU

Francis Bach Alexandre d’Aspremont David SontagINRIA, ENS CNRS, ENS MIT

Abstract

We present a novel algorithm for over-complete independent components analysis(ICA), where the number of latent sourcesk exceeds the dimension p of observed vari-ables. Previous algorithms either su↵erfrom high computational complexity or makestrong assumptions about the form of themixing matrix. Our algorithm does not makeany sparsity assumption yet enjoys favor-able computational and theoretical proper-ties. Our algorithm consists of two mainsteps: (a) estimation of the Hessians of thecumulant generating function (as opposed tothe fourth and higher order cumulants usedby most algorithms) and (b) a novel semi-definite programming (SDP) relaxation forrecovering a mixing component. We showthat this relaxation can be e�ciently solvedwith a projected accelerated gradient de-scent method, which makes the whole al-gorithm computationally practical. More-over, we conjecture that the proposed pro-gram recovers a mixing component at therate k < p2/4 and prove that a mixing com-ponent can be recovered with high probabil-ity when k < (2� ")p log p when the originalcomponents are sampled uniformly at ran-dom on the hypersphere. Experiments areprovided on synthetic data and the CIFAR-10 dataset of real images.

Proceedings of the 22nd International Conference on Ar-tificial Intelligence and Statistics (AISTATS) 2019, Naha,Okinawa, Japan. PMLR: Volume 89. Copyright 2019 bythe author(s).

1 Introduction

Independent component analysis (ICA) models a p-dimensional observation x as a linear combination ofk latent mutually independent sources:

x = D↵, (1)

where ↵ := (↵1

, . . . ,↵k

)> and D 2 Rp⇥k. The lin-ear transformation D is called the mixing matrix andis closely related to the dictionary matrix from dictio-nary learning (see, e.g., Chen and Donoho, 1994; Chenet al., 1998). Given a sample X := {x(1), . . . , x(n)} ofn observations, one is often interested in estimatingthe latent mixing matrix D and respective latent rep-resentations, ↵(1), . . . ,↵(n), also known as sources, ofevery observation.

A classical motivating example for ICA is the cocktailparty problem, where one is interested in separating in-dividual speakers’ voices from noisy recordings. Here,each record is an observation and each speaker is anindependent source. In general, ICA is a simple single-layered neural network and is widely used as an unsu-pervised learning method in machine learning and sig-nal processing communities (see, e.g., Hyvarinen et al.,2001; Comon and Jutten, 2010).

There are three conceptually di↵erent settings of theICA problem: (a) complete, or determined, where thedimension of observations coincides with the numberof sources, i.e., p = k; (b) undercomplete, or overde-termined, with fewer sources than the dimension, i.e.,k < p; and (c) overcomplete, or underdetermined, withmore sources than the dimension, i.e., k > p. Whilethe first two cases are well studied, the last one is moredi�cult and we address it here.

In the complete setting, where k = p, ICA is usu-ally solved via pre-whitening of the data so that thewhitened observations, z := Wx, are uncorrelated andall have unit variance, i.e., cov(z) = W cov(x)W> = I,


where W denotes the whitening matrix. Substitut-ing x = D↵, we get (WD)(WD)> = I which impliesthat the matrix Q := WD is orthogonal and there-fore the problem of finding the mixing matrix D boilsdown to finding the “correct” orthogonal matrix Q.Numerous “correctness” criteria, such as maximizingnon-Gaussianity of sources, were proposed and respec-tive algorithms for complete ICA are well known (see,e.g., Hyvarinen et al., 2001; Comon and Jutten, 2010).The most widely known complete ICA algorithms arepossibly the FastICA algorithm by Hyvarinen (1999)and the JADE algorithm by Cardoso and Souloumiac(1993). This naturally extends to the undercom-plete setting where one looks for an orthonormal ma-trix, where columns are orthogonal, instead. How-ever, although nothing prevents us from whiteningdata in the overcomplete setting, the orthogonaliza-tion trick cannot be extended to the overcomplete set-ting, where k > p, since the mixing matrix D has morecolumns than rows and therefore cannot have full col-umn rank.

Improvements in feature learning are among the ad-vantages of overcomplete representations: it has beenshown by Coates et al. (2011) that dense and overcom-plete features can significantly improve performance ofclassification algorithms. However, advantages of over-complete representations go far beyond this task (see,e.g., Bengio et al., 2013).

Originally, the idea of overcomplete representationswas developed in the context of dictionary learning,where an overcomplete dictionary, formed by Fourier,wavelet, Gabor or other filters, is given and one isonly interested in estimating the latent representa-tions ↵. Di↵erent approaches were proposed for thisproblem including the method of frames (Daubechies,1988) and basis pursuit (Chen and Donoho, 1994;Chen et al., 1998). Later in sparse coding, the ideaof estimating a dictionary matrix directly from datawas introduced (Olshausen and Field, 1996, 1997) andwas shortly followed by the first overcomplete ICAalgorithm (Lewicki and Sejnowski, 2000).1 Furtherovercomplete ICA research continued in several fairlydi↵erent directions based on either (a) various spar-sity assumptions (see, e.g., Teh et al., 2003) or on(b) prior assumptions about the sources as by Lewickiand Sejnowski (2000) or (c) instead in a more generaldense overcomplete setting (see, e.g., Hyvarinen, 2005;Comon and Rajih, 2006; De Lathauwer et al., 2007;Goyal et al., 2014; Bhaskara et al., 2014a,b; Anandku-mar et al., 2015; Ma et al., 2016). Since we focus on

1 Recall the close relation between ICA and sparse cod-ing: indeed, the maximum likelihood estimation of ICAwith the Laplace prior on the sources (latent representa-tions ↵) is equivalent to the standard sparse coding formu-lation with the `1-penalty.

this more general dense setting, we do not review orcompare to the literature in the other settings.

In particular, we focus on the following problem: Es-timate the mixing matrix D given an observed sam-ple X :=

�x(1), . . . , x(n)

of n observations. We aim

at constructing an algorithm that would bridge thegap between algorithms with theoretical guaranteesand ones with practical computational properties. No-tably, our algorithm does not depend on any prob-abilistic assumptions on the sources, except for thestandard independence and non-Gaussianity, and theuniqueness of the ICA representation (up to permuta-tion and scaling) is the result of the independence ofsources rather than sparsity. Here we only focus on theestimation of the latent mixing matrix and leave thelearning of the latent representation for future research(note that one can use, e.g., the mentioned earlier dic-tionary learning approaches).

Di↵erent approaches have been proposed to addressthis problem. Some attempt to relax the hard orthogo-nality constraint in the whitening procedure with moreheuristic quasi-orthogonalization approaches (see, e.g.,Le et al., 2011; Arora et al., 2012). Other approachestry to specifically address the structure of the modelin the overcomplete setting (see, e.g., Hyvarinen, 2005;Comon and Rajih, 2006; De Lathauwer et al., 2007;Goyal et al., 2014; Bhaskara et al., 2014a,b; Anand-kumar et al., 2015; Ma et al., 2016) by consideringhigher-order cumulants or derivatives of the cumulantgenerating function. The algorithm that we propose isthe closest to the latter type of approach.

We make two conceptual contributions: (a) we showhow to use second-order statistics instead of the fourthand higher-order cumulants, which improves samplecomplexity, and (b) we introduce a novel semi-definiteprogramming-based approach, with a convex relax-ation that can be solved e�ciently, for estimating the

Algorithm 1 OverICA

1: Input: Observations X := {x1

, . . . , xn

} andlatent dimension k.Parameters: The regularization parameter µ andthe number s of generalized covariances, s > k.

2: STEP I. Estimation of the subspace W :Sample vectors t

1

, . . . , ts

.Estimate matrices H

j

:= Cx

(tj

) for all j 2 [s].3: STEP II. Estimation of the atoms:

Given G(i) for every deflation step i = 1, 2, . . . , k:Solve the relaxation (12) with G(i).(OR: Solve the program (9) with G(i).)Estimate the i-th mixing component d

i

from B⇤.4: Output: Mixing matrix D = (d

1

, d2

, . . . , dk

).

Podosinnikova, Perry, Wein, Bach, d’Aspremont, Sontag

columns of D. Overall, this leads to a computationallye�cient overcomplete ICA algorithm that also has the-oretical guarantees. Conceptually, our work is similarto the fourth-order only blind identification (FOOBI)algorithm (De Lathauwer et al., 2007), which we foundto work well in practice. However, FOOBI su↵ers fromhigh computational and memory complexities, its the-oretical guarantee requires all kurtoses of the sourcesto be positive, and it makes the strong assumption thatcertain fourth-order tensors are linearly independent.Our approach resolves these drawbacks. We describeour algorithm in Section 2 and experimental results inSection 3.

2 Overcomplete ICA via SDP

2.1 Algorithm overview

We focus on estimating the latent mixing matrix D 2Rp⇥k of the ICA model (1) in the overcomplete settingwhere k > p. We first motivate our algorithm in thepopulation (infinite sample) setting and later addressthe finite sample case.

In the following, the i-th column of the mixing ma-trix D is denoted as d

i

and called the i-th mixingcomponent. The rank-1 matrices d

1

d>1

, . . . , dk

d>k

arereferred to as atoms.2

Our algorithm, referred to as OverICA, consists oftwo major steps: (a) construction of the subspace Wspanned by the atoms, i.e.,

W := Span�d1

d>1

, . . . , dk

d>k

, (2)

and (b) estimation of individual atoms di

d>i

, i 2 [k],given any basis of this subspace.3 We summarize thishigh level idea4 in Algorithm 1. Note that although thedefinition of the subspace W in (2) is based on the la-tent atoms, in practice this subspace is estimated fromthe known observations x (see Section 2.3). However,we do use this explicit representation in our theoreticalanalysis.

In general, there are di↵erent ways to implement thesetwo steps. For instance, some algorithms implementthe first step based on the fourth or higher order cu-mulants (see, e.g., De Lathauwer et al., 2007; Goyalet al., 2014). In contrast, we estimate the subspaceW from the Hessian of the cumulant generating func-tion which has better computational and sample com-plexities (see Section 2.3). Our algorithm also works(without any adjustment) with other implementations

2 We slightly abuse the standard closely related dictio-nary learning terminology where the term atom is used forthe individual columns di (see, e.g., Chen et al., 1998).

3 The mixing component is then the largest eigenvector.4 The deflation part is more involved (see Section 2.4.3).

of the first step, including the fourth-order cumulantbased one, but other algorithms cannot take advan-tage of our e�cient first step due to the di↵erences inthe second step.

In the second step, we propose a novel semi-definiteprogram (SDP) for estimation of an individual atomgiven the subspace W (Section 2.4.1). We also providea convex relaxation of this program which admits e�-cient implementation and introduces regularization tonoise which is handy in practice when the subspace Wcan only be estimated approximately (Section 2.4.2).Finally, we provide a deflation procedure that allowsus to estimate all the atoms (Section 2.4.3). Beforeproceeding, a few assumptions are in order.

2.2 Assumptions

Due to the inherent permutation and scaling unidenti-fiability of the ICA problem, it is a standard practiceto assume, without loss of generality, thatAssumption 2.1. Every mixing component has unitnorm, i.e., kd

i

k2

= 1 for all i 2 [k].

This assumption immediately implies that all atomshave unit Frobenius norm, i.e.,

��di

d>i

��F

= kdi

k22

= 1for all i 2 [k].

Since instead of recovering mixing components di

as in(under-) complete setting we recover atoms d

i

d>i

, thefollowing assumption is necessary for the identifiabilityof our algorithm:Assumption 2.2. The matrices (atoms) d

1

d>1

, d2

d>2

,. . . , d

k

d>k

are linearly independent.

This in particular implies that the number of sources kcannot exceed m := p(p+1)/2, which is the latent di-mension of the set of all symmetric matrices S

p

. Wealso assume, without loss of generality, that the obser-vations are centred, i.e., E(x) = E(↵) = 0.

2.3 Step I: Subspace Estimation

In this section, we describe a construction of an or-thonormal basis of the subspace W . For that, we firstconstruct matrices H

1

, . . . , Hs

2 Rp⇥p, for some s,which span the subspace W . These matrices are ob-tained from the Hessian of the cumulant generatingfunction as described below.

Generalized Covariance Matrices. Introducedfor complete ICA by Yeredor (2000), a generalized co-variance matrix is the Hessian of the cumulant gener-ating function evaluated at a non-zero vector.

Recall that the cumulant generating function (cfg) ofa p-valued random variable x is defined as

�x

(t) := logE(et>x), (3)


for any t 2 Rp. It is well known that the cumulantsof x can be computed as the coe�cients of the Taylorseries expansion of the cgf evaluated at zero (see, e.g.,Comon and Jutten, 2010, Chapter 5). In particular,the second order cumulant, which coincides with thecovariance matrix, is then the Hessian evaluated atzero, i.e., cov(x) = r2�

x

(0).

The generalized covariance matrix is a straight-forward extension where the Hessian of the cgf is eval-uated at a non-zero vector t:

Cx

(t) := r2�x

(t) =E(xx>et

>x)

E(et>x)� E

x

(t)Ex

(t)>, (4)

where we introduced

Ex

(t) := r�x

(t) =E(xet>x)

E(et>x). (5)

Generalized Covariance Matrices of ICA. Incase of the ICA model, substituting (1) into the ex-pressions (5) and (4), we obtain

Ex

(t) =DE(↵e↵>

y)

E(e↵>y)

= DE↵

(y),

Cx

(t) = DC↵

(y)D>,

(6)

where we introduced y := D>t and the generalizedcovariance C

↵

(y) := r2�↵

(y) of the sources:

C↵

(y) =E(↵↵>e↵

>y)

E(e↵>y)

� E↵

(y)E↵

(y)>, (7)

where E↵

(y) := r�↵

(y) = E(↵ey>↵)/E(ey>

↵).

Importantly, the generalized covariance C↵

(y) of thesources, due to the independence, is a diagonal ma-trix (see, e.g., Podosinnikova et al., 2016). Therefore,the ICA generalized covariance C

x

(t) is:

Cx

(t) =kX

i=1

!i

(t)di

d>i

, (8)

where !i

(t) := [C↵

(D>t)]ii

are the generalized varianceof the i-th source ↵

i

. This implies that ICA generalizedcovariances belong to the subspace W .

Construction of the Subspace. Since ICA gener-alized covariance matrices belong to the subspace W ,then the span of any number of such matrices wouldeither be a subset of W or equal to W . Choosing suf-ficiently large number s > k of generalized covariancematrices, we can ensure the equality. Therefore, givena su�ciently large number s of vectors t

1

, . . . , ts

, weconstruct matrices H

j

:= Cx

(tj

) for all j 2 [s]. Notethat in practice it is more convenient to work with vec-torizations of these matrices and then consequent ma-tricization of the obtained result (see Appendices A.1and B.2). Given matrices H

j

, for j 2 [s], an orthonor-mal basis can be straightforwardly extracted via thesingular value decomposition. In practice, we set s

as a multiple of k and sample the vectors tj

from theGaussian distribution.

Note that one can also construct a basis of the sub-space W from the column space of the flattening ofthe fourth-order cumulant of the ICA model (1). In

particular, this flattening is a matrix C 2 Rp

2⇥p

2

suchthat C = (D�D)Diag()(D�D), where � stands forthe Khatri-Rao product and the i-th element of thevector 2 Rk is the kurtosis of the i-th source ↵

i

. Im-portantly, matricization of the i-th column a

i

of thematrix A := D � D is exactly the i-th atom, i.e.,mat(a

i

) = di

d>i

. Therefore, one can construct thedesirable basis from the column space of the matrixA (see Appendix B.2 for more details). This also in-tuitively explains the need for Assumption 2.2, whichbasically ensures that A has full column rank (as op-posed to D). In general, this approach is common inthe overcomplete literature (see, e.g., De Lathauweret al., 2007; Bhaskara et al., 2014a; Anandkumar et al.,2015; Ma et al., 2016) and can be used as the first stepof our algorithm. However, the generalized covariance-based construction has better computational (see Sec-tion 3.3) and sample complexities.

2.4 Step II: Estimation of the Atoms

We now discuss the recovery of one atom di

d>i

, forsome i 2 [k], given a basis of the subspace W (Sec-tion 2.4.1). We then provide a deflation procedure torecover all atoms d

i

d>i

(Section 2.4.3).

2.4.1 The Semi-Definite Program

Given matrices H1

, H2

, . . . , Hs

which span the sub-space W defined in (2) we formulate the followingsemi-definite program (SDP):

B⇤sdp

:= argmaxB2Sp

hG,Bi

B 2 Span {H1

, H2

, . . . , Hs

} ,Tr(B) = 1,

B ⌫ 0.

(9)

We expect that the optimal solution (if it exists andis unique) B⇤

sdp

coincides with one of the atoms di

d>i

for some i 2 [k]. This is not always the case, but weconjecture based on the experimental evidence thatone of the atoms is recovered with high probabilitywhen k p2/4 (see Figure 1) and prove a weaker re-sult (Theorem 2.1). The matrix G 2 Rp⇥p determineswhich of the atoms d

i

d>i

is the optimizer and its choiceis discussed when we construct a deflation procedure(Section 2.4.3; see also Appendix C.1.4).

Intuition. Since generalized covariances H1

,. . . ,Hs

span the subspace W , the constraint set of (9)


is:K := {B 2 W : Tr(B) = 1, B ⌫ 0} . (10)

It is not di�cult to show (see Appendix C.2.2) that un-der Assumption 2.2 the atoms d

i

d>i

are extreme pointsof this set K:Lemma 2.4.1. Let the atoms d

1

d>1

, d2

d>2

, . . . , dk

d>k

be linearly independent. Then they are extreme pointsof the set K defined in (10).

If the program (9) has a unique solution, the optimizerB⇤

sdp

must be an extreme point due to the compact-ness of the convex set K. If the set (10) does nothave other extreme points except for the atoms d

i

d>i

,i 2 [k], then the optimizer is guaranteed to be one ofthe atoms. This might not be the case if the set K con-tains extreme points di↵erent from the atoms. Thismight explain why the phase transition (at the ratek p2/4) happens and could potentially be relatedto the phenomenon of polyhedrality of spectrahedra5

(Bhardwaj et al., 2015).

Before diving into the analysis of this SDP, let uspresent its convex relaxation which enjoys certain de-sirable properties.

2.4.2 The Convex Relaxation

Let us rewrite (9) in an equivalent form. The con-straint B 2 W := Span

�d1

d>1

, . . . , dk

d>k

is equivalent

to the fact that B is orthogonal to any matrix from theorthogonal complement (null space) ofW . Let the ma-trices {F

1

, F2

, . . . , Fm�k

}, wherem := p(p+1)/2, forma basis of the null spaceN (W ).6 Then the program (9)takes an equivalent formulation:

B⇤sdp

:= argmaxB2Sp

hG,Bi

hB,Fj

i = 0, for all j 2 [m� k],

Tr(B) = 1,

B ⌫ 0.

(11)

In the presence of (e.g., finite sample) noise, the sub-space W can only be estimated approximately (in thefirst step). Therefore, rather than keeping the hardfirst constraint, we introduce the relaxation

B⇤ := argmaxB2Sp

hG,Bi � µ

2

X

j2[m�k]

hB,Fj

i2

Tr(B) = 1, B ⌫ 0,

(12)

where µ > 0 is a regularization parameter whichhelps to adjust to an expected level of noise. Im-portantly, the relaxation (12) can be solved e�ciently,

5 The spectrahedron is a set formed by an intersection ofthe positive semi-definite cone with linear constraints, e.g.the set K. Importantly, all polyhedra are spectrahedra,but not all spectrahedra are polyhedra.

6 Note that a basis of N (W ) can be easily computedgiven matrices H1, . . . , Hs.

e.g., via the fast iterative shrinkage-thresholding al-gorithm (FISTA; Beck and Teboulle, 2009) and themajorization-maximization principle (see, e.g., Hunterand Lange, 2004). See Appendix C.1 for details.

2.4.3 Deflation

The semi-definite program (9), or its relaxation (12),is designed to estimate only some one atom d

i

d>i

. Toestimate all other atoms we need a deflation procedure.In general, there is no easy and straightforward wayto perform deflation in the overcomplete setting, butwe discuss possible approaches below.

Clustering. Since the matrix G determines whichatom is found, it is natural to repeatedly resample thismatrix a multiple of k times and then cluster the ob-tained atoms into k clusters. This approach generallyworks well except in the cases where either (a) some ofthe atoms, say d

i

d>i

and dj

d>j

, are relatively close (e.g.,in terms of angle in the space of all symmetric matri-ces) to each other, or (b) one or several atoms werenot properly estimated. In the former case, one couldincrease the number of times G is resampled, and theprogram is solved, but that might require very highnumber of repetitions. The latter issue is more di�-cult to fix since a single wrong atom could significantlyperturb the overall outcome.

Adaptive Deflation. Alternatively, one couldadapt the constraint set iteratively to exclude fromthe search all the atoms found so far. For that, onecan update the constraint set so that the subspace Wis replaced with the subspace that is spanned by all theatoms except for the ones which were already found.The most natural way to implement this is to add thefound atoms to a basis of the null space of W , which isstraightforward to implement with the relaxation (12).Similar to other deflation approaches, a poor estimateof an atom obtained in an earlier deflation step of suchadaptive deflation can propagate this error leading toan overall poor result.

Semi-Adaptive Deflation. We found that takingadvantage of both presented deflation approaches leadsto the best result in practice. In particular, we com-bine these approaches by first performing clusteringand keeping only good clusters (with low variance overthe cluster) and then continuing with the adaptive de-flation approach. We assume this semi-adaptive de-flation approach for all the experiments presented inSection 3.


100 200 300 400 500 600 700 800 90010

15

20

25

30

35

40

45

50

0

0.2

0.4

0.6

0.8

1

Figure 1: Phase transition of the program (13).

2.4.4 Identifiability

In general, there are two types of identifiability ofprobabilistic models: (a) statistical and (b) algebraic.The statistical identifiability addresses whether the pa-rameters of the model can be identified for given dis-tributions. In particular, it is well known that theICA model is not identifiable if (more than one of) thesources are Gaussian (Comon, 1994) and issues alsoarise when the sources are close to Gaussian (Sokolet al., 2014). These results also extend to the over-complete case that we consider. However, we do notaddress these questions here and assume that the mod-els we work with are statistically identifiable. Instead,we are interested whether our approach is algebraicallyidentifiable, i.e., whether our algorithm correctly re-covers the parameters of the model. In particular,we address the following question: When is the solu-tion B⇤

sdp

of the program (9) is one of the atoms di

d>i

,i 2 [k]?

We address this question in theory and in practice andfocus on the population (infinite number of samples)case, where we assume that an exact estimate of thesubspace W is given and, therefore, one can use therepresentation W := Span

�d1

d>1

, . . . , dk

d>k

without

loss of generality. Therefore, for the theoretical anal-ysis purposes we assume that atoms d

i

d>i

are known,we consider the following program instead

B⇤sdp

:= argmaxB2Sp

hG,Bi

B 2 Span�d1

d>1

, d2

d>2

, . . . , dk

d>k

,

Tr(B) = 1,

B ⌫ 0.

(13)

Phase Transition. In Figure 1, we present thephase transition plot for the program (13) obtainedby solving the program multiple times for di↵erentsettings. In particular, for every pair (p, k) we solvethe program n

rep

:= 50 times and assign to the re-spective point the value equal to the fraction of suc-cessful solutions (where the optimizer was one of theatoms).

Given a fixed pair (p, k), every instance of the pro-gram (13) is constructed as follows. We first samplea mixing matrix D 2 Rp⇥k so that every mixing com-ponent is from the standard normal distribution as

described in Appendix D.1; and we sample a matrixG 2 Rp⇥p from the standard normal distribution. Wethen construct the constraint set of the program (13)by setting every matrix H

i

= di

d>i

for all i 2 [k],where s = k. We solve every instance of this problemwith the CVX toolbox (Grant et al., 2006) using theSeDuMi solver (Sturm, 1999).

We consider the observations dimensions p from 10 to50 with the interval of 5 and we vary the number ofatoms from 10 to 1000 with the interval of 10. Theresulting phase transition plots are presented in Fig-ure 1. The blue line on this plot corresponds to thecurve k = p(p+ 1)/2, which is the largest possible la-tent dimension of all symmetric matrices S

p

. The redline on this plot corresponds to the curve k = p2/4.Since above the red line we observe 100% successful re-covery (black), we conjecture that the phase transitionhappens around k = p2/4.

Theoretical Results. Interestingly, an equivalentconjecture, k < p2/4, was made for the ellipsoid fittingproblem (Saunderson et al., 2012, 2013) and the ques-tion remains open to our best knowledge.7 In fact, weshow close relation between successful solution (recov-ery of an atom) of our program (13) and the ellipsoidfitting problem. In particular, a successful solutionof our problem implies that the feasibility of its La-grange dual program is equivalent to the ellipsoid fit-ting problem (see Appendix C.2.3). Moreover, usingthis connection, we prove the following:Theorem 2.1. Let " > 0. Consider a regime withp tending to to infinity, and with k varying accordingto the bound k < (2 � ")p log p. As above, let the d

i

be random unit vectors and let G = uu> for a ran-dom unit vector u. Then with high probability8, thematrix d

i

d>i

for which d>i

Gdi

is largest is the uniquemaximizer of the program (13).

3 Experiments

It is di�cult to objectively evaluate unsuper-vised learning algorithms on real data in theabsence of ground truth parameters. There-fore, we first perform comparison on syntheticdata. All our experiments can be reproduced withthe publicly available code: https://github.com/

anastasia-podosinnikova/oica.

7 In Appendix C.2.1, we recall the formulation of theellipsoid fitting problem and slightly improve the results ofSaunderson et al. (2012, 2013).

8Throughout, “with high probability” indicates proba-bility tending to 1 as p ! 1.

https://github.com/anastasia-podosinnikova/oica

https://github.com/anastasia-podosinnikova/oica


5 10 15 20 25 30 35 40 45 50 55 600

0.5

1

5 10 15 20 25 30 35 40 45 50 55 600

0.5

1

5 10 15 20 25

5

10

15

20

25

5 10 15 20 25

5

10

15

20

25

Figure 2: A proof of concept in the asymptotic regime. See explanation in Section 3.1.

1 3 5 7 90

0.5

1

1 3 5 7 90

0.5

1

10 30 50 70 90 110 130 150 170 1900

0.5

1

10 30 50 70 90 110 130 150 170 1900

0.5

1

10 30 50 70 90 110 130 150 170 190

5

10

15

20

25

30

10 30 50 70 90 110 130 150 170 190

5

10

15

20

25

30

10 30 50 70 90 110 130 150 170 190

5

10

15

20

25

30

10 30 50 70 90 110 130 150 170 190

5

10

15

20

25

30

Figure 3: Comparison in the finite sample regime. See explanation in Section 3.2.

3.1 Synthetic Data: Population Case

As a proof of concept, this simple experiment (Fig-ure 2) imitates the infinite sample case. Given aground truth mixing matrix D, we construct a basis ofthe subspace W directly from the matrix A := D�D(see Appendix B.2). This leads to a noiseless estimateof the subspace. We then evaluate the performance ofthe second step of our OverICA algorithm and com-pare it with the second step of FOOBI. We fix theobserved dimension p = 10 and vary the latent dimen-sion from k = 5 to k = 60 in steps of 5. For everypair (p, k), we repeat the experiment n

rep

= 10 timesand display the minimum, median, and maximum val-ues. Each time we sample the mixing matrix D withmixing components from the standard normal distri-bution (see Appendix D.1.1). Note that we tried di↵er-ent sampling methods and distributions of the mixingcomponents, but did not observe any significant dif-ference in the overall result. See Appendix D.1.2 forfurther details on this sampling procedure.

The error metrics (formally defined in Appendix D.2)are: (a) f-error is essentially the relative Frobeniusnorm of the mixing matrices with properly permutedmixing components (lower is better); (b) a-error mea-sures the angle deviations of the estimated mixingcomponents vs the ground truth (lower is better);and (c) “perfect” recovery rates, which show for ev-ery i 2 [k] the fraction of perfectly estimated i com-ponents. We say that a mixing component is “per-fectly” recovered if the cosine of the angle betweenthis component d

i

and its ground truth equivalent

d⇡(i)

is at least 0.99, i.e., cos(di

, bd⇡(i)

) � 0.99. Notethat the respective angle is approximately equal to 8.Then the black-and-white perfect recovery plots (inFigure 2) show if i k (on the y-axis) componentswere perfectly recovered (black) for the given latentdimension k (x-axis). These black vertical bars can-not exceed the red line i = k, but the closer theyapproach this line, the better. The vertical greenlines correspond to k = p = 10, k = p2/4 = 25,k = p(p � 1)/2, and k = p(p + 1)/2. Importantly,we see that OverICA works better or comparably toFOOBI in the regime k < p2/4. Performance of Over-ICA starts to deteriorate near the regime k ⇡ p2/4and beyond, which is in accord with our theoreticalresults in Section 2.4.4. Note that to see whether thealgorithms work better than random, we display theerrors of a randomly sampled mixing matrix (RAND;see Appendix D.1.1).

3.2 Synthetic Data: Finite SampleCase

With these synthetic data we evaluate performance ofovercomplete ICA algorithms in the presence of finitesample noise but absence of model misspecification. Inparticular, we sample synthetic data in the observeddimension p = 15 from the ICA model with uniformlydistributed (on [�0.5, 0.5]) k = 30 sources for di↵er-ent sample sizes n taking values from n = 1, 000 ton = 10, 000 in steps of 1, 000 (two left most plots inthe top line of Figure 3) and values from n = 10, 000 ton = 210, 000 in steps of 10, 000 (two right most plots in


Table 1: Computational complexities (n is the sample size,p is the observed dimension, k is the latent dimension, s isthe number of generalized covariances, usually s = O(k)).

Procedure Memory TimeGenCov O(p2s) O(snp2)CUM O(p4) O(np4 + k2p2)FOOBI O(p4k2 + k4) O(np4 + k2p4 + k6)OverICA O(sp2) O(nsp2)OverICA(Q) O(p4) O(np4 + k2p2)Fourier PCA O(p4) O(np4)

the top line of Figure 3; see also Figure 6 in Appendixfor log-linear scale). Note that the choice of dimen-sions p = 15 and k = 30 corresponds to the regimek < p2/4 ⇡ 56 of our guarantees. We repeat the ex-periment n

rep

:= 10 times for every n where we everytime resample the (ground truth) mixing matrix (withthe sampling procedure described in Appendix D.1.1).See further explanation in Appendix D.1.3.

We compare the Fourier PCA algorithm (Goyal et al.,2014), the FOOBI algorithm (De Lathauwer et al.,2007), OverICA from Algorithm 1, and a version ofthe OverICA algorithm where the first step is replacedwith the construction based on the fourth-order cumu-lant, a.k.a. quadricovariance (OverICA(Q); see Ap-pendix B.2). Note that we can not compare with thereconstruction ICA algorithm by Le et al. (2011) be-cause it estimates the de-mixing (instead of mixing)matrix.9 Similarly to Section 3.1, we measure theFrobenius error (f-error), the angle error (a-error), andthe perfect recovery for the angle of 8. We observethat the generalized covariance-based OverICA algo-rithm performs slightly better which we believe is dueto the lower sample complexity. Fourier PCA on thecontrary performs with larger error, which is probablydue to the higher sample complexity and larger noiseresulting from estimation using fourth-order general-ized cumulants.

3.3 Computational Complexities

In Table 1, we summarize the timespace complexi-ties of the considered overcomplete ICA algorithmsand two sub-procedures they use: generalized covari-ances (GenCov; used by OverICA) from Section 2.3and the forth-order cumulant (CUM; used by Over-ICA(Q) and FOOBI; see Appendix B.2) (see Ap-pendix D.3). Importantly, we can see that our Over-ICA algorithm has a significantly lower complexity.In Appendix D.3, we present runtime comparisons ofthese algorithms.

9 In the complete invertible case, the de-mixing matrix

would be the inverse of the mixing matrix. In the overcom-plete regime, one cannot simply obtain the mixing matrixfrom the de-mixing matrix.

Figure 4: Mixing components obtained from 7-by-7patches, i.e., p = 49, of the CIFAR-10 dataset (k = 150,i.e., overcomplete). ICA does not preserve non-negativityand the signs of ICA mixing components can be arbitrar-ily flipped due to the scaling unidentifiability; here blackand white correspond to the extreme positive and extremenegative values. The colorbar limits of every image arethe same and the signs are alligned to have positive scalarproduct with the first component.

3.4 Real Data: CIFAR-10 Patches

Finally, we estimate the overcomplete mixing matrixof data formed of patches of the CIFAR-10 dataset(see, e.g., Krizhevsky et al., 2014). In particular, wetransform the images into greyscale and then form 7-by-7 patches for every interior point (at least 3 pixelsfrom the boundary) of every image from the train-ing batch 1 of the CIFAR-10 dataset. This results in6, 760, 000 patches each of dimension p = 49. We per-form the estimation of the mixing matrix for k = 150latent mixing components. The resulting atoms arepresented in Figure 4. Note that since ICA is scale(and therefore sign) invariant, the sign of every compo-nent can be arbitrary flipped. We present the obtainedcomponents in the scale where black and white corre-sponds to the extreme positive or negative values andwe observe that these peaks are concentrated in ratherpointed areas (which is a desirable property of latentcomponents). Note that the runtime of this whole pro-cedure was around 2 hours on a laptop. Due to hightimespace complexities (see Section 3.3), we cannotperform similar estimation neither with FOOBI norwith Fourier PCA algorithms.

4 Conclusion

We presented a novel ICA algorithm for estimationof the latent overcomplete mixing matrix. Our algo-rithm also works in the (under-)complete setting, en-joys lower computational complexity, and comes withtheoretical guarantees, which is also confirmed by ex-periments.


Acknowledgements

A. Podosinnikova was partially supported by DARPAgrant #W911NF-16-1-0551. A. Podosinnikova and D.Sontag were partially supported by NSF CAREERaward #1350965. This work was supported in partby NSF CAREER Award CCF-1453261 and a grantfrom the MIT NEC Corporation. Part of this workwas done while A. S. Wein was at the MassachusettsInstitute of Technology. A. S. Wein received Govern-ment support under and awarded by DoD, Air ForceO�ce of Scientific Research, National Defense Scienceand Engineering Graduate (NDSEG) Fellowship, 32CFR 168a. A. S. Wein is also supported by NSF grantDMS-1712730 and by the Simons Collaboration on Al-gorithms and Geometry.

References

A. Anandkumar, R. Ge, and M. Janzamin. Learningovercomplete latent variable models through ten-sor methods. In Proceedings of the Conference onLearning Theory (COLT), 2015.

S. Arora, R. Ge, A. Moitra, and S. Sachdeva. Prov-able ICA with unknown Gaussian noise, with impli-cations for Gaussian mixtures and autoencoders. InAdvances in Neural Information Processing Systems(NIPS), 2012.

A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.SIAM Journal on Imaging Sciences, 2(1):183–202,2009.

Y. Bengio, A. Courville, and P. Vincent. Representa-tion learning: A review and new perspectives. IEEETransactions on Pattern Analysis and Machine In-telligence, 35(8):1798–1828, 2013.

A. Bhardwaj, P. Rostalski, and R. Sanyal. Decid-ing polyhedrality of spectrahedra. SIAM Journalon Optimization, 25(3):1873–1884, 2015.

A. Bhaskara, M. Charikar, A. Moitra, and A. Vija-yaraghavan. Smoothed analysis of tensor decompo-sitions. In Proceedings of the Annual ACM Sympo-sium on Theory of Computing (STOC), 2014a.

A. Bhaskara, M. Charikar, and A. Vijayaraghavan.Uniqueness of tensor decompositions with applica-tions to polynomial idenfifiability. In Proceedingsof the Conference on Learning Theorey (COLT),2014b.

A. Bovier. Extreme Values of Random Processes. Lec-ture Notes Technische Universitat Berlin, 2005.

A. Bunse-Gerstner, R. Byers, and V. Mehrmann. Nu-merical methods for simultaneous diagonalization.SIAM Journal on Matrix Analysis and Applications,14(4):927–949, 1993.

J.-F. Cardoso and A. Souloumiac. Blind beamformingfor non-Gaussian signals. In IEE Proceedings F -Radar and Signal Processing. IEEE, 1993.

J.-F. Cardoso and A. Souloumiac. Jacobi angles for si-multaneous diagonalization. SIAM Journal on Ma-trix Analysis and Applications, 17(1):161–164, 1996.

S.S. Chen and D.L. Donoho. Basis Pursuit. Technicalreport, Stanford University, 1994.

S.S. Chen, D.L. Donoho, and M.A. Saunders. Atomicdecomposition by basis pursuit. SIAM Journal onScientific Computing, 20(1):33–61, 1998.

A. Coates, H. Lee, and A.Y. Ng. An analysis of single-layer networks in unsupervised feature learning. InProceedings of the International Conference on Ar-tificial Intelligence and Statistics (AISTATS), 2011.

P. Comon. Independent component analysis, A newconcept? Signal Processing, 36(3):287–314, 1994.

P. Comon and C. Jutten. Handbook of Blind SourseSeparation: Independent Component Analysis andApplications. Academic Press, 2010.

P. Comon and M. Rajih. Blind identification of under-determined mixtures based on the characteristicfunction. Signal Processing, 86(9):2271–2281, 2006.

I. Daubechies. Time-frequency localization operators:A geometric phase space approach. IEEE Transac-tions on Information Theory, 34(4):604–612, 1988.

L. De Lathauwer, J. Castaing, and J.-F. Cardoso.Fourth-order cumulant-based blind identification ofunderdetermined mixtures. IEEE Transactions onSignal Processing, 55(6):2965–2973, 2007.

J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chan-dra. E�cient projections onto the `

1

-ball for learn-ing in high dimensions. In Proceedings of the Inter-national Conference on Machine Learning (ICML),2008.

N. Goyal, S. Vempala, and Y. Xiao. Fourier PCA androbust tensor decomposition. In Proceedings of theAnnual ACM Symposium on Theory of Computing(STOC), 2014.

M. Grant, S. Boyd, and Y. Ye. Disciplined convexprogramming. In Global Optimization: from Theoryto Implementation, Nonconvex Optimization and ItsApplications. Springer, 2006.

D.R. Hunter and K. Lange. A tutorial on MM al-gorithms. The American Statistician, 58(1):30–37,2004.

A. Hyvarinen. Fast and robust fixed-point algorithmsfor independent component analysis. IEEE Trans-actions on Neural Networks, 10(3):626–634, 1999.

A. Hyvarinen. Estimation of non-normalized statisti-cal models by score matching. Journal of MachineLearning Research (JMLR), 6:695–708, 2005.


A. Hyvarinen, J. Karhunen, and E. Oja. IndependentComponent Analysis. Wiley, 2001.

A. Krizhevsky, V. Nair, and G. Hinton. The CIFAR-10dataset. University of Toronto, 2014. URL http:

//www.cs.toronto.edu/kriz/cifar.html.

H.W. Kuhn. The Hungarian method for the assign-ment problem. Naval Research Logistics Quarterly,2(1-2):83–97, 1955.

Q.L. Le, A. Karpenko, J. Ngiam, and A.Y. Ng. ICAwith reconstruction cost for e�cient overcompletefeature learning. In Advances in Neural InformationProcessing Systems (NIPS), 2011.

M.S. Lewicki and T.J Sejnowski. Learning overcom-plete representations. Neural Computation, 12(2):337–365, 2000.

T. Ma, J. Shi, and D. Steurer. Polynomial-time ten-sor decompositions with sum-of-squares. In AnnualSymposium on Foundations of Computer Science(FOCS), 2016.

B.A. Olshausen and D.J. Field. Emergence of simple-cell receptive field properties by learning a sparsecode for natural images. Nature, 381:607–609, 1996.

B.A. Olshausen and D.J. Field. Sparse coding with anovercomplete basis set: A strategy employed by V1?Vision Research, 37(23):3311–3325, 1997.

A. Podosinnikova, F. Bach, and S. Lacoste-Julien. Be-yond CCA: Moment matching for multi-view mod-els. In Proceedings of the International Conferenceon Machine Learning (ICML), 2016.

J. Saunderson. Subspace identification via convex op-timization. PhD thesis, Massachusetts Institute ofTechnology, 2011.

J. Saunderson, V. Chandrasekaran, P.A. Parrilo, andA.S. Willsky. Diagonal and low-rank matrix decom-positions, correlation matrices, and ellipsoid fitting.Technical report, arXiv:1204.1220v1, 2012.

J. Saunderson, P.A. Parrilo, and A.S. Willsky. Diag-onal and low-rank decompositions and fitting ellip-soids to random points. In Proceedings of the IEEEConference on Decision and Control (CDC), 2013.

A. Sokol, M.H. Maathuis, and B. Falkeborg. Quanti-fying identifiability in independent component anal-ysis. Electronic Journal of Statistics, 8:1438–1459,2014.

J.F. Sturm. Using SeDuMi 1.02, A MATLAB toolboxfor optimization over symmetric cones. OptimizationMethods and Software, 11(12):625–633, 1999.

Y.W. Teh, M. Welling, S. Osindero, and G.E. Hinton.Energy-based models for sparse overcomplete repre-sentations. Journal of Machine Learning Research(JMLR), 4:1235–1260, 2003.

A. Yeredor. Blind source separation via the secondcharacteristic function. Signal Processing, 80(5):897–902, 2000.

http://www.cs.toronto.edu/kriz/cifar.html

http://www.cs.toronto.edu/kriz/cifar.html

Documents

Overcomplete Independent Component Analysis via SDPproceedings.mlr.press/v89/podosinnikova19a/podosinnikova19a.pdfOvercomplete Independent Component Analysis via SDP where W denotes