6
COMPLETE DICTIONARY RECOVERY OVER THE SPHERE JU SUN * , QING QU * , AND JOHN WRIGHT * Abstract. In this work, we consider the problem of recovering a complete dictionary A 0 , from Y = A 0 X 0 with Y R n×p . We give the first efficient algorithm that provably recovers A 0 when X 0 has O (n) nonzeros per column. This algorithm is based on nonconvex optimization. Our proofs give a geometric characterization of the high-dimensional objective landscape, which shows that with high probability there are no spurious local minima, when the coefficients X 0 assume the i.i.d. Bernoulli-Gaussian model. Key words. Dictionary learning, Recovery guarantee, Geometric analysis, Nonconvex optimiza- tion, Riemannian trust-region method 1. Introduction. Dictionary learning (DL) is the problem of finding a sparse representation for a given family of signals. Its applications span classical image processing, visual recognition, compressive signal acquisition, and recently into deep architectures for signal classification[12, 23]. Despite many empirical successes, relatively little is known about the theoretical properties of DL algorithms. Typical formulations for DL are badly nonconvex, almost inevitably involving bilinear terms in the objective or constraint and restricting the target dictionary to some nonconvex admissible set. Moreover, the intrinsic symmetry (see, e.g., [13, 14]) associated with DL appears to resist any convexification trick. In fact, even proving that the target solution is a local minimum requires nontrivial analysis [13, 14, 29, 30]. Obtaining global solutions with efficient algorithms is a standing challenge. Suppose that the data matrix Y is generated as Y = A 0 X 0 , where A R n×m and X 0 R m×p . Existing results pertain only to highly sparse X 0 . For example, [31] showed that a sequence of certain linear programming relaxation can recover a complete (m = n) dictionary A 0 from Y , when X 0 is a sparse random matrix with O( n) nonzeros per column. [3, 4] and [7] have subsequently given efficient algorithms for the overcomplete setting (m n), based on a combination of clustering and local refinement. These algorithms again succeed when X 0 has e O( n) 1 nonzeros per column. Recent works, including [6, 8], guarantee recovery with (almost) linear sparsity, but run in super-polynomial (quasipolynomial) time. Giving efficient algorithms which provably succeeds beyond this “ n barrier” is an open problem. In this work, we consider the problem of recovering a complete dictionary A 0 from Y = A 0 X 0 . We give the first polynomial-time algorithm that provably recovers A 0 when X 0 has O (n) nonzeros per column. This algorithm is based on nonconvex optimization. Our proofs give a geometric characterization of the high-dimensional objective landscape, which shows that with high probability (w.h.p.) there are no local minima, under our probability model for X 0 . In particular, the geometric structure allows us to design a trust region algorithm over the sphere that provably converges to one global minimum with an arbitrary initialization. 2. A Peek into High-dimensional Geometry. Since Y = A 0 X 0 and A 0 is complete (square and nonsingular), row (Y ) = row (X 0 ). 2 So rows of X 0 are sparse * Department of Electrical Engineering, Columbia University, New York, NY 10027. Email: {js4038, qq2105, jw2966}@columbia.edu 1 The e O suppresses some logarithm factors. 2 row (·) denotes the row space. 1

COMPLETE DICTIONARY RECOVERY OVER THE …COMPLETE DICTIONARY RECOVERY OVER THE SPHERE JU SUN , QING QU , AND JOHN WRIGHT Abstract. In this work, we consider the problem of recovering

  • Upload
    others

  • View
    35

  • Download
    1

Embed Size (px)

Citation preview

Page 1: COMPLETE DICTIONARY RECOVERY OVER THE …COMPLETE DICTIONARY RECOVERY OVER THE SPHERE JU SUN , QING QU , AND JOHN WRIGHT Abstract. In this work, we consider the problem of recovering

COMPLETE DICTIONARY RECOVERY OVER THE SPHERE

JU SUN∗, QING QU∗, AND JOHN WRIGHT∗

Abstract. In this work, we consider the problem of recovering a complete dictionary A0, fromY = A0X0 with Y ∈ Rn×p. We give the first efficient algorithm that provably recovers A0 whenX0 has O (n) nonzeros per column. This algorithm is based on nonconvex optimization. Our proofsgive a geometric characterization of the high-dimensional objective landscape, which shows thatwith high probability there are no spurious local minima, when the coefficients X0 assume the i.i.d.Bernoulli-Gaussian model.

Key words. Dictionary learning, Recovery guarantee, Geometric analysis, Nonconvex optimiza-tion, Riemannian trust-region method

1. Introduction. Dictionary learning (DL) is the problem of finding a sparserepresentation for a given family of signals. Its applications span classical imageprocessing, visual recognition, compressive signal acquisition, and recently into deeparchitectures for signal classification[12, 23].

Despite many empirical successes, relatively little is known about the theoreticalproperties of DL algorithms. Typical formulations for DL are badly nonconvex, almostinevitably involving bilinear terms in the objective or constraint and restricting thetarget dictionary to some nonconvex admissible set. Moreover, the intrinsic symmetry(see, e.g., [13, 14]) associated with DL appears to resist any convexification trick. Infact, even proving that the target solution is a local minimum requires nontrivialanalysis [13, 14, 29, 30]. Obtaining global solutions with efficient algorithms is astanding challenge.

Suppose that the data matrix Y is generated as Y = A0X0, where A ∈ Rn×mand X0 ∈ Rm×p. Existing results pertain only to highly sparse X0. For example,[31] showed that a sequence of certain linear programming relaxation can recover acomplete (m = n) dictionary A0 from Y , when X0 is a sparse random matrix withO(√n) nonzeros per column. [3, 4] and [7] have subsequently given efficient algorithms

for the overcomplete setting (m ≥ n), based on a combination of clustering and

local refinement. These algorithms again succeed when X0 has O(√n) 1 nonzeros per

column. Recent works, including [6, 8], guarantee recovery with (almost) linear sparsity,but run in super-polynomial (quasipolynomial) time. Giving efficient algorithms whichprovably succeeds beyond this “

√n barrier” is an open problem.

In this work, we consider the problem of recovering a complete dictionary A0

from Y = A0X0. We give the first polynomial-time algorithm that provably recoversA0 when X0 has O (n) nonzeros per column. This algorithm is based on nonconvexoptimization. Our proofs give a geometric characterization of the high-dimensionalobjective landscape, which shows that with high probability (w.h.p.) there are no localminima, under our probability model for X0. In particular, the geometric structureallows us to design a trust region algorithm over the sphere that provably convergesto one global minimum with an arbitrary initialization.

2. A Peek into High-dimensional Geometry. Since Y = A0X0 and A0 iscomplete (square and nonsingular), row (Y ) = row (X0). 2 So rows of X0 are sparse

∗Department of Electrical Engineering, Columbia University, New York, NY 10027. Email:js4038, qq2105, [email protected]

1The O suppresses some logarithm factors.2row (·) denotes the row space.

1

Page 2: COMPLETE DICTIONARY RECOVERY OVER THE …COMPLETE DICTIONARY RECOVERY OVER THE SPHERE JU SUN , QING QU , AND JOHN WRIGHT Abstract. In this work, we consider the problem of recovering

2 SUN,QU, and WRIGHT

−1

−0.5

0

0.5

1

−1

−0.5

0

0.5

1

0.18

0.2

0.22

0.24

0.26

0.28

0.3

Slightly%more%formally%…!Lemma:%Suppose))))))))))))))))))),)and)))))))))))))))))).)Then…)2

…)and)so)…)every%local%optimizer%of%%%%%%%%%%%%%%%%%is%a%target%point.%+

Strongly%convex

Nonzero%gradient

Negative%curvature

Fig. 2.1. Why is dictionary learning over Sn−1 tractable? Assume the target dictionaryA0 is orthogonal. Left: Large sample objective function EX0

[f (q)]. The only local minima are thecolumns of A0 and their negatives. Center: the same function, visualized as a height above the planea⊥1 (a1 is the first column of A0). Around the optimum, the function exhibits a small region ofpositive curvature, a region of large gradient, and finally a region in which the direction away froma1 is a direction of negative curvature (right).

vectors in the known subspace row (Y ). Following [31], we use this fact to first recoverthe rows of X0, and subsequently recover A0 by solving a system of linear equations.Under suitable probability models on X0, rows of X0 are the n sparsest vectors(directions) in row (Y ) [31]. Thus one might try to recover them by solving3

min ‖q∗Y ‖0 subject to q 6= 0. (2.1)

The objective is discontinuous, and the domain is an open set. Known convexrelaxations [11, 31] provably break down beyond the aforementioned

√n barrier.

Instead, we work with a nonconvex alternative: 4

min f (q).=

1

p

p∑i=1

hµ (q∗yi) , subject to ‖q‖2 = 1, (2.2)

where yi is the i-th column of Y . Here hµ (·) is chosen to be a convex smoothapproximation to ‖·‖1,

hµ (z) = 1µ log

(ez/µ+e−z/µ

2

), (2.3)

which is infinite-time differentiable and µ controls the smoothing level. The sphericalconstraint is nonconvex.

Despite this nonconvexity, simple descent algorithms for (2.2) exhibit very strikingbehavior: on many practical numerical examples, they appear to produce globalsolutions. To attempt to shed some light on this phenomenon, we analyze theirbehavior when X0 follows the Bernoulli-Gaussian (BG) model: [X0]ij = ΩijVij , withΩij ∼ Ber(θ) and Vij ∼ N (0, 1). For the moment, suppose that A0 is orthogonal.Fig. 2.1 plots the landscape of EX0

[f (q)] over S2. Remarkably, EX0[f (q)] has no

spurious local minima. Every local minimum q produces a row of X0: q∗Y = αe∗iX0.Moreover, the geometry implies that at any nonoptimal point, there is always at leastone direction of descent.

Probabilistic arguments show that this structure persists in high dimensions(n ≥ 3), with high probability,5 even when the number of observations p is large yet

3The notation ∗ denotes matrix transposition.4Similar formulation has been proposed in [33] in the context of blind source separation, see

also [28].5Here the probability is with respect to the randomness of X0.

Page 3: COMPLETE DICTIONARY RECOVERY OVER THE …COMPLETE DICTIONARY RECOVERY OVER THE SPHERE JU SUN , QING QU , AND JOHN WRIGHT Abstract. In this work, we consider the problem of recovering

COMPLETE DICTIONARY RECOVERY OVER THE SPHERE 3

finite. Theorem 2.1 makes this precise, in the special case of A0 = I. The resultcharacterizes the properties of the reparameterization g(w) = f(q(w)) obtained byprojecting Sn−1 onto the equatorial plane e⊥n – see Fig. 2.1 (center).

Theorem 2.1. For any θ ∈ (1/n, 1/2) and µ < O(θn−1, n−5/4), when p ≥Cn3 log n/(µ2θ2) the following hold simultaneously w.h.p.:

∇2g(w) 1µc?θI ∀w s.t. ‖w‖ ≤ µ

4√

2, (2.4)

w∗∇g(w)

‖w‖≥ c?θ ∀w s.t. µ

4√

2< ‖w‖ ≤ 1

20√

5, (2.5)

w∗∇2g(w)w

‖w‖2≤ −c?θ ∀w s.t. 1

20√

5< ‖w‖ ≤

√2n−1

2n , (2.6)

for some constant c? > 0, and g (w) has a unique minimizer w? over w : ‖w‖ ≤√2n−1

2n and w? satisfies

‖w? − 0‖ ≤ O(µθ

√n log pp

). (2.7)

In words, one sees the strongly convex, nonzero gradient, and negative curvatureregions successively when moving away from each target solution, and (in the finitep case) the local (also global) minimizers of f (q) are next to the target solutions intheir respective symmetric sections. Here θ controls the fraction of nonzeros in X0 inour probability model. Where previous results required θ = O(1/

√n), our algorithm

(described below) succeeds even when θ = 1/2− ε. The geometric characterization inTheorem 2.1 can be extended to general orthogonal A0 by a simple rotation, and togeneral complete A0 ∈ Rn×n by preconditioning, in conjunction with a perturbationargument.

3. Riemannian Trust Region Algorithm. Although the problem has nospurious local minima, it does have many saddle points (Fig. 2.1). Intuitively, second-order information is needed to help escape the saddle points. We describe a Riemanniantrust region method (TRM) [1, 2] over the sphere for this purpose.

Typical Euclidean TRM proceeds by successively forming second-order approxi-mation to the objective function at the current iterate, and seeking the next iterate byminimizing the approximation within a small radius (the “trust region”) [10, 27]. Togeneralize the idea to smooth manifolds, one natural choice is to form the approxima-tion over the tangent spaces [1, 2]. Specific to our spherical manifold, for which thetangent space at an iterate qk ∈ Sn−1 is TqkSn−1 .

= v : v∗qk = 0, we work with the

quadratic approximation f : TqkSn−1 7→ R defined as

f(qk, δ).= f(qk) + 〈∇f(qk), δ〉+

1

2δ∗(∇2f(qk)− 〈∇f(qk), qk〉 I

)δ. (3.1)

To interpret this approximation, let PTqkSn−1

.= (I − qkq∗k) be the orthoprojector onto

TqkSn−1 and write (3.1) into an equivalent form:

f(qk, δ).= f(qk) +

⟨PTqk

Sn−1∇f(qk), δ⟩

+1

2δ∗PTqk

Sn−1

(∇2f(qk)− 〈∇f(qk), qk〉 I

)PTqk

Sn−1δ. (3.2)

Page 4: COMPLETE DICTIONARY RECOVERY OVER THE …COMPLETE DICTIONARY RECOVERY OVER THE SPHERE JU SUN , QING QU , AND JOHN WRIGHT Abstract. In this work, we consider the problem of recovering

4 SUN,QU, and WRIGHT

The two terms

gradf (qk).= PTqk

Sn−1∇f(qk), (3.3)

Hessf (qk).= PTqk

Sn−1

(∇2f(qk)− 〈∇f(qk), qk〉 I

)PTqk

Sn−1 (3.4)

are the Riemannian gradient and Riemannian Hessian of f w.r.t. Sn−1, respectively [1,2], turning (3.1) into the form of familiar quadratic approximation.

The Riemannian trust-region subproblem is

minδ∈Tqk

Sn−1, ‖δ‖≤∆f (qk, δ) , (3.5)

where ∆ > 0 is the familiar trust-region parameter. Taking any orthonormal basis Uqkfor TqkSn−1, we can transform (3.5) into a classical Euclidean trust-region subproblem:

min‖ξ‖≤∆

f (qk,Uqkξ) , (3.6)

for which very efficient numerical algorithms exist [17, 24]. Once we obtain theminimizer ξ?, we set δ? = Uξ?, which solves (3.5). To find the next iterate qk+1 thatlives on Sn−1, we use the natural exponential map:

qk+1.= expqk (δ?) = qk cos ‖δ?‖ + δ?

‖δ?‖ sin ‖δ?‖ . (3.7)

Using the geometric characterization in Theorem 2.1, we prove that when theparameter ∆ is sufficiently small, (1) the trust region step induces at least a fixedamount of decrease to the objective value in the negative curvature and nonzerogradient region; (2) the trust region iterate sequence will eventually move to and stayin the strongly convex region, and converge to the global minima with an asymptoticquadratic rate. In particular, the geometry implies that from any initialization,the iterate sequence converges to a close approximation to the target solution in apolynomial number of steps.

Theorem 3.1. Suppose that θ ∈(

1n ,

12

)and µ < O

(c?θ

n log(np) ,θn ,

1n5/4

), where c?

is the same as in Theorem 2.1, and that p ≥ Cn3 log n/(µ2θ2

)for some large positive

constant C. Then w.h.p. the Riemannian RTM as described above with step size

∆ ≤ O

(c?θµ

2

n5/2 log5/2 (np),

c3?θ3µ

n7/2 log7/2 (np)

)(3.8)

returns a q such that ‖q − q?‖ ≤ ε in at most

O

(max

n6 log3 (np)

c3?θ3µ4

,n log (np)

c2?θ2∆2

(f (q0)− f (q?)) + log log

c?θµ

εn3/2 log3/2 (np)

)

steps.

Using this algorithm, together with rounding and deflation techniques to obtainall n rows of X0, we obtain a polynomial-time algorithm for complete DL, working inlinear sparsity regime.

Page 5: COMPLETE DICTIONARY RECOVERY OVER THE …COMPLETE DICTIONARY RECOVERY OVER THE SPHERE JU SUN , QING QU , AND JOHN WRIGHT Abstract. In this work, we consider the problem of recovering

COMPLETE DICTIONARY RECOVERY OVER THE SPHERE 5

4. Discussion. Our result be compared to previous analyses, which either de-manded much more stringent (sublinear) sparsity assumptions [3, 4, 7, 31], or did notprovide efficient algorithms [6, 8] (super-polynomial running time).

The particular geometry of this problem does not demand any clever initialization,in contrast with most recent approaches to analyzing nonconvex recovery of structuredsignals [3–7, 9, 15, 16, 18–22, 25, 26, 28, 32]. The geometric approach taken here mayapply to these problems as well. Finally, for dictionary learning, the geometry appearsto be stable to small noise, allowing almost plug-and-play stability analysis.

Acknowledgement. JS thanks the Wei Family Private Foundation for theirgenerous support. This work was partially supported by grants ONR N00014-13-1-0492, NSF 1343282, and funding from the Moore and Sloan Foundations.

References.

[1] P.-A. Absil, C. G. Baker, and K. A. Gallivan, Trust-region methods on riemannianmanifolds, Foundations of Computational Mathematics, 7 (2007), pp. 303–330.

[2] Pierre-Antoine Absil, Robert Mahoney, and Rodolphe Sepulchre, OptimizationAlgorithms on Matrix Manifolds, Princeton University Press, 2009.

[3] Alekh Agarwal, Animashree Anandkumar, Prateek Jain, Praneeth Netra-palli, and Rashish Tandon, Learning sparsely used overcomplete dictionaries viaalternating minimization, arXiv preprint arXiv:1310.7991, (2013).

[4] Alekh Agarwal, Animashree Anandkumar, and Praneeth Netrapalli, Exactrecovery of sparsely used overcomplete dictionaries, arXiv preprint arXiv:1309.1952,(2013).

[5] Animashree Anandkumar, Rong Ge, and Majid Janzamin, Guaranteed non-orthogonal tensor decomposition via alternating rank-1 updates, arXiv preprintarXiv:1402.5180, (2014).

[6] Sanjeev Arora, Aditya Bhaskara, Rong Ge, and Tengyu Ma, More algorithmsfor provable dictionary learning, arXiv preprint arXiv:1401.0579, (2014).

[7] Sanjeev Arora, Rong Ge, and Ankur Moitra, New algorithms for learning inco-herent and overcomplete dictionaries, arXiv preprint arXiv:1308.6273, (2013).

[8] Boaz Barak, Jonathan A Kelner, and David Steurer, Dictionary learning andtensor decomposition via the sum-of-squares method, arXiv preprint arXiv:1407.1543,(2014).

[9] Emmanuel Candes, Xiaodong Li, and Mahdi Soltanolkotabi, Phase retrieval viawirtinger flow: Theory and algorithms, arXiv preprint arXiv:1407.1065, (2014).

[10] Andrew R. Conn, Nicholas I. M. Gould, and Philippe L. Toint, Trust-regionMethods, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2000.

[11] Laurent Demanet and Paul Hand, Scaling law for recovering the sparsest element ina subspace, Information and Inference, 3 (2014), pp. 295–309.

[12] Michael Elad, Sparse and redundant representations: from theory to applications insignal and image processing, Springer, 2010.

[13] Quan Geng and John Wright, On the local correctness of `1-minimization fordictionary learning, Submitted to IEEE Transactions on Information Theory, (2011).Preprint: http://www.columbia.edu/~jw2966.

[14] Remi Gribonval and Karin Schnass, Dictionary identification - sparse matrix-factorization via `1-minimization, IEEE Transactions on Information Theory, 56 (2010),pp. 3523–3539.

[15] Moritz Hardt, Understanding alternating minimization for matrix completion, inFoundations of Computer Science (FOCS), 2014 IEEE 55th Annual Symposium on,IEEE, 2014, pp. 651–660.

[16] Moritz Hardt and Mary Wootters, Fast matrix completion without the conditionnumber, in Proceedings of The 27th Conference on Learning Theory, 2014, pp. 638–678.

Page 6: COMPLETE DICTIONARY RECOVERY OVER THE …COMPLETE DICTIONARY RECOVERY OVER THE SPHERE JU SUN , QING QU , AND JOHN WRIGHT Abstract. In this work, we consider the problem of recovering

6 SUN,QU, and WRIGHT

[17] Elad Hazan and Tomer Koren, A linear-time algorithm for trust region problems,arXiv preprint arXiv:1401.6757, (2014).

[18] Prateek Jain and Praneeth Netrapalli, Fast exact matrix completion with finitesamples, arXiv preprint arXiv:1411.1087, (2014).

[19] Prateek Jain, Praneeth Netrapalli, and Sujay Sanghavi, Low-rank matrixcompletion using alternating minimization, in Proceedings of the forty-fifth annual ACMsymposium on Theory of Computing, ACM, 2013, pp. 665–674.

[20] Prateek Jain and Sewoong Oh, Provable tensor factorization with missing data, inAdvances in Neural Information Processing Systems, 2014, pp. 1431–1439.

[21] Raghunandan H Keshavan, Andrea Montanari, and Sewoong Oh, Matrix com-pletion from a few entries, Information Theory, IEEE Transactions on, 56 (2010),pp. 2980–2998.

[22] Kiryung Lee, Yihong Wu, and Yoram Bresler, Near optimal compressed sensing ofsparse rank-one matrices via sparse power factorization, arXiv preprint arXiv:1312.0525,(2013).

[23] Julien Mairal, Francis Bach, and Jean Ponce, Sparse modeling for image andvision processing, Foundations and Trends in Computer Graphics and Vision, 8 (2014).

[24] J. J. More and D. C. Sorensen, Computing a trust region step, SIAM J. Scientificand Statistical Computing, 4 (1983), pp. 553–572.

[25] Praneeth Netrapalli, Prateek Jain, and Sujay Sanghavi, Phase retrieval usingalternating minimization, in Advances in Neural Information Processing Systems, 2013,pp. 2796–2804.

[26] Praneeth Netrapalli, UN Niranjan, Sujay Sanghavi, Animashree Anandkumar,and Prateek Jain, Non-convex robust pca, in Advances in Neural Information ProcessingSystems, 2014, pp. 1107–1115.

[27] Jorge Nocedal and Stephen Wright, Numerical Optimization, Springer, 2006.[28] Qing Qu, Ju Sun, and John Wright, Finding a sparse vector in a subspace: Linear

sparsity using alternating directions, in Advances in Neural Information ProcessingSystems, 2014, pp. 3401–3409.

[29] Karin Schnass, Local identification of overcomplete dictionaries, arXiv preprintarXiv:1401.6354, (2014).

[30] , On the identifiability of overcomplete dictionaries via the minimisation principleunderlying k-svd, Applied and Computational Harmonic Analysis, 37 (2014), pp. 464–491.

[31] Daniel A Spielman, Huan Wang, and John Wright, Exact recovery of sparsely-useddictionaries, in Proceedings of the 25th Annual Conference on Learning Theory, 2012.

[32] Xinyang Yi, Constantine Caramanis, and Sujay Sanghavi, Alternating minimiza-tion for mixed linear regression, arXiv preprint arXiv:1310.3745, (2013).

[33] Michael Zibulevsky and Barak Pearlmutter, Blind source separation by sparsedecomposition in a signal dictionary, Neural computation, 13 (2001), pp. 863–882.