43
Recursive Importance Sketching for Rank Constrained Least Squares: Algorithms and High-order Convergence Yuetian Luo 1 , Wen Huang 2 , Xudong Li 3 , and Anru R. Zhang 1,4 March 16, 2021 Abstract In this paper, we propose a new R ecursive I mportance S ketching algorithm for R ank con- strained least squares O ptimization (RISRO). As its name suggests, the algorithm is based on a new sketching framework, recursive importance sketching. Several existing algorithms in the literature can be reinterpreted under the new sketching framework and RISRO offers clear ad- vantages over them. RISRO is easy to implement and computationally efficient, where the core procedure in each iteration is only solving a dimension reduced least squares problem. Different from numerous existing algorithms with locally geometric convergence rate, we establish the local quadratic-linear and quadratic rate of convergence for RISRO under some mild conditions. In addition, we discover a deep connection of RISRO to Riemannian manifold optimization on fixed rank matrices. The effectiveness of RISRO is demonstrated in two applications in machine learning and statistics: low-rank matrix trace regression and phase retrieval. Simulation studies demonstrate the superior numerical performance of RISRO. Keywords: Rank constrained least squares, Sketching, Quadratic convergence, Riemannian man- ifold optimization, Low-rank matrix recovery, Non-convex optimization 1 Introduction The focus of this paper is on the rank constrained least squares: min XPR p 1 ˆp 2 f pXq :1 2 }y ´ ApXq} 2 2 , subject to rankpXq“ r. (1) Here, y P R n is the given data and A P R p 1 ˆp 2 Ñ R n is a known linear map that can be explicitly represented as ApXq “ rxA 1 , Xy,..., xA n , Xys J , xA i , Xy“ ÿ 1ďj ďp 1 ,1ďkďp 2 pA i q rj,ks X rj,ks (2) with given measurement matrices A i P R p 1 ˆp 2 , i 1,...,n. The expected rank is assumed to be known in Problem (1) since in some applications such as phase retrieval and blind deconvolution, 1 Department of Statistics, University of Wisconsin-Madison [email protected], [email protected] . Y. Luo would like to thank RAship from Institute for Foundations of Data Science at UW-Madison. 2 School of Mathematical Sciences, Xiamen University [email protected] 3 School of Data Science and Shanghai Center for Mathematical Sciences, Fudan University [email protected] 4 Department of Biostatistics and Bioinformatics, Duke University 1 arXiv:2011.08360v2 [math.OC] 14 Mar 2021

arXiv:2011.08360v1 [math.OC] 17 Nov 2020 · 2020. 11. 18. · arXiv:2011.08360v1 [math.OC] 17 Nov 2020. such as phase retrieval and blind deconvolution, the expected rank is known

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

  • Recursive Importance Sketching for Rank Constrained Least

    Squares: Algorithms and High-order Convergence

    Yuetian Luo1, Wen Huang2, Xudong Li3, and Anru R. Zhang1,4

    March 16, 2021

    Abstract

    In this paper, we propose a new Recursive Importance Sketching algorithm for Rank con-strained least squares Optimization (RISRO). As its name suggests, the algorithm is based ona new sketching framework, recursive importance sketching. Several existing algorithms in theliterature can be reinterpreted under the new sketching framework and RISRO offers clear ad-vantages over them. RISRO is easy to implement and computationally efficient, where the coreprocedure in each iteration is only solving a dimension reduced least squares problem. Differentfrom numerous existing algorithms with locally geometric convergence rate, we establish thelocal quadratic-linear and quadratic rate of convergence for RISRO under some mild conditions.In addition, we discover a deep connection of RISRO to Riemannian manifold optimization onfixed rank matrices. The effectiveness of RISRO is demonstrated in two applications in machinelearning and statistics: low-rank matrix trace regression and phase retrieval. Simulation studiesdemonstrate the superior numerical performance of RISRO.

    Keywords: Rank constrained least squares, Sketching, Quadratic convergence, Riemannian man-ifold optimization, Low-rank matrix recovery, Non-convex optimization

    1 Introduction

    The focus of this paper is on the rank constrained least squares:

    minXPRp1ˆp2

    fpXq :“ 12}y ´ApXq}22 , subject to rankpXq “ r. (1)

    Here, y P Rn is the given data and A P Rp1ˆp2 Ñ Rn is a known linear map that can be explicitlyrepresented as

    ApXq “ rxA1,Xy, . . . , xAn,XysJ , xAi,Xy “ÿ

    1ďjďp1,1ďkďp2

    pAiqrj,ksXrj,ks (2)

    with given measurement matrices Ai P Rp1ˆp2 , i “ 1, . . . , n. The expected rank is assumed to beknown in Problem (1) since in some applications such as phase retrieval and blind deconvolution,

    1Department of Statistics, University of Wisconsin-Madison [email protected], [email protected] . Y.Luo would like to thank RAship from Institute for Foundations of Data Science at UW-Madison.

    2School of Mathematical Sciences, Xiamen University [email protected] of Data Science and Shanghai Center for Mathematical Sciences, Fudan University

    [email protected] of Biostatistics and Bioinformatics, Duke University

    1

    arX

    iv:2

    011.

    0836

    0v2

    [m

    ath.

    OC

    ] 1

    4 M

    ar 2

    021

  • the expected rank is known to be one. If the expected rank is unknown, it is typical to optimizeover the set of fixed rank matrices using the formulation of (1) and dynamically update the rank,see, e.g., Vandereycken and Vandewalle (2010); Zhou et al. (2016).

    The rank constrained least squares (1) is motivated by the widely studied low-rank matrixrecovery problem, where the goal is to recovery a low-rank matrix X˚ from the observation y “ApX˚q ` � (� is the noise). This problem is of fundamental importance in a variety of fieldssuch as optimization, machine learning, signal processing, scientific computation, and statistics.With different realizations of A, (1) covers many applications, such as matrix trace regression(Candès and Plan, 2011; Davenport and Romberg, 2016), matrix completion (Candès and Tao,2010; Keshavan et al., 2009; Koltchinskii et al., 2011; Miao et al., 2016), phase retrieval (Candèset al., 2013; Shechtman et al., 2015), blind devolution (Ahmed et al., 2013), and matrix recoveryvia rank-one projections (Cai and Zhang, 2015; Chen et al., 2015). To overcome the non-convexityand NP-hardness of directly solving (1) (Recht et al., 2010), various computational feasible schemeshave been developed in the past decade. In particular, the convex relaxation has been a centraltopic of interest (Recht et al., 2010; Candès and Plan, 2011):

    minXPRp1ˆp2

    1

    2}y ´ApXq}22 ` λ}X}˚, (3)

    where }X}˚ “řminpp1,p2qi“1 σipXq is the nuclear norm of X and λ ą 0 is a tuning parameter.

    Nevertheless, the convex relaxation technique has one well-documented limitation: the parameterspace after relaxation is usually much larger than that of the target problem. Also, algorithmsfor solving the convex programming often require the singular value decomposition as the steppingstone and can be prohibitively time consuming for large-scale instances.

    In addition, non-convex optimization renders another important class of algorithms for solving(1), which directly enforce the rank r constraint on the iterates. Since each iterate lies in a lowdimensional space, the computation cost of the non-convex approach can be much smaller than theconvex regularized approach. In the last couple of years, there is a flurry of research on non-convexmethods in solving (1) (Chen and Wainwright, 2015; Hardt, 2014; Jain et al., 2013; Sun and Luo,2015; Tran-Dinh and Zhang, 2016; Tu et al., 2016; Wen et al., 2012; Zhao et al., 2015; Zheng andLafferty, 2015), and many of the algorithms such as gradient descent and alternating minimizationare shown to have nice convergence results under proper assumptions (Hardt, 2014; Jain et al.,2013; Sun and Luo, 2015; Tu et al., 2016; Zhao et al., 2015). We refer readers to Section 1.2 formore review on recent works on convex and non-convex approaches on solving (1).

    In the existing literature, many algorithms for solving (1) either require careful tuning of hyper-parameters or have a convergence rate no faster than linear. Thus, we raise the following question:

    Can we develop an easy-to-compute and efficient (hopefully has the comparable per-iterationcomputational complexity as the first-order methods) algorithm with provable high-order convergenceguarantees (possibly converge to a stationary point due to the non-convexity) for solving (1)?

    In this paper, we give an affirmative answer to this question by making contributions to therank constrained optimization problem (1) as outlined next.

    1.1 Our Contributions

    We introduce an easy-to-implement and computationally efficient algorithm, Recursive ImportanceSketching for Rank constrained least squares Optimization (RISRO), for solving (1) in this paper.The proposed algorithm is tuning free and has the same per-iteration computational complexity asAlternating Minimization (Jain et al., 2013), as well as comparable complexity to many popularfirst-order methods such as iterative hard thresholding (Jain et al., 2010) and gradient descent

    2

  • (Tu et al., 2016) when r ! p1, p2, n. We then illustrate the key idea of RISRO under a generalframework of recursive importance sketching. This framework also renders a platform to compareRISRO and several existing algorithms for rank constrained least squares.

    Assuming A satisfies the restricted isometry property (RIP), we prove that RISRO enjoys localquadratic-linear convergence in general and quadratic convergence under some extra conditions.Figure 1 provides a numerical example on the performance of RISRO in the noiseless low-rankmatrix trace regression (left panel) and phase retrieval (right panel). In both problems, RISROconverges to the underlying parameter quadratically and reaches to a highly accurate solution withinfive iterations. We will illustrate later that RISRO has the same per-iteration complexity with otherfirst-order methods when r is small while converges quadratically with provable guarantees. To ourbest knowledge, we are among the first to achieve so for the general rank constrained least squaresproblem.

    ●●

    1e−12

    1e−08

    1e−04

    1e+00

    0 1 2 3 4 5Iteration Number

    ||Xt −

    X∗ |

    | F/||

    X∗ |

    | F npr

    ●45678

    (a) Noiseless low-rank matrix trace regression.Here, yi “ xAi,X˚y for 1 ď i ď n, X˚ P Rpˆpwith p “ 100, σ1pX˚q “ ¨ ¨ ¨ “ σ3pX˚q “3, σkpX˚q “ 0 for 4 ď k ď 100 and Ai has in-dependently identical distributed (i.i.d.) stan-dard Gaussian entries

    ● ●●

    1e−12

    1e−08

    1e−04

    1e+00

    0 2 4 6Iteration Number

    ||xt (x

    t )T−

    x∗(x

    ∗ )T|| F

    /||x∗

    (x∗ )

    T|| F

    np

    ●45678

    (b) Phase Retrieval. Here, yi “ xaiaJi ,x˚x˚Jyfor 1 ď i ď n, x˚ P Rp with p “ 1200, ai i.i.d.„Np0, Ipq

    Figure 1: RISRO achieves a quadratic rate of convergence (spectral initialization is used in eachsetting and more details about the simulation setup is given in Section 7)

    In addition, we discover a deep connection between RISRO and the optimization algorithmon Riemannian manifold. The least squares step in RISRO implicitly solves a Fisher Scoring orRiemannian Gauss-Newton equation on the Riemannian optimization of low-rank matrices and theupdating rule in RISRO can be seen as a retraction map. With this connection, our theory onRISRO also improves the existing convergence results on the Riemannian Gauss-Newton methodfor the rank constrained least squares problem.

    Next, we further apply RISRO to two important problems arising from machine learning andstatistics: low-rank matrix trace regression and phase retrieval. In low-rank matrix trace regression,we are able to prove RISRO achieves the minimax optimal estimation error rate under the Gaussianensemble design with only double-logarithmic number of iterations. In phase retrieval, where A doesnot satisfy the RIP condition, we can still establish the local convergence of RISRO given a properinitialization.

    Finally, we conduct simulation studies to support our theoretical results and compare RISROwith many existing algorithms. The simulation studies show RISRO not only offers faster andmore robust convergence but also smaller sample size requirement for low-rank matrix recovery,compared to the existing approaches.

    3

  • 1.2 Related Literature

    This work is related to a range of literature on low-rank matrix recovery, convex/non-convex op-timization, and sketching arising from a number of communities, including optimization, machinelearning, statistics and applied mathematics. We make an attempt to review the related literaturewithout claiming the survey is exhaustive.

    One class of the most popular approaches to solve (1) is the nuclear norm minimization (NNM)(3). Many algorithms have been proposed to solve NNM, such as proximal gradient descent (Tohand Yun, 2010), fixed-point continuation (FPC) (Goldfarb and Ma, 2011), and proximal pointmethods (Jiang et al., 2014). It has been shown that the solution of NNM has desirable propertiesunder proper model assumptions (Cai and Zhang, 2013, 2014, 2015; Candès and Plan, 2011; Rechtet al., 2010). In addition to NNM, the max norm minimization is another widely considered convexrealization for the rank constrained optimization (Lee et al., 2010; Cai and Zhou, 2013). However,it is usually computationally intensive to solve these convex programs and this motivates a line ofwork on using non-convex approaches. Since Burer and Monteiro (2003), one of the most popularnon-convex methods for solving (1) is to first factor the low-rank matrix X to RLJ with two factormatrices R P Rp1ˆr,L P Rp2ˆr, then run either gradient decent or alternating minimization on Rand L (Candès et al., 2015; Li et al., 2019b; Ma et al., 2019; Park et al., 2018; Sanghavi et al., 2017;Sun and Luo, 2015; Tu et al., 2016; Wang et al., 2017c; Zhao et al., 2015; Zheng and Lafferty, 2015;Tong et al., 2020). Others methods, such as singular value projection or iterative hard thresholding(Goldfarb and Ma, 2011; Jain et al., 2010; Tanner and Wei, 2013), Grassmann manifold optimization(Boumal and Absil, 2011; Keshavan et al., 2009), Riemannian manifold optimization (Huang andHand, 2018; Meyer et al., 2011; Mishra et al., 2014; Vandereycken, 2013; Wei et al., 2016) havealso been proposed and studied. We refer readers to the recently survey paper Chi et al. (2019)for comprehensive overview on existing literature on convex and non-convex approaches on solving(1). There are a few recent attempts in connecting the geometric structures of different approaches(Ha et al., 2020; Li et al., 2019a), and the landscape of problem (1) has also been studied in varioussettings (Bhojanapalli et al., 2016; Ge et al., 2017; Uschmajew and Vandereycken, 2018; Zhanget al., 2019; Zhu et al., 2018).

    Our work is also related to the idea of sketching in numerical linear algebra. Performingsketching to speed up the computation via dimension reduction has been explored extensively inrecent years (Mahoney, 2011; Woodruff, 2014). Sketching methods have been applied to solve anumber of problems including but not limited to matrix approximation (Song et al., 2017; Zhenget al., 2012; Drineas et al., 2012), linear regression (Clarkson and Woodruff, 2017; Dobriban and Liu,2019; Pilanci and Wainwright, 2016; Raskutti and Mahoney, 2016), ridge regression (Wang et al.,2017b), etc. In most of the sketching literature, the sketching matrices are randomly constructed(Mahoney, 2011; Woodruff, 2014). Randomized sketching matrices are easy to generate and requirelittle storage for sparse sketching. However, randomized sketching can be suboptimal in statisticalsettings (Raskutti and Mahoney, 2016). To overcome this, Zhang et al. (2020) introduced an ideaof importance sketching in the context of low-rank tensor regression. Contract to the randomizedsketching, importance sketching matrices are constructed deterministically with the supervision ofthe data and are shown capable of achieving better statistical efficiency. In this paper, we proposea more powerful recursive importance sketching algorithm where we can recursively refine thesketching matrices. Then, we provide a comprehensive convergence analysis towards the proposedalgorithm and demonstrate its advantages over other algorithms for rank constrained least squaresproblem.

    4

  • 1.3 Organization of the Paper

    The rest of this article is organized as follows. After a brief introduction of notation in Section1.4, we present our main algorithm RISRO with an interpretation from the recursive importancesketching perspective in Section 2. The theoretical results of RISRO are given in Section 3. InSection 4, we present anther interpretation for RISRO from Riemannian manifold optimization.The computational complexity of RISRO and its applications to low-rank matrix trace regressionand phase retrieval are discussed in Section 5 and 6, respectively. Numerical studies of RISRO andthe comparison with existing algorithms in the literature are presented in Section 7. Conclusionand future work are given in Section 8.

    1.4 Notation

    The following notation will be used throughout this article. Upper and lowercase letters (e.g.,A,B, a, b), lowercase boldface letters (e.g. u,v), uppercase boldface letters (e.g., U,V) are used todenote scalars, vectors, matrices, respectively. For any two series of numbers, say tanu and tbnu,denote a “ Opbq if there exists uniform constants C ą 0 such that an ď Cbn,@n. For any a, b P R, leta^b :“ minta, bu, a_b “ maxta, bu. For any matrix X P Rp1ˆp2 with singular value decompositionřp1^p2i“1 σipXquivJi , where σ1pXq ě σ2pXq ě ¨ ¨ ¨ ě σp1^p2pXq, let Xmaxprq “

    řri“1 σipXquivJi be the

    best rank-r approximation of X and denote }X}F “b

    ř

    i σ2i pXq and }X} “ σ1pXq as the Frobenius

    norm and spectral norm, respectively. Let QRpXq be the Q part of the QR decomposition outcomeof X. vecpXq P Rp1p2 represents the vectorization of X by its columns. In addition, Ir is the r-by-ridentity matrix. Let Op,r “ tU : UJU “ Iru be the set of all p-by-r matrices with orthonormalcolumns. For any U P Op,r, PU “ UUJ represents the orthogonal projector onto the column spaceof U; we also note UK P Op,p´r as the orthonormal complement of U. We use bracket subscriptsto denote sub-matrices. For example, Xri1,i2s is the entry of X on the i1-th row and i2-th column;

    Xrpr`1q:p1,:s contains the pr` 1q-th to the m-th rows of X. For any matrix X, we use X: to denoteits Moore-Penrose inverse. For matrices U P Rp1ˆp2 ,V P Rm1ˆm2 , let

    UbV “

    »

    Ur1,1s ¨V ¨ ¨ ¨ Ur1,p2s ¨V...

    ...Urp1,1s ¨V ¨ ¨ ¨ Urp1,p2s ¨V

    fi

    ffi

    fl

    P Rpp1m1qˆpp2m2q

    be their Kronecker product. Finally, for any given linear operator L, we use L˚ to denote itsadjoint, and use RanpLq to denote its range space.

    2 Recursive Importance Sketching for Rank Constrained LeastSquares

    In this section, we discuss the procedure and interpretations of RISRO, then compare it withexisting algorithms from a sketching perspective.

    2.1 RISRO Procedure and Recursive Importance Sketching

    The detailed procedure of RISRO is given in Algorithm 1. RISRO includes three steps in each iter-ation. Specifically, in the t-th iteration, we first sketch each Ai onto the subspace spanned by rUtbVt,UtKbVt,UtbVtKs, which yields the sketched importance covariates UtJAiVt,UtJK AiVt,UtJAiVtKin (4). See Figure 2 left panel for an illustration for the sketching scheme of RISRO. Second, we

    5

  • Figure 2: Illustration of RISRO (this work), Alter Mini (Hardt, 2014; Jain et al., 2013), and R2RILS(Bauch and Nadler, 2020) in a sketching perspective

    solve a dimension reduced least squares problem (5) where the number of parameters is reducedto pp1 ` p2 ´ rqr while the sample size remains to be n. Third, we update the sketching matricesUt`1,Vt`1 and Xt`1 in Step 6 and 7. Note that by construction, Ut`1,Vt`1 capture both thecolumn and row spans of Xt`1. In particular, if Bt`1 is invertiable, then Ut`1,Vt`1 are exactlyorthonormal bases of the column and row spans of Xt`1, respectively.

    Algorithm 1 Recursive Importance Sketching for Rank Constrained Least Squares (RISRO)

    1: Input: Ap¨q : Rp1ˆp2 Ñ Rn, y P Rn, rank r, initialization X0 which admits singular valuedecomposition U0Σ0V0J, where U0 P Op1,r,V0 P Op2,r,Σ0 P Rrˆr

    2: for t “ 0, 1, . . . do3: Perform importance sketching on A and construct the covariates maps AB : Rrˆr Ñ Rn,

    AD1 : Rpp1´rqˆr Ñ Rn and AD2 : Rrˆpp2´rq Ñ Rn, where for 1 ď i ď n,

    pABqi “ UtJAiVt, pAD1qi “ UtJK AiVt, pAD2qi “ UtJAiVtK (4)

    Here, pABqi satisfies rABp¨qsi “ x¨, pABqiy and similarly for pAD1qi and pAD2qi.4: Solve the unconstrained least squares problem

    pBt`1,Dt`11 ,Dt`12 q “ arg min

    BPRrˆr,DiPRppi´rqˆr,i“1,2

    ›y ´ABpBq ´AD1pD1q ´AD2pDJ2 q›

    2

    2(5)

    5: Compute Xt`1U “`

    UtBt`1 `UtKDt`11

    ˘

    and Xt`1V “`

    VtBt`1J `VtKDt`12

    ˘

    .6: Perform QR orthogonalization: Ut`1 “ QRpXt`1U q, Vt`1 “ QRpX

    t`1V q.

    7: Update Xt`1 “ Xt`1U`

    Bt`1˘:

    Xt`1JV .8: end for

    We give a high-level explanation of RISRO through a decomposition of yi. Suppose yi “xAi, sXy ` s�i where sX is a rank r target matrix with singular value decomposition sUsΣsVJ withsU P Op1,r, sΣ P Rrˆr and sV P Op2,r. Then

    yi “xUtJAiVt,UtJ sXVty ` xUtJK AiVt,UtJK sXVty ` xUtJAiVtK,UtJ sXVtKy ` xUtJK AiVtK,UtJK sXVtKy ` s�i:“xUtJAiVt,UtJ sXVty ` xUtJK AiVt,UtJK sXVty ` xUtJAiVtK,UtJ sXVtKy ` �ti.

    (6)

    Here, �t :“ ApPUtKsXPVtKq ` s� P R

    n can be seen as the residual of the new regression model (6),

    6

  • and UtJAiVt,UtJK AiV

    t,UtJAiVtK are exactly the importance covariates constructed in (4). Let

    rBt :“ UtJ sXVt, rDt1 :“ UtJK sXVt, rDtJ2 :“ UtJ sXVtK. (7)

    If �t “ 0, we have prBt, rDt1, rDt2q is a solution of the least squares in (5). Hence, we could setBt`1 “ rBt,Dt`11 “ rDt1,D

    t`12 “ rDt2 and thus X

    t`1U “ sXVt,X

    t`1V “ sXJUt. Furthermore, if Bt`1 is

    invertible, then it holds that

    Xt`1U pBt`1q´1Xt`1JV “ sXV

    tpUtJ sXVtq´1psXJUtqJ “ sX, (8)

    which means sX can be exactly recovered by one iteration of RISRO.In general, �t ‰ 0. When the column spans of Ut,Vt well approximate the ones of sU, sV,

    i.e., the column and row subspaces that the target parameter sX lie on, we expect UtJKsXVtK and

    �ti “ xUtJK AiVtK,UtJK sXVtKy ` s�i to have a small amplitude, then Bt`1,Dt`11 ,D

    t`12 , the outcome

    of the least squares problem (5), can well approximate rBt, rDt1,rDt2. In Lemma 1, we give a precise

    characterization for this approximation. Before that, let us introduce a convenient notation so that(5) can be written in a more compact way.

    Define the linear operator Lt as

    Lt : W “„

    W0 P Rrˆr W2 P Rrˆpp2´rqW1 P Rpp1´rqˆr 0pp1´rqˆpp2´rq

    Ñ rUt UtKs„

    W0 W2W1 0

    rVt VtKsJ, (9)

    and it is easy to compute its adjoint L˚t : M P Rp1ˆp2 Ñ„

    UtJMVt UtJMVtKpUtKqJMVt 0

    . Then, the

    least squares problem in (5) can be written as

    pBt`1,Dt`11 ,Dt`12 q “ arg min

    BPRrˆr,DiPRppi´rqˆr,i“1,2

    y ´ALtˆ„

    B DJ2D1 0

    ˙›

    2

    2

    . (10)

    Lemma 1 (Iteration Error Analysis for RISRO) Let sX be any given target matrix. Recallthe definition of �t “ s� ` ApPUtK

    sXPVtKq from (6). If the operator L˚tA˚ALt is invertible over

    RanpL˚t q, then Bt`1,Dt`11 ,Dt`12 in (5) satisfy

    «

    Bt`1 ´ rBt Dt`1J2 ´ rDtJ2Dt`11 ´ rDt1 0

    ff

    “ pL˚tA˚ALtq´1L˚tA˚�t, (11)

    and

    }Bt`1 ´ rBt}2F `2ÿ

    k“1}Dt`1k ´ rD

    tk}2F “

    ›pL˚tA˚ALtq´1L˚tA˚�t›

    2

    F. (12)

    In view of Lemma 1, the approximation errors of Bt`1,Dt`11 ,Dt`12 to

    rBt, rDt1,rDt2 are driven by

    the least squares residual }pL˚tA˚ALtq´1L˚tA˚�t}2F. This fact plays a key role in the proof for thehigh-order convergence theory of RISRO.

    Remark 1 (Comparison with Randomized Sketching) The importance sketching in RISROis significantly different from the randomized sketching in the literature (see surveys Mahoney(2011); Woodruff (2014) and the references therein). The randomized sketching matrices are oftenrandomly generated and reduce the sample size (n), the importance sketching matrices are deter-ministically constructed under supervision of y and reduce the dimension of parameter space (p1p2).See (Zhang et al., 2020, Section 1.3 and 2) for more comparison of randomized and importancesketchings.

    7

  • 2.2 Comparison with More Algorithms in View of Sketching

    In addition to RISRO, several classic algorithms for rank constrained least squares can be in-terpreted from a recursive importance sketching perspective. Through the lens of the sketching,RISRO exhibits advantages over these existing algorithms.

    We first focus on Alternating Minimization (Alter Mini) proposed and studied in Hardt (2014);Jain et al. (2013); Zhao et al. (2015). Suppose Ut is the left singular vectors of Xt, the outcome ofthe t-th iteration, Alter Mini solves the following least squares problems to update U and V,

    qVt`1 “ arg minVPRp2ˆr

    nÿ

    i“1

    `

    yi ´ xAi,UtVJy˘2 “ arg min

    VPRp2ˆr

    nÿ

    i“1

    `

    yi ´ xUtJAi,VJy˘2,

    qUt`1 “ arg minUPRp1ˆr

    nÿ

    i“1

    `

    y ´ xAi,UpVt`1qJy˘2 “ arg min

    UPRp1ˆr

    nÿ

    i“1

    `

    y ´ xAiVt`1,Uy˘2,

    Vt`1 “ QRpqVt`1q, Ut`1 “ QRpqUt`1q.

    (13)

    Then, Alter Mini essentially solves least squares problems with sketched covariates UtJAi,AiVt`1

    to update qVt`1, qUt`1 alternatively and iteratively. The number of parameters of the least squaresin (13) are rp2 and rp1 as opposed to p1p2, the number of parameters in the original least squaresproblem. See Figure 2 upper right panel for an illustration of the sketching scheme in Alter Mini.Consider the following decomposition of yi,

    yi “ xAi, PUt sXy ` xAi, PUtKsXy ` s�i “ xUtJAi,UtJ sXy ` xAi, PUtK

    sXy ` s�i :“ xUtJAi,UtJ sXy ` q�ti,(14)

    where q�t :“ ApPUtKsXq ` �̄ P Rn. Define qAt P Rnˆp2r with qAri,:s “ vecpUtJAiq. Similar to how

    Lemma 1 is proved, we can show }qVt`1J ´ UtJ sX}2F “ }pqAtJ qAtq´1 qAtJq�t}22, which implies the

    approximation error of Vt`1 “ QRpqVt`1q (i.e., the outcome of one iteration Alter Mini) to sV(i.e., true row span of the target matrix sX) is driven by q�t “ ApPUtK

    sXq ` �̄, i.e., the residual ofleast squares problem (14). Recall for RISRO, Lemma 1 shows the approximation error of Vt`1 isdriven by �t “ ApPUtK

    sXPVtKq` �̄. Since }PUtKsXPVtK}F ď }PUtK

    sX}F, the approximation error in periteration of RISRO can be smaller than the one of Alter Mini. Such a difference between RISROand Alter Mini is due to the following fact: in Alter Mini, the sketching captures the importancecovariates correspond to only the row (or column) span of Xt in updating Vt`1 (or Ut`1), whilethe importance sketching of RISRO in (4) catches the importance covariates from both the rowspan and column span of Xt. As a consequence, Alter Mini iterations yield first order convergencewhile RISRO iterations render high-order convergence as will be established in Section 3.

    Remark 2 Recently, Kümmerle and Sigl (2018) proposed a harmonic mean iterative reweightedleast squares (HM-IRLS) method for low-rank matrix recovery via solving minXPRp1ˆp2 }X}

    qq, subject

    to y “ ApXq, where }X}q “ př

    i σqi pXqq1{q is the Schatten-q norm of the matrix X. Compared to the

    original iterative reweighted least squares (IRLS) (Fornasier et al., 2011; Mohan and Fazel, 2012),which only uses the column span of Xt in constructing the reweight matrix, HM-IRLS uses both thecolumn and row spans of Xt in constructing the reweight matrix and results in better performance.Such a comparison of HM-IRLS versus IRLS shares the same spirit as RISRO versus Alter Mini:the importance sketching of RISRO captures the information of both column and row spans of Xt

    and achieves a better performance.

    8

  • Another example is the rank 2r iterative least squares (R2RILS) proposed in Bauch and Nadler(2020) for solving ill-conditioned matrix completion problems. In particular, at t-th iteration, Step1 of R2RILS solves the following least squares problem

    minMPRp1ˆr,NPRp2ˆr

    ÿ

    pi,jqPΩ

    !

    `

    UtNJ `MVtJ ´X˘

    ri,js

    )2, (15)

    where Ω is the set of index pairs of the observed entries. In the matrix completion setting, it turnsout the following equivalence holds (proof given in Appendix)

    arg minMPRp1ˆr,NPRp2ˆr

    ÿ

    pi,jqPΩ

    !

    `

    UtNJ `MVtJ ´X˘

    ri,js

    )2

    “ arg minMPRp1ˆr,NPRp2ˆr

    ÿ

    pi,jqPΩ

    `

    xUtJAij ,NJy ` xM,AijVty ´Xri,js˘2,

    (16)

    where Aij P Rp1ˆp2 is the special covariate in matrix completion satisfying pAijqrk,ls “ 1 if pi, jq “pk, lq and pAijqrk,ls “ 0 otherwise. This equivalence reveals that the least squares step (15) inR2RILS can be seen as an implicit sketched least squares problem similar to (5) and (13) withcovariates UtJAij and AijVt for pi, jq P Ω.

    We give a pictorial illustration for the sketching interpretation of R2RILS on the bottom rightpart of Figure 2. Different from the sketching in RISRO, R2RILS incorporates the core sketchUtJAiV

    t twice, which results in the rank deficiency in the least squares problem (15) and bringsdifficulties in both implementation and theoretical analysis. RISRO overcomes this issue by per-forming a better designed sketching and covers more general low-rank matrix recovery settings thanR2RILS. With the new sketching scheme, we are able to give a new and solid theory for RISROwith high-order convergence.

    3 Theoretical Analysis

    In this section, we provide convergence analysis for the proposed algorithm. For technical conve-nience, we assume A satisfies the Restricted Isometry Property (RIP) (Candès, 2008). The RIPcondition, first introduced in compressed sensing, has been widely used as one of the most stan-dard assumptions in the low-rank matrix recovery literature (Cai and Zhang, 2013, 2014; Candèsand Plan, 2011; Chen and Wainwright, 2015; Jain et al., 2010; Recht et al., 2010; Tu et al., 2016;Zhao et al., 2015). It also plays a critical role in analyzing the landscape of the rank constrainedoptimization problem (1) (Bhojanapalli et al., 2016; Ge et al., 2017; Uschmajew and Vandereycken,2018; Zhang et al., 2019; Zhu et al., 2018). Moreover, it is practically useful as the condition hasbeen shown to be satisfied with desired sample size in random design (Candès and Plan, 2011;Recht et al., 2010).

    Definition 1 (Restricted Isometry Property (RIP)) Let A : Rp1ˆp2 Ñ Rn be a linear map.For every integer r with 1 ď r ď minpp1, p2q, define the r-restricted isometry constant to be thesmallest number Rr such that p1 ´ Rrq}Z}2F ď }ApZq}22 ď p1 ` Rrq}Z}2F holds for all Z of rank atmost r. And A is said to satisfy the r-restricted isometry property (r´RIP) if 0 ď Rr ă 1.

    Note that, by definition, Rr ď Rr1 for r ď r1. By assuming RIP for A, we can show thelinear operator L˚tA˚ALt mentioned in Lemma 1 is always invertible over RanpL˚t q (i.e. the leastsquares (5) has the unique solution). In fact, we could give explicit lower and upper bounds for thespectrum of this operator.

    9

  • Lemma 2 (Bounds for Spectrum of L˚tA˚ALt) Recall the definition of Lt in (9). It holds that

    }LtpMq}F “ }M}F, @M P RanpL˚t q. (17)

    Suppose the linear map A satisfies the 2r-RIP. Then, it holds that for any matrix M P RanpL˚t q,

    p1´R2rq}M}F ď }L˚tA˚ALtpMq}F ď p1`R2rq}M}F.

    Remark 3 (Bounds for spectrum of pL˚tA˚ALtq´1) By the relationship of the spectrum of anoperator and its inverse, from Lemma 2, we also have the spectrum of pL˚tA˚ALtq´1 is lower andupper bounded by 1p1`R2rq and

    1p1´R2rq , respectively.

    In the following Proposition 1, we bound the iteration approximation error given in Lemma 1.

    Proposition 1 (Upper Bound for Iteration Approximation Error) Let sX be a given targetrank r matrix and s� “ y ´ ApsXq. Suppose that A satisfies the 3r-RIP. Then at t-th iteration ofRISRO, the approximation error (12) has the following upper bound:

    ›pL˚tA˚ALtq´1L˚tA˚�t›

    2

    F

    ďR23r}Xt ´ sX}2}Xt ´ sX}2Fp1´R2rq2σ2r psXq

    ` }L˚tA˚ps�q}2F

    p1´R2rq2` }L˚tA˚ps�q}F

    2R3r}Xt ´ sX}}Xt ´ sX}FσrpsXqp1´R2rq2

    .(18)

    Note that Proposition 1 is rather general in the sense that it applies to any sX of rank r andwe will pick different choices of sX depending on our purposes. For example, in studying theconvergence of RISRO, e.g., the upcoming Theorem 1, we treat sX as a stationary point and in thesetting of estimating the model parameter in matrix trace regression, we take sX to be the groundtruth (see Theorem 3).

    Now, we are ready to establish the deterministic convergence theory for RISRO. For problem(1), we use the following definition of stationary points: a rank r matrix sX is said to be a stationarypoint of (1) if ∇fpsXqJ sU “ 0 and ∇fpsXqsV “ 0 where ∇fpsXq “ A˚pApsXq ´ yq, and sU, sV arethe left and right singular vectors of sX. See also Ha et al. (2020). In Theorem 1, we show thatgiven any target stationary point sX and proper initialization, RISRO has a local quadratic-linearconvergence rate in general and quadratic convergence rate if y “ ApX̄q.

    Theorem 1 (Local Quadratic-Linear and Quadratic Convergence of RISRO) Let sX be astationary point to problem (1) and s� “ y ´ ApsXq. Suppose that A satisfies the 3r-RIP, and theinitialization X0 satisfies

    }X0 ´ sX}F ďˆ

    1

    4^ 1´R2r

    4?

    5R3r

    ˙

    σrpsXq, (19)

    and }A˚ps�q}F ď 1´R2r4?5 σrpsXq. Then, we have tXtu, the sequence generated by RISRO (Algorithm

    1), converges Q-linearly to sX: }Xt`1 ´ sX}F ď 34}Xt ´ sX}F, @ t ě 0.

    More precisely, it holds that @ t ě 0:

    }Xt`1´sX}2F ď5}Xt ´ sX}2

    p1´R2rq2σ2r psXq¨`

    R23r}Xt ´ sX}2F ` 4R3r}A˚ps�q}F}Xt ´ sX}F ` 4}A˚ps�q}2F˘

    . (20)

    In particular, if s� “ 0, then tXtu converges quadratically to sX as

    }Xt`1 ´ sX}F ď?

    5R3r

    p1´R2rqσrpsXq}Xt ´ sX}2F, @ t ě 0.

    10

  • Remark 4 (Quadratic-linear and Quadratic Convergence of RISRO) We call the conver-gence in (20) quadratic-linear since the sequence tXtu generated by RISRO exhibits a phase tran-sition from quadratic to linear convergence: when }Xt ´ sX}F " }A˚ps�q}F, the algorithm has aquadratic convergence rate; when Xt becomes close to sX such that }Xt ´ sX}F ď c}A˚ps�q}F forsome c ą 0, the convergence rate becomes linear. Moreover, as s� becomes smaller, the stage ofquadratic convergence becomes longer (see Section 7.1 for a numerical illustration of this conver-gence pattern). In the extreme case s� “ 0, Theorem 1 covers the widely studied matrix sensingproblem under the RIP framework (Chen and Wainwright, 2015; Jain et al., 2010; Park et al.,2018; Recht et al., 2010; Tu et al., 2016; Zhao et al., 2015; Zheng and Lafferty, 2015). It showsas long as the initialization error is within a constant factor of σrpsXq, Algorithm RISRO enjoysquadratic convergence to the target matrix sX. To our best knowledge, we are among the first to givequadratic-linear algorithmic convergence guarantees for general rank constrained least squares andquadratic convergence for matrix sensing. Recently, Charisopoulos et al. (2019) formulated (1) asa non-convex composite optimization problem based on X “ RLJ factorization and showed that theprox-linear algorithm (Burke, 1985; Lewis and Wright, 2016) achieves local quadratic convergencewhen �̄ “ 0. Note that in each iteration therein, a carefully tuned convex programming problemneeds to be solved exactly. In contrast, the proposed RISRO is tuning free, only solves a dimension-reduced least squares in each step, and can be as cheap as many first-order methods. See Section 5for a detailed discussion on the computational complexity of RISRO.

    It is noteworthy that a quadratic-linear convergence rate appears in the recent Newton Sketchalgorithm (Pilanci and Wainwright, 2017). However, Pilanci and Wainwright (2017) studied aconvex problem using randomized sketching, which is significantly different from our settings, i.e.,the recursive importance sketching and the non-convex matrix optimization.

    Remark 5 (Initialization and global convergence of RISRO) The convergence theory in The-orem 1 requires a good initialization condition. Practically, the spectral method often provides asufficiently good initialization that meets the requirement in (19) in many statistical applications.In Section 6 and 7, we will illustrate this point from two applications: matrix trace regression andphase retrieval.

    Moreover, our main finding in the next section, i.e., RISRO can be interpreted as a Riemannianmanifold optimization method, implies that standard globalization strategies in manifold optimiza-tion such as the line search or the trust region scheme (Absil et al., 2009; Nocedal and Wright,2006) can be used to guarantee the global convergence of RISRO.

    Remark 6 (Small residual condition in Theorem 1) In addition to the initialization condi-tion, the small residual condition }A˚ps�q}F ď 1´R2r4?5 σrp

    sXq is also needed in Theorem 1. Thiscondition essentially means that the signal strength at point sX needs to dominate the noise. Ifs� “ y ´ApsXq “ 0, then the aforementioned small residual condition holds automatically.

    4 A Riemannian Manifold Optimization Interpretation of RISRO

    We give an interpretation of RISRO by recursive importance sketching in Section 2 and develop theconvergence results in Section 3. The superior performance of RISRO yields the following question:

    Is there a connection of RISRO to any class of optimization algorithms in the literature?In this section, we give an affirmative answer to this question. We show RISRO can be viewed

    as a Riemannian optimization algorithm on the manifold Mr :“ tX P Rp1ˆp2 | rankpXq “ ru. Wefind the sketched least squares in (5) in RISRO actually solves the Fisher Scoring or Riemannian

    11

  • Gauss-Newton equation and Step 7 in RISRO performs a type of retraction under the frameworkof Riemannian optimization.

    Riemannian optimization concerns optimizing a real-valued function f defined on a Riemannianmanifold M. One commonly-encountered manifold is a submanifold of Rn. Under such circum-stance, a manifold can be viewed as a smooth subset of Rn. When a smooth-varying inner productis further defined on the subset, the subset together with the inner product is called a Rieman-nian manifold. We refer to Absil et al. (2009) for the rigorous definition of Riemannian manifolds.Optimization on a Riemannian manifold often relies on the notion of Riemannian gradient andRiemannian Hessian, which are used for finding a search direction, and the notion of retraction,which is defined for motion of iterates on the manifold. The remaining of this section describes therequired Riemannian optimization tools and the connection of RISRO to Riemannian optimization.

    It has been shown in (Lee, 2013, Example 8.14) that the set Mr is a smooth submanifoldof Rp1ˆp2 and the tangent space is also given therein. The result is given in Proposition 2 forcompleteness.

    Proposition 2 (Lee, 2013, Example 8.14) Mr “ tX P Rp1ˆp2 : rankpXq “ ru is a smooth subman-ifold of dimension pp1` p2´ rqr. Its tangent space TXMr at X PMr with the SVD decompositionX “ UΣVJ (U P Op1,r and V P Op2,r) is given by:

    TXMr “#

    rU UKs«

    Rrˆr Rrˆpp2´rq

    Rpp1´rqˆr 0pp1´rqˆpp2´rq

    ff

    rV VKsJ+

    . (21)

    The Riemannian metric of Mr that we use throughout this paper is the Euclidean inner product,i.e., xU,Vy “ tracepUJVq.

    In the Euclidean setting, the update formula in an iterative algorithm is Xt ` αηt, where α isthe stepsize and ηt is a descent direction. However, in the framework of Riemannian optimization,Xt`αηt is generally neither well-defined nor lying in the manifold. To overcome this difficulty, thenotion of retraction is used, see e.g., Absil et al. (2009). Considering the manifold Mr, we have thedefinition that a retraction R is a smooth map from TMr to Mr satisfying i) RpX, 0q “ X and ii)ddtRpX, tηq|t“0 “ η for all X PMr and η P TXMr, where TMr “ tpX, TXMrq : X PMru, is thetangent bundle of Mr. The two conditions guarantee that RpX, tηq stays in Mr and RpX, tηq is afirst order approximation of X` tη at t “ 0.

    Next, we show that Step 7 in RISRO performs the orthographic retraction on the manifold offixed-rank matrices given in Absil and Malick (2012). Suppose at iteration t` 1, Bt`1 is invertible(this is true under the RIP framework, see Step 2 in the proof of Theorem 1). We can show bysome algebraic calculations that the update Xt`1 in Step 7 can be rewritten as

    Xt`1 “ Xt`1U`

    Bt`1˘´1

    Xt`1JV “ rUt UtKs

    «

    Bt`1 Dt`1J2Dt`11 D

    t`11 pBt`1q

    ´1Dt`1J2

    ff

    rVt VtKsJ. (22)

    Let ηt P TXtMr be the update direction and Xt ` ηt has the following representation,

    Xt ` ηt “ rUt UtKs„

    Bt`1 Dt`1J2Dt`11 0

    rVt VtKsJ. (23)

    Comparing (22) and (23), we can view the update of Xt`1 from Xt ` ηt as simply completing the

    0 matrix in

    Bt`1 Dt`1J2Dt`11 0

    by Dt`11 pBt`1q´1Dt`1J2 . This operation maps the tangent vector on

    12

  • TXtMr back to the manifold Mr and it turns out that it coincides with the orthographic retraction

    RpXt, ηtq “ rUt UtKs«

    Bt`1 Dt`1J2Dt`11 D

    t`11 pBt`1q

    ´1Dt`1J2

    ff

    rVt VtKsJ (24)

    on the set of fixed-rank matrices (Absil and Malick, 2012). Therefore, we have Xt`1 “ RpXt, ηtq.

    Remark 7 Although the orthographic retraction defined in Absil and Malick (2012) requires thatUt and Vt are left and right singular vectors of Xt, one can verify that even if the Ut and Vt are notexactly the left and right singular vectors but satisfy Ut “ rUtO, Vt “ rVtQ, then the mapping (24)is equivalent to the orthographic retraction in Absil and Malick (2012). Here, O,Q P Or,r, and rUt

    and rVt are left and right singular vectors of Xt.

    The Riemannian gradient of a smooth function f : Mr Ñ R at X P Mr is defined as theunique tangent vector grad fpXq P TXMr such that xgrad fpXq,Zy “ D fpXqrZs,@Z P TXMr,where DfpXqrZs denotes the directional derivative of f at point X along the direction Z. SinceMr is an embedded submanifold of Rp1ˆp2 and the Euclidean metric is used, from (Absil et al.,2009, (3.37)), we know in our problem,

    grad fpXq “ PTXpA˚pApXq ´ yqq, (25)

    and here PTX is the orthogonal projector onto the tangent space at X defined as follows

    PTXpZq “ PUZPV ` PUKZPV ` PUZPVK , @Z P Rp1ˆp2 , (26)

    where U P Op1,r,V P Op2,r are the left and right singular vectors of X.Next, we introduce the Riemannian Hessian. The Riemannian Hessian of f at X P Mr is the

    linear map Hess fpXq of TXMr onto itself defined as Hess fpXqrZs “ s∇Zgrad fpXq, @Z P TXMr,where s∇ is the Riemannian connection on Mr (Absil et al., 2009, Section 5.3). Lemma 3 gives anexplicit formula for Riemannian Hessian in our problem.

    Lemma 3 (Riemannian Hessian) Consider fpXq in (1). If X PMr has singular value decom-position UΣVJ and Z P TXMr has representation

    Z “ rU UKs„

    ZB ZJD2

    ZD1 0

    rV VKsJ,

    then the Hessian operator in this setting satisfies

    HessfpXqrZs “PTX pA˚pApZqqq ` PUKA˚pApXq ´ yqVpΣ´1VJPV

    ` PUUΣ´1UJpA˚pApXq ´ yqPVK ,(27)

    where Up “ UKZD1 ,Vp “ VKZD2.

    Next, we show that the update direction ηt, implicitly encoded in (23), finds the RiemannianGauss-Newton direction in the manifold optimization of Mr. Similar as the classic Newton’smethod, at t-th iteration, the Riemannian Newton method aims to find the Riemannian Newtondirection ηtNewton in TXtMr that solves the following Newton equation

    ´ gradfpXtq “ HessfpXtqrηtNewtons. (28)

    13

  • If the residual py´ApXtqq is small, the last two terms in HessfpXtqrηs of (27) are expected to besmall, which means we can approximately solve the Riemannian Newton direction via

    ´ gradfpXtq “ PTXt pA˚pApηqqq , η P TXtMr. (29)

    In fact, Equation (29) has a interpretation from the Fisher scoring algorithm. Consider the

    statistical setting y “ ApXq` �, where X is a fixed low-rank matrix and �ii.i.d.„ Np0, σ2q. Then for

    any η,tEpHessfpXqrηsqu |X“Xt “ PTXt pA

    ˚pApηqqq ,where on the left hand side, the expression is evaluated at Xt after taking expectation. In theliterature, the Fisher Scoring algorithm computes the update direction via solving the modifiedNewton equation which replaces the Hessian with its expected value (Lange, 2010), i.e.,

    tEpHessfpXqrηsqu |X“Xt “ ´gradfpXtq, η P TXtMr,

    which exactly becomes (29) in our setting. Meanwhile, it is not difficult to show that the FisherScoring algorithm here is equivalent to the Riemannian Gauss-Newton method for solving nonlinearleast squares (Lange, 2010, Section 14.6) and (Absil et al., 2009, Section 8.4). Thus, η that solvesthe equation (29) is also the Riemannian Gauss-Newton direction.

    It turns out that the update direction ηt (23) of RISRO solves the Fisher Scoring or RiemannianGauss-Newton equation (29):

    Theorem 2 Let tXtu be the sequence generated by RISRO under the same assumptions as inTheorem 1. Then, for all t ě 0, the implicitly encoded update direction ηt in (23) solves theRiemannian Gauss-Newton equation (29).

    Theorem 2 together with the retraction explanation in (24) establishes the connection of RISROand Riemannian manifold optimization. Following this connection, we further show that each ηtis always a decent direction in the next Proposition 3. By incorporating the globalization schemediscussed in Remark 5, this fact is useful in boosting the local convergence of RISRO to the globalconvergence.

    Proposition 3 For all t ě 0, the update direction ηt P TXtMr in (23) satisfies xgradfpXtq, ηty ă 0,i.e., ηt is a descent direction. If A satisfies the 2r-RIP, then the direction sequence tηtu is gradientrelated.

    Remark 8 The convergence of Riemannian Gauss-Newton was studied in a recent work Breidingand Vannieuwenhoven (2018). Our results are significantly different from and offer improvements toBreiding and Vannieuwenhoven (2018) in the following ways. First, Breiding and Vannieuwenhoven(2018) considered a more general Riemannian Gauss-Newton setting, while their convergence resultsare established for a local minima, which is a much stronger and less practical requirement thanthe stationary point assumption we need. Second, the initialization condition and convergence ratein Breiding and Vannieuwenhoven (2018) can be suboptimal in our rank constrained least squaressetting. Third, the proof technique of ours are quite different from Breiding and Vannieuwenhoven(2018). The proof of Breiding and Vannieuwenhoven (2018) is based on the manifold optimization,while our proof is motivated from the insight of recursive importance sketching introduced in Section2 and in particular, the approximation error of sketched least squares established in Lemma 1plays a key role. Fourth, our recursive importance sketching framework provides new sketchinginterpretations for several classical algorithms for rank constrained least squares. In Section 6 wealso apply RISRO in popular statistical models and establish the statistical convergence results. It ishowever not immediately clear how to utilize the results in Breiding and Vannieuwenhoven (2018)in these statistical settings.

    14

  • Alter Mini SVP GD RISRO (this work)

    Complexity per iteration Opnp2r2 ` pprq3q Opnp2q Opnp2q Opnp2r2 ` pprq3qConvergence rate Linear Linear Linear Quadratic-(linear)

    Table 1: Computational complexity per iteration and convergence rate for Alternating Minimization(Alter Mini) (Jain et al., 2013), singular value projection (SVP) (Jain et al., 2010), gradient descent(GD) (Tu et al., 2016), and RISRO

    5 Computational Complexity of RISRO

    In this section, we discuss the computational complexity of RISRO. Suppose p1 “ p2 “ p, thecomputational complexity of RISRO per iteration is Opnp2r2 ` pprq3q in the general setting. Acomparison on computational complexity of RISRO and other common algorithms is provided inTable 1. Here the main complexity of RISRO and Alter Mini is from solving the least squares. Themain complexity of the singular value projection (SVP) (Jain et al., 2010) and gradient descent(Tu et al., 2016) is from computing the gradient. From Table 1, we can see RISRO has the sameper-iteration complexity as Alter Mini and comparable complexity with SVP and GD when n ě prand r is much less than n and p. On the other hand, RISRO and Alter Mini are tuning free, whilea proper step size is crucial for SVP and GD to have fast convergence. Finally, RISRO enjoysa high-order convergence as we have shown in Section 3, and the convergence rates of all otheralgorithms are limited to be linear.

    The main computational bottleneck of RISRO is solving the least squares, which can be al-leviated by using iterative linear system solvers, such as the (preconditioned) conjugate gradientmethod, when the linear operator A has special structures. Such special structures occur, forexample, in matrix completion problem (A is sparse) (Vandereycken, 2013), phase retrieval forX-ray crystallography imaging (A involves fast Fourier transforms) (Huang et al., 2017b), andblind deconvolution for imaging deblurring (A involves fast Fourier transforms and Haar wavelettransforms) (Huang and Hand, 2018) .

    To utilize these structures, we introduce an intrinsic representation of tangent vectors in Mr:if U,V are the left and right singular vectors of a rank-r matrix X, an orthonormal basis of TXMrcan be

    #

    rU UKs«

    eieJj 0rˆpp´rq

    0pp´rqˆr 0pp´rqˆpp´rq

    ff

    rV VKsJ, i “ 1, . . . , r, j “ 1, . . . , r+

    Y#

    rU UKs«

    0rˆr eiẽJj

    0pp´rqˆr 0pp´rqˆpp´rq

    ff

    rV VKsJ, i “ 1, . . . , r, j “ 1, . . . , p´ r+

    Y#

    rU UKs«

    0rˆr 0rˆpp´rqẽie

    Jj 0pp´rqˆpp´rq

    ff

    rV VKsJ, i “ 1, . . . , p´ r, j “ 1, . . . , r+

    ,

    where ei and ẽi denote the i-th canonical basis of Rr and Rp´r, respectively. It follows that anytangent vector in TXMr can be uniquely represented by a coefficient vector in Rp2p´rqr via the basisabove. This representation is called the intrinsic representation (Huang et al., 2017a). Computingthe intrinsic representations of a Riemannian gradient can be computationally efficient. For exam-ple, the complexity of computing the Riemannian gradient in matrix completion is Opnr`pr2q andits intrinsic representation can be computed by an additional Oppr2q operations (Vandereycken,2013). The complexities of computing intrinsic representations of the Riemannian gradients of the

    15

  • phase retrieval and the blind deconvolution are both Opn logpnqr`pr2q (Huang et al., 2017b; Huangand Hand, 2018).

    By Theorem 2, the least squares problem (5) of RISRO is equivalent to solve η P TXtMr suchthat PTXtA

    ˚pApηqq “ ´gradfpXtq. Reformulating this equation by intrinsic representation yields

    ´ gradfpXtq “ PTXtA˚pApηqq ùñ ´u “ B˚XpA˚pApBXvqqq, (30)

    where u, v are the intrinsic representations of gradfpXtq and η, the mapping BX : Rp2p´rqr ÑTXMr Ă Rpˆp converts an intrinsic representation to the corresponding tangent vector, and B˚X :Rpˆp Ñ Rp2p´rqr is the adjoint operator of BX. The computational complexity of using conjugategradient method to solve (30) is determined by the complexity of evaluating the operator B˚X ˝pA˚Aq˝BX on a given vector. With the intrinsic representation, it can be shown that this evaluationcosts Opnr ` pr2q in matrix completion and Opn logpnqr ` pr2q in the phase retrieval and blinddeconvolution. Thus, when solving (30) via the conjugate gradient method, the complexity isOpkpnr ` pr2qq in the matrix completion and Opkpn logpnqr ` pr2qq in the phase retrieval and theblind deconvolution, where k is the number of conjugate gradient iterations and is provably atmost p2p´rqr. Hence, for special applications such as matrix completion, phase retrieval and blinddeconvolution, by using the conjugate gradient method with the intrinsic representation, the periteration complexity of RISRO can be greatly reduced. This point will be further exploited in ourfuture research.

    6 Recursive Importance Sketching under Statistical Models

    In this section, we study the applications of RISRO in machine learning and statistics. We specifi-cally investigate the low-rank matrix trace regression and phase retrieval, while our key ideas canbe applied to more problems.

    6.1 Low-Rank Matrix Trace Regression

    Consider the low-rank matrix trace regression model:

    yi “ xAi,X˚y ` �i, for 1 ď i ď n, (31)

    where X˚ P Rp1ˆp2 is the true model parameter and rankpX˚q “ r. The goal is to estimate X˚from tyi,Aiuni“1. Due to the noise �i, X˚ usually cannot be exactly recovered.

    The following Theorem 3 shows RISRO converges quadratic-linearly to X˚ up to some statisticalerror given a proper initialization. Moreover, under the Gaussian ensemble design, RISRO withspectral initialization achieves the minimax optimal estimation error rate.

    Theorem 3 (RISRO in Matrix Trace Regression) Consider the low-rank matrix trace regres-sion problem (31). Suppose that A satisfies the 3r-RIP, the initialization of RISRO satisfies

    }X0 ´X˚}F ďˆ

    1

    4^ 1´R2r

    2?

    5R3r

    ˙

    σrpX˚q, (32)

    and

    σrpX˚q ěˆ

    16?

    5_ 40?

    2R3r1´R2r

    ˙ }pA˚p�qqmaxprq}F1´R2r

    . (33)

    16

  • Then the iterations of RISRO converge as follows

    }Xt`1 ´X˚}2F ď10R23r}Xt ´X˚}4Fp1´R2rq2σ2r pX˚q

    `20}pA˚p�qqmaxprq}2F

    p1´R2rq2, @ t ě 0. . . . (34)

    Moreover, if we assume pAiqrj,ksi.i.d.„ Np0, 1{nq and �i

    i.i.d.„ Np0, σ2{nq. Then there exist universalconstants C1, C2, C

    1, c ą 0 such that as long as n ě C1pp1 ` p2qrp σ2

    σ2rpX˚q_ rκ2q (here κ “ σ1pX

    ˚qσrpX˚q

    is the condition number of X˚) and tmax ě C2 log logp σ2rpX˚qn

    rpp1`p2qσ2 q _ 1, the output of RISRO withspectral initialization X0 “ pA˚pyqqmaxprq satisfies }Xtmax ´X˚}2F ď c

    rpp1`p2qn σ

    2 with probability atleast 1´ expp´C 1pp1 ` p2qq.

    Remark 9 (Quadratic Convergence, Statistical Error, and Robustness) Note that in (34),there are two terms in the upper bound of }Xt´X˚}2F. The first term corresponds to the optimizationerror, which quadratically decreases over iteration t; the second term Op}pA˚p�qqmaxprq}2Fq is theessential statistical error that is independent of the optimization algorithm.

    In addition, (34) shows the error contraction factor is independent of the condition numberκ, which demonstrates the robustness of RISRO to the ill-conditioning of the underlying low-rankmatrix. We will further demonstrate this point by simulation studies in Section 7.2.

    Remark 10 (Optimal Statistical Error) Under the Gaussian ensemble design, RISRO withspectral initialization achieves the rate of estimation error crpp1` p2qσ2{n after double-logarithmicnumber of iterations when n ě C1pp1 ` p2qrp σ

    2

    σ2rpX˚q_ rκ2q. Compared with the lower bound of the

    estimation error

    minpX

    maxrankpX˚qďr

    E}pX´X˚}2F ě c1rpp1 ` p2qσ2

    n

    for some c1 ą 0 in Candès and Plan (2011), RISRO achieves the minimax optimal estimation error.To the best of our knowledge, RISRO is the first provable algorithm that achieves the minimax rate-optimal estimation error with only a double-logarithmic number of iterations.

    6.2 Phase Retrieval

    In this section, we consider RISRO for solving the following quadratic equation system

    yi “ |xai,x˚y|2 for 1 ď i ď n, (35)

    where y P Rn and covariates taiuni“1 P Rp (or Cp) are known whereas x˚ P Rp (or Cp) are unknown.The goal is to recover x˚ based on tyi,aiuni“1. One important application is known as phaseretrieval arising from physical science due to the nature of optical sensors (Fienup, 1982). In theliterature, various approaches have been proposed for phase retrieval with provable guarantees,such as convex relaxation (Candès et al., 2013; Huang et al., 2017b; Waldspurger et al., 2015) andnon-convex approaches (Candès et al., 2015; Chen and Candès, 2017; Gao and Xu, 2017; Ma et al.,2019; Netrapalli et al., 2013; Sanghavi et al., 2017; Wang et al., 2017a; Duchi and Ruan, 2019).

    For ease of exposition, we focus on the real-value model, i.e., x˚ P Rn and ai P Rn, while a simpletrick in Sanghavi et al. (2017) can recast the problem (35) in the complex model into a rank-2 realvalue matrix recovery problem, then our approach still applies. In the real-valued setting, we canrewrite model (35) into a low-rank matrix recovery model

    y “ ApX˚q with X˚ “ x˚x˚J and rApX˚qsi “ xaiaJi ,x˚x˚Jy. (36)

    17

  • There are two challenges in phase retrieval compared to the low-rank matrix trace regression consid-ered previously. First, due to the symmetry of sensing matrices aia

    Ji and x

    ˚x˚J in phase retrieval,the importance covariates AD1 and AD2 in (4) are exactly the same and an adaptation of Algo-rithm 1 is thus needed. Second, in phase retrieval, the mapping A no longer satisfies a proper RIPcondition in general (Cai and Zhang, 2015; Candès et al., 2013), so new theory is needed. To thisend, we introduce a modified RISRO for phase retrieval in Algorithm 2. Particularly in Step 4 ofAlgorithm 2, we multiply the importance covariates A2 by an extra factor 2 to account for theduplicate importance covariates due to symmetry.

    Algorithm 2 RISRO for Phase Retrieval

    1: Input: design vectors taiuni“1 P Rp, y P Rn, initialization X0 that admits eigenvalue decompo-sition σ01u

    0u0J

    2: for t “ 0, 1, . . . , do3: Perform importance sketching on ai and construct the covariates A1 P Rn, A2 P Rnˆpp´1q,

    where for 1 ď i ď n, pA1qi “ paJi utq2, pA2qri,:s “ utJK aiaJi ut.4: Solve the unconstrained least squares problem pbt`1,dt`1q “

    arg minbPR,dPRpp´1q }y ´A1b´ 2A2d}22 .

    5: Compute the eigenvalue decomposition of rut utKs„

    bt`1 dt`1J

    dt`1 0

    rut utKsJ, and denote it

    as rv1 v2s„

    λ1 00 λ2

    rv1 v2sJ with λ1 ě λ2.

    6: Update ut`1 “ v1 and Xt`1 “ λ1ut`1ut`1J.7: end for

    Next, we show under Gaussian ensemble design, given the sample number n “ Opp log pq andproper initialization, the sequence tXtu generated by Algorithm 2 converges quadratically to X˚.

    Theorem 4 (Local Quadratic Convergence of RISRO for Phase Retrieval) In the phaseretrieval problem (35), assume that taiuni“1 are independently generated from Np0, Ipq. Then forany δ1, δ2 P p0, 1q, there exist c, Cpδ1q, C 1 ą 0 such that when p ě c log n, n ě Cpδ1qp log p, if}X0 ´ X˚}F ď p1´δ1qC1p1`δ2qp}X

    ˚}F, with probability at least 1 ´ C1 expp´C2pδ1, δ2qnq ´ C3n´p, thesequence tXtu generated by Algorithm 2 satisfies

    }Xt`1 ´X˚}F ďC 1p1` δ2qpp1´ δ1q}X˚}F

    }Xt ´X˚}2F, @ t ě 0 (37)

    for some C1, C2pδ1, δ2q, C3 ą 0.

    Remark 11 (Initialization Condition) In Theorem 4, we assume }X0 ´X˚}F ď Op}X˚}F{pqto show the quadratic convergence for RISRO. This initialization assumption is used to handle sometechnical difficulties without the proper restricted isometry property in phase retrieval. Although intheory it is difficult to prove that the spectral initialization satisfies this assumption, we find bysimulation that the spectral initialization is sufficiently good to guarantee quadratic convergence inthe subsequent updates. We leave the convergence theory under weaker initialization assumption asfuture work.

    7 Numerical Studies

    In this section, we conduct simulation studies to investigate the numerical performance of RISRO.We specifically consider two settings:

    18

  • • Matrix trace regression. Let p “ p1 “ p2 and yi “ xX˚,Aiy ` �i, where Ais are constructedwith independent standard normal entries and �i

    i.i.d.„ Np0, σ2q. X˚ “ U˚Σ˚V˚J whereU˚,V˚ P Op,r are randomly generated, Σ “ diagpλ1, . . . , λrq. Also, we set λ1 “ 3 and λi “λ1κi{r

    for i “ 2, . . . , r, so the condition number of X˚ is κ. We initialize X0 via pA˚pyqqmaxprq.

    • Phase retrieval. Let yi “ xai,x˚y2, where x˚ P Rp is a randomly generated unit vector,ai

    i.i.d.„ Np0, Ipq. We initialize X0 via truncated spectral initialization (Chen and Candès,2017).

    Throughout the simulation studies, we consider errors in two metrics: (1) }Xt´Xtmax}F{}Xtmax}F,which measures the convergence error; (2) }Xt ´ X˚}F{}X˚}F, which is the relative root mean-squared error (Relative RMSE) that measures the estimation error for X˚. The algorithm isterminated when it reaches the maximum number of iterations tmax “ 300 or the correspondingerror metric is less than 10´12. Unless otherwise noted, the reported results are based on theaverages of 50 simulations and on a computer with Intel Xeon E5-2680 2.5GHz CPU.

    7.1 Properties of RISRO

    We first study the convergence rate of RISRO. Specifically, set p “ 100, r “ 3, n P t1200, 1500, 1800, 2100, 2400u, κ “1, σ “ 0 for low-rank matrix trace regression and p “ 1200, n P t4800, 6000, 7200, 8400, 9600u forphase retrieval. The convergence performance of RISRO (Algorithm 1 in low-rank matrix traceregression and Algorithm 2 in phase retrieval) is plotted in Figure 1. We can see RISRO withthe (truncated) spectral initialization converges quadratically to the true parameter X˚ in bothproblems, which is in line with the theory developed in previous sections. Although our theory onphase retrieval in Theorem 4 is based a stronger initialization assumption, the truncated spectralinitialization achieves great empirical performance.

    In another setting, we examine the quadratic-linear convergence for RISRO under the noisysetting. Consider the matrix trace regression problem, where σ “ 10α, α P t0,´1,´2,´3,´5,´14u,n “ 1500, and p, r, κ are the same as the previous setting. The simulation results in Figure 3show the gradient norm }gradfpXtq} of the iterates converges to zero, which demonstrates theconvergence of the algorithm. Meanwhile, since the observations are noisy, RISRO exhibits thequadratic-linear convergence as we discussed in Remark 4: when α “ 0, i.e., σ “ 1, RISROconverges quadratically in the first 2-3 steps and then reduces to linear convergence afterwards; asσ gets smaller, we can see RISRO enjoys a longer path of quadratic convergence, which matchesour theoretical prediction in Remark 4.

    ●●

    1e−12

    1e−08

    1e−04

    1e+00

    0 10 20Iteration Number

    ||Xt −

    Xt m

    ax|| F

    /||X

    t max|| F

    ●●

    1e−08

    1e−04

    1e+00

    1e+04

    0 10 20Iteration Number

    Gra

    dien

    t Nor

    m α

    ●−14−5−3−2−10

    Figure 3: Convergence plot of RISRO in matrix trace regression. p “ 100, r “ 3, n “ 1500, κ “ 1,σ “ 10α with varying α

    19

  • Finally, we study the performance of RISRO under the large-scale setting of the matrix traceregression. Fix n “ 7000, r “ 3, κ “ 1, σ “ 0 and let dimension p grow from 100 to 500. For thelargest case, the space cost of storing A reaches 7000 ¨ 500 ¨ 500 ¨ 8B “ 13.04GB. Figure 4 shows therelative RMSE of the output of RISRO and runtime vesus the dimension. We can clearly see therelative RMSE of the output is stable and the runtime scales reasonably well as the dimension pgrows.

    ● ●

    ● ●

    1e−15

    2e−15

    3e−15

    100 200 300 400 500p

    Rel

    ativ

    e R

    MS

    E

    ● ●

    ●●

    1

    3

    10

    100 200 300 400 500p

    Run

    time

    (s)

    Figure 4: Relative RMSE and runtime of RISRO in matrix trace regression. p P r100, 500s, r “3, n “ 7000, κ “ 1, σ “ 0

    7.2 Comparison of RISRO with Other Algorithms in Literature

    In this subsection, we further compare RISRO with existing algorithms in the literature. In thematrix trace regression, we compare our algorithm with singular value projection (SVP) (Goldfarband Ma, 2011; Jain et al., 2010), Alternating Minimization (Alter Mini) (Jain et al., 2013; Zhaoet al., 2015), gradient descent (GD) (Park et al., 2018; Tu et al., 2016; Zheng and Lafferty, 2015),and convex nuclear norm minimization (NNM) (3) (Toh and Yun, 2010). We consider the settingwith p “ 100, r “ 3, n “ 1500, κ P t1, 50, 500u, σ “ 0 (noiseless case) or σ “ 10´6 (noisy case).Following Zheng and Lafferty (2015), in the implementation of GD and SVP, we evaluate threechoices of step size, t5 ˆ 10´3, 10´3, 5 ˆ 10´4u, then choose the best one. In phase retrieval, wecompare Algorithm 2 with Wirtinger Flow (WF) (Candès et al., 2015) and Truncated WirtingerFlow (TWF) (Chen and Candès, 2017) with p “ 1200, n “ 6000. We use the codes of the acceleratedproximal gradient for NNM, WF and TWF from the corresponding authors’ websites and implementthe other algorithms by ourselves. The stopping criteria of all procedures are the same as RISROmentioned in the previous simulation settings.

    We compare the performance of various procedures on noiseless matrix trace regression inFigure 5. For all different choices of κ, RISRO converges quadratically to X˚ in 7 iterations withhigh accuracy, while the other baseline algorithms converge much slower in a linear rate. Whenκ (condition number of X˚) increases from 1 to 50 and 500 so that the problem becomes moreill-conditioned, RISRO, Alter Mini, and SVP perform robustly, while GD converges more slowly.In Theorem 3, we have shown the quadratic convergence rate of RISRO is robust to the conditionnumber (see Remark 9). As we expect, the non-convex optimization methods converge much fasterthan the convex relaxation method. Moreover, to achieve a relative RMSE of 10´10, RISRO onlytakes about 1{5 runtime compared to other algorithms if κ “ 1 and this factor is even smaller inthe ill-conditioned cases that κ “ 50 and 500.

    The comparison of RISRO, WF, and TWF in phase retrieval is plotted in Figure 6. We canalso see RISRO can recover the underlying true signal with high accuracy in much less time thanthe other baseline methods.

    20

  • ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●1e−12

    1e−08

    1e−04

    1e+00

    0 10 20 30 40 50Iteration Number

    Rel

    ativ

    e R

    MS

    E

    ● ●● ● ●●●●●●●●●●● ●●●●●

    1e−12

    1e−08

    1e−04

    1e+00

    0 1 2 3 4 5Runtime (s)

    Rel

    ativ

    e R

    MS

    E

    Algorithm

    ●RISROAlter MiniGDSVPNNM

    (a) κ “ 1

    ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●1e−12

    1e−08

    1e−04

    1e+00

    0 10 20 30 40 50Iteration Number

    Rel

    ativ

    e R

    MS

    E

    ●●

    ●●●●●●●●●●●●●●●●●●●●

    1e−11

    1e−07

    1e−03

    0 1 2 3 4 5Runtime (s)

    Rel

    ativ

    e R

    MS

    E

    Algorithm

    ●RISROAlter MiniGDSVPNNM

    (b) κ “ 50

    ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●1e−12

    1e−08

    1e−04

    1e+00

    0 10 20 30 40 50Iteration Number

    Rel

    ativ

    e R

    MS

    E

    ●●

    ●●●●●●●●●●●●●●●●●●●●●

    1e−12

    1e−08

    1e−04

    1e+00

    0 1 2 3 4 5Runtime (s)

    Rel

    ativ

    e R

    MS

    E

    Algorithm

    ●RISROAlter MiniGDSVPNNM

    (c) κ “ 500

    Figure 5: Relative RMSE of RISRO, singular value projection (SVP), Alternating Minimization(Alter Mini), gradient descent (GD), and Nuclear Norm Minimization (NNM) in low-rank matrixtrace regression. Here, p “ 100, r “ 3, n “ 1500, σ “ 0, κ P t1, 50, 500u.

    21

  • ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

    1e−12

    1e−08

    1e−04

    1e+00

    0 10 20 30 40 50Iteration Number

    Rel

    ativ

    e R

    MS

    E

    ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

    1e−12

    1e−08

    1e−04

    1e+00

    0 5 10 15 20 25Runtime (s)

    Rel

    ativ

    e R

    MS

    E

    Algorithm

    ●RISROTWFWF

    Figure 6: Relative RMSE of RISRO, Wirtinger Flow (WF), Truncated Wirtinger Flow (TWF) inphase retrieval. Here, p “ 1200, n “ 6000

    Next, we compare the performance of RISRO with other algorithms in the noisy setting, σ “10´6, in the low-rank matrix trace regression. We can see from the results in Figure 7 that dueto the noise, the estimation error first decreases then stabilizes after reaching at a certain level.Meanwhile, we can also find RISRO converges in a much faster quadratic-linear rate before reachingat the stable level compared to all other algorithms.

    ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

    1e−05

    1e−03

    1e−01

    0 10 20 30 40 50Iteration Number

    Rel

    ativ

    e R

    MS

    E

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    ●●

    1e−05

    1e−03

    1e−01

    0 1 2 3 4 5Runtime (s)

    Rel

    ativ

    e R

    MS

    E

    Algorithm

    ●RISROAlter MiniGDSVPNNM

    Figure 7: Relative RMSE of RISRO, singular value projection (SVP), Alternating Minimization(Alter Mini), gradient descent (GD), and Nuclear Norm Minimization (NNM) in low-rank matrixtrace regression. Here, p “ 100, r “ 3, n “ 1500, κ “ 5, σ “ 10´6

    Finally, we study the required sample size to guarantee successful recovery by RISRO and otheralgorithms. We set p “ 100, r “ 3, κ “ 5, n P r600, 1500s in the noiseless matrix trace regressionand p “ 1200, n P r2400, 6000s in phase retrieval. We say the algorithm achieves successful recoveryif the relative RMSE is less than 10´2 when the algorithm terminates. The simulation results inFigure 8 show RISRO requires the minimum sample size to achieve a successful recovery in bothmatrix trace regression and phase retrieval; Alter Mini has similar performance to RISRO; andboth RISRO and Alter Mini require smaller sample size than the rest of algorithms for successfulrecovery.

    8 Conclusion and Discussion

    In this paper, we propose a new algorithm, RISRO, for solving rank constrained least squares.RISRO is based on a novel algorithmic framework, recursive importance sketching, which also

    22

  • ● ●

    ● ● ● ● ● ● ●

    0.00

    0.25

    0.50

    0.75

    1.00

    2 3 4 5n/(pr)

    Suc

    c R

    ecov

    ery

    Rat

    e

    Algorithm

    ●RISROAlter MiniGDSVPNNM

    (a) Matrix Trace Regression (p “ 100, r “3, σ “ 0, κ “ 5)

    ● ● ● ● ●

    ● ●

    0.00

    0.25

    0.50

    0.75

    1.00

    2 3 4 5n/p

    Suc

    c R

    ecov

    ery

    Rat

    e

    Algorithm

    ●RISROTWFWF

    (b) Phase Retrieval (p “ 1200)

    Figure 8: Successful recovery rate comparison

    provides new sketching interpretations for several existing algorithms for rank constrained leastsquares. RISRO is easy to implement and computationally efficient. Under some reasonable as-sumptions, local quadratic-linear and quadratic convergence are established for RISRO. Simulationstudies demonstrate the superior performance of RISRO.

    There are many interesting extensions to the results in this paper to be explored in the future.First, our current convergence theory on RISRO relies on RIP assumption, which may not holdin many scenarios, such as phase retrieval and matrix completion. In this paper, we give sometheoretical guarantees of RISRO in phase retrieval with a strong initialization assumption as wediscussed in Remark 11. However, such an initialization requirement may be unnecessary andspectral initialization is good enough to guarantee quadratic convergence as we observe in thesimulation studies. Also in matrix completion, the analysis will become more delicate as theadditional incoherence condition on X˚ is needed to guarantee recovery. To improve and establishtheoretical guarantees for RISRO in phase retrieval and matrix completion, we think some extraproperties such as “implicit regularization” (Ma et al., 2019) need to be incorporated in the analysisfor RISRO and it is an interesting future work. Also, this paper focuses on the squared error lossin (1), while the other loss functions may be of interest in different settings, such as the `1 lossin robust low-rank matrix recovery (Charisopoulos et al., 2019; Li et al., 2020a,b), which is worthexploring.

    References

    Absil, P.-A., Mahony, R., and Sepulchre, R. (2009). Optimization algorithms on matrix manifolds.Princeton University Press.

    Absil, P.-A. and Malick, J. (2012). Projection-like retractions on matrix manifolds. SIAM Journalon Optimization, 22(1):135–158.

    Ahmed, A., Recht, B., and Romberg, J. (2013). Blind deconvolution using convex programming.IEEE Transactions on Information Theory, 60(3):1711–1732.

    Bauch, J. and Nadler, B. (2020). Rank 2r iterative least squares: efficient recovery of ill-conditionedlow rank matrices from few entries. to appear, SIAM Journal on the Mathematics of Data Science.

    Bhojanapalli, S., Neyshabur, B., and Srebro, N. (2016). Global optimality of local search for lowrank matrix recovery. In Advances in Neural Information Processing Systems, pages 3873–3881.

    23

  • Boumal, N. and Absil, P.-a. (2011). Rtrmc: A Riemannian trust-region method for low-rank matrixcompletion. In Advances in neural information processing systems, pages 406–414.

    Breiding, P. and Vannieuwenhoven, N. (2018). Convergence analysis of Riemannian Gauss–Newtonmethods and its connection with the geometric condition number. Applied Mathematics Letters,78:42–50.

    Burer, S. and Monteiro, R. D. (2003). A nonlinear programming algorithm for solving semidefiniteprograms via low-rank factorization. Mathematical Programming, 95(2):329–357.

    Burke, J. V. (1985). Descent methods for composite nondifferentiable optimization problems.Mathematical Programming, 33(3):260–279.

    Cai, T. T. and Zhang, A. (2013). Sharp RIP bound for sparse signal and low-rank matrix recovery.Applied and Computational Harmonic Analysis, 35(1):74–93.

    Cai, T. T. and Zhang, A. (2014). Sparse representation of a polytope and recovery of sparse signalsand low-rank matrices. IEEE transactions on information theory, 60(1):122–132.

    Cai, T. T. and Zhang, A. (2015). ROP: Matrix recovery via rank-one projections. The Annals ofStatistics, 43(1):102–138.

    Cai, T. T. and Zhang, A. (2018). Rate-optimal perturbation bounds for singular subspaces withapplications to high-dimensional statistics. The Annals of Statistics, 46(1):60–89.

    Cai, T. T. and Zhou, W.-X. (2013). A max-norm constrained minimization approach to 1-bitmatrix completion. The Journal of Machine Learning Research, 14(1):3619–3647.

    Candès, E. J. (2008). The restricted isometry property and its implications for compressed sensing.Comptes rendus mathematique, 346(9-10):589–592.

    Candès, E. J., Li, X., and Soltanolkotabi, M. (2015). Phase retrieval via Wirtinger flow: Theoryand algorithms. IEEE Transactions on Information Theory, 61(4):1985–2007.

    Candès, E. J. and Plan, Y. (2011). Tight oracle inequalities for low-rank matrix recovery from aminimal number of noisy random measurements. IEEE Transactions on Information Theory,57(4):2342–2359.

    Candès, E. J., Strohmer, T., and Voroninski, V. (2013). Phaselift: Exact and stable signal recoveryfrom magnitude measurements via convex programming. Communications on Pure and AppliedMathematics, 66(8):1241–1274.

    Candès, E. J. and Tao, T. (2010). The power of convex relaxation: Near-optimal matrix completion.IEEE Transactions on Information Theory, 56(5):2053–2080.

    Charisopoulos, V., Chen, Y., Davis, D., Dı́az, M., Ding, L., and Drusvyatskiy, D. (2019). Low-rankmatrix recovery with composite optimization: good conditioning and rapid convergence. arXivpreprint arXiv:1904.10020.

    Chen, Y. and Candès, E. J. (2017). Solving random quadratic systems of equations is nearly as easyas solving linear systems. Communications on Pure and Applied Mathematics, 70(5):822–883.

    24

  • Chen, Y., Chi, Y., and Goldsmith, A. J. (2015). Exact and stable covariance estimation fromquadratic sampling via convex programming. IEEE Transactions on Information Theory,61(7):4034–4059.

    Chen, Y. and Wainwright, M. J. (2015). Fast low-rank estimation by projected gradient descent:General statistical and algorithmic guarantees. arXiv preprint arXiv:1509.03025.

    Chi, Y., Lu, Y. M., and Chen, Y. (2019). Nonconvex optimization meets low-rank matrix factor-ization: An overview. IEEE Transactions on Signal Processing, 67(20):5239–5269.

    Clarkson, K. L. and Woodruff, D. P. (2017). Low-rank approximation and regression in inputsparsity time. Journal of the ACM (JACM), 63(6):54.

    Davenport, M. A. and Romberg, J. (2016). An overview of low-rank matrix recovery from incom-plete observations. IEEE Journal of Selected Topics in Signal Processing, 10(4):608–622.

    Dobriban, E. and Liu, S. (2019). Asymptotics for sketching in least squares regression. In Advancesin Neural Information Processing Systems, pages 3675–3685.

    Drineas, P., Magdon-Ismail, M., Mahoney, M. W., and Woodruff, D. P. (2012). Fast approxi-mation of matrix coherence and statistical leverage. Journal of Machine Learning Research,13(Dec):3475–3506.

    Duchi, J. C. and Ruan, F. (2019). Solving (most) of a set of quadratic equalities: Compositeoptimization for robust phase retrieval. Information and Inference: A Journal of the IMA,8(3):471–529.

    Fienup, J. R. (1982). Phase retrieval algorithms: a comparison. Applied optics, 21(15):2758–2769.

    Fornasier, M., Rauhut, H., and Ward, R. (2011). Low-rank matrix recovery via iterativelyreweighted least squares minimization. SIAM Journal on Optimization, 21(4):1614–1640.

    Gao, B. and Xu, Z. (2017). Phaseless recovery using the Gauss–Newton method. IEEE Transactionson Signal Processing, 65(22):5885–5896.

    Ge, R., Jin, C., and Zheng, Y. (2017). No spurious local minima in nonconvex low rank problems:A unified geometric analysis. In Proceedings of the 34th International Conference on MachineLearning-Volume 70, pages 1233–1242. JMLR. org.

    Goldfarb, D. and Ma, S. (2011). Convergence of fixed-point continuation algorithms for matrixrank minimization. Foundations of Computational Mathematics, 11(2):183–210.

    Ha, W., Liu, H., and Barber, R. F. (2020). An equivalence between critical points for rank con-straints versus low-rank factorizations. SIAM Journal on Optimization, 30(4):2927–2955.

    Hardt, M. (2014). Understanding alternating minimization for matrix completion. In 2014 IEEE55th Annual Symposium on Foundations of Computer Science, pages 651–660. IEEE.

    Huang, W., Absil, P.-A., and Gallivan, K. A. (2017a). Intrinsic representation of tangent vectorsand vector transports on matrix manifolds. Numerische Mathematik, 136(2):523–543.

    Huang, W., Gallivan, K. A., and Zhang, X. (2017b). Solving phaselift by low-rank Riemannian op-timization methods for complex semidefinite constraints. SIAM Journal on Scientific Computing,39(5):B840–B859.

    25

  • Huang, W. and Hand, P. (2018). Blind deconvolution by a steepest descent algorithm on a quotientmanifold. SIAM Journal on Imaging Sciences, 11(4):2757–2785.

    Jain, P., Meka, R., and Dhillon, I. S. (2010). Guaranteed rank minimization via singular valueprojection. In Advances in Neural Information Processing Systems, pages 937–945.

    Jain, P., Netrapalli, P., and Sanghavi, S. (2013). Low-rank matrix completion using alternatingminimization. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing,pages 665–674. ACM.

    Jiang, K., Sun, D., and Toh, K.-C. (2014). A partial proximal point algorithm for nuclear normregularized matrix least squares problems. Mathematical Programming Computation, 6(3):281–325.

    Keshavan, R. H., Oh, S., and Montanari, A. (2009). Matrix completion from a few entries. In 2009IEEE International Symposium on Information Theory, pages 324–328. IEEE.

    Koltchinskii, V., Lounici, K., Tsybakov, A. B., et al. (2011). Nuclear-norm penalization and optimalrates for noisy low-rank matrix completion. The Annals of Statistics, 39(5):2302–2329.

    Kümmerle, C. and Sigl, J. (2018). Harmonic mean iteratively reweighted least squares for low-rankmatrix recovery. The Journal of Machine Learning Research, 19(1):1815–1863.

    Lange, K. (2010). Numerical analysis for statisticians. Springer Science & Business Media.

    Lee, J. D., Recht, B., Srebro, N., Tropp, J., and Salakhutdinov, R. R. (2010). Practical large-scale optimization for max-norm regularization. In Advances in Neural Information ProcessingSystems, pages 1297–1305.

    Lee, J. M. (2013). Smooth manifolds. In Introduction to Smooth Manifolds, pages 1–31. Springer.

    Lewis, A. S. and Wright, S. J. (2016). A proximal method for composite minimization. MathematicalProgramming, 158(1-2):501–546.

    Li, Q., Zhu, Z., and Tang, G. (2019a). The non-convex geometry of low-rank matrix optimization.Information and Inference: A Journal of the IMA, 8(1):51–96.

    Li, X., Ling, S., Strohmer, T., and Wei, K. (2019b). Rapid, robust, and reliable blind deconvolutionvia nonconvex optimization. Applied and computational harmonic analysis, 47(3):893–934.

    Li, X., Zhu, Z., Man-Cho So, A., and Vidal, R. (2020a). Nonconvex robust low-rank matrix recovery.SIAM Journal on Optimization, 30(1):660–686.

    Li, Y., Chi, Y., Zhang, H., and Liang, Y. (2020b). Non-convex low-rank matrix recovery witharbitrary outliers via median-truncated gradient descent. Information and Inference: A Journalof the IMA, 9(2):289–325.

    Luo, Y. and Zhang, A. R. (2020). A schatten-q matrix perturbation theory via perturbationprojection error bound. arXiv preprint arXiv:2008.01312.

    Ma, C., Wang, K., Chi, Y., and Chen, Y. (2019). Implicit regularization in nonconvex statisticalestimation: Gradient descent converges linearly for phase retrieval, matrix completion, and blinddeconvolution. Foundations of Computational Mathematics, pages 1–182.

    26

  • Mahoney, M. W. (2011). Randomized algorithms for matrices and data. Foundations and Trends®in Machine Learning, 3(2):123–224.

    Meyer, G., Bonnabel, S., and Sepulchre, R. (2011). Linear regression under fixed-rank constraints:a Riemannian approach. In Proceedings of the 28th international conference on machine learning.

    Miao, W., Pan, S., and Sun, D. (2016). A rank-corrected procedure for matrix completion withfixed basis coefficients. Mathematical Programming, 159(1):289–338.

    Mishra, B., Meyer, G., Bonnabel, S., and Sepulchre, R. (2014). Fixed-rank matrix factorizationsand Riemannian low-rank optimization. Computational Statistics, 29(3-4):591–621.

    Mohan, K. and Fazel, M. (2012). Iterative reweighted algorithms for matrix rank minimization.The Journal of Machine Learning Research, 13(1):3441–3473.

    Netrapalli, P., Jain, P., and Sanghavi, S. (2013). Phase retrieval using alternating minimization.In Advances in Neural Information Processing Systems, pages 2796–2804.

    Nocedal, J. and Wright, S. (2006). Numerical optimization. Springer Science & Business Media.

    Park, D., Kyrillidis, A., Caramanis, C., and Sanghavi, S. (2018). Finding low-rank solutions vianonconvex matrix factorization, efficiently and provably. SIAM Journal on Imaging Sciences,11(4):2165–2204.

    Pilanci, M. and Wainwright, M. J. (2016). Iterative Hessian sketch: Fast and accurate solu-tion approximation for constrained least-squares. The Journal of Machine Learning Research,17(1):1842–1879.

    Pilanci, M. and Wainwright, M. J. (2017). Newton sketch: A near linear-time optimization algo-rithm with linear-quadratic convergence. SIAM Journal on Optimization, 27(1):205–245.

    Raskutti, G. and Mahoney, M. W. (2016). A statistical perspective on randomized sketching forordinary least-squares. The Journal of Machine Learning Research, 17(1):7508–7538.

    Recht, B., Fazel, M., and Parrilo, P. A. (2010). Guaranteed minimum-rank solutions of linearmatrix equations via nuclear norm minimization. SIAM review, 52(3):471–501.

    Sanghavi, S., Ward, R., and White, C. D. (2017). The local convexity of solving systems of quadraticequations. Results in Mathematics, 71(3-4):569–608.

    Shechtman, Y., Eldar, Y. C., Cohen, O., Chapman, H. N., Miao, J., and Segev, M. (2015). Phaseretrieval with application to optical imaging: a contemporary overview. IEEE signal processingmagazine, 32(3):87–109.

    Song, Z., Woodruff, D. P., and Zhong, P. (2017). Low rank approximation with entrywise l1-normerror. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing,pages 688–701. ACM.

    Sun, J., Qu, Q., and Wright, J. (2018). A geometric analysis of phase retrieval. Foundations ofComputational Mathematics, 18(5):1131–1198.

    Sun, R. and Luo, Z.-Q. (2015). Guaranteed matrix completion via nonconvex factorization. InFoundations of Computer Science (FOCS), 2015 IEEE 56th Annual Symposium on, pages 270–289. IEEE.

    27

  • Tanner, J. and Wei, K. (2013). Normalized iterative hard thresholding for matrix completion.SIAM Journal on Scientific Computing, 35(5):S104–S125.

    Toh, K.-C. and Yun, S. (2010). An accelerated proximal gradient algorithm for nuclear normregularized linear least squares problems. Pacific Journal of Optimization, 6(615-640):15.

    Tong, T., Ma, C., and Chi, Y. (2020). Low-rank matrix recovery with scaled subgradient methods:Fast and robust convergence without the condition number. arXiv preprint arXiv:2010.13364.

    Tran-Dinh, Q. and Zhang, Z. (2016). Extended Gauss-Newton and Gauss-Newton-ADMM algo-rithms for low-rank matrix optimization. arXiv preprint arXiv:1606.03358.

    Tu, S., Boczar, R., Simchowitz, M., Soltanolkotabi, M., and Recht, B. (2016). Low-rank solutionsof linear matrix equations via Procrustes flow. In International Conference on Machine Learning,pages 964–973.

    Uschmajew, A. and Vandereycken, B. (2018). On critical points of quadratic low-rank matrixoptimization problems. IMA Journal of Numerical Analysis.

    Vandereycken, B. (2013). Low-rank matrix completion by Riemannian optimization. SIAM Journalon Optimization, 23(2):1214–1236.

    Vandereycken, B. and Vandewalle, S. (2010). A Riemannian optimization approach for computinglow-rank solutions of lyapunov equations. SIAM Journal on Matrix Analysis and Applications,31(5):2553–2579.

    Vershynin, R. (2010). Introduction to the non-asymptotic analysis of random matrices. arXivpreprint arXiv:1011.3027.

    Waldspurger, I., d’Aspremont, A., and Mallat, S. (2015). Phase recovery, maxcut and complexsemidefinite programming. Mathematical Programming, 149(1-2):47–81.

    Wang, G., Giannakis, G. B., and Eldar, Y. C. (2017a). Solving systems of random quadraticequations via truncated amplitude flow. IEEE Transactions on Information Theory, 64(2):773–794.

    Wang, J., Lee, J. D., Mahdavi, M., Kolar, M., Srebro, N., et al. (2017b). Sketching meets ran-dom projection in the dual: A provable recovery algorithm for big and high-dimensional data.Electronic Journal of Statistics, 11(2):4896–4944.

    Wang, L., Zhang, X., and Gu, Q. (2017c). A unified computational and statistical framework fornonconvex low-rank matrix estimation. In Artificial Intelligence and Statistics, pages 981–990.

    Wei, K., Cai, J.-F., Chan, T. F., and Leung, S. (2016). Guarantees of Riemannian optimization forlow rank matrix recovery. SIAM Journal on Matrix Analysis and Applications, 37(3):1198–1222.

    Wen, Z., Yin, W., and Zhang, Y. (2012). Solving a low-rank factorization model for matrix com-pletion by a nonlinear successive over-relaxation algorithm. Mathematical Programming Compu-tation, 4(4):333–361.

    Woodruff, D. P. (2014). Sketching as a tool for numerical linear algebra. Foundations and Trends®in Theoretical Computer Science, 10(1–2):1–157.

    28

  • Zhang, A. R., Luo, Y., Raskutti, G., and Yuan, M. (2020). ISLET: Fast and optimal low-ranktensor regression via