Upload
dinhdieu
View
229
Download
2
Embed Size (px)
Citation preview
A First Order Free Lunch for SQRT-Lasso∗
Xingguo Li, Jarvis Haupt, Raman Arora, Han Liu, Mingyi Hong, and Tuo Zhao
Abstract
Many statistical machine learning techniques sacrifice convenient computational structures
to gain estimation robustness and modeling flexibility. In this paper, we study this fundamental
tradeoff through a SQRT-Lasso problem for sparse linear regression and sparse precision matrix
estimation in high dimensions. We explain how novel optimization techniques help address these
computational challenges. Particularly, we propose a pathwise iterative smoothing shrinkage
thresholding algorithm for solving the SQRT-Lasso optimization problem. We further provide a
novel model-based perspective for analyzing the smoothing optimization framework, which allows
us to establish a nearly linear convergence (R-linear convergence) guarantee for our proposed
algorithm. This implies that solving the SQRT-Lasso optimization is almost as easy as solving
the Lasso optimization. Moreover, we show that our proposed algorithm can also be applied
to sparse precision matrix estimation, and enjoys good computational properties. Numerical
experiments are provided to support our theory.
1 Introduction
Given a design matrix X ∈ Rn×d and a response vector y ∈ Rn, we consider a linear model
y = Xθ∗ + ε, where θ∗ ∈ Rd is an unknown coefficient vector, and ε ∈ Rn is a random noise vector
with i.i.d. sub-Gaussian entries, E[εi] = 0 and E[ε2i ] = σ2 for all i = 1, . . . , n. We are interested in
estimating θ∗ in high dimensions where n/d→ 0. A popular assumption in high dimensions is that
only a small subset of variables are relevant in modeling, i.e., many entries of θ∗ are zero. To get
such a sparse estimator, Tibshirani (1996) proposed Lasso, which solves
θ = argminθ
1
n‖y −Xθ‖22 + λLasso‖θ‖1, (1.1)
where λLasso is the regularization parameter and encourages the solution sparsity. The statistical
properties of Lasso have been established in Zhang and Huang (2008); Zhang (2009); Bickel et al.
∗Xingguo Li and Jarvis Haupt are affiliated with Department of Electrical and Computer Engineering at University
of Minnesota, Minneapolis, MN, 55455, USA; Raman Arora and Tuo Zhao is affiliated with Department of Computer Sci-
ence at Johns Hopkins University Baltimore, MD, 21210, USA; Han Liu is affiliated with Department of Operations Re-
search and Financial Engineering at Princeton University, Princeton, NJ 08544, USA; Mingyi Hong is affiliated with De-
partment of Industrial and Manufacturing Systems Engineering at Iowa State University. Emails: [email protected],
[email protected], [email protected], [email protected], [email protected], [email protected]
1
arX
iv:1
605.
0795
0v1
[cs
.LG
] 2
5 M
ay 2
016
(2009); Negahban et al. (2012). In particular, given λLasso σ√
log d/n, the Lasso estimator in (1.1)
attains the minimax optimal rates of convergence in parameter estimation1,
‖θ − θ∗‖2 = OP(σ√s∗ log d/n
), (1.2)
where s∗ denotes the number of nonzero entires in θ∗ (Ye and Zhang, 2010; Raskutti et al., 2011).
Despite these favorable properties, the Lasso approach has a significant drawback: The selected
regularization parameter parameter λLasso linearly scales with the unknown quantity σ. Therefore,
we need to carefully tune λLasso over a wide range of potential values in order to get a good
finite-sample performance. To overcome this drawback, Belloni et al. (2011) proposed SQRT-Lasso,
which solves
θ = argminθ∈Rd
1√n‖y −Xθ‖2 + λSQRT‖θ‖1. (1.3)
They further show that SQRT-Lasso require no prior knowledge of σ. Choosing λSQRT √
log d/n,
the SQRT-Lasso estimator in (1.3) attains the same optimal statistical rate of convergence in
parameter estimation as (1.2). This means that the regularization selection for SQRT-Lasso does
not scale with σ. We can easily specify a smaller range of potential values for tuning λSQRT than
Lasso.
Besides estimating θ∗, SQRT-Lasso can also estimate σ, which further makes it applicable to
sparse precision matrix estimation; this is not the case with Lasso. Specifically, given n observations
i.i.d. sampled from a d-variate normal distribution with mean 0 and a sparse precision matrix Θ∗,
Liu and Wang (2012) proposed an estimator based on SQRT-Lasso (See more details in §4), and
showed that it attains the minimax optimal statistical rate of convergence in parameter estimation
‖Θ−Θ∗‖2 = OP(‖Θ∗‖2 · s∗
√log d/n
),
where ‖Θ∗‖2 denotes the spectral norm of Θ∗ (i.e., the largest singular value of Θ∗), and s∗ denotes
the maximum number of nonzero entries in each column of Θ∗ (i.e. max` 1(Θj` 6= 0) ≤ s∗).Though SQRT-Lasso simplifies us tuning efforts and achieves the optimal statistical properties
for both sparse linear regression and sparse precision matrix estimation in high dimensions, the
optimization problem in (1.3) is computationally more challenging than (1.1) for Lasso, because the
`2 loss in SQRT-Lasso does not have the same nice computational structures as the least square loss
in Lasso. For example, the `2 loss can be nondifferentiable, and does not have a Lipschitz continuous
gradient. Belloni et al. (2011) converted (1.3) to a second order cone optimization problem, and
further solved it by an interior point method; Li et al. (2015) then solved (1.3) by an ADMM
algorithm. Neither of them, however, can scale to large problems. In contrast, Xiao and Zhang
(2013) proposed an efficient pathwise iterative shrinkage thresholding algorithm (PISTA) for solving
(1.1), which attains a linear convergence to the unique sparse global optimum with high probability.
To address this computational challenge, we propose a pathwise iterative smoothing shrinkage
thresholding algorithm (PIS2TA) to solve (1.3). Specifically, we first apply the conjugate dual
1The notation OP (·) is defined in Line 84 on Page 3
2
smoothing approach to the nonsmooth `2 loss (Nesterov, 2005; Beck and Teboulle, 2012), and obtain
a smooth surrogate denoted by ‖y−Xθ‖µ, where µ > 0 is a smoothing parameter (See more details
in §2). We then apply PISTA to solve the partially smoothed optimization problem as follows:
θ = argminθ∈Rd
1√n‖y −Xθ‖µ + λSQRT‖θ‖1. (1.4)
Existing computational theory guarantees that our proposed PIS2TA algorithm attains a sublinear
convergence to the global optimum in term of the objective value (Nesterov, 2005). However, our
numerical experiments show that PIS2TA achieves far better empirical computational performance
(better than sublinear convergence) for solving SQRT-Lasso, and is significantly more efficient than
other competing algorithms, and nearly as efficient as Xiao and Zhang (2013) for solving Lasso.
This is because the existing computational analyses of the conjugate dual smoothing approach
do not take certain specific modeling structures into consideration. For example: (I) The `2 loss
is only nonsmooth when all residuals are equal to zero (significantly overfitted). But this is very
unlikely to happen because we are solving (1.3) with a sufficiently large regularization; (II) Although
the smoothed `2 loss is not strongly convex, if we restrict the solution to a sparse domain, the
smoothed `2 loss can behave like strongly convex functions over a neighborhood of θ∗.
Motivated by these observations, we establish a new computational theory for PIS2TA, which
exploits the above modeling structures. Particularly, we show that PIS2TA achieves a nearly linear
convergence (R-linear convergence) to the unique sparse global optimum for solving (1.3) with
high probability, and also gives us a well fitted model. There are two implications: (I) We can
solve the SQRT-Lasso optimization as nearly efficiently as solving the Lasso optimization; (II) We
pay almost no price in optimization accuracy when using the smoothing approach for solving the
SQRT-Lasso optimization, because (1.4) and (1.3) share the same unique sparse global optimum
with high probability.
As an extension of our theory for the SQRT-Lasso optimization, we further analyze the com-
putational properties of our proposed PIS2TA algorithm for sparse precision matrix estimation in
high dimensions. We show that PIS2TA also achieves an R-linear convergence to the unique sparse
global optimum with high probability. We provide numerical experiments on simulated and real
data to support our theory. All proofs of our analysis are deferred to the supplementary material.
Notations: Given a vector v = (v1, . . . , vd)> ∈ Rd, we define vector norms: ‖v‖1 =
∑j |vj |,
‖v‖22 =∑
j v2j , and ‖v‖∞ = maxj |vj |. We denote the number of nonzero entries in v as ‖v‖0 =∑
j 1(vj 6= 0). We denote v\j = (v1, . . . , vj−1, vj+1, . . . , vd)> ∈ Rd−1 as the subvector of v with the
j-th entry removed. Let A ⊆ 1, ..., d be an index set. We use A to denote the complementary
set to A, i.e. A = j | j ∈ 1, ..., d, j /∈ A. We use vA to denote a subvector of v by extracting
all entries of v with indices in A. Given a matrix A ∈ Rd×d, we use A∗j = (A1j , ...,Adj)> to
denote the j-th column of A, and Ak∗ = (Ak1, ...,Akd)> to denote the k-th row of A. Let Λmax(A)
and Λmin(A) be the largest and smallest eigenvalues of A. We define ‖A‖2F =∑
j ‖A∗j‖22 and
‖A‖2 =√
Λmax(A>A). We denote A\i\j as the submatrix of A with the i-th row and the j-th
column removed. We denote Ai\j as the i-th row of A with its j-th entry removed. Let A ⊆ 1, ..., d
3
be an index set. We use AAA to denote a submatrix of A by extracting all entries of A with
both row and column indices in A. We denote A 0 if A is a positive-definite matrix. Given
two real sequences An, an, An = O(an) (or An = Ω(an)) if and only if ∃M ∈ R+ and N ∈ Nsuch that |An| ≤M |an| (or |An| ≥M |an|) for all n ≥ N . An an if An = O(an) and An = Ω(an)
simultaneously. An = OP (an) if ∀δ ∈ (0, 1), ∃M ∈ R+ and Nδ ∈ N such that P[|An| > M |an|] < δ
for all n ≥ Nδ. An = o(an) if ∀δ > 0, ∃Nδ ∈ N such that |An| ≤ δ|an| for all n ≥ Nδ, i.e.,
limn→∞An/an = 0. Given a vector x ∈ Rd and a real value λ > 0, we denote the soft thresholding
operator Sλ(x) = [sign(xj) max|xj | − λ, 0]dj=1.
2 Algorithm
Our proposed algorithm consists of three components: (I) Conjugate Dual Smoothing, (II) Iterative
Shrinkage Thresholding Algorithm (ISTA), and (III) Pathwise Optimization.
(I) The Conjugate Dual Smoothing approach is adopted to obtain a smooth surrogate of `2
loss (Nesterov, 2005; Beck and Teboulle, 2012). We denote the smoothed `2 loss function as
‖y −Xθ‖µ = max‖z‖2≤1
z>(y −Xθ)− µ
2‖z‖22. (2.1)
The optimization problem in (2.1) admits a closed form solution:
Lµ(θ) =1√n‖y −Xθ‖µ =
1
2µ√n‖y −Xθ‖22, if ‖y −Xθ‖2 < µ
1√n‖y −Xθ‖2 − µ
2 , o.w..
We present several two-dimensional examples of the smoothed `2 norm using different µ’s in Figure
1. A larger µ introduces a larger approximation error, but makes the approximation smoother. We
then consider the following partially smoothed optimization problem,
θ = argminθ∈Rd
Fµ,λ(θ), where Fµ,λ(θ) = Lµ(θ) + λ‖θ‖1. (2.2)
(a) µ = 0 (b) µ = 0.1 (c) µ = 0.5 (d) µ = 1
Figure 1: Examples of ‖x‖2 (µ = 0) and ‖x‖µ with µ = 0.1, 0.5, and 1 respectively for x ∈ R2.
(II) The ISTA Algorithm is applied to solve (2.2) (Nesterov, 2013). Particularly, given θ(t) at
t-th iteration, we consider the quadratic approximation of Fµ,λ(θ) at θ = θ(t),
Qµ,λ(θ,θ(t)) = Lµ(θ(t)) +∇Lµ(θ(t))>(θ − θ(t)) +L(t)
2‖θ − θ(t)‖22 + λ‖θ‖1, (2.3)
4
where L(t) is a step size parameter determined by the backtracking line search. We then take
θ(t+1) = argminθQµ,λ(θ,θ(t)) = Sλ/L(t)(θ(t) −∇Lµ(θ(t))/L(t)),
For simplicity, we denote θ(t+1) = TL(t+1),λ(θ(t)). Given a pre-specified precision ε, we terminate
the iterations when the approximate KKT condition holds:
ωλ(θ(t)) = ming∈∂‖θ(t)‖1
‖∇Lµ(θ(t)) + λg‖∞ ≤ ε.
(III) The Pathwise Optimization is essentially a multistage optimization scheme for boosting
computational performance. We solve (2.2) using a geometrically decreasing sequence of regular-
ization parameters λ1 > . . . > λN , where λN = λSQRT. This yields a sequence of output solutions
θ[1], . . . , θ[N ] from sparse to dense.
Particularly, at the K-th optimization stage, we choose θ[K−1] (the output solution of the
(K − 1)-th stage) as the initial solution, i.e., θ(0)[K] = θ[K−1], and solve (2.2) with λ = λK using the
ISTA algorithm. This is also referred as the warm start initialization in existing literature. We
summarize our approach in Algorithm 1.
3 Computational and Statistical Analysis
We first define the locally restricted strong convexity and smoothness.
Definition 3.1. Given a constant r ∈ R+, let Br = θ ∈ Rd : ‖θ − θ∗‖22 ≤ r. For any v,w ∈ Br,which satisfies ‖v−w‖0 ≤ s, Lµ is locally restricted strongly convex (LRSC) and smooth (LRSS) on
Br at sparsity level s if there exist universal constants ρ−s , ρ+s ∈ (0,∞) such that
ρ−s2‖v −w‖22 ≤ Lµ(v)− Lµ(w)−∇Lµ(w)>(v −w) ≤ ρ+
s
2‖v −w‖22, (3.1)
We define the locally restricted condition number at sparsity level s as κs = ρ+s /ρ
−s .
The LRSC and LRSS properties are locally constrained variants of restricted strong convexity
and smoothness (Agarwal et al., 2010; Xiao and Zhang, 2013) with respect to a neighborhood of the
true model parameter θ∗, which are keys to establishing the strong convergence guarantees of our
proposed algorithm in high dimensions.
Next, we introduce two key assumptions for establishing our computational theory.
Assumption 3.2. The sequence of the regularization parameters satisfies λN ≥ 6‖∇Lµ(θ∗)‖∞.
Assumption 3.2 requires that λN is sufficiently large such that the irrelevant variables can be
eliminated (Bickel et al., 2009; Negahban et al., 2012).
Assumption 3.3. Lµ satisfies LRSC and LRSS properties on Br, where r ≥ s∗(
8λN1/ρ−s∗+s
)2for
some N1 < N , N1 ∈ Z+, and λN1 > λN . Specifically, (3.1) holds with ρ+s∗+2s, ρ
−s∗+2s ∈ (0,∞), where
s = C1s∗ > (196κ2
s∗+2s + 144κs∗+2s)s∗, C1 ∈ R+ is a constant and κs∗+2s = ρ+
s∗+2s/ρ−s∗+2s.
Assumption 3.3 guarantees that Lµ satisfies LRSC and LRSS properties as long as the estimation
error satisfies ‖θ − θ∗‖22 ≤ r and the number of irrelevant coordinates of solutions is bounded by s.
5
Algorithm 1: Pathwise Iterative Smoothing Shrinkage Thresholding Algorithm (PIS2TA) for
solving the SQRT-Lasso optimization (1.4). θ[K] denotes the output solution corresponding
to λK ; θ(t)[K] denotes the solution at the t-th iteration of the K-th optimization stage; εK is a
pre-specified precision for the K-th optimization stage. The line search procedure is describe
in Algorithm2.
Input: y, X, N , λN , εN , Lmax > 0
Initialize: θ[0] ← 0, λ0 ← ‖∇Lµ(0)‖∞, η ← (λN/λ0)1/N
For: K = 1, . . . , N
t← 0, λK ← ηλK−1, θ(0)[K] ← θ[K−1], L
(0)[K] ← Lmax
Repeat:
t← t+ 1
L(t)[K] ← min2L(t)
[K], Lmax, where L(t)[K] ← LineSearch
(λK ,θ
(t−1)[K] , L
(t−1)[K]
)
θ(t)[K] ← TL(t)
[K],λK
(θ(t−1)[K] )
Until: ωλK (θ(t)[K]) ≤ εK
θ[K] ← θ(t)[K]
End For
Return: θ[N ]
Algorithm 2: Line search of PIS2TA for SQRT-Lasso.
Input: λK , θ(t−1)[K] , L
(t−1)[K]
Initialize: L(t)[K] = L
(t−1)[K]
Repeat:
θ(t)[K] = T
L(t)[K]
,λK(θ
(t−1)[K] )
If Fµ,λK (θ(t)[K]) < Qµ,λK (θ
(t)[K],θ
(t−1)[K] )
L(t)[K] = L
(t)[K]/2
End If
Until: Fµ,λK (θ(t)[K]) ≥ Qµ,λK (θ
(t)[K],θ
(t−1)[K] )
Return: L(t)[K]
3.1 Computational Theory
Our analysis consists of two phases depending on the estimation error and sparsity of the solution
θ along the path. Specifically, we denote Bs∗+sr = Br ∩ θ ∈ Rd : ‖θ − θ∗‖0 ≤ s∗ + s. Let
N1 ∈ 1, . . . , N be a cut-off between Phase I and Phase II. We can show that Phase I corresponds
to the first N1 stages of pathwise optimization, in which we cannot guarantee θ /∈ Bs∗+sr . Thus we
only establish a sublinear convergence for Phase I. But Phase I is still computationally efficient,
6
since we can choose reasonably large εK for K = 1, . . . , N1 to facilitate early stopping; Phase II
corresponds to the consequent (N −N1) stages, in which we guarantee θ ∈ Bs∗+sr . Thus LRSC and
LRSS hold, and a linear convergence can be established accordingly.
Theorem 3.4. Suppose Assumptions 3.2 and 3.3 hold. Let θ[K] be the output solution that satisfies
ωλK (θ[K]) ≤ εK of the K-th stage respectively for all K = 1, . . . , N . We denote S∗ = j | θ∗j 6= 0,S∗ = j | θ∗j = 0 and s∗ = |S∗|. Recall that η is the decaying ratio of the geometrically decreasing
regularization sequence. Given µ ≤√nσ4 , λ0 = ‖∇Lµ(0)‖∞, and η = (λN/λ0)1/N ∈ (5/6, 1), there
exists N1 ∈ 1, 2, . . . , N such that the following results hold:
Phase I: Let R = maxK≤N1supθ ‖θ − θ[K]‖2 : Fµ,λK (θ) ≤ Fµ,λK (θ(0)[K]). At the K-th stage,
where K = 1, . . . , N1, we need at most TK = O(‖X‖22RεKµ√n
)iterations to guarantee
ωλK (θ(t)[K]) ≤ εK and Fµ,λK (θ[K])−Fµ,λK (θ[K]) = O
(‖X‖22R2
TKµ√n
),
where θ[K] is a global optimum to (1.4). Moreover, we have θ[N1] ∈ Br and ‖[θ[N1]]S∗‖0 ≤ s∗ + s.
Phase II: Let α = 1 − 18κs∗+2s
. At the K-th stage, where K = N1 + 1, . . . , N , we have
sparse solutions throughout all iterations, i.e., ‖[θ(t)[K]]Sc‖0 ≤ s∗ + s. Moreover, we need at most
TK = O(κs∗+2s log
(κ3s∗+2s
s∗λ2K
ε2K
))iterations to guarantee ωλK (θ
(t)[K]) ≤ εK ,
Fµ,λK (θ[K])−Fµ,λK (θ[K]) = O(αTKεKλKs
∗) , and ‖θ[K] − θ[K]‖22 = O(αTKεKλKs
∗) ,
where θ[K] is the unique sparse global optimum to (1.4) with λK satisfying ‖[θ[K]]S∗‖0 ≤ s∗ + s.
Theorem 3.4 guarantees that PIS2TA achieves an R-linear convergence to the unique sparse
global optimum to (1.4) in terms of both objective value and solution parameter, which is as nearly
efficient as Xiao and Zhang (2013) for solving Lasso with much less turning effort (since λN is
independent of σ). This further explains why PIS2TA is much more efficient than other competing
algorithm for solving SQRT-Lasso such as ADMM and SOCP + interior point method.
A geometric interpretation of Theorem 3.4 is provided in Figure 2. The first N − 1 stages serve
as intermediate processes to facilitate fast convergence to θ[N ], which do not require high precision
solutions. Thus we choose εK = λK/4 εN for K = 1, . . . , N − 1 such that Phase I is efficient, as
shown in Figure 3, and only high precision for the last stage, e.g. εN = 10−5 (because only the last
regularization parameter is of our interest). The total number of iterations for computing the entire
solution path is at most
O(N1‖X‖22RλN1µ
√n
+ κs∗+2s(N −N1) log(κs∗+2ss∗λN/εN )
).
Moreover, given properly chosen µ and λN , we guarantee that none of the linear convergence region,
true model parameter θ∗ and output solution θ[N ] fall into the smooth region. This implies that
(1.3) and (1.4) share the same global optimum in Phase II. A formal claim is presented in §3.2.
7
Initial SolutionRegion of
Sublinear Conv.
Region ofLinear Conv.
Neighborhood of :
Phase I:
Phase II:
8>>><>>>:
8>>>><>>>>:
...
...
Output Solution for
Output Solution forOutput Solution for
Output Solution for
Bs+sr
SmoothedRegion
(Overfitted Model)
b[N1+1] b[N ]
b[N1]
b[0] = 0
b[2]
b[1]
Output Solution for 1
2
N1+1
N1
N
Figure 2: A geometric interpretation of two phases of convergence. Phase I (yellow region): Sublinear
convergence, and Phase II (orange region): Linear convergence. The linear convergence region, the
true parameter θ∗ and output solution θ[N ] do not fall into the smoothed region.
2 4 6 8 10 12 14 16 1810-14
10-12
10-10
10-8
10-6
10-4
F µ,
K(
(t)
[K])F µ
,K(
[K])
t : # of Iterations in Each Stage
1, . . . ,N1
N
Figure 3: Plots of the objective gaps for all iterations t of each path following stage (only a few
stages are demonstrated for clarity).
3.2 Statistical Theory
To analyze the statistical properties of our estimator obtained via PIS2TA, we assume that the
design matrix X satisfies the restricted eigenvalue condition as follows.
Assumption 3.5. The design matrix X satisfies the Restricted Eigenvalue (RE) condition, i.e.,
there exist constants ψmin, ψmax, ϕmin, ϕmax ∈ (0,∞), which do not scale with (s∗, n, d), such that
ψmin‖v‖22 − ϕminlog d
n‖v‖21 ≤
‖Xv‖22n
≤ ψmax‖v‖22 + ϕmaxlog d
n‖v‖21, (3.2)
A wide family of examples satisfy the RE condition, such as the correlated sub-Gaussian random
design (Rudelson and Zhou, 2013), which has been extensively studied for sparse recovery (Candes
and Tao, 2005; Bickel et al., 2009; Raskutti et al., 2010).
We next verify Assumption 3.2 and 3.3 based on the RE condition in the following lemma.
8
Lemma 3.6. Suppose Assumption 3.5 holds. Given µ ≤√nσ4 and λN = 24
√log dn , λN ≥
6‖∇Lµ(θ∗)‖∞ with high probability. Moreover, for large enough n, Lµ(θ) satisfies LRSC and
LRSS properties on Br with high probability, where r = σ2
8ψmax. Specifically, (3.1) holds with
ρ+s∗+2s ≤
8ψmax
σ and ρ−s∗+2s ≥ψmin8σ , where s = C2s
∗ > (196κ2s∗+2s + 144κs∗+2s)s
∗, C2 ∈ R+ is a
generic constant and κs∗+2s ≤ 64ψmax/ψmin.
Lemma 3.6 guarantees that Assumption 3.2 holds given properly chosen µ and λN , and Assump-
tion 3.3 holds given the design X satisfying RE condition, both with high probability. Therefore, by
Theorem 3.4, PIS2TA achieves an R-linear convergence to the unique sparse global optimum. In the
next theorem, we characterize the statistical rate of convergence of PIS2TA.
Theorem 3.7. Suppose Assumption 3.5 holds. Let the output solution θ[N ] satisfy ωλN (θ[N ]) ≤ εNfor a small enough εN . Given µ ≤
√nσ4 , λN = 24
√log dn and a large enough n, we have:
‖θ[N ] − θ∗‖2 = OP(σ√s∗ log d/n
)and ‖θ[N ] − θ∗‖1 = OP
(σs∗√
log d/n).
Moreover, let σ =‖y−Xθ[N ]‖2√
nbe the estimation of σ. Then we have |σ − σ| = OP (σs∗ log d/n).
Theorem 3.7 guarantees that the output solution θ[N ] obtained by PIS2TA achieves the minimax
optimal rate of convergence in parameter estimation (Ye and Zhang, 2010; Raskutti et al., 2011).
The next proposition shows that (1.3) and (1.4) share the same global optimum, which corresponds
to a well fitted model. This implies that neither the unique sparse global optimum θ[N ] nor the
linear convergence region falls into the smooth region, as shown in Figure 2.
Proposition 3.8. Under the same assumptions as Theorem 3.7, for all λK ’s, where K = N1 +
1, . . . , N , (1.3) and (1.4) share the same unique sparse global optimum with high probability.
4 Extension to Sparse Precision Matrix Estimation
We consider the TIGER approach proposed in Liu and Wang (2012) for estimating sparse precision
matrix. To be clear, we emphasize that we use v1,v2, . . . (without [·] for subscript) to index vectors.
Let X = [x>1 , . . . ,x>n ]> ∈ Rn×d be n observed data points from a d-variate Gaussian distribution
Nd (0,Σ). Our goal is to estimate the sparse precision matrix Θ = Σ−1. Let Z = XΓ−1/2 =
[z>1 , . . . , z>n ]> ∈ Rn×d be the standardized data matrix, where Γ = diag(Σ11, . . . , Σdd) is a diagonal
matrix and Σ = 1nX>X. Then, we can write zi = Z?,\iθ∗i + Γ
−1/2ii εi for all i = 1, . . . , d, where
θ∗i = Γ−1/2ii Γ
1/2\i,\i(Σ\i,\i)
−1Σ\i,i and εi ∼ Nn(0, σ2i In) with σ2
i = Σii − Σ>\i,i(Σ\i,\i)−1Σ\i,i. We
denote τ2i = σ2
i Γ−1ii , and solve:
θi = argminθi∈Rd−1
Lµ,i(θi) + λ‖θi‖1 and τ i = ‖zi − Z?,\iθi‖µ, (4.1)
9
for all i = 1, . . . , d, where Lµ,i(θi) = 1√n‖zi − Z?,\iθi‖µ. Then the i-th column of precision matrix
Θ is estimated by: Θii = τ−2i Γ−1
ii , and Θ\i,i = −τ−2i Γ
−1/2ii Γ
−1/2\i,\i θi.
Here we solve (4.1) by PIS2TA. We then introduce a few mild technical assumptions:
Assumption 4.1. Suppose the true covariance matrix Σ∗ and precision matrix Θ∗ satisfy: (A1)
Θ∗ ∈M(κΘ, s∗) =
Θ ∈ Rd×d : Θ 0, Λmax(Θ)/Λmin(Θ) ≤ κΘ, maxi
∑j 1(Θij 6= 0) ≤ s∗
; (A2)
(s∗)2 log d = o(n); (A3) lim supn→∞maxi(Σ∗ii)
2 log d/n < 1/4.
We first verify the assumptions required by our computational theory by the following lemma.
Lemma 4.2. Suppose Assumption 4.1 holds. Given µ ≤ 15
√nκΘ
and λN = 6√
5 log dn , we have
λN ≥ 6 maxi∈1,...,d ‖∇Lµ,i(θ∗i )‖∞ with high probability. Moreover, Lµ,i(θi) satisfies LRSC and
LRSS properties on Bri for all i = 1, . . . , d with high probability, where ri =σ2i
12κΘ. Specifically,
for all i = 1, . . . , d, (3.1) holds with ρ+s∗+2s ≤
12κΘσi
and ρ−s∗+2s ≥ 112κΘσi
, where s = C3s∗ >
(196κ2s∗+2s + 144κs∗+2s)s
∗ for a generic constant C3, and κs∗+2s ≤ 144κ2Θ.
Lemma 4.2 guarantees that Assumption 3.2 and 3.3 holds with high probability given properly
chosen µ and λN . Thus, by Theorem 3.4, PIS2TA achieves an R-linear convergence to the unique
sparse global optimum of (4.1) with high probability for all columns of Θ. The next theorem
characterizes the statistical rate of convergence of the obtained precision matrix estimator using
PIS2TA.
Theorem 4.3. Suppose Assumption 4.1 holds. Let Θ[N ] be the output solution for the regularization
parameter λN . Given µ ≤ 15
√nκΘ
and λN = 6√
5 log dn , we have
‖Θ[N ] −Θ∗‖2 = OP(s∗‖Θ∗‖2
√log d/n
).
Theorem 4.3 implies that our obtained precision matrix estimator attains the minimax optimal
rate of convergence in parameter estimation. Moreover, we guarantee that neither the linear
convergence region nor output solution Θ[N ] falls into the smooth region with high probability.
5 Numerical Experiments
We investigate the computational performance of the proposed PIS2TA algorithm through numerical
experiments over both simulated and real data example. All simulations are implemented in C with
double precision using a PC with an Intel 3.3GHz Core i5 CPU and 16GB memory.
For simulated data, we generate a training dataset of 200 samples, where each row of the design
matrix Xi,?, i = 1, . . . , 200, independently from a 2000-dimensional normal distribution N(0,Σ)
where Σjj = 1 and Σjk = 0.5 for all k 6= j. We set s∗ = 3 with θ∗1 = 3, θ∗2 = −2, and θ∗4 = 1.5, and
θ∗j = 0 for all j 6= 1, 2, 4. A validation set of 200 samples for the regularization parameter selection
and a testing set of 10, 000 samples are also generated to evaluate the prediction accuracy.
10
We set σ = 0.5, 1, 2, 4 respectively to illustrate the tuning insensitivity. The regularization
parameter of both Lasso and SQRT-Lasso is chosen over a geometrically decreasing sequence
λK50t=0with λ50 = σ
√log d/n/2 for Lasso and λ50 =
√log d/n/2 for SQRT-Lasso. The optimal
regularization parameter is determined by λopt = λN
as N = argminK∈0,...,50 ‖y− Xθ[K]‖22, where
θ[K] denotes the obtained estimate using the regularization parameter λK , and y and X denote the
response vector and design matrix of the validation set. For both Lasso and SQRT-Lasso, we set
the stopping precision εK = 10−5 for all K = 1, . . . , 30 and εK = 10−5 for all K = 31, . . . , 50. For
SQRT-Lasso, we set the smoothing parameter µ = 10−4.
First of all, we compare PIS2TA with ADMM proposed in Li et al. (2015)2. The backtracking line
search described in Algorithm 2 is adopted to accelerate both algorithms. We conduct 500 simulations
for all σ’s. The results are presented in Table 1. The PIS2TA and ADMM algorithms attain similar
objective values, but PIS2TA is about 20 times faster than ADMM. Both algorithms also achieve
similar estimation errors. Throughout all 500 simulations, we have ‖y −Xθ[N ]‖2
√nµ ≈ 0.0015.
This implies that all obtained optimal estimators are outside the smoothed region of the optimization
problem, i.e., the smoothing approach does not hurt the solution accuracy.
Next, we compare the computational and statistical performance between Lasso (solved by
PISTA Xiao and Zhang (2013)) and SQRT-Lasso (solved by PIS2TA). The results averaged over
500 simulations are summarized in Tables 1. In terms of statistical performance, Lasso and SQRT-
Lasso attains similar estimation and prediction error. In terms of computational performance, the
PIS2TA algorithm for solving SQRT-Lasso is as efficient as PISTA for Lasso, which matches our
computational analysis.
Moreover, we also examine the optimal regularization parameters for Lasso and SQRT-Lasso. We
visualize the distribution of all 500 selected λopt’s using the kernel density estimator. Particularly,
we adopt the Gaussian kernel, and the kernel bandwidth is selected based on the 10-fold cross
validation. Figure 4 illustrates the estimated density functions. The horizontal axis corresponds
to the rescaled regularization parameter λopt/√
log d/n. We see that the optimal regularization
parameters of Lasso significantly vary with different σ. In contrast, the optimal regularization
parameters of SQRT-Lasso are more concentrated. This is consistent with the claimed tuning
insensitivity.
Finally, we compare PIS2TA with ADMM over real data sets for precision matrix estimation.
Particularly, we use four real world biology data sets preprocessed by Li and Toh (2010): Estrogen
(d = 692), Arabidopsis (d = 834), Leukemia (d = 1, 225), Hereditary (d = 1, 869). We set three
different values for λN such that the obtained estimators achieve different levels of sparse recovery.
We set N = 50, and εK = 10−4 for all K’s. The timing performance is summarized in Table 2. As
can be seen, PIS2TA is 5 to 20 times faster than ADMM on all four data sets3.
2We do not have any results for the algorithm proposed in Belloni et al. (2011), because it failed to finish 500
simulations in 12 hours. The implementation was based on SDPT3.3We do not have any results for the algorithm proposed in Belloni et al. (2011), because it failed to finish the
experiments on all four data sets in 12 hours. The implementation was also based on SDPT3.
11
Table 1: Quantitative comparison between Lasso (PISTA) and SQRT-Lasso (PIS2TA and ADMM)
over 500 simulations. The estimation error is defined as ‖θ−θ∗‖2. The prediction error is defined as
‖y− Xθ[N ]‖2/√n. The residual is defined as ‖y−Xθ
[N ]‖2 for PIS2TA only. PIS2TA attains nearly
the same estimation and prediction errors as ADMM, but is significantly faster than ADMM over
all different settings. Besides, we also find that all obtained estimators are outside the smoothed
region throughout all 500 simulations.
Variance Est. Err. Pred. Err. Time (Second) Residual
of Noise Lasso PIS2TA Lasso PIS2TA Lasso PIS2TA ADMM PIS2TA
σ = 0.50.2761 0.2760 0.5403 0.5399 0.8817 0.9526 17.260 5.2505
(0.0651) (0.0537) (0.0172) (0.0143) (0.2824) (0.2646) (3.1723) (0.9591)
σ = 10.5271 0.5319 1.0722 1.0757 0.9146 1.0170 19.762 10.5209
(0.1174) (0.1065) (0.0303) (0.0280) (0.3185) (0.2959) (4.4387) (1.7760)
σ = 21.0962 1.1065 2.1551 2.1492 1.1772 1.1263 24.406 20.886
(0.2252) (0.2141) (0.0595) (0.0589) (0.4582) (0.4128) (4.2786) (3.6506)
σ = 42.1275 2.1356 4.2928 4.2963 1.2913 1.2544 26.101 45.7623
(0.4247) (0.4033) (0.1112) (0.1079) (0.4855) (0.5074) (5.7725) (5.6333)
Table 2: Timing comparison between PIS2TA and ADMM on biology data under different levels of
sparsity recovery. PIS2TA is significantly faster than ADMM over all settings and data sets.
Estrogen Arabidopsis Leukemia Hereditary
PIS2TA ADMM PIS2TA ADMM PIS2TA ADMM PIS2TA ADMM
Sparsity 1% 16.562 175.98 18.404 373.83 30.609 431.45 43.161 498.32
Sparsity 3% 70.622 338.96 81.557 707.52 86.406 812.69 141.65 895.09
Sparsity 10% 188.03 703.24 226.97 1378.1 257.23 1653.1 413.85 1921.6
0 2 4 6 80
0.5
1
1.5 = 0.5
= 1 = 2
= 4
(a) Lasso: Distribution of λopt/√
log d/n
0.5 1 1.5 2 2.5 30
0.5
1
1.5
2 = 0.5
= 1 = 2
= 4
(b) SQRT-Lasso: Distribution of λopt/√
log d/n
Figure 4: Estimated distributions of λopt/√
log d/n over different values of σ for Lasso and PIS2TA.
12
References
Agarwal, A., Negahban, S. and Wainwright, M. J. (2010). Fast global convergence rates of
gradient methods for high-dimensional statistical recovery. In Advances in Neural Information
Processing Systems.
Beck, A. and Teboulle, M. (2012). Smoothing and first order methods: A unified framework.
SIAM Journal on Optimization 22 557–580.
Belloni, A., Chernozhukov, V. and Wang, L. (2011). Square-root lasso: pivotal recovery of
sparse signals via conic programming. Biometrika 98 791–806.
Bickel, P. J., Ritov, Y. and Tsybakov, A. B. (2009). Simultaneous analysis of lasso and
dantzig selector. The Annals of Statistics 37 1705–1732.
Buhlmann, P. and van de Geer, S. (2011). Statistics for high-dimensional data: methods, theory
and applications. Springer.
Candes, E. J. and Tao, T. (2005). Decoding by linear programming. IEEE Transactions on
Information Theory 51 4203–4215.
Johnstone, I. M. (2001). Chi-square oracle inequalities. Lecture Notes-Monograph Series 399–418.
Li, L. and Toh, K.-C. (2010). An inexact interior point method for l 1-regularized sparse covariance
selection. Mathematical Programming Computation 2 291–315.
Li, X., Zhao, T., Yuan, X. and Liu, H. (2015). The flare package for high dimensional linear
regression and precision matrix estimation in R. The Journal of Machine Learning Research 16
553–557.
Liu, H. and Wang, L. (2012). Tiger: A tuning-insensitive approach for optimally estimating
Gaussian graphical models. Tech. rep., Massachusett Institute of Technology.
Negahban, S. N., Ravikumar, P., Wainwright, M. J. and Yu, B. (2012). A unified framework
for high-dimensional analysis of m-estimators with decomposable regularizers. Statistical Science
27 538–557.
Nesterov, Y. (2004). Introductory lectures on convex optimization: A basic course, vol. 87.
Springer.
Nesterov, Y. (2005). Smooth minimization of non-smooth functions. Mathematical Programming
103 127–152.
Nesterov, Y. (2013). Gradient methods for minimizing composite functions. Mathematical
Programming 140 125–161.
13
Raskutti, G., Wainwright, M. J. and Yu, B. (2010). Restricted eigenvalue properties for
correlated Gaussian designs. The Journal of Machine Learning Research 11 2241–2259.
Raskutti, G., Wainwright, M. J. and Yu, B. (2011). Minimax rates of estimation for
high-dimensional linear regression over-balls. Information Theory, IEEE Transactions on 57
6976–6994.
Rudelson, M. and Zhou, S. (2013). Reconstruction from anisotropic random measurements.
Information Theory, IEEE Transactions on 59 3434–3447.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal
Statistical Society, Series B 58 267–288.
Wainwright, J. (2015). High-dimensional statistics: A non-asymptotic viewpoint. preparation.
University of California, Berkeley .
Xiao, L. and Zhang, T. (2013). A proximal-gradient homotopy method for the sparse least-squares
problem. SIAM Journal on Optimization 23 1062–1091.
Ye, F. and Zhang, C.-H. (2010). Rate minimaxity of the lasso and dantzig selector for the lq loss
in lr balls. The Journal of Machine Learning Research 11 3519–3540.
Zhang, C.-H. and Huang, J. (2008). The sparsity and bias of the lasso selection in high-dimensional
linear regression. The Annals of Statistics 36 1567–1594.
Zhang, T. (2009). Some sharp performance bounds for least squares regression with `1 regularization.
The Annals of Statistics 37 2109–2144.
14
A Intermediate Results of Theorem 3.4 and Theorem 3.7
We introduce some important implications of the proposed assumptions. Recall that S∗ = j :
θ∗j 6= 0 be the index set of non-zero entries of θ∗ with s∗ = |S∗| and S∗ = j : θ∗j = 0 be the
complement set. From Lemma 3.6, Assumption 3.3 implies RSC and RSS with parameter ρ−s∗+2s
and ρ+s∗+2s respectively. By Nesterov (2004), the following conditions are equivalent to RSC and
RSS, i.e., for any v,w ∈ Rd satisfying ‖v −w‖0 ≤ s∗ + 2s,
ρ−s∗+2s‖v −w‖22 ≤ (v −w)>∇Lµ(w) ≤ ρ+s∗+2s‖v −w‖22, (A.1)
1
ρ+s∗+2s
‖∇Lµ(v)−∇Lµ(w)‖22 ≤ (v −w)>∇Lµ(w) ≤ 1
ρ−s∗+2s
‖∇Lµ(v)−∇Lµ(w)‖22. (A.2)
From the convexity of `1 norm, we have
‖v‖1 − ‖w‖1 ≥ (v −w)>g, (A.3)
where g ∈ ∂‖w‖1. Combining and (A.1) and (A.3), we have for any v,w ∈ Rd satisfying ‖v−w‖0 ≤s∗ + 2s,
Fµ,λ(v)−Fµ,λ(w)− (v −w)>∇Fµ,λ(w) ≥ ρ−s∗+2s‖v −w‖22, (A.4)
Remark A.1. For any t and k, the line search satisfies
L(t)[K] ≤ L
(t)[K] ≤ Lmax, Lµ ≤ L(t)
[K] ≤ L(t)[K] ≤ 2Lµ and ρ+
s∗+2s ≤ L(t)[K] ≤ L
(t)[K] ≤ 2ρ+
s∗+2s, (A.5)
where Lµ = minL : ‖∇Lµ(v)−∇Lµ(w)‖2 ≤ L‖x− y‖2, ∀v,w ∈ Rd.We first show that when θ is sparse and the approximate KKT condition is satisfied, then both
estimation error (in `2 norm) and objective error, w.r.t. the true model parameter, are bounded.
This characterizes that the initial value θ(0)[K] of K-th path following stage has desirable statistical
properties if we initialize θ(0)[K] = θ[K−1]. This is formalized in Lemma A.2, and its proof is deferred
to Appendix N.1.
Lemma A.2. Suppose Assumption 3.2 and Assumption 3.3 hold, and λ ≥ λN . If θ satisfies
‖θS∗‖0 ≤ s and the approximate KKT condition
ming∈∂‖θ‖1
‖∇Lµ(θ) + λg‖∞ ≤ λ/2, (A.6)
then we have
‖(θ − θ∗)S∗‖1 ≤ 5‖(θ − θ∗)S∗‖1, (A.7)
‖θ − θ∗‖2 ≤2λ√s∗
ρ−s∗+2s
, (A.8)
‖θ − θ∗‖1 ≤12λs∗
ρ−s∗+2s
, (A.9)
Fµ,λ(θ)−Fµ,λ(θ∗) ≤ 6λ2s∗
ρ−s∗+2s
. (A.10)
15
Next, we show that if θ is sparse and the objective error is bounded, then the estimation error
is also bounded. This characterizes that within the K-th path following stage, good statistical
performance is preserved after each proximal-gradient update. This is formalized in Lemma A.3,
and its proof is deferred to Appendix N.2.
Lemma A.3. Suppose Assumption 3.2 and Assumption 3.3 hold, and λ ≥ λN . If θ satisfies
‖θS∗‖0 ≤ s and the objective satisfies
Fµ,λ(θ)−Fµ,λ(θ∗) ≤ 6λ2s∗
ρ−s∗+2s
then we have
‖θ − θ∗‖2 ≤4λ√
3s∗
ρ−s∗+2s
, (A.11)
‖θ − θ∗‖1 ≤24λs∗
ρ−s∗+2s
. (A.12)
We then show that if θ is sparse and the objective error is bounded, then each proximal-gradient
update preserves solution to be sparse. This demonstrates that within the K-th path following
stage, each update of θ(t)[K] is preserved to be sparse and has good statistical performance. This is
formalized in Lemma A.4, and its proof is deferred to Appendix N.3.
Lemma A.4. Suppose Assumption 3.2 and Assumption 3.3 hold, and λ ≥ λN . If θ satisfies
‖θS∗‖0 ≤ s, L satisfies L < 2ρ+s∗+2s, and the objective satisfies
Fµ,λ(θ)−Fµ,λ(θ∗) ≤ 6λ2s∗
ρ−s∗+2s
then we have
‖ (TL,λ(θ))S∗ ‖0 ≤ s. (A.13)
Moreover, we show that if θ satisfies the approximate KKT condition, then the objective has a
bounded error w.r.t. the regularization parameter λ. This characterizes the geometric decrease of
the objective error when we choose a geometrically decreasing sequence of regularization parameters.
This is formalized in Lemma A.5, and its proof is deferred to Appendix N.4.
Lemma A.5. Suppose Assumption 3.2 and Assumption 3.3 hold, and λ ≥ λN . If θ satisfies
ωλ(θ) ≤ λ/2,
For any λ ∈ [λN , λ], let θ = argminθ Fµ,λ(θ). Then we have
Fµ,λ
(θ)−Fµ,λ
(θ) ≤12(λ+ λ)
(ωλ(θ) + λ− λ
)s∗
ρ−s∗+2s
.
16
Furthermore, we show that each path following stage has a local linear convergence rate if the
initial value θ(0) is sparse and satisfies the approximate KKT condition with adequate precision.
Besides, the estimation after each proximal gradient update is also sparse. This is the key result
in demonstrating the overall geometric convergence rate of the algorithm. This is formalized in
Lemma A.6, and its proof is deferred to Appendix N.5.
Lemma A.6. Suppose Assumption 3.3 holds. If the initialization θ(0) for every stage with any λ
in Algorithm 1 satisfies
‖θ(0)‖0 ≤ s.
Then for any t = 1, 2, . . ., we have ‖θ(t)‖0 ≤ s,
Fµ,λ(θ(t))−Fµ,λ(θ) ≤(
1− 1
8κs∗+2s
)t (Fµ,λ(θ(0))−Fµ,λ(θ)
),
where θ = argminθ Fµ,λ(θ).
In addition, we provide the sublinear convergence rate when RSC does not hold based on a
refined analysis of the convergence rate for convex objective (not strongly convex) via proximal
gradient method with line search Nesterov (2013). Specifically, we provide a sub-linear rate of
convergence without the need to classify the distance of the initial objective to the optimal objective.
This characterizes the convergence behavior when ‖X(θ − θ∗)‖2 is large. We formalize this in
Lemma A.7, and provide the proof in Appendix N.6.
Lemma A.7 (Refined result of Theorem 4 in Nesterov (2013)). Given the initialization θ(0), if for
any θ ∈ Rd that satisfies Fµ,λ(θ) ≤ Fµ,λ(θ(0)), denote R as
‖θ − θ‖2 ≤ R.
Then for any t = 1, 2, . . ., we have
Fµ,λ(θ(t))−Fµ,λ(θ) ≤ 4‖X‖22R2
(t+ 2)µ√n, (A.14)
where θ = argminθ Fµ,λ(θ).
Finally, we introduce two results characterizing the proximal gradient mapping operation,
adapted from Nesterov (2013) and Xiao and Zhang (2013) without proof. The first lemma describes
sufficient descent of the objective by proximal gradient method.
Lemma A.8 (Adapted from Theorem 2 in Nesterov (2013)). For any L > 0,
Qµ,λ (TL,λ(θ),θ) ≤ Fµ,λ (θ)− L
2‖TL,λ(θ)− θ‖22.
17
Besides, if Lµ(θ) is convex, we have
Qµ,λ (TL,λ(θ),θ) ≤ minxFµ,λ (x) +
L
2‖x− θ‖22. (A.15)
Further, we have for any L ≥ Lµ,
Fµ,λ (TL,λ(θ)) ≤ Qµ,λ (TL,λ(θ),θ) ≤ Fµ,λ (θ)− L
2‖TL,λ(θ)− θ‖22. (A.16)
The next lemma provides an upper bound of the optimal residue ω(·).
Lemma A.9 (Adapted from Lemma 2 in Xiao and Zhang (2013)). For any L > 0, if Lµ is the
Lipschitz constant of ∇Lµ, then
ωλ (TL,λ(θ)) ≤ (L+ SL(θ)) ‖TL,λ(θ)− θ‖2 ≤ (L+ Lµ) ‖TL,λ(θ)− θ‖2,
where SL(θ) =‖∇Lµ(TL,λ(θ))−∇Lµ(θ)‖2
‖TL,λ(θ)−θ‖2 is a local Lipschitz constant, which satisfies SL(θ) ≤ Lµ.
B Proof of Theorem 3.4
We first demonstrate the sublinear rate for initial stages when the estimation error ‖θ − θ∗‖2 is
large due large regularization parameter λK . The proof is provided in Appendix F
Theorem B.1. For any K = 1, . . . , N and λK > 0, let θ[K] = argminθ Fµ,λK (θ) be the optimal
solution of K-th stage with regularization parameter λK . For any θ ∈ Rd that satisfies Fµ,λK (θ) ≤Fµ,λK (θ
(0)[K]), let ≥ ‖θ − θ[K]‖2. If the initial value θ
(0)[K] satisfies ωλK (θ
(0)[K]) ≤ λK/2, then within
K-th stage, for any t = 1, 2, . . ., we have
Fµ,λ(θ(t)[K])−Fµ,λ(θ[K]) ≤
4‖X‖22R2
(t+ 2)µ√n, (B.1)
To achieve the approximate KKT condition ωλK (θ(t)[K]) ≤ εK , the number of proximal gradient steps
is no more than
18‖X‖32RεKµ3/2n3/4
√Lµ− 3. (B.2)
Note that Lµ ‖X‖22/(µ√n). Then (B.2) can be simplified as O
(‖X‖22RεKµ√n
).
Next, we demonstrate the linear rate when the estimator satisfies θ ∈ Bs∗+sr . The proof is
provided in Appendix G.
Theorem B.2. Suppose Assumption 3.2 and Assumption 3.3 hold, and λK > 0 for any K =
N1, . . . , N . Let θ[K] = argminθ Fµ,λK (θ) be the optimal solution of K-th stage with regularization
18
parameter λK . If the initial value θ(0)[K] satisfies ωλK (θ
(0)[K]) ≤ λK/2 with ‖(θ(0)
[K])S∗‖0 ≤ s, then within
K-th stage, for any t = 1, 2, . . ., we have
‖(θ(t)[K])S∗‖0 ≤ s, ‖θ
(t)[K] − θ[K]‖22 ≤
(1− 1
8κs∗+2s
)t 24λKs∗ωλK (θ
(t)[K])
(ρ−s∗+2s)2
and
Fµ,λK (θ(t)[K])−Fµ,λK (θ[K]) ≤
(1− 1
8κs∗+2s
)t 24λKs∗ωλK (θ
(t)[K])
ρ−s∗+2s
, (B.3)
(1) For K = N1, . . . , N − 1, to achieve the approximate KKT condition ωλK (θ(t)[K]) ≤ λK/4, the
number of proximal gradient steps is no more than
log(
1536 (1 + κs∗+2s)2 s∗κs∗+2s
)
log (8κs∗+2s/(8κs∗+2s − 1)). (B.4)
(2) For K = N , to achieve the approximate KKT condition ωλN (θ(t)[N ]) ≤ εN , the number of
proximal gradient steps is no more than
log(
96 (1 + κs∗+2s)2 λ2
Ns∗κs∗+2s/ε
2N
)
log (8κs∗+2s/(8κs∗+2s − 1)). (B.5)
From basic inequalities, since κs∗+2s ≥ 1, we have
log
(8κs∗+2s
8κs∗+2s − 1
)≥ log
(1 +
1
8κs∗+2s − 1
)≥ 1
8κs∗+2s.
Then (B.4) and (B.5) can be simplified asO(κs∗+2s
(log(κ3s∗+2ss
∗))) andO(κs∗+2s
(log(κ3s∗+2sλ
2Ns∗/ε2
N
)))
respectively.
As can be seen from Theorem B.2, when the initial value θ(0)[K] satisfies the approximate KKT
condition ωλK (θ(0)[K]) ≤ λK/2 with θ
(0)[K] ∈ Bs
∗+sr , then we can guarantee the geometric convergence
rate of the estimated objective value towards the minimal objective. Next, we show that if the the
optimal solution θ[K−1] from K− 1-th path following stage satisfies the approximate KKT condition
and the regularization parameter λK in the K-th path following stage is chosen properly, then θ[K−1]
satisfies the approximate KKT condition for λK with a slightly larger bound. This characterizes that
good computational properties are preserved by using the warm start θ(0)[K] = θ[K−1] and geometric
sequence of regularization parameters λK . We formalize this notion in Lemma B.3, and its proof is
deferred to Appendix H.
Lemma B.3. Let θ[K−1] be the approximate solution of K − 1-th path following state, which
satisfies the approximate KKT condition ωλK−1(θ[K−1]) ≤ λK−1/4. Then we have
ωλK (θ[K−1]) ≤ λK/2,
where λK = ηλK−1 with η ∈ (5/6, 1).
19
Combining Theorem (B.2) and Lemma (B.3), we can achieve the global convergence in terms
of the objective value using the path following proximal gradient method. We have the bounds of
iterations TK in Phase 2 directly from (B.4) and (B.5) of Theorem B.2.
Finally, we obtain the objective gap of N -th stage via analogous argument. Specifically, in the
N -th (final) path following stage, when the number of iterations for proximal method is large enough
such that ωλN (θ(t)[N ]) ≤ εN holds, then we obtain the result from Lemma A.5 with λ = λ = λN .
Finally, we need to show that there exists some N1 ∈ 1, . . . , N such that θ[N1] ∈ Bs∗+sr . We
demonstrate this result in Lemma B.4 and provide its proof in Appendix I.
Lemma B.4. Suppose Assumption 3.2 and Assumption 3.3 holds, and the approximate KKT
satisfies ωλ(θ) ≤ λ/4. If µ ≤ √nσ/4 andρ−s∗+s8
√rs∗ > λ > λN , then we have
‖θ − θ∗‖22 ≤ r and ‖θS∗‖0 ≤ s.
Lemma B.4 guarantees that there exits some N1 < N such that for some λN1 > λN , the
approximate solution θ[N1] satisfies the approximate KKT condition and θ[N1] ∈ Bs∗+sr , then we
enters the Phase 2 of strong linear convergence by Theorem B.2. Thus we finish the proof.
We further provide a bound of the total number of proximal gradient steps for each λK in the
following lemma for interested readers. The proof is provided in Appendix J.
Lemma B.5. For each λK , K = 1, . . . , N , if we restart the line search with a large enough Lmax, then
the total number of proximal gradient steps is no more than 2(TK+1)+max(log2 Lmax − log2 ρ+s∗+2s), 0.
C Proof of Lemma 3.6
Part 1. We first show that Assumption 3.2 holds. By y = Xθ∗ + ε and (C.4), we have
∇Lµ(θ∗) =X>(Xθ∗ − y)
max√nµ,√n‖y −Xθ∗‖2= − X>ε
max√nµ,√n‖ε‖2. (C.1)
Since ε has i.i.d. sub-Gaussian entries and E[εi] = 0 and E[ε2i ] = σ2 for all i = 1, . . . , n, then we
have from Wainwright (2015) that
P[‖ε‖22 ≤
1
4nσ2
]≤ exp
(− n
32
), (C.2)
By Negahban et al. (2012), we have the following result.
Lemma C.1. Assume X satisfies ‖xj‖2 ≤√n for all j = 1, . . . , d and ε has i.i.d. zero-mean
sub-Gaussian entries with E[w2i ] = σ2 for all i = 1, . . . , n, then we have
P
[1
n‖X>ε‖∞ ≥ 2σ
√log d
n
]≤ 2d−1.
20
Combining (C.1), (C.2) and Lemma C.1, we have with probability at least 1− 2d−1− exp(− n
32
),
‖∇Lµ(θ∗)‖∞ ≤4√
log d/n
max√
2µ/(√nσ), 1
.
Part 2. Next, we show that Assumption 3.3 holds. We divide the proof into two steps.
Step 1. When X satisfies the RE condition, i.e.
ψmin‖v‖22 − ϕminlog d
n‖v‖21 ≤
‖Xv‖22n
≤ ψmax‖v‖22 + ϕmaxlog d
n‖v‖21,
Denote s = s∗ + 2s. Since ‖v‖0 ≤ s, which implies ‖v‖21 ≤ s‖v‖22, then we have
(ψmin − ϕmin
s log d
n
)‖v‖22 ≤
‖Xv‖22n
≤(ψmax + ϕmax
s log d
n
)‖v‖22,
Then there exists a universal constant c1 such that if n ≥ c1s∗ log d, we have
1
2ψmin‖v‖22 ≤
‖Xv‖22n
≤ 2ψmax‖v‖22. (C.3)
Step 2. Conditioning on (C.3), we show that Lµ satisfies LRSC and LRSS with high probability.
The gradient of Lµ(θ) is
∇Lµ(θ) =1√n
((∂‖y −Xθ‖µ∂(y −Xθ)
)>(∂(y −Xθ)
∂θ
)>)>=
X>(Xθ − y)
max√nµ,√n‖y −Xθ‖2. (C.4)
The Hessian of Lµ(θ) is
∇2Lµ(θ) =1
n
∂(−X>z)
∂θ=
X>X√nµ, if ‖y −Xθ‖2 < µ
1√n‖y−Xθ‖2 X>
(I− (y−Xθ)(y−Xθ)>
‖y−Xθ‖22
)X, o.w.
(C.5)
For notational convenience, we define ∆ = v−w for any v,w ∈ B∗s . Also denote the residual of
the first order Taylor expansion as
δLµ(w + ∆,w) = Lµ(w + ∆)− Lµ(w)−∇Lµ(w)>∆.
Using the first order Taylor expansion of Lµ(θ) at w and the Hessian of Lµ(θ) in (C.5), we have
from mean value theorem that there exists some α ∈ [0, 1] such that
δLµ(w + ∆,w) =
∆>X>X∆√nµ
, if ‖ξ‖2 < µ
1√n‖ξ‖2 ∆>X>
(I− ξξ>
‖ξ‖22
)X∆, o.w.
where ξ = y−X(w+α∆). For notational simplicity, let‘s denote z = X(v−θ∗) and z = X(w−θ∗),which can be considered as two fixed vectors in Rn. Without loss of generality, assume ‖z‖2 ≤ ‖z‖2.
Then we have
‖z‖22 ≤ ‖z‖22 ≤ 2ψmaxn‖w − θ∗‖22 ≤nσ2
4.
21
Further, we have
ξ = y −X(w + α∆) = ε−X(w + α∆− θ∗) = ε− αz− (1− α)z, and X∆ = z− z.
We have from Wainwright (2015) that
P[‖ε‖22 ≤ nσ2(1− δ)
]≤ exp
(−nδ
2
16
), (C.6)
Then by taking δ = 1/3 in (C.6), we have with probability 1− exp(− n
144
),
‖ξ‖2 ≥ ‖ε‖2 − α‖z‖2 − (1− α)‖z‖2≥‖ε‖2 − ‖z‖2≥4
5
√nσ − 1
2
√nσ ≥ 1
4
√nσ. (C.7)
We first discuss the RSS property. From (C.7), we have ‖ξ‖2 ≥ µ, then from (C.7) we have
δLµ(w + ∆,w) =1√n‖ξ‖2
∆>X>(
I− ξξ>
‖ξ‖22
)X∆ =
1√n‖ξ‖2
(‖X∆‖22 −
(ξ>X∆)2
‖ξ‖22
)
≤ ‖X∆‖22√n‖ξ‖2
≤ 8ψmax
σ‖∆‖22
Next, we verify the RSC property. From (C.7), we have ‖ξ‖2 ≥ µ. We want to show that with
high probability, for any constant a ∈ (0, 1)∣∣∣∣ξ>
‖ξ‖2X∆
∣∣∣∣ ≤√
1− a‖X∆‖2. (C.8)
Consequently, we have
∆>X>(
I− ξξ>
‖ξ‖22
)X∆ = ‖X∆‖22 −
(ξ>
‖ξ‖2X∆
)2
≥ a‖X∆‖22.
This further implies
δLµ(w + ∆,w) =1√n‖ξ‖2
∆>X>(
I− ξξ>
‖ξ‖22
)X∆ ≥ aψmin
2‖ξ‖2/√n‖∆‖22. (C.9)
Since ‖z‖2 ≤ ‖z‖2, then for any real constant a ∈ (0, 1),
P[∣∣∣∣ξ>
‖ξ‖2X∆
∣∣∣∣ ≤√
1− a‖X∆‖2]
= P[∣∣∣∣
(ε− αz− (1− α)z)>
‖ε− αz− (1− α)z‖2(z− z)
∣∣∣∣ ≤√
1− a‖z− z‖2]
(i)
≥ P[∣∣∣∣
(ε− z)>(z− z)
‖ε− z‖2
∣∣∣∣ ≤√
1− a‖z− z‖2]
= P[(ε>(z− z)− z>(z− z)
)2≤ (1− a)‖ε− z‖22‖z− z‖22
]
(ii)= P
[∣∣∣∣∣
(ε>(z− z)
‖z− z‖2
)2
+ ‖z‖22 − 2ε>z
∣∣∣∣∣ ≤ (1− a)(‖ε‖22 + ‖z‖22 − 2ε>z)
], (C.10)
22
where (ii) is from dividing both sides by ‖v‖22, and (i) is from a geometric inspection and the
randomness of ε, i.e., for any α ∈ [0, 1] and ‖z‖2 ≤ ‖z‖2,
∣∣∣∣−z>
‖ − z‖2(z− z)
∣∣∣∣ ≤∣∣∣∣
(−αz− (1− α)z)>
‖ − αz− (1− α)z‖2(z− z)
∣∣∣∣ .
The random vector ε with i.i.d. entries does not affect the inequality above. Let‘s first discuss one
side of the probability in (C.10), i.e.,
P
[(ε>(z− z)
‖z− z‖2
)2
+ ‖z‖22 − 2ε>z ≤ (1− a)(‖ε‖22 + ‖z‖22 − 2ε>z)
]
= P
[(1− a)‖ε‖22 ≥
(ε>(z− z)
‖z− z‖2
)2
+ a(‖z‖22 − 2ε>z)
]. (C.11)
Since ε has i.i.d. sub-Gaussian entries with E[εi] = 0 and E[ε2i ] = σ2 for all i = 1, . . . , n, thenε>(z−z)‖z−z‖2 and ε>z are also zero-mean sub-Gaussians with variances σ2 and σ2‖z‖22 respectively. We
have from Wainwright (2015) that
P[‖ε‖22 ≤ nσ2(1− δ)
]≤ exp
(−nδ
2
16
), (C.12)
P
[(ε>(z− z)
‖z− z‖2
)2
≥ nσ2δ2
]≤ exp
(−nδ
2
2
), (C.13)
P[ε>z ≤ −nσ2δ
]≤ exp
(−n
2σ2δ2
2‖z‖22
). (C.14)
Combining (C.12) – (C.14) with ‖z‖22 ≤ nσ2/4, we have from union bound that with probability
at least 1− exp(− n
144
)− exp
(− n
128
)− exp
(− n
128
)≥ 1− 3 exp
(− n
144
),
‖ε‖22 ≥2
3nσ2,
(ε>(z− z)
‖z− z‖2
)2
≤ 1
64nσ2, − ε>z ≤ 1
16nσ2.
This implies for a ≤ 3/5, we have
ξ>
‖ξ‖2X∆ ≤
√1− a‖X∆‖2.
For the other side of the probability in (C.10), we have
P
[(ε>(z− z)
‖z− z‖2
)2
+ ‖z‖22 − 2ε>z ≥ −(1− a)(‖ε‖22 + ‖z‖22 − 2ε>z)
]
= P
[(1− a)‖ε‖22 ≥ −
(ε>(z− z)
‖z− z‖2
)2
− (2− a)(‖z‖22 − 2ε>z)
]
≥ P
[(1− a)‖ε‖22 ≥
(ε>(z− z)
‖z− z‖2
)2
+ a(‖z‖22 − 2ε>z)
]. (C.15)
23
Combining (C.10), (C.11) and (C.15), we have (C.8) holds with high probability, i.e., for any
r > 0,
P[∣∣∣∣ξ>
‖ξ‖2X∆
∣∣∣∣ ≤√
1− a‖X∆‖2]≥ 1− 6 exp
(− n
144
).
Now wo bound ‖ξ‖2 to obtain the desired result. From Wainwright (2015), we have
P[‖ε‖22 ≥ nσ2(1 + δ)
]≤ exp
(−nδ
2
18
)= exp
(− n
72
), (C.16)
where we take δ = 1/2. From ξ = ε− αz− (1− α)z, we have
‖ξ‖2 ≤ ‖ε‖2 + α‖z‖2 + (1− α)‖z‖2(i)
≤ ‖ε‖2 + ‖z‖2(ii)
≤√
3n
2σ +
1
2
√nσ. (C.17)
where (i) is from ‖z‖2 ≤ ‖z‖2 and (ii) is from (C.16) and ‖z‖22 ≤ nσ2/4. Then by the union bound
setting a = 1/2, with probability at least 1− 7 exp(− n
144
), we have
δLµ(w + ∆,w) ≥ ψmin
8σ‖∆‖22.
Moreover, we also have r = σ2
8ψmax> s∗ (64σλN1/ψmin)2 ≥ s∗
(8λN1/ρ
−s∗+s
)2for large enough
n ≥ c1s∗ log d, where λN1 ≥ 2λN = 48
√log d/n. The choice of the constant “2” in λN1 ≥ 2λN is
somewhat arbitrary, which can be any fixed constant larger than 1/η such that the existence of λN1
is guaranteed.
D Proof of Theorem 3.7
Part 1. We first show that estimation errors are as claimed. Since θ[K] is the approximate solution
of K-th path following stage, it satisfies ωλK (θ[K]) ≤ λK/4 ≤ λK+1/2 for t ∈ [N1 + 1, T − 1], then
we have from Lemma B.3 that
ωλK+1(θ
(0)[K+1]) ≤ λK+1/2.
By Theorem B.2, we have for any t = 1, 2, . . .,
‖(θ(t)[K+1])S∗‖0 ≤ s.
Applying Lemma A.2 recursively, we have
‖θ[N ] − θ∗‖2 ≤2λN√s∗
ρ−s∗+2s
and ‖θ[N ] − θ∗‖1 ≤12λNs
∗
ρ−s∗+2s
.
Applying Lemma 3.6 with λN = 24√
log d/n and ρ−s∗+2s = ψmin8σ , then by union bound, with
probability at least 1− 8 exp(− n
144
)− 2d−1, we have
‖θ[N ] − θ∗‖2 ≤384σ
√s∗ log d/n
ψmin,
‖θ[N ] − θ∗‖1 ≤2304σs∗
√log d/n
ψmin.
24
Part 2. Next, we demonstrate the result of the estimation of variance. Let θ[N ] = argminθ Fµ,λN (θ)
be the optimal solution of K-th stage. Apply the argument in Part recursively, we have
‖θ[N ] − θ∗‖1 ≤2304σs∗
√log d/n
ψmin. (D.1)
Denote c1, c2, . . . as positive universal constants. Then we have
Lµ(θ[N ])− Lµ(θ∗) ≤ λN (‖θ∗‖1 − ‖θ[N ]‖1) ≤ λN (‖θ∗S∗‖1 − ‖(θ[N ])S∗‖1 − ‖(θ[N ])S∗‖1)
≤ λN‖(θ[N ] − θ∗)S∗‖1 ≤ λN‖θ[N ] − θ∗‖1(ii)
≤ c1σs∗ log d
n, (D.2)
where (i) is from the value of λN and `1 error bound in (D.1).
On the other hand, from the convexity of Lµ(θ), we have
Lµ(θ[N ])− Lµ(θ∗) ≥ (θ[N ] − θ∗)>∇Lµ(θ∗) ≥ −‖∇Lµ(θ∗)‖∞‖θ[N ] − θ‖1(i)
≥ −c2λN‖θ[N ] − θ‖1(ii)
≥ −c3σs∗ log d
n, (D.3)
where (i) is from Assumption 3.2 and (ii) value of λN and `1 error bound in (D.1).
For our choice of µ and n, we have Lµ(θ) = 1√n‖y −Xθ‖2 − µ
2 by Proposition 3.8, then
Lµ(θ[N ])− Lµ(θ∗) =‖y −Xθ[N ]‖2√
n− ‖ε‖2√
n. (D.4)
From Wainwright (2015), we have for any δ > 0,
P[∣∣∣∣‖ε‖22n− σ2
∣∣∣∣ ≥ σ2δ
]≤ 2 exp
(−nδ
2
18
). (D.5)
Combining (D.2), (D.3), (D.4) and (D.5) with δ2 = c3s∗ log dn , we have with high probability,
∣∣∣∣∣‖y −Xθ[N ]‖2√
n− σ
∣∣∣∣∣ = O(σs∗ log d
n
). (D.6)
From Part 1, for n ≥ c4s∗ log d, we have with high probability,
‖θ[N ] − θ∗‖2 ≤384σ
√s∗ log d/n
ψmin≤ σ
2√
2ψmax,
then θ[N ] ∈ Bs∗+sr and ‖θ[N ] − θ[N ]‖0 ≤ s∗ + 2s. Then from the analysis of Theorem B.2, we have
ωλK (θ(t+1)[K] ) ≤ (1 + κs∗+2s)
√4ρ+
s∗+2s
(Fµ,λK (θ
(t)[K])−Fµ,λK (θ[K])
)≤ εN .
This implies
Fµ,λK (θ(t)[K])−Fµ,λK (θ[K]) ≤
ε2N4ρ+
s∗+2s (1 + κs∗+2s)2 . (D.7)
25
On the other hand, from the LRSC property of Lµ, convexity of `1 norm and optimality of θ, we
have
Fµ,λK (θ(t)[K])−Fµ,λK (θ[K]) ≥ ρ−s∗+2s‖θ[N ] − θ[N ]‖22. (D.8)
Combining (D.7), (D.8) and Assumption 3.5, we have
‖X(θ[N ] − θ[N ])‖2√n
≤
√8ρ+
s∗+2s
σ‖θ[N ] − θ∗‖2 ≤
√2
σρ−s∗+2s
εN(1 + κs∗+2s)
≤ 4εN(1 + κs∗+2s)
√ψmin
. (D.9)
Combining (D.6) and (D.9), we have
∣∣∣∣∣‖y −Xθ[N ]‖2√
n
∣∣∣∣∣ ≤∣∣∣∣∣‖y −Xθ[N ]‖2√
n
∣∣∣∣∣+‖X(θ[N ] − θ[N ])‖2√
n
≤∣∣∣∣∣‖y −Xθ[N ]‖2√
n
∣∣∣∣∣+4εN
(1 + κs∗+2s)√ψmin
.
If εN ≤ c5σs∗ log d
n for some constant c5, then we have the desired result.
E Proof of Proposition 3.8
Let θ[K] and θ[K] be the unique global optima of (1.3) and (1.4) respectively for all K = N1+1, . . . , N .
Then, we show θ[K] = θ[K] under the proposed conditions. Apply the argument of the proof of
Theorem 3.7 recursively, we have with probability at least 1− 8 exp(− n
144
)− 2d−1,
‖θ[K] − θ∗‖2 ≤384σ
√s∗ log d/n
ψmin.
By SE condition of X in Assumption 3.3, this implies
‖X(θ[K] − θ∗)‖2 ≤384σ
√2ψmaxs∗ log d
ψmin. (E.1)
On the other hand, we have
‖y −Xθ[K]‖2 = ‖X(θ[K] − θ∗) + ε‖2 ≥ ‖ε‖2 − ‖X(θ[K] − θ∗)‖2. (E.2)
Since ε has i.i.d. sub-Gaussian entries with E[εi] = 0 and E[ε2i ] = σ2 for all i = 1, . . . , n, we have
from Wainwright (2015) that
P[‖ε‖22 ≤
2
3nσ2
]≤ exp
(− n
144
), (E.3)
26
Combining (E.1) and (E.2), (E.3) and the condition on n ≥ c4s∗ log d for some constant c4, we have
with probability at least 1− 9 exp(− n
144
)− 2d−1,
‖y −Xθ[K]‖2 ≥√nσ
(√2
3− 384σ
√2ψmaxs∗ log d/n
ψmin
)>
√nσ
4≥ µ.
This implies Fµ,λ(θ) = Fµ(θ) + µ2 , thus argminθ Fµ,λ(θ) = argminθ Fµ(θ), i.e., θT = θ[K]. Besides
this also implies ‖y −Xθ∗‖2 >√nσ4 ≥ µ, i.e., θ∗ is not in the smoothed region.
Applying the same argument again to θ[K], we have that for large enough n, with high probability,
‖y −Xθ[K]‖2 >√nσ
4≥ µ,
and r = σ2
8ψmax> s∗ (64σλN1/ψmin)2 ≥ s∗
(8λN1/ρ
−s∗+s
)2is guaranteed, where λN1 ≥ 2λN =
48√
log d/n. By Lemma B.4, this implies the existence of the linear convergence region, which does
not fall into the smoothed region. Besides, θ[K] is not in the smoothed region.
F Proof of Theorem B.1
The sub-linear rate of convergence (B.1) follows directly from Lemma A.7. In terms of the optimal
residual, we have
ω2λK
(θ(t+1)[K] )
(i)
≤(L
(t)[K] + S
L(t)[K]
(θ(t)[K])
)2
‖θ(t+1)[K] − θ
(t)[K]‖
22
(ii)
≤(L
(t)[K] + ‖X‖22/(
√nµ)
)2‖θ(t+1)
[K] − θ(t)[K]‖
22
(iii)
≤2(L
(t)[K] + ‖X‖22/(
√nµ)
)2
(k −m+ 1)·
(∑ki=mFµ,λK (θ
(m)[K] )−Fµ,λK (θ
(m+1)[K] )
)
mini∈[m,...,k]L(i)[K]
(iv)
≤ 18‖X‖42(k −m+ 1)nµ2
·Fµ,λK (θ
(m)[K] )−Fµ,λK (θ
(t+1)[K] )
Lµ
≤ 18‖X‖42(k −m+ 1)nµ2
·Fµ,λK (θ
(m)[K] )−Fµ,λK (θ[K])
Lµ(v)
≤ 72R2‖X‖62Lµ(k −m+ 1)(m+ 2)n3/2µ3
(vi)
≤ 288R2‖X‖62Lµ(k + 3)2n3/2µ3
, (F.1)
where (i) is from Lemma A.9, (ii) is from SL
(t)[K]
(θ(t)[K]) ≤ Lµ ≤ ‖X‖22/(
√nµ) in Lemma A.9 and
Lemma A.7, (iii) is from (A.16) in Lemma A.8, (iv) is from Lµ ≤ L(t)[K] ≤ 2Lµ ≤ 2‖X‖22/(
√nµ) in
Remark A.1 and Lemma 3.6, (v) is from Lemma A.7 and (vi) is obtained by choosing m = bk/2c.To achieve the approximate KKT condition ωλK (θ
(t)[K]) ≤ εK , we require the R.H.S. of (F.1) to be
no greater than ε2K , then we have the desired result (B.2).
27
G Proof of Theorem B.2
Note that the RSS property implies that line search terminate when L(t)[K] satisfies
ρ+s∗+2s ≤ L
(t)[K] ≤ 2ρ+
s∗+2s. (G.1)
Since the initialization θ(0)[K] satisfies ωλK (θ
(0)[K]) ≤ λK/2 with ‖(θ(0)
[K])S∗‖0 ≤ s, then by Lemma A.2,
the objective satisfies
Fµ,λK (θ(0)[K])−Fµ,λ(θ∗) ≤ 6λ2
Ks∗
ρ−s∗+2s
.
Then by Lemma A.4, we have
‖(θ(1)[K])S∗‖0 ≤ s.
By monotone decrease of Fµ,λK (θ(t)[K]) from (A.16) in Lemma A.8 and recursively applying
Lemma A.4, ‖(θ(t)[K])S∗‖0 ≤ s holds in (B.3) for any t = 1, 2, . . ..
For the objective error, we have
Fµ,λK (θ(t)[K])−Fµ,λK (θ[K])
(i)
≤(
1− 1
8κs∗+2s
)t (Fµ,λK (θ
(0)[K])−Fµ,λK (θ[K])
)
(ii)
≤(
1− 1
8κs∗+2s
)t 24λKs∗ωλK (θ
(t)[K]
ρ−s∗+2s
, (G.2)
where (i) is from Lemma A.6, and (ii) is from Lemma A.5 with λ = λ = λK and ωλK (θ(t+1)[K] ) ≤
λK/2 ≤ λK , which results in (B.3).
Combining (G.2) and (A.4), we have
‖θ(t)[K] − θ[K]‖22 ≤
1
ρ−s∗+2s
(Fµ,λK (θ
(t)[K])−Fµ,λK (θ[K])−∇Fµ,λK (θ[K])
)
=1
ρ−s∗+2s
(Fµ,λK (θ
(t)[K])−Fµ,λK (θ[K])
)≤(
1− 1
8κs∗+2s
)t 24λKs∗ωλK (θ
(t)[K])
(ρ−s∗+2s)2
28
For the optimal residue ωλK (θ(t+1)[K] ) of (t+ 1)-th iteration of K-th path following stage, we have
ωλK (θ(t+1)[K] )
(i)
≤(L
(t)[K] + S
L(t)[K]
(θ(t)[K])
)‖θ(t+1)
[K] − θ(t)[K]‖2
(ii)
≤(L
(t)[K] + ρ+
s∗+2s
)‖θ(t+1)
[K] − θ(t)[K]‖2
(iii)
≤ L(t)[K]
(1 +
ρ+s∗+2s
ρ−s∗+2s
)‖θ(t+1)
[K] − θ(t)[K]‖2
(iv)
≤ L(t)[K]
(1 +
ρ+s∗+2s
ρ−s∗+2s
)√√√√√
2(Fµ,λK (θ
(t)[K])−Fµ,λK (θ
(t+1)[K] )
)
L(t)[K]
(v)
≤ (1 + κs∗+2s)
√4ρ+
s∗+2s
(Fµ,λK (θ
(t)[K])−Fµ,λK (θ[K])
)
(vi)
≤ (1 + κs∗+2s)
√96λ2
Ks∗κs∗+2s
(1− 1
8κs∗+2s
)t, (G.3)
where (i) is from Lemma A.9, (ii) is from SL
(t)[K]
(θ(t)[K]) ≤ ρ
+s∗+2s, (iii) is from ρ−s∗+2s ≤ L
(t)[K] in (G.1),
(iv) is from (A.16) in Lemma A.8, (v) is from L(t)[K] ≤ 2ρ+
s∗+2s in (G.1) and monotone decrease of
Fµ,λK (θ(t)[K]) from (A.16) in Lemma A.8, and (vi) is from (G.2) and κs∗+2s =
ρ+s∗+2s
ρ−s∗+2s
.
For K-th path following stage, K = 1, . . . , N − 1, to have ωλK (θ(t+1)[K] ) ≤ λK/4, we set the R.H.S.
of (G.3) to be no greater than λK/4, which is equivalent to require the number of iterations k to be
an upper bound of (B.4). For the last N -th path following stage, we need ωλN (θ[N ]) ≤ εN ≤ λN/4.
Set the R.H.S. of (G.3) to be no greater than εN , which is equivalent to require the number of
iterations k to be an upper bound of (B.5).
H Proof of Lemma B.3
Since ωλK−1(θ[K−1]) ≤ λK−1/4, there exists some subgradient g ∈ ∂‖θ[K−1]‖1 such that
‖∇Lµ(θ[K−1]) + λK−1g‖∞ ≤ λK−1/4. (H.1)
By the definition of ωλK (·), we have
ωλK (θ[K−1]) ≤ ‖∇Lµ(θ[K−1]) + λKg‖∞ = ‖∇Lµ(θ[K−1]) + λK−1g + (λK − λK−1)g‖∞
≤ ‖∇Lµ(θ[K−1]) + λK−1g‖∞ + |λK − λK−1| · ‖g‖∞(i)
≤ λK−1/4 + (1− η)λK−1
(ii)
≤ λK/2,
where (i) is from (H.1) and choice of λK , (ii) is from the condition on η.
29
I Proof of Lemma B.4
Part 1. We first show ‖θ − θ∗‖22 ≤ r by contradiction. Suppose ‖θ − θ∗‖2 >√r. Let α ∈ [0, 1]
such that θ = (1− α)θ + αθ∗ and
‖θ − θ∗‖2 =√r. (I.1)
Let g = argming∈∂‖θ‖1 ‖∇Lµ(θ) + λg‖∞ and ∆ = θ − θ∗, then we have
Fµ,λ(θ∗)(i)
≥ Fµ,λ(θ)− (∇Lµ(θ) + λg)>∆ ≥ Fµ,λ(θ)− ‖∇Lµ(θ) + λg‖∞‖∆‖1(ii)
≥ Fµ,λ(θ)− λ
4‖∆‖1, (I.2)
where (i) is from the convexity of Fµ,λ(θ) and (ii) is from the approximate KKT condition.
Denote ∆ = θ − θ∗. Combining (I.2) and (I.1), we have
Fµ,λ(θ)(i)
≤ (1− α)Fµ,λ(θ) + αFµ,λ(θ∗) ≤ (1− α)Fµ,λ(θ∗) +(1− α)λ
4‖∆‖1 + αFµ,λ(θ∗)
≤ Fµ,λ(θ∗) +λ
4‖(1− α)(θ − θ∗)‖1 = Fµ,λ(θ∗) +
λ
4‖(1− α)θ + αθ∗ − θ∗)‖1
= Fµ,λ(θ∗) +λ
4‖θ − θ∗‖1 = Fµ,λ(θ∗) +
λ
4‖∆‖1.
where (i) is from the convexity of Fµ,λ(θ). This indicates
Lµ(θ)− Lµ(θ∗) ≤ λ(‖θ∗‖1 − ‖θ‖1 +1
4‖∆‖1)
= λ(‖θ∗S∗‖1 − ‖θS∗‖1 − ‖θS∗‖1 +1
4‖∆S∗‖1 +
1
4‖∆S∗‖1)
≤ λ(‖θ∗S∗ − θS∗‖1 − ‖θS∗ − θ∗S∗‖1 +1
4‖∆S∗‖1 +
1
4‖∆S∗‖1)
=5λ
4‖∆S∗‖1 −
3λ
4‖∆S∗‖1. (I.3)
On the other hand, we have
Lµ(θ)− Lµ(θ∗)(i)
≥ ∇Lµ(θ∗)∆ ≥ −‖∇Lµ(θ∗)‖∞‖∆‖1(ii)
≥ −λ6‖∆‖1
= −λ6‖∆S∗‖1 −
λ
6‖∆S∗‖1, (I.4)
where (i) is from the convexity of Lµ(θ), (ii) is from Assumption 3.2. Combining (I.3) and (I.4), we
have
‖∆S∗‖1 ≤5
2‖∆S∗‖1. (I.5)
30
Next, we consider the following sequence of sets:
S0 =
j ∈ S
∗:∑
m∈S∗1(θm ≥ θj) ≤ s
and
Si =
j ∈ S∗\
⋃
k<i
Sk :∑
m∈S∗\⋃k<i Sk
1(θm ≥ θj) ≤ s
for all i = 1, 2, . . . .
We introduce a result from Buhlmann and van de Geer (2011) with its proof provided therein.
Lemma I.1 (Adapted from Lemma 6.9 in Buhlmann and van de Geer (2011) by setting q = 2).
Let v = [v1, v2, . . .]> with v1 ≥ v2 ≥ . . . ≥ 0. For any s ∈ 1, 2, . . ., we have
∑
j≥s+1
v2j
1/2
≤∞∑
k=1
(k+1)s∑
j=ks+1
v2j
1/2
≤ ‖v‖1√s.
Denote A = S∗ ∪ S0. Then we have
∑
i≥1
‖∆Si‖2(i)
≤ 1√s‖∆S∗‖1
(ii)
≤ 5
2
√s∗
s‖∆S∗‖2 ≤
5
2
√s∗
s‖∆A‖2, (I.6)
where (i) is rom Lemma I.1 with s = s and (ii) is from (I.5). Let θ = (1 − β)θ + βθ∗ for any
β ∈ [0, 1]. Then we have
‖θ − θ∗‖2 = (1− β)‖θ − θ∗‖2 ≤√r,
which implies Lµ(θ) satisfies RSC/RSS for θ restricted on a sparse set by Assumption 3.3. Then we
have
|∆>A∇A,ALµ(θ)∆A| ≤∑
i≥1
|∆>Si∇Si,ALµ(θ)∆A| ≤ ρ+s∗+s‖∆A‖2
∑
i≥1
‖∆Si‖2
(i)
≤ 5
2
√s∗
sρ+s∗+s‖∆A‖22, (I.7)
where (i) is from (I.6). On the other hand, we have from RSC
∆>A∇A,ALµ(θ)∆A ≥ ρ−s∗+s‖∆A‖22. (I.8)
Then we have w.h.p.
∆∇Lµ(θ)∆ = ∆>A∇A,ALµ(θ)∆A + 2∆>A∇A,ALµ(θ)∆A + ∆>A∇A,ALµ(θ)∆A
≥ ∆>A∇A,ALµ(θ)∆A − 2|∆>A∇A,ALµ(θ)∆A|(i)
≥(ρ−s∗+s − 5
√s∗
sρ+s∗+s
)‖∆A‖22
(ii)
≥ 9
14ρ−s∗+s‖∆A‖22,
31
where (i) is from (I.7) and (I.8), (ii) is from Assumption 3.3. This implies
Lµ(θ)− Lµ(θ∗) = ∇Lµ(θ∗)>∆ +1
2∆∇Lµ(θ)∆ ≥ ∇Lµ(θ∗)>∆ +
9
28ρ−s∗+s‖∆A‖22
(i)
≥ 9
28ρ−s∗+s‖∆A‖22 −
λ
6‖∆S∗‖1 −
λ
6‖∆S∗‖1, (I.9)
where (i) is from Assumption 3.2. Combining (I.3) and (I.9), we have
ρ−s∗+s‖∆S∗‖22 ≤ ρ−s∗+s‖∆A‖22 ≤8
3λ‖∆S∗‖1 ≤
8
3λ√s∗‖∆S∗‖2 ≤
8
3λ√s∗‖∆A‖2.
This implies
‖∆S∗‖2 ≤ ‖∆A‖2 ≤8λ√s∗
3ρ−s∗+sand ‖∆S∗‖1 ≤
8λs∗
3ρ−s∗+s. (I.10)
Then we have
‖∆A‖2 ≤∑
i≥1
‖∆Si‖2(i)
≤ 1√s‖∆S∗‖1
(ii)
≤ 5
2
√1
s∗‖∆S∗‖1
(iii)
≤ 20λ√s∗
3ρ−s∗+s, (I.11)
where (i) is rom Lemma I.1 with s = s, (ii) is from (I.5) and s ≥ s∗ and (iii) is from (I.10).
Combining (I.10) and (I.11), we have
‖∆‖2 =
√‖∆A‖22 + ‖∆A‖22 ≤
8λ√s∗
ρ−s∗+s<√r.
This conflicts with (I.1), which indicates that ‖θ − θ∗‖2 ≤√r.
Part 2. We next demonstrate the sparsity of θ. From λ > λN ≥ 6‖∇Lµ(θ∗)‖∞, then we have∣∣∣∣i ∈ S∗ : |∇iLµ(θ∗)| ≥ λ
6
∣∣∣∣ = 0. (I.12)
Denote S1 =i ∈ S∗ : |∇iLµ(θ)−∇iLµ(θ∗)| ≥ 2λ
3
and s1 = |S1|. Then there exists some b ∈ Rd
such that ‖b‖∞ = 1, ‖b‖0 ≤ s1 and b>(∇Lµ(θ) − ∇Lµ(θ∗)) ≥ 2λs13 . Then by the mean value
theorem, we have for some θ = (1− α)θ + αθ∗ with α ∈ [0, 1], ∇Lµ(θ)−∇Lµ(θ∗) = ∇2Lµ(θ)∆,
where ∆ = θ − θ∗. Then we have
2λs1
3≤ b>∇2Lµ(θ)∆
(i)
≤√b>∇2Lµ(θ)b
√∆>∇2Lµ(θ)∆
(ii)
≤√s1ρ
+s1
√∆>(∇Lµ(θ)−∇Lµ(θ∗)), (I.13)
where (i) is from the generalized Cauchy-Schwarz inequality, (ii) is from the definition of RSS and
the fact that ‖b‖2 ≤√s1‖b‖∞ =
√s1. Let g achieve ming∈∂‖θ‖1 Fµ,λ(θ). Further, we have
∆>(∇Lµ(θ)−∇Lµ(θ∗)) ≤ ‖∆‖1‖∇Lµ(θ)−∇Lµ(θ∗)‖∞≤ ‖∆‖1(‖∇Lµ(θ∗)‖∞ + ‖∇Lµ(θ)‖∞)
≤ ‖∆‖1(‖∇Lµ(θ∗)‖∞ + ‖∇Lµ(θ) + λg‖∞ + λ‖g‖∞)
(i)
≤ 28λs∗
3ρ−s∗+s(λ
6+λ
4+ λ) ≤ 14λ2s∗
ρ−s∗+s, (I.14)
32
where (i) is from combining (I.5) and (I.10), condition on λ, approximate KKT condition and
‖g‖∞ ≤ 1. Combining (I.13) and (I.14), we have 2√s1
3 ≤√
14ρ+s1s∗
ρ−s∗+s
, which further implies
s1 ≤32ρ+
s1s∗
ρ−s∗+s≤ 32κs∗+2ss
∗ ≤ s. (I.15)
For any v ∈ Rd that satisfies ‖v‖0 ≤ 1, we have
S2 =
i ∈ S∗ :
∣∣∣∣∇iLµ(θ) +λ
4vi
∣∣∣∣ ≥5λ
6
⊆i ∈ S∗ : |∇iLµ(θ∗)| ≥ λ
6
⋃S1.
Then we have |S2| ≤ |S1| ≤ s. Since for any i ∈ S∗ and∣∣∇iLµ(θ) + λ
4vi∣∣ < 5λ
6 , we can find gi that
satisfies |gi| ≤ 1 such that ∇iLµ(θ) + λ4vi + λgi = 0 which implies θi = 0, then we have
∣∣∣∣i ∈ S∗ :
∣∣∣∣∇iLµ(θ) +λ
4vi
∣∣∣∣ <5λ
6
∣∣∣∣ = 0.
Therefore, we have ‖θS∗‖0 ≤ |S2| ≤ s.
J Proof of Lemma B.5
Let n(t) be the number of proximal gradient steps in t-th iteration of PIS2TA. Then we have
L(t+1) ≤ 2L(t)
(1
2
)n(t)−1
.
This indicates
n(t) ≤ 2 + log2
L(t)
L(t+1).
Then we have
t = 0
TKn(t) ≤ 2(TK + 1) + log2
L(0)
L(TK+1)
We obtain the desired result by L(0) = Lmax and L(TK+1) ≥ ρ+s∗+2s.
K Intermediate Results of Theorem 4.3
We start with some preliminaries. For any S ⊂ 1, . . . , d with |S| ≤ s∗, we denote the set of cone
CνS =
x ∈ Rd : ‖xS‖1 ≤ ν‖xS‖1
and Cνs∗ =⋃
S⊂1,...,d,|S|≤s∗CνS .
33
Besides, since Θ∗ = Σ∗−1 ∈M(κ, s∗), we have
Λmax(Θ∗)Λmin(Θ∗)
=Λmax(Σ∗−1)
Λmin(Σ∗−1)=
Λmax(Σ∗)Λmin(Σ∗)
≤ κ
We first introduce some important results on characterizing the data matrix X. These are
adapted from intermediate lemmas in Liu and Wang (2012), which we refer to interested readers for
detailed proofs.
The first lemma provides the bounds of entry-wise difference between sample and population
correlation matrices.
Lemma K.1. Let R and R be the sample and population correlation matrices. Then for event
E1 =
‖R−R‖∞ ≤ 18
√log d
n
,
we have P[E1] ≥ 1− d−1.
The second lemma provides the bounds of normalized model noise ε.
Lemma K.2. Let εi ∈ Rn follows εi ∼ Nn(0, σ2i In). Then for event
E2 =
max
i∈1,...,d‖εi‖22nσ2
i
≤ 1.4 and maxi∈1,...,d
∥∥∥∥‖εi‖22nσ2
i
− 1
∥∥∥∥ ≤ 3.5
√log d
n
,
we have P[E1] ≥ 1− d−1 − d exp(−100/n).
The third lemma provides the bounds of sample standard deviation of the marginal univariate
Gaussian random variables.
Lemma K.3. Let Σ be the sample covariance matrix. Suppose Assumption (A3) holds, then for
event
E3 =
1
2Λmin(Σ) ≤ min
i∈1,...,dΣii ≤ max
i∈1,...,dΣii ≤
3
2Λmax(Σ)
,
we have P[E1] ≥ 1− d−1 − d exp(−100/n).
Next, we verify Assumption 3.2 in the following lemma, and provide its proof in Appendix N.7.
Lemma K.4. Denote Lµ,i(θ∗i ) = ‖zi−Z?,\iθ∗i ‖µ/√n. Let λN =
6√
5 log d/n
max√
5/3µminiΓ1/2ii /(
√nσi),1
, then
for event
E4 =
λN ≥ 6 max
i∈1,...,d‖∇Lµ,i(θ∗i )‖∞
,
we have P[E1] ≥ 1− d exp(− n
25
)− d0.6√
0.6πa log d.
Then we provide the bound of the restricted eigenvalue of the sample correlation matrix.
34
Lemma K.5. Suppose E3 and Assumption 4.1 (A2) hold, then for event
E5 =
inf
θ∈Cνs∗
√s∗θ>Rθ‖R‖1
≥ 1
5(1 + ν)√κ
,
there exists constants c1 and c2 such that P[E5|E3] ≥ 1− c2 exp(−c2n).
Further, we provide the prediction error bound for the approximate solution. It follows directly
from Theorem 3.7.
Lemma K.6. Suppose E5 holds, then for event
E6 =
max
i∈1,...,d
‖Z?,\i(θi − θ∗i )‖τi
≤ c3
√s∗ log d
,
there exists constants c1, c2 and c3 such that P[E6|E5] ≥ 1− c1 exp(−c2n)− c3d−1.
L Proof of Lemma 4.2
Using the result in Liu and Wang (2012) (Lemma 12) and Agarwal et al. (2010) (Proposition 1),
we have with probability at least 1− c1 exp(−c2n) for some universal constants c1 and c2, for any
v ∈ Rd,
1Λmin(Σ)
2Λmax(Σ)‖v‖22 −
9Λmax(Σ)
Λmin(Σ)· log d
n‖v‖21 ≤
‖Zv‖22n
≤ 2Λmax(Σ)
Λmin(Σ)‖v‖22 +
9Λmax(Σ)
Λmin(Σ)· log d
n‖v‖21.
Applying the same analysis for Theorem3.4, we have for some constant c, ‖v‖21 ≤ c‖vS∗‖21 ≤cs∗‖vS∗‖22 ≤ cs∗‖v‖22 and we have
(1Λmin(Σ)
2Λmax(Σ)− 9Λmax(Σ)
Λmin(Σ)· s∗ log d
n
)‖v‖22 ≤
‖Zv‖22n
≤(
2Λmax(Σ)
Λmin(Σ)+
9Λmax(Σ)
Λmin(Σ)· s∗ log d
n
)‖v‖22.
Since n ≥ 54Λ2max(Σ)s∗ log dΛ2
min(Σ), we have
1Λmin(Σ)
3Λmax(Σ)‖v‖22 ≤
‖Zv‖22n
≤ 3Λmax(Σ)
Λmin(Σ)‖v‖22.
From Λmax(Σ) = 1/Λmin(Θ) and Λmin(Σ) = 1/Λmax(Θ), we have
1
3κΘ‖v‖22 ≤
‖Zv‖22n
≤ 3κΘ‖v‖22.
35
To satisfy the condition for the computational theory, we require µ ≤√nVar (Γ
−1/2ii εi)
4 for any
i ∈ 1, . . . , d. From σi = Θ−1/2ii and mini∈1,...,d Σii ≤ 3
2Λmax(Σ) with high probability in
Lemma K.3, we have√
Var (Γ−1/2ii εi) = Γ
−1/2ii σi =
1√ΘiiΣii
≥ 1√32ΘiiΛmax(Σ)
Since Σ = Θ−1, for any i ∈ 1, . . . , d, µ need to satisfy
µ ≤√n
4√
32 maxi ΘiiΛmax(Σ)
≤ 1
5
√n
Λmax(Θ)Λmax(Σ)=
1
5
√nΛmin(Θ)
Λmax(Θ)=
1
5
√n
κΘ.
By Lemma K.4, the condition on λN guarantees that Assumption 3.2 holds for each i = 1, . . . , d.
Besides, when n is large enough, it can be guaranteed that there exists N1 < N , N1 ∈ Z+, such
thatρ−s∗+s8
√rs∗ > λN1 ≥ 2λN ≥ 12‖∇Lµ,i(θi)‖∞.
M Proof of Theorem 4.3
The analysis here follows directly from our analysis in the linear model and the analysis in Liu and
Wang (2012). Let E = ∩6i=1Ei. Combining Lemma K.4 and our choice of µ, we have λN = 6
√5 log dn .
We first show that the estimation error of diagonal elements are bounded.
Lemma M.1 (Adapted from Lemma 14 in Liu and Wang (2012)). Suppose Assumption 4.1 and
the event E hold, the we have
maxi∈1,...,d
|Θii −Θ∗ii| ≤ c4‖Θ∗‖2log d
n.
Besides, we have the `1 norm error bounded for the estimation of off-diagonal elements each
column.
Lemma M.2 (Adapted from Lemma 15 in Liu and Wang (2012)). Suppose Assumption 4.1 and
the event E hold, the we have
maxi∈1,...,d
‖Θ\i,i −Θ∗\i,i‖1 ≤ c5(s∗‖Θ∗‖2 + ‖Θ∗‖1)log d
n.
Combining Lemma M.1 and Lemma M.2, we have
‖Θ−Θ∗‖1 = maxi∈1,...,d
‖Θ?,i −Θ∗?,i‖1 ≤ maxi∈1,...,d
|Θii −Θ∗ii|+ ‖Θ\i,i −Θ∗\i,i‖1
≤ c6(s∗‖Θ∗‖2 + ‖Θ∗‖1)log d
n
(i)
≤ c7(s∗‖Θ∗‖2)log d
n,
where (i) is from ‖Θ∗‖1 ≤ s∗‖Θ∗‖2. Then we finish the proof from
‖Θ−Θ∗‖2 ≤ ‖Θ−Θ∗‖1.
36
N Proofs of Intermediate Lemmas in Appendix A and Appendix K
N.1 Proof of Lemma A.2
We first bound the estimation error. From Assumption 3.3, we have the RSC property, which
indicates
Lµ(θ) ≥ Lµ(θ∗) + (θ − θ∗)>∇Lµ(θ∗) + (ρ−s∗+2s/2)‖θ − θ∗‖22, (N.1)
Lµ(θ∗) ≥ Lµ(θ) + (θ∗ − θ)>∇Lµ(θ) + (ρ−s∗+2s/2)‖θ − θ∗‖22, (N.2)
Adding (N.2) and (N.1), we have
(θ − θ∗)>∇Lµ(θ) ≥ (θ − θ∗)>∇Lµ(θ∗) + ρ−s∗+2s‖θ − θ∗‖22. (N.3)
Let g ∈ ∂‖θ‖1 be the subgradient that achieves the approximate KKT condition of the L.H.S of
(A.6), then we have
(θ − θ∗)> (∇Lµ(θ) + λg) ≤ ‖θ − θ∗‖1 ‖∇Lµ(θ) + λg‖∞ ≤1
2λ‖θ − θ∗‖1. (N.4)
On the other hand, we have from (N.3)
(θ − θ∗)> (∇Lµ(θ) + λg)≥(θ − θ∗)>∇Lµ(θ∗) + ρ−s∗+2s‖θ − θ∗‖22 + λg>(θ − θ∗), (N.5)
Since ‖θ − θ∗‖1 = ‖(θ − θ∗)S∗‖1 + ‖(θ − θ∗)S∗‖1, then
(θ − θ∗)>∇Lµ(θ∗) ≥ −‖(θ − θ∗)S∗‖1‖Lµ(θ∗)‖∞ − ‖(θ − θ∗)S∗‖1‖Lµ(θ∗)‖∞. (N.6)
Besides, we have
(θ − θ∗)>g = g>S∗(θ − θ∗)S∗ + g>S∗(θ − θ∗)S∗
(i)
≥ −‖gS∗‖∞‖(θ − θ∗)S∗‖1 + g>S∗θS∗
(ii)
≥ −‖(θ − θ∗)S∗‖1 + ‖gS∗‖1(iii)= −‖(θ − θ∗)S∗‖1 + ‖(θ − θ∗)S∗‖1, (N.7)
where (i) and (iii) is from θ∗S∗ = 0, (ii) is from ‖gS∗‖∞ ≤ 1 and g ∈ ∂‖θ‖1.
Combining (N.4), (N.5), (N.6) and (N.7), we have
1
2λ‖θ − θ∗‖1 =
1
2λ‖(θ − θ∗)S∗‖1 +
1
2λ‖(θ − θ∗)S∗‖1
≥ ρ−s∗+2s‖θ − θ∗‖22 − (λ+ ‖Lµ(θ∗)‖∞)‖(θ − θ∗)S∗‖1+ (λ− ‖Lµ(θ∗)‖∞)‖(θ − θ∗)S∗‖1.
This implies
ρ−s∗+2s‖θ − θ∗‖22 + (1
2λ− ‖Lµ(θ∗)‖∞)‖(θ − θ∗)S∗‖1
≤ (3
2λ+ ‖Lµ(θ∗)‖∞)‖(θ − θ∗)S∗‖1, (N.8)
37
which results in (A.7) from ρ−s∗+2s > 0 and Assumption 3.2 as
‖(θ − θ∗)S∗‖1 ≤32λ+ ‖Lµ(θ∗)‖∞12λ− ‖Lµ(θ∗)‖∞
‖(θ − θ∗)S∗‖1.
Combining 12λ− ‖Lµ(θ∗)‖∞ ≥ 0, 3
2λ+ ‖Lµ(θ∗)‖∞ ≤ 2λ and (N.8), we have estimation error bound
in (A.8) and (A.9) as
ρ−s∗+2s‖θ − θ∗‖22 ≤ 2λ‖(θ − θ∗)S∗‖1 ≤ 2λ√s∗‖θ − θ∗‖2.
‖θ − θ∗‖1 ≤ 6‖(θ − θ∗)S∗‖1 ≤ 6√s∗‖θ − θ∗‖2.
Next, we bound the objective error in (A.10). We have
Fµ,λ(θ)−Fµ,λ(θ∗)(i)
≤ −(∇Lµ(θ) + λg)>(θ∗ − θ) ≤ ‖∇Lµ(θ) + λg‖∞‖θ∗ − θ‖1
≤ 1
2λ‖θ∗ − θ‖1 =
1
2λ(‖(θ∗ − θ)S∗‖1 + ‖(θ∗ − θ)S∗‖1)
(ii)
≤ 3λ‖(θ∗ − θ)S∗‖1 ≤ 3λ√s∗‖(θ∗ − θ)S∗‖2
(iii)
≤ 6λ2s∗
ρ−s∗+2s
,
where (i) is from the convexity of Fµ,λ(θ) with ∇Lµ(θ) + λg as its subgradient, (ii) is from (A.7),
and (iii) is from (A.8).
N.2 Proof of Lemma A.3
Assumption Fµ,λ(θ)−Fµ,λ(θ∗) ≤ 6λ2s∗/ρ−s∗+2s implies
Lµ(θ)− Lµ(θ∗) + λ(‖θ‖1 − ‖θ∗‖1) ≤ 6λ2s∗
ρ−s∗+2s
. (N.9)
We have from the RSC property that
Lµ(θ) ≥ Lµ(θ∗) + (θ − θ∗)>∇Lµ(θ∗) +ρ−s∗+2s
2‖θ − θ∗‖22, (N.10)
Then we have (N.9) and (N.10),
ρ−s∗+2s
2‖θ − θ∗‖22≤
6λ2s∗
ρ−s∗+2s
− (θ − θ∗)>∇Lµ(θ∗) + λ(‖θ∗‖1 − ‖θ‖1). (N.11)
Besides, we have
(θ − θ∗)>∇Lµ(θ∗) ≥ −‖(θ − θ∗)S∗‖1‖Lµ(θ∗)‖∞ − ‖(θ − θ∗)S∗‖1‖Lµ(θ∗)‖∞, (N.12)
and
‖θ∗‖1 − ‖θ‖1 = ‖θ∗S∗‖1 − ‖θS∗‖1 − ‖(θ − θ∗)S∗‖1 ≤ ‖(θ − θ∗)S∗‖1 − ‖(θ − θ∗)S∗‖1. (N.13)
38
Combining (N.11), (N.12) and (N.13), we have
ρ−s∗+2s
2‖θ − θ∗‖22 ≤
6λ2s∗
ρ−s∗+2s
+ (‖∇Lµ(θ∗)‖∞ + λ)‖(θ − θ∗)S∗‖1
+ (‖∇Lµ(θ∗)‖∞ − λ)‖(θ − θ∗)S∗‖1. (N.14)
We discuss two cases as following:
Case 1. We first assume ‖θ − θ∗‖1 ≤ 12λs∗
ρ−s∗+2s
. Then (N.14) implies
ρ−s∗+2s
2‖θ − θ∗‖22
(i)
≤ 6λ2s∗
ρ−s∗+2s
+ (‖∇Lµ(θ∗)‖∞ + λ)‖(θ − θ∗)S∗‖1
(ii)
≤ 6λ2s∗
ρ−s∗+2s
+3
2λ‖(θ − θ∗)S∗‖1
≤ 6λ2s∗
ρ−s∗+2s
+18λ2s∗
ρ−s∗+2s
=24λ2s∗
ρ−s∗+2s
.
where (i) is from ‖∇Lµ(θ∗)‖∞ − λ ≤ 0 and (ii) is from ‖∇Lµ(θ∗)‖∞ + λ ≤ 32λ. This indicates
‖θ − θ∗‖2 ≤4√
3s∗λ
ρ−s∗+2s
. (N.15)
Case 2. Next, we assume ‖θ − θ∗‖1 > 12λs∗
ρ−s∗+2s
. Then (N.14) implies
ρ−s∗+2s
2‖θ − θ∗‖22
≤ (‖∇Lµ(θ∗)‖∞ + λ)‖(θ − θ∗)S∗‖1 + (‖∇Lµ(θ∗)‖∞ − λ)‖(θ − θ∗)S∗‖1 +1
2λ‖θ − θ∗‖1
= (‖∇Lµ(θ∗)‖∞ +3
2λ)‖(θ − θ∗)S∗‖1 + (‖∇Lµ(θ∗)‖∞ −
1
2λ)‖(θ − θ∗)S∗‖1
(i)
≤ 2λ‖(θ − θ∗)S∗‖1 ≤ 2√s∗λ‖(θ − θ∗)S∗‖2, (N.16)
where (i) is from ‖∇Lµ(θ∗)‖∞ + 32λ ≤ 2λ and ‖∇Lµ(θ∗)‖∞ − 1
2λ ≤ 0. This indicates
‖θ − θ∗‖2 ≤4√s∗λ
ρ−s∗+2s
. (N.17)
Besides, we have
‖θ − θ∗‖1(i)
≤ 6‖(θ − θ∗)S∗‖1 ≤ 6√s∗‖(θ − θ∗)S∗‖2 ≤
24λs∗
ρ−s∗+2s
, (N.18)
where (i) is from ‖∇Lµ(θ∗)‖∞ + 32λ ≤ 2λ and (N.16).
Combining (N.15) and (N.17), we have desired result (A.11). Combining the assumption in Case
1 and (N.18), we have desired result (A.12).
39
N.3 Proof of Lemma A.4
Recall that the proximal-gradient update can be computed by the soft-thresholding operation, i.e.,
for all i = 1, . . . , d,
(TL,λ(θ))i = sign(θi) max|θi| − λ/L, 0
(N.19)
where θ = θ −∇Lµ(θ)/L. To bound ‖ (TL,λ(θ))S∗ ‖0, we consider
θ = θ − 1
L∇Lµ(θ) = θ − 1
L∇Lµ(θ∗) +
1
L(∇Lµ(θ∗)−∇Lµ(θ)) . (N.20)
We then consider the following three events:
A1 =i ∈ S∗ : |θi| ≥ λ/(3L)
, (N.21)
A2 =i ∈ S∗ : |(∇Lµ(θ∗)/L)i| > λ/(6L)
, (N.22)
A3 =i ∈ S∗ :
∣∣(∇Lµ(θ∗)/L−∇Lµ(θ)/L)i∣∣ ≥ λ/(2L)
, (N.23)
Event A1. Note that for any i ∈ S∗, |θi| = |θi − θ∗i |, then we have
|A1| ≤∑
i∈S∗
3L
λ|θi − θ∗i | · 1(|θi − θ∗i | ≥ λ/(3L)) ≤ 3L
λ
∑
i∈S∗|θi − θ∗i |
≤ 3L
λ‖θ − θ∗‖1
(i)
≤ 72Ls∗
ρ−s∗+2s
, (N.24)
where (i) is from (A.12) in Lemma A.3.
Event A2. By Assumption 3.2 and λ ≥ λN , we have
0 ≤ |A2| ≤∑
i∈S∗
6L
λ|(∇Lµ(θ∗)/L)i| · 1(|(∇Lµ(θ∗)/L)i| > λ/(6L))
=∑
i∈S∗
6L
λ|(∇Lµ(θ∗)/L)i| · 0 = 0, (N.25)
which indicates that |A2| = 0.
Event A3. Consider the event A =i :∣∣(∇Lµ(θ∗)−∇Lµ(θ))i
∣∣ ≥ λ/2
, which satisfies A3 ⊆ A.
We will provide an upper bound of |A|, which is also an upper bound of |A3|. Let v ∈ Rd be chosen
such that, vi = sign
(∇Lµ(θ∗)/L−∇Lµ(θ)/L)i
for any i ∈ A, and vi = 0 for any i /∈ A. Then
we have
v> (∇Lµ(θ∗)−∇Lµ(θ)) =∑
i∈A
vi (∇Lµ(θ∗)/L−∇Lµ(θ)/L)i
=∑
i∈A
∣∣(∇Lµ(θ∗)−∇Lµ(θ))i∣∣ ≥ λ|A|/2. (N.26)
40
On the other hand, we have
v> (∇Lµ(θ∗)−∇Lµ(θ)) ≤ ‖v‖2‖∇Lµ(θ∗)−∇Lµ(θ)‖2(i)
≤√|A| · ‖∇Lµ(θ∗)−∇Lµ(θ)‖2
(ii)
≤ ρ+s∗+2s
√|A| · ‖θ − θ∗‖2, (N.27)
where (i) is from ‖v‖2 ≤√|A|maxi : |Ai| ≤
√|A|, and (ii) is from (A.1) and (A.2).
Combining (N.26) and (N.27), we have
λ|A| ≤ 2ρ+s∗+2s
√|A| · ‖θ − θ∗‖2
(i)
≤ 8λκs∗+2s
√3s∗|A|
where (i) is from (A.11) in Lemma A.3 and definition of κs∗+2s = ρ+s∗+2s/ρ
−s∗+2s. Considering
A3 ⊆ A, this implies
|A3| ≤ |A| ≤ 196κ2s∗+2ss
∗. (N.28)
Now combining Even A1, A2, A3 and L ≤ 2ρ+s∗+2s in assumption, we close the proof as
‖ (TL,λ(θ))S∗ ‖0 ≤ |A1|+ |A2|+ |A3| ≤72Ls∗
ρ−s∗+2s
+ 196κ2s∗+2ss
∗ ≤ (144κs∗+2s + 196κ2s∗+2s)s
∗
≤ s.
N.4 Proof of Lemma A.5
Let g = argming∈∂‖θ‖1 Lµ + λ‖θ‖1, then ωλ = ‖∇Lµ + λg‖∞. By the optimality of θ and convexity
of Fµ,λ
, we have
Fµ,λ
(θ)−Fµ,λ
(θ) ≤(∇Lµ + λg
)>(θ − θ) ≤ ‖∇Lµ + λg‖∞‖θ − θ‖1
≤(ωλ(θ) + λ− λ
)‖θ − θ‖1. (N.29)
Besides, we have
‖θ − θ‖1 ≤ ‖θ − θ∗‖1 + ‖θ − θ∗‖1(i)
≤ 6(‖(θ − θ∗)S∗‖1 + ‖(θ − θ∗)S∗‖1
)
≤ 6√s∗(‖(θ − θ∗)S∗‖2 + ‖(θ − θ∗)S∗‖2
) (ii)
≤ 12(λ+ λ)s∗
ρ−s∗+2s
. (N.30)
where (i) and (ii) are from (A.7) and (A.8) in Lemma A.2 respectively. Combining (N.29) and
(N.30), we have desired result.
41
N.5 Proof of Lemma A.6
Our analysis has two steps. In the first step, we show that θ(t)∞t=0 converges to the unique limit
point θ. In the second step, we show that the proximal gradient method has linear convergence rate.
Step 1. Note that θ(t+1) = TLµ,λ(θ(t)). Since Fµ,λ(θ) is convex in θ (but not strongly convex),
the sub-level set θ : Fµ,λ(θ) ≤ Fµ,λ(θ(0)) is bounded. By the monotone decrease of Fµ,λ(θ(t))
from (A.16) in Lemma A.8, θ(t)∞t=0 is also bounded. By BolzanoWeierstrass theorem, it has a
convergent subsequence and we will show that θ is the unique accumulation point.
Since Fµ,λ(θ) is bounded below,
limk→∞
‖θ(t+1) − θ(t)‖2 ≤2
L(t)µ
· limk→∞
[Fµ,λ
(θ(t+1)
)−Fµ,λ
(θ(t))]
= 0.
By Lemma A.9, we have
limk→∞
ωλ(θ(t)) = 0,
This implies limk→∞ θ(t) satisfies the KKT condition, hence is an optimal solution.
Let θ be an accumulation point. Since θ = argminθ Fµ,λ(θ), then there exists some g ∈ ∂‖θ‖1such that
∇Fµ,λ(θ) = Lµ,λ(θ) + λg = 0. (N.31)
By Lemma A.4, every proximal update is sparse, hence ‖θS∗‖0 ≤ s. By RSC property in (3.1), if
‖θS∗‖0 ≤ s, i.e.,‖(θ − θ)S∗‖0 ≤ s , then we have
Lµ(θ)− Lµ(θ) ≥ (θ − θ)>∇Lµ(θ) +ρ−s∗+2s
2‖θ − θ‖22, (N.32)
From the convexity of ‖θ‖1 and g ∈ ∂‖θ‖1, we have
‖θ‖1 − ‖θ‖1 ≥ (θ − θ)>g. (N.33)
Combining (N.32) and (N.33), we have for any ‖θS∗‖0 ≤ s,
Fµ,λ(θ)−Fµ,λ(θ) = Lµ(θ) + λ‖θ‖1 −(Lµ(θ)− λ‖θ‖1
)
≥ (θ − θ)>(Lµ(θ) + λg
)+ρ−s∗+2s
2‖θ − θ‖22
(i)=ρ−s∗+2s
2‖θ − θ‖22 ≥ 0, (N.34)
where (i) is from (N.31). Therefore, θ is the unique accumulation point, i.e. limk→∞ θ(t) = θ.
Step 2. The objective Fµ,λ(θ(t+1)) satisfies
Fµ,λ(θ(t+1))(i)
≤ Qµ,λ(θ(t+1),θ(t)
)
(ii)= min
θLµ(θ(t)) +∇Lµ(θ(t))>(θ − θ(t)) +
L(t)λ
2‖θ − θ(t)‖22 + λ‖θ‖1. (N.35)
42
where (i) is from (A.16) in Lemma A.8, (ii) is from the definition of Oµ,λ in (2.3). To further bound
R.H.S. of (N.35), we consider the line segment
S(θ,θ(t)) = θ : θ = αθ + (1− α)θ(t), α ∈ [0, 1].
Then we restrict the minimization over the line segment S(θ,θ(t)),
Fµ,λ(θ(t+1)) ≤ minθ∈S(θ,θ(t))
Lµ(θ(t)) +∇Lµ(θ(t))>(θ − θ(t)) +L
(t)λ
2‖θ − θ(t)‖22 + λ‖θ‖1. (N.36)
Since ‖θS∗‖0 ≤ s and ‖θ(t)
S∗‖0 ≤ s, then for any θ ∈ S(θ,θ(t)), we have ‖θS∗‖0 ≤ s and ‖(θ −θ(t))S∗‖0 ≤ 2s. By RSC property, we have
Lµ(θ) ≥ Lµ(θ(t)) +∇Lµ(θ(t))>(θ − θ(t)) +ρ−s∗+2s
2‖θ − θ(t)‖22
≥ Lµ(θ(t)) +∇Lµ(θ(t))>(θ − θ(t)). (N.37)
Combining (N.36) and (N.37), we have
Fµ,λ(θ(t+1)) ≤ minθ∈S(θ,θ(t))
Lµ(θ) +L
(t)λ
2‖θ − θ(t)‖22 + λ‖θ‖1
= minθ∈S(θ,θ(t))
Fµ,λ(θ) +L
(t)λ
2‖θ − θ(t)‖22
= minα∈[0,1]
Fµ,λ(αθ + (1− α)θ(t)) +α2L
(t)λ
2‖θ − θ(t)‖22
(i)
≤ minα∈[0,1]
αFµ,λ(θ) + (1− α)Fµ,λ(θ(t)) +α2L
(t)λ
2‖θ − θ(t)‖22
(ii)
≤ minα∈[0,1]
Fµ,λ(θ(t))− α(Fµ,λ(θ(t))−Fµ,λ(θ)
)+α2L
(t)λ
ρ−s∗+2s
(Fµ,λ(θ(t))−Fµ,λ(θ)
)
= minα∈[0,1]
Fµ,λ(θ(t))− α(
1− αL(t)λ
ρ−s∗+2s
)(Fµ,λ(θ(t))−Fµ,λ(θ)
), (N.38)
where (i) is from the convexity of Fµ,λ and (ii) is from (N.34).
Minimize the R.H.S. of (N.38) w.r.t. α, the optimal value α =ρ−s∗+2s
2L(t)λ
results in
Fµ,λ(θ(t+1)) ≤ Fµ,λ(θ(t))−ρ−s∗+2s
4L(t)λ
(Fµ,λ(θ(t))−Fµ,λ(θ)
). (N.39)
Subtracting both sides of (N.39) by Fµ,λ(θ), we have
Fµ,λ(θ(t+1))−Fµ,λ(θ) ≤(
1−ρ−s∗+2s
4L(t)λ
)(Fµ,λ(θ(t))−Fµ,λ(θ)
)
(i)
≤(
1−ρ−s∗+2s
8ρ+s∗+2s
)(Fµ,λ(θ(t))−Fµ,λ(θ)
), (N.40)
43
where (i) is from Remark A.1. Apply (N.40) recursively, we have the desired result.
N.6 Proof of Lemma A.7
We first show an upper bound of Lµ. Recall from the analysis in Appendix C of Lemma 3.6, there
exists some α ∈ [0, 1] such that
∇2Lµ =
X>X√nµ, if ‖ξ‖2 < µ
1√n‖ξ‖2 X>
(I− ξξ>
‖ξ‖22
)X, o.w.
where ξ = y −X(w + α∆). We discuss two cases depending on ‖ξ‖2 < µ and ‖ξ‖2 ≥ µ.
Case 1. For ‖ξ‖2 <√nµ, we have from (??) that
Lµ ≤ ‖∇2Lµ‖2 =‖X‖22√nµ
=‖X‖22√nµ
.
Case 2. For ‖ξ‖2 ≥ µ, we have
Lµ ≤ ‖∇2Lµ‖2 =1√n‖ξ‖2
∣∣∣∣∣∣∣∣X>
(I− ξξ>
‖ξ‖22
)X
∣∣∣∣∣∣∣∣2
≤ ‖X‖22√n‖ξ‖2
=‖X‖22√n‖ξ‖2
≤ ‖X‖22√
nµ.
Combining the two cases, we have
Lµ ≤‖X‖22√nµ
. (N.41)
Applying the analogous argument in Step 1 of the proof of Lemma A.6, we have that θ(t)∞t=0
converges to the unique limit point θ. By the monotonicity of Fµ,λ(θ(t)) from (A.16) in Lemma A.8
and convexity of Fµ,λ(θ), we have ‖θ(t) − θ‖2 ≤ R for all t = 1, 2, . . .. Then we have
Fµ,λ(θ(t+1))(i)
≤ Qµ,λ(θ(t+1),θ(t))(ii)
≤ minθFµ,λ(θ) +
L(t)
2‖θ − θ(t)‖22
≤ minθ=αθ+(1−α)θ(t),α∈[0,1]
Fµ,λ(θ) +L(t)
2‖θ − θ(t)‖22
= minα∈[0,1]
Fµ,λ(αθ + (1− α)θ(t)
)+L(t)α2
2‖θ(t) − θ‖22
(iii)
≤ minα∈[0,1]
Fµ,λ(θ(t))− α(Fµ,λ(θ(t))−Fµ,λ(θ)
)+
2‖X‖22R2α2
2√nµ
, (N.42)
where (i) and (ii) are from (A.16) and (A.15) in Lemma A.8 respectively, (iii) is from the convexity
of Fµ,λ(θ), ‖θ(t) − θ‖2 ≤ R for all t = 1, 2, . . . and L(t) ≤ 2Lµ ≤ 2‖X‖22/(√nµ) in Remark A.1 and
Lemma 3.6. We discuss in two cases to provide an upper bound of R.H.S. (N.42).
Case 1: Suppose Fµ,λ(θ(0)) − Fµ,λ(θ) ≤ 2‖X‖22R2/(√nµ). Minimizing the R.H.S. of (N.42)
w.r.t. α, then the optimal value is
α =Fµ,λ(θ(t))−Fµ,λ(θ)
2‖X‖22R2/(√nµ)
≤ Fµ,λ(θ(0))−Fµ,λ(θ)
2‖X‖22R2/(√nµ)
≤ 1.
44
Then we have
Fµ,λ(θ(t+1)) ≤ Fµ,λ(θ(t))−(Fµ,λ(θ(t))−Fµ,λ(θ)
)2
4‖X‖22R2/(√nµ)
Equivalently, we have
Fµ,λ(θ(t+1))−Fµ,λ(θ) ≤ Fµ,λ(θ(t))−Fµ,λ(θ)−(Fµ,λ(θ(t))−Fµ,λ(θ)
)2
4‖X‖22R2/(√nµ)
Denote fk = Fµ,λ(θ(t))−Fµ,λ(θ). Then we have
1
fk+1≤ 1
fk− 1
4f2k‖X‖22R2/(
√nµ)
,
which results in
fk+1 ≥ fk +fk+1
4fk‖X‖22R2/(√nµ)
(i)
≥ fk +1
4‖X‖22R2/(√nµ)
, (N.43)
where (i) is from the monotonicity of Fµ,λ(θ(t)) (A.16) in Lemma A.8. Applying (N.43) recursively,
we have
fk ≥ f0 +k
4‖X‖22R2/(√nµ)
(i)
≥ t+ 2
4‖X‖22R2/(√nµ)
,
where (i) is from Fµ,λ(θ(0))−Fµ,λ(θ) < 2‖X‖22R2/(√nµ). Then we have the desired result (A.14).
Case 2: Suppose Fµ,λ(θ(0))−Fµ,λ(θ) > 2‖X‖22R2/(√nµ). Minimize the R.H.S. of (N.42) w.r.t.
α, then the optimal value is α = 1 and
Fµ,λ(θ(1))−Fµ,λ(θ) ≤ 2‖X‖22R2
2√nµ
.
We claim that for all t = 1, 2, . . .,
Fµ,λ(θ(t))−Fµ,λ(θ) = ct2‖X‖22R2/(√nµ),
where 1t+2 ≤ ct ≤ 2
t+2 . We prove the claim by induction.
This obviously holds when t = 1. Assume 1t+2 ≤ ct ≤ 2
t+2 holds when t = T . For t = T + 1,
minimize the R.H.S. of (N.42) w.r.t. α, then the optimal value is α = cT ≤ 1 and convergence rate
for T + 1-th iteration is cT+1 = cT − c2T /2. Since cT+1 is a increasing function of cT in cT ∈ [0, 1/2]
and ct is monotone decreasing, i.e., ct ≤ c1 for all k > 1, then we verifies the claim since
cT+1 ≤2
T + 2− 1
2
(2
T + 2
)2
≤ 2
T + 3, and
cT+1 ≥1
T + 2− 1
2
(1
T + 2
)2
≤ 1
T + 3.
Combining the two cases, we have the desired result (A.14).
45
N.7 Proof of Lemma K.4
Recall that the model is
zi = Z?,\iθ∗i + Γ
−1/2ii εi,
where zi = Γ−1/2ii xi, Z?,\i = X?,\iΓ
−1/2\i,\i and εi ∼ Nn(0, σ2
i In). Then we have
∇Lµ,i(θ∗i ) =Z>?,\i(Z?,\iθ
∗i − zi)
max√nµ,√n‖zi − Z?,\iθ∗i ‖2= −
Z>?,\iεi
max√nµΓ1/2ii ,√n‖εi‖2
.
Since ‖εi‖2nσ2
i∼ χ2
n, we have from Johnstone (2001) that for any δ ∈ [0, 1/2),
P[
maxi∈1,...,d
‖εi‖22nσ2
i
≤ 1− δ]≤ d exp
(−nδ
2
4
). (N.44)
Besides, Z>?,\iεi ∼ N (0, nσ2i ). Then we have from Liu and Wang (2012) that for any δ ∈ [0, 1/2)
and c > 2,
P[
maxi∈1,...,d
‖Z>?,\iεi‖∞ > σi√
2cn log d(1− δ)]≤ d2−c(1−δ)√πa log d(1− δ)
. (N.45)
Combining (N.44) and (N.45), we have with probability at least 1− d exp(−nδ2
4
)− d2−c(1−δ)√
πa log d(1−δ),
maxi∈1,...,d
‖Z>?,\iεi‖∞maxµΓ
1/2ii ,√n‖εi‖2
≤√
2c log d(1− δ)/nmaxmini∈1,...,d µΓ
1/2ii /(
√nσi),
√1− δ
.
Take δ = 2/5 and c = 7/3, then we have the desired result.
46