View
15
Download
0
Category
Preview:
Citation preview
Intro Smooth NonSmooth
Stochastic Second-Order Optimization MethodsPart I: Convex
Fred Roosta
School of Mathematics and PhysicsUniversity of Queensland
Intro Smooth NonSmooth
Disclaimer
Disclaimer:
To preserve brevity and flow, many cited results are simplified,slightly-modified, or generalized...please see the cited work forthe details.
Unfortunately, due to time constraints, many relevant andinteresting works are not mentioned in this presentation.
Intro Smooth NonSmooth
Problem Statement
Problem
minx∈X⊆Rd
F (x) = Ewf (x;w) =
Risk︷ ︸︸ ︷∫Ω
f (x;w(ω))︸ ︷︷ ︸e.g., Loss/Penalty
+prediction function
dP(w(ω))
F : (non-)convex/(non-)smooth
High-dimensional: d 1
With empirical (counting) measure over wini=1 ⊂ Rp, i.e.,∫AdP(w) =
1
n
n∑i=1
1wi∈A, ∀A ⊆ Ω =⇒ F (x) =1
n
n∑i=1
f (x; wi )︸ ︷︷ ︸finite-sum/empirical risk
“Big data”: n 1
Intro Smooth NonSmooth
Intro Smooth NonSmooth
Intro Smooth NonSmooth
Humongous Data / High Dimension
Classical algorithms =⇒ High per-iteration costs
Intro Smooth NonSmooth
Humongous Data / High Dimension
Modern variants:
1 Low per-iteration costs
2 Maintain original convergence rate
Intro Smooth NonSmooth
First Order Methods
Use only gradient information
E.g. : Gradient Descent
x(k+1) = x(k) − α(k)∇F (x(k))
Smooth Convex F ⇒ Sublinear, O(1/k)
Smooth Strongly Convex F ⇒ Linear, O(ρk), ρ < 1
However, iteration cost is high!
Intro Smooth NonSmooth
First Order Methods
Stochastic variants e.g., (mini-batch) SGD
For some s small, chosen at random wisi=1 and
x(k+1) = x(k) − α(k)
s
s∑i=1
∇f (x(k),wi )
Cheap per-iteration costs!
However slower to converge:
Smooth Convex F ⇒ O(1/√k)
Smooth Strongly Convex F ⇒ O(1/k)
Intro Smooth NonSmooth
Modifications...
Achieve the convergence rate of the full GD
Preserve the per-iteration cost of SGD
E.g.: SAG, SDCA, SVRG, Prox-SVRG, Acc-Prox-SVRG,Acc-Prox-SDCA, S2GD, mS2GD, MISO, SAGA, AMSVRG, ...
Intro Smooth NonSmooth
Second Order Methods
Use both gradient and Hessian information
E.g. : Classical Newton’s method
x(k+1) = x(k) − α(k) [∇2F (x(k))]−1∇F (x(k))︸ ︷︷ ︸Non-uniformly Scaled Gradient
Smooth Convex F ⇒ Locally Superlinear
Smooth Strongly Convex F ⇒ Locally Quadratic and GloballyLinear
However, per-iteration cost is much higher!
Intro Smooth NonSmooth
Outline
Part I: ConvexSmooth
Newton-CG
Non-Smooth
Proximal NewtonSemi-smooth Newton
Part II: Non-ConvexLine-Search Based Methods
L-BFGSGauss-NewtonNatural Gradient
Trust-Region Based Methods
Trust-RegionCubic Regularization
Part III: Discussion and Examples
Intro Smooth NonSmooth
Outline
Part I: ConvexSmooth
Newton-CG
Non-Smooth
Proximal NewtonSemi-smooth Newton
Part II: Non-ConvexLine-Search Based Methods
L-BFGSGauss-NewtonNatural Gradient
Trust-Region Based Methods
Trust-RegionCubic Regularization
Part III: Discussion and Examples
Intro Smooth NonSmooth
Finite Sum / Empirical Risk Minimization
FSM/ERM
minx∈X⊆Rd
F (x) =1
n
n∑i=1
f (x; wi ) =1
n
n∑i=1
fi (x)
fi : Convex and Smooth
F : Strongly Convex =⇒ Unique minimizer x?
n 1 and/or d 1
Intro Smooth NonSmooth
Iterative Scheme
y(k) = argminx∈X
(x− x(k))Tg(k) +
1
2(x− x(k))TH(k)(x− x(k))
x(k+1) = x(k) + αk
(y(k) − x(k)
)
Intro Smooth NonSmooth
Iterative Scheme
y(k) = argminx∈X
(x− x(k))T g(k) +
1
2(x− x(k))T H(k)(x− x(k))
, x(k+1) = x(k) + αk
(y(k) − x(k)
)
Newton: g(k) = ∇F (x(k)) & H(k) = ∇2F (x(k))
Gradient Descent: g(k) = ∇F (x(k)) & H(k) = I
Frank-Wolfe: g(k) = ∇F (x(k)) & H(k) = 0
(mini-batch) SGD: Sg ⊂ 1, 2, . . . , n =⇒ g(k) = 1|Sg|
∑j∈Sg
∇fj (x(k)) & H(k) = I
Sub-Sampled Newton:
g(k) = ∇F (x(k))
Sg ⊂ 1, 2, . . . , n =⇒ H(k) =1
|SH |
∑j∈SH
∇2fj (x(k))
Hessian Sub-Sampling
Sg ⊂ 1, 2, . . . , n =⇒ g(k) =1
|Sg|
∑j∈Sg
∇fj (x(k))
SH ⊂ 1, 2, . . . , n =⇒ H(k) =1
|SH |
∑j∈SH
∇2fj (x(k))
Gradient and Hessian Sub-Sampling
Intro Smooth NonSmooth
Sub-sampled Newton’s Method
Let X = Rd , i.e., unconstrained optimization
Iterative Scheme
p(k) ≈ argminp∈Rd
pTg(k) +
1
2pTH(k)p
,
x(k+1) = x(k) + α(k)p(k)
g(k) = ∇F (x(k))
H(k) =1
|SH |∑j∈SH
∇2fj(x(k))
Hessian Sub-Sampling
Intro Smooth NonSmooth
Sub-sampled Newton’s Method
Global convergence, i.e., starting from any initial point
Intro Smooth NonSmooth
Sub-sampled Newton’s Method
Iterative Scheme
Descent Direction: H(k)p(k) ≈ −g(k),
Step Size:
αk = argmax
α≤1α
s.t. F (x(k) + αpk) ≤ F (x(k)) + αβpTk ∇F (x(k))
Update: x(k+1) = x(k) + α(k)p(k)
Intro Smooth NonSmooth
Sub-sampled Newton’s Method
Algorithm Newton’s Method with Hessian Sub-Sampling
1: Input: x(0)
2: for k = 0, 1, 2, · · · until termination do3: - g(k) = ∇F (x(k))
4: - S(k)H ⊆ 1, 2, . . . , n
5: - H(k) =1
|S(k)H |
∑j∈S(k)
H
∇2fj(x(k))
6: - H(k)p(k) ≈ −g(k)
7: - Find α(k) that passes Armijo linesearch8: - Update x(k+1) = x(k) + α(k)p(k)
9: end for
Intro Smooth NonSmooth
Sub-sampled Newton’s Method
Theorem ( [Byrd et al., 2011])
If each fi is strongly convex, twice continuously differentiable withLipschitz continuous gradient, then
limk→∞
∇F (x(k)) = 0.
Intro Smooth NonSmooth
Sub-sampled Newton’s Method
Theorem ( [Bollapragada et al., 2016])
If each fi is strongly convex, twice continuously differentiable withLipschitz continuous gradient, then
E(F (x(k))− F (x?)
)≤ ρk
(F (x(0))− F (x?)
),
where ρ ≤(1− 1/κ2
)with κ being the “condition number” of the
problem.
Intro Smooth NonSmooth
Sub-sampled Newton’s Method
What if F is strongly convex, but each fi is only weakly convex?,e.g.,
F (x) = 1n
∑ni=1 `i (aT
i x)
ai ∈ Rd and Range(aini=1) = Rd
`i : R → R and `′′i ≥ γ > 0
Each ∇2fi (x) = `′′i (aTi x)aia
Ti is rank one!
But ∇2F (x) γ · λmin
(n∑
i=1
aiaTi
)︸ ︷︷ ︸
>0
Intro Smooth NonSmooth
Sub-Sampling Hessian
H =1
|S|∑j∈S∇2fj(x)
Lemma ( [Roosta and Mahoney, 2016a])
Suppose ∇2F (x) γI and 0 ∇2fi (x) LI. Given any0 < ε < 1, 0 < δ < 1, if Hessian is uniformly sub-sampled with
|S| ≥ 2κ log(d/δ)
ε2,
then
Pr(
H (1− ε)γI)≥ 1− δ.
where κ = L/γ.
Intro Smooth NonSmooth
Sub-Sampling Hessian
Inexact Update
Hp ≈ −g =⇒ ‖Hp + g‖ ≤ θ ‖g‖, θ < 1, using CG.
Why CG and not other solvers (at least theoretically)?
H is SPD
p(t) = argminp∈Kt
pTg+1
2pTHp→
⟨p(t), g
⟩≤ −1
2
⟨p(t),Hp(t)
⟩
Intro Smooth NonSmooth
Sub-Sampling Hessian
Theorem ( [Roosta and Mahoney, 2016a])
If θ ≤√
(1− ε)κ
, then, w.h.p,
F (x(k+1))− F (x?) ≤ ρ(F (x(k))− F (x?)
),
where ρ = 1− (1− ε)/κ2.
Intro Smooth NonSmooth
Sub-sampled Newton’s Method
Local convergence, i.e., in a neighborhood of x?, and with α(k) = 1
∥∥∥x(k+1) − x?∥∥∥ ≤ ξ1
∥∥∥x(k) − x?∥∥∥+ ξ2
∥∥∥x(k) − x?∥∥∥2
Linear-Quadratic Error Recursion
Intro Smooth NonSmooth
Sub-sampled Newton’s Method
[Erdogdu and Montanari, 2015]
[Ur+1,Λr+1] = TSVD (H, r + 1)
H−1 = λ−1r+1I + Ur
(Λ−1r −
1
λr+1I
)UT
r
x(k+1) = argminx∈X
x−
(x(k) − α(k)H−1∇F (x(k))
)Theorem ( [Erdogdu and Montanari, 2015])∥∥∥x(k+1) − x?
∥∥∥ ≤ ξ(k)1
∥∥∥x(k) − x?∥∥∥+ ξ
(k)2
∥∥∥x(k) − x?∥∥∥2
,
ξ(k)1 = 1− λmin(H(k))
λr+1(H(k))+
Lg
λr+1(H(k))
√log d
|S(k)H |
, and ξ(k)2 =
LH
2λr+1(H(k)).
Note: For constrained optimization, the method is based on two metricgradient projection =⇒ starting from an arbitrary point, the algorithmmight not recognize, i.e., fail to stop at, an stationary point.
Intro Smooth NonSmooth
Sub-sampling Hessian
H =1
|S|∑j∈S∇2fj(x)
Lemma ( [Roosta and Mahoney, 2016b])
Given any 0 < ε < 1, 0 < δ < 1, if Hessian is uniformlysub-sampled with
|S| ≥ 2κ2 log(d/δ)
ε2,
then
Pr(
(1− ε)∇2F (x) H(x) (1 + ε)∇2F (x))≥ 1− δ.
Intro Smooth NonSmooth
Sub-sampled Newton’s Method [Roosta and Mahoney,2016b]
Theorem ( [Roosta and Mahoney, 2016b])
With high-probability, we get∥∥∥x(k+1) − x?∥∥∥ ≤ ξ1
∥∥∥x(k) − x?∥∥∥+ ξ2
∥∥∥x(k) − x?∥∥∥2,
where
ξ1 =ε
(1− ε)+
(√κ
1− ε
)θ, and ξ2 =
LH
2(1− ε)γ.
If θ = ε/√κ, then ξ1 is problem-independent! ⇒ Can be made
arbitrarily small!
Intro Smooth NonSmooth
Sub-sampled Newton’s Method
Theorem ( [Roosta and Mahoney, 2016b])
Consider any 0 < ρ0 < ρ < 1 and ε ≤ ρ0/(1 + ρ0). If‖x(0) − x∗‖ ≤ (ρ− ρ0)/ξ2, and
θ ≤ ρ0
√(1− ε)κ
,
we get locally Q-linear convergence
‖x(k) − x∗‖ ≤ ρ‖x(k−1) − x∗‖, k = 1, . . . , k0
with probability (1− δ)k0 .
Problem-independent local convergence rate
By increasing Hessian accuracy, super-linear rate is possible
Intro Smooth NonSmooth
Putting it all together
Theorem ( [Roosta and Mahoney, 2016b])
Under certain assumptions, starting at any x(0), we have
linear convergence
after certain number of iterations, we get“problem-independent” linear convergence
after certain number of iterations, the step size of α(k) = 1passes Armijo rule for all subsequent iterations
Intro Smooth NonSmooth
“Any optimization algorithm for which the unit step length workshas some wisdom. It is too much of a fluke if the unite step length
[accidentally] works.”
Prof. Jorge NocedalIPAM Summer School, 2012
Intro Smooth NonSmooth
Sub-sampled Newton’s Method
Theorem ( [Bollapragada et al., 2016])
Suppose ∥∥∥∥Ei
(∇2fi (x)−∇2F (x)
)2∥∥∥∥ ≤ σ2.
Then,
E∥∥∥x(k+1) − x?
∥∥∥ ≤ ξ1
∥∥∥x(k) − x?∥∥∥+ ξ2
∥∥∥x(k) − x?∥∥∥2,
where
ξ1 =σ
γ
√|S(k)
H |+ κθ, and ξ2 =
LH
2γ.
Intro Smooth NonSmooth
Exploiting the structure....
Example:
F (x) =1
n
n∑i=1
`i (aTi x) =⇒ ∇2F (x) = ATDA,
where
A ,
aT1...
aTn
∈ Rn×d , Dx ,1
n
`′′(aT
1 x). . .
`′′(aTn x)
∈ Rn×n.
Intro Smooth NonSmooth
Sketching [Pilanci and Wainwright, 2017]
x(k+1) = argminx∈X
(x− x(k))T∇F (x(k)) +
1
2(x− x(k))T∇2F (x(k))(x− x(k))
∇2F (x(k)) = B(k)TB(k), B(k) ∈ Rn×d
x(k+1) = argminx∈X
(x− x(k))T∇F (x(k)) +1
2‖B(k)︸︷︷︸
n×d
(x− x(k))‖2
∇2F (x(k)) ≈ B(k)TS(k)TS(k)B(k), S(k) ∈ Rs×n, d ≤ s n
x(k+1) = argminx∈X
(x− x(k))T∇F (x(k)) +1
2‖S(k)B(k)︸ ︷︷ ︸
s×d
(x− x(k))‖2
Intro Smooth NonSmooth
Sketching [Pilanci and Wainwright, 2017]
Sub-Gaussian sketches
well-known concentration properties
involve dense/unstructured matrix operations
Randomized orthonormal systems, e.g., Hadamard or Fourier
sub-optimal sample sizes
fast matrix multiplication
Random row sampling
Uniform
Non-uniform (more on this later)
Sparse JL sketches
Intro Smooth NonSmooth
Non-uniform Sampling [Xu et al., 2016]
Non-uniformity among ∇2fi =⇒ O(n) uniform samples!!!
Find |SH| independent of n
Immune non-uniformity
Non-Uniform sampling schemes base on
1 Row norms
2 Leverage scores
Intro Smooth NonSmooth
LiSSA [Agarwal et al., 2017]
Key idea: use Taylor expansion (Neumann series) to constructan estimator of the Hessian inverse
‖A‖ ≤ 1,A 0 =⇒ A−1 =∑∞
i=0 (I− A)i
A−1j =
∑ji=0 (I− A)i = I + (I− A) A−1
j−1
limj→∞A−1j = A−1
Uniformly sub-sampled Hessian + Turncated Neumann series
Three-nested loop involving HVP, i.e., ∇2fi ,j(x(k))v(k)i ,j
Leverage fast multiplication by Hessian of GLMs
Intro Smooth NonSmooth
name complexity per iteration reference
Newton-CG O(nnz(A)√κ) Folklore
SSN-LS O(nnz(A) log n + p2κ3/2) [Xu et al., 2016]
SSN-RNS O(nnz(A) + sr(A)pκ5/2) [Xu et al., 2016]
SRHT O(np(log n)4 + p2(log n)4κ3/2) [Pilanci et al., 2016]
SSN-Uniform O(nnz(A) + pκκ3/2) [Roosta et al., 2016]
LiSSA O(nnz(A) + pκκ2) [Agarwal et al., 2017]
κ = maxx
λmax∇2F (x)
λmin∇2F (x)
κ = maxx
maxi λmax∇2fi (x)
λmin∇2F (x)
κ = maxx
maxi λmax∇2fi (x)
mini λmin∇2fi (x)
⇒ κ ≤ κ ≤ κ
Intro Smooth NonSmooth
Sub-sampled Newton’s Method
Iterative Scheme
p(k) ≈ argminp∈Rd
pTg(k) +
1
2pTH(k)p
,
x(k+1) = x(k) + α(k)p(k)
g(k) =1
|Sg|∑j∈Sg
∇fj(x(k))
H(k) =1
|SH |∑j∈SH
∇2fj(x(k))
Gradient and Hessian Sub-Sampling
Intro Smooth NonSmooth
Sub-sampled Newton’s Method
Algorithm Newton’s Method with Hessian and Gradient Sub-Sampling
1: Input: x(0)
2: for k = 0, 1, 2, · · · until termination do
3: - S(k)g ⊆ 1, 2, . . . , n =⇒ g(k) = 1
|S(k)g |
∑j∈S(k)
g∇fj(x(k))
4: if∥∥g(k)
∥∥ ≤ σεg then5: - STOP6: end if7: - S(k)
H ⊆ 1, 2, . . . , n =⇒ H(k) = 1
|S(k)H |
∑j∈S(k)
H
∇2fj(x(k))
8: - H(k)p(k) ≈ −g(k)
9: - Find α(k) that passes Armijo linesearch10: - Update x(k+1) = x(k) + α(k)p(k)
11: end for
Intro Smooth NonSmooth
Sub-sampled Newton’s Method
Global convergence, i.e., starting from any initial point
[Byrd et al., 2012,Roosta and Mahoney, 2016a,Bollapragada et al., 2016]
Intro Smooth NonSmooth
Sub-sampled Newton’s Method
We can write ∇F (x) = AB where
A :=
| | |∇f1(x) ∇f2(x) · · · ∇fn(x)| | |
and B :=
1/n1/n
...1/n
Lemma ( [Roosta and Mahoney, 2016a])
Let ‖∇fi (x)‖ ≤ G (x) <∞. For any 0 < ε, δ < 1, if
|S| ≥ G (x)2(1 +
√8 ln(1/δ)
)2/ε2,
then
Pr(‖∇F (x)− g(x)‖ ≤ ε
)≥ 1− δ.
Intro Smooth NonSmooth
Sub-sampled Newton’s Method
Theorem ( [Roosta and Mahoney, 2016a])
If
θ ≤√
(1− εH)
κ,
then, w.h.p,
F (x(k+1))− F (x?) ≤ ρ(F (x(k))− F (x?)
),
where ρ = 1− (1− εH)/κ2, and upon “STOP”, we have‖∇F (x(k))‖ < (1 + σ) εg.
Intro Smooth NonSmooth
Sub-sampled Newton’s Method
Local convergence, i.e., in a neighborhood of x?, and with α(k) = 1
∥∥∥x(k+1) − x?∥∥∥ ≤ ξ0 + ξ1
∥∥∥x(k) − x?∥∥∥+ ξ2
∥∥∥x(k) − x?∥∥∥2
Error Recursion
[Roosta and Mahoney, 2016b, Bollapragada et al., 2016]
Intro Smooth NonSmooth
Sub-sampled Newton’s Method
Theorem ( [Roosta and Mahoney, 2016b])
With high-probability, we get∥∥∥x(k+1) − x?∥∥∥ ≤ ξ0 + ξ1
∥∥∥x(k) − x?∥∥∥+ ξ2
∥∥∥x(k) − x?∥∥∥2,
whereξ0 =
εg
(1− εH)γ
ξ1 =εH
(1− εH)+
(√κ
1− εH
)θ,
ξ2 =LH
2(1− εH)γ.
Intro Smooth NonSmooth
Sub-sampled Newton’s Method
Theorem ( [Roosta and Mahoney, 2016b])
Consider any 0 < ρ0 + ρ1 < ρ < 1. If ‖x(0) − x∗‖ ≤ c(ρ0, ρ1, ρ),
ε(k)g = ρkεg and
θ ≤ ρ0
√(1− εH)
κ,
we get locally R-linear convergence
‖x(k) − x∗‖ ≤ cρk
with probability (1− δ)2k .
Problem-independent local convergence rate
Intro Smooth NonSmooth
Putting it all together
Theorem ( [Roosta and Mahoney, 2016b])
Under certain assumptions, starting at any x(0), we have
linear convergence,
after certain number of iterations, we get“problem-independent” linear convergence,
after certain number of iterations, the step size of α(k) = 1passes Armijo rule for all subsequent iterations,
upon “STOP” at iteration k, we have‖∇F (x(k))‖ < (1 + σ)
√ρkεg.
Intro Smooth NonSmooth
Outline
Part I: ConvexSmooth
Newton-CG
Non-Smooth
Proximal NewtonSemi-smooth Newton
Part II: Non-ConvexLine-Search Based Methods
L-BFGSGauss-NewtonNatural Gradient
Trust-Region Based Methods
Trust-RegionCubic Regularization
Part III: Discussion and Examples
Intro Smooth NonSmooth
Newton-type Methods for Non-Smooth Problems
Convex Composite Problem
minx∈X⊆Rd
F (x) + R(x)
F : convex and smooth
R: convex and non-smooth
Intro Smooth NonSmooth
Non-Smooth Newton-type [Yu et al., 2010]
Let X = Rd , i.e., unconstrained optimization
Iterative Scheme (Non-Quadratic Sub-Problem)
p(k) ≈ argminp∈Rd
supg(k)∈∂(F+R)(x(k))
pTg(k)
+
1
2pT H(k)︸︷︷︸
e.g.,QN
p
,
x(k+1) = x(k) + α(k)p(k)
Intro Smooth NonSmooth
Proximal Newton-type [Lee et al., 2014,Byrd et al., 2016]
Iterative Scheme
p(k) ≈ argminp∈Rd
pT∇F (x(k)) +1
2pT H(k)︸︷︷︸
≈∇2F (x(k))
p + R(x(k) + p)
x(k+1) = x(k) + α(k)p(k)
Notable examples of proximal Newton methods:
glmnet ( [Friedman et al., 2010]): `1-regularized GLMs,sub-problems are solved using coordinate descent
QUIC ( [Hsieh et al., 2014]): Graphical Lasso problem withfactorization tricks, sub-problems are solved using coordinatedescent
Intro Smooth NonSmooth
Proximal Newton-type [Lee et al., 2014,Byrd et al., 2016]
Inexactness of sub-problem solver
p(k) ≈ argminp∈Rd
pT∇F (x(k)) +
1
2pTH(k)p + R(x(k) + p)
p? is the minimizer of the above sub-problem iff p? = Proxλ(p?), where
Prox(k)λ (p) , argmin
q∈Rd
⟨∇F (x(k)) + H(k)p,q
⟩+ R(x(k) + q) +
λ
2‖q− p‖2
.
∥∥∥p− Prox(k)λ (p)
∥∥∥ ≤ θ ∥∥∥Prox(k)λ (0)
∥∥∥ ,and some sufficient decrease condition on the subproblem
Inexactness
Intro Smooth NonSmooth
Proximal Newton-type [Lee et al., 2014,Byrd et al., 2016]
Step-size
x(k+1) = x(k) + α(k)p(k)
(F + R)(x(k) + αp(k))− (F + R)(x(k)) ≤ β(`k
(x(k) + αp
)− `k
(x(k)
))`k (x) = F (x(k)) + (x− x(k))T∇F (x(k)) + R(x)
Line-Search
Intro Smooth NonSmooth
Proximal Newton-type [Lee et al., 2014,Byrd et al., 2016]
Global convergence: x(k) converges to an optimal solutionstarting at any x(0)
Local convergence: for x(0) close enough to x?
H(k) = ∇2F (x(k))
Exact solve: quadratic convergence
Approximate solve with decaying forcing term: superlinearconvergence
Approximate solve with fixed forcing term: linear convergence
H(k) : Dennis-More =⇒ supelinear convergence
Intro Smooth NonSmooth
Sub-Sampled Proximal Newton-type [Liu et al., 2017]
FSM/ERM
minx∈X⊆Rd
F (x) + R(x) =1
n
n∑i=1
fi (aTi x) + R(x)
fi : Strongly-Convex, Smooth, and Self-Concordant
R: Convex and Non-Smooth
n 1 and/or d 1
Dennis-More condition:
|p(k)T(
H(k) −∇2F (x(k)))
p(k)| ≤ ηkp(k)T∇2F (x(k))p(k)
Leverage score sampling to ensure Dennis-More
Inexact sub-problem solver
Intro Smooth NonSmooth
Finite Sum / Empirical Risk Minimization
Self-Concordant
∣∣vT(∇3f (x)[v]
)v∣∣ ≤ M
(vT∇2f (x)v
)3/2, ∀x, v ∈ Rd
M = 2 is called standard self-concordant.
Theorem ( [Zhang and Lin, 2015])
Suppose there exists γ > 0 and η ∈ [0, 1) such that
|f ′′′
i (t)| ≤ γ(f
′′
i (t))1−η
. Then
F (x) =1
n
n∑i=1
fi (aTi x) +
λ
2‖x‖2
is self-concordant with
M =maxi ‖ai‖1+2η
γ
λη+1/2.
Intro Smooth NonSmooth
Proximal Newton-type Methods
General treatment: [Fukushima and Mine, 1981, Lee et al.,2014, Byrd et al., 2016, Becker and Fadili, 2012, Schmidtet al., 2012, Schmidt et al., 2009, Shi and Liu, 2015]
Tailored to specific problems: [Friedman et al., 2010, Hsiehet al., 2014, Yuan et al., 2012, Oztoprak et al., 2012]
Self-Concordant: [Li et al., 2017, Kyrillidis et al.,2014, Tran-Dinh et al., 2013]
Intro Smooth NonSmooth
Outline
Part I: ConvexSmooth
Newton-CG
Non-Smooth
Proximal NewtonSemi-smooth Newton
Part II: Non-ConvexLine-Search Based Methods
L-BFGSGauss-NewtonNatural Gradient
Trust-Region Based Methods
Trust-RegionCubic Regularization
Part III: Discussion and Examples
Intro Smooth NonSmooth
Semi-Smooth Newton-type Methods
Convex Composite Problem
minx∈X⊆Rd
F (x) + R(x)
F : (non-)convex and (non-)smooth
R: convex and non-smooth
Intro Smooth NonSmooth
Semi-Smooth Newton-type Methods
Recall:
Proximal Newton-type Methods
p(k) ≈ argminp∈Rd
pT∇F (x(k)) +
1
2pTH(k)p + R(x(k) + p)
≡ proxH(k)
R
(x(k) −H(k)−1∇F (x(k))
)− x(k)
proxH(k)
R (x) , argminy∈Rd
R(y) +1
2‖y − x‖2
H(k)
‖y − x‖2H = 〈y − x,H (y − x)〉
Intro Smooth NonSmooth
Semi-Smooth Newton-type Methods
Recall Newton for smooth root-finding problems:
r(x) = 0 =⇒ J(x(k))p(k) = −r(x(k)) =⇒ x(k+1) = x(k) + p(k)
Key Idea
For solving non-linear non-smooth systems of equations, replacethe Jacobian in the Newtonian iteration system by an element ofClarke’s generalized Jacobian, which may be nonempty (e.g.,locally Lipschitz continuous) even if the true Jacobian does notexist.
Example:
r(x) = proxBR
(x− B−1∇F (x)
)− x
Intro Smooth NonSmooth
Semi-Smooth Newton-type Methods
Clarke’s generalized Jacobian
Let r : Rd → Rp be locally Lipschitz continuous at xa. Let Dr bethe set of differentiable points. The Bouligand-subdifferential of rat x is defined as
∂B ,
limk→∞
r′(xk) | xk ∈ Dr, xk → x
.
Clarke’s generalized Jacobian is ∂r(x) = Conv-hull (∂Br(x)).
aBy Rademacher’s Theorem, r is almost everywhere differentiable
Intro Smooth NonSmooth
Semi-Smooth Newton-type Methods
Semi-Smooth Map
r : Rd → Rp is semi-smooth at x if
r a locally Lipschitz continuous function,
r is directionally differentiable at x, and
‖r(x + v)− r(x)− Jv‖ = o(‖v‖), v ∈ Rd , J ∈ ∂r(x + v)
Example: Piecewise continuously differentiable functions.
A vector-valued function is (strongly) semi-smooth if and only ifeach of its component functions is (strongly) semi-smooth.
Intro Smooth NonSmooth
Semi-Smooth Newton-type Methods
First-order stationarity is written as a fixed point-typenon-linear system of equations, e.g.,
rB(x) , x− proxBR
(x− B−1∇F (x)
)= 0, B 0
(Stochastic) semi-smooth Newton’s method used to(approximately) solve these system of nonlinear equations ,e.g., as in [Milzarek et al., 2018]
rB(x) , x− proxBR
(x− B−1g(x)
)J(k)p(k) = −rB(x(k)), J(k) ∈ J B(x(k))
J Bk ,
J ∈ Rd×d | J = (I− A) + AB−1H(k),A ∈ ∂proxB
R (u(x))
u(x) , x− B−1g(x)
Intro Smooth NonSmooth
Semi-Smooth Newton-type Methods
Semi-smoothness of proximal mapping does not hold in general.
In certain settings, these vector-valued functions are monotone andcan be semi-smooth due to the properties of the proximal mappings.
In such cases, their generalized Jacobian matrix is positivesemi-definite due to monotonicity.
Lemma
For a monotone and Lipschitz continuous mapping r : Rd → Rd , andany x ∈ Rd , each element of ∂Br(x) is positive semi-definite.
Intro Smooth NonSmooth
Semi-Smooth Newton-type Methods
General treatment: [Patrinos et al., 2014, Milzarek andUlbrich, 2014, Xiao et al., 2016, Patrinos and Bemporad, 2013]
Tailored to a specific problem: [Yuan et al., 2018]
Stochastic: [Milzarek et al., 2018]
Intro Smooth NonSmooth
THANK YOU!
Intro Smooth NonSmooth
Agarwal, N., Bullins, B., and Hazan, E. (2017).Second-order stochastic optimization for machine learning inlinear time.The Journal of Machine Learning Research, 18(1):4148–4187.
Becker, S. and Fadili, J. (2012).A quasi-Newton proximal splitting method.In Advances in Neural Information Processing Systems, pages2618–2626.
Bollapragada, R., Byrd, R. H., and Nocedal, J. (2016).Exact and inexact subsampled Newton methods foroptimization.IMA Journal of Numerical Analysis.
Byrd, R. H., Chin, G. M., Neveitt, W., and Nocedal, J. (2011).
On the use of stochastic Hessian information in optimizationmethods for machine learning.SIAM Journal on Optimization, 21(3):977–995.
Intro Smooth NonSmooth
Byrd, R. H., Chin, G. M., Nocedal, J., and Wu, Y. (2012).Sample size selection in optimization methods for machinelearning.Mathematical programming, 134(1):127–155.
Byrd, R. H., Nocedal, J., and Oztoprak, F. (2016).An inexact successive quadratic approximation method for l-1regularized optimization.Mathematical Programming, 157(2):375–396.
Erdogdu, M. A. and Montanari, A. (2015).Convergence rates of sub-sampled Newton methods.In Proceedings of the 28th International Conference on NeuralInformation Processing Systems-Volume 2, pages 3052–3060.MIT Press.
Friedman, J., Hastie, T., and Tibshirani, R. (2010).Regularization paths for generalized linear models viacoordinate descent.Journal of statistical software, 33(1):1.
Intro Smooth NonSmooth
Fukushima, M. and Mine, H. (1981).A generalized proximal point algorithm for certain non-convexminimization problems.International Journal of Systems Science, 12(8):989–1000.
Hsieh, C.-J., Sustik, M. A., Dhillon, I. S., and Ravikumar, P.(2014).QUIC: quadratic approximation for sparse inverse covarianceestimation.The Journal of Machine Learning Research, 15(1):2911–2947.
Kyrillidis, A., Mahabadi, R. K., Tran-Dinh, Q., and Cevher, V.(2014).Scalable Sparse Covariance Estimation via Self-Concordance.In AAAI, pages 1946–1952.
Lee, J. D., Sun, Y., and Saunders, M. A. (2014).Proximal Newton-type methods for minimizing compositefunctions.SIAM Journal on Optimization, 24(3):1420–1443.
Intro Smooth NonSmooth
Li, J., Andersen, M. S., and Vandenberghe, L. (2017).Inexact proximal Newton methods for self-concordantfunctions.Mathematical Methods of Operations Research, 85(1):19–41.
Liu, X., Hsieh, C.-J., Lee, J. D., and Sun, Y. (2017).An inexact subsampled proximal Newton-type method forlarge-scale machine learning.arXiv preprint arXiv:1708.08552.
Milzarek, A. and Ulbrich, M. (2014).A semismooth newton method with multidimensional filterglobalization for l 1-optimization.SIAM Journal on Optimization, 24(1):298–333.
Milzarek, A., Xiao, X., Cen, S., Wen, Z., and Ulbrich, M.(2018).A Stochastic Semismooth Newton Method for NonsmoothNonconvex Optimization.arXiv preprint arXiv:1803.03466.
Intro Smooth NonSmooth
Oztoprak, F., Nocedal, J., Rennie, S., and Olsen, P. A. (2012).
Newton-like methods for sparse inverse covariance estimation.In Advances in neural information processing systems, pages755–763.
Patrinos, P. and Bemporad, A. (2013).Proximal Newton methods for convex composite optimization.In Decision and Control (CDC), 2013 IEEE 52nd AnnualConference on, pages 2358–2363. IEEE.
Patrinos, P., Stella, L., and Bemporad, A. (2014).Forward-backward truncated Newton methods for convexcomposite optimization.arXiv preprint arXiv:1402.6655.
Pilanci, M. and Wainwright, M. J. (2017).Newton sketch: A near linear-time optimization algorithm withlinear-quadratic convergence.SIAM Journal on Optimization, 27(1):205–245.
Intro Smooth NonSmooth
Roosta, F. and Mahoney, M. W. (2016a).Sub-sampled Newton methods I: globally convergentalgorithms.arXiv preprint arXiv:1601.04737.
Roosta, F. and Mahoney, M. W. (2016b).Sub-sampled Newton methods II: Local convergence rates.arXiv preprint arXiv:1601.04738.
Schmidt, M., Berg, E., Friedlander, M., and Murphy, K.(2009).Optimizing costly functions with simple constraints: Alimited-memory projected quasi-Newton algorithm.In Artificial Intelligence and Statistics, pages 456–463.
Schmidt, M., Kim, D., and Sra, S. (2012).Projected Newton-type Methods in Machine Learning.Optimization for Machine Learning, (1).
Shi, Z. and Liu, R. (2015).
Intro Smooth NonSmooth
Large scale optimization with proximal stochastic newton-typegradient descent.In Joint European Conference on Machine Learning andKnowledge Discovery in Databases, pages 691–704. Springer.
Tran-Dinh, Q., Kyrillidis, A., and Cevher, V. (2013).Composite self-concordant minimization.arXiv preprint arXiv:1308.2867.
Xiao, X., Li, Y., Wen, Z., and Zhang, L. (2016).A regularized semi-smooth Newton method with projectionsteps for composite convex programs.Journal of Scientific Computing, pages 1–26.
Xu, P., Yang, J., Roosta, F., Re, C., and Mahoney, M. W.(2016).Sub-sampled Newton methods with non-uniform sampling.In Advances in Neural Information Processing Systems (NIPS),pages 3000–3008.
Intro Smooth NonSmooth
Yu, J., Vishwanathan, S., Gunter, S., and Schraudolph, N. N.(2010).A quasi-Newton approach to nonsmooth convex optimizationproblems in machine learning.Journal of Machine Learning Research, 11(Mar):1145–1200.
Yuan, G.-X., Ho, C.-H., and Lin, C.-J. (2012).An improved glmnet for l1-regularized logistic regression.Journal of Machine Learning Research, 13(Jun):1999–2030.
Yuan, Y., Sun, D., and Toh, K.-C. (2018).An Efficient Semismooth Newton Based Algorithm for ConvexClustering.arXiv preprint arXiv:1802.07091.
Zhang, Y. and Lin, X. (2015).DiSCO: Distributed optimization for self-concordant empiricalloss.In International conference on machine learning, pages362–370.
Recommended