A Generic Quasi-Newton Algorithm for Faster Gradient-Based

A Generic Quasi-Newton Algorithm for Faster

Gradient-Based Optimization

Hongzhou Lin1, Julien Mairal1, Zaid Harchaoui2

1Inria, Grenoble 2University of Washington

Journees franco-chiliennes d’optimisationToulouse, 2017

Julien Mairal QuickeNing 1/28

An alternate title:

Acceleration by Smoothing


Collaborators

Hongzhou Zaid Dima CourtneyLin Harchaoui Drusvyatskiy Paquette

Publications and pre-prints

H. Lin, J. Mairal and Z. Harchaoui. A Generic Quasi-Newton Algorithm for FasterGradient-Based Optimization. arXiv:1610.00960. 2017

C. Paquette, H. Lin, D. Drusvyatskiy, J. Mairal, Z. Harchaoui. Catalyst Acceleration

for Gradient-Based Non-Convex Optimization. arXiv:1703.10993. 2017

H. Lin, J. Mairal and Z. Harchaoui. A Universal Catalyst for First-Order

Optimization. Adv. NIPS 2015.


Focus of this work

Minimizing large finite sums

Consider the minimization of a large sum of convex functions

minx∈Rd

{

f (x)△=

1

n

n∑

i=1

fi (x) + ψ(x)

}

,

where each fi is L-smooth and convex and ψ is a convex regularizationpenalty but not necessarily differentiable.

Motivation

Composite Finite sum Exploit “curvature”

First-order methods ✔

Quasi-Newton

[Nesterov, 2013, Wright et al., 2009, Beck and Teboulle, 2009, Chambolle andPock, 2011, Combettes and Wajs, 2005],...


Focus of this work



minx∈Rd

{

f (x)△=

1

n

n∑

i=1

fi (x) + ψ(x)

}

,


Motivation


First-order methods ✔ ✔

Quasi-Newton

[Schmidt et al., 2017, Xiao and Zhang, 2014, Defazio et al., 2014a,b,Shalev-Shwartz and Zhang, 2012, Mairal, 2015, Zhang and Xiao, 2015]


Focus of this work



minx∈Rd

{

f (x)△=

1

n

n∑

i=1

fi (x) + ψ(x)

}

,


Motivation



Quasi-Newton



Expected number of gradients ∇fi to compute to guaranteef (xk)− f ⋆ ≤ ε, when the objective f is µ-strongly convex:

accelerated proximal gradient: O(

n√

Lfµlog

(

1ε

)

)

;

incremental gradient methods: O((

n + Lµ

)

log(

1ε

)

)

.

Focus of this work



minx∈Rd

{

f (x)△=

1

n

n∑

i=1

fi (x) + ψ(x)

}

,


Motivation



Quasi-Newton



Expected number of gradients ∇fi to compute to guaranteef (xk)− f ⋆ ≤ ε, when the objective f is µ-strongly convex:

accelerated proximal gradient: O(

n√

Lfµlog

(

1ε

)

)

;

incremental gradient methods: O((

n +√

n Lµ

)

log(

1ε

)

)

.

Focus of this work



minx∈Rd

{

f (x)△=

1

n

n∑

i=1

fi (x) + ψ(x)

}

,


Motivation


First-order methods ✔ ✔ ✗

Quasi-Newton


Focus of this work



minx∈Rd

{

f (x)△=

1

n

n∑

i=1

fi (x) + ψ(x)

}

,


Motivation



Quasi-Newton ✔


Focus of this work



minx∈Rd

{

f (x)△=

1

n

n∑

i=1

fi (x) + ψ(x)

}

,


Motivation



Quasi-Newton — ✔

[Byrd et al., 2015, Lee et al., 2012, Scheinberg and Tang, 2016, Yu et al.,2008, Ghadimi et al., 2015, Stella et al., 2016],. . .


Focus of this work



minx∈Rd

{

f (x)△=

1

n

n∑

i=1

fi (x) + ψ(x)

}

,


Motivation



Quasi-Newton — ✗ ✔

[Byrd et al., 2016, Gower et al., 2016]


Focus of this work



minx∈Rd

{

f (x)△=

1

n

n∑

i=1

fi (x) + ψ(x)

}

,


Motivation



Quasi-Newton — ✗ ✔

[Byrd et al., 2016, Gower et al., 2016]


Our goal is to

accelerate first-order methods with Quasi-Newton heuristics;

design algorithms that can adapt to composite and finite-sumstructures and that can also exploit curvature information.

QuickeNing: main idea (an old one)

Idea: Smooth the function and then apply Quasi-Newton.

The strategy appears in early work about variable metric bundlemethods. [Chen and Fukushima, 1999, Fukushima and Qi, 1996, Mifflin,

1996, Fuentes, Malick, and Lemarechal, 2012, Burke and Qian, 2000] ...


QuickeNing: main idea (an old one)

Idea: Smooth the function and then apply Quasi-Newton.

The strategy appears in early work about variable metric bundlemethods. [Chen and Fukushima, 1999, Fukushima and Qi, 1996, Mifflin,

1996, Fuentes, Malick, and Lemarechal, 2012, Burke and Qian, 2000] ...

The Moreau-Yosida envelope

Given f : Rd → R a convex function, the Moreau-Yosida envelope of f isthe function F : Rd → R defined as

F (x) = minw∈Rd

{

f (w) +κ

2‖w − x‖2

}

.

The proximal operator p(x) is the unique minimizer of the problem.


The Moreau-Yosida regularization

F (x) = minw∈Rd

{

f (w) +κ

2‖w − x‖2

}

.

Basic properties [see Lemarechal and Sagastizabal, 1997]

Minimizing f and F is equivalent in the sense that

minx∈Rd

F (x) = minx∈Rd

f (x),

and the solution set of the two problems coincide with each other.

F is continuously differentiable even when f is not and

∇F (x) = κ(x − p(x)).

In addition, ∇F is Lipschitz continuous with parameter LF = κ.

If f is µ-strongly convex then F is also strongly convex withparameter µF = µκ

µ+κ.


The Moreau-Yosida regularization

F (x) = minw∈Rd

{

f (w) +κ

2‖w − x‖2

}

.

Basic properties [see Lemarechal and Sagastizabal, 1997]

Minimizing f and F is equivalent in the sense that

minx∈Rd

F (x) = minx∈Rd

f (x),

and the solution set of the two problems coincide with each other.

F is continuously differentiable even when f is not and

∇F (x) = κ(x − p(x)).

In addition, ∇F is Lipschitz continuous with parameter LF = κ.

If f is µ-strongly convex then F is also strongly convex withparameter µF = µκ

µ+κ.


F enjoys nice properties: smoothness, (strong) convexity andwe can control its condition number 1 + κ/µ.

A fresh look at Catalyst[Lin et al., 2015]


A fresh look at the proximal point algorithm

A naive approach consists of minimizing the smoothed objective Finstead of f with a method designed for smooth optimization.

Consider indeed

xk+1 = xk −1

κ∇F (xk).

By rewriting the gradient ∇F (xk) as κ(xk − p(xk)), we obtain

xk+1 = p(xk) = argminw∈Rp

{

f (w) +κ

2‖w − xk‖

2}

.

This is exactly the proximal point algorithm [Martinet, 1970,Rockafellar, 1976].


A fresh look at the accelerated proximal point algorithm

Consider now

xk+1 = yk −1

κ∇F (yk) and yk+1 = xk+1 + βk+1(xk+1 − xk),

where βk+1 is a Nesterov-like extrapolation parameter. We may nowrewrite the update using the value of ∇F , which gives:

xk+1 = p(yk) and yk+1 = xk+1 + βk+1(xk+1 − xk)

This is the accelerated proximal point algorithm of Guler [1992].


A fresh look at the accelerated proximal point algorithm

Consider now

xk+1 = yk −1

κ∇F (yk) and yk+1 = xk+1 + βk+1(xk+1 − xk),

where βk+1 is a Nesterov-like extrapolation parameter. We may nowrewrite the update using the value of ∇F , which gives:

xk+1 = p(yk) and yk+1 = xk+1 + βk+1(xk+1 − xk)

This is the accelerated proximal point algorithm of Guler [1992].

Remarks

F may be better conditioned than f when 1 + κ/µ ≤ L/µ;

Computing p(yk) has a cost!


A fresh look at Catalyst [Lin, Mairal, and Harchaoui, 2015]

Catalyst is a particular accelerated proximal point algorithm withinexact gradients [Guler, 1992].

xk+1 ≈ p(yk) and yk+1 = xk+1 + βk+1(xk+1 − xk)

The quantity xk+1 is obtained by using an optimization methodM forapproximately solving:

xk+1 ≈ argminw∈Rp

{

f (w) +κ

2‖w − yk‖

2}

,

Catalyst provides Nesterov’s acceleration toM with...

restart strategies for solving the sub-problems;

global complexity analysis resulting in theoretical acceleration;

optimal balancing between outer and inner computations.

see also [Frostig et al., 2015]


Limited-Memory BFGS (L-BFGS)

Pros

one of the largest practical success of smooth optimization.


Limited-Memory BFGS (L-BFGS)

Pros

one of the largest practical success of smooth optimization.

Cons

worst-case convergence rates for strongly-convex functions arelinear, but much worse than the gradient descent method.

proximal variants typically requires solving many times

minx∈Rd

1

2(x − z)Bk(z − z) + ψ(x).

no guarantee of approximating the Hessian.


QuickeNing

Main recipe

L-BFGS applied to the smoothed objective F with inexactgradients [see Friedlander and Schmidt, 2012].

inexact gradients are obtained by solving sub-problems using afirst-order optimization methodM;

ideally,M is able to adapt to the problem structure (finite sum,composite regularization).

replace L-BFGS steps by proximal point steps if no sufficientdecrease is estimated ⇒ no line search on F ;


Obtaining inexact gradients

Algorithm Procedure ApproxGradient

input Current point x in Rd ; smoothing parameter κ > 0.

1: Compute the approximate mapping using an optimizationmethodM:

z ≈ argminw∈Rd

{

h(w)△= f (w) +

κ

2‖w − x‖2

}

,

2: Estimate the gradient ∇F (x)

g = κ(x − z).

output approximate gradient estimate g , objective value Fa△= h(z),

proximal mapping z .


Algorithm QuickeNing

input x0 in Rp; number of iterations K ; κ > 0; minimization algorithmM.

1: Initialization: (g0,F0, z0) = ApproxGradient (x0,M); B0 = κI .2: for k = 0, . . . ,K − 1 do3: Perform the Quasi-Newton step

xtest = xk − B−1k gk

(gtest,Ftest, ztest) = ApproxGradient (xtest,M) .

4: if Ftest ≤ Fk −12κ‖gk‖

2, then5: (xk+1, gk+1,Fk+1, zk+1) = (xtest, gtest,Ftest, ztest).6: else7: Update the current iterate with the last proximal mapping:

xk+1 = zk = xk − (1/κ)gk

(gk+1,Fk+1, zk+1) = ApproxGradient (xk+1,M) .

8: end if9: update Bk+1 = L-BFGS(Bk , xk+1 − xk , gk+1 − gk).

10: end foroutput last proximal mapping zK (solution).


Algorithm QuickeNing

input x0 in Rp; number of iterations K ; κ > 0; minimization algorithmM.

1: Initialization: (g0,F0, z0) = ApproxGradient (x0,M); B0 = κI .2: for k = 0, . . . ,K − 1 do3: Perform the Quasi-Newton step

xtest = xk − B−1k gk

(gtest,Ftest, ztest) = ApproxGradient (xtest,M) .

4: if Ftest ≤ Fk −12κ‖gk‖

2, then5: (xk+1, gk+1,Fk+1, zk+1) = (xtest, gtest,Ftest, ztest).6: else7: Update the current iterate with the last proximal mapping:

xk+1 = zk = xk − (1/κ)gk

(gk+1,Fk+1, zk+1) = ApproxGradient (xk+1,M) .

8: end if9: update Bk+1 = L-BFGS(Bk , xk+1 − xk , gk+1 − gk).

10: end foroutput last proximal mapping zK (solution).


The main characters:

the sequence (xk)k≥0 that minimizes F ;

the sequence (zk)k≥0 produced byM that minimizes f ;

the gradient approximations gk ≈ ∇F (xk);

the function value approximations Fk ≈ F (xk);

an L-BFGS update with inexact gradients;

an approximate sufficient descent condition.

Requirements onM and restarts

MethodM

Say a sub-problem consists of minimizing h; we wantM to producea sequence of iterates (wt)t≥0 with linear convergence rate

h(wt)− h⋆ ≤ CM(1− τM)t(h(w0)− h⋆).

Restarts

When f is smooth, we initialize w0 = x when solving

minw∈Rd

{

f (w) +κ

2‖w − x‖2

}

.

When f = f0 + ψ is composite, we use the initialization

w0 = argminw∈Rd

{

f0(x) + 〈∇f0(x),w − x〉+L+ κ

2‖w − x‖2 + ψ(w)

}

.


When do we stop the methodM?

Three strategies to balance outer and inner computations

(a) use a pre-defined sequence (εk)k≥0 and stop the optimizationmethodM when the approximate proximal mapping is εk -accurate.

(b) define an adaptive stopping criterion that depends on quantitiesthat are available at iteration k .

(c) use a pre-defined budget TM of iterations of the methodM forsolving each sub-problem.



Three strategies to balance outer and inner computations


(b) define an adaptive stopping criterion that depends on quantitiesthat are available at iteration k .

(c) use a pre-defined budget TM of iterations of the methodM forsolving each sub-problem.

Remarks

(a) is the less practical strategy.

(b) is simpler to use and conservative (compatible with theory).

(c) requires TM to be large enough in theory. The aggressivestrategy TM = n for an incremental method is extremely simpleto use and effective in practice.



Three strategies for µ-strongly convex objectives f


εk =1

2C (1− ρ)k+1 with C ≥ f (x0)− f ∗ and ρ =

µ

4(µ+ κ).

(b) For minimizing h(w) = f (w) + (κ/2)‖w − x‖2, stop when

h(wt)− h⋆ ≤κ

36‖wt − x‖2.

(c) use a pre-defined budget TM of iterations of the methodM forsolving each sub-problem with

TM =1

τMlog

(

19CM

L+ κ

κ

)

. (be more aggressive in practice)


Remarks and worst-case global complexity

Composite objectives and sparsity

Consider a composite problem with a sparse solution (e.g., ψ = ℓ1). Themethod produces two sequences (xk)k≥0 and (zk)k≥0;

F (xk)→ F ⋆, minimizes the smoothed objective ⇒ no sparsity;

f (zk)→ f ⋆, minimizes the true objective ⇒ the iterates may besparse ifM handles composite optimization problems;

Global complexity

The number of iterations ofM to guarantee f (zk)− f ⋆ ≤ ε is at most

O( µ+κ

τMµlog(1/ε)) for µ-strongly convex problems.

O( κR2

τMε) for convex problems.


Global Complexity and choice of κ

Example for gradient descent

With the right step-size, we have τM = (µ+ κ)/(L+ κ) and thecomplexity for µ > 0 becomes

O

(

L+ κ

µlog(1/ε)

)

.

Example for SVRG for minimizing the sum of n functions

τM = min(1/n, (µ+ κ)/(L+ κ)) and the complexity for µ > 0 is

O

(

max

(

µ+ κ

µn,

L+ κ

µ

)

log(1/ε)

)

.





O

(

L+ κ

µlog(1/ε)

)

.



O

(

max

(

µ+ κ

µn,

L+ κ

µ

)

log(1/ε)

)

.

QuickeNing does not provide any theoretical acceleration, but itdoes not degrade significantly the worst-case performance of M(unlike L-BFGS vs gradient descent).





O

(

L+ κ

µlog(1/ε)

)

.



O

(

max

(

µ+ κ

µn,

L+ κ

µ

)

log(1/ε)

)

.


Then, how to choose κ?(i) assume that L-BFGS steps do as well as Nesterov.(ii) choose κ as in Catalyst.

Experiments: formulations

ℓ2-regularized Logistic Regression:

minx∈Rd

1

n

n∑

i=1

log(

1 + exp(−bi aTi x)

)

+µ

2‖x‖2,

ℓ1-regularized Linear Regression (LASSO):

minx∈Rd

1

2n

n∑

i=1

(bi − aTi x)2 + λ‖x‖1,

ℓ1 − ℓ22-regularized Linear Regression (Elastic-Net):

minx∈Rd

1

2n

n∑

i=1

(bi − aTi x)2 + λ‖x‖1 +

µ

2‖x‖2,


Experiments: Datasets

We consider four standard machine learning datasets with differentcharacteristics in terms of size and dimension

name covtype alpha real-sim rcv1

n 581 012 250 000 72 309 781 265

d 54 500 20 958 47 152

we simulate the ill-conditioned regime µ = 1/(100n);

λ for the Lasso leads to about 10% non-zero coefficients.


Experiments: QuickeNing-SVRG

We consider the methods

SVRG: the Prox-SVRG algorithm of Xiao and Zhang [2014].

Catalyst-SVRG: Catalyst applied to SVRG;

L-BFGS (for smooth objectives): Mark Schmidt’s implementation.

QuickeNing-SVRG1: QuickeNing with aggressive strategy (c):one pass over the data in the inner loop.

QuickeNing-SVRG2: strategy (b), compatible with theory.

We produce 12 figures (3 formulations, 4 datasets).


Experiments: QuickeNing-SVRG (log scale)

0 10 20 30 40 50 60 70 80 90 100−10

−9

−8

−7

−6

−5

−4

−3

−2

Number of gradient evaluations

Rela

tive f

un

cti

on

valu

ecovtype, logistic, µ=1/100 n

SVRGCatalyst SVRGQuickeNing-SVRG1QuickeNing-SVRG2LBFGS

0 10 20 30 40 50 60 70 80 90 100−10

−9

−8

−7

−6

−5

−4

−3

−2


Rela

tive f

un

cti

on

valu

e

covtype, lasso, λ= 10 / n

SVRGCatalyst SVRGQuickeNing-SVRG1QuickeNing-SVRG2QuickeNing-SVRG1*

0 10 20 30 40 50 60 70 80 90 100−10

−8

−6

−4

−2

0


Rela

tive f

un

cti

on

valu

e

rcv1, logistic, µ=1/100 n

SVRGCatalyst SVRGQuickeNing-SVRG1QuickeNing-SVRG2LBFGS

0 2 4 6 8 10 12 14 16 18 20−10

−8

−6

−4

−2

0

Number of gradient evaluationsR

ela

tiv

e f

un

cti

on

va

lue

rcv1, lasso, λ= 10 / n

SVRGCatalyst SVRGQuickeNing-SVRG1QuickeNing-SVRG2QuickeNing-SVRG1*

QuickeNing-SVRG1 ≥ SVRG, QuickeNing-SVRG2;

QuickeNing-SVRG2 ≥ SVRG;

QuickeNing-SVRG1 ≥ Catalyst-SVRG in 10/12 cases.


Experiments: QuickeNing-ISTA

We consider the methods

ISTA: the proximal gradient descent method with line search.

FISTA: the accelerated ISTA of Beck and Teboulle [2009].

L-BFGS (for smooth objectives): Mark Schmidt’s implementation.

QuickeNing-ISTA1: QuickeNing with aggressive strategy (c): onepass over the data in the inner loop.

QuickeNing-ISTA2: strategy (b), compatible with theory.


Experiments: QuickeNing-ISTA (log scale)

0 10 20 30 40 50 60 70 80 90 100−7

−6

−5

−4

−3

−2


Re

lati

ve

fu

nc

tio

n v

alu

ecovtype, logistic , µ=1/100 n

ISTAFISTAQuickeNing-ISTA1QuickeNing-ISTA2LBFGS

0 10 20 30 40 50 60 70 80 90 100−5

−4.5

−4

−3.5

−3

−2.5

−2

−1.5


Rela

tive f

un

cti

on

valu

e

covtype, lasso, λ= 10/n

ISTAFISTAQuickeNing-ISTA1QuickeNing-ISTA2QuickeNing-ISTA1*

0 10 20 30 40 50 60 70 80 90 100−3

−2.5

−2

−1.5

−1

−0.5

0

0.5


Rela

tive f

un

cti

on

valu

e

rcv1, logistic , µ=1/100 n

ISTAFISTAQuickeNing-ISTA1QuickeNing-ISTA2LBFGS

0 10 20 30 40 50 60 70 80 90 100−3

−2.5

−2

−1.5

−1

−0.5

0


ela

tive f

un

cti

on

valu

e

rcv1, lasso, λ= 10/n

ISTAFISTAQuickeNing-ISTA1QuickeNing-ISTA2QuickeNing-ISTA1*

L-BFGS (for smooth f ) is slightly better than QuickeNing-ISTA1;

QuickeNing-ISTA ≥ or ≫ FISTA in 11/12 cases.

QuickeNing-ISTA1 ≥ QuickeNing-ISTA2.


Experiments: Influence of κ

0 5 10 15 20 25 30−10

−9

−8

−7

−6

−5

−4

−3

−2


Re

lati

ve

fu

nc

tio

n v

alu


QuickeNing-SVRG κ = 0.001 κ0QuickeNing-SVRG κ = 0.01 κ0QuickeNing-SVRG κ = 0.1 κ0QuickeNing-SVRG κ = κ0QuickeNing-SVRG κ = 10 κ0QuickeNing-SVRG κ = 100 κ0QuickeNing-SVRG κ = 1000 κ0

0 5 10 15 20 25 30 35 40−10

−9

−8

−7

−6

−5

−4

−3

−2


Re

lati

ve

fu

nc

tio

n v

alu

e



0 5 10 15 20 25 30−7

−6

−5

−4

−3

−2

−1

0


Re

lati

ve

fu

nc

tio

n v

alu

e



0 5 10 15 20 25 30 35 40−10

−8

−6

−4

−2

0


ela

tiv

e f

un

cti

on

va

lue



κ0 is the parameter (same as in Catalyst) used in all experiments;

QuickeNing slows down when using κ > κ0;

here, for SVRG, QuickeNing is robust to small values of κ!


Experiments: Influence of l

0 5 10 15 20 25 30 35−10

−9

−8

−7

−6

−5

−4

−3

−2


Re

lati

ve

fu

nc

tio

n v

alu


QuickeNing-SVRG1 l=1QuickeNing-SVRG1 l=2QuickeNing-SVRG1 l=5QuickeNing-SVRG1 l=10QuickeNing-SVRG1 l=20QuickeNing-SVRG1 l=100

0 5 10 15 20 25 30 35 40 45 50−10

−9

−8

−7

−6

−5

−4

−3

−2


Re

lati

ve

fu

nc

tio

n v

alu

e



0 5 10 15 20 25 30 35 40 45 50−10

−8

−6

−4

−2

0


Re

lati

ve

fu

nc

tio

n v

alu

e



0 5 10 15−10

−8

−6

−4

−2

0


ela

tiv

e f

un

cti

on

va

lue



l = 100 in all previous experiments;

l = 5 seems to be a reasonable choice in many cases, especially forsparse problems.


Conclusions and perspectives

A simple generic Quasi-Newton method for composite functions,with simple sub-problems, and complexity guarantees.

We also have a variant for dual approaches.

Does not solve the gap between theory and practice for L-BFGS.

Perspectives

QuickeNing-BCD, QuickeNing-SAG,SAGA,SDCA...

Other types of smoothing? ⇒ Links with recent Quasi-Newtonmethods applied to other envelopes [Stella et al., 2016].


Outer-loop convergence analysis

Lemma: approximate descent property

F (xk+1) ≤ f (zk) ≤ F (xk)−1

4κ‖∇F (xk)‖

22 + 2εk .

Then, εk should be smaller than 14κ‖∇F (xk)‖

22, and indeed




F (xk+1) ≤ f (zk) ≤ F (xk)−1

4κ‖∇F (xk)‖

22 + 2εk .


22, and indeed

Proposition: convergence with impractical εk and µ > 0

If εk ≤1

16κ‖∇F (xk)‖22, define ρ = µ

4(µ+κ) , then

F (xk+1)− F ∗ ≤ f (zk)− f ∗ ≤ (1− ρ)k+1 (f (x0)− f ∗).

Unfortunately, ‖∇F (xk)‖ is unknown.




F (xk+1) ≤ f (zk) ≤ F (xk)−1

4κ‖∇F (xk)‖

22 + 2εk .


22, and indeed

Proposition: convergence with impractical εk and µ > 0

If εk ≤1

16κ‖∇F (xk)‖22, define ρ = µ

4(µ+κ) , then

F (xk+1)− F ∗ ≤ f (zk)− f ∗ ≤ (1− ρ)k+1 (f (x0)− f ∗).

Unfortunately, ‖∇F (xk)‖ is unknown.

Lemma: convergence with adaptive εk and µ > 0

If εk ≤1

36κ‖gk‖2, then εk ≤

116‖∇F (xk)‖

2.

This is strategy (b). gk is known and easy to compute.


Inner-loop complexity analysis

Restart for L-smooth functions

For minimizing h, initialize the methodM with w0 = x . Then,

h(w0)− h∗ ≤L+ κ

2κ2‖∇F (x)‖2. (1)

Proof.

We have the optimality condition ∇f (w∗) + κ(w∗ − x) = 0. As a result,

h(w0)−h∗

= f (x)−(

f (w∗) +κ

2‖w∗ − x‖2

)

≤ f (w∗)+〈∇f (w∗), x−w∗〉+L

2‖x−w∗‖2−

(

f (w∗)+κ

2‖w∗−x‖2

)

=L+ κ

2‖w∗ − x‖2 =

L+ κ

2κ2‖∇F (x)‖2.


Quasi-Newton and L-BFGSPresentation borrowed from Mark Schmidt, NIPS OPT 2010

Quasi-Newton methods work with the parameter and gradientdifferences between successive iterations:

sk , xk+1 − xk , yk , ∇f (xk+1)−∇f (xk).




sk , xk+1 − xk , yk , ∇f (xk+1)−∇f (xk).

They start with an initial approximation B0 , σI , and choose Bk+1

to interpolate the gradient difference:

Bk+1sk = yk .




sk , xk+1 − xk , yk , ∇f (xk+1)−∇f (xk).

They start with an initial approximation B0 , σI , and choose Bk+1

to interpolate the gradient difference:

Bk+1sk = yk .

Since Bk+1 is not unique, the Broyden-Fletcher-Goldfarb-Shanno(BFGS) method chooses the symmetric matrix whose differencewith Bk is minimal:

Bk+1 = Bk −BkskskBk

skBksk+

yky⊤k

y⊤k sk.



Update skipping/damping or a sophisticated line search (Wolfeconditions) can keep Bk+1 positive-definite.




They perform updates of the form

xk+1 ← xk − ηkB−1k ∇f (xk).






The BFGS method has a superlinear convergence rate.







But, it still uses a dense p × p matrix Bk .








Instead of storing Bk , the limited-memory BFGS (L-BFGS)method stores the previous l differences sk and yk .








Instead of storing Bk , the limited-memory BFGS (L-BFGS)method stores the previous l differences sk and yk .

We can solve a linear system involving these updates when B0 isdiagonal in O(dl) [Nocedal, 1980].


References I

A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm forlinear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202,2009.

J.V. Burke and Maijian Qian. On the superlinear convergence of the variablemetric proximal point algorithm using Broyden and BFGS matrix secantupdating. Mathematical Programming, 88(1):157–181, 2000.

R. H. Byrd, J. Nocedal, and F. Oztoprak. An inexact successive quadraticapproximation method for L-1 regularized optimization. MathematicalProgramming, 157(2):375–396, 2015.

R.H. Byrd, SL Hansen, Jorge Nocedal, and Y Singer. A stochasticquasi-Newton method for large-scale optimization. SIAM Journal onOptimization, 26(2):1008–1031, 2016.

Antonin Chambolle and Thomas Pock. A first-order primal-dual algorithm forconvex problems with applications to imaging. Journal of mathematicalimaging and vision, 40(1):120–145, 2011.


References IIXiaojun Chen and Masao Fukushima. Proximal quasi-Newton methods for

nondifferentiable convex optimization. Mathematical Programming, 85(2):313–334, 1999.

Patrick L Combettes and Valerie R Wajs. Signal recovery by proximalforward-backward splitting. Multiscale Modeling & Simulation, 4(4):1168–1200, 2005.

Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fastincremental gradient method with support for non-strongly convex compositeobjectives. In Advances in Neural Information Processing Systems (NIPS),2014a.

Aaron Defazio, Justin Domke, and Tiberio S Caetano. Finito: A faster,permutable incremental gradient method for big data problems. InProceedings of the International Conferences on Machine Learning (ICML),2014b.

Michael P Friedlander and Mark Schmidt. Hybrid deterministic-stochasticmethods for data fitting. SIAM Journal on Scientific Computing, 34(3):A1380–A1405, 2012.


References IIIRoy Frostig, Rong Ge, Sham M Kakade, and Aaron Sidford. Un-regularizing:

approximate proximal point and faster stochastic algorithms for empirical riskminimization. In Proceedings of the International Conferences on MachineLearning (ICML), 2015.

Marc Fuentes, Jerome Malick, and Claude Lemarechal. Descentwise inexactproximal algorithms for smooth optimization. Computational Optimizationand Applications, 53(3):755–769, 2012.

Masao Fukushima and Liqun Qi. A globally and superlinearly convergentalgorithm for nonsmooth convex minimization. SIAM Journal onOptimization, 6(4):1106–1120, 1996.

S. Ghadimi, G. Lan, and H. Zhang. Generalized Uniformly Optimal Methods forNonlinear Programming. arxiv:1508.07384, 2015.

R. M. Gower, D. Goldfarb, and P. Richtarik. Stochastic block BFGS: Squeezingmore curvature out of data. In Proceedings of the International Conferenceson Machine Learning (ICML), 2016.

O. Guler. New proximal point algorithms for convex minimization. SIAMJournal on Optimization, 2(4):649–664, 1992.


References IVJason Lee, Yuekai Sun, and Michael Saunders. Proximal Newton-type methods

for convex optimization. In Advances in Neural Information ProcessingSystems (NIPS), 2012.

Claude Lemarechal and Claudia Sagastizabal. Practical aspects of theMoreau–Yosida regularization: Theoretical preliminaries. SIAM Journal onOptimization, 7(2):367–385, 1997.

Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A universal catalyst forfirst-order optimization. In Advances in Neural Information ProcessingSystems (NIPS), 2015.

J. Mairal. Incremental majorization-minimization optimization with applicationto large-scale machine learning. SIAM Journal on Optimization, 25(2):829–855, 2015.

Bernard Martinet. Breve communication. regularisation d’inequationsvariationnelles par approximations successives. 4(3):154–158, 1970.

Robert Mifflin. A quasi-second-order proximal bundle algorithm. MathematicalProgramming, 73(1):51–72, 1996.


References VY. Nesterov. Gradient methods for minimizing composite functions.

Mathematical Programming, 140(1):125–161, 2013.

Jorge Nocedal. Updating quasi-Newton matrices with limited storage.Mathematics of Computation, 35(151):773–782, 1980.

R Tyrrell Rockafellar. Monotone operators and the proximal point algorithm.SIAM Journal on Control and Optimization, 14(5):877–898, 1976.

Katya Scheinberg and Xiaocheng Tang. Practical inexact proximalquasi-Newton method with global complexity analysis. MathematicalProgramming, 160(1):495–529, 2016.

M. Schmidt, N. Le Roux, and F. Bach. Minimizing finite sums with thestochastic average gradient. Mathematical Programming, 160(1):83–112,2017.

S. Shalev-Shwartz and T. Zhang. Proximal stochastic dual coordinate ascent.arXiv:1211.2717, 2012.

Lorenzo Stella, Andreas Themelis, and Panagiotis Patrinos. Forward-backwardquasi-newton methods for nonsmooth optimization problems. arXiv preprintarXiv:1604.08096, 2016.


References VIS. J. Wright, R. D. Nowak, and M. A. T. Figueiredo. Sparse reconstruction by

separable approximation. IEEE Transactions on Signal Processing, 57(7):2479–2493, 2009.

L. Xiao and T. Zhang. A proximal stochastic gradient method with progressivevariance reduction. SIAM Journal on Optimization, 24(4):2057–2075, 2014.

Jin Yu, SVN Vishwanathan, Simon Gunter, and Nicol N Schraudolph. Aquasi-Newton approach to non-smooth convex optimization. In Proceedingsof the International Conferences on Machine Learning (ICML), 2008.

Y. Zhang and L. Xiao. Stochastic primal-dual coordinate method forregularized empirical risk minimization. In Proceedings of the InternationalConferences on Machine Learning (ICML), 2015.


Documents

A Generic Quasi-Newton Algorithm for Faster Gradient-Based