63
A Generic Quasi-Newton Algorithm for Faster Gradient-Based Optimization Hongzhou Lin 1 , Julien Mairal 1 , Zaid Harchaoui 2 1 Inria, Grenoble 2 University of Washington Journ´ ees franco-chiliennes d’optimisation Toulouse, 2017 Julien Mairal QuickeNing 1/28

A Generic Quasi-Newton Algorithm for Faster Gradient-Based

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

A Generic Quasi-Newton Algorithm for Faster

Gradient-Based Optimization

Hongzhou Lin1, Julien Mairal1, Zaid Harchaoui2

1Inria, Grenoble 2University of Washington

Journees franco-chiliennes d’optimisationToulouse, 2017

Julien Mairal QuickeNing 1/28

Page 2: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

An alternate title:

Acceleration by Smoothing

Julien Mairal QuickeNing 2/28

Page 3: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

Collaborators

Hongzhou Zaid Dima CourtneyLin Harchaoui Drusvyatskiy Paquette

Publications and pre-prints

H. Lin, J. Mairal and Z. Harchaoui. A Generic Quasi-Newton Algorithm for FasterGradient-Based Optimization. arXiv:1610.00960. 2017

C. Paquette, H. Lin, D. Drusvyatskiy, J. Mairal, Z. Harchaoui. Catalyst Acceleration

for Gradient-Based Non-Convex Optimization. arXiv:1703.10993. 2017

H. Lin, J. Mairal and Z. Harchaoui. A Universal Catalyst for First-Order

Optimization. Adv. NIPS 2015.

Julien Mairal QuickeNing 3/28

Page 4: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

Focus of this work

Minimizing large finite sums

Consider the minimization of a large sum of convex functions

minx∈Rd

{

f (x)△=

1

n

n∑

i=1

fi (x) + ψ(x)

}

,

where each fi is L-smooth and convex and ψ is a convex regularizationpenalty but not necessarily differentiable.

Motivation

Composite Finite sum Exploit “curvature”

First-order methods ✔

Quasi-Newton

[Nesterov, 2013, Wright et al., 2009, Beck and Teboulle, 2009, Chambolle andPock, 2011, Combettes and Wajs, 2005],...

Julien Mairal QuickeNing 4/28

Page 5: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

Focus of this work

Minimizing large finite sums

Consider the minimization of a large sum of convex functions

minx∈Rd

{

f (x)△=

1

n

n∑

i=1

fi (x) + ψ(x)

}

,

where each fi is L-smooth and convex and ψ is a convex regularizationpenalty but not necessarily differentiable.

Motivation

Composite Finite sum Exploit “curvature”

First-order methods ✔ ✔

Quasi-Newton

[Schmidt et al., 2017, Xiao and Zhang, 2014, Defazio et al., 2014a,b,Shalev-Shwartz and Zhang, 2012, Mairal, 2015, Zhang and Xiao, 2015]

Julien Mairal QuickeNing 4/28

Page 6: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

Focus of this work

Minimizing large finite sums

Consider the minimization of a large sum of convex functions

minx∈Rd

{

f (x)△=

1

n

n∑

i=1

fi (x) + ψ(x)

}

,

where each fi is L-smooth and convex and ψ is a convex regularizationpenalty but not necessarily differentiable.

Motivation

Composite Finite sum Exploit “curvature”

First-order methods ✔ ✔

Quasi-Newton

[Schmidt et al., 2017, Xiao and Zhang, 2014, Defazio et al., 2014a,b,Shalev-Shwartz and Zhang, 2012, Mairal, 2015, Zhang and Xiao, 2015]

Julien Mairal QuickeNing 4/28

Expected number of gradients ∇fi to compute to guaranteef (xk)− f ⋆ ≤ ε, when the objective f is µ-strongly convex:

accelerated proximal gradient: O(

n√

Lfµlog

(

)

)

;

incremental gradient methods: O((

n + Lµ

)

log(

)

)

.

Page 7: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

Focus of this work

Minimizing large finite sums

Consider the minimization of a large sum of convex functions

minx∈Rd

{

f (x)△=

1

n

n∑

i=1

fi (x) + ψ(x)

}

,

where each fi is L-smooth and convex and ψ is a convex regularizationpenalty but not necessarily differentiable.

Motivation

Composite Finite sum Exploit “curvature”

First-order methods ✔ ✔

Quasi-Newton

[Schmidt et al., 2017, Xiao and Zhang, 2014, Defazio et al., 2014a,b,Shalev-Shwartz and Zhang, 2012, Mairal, 2015, Zhang and Xiao, 2015]

Julien Mairal QuickeNing 4/28

Expected number of gradients ∇fi to compute to guaranteef (xk)− f ⋆ ≤ ε, when the objective f is µ-strongly convex:

accelerated proximal gradient: O(

n√

Lfµlog

(

)

)

;

incremental gradient methods: O((

n +√

n Lµ

)

log(

)

)

.

Page 8: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

Focus of this work

Minimizing large finite sums

Consider the minimization of a large sum of convex functions

minx∈Rd

{

f (x)△=

1

n

n∑

i=1

fi (x) + ψ(x)

}

,

where each fi is L-smooth and convex and ψ is a convex regularizationpenalty but not necessarily differentiable.

Motivation

Composite Finite sum Exploit “curvature”

First-order methods ✔ ✔ ✗

Quasi-Newton

Julien Mairal QuickeNing 4/28

Page 9: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

Focus of this work

Minimizing large finite sums

Consider the minimization of a large sum of convex functions

minx∈Rd

{

f (x)△=

1

n

n∑

i=1

fi (x) + ψ(x)

}

,

where each fi is L-smooth and convex and ψ is a convex regularizationpenalty but not necessarily differentiable.

Motivation

Composite Finite sum Exploit “curvature”

First-order methods ✔ ✔ ✗

Quasi-Newton ✔

Julien Mairal QuickeNing 4/28

Page 10: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

Focus of this work

Minimizing large finite sums

Consider the minimization of a large sum of convex functions

minx∈Rd

{

f (x)△=

1

n

n∑

i=1

fi (x) + ψ(x)

}

,

where each fi is L-smooth and convex and ψ is a convex regularizationpenalty but not necessarily differentiable.

Motivation

Composite Finite sum Exploit “curvature”

First-order methods ✔ ✔ ✗

Quasi-Newton — ✔

[Byrd et al., 2015, Lee et al., 2012, Scheinberg and Tang, 2016, Yu et al.,2008, Ghadimi et al., 2015, Stella et al., 2016],. . .

Julien Mairal QuickeNing 4/28

Page 11: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

Focus of this work

Minimizing large finite sums

Consider the minimization of a large sum of convex functions

minx∈Rd

{

f (x)△=

1

n

n∑

i=1

fi (x) + ψ(x)

}

,

where each fi is L-smooth and convex and ψ is a convex regularizationpenalty but not necessarily differentiable.

Motivation

Composite Finite sum Exploit “curvature”

First-order methods ✔ ✔ ✗

Quasi-Newton — ✗ ✔

[Byrd et al., 2016, Gower et al., 2016]

Julien Mairal QuickeNing 4/28

Page 12: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

Focus of this work

Minimizing large finite sums

Consider the minimization of a large sum of convex functions

minx∈Rd

{

f (x)△=

1

n

n∑

i=1

fi (x) + ψ(x)

}

,

where each fi is L-smooth and convex and ψ is a convex regularizationpenalty but not necessarily differentiable.

Motivation

Composite Finite sum Exploit “curvature”

First-order methods ✔ ✔ ✗

Quasi-Newton — ✗ ✔

[Byrd et al., 2016, Gower et al., 2016]

Julien Mairal QuickeNing 4/28

Our goal is to

accelerate first-order methods with Quasi-Newton heuristics;

design algorithms that can adapt to composite and finite-sumstructures and that can also exploit curvature information.

Page 13: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

QuickeNing: main idea (an old one)

Idea: Smooth the function and then apply Quasi-Newton.

The strategy appears in early work about variable metric bundlemethods. [Chen and Fukushima, 1999, Fukushima and Qi, 1996, Mifflin,

1996, Fuentes, Malick, and Lemarechal, 2012, Burke and Qian, 2000] ...

Julien Mairal QuickeNing 5/28

Page 14: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

QuickeNing: main idea (an old one)

Idea: Smooth the function and then apply Quasi-Newton.

The strategy appears in early work about variable metric bundlemethods. [Chen and Fukushima, 1999, Fukushima and Qi, 1996, Mifflin,

1996, Fuentes, Malick, and Lemarechal, 2012, Burke and Qian, 2000] ...

The Moreau-Yosida envelope

Given f : Rd → R a convex function, the Moreau-Yosida envelope of f isthe function F : Rd → R defined as

F (x) = minw∈Rd

{

f (w) +κ

2‖w − x‖2

}

.

The proximal operator p(x) is the unique minimizer of the problem.

Julien Mairal QuickeNing 5/28

Page 15: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

The Moreau-Yosida regularization

F (x) = minw∈Rd

{

f (w) +κ

2‖w − x‖2

}

.

Basic properties [see Lemarechal and Sagastizabal, 1997]

Minimizing f and F is equivalent in the sense that

minx∈Rd

F (x) = minx∈Rd

f (x),

and the solution set of the two problems coincide with each other.

F is continuously differentiable even when f is not and

∇F (x) = κ(x − p(x)).

In addition, ∇F is Lipschitz continuous with parameter LF = κ.

If f is µ-strongly convex then F is also strongly convex withparameter µF = µκ

µ+κ.

Julien Mairal QuickeNing 6/28

Page 16: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

The Moreau-Yosida regularization

F (x) = minw∈Rd

{

f (w) +κ

2‖w − x‖2

}

.

Basic properties [see Lemarechal and Sagastizabal, 1997]

Minimizing f and F is equivalent in the sense that

minx∈Rd

F (x) = minx∈Rd

f (x),

and the solution set of the two problems coincide with each other.

F is continuously differentiable even when f is not and

∇F (x) = κ(x − p(x)).

In addition, ∇F is Lipschitz continuous with parameter LF = κ.

If f is µ-strongly convex then F is also strongly convex withparameter µF = µκ

µ+κ.

Julien Mairal QuickeNing 6/28

F enjoys nice properties: smoothness, (strong) convexity andwe can control its condition number 1 + κ/µ.

Page 17: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

A fresh look at Catalyst[Lin et al., 2015]

Julien Mairal QuickeNing 7/28

Page 18: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

A fresh look at the proximal point algorithm

A naive approach consists of minimizing the smoothed objective Finstead of f with a method designed for smooth optimization.

Consider indeed

xk+1 = xk −1

κ∇F (xk).

By rewriting the gradient ∇F (xk) as κ(xk − p(xk)), we obtain

xk+1 = p(xk) = argminw∈Rp

{

f (w) +κ

2‖w − xk‖

2}

.

This is exactly the proximal point algorithm [Martinet, 1970,Rockafellar, 1976].

Julien Mairal QuickeNing 8/28

Page 19: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

A fresh look at the accelerated proximal point algorithm

Consider now

xk+1 = yk −1

κ∇F (yk) and yk+1 = xk+1 + βk+1(xk+1 − xk),

where βk+1 is a Nesterov-like extrapolation parameter. We may nowrewrite the update using the value of ∇F , which gives:

xk+1 = p(yk) and yk+1 = xk+1 + βk+1(xk+1 − xk)

This is the accelerated proximal point algorithm of Guler [1992].

Julien Mairal QuickeNing 9/28

Page 20: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

A fresh look at the accelerated proximal point algorithm

Consider now

xk+1 = yk −1

κ∇F (yk) and yk+1 = xk+1 + βk+1(xk+1 − xk),

where βk+1 is a Nesterov-like extrapolation parameter. We may nowrewrite the update using the value of ∇F , which gives:

xk+1 = p(yk) and yk+1 = xk+1 + βk+1(xk+1 − xk)

This is the accelerated proximal point algorithm of Guler [1992].

Remarks

F may be better conditioned than f when 1 + κ/µ ≤ L/µ;

Computing p(yk) has a cost!

Julien Mairal QuickeNing 9/28

Page 21: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

A fresh look at Catalyst [Lin, Mairal, and Harchaoui, 2015]

Catalyst is a particular accelerated proximal point algorithm withinexact gradients [Guler, 1992].

xk+1 ≈ p(yk) and yk+1 = xk+1 + βk+1(xk+1 − xk)

The quantity xk+1 is obtained by using an optimization methodM forapproximately solving:

xk+1 ≈ argminw∈Rp

{

f (w) +κ

2‖w − yk‖

2}

,

Catalyst provides Nesterov’s acceleration toM with...

restart strategies for solving the sub-problems;

global complexity analysis resulting in theoretical acceleration;

optimal balancing between outer and inner computations.

see also [Frostig et al., 2015]

Julien Mairal QuickeNing 10/28

Page 22: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

Limited-Memory BFGS (L-BFGS)

Pros

one of the largest practical success of smooth optimization.

Julien Mairal QuickeNing 11/28

Page 23: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

Limited-Memory BFGS (L-BFGS)

Pros

one of the largest practical success of smooth optimization.

Cons

worst-case convergence rates for strongly-convex functions arelinear, but much worse than the gradient descent method.

proximal variants typically requires solving many times

minx∈Rd

1

2(x − z)Bk(z − z) + ψ(x).

no guarantee of approximating the Hessian.

Julien Mairal QuickeNing 11/28

Page 24: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

QuickeNing

Main recipe

L-BFGS applied to the smoothed objective F with inexactgradients [see Friedlander and Schmidt, 2012].

inexact gradients are obtained by solving sub-problems using afirst-order optimization methodM;

ideally,M is able to adapt to the problem structure (finite sum,composite regularization).

replace L-BFGS steps by proximal point steps if no sufficientdecrease is estimated ⇒ no line search on F ;

Julien Mairal QuickeNing 12/28

Page 25: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

Obtaining inexact gradients

Algorithm Procedure ApproxGradient

input Current point x in Rd ; smoothing parameter κ > 0.

1: Compute the approximate mapping using an optimizationmethodM:

z ≈ argminw∈Rd

{

h(w)△= f (w) +

κ

2‖w − x‖2

}

,

2: Estimate the gradient ∇F (x)

g = κ(x − z).

output approximate gradient estimate g , objective value Fa△= h(z),

proximal mapping z .

Julien Mairal QuickeNing 13/28

Page 26: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

Algorithm QuickeNing

input x0 in Rp; number of iterations K ; κ > 0; minimization algorithmM.

1: Initialization: (g0,F0, z0) = ApproxGradient (x0,M); B0 = κI .2: for k = 0, . . . ,K − 1 do3: Perform the Quasi-Newton step

xtest = xk − B−1k gk

(gtest,Ftest, ztest) = ApproxGradient (xtest,M) .

4: if Ftest ≤ Fk −12κ‖gk‖

2, then5: (xk+1, gk+1,Fk+1, zk+1) = (xtest, gtest,Ftest, ztest).6: else7: Update the current iterate with the last proximal mapping:

xk+1 = zk = xk − (1/κ)gk

(gk+1,Fk+1, zk+1) = ApproxGradient (xk+1,M) .

8: end if9: update Bk+1 = L-BFGS(Bk , xk+1 − xk , gk+1 − gk).

10: end foroutput last proximal mapping zK (solution).

Julien Mairal QuickeNing 14/28

Page 27: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

Algorithm QuickeNing

input x0 in Rp; number of iterations K ; κ > 0; minimization algorithmM.

1: Initialization: (g0,F0, z0) = ApproxGradient (x0,M); B0 = κI .2: for k = 0, . . . ,K − 1 do3: Perform the Quasi-Newton step

xtest = xk − B−1k gk

(gtest,Ftest, ztest) = ApproxGradient (xtest,M) .

4: if Ftest ≤ Fk −12κ‖gk‖

2, then5: (xk+1, gk+1,Fk+1, zk+1) = (xtest, gtest,Ftest, ztest).6: else7: Update the current iterate with the last proximal mapping:

xk+1 = zk = xk − (1/κ)gk

(gk+1,Fk+1, zk+1) = ApproxGradient (xk+1,M) .

8: end if9: update Bk+1 = L-BFGS(Bk , xk+1 − xk , gk+1 − gk).

10: end foroutput last proximal mapping zK (solution).

Julien Mairal QuickeNing 14/28

The main characters:

the sequence (xk)k≥0 that minimizes F ;

the sequence (zk)k≥0 produced byM that minimizes f ;

the gradient approximations gk ≈ ∇F (xk);

the function value approximations Fk ≈ F (xk);

an L-BFGS update with inexact gradients;

an approximate sufficient descent condition.

Page 28: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

Requirements onM and restarts

MethodM

Say a sub-problem consists of minimizing h; we wantM to producea sequence of iterates (wt)t≥0 with linear convergence rate

h(wt)− h⋆ ≤ CM(1− τM)t(h(w0)− h⋆).

Restarts

When f is smooth, we initialize w0 = x when solving

minw∈Rd

{

f (w) +κ

2‖w − x‖2

}

.

When f = f0 + ψ is composite, we use the initialization

w0 = argminw∈Rd

{

f0(x) + 〈∇f0(x),w − x〉+L+ κ

2‖w − x‖2 + ψ(w)

}

.

Julien Mairal QuickeNing 15/28

Page 29: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

When do we stop the methodM?

Three strategies to balance outer and inner computations

(a) use a pre-defined sequence (εk)k≥0 and stop the optimizationmethodM when the approximate proximal mapping is εk -accurate.

(b) define an adaptive stopping criterion that depends on quantitiesthat are available at iteration k .

(c) use a pre-defined budget TM of iterations of the methodM forsolving each sub-problem.

Julien Mairal QuickeNing 16/28

Page 30: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

When do we stop the methodM?

Three strategies to balance outer and inner computations

(a) use a pre-defined sequence (εk)k≥0 and stop the optimizationmethodM when the approximate proximal mapping is εk -accurate.

(b) define an adaptive stopping criterion that depends on quantitiesthat are available at iteration k .

(c) use a pre-defined budget TM of iterations of the methodM forsolving each sub-problem.

Remarks

(a) is the less practical strategy.

(b) is simpler to use and conservative (compatible with theory).

(c) requires TM to be large enough in theory. The aggressivestrategy TM = n for an incremental method is extremely simpleto use and effective in practice.

Julien Mairal QuickeNing 16/28

Page 31: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

When do we stop the methodM?

Three strategies for µ-strongly convex objectives f

(a) use a pre-defined sequence (εk)k≥0 and stop the optimizationmethodM when the approximate proximal mapping is εk -accurate.

εk =1

2C (1− ρ)k+1 with C ≥ f (x0)− f ∗ and ρ =

µ

4(µ+ κ).

(b) For minimizing h(w) = f (w) + (κ/2)‖w − x‖2, stop when

h(wt)− h⋆ ≤κ

36‖wt − x‖2.

(c) use a pre-defined budget TM of iterations of the methodM forsolving each sub-problem with

TM =1

τMlog

(

19CM

L+ κ

κ

)

. (be more aggressive in practice)

Julien Mairal QuickeNing 17/28

Page 32: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

Remarks and worst-case global complexity

Composite objectives and sparsity

Consider a composite problem with a sparse solution (e.g., ψ = ℓ1). Themethod produces two sequences (xk)k≥0 and (zk)k≥0;

F (xk)→ F ⋆, minimizes the smoothed objective ⇒ no sparsity;

f (zk)→ f ⋆, minimizes the true objective ⇒ the iterates may besparse ifM handles composite optimization problems;

Global complexity

The number of iterations ofM to guarantee f (zk)− f ⋆ ≤ ε is at most

O( µ+κ

τMµlog(1/ε)) for µ-strongly convex problems.

O( κR2

τMε) for convex problems.

Julien Mairal QuickeNing 18/28

Page 33: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

Global Complexity and choice of κ

Example for gradient descent

With the right step-size, we have τM = (µ+ κ)/(L+ κ) and thecomplexity for µ > 0 becomes

O

(

L+ κ

µlog(1/ε)

)

.

Example for SVRG for minimizing the sum of n functions

τM = min(1/n, (µ+ κ)/(L+ κ)) and the complexity for µ > 0 is

O

(

max

(

µ+ κ

µn,

L+ κ

µ

)

log(1/ε)

)

.

Julien Mairal QuickeNing 19/28

Page 34: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

Global Complexity and choice of κ

Example for gradient descent

With the right step-size, we have τM = (µ+ κ)/(L+ κ) and thecomplexity for µ > 0 becomes

O

(

L+ κ

µlog(1/ε)

)

.

Example for SVRG for minimizing the sum of n functions

τM = min(1/n, (µ+ κ)/(L+ κ)) and the complexity for µ > 0 is

O

(

max

(

µ+ κ

µn,

L+ κ

µ

)

log(1/ε)

)

.

QuickeNing does not provide any theoretical acceleration, but itdoes not degrade significantly the worst-case performance of M(unlike L-BFGS vs gradient descent).

Julien Mairal QuickeNing 19/28

Page 35: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

Global Complexity and choice of κ

Example for gradient descent

With the right step-size, we have τM = (µ+ κ)/(L+ κ) and thecomplexity for µ > 0 becomes

O

(

L+ κ

µlog(1/ε)

)

.

Example for SVRG for minimizing the sum of n functions

τM = min(1/n, (µ+ κ)/(L+ κ)) and the complexity for µ > 0 is

O

(

max

(

µ+ κ

µn,

L+ κ

µ

)

log(1/ε)

)

.

Julien Mairal QuickeNing 19/28

Then, how to choose κ?(i) assume that L-BFGS steps do as well as Nesterov.(ii) choose κ as in Catalyst.

Page 36: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

Experiments: formulations

ℓ2-regularized Logistic Regression:

minx∈Rd

1

n

n∑

i=1

log(

1 + exp(−bi aTi x)

)

2‖x‖2,

ℓ1-regularized Linear Regression (LASSO):

minx∈Rd

1

2n

n∑

i=1

(bi − aTi x)2 + λ‖x‖1,

ℓ1 − ℓ22-regularized Linear Regression (Elastic-Net):

minx∈Rd

1

2n

n∑

i=1

(bi − aTi x)2 + λ‖x‖1 +

µ

2‖x‖2,

Julien Mairal QuickeNing 20/28

Page 37: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

Experiments: Datasets

We consider four standard machine learning datasets with differentcharacteristics in terms of size and dimension

name covtype alpha real-sim rcv1

n 581 012 250 000 72 309 781 265

d 54 500 20 958 47 152

we simulate the ill-conditioned regime µ = 1/(100n);

λ for the Lasso leads to about 10% non-zero coefficients.

Julien Mairal QuickeNing 21/28

Page 38: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

Experiments: QuickeNing-SVRG

We consider the methods

SVRG: the Prox-SVRG algorithm of Xiao and Zhang [2014].

Catalyst-SVRG: Catalyst applied to SVRG;

L-BFGS (for smooth objectives): Mark Schmidt’s implementation.

QuickeNing-SVRG1: QuickeNing with aggressive strategy (c):one pass over the data in the inner loop.

QuickeNing-SVRG2: strategy (b), compatible with theory.

We produce 12 figures (3 formulations, 4 datasets).

Julien Mairal QuickeNing 22/28

Page 39: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

Experiments: QuickeNing-SVRG (log scale)

0 10 20 30 40 50 60 70 80 90 100−10

−9

−8

−7

−6

−5

−4

−3

−2

Number of gradient evaluations

Rela

tive f

un

cti

on

valu

ecovtype, logistic, µ=1/100 n

SVRGCatalyst SVRGQuickeNing-SVRG1QuickeNing-SVRG2LBFGS

0 10 20 30 40 50 60 70 80 90 100−10

−9

−8

−7

−6

−5

−4

−3

−2

Number of gradient evaluations

Rela

tive f

un

cti

on

valu

e

covtype, lasso, λ= 10 / n

SVRGCatalyst SVRGQuickeNing-SVRG1QuickeNing-SVRG2QuickeNing-SVRG1*

0 10 20 30 40 50 60 70 80 90 100−10

−8

−6

−4

−2

0

Number of gradient evaluations

Rela

tive f

un

cti

on

valu

e

rcv1, logistic, µ=1/100 n

SVRGCatalyst SVRGQuickeNing-SVRG1QuickeNing-SVRG2LBFGS

0 2 4 6 8 10 12 14 16 18 20−10

−8

−6

−4

−2

0

Number of gradient evaluationsR

ela

tiv

e f

un

cti

on

va

lue

rcv1, lasso, λ= 10 / n

SVRGCatalyst SVRGQuickeNing-SVRG1QuickeNing-SVRG2QuickeNing-SVRG1*

QuickeNing-SVRG1 ≥ SVRG, QuickeNing-SVRG2;

QuickeNing-SVRG2 ≥ SVRG;

QuickeNing-SVRG1 ≥ Catalyst-SVRG in 10/12 cases.

Julien Mairal QuickeNing 23/28

Page 40: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

Experiments: QuickeNing-ISTA

We consider the methods

ISTA: the proximal gradient descent method with line search.

FISTA: the accelerated ISTA of Beck and Teboulle [2009].

L-BFGS (for smooth objectives): Mark Schmidt’s implementation.

QuickeNing-ISTA1: QuickeNing with aggressive strategy (c): onepass over the data in the inner loop.

QuickeNing-ISTA2: strategy (b), compatible with theory.

Julien Mairal QuickeNing 24/28

Page 41: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

Experiments: QuickeNing-ISTA (log scale)

0 10 20 30 40 50 60 70 80 90 100−7

−6

−5

−4

−3

−2

Number of gradient evaluations

Re

lati

ve

fu

nc

tio

n v

alu

ecovtype, logistic , µ=1/100 n

ISTAFISTAQuickeNing-ISTA1QuickeNing-ISTA2LBFGS

0 10 20 30 40 50 60 70 80 90 100−5

−4.5

−4

−3.5

−3

−2.5

−2

−1.5

Number of gradient evaluations

Rela

tive f

un

cti

on

valu

e

covtype, lasso, λ= 10/n

ISTAFISTAQuickeNing-ISTA1QuickeNing-ISTA2QuickeNing-ISTA1*

0 10 20 30 40 50 60 70 80 90 100−3

−2.5

−2

−1.5

−1

−0.5

0

0.5

Number of gradient evaluations

Rela

tive f

un

cti

on

valu

e

rcv1, logistic , µ=1/100 n

ISTAFISTAQuickeNing-ISTA1QuickeNing-ISTA2LBFGS

0 10 20 30 40 50 60 70 80 90 100−3

−2.5

−2

−1.5

−1

−0.5

0

Number of gradient evaluationsR

ela

tive f

un

cti

on

valu

e

rcv1, lasso, λ= 10/n

ISTAFISTAQuickeNing-ISTA1QuickeNing-ISTA2QuickeNing-ISTA1*

L-BFGS (for smooth f ) is slightly better than QuickeNing-ISTA1;

QuickeNing-ISTA ≥ or ≫ FISTA in 11/12 cases.

QuickeNing-ISTA1 ≥ QuickeNing-ISTA2.

Julien Mairal QuickeNing 25/28

Page 42: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

Experiments: Influence of κ

0 5 10 15 20 25 30−10

−9

−8

−7

−6

−5

−4

−3

−2

Number of gradient evaluations

Re

lati

ve

fu

nc

tio

n v

alu

ecovtype, logistic, µ=1/100 n

QuickeNing-SVRG κ = 0.001 κ0QuickeNing-SVRG κ = 0.01 κ0QuickeNing-SVRG κ = 0.1 κ0QuickeNing-SVRG κ = κ0QuickeNing-SVRG κ = 10 κ0QuickeNing-SVRG κ = 100 κ0QuickeNing-SVRG κ = 1000 κ0

0 5 10 15 20 25 30 35 40−10

−9

−8

−7

−6

−5

−4

−3

−2

Number of gradient evaluations

Re

lati

ve

fu

nc

tio

n v

alu

e

covtype, lasso, λ= 10 / n

QuickeNing-SVRG κ = 0.001 κ0QuickeNing-SVRG κ = 0.01 κ0QuickeNing-SVRG κ = 0.1 κ0QuickeNing-SVRG κ = κ0QuickeNing-SVRG κ = 10 κ0QuickeNing-SVRG κ = 100 κ0QuickeNing-SVRG κ = 1000 κ0

0 5 10 15 20 25 30−7

−6

−5

−4

−3

−2

−1

0

Number of gradient evaluations

Re

lati

ve

fu

nc

tio

n v

alu

e

rcv1, logistic, µ=1/100 n

QuickeNing-SVRG κ = 0.001 κ0QuickeNing-SVRG κ = 0.01 κ0QuickeNing-SVRG κ = 0.1 κ0QuickeNing-SVRG κ = κ0QuickeNing-SVRG κ = 10 κ0QuickeNing-SVRG κ = 100 κ0QuickeNing-SVRG κ = 1000 κ0

0 5 10 15 20 25 30 35 40−10

−8

−6

−4

−2

0

Number of gradient evaluationsR

ela

tiv

e f

un

cti

on

va

lue

rcv1, lasso, λ= 10 / n

QuickeNing-SVRG κ = 0.001 κ0QuickeNing-SVRG κ = 0.01 κ0QuickeNing-SVRG κ = 0.1 κ0QuickeNing-SVRG κ = κ0QuickeNing-SVRG κ = 10 κ0QuickeNing-SVRG κ = 100 κ0QuickeNing-SVRG κ = 1000 κ0

κ0 is the parameter (same as in Catalyst) used in all experiments;

QuickeNing slows down when using κ > κ0;

here, for SVRG, QuickeNing is robust to small values of κ!

Julien Mairal QuickeNing 26/28

Page 43: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

Experiments: Influence of l

0 5 10 15 20 25 30 35−10

−9

−8

−7

−6

−5

−4

−3

−2

Number of gradient evaluations

Re

lati

ve

fu

nc

tio

n v

alu

ecovtype, logistic, µ=1/100 n

QuickeNing-SVRG1 l=1QuickeNing-SVRG1 l=2QuickeNing-SVRG1 l=5QuickeNing-SVRG1 l=10QuickeNing-SVRG1 l=20QuickeNing-SVRG1 l=100

0 5 10 15 20 25 30 35 40 45 50−10

−9

−8

−7

−6

−5

−4

−3

−2

Number of gradient evaluations

Re

lati

ve

fu

nc

tio

n v

alu

e

covtype, lasso, λ= 10 / n

QuickeNing-SVRG1 l=1QuickeNing-SVRG1 l=2QuickeNing-SVRG1 l=5QuickeNing-SVRG1 l=10QuickeNing-SVRG1 l=20QuickeNing-SVRG1 l=100

0 5 10 15 20 25 30 35 40 45 50−10

−8

−6

−4

−2

0

Number of gradient evaluations

Re

lati

ve

fu

nc

tio

n v

alu

e

rcv1, logistic, µ=1/100 n

QuickeNing-SVRG1 l=1QuickeNing-SVRG1 l=2QuickeNing-SVRG1 l=5QuickeNing-SVRG1 l=10QuickeNing-SVRG1 l=20QuickeNing-SVRG1 l=100

0 5 10 15−10

−8

−6

−4

−2

0

Number of gradient evaluationsR

ela

tiv

e f

un

cti

on

va

lue

rcv1, lasso, λ= 10 / n

QuickeNing-SVRG1 l=1QuickeNing-SVRG1 l=2QuickeNing-SVRG1 l=5QuickeNing-SVRG1 l=10QuickeNing-SVRG1 l=20QuickeNing-SVRG1 l=100

l = 100 in all previous experiments;

l = 5 seems to be a reasonable choice in many cases, especially forsparse problems.

Julien Mairal QuickeNing 27/28

Page 44: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

Conclusions and perspectives

A simple generic Quasi-Newton method for composite functions,with simple sub-problems, and complexity guarantees.

We also have a variant for dual approaches.

Does not solve the gap between theory and practice for L-BFGS.

Perspectives

QuickeNing-BCD, QuickeNing-SAG,SAGA,SDCA...

Other types of smoothing? ⇒ Links with recent Quasi-Newtonmethods applied to other envelopes [Stella et al., 2016].

Julien Mairal QuickeNing 28/28

Page 45: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

Outer-loop convergence analysis

Lemma: approximate descent property

F (xk+1) ≤ f (zk) ≤ F (xk)−1

4κ‖∇F (xk)‖

22 + 2εk .

Then, εk should be smaller than 14κ‖∇F (xk)‖

22, and indeed

Julien Mairal QuickeNing 29/28

Page 46: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

Outer-loop convergence analysis

Lemma: approximate descent property

F (xk+1) ≤ f (zk) ≤ F (xk)−1

4κ‖∇F (xk)‖

22 + 2εk .

Then, εk should be smaller than 14κ‖∇F (xk)‖

22, and indeed

Proposition: convergence with impractical εk and µ > 0

If εk ≤1

16κ‖∇F (xk)‖22, define ρ = µ

4(µ+κ) , then

F (xk+1)− F ∗ ≤ f (zk)− f ∗ ≤ (1− ρ)k+1 (f (x0)− f ∗).

Unfortunately, ‖∇F (xk)‖ is unknown.

Julien Mairal QuickeNing 29/28

Page 47: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

Outer-loop convergence analysis

Lemma: approximate descent property

F (xk+1) ≤ f (zk) ≤ F (xk)−1

4κ‖∇F (xk)‖

22 + 2εk .

Then, εk should be smaller than 14κ‖∇F (xk)‖

22, and indeed

Proposition: convergence with impractical εk and µ > 0

If εk ≤1

16κ‖∇F (xk)‖22, define ρ = µ

4(µ+κ) , then

F (xk+1)− F ∗ ≤ f (zk)− f ∗ ≤ (1− ρ)k+1 (f (x0)− f ∗).

Unfortunately, ‖∇F (xk)‖ is unknown.

Lemma: convergence with adaptive εk and µ > 0

If εk ≤1

36κ‖gk‖2, then εk ≤

116‖∇F (xk)‖

2.

This is strategy (b). gk is known and easy to compute.

Julien Mairal QuickeNing 29/28

Page 48: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

Inner-loop complexity analysis

Restart for L-smooth functions

For minimizing h, initialize the methodM with w0 = x . Then,

h(w0)− h∗ ≤L+ κ

2κ2‖∇F (x)‖2. (1)

Proof.

We have the optimality condition ∇f (w∗) + κ(w∗ − x) = 0. As a result,

h(w0)−h∗

= f (x)−(

f (w∗) +κ

2‖w∗ − x‖2

)

≤ f (w∗)+〈∇f (w∗), x−w∗〉+L

2‖x−w∗‖2−

(

f (w∗)+κ

2‖w∗−x‖2

)

=L+ κ

2‖w∗ − x‖2 =

L+ κ

2κ2‖∇F (x)‖2.

Julien Mairal QuickeNing 30/28

Page 49: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

Quasi-Newton and L-BFGSPresentation borrowed from Mark Schmidt, NIPS OPT 2010

Quasi-Newton methods work with the parameter and gradientdifferences between successive iterations:

sk , xk+1 − xk , yk , ∇f (xk+1)−∇f (xk).

Julien Mairal QuickeNing 31/28

Page 50: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

Quasi-Newton and L-BFGSPresentation borrowed from Mark Schmidt, NIPS OPT 2010

Quasi-Newton methods work with the parameter and gradientdifferences between successive iterations:

sk , xk+1 − xk , yk , ∇f (xk+1)−∇f (xk).

They start with an initial approximation B0 , σI , and choose Bk+1

to interpolate the gradient difference:

Bk+1sk = yk .

Julien Mairal QuickeNing 31/28

Page 51: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

Quasi-Newton and L-BFGSPresentation borrowed from Mark Schmidt, NIPS OPT 2010

Quasi-Newton methods work with the parameter and gradientdifferences between successive iterations:

sk , xk+1 − xk , yk , ∇f (xk+1)−∇f (xk).

They start with an initial approximation B0 , σI , and choose Bk+1

to interpolate the gradient difference:

Bk+1sk = yk .

Since Bk+1 is not unique, the Broyden-Fletcher-Goldfarb-Shanno(BFGS) method chooses the symmetric matrix whose differencewith Bk is minimal:

Bk+1 = Bk −BkskskBk

skBksk+

yky⊤k

y⊤k sk.

Julien Mairal QuickeNing 31/28

Page 52: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

Quasi-Newton and L-BFGSPresentation borrowed from Mark Schmidt, NIPS OPT 2010

Update skipping/damping or a sophisticated line search (Wolfeconditions) can keep Bk+1 positive-definite.

Julien Mairal QuickeNing 32/28

Page 53: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

Quasi-Newton and L-BFGSPresentation borrowed from Mark Schmidt, NIPS OPT 2010

Update skipping/damping or a sophisticated line search (Wolfeconditions) can keep Bk+1 positive-definite.

They perform updates of the form

xk+1 ← xk − ηkB−1k ∇f (xk).

Julien Mairal QuickeNing 32/28

Page 54: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

Quasi-Newton and L-BFGSPresentation borrowed from Mark Schmidt, NIPS OPT 2010

Update skipping/damping or a sophisticated line search (Wolfeconditions) can keep Bk+1 positive-definite.

They perform updates of the form

xk+1 ← xk − ηkB−1k ∇f (xk).

The BFGS method has a superlinear convergence rate.

Julien Mairal QuickeNing 32/28

Page 55: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

Quasi-Newton and L-BFGSPresentation borrowed from Mark Schmidt, NIPS OPT 2010

Update skipping/damping or a sophisticated line search (Wolfeconditions) can keep Bk+1 positive-definite.

They perform updates of the form

xk+1 ← xk − ηkB−1k ∇f (xk).

The BFGS method has a superlinear convergence rate.

But, it still uses a dense p × p matrix Bk .

Julien Mairal QuickeNing 32/28

Page 56: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

Quasi-Newton and L-BFGSPresentation borrowed from Mark Schmidt, NIPS OPT 2010

Update skipping/damping or a sophisticated line search (Wolfeconditions) can keep Bk+1 positive-definite.

They perform updates of the form

xk+1 ← xk − ηkB−1k ∇f (xk).

The BFGS method has a superlinear convergence rate.

But, it still uses a dense p × p matrix Bk .

Instead of storing Bk , the limited-memory BFGS (L-BFGS)method stores the previous l differences sk and yk .

Julien Mairal QuickeNing 32/28

Page 57: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

Quasi-Newton and L-BFGSPresentation borrowed from Mark Schmidt, NIPS OPT 2010

Update skipping/damping or a sophisticated line search (Wolfeconditions) can keep Bk+1 positive-definite.

They perform updates of the form

xk+1 ← xk − ηkB−1k ∇f (xk).

The BFGS method has a superlinear convergence rate.

But, it still uses a dense p × p matrix Bk .

Instead of storing Bk , the limited-memory BFGS (L-BFGS)method stores the previous l differences sk and yk .

We can solve a linear system involving these updates when B0 isdiagonal in O(dl) [Nocedal, 1980].

Julien Mairal QuickeNing 32/28

Page 58: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

References I

A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm forlinear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202,2009.

J.V. Burke and Maijian Qian. On the superlinear convergence of the variablemetric proximal point algorithm using Broyden and BFGS matrix secantupdating. Mathematical Programming, 88(1):157–181, 2000.

R. H. Byrd, J. Nocedal, and F. Oztoprak. An inexact successive quadraticapproximation method for L-1 regularized optimization. MathematicalProgramming, 157(2):375–396, 2015.

R.H. Byrd, SL Hansen, Jorge Nocedal, and Y Singer. A stochasticquasi-Newton method for large-scale optimization. SIAM Journal onOptimization, 26(2):1008–1031, 2016.

Antonin Chambolle and Thomas Pock. A first-order primal-dual algorithm forconvex problems with applications to imaging. Journal of mathematicalimaging and vision, 40(1):120–145, 2011.

Julien Mairal QuickeNing 33/28

Page 59: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

References IIXiaojun Chen and Masao Fukushima. Proximal quasi-Newton methods for

nondifferentiable convex optimization. Mathematical Programming, 85(2):313–334, 1999.

Patrick L Combettes and Valerie R Wajs. Signal recovery by proximalforward-backward splitting. Multiscale Modeling & Simulation, 4(4):1168–1200, 2005.

Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fastincremental gradient method with support for non-strongly convex compositeobjectives. In Advances in Neural Information Processing Systems (NIPS),2014a.

Aaron Defazio, Justin Domke, and Tiberio S Caetano. Finito: A faster,permutable incremental gradient method for big data problems. InProceedings of the International Conferences on Machine Learning (ICML),2014b.

Michael P Friedlander and Mark Schmidt. Hybrid deterministic-stochasticmethods for data fitting. SIAM Journal on Scientific Computing, 34(3):A1380–A1405, 2012.

Julien Mairal QuickeNing 34/28

Page 60: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

References IIIRoy Frostig, Rong Ge, Sham M Kakade, and Aaron Sidford. Un-regularizing:

approximate proximal point and faster stochastic algorithms for empirical riskminimization. In Proceedings of the International Conferences on MachineLearning (ICML), 2015.

Marc Fuentes, Jerome Malick, and Claude Lemarechal. Descentwise inexactproximal algorithms for smooth optimization. Computational Optimizationand Applications, 53(3):755–769, 2012.

Masao Fukushima and Liqun Qi. A globally and superlinearly convergentalgorithm for nonsmooth convex minimization. SIAM Journal onOptimization, 6(4):1106–1120, 1996.

S. Ghadimi, G. Lan, and H. Zhang. Generalized Uniformly Optimal Methods forNonlinear Programming. arxiv:1508.07384, 2015.

R. M. Gower, D. Goldfarb, and P. Richtarik. Stochastic block BFGS: Squeezingmore curvature out of data. In Proceedings of the International Conferenceson Machine Learning (ICML), 2016.

O. Guler. New proximal point algorithms for convex minimization. SIAMJournal on Optimization, 2(4):649–664, 1992.

Julien Mairal QuickeNing 35/28

Page 61: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

References IVJason Lee, Yuekai Sun, and Michael Saunders. Proximal Newton-type methods

for convex optimization. In Advances in Neural Information ProcessingSystems (NIPS), 2012.

Claude Lemarechal and Claudia Sagastizabal. Practical aspects of theMoreau–Yosida regularization: Theoretical preliminaries. SIAM Journal onOptimization, 7(2):367–385, 1997.

Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A universal catalyst forfirst-order optimization. In Advances in Neural Information ProcessingSystems (NIPS), 2015.

J. Mairal. Incremental majorization-minimization optimization with applicationto large-scale machine learning. SIAM Journal on Optimization, 25(2):829–855, 2015.

Bernard Martinet. Breve communication. regularisation d’inequationsvariationnelles par approximations successives. 4(3):154–158, 1970.

Robert Mifflin. A quasi-second-order proximal bundle algorithm. MathematicalProgramming, 73(1):51–72, 1996.

Julien Mairal QuickeNing 36/28

Page 62: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

References VY. Nesterov. Gradient methods for minimizing composite functions.

Mathematical Programming, 140(1):125–161, 2013.

Jorge Nocedal. Updating quasi-Newton matrices with limited storage.Mathematics of Computation, 35(151):773–782, 1980.

R Tyrrell Rockafellar. Monotone operators and the proximal point algorithm.SIAM Journal on Control and Optimization, 14(5):877–898, 1976.

Katya Scheinberg and Xiaocheng Tang. Practical inexact proximalquasi-Newton method with global complexity analysis. MathematicalProgramming, 160(1):495–529, 2016.

M. Schmidt, N. Le Roux, and F. Bach. Minimizing finite sums with thestochastic average gradient. Mathematical Programming, 160(1):83–112,2017.

S. Shalev-Shwartz and T. Zhang. Proximal stochastic dual coordinate ascent.arXiv:1211.2717, 2012.

Lorenzo Stella, Andreas Themelis, and Panagiotis Patrinos. Forward-backwardquasi-newton methods for nonsmooth optimization problems. arXiv preprintarXiv:1604.08096, 2016.

Julien Mairal QuickeNing 37/28

Page 63: A Generic Quasi-Newton Algorithm for Faster Gradient-Based

References VIS. J. Wright, R. D. Nowak, and M. A. T. Figueiredo. Sparse reconstruction by

separable approximation. IEEE Transactions on Signal Processing, 57(7):2479–2493, 2009.

L. Xiao and T. Zhang. A proximal stochastic gradient method with progressivevariance reduction. SIAM Journal on Optimization, 24(4):2057–2075, 2014.

Jin Yu, SVN Vishwanathan, Simon Gunter, and Nicol N Schraudolph. Aquasi-Newton approach to non-smooth convex optimization. In Proceedingsof the International Conferences on Machine Learning (ICML), 2008.

Y. Zhang and L. Xiao. Stochastic primal-dual coordinate method forregularized empirical risk minimization. In Proceedings of the InternationalConferences on Machine Learning (ICML), 2015.

Julien Mairal QuickeNing 38/28