40
A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method: Theory and Insights Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 speaker: kv MCLab, CITI Academia Sinicas May 14, 2015 Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab) ODE-NAG May 14, 2015 1 / 40

NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method

Embed Size (px)

Citation preview

Page 1: NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method

A Differential Equation for Modeling Nesterov’sAccelerated Gradient Method: Theory and Insights

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014

speaker: kv

MCLab, CITI Academia Sinicas

May 14, 2015

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 1 / 40

Page 2: NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method

Overview

1 IntroductionSmooth Unconstrained OptimizationAccelerated SchemeOrdinary Differential Equation

2 Derivation of the ODESimple Properties

3 Equivalence between the ODE and Nesterov’s schemeAnalogous Convergence RateQuadratic f and Bessel functionEquivalence between the ODE and Nesterov’s Scheme

4 A family of generalized Nesterov’s schemesContinuous OptimizationComposite Optimization

5 New Restart Scheme6 Accelerating to linear convergence by restarting

Numerical examples7 Discussion

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 2 / 40

Page 3: NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method

Introduction

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 3 / 40

Page 4: NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method

Smooth Unconstrained Optimization

We wish to minimize a smooth convex function

minimize f (x)

where f : Rn → R has a Lipschitz continuous gradient

‖∇f (x)−∇f (y)‖2 < L‖x − y‖2

µ-strong convexity

f (x)− µ‖x‖2/2

In this paper, FL denotes the class of convex function f with L-Lipschitzcontinuous gradients defined on Rn; Sµ denotes the class of µ-stronglyconvex function f on Rn. We set Sµ,L = FL ∩ Sµ

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 4 / 40

Page 5: NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method

Introduction: Accelerated Scheme

Nesterov’s Accelerated Gradient Scheme

xk = yk−1 − s∇f (yk−1) (1)

yk = xk +k − 1

k + 2(xk − xk−1) (2)

For fixed step size s = 1/L, where L is Lipschitz constant of ∇f , thisscheme exhibits the convergence rate

f (xk)− f ∗ ≤ O(L||x0 − x∗||2

k2)

This improvement relies on the introduction to momentum xk − xk−1 andthe particularly tuned coefficient (k − 1)/(k + 2) = 1− 3/(k + 2)

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 5 / 40

Page 6: NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method

Accelerated Scheme: Oscillation Problem

In general, Nesterov’s scheme is not monotone in the objective function.(due to introduction to the momentum)

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 6 / 40

Page 7: NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method

Introduction: Second order ODE

Derive a second order ordinary differential equation(ODE), which is theexact limit of Nesterov’s scheme by taking small step size

X +3

tX +∇f (X ) = 0 (3)

for t > 0, with initial condition X (0) = x0, ˙X (0) = 0; here x0 is thestarting point in Nesterov’s scheme. X denotes to velocity and X isacceleration.

Small t: large 3/t leads the ODE to be an over-damped system

Large t: As t increases, system behaves like under-damped system,oscillating with amplitude decreases to zero

Time parameter in this ODE is related to step size

t ∼ k√s

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 7 / 40

Page 8: NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method

An Example: Trajectories

Minimize f = 0.02x21 + 0.005x22

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 8 / 40

Page 9: NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method

An Example: Zoomed Trajectories

Minimize f = 0.02x21 + 0.005x22

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 9 / 40

Page 10: NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method

An Example: Errors f − f ∗

Minimize f = 0.02x21 + 0.005x22

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 10 / 40

Page 11: NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method

Derivation of the ODE

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 11 / 40

Page 12: NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method

Derivation of the ODE

Assmue f ∈ FL for L > 0, combine two equations of (1) and applyingrescaling give

xk+1 − xk√s

=k − 1

k + 2

xk − xk−1√s

−√s∇f (yk) (4)

Introducfelis e the ansatz xk ∼ X (k√s) for smooth curve X (t) define for

t > 0. With these approximations, we get Taylor expansions:

(xk+1 − xk)/√s = X (t) +

1

2X (t)

√s + o(

√s)

(xk − xk−1)/√s = X (t)− 1

2X (t)

√s + o(

√s)

√s∇f (yk) =

√s∇f (X (t)) + o(

√s)

where the third equality we use yk − X (t) = o(1)

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 12 / 40

Page 13: NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method

Derivation of the ODE

The formula (4) can be rewritten as

X (t) +1

2X (t)

√s + o(

√s)

= (1− 3√s

t){X (t)− 1

2X (t)

√s + o(

√s)} −

√s∇f (X (t)) + o(

√s)

By comparing the coefficients of√s, we obtain

X +3

tX +∇f (X ) = 0

Theorem (Well posed ODE, Existence and Uniqueness)

For any f ∈ F∞ := UL>0FL and any x0 ∈ Rn, the ODE (3) with initialconditions X (0) = x0, X (0) = 0 has an unique global solution X .

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 13 / 40

Page 14: NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method

Simple Properties

Invariance

ODE is invariant under time change and invariant under transformation

Initial asymptotic

Assume sufficient smoothness of X , that limt→0 X exists. The Mean ValueTheorem guarantees the existence ζ ∈ (0, t) that

X (t)/t = (X − X (0))/t = X (ζ)

Hence the ODE deduces to X (t) + 3X (ζ) +∇f (X (t)) = 0 Taking t → 0(for small t), we have

X (t) = −∇f (x0)t2

8+ x0 + o(t2)

Consistent with the empirical observation the Nesterov’s scheme movesslowly in the beginning.

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 14 / 40

Page 15: NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method

Connections and Interpretations

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 15 / 40

Page 16: NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method

Analogous Convergence Rate

Now, we exhibit approximate equivalence between the ODE andNesterov’s scheme in terms of convergence rate.

Theorem (Discrete Nesterov Scheme 3.1)

For any f ∈ FL, the sequence {xk} in 1 with step size s ≤ 1/L obeys

f (xk)− f ∗ ≤ 2||x0 − x∗||2

s(k + 1)2

First result indicates the trajectory of ODE (3) closely resembles thesequence {xk} in terms of the convergence rate

Theorem (Continuous ODE Scheme 3.2)

For any f ∈ F∞, let X (t) be the unique global solution to (3) with initialconditions X (0) = x0, X (0) = 0, for any t > 0

f (X (t))− f ∗ ≤ 2||x0 − x∗||2

t2Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 16 / 40

Page 17: NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method

Proof of Theorem 3.2

Consider energy functional defined as

ε(t) := t2(f (X (t))− f ∗) + 2||X +t

2X − x∗||2

whose time derivative is

ε(t) = 2t(f (X )− f ∗) + t2〈∇f , X 〉+ 4〈X +t

2X − x∗,

3

2X +

t

2X 〉

Substituting 3X/2 + tX/2 with −t∇f (X )/2

ε(t) = 2t(f (X )− f ∗) + 4〈X − x∗,−t∇f (X )/2〉= 2t(f (X )− f ∗)− 2t〈X − x∗,∇f (X )〉 ≤ 0

Hence the monotonicity of ε and non-negativity of 2||X + t2 X − x∗||2, the

gap obeys

f (X (t))− f ∗ ≤ ε(t)

t2≤ ε(0)

t2=

2||x0 − x∗||2

t2

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 17 / 40

Page 18: NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method

Quadratic f and Bessel function

For quadratic f , we have

f (x) =1

2〈x ,Ax〉+ 〈b, x〉

where A ∈ Rn×n. Simple translation can absorb the liner term 〈b, x〉. Weassume A is positive semi-definite, admitting spectral decompositionA = QTΛQ, replace x with Qx , we assume f = 1/2〈x ,Λx〉. The ODEadmits

Xi +3

tXi + λiXi = 0

where i = 1, ..., n

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 18 / 40

Page 19: NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method

Quadratic f and Bessel function

Introduce Yi (u) = uXi (u/√λi ), which satisfies

u2Yi + uYi + (u2 − 1)Yi = 0

This is the Bessel’s Differential Equation with order 1. Apply asymptoticexpansion, we obtain

Xi (t) =2xx0,i

t√λi

J1(t√λi )

For t is large, the Bessel function has asymptotic form

J1(t) =

√2

πt(cos(t − 3π/4) + O(1/t))

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 19 / 40

Page 20: NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method

Quadratic f and Bessel function: Example

Minimizef = 0.02x21 + 0.005x22 , whose eigenvalues are λ1,2 = 0.02, 0.005

f (X )− f ∗ = f (X ) =n∑

i=1

2x20,it2

J1(t√λi )

2

Denote two major period T1,T2. We get T1 = π/√λ1 = 22.214 and

T2 = π/√λ2 = 44.423

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 20 / 40

Page 21: NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method

Equivalence between the ODE and Nesterov’s scheme

We study the stable step size for numerically solving ODE. The finitedifference approximation of (3) by the forward Euler method

X (t + ∆t)− 2X (t) + X (t + ∆t)

∆t2+

3

t

X (t)− X (t −∆t)

∆t+∇f (X (t)) = 0

which is equivalent to

X (t + ∆t) = (2− 3∆t

t)X (t)−∆t2∇f (X (t))− (1− ∆t

t)X (t −∆t)

Assuming that f is sufficiently smooth, for small perturbation. Thecharacteristic equation of this finite difference scheme is approximately(identify k = t/∆t)

det{λ2 − (2−∆t2∇2f − 3∆t

t)λ+ 1− 3∆t

t} = 0 (5)

For numerical stability, all the roots of (5) should lie on unit circle.Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 21 / 40

Page 22: NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method

A family of generalized Nesterov’s schemes

Exploit the power of ODE. We would be interested in studying the ODE(3) with the number of 3 appearing the coefficient of X/t replaced by ageneral constant r as

X +r

tX +∇f (X ) = 0, X (0) = x0, X (0) = 0 (6)

Using the argument similar to theorem 2.1, this ODE is guaranteed toassume a unique global solution for any f ∈ F∞

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 22 / 40

Page 23: NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method

Generalized Nesterov’s Scheme: Continuous Optimization

Theorem (4.1)

Suppose r > 3 and let X be the unique solution to (6) for some f ∈ F∞.Then X (t) obeys

f (X (t))− f ∗ ≤ (r − 1)2||x0 − x∗||2

2t2

and ∫ ∞0

t(f (X (t))− f ∗)dt ≤ (r − 1)2||x0 − x∗||2

2(r − 3)

Theorem (4.2)

For any f ∈ Sµ,L(Rn), the unique solution X to (6) with r ≥ 9/2 obeys

f (X (t))− f ∗ ≤ Cr5/2||x0 − x∗||2

t3õ

For any t > 0 and an universal constant C > 1/2Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 23 / 40

Page 24: NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method

Generalized Nesterov’s Scheme: Continuous Optimization

For example, the solution to (6) with f (x) = ||x ||2/2 is

X (t) =2

r−12 Γ((r + 1)/2)J(r−1)/2(t)

t(r−1)/2

where J(r−1)/2(.) is the first kind of Bessel function of order (r − 1)/2. Forlarge t, this Bessel function obeysJ(r−1)/2(t) =

√2/(πt)(cos(t − rπ/4) + O(1/t). Hence

f (X (t))− f ∗ ≤ ||x0 − x∗||2/tr

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 24 / 40

Page 25: NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method

Generalized Nesterov’s Scheme: Composite Optimization

Skip this part

minx∈Rn

f (x) = g(x) + h(x)

where g ∈ FL for some L > 0 and h is convex on Rn with possibleextended value ∞. Define proximal subgradient

Gs(x) :=x − arg minz{||z − (x − s∇g(x))||2/(2s) + h(z)}

s

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 25 / 40

Page 26: NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method

New Restart Scheme

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 26 / 40

Page 27: NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method

Restart Scheme: Previous Works

Restart: Erase the memory of previous iterations and resets themomentum back to zero

Function Scheme: we restart whenever

f (xk) > f (xk−1)

Gradient Scheme: we restart whenever

∇f (yk−1)T (xk − xk−1) > 0

Refer to: Adaptive Restart for Accelerated Gradient Schemes, BrendanODonoghue Emmanuel Cands, 2012

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 27 / 40

Page 28: NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method

Restart Scheme: Previous Works

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 28 / 40

Page 29: NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method

Restart Scheme: Previous Works

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 29 / 40

Page 30: NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method

New Restart Scheme: Speed Restart

This work provides a new restarting strategy, called speed restartingscheme. The underlying motivation is to maintain relatively high velocityX along the trajectory.

Definition 5.1

For ODE (3) with X (0) = x0, X (0) = 0, let

T = T (f , x0) = sup{t > 0 : ∀u ∈ (0, t),||X (u)||2

du> 0}

be the speed restarting time.

In words, T is the first time the velocity ||X || decreases. Indeed, f (X (t))is the decreasing function before time T , for t < T ,

df (X (t))

dt=< ∇f (X ), X >= −3

t||X ||2 − 1

2

||X ||2

dt< 0

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 30 / 40

Page 31: NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method

Accelerating to linear convergence by restarting

The speed restarting ODE is thus

X (t) +3

tsrX (t) +∇f (X (t)) = 0 (7)

where tsr is set to zero whenever < X , X >= 0. We have followingobservations

X sr (t) is continuous for t ≤ 0, with X sr (0) = x0

X sr (t) satisfied (3) for 0 < t < T1 := T (x0; f )

Recursively define Ti+1 = Ti (Xsr (

∑Tj ; f ) for i ≤ 1 and

X (t) = X sr (∑

Tj + t)

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 31 / 40

Page 32: NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method

Accelerating to linear convergence by restarting

Lemma 5.2

There is a universal constant C > 0 such that

f (X (T ))− f (x∗) ≤ (1− Cµ

L)(f (x0)− f ∗)

Guarantee each restarting reduces the error by a constant factor

Lemma 5.3

There is a universal constant C such that

T ≤4exp( CL

µ )

5√L

An upper bound for T. It conforms that restartings are adequate

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 32 / 40

Page 33: NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method

Accelerating to linear convergence by restarting

Applying Lemma 5.2, 5.3, we have

Theorem (5.1)

There exists positive constants c1 and c2, which only depend on thecondition number L/µ, such that for any f ∈ Sµ,L, we have

f (X sr (t))− f (x∗) ≤ c1L||x0 − x∗||2

2exp−c2t

√L

The theorem guarantees linear convergence of solution to (7). This is newresult in the literature.

where c1 = exp(Cµ/L) and c2 = 5Cµ4L e(−Cµ/L)

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 33 / 40

Page 34: NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method

Numerical examples: algorithm of speed restarting

Below we present a discrete analog to restarted scheme.

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 34 / 40

Page 35: NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method

Quadratic

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 35 / 40

Page 36: NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method

Log-sum-exp

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 36 / 40

Page 37: NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method

Matrix compleltion

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 37 / 40

Page 38: NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method

Lasso in l1-constrainted form with large space design

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 38 / 40

Page 39: NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method

References

W. Su, S. Boyd, E. Candes (2014)

A Differential Equation for Modeling Nesterovs Accelerated Gradient Method:Theory and Insights

NIPS 2014

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 39 / 40

Page 40: NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method

thanks!

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 40 / 40