NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method

A Differential Equation for Modeling Nesterov’sAccelerated Gradient Method: Theory and Insights

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014

speaker: kv

MCLab, CITI Academia Sinicas

May 14, 2015

Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 1 / 40

Overview

1 IntroductionSmooth Unconstrained OptimizationAccelerated SchemeOrdinary Differential Equation

2 Derivation of the ODESimple Properties

3 Equivalence between the ODE and Nesterov’s schemeAnalogous Convergence RateQuadratic f and Bessel functionEquivalence between the ODE and Nesterov’s Scheme

4 A family of generalized Nesterov’s schemesContinuous OptimizationComposite Optimization

5 New Restart Scheme6 Accelerating to linear convergence by restarting

Numerical examples7 Discussion


Introduction


Smooth Unconstrained Optimization

We wish to minimize a smooth convex function

minimize f (x)

where f : Rn → R has a Lipschitz continuous gradient

‖∇f (x)−∇f (y)‖2 < L‖x − y‖2

µ-strong convexity

f (x)− µ‖x‖2/2

In this paper, FL denotes the class of convex function f with L-Lipschitzcontinuous gradients defined on Rn; Sµ denotes the class of µ-stronglyconvex function f on Rn. We set Sµ,L = FL ∩ Sµ


Introduction: Accelerated Scheme

Nesterov’s Accelerated Gradient Scheme

xk = yk−1 − s∇f (yk−1) (1)

yk = xk +k − 1

k + 2(xk − xk−1) (2)

For fixed step size s = 1/L, where L is Lipschitz constant of ∇f , thisscheme exhibits the convergence rate

f (xk)− f ∗ ≤ O(L||x0 − x∗||2

k2)

This improvement relies on the introduction to momentum xk − xk−1 andthe particularly tuned coefficient (k − 1)/(k + 2) = 1− 3/(k + 2)


Accelerated Scheme: Oscillation Problem

In general, Nesterov’s scheme is not monotone in the objective function.(due to introduction to the momentum)


Introduction: Second order ODE

Derive a second order ordinary differential equation(ODE), which is theexact limit of Nesterov’s scheme by taking small step size

X +3

tX +∇f (X ) = 0 (3)

for t > 0, with initial condition X (0) = x0, ˙X (0) = 0; here x0 is thestarting point in Nesterov’s scheme. X denotes to velocity and X isacceleration.

Small t: large 3/t leads the ODE to be an over-damped system

Large t: As t increases, system behaves like under-damped system,oscillating with amplitude decreases to zero

Time parameter in this ODE is related to step size

t ∼ k√s


An Example: Trajectories

Minimize f = 0.02x21 + 0.005x22


An Example: Zoomed Trajectories

Minimize f = 0.02x21 + 0.005x22


An Example: Errors f − f ∗

Minimize f = 0.02x21 + 0.005x22


Derivation of the ODE



Assmue f ∈ FL for L > 0, combine two equations of (1) and applyingrescaling give

xk+1 − xk√s

=k − 1

k + 2

xk − xk−1√s

−√s∇f (yk) (4)

Introducfelis e the ansatz xk ∼ X (k√s) for smooth curve X (t) define for

t > 0. With these approximations, we get Taylor expansions:

(xk+1 − xk)/√s = X (t) +

1

2X (t)

√s + o(

√s)

(xk − xk−1)/√s = X (t)− 1

2X (t)

√s + o(

√s)

√s∇f (yk) =

√s∇f (X (t)) + o(

√s)

where the third equality we use yk − X (t) = o(1)



The formula (4) can be rewritten as

X (t) +1

2X (t)

√s + o(

√s)

= (1− 3√s

t){X (t)− 1

2X (t)

√s + o(

√s)} −

√s∇f (X (t)) + o(

√s)

By comparing the coefficients of√s, we obtain

X +3

tX +∇f (X ) = 0

Theorem (Well posed ODE, Existence and Uniqueness)

For any f ∈ F∞ := UL>0FL and any x0 ∈ Rn, the ODE (3) with initialconditions X (0) = x0, X (0) = 0 has an unique global solution X .


Simple Properties

Invariance

ODE is invariant under time change and invariant under transformation

Initial asymptotic

Assume sufficient smoothness of X , that limt→0 X exists. The Mean ValueTheorem guarantees the existence ζ ∈ (0, t) that

X (t)/t = (X − X (0))/t = X (ζ)

Hence the ODE deduces to X (t) + 3X (ζ) +∇f (X (t)) = 0 Taking t → 0(for small t), we have

X (t) = −∇f (x0)t2

8+ x0 + o(t2)

Consistent with the empirical observation the Nesterov’s scheme movesslowly in the beginning.


Connections and Interpretations


Analogous Convergence Rate

Now, we exhibit approximate equivalence between the ODE andNesterov’s scheme in terms of convergence rate.

Theorem (Discrete Nesterov Scheme 3.1)

For any f ∈ FL, the sequence {xk} in 1 with step size s ≤ 1/L obeys

f (xk)− f ∗ ≤ 2||x0 − x∗||2

s(k + 1)2

First result indicates the trajectory of ODE (3) closely resembles thesequence {xk} in terms of the convergence rate

Theorem (Continuous ODE Scheme 3.2)

For any f ∈ F∞, let X (t) be the unique global solution to (3) with initialconditions X (0) = x0, X (0) = 0, for any t > 0

f (X (t))− f ∗ ≤ 2||x0 − x∗||2

t2Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 16 / 40

Proof of Theorem 3.2

Consider energy functional defined as

ε(t) := t2(f (X (t))− f ∗) + 2||X +t

2X − x∗||2

whose time derivative is

ε(t) = 2t(f (X )− f ∗) + t2〈∇f , X 〉+ 4〈X +t

2X − x∗,

3

2X +

t

2X 〉

Substituting 3X/2 + tX/2 with −t∇f (X )/2

ε(t) = 2t(f (X )− f ∗) + 4〈X − x∗,−t∇f (X )/2〉= 2t(f (X )− f ∗)− 2t〈X − x∗,∇f (X )〉 ≤ 0

Hence the monotonicity of ε and non-negativity of 2||X + t2 X − x∗||2, the

gap obeys

f (X (t))− f ∗ ≤ ε(t)

t2≤ ε(0)

t2=

2||x0 − x∗||2

t2


Quadratic f and Bessel function

For quadratic f , we have

f (x) =1

2〈x ,Ax〉+ 〈b, x〉

where A ∈ Rn×n. Simple translation can absorb the liner term 〈b, x〉. Weassume A is positive semi-definite, admitting spectral decompositionA = QTΛQ, replace x with Qx , we assume f = 1/2〈x ,Λx〉. The ODEadmits

Xi +3

tXi + λiXi = 0

where i = 1, ..., n


Quadratic f and Bessel function

Introduce Yi (u) = uXi (u/√λi ), which satisfies

u2Yi + uYi + (u2 − 1)Yi = 0

This is the Bessel’s Differential Equation with order 1. Apply asymptoticexpansion, we obtain

Xi (t) =2xx0,i

t√λi

J1(t√λi )

For t is large, the Bessel function has asymptotic form

J1(t) =

√2

πt(cos(t − 3π/4) + O(1/t))


Quadratic f and Bessel function: Example

Minimizef = 0.02x21 + 0.005x22 , whose eigenvalues are λ1,2 = 0.02, 0.005

f (X )− f ∗ = f (X ) =n∑

i=1

2x20,it2

J1(t√λi )

2

Denote two major period T1,T2. We get T1 = π/√λ1 = 22.214 and

T2 = π/√λ2 = 44.423


Equivalence between the ODE and Nesterov’s scheme

We study the stable step size for numerically solving ODE. The finitedifference approximation of (3) by the forward Euler method

X (t + ∆t)− 2X (t) + X (t + ∆t)

∆t2+

3

t

X (t)− X (t −∆t)

∆t+∇f (X (t)) = 0

which is equivalent to

X (t + ∆t) = (2− 3∆t

t)X (t)−∆t2∇f (X (t))− (1− ∆t

t)X (t −∆t)

Assuming that f is sufficiently smooth, for small perturbation. Thecharacteristic equation of this finite difference scheme is approximately(identify k = t/∆t)

det{λ2 − (2−∆t2∇2f − 3∆t

t)λ+ 1− 3∆t

t} = 0 (5)

For numerical stability, all the roots of (5) should lie on unit circle.Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 21 / 40

A family of generalized Nesterov’s schemes

Exploit the power of ODE. We would be interested in studying the ODE(3) with the number of 3 appearing the coefficient of X/t replaced by ageneral constant r as

X +r

tX +∇f (X ) = 0, X (0) = x0, X (0) = 0 (6)

Using the argument similar to theorem 2.1, this ODE is guaranteed toassume a unique global solution for any f ∈ F∞


Generalized Nesterov’s Scheme: Continuous Optimization

Theorem (4.1)

Suppose r > 3 and let X be the unique solution to (6) for some f ∈ F∞.Then X (t) obeys

f (X (t))− f ∗ ≤ (r − 1)2||x0 − x∗||2

2t2

and ∫ ∞0

t(f (X (t))− f ∗)dt ≤ (r − 1)2||x0 − x∗||2

2(r − 3)

Theorem (4.2)

For any f ∈ Sµ,L(Rn), the unique solution X to (6) with r ≥ 9/2 obeys

f (X (t))− f ∗ ≤ Cr5/2||x0 − x∗||2

t3√µ

For any t > 0 and an universal constant C > 1/2Weijie Su, Stephen Boyd, Emmanuel J. Candes, NIPS Conference 2014 (MCLab)ODE-NAG May 14, 2015 23 / 40

Generalized Nesterov’s Scheme: Continuous Optimization

For example, the solution to (6) with f (x) = ||x ||2/2 is

X (t) =2

r−12 Γ((r + 1)/2)J(r−1)/2(t)

t(r−1)/2

where J(r−1)/2(.) is the first kind of Bessel function of order (r − 1)/2. Forlarge t, this Bessel function obeysJ(r−1)/2(t) =

√2/(πt)(cos(t − rπ/4) + O(1/t). Hence

f (X (t))− f ∗ ≤ ||x0 − x∗||2/tr


Generalized Nesterov’s Scheme: Composite Optimization

Skip this part

minx∈Rn

f (x) = g(x) + h(x)

where g ∈ FL for some L > 0 and h is convex on Rn with possibleextended value ∞. Define proximal subgradient

Gs(x) :=x − arg minz{||z − (x − s∇g(x))||2/(2s) + h(z)}

s


New Restart Scheme


Restart Scheme: Previous Works

Restart: Erase the memory of previous iterations and resets themomentum back to zero

Function Scheme: we restart whenever

f (xk) > f (xk−1)

Gradient Scheme: we restart whenever

∇f (yk−1)T (xk − xk−1) > 0

Refer to: Adaptive Restart for Accelerated Gradient Schemes, BrendanODonoghue Emmanuel Cands, 2012






New Restart Scheme: Speed Restart

This work provides a new restarting strategy, called speed restartingscheme. The underlying motivation is to maintain relatively high velocityX along the trajectory.

Definition 5.1

For ODE (3) with X (0) = x0, X (0) = 0, let

T = T (f , x0) = sup{t > 0 : ∀u ∈ (0, t),||X (u)||2

du> 0}

be the speed restarting time.

In words, T is the first time the velocity ||X || decreases. Indeed, f (X (t))is the decreasing function before time T , for t < T ,

df (X (t))

dt=< ∇f (X ), X >= −3

t||X ||2 − 1

2

||X ||2

dt< 0


Accelerating to linear convergence by restarting

The speed restarting ODE is thus

X (t) +3

tsrX (t) +∇f (X (t)) = 0 (7)

where tsr is set to zero whenever < X , X >= 0. We have followingobservations

X sr (t) is continuous for t ≤ 0, with X sr (0) = x0

X sr (t) satisfied (3) for 0 < t < T1 := T (x0; f )

Recursively define Ti+1 = Ti (Xsr (

∑Tj ; f ) for i ≤ 1 and

X (t) = X sr (∑

Tj + t)



Lemma 5.2

There is a universal constant C > 0 such that

f (X (T ))− f (x∗) ≤ (1− Cµ

L)(f (x0)− f ∗)

Guarantee each restarting reduces the error by a constant factor

Lemma 5.3

There is a universal constant C such that

T ≤4exp( CL

µ )

5√L

An upper bound for T. It conforms that restartings are adequate



Applying Lemma 5.2, 5.3, we have

Theorem (5.1)

There exists positive constants c1 and c2, which only depend on thecondition number L/µ, such that for any f ∈ Sµ,L, we have

f (X sr (t))− f (x∗) ≤ c1L||x0 − x∗||2

2exp−c2t

√L

The theorem guarantees linear convergence of solution to (7). This is newresult in the literature.

where c1 = exp(Cµ/L) and c2 = 5Cµ4L e(−Cµ/L)


Numerical examples: algorithm of speed restarting

Below we present a discrete analog to restarted scheme.


Quadratic


Log-sum-exp


Matrix compleltion


Lasso in l1-constrainted form with large space design


References

W. Su, S. Boyd, E. Candes (2014)

A Differential Equation for Modeling Nesterovs Accelerated Gradient Method:Theory and Insights

NIPS 2014


thanks!


Data & Analytics

NIPS paper review 2014: A Differential Equation for Modeling Nesterov’s Accelerated Gradient Method