1
Dissipativity Theory for Nesterov’s Accelerated Method Bin Hu and Laurent Lessard †‡ Wisconsin Institute for Discovery at the University of Wisconsin–Madison Department of Electrical and Computer Engineering, University of Wisconsin–Madison WID.WISC.EDU WISCONSIN INSTITUTE FOR DISCOVERY Abstract Dissipativity theory is a framework for analyzing dynamical systems based on the notion of energy dissipation in physical or mechanical systems. Since first-order optimization algorithms can be viewed as (discrete-time) dynamical systems, dissipativity theory can be leveraged to understand their asymptotic convergence properties from an energy dissipation viewpoint. This work focuses on applying dissipativity theory to unify the linear and sublinear analyses for Nesterov’s accelerated method, though the approach is very general and can be applied to other algorithms as well. Summary of the main idea I Energy is a scalar quantity that we can associate with a dynamical system at a given instant in time. For example, kinetic and potential energy of a projectile. Total energy always dissipates (cannot increase). I Algorithms are dynamical systems. Once we define the appropriate notion of energy, the conservation law (dissipation inequality) directly leads to a convergence result. This principle is very general and can be used, for example, to prove linear, 1/k , or 1/k 2 rate bounds. Algorithms as dynamical systems A (discrete-time) linear dynamical system consists of an internal state ξ k , an input w k , and an output y k . It satisfies state-space equations: ξ k+1 = k + Bw k y k = k {w 0 ,w 1 ,... } {y 0 ,y 1 ,... } First-order optimization algorithms can typically be represented in this way where the “driving force” satisfies the feedback law w k = f (y k ). I Gradient descent: x k+1 = x k - αf (x k ) so we can set: ξ k = x k - x ? , A = I, B = -αI, C = I I Nesterov’s accelerated method: x k+1 = y k - αf (y k ) and y k = x k + β (x k - x k-1 ). So we can set: ξ k = x k - x ? x k-1 - x ? ,A = (1 + β )I -βI I 0 ,B = -αI 0 ,C = (1 + β )I -βI Dissipativity theory In 1972, J.C. Willems pioneered “dissipativity theory”, a physics-inspired framework to analyze dynamical systems. Its main ingredients are: I A storage function V : R n R + . We interpret V (ξ ) as the energy stored in the dynamical system when it is in state ξ . I A supply rate S : R n × R R. We interpret S (ξ, w ) as the energy supplied to the system when it is in state ξ and the driving force is w . I A dissipation inequality is a relation that holds for all trajectories of the system and can be interpreted as an energy conservation law. e.g.: V (ξ k+1 ) - ρ 2 V (ξ k ) S (ξ k ,w k ) (1) Eq. (1) states that at least a fraction (1 - ρ 2 ) of the internal energy dissipates at every iteration k . Carefully selecting V and S such that (1) always holds allows us to prove stability properties e.g. convergence rates. Supply rate for gradient descent The supply rate S characterizes known properties about the function f . For example, if f is L-smooth and m-strongly convex, we know that x - x ? f (x) T 2mL -(m + L) -(m + L) 2 x - x ? f (x) 0 (2) Gradient descent: Setting ξ = x - x ? and w = f (x), we can rewrite the above inequality as: S (ξ, w ) 0 (no energy is supplied). If we can find a storage function V such that (1) holds, then we have: V (ξ k+1 ) - ρ 2 V (ξ k ) S (ξ k ,w k ) 0 In other words, V (ξ k+1 ) ρ 2 V (ξ k ), i.e. V (ξ ) is a Lyapunov function that proves linear convergence. Constructing dissipation inequalities using SDPs If we assume the storage function V and supply rate S are quadratic, i.e. V (ξ )= ξ T and S (ξ, w )= ξ w T X ξ w Then if we can find a matrix P 0 such that: A T PA - ρ 2 PA T PB B T PA B T PB - X 0, (3) then the dissipation inequality (1) holds. Proving linear rates for Nesterov’s method Enforcing S 0 for Nesterov’s method yields conservative results because unlike gradient descent, Nesterov’s method is not a descent method. Instead, we can write inequalities similar to (2): ξ k w k T X 1 ξ k w k f (x k ) - f (x k+1 ), ξ k w k T X 2 ξ k w k f (x ? ) - f (x k+1 ) that can be combined to yield a quadratic supply rate satisfying: S (ξ k ,w k ) ρ 2 (f (x k ) - f (x ? )) - (f (x k+1 ) - f (x ? )) Seeking a quadratic storage function can be done as in the gradient case, by solving the same SDP (3) but with X = ρ 2 X 1 + (1 - ρ 2 )X 2 . This time, the dissipation inequality yields the relation: V (ξ k+1 )+ f (x k+1 ) - f (x ? ) ρ 2 (V (ξ k )+ f (x k ) - f (x ? )) In other words, V (ξ )+ f (x) - f (x ? ) is a Lyapunov function that proves linear convergence. The physical interpretation is that there is hidden energy that is stored in f . While total energy is dissipated, the energy stored in the system state V (ξ ) need not be monotonically decreasing. Also in the paper... I Proving 1/k rate for gradient descent and 1/k 2 rate for Nesterov’s method with weakly-convex objective functions (and iteration-dependent parameters α k and β k ). The dissipation inequality and its associated SDP are very similar to those from the strongly convex case. I Dissipativity theory for continuous-time systems, applied to the continuous-time limit of Nesterov’s method. Acknowledgements This material is based upon work supported by the National Science Foundation under Grant No. 1656951. Both authors also acknowledge support from the Wisconsin Institute for Discovery, the College of Engineering, and the Department of Electrical and Computer Engineering at the University of Wisconsin–Madison. Bin Hu ([email protected]) Laurent Lessard ([email protected]), http://www.laurentlessard.com

Dissipativity Theory for Nesterov's Accelerated Methodlaurentlessard.com/public/icml17_diss_poster.pdf · Dissipativity Theory for Nesterov’s Accelerated Method Bin Huyand Laurent

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Dissipativity Theory for Nesterov's Accelerated Methodlaurentlessard.com/public/icml17_diss_poster.pdf · Dissipativity Theory for Nesterov’s Accelerated Method Bin Huyand Laurent

Dissipativity Theory for Nesterov’s Accelerated MethodBin Hu† and Laurent Lessard†‡

† Wisconsin Institute for Discovery at the University of Wisconsin–Madison‡ Department of Electrical and Computer Engineering, University of Wisconsin–Madison

WID.WISC.EDU

WISCONSININSTITUTE FOR DISCOVERY

Abstract

Dissipativity theory is a framework for analyzing dynamical systems basedon the notion of energy dissipation in physical or mechanical systems. Sincefirst-order optimization algorithms can be viewed as (discrete-time)dynamical systems, dissipativity theory can be leveraged to understand theirasymptotic convergence properties from an energy dissipation viewpoint.This work focuses on applying dissipativity theory to unify the linear andsublinear analyses for Nesterov’s accelerated method, though the approachis very general and can be applied to other algorithms as well.

Summary of the main idea

I Energy is a scalar quantity that we can associate with a dynamicalsystem at a given instant in time. For example, kinetic and potentialenergy of a projectile. Total energy always dissipates (cannot increase).

I Algorithms are dynamical systems. Once we define the appropriatenotion of energy, the conservation law (dissipation inequality) directlyleads to a convergence result. This principle is very general and can beused, for example, to prove linear, 1/k, or 1/k2 rate bounds.

Algorithms as dynamical systems

A (discrete-time) linear dynamical system consists of an internal state ξk,an input wk, and an output yk. It satisfies state-space equations:

ξk+1 = Aξk +Bwkyk = Cξk

{w0, w1, . . . } {y0, y1, . . . }

First-order optimization algorithms can typically be represented in this waywhere the “driving force” satisfies the feedback law wk = ∇f(yk).

I Gradient descent: xk+1 = xk − α∇f(xk) so we can set:

ξk = xk − x?, A = I, B = −αI, C = I

I Nesterov’s accelerated method: xk+1 = yk − α∇f(yk) andyk = xk + β(xk − xk−1). So we can set:

ξk =

[xk − x?xk−1 − x?

], A =

[(1 + β)I −βI

I 0

], B =

[−αI0

], C =

[(1 + β)I −βI

]

Dissipativity theory

In 1972, J.C. Willems pioneered “dissipativity theory”, a physics-inspiredframework to analyze dynamical systems. Its main ingredients are:I A storage function V : Rn→ R+. We interpret V (ξ) as the energy

stored in the dynamical system when it is in state ξ.I A supply rate S : Rn×R→ R. We interpret S(ξ, w) as the energy

supplied to the system when it is in state ξ and the driving force is w.I A dissipation inequality is a relation that holds for all trajectories of

the system and can be interpreted as an energy conservation law. e.g.:

V (ξk+1)− ρ2V (ξk) ≤ S(ξk, wk) (1)

Eq. (1) states that at least a fraction (1− ρ2) of the internal energydissipates at every iteration k. Carefully selecting V and S such that (1)always holds allows us to prove stability properties e.g. convergence rates.

Supply rate for gradient descent

The supply rate S characterizes known properties about the function f .For example, if f is L-smooth and m-strongly convex, we know that[

x− x?∇f(x)

]T [2mL −(m+ L)

−(m+ L) 2

] [x− x?∇f(x)

]≤ 0 (2)

Gradient descent: Setting ξ = x− x? and w = ∇f(x), we canrewrite the above inequality as: S(ξ, w) ≤ 0 (no energy is supplied). Ifwe can find a storage function V such that (1) holds, then we have:

V (ξk+1)− ρ2V (ξk) ≤ S(ξk, wk) ≤ 0

In other words, V (ξk+1) ≤ ρ2V (ξk), i.e. V (ξ) is a Lyapunov functionthat proves linear convergence.

Constructing dissipation inequalities using SDPs

If we assume the storage function V and supply rate S are quadratic, i.e.

V (ξ) = ξTPξ and S(ξ, w) =

[ξw

]TX

[ξw

]Then if we can find a matrix P � 0 such that:[

ATPA− ρ2P ATPBBTPA BTPB

]−X � 0, (3)

then the dissipation inequality (1) holds.

Proving linear rates for Nesterov’s method

Enforcing S ≤ 0 for Nesterov’s method yields conservative results becauseunlike gradient descent, Nesterov’s method is not a descent method.Instead, we can write inequalities similar to (2):[

ξkwk

]TX1

[ξkwk

]≤ f(xk)− f(xk+1),

[ξkwk

]TX2

[ξkwk

]≤ f(x?)− f(xk+1)

that can be combined to yield a quadratic supply rate satisfying:

S(ξk, wk) ≤ ρ2(f(xk)− f(x?))− (f(xk+1)− f(x?))Seeking a quadratic storage function can be done as in the gradient case,by solving the same SDP (3) but with X = ρ2X1 + (1− ρ2)X2.This time, the dissipation inequality yields the relation:

V (ξk+1) + f(xk+1)− f(x?) ≤ ρ2 (V (ξk) + f(xk)− f(x?))In other words, V (ξ) + f(x)− f(x?) is a Lyapunov function that proveslinear convergence. The physical interpretation is that there is hiddenenergy that is stored in f . While total energy is dissipated, the energystored in the system state V (ξ) need not be monotonically decreasing.

Also in the paper...

I Proving 1/k rate for gradient descent and 1/k2 rate for Nesterov’smethod with weakly-convex objective functions (and iteration-dependentparameters αk and βk). The dissipation inequality and its associatedSDP are very similar to those from the strongly convex case.

I Dissipativity theory for continuous-time systems, applied to thecontinuous-time limit of Nesterov’s method.

Acknowledgements

This material is based upon work supported by the NationalScience Foundation under Grant No. 1656951.Both authors also acknowledge support from the WisconsinInstitute for Discovery, the College of Engineering, and theDepartment of Electrical and Computer Engineering at theUniversity of Wisconsin–Madison.

Bin Hu ([email protected]) Laurent Lessard ([email protected]), http://www.laurentlessard.com