29
Continuous-time dynamics for Stochastic Optimization Bay Area Optimization Meeting Walid Krichene Google Research [email protected]

Continuous-time dynamics Bay Area Optimization Meeting · Continuous-time dynamics / Discrete algorithms. Unify and simplify the analysis Physical intuition => heuristics Streamlined

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Continuous-time dynamics Bay Area Optimization Meeting · Continuous-time dynamics / Discrete algorithms. Unify and simplify the analysis Physical intuition => heuristics Streamlined

Continuous-time dynamicsfor Stochastic OptimizationBay Area Optimization Meeting

Walid KricheneGoogle [email protected]

Page 2: Continuous-time dynamics Bay Area Optimization Meeting · Continuous-time dynamics / Discrete algorithms. Unify and simplify the analysis Physical intuition => heuristics Streamlined

● Set of ideas and techniques:

Continuous-time dynamics / Discrete algorithms.

○ Unify and simplify the analysis

○ Physical intuition => heuristics

○ Streamlined design of new algorithms

● Stochastic optimization

Goal of this Talk

Peter Bartlett Alex Bayen

Joint work with

Page 3: Continuous-time dynamics Bay Area Optimization Meeting · Continuous-time dynamics / Discrete algorithms. Unify and simplify the analysis Physical intuition => heuristics Streamlined

Proprietary + Confidential

Outline

Accelerated Gradient Flow

Averaging Interpretation

Stochastic optimization

Sampling, non-convex optimization

Page 4: Continuous-time dynamics Bay Area Optimization Meeting · Continuous-time dynamics / Discrete algorithms. Unify and simplify the analysis Physical intuition => heuristics Streamlined

Dynamics for smooth convex minimization

Smooth convex minimization

is convex, is Lipschitz, is closed convex.

Page 5: Continuous-time dynamics Bay Area Optimization Meeting · Continuous-time dynamics / Discrete algorithms. Unify and simplify the analysis Physical intuition => heuristics Streamlined

Beginnings of optimization

Gradient Descent

1820

GD

Grad flow

3

Augustin L. Cauchy1789 - 1857

Gradient Flow

Page 6: Continuous-time dynamics Bay Area Optimization Meeting · Continuous-time dynamics / Discrete algorithms. Unify and simplify the analysis Physical intuition => heuristics Streamlined

Accelerated Methods

1982: Nemirovski and Yudin show lower bound (first-order methods, smooth convex)

1820

GD

Grad flow

1983Nesterov

1983: Nesterov’s method achieves the lower bound

Arkadi Nemirovski

Yurii Nesterov

Page 7: Continuous-time dynamics Bay Area Optimization Meeting · Continuous-time dynamics / Discrete algorithms. Unify and simplify the analysis Physical intuition => heuristics Streamlined

Nesterov ODE

2014: Continuous-time limit of Nesterov’s method [1]

Proved convergence rate

[1] W. Su, S. Boyd and E. Candes. A differential equation for modeling Nesterov's accelerated gradient method: theory and insights. NIPS 2014.[2] A. Cabot, H. Engler, and S. Gadat. On the long time behavior of second order differential equations with asymptotically small dissipation. Trans. Amer. Math. Soc. 2009.

1820

GD

Grad flow1983

Nesterov 2014

Nesterov ODE

Nonlinear oscillator Damping, friction

Page 8: Continuous-time dynamics Bay Area Optimization Meeting · Continuous-time dynamics / Discrete algorithms. Unify and simplify the analysis Physical intuition => heuristics Streamlined

Nesterov ODE

Nesterov’s ODE Vs. Gradient flow

Page 9: Continuous-time dynamics Bay Area Optimization Meeting · Continuous-time dynamics / Discrete algorithms. Unify and simplify the analysis Physical intuition => heuristics Streamlined

Nesterov ODE - heuristics

Restarting heuristics: Restart the ODE when

- Trajectory points in “bad direction” [3]

- Trajectory is decelerating [1]

Reduces kinetic energy without changing potential energy.

[1] Su, Boyd, Candes. A differential equation for modeling Nesterov's accelerated gradient method: theory and insights. NIPS 2014[3] O'Donoghue, Candes. Adaptive Restart for Accelerated Gradient Schemes. 2012

Page 10: Continuous-time dynamics Bay Area Optimization Meeting · Continuous-time dynamics / Discrete algorithms. Unify and simplify the analysis Physical intuition => heuristics Streamlined

Nesterov ODE - heuristics

Restarting heuristics

Page 11: Continuous-time dynamics Bay Area Optimization Meeting · Continuous-time dynamics / Discrete algorithms. Unify and simplify the analysis Physical intuition => heuristics Streamlined

Proprietary + Confidential

Outline

Accelerated Gradient Flow

Averaging Interpretation

Stochastic optimization

Sampling, non-convex optimization

Page 12: Continuous-time dynamics Bay Area Optimization Meeting · Continuous-time dynamics / Discrete algorithms. Unify and simplify the analysis Physical intuition => heuristics Streamlined

Constrained Optimization | Mirror Descent

1979: Nemirovski and Yudin: Mirror Descent

Lipschitz mirror map

- Maps dual space to

Illustration of Mirror Descent

Page 13: Continuous-time dynamics Bay Area Optimization Meeting · Continuous-time dynamics / Discrete algorithms. Unify and simplify the analysis Physical intuition => heuristics Streamlined

Accelerated Mirror Descent

Accelerated Mirror Descent (AMD): Averaging in primal space [4]

Special case: Nesterov’s ODE [1] ( )Generalization to Hilbert spaces and more [5]Lagrangian dynamics perspective [6]Extensions: Frank-Wolfe etc. [7]

[1] W. Su, S. Boyd and E. Candes. A differential equation for modeling Nesterov's accelerated gradient method: theory and insights. NIPS 2014.[4] W. Krichene, A. Bayen and P. Bartlett. Accelerated Mirror Descent in Continuous and Discrete Time. NIPS 2015.[5] H. Attouch, Z. Chbani, J. Peypouquet and P. Redont. Fast convergence of inertial dynamics and algorithms with asymptotic vanishing viscosity. Math. Programming 2016.[6] A. Wibisono, A. C. Wilson and M. I. Jordan. A Variational Perspective on Accelerated Methods in Optimization. PNAS 2016.[7] J. Diakonikolas and L. Orecchia. The Approximate Gap Technique: A Unified Approach to Optimal First-Order Methods. 2017.

Illustration of Accelerated Mirror Descent

Page 14: Continuous-time dynamics Bay Area Optimization Meeting · Continuous-time dynamics / Discrete algorithms. Unify and simplify the analysis Physical intuition => heuristics Streamlined

Lyapunov function (decreases along trajectory)

Convergence rate

Convergence rateGeometry

Aleksandr M. Lyapunov1857 - 1918

Accelerated Mirror Descent

Page 15: Continuous-time dynamics Bay Area Optimization Meeting · Continuous-time dynamics / Discrete algorithms. Unify and simplify the analysis Physical intuition => heuristics Streamlined

Proprietary + Confidential

Outline

Accelerated Gradient Flow

Averaging Interpretation

Stochastic optimization

Sampling, non-convex optimization

Page 16: Continuous-time dynamics Bay Area Optimization Meeting · Continuous-time dynamics / Discrete algorithms. Unify and simplify the analysis Physical intuition => heuristics Streamlined

Stochastic Optimization

SGD (Robbins and Monro)where and

can show convergence

8

(Overdamped) Langevin equation- Discrete space: Simulated Annealing [9]

- Continuous space: [10]

1820

GD

Grad flow1983

Nesterov 2014

Nesterov ODE

1951

SGD

Langevin

1940

[9] S. Kirkpatrick, C. D. Gelatt and M. P. Vecchi, Optimization by Simulated Annealing, Science, 1983.[10] T. S. Chiang, C.R. Hwang, S. J. Sheu, Diffusion for Global Optimization in Rn, SIAM J. Control and Optimization, 1987.

Page 17: Continuous-time dynamics Bay Area Optimization Meeting · Continuous-time dynamics / Discrete algorithms. Unify and simplify the analysis Physical intuition => heuristics Streamlined

Acceleration in Stochastic Dynamics

1820

GD

Grad flow1983

Nesterov 2014

Nesterov ODE

1951

SGD

Langevin

1940

2012

SAMD 2017

SAMD ODE[11] G. Lan, An optimal method for stochastic composite optimization, J. Math. Programming, 2012.[12] W. Krichene and P. Bartlett, Acceleration and Averaging in Stochastic Descent Dynamics. NIPS 2017.

2017: continuous Stochastic AMD [12]

- : Brownian motion

- : Volatility matrix

2012: discrete Stochastic AMD [11]

Page 18: Continuous-time dynamics Bay Area Optimization Meeting · Continuous-time dynamics / Discrete algorithms. Unify and simplify the analysis Physical intuition => heuristics Streamlined

Acceleration in Stochastic Dynamics

Accelerated dynamics may fail to converge in the presence of noise, without proper rectification

Page 19: Continuous-time dynamics Bay Area Optimization Meeting · Continuous-time dynamics / Discrete algorithms. Unify and simplify the analysis Physical intuition => heuristics Streamlined

Convergence bounds

AMD

SAMD

SAMD dynamics

Page 20: Continuous-time dynamics Bay Area Optimization Meeting · Continuous-time dynamics / Discrete algorithms. Unify and simplify the analysis Physical intuition => heuristics Streamlined

Rectified weightsGiven , choose

Resulting rate

Convergence bounds

Page 21: Continuous-time dynamics Bay Area Optimization Meeting · Continuous-time dynamics / Discrete algorithms. Unify and simplify the analysis Physical intuition => heuristics Streamlined

Convergence bounds - Proof

Lyapunov function

By Itô’s lemma

Volatility Itô correction term

SAMD dynamics

- Volatility , and can be bounded.

- Itô correction

Page 22: Continuous-time dynamics Bay Area Optimization Meeting · Continuous-time dynamics / Discrete algorithms. Unify and simplify the analysis Physical intuition => heuristics Streamlined

On rescaling time

Continuous-time, deterministic

becomes

Continuous-time, stochastic

becomes

quadratic covariation

Page 23: Continuous-time dynamics Bay Area Optimization Meeting · Continuous-time dynamics / Discrete algorithms. Unify and simplify the analysis Physical intuition => heuristics Streamlined

Proprietary + Confidential

Outline

Accelerated Gradient Flow

Averaging Interpretation

Stochastic optimization

Sampling, non-convex optimization

Page 24: Continuous-time dynamics Bay Area Optimization Meeting · Continuous-time dynamics / Discrete algorithms. Unify and simplify the analysis Physical intuition => heuristics Streamlined

Dynamics for samplingProblem setupSample from a target distribution

● is known, smooth, strongly convex, and is known

[13] A. Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-concave densities. 2014[14] X. Cheng, N. S. Chatterji, P. L. Bartlett and M. I. Jordan. Underdamped Langevin MCMC: A non-asymptotic analysis. 2017

Dynamics● Overdamped Langevin diffusion [13]

● Underdamped Langevin diffusion [14]

Improves convergence (in 2-Wasserstein) from to

Page 25: Continuous-time dynamics Bay Area Optimization Meeting · Continuous-time dynamics / Discrete algorithms. Unify and simplify the analysis Physical intuition => heuristics Streamlined

Dynamics for samplingHandling constraints

● Reflected brownian motion with drift [15]

[15] S. Bubeck, R. Eldan and J. Lehec. Finite-Time Analysis of Projected Langevin Monte Carlo. NIPS 2015[16] Y. T. Lee, S. S. Vempala. Geodesic walks on polytopes. 2017

Tanaka drift supported on boundary

Drift to induce uniform dist.

● Geodesic walk for sampling from polytopes [16]

● Mirror descent dynamics

Page 26: Continuous-time dynamics Bay Area Optimization Meeting · Continuous-time dynamics / Discrete algorithms. Unify and simplify the analysis Physical intuition => heuristics Streamlined

Dynamics for non-convex optimization

Dissipative functions [17]

● Gibbs distribution concentrates around minimizers of the function.

[17] M. Raginsky, A. Rakhlin and M. Telgarsky. Non-convex learning via Stochastic Gradient Langevin Dynamics: a nonasymptotic analysis. 2017[18] Z. Zhou, P. Mertikopoulos, N. Bambos, S. Boyd and P. W. Glynn. Stochastic Mirror Descent in Variationally Coherent Optimization Problems. NIPS 2017

Variationally coherent functions [18]

● Includes pseudo-convex, star-convex.

● A.s. convergence (of discretization) by Asymptotic Pseudo-Trajectory property.

Page 27: Continuous-time dynamics Bay Area Optimization Meeting · Continuous-time dynamics / Discrete algorithms. Unify and simplify the analysis Physical intuition => heuristics Streamlined

Nesterov ODE - non-convex optimization

Rosenbrock’s function

Page 28: Continuous-time dynamics Bay Area Optimization Meeting · Continuous-time dynamics / Discrete algorithms. Unify and simplify the analysis Physical intuition => heuristics Streamlined

● Continuous-time dynamics

○ Unify and simplify analysis

○ Heuristics

○ Acceleration in vanishing noise regime

● Open questions

○ Discretization in non-convex problems

○ Multiplicative noise models

Summary

Page 29: Continuous-time dynamics Bay Area Optimization Meeting · Continuous-time dynamics / Discrete algorithms. Unify and simplify the analysis Physical intuition => heuristics Streamlined

Thank You!