Upload
others
View
11
Download
0
Embed Size (px)
Citation preview
Continuous-time dynamicsfor Stochastic OptimizationBay Area Optimization Meeting
Walid KricheneGoogle [email protected]
● Set of ideas and techniques:
Continuous-time dynamics / Discrete algorithms.
○ Unify and simplify the analysis
○ Physical intuition => heuristics
○ Streamlined design of new algorithms
● Stochastic optimization
Goal of this Talk
Peter Bartlett Alex Bayen
Joint work with
Proprietary + Confidential
Outline
Accelerated Gradient Flow
Averaging Interpretation
Stochastic optimization
Sampling, non-convex optimization
Dynamics for smooth convex minimization
Smooth convex minimization
is convex, is Lipschitz, is closed convex.
Beginnings of optimization
Gradient Descent
1820
GD
Grad flow
3
Augustin L. Cauchy1789 - 1857
Gradient Flow
Accelerated Methods
1982: Nemirovski and Yudin show lower bound (first-order methods, smooth convex)
1820
GD
Grad flow
1983Nesterov
1983: Nesterov’s method achieves the lower bound
Arkadi Nemirovski
Yurii Nesterov
Nesterov ODE
2014: Continuous-time limit of Nesterov’s method [1]
Proved convergence rate
[1] W. Su, S. Boyd and E. Candes. A differential equation for modeling Nesterov's accelerated gradient method: theory and insights. NIPS 2014.[2] A. Cabot, H. Engler, and S. Gadat. On the long time behavior of second order differential equations with asymptotically small dissipation. Trans. Amer. Math. Soc. 2009.
1820
GD
Grad flow1983
Nesterov 2014
Nesterov ODE
Nonlinear oscillator Damping, friction
Nesterov ODE
Nesterov’s ODE Vs. Gradient flow
Nesterov ODE - heuristics
Restarting heuristics: Restart the ODE when
- Trajectory points in “bad direction” [3]
- Trajectory is decelerating [1]
Reduces kinetic energy without changing potential energy.
[1] Su, Boyd, Candes. A differential equation for modeling Nesterov's accelerated gradient method: theory and insights. NIPS 2014[3] O'Donoghue, Candes. Adaptive Restart for Accelerated Gradient Schemes. 2012
Nesterov ODE - heuristics
Restarting heuristics
Proprietary + Confidential
Outline
Accelerated Gradient Flow
Averaging Interpretation
Stochastic optimization
Sampling, non-convex optimization
Constrained Optimization | Mirror Descent
1979: Nemirovski and Yudin: Mirror Descent
Lipschitz mirror map
- Maps dual space to
Illustration of Mirror Descent
Accelerated Mirror Descent
Accelerated Mirror Descent (AMD): Averaging in primal space [4]
Special case: Nesterov’s ODE [1] ( )Generalization to Hilbert spaces and more [5]Lagrangian dynamics perspective [6]Extensions: Frank-Wolfe etc. [7]
[1] W. Su, S. Boyd and E. Candes. A differential equation for modeling Nesterov's accelerated gradient method: theory and insights. NIPS 2014.[4] W. Krichene, A. Bayen and P. Bartlett. Accelerated Mirror Descent in Continuous and Discrete Time. NIPS 2015.[5] H. Attouch, Z. Chbani, J. Peypouquet and P. Redont. Fast convergence of inertial dynamics and algorithms with asymptotic vanishing viscosity. Math. Programming 2016.[6] A. Wibisono, A. C. Wilson and M. I. Jordan. A Variational Perspective on Accelerated Methods in Optimization. PNAS 2016.[7] J. Diakonikolas and L. Orecchia. The Approximate Gap Technique: A Unified Approach to Optimal First-Order Methods. 2017.
Illustration of Accelerated Mirror Descent
Lyapunov function (decreases along trajectory)
Convergence rate
Convergence rateGeometry
Aleksandr M. Lyapunov1857 - 1918
Accelerated Mirror Descent
Proprietary + Confidential
Outline
Accelerated Gradient Flow
Averaging Interpretation
Stochastic optimization
Sampling, non-convex optimization
Stochastic Optimization
SGD (Robbins and Monro)where and
can show convergence
8
(Overdamped) Langevin equation- Discrete space: Simulated Annealing [9]
- Continuous space: [10]
1820
GD
Grad flow1983
Nesterov 2014
Nesterov ODE
1951
SGD
Langevin
1940
[9] S. Kirkpatrick, C. D. Gelatt and M. P. Vecchi, Optimization by Simulated Annealing, Science, 1983.[10] T. S. Chiang, C.R. Hwang, S. J. Sheu, Diffusion for Global Optimization in Rn, SIAM J. Control and Optimization, 1987.
Acceleration in Stochastic Dynamics
1820
GD
Grad flow1983
Nesterov 2014
Nesterov ODE
1951
SGD
Langevin
1940
2012
SAMD 2017
SAMD ODE[11] G. Lan, An optimal method for stochastic composite optimization, J. Math. Programming, 2012.[12] W. Krichene and P. Bartlett, Acceleration and Averaging in Stochastic Descent Dynamics. NIPS 2017.
2017: continuous Stochastic AMD [12]
- : Brownian motion
- : Volatility matrix
2012: discrete Stochastic AMD [11]
Acceleration in Stochastic Dynamics
Accelerated dynamics may fail to converge in the presence of noise, without proper rectification
Convergence bounds
AMD
SAMD
SAMD dynamics
Rectified weightsGiven , choose
Resulting rate
Convergence bounds
Convergence bounds - Proof
Lyapunov function
By Itô’s lemma
Volatility Itô correction term
SAMD dynamics
- Volatility , and can be bounded.
- Itô correction
On rescaling time
Continuous-time, deterministic
becomes
Continuous-time, stochastic
becomes
quadratic covariation
Proprietary + Confidential
Outline
Accelerated Gradient Flow
Averaging Interpretation
Stochastic optimization
Sampling, non-convex optimization
Dynamics for samplingProblem setupSample from a target distribution
● is known, smooth, strongly convex, and is known
[13] A. Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-concave densities. 2014[14] X. Cheng, N. S. Chatterji, P. L. Bartlett and M. I. Jordan. Underdamped Langevin MCMC: A non-asymptotic analysis. 2017
Dynamics● Overdamped Langevin diffusion [13]
● Underdamped Langevin diffusion [14]
Improves convergence (in 2-Wasserstein) from to
Dynamics for samplingHandling constraints
● Reflected brownian motion with drift [15]
[15] S. Bubeck, R. Eldan and J. Lehec. Finite-Time Analysis of Projected Langevin Monte Carlo. NIPS 2015[16] Y. T. Lee, S. S. Vempala. Geodesic walks on polytopes. 2017
Tanaka drift supported on boundary
Drift to induce uniform dist.
● Geodesic walk for sampling from polytopes [16]
● Mirror descent dynamics
Dynamics for non-convex optimization
Dissipative functions [17]
● Gibbs distribution concentrates around minimizers of the function.
[17] M. Raginsky, A. Rakhlin and M. Telgarsky. Non-convex learning via Stochastic Gradient Langevin Dynamics: a nonasymptotic analysis. 2017[18] Z. Zhou, P. Mertikopoulos, N. Bambos, S. Boyd and P. W. Glynn. Stochastic Mirror Descent in Variationally Coherent Optimization Problems. NIPS 2017
Variationally coherent functions [18]
● Includes pseudo-convex, star-convex.
● A.s. convergence (of discretization) by Asymptotic Pseudo-Trajectory property.
Nesterov ODE - non-convex optimization
Rosenbrock’s function
● Continuous-time dynamics
○ Unify and simplify analysis
○ Heuristics
○ Acceleration in vanishing noise regime
● Open questions
○ Discretization in non-convex problems
○ Multiplicative noise models
Summary
Thank You!