View
4
Download
0
Category
Preview:
Citation preview
Hamiltonian Monte Carlo
for Scalable Deep Learning
Isaac Robson
Department of Statistics and Operations Research,
University of North Carolina at Chapel Hill
isrobson@email.unc.edu
BIOS 740
May 4, 2018
Preface
Markov Chain Monte Carlo (MCMC) techniques are powerful algorithms for fitting probabilistic models
Variations such as Gibbs samplers work well for some high-dimensional situations, but have issues scaling to todayโs challenges and model architectures
Hamiltonian Monte Carlo (HMC) is a more proposal-efficient variant of MCMCs that is a promising catalyst for innovation in deep learning and probabilistic graphical models
Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 02/24
Outline
Review Metropolis-Hastings
Introduction to Hamiltonian Monte Carlo (HMC)
Brief review of neural networks and fitting methods
Discussion of Stochastic Gradient HMC (T. Chen et al., 2014)
Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 03/24
Introduction to Hamiltonian Monte Carlo
Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 04/24
Review: Metropolis-Hastings 1/3
Metropolis et al., 1953, and Hastings, 1970
The original Metropolis et al. can be used to compute integrals of a distribution, e.g. the normalization for a Bayesian posterior
๐ฝ = ๐ ๐ฅ ๐ ๐ฅ ๐๐ฅ = ๐ธ๐(๐)
Originally for statistical mechanics, more specifically, calculating potential of 2D spheres (particles) in a square with โfast electronic computing machinesโ
Size N = 224 particles, time = 16 hours(on prevailing machines)
Advantage is that it depends only the ratio of the ๐(๐ฅโฒ)/๐(๐ฅ) of the probability distribution evaluated at two points, ๐ฅโฒ and ๐ฅ in some data ๐ท
Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 05/24
Metropolis et al., 1953
Review: Metropolis-Hastings 2/3
We can use this ratio to accept or reject moving from randomly generated points ๐ฅ ๐ฅโฒ with acceptance ratio ๐(๐ฅโฒ)/๐(๐ฅ)
This allows us to โsampleโ by accumulating a running random-walk (Markov chain) list of correlated samples under a symmetric proposal scheme from the target distribution, which we can then estimate
Hastings extended this to permit (but not require) an asymmetric proposal scheme, which speeds the process and improves mixing
๐ด(๐ฅโฒ|๐ฅ) = min[1,๐ ๐ฅโฒ
๐ ๐ฅ
๐ ๐ฅ ๐ฅโฒ
๐ ๐ฅโฒ ๐ฅ]๐๐2
Regardless, we also accumulate a โburn-inโ period of bad initial samples we have to ignore (this slows convergence, as do correlated samples)
Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 06/24
Review: Metropolis-Hastings 3/3
We have to remember that Metropolis-Hastings has a few restrictionsA Markov chain wonโt converge to a target distribution P(๐ฅ) unless it converges to a stationary distribution ๐ ๐ฅ = ๐ ๐ฅ
If ๐ ๐ฅ is not unique, we can also get multiple answers! (This is bad)
So we require the equality P xโฒ x)๐ ๐ฅ = ๐ ๐ฅ ๐ฅโฒ)๐(๐ฅโฒ), e.g. reversibility
Additionally, the proposal is symmetric when ๐ ๐ฅ ๐ฅโฒ
๐ ๐ฅโฒ ๐ฅ= 1, e.g. Gaussian
These are called โrandom-walkโ algorithms
When ๐(๐ฅโฒ) โฅ ๐ ๐ฅ , they move to the high-density region with certainty, else with acceptance ratio ๐(๐ฅโฒ)/๐ ๐ฅ
Note a proposal with higher variance typically yields a lower acceptance ratio
Finally, remember Gibbs sampling, useful for certain high-dimensional situations, is a special case of Metropolis-Hastings using proposals conditioned on values of other dimensions
Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 07/24
Hamiltonian Monte Carlo 1/5
Duane et al., 1987, Neal, 2012, and Betancourt, 2017
Duane proposed Hybrid Monte Carlo to more efficiently computer integrals in lattice field theory
โHybridโ was due to the fact that it infused Hamiltonian equations of motion to generate a candidate point ๐ฅโฒ instead of just RNG
As Neal describes, this allows us to โpushโ the candidate points further out with momentum because the dynamics
Are reversible (necessary for convergence to unique target distribution)
Preserve the Hamiltonian (so we can still use momentum)
Preserve volume (which makes acceptance probabilities solvable)
Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 08/24
Hamiltonian Monte Carlo 2/5
A Hamiltonian is an energy function of form
H ๐, ๐ = U ๐ + K(๐)
Hamiltonโs equations govern the change of this system over time
๐๐๐๐๐ก=๐H
๐๐๐= [Mโ1 ๐]๐ ,
๐๐๐๐๐ก=๐H
๐๐๐= โ๐U
๐๐๐
JH =๐๐x๐ ๐๐x๐๐๐x๐ ๐๐x๐
We can set U ๐ = โ log ๐ + ๐ถ and K(๐) = ๐TMโ1๐/2 where C is constant and M is a PSD โmass matrixโ that determines our momentum(kinetic energy)
Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 09/24
Potential Energy
Kinetic Energy
MomentumPosition
Hamiltonian Monte Carlo 3/5
In a Bayesian setting, we set ๐ to be the prior, ฯ(๐), times the likelihood given data ๐ท, ๐ฟ ๐ ๐ท) for our potential energy
U ๐ = โ log[ฯ ๐ ๐ฟ ๐ ๐ท) ]
If we choose a Gaussian proposal (Metropolis), we set kinetic energy
K ๐ =
๐=1
๐๐๐2
2๐๐
We then generate ๐โฒ, ๐โฒ via Hamiltonian dynamics and use the difference in energy levels as our acceptance ratio in the MH algorithm
A ๐โฒ, ๐โฒ ๐, ๐) = min 1, exp โH ๐โฒ, ๐โฒ + H ๐, ๐
Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 10/24
Hamiltonian Monte Carlo 4/5
Converting proposal and acceptance steps to this energy form is convoluted, however, we can now use Hamiltonian dynamics to walk farther without sacrificing acceptance ratio
Classic method of solving Hamiltonianโs differential equations is Eulerโs Method, which traverses a small distance ฮต > 0 for ๐ฟ steps
๐๐ ๐ก + ฮต = ๐๐ ๐ก โ ฮต๐๐
๐๐๐[๐ ๐ก ], ๐๐ ๐ก + ฮต = ๐๐ ๐ก + ฮต
๐๐ ๐ก
๐๐
We can also employ the more efficient โleapfrogโ technique to quickly propose a candidate
๐๐ ๐ก + ฮต/2 = ๐๐ ๐ก โ (ฮต/2)๐U
๐๐๐[๐ ๐ก ], ๐๐ ๐ก + ฮต = ๐๐ ๐ก + ฮต
๐๐ ๐ก + ฮต/2
๐๐
๐๐ ๐ก + ฮต = ๐๐ ๐ก + ฮต/2 โ (ฮต/2)๐U
๐๐๐[๐ ๐ก + ฮต ]
Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 11/24
Hamiltonian Monte Carlo 5/5
The HMC algorithm adds two steps to MH:Sample ๐ the momentum parameter (typically symmetric, Gaussian)
Compute ๐ฟ steps of size ฮต to find a new ๐โฒ, ๐โฒ
Betancourt explains we can sample a momentum to easily change energy levels, then we use Hamiltonian dynamics to traverse our ๐-space (state space)
We no longer have to wait for random walk to slowly explore as we can easily find samples well-distributed across our posterior with high acceptance ratios (same energy levels)
However, as Chen et al, 2014 describes, we do still have to compute the gradient of our potential at every step, which can be costly, especially in high dimensions
Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 12/24
Graphic from Betancourt, 2017
Neural Networks
Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 13/24
Neural Networks 1/4
[Artificial] neural networks (or nets) are popular connectionist models for โlearningโ a function approximation
Derivative of โHebbianโ learning after Hebbโs neuropsychology work in the 1940s
Popularized today thanks to parallelization and convex optimization
Universal function approximator (in theory)
Typically use function composition and the chain rulealongside vectorization to efficiently optimize a loss function by altering the โweightsโ (function) of each node
Requires immense computational power, especiallywhen the functions being composed are probabilistic(such as in a Bayesian Neural Net (BNN))
Fitting neural nets is an active area of research, with contributions from the perspectives of both optimization and sampling
Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 14/24
Feedforward Neural Net,Wikimedia Commons
Neural Networks 2/4
LeCun et al., 1998, Robbins and Monro, 1951
As Lecun et al., details, backward propagation of errors (backprop) is a powerful method for neural net optimization (these do not use sampling) a layer of weights ๐, at a time ๐ก, given error matrix ๐ธ
๐๐ก+1 = ๐๐ก โ ๐๐๐ธ
๐๐
Note this is merely gradient descent, which in recent years has been upgraded with many bells and whistles
One such whistle is stochastic gradient descent (SGD), an algorithm that evolved following the stochastic approximation methods introduced by Robbins and Monro, 1951 (GO TAR HEELS!)
Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 15/24
Neural Networks 3/4
Calculating a gradient is costly, but as LeCun et al. details, stochastic gradient descent is much faster, and comes in both online and the smoother minibatch variations
The primary idea is to update using only the error at one point, ๐ธ*
๐๐ก+1 = ๐๐ก โ ๐๐๐ธโ
๐๐
The error at one point is an estimate of the error for the entire vector of current weights ๐๐ก, hence the name stochastic gradient descent
The speedup is feasible due to shared information across observations and the fact that by decreasing ๐, the โlearning rateโ, SGD still converges, including for minibatch variants of SGD, which just computes gradients for a handful of points instead of a single one
Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 16/24
Neural Networks 4/4
Rumelhart et al., 1986
A popular bell to complement SGDโs whistle is the addition of a momentum term to the update step. We more or less smooth our update steps with an exponential decay factor ฮฑ
๐๐ก+1 = ๐๐ก โ ๐๐๐ธโ
๐๐+ ฮฑโ๐๐ก
โ๐๐ก = ๐๐ก+1 โ๐๐ก
This may seem familiar if you recall the โmomentumโ term that exists in Hamiltonian Monte Carlo (cue imaginary dramatic sound effect)
Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 17/24
Stochastic Gradient HMC(T. Chen et al., 2014)
Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 18/24
Stochastic Gradient HMC 1/4
As mentioned before, backprop uses the powerful stochastic gradient descent method and extensions to fit gradient-based neural networks
Unfortunately, many of these neural nets lack inferentiability
One solution (other proving P = NP or than solving AGI) is to use Bayesian Neural Networks, which exist as a class of probabilistic graphical models, and can be fitted with sampling or similar methods
BNNs still perform many of the surreal feats that other neural nets accomplish
However, even with Gibbs samplers and HMC, sampling in high dimensions is quite slow to convergeโฆ
Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 19/24
for now...
Stochastic Gradient HMC 2/4
Welling et al., 2011, T. Chen et al., 2014, and C. Chen et al., 2016
In HMC, Instead of calculating our the gradient of our potential energy, U ๐ , for all of the dataset ๐ท, what if we selected some minibatch ๐ท โ ๐ทto use for our estimate in the โleapfroggingโ method?
โ U โ โU + ๐ ๐, ฮฃ for noise ฮฃ
๐๐ ๐ก + ฮต/2 = ๐๐ ๐ก โ (ฮต/2)๐ U
๐๐๐[๐ ๐ก ]
Unfortunately, this โnaรฏveโ stochastic gradient HMC (SGHMC) injects noise into the Hamilton equations, which requires materially decreasing acceptance ratio in the MH algorithm to inefficient levels
Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 20/24
Stochastic Gradient HMC 3/4
T. Chen et al. suggests fixing naรฏve SGHMC by adding a friction term, as proposed by Welling et al., borrowing once again from physics in the form of Langevin dynamics (vectorized form, omitting leapfrog notation)
โ U = โU + ๐ ๐, 2B B =ฮต
2ฮฃ
๐๐ก+1 = ๐๐ก +Mโ1๐๐ก
๐๐ก+1 = ๐๐ก โ ฮต โ U ๐๐ก+1 โ BMโ1๐๐ก
Note that B is a PSD function of ๐๐ก+1, but Chen also shows certain constant choices of B converge (and are far more practical)
Welling et al. also laments that Bayesian methods have been โleft-behindโ in recent machine learning advances due to โ[MCMC] requiring computations over the whole datasetโ
Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 21/24
Stochastic Gradient HMC 4/4
The end result of SGHMC is an efficient sampling algorithm that also permits computing gradients on a minibatch in a Bayesian setting
T. Chen et al then elaborate to show that under deterministic settings, SGHMC performs analogously to SGD with momentum, as the momentum components are related
C. Chen et al. (BEAT DOOK!) further elaborates that many Bayesian MCMC sampling algorithms are analogs of stochastic optimization algorithms, which suggests that a symbiotic discovery and extensions of the two is possible, as presented in the Stochastic AnNealingThermostats with Adaptive momentum (Santa) that incorporates recent advances from both domains
Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 22/24
Conclusions
HMC is a promising variant of MCMC sampling algorithms with applications in Bayesian models
SGHMC offers more scalability in deep learning and several other settings, with the added benefit of inferentiability in Bayesian neural nets.
Future work and collaborations between the sampling and optimization communities is promising
Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 23/24
Bibliography (by date)Robbins, H. and Monro, S. (1951). A stochastic approximation method. The annals of mathematical statistics, pages 400โ407.
Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H. and Teller, E. (1953). Equation of State Calculations by Fast Computing Machines. The journal of chemical physics 21 1087.
Hastings, W. K. (1970). Monte Carlo Sampling Methods Using Markov Chains and Their Applications. Biometrika 57 97โ109.
Rumelhart, David E., Hinton, Geoffrey E., and Williams, Ronald J (1986). Learning representations by backpropagating errors. Nature, 323:533โ536.
Duane, S., Kennedy, A. D., Pendleton, B. J. and Roweth, D. (1987). Hybrid Monte Carlo. Physics Letters B 195 216 - 222.
Y. A. LeCun, L. Bottou, G. B. Orr, and K.R. Mรผller (1998). Efficient backprop. In Neural networks: Tricks of the trade, pages 9โ48. Springer, 1998b.
Welling, M. and Teh, Y.W. (2011). Bayesian learning via stochastic gradient langevindynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML), pp. 681โ688.
Neal, R. M. (2012). MCMC using Hamiltonian dynamics. ArXiv e-prints, arXiv:1206.1901
Chen, T., Fox, E. B., and Guestrin, C. (2014). Stochastic gradient hamiltonian monte carlo. arXiv preprint arXiv:1402.4102v2.
Chen, C. Carlson, D. Gan, Z. Li, C. and Carin, L. (2015). Bridging the gap between stochastic gradient MCMC and stochastic optimization. arXiv:1512.07962
Betancourt, M. (2017). A conceptual introduction to Hamiltonian Monte Carlo. arXivpreprint arXiv:1701.02434.
Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018 24/24
Isaac Robson (UNC) HMC for Scalable Deep Learning May 4, 2018
Recommended