18
Hamiltonian Monte Carlo Sherman Ip and Jack Jewson 19th November 2015

Sherman Ip and Jack Jewson 19th November 2015sip/work/report_hmc.pdf · 2015. 11. 19. · Then the ratio s s2 b s2 w (8) where s2 b = n P k j=1 ( x j x) 2 k 1 (9) and s2 w = P k j=1

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Sherman Ip and Jack Jewson 19th November 2015sip/work/report_hmc.pdf · 2015. 11. 19. · Then the ratio s s2 b s2 w (8) where s2 b = n P k j=1 ( x j x) 2 k 1 (9) and s2 w = P k j=1

Hamiltonian Monte Carlo

Sherman Ip and Jack Jewson

19th November 2015

Page 2: Sherman Ip and Jack Jewson 19th November 2015sip/work/report_hmc.pdf · 2015. 11. 19. · Then the ratio s s2 b s2 w (8) where s2 b = n P k j=1 ( x j x) 2 k 1 (9) and s2 w = P k j=1

Introduction/Motivation

Markov Chain Monte Carlo (MCMC) methods have become a huge part of statistical research and

there is an increasing necessity for these methods to be able to sample from vastly complex and

multidimensional distributions. One of the most widely used MCMC algorithms is the Metropolis

Hastings algorithm designed by Metropolis et al. [1953] and Hastings [1970]. This algorithm proposes

a new data point conditional on the previous data point, creating a Markov chain, and accepts or rejects

this new point with a probability which ensures the invariant distribution of the Markov chain is the

correct target. Using a Metropolis Hastings (MH) algorithm to successfully and efficiently sample

from a complex and multidimensional distributions relies on the ability the produce proposals that

both explore the whole target distribution ’quickly’ and are accepted with high probability. The speed

of exploration and the acceptance probability determine how long the algorithm will need to be run

for it to produce a satisfactory sample and also how close the sample produced is to an independent

sample.

This is by no means an easy task. An example which will be revisited a lot in this paper is the

Random Walk Metropolis Hastings (RWMH) which proposes a new point by means of a Gaussian

random walk with a preset variance. If this variance is too low the random walk will explore the space

very slowly requiring a large number of iterations to explore the target, creating a sample where data

points are largely dependent on data points many iterations ago. Alternatively, if the variance is too

high then the jumps made by the random walk will be much larger but by the nature of the MH

acceptance probability may often be rejected, this will result in many values being the same and a lot

of computational time wasted.

In order to attempt to solve this problem, Hamiltonian Monte Carlo (HMC) presented by Neal [2011]

is suggested. The HMC algorithm uses Hamiltonian dynamics to propose new data points and the

same MH acceptance rate to ensure the correct target is produced. Neal [2011] showed that using the

Hamiltonian dynamics wide ranging proposals that explore that target quickly can be produced and

are accepted with probability very close to one. The aims of this project was to implement HMC and

investigate whenever it does solve this problem or not compared to RWMH.

Theory

Neal [2011] gave a detailed review of HMC and in this section a brief review will be presented. From

physics, Hamiltonian dynamics are a way of describing the motion a body of mass in d dimension

space. Neal [2011] gave the intuitive description of a puck moving around on an unlevel d-dimensional

surface. Suppose the body of mass has d dimensional momentum p and d dimensional position q such

that

p =

p1

p2...

pd

q =

q1

q2...

qd

, (1)

then Hamiltonian dynamics formulates the equations of motion of this body using the following differ-

1

Page 3: Sherman Ip and Jack Jewson 19th November 2015sip/work/report_hmc.pdf · 2015. 11. 19. · Then the ratio s s2 b s2 w (8) where s2 b = n P k j=1 ( x j x) 2 k 1 (9) and s2 w = P k j=1

ential equations

dqidt

=∂H

∂pi(2)

dpidt

= −∂H∂qi

(3)

for i = 1, 2, ···, d, where H is the Hamiltonian, or total energy, which depends on both the momentum

and position and t is time.

The goal here is to use Hamiltonian dynamics to produce proposals that can be used in a MH algorithm

in order to efficiently sample from a distribution with d dimensions and probability density function

f(.). This is done by introducing p as an auxilery random variable and defining its distribution by

means of the kinetic energy

K(p) =1

2pTM−1p (4)

where M is the mass matrix and is usually defined as a multiple of the identity matrix. This corresponds

to the negative logarithm of the p.d.f. of a multivaraite gaussian, plus some arbitrary constant, with

mean 0 and covariance matrix M.

The potential energy is defined, in a similar way, to be the negative logarithm of the target. This has

been done to ensure that the canonical distribution1 of our potential and kinetic energy are the desired

target and a multivariate Gaussian.

U(q) = − log(f(q)) . (5)

Putting this together, the Hamiltonian, the total energy in the system, is

H(p,q) = K(p) + U(q) . (6)

It can be shown that Hamiltonian dynamics are reversible, keep the Hamiltonian constant and preserves

volume in (p,q). These properties are important as they ensure a Markov chain, using Hamiltonian

dynamics and the usual MH acceptance probability, will emit the correct target as its invariant distri-

bution. This motivates the use of Hamiltonian dynamics as a proposal distribution.

Hamiltonian dynamics evolve as a continuous process and therefore require a discretisation in order for

them to be implemented on a computer. The discretisation will therefore only provide proposals that

approximately obey Hamiltonian dynamics. It is important that the discretisation is also reversible and

volume preserving in order to maintain the validity of using approximate Hamiltonian Dynamics as a

proposal distribution. After trying various methods Neal [2011] decides upon the leapfrog discretisation

method, as a way of providing a good approximation for the Hamiltonian dynamics that satisfy the

necessary properties. The leapfrog method depends on 2 externally inputted parameters, L and ε. L

determines how many leapfrog steps the discretisation is run for and ε determines how far each leapfrog

step jumps. εL can therefore be seen as the length of the leapfrog algorithm. A smaller ε will give a

better approximation to the Hamiltonian dynamics and a larger L will displace the particle further at

each iteration.

Now that the validity and practical implications of using Hamiltonian dynamics as a MH proposal

1The canonical distribution P (E) of energy E is eE/kBT where T is temperature and kB is Boltzmann’s constant

2

Page 4: Sherman Ip and Jack Jewson 19th November 2015sip/work/report_hmc.pdf · 2015. 11. 19. · Then the ratio s s2 b s2 w (8) where s2 b = n P k j=1 ( x j x) 2 k 1 (9) and s2 w = P k j=1

distribution have been established Neal [2011] proposes a HMC algorithm.

The HMC algorithm is as follows:

Initialise position q0

for i = 1,...,N

1) Draw pi ∼ Nd(0,M)

2) Using the leapfrog starting at qi and pi, with preset parameters ε and L, propose q* and p* from the

Hamiltonian dynamics.

With probability

αHMC = min

(1,

exp [−H(p∗,q∗)]

exp [−H(pi,qi)]

)(7)

set qi+1 = q∗, else set qi+1 = qi

The HMC algorithm targets the canonical distribution of the Hamiltonian which, here, is the augmented

target. Due to the properties of the exp function our augmented target can be written as the product

of the canonical distributions of the Kinetic energy and the Potential energy. The fact that the target

can be written in such a way indicates that q and p are independent and therefore the marginal of q

will be our target f(.).

αHMC corresponds to the usual MH acceptance probability.exp [−H(p∗,q∗)]

exp [−H(pi,qi)]corresponds to the ratio

of our augmented target distribution and p∗ is set as the negative of the momentum at q∗ in order to

ensure that the proposals are symmetric and therefore cancel. If it were possible to propose a position

exactly according to Hamiltonian dynamics then αHMC would always equal 1, as the Hamiltonian is

constant. However as a discretisation is used the value of the Hamiltonian will change, but providing

ε is small enough, it shouldn’t change too much leaving a high acceptance rate.

The right choice of ε and L will allow HMC to produce wide ranging proposals that will be accepted

with probability close to one.

Univariate Gaussian

In order to provide a simple comparison between RWMH and HMC they were both implemented to

sample from a univariate standard Gaussian distribution

In order to compare HMC fairly with RWMH, L, the number of leapfrog steps done in HMC, RWMH

steps will be run before doing a MH accept/reject step.

Both methods were run for 10000 iterations to target the standard univariate normal. The RWMH

parameters were set with proposal variance of 1 and L = 25 and the HMC parameters were selected

to be ε = 0.3, M = 1 and L = 25.

During the experiments, the autocorrelation and trace plots were observed. In order to provide further

insight into the performance of HMC compared with RWMH the Gelman and Rubin [1992] diagnostic

was also implemented to investigate how fast the algorithm chains burnt in. This was done by running

k independent chains of length 2n with different initial values. Let xji be the ith sample from the jth

3

Page 5: Sherman Ip and Jack Jewson 19th November 2015sip/work/report_hmc.pdf · 2015. 11. 19. · Then the ratio s s2 b s2 w (8) where s2 b = n P k j=1 ( x j x) 2 k 1 (9) and s2 w = P k j=1

chain, x̄j =∑2n

i=n+1 xji/n and x̄ =∑k

j=1

∑2ni=n+1 xji/kn. Then the ratio√

s2bs2w

(8)

where

s2b =n∑k

j=1 (x̄j − x̄)2

k − 1(9)

and

s2w =

∑kj=1

∑2ni=n+1 (xji − x̄j)2

nk − k(10)

can be used to compare the sample means of all k chains, similar to ANOVA. The ratio should be about

1 if all k chains have the same mean, thus could be considered burnt in if the ratio is about 1. The F

test can be conducted but it was found that it was too strict for the purpose of this project. Instead,

the diagnostic was repeated 10 times to obtain a sample of ratios. When targeting the univariate

standard Normal distribution, initial values of {−20,−19, ···, 19, 20} were used for both RWMH and

HMC.

The trace plots in Figure 1 show that both RWMH and HMC explore the suport of the target however

HMC manages to do it a lot quicker. The autocorrelation plots supports this with RWMH taking 10

iterations to produce an independent sample whereas HMC only takes 4. The faster mixing and lower

autoccorelation of the sample are a result of αHMC being very close to one where as αRWMH = 0.244.

As a result, this was evidence that HMC mixed well and the samples explored the state space well.

This demonstrates HMC’s ability to produce wide ranging proposals that are still accepted with large

probability and thus explore the state space faster a produce a sample that is closer to being an

independent sample.

4

Page 6: Sherman Ip and Jack Jewson 19th November 2015sip/work/report_hmc.pdf · 2015. 11. 19. · Then the ratio s s2 b s2 w (8) where s2 b = n P k j=1 ( x j x) 2 k 1 (9) and s2 w = P k j=1

Figure 1: Traceplot and autocorrelation plot of RWMH and HMC targeting a standard Gaussiandistribution with initial value 5.

An alternative way to demonstrate the impact using HMC has on the mixing of the sample compared

with RWMH is to examine the speed at which the chains burn in form a variety of starting values.

Figure 2 plots the estimate of the mean of the target and shows that the HMC algorithm burnt in

much faster than RWMH. The Gelman-Rubin diagnostic, as shown in Figure 3, concurred that HMC

burnt in much faster as the ratio of standard deviations was closer to 1 compared to RWMH. The

shorter burn in period, even from dispersed starting values, perfectly illustrates the fact that HMC can

produce wider ranging proposals allowing the sample to move to areas of high density in the target

more quickly than thw RWMH can and avoiding wasting computing time.

0 10 20 30 40 50

−20

020

Dimension: 1

Index

MC

MC

(a) RWMH

0 10 20 30 40 50

−20

020

Dimension: 1

Index

MC

MC

(b) HMC

Figure 2: Traceplots of 41 MCMCs targeting a standard Gaussian distribution with initial valuesranging from -20 to 20.

5

Page 7: Sherman Ip and Jack Jewson 19th November 2015sip/work/report_hmc.pdf · 2015. 11. 19. · Then the ratio s s2 b s2 w (8) where s2 b = n P k j=1 ( x j x) 2 k 1 (9) and s2 w = P k j=1

0 10 20 30 40 50

24

68

Chain length (2n)

Rat

io o

f std

Dimension: 1

(a) RWMH

0 10 20 30 40 50

1.2

1.6

2.0

Chain length (2n)

Rat

io o

f std

Dimension: 1

(b) HMC

Figure 3: Ratios of between and within chains standard deviations targeting the standard Gaussiandistribution. The experiment was repeated 10 times, the means were plotted with error bars corre-sponding to the standard deviation.

Bivariate Gaussian

In order to test the effectiveness of HMC further an experiment was run targeting a bivariate Gaussian

distribution with mean 0, and correlation between the 2 variables at 0.98. This example is one where

RWMH is known to perform badly so HMC will be tested to see if it can improve upon it.

Both the HMC and RWMH algorithms were run targeting the bivariate gaussian. In this case M was

set to 2 times the identity matrix and all other parameters were left the same when targeting the

univariate case. Figures 4 and 5 show trace plots, scatter plots and autocorrelation of the samples

produced by both algorithms

6

Page 8: Sherman Ip and Jack Jewson 19th November 2015sip/work/report_hmc.pdf · 2015. 11. 19. · Then the ratio s s2 b s2 w (8) where s2 b = n P k j=1 ( x j x) 2 k 1 (9) and s2 w = P k j=1

Figure 4: Traceplot and autocorrelation plot of RWMH, with initial value (5,5), targeting a bivariateNormal distribution centered at the origin with unit variance and correlation coefficient 0.98 .

Figure 5: Traceplot and autocorrelation plot of HMC, with initial value (5,5), targeting a bivariateNormal distribution centered at the origin with unit variance and correlation coefficient 0.98.

The RWMH sample trace plots show very slow mixing with lots of horizontal straight lines indicat-

ing that the sample has the same data point for a considerable length of time. This suggests that

7

Page 9: Sherman Ip and Jack Jewson 19th November 2015sip/work/report_hmc.pdf · 2015. 11. 19. · Then the ratio s s2 b s2 w (8) where s2 b = n P k j=1 ( x j x) 2 k 1 (9) and s2 w = P k j=1

many of the proposed values were rejects which is confirmed by the mean acceptance probability being

αRWMH = 0.0148. Owing to this very high autocorrelation is observed between points many iterations

away. This suggests that the RWMH algorithm would need to be run for a very long time to gain a

reasonable approximation of the target. The RWMH algorithm proposes individual variables indepen-

dently from a random walk, as the correlation in this target is so high only proposals with similar X1

and X2 values will be accepted with high probability. It is quite unlikely to produce 2 similar variable

interdependently, espeically in the tails of the target and this is why the average acceptance probability

is so low.

On the other hand the trace plots from the HMC sample show much faster mixing with the sample

exploring the support of the target well. This produces low autocorrelation in the sample taking only

4 iterations to produce an independent sample. αHMC = 0.734, which is actually quite low for HMC

algorithm and suggests the value of epsilon should be decreased, further demonstrating HMC’s ability

to produce wide ranging proposals that explore the whole target and are accepted with high probability

even when the target is slightly more complicated.

Again the Gelman-Rubin diagnostic was examined in order to illustrate the power of the HMC propos-

als. For the bivariate Gaussian distribution the initial values were set to be 20 points equally spaced

on a circle of radius 5. It was observed that HMC burnt in immediately after a step whereas RWMH

struggled to find the distribution, as shown in Figure 6. The ratios of between and within chains

standard deviations, as shown in Figures 7 and 8, also suggested that RWMH struggled to burn in

after 50 steps as the ratio was not near 1 considering error bars.

−10 −5 0 5 10

−10

−5

05

10

x

y

(a) RWMH

−4 −2 0 2 4

−4

−2

02

4

x

y

(b) HMC

Figure 6: Paths of 20 MCMCs after 50 steps, with initial values on a circle of radius 5, targeting abivariate Normal distribution centered at the origin with unit variance and correlation coefficient 0.98.

8

Page 10: Sherman Ip and Jack Jewson 19th November 2015sip/work/report_hmc.pdf · 2015. 11. 19. · Then the ratio s s2 b s2 w (8) where s2 b = n P k j=1 ( x j x) 2 k 1 (9) and s2 w = P k j=1

0 10 20 30 40 50

515

Chain length (2n)

Rat

io o

f std

Dimension: 1

0 10 20 30 40 50

010

2030

Chain length (2n)

Rat

io o

f std

Dimension: 2

Figure 7: Ratios of between and within chain standard deviations for RWHM targeting the bivariateNormal distribution. The experiment was repeated 10 times, the means were plotted with error barscorresponding to the standard deviation.

0 10 20 30 40 50

0.3

0.6

0.9

Chain length (2n)

Rat

io o

f std

Dimension: 1

0 10 20 30 40 50

0.4

0.7

1.0

Chain length (2n)

Rat

io o

f std

Dimension: 2

Figure 8: Ratios of between and within chain standard deviations for HMC targeting the bivariateNormal distribution. The experiment was repeated 10 times, the means were plotted with error barscorresponding to the standard deviation.

Multivariate Gaussian

Another advantage discussed in Neal [2011] of HMC when compared to RWMH is the way in which the

two algorithms perform as the dimension of the target increases. HMC’s ability to produce wide ranging

multidimensional proposals that are accepted with probability close to one allows it to deal with high

correlation between 2 variates, as shown in the previous section and also to deal with multidimensional

targets as will be demonstrated here.

The issue that RWMH has when it is required to sample from a multi-dimensional target is that it

produces a sample for each dimension independently by ways of a random walk and these samples are

all accepted or rejected together. It therefore only takes one dimension of the proposed point to be

in an area of low probability with respect to the target for the whole proposal to be rejected. As the

dimension of the proposal increases it will become more likely that one of the dimensions will have low

probability and therefore the acceptance rate will drop and the RWMH will become inefficient.

To test whether this will happen in practice RWMH with proposal variance 1 and L = 25 and HMC

with ε = 0.3, L = 25 and M = 1 were both run for 10000 iterations targeting an independent standard

tri variate Normal. Plots comparing the performance of the two algorithms are as shown in Figure 9

9

Page 11: Sherman Ip and Jack Jewson 19th November 2015sip/work/report_hmc.pdf · 2015. 11. 19. · Then the ratio s s2 b s2 w (8) where s2 b = n P k j=1 ( x j x) 2 k 1 (9) and s2 w = P k j=1

Figure 9: Trace plots, Histograms and Autocorrelation plots comparing the RWMH algorithm to theHMC algorithm

Here what was expected was observed. The trace plots for the RWMH showed that the chain explored

the support of the target reasonably well but a bit slowly, the histograms showed that the sample

produced is an ok approximation of the target and it took about 20 iterations to produce an almost

independent point. The reason for this is that αRWMH = 0.151 so there are less than 2000 distinct data

point and therefore this, somewhat unsatisfactory behavior, was observed. Even in just 3 dimensions

RWMH found it difficult to propose 3 data points that will be accepted together with high probability.

αHMC = 0.986 and this explained why faster mixing, a better approximation of the target and only

4 iterations to an almost independent point were observed in the HMC sample. The Hamiltonian

dynamics allowed HMC to propose a multidimensional point via the leapfrog which was accepted with

probability close to 1. Therefore more data points covering the support of the target were produced

and a better approximation was obtained. There was nothing complicated or difficult about sampling

from 3 independent standard Normal distributions but RWMH still performed poorly and it would

take a very large number of samples to produce a good approximation of the target.

In order to observe how this extended to even larger dimensions, RWMH and HMC were run for 5000

iterations targeting standard multivariate Normal random variables with increasing dimensionality.

The same parameters as before were used except that the RWMH proposal variance was set equal to

0.15, in order to attempt to improve its acceptance rates. Figure 10 below demonstrated how the mean

acceptance rates of the HMC (top line) and the RWMH (bottom line) changed as the dimension of the

target increased.

10

Page 12: Sherman Ip and Jack Jewson 19th November 2015sip/work/report_hmc.pdf · 2015. 11. 19. · Then the ratio s s2 b s2 w (8) where s2 b = n P k j=1 ( x j x) 2 k 1 (9) and s2 w = P k j=1

Figure 10: Plots presenting how the mean acceptance rate of the HMC (top line) and the RWMH(bottom line) change as the dimension of the target increases.

This demonstrated that the mean acceptance rates of both the HMC and RWMH algorithms decreased

as dimension increased. The mean acceptance rate of the HMC algorithm seem to decreased linearly

with dimension. The reason that all of the mean acceptance rates for HMC were not one was because of

the discretisation and the value of epsilon caused a difference between the Hamiltonian at the current

and proposed state. The discretisation produced a difference in each dimension and therefore the

total difference was just the sum of these difference so the acceptance probability decreased linearly as

dimension increased. It was less obvious to see at what rate the RWMH mean acceptance rate fell as

the dimension increased but it was at a much faster rate than HMC. This was further evidence of the

difficulty RWMH algorithms had in proposing d independent proposals that were accepted with high

probability and demonstrated that HMC scaled in a much more desirable way than RWMH.

Univariate Normal Mixtures

A common problem where standard MCMC methods often fail is when they try to sample from multi-

modal distributions, for example the mixture of two Normal distributions. The Markov chain structure

means that it is easy for the chain to become stuck in one of the modes and therefore not explore the

whole space. This will become a problem if the structure of the target is not known and what will

appear to be a well mixing sample may only be exploring half the target. Techniques such as simulated

annealing and simulated tempering have been used to try and solve this problem but here HMC was

tested. Neal [2011] remarks ”HMC is no less (or more) vulnerable to problems with isolated modes

than other MCMC methods that make local changes to the state”, this statement was explored by

comparing HMC and RWMH for two different univariate Normal mixture examples.

Example 1

The first univariate Normal mixture example that was investigated is π1(x) = 12N(x; 0, 12)+1

2N(x; 5, 12).

The Figure 11 below shows its density.

11

Page 13: Sherman Ip and Jack Jewson 19th November 2015sip/work/report_hmc.pdf · 2015. 11. 19. · Then the ratio s s2 b s2 w (8) where s2 b = n P k j=1 ( x j x) 2 k 1 (9) and s2 w = P k j=1

Figure 11: Density of π1(x) = 12N(x; 0, 12) + 1

2N(x; 5, 12)

π1(x) has ’significant’ positive density between the modes, the smallest density, between the 2 modes

occurs at x = 2.5 where the density is 0.018 compared with 0.20 at the modes x = 0 and x = 5.

A RWMH algorithm with proposal variance 1 was run for 10000 iterations, counting 25 steps as an

iteration, and a HMC algorithm L = 25, ε = 0.3 and M = I was run for 10000 iteration targeting

π1(x). Figure 12 below shows trace plots, histograms and autocorrelation plots for both the RWMH

and HMC runs.

Figure 12: Trace plots, Histograms and Autocorrelation plots comparing the RWMH algorithm to theHMC algorithm targeting π1(x)

From the trace plots it was observed that both the RWMH and the HMC chain appeared to explore the

support of π1 quite quickly. The RWMH trace plots showed excellent mixing with very frequent jumps

between the two modes of the data. This was because the difference between the two modes was 5, only

5 times the RWMH variance making it relatively easy for the RWMH to jump between modes. The

ability of the RWMH chain to do this resulted in the expected time to an independent sample being

relatively short at 10. The HMC trace plots seem to generally show good mixing but it was observed

that jumps between nodes were less frequent. This was because, unlike the RWMH, HMC does not

make jumps between points it ’flows’ between them. Thinking of the area of low density in between

the modes as a hill (or high energy barrier) the particle, in HMC, has to be given enough energy to get

over that hill. Whereas RWMH can just jump over the area of low density so avoids this problem. This

was demonstrated further by the autocorrelation plots which show that HMC took over 40 iterations

to produce an independent sample. However this area of low density was not so low that HMC can

never get over it, an example of this will be presented later, so the HMC algorithm still explored the

12

Page 14: Sherman Ip and Jack Jewson 19th November 2015sip/work/report_hmc.pdf · 2015. 11. 19. · Then the ratio s s2 b s2 w (8) where s2 b = n P k j=1 ( x j x) 2 k 1 (9) and s2 w = P k j=1

whole target. Finally the histogram plots show how well the samples were approximating the target.

It was observed that despite slightly slower mixing, the HMC algorithm produced the sample closer to

the target, this was likely to be because of the acceptance probabilities. αHMC = 0.994 meaning that

the HMC sample contained almost 10000 distinct points. However αRWMH = 0.396 so the RWMH

sample had many less distinct points and is therefore a slightly worse approximation of the target. The

RWMH algorithm would need to be run for longer to produce a closer approximation of the target.

Example 2

The second univariate Normal mixture that was examined is π2(x) = 12N(x; 0, 0.32) + 1

2N(x; 3, 0.32).

Figure 13 below shows its density. This example had been chosen deliberately such that the modes

were quite close together but such that there was almost zero density in between.

Figure 13: Density of π1(x) = 12N(x; 0, 0.32) + 1

2N(x; 3, 0.32)

The smallest density, between the 2 modes, of this distribution occurred at x = 1.5 where the density

is 0.000005 compared with 0.66 at the modes x = 0 and x = 3. Once again the RWMH and HMC

algorithms targeted π2(x) with the same input parameters as before.

Figure 14: Trace plots, Histograms and Autocorrelation plots comparing the RWMH algorithm to theHMC algorithm targeting π2(x)

As in Example 1 the trace plots of the RWMH sample appeared to be exploring the target of π2 quite

well though more slowly than in Example 1. The chain managed to jump between chain reasonably

frequently but less often than in Example 1. This was because there was more density in the modes

13

Page 15: Sherman Ip and Jack Jewson 19th November 2015sip/work/report_hmc.pdf · 2015. 11. 19. · Then the ratio s s2 b s2 w (8) where s2 b = n P k j=1 ( x j x) 2 k 1 (9) and s2 w = P k j=1

and less between so the RWMH needed a bigger jump, which will happen less often, to jump between

modes. This was verified by the autocorrelation plot for the RWMH where it took considerably longer

to achieve an almost independent point. On the other hand the trace plot of the HMC sample did

not suggest that the chain had explored the whole target. Despite one random fluctuation the chain

became stuck in the mode at x = 3 and not explored the other mode at x = 0. This was because

the density in between the modes was so low that this constituted such a large high energy barrier

that there was never enough momentum to get over it. Different values for the parameters ε, L and

M were tried but none of them produced mixing between the two modes. This was also backed up

by the autocorrelations plots as it took well over 40 iterations to produce and independent point. In

this example the target was known and therefore this was easy to notice but had the target not been

known and because HMC mixed so well with a high acceptance rate αHMC = 0.914 the histogram and

trace plot would made it appear as though the target were just a N(3, 0.32). This histogram of the

RWMH sample indicated that there is a bi modal target, but the algorithm still did not produce a

very good approximation of the target. This was likely to be because αRWMH = 0.141 was so low that

the sample produced contained less than 1500 distinct points. Running the sample for longer should

produced a better approximation though this may not be computationally feasible.

In conclusion, just by looking at these two simple examples it appeared Neal [2011] was right that

HMC is still susceptible to failures when sampling from multi modal distributions. In fact providing

the proposal variance of the RWMH allows it to jump between modes. The RWMH appeared more

proficient at mixing between modes than the HMC algorithm did simply because the RWMH can jump

over regions of low density where the HMC can not. However the low acceptance probability, produced

by the RWMH when sampling from multimodal distributions, may mean that the algorithm may need

to be run for a lot longer in order to produce a suitable approximation of the target. If the modes were

such that there was significant positive density in between them then the high acceptance probability

produced by the HMC algorithm creates a better approximation of the target.

Mixture of Bivarate Normal

In order to test how HMC compared with RWMH one final experiments were done implementing

them on a bivariate Gaussian mixture on a mixture distribution, the exact mixture is below. It was

demonstrated previously that HMC coped with larger dimensions better but RWMH has a stronger

ability to jump between multiple nodes of a target.

N

1

1

,

0.1 0

0 0.1

with probability 1/4

N

−1

1

,

0.1 0

0 0.1

with probability 1/4

N

−1

−1

,

0.1 0

0 0.1

with probability 1/4

N

1

−1

,

0.1 0

0 0.1

with probability 1/4

(11)

14

Page 16: Sherman Ip and Jack Jewson 19th November 2015sip/work/report_hmc.pdf · 2015. 11. 19. · Then the ratio s s2 b s2 w (8) where s2 b = n P k j=1 ( x j x) 2 k 1 (9) and s2 w = P k j=1

The RWMH and HMC algorithms were run for 200 iterations with 5 initial values on a circle of radius

2, to give abreif idea of how there chains would behave. The parameters for the RWMH sampler

were L = 25 and σ = 0.02 and the parameters for HMC were selected to be ε = 0.3, L = 25 and

M to be two times the identity matrix. It was observed from Figure 15 that HMC did not jump

between mixtures as often as RWMH, suggesting RWMH better explores the state space. However for

RWMH, the rejection rate was high and did not mix well within a mixture compared to HMC. This

is wholly unsurprising and just goes to reiterate the results found when mixtures and bivariate target

were examined independently

0 50 100 150 200

−2

01

2

Dimension: 1

Index

MC

MC

0 50 100 150 200−

20

12

Dimension: 2

Index

MC

MC

(a) RWMH

0 50 100 150 200

−2

01

2

Dimension: 1

Index

MC

MC

0 50 100 150 200

−2

01

2

Dimension: 2

Index

MC

MC

(b) HMC

Figure 15: Traceplots of 5 RWMH and HMC chains targeting a mixture of 4 bivariate gaussian mixturedistributions with initial values on a circle of radius 2.

Further work and ideas

This paper has explained and implemented the HMC algorithm proposed in Neal [2011]. It has looked

at relatively straightforward examples of target distributions and compared the results of implementing

HMC with the results of implementing the RWMH.

Unfortunately the length of time given to complete this project was not long enough to try out any

more complicated or innovative methods to use HMC. However this was given some thought and some

potential further work is as follows. Girolami and Calderhead [2011] suggested using a position specific

mass matrix and it was hoped that something as simple as using the Hessian of the negative log of

the target may lead to better exploration the space in these simple Gaussian examples. Neal [2011]

15

Page 17: Sherman Ip and Jack Jewson 19th November 2015sip/work/report_hmc.pdf · 2015. 11. 19. · Then the ratio s s2 b s2 w (8) where s2 b = n P k j=1 ( x j x) 2 k 1 (9) and s2 w = P k j=1

suggested that it is possible to combine HMC and some other MCMC update. It was therefore felt that

some kind of hybrid algorithm, for example doing HMC updates but with a RWMH update every ith

iteration may help to solve the problem caused by using HMC to sample from a multi modal target.

16

Page 18: Sherman Ip and Jack Jewson 19th November 2015sip/work/report_hmc.pdf · 2015. 11. 19. · Then the ratio s s2 b s2 w (8) where s2 b = n P k j=1 ( x j x) 2 k 1 (9) and s2 w = P k j=1

Bibliography

Andrew Gelman and Donald B Rubin. Inference from iterative simulation using multiple sequences.

Statistical science, pages 457–472, 1992.

Mark Girolami and Ben Calderhead. Riemann manifold langevin and hamiltonian monte carlo methods.

Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2):123–214, 2011.

W Keith Hastings. Monte carlo sampling methods using markov chains and their applications.

Biometrika, 57(1):97–109, 1970.

Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta H Teller, and Edward

Teller. Equation of state calculations by fast computing machines. The journal of chemical physics,

21(6):1087–1092, 1953.

Radford M Neal. Mcmc using hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 2,

2011.

17