Exact approximate Markov chain Monte Carlo · 2018-02-02 · including online parameter estimation of HMMs and exact approximate MCMC for big data. 1 Introduction Interest in a process

Lancaster University

Exact approximate Markov chain MonteCarlo

Author:Jack Baker

Supervisor:Paul Fearnhead

April 23, 2015

Exact approximate MCMC Jack Baker

Abstract

Traditional MCMC has proven to be an invaluable tool for Bayesian statistics. Howeverthe methods often struggle to mix well when there is a lot of missing data which cannot bemarginalised out analytically. Such a phenomenon commonly occurs when an unobserved orlatent process is introduced in order to simplify complex models. In this report we introducepseudo-marginal MCMC, which provides an alternative to the Metropolis Hastings algorithmwhen we cannot evaluate the posterior distribution, but can provide an unbiased estimateof it. We demonstrate the uses of pseudo-marginal MCMC when a model contains a lot ofmissing data. Hidden Markov models (HMMs), an important, flexible class of models whichrelies on an unobservable process are then discussed, along with estimating the distributionof the unobserved process using sequential Monte Carlo. The estimation of parametersin HMMs using particle MCMC, which can be seen as a special case of pseudo-marginalMCMC, is then considered. This is demonstrated by analysing timeseries data on the numberof annual earthquakes worldwide. Finally a number of related open areas are considered,including online parameter estimation of HMMs and exact approximate MCMC for big data.

1 Introduction

Interest in a process which can only be observed indirectly is a problem encountered in a variety ofapplications. Examples include the analysis of biological sequences, speech recognition and timeseries analysis. Parameter estimation for problems with missing data or unobservable processescan be difficult when the unobserved process cannot be marginalised out. Data augmentationused along with MCMC is generally recommended in these cases. For example a Gibbs samplerwhich alternates between updating the unobservable process and the parameters. Howeverthese algorithms often perform poorly due to strong correlation between the hidden state andthe parameters that need to be estimated. This report introduces a new breed of MCMCalgorithms, which allow us to simulate from distributions which cannot be evaluated explicitly,but for which unbiased estimators can be produced. Incredibly these algorithms can be shownto target the true distribution, not just the estimator used to approximate it. This is motivationfor the name exact approximate MCMC. These tools allow us to perform parameter estimationfor a wide range of models which depend on unobservable or hidden processes, especially animportant class known as hidden Markov models.

We begin this report by introducing pseudo-marginal MCMC in Section 2. Pseudo-marginalMCMC can be used in place of the standard Metropolis-Hastings sampler to simulate from adistribution when it cannot be evaluated explicitly, but for which an unbiased estimator can beproduced. Pseudo-marginal MCMC is particularly useful for models of systems with missingdata or latent processes. This is because often the likelihood with the unobserved processmarginalised out (marginal likelihood) can be estimated in an unbiased way using Monte Carlomethods. In Section 3, we first provide a formal definition of a hidden Markov model (HMM),an important, flexible class of models with an unobservable process. Obtaining an estimate ofthe distribution for the unobservable process is analytically tractable only for a small subclass ofHMMs. In the rest of the section we therefore focus on numerical methods, known as sequentialMonte Carlo, that allow us to simulate from the unobservable process. Finally Section 4 focusseson particle MCMC, which enables us to perform parameter estimation for HMMs. The algorithmuses sequential Monte Carlo to build an unbiased estimate of the marginal likelihood of theprocess. This enables a pseudo-marginal style sampler to be built which targets the true jointdistribution of the parameters and latent process. In Section 5 we demonstrate a particularParticle MCMC algorithm by using it to analyse data on the annual count of earthquakes

1


worldwide (above a magnitude of 7 on the Richter scale). Finally in Section 6 we providea discussion of some of the key areas outlined by the report. The discussion includes theconsideration of some related open areas including online parameter estimation of HMMs andexact approximate MCMC for big data.

2 Pseudo-marginal Markov chain Monte Carlo

The Metropolis-Hastings algorithm has become a familiar tool to the statistician for understand-ing complex probability distributions (for a detailed review of the Metropolis-Hastings algorithm,the reader is referred to Chib and Greenberg (1995)). However the algorithm assumes that thedistribution you are trying to simulate from can be evaluated up to a proportionality constant.Situations where this is not possible arise quite frequently when dealing with missing data orlatent variables. Pseudo-marginal or exact approximate MCMC provides us with a methodto apply the Metropolis-Hastings algorithm when only an unbiased estimate of the distribu-tion is available, while still targeting the desired distribution. This section aims to introducepseudo-marginal MCMC to the reader, along with key issues affecting its implementation.

2.1 Motivation: missing data

Suppose we have data y which depends upon an unknown parameter vector θ, as well as anunobserved or latent process x. Then we can write down the likelihood as follows

p(y, x|θ) = p(x|θ)p(y|x, θ). (2.1)

Information about the latent process x is unavailable, so we wish to marginalise (2.1) to obtainthe marginal likelihood

p(y|θ) =

∫p(x|θ)p(y|x, θ)dx. (2.2)

We can then use (2.2) to find the posterior

p(θ|y) ∝ p(θ)p(y|θ).

However for many models the integral in (2.2) is intractable, which means the posterior distri-bution cannot be evaluated explicitly.

We might then try to use an MCMC algorithm which alternates between updating theparameters θ|y, x and the latent process x|θ, y. However an algorithm of this sort will often mixvery slowly due to strong correlation between x and θ (Fearnhead, 2012). This is an importantsituation that the pseudo-marginal approach aims to address.

More generally, pseudo-marginal MCMC can be used in place of a standard Metropolis-Hastings (MH) sampler when we are trying to simulate from a distribution p(x) which cannot beevaluated explicitly, but which can be approximated using a Monte Carlo estimator. Incredibly,if this estimator is unbiased and non-negative, then pseudo-marginal MCMC will target thedesired distribution p(x) (Andrieu and Roberts, 2009). This is motivation for the term exactapproximate MCMC.

2


2.2 The pseudo-marginal approach

Suppose we wish to simulate from the distribution p(x), but the distribution can’t be evaluatedexplicitly. Then a natural way to extend the MH algorithm to deal with this case is to substitute aMonte Carlo approximation p(x) into the acceptance probability. Two key ways of implementingthis have been considered previously. O’Neill et al. (2000) introduced algorithms similar toMonte Carlo within metropolis outlined in Algorithm 1 and Beaumont (2003) introduced amethod similar to Algorithm 2. The two algorithms can be written as follows:

Algorithm 1: Monte Carlo within metropolis

Given, from the previous step, a current value θ:1. Propose a new value x′ from some distribution q(x′|x).2. Calculate approximations to the desired distribution at x and x′ (p(x) and p(x′)).3. Accept x′ with probability

α = 1 ∧ p(x′)q(x|x′)

p(x)q(x′|x).

Algorithm 2: Pseudo-marginal MCMC

Given, from the previous step, a current value x and approximation to the truedistribution at x (p(x)):

1. Propose a new value x′ from some distribution q(x′|x).2. Calculate an approximation to the desired distribution at x′, p(x′).3. Accept x′ with probability

α = 1 ∧ p(x′)q(x|x′)

p(x)q(x′|x).

Notice that the main difference between the two algorithms is that at each iteration Algo-rithm 1 produces a new approximation to p(x) at both the current state and the proposed state,while Algorithm 2 reuses the approximation to p(x) at the current state. This means that Algo-rithm 2 can be seen as more efficient since it only calculates one approximation each step, ratherthan two. However it was found that Algorithm 1 mixes better. The real breakthrough camehowever, when Andrieu and Roberts (2009) showed that if p(x) is unbiased and non-negativethen Algorithm 2 targets the true distribution p(x).

The mixing of pseudo-marginal MCMC depends on two things, the mixing of the true-marginal MCMC algorithm, and the variance in the Monte Carlo estimator p(x) (Fearnhead,2012). Clearly the variance of the Monte Carlo estimator depends on the number of randomsamples N , used to generate the sample. This leads to a trade off between computationalexpense and the performance of the algorithm. In the case that there is high variance in theestimate p(x), this can lead to over-estimation of a particular state which leads to a long run ofrejected proposals (this is sometimes referred to as stickiness). Sherlock et al. (2014) considerthe optimal variance of the Monte Carlo estimate in the case of a random walk pseudo-marginalMCMC. They make the assumption of Gaussian noise in the log of the Monte Carlo estimate.They show that the optimal variance in the log of the Monte Carlo estimate in this case is about3.3, and the corresponding optimal acceptance rate is 7%. Doucet et al. (2012) consider thepseudo-marginal algorithm under more general proposals. While in this case they are unable tominimise computing time itself, they minimise upper bounds on it. Under the assumption ofGaussian noise in the approximation p(x), they recommend its variance should be approximately1 when the MCMC is efficient, and 1.7 when it is not. In Section 2.3, we give a simple example to

3


demonstrate how the quality of the sample from a pseudo-marginal MCMC algorithm is affectedby the variance of the Monte Carlo estimator.

If we now consider the situation that was outlined in Section 2.1, where we have

p(θ|y) ∝ p(θ)p(y|θ) = p(θ)

∫p(x|θ)p(y|x, θ)dx

where x is a latent process and finding the marginal likelihood for y is intractable. Then providedwe can simulate xi from some proposal distribution g(x|θ) we can build a Monte Carlo estimatefor p(y|θ) by drawing an independent sample x1, . . . , xN from the proposal g(x|θ), and using,for example importance sampling or Monte Carlo integration. Let us call this estimate p(y|θ).Using pseudo-marginal Metropolis-Hastings (MH), the acceptance probability becomes

α = 1 ∧ p(θ′)p(y|θ′)q(θ|θ′)

p(θ)p(y|θ)q(θ′|θ)

= 1 ∧p(θ′)p(y|θ′)q(θ|θ′)

∏Ni=1 g(xi|θ)

p(θ)p(y|θ)q(θ′|θ)∏Ni=1 g(x′i|θ)

∏Ni=1 g(x′i|θ)∏Ni=1 g(xi|θ)

. (2.3)

Writing the acceptance probability in the form (2.3) we can see that the pseudo-marginal MHsampler over the space Θ can be seen as an exact MH sampler over the space Θ×χ. The targetof this exact MH sampler is (Wilkinson, 2010)

p(θ)p(y|θ)N∏i=1

g(xi|θ). (2.4)

In order to show that the pseudo-marginal MH sampler targets the true posterior p(θ|y) ∝p(θ)p(y|θ), we need to show that the target distribution for the exact MH (2.4) marginalisesdown to p(θ|y). Notice that∫

p(θ)p(y|θ)N∏i=1

g(xi|θ)dx = p(θ)E(p(y|θ)).

However we have assumed that our Monte Carlo estimate is unbiased, so that E(p(y|θ)) = p(y|θ).We conclude that the target for the exact MH marginalises down to the required distributionp(θ|y) and in this case the MCMC algorithm targets the desired posterior.

2.3 Example

In order to demonstrate the pseudo-marginal MCMC algorithm, we use it to simulate from aN(0, 1) distribution using 3 noisy, unbiased approximations with different variances. Supposewe have X ∼ N(0, 1). Let us define a noise term W ∼ Γ(1/σ2, 1/σ2), where σ2 is the varianceof W and Γ(.) denotes the gamma distribution. Then for a particular realisation x of X we canproduce an unbiased approximation, p(x), to p(x) by:

1. Simulate a realisation w of W .2. Set p(x) = wp(x).

Notice that E(p(x)) = E(Wp(x)) = p(x) since the noise and X are generated independentlyand E(W ) = 1, so that p(x) is unbiased. We use multiplicative noise rather than additivenoise in this case so that p(x) is always positive. While these approximations may seem rather

4


●

●●●●●●●●

●●●●●●

●●●●●

●

●

●●

●●●

●●

●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●

●●●●●

●●●●●●●●●●●●●●●

●

●

●

●●

●●●●●●●●●●●●●●

●●●●●●

●

●●●●●●●●●●●●●●●

●●

●

●

●

●●●●●●●

●●●

●●●●●●

●●●●●●

●●●●●●●●●●●●●●●●

●●●●

●

●●●

●●

●●●

●●

●

●●

●●●●●

● ●

●●●●●●●●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●

●●●●●●

●●●●

●

●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●●●●●

●●

●●●●●●●●●●●●●●●

●●●

●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●

●●●●●

●

●●●●●●●●●

●

●●

●●●●●●●

●●●

●●●

●●●

●●●●●●●●●

●●●

●●●●

●●●●●

●●

●●

●●●●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●●●●

●●

●●●

●

●●●●●●●●●●

●●●●●●●●●●●●

●●

●●●●●●●●

●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●

●●●●●

●

●●●●●●●●●●●●●●●●●●●

●●●

●●●●●●●●●

●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●

●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●

●

●●

●●

●●●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●

●●●

●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●●●●●●●●●●

●

●

●

●

●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●

● ●

●●●●

●●

●

●●●

●●

●●●●●●●●●●●●●

●●●●●●

●●●●

●

●●●●●

●●●●●

●●●●

●●●●

−3 −2 −1 0 1 2 3

−3

−2

−1

01

23

Theoretical Quantiles

Sam

ple

Qua

ntile

s

(a) Variance: 5

●

●

●

●●●

●

●●●●●

●●●●

●●●●●●●●●

●

●●

●●●●●●●

●●●●●●

●

●●

●●●

●●●●●●●●

●

●●●

●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●

●●●●●

●●●●●

●●●

●

●

●●

●●●●●●

●●●●●●

●●●●

●●

●●●●●●

●●

●●

●

●●●●●

●●

●●●●●●

●●●●●●●●

●●●●●●

●

●●●

●●●●●

●

●

●●●●●●●

●

●●●●●●

●

●●●●●

●

●●●●●●

●●●●●●●●●

●●●●●●●●●●●●

●●●●●

●●

●●●●●●●●●●●●

●●●●●●●●●●●

●●●●●

●●●

●●●●●●●

●●●●●●●●●

●

●●●●

●●●●●●●●●

●●

●●

●●●●

●●●●

●●●●●●●

●●

●●

●●●

●●●

●●●●●

●

●●●

●●

●●●●●●●●●

●●●●●●●●

●●●●●●

●●●●

●●

●●●●●●●

●

●●●

●●●●

●

●●●

●●●●●●●●●●

●

●●●●

●●

●

●●●●

●

●●

●●●●●●●

●

●

●

●

●●●●

●●

●●●●

●●●●●●●●●

●●●●●

●●

●●

●●●●●●

●●

●

●●●●●●

●

●●

●●

●●

●●

●●●●

●●●●●●●●

●●●●

●●

●

●●●

●●●●●●●●●●●●●

●

●

●●●●●●●●

●

●●●●●●●●●●●●●

●●●

●●

●●●●

●●●●●●●

●●●●●●

●●●●●●

●

●●●

●

●

●●●●●●●●●●●●●●●●●

●●

●

●

●●●

●●●●

●

●

●

●●

●

●

●

●●

●●●

●●●●●●●●●●●●●●●

●●

●●●●

●

●●●●●●●

●

●

●●

●●●

●●●

●

●●

●●●●●●

●●●

●●●●

●

●●

●●●

●

●●●●●

●●

●●

●●

●●

●●●

●●

●●●●●

●●●

●●●●

●

●●●●●●●●●●●●●●●●

●●●●●●●

●●●●

●

●●

●

●●●●●●●

●●●

●

●

●●●●●●

●

●

●●●●●●●●●●●●●●●●●●●●●

●●●●●

●●

●●●

●●●

●

●●●●

●

●●●●

●

●●

●

●●●

●●●●●●

●●●

●

●●●●●●●●

●

●●●●●●●

●

●●●●

●

●●

●●●●●●●●●●●●

●●●●●●●●●●●●●●

●

●

●

●

●●

●

●●●●●●●

●●●

●●●●●●●●●

●●●●●●●●●●●

●●●●●●●

●●●●

●●●●●●●●●●●●

●●●●

●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●

●

●●●●●●●●

●●●

●●

●●●●●●

●●●

●

●●●●●●●●●●

●●

−3 −2 −1 0 1 2 3

−3

−2

−1

01

23

4


Sam

ple

Qua

ntile

s

(b) Variance: 1

●

●●●●

●●●●●●●●●

●●●●●●

●●

●●●

●

●●●

●

●●●●●●●●●

●●●●

●

●●

●●

●

●●●

●●●

●●

●

●●●

●●●

●●

●

●●●

●

●●●●

●●●●

●

●●

●

●

●●

●

●●●●●●

●

●

●●●

●

●

●●●●

●

●●●●●

●●

●●

●●●●●

●

●●●

●●●

●

●

●

●●●●●●●●

●

●●

●

●●

●●●

●●

●●

●●●●

●

●

●●●●●●●

●●●●●

●

●●

●●

●●●

●

●●●●●

●●

●

●

●

●

●●●

●●●

●●●●●

●●●●●●●●

●

●

●●

●

●

●

●

●

●●●●●

●

●●●●●●●

●

●●

●●●

●

●●

●●

●●●

●

●●●

●●

●

●●●●●

●

●

●●●

●

●

●●●●●●●

●●●●●●●

●

●

●●●●

●

●

●●

●

●●●●●●●●

●●●

●●

●

●●●●●●

●●●

●●●●●●

●●

●

●●

●

●●

●●●

●●●

●

●●●●●●

●●●

●●●

●

●

●

●

●●

●

●

●●●●

●●●●

●

●

●●

●

●●

●●

●

●●

●

●●●●●●●●●

●

●

●

●

●●

●●●●●●●●

●●

●●●●●●●

●●●

●●

●●●

●●

●

●●●

●●●

●

●●●●●●

●●

●●●●●●

●

●●●●●●

●

●●

●●

●

●

●●

●

●

●●●●

●●●●●●●●●

●●●

●●

●●●

●●●●

●

●

●

●●●●

●●●●●

●●●●

●●●

●●●

●●

●

●

●●

●

●

●

●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●

●●

●●●

●●

●●●●

●●●●●●●●●●●

●

●●●●●●●●●●●

●

●

●●●●●

●●●●●●●

●

●●●●●●●

●

●

●

●

●●

●

●●●

●

●

●●●●●●●

●●

●●

●

●

●●

●

●●

●

●

●

●●

●●

●

●●

●●

●●

●●●

●

●

●

●●

●●●●

●

●●●

●

●●

●●

●

●●●

●●●●

●

●

●●●

●●●

●●●●●

●●●●

●

●●

●

●●●●●●●●

●●

●

●●

●●●●●●●●●●●●

●●●●●●

●

●

●

●

●

●

●●●●

●●●●●

●

●

●●●

●

●●●●●●●●

●

●

●●

●

●

●●●

●

●

●●

●

●●●●

●●

●●●

●●●●●●●●●

●●●

●

●●

●

●

●●

●●●●

●

●●

●●●●●●●

●●●●●●●

●●

●

●

●●●●

●●

●

●●●

●●●●●

●●●

●●●●

●●●

●

●●●

●●●

●●●●●●●

●●

●●

●●

●

●●

●

●●●

●●

●●●●●●●●●

●●

●●●●●●●

●

●●●●

●●

●

●

●

●

●

●

●●●

●●●●●●●●

●●●●

●●

●

●

●

●

●●●

●

●●

●●●

●●

●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●

●

●●

●●

●

●

●

●

●●

●●●●

●●

●

−3 −2 −1 0 1 2 3

−3

−2

−1

01

23


Sam

ple

Qua

ntile

s

(c) Variance: 0.2

Figure 1: QQplots of the simulation from a pseudo-marginal MCMC with differing variance inthe estimate p(x).

artificial, they allow us to easily compare the difference in sample quality as the variance in theapproximations vary.

We run 3 independence sampler pseudo-marginal MCMC algorithms for 1000 iterations,using a Cauchy proposal distribution with location parameter 0 and scale parameter 1. Wechange the variance of the noise term at each run to demonstrate how this affects the qualityof the sample. QQplots of the sample as compared with a N(0, 1) distribution are plotted inFigure 1. The Figure shows how the quality of the sample is decreased as the variance of thenoise term increases. The effective sample sizes are approximately 100, 180, 510 for the samplecorresponding to Figures 1a, 1b and 1c respectively.

3 Sequential Monte Carlo

Hidden Markov models (HMMs) are an important class of models with diverse applications,especially when observations are temporal. HMMs have found to be particularly useful modelsfor temporal pattern recognition, such as for speech recognition, robotics and bioinformatics.We begin this section by giving a formal definition of a HMM. We detail the forward-backwardrecursions, which can be used to calculate the distribution of the hidden process given currentand past observations. The recursion is not tractable in general however, and this motivatesthe rest of the section where we consider methods to simulate from the hidden state when the

5


forward-backward recursion is not tractable. These methods are known as sequential MonteCarlo, or particle filters.

3.1 Hidden Markov models

A hidden Markov model (HMM) is a bivariate process (Yt, Xt) defined by (Fearnhead, 2012):• An unobserved or hidden state Xt that is a Markov Chain.• An observed variable Yt that is conditionally independent of Xs and Ys given Xt, wheres 6= t.

Yt

Xt

Figure 2: Diagram of a hidden Markov model.

We can therefore write down the joint probability density as

p(x1:T , y1:T ) = p(x1)

(T∏t=2

p(xt|xt−1)

)(T∏t=1

p(yt|xt)

),

where we define x1:T = {x1, . . . , xT }. Often the HMM will additionally depend on some unknownparameters θ. Since for the remainder of this section our interest is mainly in the hidden staterather than parameter estimation, we omit any dependence on θ in our notation.

Suppose we want to find the distribution of the hidden process Xt conditional on the ob-served states y1:t, known as the filtering distribution. Notice the difference between estimatingXt conditional on current and past observations y1:t and conditional on all observations y1:T .The latter is known as the smoothing problem. The filtering problem can be estimated onlineor sequentially, as observations occur whereas the solutions to the smoothing problem mustbe performed offline. We restrict ourselves in this section to online methods for the filteringproblem. For a review of the smoothing problem, readers are referred to Doucet and Johansen(2008).

We can derive recursions in order to find the filtering distribution, and to calculate thelikelihood p(y1:t). These are often referred to as the forward-backward recursions. When thespace of Xt is discrete then these recursions can be solved analytically. If the space of Xt iscontinuous however then the recursions can only be solved when the models for Xt and Yt arelinear and Gaussian. We focus on the case when the space of Xt is continuous. In this case therecursion can be initialised using

p(x1|y1) =p(x1, y1)

p(y1)=p(x1)p(y1|x1)

p(y1)∝ p(x1)p(y1|x1).

6


Then for t ∈ {2, 3, . . .} we similarly can calculate the filtering distribution

p(xt|y1:t) =p(yt|xt)p(xt|y1:t−1)

p(yt|y1:t−1)

∝ p(yt|xt)∫p(xt|xt−1)p(xt−1|y1:t−1)dxt−1. (3.1)

Finally the likelihood can be calculated as

p(yt|y1:t−1) =

∫p(yt|xt)p(xt|y1:t−1)

=

∫ (p(yt|xt)

∫p(xt|xt−1)p(xt−1|y1:t−1)dxt−1

)dxt. (3.2)

In the case that the models for Xt and Yt are linear and Gaussian, then the solution of (3.1) isthe widely used Kalman Filter. When this is not the case, Monte Carlo methods can be usedto approximately solve these recursions. One such method is known as sequential importancesampling, which we outline in the next section.

3.2 Sequential importance sampling

The basic idea of sequential importance sampling (SIS) is to iteratively use importance samplingto approximately solve the recursions given in Section 3.1. If the reader is unfamiliar withimportance sampling, we refer them to Pollock (2010), which gives a review of the concept.Suppose at time t−1, we have a weighted particle approximation {xt−1, ωt−1}Ni=1 to the filteringdistribution p(xt−1|y1:t−1), then we can approximate the filtering distribution (3.1) at time tusing (Nemeth, 2014)

p(xt|y1:t) ∝N∑i=1

ω(i)t−1p(yt|xt)p(xt|xt−1) (3.3)

Using this idea, by choosing an importance distribution q(.) of the form (Doucet et al., 2000)

q(x1:t|y1:t) = q(x1|y1)t∏

τ=2

q(xτ |x1:τ−1, y1:τ )

= q(xt|x1:t−1, y1:t)q(x1:t−1|y1:t−1), (3.4)

we can perform importance sampling recursively, so that at time t we sample x(i)t ∼ q(xt|x

(i)1:t−1, y1:t),

where x(i)1:t−1 are the paths of the process simulated at the previous time steps. This allows us,

given the initial state of the process x1, to produce a weighted particle approximation sequen-tially at each time step t. Using (3.1) and (3.4) we can obtain a recursive expression for the ith

importance weight ω(i)t

w(i)t ∝

p(x(i)1:t−1|y1:t−1)

q(x(i)1:t−1|y1:t−1)

p(yt|x(i)t )p(x(i)t |x

(i)t−1)

q(x(i)t |x1:t−1, y1:t)

∝ ω(i)t−1


(i)t−1)

q(x(i)t |x1:t−1, y1:t)

7


This enables us to write down the SIS filter algorithm as follows:

Algorithm 3: SIS filter

Initialise;for i ∈ {1, . . . , N} do

Sample x(i)1 from the prior x

(i)1 ∼ p(x1).

endRecursion;for t ∈ {2, . . . , T} do

for i ∈ {1, . . . , N} do1. Sample x

(i)t ∼ q(xt|x

(i)1:t−1, y1:t).

2. Evaluate the importance weights

ω(i)t ∝ ω

(i)t−1


(i)t−1)

q(x(i)t |x

(i)1:t−1, y1:t)

.

endNormalise the importance weights so that they sum to 1.

end

Using the SIS filter, we can obtain approximations to (3.1) and (3.2) as follows

p(xt|y1:t) =N∑i=1

ω(i)t δ(xt − x(i)t ), (3.5)

p(yt|y1:t−1) =1

N

N∑i=1

ω(i)t (3.6)

where δ is the Dirac delta function.

3.3 Particle degeneracy and the SIR filter

A serious problem with the SIS filter is that it can be shown to be degenerate (Doucet et al.,2000). This means that we will always converge to a single non-zero weight that is one, whilethe rest are zero. This is due to the fact that the variance of the weights increases at every timestep. Clearly this is not very useful. One solution for the degeneracy problem is to introduce a

resampling step to the SIS algorithm. Suppose we have a set of weighted particles {ω(i)t , x

(i)t }Ni=1

that approximate (3.1). Then a resampling step draws N samples from this weighted paricleapproximation. The old sample set is then replaced with the new one, with all the weightsset to 1/N . The intuition behind this is that it eliminates trajectories with small normalisedimportance weights and focuses on trajectories with large weights. A standard way to do this

is to resample x(i)t with probability equal to the normalised importance weights ω

(i)t . This is

known as multinomial resampling.While resampling decreases degeneracy, it also increases Monte Carlo variability. This vari-

ability can be reduced by using alternative resampling algorithms, such as stratified samplingintroduced by Carpenter et al. (1999). For a comparison of different resampling algorithms insequential Monte Carlo the reader is referred to Douc and Cappe (2005).

8


The benefit gained by resampling depends on the variability of the current weights (Fearn-head, 2012). A common approach is therefore to monitor the effective sample size (Neff ) and toresample when this drops below a threshold (Nthresh). While Neff cannot be calculated exactly,it can be estimated using

Neff =1∑N

i=1

(ω(i)t

)2 .Sequential importance sampling methods which include resampling steps are commonly referredto as sequential importance resampling algorithms (SIR). We can outline a common SIR proce-dure as follows:

Algorithm 4: SIR filter


Sample x(i)1 from the prior x

(i)1 ∼ p(x1).


for i ∈ {1, . . . , N} do1. Sample x

(i)t ∼ q(xt|x

(i)1:t−1, y1:t).

2. Evaluate and normalise the importance weights

ω(i)t ∝ ω

(i)t−1


(i)t−1)

q(x(i)t |x

(i)1:t−1, y1:t)

.

endif Neff < Nthresh then

Draw N samples x(i)t from {x(i)t , ω

(i)t }Ni=1 using some resampling algorithm. Set

ω(i)t = 1

N and replace {x(i)t , ω(i)t }Ni=1 with this sample.

end

end

Note that approximations (3.5), (3.6) still hold only for certain resampling algorithms, multi-nomial and stratified are acceptable.

3.4 Choice of proposal

Clearly an important question to answer with regards to SIS and SIR filters is the choice ofproposal distribution q(.). A natural choice would be to set

q(xt|x(i)1:t−1, y1:t) = p(xt|x(i)t−1).

which reduces the importance weights to ω(i)t ∝ ω

(i)t−1p(yt|x

(i)t ). If we perform resampling at each

step then this method is known as the bootstrap filter (Gordon et al., 1993). Note that if we are

resampling at each step we have that ω(i)t−1 = 1/N for all i. This allows us to set

ω(i)t ∝ p(yt|x

(i)t ).

9


Clearly in this case, the importance weights take a particularly convenient form, meaning thealgorithm is particularly easy to implement. That being said, this proposal ignores the factthat we know y1:T , meaning that the method is often inefficient. The method is also especiallysensitive to outliers (Doucet et al., 2000).

In order to limit degeneracy of the algorithm, we wish to select a proposal distribution that

minimises the variance of the unnormalised importance weight ω(i)t conditional on x

(i)1:t−1 and

y1:t. The proposal distribution with this property is given by (Doucet et al., 2000)

q(xt|x(i)1:t−1, y1:t) = p(xt|x(i)t−1, yt). (3.7)

Using Bayes theorem and the independence properties of a HMM, we obtain the identity

p(yt|x(i)t−1) =p(yt|xt)p(xt|x(i)t )

p(xt|x(i)t−1, yt),

it follows that

p(xt|x(i)t−1, yt) =p(yt|xt)p(xt|x(i)t−1)

p(yt|x(i)t−1)=

p(yt|xt)p(xt|x(i)t−1)∫p(yt|xt)p(xt|x(i)t−1)dxt

. (3.8)

So that the importance weights reduce to ω(i)t ∝ ω

(i)t−1p(yt|x

(i)t−1). Nothing comes for free however,

and the use of this proposal requires the ability to sample from p(xt|x(i)t−1, yt) and to evaluate

p(yt|x(i)t−1) =

∫p(yt|xt)p(xt|x(i)t−1)dxt

up to a constant of proportionality, which is not possible in general.

3.5 Auxiliary particle filter

Ideally we would sample from the optimal proposal (3.7). However as outlined in Section 3.4,this is often impossible. Pitt and Shephard (1999) introduced the auxiliary particle filter inorder to approximate the optimal proposal when it is not possible to use directly.

Notice that we can rewrite the particle approximation to the filtered density (3.3) as follows(Nemeth, 2014)

p(xt|y1:t) ∝N∑i=1

ω(i)t−1p(yt|xt)p(xt|x

(i)t−1)

=

N∑i=1

ω(i)t−1

p(yt|xt)p(xt|x(i)t−1)∫p(yt|xt)p(xt|x(i)t−1)dxt

∫p(yt|xt)p(xt|x(i)t−1)dxt.

Using (3.8) allows us to write (3.3) as

p(xt|y1:t) =

N∑i=1

λ(i)t p(xt|x

(i)t−1, yt), (3.9)

where

λ(i)t ∝ ω

(i)t−1

∫p(yt|xt)p(xt|x(i)t−1)dxt. (3.10)

10


This can be seen as approximating the filtering distribution by using a mixture of densities

p(xt|x(i)t−1, yt). Introducing an auxiliary variable k representing the index of the mixture distri-bution, the joint distribution of xt and k can be written

p(xt, k|y1:t) = λ(k)t p(xt|x(k)t−1, yt). (3.11)

Often we cannot sample exactly from (3.11) since we are unable to sample from p(xt|x(k)t−1, yt),or λ

(i)t is intractable. However at time t, given a weighted particle approximation at the previous

time step {x(i)t−1, ω(i)t−1}Ni=1, we can approximate the joint density (3.11) as follows. First we

sample an index ki ∈ {1, . . . , N} with probability β(ki)t , where each β

(ki)t is a probability that

approximates λ(ki)t . Next we sample x

(i)t from a suitably chosen proposal q(xt|x(ki)t−1, yt) which

approximates p(xt|x(ki)t−1, yt). Each sample {x(i)t , ki}Ni=1 then has associated importance weight

ω(i)t ∝

λ(ki)t p(x

(i)t |x

(ki)t−1, yt)

β(ki)t q(x

(i)t |x

(ki)t−1, yt)

=ω(ki)t−1p(yt|x

(i)t )p(x

(i)t |x

(ki)t−1)

β(ki)t q(x

(i)t |x

(ki)t−1, yt)

.

Dropping the indices ki, we now have a particle approximation {ω(i)t , x

(i)t }Ni=1 to the filtering

density approximation (3.9). A nice property of the auxiliary filter is that the resampling stepis now built into the filter itself. We can summarise the auxiliary particle filter as follows:

Algorithm 5: Auxiliary particle filter


Sample x(i)1 from the prior p(x1), set weights ω

(i)1 ∝ p(y1|x1).


1. Calculate β(i)t using an approximation to (3.10).

2. Sample indices {k1, . . . , kN} from {1, . . . , N} with probability β(i)t .

3. Sample x(i)t ∼ q(xt|x

(ki)t−1, yt).


ω(i)t ∝

ω(ki)t−1p(yt|x

(i)t )p(x

(i)t |x

(ki)t−1)

β(ki)t q(x

(i)t |x

(ki)t−1, yt)

.

end

In the case that p(yt|xt) is a log-concave function, i.e.

p(syt + (1− s)y′t|xt) ≥ p(yt|xt)sp(y′t|xt)1−s

for all yt, y′t in the domain of p(.|xt) and for all s ∈ (0, 1), then the proposal can be obtained

by approximating the optimal density p(xt|x(i)t−1, yt) using a Taylor expansion. Details of thisapproximation can be found in Pitt and Shephard (1999). For more general models, where the

optimal density is difficult to approximate, it is suggested to set q(xt|x(i)t−1, yt) = p(xt|x(i)t−1).

11


Weights are then set to β(i)t ∝ ω

(i)t−1p(yt|µ

(i)t ), where µ

(i)t is the some summary measure of

p(xt|x(i)t−1) (Nemeth, 2014). The weights ω(i)t then reduce to

ω(i)t ∝

p(yt|xt)p(yt|µ(i)t )

.

While this procedure may seem somewhat similar to the SIR filter, since the weights β(i)t take

into account the observations yt, the sampled particles x(i)t should give a better representation

of the true state Xt (Johansen and Doucet, 2008). Note that, providing a relevant resamplingalgorithm is used such as multinomial or stratified, approximations (3.5), (3.6) still hold forauxiliary particle filters.

3.6 Path particle filters

Often we are interested in simulating from the path of the process, rather than each hiddenstate separately. The SIR and auxiliary particle filters can be implemented in a very similarway when this is the case. The key difference being that, due to resampling, we have to keep

track of the index i of the particle x(i)t−1 that was used to generate each particle at the next time

period x(k)t , k ∈ {1, . . . , N}.

Suppose we wish to simulate from x1:t given weighted particles {x(i)1:t−1, ω(i)t−1}Ni=1. During the

resampling stage, rather than sampling x(i)1:t−1 directly we sample the index i. So at time t, we

would sample the value A(i)t−1 ∈ {1, . . . , N} using one of the resampling algorithms outlined in

Section 3.3. We then simulate

x(i)t ∼ q(xt|x

(A

(i)t−1

)1:t−1 , y1:t)

and set

x1:t =

(x

(A

(i)t−1

)1:t−1 , x

(i)t

).

Finally importance weights can be updated by

ω(i)t ∝ ω

(i)t−1


(A

(i)t−1

)t−1 )

q(x(i)t |x

(A

(i)t−1

)1:t−1 , y1:t)

.

In the case of the auxiliary particle filter outlined in Section 3.5, storing information aboutthe path of a hidden state is very similar. The update of a path is simply given by

x(i)1:t =

(x(ki)1:t−1, x

(i)t

).

Therefore rather than dispensing with the index ki, at each time step t we simply set A(i)t−1 := ki.

Note that while we could resample the whole path x1:t−1 itself at each iteration, doing so isconsiderably more computationally intensive than sampling just the index.

12


4 Particle Markov chain Monte Carlo

In Section 3 we assumed that any model parameters were known. This is rarely the casehowever and parameter estimation for HMMs is an important problem across a number ofdifferent applications. In this section, we discuss particle MCMC, which enables us to simulatefrom both the hidden state of a HMM and any unknown parameters. Like the pseudo-marginalalgorithm discussed in Section 2.2, particle MCMC is exact in that it targets the true underlyingposterior distribution. In fact there are strong links between pseudo-marginal MCMC andparticle MCMC, certain particle MCMC methods can be seen simply as a special case of pseudo-marginal MCMC. In the next section, a particular particle MCMC method is demonstrated byusing it to analyse data on the annual number of earthquakes worldwide with a magnitude of 7or more on the Richter scale.

4.1 Particle independent Metropolis-Hastings sampler

In this section we focus on estimating just the unknown parameters θ of a HMM, given ob-servations y1:T . This approach can then be extended to build MCMC samplers for the jointdistribution of the hidden state x1:T of a HMM and any unknown parameters θ. Note that inSection 3, dependence on θ was omitted for simplicity and notational brevity. However all theresults introduced still hold when dependence on θ is introduced.

Recall that in general, the marginal likelihood of a HMM p(y1:T |θ) is intractable. This leadsto a situation similar to Section 2.1, which we used as motivation for pseudo-marginal MCMC.It would be natural therefore to try to use a pseudo-marginal style sampler to obtain an MCMCalgorithm which samples from the unknown parameter vector θ. In order to do so however,we need to find an unbiased estimate pθ(y1:T ) of the marginal likelihood. Let us consider thefollowing estimator of the marginal likelihood, obtained using sequential Monte Carlo,

pθ(y1:T ) = pθ(y1)T∏t=2

pθ(yt|y1:t−1),

where pθ(yt|y1:t−1) is defined as in (3.6). Interestingly this estimate is unbiased for quite a largeclass of sequential Monte Carlo algorithms (Del Moral, 2004). This allows us to immediatelyapply the pseudo-marginal result given in Section 2.2 to write down a simple MH sampler thatsimulates from an unknown parameter vector θ given a HMM (Andrieu et al., 2010):

Algorithm 6: Particle independent Metropolis-Hastings

Given current state θ and estimate pθ(y1:T ) of the marginal likelihood p(y1:T |θ)1. Propose a new value θ′ from some distribution q(θ′|θ).2. Run a particle filter to obtain an unbiased approximation to the new marginal likelihoodpθ′(y1:T ).

3. Accept the proposal θ and pθ(y1:T ) with probability

1 ∧ p(θ′)pθ′(y1:T )q(θ|θ′)

p(θ)pθ(y1:T )q(θ′|θ).

Using results from pseudo-marginal MCMC we can conclude that this sampler targets thetrue posterior distribution p(θ|y1:T ). A similar sampler can be built in order to sample from thesmoothing distribution p(x1:T |y1:T ). Algorithm 6 can be seen as a special case of a much more

13


powerful result which allows us to build a sampler for the joint distribution for {x1:T , θ}Ni=1. Weoutline this result in the next section.

4.2 Particle marginal Metropolis-Hastings

Often when we are studying a HMM, we are not just interested in simulating from the hiddenstate, or estimating unknown parameters, but we require both. This can be performed using se-quential Monte Carlo methods, however there are often problems with degeneracy. The Particlemarginal Metropolis-Hastings (PMMH) algorithm provides an alternative way to sample fromthe full joint posterior distribution p(θ, x1:T |y1:T ) using MCMC.

The PMMH algorithm offers a number of advantages over alternative methods. Estimationsof the hidden state Xt uses all the observations, rather than just those up to that time point y1:t.Therefore the PMMH sampler also solves the smoothing problem (Wilkinson, 2011), mentionedin Section 3.1, which is generally computationally more challenging than the filtering problem.A further advantage of using PMMH is that the sampler is guaranteed to converge to the truejoint distribution p(θ, x1:T |y1:T ). However the PMMH is computationally expensive, and runsa particle filter up to the final time period at each iteration. Therefore the sampler can onlybe used for offline analysis, rather than online. In cases where online parameter estimationin HMMs are required, parameter estimation within SMC is recommended, and readers aresuggested to refer to Kantas et al. (2014).

Suppose we have a process y, which depends on an unobserved latent process x and anunknown parameter vector θ. Then we can write down the joint posterior p(θ, x|y) as

p(θ, x|y) ∝ p(θ)p(x|θ)p(y|x, θ). (4.1)

Often we cannot evaluate p(x|θ). However for many important problems we can still simulatefrom it, even if we can’t evaluate it. In this case, we can build a “likelihood free” MH sampler,which targets the posterior (4.1). Given a current state (θ, x), we propose a new parameter θ′

from some proposal q(θ′|θ), we then simulate x′ from p(x′|θ′). This new state (θ′, x′) is thenaccepted with probability

α = 1 ∧ p(θ′)p(x′|θ′)p(y|x′, θ′)q(θ|θ′)p(x|θ)

p(θ)p(x|θ)p(y|x, θ)q(θ′|θ)p(x′|θ′)

= 1 ∧ p(θ′)p(y|x′, θ′)q(θ|θ′)

p(θ)p(y|x, θ)q(θ′|θ)

However using p(x′|θ′) as the proposal for x′ suffers from the drawback that while x′ will beconsistent with θ′, it will not necessarily be consistent with the data y. Therefore the schemewill generally work well only when y is of low dimensionality (Wilkinson, 2011). Ideally theproposal for x′ given θ′ would be p(x′|y, θ′). The acceptance probability α would then be givenby

α = 1 ∧ p(θ′)p(x′|θ′)p(y|x′, θ′)q(θ|θ′)p(x|y, θ)

p(θ)p(x|θ)p(y|x, θ)q(θ′|θ)p(x′|y, θ′)

= 1 ∧ p(θ′)q(θ|θ′)p(y|θ′)

p(θ)q(θ′|θ)p(y|θ)(4.2)

14


where we have used that

p(x|y, θ) =p(y|x, θ)p(x|θ)

p(y|θ)We can see therefore that the acceptance probability (4.2) has reduced to a standard MH samplerfor p(y|θ) (Wilkinson, 2011).

For a HMM updated using this ideal joint scheme, we would therefore have the acceptanceprobability

α = 1 ∧ p(θ′)q(θ|θ′)p(y1:T |θ′)

p(θ)q(θ′|θ)p(y1:T |θ). (4.3)

However p(y1:T |θ) is rarely analytically available. The particle marginal Metropolis Hastingstherefore aims to approximate the ideal joint update by approximating the marginal likelihoodp(y1:T |θ) by using a path particle filter as described in Section 3.6. The full procedure for theparticle marginal Metropolis-Hastings can therefore be summarised as follows (Andrieu et al.,2010):

Algorithm 7: Particle marginal Metropolis-Hastings

Given current state (θ, x1:T ) and estimate pθ(y1:T ) of the marginal likelihood p(y1:T |θ):1. Propose a new value θ′ from some distribution q(θ′|θ).2. Run a path particle filter to obtain an unbiased approximation to the new marginal

likelihood pθ′(y1:T ) and to sample a path for the hidden process x′1:T .3. Accept the proposal θ, x′1:T and pθ(y1:T ) with probability

1 ∧ p(θ′)pθ′(y1:T )q(θ|θ′)

p(θ)pθ(y1:T )q(θ′|θ).

Once again, Andrieu et al. (2010) show that the PMMH algorithm targets the true posteriorp(θ, x1:T |y1:T ). Under mild conditions, the acceptance probability of the PMMH sampler can beseen to converge to (4.3) as the number of particles in the particle filter tends towards infinity.

Suppose we are using a simple bootstrap particle filter in order to estimate the marginallikelihood of the HMM, and to sample a path from the hidden state x1:T . Then we can writedown the joint density of the sampled path k as follows (Wilkinson, 2011)

g(x(k)1:T ) = ω

(A(k)T )

T

N∏i=1

p(x(i)1 )

T−1∏t=1

N∏i=1

ω(A

(i)t )

t p(x(i)t+1|x

A(i)t

t ),

where the notation is as defined in 3.6. Let us rewrite the acceptance probability to include allthe random variables used by the particle filter to generate a random path and estimate themarginal likelihood as follows

1 ∧ p(θ′)pθ′(y1:T )q(θ|θ′)g(x1:T )

p(θ)pθ(y1:T )q(θ′|θ)g(x′1:T )

g(x′1:T )

g(x1:T ).

Then we can see, like in the pseudo-marginal case, that the particle marginal MCMC with abootstrap particle filter can be seen as defining an exact MH sampler over an extended space.The target for this exact MH sampler is given by

p(θ)pθ(y1:T )g(x1:T ).

It can be shown that this target marginalises down to the correct posterior.

15


Clearly an important decision when trying to implement PMMH is the number of particlesto use in the particle filter. As in the pseudo-marginal MH sampler, the mixing of the PMMHalgorithm depends considerably on the variance of the estimate pθ(y1:T ). The PMMH sampler,like in pseudo-marginal MCMC can exhibit ‘stickiness’. As PMMH can be seen as a special caseof pseudo-marginal MCMC then the recommendations that were made for pseudo-marginalMCMC in Section 2.2 largely hold here.

Care has to be taken when choosing the particle filter to use in providing an estimate forpθ(y1:T ), as it is important the estimate is unbiased. More details of any assumptions made inthe proofs of the properties of the PMMH algorithm can be found in Andrieu et al. (2010). Inparticular they note that the multinomial, stratified and residual resampling schemes can beused. In the case of the auxiliary filter, outlined in Section 3.5, Pitt et al. (2012) provide alikelihood estimate and prove its unbiasedness. They show that using the auxiliary filter withthis estimate over a standard bootstrap filter can significantly improve efficiency of the PMMHalgorithm due to a decreased variance in the estimate of the likelihood.

4.3 Particle Gibbs sampler

An alternative to using PMMH to sample from the joint posterior p(θ, x1:T |y1:T ) is to use aGibbs sampler that iterates between the following:

1. Sample θ ∼ p(θ|x1:T , y1:T )2. Sample x1:T ∼ p(x1:T |θ, y1:T ).

The advantage of the Gibbs sampler over the PMMH scheme is that it avoids having to choosea proposal distribution. However, while p(θ|x1:T , y1:T ) is often possible to simulate from, we arerarely able to sample from p(x1:T |y1:T , θ). While it would be natural to try and sample x1:Tfrom an SMC approximation pθ(x1:T |y1:T ), this means that the Gibbs update scheme no longertargets the correct posterior distribution. By introducing a conditional SMC update, Andrieuet al. (2010) show how we can use particle filters to sample approximately from p(x1:T |θ, y1:T ),while our Gibbs Sampler still samples from the desired posterior p(θ, x1:T |y1:T ). The idea ofa conditional SMC update is rather than using an acceptance probability, as in the PMMHalgorithm, in order to ensure the algorithm targets from the correct distribution, we conditionon the previously accepted path x∗1:T . The full procedure for the conditional SMC update isgiven in Algorithm 8.

As for the PMMH algorithm, Andrieu et al. (2010) show that a Gibbs sampler which includesthis conditional SMC step targets the desired posterior p(θ, x1:T |y1:T ). A key motivation forintroducing the Gibbs sampler was due to the ‘stickiness’ of the PMMH algorithm. Howeverin the conditional SMC update we are conditioning on the previous path. Due to degeneracyproblems in SMC methods this can often lead to the current path being chosen multiple timesso that particle Gibbs exhibits ‘stickiness’ also (Wilkinson, 2011).

More recently work has been done in order to improve properties of the particle Gibbsmethod. Backward sampling, developed by Godsill et al. (2004) as a smoothing method for SMC,was first suggested as a method to improve particle Gibbs mixing by Whiteley et al. (2010).The method has been explored theoretically by Lindsten and Schon (2011) and Chopin andSingh (2013). Intuitively, backward sampling is used after a weighted particle approximation

{x(i)t , w(i)t }Ni=1 to the filtering distribution, at each time period t, has been obtained using a

particle filter. The method then samples a trajectory x1:T backwards in time, with each x(i)t

being selected with probability equal to a slightly modified importance weight. The importanceweights are modified so that the previously sampled future observation is taken into account.

16


More recently, Lindsten et al. (2014) suggested an improvement to the particle Gibbs algorithmusing ancestor sampling. The method works similar to backward sampling, however rather thanhaving the backward and forward sweeps as separate procedures, the method enables the sameeffect to be achieved using just a forward sweep.

As for the PMMH algorithm, care has to be taken when more sophisticated particle filtersand resampling methods are used, that the SMC update is implemented correctly. Again formore information the reader is referred to the original paper by Andrieu et al. (2010).

Algorithm 8: Conditional SMC update

Given current path x∗1:T and value of θ:Step 1:

1. Set x(1)1 = x∗1.

2. Sample x(i)1 ∼ qθ(x1|y1) for i ∈ {2, . . . , N}.


w(i)1 ∝

pθ(y1|x(i)1 )pθ(x

(i)1 )

qθ(x(i)1 |y1)

.

4. Store paths x(i)1 for i ∈ {2, . . . , N}.

Step 2:for t ∈ {2, . . . , T} do

1. Set x(1)t = x∗t .

2. Sample A(i)t−1 from {1, . . . , N} with probabilities {ω(1)

t−1, . . . , ω(N)t−1}.

3. Sample x(i)t ∼ q(xt|x

(A(i)t−1)

1:t−1 , y1:t).4. Calculate and normalise importance weights

ω(i)t ∝ ω

(i)t−1

pθ(yt|x(i)t )pθ(x

(i)t |x

(A(i)t−1)

t−1 )

qθ(x(i)t |x

(A(i)t−1)

1:t−1 , y1:t)

.

5. Store paths x(i)1:t =

(x(A

(i)t−1)

1:t−1 , x(i)t

).

endStep 3:

Sample a new path x1:T ∗ from {x(1)1:T , . . . , x(N)1:T } with probabilities {ω(1)

T , . . . , ω(N)T }.

17


5 Application: particle MCMC for inference in timeseries

Hidden Markov Models have proven to be very useful in the analysis of complex timeseries (Zuc-chini and MacDonald, 2009). The particle MCMC methods described Section 4 can thereforebe used effectively in parameter estimation of timeseries when they can be effectively modelledusing HMMs. In order to demonstrate this we use the PMMH algorithm to estimate the pa-rameters of a model of the number of annual, worldwide earthquakes with a magnitude of 7 orover on the Richter scale. The data has been previously analysed by Langrock (2011), thoughthe approach was to discretize the space of the hidden state and apply a maximum likelihoodapproach. A plot of the data is given in Figure 3.

10

20

30

40

1920 1950 1980 2010Year

Cou

nts

Figure 3: Annual number of earthquakes worldwide with a magnitude of 7 or over on the Richterscale.

The standard statistical model for count data is the Poisson distribution, however the sam-ple mean of the counts is 19.36, while the sample variance is 51.57. This indicates that theobservations are overdispersed. The series also exhibits significant positive autocorrelation. Thecombination of these factors motivates Langrock (2011) to consider modelling the earthquakecounts process using a Poisson random variable yt, whose parameter λt is time dependent andassumed to follow an AR(1) process as follows

log(λt)− µ = φ(log(λt−1)− µ) + σηt

where |φ| < 1, µ, σ > 0, and ηt ∼ N (0, 1) as well as being i.i.d. By writing xt := log(λt)−µ andβ := exp(µ) Langrock (2011) models the process as the following HMM

xt|xt−1 ∼ N (φxt−1, σ2)

yt|xt ∼ Pois(β exp(xt)).

18


We used a random walk PMMH algorithm to sample from (θ, x1:T ), where θ = (φ, σ, β)T .For simplicity we used a bootstrap filter to estimate the marginal likelihood. The prior used forthe initial state assumed that the process is stationary, so that

x1 ∼ N(

0,σ2

1− φ2

).

Initially it was found that the distribution for the parameter β exhibited significant positiveskew, which meant that the random walk was not able to explore the full state space effectively.We therefore introduced the parameter α := log β = µ and reparameterised the model as follows

xt|xt−1 ∼ N (φxt−1, σ2)

yt|xt ∼ Pois(exp(αxt)).

We then set θ := (φ, σ, α), note that β > 0 so that this transformation is valid. Uniform priorswere applied over θ with the conditions |φ| < 1 and σ > 0. Code for implementation of thealgorithm can be found on my website: http://lancs.ac.uk/~bakerj1/#three.

We used 60 particles in the bootstrap particle filter. This ensured, as suggested by Sherlocket al. (2014), that the estimated variance in the log of the estimate of the marginal likelihoodlog(p(y1:T )) was close to 3.3. After initial tuning, the variance-covariance matrix of the randomwalk proposal was set to

2.562

3cov(θ),

where cov(θ) was estimated from an initial MCMC run after tuning (Sherlock et al., 2014).The average acceptance rate was 6.9%, close to the optimal 7%. Initial parameter estimateswere set according to the mean estimates of an initial MCMC run where convergence had beenestablished. These were quite close to the maximum likelihood estimates of Langrock (2011).

We ran 50000 MCMC iterations, discarding 10000 as burn-in. The resulting effective samplesizes were approximately 640, 683 and 442 for φ, σ and α respectively. We obtained meanparameter values of 0.88, 0.14, and 18.89 for φ, σ and β respectively. These are comparable tothe maximum likelihood estimates of Langrock (2011) (φ, σ, β) = (0.89, 0.14, 17.81). Trace plotsand kernel density plots after burn-in are given in Figure 4. They show that the parameterα shows considerable variability. This is probably a limiting factor for the algorithm. Betterresults might be achieved by incorporating posterior gradient information in the proposal so thatthe algorithm explores the space of α more effectively. This idea has been developed by Dahlinet al. (2014), who apply their method to this particular dataset and achieve better results.

The mean trajectory, plotted against the observed counts are plotted in Figure 5. The plotshows relatively good fit. However it reveals the possibility that the model overestimates yearswith a lower than average earthquake count.

19

http://lancs.ac.uk/~bakerj1/#three


0.6 0.7 0.8 0.9 1.0

01

23

45

6

Phi

Den

sity

Sample mean

(a) Kernel density estimate for the parameter φ,based on the PMMH sample.

10000 20000 30000 40000 50000

0.6

0.7

0.8

0.9

1.0

Iteration

Phi

(b) MCMC traceplot for the parameter φ.

0.10 0.15 0.20 0.25

05

1015

Sigma

Den

sity

Sample mean

(c) Kernel density estimate for the parameter σ,based on the PMMH sample.

10000 20000 30000 40000 50000

0.10

0.15

0.20

0.25

Iteration

Sig

ma

(d) MCMC traceplot for the parameter σ.

1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

01

23

Alpha

Den

sity

Sample mean

(e) Kernel density estimate for the parameter α,based on the PMMH sample.

10000 20000 30000 40000 50000

1.5

2.5

3.5

4.5

Iteration

Alp

ha

(f) MCMC traceplot for the parameter α.

Figure 4: Kernel density estimates and traceplots for each of the parameters simulated usingPMMH.

20


0

20

40

60

1920 1950 1980 2010Counts

Obs

erve

d

Estimated

Observed

Figure 5: Observed data, against the fitted model with an approximate 95% confidence interval.Data: annual number of earthquakes worldwide with a magnitude of 7 or over on the Richterscale.

6 Discussion and open areas

This report has introduced pseudo-marginal MCMC, which enables a Metropolis-Hastings stylesampler to be built when we are trying to simulate from a distribution p(x), which cannotbe evaluated, but for which an unbiased Monte Carlo approximation can be produced. Inpractice, the main application of pseudo-marginal MCMC is when there is considerable missingor unobserved data, so that the marginal likelihood is intractable. We discussed how, incredibly,pseudo-marginal MCMC can be shown to target the desired distribution p(x). However thisresult is predominantly at the cost of mixing. The mixing of the algorithm is found to dependconsiderably on the variance of the Monte Carlo estimate, which can be improved by increasingthe number of samples used to generate it. In light of this we review a number of recent resultswhich attempt to optimise the mixing of the algorithm, while taking into account computationalexpense.

We motivate hidden Markov models (HMMs) as important, flexible models, with numerousapplications. Difficulties in estimating the distribution of the hidden state in a HMM analyticallyis discussed. This motivates us to consider sequential Monte Carlo methods, which estimate thehidden state sequentially using importance sampling. We discuss the degeneracy problem, amajor pitfall for this class of algorithms, as well as suggestions to overcome the problem usingresampling techniques and judicious proposal choice.

Finally we introduced particle MCMC and discuss its use in performing parameter estimationin hidden Markov models. We note a key credit to the algorithm is that it is exact in that it tar-gets the true posterior distribution. Key issues associated with these algorithms are considered,such as the number of particles to choose for the particle filter, and the use of more sophisti-cated particle filters. Recent work on improving performance of the particle Gibbs algorithm(especially its tendency to exhibit ‘stickiness’) is reviewed. In order to demonstrate the PMMH

21


algorithm, arguably the most important of the three particle MCMC algorithms introduced byAndrieu et al. (2010), we use it to demonstrate its use in inference for timeseries models. Inparticular we analyse worldwide, annual earthquake counts using a hidden Markov model. Wediscuss how providing posterior gradient information in the proposal of the parameters wouldbe useful in this case and allow the sampler to explore the state space more effectively.

6.1 Open areas: online parameter estimation in hidden Markov models

Particle MCMC provides a number of techniques to enable parameter estimation in HMMs.These methods have some major strengths. They target the exact posterior, so are able toestimate parameters with considerable accuracy, and they are easy to implement. However asthese algorithms are based on MCMC, they are computationally intensive. This means particleMCMC cannot really be used for online applications, where we are estimating the unknownparameters as the process is evolving through time. For these cases, it is usually suggested thatparameter estimation be performed within SMC. Estimating parameters within SMC proves tobe challenging however. The likelihood is intractable, meaning that in order to perform somesort of maximum likelihood procedure, its estimate pθ(y1:t) needs to have very low variance(Kantas et al., 2014). Similar difficulties apply when we are trying to update the posterior usinga Bayesian framework.

For online parameter estimation in HMMs using maximum likelihood, Kantas et al. (2014)recommends using either an online gradient descent algorithm or an online EM algorithm. How-ever they do note that empirically these methods are slow to converge, limiting their usage tolarge datasets. Online parameter estimation using a Bayesian framework is yet to yield fruithowever, with all the methods considered by Kantas et al. (2014) suffering from the problem ofdegeneracy. Therefore developing Bayesian methods for online parameter estimation in hiddenMarkov models is an important open problem.

6.2 Open areas: exact approximate Markov chain Monte Carlo for big data

A very active area of current research in computational statistics is developing MCMC methodsthat scale more easily to very large datasets. Often the likelihood is computationally expensiveto calculate when the dataset is large. This makes MCMC prohibitive for these cases. A numberof approaches have been made to try to solve this problem, including parallelisation of MCMCand variational approximations. The approach we focus on however, due to its strong links withthe rest of the report is a subsampling approach. This simply means that only some of the dataavailable is used to calculate the likelihood. We also, in theme with the rest of the report, focuson exact methods in that the resulting MCMC targets the posterior that would be obtained ifthe full dataset was used.

Calculating the likelihood using only a subsample of the data can be seen as providing anestimate of the likelihood. It is therefore natural to ask whether we can apply pseudo-marginalMCMC in this case. Providing that the likelihood estimate is unbiased, we can use the pseudo-marginal result to construct an MH sampler that uses only a subsample of the data but targetsthe full posterior. This idea was first investigated by Korattikara et al. (2013), who used a simplerandom sampling design to construct an estimate of the likelihood. However they quickly foundthat the variance in the estimate of the likelihood was far too high for pseudo-marginal MCMCto exhibit good results. The chain would exhibit considerable stickiness, as we discussed inSection 2.2.

22


Quiroz et al. (2014) tried a similar approach but using ideas from survey sampling. Inparticular, they aimed to sample each data point with probability proportional to its likelihoodcontribution. This can be seen as a variance reduction technique for estimating the likelihood.A key problem that needs to be addressed with this technique however is that the likelihooditself cannot be used to evaluate sampling probabilities. Quiroz et al. (2014) therefore suggestconstructing a proxy to the likelihood, using for example Gaussian processes. A thoroughinvestigation into calculating efficient likelihood proxies would therefore complement this workwell. Alternatively perhaps alternative survey sampling methods will reduce the variance of thelikelihood estimate more quickly or efficiently.

Rather than appealing directly to pseudo-marginal MCMC, alternative approaches mayprove more fruitful. Maclaurin and Adams (2014) introduce a Bernoulli latent variable foreach data point with rather desirable properties. If this variable is 1, then the likelihood con-tribution of the data point is evaluated. If the variable is zero, only a strictly positive lowerbound of the likelihood contribution needs to be evaluated. The MCMC proceeds by alternatelyupdating the unknown parameters and the latent variables. This algorithm can also be shownto target the full posterior. Clearly some work is needed to investigate constructing a lowerbound to the likelihood for a broad class of models. Examining how the introduction of thislatent variable affects mixing would also be important work.

References

Andrieu, C., Doucet, A., and Holenstein, R. (2010). Particle Markov chain Monte Carlo methods.Journal of the Royal Statistical Society. Series B: Statistical Methodology, 72:269–342.

Andrieu, C. and Roberts, G. O. (2009). The pseudo-marginal approach for efficient Monte Carlocomputations. Annals of Statistics, 37(2):697–725.

Beaumont, M. A. (2003). Estimation of population growth or decline in genetically monitoredpopulations. Genetics, 164(3):1139–1160.

Carpenter, J., Clifford, P., and Fearnhead, P. (1999). Improved particle filter for nonlinearproblems. IEE Proceedings-Radar, Sonar and Navigation, 146(1):2–7.

Chib, S. and Greenberg, E. (1995). Understanding the metropolis-hastings algorithm. Theamerican statistician, 49(4):327–335.

Chopin, N. and Singh, S. S. (2013). On the particle gibbs sampler. arXiv preprintarXiv:1304.1887.

Dahlin, J., Lindsten, F., and Schon, T. B. (2014). Particle metropolis–hastings using gradientand hessian information. Statistics and Computing, pages 1–12.

Del Moral, P. (2004). Feynman-Kac Formulae. Springer.

Douc, R. and Cappe, O. (2005). Comparison of resampling schemes for particle filtering. In Imageand Signal Processing and Analysis, 2005. ISPA 2005. Proceedings of the 4th InternationalSymposium on, pages 64–69. IEEE.

Doucet, A., Godsill, S., and Andrieu, C. (2000). On sequential Monte Carlo sampling methodsfor Bayesian filtering. Statistics and Computing, 10(3):197–208.

23


Doucet, A. and Johansen, A. M. (2008). A tutorial on particle filtering and smoothing: Fifteenyears later. Technical Report.

Doucet, A., Pitt, M., Deligiannidis, G., and Kohn, R. (2012). Efficient implementation ofmarkov chain monte carlo when using an unbiased likelihood estimator. arXiv preprintarXiv:1210.1871.

Fearnhead, P. (2012). Modern computational statistics: Alternatives to mcmc. http://www.

maths.lancs.ac.uk/~fearnhea/GTP/GTP_Slides.pdf. RSS Graduate Training Programme.

Godsill, S. J., Doucet, A., and West, M. (2004). Monte carlo smoothing for nonlinear timeseries. Journal of the american statistical association, 99(465).

Gordon, N. J., Salmond, D. J., and Smith, a. F. M. (1993). Novel-Approach to NonlinearNon-Gaussian Bayesian State Estimation. Iee Proceedings-F Radar and Signal Processing,140(2):107–113.

Johansen, A. M. and Doucet, A. (2008). A note on auxiliary particle filters. Statistics &Probability Letters, 78(12):1498–1504.

Kantas, N., Doucet, A., Singh, S. S., Maciejowski, J. M., and Chopin, N. (2014). On particlemethods for parameter estimation in state-space models. arXiv preprint arXiv:1412.8695.

Korattikara, A., Chen, Y., and Welling, M. (2013). Austerity in mcmc land: Cutting themetropolis-hastings budget. arXiv preprint arXiv:1304.5299.

Langrock, R. (2011). Some applications of nonlinear and non-gaussian state–space modellingby means of hidden markov models. Journal of Applied Statistics, 38(12):2955–2970.

Lindsten, F., Jordan, M. I., and Schon, T. B. (2014). Particle gibbs with ancestor sampling.The Journal of Machine Learning Research, 15(1):2145–2184.

Lindsten, F. and Schon, T. B. (2011). On the use of backward simulation in particle markovchain monte carlo methods. arXiv preprint arXiv:1110.2873.

Maclaurin, D. and Adams, R. P. (2014). Firefly monte carlo: Exact mcmc with subsets of data.arXiv preprint arXiv:1403.5693.

Nemeth, C. (2014). Parameter Estimation for State Space Models using Sequential Monte CarloAlgorithms. PhD thesis, The University of Lancaster.

O’Neill, P. D., Balding, D. J., Becker, N. G., Eerola, M., and Mollison, D. (2000). Analysesof infectious disease data from household outbreaks by markov chain monte carlo methods.Journal of the Royal Statistical Society: Series C (Applied Statistics), 49(4):517–542.

Pitt, M. K., dos Santos Silva, R., Giordani, P., and Kohn, R. (2012). On some propertiesof markov chain monte carlo simulation methods based on the particle filter. Journal ofEconometrics, 171(2):134–151.

Pitt, M. K. and Shephard, N. (1999). Filtering via Simulation: Auxiliary Particle Filters.Journal of the American Statistical Association, 94(March 2015):590–599.

24

http://www.maths.lancs.ac.uk/~fearnhea/GTP/GTP_Slides.pdf

http://www.maths.lancs.ac.uk/~fearnhea/GTP/GTP_Slides.pdf


Pollock, M. (2010). Algorithms & computationally intensive inference reading group introductionto particle filtering discussion-notes.

Quiroz, M., Villani, M., and Kohn, R. (2014). Speeding up mcmc by efficient data subsampling.arXiv preprint arXiv:1404.4178.

Sherlock, C., Thiery, A. H., Roberts, G. O., Rosenthal, J. S., et al. (2014). On the efficiency ofpseudo-marginal random walk metropolis algorithms. The Annals of Statistics, 43(1):238–275.

Whiteley, N., Andrieu, C., and Doucet, A. (2010). Efficient bayesian inference for switchingstate-space models using discrete particle markov chain monte carlo methods. arXiv preprintarXiv:1011.2437.

Wilkinson, D. (2010). The pseudo-marginal approach to exact approximate mcmc algorithms.Darren Wilkinson’s research blog.

Wilkinson, D. (2011). The particle marginal metropolis-hastings (pmmh) particle mcmc algo-rithm. Darren Wilkinson’s research blog.

Zucchini, W. and MacDonald, I. L. (2009). Hidden Markov models for time series: an introduc-tion using R. CRC Press.

25

Documents

Exact approximate Markov chain Monte Carlo · 2018-02-02 · including online parameter estimation of HMMs and exact approximate MCMC for big data. 1 Introduction Interest in a process