CHAPTER 16 M ARKOV C HAIN M ONTE C ARLO Organization of chapter in ISSO –Background on MCMC –Metropolis-Hastings algorithm –Numerical example of Metropolis-Hastings

CHAPTER 16CHAPTER 16

MMARKOV ARKOV CCHAINHAIN MMONTEONTE CCARLOARLO

•Organization of chapter in ISSO–Background on MCMC–Metropolis-Hastings algorithm–Numerical example of Metropolis-Hastings –Gibbs sampling–Numerical example of Gibbs sampling–Optional in these slides: Non-Gaussian state estimation (not in ISSO)

Slides for Introduction to Stochastic Search and Optimization (ISSO) by J. C. Spall

16-2

BackgroundBackground

• Process generating random vector X, • Want to compute E([f(X)] for function f()

• Standard method for approximating E([f(X)] is to generate many independent sample values of X and compute sample mean of f(X) values

• Only useful in “trivial” cases where X can be generated directly

• Many practical problems have non-trivial distribution for X

– E.g., state in nonlinear/non-Gaussian state-space model

16-3

Markov ChainsMarkov Chains• Not necessary to generate independent X to estimate E([f(X)]

• Consider dependent sequence X0, X1, X2,…

• Generate Xk+1 according to “easy” conditional distribution for {Xk+1|Xk}

– {Xk} process is a Markov chain

– Xk dependence on fixed number of early states disappears as k gets large

• Above implies distribution of Xk approaches a stationary form as k gets large

– Stationary form corresponds to target distribution (density) p(·) if conditional distribution chosen properly

16-4

Ergodic AveragingErgodic Averaging

• Let M denote the “burn-in” period for the Markov chain

• The ergodic average of n – M values of f(X) with Xk generated via a Markov chain is

• Summands above are dependent via the Markov property for the {Xk}

• Above sum approaches E([f(X)] as n gets large by ergodic theorem

1( )

+1

n

kk=M

fn M

X

16-5

Metropolis-Hastings (M-H) AlgorithmMetropolis-Hastings (M-H) Algorithm

• M-H algorithm is one of two most popular forms for MCMC (other is Gibbs sampling)

• M-H relies on proposal distribution and Metropolis criterion

• Let proposal distribution be q(·|·); used to generate candidate points W ~q(·|X=x)

• Candidate point W = w is accepted with probability given by Metropolis criterion:

• In practice, in going from Xk to Xk+1, x above is Xk and W becomes Xk+1 if W is accepted

p q,

p q

( ) ( | )( ) min , 1

( ) ( | )

w x wx w

x w x

16-6

M-H Algorithm for Estimating M-H Algorithm for Estimating EE([([ff((XX)]))])

Step 0 (initialization)Step 0 (initialization) Choose length of “burn-in” period M and initial state X0. Set k = 0.

Step 1 (candidate point) Step 1 (candidate point) Generate a candidate point W according to proposal distribution q(|Xk).

Step 2 (accept/reject)Step 2 (accept/reject) Generate point U from U(0,1) distribution. Set Xk+1 = W if U (Xk,W) (Metropolis criterion). Otherwise set Xk+1 = Xk.

Step 3 (iterate) Step 3 (iterate) Repeat Steps 1 and 2 until XM is available. Terminate “burn-in” process and proceed to step 4 with Xk = XM.

Steps 4–6 (ergodic average)Steps 4–6 (ergodic average) Repeat process and compute average of f(XM+1),…, f(Xn). This ergodic average is estimate of E([f(X)].

16-7

Example: Estimating Example: Estimating EE([([ff((XX)]) from a )]) from a Bivariate Normal DistributionBivariate Normal Distribution

(Example 16.1 from (Example 16.1 from ISSOISSO) )

• Suppose

• Use M-H to estimate sum of the two mean components (true value = 0): f(X) = [1,1]X

• Standard (unit length) uniform proposal distribution and burn-in period of M = 500

• Following plot shows three independent runs – Acceptance rate (Metropolis criterion) about 70%– Better performance possible with lower acceptance rate

(requires “tuning”—not always feasible in practice)

N0 1 0.9

~ ,0 0.9 1

X

16-8

Example (cont’d): M-H Algorithm with Example (cont’d): M-H Algorithm with Uniform Proposal Distribution; Uniform Proposal Distribution;

Mean Zero TargetMean Zero Target3

Iterations (Post Burn-In)

0 5000 10000 15000–3

–2

–1

0

1

2

16-9

Gibbs SamplingGibbs Sampling

• Gibbs sampling is implementation of M-H on element-by-element basis

• Gibbs sampling uniquely designed for multivariate problems, i.e., dim(X) 2

• Gibbs sampling based on idea of “full conditional” distributions– i th full conditional distribution is conditional distribution for i

th component of X conditioned on most recent values of all other components of X

• In contrast to M-H, Gibbs sampling updates components of X one-at-a-time

16-10

Relationship of Gibbs Sampling to M-HRelationship of Gibbs Sampling to M-H

• Gibbs sampling is special case of M-H on element-by-element basis• Gibbs sampling and M-H developed largely independent of each

other– M-H introduced in Hastings (1970) as implementation of Metropolis

sampling from statistical physics– Gibbs introduced in Tanner and Wong (1987) and Gelfand and Smith

(1990), with special focus on Bayesian problems• Gibbs sampling uses particular form of full conditionals as proposal

distribution from M-H– Eliminates need to “tune” proposal distribution as in general M-H– Requires stronger assumptions to construct full conditionals– Acceptance rate for new points is 100%

16-11

Example: Truncated Exponential Distribution Example: Truncated Exponential Distribution (Example 16.5 from (Example 16.5 from ISSOISSO))

• Consider two-variable problem where conditional random variables {X|Y} and {Y|X} have exponential distributions over finite interval (length = 5)– Distributions for {X|Y} and {Y|X} are two full conditionals for Gibbs

sampling

• Suppose interested in marginal distribution for X• Can determine exact marginal distribution for X

– Useful for comparison purposes; not usually available in practice

• Plot shows Gibbs output relative to true density for X– Histogram based on terminal X value from 5000 independent

replications– Burn-in period of M = 10; terminal value occurs 30 iterations past

burn-in

16-12

Example (cont’d): Histogram of Gibbs Example (cont’d): Histogram of Gibbs Sampling Output vs. Known DensitySampling Output vs. Known Density

16-13

Optional (not in Optional (not in ISSOISSO): Non-Gaussian State ): Non-Gaussian State EstimationEstimation

• Consider state-space model with non-Gaussian noises (xk is state; zk is measurement)

• Represent p(xk|xk–1) and p(zk|xk) as Gaussian mixtures

– Gaussian mixtures can be used to approx. many non-Gaussian distributions

• Gibbs sampling used to estimate state based on Gaussian full conditionals

• Further information on pp. 4344 of: Spall, J. C. (2003), “Estimation via Markov Chain Monte Carlo,” IEEE Control Systems Magazine, vol. 23(2), pp. 3445

16-14

Non-Gaussian State Estimation: Basic Idea Non-Gaussian State Estimation: Basic Idea

• Let represent parameters in Gaussian mixture• Xn and Zn are complete collection of all (n) states and measurements• Gibbs sampling operates from full conditionals:

{xk| xk–1, , Zn} — Gaussian distribution

{ | Xn, Zn} — non-Gaussian distribution

• Above non-Gaussian distribution known for many cases• Iterative sampling from above full conditionals produces samples

from p(xk| Zn) for all k

Average the samples to get E(xk| Zn)

16-15

Concluding RemarksConcluding Remarks

• M-H and Gibbs sampling two notable examples of MCMC– Methods for “easy” generation of random samples and

estimates• M-H more general, but Gibbs especially useful in specific

applications• Not “magic”—still need relevant assumptions • Widespread use in statistics, computer science,

simulation, etc. • Limited current use in control and signal processing

– But non-Gaussian/nonlinear state estimation one growing area

16-16

Exercise 16.3: Four replications of M-H Exercise 16.3: Four replications of M-H Algorithm with Mean Zero TargetAlgorithm with Mean Zero Target

16-17

Exercise 16.8: Exercise 16.8: Histogram (2000 Histogram (2000 samples) and Known Density for samples) and Known Density for XX

Documents

CHAPTER 16 M ARKOV C HAIN M ONTE C ARLO Organization of chapter in ISSO –Background on MCMC –Metropolis-Hastings algorithm –Numerical example of Metropolis-Hastings