17
STAT 8054 notes on Markov Chain Monte Carlo Spring 2015 April 20, 2015 Contents 1 Introduction 1 2 Review of X -valued random variables 2 3 Discrete time homogeneous Markov chains 3 4 Convergence of Markov Chains used in MCMC 4 5 Algorithms 5 5.1 Metropolis–Hastings ............................... 5 5.2 Random walk Metropolis ............................. 7 5.2.1 Example: approximately sample from N (μ, σ 2 ) ............. 8 5.2.2 Example: approximately sample from a bivariate posterior ...... 8 5.3 Independence sampler .............................. 9 5.4 Variable at a time Metropolis–Hastings and the Gibbs sampler ........ 9 5.4.1 Example: approximately sample from a bivariate posterior ...... 11 5.4.2 Example: fit a linear random eects model ............... 12 5.4.3 Example: Bayesian ridge regression ................... 14 6 Inference for E{h(X )}, where X has the target distribution 15 1 Introduction In ordinary Monte Carlo, we studied methods to generate a realization of X 1 ,...,X n , which are iid with some distribution of interest. Our goal is to still sample from this distribution, but we are willing to relax the requirement that X 1 ,...,X n are iid: we will allow X 1 ,...,X n to be dependent and have dierent distributions, but the distribution of X k will converge to the distribution of interest as k increases. 1

STAT 8054 notes on Markov Chain Monte Carlo

  • Upload
    others

  • View
    13

  • Download
    0

Embed Size (px)

Citation preview

Page 1: STAT 8054 notes on Markov Chain Monte Carlo

STAT 8054 notes on Markov Chain Monte Carlo

Spring 2015

April 20, 2015

Contents

1 Introduction 1

2 Review of X -valued random variables 2

3 Discrete time homogeneous Markov chains 3

4 Convergence of Markov Chains used in MCMC 4

5 Algorithms 55.1 Metropolis–Hastings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55.2 Random walk Metropolis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

5.2.1 Example: approximately sample from N(µ, �2) . . . . . . . . . . . . . 85.2.2 Example: approximately sample from a bivariate posterior . . . . . . 8

5.3 Independence sampler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95.4 Variable at a time Metropolis–Hastings and the Gibbs sampler . . . . . . . . 9

5.4.1 Example: approximately sample from a bivariate posterior . . . . . . 115.4.2 Example: fit a linear random e↵ects model . . . . . . . . . . . . . . . 125.4.3 Example: Bayesian ridge regression . . . . . . . . . . . . . . . . . . . 14

6 Inference for E{h(X)}, where X has the target distribution 15

1 Introduction

In ordinary Monte Carlo, we studied methods to generate a realization of X1, . . . , Xn

, whichare iid with some distribution of interest. Our goal is to still sample from this distribution,but we are willing to relax the requirement that X1, . . . , Xn

are iid: we will allow X1, . . . , Xn

to be dependent and have di↵erent distributions, but the distribution of Xk

will converge tothe distribution of interest as k increases.

1

MATH680 notes on Monte Carlo Markov Chains
Fall 2015
Page 2: STAT 8054 notes on Markov Chain Monte Carlo

2 Review of X -valued random variables

The following review is based on Chapter 2 of Keener (2005). Suppose that (⌦,F , P ) isa probability space and (X ,B) is a measurable space. An X -valued random variable X isfunction X : ⌦ ! X such that {! 2 ⌦ : X(!) 2 B} 2 F for every B 2 B. The probabilitymeasure P

X

: B ! [0, 1] defined by

P

X

(A) = P (X 2 A) ⌘ P ({! 2 ⌦ : X(!) 2 A}),

is called the distribution of X. We indicate this by writing X ⇠ P

X

.Let h : X ! R be a measurable function, i.e. {x 2 X : h(x) 2 B} 2 B for every B 2 BR.

We define the integral of h against the probability measure P

X

byZ

h(x)PX

(dx) = E{h(X)}.

In particular, when h(x) = 1(x 2 A) and A 2 B,

P

X

(A) =

Z

1(x 2 A)PX

(dx).

If ⌫ is a sigma-finite measure on B such that P

X

(A) = 0 whenever ⌫(A) = 0, then we saythat P

X

is absolutely continuous with respect to ⌫. In this case, there exits a non-negativemeasurable function f called the density of P

X

with respect to ⌫ such that

P

X

(A) =

Z

1(x 2 A)f(x)⌫(dx).

Also for X ⇠ P

X

we have that

E{h(X)} =

Z

h(x)PX

(dx) =

Z

h(x)f(x)⌫(dx).

Example 1.

Suppose that (X ,B) = (Rp

,BRp), and X is an Rp-valued random variable (typically calleda random vector) with distribution P

X

with density f with respect to Lebesgue measure ⌫.Then

E{h(X)} =

Z

h(x)PX

(dx) =

Z

h(x)f(x)⌫(dx)

=

Z

· · ·Z

h(x1, . . . , xp

)f(x1, . . . , xp

)dx1, . . . , dxp

.

Example 2.

Suppose that X is countable so our measurable space is (X , 2X ). Let X be an X -valuedrandom variable with distribution P

X

with density f (called a probability mass function)with respect to counting measure ⌫. Then

E{h(X)} =

Z

h(x)PX

(dx) =X

x2X

h(x)f(x).

2

Page 3: STAT 8054 notes on Markov Chain Monte Carlo

3 Discrete time homogeneous Markov chains

This section is based on Atchade (2008) and Jones (2013).

Definition 1. A transition kernel Q on a measurable space (X ,B) is a function Q : X⇥B ![0, 1] such that Q(x, ·) : B ! [0, 1] is a probability measure on B (a distribution) for all x 2 Xand Q(·, A) : X ! [0, 1] is a measurable function for all A 2 B.

Example 3.

Take X = Rp and B = BRp . Let ⌫ be Lebesgue measure on Rp and suppose that ⌃ 2 Sp

+.Define

Q(x,A) =

Z

1(y 2 A)(2⇡)�p/2{det(⌃�1)}1/2 exp{0.5(y � x)0⌃�1(y � x)}⌫(dy),

for all x 2 Rp and A 2 B. Then Q is a transition kernel: Q(x, ·) is a probability measure onB (it’s the N

p

(x,⌃) distribution) and {x 2 Rp : Q(x,A) 2 B} 2 BRp for every B 2 BR.

Definition 2. A sequence of X -valued random variables X0, X1, . . . , is a (time-homogeneous)Markov chain with transition kernel Q if

(Xn+1|X0 = x0, . . . , Xn

= x

n

) ⇠ (Xn+1|Xn

= x

n

) ⇠ Q(xn

, ·),

for each non-negative integer n. Here P (Xn+1 2 A|X

n

= x

n

) = Q(xn

, A) for all A 2 B.

To generate a realization of a Markov chain where X0 has initial distribution P

X0 andthe chain has transition kernel Q, we

1. generate a realization x0 of X0 ⇠ P

X0

2. generate a realization x1 of (X1|X0 = x0) ⇠ Q(x0, ·)

3. generate a realization x2 of (X2|X1 = x1) ⇠ Q(x1, ·)

4. . . .

the sequence x0, x1, . . . is a realization of the Markov chain. Let X0 ⇠ P

X0 , then the marginaldistribution/probability measure for X1 at A 2 B is

P (X1 2 A) = E[E{1(X1 2 A)|X0}]

=

Z

Z

1(x1 2 A)Q(x0, dx1)

P

X0(dx0)

=

Z

Q(x0, A)PX0(dx0)

⌘ P

X0Q(A).

3

Page 4: STAT 8054 notes on Markov Chain Monte Carlo

We write this as X1 ⇠ P

X0Q. We also have that (X2|X0 = x0) ⇠ QQ(x0, ·):

P (X2 2 A|X0 = x0) = E[E{1(X2 2 A)|X1, X0 = x0}|X0 = x0]

=

Z

Z

1(x2 2 A)Q(x1, dx2)

Q(x0, dx1)

=

Z

Q(x1, A)Q(x0, dx1)

⌘ QQ(x0, A)

So X2 ⇠ P

X0QQ = P

X0Q(2), In general, X

n

⇠ P

X0Q(n) and (X

n

|X0 = x) ⇠ Q

(n)(x, ·).Suppose that h : X ! R is measurable, i.e. {x 2 X : h(x) 2 B} 2 B for every Borel set

B 2 BR. Then

E{h(Xn+1)|Xn

= x

n

} =

Z

h(xn+1)Q(x

n

, dx

n+1) ⌘ Qh(xn

),

for each x

n

2 X .If X0 ⇠ P

X

implies that X1 ⇠ P

X

, then P

X

is an invariant distribution for the transitionkernel Q, i.e.

P

X

(A) =

Z

Q(x,A)PX

(dx),

for all A 2 B. We write this as PX

= P

X

Q. If the initial distribution is PX

, then X

k

⇠ P

X

for all k. We say that Q is reversible with respect to P

X

ifZZ

h(x, y)Q(x, dy)PX

(dx) =

ZZ

h(x, y)Q(y, dx)PX

(dy), (1)

for any h such that both sides of (1) exist. If the chain is reversible, then the distributionof (X

i

, X

i+1, . . . , Xi+k

) is the same as the distribution of (Xi+k

, . . . , X

i

) for all i and k.Reversible implies invariance:

P

X

Q(A) =

Z

Q(x,A)PX

(dx)

=

ZZ

1(y 2 A)Q(x, dy)PX

(dx)

=

ZZ

1(y 2 A)Q(y, dx)PX

(dy)

=

Z

1(y 2 A)PX

(dy)

Z

Q(y, dx)

= P

X

(A).

4 Convergence of Markov Chains used in MCMC

Let (X ,B) be a measurable space and let X0, X1, . . . be an X -valued Markov Chain withtransition kernel Q and invariant distribution P

X

.

4

Page 5: STAT 8054 notes on Markov Chain Monte Carlo

1. The chain is aperiodic if there does not exist a d � 2 and disjoint subsets X1, . . . ,Xd

ofX such that Q(x,X1) = 1 for all x 2 X

d

and Q(x,Xi

) = 1 for all x 2 Xi�1 (i = 2, . . . , d).

2. The chain is P

X

-irreducible if for any A 2 B with P

X

(A) > 0 implies that P (tA

<

1|X0 = x) > 0 for x 2 X , where t

A

= min(n > 0 : Xn

2 A).

3. The chain is Harris if for any A 2 B with P

X

(A) > 0 implies that P (tA

< 1|X0 =x) = 1 for all x 2 X , where t

A

= min(n > 0 : Xn

2 A).

Define the total variation distance between measures PX

and P

Y

by

kPX

� P

Y

kTV = supA2B

|PX

(A)� P

Y

(A)|

The norm is kPZ

kTV = supA2B PZ

(A)� infA2B PZ

(A).

Theorem 1. If the X -valued Markov Chain X0, X1, . . . with invariant distribution P

X

andtransition kernel Q is P

X

-irreducible, then P

X

is the unique invariant distribution. Further-more, if the chain is aperiodic, then there exists a set N with P

X

(N) = 0 such that for allx /2 N ,

kQ(n)(x, ·)� P

X

(·)kTV ! 0,

as n ! 1. This holds for all x 2 X if the chain is Harris. This result is taken from “Con-vegence of Markov chains from all starting points with applications to Metropolis–Hastingsalgorithms” by R. L. Tweedie in 1999.

5 Algorithms

5.1 Metropolis–Hastings

Let (X ,B) be a measurable space. Our goal is to approximately sample from the distributionP

X

. We will create an X -valued Markov chain X0, X1, . . . with transition kernel Q so thatthe target distribution P

X

is invariant for Q. Let P

X

have density f with respect to thesigma-finite measure ⌫: P

X

(dx) = f(x)⌫(dx). Let G be a proposal kernel where G(x, dy) =g(x, y)⌫(dy), so g(x, ·) is the density for G(x, ·) with respect to ⌫. Define

↵(x, y) = min

1,f(y)g(y, x)

f(x)g(x, y)

. (2)

The denominator of the Hastings ratio in (2) is always positive provided that we started thechain at a point x 2 X such that f(x) > 0. Of course, g(x, y) > 0 because y is generated fromthe distribution with density g(x, ·). In (2), we only need to know unnormalized densitiesf̃(x) = K

f

f(x) and g̃(x, y) = K

g

g(x, y), because these constants cancel in the Hastings ratio.

Algorithm 1. Pick or generate x0 2 X with f(x0) > 0 and set n = 0

1. Given X

n

= x

n

, generate Z ⇠ G(xn

, ·) and independently generate U ⇠ Unif(0, 1).

5

Page 6: STAT 8054 notes on Markov Chain Monte Carlo

2. If U ↵(xn

, Z) then set Xn+1 = Z. Otherwise set X

n+1 = X

n

.

3. Replace n by n+ 1 and go to step 1.

We can express the transition kernel Q by

Q(x,A) =

Z

1(y 2 A)↵(x, y)g(x, y)⌫(dy) +

1�Z

↵(x, y)g(x, y)⌫(dy)

I(x,A)

=

Z

1(y 2 A)↵(x, y)g(x, y)⌫(dy) + r(x)I(x,A), (3)

where I is the identity kernel and r(x) = 1�R

↵(x, y)g(x, y)⌫(dy) is the marginal probabilitythat the chain remains at x. We typically define Q

(0)(x,A) ⌘ I(x,A) = 1(x 2 A). Themeasure I(x, ·) does not have a density with respect to ⌫. To derive (3), do the followingdecomposition:

P (Xn+1 2 A|X

n

= x) =P (Xn+1 2 A,U ↵(x, Z)|X

n

= x)

+ P (Xn+1 2 A,U > ↵(x, Z)|X

n

= x)

=T1 + T2.

Then

T1 = E [E {1(Z 2 A,U ↵(x, Z) |Z,Xn

= x} |Xn

= x]

= E

Z

1(Z 2 A)1(u ↵(x, Z))du

X

n

= x

= E [1(Z 2 A)↵(x, Z)|Xn

= x]

=

Z

1(z 2 A)↵(x, z)g(x, z)⌫(dz)

and

T2 = E[E{1(x 2 A,U > ↵(x, Z)|Z,Xn

= x}|Xn

= x]

= E[

Z

1(x 2 A)1(u > ↵(x, Z))du|Xn

= x]

= E[1(x 2 A)(1� ↵(x, Z))|Xn

= x]

= 1(x 2 A)� 1(x 2 A)

Z

↵(x, z)g(x, z)⌫(dz).

Given the transition kernel expression in (3), we will show the chain is reversible withrespect to P

X

, i.e.ZZ

h(x, y)Q(x, dy)PX

(dx) =

ZZ

h(x, y)Q(y, dx)PX

(dy). (4)

6

Page 7: STAT 8054 notes on Markov Chain Monte Carlo

for all h such that both sides of (4) exist. Recall that r(x) =�

1�R

↵(x, z)g(x, z)⌫(dz)

.Starting with the left hand side of (4),

ZZ

h(x, y)Q(x, dy)PX

(dx) =

ZZ

h(x, y)↵(x, y)g(x, y)⌫(dy)f(x)⌫(dx)

+

ZZ

h(x, y)I(x, dy)r(x)f(x)⌫(dx)

=

ZZ

h(x, y)min{g(x, y)f(x), f(y)g(y, x)}⌫(dy)⌫(dx)

+

Z

h(x, x)r(x)f(x)⌫(dx)

=

ZZ

h(x, y)min{g(y, x)f(y), f(x)g(x, y)}⌫(dx)⌫(dy)

+

Z

h(y, y)r(y)f(y)⌫(dy)

=

ZZ

h(x, y)↵(y, x)g(y, x)⌫(dx)f(y)⌫(dy)

+

Z

h(y, y)r(y)f(y)⌫(dy)

=

ZZ

h(x, y)Q(y, dx)PX

(dy).

So we have shown that the chain is reversible with respect to P

X

, which implies that PX

isthe invariant distribution. We also have that if P

X

is positive on X and there exist ✏, R > 0such that

infy2B(x,R)

min(g(x, y), g(y, x)) > ✏,

for all x 2 X , then Q is PX

-irreducible and aperiodic, so P

X

is the unique invariant distri-bution and we have convergence in the TV norm.

5.2 Random walk Metropolis

If the density of our proposal kernel g(x, ·) is of the form g(x, y) = g

⇤(y � x), where g

⇤ isa density, then we call the resulting MH algorithm a random walk Metropolis algorithm. Inthis case we can write the Hastings ratio in (2) as

f(y)g(y, x)

f(x)g(x, y)=

f(y)g⇤(x� y)

f(x)g⇤(y � x).

If g⇤ is symmetric, i.e. g⇤(�u) = g

⇤(u), then the Hastings ratio simplifies:

f(y)g(y, x)

f(x)g(x, y)=

f(y)g⇤(y � x)

f(x)g⇤(y � x)=

f(y)

f(x).

e.g. X = Rp and G(x, ·) = N

p

(x,⌃); X = R and G(x, ·) = Unif(x� b, x+ b).

7

Page 8: STAT 8054 notes on Markov Chain Monte Carlo

5.2.1 Example: approximately sample from N(µ, �2)

We give a simple univariate illustration of the random walk Metropolis algorithm to approx-imately sample from N(µ, �2). We will use the symmetric kernel G(x, ·) ⇠ Unif(x� b, x+ b).Then g(x, y) / 1{y 2 (x� b, x+ b)} = 1{y�x 2 (�b, b)} and f(x) / exp{�(x�µ)2/(2�2)}.Then

↵(x, y) = min

1,f(y)

f(x)

= min

1, exp

1

2�2

(x� µ)2 � (y � µ)2

��

.

We pick X0 2 R, set n = 0, and perform the following steps:

1. Given X

n

= x

n

, generate Z ⇠ Unif(xn

� b, x

n

+ b) and independently generate U ⇠Unif(0, 1).

2. If U ↵(xn

, Z) then set Xn+1 = Z. Otherwise set X

n+1 = X

n

.

3. Replace n by n+ 1 and go to step 1.

5.2.2 Example: approximately sample from a bivariate posterior

Suppose we measured heights x1, . . . xn

. Assume that x = (x1, . . . , xn

)0 is a realization of X,where

(X|M = µ, V = v) ⇠ N

n

(µ1n

, vI

n

)

(M |V = v) ⇠ N(µM

, v

M

)

V ⇠ InvGam(↵, �),

where v

M

, µ

M

,↵, and � are user-specified. The density for InvGam(↵, �) evaluated at v isproportional to v

�(↵+1)e

��/v. The posterior density at (µ, v) is

f(µ, v|x) / v

�(↵+1)e

��/v

v

�n/2 exp

(

� 1

2v

n

X

i=1

(xi

� µ)2)

exp

� 1

2vM

(µ� µ

M

)2�

= v

�(↵+1+n/2) exp

(

��

v

� 1

2v

n

X

i=1

(xi

� µ)2 � 1

2vM

(µ� µ

M

)2)

.

For illustration, we will use a random walk Metropolis algorithm with aN2{(mn

, v

n

)0, D} trialdistribution to approximately sample from (M,V |X = x). This is an ine�cient approach.The covariance matrix will be diagonal: D = diag(s21, s

22). Also,

log↵{(µx

, v

x

), (µy

, v

y

)} = min {0, log f(µy

, v

y

|x)� log f(µx

, v

x

|x)} ,

where

log f(µy

, v

y

|x) = �(↵ + 1 + n/2) log vy

� 1

2vy

n

X

i=1

(xi

� µ

y

)2 � 1

2vM

(µy

� µ

M

)2 � �/v

y

,

8

Page 9: STAT 8054 notes on Markov Chain Monte Carlo

provided that v

y

> 0, otherwise we define log f(µy

, v

y

|x) = �1. To summarize the algo-rithm, we pick (µ0, v0) 2 R2 and D = diag(s21, s

22); set n = 0; and perform the following

steps:

1. Generate Z ⇠ N2 {(µn

, v

n

)0, D} and independently generate U ⇠ Unif(0, 1).

2. If logU log↵{(µn

, v

n

), Z} then set (µn+1, vn+1)0 = Z. Otherwise set (µ

n+1, vn+1)0 =(µ

n

, v

n

)0.

3. Replace n by n+ 1 and go to step 1.

5.3 Independence sampler

If the density of our proposal kernel g(x, ·) is of the form g(x, y) = g

⇤(y), where g

⇤ is adensity, then we call the resulting MH algorithm an independence sampler. In this case wecan write the ratio in (2) as

f(y)g(y, x)

f(x)g(x, y)=

f(y)g⇤(x)

f(x)g⇤(y).

We need to select g

⇤ carefully to ensure that the Markov chain converges to the targetdistribution.

5.4 Variable at a time Metropolis–Hastings and the Gibbs sampler

Suppose that there are intermediate X -valued random variables in the Markov chain:

. . . , X

n

, X

n,b

, X

n+1, Xn+1,b, Xn+1, . . .

From X

n

to X

n,b

we use kernel Q1 and from X

n,b

to X

n+1 we use kernel Q2. The kernel Qfor the combination/composition of Q1 and Q2 is

Q(xn

, A) = P (Xn+1 2 A|X

n

= x

n

)

= E[E{1(Xn+1 2 A)|X

n,b

, X

n

= x

n

}|Xn

= x

n

]

=

Z

Z

1(xn+1 2 A)Q2(xn,b

, dx

n+1)

Q1(xn

, dx

n,b

)

=

Z

Q2(xn,b

, A)Q1(xn

, dx

n,b

)

⌘ Q1Q2(xn

, A)

Some algorithms for MCMC involve creating a transition kernelQ from sub-transition kernelsQ1, . . . , Qp

. So long has each Q

j

has invariant distribution P

X

, i.e. P

X

Q

j

= P

X

, thenP

X

Q1 · · ·Qj

= P

X

Q = P

X

because (PX

Q1)Q2 = P

X

Q2 = P

X

, so ((PX

Q1)Q2)Q3 = P

X

Q3 =P

X

, and so on.

9

Page 10: STAT 8054 notes on Markov Chain Monte Carlo

Suppose that Xn

= (Xn1, . . . , Xnp

). In variable at a time Metropolis–Hastings, we updateX

nj

with the proposal density g

j

(x, ·) for j = 1, . . . , p. Define

j

(x, y) = min

1,f(x(y,j))g

j

(x(y,j), x

j

)

f(x)gj

(x, y)

, (5)

where x

(y,j) = (x1, . . . , xj�1, y, xj+1, . . . , xp

). This denominator is always positive if we startthe chain at x with f(x) > 0.

Algorithm 2 (Variable at a time Metropolis–Hastings). Pick or generate X0 2 X withf(X0) > 0 and set n = 0

1. Set Y = (Y1, . . . , Yp

) = X

n

= (Xn1, . . . , Xnp

).

2. Generate Z ⇠ G1(Y, ·) and independently generate U ⇠ Unif(0, 1). If U ↵1(Y, Z)then set Y1 = Z Otherwise set Y1 = X

n1.

3. Generate Z ⇠ G2(Y, ·) and independently generate U ⇠ Unif(0, 1). If U ↵2(Y, Z)then set Y2 = Z Otherwise set Y2 = X

n2.

4. . . .

5. Generate Z ⇠ G

p

(Y, ·) and independently generate U ⇠ Unif(0, 1). If U ↵

p

(Y, Z)then set Y

p

= Z Otherwise set Yp

= X

np

.

6. Set Xn+1 = Y , replace n by n+ 1 and go to step 1.

The Gibbs sampler is a special case of Algorithm 2. Let x�j

be the vector x with itsjth entry removed Take g

j

(x, y) = f

j

(y|x�j

), where f

j

is the density for the conditionaldistribution of the jth variable in X ⇠ f given the others. In this case ↵

j

(x, y) = 1 becausethe ratio in (5) is

f(x(y,j))gj

(x(y,j), x

j

)

f(x)gj

(x, y)=

f(x(y,j))fj

(xj

|x�j

)

f(x)fj

(y|x�j

)=

f(x(y,j))Kj

f(x)

f(x)Kj

f(x(y,j))= 1.

So the Gibbs sampler always accepts. Let Y = (Y1, . . . , Yp

) be a random vector with thetarget distribution (with density f). The Gibbs sampler (with a fixed scan) simplifies to thefollowing:

Algorithm 3 (Gibbs sampler). Pick or generate x0 = (x01, . . . , x0p) 2 X with f(x0) > 0and set n = 0

1. Generate a realization x

n+1,1 of (Y1|Y2 = x

n2, . . . , Yp

= x

np

).

2. Generate a realization x

n+1,2 of (Y2|Y1 = x

n+1,1, Y3 = x

n,3, . . . , Yp

= x

np

).

3. ...

10

Page 11: STAT 8054 notes on Markov Chain Monte Carlo

4. Generate a realization x

n+1,p of (Yp

|Y1 = x

n+1,1, . . . , Yp�1 = x

n+1,p�1).

5. Replace n by n+ 1 and go to step 1.

It is important that the steps in Algorithm 3 are performed sequentially and not inparallel. We can modify the order of this sequence.

5.4.1 Example: approximately sample from a bivariate posterior

Let x = (x1, . . . , xn

) be measurements of some numerical characteristic on n subjects. Sup-pose that x is a realization of X, where

(X|M = µ, V = v) ⇠ N

n

(µ1n

, vI

n

),

(M |V = v) ⇠ N(µM

, v

M

),

V ⇠ InvGam(↵, �),

where µ

M

, v

M

,↵, � are user-specified prior parameters. The density for InvGam(↵, �) eval-uated at v is proportional to v

�(↵+1)e

��/v. The joint density is defined by

f(µ, v, x) / v

�(↵+1)e

��/v

v

�n/2 exp

(

� 1

2v

n

X

i=1

(xi

� µ)2)

exp

� 1

2vM

(µ� µ

M

)2�

.

Then

f(v|µ, x) / v

�n/2e

��/v exp

(

� 1

2v

n

X

i=1

(xi

� µ)2)

v

�(↵+1)

= v

�(↵+1+n/2) exp

"

�(

� +1

2

n

X

i=1

(xi

� µ)2)

/v

#

.

Thus

(V |M = µ,X = x) ⇠ InvGam

↵ +n

2, � +

1

2

n

X

i=1

(xi

� µ)2!

We also have that

f(µ|v, x) / exp

(

� 1

2v

n

X

i=1

(xi

� µ)2)

exp

� 1

2vM

(µ� µ

M

)2�

/ exp

�0.5v�1n

X

i=1

x

2i

+ µv

�1n

X

i=1

x

i

� 0.5v�1nµ

2 � 0.5v�1M

µ

2 + µv

�1M

µ

M

!

/ exp�

�0.5µ2(nv�1 + v

�1M

) + µ

v

�1nx̄+ v

�1M

µ

M

The density for N(m,w) is proportional in y to exp {�0.5y2v�1 + ymw

�1}, so

(M |V = v,X = x) ⇠ N

nx̄+ vv

�1M

µ

M

n+ vv

�1M

,

1

nv

�1 + v

�1M

To summarize, we pick (v0,m0) and set n = 0.

11

Page 12: STAT 8054 notes on Markov Chain Monte Carlo

1. Generate v

n+1 from (V |M = m

n

, X = x).

2. Generate m

n+1 from (M |V = v

n+1, X = x).

3. Replace n by n+ 1 and go to step 1.

5.4.2 Example: fit a linear random e↵ects model

Let yit

be the observed response for the ith subject at time t. Suppose that yit

is a realizationof Y

it

, where

(Yit

|M = m,A = a, V

E

= v

E

, V

A

= v

A

) ⇠ind N(m+ a

i

, v

E

),

i = 1, . . . , n, t = 1, . . . , Ti

(M |A = a, V

E

= v

e

, V

A

= v

a

) ⇠ N(0, vM

)

(A|VE

= v

E

, V

A

= v

A

) ⇠ N

n

(0, vA

I

n

)

(VE

|VA

= v

A

) ⇠ InvGam(↵E

, �

E

)

V

A

⇠ InvGam(↵A

, �

A

)

Let y be a realization of Y = (Y11, . . . , YnTn). Recall that the density for InvGam(↵, �)evaluated at v is proportional to v

�(↵+1)e

��/v. The joint density is defined by

f(y,a,m, v

E

, v

A

)

/ v

�(↵A+1)A

e

��A/vAv

�(↵E+1)E

e

��E/vE

⇤ v�n/2A

exp

� 1

2vA

n

X

i=1

a

2i

!

v

�1/2M

exp

� 1

2vM

m

2

⇤ v�Pn

i=1 Ti/2E

exp

(

� 1

2vE

n

X

i=1

TiX

t=1

(yit

�m� a

i

)2)

.

So we have that

f(vA

|a,m, v

E

, y) / v

�(↵A+1+n/2)A

exp

(

� 1

v

A

A

+1

2

n

X

i=1

a

2i

!)

f(vE

|a,m, v

A

, y) / v

�(↵E+1+ 12

Pni=1 Ti)

E

exp

"

� 1

v

E

(

E

+1

2

n

X

i=1

TiX

t=1

(yit

�m� a

i

)2)#

Thus

(VA

|A = a,M = m,V

E

= v

E

, Y = y) ⇠ InvGam

A

+ n/2, �

A

+1

2

n

X

i=1

a

2i

!

(VE

|A = a,M = m,V

A

= v

A

, Y = y) ⇠ InvGam

E

+1

2

n

X

i=1

T

i

, �

E

+1

2

n

X

i=1

TiX

t=1

(yit

�m� a

i

)2!

12

Page 13: STAT 8054 notes on Markov Chain Monte Carlo

We also have that

f(m|a, vA

, v

E

, y) / exp

(

� 1

2vE

n

X

i=1

TiX

t=1

(yit

� a

i

�m)2)

exp

� 1

2vM

m

2

/ exp

"

� 1

2vE

n

X

i=1

TiX

t=1

(yit

� a

i

)2 � 2(yit

� a

i

)m+m

2

� 1

2vM

m

2

#

/ exp

(

�0.5m2

n

X

i=1

T

i

/v

E

+ 1/vM

!

+m

n

X

i=1

TiX

t=1

(yit

� a

i

)/vE

)

This implies that

(M |A = a, V

E

= v

E

, V

A

= v

A

, Y = y) ⇠ N

P

n

i=1

P

Ti

t=1(yit � a

i

)P

n

i=1 Ti

+ v

E

/v

M

,

1P

n

i=1 Ti

/v

E

+ 1/vM

!

Let a�i

be the vector a with its ith entry removed. We also have that

f(ai

|a�i

,m, v

A

, v

E

, y) / exp

(

� 1

2vE

TiX

t=1

(yit

�m� a

i

)2)

exp

� 1

2vA

a

2i

/ exp

"

� 1

2vE

TiX

t=1

{(yit

�m)2 � 2ai

(yit

�m) + a

2i

}� 1

2vA

a

2i

#

/ exp

(

�0.5a2i

(Ti

/v

E

+ 1/vA

) + a

i

TiX

t=1

(yit

�m)/vE

)

This implies that

(Ai

|M = m,A�i

= a�i

, V

E

= v

E

, V

A

= v

A

, Y = y) ⇠ N

P

Ti

t=1(yit �m)

T

i

+ v

E

/v

A

,

1

T

i

/v

E

+ 1/vA

!

,

for i = 1, . . . , n.To summarize, we pick (m0, a0, vA,0, vE,0) and set n = 0.

1. Generate m

n+1 from (M |A = a

n

, V

E

= v

E,n

, V

A

= v

A,n

, Y = y).

2. Generate a

n+1,i from (Ai

|M = m

n+1, VE

= v

E,n

, V

A

= v

A,n

, Y = y) for i = 1, . . . , n inparallel.

3. Generate v

A,n+1 from (VA

|A = a

n+1,M = m

n+1, VE

= v

E,n

, Y = y).

4. Generate v

E,n+1 from (VE

|A = a

n+1,M = m

n+1, VA

= v

A,n+1, Y = y)

5. Replace n by n+ 1 and go to step 1.

13

Page 14: STAT 8054 notes on Markov Chain Monte Carlo

5.4.3 Example: Bayesian ridge regression

Let yi

be the measured response for the ith case and let xi

= (1, xi2, . . . , xip

)0 2 Rp be thevalues of the predictors for the ith case. Define X 2 Rn⇥p to have ith row x

i

. Assume that(y1, . . . , yn)0 is a realization of Y , where

(Y |B = �, V = v, L = �) ⇠ N

n

(X�, vI

n

)

(B|V = v, L = �) ⇠ N

p

�̃,

v

I

p

(V |L = �) ⇠ InvGam(aV

, b

V

)

L ⇠ Gamma(aL

, b

L

)

The Gamma(a, b) distribution has density at � proportional to �

a�1e

�b�. Then

f(�, v,�|y) / v

�aV +1e

�bV /v

aL�1e

�bL��

p/2v

�p/2 exp

� �

2v(� � �̃)0(� � �̃)

⇤ v�n/2 exp

� 1

2v(y �X�)0(y �X�)

= �

aL�1+p/2v

�(aV +1+p/2+n/2)

⇤ exp⇢

� 1

2v(y �X�)0(y �X�)� �

2v(� � �̃)0(� � �̃)� b

L

�� b

V

v

We will use the fact that �(x, µ,⌃) /x

exp(�0.5x0⌃�1x+ x

0⌃�1µ). We have that

f(�|v,�, y) / exp

� 1

2v(y �X�)0(y �X�)� �

2v(� � �̃)0(� � �̃)� b

L

�� b

V

v

/ exp

�1

2�

0✓

1

v

X

0X +

v

I

p

� + �

0✓

1

v

X

0y +

v

�̃

◆�

,

which implies that

(B|V = v, L = �, Y = y) ⇠ N

(X 0X + �I

p

)�1(X 0y + ��̃), v(X 0

X + �I

p

)�1⌘

.

Let µB

= (X 0X + �I

p

)�1(X 0y + ��̃). Then

f(�|v,�, y) /�,v

v

�p/2 exp

� 1

2v(� � µ

B

)0(X 0X + �I

p

)(� � µ

B

)

/�,v

v

�p/2 exp

� 1

2v{�0(X 0

X + �I

p

)� � 2�0(X 0X + �I

p

)µB

+ µ

0B

(X 0X + �I

p

)µB

}�

/�,v

v

�p/2 exp

� 1

2v

n

0(X 0X + �I

p

)� � 2�0(X 0y + ��̃) + µ

0B

(X 0X + �I

p

)µB

o

14

Page 15: STAT 8054 notes on Markov Chain Monte Carlo

We can write

f(�, v|�, y) / v

�(aV +1+n/2)v

�p/2

⇤ exp

� 1

2v

n

y

0y � 2�0

X

0y + �

0X

0X� + ��

0� � 2��0

�̃ + ��̃

0�̃ + 2b

V

o

=f(�|v,�, y)v�(aV +1+n/2) exp

� 1

2v

n

y

0y � µ

0B

(X 0X + �I

p

)µB

+ ��̃

0�̃ + 2b

V

o

/f(�|v,�, y)f(v|�, y)

which implies that

(V |L = �, Y = y) ⇠ InvGam

a

V

+n

2, b

V

+1

2

n

y

0y � µ

0B

(X 0X + �I

p

)µB

+ ��̃

0�̃

o

We have that

f(�|�, v, y) / �

aL�1+p/2 exp

��

1

2vk� � �̃k2 + b

L

◆�

,

which implies that

(L|B = �, V = v, Y = y) ⇠ Gamma

a

L

+p

2,

1

2vk� � �̃k2 + b

L

.

To summarize, we pick (�0, v0,�0) and set n = 0.

1. Generate (�n+1, vn+1) from (B, V |L = �

n

, Y = y) with the following two steps:

(a) Generate v

n+1 from (V |L = �

n

, Y = y).

(b) Generate �

n+1 from (B|V = v

n+1, L = �

n

, Y = y)

2. Generate �

n+1 from (L|B = �

n+1, V = v

n+1, Y = y).

3. Replace n by n+ 1 and go to step 1.

6 Inference for E{h(X)}, where X has the target dis-

tribution

This section is based on Atchade (2008) and Jones (2013). Suppose our goal is to esti-mate E{h(X)} where X ⇠ P

X

(the target distribution with density f). Using the Markovchain X0, X1, . . . with invariant distribution P

X

(from our sampling algorithm), we estimateE{h(X)} with

b

E{h(X)} =1

n

n

X

i=1

h(Xi

).

15

Page 16: STAT 8054 notes on Markov Chain Monte Carlo

If the chain is P

X

-irreducible and aperiodic, then b

E{h(X)} ! E{h(X)} almost surely asn ! 1. A Markov chain with transition kernel Q is geometrically ergodic if there exists a⇢ 2 (0, 1) and a function M : X ! [0,1) such that

kQ(n)(x, ·)� P

X

kTV M(x)⇢n,

for all x 2 X and all non-negative integers n. Here are two selected results (Jones et al.,2006; Jones, 2013):

• If the Harris chain is geometrically ergodic and E{|h(X)|2+�} < 1 for some � > 0,then p

n

h

b

E{h(X)}� E{h(X)}i

L! N(0, vh

).

• If the Harris chain is geometrically ergodic, reversible, and E{h2(X)} < 1, then

pn

h

b

E{h(X)}� E{h(X)}i

L! N(0, vh

).

In general, varh

b

E{h(X)}i

= n

�2P

n

i=1

P

n

j=1 cov{h(Xi

), h(Xj

)}. Without giving details on

how to simplify this formula, we will estimate the asymptotic variance v

h

using the Batchmeans method. Suppose we have b batches of size k. Let

Y

j

=1

k

jk

X

i=(j�1)k+1

h(Xi

), j = 1, . . . , b.

The batch means estimator of vh

is

h

=k

b� 1

b

X

j=1

h

Y

j

� b

E{h(X)}i2

.

This estimator is inconsistent in general, but is consistent if we let k and b increase with n andassume that other regularity conditions hold (Jones, 2013). Jones suggested that k =

pn

and b = n/k are sensible choices. Assuming that bE{h(X)} is approximately Normal, a100(1� ↵)% approximate confidence interval for E{h(X)} is

b

E{h(X)} ± qnorm(1� ↵/2)

r

h

n

.

References

Atchade, Y. F. (2008). Course notes for Statistics 606. Lecture notes.

Jones, G. L. (2013). Course notes for STAT 8701: Computational statistical methods.Lecture notes.

16

Page 17: STAT 8054 notes on Markov Chain Monte Carlo

Jones, G. L., Haran, M., Ca↵o, B. S., and Neath, R. (2006). Fixed-width output analysis formarkov chain monte carlo. Journal of the American Statistical Association, 101:1537–1547.

Keener, R. W. (2005). Statistical theory: A medley of core topics. Notes for a course intheoretical statistics.

17