Upload
others
View
13
Download
0
Embed Size (px)
Citation preview
STAT 8054 notes on Markov Chain Monte Carlo
Spring 2015
April 20, 2015
Contents
1 Introduction 1
2 Review of X -valued random variables 2
3 Discrete time homogeneous Markov chains 3
4 Convergence of Markov Chains used in MCMC 4
5 Algorithms 55.1 Metropolis–Hastings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55.2 Random walk Metropolis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5.2.1 Example: approximately sample from N(µ, �2) . . . . . . . . . . . . . 85.2.2 Example: approximately sample from a bivariate posterior . . . . . . 8
5.3 Independence sampler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95.4 Variable at a time Metropolis–Hastings and the Gibbs sampler . . . . . . . . 9
5.4.1 Example: approximately sample from a bivariate posterior . . . . . . 115.4.2 Example: fit a linear random e↵ects model . . . . . . . . . . . . . . . 125.4.3 Example: Bayesian ridge regression . . . . . . . . . . . . . . . . . . . 14
6 Inference for E{h(X)}, where X has the target distribution 15
1 Introduction
In ordinary Monte Carlo, we studied methods to generate a realization of X1, . . . , Xn
, whichare iid with some distribution of interest. Our goal is to still sample from this distribution,but we are willing to relax the requirement that X1, . . . , Xn
are iid: we will allow X1, . . . , Xn
to be dependent and have di↵erent distributions, but the distribution of Xk
will converge tothe distribution of interest as k increases.
1
2 Review of X -valued random variables
The following review is based on Chapter 2 of Keener (2005). Suppose that (⌦,F , P ) isa probability space and (X ,B) is a measurable space. An X -valued random variable X isfunction X : ⌦ ! X such that {! 2 ⌦ : X(!) 2 B} 2 F for every B 2 B. The probabilitymeasure P
X
: B ! [0, 1] defined by
P
X
(A) = P (X 2 A) ⌘ P ({! 2 ⌦ : X(!) 2 A}),
is called the distribution of X. We indicate this by writing X ⇠ P
X
.Let h : X ! R be a measurable function, i.e. {x 2 X : h(x) 2 B} 2 B for every B 2 BR.
We define the integral of h against the probability measure P
X
byZ
h(x)PX
(dx) = E{h(X)}.
In particular, when h(x) = 1(x 2 A) and A 2 B,
P
X
(A) =
Z
1(x 2 A)PX
(dx).
If ⌫ is a sigma-finite measure on B such that P
X
(A) = 0 whenever ⌫(A) = 0, then we saythat P
X
is absolutely continuous with respect to ⌫. In this case, there exits a non-negativemeasurable function f called the density of P
X
with respect to ⌫ such that
P
X
(A) =
Z
1(x 2 A)f(x)⌫(dx).
Also for X ⇠ P
X
we have that
E{h(X)} =
Z
h(x)PX
(dx) =
Z
h(x)f(x)⌫(dx).
Example 1.
Suppose that (X ,B) = (Rp
,BRp), and X is an Rp-valued random variable (typically calleda random vector) with distribution P
X
with density f with respect to Lebesgue measure ⌫.Then
E{h(X)} =
Z
h(x)PX
(dx) =
Z
h(x)f(x)⌫(dx)
=
Z
· · ·Z
h(x1, . . . , xp
)f(x1, . . . , xp
)dx1, . . . , dxp
.
Example 2.
Suppose that X is countable so our measurable space is (X , 2X ). Let X be an X -valuedrandom variable with distribution P
X
with density f (called a probability mass function)with respect to counting measure ⌫. Then
E{h(X)} =
Z
h(x)PX
(dx) =X
x2X
h(x)f(x).
2
3 Discrete time homogeneous Markov chains
This section is based on Atchade (2008) and Jones (2013).
Definition 1. A transition kernel Q on a measurable space (X ,B) is a function Q : X⇥B ![0, 1] such that Q(x, ·) : B ! [0, 1] is a probability measure on B (a distribution) for all x 2 Xand Q(·, A) : X ! [0, 1] is a measurable function for all A 2 B.
Example 3.
Take X = Rp and B = BRp . Let ⌫ be Lebesgue measure on Rp and suppose that ⌃ 2 Sp
+.Define
Q(x,A) =
Z
1(y 2 A)(2⇡)�p/2{det(⌃�1)}1/2 exp{0.5(y � x)0⌃�1(y � x)}⌫(dy),
for all x 2 Rp and A 2 B. Then Q is a transition kernel: Q(x, ·) is a probability measure onB (it’s the N
p
(x,⌃) distribution) and {x 2 Rp : Q(x,A) 2 B} 2 BRp for every B 2 BR.
Definition 2. A sequence of X -valued random variables X0, X1, . . . , is a (time-homogeneous)Markov chain with transition kernel Q if
(Xn+1|X0 = x0, . . . , Xn
= x
n
) ⇠ (Xn+1|Xn
= x
n
) ⇠ Q(xn
, ·),
for each non-negative integer n. Here P (Xn+1 2 A|X
n
= x
n
) = Q(xn
, A) for all A 2 B.
To generate a realization of a Markov chain where X0 has initial distribution P
X0 andthe chain has transition kernel Q, we
1. generate a realization x0 of X0 ⇠ P
X0
2. generate a realization x1 of (X1|X0 = x0) ⇠ Q(x0, ·)
3. generate a realization x2 of (X2|X1 = x1) ⇠ Q(x1, ·)
4. . . .
the sequence x0, x1, . . . is a realization of the Markov chain. Let X0 ⇠ P
X0 , then the marginaldistribution/probability measure for X1 at A 2 B is
P (X1 2 A) = E[E{1(X1 2 A)|X0}]
=
Z
⇢
Z
1(x1 2 A)Q(x0, dx1)
�
P
X0(dx0)
=
Z
Q(x0, A)PX0(dx0)
⌘ P
X0Q(A).
3
We write this as X1 ⇠ P
X0Q. We also have that (X2|X0 = x0) ⇠ QQ(x0, ·):
P (X2 2 A|X0 = x0) = E[E{1(X2 2 A)|X1, X0 = x0}|X0 = x0]
=
Z
⇢
Z
1(x2 2 A)Q(x1, dx2)
�
Q(x0, dx1)
=
Z
Q(x1, A)Q(x0, dx1)
⌘ QQ(x0, A)
So X2 ⇠ P
X0QQ = P
X0Q(2), In general, X
n
⇠ P
X0Q(n) and (X
n
|X0 = x) ⇠ Q
(n)(x, ·).Suppose that h : X ! R is measurable, i.e. {x 2 X : h(x) 2 B} 2 B for every Borel set
B 2 BR. Then
E{h(Xn+1)|Xn
= x
n
} =
Z
h(xn+1)Q(x
n
, dx
n+1) ⌘ Qh(xn
),
for each x
n
2 X .If X0 ⇠ P
X
implies that X1 ⇠ P
X
, then P
X
is an invariant distribution for the transitionkernel Q, i.e.
P
X
(A) =
Z
Q(x,A)PX
(dx),
for all A 2 B. We write this as PX
= P
X
Q. If the initial distribution is PX
, then X
k
⇠ P
X
for all k. We say that Q is reversible with respect to P
X
ifZZ
h(x, y)Q(x, dy)PX
(dx) =
ZZ
h(x, y)Q(y, dx)PX
(dy), (1)
for any h such that both sides of (1) exist. If the chain is reversible, then the distributionof (X
i
, X
i+1, . . . , Xi+k
) is the same as the distribution of (Xi+k
, . . . , X
i
) for all i and k.Reversible implies invariance:
P
X
Q(A) =
Z
Q(x,A)PX
(dx)
=
ZZ
1(y 2 A)Q(x, dy)PX
(dx)
=
ZZ
1(y 2 A)Q(y, dx)PX
(dy)
=
Z
1(y 2 A)PX
(dy)
✓
Z
Q(y, dx)
◆
= P
X
(A).
4 Convergence of Markov Chains used in MCMC
Let (X ,B) be a measurable space and let X0, X1, . . . be an X -valued Markov Chain withtransition kernel Q and invariant distribution P
X
.
4
1. The chain is aperiodic if there does not exist a d � 2 and disjoint subsets X1, . . . ,Xd
ofX such that Q(x,X1) = 1 for all x 2 X
d
and Q(x,Xi
) = 1 for all x 2 Xi�1 (i = 2, . . . , d).
2. The chain is P
X
-irreducible if for any A 2 B with P
X
(A) > 0 implies that P (tA
<
1|X0 = x) > 0 for x 2 X , where t
A
= min(n > 0 : Xn
2 A).
3. The chain is Harris if for any A 2 B with P
X
(A) > 0 implies that P (tA
< 1|X0 =x) = 1 for all x 2 X , where t
A
= min(n > 0 : Xn
2 A).
Define the total variation distance between measures PX
and P
Y
by
kPX
� P
Y
kTV = supA2B
|PX
(A)� P
Y
(A)|
The norm is kPZ
kTV = supA2B PZ
(A)� infA2B PZ
(A).
Theorem 1. If the X -valued Markov Chain X0, X1, . . . with invariant distribution P
X
andtransition kernel Q is P
X
-irreducible, then P
X
is the unique invariant distribution. Further-more, if the chain is aperiodic, then there exists a set N with P
X
(N) = 0 such that for allx /2 N ,
kQ(n)(x, ·)� P
X
(·)kTV ! 0,
as n ! 1. This holds for all x 2 X if the chain is Harris. This result is taken from “Con-vegence of Markov chains from all starting points with applications to Metropolis–Hastingsalgorithms” by R. L. Tweedie in 1999.
5 Algorithms
5.1 Metropolis–Hastings
Let (X ,B) be a measurable space. Our goal is to approximately sample from the distributionP
X
. We will create an X -valued Markov chain X0, X1, . . . with transition kernel Q so thatthe target distribution P
X
is invariant for Q. Let P
X
have density f with respect to thesigma-finite measure ⌫: P
X
(dx) = f(x)⌫(dx). Let G be a proposal kernel where G(x, dy) =g(x, y)⌫(dy), so g(x, ·) is the density for G(x, ·) with respect to ⌫. Define
↵(x, y) = min
⇢
1,f(y)g(y, x)
f(x)g(x, y)
�
. (2)
The denominator of the Hastings ratio in (2) is always positive provided that we started thechain at a point x 2 X such that f(x) > 0. Of course, g(x, y) > 0 because y is generated fromthe distribution with density g(x, ·). In (2), we only need to know unnormalized densitiesf̃(x) = K
f
f(x) and g̃(x, y) = K
g
g(x, y), because these constants cancel in the Hastings ratio.
Algorithm 1. Pick or generate x0 2 X with f(x0) > 0 and set n = 0
1. Given X
n
= x
n
, generate Z ⇠ G(xn
, ·) and independently generate U ⇠ Unif(0, 1).
5
2. If U ↵(xn
, Z) then set Xn+1 = Z. Otherwise set X
n+1 = X
n
.
3. Replace n by n+ 1 and go to step 1.
We can express the transition kernel Q by
Q(x,A) =
Z
1(y 2 A)↵(x, y)g(x, y)⌫(dy) +
⇢
1�Z
↵(x, y)g(x, y)⌫(dy)
�
I(x,A)
=
Z
1(y 2 A)↵(x, y)g(x, y)⌫(dy) + r(x)I(x,A), (3)
where I is the identity kernel and r(x) = 1�R
↵(x, y)g(x, y)⌫(dy) is the marginal probabilitythat the chain remains at x. We typically define Q
(0)(x,A) ⌘ I(x,A) = 1(x 2 A). Themeasure I(x, ·) does not have a density with respect to ⌫. To derive (3), do the followingdecomposition:
P (Xn+1 2 A|X
n
= x) =P (Xn+1 2 A,U ↵(x, Z)|X
n
= x)
+ P (Xn+1 2 A,U > ↵(x, Z)|X
n
= x)
=T1 + T2.
Then
T1 = E [E {1(Z 2 A,U ↵(x, Z) |Z,Xn
= x} |Xn
= x]
= E
Z
1(Z 2 A)1(u ↵(x, Z))du
�
�
�
�
X
n
= x
�
= E [1(Z 2 A)↵(x, Z)|Xn
= x]
=
Z
1(z 2 A)↵(x, z)g(x, z)⌫(dz)
and
T2 = E[E{1(x 2 A,U > ↵(x, Z)|Z,Xn
= x}|Xn
= x]
= E[
Z
1(x 2 A)1(u > ↵(x, Z))du|Xn
= x]
= E[1(x 2 A)(1� ↵(x, Z))|Xn
= x]
= 1(x 2 A)� 1(x 2 A)
Z
↵(x, z)g(x, z)⌫(dz).
Given the transition kernel expression in (3), we will show the chain is reversible withrespect to P
X
, i.e.ZZ
h(x, y)Q(x, dy)PX
(dx) =
ZZ
h(x, y)Q(y, dx)PX
(dy). (4)
6
for all h such that both sides of (4) exist. Recall that r(x) =�
1�R
↵(x, z)g(x, z)⌫(dz)
.Starting with the left hand side of (4),
ZZ
h(x, y)Q(x, dy)PX
(dx) =
ZZ
h(x, y)↵(x, y)g(x, y)⌫(dy)f(x)⌫(dx)
+
ZZ
h(x, y)I(x, dy)r(x)f(x)⌫(dx)
=
ZZ
h(x, y)min{g(x, y)f(x), f(y)g(y, x)}⌫(dy)⌫(dx)
+
Z
h(x, x)r(x)f(x)⌫(dx)
=
ZZ
h(x, y)min{g(y, x)f(y), f(x)g(x, y)}⌫(dx)⌫(dy)
+
Z
h(y, y)r(y)f(y)⌫(dy)
=
ZZ
h(x, y)↵(y, x)g(y, x)⌫(dx)f(y)⌫(dy)
+
Z
h(y, y)r(y)f(y)⌫(dy)
=
ZZ
h(x, y)Q(y, dx)PX
(dy).
So we have shown that the chain is reversible with respect to P
X
, which implies that PX
isthe invariant distribution. We also have that if P
X
is positive on X and there exist ✏, R > 0such that
infy2B(x,R)
min(g(x, y), g(y, x)) > ✏,
for all x 2 X , then Q is PX
-irreducible and aperiodic, so P
X
is the unique invariant distri-bution and we have convergence in the TV norm.
5.2 Random walk Metropolis
If the density of our proposal kernel g(x, ·) is of the form g(x, y) = g
⇤(y � x), where g
⇤ isa density, then we call the resulting MH algorithm a random walk Metropolis algorithm. Inthis case we can write the Hastings ratio in (2) as
f(y)g(y, x)
f(x)g(x, y)=
f(y)g⇤(x� y)
f(x)g⇤(y � x).
If g⇤ is symmetric, i.e. g⇤(�u) = g
⇤(u), then the Hastings ratio simplifies:
f(y)g(y, x)
f(x)g(x, y)=
f(y)g⇤(y � x)
f(x)g⇤(y � x)=
f(y)
f(x).
e.g. X = Rp and G(x, ·) = N
p
(x,⌃); X = R and G(x, ·) = Unif(x� b, x+ b).
7
5.2.1 Example: approximately sample from N(µ, �2)
We give a simple univariate illustration of the random walk Metropolis algorithm to approx-imately sample from N(µ, �2). We will use the symmetric kernel G(x, ·) ⇠ Unif(x� b, x+ b).Then g(x, y) / 1{y 2 (x� b, x+ b)} = 1{y�x 2 (�b, b)} and f(x) / exp{�(x�µ)2/(2�2)}.Then
↵(x, y) = min
⇢
1,f(y)
f(x)
�
= min
⇢
1, exp
1
2�2
�
(x� µ)2 � (y � µ)2
��
.
We pick X0 2 R, set n = 0, and perform the following steps:
1. Given X
n
= x
n
, generate Z ⇠ Unif(xn
� b, x
n
+ b) and independently generate U ⇠Unif(0, 1).
2. If U ↵(xn
, Z) then set Xn+1 = Z. Otherwise set X
n+1 = X
n
.
3. Replace n by n+ 1 and go to step 1.
5.2.2 Example: approximately sample from a bivariate posterior
Suppose we measured heights x1, . . . xn
. Assume that x = (x1, . . . , xn
)0 is a realization of X,where
(X|M = µ, V = v) ⇠ N
n
(µ1n
, vI
n
)
(M |V = v) ⇠ N(µM
, v
M
)
V ⇠ InvGam(↵, �),
where v
M
, µ
M
,↵, and � are user-specified. The density for InvGam(↵, �) evaluated at v isproportional to v
�(↵+1)e
��/v. The posterior density at (µ, v) is
f(µ, v|x) / v
�(↵+1)e
��/v
v
�n/2 exp
(
� 1
2v
n
X
i=1
(xi
� µ)2)
exp
⇢
� 1
2vM
(µ� µ
M
)2�
= v
�(↵+1+n/2) exp
(
��
v
� 1
2v
n
X
i=1
(xi
� µ)2 � 1
2vM
(µ� µ
M
)2)
.
For illustration, we will use a random walk Metropolis algorithm with aN2{(mn
, v
n
)0, D} trialdistribution to approximately sample from (M,V |X = x). This is an ine�cient approach.The covariance matrix will be diagonal: D = diag(s21, s
22). Also,
log↵{(µx
, v
x
), (µy
, v
y
)} = min {0, log f(µy
, v
y
|x)� log f(µx
, v
x
|x)} ,
where
log f(µy
, v
y
|x) = �(↵ + 1 + n/2) log vy
� 1
2vy
n
X
i=1
(xi
� µ
y
)2 � 1
2vM
(µy
� µ
M
)2 � �/v
y
,
8
provided that v
y
> 0, otherwise we define log f(µy
, v
y
|x) = �1. To summarize the algo-rithm, we pick (µ0, v0) 2 R2 and D = diag(s21, s
22); set n = 0; and perform the following
steps:
1. Generate Z ⇠ N2 {(µn
, v
n
)0, D} and independently generate U ⇠ Unif(0, 1).
2. If logU log↵{(µn
, v
n
), Z} then set (µn+1, vn+1)0 = Z. Otherwise set (µ
n+1, vn+1)0 =(µ
n
, v
n
)0.
3. Replace n by n+ 1 and go to step 1.
5.3 Independence sampler
If the density of our proposal kernel g(x, ·) is of the form g(x, y) = g
⇤(y), where g
⇤ is adensity, then we call the resulting MH algorithm an independence sampler. In this case wecan write the ratio in (2) as
f(y)g(y, x)
f(x)g(x, y)=
f(y)g⇤(x)
f(x)g⇤(y).
We need to select g
⇤ carefully to ensure that the Markov chain converges to the targetdistribution.
5.4 Variable at a time Metropolis–Hastings and the Gibbs sampler
Suppose that there are intermediate X -valued random variables in the Markov chain:
. . . , X
n
, X
n,b
, X
n+1, Xn+1,b, Xn+1, . . .
From X
n
to X
n,b
we use kernel Q1 and from X
n,b
to X
n+1 we use kernel Q2. The kernel Qfor the combination/composition of Q1 and Q2 is
Q(xn
, A) = P (Xn+1 2 A|X
n
= x
n
)
= E[E{1(Xn+1 2 A)|X
n,b
, X
n
= x
n
}|Xn
= x
n
]
=
Z
⇢
Z
1(xn+1 2 A)Q2(xn,b
, dx
n+1)
�
Q1(xn
, dx
n,b
)
=
Z
Q2(xn,b
, A)Q1(xn
, dx
n,b
)
⌘ Q1Q2(xn
, A)
Some algorithms for MCMC involve creating a transition kernelQ from sub-transition kernelsQ1, . . . , Qp
. So long has each Q
j
has invariant distribution P
X
, i.e. P
X
Q
j
= P
X
, thenP
X
Q1 · · ·Qj
= P
X
Q = P
X
because (PX
Q1)Q2 = P
X
Q2 = P
X
, so ((PX
Q1)Q2)Q3 = P
X
Q3 =P
X
, and so on.
9
Suppose that Xn
= (Xn1, . . . , Xnp
). In variable at a time Metropolis–Hastings, we updateX
nj
with the proposal density g
j
(x, ·) for j = 1, . . . , p. Define
↵
j
(x, y) = min
✓
1,f(x(y,j))g
j
(x(y,j), x
j
)
f(x)gj
(x, y)
◆
, (5)
where x
(y,j) = (x1, . . . , xj�1, y, xj+1, . . . , xp
). This denominator is always positive if we startthe chain at x with f(x) > 0.
Algorithm 2 (Variable at a time Metropolis–Hastings). Pick or generate X0 2 X withf(X0) > 0 and set n = 0
1. Set Y = (Y1, . . . , Yp
) = X
n
= (Xn1, . . . , Xnp
).
2. Generate Z ⇠ G1(Y, ·) and independently generate U ⇠ Unif(0, 1). If U ↵1(Y, Z)then set Y1 = Z Otherwise set Y1 = X
n1.
3. Generate Z ⇠ G2(Y, ·) and independently generate U ⇠ Unif(0, 1). If U ↵2(Y, Z)then set Y2 = Z Otherwise set Y2 = X
n2.
4. . . .
5. Generate Z ⇠ G
p
(Y, ·) and independently generate U ⇠ Unif(0, 1). If U ↵
p
(Y, Z)then set Y
p
= Z Otherwise set Yp
= X
np
.
6. Set Xn+1 = Y , replace n by n+ 1 and go to step 1.
The Gibbs sampler is a special case of Algorithm 2. Let x�j
be the vector x with itsjth entry removed Take g
j
(x, y) = f
j
(y|x�j
), where f
j
is the density for the conditionaldistribution of the jth variable in X ⇠ f given the others. In this case ↵
j
(x, y) = 1 becausethe ratio in (5) is
f(x(y,j))gj
(x(y,j), x
j
)
f(x)gj
(x, y)=
f(x(y,j))fj
(xj
|x�j
)
f(x)fj
(y|x�j
)=
f(x(y,j))Kj
f(x)
f(x)Kj
f(x(y,j))= 1.
So the Gibbs sampler always accepts. Let Y = (Y1, . . . , Yp
) be a random vector with thetarget distribution (with density f). The Gibbs sampler (with a fixed scan) simplifies to thefollowing:
Algorithm 3 (Gibbs sampler). Pick or generate x0 = (x01, . . . , x0p) 2 X with f(x0) > 0and set n = 0
1. Generate a realization x
n+1,1 of (Y1|Y2 = x
n2, . . . , Yp
= x
np
).
2. Generate a realization x
n+1,2 of (Y2|Y1 = x
n+1,1, Y3 = x
n,3, . . . , Yp
= x
np
).
3. ...
10
4. Generate a realization x
n+1,p of (Yp
|Y1 = x
n+1,1, . . . , Yp�1 = x
n+1,p�1).
5. Replace n by n+ 1 and go to step 1.
It is important that the steps in Algorithm 3 are performed sequentially and not inparallel. We can modify the order of this sequence.
5.4.1 Example: approximately sample from a bivariate posterior
Let x = (x1, . . . , xn
) be measurements of some numerical characteristic on n subjects. Sup-pose that x is a realization of X, where
(X|M = µ, V = v) ⇠ N
n
(µ1n
, vI
n
),
(M |V = v) ⇠ N(µM
, v
M
),
V ⇠ InvGam(↵, �),
where µ
M
, v
M
,↵, � are user-specified prior parameters. The density for InvGam(↵, �) eval-uated at v is proportional to v
�(↵+1)e
��/v. The joint density is defined by
f(µ, v, x) / v
�(↵+1)e
��/v
v
�n/2 exp
(
� 1
2v
n
X
i=1
(xi
� µ)2)
exp
⇢
� 1
2vM
(µ� µ
M
)2�
.
Then
f(v|µ, x) / v
�n/2e
��/v exp
(
� 1
2v
n
X
i=1
(xi
� µ)2)
v
�(↵+1)
= v
�(↵+1+n/2) exp
"
�(
� +1
2
n
X
i=1
(xi
� µ)2)
/v
#
.
Thus
(V |M = µ,X = x) ⇠ InvGam
↵ +n
2, � +
1
2
n
X
i=1
(xi
� µ)2!
We also have that
f(µ|v, x) / exp
(
� 1
2v
n
X
i=1
(xi
� µ)2)
exp
⇢
� 1
2vM
(µ� µ
M
)2�
/ exp
�0.5v�1n
X
i=1
x
2i
+ µv
�1n
X
i=1
x
i
� 0.5v�1nµ
2 � 0.5v�1M
µ
2 + µv
�1M
µ
M
!
/ exp�
�0.5µ2(nv�1 + v
�1M
) + µ
�
v
�1nx̄+ v
�1M
µ
M
�
The density for N(m,w) is proportional in y to exp {�0.5y2v�1 + ymw
�1}, so
(M |V = v,X = x) ⇠ N
✓
nx̄+ vv
�1M
µ
M
n+ vv
�1M
,
1
nv
�1 + v
�1M
◆
To summarize, we pick (v0,m0) and set n = 0.
11
1. Generate v
n+1 from (V |M = m
n
, X = x).
2. Generate m
n+1 from (M |V = v
n+1, X = x).
3. Replace n by n+ 1 and go to step 1.
5.4.2 Example: fit a linear random e↵ects model
Let yit
be the observed response for the ith subject at time t. Suppose that yit
is a realizationof Y
it
, where
(Yit
|M = m,A = a, V
E
= v
E
, V
A
= v
A
) ⇠ind N(m+ a
i
, v
E
),
i = 1, . . . , n, t = 1, . . . , Ti
(M |A = a, V
E
= v
e
, V
A
= v
a
) ⇠ N(0, vM
)
(A|VE
= v
E
, V
A
= v
A
) ⇠ N
n
(0, vA
I
n
)
(VE
|VA
= v
A
) ⇠ InvGam(↵E
, �
E
)
V
A
⇠ InvGam(↵A
, �
A
)
Let y be a realization of Y = (Y11, . . . , YnTn). Recall that the density for InvGam(↵, �)evaluated at v is proportional to v
�(↵+1)e
��/v. The joint density is defined by
f(y,a,m, v
E
, v
A
)
/ v
�(↵A+1)A
e
��A/vAv
�(↵E+1)E
e
��E/vE
⇤ v�n/2A
exp
� 1
2vA
n
X
i=1
a
2i
!
v
�1/2M
exp
✓
� 1
2vM
m
2
◆
⇤ v�Pn
i=1 Ti/2E
exp
(
� 1
2vE
n
X
i=1
TiX
t=1
(yit
�m� a
i
)2)
.
So we have that
f(vA
|a,m, v
E
, y) / v
�(↵A+1+n/2)A
exp
(
� 1
v
A
�
A
+1
2
n
X
i=1
a
2i
!)
f(vE
|a,m, v
A
, y) / v
�(↵E+1+ 12
Pni=1 Ti)
E
exp
"
� 1
v
E
(
�
E
+1
2
n
X
i=1
TiX
t=1
(yit
�m� a
i
)2)#
Thus
(VA
|A = a,M = m,V
E
= v
E
, Y = y) ⇠ InvGam
↵
A
+ n/2, �
A
+1
2
n
X
i=1
a
2i
!
(VE
|A = a,M = m,V
A
= v
A
, Y = y) ⇠ InvGam
↵
E
+1
2
n
X
i=1
T
i
, �
E
+1
2
n
X
i=1
TiX
t=1
(yit
�m� a
i
)2!
12
We also have that
f(m|a, vA
, v
E
, y) / exp
(
� 1
2vE
n
X
i=1
TiX
t=1
(yit
� a
i
�m)2)
exp
✓
� 1
2vM
m
2
◆
/ exp
"
� 1
2vE
n
X
i=1
TiX
t=1
�
(yit
� a
i
)2 � 2(yit
� a
i
)m+m
2
� 1
2vM
m
2
#
/ exp
(
�0.5m2
n
X
i=1
T
i
/v
E
+ 1/vM
!
+m
n
X
i=1
TiX
t=1
(yit
� a
i
)/vE
)
This implies that
(M |A = a, V
E
= v
E
, V
A
= v
A
, Y = y) ⇠ N
P
n
i=1
P
Ti
t=1(yit � a
i
)P
n
i=1 Ti
+ v
E
/v
M
,
1P
n
i=1 Ti
/v
E
+ 1/vM
!
Let a�i
be the vector a with its ith entry removed. We also have that
f(ai
|a�i
,m, v
A
, v
E
, y) / exp
(
� 1
2vE
TiX
t=1
(yit
�m� a
i
)2)
exp
✓
� 1
2vA
a
2i
◆
/ exp
"
� 1
2vE
TiX
t=1
{(yit
�m)2 � 2ai
(yit
�m) + a
2i
}� 1
2vA
a
2i
#
/ exp
(
�0.5a2i
(Ti
/v
E
+ 1/vA
) + a
i
TiX
t=1
(yit
�m)/vE
)
This implies that
(Ai
|M = m,A�i
= a�i
, V
E
= v
E
, V
A
= v
A
, Y = y) ⇠ N
P
Ti
t=1(yit �m)
T
i
+ v
E
/v
A
,
1
T
i
/v
E
+ 1/vA
!
,
for i = 1, . . . , n.To summarize, we pick (m0, a0, vA,0, vE,0) and set n = 0.
1. Generate m
n+1 from (M |A = a
n
, V
E
= v
E,n
, V
A
= v
A,n
, Y = y).
2. Generate a
n+1,i from (Ai
|M = m
n+1, VE
= v
E,n
, V
A
= v
A,n
, Y = y) for i = 1, . . . , n inparallel.
3. Generate v
A,n+1 from (VA
|A = a
n+1,M = m
n+1, VE
= v
E,n
, Y = y).
4. Generate v
E,n+1 from (VE
|A = a
n+1,M = m
n+1, VA
= v
A,n+1, Y = y)
5. Replace n by n+ 1 and go to step 1.
13
5.4.3 Example: Bayesian ridge regression
Let yi
be the measured response for the ith case and let xi
= (1, xi2, . . . , xip
)0 2 Rp be thevalues of the predictors for the ith case. Define X 2 Rn⇥p to have ith row x
i
. Assume that(y1, . . . , yn)0 is a realization of Y , where
(Y |B = �, V = v, L = �) ⇠ N
n
(X�, vI
n
)
(B|V = v, L = �) ⇠ N
p
⇣
�̃,
v
�
I
p
⌘
(V |L = �) ⇠ InvGam(aV
, b
V
)
L ⇠ Gamma(aL
, b
L
)
The Gamma(a, b) distribution has density at � proportional to �
a�1e
�b�. Then
f(�, v,�|y) / v
�aV +1e
�bV /v
�
aL�1e
�bL��
p/2v
�p/2 exp
⇢
� �
2v(� � �̃)0(� � �̃)
�
⇤ v�n/2 exp
⇢
� 1
2v(y �X�)0(y �X�)
�
= �
aL�1+p/2v
�(aV +1+p/2+n/2)
⇤ exp⇢
� 1
2v(y �X�)0(y �X�)� �
2v(� � �̃)0(� � �̃)� b
L
�� b
V
v
�
We will use the fact that �(x, µ,⌃) /x
exp(�0.5x0⌃�1x+ x
0⌃�1µ). We have that
f(�|v,�, y) / exp
⇢
� 1
2v(y �X�)0(y �X�)� �
2v(� � �̃)0(� � �̃)� b
L
�� b
V
v
�
/ exp
⇢
�1
2�
0✓
1
v
X
0X +
�
v
I
p
◆
� + �
0✓
1
v
X
0y +
�
v
�̃
◆�
,
which implies that
(B|V = v, L = �, Y = y) ⇠ N
⇣
(X 0X + �I
p
)�1(X 0y + ��̃), v(X 0
X + �I
p
)�1⌘
.
Let µB
= (X 0X + �I
p
)�1(X 0y + ��̃). Then
f(�|v,�, y) /�,v
v
�p/2 exp
⇢
� 1
2v(� � µ
B
)0(X 0X + �I
p
)(� � µ
B
)
�
/�,v
v
�p/2 exp
� 1
2v{�0(X 0
X + �I
p
)� � 2�0(X 0X + �I
p
)µB
+ µ
0B
(X 0X + �I
p
)µB
}�
/�,v
v
�p/2 exp
� 1
2v
n
�
0(X 0X + �I
p
)� � 2�0(X 0y + ��̃) + µ
0B
(X 0X + �I
p
)µB
o
�
14
We can write
f(�, v|�, y) / v
�(aV +1+n/2)v
�p/2
⇤ exp
� 1
2v
n
y
0y � 2�0
X
0y + �
0X
0X� + ��
0� � 2��0
�̃ + ��̃
0�̃ + 2b
V
o
�
=f(�|v,�, y)v�(aV +1+n/2) exp
� 1
2v
n
y
0y � µ
0B
(X 0X + �I
p
)µB
+ ��̃
0�̃ + 2b
V
o
�
/f(�|v,�, y)f(v|�, y)
which implies that
(V |L = �, Y = y) ⇠ InvGam
✓
a
V
+n
2, b
V
+1
2
n
y
0y � µ
0B
(X 0X + �I
p
)µB
+ ��̃
0�̃
o
◆
We have that
f(�|�, v, y) / �
aL�1+p/2 exp
⇢
��
✓
1
2vk� � �̃k2 + b
L
◆�
,
which implies that
(L|B = �, V = v, Y = y) ⇠ Gamma
✓
a
L
+p
2,
1
2vk� � �̃k2 + b
L
◆
.
To summarize, we pick (�0, v0,�0) and set n = 0.
1. Generate (�n+1, vn+1) from (B, V |L = �
n
, Y = y) with the following two steps:
(a) Generate v
n+1 from (V |L = �
n
, Y = y).
(b) Generate �
n+1 from (B|V = v
n+1, L = �
n
, Y = y)
2. Generate �
n+1 from (L|B = �
n+1, V = v
n+1, Y = y).
3. Replace n by n+ 1 and go to step 1.
6 Inference for E{h(X)}, where X has the target dis-
tribution
This section is based on Atchade (2008) and Jones (2013). Suppose our goal is to esti-mate E{h(X)} where X ⇠ P
X
(the target distribution with density f). Using the Markovchain X0, X1, . . . with invariant distribution P
X
(from our sampling algorithm), we estimateE{h(X)} with
b
E{h(X)} =1
n
n
X
i=1
h(Xi
).
15
If the chain is P
X
-irreducible and aperiodic, then b
E{h(X)} ! E{h(X)} almost surely asn ! 1. A Markov chain with transition kernel Q is geometrically ergodic if there exists a⇢ 2 (0, 1) and a function M : X ! [0,1) such that
kQ(n)(x, ·)� P
X
kTV M(x)⇢n,
for all x 2 X and all non-negative integers n. Here are two selected results (Jones et al.,2006; Jones, 2013):
• If the Harris chain is geometrically ergodic and E{|h(X)|2+�} < 1 for some � > 0,then p
n
h
b
E{h(X)}� E{h(X)}i
L! N(0, vh
).
• If the Harris chain is geometrically ergodic, reversible, and E{h2(X)} < 1, then
pn
h
b
E{h(X)}� E{h(X)}i
L! N(0, vh
).
In general, varh
b
E{h(X)}i
= n
�2P
n
i=1
P
n
j=1 cov{h(Xi
), h(Xj
)}. Without giving details on
how to simplify this formula, we will estimate the asymptotic variance v
h
using the Batchmeans method. Suppose we have b batches of size k. Let
Y
j
=1
k
jk
X
i=(j�1)k+1
h(Xi
), j = 1, . . . , b.
The batch means estimator of vh
is
v̂
h
=k
b� 1
b
X
j=1
h
Y
j
� b
E{h(X)}i2
.
This estimator is inconsistent in general, but is consistent if we let k and b increase with n andassume that other regularity conditions hold (Jones, 2013). Jones suggested that k =
pn
and b = n/k are sensible choices. Assuming that bE{h(X)} is approximately Normal, a100(1� ↵)% approximate confidence interval for E{h(X)} is
b
E{h(X)} ± qnorm(1� ↵/2)
r
v̂
h
n
.
References
Atchade, Y. F. (2008). Course notes for Statistics 606. Lecture notes.
Jones, G. L. (2013). Course notes for STAT 8701: Computational statistical methods.Lecture notes.
16
Jones, G. L., Haran, M., Ca↵o, B. S., and Neath, R. (2006). Fixed-width output analysis formarkov chain monte carlo. Journal of the American Statistical Association, 101:1537–1547.
Keener, R. W. (2005). Statistical theory: A medley of core topics. Notes for a course intheoretical statistics.
17