Chapter 3: Monte Carlo methods - Personal Homepageshomepages.ulb.ac.be/~majansen/teaching/STAT-F-408/slides03MCMC_4.pdf · Chapter 3: Monte Carlo methods Maarten Jansen Overview 1

Chapter 3: Monte Carlo methods

Maarten Jansen

Overview

1. Aspects of Monte Carlo Methods

1.1 Monte Carlo integration and importance sampling

1.2 Random number generators (slide 30)

1.2.1 Quantile method (slide 31)

1.2.2 Rejection sampling (slide 37)

2. Markov Chain Monte Carlo Methods

2.1 Markov Chains

2.2 Models for multivariate RV (slide 60)

2.2.1 Markov Random Fields (MRF) (slide 61)

2.2.2 Gibbs Random Fields (GRF) (slide 65)

2.2.3 The Hammersley-Clifford Theorem (slide 68)

2.3 MCMC samplers for integration

2.3.1 Gibbs sampler (slide 79)

2.3.2 Metropolis-Hastings sampler (slide 90)

2.4 Simulated annealing - MCMC optimization

c©Maarten Jansen STAT-F-408 Comp. Stat. — Chap. 3: Monte Carlo methods p.1

1. Aspects of Monte Carlo Methods

Monte Carlo simulation or stochastic simulation

• tries to re-formulate a problem such that its solution is the unknown

parameter of an artificial random variable

• generates instances (an artificial sample) from that random variable

• applies statistical techniques to

– find (estimate) the parameter from the artificial sample

– evaluate the quality of the numerical outcome

• but it is essentially a method from numerical analysis

• Many of the applications of this numerical method come from statistical

problems:

statistical problem numerical solution statistical technique


Two main categories of problems

• Integration

• Optimization


1.1 Monte Carlo integration and importance sampling

Suppose we want to evaluate I =

∫ b

a

y(x)dx

• Suppose X ∼ uniform[a, b], then I = (b− a) · E(y(X))

• Generate Xi, with i = 1, . . . , n

• Estimate I =b− a

n

n∑

i=1

y(Xi)


Accuracy of the stochastic approximation

We use statistical measures to evaluate the approximation

1. BiasE(I) = b−a

n

∑ni=1 E(y(Xi)) = (b− a) · E(y(X))

= (b− a)∫ ba y(x) · 1

b−a· dx = I

The estimator is unbiased

2. Variance var(I) =(b− a)2

nvar(y(X)) =

(b− a)2

n

∫ b

a(y(x)− I)2 · 1

b− a· dx

Variance has two components:

– Order of magnitude:

∗ σI= O(n−1/2)

∗ typical result for variance of sample mean

∗ Independent from dimension

– Variance of one observation

Two questions

• How does this compare to competitors?

• How can we improve? → not on order of magnitude


Competitors: numerical integration (= quadrature)

Numerical integration is based on the principle: approximate the integrand by

a function that is easy to integrate.

The approximation is based on a limited number of observations of the

integrand only, and it is constructed using interpolation or smoothing.

The error of numerical integration methods depends on several factors

• The smoothness of the integrand, in particular: is the integrand easy to

approximate (see figures below)

• The number of function evaluations or observations n

• The location xi in which integrand is observed or evaluated

• The dimension (curse of dimensionality)


Functions that are difficult to approximate

Functions with

1. infinite slope

2. singularities

3. heavy oscillations

These features require locally dense observations/function evaluations


A very brief overview of quadrature methods

• For given xi, y(xi), quadrature formulas are based on

– Approximation of the integrand by polynomials:

∗ Rectangular rule or Midpoint rule

∗ Trapezoid rule

∗ Simpson’s rule

– Breaking up the interval [a, b] into subintervals→ composite rules

• When xi are free to choose, order of approximation can be optimised by

chosing the xi to be the zeros of orthogonal polynomials→ Gauss

Quadrature


Accuracy of quadrature methods

Assuming that the integrand is “sufficiently smooth”, we have in one

dimension the approximation Iq for I satisfies

|I − Iq| ≤ C · n−1,

and for many methods

|I − Iq| ≤ C · n−α,

with α > 1

Compare with random simulation[E(I − I)2

] 12 ∼ n− 1

2

(Note that error measures are different)


Curse of dimensionality

Observation

If n1 observations (function evaluations) are needed for given accuracy of a

numerical integration technique in one dimension, then the same technique

extended into higher dimensions requires nd1 observations.

Reason

• – Accuracy of numerical integration is a deterministic thing: we must cover every area in

the region of integration to be sure that accuracy is met.

– Accuracy thus directly linked to interpoint-distance

– High dimensions means many dimensions in which two points can be distant from

each other.

– Much more observations needed for same interpoint distance

• Quadrature is based on clever approximations of functions. It’s hard to be clever in high

dimensions: hard to find equally good approximations.

No curse of dimensionality for stochastic simulation


Applications in statistics

• Computation of expected values E(h(X)) =

∫ ∞

−∞fX(x)h(x)dx

• Computation of probabilities P (X ∈ A) = E(χA(X)) =

∫

A

fX(x)dx

(χA(X) is the characteristic or indicator function of A)

• Computation of quantiles QX(p) = F−1X (p) with FX(u) =

∫ u

−∞fX(x)dx

These problems appear in

• Bootstrapping and simulation

• Bayesian analysis: computation of posterior means, medians

• . . .


Non-uniform sampling

We have above the general expression µ = E(h(X)) =

∫ ∞

−∞fX(x)h(x)dx

which we can estimate by µ =1

n

n∑

i=1

h(Xi)

So, if we have an integral I =

∫ b

a

y(x)dx

then we can define h(x) as h(x) =y(x) · χ[a,b](x)

fX(x)(if this ratio is bounded near zeros of fX(x))

and estimate I =1

n

n∑

i=1

h(Xi) =1

n

n∑

i=1

y(Xi) · χ[a,b](Xi)

fX(Xi)

where all Xi are IID and have density fX(x).


Examples

• fX(x) = y(x)/x i.e., h(x) = x

(only possible if y(x)/x is positive with integral equal to 1)

Then I = µX = E(X) =

∫ ∞

−∞x · fX(x) dx and I =

1

n

n∑

i=1

Xi

• h(x) = χA(x), then I = p = P (X ∈ A) =

∫

AfX(x) dx and

I =1

n

n∑

i=1

χA(Xi) =#{i|Xi ∈ A}

n

• fX(x) =1

b− a· χ[a,b](x) and take h(x) such that h(x) · fX(x) = y(x) (where we asume

that y(x) is zero outside [a, b] — note that h(x) outside [a, b] is free to choose)

I =

∫ b

ay(x) dx and I =

1

n

n∑

i=1

h(Xi) =b− a

n

n∑

i=1

y(Xi)

From these examples, it is clear that there are many ways to estimate an integral. We formalise

this problem.


The importance function

If X has density function fX(x) and we want to estimate

µ = E(h(X)) =

∫ ∞

−∞h(x) · fX(x) dx,

then we may estimate this from a sample Xi as

µ =1

n

n∑

i=1

h(Xi)

If it is easier to sample from fU (u) (for instance, uniform random variables are

easy to generate), then we can write

E(h(X)) =∫∞−∞ h(u) · fX(u) du

=∫∞−∞ h(u) · fX(u)

fU (u) · fU (u) du


We call the new sampling distribution

fU (u) the importance function

and we denote w(u) =fX(u)

fU (u)

As a result µ = E(h(X)) =

∫ ∞

−∞h(u) · w(u) · fU (u) du

We can now estimate µ =1

n

n∑

i=1

h(Ui) · w(Ui)

The question is now how to choose fU (u)

• It must be easy to generate samples from it

• The variance of the estimator must be as low as possible


The variance of importance sampling

The variance equals var(µ) =1

n· var

[h(U) · w(U)

]

We can develop this as

var(µ) = E([

h(U) · w(U)]2)−

(E[h(U) · w(U)

])2

= E([

h(U) · w(U)]2)− µ2

= E([|h(U)| · w(U)

]2)− µ2

≥(E[|h(U)| · w(U)

])2 − µ2 = (E|h(X)|)2 − µ2

The lower bound is independent from fU (u). The inequality becomes an

equality if for V = |h(U)| · w(U) it holds that E(V 2)= (EV )2, or,

var(V ) = E(V 2)− (EV )2 = 0, thus if V is deterministic (with prob. 1).

So minimum variance is obtained if |h(U)| · w(U) = K, for any random U , i.e.,

∀u ∈ R.

We have |h(U)| · w(U) = K ⇔ fU (u) =|h(u)|·fX(u)

K

Imposing

∫ ∞

−∞fU (u) du = 1, minimum variance for


fU (u) =|h(u)| · fX(u)∫∞

−∞ |h(u)| · fX(u) du


Interpretation of this result

• The result is of little immediate use. Indeed, full application requires

knowledge of

∫ ∞

−∞|h(u)| · fX(u) du

If h(u) ≥ 0, ∀u ∈ R, this is eactly the integral we are after. In the other

case, computation of this integral is probably equally difficult as the

original question.

• var(µ) can be much lower than when estimating µ with samples from

fX(x).

• The basic idea is that fU (u) should behave not just as fX(x), but it

should also “follow” |h(u)|. Regions where h(u) is large in magnitude

should be sampled more.

• Pay special attention to tails of |h(u)| · fX(u)


Example with mixture of uniform sampling

• Mixture of L uniform random variables

• Uniform on (non-convex) subdomains Iℓ defined by

Iℓ ={x∣∣∣|y(x)| ≥ ℓ/Lmax |y(x)|

}

• mixture probability mass functions pℓ = |Iℓ|/∑L

ℓ=1 |Iℓ|


Example from Bayesian statistics

Suppose we observe Xi ∼ N(M,σ2) with σ2 known. We want to estimate the

mean M , for which we impose a Cauchy prior model

fM (m) =1

π(1 + (m− µ)2),

where hyperparameter µ may express prior knowledge of expected values

(could be zero, e.g.)

The conditional sample density is

fX|M (x|m) =∏n

i=11√2πσ· e−(xi−m)2/2σ2

= 1(2π)n/2σn · e−

∑ni=1(xi−m)2/2σ2

= 1(2π)n/2σn · e−(x−m)2/(2σ2/n) · e−

∑ni=1(xi−x)2/2σ2

Then the joint distribution is

fM,X(m,x) = fM (m) · fX|M (x|m)

= 1π(1+(m−µ)2) · 1

(2π)n/2σn · e−(x−m)2/(2σ2/n) · e−∑n

i=1(xi−x)2/2σ2

So the marginal distribution of X becomes


fX(x) =∫∞−∞ fM,X(m,x) dm =

∫∞−∞ fM (m) · fX|M (x|m) dm

= C(x) ·∫∞−∞

11+(m−µ)2 · e−(x−m)2/(2σ2/n) dm

with

C(x) =1

π· 1

(2π)n/2σn· e−

∑ni=1(xi−x)2/2σ2

(Note that the integral exists thanks to the rapid decay of the normal bell

curve)

The posterior distribution of M , given the observation X becomes


fM |X(m|x) =fM (m)·fX|M (x|m)

fX(x)

=fM (m)·fX|M (x|m)∫ ∞

−∞fM (m) · fX|M (x|m) dm

=C(x) · 1

1 + (m− µ)2· e−(x−m)2/(2σ2/n)

C(x) ·∫ ∞

−∞

1

1 + (m− µ)2· e−(x−m)2/(2σ2/n) dm

=

1

1 + (m− µ)2· e−(x−m)2/(2σ2/n)

∫ ∞

−∞

1

1 + (m− µ)2· e−(x−m)2/(2σ2/n) dm

Possible values of interest are

the posterior mean

E(M |X = x) =

∫ ∞

−∞

m

1 + (m− µ)2· e−(x−m)2/(2σ2/n) dm

∫ ∞

−∞

1

1 + (m− µ)2· e−(x−m)2/(2σ2/n) dm


and the posterior variance which is

var(M |X = x) = E(M2|X = x)−[E(M |X = x)

]2with

E(M2|X = x) =

∫ ∞

−∞

m2

1 + (m− µ)2· e−(x−m)2/(2σ2/n) dm

∫ ∞

−∞

1

1 + (m− µ)2· e−(x−m)2/(2σ2/n) dm


Computation of the integrals

At least two possibilities:

• Sample from normal density with expected value x and variance σ2/n:

Xn ∼ N(x, σ2/n),

Then

E(Mk|X = x) =E(

Xkn

1+(Xn−µ)2

)

E(

11+(Xn−µ)2

)

• Sample from Cauchy density with center (median) µ

fU (u) = 1/[π(1 + (u− µ)2)

]

Then

E(Mk|X = x) =E(uk · e−(u−x)2/(2σ2/n)

)

E(e−(u−x)2/(2σ2/n)

)


• Sample from another distribution “as close as possible” to the integrand.

In this case, the normal density is a much better choice than the Cauchy

• The tails of the integrand are lighter than the normal tail, so the heavy

tails of the Cauchy produce a lot of large samples whose values are not

representative for the integral

• The normal density has sample size (n) dependent variance, so that

samples get more concentrated for large n, which corresponds to the

true shape of the integrand

As an illustration, we plot the estimates of the standard errors in estimating

the following parameter

I =∫∞−∞

1π·(1+(u−µ)2) · 1√

2πσ/√n· e−(u−x)2/(2σ2/n) du

= E(

1π·(1+(Xn−µ)2)

)

= E(

1√2πσ/

√n· e−(U−x)2/(2σ2/n)

) with

Xn ∼ N(x, σ2/n) and U ∼ Cauchy(µ, 1)


Then we simulate Xn,i and Ui for i = 1, . . . , nMC and define

I1 = 1nMC

∑nMC

i=11

π·(1+(Xn,i−µ)2)

I2 = 1nMC

∑nMC

i=11√

2πσ/√n· e−(Ui−x)2/(2σ2/n)

We can easily estimate the variances

var(I1) = 1nMC

var(

1π·(1+(Xn−µ)2)

)

var(I2) = 1nMC

var(

1√2πσ/

√n· e−(Ui−x)2/(2σ2/n)

)

The estimates of the standard errors of a single observation (to be divided by√nMC) are depicted below (together with the log of the estimates, to better

show the behavior)


0 50 100 150 2000

0.1

0.2

0.3

0.4

n

estimated st.dev. of one observation

Cauchy samplesNormal samples

0 50 100 150 200-8

-6

-4

-2

0

n

log(estimated st.dev.) of one observation

Cauchy samplesNormal samples


Importance function must have sufficiently heavy tails

Previous examples have illustrated that it is of little use that the sampling

density function has heavier tails than the integrand.

The opposite, is however, much worse (so if no perfect match can be realized,

a slightly too heavy tail is preferable)

We have

var(µ) =1

n· var

[h(U) · w(U)

]=

1

n· E[(h(U) · w(U))2

]− µ2

Herein

E[(h(U) · w(U))2

]=

∫∞−∞[h(u)]2[w(u)]2fU (u) du

=∫∞−∞[h(u)]2 fX(u)2

fU (u) du

=∫∞−∞

h(u)fX(u)fU (u) · h(u)fX(u) du

If h(u)fX(u)

has a heavier tail than fU (u), then the first factor tends to infinity for u→∞.

The integral may then be large or even infinity, depending on the tail of

h(u)fX(u)


Conclusions about importance sampling

Importance sampling allows to

• estimate expected values (integrals) with a random variable X whose

distribution does not allow easy simulations, by drawing from another

random variable U which is easier, followed by proper re-weighting.

• optimize (to some extend) the choice of sample distribution to estimate

integrals.

We will later discuss rejection sampling, which also samples from an auxiliary distribution. The

outcome is then rejected or accepted with an appropriate probability such that, a posteriori,

(given the event of rejection or acceptance) the variable takes the aimed distribution. Unlike

importance sampling, the correction of rejection sampling thus proceeds at the level of the

random number generator itself (and not at the level of computing the integral). We therefore

discuss random number generators.


1.2 Random number generators

Importance sampling assumes that we can generate numbers from a given

distribution. How can we do that?


1.2.1 Quantile or inversion method

Theorem If U ∼ uniform[0, 1] and QX(p) is the quantile function of X , then

QX(U) has the same distribution as X , i.e.:

U ∼ uniform[0, 1]⇔ QX(U)d= X

Proof uses monotonicity of QX(p) or its inverse FX(x)

P (QX(U) ≤ x) = P (U ≤ Q−1X (x))

= FU (Q−1X (x))

= Q−1X (x)

= FX(x)

Example 1: Let X ∼ exp(λ), then FX(x) = 1− e−λx, so

QX(p) = − log(1− p)/λ, so if U ∼ uniform[0, 1], then take

X = − log(1− U)/λ,

or, because 1− U is also uniform (for symmetry), we can take

Y = − log(U)/λ


Quantile or inversion method(2)

Example 1: Let X ∼ Cauchy with median µ, i.e., fX(x) = 1π[1+(x−µ)2]

If µ 6= 0, then X can be generated by adding µ to a Cauchy random variable

with median 0.

So, we assume that µ = 0.

Then FX(x) = 12 + 1

π arctan(x), and QX(U) = tan [π(U − 1/2)]

Note: if X ∼ Cauchy(µ = 0) then −X ∼ Cauchy(µ = 0) and

1/X ∼ Cauchy(µ = 0).

Indeed, for Y = 1/X, fY (y) = fX (x(y))

∣∣∣∣dx(y)

dy

∣∣∣∣ =1

π

1

1 + 1/y21

y2=

1

π

1

1 + y2

So, if X ∼ Cauchy(µ = 0) then Y = −1/X ∼ Cauchy(µ = 0).

And if X = tan [π(U − 1/2)] ∼ Cauchy(µ = 0), then

Y = −1/X = tan(πU) ∼ Cauchy(µ = 0)


Example: Box-Muller transform for normal

Problem: normal CDF FZ(z) = Φ(z) has no closed formula, working with

quantile QZ(U) is not possible, unless software provides detailed tables of

QZ(p)

Solution: we go for two independent normal RV: (Z1, Z2) ∼ N2(0, I2), then

we know:

• Z21 + Z2

2 ∼ χ2(2) = exp(1/2)

Indeed

1. P (Z2 < x) = P (−√x < Z <

√x) = Φ(

√x)− Φ(−

√x)

Hence fZ2 (x) =[φ(√x) + φ(−

√x)]/(2√x) = e−x/2/

√2πx

which is: Z2 ∼ χ2(1) = Γ(1/2, 1/2)

2. Let Y = Z21 + Z2

2 , then

fY (y) =

∫ y

0fZ2

1(z)fZ2

2(y − z) dz =

∫ y

0

e−z/2

√2πz

e−(y−z)/2

√2π(y − z)

dz

=e−y/2

2π

∫ y

0

dz√

z(y − z)=

e−y/2

2π

∫ 1

0

dt√

t(1− t)

=e−y/2

2πB(1/2, 1/2) =

e−y/2

2π

Γ(1/2)Γ(1/2)

Γ(1/2 + 1/2)= e−y/2/2


•√Z21 + Z2

2 ∼ Rayleigh

Indeed, let Y =√Z21 + Z2

2 , then

FY (y) = P (Y ≤ y) = P (Y 2 ≤ y2) = 1− e−y2/2 because F (x) = 1− eλx

is the CDF of the exponential distribution

• Z1

Z2∼ Cauchy(µ = 0)

Indeed, let X = Z1/Z2, so, Z1 = XZ2, then

FX(x) = P (X ≤ x) =

∫ ∞

−∞fZ2 (z)P (X ≤ x|Z2 = z)dz

=

∫ 0

−∞fZ2

(z)P (Z1 ≥ zx)dz +

∫ ∞

0fZ2

(z)P (Z1 ≤ zx)dz

= 2

∫ ∞

0fZ2

(z)P (Z1 ≤ zx)dz

and so,

fX(x) = 2

∫ ∞

0fZ2 (z)fZ1(zx) z dz =

2

2π

∫ ∞

0ze−(1+x2)z2/2dz =

1

π

∫ ∞

0e−u du

1 + x2

We propose to generate a Cauchy RV X1 and an exponential RV

X2 ∼ exp(1/2), using X1 = tan(2πU2) and X2 = −2 log(U1)


Note that X1 = tan(2πU2) is Cauchy with the same parameters as

X ′1 = tan(πU2), since tan(πu) has period 1. We take X1 = tan(2πU2) instead

of X1 = tan(πU2) for reasons explained below.

Then solve the system

Z1/Z2 = X1 = tan(2πU1)

Z21 + Z2

2 = X2 = −2 log(U2) = log(1/U22 )

So suppose U1 and U2 are 2 independent, uniform r.v. on [0, 1] and letZ1 =

√log(1/U2

2 ) cos(2πU1)

Z2 =√log(1/U2

2 ) sin(2πU1).

Here sin(2πU1) and cos(2πU1) have the same distribution on [−1, 1]. This

would not be the case for cos(πU1) ∈ [0, 1]. This is why we take

X1 = tan(2πU1).

This is a R2 → R

2 transformation Z = g(U ) =√log(1/U2

2 ) ·

cos(2πU1)

sin(2πU1)

.


The inverse g−1:

U2 = e−12 (Z

21+Z2

2 )

U1 = 12π arctan

(Z2

Z1

).

Using ddx arctan(x) = 1

1+x2 ,

fZ1,Z2(z1, z2) = fU1,U2

(12π arctan

(z2z1

), e−

12 (z

21+z2

2))|J |

= 1 ·

∣∣∣∣∣∣det

∂u1

∂z1∂u1

∂z2∂u2

∂z1∂u2

∂z2

∣∣∣∣∣∣

=

∣∣∣∣∣∣det

12π · 1

1+(z2/z1)2· −z2

z21

12π · 1

1+(z2/z1)2· 1z1

e−12 (z

21+z2

2) · (−z1) e−12 (z

21+z2

2) · (−z2)

∣∣∣∣∣∣

= 12π · e−

12 (z

21+z2

2) · 11+(z2/z1)2

·(

z22

z21+ 1)

= 1√2π

e−z212 · 1√

2πe−

z222 .

Hence (Z1, Z2) ∼ NID(0, 1)


1.2.2 Rejection sampling

Suppose we want to generate random numbers with density f(x) and

cumulative distribution F (X).

Theorem

Let X ∼ gX and ∀x ∈ R : f(x) ≤M · gX(x) and let U ∼ uniform[0, 1],

independent from X , then F (x) = P

(X ≤ x

∣∣∣∣U ≤f(X)

M · gX(X)

)

Proof

P(X ∈ A & U ≤ f(X)

M·gX (X)

)=

∫A gX(x) · P

(U ≤ f(X)

M·gX (X)

∣∣∣X = x)dx

=∫A gX(x) · P

(U ≤ f(x)

M·gX (x)

)dx

=∫A gX(x) · f(x)

M·gX (x)dx

=∫A

f(x)M

dx =∫A f(x) dx

MHence, if A =]−∞, x], then

P(X ∈ A

∣∣∣U ≤ f(X)M·gX (X)

)=

P(X∈A & U≤

f(X)M·gX (X)

)

P(U≤

f(X)M·gX (X)

)

=P(X∈A & U≤

f(X)M·gX (X)

)

P(X∈R & U≤

f(X)M·gX (X)

)

=

∫A f(x) dx

M1M

=∫A f(x) dx = F (x)


Algorithm

Situation and aim

We want a random number X with density function f(X). We have no expression for F (x) that

we can invert. We can generate numbers according to a different law gX(x) and we know that

f(x) ≤M · gX(x), for all values of x.

Pseudo-code

continue-search = TRUE

While continue-search

• Generate X ∼ gX

• Generate U ∼ uniform[0, 1]

• If U ≤ f(X)/[M · gX(X)

]

then continue-search = FALSE

The output is X


How to choose gX(x)?

• X ∼ gX should be easy to generate

• gX(x) should be as close as possible to f(x), such that M can be close

to 1, and rejection probabilities are low. Otherwise, computational efforts

increase.

• Some combinations don’t work: for instance, one can never generate

Cauchy variables by rejection sampling applied to normal variables,

simply because there is no M satisfying the condition.


Example 1: generating Gamma-distributed r.v.

Let X ∼ Gamma(λ, α), i.e.,fX(x) = xα−1λαe−λx

Γ(α)

If α is integer, then we can write X =α∑

i=1

Xi with independent

Xi ∼ exp(λ)

If α is not integer, denote δ = α− ⌊α⌋ and r = ⌊α⌋

Then we can decompose or generate X as

X = (Xr +Xδ)/λ with Xr ∼ Gamma(1, r) and Xδ ∼ Gamma(1, δ) and

both independent.

We can generate Xr as sum of exponentials, but for Xδ, the quantile method

does not work, so we need another direction.


Generating Gamma values with small α

The distribution function of X ∼ Gamma(1, δ) is fX(x) =xδ−1e−x

Γ(δ)

It is depicted below for δ = 0.23

0 2 4 60

5

10

15

Not straightforward to bound by some M · gX(x)


A mixture distribution as upper bound

We will use a mixture distribution. Suppose

V ∼ uniform[0, 1]

W = χ[0,p](V ) for some value p

X1 ∼ g1 with g1(x) = δ · xδ−1 on [0, 1]

X2 − 1 ∼ exp(1), hence g2(x) = e−(x−1) = e · e−x on [1,∞[

X = W ·X1 + (1−W ) ·X2

In other words, X = X1 with probability p and X = X2 with probability 1− p.

Therefore, generate two uniform RV: U and V . If V < p, then X = QX1(U),

otherwise X = QX2(U). In one formula:

X = I(V < p) QX1(U) + I(V ≥ p) QX2(U)

where QX1(U) = U1/δ and QX2(U) as on slide 31


The mixture distribution and density

The cumulative distribution of X , denoted as GX(x) is then

GX(x) = P (W = 0) · P (X ≤ x|W = 0) + P (W = 1)P (X ≤ x|W = 1)

= (1− p) · P (X2 ≤ x|W = 0) + p · P (X1 ≤ x|W = 1)

= (1− p) ·G2(x) + p ·G1(x)

GX(x) = (1− p) ·G2(x) + p ·G1(x)

and from there gX(x) = p · g1(x) + (1− p) · g2(x)In our case gX(x) = p · δ · xδ−1 · χ[0,1](x) + (1− p) · e · e−x · χ[1,∞[(x)


Optimizing the parameters in the function gX(x)

The value of p can be chosen to minimize the number of rejections, i.e., to minimize M .

• We need that M · gX(x) ≥ fX(x)

• For x ∈ [0, 1], this means that

M · p · δ · xδ−1 ≥ e−x · xδ−1

Γ(δ)⇔M ≥ e−x

pδΓ(δ)

The maximum in the right hand side is reached if x = 0, hence M ≥ 1

pδΓ(δ)

• For x ≥ 1, this becomes

M · (1− p) · e · e−x ≥ e−x · xδ−1

Γ(δ)⇔M ≥ xδ−1

(1− p)eΓ(δ)

The maximum in the right hand side is reached if x = 1, hence M ≥ 1

(1− p)eΓ(δ)

• The minimum M can be obtained if both lower bounds for M are equal, i.e., if

pδΓ(δ) = (1− p)eΓ(δ)⇔ p =e

e+ δ


The resulting algorithm

We have M =1

pδΓ(δ)=

e+ δ

eδΓ(δ)So, for x ∈ [0, 1], we find

fX(x)

M · gX(x)=

e−x · xδ−1/Γ(δ)

M · p · δ · xδ−1= e−x which is smaller than 1

and for x > 1, we findfX(x)

M · gX(x)=

e−x · xδ−1/Γ(δ)

M · (1− p) · e1−x= xδ−1 which is smaller than 1 because δ − 1 is negative.

While search == true,

• Generate independent U, V,W ∼ unif([0, 1])

• if V < p = e/(e+ δ),

then

– X = W 1/δ

– If U < e−X , then search← false

else

– X = − log(W )

– If U < Xδ−1, then search← false


Example 2: computing the integral on page 25

In Bayesian inference with a Cauchy prior and normal errors, we have to

compute a ratio of the form

r =

∫∞−∞ xm · 1

bπ·[1+(x−a

b )2] · 1√

2πσ· e−(x−µ)2/2σ2

dx

∫∞−∞

1

bπ·[1+(x−a

b )2] · 1√

2πσ· e−(x−µ)2/2σ2 dx

Using rejection sampling, we will generate data X from a distribution

proportional to

fX(x) = K · 1

1 +(x−ab

)2 · gX(x),

where gX(x) is the normal density function.

It then holds that r = E(Xr)

We can draw observations from X even if we know fX(x) only up to a constant.

Indeed, let fX(x) = K · f(x) with K unknownfX(x)

M · gX(x)=

K · f(x)M · gX(x)

Herein f(x) and gX(x) are known and f(x)/gX(x) is bounded by C, then take


M ≥ KC.

In the example above

fX(x)

M · gX(x)=

K · gX(x) · 1

1+(x−ab )

2

M · gX(x)=

K

M[1 +

(x−ab

)2]

Take M = K, then the result is bounded by 1.

So generate X ∼ gX , then check if

1

1+(X−ab )

2 ≤ U


2. Markov Chain Monte Carlo Methods

• Monte Carlo Methods are based on independent sampling, law of large

numbers, central limit theorem

• Independent sampling may be difficult to realize, especially when we

sample from a large dimensional vector X

• Markov Chain Monte Carlo (MCMC) Methods simulate a sequence of

dependent observations


2.1 Markov Chains

Discrete time Markov Chain←→ continuous time Markov Chain

We consider discrete time MC

Discrete state space MC←→ general state space MC

A Discrete state space MC is a sequence of RV’s (Xn;n ∈ N) for which

Xn ∈ E. The state space E is countable and thus homomorphic with Z. (We

can take E = Z.) The sequence satisfies the Markov condition, i.e.,

P (Xn+1 = j|X0 = i0, X1 = i1, . . . , Xn = in) = P (Xn+1 = j|Xn = in)

Define P(n)ij = P (Xn+1 = j|Xn = i)

The Markov Chain is stationary or homogeneous if P(n)ij does not depend

on n. We can write Pij = P (Xn+1 = j|Xn = i)

The matrix with elements Pij is called the transition matrix


Irreducibility

(We further assume stationary MC, unless otherwise stated)

n-step Transitions

If P is the transition matrix of a discrete space Markov process, then

P (Xm+n = j|Xm = i) = (P n)ij

Accessibility

A state j is accessible from a state i if ∃n ∈ N, such that (P n)ij > 0. We

denote i→ j

Two states are communicating if they are mutually accessible from each

other. We denote i↔ j

If all states communicate, the MC is said to be irreducible

Period

The smallest di for which(P di

)ii> 0 is called the period of state i. It follows

that (P n)ii > 0⇔ n = k · di with k ∈ N


If i↔ j, then di = djProof

∃n ∈ N for which (Pn)ij > 0 and ∃m ∈ N for which (Pm)ji > 0

Now suppose that (P r)jj > 0, then(Pn+r+m

)ii

> (Pn)ij · (P r)jj · (Pm)ji > 0, hence we

know that r = ki · di and n+ r +m = kj′ · dj .

We also have(P 2r

)jj

> (P r)jj · (P r)jj > 0, hence n+ 2r +m = kj′′ · dj , and so

r = (n+ 2r +m)− (n+ r +m) = (kj′′ − kj′ ) · dj = kj · dj .

So, any r = ki · di can be written as r = kj · dj

A similar argument leads to the conclusion that any r = kj · dj can be written as r = ki · di. This

is only possible if dj = di


Transient states

Denote Tii the first n > 0 so that Xn = i given that X0 = i

We know that Tii = k · di and P (Tii = k · di) > 0.

Denote Vii =∞∑

k=1

I(Xk·di = i)|{X0 = i} =∞∑

n=1

I(Xn = i)|{X0 = i}

(with I(A) the indicator function of event A)

A state i is transient if E(Vii) <∞, that is, if an infinite number of steps in

the Markov Chain leads at most to a finite number of visits to state i.

E(Vii) =

∞∑

n=1

E(I(Xn = i)|X0 = i) =

∞∑

n=1

P (Xn = i|X0 = i)

So

E(Vii) <∞⇔∞∑

n=1

P (Xn = i|X0 = i) <∞

This is equivalent to P (Tii <∞) < 1


Indeed, suppose that P (Tii <∞) = 1, and denote T(r)ii the number of steps until the rth

occurence of state i. Then, because of the Markov condition, T(r)ii =

∑rℓ=1 Tii,ℓ with Tii,ℓ IID

observations from Tii. P (T(r)ii <∞) = P

(r⋂

ℓ=1

(Tii,ℓ <∞)

)

=r∏

ℓ=1

P (Tii,ℓ <∞) = 1 for any

finite r. So Vii ≥ r, a.s. for any r ∈ N.

This implies that µii = E(Tii) =∞ but the opposite does not hold (see

below).


Recurrent states

A state is called recurrent if it is not transient, i.e., if it is visited an infinite

number of times.

If the expected time until the first visit is infinite, i.e., if µii = E(Tii) =∞, then

the state is called null-recurrent, otherwise it is called positive or ergodic.

A null-recurrent state is visited an infinite number of times, but the relative

number of visits tends to zero: E(Vii) =

∞∑

n=1

(P n)ii =∞ and

1

N

N∑

n=1

(P n)ii → 0

A positive state has 1N

∑Nn=1 (P

n)ii → 1µii


Proof (as of yet incomplete)

We prove that in a positive state, it holds that1

N

N∑

n=1

P (Xn = i|X0 = i)→ 1

E(Tii)

• Law of total probability + Markov condition for n > 0:

P (Xn = i|X0 = i) =∑n

k=1 P (Xn = i|Xk = i) · P (Tii = k)

=∑n

k=1 P (Xn−k = i|X0 = i) · P (Tii = k)

Also, P (Xn = i|X0 = i) = 1 for n = 0.

• If we define tk = P (Tii = k), with t0 = 0 and pn = P (Xn = i|Xk = i), then we have

pn =∑n

k=1 pn−k · tk =∑n

k=0 pn−k · tk for n > 0 and p0 = 1 6= p0 · t0 = 0.

• Denoting t = (tk, k ∈ N) and p = (tn, n ∈ N), then the sum above is the convolution of

the sequences t and p: t ∗ p. Since the expression does not hold for n = 0, we have to

correct with a Kronecker sequence δ0 = (1, 0, 0, 0, . . .). We get: p = t ∗ p+ δ0

• Denote a(s) =

∞∑

k=0

aksk, then the equation above becomes

p(s) = p(s) · t(s) + 1⇔ p(s) =1

1− t(s)

• Since t(1) =∑∞

k=1 P (Tii = k) = P (Tii <∞), a recurrent Markov process has a


singularity in for p(s) in s = 1. Further, lims→1

(1− s) · p(s) = lims→1

1− s

1− t(s)=

1

t′(1)and

t′(1) =∑∞

k=1 k · P (Tii = k) = E(Tii) = µii

• On the other hand,

lims→1(1− s) · p(s) = limu→∞1u· p(1− 1/u) = limn→∞

1n·∑∞

k=0 pk(1− 1/n)k

= limn→∞1n·∑n

k=0 pk + limn→∞1n·∑n

k=0 pk[(1− 1/n)k − 1

]

+ limn→∞1n·∑∞

k=n+1 pk(1− 1/n)k

• ...


Equilibrium distribution

Theorem In an irreducible discrete time, discrete state space MC the states

are either all transient, all null-recurrent, or all positive (ergodic).

All finite state MC are positive

Denote pn,i = P (Xn = i), and row vector pn = (. . . , pn,i . . .), then

pn+1 = pn · P pn+1,i =∑

ℓ∈Z

Pℓipn,ℓ

P · 1 = 1 (because transition probabilities sum to one)

λ = 1 is an eigenvalue

The left eigenvector is an invariant or stationary or equilibrium distribution

p · P = p


Reversed Markov Processes

If (Xn;n = 0, . . .) is a Markov Chain with transition matrix P and equilibrium

distribution p, then

P (Xn = j|Xn+1 = i,Xn+2 = i2, . . . , Xn+m = im)

= P (Xn = j|Xn+1 = i) =pjpi· Pji

Proof (Bayes, Chain rule for conditional probabilities and Markov Condition)

P (Xn = j|Xn+1 = i,Xn+2 = i2, . . . , Xn+m = im)

=P (Xn=j)·P (Xn+1=i,Xn+2=i2,...,Xn+m=im|Xn=j)

P (Xn+1=i,Xn+2=i2,...,Xn+m=im)

=P (Xn=j)·P (Xn+1=i|Xn=j)·P (Xn+2=i2,...,Xn+m=im|Xn+1=i)

P (Xn+1=i)·P (Xn+2=i2,...,Xn+m=im|Xn+1=i)

= P (Xn = j|Xn+1 = i) =P (Xn=j)·P (Xn+1=i|Xn=j)

P (Xn+1=i)

=pj ·Pji

pi


Reversible Markov Processes

If there exists a distribution pi = P (X = i) that satisfiespj · Pji

pi= Pij then

the Markov chain is called Reversible

Remark Reversibility thus means not that the reversed Markov process exists (it always exists),

but that its transition probabilities for i→ j are the same as the forward probabilities for the same

transitions i→ j (so NOT for j → i)

The distribution is then the equilibrium distribution. Indeed, from summation

of pj · Pji = piPij we obtain:∑

j

pj · Pji = pi∑

j

Pij = pi, which is, in matrix

form, p · P = p, the invariant distribution equation.

The reverted process is of course the same.

The conditionpj · Pji

pi= Pij is called the detailed balance equation (since it

implies the “global” balance equation)


2.2 Models for multivariate random variables

1. In the next slides we consider vectors X with multivariate distributions.

We discuss two ways to define/fix any multivariate distribution

• Markov Random Field (MRF), which is special case of a graphical

model

• Gibbs Random Field (GRF)

2. The Markov property (dependence through adjacency) plays a role

both on the level of the sampling process as on the level of the sampled

multivariate random variable: Markov Chains for the sampling, Markov

Random Fields for the sampled variable


2.2.1 Markov Random Field (MRF)

Given a multivariate random variable X, a graphical model can be used to

represent the intra-dependencies.

An undirected graph is a ordered pair of sets G = (V,E), where

V = {1, . . . , p} is the set of vertices, sites or nodes, which are here indices

into X. The set E contains the (undirected) edges in the graph, where an

undirected edge is an unordered pair of vertices.

In a Markov Random Field, two vertices i and j are connected by an edge if

and only if the corresponding components of x are conditionally dependent,

i.e., given all the other components’ values.

P(Xi = xi

∣∣{X1, . . . , Xp}\{Xi})6= P

(Xi = xi

∣∣{X1, . . . , Xp}\{Xi, Xj})

The two sites are then called neighbours.

A neighbourhood of i is defined as ∂i = {j|{i, j} ∈ E}Formally, denoting by 2V all subsets of V , we have

∂ : V → 2V : i 7→ ∂i = {j|{i, j} ∈ E}

Markov property: it holds that P(Xi = xi

∣∣XV \{i})= P

(Xi = xi

∣∣X∂i

)


Examples of MRFs

• In principle any multidimensional probability distribution can be seen as

a MRF. In general, all components are conditionally dependent, so

∂i = {1, . . . , p}\{i}• A (finite sample from a) Markov Chain is also a MRF. Indeed, (thanks to

the notion of reversed Markov Processes)∂i = {i− 1, i+ 1}

– Forward Markov Chain: • → • → • → •– Reversed MC: • ← • ← • ← •– MRF representation: • − • − • − •

• A two-dimensional MRF:• − • − • − •| | | |• − • − • − •| | | |• − • − • − •| | | |• − • − • − •

– Dimension of random vector X is p = 16

– X has a 2D-geometric background

– Components of X can be represented

with a 2D index: Xs = X(i,j)


A short note on graphical models

Markov Random Fields are an example of graphical models

Graphical models are used to define or represent multivariate random

variables X

MRF are undirected graphs, edges define neighbourhoods ∂i

MC (Markov Chains) are an example of Bayesian networks: directed,

acyclic graphs: Edges define parents of nodes par(i)

The construction of the joint probability in a directed graph is immediate

fX(x) =

p∏

i=1

fXi|Xpar(i)(xi|xpar(i))

• when par(i) = Ø, then the conditional distribution should be interpreted

as the marginal distribution

• The construction is always possible because the graph is acyclic

• For MRF’s/undirected graphs, the joint pdf/pmf is not so straightforward,

we need the concept of Gibbs Random Fields (see slide 65)


Example of modelling by Bayesian networks

Let X = (X1, X2, X3), then

• the graph • ← • → • represents the situation where X1 and X3 are

dependent, but, given the value of X2, (=conditionnally) they are

independent. The dependence occurs through X2

• the graph • → • ← • represents the situation where X1 and X3 are

independent, but X2 depends on both. If X2 is observed, this gives

information on both X1 and X3, so X1 and X3 are conditionnally

dependent. (By observation of X2, we learn about both X1 and X3)

These models are used, for instance, in studies of causality, and are popular

in several (other) domains of statistical learning


2.2.2 Gibbs Random Field (GRF)

Let X be a multivariate random variable of dimension p, and let E be a set of

edges defined on V = {1, . . . , p}.

Unlike in MRF, the edges in a GRF are not defined on the basis of a

conditional probability. They are used to define the global probability, as

follows:

A clique (or complete subset) is defined as

C ⊂ V is a clique ⇔ ∀i ∈ C : C ⊂ {i} ∪ ∂i

The set of cliques is denoted as C C = {C ⊂ V |∀i ∈ C : C ⊂ {i} ∪ ∂i}

A probability distribution that can be decomposed into factors associated with

the cliques is called a Gibbs Random Field (GRF)

fX(x) is a GRF ⇔ fX(x) =∏

C∈CfC(xC) =

1

Zexp

(−∑

C∈CHC(xC)

)

The functions HC(xC) are (up to constant) the logarithms of fC(xC). They

are called clique potentials. The normalizing constant Z is called a partition

function.


Gibbs Random Field - further discussion

Use of GRF’s

• GRF’s can be used to define a joint probability on an undirected graph

• MRF’s represent local, conditional probabilities

• THe Hammersley-Clifford theorem (slide 69) finds connection GRF-MRF

Examples of GRF’s

• In principle any multidimensional probability distribution can be seen as

a GRF. In general, all components are conditionally dependent, and the

cliques are all subsets of V . All clique potentials are zero, except for

C = V , whose potential is HV (x) = − log(fX(x)).

• Ising model (see slide 67)


Example of GRF: Ising model

A two dimensional lattice {(i, j)|0 ≤ i ≤ m, 0 ≤ j ≤ n} (see slide 62) can be

equiped with a neighbourhood system by defining for each internal site

∂(i, j) = {(i− 1, j), (i+ 1, j), (i, j − 1), (i, j + 1)}The cliques are then singletons and (horizontal and vertical) pairs of sites

C ={{(i, j)}

}∪{{(i, j), (i+ 1, j)}

}∪{{(i, j), (i, j + 1)}

}

In the case where the observations are binary, say X(i,j) ∈ {−1, 1}, a popular

GRF model is the Ising model

HC(xC) = τ · xC,1 · xC,2 for the pairs and Hs(xs) = γ · xs for the singletons.

The pair’s potentials express the interaction between adjacent sites, while the

singleton potentials express a drift towards one of the two states.


2.2.3 The Hammersley-Clifford Theorem: conditions

MRFs are defined by conditional probabilities, based on a neighbourhood

system.

GRFs are defined by a joint probability, decomposed into clique potentials.

The Hammersley-Clifford Theorem states that under mild conditions, both

definitions are equivalent, i.e., a MRF is also a GRF and vice versa.

Two important conditions: existence of joint pdf + positivity

Existence of fX(x): See slide 77

Positivity condition

A probability distribution is said to satisfy the positivity condition if ∀i =

1, . . . , p : fXi(xi) > 0 implies that for x = (x1, . . . , xi, . . . , xp) we have

fX(x) > 0

A counterexample of such a distribution is a uniform distribution on the unit

disk: fX(0.9, 0.8) = 0 although fX1(0.9) > 0 and fX2(0.8) > 0


The Hammersley-Clifford Theorem

Theorem

If fX(x) exists and satisfies the positivity condition, then X is a MRF with

neighbourhood system ∂ if and only if it is a GRF whose cliques C follow from

the neighbourhood system ∂.


⇐: GRF → MRF

Suppose X is a GRF with cliques C based on neighbourhood system ∂. Further denote

I = {1, . . . , p}, and i∂i = {i} ∪ ∂i. Let Ci = {C ∈ C|i ∈ C} be the cliques that contain site i.

Then

P (Xi = xi|XI\{i} = xI\{i}) =P (Xi=xi,XI\{i}=xI\{i})

P (XI\{i}=xI\{i})

=P (Xi=xi,XI\{i}=xI\{i})∑yi

P (Xi=yi,XI\{i}=xI\{i})

=

∏C∈Ci

fC(xi,xC∩∂i)·∏

C∈C\CifC(xC )

∑yi

∏C∈Ci

fC (yi,xC∩∂i)·∏

C∈C\CifC(xC)

=

∏C∈Ci

fC(xi,xC∩∂i)∑yi

∏C∈Ci

fC (yi,xC∩∂i)

=

∏C∈Ci

fC(xi,xC∩∂i)∑yi

∏C∈Ci

fC (yi,xC∩∂i)·∑

yI\i∂i

∏C∈C\Ci

fC (xC∩∂i,yC\∂i)∑

yI\i∂i

∏C∈C\Ci

fC (xC∩∂i,yC\∂i)

=

∑yI\i∂i

∏C∈C fC(xC∩i∂i,yC\i∂i)

∑yI\∂i

∏C∈C fC(xC∩∂i,yC\∂i)

=P (Xi∂i=xi∂i)P (X∂i=x∂i)

= P (Xi = xi|X∂i = x∂i)


The construction of a GRF out of a MRF

For the other direction (from MRF to GRF) we need a few auxiliary definitions and results.

Given a function g : Rp → R : x 7→ g(x) and let o ∈ Rp be a reference state for which g(o) > 0.

Then define for each A ⊂ I = {1, . . . , p} the function

GA(x) = g(u(x)) where u : Rp → Rp and ui = xi if i ∈ A and ui = oi if i ∈ I\A Further

define HA(x) =∑

B⊆A

(−1)#(A\B)GB(x)

Then we have the following results

• HØ(x) is a constant HØ(x) = g(o),∀x

• HA(x) does not depend on the components of x with index outside A

If xA = yA, then HA(x) = HA(y)

• If one of the components of x with index in A takes the corresponding reference value,

then HA(x) = 0. for A 6= Ø, if xi = oi for at least one i ∈ A, then HA(x) = 0


Proof. Define

Bi = {B ⊂ A|i 6∈ B} , B = B ∪ {i} , Bi = {B = B ∪ {i}|B ∈ Bi},then Bi and Bi constitute a equal partition of 2A = {B ⊂ A}. For a pair

{B,B = B ∪ {i}}, and for any x with xi = oi, we have that GB(x) = GB(x), and so

HA(x) =∑

B∈Bi

[(−1)#(A\B)GB(x) + (−1)#(A\B)GB(x)

]= 0

• (Mobius Inversion) g(x) = GI(x) =∑

A⊆I

HA(x)

Proof∑A⊆I HA(x) =

∑A⊆I

∑B⊆A(−1)#(A\B)GB(x)

=∑

B⊆I GB(x)∑

A:B⊆A⊆I (−1)#(A\B)

(We have switched the order of summations and moved GB(x) forward)

Denote D = A\B, then B ⊆ A ⊆ I ⇔ Ø ⊆ D ⊆ I\B, and so we get∑

A⊆I

HA(x) =∑

B⊆I

GB(x)∑

D⊆I\B

(−1)#D

Unless B = I, the number of subsets D ⊆ I\B is even, and exactly half of those subsets

have an even #D, and the other half have an odd #D, hence all but one terms in the outer

sum are zero, leading to∑

A⊆I

HA(x) = GI(x) = g(x)


Proof of Hammersley-Clifford ⇒ MRF → GRF

Theorem

If g(x) = − log fX(x) where fX(x) is the joint probability distribution of a MRF on x with cliques

C, then in the construction above HA(x) = 0 if A 6∈ C.

Proof

Suppose that A 6∈ C, then there must be two elements, say i, j ∈ A so that i 6∈ ∂j and vice versa.

For the given i, define as before

Bi = {B ⊂ A|i 6∈ B}B = B ∪ {i}Bi = {B = B ∪ {i}|B ∈ Bi},Then

HA(x) =∑

B∈Bi

[(−1)#(A\B)GB(x) + (−1)#(A\B)GB(x)

]

=∑

B∈Bi(−1)#(A\B)

[GB(x)−GB(x)

]

Denoting u = (xBoI\B), we have

GB(x) = − log fX(u)

= − log[fXI\{i}

(uI\{i}) · fXi|XI\{i}(xi|uI\{i})

]

= − log fXI\{i}(uI\{i})− log fXi|X∂i

(xi|u∂i)

Denoting u = (xBoI\B), we see that u and u differ only in i, so uI\{i} = uI\{i}, and so we

can write

GB(x) = − log fXI\{i}(uI\{i})− log fXi|X∂i

(oi|u∂i)


The difference between both is then

GB(x)−GB(x) = − log fXi|X∂i(xi|u∂i) + log fXi|X∂i

(oi|u∂i)

The common term that was anihilated, depended on index j, but what remains does not, as

j 6∈ ∂i, hence all terms in HA(x) do not depend on the value of xj . Hence, HA(x) = HA(y),

where yℓ = xℓ, for ℓ 6= j and yj = oj . We have seen that for such an argument HA(y) = 0, from

which the proof follows.

The proof assumes positivity because the anihilations that take place are implicitly based on

ratios of probabilities (differences of log-probabilities), which are all assumed to be nonzero.


Importance of Hammersley-Clifford in MCMC

The constructive proof of Hammersley-Clifford shows that given the

conditional probabilities in a Markov Model allow to construct the joint

distribution as

fX(x) =1

Z· exp

(−∑

C∈CHC(xC)

)

where for a chosen i ∈ C, and a reference state o

HC(xC) =∑

B⊂C|i∈B

log

(fXi|X∂i

(oi|u∂i)

fXi|X∂i(xi|u∂i)

)

where uj = xj if j ∈ C and uj = oj if j 6∈ C. The partition function Z follows

from the choices of o and i ∈ C.

HC states that conditional probabilities in a Markov model are sufficient to

define the joint probability of a random vector.

This is unlike marginal probabilities: they do not uniquely fix the joint

probability (as they contain no information about the dependence structure)


A construction without cliques

In some applications (such as the one we will need), the clique potentials are just an intermediate

result. It is possible to construct the joint distribution directly from the conditional distributions,

however, without proving that it factorizes into clique potential functions.

Simplified theorem (no cliques)fX(x)

fX(o)=

p∏

i=1

fXi|XI\{i}(xi|o{1,...,i−1}x{i+1,...,p})

fXi|XI\{i}(oi|o{1,...,i−1}x{i+1,...,p})

Or, otherwise stated

fX(x) ∝p∏

i=1

fXi|XI\{i}(xi|o{1,...,i−1}x{i+1,...,p})

fXi|XI\{i}(oi|o{1,...,i−1}x{i+1,...,p})

Proof

We start from the right-hand side∏p

i=1

fXi|XI\{i}(xi|o{1,...,i−1}x{i+1,...,p})

fXi|XI\{i}(oi|o{1,...,i−1}x{i+1,...,p})

=∏p

i=1

fX (o{1,...,i−1}x{i,...,p})

/fXI\{i}

(o{1,...,i−1}x{i+1,...,p})

fX (o{1,...,i}x{i+1,...,p})

/fXI\{i}

(o{1,...,i−1}x{i+1,...,p})

All numerators in this product cancel against the denominator in the previous factor, leaving us

with the first denominator and the last numerator, which is exactly the expression of the left hand

side.


Note on the existence of a joint distrubution

Note Hammersley-Clifford does not guarantee existence of the joint

distribution, but if it exists, it is well defined by the conditional probabilities.

Example Consider X1|X2 ∼ exp(λX2) and X2|X1 ∼ exp(λX1), then

according to the construction above, we find that

f(X1,X2)(x1, x2) ∝ fX1|X2(x1|x2)

fX1|X2(o1|x2)

· fX2|X1(x2|o1)

fX2|X1(o2|o1)

= λx2e−λx2·x1

λx2e−λx2·o1· λo1e−λo1·x2

λo1e−λo1·o2

∝ e−λx2·x1

The function exp(−λx2x1) has no finite integral on [0,∞[×[0,∞[, and

therefore it cannot be normalized to be a (2D) density function.


From HC to Markov Chain Monte Carlo

• Sample from conditional distributions in MRF X (= any multivariate

random variable)

• Creates sequence of samples X1,X2,X3, . . . that are a Markov chain of

Markov Random Fields

•


2.3 MCMC samplers for integration

2.3.1 The Gibbs sampler

Suppose X is a p-dimensional random vector, and we can sample from

conditional densities fXi|XI\{i}(xi|xI\{i}) = fXi|X∂i

(xi|x∂i)

Then we construct the following sampler

Set initial values x0 = (x0,1, . . . , x0,p)

for n = 1, 2, . . .

for i = 1, . . . , p

Draw Xn,i ∼ fXi|XI\{i}(x|xn;1,...,i−1xn−1;i+1,...,p)

The Gibbs-sampler consists of loops defined by conditional distributions.

Therefore, the sampler is based on the description of fX(x) as a Markov

random field. Moreover, the sequence can be seen as a Markov Chain.

So, the Gibbs sampler does NOT rely on the description of fX(x) as a Gibbs

random field. GRF will be at the basis of the Metropolis-Hastings sampler on

slide 90


Invariant distribution

On slide 81, we prove:

The joint distribution fX(x) is invariant under the loops of a Gibbs-sampler

We consider the sequence of states after each outer loop (i.e., iterations over n), not the inner

loops (over the vector components).

We consider the case of a discrete state space.

Lemma The transition probabilities over the outer loops satisfy

fXn+1|Xn(x|v) =

p∏

i=1

fXi|XI\{i}(xi|x1,...,i−1vi+1,...,p)

Proof (discrete case)

This follows from the chain rule P

(p⋂

i=1

Ai

∣∣∣∣∣B

)

=

p∏

i=1

P

Ai

∣∣∣∣∣∣

i−1⋂

j=1

Aj ∩B

where in our case

Ai = {Xn+1;i = xi} and B = {Xn = v} �


Invariant distribution: proof

We now consider the case of a discrete state space, and suppose that fXn (x) = fX(x), then

fXn+1(x) =

∑v fXn+1|Xn

(x|v) · fXn (v)

=∑

v

∏pi=1 fXi|XI\{i}

(xi|x1,...,i−1vi+1,...,p) · fX(v)

=∑

vp· · ·∑v1

fX(v) · fX1|XI\{1}(x1|v2,...,p) ·

∏pi=2 fXi|XI\{i}

(xi|x1,...,i−1vi+1,...,p)

=∑

vp· · ·∑v2

fX2,...,p (v2,...,p) · fX1|XI\{1}(x1|v2,...,p) · . . .∏pi=2 fXi|XI\{i}

(xi|x1,...,i−1vi+1,...,p)=

∑vp· · ·∑v2

fX(x1v2,...,p) ·∏p

i=2 fXi|XI\{i}(xi|x1,...,i−1vi+1,...,p)

= · · ·= fX(x)

In the expressions above, we used that∑

v1

fX(v) = fX2,...,p (v2,...,p) (The notation X2,...,p

refers to the components of X, not to successive Markov Chain realisations like in Xn.

We then used fX2,...,p (v2,...,p) · fX1|XI\{1}(x1|v2,...,p) = fX(x1v2,...,p)


Reversibility

The proof of the invariance property of fX(x) w.r.t. the Gibbs sampler

established a global balance equation, not a detailed balance equation. A

detailed balance equation is necessary for reversibility.

The Gibbs-sampler as a whole is not reversible, meaning

fXn−1|Xn(xn−1|xn) 6= fXn+1|Xn

(xn−1|xn)

The probability that we arrive in xn−1 given xn 6= the probability that we come from xn−1 given

that we are in xn

Each substep (inner loop) on its own is reversible. That is, if we have

generated a new ith component xi, we could “undo” that step (“undo” in

probabilistic sense, that is). In order to undo the complete Gibbs iteration

step, the substeps have to be followed in reverse order.

One can prove that an reversible Gibbs sampler can be constructed by

randomizing the order of substeps.


Convergence

Under mild assumption (positivity of fX(x)), the Gibbs sampler creates a

Markov chain for which Xndist−→X ∼ fX

If the Gibbs sampler Markov chain is irreducible and recurrent, then for any

integrable function h(x) we have

1

M

M∑

n=1

h(Xn)P→ E [h(X)]


Foundations for MCMC

MCMC is used for sampling from multidimensional random variables It has

two aspects

• Sampling proceeds through conditional probabilities/densities

• The subsequent samples are dependent→ Markov Chain

We have to make sure that

• Conditionals define the correct joint distribution in a unique way:

Hammersley-Clifford

• The Markov chain replaces the large number convergence

– The target joint distribution is invariant under the Gibbs sampler

Markov Chain

– The chain converges to the invariant distribution

– Although convergence is a limit property, all generated samples of a

Gibbs sampler can be used in estimating the expected value of

h(X).


An example from Bayesian statistics

A Hidden Markov Random Field (HMM - Hidden Markov Model)

Suppose that we have the following graphical model for observations Y

Y • • • • • •| | | | | |

M • • • • • •| | | | | |

X • − • − • − • − • − •

• We observe Y , where Yi and Yj are dependent, but conditioned on the

hidden or latent states Xi and Xj they are independent.

• The observation consists of two parts: the real signal (expression) M

and the noise Y −M . Goal: inference on fM |Y (m|y)

• The latent state is a binary label: Xi = +1 means that Mi is probably

large, Xi = −1 means that Mi is probably small.

• Large values of Mi are clustered


A formalisation of the graphical model

Suppose X ∈ {−1, 1}p ∼ Ising(τ, γ), that is

P (X = x) =1

T· exp

[−τ

p∑

i=2

xixi−1

]· exp

[−γ

p∑

i=1

xi

]

with partition function T =∑

x∈{−1,1}p

exp

[−τ

p∑

i=2

xixi−1

]· exp

[−γ

p∑

i=1

xi

]

We observe Yi = Mi + Vi, with Vi independent normal observational errors

with zero mean and common variance σ2 and Mi a mixture:

Mi =1−Xi

2·Ri +

1 +Xi

2· Si with Ri ∼ N(0, κ2) and Si ∼ N(0, K2)

and all these are independent.

The hyperparameters γ, τ,K, κ2, σ2 are assumed to be known.


Bayesian inference 1: posterior law of total probability

We want to know E(M |Y )

The posterior total probability is

fMi|Y (m|y) = fMi|Xi=−1;Y (m|y)·P (Xi = −1|Y = y)+fMi|Xi=1;Y (m|y)·P (Xi = 1|Y = y)

The only dependence between components of Y lies in the Hidden Gibbs/Ising random field, so

fMi|Xi=±1;Y (m|y) = fMi|Xi=±1;Yi(m|yi)

Filling in leads to

fMi|Y(m|y) = fMi|Xi=−1;Yi

(m|yi) · P (Xi = −1|Y = y) + fMi|Xi=1;Yi(m|yi) · P (Xi = 1|Y = y)

For Xi = −1, Yi = Ri + Vi ∼ N(0, σ2 + κ2).

Hence cov(Mi, Yi) = cov(Ri, Ri + Vi) = var(Ri) + 0 = κ2

And from properties of the multivariate normal distribution (See slides Chapter 1, page 20) we

know

(Mi|Yi = y,Xi = −1) = (Ri|Yi = y) ∼ N

(κ2

κ2 + σ2· y, κ2σ2

κ2 + σ2

)

The same holds for Xi = 1, replacing κ2 by K2.

This leads to

E(Mi|Y = y) = yi ·[

κ2

κ2 + σ2· P (Xi = −1|Y = y) +

K2

K2 + σ2· P (Xi = 1|Y = y)

]


Bayesian inference 2: posterior label probabilities

We still need P (Xi = −1|Y = y). We compute the marginal posterior probabilities of Xi from

the joint posterior: P (X = x|Y = y)

A Gibbs sampler for this posterior probability would draw from

P (Xi = xn,i|XI\{i} = xn;1,...,i−1xn−1;i+1,...,p;Y = y)

= P (Xi = xn,i|XI\{i} = xn;1,...,i−1xn−1;i+1,...,p;Yi = yi)

=P (Xi=xn,i|XI\{i}=xn;1,...,i−1xn−1,i+1,...,p)·fYi|X

(yi|xn;1,...,i−1xn,ixn−1;i+1,...,p)

fYi|XI\{i}(yi|xn;1,...,i−1xn−1;i+1,...,p)

= P (Xi = xn,i|XI\{i} = xn;1,...,i−1xn−1,i+1,...,p)·fYi|Xi

(yi|xn,i)

fYi|Xi(yi|1)·P (Xi=1)+fYi|Xi

(yi|−1)·P (Xi=−1)

This expression has three components

1. Conditional probabilities

Yi|Xi = −1 ∼ N(0, κ2 + σ2) and Yi|Xi = 1 ∼ N(0,K2 + σ2)

2. Marginal probabilities

The prior marginal probabilities P (Xi = 1) and P (Xi = −1) have to be computed from

(Markov Chain) Monte Carlo sampling of the prior probability model.

3. The transition probabilities

We know (from the proof of Hammersley-Clifford)


P (Xi = xi|XI\{i} = xI\{i}) = P (Xi = xi|X∂i = x∂i) =

exp

−∑

C|i∈C

HC(xC)

∑

yi

exp

−∑

C|i∈C

HC(yixC\{i})

which is in our case

P (Xi = xi|Xi−1 = xi−1, Xi+1 = xi+1)

=exp(−τ(xixi−1+xixi+1))·exp(−γxi)∑

yi∈{−1,1} exp(−τ(yixi−1+yixi+1)) exp(−γyi)

=exp(−τ(xixi−1+xixi+1))·exp(−γxi)

exp(−τ(xi−1+xi+1)) exp(−γ)+exp(+τ(xi−1+xi+1)) exp(γ)


2.3.2 Metropolis-Hastings sampler

Gibbs-sampler

1. Based on conditional probabilities in multidimensional random vector⇒Markov random field

2. If vector components are highly correlated, conditional sampling leads to

values that are close to old ones: slow move through range of possible

values, hence slow convergence

Metropolis-Hastings sampler

1. Local update of previous sample: Markov Chain of samples (like Gibbs

sampler)

2. Based on joint probabilities⇒ Gibbs random field

3. One or more dimensions (↔ Gibbs sampler is always for random

vectors)

4. Uses rejection sampling: new sample has to be accepted


A proposal/transition distribution

Given a state Xn, a possible new state Xn′ is generated from a distribution

q(x|Xn = xn)

In principle, this proposal distribution can be any good choice. This

distribution should be easy to work with. It typically describes local updates.

The new state is accepted if

Xn+1 = Xn′ ⇔ U ≤ fX(Xn′) · q(Xn|Xn′)

fX(Xn) · q(Xn′ |Xn)

where U ∼ uniform[0, 1]


The acceptance probability

Given a proposal Xn′ = xn′ the probability that it is accepted equals

α(xn′ ;xn) = min

(1,

fX(xn′) · q(xn|xn′)

fX(xn) · q(xn′ |xn)

)

Remark If the distribution has the form fX(x) =1

Zexp [−H(x)] then the

acceptance probability does not depend on Z. Often Z is very hard to find

(integration/summation over all possible configurations).


Transition probabilities from one state to the next

The transition probability becomes (case of discrete states)

• For xn+1 6= xn:

P (Xn+1 = xn+1|Xn = xn)

= P (Xn′ = xn+1|Xn = xn) · P (xn+1 accepted |Xn = xn,Xn′ = xn+1)

= q(xn+1|xn) · α(xn+1,xn)

• The probability that the proposed state (whatever the proposal is) will be

rejected, given that the current state is Xn = xn equals

r(xn) := P (rejected|Xn = xn)

=∑

x P (Xn′ = x|Xn = xn) · P (x rejected |Xn = xn,Xn′ = x)

=∑

x q(x|xn) · (1− α(x;xn))

= 1−∑x q(x|xn) · α(x;xn)

=: 1− a(xn)

• For xn+1 = xn we obtain

P (Xn+1 = xn|Xn = xn) = q(xn|xn) · α(xn,xn) + (1− a(xn))



Equilibrium distribution

The objective distribution fX(x) is an invariant distribution of a Metropolis-

Hastings sampler

Proof

Denote the transition probabilities Pxy = P (Xn+1 = y|Xn = x)

It holds that Pxy = q(y|x) · α(y;x) + δ(x,y) · (1− a(x))

where δ(x,y) is the Kronecker-delta.

We consider Pxy · fX(x) = q(y|x) ·α(y;x) · fX(x)+ δ(x,y) · (1−a(x)) · fX(x)

We have

α(y;x) · q(y|x) · fX(x) = min(1, fX(y)·q(x|y)

fX(x)·q(y|x)

)· q(y|x) · fX(x)= min (q(y|x) · fX(x), fX(y) · q(x|y))

= min(

fX(x)·q(y|x)fX(y)·q(x|y) , 1

)· q(x|y) · fX(y)

= α(x;y) · q(x|y) · fX(y)

The Kronecker-delta term is only active if x = y, so formally, one can always

write

δ(x,y) · (1− a(x)) · fX(x) = δ(y,x) · (1− a(y)) · fX(y)


We may conclude that Pxy · fX(x) = Pyx · fX(y)

This is a detailed balance equation. Not only is the objective distribution

invariant under the Metropolis-Hastings sampler, but also

The Metropolis-Hastings sampler is reversible


A special case: the original Metropolis sampler

Suppose that the proposal distribution is symmetric in the sense that

q(x|y) = q(y|x)

This is realized by chosing Y = X + η where η has a zero mean symmetric

distribution g(η), hence q(y|x) = g(y − x).

Then the acceptance probability for a proposal y given a current state x

becomes

α(y;x) = min

(1,

fX(y) · q(x|y)fX(x) · q(y|x)

)= min

(1,

fX(y)

fX(x)

)

This was the original procedure, proposed by Metropolis. It was later refined

by Hastings for arbitrary proposal distributions.


An example of a local update

Consider q(y|x) = fXi|XI\{i}(yi|xI\{i}), if yI\{i} = xI\{i} (and otherwise q(y|x) = 0)

Configurations with yI\{i} 6= xI\{i} have probability (density) zero.

It then holds

fX(x) · q(y|x) = fX(x) · fXi|XI\{i}(yi|xI\{i})

= fX(x) · fX (xI\{i}yi)

fXI\{i}(xI\{i})

and, keeping in mind that yI\{i} = xI\{i},

fX(y) · q(x|y) = fX(y) · fXi|XI\{i}(xi|yI\{i})

= fX(y) · fX (yI\{i}xi)

fXI\{i}(yI\{i})

= fX(yixI\{i}) · fX(x)fXI\{i}

(xI\{i})

= fX(x) · q(y|x)

From this, it follows

1. The acceptance probability α(y;x) = min

(1,

fX(y) · q(x|y)fX(x) · q(y|x)

)= 1

2. The process is reversible

In fact, this is one step of a Gibbs sampler.

1. In a general Metropolis-Hastings sampler, there is a free proposal, which is evaluated: the


evaluation uses the joint distribution

2. In the specific case of the Gibbs sampler, the proposal uses the conditional distribution and

there is no evaluation afterwards (so no joint distribution)


Convergence

We have seen that the objective distribution is invariant under a

Metropolis-Hastings sampler.

This is not enough for good convergence.

Indeed, the case of one step of a Gibbs sampler illustrates that updates may

be too local: in this case, only one component of the vector is subject to

possible change. That implies that many states x are unreachable. The

Markov Chain is then reducible.

Irreducibility is obtained if q(y|x) > 0 for all pairs (x,y).

As this condition is sometimes too restrictive for every sampler separately,

one may consider a combination of different proposal distributions, e.g.,

sequence of one-at-a-time component Metropolis-Hastings sampler. (e.g.,

the Gibbs sampler)


Choice of the proposal distribution

The speed of convergence in a Metropolis-Hastings sampler depends on the correlation between

subsequent samples.

High correlation→ slow convergence

Subsequent samples should be as independent as possible. (Fully independent samples are

optimal, in the sense that the limiting distribution is reached instantaneously, but they are often

difficult to realize or difficult to sample from)

Inter-sample dependence depends on two adversary objectives

• Acceptance probability: low acceptance probability means high probability that two

subsequent samples are identical, hence, high correlation.

Acceptance probability is enhanced by proposal distributions with small variance, i.e., a

distribution that favours very local updates.

• Correlation between current state and proposed state. This source of correlation is

reduced by proposals that favour large updates.


Documents

Chapter 3: Monte Carlo methods - Personal Homepageshomepages.ulb.ac.be/~majansen/teaching/STAT-F-408/slides03MCMC_4.pdf · Chapter 3: Monte Carlo methods Maarten Jansen Overview 1