Yet another MCMC case-study

Yet another MCMC case-study


Havard RueDepartment of Mathematical Sciences

NTNU, Norway

Sept 2008


Overview

Yet another MCMC case-study: Overview of talk

• Study a seemingly trivial hierarchical model• Latent temporal Gaussian, with• Binary observations

• Develop a “standard” MCMC-algorithm for inference• Auxiliary variables• (Conjugate) single-site updates

• ..and study empirically its properties.


Overview

Yet another MCMC case-study: Overview of talk ...

• Demonstrate how to develop more sophisticatedMCMC-algorithms

• Blocking• Joint updates with blocking• Independence sampler

• and demonstrate that MCMC is not even needed in thisexample.

• Discuss the consequences of this case-study.


The latent Gaussian model

Tokyo rainfall data

Stage 1 Binomial data

yi ∼

{Binomial(2, p(xi ))

Binomial(1, p(xi ))



Tokyo rainfall data

Stage 2 Assume a smooth latent x,

x ∼ RW 2(κ), logit(pi ) = xi



Tokyo rainfall data

Stage 3 Gamma(α, β)-prior on κ



The RW2 model for regular locations

Use the second order increments

∆2xiiid∼ N (0, κ−1) (1)

for i = 1, . . . , n − 2, to define the joint density of x

π(x) ∝ κ(n−2)/2 exp

(−κ

2

n−2∑i=1

(xi − 2xi+1 + xi+2)2

)= κ(n−2)/2 exp

(−κ

2xTRx

)



R =

1 −2 1−2 5 −4 11 −4 6 −4 1

1 −4 6 −4 1. . .

. . .. . .

. . .. . .

1 −4 6 −4 11 −4 6 −4 1

1 −4 5 −21 −2 1

.

IGMRF of second order: invariant to the adding line to x.Our problem is circulant.



Model summary

π(x | κ) π(κ)∏i

π(yi | xi )

where

• x | κ is Gaussian (Markov) with dimension 366

• κ is Gamma

• yi |xi is Binomial with p(xi )


MCMC – single-site

Construction of nice full conditionals

A popular approach is to introduce auxiliary variables w, so that

x | the rest

is Gaussian.



Normal-mixture representation

Theorem (Kelker, 1971)If x has density f (x) symmetric around 0, then there existindependent random variables z and v, with z standard normalsuch that x = z/v iff the derivatives of f (x) satisfy(

− d

dy

)k

f (√

y) ≥ 0

for y > 0 and for k = 1, 2, . . ..

• Student-t

• Logistic and Laplace



Corresponding mixing distribution for the precision parameter λthat generates these distributions as scale mixtures of normals.

Distribution of x Mixing distribution of λ

Student-tν G(ν/2, ν/2)Logistic 1/(2K )2 where K is

Kolmogorov-Smirnov distributedLaplace 1/(2E ) where E is exponential distributed



Example: Binary regression

GMRF x and Bernoulli data

yi ∼ B(g−1(xi ))

g(p) = Φ(p) probit link

Equivalent representation using auxiliary variables w

εiiid∼ N (0, 1)

wi = xi + εi

yi =

{1 if wi > 0

0 otherwise.

for the probit-link.



Single-site Gibbs sampling

Auxiliary variables can be introduced for the logit-link1, to achievethis sampler:

• κ ∼ Γ(·, ·)• for each i

• xi ∼ N (·, ·)• for each i

• wi ∼ W(·)

It is fully automatic; no tuning!!!

1Held & Holmes (2006)



Results: hyper-parameter log(κ)



Results: latent node x10



Results: density for latent node x10

0.0 0.1 0.2 0.3 0.4 0.5

02

46

810

density(x[10]), run 1

N = 747017 Bandwidth = 0.002606

Den

sity

0.0 0.1 0.2 0.3 0.4 0.5

02

46

8

density(x[10]), run 2

N = 747391 Bandwidth = 0.002803

Den

sity



Discussion

Single-site sampler with auxiliary variables:

• Even long runs shows large variation

• “Long” range dependence

• Very slowly mixing

But:

• Easy to be “fooled” running shorter chains

• The variability can be underestimated.



What is causing the problem?

Two issues

1. Slow mixing within the latent field x

2. Slow mixing between the latent field x and θ.

Blocking is the “usual” approach to resolve such issues, if possible.

Note: blocking mainly helps within the block only.


MCMC – blocking

Strategies for blocking

Slow mixing due to the latent field x only:

• Block x

Slow mixing due to the interaction between the latent field x andθ:

• Block (x,θ).

In most cases: if you can do one, you can do both.


MCMC – blocking scheme I

Blocking scheme I

• κ ∼ Γ(·, ·)• x ∼ N (·, ·)• w ∼ W(·) (conditional independent)


MCMC – blocking scheme I

Results

6 7 8 9 10 11

0.00.2

0.40.6

0.8

density of log(kappa)

N = 101804 Bandwidth = 0.04667

Dens

ity

0.1 0.2 0.3 0.4

02

46

8

density of x[10]

N = 101804 Bandwidth = 0.003815

Dens

ity


MCMC – blocking scheme II

Blocking scheme II

• Sample• κ′ ∼ q(κ′;κ)• x′ | κ′, y ∼ N (·, ·)

and then accept/reject (x′, κ′) jointly

• w ∼ W(·) (conditional independent)

Remarks

• If the normalising constant for x|· is available, then this is anEASY FIX of scheme I.

• Usually makes a huge improvement

• Automatic “reparameterisation”

• Doubles the computational costs



Results

0 50 100 150 200 250 300

0.0

0.2

0.4

0.6

0.8

1.0

Lag

AC

F

ACF(log(kappa) scheme I)

0 50 100 150 200 250 300

0.0

0.2

0.4

0.6

0.8

1.0

Lag

AC

F

ACF(log(kappa) scheme II)



Scheme I & II is not that easy to do in practice...

• The computational speed depends critically on how theGaussian part is dealt with

• Statisticians does like high-dimensional Gaussians as long theycan be dealt with using “Kalman-filter” algorithms

• “Kalman-filter” algorithms are not applicable here, as theprior is circulant.

• General sparse matrix numerical algebra is required

• Possible in R, but not easy (yet?) as it could be.

Conclusion: Not that many would had implemented neitherblocking schemes.


MCMC – blocking without auxiliary variables

Removing the auxiliary variables

• The auxiliary variables makes the full conditional for xGaussian

• If we do not use them, the full conditional for x looks like

π(x | . . .) ∝ exp

(−1

2xTQx +

∑i

log(π(yi |xi ))

)

≈ exp

(−1

2(x− µ)T (Q + diag(c))(x− µ)

)= πG (x| . . .)

• The Gaussian approximation is constructed by matching the• mode, and the• curvature at the mode.



Improved one-block scheme

• κ′ ∼ q(·;κ)

• x′ ∼ πG (x | κ′, y)

• Accept/reject (x′, κ′) jointly

Note: πG (·) is indexed by κ′, hence we need to compute one foreach value of κ′.



Results

0 2000 4000 6000 8000

78

91

0

Trace of log(kappa)

Index

log

(xx[m

:n,

2])

0 50 100 150 200 250 300

0.0

0.2

0.4

0.6

0.8

1.0

Lag

AC

F

ACF(log(kappa))


MCMC – independence sampler

Independence sampler

We can construct an independence sampler, using πG (·).The Laplace-approximation for κ|x:

π(κ | y) ∝ π(κ) π(x|κ) π(y|x)

π(x|κ, y)

≈ π(κ) π(x|κ) π(y|x)

πG (x|κ, y)

∣∣∣∣∣x=mode(κ)

Hence, we do first

• Evaluate the Laplace-approximation at some “selected” points

• Build an interpolation log-spline

• Use this parametric model as π(κ|y)



Independence sampler

• κ′ ∼ π(κ|y)

• x′ ∼ πG (x|κ′, y)

• Accept/reject (κ′, x′) jointly

Note:Corr(x(t + k), x(t)) ≈ (1− α)|k|

In this example, α = 0.83...



Results

0 2000 4000 6000 8000

67

89

10

Trace log(kappa); independence sampler

Index

log

(xx[m

:n,

2])

0 50 100 150 200 250 300

0.0

0.2

0.4

0.6

0.8

1.0

Lag

AC

F

ACF(log(kappa)); independence sampler


MCMC – Deterministic inference

Can we improve this sampler?

• Yes, if we are interested in the posterior marginals for κ and{xi}.

• The marginals for the Gaussian proposal πG (x| . . .), are knowanalytically.

• Just use numerical integration!



Deterministic inference

Posterior marginal for κ:

• Compute π(κ|y)

Posterior marginal for xi :

• Use numerical integration

π(xi | y) =

∫π(xi | y, κ) π(κy) dκ

≈∑k

N (xi ; µκk, σ2(κk)) × π(κk | y) × ∆k



Results: Mixture of Gaussians

Histogram of x

x

Den

sity

0.1 0.2 0.3 0.4

02

46

8



Results: Improved....

Histogram of x

x

Den

sity

0.1 0.2 0.3 0.4

02

46

8



Deterministic inference

• Obtain “exact results”; cannot find error using MCMC

• Fast; about 1 second on my laptop.

• This approach is much more generally applicable.

• A lot of delicate coding

• In this example, it makes the model useful in practise!


Discussion

What can be learned from this exercise?

For a relative simple model, we have implemented

• single-site with auxiliary variables (looong time; hours)

• various forms for blocking (long time; many minutes)

• independence sampler (long time; many minutes)

• approximate inference (nearly instant; one second)


Discussion

What can be learned from this exercise? ...

My guess; Near all statisticians will implement the single-sitescheme, (possible) with auxiliary variables.

Which implies

• Most probably, the results would be not correct.

• They “accept” the long running-time.

• Trouble: such MCMC-schemes is not useful for routineanalysis of similar data.


Discussion

What can be learned from this exercise? ...

• In many cases, the situation is much worse in practice; thiswas a very simple model.

• Single-site MCMC is still the default choice for the non-expertuser.

• Hierarchical models are popular, but they are difficult forMCMC.

Perhaps the development of models is not in sync with thedevelopment of inference? We cannot just wait for more powerfulcomputers...


Discussion

MCMC tries to solve the full problem...

If the target is the posterior marginals, then perhaps MCMC is notneeded?

• Consider a latent AR(1) model observed with non-Gaussianobservations of length n

• If we are interested in the posterior marginal for x1, then onlya subset of the data (and x’s) are needed! No effect of n.

• For hyper-parameters, then large n makes life easier!

Documents

Yet another MCMC case-study