Upload
buidung
View
225
Download
1
Embed Size (px)
Citation preview
Yet another MCMC case-study
Yet another MCMC case-study
Havard RueDepartment of Mathematical Sciences
NTNU, Norway
Sept 2008
Yet another MCMC case-study
Overview
Yet another MCMC case-study: Overview of talk
• Study a seemingly trivial hierarchical model• Latent temporal Gaussian, with• Binary observations
• Develop a “standard” MCMC-algorithm for inference• Auxiliary variables• (Conjugate) single-site updates
• ..and study empirically its properties.
Yet another MCMC case-study
Overview
Yet another MCMC case-study: Overview of talk ...
• Demonstrate how to develop more sophisticatedMCMC-algorithms
• Blocking• Joint updates with blocking• Independence sampler
• and demonstrate that MCMC is not even needed in thisexample.
• Discuss the consequences of this case-study.
Yet another MCMC case-study
The latent Gaussian model
Tokyo rainfall data
Stage 1 Binomial data
yi ∼
{Binomial(2, p(xi ))
Binomial(1, p(xi ))
Yet another MCMC case-study
The latent Gaussian model
Tokyo rainfall data
Stage 2 Assume a smooth latent x,
x ∼ RW 2(κ), logit(pi ) = xi
Yet another MCMC case-study
The latent Gaussian model
Tokyo rainfall data
Stage 3 Gamma(α, β)-prior on κ
Yet another MCMC case-study
The latent Gaussian model
The RW2 model for regular locations
Use the second order increments
∆2xiiid∼ N (0, κ−1) (1)
for i = 1, . . . , n − 2, to define the joint density of x
π(x) ∝ κ(n−2)/2 exp
(−κ
2
n−2∑i=1
(xi − 2xi+1 + xi+2)2
)= κ(n−2)/2 exp
(−κ
2xTRx
)
Yet another MCMC case-study
The latent Gaussian model
R =
1 −2 1−2 5 −4 11 −4 6 −4 1
1 −4 6 −4 1. . .
. . .. . .
. . .. . .
1 −4 6 −4 11 −4 6 −4 1
1 −4 5 −21 −2 1
.
IGMRF of second order: invariant to the adding line to x.Our problem is circulant.
Yet another MCMC case-study
The latent Gaussian model
Model summary
π(x | κ) π(κ)∏i
π(yi | xi )
where
• x | κ is Gaussian (Markov) with dimension 366
• κ is Gamma
• yi |xi is Binomial with p(xi )
Yet another MCMC case-study
MCMC – single-site
Construction of nice full conditionals
A popular approach is to introduce auxiliary variables w, so that
x | the rest
is Gaussian.
Yet another MCMC case-study
MCMC – single-site
Normal-mixture representation
Theorem (Kelker, 1971)If x has density f (x) symmetric around 0, then there existindependent random variables z and v, with z standard normalsuch that x = z/v iff the derivatives of f (x) satisfy(
− d
dy
)k
f (√
y) ≥ 0
for y > 0 and for k = 1, 2, . . ..
• Student-t
• Logistic and Laplace
Yet another MCMC case-study
MCMC – single-site
Corresponding mixing distribution for the precision parameter λthat generates these distributions as scale mixtures of normals.
Distribution of x Mixing distribution of λ
Student-tν G(ν/2, ν/2)Logistic 1/(2K )2 where K is
Kolmogorov-Smirnov distributedLaplace 1/(2E ) where E is exponential distributed
Yet another MCMC case-study
MCMC – single-site
Example: Binary regression
GMRF x and Bernoulli data
yi ∼ B(g−1(xi ))
g(p) = Φ(p) probit link
Equivalent representation using auxiliary variables w
εiiid∼ N (0, 1)
wi = xi + εi
yi =
{1 if wi > 0
0 otherwise.
for the probit-link.
Yet another MCMC case-study
MCMC – single-site
Single-site Gibbs sampling
Auxiliary variables can be introduced for the logit-link1, to achievethis sampler:
• κ ∼ Γ(·, ·)• for each i
• xi ∼ N (·, ·)• for each i
• wi ∼ W(·)
It is fully automatic; no tuning!!!
1Held & Holmes (2006)
Yet another MCMC case-study
MCMC – single-site
Results: hyper-parameter log(κ)
Yet another MCMC case-study
MCMC – single-site
Results: latent node x10
Yet another MCMC case-study
MCMC – single-site
Results: density for latent node x10
0.0 0.1 0.2 0.3 0.4 0.5
02
46
810
density(x[10]), run 1
N = 747017 Bandwidth = 0.002606
Den
sity
0.0 0.1 0.2 0.3 0.4 0.5
02
46
8
density(x[10]), run 2
N = 747391 Bandwidth = 0.002803
Den
sity
Yet another MCMC case-study
MCMC – single-site
Discussion
Single-site sampler with auxiliary variables:
• Even long runs shows large variation
• “Long” range dependence
• Very slowly mixing
But:
• Easy to be “fooled” running shorter chains
• The variability can be underestimated.
Yet another MCMC case-study
MCMC – single-site
What is causing the problem?
Two issues
1. Slow mixing within the latent field x
2. Slow mixing between the latent field x and θ.
Blocking is the “usual” approach to resolve such issues, if possible.
Note: blocking mainly helps within the block only.
Yet another MCMC case-study
MCMC – blocking
Strategies for blocking
Slow mixing due to the latent field x only:
• Block x
Slow mixing due to the interaction between the latent field x andθ:
• Block (x,θ).
In most cases: if you can do one, you can do both.
Yet another MCMC case-study
MCMC – blocking scheme I
Blocking scheme I
• κ ∼ Γ(·, ·)• x ∼ N (·, ·)• w ∼ W(·) (conditional independent)
Yet another MCMC case-study
MCMC – blocking scheme I
Results
6 7 8 9 10 11
0.00.2
0.40.6
0.8
density of log(kappa)
N = 101804 Bandwidth = 0.04667
Dens
ity
0.1 0.2 0.3 0.4
02
46
8
density of x[10]
N = 101804 Bandwidth = 0.003815
Dens
ity
Yet another MCMC case-study
MCMC – blocking scheme II
Blocking scheme II
• Sample• κ′ ∼ q(κ′;κ)• x′ | κ′, y ∼ N (·, ·)
and then accept/reject (x′, κ′) jointly
• w ∼ W(·) (conditional independent)
Remarks
• If the normalising constant for x|· is available, then this is anEASY FIX of scheme I.
• Usually makes a huge improvement
• Automatic “reparameterisation”
• Doubles the computational costs
Yet another MCMC case-study
MCMC – blocking scheme II
Results
0 50 100 150 200 250 300
0.0
0.2
0.4
0.6
0.8
1.0
Lag
AC
F
ACF(log(kappa) scheme I)
0 50 100 150 200 250 300
0.0
0.2
0.4
0.6
0.8
1.0
Lag
AC
F
ACF(log(kappa) scheme II)
Yet another MCMC case-study
MCMC – blocking scheme II
Scheme I & II is not that easy to do in practice...
• The computational speed depends critically on how theGaussian part is dealt with
• Statisticians does like high-dimensional Gaussians as long theycan be dealt with using “Kalman-filter” algorithms
• “Kalman-filter” algorithms are not applicable here, as theprior is circulant.
• General sparse matrix numerical algebra is required
• Possible in R, but not easy (yet?) as it could be.
Conclusion: Not that many would had implemented neitherblocking schemes.
Yet another MCMC case-study
MCMC – blocking without auxiliary variables
Removing the auxiliary variables
• The auxiliary variables makes the full conditional for xGaussian
• If we do not use them, the full conditional for x looks like
π(x | . . .) ∝ exp
(−1
2xTQx +
∑i
log(π(yi |xi ))
)
≈ exp
(−1
2(x− µ)T (Q + diag(c))(x− µ)
)= πG (x| . . .)
• The Gaussian approximation is constructed by matching the• mode, and the• curvature at the mode.
Yet another MCMC case-study
MCMC – blocking without auxiliary variables
Improved one-block scheme
• κ′ ∼ q(·;κ)
• x′ ∼ πG (x | κ′, y)
• Accept/reject (x′, κ′) jointly
Note: πG (·) is indexed by κ′, hence we need to compute one foreach value of κ′.
Yet another MCMC case-study
MCMC – blocking without auxiliary variables
Results
0 2000 4000 6000 8000
78
91
0
Trace of log(kappa)
Index
log
(xx[m
:n,
2])
0 50 100 150 200 250 300
0.0
0.2
0.4
0.6
0.8
1.0
Lag
AC
F
ACF(log(kappa))
Yet another MCMC case-study
MCMC – independence sampler
Independence sampler
We can construct an independence sampler, using πG (·).The Laplace-approximation for κ|x:
π(κ | y) ∝ π(κ) π(x|κ) π(y|x)
π(x|κ, y)
≈ π(κ) π(x|κ) π(y|x)
πG (x|κ, y)
∣∣∣∣∣x=mode(κ)
Hence, we do first
• Evaluate the Laplace-approximation at some “selected” points
• Build an interpolation log-spline
• Use this parametric model as π(κ|y)
Yet another MCMC case-study
MCMC – independence sampler
Independence sampler
• κ′ ∼ π(κ|y)
• x′ ∼ πG (x|κ′, y)
• Accept/reject (κ′, x′) jointly
Note:Corr(x(t + k), x(t)) ≈ (1− α)|k|
In this example, α = 0.83...
Yet another MCMC case-study
MCMC – independence sampler
Results
0 2000 4000 6000 8000
67
89
10
Trace log(kappa); independence sampler
Index
log
(xx[m
:n,
2])
0 50 100 150 200 250 300
0.0
0.2
0.4
0.6
0.8
1.0
Lag
AC
F
ACF(log(kappa)); independence sampler
Yet another MCMC case-study
MCMC – Deterministic inference
Can we improve this sampler?
• Yes, if we are interested in the posterior marginals for κ and{xi}.
• The marginals for the Gaussian proposal πG (x| . . .), are knowanalytically.
• Just use numerical integration!
Yet another MCMC case-study
MCMC – Deterministic inference
Deterministic inference
Posterior marginal for κ:
• Compute π(κ|y)
Posterior marginal for xi :
• Use numerical integration
π(xi | y) =
∫π(xi | y, κ) π(κy) dκ
≈∑k
N (xi ; µκk, σ2(κk)) × π(κk | y) × ∆k
Yet another MCMC case-study
MCMC – Deterministic inference
Results: Mixture of Gaussians
Histogram of x
x
Den
sity
0.1 0.2 0.3 0.4
02
46
8
Yet another MCMC case-study
MCMC – Deterministic inference
Results: Improved....
Histogram of x
x
Den
sity
0.1 0.2 0.3 0.4
02
46
8
Yet another MCMC case-study
MCMC – Deterministic inference
Deterministic inference
• Obtain “exact results”; cannot find error using MCMC
• Fast; about 1 second on my laptop.
• This approach is much more generally applicable.
• A lot of delicate coding
• In this example, it makes the model useful in practise!
Yet another MCMC case-study
Discussion
What can be learned from this exercise?
For a relative simple model, we have implemented
• single-site with auxiliary variables (looong time; hours)
• various forms for blocking (long time; many minutes)
• independence sampler (long time; many minutes)
• approximate inference (nearly instant; one second)
Yet another MCMC case-study
Discussion
What can be learned from this exercise? ...
My guess; Near all statisticians will implement the single-sitescheme, (possible) with auxiliary variables.
Which implies
• Most probably, the results would be not correct.
• They “accept” the long running-time.
• Trouble: such MCMC-schemes is not useful for routineanalysis of similar data.
Yet another MCMC case-study
Discussion
What can be learned from this exercise? ...
• In many cases, the situation is much worse in practice; thiswas a very simple model.
• Single-site MCMC is still the default choice for the non-expertuser.
• Hierarchical models are popular, but they are difficult forMCMC.
Perhaps the development of models is not in sync with thedevelopment of inference? We cannot just wait for more powerfulcomputers...
Yet another MCMC case-study
Discussion
MCMC tries to solve the full problem...
If the target is the posterior marginals, then perhaps MCMC is notneeded?
• Consider a latent AR(1) model observed with non-Gaussianobservations of length n
• If we are interested in the posterior marginal for x1, then onlya subset of the data (and x’s) are needed! No effect of n.
• For hyper-parameters, then large n makes life easier!