Upload
pierre-e-jacob
View
226
Download
1
Embed Size (px)
DESCRIPTION
January 25th, 2012
Citation preview
Wang–Landau algorithmImprovements
Example: variable selectionConclusion
Parallel Adaptive Wang–Landau Algorithm
Pierre E. Jacob
CEREMADE - Universite Paris Dauphine, funded by AXA Research
GPU in Computational StatisticsJanuary 25th, 2012
joint work with Luke Bornn (UBC), Arnaud Doucet (Oxford),Pierre Del Moral (INRIA & Universite de Bordeaux), Robin J. Ryder (Dauphine)
Pierre E. Jacob PAWL 1/ 29
Wang–Landau algorithmImprovements
Example: variable selectionConclusion
Outline
1 Wang–Landau algorithm
2 ImprovementsAutomatic BinningAdaptive proposalsParallel Interacting Chains
3 Example: variable selection
4 Conclusion
Pierre E. Jacob PAWL 2/ 29
Wang–Landau algorithmImprovements
Example: variable selectionConclusion
Wang–Landau
Context
unnormalized target density π
on a state space X
A kind of adaptive MCMC algorithm
It iteratively generates a sequence Xt .
The stationary distribution is not π itself.
At each iteration a different stationary distribution is targeted.
Pierre E. Jacob PAWL 3/ 29
Wang–Landau algorithmImprovements
Example: variable selectionConclusion
Wang–Landau
Partition the space
The state space X is cut into d bins:
X =d⋃
i=1
Xi and ∀i 6= j Xi ∩ Xj = ∅
Goal
The generated sequence spends a desired proportion φi oftime in each bin Xi ,
within each bin Xi the sequence is asymptotically distributedaccording to the restriction of π to Xi .
Pierre E. Jacob PAWL 4/ 29
Wang–Landau algorithmImprovements
Example: variable selectionConclusion
Wang–Landau
Stationary distribution
Define the mass of π over Xi by:
ψi =
∫Xi
π(x)dx
The stationary distribution of the WL algorithm is:
π(x) ∝ π(x)×φJ(x)
ψJ(x)
where J(x) is the index such that x ∈ XJ(x)
Pierre E. Jacob PAWL 5/ 29
Wang–Landau algorithmImprovements
Example: variable selectionConclusion
Wang–Landau
Example with a bimodal, univariate target density: π and two πcorresponding to different partitions. Here φi = d−1.
X
Log
Den
sity
−12
−10
−8
−6
−4
−2
0
Original Density, with partition lines
−5 0 5 10 15
Biased by X
−5 0 5 10 15
Biased by Log Density
−5 0 5 10 15
Pierre E. Jacob PAWL 6/ 29
Wang–Landau algorithmImprovements
Example: variable selectionConclusion
Wang–Landau
Plugging estimates
In practice we cannot compute ψi analytically. Instead we plug inestimates θt(i) of ψi/φi at iteration t, and define the distributionπθt by:
πθt (x) ∝ π(x)× 1
θt(J(x))
Metropolis–Hastings
The algorithm does a Metropolis–Hastings step, aiming πθt atiteration t, generating a new point Xt , updating θt . . .
Pierre E. Jacob PAWL 7/ 29
Wang–Landau algorithmImprovements
Example: variable selectionConclusion
Wang–Landau
Estimate of the bias
The update of the estimated bias θt(i) is done according to:
θt(i)← θt−1(i) [1 + γt (1IXi(Xt)− φi )]
with d the number of bins, γt a decreasing sequence or “stepsize”. E.g. γt = 1/t.
If 1IXi(Xt) then θt(i) increases;
otherwise θt(i) decreases.
Pierre E. Jacob PAWL 8/ 29
Wang–Landau algorithmImprovements
Example: variable selectionConclusion
Wang–Landau
The algorithm itself
1: First, ∀i ∈ {1, . . . , d} set θ0(i)← 1.2: Choose a decreasing sequence {γt}, typically γt = 1/t.3: Sample X0 from an initial distribution π0.4: for t = 1 to T do5: Sample Xt from Pt−1(Xt−1, ·), a MH kernel with invariant
distribution πθt−1(x).6: Update the bias: θt(i)← θt−1(i)[1 + γt(1IXi
(Xt)− φi )].7: end for
Pierre E. Jacob PAWL 9/ 29
Wang–Landau algorithmImprovements
Example: variable selectionConclusion
Wang–Landau
Result
In the end we get:
a sequence Xt asymptotically following π,
as well as estimates θt(i) of ψi/φi .
Pierre E. Jacob PAWL 10/ 29
Wang–Landau algorithmImprovements
Example: variable selectionConclusion
Wang–Landau
Usual improvement: Flat Histogram
Wait for the FH criterion to occur before decreasing γt .
(FH) maxi=1...d
∣∣∣∣νt(i)
t− φi
∣∣∣∣ < c
where νt(i) =∑t
k=1 1IXi(Xk) and c > 0.
WL with stochastic schedule
Let κt be the number of times FH was reached at iteration t. Useγκt at iteration t instead of γt . If FH reached, reset νt(i) to 0.
Pierre E. Jacob PAWL 11/ 29
Wang–Landau algorithmImprovements
Example: variable selectionConclusion
Wang–Landau
Theoretical Understanding of WL with deterministic schedule
The schedule γt decreases at each iteration, hence θt converges,hence Pt(·, ·) converges . . .≈ “diminishing adaptation”.
Theoretical Understanding of WL with stochastic schedule
Flat Histogram is reached in finite time for any γ, φ, c if one usesthe following update:
log θt(i)← log θt−1(i) + γ(1IXt (Xt)− φi )
instead ofθt(i)← θt−1(i)[1 + γ(1IXt (Xt)− φi )]
Pierre E. Jacob PAWL 12/ 29
Wang–Landau algorithmImprovements
Example: variable selectionConclusion
Automatic BinningAdaptive proposalsParallel Interacting Chains
Automate Binning
Maintain some kind of uniformity within bins. If non-uniform, splitthe bin.
Log density
Fre
quen
cy
(a) Before the split
Log density
Fre
quen
cy
(b) After the split
Pierre E. Jacob PAWL 13/ 29
Wang–Landau algorithmImprovements
Example: variable selectionConclusion
Automatic BinningAdaptive proposalsParallel Interacting Chains
Adaptive proposals
Target a specific acceptance rate:
σt+1 = σt + ρt (21I(A > 0.234)− 1)
Or use the empirical covariance of the already-generated chain:
Σt = δ × Cov (X1, . . . ,Xt)
Pierre E. Jacob PAWL 14/ 29
Wang–Landau algorithmImprovements
Example: variable selectionConclusion
Automatic BinningAdaptive proposalsParallel Interacting Chains
Parallel Interacting Chains
N chains (X(1)t , . . . ,X
(N)t ) instead of one.
targeting the same biased distribution πθt at iteration t,
sharing the same estimated bias θt at iteration t.
The update of the estimated bias becomes:
log θt(i)← log θt−1(i) + γκt
1
N
N∑j=1
1IXi(X
(j)t )− φi
Pierre E. Jacob PAWL 15/ 29
Wang–Landau algorithmImprovements
Example: variable selectionConclusion
Automatic BinningAdaptive proposalsParallel Interacting Chains
Parallel Interacting Chains
How “parallel” is PAWL?
The algorithm’s additional cost compared to independent parallelMCMC chains lies in:
getting the proportions 1N
∑Nj=1 1IXi
(X(j)t )
updating (θt(1), . . . , θt(d)).
Pierre E. Jacob PAWL 16/ 29
Wang–Landau algorithmImprovements
Example: variable selectionConclusion
Automatic BinningAdaptive proposalsParallel Interacting Chains
Parallel Interacting Chains
Example: Normal distribution
Histogram of the binned coordinate
binned coordinate
Den
sity
−4 −2 0 2 4
0.0
0.1
0.2
0.3
0.4
Pierre E. Jacob PAWL 17/ 29
Wang–Landau algorithmImprovements
Example: variable selectionConclusion
Automatic BinningAdaptive proposalsParallel Interacting Chains
Parallel Interacting Chains
Reaching Flat Histogram
iterations
#FH
10
20
30
40
2000 4000 6000 8000 10000
N = 1N = 10N = 100
Pierre E. Jacob PAWL 18/ 29
Wang–Landau algorithmImprovements
Example: variable selectionConclusion
Automatic BinningAdaptive proposalsParallel Interacting Chains
Parallel Interacting Chains
Stabilization of the log penalties
iterations
valu
e
−10
−5
0
5
10
2000 4000 6000 8000 10000
Figure: log θt against t, for N = 1
Pierre E. Jacob PAWL 19/ 29
Wang–Landau algorithmImprovements
Example: variable selectionConclusion
Automatic BinningAdaptive proposalsParallel Interacting Chains
Parallel Interacting Chains
Stabilization of the log penalties
iterations
valu
e
−10
−5
0
5
10
2000 4000 6000 8000 10000
Figure: log θt against t, for N = 10
Pierre E. Jacob PAWL 20/ 29
Wang–Landau algorithmImprovements
Example: variable selectionConclusion
Automatic BinningAdaptive proposalsParallel Interacting Chains
Parallel Interacting Chains
Stabilization of the log penalties
iterations
valu
e
−10
−5
0
5
10
2000 4000 6000 8000 10000
Figure: log θt against t, for N = 100
Pierre E. Jacob PAWL 21/ 29
Wang–Landau algorithmImprovements
Example: variable selectionConclusion
Automatic BinningAdaptive proposalsParallel Interacting Chains
Parallel Interacting Chains
Multiple effects of parallel chains
log θt(i)← log θt−1(i) + γκt
1
N
N∑j=1
1IXi(X
(j)t )− φi
FH is reached more often when N increases, hence γκtdecreases quicker;
log θt tends to vary much less when N increases, even for afixed value of γ.
Pierre E. Jacob PAWL 22/ 29
Wang–Landau algorithmImprovements
Example: variable selectionConclusion
Variable selection
Settings
Pollution data as in McDonald & Schwing (1973). For 60metropolitan areas:
15 possible explanatory variables (including precipitation,population per household, . . . ) (denoted by X ),
the response variable Y is the age-adjusted mortality rate.
This leads to 32,768 possible models to explain the data.
Pierre E. Jacob PAWL 23/ 29
Wang–Landau algorithmImprovements
Example: variable selectionConclusion
Variable selection
Introduce
γ ∈ {0, 1}p the “variable selector”,
qγ represents the number of variables in model “γ”,
g some large value (g -prior, see Zellner 1986, Marin & Robert2007).
Posterior distribution
π(γ|y,X) ∝ (g + 1)−(qγ+1)/2[yTy − g
g + 1yTXγ(XT
γ Xγ)−1Xγy
]−n/2
.
Pierre E. Jacob PAWL 24/ 29
Wang–Landau algorithmImprovements
Example: variable selectionConclusion
Variable selection
Most naive MH algorithm
The proposal is flipping a variable on / off at random, at eachiteration.
Binning
Along values of log π(x), found with a preliminary exploration, in20 bins.
Pierre E. Jacob PAWL 25/ 29
Wang–Landau algorithmImprovements
Example: variable selectionConclusion
Variable selection
Iteration
Log(
θ)
−60
−40
−20
0
N = 1
20000 40000 60000 80000
N = 10
5000 10000 15000 20000 25000
N = 100
500 1000 1500 2000 2500 3000 3500
Figure: Each run took 2 minutes (+/- 5 seconds). Dotted lines show thereal ψ.
Pierre E. Jacob PAWL 26/ 29
Wang–Landau algorithmImprovements
Example: variable selectionConclusion
Variable selection
Iteration
Mod
el S
atur
atio
n
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Wang−Landau
Metropolis−Hastings, Temp = 10
500 1000 1500 2000 2500 3000 3500
Metropolis−Hastings, Temp = 1
Metropolis−Hastings, Temp = 100
500 1000 1500 2000 2500 3000 3500
Figure: qγ/p (mean and 95% interval) along iterations, for N = 100.
Pierre E. Jacob PAWL 27/ 29
Wang–Landau algorithmImprovements
Example: variable selectionConclusion
Conclusion
Automatic binning but. . .
We still have to define a range of plausible (or “interesting”)values.
Parallel Chains
Seems reasonable to use more than N = 1 chain, with or withoutGPUs. No theoretical validation of this yet. Optimal N for a givencomputational effort?
Need of a stochastic schedule?
It seems that using large N makes the use and hence the choice ofγt irrelevant.
Pierre E. Jacob PAWL 28/ 29
Wang–Landau algorithmImprovements
Example: variable selectionConclusion
Would you like to know more?
Article: An Adaptive Interacting Wang-Landau Algorithm forAutomatic Density Exploration, with L. Bornn, P. Del Moral, A.
Doucet.
Article: The Wang-Landau algorithm reaches the FlatHistogram criterion in finite time, with R. Ryder.
Software: PAWL, available on CRAN:
install.packages("PAWL")
References:
F. Wang, D. Landau, Physical Review E, 64(5):56101
Y. Atchade, J. Liu, Statistica Sinica, 20:209-233
Pierre E. Jacob PAWL 29/ 29