A Vanilla Rao-Blackwellisation

Vanilla Rao–Blackwellisation ofMetropolis–Hastings algorithms

Christian P. Robert

Universite Paris-Dauphine and CREST, Paris, FranceJoint works with Randal Douc, Pierre Jacob and Murray Smith

[email protected]

August 9, 2010

1 / 22

Main themes

1 Rao–Blackwellisation on MCMC.

2 Can be performed in any Hastings Metropolis algorithm.

3 Asymptotically more efficient than usual MCMC with acontrolled additional computing

4 Can take advantage of parallel capacities at a very basic level

2 / 22

Main themes





2 / 22

Main themes





2 / 22

Metropolis Hastings revisited Rao–Blackwellisation

Outline

1 Metropolis Hastings revisited

2 Rao–BlackwellisationFormal importance samplingVariance reductionAsymptotic resultsIllustrations

3 / 22


Outline



3 / 22


Outline



4 / 22


Metropolis Hastings algorithm

1 We wish to approximate

I =

∫h(x)π(x)dx∫π(x)dx

=

∫

h(x)π(x)dx

2 x 7→ π(x) is known but not∫π(x)dx .

3 Approximate I with δ = 1n

∑n

t=1 h(x (t)) where (x (t)) is a Markov

chain with limiting distribution π.

4 Convergence obtained from Law of Large Numbers or CLT forMarkov chains.

5 / 22




I =


=

∫

h(x)π(x)dx



∑n




5 / 22




I =


=

∫

h(x)π(x)dx



∑n




5 / 22




I =


=

∫

h(x)π(x)dx



∑n




5 / 22


Metropolis Hasting Algorithm

Suppose that x (t) is drawn.

1 Simulate yt ∼ q(·|x (t)).

2 Set x (t+1) = yt with probability

α(x (t), yt) = min

{

1,π(yt)

π(x (t))

q(x (t)|yt)

q(yt |x (t))

}

Otherwise, set x (t+1) = x (t) .

3 α is such that the detailed balance equation is satisfied: ⊲ π is thestationary distribution of (x (t)).

◮ The accepted candidates are simulated with the rejection algorithm.

6 / 22






α(x (t), yt) = min

{

1,π(yt)

π(x (t))

q(x (t)|yt)

q(yt |x (t))

}


3 α is such that the detailed balance equation is satisfied: ⊲ π is thestationary distribution of (x (t)).


6 / 22






α(x (t), yt) = min

{

1,π(yt)

π(x (t))

q(x (t)|yt)

q(yt |x (t))

}


3 α is such that the detailed balance equation is satisfied:

π(x)q(y |x)α(x , y) = π(y)q(x |y)α(y , x).

⊲ π is the stationary distribution of (x (t)).


6 / 22






α(x (t), yt) = min

{

1,π(yt)

π(x (t))

q(x (t)|yt)

q(yt |x (t))

}


3 α is such that the detailed balance equation is satisfied:

π(x)q(y |x)α(x , y) = π(y)q(x |y)α(y , x).

⊲ π is the stationary distribution of (x (t)).


6 / 22


Some properties of the HM algorithm

1 Alternative representation of the estimator δ is

δ =1

n

n∑

t=1

h(x (t)) =1

N

MN∑

i=1

nih(zi ) ,

where

zi ’s are the accepted yj ’s,MN is the number of accepted yj ’s till time N,ni is the number of times zi appears in the sequence (x (t))t .

7 / 22


q(·|zi ) =α(zi , ·) q(·|zi )

p(zi )≤ q(·|zi )

p(zi ),

where p(zi ) =∫α(zi , y) q(y |zi )dy . To simulate according to q(·|zi ):

1 Propose a candidate y ∼ q(·|zi )2 Accept with probability

q(y |zi )/(

q(y |zi )p(zi )

)

= α(zi , y)

Otherwise, reject it and starts again.

3 ◮ this is the transition of the HM algorithm.

The transition kernel q admits π as a stationary distribution:

π(x)q(y |x) =

8 / 22


q(·|zi ) =α(zi , ·) q(·|zi )

p(zi )≤ q(·|zi )

p(zi ),



q(y |zi )/(

q(y |zi )p(zi )

)

= α(zi , y)




π(x)q(y |x) =

8 / 22


q(·|zi ) =α(zi , ·) q(·|zi )

p(zi )≤ q(·|zi )

p(zi ),



q(y |zi )/(

q(y |zi )p(zi )

)

= α(zi , y)




π(x)q(y |x) =

8 / 22


q(·|zi ) =α(zi , ·) q(·|zi )

p(zi )≤ q(·|zi )

p(zi ),



q(y |zi )/(

q(y |zi )p(zi )

)

= α(zi , y)




π(x)q(y |x) =π(x)p(x)

∫π(u)p(u)du

︸︷︷︸

π(x)

α(x , y)q(y |x)

p(x)︸︷︷︸

q(y |x)

8 / 22


q(·|zi ) =α(zi , ·) q(·|zi )

p(zi )≤ q(·|zi )

p(zi ),



q(y |zi )/(

q(y |zi )p(zi )

)

= α(zi , y)




π(x)q(y |x) =π(x)α(x , y)q(y |x)∫π(u)p(u)du

8 / 22


q(·|zi ) =α(zi , ·) q(·|zi )

p(zi )≤ q(·|zi )

p(zi ),



q(y |zi )/(

q(y |zi )p(zi )

)

= α(zi , y)




π(x)q(y |x) =π(y)α(y , x)q(x |y)∫π(u)p(u)du

8 / 22


q(·|zi ) =α(zi , ·) q(·|zi )

p(zi )≤ q(·|zi )

p(zi ),



q(y |zi )/(

q(y |zi )p(zi )

)

= α(zi , y)




π(x)q(y |x) = π(y)q(x |y) ,

8 / 22


Lemme

The sequence (zi , ni ) satisfies

1 (zi , ni )i is a Markov chain;

2 zi+1 and ni are independent given zi ;

3 ni is distributed as a geometric random variable with probabilityparameter

p(zi ) :=

∫

α(zi , y) q(y |zi ) dy ; (1)

4 (zi )i is a Markov chain with transition kernel Q(z, dy) = q(y |z)dy

and stationary distribution π such that

q(·|z) ∝ α(z, ·) q(·|z) and π(·) ∝ π(·)p(·) .

9 / 22


Lemme





p(zi ) :=

∫

α(zi , y) q(y |zi ) dy ; (1)



q(·|z) ∝ α(z, ·) q(·|z) and π(·) ∝ π(·)p(·) .

9 / 22


Lemme





p(zi ) :=

∫

α(zi , y) q(y |zi ) dy ; (1)



q(·|z) ∝ α(z, ·) q(·|z) and π(·) ∝ π(·)p(·) .

9 / 22


Lemme





p(zi ) :=

∫

α(zi , y) q(y |zi ) dy ; (1)



q(·|z) ∝ α(z, ·) q(·|z) and π(·) ∝ π(·)p(·) .

9 / 22


Old bottle, new wine [or vice-versa]

zi−1

10 / 22



zi−1 zi

ni−1

indep

indep

10 / 22



zi−1 zi zi+1

ni−1 ni

indep

indep

indep

indep

10 / 22



zi−1 zi zi+1

ni−1 ni

indep

indep

indep

indep

δ =1

n

n∑

t=1

h(x (t)) =1

N

MN∑

i=1

nih(zi ) .

10 / 22



zi−1 zi zi+1

ni−1 ni

indep

indep

indep

indep

δ =1

n

n∑

t=1

h(x (t)) =1

N

MN∑

i=1

nih(zi ) .

10 / 22


Outline



11 / 22


Formal importance sampling

Importance sampling perspective

1 A natural idea:

δ∗ =1

N

MN∑

i=1

h(zi )

p(zi ),

12 / 22




1 A natural idea:

δ∗ ≃

∑MN

i=1

h(zi )

p(zi )∑MN

i=1

1

p(zi )

=

∑MN

i=1

π(zi )

π(zi )h(zi )

∑MN

i=1

π(zi )

π(zi )

.

12 / 22




1 A natural idea:

δ∗ ≃

∑MN

i=1

h(zi )

p(zi )∑MN

i=1

1

p(zi )

=

∑MN

i=1

π(zi )

π(zi )h(zi )

∑MN

i=1

π(zi )

π(zi )

.

2 But p not available in closed form.

12 / 22




1 A natural idea:

δ∗ ≃

∑MN

i=1

h(zi )

p(zi )∑MN

i=1

1

p(zi )

=

∑MN

i=1

π(zi )

π(zi )h(zi )

∑MN

i=1

π(zi )

π(zi )

.


3 The geometric ni is the replacement obvious solution that is used inthe original Metropolis–Hastings estimate since E[ni ] = 1/p(zi ).

12 / 22




1 A natural idea:

δ∗ ≃

∑MN

i=1

h(zi )

p(zi )∑MN

i=1

1

p(zi )

=

∑MN

i=1

π(zi )

π(zi )h(zi )

∑MN

i=1

π(zi )

π(zi )

.


3 The geometric ni is the replacement obvious solution that is used inthe original Metropolis–Hastings estimate since E[ni ] = 1/p(zi ).

12 / 22



The crude estimate of 1/p(zi ),

ni = 1 +∞∑

j=1

∏

ℓ≤j

I {uℓ ≥ α(zi , yℓ)} ,

can be improved:

Lemma

If (yj)j is an iid sequence with distribution q(y |zi ), the quantity

ξi = 1 +

∞∑

j=1

∏

ℓ≤j

{1 − α(zi , yℓ)}

is an unbiased estimator of 1/p(zi ) which variance, conditional on zi , is

lower than the conditional variance of ni , {1 − p(zi )}/p2(zi ).

13 / 22



Rao-Blackwellised, for sure?

ξi = 1 +

∞∑

j=1

∏

ℓ≤j

{1 − α(zi , yℓ)}

1 Infinite sum but finite with at least positive probability:

α(x (t), yt) = min

{

1,π(yt)

π(x (t))

q(x (t)|yt)

q(yt |x (t))

}

For example: take a symetric random walk as a proposal.

2 What if we wish to be sure that the sum is finite?

14 / 22


Variance reduction

Variance improvement

Proposition

If (yj)j is an iid sequence with distribution q(y |zi ) and (uj)j is an iiduniform sequence, for any k ≥ 0, the quantity

ξki = 1 +

∞∑

j=1

∏

1≤ℓ≤k∧j

{1 − α(zi , yj)}∏

k+1≤ℓ≤j

I {uℓ ≥ α(zi , yℓ)}

is an unbiased estimator of 1/p(zi ) with an almost sure finite number ofterms.

15 / 22


Variance reduction


Proposition


ξki = 1 +

∞∑

j=1

∏

1≤ℓ≤k∧j

{1 − α(zi , yj)}∏

k+1≤ℓ≤j


is an unbiased estimator of 1/p(zi ) with an almost sure finite number of

terms. Moreover, for k ≥ 1,

V

h

ξki

˛

˛

˛zi

i

=1 − p(zi )

p2(zi )−

1 − (1 − 2p(zi ) + r(zi ))k

2p(zi ) − r(zi )

„

2 − p(zi )

p2(zi )

«

(p(zi ) − r(zi )) ,

where p(zi ) :=R

α(zi , y) q(y |zi ) dy . and r(zi ) :=R

α2(zi , y) q(y |zi ) dy .

15 / 22


Variance reduction


Proposition


ξki = 1 +

∞∑

j=1

∏

1≤ℓ≤k∧j

{1 − α(zi , yj)}∏

k+1≤ℓ≤j


is an unbiased estimator of 1/p(zi ) with an almost sure finite number of

terms. Therefore, we have

V

[

ξi

∣∣∣ zi

]

≤ V

[

ξki

∣∣∣ zi

]

≤ V

[

ξ0i

∣∣∣ zi

]

= V [ni | zi ] .

15 / 22


Variance reduction

zi−1

ξki = 1 +

∞∑

j=1

∏

1≤ℓ≤k∧j

{1 − α(zi , yj)}∏

k+1≤ℓ≤j


16 / 22


Variance reduction

zi−1 zi

ξki−1

not indep

not indep

ξki = 1 +

∞∑

j=1

∏

1≤ℓ≤k∧j

{1 − α(zi , yj)}∏

k+1≤ℓ≤j


16 / 22


Variance reduction

zi−1 zi zi+1

ξki−1 ξk

i

not indep

not indep

not indep

not indep

ξki = 1 +

∞∑

j=1

∏

1≤ℓ≤k∧j

{1 − α(zi , yj)}∏

k+1≤ℓ≤j


16 / 22


Variance reduction

zi−1 zi zi+1

ξki−1 ξk

i

not indep

not indep

not indep

not indep

δkM =

∑M

i=1 ξki h(zi )

∑M

i=1 ξki

.

16 / 22


Variance reduction

zi−1 zi zi+1

ξki−1 ξk

i

not indep

not indep

not indep

not indep

δkM =

∑M

i=1 ξki h(zi )

∑M

i=1 ξki

.

16 / 22


Asymptotic results

Let

δkM =

∑M

i=1 ξki h(zi )

∑M

i=1 ξki

.

For any positive function ϕ, we denote Cϕ = {h; |h/ϕ|∞ <∞}.

17 / 22


Asymptotic results

Let

δkM =

∑M

i=1 ξki h(zi )

∑M

i=1 ξki

.

For any positive function ϕ, we denote Cϕ = {h; |h/ϕ|∞ <∞}. Assumethat there exists a positive function ϕ ≥ 1 such that

∀h ∈ Cϕ,

PMi=1 h(zi )/p(zi )

PMi=1 1/p(zi )

P−→ π(h)

Theorem

Under the assumption that π(p) > 0, the following convergence property holds:

i) If h is in Cϕ, then

δkM

P−→M→∞

π(h) (◮Consistency)

17 / 22


Asymptotic results

Let

δkM =

∑M

i=1 ξki h(zi )

∑M

i=1 ξki

.

For any positive function ϕ, we denote Cϕ = {h; |h/ϕ|∞ <∞}.Assume that there exists a positive function ψ such that

∀h ∈ Cψ,√

M

(∑M

i=1 h(zi )/p(zi )∑M

i=1 1/p(zi )− π(h)

)

L−→ N (0, Γ(h))

Theorem

Under the assumption that π(p) > 0, the following convergence propertyholds:

ii) If, in addition, h2/p ∈ Cϕ and h ∈ Cψ, then

√M(δk

M − π(h))L−→M→∞ N (0,Vk [h − π(h)]) , (◮Clt)

where Vk(h) := π(p)∫π(dz)V

[

ξki

∣∣∣ z

]

h2(z)p(z) + Γ(h) .17 / 22


Asymptotic results

We will need some additional assumptions. Assume a maximal inequalityfor the Markov chain (zi )i : there exists a measurable function ζ such thatfor any starting point x ,

∀h ∈ Cζ , Px

∣∣∣∣∣∣

sup0≤i≤N

i∑

j=0

[h(zi ) − π(h)]

∣∣∣∣∣∣

> ǫ

≤ NCh(x)

ǫ2

Theorem

Assume that h is such that h/p ∈ Cζ and {Ch/p, h2/p2} ⊂ Cφ. Assume

moreover that

√M(δ0M − π(h)

) L−→ N (0,V0[h − π(h)]) .

Then, for any starting point x,

√

MN

(∑N

t=1 h(x (t))

N− π(h)

)

L−→N→∞ N (0,V0[h − π(h)]) ,

where MN is defined by18 / 22


Asymptotic results


∀h ∈ Cζ , Px

∣∣∣∣∣∣

sup0≤i≤N

i∑

j=0

[h(zi ) − π(h)]

∣∣∣∣∣∣

> ǫ

≤ NCh(x)

ǫ2

Moreover, assume that ∃φ ≥ 1 such that for any starting point x ,

∀h ∈ Cφ, Qn(x , h)P−→ π(h) = π(ph)/π(p) ,

Theorem


moreover that

√M(δ0M − π(h)

) L−→ N (0,V0[h − π(h)]) .


√

MN

(∑N

t=1 h(x (t)) − π(h)

)

L−→N→∞ N (0,V0[h − π(h)]) ,18 / 22


Asymptotic results


∀h ∈ Cζ , Px

∣∣∣∣∣∣

sup0≤i≤N

i∑

j=0

[h(zi ) − π(h)]

∣∣∣∣∣∣

> ǫ

≤ NCh(x)

ǫ2

Moreover, assume that ∃φ ≥ 1 such that for any starting point x ,


Theorem


moreover that

√M(δ0M − π(h)

) L−→ N (0,V0[h − π(h)]) .


√

MN

(∑N

t=1 h(x (t)) − π(h)

)

L−→N→∞ N (0,V0[h − π(h)]) ,18 / 22


Asymptotic results

∀h ∈ Cζ , Px

∣∣∣∣∣∣

sup0≤i≤N

i∑

j=0

[h(zi ) − π(h)]

∣∣∣∣∣∣

> ǫ

≤ NCh(x)

ǫ2


Theorem


moreover that

√M(δ0M − π(h)

) L−→ N (0,V0[h − π(h)]) .


√

MN

(∑N

t=1 h(x (t))

N− π(h)

)

L−→N→∞ N (0,V0[h − π(h)]) ,

where MN is defined by

M∑ M +1∑

18 / 22


Asymptotic results

Theorem


moreover that

√M(δ0M − π(h)

) L−→ N (0,V0[h − π(h)]) .


√

MN

(∑N

t=1 h(x (t))

N− π(h)

)

L−→N→∞ N (0,V0[h − π(h)]) ,


MN∑

i=1

ξ0i ≤ N <

MN+1∑

i=1

ξ0i .

18 / 22


Asymptotic results

Theorem


moreover that

√M(δ0M − π(h)

) L−→ N (0,V0[h − π(h)]) .


√

MN

(∑N

t=1 h(x (t))

N− π(h)

)

L−→N→∞ N (0,V0[h − π(h)]) ,


MN∑

i=1

ξ0i ≤ N <

MN+1∑

i=1

ξ0i .

18 / 22


Illustrations

Variance gain (1)

h(x) x x2IX>0 p(x)

τ = .1 0.971 0.953 0.957 0.207τ = 2 0.965 0.942 0.875 0.861τ = 5 0.913 0.982 0.785 0.826τ = 7 0.899 0.982 0.768 0.820

Ratios of the empirical variances of δ∞ and δ estimating E[h(X )]: 100MCMC iterations over 103 replications of a random walk Gaussianproposal with scale τ .

19 / 22


Illustrations

Illustration (1)

Figure: Overlay of the variations of 250 iid realisations of the estimatesδ (gold) and δ∞ (grey) of E[X ] = 0 for 1000 iterations, along with the90% interquantile range for the estimates δ (brown) and δ∞ (pink), inthe setting of a random walk Gaussian proposal with scale τ = 10.

20 / 22


Illustrations

Extra computational effort

median mean q.8 q.9 timeτ = .25 0.0 8.85 4.9 13 4.2τ = .50 0.0 6.76 4 11 2.25τ = 1.0 0.25 6.15 4 10 2.5τ = 2.0 0.20 5.90 3.5 8.5 4.5

Additional computing effort due: median and mean numbers of additionaliterations, 80% and 90% quantiles for the additional iterations, and ratioof the average R computing times obtained over 105 simulations

21 / 22


Illustrations

Illustration (2)

Figure: Overlay of the variations of 500 iid realisations of the estimatesδ (deep grey), δ∞ (medium grey) and of the importance sampling version(light grey) of E[X ] = 10 when X ∼ Exp(.1) for 100 iterations, alongwith the 90% interquantile ranges (same colour code), in the setting ofan independent exponential proposal with scale µ = 0.02. 22 / 22

Education

A Vanilla Rao-Blackwellisation