[Lecture Notes in Statistics] Discretization and MCMC Convergence Assessment Volume 135 || Convergence Control of MCMC Algorithms

2 Convergence Control of MCMC Algorithms

Christian P. Robert Dominique Cellier

2.1 Introduction

There is an obvious difference between the theoretical guarantee that f is the stationary distribution of a Markov chain (x(t)) and the practical requirement that (1.2) is close enough to (1.1). It is thus necessary to develop diagnostic tools towards the latter goal, namely convergence control. l

While control is the topic of this book, we first present in this chapter some of the usual methods, before embarking upon the description of new control methods. The reader is referred to the survey papers of Brooks (1998), Brooks and Roberts (1998) and Cowles and Carlin (1996), as well as to Robert and Casella (1998) and Gelfand and Smith (1998) for details.

We first distinguish between single chain (§2.2) and parallel chain (§2.3) control methods because both their motivations and purposes differ. The former are indeed usually oriented towards a control of the convergence of (1.2) to (1.1) for arbitrary functions h or, in other words, towards an assessment of the mixing properties of the chain (x(t)), that is of the speed of exploration of the support of f. Besides, they usually allow for an "online" processing of the simulation output. The later methods require on the contrary a tailored implementation, with several runs of the algorithm and a preliminary selection of the initial distri bu tion. Moreover, they are closer to genuine simulation control, in the sense of producing r.v. 's approximately distributed from f or even i.i.d. from f. Both approaches have drawbacks, too, since the single chain methods can never guarantee2 that the whole

IThanks to Peter Green, we became aware of a subtle difference between controle (Fr.) and control (Eng.). We will thus use control in its French meaning of monitoring, evaluation or assessment.

2See, however, the Riemann control variate of Philippe, 1997a, described in §3.3, which escapes this difficulty by evaluating the probability of the region already explored by the Markov chain, based on a single path of this Markov chain.

C. P. Robert (ed.), Discretization and MCMC Convergence Assessment© Springer-Verlag New York, Inc. 1998

CAUCHY

BENCHMARK

28 Christian P. Robert, Dominique Cellier

support of f has been explored,3 while parallel chain methods depend on the choice of the initial distribution and thus only get interesting for welldispersed distributions. It is thus preferable, although more conservative, to advise for a cumulated implementation of these methods.

The last section (§2.4) introduces the notion of coupling, which is instrumental to perfect simulation techniques (see §1.4)' and is also of potential value for convergence control, although we do not take advantage of it in this book.

2.2 Convergence assessments for single chains

2.2.1 Graphical evaluations

While a simple monitoring of the chain (x(t)) can only expose strong nonstationarities, it is more relevant to consider the cumulated sums (1.2), since they need to stabilize for convergence to be achieved. While this is only a necessary condition for convergence, since a stabilization of the average (1.2) may only correspond to the exploration of a single mode of f by the chain, improved monitoring involves several estimates of (1.1) based on the same chain, as in Robert (1995). Possible alternatives to the empirical average (1.2) are conditional expectations (Rao-Blackwellization), importance sampling and quadmture methods as in the Riemann approximation technique of Yakowitz et al. (1978) and Philippe (1997a), based on h d .. < f ( (1) (T)) t e or er statIstIcs X(I) _ ••• ::; X(T) 0 X , •.. , x ,

T-l S: = L [X(t+1) - X(t)] h(X(t)) f(X(t)), (2.1) t=1

which is only relevant in dimension 1. Both alternatives are further discussed in Chapter 3, which also presents a control variate technique based on (2.1).

Example 2.2.1 Instead of 11"(8) = 1, consider a normal proper prior 11"(8) = exp( _82 j2(J'2) with known (J'. The corresponding Gibbs sampler is associated with three artificial r. v .'s, 1/1,1/2, 1/3, such that

(8 I ) ex e -92/2(72 e-(I+(9-x,)2)'11/2 11" ,1/1,1/2,1/3 Xl,X2,X3 xe-(1+(9-x2)2)'12/2 e-(1+(9-x3)2)'13/ 2 .

The conditional distributions are then

(i=1,2,3)

3This is the "You've only seen where you've been" defect.

2. Convergence Control of MCMC Algorithms 29

We denote

and 7-2 (771,772,773) = 771 + 772 + 773 + 0"-2

the conditional mean and variance of the conditional distribution of O. When h(O) = exp( -0/0"), the different approximations of JE,,[h(O)] are

for the conditional expectation, Sf = 'Li'=1 Wth(O(t))/ 'Li'=1 Wt, with

for importance sampling and

3

'Li'.;/ (0(t+1) - Ott)) e-8(1)/q-8(1)/( 2q2) II [1 + (Xi - 0(t))2t1

S R _ i=1 T - 3

'Li'.;/ (O(t+1) - Ott)) e-8(1)/(2q2) II [1 + (Xi - 0(t))2r1 ;=1

for the Riemann approximation, where 0(1) :s ... :s OtT) are the order statistics associated with the ott) 's.

Figure 2.1 describes the convergence ofthe four estimators as t increases. As often, ST and Sf. are quite similar from the start, S~ is very stable and shows that convergence is rapidly achieved for this example. The importance sampling estimate S: has not yet converged, which can be explained by the infinite variance of the weights Wt· II

Yu and Mykland (1998) propose a graphical evaluation based on the cumulated sums (CUSUM's), which ~onitor the partial differences

i

D~ = L [h(x(t)) - ST], i = 1,· ··,T, t=1

to assess the mixing behavior ofthe chain and correlation between the x(t),s:

the more mixing the chain is, the closer to Brownian motion the graph of

30 Christian P. ttobert, Dominique Cellier

FIGURE 2.1. Convergence of the four estimators ST (full), S¥ (dots), S¥ (dashes) and Sj': (long dashes) for rr2 = 50 and (Xl,X2,X3) = (-8,8,17). The graphs for ST and S¥ are indistinguishable. The final values are 0.845, 0.844, 0.828 and 0.845 for ST, S¥, Sj': and S¥ respectively. (Source: Robert, 1996c.)

Cauchy distribution

~ ~--------~------~------~------~--------~

1000 2000 3000 4000 5000

(500,000 iterations)

the D~ 's is. For slowly mixing chains, the graphs are on the contrary much smoother, with long excursions away from O. These very empirical rules are however too tentative for the method of Yu and Mykland (1998) to appear as a strong convergence criterion.

2.2.2 Binary approximation

Raftery and Lewis (1992a,b, 1996) have proposed a technique which pertains to our approach, namely to use some finite Markov chain theory to control the convergence of the chain of interest, by approximating the minimal time to to reach convergence, the sample size T necessary to evaluate (1.1), as well as the "batch" size k, from the finite structure. (The batch size k gives the number of iterations ignored between two recordings of the Markov chain. This strategy is commonly used in dependent simulation to approximate independence. It is only validated for MCMC algorithms in the special case of interleaving, as shown by Liu, Wong and Kong, 1995.) The authors derive a two-state process from the Markov chain (x(t)) as

(t) _ ][ Z - x(t)::;"",

where ;& is an arbitrary point in the support of f. Assuming (z(t)) is an homogeneous Markov chain (although this is not the case in general), with transition matrix

( I-a a) 1P= (3 1-(3 ,


the associated invariant distribution is

P(z(oo) = 0) = _f3_ , a+f3

P(z(oo) = 1) = _a_ . a+f3

The warm-up time can then be deduced from the condition

for i, j = 0,1. Raftery and Lewis (1992a) show that this condition is equivalent to

I.e.

11 - a _ f3lto < (a + (3)c - aVf3 '

to 2: log C: ~~c) / log 11 - a - f31·

For h( z) = z, the minimal sample size for the convergence of

to _a_ is derived from the normal approximation of dT, with variance a+f3

~ (2 - a - (3) af3 T (a + (3)3

For instance,

implies

<p VT >--((a + (3)3/2 q ) c' + 1

J af3(2 - a - (3) 2

I.e. T> af3(2 - a - (3) <p- 1 (c' + 1) .

- q2(a+f3)3 2

Since (z(t)) is not a Markov chain, even when x(t) has a finite support (see Kemeny and Snell, 1960), Raftery and Lewis (1992a,b) determine a batch step k by testing whether (z(kt)) is a Markov chain against the alternative hypothesis that (z(kt)) is a second order Markov process. This derivation is rather weak from a theoretical point of view since the alternative hypothesis is restrictive and its rejection does not imply that the null hypothesis holds. Moreover, it is possible to construct a true binary Markov chain by subsampling at renewal times (see Chapter 4) rather than

MULTINOMIAL

BENCHMARK


at fixed times. From a practical point of view, it appears in the examples (see Raftery and Lewis, 1992a, Brooks, 1998, Brooks and Roberts, 1997, and Cowles and Carlin, 1996) that the values of k obtained through these tests are usually quite small and often equal to 1.

The previous analysis and the derivation of the quantities (To, T, k) depends solely on the parameter (0:, (3), which is usually unknown. The binary control technique of Raftery and Lewis (1992a, b) thus requires a preliminary run where (0:, (3) is "correctly" estimated. While this is easier than for the original chain, since the state-space is reduced to two points, another control technique is necessary to decide whether (0:, (3) are indeed wellestimated. An alternative is to iteratively estimate (0:, (3) until all parameters stabilize (Raftery and Lewis, 1996). Comparing with the independent setup, Raftery and Lewis (1992a) suggest to use first a sample size larger than

_1(CI +l)2 0:j3 -1 T min ;:::: <f> -2 - ( 0: + (3) 2 q .

A problem with this solution is that the Central Limit Theorem does not necessarily applies for T as small as this T min, as shown in Chapter 5.

As mentioned in Brooks and Roberts (1997), the method does not perform well when!!: is located in the queues of f. A sensitivity analysis must then be led to robustify the choice of!!: and this may considerably increase the computing time in large dimension models. Nonetheless, the binary control method presents the major incentive to be almost totally automated and it does not require additional programming time, given the programs already existing in Statlib.

TABLE 2.1. Parameters derived from the binary control for three possible parameterizations with control parameters <0 = q = 0.005 and <0' = 0.999 (5000 preliminary runs).

parameter ;!C Q

/J 0.13 0.19

'1 0.35 0.38

~ 0.02 0.29

(3 qo

0.34 0.36 0.18 0.68 0.40 0.42

to

6 6 4

T

42345 36689 30906

Example 2.2.2 For a = (0.1,0.14,0.7,0.9), b = (0.17,0.24,0.19,0.20) and (Xl, X2, X4, X5) = (4,15,12,7,4), Table 2.1 provides the values of 0:, j3, qo = j3/(0: + (3), to and T for two-state variables derived from Il, TJ and ~ = IlTJ, respectively. The preliminary run is of size 5000. The control parameters (, (' and q are rather strict and the corresponding values of T are large, with reduced initial sample sizes. In this case, the parameterization (TJ vs. Il vs. ~) is negligible. II


2.3 Convergence assessments based on parallel chains

2.3.1 Introduction

As mentioned in §2.2, the single chain methods have the drawback that the chain hardly brings information on the regions of the space it does not visit. Parallel chains methods try to overcome this defect by generating in parallel M chains ((}~») (1 ~ m ~ M), aiming at eliminating the dependence on initial conditions, and the convergence control is most often based on the comparison of the estimations of different quantities for the M chains, although Chapter 5 develops an alternative. An obvious danger of this approach is that the slowest chain commands the speed of convergence. The various methods proposed in the literature are actually far from free of defects (see Geyer, 1992), since a preliminary (partial) knowledge of the distribution of interest is crucial to ensure good performances. Indeed, an initial distribution which omits one or several important modes of f does not improve much over a single starting value if the MCMC algorithm under study has a strong tendency to get trapped near the starting mode. Another major drawback is that the comparison between two chains at time T is delicate, given that they are not converging at the same speed. For complex settings like the Gibbs sampler on nonlinear models, where the MCMC algorithm may be quite slow, it is thus more efficient to use a single chain of size MT rather than M chains of size T, which will likely remain in a neighborhood of their starting point. (See Tierney, 1994, and Raftery and Lewis, 1996, for additional criticisms.) The debate pamllel vs. single chain is, however, far from being closed and most methods proposed in this monograph will rely on parallel chains, at one stage or another.

2.3.2 Between-within variance criterion

Gelman and Rubin (1992) initiate their control strategy by constructing an initial distribution p related to the modes of f, obtained for instance by numerical methods. Their suggestion is to use a mixture of Student's t distributions centered at the modes of f and with scale parameters derived from the second derivatives of f at these modes. Using these as initial distributions, one generates M chains (x~») (1 ~ m ~ M). For every parameter of interest ~ = h(x), Gelman and Rubin's (1992) criterion is based on the difference between a weighted estimator of the variance for each chain and the variance of the estimators on the different chains.

More precisely, consider

BT


with T

C =.!." c(t) "m T L...J "m , t=l

and d:.) = h(x!!)). The quantities BT and WT are the between- and withinchain variances. A first estimator of the posterior variance of e is

,2 T-l (J'T = --r WT+BT.

Gelman and Rubin (1992) compare o-} and WT, which are asymptotically equivalent, through a Student approximation; o-} overestimates the vari-

ance of the e;;') 's because of the large dispersion of the initial distribution, while WT underestimates this variance when the sequences (e;;')) remain concentrated around their initial value. If we denote

with corresponding degree of freedom

(0-2 + llT..) 2 I/T=2 T M

WT

the criterion of Gelman and Rubin (1992) is given by

Normal approximations lead to an approximate :F(M - 1, 1{IT) distribution for TBT/WT, with 1{IT = 2Wf./WT and


A test of IE[RT] = 1 can be derived from this approximation. This method is commonly used, because of its simplicity and of its con

nections with standards tools of linear regression. Nonetheless, it suffers from several drawbacks. First, the detailed construction of J.l is costly and unreliable, given that it usually implies advanced maximization techniques but cannot guarantee an exhaustive list of the modes of f. Second, and maybe more importantly, the evaluation of convergence is based on normal approximations, which are only valid asymptotically, are difficult to validate and are certainly inappropriate in some MCMC settings. Third, the criterion is even more difficult to derive and to assess in multidimensional problems, as shown by Brooks and Gelman (1998).

2.3.3 Distance to the stationary distribution

The following methods are rather "off-key", given our overall purpose, in the sense that they seek to estimate some distance to the stationary distribution, rather than focusing on the chain(s) at hand. For instance, in the case of the Gibbs sampler, Roberts (1992) proposes an unbiased estimator of the distance 11ft - fll, where It is the marginal density of the "symmetrized" chain x(t), which is obtained from a reversible version of the Gibbs sampler, namely by running the steps 1., 2. , ... ,p. of [A2J, then the steps p. ,p-1., ... ,1.. This method associates to (x(t») a dual chain (x(t»): given x(t), x(t) is simulated conditionally on (}(t) through steps 1., 2.,· .. ,p. of [A2], then X(t+l) is simulated conditionally on x(t) through steps p. ,p-1., ... ,1. Of[A2].4

Starting with a single initial value x(O), Roberts' (1992) method is based on M parallel chains, (x~t») (l = 1, ... , M), and on the following unbiased estimator of 11ft - fll + 1:

1 Jt = M(M _ 1) :L

l~l;ts~M

where K_ denotes the transition kernel associated with steps p. ,p-1. , ... , 1. of [A2]. Since f is usually known up to a multiplicative constant, the limiting value of Jt is unknown (instead of being 1) and the control method is a simple graphical monitoring of the stabilization of k Note that the normalizing constants of K _ (x, x') have to be known since they depend on x~O), and this can be quite costly in terms of computing time.

This estimation method is theoretically well-grounded but, as mentioned above, it somehow misses the true purposes of convergence control, as the marginal distribution ft is not of direct interest in the control of MCMC algorithms. The dependence on the same initial value is also a negative

4Note that these additional steps create a reversible chain.


feature of the method, because a slow mixing algorithm will give similar values of x~t) for small t's and thus leads to the impression that convergence is reached.

Roberts (1994) extends this method to other MCMC algorithms. Brooks, Dellaportas and Roberts (1997) propose a similar approach which estimates an upper bound on the distance L1 between It and I, based on the following relation:

III - It lit = IEf. [1" ~~;)] (see also Brooks and Roberts, 1998).

As in Roberts (1992), Liu, Liu and Rubin (1992) evaluate the difference between I and It. Their method is based on an unbiased estimator of the variance of It (())/ I(())' namely U - 1, with

where ()l ,():; '" ICt-1) are independent, ()1 '" K (()l , ()1) and ()2 '" K (():; , ()2).

Using M parallel chains, each iteration t provides M(M -1) values UCi,j,t) (i =f j) which can be monitored graphically or with Gelman and Rubin's (1992) method. Note that the ratio U does not imply the computation of the normalizing constant for the kernel K.

The following section (§2.4) provides another approach to the estimation of the total variation distance which has not been yet exploited in MCMC settings.

2.4 Coupling techniques

Although the notion of coupling is not used as a control technique per se in this book, it appears at several places (Chapters 4 and 6). This probabilistic notion is indeed related to convergence control in the sense that it establishes independence from initial conditions and may as well accelerate convergence. Besides, it is related with the exact sampling techniques of §1.4. We thus give a short introduction on coupling at this stage, before presenting the coupling control method of Johnson (1996). Its potential for actual convergence control should not be neglected, even if we only marginally use coupling in this monograph.

2.4.1 Coupling theory

While the general purpose of coupling is to evaluate the distance between two distributions by creating a joint distribution whose marginals are the distributions of interest (see Lindvall, 1992), its relevance in our setting is


both to evaluate the (total variation) distance to the stationary distribution and to accelerate convergence to this distribution.

Definition 2.4.1 Two Markov chains (xlt)) and (x~t)) with initial distri

butions J.l1 and J.l2 are coupled if the joint distribution of (xlt ), x~t)) preserves the marginal distributions of (xlt)) and (x~t)). A coupling time is thus a stopping time T such that

~(t) _ ~(t) "'1 - "'2 for t ~ T.

Obviously, a coupling time is only useful if it is almost surely finite. (The coupling is then said to be successful.) Therefore, this notion seems to solely apply in specific settings like finite or discrete chains. However, if the transition kernel is such that there exists a recurrent atom Ct, i.e. an accessible set such that q(·lx) is constant for x E Ct, both chains (xlt))

and (x~t)) can be made identical once they meet in Ct. (A weaker notion

of coupling requires that the distributions of xlt ) and x~t) are identical for t 2: T.) While atoms are rarely encountered in (continuous) MCMC setups, Chapter 4 shows that the existence of small sets is sufficient to achieve the same goal, namely that two independent chains meet with positive probability at a renewal time (see §4.2.1). More generally, we will see below that a maximal coupling algorithm provides a generic way to couple two Markov chains.

As shown by Lindvall (1992), the distribution of the coupling time is strongly related to the total variation distance between the distributions of (xlt)) and (x~t)), in the sense of the fundamental coupling inequality,

(2.2)

where P;:; denotes the distribution of x~n) (i = 1,2). If J.l2 is the stationary distribution f, the inequality (2.2) is

lIP:, - !IITV :::; 2P(T > n).

An empirical study of the distribution of T thus provides information on the convergence of P;:, to the stationary distribution.

Two particular types of couplings are Doeblin's coupling, where two independent chains are monitored till they meet (in an arbitrary state, in an atom, or in a small set with a certain renewal probability), and the deter

ministic coupling where x~t) is a deterministic function of xlt ) (conditionally

on x~t-l)) through the use of the same underlying uniform variable.5 For

5This obviously excludes the use of standard simulation techniques like acceptreject methods (see Robert and Casella, 1998).


instance, if the transition kernel corresponds to an exponential distribution

ylx '" &xp(x/(l + x)),

x~t) can be constructed as

(t) _ 1 + x~t-1) xl - - (t-1) log(ut),

xl

where Ut '" U[O,l], and x~t) is then

1 + x~t-1) (t-1) log(ut)

x 2

1 + (t-1) (t-1) (t) X2 xl xl

(t-1) 1 + (t-1) x 2 xl

(Both chains are thus identical onCe they have met.) Another type of coupling has been developed for probabilistic reasons.

It is called maximal coupling because it leads to an equality6 in (2.2). The algorithm corresponding to the maximal coupling is as follows:

1. Generate x~t) '" q(xlxit - 1)) and Ut '" U[O,l]'

2. If Ut q(xit)lx~t-1)) ::; q(xit)lx~t-1)), take x~t) = xit ). [A6]

3. Else, generate x~t) '" q(Xlx~t-1)) and u~ '" U[O,l]

until u~q(x~t)lx~t-1)) ~ q(x~t)lx~t-1)).

It is relevant for our purposes to establish that the algorithm [A6] is truly a coupling algorithm, namely that

and

because the proof brings neW lights on some results of Chapter 4 (§4.2.3

and Lemma 4.2.2). In fact, while x~t) '" q(xlx~t-1)) by construction, the

density of x~t) is indeed

( I (t-1)) q x Xl 1\ q(Xlx~t-1)) + (1- f)q(xlx~t-1))

x 00 ( (I (t-1)) 1\ ( I (t-1))) '" i 1 _ q X Xl q x x2

L..J f I (t-1) i=O q(x x 2 )

( I (t-1)) q x x 2 ,

6Note that this does not signify that coupling occurs faster or that it is optimal in any sense. The appeal of this particular coupling is to provide a more accurate connection between total variation distance and coupling time.


where f is the normalizing constant for q(Xlx~t-1») 1\ q(xlx~t-1»), i.e.

As in the splitting process of §4.2.3, the density q(xlx~t-1») is thus split into two parts,

( I (t-l)) q x x 2 f q(xlx~t-1») 1\ q(Xlx~t-1»)

f

+(1 _ f) q(Xlx~t-1») - q(xlx~t-1») 1\ q(Xlx~t-1») 1-f

and the generation does not require f to be known, which is quite appealing in simulation setups. Note however that both transitions must be known up to the same multiplicative constant. Moreover, in the event f is close to 1, this scheme may require extremely long runs and alternatives based on q(xlx~t-1») - q(xlx~t-1») 1\ q(xlx~t-1») may be preferable, when both

conditional distributions q(Xlx~t-1») and q(Xlx~t-1») are available. From the point of view of MCMC algorithms, and in particular for the

Gibbs sampler, coupling can be implemented in many ways. First, for a given order on the 9;'S in [A2], the chain of the full vector Y1 can be coupled with another chain Y2 as in Examples 2.4.1-2.4.3 below. The basic coupling ratio is then

( (t)1 (t-l) (t-1») (t)1 (t) (t-1) (t-1») 91 Yll Y22 , ... , Y2p 92 Y12 Yll ,Y23 , ... , Y2p

( (t)1 (t-1) (t-1») (t)1 (t) (t-1) (t-1») 91 Yll Y12 , ... , Y1p 92 Y12 Yll' Y23 , ... , Y1

( (t)1 (t) (t») 9p Y1 p Yll'···' Y1(p-1)

x . . . (t) I (t) (t») 9p Y1 p Yll'···' Yl(P-l)

An alternative approach is to couple both chains at each stage j. (1 ~ j ~

p) of the Gibbs sampler in [A2J. The component Y~7 is then equal to y~~) with probability

( (t) I (t) (t) (t-1) (t-1) 9j Ylj Y21'···' Y2(j-l)' Y2(j+l)' ... , Y2p )

.( (t) I (t) (t) (t-l) (t-l») 1\ 1 , 9) Y1j Yll'···' Y1(j-l)' Y1(j+l)' ... , Yl p

the previous components of y~t) being independently equal to those of yit ).

Example 2.4.1 The implementation of the maximal coupling algorithm CAUCHY

implies running two parallel chains (7]it),(}~t») and (7]~t),(}~t») such that the BENCHMARK

former is run from the conditional distributions as in Example 1.2.3 and the later is generated from


1. Take ( (t) (P)) _ ( (t) (}(t))

'fJ2 '2 - 'fJl , 1

with probability

( (t) (}(t)1 (t-l) (}(t-l)) ( (t)I(}(t-l)) 7r 'fJl , 1 'fJ2 '2 _ 7r 'fJl 2

( (t) (}(t)1 (t-l) (}(t-l)) - (t)I(}(t-l)) 7r 'fJl , 1 'fJl '1 7r 'fJl 1

= 113 1 + ((}~t-l) - Xi)2 ex {_~((}(t_1)2 . 1 + ((}(t-l) _ .)2 P 2 2 ,=1 1 X,

_(}(t_l)2)" (t) + ((}(t-l) (}(t-l))" (t) .} 1 L...J 'fJu 2 - 1 L...J 'fJu X,

i i

2. Else, generate

until

x

( (t) (}(t)) '" (t) (}(t) I (t-l) (}(t-l)) 'fJ2 '2 7r 'fJ2 , 2 'fJ2 '2 ,

3 1 + ((}~t-l) _ Xi)2

P 1 + ((}(t-l) - -)2 ,=1 2 X,

u '" U[a,l]

{ _~((}(t-l)2 _ (}(t_l)2)" (t) + ((}(t-l) _ (}(t-l))" (t) .} exp 2 1 2 L...J 'fJ2i 1 2 L...J 'fJ2i X,

i i

In order to study the speed of convergence to stationarity, we run the chain ('fJ~t), (}~t)) with no modification at coupling times and the second

chain ('fJ~t), (}~t)) with restarts from the initial distribution U[X(1),X(3)] at every coupling time. We can then evaluate the coupling time as well as the total variation distance between U[X(1),X(3)] and the stationary distribution (under maximal coupling). When implemented with the same observations Xi as in Example 1.2.3, the algorithm leads to an average coupling time of 9.32 iterations (for 25,000 replications). Note that the distribution of the chain at the time of coupling is not the stationary distribution, as shown by Figure 2.2(a), which gives a sample of 50,000 values at coupling, along with

the true stationary distribution. Although a given (}~t) is approximately dis

tributed from the stationary qistribution, the (}~t) 's at coupling times are not, because of the bias created by the stopping rule.

The same approach can be implemented for the reverse order Gibbs sampling, namely on the chain ((}~t), 'fJ~t)) and ((}~t), 'fJ~t)). The coupling steps are then

1. Take ( (}(t) (t)) _ ((}(t) (t))

2 ,'fJ2 - l' 'fJl


with probability

7r(8(t)I1}(t-l)) o-(1}(t-1)) exp (8~t) _ J1(1}~t-l))) 2 /2O-(1}~t-1))2 1 2 _ --,--,-,1!....-....,.,.. _---''---_____ ~----_

7r(8~t)I1}~t-1)) - O-(1}~t-1)) exp (8~t) _ J1(1}~t-1))) 2 /2O-(1}~t-1))2

where

2 1 o-(1}) = . 2'

1}l + 1}2 + 1}3 + r-2. Else, generate

(8 (t) (t)) (8(t) (t)1 (t-1)) 2 , 1}2 '" 7r 2 , 1}2 1}2 , U '" UrO,l)

until

o-(1}~t-l)) exp (8~t) - J1(1}~t-l))r /2O-(1}~t-1))2 U> (t 1) 2

O-(1}1 - ) exp (8~t) _ J1(1}~t-1))) /2o-(1}~t-1))2

In this case, the average coupling time 9.66 is approximately the same,

using the same uniform starting distribution on 8~O) 'so As shown by Figure 2.2(b), the distribution of the points at coupling is again different from the stationary distribution, although the difference is not as important as for the direct order.

FIGURE 2.2. Histogram of a sample of (J's of size 25, 000 at coupling time, obtained without restarts of the first chain and with a uniform U[x(1) ,X(3)J initial distribution, against the stationary distribution, (a) for the ('7,8) order and (b) for the ((J, '7) order.

;;;; '" <>

0 ,. d

~ ~

~ ~ ·20 ·'0 0 ,0 20 >0

As mentioned above, coupling can be implemented at each stage of the Gibbs sampler, with the same acceptance probabilities as above. The improvement brought by this "double" coupling is rather marginal since the mean coupling time is then 8.39 . This shows however that double coupling does not necessarily have a negative impact on coupling times. II

PUMP BENCHMARK


Example 2.4.2 A second chain (A~t),,8~t)), with Ai = (Ail, ... ,A;10), can be coupled to the original chain (A~t), ,8~t)) as follows:

1. Take (A(t) (.I(t)) _ (A(t) (.I(t))

2 ,1-'2 - l' 1-'1

with probability

2. Else, generate

(A(t) (.I(t)) '" (A(t) (.I(t)IA(t-l)) 2 ,1-'2 11" 2 , 1-'2 2 , u '" U[O,I]

until

u > WI 11 e 2 Wi 2. Wi" (0 + ". A (t.-l)) 'Y+I0a (3(') (" A ('-1) _" A ('.-'))

0+ Li A~~-I) For the dataset given in Table 1.2, and the uniform distribution on [1.5,3.5] as initial distribution, the average coupling time for 10,000 replications is 1.61.

The reverse coupling is given by

1. Take ( (.I(t) A(t)) _ ((.I(t) A(t)) 1-'2 , 2 - 1-'1 , 1

2. Else, generate

( (.I(t) A(t)) '" 11"((.I(t) A(t)) 1-'2 , 2 1-'2 , 2 , u '" U[O,I]

until

U > II ti + ,81 eA~,/(.8~t-l) _(3~'-1)) 10 ( (t-l) )Pi+ll< (.I(t-l)

;::1 ti + 1-'2

The implementation of this scheme leads to an average coupling time of 1.8, only slightly slower than the reverse approach, "despite" the larger dimension of the parameter A. Double-coupling reduces the average coupling time to 1.47. II


Example 2.4.3 If we denote () = (J.L, 'fJ) and Zl = (Zll, Zl2, Zl3, Zl4), the chain (()it ), zit)) can be coupled with a second chain (()~t), 4)) as follows:

1. Take (()~t),z~t)) = (()~t),zit)) with probability

1I'(()~t) Iz~t-l))

1I'(()~t) Izit - l ))

where (i = 1,2)

r(Z~t2-l) + Z~~-l) + Xs + 1.5)

r(Z~tl-l) + z~tl-l) + Xs + 1.5)

r(Z~tl-l) + 0.5)r(Z~tl-l) + 0.5) x--~~------~~~----

r(Z~t2-l) + 0.5)r(z~~-l) + 0.5)

('-1) ('-1) ('-1) ('-1)

( (t))ZI'2 -Zl'l ((t))Z'2 -Z., X J.Ll rll

(t-l) _ (t-l) + (t-l) z/Ji - zil zi2'

(t-l) _ (t-l) + (t-l) z1}i - zi3 zi4'

2. Else, generate

(() (t) (t)) (() I (t-l)) 2 , Z2 '" 11' , Z Z2

until

u >

When implemented on the original dataset of Example 1.3.5, the average coupling time on 25, 000 iterations with a Dirichlet D(I, 1, 1) distribution on ()~O) is 2.67. Note that the implementation of the coupling method requires a tight control of the normalizing constants which may be too computationally demanding in some cases.

The reverse order coupling is

2. Else, generate

( (t) (}(t)) (()I()(t-l)) z2 , 2 '" 11' Z, 2

MULTINOMIAL

BENCHMARK


until

( ) Z(') ( )X. ()Z(') ( X· U > IT J.L1 2; aiJ.L2 + bi • IT '1/1 2. ai'1/2 + bi ) •

'-1 2 J.L2 aiJ.L1 + bi '-34 '1/2 ai'1/1 + bi 1_ , 1_ I

In this order, the average coupling time is quite similar since it is equal to 2.71. To reproduce the comparison of Example 2.4.1, we integrated the posterior in '1/ and J.L to obtain the respective marginals in J.L and '1/, using Mathematica for the case Q1 = ... = Q3 = 1 (see also Robert, 1995). The comparison between the true posteriors and the histograms of the (J(t)'s

at coupling are more satisfactory than in Example 2.4.1, although there are still discrepancies around 0 and 1, as shown by Figure 2.3 . A side consequence of this study is to show that the posterior marginals in J.L and '1/ are almost identical analytically. Double-coupling is much faster in this case since the average coupling time is then 1.95. "

FIGURE 2.3. Histogram of a sample of IJ'S (left) and ,.,'s (right) of size 25, 000, at coupling time, obtained with a V(I, 1, 1) initial clistribution, against the stationary distribution, (top) for the (8,z) order and (middle) for the (z,8) order. For comparison purposes, the histograms of the whole samples of IJ'S and ,.,'s are also plotted against the stationary distribution (bottom).

2.4.2 Coupling diagnoses

~I ~I

~I.~~.

Although there are strong misgivings about the pertinence of a coupling strategy as a mean to start "in" the stationary distribution,7 coupling and in particular optimal coupling can be used to evaluate the warmup time to stationarity. As shown by Example 2.4.1, if one of the two chains is run without interruption, that is with no restart at each coupling event, the average coupling time gives a convergent estimator of the mean number of iterations till stationarity, while the evaluation of the total variation

7Unless one uses backward coupling as in perfect simulation (see §1.4).


distance IIj.tpn - 7rIITV is more challenging, as seen in Johnson's (1996) attempts (see below). Note that different coupling strategies can be compared by this device, although independent coupling cannot work in continuous setups since the probability of coupling is then O.

An additional use of coupling is to compare different orderings in Gibbs setups. More precisely, given a decomposition (Y1, ... , Yp) of Y and the corresponding conditional distributions 91, ... , 9p of f, as in [A2], there are p! ways of implementing the Gibbs sampler, by selecting the order in which the components are generated. It is well-known that this order is not innocuous, and that some strategies are superior to others. The superiority of random scan Gibbs samplers, where the successive components are chosen at random, either independently or in a multinomial fashion-which amounts to select a random permutation-, has also been shown by Liu, Roberts and others. Since the practical consequences of these results are poorly known, the implementation of the corresponding sampling schemes has been rather moderate. Now, for a given ordering (or a given distribution for a random scan), the coupling time can be assessed by the above method, and this produces a practical comparison of various scans which should be useful in deciding which scheme to adopt. That important differences may occur has already been demonstrated in Examples 2.4.1-2.4.3.

Johnson (1996) suggests to use coupling based on M parallel chains (O~)) (1 ~ m ~ M) which are functions of the same sequence of uniform variables. (This is a particular case of the deterministic coupling mentioned above.) For the Gibbs sampler [A 2], the coupling method can thus be written as

1. Generate M initial values (}~) (1 ~ m ~ M).

2. For 1 ~ i ~ p, 1 ~ m ~ M, generate (}I~~ from

3. Stop the iterations when

(} (T) _ _ (}(T) 1 - ... - M .

In a continuous setup, the stopping rule 3. must be replaced by the approximation

max I(}(T) - (}(T) I < f m n , m,n

for 1 ~ m, n ~ M (or by the simultaneous visit to an atom, or yet by the simultaneous occurrence of a renewal event). In the algorithm [A7], Fj denotes the cdf of the conditional distribution 9j ((}j 1(}1, ... , (}j-1, (}i+1, ... , (}p).

A necessary condition for Johnson's (1996) method to apply is thus that


the conditional distributions must be simulated by inversion of the cdf, which is a rare occurrence.8

As shown in Examples 2.4.1-2.4.3, this method induces in addition strong biases, besides being strongly dependent on the initial distribution.

8For instance, this excludes Metropolis-Hastings algorithms, a drawback noticed by Johnson (1996), as well as all accept-reject algorithms.

Documents

[Lecture Notes in Statistics] Discretization and MCMC Convergence Assessment Volume 135 || Convergence Control of MCMC Algorithms