Roberts01 OptimalScaling Metropolis–Hastings Algorithms

Embed Size (px)

Citation preview

  • 8/10/2019 Roberts01 OptimalScaling MetropolisHastings Algorithms

    1/17

    Statistical Science2001, Vol. 16, No. 4, 351367

    Optimal Scaling for VariousMetropolisHastings AlgorithmsGareth O. Roberts and Jeffrey S. Rosenthal

    Abstract. We review and extend results related to optimal scaling of MetropolisHastings algorithms. We present various theoretical resultsfor the high-dimensional limit. We also present simulation studies whichconrm the theoretical results in nite-dimensional contexts.

    Key words and phrases: Adaptive triangulations, AIC, density estima-tion, extended linear models, nite elements, free knot splines, GCV,linear splines, multivariate splines, regression.

    1. INTRODUCTION

    MetropolisHastings algorithms are an importantclass of Markov chain Monte Carlo (MCMC) algo-rithms [see, e.g., Smith and Roberts (1993), Tierney(1994) and Gilks, Richardson and Spiegelhalter(1996)]. Given essentially any probability distri-bution (the target distribution), these algorithmsprovide a way to generate a Markov chain X 0 X 1having the target distribution as a stationarydistribution.

    Specically, suppose that the target distributionhas density (usually with respect to Lebesguemeasure). Then, given X n , a proposed value Y n+1is generated from some prespecied density q X n y

    and is then accepted with probability X n Y n+1 ,given by x y

    =min

    y x

    q y xq x y

    1 x q x y > 0,

    1 x q x y = 0.(1)

    If the proposed value is accepted, we set X n+1 = Y n+1 ; otherwise, we set X n+1 = X n . The function x y above is chosen precisely to ensure that theMarkov chain X 0 X 1 is reversible with respect tothe target density y , so that the target densityis stationary for the chain.

    Gareth O. Roberts is Professor of Statistics, Depart-ment of Mathematics and Statistics, Fylde Col-lege, Lancaster University, Lancaster LA1 4YF, England (e-mail: [email protected]). JeffreyS. Rosenthal is Professor, Department of Statis-tics, University of Toronto, Toronto, Ontario, Canada M5S 3G3. (e-mail: [email protected]).

    In applying MetropolisHastings algorithms, it isnecessary to choose the proposal density q x y .Typically, q is chosen from some family of distri-

    butions, for example, normal distributions centeredat x . There is then a need to select the scaling of the proposal density (e.g., the variance of the nor-mal distributions) in order to have some level of optimality in the performance of the algorithm.

    An important special case of the MetropolisHastings method is the symmetric random-walkMetropolis (RWM) algorithm. In this case, wetake the Markov chain described by q alone tobe a simple symmetric random walk, so thatq x y = q y x and the single-argument q den-sity (sometimes called the increment density ) isa symmetric function about 0. In this case, theacceptance probability in (1) simplies to

    x y = min 1 y x

    1.1 A First Example

    A simple example illustrates the issues involved.Suppose the target y is the standard normal den-sity. Suppose also that the proposal density q x yis taken to be the normal density N x 2 , where is to be chosen. That is, if X n = x , then wechoose Y n+1 N x 2 and set X n+1 to either Y n+1[with probability x y ] or X n [with probability1 x y ], as above.It is intuitively clear that we can make the algo-rithm arbitrarily poor by making either verysmall or very large. For extremely small , thealgorithm will propose small jumps. These willalmost all be accepted (since will be approxi-mately 1, because of the continuity of and q).However, the size of the jumps is too small for the

    351

  • 8/10/2019 Roberts01 OptimalScaling MetropolisHastings Algorithms

    2/17

    352 G. O. ROBERTS AND J. S. ROSENTHAL

    xva

    t a r

    4 2 0 2 4

    0 . 0

    0 . 1

    0 . 2

    0 . 3

    0 . 4

    Fig. 1 . Standard normal target density with proposal distribu-tions for a normal proposal random walk Metropolis algorithmwith three alternative proposal scalings. The alternative propos-als are all centered about the current point x = 1 and are shownusing dotted lines.

    algorithm to explore the space rapidly, and the algo-rithm will therefore take a long time to converge toits stationary distribution. On the other hand, if is taken to be extremely large, the algorithm willnearly always propose large jumps to regions where is extremely small. It will therefore reject mostof its proposed moves and hence stay xed for largenumbers of iterations. It seems reasonable thatthere exist good values for , between these twoextremes, where the algorithm performs optimally.This is illustrated by simulation in Figure 1.

    Figure 2 shows trace plots giving examples of all

    three types of behavior: a situation where the pro-posal variance is too high so the chain gets stuck indifferent regions of the space, a situation where theproposal variance is too low so that the chain crawlsto stationarity and a case where the proposal vari-ance is tuned appropriately (using Theorem 1 below)and the chain converges at a reasonable rate.

    1.2 Efciency of Markov Chains andRelated Concepts

    To compare different implementations of MCMC,we require some notion of efciency of Markovchains to guide us. For an arbitrary square-integrable function g , we dene its integratedautocorrelation time by

    g = 1 +2

    i=1Corr g X 0 g X i

    where X 0 is assumed to be distributed according to . If a central limit theorem for X and g exists,then the variance of the estimator

    ni=1 g X i /n for

    estimating E g X is approximately Var g X g /n . This suggests that the efciency of Markovchains can be compared by comparing the reciprocalof their integrated autocorrelation times, that is,

    eg = Var g X g 1

    = limnn Varni=1 g X i

    n1(2)

    However, this measure of efciency is highlydependent on the function g chosen. Thus, for twodifferent Markov chains, different functions g couldorder their efciency differently. Where specicinterest is in a particular function g eg is a sen-sible criterion to be using, but where interest is ina whole collection of functionals of the target distri-bution (perhaps the cdf of a component of interestfor example), its use is more problematic.

    We shall see later in Section 2.2 that, in thehigh-dimensional limit, at least for algorithmswhich behave like diffusion processes, all ef-ciency measures e

    g are virtually equivalent. In

    such cases, we can produce a unique limiting ef-ciency criterion. We shall use this criterion to denea function-independent notion of efciency.

    It will turn out that a quantity related to ef-ciency is the algorithms acceptance rate , that is,the probability in stationarity that the chains pro-posed move is accepted. Alternatively, by the ergodictheorem, the acceptance rate equals the long-termproportion of proposed moves accepted by the chain:

    a = x y x q x y d x d y=

    limn

    n 1 # accepted moves(3)

    the latter identity indicating a sensible way to esti-mate a without extra computational effort.

    1.3 Goal of This Paper

    The goal of this paper is to review and furtherdevelop guidelines for choosing good values of insituations such as those in Section 1.1, especiallywhen the dimension of the space is large. The guide-lines we describe are in terms of the asymptoticoverall acceptance rate of the chain, which makesthem easily implementable. Much, but not all, of what follows is a synthesis of Roberts, Gelmanand Gilks (1997), Roberts (1998), Roberts andRosenthal (1998) and Breyer and Roberts (2000).Extensions to consider heterogeneity in scale arealso given.

    Our results provide theoretical justication fora commonly used strategy for implementing themultivariate random-walk Metropolis algorithm,which dates back at least as far as Tierney (1994).

  • 8/10/2019 Roberts01 OptimalScaling MetropolisHastings Algorithms

    3/17

    OPTIMAL SCALING FOR VARIOUS METROPOLISHASTINGS ALGORITHMS 353

    (a) Proposal variance too large

    d 1

    0 200 400 600 800 1000

    2

    1

    0

    1

    2

    (b) Proposal variance too small

    d 2

    0 200 400 600 800 1000

    1

    0

    1

    2

    (c) Proposal variance approximately optimised

    d 3

    0 200 400 600 800 1000

    2

    1

    0

    1

    2

    3

    Lag

    A C F

    0 10 20 30 40

    0 . 0

    0 . 2

    0 . 4

    0 . 6

    0 . 8

    1 . 0

    Lag

    A C F

    0 10 20 30 40

    0 . 0

    0 . 2

    0 . 4

    0

    . 6

    0 . 8

    1 . 0

    Lag

    A C F

    0 10 20 30 40

    0 . 0

    0 . 2

    0 . 4

    0

    . 6

    0 . 8

    1 . 0

    Fig. 2 . Simple Metropolis algorithm with (a) too-large variance (left plots ) , (b) too-small variance (middle ) and (c) appropriate variance(right ). Trace plots (top ) and autocorrelation plots (below ) are shown for each case.

    The strategy involves estimating the correlationstructure of the target distribution, either empiri-cally based on a pilot sample of MCMC output orperhaps numerically from curvature calculationson the target density itself, and using a proposaldistribution for the random walk algorithm tobe a scalar multiple of the estimated correlationmatrix. In the Gaussian case, if the correlationstructure is accurately estimated, then scaling theproposal in each direction proportional to the targetscaling can be shown to optimize the algorithms

    efciency.With knowledge of such optimal scaling prop-

    erties, the applied user of MetropolisHastingsalgorithms can tune their proposal variances byrunning various versions of the algorithm, withvariance at the appropriate level [e.g., O d 1 orO d 1/ 3 , say], and making ner adjustments to getthe approximate acceptance rate close to optimal.

    Thus, optimal scaling results are of importancein the practical implementation of MetropolisHastings algorithms.

    1.4 Outline of This Paper

    In Section 2, we shall review the results of Roberts, Gelman and Gilks (1997) for random-walkMetropolis (RWM) algorithms with target distribu-tion consisting of approximately i.i.d. components.They showed that optimality is achieved if thevariance of the proposal distribution is O d 1and the overall acceptance rate (i.e., the long-runexpected proportion of accepted moves) is close to0.234. In Section 3, we review the results of Roberts(1998), who showed that the same acceptance rateis approximately optimal for a quite different classof discrete RWM algorithms.

    In Section 4, we consider the Metropolis-adjusted Langevin (MALA) algorithms, which

  • 8/10/2019 Roberts01 OptimalScaling MetropolisHastings Algorithms

    4/17

    354 G. O. ROBERTS AND J. S. ROSENTHAL

    are MetropolisHastings algorithms whose pro-posal distributions make clever use of the gradientof the target density. We review results of Robertsand Rosenthal (1998), who proved that, for targetdistributions of the form (5), optimality is achievedif the variance of the proposal distribution isO d 1/ 3 and we have an overall acceptance rateclose to 0 574. That is, MALA algorithms have alarger optimal proposal variance and a larger opti-mal acceptance rate. This demonstrates that MALA algorithms asymptotically mix considerably fasterthan do RWM algorithms.

    In Section 5, we shall review the results of Breyerand Roberts (2000), who showed that the optimalacceptance rate for the random-walk Metropolisalgorithm is again optimal for a class of Markovrandom-eld models, but only when the local corre-lations are small enough to avoid phase-transitionbehavior.

    In Section 6, we extend some of the above resultsto cases other than target distributions of theform (5). In particular, we consider the case wherethe target distribution consists of components whicheach have different scaling C i . Now, if the C i valuesare known, then it is best to scale the proposal dis-tribution components also proportional to C i ; thatway, the resulting chain corresponds precisely tothe previous case where all components are iden-tical. However, if the quantities C i are unknownand the proposal scaling is taken to be the same ineach component, then we prove the following. Onceagain, the optimal acceptance rate is close to 0.234.However, in this case, the algorithms asymptoticefciency is multiplied by a factor E C i 2 /E C 2icompared to what it could have been with differ-ent proposal distribution scaling in each compo-nent. Since we always have E C i 2 /E C 2i < 1 fornonconstant C i , this shows that the RWM algo-rithm becomes less efcient when the componentsof the target distribution have signicantly differ-ent scalings. For MALA algorithms, we show thatthis effect is even stronger, depending instead onthe sixth moments of the target component scal-ings C i . Simulations are presented which conrmour theoretical results.

    In Section 7, we consider various examplesrelated to our theorems. In some cases, the exam-ples t in well with our theorems, and we providesimulations to illustrate this. In other cases, theexamples fall outside our theoretical results andexhibit different behavior, as we describe.

    2. PROPOSAL VARIANCE OFCONTINUOUS i.i.d. RWM

    Suppose we are given a density with respectto Lebesgue measure on R d and a class of sym-

    metric proposal increment densities for the RWMalgorithm given by

    q x = d q x / (4)where q is some xed density and where > 0denotes some measure of dispersion (typically thestandard deviation) of the distribution with den-sity q . Thus, from a random variable Z withdensity q, we can produce one with density q as Z = Z . As already noted, we can make theMetropolisHastings algorithm inefcient by tak-ing either very small or very large, and our goalis to determine optimal choices of .

    For ease of analysis, we here let have the simpleproduct form

    x =d

    i=1f x i(5)

    and suppose that the proposal is of the form

    q x d x N 0 Id 2d(6)a normal distribution with mean 0 and variance 2

    dtimes the identity matrix. Our goal is to character-ize the optimal values of 2d in a practically usefulway.

    We shall assume that there exists a positive con-stant such that

    2d = 2 /d(7)Indeed, if the variance is larger than O d 1 , thenthe acceptance rate of the algorithm converges to0 too rapidly, whereas for smaller order scalings,the jumps of the algorithm (which are almost allaccepted) are too small. That is, it can be shownthat taking 2

    d to be O d 1 is optimal. Our goal is

    then to optimize the choice of in (7) above.For xed d , the algorithm involves d components,

    X 1 Xd , say, which are constantly interacting with each other. Thus, there is no reason to thinkthat any proper subset of the d components shoulditself form a Markov chain. However, for target den-sities of the form (5), each component acts like anindependent Markov chain as d . Therefore,by considering any one component (say X 1 ), byindependence this gives us information about thebehavior of any nite collection of the componentsof interest. For simplicity, therefore, we shall writeall our results in terms of convergence of singlecomponents.

    Note that, when we consider dependent densitiesin Section 5, asymptotic independence of the con-stituent components is not achieved; therefore, toconsider the limiting behavior, we need to look ata genuinely innite-dimensional limit process. As aresult of this, we refrain from a formal statement of that result.

  • 8/10/2019 Roberts01 OptimalScaling MetropolisHastings Algorithms

    5/17

    OPTIMAL SCALING FOR VARIOUS METROPOLISHASTINGS ALGORITHMS 355

    2.1 RWM Algorithm as d

    For each component, since we are making steps of decreasing size according to (7) as d , any indi-vidual components movement will grind to a haltunless we somehow speed up time to allow steps of the algorithm to happen more quickly. We do thisby stipulating that the algorithm is updated everyd 1 time units. As d

    , therefore, the algorithm

    makes smaller jumps more and more frequently,and its limit needs to be described as a continuous-time stochastic process. Since its jumps get smallerand smaller, the limiting trajectory will resemble acontinuous sample path, and, as we shall see, it willbe a diffusion process.

    Let the RWM chain on R d be denoted X1

    n X dn and consider the related one-dimensional pro-cess given by

    Z dt = X1td(8)

    where denotes the integer part function. That is,Z d is a speeded-up continuous-time version of therst coordinate of the original algorithm, parame-terized to make jumps [of size O d = O d 1 / 2 ,by (7)] every d1 time units. The process Z d is notitself Markovian, since it only considers the rstcomponent of a d -dimensional chain. However, inthe limit as d goes to , the process Z d convergesto a Markovian limit, as we now describe.

    We require a number of smoothness conditions onthe density f in (5). The details of sufcient condi-tions appear in Roberts, Gelman and Gilks (1997),although these conditions can be weakened also.The essential requirements are that log f is con-tinuously differentiable and that

    I E ff Xf X

    2

    < (9)We denote weak convergence by , let B t bestandard Brownian motion and write z =1/ 2

    z

    es2 / 2 ds for the cumulative distribu-

    tion function of a standard normal distribution. Thefollowing result is taken from Roberts, Gelman andGilks (1997).

    Theorem 1. Consider the RWM X1

    n Xd

    n on R d . Dene the process Z dt by 8 . Suppose that and q are given by

    5 and

    6 , respectively, with 2

    given by 7 . Under the above regularity conditionson f ,Z d Z(10)

    where Z is a diffusion process which satises the stochastic differential equation

    dZ t = h 1 / 2dB t + h

    log f Z t2

    dt(11)

    with

    h = 2 2 I

    2 = 2 A I(12)

    and I given by 9 . Here the acceptance rate a =A I = 2 I / 2 . The speed of the limiting dif- fusion, as a function of this acceptance rate, is pro- portional to

    A I 1A I

    2

    2(13)

    The scaling which gives rise to the optimal limiting speed of the diffusion and hence the optimal asymp-totic efciency of the algorithm is given by

    opt = 2 38I 1/ 2

    and the corresponding optimal acceptance rate is

    A I opt = 0 234

    For any xed function g , the optimal asymptotic ef-ciency eg as given in 2 is proportional to 1 /d .

    This theorem thus says that, for RWM algorithmson target densities of the form

    ni=1 f x i , the ef-ciency of the algorithm as a function of its asymp-

    totic acceptance rate can be explicitly described andoptimized. In particular, acceptance rates approx-imately equal to 0.234 lead to greatest efciencyof the algorithm. Since the process Z d had timescaled by a factor of d , the theorem also says thatthe convergence time of the algorithm grows withdimension like O d [or, equivalently, the efciencyis O 1/d ]. Hence, if the computation time for com-pleting each iteration of the algorithm grows with d(which appears likely for most examples), then theoverall complexity of the algorithm is O d 2 .

    The quantity I , which is the variance of thederivative of the log density of f , can be interpretedas a measure of roughness of the target distribu-tion. It measures local variation in f rather thanany global scale property. For Gaussian f , I is thereciprocal of the variance of the density f , so in thiscase (12) reduces to

    h = 2 2 /2

    Var f

    The rst graph in Figure 3 shows the function hof Theorem 1, as a function of , for xed I . Dif-ferent I values will distort the curve, causing itsmaximum to appear to the left or right for lower orhigher values of I , respectively. The second graphshows h as a function of A , specically as thefunction a aA 1I a 2 = a 1 a2 2 4 /I . Notethat the shape of this second curve is independent

  • 8/10/2019 Roberts01 OptimalScaling MetropolisHastings Algorithms

    6/17

    356 G. O. ROBERTS AND J. S. ROSENTHAL

    E f f i c i e n c y

    0 1 2 3 4 5 0

    . 0 0

    . 2 0

    . 4 0

    . 6 0

    . 8 1

    . 0 1

    . 2

    Efficiency as a function of scaling and acceptance rate

    Acceptance rate

    E f f i c i e n c y

    0.0 0.2 0.4 0.6 0.8 1.0 0

    . 0 0

    . 2 0

    . 4 0

    . 6 0

    . 8 1

    . 0 1

    . 2

    Fig. 3 . Efciency of RWM as a function of (top ) and of accep-tance rate (bottom ) , in the innite-dimensional limit. In this case,I = 1.

    of I and hence of f , apart from the global multiplica-tive factor of I 1 . Hence, the optimal acceptancerate is approximately 2 opt / 2 = 0 234. Sincethe constant 4 /I plays no role in the optimization of efciency, we shall omit it when talking about thealgorithms relative efciency .

    Remark. It must be stressed that Theorem 1is an asymptotic result, valid for large d . In fact,the appropriate scaling in Figure 2(c), whichminimizes the rst-order autocorrelation of thealgorithm, achieves acceptance rate 0 44; theasymptotics do not directly apply since d = 1there. However, even in ve dimensions, the opti-mal acceptance rate is so close to 0 234 as tonot matter at all in practice. This can be seenin Figure 4, which is computed using the crite-rion of minimizing the rst-order autocorrelationof the identity function for the algorithm on stan-dard Gaussian densities. These data were takenfrom a simulation study which was rst reported inGelman, Roberts and Gilks (1996).

    One interpretation of the efciency curve, the sec-ond plot in Figure 3, is as follows. The number of iterations needed to reach a prescribed accuracyin estimating any function is proportional to theinverse of this efciency. As a result, an algorithmwith say 50% of the optimal efciency would needto be run for twice as long to obtain the same accu-racy of results. On the other hand, the importanceof the precise optimal acceptance rate should notbe overstated. The relative efciency curve given inFigure 3 is relatively at around 0 234, and any

    Dimension

    O p t i m a

    l s c a

    l i n g

    2 4 6 8 10

    0 . 2

    5

    0 . 3

    0

    0 . 3

    5

    0 . 4

    0

    0 . 4

    5

    Optimal acceptance rate with dimension

    Fig. 4 . Optimal scaling as a function of acceptance rate, usingthe minimum autocorrelation criterion, as dimension increases for the case of Gaussian target densities. This analysis comes froma simulation study on standard Gaussian target densities.

    algorithm with acceptance rate between say 0 15and 0 5 will be at least 80% efcient. It is thereforeof little value to nely tune algorithms to the exactoptimal values.

    It is likely that the assumption of second-orderdifferentiability could be relaxed to some extent,while leaving the asymptotics of Theorem 1 intact.On the other hand, the asymptotics of Theorem 1may be completely altered if f is allowed to actuallybe discontinuous. For example, let f be the indica-tor function of

    0 1 . Then it is fairly straightfor-

    ward to show that any scaling of the form (7) for any > 0 will lead to acceptance rates converging to zero

    (in fact, exponentially quickly). In this case, optimalscaling will have to scale the jumps in each dimen-sion by a term which converges to zero quicker thand 1 / 2 . Such questions are currently being investi-gated by Roberts and Yuen (2001); initial investiga-tions suggest that the correct scaling in this caseis 2 = 2 /d 2 , with an optimal acceptance rate of approximately 0 13.

    Theorem 1 applies only to the special case of a

    density of the form di=1 f x i , that is, correspond-ing to i.i.d. components. Minor generalizations are

    possible; for example, it sufces to have a densityof the form

    di=1 f i x i where f i f [see Roberts,Gelman and Gilks (1997)]. However, it is reasonable

    to ask what happens if the different components arenot identically distributed or are not independent.Such issues are examined in subsequent sections.

  • 8/10/2019 Roberts01 OptimalScaling MetropolisHastings Algorithms

    7/17

    OPTIMAL SCALING FOR VARIOUS METROPOLISHASTINGS ALGORITHMS 357

    2.2 Assessing Efciency of Markov Chains

    Now, as already discussed in Section 1.2, theefciency eg of a Markov chain can depend onthe function g for which we are trying to obtainMonte Carlo estimates. Therefore, we cannot relyon integrated autocorrelation times for some par-ticular function g to provide unambiguous criteriaby which to assess efciency.

    However, if we look at high-dimensional problemsin situations where the algorithm can be shown toconverge to a diffusion process as in Theorem 1, itdoes not matter which function g we choose, andall functions lead to essentially the same notion of efciency.

    Indeed, suppose we let g be a function of therst variable X 1 of the chain. Recall that, for esti-mation of g = g x x d x for a function gusing (4), a natural measure of efciency is theinverse autocorrelation time formula (2). Dene also

    e

    g = limT T Var Tt

    =0 g Z t

    T

    1

    (14)

    and the corresponding quantity for the standardLangevin diffusion

    eLg = limT T Var

    Tt=0 g L t

    T

    1(15)

    Now from (14) it is easy to see that e g = h eLg .Furthermore, from (2), for large d we can write

    e dg d1 / 2 = limT Td Var

    Tdi=1 g X

    1i

    Td

    1

    = limT Td Var

    Tdi=1 g Z

    di/d

    Td1

    limT Td Var

    T0 g Z s ds

    T1

    d 1e g

    (16)

    so that

    limd

    de dg d1 / 2 = e g = h eLg(17)Since the only term on the right-hand side whichdepends on our scaling parameter is h

    , which

    is independent of the function of interest g , it fol-lows that the corresponding optimization problemis independent of g as well. That is why Theorem1 is useful regardless of the function g underinvestigation.

    Note also that, for a diffusion process satisfying (11), the speed measure h can be understood interms of simple autocorrelations of an arbitrary

    function g . Indeed, it is easily demonstrated that,for small > 0, there is a positive constant B gwhich depends on the target density and functionof interest g , such that

    Corr g Z g Z 0 1 B g h(18)Hence, maximizing h is equivalent to minimizing Corr g Z g Z 0 . For computational reasons, it is

    most convenient to just estimate single autocorrela-tions K of order K > 0 for some chosen function(s)g . We shall also translate our results in terms pro-portional to the number of iterations needed to esti-mate g to a desired accuracy, by dening

    convergence time = K

    log K

    3. RWM IN A DISCRETE STATE SPACE

    Theorem 1 says that, in the limit as d ,certain RWM algorithms on continuous state spaceproblems look like diffusion processes, and diffusion

    processes only make sense on a continuous statespace. Hence, we might expect different asymptoticbehavior for discrete state space problems. To inves-tigate this, we consider the following discrete prob-lem, which can be thought of as an analog of thecontinuous state space problem of Section 2.

    Let 0 1 d 0 1 be the product measureof a collection of d independent random variableseach taking the value 0 with probability p , or 1 withprobability 1 p . Thus, is a discrete distributionon the vertices of a d -dimensional hypercube givenby

    i1

    i2

    id =

    p j i j=

    1 1

    p j i j

    =0(19)

    Assume without loss of generality that p < 1/ 2.We shall consider the following random scan ver-

    sion of the RWM algorithm, for which we can ndan explicit solution. Suppose that, at each iteration,the algorithm picks r sites uniformly at random,S = j 1 j2 jr , say. The proposed move is thento change the value at each state in S and to leaveall states in 1 d\S unaltered. From a cur-rent state x , then, the algorithm proposes a moveto state y with y j = x j for j / S and y j = 1 x j forj S . The acceptance rate a is given by the usual x y

    = 1

    y / x .

    In this context, the notion of scaling needs tobe modied: the number of sites r in S takes theplace of variance in this situation. Clearly, the opti-mal choice of r will depend heavily on p . For pclose to 1 / 2, it will be possible to propose updates of large numbers of components without changing thevalue of drastically. These moves will thereforebe accepted with reasonable probability, and so the

  • 8/10/2019 Roberts01 OptimalScaling MetropolisHastings Algorithms

    8/17

    358 G. O. ROBERTS AND J. S. ROSENTHAL

    optimal value of r is likely to be large. Conversely,for small values of p , if r is large, most proposedmoves are likely to be rejected.

    We again investigate the optimality problem inthe limiting case as d . Fix r (so that it doesnot depend on d ) and let d go to innity along oddvalues. Let S dt =X

    1td , so S

    d is a continuous-timebinary process (non-Markovian) making jumps

    at times which are integer multiples of 1 /d , analo-gous to the continuous model. The following resultis taken from Roberts (1998).

    Theorem 2. (i) Assume X 0 is distributed accor-ding to . Then, as d ,

    S d

    S

    where S is a two-state continuous-time Markov chainwith stationary distribution p 1p . The Q -matrixdescribing its transition rates for S has the form

    Q = e r 1

    p 1

    p

    p pwhere e r is available and given in Roberts 1998as an explicit Binomial expectation .

    (ii) Let a p r denote the acceptance rate for thealgorithm for xed target and algorithm parametersp and r , respectively. Let p 1/ 2 in such a waythat = 1/ 2 p 2 r remains constant. Then, for pclose to 1 / 2 , we can write

    e r 2r a p r

    2

    1/ 2 p 2

    2

    2

    If p 1/ 2 , then the optimal choice of r is approx-imately that achieving acceptance rate a = 0 234 ,and the efciency curve as a function of acceptancerate converges to that of continuous RWM as shownin Figure 3 , bottom .

    This theorem says that, in the discrete 0 1 dmodel, if p 1/ 2, then the optimal scaling prop-erties are very similar to those of the continuousdi=1 f x i model of the previous section.

    4. MALA ALGORITHM As in the RWM case, we shall again consider

    the case where the target density takes the prod-uct form (5). We will consider the MALA proposalgiven by

    Y n+1 N X n + 2d

    2 log X n 2d I d(20)

    This proposal is chosen so as to mimic a Langevindiffusion for the density y and, therefore, pro-vide a smarter choice of proposal. In particular,since the proposal (20) tends to move the chainin the direction of log X n , it tends to movetoward larger values of , which tends to helpthe chain converge to faster. For more details,see, for example, Roberts and Tweedie (1996) andRoberts and Rosenthal (1998).We again ask how the optimal value of 2d shoulddepend on d for large d and investigate how theseoptimal scalings can be characterized in practicallyuseful ways. As a rst attempt to solve this problem,we might again set 2d = 2 /d as for the RWM caseand dene

    Z dt = X1td

    Using this approach, it turns out that the overallacceptance rate converges to 1 and Z d Z as d , where, in this case,

    dZ t = dB t + 2 log Z t2 dt(21)

    The speed of this limiting algorithm is 2 , whichis unbounded, and therefore the limiting optimiza-tion problem is ill posed. The fact that the speedis unbounded, as we choose arbitrarily large , sug-gests that larger variances for proposals should beadopted in this case.

    The scaling 2d = O d 1 / 3 was suggested in thephysics literature [Kennedy and Pendleton (1991)].This turns out to be the correct scaling rate anddenes a limiting optimal scaling. The following

    result is taken from Roberts and Rosenthal (1998).

    Theorem 3. Consider a MetropolisHastingschain X 0 X 1 for a target distribution havingdensity and with MALA proposals as in 20 .Suppose satises 5 . Let 2d = 2 /d 1 / 3 and set

    Z dt = X1d 1/ 3 t

    where X 1n is the rst component of X n . Then,assuming various regularity conditions on the den- sities f described in detail in Roberts and Rosenthal

    1998 , as d

    , we have the following:

    (i) Z d converges weakly to the continuous-time process Z satisfying

    dZ t = g 1 / 2 dB t + g

    log Z t2

    dt

    where

    g = 2 2 J 3(22)

  • 8/10/2019 Roberts01 OptimalScaling MetropolisHastings Algorithms

    9/17

    OPTIMAL SCALING FOR VARIOUS METROPOLISHASTINGS ALGORITHMS 359

    and J is given by

    J = E 5 log f X 2 3 log f X 348(23)where the expectation is with respect to f .

    (ii) The acceptance rate a of the algorithm is givenby 2 J 3 , and the scaling which gives optimalasymptotic efciency is that having asymptoticacceptance rate equal to 0 574 .

    Thus, as for the RWM case, the optimality can becharacterized in terms of the acceptance rate in amanner which is otherwise independent of proper-ties of f . The MALA algorithm thus has a smallerconvergence time [ O d 1 / 3 instead of O d ] andlarger optimal asymptotic acceptance rate (0.574instead of 0.234), as compared to RWM algorithms.[Balanced against this, of course, is the need tocompute log X n at each step of the algorithm.]Note that, similar to the RWM case, there is noneed to calculate or estimate J , since the optimalityresult is stated in terms of asymptotic acceptancerate only.

    The regularity conditions needed for this resultrely again on smoothness conditions on the func-tion log f . In Roberts and Rosenthal (1998), theexistence of eight continuous derivatives of log f isassumed, though it is clear that these conditionscan be relaxed to some extent at least.

    5. A GIBBS RANDOM-FIELD MODEL

    One can try to relax the independence assump-tions in Section 2. The following result is an infor-

    mal version of a statement taken from Breyer andRoberts (2000) in the context of nite-range Gibbsrandom-eld target distributions, which gives a a-vor of the problems encountered.

    Suppose that dx is a continuous probabilitymeasure on . We dene a Gibbs measure on

    r

    which has a density with respect to kr dx kgiven byexp

    k

    r

    U k x(24)

    where U k is a nite-range potential (depending onlyon a nite number of neighboring terms of k ), whichis everywhere nite. [Formally, this framework onlydenes nite-dimensional conditional distributions;therefore, (24) needs to be interpreted in terms of conditional distributions.]

    Example. We give a simple Gaussian exampleof a Markov random eld. Suppose denotes themeasure dx exp x 2 / 2 dx and dene the

    neighborhood structure on r by l m if and onlyif l m = 1; that is, l and m differ in only one coor-dinate, and in that coordinate by only one. DeneU k x =

    l

    k x k x l4r

    The one-dimensional full conditionals are given by

    x k x

    k

    N

    x

    k

    1

    where xk denotes the mean of the x values atneighboring states.

    Now we can consider RWM on i with Gaussianproposals with variance given by (6) with proposalvariance 2i . We are now ready to informally statethe following result which is taken from Breyer andRoberts (2000). We take A i to be a sequence of hyperrectangular grids increasing to r as i .

    Theorem 4. Consider RWM with Gaussian pro- posals with variance 2

    i = 2 / A

    i. Call this chain

    X i and consider the speeded-up chain

    Z it = Xit A i

    (25)

    speeded up by a factor A i . Then, under certaintechnical conditions discussed below, Z i convergesweakly to a limiting innite-dimensional diffusionon

    r. Moreover, the relative efciency curve as

    a function of acceptance rate is again that givenby Figure 2 and the corresponding optimal scal-ing problem for has a solution which can againbe characterized as that which achieves acceptancerate 0 234 .

    This theorem thus says that, under appropriatetechnical conditions, RWM (with Gaussian proposaldistributions) on Markov random-eld target dis-tributions again has the same optimality proper-ties (including the 0.234 optimal acceptance rate)as RWM on the target distributions given by prod-uct densities

    di=1 f x i .The most important technical condition assumed

    for Theorem 4 is that the random eld has nophase transition; in fact, the elds correlationsdecrease exponentially quickly as a function of distance. [This avoids the multiplicity of distribu-tions satisfying (24).] It should be emphasized thatphase-transition behavior is indeed possible forthese types of distributions, and the consequencesfor practical MCMC are very important. In order forthe algorithm to mix in this case, the proposal needsto make considerably larger jumps, and the optimalacceptance rate will then necessarily converge to 0.In practice, this makes the algorithm prohibitively

  • 8/10/2019 Roberts01 OptimalScaling MetropolisHastings Algorithms

    10/17

    360 G. O. ROBERTS AND J. S. ROSENTHAL

    slow (typically taking a time exponential in d tomix adequately). Therefore, for all intents and pur-poses, problems with heavy dependence structure(sufcient to mimic phase transition) will not yieldpractically useful algorithms.

    It is interesting to note that phase-transitionbehavior of this type does not happen for r = 1,where ergodicity ensures that (24) uniquely speciesthe full distribution. In this case, for large

    i, algo-

    rithms are considerably more robust to dependencebetween components. This supports (for example)the prolic success of MCMC algorithms in hierar-chical models [see, e.g., Smith and Roberts (1993)]where dependence is propagated through a one-dimensional hierarchical structure only. On theother hand, for a multidimensional Markov randomeld, if signicant phase-transition-type behaviordoes occur, then the optimal scaling properties canbe very different than the above.

    6. EXTENSIONS TO MORE GENERAL

    TARGET DISTRIBUTIONSTheorems 1 and 3 assume that the target density

    is of the special form x = di=1 f x i , con-sisting of i.i.d. components. This assumption is

    obviously very restrictive. However, the essentialresult, that the optimal acceptance rate should beabout 0.234 for random-walk Metropolis (RWM)algorithms, and about 0.574 for Langevin (MALA)algorithms, appears to be considerably more robust.Thus, in this section, we consider somewhat moregeneral target distributions and optimal scaling properties in a broader context. We have alreadydemonstrated robustness of the 0 234 rule to dif-ferent dependence structures in Section 5. Next weshall consider the notion of heterogeneity of scalebetween different components.

    We assume that is of the form

    x =d

    i=1C i f C i x i(26)

    where f is again a xed one-dimensional den-sity (satisfying the same regularity conditions asfor Theorem 1), but where now there is an arbi-trary scaling factor C i > 0 in each component.We again assume that the proposals are given byq y d y

    N 0 Id 2d , a normal distribution with

    variance 2d times the identity. To make a reason-able limiting theory as d , we assume thatthe values C i are i.i.d. from some distribution hav-ing mean 1 and nite variance. Then we have thefollowing.

    Theorem 5. Consider RWM with target density of the form 26 , where C i are i.i.d. with E C i = 1

    and E C 2i b < and with proposal distribu-tion N 0 Id 2d , where 2d = 2 /d for some > 0. Let Wdt = C1X

    1td . Then, as d Wdt convergesto a limiting diffusion process W t satisfying

    dW t = 12

    g W t C 1s2 dt + C 1s dB t

    where B t is standard Brownian motion and where

    s2

    = 22

    b1 / 2

    I1/ 2

    / 2

    = 1b 2

    2b 2b 1 / 2 I 1 / 2 / 2(27)

    with I = E f g X 2 . Hence, the efciency of thealgorithm when considering functionals of the rstcoordinate only , as a function of acceptance rate, isidentical to that of Theorem 1 , except multiplied bythe global factor of C21 /b . In particular, the optimalacceptance rate is still equal to 0 234 . For a xed function f , the optimal asymptotic efciency is pro- portional to C 21 /bd .

    This result does not appear in any of the MCMCscaling literature, so we have sketched a proof whichappears in the Appendix.

    This theorem shows that the optimal acceptanceproperties for RWM on the inhomogeneous targetdistribution (26) are identical to the case where thetarget satises (5). However, at least if C 1 = 1, theefciency of the algorithm is slowed down by a fac-tor b, where b = E C 2i /E C i 2 . Of course, if theC i are constant, then there is no slow-down effectat all. However, if the C i are nonconstant, then wewill have b > 1, and the algorithm will be slowerthan the algorithm for the corresponding homoge-

    neous target distribution.On the other hand, if the values of C i are known,

    then, instead of using the proposal distributionq N 0 d I d , one could use a proposal whichscales proportional to C i in each component. Inthis case, the scalings C i would be the same forthe target and the proposal, so they would cancelout, and the resulting algorithm would be preciselyequivalent to running the original RWM algorithmon the corresponding target distribution satisfy-ing (5), that is, to setting C i = 1 for each i . ByTheorem 5, this would be more efcient by a factorof b. We therefore have the following.

    Theorem 6. Consider running a Metropolis algo-rithm on a target density of the form 26 to estimate some function of the rst component only. Supposewe use either homogeneous proposals of the formq N 0 2d I d or inhomogeneous proposals of the form q N 0 2d diag C 1 Cd . Then, in eithercase, it is optimal to tune the algorithm to an

  • 8/10/2019 Roberts01 OptimalScaling MetropolisHastings Algorithms

    11/17

    OPTIMAL SCALING FOR VARIOUS METROPOLISHASTINGS ALGORITHMS 361

    asymptotic acceptance rate of 0 234 , and the asymp-totic efciency is proportional to d 1 . On the otherhand, note that the optimal value of 2d itself is not equal to the optimal value of 2d . However, the opti-mal inhomogeneous-proposal algorithm will haveasymptotic relative efciency which is larger thanthat of the optimal homogeneous-proposal algorithmby a factor of bC 21 , where b = E C 2i /E C i 2 .

    This theorem suggests that, if the target den-sity is even approximately of the form (26), withsignicantly varying values of C i , then it may beworthwhile to obtain a preliminary estimate of theC i and then use inhomogeneous proposals of theform q N 0 2d diag C 1 Cd when running the Metropolis algorithm.

    Remark. On examining the proof of Theorem 5,we see that, in the innite-dimensional limit, therejection penalty for trying big moves dependsonly on the components 2 d. Hence, if we were

    merely interested in the rst component, it wouldmake sense to make bigger jumps in that com-ponent than others. In fact, best of all would beto just update the rst component and ignore allthe others. However, this argument uses stronglythe independence between components. In practice,where dependence between components is present,it will be necessary for all components to convergerapidly to have condence in the stationarity of anyone component.

    Remark. In fact, Theorem 5 extends readily totarget densities whose rst component has a differ-ent form of density to all other components, that is,specically target densities of the form

    x = f 1 x 1d

    i=2C i f C i x i(28)

    provided that the function f 1 does not depend on d ;see Section 7.

    Finally, the arguments above can be mimicked forinvestigating the effect of heterogeneity of compo-nents of the target density on MALA. In this case,the following result holds (we omit the proof).

    Theorem 7. Consider a MetropolisHastingschain X 0 X 1 for a target distribution havingdensity , with MALA proposals as in 20 . Sup- pose satises 26 , let 2d = 2 /d 1 / 3 and set

    Z dt = X1d 1/ 3 t

    Assume the same regularity conditions on the densi-ties f as in Roberts and Rosenthal 1998 .

    (i) Z d converges weakly to

    dZ t = h 1/ 2 dW t + h

    log Z t2

    dt

    as d , whereh = 2 2 Jk 3(29)

    J is given by 23 , and withk = E C

    6i /E C i 6(30)

    (ii) The asymptotic acceptance rate of the algo-rithm is given by 2 kJ 3 , and the optimal scal-ing is that having asymptotic acceptance rate 0 574 .

    (iii) The efciency of this algorithm is reducedby a factor of k1/ 3 , compared to the correspondinghomogeneous algorithm with C i = 1 for all i .

    Therefore, Langevin algorithms are considerablymore sensitive to scale changes than Metropolisalgorithms: the cost of heterogeneity is governed bythe L 6 norm of the C i , rather than the L 2 normas in the Metropolis case.

    7. EXAMPLES AND SIMULATIONS

    In this section, we provide various examplesand simulations to further illustrate the theoremspresented above.

    7.1 Inhomogeneous MultivariateGaussian Examples

    We performed a large simulation study to demon-strate examples of Theorems 5 and 6 working inpractice. We considered three classes of distribu-tions, as follows.

    1. Independent and identically distributed normalcomponents, x = exp

    di=1 x

    2i / 2 ; that is,

    if f is a standard normal density in (26), thenwe take C i = 1 for all i .2. Normal components with the rst having smaller scale, x = exp 2x 21

    di=2 x

    2i / 2 ;

    that is, we take C i = 1 for all i = 1 and takeC 1 = 2.3. Normal components with C 1 = 1 but with theother Ci s being chosen from the Exponential 1distribution.

    For each of these three cases, we performed runs indimensions 5 10 15 50, together with 100 and200. For each dimension, we used spherically sym-metric Gaussian proposals with various values of the proposal variance, and the algorithm was runfor 100,000 iterations. We concentrated on analysisof the rst component for simplicity. In each case,we estimated this components integrated autocor-relation time (called convergence time for short in

  • 8/10/2019 Roberts01 OptimalScaling MetropolisHastings Algorithms

    12/17

    362 G. O. ROBERTS AND J. S. ROSENTHAL

    what follows) using the empirical lag- k autocorrela-tion k of the rst component:

    k

    log kThis estimate becomes increasingly accurate as thediffusion limit is approachedas the arguments atthe end of Section 2 indicate.

    Figure 5 summarizes the results of our simula-tions for the rst two distributions. We have plot-ted the convergence time of the algorithm dividedby dimension (since by Theorem 5 this ought tobe stable as d ) against the acceptance rateof the algorithm. The plotting symbol indicated thedimension in which the simulation was performedin each case.

    According to Theorem 5, the optimal (i.e.,minimal) convergence time is proportional to b/C 21 .Therefore, we should expect that the optimal con-vergence time of distribution 1 to be 4 times thatof distribution 2. This is seen quite clearly in theminima achieved in each case: just below 1.5 in dis-tribution 1 and around 0 38 for distribution 2. Notethat, as expected, the optimal values are achievedat acceptance rates between 0 2 and 0 25 althoughit is difcult to be precise about optimal acceptancerates because the functions are inevitably ratherat around the optimal values and there is stillsome Monte Carlo error in our simulations.

    0.1 0.2 0.3 0.4 0.5 0.6 0.7

    1 . 5

    2 . 0

    2 . 5

    3 . 0

    3 . 5

    4 . 0

    5

    5

    5

    5

    5

    5

    555

    555

    5

    55

    55

    5555

    10

    10

    10

    10

    10

    10

    1010

    101010

    10

    101010

    101010101010

    15

    15

    15

    1515

    1515

    1515

    15

    1515

    1515

    15151515151515

    20

    20

    20

    20

    20

    20

    20

    2020

    2020

    20

    202020

    202020202020

    25

    25

    2525

    25

    2525

    2525

    2525

    25252525

    25252525

    2525

    30

    30

    30

    30

    30

    30

    3030

    3030

    3030

    303030

    303030303030

    35

    35

    35

    35

    3535

    35

    35

    353535

    3535

    35

    35353535353535

    40

    40

    40

    40

    40

    40

    404040

    4040

    4040

    4040

    4040404040

    40

    45

    45

    45

    45

    4545

    454545

    4545

    454545

    454545454545

    45

    50

    50

    50

    50

    50

    5050

    5050

    5050

    505050

    50505050505050

    100

    100

    100

    100

    100

    100100

    100100

    100100100

    100

    100100100100

    100100100100

    200

    200

    200

    200

    200

    200

    200200

    200200

    200200200200

    200200200200200200

    20055555555555555555

    5555 101010101010101010101010101010101010101010 151515151515151515151515

    151515151515151515 202020202020202020202020202020202020202020 2525252525252525252525252525252525252525

    25 3030303030303030303030303030303030303030

    30 35353535353535353535353535353535353535

    3535 40404040404040404040404040404040404040

    4040 4545454545454545454545454545454545454545

    45 505050505050505050505050505050505050505050 100100100100100100100100100100100100100100100100100

    100100100100 200200200200200200200200200200200200200200200200200

    200200200200

    555

    555

    55

    5555

    5

    555555

    5

    5

    101010

    1010101010

    101010

    101010

    10

    10

    1010

    101010

    151515151515

    15151515151515

    1515151515

    1515

    15

    20202020

    202020202020

    202020

    2020

    202020202020

    25252525

    25252525

    2525252525

    2525252525

    252525

    3030303030

    30303030

    3030

    30303030

    303030

    303030

    3535353535

    35353535

    35

    35

    353535

    3535

    35

    35

    35

    3535

    4040404040

    404040404040

    40404040

    404040404040

    4545

    45454545

    45454545

    4545454545

    45

    4545

    454545

    5050505050

    505050

    505050

    50505050

    50

    50

    505050

    50

    100100100100

    100100100

    100100100100

    100100100100100

    100100

    100100100

    200200200200

    200200200200200200

    200200

    200200200

    200200

    200200

    200200

    distribution 1

    0.1 0.2 0.3 0.4 0.5 0.6 0.7

    0 . 5

    1 . 0

    1 . 5

    2 . 0

    5

    55

    5

    55555

    5555555

    55555

    10

    10

    1010

    1010

    1010101010101010

    10101010101010

    15

    15

    15

    1515151515

    15151515

    151515151515151515

    20

    2020

    2020

    2020

    20202020

    20202020202020202020

    25

    2525

    2525

    25

    25252525

    2525252525252525252525

    30

    3030

    3030

    3030

    303030

    3030303030303030303030

    35

    35

    3535

    3535

    3535

    35

    353535353535353535353535

    40

    40

    4040

    40

    4040404040

    4040404040404040404040

    45

    45

    45

    454545

    454545

    454545454545454545454545

    50

    50

    5050

    5050

    505050

    505050505050505050505050

    100

    100

    100100

    100100

    100100100

    100100100100100100100100100100100100

    200

    200

    200

    200200

    200200200200

    200200200200200200200200200200200200

    55555

    5

    55555

    555555

    5555

    10101010101010101010101010101010

    10101010

    10

    15151515151515151515151515151515

    1515151515

    2020202020202020202020202020202020202020

    202525252525252525252525252525252525

    252525253030303030303030303030303030303030

    30303030353535353535353535353535353535

    353535353535 404040404040404040404040404040404040404040 45454545454545454545454545454545454545

    4545 505050505050505050505050505050505050505050 100100100100100100100100100100100100100100100100100100100100100 200200200200200200200200200200200200200200200200200200200200200

    555

    5

    555

    555

    555

    55

    55

    55

    5

    5

    10101010101010

    1010

    1010

    101010

    10101010

    10

    10

    10

    15151515

    151515

    15151515

    1515

    1515

    1515

    15151515

    20202020202020

    202020

    20202020

    202020

    2020

    2020

    2525252525252525

    252525252525252525

    252525

    25

    3030303030303030

    303030

    3030303030

    3030303030

    3535353535353535

    3535353535

    3535353535

    353535

    404040404040

    404040404040

    404040404040

    404040

    4545454545

    454545454545454545

    454545454545

    45

    5050505050

    50505050505050505050

    505050

    505050

    100100100100100100100

    100100100100100100

    100100100100100

    100100100

    200200200200200200200200

    200200200200200200200

    200200200200200200

    distribution 2

    acceptance rateacceptance rate

    / d

    / d

    Fig. 5 . Convergence times for Metropolis algorithms as a function of their acceptance rates. The plotting symbol indicates the dimensionof the simulation.

    Figure 6 shows the results of our simulation indistribution 3. The plotting symbol used in eachcase refers to a particular collection of random C i s.Here the additional randomness of the C i s meansthat even in 50 dimensions, the optimal scaling forthe algorithm can vary quite appreciably, depend-ing on the particular values of the C i s. However,the robustness of the optimal acceptance rate 0 234

    remains intact despite this.If C i Exp(1), then b is easily computed to be 2,so, according to Theorem 5, we ought to lose a fac-tor of 2 in efciency between distribution 2 anddistribution 1. Our results show very good agree-ment with Theorem 5. For instance, Figure 5 sug-gests that a minimum convergence time is around3d/ 2, which is around 300 for d = 200. Figure 6(d)shows that the minimized convergence time in 200dimensions is around 600 for distribution 3, a lossof efciency by a factor of around 2.

    7.2 A Multimodel Example

    For example, consider a target density of the form

    = 12

    N M d Id + 12

    N M d Id

    where M d = m d 0 0 0 . This distribution hashalf its mass near m d 0 0 and the other half near m d 0 0 .Strictly speaking, this distribution is not of theform (26), since the rst coordinate is not merely

  • 8/10/2019 Roberts01 OptimalScaling MetropolisHastings Algorithms

    13/17

    OPTIMAL SCALING FOR VARIOUS METROPOLISHASTINGS ALGORITHMS 363

    1

    11

    11

    1111111111111111111111111111111111111

    1

    11

    11

    1111111111111111111111111111111111111

    11

    11

    11111111111111111111111111111111111111

    2

    2

    22

    22

    22222222222222222222

    22222222222222222

    3

    3

    33

    33

    3333333333333333333333333

    333333333333

    4

    44

    44

    4444444444444444444444444444444444444

    0.2 0.4 0.6 0.8 1.0

    0

    1 0

    2 0

    3 0

    4 0

    5 0

    (a) d=5

    1

    1

    11

    11

    1111111111111111111111111111111

    11111

    111

    1

    1

    11

    11111111111111111111111111

    111111111

    11111

    1

    1

    1

    11

    11

    111111111111111111111111111111

    11111

    1111

    2

    2

    22

    22

    2222222222222222222222222222222

    22222222

    3

    33

    33

    3333333333333333333333333

    333333

    33333333

    4

    44

    444444444444444444

    44444

    4444

    444444

    0.0 0.2 0.4 0.6 0.8

    0

    5 0

    1 0 0

    1 5 0

    2 0 0

    (b) d=30

    11

    11

    1111111111111111111111

    11111111

    11111

    1111

    1

    11

    11

    11111111111111111111111111

    11111

    1111

    111

    11

    11

    11

    1111111111111111111111111

    111111

    111111

    2

    22

    222222222222222222

    22222

    2222222

    222

    223

    333333333

    333333 4

    4444444444444

    44444

    44

    44

    0.0 0.2 0.4 0.6 0.8

    0

    5 0

    1 0 0

    1 5 0

    2 0 0

    c o n v e r g e n c e

    t i m e

    c o n v e r g e n c e

    t i m e

    c o n v e r g e n c e

    t i m e

    c o n v e r g e n c e

    t i m e

    (c) d=50

    1

    1

    11

    11111

    11

    1

    11

    11

    1

    1

    11

    11

    111

    1

    11

    1

    1 1

    1

    1

    11

    11

    111

    11

    1

    1

    1

    2

    2

    22

    2222

    22

    22

    2

    2

    3

    3

    3

    33

    33333

    3

    33

    3

    34

    4

    4

    4444444

    4

    44

    4

    0.0 0.2 0.4 0.6 0.8

    5 0 0

    5 5 0

    6 0 0

    6 5 0

    7 0 0

    7 5 0

    8 0 0

    (d) d=200

    acceptance rateacceptance rate

    acceptance rateacceptance rate

    Fig. 6 . Convergence times of RWM in the heterogeneous environment of distribution 3 in dimensions 5, 30, 50 and 200 . Here the plottingnumber indicates a particular random collection of C i s.

    a rescaling of the other coordinates. However, if m ddoes not depend on d , then the asymptotic results of Theorem 5 still apply. Figure 7 shows such a simu-lation in various dimensions, with m d = 3 through-out. This is demonstrated in Figure 7(a) where theconvergence time is O d as expected.

    On the other hand, suppose now that m d = d/ 2[which is equivalent to having the two normal com-ponents centered at 0 0 0 and 1 1 1 ,respectively]. In this case, since m d is growing with d , then the asymptotic results of Theorem 5do not apply. Indeed, in this case, for large d thechain will tend to spend long amounts of time inone of the two components of before moving tothe other. Hence, there is no limiting diffusion atall. Indeed, computing autocorrelations in this casecan be quite deceptive, since the autocorrelations

    could be small even though the chain spends allits time in just one of the two components, and aLangevin diffusion limit is not a good description of the process. Figure 7(b) shows an estimate of con-vergence time from autocorrelations, scaled by d 2 .In fact, the true convergence time is at least expo-nential in d in this example, and this can be provedby the use of capacitance ideas [see, e.g., Sinclair

    and Jerrum (1989)], although we do not attempt toshow that in this paper.

    7.3 Example: Multivariate Normalwith Correlations

    Suppose now that is a d -dimensional multi-variate normal distribution, but with nontrivialcovariance matrix . Since must be symmetric(and hence self-adjoint), it follows that we can ndan orthogonal rotation of the axes which trans-forms into another multivariate normal, butthis one having independent components, that is,a diagonal variance/covariance matrix of the form

    = diag 1 d , where i are the eigenval-ues of .Now, if we knew or could estimate it, then we

    could use proposals of the form q

    N x 2 . This

    would be equivalent to starting with a target distri-bution satisfying (5) and running standard RWM.By Theorem 6, this would be better than using axed N x 2 I d distribution. This provides theo-retical justication for the commonly used strategyof running random walk Metropolis with incrementdistribution having covariance structure which isset to be proportional to the empirically estimated

  • 8/10/2019 Roberts01 OptimalScaling MetropolisHastings Algorithms

    14/17

    364 G. O. ROBERTS AND J. S. ROSENTHAL

    0.1 0.2 0.3 0.4 0.5 0.6 0.7

    2 0

    3 0

    4 0

    5 0

    6 0 5

    5

    5

    5

    5

    55

    55

    55555555

    5555

    10

    10

    10

    10

    10

    1010

    1010

    1010

    1010101010

    1010101010

    15

    15

    15

    15

    1515

    1515

    1515

    151515

    1515151515151515

    20

    20

    20

    20

    2020

    20

    2020

    2020

    20202020

    202020202020

    25

    25

    25

    2525

    25

    2525

    2525

    2525

    252525252525252525

    30

    30

    30

    30

    3030

    3030

    303030

    30303030

    303030303030

    35

    35

    35

    3535

    3535

    3535

    35353535

    35353535

    35353535

    40

    40

    40

    40

    4040

    4040

    4040

    4040404040404040404040

    45

    45

    45

    4545

    4545

    4545

    454545454545454545454545

    50

    50

    5050

    50

    5050

    5050

    5050

    50505050

    505050505050

    100

    100

    100

    100

    100

    100100

    100100

    100100

    100100100100100

    100100100100100

    200

    200

    200

    200

    200200

    200200

    200200

    200200

    200200200200200200200200200

    555555555555555555555101010101010101010101010101010101010101010

    151515151515151515151515151515151515151515 202020202020202020202020202020202020202020 252525252525252525252525252525252525252525

    303030303030303030303030303030303030303030

    353535353535353535353535353535353535353535404040404040404040404040404040404040404040454545454545454545454545454545454545

    454545 505050505050505050505050505050505050505050 100100100100100100100100100100100100100100100100100100100100100 200200200200200200200200200200200200200200200200200200200200200

    555555555555555555555

    10101010101010

    1010101010101010101010101010

    15151515151515

    151515

    151515151515

    1515151515

    2020202020

    2020202020202020

    20202020202020

    20

    252525252525252525

    252525252525252525

    252525

    30303030

    30303030303030

    3030303030

    3030303030

    35353535353535353535

    353535353535353535

    3535

    4040404040404040

    40404040

    4040

    40404040404040

    4545454545

    45454545

    4545454545

    454545454545

    45

    5050505050

    5050505050

    505050505050

    5050505050

    100100100100100

    100100100100100100

    100100100100100

    100100

    100100100

    200200200200200200200200200

    200200200200200

    200200200

    200200

    200200

    0.1 0.2 0.3 0.4 0.5 0.6 0.7

    0 . 5

    1 . 0

    1 . 5

    2 . 0

    2 . 5

    5

    5

    5

    5

    5

    55

    55

    55

    55555

    55555

    10

    1010

    10

    1010

    101010

    1010

    10101010101010101010

    15

    1515

    15

    15

    151515

    15151515

    151515151515151515

    20

    2020

    20

    2020

    2020

    20202020202020202020202020

    25

    25

    25

    2525

    2525

    252525252525

    2525252525252525

    30

    3030

    3030

    3030

    303030

    3030303030303030303030

    35

    35

    3535

    353535

    353535

    3535353535353535353535

    40

    40

    4040

    404040

    4040

    404040404040404040404040

    45

    45

    45

    4545

    4545

    45454545

    45454545454545454545

    5050

    5050

    5050

    5050505050

    50505050505050505050

    100100

    100100

    100100

    100100

    100100100100100100100100100100100100100

    200

    200

    200200

    200200

    200200

    200200200200200200200200200200200200200

    555555555555555555555

    101010101010101010101010101010101010101010 151515151515151515151515151515151515151515

    202020202020202020202020202020202020202020 252525252525252525252525252525252525252525 303030303030303030303030303030303030303030 353535353535353535353535353535353535353535 404040404040404040404040404040404040404040 454545454545454545454545454545454545454545 505050505050505050505050505050505050505050 100100100100100100100100100100100100100100100100100100100100100 200200200200200200200200200200200200200200200200200200200200200

    55555555

    555555

    555555

    5

    101010101010101010101010101010

    101010101010

    151515151515151515

    1515151515151515151515

    15

    20202020202020202020

    202020202020202020

    2020

    25252525252525252525

    252525252525252525

    2525

    30303030303030303030303030

    3030303030303030

    35353535353535353535

    353535353535

    3535353535

    404040404040404040

    4040404040404040

    40404040

    4545454545454545

    45454545

    45454545454545

    4545

    505050505050505050

    505050505050

    5050505050

    50

    100100100100100100100100100

    100100100100100100100100

    100100100100

    200200200200200200200200200200200

    200200200200200200200

    200200200

    acceptance rate

    / d

    / d 2

    acceptance rate

    (b) d 1/2 separation(a) constant separation

    Fig. 7 . (a) Convergence time as a function of acceptance rate for RWM on the bimodal example, with m d = 3 throughout, and (b) /d 2as a function of acceptance rate in the case m d = d/ 2. Note that in (b) does not well estimate the convergence time. In dimension 100and 200 , the algorithm in (b) fails to leave its starting mode at all, so that convergence has certainly not occurred, even though the plot of seems to suggest convergence ought to occur within the 500,000 iterations for which the algorithm was run. The problem is that, sincethe separation distance is increasing with d , there is no longer a diffusion limit and Theorem 5 does not apply.

    correlation structure of the target density; see, forexample, Tierney (1994).

    In any case, this suggests that the behavior of ordinary RWM on a multivariate normal distribu-tion is governed by the eigenvalues of .

    In particular, suppose that has 1s on the

    diagonal and off the diagonal (corresponding toa collection of normal random variables with unitvariance and pairwise correlation ). Then theeigenvalues of are 1 = d +1 (with multiplic-ity 1 and eigenvector x ) and 2 = = d = 1 (with multiplicity d 1 and eigenvectors of the formx i x for any i ). This corresponds to C 1 = 1/ 1 =1/ d +1 and C2 = = Cd = 1/ 1 .Here f is the standard normal density, so thatg x = x . Hence, the limiting diffusion satisesdW t = 12 C 21W t s2 dt +C 1s dB t , which can be solvedexplicitly (as an OrnsteinUhlenbeck process) togive that W t

    = eC 21 s2 t 2 / 2 W 0

    + N 0 1

    eC 1 st .

    Since C1 = 1/ d +1 , for large d we see thatC 21 = O d 1 and is therefore very small, so thediffusion converges very slowly.For this example, Theorem 5 does not strictly

    apply, although, since it describes the algorithm asbeing O C 21 /d , it therefore suggests that, in thiscase, the algorithm is O d 2 . In fact, by speeding up time by d 2 instead of d an analogous weak con-

    vergence result can be demonstrated to show that,in this case, the algorithm is in fact O d 2 .

    On the other hand, the above analysis is forfunctions of the rst eigenvector only, that is, forfunctions of x . Suppose instead we renumber theeigenvalues of , so that the large eigenvalue is

    numbered d (or 2 ) instead of 1 . This correspondsto considering functions which are orthogonal to x ,that is, which are functions of x i x . In this case,we obtain C 21 = 1/ 1 which is not small, so thediffusion converges much faster.

    We conclude that, for RWM with normal proposalson the multivariate normal target distribution with

    as above, the algorithm will converge quickly forfunctions orthogonal to x , but will converge muchmore slowly [ O d 2 ] for functions of x itself. This isillustrated in Figure 8.

    7.4 Some Simulations with MALAFigure 9 shows the O d 1 / 3 performance of MALA

    on product form densities of the type in (5). Thesesimulations were performed using distribution 1. Again notice the stability of these curves, even inrelatively small dimensional problems. The asymp-totically optimal acceptance rate 0 574 performsexcellently even in ve dimensions.

  • 8/10/2019 Roberts01 OptimalScaling MetropolisHastings Algorithms

    15/17

    OPTIMAL SCALING FOR VARIOUS METROPOLISHASTINGS ALGORITHMS 365

    0.2 0.4 0.6 0.8

    2

    4

    6

    8

    1 0

    5

    5

    5

    5

    5

    5

    55

    55

    55555555555

    10

    10

    10

    10

    10

    1010

    1010

    101010101010101010101010

    15

    15

    15

    15

    15

    1515

    1515

    151515151515151515151515

    20

    20

    20

    20

    20

    2020

    2020

    202020202020202020202020

    25

    25

    25

    25

    25

    2525

    2525

    252525252525252525252525

    30

    30

    30

    30

    3030

    3030

    30303030303030303030303030

    35

    35

    35

    35

    3535

    3535

    35353535353535353535353535

    40

    40

    40

    40

    4040

    4040

    40404040404040404040404040

    45

    45

    45

    45

    4545

    4545

    45454545454545454545454545

    50

    50

    50

    50

    5050

    5050

    50505050505050505050505050

    100

    100

    100

    100

    100

    100100

    100100100

    100100100100100100100100100100100

    200

    200

    200

    200

    200

    200200

    200200

    200200200200200200200200200200200200555555555555555555555 101010101010101010101010101010101010101010

    151515151515151515151515151515151515151515 202020202020202020202020202020202020202020 252525252525252525252525252525252525252525 303030303030303030303030303030303030303030 353535353535353535353535353535353535353535 404040404040404040404040404040404040404040 454545454545454545454545454545454545454545 505050505050505050505050505050505050505050 100100100100100100100100100100100100100100100100100100100100100 200200200200200200200200200200200200200200200200200200200200200 555555555555555555555 101010101010101010101010101010101010

    101010 151515151515151515151515151515151515151515 2020202020202020202020202020

    20202020202020252525252525252525252525252525

    252525252525

    303030303030303030303030303030303030303030

    35353535353535353535353535353535353535

    35354040404040404040404040404040

    404040404040404545454545454545454545

    45454545454545454545

    5050505050505050505050505050505050505050

    50100100100100100100100100100100100100

    100100100100100100100100100

    200200200200200200200200200200200200200200200200200

    200200200200

    acceptance rate

    / d 2

    Fig. 8 . Convergence time as a function of acceptance rate for RWM on the exchangeable normal example using x as the function of interest. Note that the convergence time is scaled by d 2 in this case, demonstrating the O d 2 behavior.

    8. DISCUSSION

    In this paper, we have summarized and extendedrecent work on scaling of RWM and MALA algorithms. The remarkable thing about the accep-tance rate criteria for scaling of algorithms is theirrobustness to different types of target distribu-tions, to dependence in the target density and to

    0.0 0.2 0.4 0.6 0.8 1.0

    0

    1

    2

    3

    4

    5

    acceptance rate

    / d 1 / 3

    5

    55

    55

    55

    555555555555

    5555

    555

    55

    5

    5

    5

    5

    5

    5

    5

    5

    5

    1010

    1010

    1010

    101010101010

    101010101010101010

    101010

    1010

    1010

    1010

    1010

    10

    10

    10

    15

    1515

    1515

    1515

    151515151515151515151515

    15151515

    151515

    15

    1515

    15

    15

    15

    15

    15

    20

    2020

    2020

    2020

    2020202020202020202020

    2020202020

    202020

    2020

    20

    20

    20

    20

    20

    20

    25

    2525

    2525

    2525

    2525252525252525252525

    2525252525

    2525

    2525

    25

    25

    25

    25

    25

    25

    25

    30

    3030

    3030

    3030

    303030303030303030303030

    30303030

    3030

    3030

    30

    30

    30

    30

    30

    30

    35

    3535

    3535

    35353535353535353535

    3535

    35353535

    353535

    353535

    35

    35

    35

    35

    35

    35

    35

    40

    4040

    4040

    4040404040404040404040

    4040404040

    404040

    4040

    40

    40

    40

    40

    40

    40

    40

    45

    4545

    4545

    454545

    454545454545454545

    454545454545

    4545

    4545

    45

    45

    45

    45

    45

    45

    45

    50

    50

    5050

    5050

    50505050505050505050

    505050505050

    505050

    5050

    50

    50

    50

    50

    50

    50

    100

    100100

    100100

    100100

    100100100100100100100100100100100100100100100100

    100100100

    100

    100

    100

    100

    100

    100

    100100

    200

    200

    200200

    200200

    200200200200200200200200200200200200200200200200

    200200200

    200200

    200200

    200

    200

    200

    200

    200

    Fig. 9 . MALA Langevin algorithm, showing its optimality at around 0.574 and its convergence time being O d 1/ 3 .

    homogeneity in scale between different components.However, as we show in our results and examples,the efciency of the algorithm itself can still dependcritically on properties of the target density.

    How should these results be used in practice?The asymptotic efciency curve given in Figure 3(bottom) shows that there is very little to be gained

  • 8/10/2019 Roberts01 OptimalScaling MetropolisHastings Algorithms

    16/17

  • 8/10/2019 Roberts01 OptimalScaling MetropolisHastings Algorithms

    17/17

    OPTIMAL SCALING FOR VARIOUS METROPOLISHASTINGS ALGORITHMS 367

    so that the maximization problem is expressedsolely in terms of A and the other constants b Ietc. only affect the maximized value itself.

    ACKNOWLEDGMENT

    Jeffrey S. Rosenthal was supported in part byNSERC of Canada.

    REFERENCES

    Besag, J. E. (1994). Comment on Representations of knowledgein complex systems by U. Grenander and M. I. Miller. J. Roy.Statist. Soc. Ser. B 56 591592.

    Breyer, L. and Roberts, G. O . (2000). From Metropolis to diffu-sions: Gibbs states and optimal scaling. Stochastic Process. Appl. 90 181206.

    Gelman, A., Roberts, G. O. and Gilks, W. R. (1996). EfcientMetropolis jumping rules. Bayesian Statist. 5 599608.

    Gilks, W. R, Richardson, S. and Spiegelhalter, D. J. eds.(1996). Markov Chain Monte Carlo in Practice . Chapmanand Hall, London.

    Jarner, S. F. and Roberts, G. O . (2001). Convergence of heavytailed Metropolis algorithms. Available at www.statslab.cam.ac.uk/mcmc.

    Kennedy, A. D . and Pendleton, B . (1991). Acceptances andautocorrelations in hybrid Monte Carlo. Nuclear Phys. B 20118121.

    Roberts, G. O . (1998). Optimal Metropolis algorithms for prod-uct measures on the vertices of a hypercube. StochasticsStochastic Rep. 62 275283.

    Roberts, G. O., Gelman, A. and Gilks, W. R. (1997). Weakconvergence and optimal scaling of random walk Metropo-lis algorithms. Ann. Appl. Probab. 7 110120.

    Roberts, G. O . and Rosenthal, J. S . (1998). Optimal scaling

    of discrete approximations to Langevin diffusions. J. Roy.Statist. Soc. Ser. B 60 255268.

    Roberts, G. O. and Tweedie, R. L. (1996). Exponential conver-gence of Langevin diffusions and their discrete approxima-tions. Biometrika 2 341363.

    Roberts, G. O . and Yuen. W. K. (2001). Optimal scaling of Metropolis algorithms for discontinuous densities. Unpub-lished manuscript.

    Sinclair, A. J. and Jerrum, M. R. (1989). Approximate count-ing, uniform generation, and rapidly mixing Markov chains. Inform. and Comput. 82 93133.

    Smith, A. F. M. and Roberts, G. O. (1993). Bayesian computa-tion via the Gibbs sampler and related Markov chain MonteCarlo methods (with discussion). J. Roy. Statist. Soc. Ser. B55 324.

    Tierney, L. (1994). Markov chains for exploring posteriordistributions (with discussion). Ann. Statist. 22 17011762.