Interplay of Bayes

Embed Size (px)

Citation preview

  • 8/13/2019 Interplay of Bayes

    1/23

    Statistical Science

    2004, Vol. 19, No. 1, 5880DOI 10.1214/088342304000000116 Institute of Mathematical Statistics, 2004

    The Interplay of Bayesian andFrequentist AnalysisM. J. Bayarri and J. O. Berger

    Abstract. Statistics has struggled for nearly a century over the issue ofwhether the Bayesian or frequentist paradigm is superior. This debate isfar from over and, indeed, should continue, since there are fundamentalphilosophical and pedagogical issues at stake. At the methodological level,however, the debate has become considerably muted, with the recognitionthat each approach has a great deal to contribute to statistical practice andeach is actually essential for full development of the other approach. Inthis article, we embark upon a rather idiosyncratic walk through some ofthese issues.

    Key words and phrases: Admissibility, Bayesian model checking, condi-tional frequentist, confidence intervals, consistency, coverage, design, hierar-chical models, nonparametric Bayes, objective Bayesian methods, p-values,reference priors, testing.

    CONTENTS

    1. Introduction2. Inherently joint Bayesianfrequentist situations

    2.1. Design or preposterior analysis2.2. The meaning of frequentism2.3. Empirical Bayes, gamma minimax, restricted risk

    Bayes3. Estimation and confidence intervals

    3.1. Computation with hierarchical, multilevel or mixedmodel analysis

    3.2. Assessment of accuracy of estimation3.3. Foundations, minimaxity and exchangeability3.4. Use of frequentist methodology in prior develop-

    ment3.5. Frequentist simplifications and asymptotic approxi-

    mations4. Testing, model selection and model checking

    4.1. Conditional frequentist testing

    4.2. Model selection4.3. p-Values for model checking

    M. J. Bayarri is Professor, Department of Statistics,

    University of Valencia, Av. Dr. Moliner 50, 46100

    Burjassot, Valencia, Spain (e-mail: susie.bayarri@

    uv.es). J. O. Berger is Director, Statistical and Applied

    Mathematical Sciences Institute, and Professor, Duke

    University, P.O. Box 90251, Durham, North Carolina

    27708-0251, USA (e-mail: [email protected]).

    5. Areas of current disagreement6. ConclusionsAcknowledgmentsReferences

    1. INTRODUCTION

    Statisticians should readily use both Bayesian andfrequentist ideas. In Section 2 we discuss situationsin which simultaneous frequentist and Bayesian think-ing is essentially required. For the most part, how-ever, the situations we discuss are situations in whichit is simply extremely useful for Bayesians to use fre-quentist methodology or frequentists to use Bayesianmethodology.

    The most common scenarios of useful connectionsbetween frequentists and Bayesians are when no exter-nal information (other than the data and model itself)

    is to be introduced into the analysison the Bayesianside, when objective prior distributions are used.Frequentists are usually not interested in subjective,informative priors, and Bayesians are less likely to beinterested in frequentist evaluations when using sub-

    jective, highly informative priors.We will, for the most part, avoid the question

    of whether the Bayesian or frequentist approach tostatistics is philosophically correct. While this is avalid question, and research in this direction can be of

    58

  • 8/13/2019 Interplay of Bayes

    2/23

    BAYESIAN AND FREQUENTIST ANALYSIS 59

    fundamental importance, the focus here is simply onmethodology. In a related vein, we avoid the questionof what is pedagogically correct. If pressed, wewould probably argue that Bayesian statistics (withemphasis on objective Bayesian methodology) shouldbe the type of statistics that is taught to the masses,with frequentist statistics being taught primarily toadvanced statisticians, but that is not an issue forthis paper.

    Several caveats are in order. First, we primarily focuson the Bayesian and frequentist approaches here; theseare the most generally applicable and accepted statisti-cal philosophies, and both have features that are com-pelling to most statisticians. Other statistical schools,such as the likelihoodschool (see, e.g., Reid, 2000),have many attractive features and vocal proponents, buthave not been as extensively developed or utilized as

    the frequentist and Bayesian approaches.A second caveat is that the selection of topicshere is rather idiosyncratic, being primarily based onsituations and examples in which we are currentlyinterested. Other Bayesianfrequentist synthesis works(e.g., Pratt, 1965; Barnett, 1982; Rubin, 1984; andeven Berger, 1985a) focus on a quite different set ofsituations. Furthermore, we almost completely ignoremany of the most time-honored Bayesianfrequentistsynthesis topics, such as empirical Bayes analysis.Hence, rather than being viewed as a comprehensivereview, this paper should be thought of more as a

    personal view of current interesting issues in theBayesianfrequentist synthesis.

    2. INHERENTLY JOINT BAYESIAN

    FREQUENTIST SITUATIONS

    There are certain statistical scenarios in which ajoint frequentistBayesian approach is arguably re-quired. As illustrations of this, we first discuss theissue of designin which the notion should not becontroversialand then discuss the basic meaning offrequentism, which arguably should be (but is not

    typically perceived as) a joint frequentistBayesianendeavor.

    2.1 Design or Preposterior Analysis

    Frequentist design focuses on planning of experi-mentsfor instance, the issue of choosing an appro-priate sample size. In Bayesian analysis this is oftencalledpreposterior analysis, because it is done beforethe data is collected (and, hence, before the posteriordistribution is available).

    EXAMPLE 2.1. Suppose X1, . . . , Xn are i.i.d.Poisson random variables with mean , and that it isdesired to estimate under the weighted squared er-ror loss ( )2/and using the classical estima-tor=X. This estimator has frequentist expected loss

    E[(X )2

    /

    ] =

    /n.A typical design problem would be to choose the

    sample size n so that the expected loss is less thansome prespecified limitC . (An alternative formulationmight be to minimize C + nc, wherecis the cost of anobservation, but this would not significantly alter thediscussion here.) This is clearly not possible, for all ;hence we must bring prior knowledgeabout into play.

    A primitive recommendation that one often sees, insuch situations, is to make a best guess 0 for ,and then choose n so that

    0/n C; that is, choose

    n

    0/C . This is needlessly dogmatic, in that one

    rarely believes particularly strongly in a particularvalue0.

    A common primitive recommendation in the oppo-site direction is to choose an upper bound U for ,and then choosenso that

    U/n C; that is, choose

    n U/C . This is needlessly conservative, in thatthe resulting n will typically be much larger thanneeded.

    The Bayesian approach to the design question isto elicit a subjective prior distribution () for ,

    and then to choose n so that

    n

    ()d C ; that

    is, choose n()d/C. This is a reasonablecompromise between the above two extremes and will

    typically result in the most reasonable values ofn.

    Classical design texts often focus on the very spe-cial situations in which the design criterion is constantin the unknown model parameter , and hence fail toclarify the philosophical centrality of Bayesian issuesin design. The basic fact is that, before experimenta-tion, one knows neither the data nor , and so expec-tations over both (i.e., both frequentist and Bayesianexpectations) are needed for design. See Chaloner and

    Verdinelli (1995) and Dawid and Sebastiani (1999).A very common situation in which design evaluationis not constant is classical testing, in which the samplesize is often chosen to achieve a given power at aspecified value of the parameter under the alternativehypothesis. Again, specifying a specific is verycrude when viewed from a Bayesian perspective. Farmore reasonable for a classical tester would be tospecify a prior distribution for under the alternative,and consider the average power with respect to this

  • 8/13/2019 Interplay of Bayes

    3/23

    60 M. J. BAYARRI AND J. O. BERGER

    distribution. (More controversial would be to consideran average Type I error.)

    2.2 The Meaning of Frequentism

    There is a sense in which essentially everyoneshould

    ascribe to frequentism:FREQUENTIST PRINCIPLE. In repeated practical

    use of a statistical procedure, the long-run averageactual accuracy should not be less than (and ideallyshould equal) the long-run average reported accuracy.

    This version of the frequentist principle is actuallya joint frequentistBayesian principle. Suppose, for in-stance, that we decide it is relevant to statistical prac-tice to repeatedlyuse a particular statistical model andprocedurefor instance, a 95% classical confidenceinterval for a normal mean. This procedure will, in

    practice, be used on a series of different problemsinvolving a series of different normal means with a cor-responding series of data. Hence, in evaluating the pro-cedure, we should simultaneously be averaging overthe differing means and data.

    This is in contrast to textbook statements of thefrequentist principle which tend to focus on fixingthe value of, say, the normal mean, and imaginingrepeatedly drawing data from the given model andutilizing the confidence procedure repeatedly on thisdata. The word imaginingis emphasized, because thisis solely a thought experiment. What is done in practiceis to use the confidence procedure on a series ofdifferent problemsnot use the confidence procedurefor a series of repetitions of the same problem withdifferent data (which would typically make no sensein practice).

    Neyman himself repeatedly pointed out (see, e.g.,Neyman, 1977) that the motivation for the frequentistprinciple is in its use on differing real problems, andnot imaginary repetitions for one problem with a fixedtrue parameter. Of course, the reason textbooks typi-cally give the latter (philosophically misleading) ver-

    sion is because of the convenient mathematical factthat if, say, a confidence procedure has 95% frequen-tist coverage for each fixed parameter value, then it willnecessarily also have 95% coverage when used repeat-edly on a series of differing problems. Thus (as withdesign), whenever the frequentist evaluation is constantover the parameter space, one does not need to alsodo a Bayesian average over the parameter space; but,conceptually, it is the combined frequentistBayesianaverage that is practically relevant.

    The impact of this real frequentist principle thusarises when the frequentist evaluation of a procedureis not constant over the parameter space. Here isan example.

    EXAMPLE 2.2. Binomial confidence interval.

    Brown, Cai and DasGupta (2001, 2002) considered theproblem of observing XBinomial(n,) and deter-mining a 95% confidence interval for the unknownsuccess probability . We consider here the specialcase of n=50, and two confidence procedures. Thefirst is CJ(x), defined as the Jeffreys equal-tailed95% confidence interval, given by

    CJ(x) = q0.025(x),q0.975(x),(2.1)where q(x) is the th-quantile of the Beta(x+ 0.5,50.5 x) distribution. The second confidence proce-dure we consider is the modified Jeffreys equal-tailed

    95% confidence interval, given by

    CJ(x) =

    q0.025(x),q0.975(x)

    , ifx= 0

    andx= n,0, q0.975(x)

    , ifx= 0,

    q0.025(x), 1, ifx= n.

    (2.2)

    For the moment, simply consider these as formulae forconfidence intervals; we later discuss their motivation.

    Brown, Cai and Dasgupta (2001) provide the graphof the coverage probability ofC J given in Figure 1.Note that, while roughly close to the target 95%, the

    coverage probability varies considerably as a functionof , going from a high of 1 at = 0 and = 1to a low of 0.884 at = 0.049 and = 0.951.A textbook frequentist might then assert that thisis only an 88.4% confidence procedure, since thecoverage cannot be guaranteed to be higher thanthis limit. But would the practical frequentist agreewith this?

    The practical frequentist evaluates how C J wouldwork for a sequence{1, 2, . . . , m} of parameters(and corresponding data) encounteredin a series of realproblems. Ifmis large, the law of large numbers guar-

    antees that the coverage that is actually experiencedwill be the average of the coverages obtained over thesequence of problems. Thus we should be consideringaverages of the coverage in Figure 1 over sequencesofj.

    One could, of course, choose the sequence ofj toall be 0.049 and/or 0.951, but this is not very realistic.One might consider global averages with respect tosequences generated from prior distributions(), buta practical frequentist presumably does not want to

  • 8/13/2019 Interplay of Bayes

    4/23

    BAYESIAN AND FREQUENTIST ANALYSIS 61

    FIG . 1. Frequentist coverage of theC J intervals, as a functionof whenn = 50.

    spend much time thinking about prior distributions.One plausible solution is to look at local averagecoverage, defined via a local smoothing of the binomialcoverage function. A convenient computational kernelfor this problem, when smoothing at a point isdesired and when a smoothing kernel having standarddeviation is desired (so that 2 can roughly bethought of as the range over which the smoothing isperformed), is the Beta(a(),a(1 )) distributionk,()where

    a() =

    1 2, if ,

    [ (1

    )

    2

    1],

    if < < 1 ,1

    3 + 2, if 1 .

    (2.3)

    Writing the standard frequentist coverage as 1 (),this leads to the-local average coverage

    1 () = 1

    0[1 ()]k,()d

    =n

    x=0

    n

    x

    a() + a(1 )

    a() + x a(1 ) + n x (a())a(1 ) a() + a(1 ) + n1

    CJBeta

    |a(),a(1 )d,

    the last equation following from the standard expres-sion for the betabinomial predictive distribution [and

    FIG . 2. Local average coverage of the CJ intervals, as afunction of whenn = 50and = 0.05.

    with Beta(|a(),a(1

    ))denoting the beta density

    with given parameters].For the binomial example we are considering, this

    is graphed in Figure 2, for =0.05. (Such a valueof could be interpreted as implying that one issure that the sequence of practical problems, forwhich the binomial confidence interval will be used,has jvarying by at least0.05.) Note that this localaverage coverage is always close to 0.95, so that apractical frequentist would be quite pleased with theconfidence interval.

    One could imagine a textbook frequentist arguingthat, sometimes, a particular value, such as

    =0.049,

    could be of special interest in repeated investigations,the value perhaps corresponding to some importantphysical theory concerning that science will repeat-edly investigate. In such a situation, however, it is ar-guably not appropriate to utilize confidence intervals;that there is a special value of of interest shouldbe acknowledged via some type of testing procedure.Even if there were a distinguished value of and itwas erroneously handled by finding a confidence inter-val, the practical frequentist has one more arrow in hisor her quiver: it is not likely that a series of experimentsinvestigating this particular physical theory would all

    choose the same sample size, so one should considerpractical averaging over sample size. For instance,suppose sample sizes would vary between 40 and 60for the binomial problem we have been considering.Then one could reasonably consider average coverageover these sample sizes, the result of which is given inFigure 3. While not always as close to 0.95 as was thelocal average coverage, it would still strike most peo-ple as reasonable to callCJa 95% confidence intervalwhen averaged over reasonable sample sizes.

  • 8/13/2019 Interplay of Bayes

    5/23

    62 M. J. BAYARRI AND J. O. BERGER

    FIG . 3 . Average coverage overn between40 and60 of the C Jintervals,as a function of.

    A similar idea concerning local averages offrequentist properties was employed by Woodroofe(1986), who called the concept very weak expan-sions. Brown, Cai and DasGupta (2002), for thebinomial problem, considered the average coveragedefined as the smooth part of their asymptotic expan-sion of coverage, yielding a result similar to that inFigure 2. Rousseau (2000) took a different approach,considering slight adjustment of the Bayesian intervalsthrough randomization to achieve the correct frequen-tist coverage.

    So far the discussion has been in terms of the

    practical frequentist acknowledging the importance ofconsidering averages over . We would also claim,however, that Bayesians should ascribe to the aboveversion of the frequentist principle. If (say) a Bayesianwere to repeatedly construct purported 90% credibleintervals in his or her practical work, yet they only con-tained the unknowns about 70% of the time, somethingwould be seriously wrong. A Bayesian might feel thatthe practical frequentist principle will automatically besatisfied if he or she does a good Bayesian job of sep-arately analyzing each individual problem, and hencethat it is not necessary to specifically worry about the

    principle, but that does not mean that the principleis invalid.

    EXAMPLE 2.3. In this regard, let us return tothe binomial example to discuss the origin of theconfidence intervals CJ(x) and CJ(x). The inter-vals CJ(x) arise as the Bayesian equal-tailed credi-ble sets obtained from use of the Jeffreys prior (seeJeffreys, 1961) () 1/2(1 )1/2 for . (Inparticular, the intervals are formed by the upper and

    lower/2-quantiles of the resulting posterior distribu-tion for .) This is the prior that is customary for anobjective Bayesian to use for the binomial problem.(See Section 3.4.3 for further discussion of the Jeffreysprior.) Note that, because of the derivation of the cred-ible set from the objective Bayesian perspective, thereis strong reason to believe that conditionally on thegiven situation and data, the accuracy assignment of95% is reasonable. See Section 3.2.2 for discussion ofconditional performance.

    The frequentist coverage of the intervals CJ(x) isgiven in Figure 4. A pure frequentist might well be con-cerned with the raw coverage of this credible intervalbecause it goes to zero at= 0 and= 1. A momentsreflection reveals why this is the case: the equal-tailedBayesian credible intervals purposely exclude valuesin the left and right tails of the posterior distributionand, hence, will always exclude

    =0 and

    =1. The

    modification of this interval employed in Brown, Caiand DasGupta (2001) is CJ(x) in (2.2): for the ob-servations x= 0 or x= n , one simply extends theJeffreys equal-tailed credible intervals to include 0 or 1.Of course, from a conditional Bayesian perspective,these intervals then have posterior probability 0.975,so a Bayesian would no longer call them 95% credibleintervals.

    While the raw frequentist coverage of the Jeffreysequal-tailed credible intervals might seem unappeal-ing, their 0.05-local average coverage is excellent,virtually the same as that in Figure 2 for the modified

    interval; indeed, the difference is not visually apparent,so that we do not separately include a graph of this lo-cal average coverage. Hence the practical frequentistwould be quite happy with use ofC J(x), even if it haslow coverage right at the endpoints.

    FIG . 4. Coverage of the CJ intervals, as a function of whenn = 50.

  • 8/13/2019 Interplay of Bayes

    6/23

    BAYESIAN AND FREQUENTIST ANALYSIS 63

    The issue of Bayesians achieving good pure fre-quentist coverage near a finite boundary of a para-meter space is an interesting issue; our guess is thatthis is often not possible. In the above example, forinstance, whether a Bayesian includes, or excludes,=

    0 or =

    1 in a credible interval is rather arbi-trary and will depend on, for example, a choice such asthat between an equal-tailed or highest posterior den-sity (HPD) interval. (The HPD intervals for x= 0 andx= n would include = 0 and = 1, respectively.)Furthermore, this choice will typically lead to either0 frequentist coverage or coverage of 1 at the end-points, unless something unnatural to a Bayesian, suchas randomization, were incorporated. Hence the recog-nition of the centrality to frequentist practice of sometype of average coverage, rather than pointwise cover-age, can be important in such problems to achieve si-

    multaneously acceptable Bayesian and frequentist per-formance.

    2.3 Empirical Bayes, Gamma Minimax,

    Restricted Risk Bayes

    Several approaches to statistical analysis have beenproposed which are inherently a mixture of Bayesianand frequentist analysis. These approaches have leng-thy histories and extensive literature and we so wecan do little more here than simply give pointers tothe areas.

    Robbins (1955) introduced the empirical Bayes

    approach, in which one specifies a class of priordistributions , but assumes that the prior is other-wise unknown. The data is then used to help de-termine the prior and/or to directly find the optimalBayesian answer. Frequentist reasoning was intimatelyinvolved in Robbins original formulation of empiri-cal Bayes, and in significant implementations of theparadigm, such as Morris (1983) for hierarchical mod-els. More recently, the name empirical Bayes is oftenused in association with approximate Bayesian analy-ses which do not specifically involve frequentist mea-sures. (Simply using a maximum likelihood estimate of

    a hyperparameter does not make a technique frequen-tist.) For modern reviews of empirical Bayes analysisand previous references, see Carlin and Louis (2000)and Robert (2001).

    In the gamma minimax approach, one again has aclass of possible prior distributions and considersthe frequentist Bayes risk (the expected loss overboth the data and unknown parameters) of the Bayesprocedure for priors in the class. One then choosesthat prior which minimizes this frequentist Bayes risk.

    For examples and references, see Berger (1985a) andVidakovic (2000).

    In therestricted risk Bayesapproach, one has a sin-gle prior distribution, but can only consider statisti-cal procedures whose frequentist risk (expected loss)is constrained in some fashion. The idea is that onecan utilize the prior information, but in a way thatwill be guaranteed to be acceptable to the frequentistwho wants to limit frequentist risk. (See Berger, 1985a,for discussion and earlier references.) This approach isactually not inherently Bayesianfrequentist, but ismore what could be termed a hybrid approach, in thesense that it seeks some type of formal compromise be-tween Bayesian and frequentist positions. There havebeen many other attempts at such compromises, butnone has seemed to significantly affect statistical prac-tice.

    There are many other important areas in which

    joint frequentistBayesian evaluation is used. Somewere even developed primarily from the Bayesianperspective,such as theprequential approach of Dawid(cf. Dawid and Vovk, 1999).

    3. ESTIMATION AND CONFIDENCE INTERVALS

    In statistical estimation (including development ofconfidence intervals), objective Bayesian and frequen-tist methods often give similar (or even identical)answers in standard parametric problems with contin-uous parameters. The standard normal linear modelis the prototypical example: frequentist estimates andconfidence intervals coincide exactly with the stan-dard objective Bayesian estimates and credible inter-vals. Indeed, this occurs more generally in situationsthat exhibit an invariance structure, provided objec-tive Bayesians use the right-Haar prior density; seeBerger (1985a), Eaton (1989) and Robert (2001) fordiscussion and earlier references.

    This dual frequentistBayesian interpretation of ma-ny textbook estimation procedures has a number ofimportant implications, not the least of which is thatmuch of standard textbook statistical methodology(and standard software) can alternatively be presented

    and described from the objective Bayesian perspective.In particular, one can teach much of elementary statis-tics from this alternative perspective, without changingthe procedures that are taught.

    In more complicated situations, it is still usuallypossible to achieve near-agreement between frequen-tist and Bayesian estimation procedures, although thismay require careful utilization of the tools of both.A number of situations requiring such cross-utilizationof tools are discussed in this section.

  • 8/13/2019 Interplay of Bayes

    7/23

    64 M. J. BAYARRI AND J. O. BERGER

    3.1 Computation with Hierarchical, Multilevel or

    Mixed Model Analysis

    With the adventof Gibbs sampling and other Markovchain Monte Carlo (MCMC) methods of analysis (cf.Robert and Casella, 1999), it has become relatively

    standard to deal with models that go under any of thenames listed in the above title as Bayesian methods.This popularity of the Bayesian methods is not nec-essarily because of their intrinsic virtues, but ratherbecause the Bayesian computation is now much easierthan computation via more classical routes. See Hobert(2000) for an overview and other references.

    On the other hand, any MCMC method relies fun-damentally on frequentist reasoning to do the com-putation. An MCMC method generates a sequence ofsimulated values 1, 2, . . . , mof an unknown quan-tity , and then relies upon a law of large num-

    bers or ergodic theorem (both frequentist) to assertthat m = 1m

    mi=1i . Furthermore, diagnostics

    for MCMC convergence are almost universally basedon frequentist tools. There is a purely Bayesian wayof looking at such computation problems, which goesunder the heading Bayesian numerical analysis (cf.Diaconis, 1988a; OHagan, 1992), but in practiceit is typically much simpler to utilize the frequen-tist reasoning.

    In conclusion for much of modern statistical analysisin hierarchical models, we already see an inseparable

    joining of Bayesian and frequentist methodology.

    3.2 Assessment of Accuracy of Estimation

    Frequentist methodology for point estimation of un-known model parameters is relatively straightforwardand successful. However, assessing the accuracy of theestimates is considerably more challenging and is aproblem for which frequentists should draw heavily onBayesian methodology.

    3.2.1 Finding good confidence intervals in the pres-ence of nuisance parameters. Confidence intervals fora model parameter are a common way of indicating

    the accuracy of an estimate of the parameter. Find-ing good confidence intervals when there are nuisanceparameters is very challenging within the frequen-tist paradigm, unless one utilizes objective Bayesianmethodology, in which case the frequentist problembecomes relatively straightforward. Indeed, here is arather general prescription for finding confidence in-tervals using objective Bayesian methods:

    Begin with a reasonable objective prior distrib-ution. (See Section 3.4 for discussion of objective

    priors, and note that a reasonable objective priormay well depend on which parameter is the parame-ter of interest.)

    By simulation, obtain a (large) sample from theposterior distribution of the parameter of interest:Option1. If a predetermined confidence interval

    C(X)is of interest, simply approximate the pos-terior probability of the interval by the fractionof the samples from the posterior distribution thatfall in the interval.

    Option2. If the confidence interval is not predeter-mined, find the /2 upper and lower fractiles ofthe posterior sample; the interval between thesefractiles approximates the 100(1 )% equal-tailed posterior credible interval for the parameterof interest. (Alternative forms for the confidenceset can be considered, but the equal-tailed intervalis fine for most applications.)

    Assert that the obtained interval is the frequen-tist confidence interval, having frequentist coveragegiven by the posterior probability of the interval.

    There is a large body of theory, discussed in Sec-tion 3.4, as well as considerable practical experience,supporting the validity of constructing frequentist con-fidence intervals in this way. Here is one example fromthe practical experience side.

    EXAMPLE 3.1. Medical diagnosis(Mossman andBerger, 2001). Within a population for which p0=Pr( Disease D), a diagnostic test results in either a

    Positive (+) or Negative () reading. Let p1= Pr(+|patient hasD) and p2 = Pr(+|patient does not haveD).By Bayes theorem,

    = Pr(D|+) = p0p1p0p1 + (1 p0)p2

    .

    In practice, the pi are typically unknown, but fori= 0, 1, 2 there are available (independent) data xihaving Binomial(ni , pi )densities. It is desired to finda 100(1 )% confidence set for that has goodconditional and frequentist properties.

    A simple objective Bayesian approach to this prob-

    lem is to utilize the Jeffreys priors (pi)p1/2

    i (1 pi )1/2 for each of the pi , and compute the100(1 )% equal-tailed posterior credible intervalfor. A suitable implementation of the algorithm pre-sented above is as follows:

    Draw randompifrom the Beta(xi+ 12 , nixi+12 )posterior distributions,i= 0, 1, 2.

    Compute the associated= p0 p1

    p0 p1 + (1 p0)p2

  • 8/13/2019 Interplay of Bayes

    8/23

    BAYESIAN AND FREQUENTIST ANALYSIS 65

    for each random triplet. Repeat this process 10,000 times. The /2 and 1 /2 fractiles of these 10,000

    generated form the desired confidence interval.[In other words, simply order the 10,000 valuesof, and let the confidence interval be the intervalbetween the (10,000 2 )th and (10,000 12 )thvalues.]

    The proposed objective Bayesian procedure is clear-ly simple to use, but is the resulting confidence in-terval a satisfactory frequentist interval? To provideperspective on this question, note that the above prob-lem has also been studied in the frequentist literature,using standard log-odds and delta-method proceduresto develop confidence intervals, as well as more sophis-ticated approaches such as the GartNam (Gart andNam, 1988) procedure. For a description of these clas-

    sical methods, as applied to this problem of medicaldiagnosis, see Mossman and Berger (2001).

    Table 1 gives an indication of the frequentist perfor-mance of the confidence intervals developed by thesefour methods. It is based on a simulation that repeat-edly generates data from binomial distributions withsample sizes ni=20 and the indicated values of theparameters (p0, p1, p2). For each generated triplet ofdata in the simulation, the 95% confidence interval iscomputed using the objective Bayesian algorithm orone of the three classical methods. It is then notedwhether the computed interval contains the true , or

    misses to the left or right. The entries in the table arethe long run proportion of misses to the left or right.Ideally, these proportions should be 0.025 and, at theleast, their sum should be 0.05.

    Clearly the objective Bayes interval has quite goodfrequentist performance, better than any of the classi-cally derived confidence intervals. Furthermore, it canbe seen that the objective Bayes intervals are, on av-erage, smaller than the classically derived intervals.(See Mossman and Berger, 2001, for these and moreextensive computations.) Finally, the objective Bayes

    confidence intervals were the simplest to derive andwill automatically be conditionally appropriate (seeSection 3.2.2), because of their Bayesian derivation.

    The finding in the above example, that objectiveBayesian analysis very easily provides small confi-

    dence sets with excellent frequentist coverage, hasbeen repeatedly shown to happen. See Section 3.4 foradditional discussion.

    3.2.2 Obtaining good conditional measures of accu-racy. Developing frequentist confidence intervals us-ing the Bayesian approach automatically provides anadditional significant benefit: the confidence statementwill be conditionally appropriate. Here is a simple arti-ficial example.

    EXAMPLE 3.2. Two observations,X1and X2, areto be taken, where

    Xi=

    + 1, with probability 1/2, 1, with probability 1/2.

    Consider the confidence set for the unknown ,C(X1, X2)

    =

    the point1

    2 (X1 + X2)

    , ifX1= X2,the point {X1 1}, ifX1= X2.

    The frequentist coverage of this confidence set can eas-ily be shown to be P(C(X1, X2)contains ) = 0.75.This is not at all a sensible report, once the data is athand. To see this, observe that, if x1

    =x2, then we

    know for sure that their average is equal to , so thatthe confidence set is then actually 100% accurate. Onthe other hand, ifx1= x2, we do not know if is thedatas common value plus 1 or their common value mi-nus 1, and each of these possibilities is equally likelyto have occurred.

    To obtain sensible frequentist answers here, onemust define the conditioning statistic S= |X1 X2|,which can be thought of as measuring the strength ofevidence in the data (S= 2 reflecting data with maxi-mal evidential content and S= 0 being data of minimal

    TABLE1The probability that the nominal95%interval misses the true on the left and on the right,for the

    indicated parameter values and when n0= n1=n2= 20

    (p0,p1,p2) O-Bayes Log odds GartNam Delta 14 ,

    34 ,

    14

    0.0286,0.0271 0.0153,0.0155 0.0277,0.0257 0.0268,0.0245 110 ,

    910 ,

    110

    0.0223,0.0247 0.0017,0.0003 0.0158,0.0214 0.0083,0.0041 12 ,

    910 ,

    110

    0.0281,0.0240 0.0004,0.0440 0.0240,0.0212 0.0125,0.0191

  • 8/13/2019 Interplay of Bayes

    9/23

    66 M. J. BAYARRI AND J. O. BERGER

    evidential content). Then one defines frequentist cover-age conditional on the strength of evidence S. For theexample, an easy computation shows that this condi-tional confidence equals

    PC(X1, X2)contains|S

    =2=

    1,

    PC(X1, X2)contains|S= 0

    = 12 ,for the two distinct cases, which are the intuitivelycorrect answers.

    It is important to realize that conditional frequentistmeasures are fully frequentist and (to most people)clearly better than unconditional frequentist measures.They have the same unconditional property (e.g., in theabove example one will report 100% confidence halfthe time, and 50% confidence half the time, resultingin an average of 75% confidence, as must be the

    case for a frequentist measure), yet give much betterindications of the accuracy for the type of data that onehas actually encountered.

    In the above example, finding the appropriate con-ditioning statistic was easy but, in more involved sit-uations, it can be a challenging undertaking. Luckily,intervals developed via the Bayesian approach will au-tomatically condition appropriately. For instance, inthe above example, the objective Bayesian approachassigns the standard objective prior (for a locationparameter) ()= 1, from which is easy to com-pute that the posterior probability assigned to the setC(X1, X2)is 1 or 0.5 as the observations differ or arethe same. (This is essentially Option 1 of the algo-rithm described at the beginning of the Section 3.2.1,although here the posterior probabilities can be com-puted analytically.)

    General theory about conditional confidence can befound in Kiefer (1977); see also Robinson (1979),Berger (1985b), Berger and Wolpert (1988), Casella(1988) and Lehmann and Casella (1998). In Sec-tion 4.1, we will return to this dual theme that (i) it iscrucial for frequentists to condition appropriately;(ii) this is technically most easily accomplished by us-ing Bayesian tools.

    3.2.3 Accuracy assessment in hierarchical models.As mentioned earlier, the utilization of hierarchicalor random effects or mixed or multilevel models hasincreasingly taken a Bayesian flavor in practice, inpart driven by the computational advantages of Gibbssampling and MCMC analysis. Another reason for thisgreatly increasing utilization of the Bayesian approachto such problems is that practitioners are finding

    the inferences that arise from the Bayesian approachto be considerably more realistic than those fromcompetitors, such as various versions of maximumlikelihood estimation (or empirical Bayes estimation)or (often worse) unbiased estimation.

    One of the potentially severe problems with themaximum likelihood or empirical Bayes approach isthat maximum likelihood estimates of variances inhierarchical models (or variance component models)can easily be zero, especially when there are numer-ous variances in the model that are being estimated.(Unbiased estimation will be even worse in such sit-uations; if the mle is zero, the unbiased estimate willbe negative.)

    EXAMPLE 3.3. Suppose, for i= 1, . . . , p, thatXi Normal(i , 1)and i Normal(0, 2), all ran-dom variables being independent. Then, marginally,XiNormal(0, 1 +2), so that the likelihood func-tion of2 can be written

    L(2) 1(1 + 2)p/2exp

    S

    2

    2(1 + 2)

    ,(3.1)

    whereS2 =X2i. The mle for 2 is easily calculatedto be 2 = max{0, S2

    p 1}. Thus, if S2 < p, the

    mle would be 2 = 0 (and the unbiased estimatewould be negative). While a value of S2 < p issomewhat unusual here [if, e.g., p = 4 and 2 = 1, thenPr(S2 < p)

    =0.264], it is quite common in problems

    with numerous variance components to have at leastone mle variance estimate equal to 0.

    For p= 4 and S2 = 4, the likelihood functionin (3.1) is graphed in Figure 5. While L(2) is de-creasing away from 0, it does not decrease particularly

    FIG . 5. Likelihood function of 2 when p= 4 and S2 =4is observed.

  • 8/13/2019 Interplay of Bayes

    10/23

    BAYESIAN AND FREQUENTIST ANALYSIS 67

    quickly, clearly indicating that there is considerableuncertainty as to the true value of2 even though themle is 0.

    Utilizing an mle of 0 as a variance estimate can bequite dangerous, because it will typically affect the

    ensuing analysis in an incorrectly aggressive fashion.In the above example, for instance, setting 2 to 0 isequivalent to stating that all theiare exactly equal toeach other. This is clearly unreasonable in light of thefact that there is actually great uncertainty about 2,as reflected in Figure 5. Since the likelihood maximumis occurring at the boundary of the parameter space, itis also very difficult to utilize likelihood or frequentistmethods to attempt to incorporate uncertainty about2

    into the analysis.None of these difficulties arises in the Bayesian

    approach, and the vague nature of the information inthe data about such variances will be clearly reflectedin the posterior distribution. For instance, if one were touse the constant prior density (2) = 1 in the aboveexample, the posterior density would be proportionalto the likelihood in Figure 5, and the significantuncertainty about2 would permeate the analysis.

    3.3 Foundations, Minimaxity and Exchangeability

    There are numerous ties between frequentist andBayesian analysis at the foundational level. The foun-dation of frequentist statistics typically focuses on the

    class of optimal procedures in a given situation,called a complete class of procedures. Through thework of Wald (1950) and others, it has long beenknown that a complete classof procedures is identi-cal to the class of Bayes procedures or certain limitsthereof. Furthermore, in proving frequentist optimal-ity of a procedure, it is typically necessary to employBayesian tools. (See Berger, 1985a; Robert, 2001, formany examples and references.) Hence, at a fundamen-tal level, the frequentist paradigm is intertwined withthe Bayesian paradigm.

    Interestingly, this fundamental duality has not had a

    pronounced effect on the Bayesian versus frequentistdebate. In part, this is because many frequentistsfind the search for optimal frequentist procedures tobe of limited practical utility (since such searchesusually take place in rather limited settings, from theperspective of practice), and hence do not themselvespursue optimality and thereby come into contact withthe Bayesian equivalence. Even among frequentistswho are significantly concerned with optimality, it istypically perceived that the relationship with Bayesian

    analysis is a nice mathematical coincidence that can beused to eliminate inferior frequentist procedures, butthat Bayesian ideas should not form the basis for choiceamong acceptable frequentist procedures. Still, thecomplete class theorems provide a powerful underlyinglink between frequentist and Bayesian statistics.

    One of the most prominent frequentist principles forchoosing a statistical procedure is that of minimaxity;see Brown (1994, 2000) and Strawderman (2000) forreviews on the important impact of this concept onstatistics. Bayesian analysis again provides the mostuseful tool for deriving minimax procedures: onefinds the least favorable prior distribution, and theminimax procedure is the resulting Bayes rule.

    To many Bayesians, the most compelling foundationof statistics is that based on exchangeability, as de-veloped in de Finetti (1970). From the assumption of

    exchangeability of an infinite sequence, X1, X2, . . . ,of observations (essentially the assumption that thedistribution of the sequence remains the same underpermutation of the coordinates), one can sometimes de-duce the existence of a particular statistical model, withunknown parameters, and a prior distribution on theparameters. By considering an infinite series of obser-vations, frequentist reasoningor at least frequentistmathematicsis clearly involved. Reviews of more re-cent developments and other references can be found inDiaconis (1988b) and Lad (1996).

    There are many other foundational arguments that

    begin with axioms of rational behavior and lead to theconclusion that some type of Bayesian behavior is im-plied. (See Bernardo and Smith, 1994, for review andreferences.) Many of these effectively involve simul-taneous frequentistBayesian evaluations of outcomes,such as Rubin (1987), which is perhaps the weakest setof axioms that implies Bayesian behavior.

    3.4 Use of Frequentist Methodology in

    Prior Development

    In principle, a subjective Bayesian need not worryabout frequentist ideasif a prior distribution is elici-

    ted and accurately reflects prior beliefs, then Bayestheorem guarantees that any resulting inference willbe optimal. The hitch is that it is not very common tohave a prior distribution that accuratelyreflects all priorbeliefs. Suppose, for instance, that the only unknownmodel parameter is a normal mean . Complete assess-ment of the prior distribution for involves an infinitenumber of judgments [e.g., specification of the prob-ability of the interval (, r) for any rational num-ber r]. In practice, of course, only a few assessments

  • 8/13/2019 Interplay of Bayes

    11/23

    68 M. J. BAYARRI AND J. O. BERGER

    are ever made, with the others being made conven-tionally (e.g., one might specify the first quartile andthe median, but then choose a Cauchy density for theprior). Clearly one should worry about the effect of fea-tures of the prior that were not elicited.

    Even more common in practice is to utilize a defaultor objective prior distribution, and Bayess theoremdoes not then provide any guarantee as to performance.It has proved to be very useful to evaluate partiallyelicited and objective priors by utilizing frequentisttechniques to evaluate their properties in repeated use.

    3.4.1 Information-based developments. A numberof developments of prior distributions utilize informa-tion-basedarguments that rely on frequentist measures.Consider the reference priortheory, for instance, ini-tiated in Bernardo (1979) and refined in Berger andBernardo (1992). The reference prior is defined to

    be that distribution which minimizes the asymptoticKullbackLeibler divergence between the posteriordistribution and the prior distribution, thus hopefullyobtaining a prior that minimizes information in anappropriate sense. This divergence is calculated withrespect to a joint frequentistBayesian computationsince,as in design, it is being computed before any datahas been obtained.

    The reference prior approach has arguably beenthe most generally successful method of obtainingBayes rules that have excellent frequentist performance(see Berger, Philippe and Robert, 1998, as but one

    example). There are, furthermore, many other featuresof reference priors that are influenced by frequentistmatters. One such feature is that the reference priortypically depends not only on the model, but also onwhich parameter is the inferential focus. Without suchdependence on the parameter of interest, optimalfrequentist performance is typically not attainable byBayesian methods.

    A number of other information-based priors havealso been derived. See Soofi (2000) for an overviewand references.

    3.4.2 Consistency. Perhaps the simplest frequentist

    estimation tool that a Bayesian can usefully employ isconsistency: as the sample size grows to, does theestimate being studied converge to the true value (ina suitable sense of convergence). Bayes estimates arevirtually always consistent if the parameter space isfinite-dimensional (see Schervish, 1995, for a typicalresult and earlier references), but this need not betrue if the parameter space is not finite-dimensionalor in irregular cases (see Ghosh, Ghosal and Samanta,1994). Here is an example of the former.

    EXAMPLE 3.4. In numerous models in use today,the number of parameters increases with the amountof data. The classic example of this is the NeymanScott problem (Neyman and Scott, 1948), in whichone observes

    Xij N(i , 2), i= 1, . . . , n , j = 1, 2,and is interested in estimating 2. Defining xi=(xi1 +xi2)/2,x = (x1, . . . ,xn),S2 =

    ni=1(xi1 xi2)2

    and = (1, . . . , n), the likelihood function canbe written

    L(, ) 12n

    exp 1

    2

    |x |2 + S

    2

    4

    .

    Until relatively recently, the most commonly usedobjective prior was the Jeffreys-rule prior (Jeffreys,1961), here given by J(, )

    =1/n+1. The re-

    sulting posterior distribution for is proportional tothe likelihood times the prior, which, after integratingout, is

    (|x) 12n+1

    exp S

    2

    42

    .

    One common Bayesian estimate of2 is the poste-rior mean, which here is S2/[4(n 1)]. This estimateis inconsistent, as can be seen by applying simple fre-quentist reasoning to the situation. Indeed, note that(Xi1 Xi2)2/(22)is a chi-squared random variablewith one degree of freedom, and hence that S

    2/(2

    2)

    is chi-squared withndegrees of freedom. It follows bythe law of large numbers that S2/(2n) 2, so thatthe Bayes estimate converges to 2/2, the wrong value.(Any other natural Bayesian estimate, such as the pos-terior median or posterior mode, can also be seen tobe inconsistent.)

    The problem in the above example is that theJeffreys-rule prior is often inappropriate in multidi-mensional settings, yet it can be difficult or impossibleto assess this problem within the Bayesian paradigm

    itself. Indeed, the inadequacy of the multidimensionalJeffreys-rule prior has led to a search for improved ob-jective priors in multivariable settings. The referenceprior approach, mentioned earlier, has been one suc-cessful solution. [For the NeymanScott problem, thereference prior is R(, )=1/, which results ina consistent posterior mean and, indeed, yields infer-ences that are numerically equal to the classical in-ferences for 2.] Another approach to developing im-proved priors is discussed in Section 3.4.3.

  • 8/13/2019 Interplay of Bayes

    12/23

    BAYESIAN AND FREQUENTIST ANALYSIS 69

    3.4.3 Frequentist performance: coverage and ad-missibility. Consistency is a rather crude frequentistcriterion, and more sophisticated frequentist evalua-tions of performance of Bayesian procedures are of-ten considered. For instance, one of the most common

    approaches to evaluation of an objective prior distribu-tion is to see if it yields posterior credible sets that havegood frequentist coverage properties. We have alreadyseen examples of this method of evaluation in Exam-ples 2.2 and 3.1.

    Evaluation by frequentist coverage has actually beengiven a formal theoretical definition and is called the

    frequentist-matchingapproach to developing objectivepriors. The idea is to look at one-sided Bayesian cred-ible sets for the unknown quantity of interest, andthen seek that prior distribution for which the crediblesets have optimal frequentist coverage asymptotically.

    Welch and Peers (1963) developed the first extensiveresults in this direction, essentially showing that, forone-dimensional continuous parameters, the Jeffreysprior is frequentist-matching. There is an extensive lit-erature devoted to finding frequentist-matching priorsin multivariate contexts; see Efron (1993), Rousseau(2000), Ghosh and Kim (2001), Datta, Mukerjee,Ghosh and Sweeting (2000) and Fraser, Reid, Wongand Yi (2003) for some recent results and earlierreferences.

    Other frequentist properties have also been used tohelp in the choice of an objective prior. For instance, ifestimation is the goal, it has long been common to uti-lize the frequentist concept of admissibility to help inthe selection of the prior. The idea behind admissibilityis to define a loss function in estimation (e.g., squarederror loss), and then see if a proposed estimator can bebeaten in terms of frequentist expected loss (e.g., meansquared error). If so, the estimator is said to be inad-missible; if it cannot be beaten, it is admissible. Forinstance, in situations having what is known as a groupinvariance structure, it has long been known that theprior distribution defined by the right-Haar measure

    will typically yield Bayes estimates that are admis-sible from a frequentist perspective, while the seem-ingly more natural (to a Bayesian) left-Haar measurewill typically fail to yield admissible estimators. Thususe of the right-Haar priors has become standard. SeeBerger (1985a) and Robert (2001) for general discus-sion and many examples of the use of admissibility.

    Another situation in which admissibility has playedan important role in prior development is in choice ofBayesian priors in hierarchical modeling. In a sense,

    this topic was initiated in Stein (1956), which effec-tively showed that the usual constant prior for a multi-variate normal mean would result in an inadmissibleestimator under quadratic loss (in three or more di-mensions). One of the first Bayesian works to ad-dress this issue was Hill (1974). To access the hugeresulting literature on the role of admissibility inchoice of hierarchical priors, see Brown (1971), Bergerand Robert (1990), Berger and Strawderman (1996),Robert (2001) and Tang (2001).

    Here is an example where initial admissibility con-siderations led to significant Bayesian developments.

    EXAMPLE 3.5. Consider estimation of a covari-ance matrix , based on i.i.d. multivariate normaldata(x1, . . . , xn), where each column vector xi arisesfrom the Nk (0,)density. The sufficient statistic for is S

    =ni

    =1 xi x

    i . Since Stein (1975), it has been

    understood that the commonly used estimates of,which are various multiples ofS(depending on the lossfunction considered) are seriously inadmissible. Hencethere has been a great effort in the frequentist literature(see Yang and Berger, 1994, for references) to developbetter estimators of .

    The interest in this from the Bayesian perspective isthat by far the most commonly used subjective priorfor a covariance matrix is the inverse Wishart prior (forsubjectively specifiedaand b)

    () ||a/2 exp12tr[b1].(3.2)

    A frequently used objective version of this prior is theJeffreys-rule prior given by choosing a=k+1 andb = 0. When one notes that the Bayesian estimatesarising from these priors are linear functions of S,which were deemed to be seriously inadequate by thefrequentists, there is clear cause for concern in theroutine use of these priors.

    In this case, it is possible to also indicate the problemwith these priors utilizing Bayesian reasoning. Indeed,write = HtDH, where H is an orthogonal matrixandDis a diagonal matrix with diagonal entries beingthe eigenvalues of the matrix, d1 > d2 >

    > dk.

    A change of variables yields

    () d = |D|a/2 exp12tr[bD1]i>dk] dD dH,

    where I[d1>>dk] denotes the indicator function onthe given set. Since

    i

  • 8/13/2019 Interplay of Bayes

    13/23

    70 M. J. BAYARRI AND J. O. BERGER

    force apart the eigenvalues of the covariance matrix;the priors give near-zero density to close eigenvalues.This is contrary to typical prior beliefs. Indeed, oftenin modelling, one is debating between assuming anexchangeable covariance structure (and hence equaleigenvalues) or allowing a more general structure.When one is contemplating whether or not to assumeequal eigenvalues, it is clearly inappropriate to use aprior distribution that gives essentially no weight toequal eigenvalues, and instead forces them apart.

    As an alternative objective prior here, the referenceprior was derived in Yang and Berger (1994) and isgiven by (D, H) = |D|1 dD dH; this clearly elim-inates the forcing apart of eigenvalues. Furthermore, itis shown in Yang and Berger (1994) that use of the ref-erence prior often results in improvements in estimat-ingon the order of 50% over use of the Jeffreys prior.

    Motivated, in part, by the significant inferiority ofthe standard inverse Wishart and Jeffreys-rule priorsfor , a large Bayesian literature has developed inrecent years that provides alternative prior distribu-tions for a covariance matrix. See Tang (2001) andDaniels and Pourahmadi (2002) for examples and ear-lier references.

    Note that we are not only focusing on objective pri-ors here. Even proper priors that are commonly used bysubjectivists can have hidden and highly undesirablefeaturessuch as the forcing apart of the eigenvaluesfor the inverse Wishart priors in the above example

    and frequentist (and objective Bayesian) tools can ex-pose these features and allow for development of bettersubjective priors.

    3.4.4 Robust Bayesian analysis. Robust Bayesiananalysis formally recognizes the impossibility of com-plete subjective specification of the model and priordistribution; as mentioned earlier, complete specifica-tion would involve an infinite number of assessments,even in the simplest situations. It follows that oneshould, ideally, work with a class of prior distribu-tionswith the class reflecting the uncertainty remain-ing after the (finite) elicitation efforts. ( could also

    reflect the differing judgments of various individualsinvolved in the decision process.)

    While much of robust Bayesian analysis takes placein a purely Bayesian framework (e.g., determining therange of the posterior mean as the prior ranges over ),it also has strong connections with the empirical Bayes,gamma minimax and restricted risk Bayes approaches,discussed in Section 2.3. See Berger (1985a, 1994),Delampady et al. (2001) and Ros, Insua and Ruggeri(2000) for discussion and references.

    3.4.5 Nonparametric Bayesian analysis. In nonpa-rametric statistical analysis, the unknown quantity in astatistical model is a function or a probability distrib-ution. A Bayesian approach to such problems requiresplacing a prior distribution on this space of functions orspace of probability distributions. Perhaps surprisingly,Bayesian analysis of such problems is computationallyquite feasible and is seeing significant practical imple-mentation; cf. Dey, Mller and Sinha (1998).

    Function spaces and spaces of probability measuresare enormous spaces, and subjective elicitation of aprior on these spaces is not really feasible. Thus, inpractice, it is typical to use a convenient form fora nonparametric prior (typically chosen for computa-tional reasons), with perhaps a small number of fea-tures of the prior being subjectively specified. Thus,much as in the case of the NeymanScott example,

    one worries that the unspecified features of the priormay overwhelm the data and result in inconsistencyor poor frequentist performance. Furthermore, there isevidence (e.g., Freedman, 1999) that Bayesian credi-ble sets and frequentist confidence sets need not agreein nonparametric problems, making it more difficult to

    judge performance.There is a long-time literature on such issues, the

    earlier period going from Freedman (1963) throughDiaconis and Freedman (1986). To access the more re-cent literature, see Barron (1999), Barron, Schervishand Wasserman (1999), Ghosal, Ghosh and van der

    Vaart (2000), Zhao (2000), Kim and Lee (2001),Belitser and Ghosal (2003) and Ghosh andRamamoorthi (2003).

    3.4.6 Impropriety and identifiability. One of themost crucial problems that Bayesians face in dealingwith complex modeling situations is that of ensuringthat the posterior distribution is proper; use of improperobjective priors can result in improper posterior distrib-utions. (Use of vague proper priors in such situationswill formally result in proper posterior distributions,but these posteriors will essentially be meaningless ifthe limiting improper objective prior would have re-sulted in an improper posterior distribution.)

    One of the major situations in which improprietycan arise is when there is a problem of parameteridentifiability, as in the following example.

    EXAMPLE 3.6. Suppose, for i= 1, . . . , p, thatXi Normal(i, 2) and i Normal(0, 2), allrandom variables being independent. Then, marginally,Xi Normal(0, 2 + 2), and it is clear that wecannot separately estimate2 and2 (although we can

  • 8/13/2019 Interplay of Bayes

    14/23

    BAYESIAN AND FREQUENTIST ANALYSIS 71

    estimate their sum); in classical language, 2 and 2

    are not identifiable. Were a Bayesian to attempt toutilize an improper objective prior here, such as (2,2) = 1, the posterior distribution would be improper.

    The point here is that frequentist insight and litera-

    ture about identifiability can be useful to a Bayesian indetermining whether there is a problem with posteriorpropriety. Thus, in the above example, upon recogniz-ing the identifiability problem, the Bayesian will knownot to use the improper objective prior and will attemptto elicit a true subjective proper prior for at least one of2 or2. (Of course, more data, such as replications atthe first stage of the model, could also be sought.)

    3.5 Frequentist Simplifications and

    Asymptotic Approximations

    Situations can occur in which straightforward use of

    frequentist intuition directly yields sensible answers. Inthe NeymanScott problem, for instance,considerationof the paired differences, xi1 xi2, directly yieldeda sensible answer. In contrast, a fairly sophisticatedobjective Bayesian analysis (use of the reference prior)was required for a satisfactory answer.

    This is not to say that classical methodology isuniversally better in such situations. Indeed, Neymanand Scott created this example primarily to show thatuse of maximum likelihood methodology can be veryinadequate; it essentially leads to the same badanswer in the example as the Bayesian analysis based

    on the Jeffreys-rule prior. This points out the dilemmafacing Bayesians in use of frequentist simplifications:a frequentist answer might be simple, but a Bayesianmight well feel uneasy in its utilization unless it werefelt to approximate a Bayesian answer. (For instance,is the answer conditionally sound, as discussed inSection 3.2.2.) Of course, if only the frequentist answeris available, the issue is moot.

    It would be highly useful to catalogue situations inwhich direct frequentist reasoning is arguably simplerthan Bayesian methodology, but we do not attemptto do so. Discussion of this and examples can be

    found in Robins and Ritov (1997) and Robins andWasserman (2000).

    Outside of standard models (such as the normal lin-ear model), it is unfortunately rather rare to be able toobtain exact frequentist answers for small or moderatesample sizes. Hence much of frequentist methodologyrelies on asymptotic approximations, based on assum-ing that the sample size is large.

    Asymptotics can also be used to provide an approx-imation to Bayesian answers for large sample sizes;

    indeed, Bayesian and frequentist asymptotic answersare often (but not always) the same; see Schervish(1995) for an introduction to Bayesian asymptotics andLe Cam (1986) for a high-level discussion. One mightconclude that this is thus another significant poten-tial use of frequentist methodology by Bayesians. It israther rare for Bayesians to directly use asymptotic an-swers, however, since Bayesians can typically directlycompute exact small sample size answers, often withless effort than derivation of the asymptotic approxi-mation would require.

    Still, asymptotic techniques are useful to Bayesians,in a variety of approximations and theoretical develop-ments. For instance, the popular Laplace approxima-tion (cf. Schervish, 1995) and BIC (cf. Schwarz, 1978)are based on an asymptotic arguments. ImportantBayesian methodological developments, such as the

    definition of reference priors, also make considerableuse of asymptotic theory, as was mentioned earlier.

    4. TESTING, MODEL SELECTION AND

    MODEL CHECKING

    Unlike estimation, frequentist reports and conclu-sions in testing (and model selection) are often in con-flict with their Bayesian counterparts. For a long timeit was believed that this was unavoidablethat the twoparadigms are essentially irreconcilable for testing.Berger, Brown and Wolpert (1994) showed, however,

    that this is not necessarily the case; that the main diffi-culty with frequentist testing was an inappropriate lackof conditioning which could, in a variety of situations,be fixed. This is the focus of the next section, afterwhich we turn to more general issues involving theinteraction of frequentist and Bayesian methodology intesting and model selection.

    4.1 Conditional Frequentist Testing

    Unconditional NeymanPearson testing, in whichone reports the same error probability regardless of thesize of the test statistic (as long as it is in the rejection

    region), has long been viewed as problematical by moststatisticians. To Fisher, this was the main inadequacyof NeymanPearson testing, and one of the chief mo-tivations for his championing p-values in testing andmodel checking. Unfortunately (as Neyman would ob-serve),p-values do not have a frequentist justificationin the sense, say, of the frequentist principle in Sec-tion 2.2. For more extensive discussion of the perceivedinadequacies of these two approaches to testing, seeBerger (2003).

  • 8/13/2019 Interplay of Bayes

    15/23

    72 M. J. BAYARRI AND J. O. BERGER

    The solution proposed in Berger, Brown andWolpert (1994) for testing, following earlier devel-opments in Kiefer (1977), was to use the NeymanPearson approach of formally defining frequentist er-ror probabilities of Type I and Type II, but to do so

    conditional on the observed value of a statistic measur-ing the strength of evidence in the data, as was donein Example 3.2. (Other proposed solutions to this prob-lem have been considered in, e.g., Hwang et al., 1992.)

    For illustration, suppose that we wish to test thatthe data X arises from the simple (i.e., completelyspecified) hypothesesH0 : f= f0or H1 : f= f1. Theidea is to select a statistic S=S(X)which measuresthe strength of the evidence in X, for or againstthe hypotheses. Then, conditional error probabilities(CEPs) are computed as

    (s) = P (Type I error |S= s) P0

    rejectH0|S(X) = s

    ,

    (4.1)(s) = P (Type II error|S= s)

    P1acceptH0|S(X) = s

    ,

    whereP0and P1refer to probability underH0and H1,respectively.

    The proposed conditioning statisticSand associatedtest utilize p-values to measure the strength of theevidence in the data. Specifically (see Wolpert, 1996;

    Sellke, Bayarri and Berger, 2001), we considerS= max{p0, p1},

    where p0 is the p-value when testing H0 versus H1,and p1 is the p-value when testing H1 versus H0.[Note that the use ofp-values in determining eviden-tiary equivalence is much weaker than their use asan absolute measure of significance; in particular, useof (pi ), where is any strictly increasing function,would determine the same conditioning.] The corre-sponding conditional frequentist test is then as follows:

    ifp0 p1 rejectH0andreport Type I CEP (s);

    (4.2)ifp0> p1 acceptH0and

    report Type II CEP(s);where the CEPs are given in (4.1).

    To this point, there has been no connection withBayesianism. Conditioning, as above, is completely

    allowed (and encouraged) within the frequentist para-digm. The Bayesian connection arises because Berger,Brown and Wolpert (1994) show that

    (s) = B(x)1

    +B(x)

    and (s) = 11

    +B(x)

    ,(4.3)

    where B(x) is the likelihood ratio (or Bayes factor),and these expressions are precisely the Bayesian poste-rior probabilities ofH0and H1, respectively, assumingthe hypotheses have equal prior probabilities of 1/2.Therefore, a conditional frequentist can simply com-pute the objective Bayesian posterior probabilities ofthe hypotheses, and declare that they are the condi-tional frequentist error probabilities; there is no needto formally derive the conditioning statistic or performthe conditional frequentist computations. (There aresome technical details concerning the definition of therejection region, but these have almost no practical im-pact; see Berger, 2003, for further discussion.)

    The value of having a merging of the frequentist andobjective Bayesian answers in testing goes well beyondthe technical convenience of computation; statistics asa whole is the big winner because of the unification thatresults. But we are trying to avoid philosophical issueshere and so will simply focus on the methodologicaladvantages that will accrue to frequentism.

    Dass and Berger (2003) and Paulo (2002) extend thisresult to many classical testing scenarios; here is anexample from the former.

    EXAMPLE 4.1. McDonald, Vance and Gibbons(1995) studied car emission data X=(X1, . . . , Xn),testing whether the i.i.d. Xi follow the Weibull orLognormal distribution, given, respectively, by

    H0 :fW(x; , ) =

    x

    1exp

    x

    ,

    H1 :fL(x; , 2) = 1

    x

    2 2exp

    (ln x )222

    .

    There are several difficulties with classical analysis ofthis situation. First, there are no low-dimensional suf-

    ficient statistics, and no obvious test statistics; indeed,McDonald, Vance and Gibbons (1995) simply considera variety of generic tests, such as the likelihood ra-tio test (MLR), which they eventually recommended asbeing the most powerful. Second, it is not clear whichhypothesis to make the null hypothesis, and the clas-sical conclusion can depend on this choice (althoughnot significantly, in the sense of the choice allowingdiffering conclusions with low error probabilities). Fi-nally, computation of unconditional error probabilities

  • 8/13/2019 Interplay of Bayes

    16/23

    BAYESIAN AND FREQUENTIST ANALYSIS 73

    requires a rather expensive simulation and, once com-puted, one is stuck with error probabilities that do notvary with the data.

    For comparison, the conditional frequentist test whenn = 16 (one of the cases considered by McDonald,Vance and Gibbons, 1995) results in the following test:

    TC =

    ifB(x) 0.94,rejectH0and report Type I CEP(x) = B(x)/(1 + B(x)),

    ifB(x) > 0.94,acceptH0and report Type II CEP(x) = 1/(1 + B(x)),

    where

    B(x) = 2(n)nn/2

    ((n 1)/2) (n1)/2(4.4)

    0

    y

    n

    ni=1

    exp

    zi zysz

    ndy,

    withzi = ln xi ,z = 1nn

    i=1 ziand s2z= 1n

    ni=1(zi z)2.

    In comparison with the situation for the unconditionalfrequentist test:

    There is a well-defined test statisticB(x). If one switches the null hypothesis, the new Bayes

    factor is simply B(x)1, which will clearly lead tothe same CEPs (i.e., the CEPs do not depend onwhich hypothesis is called the null hypothesis).

    Computation of the CEPs is almost trivial, requiringonly a one-dimensional integration.

    Above all, the CEPs vary continuously with the data.In elaboration of the last point, consider one of the

    testing situations considered by McDonald, Vance andGibbons (1995), namely, testing for the distribution ofcarbon monoxide emission data, based on a sample ofsize n=16. Data was collected at the four differentmileage levels indicated in Table 2, with (b) and (a)indicating before or after scheduled vehicle main-tenance. Note that the decisions for both the MLR and

    the conditional test would be to accept the lognormalmodel for the data. McDonald, Vance and Gibbons(1995) did not give the Type II error probability associ-ated with acceptance (perhaps because it would dependon the unknown parameters for many of the test statis-tics they considered) but, even if Type II error had beenprovided, note that it would be constant. In contrast, theconditional test has CEPs (here, the conditional Type IIerrors) that vary fully with the data, usefully indicatingthe differing certainties in the acceptance decision for

    TABLE2For CO data,the MLR test at level = 0.05,and the conditional

    test ofH0 :LognormalversusH1 : Weibull

    Mileage 0 4,000 24,000 (b) 24,000 (a)

    MLR decision A A A AB(x) 2.436 9.009 6.211 2.439TC decision A A A ACEP 0.288 0.099 0.139 0.291

    the considered mileages. Further analyses and compar-isons can be found in Dass and Berger (2003).

    Derivation of the conditional frequentist test. Theconditional frequentist analysis here depends on recog-nizing an important fact: both the Weibull and thelognormal distributions are locationscaledistributions(the Weibull after suitable transformation). In this case,the objective Bayesian (and, hence, conditional fre-quentist) solution to the problem is to utilize the right-Haar prior density for the distributions [(,) = 1for the lognormal problem] and compute the result-ing Bayes factor. (Unconditional frequentists could, ofcourse, have recognized the invariance of the situationand used this as a test statistic, but they would havefaced a much more difficult computational challenge.)

    By invariance, the distribution of B(X) under ei-ther hypothesis does not depend on model parameters,so that the original testing problem can be reduced

    to testing two simple hypotheses, namely H0 : B(X)has distribution FW versus H1 : B(X) has distrib-ution FL , where F

    W and F

    L are the distribution

    functions of B(X) under the Weibull and Lognormaldistributions, respectively, with an arbitrary choice ofthe parameters (e.g., == 1 for the Weibull, and = 0,= 1 for the Lognormal). Recall that the CEPshappen to equal the objective Bayesian posterior prob-abilities of the hypotheses.

    4.2 Model Selection

    Clyde and George (2004) give an excellent review of

    Bayesian model selection. Frequentists have not typ-ically used Bayesian arguments in model selection,although that may be changing, in part due to the pro-nounced practical success that Bayesian model aver-aging has achieved. Bayesians often use frequentistarguments to develop approximate model selectionmethods (such as BIC), to evaluate performance ofmodel selection methods and to develop default pri-ors for model selection. There is a huge list of arti-cles of this type, including many listed in Clyde and

  • 8/13/2019 Interplay of Bayes

    17/23

    74 M. J. BAYARRI AND J. O. BERGER

    George (2004). Robert (2001), Berger and Pericchi(2001, 2004) and Berger, Ghosh and Mukhopadhyay(2003) also have general discussions and numerousother recent references. The story here is far from set-tled, in that there is no agreement on the immediate

    horizon as to even a reasonable method of model se-lection. It seems highly likely, however, that any suchagreement will be based on a mixture of frequentist andBayesian arguments.

    4.3 p-Values for Model Checking

    Both classical statisticians and Bayesians routinelyuse p-values for model checking. We first considertheir use by classical statisticians and show the value ofBayesian methodology in the computation of properp-values; then we turn to Bayesian p-values and theimportance of frequentist ideas in their evaluation.

    4.3.1 Use of Bayesian methodology in computingclassical p-values. Suppose that a statistical modelH0 : X f (x| ) is being entertained, data xobs isobserved and it is desired to check whether the modelis adequate, in light of the data. Classical statisticianshave long used p-values for this purpose. A strictfrequentist would not do so (since p-values do notsatisfy the frequentist principle), but most frequentistsrelax their morals a bit in this situation (i.e., whenthere is no alternative hypothesis which would allowconstruction of a NeymanPearson test).

    The common approach to model checking is tochoose a statistic T= t (X), where (without loss ofgenerality) large values ofTindicate less compatibilitywith the model. The p-value is then defined as

    p = Prt (X) t (xobs)|.(4.5)Whenis known, this probability computation is withrespect to f (x| ). The crucial question is what to dowhenis unknown.

    For future reference, we note a key property ofthe p-value when is known: considered as a ran-

    dom function of X in the continuous case, p(X) hasa Uniform(0, 1) distribution under H0. The implica-tion of this property is that p-values then have a com-mon interpretation across statistical problems, makingtheir general use feasible. Indeed, this property, or itsasymptotic version for composite H0, has been usedto characterize proper, well-behaved p-values (seeRobins, van der Vaart and Ventura, 2000).

    When is unknown, computation of a p-valuerequires some way of eliminating from (4.5). There

    are many non-Bayesian ways of doing so, some ofwhich are reviewed in Bayarri and Berger (2000) andRobins, van der Vaart and Ventura (2000). Here weconsider only the most common method, which is toreplacein (4.5) by its mle,. The resulting p-value

    will be called the plug-in p-value(pplug). Henceforthusing a superscript to denote the density with respectto which thep-value in (4.5) is computed, the plug-in

    p-valueis thus defined as

    pplug = Prf (| )

    t (X) t (xobs).(4.6)

    Although very simple to use, there is a worrisomedouble use of the data inpplug, first to estimateandthen to compute the tail area corresponding to t (xobs)in that distribution.

    Bayarri and Berger (2000) proposed the followingalternative way of eliminating , based on Bayesianmethodology. Begin with an objective prior densi-ty (); we recommend, as before, that this be areference prior (when available), but even a constantprior density will usually work fine. Next, define the

    partial posterior density(Bayesian motivation will begiven in the next section)

    (|xobs \ tobs) f (xobs|tobs,)()(4.7)

    f (xobs|)()f (tobs| )

    ,

    resulting in the partial posterior predictive densityofT,

    m(t|xobs \ tobs) =

    f (t|)(|xobs \ tobs) d .(4.8)

    Since this density is free of, it can be used in (4.5) tocompute thepartial posterior predictivep-value,

    pppp = Pr m(|xobs\tobs)(T tobs).(4.9)Note that pppp uses only the information in xobs thatis notin tobs= t (xobs)to train the prior and to theneliminate . Intuitively this avoids double use of the

    data because the contribution oftobsto the posterior isremoved beforeis eliminated by integration. (Thenotationxobs \ tobswas chosen to indicate this.)

    When T is not ancillary (or approximately so), thedouble use of the data can cause the plug-inp-value tofail dramatically, in the sense of being moderately largeeven when the null model is clearly wrong. Examplesand earlier references are given in Bayarri and Berger(2000). Here is another interesting illustration, takenfrom Bayarri and Castellanos (2004).

  • 8/13/2019 Interplay of Bayes

    18/23

  • 8/13/2019 Interplay of Bayes

    19/23

    76 M. J. BAYARRI AND J. O. BERGER

    (1984) and Gelman, Carlin, Stern and Rubin (1995),given by

    ppost= Pr m(|xobs)(T tobs),(4.13)

    m(t

    |xobs)

    = f (t|)(|xobs) d .Note that there is also double use of the datain ppost, first to convert the (possibly improper) priorinto a proper distribution for determining the predic-tive distribution, and then for computing the tail areacorresponding tot (xobs)in that distribution. The detri-mental effect of this double use of the data was dis-cussed in Bayarri and Berger (2000) and arises again inExample 4.2. Indeed, computation yields that the pos-terior predictive p-value for the given data is 0.409,which does not at all suggest a problem with the ran-

    dom effects normality assumption; yet, recall that oneof the means was more than six standard deviationsaway from the others.

    The point of this section is to observe that the fre-quentist property of uniformity of a p-value underthe null hypothesis provides a useful discriminatorytool for judging the adequacy of Bayesian (and other)p-values. Note that if a p-value is uniform under thenull hypothesis in the frequentist sense for any ,then it has the strong Bayesian property of being mar-ginally Uniform[0, 1]under any proper prior distribu-tion. More important, if a proposed p-value is always

    either conservative or anticonservative in a frequentistsense (see Robins, van der Vaart and Ventura, 2000,for definitions), then it is likewise guaranteed to beconservative or anticonservative in a Bayesian sense,no matter what the prior. If the conservatism (or anti-conservatism) is severe, then the p-value cannot cor-respond to any approximate true Bayesian p-value.

    In this regard, Robins, van der Vaart and Ventura(2000) show that ppost is often severely conservative(and, surprisingly, is worse in this regard than is pplug),while ppppis asymptotically Uniform[0, 1].

    It is also of interest to numerically study the nonas-

    ymptotic null distribution of the three p-values con-sidered in this section. We thus return to the randomeffects example.

    EXAMPLE 4.3. Consider again the situation de-scribed in Example 4.2. We consider pplug(X), pppp(X)and ppost(X) as random variables and simulate theirdistribution under an instance of the null model.Specifically, we chose the random effects mean andvariance to be 0 and 1, respectively, thus simulat-ingXijas in (4.12), but now generating all five ifromtheN (0, 1)distribution (so that the null normal hierar-chical model is correct). Figure 6 shows the resultingsampling distributions of the threep-values.

    Note that the distribution ofpppp(X)is quite closeto uniform, even though only five means were in-volved. In contrast, the distributions of pplug(X) andppost(X) are quite far from uniform, with the latterbeing the worst. Even for larger numbers of means(e.g., 25), pplug(X) and ppost(X) remained signifi-cantly nonuniform, indicating a serious and inappro-priate conservatism.

    Of course, Bayesians frequently criticize the directuse ofp-values in testing a precise hypothesis, in thatp-values then fail to correspond to natural Bayesianmeasures. There are, however, various calibrationsof p-values that have been suggested (see Good,1983; Sellke, Bayarri and Berger, 2001). But these arehigher level considerations in that, if a purportedBayesian p-value fails to adequately criticize a nullhypothesis, the higher level concerns are irrelevant.

    FIG . 6. Null distribution ofpplug(X)(left),ppost(X)(center)andpppp(X)(right)when the null normal hierarchical model is correct.

  • 8/13/2019 Interplay of Bayes

    20/23

    BAYESIAN AND FREQUENTIST ANALYSIS 77

    5. AREAS OF CURRENT DISAGREEMENT

    It is worth mentioning some aspects of inference inwhich it seems very difficult to reconcile the frequentistand Bayesian approaches. Of course, as discussed inSection 4, it was similarly believed to be difficult

    to reconcile frequentist and Bayesian testing untilrecently, so it may simply be a matter of time untilreconciliation also occurs in these other areas.

    Multiple comparisons.When doing multiple tests,such as occurs in the variable selection problem inregression, classical statistics performs some type ofadjustment (e.g., the Bonferonni adjustment to thesignificance level) to account for the multiplicity oftests. In contrast, Bayesian analysis does not explicitlyadjust for multiplicity of tests, the argument beingthat a correct adjustment is automatic within theBayesian paradigm.

    Unfortunately, the duality between frequentist andBayesian methodology in testing, discussed in Sec-tion 4, has not been extended to the multiple hypothesistesting framework, so we are left with competing andquite distinct methodologies. Worse, there are multi-ple competing frequentist methodologies and multiplecompeting Bayesian methodologies, a situation that isprofessionally disconcerting, yet common when thereis no frequentistBayesian consensus. (To access someof these methodologies, see http://www.ba.ttu.edu/isqs/westfall/mcp2002.htm.)

    Sequential analysis. The stopping rule principle saysthat once the data have been obtained, the reasonsfor stopping experimentation should have no bearingon the evidence reported about unknown model pa-rameters. This principle is automatically satisfied byBayesian analysis, but is viewed as crazy by many fre-quentists. Indeed frequentist practice in clinical trialsis to spend for looks at the data; that is, if thereare to be interim analyses during the clinical trial, withthe option of stopping the trial early should the datalook convincing, frequentists feel that it is then manda-tory to adjust the allowed error probability (down) to

    account for the multiple analyses.This issue is extensively discussed in Berger andBerry (1988), which has many earlier references. Thatit is a controversial and difficult issue is admirablyexpressed by Savage (1962): I learned the stoppingrule principle from Professor Barnard, in conversationin the summer of 1952. Frankly, I then thought it ascandal that anyone in the profession could advance anidea so patently wrong, even as today I can scarcelybelieve that people resist an idea so patently right.

    An interesting recent development is the finding,in Berger, Boukai and Wang (1999), that optimalconditional frequentist testing also essentially obeysthe stopping rule principle. This suggests the admit-tedly controversial speculation that optimal (condi-tional) frequentist procedures will eventually be foundto essentially satisfy the stopping rule principle; andthat we currently have a controversy only because sub-optimal (unconditional) frequentist procedures are be-ing used.

    It is, however, also worth noting that in prior de-velopment Bayesians sometimes utilize the stoppingrule to help define an objective prior, for example,the Jeffreys or reference prior (see Ye, 1993; Sun andBerger, 2003, for examples and motivation). Since thestatistical inference will then depend on the stoppingrule, objective Bayesian analysis can involve a (proba-bly slight) violation of the stopping rule principle. (SeeSweeting, 2001, for related discussion concerning theextent of violation of the likelihood principle by objec-tive Bayesian methods.)

    Finite population sampling. In this central area ofstatistics, classical inference is primarily based on fre-quentist averaging with respect to the sampling proba-bilities by which units of the population are selected forinclusion in the sample. In contrast, Bayesian analy-sis asserts that these sampling probabilities are irrele-vant for inference, once the data is at hand. (See Ghosh,1988, which contains excellent essays by Basu on this

    subject.) Hence we, again, have a fundamental philo-sophical and practical conflict.There have been several arguments (e.g., Rubin,

    1984; Robins and Ritov, 1997) to the effect that thereare situations in which Bayesians do need to take intoaccount the sampling probabilities, to save themselvesfrom a too-difficult (and potentially nonrobust) priordevelopment. Conversely, there has been a very signif-icant growth in use of Bayesian methodology in finitepopulation contexts, such as in the use of small areaestimation methods (see, e.g., Rao, 2003). Small areaestimation methods actually occur in the broader con-

    text of the model-based approach to finite populationsampling (which mostly ignores the sampling probabil-ities), and this model-based approach also has frequen-tist and Bayesian versions (the differences, however,being much smaller than the differences arising fromuse, or not, of the sampling probabilities).

    6. CONCLUSIONS

    It seems quite clear that both Bayesian and frequen-tist methodology are here to stay, and that we should

  • 8/13/2019 Interplay of Bayes

    21/23

    78 M. J. BAYARRI AND J. O. BERGER

    not expect either to disappear in the future. This is notto say that allBayesian or all frequentist methodol-ogy is fine and will survive. To the contrary, there aremany areas of frequentist methodology that should bereplaced by (existing) Bayesian methodology that pro-vides superior answers, and the verdict is still out onthose Bayesian methodologies that have been exposedas having potentially serious frequentist problems.

    Philosophical unification of the Bayesian and fre-quentist positions is not likely, nor desirable, sinceeach illuminates a different aspect of statistical infer-ence. We can hope, however, that we will eventuallyhave a general methodological unification, with bothBayesians and frequentists agreeing on a body of stan-dard statistical procedures for general use.

    ACKNOWLEDGMENTS

    Research supported by Spanish Ministry of Scienceand Technology Grant SAF2001-2931 and by NSFGrants DMS-01-03265 and DMS-01-12069. Part ofthe work was done while the first author was visitingthe Statistical and Applied Mathematical SciencesInstitute and ISDS, Duke University.

    REFERENCES

    BARNETT, V. (1982). Comparative Statistical Inference, 2nd ed.Wiley, New York.

    BARRON, A. (1999). Information-theoretic characterization of

    Bayes performance and the choice of priors in paramet-ric and nonparametric problems. In Bayesian Statistics 6(J. M. Bernardo, J. O. Berger, A. P. Dawid and A. F. M. Smith,eds.) 2752. Oxford Univ. Press.

    BARRON, A., SCHERVISH, M. J. and WASSERMAN, L. (1999).The consistency of posterior distributions in nonparametricproblems.Ann. Statist.27536561.

    BAYARRI, M. J. and BERGER, J. (2000). P-values for compositenull models (with discussion). J. Amer. Statist. Assoc. 9511271170.

    BAYARRI, M. J. and CASTELLANOS, M. E. (2004). Bayesianchecking of hierarchical models. Technical report, Univ. Va-lencia.

    BELITSER, E. and GHOSAL, S. (2003). Adaptive Bayesian infer-

    ence on the mean of an infinite-dimensional normal distribu-tion.Ann. Statist.31536559.BERGER, J. (1985a), Statistical Decision Theory and Bayesian

    Analysis, 2nd ed. Springer, New York.BERGER, J. (1985b). The frequentist viewpoint and conditioning.

    In Proceedings of the Berkeley Conference in Honor of JerzyNeyman and Jack Kiefer (L. Le Cam and R. Olshen, eds.)1544. Wadsworth, Monterey, CA.

    BERGER, J. (1994). An overview of robust Bayesian analysis (withdiscussion).Test35124.

    BERGER, J. (2003). Could Fisher, Jeffreys and Neyman haveagreed on testing (with discussion)?Statist. Sci.18132.

    BERGER, J. and BERNARDO, J. (1992). On the development ofreference priors. In Bayesian Statistics 4 (J. M. Bernardo,J. O. Berger, A. P. Dawid and A. F. M. Smith, eds.) 3560.Oxford Univ. Press.

    BERGER, J. and BERRY, D. (1988). The relevance of stoppingrules in statistical inference (with discussion). In Statisti-cal Decision Theory and Related Topics IV (S. Gupta andJ. Berger, eds.) 12972. Springer, New York.

    BERGER, J., BOUKAI, B. and WAN G, Y. (1999). SimultaneousBayesianfrequentist sequential testing of nested hypo