basian.pdf

Embed Size (px)

Citation preview

  • 8/14/2019 basian.pdf

    1/12

    Stat 411 Lecture Notes 06

    Bayesian Analysis

    Ryan Martinwww.math.uic.edu/~rgmartin

    Version: August 22, 2013

    1 Introduction

    Up to know, our focus in Stat 411 has been on whats called frequentiststatistics. That is,our main objective was sampling distribution properties of our estimators, tests, etc. Whythe name frequentist? Recall that the sampling distribution of, say, an estimator =(X1, . . . , X n) describes the distribution ofas the sampleX1, . . . , X nvaries according toits law. In particular, the sampling distribution describes the frequencyof certain eventsconcerning in repeated sampling.

    Here we want to shift gears and briefly discuss the other dominant school of thoughtin statistics, namely, the Bayesian school. The name comes from Reverend Thomas

    Bayes, who developed the Bayes theorem you learn in Stat 401. For students, it ismaybe not clear why such a simple result in basic probability theory could have pavedthe way for the substantial amount of work on Bayesian analysis thats been done todate. Presentations of Bayes theorem in basic probability courses usually side-step thephilosophical importance of a result like thisits a fundamental tool that describes howuncertainties can be updated when new information becomes available. So one can thinkof Bayes theorem as a statement about how information is processed and beliefs areupdated. In the statistics context, Bayes theorem is used to take prior or initial beliefsabout the parameter of interest and, after data is observed, those beliefs are updated toreflect what has been learned. Expressed in this way, one should see that Bayes theorem

    is a bit more than just a simple manipulation of the symmetry inP

    (A B).What sets the Bayesian framework apart from what weve previously seen is the waythat uncertainty is defined and represented. The usual setup weve encountered is thatobservable data X1, . . . , X n is available from a distribution f(x). The starting pointis that is unknown and to be estimated/tested from the data. But weve not reallysaid what it means that is unknown. Do we really know nothing about it, or dowe not know how to summarize what knowledge we have, or are we uneasy using thisknowledge? It seems unrealistic that we actually know nothing about , e.g., when

    These notes are meant to supplement in-class lectures. The author makes no guarantees that thesenotes are free of typos or other, more serious errors.

    HMCrefers to Hogg, McKean, and Craig, Introduction to Mathematical Statistics, 7th ed., 2012.

    1

  • 8/14/2019 basian.pdf

    2/12

    is the mean income in Cook county to be estimated, we know that is positive andless than $1 billion; wed also likely believe that ($40K, $60K) is more likely that ($200K, $220K). In what weve discussed so far in Stat 411, such information is notused. The jumping off point for a Bayesian analysis is the following belief:

    The only way to describe uncertainties is with probability.

    Such a concept permeates even our everyday lives. For example, suppose you and severalfriends have been invited to a party next saturday. When your friend Kevin asks if youwill attend, you might respond with something like theres a 90% chance that Ill go.Although not on the scale of probabilities, this has such an interpretation. The samething goes for weather reports, e.g., theres a 30% chance of rain tomorrow. Whatsparticularly interesting is the nature of these probabilities. Were used to using probabili-ties to describe the uncertain results of a random experiment (e.g., rolling dice). The twoscenarios above (party and weather) are not really random; moreover, these are singularevents and not something that can repeated over and over. And yet probabilitiesinparticular, subjective probabilitiesare introduced and can be used. In the statisticsproblem, to that thing about which we are uncertain, we must assign probabilities tocertain events, such as { > 7}, {0.27 < 0.98}, etc. Once this probability as-signment has been made to all such events, what we have is a probability distribution,whats called the prior distribution for . This is effectively the same as assuming thatthe unknown parameter itself is a random variable with a specified distribution. It is acommon misconception to say that Bayesian analysis assumes the parameter is a randomvariable. On the contrary, a Bayesian starts by assigning probabilities to all such thingswhich uncertain; that this happens to be equivalent to taking to be a random variableis just a (convenient or unfortunate?) consequence.

    There are reasons to take a Bayesian approach, other than these rather philosophicalreasons mentioned above. In fact, there is a remarkable theorem of deFinetti whichsays that, if data are exchangeable, i.e., if permuting the data does not change its jointdistribution, then there exists a likelihood and prior like assumed in the Bayesian setting.Also, there is a very surprising result that says, roughly, given any estimator, there existsa Bayes (or approximate Bayes) estimator thats as good or better in terms of mean-square error. In other words, one cannot do too bad by using Bayes estimators. Finally,theres some advantage to the Bayesian approach in the high-dimensional problems whichare popular nowadays, because a suitable Bayes model will result in some automaticpenalties on models with higher dimensionality. Outside the Bayesian context, one must

    actively introduce such penalties. Some additional details about these and other aspectsof Bayesian analysis can be found in Section 4.As you can probably tell already, Bayesian analysis is not so easy to come to grips

    with, at least at first. Here in Stat 411, we will not dwell on these philosophical matters.My focus in these notes is to give you a basic introduction to the ideas and terminologyin the Bayesian setting. In particular, I hope that students will understand the basicsteps in a Bayesian analysischoosing a prior distribution, updating the prior to a pos-terior distribution using data and Bayes theorem, and summarizing this posterior. Someadditional important points are mentioned in the last section.

    2

  • 8/14/2019 basian.pdf

    3/12

    2 Mechanics of Bayesian analysis

    2.1 Ingredients

    The Bayesian problem starts with an important additional ingredient compared to the

    frequentist problem weve considered so far. This additional ingredient is the prior dis-tributionfor the parameter. Here I will modify our familiar notation a bit. Now willdenote a random variable version of the parameter well let the parameter space (ourusual use of the notation ) be implicit and determined by context.

    The prior distribution for the parameter is a statement that (), where ()is a distribution (i.e., a PDF/PMF) defined on the parameter space. The idea is thatthe distribution () encodes our uncertainty about the parameter. For example, if isthe mean income in Cook County, my belief that this value is between $25K and $35Kis given by the probability calculation

    35K25K

    () d. Where this prior distribution comesfrom is an important questionsee Section 3but for now just take as given.

    In addition to our prior beliefs about the parameter, we get to observe data justlike in our previous settings. That is, given = , we get X1, . . . , X niid

    f(x). Notethe emphasis here that f(x) is the conditional distribution of X1 given the randomparameter happens to equal the particular value . From this sample, we can againdefine a likelihood function L() =

    ni=1 f(Xi). I may add a subscript LX() to remind

    us that the likelihood is a function of but it depends on data X= (X1, . . . , X n).To summarize, the Bayesian model specifies a joint distribution for (, X1, . . . , X n),

    with a density1 ()L(). But this specification is done in two stages: first a marginaldistribution for and then a conditional distribution for (X1, . . . , X n) given =. Sucha model is sometimes called a hierarchical model, because its done in stages, but I shallcall it the Bayes model.

    2.2 Bayes theorem and the posterior distribution

    The key feature of a Bayesian analysis is that the prior distribution for is updatedafter seeing data to what is called the posterior distribution. Bayesian inference is basedentirely on this posterior distributionsee Section 2.3. The key to this transformationis the simple Bayes formula from Stat 401: for given a probability space, consisting of acollection of events and a probability P, ifA and B are events with positive P-probability,the conditional probability P(A| B), defined as P(A B)/P(B) satisfies

    P(A| B) = P(B|A)P(A)/P(B).

    In our case, A is some event concerning and B corresponds to our observable data.To make the connection more precise, consider a simple discrete problem involving onlyPMFs. Let take two values0.25 and 0.75with equal probability and, given = ,let X Ber(). The posterior probability that = 0.25, given X=x, can be obtainedvia Bayes formula:

    P( = 0.25| X=x) = f0.25(x) 0.5

    f0.25(x) 0.5 +f0.75(x) 0.5.

    1Quotes here indicate that ()L() may not be a genuine PDF for ( ,X1, . . . ,X n) because the datamay be discrete and parameter may be continuous or vice-versa. But this is not really a problem.

    3

  • 8/14/2019 basian.pdf

    4/12

    Depending on which x is used, this can be easily evaluated. In general, P( = 0.25 |X = x) will be different from P( = 0.25), so Bayes formula is, indeed, updating ourprior beliefs about the possible values of .

    This is a very simple version of the problem. Fortunately, there is a version of Bayesformula for the case where both data and parameter are continuous. The proof is more

    difficult than the simple Bayes formula from Stat 401, but the formula itself is verysimilar. Essentially, one just pretends that () and L() are PMFs instead of PDFs,and apply the familiar Bayes formula. In particular, the posterior distribution of ,givenX=x, has a PDF/PMF (| x) given by

    (| x) = Lx()()Lx()() d

    ,

    where, in the denominator, integration is over the entire parameter space. If is adiscrete random variable, i.e., its prior is a PMF instead of a PDF, the formula looks the

    same but the denominator has a sum over all possible values instead of an integral.Exercise 1. Suppose that, given = , X Bin(n, ). If Unif(0, 1), find theposterior distribution(| x).

    2.3 Bayesian inference

    Bayesian analysis is really focused on the posterior distribution, which is assumed todescribe all uncertainty about the parameter after data X = x is observed. But itis often of interest to answer questions like those weve encountered previously in thecourse. For example, if we want to produce an estimator or do a hypothesis test, we can

    do such things from a Bayesian perspective as well.

    2.3.1 Point estimation

    The most basic problem in statistics is one of producing an estimator of unknown pa-rameter . Of course, in the Bayes context, the parameter is random, not fixed, so itsnot immediately clear what were trying to estimate. The way to understand this from aBayesian point of view is that the goal is to report the center of the posterior distribu-tion ( | x), which is some function of the observed X=x, as statistic. This centeris often the mean of the posterior distribution, but it doesnt have to be.

    To properly describe how Bayes methods are derived, one should first introduce what

    is called a loss function ((x), ), which measures the penalty one incurs by using, say,an procedure generically written (x) when the true parameter value is . Once thisloss function is specified, the Bayes method is derived by choosing (x) to minimize theexpectation

    ((x), )( | x) d, called the posterior Bayes risk. In the estimation

    problem, the loss function is usually((x), ) = ( (x))2, called squared-error loss, sothe goal is to choose such that

    ( (x))2(| x) dx is minimized.

    4

  • 8/14/2019 basian.pdf

    5/12

    It is a relatively simple calculus exercise to show that the minimizer is the mean of theposterior distribution, i.e., the Bayes estimator of is

    (x) = E(| x) =

    (| x) d.

    If the loss function is different from squared-error, then the Bayes estimator would besomething different. For Stat 411, we will only consider squared-error loss.

    Exercise 2. Reconsider the binomial problem in Exercise 1. Find the posterior meanE(| x) and calculate its mean-square error. Compare it to that of the MLE = X.

    2.3.2 Hypothesis testing

    The hypothesis testing problem also has a loss function type of description, but Ill notget into that here because its a bit more difficult to explain in this context. Ill also

    use some notation thats a bit different from what weve used before, though it shouldbe clear. Start with the hypotheses H0 : U0 versus H1 : U1, where U1 = Uc0 .

    With a prior distribution for , we can calculate prior probabilities for H0 and H1,which are (U0) =

    U0

    () d and (U1) = 1 (U0), respectively. Now we just wantto update these probabilities in light of the observed data. In other words, we calculatethe posterior probabilities ofH0 and H1, which are readily available once we have theposterior distribution(| x). These probabilities are

    (U0 | x) =

    U0

    (| x) d and (U1|x) = 1 (U0|x)

    Finally, to decide whether to reject H0 or not, we just compare the relative magnitudesof(U0 |x) and (U1 |x) and choose the larger of the two. Because U1 =U

    c0 , it is easy

    to see that we reject H0 if and only if(U0 | x)< 1/2. This is the Bayes test.One remark to make here is that, in a certain sense, we can accomplish here, in

    the Bayesian setting, what we could not in the frequentist setting. That is, we knowhave measures of certainty that H0 is true/false given data. The frequentist approachhas nothing like thissize and power are only sampling distribution properties and havenothing to do with observed data. The trade-off is that, in the Bayesian setting, onerequires a prior distribution, and if the prior distribution is wrong, then the Bayes testmay give unsatisfactory answers. Also, note that the Bayes test has no involved, so it

    makes no attempt to control the tests size at a particular level. Of course, the test has asize and power associated with it, but one must calculate these separately. For example,the size of the Bayes test would look like

    maxU0

    P{(U0|X)< 1/2}.

    If the posterior probability(U0|x) has a nice looking form, then perhaps this calculationwould not be too difficult. Alternatively, one could use Monte Carlo to evaluate this. Butkeep in mind that size and power and not really meaningful to a Bayesian, so theres noneed to do these things unless its of interest to compare the performance of a Bayes testwith that of a frequentist test.

    5

  • 8/14/2019 basian.pdf

    6/12

    Notice that the explanation above implicitly requires that (U0) > 0. IfH0 : =0and is continuous, then this will automatically be zero, so we need to adjust themethodology. The standard approach here is to use Bayes factors to perform the test.For the nice problems we consider in Stat 411, the necessary Bayes factor calculations arerelatively simple, and the result resembles a likelihood ratio test. But the interpretation

    of Bayes factors is less straightforward than that of the posterior probabilities (U0 |x)from above, so Ill not discuss this here. A formal course on Bayesian analysis would notignore this issue.

    2.3.3 Credible intervals

    Bayesians have an analog of the frequentists confidence intervals, which are called cred-ible intervals. There are a couple of ways to do this, which Ill explain only briefly.Throughout, take (0, 1) fixed.

    Equal-tailed credible interval. Let ( | x) denote the posterior CDF. An equal-tailed 100(1 )% credible interval looks like

    {: /2 (| x) 1 /2}.

    This is just the central 1 region for the posterior distribution.

    Highest posterior density credible interval. For a cutoffc >0, define an interval2

    H(c) ={: (| x) c}.

    Note that, as c changes, the posterior probability ofH(c) changes, in particular,

    this posterior probability increases as c decreases, and vice versa. If it varies con-tinuously, then by the intermediate value theorem, there is a pointc = csuch thatthe posterior probability ofH(c) equals 1 . The set H(c) is the 100(1 )%highest posterior density credible interval.

    Note that these credible intervals are not understood the same way as confidence intervals.In particular, they may not have coverage probability equal to 1 . In this case, 1 represents the amount of probability the posterior assigns to the interval. But in manycases, Bayesian credible intervals do have reasonable frequentist coverage probabilities,though this is not guaranteed by the construction.

    3 Choice of prior

    In the previous discussion, the prior distribution seem to appear from thin air. Indeed,if a prior is provided, then the statistics problem really reduces to some simple alge-bra/calculus. In real life, however, one must select the prior. Here I discuss the threeways in which one can come up with the prior.

    2This set may not be an interval if the posterior PDF (| x) has multiple bumps.

    6

  • 8/14/2019 basian.pdf

    7/12

    3.1 Elicitation from experts

    The most natural way to get a prior distribution is to go and ask experts about theproblem at hand what they expect. For example, if (0, 1) represents the probabil-ity that some widget manufacturing machine will produce a defective widget, then the

    statistician might go to speak to the experts, the people who designed and built the ma-chine, to get their opinions about reasonable values of . These experts might be able totell the statistician some helpful information about the shape or percentiles of the priordistribution, which can be used to make a specific choice of prior. However, this can beexpensive and time-consuming, and may not result in reasonable prior information, sothis is rarely done in practice.

    3.2 Convenient priors

    Before we had high-powered computers readily available, Bayesian analysis was mostlyrestricted to the use of convenient priors. The priors which are convenient for a partic-ular problem, may not be so realistic. This is perhaps the biggest reason it took so longfor Bayesian analysis to catch on in the statistics community. Although computationwith realistic priors is possible these days, these convenient are still of interest and canbe used as prior distributions for high-level hyperparameters in hierarchical models.

    The most popular class of convenient priors are called conjugate priors. A class ofpriors is said to be conjugate for a given problem if, when combined with the likelihoodin Bayes formula, the resulting posterior distribution is also a member of this class.This is convenient because calculations for the posterior distribution can often be donein closed-form or with very easy numerical methods. Here are a few examples.

    Example 1. LetX1, . . . , X niid

    Ber() given =. We shall consider a prior distribution Beta(a, b), where aand b are some fixed positive numbers. Here the likelihood is

    L() =c

    n

    i=1Xi(1 )n

    n

    i=1Xi ,

    where c is a constant depending on n and Xis only. Also, the prior density for is

    () =ca1(1 )b1,

    where c is a (different) constant depending on a andb only. We can see some similaritiesin these two formulas. Indeed, if we multiply them together, we see that the the posterior

    distribution (| x) satisfies

    (| x) =C a+

    n

    i=1xi1(1 )n+b

    n

    i=1xi1,

    whereCis some constant depending onn,xis,aandb. This is clearly of the same form asas the prior density(), but with different parameters. That is, the posterior distributionfor , given X=x, is also a beta distribution but with parametersa =a +

    ni=1 xi and

    b =b+n n

    i=1 xi. So, the Bayes estimate of can be found easily from the standardformula for the mean of a beta distribution, i.e.,

    (x) = E(| x) = a(x)

    a

    (x) +b

    (x)

    = a+nx

    a+b+n

    .

    7

  • 8/14/2019 basian.pdf

    8/12

    This estimator depends on both the prior parameters (a, b) and the observable data. Infact, the posterior mean is just a weighted average of the prior mean a/(a+b) and thedata mean x. This is a fairly general phenomenon.

    Exercise 3. Suppose the model is X1, . . . , X niid

    N(, 1) given = . Show that the

    class of normal distributions for is conjugate.

    Exercise 4. Suppose the model is X1, . . . , X niid

    Pois() given = . Show that theclass of gamma distributions for is conjugate.

    Conjugate priors are nice to work with but, as mentioned above, may not be realisticrepresentations of prior uncertainty. Some additional flexibility can be achieved, withouttoo much sacrifice of convenience, by working with priors that are mixtures of conjugatepriors.

    3.3 Non-informative priorsSince a real prior is rarely available, there can be a number of reasonable priors tochoose from and, generally, the posterior distributions for these different priors will havesome differences. How can one justify a particular choice of prior? One popular strategyis to choose the prior that, in some sense, influences the posterior as little as possible, sothat the data drives the Bayesian analysis more than the choice of prior. Such a prior iscallednon-informative3it gives data maximal freedom to select those parameter valueswhich are most likely.

    It is not even clear how to define non-informative priors. It is easy, however, to givean example of an informative prior: take = 0 with prior probability 1. In this case,

    the posterior also assigns probability 1 to the value 0, so the data can do nothing toidentify a different parameter value. This is an extreme example, but the general idea issimilara prior is informative if it imposes too much restriction on the posterior. Formaldefinitions of non-informative priors are complicated and somewhat technical,4 so I wontgo into these. I will try to give some intuition, and a good technique for choosing anon-informative prior in standard priors.

    One might think that some kind of uniform prior is, in some sense, non-informativebecause all values are equally likely. For along time, this was the caseboth Bayes andLaplace, the first Bayesians, used such a prior in their examples. But Fisher astutelycriticized this approach. Here is my summary of Fishers argument, say, for the success

    probability for a binomial experiment:Suppose we dont know where is likely to be, and for this reason we chooseto take Unif(0, 1). Then we also dont know anything about = logeither, so, if we express the problem in terms of , then we should also takea uniform prior for . But theres an inconsistency in the logic, becausewe should also be able to apply standard results from probability theory, in

    3Such priors are also sometimes called objectiveI think this is a potentially misleading name.4The formal definitions require some notion of an infinite sequence of experiments and prior is cho-

    sen to maximize the limiting distance between prior and posterior or, alternatively, by some sort ofprobability matching.

    8

  • 8/14/2019 basian.pdf

    9/12

    particular, transformation formulas, to get the prior for from the prior for. But its easy to see that a uniform prior for does not correspond toa uniform prior for obtained from the transformation theorem of PDFs.Hence, a logical inconsistency.

    So, this suggests uniform is not a good definition of non-informative. However, with anadjustment to how we understand uniformity, we can make such a definition. In fact,the priors we will discuss next are uniform in this different sense. 5

    A standard form of non-informative prior is Jeffreys prior. This has a close connec-tion to some of the calculations we did when studying properties of maximum likelihoodestimators. In particular, these priors depend critically on the Fisher information. Sup-

    pose that, given = , data are modeled as X1, . . . , X niid

    f(x). From the distributionf(x), we can calculate the Fisher information:

    I() =E2

    2log f(X1).

    Then Jeffreys prior is defined as

    J() =cI()1/2,

    where c >0 is a constant to make Jintegrate to 1; sometimes there is no such constantc, so it can be chosen arbitrarily. I cannot explain the properties Jhas here, but a verycompelling case for Jeffreys prior in low-dimensional problems is given in Chapter 5 ofGhosh, Delampady, and Samanta, Introduction to Bayesian Analysis, Springer 2006.

    Example 2. Suppose, given = , dataX1, . . . , X n are iid Ber(). The Fisher informa-

    tion for this problem is I() ={(1 )}1

    , so Jeffreys prior is

    J() = c

    1/2(1 )1/2.

    It is easy to check that this is a special case of the Beta(a, b) priors used in Example 1above, with a = b = 1/2. Therefore, the posterior distribution of , given X = x, issimply | x Beta(nx+ 1/2, n nx+ 1/2), and, for example, the posterior mean isE( | x) = (nx+ 1/2)/(n+ 1), which is very close to the MLE x. Notice, as we couldanticipate from the previous discussion, the non-informative prior is not uniform!

    Exercise 5. Find the Jeffreys prior J for when, given = , data are iid N(, 1).Calculate the corresponding posterior distribution.

    Exercise 6. Find the Jeffreys prior J for when, given = , data are iid Pois().Calculate the corresponding posterior distribution.

    There are some potential difficulties with Jeffreys prior that should be mentioned.First, it can happen that I()1/2 does not integrate to a finite number, which means thatJeffreys prior may not be aproper prior. People generally do not concern themselves with

    5One should use a different sort of geometry defined on the parameter space, not necessarily the usualEuclidean geometry. With this adjusted geometric, basically a Riemannian metric induced by the Fisherinformation, one can recover the Jeffreys prior as a sort of uniform distribution in this geometry.

    9

  • 8/14/2019 basian.pdf

    10/12

    this issue, provided that the posterior is proper. So, when using non-informative improperpriors, one should always check that the corresponding posterior is proper before usingit for inference. If its not proper, then posterior means, for example, are meaningless.The second point is a more philosophical one: by using Jeffreys prior, one assumes thatthe prior for depends on the model for data. Ideally, the prior is based on our beliefs

    about , which has no direct connection with the particular model we use for observabledata. More formally, this leads to a violation of the likelihood principle.6

    4 Other important points

    In this last section, I shall give some brief comments on some important aspects ofBayesian analysis that we dont have time to discuss in detail in Stat 411. In a formalBayesian analysis course, one would surely spend a bit of time on each of these items.

    4.1 Hierarchical modelsThe particular models considered here are so simple that it really doesnt matter whatkind of analysis one uses, the result will generally be the same. However, in more compli-cated problems, the advantages of a Bayesian analysis become more pronounced. A nowclassical problem is the so-called many-normal-means problem. Suppose X1, . . . , X nare independent (not iid) with Xi N(i, 1), i= 1, . . . , n. The maximum likelihood (orleast-squares) estimator of the mean vector = (1, . . . , n)

    is the observed data vectorX = (X1, . . . , X n)

    . However, there is a famous result of C. Stein which says that thisestimator is bad (for a certain kind of loss function) whenever n 3, in particular, thereis another estimator that is at least as good for all possible values. For this reason,

    one must make some kind of adjustments to the MLE, and the particular adjustmentsare often of the form of shrinking Xtowards some fixed point in Rn. This shrinkage issomewhat ad hoc, so it might be nice to have a more formal way to accomplish this.

    In a Bayesian setting, this can be accomplished quite easily with a hierarchical model.Consider the following model:

    Xi|(i, v) N(i, 1), i= 1, . . . , n (independent)

    1, . . . , n|v iid

    N(0, v)

    V(v).

    The idea is that there are several layers of priorsone for the main parameters =(1, . . . , n)

    and one for the hyperparameter v. The key to the success of such amodel is that one can initially marginalize away the high-dimensional parameter . Thenthere is lots of data which contains considerable information for inference on V. That is,the posterior distribution(v|x) is highly informative. The goal then is to calculate the

    6The likelihood principle says, roughly, that inference should depend on the model only through theshape of the likelihood function. There is a famous result of Birnbaum that says it is hard to disagreewith the likelihood principle. Formal Bayesian inference satisfies the likelihood principle, while almost allfrequentist inference violates it. Since Jeffreys prior depends on the model, its use for Bayesian analysisis a violation of the likelihood principle.

    10

  • 8/14/2019 basian.pdf

    11/12

    posterior mean of i as an estimate:

    E(i|xi) =

    E(i|xi, v)(v| x) dv.

    The inner expectation has a closed-form expression, vxi/(1 +v), so it would be easy

    to approximate the full expectation by Monte Carlo. The point, however, is that theresulting estimator with have certain shrinkage features, sincev/(1 + v) (0, 1), whichoccur automatically in the Bayesian setupno ad hoc shrinkage need be imposed.

    As an alternative to the full Bayes analysis described above, one could perform anempirical Bayes analysis and estimate v from the data, say, by v; typically, v is foundvia marginal maximum likelihood. Then one would estimate i via

    i= E(i|xi, v) = vxi1 + v

    , i= 1, . . . , n .

    Such empirical Bayes strategies are popular nowadays.

    4.2 Complete-class theorems

    There are a collection of results, which fall under the general umbrella of complete-class theorems, which gives some frequentist justification for Bayesian methods. Recallthat in a frequentist estimation problem, one is looking for estimators which have smallmean-square error. Complete-class theorems identify the collection of estimators whichare admissiblenot uniformly worse than another estimator. In other words, admissibleestimators are the only estimators one should consider, if one cares about mean-squareerror. There is one such theorem which says that, roughly, Bayes estimators (and limitsthereof) form a complete class. That is, one can essentially restrict their frequentist

    attention to estimators obtained by taking a prior distribution for and produce theposterior mean as an estimator. So, you see, Bayesian methods are not only of interestto Bayesians.

    4.3 Computation

    We did not discuss this matter in these notes, but computation is an important partof Bayesian analysis, perhaps more important than for frequentist analysis. Typically,unless one uses a convenient conjugate prior, some kind of numerical simulations areneeded to carry out a Bayesian analysis. There are so many such techniques available

    now, but they all fall under the general class of a Monte Carlo method, the most popularare Gibbs samplers and Markov chain Monte Carlo (MCMC). One could devote an entirecourse to MCMC, so Ill not get into any details here. Instead, Ill just show why suchmethods are needed.

    Suppose I want to estimate . I have a Bayes model with a prior and, at leastformally, I can write down what the posterior distribution ( | x) looks like. But toestimate , I need to calculate the posterior mean E( | x), which is just the integral

    (| x) d. One way to approximate this is by Monte Carlo. That is,

    E(| x) 1

    T

    T

    t=1

    (t),

    11

  • 8/14/2019 basian.pdf

    12/12

    where (1), . . . , (T) is an (approximately) independent sample from the posterior distri-bution ( | x). If such a sample is can be obtained, and ifT is sufficiently large, thenwe know by the law of large numbers that this sample average will be a good approxima-tion to the expectation. These MCMC approaches strive to obtain this sample of(t)sefficiently and with as little input as possible about the (possibly complex) posterior

    distribution (| x).

    4.4 Asymptotic theory

    We discussed briefly the question about how to choose the prior. But one can ask ifthe choice of prior really matters that much. The answer is no, not really in low-dimensional problems because there is an asymptotic theory in Bayesian analysis whichsays that, under suitable conditions, the data wash away the effect of the prior when nis large. These results can be rather technical, so here Ill only briefly mention the mainresult. It parallels the asymptotic normality of MLEs discussed earlier in the course.

    Let I() be the Fisher information and n the MLE. Then the Bayesian asymptoticnormality, or the Bernsteinvon Mises theorem, says that the posterior distribution of, given X = x, is approximately normal with mean n and variance [nI(n)]

    1. Moreformally, under suitable conditions on likelihood and prior,

    [nI(n)]1/2( n) N(0, 1), in distribution.

    (This is not the only version, but I think its the simplest.) Notice here that the statementlooks similar to that of the asymptotic normality of the MLE, except that the order ofthe terms is rearranged. Here the random variable is and the distribution were talkingabout is its posterior distribution. Stripping away all the details, what the Bernsteinvon

    Mises theorem says is that, no matter the prior, the posterior distribution will look likea normal distribution when n is large. From this, one can then reach the conclusion thatthe choice of prior does not really matter, provided nis sufficiently large.

    Here I am mostly focusing on simple finite-dimensional parameters and, in such cases,the message I just presented is generally true. However, when the parameter is high- oreven infinite-dimensional, this intuition can break down. So one must really take care inchoosing a prior for these more complicated problems. There are known cases where veryintuitive choices of prior distribution lead to some form of inconsistency in the posterior.

    12