49
Introduction to Bayesian Estimation May 31, 2013

Baysian-Slides 16 Bayes Intro

Embed Size (px)

DESCRIPTION

introduction of baysian methods

Citation preview

  • Introduction to Bayesian Estimation

    May 31, 2013

  • The Plan

    1. What is Bayesian statistics? How is it different fromfrequentist methods?

    2. 4 Bayesian Principles

    3. Overview of main concepts in Bayesian analysis

    Main reading: Ch.1 in Gary Koops Bayesian Econometrics

    Christian P. Roberts The Bayesian Choice: From decisiontheoretic foundations to Computational Implementation

  • Probability and statistics; Whats the difference?

    Probability is a branch of mathematics

    I There is little disagreement about whether the theoremsfollow from the axioms

    Statistics is an inversion problem: What is a good probabilisticdescription of the world, given the observed outcomes?

    I There is some disagreement about how we interpretdata/observations

  • Why probabilistic models?

    Is the world characterized by randomness?

    I Is the weather random?

    I Is a coin flip random?

    I ECB interest rates?

  • What is the meaning of probability, randomness anduncertainty?

    Two main schools of thought:

    I The classical (or frequentist) view is that probabilitycorresponds to the frequency of occurrence in repeatedexperiments

    I The Bayesian view is that probabilities are statements aboutour state of knowledge, i.e. a subjective view.

    The difference has implications for how we interpret estimatedstatistical models and there is no general agreement about whichapproach is better.

  • Frequentist vs Bayesian statistics

    Variance of estimator vs variance of parameter

    I Bayesians think of parameters as having distributions whilefrequentist conduct inference by thinking about the varianceof an estimator

    Frequentist confidence intervals:

    I If point estimates are the truth and with repeated draws fromthe population of equal sample length, what is the intervalthat lies in 95 per cent of the time?

    Bayesian probability intervals:

    I Conditional on the observed data, a prior distribution of anda functional form (i.e. model), what is the shortest intervalthat with 95% contains ?

    The main conceptual difference is that Bayesians condition on thedata while frequentist design procedures that work well ex ante.

  • Frequentist vs Bayesian statistics

    Coin flip example:

    After flipping a coin 10 times and counting the number of heads,what is the probability that the next flip will come up heads?

    I A Bayesian with a uniform prior would say that the probabilityof the next flip coming up heads is x/10 and where x is thenumber of times the flip came up heads in the observedsample

    I A frequentist cannot answer this question: A frequentistconfidence interval can only be used to compute theprobability of having observed the sample conditional on agiven null hypothesis about the probability of heads vs tails.

  • Bayesian statistics

    Bayesianism has obviously come a long way. It used to be thatcould tell a Bayesian by his tendency to hold meetings in isolatedparts of Spain and his obsession with coherence, self-interrogations,and other manifestations of paranoia. Things have changed...

    Peter Clifford, 1993

    Quote arrived here via Jesus Fernandez-Villaverde (U Penn)

  • Bayesian statistics

    Bayesians used to be fringe types, but are now more or less themainstream in macro,

    I This is largely due to increased computing power

    Bayesian methods have several advantages:

    I Facilitates incorporating information from outside of sample

    I Easy to compute confidence/probabilty intervals of functionsof parameters

    I Good small sample properties

  • The subjective view of probability

    What does subjective mean?

    I Coin flip: What is prob(heads) after flip but before looking?

    Important: Probabilities can be viewed as statements about ourknowledge of a parameter, even if the parameter is a constant

    I E.g. treating risk aversion as a random variable does notimply that we think of it as varying across time.

  • The subjective view of probability

    Is a subjective view of probability useful?

    I Yes, it allow people to incorporate information from outsidethe sample

    I Yes, because subjectivity is always there (it just more hiddenin frequentist methods)

    I Yes, because one can use objective, or non-informativepriors

    I And: Sensitivity to priors can be checked

  • 4 Principle of Bayesian Statistics

    1. The Likelihood Principle

    2. The Sufficiency Principle

    3. The Conditionality Principle

    4. The Stopping Rule Principle

    These are principles that appeal to common sense.

  • The Likelihood Principle

    All the information about a parameter from an experiment iscontained in the likelihood function

  • The Likelihood Principle

    The likelihood principle can be derived from two other principles:

    I The Sufficiency Principle

    I If two samples imply the same sufficient statistic (given amodel that admits such a thing), then they should lead to thesame inference about

    I Example: Two samples from a normal distribution with thesame mean and variance should lead to the same inferenceabout the parameters and 2

    I The Conditionality Principle

    I If there are two possible experiments on the parameter isavailable, and one of the experiments is chosen withprobability p then the inference on should only depend onthe chosen experiment

  • The Stopping Rule Principle

    The Stopping Rule Principle is an implication of the LikelihoodPrinciple

    If a sequence of experiments, E 1,E 2, ..., is directed by a stoppingrule, , which indicates when the experiment should stop, inferenceabout should depend on only through the resulting sample.

  • The Stopping Rule Principle: Example

    In an experiment to find out the fraction of the population thatwatched a TV show, a researcher found 9 viewers and and 3non-viewers.

    Two possible models:

    1. The researcher interviewed 12 people and thus observedx B(12, )

    2. The researcher questions N people until he obtained 3non-viewers with N Neg(3, 1 )

    The likelihood principle implies that the inference about shouldbe identical if either model was used. A frequentist would drawdifferent conclusions depending on wether he thought (1) or (2)generated the experiment.In fact, H0 : > 0.5 would be rejected if experiment is (2) but notif it is (1).

  • The Stopping Rule Principle

    I learned the stopping rule principle from Professor Barnard, inconversation in the summer of 1952. Frankly, I thought it ascandal that anyone in the profession could advance an idea sopatently wrong, even as today I can scarcely believe that somepeople resist an idea so patently rightSavage (1962)

    Quote (again) arrived here via Jesus Fernandez-Villaverde (U Penn)

  • The main components in Bayesian inference

  • Bayes Rule

    Deriving Bayes Rule

    p (A,B) = p(A | B)p(B)

    Symmetricallyp (A,B) = p(B | A)p(A)

    Implyingp(A | B)p(B) = p(B | A)p(A)

    or

    p(A | B) = p(B | A)p(A)p(B)

    which is known as Bayes Rule.

  • Data and parameters

    The purpose of Bayesian analysis is to use the data y to learnabout the parameters

    I Parameters of a statistical model

    I Or anything not directly observed

    Replace A and B in Bayes rule with and y to get

    p( | y) = p(y | )p()p(y)

    The probability density p( | y) then describes what do we knowabout , given the data.

  • Parameters as random variables

    Since p( | y) is a density function, we thus treat as a randomvariable with a distribution.

    This is the subjective view of probability

    I Express subjective probabilities through the prior

    I More importantly, the randomness of a parameter is subjectivein the sense of depending on what information is available tothe person making the statement.

    The density p( | y) is proportional to the prior times thelikelihood function

    p( | y) posterior

    p(y | ) likelihood function

    p()prior

    But what is the prior and the likelihood function?

  • The prior density p()

    Sometimes we know more about the parameters than what thedata tells us, i.e. we have some prior information.The prior density p() summarizes what we know about aparameters before observing the data y .

    I For a DSGE model, we may have information about deepparameters

    I Range of some parameters may be restricted by theory, e.g.risk aversion should be positive

    I Discount rate is inverse of average real interest ratesI Price stickiness can be measured by surveys

    I We may know something about the mean of a process

  • Choosing priors

    How influential the priors will be for the posterior is a choice:

    I It is always possible to choose priors such that a given result isachieved no matter what the sample information is (i.e.dogmatic priors)

    I It is also possible to choose priors such that they do notinfluence the posterior (i.e. so-called non-informative priors)

    Important: Priors are a choice and must be motivated.

  • Choosing prior distributions

    The beta distribution is a good choice when parameter is in [0,1]

    P(x) =(1 x)b1 xa1

    B(a, b)

    where

    B(a, b) =(a 1)!(b 1)!

    (a + b 1)!Easier to parameterize using expression for mean, mode andvariance:

    =a

    a + b, x =

    a 1a + b 2

    2 =ab

    (a + b)2 (a + b + 1)

  • Examples of beta distributions holding mean fixed

    0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

    1

    2

    3

    4

    5

    6

    7

    8

  • Examples of beta distributions holding s.d. fixed

    0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

    2

    4

    6

    8

    10

    12

    14

    16

    18

  • Choosing prior distributions

    The inverse gamma distribution is a good choice when parameteris positive

    P(x) =ba

    (a)(1/x)a+1 exp(b/x)

    where(a) = (a 1)!

    Again, easier to parameterize using expression for mean, mode andvariance:

    =b

    a 1; a > 1, x =

    b

    a + 1

    2 =b2

    (a 1)2 (a 2); a > 2

  • Examples of inverse gamma distributions

    0 1 2 3 4 5 6 7 8 9 100

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

  • Conjugate Priors

    Conjugate priors are a particularly convenient:

    I Combining distributions that are members of a conjugatefamily result in a new distribution that is a member of thesame family

    Useful, but only so far as that the priors are chosen to actuallyreflect prior beliefs rather than just for analytical convenience

  • Improper Priors

    Improper priors are priors that are not probability density functionsin the sense that they do not integrate to 1.

    I Can still be used as a form of uninformative priors for sometypes of analysis

    I The uniform distribution U(,) is popularI Mode of posterior then coincide with MLE.

  • The likelihood function

    The likelihood function is the probability density of observing thedata y conditional on a statistical model and the parameters .

    I For instance if the data y conditional on is normallydistributed we have that

    p(y | ) = (2)T/211/2 exp [1

    2(y )1(y )

    ]where and are either elements or functions of .

  • The posterior density p( | y)

    The posterior density is often the object of fundamental interest inBayesian estimation.

    I The posterior is the result.

    I We are interested in the entire conditional distribution of theparameters

  • Model comparison

    Bayesian statistics is not concerned with rejecting models, butwith assigning probabilities to different models

    I We can index different models by i = 1, 2, ...m

    p( | y ,Mi ) =p(y | ,Mi )p( | Mi )

    p(y | Mi )

  • Model comparison

    The posterior model probability is given by

    p(Mi | y) =p(y | Mi )p(Mi )

    p(y)

    where p(y | Mi ) is called the marginal likelihood. It can becomputed from

    p( | y ,Mi )d =

    p(y | ,Mi )p(,Mi )p(y | Mi )

    d

    by using that

    p( | y ,Mi )d = 1 so that

    p(y | Mi ) =

    p(y | ,Mi )p(,Mi )d

  • The posterior odds ratio

    The posterior odds ratio is the relative probabilities of two modelsconditional on the data

    p(Mi | y)p(Mj | y)

    =p(y | Mi )p(Mi )p(y | Mj)p(Mj)

    It is made up of the prior odds ratio p(Mi )p(Mj ) and the Bayes factorp(y |Mi )p(y |Mj )

  • Concepts of Bayesian analysis

    Prior densities, likelihood functions, posterior densities, posteriorodds ratios, Bayes factors etc are part of almost all Bayesiananalysis.

    I Data, choice of prior densities and the likelihood function arethe inputs into the analysis

    I Posterior densities etc are the outputs

    The rest of the course is about how to choose these inputs andhow to compute the outputs

  • Bayesian computation

    More specifics about the outputs

    I Posterior mean E ( | y) =p( | y)d

    I Posterior variance var( | y) = E ( | y) [E ( | y)]2

    I Posterior prob(i > 0)

    These objects can all be written in the form

    E (g () | y) =

    g () p( | y)d

    where g(y) is the function of interest.

  • Posterior simulation

    There are only a few cases when the expected value of functions ofinterest can be derived analytically.

    I Instead, we rely on posterior simulation and Monte Carlointegration.

    I Posterior simulation consists of constructing a sample fromthe posterior distribution p ( | y)

    Monte carlo integration then uses that

    gS =1

    S

    Ss=1

    g((s))

    and that limS gS = E (g () | y) where (s) is a draw from theposterior distribution.

  • The end product of Bayesian statistics

    Most of Bayesian econometrics consists of simulating distributionsof parameters using numerical methods.

    I A simulated posterior is a numerical approximation to thedistribution p(y | )p()

    I This is useful since the the distribution p(y | )p() (byBayes rule) is proportional to p( | y)

    I We rely on ergodicity, i.e. that the moments of theconstructed sample correspond to the moments of thedistribution p( | y)

    The most popular procedure to simulate the posterior is called theRandom Walk Metropolis Algorithm

  • Sampling from a distribution

    Example:

    I Standard Normal distribution N(0, 1)

    If distribution is ergodic, sample shares all the properties of thetrue distribution asymptotically

    I Why dont we just compute the moments directly?

    I When we can, we should (as in the Normal distributions case)I Not always possible, either because tractability reasons or for

    computational burden

  • Sample from Standard Normal

    0 10 20 30 40 50 60 70 80 90 100-3

    -2

    -1

    0

    1

    2

    3

    4

  • Sample from Standard Normal

    -3 -2 -1 0 1 2 3 40

    1

    2

    3

    4

    5

    6

    7

  • Sample from Standard Normal

    0 1 2 3 4 5 6 7 8 9 10

    x 104

    -5

    -4

    -3

    -2

    -1

    0

    1

    2

    3

    4

    5

  • Sample from Standard Normal

    -5 -4 -3 -2 -1 0 1 2 3 4 50

    1000

    2000

    3000

    4000

    5000

    6000

    7000

    8000

  • Sampling from a distribution

    Example:

    I Standard Normal distribution N(0, 1)

    If distribution is ergodic, sample shares all the properties of thetrue distribution asymptotically

    I Why dont we just compute the moments directly?

    I When we can, we should (as in the Normal distributions case)I Not always possible, either because tractability reasons or for

    computational burden

  • What is a Markov Chain?

    A process for which the distribution of next period variables areindependent of the past, once we condition on the current state

    I A Markov chain is a stochastic process with the Markovproperty

    I Markov chain is sometimes taken to mean only processeswith a countably finite number of states

    I Here, the term will be used in the broader sense

    Markov Chain Monte Carlo methods provide a way of generatingsamples that share the properties of the target density (i.e. theobject of interest)

  • What is a Markov Chain?

    Markov property

    p(j+1 | j) = p(j+1 | j , j1, j2, ..., 1)

    Transition density function p(j+1 | j) describes the distribution ofj+1 conditional on j .

    I In most applications, we know the conditional transitiondensity and can figure out unconditional properties like E ()and E

    (2)

    I MCMC methods can be used to do the opposite: Determine aparticular conditional transition density such that theunconditional distribution converges to that of the targetdistribution.

  • Summing up

    I The subjective view of probability

    I The Bayesian Principles

    I Bayesians condition on the data

    I Priors and likelihood functions are chosen by the researcherand must be motivated

    I The posterior density is the main object of interest

    I We often need to use numerical methods to approximate theposterior density

    Next time: Regression model with natural conjugate priors

  • Thats it for today.