Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Introduction to Bayesian Statistics
Christiana Kartsonaki
February 11th, 2015
Introduction
Bayes’ theorem is an elementary result in probability theory which relatesthe conditional probability P(A given B) to P(B given A), for two events Aand B.
Christiana Kartsonaki Introduction to Bayesian Statistics February 11th, 2015 2 / 28
Bayes’ theorem
P(A): probability of event A
P(A | B): conditional probability of event A given that event B hasoccurred
P(A | B) =P(B | A) P(A)
P(B)
Christiana Kartsonaki Introduction to Bayesian Statistics February 11th, 2015 3 / 28
Bayesian inference
P(conclusion | data) ∝ P(data | conclusion) × P(conclusion)
posterior likelihood prior
Christiana Kartsonaki Introduction to Bayesian Statistics February 11th, 2015 4 / 28
Bayesian inference
P(conclusion | data) ∝ P(data | conclusion) × P(conclusion)
posterior likelihood prior
Christiana Kartsonaki Introduction to Bayesian Statistics February 11th, 2015 4 / 28
Example
buses arrive in a random pattern (Poisson process), mean time intervalunknown
data → likelihood
you know something about the interval (e.g. likely not every 1 hour, butnot 1 minute either)
prior distribution
the two sources of information are synthesized as a probability distribution
posterior distribution
→ merge information from data with ‘external’ information
Christiana Kartsonaki Introduction to Bayesian Statistics February 11th, 2015 5 / 28
Example
buses arrive in a random pattern (Poisson process), mean time intervalunknown data → likelihood
you know something about the interval (e.g. likely not every 1 hour, butnot 1 minute either) prior distribution
the two sources of information are synthesized as a probability distributionposterior distribution
→ merge information from data with ‘external’ information
Christiana Kartsonaki Introduction to Bayesian Statistics February 11th, 2015 5 / 28
Example
buses arrive in a random pattern (Poisson process), mean time intervalunknown data → likelihood
you know something about the interval (e.g. likely not every 1 hour, butnot 1 minute either) prior distribution
the two sources of information are synthesized as a probability distributionposterior distribution
→ merge information from data with ‘external’ information
Christiana Kartsonaki Introduction to Bayesian Statistics February 11th, 2015 5 / 28
Statistical inference
parameter θ
data x
model f (x, θ)
Christiana Kartsonaki Introduction to Bayesian Statistics February 11th, 2015 6 / 28
Bayes’ theorem
p(θ | x) =p(x | θ) π(θ)
p(x)
π(θ): prior distribution
p(x | θ): likelihood
p(θ | x): posterior distribution
p(x): predictive probability of x (normalizing factor)
posterior ∝ likelihood × prior
Christiana Kartsonaki Introduction to Bayesian Statistics February 11th, 2015 7 / 28
Bayes’ theorem
p(θ | x) =p(x | θ) π(θ)
p(x)
π(θ): prior distribution
p(x | θ): likelihood
p(θ | x): posterior distribution
p(x): predictive probability of x (normalizing factor)
posterior ∝ likelihood × prior
Christiana Kartsonaki Introduction to Bayesian Statistics February 11th, 2015 7 / 28
Bayesian inference
Any quantity that does not depend on θ cancels out from the denominatorand numerator of Bayes’ theorem.
So if we can recognise which density is proportional to the product of thelikelihood and the prior, regarded solely as a function of θ, we know theposterior density of θ.
Christiana Kartsonaki Introduction to Bayesian Statistics February 11th, 2015 8 / 28
Frequentist and Bayesian statistics
Frequentist approaches → typically treat θ as an unknown constant
Bayesian approaches → treat it as a random variable
Christiana Kartsonaki Introduction to Bayesian Statistics February 11th, 2015 9 / 28
Likelihood
Likelihood → used in most approaches to formal statistical inference.
Describes the data generating process.
Christiana Kartsonaki Introduction to Bayesian Statistics February 11th, 2015 10 / 28
Prior and posterior distribution
prior distribution → represents information about the parameters otherthan that supplied by the data under analysis
posterior distribution → probability distribution as revised in the light ofthe data, determined by applying standard rules of probability theory
Sometimes the prior distribution is ‘flat’ over a particular scale, intendedto represent the absence of initial information.
In complex problems with many nuisance parameters the use of flat priordistributions is suspect and, at the very least, needs careful study usingsensitivity analyses.
Christiana Kartsonaki Introduction to Bayesian Statistics February 11th, 2015 11 / 28
Prior and posterior distribution
prior distribution → represents information about the parameters otherthan that supplied by the data under analysis
posterior distribution → probability distribution as revised in the light ofthe data, determined by applying standard rules of probability theory
Sometimes the prior distribution is ‘flat’ over a particular scale, intendedto represent the absence of initial information.
In complex problems with many nuisance parameters the use of flat priordistributions is suspect and, at the very least, needs careful study usingsensitivity analyses.
Christiana Kartsonaki Introduction to Bayesian Statistics February 11th, 2015 11 / 28
Example
X1, . . . ,Xn: random sample from an exponential distribution with densityf (x | θ) = θe−θx , x > 0
prior π(θ) = λe−λθ, θ > 0, for some known value of λ
Then the likelihood is
f (x1, . . . , xn | θ) = θn e−θ∑n
i=1 xi .
So the posterior distribution is
which is the Gamma(n + 1, λ+∑
xi ) density.
Christiana Kartsonaki Introduction to Bayesian Statistics February 11th, 2015 12 / 28
Example
X1, . . . ,Xn: random sample from an exponential distribution with densityf (x | θ) = θe−θx , x > 0
prior π(θ) = λe−λθ, θ > 0, for some known value of λ
Then the likelihood is
f (x1, . . . , xn | θ) = θn e−θ∑n
i=1 xi .
So the posterior distribution is
which is the Gamma(n + 1, λ+∑
xi ) density.
Christiana Kartsonaki Introduction to Bayesian Statistics February 11th, 2015 12 / 28
Example
X1, . . . ,Xn: random sample from an exponential distribution with densityf (x | θ) = θe−θx , x > 0
prior π(θ) = λe−λθ, θ > 0, for some known value of λ
Then the likelihood is
f (x1, . . . , xn | θ) = θn e−θ∑n
i=1 xi .
So the posterior distribution is
which is the Gamma(n + 1, λ+∑
xi ) density.
Christiana Kartsonaki Introduction to Bayesian Statistics February 11th, 2015 12 / 28
Example
X1, . . . ,Xn: random sample from an exponential distribution with densityf (x | θ) = θe−θx , x > 0
prior π(θ) = λe−λθ, θ > 0, for some known value of λ
Then the likelihood is
f (x1, . . . , xn | θ) = θn e−θ∑n
i=1 xi .
So the posterior distribution is
p(θ | x) ∝ π(θ) f (x | θ)
which is the Gamma(n + 1, λ+∑
xi ) density.
Christiana Kartsonaki Introduction to Bayesian Statistics February 11th, 2015 12 / 28
Example
X1, . . . ,Xn: random sample from an exponential distribution with densityf (x | θ) = θe−θx , x > 0
prior π(θ) = λe−λθ, θ > 0, for some known value of λ
Then the likelihood is
f (x1, . . . , xn | θ) = θn e−θ∑n
i=1 xi .
So the posterior distribution is
p(θ | x) ∝ λe−λθ θne−θ∑
xi
which is the Gamma(n + 1, λ+∑
xi ) density.
Christiana Kartsonaki Introduction to Bayesian Statistics February 11th, 2015 12 / 28
Example
X1, . . . ,Xn: random sample from an exponential distribution with densityf (x | θ) = θe−θx , x > 0
prior π(θ) = λe−λθ, θ > 0, for some known value of λ
Then the likelihood is
f (x1, . . . , xn | θ) = θn e−θ∑n
i=1 xi .
So the posterior distribution is
p(θ | x) ∝ λe−λθ θne−θ∑
xi
∝ θne−θ(λ+∑
xi )
which is the Gamma(n + 1, λ+∑
xi ) density.
Christiana Kartsonaki Introduction to Bayesian Statistics February 11th, 2015 12 / 28
Example
X1, . . . ,Xn: random sample from an exponential distribution with densityf (x | θ) = θe−θx , x > 0
prior π(θ) = λe−λθ, θ > 0, for some known value of λ
Then the likelihood is
f (x1, . . . , xn | θ) = θn e−θ∑n
i=1 xi .
So the posterior distribution is
p(θ | x) ∝ θne−θ(λ+∑
xi ),
which is the Gamma(n + 1, λ+∑
xi ) density.
Christiana Kartsonaki Introduction to Bayesian Statistics February 11th, 2015 12 / 28
Example
X1, . . . ,Xn: random sample from an exponential distribution with densityf (x | θ) = θe−θx , x > 0
prior π(θ) = λe−λθ, θ > 0, for some known value of λ
Then the likelihood is
f (x1, . . . , xn | θ) = θn e−θ∑n
i=1 xi .
So the posterior distribution is
p(θ | x) ∝ θne−θ(λ+∑
xi ),
which is the Gamma(n + 1, λ+∑
xi ) density.
Christiana Kartsonaki Introduction to Bayesian Statistics February 11th, 2015 12 / 28
History
‘inverse probability’
prior distribution intended to represent initial ignorance → usedsystematically in statistical analysis by Gauss and especially Laplace (circa1800)
the approach was criticized during the 19th century
in the middle of the 20th century attention shifted to a personalistic viewof probability – individual belief as expressed by individual choice in adecision-making context
Christiana Kartsonaki Introduction to Bayesian Statistics February 11th, 2015 13 / 28
Prior distribution
Controversial aspects concern prior. What does it mean?
What is the prior probability that treatment and control have identicaleffect?
What is the prior probability that the difference between two groups isbetween 5 and 15 units?
prior distribution must be specified explicitly, i.e. in effect numerically
Christiana Kartsonaki Introduction to Bayesian Statistics February 11th, 2015 14 / 28
Prior distribution
Broadly three approaches:
1 Summary of other data.Empirical Bayes.
2 Prior measures personalistic opinion of investigator about conclusions.Not useful for ‘public’ transmission of knowledge.
3 Objective degree of uncertainty.
1 Agreed measure of uncertainty.2 Ignorance, reference, flat prior for interval estimate. Laplace’s principle
of indifference.
Christiana Kartsonaki Introduction to Bayesian Statistics February 11th, 2015 15 / 28
Empirical Bayes
‘empirical’ → frequency interpretation implied
e.g. an unknown parameter representing a mean of some measurement –likely to vary under different circumstances
can be represented by a widely dispersed distribution
leading to a posterior distribution with a frequency interpretation
Example: variances of gene expression for different probes on a microarraymay be assumed to be a sample from a distribution with a commonparameter
Christiana Kartsonaki Introduction to Bayesian Statistics February 11th, 2015 16 / 28
Personalistic prior
reflects the investigator’s subjective beliefs
prior distribution is based on relatively informally recalled experience of afield, for example on data that have been seen only informally
Christiana Kartsonaki Introduction to Bayesian Statistics February 11th, 2015 17 / 28
Flat prior
a prior which aims to insert as little new information as possible
for relatively simple problems often limiting forms of the prior reproduceapproximately or exactly posterior intervals equivalent to confidenceintervals
Christiana Kartsonaki Introduction to Bayesian Statistics February 11th, 2015 18 / 28
Priors
Is the prior distribution a positive insertion of evidence? If so, what isits basis and has the consistency of that evidence with the currentdata been checked?
If flat/ignorance/reference priors have been used, how have they beenchosen? Has there been a sensitivity analysis? If the number ofparameters over which a prior distribution is defined is appreciablethen the choice of a flat prior distribution could be misleading.
Each of a substantial number of individuals may have been allocateda value of an unknown parameter, the values having a stablefrequency distribution across individuals – empirical Bayes.
Christiana Kartsonaki Introduction to Bayesian Statistics February 11th, 2015 19 / 28
Posterior distribution
Conclusions can be summarized using for example
posterior mean
posterior variance
credible intervals
Christiana Kartsonaki Introduction to Bayesian Statistics February 11th, 2015 20 / 28
Credible intervals
a region Cα(x) is a 100(1− α)% credible region for θ if∫Cα(x)
p(θ | x) dθ = 1− α
I there is posterior probability 1− α that θ is in Cα(x)
credible interval – special case of credible region
analogous to (frequentist) confidence intervals, different interpretation
Christiana Kartsonaki Introduction to Bayesian Statistics February 11th, 2015 21 / 28
Hypothesis testing
Frequentist approach to hypothesis testing → compares a null hypothesisH0 with an alternative H1 through a test statistic T that tends to belarger under H1 than under H0 and rejects H0 for small p-valuesp = PH0(T ≥ tobs), where tobs is the value of T actually observed and theprobability is computed as if H0 were true
Bayesian approach → attaches prior probabilities to models correspondingto H0 and H1 and compares their posterior probabilities using the Bayesfactor
B10 =P(x | H1)
P(x | H0)
Christiana Kartsonaki Introduction to Bayesian Statistics February 11th, 2015 22 / 28
Computation
conjugate prior → when the prior and the posterior are from the samefamily of distributions (for example normal prior and normal likelihood)
makes calculations easier
however, often unrealistic, so posterior distributions need to be evaluatednumerically
Christiana Kartsonaki Introduction to Bayesian Statistics February 11th, 2015 23 / 28
Markov chain Monte Carlo (MCMC)
Markov chain Monte Carlo (MCMC): a stochastic simulation techniquewhich is used for computing inferential quantities which cannot beobtained analytically
MCMC simulates a discrete-time Markov chain
it produces a dependent sequence of random variables{θ(1), . . . , θ(M)} with approximate distribution the posteriordistribution of interest
MCMC is an iterative procedure, such that given the current state ofthe chain, θ(i), the algorithm makes a probabilistic update to θ(i + 1)
Markov chains can automatically be constructed to match anyposterior density
Christiana Kartsonaki Introduction to Bayesian Statistics February 11th, 2015 24 / 28
MCMC
Two of the most general procedures for MCMC simulation from a targetdistribution:
• Metropolis–Hastings algorithm
• Gibbs sampler
Software:
• WinBugs – a Windows version of BUGS (Bayesian analysis Using theGibbs Sampler)
• CODA: a collection of convergence diagnostics and sample outputanalysis programs
• JAGS (Just Another Gibbs Sampler)
Christiana Kartsonaki Introduction to Bayesian Statistics February 11th, 2015 25 / 28
MCMC
Two of the most general procedures for MCMC simulation from a targetdistribution:
• Metropolis–Hastings algorithm
• Gibbs sampler
Software:
• WinBugs – a Windows version of BUGS (Bayesian analysis Using theGibbs Sampler)
• CODA: a collection of convergence diagnostics and sample outputanalysis programs
• JAGS (Just Another Gibbs Sampler)
Christiana Kartsonaki Introduction to Bayesian Statistics February 11th, 2015 25 / 28
MCMC – priors
MCMC mostly uses flat priors.
Flat for θ not same as flat for e.g. log(θ).
For models with fairly few parameters and reasonable data givesconfidence level.
For large number of parameters may give very bad answer. Nogeneral theory known.
Christiana Kartsonaki Introduction to Bayesian Statistics February 11th, 2015 26 / 28
Discussion
Bayesian inference → based on Bayes’ theorem
differences in interpretation between Bayesian and frequentistinference
choice of prior controversial
computation usually done numerically; MCMC useful but to be usedwith caution
Christiana Kartsonaki Introduction to Bayesian Statistics February 11th, 2015 27 / 28
Further reading
Cox, D. R. (2006). Frequentist and Bayesian Statistics: A Critique(Keynote Address). In Statistical Problems in Particle Physics,Astrophysics and Cosmology (Vol. 1, p. 3).
Cox, D. R. and Donnelly, C. A. (2011). Principles of AppliedStatistics. Cambridge University Press.
Davison, A. C. (2003). Statistical Models (Vol. 11). CambridgeUniversity Press.
Gelman, A., Carlin, J. B., Stern, H. S. and Rubin, D. B. (2003).Bayesian Data Analysis. Chapman and Hall/CRC Texts in StatisticalScience.
WinBugshttp://www.mrc-bsu.cam.ac.uk/software/bugs/
Christiana Kartsonaki Introduction to Bayesian Statistics February 11th, 2015 28 / 28