BernouliDistribution Is The Empirical Mean Reasonable?db.cs.duke.edu/courses/fall09/cps271/distributions.pdf · 9/1/2009 2 Binomial Distribution • Probability of getting m heads

9/1/2009

1

Probability Distributions

CPS 271

Ron Parr

Some Figures courtesy Andrew Ng and Chris Bishop and © original authors.

Thanks to Lise Getoor for some slides

Fitting Models to Data

• Suppose we have a space of possible hypotheses H

• Which hypothesis has the highest posterior:

• P(D) does not depend on H; maximize numerator

• Uniform P(H) is called Maximum Likelihood solution

(model for which data has highest prob.)

• P(H) can be used for regularization

)(

)()|()|(

DP

HPHDPDHP =

Bernouli Distribution

• What is P(x=I(heads)=1)?

• P(x)=µ

• E(x)= µ

• Var(x)= µ(1− µ)

• Empirical mean = Sample mean = maximum

likelihood = µML

Is The Empirical Mean Reasonable?

• ML solution is presented as frequentist solution

• We know:

– E(µML)= µ

– µML converges to µ

• What about small numbers of samples?

9/1/2009

2

Binomial Distribution

• Probability of getting m heads in N flips?

• Add up different ways this can happen

)()1(),|(

mNm

m

NNmBin

−−

= µµµ

µµ

µ

)1()(

)(

−=

=

NmVar

NmE

Conjugate Priors

• We know µML maximizes P(D|H)

• For small data sets, this seems unreliable

• Can we maximize P(H|D)=P(D|H)P(H)/P(D)?

• Questions:

– What form should P(H) take?

– If H is in some class (binomial, Bernouli), we want P(D|H)P(H)=P(HD) to generate answers that are also in this class

• In general, if P(D|H)P(H) is in the same class as P(H), we say that P(H) is conjugate for P(D|H)

Background: Gamma Function

• For discrete variables:

• For continuous variables, continuous

generalization of factorial:

)()1(

)(0

1

xxx

dueuxux

Γ=+Γ

=Γ ∫∞

−−

)()1(

!)1(

xxx

xx

Γ=+Γ

=+Γ

Beta Distribution

)1()()var(

)(

)1()()(

)(),|(

2

11

+++=

+=

−ΓΓ

+Γ= −−

baba

ab

ba

aE

ba

babaBeta

ba

µ

µ

µµµ

Observation: Beta has very similar form to binomial

)()1(),|(

mNm

m

NNmBin

−−

= µµµ

9/1/2009

3

Posterior with Beta Prior

• Ultimately want:

• P(H)=Beta :

• P(D|H) = Binomial

• Bayes Rule

)()1(),|( mNm

m

NNmBin

−−

= µµµ

11 )1()()(

)(),|( −− −

ΓΓ

+Γ= ba

ba

babaBeta µµµ

),,|(

),|(),|(),,,|(

baNmP

baPNmPbaNmP

µµµ =

)(

)()|()|()(

DP

HPHDPDHPP ==µ

Posterior with Beta Prior

• P(D|H) = Binomial

• P(H)=Beta

1111)1()1()1()()|(

−+−−+−−− −=−−∝ bmNambamNmHPHDP µµµµµµ

)()1(),|( mNm

m

NNmBin

−−

= µµµ

11 )1()()(

)(),|( −− −

ΓΓ

+Γ= ba

ba

babaBeta µµµ

),|(

)1()()(

)()|( 11

bmNambeta

bmNam

bmNamDHP

bmNam

+−+=

−+−Γ+Γ

+−++Γ= −+−−+

µ

µµ

(Ignore the part that doesn’t depend upon m. Why can we get away with this?)

Interpreting the Beta Prior

• Beta prior has mean (m+a)/(N+a+b)

• A beta prior with parameters a,b is like having “imagined” a previous heads, b previous tails

• Examples:– a=b=1000 implies strong prior towards fairness

– a=b=1 implies weak prior towards fairness

– a=1000, b=1 implies strong prior towards heads bias

– a=1, b=1000 implies weak prior towards head bias

),|(

)1()()(

)()|( 11

bmNambeta

bmNam

bmNamDHP

bmNam

+−+=

−+−Γ+Γ

+−++Γ= −+−−+

µ

µµ

Multinomial

• Multinomial generalizes binomial to >2 outcomes

• Dirichlet is conjugate

• αααα parameters correspond to phantom observations

∏=

=

K

k

m

k

K

Kk

mm

NNmmMult

11

1!!...

!),|,...,( µμ

∑

∏

=

=

−

=

ΓΓ

Γ=

K

k

k

K

k

k

K

kdir

1

0

1

1

1

0

)()...(

)(),(

αα

µαα

α ααμ

9/1/2009

4

Multivariate Gaussian Distribution

• Also called multivariate normal

• First, recall the univariate Gaussian distribution:

( )

( )

σ

µ−−=σµ

σπ 2

2

2

1x

2

1exp),;x(p 2/1

• where µ is the mean and σ2 is the variance

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

-10 -5 0 5 10

mynorm(x,3,2)mynorm(x,-2,5)

Multivariate Gaussian Distribution

• A 2-dimensional Gaussian is defined by a mean vector µ = (µ1,µ2) and

a covariance matrix:

=Σ

2

2,2

2

1,2

2

2,1

2

1,1

σσ

σσ

• where

– is the variance if xi=xj

– covariance if xi≠xj

)])([(2

, jjiijixxE µµσ −−=

( )

( ) ( )

−Σ−−=Σ −

Σ

µµµπ

xxxpT 1

2

1

2

1exp),;(

2

1

−3

−2

−1

0

1

2

3

−3

−2

−1

0

1

2

3

0.05

0.1

0.15

0.2

0.25

Standard normal distribution

=Σ

10

01• We get the standard normal for Σ = the identity matrix

and µ = (0,0)

MVG examples

−3

−2

−1

0

1

2

3

−3

−2

−1

0

1

2

3

0.05

0.1

0.15

0.2

0.25

=Σ

10

01µ = (1, 0)

9/1/2009

5

MVG examples

=Σ

10

01µ = (-0.5, 0)

−3

−2

−1

0

1

2

3

−3

−2

−1

0

1

2

3

0.05

0.1

0.15

0.2

0.25

MVG examples

−3

−2

−1

0

1

2

3

−3

−2

−1

0

1

2

3

0.05

0.1

0.15

0.2

0.25

=Σ

10

01µ = (-1,-1.5)

MVG examples

−3

−2

−1

0

1

2

3

−3

−2

−1

0

1

2

3

0.05

0.1

0.15

0.2

0.25

µ = (0,0)

=Σ

6.00

06.0

MVG examples

−3

−2

−1

0

1

2

3

−3

−2

−1

0

1

2

3

0.05

0.1

0.15

0.2

0.25

µ = (0,0)

=Σ

20

02

9/1/2009

6

MVG examples

µ = (0,0)

=Σ

15.0

5.01

−3

−2

−1

0

1

2

3

−3

−2

−1

0

1

2

3

0.05

0.1

0.15

0.2

0.25

MVG examples

µ = (0,0)

=Σ

18.0

8.01

−3

−2

−1

0

1

2

3

−3

−2

−1

0

1

2

3

0.05

0.1

0.15

0.2

0.25

MVG examples – contour plots

−3 −2 −1 0 1 2 3

−3

−2

−1

0

1

2

3

µ = (0,0)

=Σ

10

01

MVG examples

−3 −2 −1 0 1 2 3

−3

−2

−1

0

1

2

3

µ = (0,0)

=Σ

15.0

5.01

9/1/2009

7

MVG examples

−3 −2 −1 0 1 2 3

−3

−2

−1

0

1

2

3

µ = (0,0)

=Σ

18.0

8.01

Multivariate normal distribution

• We can generalize this to n dimensions

• parameters

– mean vector µ ∈ℜn– a covariance matrix Σ ∈ℜnxn, where Σ ≥ 0 is symmetric and positive

semi-definite

• Written Ν(µ, Σ), density is

• where |Σ| is the determinant of the matrix Σ

• For X ∼ Ν(µ, Σ)– E[X] = ∫xx p(x; µ, Σ) dx = µ

– Cov(X) = E[XXT] – (E[X])(E[X])T = Σ

( )

( ) ( )

−Σ−−=Σ −

Σ

µµµπ

xxxpT

n

1

2

1

2

1exp),;(

2

1

2/

A note about covariances

• By construction, the covariance matrix is

– Symmetric

– Positive semi-definite

• Diagonal covariance matrices:

– Can be expressed as a product of I and a vector of

variances

– Imply independence between variables

Useful Properties of Gaussians I

• Surfaces of equal probability for standard (mean 0, I

covariance) Gaussians are spheroids

• Surfaces of equal probability for general Gaussians are

ellipsoids

• Every general Gaussian can be viewed as a standard

Gaussian that has undergone an affine transformation

9/1/2009

8

Useful Properties of Gaussians II

• A Gaussian distribution is completely specific by

the a vector of means and covariance matrix

• Requires O(n2) space

• Requires O(n3) time to manipulate

• If these seem bad, recall that a joint distribution

over n binary variables requires O(2n) space

Useful Properties of Gaussians III

• Marginals of Gaussians are Gaussian

• Given:

• Marginal Distribution:

• (Marginalize by ignoring)

ΣΣ

ΣΣ=Σ

==

bbba

abaa

baba xxx ),(),,( µµµ

),|()( aaaaa xNxp Σ= µ

Useful Properties of Gaussians IV

• Conditionals of Gaussians are Gaussian

• Notation:

• Conditional Distribution:

ΛΛ

ΛΛ=Σ=Λ −

bbba

abaa1

)(

),|()|(

1

|

1

|

ababaaaba

aabaaba

x

xNxxp

µµµ

µ

−ΛΛ−=

Λ=

−

−

Visualizing Marginalization & Conditioning

9/1/2009

9

Useful Properties of Gaussians V

• Affine transformations of Gaussian variables

are Gaussian

– Suppose x is Gaussian

– y=Ax+b is Gaussian (Aµ+b, AΣAT)

• Mixtures of Gaussians are…

– Mixtures of Gaussians

– How is a mixture different from a linear

combination?

Useful Properties of Gaussians

• Lots of things can (arguably) be approximated well by Gaussians

• The central limit theorem: The sum of IID variables with finite variances will tend towards a Gaussian distribution

• Note: This is often used a hand-waving argument to justify using the Gaussian distribution for almost anything

Mixtures of Gaussians

• Want to approximate distribution that is not unimodal?

• Density is weighted combination of Gaussians

• Idea: Flip coin (roll dice) to select Gaussian, then

sample from the Gaussian

• Can be arbitrarily expressive with enough Gaussians

∑

∑

=

=

=

Σ=

K

k

k

K

k

kkk xNxp

1

1

1

),|()(

π

µπ

Mixture of Gaussians Example

9/1/2009

10

Limitations of Gaussians

• Gaussians are unimodal (single peak at mean)

• O(n2) and O(n3) can get expensive

• Definite integrals of Gaussian distributions do

not have a closed form solution(somewhat inconvenient)

– Must approximate, use lookup tables, etc.

– Sampling from Gaussian is somewhat inelegant

Fitting Gaussians

• Maximum Likelihood

• Mean:

• Covariance:

∑=

=N

n

nML xN 1

1µ

T

MLn

N

n

MLnML xxN

)()(1

1

µµ −−=Σ ∑=

Bayesian Fits with Known Variance

• Can use a Gaussian prior:

• Posterior:

),|()(2

00 σµµµ Np =

22

0

2

22

0

2

0

22

0

2

2

11

),|()|(

σσσ

µσσ

σ

σσ

σµ

σµµµ

N

N

N

N

NXp

N

MLN

NN

+=

++

+=

=

Bayesian Fit with Unknown Variance,

Known Mean

• For single variable, gamma distribution is

conjugate

• For multiple variables, Wishart is conjugate

• No conjugate for unknown mean & variance

Documents

BernouliDistribution Is The Empirical Mean Reasonable?db.cs.duke.edu/courses/fall09/cps271/distributions.pdf · 9/1/2009 2 Binomial Distribution • Probability of getting m heads