Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
9/1/2009
1
Probability Distributions
CPS 271
Ron Parr
Some Figures courtesy Andrew Ng and Chris Bishop and © original authors.
Thanks to Lise Getoor for some slides
Fitting Models to Data
• Suppose we have a space of possible hypotheses H
• Which hypothesis has the highest posterior:
• P(D) does not depend on H; maximize numerator
• Uniform P(H) is called Maximum Likelihood solution
(model for which data has highest prob.)
• P(H) can be used for regularization
)(
)()|()|(
DP
HPHDPDHP =
Bernouli Distribution
• What is P(x=I(heads)=1)?
• P(x)=µ
• E(x)= µ
• Var(x)= µ(1− µ)
• Empirical mean = Sample mean = maximum
likelihood = µML
Is The Empirical Mean Reasonable?
• ML solution is presented as frequentist solution
• We know:
– E(µML)= µ
– µML converges to µ
• What about small numbers of samples?
9/1/2009
2
Binomial Distribution
• Probability of getting m heads in N flips?
• Add up different ways this can happen
)()1(),|(
mNm
m
NNmBin
−−
= µµµ
µµ
µ
)1()(
)(
−=
=
NmVar
NmE
Conjugate Priors
• We know µML maximizes P(D|H)
• For small data sets, this seems unreliable
• Can we maximize P(H|D)=P(D|H)P(H)/P(D)?
• Questions:
– What form should P(H) take?
– If H is in some class (binomial, Bernouli), we want P(D|H)P(H)=P(HD) to generate answers that are also in this class
• In general, if P(D|H)P(H) is in the same class as P(H), we say that P(H) is conjugate for P(D|H)
Background: Gamma Function
• For discrete variables:
• For continuous variables, continuous
generalization of factorial:
)()1(
)(0
1
xxx
dueuxux
Γ=+Γ
=Γ ∫∞
−−
)()1(
!)1(
xxx
xx
Γ=+Γ
=+Γ
Beta Distribution
)1()()var(
)(
)1()()(
)(),|(
2
11
+++=
+=
−ΓΓ
+Γ= −−
baba
ab
ba
aE
ba
babaBeta
ba
µ
µ
µµµ
Observation: Beta has very similar form to binomial
)()1(),|(
mNm
m
NNmBin
−−
= µµµ
9/1/2009
3
Posterior with Beta Prior
• Ultimately want:
• P(H)=Beta :
• P(D|H) = Binomial
• Bayes Rule
)()1(),|( mNm
m
NNmBin
−−
= µµµ
11 )1()()(
)(),|( −− −
ΓΓ
+Γ= ba
ba
babaBeta µµµ
),,|(
),|(),|(),,,|(
baNmP
baPNmPbaNmP
µµµ =
)(
)()|()|()(
DP
HPHDPDHPP ==µ
Posterior with Beta Prior
• P(D|H) = Binomial
• P(H)=Beta
1111)1()1()1()()|(
−+−−+−−− −=−−∝ bmNambamNmHPHDP µµµµµµ
)()1(),|( mNm
m
NNmBin
−−
= µµµ
11 )1()()(
)(),|( −− −
ΓΓ
+Γ= ba
ba
babaBeta µµµ
),|(
)1()()(
)()|( 11
bmNambeta
bmNam
bmNamDHP
bmNam
+−+=
−+−Γ+Γ
+−++Γ= −+−−+
µ
µµ
(Ignore the part that doesn’t depend upon m. Why can we get away with this?)
Interpreting the Beta Prior
• Beta prior has mean (m+a)/(N+a+b)
• A beta prior with parameters a,b is like having “imagined” a previous heads, b previous tails
• Examples:– a=b=1000 implies strong prior towards fairness
– a=b=1 implies weak prior towards fairness
– a=1000, b=1 implies strong prior towards heads bias
– a=1, b=1000 implies weak prior towards head bias
),|(
)1()()(
)()|( 11
bmNambeta
bmNam
bmNamDHP
bmNam
+−+=
−+−Γ+Γ
+−++Γ= −+−−+
µ
µµ
Multinomial
• Multinomial generalizes binomial to >2 outcomes
• Dirichlet is conjugate
• αααα parameters correspond to phantom observations
∏=
=
K
k
m
k
K
Kk
mm
NNmmMult
11
1!!...
!),|,...,( µμ
∑
∏
=
=
−
=
ΓΓ
Γ=
K
k
k
K
k
k
K
kdir
1
0
1
1
1
0
)()...(
)(),(
αα
µαα
α ααμ
9/1/2009
4
Multivariate Gaussian Distribution
• Also called multivariate normal
• First, recall the univariate Gaussian distribution:
( )
( )
σ
µ−−=σµ
σπ 2
2
2
1x
2
1exp),;x(p 2/1
• where µ is the mean and σ2 is the variance
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
-10 -5 0 5 10
mynorm(x,3,2)mynorm(x,-2,5)
Multivariate Gaussian Distribution
• A 2-dimensional Gaussian is defined by a mean vector µ = (µ1,µ2) and
a covariance matrix:
=Σ
2
2,2
2
1,2
2
2,1
2
1,1
σσ
σσ
• where
– is the variance if xi=xj
– covariance if xi≠xj
)])([(2
, jjiijixxE µµσ −−=
( )
( ) ( )
−Σ−−=Σ −
Σ
µµµπ
xxxpT 1
2
1
2
1exp),;(
2
1
−3
−2
−1
0
1
2
3
−3
−2
−1
0
1
2
3
0.05
0.1
0.15
0.2
0.25
Standard normal distribution
=Σ
10
01• We get the standard normal for Σ = the identity matrix
and µ = (0,0)
MVG examples
−3
−2
−1
0
1
2
3
−3
−2
−1
0
1
2
3
0.05
0.1
0.15
0.2
0.25
=Σ
10
01µ = (1, 0)
9/1/2009
5
MVG examples
=Σ
10
01µ = (-0.5, 0)
−3
−2
−1
0
1
2
3
−3
−2
−1
0
1
2
3
0.05
0.1
0.15
0.2
0.25
MVG examples
−3
−2
−1
0
1
2
3
−3
−2
−1
0
1
2
3
0.05
0.1
0.15
0.2
0.25
=Σ
10
01µ = (-1,-1.5)
MVG examples
−3
−2
−1
0
1
2
3
−3
−2
−1
0
1
2
3
0.05
0.1
0.15
0.2
0.25
µ = (0,0)
=Σ
6.00
06.0
MVG examples
−3
−2
−1
0
1
2
3
−3
−2
−1
0
1
2
3
0.05
0.1
0.15
0.2
0.25
µ = (0,0)
=Σ
20
02
9/1/2009
6
MVG examples
µ = (0,0)
=Σ
15.0
5.01
−3
−2
−1
0
1
2
3
−3
−2
−1
0
1
2
3
0.05
0.1
0.15
0.2
0.25
MVG examples
µ = (0,0)
=Σ
18.0
8.01
−3
−2
−1
0
1
2
3
−3
−2
−1
0
1
2
3
0.05
0.1
0.15
0.2
0.25
MVG examples – contour plots
−3 −2 −1 0 1 2 3
−3
−2
−1
0
1
2
3
µ = (0,0)
=Σ
10
01
MVG examples
−3 −2 −1 0 1 2 3
−3
−2
−1
0
1
2
3
µ = (0,0)
=Σ
15.0
5.01
9/1/2009
7
MVG examples
−3 −2 −1 0 1 2 3
−3
−2
−1
0
1
2
3
µ = (0,0)
=Σ
18.0
8.01
Multivariate normal distribution
• We can generalize this to n dimensions
• parameters
– mean vector µ ∈ℜn– a covariance matrix Σ ∈ℜnxn, where Σ ≥ 0 is symmetric and positive
semi-definite
• Written Ν(µ, Σ), density is
• where |Σ| is the determinant of the matrix Σ
• For X ∼ Ν(µ, Σ)– E[X] = ∫xx p(x; µ, Σ) dx = µ
– Cov(X) = E[XXT] – (E[X])(E[X])T = Σ
( )
( ) ( )
−Σ−−=Σ −
Σ
µµµπ
xxxpT
n
1
2
1
2
1exp),;(
2
1
2/
A note about covariances
• By construction, the covariance matrix is
– Symmetric
– Positive semi-definite
• Diagonal covariance matrices:
– Can be expressed as a product of I and a vector of
variances
– Imply independence between variables
Useful Properties of Gaussians I
• Surfaces of equal probability for standard (mean 0, I
covariance) Gaussians are spheroids
• Surfaces of equal probability for general Gaussians are
ellipsoids
• Every general Gaussian can be viewed as a standard
Gaussian that has undergone an affine transformation
9/1/2009
8
Useful Properties of Gaussians II
• A Gaussian distribution is completely specific by
the a vector of means and covariance matrix
• Requires O(n2) space
• Requires O(n3) time to manipulate
• If these seem bad, recall that a joint distribution
over n binary variables requires O(2n) space
Useful Properties of Gaussians III
• Marginals of Gaussians are Gaussian
• Given:
• Marginal Distribution:
• (Marginalize by ignoring)
ΣΣ
ΣΣ=Σ
==
bbba
abaa
baba xxx ),(),,( µµµ
),|()( aaaaa xNxp Σ= µ
Useful Properties of Gaussians IV
• Conditionals of Gaussians are Gaussian
• Notation:
• Conditional Distribution:
ΛΛ
ΛΛ=Σ=Λ −
bbba
abaa1
)(
),|()|(
1
|
1
|
ababaaaba
aabaaba
x
xNxxp
µµµ
µ
−ΛΛ−=
Λ=
−
−
Visualizing Marginalization & Conditioning
9/1/2009
9
Useful Properties of Gaussians V
• Affine transformations of Gaussian variables
are Gaussian
– Suppose x is Gaussian
– y=Ax+b is Gaussian (Aµ+b, AΣAT)
• Mixtures of Gaussians are…
– Mixtures of Gaussians
– How is a mixture different from a linear
combination?
Useful Properties of Gaussians
• Lots of things can (arguably) be approximated well by Gaussians
• The central limit theorem: The sum of IID variables with finite variances will tend towards a Gaussian distribution
• Note: This is often used a hand-waving argument to justify using the Gaussian distribution for almost anything
Mixtures of Gaussians
• Want to approximate distribution that is not unimodal?
• Density is weighted combination of Gaussians
• Idea: Flip coin (roll dice) to select Gaussian, then
sample from the Gaussian
• Can be arbitrarily expressive with enough Gaussians
∑
∑
=
=
=
Σ=
K
k
k
K
k
kkk xNxp
1
1
1
),|()(
π
µπ
Mixture of Gaussians Example
9/1/2009
10
Limitations of Gaussians
• Gaussians are unimodal (single peak at mean)
• O(n2) and O(n3) can get expensive
• Definite integrals of Gaussian distributions do
not have a closed form solution(somewhat inconvenient)
– Must approximate, use lookup tables, etc.
– Sampling from Gaussian is somewhat inelegant
Fitting Gaussians
• Maximum Likelihood
• Mean:
• Covariance:
∑=
=N
n
nML xN 1
1µ
T
MLn
N
n
MLnML xxN
)()(1
1
µµ −−=Σ ∑=
Bayesian Fits with Known Variance
• Can use a Gaussian prior:
• Posterior:
),|()(2
00 σµµµ Np =
22
0
2
22
0
2
0
22
0
2
2
11
),|()|(
σσσ
µσσ
σ
σσ
σµ
σµµµ
N
N
N
N
NXp
N
MLN
NN
+=
++
+=
=
Bayesian Fit with Unknown Variance,
Known Mean
• For single variable, gamma distribution is
conjugate
• For multiple variables, Wishart is conjugate
• No conjugate for unknown mean & variance