34
Introduction Binary Variables Multinomial Variables The Gaussian Distribution The Exponential Family Nonparametric Methods Probability Distributions for ML Sung-Yub Kim Dept of IE, Seoul National University January 29, 2017 Sung-Yub Kim Probability Distributions for ML

Probability distributions for ml

Embed Size (px)

Citation preview

Page 1: Probability distributions for ml

IntroductionBinary Variables

Multinomial VariablesThe Gaussian Distribution

The Exponential FamilyNonparametric Methods

Probability Distributions for ML

Sung-Yub Kim

Dept of IE, Seoul National University

January 29, 2017

Sung-Yub Kim Probability Distributions for ML

Page 2: Probability distributions for ml

IntroductionBinary Variables

Multinomial VariablesThe Gaussian Distribution

The Exponential FamilyNonparametric Methods

Bishop, C. M. Pattern Recognition and Machine Learning Information Science and Statistics, Springer, 2006.

Kevin P. Murphy. Machine Learning - A Probabilistic Perspective Adaptive Computation and Machine

Learning, MIT press, 2012.

Ian Goodfellow and Yoshua Bengio and Aaron Courville. Deep Learning Computer Science and Intelligent

Systems, MIT Press, 2016.

Sung-Yub Kim Probability Distributions for ML

Page 3: Probability distributions for ml

IntroductionBinary Variables

Multinomial VariablesThe Gaussian Distribution

The Exponential FamilyNonparametric Methods

Purpose: Density Estimation

Assumption: Data Points are independent and identically distributed.(i.i.d)

Parametric and NonparametricParametric estimations are more intuitive but has very strong assumption.Nonparametric estimation also has some parameters, but they controlmodel complexity.

Sung-Yub Kim Probability Distributions for ML

Page 4: Probability distributions for ml

IntroductionBinary Variables

Multinomial VariablesThe Gaussian Distribution

The Exponential FamilyNonparametric Methods

Bernouli and Binomial DistributionMLE of Bernouli parameterThe Beta DistributionBayesian Inference on binary variablesDifference between prior and posterior

Bernouli Distribution(Ber(θ))Bernouli Distribution has only one parameter θ which means the successprobability of the trial. PMF of bernouli dist is shown like

Ber(x |θ) = θI(x=1)(1− θ)I(x=0)

Binomial Distribution(Bin(n,θ))Binomial Distribution has two parameters n for number of trials, θ forsuccess prob. PMF of binomial dist is shown like

Bin(k|n, θ) =

(n

k

)θk(1− θ)n−k

Sung-Yub Kim Probability Distributions for ML

Page 5: Probability distributions for ml

IntroductionBinary Variables

Multinomial VariablesThe Gaussian Distribution

The Exponential FamilyNonparametric Methods

Bernouli and Binomial DistributionMLE of Bernouli parameterThe Beta DistributionBayesian Inference on binary variablesDifference between prior and posterior

Likelihood of DataBy i.i.d assumption, we get

p(D|µ) =N∏

n=1

p(xn|µ) =N∏

n=1

µxn (1− µ)1−xn (1)

Log-likelihood of DataTake logarithm, we get

ln p(D|µ) =N∑

n=1

ln p(xn|µ) =N∑

n=1

{xn lnµ+ (1− xn) ln(1− µ)} (2)

MLESince maximizer is stationary point, we get

µML := µ̂ =1

N

N∑n=1

xn (3)

Sung-Yub Kim Probability Distributions for ML

Page 6: Probability distributions for ml

IntroductionBinary Variables

Multinomial VariablesThe Gaussian Distribution

The Exponential FamilyNonparametric Methods

Bernouli and Binomial DistributionMLE of Bernouli parameterThe Beta DistributionBayesian Inference on binary variablesDifference between prior and posterior

Prior DistributionThe weak point of MLE is you can be overfitted to data. To overcome thisdeficiency, we need to make some prior distribution.But same time our prior distribution need to has a simple interpretationand useful analytical properties.

Conjugate PriorConjugate prior for a likelihood is a prior distribution which your prior andposterior distribution are same given your likelihood.In this case, we need to make our prior proportional to powers of µ and(1− µ). Therefore, we choose Beta Distribution

Beta(µ|a, b) =Γ(a + b)

Γ(a)Γ(b)µa−1(1− µ)b−1 (4)

Beta Distribution has two parameters a,b each counts how many occurseach classes(effective number of observations). Also we can easily validthat posterior is also beta distribution.

Sung-Yub Kim Probability Distributions for ML

Page 7: Probability distributions for ml

IntroductionBinary Variables

Multinomial VariablesThe Gaussian Distribution

The Exponential FamilyNonparametric Methods

Bernouli and Binomial DistributionMLE of Bernouli parameterThe Beta DistributionBayesian Inference on binary variablesDifference between prior and posterior

Posterior DistributionBy some calculation,

p(µ|m, l , a, b) =Γ(m + l + a + b)

Γ(m + a)Γ(l + b)µm+a−1(1− µ)l+b−1 (5)

where m,l are observed data.

Bayesian InferenceNow we can make some bayesian inference on binary variables. We wantto know

p(x = 1|D) =

∫ 1

0

p(x = 1|µ)p(µ|D)dµ =

∫ 1

0

µp(µ|D)dµ = E[µ|D] (6)

Therefore we get

p(x = 1|D) =m + a

m + a + l + b(7)

If observed data(m,l) are sufficiently big, its asymptotic property isidentical to MLE, and this property is very general.

Sung-Yub Kim Probability Distributions for ML

Page 8: Probability distributions for ml

IntroductionBinary Variables

Multinomial VariablesThe Gaussian Distribution

The Exponential FamilyNonparametric Methods

Bernouli and Binomial DistributionMLE of Bernouli parameterThe Beta DistributionBayesian Inference on binary variablesDifference between prior and posterior

SinceEθ[θ] = ED[Eθ[θ|D]] (8)

we know that poseterior mean of θ, averaged over the distribution generatingthe data, is equal to the prior mean of θ.Also since

Varθ[θ] = ED[Varθ[θ|D]] + VarD[Eθ[θ|D]] (9)

We know that on average, the posterior variance of θ is smaller than the priorvariance.

Sung-Yub Kim Probability Distributions for ML

Page 9: Probability distributions for ml

IntroductionBinary Variables

Multinomial VariablesThe Gaussian Distribution

The Exponential FamilyNonparametric Methods

Multinomials and Multinouli DistributionsMLE of Multinouli parametersThe Dirichlet Distribution and Bayesian Inference

Multinomial Distribution(Mu(x|n, θ))Multinomial distribution is different from binomial with respect todimension of ouput and θ. In binomial, k means the number of success. Inmultinomial each index of x means the number of state. Therefore we cansee binomial as multinomial when the dimension of x and θ is 2.

Mu(x|n, θ) =

(n

x0, . . . , xK−1

)K−1∏j=0

θxjj

Multinouli Distribution(Mu(x|1, θ))Sometimes we are intersted in the special case of Multinomial when the nis 1 that is called Multinouli distribution:

Mu(x|1, θ) =K−1∏j=0

θI(xj=1)

j

Sung-Yub Kim Probability Distributions for ML

Page 10: Probability distributions for ml

IntroductionBinary Variables

Multinomial VariablesThe Gaussian Distribution

The Exponential FamilyNonparametric Methods

Multinomials and Multinouli DistributionsMLE of Multinouli parametersThe Dirichlet Distribution and Bayesian Inference

Likelihood of DataBy i.i.d assumption, we get

p(D|µ) =N∏

n=1

K∏k=1

µxnkk =

K∏k=1

µ∑

n xnkk =

K∏k=1

µmkk (10)

where mk =∑

n xnk (sufficient statistics)

Log-likelihood of DataTake logarithm, we get

ln p(D|µ) =K∑

k=1

mk lnµk (11)

MLETherefore, we need to solve following optimization problem for MLE

max{K∑

k=1

mk lnµk |K∑

k=1

µk = 1} (12)

Sung-Yub Kim Probability Distributions for ML

Page 11: Probability distributions for ml

IntroductionBinary Variables

Multinomial VariablesThe Gaussian Distribution

The Exponential FamilyNonparametric Methods

Multinomials and Multinouli DistributionsMLE of Multinouli parametersThe Dirichlet Distribution and Bayesian Inference

MLE(cont.)We already know that Lagrangian stationaty point is a necessary conditionfor constrained optimization problem. Therefore,

∇µL(µ;λ) = 0,∇λL(µ;λ) = 0 (13)

where

L(µ;λ) =K∑

k=1

mk lnµk + λ(K∑

k=1

µk − 1) (14)

Therefore, we get

µMLk =

mk

N(15)

Sung-Yub Kim Probability Distributions for ML

Page 12: Probability distributions for ml

IntroductionBinary Variables

Multinomial VariablesThe Gaussian Distribution

The Exponential FamilyNonparametric Methods

Multinomials and Multinouli DistributionsMLE of Multinouli parametersThe Dirichlet Distribution and Bayesian Inference

Dirichlet DistributionBy the same intuition in Beta distribution, we can get conjugate prior forMultinouli

Dir(µ|α) =Γ(α0)

Γ(α1) · · · Γ(αK )

K∏k=1

µαk−1k (16)

where α0 =∑

k αk

Bayesian InferenceBy the same argument in binomial, we can get posterior probability

p(µ|D, α) = Dir(µ|α + m) =Γ(α0 + N)

Γ(α1 + m1) · · · Γ(αK + mK )

K∏k=1

µαk+mk−1k

(17)

Sung-Yub Kim Probability Distributions for ML

Page 13: Probability distributions for ml

IntroductionBinary Variables

Multinomial VariablesThe Gaussian Distribution

The Exponential FamilyNonparametric Methods

Uni and Multi variate GaussianBasic PropertyConditional and Marginal DistributionsInference for GaussianStudent’s t-distribution

Univariate Gaussian Distribution(N (x |µ, σ2) = N (x |µ, β−1))

N (x |µ, σ2) =1√

2πσ2exp(− 1

2σ2(x − µ)2) (18)

N (x |µ, β−1) =

√β

2πexp(−β

2(x − µ)2) (19)

Multivariate Gaussian Distribution(N (x|µ,Σ) = N (x|µ, β−1))

N (x|µ,Σ) =1

(2π)D2 det(Σ)

12

exp(−1

2(x− µ)>Σ−1(x− µ)) (20)

N (x|µ, β−1) =1

(2π)D2 det(Σ)

12

exp(−1

2(x− µ)>β(x− µ)) (21)

Sung-Yub Kim Probability Distributions for ML

Page 14: Probability distributions for ml

IntroductionBinary Variables

Multinomial VariablesThe Gaussian Distribution

The Exponential FamilyNonparametric Methods

Uni and Multi variate GaussianBasic PropertyConditional and Marginal DistributionsInference for GaussianStudent’s t-distribution

Mahalanobis DistanceBy EVD, we can get

∆2 = (x − µ)>Σ−1(x − µ) =D∑i=1

y 2i

λi(22)

where yi = u>i (x − µ)

Change of Variable in GaussianBy above, we can get

p(y) = p(x)|Jy→x | =D∏j=1

1

(2πλj)12

exp{−y 2j

2λj} (23)

which means product of D independent univariate Gaussian Distribution.

First and Second Moment of GaussianBy using above, we can get

E[x ] = µ,E[xx>] = µµ> + Σ (24)

Sung-Yub Kim Probability Distributions for ML

Page 15: Probability distributions for ml

IntroductionBinary Variables

Multinomial VariablesThe Gaussian Distribution

The Exponential FamilyNonparametric Methods

Uni and Multi variate GaussianBasic PropertyConditional and Marginal DistributionsInference for GaussianStudent’s t-distribution

Limitations of Gaussian and SolutionsThere are two main limitations for Gaussian.First, we have to infer so many covariance parameters.Second, we cannot represent multi-modal ditriubtions. Therefore, wedefine some auxilarily concepts.

Diagonal Covariance

Σ = diag(s2) (25)

Isotropic Covariance

Σ = σ2I (26)

Mixture Model

p(x) =K∑

k=1

πkp(x |πk) (27)

Sung-Yub Kim Probability Distributions for ML

Page 16: Probability distributions for ml

IntroductionBinary Variables

Multinomial VariablesThe Gaussian Distribution

The Exponential FamilyNonparametric Methods

Uni and Multi variate GaussianBasic PropertyConditional and Marginal DistributionsInference for GaussianStudent’s t-distribution

Partitions of Mahalanobis distanceFirst, partition the covariance matrix and precision matrix.

Σ =

[Σaa Σab

Σba Σbb

],Σ−1 = Λ =

[Λaa Λab

Λba Λbb

](28)

where aa, bb are symmetric and ab and ba are conjugate transpose.Now, partition the Mahalanobis distance.

(x − µ)>Σ−1(x − µ)

= (xa − µ)>Σ−1aa (xa − µ) + (xa − µ)>Σ−1

ab (xb − µ)+(xb − µ)>Σ−1

ba (xa − µ) + (xb − µ)>Σ−1bb (xb − µ)(29)

Schur ComplementLike gaussian elimination, we can use some block matrix elimination bySchur Complement[

A BC D

]−1

=

[M −MBD−1

−D−1CM D−1 + D−1CMBD−1

](30)

where M = (A− BD−1C)−1

Sung-Yub Kim Probability Distributions for ML

Page 17: Probability distributions for ml

IntroductionBinary Variables

Multinomial VariablesThe Gaussian Distribution

The Exponential FamilyNonparametric Methods

Uni and Multi variate GaussianBasic PropertyConditional and Marginal DistributionsInference for GaussianStudent’s t-distribution

Schur Complement(cont.)Therefore, we get

Λaa = (Σaa − ΣabΣ−1bb Σba)−1 (31)

Λab = −(Σaa − ΣabΣ−1bb Σba)−1ΣabΣ−1

bb (32)

Conditional DistributionTherefore, we get

xa|xb ∼ N (x |µa|b,Σa|b) (33)

whereµa|b = µa + ΣabΣ−1

bb (xb − xa) (34)

Σa|b = Σaa − ΣabΣ−1bb Σba (35)

Marginal DistributionRemoving xb by integrating, we can get marginal distribution of xa

p(xa) = −1

2x>a (Λaa−ΛabΛbbΛba)xa + x>a (Λaa−ΛabΛbbΛba)µa + const (36)

Therefore, we getxa ∼ N (x |µa,Σaa) (37)

Sung-Yub Kim Probability Distributions for ML

Page 18: Probability distributions for ml

IntroductionBinary Variables

Multinomial VariablesThe Gaussian Distribution

The Exponential FamilyNonparametric Methods

Uni and Multi variate GaussianBasic PropertyConditional and Marginal DistributionsInference for GaussianStudent’s t-distribution

Given a marginal Gaussian for x and a conditional Gaussian for y given x in theform

x ∼ N (x |µ,Λ−1) (38)

y |x ∼ N (y |Ax + b, L−1) (39)

Then we can get marginal distribution of y and the conditional distribution of xgiven y are given by

y ∼ N (y |Aµ+ b, L−1 + AΛ−1A>) (40)

x |y ∼ N (x |Σ{A>L(y − b) + Aµ},Σ) (41)

whereΣ = (Λ + A>LA)−1 (42)

Sung-Yub Kim Probability Distributions for ML

Page 19: Probability distributions for ml

IntroductionBinary Variables

Multinomial VariablesThe Gaussian Distribution

The Exponential FamilyNonparametric Methods

Uni and Multi variate GaussianBasic PropertyConditional and Marginal DistributionsInference for GaussianStudent’s t-distribution

Log-likelihood for dataBy same argument in categorical data, we can get log-likelihood forGaussian

ln p(D|µ,Σ) = −ND

2ln 2π− N

2ln |Σ| − 1

2

N∑n=1

(xn −µ)>Σ−1(xn −µ) (43)

and this log-likelihood depends only on these quantities called SufficientStatistics

N∑n=1

xn,N∑

n=1

xnx>n (44)

MLE for GaussianSince MLE is a maximizer for log-likelihood, we can get

µML =1

N

N∑n=1

xn (45)

ΣML =1

N

N∑n=1

(xn − µML)(xn − µML)> (46)

Sung-Yub Kim Probability Distributions for ML

Page 20: Probability distributions for ml

IntroductionBinary Variables

Multinomial VariablesThe Gaussian Distribution

The Exponential FamilyNonparametric Methods

Uni and Multi variate GaussianBasic PropertyConditional and Marginal DistributionsInference for GaussianStudent’s t-distribution

Sequential estimationSince we get MLE for gaussian analytically, we can do this sequentially like

µNML = µN−1

ML +1

N(xN − µN−1

ML ) (47)

Robbins-Monro AlgorithmBy same intuition, we can generalize sequential learning. Robbins-Monroalgorithm gives us root θ such that f (θ) = E[z |θ] = 0. The iterate processof RM algorithm can be represented by

θN = θN−1 − aN−1z(θN−1) (48)

where z(θN−1) means observed value of z when θ takes the value θN−1

and aN is an sequence satisfy

limN→∞

aN = 0,∞∑N=1

aN =∞,∞∑N=1

aN <∞ (49)

Sung-Yub Kim Probability Distributions for ML

Page 21: Probability distributions for ml

IntroductionBinary Variables

Multinomial VariablesThe Gaussian Distribution

The Exponential FamilyNonparametric Methods

Uni and Multi variate GaussianBasic PropertyConditional and Marginal DistributionsInference for GaussianStudent’s t-distribution

Generalized Sequential LearningWe can apply RM algorithm for sequential learning. In this case, our f (θ)is a gradient of log-likelihood function. Therefore, we can get

z(θ) = − ∂

∂θln p(x |θ) (50)

In Gaussian case, we put aN to σ2/N.

Bayesian Inference for mean given varianceSince gaussian likelihood takes the form of the exponential of a quadraticform in µ, we can choose a prior also Gaussian. Therefore, if we choose

µ ∼ N (µ|µ0, σ20) (51)

for prior, we get following for posterior

µ|D ∼ N (µ|µN , σ2N) (52)

where

µN =σ2

Nσ20 + σ2

µ0 +Nσ2

0

Nσ20 + σ2

µML,1

σ2N

=1

σ20

+N

σ2(53)

Sung-Yub Kim Probability Distributions for ML

Page 22: Probability distributions for ml

IntroductionBinary Variables

Multinomial VariablesThe Gaussian Distribution

The Exponential FamilyNonparametric Methods

Uni and Multi variate GaussianBasic PropertyConditional and Marginal DistributionsInference for GaussianStudent’s t-distribution

Bayesian Inference for mean given variance(cont.)1. Posterior mean compromises between the priot and the MLE.2. Precision is given by the precision of the prior plus one contribution ofthe data precision from each of the observed data.3. If we take σ2

0 →∞ then the posterior mean reduces to the MLE.

Bayesian Inference for variance given meanSince gaussian likelihood takes the form of proportional to the product ofa power of precision and the exponential of a linear function of precision.We choose gamma distribution which is defined by

Gam(λ|a0, b0) =1

Γ(a0)ba

00λa0−1 exp(−b0λ) (54)

Then we can get posterior

λ|D ∼ Gam(λ|aN , bN) (55)

where

aN = a0 +N

2, bN = b0 +

N

2σ2ML (56)

Sung-Yub Kim Probability Distributions for ML

Page 23: Probability distributions for ml

IntroductionBinary Variables

Multinomial VariablesThe Gaussian Distribution

The Exponential FamilyNonparametric Methods

Uni and Multi variate GaussianBasic PropertyConditional and Marginal DistributionsInference for GaussianStudent’s t-distribution

Bayesian Inference for variance given mean(cont.)1. We can interpret the parameter 2a0 effective prior observations fornumber of data. 2. We can interpret the parameter b0/a0 effective priorobservations for variance.

Bayesian Inference for no dataBy apply same argument on mean and variance, we can get prior

p(µ, λ) ∼ N (µ|µ0, (βλ)−1)Gam(λ|a, b) (57)

whereµ0 = c/β, a = 1 + β/2, b = d − c2/2β (58)

Note that precision of µ is a linear function of λFor Multivariate case, we can similarly get prior

p(µ,Λ|µ0, β,W , ν) = N (µ|µ0, (βΛ)−1)W(Λ|W , ν) (59)

where W is Wishart distribution.

Sung-Yub Kim Probability Distributions for ML

Page 24: Probability distributions for ml

IntroductionBinary Variables

Multinomial VariablesThe Gaussian Distribution

The Exponential FamilyNonparametric Methods

Uni and Multi variate GaussianBasic PropertyConditional and Marginal DistributionsInference for GaussianStudent’s t-distribution

Univariate t-distributionIf we integrate out the precision given that our prior for precision isGamma, we get t-distribution.

St(x |µ, λ, ν) =Γ(ν/2 + 1/2)

Γ(ν/2)(λ

πν)1/2[1 +

λ(x − µ)2

ν]−ν/2−1/2 (60)

where ν = 2a(degrees of freedom) and λ = a/b.We can think t-dstribution as an infinite mixture of Gaussians.Since t-distribution has fat tail(than Gaussian), we can obtain more robustmodel when we estimate.

Multivariate t-distributionWe also can get multivariate case of infinite mixture of Gaussians, then weget multivariate t-distribution

St(x |µ,Λ, ν) =Γ(ν/2 + D/2)

Γ(ν/2)(

Λ1/2

(πν)D/2)[1 +

∆2

ν]−ν/2−D/2 (61)

Sung-Yub Kim Probability Distributions for ML

Page 25: Probability distributions for ml

IntroductionBinary Variables

Multinomial VariablesThe Gaussian Distribution

The Exponential FamilyNonparametric Methods

Distribution for the exponential familySigmoid and SoftmaxMLE for the exponential familyConjugate priors for exponential familyNoninformative priors

The Exponential FamilyThe exponential family of distributions over x, given parameters η, isdefined to be the set of distributions of the form

p(x |η) = g(η)h(x) exp{η>u(x)} (62)

where η is natural parameters of the distribution, and u(x) is a functionof x.The fnuction g(η) can be interpereted as the normalization factor.

Sung-Yub Kim Probability Distributions for ML

Page 26: Probability distributions for ml

IntroductionBinary Variables

Multinomial VariablesThe Gaussian Distribution

The Exponential FamilyNonparametric Methods

Distribution for the exponential familySigmoid and SoftmaxMLE for the exponential familyConjugate priors for exponential familyNoninformative priors

Logistic SigmoidIn case of bernouli distribution, our parameter is µ, although our naturalparameter is η. Those two parameter can be connected by following

η = ln(µ

1− µ ), µ := σ(η) =exp(µ)

1 + exp(µ)(63)

And we call this σ(η) sigmoid function.

Softmax functionBy same argument, we can find some realtionship between our parameterand natural parameter. That is Softmax function.

µk =exp(ηk)∑Kj=1 exp(ηj)

(64)

Note that in this case, u(x) = 1, h(x) = 1, g(x) = (∑K

j=1 exp(ηj))−1

Sung-Yub Kim Probability Distributions for ML

Page 27: Probability distributions for ml

IntroductionBinary Variables

Multinomial VariablesThe Gaussian Distribution

The Exponential FamilyNonparametric Methods

Distribution for the exponential familySigmoid and SoftmaxMLE for the exponential familyConjugate priors for exponential familyNoninformative priors

GaussianGaussian also can be interpreted as the exponential family by

u(x) =

[xx2

](65)

η =

[µ/σ2

−1/2σ2

](66)

g(η) = (−2η2)1/2 exp(η2

1

4η2) (67)

Sung-Yub Kim Probability Distributions for ML

Page 28: Probability distributions for ml

IntroductionBinary Variables

Multinomial VariablesThe Gaussian Distribution

The Exponential FamilyNonparametric Methods

Distribution for the exponential familySigmoid and SoftmaxMLE for the exponential familyConjugate priors for exponential familyNoninformative priors

Problem of estimating the natural parameterWe can generalize the argument in MLE in other cases.First, we consider the log-likelihood of the data.

ln p(D|η) =N∑

n=1

h(xn) + N ln g(η) + η>N∑

n=1

u(xn) (68)

Next, we need to find the stationary point of the log-likelihood.

N∇η ln g(η) +N∑

n=1

u(xn) = 0 (69)

Therfore, we get MLE

−∇η ln g(η) =1

N

N∑n=1

u(xn) (70)

We see that the solution for the MLE depedns on the data only throughσnu(xn), which is therefore called the sufficient statistic of theexponential family.

Sung-Yub Kim Probability Distributions for ML

Page 29: Probability distributions for ml

IntroductionBinary Variables

Multinomial VariablesThe Gaussian Distribution

The Exponential FamilyNonparametric Methods

Distribution for the exponential familySigmoid and SoftmaxMLE for the exponential familyConjugate priors for exponential familyNoninformative priors

Conjugate priorFor any member of the exponential family, there exists a conjugate priorthat can be written in the form

p(η|χ, ν) = f (χ, ν)g(η)ν exp{νη>χ} (71)

where f (χ, ν) is a normalization factor, and g(η) is the same function asthe exponential family.

Posterior distributionIf we choose prior as conjugate prior, we get

p(η|D, χ, ν) ∝ g(η)ν+N exp{η>(N∑

n=1

u(xn) + νχ)} (72)

Therefore, we see that the parameter ν can be interpreted as the effectivenumber of pseudo-observations in the prior, each of which has a valuefor the sufficient statistics u(x) given by χ.

Sung-Yub Kim Probability Distributions for ML

Page 30: Probability distributions for ml

IntroductionBinary Variables

Multinomial VariablesThe Gaussian Distribution

The Exponential FamilyNonparametric Methods

Distribution for the exponential familySigmoid and SoftmaxMLE for the exponential familyConjugate priors for exponential familyNoninformative priors

Noninformative PriorsWe may seek a form of prior distribution, called a noninformative prior,which is intended to have as little influence on the posterior distribution aspossible.

Generalizations of Noninformative priorsIt leads to two generalizations, namely the principle of transformationgroups as in the Jeffreys prior, and the principle of maximum entropy.

Sung-Yub Kim Probability Distributions for ML

Page 31: Probability distributions for ml

IntroductionBinary Variables

Multinomial VariablesThe Gaussian Distribution

The Exponential FamilyNonparametric Methods

Histogram TechniqueKernel Density EstimationNearest-Neighbour methods

Histogram TechniqueStandard histograms simply partition x into distinct bins of width ∆i andthen count the number ni of observations of x falling in bin i. In order toturn this count into a normalized probability density, we simply divide bythe total number N of observations and by the width ∆i of the bins toobtain probability values for each bin given by

pi =ni

N∆i(73)

Limitations of HitogramThe estimated density has discontinuities that are due to the bin edgesrather than any property of the underlying distribution that generated thedata.Histogram approach also sacling with dimensionality.

Lessons of HistogramFirst, to estimate the probability density at a particular location, we shouldconsider the data points that lie within some local neighbourhood of thatpoint.Second, the value of the smoothing parameter should be neither too largenor too small in order to obtain good results.

Sung-Yub Kim Probability Distributions for ML

Page 32: Probability distributions for ml

IntroductionBinary Variables

Multinomial VariablesThe Gaussian Distribution

The Exponential FamilyNonparametric Methods

Histogram TechniqueKernel Density EstimationNearest-Neighbour methods

MotivationFor large N, the bernouli trial that data point fall within small regionmathcalR will be sharply peaked around the mean and so

K ' NP (74)

If, however, we also assume that the region R is sufficiently small that theprobability density p(x) is roughlt over the region, then we have

P ' p(x)V (75)

where V is the volume of R. Therefore,

p(x) =K

NV(76)

Note that in our assumption, R is sufficiently small tha the density isapproximately constant over the region and the yet sufficiently large thatthe number K of points falling inside the region is sufficient for thebinomial distribution to be sharply peaked.

Sung-Yub Kim Probability Distributions for ML

Page 33: Probability distributions for ml

IntroductionBinary Variables

Multinomial VariablesThe Gaussian Distribution

The Exponential FamilyNonparametric Methods

Histogram TechniqueKernel Density EstimationNearest-Neighbour methods

Kernel Density Estimation(KDE)If we fix V and determine K from the data, we use kernel approach. Forinstance, we fix V to 1 and count the data point by following function

k(u) =

{1, if |ui | ≤ 1/2, i = 1, · · · ,D,0, otherwise

(77)

which called Parzen window In this case, we can use this by

K =N∑

n=1

k(x − xn

h) (78)

and it leads density function

p(x) =1

N

N∑n=1

1

hDk(

x − xnh

) (79)

We can also use another kernel like Gaussian kernel. If we do so, then weget

p(x) =1

N

N∑n=1

1

(2πh2)D/2exp{−‖x − xn‖

2h2} (80)

Sung-Yub Kim Probability Distributions for ML

Page 34: Probability distributions for ml

IntroductionBinary Variables

Multinomial VariablesThe Gaussian Distribution

The Exponential FamilyNonparametric Methods

Histogram TechniqueKernel Density EstimationNearest-Neighbour methods

Limitation of KDEOne of the difficulties with the kernel approach to density estimation isthat the parameter h governing the kernel width is fixed for all kernels. Inregions of high data density, a large value of h may lead to over-smoothingand in lower data density, a small value of h may lead to overfitting. Thusthe optimal choice for h may be dependent on location within data space.

Nereat-Neighbor(NN)Therefore we consider a fixing K and use the data to find an appropriate Vand we call this method K-NN methods.In this case, the value of K governs the degree of smoothing and we needto optimizae(hyper-parameter optimize) K.

Erro of KNNNote that for sufficiently big N, the error rate is never more than twice theminimum achievable error rate of an optimal classifier.

Sung-Yub Kim Probability Distributions for ML