Probability distributions for ml

IntroductionBinary Variables

Multinomial VariablesThe Gaussian Distribution

The Exponential FamilyNonparametric Methods

Probability Distributions for ML

Sung-Yub Kim

Dept of IE, Seoul National University

January 29, 2017

Sung-Yub Kim Probability Distributions for ML




Bishop, C. M. Pattern Recognition and Machine Learning Information Science and Statistics, Springer, 2006.

Kevin P. Murphy. Machine Learning - A Probabilistic Perspective Adaptive Computation and Machine

Learning, MIT press, 2012.

Ian Goodfellow and Yoshua Bengio and Aaron Courville. Deep Learning Computer Science and Intelligent

Systems, MIT Press, 2016.





Purpose: Density Estimation

Assumption: Data Points are independent and identically distributed.(i.i.d)

Parametric and NonparametricParametric estimations are more intuitive but has very strong assumption.Nonparametric estimation also has some parameters, but they controlmodel complexity.





Bernouli and Binomial DistributionMLE of Bernouli parameterThe Beta DistributionBayesian Inference on binary variablesDifference between prior and posterior

Bernouli Distribution(Ber(θ))Bernouli Distribution has only one parameter θ which means the successprobability of the trial. PMF of bernouli dist is shown like

Ber(x |θ) = θI(x=1)(1− θ)I(x=0)

Binomial Distribution(Bin(n,θ))Binomial Distribution has two parameters n for number of trials, θ forsuccess prob. PMF of binomial dist is shown like

Bin(k|n, θ) =

(n

k

)θk(1− θ)n−k






Likelihood of DataBy i.i.d assumption, we get

p(D|µ) =N∏

n=1

p(xn|µ) =N∏

n=1

µxn (1− µ)1−xn (1)

Log-likelihood of DataTake logarithm, we get

ln p(D|µ) =N∑

n=1

ln p(xn|µ) =N∑

n=1

{xn lnµ+ (1− xn) ln(1− µ)} (2)

MLESince maximizer is stationary point, we get

µML := µ̂ =1

N

N∑n=1

xn (3)






Prior DistributionThe weak point of MLE is you can be overfitted to data. To overcome thisdeficiency, we need to make some prior distribution.But same time our prior distribution need to has a simple interpretationand useful analytical properties.

Conjugate PriorConjugate prior for a likelihood is a prior distribution which your prior andposterior distribution are same given your likelihood.In this case, we need to make our prior proportional to powers of µ and(1− µ). Therefore, we choose Beta Distribution

Beta(µ|a, b) =Γ(a + b)

Γ(a)Γ(b)µa−1(1− µ)b−1 (4)

Beta Distribution has two parameters a,b each counts how many occurseach classes(effective number of observations). Also we can easily validthat posterior is also beta distribution.






Posterior DistributionBy some calculation,

p(µ|m, l , a, b) =Γ(m + l + a + b)

Γ(m + a)Γ(l + b)µm+a−1(1− µ)l+b−1 (5)

where m,l are observed data.

Bayesian InferenceNow we can make some bayesian inference on binary variables. We wantto know

p(x = 1|D) =

∫ 1

0

p(x = 1|µ)p(µ|D)dµ =

∫ 1

0

µp(µ|D)dµ = E[µ|D] (6)

Therefore we get

p(x = 1|D) =m + a

m + a + l + b(7)

If observed data(m,l) are sufficiently big, its asymptotic property isidentical to MLE, and this property is very general.






SinceEθ[θ] = ED[Eθ[θ|D]] (8)

we know that poseterior mean of θ, averaged over the distribution generatingthe data, is equal to the prior mean of θ.Also since

Varθ[θ] = ED[Varθ[θ|D]] + VarD[Eθ[θ|D]] (9)

We know that on average, the posterior variance of θ is smaller than the priorvariance.





Multinomials and Multinouli DistributionsMLE of Multinouli parametersThe Dirichlet Distribution and Bayesian Inference

Multinomial Distribution(Mu(x|n, θ))Multinomial distribution is different from binomial with respect todimension of ouput and θ. In binomial, k means the number of success. Inmultinomial each index of x means the number of state. Therefore we cansee binomial as multinomial when the dimension of x and θ is 2.

Mu(x|n, θ) =

(n

x0, . . . , xK−1

)K−1∏j=0

θxjj

Multinouli Distribution(Mu(x|1, θ))Sometimes we are intersted in the special case of Multinomial when the nis 1 that is called Multinouli distribution:

Mu(x|1, θ) =K−1∏j=0

θI(xj=1)

j






Likelihood of DataBy i.i.d assumption, we get

p(D|µ) =N∏

n=1

K∏k=1

µxnkk =

K∏k=1

µ∑

n xnkk =

K∏k=1

µmkk (10)

where mk =∑

n xnk (sufficient statistics)

Log-likelihood of DataTake logarithm, we get

ln p(D|µ) =K∑

k=1

mk lnµk (11)

MLETherefore, we need to solve following optimization problem for MLE

max{K∑

k=1

mk lnµk |K∑

k=1

µk = 1} (12)






MLE(cont.)We already know that Lagrangian stationaty point is a necessary conditionfor constrained optimization problem. Therefore,

∇µL(µ;λ) = 0,∇λL(µ;λ) = 0 (13)

where

L(µ;λ) =K∑

k=1

mk lnµk + λ(K∑

k=1

µk − 1) (14)

Therefore, we get

µMLk =

mk

N(15)






Dirichlet DistributionBy the same intuition in Beta distribution, we can get conjugate prior forMultinouli

Dir(µ|α) =Γ(α0)

Γ(α1) · · · Γ(αK )

K∏k=1

µαk−1k (16)

where α0 =∑

k αk

Bayesian InferenceBy the same argument in binomial, we can get posterior probability

p(µ|D, α) = Dir(µ|α + m) =Γ(α0 + N)

Γ(α1 + m1) · · · Γ(αK + mK )

K∏k=1

µαk+mk−1k

(17)





Uni and Multi variate GaussianBasic PropertyConditional and Marginal DistributionsInference for GaussianStudent’s t-distribution

Univariate Gaussian Distribution(N (x |µ, σ2) = N (x |µ, β−1))

N (x |µ, σ2) =1√

2πσ2exp(− 1

2σ2(x − µ)2) (18)

N (x |µ, β−1) =

√β

2πexp(−β

2(x − µ)2) (19)

Multivariate Gaussian Distribution(N (x|µ,Σ) = N (x|µ, β−1))

N (x|µ,Σ) =1

(2π)D2 det(Σ)

12

exp(−1

2(x− µ)>Σ−1(x− µ)) (20)

N (x|µ, β−1) =1

(2π)D2 det(Σ)

12

exp(−1

2(x− µ)>β(x− µ)) (21)






Mahalanobis DistanceBy EVD, we can get

∆2 = (x − µ)>Σ−1(x − µ) =D∑i=1

y 2i

λi(22)

where yi = u>i (x − µ)

Change of Variable in GaussianBy above, we can get

p(y) = p(x)|Jy→x | =D∏j=1

1

(2πλj)12

exp{−y 2j

2λj} (23)

which means product of D independent univariate Gaussian Distribution.

First and Second Moment of GaussianBy using above, we can get

E[x ] = µ,E[xx>] = µµ> + Σ (24)






Limitations of Gaussian and SolutionsThere are two main limitations for Gaussian.First, we have to infer so many covariance parameters.Second, we cannot represent multi-modal ditriubtions. Therefore, wedefine some auxilarily concepts.

Diagonal Covariance

Σ = diag(s2) (25)

Isotropic Covariance

Σ = σ2I (26)

Mixture Model

p(x) =K∑

k=1

πkp(x |πk) (27)






Partitions of Mahalanobis distanceFirst, partition the covariance matrix and precision matrix.

Σ =

[Σaa Σab

Σba Σbb

],Σ−1 = Λ =

[Λaa Λab

Λba Λbb

](28)

where aa, bb are symmetric and ab and ba are conjugate transpose.Now, partition the Mahalanobis distance.

(x − µ)>Σ−1(x − µ)

= (xa − µ)>Σ−1aa (xa − µ) + (xa − µ)>Σ−1

ab (xb − µ)+(xb − µ)>Σ−1

ba (xa − µ) + (xb − µ)>Σ−1bb (xb − µ)(29)

Schur ComplementLike gaussian elimination, we can use some block matrix elimination bySchur Complement[

A BC D

]−1

=

[M −MBD−1

−D−1CM D−1 + D−1CMBD−1

](30)

where M = (A− BD−1C)−1






Schur Complement(cont.)Therefore, we get

Λaa = (Σaa − ΣabΣ−1bb Σba)−1 (31)

Λab = −(Σaa − ΣabΣ−1bb Σba)−1ΣabΣ−1

bb (32)

Conditional DistributionTherefore, we get

xa|xb ∼ N (x |µa|b,Σa|b) (33)

whereµa|b = µa + ΣabΣ−1

bb (xb − xa) (34)

Σa|b = Σaa − ΣabΣ−1bb Σba (35)

Marginal DistributionRemoving xb by integrating, we can get marginal distribution of xa

p(xa) = −1

2x>a (Λaa−ΛabΛbbΛba)xa + x>a (Λaa−ΛabΛbbΛba)µa + const (36)

Therefore, we getxa ∼ N (x |µa,Σaa) (37)






Given a marginal Gaussian for x and a conditional Gaussian for y given x in theform

x ∼ N (x |µ,Λ−1) (38)

y |x ∼ N (y |Ax + b, L−1) (39)

Then we can get marginal distribution of y and the conditional distribution of xgiven y are given by

y ∼ N (y |Aµ+ b, L−1 + AΛ−1A>) (40)

x |y ∼ N (x |Σ{A>L(y − b) + Aµ},Σ) (41)

whereΣ = (Λ + A>LA)−1 (42)






Log-likelihood for dataBy same argument in categorical data, we can get log-likelihood forGaussian

ln p(D|µ,Σ) = −ND

2ln 2π− N

2ln |Σ| − 1

2

N∑n=1

(xn −µ)>Σ−1(xn −µ) (43)

and this log-likelihood depends only on these quantities called SufficientStatistics

N∑n=1

xn,N∑

n=1

xnx>n (44)

MLE for GaussianSince MLE is a maximizer for log-likelihood, we can get

µML =1

N

N∑n=1

xn (45)

ΣML =1

N

N∑n=1

(xn − µML)(xn − µML)> (46)






Sequential estimationSince we get MLE for gaussian analytically, we can do this sequentially like

µNML = µN−1

ML +1

N(xN − µN−1

ML ) (47)

Robbins-Monro AlgorithmBy same intuition, we can generalize sequential learning. Robbins-Monroalgorithm gives us root θ such that f (θ) = E[z |θ] = 0. The iterate processof RM algorithm can be represented by

θN = θN−1 − aN−1z(θN−1) (48)

where z(θN−1) means observed value of z when θ takes the value θN−1

and aN is an sequence satisfy

limN→∞

aN = 0,∞∑N=1

aN =∞,∞∑N=1

aN <∞ (49)






Generalized Sequential LearningWe can apply RM algorithm for sequential learning. In this case, our f (θ)is a gradient of log-likelihood function. Therefore, we can get

z(θ) = − ∂

∂θln p(x |θ) (50)

In Gaussian case, we put aN to σ2/N.

Bayesian Inference for mean given varianceSince gaussian likelihood takes the form of the exponential of a quadraticform in µ, we can choose a prior also Gaussian. Therefore, if we choose

µ ∼ N (µ|µ0, σ20) (51)

for prior, we get following for posterior

µ|D ∼ N (µ|µN , σ2N) (52)

where

µN =σ2

Nσ20 + σ2

µ0 +Nσ2

0

Nσ20 + σ2

µML,1

σ2N

=1

σ20

+N

σ2(53)






Bayesian Inference for mean given variance(cont.)1. Posterior mean compromises between the priot and the MLE.2. Precision is given by the precision of the prior plus one contribution ofthe data precision from each of the observed data.3. If we take σ2

0 →∞ then the posterior mean reduces to the MLE.

Bayesian Inference for variance given meanSince gaussian likelihood takes the form of proportional to the product ofa power of precision and the exponential of a linear function of precision.We choose gamma distribution which is defined by

Gam(λ|a0, b0) =1

Γ(a0)ba

00λa0−1 exp(−b0λ) (54)

Then we can get posterior

λ|D ∼ Gam(λ|aN , bN) (55)

where

aN = a0 +N

2, bN = b0 +

N

2σ2ML (56)






Bayesian Inference for variance given mean(cont.)1. We can interpret the parameter 2a0 effective prior observations fornumber of data. 2. We can interpret the parameter b0/a0 effective priorobservations for variance.

Bayesian Inference for no dataBy apply same argument on mean and variance, we can get prior

p(µ, λ) ∼ N (µ|µ0, (βλ)−1)Gam(λ|a, b) (57)

whereµ0 = c/β, a = 1 + β/2, b = d − c2/2β (58)

Note that precision of µ is a linear function of λFor Multivariate case, we can similarly get prior

p(µ,Λ|µ0, β,W , ν) = N (µ|µ0, (βΛ)−1)W(Λ|W , ν) (59)

where W is Wishart distribution.






Univariate t-distributionIf we integrate out the precision given that our prior for precision isGamma, we get t-distribution.

St(x |µ, λ, ν) =Γ(ν/2 + 1/2)

Γ(ν/2)(λ

πν)1/2[1 +

λ(x − µ)2

ν]−ν/2−1/2 (60)

where ν = 2a(degrees of freedom) and λ = a/b.We can think t-dstribution as an infinite mixture of Gaussians.Since t-distribution has fat tail(than Gaussian), we can obtain more robustmodel when we estimate.

Multivariate t-distributionWe also can get multivariate case of infinite mixture of Gaussians, then weget multivariate t-distribution

St(x |µ,Λ, ν) =Γ(ν/2 + D/2)

Γ(ν/2)(

Λ1/2

(πν)D/2)[1 +

∆2

ν]−ν/2−D/2 (61)





Distribution for the exponential familySigmoid and SoftmaxMLE for the exponential familyConjugate priors for exponential familyNoninformative priors

The Exponential FamilyThe exponential family of distributions over x, given parameters η, isdefined to be the set of distributions of the form

p(x |η) = g(η)h(x) exp{η>u(x)} (62)

where η is natural parameters of the distribution, and u(x) is a functionof x.The fnuction g(η) can be interpereted as the normalization factor.






Logistic SigmoidIn case of bernouli distribution, our parameter is µ, although our naturalparameter is η. Those two parameter can be connected by following

η = ln(µ

1− µ ), µ := σ(η) =exp(µ)

1 + exp(µ)(63)

And we call this σ(η) sigmoid function.

Softmax functionBy same argument, we can find some realtionship between our parameterand natural parameter. That is Softmax function.

µk =exp(ηk)∑Kj=1 exp(ηj)

(64)

Note that in this case, u(x) = 1, h(x) = 1, g(x) = (∑K

j=1 exp(ηj))−1






GaussianGaussian also can be interpreted as the exponential family by

u(x) =

[xx2

](65)

η =

[µ/σ2

−1/2σ2

](66)

g(η) = (−2η2)1/2 exp(η2

1

4η2) (67)






Problem of estimating the natural parameterWe can generalize the argument in MLE in other cases.First, we consider the log-likelihood of the data.

ln p(D|η) =N∑

n=1

h(xn) + N ln g(η) + η>N∑

n=1

u(xn) (68)

Next, we need to find the stationary point of the log-likelihood.

N∇η ln g(η) +N∑

n=1

u(xn) = 0 (69)

Therfore, we get MLE

−∇η ln g(η) =1

N

N∑n=1

u(xn) (70)

We see that the solution for the MLE depedns on the data only throughσnu(xn), which is therefore called the sufficient statistic of theexponential family.






Conjugate priorFor any member of the exponential family, there exists a conjugate priorthat can be written in the form

p(η|χ, ν) = f (χ, ν)g(η)ν exp{νη>χ} (71)

where f (χ, ν) is a normalization factor, and g(η) is the same function asthe exponential family.

Posterior distributionIf we choose prior as conjugate prior, we get

p(η|D, χ, ν) ∝ g(η)ν+N exp{η>(N∑

n=1

u(xn) + νχ)} (72)

Therefore, we see that the parameter ν can be interpreted as the effectivenumber of pseudo-observations in the prior, each of which has a valuefor the sufficient statistics u(x) given by χ.






Noninformative PriorsWe may seek a form of prior distribution, called a noninformative prior,which is intended to have as little influence on the posterior distribution aspossible.

Generalizations of Noninformative priorsIt leads to two generalizations, namely the principle of transformationgroups as in the Jeffreys prior, and the principle of maximum entropy.





Histogram TechniqueKernel Density EstimationNearest-Neighbour methods

Histogram TechniqueStandard histograms simply partition x into distinct bins of width ∆i andthen count the number ni of observations of x falling in bin i. In order toturn this count into a normalized probability density, we simply divide bythe total number N of observations and by the width ∆i of the bins toobtain probability values for each bin given by

pi =ni

N∆i(73)

Limitations of HitogramThe estimated density has discontinuities that are due to the bin edgesrather than any property of the underlying distribution that generated thedata.Histogram approach also sacling with dimensionality.

Lessons of HistogramFirst, to estimate the probability density at a particular location, we shouldconsider the data points that lie within some local neighbourhood of thatpoint.Second, the value of the smoothing parameter should be neither too largenor too small in order to obtain good results.






MotivationFor large N, the bernouli trial that data point fall within small regionmathcalR will be sharply peaked around the mean and so

K ' NP (74)

If, however, we also assume that the region R is sufficiently small that theprobability density p(x) is roughlt over the region, then we have

P ' p(x)V (75)

where V is the volume of R. Therefore,

p(x) =K

NV(76)

Note that in our assumption, R is sufficiently small tha the density isapproximately constant over the region and the yet sufficiently large thatthe number K of points falling inside the region is sufficient for thebinomial distribution to be sharply peaked.






Kernel Density Estimation(KDE)If we fix V and determine K from the data, we use kernel approach. Forinstance, we fix V to 1 and count the data point by following function

k(u) =

{1, if |ui | ≤ 1/2, i = 1, · · · ,D,0, otherwise

(77)

which called Parzen window In this case, we can use this by

K =N∑

n=1

k(x − xn

h) (78)

and it leads density function

p(x) =1

N

N∑n=1

1

hDk(

x − xnh

) (79)

We can also use another kernel like Gaussian kernel. If we do so, then weget

p(x) =1

N

N∑n=1

1

(2πh2)D/2exp{−‖x − xn‖

2h2} (80)






Limitation of KDEOne of the difficulties with the kernel approach to density estimation isthat the parameter h governing the kernel width is fixed for all kernels. Inregions of high data density, a large value of h may lead to over-smoothingand in lower data density, a small value of h may lead to overfitting. Thusthe optimal choice for h may be dependent on location within data space.

Nereat-Neighbor(NN)Therefore we consider a fixing K and use the data to find an appropriate Vand we call this method K-NN methods.In this case, the value of K governs the degree of smoothing and we needto optimizae(hyper-parameter optimize) K.

Erro of KNNNote that for sufficiently big N, the error rate is never more than twice theminimum achievable error rate of an optimal classifier.


Data & Analytics

Probability distributions for ml