ECE595 / STAT598: Machine Learning I Lecture 11 Maximum ...Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior Lecture 13 Connecting Bayesian and Linear Regression Today’s

c©Stanley Chan 2020. All Rights Reserved.

ECE595 / STAT598: Machine Learning ILecture 11 Maximum-Likelihood Estimation

Spring 2020

Stanley Chan

School of Electrical and Computer EngineeringPurdue University

1 / 27


Overview

In linear discriminant analysis (LDA), there are generally two types ofapproaches

Generative approach: Estimate model, then define the classifier

Discriminative approach: Directly define the classifier2 / 27


Outline

Generative Approaches

Lecture 9 Bayesian Decision Rules

Lecture 10 Evaluating Performance

Lecture 11 Parameter Estimation

Lecture 12 Bayesian Prior

Lecture 13 Connecting Bayesian and Linear Regression

Today’s LectureBasic Principles

Likelihood FunctionMaximum Likelihood Estimate1D IllustrationGaussian Distributions

ExamplesNon-Gaussian DistributionsBiased and Unbiased EstimatorsFrom MLE to MAP

3 / 27


What is Parameter Estimation?

The goal of parameter estimation is to determine Θ = (µ,Σ) fromdataset

This is the step where you use data4 / 27


MLE and MAP

There are two typical ways of estimating parameters.

Maximum-likelihood estimation (MLE): θ is deterministic.

Maximum-a-posteriori estimation (MAP): θ is random and has a priordistribution.

5 / 27


Maximum Likelihood Estimation

Given the dataset D = {xn}Nn=1, how to estimate the model parameters?

We are going to use Gaussian as an illustration.

Denote θ as the model parameter.

In Gaussianθ = {µ,Σ}

The likelihood for one data point xn is

p(xn |={µ,Σ}︷︸︸︷

θ ) =1√

(2π)d |Σ|exp

{−1

2(xn − µ)TΣ−1(xn − µ)

}θ is a deterministic quantity, not a random variable.

θ does not have a distribution.

θ is fixed but unknown.

6 / 27


Likelihood for the Entire Dataset

Likelihood for the entire dataset {x1, . . . , xN} is

p(D |θ) =N∏

n=1

{1√

(2π)d |Σ|exp

{−1

2(xn − µ)TΣ−1(xn − µ)

}}

=

(1√

(2π)d |Σ|

)N

exp

{N∑

n=1

− 1

2(xn − µ)TΣ−1(xn − µ)

}

The Negative Log-Likelihood is

− logp(D |θ) =N

2log |Σ|+ N

2log(2π)d

+N∑

n=1

{1

2(xn − µ)TΣ−1(xn − µ)

}.

7 / 27



Goal: Find θ that maximizes the likelihood:

θ̂ = argmaxθ

p(D |θ)

= argmaxθ

N∏n=1

{1√

(2π)d |Σ|exp

{−1

2(xn − µ)TΣ−1(xn − µ)

}}= argmin

θ− log(· · · )

= argminθ

N

2log |Σ|+ N

2log(2π)d

+N∑

n=1

{1

2(xn − µ)TΣ−1(xn − µ)

}.

This optimization is called the maximum likelihood estimation(MLE).

8 / 27


Illustrating MLE when N = 1. Known σ.

When N = 1: The MLE solution is

µ̂ = argmaxµ

1√2πσ2

exp

{−(x1 − µ)2

2σ2

}= argmin

µ(x1 − µ)2 = x1.

Which µ will give you the best Gaussian?When µ = x1, the probability of obtaining x1 is the highest.

9 / 27


Illustrating MLE when N = 2. Known σ.

When N = 2: The MLE solution is

µ̂ = argmaxµ

(1√

2πσ2

)2

exp

{−(x1 − µ)2 + (x2 − µ)2

2σ2

}= argmin

µ(x1 − µ)2 + (x2 − µ)2 =

x1 + x22

.

Which µ will give you the best Gaussian?When µ = (x1 + x2)/2, the prob. of obtaining x1 and x2 is highest.

10 / 27


Illustrating MLE when N = arbitrary integer

The MLE solution is

µ̂ = argmaxµ

(1√

2πσ2

)2

exp

{−

N∑n=1

(xn − µ)2

2σ2

}

= argminµ

N∑n=1

(xn − µ)2 =1

N

N∑n=1

xn.

Which µ will give you the best Gaussian?When µ = 1

N

∑Nn=1 xn, the prob. of obtaining {xn} is highest.

11 / 27


Estimation in High-dimension

Assume Σ is known and fixed.

Thus, θ = µ. Estimate µ

µ̂ = argminµ ��

��N

2log |Σ|+

��N

2log(2π)d

+N∑

n=1

{1

2(xn − µ)TΣ−1(xn − µ)

}

= argminµ

N∑n=1

{(xn − µ)TΣ−1(xn − µ)

}Take derivative, setting to zero:

∇µ

{N∑

n=1

(xn − µ)TΣ−1(xn − µ)

}= 2

N∑n=1

Σ−1(xn − µ) = 0.

12 / 27


Estimation in High-dimension

Let us do some algebra

N∑n=1

Σ−1(xn − µ) = 0 =⇒N∑

n=1

xn =N∑

n=1

µ

Then we can show that the MLE solution is

µ̂ =1

N

N∑n=1

xn.

This is just the empirical average of the entire dataset!

You can show that if E[xn] = µ for all n, then

E[µ̂] =1

N

N∑n=1

E[xn] = µ.

We say that µ̂ is a unbiased estimator of µ since E[µ̂] = µ.13 / 27


When both µ and Σ are Unknown

What will be the MLE when both µ and Σ are unknown?

(µ̂, Σ̂) = argminµ,Σ

N

2log |Σ|+ N

2log(2π)d

+N∑

n=1

{1

2(xn − µ)TΣ−1(xn − µ)

}.︸︷︷︸

ϕ(µ,Σ)

You need to take derivative with respect to µ and Σ, and solve

∇µϕ(µ,Σ) = 0∇Σϕ(µ,Σ) = 0

With some (tedious) matrix calculus, we can show that

µ̂ =1

N

N∑n=1

xn, and Σ̂ =1

N

N∑n=1

(xn − µ)(xn − µ)T .

Exercise: Prove this result when xn is a 1D scalar. 14 / 27


Outline

Generative Approaches

Lecture 9 Bayesian Decision Rules

Lecture 10 Evaluating Performance

Lecture 11 Parameter Estimation

Lecture 12 Bayesian Prior

Lecture 13 Connecting Bayesian and Linear Regression

Today’s LectureBasic Principles

Likelihood FunctionMaximum Likelihood Estimate1D IllustrationGaussian Distributions

ExamplesNon-Gaussian DistributionsBiased and Unbiased EstimatorsFrom MLE to MAP

15 / 27


MLE does not need to be Gaussian

Suppose that xn ∼ Bernoulli(θ). Then,

p(xn|θ) =

{θ, if xn = 1

1− θ, if xn = 0.

We can write the likelihood function as

p(D|θ) =N∏

n=1

θxn(1− θ)1−xn

= θ∑N

n=1 xn(1− θ)∑N

n=1(1−xn)

and so the negative-log likelihood is

− log p(D|θ) = − log{θ∑N

n=1 xn(1− θ)∑N

n=1(1−xn)}

= −

(N∑

n=1

xn

)log θ −

(N∑

n=1

(1− xn)

)log(1− θ)

16 / 27


Bernoulli MLE

Taking the derivative and setting to zero:

d

dθ

{− log p(D|θ)

}=

d

dθ

{−

(N∑

n=1

xn

)log θ −

(N∑

n=1

(1− xn)

)log(1− θ)

}

= −

(N∑

n=1

xn

)1

θ+

(N∑

n=1

(1− xn)

)1

1− θ.

Setting to zero, we have(∑Nn=1 xn

)1θ =

(∑Nn=1(1− xn)

)1

1−θ(∑Nn=1 xn

)(1− θ) =

(∑Nn=1(1− xn)

)θ(∑N

n=1 xn)−(∑N

n=1 xn)θ = Nθ −

(∑Nn=1 xn

)θ

Therefore,

θ =1

N

N∑n=1

xn.

17 / 27


Unbias and Consistent Estimator

µ̂ =1

N

N∑n=1

xn, and Σ̂ =1

N

N∑n=1

(xn − µ̂)(xn − µ̂)T .

µ̂ is the empirical average (or the sample mean)Σ̂ is the empirical covariance (or the sample covariance)E[µ̂] = µ

µ̂ is an unbiased estimate of µ: E[µ̂] = µ for all N

µ̂ is a consistent estimate of µ: As N →∞, µ̂p→ µ

E[Σ̂] = N−1N Σ

Σ̂ is a biased estimate of Σ: E[Σ̂] 6= Σ

Σ̂ is a consistent estimate of Σ: As N →∞, Σ̂p→ Σ

You can make Σ unbiased by defining

µ̂ =1

N

N∑n=1

xn, and Σ̂ =1

N − 1

N∑n=1

(xn − µ̂)(xn − µ̂)T .

18 / 27


Unbiased and Consistent Estimator

Where does (N − 1)/N come from? Here is a 1D explanation. Assumeµ = 0.

µ̂ =1

N

N∑n=1

xn, and σ̂2 =1

N

N∑n=1

(xn − µ̂)2

Taking expectation yields

E[σ̂2] =1

N

N∑n=1

E(xn − µ̂)2

=1

N

N∑n=1

E[x2n ]− 2E[µ̂xn] + E[µ̂2]

=1

N

N∑n=1

σ2 − 2E

1

N

N∑j=1

xjxn

+ E

( 1

N

N∑n=1

xn

)2

19 / 27


Unbiased and Consistent Estimator

E[σ̂2] =1

N

N∑n=1

σ2 − 2E

[1

N

N∑j=1

xjxn

]︸︷︷︸

= 2N(E[x1xn + . . .+ xNxn])

= 2N(0 + . . .+ σ2 + . . .+ 0)

= 2σ2

N

+ E

( 1

N

N∑n=1

xn

)2

︸︷︷︸= 1

N2

∑Nn=1 E[x

2n ] +

∑j 6=n E[xjxn]

= 1N2Nσ

2 + 0

= σ2

N

=

1

N

N∑n=1

{σ2 − 2

Nσ2 +

1

Nσ2

}=

N − 1

Nσ2.

20 / 27


From ML to Decision Boundary

If you have a training set D, ...

Partition it according to the labels: D(1) and D(2)

Pick a model, e.g., Gaussian

For each class, estimate the model parameterµ1,Σ1 for D(1)

µ2,Σ2 for D(2)

Construct a discriminant function

g(x) = xT (W 1 −W 2)x + (w1 −w2)Tx + (w10 − w20)

See Lecture Bayesian Decision Rule 1 for formula

Define the hypothesis function

h(x) =

{1, if g(x) > 0,

0, if g(x) < 0.

21 / 27


How well do you do?

You have a dataset {x1, . . . , xN}.You estimate µ by solving the maximum-likelihood problem.

You get an estimate

µ̂ =1

N

N∑n=1

xn

How good is your estimate?

Hoeffding inequality. In the 1D case:

P [|µ̂− µ| > ε] ≤ 2e−2ε2N .

As N →∞, µ̂→ µ with very high probability.

Overall performance:

Is µ̂ ≈ µ?Is µ giving a good classifier?

22 / 27


MLE and MAP

There are two typical ways of estimating parameters.

Maximum-likelihood estimation (MLE): θ is deterministic.

Maximum-a-posteriori estimation (MAP): θ is random and has a priordistribution.

23 / 27


From MLE to MAP

Likelihood:

p(xn|θ1) = N (xn|θ1, σ21), and p(xn|θ2) = N (xn|θ2, σ22).

Maximum-likelihood: You know nothing about θ1 and θ2. So youneed to take measurements to estimate θ1 and θ2.Maximum-a-Posteriori: You know something about θ1 and θ2.Prior

p(θ1) = N (µ1|γ21), and p(θ2) = N (µ2|γ22).

24 / 27


From MLE to MAP

In MLE, the parameter θ is deterministic.What if we assume θ has a distribution?This makes θ probabilistic.So make Θ as a random variable, and θ a state of Θ.Distribution of Θ:

pΘ(θ)

pΘ(θ) is the distribution of the parameter Θ.Θ has its own mean and own variance.

25 / 27


Maximum-a-Posteriori

By Bayes Theorem again:

pΘ|X (θ|xn) =pX |Θ(xn|θ)pΘ(θ)

pX (xn).

To maximize the posterior distribution

θ̂ = argmaxθ

pΘ|X (θ|D)

= argmaxθ

N∏n=1

pΘ|X (θ|xn)

= argmaxθ

N∏n=1

pX |Θ(xn|θ)pΘ(θ)

��pX (xn)

= argminθ

−N∑

n=1

{log pX |Θ(xn|θ) + log pΘ(θ)

}26 / 27


Reading List


Duda-Hart-Stork, Pattern Classification, Chapter 3.2

Iowa State EE 527 http:

//www.ece.iastate.edu/~namrata/EE527_Spring08/l5.pdf

Purdue ECE 645, Lecture 18-20https://engineering.purdue.edu/ChanGroup/ECE645.html

UCSD ECE 271A, Lecture 6http://www.svcl.ucsd.edu/courses/ece271A/ece271A.htm

Univ. Orleans. https://www.univ-orleans.fr/deg/masters/

ESA/CH/Chapter2_MLE.pdf

27 / 27

http://www.ece.iastate.edu/~namrata/EE527_Spring08/l5.pdf

http://www.ece.iastate.edu/~namrata/EE527_Spring08/l5.pdf

https://engineering.purdue.edu/ChanGroup/ECE645.html

http://www.svcl.ucsd.edu/courses/ece271A/ece271A.htm

https://www.univ-orleans.fr/deg/masters/ESA/CH/Chapter2_MLE.pdf

https://www.univ-orleans.fr/deg/masters/ESA/CH/Chapter2_MLE.pdf

Documents

ECE595 / STAT598: Machine Learning I Lecture 11 Maximum ...Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior Lecture 13 Connecting Bayesian and Linear Regression Today’s