27
c Stanley Chan 2020. All Rights Reserved. ECE595 / STAT598: Machine Learning I Lecture 11 Maximum-Likelihood Estimation Spring 2020 Stanley Chan School of Electrical and Computer Engineering Purdue University 1 / 27

ECE595 / STAT598: Machine Learning I Lecture 11 Maximum ...Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior Lecture 13 Connecting Bayesian and Linear Regression Today’s

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ECE595 / STAT598: Machine Learning I Lecture 11 Maximum ...Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior Lecture 13 Connecting Bayesian and Linear Regression Today’s

c©Stanley Chan 2020. All Rights Reserved.

ECE595 / STAT598: Machine Learning ILecture 11 Maximum-Likelihood Estimation

Spring 2020

Stanley Chan

School of Electrical and Computer EngineeringPurdue University

1 / 27

Page 2: ECE595 / STAT598: Machine Learning I Lecture 11 Maximum ...Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior Lecture 13 Connecting Bayesian and Linear Regression Today’s

c©Stanley Chan 2020. All Rights Reserved.

Overview

In linear discriminant analysis (LDA), there are generally two types ofapproaches

Generative approach: Estimate model, then define the classifier

Discriminative approach: Directly define the classifier2 / 27

Page 3: ECE595 / STAT598: Machine Learning I Lecture 11 Maximum ...Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior Lecture 13 Connecting Bayesian and Linear Regression Today’s

c©Stanley Chan 2020. All Rights Reserved.

Outline

Generative Approaches

Lecture 9 Bayesian Decision Rules

Lecture 10 Evaluating Performance

Lecture 11 Parameter Estimation

Lecture 12 Bayesian Prior

Lecture 13 Connecting Bayesian and Linear Regression

Today’s LectureBasic Principles

Likelihood FunctionMaximum Likelihood Estimate1D IllustrationGaussian Distributions

ExamplesNon-Gaussian DistributionsBiased and Unbiased EstimatorsFrom MLE to MAP

3 / 27

Page 4: ECE595 / STAT598: Machine Learning I Lecture 11 Maximum ...Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior Lecture 13 Connecting Bayesian and Linear Regression Today’s

c©Stanley Chan 2020. All Rights Reserved.

What is Parameter Estimation?

The goal of parameter estimation is to determine Θ = (µ,Σ) fromdataset

This is the step where you use data4 / 27

Page 5: ECE595 / STAT598: Machine Learning I Lecture 11 Maximum ...Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior Lecture 13 Connecting Bayesian and Linear Regression Today’s

c©Stanley Chan 2020. All Rights Reserved.

MLE and MAP

There are two typical ways of estimating parameters.

Maximum-likelihood estimation (MLE): θ is deterministic.

Maximum-a-posteriori estimation (MAP): θ is random and has a priordistribution.

5 / 27

Page 6: ECE595 / STAT598: Machine Learning I Lecture 11 Maximum ...Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior Lecture 13 Connecting Bayesian and Linear Regression Today’s

c©Stanley Chan 2020. All Rights Reserved.

Maximum Likelihood Estimation

Given the dataset D = {xn}Nn=1, how to estimate the model parameters?

We are going to use Gaussian as an illustration.

Denote θ as the model parameter.

In Gaussianθ = {µ,Σ}

The likelihood for one data point xn is

p(xn |={µ,Σ}︷︸︸︷

θ ) =1√

(2π)d |Σ|exp

{−1

2(xn − µ)TΣ−1(xn − µ)

}θ is a deterministic quantity, not a random variable.

θ does not have a distribution.

θ is fixed but unknown.

6 / 27

Page 7: ECE595 / STAT598: Machine Learning I Lecture 11 Maximum ...Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior Lecture 13 Connecting Bayesian and Linear Regression Today’s

c©Stanley Chan 2020. All Rights Reserved.

Likelihood for the Entire Dataset

Likelihood for the entire dataset {x1, . . . , xN} is

p(D |θ) =N∏

n=1

{1√

(2π)d |Σ|exp

{−1

2(xn − µ)TΣ−1(xn − µ)

}}

=

(1√

(2π)d |Σ|

)N

exp

{N∑

n=1

− 1

2(xn − µ)TΣ−1(xn − µ)

}

The Negative Log-Likelihood is

− logp(D |θ) =N

2log |Σ|+ N

2log(2π)d

+N∑

n=1

{1

2(xn − µ)TΣ−1(xn − µ)

}.

7 / 27

Page 8: ECE595 / STAT598: Machine Learning I Lecture 11 Maximum ...Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior Lecture 13 Connecting Bayesian and Linear Regression Today’s

c©Stanley Chan 2020. All Rights Reserved.

Maximum Likelihood Estimation

Goal: Find θ that maximizes the likelihood:

θ̂ = argmaxθ

p(D |θ)

= argmaxθ

N∏n=1

{1√

(2π)d |Σ|exp

{−1

2(xn − µ)TΣ−1(xn − µ)

}}= argmin

θ− log(· · · )

= argminθ

N

2log |Σ|+ N

2log(2π)d

+N∑

n=1

{1

2(xn − µ)TΣ−1(xn − µ)

}.

This optimization is called the maximum likelihood estimation(MLE).

8 / 27

Page 9: ECE595 / STAT598: Machine Learning I Lecture 11 Maximum ...Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior Lecture 13 Connecting Bayesian and Linear Regression Today’s

c©Stanley Chan 2020. All Rights Reserved.

Illustrating MLE when N = 1. Known σ.

When N = 1: The MLE solution is

µ̂ = argmaxµ

1√2πσ2

exp

{−(x1 − µ)2

2σ2

}= argmin

µ(x1 − µ)2 = x1.

Which µ will give you the best Gaussian?When µ = x1, the probability of obtaining x1 is the highest.

9 / 27

Page 10: ECE595 / STAT598: Machine Learning I Lecture 11 Maximum ...Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior Lecture 13 Connecting Bayesian and Linear Regression Today’s

c©Stanley Chan 2020. All Rights Reserved.

Illustrating MLE when N = 2. Known σ.

When N = 2: The MLE solution is

µ̂ = argmaxµ

(1√

2πσ2

)2

exp

{−(x1 − µ)2 + (x2 − µ)2

2σ2

}= argmin

µ(x1 − µ)2 + (x2 − µ)2 =

x1 + x22

.

Which µ will give you the best Gaussian?When µ = (x1 + x2)/2, the prob. of obtaining x1 and x2 is highest.

10 / 27

Page 11: ECE595 / STAT598: Machine Learning I Lecture 11 Maximum ...Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior Lecture 13 Connecting Bayesian and Linear Regression Today’s

c©Stanley Chan 2020. All Rights Reserved.

Illustrating MLE when N = arbitrary integer

The MLE solution is

µ̂ = argmaxµ

(1√

2πσ2

)2

exp

{−

N∑n=1

(xn − µ)2

2σ2

}

= argminµ

N∑n=1

(xn − µ)2 =1

N

N∑n=1

xn.

Which µ will give you the best Gaussian?When µ = 1

N

∑Nn=1 xn, the prob. of obtaining {xn} is highest.

11 / 27

Page 12: ECE595 / STAT598: Machine Learning I Lecture 11 Maximum ...Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior Lecture 13 Connecting Bayesian and Linear Regression Today’s

c©Stanley Chan 2020. All Rights Reserved.

Estimation in High-dimension

Assume Σ is known and fixed.

Thus, θ = µ. Estimate µ

µ̂ = argminµ ��

���N

2log |Σ|+

������N

2log(2π)d

+N∑

n=1

{1

2(xn − µ)TΣ−1(xn − µ)

}

= argminµ

N∑n=1

{(xn − µ)TΣ−1(xn − µ)

}Take derivative, setting to zero:

∇µ

{N∑

n=1

(xn − µ)TΣ−1(xn − µ)

}= 2

N∑n=1

Σ−1(xn − µ) = 0.

12 / 27

Page 13: ECE595 / STAT598: Machine Learning I Lecture 11 Maximum ...Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior Lecture 13 Connecting Bayesian and Linear Regression Today’s

c©Stanley Chan 2020. All Rights Reserved.

Estimation in High-dimension

Let us do some algebra

N∑n=1

Σ−1(xn − µ) = 0 =⇒N∑

n=1

xn =N∑

n=1

µ

Then we can show that the MLE solution is

µ̂ =1

N

N∑n=1

xn.

This is just the empirical average of the entire dataset!

You can show that if E[xn] = µ for all n, then

E[µ̂] =1

N

N∑n=1

E[xn] = µ.

We say that µ̂ is a unbiased estimator of µ since E[µ̂] = µ.13 / 27

Page 14: ECE595 / STAT598: Machine Learning I Lecture 11 Maximum ...Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior Lecture 13 Connecting Bayesian and Linear Regression Today’s

c©Stanley Chan 2020. All Rights Reserved.

When both µ and Σ are Unknown

What will be the MLE when both µ and Σ are unknown?

(µ̂, Σ̂) = argminµ,Σ

N

2log |Σ|+ N

2log(2π)d

+N∑

n=1

{1

2(xn − µ)TΣ−1(xn − µ)

}.︸ ︷︷ ︸

ϕ(µ,Σ)

You need to take derivative with respect to µ and Σ, and solve

∇µϕ(µ,Σ) = 0∇Σϕ(µ,Σ) = 0

With some (tedious) matrix calculus, we can show that

µ̂ =1

N

N∑n=1

xn, and Σ̂ =1

N

N∑n=1

(xn − µ)(xn − µ)T .

Exercise: Prove this result when xn is a 1D scalar. 14 / 27

Page 15: ECE595 / STAT598: Machine Learning I Lecture 11 Maximum ...Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior Lecture 13 Connecting Bayesian and Linear Regression Today’s

c©Stanley Chan 2020. All Rights Reserved.

Outline

Generative Approaches

Lecture 9 Bayesian Decision Rules

Lecture 10 Evaluating Performance

Lecture 11 Parameter Estimation

Lecture 12 Bayesian Prior

Lecture 13 Connecting Bayesian and Linear Regression

Today’s LectureBasic Principles

Likelihood FunctionMaximum Likelihood Estimate1D IllustrationGaussian Distributions

ExamplesNon-Gaussian DistributionsBiased and Unbiased EstimatorsFrom MLE to MAP

15 / 27

Page 16: ECE595 / STAT598: Machine Learning I Lecture 11 Maximum ...Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior Lecture 13 Connecting Bayesian and Linear Regression Today’s

c©Stanley Chan 2020. All Rights Reserved.

MLE does not need to be Gaussian

Suppose that xn ∼ Bernoulli(θ). Then,

p(xn|θ) =

{θ, if xn = 1

1− θ, if xn = 0.

We can write the likelihood function as

p(D|θ) =N∏

n=1

θxn(1− θ)1−xn

= θ∑N

n=1 xn(1− θ)∑N

n=1(1−xn)

and so the negative-log likelihood is

− log p(D|θ) = − log{θ∑N

n=1 xn(1− θ)∑N

n=1(1−xn)}

= −

(N∑

n=1

xn

)log θ −

(N∑

n=1

(1− xn)

)log(1− θ)

16 / 27

Page 17: ECE595 / STAT598: Machine Learning I Lecture 11 Maximum ...Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior Lecture 13 Connecting Bayesian and Linear Regression Today’s

c©Stanley Chan 2020. All Rights Reserved.

Bernoulli MLE

Taking the derivative and setting to zero:

d

{− log p(D|θ)

}=

d

{−

(N∑

n=1

xn

)log θ −

(N∑

n=1

(1− xn)

)log(1− θ)

}

= −

(N∑

n=1

xn

)1

θ+

(N∑

n=1

(1− xn)

)1

1− θ.

Setting to zero, we have(∑Nn=1 xn

)1θ =

(∑Nn=1(1− xn)

)1

1−θ(∑Nn=1 xn

)(1− θ) =

(∑Nn=1(1− xn)

)θ(∑N

n=1 xn)−(∑N

n=1 xn)θ = Nθ −

(∑Nn=1 xn

Therefore,

θ =1

N

N∑n=1

xn.

17 / 27

Page 18: ECE595 / STAT598: Machine Learning I Lecture 11 Maximum ...Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior Lecture 13 Connecting Bayesian and Linear Regression Today’s

c©Stanley Chan 2020. All Rights Reserved.

Unbias and Consistent Estimator

µ̂ =1

N

N∑n=1

xn, and Σ̂ =1

N

N∑n=1

(xn − µ̂)(xn − µ̂)T .

µ̂ is the empirical average (or the sample mean)Σ̂ is the empirical covariance (or the sample covariance)E[µ̂] = µ

µ̂ is an unbiased estimate of µ: E[µ̂] = µ for all N

µ̂ is a consistent estimate of µ: As N →∞, µ̂p→ µ

E[Σ̂] = N−1N Σ

Σ̂ is a biased estimate of Σ: E[Σ̂] 6= Σ

Σ̂ is a consistent estimate of Σ: As N →∞, Σ̂p→ Σ

You can make Σ unbiased by defining

µ̂ =1

N

N∑n=1

xn, and Σ̂ =1

N − 1

N∑n=1

(xn − µ̂)(xn − µ̂)T .

18 / 27

Page 19: ECE595 / STAT598: Machine Learning I Lecture 11 Maximum ...Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior Lecture 13 Connecting Bayesian and Linear Regression Today’s

c©Stanley Chan 2020. All Rights Reserved.

Unbiased and Consistent Estimator

Where does (N − 1)/N come from? Here is a 1D explanation. Assumeµ = 0.

µ̂ =1

N

N∑n=1

xn, and σ̂2 =1

N

N∑n=1

(xn − µ̂)2

Taking expectation yields

E[σ̂2] =1

N

N∑n=1

E(xn − µ̂)2

=1

N

N∑n=1

E[x2n ]− 2E[µ̂xn] + E[µ̂2]

=1

N

N∑n=1

σ2 − 2E

1

N

N∑j=1

xjxn

+ E

( 1

N

N∑n=1

xn

)2

19 / 27

Page 20: ECE595 / STAT598: Machine Learning I Lecture 11 Maximum ...Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior Lecture 13 Connecting Bayesian and Linear Regression Today’s

c©Stanley Chan 2020. All Rights Reserved.

Unbiased and Consistent Estimator

E[σ̂2] =1

N

N∑n=1

σ2 − 2E

[1

N

N∑j=1

xjxn

]︸ ︷︷ ︸

= 2N(E[x1xn + . . .+ xNxn])

= 2N(0 + . . .+ σ2 + . . .+ 0)

= 2σ2

N

+ E

( 1

N

N∑n=1

xn

)2

︸ ︷︷ ︸= 1

N2

∑Nn=1 E[x

2n ] +

∑j 6=n E[xjxn]

= 1N2Nσ

2 + 0

= σ2

N

=

1

N

N∑n=1

{σ2 − 2

Nσ2 +

1

Nσ2

}=

N − 1

Nσ2.

20 / 27

Page 21: ECE595 / STAT598: Machine Learning I Lecture 11 Maximum ...Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior Lecture 13 Connecting Bayesian and Linear Regression Today’s

c©Stanley Chan 2020. All Rights Reserved.

From ML to Decision Boundary

If you have a training set D, ...

Partition it according to the labels: D(1) and D(2)

Pick a model, e.g., Gaussian

For each class, estimate the model parameterµ1,Σ1 for D(1)

µ2,Σ2 for D(2)

Construct a discriminant function

g(x) = xT (W 1 −W 2)x + (w1 −w2)Tx + (w10 − w20)

See Lecture Bayesian Decision Rule 1 for formula

Define the hypothesis function

h(x) =

{1, if g(x) > 0,

0, if g(x) < 0.

21 / 27

Page 22: ECE595 / STAT598: Machine Learning I Lecture 11 Maximum ...Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior Lecture 13 Connecting Bayesian and Linear Regression Today’s

c©Stanley Chan 2020. All Rights Reserved.

How well do you do?

You have a dataset {x1, . . . , xN}.You estimate µ by solving the maximum-likelihood problem.

You get an estimate

µ̂ =1

N

N∑n=1

xn

How good is your estimate?

Hoeffding inequality. In the 1D case:

P [|µ̂− µ| > ε] ≤ 2e−2ε2N .

As N →∞, µ̂→ µ with very high probability.

Overall performance:

Is µ̂ ≈ µ?Is µ giving a good classifier?

22 / 27

Page 23: ECE595 / STAT598: Machine Learning I Lecture 11 Maximum ...Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior Lecture 13 Connecting Bayesian and Linear Regression Today’s

c©Stanley Chan 2020. All Rights Reserved.

MLE and MAP

There are two typical ways of estimating parameters.

Maximum-likelihood estimation (MLE): θ is deterministic.

Maximum-a-posteriori estimation (MAP): θ is random and has a priordistribution.

23 / 27

Page 24: ECE595 / STAT598: Machine Learning I Lecture 11 Maximum ...Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior Lecture 13 Connecting Bayesian and Linear Regression Today’s

c©Stanley Chan 2020. All Rights Reserved.

From MLE to MAP

Likelihood:

p(xn|θ1) = N (xn|θ1, σ21), and p(xn|θ2) = N (xn|θ2, σ22).

Maximum-likelihood: You know nothing about θ1 and θ2. So youneed to take measurements to estimate θ1 and θ2.Maximum-a-Posteriori: You know something about θ1 and θ2.Prior

p(θ1) = N (µ1|γ21), and p(θ2) = N (µ2|γ22).

24 / 27

Page 25: ECE595 / STAT598: Machine Learning I Lecture 11 Maximum ...Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior Lecture 13 Connecting Bayesian and Linear Regression Today’s

c©Stanley Chan 2020. All Rights Reserved.

From MLE to MAP

In MLE, the parameter θ is deterministic.What if we assume θ has a distribution?This makes θ probabilistic.So make Θ as a random variable, and θ a state of Θ.Distribution of Θ:

pΘ(θ)

pΘ(θ) is the distribution of the parameter Θ.Θ has its own mean and own variance.

25 / 27

Page 26: ECE595 / STAT598: Machine Learning I Lecture 11 Maximum ...Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior Lecture 13 Connecting Bayesian and Linear Regression Today’s

c©Stanley Chan 2020. All Rights Reserved.

Maximum-a-Posteriori

By Bayes Theorem again:

pΘ|X (θ|xn) =pX |Θ(xn|θ)pΘ(θ)

pX (xn).

To maximize the posterior distribution

θ̂ = argmaxθ

pΘ|X (θ|D)

= argmaxθ

N∏n=1

pΘ|X (θ|xn)

= argmaxθ

N∏n=1

pX |Θ(xn|θ)pΘ(θ)

����pX (xn)

= argminθ

−N∑

n=1

{log pX |Θ(xn|θ) + log pΘ(θ)

}26 / 27

Page 27: ECE595 / STAT598: Machine Learning I Lecture 11 Maximum ...Lecture 11 Parameter Estimation Lecture 12 Bayesian Prior Lecture 13 Connecting Bayesian and Linear Regression Today’s

c©Stanley Chan 2020. All Rights Reserved.

Reading List

Maximum Likelihood Estimation

Duda-Hart-Stork, Pattern Classification, Chapter 3.2

Iowa State EE 527 http:

//www.ece.iastate.edu/~namrata/EE527_Spring08/l5.pdf

Purdue ECE 645, Lecture 18-20https://engineering.purdue.edu/ChanGroup/ECE645.html

UCSD ECE 271A, Lecture 6http://www.svcl.ucsd.edu/courses/ece271A/ece271A.htm

Univ. Orleans. https://www.univ-orleans.fr/deg/masters/

ESA/CH/Chapter2_MLE.pdf

27 / 27