Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
c©Stanley Chan 2020. All Rights Reserved.
ECE595 / STAT598: Machine Learning ILecture 11 Maximum-Likelihood Estimation
Spring 2020
Stanley Chan
School of Electrical and Computer EngineeringPurdue University
1 / 27
c©Stanley Chan 2020. All Rights Reserved.
Overview
In linear discriminant analysis (LDA), there are generally two types ofapproaches
Generative approach: Estimate model, then define the classifier
Discriminative approach: Directly define the classifier2 / 27
c©Stanley Chan 2020. All Rights Reserved.
Outline
Generative Approaches
Lecture 9 Bayesian Decision Rules
Lecture 10 Evaluating Performance
Lecture 11 Parameter Estimation
Lecture 12 Bayesian Prior
Lecture 13 Connecting Bayesian and Linear Regression
Today’s LectureBasic Principles
Likelihood FunctionMaximum Likelihood Estimate1D IllustrationGaussian Distributions
ExamplesNon-Gaussian DistributionsBiased and Unbiased EstimatorsFrom MLE to MAP
3 / 27
c©Stanley Chan 2020. All Rights Reserved.
What is Parameter Estimation?
The goal of parameter estimation is to determine Θ = (µ,Σ) fromdataset
This is the step where you use data4 / 27
c©Stanley Chan 2020. All Rights Reserved.
MLE and MAP
There are two typical ways of estimating parameters.
Maximum-likelihood estimation (MLE): θ is deterministic.
Maximum-a-posteriori estimation (MAP): θ is random and has a priordistribution.
5 / 27
c©Stanley Chan 2020. All Rights Reserved.
Maximum Likelihood Estimation
Given the dataset D = {xn}Nn=1, how to estimate the model parameters?
We are going to use Gaussian as an illustration.
Denote θ as the model parameter.
In Gaussianθ = {µ,Σ}
The likelihood for one data point xn is
p(xn |={µ,Σ}︷︸︸︷
θ ) =1√
(2π)d |Σ|exp
{−1
2(xn − µ)TΣ−1(xn − µ)
}θ is a deterministic quantity, not a random variable.
θ does not have a distribution.
θ is fixed but unknown.
6 / 27
c©Stanley Chan 2020. All Rights Reserved.
Likelihood for the Entire Dataset
Likelihood for the entire dataset {x1, . . . , xN} is
p(D |θ) =N∏
n=1
{1√
(2π)d |Σ|exp
{−1
2(xn − µ)TΣ−1(xn − µ)
}}
=
(1√
(2π)d |Σ|
)N
exp
{N∑
n=1
− 1
2(xn − µ)TΣ−1(xn − µ)
}
The Negative Log-Likelihood is
− logp(D |θ) =N
2log |Σ|+ N
2log(2π)d
+N∑
n=1
{1
2(xn − µ)TΣ−1(xn − µ)
}.
7 / 27
c©Stanley Chan 2020. All Rights Reserved.
Maximum Likelihood Estimation
Goal: Find θ that maximizes the likelihood:
θ̂ = argmaxθ
p(D |θ)
= argmaxθ
N∏n=1
{1√
(2π)d |Σ|exp
{−1
2(xn − µ)TΣ−1(xn − µ)
}}= argmin
θ− log(· · · )
= argminθ
N
2log |Σ|+ N
2log(2π)d
+N∑
n=1
{1
2(xn − µ)TΣ−1(xn − µ)
}.
This optimization is called the maximum likelihood estimation(MLE).
8 / 27
c©Stanley Chan 2020. All Rights Reserved.
Illustrating MLE when N = 1. Known σ.
When N = 1: The MLE solution is
µ̂ = argmaxµ
1√2πσ2
exp
{−(x1 − µ)2
2σ2
}= argmin
µ(x1 − µ)2 = x1.
Which µ will give you the best Gaussian?When µ = x1, the probability of obtaining x1 is the highest.
9 / 27
c©Stanley Chan 2020. All Rights Reserved.
Illustrating MLE when N = 2. Known σ.
When N = 2: The MLE solution is
µ̂ = argmaxµ
(1√
2πσ2
)2
exp
{−(x1 − µ)2 + (x2 − µ)2
2σ2
}= argmin
µ(x1 − µ)2 + (x2 − µ)2 =
x1 + x22
.
Which µ will give you the best Gaussian?When µ = (x1 + x2)/2, the prob. of obtaining x1 and x2 is highest.
10 / 27
c©Stanley Chan 2020. All Rights Reserved.
Illustrating MLE when N = arbitrary integer
The MLE solution is
µ̂ = argmaxµ
(1√
2πσ2
)2
exp
{−
N∑n=1
(xn − µ)2
2σ2
}
= argminµ
N∑n=1
(xn − µ)2 =1
N
N∑n=1
xn.
Which µ will give you the best Gaussian?When µ = 1
N
∑Nn=1 xn, the prob. of obtaining {xn} is highest.
11 / 27
c©Stanley Chan 2020. All Rights Reserved.
Estimation in High-dimension
Assume Σ is known and fixed.
Thus, θ = µ. Estimate µ
µ̂ = argminµ ��
���N
2log |Σ|+
������N
2log(2π)d
+N∑
n=1
{1
2(xn − µ)TΣ−1(xn − µ)
}
= argminµ
N∑n=1
{(xn − µ)TΣ−1(xn − µ)
}Take derivative, setting to zero:
∇µ
{N∑
n=1
(xn − µ)TΣ−1(xn − µ)
}= 2
N∑n=1
Σ−1(xn − µ) = 0.
12 / 27
c©Stanley Chan 2020. All Rights Reserved.
Estimation in High-dimension
Let us do some algebra
N∑n=1
Σ−1(xn − µ) = 0 =⇒N∑
n=1
xn =N∑
n=1
µ
Then we can show that the MLE solution is
µ̂ =1
N
N∑n=1
xn.
This is just the empirical average of the entire dataset!
You can show that if E[xn] = µ for all n, then
E[µ̂] =1
N
N∑n=1
E[xn] = µ.
We say that µ̂ is a unbiased estimator of µ since E[µ̂] = µ.13 / 27
c©Stanley Chan 2020. All Rights Reserved.
When both µ and Σ are Unknown
What will be the MLE when both µ and Σ are unknown?
(µ̂, Σ̂) = argminµ,Σ
N
2log |Σ|+ N
2log(2π)d
+N∑
n=1
{1
2(xn − µ)TΣ−1(xn − µ)
}.︸ ︷︷ ︸
ϕ(µ,Σ)
You need to take derivative with respect to µ and Σ, and solve
∇µϕ(µ,Σ) = 0∇Σϕ(µ,Σ) = 0
With some (tedious) matrix calculus, we can show that
µ̂ =1
N
N∑n=1
xn, and Σ̂ =1
N
N∑n=1
(xn − µ)(xn − µ)T .
Exercise: Prove this result when xn is a 1D scalar. 14 / 27
c©Stanley Chan 2020. All Rights Reserved.
Outline
Generative Approaches
Lecture 9 Bayesian Decision Rules
Lecture 10 Evaluating Performance
Lecture 11 Parameter Estimation
Lecture 12 Bayesian Prior
Lecture 13 Connecting Bayesian and Linear Regression
Today’s LectureBasic Principles
Likelihood FunctionMaximum Likelihood Estimate1D IllustrationGaussian Distributions
ExamplesNon-Gaussian DistributionsBiased and Unbiased EstimatorsFrom MLE to MAP
15 / 27
c©Stanley Chan 2020. All Rights Reserved.
MLE does not need to be Gaussian
Suppose that xn ∼ Bernoulli(θ). Then,
p(xn|θ) =
{θ, if xn = 1
1− θ, if xn = 0.
We can write the likelihood function as
p(D|θ) =N∏
n=1
θxn(1− θ)1−xn
= θ∑N
n=1 xn(1− θ)∑N
n=1(1−xn)
and so the negative-log likelihood is
− log p(D|θ) = − log{θ∑N
n=1 xn(1− θ)∑N
n=1(1−xn)}
= −
(N∑
n=1
xn
)log θ −
(N∑
n=1
(1− xn)
)log(1− θ)
16 / 27
c©Stanley Chan 2020. All Rights Reserved.
Bernoulli MLE
Taking the derivative and setting to zero:
d
dθ
{− log p(D|θ)
}=
d
dθ
{−
(N∑
n=1
xn
)log θ −
(N∑
n=1
(1− xn)
)log(1− θ)
}
= −
(N∑
n=1
xn
)1
θ+
(N∑
n=1
(1− xn)
)1
1− θ.
Setting to zero, we have(∑Nn=1 xn
)1θ =
(∑Nn=1(1− xn)
)1
1−θ(∑Nn=1 xn
)(1− θ) =
(∑Nn=1(1− xn)
)θ(∑N
n=1 xn)−(∑N
n=1 xn)θ = Nθ −
(∑Nn=1 xn
)θ
Therefore,
θ =1
N
N∑n=1
xn.
17 / 27
c©Stanley Chan 2020. All Rights Reserved.
Unbias and Consistent Estimator
µ̂ =1
N
N∑n=1
xn, and Σ̂ =1
N
N∑n=1
(xn − µ̂)(xn − µ̂)T .
µ̂ is the empirical average (or the sample mean)Σ̂ is the empirical covariance (or the sample covariance)E[µ̂] = µ
µ̂ is an unbiased estimate of µ: E[µ̂] = µ for all N
µ̂ is a consistent estimate of µ: As N →∞, µ̂p→ µ
E[Σ̂] = N−1N Σ
Σ̂ is a biased estimate of Σ: E[Σ̂] 6= Σ
Σ̂ is a consistent estimate of Σ: As N →∞, Σ̂p→ Σ
You can make Σ unbiased by defining
µ̂ =1
N
N∑n=1
xn, and Σ̂ =1
N − 1
N∑n=1
(xn − µ̂)(xn − µ̂)T .
18 / 27
c©Stanley Chan 2020. All Rights Reserved.
Unbiased and Consistent Estimator
Where does (N − 1)/N come from? Here is a 1D explanation. Assumeµ = 0.
µ̂ =1
N
N∑n=1
xn, and σ̂2 =1
N
N∑n=1
(xn − µ̂)2
Taking expectation yields
E[σ̂2] =1
N
N∑n=1
E(xn − µ̂)2
=1
N
N∑n=1
E[x2n ]− 2E[µ̂xn] + E[µ̂2]
=1
N
N∑n=1
σ2 − 2E
1
N
N∑j=1
xjxn
+ E
( 1
N
N∑n=1
xn
)2
19 / 27
c©Stanley Chan 2020. All Rights Reserved.
Unbiased and Consistent Estimator
E[σ̂2] =1
N
N∑n=1
σ2 − 2E
[1
N
N∑j=1
xjxn
]︸ ︷︷ ︸
= 2N(E[x1xn + . . .+ xNxn])
= 2N(0 + . . .+ σ2 + . . .+ 0)
= 2σ2
N
+ E
( 1
N
N∑n=1
xn
)2
︸ ︷︷ ︸= 1
N2
∑Nn=1 E[x
2n ] +
∑j 6=n E[xjxn]
= 1N2Nσ
2 + 0
= σ2
N
=
1
N
N∑n=1
{σ2 − 2
Nσ2 +
1
Nσ2
}=
N − 1
Nσ2.
20 / 27
c©Stanley Chan 2020. All Rights Reserved.
From ML to Decision Boundary
If you have a training set D, ...
Partition it according to the labels: D(1) and D(2)
Pick a model, e.g., Gaussian
For each class, estimate the model parameterµ1,Σ1 for D(1)
µ2,Σ2 for D(2)
Construct a discriminant function
g(x) = xT (W 1 −W 2)x + (w1 −w2)Tx + (w10 − w20)
See Lecture Bayesian Decision Rule 1 for formula
Define the hypothesis function
h(x) =
{1, if g(x) > 0,
0, if g(x) < 0.
21 / 27
c©Stanley Chan 2020. All Rights Reserved.
How well do you do?
You have a dataset {x1, . . . , xN}.You estimate µ by solving the maximum-likelihood problem.
You get an estimate
µ̂ =1
N
N∑n=1
xn
How good is your estimate?
Hoeffding inequality. In the 1D case:
P [|µ̂− µ| > ε] ≤ 2e−2ε2N .
As N →∞, µ̂→ µ with very high probability.
Overall performance:
Is µ̂ ≈ µ?Is µ giving a good classifier?
22 / 27
c©Stanley Chan 2020. All Rights Reserved.
MLE and MAP
There are two typical ways of estimating parameters.
Maximum-likelihood estimation (MLE): θ is deterministic.
Maximum-a-posteriori estimation (MAP): θ is random and has a priordistribution.
23 / 27
c©Stanley Chan 2020. All Rights Reserved.
From MLE to MAP
Likelihood:
p(xn|θ1) = N (xn|θ1, σ21), and p(xn|θ2) = N (xn|θ2, σ22).
Maximum-likelihood: You know nothing about θ1 and θ2. So youneed to take measurements to estimate θ1 and θ2.Maximum-a-Posteriori: You know something about θ1 and θ2.Prior
p(θ1) = N (µ1|γ21), and p(θ2) = N (µ2|γ22).
24 / 27
c©Stanley Chan 2020. All Rights Reserved.
From MLE to MAP
In MLE, the parameter θ is deterministic.What if we assume θ has a distribution?This makes θ probabilistic.So make Θ as a random variable, and θ a state of Θ.Distribution of Θ:
pΘ(θ)
pΘ(θ) is the distribution of the parameter Θ.Θ has its own mean and own variance.
25 / 27
c©Stanley Chan 2020. All Rights Reserved.
Maximum-a-Posteriori
By Bayes Theorem again:
pΘ|X (θ|xn) =pX |Θ(xn|θ)pΘ(θ)
pX (xn).
To maximize the posterior distribution
θ̂ = argmaxθ
pΘ|X (θ|D)
= argmaxθ
N∏n=1
pΘ|X (θ|xn)
= argmaxθ
N∏n=1
pX |Θ(xn|θ)pΘ(θ)
����pX (xn)
= argminθ
−N∑
n=1
{log pX |Θ(xn|θ) + log pΘ(θ)
}26 / 27
c©Stanley Chan 2020. All Rights Reserved.
Reading List
Maximum Likelihood Estimation
Duda-Hart-Stork, Pattern Classification, Chapter 3.2
Iowa State EE 527 http:
//www.ece.iastate.edu/~namrata/EE527_Spring08/l5.pdf
Purdue ECE 645, Lecture 18-20https://engineering.purdue.edu/ChanGroup/ECE645.html
UCSD ECE 271A, Lecture 6http://www.svcl.ucsd.edu/courses/ece271A/ece271A.htm
Univ. Orleans. https://www.univ-orleans.fr/deg/masters/
ESA/CH/Chapter2_MLE.pdf
27 / 27