23
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources: D.H.S.: Chapter 3 (Part 2) Wiki: Maximum Likelihood M.Y.: Maximum Likelihood Tutorial J.O.S.: Bayesian Parameter Estimation J.H.: Euro Coin Audio: URL:

LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION

  • Upload
    havily

  • View
    93

  • Download
    4

Embed Size (px)

DESCRIPTION

LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION. •Objectives: Bias in ML Estimates Bayesian Estimation Example Resources: D.H.S.: Chapter 3 (Part 2) Wiki: Maximum Likelihood M.Y.: Maximum Likelihood Tutorial J.O.S.: Bayesian Parameter Estimation J.H.: Euro Coin. Audio:. URL:. - PowerPoint PPT Presentation

Citation preview

Page 2: LECTURE  06:  MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION

ECE 8443: Lecture 06, Slide 2

• Consider the case where only the mean, = , is unknown:

)()(21])2ln[(

21

)]()(21exp[

)2(1ln[))(ln(

1

12/12/

kkd

kkdp

xx

xxx

t

tk

0ln1

n

kkp x

)())(ln( 1 kp xxkwhich implies:

)(

)]()(21[])2ln[(

21[

)]()(21])2ln[(

21[

1

1

1

k

kkd

kkd

x

xx

xx

t

t

because:

Gaussian Case: Unknown Mean (Review)

Page 3: LECTURE  06:  MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION

ECE 8443: Lecture 06, Slide 3

• Rearranging terms:

• Significance???

n

kk

n

kk

n

k

n

kk

n

kk

n

kk

n

n

1

1

1 1

1

1

1

0)ˆ(

0)ˆ(

x

x

x

x

x

• Substituting into the expression for the total likelihood:

0)(ln1

1

1

n

kk

n

kkpl xx

Gaussian Case: Unknown Mean (Review)

Page 4: LECTURE  06:  MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION

ECE 8443: Lecture 06, Slide 4

• Let = [,2]. The log likelihood of a SINGLE point is:

))(21])2ln[(

21))(ln( 1

121

22

kt

k (xxxp k

22

21

2

12

2)(

21

)(1

))(ln(

k

k

x

xxpl θθθ k

• The full likelihood leads to:

n

k

n

kk

n

k

k

n

kk

xx

x

12

1

21

1 22

21

2

11

2

ˆ)ˆ(0ˆ2)ˆ(

ˆ21

0)ˆ(ˆ1

Gaussian Case: Unknown Mean and Variance (Review)

Page 5: LECTURE  06:  MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION

ECE 8443: Lecture 06, Slide 5

• This leads to these equations:

2

1

22

11

)ˆ1ˆˆ

1ˆˆ

n

kk

n

kk

xn

xn

(

• In the multivariate case:

n

kkk

n

kk

n

n

1

2

1

ˆˆ1ˆ

txx

x

• The true covariance is the expected value of the matrix ,which is a familiar result.

tkk ˆˆ xx

Gaussian Case: Unknown Mean and Variance (Review)

Page 6: LECTURE  06:  MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION

ECE 8443: Lecture 06, Slide 6

• Does the maximum likelihood estimate of the variance converge to the true value of the variance? Let’s start with a few simple results we will need later.

• Expected value of the ML estimate of the mean:

n

i

n

ii

n

ii

n

xEn

xn

EE

1

1

1

1

][1

]1[]ˆ[

22

1 12

2

11

22

22

][1

]11[

]ˆ[

])ˆ[(]ˆ[]ˆvar[

n

i

n

jji

n

jj

n

ii

xxEn

xn

xn

E

E

EE

Convergence of the Mean (Review)

Page 7: LECTURE  06:  MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION

ECE 8443: Lecture 06, Slide 7

• The expected value of xixj will be 2 for j k since the two random variables are independent.

• The expected value of xi2 will be 2 + 2.

• Hence, in the summation above, we have n2-n terms with expected value 2 and n terms with expected value 2 + 2.

• Thus,

n

nnnn

222222

21]ˆvar[

• We see that the variance of the estimate goes to zero as n goes to infinity, and our estimate converges to the true estimate (error goes to zero).

which implies:

22

22 ])ˆ[(]ˆvar[]ˆ[ n

EE

Variance of the ML Estimate of the Mean (Review)

Page 8: LECTURE  06:  MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION

ECE 8443: Lecture 06, Slide 8

2

11

2

22

222

2222

)1(

][

][2][

][][2][)[(

n

ii

n

ii x

nx

xE

ExE

ExExExE

Note that this implies:22

1

2

n

iix

• Now we can combine these results. Recall our expression for the ML estimate of the variance:

n

iixn

E1

22 ]ˆ1[ˆ

• We will need one more result:

Variance Relationships

Page 9: LECTURE  06:  MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION

ECE 8443: Lecture 06, Slide 9

))(]ˆ[2)((1

])ˆ[]ˆ[2][(1

)]ˆˆ2([1ˆ1[ˆ

22

1

22

2

1

2

2

1

2

1

22

nxEn

ExExEn

xxEn

xn

E

in

i

n

iii

in

ii

n

ii

• Expand the covariance and simplify:

n

nn

nn

xxExxEn

xxEn

xxExE iin

jij

jin

jji

n

jjii

2222222

111

)((1))1((1

])[][(1][1][]ˆ[

• One more intermediate term to derive:

Covariance Expansion

Page 10: LECTURE  06:  MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION

ECE 8443: Lecture 06, Slide 10

2

1

2

1

22

1

2

2

1

2

2222

1

22

2222

1

22

22

1

222

)1(

)1(1)/11(1)(1

)(1

)22(1

))()(2)((1

))(]ˆ[2)((1ˆ

nn

nn

nn

nn

n

nn

nnn

nnn

nxEn

n

i

n

i

n

i

n

i

n

i

n

i

i

n

i

• Substitute our previously derived expression for the second term:

Biased Variance Estimate

Page 11: LECTURE  06:  MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION

ECE 8443: Lecture 06, Slide 11

22

1

22 1]ˆ1[ˆ n

nxn

En

ii

• An unbiased estimator is:

n

i

tiin 1ˆˆ

11 xxC

• These are related by:

Cnn )1(ˆ

which is asymptotically unbiased. See Burl, AJWills and AWM for excellent examples and explanations of the details of this derivation.

• Therefore, the ML estimate is biased:

However, the ML estimate converges (and is MSE).

Expectation Simplification

Page 12: LECTURE  06:  MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION

ECE 8443: Lecture 06, Slide 12

• In Chapter 2, we learned how to design an optimal classifier if we knew the prior probabilities, P(i), and class-conditional densities, p(x|i).

• Bayes: treat the parameters as random variables having some known prior distribution. Observations of samples converts this to a posterior.

• Bayesian learning: sharpen the a posteriori density causing it to peak near the true value.

• Supervised vs. unsupervised: do we know the class assignments of the training data.

• Bayesian estimation and ML estimation produce very similar results in many cases.

• Reduces statistical inference (prior knowledge or beliefs about the world) to probabilities.

Introduction to Bayesian Parameter Estimation

Page 13: LECTURE  06:  MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION

ECE 8443: Lecture 06, Slide 13

• Posterior probabilities, P(i|x), are central to Bayesian classification.• Bayes formula allows us to compute P(i|x) from the priors, P(i), and the

likelihood, p(x|i).• But what If the priors and class-conditional densities are unknown?• The answer is that we can compute the posterior, P(i|x), using all of the

information at our disposal (e.g., training data).• For a training set, D, Bayes formula becomes:

c

jjj

iii

DPDp

DPDpDP

1)(),(

)(),(evidence

priorlikelihood),|(

x

xx

• We assume priors are known: P(i|D) = P(i).

• Also, assume functional independence:

Di have no influence on

This gives:

c

jjjj

iiii

PDp

PDpDP

1)(),(

)(),(),|(

x

xx

Class-Conditional Densities

jiifDp j ),( x

Page 14: LECTURE  06:  MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION

ECE 8443: Lecture 06, Slide 14

• Assume the parametric form of the evidence, p(x), is known: p(x|).

• Any information we have about prior to collecting samples is contained in a known prior density p().

• Observation of samples converts this to a posterior, p(|D), which we hope is peaked around the true value of .

• Our goal is to estimate a parameter vector:

dDpDp )()( ,xx

• We can write the joint distribution as a product:

dDpp

dDpDpDp

)(

),()(

()

()

xxx

because the samples are drawn independently.

The Parameter Distribution

)( Dp x)Dp (

• This equation links the class-conditional density

to the posterior, . But numerical solutions are typically required!

Page 15: LECTURE  06:  MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION

ECE 8443: Lecture 06, Slide 15

• Case: only mean unknown ),()( 2 Np x

• Known prior density: ),()( 200 Np

• Using Bayes formula:

)()(

])()([

)()()()(

)()()(

)(

)()()()(

1

pp

pDp

dpDppDp

DppDp

Dp

pDpDpDp

n

kk

x

• Rationale: Once a value of is

known, the density for x is

completely known. is a

normalization factor that

depends on the data, D.

Univariate Gaussian Case

Page 16: LECTURE  06:  MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION

ECE 8443: Lecture 06, Slide 16

• Applying our Gaussian assumptions:

n

k

k

n

k

kn

k

k

n

k

kk

n

k

k

n

k

kDp

12

2

220

020

2

12

2

220

020

2

12

2

20

20

12

2

22

2

20

200

2

1

22

0

0

2

0

0

01

2

)2(2

21exp

)2(2

21exp

21exp

22

21exp

21exp

21exp

21

21exp

21|

x

xx

xx

x

x

Univariate Gaussian Case

Page 17: LECTURE  06:  MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION

ECE 8443: Lecture 06, Slide 17

• Now we need to work this into a simpler form:

Univariate Gaussian Case (Cont.)

n

kkn

n

n

kk

n

kk

n

k

kn

k

n

k

k

nwhere

nn

nn

n

Dp

1

20

02

220

2

20

0

12

220

2

20

0

122

0

2

2

2

20

0

122

0

2

12

2

12

2

220

020

2

)ˆ(12121(exp

12121(exp

2221exp

2221exp

)2(2

21exp|

x

x

x

x

x

Page 18: LECTURE  06:  MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION

ECE 8443: Lecture 06, Slide 18

Univariate Gaussian Case (Cont.)• p(|D) is an exponential of a quadratic function, which makes it a normal

distribution. Because this is true for any n, it is referred to as a reproducing density.

• p() is referred to as a conjugate prior.

• Write p(|D) ~ N(n,n2):

• Expand the quadratic term:

• Equate coefficients of our two functions:

])(21exp[

21)( 2

n

n

nDp

)]2

(21exp[

21])(

21exp[

21)( 2

222

n

nn

nn

n

n

Dp

20

02

220

2

2

22

)ˆ(12121exp

221exp

21

n

n

nn

n

nn

Page 19: LECTURE  06:  MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION

ECE 8443: Lecture 06, Slide 19

Univariate Gaussian Case (Cont.)

• Rearrange terms so that the dependencies on are clear:

• Associate terms related to 2 and :

• There is actually a third equation involving terms not related to :

but we can ignore this since it is not a function of and is a complicated equation to solve.

20

022

20

2222

ˆ:

11:

nn

nn

nn

n

n

20

02

220

2

22

22

2

)ˆ(12121exp

2121exp

21exp

21

n

n

n

nn

n

n

nn

n

n

n

nn

n

n

or

21

21

21exp

21

21exp

21

02

2

2

2

Page 20: LECTURE  06:  MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION

ECE 8443: Lecture 06, Slide 20

• Two equations and two unknowns. Solve for n and n2. First, solve for n

2 :

220

220

220

20

2

20

2

2

11

nnnn

Univariate Gaussian Case (Cont.)

• Next, solve for n:

• Summarizing:

220

2202

0220

2

220

20 ˆ)(

n

nnn

n

nn

220

2

0220

20

220

220

20

0220

220

2

20

2

02

2

ˆ

1)(ˆ

)(ˆ

nnn

nnn

n

n

n

nnnn

Page 21: LECTURE  06:  MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION

ECE 8443: Lecture 06, Slide 21

• n represents our best guess after n samples.

• n2 represents our uncertainty about this guess.

• n2 approaches 2/n for large n – each additional observation decreases our

uncertainty.

• The posterior, p(|D), becomes more sharply peaked as n grows large. This is known as Bayesian learning.

Bayesian Learning

Page 22: LECTURE  06:  MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION

ECE 8443: Lecture 06, Slide 22

• Getting ahead a bit, let’s see how we can put these ideas to work on a simple example due to David MacKay, and explained by Jon Hamaker.

“The Euro Coin”

Page 23: LECTURE  06:  MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION

ECE 8443: Lecture 06, Slide 23

Summary• Review of maximum likelihood parameter estimation in the Gaussian case,

with an emphasis on convergence and bias of the estimates.• Introduction of Bayesian parameter estimation.• The role of the class-conditional distribution in a Bayesian estimate.• Estimation of the posterior and probability density function assuming the

only unknown parameter is the mean, and the conditional density of the “features” given the mean, p(x|), can be modeled as a Gaussian distribution.