15
CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 1 L6: Parameter estimation Introduction Parameter estimation Maximum likelihood Bayesian estimation Numerical exampl es

pr_l6(1)

Embed Size (px)

Citation preview

Page 1: pr_l6(1)

7/29/2019 pr_l6(1)

http://slidepdf.com/reader/full/prl61 1/15

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 1

L6: Parameter estimation

• Introduction

• Parameter estimation• Maximum likelihood

• Bayesian estimation

• Numerical examples

Page 2: pr_l6(1)

7/29/2019 pr_l6(1)

http://slidepdf.com/reader/full/prl61 2/15

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 2

• In previous lectures we showed how to build classifiers when theunderlying densities are known – Bayesian Decision Theory introduced the general formulation – Quadratic classifiers covered the special case of unimodal Gaussian data

• In most situations, however, the true distributions are unknownand must be estimated from data – Two approaches are commonplace

Parameter Estimation (this lecture)• Non-parametric Density Estimation (the next two lectures)

• Parameter estimation – Assume a particular form for the density (e.g. Gaussian), so only the

parameters (e.g., mean and variance) need to be estimated• Maximum Likelihood

Bayesian Estimation• Non-parametric density estimation

 – Assume NO knowledge about the density• Kernel Density Estimation

• Nearest Neighbor Rule

Page 3: pr_l6(1)

7/29/2019 pr_l6(1)

http://slidepdf.com/reader/full/prl61 3/15

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 3

ML vs. Bayesian parameter estimation

• Maximum Likelihood

 –

The parameters are assumed to be FIXED but unknown – The ML solution seeks the solution that “best” explains the dataset X

|  

• Bayesian estimation

 – Parameters are assumed to be random variables with some (assumed)

known a priori distribution

 – Bayesian methods seeks to estimate the posterior density (|) 

 – The final density (|) is obtained by integrating out the parameters

| ∫ |  

•   Maximum Likelihood Bayesian

θˆ θ

θ|Xp X|θp

θp

X|θp

θ

Page 4: pr_l6(1)

7/29/2019 pr_l6(1)

http://slidepdf.com/reader/full/prl61 4/15

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 4

Maximum Likelihood

• Problem definition

 –

Assume we seek to estimate a density () that is known to dependson a number of parameters , , …  

• For a Gaussian pdf, , and ()(,) 

• To make the dependence explicit, we write (|) 

 – Assume we have dataset  {( , (, … (} drawn independently

from the distribution (|) (an i.i.d. set)• Then we can write

| Π= (|• The ML estimate of  is the value that maximizes the likelihood  

|  

• This corresponds to the intuitive idea of choosing the value of  that ismost likely to give rise to the data

Page 5: pr_l6(1)

7/29/2019 pr_l6(1)

http://slidepdf.com/reader/full/prl61 5/15

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 5

• For convenience, we will work with the log likelihood

 –

Because the log is a monotonic function, then: | log |  

 – Hence, the ML estimate of  can be written as:

log Π= (| Σ= log (|  

• This simplifies the problem, since now we have to maximize a sum of 

terms rather than a long product of terms

• An added advantage of taking logs will become very clear when the

distribution is Gaussian

   p    (   X    |           )

    l   o   g

   p    (   X    |           )

Taking logs

θˆ

θˆ

θ θ

Page 6: pr_l6(1)

7/29/2019 pr_l6(1)

http://slidepdf.com/reader/full/prl61 6/15

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 6

Example: Gaussian case, unknown

• Problem statement

 –Assume a dataset  (, (, … ( and a density of the form , where is known

 – What is the ML estimate of the mean?

⇒ argΣ= (|  

argΣ=

1

2 exp

1

2 (

 

argΣ= 12 1

2 (  

 – The maxima of a function are defined by the zeros of its derivative

(|

 Σ=

⋅ 0 ⇒

  1

Σ= ( 

 – So the ML estimate of the mean is the average value of the training

data, a very intuitive result!

Page 7: pr_l6(1)

7/29/2019 pr_l6(1)

http://slidepdf.com/reader/full/prl61 7/15CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 7

Example: Gaussian case, both and unknown

• A more general case when neither nor is known

 –

Fortunately, the problem can be solved in the same fashion – The derivative becomes a gradient since we have two variables

 Σ= (|

 Σ= (| Σ=

1

(

1

2

+ (

2

 – Solving for and yields

1 Σ= (; 1

Σ= (

 

• Therefore, the ML of the variance is the sample variance of the dataset,

again a very pleasing result

 – Similarly, it can be shown that the ML estimates for the multivariate

Gaussian are the sample mean vector and sample covariance matrix

1 Σ= (; Σ 1

Σ= ( (  

Page 8: pr_l6(1)

7/29/2019 pr_l6(1)

http://slidepdf.com/reader/full/prl61 8/15CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 8

Bias and variance

• How good are these estimates?

 –

Two measures of “goodness” are used for statistical estimates  – BIAS: how close is the estimate to the true value?

 – VARIANCE: how much does it change for different datasets?

 – The bias-variance tradeoff 

• In most cases, you can only decrease one of them at the expense of the

other

VARIANCE 

   TRUE 

BIAS

   TRUE  

  TRUE 

LOW BIAS

HIGH VARIANCE HIGH BIAS

LOW VARIANCE 

Page 9: pr_l6(1)

7/29/2019 pr_l6(1)

http://slidepdf.com/reader/full/prl61 9/15CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 9

• What is the bias of the ML estimate of the mean?

1 Σ= ( 1 Σ= (  

 – Therefore the mean is an unbiased estimate

• What is the bias of the ML estimate of the variance?

1

Σ=

(

1

≠  

 – Thus, the ML estimate of variance is BIASED

• This is because the ML estimate of variance uses instead of  

 – How “bad” is this bias? 

• For → ∞ the bias becomes zero asymptotically

• The bias is only noticeable when we have very few samples, in which casewe should not be doing statistics in the first place!

 – Notice that MATLAB uses an unbiased estimate of the covariance

Σ 1 1 Σ= ( (

 

Page 10: pr_l6(1)

7/29/2019 pr_l6(1)

http://slidepdf.com/reader/full/prl61 10/15CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 10

Bayesian estimation

• In the Bayesian approach, our uncertainty about the

parameters is represented by a pdf  – Before we observe the data, the parameters are described by a prior

density () which is typically very broad to reflect the fact that we

know little about its true value

 – Once we obtain data, we make use of Bayes theorem to find the

posterior (|) • Ideally we want the data to sharpen the posterior (|), that is, reduce

our uncertainty about the parameters

 – Remember, though, that our goal is to estimate () or, more exactly,

(|), the density given the evidence provided by the dataset X

X|θp

θp

X|θp

θ

Page 11: pr_l6(1)

7/29/2019 pr_l6(1)

http://slidepdf.com/reader/full/prl61 11/15CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 11

• Let us derive the expression of a Bayesian estimate

 – From the definition of conditional probability

, | |, |  

 – (|,) is independent of X since knowledge of  completelyspecifies the (parametric) density. Therefore

, | | |  

 – and, using the theorem of total probability we can integrate

out:

| ∫ | |  

• The only unknown in this expression is (|); using Bayes rule

| | |

∫ |  

Where (|) can be computed using the i.i.d. assumption

| (|

• NOTE: The last three expressions suggest a procedure to estimate (|).This is not to say that integration of these expressions is easy!

Page 12: pr_l6(1)

7/29/2019 pr_l6(1)

http://slidepdf.com/reader/full/prl61 12/15

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 12

• Example

 –

Assume a univariate density where our random variable is generatedfrom a normal distribution with known standard deviation

 – Our goal is to find the mean of the distribution given some i.i.d. data

points  (, (, … (  

 – To capture our knowledge about , we assume that it also follows

a normal density with mean and standard deviation  

12

−  

 – We use Bayes rule to develop an expression for the posterior  

| | Π= (|  1

2e−

− 1

∏= 12 e−

(−  

[Bishop, 1995]

Page 13: pr_l6(1)

7/29/2019 pr_l6(1)

http://slidepdf.com/reader/full/prl61 13/15

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 13

 – To understand how Bayesian estimation changes the posterior as more

data becomes available, we will find the maximum of 

(|) 

 – The partial derivative with respect to is

log | 0 ⇒

12

Σ= 12 ( 0 

 – which, after some algebraic manipulation, becomes

+

+ + 1 Σ= (

 

• Therefore, as N increases, the estimate of the mean moves from the

initial prior to the ML solution

 –

Similarly, the standard deviation can be found to be

+

 

[Bishop, 1995]

Page 14: pr_l6(1)

7/29/2019 pr_l6(1)

http://slidepdf.com/reader/full/prl61 14/15

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 14

Example

• Assume that the true mean of the distribution () is

0 . 8 with standard deviation 0 . 3 • In reality we would not know the true mean; we are just “playing God” 

 – We generate a number of examples from this distribution

 – To capture our lack of knowledge about the mean, we assume a

normal prior (), with 0.0 and 0.3 

 – The figure below shows the posterior (|) 

• As increases, the estimate approaches its true value ( 0 . 8) and

the spread (or uncertainty in the estimate) decreases

0 0 . 2 0 . 4 0 . 6 0 . 8

0

1 0

2 0

3 0

4 0

5 0

     P     (  

     |       X     )  

N = 0 

N = 1

N = 5 

N = 1 0  

0 0 . 2 0 . 4 0 . 6 0 . 8

0

1 0

2 0

3 0

4 0

5 0

     P     (  

     |       X     )  

N = 0 

N = 1

N = 5 

N = 1 0  

Page 15: pr_l6(1)

7/29/2019 pr_l6(1)

http://slidepdf.com/reader/full/prl61 15/15

CSCE 666 Pattern Analysis | Ricardo Gutierrez-Osuna | CSE@TAMU 15

ML vs. Bayesian estimation

• What is the relationship between these two estimates?

 –

By definition, (|) peaks at the ML estimate – If this peak is relatively sharp and the prior is broad, then the integral

below will be dominated by the region around the ML estimate

| ∫ | | ≅ | ∫ | =

|  

• Therefore, the Bayesian estimate will approximate the ML solution

 – As we have seen in the previous example, when the number of 

available data increases, the posterior (|) tends to sharpen

• Thus, the Bayesian estimate of () will approach the ML solution as

→ ∞ 

• In practice, only when we have a limited number of observations will thetwo approaches yield different results