15
880.P20 Winter 2006 Richard Kass 1 Maximum Likelihood Method (MLM) Suppose we are trying to measure the true value of some quantity ( x T ). We make repeated measurements of this quantity { x 1 , x 2 ... x n } . The standardwaytoestimate x T fromour measurements is to calculate the mean value of the measurements: x x i i 1 N N and set x T x . Does this procedure make sense? wers this question and provides a method for estimating parameters from exis Statem entofthe M axim um Likelihood M ethod (M LM): A ssum e w e have m ade N m easurem entsof x { x 1 , x 2 ... x n } . A ssum e w e know the probability distribution function thatdescribes x : f ( x , ) A ssum e w e w antto determ ine the param eter . The ML M says that we pick such as to maximize the probability of getting the measurements (the x i 's) that we obtained! H ow do w e use the M LM ? The probability ofm easuring x 1 isf( x 1 , ) dx The probability ofm easuring x 2 isf( x 2 , ) dx The probability ofm easuring x n isf( x n , ) dx Ifthe m easurem entsare independent, the probability ofgetting ourm easurem entsis: L=f( x 1 , )f( x 2 , ) f( x n , ) dx n . L iscalled the Likelihood Function . Itisconvenientto w rite L as: L f ( x i i 1 N , ) W e w antto pick the thatm axim izes L . Thusw e w antto solve: L * 0 We drop the dx n since it is just proportionality

880.P20 Winter 2006 Richard Kass 1 Maximum Likelihood Method (MLM) Does this procedure make sense? The MLM answers this question and provides a method

Embed Size (px)

Citation preview

880.P20 Winter 2006 Richard Kass 1

Maximum Likelihood Method (MLM)Suppose we are trying to measure the true value of some quantity (xT). We make repeated measurements of this quantity {x1,x2...xn}. The standard way to estimate xT from our measurements is to calculate the mean value of the measurements:

x xi

i1

N

Nand set xTx. Does this procedure make sense?

The MLM answers this question and provides a method for estimating parameters from existing data.

Statement of the Maximum Likelihood Method (MLM): Assume we have made N measurements of x {x1, x2... xn}.

Assume we know the probability distribution function that describes x: f(x,) Assume we want to determine the parameter . The MLM says that we pick such as to maximize the probability of getting the measurements (the xi's) that we obtained!

How do we use the MLM? The probability of measuring x1 is f(x1,)dx The probability of measuring x2 is f(x2,)dx The probability of measuring xn is f(xn,)dx If the measurements are independent, the probability of getting our measurements is: L=f(x1,) f(x2,)…f(xn,)dxn. L is called the Likelihood Function. It is convenient to write L as:

L f (xii1

N

, )

We want to pick the that maximizes L. Thus we want to solve: L

*0

We drop the dxn since it is just proportionality constant

880.P20 Winter 2006 Richard Kass 2

Maximum Likelihood Method (MLM)In practice it is easier to maximize lnL rather than L itself since lnL turns the product into a summation. However, maximizing lnL gives the same since L and lnL are both maximum at the same time.

ln L ln f (xi , ) i1

N

The maximization condition is now:

ln L

*

i1

N

(ln f (xi , )) *

0

Note: could be an array of parameters or just a single variable. The equations to determine range from simple linear equations to coupled non-linear equations.

Example: Let f(x) be given by a Gaussian distribution and let the mean of the Gaussian. We want the best estimate of from our set of n measurements (x1, x2, ..xn). Let’s assume that is the same for each measurement.

f (xi, ) 1

2e

(xi )2

2 2

The likelihood function for this problem is:

L f (xi , )

i1

n

1

2e (xi )2

22

i1

n

1

2

n

e (x1 )2

2 2e

(x2 )2

22 e (xn )2

22 1

2

n

e

( xi )2

22i1

n

We want to find the that maximizes the likelihood function (actually we will maximize lnL). ln L

n ln1

2

(xi )2

2 2i1

n

0

Since is the same for each data point we can factor it out:

xii1

n

i1

n

0 or xii1

n

n

Finally, solving for we have:

1

nxi

i1

n

For the case where the are different for each data point we get the weighted average:

(

xi

i2

)i1

n

( 1

i2

)i1

n

Average !

880.P20 Winter 2006 Richard Kass 3

Maximum Likelihood Method (MLM)Example: Let f(x,a) be given by a Poisson distribution with the mean of the Poisson. We want the best estimate of from our set of n measurements (x1, x2, ..xn). We want the best estimate of from our set of n measurements (x1, x2, ..xn). The Poisson distribution is a discrete distribution, so (x1, x2, ..xn) are integers. The Poisson distribution is given by:

f (x, ) e x

x !

The likelihood function for this problem is:

L f (xi , )i1

n

e xi

xi !i1

n

e x1

x1!

e x2

x2 !..

e xn

xn !

e nxi

i1

n

x1 !x2 !.. xn !

We want to find the that maximizes the likelihood function (actually we will maximize lnL).

d ln L

d

d

d n ln xi

i1

n ln(x1!x2 !.. xn !)

or

d ln L

d n

xii1

n

0

1

nxi

i1

n

. Average !

Some general properties of the maximum likelihood method: a) For large data samples (large n) the likelihood function, L, approaches a Gaussian distribution. b) Maximum likelihood estimates are usually consistent. By consistent we mean that for large n the estimates converge to the true value of the parameters we wish to determine. c) For many instances the estimate from MLM is unbiased. Unbiased means that for all sample sizes the parameter of interest is calculated correctly. d) The maximum likelihood estimate of a parameter is the estimate with the smallest variance. We say the estimate is efficient. e) The maximum likelihood estimate is sufficient. By sufficient we mean that it uses all the information in the observations (the xi’s). f) The solution from MLM is unique. The bad news is that we must know the correct probability distribution for the problem at hand!

Cramer-Rao bound

880.P20 Winter 2006 Richard Kass 4

Errors & Maximum Likelihood Method (MLM)How do we calculate errors (’s) using the MLM?

Start by looking at the case where we have a gaussian pdf. The likelihood function is:

L f (xi , )

i1

n

1

2e (xi )2

22

i1

n

1

2

n

e (x1 )2

2 2e

(x 2 )2

22 e (xn )2

22 1

2

n

e

( xi )2

22i1

n

It is easier to work with lnL:

n

i

ixnL

12

2

2

)()2ln(ln

If we take two derivatives of lnL with respect to we get:

n

i

ixL

122

)(2

ln

221

22

2 1)(ln

nxL n

i

i

For the case of a gaussian pdf we get the familiar result:1

2

222 ln

L

n

The big news here is that the variance of the parameter of interest is related to the 2nd derivative of L.

Since our example uses a gaussian pdf the result is exact. More important, the result is asymptotically true for ALL pdf’s since for large samples (n) all likelihood functions become “gaussian”.

880.P20 Winter 2006 Richard Kass 5

Errors & MLMThe previous example was for one variable. We can generalize the result to the case where

we determine several parameters from the likelihood function (e.g. 1, 2, … n):1

2 ln

jiij

LV

Here Vij is a matrix, (the “covariance matrix” or “error matrix”) and it is evaluated at the values of (1, 2, … n) that maximize the likelihood function.

In practice it is often very difficult or impossible to analytically evaluate the 2nd derivatives.

The procedure most often used to determine the variances in the parameters relies on the propertythat the likelihood function becomes gaussian (or parabolic) asymptotically. We expand lnL about the ML estimate for the parameters. For the one parameter case we have:

2*2

2** )(

ln

!2

1)(

ln)(ln)(ln

**

LLLL

Since we are evaluating lnL at the value of (=*) that maximizes L, the term with the 1st derivative is zero.Using the expression for the variance of on the previous page and neglecting higher order terms we find:

2ln)(ln

2

)(ln)(ln

2

max*

2

2*

max

kLkLorLL

Thus we can determine the k limits on the parameters by finding the values where lnL decreases by k2/2 from its maximum value.

This iswhatMINUITdoes!

880.P20 Winter 2006 Richard Kass 6

Example: Log-Likelihood Errors & MLM

-67

-66

-65

-64

-63

-62

0 100 200 300 400 500 600

lnL

lnL

-5.613 104

-5.613 104

-5.613 104

-5.613 104

-5.613 104

-5.613 104

97 98 99 100 101 102 103 104

lnL

lnL

y = m3-(m0-m1)^ 2/(2*m2^ 2)

ErrorValue0.013475100.8m1

0.00889441.01m2

0.034297-56128m3 NA0.055864Chisq

NA0.99862R

Example: Exponential decay: /),(: /tetfpdf

Log-likelihood function for 10 eventslnL max for =1891 points: (140, 265) Vs exact: (129, 245)L not gaussian

Generate events according to an exponential distribution with = 100generatetimes from an exponential using:i=-0lnri

Calculate lnL vs &find max of lnL and the points where lnL=lnLmax-1/2 (“1 points”)

Compare errors from “exact” formula and log-likelihood points

n

ii

n

i

n

ii

t tn

tnLandeL i

11 1

/ 1solutionexact/lnln/

Log-likelihood function for 104 eventslnL max for =100.81 points: (99.8, 101.8) Vs exact: (99.8, 101.8L is fit by a gaussian

G

The variance of an exponential pdf with mean lifetime= is: 2=2/n

ten events: 1104.082, 220.056, 27.039, 171.492, 10.217, 11.671, 94.930, 74.246, 12.534, 168.319

880.P20 Winter 2006 Richard Kass 7

Determining the Slope and Intercept with MLM

Example: MLM and determining slope and intercept of a line

Assume we have a set of measurements: (x1, y1), (x2, y2… (xn, ynnand the points are thought to come from a straight line, y=+x, and the measurements come from a gaussian pdf. The likelihood function is:

L f (xi, ,)i1

n

1

i 2e

( yi q (xi , ,))2

2 i2

i1

n

1

i 2e

( yi xi )2

2i2

i1

n

We wish to find the and that maximizes the likelihood function L.Thus we need to take some derivatives:

We have to solve the two equations for the two unknowns, and . We can get an exact solution since these equations are linear in and .Just have to invert a matrix.

02

))((2

2

)(

2

1ln

ln

02

)1)((2

2

)(

2

1ln

ln

12

12

2

12

12

2

n

i i

iiin

i i

ii

i

n

i i

iin

i i

ii

i

xxyxyL

xyxyL

0ln

01ln

12

2

12

12

12

12

12

n

i i

in

i i

in

i i

ii

n

i i

in

i i

n

i i

i

xxyxL

xyL

n

i i

in

i i

i

n

i i

in

i in

i i

ii

n

i i

i

xx

x

yx

y

12

2

12

12

12

12

12

1

880.P20 Winter 2006 Richard Kass 8

Determining the Errors on the Slope and Intercept with MLM

yi

i2

i1

n

x i2

i2

yix i

i2

x i

i2

i1

n

i1

n

i1

n

1

i2

i1

n

xi2

i2 (

xi

i2

i1

n )2

i1

n

and

1

i2

i1

n

xiyi

i2

i1

n

yi

i2

x i

i2

i1

n

i1

n

1

i2

i1

n

xi2

i2 (

x i

i2

i1

n )2

i1

n

Let’s calculate the error (covariance) matrix for and :1

2 ln

ji

ijL

V

n

i i

in

i i

i

n

i i

in

i i

xx

x

V

12

2

12

12

12

1

1

)1(

n

i i

in

i i

in

i i

n

i i

i

n

i i

in

i i

in

i i

in

i i

ii

n

i i

n

i i

in

i i

n

i i

i

xxyL

xxxyxL

xyL

12

12

12

12

2

12

2

12

2

12

122

2

12

12

12

122

2

)1

(ln

)(ln

1)

1(

ln

2

12

12

2

12

12

12

12

12

2

)(1

with/

1/

//

n

i i

in

i i

in

i in

i i

n

i i

i

n

i i

in

i i

i

xxD

DDx

Dx

Dx

V

Note: We could also derive the variance of and just using propagation of errors on the formulas for and .

2

2

V

880.P20 Winter 2006 Richard Kass 9

Chi-Square (2) Distribution

Chi-square (2) distribution:Assume that our measurements (xii’s) come from a gaussian pdf with mean =.

n

i i

ix

12

22 )(

Define a statistic called chi-square:

It can be shown that the pdf for 2 is: p( 2,n)1

2n /2(n / 2)[ 2 ]n/ 2 1e 2 /2 0 2

This is a continuous pdf.

It is a function of two variables, 2 and n = number of degrees of freedom. ( = "Gamma Function“)

2 distribution for different degrees of freedom v

A few words about the number of degrees of freedom n:n = # data points - # of parameters calculated from the data points 

Reminder: If you collected N events in an experiment and you histogram your data in n bins before performing the fit, then you have n data points!EXAMPLE: You count cosmic ray events in 15 second intervals and sort the data into 5 bins:number of intervals with 0 cosmic rays 2number of intervals with 1 cosmic rays 7number of intervals with 2 cosmic rays 6number of intervals with 3 cosmic rays 3number of intervals with 4 cosmic rays 2Although there were 36 cosmic rays in your sample you have only 5 data points. EXAMPLE: We have 10 data points with and the mean and standard deviation of the data set.

If we calculate and from the 10 data point then n = 8If we know and calculate OR if we know and calculate then n = 9If we know and then n = 10

RULE of THUMBA good fit has 2/DOF 1

For n 20, P(2>y) can be approximated using a gaussian pdf with y=(22) 1/2 -(2n-1)1/2

A common approximation (useful for poisson case)“Pearson’s 2”: approximately 2 with n-1 DOF

n

i i

ii

P

PN

1

22 )(

880.P20 Winter 2006 Richard Kass 10

MLM, Chi-Square, and Least Squares Fitting

Assume we have n data points of the form (yi, i) and we believe a functional

relationship exists between the points:

y=f(x,a,b…)

In addition, assume we know (exactly) the xi that goes with each yi.

We wish to determine the parameters a, b,..

A common procedure is to minimize the following 2 with respect to the parameters:

n

i i

ii baxfy

12

22 ,..)],,([

If the yi’s are from a gaussian pdf then minimizing the 2 is equivalent to the MLM.

However, often times the yi’s are NOT from a gaussian pdf. In these instances we call this technique “2 fitting” or “Least Squares Fitting”.

Strictly speaking, we can only use a 2 probability table when y is from a gaussian pdf.

However, there are many instances where even for non-gaussian pdf’s the above sum approximates 2 pdf.

From a common sense point of view minimizing the above sum makes sense regardless of the underlying pdf.

880.P20 Winter 2006 Richard Kass 11

Least Squares Fitting ExampleExample: Leo’s 4.8 (P107) The following data from a radioactive sourcewas taken at 15 s intervals. Determine the lifetime () of the source.The pdf that describes radioactivity (or the decay of a charmed particle) is:

i ti Ni yi = lnNi 1 0 106 4.663 2 15 80 4.382 3 30 98 4.585 4 45 75 4.317 5 60 74 4.304 6 75 73 4.290 7 90 49 3.892 8 105 38 3.638 9 120 37 3.611

10 135 22 3.091

/)0()( teNtN As written the above pdf is not linear in . We can turn this into a linear problem bytaking the natural log of both sides of the pdf.

DtCytNtN /))0(ln())(ln(We can now use the methods of linear least squares to find D and then .

In doing the LSQ fit what do we use to weight the data points ?The fluctuations in each bin are governed by Poisson statistics: 2

i=Ni.However in this problem the fitting variable is lnN so we must use propagation of errorsto transform the variances of N into the variances of lnN.

NNNNNNNyNy /1/1)(/ln)(/ 22222

Technically the pdf is |dN(t)/(N(0)dt)| =N(t)/(N(0)).

Leo has a “1” here

880.P20 Winter 2006 Richard Kass 12

Least Squares Fitting-Exponential Example

3

3.5

4

4.5

5

-20 0 20 40 60 80 100 120 140

Y(x)

t

n

i

n

i i

i

i

in

i i

n

i

n

i i

i

i

in

i i

iin

i i

tt

tyyt

D

1

2

122

2

12

1 122

12

12

)(1

1

The slope of the line is given by:

00903.0)33240(2684700652

332403.27801328006522

D

Thus the lifetime () = -1/D = 110.7 s

The error in the lifetime is:

62

1

2

122

2

12

12

2 1001.1)33240(2684700652

652

)(1

1

n

i

n

i i

i

i

in

i i

n

i iD

tt

sDD DD 3.121003.9

10005.1/1/ 23

32222

= 110.7 ± 12.3 sec.

Caution: Leo has a factor of ½ in his error matrix (V-1)ij, Eq 4.72. He minimizes:

2],...),,(

[i

ii baxfyS

Using MLM we minimized:

2

22

,...),,((ln

i

baxfyL ii

Note: fitting without weighting yields: =96.8 s.

Line of “best fit”

880.P20 Winter 2006 Richard Kass 13

Least Squares Fitting-Exponential ExampleWe can calculate the 2 to see how “good” the data fits an exponential decay distribution:

6.15)()( 10

1 /

2/10

1 2

2/2

i t

ti

ii

ti

i

ii

Ae

AeNAeN

For this problem: lnA=4.725 A=112.73 and = 110.7 sec

Poisson approximation

The chi sq per dof is 1.96The chi sq prob. is 4.9 %

Mathematica Calculation:Do[{csq=csq+(cnt[i]-a*Exp[-x[i]/tau])^2/(a*Exp[-x[i]/tau])},{i,1,10}]Print["The chi sq per dof is ",csq/8] xvt=1-CDF[ChiSquareDistribution[8],csq];Print["The chi sq prob. is ",100*xvt,"%"]

This is not such a good fit since the probability is only ~4.9%.

880.P20 Winter 2006 Richard Kass 14

Extended MLMOften we want to do a MLM fit to determine the number of a signal & background events.Let’s assume we know the pdfs that describe the signal (ps) and background (pb) and the pdfsdepend on some measured quantity x (e.g. energy, momentum, cerenkov angle..)We can write the Likelihood for a single event (i) as: L=fsps(xi)+(1-fs)pb(xi) with fs the fraction of signal events in the sample, and the number of signal events: Ns=fsN The likelihood function to maximize (with respect to fs) is:

N

iibsiss xpfxpfL

1

))()1()(( Usually, there is no closed form solution for fs

There are several drawbacks to this solution:1) The number of signal and background are 100% correlated.2) the (poisson) fluctuations in the number of events (N) is not taken into account

Another solution which explicitly takes into account 2) is the EXTENDED MLM:

N

iibsiss

vN

iibsiss

Nv

xpfxpfvN

expfxpf

N

veL

11

))()1()((!

))()1()((!

]))()1()((ln[!lnln1

N

iibsiss xpfxpfvNvL

Here v=Ns+Nb so we can re-write the likelihood function as:

]))()((ln[!ln)(ln1

N

iibbissbs xpNxpNNNNL The N! term drops out when

we take derivatives to max L.We maximize L in terms of Ns and Nb.

If Ns & Nb are poisson then so is their product for fixed N

880.P20 Winter 2006 Richard Kass 15

Extended MLM Example: BF(BD0K*)Event yields are determined from an unbinned EMLM fit in the region 5.2 mES 5.3 GeV/c2

Choose simple PDFs to fit mES distributions: A=Argus G=GaussianPerform ML fits simultaneously in 3 regions.In each region fit K, K0, K3mES distributions (k=1,2,3) 9 PDFs in allI) |E| Sideband: -100 |E| -60 MeV & 60 |E| 200 MeV pdf: Ak

II) D0 Sideband: |mD-mD,PDG| Take into account “doubly peaking” backgrounds (DP) pdf: (NnoPA+NDPG)k

III) Signal region: |E| 25MeV pdf: (NqqA+NDPG+NsigG)k

scales the NDP found in D0 the sideband fit.

~520 signal events signal region

D0 sideband

E region

))/(1(2max0

2max)/(1),( mmemmmAmA

“fake” D0’s give fake B’s

should be no B’s in this region