Chapter 2: Maximum Likelihood Estimation · Chapter 2: Maximum Likelihood Estimation Advanced Econometrics - HEC Lausanne Christophe Hurlin University of OrlØans December 9, 2013

Chapter 2: Maximum Likelihood EstimationAdvanced Econometrics - HEC Lausanne

Christophe Hurlin

University of Orléans

December 9, 2013

Christophe Hurlin (University of Orléans) Advanced Econometrics - HEC Lausanne December 9, 2013 1 / 207

Section 1

Introduction


1. Introduction

The Maximum Likelihood Estimation (MLE) is a method ofestimating the parameters of a model. This estimation method is oneof the most widely used.

The method of maximum likelihood selects the set of values of themodel parameters that maximizes the likelihood function. Intuitively,this maximizes the "agreement" of the selected model with theobserved data.

The Maximum-likelihood Estimation gives an uni�ed approach toestimation.


2. The Principle of Maximum Likelihood

What are the main properties of the maximum likelihood estimator?I Is it asymptotically unbiased?I Is it asymptotically e¢ cient? Under which condition(s)?I Is it consistent?I What is the asymptotic distribution?

How to apply the maximum likelihood principle to the multiple linearregression model, to the Probit/Logit Models etc. ?

... All of these questions are answered in this lecture...


1. Introduction

The outline of this chapter is the following:

Section 2: The principle of the maximum likelihood estimation

Section 3: The likelihood function

Section 4: Maximum likelihood estimator

Section 5: Score, Hessian and Fisher information

Section 6: Properties of maximum likelihood estimators


1. Introduction

References

Amemiya T. (1985), Advanced Econometrics. Harvard University Press.

Greene W. (2007), Econometric Analysis, sixth edition, Pearson - Prentice Hil

Pelgrin, F. (2010), Lecture notes Advanced Econometrics, HEC Lausanne (aspecial thank)

Ruud P., (2000) An introduction to Classical Econometric Theory, OxfordUniversity Press.

Zivot, E. (2001), Maximum Likelihood Estimation, Lecture notes.


Section 2

The Principle of Maximum Likelihood



Objectives

In this section, we present a simple example in order

1 To introduce the notations

2 To introduce the notion of likelihood and log-likelihood.

3 To introduce the concept of maximum likelihood estimator

4 To introduce the concept of maximum likelihood estimate



ExampleSuppose that X1,X2,� � � ,XN are i.i.d. discrete random variables, such thatXi � Pois (θ) with a pmf (probability mass function) de�ned as:

Pr (Xi = xi ) =exp (�θ) θxi

xi !

where θ is an unknown parameter to estimate.



Question: What is the probability of observing the particular samplefx1, x2, .., xNg, assuming that a Poisson distribution with as yet unknownparameter θ generated the data?

This probability is equal to

Pr ((X1 = x1) \ ...\ (XN = xN ))



Since the variables Xi are i .i .d . this joint probability is equal to theproduct of the marginal probabilities

Pr ((X1 = x1) \ ...\ (XN = xN )) =N

∏i=1Pr (Xi = xi )

Given the pmf of the Poisson distribution, we have:

Pr ((X1 = x1) \ ...\ (XN = xN )) =N

∏i=1

exp (�θ) θxi

xi !

= exp (�θN)θ∑N

i=1 xi

N∏i=1xi !



De�nition

This joint probability is a function of θ (the unknown parameter) andcorresponds to the likelihood of the sample fx1, .., xNg denoted by

LN (θ; x1.., xN ) = Pr ((X1 = x1) \ ...\ (XN = xN ))

withLN (θ; x1.., xN ) = exp (�θN)� θ∑N

=1 xi � 1N∏i=1xi !



ExampleLet us assume that for N = 10, we have a realization of the sample equalto f5, 0, 1, 1, 0, 3, 2, 3, 4, 1g , then:

LN (θ; x1.., xN ) = Pr ((X1 = x1) \ ...\ (XN = xN ))

LN (θ; x1.., xN ) =e�10θθ20

207, 360



Question: What value of θ would make this sample most probable?



This Figure plots the function LN (θ; x) for various values of θ. It has asingle mode at θ = 2, which would be the maximum likelihood estimate,or MLE, of θ.

0 0.5 1 1.5 2 2.5 3 3.5 40

0.2

0.4

0.6

0.8

1

1.2x 108

θ





Consider maximizing the likelihood function LN (θ; x1.., xN ) with respect toθ. Since the log function is monotonically increasing, we usually maximizeln LN (θ; x1.., xN ) instead. In this case:

ln LN (θ; x1.., xN ) = �θN + ln (θ)N

∑i=1xi � ln

�N∏i=1xi !�

∂ ln LN (θ; x1.., xN )∂θ

= �N + 1θ

N

∑i=1xi

∂2 ln LN (θ; x1.., xN )

∂θ2= � 1

θ2

N

∑i=1xi < 0



Under suitable regularity conditions, the maximum likelihood estimate(estimator) is de�ned as:

bθ = argmaxθ2R+

ln LN (θ; x1.., xN )

FOC :∂ ln LN (θ; x1.., xN )

∂θ

��bθ = �N + 1bθN

∑i=1xi = 0

() bθ = (1/N)N

∑i=1xi

SOC :∂2 ln LN (θ; x1.., xN )

∂θ2

��bθ = � 1bθ2N

∑i=1xi < 0

bθ is a maximum.Christophe Hurlin (University of Orléans) Advanced Econometrics - HEC Lausanne December 9, 2013 18 / 207


The maximum likelihood estimate (realization) is:

bθ � bθ (x) = 1N

N

∑i=1xi

Given the sample f5, 0, 1, 1, 0, 3, 2, 3, 4, 1g , we have bθ (x) = 2.The maximum likelihood estimator (random variable) is:

bθ = 1N

N

∑i=1Xi



Continuous variables

The reference to the probability of observing the given sample is notexact in a continuous distribution, since a particular sample hasprobability zero. Nonetheless, the principle is the same.

The likelihood function then corresponds to the pdf associated to thejoint distribution of (X1,X2, ..,XN ) evaluated at the point(x1, x2, .., xN ) :

LN (θ; x1.., xN ) = fX1,..,XN (x1, x2, .., xN ; θ)



Continuous variables

If the random variables fX1,X2, ..,XNg are i .i .d . then we have:

LN (θ; x1.., xN ) =N

∏i=1fX (xi ; θ)

where fX (xi ; θ) denotes the pdf of the marginal distribution of X (orXi since all the variables have the same distribution).

The values of the parameters that maximize LN (θ; x1.., xN ) or its logare the maximum likelihood estimates, denoted bθ (x).


Section 3

The Likelihood function

De�nitions and Notations


3. The Likelihood Function

Objectives

1 Introduce the notations for an estimation problem that deals with amarginal distribution or a conditional distribution (model).

2 De�ne the likelihood and the log-likelihood functions.

3 Introduce the concept of conditional log-likelihood

4 Propose various applications



Notations

Let us consider a continuous random variable X , with a pdf denotedfX (x ; θ) , for x 2 R

θ = (θ1..θK )| is a K � 1 vector of unknown parameters. We assume

that θ 2 Θ � RK .

Let us consider a sample fX1, ..,XNg of i .i .d . random variables withthe same arbitrary distribution as X .

The realisation of fX1, ..,XNg (the data set..) is denoted fx1, .., xNgor x for simplicity.



Example (Normal distribution)

If X � N�m, σ2

�then:

fX (z ; θ) =1

σp2π

exp

� (z �m)

2

2σ2

!8z 2 R

with K = 2 and

θ =

�mσ2

�



De�nition (Likelihood Function)The likelihood function is de�ned to be:

LN : Θ�RN! R+

(θ; x1, .., xn) 7�! LN (θ; x1, .., xn) =N

∏i=1fX (xi ; θ)



De�nition (Log-Likelihood Function)The log-likelihood function is de�ned to be:

`N : Θ�RN! R

(θ; x1, .., xn) 7�! `N (θ; x1, .., xn) =N

∑i=1ln fX (xi ; θ)



Remark: the (log-)likelihood function depends on two type of arguments:



Notations: In the rest of the chapter, I will use the following alternativenotations:

LN (θ; x) � L (θ; x1, .., xN ) � LN (θ)

`N (θ; x) � ln LN (θ; x) � ln L (θ; x1, .., xN ) � ln LN (θ)



Example (Sample of Normal Variables)

We consider a sample fY1, ..,YNg N .i .d .�m, σ2

�and denote the

realisation by fy1, .., yNg or y . Let us de�ne θ =�m σ2

�|, then we have:

LN (θ; y) =N

∏i=1

1

σp2π

exp

� (yi �m)

2

2σ2

!

=�σ22π

��N/2exp

� 12σ2

N

∑i=1(yi �m)2

!

`N (θ; y) = �N2ln�σ2�� N2ln (2π)� 1

2σ2

N

∑i=1(yi �m)2



De�nition (Likelihood of one observation)

We can also de�ne the (log-)likelihood of one observation xi :

Li (θ; x) = fX (xi ; θ) with LN (θ; x) =N

∏i=1Li (θ; x)

ì (θ; x) = ln fX (xi ; θ) with `N (θ; x) =N

∑i=1ì (θ; x)



Example (Exponential Distribution)

Suppose that D1,D2, ..,DN are i .i .d . positive random variables (durationsfor instance), with Di � Exp (θ) with θ � 0 and

Li (θ; di ) = fD (di ; θ) =1θexp

��di

θ

�

ì (θ; di ) = ln (fD (di ; θ)) = � ln (θ)�diθ

Then we have:

LN (θ; d) = θ�N exp

�1

θ

N

∑i=1di

!

`N (θ; d) = �N ln (θ)�1θ

N

∑i=1di



Remark: The (log-)likelihood and the Maximum Likelihood Estimator arealways based on an assumption (bet?) about the distribution of Y .

Yi � Distribution with pdf fY (y ; θ) =) LN (θ; y) and `N (θ; y)

In practice, generally we have no idea about the true distribution of Yi ....

A solution: the Quasi-Maximum Likelihood Estimator



Remark: We can also use the MLE to estimate the parameters of amodel (with dependent and explicative variables) such that:

y = g (x ; θ) + ε

where β denotes the vector or parameters, X a set of explicative variables,ε and error term and g (.) the link function.

In this case, we generally consider the conditional distribution of Y givenX , which is equivalent to unconditional distribution of the error term ε :

Y jX � D () ε � D



Notations (model)

Let us consider two continuous random variables Y and X

We assume that Y has a conditional distribution given X = x with apdf denoted fY jx (y ; θ) , for y 2 R

θ = (θ1..θK )| is a K � 1 vector of unknown parameters. We assume

that θ 2 Θ � RK .

Let us consider a sample fX1,YNgNi=1 of i .i .d . random variables anda realisation fx1, yNgNi=1 .



De�nition (Conditional likelihood function)

The (conditional) likelihood function is de�ned to be:

LN (θ; y j x) =N

∏i=1fY jX (yi j xi ; θ)

where fY jX (yi j xi ; θ) denotes the conditional pdf of Yi given Xi .

Remark: The conditional likelihood function is the joint conditionaldensity of the data in which the unknown parameter is .



De�nition (Conditional log-likelihood function)

The (conditional) log-likelihood function is de�ned to be:

`N (θ; y j x) =N

∑i=1ln fY jX (yi j xi ; θ)

where fY jX (yi j xi ; θ) denotes the conditional pdf of Yi given Xi .



Remark: The conditional probability density function (pdf) can denotedby:

fY jX (y j x ; θ) � fY (y jX = x ; θ) � fY (y jX = x)



Example (Linear Regression Model)Consider the following linear regression model:

yi = X>i β+ εi

where Xi is a K � 1 vector of random variables and β = (β1..βK )> a

K � 1 vector of parameters. We assume that the εi are i .i .d . withεi � N

�0, σ2

�. Then, the conditional distribution of Yi given Xi = xi is:

Yi j xi � N�x>i β, σ2

�

Li (θ; y j x) = fY jx (yi j xi ; θ) =1

σp2π

exp

��yi � x>i β

�22σ2

!

where θ =�

β> σ2�>

is K + 1� 1 vector.



Example (Linear Regression Model, cont�d)

Then, if we consider an i .i .d . sample fyi , xigNi=1, the correspondingconditional (log-)likelihood is de�ned to be:

LN (θ; y j x) =N

∏i=1fY jX (yi j xi ; θ) =

N

∏i=1

1

σp2π

exp

��yi � x>i β

�22σ2

!

=�σ22π

��N/2exp

� 12σ2

N

∑i=1

�yi � x>i β

�2!

`N (θ; y j x) = �N2ln�σ2�� N2ln (2π)� 1

2σ2

N

∑i=1

�yi � x>i β

�2



Remark: Given this principle, we can derive the (conditional) likelihoodand the log-likelihood functions associated to a speci�c sample for anytype of econometric model in which the conditional distribution of thedependent variable is known.

Dichotomic models: probit, logit models etc.

Censored regression models: Tobit etc.

Times series models: AR, ARMA, VAR etc.

GARCH models

....



Example (Probit/Logit Models)Let us consider a dichotomic variable Yi such that Yi = 1 if the �rm i is indefault and 0 otherwise. Xi = (Xi1...XiK ) denotes a a K � 1 vector ofindividual caracteristics. We assume that the conditional probability ofdefault is de�ned as:

Pr (Yi = 1jXi = xi ) = F�x>i β

�where β = (β1..βK )

> is a vector of parameters and F (.) is a cdf(cumlative distribution function).

Yi =�10

with probability F�x>i β

�with probability 1� F

�x>i β

�



Remark: Given the choice of the link function F (.) we get a probit or alogit model.



De�nition (Probit Model)In a probit model, the conditional probability of the event Yi = 1 is:

Pr (Yi = 1jXi = xi ) = Φ (xi β) =x>i βR�∞

1p2π

exp��u

2

2

�du

where Φ (.) denotes the cdf of the standard normal distribution.



De�nition (Logit Model)In a logit model, the conditional probability of the event Yi = 1 is:

Pr (Yi = 1jXi = xi ) = Λ�x>i β

�=

11+ exp

��x>i β

�where Λ (.) denotes the cdf of the logistic distribution.



Example (Probit/Logit Models, cont�d)

What is the (conditional) log-likelihood of the sample fyi , xigNi=1?Whatever the choice of F (.), the conditional distribution of Yi givenXi = xi is a Bernouilli distribution since:

Yi =�10

with probability F�x>i β

�with probability 1� F

�x>i β

�Then, for θ = β, we have:

Li (θ; y j x) = fY jx (yi j xi ; θ) =hF�x>i β

�iyi h1� F

�x>i β

�i1�yiwhere fY jx (yi j xi ; θ) denotes the conditional probability mass function(pmf) of Yi .



Example (Probit/Logit Models, cont�d)

The (conditional) likelihood and log-likelihood of the sample fyi , xigNi=1arede�ned to be:

LN (θ; y j x) =N

∏i=1fY jx (yi j xi ; θ) =

N

∏i=1

hF�x>i β

�iyi h1� F

�x>i β

�i1�yi

`N (θ; y j x) =N

∑i=1yi ln

hF�x>i β

�i+

N

∑i=1(1� yi ) ln

h1� F

�x>i β

�i= ∑

i : yi=1lnF

�x>i β

�+ ∑i : yi=0

lnh1� F

�x>i β

�iwhere fY jx (yi j xi ; θ) denotes the conditional probability mass function(pmf) of Yi .



Key Concepts

1 Likelihood (of a sample) function

2 Log-likelihood (of a sample) function

3 Conditional Likelihood and log-likelihood function

4 Likelihood and log-likelihood of one observation


Section 4

Maximum Likelihood Estimator


4. Maximum Likelihood Estimator

Objectives

1 This section will be concerned with obtaining estimates of theparameters θ.

2 We will de�ne the maximum likelihood estimator (MLE).

3 Before we begin that study, we consider the question of whetherestimation of the parameters is possible at all: the question ofidenti�cation.

4 We will introduce the invariance principle



De�nition (Identi�cation)

The parameter vector θ is identi�ed (estimable) if for any other parametervector, θ� 6= θ, for some data y , we have

LN (θ; y) 6= LN (θ�; y)



ExampleLet us consider a latent (continuous and unobservable) variable Y �i suchthat:

Y �i = X>i β+ εi

with β = (β1..βK )>, Xi = (Xi1...XiK )

> and where the error term εi isi .i .d . such that E (εi ) = 0 and V (εi ) = σ2. The distribution of εi issymmetric around 0 and we denote by G (.) the cdf of the standardizederror term εi/σ. We assume that this cdf does not depend on σ or β.Example: εi/σ � N (0, 1).



Example (cont�d)We observe a dichotomic variable Yi such that:

Yi =�10

if Y �i > 0otherwise

Problem: are the parameters θ = (β> σ2)> identi�able?



Solution:

To answer to this question we have to compute the (log-)likelihood of thesample of observed data fyi , xigNi=1 . We have:

Pr (Yi = 1jXi = xi ) = Pr (Y �i > 0jXi = xi )= Pr

�εi > �x>i β

�= 1� Pr

�εi � �x>i β

�= 1� Pr

�εiσ� �x>i

β

σ

�If we denote by G (.) the cdf associated to the distribution of εi/σ, sincethis distribution is symetric around 0, then we have:

Pr (Yi = 1jXi = xi ) = G�x>i

β

σ

�Christophe Hurlin (University of Orléans) Advanced Econometrics - HEC Lausanne December 9, 2013 54 / 207


Solution (cont�d):

For θ = (β> σ2)>, we have

`N (θ; y j x) =N

∑i=1yi ln

�G�x>i

β

σ

��+

N

∑i=1(1� yi ) ln

�1� G

�x>i

β

σ

��This log-likelihood depends only on the ratio β/σ. So, for θ = (β> σ2)>

and θ� = (k � β> k � σ)>, with k 6= 1 :

`N (θ; y j x) = `N (θ�; y j x)

The parameters β and σ2 cannot be identi�ed. We can only identify theratio β/σ.



Remark:

In this latent model, only the ratio β/σ can be identi�ed since

Pr (Yi = 1jXi = xi ) = Pr�

εiσ< x>i

β

σ

�= G

�x>i

β

σ

�The choice of a logit or probit model implies a normalisation on thevariance of εi/σ and then on σ2 :

probit : Pr (Yi = 1jXi = xi ) = Φ�x>i eβ� with eβ = βi/σ, V

� εiσ

�= 1



De�nition (Maximum Likelihood Estimator)

A maximum likelihood estimator bθ of θ 2 Θ is a solution to themaximization problem:

bθ = argmaxθ2Θ

`N (θ; y j x)

or equivalently bθ = argmaxθ2Θ

LN (θ; y j x)



Remarks

1 Do not confuse the maximum likelihood estimator bθ (which is arandom variable) and the maximum likelihood estimate bθ (x) whichcorresponds to the realisation of bθ on the sample x .

2 Generally, it is easier to maximise the log-likelihood than thelikelihood (especially for the distributions that belong to theexponential family).

3 When we consider an unconditional likelihood, the MLE is de�ned by:

bθ = argmaxθ2Θ

`N (θ; x)



De�nition (Likelihood equations)Under suitable regularity conditions, a maximum likelihood estimator(MLE) of θ is de�ned to be the solution of the �rst-order conditions(FOC):

∂`N (θ; y j x)∂θ

��bθ = 0(K ,1)

or∂LN (θ; y j x)

∂θ

��bθ = 0(K ,1)

These conditions are generally called the likelihood or log-likelihoodequations.



Notations

The �rst derivative (gradient) of the (conditional) log-likelihood evaluatedat the point bθ satis�es:

∂LN (θ; y j x)∂θ

��bθ �∂LN

�bθ; y j x�∂θ

= g�bθ; y j x� = 0



Remark

The log-likelihood equations correspond to a linear/nonlinear system ofK equations with K unknown parameters θ1, .., θK :

∂`N (θ; Y j x)∂θ

��bθ =0BB@

∂`N (θ;Y jx )∂θ1

��bθ...

∂`N (θ;Y jx )∂θK

��bθ

1CCA =

0@ 0...0

1A



De�nition (Second Order Conditions)

Second order condition (SOC) of the likelihood maximisation problem: theHessian matrix evaluated at bθ must be negative de�nite.

∂2`N (θ; y j x)∂θ∂θ>

��bθ is negative de�niteor

∂2LN (θ; y j x)∂θ∂θ>

��bθ is negative de�nite



Remark:

The Hessian matrix (realisation) is a K �K matrix:

∂2`N (θ; y j x)∂θ∂θ>

=

0BBBBBB@

∂2`N (θ; y jx )∂θ21

∂2`N (θ; y jx )∂θ1∂θ2

..∂2`N (θ; y jx )

∂θ1∂θK

∂2`N (θ; y jx )∂θ2∂θ1

∂2`N (θ; y jx )∂θ22

.. ..

.. .. .. ..

∂2`N (θ; y jx )∂θK ∂θ1

.. ..∂2`N (θ; y jx )

∂θ2K

1CCCCCCA



Reminders

A negative de�nite matrix is a symetric (Hermitian if there arecomplex entries) matrix all of whose eigenvalues are negative.

The n� n Hermitian matrix M is said to be negative-de�nite if:

x|Mx < 0

for all non-zero x in Rn.



Example (MLE problem with one parameter)Let us consider a real-valued random variable X with a pdf given by:

fX�x ; σ2

�= exp

�� x2

2σ2

�xσ2

8x 2 [0,+∞[

where σ2 is an unknown parameter. Let us consider a sample fX1, ..,XNgof i .i .d . random variables with the same arbitrary distribution as X .

Problem: What is the maximum likelihood estimator (MLE) of σ2?



Solution:

We have:

ln fX�x ; σ2

�= � x2

2σ2+ ln (x)� ln

�σ2�

So, the log-likelihood of the sample fx1, .., xNg is:

`N�σ2; x

�=

N

∑i=1ln fX

�xi ; σ2

�= � 1

2σ2

N

∑i=1x2i +

N

∑i=1ln (xi )�N ln

�σ2�




The maximum likelihood estimator bσ2 of σ2 2 R+ is a solution to themaximization problem:

bσ2 = argmaxσ22R+

`N�σ2; x

�= argmax

σ22R+

� 12σ2

N

∑i=1x2i +

N

∑i=1ln (xi )�N ln

�σ2�

∂`N�σ2; x

�∂σ2

=12σ4

N

∑i=1x2i �

Nσ2

FOC (log-likelihood equation):

∂`N�σ2; x

�∂σ2

��bσ2 =1

2bσ4N

∑i=1x2i �

Nbσ2 = 0() bσ2 = 12N

N

∑i=1x2i




Check that bσ2 is a maximum:∂`N

�σ2; x

�∂σ2

=12σ4

N

∑i=1x2i �

Nσ2

∂2`N�σ2; x

�∂σ4

= � 1σ6

N

∑i=1x2i +

Nσ4

SOC:

∂2`N�σ2; x

�∂σ4

��bσ2 = � 1bσ6N

∑i=1x2i +

Nbσ4= �2Nbσ2bσ6 +

Nbσ4 since bσ2 = 12N

N

∑i=1x2i

= � Nbσ4 < 0Christophe Hurlin (University of Orléans) Advanced Econometrics - HEC Lausanne December 9, 2013 68 / 207


Conclusion:

The maximum likelihood estimator (MLE) of the parameter σ2 is de�nedby:

bσ2 = 12N

N

∑i=1X 2i

The maximum likelihood estimate of the parameter σ2 is equal to:

bσ2 (x) = 12N

N

∑i=1x2i



Example (Sample of normal variables)

We consider a sample fY1, ..,YNg N.i .d .�m, σ2

�. Problem: what are

the MLE of m and σ2?

Solution: Let us de�ne θ =�m σ2

�|.

bθ = argmaxσ22R+,m2R

`N (θ; y)

with

`N (θ; y) = �N2ln�σ2�� N2ln (2π)� 1

2σ2

N

∑i=1(yi �m)2




`N (θ; y) = �N2ln�σ2�� N2ln (2π)� 1

2σ2

N

∑i=1(yi �m)2

The �rst derivative of the log-likelihood function is de�ned by:

∂`N (θ; y)∂θ

=

∂`N (θ;y )

∂m∂`N (θ;y )

∂σ2

!

∂`N (θ; y)∂m

=1

σ2

N

∑i=1(yi �m)

∂`N (θ; y)∂σ2

= � N2σ2

+12σ4

N

∑i=1(yi �m)2




FOC (log-likelihood equations)

∂`N (θ; y)∂θ

��bθ = 1bσ2 ∑N

i=1 (yi � bm)� N2bσ2 + 1

2bσ4 ∑Ni=1 (yi � bm)2

!=

0

0

!

So, the MLE correspond to the empirical mean and variance:

bθ = � bmbσ2�

with bm = 1N

N

∑i=1Yi bσ2 = 1

N

N

∑i=1

�Yi � Y N

�2




∂`N (θ; y)∂m

=1

σ2

N

∑i=1(yi �m)

∂`N (θ; y)∂σ2

= � N2σ2

+12σ4

N

∑i=1(yi �m)2

The Hessian matrix (realization) is:

∂2`N (θ; y)

∂θ∂θ>=

∂2`N (θ;y )

∂m2∂2`N (θ;y )

∂m∂σ2

∂2`N (θ;y )∂σ2∂m

∂2`N (θ;y )∂σ4

!

=

� N

σ2� 1

σ4 ∑Ni=1 (yi �m)

� 1σ4 ∑N

i=1 (yi �m) N2σ4� 1

σ6 ∑Ni=1 (yi �m)

2

!



Solution (cont�d): SOC

∂2`N (θ; y)

∂θ∂θ>

��bθ =

� Nbσ2 � 1bσ4 ∑N

i=1 (yi � bm)� 1bσ4 ∑N

i=1 (yi � bm) N2bσ4 � 1bσ6 ∑N

i=1 (yi � bm)2!

=

� Nbσ2 0

0 N2bσ4 � Nbσ2bσ6

!

since since N bm = ∑Ni=1 yi and N bσ2 = ∑N

i=1 (yi � bm)2∂2`N (θ; y)

∂θ∂θ>

��bθ = � Nbσ2 00 � N

2bσ4!

is de�nite negative



Example (Linear Regression Model)Consider the linear regression model:

yi = x>i β+ εi

where xi = (xi1...xiK )> and β = (β1..βK )

> are K � 1 vectors. We assumethat the εi are N .i .d .

�0, σ2

�. Then, the (conditional) log-likelihood of the

observations (xi , yi ) is given by

`N (θ; y j x) = �N2ln�σ2�� N2ln (2π)� 1

2σ2

N

∑i=1

�yi � x>i β

�2where θ = (β> σ2)> is (K + 1)� 1 vector. Question: what are the MLEof β and σ2?



Notation 1: The derivative of a scalar y by a K � 1 vectorx = (x1...xK )

> is K � 1 vector

∂y∂x=

0B@∂y∂x1..∂y

∂xK

1CANotation 2: If x and β are two K � 1 vectors, then:

∂�x>β

�∂β

= x(K ,1)



Solution

bθ = argmaxβ2RK ,σ22R+

� N2ln�σ2�� N2ln (2π)� 1

2σ2

N

∑i=1

�yi � x>i β

�2The �rst derivative of the log-likelihood function is a (K + 1)� 1 vector:

∂`N (θ; y j x)∂θ| {z }

(K+1)�1

=

0@ ∂`N (θ; y jx )∂β

∂`N (θ; y jx )∂σ2

1A =

0BBBB@∂`N (θ; y jx )

∂β1..

∂`N (θ; y jx )∂βK

∂`N (θ; y jx )∂σ2

1CCCCA



Solution (cont�d)

bθ = argmaxβ2RK ,σ22R+

� N2ln�σ2�� N2ln (2π)� 1

2σ2

N

∑i=1

�yi � x>i β

�2The �rst derivative of the log-likelihood function is a (K + 1)� 1 vector:

∂`N (θ; y j x)∂β| {z }(K ,1)

=1

σ2

N

∑i=1

xi|{z}(K ,1)

�yi � x>i β

�| {z }

(1,1)

∂`N (θ; y j x)∂σ2| {z }(1,1)

= � N2σ2

+12σ4

N

∑i=1

�yi � x>i β

�2| {z }

(1,1)




FOC (log-likelihood equations)

∂`N (θ; y j x)∂θ

��bθ =0@ 1bσ2 ∑N

i=1 xi�yi � x>i bβ�

� N2bσ2 + 1

2bσ4 ∑Ni=1

�yi � x>i bβ�2

1A =

�0K0

�

So, the MLE is de�ned by:

bθ = � bβbσ2�

bβ = N

∑i=1XiX>i

!�1 N

∑i=1XiYi

! bσ2 = 1N

N

∑i=1

�Yi � X>i bβ�2




The Hessian is a (K + 1)� (K + 1) matrix:

∂2`N (θ; y j x)∂θ∂θ>| {z }

(K+1)�(K+1)

=

0BBBBBBBB@

∂2`N (θ; y j x)∂β∂β>| {z }K�K

∂2`N (θ; y j x)∂β∂σ2| {z }K�1

∂2`N (θ; y j x)∂σ2∂β>| {z }1�K

∂2`N (θ; y j x)∂σ4| {z }1�1

1CCCCCCCCA




∂`N (θ; y j x)∂β

=1

σ2

N

∑i=1xi�yi � x>i β

�∂`N (θ; y j x)

∂σ2= � N

2σ2+

12σ4

N

∑i=1

�yi � x>i β

�2So, the Hessian matrix (realization) is equal to:

∂2`N (θ; y j x)∂θ∂θ>

=

0BBBBB@� 1

σ2 ∑Ni=1 xi|{z}

K�1

x>i|{z}1�K

� 1σ4 ∑N

i=1 xi|{z}K�1

�yi � x>i β

�| {z }

1�1

� 1σ4 ∑N

i=1 x>i|{z}

1�K

�yi � x>i β

�| {z }

1�1

N2σ4� 1

σ6 ∑Ni=1

�yi � x>i β

�| {z }

1�1

2

1CCCCCAChristophe Hurlin (University of Orléans) Advanced Econometrics - HEC Lausanne December 9, 2013 81 / 207



Second Order Conditions (SOC)

∂2`N (θ)

∂θ∂θ>

��bθ =0B@ � 1bσ2 ∑N

i=1 xix>i � 1bσ4 ∑N

i=1 xi�yi � x>i bβ�

� 1bσ4 ∑Ni=1 x

>i

�yi � x>i bβ� N

2bσ4 � 1bσ6 ∑Ni=1


1CA

Since ∑Ni=1 x

>i

�yi � x>i bβ� = 0 (FOC) and Nbσ2 = ∑N

i=1


∂2`N (θ)

∂θ∂θ>

��bθ = � Nbσ2 ∑N

i=1 xix>i 0

0 N2bσ4 � Nbσ2bσ6

!




Second Order Conditions (SOC).

∂2`N (θ; y j x)∂θ∂θ>

��bθ = � 1bσ2 ∑N

i=1 xix>i 0

0 � N2bσ4

!is de�nite negative

Since ∑Ni=1 xix

>i is positive de�nite (assumption), the Hessian matrix is

de�nite negative and bθ is the MLE of the parameters θ.



Theorem (Equivariance or Invariance Principle)Under suitable regularity conditions, the maximum likelihood estimator ofa function g (.) of the parameter θ is g

�bθ�, where bθ is the maximumlikelihood estimator of θ.



Invariance Principle

The MLE is invariant to one-to-one transformations of θ. Anytransformation that is not one to one either renders the modelinestimable if it is one to many or imposes restrictions if it is many toone.

For the practitioner, this result is extremely useful. For example, whena parameter appears in a likelihood function in the form 1/θ , it isusually worthwhile to reparameterize the model in terms of γ = 1/θ.

Example: Olsen (1978) and the reparametrisation of the likelihoodfunction of the Tobit Model.



Example (Invariance Principle)Suppose that the normal log-likelihood in the previous example isparameterized in terms of the precision parameter, γ2 = 1/σ2. Thelog-likelihood

`N�m, σ2; y

�= �N

2ln�σ2�� N2ln (2π)� 1

2σ2

N

∑i=1(yi �m)2

becomes

`N�m,γ2; y

�=N2ln�γ2�� N2ln (2π)� γ2

2

N

∑i=1(yi �m)2



Example (Invariance Principle, cont�d)

The MLE for m is clearly still Y N . But the likelihood equation for γ2 isnow:

∂`N�m,γ2; y

�∂γ2

=N2γ2

� 12

N

∑i=1(yi �m)2

and the MLE for γ2 is now de�ned by:

bγ2 = N

∑Ni=1 (Yi �m)

2 =1bσ2

as expected.


Key Concepts

1 Identi�cation.

2 Maximum likelihood estimator.

3 Maximum likelihood estimate.

4 Log-likelihood equations.

5 Equivariance or invariance principle.

6 Gradient Vector and Hessian Matrix (deterministic elements).


Section 5

Score, Hessian and Fisher Information


5. Score, Hessian and Fisher Information

ObjectivesWe aim at introducing the following concepts:

1 Score vector and gradient

2 Hessian matrix

3 Fischer information matrix of the sample

4 Fischer information matrix of one observation for marginal andconditional distributions

5 Average Fischer information matrix of one observation



De�nition (Score Vector)

The (conditional) score vector is a K � 1 vector de�ned by:

sN (θ; Y j x)(K ,1)

� s (θ) = ∂`N (θ; Y j x)∂θ



Remarks:

The score sN (θ; Y j x) is a vector of random elements since itdepends on the random variables Y1, .., .YN .

For an unconditional log-likelihood, `N (θ; x) , the score is denoted by

sN (θ;X ) = ∂`N (θ;X ) /∂θ

The score is a K � 1 vector such that:

sN (θ; Y j x) =

0B@∂`N (θ;Y jx )

∂θ1.

∂`N (θ;Y jx )∂θK

1CA



CorollaryBy de�nition, the score vector satis�es

Eθ (sN (θ; Y j x)) = 0K

where Eθ means the expectation with respect to the conditionaldistribution Y jX = x.



Remark: If we consider a variable X with a pdf fX (x ; θ) , 8x 2 R, thenEθ (.) means the expectation with respect to the distribution of X :

Eθ (sN (θ;X )) =

∞Z�∞

sN (θ; x) fX (x ; θ) dx = 0

Remark: If we consider a variable Y with a conditional pdf fY jx (y ; θ) ,8y 2 R, then Eθ (.) means the expectation with respect to thedistribution of Y jX = x :

Eθ (sN (θ; Y j x)) =∞Z�∞

sN (θ; Y j x) fY jx (y ; θ) dy = 0



Proof.If we consider a variable X with a pdf fX (x ; θ) , 8x 2 R, then:

Eθ (sN (θ;X )) =ZsN (θ; x) fX (x ; θ) dx

= NZ

∂ ln fX (x ; θ)∂θ

fX (x ; θ) dx

= NZ 1fX (x ; θ)

∂fX (x ; θ)∂θ

fX (x ; θ) dx

= N∂

∂θ

ZfX (x ; θ) dx

= N∂1∂θ= 0



Example (Exponential Distribution)Suppose that D1,D2, ..,DN are i .i .d ., positive random variable withDi � Exp (θ) and E (Di ) = θ > 0.

fD (d ; θ) =1θexp

��d

θ

�, 8d 2 R+

`N (θ; d) = �N ln (θ)�1θ

N

∑i=1di

The score (scalar) is equal to:

sN (θ;D) = �Nθ+1

θ2

N

∑i=1Di



Example (Exponential Distribution, cont�d)By de�nition:

Eθ (sN (θ;D)) = Eθ

�N

θ+1

θ2

N

∑i=1Di

!

= �Nθ+1

θ2

N

∑i=1

Eθ (Di )

= �Nθ+Nθ

θ2

= 0 �



Example (Linear Regression Model)

Let us consider the previous linear regression model yi = x>i β+ εi . Thescore is de�ned by:

sN (θ; Y j x) =

0@ 1σ2 ∑N

i=1 xi�Yi � x>i β

�� N2σ2+ 1

2σ4 ∑Ni=1

�Yi � x>i β

�21A

Then, we have

Eθ (sN (θ; Y j x)) = Eθ

0@ 1σ2 ∑N

i=1 xi�Yi � x>i β

�� N2σ2+ 1

2σ4 ∑Ni=1

�Yi � x>i β

�21A




We know that Eθ (Yi j x) = x>i β. So, we have:

Eθ

�1

σ2∑Ni=1 xi

�Yi � x>i β

��=

1σ2

∑Ni=1 xi

�Eθ (Yi j x)� x>i β

�=

1σ2

∑Ni=1 xi

�x>i β� x>i β

�= 0K




Eθ

�� N2σ2

+12σ4

∑Ni=1

�Yi � x>i β

�2�= � N

2σ2+

12σ4

∑Ni=1 Eθ

��Yi � x>i β

�2�= � N

2σ2+

12σ4

∑Ni=1 Eθ

�(Yi �Eθ (Yi j x))2

�= � N

2σ2+

12σ4

∑Ni=1 Vθ (Yi j x)

= � N2σ2

+Nσ2

2σ4

= 0 �



De�nition (Gradient)The gradient vector associated to the log-likelihood function is a K � 1vector de�ned by:

gN (θ; y j x)(K ,1)

� g (θ) = ∂`N (θ; y j x)∂θ



Remarks

1 The gradient gN (θ; y j x) is a vector of deterministic entries since itdepends on the realisation y1, .., yN .

2 For an unconditional log-likelihood, the gradient is de�ned by

gN (θ; x) = ∂`N (θ; x) /∂θ

3 The gradient is a K � 1 vector such that:

gN (θ; y j x) =

0B@∂`N (θ; y jx )

∂θ1.

∂`N (θ; y jx )∂θK

1CA



CorollaryBy de�nition of the FOC, the gradient vector satis�es

gN�bθ; y j x� = 0K

where bθ = bθ (x) is the maximum likelihood estimate of θ.



Example (Linear regression model)In the linear regression model, the gradient associated to the log-likelihoodfunction is de�ned to be:

gN (θ; y j x) =

1σ2 ∑N

i=1 xi�yi � x>i β

�� N2σ2+ 1

2σ4 ∑Ni=1

�yi � x>i β

�2!

Given the FOC, we have:

gN�bθ; y j x� =

0B@ 1bσ2 ∑Ni=1 xi

�yi � x>i bβ�

� N2bσ2 + 1

2bσ4 ∑Ni=1


1CA =

0K0

!



De�nition (Hessian Matrix)

The Hessian matrix (deterministic) is de�ned as to be:

HN (θ; y j x) =∂2`N (θ; y j x)

∂θ∂θ>

Remarks: The matrix ∂2`N (θ; y jx )∂θ∂θ>

is also called the Hessian matrix, but do

not confuse the two matrices ∂2`N (θ;Y jx )∂θ∂θ>

and ∂2`N (θ; y jx )∂θ∂θ>

.



Random Variable Constant

Score vector ∂`N (θ;Y jx )∂θ Gradient vector ∂`N (θ; y jx )

∂θ

Hessian Matrix ∂2`N (θ;Y jx )∂θ∂θ>

Hessian Matrix ∂2`N (θ; y jx )∂θ∂θ>



De�nition (Fisher Information Matrix)

The (conditional) Fisher information matrix associated to the samplefY1, ..,YNg is the variance-covariance matrix of the score vector:

IN (θ)| {z }K�K

= Vθ (sN (θ; Y j x))

or equivalently:

IN (θ) = Vθ

�∂`N (θ; Y j x)

∂θ

�where Vθ means the variance with respect to the conditional distributionY jX .



Corollary

Since by de�nition Eθ (sN (θ; Y j x)) = 0, then an alternative de�nition ofthe Fisher information matrix of the sample fY1, ..,YNg is:

IN (θ)| {z }K�K

= Eθ

0B@sN (θ; Y j x)| {z }K�1

� sN (θ; Y j x)>| {z }1�K

1CA




The (conditional) Fisher information matrix of the sample fY1, ..,YNg isalso given by:

IN (θ) = Eθ

��∂2`N (θ; Y j x)

∂θ∂θ>

�= Eθ (�HN (θ; Y j x))



De�nition (Fisher Information Matrix, summary)

The (conditional) Fisher information matrix of the sample fY1, ..,YNgcan alternatively be de�ned by:

IN (θ) = Vθ (sN (θ; Y j x))

IN (θ) = Eθ

�sN (θ; Y j x)� sN (θ; Y j x)>

�IN (θ) = Eθ (�HN (θ; Y j x))

where Eθ and Vθ denote the mean and the variance with respect to theconditional distribution Y jX , and where sN (θ; Y j x) denotes the scorevector and HN (θ; Y j x) the Hessian matrix.



De�nition (Fisher Information Matrix, summary)

The (conditional) Fisher information matrix of the sample fY1, ..,YNgcan alternatively be de�ned by:

IN (θ) = Vθ

�∂`N (θ; Y j x)

∂θ

�

IN (θ) = Eθ

∂`N (θ; Y j x)

∂θ��

∂`N (θ; Y j x)∂θ

�>!

IN (θ) = Eθ

��∂2`N (θ; Y j x)

∂θ∂θ>

�where Eθ and Vθ denote the mean and the variance with respect to theconditional distribution Y jX .



Remarks

1 Three equivalent de�nitions of the Fisher information matrix, and as aconsequence three di¤erent consistent estimates of the Fisherinformation matrix (see later).

2 The Fisher information matrix associated to the sample fY1, ..,YNgcan also be de�ned from the Fisher information matrix for theobservation i .




The (conditional) Fisher information matrix associated to the i th

individual can be de�ned by:

I i (θ) = Vθ

�∂ì (θ; Yi j xi )

∂θ

�

I i (θ) = Eθ

∂ì (θ; Yi j xi )

∂θ

∂ì (θ; Yi j xi )>

∂θ

!

I i (θ) = Eθ

��∂2ì (θ; Yi j xi )

∂θ∂θ>

�where Eθ and Vθ denote the expectation and variance with respect to thetrue conditional distribution Yi jXi .




The (conditional) Fisher information matrix associated to the i th

individual can be alternatively be de�ned by:

I i (θ) = Vθ (si (θ; Yi j xi ))

I i (θ) = Eθ

�si (θ; Yi j xi ) si (θ; Yi j xi )>

�I i (θ) = Eθ (�Hi (θ; Yi j xi ))

where Eθ and Vθ denote the expectation and variance with respect to thetrue conditional distribution Yi jXi .



TheoremThe Fisher information matrix associated to the sample fY1, ..,YNg isequal to the sum of individual Fisher information matrices:

IN (θ) =N

∑i=1I i (θ)



Remark:

1 In the case of a marginal log-likelihood, the Fisher information matrixassociated to the variable Xi is the same for the observations i :

I i (θ) = I (θ) 8i = 1, ..N

2 In the case of a conditional log-likelihood, the Fisher informationmatrix associated to the variable Yi given Xi = xi depends on theobservation i :

I i (θ) 6= I j (θ) 8i 6= j



Example (Exponential marginal distribution)Suppose that D1,D2, ..,DN are i .i .d ., positive random variable withDi � Exp (θ)

E (Di ) = θ V (Di ) = θ2

fD (d ; θ) =1θexp

��d

θ

�, 8d 2 R+

ì (θ; di ) = � ln (θ)�diθ

Question: what is the Fisher information number (scalar) associated toDi ?



Solution` (θ; di ) = � ln (θ)�

diθ

The score of the observation Xi is de�ned by:

si (θ;Di ) =∂ì (θ;Di )

∂θ= �1

θ+Diθ2

Let us use the three de�nitions of the information quantity I i (θ) :

I i (θ) = Vθ (si (θ;Di ))

= Eθ

�si (θ;Di )

2�

= Eθ (�Hi (θ;Di ))



Solution, cont�d

si (θ;Di ) =∂ì (θ;Di )

∂θ= �1

θ+Diθ2

First de�nition:

I i (θ) = Vθ (si (θ;Di ))

= Vθ

��1

θ+Diθ2

�=

1

θ4Vθ (Di )

=1

θ2

Conclusion: I i (θ) =I (θ) does not depend on i .



Solution, cont�d

si (θ;Di ) =∂ì (θ;Di )

∂θ= �1

θ+Diθ2

Second de�nition:

I i (θ) = Eθ

�si (θ;Di )

2�

= Eθ

��1

θ+Diθ2

�2!

= Vθ

��1

θ+Diθ2

�since Eθ

��1

θ+Diθ2

�= 0

=1

θ2

Conclusion: I i (θ) =I (θ) does not depend on i .



Solution, cont�d

si (θ;Di ) =∂ì (θ;Di )

∂θ= �1

θ+Diθ2

Hi (θ;Di ) =∂2ì (θ;Di )

∂θ2=1

θ2� 2Di

θ3

Third de�nition:

I i (θ) = Eθ (�Hi (θ;Di ))

= Eθ

��1

θ2� 2Di

θ3

��= � 1

θ2+2

θ3Eθ (Di )

= � 1

θ2+2

θ3θ =

1

θ2

Conclusion: I i (θ) =I (θ) does not depend on i .Christophe Hurlin (University of Orléans) Advanced Econometrics - HEC Lausanne December 9, 2013 121 / 207


Example (Linear regression model)We shown that:

∂2ì (θ; Yi j xi )∂θ∂θ>

=

0BBBBB@� 1

σ2xi|{z}K�1

x>i|{z}1�K

� 1σ4xi|{z}K�1

�Yi � x>i β

�| {z }

1�1

� 1σ4x>i|{z}1�K

�Yi � x>i β

�| {z }

1�1

12σ4� 1

σ6

�Yi � x>i β

�| {z }

1�1

2

1CCCCCAQuestion: what is the Fisher information matrix associated to theobservation Yi ?



Solution

The information matrix is then de�ned by:

I i (θ)| {z }K+1�K+1

= Eθ

��∂2ì (θ; Yi j xi )

∂θ∂θ>

�= Eθ (�Hi (θ; Yi j xi ))

where Eθ means the expectation with respect to the conditionaldistribution Yi jXi = xi

I i (θ) =

0@ 1σ2xix>i

1σ4xi�Eθ (Yi )� x>i β

�1σ4x>i�Eθ (Yi )� x>i β

�� 12σ4+ 1

σ6Eθ

��Yi � x>i β

�2�1A



Solution (cont�d)

I i (θ) =

0@ 1σ2xix>i

1σ4xi�Eθ (Yi )� x>i β

�1σ4x>i�Eθ (Yi )� x>i β

�� 12σ4+ 1

σ6Eθ

��Yi � x>i β

�2�1A

Given that Eθ (Yi ) = x>i β and Eθ(�Yi � x>i β

�2) = σ2, then we have:

I i (θ) =

1σ2xix>i 0

0 12σ4

!Conclusion: I i (θ) depends on xi and I i (θ) 6=I j (θ) for i 6= j .



De�nition (Average Fisher information matrix)For a conditional model, the average Fisher information matrix for oneobservation is de�ned by:

I (θ) = EX (I i (θ))

where EX denotes the expectation with respect to X (conditioningvariable).



Summary: For a conditional model (and only for a conditional model),we have:

I (θ) = EX

�Vθ

�∂ì (θ; Yi jXi )

∂θ

��= EX (Vθ (s (θ; Yi jXi )))

I (θ) = EXEθ

∂ì (θ; Yi jXi )

∂θ

∂ì (θ; Yi jXi )>

∂θ

!= EXEθ

�si (θ; Yi jXi ) si (θ; Yi jXi )>

�I (θ) = EXEθ

��∂2ì (θ; Yi jXi )

∂θ∂θ>

�= EXEθ (�Hi (θ; Yi jXi ))



Summary: For a marginal distribution, we have:

I (θ) = Vθ

�∂ì (θ;Yi )

∂θ

�= Vθ (s (θ;Yi ))

I (θ) = Eθ

∂ì (θ;Yi )

∂θ

∂ì (θ;Yi )>

∂θ

!= Eθ

�si (θ;Yi ) si (θ;Yi )

>�

I (θ) = Eθ

��∂2ì (θ;Yi )

∂θ∂θ>

�= Eθ (�Hi (θ;Yi ))



Example (Linear Regression Model)In the linear model, the individual Fisher information matrix is equal to:

I i (θ) =

1σ2xix>i 0

0 12σ4

!

and the average Fisher information Matrix for one observation is de�nedby:

I (θ) = EX (I i (θ)) =

1σ2

EX�XiX>i

�0

0 12σ4

!



Summary: in order to compute the average information matrix I (θ) forone observation:Step 1: Compute the Hessian matrix or the score vector for oneobservation

Hi (θ; Yi j xi ) =∂2ì (θ; Yi j xi )

∂θ∂θ>si (θ; Yi j xi ) =

∂ì (θ; Yi j xi )∂θ

Step 2: Take the expectation (or the variance) with respect to theconditional distribution Yi jXi = xi

I i (θ) = Vθ (si (θ; Yi j xi )) = Eθ (�Hi (θ; Yi j xi ))

Step 3: Take the expectation with respect to the conditioning variable X

I (θ) = EX (I i (θ))



TheoremIn a sampling model (with i .i .d . observations), one has:

IN (θ) = N I (θ)



Marginal Distribution Cond. Distribution (model)pdf fXi (θ; xi ) fYi jxi (θ; y j x)

Score Vector si (θ;Xi ) si (θ; Yi j xi )Hessian Matrix Hi (θ;Xi ) Hi (θ; Yi j xi )

Information matrix I i (θ) = I (θ) I i (θ)

Av. Infor. Matrix I (θ) = I i (θ) I (θ) = EX (I i (θ))

with I i (θ) = Vθ (si (θ; Yi j xi )) = Eθ

�si (θ; Yi j xi ) si (θ; Yi j xi )>

�=

Eθ (�Hi (θ; Yi j xi ))



How to estimate the average Fisher Information Matrix?

This matrix is particularly important, since we will see that itscorresponds to the asymptotic variance covariance matrix of theMLE.

Let us assume that we have a consistent estimator bθ of the parameterθ, how to estimate the average Fisher information matrix?



De�nition (Estimators of the average Fisher Information Matrix)

If bθ converges in probability to θ0 (true value), then:

bI �bθ� = 1N

N

∑i=1

bI i �bθ�

bI �bθ� = 1N

N

∑i=1

∂ì (θ; yi j xi )

∂θ

��bθ ∂ì (θ; yi j xi )∂θ

��>bθ!

bI �bθ� = 1N

N

∑i=1

�� ∂2ì (θ; yi j xi )

∂θ∂θ>

��bθ�

are three consistent estimators of the average Fisher information matrix.



1 The �rst estimator corresponds to the average of the N Fisherinformation matrices (for Y1, .., YN ) evaluated at the estimated valuebθ. This estimator will rarely be available in practice.

2 The second estimator corresponds to the average of the product ofthe individual score vectors evaluated at bθ. It is known as the BHHH(Berndt, Hall, Hall, and Hausman, 1994) estimator or OPG estimator(outer product of gradients).

bI �bθ� = 1N

N

∑i=1

�gi�bθ; yi j xi� gi �bθ; yi j xi�>�



3. The third estimator corresponds to the opposite of the average of theHessian matrices evaluated at bθ.

bI �bθ� = 1N

N

∑i=1

��Hi

�bθ; yi j xi��



ProblemThese three estimators are asymptotically equivalent, but they could givedi¤erent results in �nite samples. Available evidence suggests that in smallor moderate sized samples, the Hessian is preferable (Greene, 2007).However, in most cases, the BHHH estimator will be the easiest tocompute.





Example (CAPM)The empirical analogue of the CAPM is given by:

erit = αi + βiermt + εt

erit = rit � rft| {z }excess return of security i at time t

ermt = (rmt � rft )| {z }market excess return at time t

where εt is an i .i .d . error term with:

E (εt ) = 0 V (εt ) = σ2 E ( εt jermt ) = 0



Example (CAPM, cont�d)

Data (data �le: capm.xls): Microsoft, SP500 and Tbill (closing prices)from 11/1/1993 to 04/03/2003

0.10

0.05

0.00

0.05

0.10

0.06 0.04 0.02 0.00 0.02 0.04 0.06 0.08

RSP500

RM

SFT

0.08

0.04

0.00

0.04

0.08

500 1000 1500 2000

RSP500 RMSFT



Example (CAPM, cont�d)We consider the CAPM model rewritten as follows

erit = x>t β+ εt t = 1, ..T

where xt = (1 ermt )> is 2� 1 vector of random variables,

θ =�αi : βi : σ2

�>=�

β> : σ2�>

is 3� 1 vector of parameters, andwhere the error term εt satis�es E (εt ) = 0, V (εt ) = σ2 andE ( εt jermt ) = 0.



Example (CAPM, cont�d)Question: Compute three alternative estimators of the asymptotic

variance covariance matrix of the MLE estimator bθ = �bαi bβi bσ2�>bβ = � bαibβi

�=

T

∑t=1xtx>t

!�1 T

∑t=1xterit

!

bσ2 = 1T

T

∑t=1

�erit � x>t bβ�2



Solution The ML estimator is de�ned by:

bθ = argmaxβ2R2,σ22R+

� T2ln�σ2�� T2ln (2π)� 1

2σ2

T

∑t=1

�erit � x>t bβ�2The problem is regular, so we have:

pT�bθ� θ0

�d! N

�0, I�1 (θ0)

�or equivalently bθ asy

� N�

θ0,1TI�1 (θ0)

�The asymptotic variance covariance matrix of bθ is

V�bθ� = 1

TI�1 (θ0)



Solution (cont�d)

First estimator: The information matrix at time t is de�ned by (thirdde�nition):

I t (θ) = Eθ

0@�∂2`t�

θ; eRit �� xt�∂θ∂θ>

1A = Eθ

��Ht

�θ; eRit �� xt��

where Eθ means the expectation with respect to the conditionaldistribution eRit ��Xt = xt

I t (θ) =

0B@ 1σ2xtx>t

1σ4xt�

Eθ

�eRit�� x>t β�

1σ4x>t�

Eθ

�eRit�� x>t β�� 12σ4+ 1

σ6Eθ

��eRit � x>t β�2�

1CAChristophe Hurlin (University of Orléans) Advanced Econometrics - HEC Lausanne December 9, 2013 143 / 207


Solution (cont�d)

First estimator:

I t (θ) =

0B@ 1σ2xtx>t

1σ4xt�

Eθ

�eRit�� x>t β�

1σ4x>t�

Eθ

�eRit�� x>t β�� 12σ4+ 1

σ6Eθ


1CAGiven that Eθ

�eRit� = x>t β and Eθ


= σ2, then we have:

I t (θ) =

1σ2xtx>t 02�101�2 1

2σ4

!



Solution (cont�d)

First estimator:

I t (θ) =

1σ2xtx>t 02�101�2 1

2σ4

!An estimator of the asymptotic variance covariance matrix of bθ is given by:

bVasy

�bθ� = 1TbI�1 �bθ�

bI �bθ� = 1T

T

∑t=1I t�bθ� = 1

T bσ2 ∑Tt=1 xtx>t 02�101�2 1

2bσ4!

�



Solution (cont�d)

Second de�nition (BHHH):

bVasy

�bθ� = 1TbI�1 �bθ�

bI �bθ� = 1T

T

∑t=1

∂`t (θ; erit j xt )

∂θ

��bθ � ∂`t (θ; erit j xt )∂θ

��>bθ!

with

∂`t (θ; erit j xt )∂θ

��bθ =0B@ 1bσ2 xt

�erit � x>t bβ�� 12bσ2 + 1

2bσ4�erit � x>t bβ�2

1CA =

1bσ2 xtbεt� 12bσ2 + 1

2bσ4bε2t!



Solution (cont�d)

Second de�nition (BHHH):

∂`t (θ; erit j xt )∂θ


��>bθ=

1bσ2 xtbεt� 12bσ2 + 1

2bσ4bε2t!��

1bσ2 x>t bεt � 12bσ2 + 1

2bσ4bε2t �

=

0@ 1bσ4 xtx>t bε2t 1bσ2 xtbεt �� 12bσ2 + 1

2bσ4bε2t �1bσ2 x>t bεt �� 1

2bσ2 + 12bσ4bε2t � �

� 12bσ2 + 1

2bσ4bε2t �21A



Solution (cont�d)

Second de�nition (BHHH): so we have

bVasy

�bθ� = 1TbI�1 �bθ�

with

bI �bθ� = 1T

T

∑t=1

0@ 1bσ4 xtx>t bε2t 1bσ2 xtbεt �� 12bσ2 + 1

2bσ4bε2t �1bσ2 x>t bεt �� 1

2bσ2 + 12bσ4bε2t � �

� 12bσ2 + 1

2bσ4bε2t �21A



Solution (cont�d)

Third de�nition (inverse of the Hessian): we know that

bVasy

�bθ� = 1TbI�1 �bθ�

bI �bθ� = 1T

T

∑t=1

��Ht

�bθ; erit j xt��

Ht�bθ; erit j xt� =

0@ � 1bσ2 xtx>t � 1bσ4 xt�erit � x>t bβ�

� 1bσ4 x>t�erit � x>t bβ� 1

2bσ4 � 1bσ6�erit � x>t bβ�2

1A



Solution (cont�d)

Third de�nition (inverse of the Hessian):

Ht�bθ; erit j xt� =

0@ � 1bσ2 xtx>t � 1bσ4 xt�erit � x>t bβ�

� 1bσ4 x>t�erit � x>t bβ� 1

2bσ4 � 1bσ6�erit � x>t bβ�2

1AGiven the FOC (log-likelihood equations), ∑T

t=1 xt�erit � x>t bβ� = 0 and�erit � x>t bβ�2 = Tbσ2.

T

∑t=1Ht�bθ; erit j xt� =

� 1bσ2 ∑T

t=1 xtx>t 02�101�2 � T

2bσ4!



Solution (cont�d)

Third de�nition (inverse of the Hessian):So, in this case, the third estimator of bI �bθ� coïncides with the �rst one:

bVasy

�bθ� = 1TbI�1 �bθ�

bI �bθ� = 1T

T

∑t=1

��Ht

�bθ; erit j xt�� = � 1T bσ2 ∑T

t=1 xtx>t 02�101�2 � 1

2bσ4!



Solution (cont�d)

These three estimates of the asymptotic variance covariance matrixare asymptotically equivalent, but can be largely di¤erent in �nitesample...

bVasy

�bθ� = 1TbI�1 �bθ�

with bI �bθ� = 1T

T

∑t=1I t�bθ�

bI �bθ� = 1T

T

∑t=1

∂`t (θ; erit j xt )

∂θ


��>bθ!

bI �bθ� = 1T

T

∑t=1(�Ht (θ; erit j xt ))







�Christophe Hurlin (University of Orléans) Advanced Econometrics - HEC Lausanne December 9, 2013 155 / 207

Key Concepts

1 Gradient and Hessian Matrix (deterministic elements).

2 Score Vector (random elements).

3 Hessian Matrix (random elements).

4 Fisher information matrix associated to the sample.

5 (Average) Fisher information matrix for one observation.


Section 6

Properties of Maximum Likelihood Estimators


6. Properties of Maximum Likelihood Estimators

Objectives

MLE is a good estimator? Under which conditions the MLE isunbiased, consistent and corresponds to the BUE (Best UnbiasedEstimator)? => regularity conditions

Is the MLE consistent?

Is the MLE optimal or e¢ cient?

What is the asymptotic distribution of the MLE? The magic of theMLE...



De�nition (Regularity conditions)

Greene (2007) identify three regularity conditions

R1 The �rst three derivatives of ln fX (θ; xi ) with respect to θ arecontinuous and �nite for almost all xi and for all θ. This conditionensures the existence of a certain Taylor series approximation and the�nite variance of the derivatives of ì (θ; xi ).

R2 The conditions necessary to obtain the expectations of the �rst andsecond derivatives of ln fX (θ;Xi ) are met.

R3 For all values of θ,��∂3 ln fX (θ; xi ) /∂θi∂θj∂θk

�� is less than a functionthat has a �nite expectation. This condition will allow us to truncate theTaylor series.



De�nition (Regularity conditions, Zivot 2001)

A pdf fX (θ; x) is regular if and only of:

R1 The support of the random variables X , SX = fx : fX (θ; x) > 0g,does not depend on θ.

R2 fX (θ; x) is at least three times di¤erentiable with respect to θ, andthese derivatives are continuous.

R3 The true value of θ lies in a compact set Θ.



Under these regularity conditions, the maximum likelihood estimator bθpossesses many appealing properties:

1 The maximum likelihood estimator is consistent.

2 The maximum likelihood estimator is asymptotically normal (themagic of the MLE..).

3 The maximum likelihood estimator is asymptotically optimal ore¢ cient.

4 The maximum likelihood estimator is equivariant: if bθ is an estimatorof θ then g(bθ) is an estimator of g (θ).



Theorem (Consistency)Under regularity conditions, the maximum likelihood estimator isconsistent bθ p�!

N!∞θ0

or equivalently:p limN!∞

bθ = θ0

where θ0 denotes the true value of the parameter θ.



Sketch of the proof (Greene, 2007)

Because bθ is the MLE, in any �nite sample, for any θ 6= bθ (including thetrue θ0) it must be true that

ln LN�bθ; y j x� � ln LN (θ; y j x)

Consider, then, the random variable LN (θ; Y j x) /LN (θ0; Y j x). Becausethe log function is strictly concave, from Jensen�s Inequality, we have

Eθ

�ln�LN (θ; Y j x)LN (θ0; Y j x)

�� ln

�Eθ

�LN (θ; Y j x)LN (θ0; Y j x)

��



Sketch of the proof, cont�d

The expectation on the right-hand side is exactly equal to one, as

Eθ


�=

Z � LN (θ; y j x)LN (θ0; y j x)

�LN (θ0; y j x) dy

=ZLN (θ; y j x) dy

= 1

is simply the integral of a joint density.




So we have

Eθ

�ln�LN (θ; Y j x)LN (θ0; Y j x)

�� ln

�Eθ


��= ln (1) = 0

Divide the left hand side of this equation by N to produce

Eθ

�1Nln LN (θ; Y j x)

�� Eθ

�1Nln LN (θ0; Y j x)

�This produces a central result:



Theorem (Likelihood Inequality)The expected value of the log-likelihood is maximized at the true value ofthe parameters. For any θ, including bθ :

Eθ

�1N`N (θ0; Yi j xi )

�� Eθ

�1N`N (θ; Yi j xi )

�




Notice that1N`N (θ; Yi j xi ) =

1N

∑Ni=1 ì (θ; Yi j xi )

where the elements ì (θ; Yi j xi ) for i = 1, ..N are i .i .d .. So, using a lawof large numbers, we get:

1N`N (θ; Yi j xi )

p�!N!∞

Eθ

�1N`N (θ; Yi j xi )

�




The Likelihood inequality for θ = bθ impliesEθ


�� Eθ

�1N`N�bθ; Yi j xi��

with1N`N (θ0; Yi j xi )

p�!N!∞

Eθ


�1N`N�bθ; Yi j xi� p�!

N!∞Eθ

�1N`N�bθ; Yi j xi��

and thus

limN!∞

Pr�1N`N (θ0; Yi j xi ) �

1N`N�bθ; Yi j xi�� = 1



Sketch of the proof, cont�d So we have two results:

limN!∞

Pr�1N`N (θ0; Yi j xi ) �

1N`N�bθ; Yi j xi�� = 1

1N`N�bθ; Yi j xi� � 1

N`N (θ0; Yi j xi ) 8N

It necessarily implies that

1N`N�bθ; Yi j xi� p�!

N!∞

1N`N (θ0; Yi j xi )

If θ is a scalar, we have immediatly:bθ p�!N!∞

θ0

For a more general case with dim (θ) = K , see a formal proof in Amemiya(1985).

Amemiya T., (1985) Advanced Econometrics. Harvard University PressChristophe Hurlin (University of Orléans) Advanced Econometrics - HEC Lausanne December 9, 2013 169 / 207


Remark

The proof of the consistency of the MLE is largely easiest when we have aformal expression for the maximum likelihood estimator bθ

bθ = bθ (X1, ..,XN )



ExampleSuppose that D1,D2, ..,DN are i .i .d ., positive random variable withDi � Exp (θ0), with

fD (d ; θ) =1θexp

��d

θ

�, 8d 2 R+

Eθ (Di ) = θ0 Vθ (Di ) = θ20

where θ0 is the true value of θ. Question: show that the MLE isconsistent.



Solution

The log-likelihood function associated to the sample fd1, .., dNg is de�nedby:

`N (θ; d) = �N ln (θ)�1θ

N

∑i=1di

We admit that maximum likelihood estimator corresponds to the samplemean: bθ = 1

N∑Ni=1 Di



Solution, cont�d

Then, we have:

Eθ

�bθ� = 1N

∑Ni=1 Eθ (Di ) = θ bθ is unbiased

Vθ

�bθ� = 1N2

∑Ni=1 Vθ (Di ) =

θ2

NAs a consequence

Eθ

�bθ� = θ limN!∞

Vθ

�bθ� = 0and bθ p�!

N!∞θ



LemmaUnder stronger conditions, the maximum likelihood estimator convergesalmost surely to θ0 bθ a.s .�!

N!∞θ0 =) bθ p�!

N!∞θ0



1 If we restrict ourselves to the class of unbiased estimators (linear andnonlinear) then we de�ne the best estimator as the one with thesmallest variance.

2 With linear estimators (next chapter), the Gauss-Markov theorem tellsus that the ordinary least squares (OLS) estimator is best (BLUE).

3 When we expand the class of estimators to include linear andnonlinear estimators it turns out that we can establish an absolutelower bound on the variance of any unbiased estimator bθ of θ undercertain conditions.

4 Then if an unbiased estimator bθ has a variance that is equal to thelower bound then we have found the best unbiased estimator(BUE).



De�nition (Cramer-Rao or FDCR bound)

Let X1, ..,XN be an i .i .d . sample with pdf fX (θ; x). Let bθ be an unbiasedestimator of θ; i.e., Eθ(bθ) = θ. If fX (θ; x) is regular then

Vθ

�bθ� � I�1N (θ0) FDCR or Cramer-Rao bound

where IN (θ0) denotes the Fisher information number for the sampleevaluated at the true value θ0.



Remarks

1 Hence, the Cramer-Rao Bound is the inverse of the information matrixassociated to the sample. Reminder: three de�nitions for IN (θ0) .

IN (θ0) = Vθ

∂`N (θ; Y j x)

∂θ

��θ0

!

IN (θ0) = Eθ

∂`N (θ; Y j x)

∂θ

��θ0

∂`N (θ; Y j x)>

∂θ

��θ0

!

IN (θ0) = Eθ

� ∂2`N (θ; Y j x)

∂θ∂θ>

��θ0

!2 If θ is a vector then Vθ

�bθ� �I�1N (θ0) means that Vθ

�bθ��I�1N (θ0)

is positive semi-de�nite



Theorem (E¢ ciency)Under regularity conditions, the maximum likelihood estimator isasymptotically e¢ cient and attains the FDCR (Frechet - Darnois -Cramer - Rao) or Cramer-Rao bound:

Vθ

�bθ� = I�1N (θ0)

where IN (θ0) denotes the Fisher information matrix associated to thesample evaluated at the true value θ0.



Example (Exponential Distribution)Suppose that D1,D2, ..,DN are i .i .d ., positive random variable withDi � Exp (θ0), with

fD (d ; θ) =1θexp

��d

θ

�, 8d 2 R+

Eθ (Di ) = θ0 Vθ (Di ) = θ20

where θ0 is the true value of θ. Question: show that the MLE is e¢ cient.



Solution

We shown that the maximum likelihood estimator corresponds to thesample mean, bθ = 1

N

N

∑i=1Di

Vθ

�bθ� = θ20N

Eθ

�bθ� = θ0



Solution, cont�d

The log-likelihood function is

`N (θ; d) = �N ln (θ)�1θ

N

∑i=1di

The score vector is de�ned by:

sN (θ;D) =∂`N (θ;D)

∂θ= �N

θ+1

θ2

N

∑i=1Di



Solution, cont�d

Let us use one of the three de�nitions of the information quantity IN (θ) :

IN (θ) = Vθ

�∂`N (θ;D)

∂θ

�= Vθ

�N

θ+1

θ2

N

∑i=1Di

!=

1

θ4∑Ni=1 Vθ (Di )

=Nθ2

θ4=N

θ2

Then, bθ is e¢ cient and attains the Cramer-Rao bound.Vθ

�bθ� = I�1N (θ0) =θ2

N �



Theorem (Convergence of the MLE)Under suitable regularity conditions, the MLE is asymptotically normallydistributed with

pN�bθ � θ0

�d! N

�0, I�1 (θ0)

�where θ0 denotes the true value of the parameter and I (θ0) the (average)Fisher information matrix for one observation.



CorollaryAnother way, to write this result, is to say that for large sample size N, theMLE bθ is approximatively distributed according a normal distribution

bθ asy� N

�θ0,N�1 I�1 (θ0)


� N�θ0, I�1N (θ0)

�where IN (θ0) = N�I (θ0) denotes the Fisher information matrixassociated to the sample.



De�nition (Asymptotic Variance)The asymptotic variance of the MLE is de�ned by:

Vasy

�bθ� = I�1N (θ0)

where IN (θ0) denotes the Fisher information matrix associated to thesample. This asymptotic variance of the MLE corresponds to theCramer-Rao or FDCR bound.



The magic of the MLE



Proof (MLE convergence)

At the maximum likelihood estimator, the gradient of the log-likelihoodequals zero (FOC):

gN�bθ�

(K ,1)

� gN�bθ; y j x� = ∂`N (θ; y j x)

∂θ

��bθ = 0Kwhere bθ = bθ (x) denotes here the ML estimate. Expand this set ofequations in a Taylor series around the true parameters θ0. We will use themean value theorem to truncate the Taylor series at the second term:

gN�bθ� = gN (θ0) +HN �θ� �bθ � θ0

�= 0

The Hessian is evaluated at a point θ that is between bθ and θ0, forinstance θ = ωbθ + (1�ω) θ0 for some 0 < ω < 1.



Proof (MLE convergence, cont�d)

We then rearrange this equation and multiply the result bypN to obtain:

pN�bθ � θ0

�=��HN

�θ��1 �p

NgN (θ0)�

By dividing HN�θ�and gN (θ0) by N, we obtain:

pN�bθ � θ0

�=

�� 1NHN�θ��1 �p

N1NgN (θ0)

�=

�� 1NHN�θ��1 �p

Ng (θ0)�

where g (θ0) denotes the sample mean of the individual gradient vectors

g (θ0) =1NgN (θ0) =

1N

N

∑i=1gi (θ0; yi j xi )




Let us now consider the same expression in terms of random variables: bθnow denotes the ML estimator, HN

�θ�= HN

�θ; Y j x

�and sN (θ0; Y j x)

the score vector. We have:

pN�bθ � θ0

�=

�� 1NHN�θ; Y j x

��1 �pNs (θ0; Y j x)

�where the score vectors associated to the variables Yi are i .i .d .

s (θ0; Y j x) =1N

N

∑i=1si (θ0; Yi j xi )




Let us consider the �rst element:

s (θ0) =1N

N

∑i=1si (θ0; Yi j xi )

The individual scores si (θ0; Yi j xi ) are i .i .d . with

Eθ (si (θ0; Yi j xi )) = 0

ExVθ (si (θ0; Yi j xi )) = Ex (I i (θ0)) = I (θ0)

By using the Lindberg-Levy Central Limit Theorem, we have:

pNs (θ0)

d! N (0, I (θ0))




We known that:

� 1NHN�θ; Y j x

�= � 1

N

N

∑i=1Hi�θ; Yi j xi

�where the hessian matrices Hi

�θ; Yi j xi

�are i .i .d . Besides, because

plim�bθ � θ0

�= 0, plim

�θ � θ0

�= 0 as well. By applying a law of large

numbers, we get:

� 1NHN�θ; Y j x

� p! EXEθ (�Hi (θ0; Yi j xi ))

with

EXEθ (�Hi (θ0; Yi j xi )) = EXEθ

��∂2ì (θ; Yi j xi )

∂θ∂θ>

�= I (θ0)



Reminder:

If XN and YN verifyXN(K ,K )

p! X(K ,K )

YN(K ,1)

d! N�0

(K ,1), Σ(K ,K )

�then

XN(K ,K )

YN(K ,1)

d! N�0

(K ,1), X(K ,K )

Σ(K ,K )

X>(K ,K )

�




Here we have

pN�bθ � θ0

�=


��1 �pNs (θ0; Y j x)


��1 p! I�1 (θ0) symmetric matrix

pNs (θ0)

d! N (0, I (θ0))

Then, we get:

pN�bθ � θ0

�d! N

�0, I�1 (θ0) I (θ0) I�1 (θ0)

�




And �nally.... pN�bθ � θ0

�d! N

�0, I�1 (θ0)

�The magic of the MLE.....



Example (Exponential Distribution)Suppose that D1,D2, ..,DN are i .i .d ., positive random variable withDi � Exp (θ0), with

fD (d ; θ) =1θexp

��d

θ

�, 8d 2 R+

Eθ (Di ) = θ0 Vθ (Di ) = θ20

where θ0 is the true value of θ. Question: what is the asymptoticdistribution of the MLE? Propose a consistent estimator of the asymptoticvariance of bθ.



Solution

We shown that bθ = (1/N)∑Ni=1 Di and:

si (θ;Di ) =∂ì (θ;Di )

∂θ= �1

θ+Diθ2

The (average) Fisher information matrix associated to Di is:

I (θ) = Vθ

��1

θ+Diθ2

�=1

θ4Vθ (Di ) =

1

θ2

Then, the asymptotic distribution of bθ is:pN�bθ � θ0

�d! N

�0, θ2


� N

θ0,θ2

N

!Christophe Hurlin (University of Orléans) Advanced Econometrics - HEC Lausanne December 9, 2013 196 / 207


Solution, cont�d

The asymptotic variance of bθ is:Vasy

�bθ� = θ2

N

A consistent estimator of Vas

�bθ� is simply de�ned by:bVasy

�bθ� = bθ2N �



Example (Linear Regression Model)

Let us consider the previous linear regression model yi = x>i β+ εi , with εiN .i .d .

�0, σ2

�. Let us denote θ the K + 1� 1 vector de�ned by

θ =�

β> σ2�>. The MLE estimator of θ is de�ned by:

bθ = � bβbσ2�

bβ = N

∑i=1XiX>i

!�1 N

∑i=1X>i Yi

! bσ2 = 1N

N

∑i=1

�Yi � X>i bβ�2

Question: what is the asymptotic distribution of bθ? Propose an estimatorof the asymptotic variance.



Solution

This model satisfy the regularity conditions. We shown that the averageFisher information matrix is equal to:

I (θ) =� 1

σ2EX

�XiX>i

�0

0 12σ4

�From the MLE convergence theorem, we get immediately:

pN�bθ � θ0

�d! N

�0, I�1 (θ0)

�where θ0 is the true value of θ.



Solution, cont�d

The asymptotic variance covariance matrix of bθ is equal to:Vasy

�bθ� = N�1 I�1 (θ0) = I�1N (θ0)

with

IN (θ) =� N

σ2EX

�XiX>i

�0

0 N2σ4

�



Solution, cont�d

A consistent estimate of IN (θ) is:

bIN (θ) = bV�1asy

�bθ� = Nbσ2 bQX 00 N

2bσ4!

with bQX = 1N

N

∑i=1xix>i



Solution, cont�d

Thus we get: bβ asy� N

�β0, bσ2 �∑N

i=1 xix>i

��1�

bσ2 asy� N

σ20,2bσ4N

!



Summary

Under regular conditions

1 The MLE is consistent.

2 The MLE is asymptotically e¢ cient and its variance attains theFDCR or Cramer-Rao bound.

3 The MLE is asymptotically normally distributed.



But, �nite sample properties can be very di¤erent from large sampleproperties:

1 The maximum likelihood estimator is consistent but can be severelybiased in �nite samples

2 The estimation of the variance-covariance matrix can be seriouslydoubtful in �nite samples.



Theorem (Equivariance)

Under regular conditions and if g (.) is a continuously di¤erentiablefunction of θ and is de�ned from RK to RP , then:

g�bθ� p! g (θ0)

pN�g�bθ�� g (θ0)� d! N

�0,G (θ0) I�1 (θ0) G (θ0)

>�

where θ0 is the true value of the parameters and the matrix G (θ0) isde�ned by

G (θ)(P ,K )

=∂g (θ)

∂θ>


Key Concepts of the Chapter 2

1 Likelihood and log-likelihood function2 Maximum likelihood estimator (MLE) and Maximum likelihoodestimate

3 Gradient and Hessian Matrix (deterministic elements)4 Score Vector and Hessian Matrix (random elements)5 Fisher information matrix associated to the sample6 (Average) Fisher information matrix for one observation7 FDCR or Cramer Rao Bound: the notion of e¢ ciency8 Asymptotic distribution of the MLE9 Asymptotic variance of the MLE10 Estimator of the asymptotic variance of the MLE


End of Chapter 2

Christophe Hurlin (University of Orléans)


Documents

Chapter 2: Maximum Likelihood Estimation · Chapter 2: Maximum Likelihood Estimation Advanced Econometrics - HEC Lausanne Christophe Hurlin University of OrlØans December 9, 2013