Misspecification and inferenceumu.diva-portal.org/smash/get/diva2:823504/FULLTEXT01.pdf2 Theory part 1 2.1 Maximum likelihood A parametric model or parametric family of distributions

Student

Umeå School of Business and Economics

Spring semester 2015

Master thesis, one-year, 15 ECTS

Misspecification and inference

A review and case studies

Author: Gabriel Wallin

Supervisor: Ingeborg Waernbaum

Abstract

When conducting inference we usually have to make some assumptions. A common as-

sumption is that the parametric model which describes the behavior of the investigated

random phenomena is correctly speci�ed. If not, some of the inferential methods does not

provide valid inference, e.g. the method of maximum likelihood. This thesis investigates

and presents some of the results regarding misspeci�ed parametric models to illustrate the

consequences of misspeci�cation and how the parameter estimates are a�ected. The main

question investigated is wether we still can learn something about the true parameter even

though the model is misspeci�ed. An important result is that the quasi-maximum likelihood

estimate of a misspeci�ed estimation model converges almost surely towards the parameter

minimizing the distance between the true model and the estimation model. Using sim-

ulations, it is illustrated how this estimator in certain situations converges almost surely

towards the true parameter times a scalar. This result also seems to hold for a situation not

covered by any theorems. Furthermore, a general class of estimators called M-estimators is

presented for the theoretic framework of misspeci�ed models. An example is given when the

theory of M-estimators come to use under model misspeci�cation.

Sammanfattning

Titel: Felspeci�cering och inferens - en genomgång och fallstudier.

När man vill dra slutsatser från data så måste man i regel göra vissa antaganden. Ett vanligt

antagande är att den parametriska modellen som beskriver beteendet hos det undersökta

slumpmässiga fenomenet är korrekt speci�cerad. Om detta inte är fallet så kommer vissa

inferens-metoder inte att kunna användas, exempelvis maximum likelihood-metoden. Den

här uppsatsen undersöker och presenterar några av resultaten för felspeci�cerade parametriska

modeller för att illustrera konsekvenserna av felspeci�cering och hur parameterskattningarna

påverkas. En huvudfråga som undersöks är huruvida det fortfarande går att lära sig något

om den sanna parametern även när modellen är felspeci�cerad. Ett viktigt resultat som pre-

senteras är att den så kallade quasi-maximum likelihood-skattningen från en felspeci�cerad

modell konvergerar nästan säkert mot den parameter som minimerar avståndet mellan den

sanna modellen och skattningsmodellen. Det visas hur denna parameter i vissa situationer

konvergerar mot den sanna parametern multiplicerat med en skalär. Detta resultat illustr-

eras också för en situation som inte täcks av något teorem. Dessutom presenteras en generell

klass av estimatorer som kallas M-estimatorer. Dessa används för att utvidga teorin kring

felspeci�cerade modeller och ett exempel presenteras där teorin för M-estimatorer nyttjas.

Populärvetenskaplig sammanfattning

När man vill uppskatta hur mycket olika variabler är associerade med, eller förklarar, en

speci�k variabel så är regression ett vanligt tillvägagångssätt. En regressionsmodell kvanti-

�erar förhållandet mellan de så kallade förklarande variablerna och den beroende variabeln,

och ett problem som nästan alltid är närvarande är att den sanna datagenererande pro-

cessen som beskriver detta förhållande i regel är okänd. Det är därför upp till forskaren

att försöka speci�cera en skattningsmodell som man tror är så nära sanningen som möjligt

så att e�ekterna av de förklarande variablerna på den beroende variabeln på ett bra sätt

kan skattas. Är skattningsmodellen inte lik den sanna modellen så kommer inte heller ef-

fektskattningarna att vara nära de sanna e�ekterna. Denna uppsats har därför till syfte att

utreda om det �nns någon information om de sanna e�ekterna som går att utvinna från en

felspeci�cerad modell. Dessutom visas hur den vanligt förekommande skattningsmetoden

maximum likelihood i fallet med den felspeci�cerade modellen är den skattningsmetod som

närmar sig den skattning som kommer de sanna e�ekterna närmast. Dessutom presenteras

specialfall för när man, trots en felspeci�cerad modell, kan få viss information om e�ekterna

av respektive förklaringsvariabel på den beroende variabeln. Resultaten generaliseras även

till att gälla �er skattningsmetoder än bara maximum likelihood och slutligen ges ett exem-

pel för en typ av skattningsmetod som ofta används när man vill skatta e�ekten av någon

form av behandling på ett visst utfall.

Contents

1 Introduction 2

1.1 Purpose of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Theory part 1 4

2.1 Maximum likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Kullback-Leibler Information Criterion . . . . . . . . . . . . . . . . . . . . . . . . 5

2.3 Quasi Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Simulation study 1 11

3.1 Design A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.2 Results A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.3 Design B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.4 Results B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4 Theory part 2 17

4.1 M-estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5 Simulation study 2 22

5.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

6 Final recommendations 24

7 Discussion 25

8 Acknowledgements 27

References 28

A Proof of identi�ability of the IPW estimator 31

1 Introduction

In statistical modeling we are interested in the underlying structure that generates the data. In

that sense, we assume that it actually exists a true model that fully describes the data generating

process (DGP). It then follows that the parametric model1 used to make inferences will be a

more or less adequate description of the DGP. In statistical research, model speci�cation, model

diagnostics and optimal model choice have been widely investigated, see for instance [4, 1] and

[25]. This thesis instead takes on the approach that we have ended up with an estimation model

that is not an adequate description of the DGP. We will call this type of estimation model

misspeci�ed. A natural question then arises: What happens with the parameter estimates when

the parametric model is misspeci�ed? Is there still any information regarding the true parameters

that can be found when making inference based on a misspeci�ed model? As pointed out in King

and Roberts [14]:

�Models are sometimes useful but almost never exactly correct, and so working out

the theoretical implications for when our estimators still apply is fundamental to the

massive multidisciplinary project of statistical model building�.

Several papers that have investigated misspeci�ed models have proposed robust alternatives for

estimation, e.g. robust standard errors. Among others, Huber [12, 13] gave robust alternatives to

the least squares estimate for regression models and both White [32] and Eicker [5] investigated a

covariance matrix estimator that is robust against heteroskedasticity. To have a broader under-

standing about statistical modeling it is also reasonable to add knowledge about the properties

of misspeci�ed models since this situation is very likely to occur as soon as we wish to model

a random phenomena. This thesis will examplify and illustrate some of the results regarding

inference for misspeci�ed models and show situations when there still is information that can

be gained about the true parameter. We have as a starting point that we are facing are two

(possibly) di�erent models; a true, unknown model and an estimation model. This means that

the theory presented here is a complement to the litterature regarding optimal model choice and

model speci�cation. This could in fact be seen as one joint theory, where model speci�cation

and optimal model choice, model diagnostics and the results of misspeci�ed models all are parts

of the theory of statistical modeling.

1.1 Purpose of the thesis

The purpose of this thesis is to review some of the theory for misspeci�ed parametric models

and state and illustrate some of the existing results for misspeci�ed parametric models when

1This thesis is only concerned with inference for parametric models even though there of course exists other,�model-free�, modes of inference.

2

conducting statistical inference about an unknown parameter. The results will be exampli�ed

and illustrated using simulation. Two main questions that will be given extra attention are:

� How will the parameter estimates be a�ected when the estimation model is misspeci�ed?

� Given a misspeci�ed model, are there any special cases where there is still some information

that can be found about the true model?

1.2 Outline of the thesis

The thesis is organized as follows:

� Section 2 starts with an overview of a common method of statistical inference, the maximum

likelihood method. It then gives a de�nition of the Kullback Leibler information criterion

and describes its role in the theory of model misspeci�cation. Then another likelihood-type

estimator, the quasi-maximum likelihood estimator, is proposed with motivation from the

de�nition of the Kullback Leibler information criterion. Some results for the estimator is

given together with a simple illustration of how the Kullback Leibler information criterion

and quasi maximum likelihood estimator is connected.

� Section 3 uses simulations to illustrate some of the results of the QMLE estimator.

� Section 4 introduces a broader class of so called M-estimators together with a discussion of

the contribution of M-estimation theory to the asymptotic theory of misspeci�ed models.

� Section 5 illustrates some of the results of M-estimators using simulation.

� Section 6 gives some last recommendations and Section 7 gives a summary and a discussion

of the results presented in the thesis, together with suggestions for further research.

3

2 Theory part 1

2.1 Maximum likelihood

A parametric model or parametric family of distributions is a collection of probability distri-

butions that can be described by a �nite number of parameters. They intent to describe the

probability mass function (p.m.f.) or probability density function (p.d.f.) of a random variable.

Consider a random variable whose functional form of the p.m.f. or p.d.f. is known but where

the distribution depends on an unknown parameter2 θ that takes values in a parameter space

Θ. If we for example know that the random variable X is described by pX(x; θ) = e−θθx/x!,

x = 0, 1, 2, ... and that θ ∈ Θ = {θ : 0 < θ < ∞}, it still might be the case that we need to

specify the most probable p.m.f. of X. This means that we are interested in a speci�c member

of the family of distributions that is contained in the parametric model {pX(x; θ), θ ∈ Θ} andthus we must estimate the parameter θ. One common estimator of θ in this type of setting is

the maximum likelihood estimator (MLE). Let X1, X2, ..., Xn be a random sample of size n of

independent and identically distributed (i.i.d.) random variables with the realizations denoted

by x1, ..., xn. If we denote the density of X as gX(x; θ) where it is regarded as a function of

the unknown parameter θ, the likelihood function is de�ned as L(θ; x, ..., xn) =∏ni=1 gXi(xi; θ).

To get the MLE of θ, the likelihood function, or more commonly, the natural logarithm of the

likelihood function, is maximized.

Given suitable regularity conditions3, the method of maximum likelihood gives estimates

that have several appealing properties such as e�ciency [7], consistency [30, 3] and asymptotic

normality [3]. The last property means that√n(θMLE − θ)

d→ Np{

(0, I(θ)−1}, as n → ∞,

where Np is a p-variate normal distribution and I(θ) is the Fisher information matrix given by

I(θ) = E[(

∂

∂θTlog gX(x; θ)

)(∂

∂θlog gX(x; θ

)]which we could rewrite with the use of the score s(X; θ) = ∂

∂θ log gX(x; θ), so that

I(θ) = E[s(X; θ)T s(X; θ)

]= −E

(∂2 log gX(x; θ)

∂θ2

).

The Bernoulli distribution can be used as an example. Let Xi ∼ Bernoulli(p), i = 1, ..., n so

that the p.m.f. is given by pX(x; p) = px(1−p)1−x and log pX(x; p) = x log p+(1−x) log(1−p).The score becomes

s(X; p) =X

p− 1−X

1− p,

and

2The parameter of interest could of course be vector valued. Throughout the thesis we will make no di�erencein notation or in use of the term between a parameter that is vector valued and a parameter that is not.

3These regularity conditions are for the most part smoothness conditions on g(x; θ) [31] and will throughoutthe thesis be considered full�lled.

4

−s′(X; p) =X

p2+

1−X(1− p)2

.

So

I(p) = −E (s′(X; p)) =1

p(1− p),

and thus, V ar(p) = I(p)−1 = p(1− p) �

Since the variance of the MLE reaches I(θ)−1 asymptotically, and since V (θ) ≥ I(θ)−1 in

general, there does not exist an estimator with lower variance. We call this property of the MLE

e�ciency.

So far we have assumed that the functional form of the p.m.f or p.d.f. is known. However,

there are situations when this is not a reasonable assumption. When the true model is unknown

we have to use an estimation model and thereby running the risk of misspecifying the model.

A natural question then becomes what happens to the MLE under model misspeci�cation. To

investigate this, we start by de�ning a measure of the discrepancy between the estimation model

from the true model called the Kullback Leibler information criterion (KLIC).

2.2 Kullback-Leibler Information Criterion

Before de�ning the KLIC, we �rst brie�y discuss what is being meant by information. It is

closely related to what is sometimes called the �surprise� of an event and the information theory

that was formalized by Shannon [26]. Let say that you �ip an unfair coin with probability 0.2

to receive heads and probability 0.8 to recieve tails. Thus the message that you will recieve

heads will give you a lot of information. A message that you will recieve tails will not give you

that much of information; with a probability of 0.8, tails is almost what you expect. With this

reasoning we can say that if some event is very unlikely to occur, the message that it will occur

gives us a lot of information and vice versa. We could use Ip = log( 1p ), where p denotes the

probability of an event, as an information function; the information decreases with increasing p

and vice versa, just as our example regarding the coin �ip. If the probability of the event changes

from p to q we could measure the information value of the change by Ip−Iq, where Iq = log(

1q

),

so that we get that Ip− Iq = log(qp

). Expressed as an expected information value we have that

E(Ip − Iq) =

n∑i=1

qi log

(qipi

). (2.1)

Kullback and Leibler [15] used Shannon´s ideas of information to de�ne a divergence measure, or

information criterion, that measures the di�erence between two probability distributions g and

f .

De�nition 1. The KLIC of f from g is de�ned as

5

D(g : f) = Eg[log

(g(x)

f(x, θ)

)]. (2.2)

Note that the expectation is taken with respect to g and that D(g : f(x; θ)) ≥ 0. It can

further be shown that D(g : f(x; θ)) = 0 if and only if f = g. The KLIC can be seen as the

information that has gone lost when using f , the probability distribution that we have assumed,

to approximate g, the true and unknown probability distribution that generates the data. Renyi

[22] showed that the KLIC, as in the opening example of this section, can be thought of in

terms of information, i.e. the KLIC can be seen as the information gained when carrying out an

experiment of a random phenomenon. White [34] describes this as the information gained when

the experimenter is given the information that the observed phenomenon is described by g and

not f which was the initial belief.

For the continuous case we can write the KLIC as

D(g : f(x; θ)) =

ˆg(x) log

[g(x)

f(x; θ)

]dx (2.3)

=

ˆg(x) log(g(x))dx−

ˆg(x) log(f(x; θ))dx (2.4)

where the similarities with Equation 2.1 is apparent. The KLIC is not a metric since D(f : g) 6=D(g : f), i.e. the distance from f to g is not the same as the distance from g to f meaning that

the KLIC could not be used as a goodness-of-�t measure in the usual sense. A simple example

of how the KLIC is calculated is given in Example 1.

Example 1

Assume that the true model g(x) that generates the data is a standard normal distribution,

N (0, 1) and that we misspecify the model using f(x) which is the density function of a normal

distribution with mean equal to 2 and variance equal to 1, N (2, 1). This situation is illustrated

in Figure 2.1.

6

−5 0 5

0.0

0.1

0.2

0.3

0.4

Correct and wrong model

µ

f(x)

Densities

Mean = 0, sd = 1Mean = 2, sd = 1

Figure 2.1: The densities for Example 2

One way of quantifying the distance between the models in Figure 2.1 is to calculate the

KLIC, which for this case is given by

D(g : f(x; θ)) =

ˆ ∞−∞

g(x) log

(g(x)

f(x)

)dx

=

ˆ ∞−∞

1√2π

exp

(−x

2

2

)log

1√2π

exp(−x

2

2

)1√2π

exp(− (x−2)2

2

) dx

=

ˆ ∞−∞

g(x) log(e

12 (x−2)2− x2

2

)dx

=

ˆ ∞−∞

g(x)(2− 2x)dx

= 2

ˆ ∞−∞

g(x)dx− 2

ˆ ∞−∞

xg(x)dx

= 2× 1− 2× 0

= 2

As could be seen in Example 1, the calculation of the KLIC requires that g is known which

is not the case in this thesis. There have been suggestions of how to estimate the KLIC, see for

instance [28, 19].

7

In view of the KLIC as a goodness of �t test we can think of a situation where we would like

to compare two estimation models f1 and f2 against the true model g to evaluate which one that

gets closest. We could write the mean KLIC di�erence as

I =D(g : f1)−D(g : f2) =

ˆg log

(g

f1

)dx−

ˆg log

(g

f2

)dx

=

ˆg log

(f2

f1

)dx

where the right hand side of the last equality can be estimated using data, even though g is

unknown. We then have three potential scenarios:

1. I = 0, meaning that f1 = f2

2. I > 0, meaning that f1 is a better approximation than f2 of g

3. I < 0, meaning that f2 is a better approximation than f1 of g.

From this we can choose the best model, i.e. the model that minimizes the distance to the true

model.

2.3 Quasi Likelihood

In contrast to the situation described in Subsection 2.1, there are situations where we don´t have

any pre-knowledge of the true functional form when we want to model a random phenomena.

When this is the case, we have to start by specifying the functional form of an estimation model

and then estimate the parameter from it. If the parametric model that we specify includes the

true model, the problem of inference reduces to estimating the parameter θ which we can do

consistently with the MLE.

We are interested in the parameter θ in f that needs to be estimated from our observations

x that are realizations of a random variable X with unknown density function g. Our objective

should intuitively be to minimize (2.2) by an appropriate choise of θ. Indeed, Akaike [1] argued

that a natural estimator of θ would be the parameter that minimizes the KLIC, i.e. the parameter

minimizing the distance between the true and the false density. We de�ne this parameter as

minθ∈Θ

E[log

(g(x)

f(x; θ)

)]= θ∗. (2.5)

By comparing equations 2.3 and 2.4, and since D(g : f) ≥ 0, we see that choosing θ to minimize

2.3 is the same as choosing θ to maximize

∼L(θ) =

ˆlog f(x; θ)g(x)dx = E(log f(X; θ)).

8

This in turn is the same as maximizing the average n−1∼L(θ) since the maximization of θ does not

depend on n. Finally, using that n−1∼L(θ) by the law of large numbers could be approximated

by n−1 log f(X, θ) ≡ Ln(X, θ), our minimization problem of (2.3) reduces to

maxθ∈Θ

Ln(X; θ) ≡ n−1n∑i=1

log f(Xi; θ), (2.6)

where we call the solution to (2.6) the quasi-maximum likelihood estimator (QMLE). White [33]

has shown that the solution of (2.6) exists, is unique and furthermore given the following key

result.

Theorem 2. θna.s.→ θ∗ as n→∞ , where θn is the parameter vector that solves max

θ∈ΘLn(X; θ).

�

So if our objective is to �nd a parameter estimate that minimizes the KLIC, Theorem 2

establishes that this is indeed what we are doing when we use the QMLE. White [33] calls

the QMLE the estimator that �...minimizes our ignorance about the true structure�, and he 4

furthermore showed that

√n(θ − θ∗)

d→ N (0, C(θ∗)). (2.7)

To de�ne C(θ) for a parameter θ we �rst need to de�ne the Hessian

Ajk(θ) = E(∂2 log f(Xi, θ)

∂θj∂θk

)and the square of the gradient,

Bjk(θ) = E(∂ log f(Xi, θ)

∂θj· ∂ log f(Xi, θ)

∂θk

).

Now,

C(θ) = A(θ)−1B(θ)A(θ)−1, (2.8)

and we furthermore have that C(θ)a.s.→ C(θ∗). C(θ) often is estimated with the so called sandwich

estimator. If we specify the parametric family correctly, −A(θ) = B(θ) meaning that C(θ) =

−A(θ)−1 = B(θ)−1 where B(θ)−1 = I(θ)−1and thus, the sandwhich estimator reduces to I(θ)−1,

giving the e�cient variance of the MLE. We will return to this estimator in Section 5.

To illustrate the connection between the KLIC, the MLE and the QMLE we revisit Example

1. A simple illustration of the KLIC is given in Figure 2.2, where it for a given variance is shown

how the KLIC reaches its minimum value as the estimation model, N (µ, 1), gets closer to the

4Huber was the �rst to prove the asymptotic normality for maximum likelihood estimators when the estima-tion model is not necessarily the underlying model that generates the data. White proves this under the sameassumptions as for which he gives his consistency proof. These assumptions are not as general as Hubers, butgeneral enough to be able to be used in many situations.

9

−15 −10 −5 0 5 10 15

050

100

150

Kullback−Leibler

Mean

KLI

C

Figure 2.2: The kullback Leibler information criterion plotted for di�erent values of µ for themisspeci�ed model.

true model N (0, 1), and how the KLIC grows both to the left and to the right of the minimum

value.

In a sense, Figure 2.2 illustrates the di�erence between the MLE and the QMLE. The MLE

is based on the true model and thus the KLIC is equal to zero, asymptotically. White's result

does not say that the QMLE reaches the minimum value of the KLIC, but that the KLIC will

be minimized given the data and the misspeci�cation. A natural question is what θ∗ can be

in relation to θ, the true parameter? Can we, even if we misspecify the model, extract some

information about the true parameter?

Li and Duan [16] investigates misspeci�ed generalized linear models (GLM) and states a

proportionality result for the coe�cient estimates. If Y is the outcome of interest, E [Y ] =

g−1(Xθ) where g−1 is the link function that connects the linear predictor Xθ to the outcome.

Li and Duan gives the following result when the link function g is misspeci�ed.

Theorem 3. The estimated coe�cients converges almost surely to the true parameter vector

times a scalar factor, i.e. θa.s.→ γθ, where θ is given by the QMLE. �

The result of Theorem 3 will from here on be referred to as convergence up to scale. Further-

more, for logistic regression we have convergence up to scale both when we omit a variable from

the regression equation [18] and when we misspecify the distribution of the error term [24].

10

The result of Theorem 3 means that it is possible to get unbiased estimates of the ratio of

the regression coe�cients since γθlγθm

= θlθm, l 6= m. This could for an example be of interest in

applied research were one is interested in the relative e�ect of two treatments on an outcome.

Theorem 3 could be seen in light of the problem that it usually is not enough to solely base

the choice of link function on the data [6]. Figure 2.3 illustrates a binary data set with 100

observations where the correct link function, the logit, is compared against the probit link and

the complementary log log link. It is apparent that especially the logit and the probit link

function is close to each other. It will in the following section be illustrated how a misspeci�ed

link function could a�ect the estimated parameters.

0 2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

1.0

Fitted probabilites with different link functions

X

Y

LogitProbitCloglog

Figure 2.3: The solid line is the correct, logit link function, (1 + exp(−X))−1, the dashed lineis the probit link function, φ(X), where φ is the standard normal distribution function and thedotted line is the complementary log-log link funcktion, 1− exp(− exp(X)).

3 Simulation study 1

To illustrate the result of Theorem 3 and the convergence result of (2.7), a simulation study

is conducted. A logistic function will be speci�ed and estimated using two di�erent estimation

models; one that is correctly speci�ed and one that is misspeci�ed. This is performed for three

di�erent model misspeci�cations for which one is displayed in the �rst design (Design A) and

the two remaining in the second design (Design B). The coe�cients for the correctly speci�ed

estimation model will be estimated using the MLE and the coe�cients for the misspeci�ed

estimation models will be estimated using the QMLE. All calculations are performed using the

software R [20].

11

3.1 Design A

Two normally distributed random variables are generated, X1 ∼ N (4, 2) and X2 ∼ N (5, 2).

Also, a Bernoulli distributed random variable is generated, T ∼ Bern(e(X)), where

e(X) =1

(1 + exp(−0.3X1 + 0.24X2))

The misspeci�ed model is

h(X) = φ(β0 + β1X1 + β2X2),

where φ denotes the standard normal distribution function. The scale parameter γ is estimated

by γ1 = β1

β1and γ2 = β2

β2and the estimates are expected to get closer to each other when the

sample size increases. We will use three di�erent sample sizes; n = 100, n = 500 and n = 1000,

each with 1000 replicates.

3.2 Results A

Figure 3.1 shows QQ-plots for a sample size of n = 500 for the MLE of the two coe�cients. As

stated in Subsection 2.1, the MLE is approximately normally distributed. Since there are no

large deviations from the straight line, there is nothing indicating a violation of the normality

of the estimator. Figure 3.2 shows QQ-plots for the same sample size, but for the QMLE of the

misspeci�ed model. As there are no large deviations from the straight line, there is no sign of

violation of the normality of the estimator. These plots therefore give an empirical illustration

of the normality result of the QMLE estimator.

12

−3 −2 −1 0 1 2 3

0.15

0.25

0.35

0.45

Normal Q−Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

−3 −2 −1 0 1 2 3

−0.

40−

0.30

−0.

20−

0.10

Normal Q−Q Plot

Theoretical QuantilesS

ampl

e Q

uant

iles

Figure 3.1: QQ-plots for the MLE of the two coe�cients, for 500 observations and 1000 replicates

−3 −2 −1 0 1 2 3

0.10

0.15

0.20

0.25

Normal Q−Q Plot


Sam

ple

Qua

ntile

s

−3 −2 −1 0 1 2 3

−0.

25−

0.20

−0.

15−

0.10

−0.

05

Normal Q−Q Plot


Sam

ple

Qua

ntile

s

Figure 3.2: QQ-plots of the QMLE when the link function is misspeci�ed, for 500 observationsand 1000 replicates

Table 1 gives the true coe�cient values, the MLE estimates and the QMLE estimates. Fur-

thermore, the di�erence between γ2 and γ1 is displayed.

13

Table 1: Comparison between the correctly speci�ed model and a misspeci�ed model.Design A

n Speci�cation β1 β2 β1 β2 γ2 − γ1

Correctly speci�ed model100 −0.3 0.24 −0.324 0.246500 −0.3 0.24 −0.299 0.2451000 −0.3 0.24 −0.300 0.240

Misspeci�ed link function100 −0.3 0.24 −0.196 0.149 −0.034500 −0.3 0.24 −0.183 0.150 0.0151000 −0.3 0.24 −0.183 0.147 0.001

We see that the MLE gives coe�cient estimates that is getting closer to the true coe�cients

with increasing sample size, and that the estimates coincide with the true values for a sample size

of n = 1000 (for a round o� on the third decimal). For the QMLE we see that β1 is overestimated

and β2 is underestimated for every sample size. The gamma estimates γ1 and γ2 are getting closer

to each other for an increasing sample size and di�ers only on the third decimal for a sample size

of n = 1000, providing an empirical illustration of Theorem 3.

3.3 Design B

In a second simulation design, three uniformly distributed random variables are generated,

X1, X2, X3 ∼ U(0, 1). Also, a Bernoulli distributed random variable is generated, T ∼ Bern(e(X))

where

e(X) =1

1 + exp(2X1 +X2 − 3X3).

Two misspeci�ed models are used,

m(X) =1

1 + exp(β1X1 + β2X2)

and

n(X) = φ(β0 + β1X1 + β2X2)

to investigate how the coe�cient estimates are a�ected by excluding a covariate (m) and by

choosing an incorrect link function and excluding a covariate (n). Table 2 is constructed in the

same manner as Table 1 and the scale parameter is estimated in the same way as in Design A.

14

3.4 Results B

As in Design A, the coe�cient estimates in the QQ-plots in Figure 3.3, 3.4 and 3.5 seems to follow

the straight line reasonably good, giving empirical support to the distributional limit result of

the MLE and QMLE, respectively.

−3 −2 −1 0 1 2 3

−3.

0−

2.5

−2.

0−

1.5

−1.

0

Normal Q−Q Plot


Sam

ple

Qua

ntile

s

−3 −2 −1 0 1 2 3

−2.

0−

1.5

−1.

0−

0.5

0.0

Normal Q−Q Plot


Sam

ple

Qua

ntile

s

Figure 3.3: QQ-plots for the MLE of the two coe�cients, for 500 observations and 1000 replicates

−3 −2 −1 0 1 2 3

−2.

5−

2.0

−1.

5−

1.0

Normal Q−Q Plot


Sam

ple

Qua

ntile

s

−3 −2 −1 0 1 2 3

−1.

5−

1.0

−0.

50.

0

Normal Q−Q Plot


Sam

ple

Qua

ntile

s

Figure 3.4: QQ-plots for the QMLE of the two coe�cients, using m(X) for 500 observations and1000 replicates.

15

−3 −2 −1 0 1 2 3

−1.

6−

1.2

−0.

8−

0.4

Normal Q−Q Plot


Sam

ple

Qua

ntile

s

−3 −2 −1 0 1 2 3

−1.

0−

0.8

−0.

6−

0.4

−0.

20.

0

Normal Q−Q Plot

Theoretical QuantilesS

ampl

e Q

uant

iles

Figure 3.5: QQ-plots for the QMLE of the two coe�cients, using n(X) for 500 observations and1000 replicates.

Table 2 gives the true coe�cients, the MLE estimates and the QMLE estimates for both

misspeci�cations. As in Table 1, the di�erence between the gamma estimates is displayed.

Table 2: Comparison between the correctly speci�ed model and two di�erent misspeci�ed models.Design B

N Speci�cation β1 β2 β3 β1 β2 γ2 − γ1

Correctly speci�ed model100 2 1 −3 2.281 1.167500 2 1 −3 2.016 1.0381000 2 1 −3 2.008 0.995

Omitted variable100 2 1 −3 1.884 0.965 0.023500 2 1 −3 1.717 0.875 0.0171000 2 1 −3 1.707 0.849 −0.004

Wrong link and omitted variable100 2 1 −3 1.156 0.593 0.015500 2 1 −3 1.063 0.540 0.0091000 2 1 −3 1.058 0.525 −0.004

When usingm(X), the MLE estimates is getting closer to the true coe�cients with increasing

sample size and the QMLE underestimates both coe�cients for both misspeci�cations. The scale

estimates are close to each other for every sample size and are getting closer to each other for

16

increasing sample size, also giving an empirical illustration of Theorem 3.

The second misspeci�ed model of Design B, n(X) deals with a situation that is not covered

by any theorems since we both omit a covariate and misspecify the link function. The parameter

estimates underestimates the true coe�cients but as have been the case for the rest of the

misspeci�cations, the scale estimates are close for every sample size and getting closer with

increasing sample size. Hence it seems like we have an up to scale convergence for this setting

as well.

So far we have been concerned with the QMLE and stated and illustrated some of its char-

acteristics. Next, we will not only look at one type of estimator but a whole class of estimators.

4 Theory part 2

4.1 M-estimators

A parametric model does in general include two parts, a systematic part and a random part.

For linear regression models the researcher needs to specify both the conditional mean of the

outcome variable given the explanatory variables (the systematic part) and the distribution of

the error term (the random part). For GLMs we also need to specify a link function that links

the systematic part to the random part, and for which a misspeci�cation of it was studied in

Section 3. Both the random part and the systematic part are important when constructing the

likelihood to be used when conducting inference, and both parts can be misspeci�ed [2]. As have

been pointed out in Section 2 it can be questioned how likely it is to know the complete functional

form of the parametric model but not θ, and so far we have been concerned with the question

of how the inference is a�ected if the model is misspeci�ed. In addition to the results presented

so far in the thesis, several suggestions have been proposed over the years that deal with model

assumption violations. Huber [10] for instance introduced the so called robust statistics whose

purpose was to adjust classical inference methods so that they would not be sensitive to violations

of the model assumptions, e.g. outliers and normality. His proposed estimator was a special case

of a broader class of estimators. As could be noted, several estimators is given by minimizing a

certain function. The quasi-maximum likelihood estimator for instance is given by maximizing∏ni=1 f(Xi; θ) (which is equivalent to minimizing −

∏ni=1 f(Xi; θ)). Huber used this idea and

generalized it so that it did not only include one estimator but a whole class of estimators [10].

If we consider X1, X2, ..., Xn that are i.i.d. 5 random variables with distribution function F , a

1× p parameter vector θ and a known p× 1 function ψ independent of i and n, an M-estimator6

then satis�es

n∑i=1

ψ(Xi, θ) = 0. (4.1)

5They don´t actually have to be equally distributed but we will restrict ourselves to this case.6Huber also called the estimator a maximum likelihood type estimator.

17

We can rede�ne θ∗, the parameter that minimizes the KLIC, using M-estimation theory so

that θ∗ is the parameter solving

EFψ(X1, θ∗) =

ˆψ(x, θ∗)dF (x) = 0. (4.2)

If there exists a unique solution to (4.2), then in general θp→ θ∗ as n → ∞, where θ is the

solution to (4.1) [27]. Furthermore, it can been shown that

√n(θ − θ∗)

d→ N (0, V (θ∗))

as n → ∞, where V (θ∗) = A(θ∗)−1B(θ∗)A(θ∗)

−1, the sandwich matrix of (2.8). To estimate

V (θ∗) we use the empirical sandwich estimator given by

Vn(X; θ) = An(X; θ)−1B(X; θ)A(X; θ)−1 (4.3)

where

An(X; θ) =1

n

n∑i=1

(−ψ′(Xi, θ)

)and

Bn(X; θ) =1

n

n∑i=1

ψ(Xi, θ)ψ(Xi, θ)T .

Huber [10, 11] derived the asymptotic properties of the M-estimator and because of its general

apperance, M-estimators includes several classes of estimators. We will give three examples.

Example 1 - Ordinary least squares

The �rst example is the least-squares estimator. Consider the linear regression model Y =

XTβ+ ε, where Y is the n×1 dimensioned response variable, X is a p×n matrix of explanatory

variables measured on n observations, β is a p × 1 coe�cient vector and ε ∼ N (0, 1) is an

error term. We estimate the regression coe�cients by β = (XTX)−1XTY . This could also be

rewritten as an M-estimator by letting ψ(Yi, Xi, β) = (Yi −XTi β)Xi so that we get

n∑i=1

ψ(Yi, Xi, β) =

n∑i=1

(Yi −XTi β)Xi = 0

were we, by solving for β, get that β = (XTX)−1XTY. �

Example 2 - QMLE

One could think of an M-estimator not as a method per se of �nding an estimator of an unknown

parameter θ but rather about having an estimator and then asking wether the given estimator

18

also is an M-estimator. So by looking at the QMLE which is given by the parameter θ solving

maxθ∈Θ

Ln(Xn, θ) ≡ n−1∑ni=1 log fi(Xi, θ) it �rst can be concluded that this is the same as min-

imizing −Ln(Xn, θ). Thus, by letting ψ(x, θ) = ∂ log f(X; θ)

∂θTthe minimization problem of the

QMLE can be re-expressed as∑ψ(x, θ) =

∑ ∂ log f(X; θ)

∂θT= 0 meaning that the QMLE indeed

is an M-estimator. �

Example 3 - Causal inference

In the statistical theory of causal inference we are interested in estimating the average causal

e�ect (ACE) of an intervention or treatment on an outcome of interest. If Y1 denotes the potential

outcome under treatment and Y0 the potential outcome under nontreatment, the causal e�ect

would be the di�erence Y1 − Y07. The fundamental problem of causal inference is that we

wish to estimate the ACE for every person in the population which is impossible since we for

every individual only observe either Y1 or Y0 [9]. Therefore the outcomes Y1 and Y0 are called

potential and the goal of inference changes to try to estimate the population average treatment

e�ect, τ = E(Y1 − Y0), which can be identi�ed under certain conditions. Because of background

covariates that confounds the relationship between the treatment and the outcome, di�erent

estimators have been proposed that takes this problem into account. Several estimators use

the so called propensity score, de�ned as the conditional probability to recieve the treatment

conditional on the covariates, P (T = 1|X) ≡ e(X), where T is an indicator variable equal to one

if a person have recieved the treatment and zero if not, and X is a covariate vector. Rosenbaum

and Rubin [23] have showed that given (Y1, Y0) ⊥⊥ T |X, it is su�cient to condition on the

propensity score to achieve balance on the covariates for individuals in the di�erent treatment

groups when having the same propensity score, i.e. X ⊥⊥ T |e(X). Usually, the propensity score is

unknown and has to be estimated. One common way is to assume that e(X) could be described

by a parametric model and use logistic regression to estimate it. We express this as

P (T = 1|X) =exp(XTβ)

1 + exp(XTβ),

The coe�cients can be estimated by e.g. maximum likelihood. Usually the treatment variable T

is modeled as a sequence of independent Bernoulli trials with treatment probability e(X). The

likelihood of the coe�cient estimates is then given by

L(β|T ) =

n∏i=1

e(Xi, β)Ti(1− e(Xi, β))1−Ti

with log-likelihood

7This theoretical framework is often referred to as Rubin's model [9] after Donald B. Rubin.

19

l(β|T ) =

n∑i=1

[log(1− e(Xi, β)) + Ti log

(log

e(Xi, β)

1− e(Xi, β)

)]To get the coe�cient estimates we take the derivate of the log-likelihood function with respect

to β,

∂ log(1− e(Xi, β))

∂β= −∂e(Xi, β)

∂β

1

1− e(Xi, β)= − e(Xi, β)

e(Xi, β)(1− e(Xi, β))

∂e(Xi, β)

∂β(4.4)

and

∂

∂β

(Tie(Xi, β)

1− e(Xi, β)

)=

Tie(Xi, β)(1− e(Xi, β))

∂e(Xi, β)

∂β(4.5)

Combining the equations in (4.4) and (4.5), setting them equal to zero and solve for β will give

the MLE of the coe�cient estimates. Expressed as an M-estimator we have that

n∑i=1

ψ(Ti, Xi, β) =

n∑i=1

Ti − e(Xi, β)

e(Xi, β)(1− e(Xi, β))

∂

∂β(e(Xi, β)) = 0. (4.6)

This is indeed what was done in Simulation 1 where the model (that could have been a model

for e(X)) where misspeci�ed in the link function. Observe though that in equation (4.6) it is

assumed that e(X) is correctly speci�ed.

When interested in estimating τ , the next step could be to use the estimate of e(X) in

an estimator of the ACE. One such proposed estimator of the ACE is the inverse probability

weighting estimator (IPW). If the observed outcome is de�ned as Y = TY1 +(1−T )Y0, the IPW

is expressed as

τ =1

n

n∑i=1

TiYie(Xi)

− (1− Ti)Yi1− e(Xi)

. (4.7)

A proof where it is shown that we can estimate τ with the observed data using the IPW is

given in the appendix. This estimator can be expressed as an M-estimator by letting

g(x) =TiYie(Xi)

− (1− Ti)Yi1− e(Xi)

.

We then express the IPW as the solution to∑ψ(g(x), τ) =

∑(g(x)− τ) = 0, because breaking

τ out of the summation we get that nτ =∑g(x) and �nally that

τ =1

n

n∑i=1

g(xi) =1

n

n∑i=1

TiYie(Xi)

− (1− Ti)Yi1− e(Xi)

.

Usually the propensity score is unknown and is estimated with the data at hand. Thus we have

to add this step in the M-estimator expression of the IPW, which gives

20

n∑i=1

ψ(Ti, Yi, Xi, β, τ) =

( ∑ni=1

(TiYi

e(xi, β) −(1−Ti)Yi

1−e(xi, β)

)− τ∑n

i=1Ti−e(Xi, β)

e(Xi, β)(1−e(Xi, β))∂∂β (e(Xi, β))

)=

(0

0

). (4.8)

The equation system (4.8) thus includes a parametric part and a nonparametric part, displaying

the �exibility of the M-estimator in that it enables equations to be stacked onto each other

to, together, yield the sought estimator. This is called a partial M-estimator [27], and the

properties of the M-estimator is similar to the general approach given by Randles [21] that

concerns estimators that contains estimated parameters. �

In Subsection 2.3 we established that the QMLE had the KLIC-minimizer as a limit almost

surely, but where this limit not necesarily were the true parameter. Assuming the same (incor-

rect) parametric family but now using the M-estimation theory, we in general have a parameter

estimate that is a unique solution to (4.1). We have shown that this estimator for instance could

be the QMLE and stated that θp→ θ∗ in general, where θ∗ is the solution of 4.2. So what is θ∗?

Equation 4.2 does not say that θ∗ will be the true parameter, only that it is a solution to this

equation8. So, for a misspeci�ed model, using the fact that the QMLE is an M-estimator, the θ∗

given by Equation 4.2 will be equal to the KLIC-minimizer given by (2.5).

A contribution of M-estimation theory is that in situations where the model is misspeci�ed,

the variance of the estimator will no longer equal the inverse of the Fisher information. Thus,

we have to use the sandwhich estimator given by (4.3) to estimate the variance, which White

[33] showed for the QMLE and Huber [11] for a whole class of M-estimators. To examplify

how this empirical sandwich estimator works, we revisit the simulation design A of Simulation

1 conducted in Section 3. The covariance matrix of the approximated sampling distribution for

the correctly speci�ed model using a sample size of n = 100 is given by

Σ =

0.645 −0.054 −0.080

−0.054 0.018 −0.004

−0.080 −0.004 0.020

and the empirical sandwich estimate of the covariance matrix for the same sample size for the

misspeci�ed estimation model is given by

Σsand =

0.234 −0.021 −0.028

−0.021 0.006 −0.001

−0.028 −0.001 0.006

.

Clearly the estimated covariance matrices di�ers. Important to point out is that if the estimated

covariance matrix di�ers substantially from the empirical sandwich covariance matrix, this is a

clear indication that the used model is misspeci�ed and that model diagnostics are in place [14].

8It can be shown that it is a unique solution.

21

As this is the case for the above example, the advise would be to respecify the model to try to

get a better �t. In this way, the sandwich estimator is used as a diagnostic tool.

5 Simulation study 2

In a �nal simulation study we will revisit Example 3 of Section 4 to illustrate the consequences

of model misspeci�cation when estimating the ACE using the IPW estimator. We will also

use the fact that the IPW estimator is an M-estimator to calculate the sandwich estimate of

the standard deviation and compare them with standard deviation of the approximate sampling

distribution. For the IPW estimator to be an unbiased estimate of the ACE, e(X) have to be

correctly speci�ed. This means that the simulation study will display the amount of bias that a

misspeci�ed propensity score model introduces in the IPW estimator when estimating the ACE.

5.1 Design

Two uniformly distributed random variables are generated , X1, X2 ∼ U(2, 4). The potential

outcomes are

Y1 = 1 + 3X1 + 4X2 + ε1

Y0 = 2 + 2X1 + 3X2 + ε0

where εt ∼ N (0, 1), t = 0, 1, meaning that τ = E(Y1 − Y0) = 5. Using N independent Bernoulli

trials, the treatment variable is generated as T ∼ Bern(e(X1, X2)), with the probability of being

treated

P (T = 1|X1, X2) = e(X1, X2) =1

1 + exp(−0.5 + 1.5X1 − 1.1X2)

A misspeci�ed model is generated as

q(X1, X2) = φ(β0 + β1X1 + β2X2).

The ACE will be estimated using the IPW estimator given by (4.7). Lunceford and Davidian

[17] have derived the estimates of the matrices A and B for the IPW and stated the sandwich

estimator of the variance of the IPW as n−2∑ni=1 I

2i , where

Ii =TiYie(Xi)

− (1− Ti)Yi1− e(Xi)

− τ − (Ti − e(Xi))HT E−1Xi,

H =1

n

n∑i=1

(TiYi(1− e(Xi))

e(Xi)+

(1− Ti)Yie(Xi)

1− e(Xi)

)Xi

22

and

E−1 =1

n

n∑i=1

e(Xi)(1− e(Xi))XiXTi .

The ACE will for both propensity score models be estimated using sample sizes n = 500, n =

1000, and n = 3000 with 1000 replicates for each sample size.

5.2 Results

The entries in Table 3 gives the bias and standard deviation for the IPW estimator using both

the correctly speci�ed and the misspeci�ed propensity score model, respectively. For every

sample size we approximate the sampling distribution of the estimator and calculate the standard

deviation (SD). This computation is compared against the sandwich estimator. In addition,

the mean squared error (MSE) is calculated using the variance of the approximated sampling

distribution.

Table 3: Results for the IPW when misspecifying the link function. SD, standard deviation;SDsandwich, standard deviation of sandwich estimator; MSE, mean squared error; PS, propen-sity score.

Simulation 3

n Specification Bias SD SDsandwich MSE

500 PS true −0.043 0.765 1.116 0.587PS false 0.166 0.807 1.129 0.679

1000 PS true −0.010 0.513 0.547 0.263PS false 0.193 0.523 0.553 0.311

3000 PS true 0.001 0.292 0.192 0.085PS false 0.212 0.294 0.194 0.131

The results of Table 3 shows that the misspeci�ed propensity score model makes the IPW

a biased estimator, while the IPW that uses the correct speci�ed propensity score model gives

estimates that is close to the true value. Using the misspeci�ed propensity score model, the bias

of the IPW is increasing with increasing sample size. Both these results are in accordance with

those given in [29], where it is investigated how the ACE estimate of the IPW estimator (among

others) is a�ected by di�erent types of misspeci�cations of e(X).

When e(X) is correctly speci�ed, the standard deviation of the approximated sampling dis-

tribution (SD) of the IPW is getting smaller for every sample size. The standard deviation ap-

proximated by the sandwich estimator (SDsandwich) is greater than SD for sample sizes n = 500

and n = 1000 and smaller for n = 3000. The results are analogous for the case where e(X) is

23

misspeci�ed. Lastly we see that the MSE of the IPW is smaller when e(X) is correctly speci�ed,

for every sample size.

Theoretically, the standard deviation estimated by the sandwich estimator should coincide

with the inverse of the Fisher information, the variance limit of the MLE. The distance between

the sandwich estimator and the approximate sampling distribution standard deviation is decreas-

ing when the sample size increases from n = 500 to n = 1000, but when the sample size increases

from n = 1000 to n = 3000, the distance increases. It thus seems like the sandwich estimator

does not converge.

6 Final recommendations

White [33] noted that since −A(θ∗) = B(θ∗) only when the model is correctly speci�ed, we

can test for model misspeci�cation by testing the null hypothesis of A(θ∗) + B(θ∗) = 0, where

A(θ∗) and B(θ∗) consistently can be estimated by A(θ) and B(θ), respectively. He called this the

Information matrix test. He furthermore adjusted the Hausman test [8] which basically measures

the distance between the MLE and the QMLE. Since it asymptotically reaches zero for correctly

speci�ed models but generally not otherwise, it ought to indicate when the QMLE is inconsistent.

We refer to [33] for a formal review and the derived test statistics of the two tests. As stated in

the beginning of this thesis, the theory of misspeci�ed models could be seen as one component

of the theory of statistical modeling. To connect even more to that way of thinking about the

presented theory, we state the recommendations of White when building an estimation model:

1. Use maximum likelihood to estimate the parameters of the model.

2. Apply the Information matrix test to check for model misspeci�cation.

3. If the null hypothesis is not rejected we can go on with our MLE estimates, if not, we

investigate the misspeci�cation with a Hausman-test.

4. If the model passes the Hausman-test we apply the robust sandwich estimator, if not, the

model is misspeci�ed to the extent that it needs to be investigated further before conducting

any inference.

24

7 Discussion

The purpose of this thesis was to review parts of the theory for misspeci�ed parametric models.

The thesis started with a discussion of model misspeci�cation for parametric models using the

KLIC. From that, the QMLE were derived. It was stated that the QMLE is√n-convergent and

also converges almost surely to the parameter which minimizes the KLIC, given the model and the

data. For certain cases, the QMLE also converges up to scale towards the true parameter. These

results were illustrated in Simulation 1, and then a broader class of estimators, M-estimators,

were introduced for which we stated its properties. Lastly, a second simulation study were

conducted where the consequences of model misspeci�cation for a partial M-estimator, the IPW

estimator, was illustrated.

This thesis has been stating results leading up to a parameter estimate that has limit θ∗. One

of the main questions that this thesis aims to answer is wether we can learn something about θ,

the true parameter that we actually are interested in, even in cases wher the estimation model

is misspeci�ed. We have showed three situations where we, in spite of a misspeci�ed model, can

get unbiased estimates of the ratio of the parameters of interest and in that way gain knowledge

about the true parameter. Interestingly, we have by simulation found a situation not covered

by any theorem where it also seems like this result holds. The up to scale convergence result

also implies that misspeci�cation of the link function will not cause any problem for hypothesis

testing when testing if the coe�cient of interest if equal to zero. But the up to scale convergence

result only has been stated to hold for certain situations, meaning that there is a lack of results

for other types of misspeci�cations. If we for example omit a variable in a linear regression

model with normally distributed errors, we might neither be able to give an unbiased estimate

of the true coe�cients nor the coe�cient ratio. This means that we, even though the results

presented in this thesis have contributed to the theory of statistical modeling, don´t have a

complete knowledge of what we can and should do when facing a misspeci�ed model. Thus this

is possible content for further studies.

As implicated in [14], readers should be suspicious when the robust and the standard errors

of the sample di�ers. It is more reasonable to use the robust covariance matrix as a model check

rather than as routine. King and Roberts [14] even mean that model diagnostics should be

performed to the extent that the choice between �classical� and robust standard errors will not

matter for the inference to be conducted. This is more restrictive than the recommendation of

White, and the author at least think that it is important to report the steps made in the analysis

and what eventual restrictions this puts on the results.

Another re�ection is that of the potential con�ict between the strive for an e�cient estimator

and the will to actually capture the true parameter value in a con�dence interval. Several esti-

mators have desirable properties under the correct model, but might not be robust against model

misspeci�cation. If the estimation model is misspeci�ed, we might instead have an estimator that

converges to another limit, and not the true parameter value. Take the QMLE for an example. It

has a limit in θ∗ and reaches the variance C(θ∗). But a con�dence interval for the QMLE might

25

surround the the wrong value. An estimator giving a broader interval might instead actually

include the true parameter value. E�ciency is a desirable property of an estimator, but it could

for misspeci�ed models mean that you will just get a narrow interval around the wrong value.

Finally this thesis have shed some light on the very basic assumptions of statistical modeling

in that we assume a true model that gives a complete description of the data generating process.

That there for an example exists a true parameter, quantifying the relationship between an

explanatory variable and the outcome of interest. This might not be true, yet we think of it in

this way in order to meanfully discuss the estimation model speci�ed to gain knowledge about

some phenomena of interest.

For further studies it would be interesting to investigate when we are able to conduct inference

about the true parameter and when we have to lean on the theory presented in this thesis, i.e.

when do we know that we have misspeci�ed the model to the extent that we are no longer able

to conduct inference about θ?

26

8 Acknowledgements

I would like to express my gratitude to my supervisor Ingeborg Waernbaum for all her help and

guidance during the work of this thesis. I would also like to thank Laura Pazzagli for her help

with the sandwich estimator.

27

References

[1] Akaike, H. [1973], Information theory and an extension of the likelihood principle, in `Pro-

ceedings of the second international symposium of information theory'.

[2] Boos, D. D. and Stefanski, L. A. [2013], Essential statistical inference, Springer-Verlag New

York.

[3] Cramér, H. [1946], Mathematical methods of statistics, Princeton University Press.

[4] Durbin, J. and Watson, G. S. [1950], `Testing for serial correlation in least squares regression:

I', Biometrika 37(3/4), pp. 409�428.

[5] Eicker, F. [1967], Limit theorems for regressions with unequal and dependent errors, in

`Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability,

Volume 1: Statistics', University of California Press, Berkeley, Calif., pp. 59�82.

[6] Faraway, J. J. [2006], Extending the linear model with R - Generalized linear, mixed e�ects

and nonparametric models, CRC Press.

[7] Fisher, R. A. [1922], Òn the mathematical foundations of theoretical statistics', Philosophi-

cal Transactions of the Royal Society of London A: Mathematical, Physical and Engineering

Sciences 222(594-604), 309�368.

[8] Hausman, J. A. [1978], `Speci�cation tests in econometrics', Econometrica 46(6), pp. 1251�

1271.

[9] Holland, P. W. [1986], `Statistics and causal inference', Journal of the American Statistical

Association 81(396), pp. 945�960.

[10] Huber, P. J. [1964], `Robust estimation of a location parameter', The Annals of Mathematical

Statistics 35(1), pp. 73�101.

[11] Huber, P. J. [1967], `The behavior of maximum likelihood estimates under nonstandard

conditions'.

[12] Huber, P. J. [1973], `Robust regression: Asymptotics, conjectures and monte carlo', The

Annals of Statistics 1(5), pp. 799�821.

[13] Huber, P. J. [1981], Robust statistics, Wiley, New York.

[14] King, G. and Roberts, M. E. [2014], `How robust standard errors expose methodological

problems they do not �x, and what to do about it', Political Analysis pp. 1�21.

[15] Kullback, S. and Leibler, R. A. [1951], Òn information and su�ciency', The Annals of

Mathematical Statistics 22(1), pp. 79�86.

28

[16] Li, K.-C. and Duan, N. [1989], `Regression analysis under link violation', The Annals of

Statistics 17(3), pp. 1009�1052.

[17] Lunceford, J. K. and Davidian, M. [2004], `Strati�cation and weighting via the propensity

score in estimation of causal treatment e�ects: a comparative study', Statistics in Medicine

23(19), 2937�2960.

[18] Manski, C. F. [1988], Ìdenti�cation of binary response models', Journal of the American

Statistical Association 83(403), pp. 729�738.

[19] Perez-Cruz, F. [2008], Kullback-leibler divergence estimation of continuous distributions, in

`2008 IEEE International Symposium on Information Theory', pp. 1666�1670.

[20] R Core Team [2013], R: A Language and Environment for Statistical Computing, R Foun-

dation for Statistical Computing, Vienna, Austria.

URL: http://www.R-project.org/

[21] Randles, R. H. [1982], Òn the asymptotic normality of statistics with estimated parameters',

The Annals of Statistics 10(2), pp. 462�474.

[22] Rényi, A. [1961], On measures of entropy and information, in `Proceedings of the Fourth

Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions

to the Theory of Statistics', University of California Press, Berkeley, Calif., pp. 547�561.

[23] Rosenbaum, P. R. and Rubin, D. B. [1983], `The central role of the propensity score in

observational studies for causal e�ects', Biometrika 70(1), pp. 41�55.

[24] Ruud, P. A. [1983], `Su�cient conditions for the consistency of maximum likelihood es-

timation despite misspeci�cation of distribution in multinomial discrete choice models',

Econometrica 51(1), pp. 225�228.

[25] Schwarz, G. [1978], Èstimating the dimension of a model', The Annals of Statistics 6(2), pp.

461�464.

[26] Shannon, C. E. [1948], À mathematical theory of communication', Bell System Technical

Journal 27(3), 379�423.

[27] Stefanski, L. A. and Boos, D. D. [2002], `The calculus of m-estimation', The American

Statistician 56(1), 29�38.

[28] Viele, K. [2007], `Nonparametric estimation of kullback-leibler information illustrated by

evaluating goodness of �t', Bayesian Anal. 2(2), 239�280.

[29] Waernbaum, I. [2012], `Model misspeci�cation and robustness in causal inference: comparing

matching with doubly robust estimation', Statistics in Medicine 31(15), 1572�1581.

29

[30] Wald, A. [1949], `Note on the consistency of the maximum likelihood estimate', The Annals

of Mathematical Statistics 20(4), pp. 595�601.

[31] Wasserman, L. [2004], All of statistics - A concise course in statistical inference, 1 edn,

Springer-Verlag New York.

[32] White, H. [1980], À heteroskedasticity-consistent covariance matrix estimator and a direct

test for heteroskedasticity', Econometrica 48(4), pp. 817�838.

[33] White, H. [1982], `Maximum likelihood estimation of misspeci�ed models', Econometrica

50(1), pp. 1�25.

[34] White, H. [1996], Estimation, inference and speci�cation analysis, Cambridge University

Press, Cambridge.

30

A Proof of identi�ability of the IPW estimator

To prove that the IPW estimator can be identi�ed with the observed data, we want to show

that E[TYe(X) −

(1−T )Y1−e(X)

]= E [Y1 − Y0]. We will use that the observed outcome is de�ned as

Y = TY1 + (1− T )Y0 and the following assumptions:

A.1 (Y1, Y0) ⊥⊥ T |X

A.2 0 < P (T = 1|X) < 1.

We start by showing that E[TYe(X)

]= E [Y1] .

E[TY

e(X)

]= E

[TY1

e(X)

]= E

{E[TY1

e(X)

]|X}

= E{

1

e(X)E [TY1] |X

}= E

{1

e(X)E [T |X]E [Y1|X]

}= E {E [Y1|X]}

= E [Y1] ,

where the �rst equality follows from the de�nition of Y and the second equality holds under the

total expectation law. Next we have that

E[

(1− T )Y

1− e(X)

]= E

[(1− T )Y0

1− e(X)

]= E

{E[

(1− T )Y0

1− e(X)

]|X}

= E{

1

1− e(X)E [(1− T )Y0] |X

}= E

{1

1− e(X)E [(1− T )|X]E [Y0|X]

}= E {E [Y0|X]}

= E [Y0]

Thus, the proof is completed. �

31

Documents

Misspecification and inferenceumu.diva-portal.org/smash/get/diva2:823504/FULLTEXT01.pdf2 Theory part 1 2.1 Maximum likelihood A parametric model or parametric family of distributions