Upload
others
View
14
Download
0
Embed Size (px)
Citation preview
Student
Umeå School of Business and Economics
Spring semester 2015
Master thesis, one-year, 15 ECTS
Misspecification and inference
A review and case studies
Author: Gabriel Wallin
Supervisor: Ingeborg Waernbaum
Abstract
When conducting inference we usually have to make some assumptions. A common as-
sumption is that the parametric model which describes the behavior of the investigated
random phenomena is correctly speci�ed. If not, some of the inferential methods does not
provide valid inference, e.g. the method of maximum likelihood. This thesis investigates
and presents some of the results regarding misspeci�ed parametric models to illustrate the
consequences of misspeci�cation and how the parameter estimates are a�ected. The main
question investigated is wether we still can learn something about the true parameter even
though the model is misspeci�ed. An important result is that the quasi-maximum likelihood
estimate of a misspeci�ed estimation model converges almost surely towards the parameter
minimizing the distance between the true model and the estimation model. Using sim-
ulations, it is illustrated how this estimator in certain situations converges almost surely
towards the true parameter times a scalar. This result also seems to hold for a situation not
covered by any theorems. Furthermore, a general class of estimators called M-estimators is
presented for the theoretic framework of misspeci�ed models. An example is given when the
theory of M-estimators come to use under model misspeci�cation.
Sammanfattning
Titel: Felspeci�cering och inferens - en genomgång och fallstudier.
När man vill dra slutsatser från data så måste man i regel göra vissa antaganden. Ett vanligt
antagande är att den parametriska modellen som beskriver beteendet hos det undersökta
slumpmässiga fenomenet är korrekt speci�cerad. Om detta inte är fallet så kommer vissa
inferens-metoder inte att kunna användas, exempelvis maximum likelihood-metoden. Den
här uppsatsen undersöker och presenterar några av resultaten för felspeci�cerade parametriska
modeller för att illustrera konsekvenserna av felspeci�cering och hur parameterskattningarna
påverkas. En huvudfråga som undersöks är huruvida det fortfarande går att lära sig något
om den sanna parametern även när modellen är felspeci�cerad. Ett viktigt resultat som pre-
senteras är att den så kallade quasi-maximum likelihood-skattningen från en felspeci�cerad
modell konvergerar nästan säkert mot den parameter som minimerar avståndet mellan den
sanna modellen och skattningsmodellen. Det visas hur denna parameter i vissa situationer
konvergerar mot den sanna parametern multiplicerat med en skalär. Detta resultat illustr-
eras också för en situation som inte täcks av något teorem. Dessutom presenteras en generell
klass av estimatorer som kallas M-estimatorer. Dessa används för att utvidga teorin kring
felspeci�cerade modeller och ett exempel presenteras där teorin för M-estimatorer nyttjas.
Populärvetenskaplig sammanfattning
När man vill uppskatta hur mycket olika variabler är associerade med, eller förklarar, en
speci�k variabel så är regression ett vanligt tillvägagångssätt. En regressionsmodell kvanti-
�erar förhållandet mellan de så kallade förklarande variablerna och den beroende variabeln,
och ett problem som nästan alltid är närvarande är att den sanna datagenererande pro-
cessen som beskriver detta förhållande i regel är okänd. Det är därför upp till forskaren
att försöka speci�cera en skattningsmodell som man tror är så nära sanningen som möjligt
så att e�ekterna av de förklarande variablerna på den beroende variabeln på ett bra sätt
kan skattas. Är skattningsmodellen inte lik den sanna modellen så kommer inte heller ef-
fektskattningarna att vara nära de sanna e�ekterna. Denna uppsats har därför till syfte att
utreda om det �nns någon information om de sanna e�ekterna som går att utvinna från en
felspeci�cerad modell. Dessutom visas hur den vanligt förekommande skattningsmetoden
maximum likelihood i fallet med den felspeci�cerade modellen är den skattningsmetod som
närmar sig den skattning som kommer de sanna e�ekterna närmast. Dessutom presenteras
specialfall för när man, trots en felspeci�cerad modell, kan få viss information om e�ekterna
av respektive förklaringsvariabel på den beroende variabeln. Resultaten generaliseras även
till att gälla �er skattningsmetoder än bara maximum likelihood och slutligen ges ett exem-
pel för en typ av skattningsmetod som ofta används när man vill skatta e�ekten av någon
form av behandling på ett visst utfall.
Contents
1 Introduction 2
1.1 Purpose of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Theory part 1 4
2.1 Maximum likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Kullback-Leibler Information Criterion . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Quasi Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Simulation study 1 11
3.1 Design A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Results A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Design B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4 Results B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4 Theory part 2 17
4.1 M-estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
5 Simulation study 2 22
5.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6 Final recommendations 24
7 Discussion 25
8 Acknowledgements 27
References 28
A Proof of identi�ability of the IPW estimator 31
1 Introduction
In statistical modeling we are interested in the underlying structure that generates the data. In
that sense, we assume that it actually exists a true model that fully describes the data generating
process (DGP). It then follows that the parametric model1 used to make inferences will be a
more or less adequate description of the DGP. In statistical research, model speci�cation, model
diagnostics and optimal model choice have been widely investigated, see for instance [4, 1] and
[25]. This thesis instead takes on the approach that we have ended up with an estimation model
that is not an adequate description of the DGP. We will call this type of estimation model
misspeci�ed. A natural question then arises: What happens with the parameter estimates when
the parametric model is misspeci�ed? Is there still any information regarding the true parameters
that can be found when making inference based on a misspeci�ed model? As pointed out in King
and Roberts [14]:
�Models are sometimes useful but almost never exactly correct, and so working out
the theoretical implications for when our estimators still apply is fundamental to the
massive multidisciplinary project of statistical model building�.
Several papers that have investigated misspeci�ed models have proposed robust alternatives for
estimation, e.g. robust standard errors. Among others, Huber [12, 13] gave robust alternatives to
the least squares estimate for regression models and both White [32] and Eicker [5] investigated a
covariance matrix estimator that is robust against heteroskedasticity. To have a broader under-
standing about statistical modeling it is also reasonable to add knowledge about the properties
of misspeci�ed models since this situation is very likely to occur as soon as we wish to model
a random phenomena. This thesis will examplify and illustrate some of the results regarding
inference for misspeci�ed models and show situations when there still is information that can
be gained about the true parameter. We have as a starting point that we are facing are two
(possibly) di�erent models; a true, unknown model and an estimation model. This means that
the theory presented here is a complement to the litterature regarding optimal model choice and
model speci�cation. This could in fact be seen as one joint theory, where model speci�cation
and optimal model choice, model diagnostics and the results of misspeci�ed models all are parts
of the theory of statistical modeling.
1.1 Purpose of the thesis
The purpose of this thesis is to review some of the theory for misspeci�ed parametric models
and state and illustrate some of the existing results for misspeci�ed parametric models when
1This thesis is only concerned with inference for parametric models even though there of course exists other,�model-free�, modes of inference.
2
conducting statistical inference about an unknown parameter. The results will be exampli�ed
and illustrated using simulation. Two main questions that will be given extra attention are:
� How will the parameter estimates be a�ected when the estimation model is misspeci�ed?
� Given a misspeci�ed model, are there any special cases where there is still some information
that can be found about the true model?
1.2 Outline of the thesis
The thesis is organized as follows:
� Section 2 starts with an overview of a common method of statistical inference, the maximum
likelihood method. It then gives a de�nition of the Kullback Leibler information criterion
and describes its role in the theory of model misspeci�cation. Then another likelihood-type
estimator, the quasi-maximum likelihood estimator, is proposed with motivation from the
de�nition of the Kullback Leibler information criterion. Some results for the estimator is
given together with a simple illustration of how the Kullback Leibler information criterion
and quasi maximum likelihood estimator is connected.
� Section 3 uses simulations to illustrate some of the results of the QMLE estimator.
� Section 4 introduces a broader class of so called M-estimators together with a discussion of
the contribution of M-estimation theory to the asymptotic theory of misspeci�ed models.
� Section 5 illustrates some of the results of M-estimators using simulation.
� Section 6 gives some last recommendations and Section 7 gives a summary and a discussion
of the results presented in the thesis, together with suggestions for further research.
3
2 Theory part 1
2.1 Maximum likelihood
A parametric model or parametric family of distributions is a collection of probability distri-
butions that can be described by a �nite number of parameters. They intent to describe the
probability mass function (p.m.f.) or probability density function (p.d.f.) of a random variable.
Consider a random variable whose functional form of the p.m.f. or p.d.f. is known but where
the distribution depends on an unknown parameter2 θ that takes values in a parameter space
Θ. If we for example know that the random variable X is described by pX(x; θ) = e−θθx/x!,
x = 0, 1, 2, ... and that θ ∈ Θ = {θ : 0 < θ < ∞}, it still might be the case that we need to
specify the most probable p.m.f. of X. This means that we are interested in a speci�c member
of the family of distributions that is contained in the parametric model {pX(x; θ), θ ∈ Θ} andthus we must estimate the parameter θ. One common estimator of θ in this type of setting is
the maximum likelihood estimator (MLE). Let X1, X2, ..., Xn be a random sample of size n of
independent and identically distributed (i.i.d.) random variables with the realizations denoted
by x1, ..., xn. If we denote the density of X as gX(x; θ) where it is regarded as a function of
the unknown parameter θ, the likelihood function is de�ned as L(θ; x, ..., xn) =∏ni=1 gXi(xi; θ).
To get the MLE of θ, the likelihood function, or more commonly, the natural logarithm of the
likelihood function, is maximized.
Given suitable regularity conditions3, the method of maximum likelihood gives estimates
that have several appealing properties such as e�ciency [7], consistency [30, 3] and asymptotic
normality [3]. The last property means that√n(θMLE − θ)
d→ Np{
(0, I(θ)−1}, as n → ∞,
where Np is a p-variate normal distribution and I(θ) is the Fisher information matrix given by
I(θ) = E[(
∂
∂θTlog gX(x; θ)
)(∂
∂θlog gX(x; θ
)]which we could rewrite with the use of the score s(X; θ) = ∂
∂θ log gX(x; θ), so that
I(θ) = E[s(X; θ)T s(X; θ)
]= −E
(∂2 log gX(x; θ)
∂θ2
).
The Bernoulli distribution can be used as an example. Let Xi ∼ Bernoulli(p), i = 1, ..., n so
that the p.m.f. is given by pX(x; p) = px(1−p)1−x and log pX(x; p) = x log p+(1−x) log(1−p).The score becomes
s(X; p) =X
p− 1−X
1− p,
and
2The parameter of interest could of course be vector valued. Throughout the thesis we will make no di�erencein notation or in use of the term between a parameter that is vector valued and a parameter that is not.
3These regularity conditions are for the most part smoothness conditions on g(x; θ) [31] and will throughoutthe thesis be considered full�lled.
4
−s′(X; p) =X
p2+
1−X(1− p)2
.
So
I(p) = −E (s′(X; p)) =1
p(1− p),
and thus, V ar(p) = I(p)−1 = p(1− p) �
Since the variance of the MLE reaches I(θ)−1 asymptotically, and since V (θ) ≥ I(θ)−1 in
general, there does not exist an estimator with lower variance. We call this property of the MLE
e�ciency.
So far we have assumed that the functional form of the p.m.f or p.d.f. is known. However,
there are situations when this is not a reasonable assumption. When the true model is unknown
we have to use an estimation model and thereby running the risk of misspecifying the model.
A natural question then becomes what happens to the MLE under model misspeci�cation. To
investigate this, we start by de�ning a measure of the discrepancy between the estimation model
from the true model called the Kullback Leibler information criterion (KLIC).
2.2 Kullback-Leibler Information Criterion
Before de�ning the KLIC, we �rst brie�y discuss what is being meant by information. It is
closely related to what is sometimes called the �surprise� of an event and the information theory
that was formalized by Shannon [26]. Let say that you �ip an unfair coin with probability 0.2
to receive heads and probability 0.8 to recieve tails. Thus the message that you will recieve
heads will give you a lot of information. A message that you will recieve tails will not give you
that much of information; with a probability of 0.8, tails is almost what you expect. With this
reasoning we can say that if some event is very unlikely to occur, the message that it will occur
gives us a lot of information and vice versa. We could use Ip = log( 1p ), where p denotes the
probability of an event, as an information function; the information decreases with increasing p
and vice versa, just as our example regarding the coin �ip. If the probability of the event changes
from p to q we could measure the information value of the change by Ip−Iq, where Iq = log(
1q
),
so that we get that Ip− Iq = log(qp
). Expressed as an expected information value we have that
E(Ip − Iq) =
n∑i=1
qi log
(qipi
). (2.1)
Kullback and Leibler [15] used Shannon´s ideas of information to de�ne a divergence measure, or
information criterion, that measures the di�erence between two probability distributions g and
f .
De�nition 1. The KLIC of f from g is de�ned as
5
D(g : f) = Eg[log
(g(x)
f(x, θ)
)]. (2.2)
Note that the expectation is taken with respect to g and that D(g : f(x; θ)) ≥ 0. It can
further be shown that D(g : f(x; θ)) = 0 if and only if f = g. The KLIC can be seen as the
information that has gone lost when using f , the probability distribution that we have assumed,
to approximate g, the true and unknown probability distribution that generates the data. Renyi
[22] showed that the KLIC, as in the opening example of this section, can be thought of in
terms of information, i.e. the KLIC can be seen as the information gained when carrying out an
experiment of a random phenomenon. White [34] describes this as the information gained when
the experimenter is given the information that the observed phenomenon is described by g and
not f which was the initial belief.
For the continuous case we can write the KLIC as
D(g : f(x; θ)) =
ˆg(x) log
[g(x)
f(x; θ)
]dx (2.3)
=
ˆg(x) log(g(x))dx−
ˆg(x) log(f(x; θ))dx (2.4)
where the similarities with Equation 2.1 is apparent. The KLIC is not a metric since D(f : g) 6=D(g : f), i.e. the distance from f to g is not the same as the distance from g to f meaning that
the KLIC could not be used as a goodness-of-�t measure in the usual sense. A simple example
of how the KLIC is calculated is given in Example 1.
Example 1
Assume that the true model g(x) that generates the data is a standard normal distribution,
N (0, 1) and that we misspecify the model using f(x) which is the density function of a normal
distribution with mean equal to 2 and variance equal to 1, N (2, 1). This situation is illustrated
in Figure 2.1.
6
−5 0 5
0.0
0.1
0.2
0.3
0.4
Correct and wrong model
µ
f(x)
Densities
Mean = 0, sd = 1Mean = 2, sd = 1
Figure 2.1: The densities for Example 2
One way of quantifying the distance between the models in Figure 2.1 is to calculate the
KLIC, which for this case is given by
D(g : f(x; θ)) =
ˆ ∞−∞
g(x) log
(g(x)
f(x)
)dx
=
ˆ ∞−∞
1√2π
exp
(−x
2
2
)log
1√2π
exp(−x
2
2
)1√2π
exp(− (x−2)2
2
) dx
=
ˆ ∞−∞
g(x) log(e
12 (x−2)2− x2
2
)dx
=
ˆ ∞−∞
g(x)(2− 2x)dx
= 2
ˆ ∞−∞
g(x)dx− 2
ˆ ∞−∞
xg(x)dx
= 2× 1− 2× 0
= 2
As could be seen in Example 1, the calculation of the KLIC requires that g is known which
is not the case in this thesis. There have been suggestions of how to estimate the KLIC, see for
instance [28, 19].
7
In view of the KLIC as a goodness of �t test we can think of a situation where we would like
to compare two estimation models f1 and f2 against the true model g to evaluate which one that
gets closest. We could write the mean KLIC di�erence as
I =D(g : f1)−D(g : f2) =
ˆg log
(g
f1
)dx−
ˆg log
(g
f2
)dx
=
ˆg log
(f2
f1
)dx
where the right hand side of the last equality can be estimated using data, even though g is
unknown. We then have three potential scenarios:
1. I = 0, meaning that f1 = f2
2. I > 0, meaning that f1 is a better approximation than f2 of g
3. I < 0, meaning that f2 is a better approximation than f1 of g.
From this we can choose the best model, i.e. the model that minimizes the distance to the true
model.
2.3 Quasi Likelihood
In contrast to the situation described in Subsection 2.1, there are situations where we don´t have
any pre-knowledge of the true functional form when we want to model a random phenomena.
When this is the case, we have to start by specifying the functional form of an estimation model
and then estimate the parameter from it. If the parametric model that we specify includes the
true model, the problem of inference reduces to estimating the parameter θ which we can do
consistently with the MLE.
We are interested in the parameter θ in f that needs to be estimated from our observations
x that are realizations of a random variable X with unknown density function g. Our objective
should intuitively be to minimize (2.2) by an appropriate choise of θ. Indeed, Akaike [1] argued
that a natural estimator of θ would be the parameter that minimizes the KLIC, i.e. the parameter
minimizing the distance between the true and the false density. We de�ne this parameter as
minθ∈Θ
E[log
(g(x)
f(x; θ)
)]= θ∗. (2.5)
By comparing equations 2.3 and 2.4, and since D(g : f) ≥ 0, we see that choosing θ to minimize
2.3 is the same as choosing θ to maximize
∼L(θ) =
ˆlog f(x; θ)g(x)dx = E(log f(X; θ)).
8
This in turn is the same as maximizing the average n−1∼L(θ) since the maximization of θ does not
depend on n. Finally, using that n−1∼L(θ) by the law of large numbers could be approximated
by n−1 log f(X, θ) ≡ Ln(X, θ), our minimization problem of (2.3) reduces to
maxθ∈Θ
Ln(X; θ) ≡ n−1n∑i=1
log f(Xi; θ), (2.6)
where we call the solution to (2.6) the quasi-maximum likelihood estimator (QMLE). White [33]
has shown that the solution of (2.6) exists, is unique and furthermore given the following key
result.
Theorem 2. θna.s.→ θ∗ as n→∞ , where θn is the parameter vector that solves max
θ∈ΘLn(X; θ).
�
So if our objective is to �nd a parameter estimate that minimizes the KLIC, Theorem 2
establishes that this is indeed what we are doing when we use the QMLE. White [33] calls
the QMLE the estimator that �...minimizes our ignorance about the true structure�, and he 4
furthermore showed that
√n(θ − θ∗)
d→ N (0, C(θ∗)). (2.7)
To de�ne C(θ) for a parameter θ we �rst need to de�ne the Hessian
Ajk(θ) = E(∂2 log f(Xi, θ)
∂θj∂θk
)and the square of the gradient,
Bjk(θ) = E(∂ log f(Xi, θ)
∂θj· ∂ log f(Xi, θ)
∂θk
).
Now,
C(θ) = A(θ)−1B(θ)A(θ)−1, (2.8)
and we furthermore have that C(θ)a.s.→ C(θ∗). C(θ) often is estimated with the so called sandwich
estimator. If we specify the parametric family correctly, −A(θ) = B(θ) meaning that C(θ) =
−A(θ)−1 = B(θ)−1 where B(θ)−1 = I(θ)−1and thus, the sandwhich estimator reduces to I(θ)−1,
giving the e�cient variance of the MLE. We will return to this estimator in Section 5.
To illustrate the connection between the KLIC, the MLE and the QMLE we revisit Example
1. A simple illustration of the KLIC is given in Figure 2.2, where it for a given variance is shown
how the KLIC reaches its minimum value as the estimation model, N (µ, 1), gets closer to the
4Huber was the �rst to prove the asymptotic normality for maximum likelihood estimators when the estima-tion model is not necessarily the underlying model that generates the data. White proves this under the sameassumptions as for which he gives his consistency proof. These assumptions are not as general as Hubers, butgeneral enough to be able to be used in many situations.
9
−15 −10 −5 0 5 10 15
050
100
150
Kullback−Leibler
Mean
KLI
C
Figure 2.2: The kullback Leibler information criterion plotted for di�erent values of µ for themisspeci�ed model.
true model N (0, 1), and how the KLIC grows both to the left and to the right of the minimum
value.
In a sense, Figure 2.2 illustrates the di�erence between the MLE and the QMLE. The MLE
is based on the true model and thus the KLIC is equal to zero, asymptotically. White's result
does not say that the QMLE reaches the minimum value of the KLIC, but that the KLIC will
be minimized given the data and the misspeci�cation. A natural question is what θ∗ can be
in relation to θ, the true parameter? Can we, even if we misspecify the model, extract some
information about the true parameter?
Li and Duan [16] investigates misspeci�ed generalized linear models (GLM) and states a
proportionality result for the coe�cient estimates. If Y is the outcome of interest, E [Y ] =
g−1(Xθ) where g−1 is the link function that connects the linear predictor Xθ to the outcome.
Li and Duan gives the following result when the link function g is misspeci�ed.
Theorem 3. The estimated coe�cients converges almost surely to the true parameter vector
times a scalar factor, i.e. θa.s.→ γθ, where θ is given by the QMLE. �
The result of Theorem 3 will from here on be referred to as convergence up to scale. Further-
more, for logistic regression we have convergence up to scale both when we omit a variable from
the regression equation [18] and when we misspecify the distribution of the error term [24].
10
The result of Theorem 3 means that it is possible to get unbiased estimates of the ratio of
the regression coe�cients since γθlγθm
= θlθm, l 6= m. This could for an example be of interest in
applied research were one is interested in the relative e�ect of two treatments on an outcome.
Theorem 3 could be seen in light of the problem that it usually is not enough to solely base
the choice of link function on the data [6]. Figure 2.3 illustrates a binary data set with 100
observations where the correct link function, the logit, is compared against the probit link and
the complementary log log link. It is apparent that especially the logit and the probit link
function is close to each other. It will in the following section be illustrated how a misspeci�ed
link function could a�ect the estimated parameters.
0 2 4 6 8 10
0.0
0.2
0.4
0.6
0.8
1.0
Fitted probabilites with different link functions
X
Y
LogitProbitCloglog
Figure 2.3: The solid line is the correct, logit link function, (1 + exp(−X))−1, the dashed lineis the probit link function, φ(X), where φ is the standard normal distribution function and thedotted line is the complementary log-log link funcktion, 1− exp(− exp(X)).
3 Simulation study 1
To illustrate the result of Theorem 3 and the convergence result of (2.7), a simulation study
is conducted. A logistic function will be speci�ed and estimated using two di�erent estimation
models; one that is correctly speci�ed and one that is misspeci�ed. This is performed for three
di�erent model misspeci�cations for which one is displayed in the �rst design (Design A) and
the two remaining in the second design (Design B). The coe�cients for the correctly speci�ed
estimation model will be estimated using the MLE and the coe�cients for the misspeci�ed
estimation models will be estimated using the QMLE. All calculations are performed using the
software R [20].
11
3.1 Design A
Two normally distributed random variables are generated, X1 ∼ N (4, 2) and X2 ∼ N (5, 2).
Also, a Bernoulli distributed random variable is generated, T ∼ Bern(e(X)), where
e(X) =1
(1 + exp(−0.3X1 + 0.24X2))
The misspeci�ed model is
h(X) = φ(β0 + β1X1 + β2X2),
where φ denotes the standard normal distribution function. The scale parameter γ is estimated
by γ1 = β1
β1and γ2 = β2
β2and the estimates are expected to get closer to each other when the
sample size increases. We will use three di�erent sample sizes; n = 100, n = 500 and n = 1000,
each with 1000 replicates.
3.2 Results A
Figure 3.1 shows QQ-plots for a sample size of n = 500 for the MLE of the two coe�cients. As
stated in Subsection 2.1, the MLE is approximately normally distributed. Since there are no
large deviations from the straight line, there is nothing indicating a violation of the normality
of the estimator. Figure 3.2 shows QQ-plots for the same sample size, but for the QMLE of the
misspeci�ed model. As there are no large deviations from the straight line, there is no sign of
violation of the normality of the estimator. These plots therefore give an empirical illustration
of the normality result of the QMLE estimator.
12
−3 −2 −1 0 1 2 3
0.15
0.25
0.35
0.45
Normal Q−Q Plot
Theoretical Quantiles
Sam
ple
Qua
ntile
s
−3 −2 −1 0 1 2 3
−0.
40−
0.30
−0.
20−
0.10
Normal Q−Q Plot
Theoretical QuantilesS
ampl
e Q
uant
iles
Figure 3.1: QQ-plots for the MLE of the two coe�cients, for 500 observations and 1000 replicates
−3 −2 −1 0 1 2 3
0.10
0.15
0.20
0.25
Normal Q−Q Plot
Theoretical Quantiles
Sam
ple
Qua
ntile
s
−3 −2 −1 0 1 2 3
−0.
25−
0.20
−0.
15−
0.10
−0.
05
Normal Q−Q Plot
Theoretical Quantiles
Sam
ple
Qua
ntile
s
Figure 3.2: QQ-plots of the QMLE when the link function is misspeci�ed, for 500 observationsand 1000 replicates
Table 1 gives the true coe�cient values, the MLE estimates and the QMLE estimates. Fur-
thermore, the di�erence between γ2 and γ1 is displayed.
13
Table 1: Comparison between the correctly speci�ed model and a misspeci�ed model.Design A
n Speci�cation β1 β2 β1 β2 γ2 − γ1
Correctly speci�ed model100 −0.3 0.24 −0.324 0.246500 −0.3 0.24 −0.299 0.2451000 −0.3 0.24 −0.300 0.240
Misspeci�ed link function100 −0.3 0.24 −0.196 0.149 −0.034500 −0.3 0.24 −0.183 0.150 0.0151000 −0.3 0.24 −0.183 0.147 0.001
We see that the MLE gives coe�cient estimates that is getting closer to the true coe�cients
with increasing sample size, and that the estimates coincide with the true values for a sample size
of n = 1000 (for a round o� on the third decimal). For the QMLE we see that β1 is overestimated
and β2 is underestimated for every sample size. The gamma estimates γ1 and γ2 are getting closer
to each other for an increasing sample size and di�ers only on the third decimal for a sample size
of n = 1000, providing an empirical illustration of Theorem 3.
3.3 Design B
In a second simulation design, three uniformly distributed random variables are generated,
X1, X2, X3 ∼ U(0, 1). Also, a Bernoulli distributed random variable is generated, T ∼ Bern(e(X))
where
e(X) =1
1 + exp(2X1 +X2 − 3X3).
Two misspeci�ed models are used,
m(X) =1
1 + exp(β1X1 + β2X2)
and
n(X) = φ(β0 + β1X1 + β2X2)
to investigate how the coe�cient estimates are a�ected by excluding a covariate (m) and by
choosing an incorrect link function and excluding a covariate (n). Table 2 is constructed in the
same manner as Table 1 and the scale parameter is estimated in the same way as in Design A.
14
3.4 Results B
As in Design A, the coe�cient estimates in the QQ-plots in Figure 3.3, 3.4 and 3.5 seems to follow
the straight line reasonably good, giving empirical support to the distributional limit result of
the MLE and QMLE, respectively.
−3 −2 −1 0 1 2 3
−3.
0−
2.5
−2.
0−
1.5
−1.
0
Normal Q−Q Plot
Theoretical Quantiles
Sam
ple
Qua
ntile
s
−3 −2 −1 0 1 2 3
−2.
0−
1.5
−1.
0−
0.5
0.0
Normal Q−Q Plot
Theoretical Quantiles
Sam
ple
Qua
ntile
s
Figure 3.3: QQ-plots for the MLE of the two coe�cients, for 500 observations and 1000 replicates
−3 −2 −1 0 1 2 3
−2.
5−
2.0
−1.
5−
1.0
Normal Q−Q Plot
Theoretical Quantiles
Sam
ple
Qua
ntile
s
−3 −2 −1 0 1 2 3
−1.
5−
1.0
−0.
50.
0
Normal Q−Q Plot
Theoretical Quantiles
Sam
ple
Qua
ntile
s
Figure 3.4: QQ-plots for the QMLE of the two coe�cients, using m(X) for 500 observations and1000 replicates.
15
−3 −2 −1 0 1 2 3
−1.
6−
1.2
−0.
8−
0.4
Normal Q−Q Plot
Theoretical Quantiles
Sam
ple
Qua
ntile
s
−3 −2 −1 0 1 2 3
−1.
0−
0.8
−0.
6−
0.4
−0.
20.
0
Normal Q−Q Plot
Theoretical QuantilesS
ampl
e Q
uant
iles
Figure 3.5: QQ-plots for the QMLE of the two coe�cients, using n(X) for 500 observations and1000 replicates.
Table 2 gives the true coe�cients, the MLE estimates and the QMLE estimates for both
misspeci�cations. As in Table 1, the di�erence between the gamma estimates is displayed.
Table 2: Comparison between the correctly speci�ed model and two di�erent misspeci�ed models.Design B
N Speci�cation β1 β2 β3 β1 β2 γ2 − γ1
Correctly speci�ed model100 2 1 −3 2.281 1.167500 2 1 −3 2.016 1.0381000 2 1 −3 2.008 0.995
Omitted variable100 2 1 −3 1.884 0.965 0.023500 2 1 −3 1.717 0.875 0.0171000 2 1 −3 1.707 0.849 −0.004
Wrong link and omitted variable100 2 1 −3 1.156 0.593 0.015500 2 1 −3 1.063 0.540 0.0091000 2 1 −3 1.058 0.525 −0.004
When usingm(X), the MLE estimates is getting closer to the true coe�cients with increasing
sample size and the QMLE underestimates both coe�cients for both misspeci�cations. The scale
estimates are close to each other for every sample size and are getting closer to each other for
16
increasing sample size, also giving an empirical illustration of Theorem 3.
The second misspeci�ed model of Design B, n(X) deals with a situation that is not covered
by any theorems since we both omit a covariate and misspecify the link function. The parameter
estimates underestimates the true coe�cients but as have been the case for the rest of the
misspeci�cations, the scale estimates are close for every sample size and getting closer with
increasing sample size. Hence it seems like we have an up to scale convergence for this setting
as well.
So far we have been concerned with the QMLE and stated and illustrated some of its char-
acteristics. Next, we will not only look at one type of estimator but a whole class of estimators.
4 Theory part 2
4.1 M-estimators
A parametric model does in general include two parts, a systematic part and a random part.
For linear regression models the researcher needs to specify both the conditional mean of the
outcome variable given the explanatory variables (the systematic part) and the distribution of
the error term (the random part). For GLMs we also need to specify a link function that links
the systematic part to the random part, and for which a misspeci�cation of it was studied in
Section 3. Both the random part and the systematic part are important when constructing the
likelihood to be used when conducting inference, and both parts can be misspeci�ed [2]. As have
been pointed out in Section 2 it can be questioned how likely it is to know the complete functional
form of the parametric model but not θ, and so far we have been concerned with the question
of how the inference is a�ected if the model is misspeci�ed. In addition to the results presented
so far in the thesis, several suggestions have been proposed over the years that deal with model
assumption violations. Huber [10] for instance introduced the so called robust statistics whose
purpose was to adjust classical inference methods so that they would not be sensitive to violations
of the model assumptions, e.g. outliers and normality. His proposed estimator was a special case
of a broader class of estimators. As could be noted, several estimators is given by minimizing a
certain function. The quasi-maximum likelihood estimator for instance is given by maximizing∏ni=1 f(Xi; θ) (which is equivalent to minimizing −
∏ni=1 f(Xi; θ)). Huber used this idea and
generalized it so that it did not only include one estimator but a whole class of estimators [10].
If we consider X1, X2, ..., Xn that are i.i.d. 5 random variables with distribution function F , a
1× p parameter vector θ and a known p× 1 function ψ independent of i and n, an M-estimator6
then satis�es
n∑i=1
ψ(Xi, θ) = 0. (4.1)
5They don´t actually have to be equally distributed but we will restrict ourselves to this case.6Huber also called the estimator a maximum likelihood type estimator.
17
We can rede�ne θ∗, the parameter that minimizes the KLIC, using M-estimation theory so
that θ∗ is the parameter solving
EFψ(X1, θ∗) =
ˆψ(x, θ∗)dF (x) = 0. (4.2)
If there exists a unique solution to (4.2), then in general θp→ θ∗ as n → ∞, where θ is the
solution to (4.1) [27]. Furthermore, it can been shown that
√n(θ − θ∗)
d→ N (0, V (θ∗))
as n → ∞, where V (θ∗) = A(θ∗)−1B(θ∗)A(θ∗)
−1, the sandwich matrix of (2.8). To estimate
V (θ∗) we use the empirical sandwich estimator given by
Vn(X; θ) = An(X; θ)−1B(X; θ)A(X; θ)−1 (4.3)
where
An(X; θ) =1
n
n∑i=1
(−ψ′(Xi, θ)
)and
Bn(X; θ) =1
n
n∑i=1
ψ(Xi, θ)ψ(Xi, θ)T .
Huber [10, 11] derived the asymptotic properties of the M-estimator and because of its general
apperance, M-estimators includes several classes of estimators. We will give three examples.
Example 1 - Ordinary least squares
The �rst example is the least-squares estimator. Consider the linear regression model Y =
XTβ+ ε, where Y is the n×1 dimensioned response variable, X is a p×n matrix of explanatory
variables measured on n observations, β is a p × 1 coe�cient vector and ε ∼ N (0, 1) is an
error term. We estimate the regression coe�cients by β = (XTX)−1XTY . This could also be
rewritten as an M-estimator by letting ψ(Yi, Xi, β) = (Yi −XTi β)Xi so that we get
n∑i=1
ψ(Yi, Xi, β) =
n∑i=1
(Yi −XTi β)Xi = 0
were we, by solving for β, get that β = (XTX)−1XTY. �
Example 2 - QMLE
One could think of an M-estimator not as a method per se of �nding an estimator of an unknown
parameter θ but rather about having an estimator and then asking wether the given estimator
18
also is an M-estimator. So by looking at the QMLE which is given by the parameter θ solving
maxθ∈Θ
Ln(Xn, θ) ≡ n−1∑ni=1 log fi(Xi, θ) it �rst can be concluded that this is the same as min-
imizing −Ln(Xn, θ). Thus, by letting ψ(x, θ) = ∂ log f(X; θ)
∂θTthe minimization problem of the
QMLE can be re-expressed as∑ψ(x, θ) =
∑ ∂ log f(X; θ)
∂θT= 0 meaning that the QMLE indeed
is an M-estimator. �
Example 3 - Causal inference
In the statistical theory of causal inference we are interested in estimating the average causal
e�ect (ACE) of an intervention or treatment on an outcome of interest. If Y1 denotes the potential
outcome under treatment and Y0 the potential outcome under nontreatment, the causal e�ect
would be the di�erence Y1 − Y07. The fundamental problem of causal inference is that we
wish to estimate the ACE for every person in the population which is impossible since we for
every individual only observe either Y1 or Y0 [9]. Therefore the outcomes Y1 and Y0 are called
potential and the goal of inference changes to try to estimate the population average treatment
e�ect, τ = E(Y1 − Y0), which can be identi�ed under certain conditions. Because of background
covariates that confounds the relationship between the treatment and the outcome, di�erent
estimators have been proposed that takes this problem into account. Several estimators use
the so called propensity score, de�ned as the conditional probability to recieve the treatment
conditional on the covariates, P (T = 1|X) ≡ e(X), where T is an indicator variable equal to one
if a person have recieved the treatment and zero if not, and X is a covariate vector. Rosenbaum
and Rubin [23] have showed that given (Y1, Y0) ⊥⊥ T |X, it is su�cient to condition on the
propensity score to achieve balance on the covariates for individuals in the di�erent treatment
groups when having the same propensity score, i.e. X ⊥⊥ T |e(X). Usually, the propensity score is
unknown and has to be estimated. One common way is to assume that e(X) could be described
by a parametric model and use logistic regression to estimate it. We express this as
P (T = 1|X) =exp(XTβ)
1 + exp(XTβ),
The coe�cients can be estimated by e.g. maximum likelihood. Usually the treatment variable T
is modeled as a sequence of independent Bernoulli trials with treatment probability e(X). The
likelihood of the coe�cient estimates is then given by
L(β|T ) =
n∏i=1
e(Xi, β)Ti(1− e(Xi, β))1−Ti
with log-likelihood
7This theoretical framework is often referred to as Rubin's model [9] after Donald B. Rubin.
19
l(β|T ) =
n∑i=1
[log(1− e(Xi, β)) + Ti log
(log
e(Xi, β)
1− e(Xi, β)
)]To get the coe�cient estimates we take the derivate of the log-likelihood function with respect
to β,
∂ log(1− e(Xi, β))
∂β= −∂e(Xi, β)
∂β
1
1− e(Xi, β)= − e(Xi, β)
e(Xi, β)(1− e(Xi, β))
∂e(Xi, β)
∂β(4.4)
and
∂
∂β
(Tie(Xi, β)
1− e(Xi, β)
)=
Tie(Xi, β)(1− e(Xi, β))
∂e(Xi, β)
∂β(4.5)
Combining the equations in (4.4) and (4.5), setting them equal to zero and solve for β will give
the MLE of the coe�cient estimates. Expressed as an M-estimator we have that
n∑i=1
ψ(Ti, Xi, β) =
n∑i=1
Ti − e(Xi, β)
e(Xi, β)(1− e(Xi, β))
∂
∂β(e(Xi, β)) = 0. (4.6)
This is indeed what was done in Simulation 1 where the model (that could have been a model
for e(X)) where misspeci�ed in the link function. Observe though that in equation (4.6) it is
assumed that e(X) is correctly speci�ed.
When interested in estimating τ , the next step could be to use the estimate of e(X) in
an estimator of the ACE. One such proposed estimator of the ACE is the inverse probability
weighting estimator (IPW). If the observed outcome is de�ned as Y = TY1 +(1−T )Y0, the IPW
is expressed as
τ =1
n
n∑i=1
TiYie(Xi)
− (1− Ti)Yi1− e(Xi)
. (4.7)
A proof where it is shown that we can estimate τ with the observed data using the IPW is
given in the appendix. This estimator can be expressed as an M-estimator by letting
g(x) =TiYie(Xi)
− (1− Ti)Yi1− e(Xi)
.
We then express the IPW as the solution to∑ψ(g(x), τ) =
∑(g(x)− τ) = 0, because breaking
τ out of the summation we get that nτ =∑g(x) and �nally that
τ =1
n
n∑i=1
g(xi) =1
n
n∑i=1
TiYie(Xi)
− (1− Ti)Yi1− e(Xi)
.
Usually the propensity score is unknown and is estimated with the data at hand. Thus we have
to add this step in the M-estimator expression of the IPW, which gives
20
n∑i=1
ψ(Ti, Yi, Xi, β, τ) =
( ∑ni=1
(TiYi
e(xi, β) −(1−Ti)Yi
1−e(xi, β)
)− τ∑n
i=1Ti−e(Xi, β)
e(Xi, β)(1−e(Xi, β))∂∂β (e(Xi, β))
)=
(0
0
). (4.8)
The equation system (4.8) thus includes a parametric part and a nonparametric part, displaying
the �exibility of the M-estimator in that it enables equations to be stacked onto each other
to, together, yield the sought estimator. This is called a partial M-estimator [27], and the
properties of the M-estimator is similar to the general approach given by Randles [21] that
concerns estimators that contains estimated parameters. �
In Subsection 2.3 we established that the QMLE had the KLIC-minimizer as a limit almost
surely, but where this limit not necesarily were the true parameter. Assuming the same (incor-
rect) parametric family but now using the M-estimation theory, we in general have a parameter
estimate that is a unique solution to (4.1). We have shown that this estimator for instance could
be the QMLE and stated that θp→ θ∗ in general, where θ∗ is the solution of 4.2. So what is θ∗?
Equation 4.2 does not say that θ∗ will be the true parameter, only that it is a solution to this
equation8. So, for a misspeci�ed model, using the fact that the QMLE is an M-estimator, the θ∗
given by Equation 4.2 will be equal to the KLIC-minimizer given by (2.5).
A contribution of M-estimation theory is that in situations where the model is misspeci�ed,
the variance of the estimator will no longer equal the inverse of the Fisher information. Thus,
we have to use the sandwhich estimator given by (4.3) to estimate the variance, which White
[33] showed for the QMLE and Huber [11] for a whole class of M-estimators. To examplify
how this empirical sandwich estimator works, we revisit the simulation design A of Simulation
1 conducted in Section 3. The covariance matrix of the approximated sampling distribution for
the correctly speci�ed model using a sample size of n = 100 is given by
Σ =
0.645 −0.054 −0.080
−0.054 0.018 −0.004
−0.080 −0.004 0.020
and the empirical sandwich estimate of the covariance matrix for the same sample size for the
misspeci�ed estimation model is given by
Σsand =
0.234 −0.021 −0.028
−0.021 0.006 −0.001
−0.028 −0.001 0.006
.
Clearly the estimated covariance matrices di�ers. Important to point out is that if the estimated
covariance matrix di�ers substantially from the empirical sandwich covariance matrix, this is a
clear indication that the used model is misspeci�ed and that model diagnostics are in place [14].
8It can be shown that it is a unique solution.
21
As this is the case for the above example, the advise would be to respecify the model to try to
get a better �t. In this way, the sandwich estimator is used as a diagnostic tool.
5 Simulation study 2
In a �nal simulation study we will revisit Example 3 of Section 4 to illustrate the consequences
of model misspeci�cation when estimating the ACE using the IPW estimator. We will also
use the fact that the IPW estimator is an M-estimator to calculate the sandwich estimate of
the standard deviation and compare them with standard deviation of the approximate sampling
distribution. For the IPW estimator to be an unbiased estimate of the ACE, e(X) have to be
correctly speci�ed. This means that the simulation study will display the amount of bias that a
misspeci�ed propensity score model introduces in the IPW estimator when estimating the ACE.
5.1 Design
Two uniformly distributed random variables are generated , X1, X2 ∼ U(2, 4). The potential
outcomes are
Y1 = 1 + 3X1 + 4X2 + ε1
Y0 = 2 + 2X1 + 3X2 + ε0
where εt ∼ N (0, 1), t = 0, 1, meaning that τ = E(Y1 − Y0) = 5. Using N independent Bernoulli
trials, the treatment variable is generated as T ∼ Bern(e(X1, X2)), with the probability of being
treated
P (T = 1|X1, X2) = e(X1, X2) =1
1 + exp(−0.5 + 1.5X1 − 1.1X2)
A misspeci�ed model is generated as
q(X1, X2) = φ(β0 + β1X1 + β2X2).
The ACE will be estimated using the IPW estimator given by (4.7). Lunceford and Davidian
[17] have derived the estimates of the matrices A and B for the IPW and stated the sandwich
estimator of the variance of the IPW as n−2∑ni=1 I
2i , where
Ii =TiYie(Xi)
− (1− Ti)Yi1− e(Xi)
− τ − (Ti − e(Xi))HT E−1Xi,
H =1
n
n∑i=1
(TiYi(1− e(Xi))
e(Xi)+
(1− Ti)Yie(Xi)
1− e(Xi)
)Xi
22
and
E−1 =1
n
n∑i=1
e(Xi)(1− e(Xi))XiXTi .
The ACE will for both propensity score models be estimated using sample sizes n = 500, n =
1000, and n = 3000 with 1000 replicates for each sample size.
5.2 Results
The entries in Table 3 gives the bias and standard deviation for the IPW estimator using both
the correctly speci�ed and the misspeci�ed propensity score model, respectively. For every
sample size we approximate the sampling distribution of the estimator and calculate the standard
deviation (SD). This computation is compared against the sandwich estimator. In addition,
the mean squared error (MSE) is calculated using the variance of the approximated sampling
distribution.
Table 3: Results for the IPW when misspecifying the link function. SD, standard deviation;SDsandwich, standard deviation of sandwich estimator; MSE, mean squared error; PS, propen-sity score.
Simulation 3
n Specification Bias SD SDsandwich MSE
500 PS true −0.043 0.765 1.116 0.587PS false 0.166 0.807 1.129 0.679
1000 PS true −0.010 0.513 0.547 0.263PS false 0.193 0.523 0.553 0.311
3000 PS true 0.001 0.292 0.192 0.085PS false 0.212 0.294 0.194 0.131
The results of Table 3 shows that the misspeci�ed propensity score model makes the IPW
a biased estimator, while the IPW that uses the correct speci�ed propensity score model gives
estimates that is close to the true value. Using the misspeci�ed propensity score model, the bias
of the IPW is increasing with increasing sample size. Both these results are in accordance with
those given in [29], where it is investigated how the ACE estimate of the IPW estimator (among
others) is a�ected by di�erent types of misspeci�cations of e(X).
When e(X) is correctly speci�ed, the standard deviation of the approximated sampling dis-
tribution (SD) of the IPW is getting smaller for every sample size. The standard deviation ap-
proximated by the sandwich estimator (SDsandwich) is greater than SD for sample sizes n = 500
and n = 1000 and smaller for n = 3000. The results are analogous for the case where e(X) is
23
misspeci�ed. Lastly we see that the MSE of the IPW is smaller when e(X) is correctly speci�ed,
for every sample size.
Theoretically, the standard deviation estimated by the sandwich estimator should coincide
with the inverse of the Fisher information, the variance limit of the MLE. The distance between
the sandwich estimator and the approximate sampling distribution standard deviation is decreas-
ing when the sample size increases from n = 500 to n = 1000, but when the sample size increases
from n = 1000 to n = 3000, the distance increases. It thus seems like the sandwich estimator
does not converge.
6 Final recommendations
White [33] noted that since −A(θ∗) = B(θ∗) only when the model is correctly speci�ed, we
can test for model misspeci�cation by testing the null hypothesis of A(θ∗) + B(θ∗) = 0, where
A(θ∗) and B(θ∗) consistently can be estimated by A(θ) and B(θ), respectively. He called this the
Information matrix test. He furthermore adjusted the Hausman test [8] which basically measures
the distance between the MLE and the QMLE. Since it asymptotically reaches zero for correctly
speci�ed models but generally not otherwise, it ought to indicate when the QMLE is inconsistent.
We refer to [33] for a formal review and the derived test statistics of the two tests. As stated in
the beginning of this thesis, the theory of misspeci�ed models could be seen as one component
of the theory of statistical modeling. To connect even more to that way of thinking about the
presented theory, we state the recommendations of White when building an estimation model:
1. Use maximum likelihood to estimate the parameters of the model.
2. Apply the Information matrix test to check for model misspeci�cation.
3. If the null hypothesis is not rejected we can go on with our MLE estimates, if not, we
investigate the misspeci�cation with a Hausman-test.
4. If the model passes the Hausman-test we apply the robust sandwich estimator, if not, the
model is misspeci�ed to the extent that it needs to be investigated further before conducting
any inference.
24
7 Discussion
The purpose of this thesis was to review parts of the theory for misspeci�ed parametric models.
The thesis started with a discussion of model misspeci�cation for parametric models using the
KLIC. From that, the QMLE were derived. It was stated that the QMLE is√n-convergent and
also converges almost surely to the parameter which minimizes the KLIC, given the model and the
data. For certain cases, the QMLE also converges up to scale towards the true parameter. These
results were illustrated in Simulation 1, and then a broader class of estimators, M-estimators,
were introduced for which we stated its properties. Lastly, a second simulation study were
conducted where the consequences of model misspeci�cation for a partial M-estimator, the IPW
estimator, was illustrated.
This thesis has been stating results leading up to a parameter estimate that has limit θ∗. One
of the main questions that this thesis aims to answer is wether we can learn something about θ,
the true parameter that we actually are interested in, even in cases wher the estimation model
is misspeci�ed. We have showed three situations where we, in spite of a misspeci�ed model, can
get unbiased estimates of the ratio of the parameters of interest and in that way gain knowledge
about the true parameter. Interestingly, we have by simulation found a situation not covered
by any theorem where it also seems like this result holds. The up to scale convergence result
also implies that misspeci�cation of the link function will not cause any problem for hypothesis
testing when testing if the coe�cient of interest if equal to zero. But the up to scale convergence
result only has been stated to hold for certain situations, meaning that there is a lack of results
for other types of misspeci�cations. If we for example omit a variable in a linear regression
model with normally distributed errors, we might neither be able to give an unbiased estimate
of the true coe�cients nor the coe�cient ratio. This means that we, even though the results
presented in this thesis have contributed to the theory of statistical modeling, don´t have a
complete knowledge of what we can and should do when facing a misspeci�ed model. Thus this
is possible content for further studies.
As implicated in [14], readers should be suspicious when the robust and the standard errors
of the sample di�ers. It is more reasonable to use the robust covariance matrix as a model check
rather than as routine. King and Roberts [14] even mean that model diagnostics should be
performed to the extent that the choice between �classical� and robust standard errors will not
matter for the inference to be conducted. This is more restrictive than the recommendation of
White, and the author at least think that it is important to report the steps made in the analysis
and what eventual restrictions this puts on the results.
Another re�ection is that of the potential con�ict between the strive for an e�cient estimator
and the will to actually capture the true parameter value in a con�dence interval. Several esti-
mators have desirable properties under the correct model, but might not be robust against model
misspeci�cation. If the estimation model is misspeci�ed, we might instead have an estimator that
converges to another limit, and not the true parameter value. Take the QMLE for an example. It
has a limit in θ∗ and reaches the variance C(θ∗). But a con�dence interval for the QMLE might
25
surround the the wrong value. An estimator giving a broader interval might instead actually
include the true parameter value. E�ciency is a desirable property of an estimator, but it could
for misspeci�ed models mean that you will just get a narrow interval around the wrong value.
Finally this thesis have shed some light on the very basic assumptions of statistical modeling
in that we assume a true model that gives a complete description of the data generating process.
That there for an example exists a true parameter, quantifying the relationship between an
explanatory variable and the outcome of interest. This might not be true, yet we think of it in
this way in order to meanfully discuss the estimation model speci�ed to gain knowledge about
some phenomena of interest.
For further studies it would be interesting to investigate when we are able to conduct inference
about the true parameter and when we have to lean on the theory presented in this thesis, i.e.
when do we know that we have misspeci�ed the model to the extent that we are no longer able
to conduct inference about θ?
26
8 Acknowledgements
I would like to express my gratitude to my supervisor Ingeborg Waernbaum for all her help and
guidance during the work of this thesis. I would also like to thank Laura Pazzagli for her help
with the sandwich estimator.
27
References
[1] Akaike, H. [1973], Information theory and an extension of the likelihood principle, in `Pro-
ceedings of the second international symposium of information theory'.
[2] Boos, D. D. and Stefanski, L. A. [2013], Essential statistical inference, Springer-Verlag New
York.
[3] Cramér, H. [1946], Mathematical methods of statistics, Princeton University Press.
[4] Durbin, J. and Watson, G. S. [1950], `Testing for serial correlation in least squares regression:
I', Biometrika 37(3/4), pp. 409�428.
[5] Eicker, F. [1967], Limit theorems for regressions with unequal and dependent errors, in
`Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability,
Volume 1: Statistics', University of California Press, Berkeley, Calif., pp. 59�82.
[6] Faraway, J. J. [2006], Extending the linear model with R - Generalized linear, mixed e�ects
and nonparametric models, CRC Press.
[7] Fisher, R. A. [1922], `On the mathematical foundations of theoretical statistics', Philosophi-
cal Transactions of the Royal Society of London A: Mathematical, Physical and Engineering
Sciences 222(594-604), 309�368.
[8] Hausman, J. A. [1978], `Speci�cation tests in econometrics', Econometrica 46(6), pp. 1251�
1271.
[9] Holland, P. W. [1986], `Statistics and causal inference', Journal of the American Statistical
Association 81(396), pp. 945�960.
[10] Huber, P. J. [1964], `Robust estimation of a location parameter', The Annals of Mathematical
Statistics 35(1), pp. 73�101.
[11] Huber, P. J. [1967], `The behavior of maximum likelihood estimates under nonstandard
conditions'.
[12] Huber, P. J. [1973], `Robust regression: Asymptotics, conjectures and monte carlo', The
Annals of Statistics 1(5), pp. 799�821.
[13] Huber, P. J. [1981], Robust statistics, Wiley, New York.
[14] King, G. and Roberts, M. E. [2014], `How robust standard errors expose methodological
problems they do not �x, and what to do about it', Political Analysis pp. 1�21.
[15] Kullback, S. and Leibler, R. A. [1951], `On information and su�ciency', The Annals of
Mathematical Statistics 22(1), pp. 79�86.
28
[16] Li, K.-C. and Duan, N. [1989], `Regression analysis under link violation', The Annals of
Statistics 17(3), pp. 1009�1052.
[17] Lunceford, J. K. and Davidian, M. [2004], `Strati�cation and weighting via the propensity
score in estimation of causal treatment e�ects: a comparative study', Statistics in Medicine
23(19), 2937�2960.
[18] Manski, C. F. [1988], `Identi�cation of binary response models', Journal of the American
Statistical Association 83(403), pp. 729�738.
[19] Perez-Cruz, F. [2008], Kullback-leibler divergence estimation of continuous distributions, in
`2008 IEEE International Symposium on Information Theory', pp. 1666�1670.
[20] R Core Team [2013], R: A Language and Environment for Statistical Computing, R Foun-
dation for Statistical Computing, Vienna, Austria.
URL: http://www.R-project.org/
[21] Randles, R. H. [1982], `On the asymptotic normality of statistics with estimated parameters',
The Annals of Statistics 10(2), pp. 462�474.
[22] Rényi, A. [1961], On measures of entropy and information, in `Proceedings of the Fourth
Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions
to the Theory of Statistics', University of California Press, Berkeley, Calif., pp. 547�561.
[23] Rosenbaum, P. R. and Rubin, D. B. [1983], `The central role of the propensity score in
observational studies for causal e�ects', Biometrika 70(1), pp. 41�55.
[24] Ruud, P. A. [1983], `Su�cient conditions for the consistency of maximum likelihood es-
timation despite misspeci�cation of distribution in multinomial discrete choice models',
Econometrica 51(1), pp. 225�228.
[25] Schwarz, G. [1978], `Estimating the dimension of a model', The Annals of Statistics 6(2), pp.
461�464.
[26] Shannon, C. E. [1948], `A mathematical theory of communication', Bell System Technical
Journal 27(3), 379�423.
[27] Stefanski, L. A. and Boos, D. D. [2002], `The calculus of m-estimation', The American
Statistician 56(1), 29�38.
[28] Viele, K. [2007], `Nonparametric estimation of kullback-leibler information illustrated by
evaluating goodness of �t', Bayesian Anal. 2(2), 239�280.
[29] Waernbaum, I. [2012], `Model misspeci�cation and robustness in causal inference: comparing
matching with doubly robust estimation', Statistics in Medicine 31(15), 1572�1581.
29
[30] Wald, A. [1949], `Note on the consistency of the maximum likelihood estimate', The Annals
of Mathematical Statistics 20(4), pp. 595�601.
[31] Wasserman, L. [2004], All of statistics - A concise course in statistical inference, 1 edn,
Springer-Verlag New York.
[32] White, H. [1980], `A heteroskedasticity-consistent covariance matrix estimator and a direct
test for heteroskedasticity', Econometrica 48(4), pp. 817�838.
[33] White, H. [1982], `Maximum likelihood estimation of misspeci�ed models', Econometrica
50(1), pp. 1�25.
[34] White, H. [1996], Estimation, inference and speci�cation analysis, Cambridge University
Press, Cambridge.
30
A Proof of identi�ability of the IPW estimator
To prove that the IPW estimator can be identi�ed with the observed data, we want to show
that E[TYe(X) −
(1−T )Y1−e(X)
]= E [Y1 − Y0]. We will use that the observed outcome is de�ned as
Y = TY1 + (1− T )Y0 and the following assumptions:
A.1 (Y1, Y0) ⊥⊥ T |X
A.2 0 < P (T = 1|X) < 1.
We start by showing that E[TYe(X)
]= E [Y1] .
E[TY
e(X)
]= E
[TY1
e(X)
]= E
{E[TY1
e(X)
]|X}
= E{
1
e(X)E [TY1] |X
}= E
{1
e(X)E [T |X]E [Y1|X]
}= E {E [Y1|X]}
= E [Y1] ,
where the �rst equality follows from the de�nition of Y and the second equality holds under the
total expectation law. Next we have that
E[
(1− T )Y
1− e(X)
]= E
[(1− T )Y0
1− e(X)
]= E
{E[
(1− T )Y0
1− e(X)
]|X}
= E{
1
1− e(X)E [(1− T )Y0] |X
}= E
{1
1− e(X)E [(1− T )|X]E [Y0|X]
}= E {E [Y0|X]}
= E [Y0]
Thus, the proof is completed. �
31