Linear models and their mathematical foundations: Multiple ... · Models, estimation and goodness-of- t Generalized least squares Misspeci cations and orthogonalization Linear models

Models, estimation and goodness-of-fitGeneralized least squares

Misspecifications and orthogonalization

Linear models and their mathematical foundations:Multiple linear regression, part I

Steffen Unkel

Department of Medical StatisticsUniversity Medical Center Gottingen, Germany

Winter term 2018/19 1/25



Introduction

In multiple linear regression, we attempt to predict acontinuous (random) response variable y on the basis of anassumed linear relationship with several (fixed) predictorvariables x1, x2, . . . , xk .

Given a sample of n observations on y and the associated xvariables, the n model equations can be written as

y1y2...yn

=

1 x11 x12 . . . x1k1 x21 x22 . . . x2k...

......

. . ....

1 xn1 xn2 . . . xnk

β0β1...βk

+

ε1ε2...εn

or

yn×1

= Xn×(k+1)

β(k+1)×1

+ εn×1

.




Assumptions

Model assumptions:

A1 E(ε) = 0.

A2 Cov(ε) = σ2I.

Occasionally, we will make use of the following additionalassumption:

A3 ε ∼ Nn(0, σ2I).

For the time being, we assume that for the n × (k + 1) designmatrix it holds that n > k + 1 and rank(X) = k + 1.

The β regression coefficients are sometimes referred to aspartial regression coefficients.




Least squares estimation of β

To find β, we solve the optimization problem minβ

ε>ε.

If y = Xβ + ε, where X has size n × (k + 1) with n > k + 1and rank(X) = k + 1, then the (ordinary) least squaresestimator β that minimizes ε>ε is

β = (X>X)−1X>y .

The least squares estimator is derived without any of theassumptions A1–A3.

If β = (X>X)−1X>y, then ε = y− Xβ = y− y is the vectorof residuals.




Basic geometry of least squares

Prediction space

O

y

Xβ

Figure: A general point Xβ in the prediction space.




Basic geometry of least squares (2)

Prediction space

O

y

X�β

Figure: The right-angled triangle of vectors y, y = Xβ and ε = y− y.




Properties of the least squares estimator β

1 If assumption A1 holds, then E(β) = β.

2 If assumption A2 holds, then Cov(β) = σ2(X>X)−1.

3 Gauss-Markov theorem: If A1 and A2 hold, the least squaresestimators βj , j = 0, . . . , k, have minimum variance among all

linear unbiased estimators; the βj (j = 0, . . . , k) are best linearunbiased estimators (BLUE).

Corollary: If A1 and A2 hold, the BLUE of a>β is a>β, whereβ = (X>X)−1X>y.




Properties (2)

4 If x = (1, x1, . . . , xk)> and z = (1, c1x1, . . . , ckxk)>, then

y = β>

x = β>z z, where βz is the least squares estimator from

the regression of y on z.

Corollary: The fitted value y is invariant to a full-rank lineartransformation on the x variables.




Estimation of σ2

We estimate σ2 by

s2 =1

n − k − 1

n∑i=1

(yi − x>i β)2

=1

n − k − 1(y− Xβ)>(y− Xβ) =

y>y− β>

X>y

n − k − 1

=SSE

n − k − 1,

where x>i is the ith row of X and SSE = y>y− β>

X>y.

If A1 and A2 hold, then E(s2) = σ2 and an unbiased

estimator of Cov(β) is Cov(β) = s2(X>X)−1.




Maximum likelihood estimation

To the assumptions A1 and A2, we now add A3:ε ∼ Nn(0, σ2I).

If y ∼ Nn(Xβ, σ2I) and X is an n × (k + 1) design matrixwith rank(X) = k + 1 < n the maximum likelihood estimators(MLEs) of β and σ2 are

βMLE = (X>X)−1X>y ,

σ2MLE =1

n(y− Xβ)>(y− Xβ) .

Whereas βMLE is the same as the least squares estimator,σ2MLE is biased.




Some properties of the MLEs

1 The MLEs βMLE and σ2MLE have the following distributionalproperties:

(i) βMLE ∼ Nk+1(β, σ2(X>X)−1).

(ii) nσ2MLE/σ

2 ∼ χ2(n − k − 1).

(iii) βMLE and σ2MLE are independent.

2 If y ∼ Nn(Xβ, σ2I), then βMLE and σ2MLE are jointly sufficientstatistics for the parameters β and σ2.




The multiple linear regression model in centered form

Let xj =∑n

i=1 xij/n (j = 1, . . . , k). The centered multiplelinear regression model for y is

y = (1n Xc)

(αβ1

)+ ε ,

where α = β0 + β1x1 + · · ·+ βk xk , β1 = (β1, . . . , βk)>,Xc =

(In − n−11n1>n

)X1 and

X1 =

x11 x12 . . . x1kx21 x22 . . . x2k

......

. . ....

xn1 xn2 . . . xnk

.

Recall the centering matrix In − n−11n1>n .




Least squares estimators in the centered model

The least squares estimators are given by

α = y , β1 = (X>c Xc)−1X>c y .

The estimators above are the same as β = (X>X)−1X>y withthe adjustment

β0 = α− β1x1 − · · · − βk xk = y − β>1 x .

We can express y1, . . . , yn in centered form as follows:yi = α + β1(xi1 − x1) + · · ·+ β1(xik − xk).

We can write the error sum of squares as follows:

SSE =∑n

i=1(yi − y)2 − β>1 X>c y = y>y− β

>X>y.




Coefficient of determination

Recall the coefficient of determination:

R2 =

∑ni=1(yi − y)2∑ni=1(yi − y)2

=SSR

SST= 1− SSE

SST.

For the multiple linear regression model, 0 ≤ R2 ≤ 1 can bewritten as

R2 =β>

X>y− ny2

y>y− ny2= 1− y>y− β

>X>y

y>y− ny2

=β>1 X>c Xc β1

y>y− ny2.




Some properties of R2

1 The positive square root of R is the multiple correlationbetween the response and the predictors.

2 The multiple correlation is equal to the simple correlationbetween the observed yi ’s and the fitted yi ’s.

3 R2 is invariant to full-rank linear transformations on the x ’sand to a scale change on y .




Adjusted R2

Adding a predictor to the model cannot decrease the value ofR2.

However, this may conflict with the principle of parsimony.

An adjusted R2a has been proposed that includes a penalty for

adding a predictor variable to the model.

It is defined as

R2a =

(R2 − k

n−1

)(n − 1)

n − k − 1=

(n − 1)R2 − k

n − k − 1.

R2a can be negative.




Model setting

We now consider situations for which the assumption A2 isviolated.

Instead we impose the assumption Cov(ε) = σ2V, whereV 6= I is a known symmetric positive definite matrix of sizen × n.

The matrix V has n diagonal elements and

(n2

)elements

above (or below) the diagonal.

In certain applications, a simpler structure for V (e.g.diagonal) is assumed.




Generalized least squares (GLS) estimators

For the model with Cov(ε) = σ2V, we obtain the followingresults:

(i) The BLUE of β is β = (X>V−1X)−1X>V−1y .

(ii) The covariance matrix for β is Cov(β) = σ2(X>V−1X)−1 .

(iii) An unbiased estimator of σ2 is

s2 =(y− Xβ)>V−1(y− Xβ)

n − k − 1

=y>[V−1 − V−1X(X>V−1X)−1X>V−1

]y

n − k − 1,

where β is given in (i).




Maximum likelihood estimators

For the model with Cov(ε) = σ2V, the maximum likelihoodestimators are

β = (X>V−1X)−1X>V−1y ,

σ2 =(y− Xβ)>V−1(y− Xβ)

n.




Misspecification of the error structure

Suppose the model is y = Xβ + ε with E(y) = Xβ andCov(y) = σ2V and one uses the ordinary least squaresestimator βOLS = (X>X)−1X>y to estimate β.

The consequences of using the ordinary least squares estimatoron E(βOLS) and Cov(βOLS) for the case that the errorstructure Cov(ε) = σ2V holds will be discussed in the tutorial.




Model misspecification

Suppose the model is y = Xβ + ε with E(y) = Xβ andCov(y) = σ2I.

Let the model be partitioned as

y = Xβ + ε = (X1 X2)

(β1

β2

)+ ε

= X1β1 + X2β2 + ε .

Suppose we leave out X2β2 when it should be included, i.e.,when β2 6= 0.

By doing so, we misspecify E(y).




Reduced model

We consider estimation of β1 when underfitting.

We write the reduced model as

y = Xβ∗1 + ε∗

using β∗1 to emphasize that these parameters and their

estimates β∗1 will be different from β1 and β1, respectively, in

the full model.




Fitting the reduced model

If we fit the model y = X1β∗1 + ε∗ when the correct model is

y = X1β1 + X2β2 + ε with Cov(y) = σ2I, then the following

results for the least squares estimator β∗1 = (X>1 X1)−1X>1 y

can be obtained:

E(β∗1) = will be discussed in the tutorial

Cov(β∗1) = σ2(X>1 X1)−1 ,

Furthermore, Cov(β1)− Cov(β∗1) = σ2AB−1A>, which is a

positive definite matrix, where A = (X>1 X1)−1X>1 X2 andB = X>2 X2 − X>2 X1A. Therefore, Var(βj) > Var(β∗j ).




Underfitting and overfitting

Underfitting leads in general to biased results but lowervariances.

Overfitting leads to unbiased results but greater variances.

Seek an adequate balance between a biased model and amodel with large variances.

Task: find an optimum subset of predictors.




Orthogonalization

Suppose that in the full model y = X1β1 + X2β2 + ε thecolumns of X1 are orthogonal to the columns of X2, that is,X>1 X2 = O.

If X>1 X2 = O, then the least squares estimator β∗1 obtained

from fitting the reduced model is unbiased: E(β∗1) = β1.

Moreover, if X>1 X2 = O, then the estimator of β1 in the full

model is the same as the estimator of β∗1 in the reduced

model.

The process of orthogonalization can give additional insightsinto the meaning of the regression coefficients.


Documents

Linear models and their mathematical foundations: Multiple ... · Models, estimation and goodness-of- t Generalized least squares Misspeci cations and orthogonalization Linear models