STAT 540: Data Analysis and Regressionriczw/teach/STAT540_F15/... · 2015-10-09 · where the hat matrix is Residuals: e= W. Zhou (Colorado State University) STAT 540 July 6th, 2015

STAT 540: Data Analysis and Regression

Wen Zhou

http://www.stat.colostate.edu/~riczw/

Email: [email protected]

Department of Statistics

Colorado State University

Fall 2015

W. Zhou (Colorado State University) STAT 540 July 6th, 2015 1 / 62

http://www.stat.colostate.edu/~riczw/

Contents

1 Multiple Linear Regression Model

2 Inference on Multiple Regression

3 Inference about Regression Parameters

4 Estimation and Prediction

5 Geometric View of Regression and Linear Models

6 Estimating estimable function of coefficient


Multiple Linear Regression I

Multiple linear regression model

1 Multiple linear regression model in matrix terms

2 Estimation of regression coefficients

Inference

1 ANOVA results

2 Inference about regression parameters

3 Estimation of mean response and prediction of new observation

Inference about regression parameters

Estimation and prediction

Geometric interpretation of linear model and regression

Estimating estimable function of regression or linear coefficient β









Multiple Linear Regression

Example: # of predictor variables = 2.

Yi = β0 + β1Xi1 + β2Xi2 + εi, εi ∼ iid N(0, σ2),

for i = 1, . . . , n.

Response surface:

E(Yi) =

Example:

I Y = Pine bark beetle density

I X1 = Temperature

I X2 = Tree species


Interpretation of Coefficients

β0: Intercept. When the model scope includes X1 = X2 = 0.

I β0 is interpreted as the mean response E(Y ) at X1 = X2 = 0.

βj : Slope in the direction of Xj (effect).

I ∂E(Y )/∂Xj =

I EY |X=(X1,X2)(Y )− EY |X=(X′1,X2)(Y ) =

Interpreted as the change in the mean response E(Y ) per unit increase in Xj ,

when X−j are held constant.

What if Xj is qualitative?


Multiple Linear Regression

A “general” linear regression model is, for i = 1, · · · , n

Yi = β0 +

p∑j=1

Xijβj + εi, εi ∼ iid N(0, σ2).

Response surface:

E(Yi) = β0 +

p∑j=1

Xijβj

Regression coefficients: β0, β1, . . . , βp−1, βp.

Predictor variables: X1, . . . , Xp are known constants/values.

The model is linear in the parameters, not necessarily in the shape of the

response surface.


Response Surface Examples

Polynomial regression

E(Y ) = β0 + β1X + β2X2 + β3X

3.

Transformed variables

E(log(Y )) = β0 + β1X1 + β2√X2.

Interaction effects

E(Y ) = β0 + β1X1 + β2√X2 + β3X1X2.

I The change in the mean response corresponding to a unit change in X1

depends on X2 and vice versa.

I Testing whether β3 = 0 or not is very challenging in high-dimensional

(n = o(p)).


Qualitative Predictor Variables

Example: Let Y = length of hospital stay, X1 = age, and X2 = gender: 0 for

male and 1 for female.

I An additive model is

I Thus the response surface for males is

and for females is

β2 is

This kind of model sometimes is called ANVOCA model.


Qualitative Predictor Variables

Interaction: the relationship between X1 and Y for a fixed value of X2 = x2

depends on x2.

An interaction model is

Thus the response surface for males is

and for females is


Notation

n observations, 1 response variable, p− 1 β’s with predictors (i.e. β0 is the

pth).

Response variable: Yn×1 = (Y1, Y2, . . . , Yn)T .

The predictors are arranged in the design matrix

Xn×p =

1 X11 X12 · · · X1,p−1

1 X21 X22 · · · X2,p−1...

......

1 Xn1 Xn2 · · · Xn,p−1

Random error: εn×1 = (ε1, ε2, . . . , εn)T .

Regression coefficients: βp×1 = (β0, β1, . . . , βp−1)T .


Multiple Linear Regression Model in Matrix Terms

The multiple linear regression model can be written as

where as we have seen before

E(ε) = 0n×1, Var{ε} = σ2In×n.

Thus,

and

Y ∼


Least Squares Estimation

Consider the criterion:

Q =

n∑i=1

(Yi − β0 −p−1∑j=1

βjXij)2 =

The least squares estimate of β is

assuming that XTX is invertible.

I This is also the MLE.

I What condition on X do we need to have XTX invertible?

I What if XTX is not invertible?


Fitted Values and Residuals

Fitted values: Y =

where the hat matrix is

Residuals: e =









Sums of Squares

We have sums of squares in matrix forms that

SSR =

n∑i=1

(Yi − Y )2 =

SSE =

n∑i=1

(Yi − Yi)2 =

SSTO =

n∑i=1

(Yi − Y )2 =

Partitioning of total sum of squares and particularly the df are

SSTO︸︷︷︸df=n−1

= SSR︸︷︷︸df=p−1

+ SSE︸︷︷︸df=n−p

.


Mean Squares

Define mean squares

MSR =SSR

p− 1, MSE =

SSE

n− p.

It can be shown that E(MSE) =

Also can be shown that

E(MSR)

= σ2 if βj = 0 for ∀ j

> σ2 otherwise.


ANOVA Table

The ANOVA table is

Source SS df MS F

Regression SSR MSR F = MSR/MSE

Error SSE MSE

Total SSTO

If then

E(MSE) = E(MSR) = σ2

in which case MSR/MSE ≈ 1.


Overall F Test for Regression Relation

Test

H0 : v.s. Ha : .

I It can be shown that under H0,

F ∗ =MSR

MSE∼

I Thus we can perform an F -test at level α by the decision rule:

Conditional on H0 being rejected, we may want to find

S = {j | βj 6= 0}

(or a.s.)– Identification/Selection.


Coefficient of Multiple Determination, R2

The coefficient of multiple determination is denoted by R2 and is defined as

R2 =SSR

SSTO= 1− SSE

SSTO

Interpretation: The proportion of variation in the Yi’s explained by the

regression relation.


More on R2

As more predictors are added to the model (p ↑), R2 must increase. Why?

I Recall

SSTO = SSR+ SSE

SSTO is fixed for Y while SSE is a minimum of the unconstraint convex

optimization problem β = argminSSE(β0, . . . , βp−1).

I Suppose we consider an extra predictor and thus consider SSE(β0, . . . , βp).

The β that minimizes this SSE cannot be inferior to the previous minimizer

because βp = 0 is a special case within the new minimization problem that

incorporates the previous one.


Adjusted R2

R2 depends on p (even for p� n), how to remove that dependence?

The adjusted coefficient of multiple determination is denoted by R2a and is

defined as

R2a = 1− SSE/n− p

SSTO/n− 1= 1−

(n− 1

n− p

)SSE

SSTO.

The adjusted coefficient of multiple determination R2a may decrease when

more predictors are in the model.

Many other statistics such as AIC, BIC, Mallow’s Cp, etc. will be discussed

and they are superior over R2a.









Estimation of Regression Coefficients

Mean satisfies

E(β) = β.

That is, the LS estimate β is an unbiased estimate of β.

Variance-covariance matrix:

Σβ := Var{β} = σ2(XTX

)−1.

I (Σβ)kk =

I (Σβ)kl =


Inference about Regression Coefficients

The estimated variance-covariance matrix.

Σβ := s2{β} = MSE ·(XTX

)−1

:=

s2{β0} s{β0, β1} · · · s{β0, βp−1}s{β1, β0} s2{β1} · · · s{β1, βp−1}

......

...

s{βp−1, β0} s{βp−1, β1} · · · s2{βp−1}

Under the multiple linear regression model, we have

βk − βks{βk}

∼

for k = 0, 1, . . . , p− 1.


Inference about Regression Coefficients

Thus the (1− α) confidence interval for βk is

βk ± t1−α/2;n−ps{βk}.

Test H0 : βk = βk0 versus Ha : βk 6= βk0.

Under H0, we have

t∗ =βk − βk0s{βk}

∼ tn−p

Thus we can perform a t-test at level α by the decision rule:









Estimation of Mean Response–Hidden Extrapolation

Define Xh = (1, Xh1, . . . , Xh,p−1)T .

Caution about hidden extrapolations.

I The region (with respect to X0) defined by

d(X0) =XT0 (XTX)−1X0 ≤ hmax

where hmax = maxi hii, is an ellipsoid enclosing all data points inside the

“regressor variable hull” (RVH).

I Predictions for any X0 outside the RVH (i.e., d(X0) > hmax) is hidden

extrapolation, at least to some degree.


Estimation of Mean Response

The estimated mean response corresponding to Xh =

Mean E(Yh) =

Variance Var{Yh} =

Estimated variance is s2{Yh} =


Confidence Intervals for Mean Response

The (1− α) confidence interval for E(Yh) is

Yh ± t1−α/2;n−ps{Yh}

The Working-Hotelling (1− α) confidence band for the regression surface is

Yh ±Ws{Yh}

where W 2 = pF (1− α; p, n− p).

The Bonferroni (1− α) joint confidence intervals for g mean responses are

Yh ±Bs{Yh}

where B = t1−α/(2g);n−p.


Prediction of New Observation

The predicted new observation corresponding to Xh is Yh = XTh β, and

I Mean E(Yh) =XTh β = E(Yh(new)).

I Prediction error variance

σ2

pred = Var(Yh − Yh(new)) =

I Estimated prediction error variance is

s2{pred} =


Prediction Intervals for New Observation

The (1− α) prediction interval for Yh(new) is

Yh ± t1−α/2;n−ps{pred}

The Scheffe (1− α) joint confidence intervals for g new observations are

Yh ± Ss{pred}

where S2 = gF (1− α; g, n− p).

The Bonferroni (1− α) joint confidence intervals for g new observations are

Yh ±Bs{pred}

where B = t1−α/(2g);n−p.









Geometric Viewpoint: The Column Space of the Design

Matrix

Xβ is a linear combination of the columns of X

Xβ = [x1, . . . ,xp]

β1...

βp

= β1x1 + . . .+ βpxp

The set of all possible linear combinations of the columns of X is called the

column space of X and is denoted by

C(X) = {Xa : a ∈ Rp}

The Gauss-Markov linear model says y is a random vector whose mean is in

the column space of X and whose variance is σ2I for some positive real

number σ2, i.e.

E(y) ∈ C(X) and Var(y) = σ2I, σ2 ∈ R+


An Example Column Space

X =

[1

1

]⇒ C(X) = {Xa : a ∈ Rp}

=

{[1

1

][a1] : a1 ∈ R

}

=

{a1

[1

1

]: a1 ∈ R

}

=

{[a1

a1

]: a1 ∈ R

}


Another Example Column Space

X =

1 0

1 0

0 1

0 1

⇒ C(X) =

1 0

1 0

0 1

0 1

[

a1

a2

]: a ∈ R2

=

a1

1

1

0

0

+ a2

0

0

1

1

: a1, a2 ∈ R

=

a1

a1

0

0

+

0

0

a2

a2

: a1, a2 ∈ R

=

a1

a1

a2

a2

: a1, a2 ∈ R


Another Example Column Space

X1 =

1 0

1 0

0 1

0 1

, X2 =

1 1 0

1 1 0

1 0 1

1 0 1

x ∈ C(X1)⇒ x = X1a for some a ∈ R2

⇒ x = X2

[0

a

]for some a ∈ R2

⇒ x = X2b for some b ∈ R3

⇒ x ∈ C(X2)

Thus

C(X1) ⊂ C(X2)


Another Example Column Space (continued)

x ∈ C(X2)⇒ x = X2a for some a ∈ R3

⇒ x = a1

1

1

1

1

+ a2

1

1

0

0

+ a3

0

0

1

1

for some a ∈ R3

⇒ x =

a1 + a2

a1 + a2

a1 + a3

a1 + a3

for some a1, a2, a3 ∈ R

⇒ x = X1

[a1 + a2

a1 + a3

]for some a1, a2, a3 ∈ R


Another Example Column Space (continued)

⇒ x = X1

[a1 + a2

a1 + a3

]for some a1, a2, a3 ∈ R

⇒ x = X1b for some b ∈ R2

⇒ x ∈ C(X1)

Thus, C(X2) ⊂ C(X1), as we have shown C(X1) ⊂ C(X2). It follows that

C(X1) = C(X2).


Estimation of E(y)

A fundamental goal of linear model analysis is to estimate E(y)

We could, of course, use y to estimate E(y)

y is obviously an unbiased estimator of E(y), but it is often not a very

sensible estimator.

For example, suppose[y1

y2

]=

[1

1

]µ+

[ε1

ε2

], and we observe y = [6.1, 2.3]′

Should we estimate E(y) = [µ, µ]′ by y = [6.1, 2.3]′?


Estimation of E(y)

The Gauss-Markov linear models says that E(y) ∈ C(X), so we should use

that information when estimating E(y)

Consider estimating E(y) by the point in C(X) that is closest to y (as

measured by the usual Euclidean distance).

This unique point is called the orthogonal projection of y onto C(X) and

denoted by y (although it could be argued that E(y) might be better

notation).

By definition, ||y − y|| = minz∈C(X) ||y − z|| where ||a|| =√∑n

i=1 a2i


Geometric Viewpoint on Multiple Regression (and LM)

Geometrically, how to minimize the distance between Y and C(X)?

I That point is

I The vector between Y and Xβ is , and the distance is

For R2: if we add another predictor, C(X) gains 1 more dimension, so ||e||can only decrease. C(X)

I Note: if dim(S) = n then



Orthogonal Projection Matrices

It can be shown that, as we did for least square estimators

∀y ∈ Rn, y = PXy is the optimal one, i.e.

y = PXy is the best estimator of E(y) in the class of linear unbiased estimators

for the unique matrix PX = H, the hat matrix, and is called orthogonal

projection matrix

HH = H, idempotent

H = H ′, symmetric

HX = X and X ′H = X ′ (Why? Intuitively...)

If (X ′X) is not invertible, we use its generalized inverse (X ′X)− where

AA−A = A.

The H is invariant to the choice of (X ′X)−, which is itself not unique

y and y − y are orthogonal (Why?)


An Example Orthogonal Projection

Suppose

[y1

y2

]=

[1

1

]µ+

[ε1

ε2

], and we observe y = [6.1, 2.3]. Then

X(X ′X)−1X ′ =

[1

1

]([1

1

]′ [1

1

])−1 [1

1

]′

=

[1

1

][2]−1[1 1]

=

[1

1

][1

2][1 1]

=1

2

[1 1

1 1

]

=

[1/2 1/2

1/2 1/2

]


An Example Orthogonal Projection

Thus, the orthogonal projection of y = [6.1, 2.3] onto the column space of

X =

[1

1

]is

PXy = Hy =

[1/2 1/2

1/2 1/2

][6.1

2.3

]=

[4.2

4.2

]


Geometric illustration

Suppose X =

[1

2

]and y =

[2

3/4

]



Suppose X =

[1

2

]and y =

[2

3/4

]



Suppose X =

[1

2

]and y =

[2

3/4

]



Suppose X =

[1

2

]and y =

[2

3/4

]



The angle between y and residual y − y is 90. So, “orthogonal projection”.









What if X is not full column rank?

XTX is not invertible, then (XTX)−1 has to be defined based on the

generalized inverse matrix.

If X is not of full column rank, then there are infinitely many vectors in the

set {b : Xb = Xβ} for any fixed value of β.

Thus, no matter what the value of E(y), there will be infinitely many vectors

b such that Xb = E(y) when X is not of full column rank.

Our response vector y can help us learn about E(y) = Xβ, but when X is

NOT of full column rank, there is NO hope of learning about β alone unless

additional information about β is available.

How, we could estimate estimable function of β


Treatment Effects Model

Researchers randomly assigned a total of six experimental units to two treatments

and measured a response of interest.

yij = µ+ τi + εij , i = 1, 2; j = 1, 2, 3

y11

y12

y13

y21

y22

y23

=

µ+ τ1

µ+ τ1

µ+ τ1

µ+ τ2

µ+ τ2

µ+ τ2

+

ε11

ε12

ε13

ε21

ε22

ε23

Question: what is X, β?


Treatment Effects Model (continued)

In this case, it makes no sense to estimate β = [µ, τ1, τ2]′ because there are

multiple (infinitely many, in fact) choices of β that define the same mean for

y.

For example µ

τ1

τ2

=

5

−1

1

, 0

4

6

, 999

−995

−993

all yield same Xβ = E(y).

When multiple values for β define the same E(y), we say that β is

non-estimable.


Estimable Functions of β

A linear function of β, Cβ, is said to be estimable if there is a linear

function of y, say Ay, that is an unbiased estimator for Cβ. Otherwise,

nonexistence of such linear function implies that Cβ is non-estimable.

Note that Ay is an unbiased estimator of Cβ if and only if

E(Ay) = Cβ, for ∀ β ∈ Rp ⇔ AXβ = Cβ ⇔ AX = C

This says that we can estimate Cβ as long as Cβ = AXβ = AE(y) for

some A, i.e. as long as Cβ is a linear function of E(y)

The bottom line is that we can always estimate E(y) and all linear functions

of E(y); all other linear functions of β are non-estimable



Xβ =

1 1 0

1 1 0

1 1 0

1 1 0

1 0 1

1 0 1

1 0 1

1 0 1

µ

τ1

τ2

=

µ+ τ1

µ+ τ1

µ+ τ1

µ+ τ2

µ+ τ2

µ+ τ2

so that

[1, 0, 0, 0, 0, 0]Xβ = [1, 1, 0]β = µ+ τ1

[0, 0, 0, 1, 0, 0]Xβ = [1, 0, 1]β = µ+ τ2

[1, 0, 0,−1, 0, 0]Xβ = [0, 1, 1]β = τ1 − τ2

are estimable functions of β


Estimating Estimable Functions of β

If Cβ is estimable, then there exists a matrix A such that C = AX and

Cβ = AXβ = AE(y) for any β ∈ Rp

It makes sens to estimate Cβ by

AE(y) = Ay = APXy = AX(X ′X)−X ′y

= AX(X ′X)−X ′Xβ = APXXβ = AXβ = Cβ

Cβ is called an Ordinary Least Squares (OLS) estimator of Cβ

Note that although the “hat” is on β, it is Cβ that we are estimating

Invariance of Cβ to the choice of β: Although there are infinitely many

solutions to the normal equations when X is not of full column rank, Cβ is

the same for all normal equation solutions β whenever Cβ is estimable

(STAT 640)



Suppose our aim is to estimate τ1 − τ2As noted before

Xβ =

1 1 0

1 1 0

1 1 0

1 1 0

1 0 1

1 0 1

1 0 1

1 0 1

µ

τ1

τ2

=

µ+ τ1

µ+ τ1

µ+ τ1

µ+ τ2

µ+ τ2

µ+ τ2

, so that

[1, 0, 0,−1, 0, 0]Xβ = [0, 1, 1]β = τ1 − τ2Thus, we can compute the OLS estimator of τ1 − τ2 as

[1, 0, 0,−1, 0, 0]y = [0, 1, 1]β

where β is any solution to the normal equations.



The normal equation in this case is

1 1 0

1 1 0

1 1 0

1 1 0

1 0 1

1 0 1

1 0 1

1 0 1

′

1 1 0

1 1 0

1 1 0

1 1 0

1 0 1

1 0 1

1 0 1

1 0 1

b1

b2

b3

=

1 1 0

1 1 0

1 1 0

1 1 0

1 0 1

1 0 1

1 0 1

1 0 1

′

y11

y12

y13

y21

y22

y23

so that 6 3 3

3 3 0

3 0 3

b1

b2

b3

=

y..

y1.

y2.



β1 =

y..

y1. − y..y2. − y..

and β2 =

0

y1.

y2.

are both solutions to the normal equation

(Check this).

Thus, the OLS estimator of Cβ = [0, 1,−1]β = τ1 − τ2 is

Cβ1 = [0, 1,−1]

y..

y1. − y..y2. − y..

= y1. − y2. = [0, 1,−1]

0

y1.

y2.

= Cβ2

HW: Can you find two different generalized inverse of (X ′X), A1 and A2 that

(X ′X)Ai(X′X) = (X ′X) so that Ai = (X ′X)− for each i, and they will give

you β1 and β2, respectively?


The Gauss-Markov Theorem

Under the Gauss-Markov Linear Model, the OLS estimator c′β of an estimable

linear function c′β is the unique Best Linear Unbiased Estimator (BLUE) in the

sense that Var(c′β) is strictly less than the variance of any other linear unbiased

estimator of c′β for all β ∈ Rp and all σ2 ∈ R+.

The Gauss-Markov Theorem says that if we want to estimate an estimable

linear function c′β using a linear estimator that is unbiased, we should always

use the OLS estimator.

In our simple example of the treatment effects model, we could have used

y11y21 to estimate τ1τ2. It is easy to see that y11y21 is a linear estimator

that is unbiased for τ1τ2, but its variance is clearly larger than the variance of

the OLS estimator y1.y2. (as guaranteed by the Gauss-Markov Theorem).


Documents

STAT 540: Data Analysis and Regressionriczw/teach/STAT540_F15/... · 2015-10-09 · where the hat matrix is Residuals: e= W. Zhou (Colorado State University) STAT 540 July 6th, 2015