154
Regression Analysis Demetris Athienitis Department of Statistics, University of Florida

Regression Analysis - University of Floridaathienit/STA4210/reg_notes.pdf · 6 Multiple Regression I 98 6.1 Model ... 9.4 Regression Model Building ... Chapter 0 Review In regression

Embed Size (px)

Citation preview

Regression Analysis

Demetris Athienitis

Department of Statistics,

University of Florida

Contents

Contents 1

0 Review 4

0.1 Random Variables and Probability Distributions . . . . . . . . 5

0.1.1 Expected value and variance . . . . . . . . . . . . . . . 8

0.1.2 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . 12

0.1.3 Mean and variance of linear combinations . . . . . . . 14

0.2 Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . 15

0.3 Inference for Population Mean . . . . . . . . . . . . . . . . . . 16

0.3.1 Confidence intervals . . . . . . . . . . . . . . . . . . . 16

0.3.2 Hypothesis tests . . . . . . . . . . . . . . . . . . . . . . 20

0.4 Inference for Two Population Means . . . . . . . . . . . . . . 27

0.4.1 Independent samples . . . . . . . . . . . . . . . . . . . 27

0.4.2 Paired data . . . . . . . . . . . . . . . . . . . . . . . . 29

1 Simple Linear Regression 31

1.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

1.2 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . 33

1.2.1 Regression function . . . . . . . . . . . . . . . . . . . . 33

1.2.2 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2 Inferences in Regression 38

2.1 Inferences concerning β0 and β1 . . . . . . . . . . . . . . . . . 38

2.2 Inferences involving E(Y ) and Ypred . . . . . . . . . . . . . . . 41

2.2.1 Confidence interval on the mean response . . . . . . . . 41

2.2.2 Prediction interval . . . . . . . . . . . . . . . . . . . . 42

2.2.3 Confidence Band for Regression Line . . . . . . . . . . 44

2.3 Analysis of Variance Approach . . . . . . . . . . . . . . . . . . 46

2.3.1 F-test for β1 . . . . . . . . . . . . . . . . . . . . . . . . 48

1

2.3.2 Goodness of fit . . . . . . . . . . . . . . . . . . . . . . 49

2.4 Normal Correlation Models . . . . . . . . . . . . . . . . . . . 50

3 Diagnostics and Remedial Measures 53

3.1 Diagnostics for Predictor Variable . . . . . . . . . . . . . . . . 53

3.2 Checking Assumptions . . . . . . . . . . . . . . . . . . . . . . 54

3.2.1 Graphical methods . . . . . . . . . . . . . . . . . . . . 55

3.2.2 Significance tests . . . . . . . . . . . . . . . . . . . . . 60

3.3 Remedial Measures . . . . . . . . . . . . . . . . . . . . . . . . 67

3.3.1 Box-Cox (Power) transformation . . . . . . . . . . . . 67

3.3.2 Lowess (smoothed) plots . . . . . . . . . . . . . . . . . 71

4 Simultaneous Inference and Other Topics 73

4.1 Controlling the Error Rate . . . . . . . . . . . . . . . . . . . . 73

4.1.1 Simultaneous estimation of mean responses . . . . . . . 75

4.1.2 Simultaneous predictions . . . . . . . . . . . . . . . . . 75

4.2 Regression Through the Origin . . . . . . . . . . . . . . . . . 75

4.3 Measurement Errors . . . . . . . . . . . . . . . . . . . . . . . 78

4.3.1 Measurement error in the dependent variable . . . . . . 78

4.3.2 Measurement error in the independent variable . . . . . 78

4.4 Inverse Prediction . . . . . . . . . . . . . . . . . . . . . . . . . 79

4.5 Choice of Predictor Levels . . . . . . . . . . . . . . . . . . . . 80

5 Matrix Approach to Simple Linear Regression 81

5.1 Special Types of Matrices . . . . . . . . . . . . . . . . . . . . 81

5.2 Basic Matrix Operations . . . . . . . . . . . . . . . . . . . . . 83

5.2.1 Addition and subtraction . . . . . . . . . . . . . . . . . 83

5.2.2 Multiplication . . . . . . . . . . . . . . . . . . . . . . . 84

5.3 Linear Dependence and Rank . . . . . . . . . . . . . . . . . . 86

5.4 Matrix Inverse . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5.5 Useful Matrix Results . . . . . . . . . . . . . . . . . . . . . . . 90

5.6 Random Vectors and Matrices . . . . . . . . . . . . . . . . . . 91

5.6.1 Mean and variance of linear functions of random vectors 93

5.6.2 Multivariate normal distribution . . . . . . . . . . . . . 94

5.7 Estimation and Inference in Regression . . . . . . . . . . . . . 94

5.7.1 Estimating parameters by least squares . . . . . . . . . 94

5.7.2 Fitted values and residuals . . . . . . . . . . . . . . . . 95

2

5.7.3 Analysis of variance . . . . . . . . . . . . . . . . . . . . 96

5.7.4 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . 97

6 Multiple Regression I 98

6.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

6.2 Special Types of Variables . . . . . . . . . . . . . . . . . . . . 100

6.3 Matrix Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

7 Multiple Regression II 119

7.1 Extra Sums of Squares . . . . . . . . . . . . . . . . . . . . . . 119

7.1.1 Definition and decompositions . . . . . . . . . . . . . . 119

7.1.2 Inference with extra sums of squares . . . . . . . . . . 122

7.2 Other Linear Tests . . . . . . . . . . . . . . . . . . . . . . . . 127

7.3 Coefficient of Partial Determination . . . . . . . . . . . . . . . 129

7.4 Standardized Regression Model . . . . . . . . . . . . . . . . . 131

7.5 Multicollinearity . . . . . . . . . . . . . . . . . . . . . . . . . 133

9 Model Selection and Validation 137

9.1 Data Collection Strategies . . . . . . . . . . . . . . . . . . . . 137

9.2 Reduction of Explanatory Variables . . . . . . . . . . . . . . . 137

9.3 Model Selection Criteria . . . . . . . . . . . . . . . . . . . . . 138

9.4 Regression Model Building . . . . . . . . . . . . . . . . . . . . 142

9.4.1 Backward elimination . . . . . . . . . . . . . . . . . . . 143

9.4.2 Forward selection . . . . . . . . . . . . . . . . . . . . . 143

9.4.3 Stepwise regression . . . . . . . . . . . . . . . . . . . . 143

9.5 Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . 146

10 Diagnostics 149

10.1 Outlying Y observations . . . . . . . . . . . . . . . . . . . . . 149

10.2 Outlying X-Cases . . . . . . . . . . . . . . . . . . . . . . . . . 151

10.3 Influential Cases . . . . . . . . . . . . . . . . . . . . . . . . . . 151

10.3.1 Fitted values . . . . . . . . . . . . . . . . . . . . . . . 151

10.3.2 Regression coefficients . . . . . . . . . . . . . . . . . . 151

10.4 Multicollinearity . . . . . . . . . . . . . . . . . . . . . . . . . 151

11 Remedial Measures 152

12 Autocorrelation in Time Series 153

3

Chapter 0

Review

In regression the emphasis is on finding links/associations between two or

more variables. For two variables a scatterplot can help in visualizing the

association

Example 0.1. A small study with 7 subjects on the pharmacodynamics

of LSD on how LSD tissue concentration affects the subjects math scores

yielded the following data.

Score 78.93 58.20 67.47 37.47 45.65 32.92 29.97Conc. 1.17 2.97 3.26 4.69 5.83 6.00 6.41

Table 1: Math score with LSD tissue concentration

1 2 3 4 5 6

3040

5060

7080

Scatterplot

LSD tissue concentration

Mat

h sc

ore

Figure 1: Scatterplot of Math score vs. LSD tissue concentration

http://www.stat.ufl.edu/~athienit/STA4210/scatterplot.R

4

Before we begin he will need to grasp some basic concepts.

0.1 Random Variables and Probability Dis-

tributions

Definition 0.1. A random variable is a function that assigns a numerical

value to each outcome of an experiment. It is a measurable function from a

probability space into a measurable space known as the state space.

It is an outcome characteristic that is unknown prior to the experiment.

For example, an experiment may consist of tossing two dice. One poten-

tial random variable could be the sum of the outcome of the two dice, i.e.

X= sum of two dice. Then, X is a random variable. Another experiment

could consist of applying different amounts of a chemical agent and a poten-

tial random variable could consist of measuring the amount of final product

created in grams.

Quantitative random random variables can either be discrete, by which

they have a countable set of possible values, or continuous which have

uncountably infinite.

Notation: For a discrete random variable (r.v.) X , the probability distribu-

tion is the probability of a certain outcome occurring, denoted as

P (X = x) = pX(x).

This is also called the probability mass function (p.m.f.).

Notation: For a continuous random variable (r.v.) X , the probability den-

sity function (p.d.f.), denoted by fX(x), models the relative frequency of X .

Since there are infinitely many outcomes within an interval, the probability

evaluated at a singularity is always zero, e.g. P (X = x) = 0, ∀x, X being a

continuous r.v.

5

Conditions for a function to be:

• p.m.f. 0 ≤ p(x) ≤ 1 and∑

∀x p(x) = 1

• p.d.f. f(x) ≥ 0 and∫∞−∞ f(x)dx = 1

Example 0.2. (Discrete) Suppose a storage tray contains 10 circuit boards,

of which 6 are type A and 4 are type B, but they both appear similar. An

inspector selects 2 boards for inspection. He is interested in X = number of

type A boards. What is the probability distribution of X?

The sample space of X is {0, 1, 2}. We can calculate the following:

p(2) = P (A on first)P (A on second|A on first)

= (6/10)(5/9) = 0.3333

p(1) = P (A on first)P (B on second|A on first)

+ P (B on first)P (A on second|B on first)

= (6/10)(4/9) + (4/10)(6/9) = 0.5333

p(0) = P (B on first)P (B on second|B on first)

= (4/10)(3/9) = 0.1334

Consequently,

X = x p(x)

0 0.1334

1 0.5333

2 0.3333

Total 1.0

Table 2: Probability Distribution of X

6

Example 0.3. (Continuous) The lifetime of a certain battery has a distri-

bution that can be approximated by f(x) = 0.5e−0.5x, x > 0.

0 2 4 6 8

0.0

0.1

0.2

0.3

0.4

0.5

Lifetime in 100 hours

Den

sity

Figure 2: Probability density function of battery lifetime.

Normal

The normal distribution (Gaussian distribution) is by far the most important

distribution in statistics. The normal distribution is identified by a location

parameter µ and a scale parameter σ2(> 0). A normal r.v. X is denoted as

X ∼ N(µ, σ2) with p.d.f.

f(x) =1

σ√2π

e−1

2σ2 (x−µ)2 −∞ < x < ∞

−3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

Normal Distribution

µ

Figure 3: Density function of N(0, 1).

It is symmetric, unimodal, bell shaped with E(X) = µ and V (X) = σ2.

7

Notation: A normal random variable with mean 0 and variance 1 is called a

standard normal r.v. It is usually denoted by Z ∼ N(0, 1). The c.d.f. of a

standard normal is given at the end of the textbook, is available online, but

most importantly has a built in function in software. Note that probabilities,

which can be expressed in terms of c.d.f, can be conveniently obtained.

Example 0.4. Find P (−2.34 < Z < −1). From the relevant remark,

P (−2.34 < Z < −1) = P (Z < −1)− P (Z < −2.34)

= 0.1587− 0.0096

= 0.1491

Notation: You may recall that∫f(t)dt is contrived from lim

∑f(ti)∆i. Hence

for the following definitions and expressions we will only be using notation

for continuous variables and wherever you see “∫” simply replace it with

“∑

”.

0.1.1 Expected value and variance

The expected value of a r.v. is thought of as the long term average for that

variable. Similarly, the variance is thought of as the long term average of

values of the r.v. to the expected value.

Definition 0.2. The expected value (or mean) of a r.v. X is

µX := E(X) =

∫ ∞

−∞xf(x)dx

(

discrete=

∀xxp(x)

)

.

In actuality, this definition is a special case of a much broader statement.

Definition 0.3. The expected value (or mean) of function h(·) of a r.v. X

is

E(h(X)) =

∫ ∞

−∞h(x)f(x)dx.

Due to this last definition, if the function h performs a simple linear

transformation, such as h(t) = at+ b, for constants a and b, then

E(aX + b) =

(ax+ b)f(x)dx = a

xf(x)dx+ b

f(x)dx = aE(X) + b

8

Example 0.5. Referring back to Example 0.2, the expected value of the

number of type A boards (X) is

E(X) =∑

∀xxp(x) = 0(0.1334) + 1(0.5333) + 2(0.3333) = 1.1999.

We can also calculate the expected value of (i) 5X + 3 and (ii) 3X2.

(i) 5(1.1999) + 3 = 8.995.

(ii) 3(02)(0.1334) + 3(12)(0.5333) + 3(22)(0.3333) = 5.5995

Definition 0.4. The variance of a r.v. X is

σ2X := V (X) = E

[(X − µX)

2]

=

(x− µX)2f(x)dx

=

(x2 − 2xµX + µ2X)f(x)dx

=

x2f(x)dx− 2µX

xf(x)dx+ µ2X

f(x)dx

= E(X2)− 2E2(X) + E2(X)

= E(X2)− E2(X)

Example 0.6. This refers to Example 0.2. We know that E(X) = 1.1999

and E(X2) = 02(0.1334) + 12(0.5333) + 22(0.3333) = 1.8665. Thus,

V (X) = E(X2)−E2(X)

= 1.8665− 1.19992

= 0.42674

9

Example 0.7. This refers to example 0.3. If we were to do this by hand we

would need to do integration by parts (multiple times). However we can use

software such as Wolfram Alpha.

1. Find E(X), so in Wolfram Alpha simply input:

integrate x*0.5*e^(-0.5*x) dx from 0 to infinity

So E(X) = 2.

2. Find E(X2), so input:

integrate x^2*0.5*e^(-0.5*x) dx from 0 to infinity

So, E(X2) = 8.

3. V (X) = E(X2)− E2(X) = 8− 22 = 4.

Definition 0.5. The variance of a function h of a r.v. X is

V (h(X)) =

[h(x)− E(h(x))]2f(x)dx

= E(h2(X))− E2(h(X))

Notice that if h stands for a linear transformation function then,

V (aX + b) = E[(aX + b− E (aX + b))2

]

= a2E[(X −E(X))2

]

= a2V (X)

If Z is standard normal then it has mean 0 and variance 1. Now if we

take a linear transformation of Z, say X = aZ + b, then

E(X) = E(aZ + b) = aE(Z) + b = b

and

V (X) = V (aZ + b) = a2V (Z) = a2.

This fact together with the following proposition allows us to express any

normal r.v. as a linear transformation of the standard normal r.v. Z by

setting a = σ and b = µ.

10

Proposition 0.1. The r.v. X that is expressed as the linear transformation

σZ + µ, is a also a normal r.v. with E(X) = µ and V (X) = σ2.

Linear transformations are completely reversible, so given a normal r.v.

X with mean µ and variance σ2 we can revert back to a standard normal by

Z =X − µ

σ.

As a consequence any probability statements made about an arbitrary normal

r.v. can be reverted to statements about a standard normal r.v.

Example 0.8. Let X ∼ N(15, 7). Find P (13.4 < X < 19.0).

We begin by noting

P (13.4 < X < 19.0) = P

(13.4− 15√

7<

X − 15√7

<19.0− 15√

7

)

= P (−0.6047 < Z < 1.5119)

= P (Z < 1.5119)− P (Z < −0.6047)

= 0.6620312

If one is using a computer there is no need to revert back and forth from a

standard normal, but it is always useful to standardize concepts. You could

find the answer by using

pnorm(1.5119)-pnorm(-0.6047)

or

pnorm(19,15,sqrt(7))-pnorm(13.4,15,sqrt(7))

Example 0.9. The height of males in inches is assumed to be normally distri-

buted with mean of 69.1 and standard deviation 2.6. Let X ∼ N(69.1, 2.62).

Find the 90th percentile for the height of males.

11

69.1

0.00

0.05

0.10

0.15

90 % area

Figure 4: N(69.1, 2.62) distribution

First we find the 90th percentile of the standard normal which is qnorm(0.9)=

1.281552. Then we transform to

2.6(1.281552) + 69.1 = 72.43204.

Or, just input into R: qnorm(0.9,69.1,2.6).

0.1.2 Covariance

The population covariance is a measure of strength of a linear relationship

among two variables.

Definition 0.6. Let X and Y be two r.vs. The population covariance of X

and Y is

Cov(X, Y ) = E [(X − E(X)) (Y − E(Y ))]

= E(XY )− E(X)E(Y )

Remark 0.1. If X and Y are independent, then

E(XY ) =

∫ ∫

xyf(x, y)dxdy

ind.=

∫ ∫

xyfX(x)fY (y)dxdy

=

xfX(x)dx

yfY (y)dy

= E(X)E(Y )

12

and consequently Cov(X, Y ) = 0. This is because under independence

f(x, y) = fX(x)fY (y). However, the converse is not true. Think of a cir-

cle such as sin2X +cos2 Y = 1. Obviously, X and Y are dependent but they

have no linear relationship. Hence, Cov(X, Y ) = 0.

The covariance is not unitless so a measure called the population corre-

lation is used to describe the strength of the linear relationship that is

• unitless

• ranges from −1 to 1

ρXY =Cov(X, Y )

V (X)√

V (Y ),

A negative relationship implies a negative covariance and consequently a

negative correlation.

Moving away from the population parameters, to estimate the sample

statistic of the covariance and the correlation we need

σXY := Cov(X, Y ) =1

n− 1

n∑

i=1

(xi − x)(yi − y)

=1

n− 1

[(n∑

i=1

xiyi

)

− nxy

]

Therefore,

rXY := ρXY =(∑n

i=1 xiyi)− nxy

(n− 1)sXsY. (1)

Example 0.10. Let’s assume that we want to look at the relationship bet-

ween two variables, height (in inches) and self esteem for 20 individuals.

Height 68 71 62 75 58 60 67 68 71 69Esteem 4.1 4.6 3.8 4.4 3.2 3.1 3.8 4.1 4.3 3.7

68 67 63 62 60 63 65 67 63 613.5 3.2 3.7 3.3 3.4 4.0 4.1 3.8 3.4 3.6

Table 3: Height to self esteem data

Hence,

rXY =4937.6− 20(65.4)(3.755)

19(4.406)(0.426)= 0.731

there is a moderate to strong positive linear relationship.

13

0.1.3 Mean and variance of linear combinations

Let X and Y be two r.vs, for (aX + b) + (cY + d) for constants a, b, c and d,

E(aX + b+ cY + d) = aE(X) + cE(Y ) + b+ d

V (aX + b+ cY + d) = Cov(aX, aX)︸ ︷︷ ︸

a2V (X)

+Cov(cY, cY )︸ ︷︷ ︸

c2V (Y )

+Cov(aX, cY ) + Cov(cY, aX)︸ ︷︷ ︸

2acCov(X,Y )

Example 0.11. Let X be a r.v. with E(X) = 3 and V (X) = 2, and Y be

another r.v. independent of X with E(Y ) = −5 and V (Y ) = 6. Then,

E(X − 2Y ) = E(X)− 2E(Y ) = 3− 2(−5) = 13

and

V (X − 2Y ) = V (X) + 4V (Y ) = 2 + 4(6) = 26

Now we extend these two concepts to more than two r.vs. Let X1, . . . , Xn

be a sequence of r.vs and a1, . . . , an a sequence of constants. Then the r.v.∑n

i=1 aiXi has mean and variance

E

(n∑

i=1

aiXi

)

=

n∑

i=1

aiE(Xi)

and

V

(n∑

i=1

aiXi

)

=n∑

i=1

n∑

j=1

aiajCov(Xi, Xj) (2)

=

n∑

i=1

a2iV (Xi) + 2∑∑

i<j

aiajCov(Xi, Xj) (3)

Example 0.12. Assume the random sample, i.e. independent identically

distributed (i.i.d.) r.vs, X1, . . . , Xn are to be obtained and of interest will

be the specific linear combination corresponding to the sample mean X =

(1/n)∑n

i=1Xi. Since the r.vs are i.i.d., let E(Xi) = µ and V (Xi) = σ2

∀i = 1, . . . , n. Then,

E

(

1

n

n∑

i=1

Xi

)

=1

n

n∑

i=1

E(Xi) =1

nnµ = µ

14

and

V

(

1

n

n∑

i=1

Xi

)

ind.=

1

n2

n∑

i=1

V (Xi) =1

n2nσ2 =

σ2

n

Remark 0.2. As the sample size increases, the variance of the sample mean

decreases with limn→∞ V (X) = 0.

A very useful theorem (whose proof is beyond the scope of this class is

the following.

Proposition 0.2. A linear combination of (independent) normal random

variables is a normal random variable.

0.2 Central Limit Theorem

The Central Limit Theorem (C.L.T.) is a powerful statement concerning

the mean of a random sample. There are three versions, the classical, the

Lyapunov and the Linderberg but in effect they all make the same statement

that the asymptotic distribution of the sample mean X is normal, irrespective

of the distribution of the individual r.vs. X1, . . . , Xn.

Proposition 0.3. (Central Limit Theorem)

Let X1, . . . , Xn be a random sample, i.e. i.i.d., with E(Xi) = µ < ∞ and

V (Xi) = σ2 < ∞. Then, for X = (1/n)∑n

i=1Xi

X − µσ√n

d−→n→∞

N(0, 1)

Although the central limit theorem is an asymptotic statement, i.e. as the

sample size goes to infinity, we can in practice implement it for sufficiently

large sample sizes n > 30 as the distribution of X will be approximately

normal with mean and variance derived from Example 0.12.

Xapprox.∼ N

(

µ,σ2

n

)

15

0.3 Inference for Population Mean

When a population parameter is estimated by a sample statistic such as

µ = x, the sample statistic is a point estimate of the parameter. Due to

sampling variability the point estimate will vary from sample to sample.

The fact that the sample estimate is not 100% accurate has to be taken into

account.

0.3.1 Confidence intervals

An alternative or complementary approach is to report an interval of plausible

values based on the point estimate sample statistic and its standard deviation

(a.k.a. standard error). A confidence interval (C.I.) is calculated by first

selecting the confidence level, the degree of reliability of the interval. A

100(1− α)% C.I. means that the method by which the interval is calculated

will contain the true population parameter 100(1 − α)% of the time. That

is, if a sample is replicated multiple times, the proportion of times that the

C.I. will not contain the population parameter is α.

For example, assume that we know the (in practice unknown) population

parameter µ is 0 and from multiple samples, multiple C.Is are created.

SampleA

SampleB

SampleC

SampleD

SampleE

SampleF

SampleG

SampleH

SampleI

SampleJ

− 2

0

2

4

Figure 5: Multiple confidence intervals from different samples

Known population variance

Let X1, . . . , Xn be i.i.d. from some distribution with finite unknown mean µ

and known variance σ2. The methodology will require that X ∼ N(µ, σ2/n).

This can occur in the following ways:

• X1, . . . , Xn be i.i.d. from a normal distribution, so that by Proposition

0.2, X ∼ N(µ, σ2/n)

16

• n > 30 and the C.L.T. is invoked.

Let zc stand for the value of Z ∼ N(0, 1) such that P (Z ≤ zc) = c.

Hence, the proportion of C.Is containing the population parameter is,

0

0.0

0.1

0.2

0.3

0.4

Standard Normal

zα 2 z1−α 2

1 − α

α 2 α 2

Due to the symmetry of the normal distribution, z1−α/2 = |zα/2| andzα/2 = −z1−α/2.

Note: Some books may define zc such that P (Z > zc) = c, i.e. c referring to

the area to the right.

1− α = P

(

−z1−α/2 <X − µ

σ/√n

< z1−α/2

)

(4)

= P

(

X − z1−α/2σ√n< µ < X + z1−α/2

σ√n

)

and the probability that (on the long run) the random C.I. interval,

X ∓ z1−α/2σ√n

contains the true value of µ is 1 − α. When a C.I. is constructed from a

single sample we can no longer talk about a probability as there is no long

run temporal concept but we can say that we are 100(1−α)% confident that

the methodology by which the interval was contrived will contain the true

population parameter.

17

Example 0.13. A forester wishes to estimate the average number of count

trees per acre on a plantation. The variance is assumed to be known as 12.1.

A random sample of n = 50 one acre plots yields a sample mean of 27.3.

A 95% C.I. for the true mean is then

27.3∓ z1−0.025︸ ︷︷ ︸

1.96

12.1

50→ (26.33581, 28.26419)

Unknown population variance

In practice the population variance is unknown, that is σ is unknown. A

large sample size implies that the sample variance s2 is a good estimate for σ2

and you will find that many simply replace it in the C.I. calculation. However,

there is a technically “correct” procedure for when variance is unknown.

Note that s2 is calculated from data, so just like x, there is a correspon-

ding random variable S2 to denote the theoretical properties of the sample

variance. In higher level statistics the distribution of S2 is found, as once

again, it is a statistic that depends on the random variables X1, . . . , Xn. It

is shown thatX − µ

S/√n

∼ tn−1 (5)

where tn−1 stands for Student’s-t distribution with parameter degrees of free-

dom ν = n−1. A Student’s-t distribution is “similar” to the standard normal

except that it places more “weight” to extreme values as seen in Figure 6.

18

−4 −2 0 2 4

0.0

0.1

0.2

0.3

0.4

Density Functions

N(0,1)t_4

Figure 6: Standard normal and t4 probability density functions

It is important to note that Student’s-t is not just “similar” to the stan-

dard normal but asymptotically (as n → ∞) is the standard normal. One

just needs to view the t-table to see that under infinite degrees of freedom the

values in the table are exactly the same as the ones found for the standard

normal. Intuitively then, using Student’s-t when σ2 is unknown makes sense

as it adds more probability to extreme values due to the uncertainty placed

by estimating σ2.

The 100(1− α)% C.I. for µ is then

x∓ t1−α/2,n−1s√n. (6)

Example 0.14. In a packaging plant, the sample mean and standard devia-

tion for the fill weight of 100 boxes are x = 12.05 and s = 0.1. The 95% C.I.

for the mean fill weight of the boxes is

12.05∓ t1−0.025,99︸ ︷︷ ︸

1.984

0.1√100

→ (12.03016, 12.06984), (7)

Remark 0.3. If we wanted to perform a 90% we would simply replace t(0.05/2,99)

with t(0.10/2,99) = 1.660, which would lead to CI of (12.0334, 12.0666) that is

a narrower interval. Thus, as α ↑ then 100(1−α) ↓ which implies a narrower

interval.

19

Example 0.15. Suppose that a sample of 36 resistors is taken with x = 10

and s2 = 0.7. A 95% C.I. for µ is

10∓ t1−0.025,35︸ ︷︷ ︸

2.03

0.7

36→ (9.71693, 10.28307)

Remark 0.4. So far we have only discussed two-sided confidence intervals.

In equation (4) However, one-sided confidence intervals might be more

appropriate in certain circumstances. For example, when one is interested

in the minimum breaking strength, or the maximum current in a circuit. In

these instances we are not interested in an upper and lower limit but only in

a lower or only in a upper limit. Then we simply replace zα/2 or t(α/2,n−1) by

zα or tα,n−1, e.g. a 100(1− α)% C.I. for µ

(

x− t1−α,n−1s√n,∞)

or

(

−∞, x+ t1−α,n−1s√n

)

0.3.2 Hypothesis tests

A statistical hypothesis is a claim about a population characteristic (and on

occasion more than one). An example of a hypothesis is the claim that the

population is some value, e.g. µ = 0.75.

Definition 0.7. The null hypothesis, denoted by H0, is the hypothesis that

is initially assumed to be true.

The alternative hypothesis, denoted by Ha or H1, is the complementary

assertion to H0 and is usually the hypothesis, the new statement that we

wish to test.

A test procedure is created under the assumption of H0 and then it is

determined how likely that assumption is compared to its complement Ha.

The decision will be based on

• Test statistic, a function of the sampled data.

• Rejection region/criteria, the set of all test statistic values for which

H0 will be rejected.

The basis for choosing a particular rejection region lies in an understanding

of the errors that can be made.

20

Definition 0.8. A type I error consists of rejecting H0 when it is actually

true.

A type II error consists of failing to reject H0 when in actuality H0 is

false.

The type I error is generally considered to be the most serious one, and

due to limitations, we can only control for one, so the rejection region is

chosen based upon the maximum P (type I error) = α that a researcher is

willing to accept.

Known population variance

We motivate the test procedure by an example whereby the drying time

of a certain type of paint, under fixed environmental conditions, is known

to be normally distributed with mean 75 min. and standard deviation 9

min. Chemists have added a new additive that is believed to decrease drying

time and have obtained a sample of 35 drying times and wish to test their

assertion. Hence,

H0 : µ ≥ 75 (or µ = 75)

Ha : µ < 75

Since we wish to control for the type I error, we set P (type I error) = α.

The default value of α is usually taken to be 5%.

An obvious candidate for a test statistic, that is an unbiased estimator

of the population mean, is X which is normally distributed. If the data

were not known to be normally distributed the normality of X can also be

confirmed by the C.L.T. Thus, under the null assumption H0

XH0∼ N

(

75,92

35

)

,

or equivalentlyX − 75

9√35

H0∼ N(0, 1).

The test statistic will be

T.S. =x− 75

9√35

,

21

and assuming that x = 70.8 from the 35 samples, then, T.S. = −2.76. This

implies that 70.8 is 2.76 standard deviations below 75. Although this appears

to be far, we need to use the p-value to reach a formal conclusion.

Definition 0.9. The p-value of a hypothesis test is the probability of ob-

serving the specific value of the test statistic, T.S., or a more extreme value,

under the null hypothesis. The direction of the extreme values is indicated

by the alternative hypothesis.

Therefore, in this example values more extreme than -2.76 are

{x|x ≤ −2.76},

as indicated by the alternative, Ha : µ < 75. Thus,

p-value = P (Z ≤ −2.76) = 0.0029.

The criterion for rejecting the null is p-value < α, the null hypothesis is

rejected in favor of the alternative hypothesis as the probability of observing

the test statistic value of -2.76 or more extreme (as indicated by Ha) is smaller

than the probability of the type I error we are willing to undertake.

0

0.0

0.1

0.2

0.3

0.4

Standard Normal

−1.645−2.76

α=0.05 areap−value

Figure 7: Rejection region and p-value.

If we can assume that X is normally distributed and σ2 is known then,

to test

22

(i) H0 : µ ≤ µ0 vs Ha : µ > µ0

(ii) H0 : µ ≥ µ0 vs Ha : µ < µ0

(iii) H0 : µ = µ0 vs Ha : µ 6= µ0

at the α significance level, compute the test statistic

T.S. =x− µ0

σ/√n. (8)

Reject the null if the p-value < α, i.e.

(i) P (Z ≥ T.S.) < α (area to the right of T.S. < α)

(ii) P (Z ≤ T.S.) < α (area to the left of T.S. < α)

(iii) P (|Z| ≥ |T.S.|) < α (area to the right of |T.S.| plus area to the left of

−|T.S.| < α)

Example 0.16. A scale is to be calibrated by weighing a 1000g weight 60

times. From the sample we obtain x = 1000.6 and s = 2. Test whether the

scale is calibrated correctly.

H0 : µ = 1000 vs Ha : µ 6= 1000

T.S. =1000.6− 1000

2/√60

= 2.32379

Hence, the p-value is 0.02013675 and we reject the null hypothesis and con-

clude that the true mean is not 1000.

23

0

0.0

0.1

0.2

0.3

0.4

Standard Normal

−2.32379 2.32379

p−value

Figure 8: p-value.

Since 1000.6 is 2.32379 standard deviations greater than 1000, we can

conclude that not only is the true mean not a 1000 but it is greater than

1000.

Example 0.17. A company representative claims that the number of calls

arriving at their center is no more than 15/week. To investigate the claim, 36

random weeks were selected from the company’s records with a sample mean

of 17 and sample standard deviation of 3. Do the sample data contradict

this statement?

First we begin by stating the hypotheses of

H0 : µ ≤ 15 vs Ha : µ > 15

The test statistic is

T.S. =17− 15

3/√36

= 4

The conclusion is that there is significance evidence to reject H0 as the p-

value (the area to the right of 4 under the standard normal) is very close to

0.

24

Unknown population variance

If σ is unknown, which is usually the case, we replace it by its sample

estimate s. Consequently,

X − µ0

S/√n

H0∼ tn−1,

and the for an observed value X = x, the test statistic becomes

T.S. =x− µ0

s/√n.

At the α significance level, for the same hypothesis tests as before, we reject

H0 if

(i) p-value= P (tn−1 ≥ T.S.) < α

(ii) p-value= P (tn−1 ≤ T.S.) < α

(iii) p-value= P (|tn−1| ≥ |T.S.|) < α

Example 0.18. In an ergonomic study, 5 subjects were chosen to study the

maximin weight of lift (MAWL) for a frequency of 4 lifts/min. Assuming the

MAWL values are normally distributed, do the following data suggest that

the population mean of MAWL exceeds 25?

25.8, 36.6, 26.3, 21.8, 27.2

H0 : µ ≤ 25 vs Ha : µ > 25

T.S. =27.54− 25

5.47/√5

= 1.03832

The p-value is the area to the right of 1.03832 under the t4 distribution,

which is 0.1788813. Hence, we fail to reject the null hypothesis. In R input:

t.test(c(25.8, 36.6, 26.3, 21.8, 27.2),mu=25,alternative="greater")

Remark 0.5. The values contained within a two-sided 100(1 − α)% C.I. are

precisely those values (that when used in the null hypothesis) will result in

the p-value of a two sided hypothesis test to be greater than α.

For the one sided case, an interval that only uses the

25

• upper limit, contains precisely those values for which the p-value of

a one-sided hypothesis test, with alternative less than, will be greater

than α.

• lower limit, contains precisely those values for which the p-value of a

one-sided hypothesis test, with alternative greater than, will be greater

than α.

Example 0.19. The lifetime of single cell organism is believed to be on

average 257 hours. A small preliminary study was conducted to test whether

the average lifetime was different when the organism was placed in a certain

medium. The measurements are assumed to be normally distributed and

turned out to be 253, 261, 258, 255, and 256. The hypothesis test is

H0 : µ = 257 vs. Ha : µ 6= 257

With x = 256.6 and s = 3.05, the test statistic value is

T.S. =256.6− 257

3.05/√5

= −0.293.

The p-value is P (t4 < −0.293) + P (t4 > 0.293) = 0.7839. Hence, since the

p-value is large (> 0.05) we fail to reject H0 and conclude that population

mean is not statistically different from 257.

Instead of a hypothesis test if a two sided 95% was constructed by

256.6∓ t(1−0.025,4)︸ ︷︷ ︸

2.776

3.05√5

→ (252.81, 260.39),

it clear that the null hypothesis value of µ = 257 is a plausible value and

consequently H0 is plausible, so it is not rejected.

26

0.4 Inference for Two Population Means

0.4.1 Independent samples

There are instances when a C.I. for the difference between two means is of

interest when one wishes to compare the sample mean from one population

to the sample mean of another.

Known population variances

Let X1, . . . , XnXand Y1, . . . , YnY

represent two independent random samples

with means µX , µY and variances σ2X , σ

2Y respectively. Once again the met-

hodology will require X and Y to be normally distributed. This can occur

by:

• X1, . . . , Xn be i.i.d. from a normal distribution, so that by Proposition

0.2, X ∼ N(µX , σ2X/n)

• nX > 40 and the C.L.T. is invoked.

Similarly for Y . Note that if the C.L.T. is to be invoked we require a more

conservative criterion of nX > 40, nY > 40 as we are using the theorem (and

hence an approximation twice).

To compare two populations means µX and µY we find it easier to work

with a new parameter the difference µK := µX − µY . Let K := X − Y is a

normal random variable (by Proposition 0.2) with

E(K) = E(X − Y ) = µX − µY ,

and

V (K) = V (X − Y ) =σ2X

nX+

σ2Y

nY.

Therefore,

K := X − Y ∼ N

(

µX − µY ,σ2X

nX+

σ2Y

nY

)

,

and hence a 100(1− α)% C.I. for the difference of µK = µX − µY is

x− y ∓ z1−α/2

σ2X

nX+

σ2Y

nY.

27

Example 0.20. In an experiment, 50 observations of soil NO3 concentration

(mg/L) were taken at each of two (independent) locations X and Y . We have

that x = 88.5, σX = 49.4, y = 110.6 and σY = 51.5. Construct a 95% C.I.

for the difference in means and interpret.

88.5− 110.6∓ 1.96

49.42

50+

51.52

50→ (−41.880683,−2.319317)

Note that 0 is not in the interval as a plausible value. This implies that

µX − µY < 0 is plausible. In fact µX is less than µY by at least 2.32 units

and at most 41.88.

Unknown population variances

As in equation (5)X − Y − (µX − µY )

√s2X

nX+

s2Y

nY

∼ tν

where

ν =

(s2X

nX+

s2Y

nY

)2

(s2X/nX)2

nX−1+

(s2Y/nY )2

nY −1

. (9)

Hence the 100(1− α)% for µX − µY is

x− y ∓ t1−α/2,ν

s2XnX

+s2YnY

.

Example 0.21. Two methods are considered standard practice for surface

hardening. For Method A there were 15 specimens with a mean of 400.9

(N/mm2) and standard deviation 10.6. For Method B there were also 15

specimens with a mean of 367.2 and standard deviation 6.1. Assuming the

samples are independent and from a normal distribution the 98% C.I. for

µA − µB is

400.9− 367.2∓ t1−0.01,ν

10.62

15+

6.12

15

where

ν =

(10.62

15+ 6.12

15

)2

(10.62/15)2

14+ (6.12/15)2

14

= 22.36

and hence t1−0.01,22.36 = 2.5052 giving a 98% C.I. for the difference µA − µB

28

of (25.7892 41.6108).

Notice that 0 is not in the interval so we can conclude that the two means

are different. In fact the interval is purely positive so we can conclude that

µA is at least 25.7892 N/mm2 larger than µB and at most 41.6108 N/mm2.

0.4.2 Paired data

There are instances when two samples are not independent, when a rela-

tionship exists between the two. For example, before treatment and after

treatment measurements made on the same experimental subject are depen-

dent on eachother through the experimental subject. This is a common event

in clinical studies where the effectiveness of a treatment, that may be quan-

tified by the difference in the before and after measurements, is dependent

upon the individual undergoing the treatment. Then, the data is said to be

paired.

Consider the data in the form of the pairs (X1, Y1), (X2, Y2), . . . , (Xn, Yn).

We note that the pairs, i.e. two dimensional vectors, are independent as the

experimental subjects are assumed to be independent with marginal expec-

tations E(Xi) = µX and E(Yi) = µY for all i = 1, . . . , n. By defining,

D1 = X1 − Y1

D2 = X2 − Y2

...

Dn = Xn − Yn

a two sample problem has been reduced to a one sample problem. Inference

for µX − µY is equivalent to one sample inference on µD as was done in

Chapter ??. This holds since,

µD := E(D) = E

(

1

n

n∑

i=1

Di

)

= E

(

1

n

n∑

i=1

Xi − Yi

)

= E(X−Y ) = µX−µY .

In addition we note that the variance of D does incorporate the covariance

between the two samples and does have to be calculated separately as

σ2D := V (D) = V

(

1

n

n∑

i=1

Di

)

=1

n2

n∑

i=1

V (Di) =σ2X + σ2

Y − 2σXY

n.

29

Example 0.22. A new and old type of rubber compound can be used in

tires. A researcher is interested in a compound/type that does not wear

easily. Ten random cars were chosen at random that would go around a

track a predetermined number of times. Each car did this twice, once for

each tire type and the depth of the tread was then measured.

Car1 2 3 4 5 6 7 8 9 10

New 4.35 5.00 4.21 5.03 5.71 4.61 4.70 6.03 3.80 4.70Old 4.19 4.62 4.04 4.72 5.52 4.26 4.27 6.24 3.46 4.50D 0.16 0.38 0.17 0.31 0.19 0.35 0.43 -0.21 0.34 0.20

With d = 0.232 and sD = 0.183. Assuming that the data are normally

distributed, a 95% C.I. for µnew − µold = µD is

0.232∓ t1−0.025,9︸ ︷︷ ︸

2.262

0.183√10

→ (0.101, 0.363)

and we note that the interval is strictly greater than 0, implying that that

the difference is positive, i.e. that µnew > µold. In fact we can conclude that

µnew is larger than µold by at least 0.101 units and at most 0.363 units.

30

Chapter 1

Simple Linear Regression

In this chapter we hypothesize a linear relationship between the two variables,

estimate and draw inference about the model parameters.

1.1 Model

The simplest deterministic mathematical relationship between two mathe-

matical variables x and y is a linear relationship

y = β0 + β1x,

where the coefficient

• β0 represents the y-axis intercept, the value of y when x = 0,

• β1 represents the slope, interpreted as the amount of change in the

value of y for a 1 unit increase in x.

To this model we add variability by introducing the random variable

ǫii.i.d.∼ N(0, σ2) for each observation i = 1, . . . , n. Hence, the statistical

model by which we wish to model one random variable using known values

of some predictor variable becomes

Yi = β0 + β1xi︸ ︷︷ ︸

systematic

+ǫi i = 1, . . . , n (1.1)

where Yi represents the r.v. corresponding to the response, i.e. the variable

we wish to model and xi stands for the observed value of the predictor.

31

Therefore we have that

Yiind.∼ N(β0 + β1xi, σ

2). (1.2)

Notice that the Y s are no longer identical since their mean depends on the

value of xi.

−20 −10 0 10 20 30 40 50 60

05

1015

x

y

Data pointsRegression line

Figure 1.1: Regression model.

Remark 1.1. An alternate form with centered predictor is

Yi = β0 + β1(xi − x) + β1x+ ǫi

= (β0 + β1x︸ ︷︷ ︸

β⋆0

) + β1(xi − x) + ǫi

In order to fit a regression line one needs to find estimates for the coeffi-

cients β0 and β1 in order to find the mean line

yi = β0 + β1xi.

32

1.2 Parameter Estimation

1.2.1 Regression function

The goal is to have this line as “close” to the data points as possible. The

concept, is to minimize the error from the actual data points to the predicted

points (in the direction of Y , i.e. vertical)

minn∑

i=1

(Yi − E(Yi))2 → min

n∑

i=1

(Yi − (β0 + β1xi))2.

Hence, the goal is to find the values of β0 and β1 that minimizes the sum of

the distances between the points and their expected value under the model.

This is done by the following steps:

1. Taking the partial derivatives with respect to β0 and β1

2. Equate the two resulting equations to 0

3. Solve the simultaneous equations for β0 and β1

4. (Optional) Taking second partial derivatives to show that in fact they

minimize, not maximize.

Therefore,

b1 := β1 =

∑ni=1(xi − x)(yi − y)∑n

i=1(xi − x)2

=(∑n

i=1 xiyi)− nxy

(∑n

i=1 x2i )− nx2

=

n∑

i=1

(

(xi − x)∑n

j=1(xj − x)2

)

︸ ︷︷ ︸

ki

yi (1.3)

and

b0 := β0 = y − b1x

=n∑

i=1

(

1

n+

x(xi − x)∑n

j=1(xj − x)2

)

︸ ︷︷ ︸

li

yi.

33

Hence both b1 and b0 are linear estimators, as they are linear combinations

of the responses.

Remark 1.2. Do not extrapolate model for values of the predictor x that were

not in the data, as it is not clear how the model may behave for other values.

Also, do not fit a linear regression for data that do not appear to be linear.

Definition 1.1. The ith residual is defined to be the difference between the

observed and fitted value of the response for point i.

ei = yi − yi

Notable Properties:

•∑

ei = 0

•∑

yi =∑

yi

•∑

xiei = 0

•∑

yiei = 0

• The regression line always goes through (x, y)

1.2.2 Variance

The variance term in the model is

σ2 = V (ǫ) = E(ǫ2)

Hence to estimate it, the “sample mean” of the squared residuals e2i seems

as a reasonable estimate.

s2 = MSE = σ2 =

∑ni=1(yi − yi)

2

n− 2=

∑ni=1 e

2i

n− 2=

SSE

n− 2.

where MSE stands for Mean Squared Error and SSE for Sum of Squares

Error. Note that in the denominator we have n− 2, as we lose 2 degrees of

freedom since we had to estimate two parameters, β0 and β1, when estimating

our center, yi.

34

Remark 1.3. Estimation of model parameters can also be done via maximum

likelihood that yields exactly the same estimates of the parameters of the

systematic component, β0 and β1, but the estimate of σ2 is slightly biased.

σ2 =

∑ni=1(yi − yi)

2

n

so

MSE =n

n− 2σ2

Example 1.1. Let x be the number of copiers serviced and Y be the time

spent (in minutes) by the technician for a known manufacturer.

1 2 · · · 44 45Time (y) 20 60 · · · 61 77

Copiers (x) 2 4 · · · 4 5

Table 1.1: Quantity of copiers and service time

The complete dataset can be found at

http://www.stat.ufl.edu/~athienit/STA4210/Examples/copiers.csv

2 4 6 8 10

050

100

150

Scatterplot

Quantity

Tim

e (in

min

utes

)

Figure 1.2: Scatterplot of Time vs Copiers.

The scatterplot shows that there is a strong positive relationship between

the two variables. Below is the R output.

35

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -0.5802 2.8039 -0.207 0.837

Copiers 15.0352 0.4831 31.123 <2e-16 ***

---

Residual standard error: 8.914 on 43 degrees of freedom

Multiple R-squared: 0.9575,Adjusted R-squared: 0.9565

F-statistic: 968.7 on 1 and 43 DF, p-value: < 2.2e-16

http://www.stat.ufl.edu/~athienit/STA4210/Examples/copier.R

The estimated equation is

y = −0.5802 + 15.0352x

We note that the slope b1 = 15.0352 implies that for each unit increase in

copier quantity, the service time increases by 15.0352 minutes (for quantity

values between 1 and 10).

If we wish to estimate the time needed for a service call for 5 copiers that

would be

−0.5802 + 15.0352(5) = 74.5958 minutes

Example 1.2. Data on lot size (x) and work hours (y) was obtained from

25 recent runs of a manufacturing process. (See example on page 19 of

textbook). A simple linear regression model was fit in R yielding

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 62.366 26.177 2.382 0.0259 *

lotsize 3.570 0.347 10.290 4.45e-10 ***

Residual standard error: 48.82 on 23 degrees of freedom

Multiple R-squared: 0.8215,Adjusted R-squared: 0.8138

F-statistic: 105.9 on 1 and 23 DF, p-value: 4.449e-10

36

20 40 60 80 100 120

100

200

300

400

500

toluca$lotsize

tolu

ca$w

orkh

rs

Figure 1.3: Scatterplot of Work Hours vs Lot Size.

We can obtain the residuals but will note that their magnitude in hours may

not be easy to determine if a value is large or small in the context of the

problem. Later we shall discuss standardized residuals.

> round(resid(toluca.reg),1)

1 2 3 4 5 6 7 8 9 10 11

51.0 -48.5 -19.9 -7.7 48.7 -52.6 55.2 4.0 -66.4 -83.9 -45.2

12 13 14 15 16 17 18 19 20 21

-60.3 5.3 -20.8 -20.1 0.6 42.5 27.1 -6.7 -34.1 103.5

22 23 24 25

84.3 38.8 -6.0 10.7

> round(rstandard(toluca.reg),1)

1 2 3 4 5 6 7 8 9 10 11 12

1.1 -1.1 -0.4 -0.2 1.0 -1.1 1.2 0.1 -1.4 -1.8 -1.0 -1.3

13 14 15 16 17 18 19 20 21 22 23 24 25

0.1 -0.5 -0.4 0.0 0.9 0.6 -0.1 -0.7 2.3 1.8 0.8 -0.1 0.2

Note, that the first residual implies that the actual observed value of work

hours was 51 hours greater than the model estimates. However, this diffe-

rence is only 1.1 standard deviations.

http://www.stat.ufl.edu/~athienit/STA4210/Examples/toluca.R

37

Chapter 2

Inferences in Regression

2.1 Inferences concerning β0 and β1

The coefficients b0 and b1 of equation (1.3) are linear combinations of the

responses. Therefore, they have corresponding r.vs B0 and B1 and since the

Y s are independent normal r.vs (see (1.1)), by Proposition 0.2 are themselves

normal r.vs. Re-expressing the r.v. B1,

B1 =

∑ni=1(xi − x)(Yi − Y )∑n

i=1(xi − x)2= · · · =

n∑

i=1

(xi − x)∑n

j=1(xj − x)2︸ ︷︷ ︸

ki

Yi

Some notable properties are:

•∑

ki = 0

•∑

kixi = 1

•∑

k2i = 1/

∑(xi − x)2

This implies

E(B1) =n∑

i=1

ki E(Yi)︸ ︷︷ ︸

β0+β1xi

= β0

n∑

i=1

ki + β1

n∑

i=1

kixi

= β1

38

and

V (B1) =

n∑

i=1

k2i V (Yi)︸ ︷︷ ︸

σ2

=σ2

∑nj=1(xj − x)2

.

Thus,

B1 ∼ N

(

β1,σ2

∑ni=1(xi − x)2

)

.

Remark 2.1. The larger the spread in the values of the predictor, the larger

the∑n

i=1(xi − x)2 value will be and hence the smaller the variances for B0

and B1. Also, as (xi − x)2 are nonnegative terms when we have more data

points, i.e. larger n, we are summing more non-negative terms and the larger

the∑n

i=1(xi − x)2.

Remark 2.2. The intercept term is not of much practical importance as it

is the value of the response when the predictor value is 0 and is included to

provide us with a “nice” model whether significant or not. Hence, inference

is omitted. It can be shown, in similar fashion, that

B0 ∼ N

(

β0,

[1

n+

x2

∑ni=1(xi − x)2

]

σ2

)

.

Remark 2.3. The r.vs B0 and B1 are not independent and their covariance

is not 0.

Cov(B0, B1) = Cov(∑

liYi,∑

kiYi

)

=∑

likiV (Yi)

since

Cov(liYi, kiYi) =

likiV (Yi) i = j

0 i 6= j

In practice, σ2 is not known, and in practice is replaced by its estimate,

MSE. This is a scenario that we are all too familiar with, similar to equation

(5) we use a Student’s t distribution instead of the normal,

B1 − β1s√

(xi−x)2

∼ tn−2.

39

This is because (not proven in this class)

(n− 2)s2

σ2=

SSE

σ2∼ χ2

n−2 (2.1)

is independent of B1, and a ratio of a normal with the square root of inde-

pendent chi-square is defined as t-distribution. Important to note is the fact

that the degrees of freedom are n − 2, as 2 were lost due to the estimation

of β0 and β1 in the mean.

Therefore, a 100(1− α)% C.I. for β1 is

β1 ∓ t1−α/2,n−2sb1

where sb1 = s/√∑n

i=1(xi − x)2.

Similarly, for a null hypothesis value H0 : β1 = β10 , the test statistic is

T.S. =β1 − β10

sb1

H0∼ tn−2

and p-values and conclusions made in the standard way, see Section 0.3.

We have not yet learned to perform inference on all parameters in the

model Yiind.∼ N(β0 + β1xi, σ

2). We can perform inference on the parameters

associated with the mean, i.e. β1 (and β0) but not yet σ2. From (2.1) we

have that

1− α = P

(

χ2(α/2,n−2) <

SSE

σ2< χ2

(1−α/2,n−2)

)

= P

(

SSE

χ2(1−α/2,n−2)

< σ2 <SSE

χ2(α/2,n−2)

)

and hence the 100(1− α)% C.I. for σ2 is

(

SSE

χ2(1−α/2,n−2)

,SSE

χ2(α/2,n−2)

)

(2.2)

Example 2.1. Back to the copier example 1.1, a 95% C.I. for

• β1 is

15.0352∓ t1−0.025,43︸ ︷︷ ︸

2.016692

(0.4831) → (14.061010, 16.009486).

40

• σ2 is (23(48.822)

38.076,23(48.822)

11.689

)

2.2 Inferences involving E(Y ) and Ypred

2.2.1 Confidence interval on the mean response

The mean is no longer a constant but is in fact a “mean line”.

µY |X=xobs:= E(Y |X = xobs) = β0 + β1xobs

Hence, we can create an interval for the mean at a specific value of the

predictor xobs. We simply need to find a statistic to estimate the mean and

find its distribution. The sample statistic used is

y = b0 + b1xobs

and the corresponding r.v. is

Y = B0 +B1xobs =n∑

i=1

[

1

n+ (xobs − x)

xi − x∑n

j=1(xj − x)2

]

Yi. (2.3)

Note that this can be expressed as a linear combination of the independent

normal r.vs Yi whose distribution is known to be normal (equation (1.2)).

Therefore, Y is also a normal r.v. with mean

E(Y ) = E(B0) + E(B1)xobs = β0 + β1xobs

and variance

V (Y ) = V (B0 +B1x obs)

= V [Y +B1(x obs − x)] since B0 = Y − B1x

= V [Y ] + (x obs − x)2V (B1) + 2(x obs − x)✘✘✘✘✘✘✘✿0Cov(Y , B1)

=σ2

n+

(x obs − x)2σ2

∑ni=1(xi − x)2

,

41

since Cov(Y , B1) = (1/n)σ2✟✟✟✯ 0∑ki . Hence,

Y ∼ N

(

β0 + β1x obs,

[

1

n+

(xobs − x)2∑n

j=1(xj − x)2

]

σ2

)

.

Thus, a 100(1− α)% C.I. for the mean response, µY |X=xobsis

y ∓ t1−α/2,n−2 s

(√

1

n+

(xobs − x)2∑n

j=1(xj − x)2

)

︸ ︷︷ ︸

sY

.

Example 2.2. Refer back to Example 1.1. Assume we are interested in a

95% C.I. for the mean time value when the quantity of copiers is 5.

74.59608∓ t1−0.025,43︸ ︷︷ ︸

2.016692

(1.329831) → (71.91422, 77.27794)

In R,

> newdata=data.frame(Copiers=5)

> predict.lm(reg,se.fit=TRUE,newdata,interval="confidence",level=0.95)

$fit

fit lwr upr

1 74.59608 71.91422 77.27794

$se.fit

[1] 1.329831

$df

[1] 43

2.2.2 Prediction interval

Once a regression model is fitted, after obtaining data (x1, y1), . . . , (xn, yn),

it may be of interest to predict a future value of the response. From equation

(1.1), we have some idea where this new prediction value will lie, somewhere

around the mean response

β0 + β1x new

However, according to the model, equation (1.1), we do not expect new

predictions to fall exactly on the mean response, but close to them. Hence,

42

the r.v. corresponding to the statistic we plan to use is the same as equation

(2.3) with the addition of the error term ǫ ∼ N(0, σ2)

Y pred = B0 +B1x new + ǫ

Therefore,

Y pred ∼ N

(

β0 + β1x new,

[

1 +1

n+

(x new − x)2∑n

j=1(xj − x)2

]

σ2

)

,

and a 100(1−α)% prediction interval (P.I.) for , for a value of the predictor

that is unobserved, i.e. not in the data, is

y pred ∓ t1−α/2,n−2 s

(√

1 +1

n+

(x new − x)2∑n

j=1(xj − x)2

)

︸ ︷︷ ︸

s pred

.

Example 2.3. Refer back to Example 1.1. Let us estimate the future service

time value when copier quantity is 7 and create a interval around it. The

predicted value is

−0.5802 + 15.0352(7) = 104.6666 minutes

a 95% P.I. around the predicted value is

104.6666∓ t1−0.025,43︸ ︷︷ ︸

2.016692

(9.058051) → (86.399, 122.9339)

In R

> newdata=data.frame(Copiers=7)

> predict.lm(reg,se.fit=TRUE,newdata,interval="prediction",level=0.95)

$fit

fit lwr upr

1 104.6666 86.39922 122.9339

$se.fit

[1] 1.6119

$df

[1] 43

43

Note that se.fit provided is the value for the CI not the PI. However, in the

calculation of the PI the correct standard error term is used. http://www.stat.ufl.edu/~athienit/STA4210/Example

Example 2.4. Also see confidence and prediction intervals for example 1.2

http://www.stat.ufl.edu/~athienit/STA4210/Examples/toluca.R

2.2.3 Confidence Band for Regression Line

If we wish to create a simultaneous estimate for the population mean for all

predictor values x, that is a (1− α)100% simultaneous C.I. for β0 + β1x

y ∓W (sY )

known as the Working-Hotelling confidence band, where

W =√

2F1−α;2,n−2.

44

Example 2.5. Continuing from example 1.2 (Toluca) we can not only eva-

luate the band at specific points but at all points and plot it with the script

found in

http://www.stat.ufl.edu/~athienit/STA4210/Examples/toluca.R

CI=predict(toluca.reg,se.fit=TRUE)

W=sqrt(2*qf(0.95,length(toluca.reg$coefficients),toluca.reg$df.residual))

Band=cbind( CI$fit - W * CI$se.fit, CI$fit + W * CI$se.fit )

points(sort(toluca$lotsize), sort(Band[,1]), type="l", lty=2)

points(sort(toluca$lotsize), sort(Band[,2]), type="l", lty=2)

legend("topleft",legend=c("Mean Line","95% CB"),col=c(1,1),

+ lty=c(1,2),bg="gray90")

20 40 60 80 100 120

100

200

300

400

500

toluca$lotsize

tolu

ca$w

orkh

rs

Mean Line95% CB

Figure 2.1: Working-Hotelling 95% confidence band.

45

2.3 Analysis of Variance Approach

Next we introduce some notation that will be useful in conducting inference

of the model. In order to determine whether a regression model is adequate

we must compare it to the most naive model which uses the sample mean

Y as its prediction, i.e. Y = Y . This model does not take into account any

predictors as the prediction is the same for all values of x. Then, the total

distance of a point yi to the sample mean y can be broken down into two

components, one measuring the error of the model for that point, and one

measuring the “improvement” distance accounted by the regression model.

(yi − y)︸ ︷︷ ︸

Total

= (yi − yi)︸ ︷︷ ︸

Error

+ (yi − y)︸ ︷︷ ︸

Regression

Looking back at Figure 1.1 and singling out a point we have that,

Figure 2.2: Sum of Squares breakdown.

Summing over all observations we have that

n∑

i=1

(yi − y)2

︸ ︷︷ ︸

SST

=

n∑

i=1

(yi − yi)2

︸ ︷︷ ︸

SSE

+

n∑

i=1

(yi − y)2

︸ ︷︷ ︸

SSR

, (2.4)

46

since the cross-product term

n∑

i=1

(yi − yi)(yi − y) =∑

ei(yi − y)

=✟✟✟✟✟✯

0∑

eiyi − y✚✚✚✚❃

0∑

ei

= 0

Remark 2.4. A useful result is

SSR =∑

(yi − y)2 =∑

(b0 + b1xi − y)2

=∑

(y − b1x+ b1xi − y)2

= b21∑

(xi − x)2

︸ ︷︷ ︸

(n−1)s2x

Each sum of squares term has an associated degrees of freedom value.

dfSSR 1SSE n− 2 +SST n− 1

We can summarize this information in an ANOVA table

Source df MS E(MS)Reg 1 SSR/1 σ2 + β2

1

∑(xi − x)2

Error n− 2 SSE/(n− 2) σ2

Total n− 1

Table 2.1: ANOVA table

Note that

SSE

σ2∼ χ2

n−2 ⇒ E

(SSE

σ2

)

= n− 2 ⇒ E

(SSE

n− 2

)

= σ2

and that

MSR = SSR = b21∑

(xi − x)2 ⇒ E(MSR) =∑

(xi − x)2E(B21)

=∑

(xi − x)2[V (B1) + E2(B1)]

= σ2 + β21

(xi − x)2

47

2.3.1 F-test for β1

In Section 2.1 we saw a t-test for testing the significance of β1, bit now we

introduce a different test that will be especially useful later in testing multiple

β’s simultaneously. In table 2.1 we notice that

E(MSR)

E(MSE)=

1 if β1 = 0

> 1 if β1 6= 0

By Cochran’s theorem it has been shown that under H0 : β1 = 0

• SSEσ2 ∼ χ2

n−2,SSRσ2 ∼ χ2

1 and that the two are independent,

• χ21/1

χ2n−2/(n−2)

∼ F1,n−2

Hence, we have that

T.S. =SSRσ2 /1

SSEσ2 /(n− 2)

=MSR

MSE

H0∼ F1,n−2.

The null is rejected if the p-value P (F1,n−2 > T.S.) < α, the area to the right

being less that α.

0

F distribution

f

T.S

p − value

Figure 2.3: F1,n−2 distribution and p-value.

Remark 2.5. The F-test and t-test for H0 : β1 = 0 vs. Ha : β1 6= 0 are

equivalent since

MSR

MSE=

b21∑

(xi − x)2

MSE=

b21MSE/

∑(xi − x)2

=b21s2b1

=

(b1sb1

)2

48

Example 2.6. Continuing from example 1.2, note that t2 = 10.2902 =

105.9 = F with the same p-value.

2.3.2 Goodness of fit

A goodness of fit statistic is a quantity that measures how well a model

explains a given set of data. For regression, we will use the coefficient of

determination

R2 =SSR

SST= 1− SSE

SST,

which is the proportion of variability in the response (to its naive mean y)

that is explained by the regression model, and R2 ∈ [0, 1].

Remark 2.6. For simple linear regression with (only) one predictor, the coef-

ficient of determination is the square of the correlation coefficient, with the

sign matching that of the slope, i.e.

r =

+√R2 b1 > 0

−√R2 b1 < 0

0 b1 = 0

Example 2.7. In the output of example 1.2 we have R2 = 0.8215, implying

that 82.15% of the (naive) variability in the work hours can now be explained

by the regression model that incorporates lost size as the only predictor.

49

2.4 Normal Correlation Models

Normal correlation models are useful when instead of a random normal re-

sponse and a fixed predictor, there are two random normal variables and one

will be used to model the other.

Let (Y1, Y2) have a bivariate normal distribution with p.d.f.

f(y1, y2) =1

2πσ1σ2

√1− ρ12

e−1

2(1−ρ212

)

[

(

y1−µ1σ1

)2−2ρ12

(

y1−µ1σ1

)(

y2−µ2σ2

)

+(

y2−µ2σ2

)2]

where ρ12 is the correlation coefficient σ12/(σ1σ2). It can be shown that

marginally Y1 ∼ N(µ1, σ21) and Y2 ∼ N(µ2, σ

22). Hence, the conditional

density of (Y1|Y2 = y2), and similarly of (Y2|Y1 = y1), can be found as

f(y1|y2) =f(y1, y2)

f(y2)=

1√2πσ1|2

e−12

(

y1−α1|2−β1|2y2

σ1|2

)2

where α1|2 = µ1 − µ2ρ12(σ1/σ2), β1|2 = ρ12(σ1/σ2), and σ21|2 = σ2

1(1 − ρ212).

Thus,

Y1|Y2 = y2 ∼ N(α1|2 + β1|2y2, σ21|2)

and we can “model” or make educated guesses as to the values of variable

Y1 given Y2 (where Y2 is random).

To determine if Y2 is an adequate “predictor” for Y1, all we need to do is

test H0 : ρ12 = 0, since under the null, (Y1|Y2) ≡ Y1. The sample estimate is

the same as in equation (1). The test statistics is

T.S. =r12

√n− 2

1− r212

H0∼ tn−2.

with p-values for two and one-sided tests found in the usual way. However,

working with confidence intervals is more practical and even easier if we apply

Fisher’s transformation to the sample correlation

z′ =1

2log

(1 + r121− r12

)

.

50

If the sample size is large, i.e n ≥ 25 then

z′approx.∼ N

1

2log

(1 + ρ121− ρ12

)

︸ ︷︷ ︸

ζ

,1

n− 3

and a 100(1− α)% C.I.for ζ

z′ ∓ z1−α/2

1/(n− 3) → (L, U)

and hence a 100(1− α)% C.I.for ρ12 (after back-transforming ζ)

(e2L − 1

e2L + 1,e2U − 1

e2U + 1

)

Non-normal data: When the data are not normal then we must imple-

ment a nonparametric procedure such as Spearman Rank Correlation coeffi-

cient.

1. Rank (y11, . . . , yn1) from 1 to n and label as (R11, . . . , Rn1).

2. Rank (y12, . . . , yn2) from 1 to n and label as (R12, . . . , Rn2).

3. Compute

rs =

∑ni=1(Ri1 − R1)(Ri2 − R2)

√∑ni=1(Ri1 − R1)2

∑ni=1(Ri2 − R2)2

To test the null hypothesis of no association between Y1 and Y2 use the test

statistic

T.S. =rs√n− 2

1− r2s

H0∼ tn−2.

Reject if p-value< α.

Example 2.8. Consider the Muscle mass problem 1.27 and let Y1 =muscle

mass, Y2 =age and we wish to model (Y1|Y2)

> muscle=read.table("http://www.stat.ufl.edu/~rrandles/sta4210/

+ Rclassnotes/data/textdatasets/KutnerData/

+ Chapter%20%201%20Data%20Sets/CH01PR27.txt",col.names=c("Y1","Y2"))

> attach(muscle)

> n=length(Y1)

51

> r=cor(Y1,Y2);r

[1] -0.866064

> b1=r*sd(Y1)/sd(Y2);b1

[1] -1.189996

> b0=mean(Y1)-mean(Y2)*b1;b0

[1] 156.3466

> s2=var(Y1)*(1-r^2);s2

[1] 65.6686

Hence the estimated model is

Y1|Y2 = y2 ∼ N(156.35− 1.19y2, 65.67).

and r12 = −0.866.

To test H0 : ρ12 = 0

> TS=(r*sqrt(n-2))/sqrt(1-r^2)

> 2*pt(-abs(TS),n-2) #2 sided pvalue

[1] 4.123987e-19

we reject the null due to the extremely small p-value. We can also create a

95% C.I. for ρ12

> zp=0.5*log((1+r)/(1-r))

> LU=zp+c(1,-1)*qnorm(0.025)*1/sqrt(n-3)

> (exp(2*LU)-1)/(exp(2*LU)+1)

[1] -0.9180874 -0.7847085

and conclude that there is a significant negative relationship.

Obviously, before performing any of these procedure we need to ba able to

assume that both variables are normal-which we will see later. If we cannot

assume normality then we need to use Spearman’s Correlation

> rs=cor(Y1,Y2,method="spearman");rs # default method is pearson

[1] -0.8657217

> TSs=(rs*sqrt(n-2))/sqrt(1-rs^2)

> 2*pt(-abs(TSs),n-2) #2 sided pvalue

[1] 4.418881e-19

and reach the same conclusion.

http://www.stat.ufl.edu/~athienit/STA4210/Examples/corr_model.R

52

Chapter 3

Diagnostics and Remedial

Measures

3.1 Diagnostics for Predictor Variable

The goal is to identify any outlying values that could affect the appropria-

teness of the linear model. More information about influential cases will be

covered in Chapter 10. The two main issues are:

• Outliers.

• The levels of the predictor are associated with the run order when the

experiment is run sequentially.

To check these we use

• Histogram and/or Boxplot

• Sequence Plot

Example 3.1. Continuing from example 1.2 we see that there do not appear

to be any outliers

53

Histogram

Lot Size

Freque

ncy

20 40 60 80 100 120

01

23

4

20 40 60 80 100 120

Box Plot

Lot Size

and no pattern/dependecy of the values of the predictor and the run order.

5 10 15 20 25

2040

6080

100

120

Sequence Plot

Run order

Lot S

ize

3.2 Checking Assumptions

Recall that for the simple linear regression model

Yi = β0 + β1xi + ǫi i = 1, . . . , n

we assume that ǫii.i.d.∼ N(0, σ2) for i = 1, . . . , n. However, once a model is

fit, before any inference or conclusions are made based upon a fitted model,

the assumptions of the model need to be checked.

These are:

1. Normality

2. Homogeneity of variance

3. Model fit/Linearity

4. Independence

54

with components of model fit being checked simultaneously within the first

three. The assumptions are checked using the residuals ei := yi − yi for

i = 1 . . . , n, or the standardized residuals, which are the residual standardized

so that their standard deviation should be 1.

3.2.1 Graphical methods

Normality

The simplest way to check for normality is with two graphical procedures:

• Histogram

• P-P or Q-Q plot

A probability plot is a graphical technique for comparing two data sets,

either two sets of empirical observations, one empirical set against a theore-

tical set.

Definition 3.1. The empirical distribution function, or empirical c.d.f., is

the cumulative distribution function associated with the empirical measure

of the sample. This c.d.f. is a step function that jumps up by 1/n at each of

the n data points.

Fn(x) =number of elements ≤ x

n=

1

n

n∑

i=1

I{xi ≤ x}

Example 3.2. Consider the sample: 1, 5, 7, 8. The empirical c.d.f. is

F4(x) =

0 if x < 1

0.25 if 1 ≤ x < 5

0.50 if 5 ≤ x < 7

0.75 if 7 ≤ x < 8

1 if x ≥ 8

55

0 2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

1.0

x

Fn(

x)

Figure 3.1: Empirical c.d.f.

The normal probability plot is a graphical technique for normality testing

by assessing whether or not a data set is approximately normally distributed.

The data are plotted against a theoretical normal distribution in such a way

that the points should form an approximate straight line. Departures from

this straight line indicate departures from normality.

There are two types of plots commonly used to plot the empirical c.d.f.

to the normal theoretical one (G(·)).

• P-P plot that plots (Fn(x), G(x)) (with scaled changed to look linear),

• Q-Q plot which plots the quantile functions (F−1n (x), G−1(x)).

Example 3.3. An experiment of lead concentrations (mg/kg dry weight)

from 37 stations, yielded 37 observations. Of interest is to determine if the

data are normally distributed (of more practical use if sample sizes are small,

e.g. < 30).

56

0 50 100 150 200

−2

−1

01

2

Normal Q−Q Plot

Sample Quantiles

The

oret

ical

Qua

ntile

s

0 50 100 150 200 250

0.00

00.

005

0.01

00.

015

Smoothed Histogram

Den

sity

NormalData

Note that the data appears to be skewed right, with a lighter tail on the

left and a heavier tail on the right (as compared to the normal).

http://www.stat.ufl.edu/~athienit/IntroStat/QQ.R

With the vertical axis being the theoretical quantiles, and the horizontal

axis being the sample quantiles the interpretation of P-P plots and Q-Q plots

is equivalent. Compared to straight line that corresponds to the distribution

you wish to compare your data, here is a quick guideline of how the tails are

Left tail Right tail

Above line Heavier Lighter

Below line Lighter Heavier

A histogram of the residuals is plotted and we try to determine if the

histogram is symmetric and bell shaped like a normal distribution is. In

addition, to check the model fit, we assume the observed response values

yi are centered around the regression line y. Hence, the histogram of the

residuals should be centered at 0.

57

Example 3.4. Referring to Example 1.1, we obtain the following

−2 −1 0 1

−2

−1

01

2

Normal Q−Q Plot

Sample Quantiles

The

oret

ical

Qua

ntile

s

−4 −3 −2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5

Histogram of std residuals

std. residuals

Den

sity

Homogeneity of variance/Fit of model

Recall that the regression model assumes that the errors ǫi have constant

variance σ2. In order to check this assumption a plot of the residuals (ei)

versus the fitted values (yi) is used. If the variance is constant, one expects

to see a constant spread/distance of the residuals to the 0 line across all the

yi values of the horizontal axis. Referring to Example 1.1, we see that this

assumption does not appear to be violated.

20 40 60 80 100 120 140

−3

−2

−1

01

23

Homogeneity / Fit

y

std

res

Figure 3.2: Residual versus fitted values plot.

In addition, the same plot can be used to check the fit of the model.

If the model is a good fit, once expects to see the residuals evenly spread

58

on either side of the 0 line. For example, if we observe residuals that are

more heavily sided above the 0 line for some interval of yi, then this is an

indication that the regression line is not “moving” through the center of the

data points for that section. By construct, the regression line does “move”

through the center of the data overall, i.e. for the whole big picture. So if it

is underestimating (or overestimating) for some portion then it will overes-

timate (or underestimate) for some other. This is an indication that there is

some curvature and that perhaps some polynomial terms should be added.

(To be discussed in the next chapter).

Independence

To check for independence a time series plot of the residuals/standardized

residuals is used, i.e. a plot of the value of the residual versus the value of

its position in the data set. For example, the first data point (x1, y1) will

yield the residual e1 = y1 − y1. Hence, the order of e1 is 1, and so forth.

Independence is graphically checked if there is no discernible pattern in the

plot. That is, one cannot predict the next ordered residual by knowing the

a few previous ordered residuals. Referring to Example 1.1, we obtain the

following plot where there does not appear to be any discernible pattern.

0 10 20 30 40

−2

−1

01

Independence

Order

std

res

Figure 3.3: Time series plot of residuals.

59

Remark 3.1. That when creating this plot that the order in which the data

was obtained is same as the way they are in the datasheet. For example,

assume that each person in a group is asked a question at a time. Then

possibly the second person might be influenced by the first person’s response

and so forth. If the data was then sorted, e.g. alphabetically, this order may

then be lost.

It is also important to note that this graph is heavily influenced by the

validity of the model fit. Here is an example we will actually be addressing

later in example 3.13

0 2000 4000 6000 8000 10000

0.2

0.4

0.6

0.8

time

prop

Histogram of std res

std res

Fre

quen

cy−1 0 1 2

0.0

1.0

2.0

3.0

−1.0 0.0 1.0 2.0

−1.

50.

01.

5

Normal Q−Q Plot

Sample Quantiles

The

oret

ical

Qua

ntile

s

2 4 6 8 10 12

−1.

00.

51.

5

Independence

Order

std

res

0.0 0.1 0.2 0.3 0.4 0.5

−3

−1

13

Homogeneity / Fit

y

std

res

http://www.stat.ufl.edu/~athienit/STA4210/Examples/copier.R

3.2.2 Significance tests

Independence

• Runs Test (Presumes data are in time order)

– Write out the sequence of +/− signs of the residuals

– Count n1 = number of +ve residuals, n2 = number of −ve resi-

duals

– Count u = number of “runs” of +ve and −ve residuals So what

is a run? For example, if we have the following 9 residuals:

−︸︷︷︸

1

+++︸ ︷︷ ︸

2

−−︸︷︷︸

3

+︸︷︷︸

4

−−︸︷︷︸

5

then we have in fact u = 5 runs with n1 = 4 and n2 = 5.

60

The null hypothesis is that the data are independent/random place-

ment. We will use the exact sampling distribution of u to determine

the p-value. The p.m.f. of the corresponding r.v. U is

p(u) =

2(n1−1k−1 )(

n2−1k−1 )

(n1+n2n1

)u = 2k, k ∈ N (u is even)

(n1−1k )(n2−1

k−1 )+(n2−1

k )(n1−1k−1 )

(n1+n2n1

)u = 2k + 1, k ∈ N (u is odd)

Then, the p-value is defined as P (U ≤ u). Luckily, there is no need to

do this by hand.

Example 3.5. Continuing from example 1.2, we run the ”Runs” test

on the standardized residuals in R

> library(lawstat) #may need to install package

> runs.test(re,plot.it=TRUE)

Runs Test - Two sided

data: re

Standardized Runs Statistic = -1.015, p-value = 0.3101

and note that we fail to reject the null due to the large p-value. We

have 11 runs out of a maximum of 25. There is also another runs.test

under the randtest package (which actually provides the value of u).

A A A A A A A A A A A A A

5 10 15 20 25

−0.

40.

00.

20.

4

B B B B B B B B B B B B

61

Remark 3.2. It is notable that for n1 + n2 > 20

Uapprox∼ N

2n1n2

n1 + n2

+ 1︸ ︷︷ ︸

µu

,2n1n2(2n1n2 − n1 − n2)

(n1 + n2)2(n1 + n2 − 1)︸ ︷︷ ︸

σ2u

and a test statistic can be used

T.S. =u− µu + 0.5

σu

to calculate the p-value P (Z ≤ T.S.).

• Durbin-Watson Test. For this test we assume that the error term in

equation (1.1) is of form

ǫi = ρǫi−1 + ui, uii.i.d.∼ N(0, σ2), |ρ| < 1

That is that the error term at a certain time period i, is correlated to

the error term at the i− 1.

The null hypothesis is H0 : ρ = 0, i.e. uncorrelated. The test statistic

is

T.S. =

∑ni=2(ei − ei−1)

2

n∑

i=1

e2i

︸ ︷︷ ︸

SSE

Once the sampling distribution of the test statistic is determined then

p-values can be obtained. However, the density function of this statistic

is not easy to work with so we leave the heavy lifting to software.

Example 3.6. Continuing from example 1.2,

> library(car)

> durbinWatsonTest(toluca.reg)

lag Autocorrelation D-W Statistic p-value

1 0.2593193 1.43179 0.166

Alternative hypothesis: rho != 0

> library(lmtest)

> dwtest(toluca.reg,alternative="two.sided")

62

Durbin-Watson test

data: toluca.reg

DW = 1.4318, p-value = 0.1616

alternative hypothesis: true autocorrelation is not 0

The p-value is large, i.e. greater than 5%, and hence we fail to reject

the null, and conclude independence.

Remark 3.3. The book suggests that in business and economics the

correlation tends to be positive and hence a one sided test should be

performed. However, this decision is context specific and left to the

researcher.

Normality test

As expected there are many tests for normality. For a current list visit

https://en.wikipedia.org/wiki/Normality_test. For now, we will dis-

cuss the Shapiro-Wilk Test.

The null hypothesis is that normality holds (for the data entered, which

we will use the standardized residuals).

Example 3.7. Continuing from example 1.2,

> shapiro.test(re)

Shapiro-Wilk normality test

data: re

W = 0.97917, p-value = 0.8683

and hence we fail to reject the assumption of normality.

Homogeneity of variance

• If the response can be split into t distinct groups, i.e. the predictor(s)

are categorical, then use the Brown-Forsythe/Levene Test . This test is

used to test whether multiple populations have the same variance.

The null hypothesis is that

H0 : V (ǫi) = σ2 ∀i

63

or equivalently

H0 : σ21 = · · · = σ2

t

Remark 3.4. If the data cannot be split into distinct groups, this can be

done artificially by separating the responses based on their predictor

values or fitted values. For example we can create two groups, data

with “small” fitted values and data with “large” fitted values. Much in

the same way we create bins for a histogram.

The test statistic is tedious to calculate and left for software. However,

T.S.H0∼ Ft−1,n−t

where t is the number of groups and n is the grand total number of

observations. The p-value is P (Ft−1,n−t ≥ T.S.). Reject null if p-

value< α.

Example 3.8. Continuing with example 1.2, assume we wish to split

the data into two groups depending on whether the lot size is greater

than 75 or not.

> ind=I(toluca$lotsize>75)

> temp=cbind(toluca$lotsize,re,ind);temp

re ind

1 80 1.07281843 1

2 30 -1.06174371 0

3 50 -0.41228961 0

4 90 -0.15886988 1

5 70 1.01932471 0

6 60 -1.10742255 0

7 120 1.25374204 1

8 80 0.08237676 1

9 100 -1.45603607 1

10 50 -1.86517337 0

11 40 -0.96611354 0

12 70 -1.27729740 0

13 90 0.10987623 1

14 20 -0.45782448 0

15 110 -0.43096533 1

64

16 100 0.01286011 1

17 30 0.92610180 0

18 50 0.56452124 0

19 90 -0.13817529 1

20 110 -0.73719055 1

21 30 2.50810852 0

22 90 1.87651913 1

23 40 0.82578984 0

24 80 -0.12266660 1

25 70 0.21940878 0

> leveneTest(temp[,2],ind) # fcn in car library

Levene’s Test for Homogeneity of Variance (center = median)

Df F value Pr(>F)

group 1 1.6553 0.211

23

Warning message:

In leveneTest.default(temp[, 2], ind) : ind coerced to factor.

With a p-value greater than 0.05 we fail to reject the null.

• Breusch-Pagan/Cook-Weisberg Test. It tests whether the estimated

variance of the residuals from a regression are dependent on the values

of the independent/predictor variables. In that case, heteroskedasticity

is present.

σ2 = E(ǫ2) = γ0 + γ1x1 + · · ·+ γpxp

The null hypothesis is that they are independent of the independent/-

predictor variables. Although we will usually have software calculate

the test statistic, the process if fairly simple.

1. Obtain SSE =∑n

i=1 e2i from original equation.

2. Fit regression with e2i as the response using the same predictor(s),

and obtain SSR⋆.

3.

T.S. =SSR⋆/2

(SSE/n)2H0∼ χ2

p

65

where p is the number of predictors in the model. The null is

rejected in the p-value P (χ2p ≥ T.S.) < α.

Example 3.9. Continuing with example 1.2,

> ncvTest(toluca.reg) # fcn in car library

Non-constant Variance Score Test

Variance formula: ~ fitted.values

Chisquare = 0.8209192 Df = 1 p = 0.3649116

Hence, we fail to reject the null due the p-value> α

Linearity of regression

We will perform an F -test for Lack-of-Fit if there are t distinct levels of the

predictor(s). Not a valid test if the number of distinct levels is large, i.e.

t ≈ n.

H0 : E(Yi) = β0 + β1xi vs Ha : E(Yi) 6= β0 + β1xi

1. For each distinct level compute yj and yj, j = 1, . . . , t.

2. Compute SSLF =∑t

j=1

∑nj

i=1(yj−yj)2 =

∑tj=1 nj(yj−yj)

2 with degrees

of freedom t− 2.

3. Compute SSPE =∑t

j=1

∑nj

i=1(yij − yj)2 with degrees of freedom n− t

4.

T.S. =SSLF/(t− 2)

SSPE/(n− t)

H0∼ Ft−2,n−t

The null is rejected if the p-value P (Ft−2,n−t ≥ T.S.) < α.

In R there is a work around where you do not have to compute these SS

explicitly, as illustrated in the following example.

Example 3.10. Continuing with example 1.2, we note that there are 11

distinct levels of lot size in the 25 observations.

> length(unique(toluca$lotsize));length(toluca$lotsize)

[1] 11

[1] 25

> Reduced=toluca.reg # fit reduced model

66

> Full=lm(workhrs~0+as.factor(lotsize),data=toluca) # fit full model

> anova(Reduced, Full) # get lack-of-fit test

Analysis of Variance Table

Model 1: workhrs ~ lotsize

Model 2: workhrs ~ 0 + as.factor(lotsize)

Res.Df RSS Df Sum of Sq F Pr(>F)

1 23 54825

2 14 37581 9 17245 0.7138 0.6893

The p-value is greater than 0.05 so we fail to reject the null and conclude

that the model is an adequate (linear) fit.

3.3 Remedial Measures

• Nonlinear Relation: Add polynomials, fit exponential regression function,

or transform x and/or y (more emphasis on x).

• Non-Constant Variance: Weighted Least Squares, transform y and/or

x, or fit Generalized Linear Model.

• Non-Independence of Errors: Transform y or use Generalized Least

Squares, or fit Generalized Linear Model with correlated errors.

• Non-Normality of Errors: Box-Cox transformation, or fit Generalized

Linear Model.

• Omitted Predictors: Include important predictors in a multiple regres-

sion model.

• Outlying Observations: Robust Estimation or Nonparametric regres-

sion.

3.3.1 Box-Cox (Power) transformation

In the event that the model assumptions appear to be violated to a signifi-

cant degree, then a linear regression model on the available data is not valid.

However, have no fear, your friendly statistician is here. The data can be

67

transformed, in an attempt to fit a valid regression model to the new trans-

formed data set. Both the response and the predictor can be transformed

but there is usually more emphasis on the response.

Remark 3.5. However, when we apply such a transformation, call it g(·), weare in fact fitting the mean line

E(g(Y )) = β0 + β1x1 + . . .

As a result we cannot back-transform, i.e. apply the inverse transformation

to make inference on E(Y ) as

g−1 [E(g(Y ))] 6= E(Y )

A common transformation mechanism is the Box-Cox transformation

(also known as Power transformation). This transformation mechanism when

applied to the response variable will attempt to remedy the “worst” of the

assumptions violated, i.e. to reach a compromise. A word of caution, is

that in an attempt to remedy the worst it may worsen the validity of one

of the other assumptions. The mechanism works by trying to identify the

(minimum or maximum depending on software) value of a parameter λ that

will be used as the power to which the responses will be transformed. The

transformation is

y(λ)i =

yλi − 1

λGλ−1y

if λ 6= 0

Gy log(yi) if λ = 0

where Gy = (∏n

i=1 yi)1/n denotes the geometric mean of the responses. Note

that a value of λ = 1 effectively implies no transformation is necessary. There

are many software packages that can calculate an estimate for λ, and if the

sample size is large enough even create a C.I. around the value. Referring to

Example 1.1, we see that λ = 1.11.

68

−2 −1 0 1 2

−20

0−

150

−10

0−

500

λ

log−

Like

lihoo

d

95%

1.11

Figure 3.4: Box-Cox plot.

However, one could argue that the value is close to 1 and that a transfor-

mation may not necessarily improve the overall validity of the assumptions,

so no transformation is necessary. In addition, we know that linear regres-

sion is somewhat robust to deviations from the assumptions, and it is more

practical to work with the untransformed data that are in the original units

of measurements. For example, if the data is in miles and a transformation

is used on the response, inference will be on log(miles).

Example 3.11. Continuing from example 1.1, we use the following R script:

http://www.stat.ufl.edu/~athienit/STA4210/Examples/boxcox.R

Example 3.12. http://www.stat.ufl.edu/~athienit/STA4210/Examples/diagnostic&BoxCox

If the model fit assumption is the major culprit violated, a trans-

formation of the predictor(s) will often resolve the issue without

having to transform the response and consequently changing its scale.

Example 3.13. In an experiment 13 subjects asked to memorize a list of

disconnected items. Asked to recall them at various times up to a week later.

• Response = proportion of items recalled correctly.

• Predictor = time, in minutes, since initially memorized the list.

69

Time 1 5 15 30 60 120 240Prop 0.84 0.71 0.61 0.56 0.54 0.47 0.45Time 480 720 1440 2880 5760 10080Prop 0.38 0.36 0.26 0.20 0.16 0.08

0 2000 4000 6000 8000 10000

0.2

0.4

0.6

0.8

time

prop

Histogram of std res

std res

Fre

quen

cy

−1 0 1 2

0.0

1.0

2.0

3.0

−1.0 0.0 1.0 2.0

−1.

50.

01.

5

Normal Q−Q Plot

Sample Quantiles

The

oret

ical

Qua

ntile

s

2 4 6 8 10 12

−1.

00.

51.

5

Independence

Order

std

res

0.0 0.1 0.2 0.3 0.4 0.5

−3

−1

13

Homogeneity / Fit

y

std

res

bcPower Transformation to Normality

Est.Power Std.Err. Wald Lower Bound Wald Upper Bound

dat$time 0.0617 0.1087 -0.1514 0.2748

Likelihood ratio tests about transformation parameters

LRT df pval

LR test, lambda = (0) 0.327992 1 5.668439e-01

LR test, lambda = (1) 46.029370 1 1.164935e-11

It seems that a decent choice for λ is 0, i.e. a log transformation for time.

0 2 4 6 8

0.2

0.4

0.6

0.8

l.time

prop

Histogram of std res

std res

Fre

quen

cy

−2 −1 0 1 2

01

23

4

−1.5 −0.5 0.5 1.5

−1.

50.

01.

5

Normal Q−Q Plot

Sample Quantiles

The

oret

ical

Qua

ntile

s

2 4 6 8 10 12

−1.

50.

01.

0

Independence

Order

std

res

0.2 0.4 0.6 0.8

−1.

50.

01.

0

Homogeneity / Fit

y

std

res

http://www.stat.ufl.edu/~athienit/STA4210/Examples/diagnostic&Linearity.R

Remark 3.6. When creating graphs and checking if there are “pattern” try

and keep the axis for the standardized residuals range from -3 to 3. That

70

is 3 standard deviation below 0, to 3 standard deviations above 0. Software

have a tendency to “zoom” in.

Is glass smooth? If you are viewing by eye then yes. If you are viewing

via an electron microscope then no.

In R just add plot(....., ylim=c(-3,3))

3.3.2 Lowess (smoothed) plots

• Nonparametric method of obtaining a smooth plot of the regression

relation between y and x.

• Fits regression in small neighborhoods around points along the regres-

sion line on the horizontal axis.

• Weights observations closer to the specific point higher than more dis-

tant points.

• Re-weights after fitting, putting lower weights on larger residuals (in

absolute value).

• Obtains fitted value for each point after “final” regression is fit.

• Model is plotted along with linear fit, and confidence bands, linear fit

is good if lowess lies within bands.

Example 3.14. For 1.2 assume we wish to fit Lowess regression in R using

the loess function with smoothing parameter α = 0.5.

71

20 40 60 80 100 120

100

200

300

400

500

lotsize

wor

khrs

Loess α=0.595% CBSLR

Figure 3.5: Lowess smoothed plot.

http://www.stat.ufl.edu/~athienit/STA4210/Examples/loess.R

72

Chapter 4

Simultaneous Inference and

Other Topics

The main concept here is that if a 95% C.I. is created for β0 and another 95%

C.I. for β1 we cannot say that we are 95% confident that these two confidence

intervals are simultaneously both correct.

4.1 Controlling the Error Rate

Let αI denote the individual comparison Type I error rate. Thus, P (Type I error) =

αI on each of the g tests.

Now assume we wish to combine all the individual tests into an overall/-

combined/simultaneous test

H0 = H01 ∩ H02 ∩ · · · ∩ H0g

H0 is rejected if any of the null hypotheses H0i is rejected.

The experimentwise error rate αE , is the probability of falsely rejecting

at least one of the g null hypotheses. If each of the g tests is done with αI ,

then assuming each test is independent and denoting the probability of not

falsely rejecting H0i by Ei

αE = 1− P (∩gi=1Ei)

= 1−g∏

i=1

P (Ei) independence

= 1− (1− αI)g

73

For example, if αI = 0.05 and 10 comparisons are made then αE = 0.401

which is very large.

However, if we do not know if the tests are independent, we use the

Bonferroni inequality

P (∩gi=1Ei) ≥

g∑

i=1

P (Ei)− g + 1

which implies

αE = 1− P (∩gi=1Ei) ≤ g −

g∑

i=1

P (Ei)

=

g∑

i=1

[1− P (Ei)]

=

g∑

i=1

αI

= gαI

Hence, αE ≤ gαI . So what we will do is choose an α to serve as an upper

bound for αE. That is we won’t know the true value of αE but we will now

it is bounded above by α, i.e. αE ≤ α. For example, if we set α = 0.05 then

αE ≤ 0.05, or that simultaneous C.I. from g individual C.I.’s, will have a

confidence of at least 95% (if not more). Set

αI =α

g

For example, if we have 5 multiple comparisons and wish that the overall

error rate is 0.05, or simultaneous confidence of at least 95%, then each one

(of the five) C.I’s must be done at the

100

(

1− 0.05

5

)

= 99%

confidence level.

For additional details the reader can read the multiple comparisons problem

and the familywise error rate.

74

4.1.1 Simultaneous estimation of mean responses

• Bonferroni: Can be used for g simultaneous C.Is, each done at the

100(1− α/g) level. If g is large then these intervals will be “too” wide

for practical conclusions.

y ∓ t1−α/(2g),n−2sY

• Working-Hotelling: A confidence band is create for the entire regression

line that can be used for any number of confidence intervals for means

simultaneously.

y ∓√

2F1−α;2,n−2sY

4.1.2 Simultaneous predictions

• Bonferroni: Can be used for g simultaneous P.Is, each done at the

100(1− α/g) level. If g is large then these intervals will be “too” wide

for practical conclusions.

y ∓ t1−α/(2g),n−2spred

• Scheffe: Widely used method. Like the Bonferroni, the width increases

as g increases.

y ∓√

gF1−α;g;n−2spred

4.2 Regression Through the Origin

When theoretical reasoning (in the context at hand) suggest that the regres-

sion line must pass through the origin (x = 0, y = 0), then the regression line

must try and meet this criterion. This is done by restricting the intercept by

setting it to 0, i.e. β0 = 0, yielding the model

Yi = β1xi + ǫi

However with this model, there are some issues

• V (Y |x = 0) = 0. The variance of the response at the origin is set to 0

which is not consistent with the “usual” regression model.

75

•∑n

i=1 ei not necessarily equals 0.

• SSE can potentially be larger than SST, affecting the analysis of vari-

ance and R2 interpretation.

The reader is referred to p.164 of the textbook for more details

To estimate β1 via least-squares we need to minimize

n∑

i=1

(yi − β1xi)2.

Taking the derivative with respect to β1 and equating to 0, we have

⇒− 2n∑

i=1

[xi(yi − β1xi)] = 0

⇒n∑

i=1

xiyi = b1

n∑

i=1

x2i

⇒b1 =

∑ni=1 xiyi∑n

i=1 x2i

=

n∑

i=1

(xi

∑ni=1 x

2i

)

yi.

The only other difference is that the degrees of freedom error are now n− 1,

hence slightly changing the MSE estimate. As result,

• s2b1 =MSE∑n

i=1 x2i

• s2Y=

MSE(x2)∑n

i=1 x2i

• s2pred = MSE

(

1 +x2

∑ni=1 x

2i

)

which are used in the C.I. for β1, mean response and for the P.I.

Example 4.1. A plumbing company operates 12 warehouses. A regression

is fit with work units performed (x) and total variable cost (y). A regression

through the origin yielded

76

50 100 150 200

200

400

600

800

work

labo

r

> plumbing=read.table(...

> with(plumbing,plot(labor~work,pch=16))

> plumb.reg=lm(labor~0+work,data=plumbing)

> summary(plumb.reg)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

work 4.68527 0.03421 137 <2e-16 ***

Residual standard error: 14.95 on 11 degrees of freedom

Multiple R-squared: 0.9994,Adjusted R-squared: 0.9994

F-statistic: 1.876e+04 on 1 and 11 DF, p-value: < 2.2e-16

> abline(plumb.reg)

Now we can create a PI for when work is equal to 100. R can do this too and

it uses the right standard error. However, if you ask it to print the se.fit

it only provides the se.fit for the CI, not the PI

> syhat=sqrt(223.42*(100^2/sum(plumbing$work^2)));syhat

[1] 3.420475

> spred=sqrt(223.42*(1+100^2/sum(plumbing$work^2)));spred

[1] 15.33361

> newdata=data.frame(work=100)

> predict.lm(plumb.reg,newdata)+c(1,-1)*qt(0.025,11)*spred

[1] 434.7784 502.2765

77

> predict.lm(plumb.reg,newdata,se.fit=TRUE,interval="prediction")

$fit

fit lwr upr

1 468.5274 434.7781 502.2767

$se.fit

[1] 3.420502

$df

[1] 11

http://www.stat.ufl.edu/~athienit/STA4210/Examples/plumbing_origin.R

4.3 Measurement Errors

Firstly, let’s take a look at what we mean when a variable/effect is fixed or

random and why there is still confusion concerning the use of these.

http://andrewgelman.com/2005/01/25/why_i_dont_use/

4.3.1 Measurement error in the dependent variable

There is no problem as long as there is no bias, i.e. consistently recording

lower or higher values. The extra error term is absorbed into the existing

error term ǫ for the response Y .

4.3.2 Measurement error in the independent variable

Assume there is no bias in measurement error.

• Not a problem when the observed (recorded) value is fixed and actual

value is random. For example, when the oven dial is set to 400◦F the

actual temperature inside is not exactly 400◦F.

• When the observed (recorded) value is random it causes a problem by

biasing β1 downward. Let Xi denote the true (unobserved) value, and

X⋆i , the observed (recorded) value. Then, the measurement error is

δi = X⋆i −Xi

78

The true model can be expressed as

Yi = β0 + β1Xi + ǫi

= β0 + β1(X⋆i − δi) + ǫi

= β0 + β1X⋆i + (ǫi − β1δi)

and we assume that δi is

– unbiased, i.e. E(δi) = 0,

– uncorrelated with random error, implying that E(ǫiδi) = ✟✟✟✟✯

0

E(ǫi) ✟✟✟✟✯

0

E(δi) =

0.

Hence,

Cov(X⋆i , ǫi − β1δi) = E {[X⋆

i − E(X⋆i )][(ǫi − β1δi)− E(ǫi − β1δi)]}

= E {[X⋆i −Xi][ǫi − β1δi]}

= E {δi(ǫi − β1δi)}

= ✘✘✘✘✘✿0E {δiǫi} − β1E

{δ2i}

= −β1V (δi)

Therefore, the recorded value X⋆i is not independent of the error term

(ǫi − β1δi) and

E(Yi|X⋆i ) = β⋆

0 + β⋆1X

⋆i

where

β⋆1 = β1

σ2X

σ2X + σ2

Y

< β1.

4.4 Inverse Prediction

The goal is to predict a new predictor value based on an observed new value

of the response. Once we have a model, it is is to show (by rearranging terms

in the prediction equation) that the prediction is

xnew =ynew − b0

b1

79

It has been shown (in higher statistics levels) that if

t21−α/2,n−2MSE

b21∑n

i=1(xi − x)2approx.< 0.1

then, a 100(1− α)% P.I. for xnew is

xnew ∓ t1−α/2,n−2

MSE

b21

[

1 +1

n+

(xnew − x)2∑n

i=1(xi − x)2

]

Remark 4.1. Bonferroni or Scheffe adjustments should be made for multiple

simultaneous predictions.

4.5 Choice of Predictor Levels

Recall that in most standard errors the term∑n

i=1(xi − x)2 was present

somewhere in a denominator. For example,

V (B1) =σ2

∑ni=1(xi − x)2

So, in order to decrease the standard error we need to maximize this term,

which in essence is a measure of spread of the predictor, by

(i) Increase sample size, n.

(ii) Increase the spacing of the predictor.

Depending on the goal of research, when planning a controlled experiments,

and selecting predictor levels, choose:

• 2 levels if only interested in whether there is an effect and its direction,

• 3 levels if goal is describing relation and any possible curvature,

• 4 or more levels for further description of response curve and any po-

tential non-linearity such as an asymptote.

80

Chapter 5

Matrix Approach to Simple

Linear Regression

We will cover the basics necessary to provide us with better understanding of

regression which will be especially useful for multiple regression. The reader

is also encouraged to review further topics and material at

• http://stattrek.com/tutorials/matrix-algebra-tutorial.aspx

• https://www.youtube.com/watch?v=xyAuNHPsq-g&list=PLFD0EB975BA0CC1E0

Definition 5.1. A matrix is a rectangular array of numbers or symbolic

elements.

In many applications, the rows will represent individual cases and columns

will represent attributes or characteristics. The dimensions of a matrix is its

number of rows and columns, often denoted m× n and has form

Am,n =

a1,1 a1,2 · · · a1,n

a2,1 a2,2 · · · a2,n...

.... . .

...

am,1 am,2 · · · am,n

5.1 Special Types of Matrices

• Square matrix: The number of rows is the same as the number of

columns. For example,

A2,2 =

(

a1,1 a1,2

a2,1 a2,2

)

81

• Vector: A column vector is matrix with only one column, and a row

vector is a matrix with only one row. For example,

c =

c1

c2...

cn

• Transpose: A matrix formed by interchanging rows and columns. For

example,

G =

(

6 15 22

8 13 25

)

⇒ GT =

6 8

15 13

22 25

• Matrix equality: Two matrices of the same dimension are equal when

each element that is in the same position in each matrix is equal.

• Symmetric matrix: A square matrix whose transpose is equal to itself,

i.e. AT = A or element-wise ai,j = aj,i. For example,

A =

6 19 −8

19 14 3

−8 3 1

⇒ AT = A.

• Diagonal matrix: Square matrix with all off-diagonal elements equal to

0. For example,

A3,3 =

a1 0 0

0 a2 0

0 0 a3

= diag(a1, a2, a3)

• Identity matrix: A diagonal matrix with all the diagonal elements equal

to 1, i.e. Im = diag(1, 1, . . . , 1). For example,

I3 =

1 0 0

0 1 0

0 0 1

We will see later that ImAm,n = Am,n, and that Am,nIn = Amn.

82

• Scalar matrix: A diagonal matrix with all the diagonal elements equal

to same scalar k, that is kIm. For example,

kI3 =

k 0 0

0 k 0

0 0 k

• 1-vector and matrix: The 1-vector, is simply a column vector whose

elements are all 1. Similarly for the matrix denoted by J .

J3 =

1 1 1

1 1 1

1 1 1

5.2 Basic Matrix Operations

To perform basic matrix operations in R, please visit

http://www.statmethods.net/advstats/matrix.html.

5.2.1 Addition and subtraction

Addition and subtraction is done elementwise for matrices of the same di-

mension.

Am,n +Bm,n =

a1,1 + b1,1 a1,2 + b1,2 · · · a1,n + b1,n

a2,1 + b2,1 a2,2 + b2,2 · · · a2,n + b2,n...

.... . .

...

am,1 + bm,1 am,2 + bm,2 · · · am,n + bm,n

Similarly for subtractions.

In regression, let

Y =

Y1

...

Yn

E(Y ) =

E(Y1)...

E(Yn)

ǫ =

ǫ1...

ǫn

The model can be expresses as

Y = E(Y ) + ǫ

83

5.2.2 Multiplication

We begin with multiplication of a matrix A by a scalar k. Each element of

A is multiplied by k, i.e.

kAm,n =

ka1,1 ka1,2 · · · ka1,n

ka2,1 ka2,2 · · · ka2,n...

.... . .

...

kam,1 kam,2 · · · kam,n

Multiplication of a matrix by a matrix is only defined if the inner dimen-

sions are equal. That is the the column dimension of the first matrix equals

the row dimension of the second matrix. That is, Am,nBp,q is only defined if

n = p. The resulting matrix of Am,nBn,q is of dimension m × q with (i, j)th

elements being

[ab]i,j =n∑

k=1

ai,kbk,j i = 1, . . . , m j = 1, . . . , q

Example 5.1. Let

A3,2 =

2 5

3 −1

0 7

B2,2 =

(

3 −1

2 4

)

then

AB =

2(3) + 5(2) 2(−1) + 5(4)

3(3) + (−1)(2) 3(−1) + (−1)(4)

0(3) + 7(2) 0(−1) + 7(4)

=

16 18

7 −7

14 28

Remark 5.1. When AB is defined, the matrix can be expressed as linear

combination of the

• columns of A

• rows of B

84

Take example 5.1,

AB =

3

2

3

0

+ 2

5

−1

7

(−1)

2

3

0

+ 4

5

−1

7

and

AB =

2(

3 −1)

+ 5(

2 4)

3(

3 −1)

+ (−1)(

2 4)

0(

3 −1)

+ 7(

2 4)

In R

> A=matrix(c(2,3,0,5,-1,7),3,2);> B=matrix(c(3,2,-1,4),2,2)

> A%*%B

[,1] [,2]

[1,] 16 18

[2,] 7 -7

[3,] 14 28

Remark 5.2. Matrix multiplication is only defined when the inner dimensions

match and as such in example 5.1, AB is defined but BA is not. Even in

cases where both AB and BA are defined, it is not necessarily true that

AB = BA. Take for example

A =

(

1 2

3 4

)

B =

(

5 6

7 8

)

Systems of linear equations can also be written in matrix form. For

example, let x1 and x2 be unknown such that

a1,1x1 + a1,2x2 = y1

a2,1x1 + a2,2x2 = y2

This can be expressed as

(

a1,1 a1,2

a2,1 a2,2

)(

x1

x2

)

=

(

y1

y2

)

(5.1)

Ax = y

85

Also, sums of squares can also be expressed as a vector multiplication.

n∑

i=1

x2i = xTx where x =

x1

...

xn

Some useful multiplications that we will be using in regression are pre-

sented in the following list:

List 1.

• Xβ =

1 x1

......

1 xn

(

β0

β1

)

=

β0 + β1x1

...

β0 + β1xn

• yTy =∑n

i=1 y2i

• XTX =

(

n∑n

i=1 xi∑n

i=1 xi

∑ni=1 x

2i

)

• XTy =

( ∑ni=1 yi

∑ni=1 xiyi

)

5.3 Linear Dependence and Rank

Definition 5.2. Let A be an m × n matrix that is made up of n column

vectors ai, i = 1, . . . , n, each of dimension m i.e. A = [a1 · · ·an]. When n

scalars k1, . . . , kn, not all zero, can be found such that

n∑

i=1

kiai = 0

then the n columns are said to be linearly dependent. If the equality holds

only for k1 = · · · = kn = 0, then the columns are said to be linearly indepen-

dent. The definition also holds for rows.

Example 5.2. Consider the matrix

A =

1 0.5 3

2 7 3

4 8 9

86

Notice that if we let scalars k1 = 2, k2 = 1k3 = −1 then,

2

1

0.5

3

+ 1

2

7

3

− 1

4

8

9

= 0

Example 5.3. Consider the simple identity matrix

I3 =

1 0 0

0 1 0

0 0 1

Notice that the only way to achieve the 0 vector is with scalars k1 = k2 =

k3 = 0, i.e.

0

1

0

0

+ 0

0

1

0

+ 0

0

0

1

= 0

Without going into too much detail we present the following definition.

Definition 5.3. The rank of a matrix is the number of linearly independent

columns or rows of the matrix. Hence, rank(Am,n) ≤ min(m,n). If equality

holds, then the matrix is said to be of full rank.

There are many way to determine the rank of the matrix, such as the

number of non-zero eigenvalues, but the simplest way is to express the ma-

trix in reduced row echelon form and count the number of non-zero rows.

However, software can calculate it for us (by finding the number of non-zero

eigenvalues).

Example 5.4. Let,

A =

0 1 2

1 2 1

2 7 8

⇒ Arref =

1 2 1

0 1 2

0 0 0

Hence, the rank(A) = 2. Row 3 is a linear combination of Rows 1 and 2.

Specifically, Row 3 = 3*( Row 1 ) + 2*( Row 2 ). Therefore, 3*( Row 1 )

+ 2*( Row 2 ) - ( Row 3)= (Row of zeroes). Hence, matrix A has only two

independent row vectors.

87

> A=matrix(c(0,1,2,1,2,7,2,1,8),3,3);A

[,1] [,2] [,3]

[1,] 0 1 2

[2,] 1 2 1

[3,] 2 7 8

> qr(A)$rank

[1] 2

and let

> B=matrix(c(1,2,3,0,1,2,2,0,1),3,3);B

[,1] [,2] [,3]

[1,] 1 0 2

[2,] 2 1 0

[3,] 3 2 1

> qr(B)$rank

[1] 3

> qr(B)$rank==min(dim(B)) #check if full rank

[1] TRUE

Remark 5.3. Other functions also exist in R that calculate the rank. The

qr(), utilizes the QR decomposition.

Remark 5.4. If we are simply interested in whether a square matrix A is full

rank or not, recall from linear algebra that matrix is full rank matrix (a.k.a.

nonsingular) if and only if it has a determinant that is not equal to zero, i.e.

|A| 6= 0. Hence, if A is not of full rank (singular) it has a determinant equal

to 0, i.e. |A| = 0. For example continuing example 5.4,

> A=matrix(c(0,1,2,1,2,7,2,1,8),3,3);A

[,1] [,2] [,3]

[1,] 0 1 2

[2,] 1 2 1

[3,] 2 7 8

> det(A)

[1] 0

> qr(A)$rank==min(dim(A))

[1] FALSE

88

> B=matrix(c(1,2,3,0,1,2,2,0,1),3,3);B

[,1] [,2] [,3]

[1,] 1 0 2

[2,] 2 1 0

[3,] 3 2 1

> det(B)

[1] 3

> qr(B)$rank==min(dim(B))

[1] TRUE

5.4 Matrix Inverse

Let An,n be a square matrix of full rank, i.e. rank(A) = n. Then A has a

(unique) inverse A−1 such that

A−1A = AA−1 = In

Computing an inverse of matrix can be done manually-which requires

finding the reduced row echelon form but we will utilize software once again.

Example 5.5. Continuing from example 5.4, only B was nonsingular and

hence has an inverse

> solve(B)

[,1] [,2] [,3]

[1,] 0.3333333 1.3333333 -0.6666667

[2,] -0.6666667 -1.6666667 1.3333333

[3,] 0.3333333 -0.6666667 0.3333333

> round(solve(B)%*%B,3)

[,1] [,2] [,3]

[1,] 1 0 0

[2,] 0 1 0

[3,] 0 0 1

Example 5.6. In regression,

(XTX)−1 =

(1n+ x2

(xi−x)2− x

(xi−x)2

− x∑

(xi−x)21

(xi−x)2

)

89

Recall from equation (5.1), that a system of equations (with unknown x)

can be expressed in matrix form

Ax = y.

Then, if A is nonsingular

⇒ A−1A︸ ︷︷ ︸

I

x = A−1y ⇒ x = A−1y

Example 5.7. Assume we have 2 systems of equations

12x1 + 6x2 = 48

10x1 + 2x2 = 12

that can be expressed as

(

12 6

10 −2

)(

x1

x2

)

=

(

48

12

)

We can easily check that the 2 × 2 matrix of coefficients is nonsingular and

has inverse

1

84

(

2 6

10 −12

)

⇒ x =1

84

(

2 6

10 −12

)(

48

12

)

=

(

2

4

)

5.5 Useful Matrix Results

All rules assume that the matrices are conformable to operations.

• Addition:

– A +B = B + A

– (A+B) + C = A+ (B + C)

• Multiplication:

– (AB)C = A(BC)

– C(A+B) = CA+ CB

90

– k(A +B) = kA+ kB for scalar k

• Transpose:

– (AT )T = A

– (A+B)T = AT +BT

– (AB)T = BTAT

– (ABC)T = CTBTAT

• Inverse:

– (A−1)−1 = A

– (AB)−1 = B−1A−1 (If A and B are non-singular)

– (ABC)−1 = C−1B−1A−1 (If A,B and C are non-singular)

– (AT )−1 = (A−1)T

5.6 Random Vectors and Matrices

Let Y be a random column vector of dimension n, i.e.

Y =

Y1

Y2

...

Yn

The expectation of this (multi-dimensional) random variable is

µ = E(Y ) =

E(Y1)

E(Y2)...

E(Yn)

91

and the variance-covariance is an n× n matrix defined as

V (Y ) = E{[Y −E(Y )][Y − E(Y )]T

}

= E

[Y1 − E(Y1)]2 [Y1 − E(Y1)][Y2 − E(Y2)] · · · [Y1 −E(Y1)][Yn − E(Yn)]

[Y2 − E(Y2)][Y1 − E(Y1)] [Y2 − E(Y2)]2 · · · [Y2 −E(Y2)][Yn − E(Yn)]

......

. . ....

[Yn − E(Yn)][Y1 − E(Y1)] [Yn − E(Yn)][Y2 − E(Y2)] · · · [Yn − E(Yn)]2

=

σ21 σ1,2 · · · σ1,n

σ2,1 σ22 · · · σ2,n

......

. . ....

σn,1 σn,2 · · · σ2n

= Σ (symmetric)

An alternate form is

Σ = E(Y Y T )− µµT

More information can be found at:

https://en.wikipedia.org/wiki/Covariance_matrix

Example 5.8. In the regression model, assuming dimension n,the only

random term is ǫ (which in turn makes Y random) and we assume

E(ǫ) =

0...

0

= 0 and V (ǫ) =

σ2 0 · · · 0

0 σ2 · · · 0...

.... . .

...

0 0 · · · σ2

= σ2In

Hence, for the model

Y = Xβ + ǫ (5.2)

• E(Y ) = E(Xβ) + E(ǫ) = Xβ

• V (Y ) = V (Xβ + ǫ) = σ2In

92

5.6.1 Mean and variance of linear functions of random

vectors

Let Am,n be a matrix of scalars and Y n,1 a random vector. Then,

Wm,1 = AY =

a1,1 a1,2 · · · a1,n

a2,1 a2,2 · · · a2,n...

.... . .

...

am,1 am,2 · · · am,n

Y1

Y2

...

Yn

=

a1,1Y1 + a1,2Y2 + · · ·+ a1,nYn

a2,1Y1 + a2,2Y2 + · · ·+ a2,nYn

...

am,1Y1 + am,2Y2 + · · ·+ am,nYn

Since Y is a random vector, Wm,1 is also a random vector with

E(W ) =

E(a1,1Y1 + a1,2Y2 + · · ·+ a1,nYn)...

E(am,1Y1 + am,2Y2 + · · ·+ am,nYn)

=

a1,1E(Y1) + a1,2E(Y2) + · · ·+ a1,nE(Yn)...

am,1E(Y1) + am,2E(Y2) + · · ·+ am,nE(Yn)

=

a1,1 a1,2 · · · a1,n

a2,1 a2,2 · · · a2,n...

.... . .

...

am,1 am,2 · · · am,n

E(Y1)

E(Y2)...

E(Yn)

= AE(Y )

and variance covariance matrix

V (W ) = E{[AY − AE(Y )][AY −AE(Y )]T

}

= E{A[Y − E(Y )][Y −E(Y )]TAT

}

= AE{[Y − E(Y )][Y −E(Y )]T

}AT

= AV (Y )AT

93

5.6.2 Multivariate normal distribution

Let Y n,1 be a random vector with mean µ and variance covariance Σ i.e.N(µ,Σ).

Then, if Y is multivariate normal it has p.d.f

f(Y ) = (2π)−n/2|Σ|−1/2e−12(Y −µ)TΣ−1(Y −µ)

and each element Yi ∼ N(µi, σ2i ) i = 1, . . . , n.

Remark 5.5. • If Am,n is a full rank matrix of scalars, then

AY ∼ N(Aµ, AΣAT )

• (True for any distribution) Two linear functions AU and BU are in-

dependent if and only if AΣB = 0. In particular, this means that Ui

and Uj are independent if and only if the (i, j)th entry of Σ equals 0.

• Y TAY ∼ χ2r(λ) if and only if AΣ is idempotent of rank(AΣ) = r and

λ = 12µTAµ

• The quadratic forms Y TAY and Y TBY are independent if and only

if AΣB = 0(BΣA = 0). As a consequence Sum of Squares Error and

Model (as well as its components) in linear models are independent.

5.7 Estimation and Inference in Regression

Assuming multivariate normal random errors in equation (5.2)

Y ∼ N(Xβ, σ2In)

5.7.1 Estimating parameters by least squares

For simple linear regression, recall in Section 1.2.1, that to estimate the

parameters we had to solve a system of linear equations by minimizing

n∑

i=1

(yi − (β0 + β1xi))2 = (y −Xβ)T (y −Xβ)

94

The resulting simultaneous equations after taking partial derivatives w.r.t.

β0, β1 and equating to zero, are:

nb0 + b1∑

xi =∑

yi

b0∑

xi + b1∑

x2i =

xiyi

which, using the results of list 1, can be expressed and solved in matrix form

XTXb = XTy ⇒ b = (XTX)−1XTy

Remark 5.6. To solve this system we assumed that XTX was nonsingular.

This is nearly always the case for simple linear regression. However, for

multiple regression we will need the following proposition to guarantee that

the unique inverse exists.

Proposition 5.1. Let Xn,p, where n ≥ p. If rank(X) = p, then XTX is

nonsingular, i.e. rank(XTX) = p.

5.7.2 Fitted values and residuals

Fitted response values are

y = Xb

= X(XTX)−1XT

︸ ︷︷ ︸

Hn,n

y

where the H matrix is called the projection matrix, (that is, if you pre-

multiply a vector by H the result is the projection of that vector onto the

column space of X). Therefore, H is

• idempotent, i.e. HH = H

• symmetric, i.e. HT = H

The estimated residuals are

e = y − y

= y −Hy

= (In −H)y

95

where it is easy to check that In −H is also idempotent. As a result,

• E(Y ) = E(HY ) = HE(Y ) = HXβ = X(XTX)−1XTXβ = Xβ

• V (Y ) = Hσ2InHT = σ2H and MSE = σ2

• E(e) = E[(In−H)Y ] = (In−H)E(Y ) = (In−H)Xβ = Xβ−Xβ = 0

• V (e) = (In −H)σ2In(In −H)T = σ2(In −H) and MSE = σ2

5.7.3 Analysis of variance

Recall that,

SST =

n∑

i=1

(yi − y)2 =

n∑

i=1

y2i −(∑n

i=1 yi)2

n.

Now note that,

yTy =

n∑

i=1

y2i and1

nyTJy =

(∑n

i=1 yi)2

n.

Therefore,

SST = yTy − 1

nyTJy = yT

[In − n−1J

]y.

Also,

SSE = eTe

= (y −Xb)T (y −Xb)

= yTy − yTXb− bTXTy + bTXTXb

= yTy − bTXTy

= yT (In −H)y since bTXTy = yTHy

Finally,

SSR = SST − SSE = · · · = yT[H − n−1J

]y

Remark 5.7. Note that SST, SSR and SSE are all of quadratic form, i.e.

yTAy for symmetric matrices A.

96

5.7.4 Inference

Since, b = (XTX)−1XTy it is a linear function of the response. The corre-

sponding random vector can be expressed as

B = (XTX)−1XT

︸ ︷︷ ︸

A

Y

Hence,

• E(B) = AE(Y ) = AXβ = β

• V (B) = AV (Y )AT = σ2(XTX)−1

and thus,

B ∼ N(β, σ2(XTX)−1)

We can also express the C.I. and P.I. in section 2.2 in matrix form.

• Estimated mean response at xobs.

y = b0 + b1xobs = xTobsb, xobs =

(

1

xobs

)

with sY =√

MSE(xTobs(X

TX)−1xobs)

• Predicted response at xnew. The point estimate is the same but spred =√

1 +MSE(xTnew(X

TX)−1xnew)

97

Chapter 6

Multiple Regression I

This chapter incorporates large sections from Chapters 8 from the textbook.

6.1 Model

The multiple regression model is an extension of the simple regression model

whereby instead of only one predictor, there are multiple predictors to better

aid in the estimation and prediction of the response. The goal is to determine

the effects (if any) of each predictor, controlling for the others.

Let p − 1 denote the number of predictors and (yi, x1,i, x2,i, . . . xp−1,i)

denote the p dimensional data points for i = 1, . . . , n. The statistical model

is

Yi = β0 + β1x1,i + · · ·+ βp−1xp−1,i + ǫi ⇔ Yi =

p−1∑

k=0

βkxk,i + ǫi x0,i ≡ 1

for i = 1, . . . , n where ǫii.i.d.∼ N(0, σ2).

Multiple regression models can also include polynomial terms (powers of

predictors). For example, one can define x2,i := x21,i. The model is still

linear as it is linear in the coefficients (β’s). Polynomial terms are useful for

accounting for potential curvature/nonlinearity in the relationship between

predictors and the response. Also, a polynomial term such as x4,i = x1,ix3,i, is

also coined as the interaction term of x1 with x3. Such terms are of particular

usefulness when an interaction exists between two predictors, i.e. when the

level/magnitude of one predictor has a relationship to the level/magnitude

of the other. For example, one may wish to fit a model with predictor terms,

98

although there are only 2 unique predictors:

Yi = β0 + β1x1,i + β2x21,i + β3x2,i + β4x1,ix2,i + β5x

21,ix2,i + ǫi

In p dimensions, we no longer use the term regression line, but a respon-

se/regression surface. Let p = 3, i.e. 2 predictors and a response. The

resulting model may look like

The interpretation of the slope coefficients now requires an additional

statement. A 1-unit increase in predictor xk will cause the response, y, to

change by amount βk, assuming all other predictors are held constant. In

a model with interaction terms special care needs to be taken. Take for

example

E(Y |x1, x2) = β0 + β1x1 + β2x2 + β3x1x2

where a 1-unit increase in x2, i.e. x2 + 1, leads to

E(Y |x1, x2 + 1) = E(Y |x1, x2) + β2 + β3x1

The effect of increasing x2 depends on the level of x1.

99

6.2 Special Types of Variables

• Distinct numeric predictors. The traditional form for variables used

thus far.

• Polynomial terms. Used to allow for “curves” in the regression/re-

sponse surface, as discussed earlier.

Example 6.1. In an experiment using flyash % as a strength (sten)

factor in concrete compression test (PSI) for 28 day cured concrete,

fitting a simple linear regression yielded the following

0 10 20 30 40 50 60

4500

5000

5500

6000

flyash

stre

ngth

Figure 6.1: First order model

Clearly a linear model in the predictor is not adequate. Maybe a second

order polynomial model might be more adequate,

> flyash2=dat$flyash^2

> reg.2poly=lm(strength~flyash+flyash2,data=dat)

> summary(reg.2poly)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 4486.3611 174.7531 25.673 8.25e-14 ***

flyash 63.0052 12.3725 5.092 0.000132 ***

flyash2 -0.8765 0.1966 -4.458 0.000460 ***

---

Residual standard error: 312.1 on 15 degrees of freedom

Multiple R-squared: 0.6485,Adjusted R-squared: 0.6016

F-statistic: 13.84 on 2 and 15 DF, p-value: 0.0003933

100

0 10 20 30 40 50 60

4500

5000

5500

6000

flyash

stre

ngth

Figure 6.2: Second order model

It still seems that there is some room for improvement, hence

> flyash3=dat$flyash^3

> reg.3poly=lm(strength~flyash+flyash2+flyash3,data=dat)

> summary(reg.3poly)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 4.618e+03 1.091e+02 42.338 3.53e-16 ***

flyash -2.223e+01 1.812e+01 -1.227 0.240110

flyash2 3.078e+00 7.741e-01 3.976 0.001380 **

flyash3 -4.393e-02 8.498e-03 -5.170 0.000142 ***

---

Residual standard error: 189.4 on 14 degrees of freedom

Multiple R-squared: 0.8792,Adjusted R-squared: 0.8533

F-statistic: 33.95 on 3 and 14 DF, p-value: 1.118e-06

Before we continue, it is important to note that there are (mathema-

tical) limitations to how many predictors can be added to a model.

As a guideline we usually have one predictor per 10 observations.

For example, a dataset with sample size 60 should have at most 6 pre-

dictors. The X matrix is n × p dimension so as p ↑ while n remains

constant, we run the risk of X not being full column rank. So in this

example we should only keep 2 predictors at most since we have 18 ≈ 20

observations. From the last output we see that the third and second

order polynomial terms are significant (flyash3 and flyash2) but flyash1

is not significant, given the other two are already incorporated in the

model.

101

>reg.3polym1=update(reg.3poly,.~.-flyash)

>summary(reg.3polym1)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 4.549e+03 9.504e+01 47.866 < 2e-16 ***

flyash2 2.166e+00 2.201e-01 9.840 6.18e-08 ***

flyash3 -3.445e-02 3.581e-03 -9.618 8.32e-08 ***

Residual standard error: 192.6 on 15 degrees of freedom

Multiple R-squared: 0.8662,Adjusted R-squared: 0.8483

F-statistic: 48.54 on 2 and 15 DF, p-value: 2.814e-07

0 10 20 30 40 50 60

4500

5000

5500

6000

flyash

stre

ngth

1st order2nd order3rd order

Figure 6.3: Third order model

http://www.stat.ufl.edu/~athienit/STA4210/Examples/poly.R

• Interaction terms. Used when the levels of one predictor influence anot-

her. We will see this in example 6.2.

• Transformed variables. Transformed response such as log(Y ) or Y −1

(as seen with Power transformations) to achieve linearity (or to satisfy

other assumptions).

• Categorical predictors. A categorical predictor is a variable with groups

or classification. The basic case with a variable with only two groups

will be illustrated by the following example:

102

Example 6.2. A study is conducted to determine the effects of com-

pany size and the presence or absence of a safety program on the num-

ber of hours lost due to work-related accidents. A total of 40 companies

are selected for the study. The variables are as follows:

y = lost work hours

x1 = number of employees

x2 =

1 safety program used

0 no safety program used

The proposed model,

Yi = β0 + β1x1,i + β2x2,i + ǫi

implies that

Yi =

(β0 + β2) + β1x1,i + ǫi if x2 = 1

β0 + β1x1,i + ǫi if x2 = 0

When a safety program is used, i.e. x2 = 1, the intercept is β0 + β2,

but the slope (for x1) remains the same in both cases. A scatterplot of

the data and the associated regression line, differentiated by whether

x2 = 1 or 0, is presented.

2000 4000 6000 8000

050

100

150

x1

y

x2

01

Although the overall fit of the model seems adequate we see that the

regression line for x2 = 1 (red), does fit the data well - a fact that

can also be seen by plotting the residuals in the assumption checking

103

procedure. The model is too restrictive by forcing parallel lines. Adding

an interaction term makes the model less restrictive.

Yi = β0 + β1x1,i + β2x2,i + β3(x1x2)i + ǫi

which implies

Yi =

(β0 + β2) + (β1 + β3)x1,i + ǫi if x2 = 1

β0 + β1x1,i + ǫi if x2 = 0

Now, the slope for x1 is allowed to differ for x2 = 1 and x2 = 0.

y = - 1.8 + 0.0197 x1 + 10.7 x2 - 0.0110 x1x2

Predictor Coef SE Coef T P

Constant -1.84 10.13 -0.18 0.857

x1 0.019749 0.001546 12.78 0.000

x2 10.73 14.05 0.76 0.450

x1x2 -0.010957 0.002174 -5.04 0.000

S = 17.7488 R-Sq = 89.2% R-Sq(adj) = 88.3%

Analysis of Variance

Source DF SS MS F P

Regression 3 93470 31157 98.90 0.000

Residual Error 36 11341 315

Total 39 104811

Figure 6.4 also shows the better fit.

104

2000 4000 6000 80000

5010

015

0

x1

y

x2

01

Figure 6.4: Scatterplot and fitted regression lines.

Remark 6.1. Since the interaction term x1x2 is deemed significant, then

for model parsimony, all lower order terms of the interaction, i.e. x1

and x2 should be kept in the model, irrespective of their statistical

significance. If x1x2 is significant then intuitively x1 and x2 are of

importance (maybe not in the statistical sense).

Now lets try and to perform inference on the slope coefficient for x1.

From the previous equation we saw that the slope takes on two values

depending on the value of x2.

– For x2 = 0, it is just β1 and inference in straightforward...right?

– For x2 = 1, it is β1 + β3. We can estimate this with b1 + b3 but

the variance is not known to us. From equation (3) we have that

V (B1 +B3) = V (B1) + V (B3) + 2Cov(B1, B3)

The sample variances and covariances can be found from theˆV (B) = MSE(XTX)−1 covariance matrix, or, obtained in R using

the vcov function. Then, create a 100(1− α)% CI for β1 + β3

b1 + b3 ∓ t1−α/2,n−p

s2b1 + s2b3 + 2sb1b3

Remark 6.2. This concept can easily be extended to linear combinations

of more that two coefficients.

http://www.stat.ufl.edu/~athienit/IntroStat/safe_reg.R

105

In the previous example the qualitative predictor only had two levels,

the use or the the lack of use of a safety program. To fully state all

levels only one dummy/indicator predictor was necessary. In general,

if a qualitative predictor has k levels, then k − 1 dummy/indicator

predictor variables are necessary. For example, a qualitative predictor

for a traffic light has three levels:

– red,

– yellow,

– green.

Therefore, only two binary predictors are necessary to fully model this

scenario.

xred =

1 if red

0 otherwisexyellow =

1 if yellow

0 otherwise

Braking it down by case we have an X matrix that has the following

form:

Color intercept xred xyellow

Red 1 1 0Yellow 1 0 1Green 1 0 0

This restriction is usually expressed as βbase group = 0 where green is

the base group in this situation, and the model is

Yi = β0 + β1xredi + β2xyellowi + ǫi

and hence the mean line, piecewise is

E(Y ) =

β0 + β1 if red

β0 + β2 if yellow

β0 if green

Notice that if we created xgreen the X matrix would no longer be full

column rank.

106

Remark 6.3. However, other restrictions do exist to makeX full column

rank too.

– The restriction

3∑

i=1

βi = 0 ⇒ β3 = −β1 − β2

i.e. the sum of the coefficients that correspond to the levels of a

qualitative predictor only are equal to 0. Not all β’s. So, green

can be written as a linear combination or red and yellow. The

model is

Yi = β0 + β1xredi + β2xyellowi + β3xgreeni + ǫi

and hence the mean line, piecewise is

E(Y ) =

β0 + β1 if red

β0 + β2 if yellow

β0 − β1 − β2 if green

for this case the X matrix has the form

Color intercept xred xyellow

Red 1 1 0Yellow 1 0 1Green 1 -1 -1

– The model with no intercept/through the origin

Yi = β1xredi + β2xyellowi + β3xgreeni + ǫi

and hence the mean line, piecewise is

E(Y ) =

β1 if red

β2 if yellow

β3 if green

for this case the X matrix has the form

107

Color xred xyellow xgreen

Red 1 0 0Yellow 0 1 0Green 0 0 1

So now we have seen three alternative ways, but we will be using the

base group approach as is done in R. The model through the origin has

issues as discussed in an earlier section and the sum to zero implies

that some parameters have to be expressed as linear combinations of

others.

Remark 6.4. The color variable has three categories, one may argue

that color (in some context) is an ordinal qualitative predictor and

therefore scores can be assigned, making it quantitative. In terms of

frequency (or wavelength) there is also an order of

Color Frequency (THz) ScoreRed 400-484 442Yellow 508-526 517Green 526-606 566

Instead of creating 2 dummy/indicator variables we can create one

quantitative variable using the midpoint of the frequency band.

Example 6.3. Three different drugs are considered, drug A, B and C. Each

is administered at 4 dosage levels and the response is measured

ProductDose A B C0.2 2.0 1.8 1.30.4 4.3 4.1 2.00.8 6.5 4.9 2.81.6 8.9 5.7 3.4

Let d = dosage level and let

pB =

1 drug B

0 otherwisepC =

1 drug C

0 otherwise

The model (that includes the interaction term) is

Yi = β0 + β1di + β2pB + β3pC + β4(dpB)i + β5(dpC)i + ǫi

108

and

E(Y ) =

β0 + β1di if drug A

β0 + β2 + (β1 + β4)di if drug B

β0 + β3 + (β1 + β5)di if drug C

0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6

24

68

Dose

Res

pons

eProductA B C

With a simple visual inspection we see that the model fit is not adequate.

A log transformation on dosage seems to help

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 7.3072 0.2103 34.748 3.79e-08 ***

logDose 3.3038 0.2186 15.111 5.30e-06 ***

ProductB -2.1548 0.2974 -7.245 0.000351 ***

ProductC -4.3486 0.2974 -14.622 6.42e-06 ***

logDose:ProductB -1.5004 0.3092 -4.853 0.002844 **

logDose:ProductC -2.2795 0.3092 -7.372 0.000319 ***

---

Residual standard error: 0.3389 on 6 degrees of freedom

Multiple R-squared: 0.9877,Adjusted R-squared: 0.9774

F-statistic: 96.3 on 5 and 6 DF, p-value: 1.207e-05

109

−1.5 −1.0 −0.5 0.0 0.5

24

68

logDose

Res

pons

e

ProductA B C

The coefficients for the interactions are both significant and negative so

slope for logDose is:

drug A: 3.3038

drug B: 3.3038− 1.5004

drug C: 3.3038− 2.2795

We can test whether the slope for B is different than that for A, by testing

β4 = 0, and for C versus A, by testing β5 = 0 (since A is the base group). A

question that may arise is if the slope for logDose is the same for drug B as

it is for drug C. That is H0 : β4 = β5. We will see in the next chapter how

to actually perform this test. In the meantime we can create a 95% CI for

β4 − β5.

> vmat=vcov(modelfull);round(vmat,3)

(Intercept) logDose ProductB ProductC logDdose:PB logDose:PC

(Intercept) 0.044 0.027 -0.044 -0.044 -0.027 -0.027

logDose 0.027 0.048 -0.027 -0.027 -0.048 -0.048

ProductB -0.044 -0.027 0.088 0.044 0.054 0.027

ProductC -0.044 -0.027 0.044 0.088 0.027 0.054

logDose:ProductB -0.027 -0.048 0.054 0.027 0.096 0.048

logDose:ProductC -0.027 -0.048 0.027 0.054 0.048 0.096

> d=diff(coefficients(modelfull)[6:5]);names(d)=NULL;d

0.7791

> d+c(1,-1)*qt(0.025,6)*sqrt(vmat[5,5]+vmat[6,6]-2*vmat[5,6])

[1] 0.02247186 1.53563879

110

and note that 0 is not in the interval, and conclude that β4 > β5, the slope

of logDose under drug B is larger than that for C.

Remark 6.5. Since we are in fact making multiple comparisons, A vs B, A vs

C and B vs C, we should probably adjust using Bonferroni’s or some other

multiple comparison adjustment.

There is however a simpler way. If we make drug C the base group,

instead of A, the (different) model would be

Yi = β0 + β1di + β2pA + β3pB + β4(dpA)i + β5(dpB)i + ǫi

so the model under

drug A: Yi = β0 + β2 + (β1 + β4)di + ǫi

drug B: Yi = β0 + β3 + (β1 + β5)di + ǫi

drug C: Yi = β0 + β1di + ǫi

so comparing the slope for logDose between drug B and C, simply involves

performing inference on β5.

> ds_base_c=transform(ds,Product=relevel(Product,"C"))

> model_bc=lm(Response~logDose*Product,data=ds_base_c)

> summary(model_bc)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 2.9586 0.2103 14.069 8.05e-06 ***

logDose 1.0243 0.2186 4.685 0.003378 **

ProductA 4.3486 0.2974 14.622 6.42e-06 ***

ProductB 2.1938 0.2974 7.377 0.000318 ***

logDose:ProductA 2.2795 0.3092 7.372 0.000319 ***

logDose:ProductB 0.7791 0.3092 2.520 0.045312 *

---

Residual standard error: 0.3389 on 6 degrees of freedom

Multiple R-squared: 0.9877,Adjusted R-squared: 0.9774

F-statistic: 96.3 on 5 and 6 DF, p-value: 1.207e-05

which corresponds to the term “logDose:ProductB” in the output.

http://www.stat.ufl.edu/~athienit/STA4210/Examples/drug.R

111

6.3 Matrix Form

This section is merely an extension of section 5.7. The model (including

dimensions) is of the same form just different dimensions for some terms

Y n,1 = Xn,pβp,1 + ǫn,1

Estimates, fitted values, residuals, standard errors and sums of squares

are of the same form as in section 5.7. The differences/generalizations are:

• The degrees of freedom are

dfSSR p− 1SSE n− p +SST n− 1

This is because we know have to estimate p parameters for our “mean”,

that is, our response surface.

• The expected sums of squares are:

– E(MSE) = σ2

– E(MSR) = σ2 +∑p−1

k=1 β2kSSkk +

∑p−1k=1

k′k βkβk′SSkk′

where SSkk′ =∑n

i=1(xik − xk)(xik′ − xk′).

It can be shown that,

E(MSR) ≥ E(MSE)

with equality holding only if β1 = · · · = βp−1 = 0. Therefore, to test

H0 : β1 = · · · = βp−1 = 0 vs Ha : not all β’s equal zero

we use the test statistic

T.S. =MSR

MSE

H0∼ Fp−1,n−p (6.1)

and reject the null when p-value= P (Fp−1,n−p ≥ T.S.) < α.

112

• Intuitively, we note that SSR will always increase, or that equivalently

SSE decreases, as the we include more predictors in the model. This is

because the fitted values (from a more complicated model) will better

fit the observed values of the response. However, any increase in SSR,

no matter how minuscule, will cause R2 to increase. The question is:

“Is the gain in SSR worth the added model complexity?”

This has lead to the introduction of the adjusted R2, defined as

R2adj := R2 − (1−R2)

p− 1

n− p︸ ︷︷ ︸

penalizing fcn.

(

= 1− MSE

SST/(n− 1)

)

. (6.2)

As p−1 increases, R2 increases, but the second term which is subtracted

from R2 also increases. Hence, the second term can be thought of as a

penalizing factor.

Example 6.4. A linear regression model of 50 observation with 3 pre-

dictors may yield an R2(1) = 0.677, and an addition of 2 “unimportant”

predictors yields a slight increase to R2(2) = 0.679. This increase does

not seem to be worth the added model complexity. Notice,

R2(1)adj = 0.677− (1− 0.677)

3

46= 0.6559

R2(2)adj = 0.679− (1− 0.679)

5

44= 0.6425

that R2adj has decreased from model (1) to model (2).

• Inferences on the individual β’s follows from section 2.1 and 5.7. The

only difference is is the degrees of freedom for the t-distribution is n−p

(instead of n− 2). For example, to test H0 : βk = βk0

bk − βk0

sbk

H0∼ tn−p (6.3)

An individual test on βk, tests the significance of predictor k, assuming

all other predictors j for j 6= k are included in the model. This

can lead to different conclusions depending on what other predictors

are included in the model. We shall explore this in more detail in the

next chapter.

113

Consider the following theoretical toy example. Someone wishes to

measure the area of a square (the response) using as predictors two

potential variables, the length and the height of the square. Due to

measurement error, replicate measurements are taken.

– A simple linear regression is fitted with length as the only predic-

tor, x = length. For the test H0 : β1 = 0, do you think that we

would reject H0, i.e. is length a significant predictor of area?

– Now assume that a multiple regression model is fitted with both

predictors, x1 = length and x2 = height. Now, for the test H0 :

β1 = 0, do you think that we would reject H0, i.e. is length a

significant predictor of area given that height is already included

in the model?

This scenario is defined as confounding. In the toy example, “height” is

a confounding variable, i.e. an extraneous variable in a statistical model

that correlates with both the response variable and another predictor

variable.

• Confidence intervals on the mean response and predictions intervals

performed as in section 5.7 with the exception that

– the degrees of freedom for the t-distribution are now n− p

– xobs (or xnew) being

xobs =

1

x1,obs

...

xp−1,obs

– The matrix X is an n× p matrix with columns being the predic-

tors, i.e. X = [1 x1 · · · xp−1]

– and for g simultaneous intervals

∗ the Bonferroni critical value is t1−α/(2g),n−p, that is, the degrees

of freedom change

∗ the Working-Hotelling critical value is

W =√

pF1−α,p,n−p

114

Example 6.5. In a biological experiment, researchers wanted to model the

biomass of an organism with respect to a salinity (SAL), acidity (pH), pot-

assium (K), sodium (Na) and zinc (Zn) with a sample size of 45. The full

model yielded the following results:

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 171.06949 1481.15956 0.115 0.90864

salinity -9.11037 28.82709 -0.316 0.75366

pH 311.58775 105.41592 2.956 0.00527

K -0.08950 0.41797 -0.214 0.83155

Na -0.01336 0.01911 -0.699 0.48877

Zn -4.47097 18.05892 -0.248 0.80576

Residual standard error: 477.8 on 39 degrees of freedom

Multiple R-squared: 0.4867,Adjusted R-squared: 0.4209

F-statistic: 7.395 on 5 and 39 DF, p-value: 5.866e-05

Analysis of Variance Table

Response: biomass

Df Sum Sq Mean Sq F value Pr(>F)

salinity 1 121832 121832 0.5338 0.4694

pH 1 7681463 7681463 33.6539 9.782e-07 ***

K 1 464316 464316 2.0343 0.1617

Na 1 157958 157958 0.6920 0.4105

Zn 1 13990 13990 0.0613 0.8058

Residuals 39 8901715 228249

Notice that the ANOVA table has broken down SSR with 5 df into 5 compo-

nents. We will discuss the sequential sum of squares breakdown in the next

chapter. For now if we sum the SS for each of the predictors we will get

SSR= 8439559

Analysis of Variance

Source DF SS MS F P

Regression 5 8439559 1687912 7.395 0.000

Residual Error 39 8901715 228249.1

Total 44 17341274

115

Assuming all the model assumptions are met, we first take a look at the

overall fit of the model.

H0 : β1 = · · · = β5 = 0 vs Ha : at least one of them 6= 0

The test statistic value is T.S. = 7.395 with an associated p-value of approx-

imately 0 (found using an F5,39 distribution). Hence, at least one predictor

appears to be significant. In addition, the coefficient of determination, R2, is

48.67%, indicating that a large proportion of the variability in the response

can be accounted for by the regression model.

Looking at the individual tests, pH is significant given all the other predic-

tors with a p-value of 0.00527, but salinity, K, Na and Zn have large p-values

(from the individual tests). Table 6.1 provides the pairwise correlations of

the quantitative predictor variables.

biomass salinity pH K Na Znbiomass . -0.084 0.669 -0.150 -0.219 -0.503salinity . . -0.051 -0.021 0.162 -0.421pH . . . 0.019 -0.038 -0.722K . . . . 0.792 0.074Na . . . . . 0.117Zn . . . . . .

Table 6.1: Pearson correlation and associated p-value

Notice that pH and Zn are highly negatively correlated, so it seems rea-

sonable to attempt to remove Zn as its p-value is 0.80576 (and pH’s p-value

is small). Also, there is a strong positive correlation between K and Na and

since both their p-values are large at 0.83155 and 0.48877 respectively, we

should attempt to remove K (but not both). Although we will see later how

to perform simultaneous inference it is more advisable to test one predictor

at a time. In effect we will perform backwards elimination. That is, start

with a complete model and see if which predictors we can remove, one at a

time.

1. Remove K that has the highest individual test p-value.

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 72.02975 1390.21648 0.052 0.95894

116

salinity -7.22888 27.12606 -0.266 0.79123

pH 314.31346 103.38903 3.040 0.00416 **

Na -0.01667 0.01106 -1.507 0.13972

Zn -3.73299 17.51434 -0.213 0.83230

Residual standard error: 472 on 40 degrees of freedom

Multiple R-squared: 0.4861,Adjusted R-squared: 0.4347

F-statistic: 9.458 on 4 and 40 DF, p-value: 1.771e-05

Where we note that R2adj has actually gone up. That is, even though

SSR is smaller for this model (than the one with K also in it) the pena-

lizing function now doesn’t penalize as much. So, K was not necessary.

Also note how the p-value for Na has dropped from 0.4105. That is

mainly due to correlation between K and Na.

2. Remove Zn that has the highest individual test p-value.

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -188.93696 650.73603 -0.290 0.773

salinity -3.18957 19.18052 -0.166 0.869

pH 332.67478 56.49655 5.888 6.24e-07 ***

Na -0.01743 0.01036 -1.682 0.100

Residual standard error: 466.5 on 41 degrees of freedom

Multiple R-squared: 0.4855,Adjusted R-squared: 0.4478

F-statistic: 12.9 on 3 and 41 DF, p-value: 4.5e-06

R2adj is still increasing.

3. Remove salinity

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -282.86356 319.38767 -0.886 0.3809

pH 333.10556 55.78001 5.972 4.36e-07 ***

Na -0.01770 0.01011 -1.752 0.0871 .

117

Residual standard error: 461.1 on 42 degrees of freedom

Multiple R-squared: 0.4851,Adjusted R-squared: 0.4606

F-statistic: 19.79 on 2 and 42 DF, p-value: 8.82e-07

R2adj is still increasing.

4. Now the question is whether we should remove Na, as its p-value is

“small-ish”.

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -593.63 271.89 -2.183 0.0345 *

pH 336.79 57.07 5.902 5.08e-07 ***

Residual standard error: 472 on 43 degrees of freedom

Multiple R-squared: 0.4475,Adjusted R-squared: 0.4347

F-statistic: 34.83 on 1 and 43 DF, p-value: 5.078e-07

But now R2adj has decreased, so it is beneficial to keep Na (with respect

to R2 criterion).

We can also create CI and/or PI using this model, and with the use of

software, we do n0t actually have to compute any of the matrices.

> newdata=data.frame(pH=4.15,Na=10000)

> predict(modu3, newdata, interval="prediction",level=0.95)

fit lwr upr

1 922.4975 -29.45348 1874.448

http://www.stat.ufl.edu/~athienit/STA4210/Examples/linthurst.R

118

Chapter 7

Multiple Regression II

For a given dataset, the total sum of squares (SST) remains the same, no

matter what predictors are included (when no missing values exist among

variables) as the formula does not involve any x’s . As we include more

predictors, the regression sum of squares (SSR) does not decrease (think of

it as increasing), and the error sum of squares (SSE) does not increase.

7.1 Extra Sums of Squares

7.1.1 Definition and decompositions

• When a model contains just x1, we denote: SSR(x1), SSE(x1)

• Model Containing x1, x2: SSR(x1, x2), SSE(x1, x2)

• Predictive contribution of x2 above that of x1:

SSR(x2|x1) = SSE(x1)− SSE(x1, x2) = SSR(x1, x2)− SSR(x1)

This can be extended to any number of predictors. Lets take a look at some

formulas for models with 3 predictors

SST = SSR(x1) + SSE(x1)

= SSR(x1, x2) + SSE(x1, x2)

= SSR(x1, x2, x3) + SSE(x1, x2, x3)

119

and

SSR(x1|x2) = SSR(x1, x2)− SSR(x2)

= SSE(x2)− SSE(x1, x2)

SSR(x2|x1) = SSR(x1, x2)− SSR(x1)

= SSE(x1)− SSE(x1, x2)

SSR(x3|x2, x1) = SSR(x1, x2, x3)− SSR(x1, x2)

= SSE(x1, x2)− SSE(x1, x2, x3)

SSR(x2, x3|x1) = SSR(x1, x2, x3)− SSR(x1)

= SSE(x1)− SSE(x1, x2, x3)

Similarly you can find other terms such as SSR(x2|x1, x3), SSR(x2, x1|x3) and

so forth. Using some of this notation we find that

SSR(x1, x2, x3) = SSR(x1) + SSR(x2|x1) + SSR(x3|x1, x2)

= SSR(x2) + SSR(x1|x2) + SSR(x3|x1, x2)

= SSR(x1) + SSR(x2, x3|x1)

For multiple regression when we request the ANOVA table in R, we obtain a

table where SSR is decomposed by sequential sums of squares.

Source SS df MSRegression SSR(x1, x2, x3) 3 MSR(x1, x2, x3)

x1 SSR(x1) 1 MSR(x1)x2|x1 SSR(x2|x1) 1 MSR(x2|x1)x3|x1, x2 SSR(x3|x1, x2) 1 MSR(x3|x1, x2)

Error SSE(x1, x2, x3) n− 4 MSE(x1, x2, x3)Total SST n− 1

The sequential sum of squares regression differ depending on the order

the variables are entered.

Example 7.1. Let us take a look at example 7.1 from the textbook.

> dat=read.table("http://www.stat.ufl.edu/~rrandles/sta4210/Rclassnotes/

+ data/textdatasets/KutnerData/Chapter%20%207%20Data%20Sets/CH07TA01.txt",

+ col.names=c("X1","X2","X3","Y"))

>

120

> reg123=lm(Y~X1+X2+X3,data=dat)

> summary(reg123)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 117.085 99.782 1.173 0.258

X1 4.334 3.016 1.437 0.170

X2 -2.857 2.582 -1.106 0.285

X3 -2.186 1.595 -1.370 0.190

Residual standard error: 2.48 on 16 degrees of freedom

Multiple R-squared: 0.8014, Adjusted R-squared: 0.7641

F-statistic: 21.52 on 3 and 16 DF, p-value: 7.343e-06

From the F-test we see that at least one predictor is significant. However,

the individual tests indicate that the predictors are not significant. We will

investigate this later but this is because we are testing an individual predictor

given all the other predictors. It will be helpful to view the sequential sum

of squares

Listing 7.1: Order 123 model

> anova(reg123)

Analysis of Variance Table

Response: Y

Df Sum Sq Mean Sq F value Pr(>F)

X1 1 352.27 352.27 57.2768 1.131e-06 ***

X2 1 33.17 33.17 5.3931 0.03373 *

X3 1 11.55 11.55 1.8773 0.18956

Residuals 16 98.40 6.15

Note that

• SSR(x1) = 352.27, so x1 contributes a lot

• SSR(x2|x1) = 33.17, so x2 contributes some above and beyond what x1

• SSR(x3|x1, x2) = 11.55 but x3 does seem to contribute much above and

beyond x1 and x2

If we switch the order in which the variables are entered

121

Listing 7.2: Order 213 model

> reg213=lm(Y~X2+X1+X3,data=dat)

> anova(reg213)

Analysis of Variance Table

Response: Y

Df Sum Sq Mean Sq F value Pr(>F)

X2 1 381.97 381.97 62.1052 6.735e-07 ***

X1 1 3.47 3.47 0.5647 0.4633

X3 1 11.55 11.55 1.8773 0.1896

Residuals 16 98.40 6.15

We note that x2 seems to be significant on its own, but that x1 does not

contribute anything above and beyond x2. Next we also try having x3 first.

Listing 7.3: Order 321 model

> reg321=lm(Y~X3+X2+X1,data=dat)

> anova(reg321)

Analysis of Variance Table

Response: Y

Df Sum Sq Mean Sq F value Pr(>F)

X3 1 10.05 10.05 1.6343 0.2193

X2 1 374.23 374.23 60.8471 7.684e-07 ***

X1 1 12.70 12.70 2.0657 0.1699

Residuals 16 98.40 6.15

We note that x3 even on its own does not appear to be significant. We shall

talk about the tests we see here in the next section.

http://www.stat.ufl.edu/~athienit/STA4210/Examples/bodyfat.R

7.1.2 Inference with extra sums of squares

Let p − 1 denote the total number of predictors in a model. Then, we can

simultaneously test for the significance of k(≤ p) predictors. For example,

let p− 1 = 3 and the full model is

Yi = β0 + β1x1,i + β2x2,i + β3x3,i + ǫi

122

Now, assume we wish to test whether we can remove simultaneously the first,

third and fourth predictor, i.e x1 and x3. Consequently, we wish to test the

hypotheses

H0 : β1 = β3 = 0 (given x2) vs Ha : at least one of them 6= 0

In effect we wish to test the full model to the reduced model

Yi = β0 + β2x2,i + ǫi

Remark 7.1. A full model does not necessarily imply a model with all the

predictors. It simply means a model that has more predictors than the

reduced model, i.e. a “fuller” model.

The SSE of the reduced model will be larger than the SSE of the full

model, as it only has two of the predictors of the full model and can never

fit the data better. The general test statistic is based on comparing the

difference in SSE of the reduced model to the full model.

T.S. =

SSEred − SSEfull

dfEred− dfEfull

SSEfull

dfEfull

H0∼ Fν1,ν2 (7.1)

where

• ν1 = dfEred − dfEfull

• ν2 = dfEfull

and the p-value for this test is always the area to the right of the F-distribution,

i.e. P (Fν1,ν2 ≥ T.S.).

In our example we have that

• SSEred − SSEfull = SSE(x2)− SSE(x1, x2, x3) = SSR(x1, x3|x2)

• dfEred− dfEfull

= (n− 2)− (n− 4) = 2

and hence equation (7.1) becomes

T.S. =SSR(x1, x3|x2)/2

SSE(x1, x2, x3)/(n− 4)=

MSR(x1, x3|x2)

MSE(x1, x2, x3)

H0∼ F2,n−4

123

Remark 7.2. Note that ν1 = dfEred − dfEfull always equals the number of

predictors being restricted to a singular point under the null hypothesis in

a simultaneous test. In the previous example H0 : β1 = β3 = 0 meant 2

degrees of freedom but H0 : β1 = β3 is only 1 degree of freedom. We shall

see examples in the section “Other Linear Tests”.

Example 7.2. ?? From example 7.1, assume we wish to test

H0 : β1 = β3 = 0 (given x2)

We need to fit the reduced model and obtain the information necessary for

equation (7.1).

> reg2=update(reg123,.~.-X1-X3)

> anova(reg2,reg123)

Analysis of Variance Table

Model 1: Y ~ X2

Model 2: Y ~ X1 + X2 + X3

Res.Df RSS Df Sum of Sq F Pr(>F)

1 18 113.424

2 16 98.405 2 15.019 1.221 0.321

With a large p-value we fail to reject the null hypothesis, and drop x1 and

x3. Remember that we actually recommend not performing simul-

taneous tests but one variable at a time.

Special cases

• The output we saw in example 7.1 that listing 7.1 (and the other lis-

tings) also provided us with some default F-tests

> anova(reg123)

Response: Y

Df Sum Sq Mean Sq F value Pr(>F)

X1 1 352.27 352.27 57.2768 1.131e-06 ***

X2 1 33.17 33.17 5.3931 0.03373 *

X3 1 11.55 11.55 1.8773 0.18956

Residuals 16 98.40 6.15

124

– The first TS = 57.2768 tests whether x1 is significant without any

other predictors with a F-test with 1 and 16 degrees of freedom

T.S. =MSR(x1)

MSE(x1, x2, x3)=

SSR(x1)

MSE(x1, x2, x3)

– The second TS = 5.3931 tests whether x2 is significant above and

beyond x1 with a F-test with 1 and 16 degrees of freedom

T.S. =MSR(x2|x1)

MSE(x1, x2, x3)=

SSR(x2|x1)

MSE(x1, x2, x3)

– The third TS = 1.8773 tests whether x3 is significant above and

beyond (x1, x2) with a F-test with 1 and 16 degrees of freedom

T.S. =MSR(x3|x1, x2)

MSE(x1, x2, x3)=

SSR(x3|x1, x2)

MSE(x1, x2, x3)

• One coefficient. Assume we wish to test H0 : β3 = 0. We can either

perform a t-test according to bullet 6.3 and equation (6.3)

b3 − 0

sb3

H0∼ tn−4

Equivalently, we can still use equation (7.1) and note that

– SSEred−SSEfull = SSE(x1, x2)−SSE(x1, x2, x3) = SSR(x3|x1, x2)

– dfEred− dfEfull

= 1

yielding to

T.S. =SSR(x3|x1, x2)/1

SSE(x1, x2, x3)/(n− 4)=

MSR(x3|x1, x2)

MSE(x1, x2, x3)

H0∼ F1,n−4

with p-value= P (F1,n−4 ≥ T.S.).

Back to example 7.1 we have the t-tests and can see the equivalent

F-tests that have the same p-value.

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 117.085 99.782 1.173 0.258

X1 4.334 3.016 1.437 0.170

125

X2 -2.857 2.582 -1.106 0.285

X3 -2.186 1.595 -1.370 0.190

> library(car)

> SS2=Anova(reg123,type=2);SS2 #notice same p-values

Anova Table (Type II tests)

Response: Y

Sum Sq Df F value Pr(>F)

X1 12.705 1 2.0657 0.1699

X2 7.529 1 1.2242 0.2849

X3 11.546 1 1.8773 0.1896

Residuals 98.405 16

• All coefficients (except intercept). Assume we wish to test

H0 : β1 = · · · = β3 = 0 vs Ha : not all β’s equal zero

We proceed in exactly the same way as bullet 6.3 and equation (6.1).

This is because the model under the null (reduced model) is

Yi = β0 + ǫi ⇔ Yi = µ+ ǫi,

and thus SSEred=SST and dfEred= n− 1. Therefore,

T.S. =

SST− SSE

(n− 1)− (n− 4)SSE

n− 4

=

SSR

3SSE

n− 4

=MSR(x1, x2, x3)

MSE(x1, x2, x3)

H0∼ F3,n−4

In example example 7.1, we can see this F-test in the summary.

> summary(reg123)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 117.085 99.782 1.173 0.258

X1 4.334 3.016 1.437 0.170

X2 -2.857 2.582 -1.106 0.285

126

X3 -2.186 1.595 -1.370 0.190

Residual standard error: 2.48 on 16 degrees of freedom

Multiple R-squared: 0.8014,Adjusted R-squared: 0.7641

F-statistic: 21.52 on 3 and 16 DF, p-value: 7.343e-06

7.2 Other Linear Tests

There are circumstances where we do not necessarily wish to test whether

a coefficient equals 0, or whether a group of coefficients all equal zero. For

example, consider the (full) model

Yi = β0 + β1x1,i + β2x2,i + β3x3,i + ǫi

and we wish to test

• H0 : β1 = β2 = β3. Under this null the reduced model is

Yi = β0 + β1x1,i + β1x2,i + β1x3,i + ǫi

= β0 + β1 (x1,i + x2,i + x3,i)︸ ︷︷ ︸

zi

+ǫi

The resulting F-test from equation (7.1) would have an F2,n−4 distri-

bution.

• H0 : β3 = β1 + β2. Under this null the reduced model is

Yi = β0 + β1x1,i + β2x2,i + (β1 + β2)x3,i + ǫi

= β0 + β1 (x1,i + x3,i)︸ ︷︷ ︸

z1,i

+β2 (x2,i + x3,i)︸ ︷︷ ︸

z2,i

+ǫi

The resulting F-test from equation (7.1) would have an F1,n−4 distri-

bution.

127

• H0 : β0 = 10, β3 = 1. Under this null the reduced model is

Yi = 10 + β1x1,i + β2x2,i + x3,i + ǫi

Yi − 10− x3,i︸ ︷︷ ︸

Y ⋆i

= β1x1,i + β2x2,i + ǫi

which is regression through the origin. The resulting F-test from equa-

tion (7.1) would have an F2,n−4 distribution.

Example 7.3. To re-examine example 7.1 so far. With the sequential sums

of squares we notes that x2 was significant above and beyond x1, with a p-

value of 0.3373, but with the individual t-tests (and equivalent F-test) that

it was not significant above and beyond (x1, x3), with a p-value of 0.285. We

also concluded in the simultaneous test that H0 : β1 = β3 = 0 holds. That

means that we either need only x2 or only the combo (x1, x3). Lets test

H0 : β2 =β1 + β3

2

Yi = β0 + β1x1,i +β1 + β3

2x2,i + β3x3,i + ǫi

= β0 + β1 (x1,i + 0.5x2,i)︸ ︷︷ ︸

z1,i

+β3 (x3,i + 0.5x2,i)︸ ︷︷ ︸

z2,i

+ǫi

> dat[,"Z1"]=dat[,"X1"]+1/2*dat[,"X2"]

> dat[,"Z2"]=dat[,"X3"]+1/2*dat[,"X2"]

> reg2eq13=lm(Y~Z1+Z2,data=dat)

> anova(reg2eq13,reg123)

Analysis of Variance Table

Model 1: Y ~ Z1 + Z2

Model 2: Y ~ X1 + X2 + X3

Res.Df RSS Df Sum of Sq F Pr(>F)

1 17 107.150

2 16 98.405 1 8.745 1.4219 0.2505

128

We fail to reject the null but that is no surprise to us and this point. It seems

all we need is just x1 and x3 so lets try it.

> reg13=update(reg123,.~.-X2)

> summary(reg13)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 6.7916 4.4883 1.513 0.1486

X1 1.0006 0.1282 7.803 5.12e-07 ***

X3 -0.4314 0.1766 -2.443 0.0258 *

---

Residual standard error: 2.496 on 17 degrees of freedom

Multiple R-squared: 0.7862,Adjusted R-squared: 0.761

F-statistic: 31.25 on 2 and 17 DF, p-value: 2.022e-06

No further seems necessary at the moment as each variable appears significant

given the other.

Remark 7.3. In R if the null hypothesis requires a transformation of the

response, such as in the last bullet using Y ⋆i , you will have to perform the

F-test manually because the anova function will give you a warning that you

are using two different datasets as the response variable in the two models is

technically different.

7.3 Coefficient of Partial Determination

The coefficient of partial determination similarly to the coefficient of deter-

mination R2 is the proportion of variation in the response explained by a set

of predictors above and beyond another set of predictors.

Consider a model with 3 predictors, i.e. p − 1 = 3. The proportion of

variation in the response that is explained by x1, given that x2 and x3 are

129

already in the model is

R2y,x1|x2,x3

=SSE(x2, x3)− SSE(x1, x2, x3)

SSE(x2, x3)

=SSR(x1, x2, x3)− SSR(x2, x3)

SSE(x2, x3)

=SSR(x1|x2, x3)

SSE(x2, x3)

The coefficient of partial correlation is then defined as

ry,x1|x2,x3 = sgn(b1)√

R2y,x1|x2,x3

Similarly for R2y,x2|x1,x3

and R2y,x3|x1,x2

. We can also express the proportion of

variation in the response that is explained by x2 and x3 given x1 as

R2y,x2,x3|x1

=SSE(x1)− SSE(x1, x2, x3)

SSE(x1)

=SSR(x1, x2, x3)− SSR(x1)

SSE(x1)

=SSR(x2, x3|x1)

SSE(x1)

Similarly for R2y,x1,x3|x2

and R2y,x1,x2|x3

.

Example 7.4. Sticking with example 7.1, we find that R2y,x2|x1,x3

is

> ### Coefficient of partial determination R^2_{Y x2|x1 x3}=SSR(x2|x1 x3)/SSE(x1 x3)

> SST=(dim(dat)[1]-1)*var(dat$Y)

> SS2["X2","Sum Sq"]/anova(lm(Y~X1+X3,data=dat))["Residuals","Sum Sq"]

[1] 0.07107507

This implies that x2 has a tiny effect in reducing the variance in the response

above and beyond (x1, x3). This agrees with the t-test for H0 : β2 = 0 given

(x1, x3) that we saw earlier.

Also note that R2y,x1,x3|x3

> ### Coefficient of partial determination R^2_{Y x1 x3|x2} = SSR(x1 x3|x2)/SSE(x2)

> SSEx2=anova(lm(Y~X2,data=dat))["Residuals","Sum Sq"]

> (SSEx2-anova(reg123)["Residuals","Sum Sq"])/SSEx2

[1] 0.1324132

130

indicating that (x1, x3) have something to contribute above and beyond x2.

This all seems to agree with our tests leading us to the final model with just

x1 and x3.

7.4 Standardized Regression Model

Standardized regression simply means that all variables are standardized

which helps in

• removing round-off errors in computing (XTX)−1

• makes for an easier comparison of the magnitude of effects of predic-

tors measured on different measurement scales. A coefficient from this

model β⋆k can be interpreted as a 1 standard deviation increase in

predictor k causes a change of β⋆k standard deviation in the

response (holding all others constant).

• (to be discussed later) reducing the standard error of coefficients due

to multicollinearity

The transformation used is known as the correlation transformation

y⋆i =1√n− 1

yi − y

sy, x⋆

k,i =1√n− 1

xk,i − xk

sxk

, k = 1, . . . , p− 1

The model is

Y ⋆i = β⋆

1x⋆1,i + · · ·+ β⋆

p−1x⋆p−1,i + ǫ⋆i

We can always revert back to the unstandardized coefficients

• βk =sysxk

β⋆k , k = 1, . . . , p− 1

• β0 = y − β1x1 − · · · − βp−1xp−1

Under this model,

y⋆ =

y⋆1...

y⋆n

, X⋆ =(

x⋆1 · · · x⋆

p−1

)

131

which results in

XT ⋆X⋆ =

1 r1,2 · · · r1,p−1

r2,1 1 · · · r2,p−1

.... . .

...

rp−1,1 rp−1,2 · · · rp−1,p−1

=: rxx, XT ⋆y⋆ =

ry,1...

ry,p−1

=: ryx

because

• ∑ni=1(x

⋆k,i)

2 = · · · = 1

•∑n

i=1(x⋆k,i)(x

⋆k′,i) = · · · = rxk,x

′k

• ∑ni=1(y

⋆i )(x

⋆k,i) = · · · = ry,xk

Therefore,

XT ⋆X⋆b⋆ = XT ⋆

y⋆ ⇒ b⋆ = (XT ⋆X⋆)−1XT ⋆

y⋆ ⇒ b⋆ = r−1xxryx

Example 7.5. So far we have concluded that for the bodyfat dataset in ex-

amples 7.1 that we only need x1 and x3 in the model. However, it seems that

these two variables are still somewhat correlated with a sample correlation

of rx1,x3 = 0.46.

> round(cor(dat[,1:4]),2)

X1 X2 X3 Y

X1 1.00 0.92 0.46 0.84

X2 0.92 1.00 0.08 0.88

X3 0.46 0.08 1.00 0.14

Y 0.84 0.88 0.14 1.00

We have mentioned that correlated variables may increase the standard er-

ror or our predictors, making it more necessary to implement standarized

regression.

A useful tool is the Variance Inflation Factor (VIF). The square root of

the variance inflation factor tells you how much larger the standard error

is, compared with what it would be if that variable were uncorrelated with

the other predictor variables in the model. If the variance inflation factor of

a predictor variable were 5.27 (√5.27 = 2.3) this means that the standard

error for the coefficient of that predictor variable is 2.3 times as large as it

132

would be if that predictor variable were uncorrelated with the other predictor

variables.

Example 7.6. Continuing with our example wee that the inflation is not

much actually

> library(car)

> sqrt(vif(reg13))

X1 X3

1.124775 1.124775

Performing standardized regression yields

> cor.trans=function(y){

+ n=length(y)

+ 1/sqrt(n-1)*(y-mean(y))/sd(y)

+ }

> dat_trans=as.data.frame(apply(dat[,1:4],2,cor.trans))

> reg13_trans=lm(Y~0+X1+X3,data=dat_trans)

> summary(reg13_trans)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

X1 0.9843 0.1226 8.029 2.33e-07 ***

X3 -0.3082 0.1226 -2.514 0.0217 *

compared to the standard errors of 0.1282 and 0.1766 respectively.

7.5 Multicollinearity

Consider the following theoretical toy example. Someone wishes to measure

the area of a square (the response) using as predictors two potential variables,

the length and the height of the square. Due to measurement error, replicate

measurements are taken.

• A simple linear regression is fitted with length as the only predictor,

x = length. For the test H0 : β1 = 0, do you think that we would reject

H0, i.e. is length a significant predictor of area?

• Now assume that a multiple regression model is fitted with both pre-

dictors, x1 = length and x2 = height. Now, for the test H0 : β1 = 0, do

133

you think that we would reject H0, i.e. is length a significant predictor

of area given that height is already included in the model?

This scenario is defined as confounding/collinearity. In the toy example,

“height” is a confounding variable, i.e. an extraneous variable in a statistical

model that correlates with both the response variable and another predictor

variable.

Example 7.7. In an experiment of 22 observations, a response y and two

predictors x1 and x2 were observed. Two simple linear regression models

were fitted:

(1)

y = 6.33 + 1.29 x1

Predictor Coef SE Coef T P

Constant 6.335 2.174 2.91 0.009

x1 1.2915 0.1392 9.28 0.000

S = 2.95954 R-Sq = 81.1% R-Sq(adj) = 80.2%

(2)

y = 54.0 - 0.919 x2

Predictor Coef SE Coef T P

Constant 53.964 8.774 6.15 0.000

x2 -0.9192 0.2821 -3.26 0.004

S = 5.50892 R-Sq = 34.7% R-Sq(adj) = 31.4%

Each predictor in their respective model is significant due to the small p-

values for their corresponding coefficients. The simple linear regression model

(1) is able to explain more of the variability in the response than model (2)

with R2 = 81.1%. Logically one would then assume that a multiple regression

model with both predictors would be the best model. The output of this

model is given below:

(3)

y = 12.8 + 1.20 x1 - 0.168 x2

134

Predictor Coef SE Coef T P

Constant 12.844 7.514 1.71 0.104

x1 1.2029 0.1707 7.05 0.000

x2 -0.1682 0.1858 -0.91 0.377

S = 2.97297 R-Sq = 81.9% R-Sq(adj) = 80.0%

We notice that the individual test for β1 stills classifies x1 as significant

given x2, but x2 is no longer significant given x1. Also, we notice that the

coefficient of determination, R2, has increased only by 0.8%, and in fact R2adj

has decreased from 80.2% in (1) to 80.0% in (3). This is because x1 is acting

as a confounding variable on x2. The relationship of x2 with the response

y is mainly accounted for by the relationship of x1 on y. The correlation

coefficient of

rx1,x2 = −0.573

which indicates a moderate negative relationship.

However, since x1 is a better predictor, the multiple regression model is

still able to determine that x1 is significant given x2, but not vice versa.

When two variables are highly correlated, their estimated of the regres-

sion coefficients become unstable, and their standard errors become larger

(leading to smaller test statistics and wider C.I’s). We can see this using

VIF.

Example 7.8. We have seen another example with 7.1. Recall that x1 and

x2 are highly correlated.

> round(cor(dat[,1:4]),2)

X1 X2 X3 Y

X1 1.00 0.92 0.46 0.84

X2 0.92 1.00 0.08 0.88

X3 0.46 0.08 1.00 0.14

Y 0.84 0.88 0.14 1.00

In listing 7.2 we noticed that x1 is not significant given x2 with a p-value of

0.4633 due to the fact that SSR(x2|x1) = 33.17 but, in listing 7.1, testing x2

given x1 yielded a p-value of 0.03373 due to SSR(x1|x2) = 3.47 indicating it

was somewhat significant.

135

Using VIF we see that standard errors are greatly inflated for the model

with all three

> sqrt(vif(reg123))

X1 X2 X3

26.62410 23.75591 10.22771

http://www.stat.ufl.edu/~athienit/STA4210/Examples/bodyfat.R

136

Chapter 9

Model Selection and Validation

Note that Chapter 8 was merged back with Chapter 6.

9.1 Data Collection Strategies

• Controlled Experiments: Subjects (Experimental Units) assigned to

X-levels by experimenter

– Purely Controlled Experiments: Researcher only uses predictors

that were assigned to units

– Controlled Experiments with Covariates: Researcher has informa-

tion (additional predictors) associated with units

• Observational Studies: Subjects (Units) have X-levels associated with

them (not assigned by researcher)

– Confirmatory Studies: New (primary) predictor(s) believed to be

associated with Y , controlling for (control) predictor(s), known to

be associated with Y

– Exploratory Studies: Set of potential predictors believed that

some or all are associated with Y

9.2 Reduction of Explanatory Variables

• Controlled Experiments

137

– Purely Controlled Experiments: Rarely any need or desire to re-

duce number of explanatory variables

– Controlled Experiments with Covariates: Remove any covariates

that do not reduce the error variance

• Observational Studies

– Confirmatory Studies: Must keep in all control variables to com-

pare with previous research, should keep all primary variables as

well

– Exploratory Studies: Often have many potential predictors (and

polynomials and interactions). Want to fit parsimonious model

that explains much of the variation in Y , while keeping model as

basic as possible. Caution: do not make decisions based on single

variable t-tests, make use of Complete/Reduced models for testing

multiple predictors

9.3 Model Selection Criteria

With p− 1 predictors there are 2p−1 potential models (each variable can be

in or out of the model), not including interaction terms etc.

• So far we have seen the adjusted R2 as in equation (6.2) where the goal

is to maximize the value

• Mallow’s Cp criterion where the goal is to find the smallest p so that

Cp ≤ p

Cp =SSEp

MSE(X1, . . . , Xp−1)− (n− 2p)

Note in the first term, that the numerator is model specific, while the

denominator is always the same (the one of the full model).

• Akaike Information Criterion (AIC) and the Bayesian Information Cri-

terion (BIC), where the goal is to choose the model with the minimum

value

AIC = n log(SSE/n) + 2p, BIC = n log(SSE/n) + p log(n)

138

• PRESS criterion where once again we aim to minimize this value

PRESS =

n∑

i=1

(yi − yi(i))2

where yi(i) is the fitted value for the ith case when it was not used in

fitting the model (leave-one-out). From this we have the

– Ordinary Cross Validation (OCV)

OCV =1

n

n∑

i=1

(yi − yi(i))2 =

1

n

n∑

i=1

(yi − yi1− hii

)2

due the Leaving-One-Out Lemma, where hii is the ith diagonal

element of H = X(XTX)−1XT .

– Generalized Cross Validation (GCV) where hii is replaced by the

average of the diagonal elements of H , leading to a weighted ver-

sion

GCV =1/n

∑ni=1(yi − yi)

2

(1 − trace(H)/n)2

Example 9.1. A cruise ship company wishes to model the crew size nee-

ded for a ship using predictors such as: age, tonnage, passengers, length,

cabins and passenger density (passdens). Without concerning ourselves with

potential interactions we will look at simple additive models.

> cruise <- read.fwf("http://www.stat.ufl.edu/~winner/data/cruise_ship.dat",

+ width=c(20,20,rep(8,7)), col.names=c("ship", "cline", "age", "tonnage",

+ "passengers", "length", "cabins", "passdens", "crew"))

> fit0=lm(crew~age+tonnage+passengers+length+cabins+passdens,data=cruise)

> summary(fit0)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -0.5213400 1.0570350 -0.493 0.62258

age -0.0125449 0.0141975 -0.884 0.37832

tonnage 0.0132410 0.0118928 1.113 0.26732

passengers -0.1497640 0.0475886 -3.147 0.00199 **

length 0.4034785 0.1144548 3.525 0.00056 ***

cabins 0.8016337 0.0892227 8.985 9.84e-16 ***

passdens -0.0006577 0.0158098 -0.042 0.96687

139

---

Residual standard error: 0.9819 on 151 degrees of freedom

Multiple R-squared: 0.9245,Adjusted R-squared: 0.9215

F-statistic: 308 on 6 and 151 DF, p-value: < 2.2e-16

> AIC(fit0)

[1] 451.4394

We will consider this to be the full model at the moment and will implement

some of the model selection criteria using the regsubsets function.

> library(leaps)

> allcruise <- regsubsets(crew~age+tonnage+passengers+length+cabins

+ passdens, nbest=4,data=cruise)

> aprout <- summary(allcruise)

> with(aprout,round(cbind(which,rsq,adjr2,cp,bic),3))

## Prints "readable" results

140

(Intercept) age tonnage passengers length cabins passdens rsq adjr2 cp bic

1 1 0 0 0 0 1 0 0.904 0.903 37.772 -360.238

1 1 0 1 0 0 0 0 0.860 0.859 125.086 -300.954

1 1 0 0 1 0 0 0 0.838 0.837 170.523 -277.122

1 1 0 0 0 1 0 0 0.803 0.801 240.675 -246.201

2 1 0 0 0 1 1 0 0.916 0.915 15.952 -376.131

2 1 0 0 0 0 1 1 0.912 0.911 24.261 -368.502

2 1 0 1 0 0 1 0 0.911 0.909 26.792 -366.249

2 1 0 0 1 0 1 0 0.908 0.907 32.443 -361.332

3 1 0 0 1 1 1 0 0.922 0.921 5.857 -382.878

3 1 0 0 0 1 1 1 0.919 0.918 11.341 -377.413

3 1 0 1 1 0 1 0 0.918 0.916 14.023 -374.808

3 1 1 0 0 1 1 0 0.917 0.915 15.909 -373.002

4 1 0 1 1 1 1 0 0.924 0.922 3.847 -381.933

4 1 1 0 1 1 1 0 0.923 0.921 5.084 -380.652

4 1 0 0 1 1 1 1 0.923 0.921 5.197 -380.534

4 1 0 1 0 1 1 1 0.919 0.917 13.056 -372.631

5 1 1 1 1 1 1 0 0.924 0.922 5.002 -377.752

5 1 0 1 1 1 1 1 0.924 0.922 5.781 -376.939

5 1 1 0 1 1 1 1 0.924 0.921 6.240 -376.462

5 1 1 1 0 1 1 1 0.920 0.917 14.904 -367.717

6 1 1 1 1 1 1 1 0.924 0.921 7.000 -372.692

141

A good model choice might be (the 13th row) model with 4 predictors: ton-

nage, passengers, length, and cabins, whose R2adj = 0.922, Cp = 3.847, and

BIC= −381.933. Also we note that this model’s AIC is lower than that of

the full model.

> fit3=update(fit0,.~.-age-passdens)

> AIC(fit3)

[1] 448.3229

We can also calculate the PRESS, OCV and GCV statistics that we would

compare to other potential models (but we haven’t here).

> library(qpcR)

> PRESS(fit3)$stat

[1] 154.8479

> library(dbstats)

> dblm(formula(fit3),data=cruise)$ocv

[1] 0.9673963

> dblm(formula(fit3),data=cruise)$gcv

[1] 0.9752566

http://www.stat.ufl.edu/~athienit/STA4210/Examples/selection_validation.R

9.4 Regression Model Building

As discussed, it is possible to have a large set of predictor variables (including

interactions). The goal is to fit a “parsimoneous” model that explains as

much variation in the response as possible with a relatively small set of

predictors.

There are 3 automated procedures

• Backward Elimination (Top down approach)

• Forward Selection (Bottom up approach)

• Stepwise Regression (Combines Forward/Backward)

We will explore these procedures using two different elimination/selection

criteria. One that uses t-test and p-value and another that uses the AIC

value.

142

9.4.1 Backward elimination

1. Select a significance level to stay in the model (e.g. αs = 0.20, generally

.05 is too low, causing too many variables to be removed).

2. Fit the full model with all possible predictors.

3. Consider the predictor with lowest t-statistic (highest p-value).

• If p-value > αs , remove the predictor and fit model without this

variable (must re-fit model here because partial regression coeffi-

cients change).

• If p-value ≤ αs, stop and keep current model.

4. Continue until all predictors have p-values ≤ αs.

9.4.2 Forward selection

1. Choose a significance level to enter the model (e.g. αe = 0.20, generally

.05 is too low, causing too few variables to be entered).

2. Fit all simple regression models.

3. Consider the predictor with the highest t-statistic (lowest p-value).

• If p-value ≤ αe, keep this variable and fit all two variable models

that include this predictor.

• If p-value > αe, stop and keep previous model.

4. Continue until no new predictors have p-values ≤ αe

9.4.3 Stepwise regression

1. Select αs and αe, (αe < αs).

2. Start like Forward Selection (bottom up process) where new variables

must have p-value ≤ αe to enter.

3. Re-test all “old variables” that have already been entered, must have

p-value ≤ αs to stay in model.

4. Continue until no new variables can be entered and no old variables

need to be removed.

143

Remark 9.1. Although we created a function in R that follows the steps of

backward, forward and stepwise, there is also an already developed function

stepAIC that can perform all three procedures by adding/removing variables

depending on whether the AIC is reduced.

Example 9.2. Continuing from example 9.1, we perform backward elimina-

tion with αs = 0.20.

> source("http://www.stat.ufl.edu/~athienit/stepT.R")

> stepT(fit0,alpha.rem=0.2,direction="backward")

crew ~ age + tonnage + passengers + length + cabins + passdens

----------------------------------------------

Step 1 -> Removing:- passdens

Estimate Pr(>|t|)

(Intercept) -0.556 0.394

age -0.012 0.358

tonnage 0.013 0.150

passengers -0.149 0.000

length 0.404 0.001

cabins 0.802 0.000

crew ~ age + tonnage + passengers + length + cabins

----------------------------------------------

Step 2 -> Removing:- age

Estimate Pr(>|t|)

(Intercept) -0.819 0.164

tonnage 0.016 0.046

passengers -0.150 0.000

length 0.398 0.001

cabins 0.791 0.000

Final model:

crew ~ tonnage + passengers + length + cabins

We can also perform forward selection and stepwise regression by running

stepT(fit0,alpha.enter=0.2,direction="forward")

stepT(fit0,alpha.rem=0.2,alpha.enter=0.15,direction="both")

144

We can also use the built in function stepAIC

> library(MASS)

> fit1 <- lm(crew ~ age + tonnage + passengers + length + cabins + passdens)

> fit2 <- lm(crew ~ 1)

> stepAIC(fit1,direction="backward")

Start: AIC=1.05

crew ~ age + tonnage + passengers + length + cabins + passdens

Df Sum of Sq RSS AIC

- passdens 1 0.002 145.57 -0.943

- age 1 0.753 146.32 -0.130

- tonnage 1 1.195 146.77 0.347

<none> 145.57 1.055

- passengers 1 9.548 155.12 9.092

- length 1 11.980 157.55 11.551

- cabins 1 77.821 223.39 66.721

Step: AIC=-0.94

crew ~ age + tonnage + passengers + length + cabins

Df Sum of Sq RSS AIC

- age 1 0.815 146.39 -2.062

<none> 145.57 -0.943

- tonnage 1 2.007 147.58 -0.780

- length 1 12.069 157.64 9.641

- passengers 1 14.027 159.60 11.591

- cabins 1 79.556 225.13 65.944

Step: AIC=-2.06

crew ~ tonnage + passengers + length + cabins

Df Sum of Sq RSS AIC

<none> 146.39 -2.062

- tonnage 1 3.866 150.25 0.056

- length 1 11.739 158.13 8.126

- passengers 1 14.275 160.66 10.640

145

- cabins 1 78.861 225.25 64.028

Call:

lm(formula = crew ~ tonnage + passengers + length + cabins)

and can also perform forward and stepwise regression by running

stepAIC(fit2,direction="forward",scope=list(upper=fit1,lower=fit2))

stepAIC(fit2,direction="both",scope=list(upper=fit1,lower=fit2))

http://www.stat.ufl.edu/~athienit/STA4210/Examples/selection_validation.R

9.5 Model Validation

When we have a lot of data, we would like to see how well a model fit on

one set of data (training sample) compares to one fit on a new set of data

(validation sample), and how the training model fits the new data.

• We want the data sets to be similar with respect to the levels of the

predictors (so that the validation sample is not an extrapolation of the

training sample). Should calculate some summary statistics such as

means, standard deviations, etc.

• The training set should have at least 6-10 times as many observations

than potential predictors.

• Models should give “similar” model fits based on SSE, PRESS, Mal-

low’s Cp, MSE and regression coefficients. Should obtain multiple mo-

dels using multiple “adequate” training samples.

The Mean Square Prediction Error (MSPE) when training model is applied

to validation sample is

MSPE =

∑nV

i=1(yVi − yTi )

2

nV

where nV is validation sample size, yVi represents a data point from the

validation sample and yTi represents a fitted value using the predictor settings

corresponding to yVi but the coefficients from the training sample, i.e.

yTi = bT0 + bT1 xV1,i + · · ·+ bTp−1x

Vp−1,i

146

If the MSPE is fairly close to the MSET of the regression model that was fitted

to the training data set, then it indicates that the selected regression model

is not seriously biased and gives an appropriate indication of the predictive

ability of the model. At this point you should now go ahead and fit the model

on the full data set. It is only a problem when MSPE

MSET.

Example 9.3. Continuing from example 9.1, we perform cross-validation

with a hold-out sample. Randomly sample 100 ships, fit model, obtain pre-

dictions for the remaining 58 ships by applying their predictor levels to the

regression coefficients from the fitted model.

> cruise.cv.samp <- sample(1:length(cruise$crew),100,replace=FALSE)

> cruise.cv.in <- cruise[cruise.cv.samp,]

> cruise.cv.out <- cruise[-cruise.cv.samp,]

> ### Check if training sample (and validation) is similar to the whole dataset

> summary(cruise[,4:7])

tonnage passengers length cabins

Min. : 2.329 Min. : 0.66 Min. : 2.790 Min. : 0.330

1st Qu.: 46.013 1st Qu.:12.54 1st Qu.: 7.100 1st Qu.: 6.133

Median : 71.899 Median :19.50 Median : 8.555 Median : 9.570

Mean : 71.285 Mean :18.46 Mean : 8.131 Mean : 8.830

3rd Qu.: 90.772 3rd Qu.:24.84 3rd Qu.: 9.510 3rd Qu.:10.885

Max. :220.000 Max. :54.00 Max. :11.820 Max. :27.000

> summary(cruise.cv.in[,4:7])

tonnage passengers length cabins

Min. : 3.341 Min. : 0.66 Min. : 2.790 Min. : 0.330

1st Qu.: 46.947 1st Qu.:12.65 1st Qu.: 7.168 1st Qu.: 6.327

Median : 73.941 Median :19.87 Median : 8.610 Median : 9.750

Mean : 73.581 Mean :19.24 Mean : 8.219 Mean : 9.177

3rd Qu.: 91.157 3rd Qu.:26.00 3rd Qu.: 9.605 3rd Qu.:11.473

Max. :220.000 Max. :54.00 Max. :11.820 Max. :27.000

> summary(cruise.cv.out[,4:7])

tonnage passengers length cabins

Min. : 2.329 Min. : 0.94 Min. : 2.960 Min. : 0.450

147

1st Qu.: 40.013 1st Qu.:10.62 1st Qu.: 6.370 1st Qu.: 5.335

Median : 70.367 Median :18.09 Median : 8.260 Median : 8.745

Mean : 67.325 Mean :17.11 Mean : 7.978 Mean : 8.232

3rd Qu.: 87.875 3rd Qu.:21.39 3rd Qu.: 9.510 3rd Qu.:10.430

Max. :160.000 Max. :37.82 Max. :11.320 Max. :18.170

> fit.cv.in <- lm(crew ~tonnage + passengers + length + cabins,

+ data=cruise.cv.in)

> summary(fit.cv.in)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -1.10180 0.77347 -1.424 0.157581

tonnage 0.00479 0.01177 0.407 0.685054

passengers -0.19192 0.05445 -3.525 0.000654 ***

length 0.45647 0.14573 3.132 0.002306 **

cabins 0.95060 0.14510 6.551 2.92e-09 ***

---

Residual standard error: 1.059 on 95 degrees of freedom

Multiple R-squared: 0.9203,Adjusted R-squared: 0.9169

F-statistic: 274.2 on 4 and 95 DF, p-value: < 2.2e-16

Then we obtain predicted values and prediction errors for the validation

sample. The model is based on same 4 predictors that we chose before

(columns 4-7 of cruise data) from which we compute the MSPE

> pred.cv.out <- predict(fit.cv.in,cruise.cv.out[,4:7])

> delta.cv.out <- cruise$crew[-cruise.cv.samp]-pred.cv.out

> (mspe <- sum((delta.cv.out)^2)/length(cruise$crew[-cruise.cv.samp]))

[1] 0.7578447

We note that the MSPE of 0.7578 is fairly close to the MSE of 1.0592 = 1.121.

(At least it is not much greater).

http://www.stat.ufl.edu/~athienit/STA4210/Examples/selection_validation.R

148

Chapter 10

Diagnostics

See slides and examples at http://www.stat.ufl.edu/~athienit/STA4210

The notes here are incomplete and under construction

The goal of this chapter is to used refined diagnostics for checking the

adequacy of the regression model that include detecting improper functional

form for a predictor, outliers, influential observations and multicollinearity.

10.1 Outlying Y observations

Model errors (unobserved) are defined as

ǫi = Yi −p−1∑

j=0

βjxi,j , xi,0 = 1 ǫ ∼ N(0, σ2In)

The observed residuals are

ei = yi −p−1∑

j=0

bjxi,j e ∼ N(0, σ2(In −H))

where H = X(XTX)−1XT is the projection matrix. So the elements of the

variance-covariance matrix σ2(In −H) are:

σ{ei, ej} =

σ2(1− hii) if i = j

−hijσ2 if i 6= j

Using σ2 = MSE we then have

149

• Semi-Studentized residual

e⋆i =ei√MSE

• Studentized residual, which uses the

ri =ei

MSE(1− hii)

• Studentized Deleted residual. When calculating a residual ei = yi − yi,

the ith observation (yi, xi,1, . . . , xi,p−1) was used in the creation of the

model (as were all the other points), and then the model was used to

estimate the response for the ith observation. That is, each observa-

tion played a role in the creation of the model, that was then used to

estimate the the response of said observation. Not very objective.

The solution is to delete/remove the ith observation, fit a model without

that observation in the data, and use the model to predict the response

of that observation by plugging in the predictor setting xi,1, . . . , xi,p−1.

This sounds very computationally intensive in that you have to fit as

many models as there are points. Luckily, it has been found that this

can be done without refitting. It can be shown that

SSE = (n− p)MSE = (n− p− 1)MSE(i) +e2i

1− hii

⇒ ti = ei

(n− p− 1

SSE(1− hii)− e2i

)

where MSE(i) is the MSE of the model with the ith observation deleted,

and ti is the “objective” residual. Then we can determine if a residual

is an outlier if it is more than 2 to 3 standard deviations from 0. We

can also use a Bonferroni adjustment and determine if an observation

is an outlier if it is greater than t1−α/(2n),n−p−1 but that will usually be

too large when n is large.

150

10.2 Outlying X-Cases

Recall that H = X(XTX)−1XT is the projection matrix with the (i, j) ele-

ment being

hij = xTi (X

TX)−1xj xi =

1

xi,1

...

xi,p−1

Note that

• hij ∈ [0, 1]

•∑n

i=1 hii = trace(H) = trace(X(XTX)−1XT ) = trace(XTX(XTX)−1) =

trace(Ip) = p

Cases with X-levels close to the “center” of the sampled X-levels will have

small leverages, i.e. hii. Cases with “extreme” levels have large leverages, and

have the potential to “pull” the regression equation toward their observed

Y -values. We can see this by

y = Hy ⇒ yi =

n∑

j=1

hijyj =

i−1∑

j=1

hijyj + hiiyii+

n∑

j=i+1

hijyj

Leverage values are considered large if > 2p/n (2 times larger than the mean).

Leverage values for potential new observations are

hnew, new = xTnew(X

TX)xnew

and are considered extrapolations if their leverage values are larger than

those in the original dataset.

10.3 Influential Cases

10.3.1 Fitted values

10.3.2 Regression coefficients

10.4 Multicollinearity

See examples 7.6, 7.7 and 7.8

151

Chapter 11

Remedial Measures

See slides and examples at http://www.stat.ufl.edu/~athienit/STA4210

152

Chapter 12

Autocorrelation in Time Series

See slides and examples at http://www.stat.ufl.edu/~athienit/STA4210

END

153