Chapter 12: Analyzing the Association Between Quantitative Variables: Regression …cparrish.sewanee.edu/stat204 S2015/notes/part 04... · 2015. 1. 30. · Stat 204, Part 4 Association

Stat 204, Part 4 Association

Chapter 12: Analyzing the Association Between Quantitative Variables:Regression Analysis

These notes reflect material from our text, Statistics: The Art and Science of Learning from Data,Third Edition, by Alan Agresti and Catherine Franklin, published by Pearson, 2013.

Linear models

For two quantitative variables, it is often convenient to distinguish between an explanatory (predictor)and a response (predicted) variable, denoted x and y, respectively.

The means, µx, µy, standard deviations, σx, σy, and correlation coefficient, ρ, describe a population.

Fitting y ∼ x results in a linear model, y = β0 + β1x, describing the population.

An association between the variables x and y is characterized by its direction (positive or negative),form (linear or non-linear) and strength (which for linear relationships is measured by the correlation).

The sample means, x, y, sample standard deviations, sx, sy, and sample correlation coefficient, r, de-scribe a sample taken from the population.

Point estimates for β0 and β1 are determined from the sample and are denoted b0 and b1.

The linear model for the sample takes the form y = b0 + b1x.

The residual, ei = yi − yi, measures the distance between the actual value, yi, and the predicted value,yi, corresponding to a particular xi.

Regression analysis uses properties of a linear model constructed from a sample to deduce propertiesof a linear relationship in the corresponding population.

Least squares line

Conditions for least squares : (1) nearly linear relationship, (2) nearly normal residuals, (3) with nearlyconstant variability.

Formulas for the regression coefficients:

b1 = ρsysx, b0 = y − b1x.

Use a least squares line to predict y from x : y = b0 + b1x

The center of mass of the sample lies on the least squares line: y = b0 + b1x

The squared correlation, r2, describes the percent of the variance of the response variable explained bythe explanatory variable.

Two quantitative variables

We illustrate simple regression with one of the examples explored by Agresti and Franklin in chapter12, a data set describing 57 female high school athletes and their performances in several athletic activ-ities. Read in the data set, select two athletic activities, and generate a scatterplot. We use x and y to

Spring 2015 Page 1 of 12


describe these activities, rather than more descriptive names, to suggest that this type of analysis is widelyapplicable.

athletes <- read.csv("high_school_female_athletes.csv", header=TRUE)

head(athletes)

str(athletes)

summary(athletes)

x <- athletes$BRTF..60. # number of 60 lb bench presses

y <- athletes$X1RM.BENCH..lbs. # maximum bench press

plot(x, y,

pch=19, col="darkred",

xlab="number of 60 lb bench presses",

ylab="maximum bench press (lbs)",

main="Female High School Athletes")

5 10 15 20 25 30 35

6070

8090

100

110

Female High School Athletes

number of 60 lb bench presses

max

imum

ben

ch p

ress

(lbs

)

A suggestion of a linear relationship?

Is there a suggestion of a linear relationship here? Use R’s lm procedure to calculate a linear model forthis data.

plot(x, y,

pch=19, col="darkred",

xlab="number of 60 lb bench presses",

ylab="maximum bench press (lbs)",

main="Female High School Athletes")

athletes.lm <- lm(y ~ x)

abline(athletes.lm, col="orange")



5 10 15 20 25 30 35

6070

8090

100

110

Female High School Athletes (lm)


max

imum

ben

ch p

ress

(lbs

)

A linear relationship in this context is described by an equation of the form

y = a+ bx,

where the coefficients a and b are part of the linear model. Create a function which calculates y given xand use it to calculate a point along the regression line. The second student in the data set had an x valueof 12. What value of y would this linear model predict for the second student?

coefficients(athletes.lm)

# (Intercept) x

# 63.536856 1.491053

predict.y.hat <- function(x){

a <- coefficients(athletes.lm)[1]

b <- coefficients(athletes.lm)[2]

y.hat <- as.numeric(a + b * x)

return(y.hat)

}

predict.y.hat(12)

# 81.42949

We can use R’s function predict to do the same calculation.

# use predict

new.data <- data.frame(x=12)

predict(athletes.lm, new.data)

# 1

# 81.42949

R’s predict can calculate the predictions for every x in the data set.



# calculate y.hat for each student

y.hat <- predict(athletes.lm, data.frame(x, y))

head(data.frame(x, y, y.hat))

# x y y.hat

# 1 10 80 78.44739

# 2 12 85 81.42949

# 3 20 85 93.35792

# 4 5 65 70.99212

# 5 12 95 81.42949

# 6 10 75 78.44739

A residual is the difference between an actual y and the predicted y. Verify that the second student’sresidual is

ε = y − y = 3.570507.

Testing for association

Do the data plausibly cluster around this least-squares line? Just how much evidence is there of alinear relationship in this data? We will test the hypothesis that there is a linear relationship againstthe alternative hypothesis that there is none. If the regression line is horizontal, then knowing somethingabout x gives no usable information about y, so there would be no association between these two variables.Therefore, the key thought is to determine if the slope of the actual (population) regression line couldplausibly be 0 or, equivalently, if the correlation between the two variables is 0. We organize the discussionas a two-sided hypothesis test. Some key statistics are contained in the summary of the linear model forthe associated sample.

H0 : β = 0

Ha : β 6= 0

# are the two variables associated?

summary(athletes.lm)

# Call:

# lm(formula = y ~ x)

# Residuals:

# Min 1Q Median 3Q Max

# -17.9205 -5.9027 -0.7237 5.4989 19.0973

# Coefficients:

# Estimate Std. Error t value Pr(>|t|)

# (Intercept) 63.5369 1.9565 32.475 < 2e-16 ***

# x 1.4911 0.1497 9.958 6.48e-14 ***

# ---

# Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

# Residual standard error: 8.003 on 55 degrees of freedom

# Multiple R-squared: 0.6432,Adjusted R-squared: 0.6368

# F-statistic: 99.17 on 1 and 55 DF, p-value: 6.481e-14



The value of the slope b in the linear model for the sample y = a + bx is the Estimate to the right ofx. Its standard error is the next number to the right in that row, under the title Std. Error. Use b and itsSE to calculate the test statistic and then determine its p-value.

# HT

# H_0 : beta == 0

# H_a : beta != 0

b <- 1.4911

se <- 0.1497

t <- (b - 0) / se

# 9.960588

n <- length(x)

p.value <- 2 * (1 - pt(t, df=n-2))

# 6.439294e-14

The p-value is very small, so we reject the null hypothesis, accept the alternative hypothesis, and con-clude that the two quantitative variables are associated.

A confidence interval centered on the statistic b provides a range of plausible values for the slope β ofthe (population) regression line.

alpha <- 0.05

t.star <- qt(1 - alpha/2, df=n-2)

# 2.004045

ci <- b + t.star * se * c(-1, 1); ci

# 1.191094 1.791106

So we are 95% confident that our confidence interval [1.191094, 1.791106] contains the population pa-rameter β. Note that this interval does not contain the value 0, so we once again discover that these twoquantitative variables are associated.

The F statistic mentioned in the summary of the simple linear regression model is an alternate teststatistic for the proposition H0 : β = 0, and in fact it is equal to the square of the t statistic that we haveused for that same purpose. The p-value obtained from the F statistic is exactly the same as the p-valueobtained from the t statistic. F distributions will play a stronger role in multiple linear regression.

Strength of the association

When working with categorical variables, we used the chi-square test to determine if the variables wereassociated, and then we turned to measures of association, such as differences of proportions and relativerisk, to determine the strength of the association. For quantitative variables, the correlation measures thestrength of the association. The correlation is a number between -1 and 1. Values near 1 and -1 reflect thestrongest (positive and negative, resp.) associations. A correlation of 0 means that the two variables arenot associated.

# correlation

cor(x, y)

# 0.8020251



Correlation matrix

R’s function cor can also return a matrix of correlations. Let’s add two more athletic activities tothe mix, a leg press and a 40 yard dash. Which activities are most strongly associated? Which have theweakest association. Can you imagine why? What is the interpretation of the negative numbers in thismatrix?

# matrix of correlations

# x bench press

# y max bench press

# add two more exercises

z <- athletes$LP.RTF..200. # leg press

w <- athletes$X40.YD..sec. # 40 yd run

corr.matrix <- cor(data.frame(x, y, z, w))

# x y z w

# x 1.00000000 0.80202510 0.61107645 -0.06509459

# y 0.80202510 1.00000000 0.57791717 -0.08076663

# z 0.61107645 0.57791717 1.00000000 0.09756962

# w -0.06509459 -0.08076663 0.09756962 1.00000000

Interpret this visualization of the correlation matrix.

library(corrplot)

corrplot(corr.matrix, method="circle")

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

x y z w

x

y

z

w



Regression toward the Mean

The equation of the regression line is y = b0 + b1x, where b0 = y − b1x and b1 = rsy/sx, so we canrewrite it as

y − y = b1(x− x),

= rsysx

(x− x).

= rsy(x− x)

sx.

Now choose x one standard deviation to the right of x, so x− x = sx. The corresponding predicted valuey is given by

y − y = rsy,

so the predicted value y is r times one standard deviation sy above y, and of course |r| ≤ 1. Therefore, ifx moves one standard deviation to the right of its mean, x = x+ sx, then the predicted y moves only rsyabove its mean, y = y + rsy. Sons of tall fathers are likely shorter than their dads. Sons of short fathersare likely taller than their dads. This was first noticed by the famous pioneer of statistics, Francis Galton(1822-1911), and it is called regression toward the mean.

Regression toward the Mean

x

y

●

●

sx

rsy

y = x

y = a + bx

(x, y)

0 1

0

1



Standardized residuals

How do data vary around the regression line? Residuals tell the story, but standardized residuals aremore informative, in the same way that a z-score tells how many standard deviations away from a givenvalue a certain result might lie.

standardized.residuals <- rstandard(athletes.lm)

hist(standardized.residuals,

col="orangered")

Histogram of standardized.residuals

standardized.residuals

Frequency

-2 -1 0 1 2

02

46

810



MSE and RSE

A basic assumption of simple linear regression is that for each fixed x, the y values are normallydistributed with mean y and standard deviation σ. A single value σ describes the spread of the normaldistributions about their mean for each one of the x’s. The value of σ can be estimated from the data. Themean square error, MSE, is the variance of all of those normal distributions, and the square root of MSE,known as the residual standard error, RSE, is the very important estimate of σ. The RSE and relatedstatistics appear in the output of R’s procedure aov (analysis of variance). The MSE is the residual sumof squares, Residual SS, divided by its degrees of freedom, n− 2, and the RSE is the square root of MSE.

aov(athletes.lm)

# Call:

# aov(formula = athletes.lm)

# Terms:

# x Residuals

# Sum of Squares 6351.755 3522.806

# Deg. of Freedom 1 55

# Residual standard error: 8.003188

# Estimated effects may be unbalanced

residual.ss <- 3522.806

df <- 55

mse <- residual.ss / df

rse <- sqrt(mse)

# 8.003188

Prediction

Two types of prediction are important in this context. Given x we would like to predict plausiblevalues for µy (the population y) with a confidence interval, CI, and we would like to predict y values forindividuals sharing that value of x with a prediction interval, PI. The PI will be wider than the associatedCI because the PI encompasses a lot of individual variation, but the CI is a confidence interval for a (muchmore constrained) mean. In the following approximate formulas (Agresti and Franklin, 3e, p.611), theRSE plays the role of σ, so these formulas resemble previous confidence intervals for means and values.

# approximate CI for the population mu_y

ci <- y.hat + t.star * rse / sqrt(n) * c(-1, 1)

# approximate PI for individual y values

pi <- y.hat + t.star * rse * c(-1, 1)

Here t∗ is calculated with an R command such as

t.star← qt(0.975, df = n− 2)

and the residual standard error, RSE, is obtained from the summary of the linear model or by calling aov

on the linear model:summary(athletes.lm) or aov(athletes.lm).



Confidence and Prediction Intervals Using Predict

For more accurate confidence and prediction intervals, use R’s predict.

# confidence and prediction intervals using predict

?predict

# 95% CI for mu_y given x == 12

new.data <- data.frame(x=12)

predict(athletes.lm, new.data, interval="confidence")

# fit lwr upr

# 1 81.42949 79.28328 83.57571

# 95% PI for y given x == 12

predict(athletes.lm, new.data, interval="prediction")

# fit lwr upr

# 1 81.42949 65.24778 97.6112

Using predict to calculate confidence and prediction intervals for a whole range of x values producesconfidence and prediction bands. Notice that the confidence band is narrowest near (x, y) = (10.98, 79.91).

5 10 15 20 25 30 35

6070

8090

100

110

Female High School Athletes,confidence and prediction bands


max

imum

ben

ch p

ress

(lbs

)



Outline for Presenting an Hypothesis Test

Agresti and Franklin suggest using a five-step outline for presenting hypothesis tests such as we areusing in this chapter. Here is a sketch of the approach they recommend.

Assumptions We assume randomization, normal conditional distributions for y given x, with a lineartrend for the means of these distributions, and a common standard deviation for all of them.

Hypotheses The null hypothesis is that the variables are independent, and the alternative hypothesis isthat they are dependent (associated).

H0 : β = 0

Ha : β 6= 0

Test Statistic The slope b of the sample regression line and its standard error, SE, are found in theCoefficients section of the summary of the linear model.

t = b/SE.

p-value The p-value is calculated with an R command such as

p.value← 2 ∗ (1− pt(t, df = n− 2))

Conclusion in Context Is there sufficient evidence to reject H0 or not? What does this mean in thecontext of this particular investigation?

Outline for Presenting a Confidence Interval

Confidence Interval A 95% confidence interval for the population parameter β is given by

b± t∗ × SE

where b and SE are as in the associated hypothesis test, and t∗ is calculated with an R command such as

t.star← qt(0.975, df = n− 2)

Conclusion in Context The confidence interval provides a range of plausible values for the populationparameter β. State clearly what this means in the context of the present study.



Exponential regression

The exponential regression model isµy = αβx.

The growth in the population of the World since 1900 can be modeled by such a curve.

1900 1920 1940 1960 1980 2000

23

45

67

World Population

Year

Pop

ulat

ion

(bill

ions

)

Exercises

We will attempt to solve some of the following exercises as a community project in class today. Finish thesesolutions as homework exercises, write them up carefully and clearly, and hand them in at the beginningof class next Friday.

Exercises for Chapter 12: 12.7 (GPA), 12.9 (GPA), 12.15 (sit-ups), 12.21 (GPA), 12.23 (placebo), 12.26(golfing), 12.33 (prices), 12.36 (boys), 12.46 (prices), 12.49 (leg presses), 12.61 (leaf litter)

Class work 12a – regression, strength of association

Exercises from Chapter 12:12.7 (GPA), 12.8 (GPA), 12.10 (exercise), 12.22 (GPA), 12.24 (tutoring)

Class work 12b – inferences about the association

Exercises from Chapter 12:12.35 (strength), 12.36 (boys), 12.39 (sales), 12.50 (predicting), 12.54 (predicting)

Class work 12c – exponential regression, review

Exercises from Chapter 12:12.58 (population), 12.61 (leaf litter), 12.64 (correlation), 12.72 (predicting), 12.74 (population)


Documents

Chapter 12: Analyzing the Association Between Quantitative Variables: Regression …cparrish.sewanee.edu/stat204 S2015/notes/part 04... · 2015. 1. 30. · Stat 204, Part 4 Association