37
BPS - 3rd Ed . Chapter 21 1 Chapter 21 Inference for Regression

BPS - 3rd Ed. Chapter 211 Inference for Regression

Embed Size (px)

Citation preview

Page 1: BPS - 3rd Ed. Chapter 211 Inference for Regression

BPS - 3rd Ed. Chapter 21 1

Chapter 21

Inference for Regression

Page 2: BPS - 3rd Ed. Chapter 211 Inference for Regression

BPS - 3rd Ed. Chapter 21 2

Objective: To quantify the linear relationship between an explanatory variable (x) and response variable (y).

We can then predict the average response for all subjects with a given value of the explanatory variable.

Linear Regression (from Chapter 5)

Page 3: BPS - 3rd Ed. Chapter 211 Inference for Regression

BPS - 3rd Ed. Chapter 21 3

Case Study

Researchers explored the crying of infants four to ten days old and their IQ

test scores at age three to determine if a more crying was a sign of higher IQ

Crying and IQ

Karelitz, S. et al., “Relation of crying activity in early infancy to speech and intellectual development at age three years,”

Child Development, 35 (1964), pp. 769-777.

Page 4: BPS - 3rd Ed. Chapter 211 Inference for Regression

BPS - 3rd Ed. Chapter 21 4

Case StudyCrying and IQData collection

Data collected on 38 infants Snap of rubber band on foot caused infants

to cry– recorded the number of peaks in the most active 20

seconds of crying (explanatory variable x)

Measured IQ score at age three years using the Stanford-Binet IQ test (response variable y)

Page 5: BPS - 3rd Ed. Chapter 211 Inference for Regression

BPS - 3rd Ed. Chapter 21 5

Case StudyCrying and IQ

Data

Page 6: BPS - 3rd Ed. Chapter 211 Inference for Regression

BPS - 3rd Ed. Chapter 21 6

Case StudyCrying and IQData analysis

Scatterplot of y vs. x shows a moderate positive linear relationship, with no extreme outliers or potential influential observations

Page 7: BPS - 3rd Ed. Chapter 211 Inference for Regression

BPS - 3rd Ed. Chapter 21 7

Case StudyCrying and IQData analysis

Correlation between crying and IQ isr = 0.455 (as calculated in Chapter 4)

Least-squares regression line for predicting IQ from crying is

(as in Ch. 5) R2 = 0.207, so 21% of the variation in IQ scores

is explained by crying intensity

xbxay 1.49391.27 ˆ

Page 8: BPS - 3rd Ed. Chapter 211 Inference for Regression

BPS - 3rd Ed. Chapter 21 8

We now want to extend our analysis to include inferences on various components involved in the regression analysis– slope– intercept– correlation– predictions

Inference

Page 9: BPS - 3rd Ed. Chapter 211 Inference for Regression

BPS - 3rd Ed. Chapter 21 9

Conditions required for inference about regression (have n observations on an explanatory variable x and a response variable y)1. for any fixed value of x, the response y varies

according to a Normal distribution. Repeated responses y are independent of each other.

2. the mean response µy has a straight-line relationship with x: µy = + x . The slope and intercept are unknown parameters.

3. the standard deviation of y (call it ) is the same for all values of x. The value of is unknown.

Regression Model, Assumptions

Page 10: BPS - 3rd Ed. Chapter 211 Inference for Regression

BPS - 3rd Ed. Chapter 21 10

the regression model has three parameters: , , and

the true regression line µy = + x says that the mean response µy moves along a straight line as x changes (we cannot observe the true regression line; instead we observe y for various values of x)

observed values of y vary about their means µy according to a Normal distribution (if we take many y observations at a fixed value of x, the Normal pattern will appear for these y values)

Regression Model, Assumptions

Page 11: BPS - 3rd Ed. Chapter 211 Inference for Regression

BPS - 3rd Ed. Chapter 21 11

the standard deviation is the same for all values of x, meaning the Normal distributions for y have the same spread at each value of x

Regression Model, Assumptions

Page 12: BPS - 3rd Ed. Chapter 211 Inference for Regression

BPS - 3rd Ed. Chapter 21 12

When using the least-squares regression line , the slope b is an unbiased estimator of the true slope , and the intercept a is an unbiased estimator of the true intercept

Estimating Parameters:Slope and Intercept

bxay ˆ

Page 13: BPS - 3rd Ed. Chapter 211 Inference for Regression

BPS - 3rd Ed. Chapter 21 13

the standard deviation describes the variability of the response y about the true regression line

a residual is the difference between an observed value of y and the value predicted by the least-squares regression line:

the standard deviation is estimated with a sample standard deviation of the residuals (this is a standard error since it is estimated from data)

Estimating Parameters:Standard Deviation

y-y ˆy

Page 14: BPS - 3rd Ed. Chapter 211 Inference for Regression

BPS - 3rd Ed. Chapter 21 14

The regression standard error is the square root of the sum of squared residuals divided by their degrees of freedom (n2):

Estimating Parameters:Standard Deviation

2

2

1yy

ns ˆ

Page 15: BPS - 3rd Ed. Chapter 211 Inference for Regression

BPS - 3rd Ed. Chapter 21 15

Case StudyCrying and IQ

Since ,b = 1.493 is an unbiased estimator of the true slope , and a = 91.27 is an unbiased estimator of the true intercept – because the slope b = 1.493, we estimate that

on the average IQ is about 1.5 points higher for each added crying peak.

The regression standard error is s = 17.50– see pages 566-567 in the text for this calculation

xbxay 1.49391.27 ˆ

Page 16: BPS - 3rd Ed. Chapter 211 Inference for Regression

BPS - 3rd Ed. Chapter 21 16

Case StudyCrying and IQ

Using Technology:

Page 17: BPS - 3rd Ed. Chapter 211 Inference for Regression

BPS - 3rd Ed. Chapter 21 17

A level C confidence interval for the true slope is b t* SEb

– t* is the critical value for the t distribution with df = n2 degrees of freedom that has area (1C)/2 to the right of it

– the standard error of b is a multiple of the regression standard error:

Confidence Interval for Slope

2xx

sSEb

Page 18: BPS - 3rd Ed. Chapter 211 Inference for Regression

BPS - 3rd Ed. Chapter 21 18

Case StudyCrying and IQ

b SEb

Confidence interval for slope

Page 19: BPS - 3rd Ed. Chapter 211 Inference for Regression

BPS - 3rd Ed. Chapter 21 19

Case StudyCrying and IQ

b=1.4929, SEb = 0.4870, df = n2 = 382 = 36 (df = 36 is not in Table C, so use next smaller df = 30)

For a 95% C.I., (1C)/2 = .025, and t* = 2.042

So a 95% C.I. for the true slope is:

b t* SEb = 1.4929 2.042(0.4870)= 1.4929 0.9944= 0.4985 to 2.4873

Confidence interval for slope

Page 20: BPS - 3rd Ed. Chapter 211 Inference for Regression

BPS - 3rd Ed. Chapter 21 20

The most common hypothesis to test regarding the slope is that it is zero:

H0: = 0 – says regression line is horizontal (the mean of

y does not change with x)– no true linear relationship between x and y– the straight-line dependence on x is of no value

for predicting y Standardize b to get a t test statistic:

Hypothesis Tests for Slope

Page 21: BPS - 3rd Ed. Chapter 211 Inference for Regression

BPS - 3rd Ed. Chapter 21 21

Hypothesis Tests for Slope

Test statistic for H0: = 0 :

– follows t distribution with df = n2

P-value: [for T ~ t(n2) distribution]

Ha: > 0 : P-value = P(T t)

Ha: < 0 : P-value = P(T t)

Ha: 0 : P-value = 2P(T |t|)

bSE

bt

Page 22: BPS - 3rd Ed. Chapter 211 Inference for Regression

BPS - 3rd Ed. Chapter 21 22

Case StudyCrying and IQ

Hypothesis Test for slope

P-value

t = b / SEb

= 1.4929 / 0.4870 = 3.07

Significant linear relationship

Page 23: BPS - 3rd Ed. Chapter 211 Inference for Regression

BPS - 3rd Ed. Chapter 21 23

The correlation between x and y is closely related to the slope (for both the population and the observed data)– in particular, the correlation is 0 exactly when

the slope is 0 Therefore, testing H0: = 0 is equivalent to

testing that there is no correlation between x and y in the population from which the data were drawn

Test for Correlation

Page 24: BPS - 3rd Ed. Chapter 211 Inference for Regression

BPS - 3rd Ed. Chapter 21 24

There does exist a test for correlation that does not require a regression analysis– Table F on page 661 of the text gives critical

values and upper tail probabilities for the sample correlation r under the null hypothesis that the correlation is 0 in the population look up n and r in the table (if r is negative, look up

its positive value), and read off the associated probability from the top margin of the table to obtain the P-value just as is done for the t table (Table C)

Test for Correlation

Page 25: BPS - 3rd Ed. Chapter 211 Inference for Regression

BPS - 3rd Ed. Chapter 21 25

Case StudyCrying and IQ

Test for H0: correlation = 0 Correlation between crying and IQ is r = 0.455 Sample size is n=38 From Table F: for Ha: correlation > 0 , the

P-value is between .001 and .0025 (using n=40)– P-value for two-sided test is between .002 and .005

(matches two-sided P-value for test on slope)

– one-sided P-value would be between .005 and .01 if we were very conservative and used n=30

Page 26: BPS - 3rd Ed. Chapter 211 Inference for Regression

BPS - 3rd Ed. Chapter 21 26

Once a regression line is fit to the data, it is useful to obtain a prediction of the response for a particular value of the explanatory variable ( x* ); this is done by substituting the value of x* into the equation of the line( ) for x in order to calculate the predicted value

We now present confidence intervals that describe how accurate this prediction is

Inference about Prediction

bxay ˆy

Page 27: BPS - 3rd Ed. Chapter 211 Inference for Regression

BPS - 3rd Ed. Chapter 21 27

There are two types of predictions– predicting the mean response of all subjects

with a certain value x* of the explanatory variable

– predicting the individual response for one subject with a certain value x* of the explanatory variable

Predicted values ( ) are the same for each case, but the margin of error is different

Inference about Prediction

y

Page 28: BPS - 3rd Ed. Chapter 211 Inference for Regression

BPS - 3rd Ed. Chapter 21 28

To estimate the mean response µy, use an

ordinary confidence interval for the parameter µy = + x*

– µy is the mean of responses y when x = x*

– 95% confidence interval: in repeated samples of n observations, 95% of the confidence intervals calculated (at x*) from these samples will contain the true value of µy at x*

Inference about Prediction

Page 29: BPS - 3rd Ed. Chapter 211 Inference for Regression

BPS - 3rd Ed. Chapter 21 29

To estimate an individual response y, use a prediction interval– estimates a single random response y rather

than a parameter like µy

– 95% prediction interval: take an observation on y for each of the n values of x in the original data, then take one more observation y at x = x*; the prediction interval from the n observations will cover the one more y in 95% of all repetitions

Inference about Prediction

Page 30: BPS - 3rd Ed. Chapter 211 Inference for Regression

BPS - 3rd Ed. Chapter 21 30

Both confidence interval and prediction interval have the same form:

– both t* values have df = n2– the standard errors (SE) differ for the two

intervals (formulas on next slide) the prediction interval is wider than the

confidence interval

Inference about Prediction

ˆ y t * SE

Page 31: BPS - 3rd Ed. Chapter 211 Inference for Regression

BPS - 3rd Ed. Chapter 21 31

Inference about Prediction

Page 32: BPS - 3rd Ed. Chapter 211 Inference for Regression

BPS - 3rd Ed. Chapter 21 32

Independent observations– no repeated observations on the same

individual

True relationship is linear– look at scatterplot to check overall pattern– plot of residuals against x magnifies any

unusual pattern (should see ‘random’ scatter about zero)

Checking Assumptions

Page 33: BPS - 3rd Ed. Chapter 211 Inference for Regression

BPS - 3rd Ed. Chapter 21 33

Constant standard deviation σ of the response at all x values– scatterplot: spread of data points about the

regression line should be similar over the entire range of the data

– easier to see with a plot of residuals against x, with a horizontal line drawn at zero (should see ‘random’ scatter about zero)(or plot residuals against for linear regr.)

Checking Assumptions

y

Page 34: BPS - 3rd Ed. Chapter 211 Inference for Regression

BPS - 3rd Ed. Chapter 21 34

Response y varies Normally about the true regression line– residuals estimate the deviations of the

response from the true regression line, so they should follow a Normal distribution

make histogram or stemplot of the residuals and check for clear skewness or other departures from Normality

– numerous methods for carefully checking Normality exists (talk to a statistician!)

Checking Assumptions

Page 35: BPS - 3rd Ed. Chapter 211 Inference for Regression

BPS - 3rd Ed. Chapter 21 35

Residual Plots

x = number of beersy = blood alcohol

Roughly linear relationship; spread is even across entire data range (‘random’ scatter about zero)

Residuals:-2 731-1 871-0 91 0 5578 1 1 2 39 3 (4|1 = .041) 4 1

(close to Normal)

Page 36: BPS - 3rd Ed. Chapter 211 Inference for Regression

BPS - 3rd Ed. Chapter 21 36

Residual Plots‘x’ = collection of explanatory variables, y = salary of player

Standard deviation is not constant everywhere (more variation among players with higher salaries)

Page 37: BPS - 3rd Ed. Chapter 211 Inference for Regression

BPS - 3rd Ed. Chapter 21 37

Residual Plotsx = number of years, y = logarithm of salary of player

A clear curved pattern – relationship is not linear