58
Inference for Regression STAT E-150 Statistical Methods

Inference for Regression STAT E-150 Statistical Methods

Embed Size (px)

Citation preview

Page 1: Inference for Regression STAT E-150 Statistical Methods

Inference for Regression

STAT E-150Statistical Methods

Page 2: Inference for Regression STAT E-150 Statistical Methods

2

We have discussed how to find a simple linear regression model to make predictions. Now we will investigate our model further: 

- How do we evaluate the effectiveness of the model?- How do we know that the relationship is significant?- How much of the variability in the response variable can be

explained by its relationship to the predictor?

The simple linear model is of the form y = β1x + β0 + ε where ε ~ N(0, σε).

Recall that inference methods can address questions about the population based on our sample data.

Page 3: Inference for Regression STAT E-150 Statistical Methods

3

Consider our earlier example:

Medical researchers have noted that adolescent females are more likely to deliver low-birthweight babies than are adult females. Because LBW babies tend to have higher mortality rates, studies have been conducted to examine the relationship between birthweight and the mother’s age.

We found the fitted regression model

Observation 1 2 3 4 5 6 7 8 9 10

Maternal Age 15 17 18 15 16 19 17 16 18 19

Birthweight (g) 2289 3393 3271 2648 2897 3327 2970 2535 3138 3573

245.15 age 1163.4ig 5we ht

Page 4: Inference for Regression STAT E-150 Statistical Methods

4

How can we assess this model?

If the slope β1 is equal to zero, then there is no change in the response variable associated with a change in the predictor. We will therefore test the value of the slope to investigate whether there is a linear relationship between the two variables

Page 5: Inference for Regression STAT E-150 Statistical Methods

5

 T-test for the Slope of a Simple Linear Model H0: β1=0Ha: β1≠0 The test statistic is

with n-2 degrees of freedom. 

Page 6: Inference for Regression STAT E-150 Statistical Methods

6

Assumptions for the model and the errors:  

1. Linearity Assumption Straight Enough Condition: does the scatterplot appear linear?

Check the residuals to see if they appear to be randomly scattered

Quantitative Data Condition: Is the data quantitative?

Page 7: Inference for Regression STAT E-150 Statistical Methods

7

Assumptions for the model and the errors:  

2. Independence Assumption: the errors must be mutually independent

Randomization Condition: the individuals are a random sampleCheck the residuals for patterns, trends, clumping

Page 8: Inference for Regression STAT E-150 Statistical Methods

8

Assumptions for the model and the errors:  

3. Equal Variance Assumption: the variability of y should be about the same for all values of x

Does The Plot Thicken? Condition: Is the spread about the line nearly constant in the scatterplot?

Check the residuals for any patterns

Page 9: Inference for Regression STAT E-150 Statistical Methods

9

Assumptions for the model and the errors:  

4. Normal Population Assumption: the errors follow a Normal model at each value of x

Nearly Normal Condition:Look at a histogram or NPP of the residuals

Page 10: Inference for Regression STAT E-150 Statistical Methods

10

We can check the conditions to see if all assumptions are true. If this is the case, the idealized regression line will have a distribution of y-values for each x-value, and these distributions will be approximately normal with equal variation and with means along the regression line:

Page 11: Inference for Regression STAT E-150 Statistical Methods

11

Here are the steps:1. Create a scatterplot to see if the data is “straight enough”.2. Fit a regression model and find the residuals (ε) and predicted

values ( ).3. Draw a scatterplot of the residuals vs. x or ; this should have no

pattern or bend, or thickening, or thinning, or outliers.4. If the scatterplot is “straight enough”, create a histogram and NPP of

the residuals to check the “nearly normal” condition.5. Continue with the inference if all conditions are reasonably satisfied.

yy

Page 12: Inference for Regression STAT E-150 Statistical Methods

12

Here are the results for our data:

- The scatterplot of the data indicates a positive linear relationship between the variables 

Page 13: Inference for Regression STAT E-150 Statistical Methods

13

- The scatterplot of the residuals shows no particular pattern 

Page 14: Inference for Regression STAT E-150 Statistical Methods

14

- The Normal Probability Plot for the residuals indicates a Normal distribution 

Page 15: Inference for Regression STAT E-150 Statistical Methods

15

Here is our data:

H0: 

Ha:

Observation 1 2 3 4 5 6 7 8 9 10

Maternal Age (in years) 15 17 18 15 16 19 17 16 18 19

Birthweight (in grams) 2289 3393 3271 2648 2897 3327 2970 2535 3138 3573

Page 16: Inference for Regression STAT E-150 Statistical Methods

16

Here is our data:

H0: β1 = 0 

Ha: β1 ≠ 0

Observation 1 2 3 4 5 6 7 8 9 10

Maternal Age (in years) 15 17 18 15 16 19 17 16 18 19

Birthweight (in grams) 2289 3393 3271 2648 2897 3327 2970 2535 3138 3573

Page 17: Inference for Regression STAT E-150 Statistical Methods

17

The SPSS output included this table:

What does this value represent?

 

Coefficientsa

Model Unstandardized Coefficients Standardized Coefficients t Sig. 95.0% Confidence Interval for B

B Std. Error Beta Lower Bound Upper Bound

1 (Constant) -1163.450 783.138 -1.486 .176 -2969.369 642.469

Age 245.150 45.908 .884 5.340 .001 139.285 351.015

a. Dependent Variable: Birthweight

ˆ0β =

Page 18: Inference for Regression STAT E-150 Statistical Methods

18

The SPSS output included this table:

-1163.45

What does this value represent?

 

Coefficientsa

Model Unstandardized Coefficients Standardized Coefficients t Sig. 95.0% Confidence Interval for B

B Std. Error Beta Lower Bound Upper Bound

1 (Constant) -1163.450 783.138 -1.486 .176 -2969.369 642.469

Age 245.150 45.908 .884 5.340 .001 139.285 351.015

a. Dependent Variable: Birthweight

ˆ0β =

Page 19: Inference for Regression STAT E-150 Statistical Methods

19

The SPSS output included this table:

-1163.45

What does this value represent?This is the observed (sample) value of the y-intercept

 

Coefficientsa

Model Unstandardized Coefficients Standardized Coefficients t Sig. 95.0% Confidence Interval for B

B Std. Error Beta Lower Bound Upper Bound

1 (Constant) -1163.450 783.138 -1.486 .176 -2969.369 642.469

Age 245.150 45.908 .884 5.340 .001 139.285 351.015

a. Dependent Variable: Birthweight

ˆ0β =

Page 20: Inference for Regression STAT E-150 Statistical Methods

20

The SPSS output included this table:

What does this value represent?

 

Coefficientsa

Model Unstandardized Coefficients Standardized Coefficients t Sig. 95.0% Confidence Interval for B

B Std. Error Beta Lower Bound Upper Bound

1 (Constant) -1163.450 783.138 -1.486 .176 -2969.369 642.469

Age 245.150 45.908 .884 5.340 .001 139.285 351.015

a. Dependent Variable: Birthweight

ˆ1β =

Page 21: Inference for Regression STAT E-150 Statistical Methods

21

The SPSS output included this table:

245.15

What does this value represent?

 

Coefficientsa

Model Unstandardized Coefficients Standardized Coefficients t Sig. 95.0% Confidence Interval for B

B Std. Error Beta Lower Bound Upper Bound

1 (Constant) -1163.450 783.138 -1.486 .176 -2969.369 642.469

Age 245.150 45.908 .884 5.340 .001 139.285 351.015

a. Dependent Variable: Birthweight

ˆ1β =

Page 22: Inference for Regression STAT E-150 Statistical Methods

22

The SPSS output included this table:

245.15

What does this value represent? This is the observed (sample) value of the slope

 

Coefficientsa

Model Unstandardized Coefficients Standardized Coefficients t Sig. 95.0% Confidence Interval for B

B Std. Error Beta Lower Bound Upper Bound

1 (Constant) -1163.450 783.138 -1.486 .176 -2969.369 642.469

Age 245.150 45.908 .884 5.340 .001 139.285 351.015

a. Dependent Variable: Birthweight

ˆ1β =

Page 23: Inference for Regression STAT E-150 Statistical Methods

23

The SPSS output included this table:

What does this value represent?

 

Coefficientsa

Model Unstandardized Coefficients Standardized Coefficients t Sig. 95.0% Confidence Interval for B

B Std. Error Beta Lower Bound Upper Bound

1 (Constant) -1163.450 783.138 -1.486 .176 -2969.369 642.469

Age 245.150 45.908 .884 5.340 .001 139.285 351.015

a. Dependent Variable: Birthweight

ˆ1SE(β ) =

Page 24: Inference for Regression STAT E-150 Statistical Methods

24

The SPSS output included this table:

45.908

What does this value represent?

 

Coefficientsa

Model Unstandardized Coefficients Standardized Coefficients t Sig. 95.0% Confidence Interval for B

B Std. Error Beta Lower Bound Upper Bound

1 (Constant) -1163.450 783.138 -1.486 .176 -2969.369 642.469

Age 245.150 45.908 .884 5.340 .001 139.285 351.015

a. Dependent Variable: Birthweight

ˆ1SE(β ) =

Page 25: Inference for Regression STAT E-150 Statistical Methods

25

The SPSS output included this table:

45.908

What does this value represent?This is the standard error for the slope; this is how much we expect the sample slope to vary from one sample to

another. 

Coefficientsa

Model Unstandardized Coefficients Standardized Coefficients t Sig. 95.0% Confidence Interval for B

B Std. Error Beta Lower Bound Upper Bound

1 (Constant) -1163.450 783.138 -1.486 .176 -2969.369 642.469

Age 245.150 45.908 .884 5.340 .001 139.285 351.015

a. Dependent Variable: Birthweight

ˆ1SE(β ) =

Page 26: Inference for Regression STAT E-150 Statistical Methods

26

What is the value of the test statistic?

 

Coefficientsa

Model Unstandardized Coefficients Standardized Coefficients t Sig. 95.0% Confidence Interval for B

B Std. Error Beta Lower Bound Upper Bound

1 (Constant) -1163.450 783.138 -1.486 .176 -2969.369 642.469

Age 245.150 45.908 .884 5.340 .001 139.285 351.015

a. Dependent Variable: Birthweight

ˆ

ˆ1

1

βt = =

SE(β )

Page 27: Inference for Regression STAT E-150 Statistical Methods

27

5.34

What is the value of the test statistic?

 

Coefficientsa

Model Unstandardized Coefficients Standardized Coefficients t Sig. 95.0% Confidence Interval for B

B Std. Error Beta Lower Bound Upper Bound

1 (Constant) -1163.450 783.138 -1.486 .176 -2969.369 642.469

Age 245.150 45.908 .884 5.340 .001 139.285 351.015

a. Dependent Variable: Birthweight

ˆ

ˆ1

1

βt = =

SE(β )

Page 28: Inference for Regression STAT E-150 Statistical Methods

28

5.34

What is the value of the test statistic? 5.34

 

Coefficientsa

Model Unstandardized Coefficients Standardized Coefficients t Sig. 95.0% Confidence Interval for B

B Std. Error Beta Lower Bound Upper Bound

1 (Constant) -1163.450 783.138 -1.486 .176 -2969.369 642.469

Age 245.150 45.908 .884 5.340 .001 139.285 351.015

a. Dependent Variable: Birthweight

ˆ

ˆ1

1

βt = =

SE(β )

Page 29: Inference for Regression STAT E-150 Statistical Methods

29

What is the p-value? 

Coefficientsa

Model Unstandardized Coefficients Standardized Coefficients t Sig. 95.0% Confidence Interval for B

B Std. Error Beta Lower Bound Upper Bound

1 (Constant) -1163.450 783.138 -1.486 .176 -2969.369 642.469

Age 245.150 45.908 .884 5.340 .001 139.285 351.015

a. Dependent Variable: Birthweight

Page 30: Inference for Regression STAT E-150 Statistical Methods

30

What is the p-value? p = .001 

Coefficientsa

Model Unstandardized Coefficients Standardized Coefficients t Sig. 95.0% Confidence Interval for B

B Std. Error Beta Lower Bound Upper Bound

1 (Constant) -1163.450 783.138 -1.486 .176 -2969.369 642.469

Age 245.150 45.908 .884 5.340 .001 139.285 351.015

a. Dependent Variable: Birthweight

Page 31: Inference for Regression STAT E-150 Statistical Methods

31

What is the p-value? p = .001 

What is your decision?

Coefficientsa

Model Unstandardized Coefficients Standardized Coefficients t Sig. 95.0% Confidence Interval for B

B Std. Error Beta Lower Bound Upper Bound

1 (Constant) -1163.450 783.138 -1.486 .176 -2969.369 642.469

Age 245.150 45.908 .884 5.340 .001 139.285 351.015

a. Dependent Variable: Birthweight

Page 32: Inference for Regression STAT E-150 Statistical Methods

32

What is the p-value? p = .001 

What is your decision? Since p is small, reject the null hypothesis

Coefficientsa

Model Unstandardized Coefficients Standardized Coefficients t Sig. 95.0% Confidence Interval for B

B Std. Error Beta Lower Bound Upper Bound

1 (Constant) -1163.450 783.138 -1.486 .176 -2969.369 642.469

Age 245.150 45.908 .884 5.340 .001 139.285 351.015

a. Dependent Variable: Birthweight

Page 33: Inference for Regression STAT E-150 Statistical Methods

33

What is the p-value? p = .001 

What is your decision? Since p is small, reject the null hypothesis

What can you conclude?

Coefficientsa

Model Unstandardized Coefficients Standardized Coefficients t Sig. 95.0% Confidence Interval for B

B Std. Error Beta Lower Bound Upper Bound

1 (Constant) -1163.450 783.138 -1.486 .176 -2969.369 642.469

Age 245.150 45.908 .884 5.340 .001 139.285 351.015

a. Dependent Variable: Birthweight

Page 34: Inference for Regression STAT E-150 Statistical Methods

34

What is the p-value? p = .001 

What is your decision? Since p is small, reject the null hypothesis

What can you conclude?The data indicates that there is a linear relationship between the mother’s age and the baby’s birthweight.

Coefficientsa

Model Unstandardized Coefficients Standardized Coefficients t Sig. 95.0% Confidence Interval for B

B Std. Error Beta Lower Bound Upper Bound

1 (Constant) -1163.450 783.138 -1.486 .176 -2969.369 642.469

Age 245.150 45.908 .884 5.340 .001 139.285 351.015

a. Dependent Variable: Birthweight

Page 35: Inference for Regression STAT E-150 Statistical Methods

35

Confidence Interval for the Slope The slope of the population regression line, , is the rate of change of the mean response as the explanatory variable increases.

The slope of the least squares line, , is an estimate of . A confidence interval for the slope will show how accurate the estimate is.

Coefficientsa

Model Unstandardized Coefficients Standardized Coefficients t Sig. 95.0% Confidence Interval for B

B Std. Error Beta Lower Bound Upper Bound

1 (Constant) -1163.450 783.138 -1.486 .176 -2969.369 642.469

Age 245.150 45.908 .884 5.340 .001 139.285 351.015

a. Dependent Variable: Birthweight

Page 36: Inference for Regression STAT E-150 Statistical Methods

36

What is the confidence interval for the slope of the regression line?

Coefficientsa

Model Unstandardized Coefficients Standardized Coefficients t Sig. 95.0% Confidence Interval for B

B Std. Error Beta Lower Bound Upper Bound

1 (Constant) -1163.450 783.138 -1.486 .176 -2969.369 642.469

Age 245.150 45.908 .884 5.340 .001 139.285 351.015

a. Dependent Variable: Birthweight

Page 37: Inference for Regression STAT E-150 Statistical Methods

37

What is the confidence interval for the slope of the regression line?(139.285, 351.015)

Coefficientsa

Model Unstandardized Coefficients Standardized Coefficients t Sig. 95.0% Confidence Interval for B

B Std. Error Beta Lower Bound Upper Bound

1 (Constant) -1163.450 783.138 -1.486 .176 -2969.369 642.469

Age 245.150 45.908 .884 5.340 .001 139.285 351.015

a. Dependent Variable: Birthweight

Page 38: Inference for Regression STAT E-150 Statistical Methods

38

Confidence Interval for the Slope of a Simple Linear Model The confidence interval has the form  where t* is the critical value for the tn-2 density curve to obtain the desired confidence level.

How can we construct this?

Page 39: Inference for Regression STAT E-150 Statistical Methods

39

Confidence Interval for the Slope of a Simple Linear Model The confidence interval has the form  We know that = 245.15 and that = 45.908, but how do we find t*?

ˆ1SE(β )

Page 40: Inference for Regression STAT E-150 Statistical Methods

40

Confidence Interval for the Slope of a Simple Linear Model The confidence interval has the form  We know that df = n - 2 = 8. On the line for df=8, the value of t for a 95% confidence interval is 2.306.

Page 41: Inference for Regression STAT E-150 Statistical Methods

41

Calculate

= 245.15 ± 2.306(45.908) 

= 245.15 ± 105.86 = (139.29, 351.01)  

Page 42: Inference for Regression STAT E-150 Statistical Methods

42

Calculate

= 245.15 ± 2.306(45.908) 

= 245.15 ± 105.86 = (139.29, 351.01)  What is the confidence interval found by SPSS? 

Coefficientsa

Model Unstandardized Coefficients Standardized Coefficients t Sig. 95.0% Confidence Interval for B

B Std. Error Beta Lower Bound Upper Bound

1 (Constant) -1163.450 783.138 -1.486 .176 -2969.369 642.469

Age 245.150 45.908 .884 5.340 .001 139.285 351.015

a. Dependent Variable: Birthweight

Page 43: Inference for Regression STAT E-150 Statistical Methods

43

Calculate

= 245.15 ± 2.306(45.908) 

= 245.15 ± 105.86 = (139.29, 351.01)  What is the confidence interval found by SPSS? 

Coefficientsa

Model Unstandardized Coefficients Standardized Coefficients t Sig. 95.0% Confidence Interval for B

B Std. Error Beta Lower Bound Upper Bound

1 (Constant) -1163.450 783.138 -1.486 .176 -2969.369 642.469

Age 245.150 45.908 .884 5.340 .001 139.285 351.015

a. Dependent Variable: Birthweight

Page 44: Inference for Regression STAT E-150 Statistical Methods

44

Calculate

= 245.15 ± 2.306(45.908) 

= 245.15 ± 105.86 = (139.29, 351.01)  What does this interval tell us? 

Page 45: Inference for Regression STAT E-150 Statistical Methods

45

Calculate

= 245.15 ± 2.306(45.908) 

= 245.15 ± 105.86 = (139.29, 351.01)  What does this interval tell us?Based on the sample data, we are 95% confident that the true average increase in the weight of the baby associated with a one-year increase in age of the mother is between 139.29 and 351.01 g.

Page 46: Inference for Regression STAT E-150 Statistical Methods

46

Partitioning Variability - ANOVA ANOVA measures the effectiveness of the model by measuring how much of the variability in the response variable y is explained by the predictions based on the fitted model.

We can partition this variability into two parts: the variability explained by the model, and the unexplained variability due to error, as measured by the residuals.

 

Page 47: Inference for Regression STAT E-150 Statistical Methods

47

In our SPSS output,  

SS(Model) =  

SS(Error) =  

SS(Total) =

 

ANOVAa

Model Sum of Squares df Mean Square F Sig.

1

Regression 1201970.450 1 1201970.450 28.515 .001b

Residual 337212.450 8 42151.556

Total 1539182.900 9 a. Dependent Variable: Birthweight b. Predictors: (Constant), Age

Page 48: Inference for Regression STAT E-150 Statistical Methods

48

In our SPSS output,  

SS(Model) = 1201970.45  

SS(Error) = 337212.45 

SS(Total) = 1539182.9

 

ANOVAa

Model Sum of Squares df Mean Square F Sig.

1

Regression 1201970.450 1 1201970.450 28.515 .001b

Residual 337212.450 8 42151.556

Total 1539182.900 9 a. Dependent Variable: Birthweight b. Predictors: (Constant), Age

Page 49: Inference for Regression STAT E-150 Statistical Methods

49

What is the value of the test statistic? What is the p-value?

Decision: 

Conclusion:

 

ANOVAa

Model Sum of Squares df Mean Square F Sig.

1

Regression 1201970.450 1 1201970.450 28.515 .001b

Residual 337212.450 8 42151.556

Total 1539182.900 9 a. Dependent Variable: Birthweight b. Predictors: (Constant), Age

Page 50: Inference for Regression STAT E-150 Statistical Methods

50

What is the value of the test statistic? 28.515 What is the p-value?

Decision: 

Conclusion:

 

ANOVAa

Model Sum of Squares df Mean Square F Sig.

1

Regression 1201970.450 1 1201970.450 28.515 .001b

Residual 337212.450 8 42151.556

Total 1539182.900 9 a. Dependent Variable: Birthweight b. Predictors: (Constant), Age

Page 51: Inference for Regression STAT E-150 Statistical Methods

51

What is the value of the test statistic? 28.515 What is the p-value? .001

Decision: 

Conclusion:

 

ANOVAa

Model Sum of Squares df Mean Square F Sig.

1

Regression 1201970.450 1 1201970.450 28.515 .001b

Residual 337212.450 8 42151.556

Total 1539182.900 9 a. Dependent Variable: Birthweight b. Predictors: (Constant), Age

Page 52: Inference for Regression STAT E-150 Statistical Methods

52

What is the value of the test statistic? 28.515 What is the p-value? .001

Decision: Since p is small, reject the null hypothesis 

Conclusion: The data indicates that there is a linear relationship between the mother’s age and the baby’s birthweight.

 

ANOVAa

Model Sum of Squares df Mean Square F Sig.

1

Regression 1201970.450 1 1201970.450 28.515 .001b

Residual 337212.450 8 42151.556

Total 1539182.900 9 a. Dependent Variable: Birthweight b. Predictors: (Constant), Age

Page 53: Inference for Regression STAT E-150 Statistical Methods

53

The Coefficient of Determination This represents the amount of variation in the response variable can be explained by the model:

 

In our example,  

This tells us that 75.44% of the variation in the birthweights can be explained by the model.

Page 54: Inference for Regression STAT E-150 Statistical Methods

54

Inference for Correlation There is a relationship between the correlation between the variables and the slope of the least squares line:

 

So testing the value of the slope is the same as testing that there is no correlation between the variables; that is, that the correlation coefficient = 0.

Page 55: Inference for Regression STAT E-150 Statistical Methods

55

T-Test for Correlation 

H0: ρ = 0Ha: ρ ≠ 0

 

The test statistic is  

If the conditions of the simple linear model hold, we find the p-value using the t-distribution with n-2 degrees of freedom.

Page 56: Inference for Regression STAT E-150 Statistical Methods

56

The three tests we have discussed are equivalent; it can be shown that the test statistic F is the square of the test statistic t:

In our t-test, we found the test statistic t = 5.34

In the ANOVA, we found the test statistic F = 28.515

t2 = (5.34)2 = 28.5156 = F

Also, for both tests we found p = .001

Page 57: Inference for Regression STAT E-150 Statistical Methods

57

Standard Errors for Predicted Values

Let x be a specific value of x. The predicted value of y is

We can create two different intervals:

a prediction interval for an individual value of x

a confidence interval for the mean predicted value at x

υ = “nu”

Page 58: Inference for Regression STAT E-150 Statistical Methods

58

The basic format for an interval is

When we want to find a mean predicted value,

When we want to find an individual predicted value,

Since individual values vary more than means,

and the prediction interval will be wider than the confidence interval.

*ν n 2y t SE

22 2 e

ν 1 ν

sˆSE(μ ) SE (b ) (x x)

n

22 2 2e

ν 1 ν e

sˆSE(y ) SE (b ) (x x) s

n

ˆ ˆSE(μ ) SE(y )ν ν