51
Shonda Kuiper Grinnell College Comparing the two-sample t-test, ANOVA and regression

Shonda Kuiper Grinnell College. Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory

Embed Size (px)

Citation preview

Page 1: Shonda Kuiper Grinnell College. Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory

Shonda Kuiper

Grinnell College

Comparing the two-sample t-test, ANOVA and regression

Page 2: Shonda Kuiper Grinnell College. Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory

Comparing Statistical Tests

Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory variable.

Explanatory Variable

Response

Variable

Response variable measures the outcome of a study.

Explanatory variable explain changes in the response variable.

Page 3: Shonda Kuiper Grinnell College. Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory

Comparing Statistical Tests

Each variable can be classified as either categorical or quantitative.

Explanatory Variable

Response

Variable

Categorical

Categorical

Quantitative

Quantitative

Chi-Square test

Two proportion test

Two-sample t-test

ANOVA

Logistic Regression

Regression

Categorical data place individuals into one of several groups (such as red/blue/white, male/female or yes/no).

Quantitative data consists of numerical values for which most arithmetic operations make sense.

Page 4: Shonda Kuiper Grinnell College. Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory

= +

Model for a Two-sample t-test

𝑌 𝑖𝑗=𝑌 𝑖+ �̂�𝑖𝑗70 80 -10

82 80 2

90 80 10

78 = 80 + -2

75 85 -10

85 85 0

95 85 10

85 85 0

where i =1,2 j = 1,2,3,4

Statistical models have the following form:

observed value = mean response + random error

Generic Group: = = (70+82+90+78)/4 = 80

Brand Name Group: = = (75+85+95+85)/4 = 85

Page 5: Shonda Kuiper Grinnell College. Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory

= = 80

= = 85

μ1

μ2

Null Hypothesis: the two groups of batteries last the same amount of time

Model for a Two-sample t-test

Page 6: Shonda Kuiper Grinnell College. Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory

= 80

= 85

μ1

μ2

Model for a Two-Sample t-test

Page 7: Shonda Kuiper Grinnell College. Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory

Model for a Two-Sample t-test

The theoretical model used in the two-sample t-test is designed to account for these two group means (µ1 and µ2) and random error.

Null Hypothesis:

Alternative Hypothesis:

observed mean randomvalue response error= +

𝑌 𝑖𝑗=𝜇𝑖+𝜀𝑖𝑗 where i =1,2 j = 1,2,3,4

𝑌 𝑖𝑗=𝑌 𝑖+ �̂�𝑖𝑗 where i =1,2 j = 1,2,3,4

Page 8: Shonda Kuiper Grinnell College. Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory

Model for ANOVA

70 82.5 -2.5 -10

82 82.5 -2.5 2

90 82.5 -2.5 10

78 = 82.5 + -2.5 + -2

75 82.5 2.5 -10

85 82.5 2.5 0

95 82.5 2.5 10

85 82.5 2.5 0

= = 80 82.5 = —2.5

= = 85 + 82.5 = 2.5

= = (70 + 82 + 90 + 78 + 75 + 85 + 95 + 85)/8

= 82.5

where i = 1,2 and j = 1,2,3,4

ANOVA: Instead of using two group means, we break the mean response into a grand mean, , two group effects (1 and 2).

Page 9: Shonda Kuiper Grinnell College. Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory

= 80

= 85

μ1

μ2

= = 82.5 = = —2.5

= 2.5

Model for ANOVA

Page 10: Shonda Kuiper Grinnell College. Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory

Model for ANOVA

Null Hypothesis:

Alternative Hypothesis:

+𝑌 𝑖 , 𝑗=𝜇𝑖+𝜀𝑖 , 𝑗

observed mean randomvalue response error= +𝑌 𝑖𝑗=𝜇𝑖+𝜀𝑖𝑗 where i =1,2

j = 1,2,3,4𝑌 𝑖 , 𝑗={𝜇+𝛼𝑖 }+𝜀𝑖 , 𝑗

𝐻0 :𝜇1=𝜇2

Page 11: Shonda Kuiper Grinnell College. Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory

Model for Regression

Xi is either 0 or 1

Regression: Instead of using two group means, we create a model for a straight line (using and ).

Xi 0, Xi , 𝐻0 :𝜇2−𝜇1=0

𝑌 𝑖 , 𝑗=𝜇𝑖+𝜀𝑖 , 𝑗

observed mean randomvalue response error= +𝑌 𝑖𝑗=𝜇𝑖+𝜀𝑖𝑗

where i =1,2 j = 1,2,3,4

𝑌 𝑖= {𝛽0+𝛽1𝑋 𝑖 }+𝜀𝑖 where i = 1,2, …, 8

Page 12: Shonda Kuiper Grinnell College. Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory

Model for Regression

Page 13: Shonda Kuiper Grinnell College. Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory

Model for Regression

70 80 0 -10

82 80 0 2

90 80 0 10

78 = 80 + 0 + -2

75 80 5 -10

85 80 5 0

95 80 5 10

85 80 5 0

80

85 80 5

where i = 1,2,…,8

Regression: Instead of using two group means, we create a model for a straight line (using and ).

Page 14: Shonda Kuiper Grinnell College. Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory

Model for Regression

80 80 0

80 80 0

80 80 0

80 = 80 + 0

85 80 5

85 80 5

85 80 5

85 80 5

where i = 1,2,…,8

Regression: Instead of using two group means, we create a model for a straight line (using and ).

The equation for the line is often written as:

Page 15: Shonda Kuiper Grinnell College. Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory

Comparing the Two-sample t-test, Regression and ANOVA

When there are only two groups (and we have the same assumptions), all three models are algebraically equivalent.

𝑌 𝑖𝑗=𝜇𝑖+𝜀𝑖𝑗 where i =1,2 j = 1,2,3,4

𝐻0 : μ1=μ2

𝑌 𝑖 , 𝑗={𝜇+𝛼𝑖 }+𝜀𝑖 , 𝑗 where i =1,2 j = 1,2,3,4

𝑌 𝑖= {𝛽0+𝛽1𝑋 𝑖 }+𝜀𝑖 where i = 1,2, …, 8

Page 16: Shonda Kuiper Grinnell College. Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory

Shonda Kuiper

Grinnell College

Introduction to Multiple RegressionHypothesis Tests and R2

Page 17: Shonda Kuiper Grinnell College. Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory

Goals of Multiple Regression

• Multiple regression analysis can be used to serve different goals. The goals will influence the type of analysis that is conducted. The most common goals of multiple regression are to:• Describe: A model may be developed to describe the

relationship between multiple explanatory variables and the response variable.

• Predict: A regression model may be used to generalize to observations outside the sample.

• Confirm: Theories are often developed about which variables or combination of variables should be included in a model. Hypothesis tests can be used to evaluate the relationship between the explanatory variables and the response.

Page 18: Shonda Kuiper Grinnell College. Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory

Introduction to Multiple Regression

• Build a multiple regression model to predict retail price of cars• Price = 35738 – 0.22 Mileage R-Sq: 4.1%

• Slope coefficient (b1): t = -2.95 (p-value = 0.004)

Questions: What happens to Price as Mileage increases?

Page 19: Shonda Kuiper Grinnell College. Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory

Introduction to Multiple Regression

• Build a multiple regression model to predict retail price of cars• Price = 35738 – 0.22 Mileage R-Sq: 4.1%

• Slope coefficient (b1): t = -2.95 (p-value = 0.004)

Questions: What happens to Price as Mileage increases? Since b1 = -0.22 is small can we conclude it is unimportant?

Page 20: Shonda Kuiper Grinnell College. Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory

Introduction to Multiple Regression

• Build a multiple regression model to predict retail price of cars• Price = 35738 – 0.22 Mileage R-Sq: 4.1%

• Slope coefficient (b1): t = -2.95 (p-value = 0.004)

Questions: What happens to Price as Mileage increases? Since b1 = -0.22 is small can we conclude it is unimportant? Does mileage help you predict price? What does the p-value tell you?

Page 21: Shonda Kuiper Grinnell College. Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory

Introduction to Multiple Regression

• Build a multiple regression model to predict retail price of cars• Price = 35738 – 0.22 Mileage R-Sq: 4.1%

• Slope coefficient (b1): t = -2.95 (p-value = 0.004)

Questions: What happens to Price as Mileage increases? Since b1 = -0.22 is small can we conclude it is unimportant? Does mileage help you predict price? What does the p-value tell you? Does mileage help you predict price? What does the R-Sq value tell you?

Page 22: Shonda Kuiper Grinnell College. Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory

Introduction to Multiple Regression

• Build a multiple regression model to predict retail price of cars• Price = 35738 – 0.22 Mileage R-Sq: 4.1%

• Slope coefficient (b1): t = -2.95 (p-value = 0.004)

Questions: What happens to Price as Mileage increases? Since b1 = -0.22 is small can we conclude it is unimportant? Does mileage help you predict price? What does the p-value tell you? Does mileage help you predict price? What does the R-Sq value tell you? Are there outliers or influential observations?

Page 23: Shonda Kuiper Grinnell College. Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory

What is R2?

Page 24: Shonda Kuiper Grinnell College. Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory

What is R2?

Page 25: Shonda Kuiper Grinnell College. Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory

What is R2?

What happens when all the points fall on the regression line?

0

Page 26: Shonda Kuiper Grinnell College. Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory

What is R2?

What happens when the regression line does not help us estimate Y?

Page 27: Shonda Kuiper Grinnell College. Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory

What is R2?

What happens when the regression line does not help us estimate Y?

Page 28: Shonda Kuiper Grinnell College. Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory

What is R2?

What happens when the regression line does not help us estimate Y?

Page 29: Shonda Kuiper Grinnell College. Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory

What is R2?

What happens when the regression line does not help us estimate Y?

Page 30: Shonda Kuiper Grinnell College. Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory

What is R2?

What happens when the regression line does not help us estimate Y?

Page 31: Shonda Kuiper Grinnell College. Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory

What is R2?

What happens when the regression line does not help us estimate Y?

Page 32: Shonda Kuiper Grinnell College. Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory

What is R2?

What happens when the regression line does not help us estimate Y?

Page 33: Shonda Kuiper Grinnell College. Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory

Adjusted R2

• R2adj includes a penalty when more terms are included in

the model.

• n is the sample size and p is the number of coefficients (including the constant term β0, β1, β2, β3,…, βp-1)

• When many terms are in the model:• p is larger R2

adj is smaller (n – 1)/(n-p) is larger

Page 34: Shonda Kuiper Grinnell College. Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory

Price = 35738 – 0.22 Mileage R-Sq: 4.1%

Slope coefficient (b1): t = -2.95 (p-value = 0.004)

Page 35: Shonda Kuiper Grinnell College. Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory

Shonda Kuiper

Grinnell College

Introduction to Multiple Regression:Variable Section

Page 36: Shonda Kuiper Grinnell College. Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory

Variable Selection Techniques

• Build a multiple regression model to predict retail price of cars

Mileage

Pri

ce

50000400003000020000100000

70000

60000

50000

40000

30000

20000

10000

0

Scatterplot of Price vs Mileage R2 = 2%

Page 37: Shonda Kuiper Grinnell College. Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory

Variable Selection Techniques

• Build a multiple regression model to predict retail price of cars

Mileage

Pri

ce

50000400003000020000100000

70000

60000

50000

40000

30000

20000

10000

0

Scatterplot of Price vs Mileage R2 = 2%Mileage

Cylinder

Liter

Leather

Cruise

Doors

Sound

Page 38: Shonda Kuiper Grinnell College. Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory

Variable Selection Techniques

• Build a multiple regression model to predict retail price of cars

Mileage

Pri

ce

50000400003000020000100000

70000

60000

50000

40000

30000

20000

10000

0

Scatterplot of Price vs Mileage R2 = 2%Mileage

Cylinder

Liter

Leather

Cruise

Doors

Sound

Price = 6759 + 6289Cruise + 3792Cyl -1543Doors + 3349Leather - 787Liter -0.17Mileage - 1994Sound

R2 = 44.6%

Page 39: Shonda Kuiper Grinnell College. Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory

Introduction to Multiple Regression

Step Forward Regression (Forward Selection):

Which single explanatory variable best predicts Price?

Price = 13921.9 + 9862.3Cruise R2 = 18.56%

Page 40: Shonda Kuiper Grinnell College. Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory

Introduction to Multiple Regression

Step Forward Regression:

Which single explanatory variable best predicts Price?

Price = 13921.9 + 9862.3Cruise R2 = 18.56%

Price = -17.06 + 4054.2Cyl R2 = 32.39%

Page 41: Shonda Kuiper Grinnell College. Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory

Introduction to Multiple Regression

Step Forward Regression:

Which single explanatory variable best predicts Price?

Price = 13921.9 + 9862.3Cruise R2 = 18.56%

Price = -17.06 + 4054.2Cyl R2 = 32.39%

Price = 24764.6 – 0.17Mileage R2 = 2.04%

Page 42: Shonda Kuiper Grinnell College. Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory

Introduction to Multiple Regression

Step Forward Regression:

Which single explanatory variable best predicts Price?

Price = 13921.9 + 9862.3Cruise R2 = 18.56%

Price = -17.06 + 4054.2Cyl R2 = 32.39%

Price = 24764.6 – 0.17Mileage R2 = 2.04%

Price = 6185.8.6 + 4990.4Liter R2 = 31.15%

Page 43: Shonda Kuiper Grinnell College. Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory

Introduction to Multiple Regression

Step Forward Regression:

Which single explanatory variable best predicts Price?

Price = 13921.9 + 9862.3Cruise R2 = 18.56%

Price = -17.06 + 4054.2Cyl R2 = 32.39%

Price = 24764.6 – 0.17Mileage R2 = 2.04%

Price = 6185.8.6 + 4990.4Liter R2 = 31.15%

Price = 23130.1 – 2631.4Sound R2 = 1.55%

Price = 18828.8 + 3473.46Leather R2 = 2.47%

Price = 27033.6 -1613.2Doors R2 = 1.93%

Page 44: Shonda Kuiper Grinnell College. Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory

Introduction to Multiple Regression

Step Forward Regression:

Which combination of two terms best predicts Price?

Price = - 17.06 + 4054.2Cyl R2 = 32.39% Price = -1046.4 + 3392.6Cyl + 6000.4Cruise R2 = 38.4% (38.2%)

Page 45: Shonda Kuiper Grinnell College. Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory

Introduction to Multiple Regression

Step Forward Regression:

Which combination of two terms best predicts Price?

Price = - 17.06 + 4054.2Cyl R2 = 32.39% Price = 3145.8 + 4027.6Cyl – 0.152Mileage R2 = 34% (33.8)

Page 46: Shonda Kuiper Grinnell College. Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory

Introduction to Multiple Regression

Step Forward Regression:

Which combination of two terms best predicts Price?

Price = -17.06 + 4054.2Cyl R2 = 32.39% Price = 1372.4 + 2976.4Cyl + 1412.2Liter R2 = 32.6% (32.4%)

Page 47: Shonda Kuiper Grinnell College. Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory

Introduction to Multiple Regression

Step Forward Regression:

Which combination of terms best predicts Price?

Price = -17.06 + 4054.2Cyl R2 = 32.39% Price = -1046.4 + 3393Cyl + 6000.4Cruise R2 = 38.4% (38.2%)

Price = -2978.4 + 3276Cyl +6362Cruise + 3139Leather

R2 = 40.4% (40.2%)

Price = 412.6 + 3233Cyl +6492Cruise + 3162Leather

-0.17Mileage R2 = 42.3% (42%)

Price = 5530.3 + 3258Cyl +6320Cruise + 2979Leather

-0.17Mileage – 1402Doors R2 = 43.7% (43.3%)

Price = 7323.2 + 3200Cyl + 6206Cruise + 3327Leather

-0.17Mileage – 1463Doors – 2024Sound R2 = 44.6% (44.15%)

Price = 6759 + 3792Cyl + 6289Cruise + 3349Leather -787Liter

-0.17Mileage -1543Doors - 1994Sound R2 = 44.6% (44.14%)

Page 48: Shonda Kuiper Grinnell College. Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory

Introduction to Multiple Regression

Step Forward Regression:

Which single explanatory variable best predicts Price?

Price = 13921.9 + 9862.3Cruise R2 = 18.56%

Price = -17.06 + 4054.2Cyl R2 = 32.39%

Price = 24764.6 – 0.17Mileage R2 = 2.04%

Price = 6185.8.6 + 4990.4Liter R2 = 31.15%

Price = 23130.1 – 2631.4Sound R2 = 1.55%

Price = 18828.8 + 3473.46Leather R2 = 2.47%

Price = 27033.6 -1613.2Doors R2 = 1.93%

Page 49: Shonda Kuiper Grinnell College. Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory

Introduction to Multiple Regression

Step Backward Regression (Backward Elimination):

Price = 7323.2 + 3200Cyl + 6206Cruise + 3327Leather

-0.17Mileage – 1463Doors – 2024Sound R2 = 44.6% (44.15%)

Price = 6759 + 3792Cyl + 6289Cruise + 3349Leather -787Liter

-0.17Mileage -1543Doors - 1994Sound R2 = 44.6% (44.14%)

Other techniques, such as Akaike information criterion, Bayesian information criterion, Mallows’ Cp, are often used to find the best model.

Bidirectional stepwise procedures

Page 50: Shonda Kuiper Grinnell College. Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory

Introduction to Multiple Regression

Best Subsets Regression:

Here we see that Liter is the second best single predictor of price.

Page 51: Shonda Kuiper Grinnell College. Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory

Introduction to Multiple Regression

Important Cautions:

• Stepwise regression techniques can often ignore very important explanatory variables. Best subsets is often preferable.

• Both best subsets and stepwise regression methods only consider linear relationships between the response and explanatory variables.

• Residual graphs are still essential in validating whether the model is appropriate.

• Transformations, interactions and quadratic terms can often improve the model.

• Whenever these iterative variable selections techniques are used, the p-values corresponding to the significance of each individual coefficient are not reliable.