Shonda Kuiper Grinnell College. Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory

Shonda Kuiper

Grinnell College

Comparing the two-sample t-test, ANOVA and regression

Comparing Statistical Tests

Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory variable.

Explanatory Variable

Response

Variable

Response variable measures the outcome of a study.

Explanatory variable explain changes in the response variable.

Comparing Statistical Tests

Each variable can be classified as either categorical or quantitative.

Explanatory Variable

Response

Variable

Categorical

Categorical

Quantitative

Quantitative

Chi-Square test

Two proportion test

Two-sample t-test

ANOVA

Logistic Regression

Regression

Categorical data place individuals into one of several groups (such as red/blue/white, male/female or yes/no).

Quantitative data consists of numerical values for which most arithmetic operations make sense.

= +

Model for a Two-sample t-test

𝑌 𝑖𝑗=𝑌 𝑖+ �̂�𝑖𝑗70 80 -10

82 80 2

90 80 10

78 = 80 + -2

75 85 -10

85 85 0

95 85 10

85 85 0

where i =1,2 j = 1,2,3,4

Statistical models have the following form:

observed value = mean response + random error

Generic Group: = = (70+82+90+78)/4 = 80

Brand Name Group: = = (75+85+95+85)/4 = 85

= = 80

= = 85

μ1

μ2

Null Hypothesis: the two groups of batteries last the same amount of time

Model for a Two-sample t-test

= 80

= 85

μ1

μ2

Model for a Two-Sample t-test

Model for a Two-Sample t-test

The theoretical model used in the two-sample t-test is designed to account for these two group means (µ1 and µ2) and random error.

Null Hypothesis:

Alternative Hypothesis:

observed mean randomvalue response error= +

𝑌 𝑖𝑗=𝜇𝑖+𝜀𝑖𝑗 where i =1,2 j = 1,2,3,4

𝑌 𝑖𝑗=𝑌 𝑖+ �̂�𝑖𝑗 where i =1,2 j = 1,2,3,4

Model for ANOVA

70 82.5 -2.5 -10

82 82.5 -2.5 2

90 82.5 -2.5 10

78 = 82.5 + -2.5 + -2

75 82.5 2.5 -10

85 82.5 2.5 0

95 82.5 2.5 10

85 82.5 2.5 0

= = 80 82.5 = —2.5

= = 85 + 82.5 = 2.5

= = (70 + 82 + 90 + 78 + 75 + 85 + 95 + 85)/8

= 82.5

where i = 1,2 and j = 1,2,3,4

ANOVA: Instead of using two group means, we break the mean response into a grand mean, , two group effects (1 and 2).

= 80

= 85

μ1

μ2

= = 82.5 = = —2.5

= 2.5

Model for ANOVA

Model for ANOVA

Null Hypothesis:

Alternative Hypothesis:

+𝑌 𝑖 , 𝑗=𝜇𝑖+𝜀𝑖 , 𝑗

observed mean randomvalue response error= +𝑌 𝑖𝑗=𝜇𝑖+𝜀𝑖𝑗 where i =1,2

j = 1,2,3,4𝑌 𝑖 , 𝑗={𝜇+𝛼𝑖 }+𝜀𝑖 , 𝑗

𝐻0 :𝜇1=𝜇2

Model for Regression

Xi is either 0 or 1

Regression: Instead of using two group means, we create a model for a straight line (using and ).

Xi 0, Xi , 𝐻0 :𝜇2−𝜇1=0

𝑌 𝑖 , 𝑗=𝜇𝑖+𝜀𝑖 , 𝑗

observed mean randomvalue response error= +𝑌 𝑖𝑗=𝜇𝑖+𝜀𝑖𝑗

where i =1,2 j = 1,2,3,4

𝑌 𝑖= {𝛽0+𝛽1𝑋 𝑖 }+𝜀𝑖 where i = 1,2, …, 8



70 80 0 -10

82 80 0 2

90 80 0 10

78 = 80 + 0 + -2

75 80 5 -10

85 80 5 0

95 80 5 10

85 80 5 0

80

85 80 5

where i = 1,2,…,8



80 80 0

80 80 0

80 80 0

80 = 80 + 0

85 80 5

85 80 5

85 80 5

85 80 5

where i = 1,2,…,8


The equation for the line is often written as:

Comparing the Two-sample t-test, Regression and ANOVA

When there are only two groups (and we have the same assumptions), all three models are algebraically equivalent.

𝑌 𝑖𝑗=𝜇𝑖+𝜀𝑖𝑗 where i =1,2 j = 1,2,3,4

𝐻0 : μ1=μ2

𝑌 𝑖 , 𝑗={𝜇+𝛼𝑖 }+𝜀𝑖 , 𝑗 where i =1,2 j = 1,2,3,4

𝑌 𝑖= {𝛽0+𝛽1𝑋 𝑖 }+𝜀𝑖 where i = 1,2, …, 8

Shonda Kuiper

Grinnell College

Introduction to Multiple RegressionHypothesis Tests and R2

Goals of Multiple Regression

• Multiple regression analysis can be used to serve different goals. The goals will influence the type of analysis that is conducted. The most common goals of multiple regression are to:• Describe: A model may be developed to describe the

relationship between multiple explanatory variables and the response variable.

• Predict: A regression model may be used to generalize to observations outside the sample.

• Confirm: Theories are often developed about which variables or combination of variables should be included in a model. Hypothesis tests can be used to evaluate the relationship between the explanatory variables and the response.

Introduction to Multiple Regression

• Build a multiple regression model to predict retail price of cars• Price = 35738 – 0.22 Mileage R-Sq: 4.1%

• Slope coefficient (b1): t = -2.95 (p-value = 0.004)

Questions: What happens to Price as Mileage increases?




Questions: What happens to Price as Mileage increases? Since b1 = -0.22 is small can we conclude it is unimportant?




Questions: What happens to Price as Mileage increases? Since b1 = -0.22 is small can we conclude it is unimportant? Does mileage help you predict price? What does the p-value tell you?




Questions: What happens to Price as Mileage increases? Since b1 = -0.22 is small can we conclude it is unimportant? Does mileage help you predict price? What does the p-value tell you? Does mileage help you predict price? What does the R-Sq value tell you?




Questions: What happens to Price as Mileage increases? Since b1 = -0.22 is small can we conclude it is unimportant? Does mileage help you predict price? What does the p-value tell you? Does mileage help you predict price? What does the R-Sq value tell you? Are there outliers or influential observations?

What is R2?

What is R2?

What is R2?

What happens when all the points fall on the regression line?

0

What is R2?

What happens when the regression line does not help us estimate Y?

What is R2?


What is R2?


What is R2?


What is R2?


What is R2?


What is R2?


Adjusted R2

• R2adj includes a penalty when more terms are included in

the model.

• n is the sample size and p is the number of coefficients (including the constant term β0, β1, β2, β3,…, βp-1)

• When many terms are in the model:• p is larger R2

adj is smaller (n – 1)/(n-p) is larger

Price = 35738 – 0.22 Mileage R-Sq: 4.1%

Slope coefficient (b1): t = -2.95 (p-value = 0.004)

Shonda Kuiper

Grinnell College

Introduction to Multiple Regression:Variable Section

Variable Selection Techniques

• Build a multiple regression model to predict retail price of cars

Mileage

Pri

ce

50000400003000020000100000

70000

60000

50000

40000

30000

20000

10000

0

Scatterplot of Price vs Mileage R2 = 2%



Mileage

Pri

ce

50000400003000020000100000

70000

60000

50000

40000

30000

20000

10000

0

Scatterplot of Price vs Mileage R2 = 2%Mileage

Cylinder

Liter

Leather

Cruise

Doors

Sound



Mileage

Pri

ce

50000400003000020000100000

70000

60000

50000

40000

30000

20000

10000

0

Scatterplot of Price vs Mileage R2 = 2%Mileage

Cylinder

Liter

Leather

Cruise

Doors

Sound

Price = 6759 + 6289Cruise + 3792Cyl -1543Doors + 3349Leather - 787Liter -0.17Mileage - 1994Sound

R2 = 44.6%


Step Forward Regression (Forward Selection):

Which single explanatory variable best predicts Price?

Price = 13921.9 + 9862.3Cruise R2 = 18.56%


Step Forward Regression:


Price = 13921.9 + 9862.3Cruise R2 = 18.56%

Price = -17.06 + 4054.2Cyl R2 = 32.39%




Price = 13921.9 + 9862.3Cruise R2 = 18.56%

Price = -17.06 + 4054.2Cyl R2 = 32.39%

Price = 24764.6 – 0.17Mileage R2 = 2.04%




Price = 13921.9 + 9862.3Cruise R2 = 18.56%

Price = -17.06 + 4054.2Cyl R2 = 32.39%

Price = 24764.6 – 0.17Mileage R2 = 2.04%

Price = 6185.8.6 + 4990.4Liter R2 = 31.15%




Price = 13921.9 + 9862.3Cruise R2 = 18.56%

Price = -17.06 + 4054.2Cyl R2 = 32.39%

Price = 24764.6 – 0.17Mileage R2 = 2.04%

Price = 6185.8.6 + 4990.4Liter R2 = 31.15%

Price = 23130.1 – 2631.4Sound R2 = 1.55%

Price = 18828.8 + 3473.46Leather R2 = 2.47%

Price = 27033.6 -1613.2Doors R2 = 1.93%



Which combination of two terms best predicts Price?

Price = - 17.06 + 4054.2Cyl R2 = 32.39% Price = -1046.4 + 3392.6Cyl + 6000.4Cruise R2 = 38.4% (38.2%)




Price = - 17.06 + 4054.2Cyl R2 = 32.39% Price = 3145.8 + 4027.6Cyl – 0.152Mileage R2 = 34% (33.8)




Price = -17.06 + 4054.2Cyl R2 = 32.39% Price = 1372.4 + 2976.4Cyl + 1412.2Liter R2 = 32.6% (32.4%)



Which combination of terms best predicts Price?

Price = -17.06 + 4054.2Cyl R2 = 32.39% Price = -1046.4 + 3393Cyl + 6000.4Cruise R2 = 38.4% (38.2%)

Price = -2978.4 + 3276Cyl +6362Cruise + 3139Leather

R2 = 40.4% (40.2%)

Price = 412.6 + 3233Cyl +6492Cruise + 3162Leather

-0.17Mileage R2 = 42.3% (42%)

Price = 5530.3 + 3258Cyl +6320Cruise + 2979Leather

-0.17Mileage – 1402Doors R2 = 43.7% (43.3%)

Price = 7323.2 + 3200Cyl + 6206Cruise + 3327Leather

-0.17Mileage – 1463Doors – 2024Sound R2 = 44.6% (44.15%)

Price = 6759 + 3792Cyl + 6289Cruise + 3349Leather -787Liter

-0.17Mileage -1543Doors - 1994Sound R2 = 44.6% (44.14%)




Price = 13921.9 + 9862.3Cruise R2 = 18.56%

Price = -17.06 + 4054.2Cyl R2 = 32.39%

Price = 24764.6 – 0.17Mileage R2 = 2.04%

Price = 6185.8.6 + 4990.4Liter R2 = 31.15%

Price = 23130.1 – 2631.4Sound R2 = 1.55%

Price = 18828.8 + 3473.46Leather R2 = 2.47%

Price = 27033.6 -1613.2Doors R2 = 1.93%


Step Backward Regression (Backward Elimination):

Price = 7323.2 + 3200Cyl + 6206Cruise + 3327Leather

-0.17Mileage – 1463Doors – 2024Sound R2 = 44.6% (44.15%)

Price = 6759 + 3792Cyl + 6289Cruise + 3349Leather -787Liter

-0.17Mileage -1543Doors - 1994Sound R2 = 44.6% (44.14%)

Other techniques, such as Akaike information criterion, Bayesian information criterion, Mallows’ Cp, are often used to find the best model.

Bidirectional stepwise procedures


Best Subsets Regression:

Here we see that Liter is the second best single predictor of price.


Important Cautions:

• Stepwise regression techniques can often ignore very important explanatory variables. Best subsets is often preferable.

• Both best subsets and stepwise regression methods only consider linear relationships between the response and explanatory variables.

• Residual graphs are still essential in validating whether the model is appropriate.

• Transformations, interactions and quadratic terms can often improve the model.

• Whenever these iterative variable selections techniques are used, the p-values corresponding to the significance of each individual coefficient are not reliable.

Documents

Shonda Kuiper Grinnell College. Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory