Shonda Kuiper Grinnell College. Statistical techniques taught in introductory statistics courses typically have one response variable and one explanatory

  • View
    214

  • Download
    0

Embed Size (px)

Text of Shonda Kuiper Grinnell College. Statistical techniques taught in introductory statistics courses...

Introduction to Randomization Tests

Shonda Kuiper Grinnell CollegeComparing the two-sample t-test, ANOVA and regression1Comparing Statistical TestsStatistical techniques taught in introductory statistics courses typically have one response variable and one explanatory variable. Explanatory Variable Response VariableResponse variable measures the outcome of a study. Explanatory variable explain changes in the response variable.

2Comparing Statistical TestsEach variable can be classified as either categorical or quantitative. Explanatory VariableResponse VariableCategoricalCategoricalQuantitativeQuantitativeChi-Square testTwo proportion testTwo-sample t-testANOVALogistic RegressionRegressionCategorical data place individuals into one of several groups (such as red/blue/white, male/female or yes/no). Quantitative data consists of numerical values for which most arithmetic operations make sense. 3

Model for a Two-sample t-test7080-108280290801078=80+-27585-108585095851085850where i =1,2 j = 1,2,3,4Statistical models have the following form:observed value = mean response + random error

Model for a Two-sample t-test

Model for a Two-Sample t-testModel for a Two-Sample t-testThe theoretical model used in the two-sample t-test is designed to account for these two group means (1 and 2) and random error. observed mean randomvalue response error=+where i =1,2 j = 1,2,3,4where i =1,2 j = 1,2,3,4Model for ANOVA7082.5-2.5-108282.5-2.529082.5-2.51078=82.5+-2.5+-27582.52.5-108582.52.509582.52.5108582.52.50where i = 1,2 and j = 1,2,3,4

Model for ANOVAModel for ANOVA+observed mean randomvalue response error=+where i =1,2 j = 1,2,3,4Model for Regressionobserved mean randomvalue response error=+where i =1,2 j = 1,2,3,4where i = 1,2, , 8

Model for RegressionModel for Regression70800-10828002908001078=80+0+-275805-108580509580510858050where i = 1,2,,8Model for Regression80800808008080080=80+085805858058580585805where i = 1,2,,8Comparing the Two-sample t-test, Regression and ANOVAWhen there are only two groups (and we have the same assumptions), all three models are algebraically equivalent.

where i =1,2 j = 1,2,3,4where i =1,2 j = 1,2,3,4where i = 1,2, , 8Shonda Kuiper Grinnell CollegeIntroduction to Multiple RegressionHypothesis Tests and R216Goals of Multiple RegressionMultiple regression analysis can be used to serve different goals. The goals will influence the type of analysis that is conducted. The most common goals of multiple regression are to:Describe: A model may be developed to describe the relationship between multiple explanatory variables and the response variable.Predict: A regression model may be used to generalize to observations outside the sample. Confirm: Theories are often developed about which variables or combination of variables should be included in a model. Hypothesis tests can be used to evaluate the relationship between the explanatory variables and the response.Introduction to Multiple RegressionBuild a multiple regression model to predict retail price of carsPrice = 35738 0.22 Mileage R-Sq: 4.1%Slope coefficient (b1): t = -2.95 (p-value = 0.004)

Questions: What happens to Price as Mileage increases?

Introduction to Multiple RegressionBuild a multiple regression model to predict retail price of carsPrice = 35738 0.22 Mileage R-Sq: 4.1%Slope coefficient (b1): t = -2.95 (p-value = 0.004)

Questions: What happens to Price as Mileage increases? Since b1 = -0.22 is small can we conclude it is unimportant?Introduction to Multiple RegressionBuild a multiple regression model to predict retail price of carsPrice = 35738 0.22 Mileage R-Sq: 4.1%Slope coefficient (b1): t = -2.95 (p-value = 0.004)

Questions: What happens to Price as Mileage increases? Since b1 = -0.22 is small can we conclude it is unimportant? Does mileage help you predict price? What does the p-value tell you?Introduction to Multiple RegressionBuild a multiple regression model to predict retail price of carsPrice = 35738 0.22 Mileage R-Sq: 4.1%Slope coefficient (b1): t = -2.95 (p-value = 0.004)

Questions: What happens to Price as Mileage increases? Since b1 = -0.22 is small can we conclude it is unimportant? Does mileage help you predict price? What does the p-value tell you? Does mileage help you predict price? What does the R-Sq value tell you?Introduction to Multiple RegressionBuild a multiple regression model to predict retail price of carsPrice = 35738 0.22 Mileage R-Sq: 4.1%Slope coefficient (b1): t = -2.95 (p-value = 0.004)

Questions: What happens to Price as Mileage increases? Since b1 = -0.22 is small can we conclude it is unimportant? Does mileage help you predict price? What does the p-value tell you? Does mileage help you predict price? What does the R-Sq value tell you? Are there outliers or influential observations?What is R2?

What is R2?

What is R2?What happens when all the points fall on the regression line? 0

What is R2?What happens when the regression line does not help us estimate Y?

What is R2?What happens when the regression line does not help us estimate Y?

What is R2?What happens when the regression line does not help us estimate Y?

What is R2?What happens when the regression line does not help us estimate Y?

What is R2?What happens when the regression line does not help us estimate Y?

What is R2?What happens when the regression line does not help us estimate Y?

What is R2?What happens when the regression line does not help us estimate Y?

Adjusted R2R2adj includes a penalty when more terms are included in the model.

n is the sample size and p is the number of coefficients (including the constant term 0, 1, 2, 3,, p-1)When many terms are in the model:p is larger

R2adj is smaller

(n 1)/(n-p) is larger

Price = 35738 0.22 Mileage R-Sq: 4.1%Slope coefficient (b1): t = -2.95 (p-value = 0.004)Shonda Kuiper Grinnell CollegeIntroduction to Multiple Regression:Variable Section35Variable Selection TechniquesBuild a multiple regression model to predict retail price of cars

R2 = 2%Variable Selection TechniquesBuild a multiple regression model to predict retail price of cars

R2 = 2%MileageCylinderLiterLeatherCruiseDoorsSound

Variable Selection TechniquesBuild a multiple regression model to predict retail price of cars

R2 = 2%MileageCylinderLiterLeatherCruiseDoorsSound

Price = 6759 + 6289Cruise + 3792Cyl -1543Doors + 3349Leather - 787Liter -0.17Mileage - 1994Sound

R2 = 44.6%Introduction to Multiple RegressionStep Forward Regression (Forward Selection):Which single explanatory variable best predicts Price?Price = 13921.9 + 9862.3CruiseR2 = 18.56%

Introduction to Multiple RegressionStep Forward Regression:Which single explanatory variable best predicts Price?Price = 13921.9 + 9862.3CruiseR2 = 18.56%

Price = -17.06 + 4054.2CylR2 = 32.39%

Introduction to Multiple RegressionStep Forward Regression:Which single explanatory variable best predicts Price?Price = 13921.9 + 9862.3CruiseR2 = 18.56%

Price = -17.06 + 4054.2CylR2 = 32.39% Price = 24764.6 0.17MileageR2 = 2.04%

Introduction to Multiple RegressionStep Forward Regression:Which single explanatory variable best predicts Price?Price = 13921.9 + 9862.3CruiseR2 = 18.56%

Price = -17.06 + 4054.2CylR2 = 32.39% Price = 24764.6 0.17MileageR2 = 2.04% Price = 6185.8.6 + 4990.4LiterR2 = 31.15%

Introduction to Multiple RegressionStep Forward Regression:Which single explanatory variable best predicts Price?Price = 13921.9 + 9862.3CruiseR2 = 18.56%

Price = -17.06 + 4054.2CylR2 = 32.39% Price = 24764.6 0.17MileageR2 = 2.04% Price = 6185.8.6 + 4990.4LiterR2 = 31.15% Price = 23130.1 2631.4SoundR2 = 1.55%Price = 18828.8 + 3473.46LeatherR2 = 2.47%Price = 27033.6 -1613.2DoorsR2 = 1.93% Introduction to Multiple RegressionStep Forward Regression:Which combination of two terms best predicts Price?Price = - 17.06 + 4054.2CylR2 = 32.39%

Price = -1046.4 + 3392.6Cyl + 6000.4CruiseR2 = 38.4% (38.2%)

Introduction to Multiple RegressionStep Forward Regression:Which combination of two terms best predicts Price?Price = - 17.06 + 4054.2CylR2 = 32.39%

Price = 3145.8 + 4027.6Cyl 0.152MileageR2 = 34% (33.8)

Introduction to Multiple RegressionStep Forward Regression:Which combination of two terms best predicts Price?Price = -17.06 + 4054.2CylR2 = 32.39%

Price = 1372.4 + 2976.4Cyl + 1412.2LiterR2 = 32.6% (32.4%)

Introduction to Multiple RegressionStep Forward Regression:Which combination of terms best predicts Price?Price = -17.06 + 4054.2CylR2 = 32.39%

Price = -1046.4 + 3393Cyl + 6000.4CruiseR2 = 38.4% (38.2%) Price = -2978.4 + 3276Cyl +6362Cruise + 3139Leather R2 = 40.4% (40.2%) Price = 412.6 + 3233Cyl +6492Cruise + 3162Leather-0.17Mileage R2 = 42.3% (42%) Price = 5530.3 + 3258Cyl +6320Cruise + 2979Leather-0.17Mileage 1402Doors R2 = 43.7% (43.3%) Price = 7323.2 + 3200Cyl + 6206Cruise + 3327Leather-0.17Mileage 1463Doors 2024Sound R2 = 44.6% (44.15%)Price = 6759 + 3792Cyl + 6289Cruise + 3349Leather -787Liter -0.17Mileage -1543Doors - 1994Sound R2 = 44.6% (44.14%)

Introduction to Multiple RegressionStep Forward Regression:Which single explanatory variable best predicts Price?Price = 13921.9 + 9862.3CruiseR2 = 18.56%

Price = -17.06 + 4054.2CylR2 = 32.39% Price = 24764.6 0.17MileageR2 = 2.04% Price = 6185.8.6 + 4990.4LiterR2 = 31.15% Price = 23130.1 2631.4SoundR2 = 1.55%Price = 18828.8 + 3473.46LeatherR2 = 2.47%Price = 27033.6 -1613.2DoorsR2 = 1.93% Introduction to Multiple RegressionStep Backward Regression (Backward Elimination):

Price = 7323.2 + 3200Cyl + 6206Cruise + 3327Leather-0.17Mileage 1463Doors 2024Sound R2 = 44.6% (44.15%)Price = 6759 + 3792Cyl + 6289Cruise + 3349Leather -787Liter -0.17Mileage -1543Doors - 1994Sound R2 = 44.6% (44.14%)

Other techniques, such as Akaike information criterion,Bayesian information criterion,MallowsCp, are often used to find the best model.

Bidirectional stepwise proceduresIntroduction to Multiple RegressionBest Subsets Regression:

Here we see that Liter is the second best single predictor of price.Introduction to Multiple RegressionImportant Cautions:

Stepwise regression techniques can often ignore very important explanatory variables. Best subsets is often preferable.Both best subsets and stepwise regression methods only consider linear relationships between the response and explanatory variables. Residual graphs are still essential in validating whether the model is appropriate.Transformations, interactions and quadratic terms can often improve the model.Whenever these iterative variable selections techniques are used, the p-values corresponding to the significance of each individual coefficient are not reliable.