23
Stat 112 Notes 20 • Today: – Interaction Variables (Chapter 7.1-7.2) – Interpreting slope when Y is logged but not X – Model Building (Chapter 8)

Stat 112 Notes 20 Today: –Interaction Variables (Chapter 7.1-7.2) –Interpreting slope when Y is logged but not X –Model Building (Chapter 8)

Embed Size (px)

Citation preview

Page 1: Stat 112 Notes 20 Today: –Interaction Variables (Chapter 7.1-7.2) –Interpreting slope when Y is logged but not X –Model Building (Chapter 8)

Stat 112 Notes 20

• Today:– Interaction Variables (Chapter 7.1-7.2)– Interpreting slope when Y is logged but not X– Model Building (Chapter 8)

Page 2: Stat 112 Notes 20 Today: –Interaction Variables (Chapter 7.1-7.2) –Interpreting slope when Y is logged but not X –Model Building (Chapter 8)

Interaction• Interaction is a three-variable concept. One of these is

the response variable (Y) and the other two are explanatory variables (X1 and X2).

• There is an interaction between X1 and X2 if the impact of an increase in X2 on Y depends on the level of X1.

Page 3: Stat 112 Notes 20 Today: –Interaction Variables (Chapter 7.1-7.2) –Interpreting slope when Y is logged but not X –Model Building (Chapter 8)

Interaction Example

• What would you prefer, a leader with good intentions or a leader with evil intentions?

• What would you prefer, a smart leader or a dumb leader?

• Is a dumb leader with evil intentions the worst case or is there an interaction between a leader’s intelligence and intentions?

Page 4: Stat 112 Notes 20 Today: –Interaction Variables (Chapter 7.1-7.2) –Interpreting slope when Y is logged but not X –Model Building (Chapter 8)

Toy Factory Manager DataBivariate Fit of Time for Run By Run Size

150

200

250

300

Tim

e fo

r R

un

50 100 150 200 250 300 350

Run Size

Squares = Manager A + = Manager B x = Manager C

Page 5: Stat 112 Notes 20 Today: –Interaction Variables (Chapter 7.1-7.2) –Interpreting slope when Y is logged but not X –Model Building (Chapter 8)

Model without InteractionResponse Time for Run Whole Model Regression Plot

150

200

250

300

Tim

e fo

r R

un

50 100 150 200 250 300 350

Run Size

Expanded Estimates Nominal factors expanded to all levels Term Estimate Std Error t Ratio Prob>|t| Intercept 176.70882 5.658644 31.23 <.0001 Run Size 0.243369 0.025076 9.71 <.0001 Manager[a] 38.409663 3.005923 12.78 <.0001 Manager[b] -14.65115 3.031379 -4.83 <.0001 Manager[c] -23.75851 2.995898 -7.93 <.0001

This model assumes that the effect of increasing run size is the same for each of the three managers.

a

b

c

The lines in the plot showthe mean time for rungiven run size for each ofthe three managers.The lines are parallelbecause the model assumes no interaction.

Page 6: Stat 112 Notes 20 Today: –Interaction Variables (Chapter 7.1-7.2) –Interpreting slope when Y is logged but not X –Model Building (Chapter 8)

Interaction Model involving Categorical Variables in JMP

• To add interactions involving categorical variables in JMP, follow the same procedure as with two continuous variables. Run Fit Model in JMP, add the usual explanatory variables first, then highlight one of the variables in the interaction in the Construct Model Effects box and highlight the other variable in the interaction in the Columns box and then click Cross in the Construct Model Effects box.

Page 7: Stat 112 Notes 20 Today: –Interaction Variables (Chapter 7.1-7.2) –Interpreting slope when Y is logged but not X –Model Building (Chapter 8)

Interaction ModelResponse Time for Run Expanded Estimates Nominal factors expanded to all levels Term Estimate Std Error t Ratio Prob>|t| Intercept 179.59191 5.619643 31.96 <.0001 Run Size 0.2344284 0.024708 9.49 <.0001 Manager[Alice] 38.188168 2.900342 13.17 <.0001 Manager[Bob] -13.5381 2.936288 -4.61 <.0001 Manager[Carol] -24.65007 2.887839 -8.54 <.0001 Manager[Alice]*(Run Size-209.317) 0.0728366 0.035263 2.07 0.0437 Manager[Bob]*(Run Size-209.317) -0.097651 0.037178 -2.63 0.0112 Manager[Carol]*(Run Size-209.317) 0.0248147 0.032207 0.77 0.4444

ˆ ( _ _ | , )

179.59 0.234* 38.188*1 13.538*0 24.651*0

0.073*1*( 209.317) 0.098*0*( 209.317) 0.025*0*( 209.317)

179.59 0.234* 38.188 0.073*( 209.317)

E time for run runsize x Manager Alice

x

x x x

x x

ˆ ( _ _ | , ) (179.59 38.188 0.073*209.317) (0.234 0.073)*

ˆ ( _ _ | , ) (179.59 13.538 0.098*209.317) (0.234 0.098)*

ˆ ( _ _ | ,

E time for run runsize x Manager Alice x

E time for run runsize x Manager Bob x

E time for run runsize x Mana

) (179.59 24.651 0.025*209.317) (0.234 0.025*ger Carol x

Page 8: Stat 112 Notes 20 Today: –Interaction Variables (Chapter 7.1-7.2) –Interpreting slope when Y is logged but not X –Model Building (Chapter 8)

• The runs supervised by Alice appear abnormally time consuming. Bob has higher initial fixed setup costs than Carol (186.565>149.706) but has lower per unit production time (0.136<0.259).

ˆ ( _ _ | , ) 202.498 0.307*

ˆ( _ _ | , ) 186.565 0.136*

ˆ( _ _ | , ) 149.706 0.259*

E time for run runsize x Manager Alice x

E time for run runsize x Manager Bob x

E time for run runsize x Manager Carol x

150

200

250

300

Tim

e fo

r R

un

50 100 150 200 250 300 350

Run Size

Alice

Bob

Carol

Regression Plot

Page 9: Stat 112 Notes 20 Today: –Interaction Variables (Chapter 7.1-7.2) –Interpreting slope when Y is logged but not X –Model Building (Chapter 8)

Interaction Model• Interaction between run size and Manager: The effect on mean run time

of increasing run size by one is different for different managers.

• Is there strong evidence that there really is an interaction for the population?

• Effect Test for Interaction:

• Manager*Run Size Effect test tests null hypothesis that there is no interaction (effect on mean run time of increasing run size is same for all managers) vs. alternative hypothesis that there is an interaction between run size and managers. p-value =0.0333. Evidence that there is an interaction.

ˆ ˆ( _ _ | 1, ) ( , ) 0.234 0.073 0.307

ˆ ˆ( _ _ | 1, ) ( , ) 0.234 0.098 0.136

ˆ ( _ _ | 1,

E time for run runsize x Manager Alice E runsize x Manager Alice

E time for run runsize x Manager Bob E runsize x Manager Bob

E time for run runsize x

ˆ) ( , ) 0.234 0.025 0.259Manager Carol E runsize x Manager Carol

Effect Tests Source Nparm DF Sum of Squares F Ratio Prob > F Run Size 1 1 22070.614 90.0192 <.0001 Manager 2 2 43981.452 89.6934 <.0001 Manager*Run Size 2 2 1778.661 3.6273 0.0333

Page 10: Stat 112 Notes 20 Today: –Interaction Variables (Chapter 7.1-7.2) –Interpreting slope when Y is logged but not X –Model Building (Chapter 8)

Interaction Profile Plot

Lower left hand plot shows mean time for run vs. run size for the three managersAlice, Bob and Carol. Upper right hand plot shows mean run time for three managers for a low run size of 58 and a high run size of 345.

Interaction Profiles

150

200

250

300T

ime

for

Run

150

200

250

300

Tim

e

for

Run

Run Size

Alice

BobCarol

100 200 300 400

58

345

Manager

Alice Bob Carol

Run

Size

Manag

er

Page 11: Stat 112 Notes 20 Today: –Interaction Variables (Chapter 7.1-7.2) –Interpreting slope when Y is logged but not X –Model Building (Chapter 8)

Interactions Involving Categorical Variables: General Approach

• First fit model with an interaction between categorical explanatory variable and continuous explanatory variable. Use effect test on interaction to see if there is evidence of an interaction.

• If there is evidence of an interaction (p-value <0.05 for effect test), use interaction model.

• If there is not strong evidence of an interaction (p-value >0.05 for effect test), use model without interactions. The model without interactions is easier to interpret but should only be used if there is not strong evidence for interactions.

Page 12: Stat 112 Notes 20 Today: –Interaction Variables (Chapter 7.1-7.2) –Interpreting slope when Y is logged but not X –Model Building (Chapter 8)

Example: A Sex Discrimination Lawsuit

• Did a bank discriminatorily pay higher starting salaries to men than to women? Harris Trust and Savings Bank was sued by a group of female employees who accused the bank of paying lower starting salries to women. The data in harrisbank.JMP are the starting salaries for all 32 male and all 61 female skilled, entry-level clerical employees hired by the bank between 1969 and 1977, as well as the education levels and sex of the employees.

Page 13: Stat 112 Notes 20 Today: –Interaction Variables (Chapter 7.1-7.2) –Interpreting slope when Y is logged but not X –Model Building (Chapter 8)

• No evidence of an interaction between Sex and Education. Fit model without interactions.

Expanded Estimates Nominal factors expanded to all levels Term Estimate Std Error t Ratio Prob>|t| Intercept 4257.893 422.8744 10.07 <.0001 EDUC 98.923456 31.79614 3.11 0.0025 SEX[FEMALE] -322.6792 68.97647 -4.68 <.0001 SEX[MALE] 322.67916 68.97647 4.68 <.0001 SEX[FEMALE]*(EDUC-12.5054) -36.7929 31.79614 -1.16 0.2503 SEX[MALE]*(EDUC-12.5054) 36.792897 31.79614 1.16 0.2503 Effect Tests Source Nparm DF Sum of Squares F Ratio Prob > F EDUC 1 1 3159890.9 9.6794 0.0025 SEX 1 1 7144342.4 21.8847 <.0001 SEX*EDUC 1 1 437120.8 1.3390 0.2503

Page 14: Stat 112 Notes 20 Today: –Interaction Variables (Chapter 7.1-7.2) –Interpreting slope when Y is logged but not X –Model Building (Chapter 8)

Discrimination Case Regression Results

• Strong evidence that there is a difference in the mean starting salaries of women and men of the same education level.

• Estimated difference: Men have 345.904-(-345.904)=$691.81 higher mean starting salaries than women of the same education level.

• 95% confidence interval for mean difference = (2*$214.55,2*$477.25)=($429.10,$854.50).

• Bank’s defense: Omitted variable bias. Variables such as Seniority, Age, Experience also need to be controlled for.

Expanded Estimates Nominal factors expanded to all levels Term Estimate Std Error t Ratio Prob>|t| Lower 95% Upper 95% Intercept 4519.0292 358.2969 12.61 <.0001 3807.2099 5230.8486 EDUC 80.697765 27.67291 2.92 0.0045 25.720708 135.67482 SEX[FEMALE] -345.9041 66.11594 -5.23 <.0001 -477.255 -214.5533 SEX[MALE] 345.90413 66.11594 5.23 <.0001 214.55328 477.25498 Effect Tests Source Nparm DF Sum of Squares F Ratio Prob > F EDUC 1 1 2786560.8 8.5038 0.0045 SEX 1 1 8969209.7 27.3715 <.0001

Page 15: Stat 112 Notes 20 Today: –Interaction Variables (Chapter 7.1-7.2) –Interpreting slope when Y is logged but not X –Model Building (Chapter 8)

Interpreting Coefficients When Y is Logged But Not X

• Example: In an industrial laboratory, under uniform conditions, batches of electrical insulating fluid were subject to constant voltages until the insulating property of the fluids broke down.

• Goal: Estimate E(Y|X), where Y=Breakdown Time, X=Voltage Level

• The log Y transformation works well for this data: 0 1(log | )E Y X X

0 1( | ) exp( )E Y X X

Page 16: Stat 112 Notes 20 Today: –Interaction Variables (Chapter 7.1-7.2) –Interpreting slope when Y is logged but not X –Model Building (Chapter 8)

Bivariate Fit of Breakdown Time By Voltage

Transformed Fit Log Log(Breakdown Time) = 18.955513 - 0.5073671*Voltage

0

500

1000

1500

2000

2500

Bre

akdo

wn

Tim

e

24 26 28 30 32 34 36 38 40

Voltage

Transformed Fit Log

Page 17: Stat 112 Notes 20 Today: –Interaction Variables (Chapter 7.1-7.2) –Interpreting slope when Y is logged but not X –Model Building (Chapter 8)

Interpreting Coefficients When Y is Logged But Not X

0 1(log | )E Y X X

0 1( | ) exp( )E Y X X

0 1

0 1

0 1

0 1

1

exp( [ 1])( | 1)

( | ) exp( )

exp( )exp( [ 1])

exp( )exp( )

exp( )

XE Y X

E Y X X

X

X

Interpretation of 1 :

Increasing X by 1 multiples the mean of Y by 1exp( ) Transformed Fit Log Log(Breakdown Time) = 18.955513 - 0.5073671*Voltage

Increasing voltage by 1 multiples the mean breakdown time by exp( 0.507) 0.602 .

Page 18: Stat 112 Notes 20 Today: –Interaction Variables (Chapter 7.1-7.2) –Interpreting slope when Y is logged but not X –Model Building (Chapter 8)

Deciding on Variables• When we have many potential explanatory variables,

how should we decide which to use? 1. Suppose our goal is to estimate the causal effect of

a variable, controlling for all lurking variables (e.g., effect of pollution on mortality). Then it is best to include all possible lurking variables. Omitting variables, even if they do not appear significant leads to potential bias.

2. Suppose our goal is to understand the association of certain variable(s) with a response, holding fixed for certain other variables. We should think carefully about what variables we want to hold fixed.

Page 19: Stat 112 Notes 20 Today: –Interaction Variables (Chapter 7.1-7.2) –Interpreting slope when Y is logged but not X –Model Building (Chapter 8)

Model Building for Prediction: General Principles

1. Include all variables that for substantive reasons, might be expected to be important in predicting the outcome.

2. For explanatory variables with large efects, consider their interactions as well. For each interaction considered, add the interaction if the p-value < 0.05 on the interaction term.

Page 20: Stat 112 Notes 20 Today: –Interaction Variables (Chapter 7.1-7.2) –Interpreting slope when Y is logged but not X –Model Building (Chapter 8)

Excluding Variables From a Model

• Should we ever drop a variable from a model? If there are many explanatory variables, then including variables that are not useful can worsen the estimates of the variables that are useful.

• We suggest the following strategy:1. If an explanatory variable is statistically significant (p-value < 0.05

for two sided test) and has the expected, then we should definitely keep it in the model.

2. If an explanatory variable is not statistically significant and has the expected sign, it is generally fine to keep the variable in the model.

3. If an explanatory variable is not statistically significant and does not have the expected sign and is not of primary interest in and of itself, we can consider removing it from the model.

4. If an explanatory variable is statistically significant but does not have the expected sign, then we should keep in the model, but think hard if the sign makes sense. Try to gather data on potential lurking variabls and include them in the analysis.

Page 21: Stat 112 Notes 20 Today: –Interaction Variables (Chapter 7.1-7.2) –Interpreting slope when Y is logged but not X –Model Building (Chapter 8)

SAT Data

• Y = Average score on 1982 SAT for the state.• Explanatory Variables:

– X1=Takers (% of Total Eligible Students in the state who took the exam).

– X2=Income (Median Income of Families of Test Takers)– X3=Years (Average Number of Years That Test Takers Had

Formal Studies in Social Sciences, Natural Sciences and Humanities)

– X4=Public (Percentage of Test Takers who attend public schools)

– X5=Expend (Total State Expenditure on Secondary Schools, expressed in hundreds of dollars per student)

– X6=Rank (Median percentile ranking of test takers within their secondary classes)

Page 22: Stat 112 Notes 20 Today: –Interaction Variables (Chapter 7.1-7.2) –Interpreting slope when Y is logged but not X –Model Building (Chapter 8)

Expected Signs

– X1=Takers (% of Total Eligible Students in the state who took the exam) (Expected sign: -)

– X2=Income (Median Income of Families of Test Takers) (Expected sign: +)

– X3=Years (Average Number of Years That Test Takers Had Formal Studies in Social Sciences, Natural Sciences and Humanities) (Expected sign: +)

– X4=Public (Percentage of Test Takers who attend public schools) (Expected sign: -)

– X5=Expend (Total State Expenditure on Secondary Schools, expressed in hundreds of dollars per student) (Expected sign: +)

– X6=Rank (Median percentile ranking of test takers within their secondary classes) (Expected sign: +)

Page 23: Stat 112 Notes 20 Today: –Interaction Variables (Chapter 7.1-7.2) –Interpreting slope when Y is logged but not X –Model Building (Chapter 8)

INITIAL MODEL Response SAT Whole Model Actual by Predicted Plot Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept -94.65897 211.5096 -0.45 0.6567 TAKERS -0.480081 0.693711 -0.69 0.4926 INCOME -0.008195 0.152358 -0.05 0.9574 YEARS 22.610073 6.314576 3.58 0.0009 PUBLIC -0.464152 0.579104 -0.80 0.4272 EXPEND 2.2120054 0.845972 2.61 0.0123 RANK 8.4762171 2.107807 4.02 0.0002

Income has opposite sign of what is expected and is not significant. We remove it from the model. Model with Variable Selection Parameter Estimates Term Estimate Std Error t Ratio Prob>|t| Intercept -100.4736 179.7255 -0.56 0.5790 TAKERS -0.46208 0.600726 -0.77 0.4459 YEARS 22.6688 6.148595 3.69 0.0006 PUBLIC -0.452261 0.529142 -0.85 0.3973 EXPEND 2.1859095 0.685131 3.19 0.0026 RANK 8.4964101 2.050469 4.14 0.0002