Upload
heman
View
46
Download
0
Embed Size (px)
DESCRIPTION
Unit 8: Categorical predictors, I: Dichotomies. "There are two kinds of people in the world: Those who believe there are two kinds of people in the world and those who don't." –Robert Benchley, American Humorist (1888-1946). The S-030 roadmap: Where’s this unit in the big picture?. Unit 1: - PowerPoint PPT Presentation
Citation preview
© Judith D. Singer, Harvard Graduate School of Education Unit 8/Slide 1
Unit 8: Categorical predictors, I: Dichotomies
"There are two kinds of people in the world: Those who believe there are two kinds of people in the world and
those who don't." –Robert Benchley, American Humorist (1888-1946)
© Judith D. Singer, Harvard Graduate School of Education Unit 8/Slide 2
The S-030 roadmap: Where’s this unit in the big picture?
Unit 2:Correlation
and causality
Unit 3:Inference for the regression model
Unit 4:Regression assumptions:
Evaluating their tenability
Unit 5:Transformations
to achieve linearity
Unit 6:The basics of
multiple regression
Unit 7:Statistical control in
depth:Correlation and
collinearity
Unit 10:Interaction and quadratic effects
Unit 8:Categorical predictors I:
Dichotomies
Unit 9:Categorical predictors II:
Polychotomies
Unit 11:Regression modeling
in practice
Unit 1:Introduction to
simple linear regression
Building a solid
foundation
Mastering the
subtleties
Adding additional predictors
Generalizing to other types of
predictors and effects
Pulling it all
together
© Judith D. Singer, Harvard Graduate School of Education Unit 8/Slide 3
In this unit, we’re going to learn about…
• Categorical predictors and regression: An unusual marriage• Creating and naming conventions for a dummy (or indicator)
variable• Regressing Y on a dummy variable: How this relates to the
two-sample t-test– What happens if we change the reference category?
• Including a dummy variable in a multiple regression model: How and why it operates
• Another example of using the simple and partial correlation matrices to foreshadow results
• Adjusted means: A simple way of presenting findings for categorical question predictors
• Graphic displays of regression findings: How do you decide which effects to highlight?
• Displaying and interpreting prototypical trajectories
© Judith D. Singer, Harvard Graduate School of Education Unit 8/Slide 4
Categorical predictors and regression: An unusual marriage?
kk XXXY 22110
assumptions focus on Y and
no assumptions about the X’s
Categorical predictors are predictors whose values denote categories
Nominal predictors(unordered values)
Sex Religion
Political Party
Ordinal predictors(ordered values)
Education Religiosity
Neighborhood Integration
Another important distinctionDichotomies (only 2 categories)Polychotomies (>2 categories)
Dummy (or indicator) variables
Variables whose values offer no meaningful quantitative information but simply
distinguish between categories
femaleif
maleif
1
0
FEMALE
treatedif
controlif
1
0
TREAT
By convention, the variable
name corresponds to the category
given the value 1By convention,
the category given the value
0 is called the reference category
© Judith D. Singer, Harvard Graduate School of Education Unit 8/Slide 5
Do “primary” seat belt laws save lives?
Seat Miles
ID State DPFat NOFat BeltLaw Driven
40 RI 44 13 0 7071 2 AK 47 17 0 438746 VT 61 19 0 646635 ND 71 9 0 712351 WY 71 15 0 757630 NH 76 28 0 11202 8 DE 84 25 0 8007
.... 47 VA 630 147 0 7032026 MO 775 143 0 62980 1 AL 777 124 0 5345843 TN 789 172 0 6052614 IL 824 313 0 9931923 MI 846 259 0 9175536 OH 944 253 0 10367539 PA 975 271 0 9801510 FL 1478 835 0 13400712 HI 83 36 1 7947 7 CT 199 96 1 2855232 NM 237 97 1 2193738 OR 306 99 1 3226816 IA 312 61 1 2798421 MD 345 146 1 4660919 LA 512 184 1 3884037 OK 541 107 1 4140015 IN 615 133 1 6862033 NY 822 546 1 12077834 NC 870 269 1 8189311 GA 973 257 1 93317 5 CA 1817 1102 1 28561244 TX 2012 613 1 198700
Source: Calkins, LN & Zlatoper, TJ (2001). The effects of mandatory seat belt laws on motor vehicle fatalities in the United States,
Social Science Quarterly, 82(4), 716-732
RQ: Do states with primary seat belt laws
have lower traffic fatality rates?
14 states had a mandatory primary seat belt law (28%)
36 states did not have a primary seat belt law (72%)
Hypothesis 1: Seat belt laws save lives because seat
belts save lives
Hypothesis 2: The Offset
hypothesis: Seat belts encourage riskier
driving behavior that may offset any
benefit associated with increased seat
belt use SeatBeltLaw = 0 if no law
= 1 if lawOccupant fatalities
(driver & passenger)Non-occupant fatalities
(pedestrians & bicyclists)
n = 50Potentially important covariate
© Judith D. Singer, Harvard Graduate School of Education Unit 8/Slide 6
Do seat belt laws save lives?
Number of fatalities, for car occupants and non-occupants, by presence of a primary seat belt law
Occupants Non-Occupants
States with a seatbelt law
(n=14)
688.9a
(583.8)267.6
(295.4)
States without a seatbelt law
(n=36)
416.1(337.0)
123.7(148.7)
Diff in meanst (for diff)p (for diff)
272.82.07
0.0439
143.92.29
0.0267a Cell entries are estimated means and standard deviations
Occupant Fatalities
Non-occupant Fatalities
A 2 sample t-test tests the null hypothesis that 2 population means are the same:
lawnolawH :0
Should we believe these t-tests? • Is the homoscedasticity assumption
tenable?• Should we be concerned about the
skewness of these outcomes?
© Judith D. Singer, Harvard Graduate School of Education Unit 8/Slide 7
Stem Leaf # Boxplot 76 1 1 | 74 0 1 | 72 0 1 | 70 | 68 588 3 | 66 5671147 7 +-----+ 64 25 2 | | 62 48914 5 | | 60 0634 4 | | 58 4336 4 *--+--* 56 247 3 | | 54 708 3 | | 52 5579 4 +-----+ 50 35 2 | 48 | 46 7 1 | 44 234 3 | 42 663 3 | 40 1 1 | 38 5 1 | 36 8 1 | ----+----+--
Loge(Occupant Fatalities)
Stem Leaf # Boxplot 7 0 1 | 6 7 1 | 6 34 2 | 5 5556667 7 | 5 0001223 7 +-----+ 4 5566666788899 13 *--+--* 4 01233 5 | | 3 566 3 +-----+ 3 23444 5 | 2 67789 5 | 2 2 1 | ----+----+----
Loge(Non-occupant Fatalities)
Can transformation help make the outcome distributions more symmetric?
Occupant Fatalities
Stem Leaf # Boxplot 20 1 1 0 19 18 2 1 0 17 16 15 14 8 1 | 13 | 12 | 11 | 10 | 9 478 3 | 8 2257 4 | 7 889 3 +-----+ 6 23 2 | | 5 14457 5 | | 4 0367 4 | + | 3 1124889 7 *-----* 2 00446 5 | | 1 25799 5 +-----+ 0 456778889 9 | ----+----+--
Non-occupant Fatalities
Stem Leaf # Boxplot 11 0 1 * 10 10 9 9 8 8 4 1 * 7 7 6 6 1 1 0 5 5 1 0 5 4 4 3 3 1 1 | 2 56677 5 | 2 14 2 | 1 55788 5 +--+--+ 1 000001222334 12 *-----* 0 5667799 7 | | 0 11222223333344 14 +-----+ ----+----+----+
© Judith D. Singer, Harvard Graduate School of Education Unit 8/Slide 8
Distribution of loge(n fatalities) by presence of seat belt law
Loge(number of fatalities), for car occupants and non-occupants, by presence of a primary seat belt law
Occupants Non-Occupants
States with a seatbelt law
(n=14)
6.21a
(0.87)5.15
(0.94)
States without a seatbelt law
(n=36)
5.64(0.98)
4.28(1.08)
Diff in meanst (for diff)p (for diff)
0.571.89
0.0643
0.862.61
0.0120a Cell entries are estimated means and standard deviations
Loge(Occupant Fatalities)
Loge(Non-occupant Fatalities)
Loge(Occupant fatalities)
5.64NO LAW
0.57Diff in means
1.89t (for diff)
0.0643p(for diff)
6.21LAW
© Judith D. Singer, Harvard Graduate School of Education Unit 8/Slide 9
Simple regression with one dichotomous predictor: How & why it works
Dependent Variable: LDPFat
Parameter StandardVariable DF Estimate Error t Value Pr > |t|
Intercept 1 5.64258 0.15809 35.69 <.0001SeatBeltLaw 1 0.56572 0.29877 1.89 0.0643
+ 0.57
SeatBeltFatPLD 57.064.5ˆ
States without laws
5.64
64.5)0(57.064.5ˆ
FatPLD
0SeatBelt When
States with laws6.21
21.6)1(57.064.5ˆ
FatPLD
1SeatBelt When
The y-intercept is the estimated value of Y when the dichotomous
predictor=0 (here, the mean loge(occupant fatalities) for non-
seat belt law states)
The slope is the estimated difference in Y between categories
of the dichotomous predictor (here, the mean difference in Y
between states with and without seat belt laws)
Loge(Occupant Fatalities)
What would have happened if we’d changed the reference category (when
X=0)?Loge(Occupant fatalities)
5.64NO LAW
0.57Diff in means
1.89t (for diff)
0.0643p(for diff)
6.21LAW
© Judith D. Singer, Harvard Graduate School of Education Unit 8/Slide 10
What happens if we change the “reference category”?
Dependent Variable: LDPFat
Parameter StandardVariable DF Estimate Error t Value Pr > |t|
Intercept 1 5.64258 0.15809 35.69 <.0001SeatBeltLaw 1 0.56572 0.29877 1.89 0.0643
SeatBeltLaw0 = no law
1 = law
Dependent Variable: LDPFat
Parameter StandardVariable DF Estimate Error t Value Pr > |t|
Intercept 1 6.20830 0.25352 24.49 <.0001NoSeatBeltLaw 1 -0.56572 0.29877 -1.89 0.0643
NoSeatBeltLaw0 = law
1 = no law
Results of hypothesis tests
are identical, regardless of how
a dichotomous predictor is coded
The intercept is always the estimated value of
Y in the reference category
The se of the slope remains
the same
The sign of the slope is reversed
What happens if we statistically
control for covariates?
Vehicle miles WeatherUrbanicity
Loge(Occupant fatalities)
5.64NO LAW
0.57Diff in means
1.89t (for diff)
0.0643p(for diff)
6.21LAW
The se and hypothesis test for the intercept changes to focus on
the reference category
© Judith D. Singer, Harvard Graduate School of Education Unit 8/Slide 11
Vehicle Miles: A theoretically important covariate
Loge(Occupant Fatalities)
Loge(Non-occupant Fatalities)
Loge(Occupant Fatalities)
Loge(Non-occupant Fatalities)
r = 0.96***
r = 0.96***
© Judith D. Singer, Harvard Graduate School of Education Unit 8/Slide 12
What about the effect of SeatBeltLaws after controlling for LMiles?
States with seat belt laws
States with seat belt laws
States without seat belt
laws
States without seat belt
laws
States with seat belt laws
States without seat belt
laws
States with seat belt laws
States without seat belt
laws
Loge(Occupant Fatalities)
Loge(Non-occupant Fatalities)
Loge(Occupant Fatalities)
Loge(Non-occupant Fatalities)
Controlling for Lmiles, states with laws have
more non-occ. fatalities than
states without laws
Controlling for Lmiles, states with laws have
fewer occupant fatalities than
states without laws
© Judith D. Singer, Harvard Graduate School of Education Unit 8/Slide 13
Including a dichotomous predictor in a MR model: How & why it works
SeatBelt = 1
SeatBelt = 0
LMilesY : SeatBeltwhen NSB 210ˆ)0(ˆˆˆ0
LMilesYNSB 20ˆˆˆ
LmilesY : SeatBeltwhen SB 210ˆ)1(ˆˆˆ1
LMilesYSB 210ˆ)ˆˆ(ˆ
1̂effect of Seat Belt Laws, controlling for vehicle miles
Loge(Non-occupant Fatalities)
Realize that these lines are parallel because we’ve assumed that they’re parallel.
This is the main effects assumption that we’ll learn how to examine (and if necessary relax) in Unit 10.
LMilesSeatBeltY 210ˆˆˆˆ
In many fields, this model is known as an Analysis of Covariance (ANCOVA) model
© Judith D. Singer, Harvard Graduate School of Education Unit 8/Slide 14
The effect of SeatBeltLaws on Occupant Fatalities (controlling for LMiles)
The REG ProcedureDependent Variable: LDPFat
Number of Observations Read 50Number of Observations Used 50
Analysis of Variance
Sum of MeanSource DF Squares Square F Value Pr > F
Model 2 43.08715 21.54358 304.21 <.0001Error 47 3.32849 0.07082Corrected Total 49 46.41564
Root MSE 0.26612 R-Square 0.9283Dependent Mean 5.80098 Adj R-Sq 0.9252Coeff Var 4.58747
Parameter Estimates
Parameter StandardVariable DF Estimate Error t Value Pr > |t|
Intercept 1 -4.12267 0.41399 -9.96 <.0001SeatBeltLaw 1 -0.05016 0.08775 -0.57 0.5703LMiles 1 0.95514 0.04026 23.72 <.0001
Model is stat sig (p<.0001)
R2 statistic is very high
Estimated effect of vehicle miles is large: Controlling for seatbelt laws, states whose total vehicle miles differ by 1% have occupant fatalities that
differ by an average of approximately 1% as well (p<.0001)
The effect of seatbelt laws disappears: Controlling for vehicle
miles, there is no relationship between seatbelt laws and the number of
occupant fatalities
Loge(Occupant Fatalities)
What has happened???
© Judith D. Singer, Harvard Graduate School of Education Unit 8/Slide 15
-0.05t = -0.57p = .5703
Controlling for vehicle miles
Developing your instinct for the effects of a dichotomous predictor:
Comparing uncontrolled and controlled effects on occupant fatalities
LMilesSeatBeltFatPLD 96.005.012.4ˆ
LMilesFatPLD
LMilesFatPLD
96.017.4ˆ
96.0)1(05.012.4ˆ
1SeatBelt When
+ 0.57
States with laws
States without laws
10.22 10.87
5.64
6.21
States with laws
States without laws
7.34
7.29
- 0.05
4.00
3.95
- 0.05
Estimated effect of having a primary seat belt law on number of occupant fatalities
+0.57t = 1.89
p = .0643
Uncontrolled
Loge(Occupant Fatalities)LMilesFatPLD
LMilesFatPLD
96.012.4ˆ
96.0)0(05.012.4ˆ
0SeatBelt When
18
States with Seat Belt Laws
have many more vehicle miles. Might this explain
why they have many more occupant
fatalities???
Holding vehicle miles constant,
states with Seat Belt Laws have no more
occupant fatalities than those without
laws
© Judith D. Singer, Harvard Graduate School of Education Unit 8/Slide 16
The REG ProcedureDependent Variable: LNOFat
Number of Observations Read 50Number of Observations Used 50
Analysis of Variance
Sum of MeanSource DF Squares Square F Value Pr > F
Model 2 55.30826 27.65413 262.68 <.0001Error 47 4.94808 0.10528Corrected Total 49 60.25634
Root MSE 0.32447 R-Square 0.9179Dependent Mean 4.52624 Adj R-Sq 0.9144Coeff Var 7.16856
Parameter Estimates
Parameter StandardVariable DF Estimate Error t Value Pr > |t|
Intercept 1 -6.41013 0.50476 -12.70 <.0001SeatBeltLaw 1 0.18784 0.10698 1.76 0.0857LMiles 1 1.04607 0.04909 21.31 <.0001
What’s the effect of SeatBelt laws on non-occupant fatalities?
Model is stat sig (DUH!)
R2 is very high (DUH!)
Estimated effect of vehicle miles is large: Controlling for seatbelt laws, states whose total vehicle miles differ
by 1% have non-occupant fatalities that differ by an average of
approximately 1% as well (p<.0001)
The effect of seatbelt laws diminishes: Controlling for vehicle
miles, the effect of seatbelt laws on the number of non-occupant fatalities is no
longer statistically significant at the 0.05 level
(although this difference between stat sig & not stat sig is undoubtedly not stat sig!)
Loge(Non-occupant Fatalities)
At this point in our analysis, our question predictor seems to have NO effect (controlling for LMILES)... Oy!
© Judith D. Singer, Harvard Graduate School of Education Unit 8/Slide 17
Making sense of the correlation matrix: Considering old & new variables
Pearson Correlation Coefficients, N = 50 Prob > |r| under H0: Rho=0
Seat LDPFat LNOFat BeltLaw LMiles LTemp PctUrban lpopden
LDPFat 1.00000 0.92716 0.26363 0.96322 0.50966 0.33674 0.40214 <.0001 0.0643 <.0001 0.0002 0.0168 0.0038
LNOFat 0.92716 1.00000 0.35271 0.95525 0.50989 0.56415 0.50772 <.0001 0.0120 <.0001 0.0002 <.0001 0.0002
SeatBeltLaw 0.26363 0.35271 1.00000 0.29585 0.33251 0.21303 0.20434 0.0643 0.0120 0.0370 0.0183 0.1374 0.1546
LMiles 0.96322 0.95525 0.29585 1.00000 0.41615 0.49945 0.53365 <.0001 <.0001 0.0370 0.0026 0.0002 <.0001
LTemp 0.50966 0.50989 0.33251 0.41615 1.00000 0.28925 0.33662 0.0002 0.0002 0.0183 0.0026 0.0416 0.0168
PctUrban 0.33674 0.56415 0.21303 0.49945 0.28925 1.00000 0.67812 0.0168 <.0001 0.1374 0.0002 0.0416 <.0001
lpopden 0.40214 0.50772 0.20434 0.53365 0.33662 0.67812 1.00000 0.0038 0.0002 0.1546 <.0001 0.0168 <.0001
The two outcomes are highly correlated
SeatBelt states have more fatalities (
Knew this from t-test results)
Vehicle miles is a strong predictor
Warmer states have more fatalities
More urban states have more fatalities
(esp non-occupants)
SeatBelt states have more vehicle
miles
Warmer states have SeatBelt laws and more vehicle
miles
SeatBelt states are more urban
More urban states are warmer & have more vehicle miles
The urbanicity variables are highly
correlated
Correlation matrix guidance: Keep your eyes on the question predictorSee a control predictor with a big effect?
Partial it out and look again...
This does NOT mean they are definitely collinear—but we
need to determine if both are needed in a
model
(but this does NOT mean we should only analyze one of them!)
Partial LMILES out and look again
Might want to include?
Might want to include?
Getting a sense that we need to make sure that we can really include all these additional
predictors…
And that seems to have been enough to make the SeatBelt
effects disappear!
Might controlling for LTemp change
things?
but this difference is not stat sig…
© Judith D. Singer, Harvard Graduate School of Education Unit 8/Slide 18
Pearson Partial Correlation Coefficients, N = 50 Prob > |r| under H0: Partial Rho=0
Seat LDPFat LNOFat BeltLaw LTemp PctUrban lpopden
LDPFat 1.00000 0.08868 -0.08310 0.44532 -0.61999 -0.49233 0.5446 0.5703 0.0013 <.0001 0.0003
LNOFat 0.08868 1.00000 0.24809 0.41772 0.33969 -0.00821 0.5446 0.0857 0.0028 0.0169 0.9554
SeatBeltLaw -0.08310 0.24809 1.00000 0.24107 0.07887 0.05751 0.5703 0.0857 0.0952 0.5901 0.6947
LTemp 0.44532 0.41772 0.24107 1.00000 0.10334 0.14895 0.0013 0.0028 0.0952 0.4798 0.3071
PctUrban -0.61999 0.33969 0.07887 0.10334 1.00000 0.56177 <.0001 0.0169 0.5901 0.4798 <.0001
lpopden -0.49233 -0.00821 0.05751 0.14895 0.56177 1.00000 0.0003 0.9554 0.6947 0.3071 <.0001
What changes & what remains the same when we partial out LMiles?
The two outcomes are no longer
highly correlated
SeatBelt states no longer have more
fatalities (
We knew this from regression results
)Warmer states still
have more fatalities
More urban states now have fewer
occupant fatalities!
Warmer states still are more likely to have SeatBelt Laws (but the partial is now n.s.)
States with a greater %age of urban roads still have more non-occupant fatalities, but population density, by itself, no
longer seems to matter
Urbanicity is now uncorrelated with either SeatBelt laws
or temperature
The urbanicity variables are still highly correlated
• In general, the inter-correlations between predictors are smaller after we control for LMILES...• But also, some of these correlations have changed sign!
Hmmm... Hmmm...
Hmmm...
Really need to include LTEMP, don’t
we!
So we can probably add at least one
urbanicity variable (but need to check
about both)
But this still does NOT mean they are
necessarily collinear!We probably want to include
PctUrban, but we’re now unsure about LPopDen—need to see what
happens
© Judith D. Singer, Harvard Graduate School of Education Unit 8/Slide 19
What happens when we control for these additional covariates?
Non-occupant fatalitiesThe REG ProcedureDependent Variable: LNOFat
Number of Observations Read 50Number of Observations Used 50
Analysis of Variance
Sum of MeanSource DF Squares Square F Value Pr > F
Model 5 56.89264 11.37853 148.84 <.0001Error 44 3.36370 0.07645Corrected Total 49 60.25634
Root MSE 0.27649 R-Square 0.9442Dependent Mean 4.52624 Adj R-Sq 0.9378Coeff Var 6.10863
Parameter Estimates
Parameter StandardVariable DF Estimate Error t Value Pr > |t|
Intercept 1 -9.48379 1.12472 -8.43 <.0001SeatBeltLaw 1 0.10353 0.09409 1.10 0.2772LMiles 1 0.97871 0.05106 19.17 <.0001LTemp 1 0.91980 0.29841 3.08 0.0035PctUrban 1 1.04986 0.31712 3.31 0.0019lpopden 1 -0.09652 0.04100 -2.35 0.0231
R2 statistic is even higher
Positive effect of temperature: Controlling for
all other variables in the model, states whose average temperatures are 1% higher have non-occupant fatality rates that are .92% higher
Urbanicity variables tell a complex story. On the one hand, the higher the
percentage of urban roads in a state, the higher the number of non-occupant
fatalities; on the other hand, the higher the population density, the lower the number of non-occupant fatalities
Effect of Vehicle Miles remains
stable
The effect of the seatbelt law has
disappeared!Any observed differential in non-occupant fatalities between states with and without seatbelt laws is
now well within the limits of sampling variation, once you control for
vehicle miles, temperature and
urbanicity.This suggests little
support for the offset hypothesis
© Judith D. Singer, Harvard Graduate School of Education Unit 8/Slide 20
What happens when we control for these additional covariates?
Occupant fatalitiesThe REG ProcedureDependent Variable: LDPFat
Number of Observations Read 50Number of Observations Used 50
Analysis of Variance
Sum of MeanSource DF Squares Square F Value Pr > F
Model 5 45.49932 9.09986 436.96 <.0001Error 44 0.91632 0.02083Corrected Total 49 46.41564
Root MSE 0.14431 R-Square 0.9803Dependent Mean 5.80098 Adj R-Sq 0.9780Coeff Var 2.48769
Parameter Estimates
Parameter StandardVariable DF Estimate Error t Value Pr > |t|
Intercept 1 -8.40250 0.58703 -14.31 <.0001SeatBeltLaw 1 -0.10059 0.04911 -2.05 0.0465LMiles 1 1.01653 0.02665 38.14 <.0001LTemp 1 1.10171 0.15575 7.07 <.0001PctUrban 1 -0.87943 0.16551 -5.31 <.0001lpopden 1 -0.06346 0.02140 -2.97 0.0049
R2 statistic is almost perfect!
Positive effect of temperature: Controlling for
all other variables in the model, states whose average temperatures are 1% higher have occupant fatality rates
that are 1.1% higher
City driving is safer for car occupants.
The higher the percentage of urban roads and the denser the population, the lower the number of occupant fatalities.
Effect of Vehicle Miles remains
stable
The effect of the seatbelt law is now
reversed!States with primary seat belt laws have lower numbers of occupant fatalities than states without
these laws, once you control for vehicle miles, temperature
and urbanicity
© Judith D. Singer, Harvard Graduate School of Education Unit 8/Slide 21
How would we present these regression results?
Results of fitting a series of multiple regression models predicting occupant and non-occupant fatalities examining the effects of the presence of a primary seat belt law (n=50 states)
Predictor
loge(Occupant Fatalities) loge(Non-Occupant Fatalities)
Model A Model B Model C Model D Model E Model F
Intercept 5.64***(0.16)35.69
-4.12***(0.41)-9.96
-8.40***(0.59)-14.31
4.8***(0.17)24.52
-6.41***(0.50)-12.70
-9.48***(1.12)-8.43
Seatbelt Law
0.57~(.30)1.89
-0.05(0.09)-0.57
-0.10*(0.05)-2.05
0.86*(.33)2.61
0.19~(0.10)1.76
0.10(0.09)1.10
Loge
(Vehicle Miles)
0.96***0.0423.72
1.02***(0.03)38.14
1.05***0.0521.31
0.98***(0.05)19.17
Loge(Mean Temp)
1.10***(0.16)7.07
0.92**(0.30)3.08
Pct Urban Roads
-0.88***(0.17)-5.31
1.05**(0.32)3.31
Loge(Pop Density)
-0.06**(0.02)-2.97
-0.09*(0.04)-2.35
R2 7.0 92.8 98.0 12.4 91.8 94.4
F(df)p
3.59~(1, 48)0.0643
304.21(2, 47)<.0001
436.96(5, 44)<.0001
6.82*(1, 48)0.0120
262.68(2, 47)<.0001
148.84(5, 44)<.0001
Cell entries are estimated regression coefficients, (standard errors) and t-statistics~ p<.10, *p<.05, **p<.01, ***p<.001
We (the
people) conclude:Seatbelt laws
save the
lives
of people in
cars and
don’t hurt
people on
the streets
© Judith D. Singer, Harvard Graduate School of Education Unit 8/Slide 22
Adjusted means: A simple way of presenting findingswhen your question predictor is CATEGORICAL (here,
dichotomous)
LPopDenPctUrbanLTempLMilesSeatBeltFatcOc 06.088.010.102.110.040.8)ˆlog(
(10.40) (3.98) (0.52) (4.29)
To calculate adjusted meansSet all predictors—except for the categorical question
predictor—to their sample means and then compute the predicted value of the outcome at each value of the
categorical predictor
LPopDenPctUrbanLTempLMilesSeatBeltOccnNo 09.005.192.098.010.048.9)ˆlog(
(10.40) (3.98) (0.52) (4.29)
SeatBelt
SeatBeltFatcOc
10.087.5
27.1410.040.8)ˆlog(
SeatBelt
SeatBeltOccnNo
10.053.4
01.1410.048.9)ˆlog(
For occupantsDiff = -0.10, t=-2.05,
p=0.0465
For non-occupantsDiff = 0.10, t=1.10,
p=0.2772
53.4)ˆlog(
OccnNo
0SeatBelt
63.4)ˆlog(
OccnNo
1SeatBelt
87.5)ˆlog(
FatcOc
0SeatBelt
77.5)ˆlog(
FatcOc
1SeatBelt
© Judith D. Singer, Harvard Graduate School of Education Unit 8/Slide 23
Presenting unadjusted and adjusted means
Loge(number of fatalities), for car occupants and non-occupants, by presence of a primary seat belt law, overall and adjusted for vehicle miles, average temperature, percentage of urban roads and population density
Occupants Non-Occupants
Unadjusted Adjusted Unadjusted Adjusted
Law (n=14) 6.21 5.77 5.15 4.63
No Law (n=36)
5.64 5.87 4.28 4.53
Diff in means
t (for diff)p (for diff)
0.571.89
0.0643
-0.10-2.05
0.0465
0.862.61
0.0120
0.101.10
0.2772The difference between the means of the dichotomous predictor’s two categories
is equal to the dichotomous predictor’s slope coefficient in a particular model. For example, for occupant fatalities:
Adjusted means 5.77 – 5.87 = -0.10 in the controlled model
Unadjusted means 6.21 – 5.64 = 0.57 in the uncontrolled model
Adjusted mean =
controlling for temp,
miles driven &
urbanicity
© Judith D. Singer, Harvard Graduate School of Education Unit 8/Slide 24
Another example presenting adjusted differences between groups
British Medical Journal, 2005, 331, 1306-1311
© Judith D. Singer, Harvard Graduate School of Education Unit 8/Slide 25
Towards a graphic display of the regression findings: Which predictors would we want to highlight in a graph?
Results of fitting a series of multiple regression models predicting occupant and non-occupant fatalities examining the effects of the presence of a primary seat belt law (n=50 states)
Predictor
loge(Occupant Fatalities) loge(Non-Occupant Fatalities)
Model A Model B Model C Model D Model E Model F
Intercept 5.64***(0.16)35.69
-4.12***(0.41)-9.96
-8.40***(0.59)-14.31
4.8***(0.17)24.52
-6.41***(0.50)-12.70
-9.48***(1.12)-8.43
Seatbelt Law
0.57~(.30)1.89
-0.05(0.09)-0.57
-0.10*(0.05)-2.05
0.86*(.33)2.61
0.19~(0.10)1.76
0.10(0.09)1.10
Loge
(Vehicle Miles)
0.96***0.0423.72
1.02***(0.03)38.14
1.05***0.0521.31
0.98***(0.05)19.17
Loge(Mean Temp)
1.10***(0.16)7.07
0.92**(0.30)3.08
Pct Urban Roads
-0.88***(0.17)-5.31
1.05**(0.32)3.31
Loge(Pop Density)
-0.06**(0.02)-2.97
-0.09*(0.04)-2.35
R2 7.0 92.8 98.0 12.4 91.8 94.4
F(df)p
3.59~(1, 48)0.0643
304.21(2, 47)<.0001
436.96(5, 44)<.0001
6.82*(1, 48)0.0120
262.68(2, 47)<.0001
148.84(5, 44)<.0001
Cell entries are estimated regression coefficients, (standard errors) and t-statistics~ p<.10, *p<.05, **p<.01, ***p<.001
Question Predictor
Definitely want to document in a
graph
Interesting Covariate
Probably want to document in a
graph
Obvious Covariate Don’t
need to document in a graph
Small, similar effect
Don’t need to document in a
graph
Interesting Covariate Might
want to document in a graph
...but as dichotomy, we probably don’t want it on the x-
axis!...so hold it
constant at its mean?
... Should we emphasize the
difference in signs for the two outcomes?...so hold it
constant at its mean?
...if so, how?
© Judith D. Singer, Harvard Graduate School of Education Unit 8/Slide 26
Sketching out the expected graph documenting the effects of Seatbelt laws, PctUrban and Temperature
Q1: With 2 outcomes, do I want
1 graph or 2?
Q2: Which of the predictors should go on the X axis?
Q3: How should we display the effects of the
other continuous predictor, LTemp?
Q4: What will the lines look like for states with and without
seatbelt laws?
Usually the question
predictor, but because SeatBelt is a dichotomy, I’m choosing PctUrban to
highlight the sign difference
Depends on where the lines fall
Need to choose prototypical
values—’warm’ and ‘cold’
states
Attend now to ranking; worry about scale later
© Judith D. Singer, Harvard Graduate School of Education Unit 8/Slide 27
LPopDenPctUrbanLTempLMilesSeatBeltOccnNo 09.005.192.098.010.048.9)ˆlog(
PctUrbanLTempSeatBeltOccnNo 05.192.010.081.948.9)ˆlog(
(10.40) (4.29)
Displaying prototypical trajectories, step one: Setting control variables at their means
LPopDenPctUrbanLTempLMilesSeatBeltFatcOc 06.088.010.102.110.040.8)ˆlog(
PctUrbanLTempSeatBeltFatcOc 88.010.110.035.1040.8)ˆlog(
(10.40) (4.29)
PctUrbanLTempSeatBeltFatcOc 88.010.110.095.1)ˆlog(
PctUrbanLTempSeatBeltOccnNo 05.192.010.033.0)ˆlog(
© Judith D. Singer, Harvard Graduate School of Education Unit 8/Slide 28
Displaying prototypical trajectories, step two: Computing predicted values for selected levels of the
remaining predictors
PctUrbanLTempSeatBeltFatcOc 88.010.110.095.1)ˆlog(
PctUrbanLTempSeatBeltOccnNo 05.192.010.033.0)ˆlog(
Occupant Fatalities
SeatBelt
LtempPct
UrbanYhat
0 3.85 0.33 5.89
0 3.85 0.66 5.60
0 4.15 0.33 6.22
0 4.15 0.66 5.93
1 3.85 0.33 5.79
1 3.85 0.66 5.50
1 4.15 0.33 6.12
1 4.15 0.66 5.83
Non Occupant Fatalites
SeatBelt Ltemp
PctUrban Yhat
0 3.85 0.33 4.22
0 3.85 0.66 4.57
0 4.15 0.33 4.49
0 4.15 0.66 4.84
1 3.85 0.33 4.32
1 3.85 0.66 4.67
1 4.15 0.33 4.59
1 4.15 0.66 4.94
Only 2 values, 0 & 1
Mean = 0.52, sd = 0.15 Displayed on X axis:
calculate at .33 and .66
Selecting prototypical temperature valuesMean(LTemp)=3.98, sd=0.15
Cold 1 sd below mean = 3.85 (47º F)Warm 1 sd above mean = 4.15 (63º F)
~ Illinois/Michigan
~ Mississippi/Tenn
© Judith D. Singer, Harvard Graduate School of Education Unit 8/Slide 29
The effects of seat belt laws, urbanicity & temperature on traffic fatalities
controlling for vehicle miles and population density
0.50 0.75Pct Urban Roads
4.0
4.5
5.0
5.5
6.0
6.5
0.25
Loge(Fatalities)
Seat Belt Law
No Law
Seat Belt LawNo Law
Warm
Cold
Occupant Fatalities
Warm
Cold
Seat Belt Law
No Law
Seat Belt LawNo Law
Non-occupant Fatalities
Note:These differences in
occupant fatalities by Seat Belt Law are
statistically significant
Note:These differences in
non-occupant fatalities by Seat Belt
Law are not statistically significant
What would this graph look like if we were to
also “just control” for the effect of LTemp?
© Judith D. Singer, Harvard Graduate School of Education Unit 8/Slide 30
In some situations, you might prefer a simpler display
Occupant Fatalities
Non-occupant Fatalities
Seat Belt Law
No Law
Pct Urban Roads
Loge(Fatalities)
Seat Belt LawNo Law
The effects of seat belt laws and urbanicity on traffic fatalitiescontrolling for vehicle miles, population density and temperature
(with Ltemp set at its mean of 3.98)
4
4.5
5
5.5
6
6.5
0.25 0.5 0.75
Note:These differences in
occupant fatalities by Seat Belt Law are
statistically significant
Note:These differences in non-
occupant fatalities by Seat Belt Law are not statistically significant
4.53
4.63
5.87
5.77
Go to adjusted means
How does this graph relate
to the adjusted means?
© Judith D. Singer, Harvard Graduate School of Education Unit 8/Slide 31
What’s the big takeaway from this unit?
• Regression models can easily include dichotomous predictors– All assumptions are about Y at particular values of X (or X’s)—no
assumptions about the distribution of the predictors– The same toolkit we’ve developed for continuous predictors can be
used for dichotomous predictors (including hypothesis tests, correlations and plots)
• Controlled effects are often different from uncontrolled effects– One of the major reasons we use multiple regression is that we have
several predictors that affect the outcome for which we want to statistically control
– Not only can we control for a single covariate, we can control for many covariates simultaneously (in this example, we had 4 covariates in addition to our question variable)
• Results of complex analyses can be displayed more simply using tables and graphs– As your models become more complex, the need for simpler numerical
and graphical displays remains– Always important to think about how you will communicate your
results to colleagues and broader audiences– Adjusted means and prototypical trajectories are powerful tools
© Judith D. Singer, Harvard Graduate School of Education Unit 8/Slide 32
*------------------------------------------------------------------*Creating boxplots of DPFAT & NOFAT distributions for SEATBELTLAW=0 and SEATBELTLAW=1 *------------------------------------------------------------------*; proc boxplot data=one; title2 "Fatalities by Presence/Absence of SeatBelt Laws"; plot (PDFat NOFat)*SeatBeltLaw;
*-------------------------------------------------------------------*Display PDFAT & NOFAT univariate summary information in tables for SEATBELTLAW=0 & SEATBELTLAW=1*------------------------------------------------------------------*; proc means data=one; by SeatBeltLaw; var PDFat NOFat;
*-------------------------------------------------------------------*Comparing mean values of PDFAT & NOFAT for SEATBELTLAW=0 and SEATBELTLAW=1
*------------------------------------------------------------------*;proc ttest data=one; class SeatBeltLaw; var PDFat NOFat;
Appendix: Annotated PC-SAS Code for Using Dichotomous Predictors
Note that this is just an abstract from the full program
Note that this is just an abstract from the full programproc boxplot, when used for
dichotomous predictors, creates pairs of boxplots comparing the outcome variables values across the two categories in the dichotomous predictor. The plot statement specifies the outcome variables to be used and the dichotomous predictor. Its syntax is outcome*predictor (note the use of parenthesis because of the two outcome variables)
proc boxplot, when used for dichotomous predictors, creates pairs of boxplots comparing the outcome variables values across the two categories in the dichotomous predictor. The plot statement specifies the outcome variables to be used and the dichotomous predictor. Its syntax is outcome*predictor (note the use of parenthesis because of the two outcome variables)
proc means is a very useful tool to create table summaries of descriptive statistics, especially for categorical predictors. The by statement specifies the categorical predictor to be used in grouping the data. The var statement specifies the variables for which you require descriptive statistics.
proc means is a very useful tool to create table summaries of descriptive statistics, especially for categorical predictors. The by statement specifies the categorical predictor to be used in grouping the data. The var statement specifies the variables for which you require descriptive statistics.
proc ttest runs a two-sample t-test comparing the means of two groups. The class statement specifies the categorical predictor used to differentiate the two groups.
proc ttest runs a two-sample t-test comparing the means of two groups. The class statement specifies the categorical predictor used to differentiate the two groups.
© Judith D. Singer, Harvard Graduate School of Education Unit 8/Slide 33
Appendix: Annotated PC-SAS Code for Using Dichotomous Predictors
*-------------------------------------------------------------------*For pedagogic purposes only: What happens if we change the reference category? Creating new dichotomous predictor NOSEATBELTLAW*------------------------------------------------------------------*; data one; set one; NoSeatBeltLaw = 1 - SeatBeltLaw;
Use the data step in the middle of the program to add new variables to the same data. The set statement specifies to which dataset to add the variable. You can then run new PROCs on the same data, using the new variables.
Use the data step in the middle of the program to add new variables to the same data. The set statement specifies to which dataset to add the variable. You can then run new PROCs on the same data, using the new variables.
-------------------------------------------------------------------*Controlling for vehicle milesInspect bivariate scatterplots LDPFAT vs MILES, LDPFAT vs LMILES, LNOFAT vs MILES, LNOFAT vs LMILES
Inspect same plots showing SEATBELTLAW=0 and SEATBELTLAW=1 *-----------------------------------------------------------------*;
proc gplot data=one; title2 "Examining the effect of vehicle miles"; plot (LDPFat LNOFat)*(miles lmiles); plot (LDPFat LNOFat)*(miles lmiles)=SeatBeltLaw;
proc gplot can also be used to represent a three way plot with plotting symbols denoting the 3rd (here categorical) predictor. The plot statement syntax is outcome*predictor=categorical predictor. If you use a symbol statement in the program, SAS will use dots ● of different colors for each category of the predictor. Note you can have multiple plot statements in a single GPLOT.
proc gplot can also be used to represent a three way plot with plotting symbols denoting the 3rd (here categorical) predictor. The plot statement syntax is outcome*predictor=categorical predictor. If you use a symbol statement in the program, SAS will use dots ● of different colors for each category of the predictor. Note you can have multiple plot statements in a single GPLOT.
*-------------------------------------------------------------------*Estimating partial correlations controlling for LMILES
*------------------------------------------------------------------*; proc corr data=one; title2 "Partial correlation matrix controlling for Lmiles"; var LDPFat LNOFat SeatBeltLaw ltemp PctUrban lpopden; partial lmiles;
proc corr estimates bivariate correlations between variables you specify. By adding a partial statement to the syntax, it will estimate partial correlations, controlling for the variable named in the partial statement.
proc corr estimates bivariate correlations between variables you specify. By adding a partial statement to the syntax, it will estimate partial correlations, controlling for the variable named in the partial statement.
© Judith D. Singer, Harvard Graduate School of Education Unit 8/Slide 34
Glossary terms included in Unit 8
• 2 sample t-test• Adjusted mean• Categorical predictor (nominal and ordinal)• Dichotomous predictor• Dummy variable• Main effects assumption