Unit 8: Categorical predictors, I: Dichotomies

© Judith D. Singer, Harvard Graduate School of Education Unit 8/Slide 1

Unit 8: Categorical predictors, I: Dichotomies

"There are two kinds of people in the world: Those who believe there are two kinds of people in the world and

those who don't." –Robert Benchley, American Humorist (1888-1946)


The S-030 roadmap: Where’s this unit in the big picture?

Unit 2:Correlation

and causality

Unit 3:Inference for the regression model

Unit 4:Regression assumptions:

Evaluating their tenability

Unit 5:Transformations

to achieve linearity

Unit 6:The basics of

multiple regression

Unit 7:Statistical control in

depth:Correlation and

collinearity

Unit 10:Interaction and quadratic effects

Unit 8:Categorical predictors I:

Dichotomies

Unit 9:Categorical predictors II:

Polychotomies

Unit 11:Regression modeling

in practice

Unit 1:Introduction to

simple linear regression

Building a solid

foundation

Mastering the

subtleties

Adding additional predictors

Generalizing to other types of

predictors and effects

Pulling it all

together


In this unit, we’re going to learn about…

• Categorical predictors and regression: An unusual marriage• Creating and naming conventions for a dummy (or indicator)

variable• Regressing Y on a dummy variable: How this relates to the

two-sample t-test– What happens if we change the reference category?

• Including a dummy variable in a multiple regression model: How and why it operates

• Another example of using the simple and partial correlation matrices to foreshadow results

• Adjusted means: A simple way of presenting findings for categorical question predictors

• Graphic displays of regression findings: How do you decide which effects to highlight?

• Displaying and interpreting prototypical trajectories


Categorical predictors and regression: An unusual marriage?

kk XXXY 22110

assumptions focus on Y and

no assumptions about the X’s

Categorical predictors are predictors whose values denote categories

Nominal predictors(unordered values)

Sex Religion

Political Party

Ordinal predictors(ordered values)

Education Religiosity

Neighborhood Integration

Another important distinctionDichotomies (only 2 categories)Polychotomies (>2 categories)

Dummy (or indicator) variables

Variables whose values offer no meaningful quantitative information but simply

distinguish between categories

femaleif

maleif

1

0

FEMALE

treatedif

controlif

1

0

TREAT

By convention, the variable

name corresponds to the category

given the value 1By convention,

the category given the value

0 is called the reference category


Do “primary” seat belt laws save lives?

Seat Miles

ID State DPFat NOFat BeltLaw Driven

40 RI 44 13 0 7071 2 AK 47 17 0 438746 VT 61 19 0 646635 ND 71 9 0 712351 WY 71 15 0 757630 NH 76 28 0 11202 8 DE 84 25 0 8007

.... 47 VA 630 147 0 7032026 MO 775 143 0 62980 1 AL 777 124 0 5345843 TN 789 172 0 6052614 IL 824 313 0 9931923 MI 846 259 0 9175536 OH 944 253 0 10367539 PA 975 271 0 9801510 FL 1478 835 0 13400712 HI 83 36 1 7947 7 CT 199 96 1 2855232 NM 237 97 1 2193738 OR 306 99 1 3226816 IA 312 61 1 2798421 MD 345 146 1 4660919 LA 512 184 1 3884037 OK 541 107 1 4140015 IN 615 133 1 6862033 NY 822 546 1 12077834 NC 870 269 1 8189311 GA 973 257 1 93317 5 CA 1817 1102 1 28561244 TX 2012 613 1 198700

Source: Calkins, LN & Zlatoper, TJ (2001). The effects of mandatory seat belt laws on motor vehicle fatalities in the United States,

Social Science Quarterly, 82(4), 716-732

RQ: Do states with primary seat belt laws

have lower traffic fatality rates?

14 states had a mandatory primary seat belt law (28%)

36 states did not have a primary seat belt law (72%)

Hypothesis 1: Seat belt laws save lives because seat

belts save lives

Hypothesis 2: The Offset

hypothesis: Seat belts encourage riskier

driving behavior that may offset any

benefit associated with increased seat

belt use SeatBeltLaw = 0 if no law

= 1 if lawOccupant fatalities

(driver & passenger)Non-occupant fatalities

(pedestrians & bicyclists)

n = 50Potentially important covariate


Do seat belt laws save lives?

Number of fatalities, for car occupants and non-occupants, by presence of a primary seat belt law

Occupants Non-Occupants

States with a seatbelt law

(n=14)

688.9a

(583.8)267.6

(295.4)

States without a seatbelt law

(n=36)

416.1(337.0)

123.7(148.7)

Diff in meanst (for diff)p (for diff)

272.82.07

0.0439

143.92.29

0.0267a Cell entries are estimated means and standard deviations

Occupant Fatalities

Non-occupant Fatalities

A 2 sample t-test tests the null hypothesis that 2 population means are the same:

lawnolawH :0

Should we believe these t-tests? • Is the homoscedasticity assumption

tenable?• Should we be concerned about the

skewness of these outcomes?


Stem Leaf # Boxplot 76 1 1 | 74 0 1 | 72 0 1 | 70 | 68 588 3 | 66 5671147 7 +-----+ 64 25 2 | | 62 48914 5 | | 60 0634 4 | | 58 4336 4 *--+--* 56 247 3 | | 54 708 3 | | 52 5579 4 +-----+ 50 35 2 | 48 | 46 7 1 | 44 234 3 | 42 663 3 | 40 1 1 | 38 5 1 | 36 8 1 | ----+----+--

Loge(Occupant Fatalities)

Stem Leaf # Boxplot 7 0 1 | 6 7 1 | 6 34 2 | 5 5556667 7 | 5 0001223 7 +-----+ 4 5566666788899 13 *--+--* 4 01233 5 | | 3 566 3 +-----+ 3 23444 5 | 2 67789 5 | 2 2 1 | ----+----+----

Loge(Non-occupant Fatalities)

Can transformation help make the outcome distributions more symmetric?

Occupant Fatalities

Stem Leaf # Boxplot 20 1 1 0 19 18 2 1 0 17 16 15 14 8 1 | 13 | 12 | 11 | 10 | 9 478 3 | 8 2257 4 | 7 889 3 +-----+ 6 23 2 | | 5 14457 5 | | 4 0367 4 | + | 3 1124889 7 *-----* 2 00446 5 | | 1 25799 5 +-----+ 0 456778889 9 | ----+----+--


Stem Leaf # Boxplot 11 0 1 * 10 10 9 9 8 8 4 1 * 7 7 6 6 1 1 0 5 5 1 0 5 4 4 3 3 1 1 | 2 56677 5 | 2 14 2 | 1 55788 5 +--+--+ 1 000001222334 12 *-----* 0 5667799 7 | | 0 11222223333344 14 +-----+ ----+----+----+


Distribution of loge(n fatalities) by presence of seat belt law

Loge(number of fatalities), for car occupants and non-occupants, by presence of a primary seat belt law


States with a seatbelt law

(n=14)

6.21a

(0.87)5.15

(0.94)

States without a seatbelt law

(n=36)

5.64(0.98)

4.28(1.08)

Diff in meanst (for diff)p (for diff)

0.571.89

0.0643

0.862.61

0.0120a Cell entries are estimated means and standard deviations



Loge(Occupant fatalities)

5.64NO LAW

0.57Diff in means

1.89t (for diff)

0.0643p(for diff)

6.21LAW


Simple regression with one dichotomous predictor: How & why it works

Dependent Variable: LDPFat

Parameter StandardVariable DF Estimate Error t Value Pr > |t|

Intercept 1 5.64258 0.15809 35.69 <.0001SeatBeltLaw 1 0.56572 0.29877 1.89 0.0643

+ 0.57

SeatBeltFatPLD 57.064.5ˆ

States without laws

5.64

64.5)0(57.064.5ˆ

FatPLD

0SeatBelt When

States with laws6.21

21.6)1(57.064.5ˆ

FatPLD

1SeatBelt When

The y-intercept is the estimated value of Y when the dichotomous

predictor=0 (here, the mean loge(occupant fatalities) for non-

seat belt law states)

The slope is the estimated difference in Y between categories

of the dichotomous predictor (here, the mean difference in Y

between states with and without seat belt laws)


What would have happened if we’d changed the reference category (when

X=0)?Loge(Occupant fatalities)

5.64NO LAW

0.57Diff in means

1.89t (for diff)

0.0643p(for diff)

6.21LAW


What happens if we change the “reference category”?



Intercept 1 5.64258 0.15809 35.69 <.0001SeatBeltLaw 1 0.56572 0.29877 1.89 0.0643

SeatBeltLaw0 = no law

1 = law



Intercept 1 6.20830 0.25352 24.49 <.0001NoSeatBeltLaw 1 -0.56572 0.29877 -1.89 0.0643

NoSeatBeltLaw0 = law

1 = no law

Results of hypothesis tests

are identical, regardless of how

a dichotomous predictor is coded

The intercept is always the estimated value of

Y in the reference category

The se of the slope remains

the same

The sign of the slope is reversed

What happens if we statistically

control for covariates?

Vehicle miles WeatherUrbanicity

Loge(Occupant fatalities)

5.64NO LAW

0.57Diff in means

1.89t (for diff)

0.0643p(for diff)

6.21LAW

The se and hypothesis test for the intercept changes to focus on

the reference category


Vehicle Miles: A theoretically important covariate





r = 0.96***

r = 0.96***


What about the effect of SeatBeltLaws after controlling for LMiles?

States with seat belt laws


States without seat belt

laws


laws



laws



laws





Controlling for Lmiles, states with laws have

more non-occ. fatalities than

states without laws

Controlling for Lmiles, states with laws have

fewer occupant fatalities than

states without laws


Including a dichotomous predictor in a MR model: How & why it works

SeatBelt = 1

SeatBelt = 0

LMilesY : SeatBeltwhen NSB 210ˆ)0(ˆˆˆ0

LMilesYNSB 20ˆˆˆ

LmilesY : SeatBeltwhen SB 210ˆ)1(ˆˆˆ1

LMilesYSB 210ˆ)ˆˆ(ˆ

1̂effect of Seat Belt Laws, controlling for vehicle miles


Realize that these lines are parallel because we’ve assumed that they’re parallel.

This is the main effects assumption that we’ll learn how to examine (and if necessary relax) in Unit 10.

LMilesSeatBeltY 210ˆˆˆˆ

In many fields, this model is known as an Analysis of Covariance (ANCOVA) model


The effect of SeatBeltLaws on Occupant Fatalities (controlling for LMiles)

The REG ProcedureDependent Variable: LDPFat

Number of Observations Read 50Number of Observations Used 50

Analysis of Variance

Sum of MeanSource DF Squares Square F Value Pr > F

Model 2 43.08715 21.54358 304.21 <.0001Error 47 3.32849 0.07082Corrected Total 49 46.41564

Root MSE 0.26612 R-Square 0.9283Dependent Mean 5.80098 Adj R-Sq 0.9252Coeff Var 4.58747

Parameter Estimates


Intercept 1 -4.12267 0.41399 -9.96 <.0001SeatBeltLaw 1 -0.05016 0.08775 -0.57 0.5703LMiles 1 0.95514 0.04026 23.72 <.0001

Model is stat sig (p<.0001)

R2 statistic is very high

Estimated effect of vehicle miles is large: Controlling for seatbelt laws, states whose total vehicle miles differ by 1% have occupant fatalities that

differ by an average of approximately 1% as well (p<.0001)

The effect of seatbelt laws disappears: Controlling for vehicle

miles, there is no relationship between seatbelt laws and the number of

occupant fatalities


What has happened???


-0.05t = -0.57p = .5703

Controlling for vehicle miles

Developing your instinct for the effects of a dichotomous predictor:

Comparing uncontrolled and controlled effects on occupant fatalities

LMilesSeatBeltFatPLD 96.005.012.4ˆ

LMilesFatPLD

LMilesFatPLD

96.017.4ˆ

96.0)1(05.012.4ˆ

1SeatBelt When

+ 0.57

States with laws

States without laws

10.22 10.87

5.64

6.21

States with laws

States without laws

7.34

7.29

- 0.05

4.00

3.95

- 0.05

Estimated effect of having a primary seat belt law on number of occupant fatalities

+0.57t = 1.89

p = .0643

Uncontrolled

Loge(Occupant Fatalities)LMilesFatPLD

LMilesFatPLD

96.012.4ˆ

96.0)0(05.012.4ˆ

0SeatBelt When

18

States with Seat Belt Laws

have many more vehicle miles. Might this explain

why they have many more occupant

fatalities???

Holding vehicle miles constant,

states with Seat Belt Laws have no more

occupant fatalities than those without

laws


The REG ProcedureDependent Variable: LNOFat






Parameter Estimates


Intercept 1 -6.41013 0.50476 -12.70 <.0001SeatBeltLaw 1 0.18784 0.10698 1.76 0.0857LMiles 1 1.04607 0.04909 21.31 <.0001

What’s the effect of SeatBelt laws on non-occupant fatalities?

Model is stat sig (DUH!)

R2 is very high (DUH!)

Estimated effect of vehicle miles is large: Controlling for seatbelt laws, states whose total vehicle miles differ

by 1% have non-occupant fatalities that differ by an average of

approximately 1% as well (p<.0001)

The effect of seatbelt laws diminishes: Controlling for vehicle

miles, the effect of seatbelt laws on the number of non-occupant fatalities is no

longer statistically significant at the 0.05 level

(although this difference between stat sig & not stat sig is undoubtedly not stat sig!)


At this point in our analysis, our question predictor seems to have NO effect (controlling for LMILES)... Oy!


Making sense of the correlation matrix: Considering old & new variables

Pearson Correlation Coefficients, N = 50 Prob > |r| under H0: Rho=0

Seat LDPFat LNOFat BeltLaw LMiles LTemp PctUrban lpopden

LDPFat 1.00000 0.92716 0.26363 0.96322 0.50966 0.33674 0.40214 <.0001 0.0643 <.0001 0.0002 0.0168 0.0038

LNOFat 0.92716 1.00000 0.35271 0.95525 0.50989 0.56415 0.50772 <.0001 0.0120 <.0001 0.0002 <.0001 0.0002

SeatBeltLaw 0.26363 0.35271 1.00000 0.29585 0.33251 0.21303 0.20434 0.0643 0.0120 0.0370 0.0183 0.1374 0.1546

LMiles 0.96322 0.95525 0.29585 1.00000 0.41615 0.49945 0.53365 <.0001 <.0001 0.0370 0.0026 0.0002 <.0001

LTemp 0.50966 0.50989 0.33251 0.41615 1.00000 0.28925 0.33662 0.0002 0.0002 0.0183 0.0026 0.0416 0.0168

PctUrban 0.33674 0.56415 0.21303 0.49945 0.28925 1.00000 0.67812 0.0168 <.0001 0.1374 0.0002 0.0416 <.0001

lpopden 0.40214 0.50772 0.20434 0.53365 0.33662 0.67812 1.00000 0.0038 0.0002 0.1546 <.0001 0.0168 <.0001

The two outcomes are highly correlated

SeatBelt states have more fatalities (

Knew this from t-test results)

Vehicle miles is a strong predictor

Warmer states have more fatalities

More urban states have more fatalities

(esp non-occupants)

SeatBelt states have more vehicle

miles

Warmer states have SeatBelt laws and more vehicle

miles

SeatBelt states are more urban

More urban states are warmer & have more vehicle miles

The urbanicity variables are highly

correlated

Correlation matrix guidance: Keep your eyes on the question predictorSee a control predictor with a big effect?

Partial it out and look again...

This does NOT mean they are definitely collinear—but we

need to determine if both are needed in a

model

(but this does NOT mean we should only analyze one of them!)

Partial LMILES out and look again

Might want to include?

Might want to include?

Getting a sense that we need to make sure that we can really include all these additional

predictors…

And that seems to have been enough to make the SeatBelt

effects disappear!

Might controlling for LTemp change

things?

but this difference is not stat sig…


Pearson Partial Correlation Coefficients, N = 50 Prob > |r| under H0: Partial Rho=0

Seat LDPFat LNOFat BeltLaw LTemp PctUrban lpopden

LDPFat 1.00000 0.08868 -0.08310 0.44532 -0.61999 -0.49233 0.5446 0.5703 0.0013 <.0001 0.0003

LNOFat 0.08868 1.00000 0.24809 0.41772 0.33969 -0.00821 0.5446 0.0857 0.0028 0.0169 0.9554

SeatBeltLaw -0.08310 0.24809 1.00000 0.24107 0.07887 0.05751 0.5703 0.0857 0.0952 0.5901 0.6947

LTemp 0.44532 0.41772 0.24107 1.00000 0.10334 0.14895 0.0013 0.0028 0.0952 0.4798 0.3071

PctUrban -0.61999 0.33969 0.07887 0.10334 1.00000 0.56177 <.0001 0.0169 0.5901 0.4798 <.0001

lpopden -0.49233 -0.00821 0.05751 0.14895 0.56177 1.00000 0.0003 0.9554 0.6947 0.3071 <.0001

What changes & what remains the same when we partial out LMiles?

The two outcomes are no longer

highly correlated

SeatBelt states no longer have more

fatalities (

We knew this from regression results

)Warmer states still

have more fatalities

More urban states now have fewer

occupant fatalities!

Warmer states still are more likely to have SeatBelt Laws (but the partial is now n.s.)

States with a greater %age of urban roads still have more non-occupant fatalities, but population density, by itself, no

longer seems to matter

Urbanicity is now uncorrelated with either SeatBelt laws

or temperature

The urbanicity variables are still highly correlated

• In general, the inter-correlations between predictors are smaller after we control for LMILES...• But also, some of these correlations have changed sign!

Hmmm... Hmmm...

Hmmm...

Really need to include LTEMP, don’t

we!

So we can probably add at least one

urbanicity variable (but need to check

about both)

But this still does NOT mean they are

necessarily collinear!We probably want to include

PctUrban, but we’re now unsure about LPopDen—need to see what

happens


What happens when we control for these additional covariates?

Non-occupant fatalitiesThe REG ProcedureDependent Variable: LNOFat






Parameter Estimates


Intercept 1 -9.48379 1.12472 -8.43 <.0001SeatBeltLaw 1 0.10353 0.09409 1.10 0.2772LMiles 1 0.97871 0.05106 19.17 <.0001LTemp 1 0.91980 0.29841 3.08 0.0035PctUrban 1 1.04986 0.31712 3.31 0.0019lpopden 1 -0.09652 0.04100 -2.35 0.0231

R2 statistic is even higher

Positive effect of temperature: Controlling for

all other variables in the model, states whose average temperatures are 1% higher have non-occupant fatality rates that are .92% higher

Urbanicity variables tell a complex story. On the one hand, the higher the

percentage of urban roads in a state, the higher the number of non-occupant

fatalities; on the other hand, the higher the population density, the lower the number of non-occupant fatalities

Effect of Vehicle Miles remains

stable

The effect of the seatbelt law has

disappeared!Any observed differential in non-occupant fatalities between states with and without seatbelt laws is

now well within the limits of sampling variation, once you control for

vehicle miles, temperature and

urbanicity.This suggests little

support for the offset hypothesis


What happens when we control for these additional covariates?

Occupant fatalitiesThe REG ProcedureDependent Variable: LDPFat






Parameter Estimates


Intercept 1 -8.40250 0.58703 -14.31 <.0001SeatBeltLaw 1 -0.10059 0.04911 -2.05 0.0465LMiles 1 1.01653 0.02665 38.14 <.0001LTemp 1 1.10171 0.15575 7.07 <.0001PctUrban 1 -0.87943 0.16551 -5.31 <.0001lpopden 1 -0.06346 0.02140 -2.97 0.0049

R2 statistic is almost perfect!

Positive effect of temperature: Controlling for

all other variables in the model, states whose average temperatures are 1% higher have occupant fatality rates

that are 1.1% higher

City driving is safer for car occupants.

The higher the percentage of urban roads and the denser the population, the lower the number of occupant fatalities.

Effect of Vehicle Miles remains

stable

The effect of the seatbelt law is now

reversed!States with primary seat belt laws have lower numbers of occupant fatalities than states without

these laws, once you control for vehicle miles, temperature

and urbanicity


How would we present these regression results?

Results of fitting a series of multiple regression models predicting occupant and non-occupant fatalities examining the effects of the presence of a primary seat belt law (n=50 states)

Predictor

loge(Occupant Fatalities) loge(Non-Occupant Fatalities)

Model A Model B Model C Model D Model E Model F

Intercept 5.64***(0.16)35.69

-4.12***(0.41)-9.96

-8.40***(0.59)-14.31

4.8***(0.17)24.52

-6.41***(0.50)-12.70

-9.48***(1.12)-8.43

Seatbelt Law

0.57~(.30)1.89

-0.05(0.09)-0.57

-0.10*(0.05)-2.05

0.86*(.33)2.61

0.19~(0.10)1.76

0.10(0.09)1.10

Loge

(Vehicle Miles)

0.96***0.0423.72

1.02***(0.03)38.14

1.05***0.0521.31

0.98***(0.05)19.17

Loge(Mean Temp)

1.10***(0.16)7.07

0.92**(0.30)3.08

Pct Urban Roads

-0.88***(0.17)-5.31

1.05**(0.32)3.31

Loge(Pop Density)

-0.06**(0.02)-2.97

-0.09*(0.04)-2.35

R2 7.0 92.8 98.0 12.4 91.8 94.4

F(df)p

3.59~(1, 48)0.0643

304.21(2, 47)<.0001

436.96(5, 44)<.0001

6.82*(1, 48)0.0120

262.68(2, 47)<.0001

148.84(5, 44)<.0001

Cell entries are estimated regression coefficients, (standard errors) and t-statistics~ p<.10, *p<.05, **p<.01, ***p<.001

We (the

people) conclude:Seatbelt laws

save the

lives

of people in

cars and

don’t hurt

people on

the streets


Adjusted means: A simple way of presenting findingswhen your question predictor is CATEGORICAL (here,

dichotomous)

LPopDenPctUrbanLTempLMilesSeatBeltFatcOc 06.088.010.102.110.040.8)ˆlog(

(10.40) (3.98) (0.52) (4.29)

To calculate adjusted meansSet all predictors—except for the categorical question

predictor—to their sample means and then compute the predicted value of the outcome at each value of the

categorical predictor

LPopDenPctUrbanLTempLMilesSeatBeltOccnNo 09.005.192.098.010.048.9)ˆlog(

(10.40) (3.98) (0.52) (4.29)

SeatBelt

SeatBeltFatcOc

10.087.5

27.1410.040.8)ˆlog(

SeatBelt

SeatBeltOccnNo

10.053.4

01.1410.048.9)ˆlog(

For occupantsDiff = -0.10, t=-2.05,

p=0.0465

For non-occupantsDiff = 0.10, t=1.10,

p=0.2772

53.4)ˆlog(

OccnNo

0SeatBelt

63.4)ˆlog(

OccnNo

1SeatBelt

87.5)ˆlog(

FatcOc

0SeatBelt

77.5)ˆlog(

FatcOc

1SeatBelt


Presenting unadjusted and adjusted means

Loge(number of fatalities), for car occupants and non-occupants, by presence of a primary seat belt law, overall and adjusted for vehicle miles, average temperature, percentage of urban roads and population density


Unadjusted Adjusted Unadjusted Adjusted

Law (n=14) 6.21 5.77 5.15 4.63

No Law (n=36)

5.64 5.87 4.28 4.53

Diff in means

t (for diff)p (for diff)

0.571.89

0.0643

-0.10-2.05

0.0465

0.862.61

0.0120

0.101.10

0.2772The difference between the means of the dichotomous predictor’s two categories

is equal to the dichotomous predictor’s slope coefficient in a particular model. For example, for occupant fatalities:

Adjusted means 5.77 – 5.87 = -0.10 in the controlled model

Unadjusted means 6.21 – 5.64 = 0.57 in the uncontrolled model

Adjusted mean =

controlling for temp,

miles driven &

urbanicity


Another example presenting adjusted differences between groups

British Medical Journal, 2005, 331, 1306-1311


Towards a graphic display of the regression findings: Which predictors would we want to highlight in a graph?

Results of fitting a series of multiple regression models predicting occupant and non-occupant fatalities examining the effects of the presence of a primary seat belt law (n=50 states)

Predictor

loge(Occupant Fatalities) loge(Non-Occupant Fatalities)

Model A Model B Model C Model D Model E Model F

Intercept 5.64***(0.16)35.69

-4.12***(0.41)-9.96

-8.40***(0.59)-14.31

4.8***(0.17)24.52

-6.41***(0.50)-12.70

-9.48***(1.12)-8.43

Seatbelt Law

0.57~(.30)1.89

-0.05(0.09)-0.57

-0.10*(0.05)-2.05

0.86*(.33)2.61

0.19~(0.10)1.76

0.10(0.09)1.10

Loge

(Vehicle Miles)

0.96***0.0423.72

1.02***(0.03)38.14

1.05***0.0521.31

0.98***(0.05)19.17

Loge(Mean Temp)

1.10***(0.16)7.07

0.92**(0.30)3.08

Pct Urban Roads

-0.88***(0.17)-5.31

1.05**(0.32)3.31

Loge(Pop Density)

-0.06**(0.02)-2.97

-0.09*(0.04)-2.35

R2 7.0 92.8 98.0 12.4 91.8 94.4

F(df)p

3.59~(1, 48)0.0643

304.21(2, 47)<.0001

436.96(5, 44)<.0001

6.82*(1, 48)0.0120

262.68(2, 47)<.0001

148.84(5, 44)<.0001

Cell entries are estimated regression coefficients, (standard errors) and t-statistics~ p<.10, *p<.05, **p<.01, ***p<.001

Question Predictor

Definitely want to document in a

graph

Interesting Covariate

Probably want to document in a

graph

Obvious Covariate Don’t

need to document in a graph

Small, similar effect

Don’t need to document in a

graph

Interesting Covariate Might

want to document in a graph

...but as dichotomy, we probably don’t want it on the x-

axis!...so hold it

constant at its mean?

... Should we emphasize the

difference in signs for the two outcomes?...so hold it

constant at its mean?

...if so, how?


Sketching out the expected graph documenting the effects of Seatbelt laws, PctUrban and Temperature

Q1: With 2 outcomes, do I want

1 graph or 2?

Q2: Which of the predictors should go on the X axis?

Q3: How should we display the effects of the

other continuous predictor, LTemp?

Q4: What will the lines look like for states with and without

seatbelt laws?

Usually the question

predictor, but because SeatBelt is a dichotomy, I’m choosing PctUrban to

highlight the sign difference

Depends on where the lines fall

Need to choose prototypical

values—’warm’ and ‘cold’

states

Attend now to ranking; worry about scale later


LPopDenPctUrbanLTempLMilesSeatBeltOccnNo 09.005.192.098.010.048.9)ˆlog(

PctUrbanLTempSeatBeltOccnNo 05.192.010.081.948.9)ˆlog(

(10.40) (4.29)

Displaying prototypical trajectories, step one: Setting control variables at their means

LPopDenPctUrbanLTempLMilesSeatBeltFatcOc 06.088.010.102.110.040.8)ˆlog(

PctUrbanLTempSeatBeltFatcOc 88.010.110.035.1040.8)ˆlog(

(10.40) (4.29)

PctUrbanLTempSeatBeltFatcOc 88.010.110.095.1)ˆlog(

PctUrbanLTempSeatBeltOccnNo 05.192.010.033.0)ˆlog(


Displaying prototypical trajectories, step two: Computing predicted values for selected levels of the

remaining predictors

PctUrbanLTempSeatBeltFatcOc 88.010.110.095.1)ˆlog(

PctUrbanLTempSeatBeltOccnNo 05.192.010.033.0)ˆlog(

Occupant Fatalities

SeatBelt

LtempPct

UrbanYhat

0 3.85 0.33 5.89

0 3.85 0.66 5.60

0 4.15 0.33 6.22

0 4.15 0.66 5.93

1 3.85 0.33 5.79

1 3.85 0.66 5.50

1 4.15 0.33 6.12

1 4.15 0.66 5.83

Non Occupant Fatalites

SeatBelt Ltemp

PctUrban Yhat

0 3.85 0.33 4.22

0 3.85 0.66 4.57

0 4.15 0.33 4.49

0 4.15 0.66 4.84

1 3.85 0.33 4.32

1 3.85 0.66 4.67

1 4.15 0.33 4.59

1 4.15 0.66 4.94

Only 2 values, 0 & 1

Mean = 0.52, sd = 0.15 Displayed on X axis:

calculate at .33 and .66

Selecting prototypical temperature valuesMean(LTemp)=3.98, sd=0.15

Cold 1 sd below mean = 3.85 (47º F)Warm 1 sd above mean = 4.15 (63º F)

~ Illinois/Michigan

~ Mississippi/Tenn


The effects of seat belt laws, urbanicity & temperature on traffic fatalities

controlling for vehicle miles and population density

0.50 0.75Pct Urban Roads

4.0

4.5

5.0

5.5

6.0

6.5

0.25

Loge(Fatalities)

Seat Belt Law

No Law

Seat Belt LawNo Law

Warm

Cold

Occupant Fatalities

Warm

Cold

Seat Belt Law

No Law

Seat Belt LawNo Law


Note:These differences in

occupant fatalities by Seat Belt Law are

statistically significant


non-occupant fatalities by Seat Belt

Law are not statistically significant

What would this graph look like if we were to

also “just control” for the effect of LTemp?


In some situations, you might prefer a simpler display

Occupant Fatalities


Seat Belt Law

No Law

Pct Urban Roads

Loge(Fatalities)

Seat Belt LawNo Law

The effects of seat belt laws and urbanicity on traffic fatalitiescontrolling for vehicle miles, population density and temperature

(with Ltemp set at its mean of 3.98)

4

4.5

5

5.5

6

6.5

0.25 0.5 0.75


occupant fatalities by Seat Belt Law are

statistically significant

Note:These differences in non-

occupant fatalities by Seat Belt Law are not statistically significant

4.53

4.63

5.87

5.77

Go to adjusted means

How does this graph relate

to the adjusted means?


What’s the big takeaway from this unit?

• Regression models can easily include dichotomous predictors– All assumptions are about Y at particular values of X (or X’s)—no

assumptions about the distribution of the predictors– The same toolkit we’ve developed for continuous predictors can be

used for dichotomous predictors (including hypothesis tests, correlations and plots)

• Controlled effects are often different from uncontrolled effects– One of the major reasons we use multiple regression is that we have

several predictors that affect the outcome for which we want to statistically control

– Not only can we control for a single covariate, we can control for many covariates simultaneously (in this example, we had 4 covariates in addition to our question variable)

• Results of complex analyses can be displayed more simply using tables and graphs– As your models become more complex, the need for simpler numerical

and graphical displays remains– Always important to think about how you will communicate your

results to colleagues and broader audiences– Adjusted means and prototypical trajectories are powerful tools


*------------------------------------------------------------------*Creating boxplots of DPFAT & NOFAT distributions for SEATBELTLAW=0 and SEATBELTLAW=1 *------------------------------------------------------------------*; proc boxplot data=one; title2 "Fatalities by Presence/Absence of SeatBelt Laws"; plot (PDFat NOFat)*SeatBeltLaw;

*-------------------------------------------------------------------*Display PDFAT & NOFAT univariate summary information in tables for SEATBELTLAW=0 & SEATBELTLAW=1*------------------------------------------------------------------*; proc means data=one; by SeatBeltLaw; var PDFat NOFat;

*-------------------------------------------------------------------*Comparing mean values of PDFAT & NOFAT for SEATBELTLAW=0 and SEATBELTLAW=1

*------------------------------------------------------------------*;proc ttest data=one; class SeatBeltLaw; var PDFat NOFat;

Appendix: Annotated PC-SAS Code for Using Dichotomous Predictors

Note that this is just an abstract from the full program

Note that this is just an abstract from the full programproc boxplot, when used for

dichotomous predictors, creates pairs of boxplots comparing the outcome variables values across the two categories in the dichotomous predictor. The plot statement specifies the outcome variables to be used and the dichotomous predictor. Its syntax is outcome*predictor (note the use of parenthesis because of the two outcome variables)

proc boxplot, when used for dichotomous predictors, creates pairs of boxplots comparing the outcome variables values across the two categories in the dichotomous predictor. The plot statement specifies the outcome variables to be used and the dichotomous predictor. Its syntax is outcome*predictor (note the use of parenthesis because of the two outcome variables)

proc means is a very useful tool to create table summaries of descriptive statistics, especially for categorical predictors. The by statement specifies the categorical predictor to be used in grouping the data. The var statement specifies the variables for which you require descriptive statistics.

proc means is a very useful tool to create table summaries of descriptive statistics, especially for categorical predictors. The by statement specifies the categorical predictor to be used in grouping the data. The var statement specifies the variables for which you require descriptive statistics.

proc ttest runs a two-sample t-test comparing the means of two groups. The class statement specifies the categorical predictor used to differentiate the two groups.

proc ttest runs a two-sample t-test comparing the means of two groups. The class statement specifies the categorical predictor used to differentiate the two groups.


Appendix: Annotated PC-SAS Code for Using Dichotomous Predictors

*-------------------------------------------------------------------*For pedagogic purposes only: What happens if we change the reference category? Creating new dichotomous predictor NOSEATBELTLAW*------------------------------------------------------------------*; data one; set one; NoSeatBeltLaw = 1 - SeatBeltLaw;

Use the data step in the middle of the program to add new variables to the same data. The set statement specifies to which dataset to add the variable. You can then run new PROCs on the same data, using the new variables.

Use the data step in the middle of the program to add new variables to the same data. The set statement specifies to which dataset to add the variable. You can then run new PROCs on the same data, using the new variables.

-------------------------------------------------------------------*Controlling for vehicle milesInspect bivariate scatterplots LDPFAT vs MILES, LDPFAT vs LMILES, LNOFAT vs MILES, LNOFAT vs LMILES

Inspect same plots showing SEATBELTLAW=0 and SEATBELTLAW=1 *-----------------------------------------------------------------*;

proc gplot data=one; title2 "Examining the effect of vehicle miles"; plot (LDPFat LNOFat)*(miles lmiles); plot (LDPFat LNOFat)*(miles lmiles)=SeatBeltLaw;

proc gplot can also be used to represent a three way plot with plotting symbols denoting the 3rd (here categorical) predictor. The plot statement syntax is outcome*predictor=categorical predictor. If you use a symbol statement in the program, SAS will use dots ● of different colors for each category of the predictor. Note you can have multiple plot statements in a single GPLOT.

proc gplot can also be used to represent a three way plot with plotting symbols denoting the 3rd (here categorical) predictor. The plot statement syntax is outcome*predictor=categorical predictor. If you use a symbol statement in the program, SAS will use dots ● of different colors for each category of the predictor. Note you can have multiple plot statements in a single GPLOT.

*-------------------------------------------------------------------*Estimating partial correlations controlling for LMILES

*------------------------------------------------------------------*; proc corr data=one; title2 "Partial correlation matrix controlling for Lmiles"; var LDPFat LNOFat SeatBeltLaw ltemp PctUrban lpopden; partial lmiles;

proc corr estimates bivariate correlations between variables you specify. By adding a partial statement to the syntax, it will estimate partial correlations, controlling for the variable named in the partial statement.

proc corr estimates bivariate correlations between variables you specify. By adding a partial statement to the syntax, it will estimate partial correlations, controlling for the variable named in the partial statement.


Glossary terms included in Unit 8

• 2 sample t-test• Adjusted mean• Categorical predictor (nominal and ordinal)• Dichotomous predictor• Dummy variable• Main effects assumption

Documents

Unit 8: Categorical predictors, I: Dichotomies