29
© Judith D. Singer, Harvard Graduate School of Education Unit 7/Slide 1 Unit 7: Statistical control in depth: Correlation and collinearity

Unit 7: Statistical control in depth: Correlation and collinearity

Embed Size (px)

DESCRIPTION

Unit 7: Statistical control in depth: Correlation and collinearity. The S-030 roadmap: Where’s this unit in the big picture?. Unit 1: Introduction to simple linear regression. Unit 2: Correlation and causality. Unit 3: Inference for the regression model. Building a solid foundation. - PowerPoint PPT Presentation

Citation preview

Page 1: Unit 7: Statistical control in depth: Correlation and collinearity

© Judith D. Singer, Harvard Graduate School of Education Unit 7/Slide 1

Unit 7: Statistical control in depth: Correlation and collinearity

Page 2: Unit 7: Statistical control in depth: Correlation and collinearity

© Judith D. Singer, Harvard Graduate School of Education Unit 7/Slide 2

The S-030 roadmap: Where’s this unit in the big picture?

Unit 2:Correlation

and causality

Unit 3:Inference for the regression model

Unit 4:Regression assumptions:

Evaluating their tenability

Unit 5:Transformations

to achieve linearity

Unit 6:The basics of

multiple regression

Unit 7:Statistical control in

depth:Correlation and

collinearity

Unit 10:Interaction and quadratic effects

Unit 8:Categorical predictors I:

Dichotomies

Unit 9:Categorical predictors II:

Polychotomies

Unit 11:Regression modeling

in practice

Unit 1:Introduction to

simple linear regression

Building a solid

foundation

Mastering the

subtleties

Adding additional predictors

Generalizing to other types of

predictors and effects

Pulling it all

together

Page 3: Unit 7: Statistical control in depth: Correlation and collinearity

© Judith D. Singer, Harvard Graduate School of Education Unit 7/Slide 3

In this unit, we’re going to learn about…

• What is really meant by statistical control?– Is statistical control always possible?: The problem of collinearity

• Learning how to examine a correlation matrix and what it foreshadows for multiple regression

• Using Venn diagrams to develop your intuition about correlation– Measuring the additional explanatory power of additional predictors

• Partial correlation—terminology, interpretation, and relationship to simple correlation

• Multiple correlation—its relationship to R2

• Suppressor effects: When statistical control can help reveal an effect

• The dangers of multicollinearity: – what it is– how to spot it– what to do about it

Page 4: Unit 7: Statistical control in depth: Correlation and collinearity

© Judith D. Singer, Harvard Graduate School of Education Unit 7/Slide 4

When and why is statistical control important?

Randomized experiments: Statistical control is not as crucial

•Researcher actively intervenes in the system observing how changes in X produce changes in Y

•Random assignment ensures that, on average, treated and control groups are equivalent on all observed and, even more importantly, unobserved variables

•Even so, statistical control still helps as it increases the precision of our estimates

16 March 1992

Lead, Lies and Data TapeTwo psychologists, both of whom have testified for the lead industry and one of whom has received tens of thousands of dollars in research grants from the industry, have filed misconduct charges against the scientist who first linked "low" levels of lead to cognitive problems in children. They don't suspect that Herbert Needleman of the University of Pittsburgh stole, faked or fabricated data. Rather, they say, he selected the data and the statistical model -- the equations for analyzing those data -- that show lead in the worst possible light…

The allegations center on a 1979 paper. It describes how Needleman and colleagues measured the lead in baby teeth, looking for a link between lead and intelligence. NIH told Pittsburgh to convene a panel of inquiry. The panel's report, submitted in December and obtained by NEWSWEEK, found that Needleman didn't "fabricate, falsify or plagiarize." It did have problems with how he decided whether or not to include particular children in his analysis, but called this "a result of a lack of scientific rigor rather than the presence of scientific misconduct." The panel found Needleman's statistical model "questionable," though. On that basis, the university launched an investigation.

Scarr, Ernhart and the Pittsburgh panel all condemn Needleman for not using a different model -- one that, say, factored in the age of each child. If he had, they say, lead would not have had an impact on IQ. But last year Environmental Protection Agency scientist (and recipient of a MacArthur Foundation "genius" award) Joel Schwartz reanalyzed Needleman's data. He factored in age explicitly. "I found essentially the identical results," he says.

Observational studies, sample surveys and quasi experiments: Statistical control is much more important

•With no active external intervention, individuals effectively “choose” their own values of X

•Individuals with particular values of X may differ on observed variables—this is when statistical control can help

•More problematic is when individuals with particular values of X may also differ on unobserved variables—then you need statistical methods that are more advanced than we cover in S-030

Page 5: Unit 7: Statistical control in depth: Correlation and collinearity

© Judith D. Singer, Harvard Graduate School of Education Unit 7/Slide 5

How statistical control can help, Example I: Cross sectional study examining predictors of reading scores in elementary

school

HEIGHTDINGARE :A Model 10.003.0ˆ

12

34

56

GRADE

Controlling for a predictor can stop us from concluding (erroneously) that a

spurious correlation is real

HEIGHT

REA

DIN

GTaller children have higher reading scores

HEIGHTGRADEDINGARE :B Model 01.090.002.0ˆ There’s no statistically

significant relationship between reading scores and height

Do we really believe this or is

there a 3rd variable for

which we should statistically

control?

Older students read better (duh)

Main effect: we’ve assumed the effect to be

the same across all grades =

parallel lines

Page 6: Unit 7: Statistical control in depth: Correlation and collinearity

© Judith D. Singer, Harvard Graduate School of Education Unit 7/Slide 6

How statistical control can help, Example II: Does the availability of guns save lives (or kill people?)

sGunLicenseMEICR :A Model 10.030.0ˆ

Urbanicity

Controlling for a predictor can reveal or reverse the direction of an effect

# GUN LICENSES

VIO

LEN

T C

RIM

E

RA

ECommunities with more gun licenses

have lower violent crime rates

sGunLicenseURBANMEICR :B Model 30.030.002.0ˆ

Do we really believe this or is

there a 3rd variable for

which we should statistically

control?

The more urban the community, the higher the

violent crime rate

Veryurban

Veryrural

There’s now a positive relationship between Gun Licenses and the

violent crime rate (the sign of the estimated regression coefficient is

reversed!)

Page 7: Unit 7: Statistical control in depth: Correlation and collinearity

© Judith D. Singer, Harvard Graduate School of Education Unit 7/Slide 7

Men

Women

FemaleesgWa :A Model 225ˆ

How statistical control should be able to help (but sometimes can’t!)

Sex discrimination in clerical salaries at Yale

If predictors are “too highly” correlated with each other, we can’t statistically

control for the effect of one and evaluate the effects the other: This is known as

(multi)collinearity

Job Status

Wages

On average, women have lower wages than men

FemaleJobStatusesgWa:C Model 0001.0225ˆ

Can we really control

statistically for the effects of job status and really

evaluate the effects of gender?

Higher status jobs pay more

There’s no statistically significant wage differential between men and

women controlling for job status

FemaleatustJobS:B Model 210ˆ On average, women are in lower status jobs than men

Page 8: Unit 7: Statistical control in depth: Correlation and collinearity

© Judith D. Singer, Harvard Graduate School of Education Unit 7/Slide 8

Two new predictors for USNews: Research Funding & Pct Doc Students

Peer Ratings of US Graduate schools of education

Peer Res PctID School Rat GRE L2Doc Fund Doc

1 Harvard 450 6.625 5.90689 17.4 35.8 2 UCLA 410 5.780 5.72792 36.4 46.8 3 Stanford 470 6.775 5.24793 15.1 48.0 4 TC 440 6.045 7.59246 30.1 37.5 5 Vanderbilt 430 6.605 4.45943 23.0 48.6 6 Northwestern 390 6.770 3.32193 8.8 47.0 7 Berkeley 440 6.050 5.42626 12.0 56.3 8 Penn 380 6.040 5.93074 19.0 41.0 9 Michigan 430 6.090 5.24793 19.0 62.710 Madison 430 5.800 6.72792 25.5 53.8 . . .

RQ: Does research production predict variation in the peer ratings of GSEs?

•Total Research $•Pct Doctoral Students

Predictor: ResFund

Mean 11.29540 Std Dev 8.13018

Stem Leaf # Boxplot 36 4 1 0 34 32 30 14 2 | 28 | 26 4 1 | 24 5158 4 | 22 08 2 | 20 13 2 | 18 14002 5 | 16 145671479 9 +-----+ 14 112 3 | | 12 068826 6 | | 10 125715679 9 *--+--* 8 5580458 7 | | 6 0256828 7 | | 4 11172356778 11 +-----+ 2 1158891237 10 | 0 33667056 8 |

UCLA

TC, NYU

HGSE

Predictor: PctDoc

Mean 38.1965517 Std Dev 16.3568807

Stem Leaf # Boxplot 9 1 1 0 8 8 0 1 0 7 7 1 | 7 | 6 9 1 | 6 2333 4 | 5 5668 4 | 5 0033444 7 | 4 567777889 9 +-----+ 4 01133 5 | | 3 566666778888888888999 21 *--+--* 3 011114 6 | | 2 555567899999 12 +-----+ 2 14 2 | 1 56778889 8 | 1 44 2 | 0 567 3 |

Stanford

Claremont

UC RiversideUSC

Penn State

HGSE

Stanford

Page 9: Unit 7: Statistical control in depth: Correlation and collinearity

© Judith D. Singer, Harvard Graduate School of Education Unit 7/Slide 9

Relationship between Peer Ratings and the two new predictors

The REG ProcedureDependent Variable: PeerRat Sum of MeanSource DF Squares Square F Value Pr > FModel 1 42509 42509 27.24 <.0001Error 85 132664 1560.74781Corrected Total 86 175172

Root MSE 39.50630 R-Square 0.2427Dependent Mean 344.82759 Adj R-Sq 0.2338Coeff Var 11.45683

Parameter StandardVariable DF Estimate Error t Value Pr > |t|Intercept 1 313.93941 7.27801 43.14 <.0001ResFund 1 2.73458 0.52398 5.22 <.0001

ResFundRatrPee 73.294.313ˆ

The REG ProcedureDependent Variable: PeerRat Sum of MeanSource DF Squares Square F Value Pr > FModel 1 38775 38775 24.16 <.0001Error 85 136397 1604.67212Corrected Total 86 175172

Root MSE 40.05836 R-Square 0.2214Dependent Mean 344.82759 Adj R-Sq 0.2122Coeff Var 11.61692

Parameter StandardVariable DF Estimate Error t Value Pr > |t|Intercept 1 295.24240 10.96333 26.93 <.0001PctDoc 1 1.29816 0.26408 4.92 <.0001

PctDocRatrPee 30.124.295ˆ

HGSE

Stanford

HGSE

Stanford

Page 10: Unit 7: Statistical control in depth: Correlation and collinearity

© Judith D. Singer, Harvard Graduate School of Education Unit 7/Slide 10

Examining the correlation matrix, Step 1: Get output (using PROC CORR)

Pearson Correlation Coefficients, N = 87 Prob > |r| under H0: Rho=0 (H0: =0)

PeerRat L2Doc GRE ResFund PctDoc

PeerRat 1.00000 0.46393 0.65654 0.49261 0.47048 <.0001 <.0001 <.0001 <.0001

L2Doc 0.46393 1.00000 0.14528 0.51096 0.31777 <.0001 0.1794 <.0001 0.0027

GRE 0.65654 0.14528 1.00000 0.40573 0.17045 <.0001 0.1794 <.0001 0.1145

ResFund 0.49261 0.51096 0.40573 1.00000 0.05695 <.0001 <.0001 <.0001 0.6003

PctDoc 0.47048 0.31777 0.17045 0.05695 1.00000 <.0001 0.0027 0.1145 0.6003

Describes cell entries (r and p-value, all with N=87)

Always list the

outcome first so the

table is easiest to

read

Notice the

symmetry

Like most computer output, it provides “too much detail”

r = 0.32**r = 0.17 (ns)

2 decimal places and *’s usually suffice* p<0.05, ** p<0.01, *** p<0.001

Page 11: Unit 7: Statistical control in depth: Correlation and collinearity

© Judith D. Singer, Harvard Graduate School of Education Unit 7/Slide 11

Examining the correlation matrix, Step 2: Create a summary table

Correlation Matrix for Peer Ratings of Graduate Schools of Education (n=87)

Peer Rating

Log2(N doc grads)

Mean GRE

Research Funding

Pct Doc students

Peer Rating

1

Log2(N doc grads)

0.46*** 1

Mean GRE

0.66*** 0.15 1

Research Funding

0.49*** 0.51*** 0.41***

1

Pct Doc students

0.47*** 0.32** 0.17 0.06 1

* p<0.05, ** p<0.01, *** p<0.001

The correlation between each predictor and Peer Ratings is statistically significant

(p<0.001).We already knew this on the basis of the

simple linear regressions, but typically, we’d estimate these correlations before looking at

those regression results

The correlation between our two original predictors—GRE and L2Doc—is not statistically significant

The percentage of doctoral students is

significantly correlated (p<0.01) with the log(# of doctoral students), but not with either mean GRE or

Research Funding

What do these correlations foreshadow for multiple

regression? The information in research funding may be

redundant with other variables already in the model, but the information in PctDoc may explain additional variation in Peer Ratings

Research funding is significantly correlated

(p<0.001) with both program size and mean

GRE scores

Page 12: Unit 7: Statistical control in depth: Correlation and collinearity

© Judith D. Singer, Harvard Graduate School of Education Unit 7/Slide 12

A visual inspection of correlations: PeerRat vs. each predictor

r = 0.46 r = 0.66

r = 0.49 r = 0.47

Page 13: Unit 7: Statistical control in depth: Correlation and collinearity

© Judith D. Singer, Harvard Graduate School of Education Unit 7/Slide 13

The scatterplot matrix: A graphic correlation matrix

.46 .66 .49 .47

.15 .51 .32

.41 .17

.06

Page 14: Unit 7: Statistical control in depth: Correlation and collinearity

© Judith D. Singer, Harvard Graduate School of Education Unit 7/Slide 14

Questions we can ask about correlations between variables:How Venn diagrams can help us understand complex interrelationships

One outcome (Y) and 2 predictors (X1 and X2)—generate 3 correlations to examine:

• Correlation between each predictor and Y: rY1 and rY2

• Correlation between the two predictors: r12

Our learning goal: To understand the interrelationships among the correlations

• How much variation in Y is explained by X1 and X2 together

• How much variation in Y is explained by X1 after controlling for X2

• How much variation in Y is explained by X2 after controlling for X1

Y

X2X1

Y

X2X1

Page 15: Unit 7: Statistical control in depth: Correlation and collinearity

© Judith D. Singer, Harvard Graduate School of Education Unit 7/Slide 15

Contrasting Venn diagrams with uncorrelated and correlated predictors

Uncorrelated predictorsUncorrelated predictors are very rare, arising mostly in designed

experiments.We can compute the overall R2 by

just summing the separate R2’s

YX2X1

22|

21|

212| YYY RRR

R2 predicting Y using only

X1

21|YR

R2 predicting Y using only

X2:

22|YR

Correlated predictorsCorrelated predictors are very common, arising in almost all

studies.We can’t just sum the separate R2 statistics because of the overlap

Y

X2X1

cba

cbaRY 212|

caRY 21| cbRY 2

2|

cba

X1 X2

YHow do correlations between

predictors affect their joint utility?• Highly correlated predictors: Jointly

explained portion “c” is large; Additional independent portions “a” and “b” are small

• Fairly uncorrelated predictors: Jointly explained portion “c” is small; additional independent portions “a” and “b” are large

X2X1

cba

Y

Page 16: Unit 7: Statistical control in depth: Correlation and collinearity

© Judith D. Singer, Harvard Graduate School of Education Unit 7/Slide 16

Measuring the additional explanatory power of an additional predictor

Assuming that X1 is already in the model, how can we measure X2’s additional contribution, over and above that already explained by X1?

Y

X2X1

cba dcba

cbRY

2

2|

d dcbaYvar Total )( dbXYarResidual v )|( 1

21)|( Xby explained XYVar Res of Propdb

b

2

22

Y

Y

r

ncorrelatio SimpleR

1|2

21|2

Y

Y

r

ncorrelatio PartialR

Clarifying terminology and notation

• Simple correlation, rY2 and RY|22 :

Proportion of variation in Y associated with X2

• Multiple correlation, RY|122 :

Proportion of variation in Y associated with both X1 and X2

• Partial correlation, rY2|1 : Y2 identifies the variables being correlated; |1 identifies the variable(s) being controlled (or partialled out)

dcba

cbrY

22

Simple Correlation2

db

brY

212

Partial Correlation2

How are partials related to simple correlations?

Comparing these 2 equations, we see that b & d are in both denominators. So the

relationship between simples and partials depends upon the size of “a” & “c”

relative to “b” & “d”

Page 17: Unit 7: Statistical control in depth: Correlation and collinearity

© Judith D. Singer, Harvard Graduate School of Education Unit 7/Slide 17

d

b

X1 X2

Y

Understanding the relationship between partial and simple correlations

dcba

cb

db

b

When “a” and “c” are small:

Simple Partial

dcba

cb

db

b

When “a” is large (and “c” is large or small):

Partial > Simple

ac

When “c” is large (and “a” isn’t very large): Partial <

Simple

dcba

cb

db

b

Most common reason:X1 is relatively

uncorrelated with Y

Most common reason:X1 is very highly correlated with Y

Most common reason:X1 is very highly

correlated with X2

Partials can equal simples

Partials can be smaller than simples

Partials can be greater than simples

c

d

b

X1 X2

Y

aa b

d

X1 X2

Y

c

Page 18: Unit 7: Statistical control in depth: Correlation and collinearity

© Judith D. Singer, Harvard Graduate School of Education Unit 7/Slide 18

Partial correlations for the USNews data, controlling for L2Doc

Pearson Partial Correlation Coefficients, N = 87 Prob > |r| under H0: Rho=0

PeerRat GRE ResFund PctDoc

PeerRat 1.00000 0.67217 0.33561 0.38461 <.0001 0.0016 0.0003

GRE 0.67217 1.00000 0.38978 0.13249 <.0001 0.0002 0.2240

ResFund 0.33561 0.38978 1.00000 -0.12934 0.0016 0.0002 0.2353

PctDoc 0.38461 0.13249 -0.12934 1.00000 0.0003 0.2240 0.2353

Describes cell entries (r and p-value, all with N=87)

Continue to list the

outcome first so the

table is easiest to

read

Again, notice

the symmetr

y

Like most computer output, it provides “too much detail”

Go to simple correlation

output

Major decision: Which variable(s), if any, should we partial out?

Page 19: Unit 7: Statistical control in depth: Correlation and collinearity

© Judith D. Singer, Harvard Graduate School of Education Unit 7/Slide 19

Understanding the link between partial correlations and MR

Partial Correlation Coefficientscontrolling for L2Doc

PeerRat GRE ResFund

PeerRat 1.00000 0.67217 0.33561 <.0001 0.0016

GRE 0.67217 1.00000 0.38978 <.0001 0.0002

ResFund 0.33561 0.38978 1.00000 0.0016 0.0002

PctDoc 0.38461 0.13249 -0.12934 0.0003 0.2240 0.2353

Parameter StandardVariable DF Estimate Error t Value Pr > |t|Intercept 1 -87.29494 43.07364 -2.03 0.0459L2Doc 1 15.34201 2.94746 5.21 <.0001GRE 1 63.31660 7.60956 8.32 <.0001

Parameter StandardVariable DF Estimate Error t Value Pr > |t|Intercept 1 262.96680 20.06665 13.10 <.0001L2Doc 1 11.70357 4.31624 2.71 0.0081ResFund 1 1.91994 0.58799 3.27 0.0016

Parameter StandardVariable DF Estimate Error t Value Pr > |t|Intercept 1 233.68011 19.46286 12.01 <.0001L2Doc 1 14.25165 3.83448 3.72 0.0004PctDoc 1 0.99151 0.25964 3.82 0.0003

Partial correlations quantify the association between two variables

after controlling statistically for one (or more) predictors

Multiple regression models quantify the

association between two variables after controlling statistically for one (or more) predictors

Partial correlations and multiple regression are intimately linked The p-value for the partial correlation between Y and a predictor, say X2, after controlling statistically for other predictors, say just X1, is identical to the p-value for the slope coefficient for X2 in a

multiple regression model that includes both X2 and X1

???????

Page 20: Unit 7: Statistical control in depth: Correlation and collinearity

© Judith D. Singer, Harvard Graduate School of Education Unit 7/Slide 20

Comparing simple and partial correlations for the USNews data

Correlation Matrix for Peer Ratings of Graduate Schools of Education (n=87) – Simple correlations and partial correlations (controlling for Log2(N doctoral graduates))

Peer Rating

Log2(N doc grads)

Mean GRE

Research Funding

Peer Rating

1

Log2(N doc grads)

0.46***--

1

Mean GRE

0.66***0.67***

0.15--

1

Research Funding

0.49***0.34**

0.51***--

0.41***0.39***

1

Pct Doc students

0.47***0.38***

0.32**--

0.170.13

0.06-0.13

Cell entries are simple correlations and partial correlations* p<0.05, ** p<0.01, *** p<0.001

The partial correlation with mean GRE is virtually unchanged while

the partial correlations with Research Funding and PctDoc

students decline (but are still stat sig.)

[This makes sense because Log2(N doc grads) was virtually uncorrelated with

mean GRE, but was significantly correlated with Research Funding and

PctDoc students.]

Research Funding remains correlated

with Mean GRE after controlling for program size

PctDoc students remains uncorrelated with the other predictors, even

after controlling for program size

Peer

GREL2Doc

ResFund

PctDoc

Page 21: Unit 7: Statistical control in depth: Correlation and collinearity

© Judith D. Singer, Harvard Graduate School of Education Unit 7/Slide 21

Results of fitting additional MR models to USNews data

Comparison of regression models predicting peer ratings of US Graduate Schools of Education (n=87) (US News and World Report, 2005)

Predictor

Model C Model D

Model E

Model F

Model G

Intercept -87.29*(43.07)-2.03

313.93***

(7.28)43.14

295.24***

(10.96)4.92

-66.47(47.95)-1.39

-78.34~(39.73)-1.97

Log2(N doctoral grads)

15.34***(2.95)5.21

13.65***

(3.40)4.01

11.91***(2.85)4.19

Mean GRE

scores

63.32***(7.61)8.32

60.13***

(8.26)7.28

59.56***(7.07)8.43

Research Funding

2.73***(0.52)5.22

0.50(0.50)0.99

PctDoc 1.30***(0.26)4.92

0.78***(0.19)4.01

R2 57.0 24.3 22.1 57.5 64.0

F(df)P

55.63(2, 84)

<0.0001

27.24(1, 85)<0.000

1

24.16(1, 85)<0.000

1

37.40(3, 83)

<0.0001

49.10(3, 83)

<0.0001

Cell entries are estimated regression coefficients, (standard errors) and t-statistics.* p<0.05, ** p<0.01, *** p<0.001

Some things to consider when

selecting models to present

Does the model chosen reflect your underlying theory?

Does the model allow you to address the effects of your key question predictor(s)?

Are you unnecessarily including predictors you could reasonably set aside (the parsimony principle)?

Are you excluding predictors that are statistically significant [If so, why exclude them?]

Always realize that NO model is ever “final”

We’ll spend much, much, much more time on this topic in Unit 11

Page 22: Unit 7: Statistical control in depth: Correlation and collinearity

© Judith D. Singer, Harvard Graduate School of Education Unit 7/Slide 22

Is it always possible to statistically control?

Two parent Latino families in the National Child Care Survey: Predicting family income as a function of mother’s and father’s education (n=45)

Predictor Model A Model B Model C

Intercept -103.54(148.31)

-0.70

-5.91(114.27)

-0.05

-141.19(149.33)

-0.95

Mother’s education

31.58**(11.27)

2.80

19.60(14.12)

1.39

Father’s education

25.67**(9.18)2.80

15.89(11.50)

1.38

R2 15.4 15.4 19.1

F(df)P

7.84(1, 43)0.0076

7.83(1, 43)0.0077

4.96(2, 42)0.0116

Cell entries are estimated regression coefficients, (standard errors) and t-statistics.* p<0.05, ** p<0.01, *** p<0.001

Our language for MR has used many terms for statistical control:

• Controlling for X1

• Holding X1 constant

• Removing the effects of X1

This language assumes that we can really hold X1 constant and X2 will still vary across its full range, but is this always true?Now: What happens if holding X1 constant dramatically restricts the range in X2—can we really statistically control for one predictor and evaluate the effects of another?

Example: National Child Care Survey

•n = 45 two parent Latino families•RQ: What is the relationship

between parental education and family income?

•Two parental education predictors: Mother’s and father’s education

Model D

-133.74(140.34

)-0.95

35.02**(11.01)

3.18

19.1

10.12(1, 43)0.0027

Avereduc

Go to: Multicollinearity: What it is, why it happens, how to spot it, and what to do

Page 23: Unit 7: Statistical control in depth: Correlation and collinearity

© Judith D. Singer, Harvard Graduate School of Education Unit 7/Slide 23

Multicollinearity: What it is, why it happens, how to spot it, and what to do

Correlation Matrix (NCCS data, n=45)

Income

Mother’s education

Income 1.00

Mother’s education

0.39**

Father’s education

0.39** 0.61***

Income

MomEd DadEd

a

d

c b

What is multicollinearity? When two (or more) predictors are so highly correlated that we cannot statistically control for one predictor and evaluate the effect of the other(s)

•Mother’s & father’s education•Gender and job status at Yale•Family background & school

resources

How to spot multicollinearity

•Controlled & uncontrolled slopes differ dramatically for two (or more) predictors

•The estimated controlled slopes make no sense (e.g., the signs appear wrong!)

•Standard errors increase with added predictors

•Reject omnibus F test but fail to reject individual t-tests for the constituent predictors

What to do about multicollinearity

•Use better research designs—especially randomized trials—that eliminate confounding

•Collect more data, especially “unusual cases”

•Collapse collinear predictors into a composite

•Include just one of the collinear predictors in your MR model (but be sure to explain what you did and why you did it)

What happens when we create a

composite—Average Education

—for the NCCS data?

Page 24: Unit 7: Statistical control in depth: Correlation and collinearity

© Judith D. Singer, Harvard Graduate School of Education Unit 7/Slide 24

Correlation Matrix for Reading data

Reading Height

Reading 1

Height 0.65

Grade 0.75 0.85

Caution: Don’t assume that all strongly correlated predictors are collinear

HEIGHT

REA

DIN

G

12

34

56

Reading

Height

Gradea

d

c

b

Holding Height constant, however, there is still

variation in Grade, and that variation is associated with

Reading

04.0ight|GradeReading,Her

49.0| HeightadeReading,Grr

Partialling out Grade, there’s

virtually no effect of Height

Partialling out Height, there’s still an effect of

Grade

Conclusion: Height and grade are strongly correlated, but not

collinearDon’t abuse the phrase…

Page 25: Unit 7: Statistical control in depth: Correlation and collinearity

© Judith D. Singer, Harvard Graduate School of Education Unit 7/Slide 25

Coda: Sometimes the direction of an effect can change upon statistical control!

Suppressor effects in predicting faculty salaries at the University of Kansas

Correlation matrix for salary survey data

SALARY

RANKDEPT HD

SALARY 1.00

RANK0.66**

*1.00

DEPT HD0.69**

*0.30*

*1.00

YRS SERV

0.130.61*

**-0.08

*p<.05 **p<.01 ***p<.001

YRSSERVHDDEPTRANKSALARY 362905,15493,10217,64

•Higher ranked professors have higher salaries

•Department heads have higher salaries•Higher ranked professors are more likely

to be department heads•The more years of service, the higher the

rank

But there are two very surprising findings concerning Years of Service:

•No correlation between years of service and salary?

•No correlation between years of service and being a department head?

R2 = 65%

0 5 10 15 20

Years of Service at University

$60,000

$80,000

$100,000

Salary

Full Professor

Full Professor

Assoc Professor

Assoc Professor

Asst Professor

Dept Head

Not Dept Head

“Salary compression…the failure of the organization to recognize

seniority with adequate compensation increase while meeting current market values for lower ranked

individuals hired into the institution”McCulley & Downey (1993) Salary compression in

faculty salaries: Identification of a supressor effect. Educational and Psychological Measurement, 53,

79-86

10,493

10,493 15,905

15,905

10,493

Page 26: Unit 7: Statistical control in depth: Correlation and collinearity

© Judith D. Singer, Harvard Graduate School of Education Unit 7/Slide 26

Start looking at the results sections of papers in your substantive fields…

Michal Kurlaender & John Yun (2007) Measuring school racial composition and student outcomes

in a multiracial society, American Journal of Education, 113, 213-242

Page 27: Unit 7: Statistical control in depth: Correlation and collinearity

© Judith D. Singer, Harvard Graduate School of Education Unit 7/Slide 27

Another example of presenting regression results in journals

Barbara Pan, Meredith Rowe, Judith Singer and Catherine Snow (2005) Maternal correlates of

growth in toddler vocabulary production in low-income families, Child Development, 76(4) 763-

782

Page 28: Unit 7: Statistical control in depth: Correlation and collinearity

© Judith D. Singer, Harvard Graduate School of Education Unit 7/Slide 28

What’s the big takeaway from this unit?

• Statistical control is a very powerful tool– The ability to statistically control for the effects of some predictors when evaluating

the effects of other predictors greatly expands the utility of statistical models– It allows you to acknowledge the effects of some predictors and then put all

individuals on a “level playing field” that holds those controlled predictors constant• The pattern of correlations can help presage multiple regression results

– Learn how to examine a correlation matrix and foreshadow how the predictors will behave in a multiple regression model

– If you have one (or more) control predictors, consider examining a partial correlation matrix that removes that effect

• Controlled effects can be similar to or different from uncontrolled effects– The effects of some predictors will persist upon statistical control while the effects of

others will change– Be sure to examine how your predictors effects change as you fit more complex

statistical models– Ask yourself whether the observed changes make sense

• Beware of the dangers of multicollinearity– Sometimes it isn’t possible to statistically control– When your predictors are highly correlated, you may think you’re statistically

controlling for the effects of one when you’re evaluating the effects of the other, but this may not be possible

– But similarly, just because predictors are highly correlated, don’t assume that you’ll have collinearity problems

Page 29: Unit 7: Statistical control in depth: Correlation and collinearity

© Judith D. Singer, Harvard Graduate School of Education Unit 7/Slide 29

Appendix: Annotated PC-SAS Code for Estimating Partial Correlations

Note that the handouts include only annotations for the needed additional code. For the complete program, check program “Unit 7—Statistical Control in Depth” on the website. Note also that this annotation builds on the knowledge from “Unit 2 – Correlation and Causality”.

Note that the handouts include only annotations for the needed additional code. For the complete program, check program “Unit 7—Statistical Control in Depth” on the website. Note also that this annotation builds on the knowledge from “Unit 2 – Correlation and Causality”.

proc corr data=one; var PeerRat L2Doc GRE ResFund PctDoc;

proc corr data=one; partial l2doc; var PeerRat GRE ResFund PctDoc;

proc corr estimates simple correlations between the variables specified. Its var statement syntax is var1 var2 var3 … varn.

proc corr estimates simple correlations between the variables specified. Its var statement syntax is var1 var2 var3 … varn.

proc corr can also estimate partial correlations. Use a partial statement to identify the variable(s) being controlled (partialled out).

proc corr can also estimate partial correlations. Use a partial statement to identify the variable(s) being controlled (partialled out).

Glossary terms included in Unit 7

• Correlation• Cross-sectional data• Main effects assumption/model• Multicollinearity• Statistical control