38
Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II

Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II

Embed Size (px)

Citation preview

Page 1: Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II

Lecture 7:Multiple Linear RegressionInterpretation with different types of predictors

BMTRY 701Biostatistical Methods II

Page 2: Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II

Interpreting regression coefficients

So far, we’ve considered continuous covariates Covariates can take other forms:

• binary• nominal categorical• quadratics (or other transforms)• interactions

Interpretations may vary depending on the nature of your covariate

Page 3: Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II

Binary covariates

Considered ‘qualitative’ The ordering of numeric assignments does not

matter Examples: MEDSCHL: 1 = yes; 2 = no More popular examples:

• gender• mutation• pre vs. post-menopausal• Two age categories

Page 4: Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II
Page 5: Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II

How is MEDSCHL related to LOS?

How to interpret β1?

Coding of variables:• 2 vs. 1• i prefer 1 vs. 0• difference? the intercept.

Let’s make a new variable: • MS = 1 if MEDSCHL = 1 (yes)• MS = 0 if MEDSCHL = 2 (no)

iii eMEDSCHLLOS 10

Page 6: Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II

How is MEDSCHOOL related to LOS?

What does β1 mean?

Same, yet different interpretation as continuous What if we had used old coding?

1010

010

10

10

1]1|[

0]0|[

][

ii

ii

ii

eii

MSLOSE

MSLOSE

MSLOSE

eMSLOS

1010

1010

10

22]2|[

1]1|[

ii

ii

eii

MEDSCHLLOSE

MEDSCHLLOSE

eMEDSCHLLOS

Page 7: Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II

R code> data$ms <- ifelse(data$MEDSCHL==2,0,data$MEDSCHL)> table(data$ms, data$MEDSCHL) 1 2 0 0 96 1 17 0> > reg <- lm(LOS ~ ms, data=data)> summary(reg)

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 9.4105 0.1871 50.290 < 2e-16 ***ms 1.5807 0.4824 3.276 0.00140 ** ---

Residual standard error: 1.833 on 111 degrees of freedomMultiple R-squared: 0.08818, Adjusted R-squared: 0.07997 F-statistic: 10.73 on 1 and 111 DF, p-value: 0.001404

Page 8: Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II

Scatterplot? Residual Plot?

0.0 0.2 0.4 0.6 0.8 1.0

-20

24

68

10

data$ms

res

res <- reg$residualsplot(data$ms, res)abline(h=0)

0.0 0.2 0.4 0.6 0.8 1.0

81

01

21

41

61

82

0

data$msd

ata

$L

OS

Page 9: Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II

Fitted Values

Only two fitted values:

Diagnostic plots are not as informative Extrapolation and Interpolation are meaningless! We can estimate LOS for MS=0.5

• LOS = 9.41 + 1.58*0.5 = 10.20• Try to interpret the result…

99.1058.141.9]1|ˆ[

41.9]0|ˆ[

10

0

MSY

MSY

Page 10: Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II

“Linear” regression?

But, what about ‘linear’ assumption?• we still need to adhere to the model assumptions• recall that they relate to the residuals primarily• residuals are independent and identically distributed:

The model is still linear in the parameters!

),0(~ 2 Ni

Page 11: Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II

MLR example: Add infection risk to our model

> reg <- lm(LOS ~ ms + INFRISK, data=data)> summary(reg)

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.4547 0.5146 12.542 <2e-16 ***ms 0.9717 0.4316 2.251 0.0263 * INFRISK 0.6998 0.1156 6.054 2e-08 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.595 on 110 degrees of freedomMultiple R-squared: 0.3161, Adjusted R-squared: 0.3036 F-statistic: 25.42 on 2 and 110 DF, p-value: 8.42e-10

How does interpretation change?

Page 12: Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II

What about more than two categories?

We looked briefly at region a few lectures back. How to interpret? You need to define a reference category For med school:

• reference was ms=0• almost ‘subconscious’ with only two categories

With >2 categories, need to be careful of interpretation

Page 13: Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II

LOS ~ REGION

Note how ‘indicator’ or ‘dummy’ variable is defined:• I(condition) = 1 if condition is true

• I(condition) = 0 if condition is false

iiiii eRIRIRILOS )4()3()2( 3210

Page 14: Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II

Interpretation

β0 =

β1 =

β2 =

β3 =

30

20

10

0

]4|[

]3|[

]2|[

]1|[

ii

ii

ii

ii

RLOSE

RLOSE

RLOSE

RLOSE

iiiii eRIRIRILOS )4()3()2( 3210

Page 15: Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II

hypothesis tests?

Ho: β1 = 0 Ha: β1 ≠ 0 What does that test (in words)?

Ho: β2 = 0 Ha: β2 ≠ 0 What does that test (in words)?

What if we want to test region, in general? One of our next topics!

Page 16: Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II

R > reg <- lm(LOS ~ factor(REGION), data=data)> summary(reg)

Call:lm(formula = LOS ~ factor(REGION), data = data)

Residuals: Min 1Q Median 3Q Max -3.05893 -1.03135 -0.02344 0.68107 8.47107

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 11.0889 0.3165 35.040 < 2e-16 ***factor(REGION)2 -1.4055 0.4333 -3.243 0.00157 ** factor(REGION)3 -1.8976 0.4194 -4.524 1.55e-05 ***factor(REGION)4 -2.9752 0.5248 -5.669 1.19e-07 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.675 on 109 degrees of freedomMultiple R-squared: 0.2531, Adjusted R-squared: 0.2325 F-statistic: 12.31 on 3 and 109 DF, p-value: 5.376e-07

Page 17: Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II

Interpreting

Is mean LOS different in region 2 vs. 1? What about region 3 vs. 1 and 4 vs. 1?

What about region 4 vs. 3?

How to test that? Two options?

• recode the data so that 3 or 4 is the reference• use knowledge about the variance of linear combinations to

estimate the p-value for the difference in the coefficients

For now…we’ll focus on the first.

Page 18: Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II

Make REGION=4 the reference

Our model then changes:

iiiii eRIRIRILOS )1()2()3( 3210

0

10

20

30

]4|[

]3|[

]2|[

]1|[

ii

ii

ii

ii

RLOSE

RLOSE

RLOSE

RLOSE

Page 19: Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II

R code: recoding so last category is reference

> data$rev.region <- factor(data$REGION, levels=rev(sort(unique(data$REGION))) )

> reg <- lm(LOS ~ rev.region, data=data)> summary(reg)

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.1137 0.4186 19.381 < 2e-16 ***rev.region3 1.0776 0.5010 2.151 0.03371 * rev.region2 1.5697 0.5127 3.061 0.00277 ** rev.region1 2.9752 0.5248 5.669 1.19e-07 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.675 on 109 degrees of freedomMultiple R-squared: 0.2531, Adjusted R-squared: 0.2325 F-statistic: 12.31 on 3 and 109 DF, p-value: 5.376e-07

Page 20: Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II

Quite a few differences:

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.1137 0.4186 19.381 < 2e-16 ***rev.region3 1.0776 0.5010 2.151 0.03371 * rev.region2 1.5697 0.5127 3.061 0.00277 ** rev.region1 2.9752 0.5248 5.669 1.19e-07 ***

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 11.0889 0.3165 35.040 < 2e-16 ***factor(REGION)2 -1.4055 0.4333 -3.243 0.00157 ** factor(REGION)3 -1.8976 0.4194 -4.524 1.55e-05 ***factor(REGION)4 -2.9752 0.5248 -5.669 1.19e-07 ***

Page 21: Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II

But the “model” is the same

Model 1: Residual standard error: 1.675 on 109 degrees of freedom

Model 2: Residual standard error: 1.675 on 109 degrees of freedom

The model represent the data equally well. However, the ‘reparameterization’ yields a

difference of interpretation for the model parameters.

Page 22: Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II

Diagnostics

# residual plotreg <- lm(LOS ~ factor(REGION), data=data)res <- reg$residualsfit <- reg$fitted.valuesplot(fit, res)abline(h=0, lwd=2)

8.0 8.5 9.0 9.5 10.0 10.5 11.0

-20

24

68

fit

res

Page 23: Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II

Diagnostics

# residual plotreg <- lm(logLOS ~ factor(REGION), data=data)res <- reg$residualsfit <- reg$fitted.valuesplot(fit, res)abline(h=0, lwd=2)

2.10 2.15 2.20 2.25 2.30 2.35

-0.2

0.0

0.2

0.4

0.6

fit

res

Page 24: Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II

Next type: polynomials

Most common: quadratic What does that mean? Including a linear and ‘squared’ term Why? to adhere to model assumptions! Example:

• last week we saw LOS ~ NURSE• quadratic actually made some sense

eiii eNURSENURSELOS 2210log

Page 25: Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II

Scatterplot

0 100 200 300 400 500 600

2.0

2.2

2.4

2.6

2.8

3.0

data$NURSE

da

ta$

log

LO

S

Page 26: Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II

Fitting the model

> data$nurse2 <- data$NURSE^2> reg <- lm(logLOS ~ NURSE + nurse2, data=data)> summary(reg)

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.090e+00 3.645e-02 57.355 < 2e-16 ***NURSE 1.430e-03 3.525e-04 4.058 9.29e-05 ***nurse2 -1.789e-06 6.262e-07 -2.857 0.00511 ** ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.161 on 110 degrees of freedomMultiple R-squared: 0.1965, Adjusted R-squared: 0.1819 F-statistic: 13.45 on 2 and 110 DF, p-value: 5.948e-06

Interpretable?

Page 27: Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II

How does it fit?

0 100 200 300 400 500 600

2.0

2.2

2.4

2.6

2.8

3.0

data$NURSEd

ata

$lo

gL

OS

# make regression lineplot(data$NURSE, data$logLOS, pch=16)coef <- reg$coefficientsnurse.values <- seq(15,650,5)fit.line <- coef[1] + coef[2]*nurse.values + coef[3]*nurse.values^2lines(nurse.values, fit.line, lwd=2)

Note: ‘abline’ only will work forsimple linear regression. whenthere is more than one predictor, you need to make the line another way.

Page 28: Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II

Another approach to the same data

Does it make sense that it increases, and then decreases?

Or, would it make more sense to increase, and then plateau?

Which do you think makes more sense?

How to tell? • use a data driven approach• tells us “what do the data suggest?”

Page 29: Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II

Smoothing

Empirical way to look at the relationship data is ‘binned’ by x for each ‘bin’, the average y is estimated but, it is a little fancier

• it is a ‘moving average’• each x value is in multiple bins

Modern methods use models within bins• Lowess smoothing• Cubic spline smoothing

Specifics are not so important: the “empirical” result is

Page 30: Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II

smoother <- lowess(data$NURSE, data$logLOS)plot(data$NURSE, data$logLOS, pch=16)lines(smoother, lwd=2, col=2)lines(nurse.values, fit.line, lwd=2)legend(450,3,c("Quadratic Model","Lowess Smooth"),

lty=c(1,1),lwd=c(2,2),col=c(1,2))

0 100 200 300 400 500 600

2.0

2.2

2.4

2.6

2.8

3.0

data$NURSE

da

ta$

log

LO

S

Quadratic ModelLowess Smooth

Page 31: Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II

Inference?

What do the data say? Looks like a plateau How can we model that? One option: a spline Zeger: “broken arrow” model Example: looks like a “knot” at NURSE = 250

• there is a linear increase in logLOS until about NURSE=250

• then, the relationship is flat• this implies a slope prior to NURSE=250, and one

after NURSE=250

Page 32: Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II

Implementing a spline

A little tricky We need to define a new variable, NURSE*

And then we write the model as follows:

250NURSE 250

250NURSE 0

i

i*

ifNURSE

ifNURSE

ii

iiii eNURSENURSELOS *210log

Page 33: Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II

How to interpret?

When in doubt, condition on different scenarios What is E(logLOS) when NURSE<250?

What is E(logLOS) when NURSE>250?

i

iii

NURSE

NURSENURSELOSE

10

210 0]250|[log

i

ii

iiii

NURSE

NURSENURSE

NURSENURSENURSELOSE

)()250(

250

)250(]250|[log

2120

2210

210

Page 34: Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II

R

data$nurse.star <- ifelse(data$NURSE<=250,0,data$NURSE-250)data$nurse.starreg.spline <- lm(logLOS ~ NURSE + nurse.star, data=data)# make regression linecoef.spline <- reg.spline$coefficientsnurse.values <- seq(15,650,5)nurse.values.star <- ifelse(nurse.values<=250,0,nurse.values-250)

spline.line <- coef.spline[1] + coef.spline[2]*nurse.values +coef.spline[3]*nurse.values.star

plot(data$NURSE, data$logLOS, pch=16)lines(smoother, lwd=2, col=2)lines(nurse.values, fit.line, lwd=2)lines(nurse.values, spline.line, col=4, lwd=3)legend(450,3,c("Quadratic Model","Lowess Smooth","Spline Model"),

lty=c(1,1,1),lwd=c(2,2,3),col=c(1,2,4))

Page 35: Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II

Interpreting the output

> summary(reg.spline)

Call:lm(formula = logLOS ~ NURSE + nurse.star, data = data)

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.1073877 0.0332135 63.450 < 2e-16 ***NURSE 0.0010278 0.0002336 4.399 2.52e-05 ***nurse.star -0.0011114 0.0004131 -2.690 0.00825 ** ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1616 on 110 degrees of freedomMultiple R-squared: 0.1902, Adjusted R-squared: 0.1754 F-statistic: 12.91 on 2 and 110 DF, p-value: 9.165e-06

How do we interpret the coefficient on nurse.star?

Page 36: Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II

Why subtract the 250 in defining nurse.star? It ‘calibrates’ where the two pieces of the lines meet if it is not included, then they will not connect

0 100 200 300 400 500 600

2.0

2.2

2.4

2.6

2.8

3.0

data$NURSE

da

ta$

log

LO

S

Quadratic ModelLowess SmoothSpline Model

Page 37: Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II

Why a spline vs. the quadratic?

it fits well! it is more interpretable it makes sense it is less sensitive to the outliers

Can be generalized to have more ‘knots’

Page 38: Lecture 7: Multiple Linear Regression Interpretation with different types of predictors BMTRY 701 Biostatistical Methods II

Next time

ANOVA F-tests