67
© Scott Evans, Ph.D. and Lynne Peeples, M.S. 1 Introduction to Biostatistics, Harvard Extension School, Fall 2007 Regression Continued… Prediction, Model Evaluation, Multivariate Extensions, & ANOVA

Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

  • View
    219

  • Download
    3

Embed Size (px)

Citation preview

Page 1: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 1

Introduction to Biostatistics, Harvard Extension School, Fall 2007

Regression Continued…Prediction, Model Evaluation,

Multivariate Extensions,& ANOVA

Page 2: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 2

Introduction to Biostatistics, Harvard Extension School, Fall 2007

Variables of Interest?

One (continuous) Variable

TwoVariables

Both ContinuousMethods from

Before Midterm…

More than TwoVariables

Multiple LinearRegression

Interested in predicting one from another

Interested in presence of association

Simple LinearRegression

Both variables normal

Not normal

PearsonCorrelation

Spearman (Rank) Correlation

One Continuous, one categorical

ANOVA* *Note: If categorical variableIs ordinal, Spearman Rank Correlation methods are applicable…

Page 3: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 3

Introduction to Biostatistics, Harvard Extension School, Fall 2007

Correlation Review

x

y

Example 1 Example 2x

y

yy

xx III

III IVIVIII

II I

Page 4: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 4

Introduction to Biostatistics, Harvard Extension School, Fall 2007

Correlation Review

x

y

Example 2More Correlation!

x

y

yy

xx III

III IVIVIII

II I

Example 1

Page 5: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 5

Introduction to Biostatistics, Harvard Extension School, Fall 2007

Find distance to new line, y-hat, and to ‘naïve’ guess, y

xy 10ˆˆˆ

yi

xi

y

Simple Linear Regression Review

y

x

Page 6: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 6

Introduction to Biostatistics, Harvard Extension School, Fall 2007

Find distance to new line, y-hat, and to ‘naïve’ guess, y

xy 10ˆˆˆ

yi

),( ii yx

xi

iii eyy ˆ

yyi ˆ

yyi iy

y

Simple Linear Regression Review

y

x

Page 7: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 7

Introduction to Biostatistics, Harvard Extension School, Fall 2007

Linear Regression Continued…

1. Predicted Values2. Model Evaluation3. No longer “simple”…

MULTIPLE Linear Regression4. Parallels to “Analysis of Variance”

aka ANOVA

Page 8: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 8

Introduction to Biostatistics, Harvard Extension School, Fall 2007

1. Predicted Values

Last week, we conducted hypothesis tests and CI’s for the slope of our linear regression model

However, we might also be interested in making an estimate of the mean (and/or individual) value of y for particular values of x

Page 9: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 9

Introduction to Biostatistics, Harvard Extension School, Fall 2007

Predicted Values: Newborn Infant Length Example

Last week, came up with least squares line for mean length of low birth weight babies: y = length x = gestation age (weeks)

What is the predicted mean length of infants at 20 weeks? 30 weeks?

xy 952.0329.9ˆ • “Hat” denotes estimate

Page 10: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 10

Introduction to Biostatistics, Harvard Extension School, Fall 2007

We can make a point estimate Let x = 29 weeks:

Now, interested in a CI around this…

Predicted Values: Newborn Infant Length Example

cm 93.36

)29(952.0329.9

952.0329.9ˆ

xy

Page 11: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 11

Introduction to Biostatistics, Harvard Extension School, Fall 2007

Predicted Values: CIs

Confidence interval for y-hat:

In order to calculate this interval, we need the standard error of y-hat:

Note, we get a different standard error of y-hat for each x

ytyytynn

ˆs.e.ˆ,ˆs.e.ˆ)21(;2)21(;2

n

i i

xyxx

xx

nsyes

1

2

2

|)(

)(1)ˆ(ˆ

Page 12: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 12

Introduction to Biostatistics, Harvard Extension School, Fall 2007

Notice as x gets further from x, the standard error gets larger (leading to a wider confidence interval)

se(y) at 29 weeks = 0.265 cm

Predicted Values: Newborn Infant Length Example

n

i i

xyxx

xx

nsyes

1

2

2

|)(

)(1)ˆ(ˆ

Page 13: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 13

Introduction to Biostatistics, Harvard Extension School, Fall 2007

Plugging in x = 29 weeks & se(y) = 0.265

95% CI for mean length of infant at 29 weeks of gestation is (36.41, 37.45)

)45.37 ,41.36())265.0(98.193.36 ),265.0(98.193.36(

ˆs.e.ˆ,ˆs.e.ˆ)21(;2)21(;2

ytyyty

nn

Predicted Values: Newborn Infant Length Example

Page 14: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 14

Introduction to Biostatistics, Harvard Extension School, Fall 2007

Predicted Values: CIs

We can do the same for an individual infant…

Confidence interval for y:

In order to calculate this interval, we need the standard error (always larger than the standard error of y-hat):

Note, we get a different standard error of y for each x

ytyytynn

~s.e.~,~s.e.~)21(;2)21(;2

n

i i

xyxx

xx

nsyes

1

2

2

|)(

)(11)~(ˆ

Page 15: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 15

Introduction to Biostatistics, Harvard Extension School, Fall 2007

Again, as x gets further from x, the standard error gets larger (leading to a wider confidence interval)

se(y) at x=29 (an infant at 29 weeks) = 2.661 cm Much more variability at this level

Predicted Values: Newborn Infant Length Example

n

i i

xyxx

xx

nsyes

1

2

2

|)(

)(11)~(ˆ

Page 16: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 16

Introduction to Biostatistics, Harvard Extension School, Fall 2007

Plugging in x = 29 weeks & se(y) = 2.661 Note, point estimate of y = y

95% CI for length of individual infant at 29 weeks of gestation is (31.66, 42.20) Wider interval - compared to (36.41, 37.45) for y-hat

)20.42 ,66.31())661.2(98.193.36 ),661.2(98.193.36(

~s.e.~,~s.e.~)21(;2)21(;2

ytyyty

nn

Predicted Values: Newborn Infant Length Example

Page 17: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 17

Introduction to Biostatistics, Harvard Extension School, Fall 2007

2. Model Evaluation

Homoscedasticity (Residual plots) Coefficient of Determination (R2)

Just how good does our model fit the data?

Page 18: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 18

Introduction to Biostatistics, Harvard Extension School, Fall 2007

Review of Assumptions

Assumptions of the linear regression model:

1. The y values are distributed according to a normal distribution with mean and variance that is unknown

2. The relationship between X and Y is given by the formula:

3. The y are independent

4. For every value x the standard deviation of the outcomes y is constant and equal to This concept is called homoscedasticity

xy|

xy|

xy|

xxy 10|

Page 19: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 19

Introduction to Biostatistics, Harvard Extension School, Fall 2007

Model Evaluation:Homoscedasticity

x

yxy 10

ˆˆˆ

),( ii yx

iii eyy ˆ

Page 20: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 20

Introduction to Biostatistics, Harvard Extension School, Fall 2007

Model Evaluation:Homoscedasticity

x

yxy 10

ˆˆˆ

),( ii yx

Calculate residual distance for each (xi,yi)

In end, we have n ei’s

Page 21: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 21

Introduction to Biostatistics, Harvard Extension School, Fall 2007

Now we plot each of the ei’s

Are the residuals increasing or decreasing as the fitted values get larger? Fairly consistent

across y-hats Look for outliers

If present, may want to remove and refit line

Model Evaluation:Homoscedasticity

Fitted Values (y-hats)

Res

idu

als

-4

-

2

0

2

4

Page 22: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 22

Introduction to Biostatistics, Harvard Extension School, Fall 2007

Example of heteroscedasticity

Increasing variability as fitted values increase Suggests nonlinear relationship…may need to transform

Model Evaluation:Homoscedasticity

Fitted Values (y-hats)

Res

idu

als

-4

-

2

0

2

4

x

y

Page 23: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 23

Introduction to Biostatistics, Harvard Extension School, Fall 2007

Model Evaluation: Coefficient of Determination

R2 is a measure of the relative predictive power of a model i.e., the proportion of variability in Y that is

explained by the linear regression of Y on X Pearson correlation coefficient squared

aka r2 = R2

Also ranges between 0 and 1

Page 24: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 24

Introduction to Biostatistics, Harvard Extension School, Fall 2007

Model Evaluation: Coefficient of Determination

Closer to one = better the model (greater ability to predict) R2 = 1 would imply that your regression

model provides perfect predictions (all data points lie on least-squares regression line

R2 = 0.7 would mean 70% of variation in response variable can be explained by preditor(s)

Page 25: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 25

Introduction to Biostatistics, Harvard Extension School, Fall 2007

Given R-squared is the Pearson correlation coefficient squared, we can solve…

If x explains none of the variation in y, then

= 0 and R2 = 0

Model Evaluation: Coefficient of Determination

SSTotal

SSR

s

ssR

SSTotal

SSE

s

sR

sRs

srs

y

xyy

y

xy

yxy

yxy

yxy

2

2|

2

2

2

2|2

22|

2

22|

2

22|

2

11

)1(

)1(

)1(

2|

2xyy ss

Page 26: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 26

Introduction to Biostatistics, Harvard Extension School, Fall 2007

Adjust R-squared = adjusted for number of variables in model

i.e., “punished” for additional variables Want more parsimonious (simple)

Note that R-squared does NOT tell you if: The predictor is the true cause of the changes in

the dependent variable CORRELATION ≠ CAUSATION !!!

The correct regression line was used May be omitting variables (multiple linear

regression…)

Model Evaluation: Coefficient of Determination

Page 27: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 27

Introduction to Biostatistics, Harvard Extension School, Fall 2007

3. Multiple Linear Regression

Extend Simple Model to include more variables Increase our power to make predictions!

Model is no longer a line, but multidimensional

Outcome = function of many variables e.g., sex, age, race, smoking status, exercise,

education, treatment, genetic factors, etc.

Page 28: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 28

Introduction to Biostatistics, Harvard Extension School, Fall 2007

Naïve model

Now, we can assess effect of x1 on y, while controlling for x2

(potential “confounder”)

We can continue adding predictors...

We can even add “interaction” terms(i.e., x1*x2)

Multiple Regression

3322110ˆˆˆˆˆ xxxy

xy 10ˆˆˆ

22110ˆˆˆˆ xxy

21322110ˆˆˆˆˆ xxxxy

Page 29: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 29

Introduction to Biostatistics, Harvard Extension School, Fall 2007

Interpretation:

β0 = y-intercept (when both x1=0 and x2=0) Often not interpretable

β1 = Increase in y for every increase in x1 While holding x2 constant

β2 = Increase in y for every increase in x2 While holding x1 constant

Multiple Regression

22110ˆˆˆˆ xxy

Page 30: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 30

Introduction to Biostatistics, Harvard Extension School, Fall 2007

Multiple Linear Regression

Can incorporate (and control for) many variables A single (continuous) dependent variable Multiple independent variables

(predictors) These variables may be of any scale

(continuous, nominal, or ordinal)

Page 31: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 31

Introduction to Biostatistics, Harvard Extension School, Fall 2007

Indicator (“dummy”) variables are created and used for categorical variables:

i.e.

Need # categories-1 “dummy” variables Analysis of Variance equivalent – will cover later…

Multiple Regression

Race x1 x2 x3

Caucasian 0 0 0

Black 1 0 0

Hispanic 0 1 0

Asian 0 0 1

3322110ˆˆˆˆˆ xxxy

0ˆˆ Caucasiony

220ˆˆˆ xyHispanic

Page 32: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 32

Introduction to Biostatistics, Harvard Extension School, Fall 2007

Multiple Regression

Conduct F-test for overall model as before, but now with k and n-k-1 degrees of freedom k = # of predictors in the model

Conduct t-tests for coefficients of predictors as before, but now with n-k-1 degrees of freedom Note F-test no longer equivalent to t when k>1

Page 33: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 33

Introduction to Biostatistics, Harvard Extension School, Fall 2007

Multiple Regression:Confounding

Multiple regression can estimate the effect of each variable while controlling for (adjusting for) the effects of other (potentially confounding) variables in the model Confounding occurs when the effect of a

variable of interest is distorted when you do not control for the effect of another “confounding” variable

Page 34: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 34

Introduction to Biostatistics, Harvard Extension School, Fall 2007

Multiple Regression: Confounding

For example,

not accounting for confounding

adjusting for effect of x2

By definition, a confounding variable is associated with both the dependent variable and the independent variable (predictor of interest - i.e., x1)

Does β1 change in second model? If yes, then evidence that x2 is confounding the association

between x1 and the dependent variable

xy 10ˆˆˆ

22110ˆˆˆˆ xxy

Page 35: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 35

Introduction to Biostatistics, Harvard Extension School, Fall 2007

Multiple Regression: Confounding

• Assume a model of blood pressure, with predictors of alcohol consumption and weight

• Weight and alcohol consumption may be associated

Weight

Alcohol Consumption(confounder for effect of weight on blood pressure)

Blood Pressure

Page 36: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 36

Introduction to Biostatistics, Harvard Extension School, Fall 2007

Multiple Regression:Effect Modification

Interactions (effect modification) may be investigated The effect of one variable depends on the

level of another variable

21322110ˆˆˆˆˆ xxxxy

Page 37: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 37

Introduction to Biostatistics, Harvard Extension School, Fall 2007

Multiple Regression: Effect Modification

Effect of x1 depends on x2:

0 β1 (non-smoker)

1 β1 + β3 (smoker)

• BP example: If x1 = weight and x2 = smoking status, then the effect on your BP of an additional 10 lbs would be different if you were smoker vs. non-smoker

x2

21322110ˆˆˆˆˆ xxxxy

Page 38: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 38

Introduction to Biostatistics, Harvard Extension School, Fall 2007

Smoker:

Non-Smoker:

DIFFERENCE =

*Difference between smokers and non-smokers dependent on x1

Multiple Regression: Effect Modification

110ˆˆˆ xy

13120

132110

)ˆˆ()ˆˆ(

ˆˆˆˆˆ

x

xxy

132ˆˆ x

New slope and

intercept!

Page 39: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 39

Introduction to Biostatistics, Harvard Extension School, Fall 2007

• DIFFERENCE in slope and intercept…

Multiple Regression: Effect Modification

xy 10ˆˆˆ

13120 )ˆˆ()ˆˆ(ˆ xy

x

y

Page 40: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 40

Introduction to Biostatistics, Harvard Extension School, Fall 2007

Multiple Regression:Confounding or Effect Modification?

Confounding without Effect Modification: Overall association of predictor of interest and

dependent variable is not the same as it is after stratifying on third variable (“confounder”)

However, after stratifying, the association is the same within each stratum

Page 41: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 41

Introduction to Biostatistics, Harvard Extension School, Fall 2007

Multiple Regression:Confounding or Effect Modification?

Effect Modification without Confounding: Overall association accurately estimates average effect

of predictor on dependent variable, but after stratifying on third variable, that effect differs across strata

Both: Overall association is not a correct estimate of effect,

and different effects across subgroups of third variable

Page 42: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 42

Introduction to Biostatistics, Harvard Extension School, Fall 2007

Multiple Regression

How to build a multiple regression model:

1. Examine two-way scatter plots of potential predictors against your dependent variable

2. Those that look associated, evaluate in a simple linear regression model (“univariate” analysis)

3. Pick out significant univariate predictors4. Use stepwise model building techniques:

Backwards Forwards Stepwise Best subsets

Page 43: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 43

Introduction to Biostatistics, Harvard Extension School, Fall 2007

Multiple Regression

Like simple linear regression, models require an assessment of model adequacy and goodness of fit Examination of residuals (comparison of

observed vs. predicted values) Coefficient of Determination

Pay attention to adjusted R-squared

Page 44: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 44

Introduction to Biostatistics, Harvard Extension School, Fall 2007

Lead Exposure Example (from Rosner B. Fundamentals of Biostatistics. 5th ed.)

Study of lead exposure on neurological and psychological function in children Compared mean finger-wrist tapping score

(maxfwt), a measure of neurological function, between exposed (≥ 40 mg/100 mL) and control children (< 40 mg/100 mL) Measured in taps per 10 seconds

Already have tools to do this in “naïve” case! 2-sample t-test

Page 45: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 45

Introduction to Biostatistics, Harvard Extension School, Fall 2007

Lead Exposure Example (from Rosner B. Fundamentals of Biostatistics. 5th ed.)

Need a dummy variable for exposure CSCN2 =

With 2-sample T-test, we compared the means of the exposed & controls

1 if child is exposed

0 if child is control

Page 46: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 46

Introduction to Biostatistics, Harvard Extension School, Fall 2007

Lead Exposure Example (from Rosner B. Fundamentals of Biostatistics. 5th ed.)

Now, we can turn this into a simple linear regression model:

MAXFWT = α + βxCSCN2 + e Estimates for each group: Exposed (CSCN2=1):

MAXFWT = α + βx1 = α + β

Controls (CSCN2=0): MAXFWT = α + βx0 = α

β represents difference between groups One unit increase in CSNC2 Testing β = 0 same as testing if mean difference = 0

Page 47: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 47

Introduction to Biostatistics, Harvard Extension School, Fall 2007

Lead Exposure Example Source | SS df MS Number of obs = 95

---------+------------------------------ F( 1, 93) = 9.021

Model | 940.63327 1 940.63327 Prob > F = 0.0034

Residual | 9697.30357 93 104.27208 R-squared = 0.0884

---------+------------------------------ Adj R-squared = 0.0786

Total | 10637.93684 94 1044.90535 Root MSE = 10.221137

------------------------------------------------------------------------------

MAXFWT | Coef. Std. Err. t P>|t| [95% Conf. Interval]

---------+--------------------------------------------------------------------

CSCN2 | -6.657738 2.21666753 -3.003 0.0034 -11.04674 -2.2687363

_cons | 55.095238 1.28651172 42.825 0.0001 52.547945, 57.642531

As just shown, MAXFWT(exposed) = α + β = 55.095 – 6.658 = 48.437& MAXFWT(controls) = α + e = 55.095

Mean Difference = -6.658!

Page 48: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 48

Introduction to Biostatistics, Harvard Extension School, Fall 2007

Lead Exposure Example Source | SS df MS Number of obs = 95

---------+------------------------------ F( 1, 93) = 9.021

Model | 940.63327 1 940.63327 Prob > F = 0.0034

Residual | 9697.30357 93 104.27208 R-squared = 0.0884

---------+------------------------------ Adj R-squared = 0.0786

Total | 10637.93684 94 1044.90535 Root MSE = 10.221137

------------------------------------------------------------------------------

MAXFWT | Coef. Std. Err. t P>|t| [95% Conf. Interval]

---------+--------------------------------------------------------------------

CSCN2 | -6.657738 2.21666753 -3.003 0.0034 -11.04674 -2.2687363

_cons | 55.095238 1.28651172 42.825 0.0001 52.547945, 57.642531

Equivalent to two-sample t-test (w/ equal vars) of H0: μc = μe

t=-3.003 p=0.0034 Slope of -6.658 is equivalent to mean difference between exposed

and controls

Page 49: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 49

Introduction to Biostatistics, Harvard Extension School, Fall 2007

Lead Exposure Example Source | SS df MS Number of obs = 95

---------+------------------------------ F( 1, 93) = 9.021

Model | 940.63327 1 940.63327 Prob > F = 0.0034

Residual | 9697.30357 93 104.27208 R-squared = 0.0884

---------+------------------------------ Adj R-squared = 0.0786

Total | 10637.93684 94 1044.90535 Root MSE = 10.221137

------------------------------------------------------------------------------

MAXFWT | Coef. Std. Err. t P>|t| [95% Conf. Interval]

---------+--------------------------------------------------------------------

CSCN2 | -6.657738 2.21666753 -3.003 0.0034 -11.04674 -2.2687363

_cons | 55.095238 1.28651172 42.825 0.0001 52.547945, 57.642531

R-squared not strong… Model doesn’t predict much of the group differences

Page 50: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 50

Introduction to Biostatistics, Harvard Extension School, Fall 2007

Lead Exposure Example What other variables related to neurological-function?

Often strongly related to age and gender Look at scatterplots of both age and gender vs. MAXFWT, separately. Both show

evidence of association…

Age (years)

MA

XF

WT

Males Females

MA

XF

WT

Page 51: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 51

Introduction to Biostatistics, Harvard Extension School, Fall 2007

Lead Exposure Example

Source | SS df MS Number of obs = 95

---------+------------------------------ F( 2, 92) = 48.109

Model | 5438.1459 2 2719.07226 Prob > F = 0.0001

Residual | 5199.7909 92 56.51947 R-squared = 0.5112

---------+------------------------------ Adj R-squared = 0.5006

Total | 10637.93684 94 2775.59173 Root MSE = 7.51794

------------------------------------------------------------------------------

MAXFWT | Coef. Std. Err. t P>|t| [95% Conf. Interval]

---------+--------------------------------------------------------------------

AGEYR | 2.520683 0.25705630 9.997 0.0001 2.011712 3.029642

SEX | -2.520683 1.58721503 -1.491 0.1395 -5.663369 0.622003

_cons | 31.591389 3.16011063 9.806 0.0001 25.33437 37.848408

Age in years, and sex coded (1=Male, 2=Female) Both appear to be associated with MAXFWT, age is statistically significant

(p=0.0001)

Page 52: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 52

Introduction to Biostatistics, Harvard Extension School, Fall 2007

Lead Exposure Example Our first multiple linear regression model…

Source | SS df MS Number of obs = 95

---------+------------------------------ F( 2, 92) = 48.109

Model | 5438.1459 2 2719.07226 Prob > F = 0.0001

Residual | 5199.7909 92 56.51947 R-squared = 0.5112

---------+------------------------------ Adj R-squared = 0.5006

Total | 10637.93684 94 2775.59173 Root MSE = 7.51794

------------------------------------------------------------------------------

MAXFWT | Coef. Std. Err. t P>|t| [95% Conf. Interval]

---------+--------------------------------------------------------------------

AGEYR | 2.520683 0.25705630 9.997 0.0001 2.011712 3.029642

SEX | -2.365745 1.58721503 -1.491 0.1395 -5.663369 0.622003

_cons | 31.591389 3.16011063 9.806 0.0001 25.33437 37.848408

Numerator DF = k = 2 (sum of squares regression) Denominator DF = n-k-1 = 92 (sum of squares error)

Page 53: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 53

Introduction to Biostatistics, Harvard Extension School, Fall 2007

Lead Exposure Example Adjusted multiple linear regression model…

Coefficients for Age and Sex haven’t changed by much Coefficient for CSNC2 smaller than the crude (naïve) difference

-5.147 from -6.658 taps/10 seconds

Page 54: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 54

Introduction to Biostatistics, Harvard Extension School, Fall 2007

Lead Exposure Example

R-squared up to 0.56 (from 0.09 in simple model) Note: Adjusted R-squared compensates for added complexity in model

Since R-squared will ALWAYS increase as more variables are added, we want to keep things as simple as we can…this takes that into account

Page 55: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 55

Introduction to Biostatistics, Harvard Extension School, Fall 2007

Lead Exposure Example

Interpretation: Holding sex and age constant (i.e., male and 10 years), the estimated mean difference between groups is -5.15 taps/10 seconds, with a 95% CI of (-8.23, -2.06)

Page 56: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 56

Introduction to Biostatistics, Harvard Extension School, Fall 2007

Other Regression Models

Logistic Regression Used when the dependent variable is binary

Very common in public health/medical studies (e.g., disease vs. no disease)

Poisson Regression Used when the dependent variable is a count

(Cox) Proportional Hazards (PH) Regression Used when the dependent variable is a “event time” with

censoring

Page 57: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 57

Introduction to Biostatistics, Harvard Extension School, Fall 2007

Variables of Interest?

One (continuous) Variable

TwoVariables

Both ContinuousMethods from

Before Midterm…

More than TwoVariables

Multiple LinearRegression

Interested in predicting one from another

Interested in presence of association

Simple LinearRegression

Both variables normal

Not normal

PearsonCorrelation

Spearman (Rank) Correlation

One Continuous, one categorical

ANOVA**Note: If categorical variableIs ordinal, Rank correlation Methods are applicable…

4. ANOVA

Page 58: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 58

Introduction to Biostatistics, Harvard Extension School, Fall 2007

Analysis of Variance

Hypothesis Test for difference in means of k groups H0: μ1= μ2= μ3=…=μk

HA: At least one pair not equal Assessing differences in means using VARIANCES

Within-group and between-group variability If no difference in means, then two types of variability

should be equal Assuming within-group variability is constant across

groups Note, if k=2, then same as two-sample t-test

Only need k-1 “dummy” variables

Page 59: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 59

Introduction to Biostatistics, Harvard Extension School, Fall 2007

Analysis of Variance

Parallels methods used for regression when we had one continuous and one categorical variable (with k levels)

In constructing Least-Squares lines, we evaluated how much variability in our response could be explained by our explanatory (predictor) variables vs. left unexplained (residual error)…

Page 60: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 60

Introduction to Biostatistics, Harvard Extension School, Fall 2007

The Total Error (SSY) was split into two portions: Variability explained by the regression (SSR) and Residual variability (SSE)

Analysis of Variance

yvariabilitdunexplaine

1

regression todue

yvariabilit1

2ˆ1

2

n

i iY

iY

n

iY

iY

n

iY

iY

SSESSRSSY

Page 61: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 61

Introduction to Biostatistics, Harvard Extension School, Fall 2007

Similarly, we can think of this as the variability WITHIN and BETWEEN each level of the predictor:

Analysis of Variance

squares of sumgroupwithin

1

squares of sumgroupbetween

1

2ˆ1

2

n

i iY

iY

n

iY

iY

n

iY

iY

WB SSSSSSESSRSSY

Page 62: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 62

Introduction to Biostatistics, Harvard Extension School, Fall 2007

Box plots for five levels of an explanatory variable:

Size of boxes (Q1-Q3) reflect “Within-Group” variability Placement of boxes along y-axis reflect “Between-Group”

variability

Analysis of Variance

Page 63: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 63

Introduction to Biostatistics, Harvard Extension School, Fall 2007

Box plots for five levels of an explanatory variable and total (combined):

Total and y line added – so we can see where groups lie relative to the mean… How much overlap?

TOTAL

Analysis of Variance

A B C D E

y

Page 64: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 64

Introduction to Biostatistics, Harvard Extension School, Fall 2007

Analysis of Variance Table

k = # of groups (levels of categorical variable) Remember, using parallel regression methods, only needed k-1 variables for k

groups – so now, k-1 and n-k degrees of freedom…

*Formerly known as SSR

*Formerly known as SSE

Page 65: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 65

Introduction to Biostatistics, Harvard Extension School, Fall 2007

Analysis of Variance Table

Same test is conducted as we saw with regression, testing the ratio of between-group sum of squares and within-group sum of squares The larger the between is relative to within, the more likely we are to reject the null

hypothesis

*Formerly known as SSR

*Formerly known as SSE

Page 66: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 66

Introduction to Biostatistics, Harvard Extension School, Fall 2007

Analysis of Variance

If we do reject the null hypothesis of all group means being equal (based on the F-test), then we only know that at least one pair differ

Still need to find where those differences lie Post-hoc tests (aka Multiple Comparisons) i.e., Tukey, Bonferroni Perform two-sample tests, while adjusting to

maintain overall α level

Page 67: Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. and Lynne Peeples, M.S.1 Regression Continued… Prediction, Model

© Scott Evans, Ph.D. and Lynne Peeples, M.S. 67

Introduction to Biostatistics, Harvard Extension School, Fall 2007

ReviewVariables

of Interest?

One (continuous) Variable

TwoVariables

Both ContinuousMethods from

Before Midterm…

More than TwoVariables

Multiple LinearRegression

Interested in predicting one from another

Interested in presence of association

Simple LinearRegression

Both variables normal

Not normal

PearsonCorrelation

Spearman (Rank) Correlation

One Continuous, one categorical

ANOVA*