21
university of copenhagen department of biostatistics Dag 4: Linear regression / ANOVA Susanne Rosthøj Section of Biostatistics Department of Public Health University of Copenhagen [email protected]

Linear regression / ANOVAstaff.pubhealth.ku.dk/~sr/MPH/slides/dag4.pdf ·  · 2017-05-05Linear regression / ANOVA Susanne Rosthøj ... One-way ANOVA In one-way ANOVA we ... therefore

  • Upload
    vudang

  • View
    226

  • Download
    0

Embed Size (px)

Citation preview

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

Dag 4: Linear regression / ANOVA

Susanne Rosthøj

Section of BiostatisticsDepartment of Public HealthUniversity of Copenhagen

[email protected]

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

Regression models

The purpose with a regression analysis is to relate one outcomevariable to one or more explanatory variables.

The type of the outcome variable determines which kind of regressionmodel is relevant:

Outcome Model Effect0/1 (binary) logistic regression odds ratioQuantitative linear regression difference in meansSurvival time Cox (Poisson) regression hazard (rate) ratio

2 / 1

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

Linear regression model

The outcome Y has a normal distribution with

EY = a + b · XVarY = σ2

where X is a quanititative explanatory variable.

n independent observations Y1,Y2, . . . ,Yn.

We write

Yi = a + b · Xi + εi, with εi ∼ N (0, σ2) independent

ε is the residual error.

3 / 1

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

Example: vitamin D as a function of BMI

Yi = a + b · BMIi + εi,

4 / 1

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

Estimation

Estimated regression line:

vitaminD = 111.05− 2.39 · BMI

SE of the effect of BMI: 0.69. 95%CI (-3.78;-1.00).

Test of the effect of BMI by a t-test

t = −2.390.69

= −3.47, p = 0.001.

Interpretation :For a 1 unit increase in BMI, vitamin D is decreased by 2.39 nmol/L(95% CI 1.00-3.78), .

5 / 1

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

Analysis in SAS

We use proc glm (General Linear Model):proc glm data=vitamin plots=DiagnosticsPanel;

model vitd = bmi / solution clparm;where country=4; * Irland;

run;

Discuss the output in the handout and find the numbers on theprevious slide.

6 / 1

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

Does the model fit to the data?

Our conclusions based on the model are valid only if the model is valid.

Assumptions :1) Independence between observations2) Linearity3) Normality (of residual errors ε)4) Homogeneity of variance (of residual errors ε)

Normality and homogeneity assessed through the residuals.

7 / 1

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

Assessment of 2) Linearity

Extend the model :

Yi = a + b · BMIi + c · BMI2i + εi

A test of linearity : H0 : c = 0.

proc glm data=vitamin plots=DiagnosticsPanel;model vitd = bmi bmi*bmi / solution clparm;where country=4; * Irland;

run;

Conclude from handout: Is the linear model plausible?

8 / 1

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

Predicted values and residualsThe predicted or fitted values :

Yi = 111.05− 2.39 · BMIi

Expected vitamin D level for a woman with BMI=21.8

111.05− 2.39× 21.8 = 58.9

For each woman we determine the residual

ri = Yi − (111.05− 2.39 · BMIi)

as the difference between observed and predicted value.

Residual for the woman with BMI=21.8 and Y=vitamin D=89.1

89.1− (111.05− 2.39 · 21.8) = 89.1− 58.9 = 30.29 / 1

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

Residuals

●●

●●

20 25 30 35

2040

6080

100

BMI

Vita

min

D●

10 / 1

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

Assessment of 3) NormalityQQ-plot of residuals (plot 4 i output) (evt histogram (plot 7)):

11 / 1

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

Assessment of 4) Homogeneity

Plot residuals as a function of predicted values (plot 1):

Constant variance? Trumpet-shape?12 / 1

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

One-way ANOVAIn one-way ANOVA we compare means for several groups.

Four countries 1=DK, 2=FI, 4=IRL, 6=PL.

Country N Mean SDDK 53 47.17 22.78FI 54 47.99 18.72IRL 41 48.01 20.22PL 65 32.56 12.46

A model

Yi = µi + εi, εi ∼ N (0, σ2)

µi =

µDK if i is from DKµFI if i is from FIµIRL if i is from IRLµPL if i is from PL

13 / 1

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

An illustration of the model

Sampling distribution:

Mean

µ1 µ2 µ3 µ4

14 / 1

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

Illustrating the dataAlways plot data before starting analysis :

15 / 1

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

Useful formulation of mean structure

We are interested in quantifying the differences between the groups

µi =

µDK DKµFI FIµIRL IRLµPL PL

=

47.17 DK47.99 FI48.01 IRL32.56 PL

=

a DKa + b1 FIa + b2 IRLa + b3 IRL

=

47.17 DK47.17 + 0.82 FI47.17 + 0.84 IRL47.17− 14.60 PL

16 / 1

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

The overall test for associationHypothesis

H0 : µDK = µFI = µIRL = µPL

against HA : at least two means are different

Equivalently the hypothesis is

H0 : b1 = b2 = b3 = 0

against HA : either b1 6= 0, b2 6= 0 or b3 6= 0.

The test statistics follows an F-distribution with df=(k − 1, n− k) wheren = 213 is the total number of patients, k = 4 is the number of groups.

The test statistic compare the variance between the groups to thevariance within the groups, therefore an ANalysis Of VAriance(even though we are actually comparing means...).

17 / 1

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

One-way ANOVA in SAS

We again use proc glm:

proc glm data=vitamin;class country (ref=’1’);model vitd = country / solution clparm;

run;

18 / 1

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

One-way ANOVA in SAS

Overall test of association (type III test):Source DF Type III SS Mean Square F Value Pr > F

country 3 10373.99129 457.99710 10.06 <.0001

Parameter estimates:

StandardParameter Estimate Error t Value Pr > |t| 95% Confidence Limits

Intercept 47.16603774 B 2.54727631 18.52 <.0001 42.14438954 52.18768593country 2 0.82470300 B 3.58567617 0.23 0.8183 -6.24402535 7.89343136country 4 0.84127934 B 3.85698593 0.22 0.8275 -6.76230351 8.44486218country 6 -14.60449927 B 3.43210354 -4.26 <.0001 -21.37047771 -7.83852084country 1 0.00000000 B . . . . .

19 / 1

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

Assumptions

Assumptions need to be checked:• the observations are independent (i.e. no siblings)• in each group, the distribution of the observations have to be

normal• the variances are the same in all groups.

20 / 1

u n i v e r s i t y o f c o p e n h a g e n d e p a r t m e n t o f b i o s t a t i s t i c s

Dealing with deviations from the model

Linearity :• Add more explanatory variables• Transform covariates (logarithms, square root, . . .).• Accept the non-linearity =⇒ non-linear regression.

Inhomogeneity :• Transform the outcome Y (often log).

Normality :• Transform the outcome Y (often log).

21 / 1