43
Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students er: Lorenzo Marini E, rsity of Padova, Viale dell’Università 16, 35020 Legnaro, Padova. l: [email protected] +39 049 8272807 : lorenzo.marini ://www.biodiversity-lorenzomarini.eu/ Session 4 Lecture: Regression Analysis Practical: multiple regression

Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

  • Upload
    herb

  • View
    26

  • Download
    4

Embed Size (px)

DESCRIPTION

Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students. Session 4 Lecture : Regression Analysis Practical : multiple regression. Lecturer : Lorenzo Marini DAFNAE, University of Padova, Viale dell’Università 16, 35020 Legnaro, Padova. - PowerPoint PPT Presentation

Citation preview

Page 1: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

Introduction to Biostatistical AnalysisUsing R

Statistics course for first-year PhD students

Lecturer: Lorenzo MariniDAFNAE,University of Padova, Viale dell’Università 16, 35020 Legnaro, Padova.E-mail: [email protected].: +39 049 8272807Skype: lorenzo.marini

http://www.biodiversity-lorenzomarini.eu/

Session 4

Lecture: Regression AnalysisPractical: multiple regression

Page 2: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

Statistical modelling: more than one parameter

Nature of the response variable

NORMAL POISSON, BINOMIAL …

GLM

Categorical Continuous Categorical + continuous

General Linear Models

Generalized

Linear Models

ANOVA Regression ANCOVA

Nature of the explanatory variables

Session 3 Session 4

Session 3

Page 3: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

REGRESSION

Simple linear

-One X

-Linear relation

REGRESSION

Multiple linear

-2 or > Xi

- Linear relation

Non-linear

-One X

-Complex relation

Polynomial

-One X but more slopes

- Non-linear relation

Linear methods Non-linear methods

Page 4: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

LINEAR REGRESSION lm()

Regression analysis is a technique used for the modeling and analysis of numerical data consisting of values of a dependent variable (response variable) and of one or more independent continuous variables (explanatory variables)

Assumptions

Independence: The Y-values and the error terms must be independent of

each other.

Linearity between Y and X.

Normality: The populations of Y-values and the error terms are normally

distributed for each level of the predictor variable x

Homogeneity of variance: The populations of Y-values and the error terms have the same variance at each level of the predictor variable x.(don’t test for normality or heteroscedasticity, check the residuals instead!)

Page 5: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

AIMS

1.To describe the linear relationships between Y and Xi (EXPLANATORY APPROACH) and to quantify how much of the total variation in Y can be explained by the linear relationship with Xi.

2. To predict new values of Y from new values of Xi (PREDICTIVE APPROACH)

LINEAR REGRESSION lm()

Yi = α + βxi + εi Y

responseXi

predictors

We estimate one INTERCEPTand one or more SLOPES

Page 6: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

Simple LINEAR regression step by step:

SIMPLE LINEAR REGRESSION:

I step:

-Check linearity [visualization with plot()]

II step:

-Estimate the parameters (one slope and one intercept)

III step:

-Check residuals (check the assumptions looking at the residuals: normality and homogeneity of variance)

Page 7: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

Normality

SIMPLE LINEAR REGRESSION:

Do not test the normality over the whole y

Page 8: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

MODEL

SIMPLE LINEAR REGRESSION

yi = α + βxi

α = ymean- β*xmean

β = Σ [(xi-xmean)(yi-ymean)]

Σ (xi-xmean)2

SLOPE

INTERCEPT

The model gives the fitted values

Residuals= observed yi- fitted yi

Observed value

RESIDUALS

The model does not explained everything

Page 9: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

SIMPLE LINEAR REGRESSION

Least square regression explanation

library(animation)###########################################

##Slope changing# save the animation in HTML pagesani.options(ani.height = 450, ani.width = 600, outdir = getwd(), title = "Demonstration of Least Squares", description = "We want to find an estimate for the slope in 50 candidate slopes, so we just compute the RSS one by one. ")ani.start()par(mar = c(4, 4, 0.5, 0.1), mgp = c(2, 0.5, 0), tcl = -0.3)least.squares()ani.stop()

############################################

# Intercept changing# save the animation in HTML pagesani.options(ani.height = 450, ani.width = 600, outdir = getwd(), title = "Demonstration of Least Squares", description = "We want to find an estimate for the slope in 50 candidate slopes, so we just compute the RSS one by one. ")ani.start()par(mar = c(4, 4, 0.5, 0.1), mgp = c(2, 0.5, 0), tcl = -0.3)least.squares(ani.type = "i")ani.stop()

Page 10: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

SIMPLE LINEAR REGRESSION

Hypothesis testing

Ho: β = 0 (There is no relation between X and Y)

H1: β ≠ 0

We must measure the unreliability associated with each of the

estimated parameters (i.e. we need the standard errors)

SE(β) = [(residual SS/(n-2))/Σ(xi - xmean)]2

t = (β – 0) / SE(β)

Parameter t testing

Parameter t testing (test the single parameter!)

Page 11: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

SIMPLE LINEAR REGRESSION

Measure of goodness-of-fit

Total SS = Σ(yobserved i- ymean)2

Model SS = Σ(yfitted i - ymean)2

Residual SS = Total SS - Model SS

R2 = Model SS /Total SS

Explained variation

It does not provide information

about the significance

If the model is significant (β ≠ 0)How much variation is explained?

Page 12: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

SIMPLE LINEAR REGRESSION: example 1

If the model is significant, then model checking

1. Linearity between X and Y?

ok

No patterns in the residuals vs. predictor plot

Page 13: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

2. Normality of the residualsQ-Q plot + Shapiro-Wilk test on the residuals

> shapiro.test(residuals)

Shapiro-Wilk normality test

data: residuals

W = 0.9669, p-value = 0.2461

ok

ok

SIMPLE LINEAR REGRESSION: example 1

Page 14: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

3. Homoscedasticity

Call:lm(formula = abs(residuals) ~ yfitted)Coefficients: Estimate SE t P(Intercept) 2.17676 2.04315 1.065 0.293yfitted 0.11428 0.07636 1.497 0.142

SIMPLE LINEAR REGRESSION: example 1

ok

Page 15: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

no

NO LINEARITY between X and Y

SIMPLE LINEAR REGRESSION: example 2

no

yes

1. Linearity between X and Y?

Page 16: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

> shapiro.test(residuals)

Shapiro-Wilk normality test

data: residuals

W = 0.8994, p-value = 0.001199 no

no

SIMPLE LINEAR REGRESSION: example 2

2. Normality of the residualsQ-Q plot + Shapiro-Wilk test on the residuals

Page 17: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

SIMPLE LINEAR REGRESSION: example 2

3. Homoscedasticity

NO YES

Page 18: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

Transformation of the data-Box-cox transformation (power transformation of the response)

-Square-root transformation

-Log transformation

-Arcsin transformation

How to deal with non-linearity and non-normality situations?

SIMPLE LINEAR REGRESSION: example 2

Polynomial regression

Regression with multiple terms (linear, quadratic, and cubic)

Y= a + b1X + b2X2 + b3X3 + error X is one variable!!!

Page 19: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

POLYNOMIAL REGRESSION: one X, n parameters

Hierarchy in the testing (always test the highest)!!!!

X + X2 + X3 X + X2 Xn.s.

Stop Stop

P<0.01 P<0.01

Stop

P<0.01

n.s.No relation

NB Do not delete lower terms even if non-significant

n.s.

Page 20: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

MULTIPLE LINEAR REGRESSION: more than one x

Multiple regression

Regression with two or more variables

Y= a + b1X1 + b2X2 +… + biXi

The Multiple Regression Model

There are important issues involved in carrying out a multiple regression:

• which explanatory variables to include (VARIABLE SELECTION);

• NON-LINEARITY in the response to the explanatory variables;

• INTERACTIONS between explanatory variables;

• correlation between explanatory variables (COLLINEARITY);

• RELATIVE IMPORTANCE of variables

Assumptions

Same assumptions as in the simple linear regression!!!

Page 21: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

MULTIPLE LINEAR REGRESSION: more than one x

Multiple regression MODEL

Regression with two or more variables

Y = a+ b1X1+ b2X2+…+ biXi

Each slope (bi) is a partial regression coefficient:

bi are the most important parameters of the multiple regression model.

They measure the expected change in the dependent variable

associated with a one unit change in an independent variable holding

the other independent variables constant. This interpretation of partial

regression coefficients is very important because independent

variables are often correlated with one another.

Page 22: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

MULTIPLE LINEAR REGRESSION: more than one x

Multiple regression MODEL EXPANDED

We can add polynomial terms and interactions

Y= a + linear terms + quadratic & cubic terms+ interactions

QUADRATIC AND CUBIC TERMS account for NON-LINEARITY

INTERACTIONS account for non-independent effects of the factors

Page 23: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

Multiple regression step by step:

MULTIPLE LINEAR REGRESSION:

I step:

-Check collinearity (visualization with pairs() and correlation)

-Check linearity

II step:

-Variable selection and model building (different procedures to select the significant variables)

III step:

-Check residuals (check the assumptions looking at the residuals: normality and homogeneity of variance)

Page 24: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

Let’s begin with an example from air pollution studies. How is ozone

concentration related to wind speed, air temperature and the intensity

of solar radiation?

MULTIPLE LINEAR REGRESSION: I STEP

I STEP:

-Check collinearity

-Check linearity

Page 25: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

Model simplification

1. Remove non-significant interaction terms.

2. Remove non-significant quadratic or other non-linear terms.

3. Remove non-significant explanatory variables.

4. Amalgamate explanatory variables that have similar parameter values.

Start with a complex model with interactions and quadraticand cubic terms

Minimum Adequate Model

How to carry out a model simplification in multiple regression

MULTIPLE LINEAR REGRESSION: II STEP

II STEP: MODEL BUILDING

Page 26: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

Start with the most complicate model (it is one approach)

model1<lm( ozone ~ temp*wind*rad+I(rad2)+I(temp2+I(wind2))

Estimate

Std.Error t Pr(>t)

(Intercept) 5.7E+02 2.1E+02 2.74 0.01 **

temp -1.1E+01 4.3E+00 -2.50 0.01 *

wind -3.2E+01 1.2E+01 -2.76 0.01 **

rad -3.1E-01 5.6E-01 -0.56 0.58

I(rad^2) -3.6E-04 2.6E-04 -1.41 0.16

I(temp^2) 5.8E-02 2.4E-02 2.44 0.02 *

I(wind^2) 6.1E-01 1.5E-01 4.16 0.00 ***

temp:wind 2.4E-01 1.4E-01 1.74 0.09

temp:rad 8.4E-03 7.5E-03 1.12 0.27

wind:rad 2.1E-02 4.9E-02 0.42 0.68

temp:wind:rad -4.3E-04 6.6E-04 -0.66 0.51Delete only the highest interaction temp:wind:rad

!!!!!!We cannot delete these terms!!!!!!!

MULTIPLE LINEAR REGRESSION: II STEP

Page 27: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

At each deletion test:Is the fit of a

simpler model worse?

Manual model simplification(It is one of the many philosophies)Deletion the non-significant terms one by one:

Hierarchy in the deletion:1. Highest interactions2. Cubic terms3. Quadratic terms4. Linear terms

If you have quadratic and cubic terms significant you cannotdelete the linear or the quadratic term even if they are not significant

If you have an interaction significant you cannotdelete the linear terms even if they are not significant

COMPLEX

SIMPLE

Deletion

MULTIPLE LINEAR REGRESSION: II STEP

IMPORTANT!!!

Page 28: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

III STEP: we must check the assumptions

We can transform the data (e.g. Log-transformation of y)

model<lm( log(ozone) ~ temp + wind + rad + I(wind2))

MULTIPLE LINEAR REGRESSION: III STEP

NONO

Variance tends to increase with y Non-normal errors

Page 29: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

The log-transformation has improved our model but maybe there is an outlier

MULTIPLE LINEAR REGRESSION: more than one x

Page 30: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

PARTIAL REGRESSION:

With partial regression we can remove the effect of one or

more variables (covariates) and test a further factor which

becomes independent from the covariates

WHEN?• Would like to hold third variables constant, but cannot

manipulate.• Can use statistical control.

HOW?• Statistical control is based on residuals. If we regress Y

on X1 and take residuals of Y, this part of Y will be uncorrelated with X1, so anything Y residuals correlate with will not be explained by X1.

Page 31: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

PARTIAL REGRESSION: VARIATION PARTITIONING

Relative importance of groups of explanatory variables

Longitude (km)

EnvironmentSpace

Latitu

de (km

)

SiteFull.model<lm(species ~ environment i + space i)

R2= 76% (TOTAL EXPLAINED VARIATION)

What is space and what is environment?

Unexpl.

Total variation

Explained variation

Space∩

Envir.

Response variable: orthopteran species richness

Explanatory variable: SPACE (latitude + longitude) +

ENVIRONMENT (temperature + land-cover heterogeneity)

Page 32: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

VARIATION PARTITIONING: varpart(vegan)

Env.model<lm(SPECIES ~ temp + het)

Pure.Space.model<lm(ENV.RESIDUALS ~ lat + long)

env.residuals

Full.model<lm(SPECIES ~ temp + het + lat + long)

TVE=76%

VE=15%

Space.model<lm(SPECIES ~ lat + long)

Pure.env.model<lm(SPACE.RESIDUALS ~ tem + het)

space.residuals

VE=40%

EnvironmentUnexpl. Space

EnvironmentUnexpl. Space

EnvironmentUnexpl. Space

EnvironmentUnexpl. Space

EnvironmentUnexpl. Space

Page 33: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

NON-LINEAR REGRESSION: nls()

Sometimes we have a mechanistic model for the relationship between y and x, and we want to estimate the parameters and standard errors of the parameters of a specific non-linear equation from data.

We must SPECIFY the exact nature of the function as part of the model formula when we use non-linear modelling

In place of lm() we write nls() (this stands for ‘non-linear least squares’). Then, instead of y~x+I(x2)+I(x3) (polynomial), we write the y~function to spell out the precise nonlinear model we want R to fit to the data.

Page 34: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

NON-LINEAR REGRESSION: step by step

3. Start fitting the different models

1. Plot y against x

2. Get an idea of the family of functions that you can fit

7. Compare PAIRS of models and choose the best

5. [Get the MAM for each by model simplification]

6. Check the residuals

Multimodel inference

(minimum deviance +

minimum number of parameters)

Compare GROUPS of model at a time

Alternative approach

AIC = scaled deviance +2k

Model weights and model average

[see Burnham & Anderson, 2002]

4. Specify initial guesses for the values of the parameters

k= parameter number + 1

Page 35: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

nls(): examples of function families

Asymptotic functions S-shaped functions

Humped functions Exponential functions

Page 36: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

nls(): Look at the data

Using the data plot work out sensible starting values. It always helps in cases like this to work out the equation’s at the limits – i.e. find the values of y when x=0 and when x=

0 10 20 30 40 50

04

08

01

20

age

bo

ne

Asymptotic functions

S-shaped functions

Humped functions

Exponential functions

?

Asymptotic exponential

Understand the role of the parameters a, b, and c

cxebay *~

Page 37: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

nls(): Look at the data

Can we try another function from the same family?

Fit the model cxebay *~

Model choice is always an important issue in curve fitting

(particularly for prediction)

0 10 20 30 40 50

02

04

06

08

01

20

age

bo

ne

2. Extract the fitted values (yi)

3. Check graphically the curve fitting

Different behavior at the limits!

Think about your biological system not just residual deviance!

1. Estimate of a, b, and c (iterative)

Page 38: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

nls(): Look at the data

Fit a second model

1. Extract the fitted values (yi)

2. Check graphically the curve fitting

bx

axy

1~

0 10 20 30 40 50

02

04

06

08

01

20

age

bo

ne

You can see that the asymptotic exponential (solid line) tends to get to its asymptote first, and that the Michaelis–Menten (dotted line) continues to increase. Model choice, therefore would be enormously important if you intended to use the model for prediction to ages much greater than 50 months.

Michaelis–Menten

Page 39: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

Application of regression: prediction

Regression models for prediction

Spatial extent + data range

A model can be used to predict values of y in space or in time

knowing new xi values

0 20 40 60 80

02

04

06

08

01

20

age

bo

ne

NOYES

Before using a model for prediction it has to be VALIDATED!!!

2 APPROACHES

Page 40: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

VALIDATION

1. In data-rich situation, set aside validation (use one part of data set to fit model, second part for assessing prediction error of final selected model).

2. If data scarce, must resort to “artificially produced” validation sets

Model fit PredictedR

eal y

Cross-validation

Bootstrap

Residual=Prediction error

Page 41: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

K-FOLD CROSS-VALIDATION

Split randomly the data in K groups with roughly the same size

Take turns using one group as test set and the other k-I as training set for fitting the model

1 2 53 4

Train TrainTrainTrain Test

Train TrainTrain Train Test

Train TrainTrainTrainTest

Train Train TrainTrain Test

Cross-validation estimate of prediction error is average of these

Train TrainTrainTrainTest 1. Prediction error1

2. Prediction error2

3. Prediction error3

4. Prediction error4

5. Prediction error5

Page 42: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

BOOTSTRAP

1. Generate a large number (n= 10 000) of bootstrap samples

3. The mean of these estimates is the bootstrap estimate of prediction error

n=10000n=10000

2. For each bootstrap sample, compute the prediction error

Error1 Error2Error3 Errorn…

……

Page 43: Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students

If you can, use an independent data set for validating the model

1. Do not use your model for prediction without carrying out a validation

Application of regression: prediction

If you cannot, use at least bootstrap or cross-validation

2. Never extrapolate