Upload
herb
View
26
Download
4
Embed Size (px)
DESCRIPTION
Introduction to Biostatistical Analysis Using R Statistics course for first-year PhD students. Session 4 Lecture : Regression Analysis Practical : multiple regression. Lecturer : Lorenzo Marini DAFNAE, University of Padova, Viale dell’Università 16, 35020 Legnaro, Padova. - PowerPoint PPT Presentation
Citation preview
Introduction to Biostatistical AnalysisUsing R
Statistics course for first-year PhD students
Lecturer: Lorenzo MariniDAFNAE,University of Padova, Viale dell’Università 16, 35020 Legnaro, Padova.E-mail: [email protected].: +39 049 8272807Skype: lorenzo.marini
http://www.biodiversity-lorenzomarini.eu/
Session 4
Lecture: Regression AnalysisPractical: multiple regression
Statistical modelling: more than one parameter
Nature of the response variable
NORMAL POISSON, BINOMIAL …
GLM
Categorical Continuous Categorical + continuous
General Linear Models
Generalized
Linear Models
ANOVA Regression ANCOVA
Nature of the explanatory variables
Session 3 Session 4
Session 3
REGRESSION
Simple linear
-One X
-Linear relation
REGRESSION
Multiple linear
-2 or > Xi
- Linear relation
Non-linear
-One X
-Complex relation
Polynomial
-One X but more slopes
- Non-linear relation
Linear methods Non-linear methods
LINEAR REGRESSION lm()
Regression analysis is a technique used for the modeling and analysis of numerical data consisting of values of a dependent variable (response variable) and of one or more independent continuous variables (explanatory variables)
Assumptions
Independence: The Y-values and the error terms must be independent of
each other.
Linearity between Y and X.
Normality: The populations of Y-values and the error terms are normally
distributed for each level of the predictor variable x
Homogeneity of variance: The populations of Y-values and the error terms have the same variance at each level of the predictor variable x.(don’t test for normality or heteroscedasticity, check the residuals instead!)
AIMS
1.To describe the linear relationships between Y and Xi (EXPLANATORY APPROACH) and to quantify how much of the total variation in Y can be explained by the linear relationship with Xi.
2. To predict new values of Y from new values of Xi (PREDICTIVE APPROACH)
LINEAR REGRESSION lm()
Yi = α + βxi + εi Y
responseXi
predictors
We estimate one INTERCEPTand one or more SLOPES
Simple LINEAR regression step by step:
SIMPLE LINEAR REGRESSION:
I step:
-Check linearity [visualization with plot()]
II step:
-Estimate the parameters (one slope and one intercept)
III step:
-Check residuals (check the assumptions looking at the residuals: normality and homogeneity of variance)
Normality
SIMPLE LINEAR REGRESSION:
Do not test the normality over the whole y
MODEL
SIMPLE LINEAR REGRESSION
yi = α + βxi
α = ymean- β*xmean
β = Σ [(xi-xmean)(yi-ymean)]
Σ (xi-xmean)2
SLOPE
INTERCEPT
The model gives the fitted values
Residuals= observed yi- fitted yi
Observed value
RESIDUALS
The model does not explained everything
SIMPLE LINEAR REGRESSION
Least square regression explanation
library(animation)###########################################
##Slope changing# save the animation in HTML pagesani.options(ani.height = 450, ani.width = 600, outdir = getwd(), title = "Demonstration of Least Squares", description = "We want to find an estimate for the slope in 50 candidate slopes, so we just compute the RSS one by one. ")ani.start()par(mar = c(4, 4, 0.5, 0.1), mgp = c(2, 0.5, 0), tcl = -0.3)least.squares()ani.stop()
############################################
# Intercept changing# save the animation in HTML pagesani.options(ani.height = 450, ani.width = 600, outdir = getwd(), title = "Demonstration of Least Squares", description = "We want to find an estimate for the slope in 50 candidate slopes, so we just compute the RSS one by one. ")ani.start()par(mar = c(4, 4, 0.5, 0.1), mgp = c(2, 0.5, 0), tcl = -0.3)least.squares(ani.type = "i")ani.stop()
SIMPLE LINEAR REGRESSION
Hypothesis testing
Ho: β = 0 (There is no relation between X and Y)
H1: β ≠ 0
We must measure the unreliability associated with each of the
estimated parameters (i.e. we need the standard errors)
SE(β) = [(residual SS/(n-2))/Σ(xi - xmean)]2
t = (β – 0) / SE(β)
Parameter t testing
Parameter t testing (test the single parameter!)
SIMPLE LINEAR REGRESSION
Measure of goodness-of-fit
Total SS = Σ(yobserved i- ymean)2
Model SS = Σ(yfitted i - ymean)2
Residual SS = Total SS - Model SS
R2 = Model SS /Total SS
Explained variation
It does not provide information
about the significance
If the model is significant (β ≠ 0)How much variation is explained?
SIMPLE LINEAR REGRESSION: example 1
If the model is significant, then model checking
1. Linearity between X and Y?
ok
No patterns in the residuals vs. predictor plot
2. Normality of the residualsQ-Q plot + Shapiro-Wilk test on the residuals
> shapiro.test(residuals)
Shapiro-Wilk normality test
data: residuals
W = 0.9669, p-value = 0.2461
ok
ok
SIMPLE LINEAR REGRESSION: example 1
3. Homoscedasticity
Call:lm(formula = abs(residuals) ~ yfitted)Coefficients: Estimate SE t P(Intercept) 2.17676 2.04315 1.065 0.293yfitted 0.11428 0.07636 1.497 0.142
SIMPLE LINEAR REGRESSION: example 1
ok
no
NO LINEARITY between X and Y
SIMPLE LINEAR REGRESSION: example 2
no
yes
1. Linearity between X and Y?
> shapiro.test(residuals)
Shapiro-Wilk normality test
data: residuals
W = 0.8994, p-value = 0.001199 no
no
SIMPLE LINEAR REGRESSION: example 2
2. Normality of the residualsQ-Q plot + Shapiro-Wilk test on the residuals
SIMPLE LINEAR REGRESSION: example 2
3. Homoscedasticity
NO YES
Transformation of the data-Box-cox transformation (power transformation of the response)
-Square-root transformation
-Log transformation
-Arcsin transformation
How to deal with non-linearity and non-normality situations?
SIMPLE LINEAR REGRESSION: example 2
Polynomial regression
Regression with multiple terms (linear, quadratic, and cubic)
Y= a + b1X + b2X2 + b3X3 + error X is one variable!!!
POLYNOMIAL REGRESSION: one X, n parameters
Hierarchy in the testing (always test the highest)!!!!
X + X2 + X3 X + X2 Xn.s.
Stop Stop
P<0.01 P<0.01
Stop
P<0.01
n.s.No relation
NB Do not delete lower terms even if non-significant
n.s.
MULTIPLE LINEAR REGRESSION: more than one x
Multiple regression
Regression with two or more variables
Y= a + b1X1 + b2X2 +… + biXi
The Multiple Regression Model
There are important issues involved in carrying out a multiple regression:
• which explanatory variables to include (VARIABLE SELECTION);
• NON-LINEARITY in the response to the explanatory variables;
• INTERACTIONS between explanatory variables;
• correlation between explanatory variables (COLLINEARITY);
• RELATIVE IMPORTANCE of variables
Assumptions
Same assumptions as in the simple linear regression!!!
MULTIPLE LINEAR REGRESSION: more than one x
Multiple regression MODEL
Regression with two or more variables
Y = a+ b1X1+ b2X2+…+ biXi
Each slope (bi) is a partial regression coefficient:
bi are the most important parameters of the multiple regression model.
They measure the expected change in the dependent variable
associated with a one unit change in an independent variable holding
the other independent variables constant. This interpretation of partial
regression coefficients is very important because independent
variables are often correlated with one another.
MULTIPLE LINEAR REGRESSION: more than one x
Multiple regression MODEL EXPANDED
We can add polynomial terms and interactions
Y= a + linear terms + quadratic & cubic terms+ interactions
QUADRATIC AND CUBIC TERMS account for NON-LINEARITY
INTERACTIONS account for non-independent effects of the factors
Multiple regression step by step:
MULTIPLE LINEAR REGRESSION:
I step:
-Check collinearity (visualization with pairs() and correlation)
-Check linearity
II step:
-Variable selection and model building (different procedures to select the significant variables)
III step:
-Check residuals (check the assumptions looking at the residuals: normality and homogeneity of variance)
Let’s begin with an example from air pollution studies. How is ozone
concentration related to wind speed, air temperature and the intensity
of solar radiation?
MULTIPLE LINEAR REGRESSION: I STEP
I STEP:
-Check collinearity
-Check linearity
Model simplification
1. Remove non-significant interaction terms.
2. Remove non-significant quadratic or other non-linear terms.
3. Remove non-significant explanatory variables.
4. Amalgamate explanatory variables that have similar parameter values.
Start with a complex model with interactions and quadraticand cubic terms
Minimum Adequate Model
How to carry out a model simplification in multiple regression
MULTIPLE LINEAR REGRESSION: II STEP
II STEP: MODEL BUILDING
Start with the most complicate model (it is one approach)
model1<lm( ozone ~ temp*wind*rad+I(rad2)+I(temp2+I(wind2))
Estimate
Std.Error t Pr(>t)
(Intercept) 5.7E+02 2.1E+02 2.74 0.01 **
temp -1.1E+01 4.3E+00 -2.50 0.01 *
wind -3.2E+01 1.2E+01 -2.76 0.01 **
rad -3.1E-01 5.6E-01 -0.56 0.58
I(rad^2) -3.6E-04 2.6E-04 -1.41 0.16
I(temp^2) 5.8E-02 2.4E-02 2.44 0.02 *
I(wind^2) 6.1E-01 1.5E-01 4.16 0.00 ***
temp:wind 2.4E-01 1.4E-01 1.74 0.09
temp:rad 8.4E-03 7.5E-03 1.12 0.27
wind:rad 2.1E-02 4.9E-02 0.42 0.68
temp:wind:rad -4.3E-04 6.6E-04 -0.66 0.51Delete only the highest interaction temp:wind:rad
!!!!!!We cannot delete these terms!!!!!!!
MULTIPLE LINEAR REGRESSION: II STEP
At each deletion test:Is the fit of a
simpler model worse?
Manual model simplification(It is one of the many philosophies)Deletion the non-significant terms one by one:
Hierarchy in the deletion:1. Highest interactions2. Cubic terms3. Quadratic terms4. Linear terms
If you have quadratic and cubic terms significant you cannotdelete the linear or the quadratic term even if they are not significant
If you have an interaction significant you cannotdelete the linear terms even if they are not significant
COMPLEX
SIMPLE
Deletion
MULTIPLE LINEAR REGRESSION: II STEP
IMPORTANT!!!
III STEP: we must check the assumptions
We can transform the data (e.g. Log-transformation of y)
model<lm( log(ozone) ~ temp + wind + rad + I(wind2))
MULTIPLE LINEAR REGRESSION: III STEP
NONO
Variance tends to increase with y Non-normal errors
The log-transformation has improved our model but maybe there is an outlier
MULTIPLE LINEAR REGRESSION: more than one x
PARTIAL REGRESSION:
With partial regression we can remove the effect of one or
more variables (covariates) and test a further factor which
becomes independent from the covariates
WHEN?• Would like to hold third variables constant, but cannot
manipulate.• Can use statistical control.
HOW?• Statistical control is based on residuals. If we regress Y
on X1 and take residuals of Y, this part of Y will be uncorrelated with X1, so anything Y residuals correlate with will not be explained by X1.
PARTIAL REGRESSION: VARIATION PARTITIONING
Relative importance of groups of explanatory variables
Longitude (km)
EnvironmentSpace
Latitu
de (km
)
SiteFull.model<lm(species ~ environment i + space i)
R2= 76% (TOTAL EXPLAINED VARIATION)
What is space and what is environment?
Unexpl.
Total variation
Explained variation
Space∩
Envir.
Response variable: orthopteran species richness
Explanatory variable: SPACE (latitude + longitude) +
ENVIRONMENT (temperature + land-cover heterogeneity)
VARIATION PARTITIONING: varpart(vegan)
Env.model<lm(SPECIES ~ temp + het)
Pure.Space.model<lm(ENV.RESIDUALS ~ lat + long)
env.residuals
Full.model<lm(SPECIES ~ temp + het + lat + long)
TVE=76%
VE=15%
Space.model<lm(SPECIES ~ lat + long)
Pure.env.model<lm(SPACE.RESIDUALS ~ tem + het)
space.residuals
VE=40%
EnvironmentUnexpl. Space
EnvironmentUnexpl. Space
EnvironmentUnexpl. Space
EnvironmentUnexpl. Space
EnvironmentUnexpl. Space
NON-LINEAR REGRESSION: nls()
Sometimes we have a mechanistic model for the relationship between y and x, and we want to estimate the parameters and standard errors of the parameters of a specific non-linear equation from data.
We must SPECIFY the exact nature of the function as part of the model formula when we use non-linear modelling
In place of lm() we write nls() (this stands for ‘non-linear least squares’). Then, instead of y~x+I(x2)+I(x3) (polynomial), we write the y~function to spell out the precise nonlinear model we want R to fit to the data.
NON-LINEAR REGRESSION: step by step
3. Start fitting the different models
1. Plot y against x
2. Get an idea of the family of functions that you can fit
7. Compare PAIRS of models and choose the best
5. [Get the MAM for each by model simplification]
6. Check the residuals
Multimodel inference
(minimum deviance +
minimum number of parameters)
Compare GROUPS of model at a time
Alternative approach
AIC = scaled deviance +2k
Model weights and model average
[see Burnham & Anderson, 2002]
4. Specify initial guesses for the values of the parameters
k= parameter number + 1
nls(): examples of function families
Asymptotic functions S-shaped functions
Humped functions Exponential functions
nls(): Look at the data
Using the data plot work out sensible starting values. It always helps in cases like this to work out the equation’s at the limits – i.e. find the values of y when x=0 and when x=
0 10 20 30 40 50
04
08
01
20
age
bo
ne
Asymptotic functions
S-shaped functions
Humped functions
Exponential functions
?
Asymptotic exponential
Understand the role of the parameters a, b, and c
cxebay *~
nls(): Look at the data
Can we try another function from the same family?
Fit the model cxebay *~
Model choice is always an important issue in curve fitting
(particularly for prediction)
0 10 20 30 40 50
02
04
06
08
01
20
age
bo
ne
2. Extract the fitted values (yi)
3. Check graphically the curve fitting
Different behavior at the limits!
Think about your biological system not just residual deviance!
1. Estimate of a, b, and c (iterative)
nls(): Look at the data
Fit a second model
1. Extract the fitted values (yi)
2. Check graphically the curve fitting
bx
axy
1~
0 10 20 30 40 50
02
04
06
08
01
20
age
bo
ne
You can see that the asymptotic exponential (solid line) tends to get to its asymptote first, and that the Michaelis–Menten (dotted line) continues to increase. Model choice, therefore would be enormously important if you intended to use the model for prediction to ages much greater than 50 months.
Michaelis–Menten
Application of regression: prediction
Regression models for prediction
Spatial extent + data range
A model can be used to predict values of y in space or in time
knowing new xi values
0 20 40 60 80
02
04
06
08
01
20
age
bo
ne
NOYES
Before using a model for prediction it has to be VALIDATED!!!
2 APPROACHES
VALIDATION
1. In data-rich situation, set aside validation (use one part of data set to fit model, second part for assessing prediction error of final selected model).
2. If data scarce, must resort to “artificially produced” validation sets
Model fit PredictedR
eal y
Cross-validation
Bootstrap
Residual=Prediction error
K-FOLD CROSS-VALIDATION
Split randomly the data in K groups with roughly the same size
Take turns using one group as test set and the other k-I as training set for fitting the model
1 2 53 4
Train TrainTrainTrain Test
Train TrainTrain Train Test
Train TrainTrainTrainTest
Train Train TrainTrain Test
Cross-validation estimate of prediction error is average of these
Train TrainTrainTrainTest 1. Prediction error1
2. Prediction error2
3. Prediction error3
4. Prediction error4
5. Prediction error5
BOOTSTRAP
1. Generate a large number (n= 10 000) of bootstrap samples
3. The mean of these estimates is the bootstrap estimate of prediction error
n=10000n=10000
2. For each bootstrap sample, compute the prediction error
Error1 Error2Error3 Errorn…
……
If you can, use an independent data set for validating the model
1. Do not use your model for prediction without carrying out a validation
Application of regression: prediction
If you cannot, use at least bootstrap or cross-validation
2. Never extrapolate