21

Part II { Oneway Anova, Simple Linear Regression and ...aix1.uottawa.ca/~glamothe/PsyStatClub/RworkshopIII.pdf · Part II { Oneway Anova, Simple Linear Regression and ANCOAV with

  • Upload
    dohanh

  • View
    219

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Part II { Oneway Anova, Simple Linear Regression and ...aix1.uottawa.ca/~glamothe/PsyStatClub/RworkshopIII.pdf · Part II { Oneway Anova, Simple Linear Regression and ANCOAV with

Part II { Oneway Anova, Simple Linear Regression and

ANCOVA with R

Gilles Lamothe

February 21, 2017

Contents

1 Anova with one factor 2

1.1 The data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 A visual description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 The aov function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Pairwise comparisons with Tukey's method . . . . . . . . . . . . . . . . . . . . . . . 4

1.5 Comparison to a control { Dunnett's test . . . . . . . . . . . . . . . . . . . . . . . . 5

1.6 Recode a vector with car . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.7 Test for trend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.8 Fitting a polynomial response . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.9 Optimizing a function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Simple Linear Regression 9

2.1 Test for lack-of-�t . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 Applying our new skills 11

4 General Linear Model { ANCOVA 12

4.1 Regression by groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4.2 A general linear model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.3 Test for Lack-of-�t . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.4 Comparing the �tness status groups . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.5 Comparing the e�ect of age according to �tness status . . . . . . . . . . . . . . . . . 19

5 Applying our new skills 21

1

Page 2: Part II { Oneway Anova, Simple Linear Regression and ...aix1.uottawa.ca/~glamothe/PsyStatClub/RworkshopIII.pdf · Part II { Oneway Anova, Simple Linear Regression and ANCOAV with

Intro to R Workshop - Psychology Statistics Club 2

Intro to R Workshop - Psychology Statistics Club

1 Anova with one factor

1.1 The data

Consider the data in the �le Fish.txt. (Source: Design and Analysis of Experiements,by Dean and Voss).

Description:

� The response variable in this experiment is hemoglobin in the blood of browntrout (in grams per 100 ml).

� The trout were randomly separated in four trough.

� The food given to the trough contained 0, 5, 10 and 15 grams of sulfamerazineper 100 lbs of �sh (coded 1, 2, 3, 4).

� The measures were achieved on 10 randomly chosen �sh from each trough after35 days.

We import the data and display the names of the columns.

> fish<-read.table("Fish.txt",header=TRUE,sep="\t")> names(fish)[1] "hemoglobin" "group"> sapply(fish,is.factor)hemoglobin group

FALSE FALSE

We will start with an ANOVA. The explanatory variable must be a factor (i.e. acategorical variable). It identi�ed the groups.

> fish$group<-factor(fish$group)> levels(fish$group)[1] "1" "2" "3" "4"

1.2 A visual description

We can use comparative boxplots to describe the amount of hemoglobin in the bloodaccording the amount of sulfamerazine in the food.

with(fish,boxplot(hemoglobin~group,ylab="Hemoglobin",xlab="Group"))

Page 3: Part II { Oneway Anova, Simple Linear Regression and ...aix1.uottawa.ca/~glamothe/PsyStatClub/RworkshopIII.pdf · Part II { Oneway Anova, Simple Linear Regression and ANCOAV with

Intro to R Workshop - Psychology Statistics Club 3

Does the amount of sulfamerazine in the food appear to have e�ects on the amountof hemoglobin in the blood?

1.3 The aov function

We will use the aov function to �t an ANOVA model with one factor. Then, withthe summary function, we obtain a summary of the �t.

> mod<-aov(hemoglobin~group,fish)> summary(mod)

Df Sum Sq Mean Sq F value Pr(>F)group 3 26.50 8.833 5.732 0.00259 **Residuals 36 55.48 1.541---Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Remarks:

� The group means are signi�cantly di�erent.

� An Anova model is a Linear model, so we could have used the lm function to �tthe model. Note that R2 = 32:32%.

> model<-lm(hemoglobin~group,fish)> summary(model)

Call:lm(formula = hemoglobin ~ group, data = fish)

Residuals:Min 1Q Median 3Q Max

-2.910 -0.915 0.000 0.730 2.390

Coefficients:Estimate Std. Error t value Pr(>|t|)

(Intercept) 7.2000 0.3926 18.341 < 2e-16 ***group2 2.1100 0.5552 3.801 0.000537 ***group3 1.8300 0.5552 3.296 0.002209 **group4 1.4900 0.5552 2.684 0.010928 *---Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Page 4: Part II { Oneway Anova, Simple Linear Regression and ...aix1.uottawa.ca/~glamothe/PsyStatClub/RworkshopIII.pdf · Part II { Oneway Anova, Simple Linear Regression and ANCOAV with

Intro to R Workshop - Psychology Statistics Club 4

Residual standard error: 1.241 on 36 degrees of freedomMultiple R-squared: 0.3232,Adjusted R-squared: 0.2668F-statistic: 5.732 on 3 and 36 DF, p-value: 0.002593

We can get the ANOVA table for the �tted linear model. This is the test for theequality of means.

> anova(model)Analysis of Variance Table

Response: hemoglobinDf Sum Sq Mean Sq F value Pr(>F)

group 3 26.499 8.8329 5.7316 0.002593 **Residuals 36 55.479 1.5411---Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

1.4 Pairwise comparisons with Tukey's method

We can use an aov object in the function TukeyHSD.

> TukeyHSD(mod)Tukey multiple comparisons of means

95% family-wise confidence level

Fit: aov(formula = hemoglobin ~ group, data = fish)

$groupdiff lwr upr p adj

2-1 2.11 0.61479395 3.6052061 0.00289183-1 1.83 0.33479395 3.3252061 0.01133894-1 1.49 -0.00520605 2.9852061 0.05108503-2 -0.28 -1.77520605 1.2152061 0.95752464-2 -0.62 -2.11520605 0.8752061 0.68171344-3 -0.34 -1.83520605 1.1552061 0.9274651

We can use the glht function from the multcomp package to test these multiplehypotheses.

> library(multcomp)> summary(glht(model, linfct = mcp(group = "Tukey")))

Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts

Page 5: Part II { Oneway Anova, Simple Linear Regression and ...aix1.uottawa.ca/~glamothe/PsyStatClub/RworkshopIII.pdf · Part II { Oneway Anova, Simple Linear Regression and ANCOAV with

Intro to R Workshop - Psychology Statistics Club 5

Fit: lm(formula = hemoglobin ~ group, data = fish)

Linear Hypotheses:Estimate Std. Error t value Pr(>|t|)

2 - 1 == 0 2.1100 0.5552 3.801 0.00275 **3 - 1 == 0 1.8300 0.5552 3.296 0.01120 *4 - 1 == 0 1.4900 0.5552 2.684 0.05091 .3 - 2 == 0 -0.2800 0.5552 -0.504 0.957534 - 2 == 0 -0.6200 0.5552 -1.117 0.681714 - 3 == 0 -0.3400 0.5552 -0.612 0.92747---Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1(Adjusted p values reported -- single-step method)

1.5 Comparison to a control { Dunnett's test

At times, we are not necessarily interested in all pairwise comparison. We might onlybe interested in comparing the control group to the other groups. To do so, we useDunnett's test.

> summary(glht(model, linfct = mcp(group = "Dunnett")))

Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Dunnett Contrasts

Fit: lm(formula = hemoglobin ~ group, data = fish)

Linear Hypotheses:Estimate Std. Error t value Pr(>|t|)

2 - 1 == 0 2.1100 0.5552 3.801 0.00159 **3 - 1 == 0 1.8300 0.5552 3.296 0.00613 **4 - 1 == 0 1.4900 0.5552 2.684 0.02895 *---Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1(Adjusted p values reported -- single-step method)

Remark: R assumes that the �rst level for the factor corresponds to the controlgroup. Keep this in mind, when using Dunnett's test.

1.6 Recode a vector with car

When the explanatory variable x is quantitative, we might want to describe the meanresponse as a function of x.

Let us, import the data in the �le fish.txt again.

Page 6: Part II { Oneway Anova, Simple Linear Regression and ...aix1.uottawa.ca/~glamothe/PsyStatClub/RworkshopIII.pdf · Part II { Oneway Anova, Simple Linear Regression and ANCOAV with

Intro to R Workshop - Psychology Statistics Club 6

> fish<-read.table("Fish.txt",header=TRUE,sep="\t")> names(fish)[1] "hemoglobin" "group"> unique(fish$group)[1] 1 2 3 4

The groups 1, 2, 3, 4 are associated to the quantity x in grams of sulfamerazineper 100 lbs of �sh, which are 0, 5, 10 and 15, respectively.

We can recode a vector with the function recode from the car package.

# Install the car packageinstall.packages("car")# Load the car packagelibrary(car)

We recode the values and assign them to the vector sulfamerazine. Then, weadd the new vector to the dataframe fish.

> sulfamerazine<-recode(fish$group,"1=0;2=5;3=10;4=15")> fish<-data.frame(fish,sulfamerazine)> unique(fish$sulfamerazine)[1] 0 5 10 15

1.7 Test for trend

We can do a test for trend with orthogonal polynomials in x of degree 0, 1, 2, . . . ,r-1, where r is the number of levels of x.

Our explanatory variable has r = 4 levels. So we will use polynomials with atmost a degree of three.

> model<-aov(hemoglobin~poly(sulfamerazine,3),fish)> summary.lm(model)

Call:aov(formula = hemoglobin ~ poly(sulfamerazine, 3), data = fish)

Residuals:Min 1Q Median 3Q Max

-2.910 -0.915 0.000 0.730 2.390

Coefficients:Estimate Std. Error t value Pr(>|t|)

(Intercept) 8.5575 0.1963 43.598 < 2e-16 ***poly(sulfamerazine, 3)1 2.9628 1.2414 2.387 0.02238 *poly(sulfamerazine, 3)2 -3.8738 1.2414 -3.120 0.00355 **

Page 7: Part II { Oneway Anova, Simple Linear Regression and ...aix1.uottawa.ca/~glamothe/PsyStatClub/RworkshopIII.pdf · Part II { Oneway Anova, Simple Linear Regression and ANCOAV with

Intro to R Workshop - Psychology Statistics Club 7

poly(sulfamerazine, 3)3 1.6476 1.2414 1.327 0.19281---Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 1.241 on 36 degrees of freedomMultiple R-squared: 0.3232,Adjusted R-squared: 0.2668F-statistic: 5.732 on 3 and 36 DF, p-value: 0.002593

Remark: The cubic trend is not signi�cant. However, the quadratic trend issigni�cant. We need a polynomial of degree 2 to describe the mean hemoglobin as afunction of the amount of sulfamerazine.

1.8 Fitting a polynomial response

We want to �nd an estimated regression model of the form:

y = b0 + b1 x+ b2 x2;

where y is the estimated amount of hemoglobin and x is the amount sulfamerazinein the food.

We want to add x and x2 to the model. We use the syntax I(x2) to add x2 to themodel.

> lm(hemoglobin~sulfamerazine+I(sulfamerazine^2),fish)

Call:lm(formula = hemoglobin ~ sulfamerazine + I(sulfamerazine^2),

data = fish)

Coefficients:(Intercept) sulfamerazine I(sulfamerazine^2)

7.3165 0.4513 -0.0245

Our estimated model is

y = 7:3165 + 0:4513x� 0:0245x2:

We will produce a scatter plot with an overlay of the estimated regression model.

### produce a scatter plotwith(fish,plot(sulfamerazine,hemoglobin,xlab="Sulfamerazine (in grams per 100 lbs of fish)",ylab="Hemoglobin in the blood",cex.lab=1.25,cex.axis=1.25))

The above command produces the following scatter plot.

Page 8: Part II { Oneway Anova, Simple Linear Regression and ...aix1.uottawa.ca/~glamothe/PsyStatClub/RworkshopIII.pdf · Part II { Oneway Anova, Simple Linear Regression and ANCOAV with

Intro to R Workshop - Psychology Statistics Club 8

To overlay the estimated model on the plot, we will need to evaluate the modelat many points. Since x goes from 0 to 15, then we will will produce a sequence ofmany values from 0 to 15. Then, we create a dataframe that contains all of thesevalues and name this vector sulfamerazine. It should have the same name as theexplanatory factor in the model.

x=seq(0,15,by=0.1)newData<-data.frame(sulfamerazine=x)

We now use the function predict, to evaluate the estimated model at all thesevalues.

P<-predict(model,newData)lines(x,P)text(10,7,expression(hat(y)==7.3165+0.4513*x-0.0245*x^2),cex=1.2)

Remarks:

� The function lines adds points to the plot that are joined by line segments.

� The function text adds text to the plot. The function expression also us todisplay a mathematical formula.

� Here is the resulting graph. Note that it appears that the optimal amount ofsulfamerazine is somewhere between 5 and 10.

1.9 Optimizing a function

R has a function called optimize for optimizing univariate functions. To maximizethe mean amount of hemoglobin in the blood, we the amount of sulfamerazine shouldbe at x = 9:21. This quantity corresponds to an estimated amount of hemoglobin inthe blood of y = 9:39.

Page 9: Part II { Oneway Anova, Simple Linear Regression and ...aix1.uottawa.ca/~glamothe/PsyStatClub/RworkshopIII.pdf · Part II { Oneway Anova, Simple Linear Regression and ANCOAV with

Intro to R Workshop - Psychology Statistics Club 9

> quadf<-function(x) 7.3165+0.4513*x-0.0245*x^2> optimize(quadf, lower = 0, upper = 15,maximum = TRUE)$maximum[1] 9.210204

$objective[1] 9.394783

2 Simple Linear Regression

Let us reconsider, the �sh data, where we describe the hemoglobin in the blood as afunction of the amount of sulfamerazine in the food. Suppose that someone decidesto use a Regression approach instead of an ANOVA approach.

They might try to �t a simple linear regression and produce a scatter plot with anoverlay of the estimated line.

mod<-lm(hemoglobin~sulfamerazine,fish)with(fish,plot(sulfamerazine,hemoglobin))abline(mod)

Comments:

� The �t appears not bad. However, we might be concerned with the lack-of-�t ofour model. It does not appear to capture the curvature in the trend for smallvalues of x.

� When there a replication for at least one of the values of x, then we can do atest for lack-of-�t.

2.1 Test for lack-of-�t

To test for lack-of-�t, we start with allowing the mean response as a function of thelevels of x, but not with any particular pattern. That is, if there are r levels, thenthere could be r di�erent means. This is an ANOVA model.

Our null hypothesis will be the mean of the response can be expressed as a linearfunction of x. This is our simple linear model with x as a predictor.

The ANOVA model and the simple linear model are nested. So we can use theanova function to compare the models.

Page 10: Part II { Oneway Anova, Simple Linear Regression and ...aix1.uottawa.ca/~glamothe/PsyStatClub/RworkshopIII.pdf · Part II { Oneway Anova, Simple Linear Regression and ANCOAV with

Intro to R Workshop - Psychology Statistics Club 10

> # fit the full model> mod<-lm(hemoglobin~factor(sulfamerazine),fish)> # fit the reduced model> mod0<-lm(hemoglobin~sulfamerazine,fish)> # compare the models> anova(mod0,mod)Analysis of Variance Table

Model 1: hemoglobin ~ sulfamerazineModel 2: hemoglobin ~ factor(sulfamerazine)

Res.Df RSS Df Sum of Sq F Pr(>F)1 38 73.2002 36 55.479 2 17.721 5.7494 0.00681 **---Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

We have signi�cant evidence for the lack-of-�t of the simple linear regressionmodel. We could try to improve the �t by approximating the response functionwith a quadratic function in x.

Let us test for the lack-of-�t of the quadratic model. The evidence for the lack-of-�tof the quadratic model is not signi�cant.

> # fit the full model> mod<-lm(hemoglobin~factor(sulfamerazine),fish)> # fit the reduced model> mod0<-lm(hemoglobin~sulfamerazine+I(sulfamerazine^2),fish)> # compare the models> anova(mod0,mod)Analysis of Variance Table

Model 1: hemoglobin ~ sulfamerazine + I(sulfamerazine^2)Model 2: hemoglobin ~ factor(sulfamerazine)

Res.Df RSS Df Sum of Sq F Pr(>F)1 37 58.1932 36 55.479 1 2.7144 1.7614 0.1928

Page 11: Part II { Oneway Anova, Simple Linear Regression and ...aix1.uottawa.ca/~glamothe/PsyStatClub/RworkshopIII.pdf · Part II { Oneway Anova, Simple Linear Regression and ANCOAV with

Intro to R Workshop - Psychology Statistics Club 11

3 Applying our new skills

Exercises to try in class.

1. Consider a company that produces items made from glass. The data is in the�le glassworks.txt.(Source: Applied Linear Models, by Kutner et al.).

A large company is studying the e�ects of the length of special training for newemployees on the quality of work. Employees are randomly assigned to haveeither 6, 8, 10, or 12 hours of training. After the special training, each employeeis given the same amount of material. The response variable is the number ofacceptable pieces.

(a) Are there the e�ects of the duration of the special training on the quality ofthe work signi�cant?

(b) Do a trend analysis to describe the quality of the work as a function of thetime in training?

2. Consider the data in the �le Productivity.txt. (Source: Applied Linear Mod-els, by Kutner et al.).

Description: An economist compiled data on productivity improvements for asample of �rms producing computing equipment. The productivity improvementis measured on a scale from 0 to 100. The �rms were classi�ed according to thelevel of their average expenditures for R&D in the last three years (low, medium,high).

(a) Test whether or not the mean productivity improvements di�ers accordingto the level of R&D research.

Page 12: Part II { Oneway Anova, Simple Linear Regression and ...aix1.uottawa.ca/~glamothe/PsyStatClub/RworkshopIII.pdf · Part II { Oneway Anova, Simple Linear Regression and ANCOAV with

Intro to R Workshop - Psychology Statistics Club 12

4 General Linear Model { ANCOVA

Rehabilitation: Consider the data in the �le Rehabilitation.txt.

Describe the association between physical �tness prior to surgery of persons under-going corrective knee surgery and time required in physical therapy until successfulrehabilitation.

� The response: The number of days required for successful completion of physicaltherapy.

� One explanatory factor: Status of �tness prior to surgery. It is categorical with3 levels: Below average (=1), average (=2), above average (=3).

� Observational Design: Cross sectional. A random selection of 24 male subjectsranging in age from 18 to 30 years that undergone similar corrective knee surgeryin the last year.

� This is an ANOVA.

Remarks:

� We can extend an ANOVA by including one ore more numerical explanatoryfactors. This is called an analysis of covariance (ANCOVA).

� The purposes of ANCOVA:

{ Reduce the error variance, so that the ANOVA becomes more powerful.

{ Control for confounding.

{ An ANCOVA model is a special case of a General Linear Model (i.e. alinear model with categorical and/or numerical predictors).

{ We will use the lm function to build the model.

Learning Objectives:

� Learn to create overlayed scatter plots.

� Learn to use the lm function with more than on predictor.

� Learn to test for the lack-of-�t of linearity.

� Multiple testing concerning regression coe�cients (i.e. compare many slopes orcompare many intercepts).

Page 13: Part II { Oneway Anova, Simple Linear Regression and ...aix1.uottawa.ca/~glamothe/PsyStatClub/RworkshopIII.pdf · Part II { Oneway Anova, Simple Linear Regression and ANCOAV with

Intro to R Workshop - Psychology Statistics Club 13

4.1 Regression by groups

We display the names of the variables and display the levels of the categorical variable.

> dataR<-read.table("Rehabilitation.txt",header=TRUE,sep="\t")> names(dataR)[1] "time" "status" "age"> sapply(dataR,is.factor)

time status ageFALSE FALSE FALSE

> dataR$status<-factor(dataR$status)> levels(dataR$status)[1] "1" "2" "3"

We will create subsets of our dataframe according to the levels of the categoricalvariable.

## We use a logical argument to determine the rows to keep.## Nothing after the comma means we keep all columns.data1=dataR[dataR$status=="1",]data2=dataR[dataR$status=="2",]data3=dataR[dataR$status=="3",]

For each group, we will �nd the least square line for the time for rehabilitation asa function of age.

To properly overlay the three scatter plots, we will need the range for all ages andfor all values of the response.

ylim=range(dataR$time)xlim=range(dataR$age)

We are now ready to produce the three plots by using the plot to create the �rstplot and use points to overlay the other plots.

ylab="Time to rehabilitation (in days)"xlab="Age (in years)"plot(data1$age,data1$time,ylab=ylab,xlab=xlab,ylim=ylim,xlim=xlim)abline(lm(data1$time~data1$age),lty=1)points(data2$age,data2$time,ylim=ylim,xlim=xlim,pch=2)abline(lm(data2$time~data2$age),lty=2)points(data3$age,data3$time,ylim=ylim,xlim=xlim,pch=3)abline(lm(data3$time~data3$age),lty=3)legend("bottomright", c("Below Average","Average","Above Average"),lty=c(1,2,3), pch=c(1,2,3))

Page 14: Part II { Oneway Anova, Simple Linear Regression and ...aix1.uottawa.ca/~glamothe/PsyStatClub/RworkshopIII.pdf · Part II { Oneway Anova, Simple Linear Regression and ANCOAV with

Intro to R Workshop - Psychology Statistics Club 14

Remarks:

� The slopes appear to be similar. Here the coe�cients for the three models.

> lm(time~age,data=data1)$coefficients(Intercept) age

6.703220 1.195104> lm(time~age,data=data2)$coefficients(Intercept) age

6.090041 1.144939> lm(time~age,data=data3)$coefficients(Intercept) age-0.6326823 1.1368930

� If we can assume that the slopes are the same, then to describe the e�ects of the�tness status it su�ces to compare the intercepts.

� To formally compare the slopes, we can build a linear model to allows for di�erentslopes according to �tness status and compare it to a reduced model where theslopes are assumed to be equal. (This is a general linear test.)

4.2 A general linear model

We �t a linear model: age and status are the predictors. The symbol * is indicatethat we want a model with interactions. This means that we are allowing a di�erentcoe�cient for age according to the levels of status.

> modelF=lm(time~age*status,data=dataR)> anova(modelF)Analysis of Variance Table

Response: timeDf Sum Sq Mean Sq F value Pr(>F)

age 1 835.75 835.75 2530.9102 < 2.2e-16 ***status 2 246.08 123.04 372.6086 2.257e-15 ***age:status 2 0.22 0.11 0.3359 0.7191Residuals 18 5.94 0.33

Page 15: Part II { Oneway Anova, Simple Linear Regression and ...aix1.uottawa.ca/~glamothe/PsyStatClub/RworkshopIII.pdf · Part II { Oneway Anova, Simple Linear Regression and ANCOAV with

Intro to R Workshop - Psychology Statistics Club 15

---Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Remarks:

� There are not signi�cant interactions between age and status. These resultssuggest that the slope of the regression between the response and age is similarfor the three groups.

� We will �t the reduced model and compare it the above full model. The symbol+ indicates that we want an additive model. This means only one coe�cient forage (i.e. it does not depend on the �tness status).

> modelR=lm(time~age+status,data=dataR)> anova(modelR,modelF)Analysis of Variance Table

Model 1: time ~ age + statusModel 2: time ~ age * status

Res.Df RSS Df Sum of Sq F Pr(>F)1 20 6.16572 18 5.9439 2 0.22184 0.3359 0.7191

Remark: This is a general linear test for the equality of the slopes. The p-valueis large. So the evidence against the equality of the slopes is not signi�cant.Note that the p-value is exactly the same at the test for interaction. Actually,the tests are testing exactly the same thing.

4.3 Test for Lack-of-�t

Is it reasonable to assume that the associations between time for rehabilitation andage are linear? If not, then the ANCOVA will be biased.

If we have repeated values for the variable age, we can test for linearity. We simplymake the age a factor. This means that we make no assumption about the form ofthe association between the response and age.

> modelFull=lm(time~factor(age)*status,data=dataR)> anova(modelF,modelFull)Analysis of Variance Table

Model 1: time ~ age * statusModel 2: time ~ factor(age) * status

Res.Df RSS Df Sum of Sq F Pr(>F)1 18 5.94392 0 0.0000 18 5.9439

Remarks:

Page 16: Part II { Oneway Anova, Simple Linear Regression and ...aix1.uottawa.ca/~glamothe/PsyStatClub/RworkshopIII.pdf · Part II { Oneway Anova, Simple Linear Regression and ANCOAV with

Intro to R Workshop - Psychology Statistics Club 16

� We did not get a p-value. This means that there are not repeated values for age.So we cannot measure the error variance.

� If we round the ages to the nearest integer, we might get repeated values. Thereis not signi�cant di�erence between the two models. Hence it is reasonable toassume linearity.

> modelFull=lm(time~factor(round(age,0))*status,data=dataR)> anova(modelF,modelFull)Analysis of Variance Table

Model 1: time ~ age * statusModel 2: time ~ factor(round(age, 0)) * status

Res.Df RSS Df Sum of Sq F Pr(>F)1 18 5.94392 6 1.1667 12 4.7772 2.0474 0.1949

� If rounding did not work, try to approximate deviations from linearity with aquadratic model. We will �t the larger model by using a quadratic function ofage.

> anova(modelF,modelFull)Analysis of Variance Table

Model 1: time ~ age * statusModel 2: time ~ (age + I(age^2)) * status

Res.Df RSS Df Sum of Sq F Pr(>F)1 18 5.94392 15 4.8200 3 1.124 1.1659 0.3555

There are no signi�cant di�erences between the two nested models. So it isreasonable to use the simpler model, i.e. to assume that the associations betweenthe response and age are linear.

4.4 Comparing the �tness status groups

We established that the slopes of the three groups are the same. This means that itsu�ces to compare the intercepts for a comparison of the three �tness status groups.

We will use the additive model. The p-value for the signi�cance of the status isvery small. The are signi�cant di�erences between the three groups.

> modelR=lm(time~age+status,data=dataR)> anova(modelR)Analysis of Variance Table

Response: timeDf Sum Sq Mean Sq F value Pr(>F)

Page 17: Part II { Oneway Anova, Simple Linear Regression and ...aix1.uottawa.ca/~glamothe/PsyStatClub/RworkshopIII.pdf · Part II { Oneway Anova, Simple Linear Regression and ANCOAV with

Intro to R Workshop - Psychology Statistics Club 17

age 1 835.75 835.75 2710.95 < 2.2e-16 ***status 2 246.08 123.04 399.11 < 2.2e-16 ***Residuals 20 6.17 0.31---Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Warning:

� We will change the order of the predictors. Look at the F -value for status. Doyou notice something strange? It has changed a lot.

� R uses sequential sums of squares (i.e. of type I). This means that when testingfor the signi�cance of a predictor, we are only controlling for the e�ects of theprevious covariable.

� If we put age after status, then the test for status does not take age into account.

� To not worry about order, we could use type II sum of squares. We use theAnova function in the car package.

> anova(lm(time~status+age,data=dataR))Analysis of Variance Table

Response: timeDf Sum Sq Mean Sq F value Pr(>F)

status 2 672.00 336.00 1089.9 < 2.2e-16 ***age 1 409.83 409.83 1329.4 < 2.2e-16 ***Residuals 20 6.17 0.31---Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

> library(car)> Anova(lm(time~status+age,data=dataR))Anova Table (Type II tests)

Response: timeSum Sq Df F value Pr(>F)

status 246.08 2 399.11 < 2.2e-16 ***age 409.83 1 1329.39 < 2.2e-16 ***Residuals 6.17 20---Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Page 18: Part II { Oneway Anova, Simple Linear Regression and ...aix1.uottawa.ca/~glamothe/PsyStatClub/RworkshopIII.pdf · Part II { Oneway Anova, Simple Linear Regression and ANCOAV with

Intro to R Workshop - Psychology Statistics Club 18

We established that there are signi�cant di�erences between the three �tness statusgroups. Where are does di�erences?

We will need the multcomp package. Once it is installed, we load it: > library(multcomp).

We need to interpret the regression coe�cients: �0; �1; �2; �3.

> modelR$coefficients(Intercept) age status2 status3

7.431688 1.167286 -1.847379 -8.722893

� �1 is the slope for age.

� Notice that status 1 is missing. This means that the intercept (i.e. �0 is forgroup 1).

� �2 is the e�ect for group 2 in comparison to group 1.

� �3 is the e�ect for group 3 in comparison to group 1.

We want to test:

�2 = 0 i.e. groups 1 and 2 are equal

�3 = 0 i.e. groups 1 and 3 are equal

�2 = �3 i.e. groups 2 and 3 are equal

We can use the glht function from the multcomp package to test these multiplehypotheses.

> summary(glht(modelR, linfct = mcp(status = "Tukey")))

Simultaneous Tests for General Linear Hypotheses

Multiple Comparisons of Means: Tukey Contrasts

Fit: lm(formula = time ~ age + status, data = dataR)

Linear Hypotheses:Estimate Std. Error t value Pr(>|t|)

2 - 1 == 0 -1.8474 0.2869 -6.438 <1e-05 ***3 - 1 == 0 -8.7229 0.3330 -26.198 <1e-05 ***3 - 2 == 0 -6.8755 0.2884 -23.842 <1e-05 ***---Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1(Adjusted p values reported -- single-step method)

Page 19: Part II { Oneway Anova, Simple Linear Regression and ...aix1.uottawa.ca/~glamothe/PsyStatClub/RworkshopIII.pdf · Part II { Oneway Anova, Simple Linear Regression and ANCOAV with

Intro to R Workshop - Psychology Statistics Club 19

4.5 Comparing the e�ect of age according to �tness status

We concluded above that the e�ect of age on the time for rehabilitation were equalequal between the �tness groups. What would we do if they were not equal?

We will do some pairwise comparisons of the slopes. Let us consider the full modelwith interactions.

> modelF=lm(time~age*status,data=dataR)> modelF$coefficients(Intercept) age status2 status3 age:status2 age:status36.70321989 1.19510377 -0.61317875 -7.33590221 -0.05016525 -0.05821075

We would like to compare the coe�cients for age: age, age:status2 age:status3.We will denote them �1, �4, �5.

� �1 is the slope for age within group 1.

� �4 is the comparison (i.e. deviation) of the slope for age within group 2 comparedto group 1.

� �5 is the comparison (i.e. deviation) of the slope for age within group 3 comparedto group 1.

We want to test:

�4 = 0 slopes for groups 1 and 2 are equal

�5 = 0 slopes for groups 1 and 3 are equal

�4 � �5 = 0 slopes for groups 2 and 3 are equal

We can use the glht function from the multcomp package to test these multiplehypotheses.

library(multcomp)K <- rbind(c(0, 0, 0, 0, 1, 0),c(0, 0, 0, 0, 0, 1),c(0, 0, 0, 0, 1, -1))rownames(K) <- c("2 - 1", "3 - 1","2 - 3")summary(glht(modelF, linfct = K))

Remarks:

� The model has six coe�cients: �0; �1; : : : ; �5.

� We are building a matrix for the test: the columns are the coe�cients and therows are the comparisons.

Page 20: Part II { Oneway Anova, Simple Linear Regression and ...aix1.uottawa.ca/~glamothe/PsyStatClub/RworkshopIII.pdf · Part II { Oneway Anova, Simple Linear Regression and ANCOAV with

Intro to R Workshop - Psychology Statistics Club 20

� The row (0, 0, 0, 1, -1, 0) means �4 � �5.

We obtain the following output.

Simultaneous Tests for General Linear Hypotheses

Fit: lm(formula = time ~ age * status, data = dataR)

Linear Hypotheses:Estimate Std. Error t value Pr(>|t|)

2 - 1 == 0 -0.050165 0.079799 -0.629 0.8063 - 1 == 0 -0.058211 0.081856 -0.711 0.7592 - 3 == 0 0.008045 0.092427 0.087 0.996(Adjusted p values reported -- single-step method)

We observe that there are no signi�cant di�erences in the pairwise comparisons.This should not be surprising, since we had concluded that the slopes were equal.

Page 21: Part II { Oneway Anova, Simple Linear Regression and ...aix1.uottawa.ca/~glamothe/PsyStatClub/RworkshopIII.pdf · Part II { Oneway Anova, Simple Linear Regression and ANCOAV with

Intro to R Workshop - Psychology Statistics Club 21

5 Applying our new skills

Exercises to try in class.

1. Consider the data in the �le Productivity.txt. (Source: Applied Linear Mod-els, by Kutner et al.).

Description: An economist compiled data on productivity improvements for asample of �rms producing computing equipment. The productivity improvementis measured on a scale from 0 to 100. The �rms were classi�ed according to thelevel of their average expenditures for R&D in the last three years (low, medium,high).

The economist also has information on annual productivity improvement in theprior year and wishes to use this information as a covariable in the model.

(a) Test whether the three regression lines that describe the productivity im-provements as a linear function of the productivity improvements in theprior year have the same slope?

(b) Test a formal test here as to whether the three regression functions arelinear?

(c) Make pairwise comparisons to compare the �rm according to their averageexpenditures for R&D in the last three years with an ANCOVA.