View
223
Download
3
Category
Preview:
Citation preview
R for Macroecology
Tests and models
Smileys
Homework Solutions to the color assignment problem?
Statistical tests in R This is why we are using R (and not C++)!
t.test() aov() lm() glm() And many more
Read the documentation! With statistical tests, its particularly important
to read and understand the documentation of each function you use
They may do some complicated things with options, and you want to make sure they do what you want
Default behaviors can change (with, e.g. sample size)
Returns from statistical tests Statistical tests are functions, so they return
objects> x = 1:10> y = 3:12> t.test(x,y)
Welch Two Sample t-test
data: x and y t = -1.4771, df = 18, p-value = 0.1569alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -4.8446619 0.8446619 sample estimates:mean of x mean of y 5.5 7.5
Returns from statistical tests Statistical tests are functions, so they return
objects> x = 1:10> y = 3:12> test = t.test(x,y)> str(test)List of 9 $ statistic : Named num -1.48 ..- attr(*, "names")= chr "t" $ parameter : Named num 18 ..- attr(*, "names")= chr "df" $ p.value : num 0.157 $ conf.int : atomic [1:2] -4.845 0.845 ..- attr(*, "conf.level")= num 0.95 $ estimate : Named num [1:2] 5.5 7.5 ..- attr(*, "names")= chr [1:2] "mean of x" "mean of y" $ null.value : Named num 0 ..- attr(*, "names")= chr "difference in means" $ alternative: chr "two.sided" $ method : chr "Welch Two Sample t-test" $ data.name : chr "x and y" - attr(*, "class")= chr "htest"
t.test() returns a list
Returns from statistical tests Getting the results out
This hopefully looks familiar after last week’s debugging
> x = 1:10> y = 3:12> test = t.test(x,y)> test$p.value[1] 0.1569322> test$conf.int[2][1] 0.8446619> test[[3]][1] 0.1569322
Model specification Models in R use a common syntax:
Y ~ X1 + X2 + X3 . . . + Xi
Means Y is a linear function of X1:j
Linear models Basic linear models are fit with lm()
Again, lm() returns a list
> x = 1:10> y = 3:12> test = lm(y ~ x)> test
Call:lm(formula = y ~ x)
Coefficients:(Intercept) x 2 1
Linear models summary() is helpful for looking at a model> summary(test)
Call:lm(formula = y ~ x)
Residuals: Min 1Q Median 3Q Max -1.293e-15 -1.986e-16 1.343e-16 3.689e-16 6.149e-16
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.000e+00 4.019e-16 4.976e+15 <2e-16 ***x 1.000e+00 6.477e-17 1.544e+16 <2e-16 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.883e-16 on 8 degrees of freedomMultiple R-squared: 1, Adjusted R-squared: 1 F-statistic: 2.384e+32 on 1 and 8 DF, p-value: < 2.2e-16
Extracting coefficients, P values For the list (e.g. “test”) returned by lm(), test$coefficients will give the coefficients, but not the std.error or p value.
Instead, use summary(test)$coefficients
Model specification - interactions Interactions are specified with a * or a :
X1 * X2 means X1 + X2 + X1:X2
(X1 + X2 + X3)^2 means each term and all second-order interactions
- removes terms constants are included by default, but can be
removed with “-1” more help available using ?formula
Quadratic terms Because ^2 means something specific in the
context of a model, if you want to square one of your predictors, you have to do something special:> x = 1:10> y = 3:12> test = lm(y ~ x + x^2)> test$coefficients(Intercept) x 2 1> test = lm(y ~ x + I(x^2))> test$coefficients(Intercept) x I(x^2) 2.000000e+00 1.000000e+00 -4.401401e-17
A break to try things out t test anova linear models
Plotting a test object plot(t.test(x,y)) does nothing plot(lm(y~x)) plots diagnostic graphs
Forward and backward selection Uses the step() function
Object – starting modelScope – specifies the range of models to considerDirection – backward, forward or both?Trace – print to the screen?Steps – set a maximum number of stepsk – penalization for adding variables (2 means AIC)
> x1 = runif(100)> x2 = runif(100)> x3 = runif(100)> x4 = runif(100)> x5 = runif(100)> x6 = runif(100)> y = x1+x2+x3+runif(100)> model = step(lm(y~1),scope = y~x1+x2+x3+x4+x5+x6,direction = "both",trace = F)> summary(model)Call:lm(formula = y ~ x2 + x3 + x1)
Residuals: Min 1Q Median 3Q Max -0.517640 -0.264732 -0.003983 0.241431 0.525636
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.47422 0.09381 5.055 2.06e-06 ***x2 0.90890 0.09887 9.192 8.08e-15 ***x3 1.11036 0.10555 10.520 < 2e-16 ***x1 1.00836 0.10865 9.281 5.23e-15 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3059 on 96 degrees of freedomMultiple R-squared: 0.7701, Adjusted R-squared: 0.7629 F-statistic: 107.2 on 3 and 96 DF, p-value: < 2.2e-16
All subsets selection
leaps(x=, y=, wt=rep(1, NROW(x)), int=TRUE, method=c("Cp", "adjr2", "r2"), nbest=10, names=NULL, df=NROW(x), strictly.compatible=TRUE)
x – a matrix of predictorsy – a vector of the responsemethod – how to compare models (Mallows Cp, adjusted R2, or R2)nbest – number of models of each size to return
> Xmat = cbind(x1,x2,x3,x4,x5,x6)> leaps(x = Xmat, y, method = "Cp", nbest = 2) $which 1 2 3 4 5 61 FALSE TRUE FALSE FALSE FALSE FALSE1 TRUE FALSE FALSE FALSE FALSE FALSE2 TRUE FALSE TRUE FALSE FALSE FALSE2 FALSE TRUE TRUE FALSE FALSE FALSE3 TRUE TRUE TRUE FALSE FALSE FALSE3 FALSE TRUE TRUE TRUE FALSE FALSE4 TRUE TRUE TRUE TRUE FALSE FALSE4 TRUE TRUE TRUE FALSE TRUE FALSE5 TRUE TRUE TRUE TRUE TRUE FALSE5 TRUE TRUE TRUE TRUE FALSE TRUE6 TRUE TRUE TRUE TRUE TRUE TRUE
$label[1] "(Intercept)" "1" "2" "3" "4" "5" "6"
$size [1] 2 2 3 3 4 4 5 5 6 6 7
$Cp [1] 191.732341 197.672391 85.498456 87.117764 3.466464 81.286117 3.579295 5.078805 [9] 5.138063 5.509598 7.000000> leapOut = leaps(x = Xmat, y, method = "Cp", nbest = 2)
> leapOut = leaps(x = Xmat, y, method = "Cp", nbest = 2)> Xmat = cbind(x1,x2,x3,x4,x5,x6)> leapOut = leaps(x = Xmat, y, method = "Cp", nbest = 2)> aicVals = NULL> for(i in 1:nrow(leapOut$which))> {> model = as.formula(paste("y~",paste(c("x1","x2","x3","x4",
"x5","x6")[leapOut$which[i,]],collapse = "+")))> test = lm(model)> aicVals[i] = AIC(test)> }> aicVals[1] 159.13825 161.18166 113.95184 114.84992 52.81268 112.42959 52.81609 [8] 54.40579 54.34347 54.74159 56.19513
> i = 8> leapOut$which[i,] 1 2 3 4 5 6 TRUE TRUE TRUE FALSE TRUE FALSE> c("x1","x2","x3","x4","x5","x6")[leapOut$which[i,]][1] "x1" "x2" "x3" "x5"> paste("y~",paste(c("x1","x2","x3","x4","x5","x6")[leapOut$which[i,]],collapse
= "+"))[1] "y~ x1+x2+x3+x5"
Comparing AIC of the best models
> data.frame(leapOut$which,aicVals) X1 X2 X3 X4 X5 X6 aicVals1 FALSE TRUE FALSE FALSE FALSE FALSE 159.138252 TRUE FALSE FALSE FALSE FALSE FALSE 161.181663 TRUE FALSE TRUE FALSE FALSE FALSE 113.951844 FALSE TRUE TRUE FALSE FALSE FALSE 114.849925 TRUE TRUE TRUE FALSE FALSE FALSE 52.812686 FALSE TRUE TRUE TRUE FALSE FALSE 112.429597 TRUE TRUE TRUE TRUE FALSE FALSE 52.816098 TRUE TRUE TRUE FALSE TRUE FALSE 54.405799 TRUE TRUE TRUE TRUE TRUE FALSE 54.3434710 TRUE TRUE TRUE TRUE FALSE TRUE 54.7415911 TRUE TRUE TRUE TRUE TRUE TRUE 56.19513
Practice with the mammal data VIF, lm(), AIC(), leaps()
Recommended