38
Hadley Wickham Stat405 Intro to modelling Tuesday, 16 November 2010

24 modelling

Embed Size (px)

Citation preview

Page 1: 24 modelling

Hadley Wickham

Stat405Intro to modelling

Tuesday, 16 November 2010

Page 2: 24 modelling

1. What is a linear model?

2. Removing trends

3. Transformations

4. Categorical data

5. Visualising models

Tuesday, 16 November 2010

Page 3: 24 modelling

What is a linear

model?

Tuesday, 16 November 2010

Page 4: 24 modelling

Tuesday, 16 November 2010

Page 5: 24 modelling

observed value

Tuesday, 16 November 2010

Page 6: 24 modelling

observed value

Tuesday, 16 November 2010

Page 7: 24 modelling

predicted value

observed value

Tuesday, 16 November 2010

Page 8: 24 modelling

predicted value

observed value

Tuesday, 16 November 2010

Page 9: 24 modelling

predicted value

observed value

residual

Tuesday, 16 November 2010

Page 10: 24 modelling

y ~ x# yhat = b1x + b0

# Want to find b's that minimise distance # between y and yhat

z ~ x + y # zhat = b2x + b1y + b0

# Want to find b's that minimise distance # between z and zhat

z ~ x * y # zhat = b3(x⋅y) + b2x + b1y + b0

Tuesday, 16 November 2010

Page 11: 24 modelling

X is measured without error.

Relationship is linear.

Errors are independent.

Errors have normal distribution.

Errors have constant variance.

Assumptions

Tuesday, 16 November 2010

Page 12: 24 modelling

Removing trends

Tuesday, 16 November 2010

Page 13: 24 modelling

library(ggplot2)

diamonds$x[diamonds$x == 0] <- NA

diamonds$y[diamonds$y == 0] <- NA

diamonds$y[diamonds$y > 30] <- NA

diamonds$z[diamonds$z == 0] <- NA

diamonds$z[diamonds$z > 30] <- NA

diamonds <- subset(diamonds, carat < 2)

qplot(x, y, data = diamonds)

qplot(x, z, data = diamonds)

Tuesday, 16 November 2010

Page 14: 24 modelling

Tuesday, 16 November 2010

Page 15: 24 modelling

Tuesday, 16 November 2010

Page 16: 24 modelling

mody <- lm(y ~ x, data = diamonds, na = na.exclude)

coef(mody)

# yhat = 0.05 + 0.99⋅x

# Plot x vs yhat

qplot(x, predict(mody), data = diamonds)

# Plot x vs (y - yhat) = residual

qplot(x, resid(mody), data = diamonds)

# Standardised residual:

qplot(x, rstandard(mody), data = diamonds)

Tuesday, 16 November 2010

Page 17: 24 modelling

qplot(x, resid(mody), data=dclean)Tuesday, 16 November 2010

Page 18: 24 modelling

qplot(x, y - x, data=dclean)Tuesday, 16 November 2010

Page 19: 24 modelling

Your turn

Do the same thing for z and x. What threshold might you use to remove outlying values?

Are the errors from predicting z and y from x related?

Tuesday, 16 November 2010

Page 20: 24 modelling

modz <- lm(z ~ x, data = diamonds, na = na.exclude)

coef(modz)# zhat = 0.03 + 0.61x

qplot(x, rstandard(modz), data = diamonds)last_plot() + ylim(-10, 10)

qplot(rstandard(mody), rstandard(modz))

Tuesday, 16 November 2010

Page 21: 24 modelling

Transformations

Tuesday, 16 November 2010

Page 22: 24 modelling

Can we use a linear model to remove this trend?

Tuesday, 16 November 2010

Page 23: 24 modelling

Can we use a linear model to remove this trend?

Tuesday, 16 November 2010

Page 24: 24 modelling

Can we use a linear model to remove this trend?

Linear models are linear in their parameters which can be any transformation of the data

Tuesday, 16 November 2010

Page 25: 24 modelling

Your turn

Use a linear model to remove the effect of carat on price. Confirm that this worked by plotting model residuals vs. color.

How can you interpret the model coefficients and residuals?

Tuesday, 16 November 2010

Page 26: 24 modelling

modprice <- lm(log(price) ~ log(carat), data = diamonds, na = na.exclude)

diamonds$relprice <- exp(resid(modprice))

qplot(carat, relprice, data = diamonds)diamonds <- subset(diamonds, carat < 2)qplot(carat, relprice, data = diamonds)

qplot(carat, relprice, data = diamonds) + facet_wrap(~ color)qplot(relprice, ..density.., data = diamonds, colour = color, geom = "freqpoly", binwidth = 0.2)qplot(relprice, ..density.., data = diamonds, colour = cut, geom = "freqpoly", binwidth = 0.2)

Tuesday, 16 November 2010

Page 27: 24 modelling

log(Y) = a * log(X) + b

Y = c . dX

An additive model becomes a multiplicative model.

Intercept becomes starting point, slope becomes geometric growth.

Multiplicative model

Tuesday, 16 November 2010

Page 28: 24 modelling

Residuals

resid(mod) = log(Y) - log(Yhat)

exp(resid(mod)) = Y / (Yhat)

Tuesday, 16 November 2010

Page 29: 24 modelling

# Useful trick - close to 0, exp(x) ~ x + 1x <- seq(-0.2, 0.2, length = 100)qplot(x, exp(x)) + geom_abline(intercept = 1)

qplot(x, x / exp(x)) + scale_y_continuous("Percent error", formatter = percent)

# Not so useful here because the x is also # transformedcoef(modprice)

Tuesday, 16 November 2010

Page 30: 24 modelling

Categorical data

Tuesday, 16 November 2010

Page 31: 24 modelling

Compare the results of the following two functions. What can you say about the model?

ddply(diamonds, "color", summarise, mean = mean(price))

coef(lm(price ~ color, data = diamonds))

Your turn

Tuesday, 16 November 2010

Page 32: 24 modelling

Categorical data

Converted into a numeric matrix, with one column for each level. Contains 1 if that observation has that level, 0 otherwise.

However, if we just do that naively, we end up with too many columns (because we have one extra column for the intercept)

So everything is relative to the first level.

Tuesday, 16 November 2010

Page 33: 24 modelling

Visualising models

Tuesday, 16 November 2010

Page 34: 24 modelling

# What do you think this model does?lm(log(price) ~ log(carat) + color, data = diamonds)

# What about this one?lm(log(price) ~ log(carat) * color, data = diamonds)

# Or this one?lm(log(price) ~ cut * color, data = diamonds)

# How can we interpret the results?

Tuesday, 16 November 2010

Page 35: 24 modelling

mod1 <- lm(log(price) ~ log(carat) + cut, data = diamonds)mod2 <- lm(log(price) ~ log(carat) * cut, data = diamonds)

# One way is to explore predictions from the model# over an evenly spaced grid. expand.grid makes # this easy

grid <- expand.grid( carat = seq(0.2, 2, length = 20), cut = levels(diamonds$cut), KEEP.OUT.ATTRS = FALSE)str(grid)grid

grid$p1 <- exp(predict(mod1, grid))grid$p2 <- exp(predict(mod2, grid))

Tuesday, 16 November 2010

Page 36: 24 modelling

Plot the predictions from the two sets of models. How are they different?

Your turn

Tuesday, 16 November 2010

Page 37: 24 modelling

qplot(carat, p1, data = grid, colour = cut, geom = "line")qplot(carat, p2, data = grid, colour = cut, geom = "line")

qplot(log(carat), log(p1), data = grid, colour = cut, geom = "line")qplot(log(carat), log(p2), data = grid, colour = cut, geom = "line")

qplot(carat, p1 / p2, data = grid, colour = cut, geom = "line")

Tuesday, 16 November 2010

Page 38: 24 modelling

# Another approach is the effects package# install.packages("effects")library(effects)effect("cut", mod1)

cut <- as.data.frame(effect("cut", mod1))qplot(fit, reorder(cut, fit), data = cut)qplot(fit, reorder(cut, fit), data = cut) + geom_errorbarh(aes(xmin = lower, xmax = upper), height = 0.1)

qplot(exp(fit), reorder(cut, fit), data = cut) + geom_errorbarh(aes(xmin = exp(lower), xmax = exp(upper)), height = 0.1)

Tuesday, 16 November 2010