24 modelling

Hadley Wickham

Stat405Intro to modelling

Tuesday, 16 November 2010

1. What is a linear model?

2. Removing trends

3. Transformations

4. Categorical data

5. Visualising models


What is a linear

model?



observed value


observed value


predicted value

observed value


predicted value

observed value


predicted value

observed value

residual


y ~ x# yhat = b1x + b0

# Want to find b's that minimise distance # between y and yhat

z ~ x + y # zhat = b2x + b1y + b0

# Want to find b's that minimise distance # between z and zhat

z ~ x * y # zhat = b3(x⋅y) + b2x + b1y + b0


X is measured without error.

Relationship is linear.

Errors are independent.

Errors have normal distribution.

Errors have constant variance.

Assumptions


Removing trends


library(ggplot2)

diamonds$x[diamonds$x == 0] <- NA

diamonds$y[diamonds$y == 0] <- NA

diamonds$y[diamonds$y > 30] <- NA

diamonds$z[diamonds$z == 0] <- NA

diamonds$z[diamonds$z > 30] <- NA

diamonds <- subset(diamonds, carat < 2)

qplot(x, y, data = diamonds)

qplot(x, z, data = diamonds)




mody <- lm(y ~ x, data = diamonds, na = na.exclude)

coef(mody)

# yhat = 0.05 + 0.99⋅x

# Plot x vs yhat

qplot(x, predict(mody), data = diamonds)

# Plot x vs (y - yhat) = residual

qplot(x, resid(mody), data = diamonds)

# Standardised residual:

qplot(x, rstandard(mody), data = diamonds)


qplot(x, resid(mody), data=dclean)Tuesday, 16 November 2010

qplot(x, y - x, data=dclean)Tuesday, 16 November 2010

Your turn

Do the same thing for z and x. What threshold might you use to remove outlying values?

Are the errors from predicting z and y from x related?


modz <- lm(z ~ x, data = diamonds, na = na.exclude)

coef(modz)# zhat = 0.03 + 0.61x

qplot(x, rstandard(modz), data = diamonds)last_plot() + ylim(-10, 10)

qplot(rstandard(mody), rstandard(modz))


Transformations


Can we use a linear model to remove this trend?





Linear models are linear in their parameters which can be any transformation of the data


Your turn

Use a linear model to remove the effect of carat on price. Confirm that this worked by plotting model residuals vs. color.

How can you interpret the model coefficients and residuals?


modprice <- lm(log(price) ~ log(carat), data = diamonds, na = na.exclude)

diamonds$relprice <- exp(resid(modprice))

qplot(carat, relprice, data = diamonds)diamonds <- subset(diamonds, carat < 2)qplot(carat, relprice, data = diamonds)

qplot(carat, relprice, data = diamonds) + facet_wrap(~ color)qplot(relprice, ..density.., data = diamonds, colour = color, geom = "freqpoly", binwidth = 0.2)qplot(relprice, ..density.., data = diamonds, colour = cut, geom = "freqpoly", binwidth = 0.2)


log(Y) = a * log(X) + b

Y = c . dX

An additive model becomes a multiplicative model.

Intercept becomes starting point, slope becomes geometric growth.

Multiplicative model


Residuals

resid(mod) = log(Y) - log(Yhat)

exp(resid(mod)) = Y / (Yhat)


# Useful trick - close to 0, exp(x) ~ x + 1x <- seq(-0.2, 0.2, length = 100)qplot(x, exp(x)) + geom_abline(intercept = 1)

qplot(x, x / exp(x)) + scale_y_continuous("Percent error", formatter = percent)

# Not so useful here because the x is also # transformedcoef(modprice)


Categorical data


Compare the results of the following two functions. What can you say about the model?

ddply(diamonds, "color", summarise, mean = mean(price))

coef(lm(price ~ color, data = diamonds))

Your turn


Categorical data

Converted into a numeric matrix, with one column for each level. Contains 1 if that observation has that level, 0 otherwise.

However, if we just do that naively, we end up with too many columns (because we have one extra column for the intercept)

So everything is relative to the first level.


Visualising models


# What do you think this model does?lm(log(price) ~ log(carat) + color, data = diamonds)

# What about this one?lm(log(price) ~ log(carat) * color, data = diamonds)

# Or this one?lm(log(price) ~ cut * color, data = diamonds)

# How can we interpret the results?


mod1 <- lm(log(price) ~ log(carat) + cut, data = diamonds)mod2 <- lm(log(price) ~ log(carat) * cut, data = diamonds)

# One way is to explore predictions from the model# over an evenly spaced grid. expand.grid makes # this easy

grid <- expand.grid( carat = seq(0.2, 2, length = 20), cut = levels(diamonds$cut), KEEP.OUT.ATTRS = FALSE)str(grid)grid

grid$p1 <- exp(predict(mod1, grid))grid$p2 <- exp(predict(mod2, grid))


Plot the predictions from the two sets of models. How are they different?

Your turn


qplot(carat, p1, data = grid, colour = cut, geom = "line")qplot(carat, p2, data = grid, colour = cut, geom = "line")

qplot(log(carat), log(p1), data = grid, colour = cut, geom = "line")qplot(log(carat), log(p2), data = grid, colour = cut, geom = "line")

qplot(carat, p1 / p2, data = grid, colour = cut, geom = "line")


# Another approach is the effects package# install.packages("effects")library(effects)effect("cut", mod1)

cut <- as.data.frame(effect("cut", mod1))qplot(fit, reorder(cut, fit), data = cut)qplot(fit, reorder(cut, fit), data = cut) + geom_errorbarh(aes(xmin = lower, xmax = upper), height = 0.1)

qplot(exp(fit), reorder(cut, fit), data = cut) + geom_errorbarh(aes(xmin = exp(lower), xmax = exp(upper)), height = 0.1)


Business

24 modelling