24 modelling

Hadley Wickham

Stat405Intro to modelling

Tuesday, 16 November 2010

1. What is a linear model?

2. Removing trends

3. Transformations

4. Categorical data

5. Visualising models

What is a linear

model?

observed value

predicted value

observed value

predicted value

observed value

predicted value

observed value

residual

y ~ x# yhat = b1x + b0

# Want to find b's that minimise distance # between y and yhat

z ~ x + y # zhat = b2x + b1y + b0

# Want to find b's that minimise distance # between z and zhat

z ~ x * y # zhat = b3(x⋅y) + b2x + b1y + b0

X is measured without error.

Relationship is linear.

Errors are independent.

Errors have normal distribution.

Errors have constant variance.

Assumptions

Removing trends

library(ggplot2)

diamonds$x[diamonds$x == 0] <- NA

diamonds$y[diamonds$y == 0] <- NA

diamonds$y[diamonds$y > 30] <- NA

diamonds$z[diamonds$z == 0] <- NA

diamonds$z[diamonds$z > 30] <- NA

diamonds <- subset(diamonds, carat < 2)

qplot(x, y, data = diamonds)

qplot(x, z, data = diamonds)

mody <- lm(y ~ x, data = diamonds, na = na.exclude)

coef(mody)

# yhat = 0.05 + 0.99⋅x

# Plot x vs yhat

qplot(x, predict(mody), data = diamonds)

# Plot x vs (y - yhat) = residual

qplot(x, resid(mody), data = diamonds)

# Standardised residual:

qplot(x, rstandard(mody), data = diamonds)

qplot(x, resid(mody), data=dclean)Tuesday, 16 November 2010

qplot(x, y - x, data=dclean)Tuesday, 16 November 2010

Your turn

Do the same thing for z and x. What threshold might you use to remove outlying values?

Are the errors from predicting z and y from x related?

modz <- lm(z ~ x, data = diamonds, na = na.exclude)

coef(modz)# zhat = 0.03 + 0.61x

qplot(x, rstandard(modz), data = diamonds)last_plot() + ylim(-10, 10)

qplot(rstandard(mody), rstandard(modz))

Transformations

Can we use a linear model to remove this trend?

Linear models are linear in their parameters which can be any transformation of the data

Your turn

Use a linear model to remove the effect of carat on price. Confirm that this worked by plotting model residuals vs. color.

How can you interpret the model coefficients and residuals?

modprice <- lm(log(price) ~ log(carat), data = diamonds, na = na.exclude)

diamonds$relprice <- exp(resid(modprice))

qplot(carat, relprice, data = diamonds)diamonds <- subset(diamonds, carat < 2)qplot(carat, relprice, data = diamonds)

qplot(carat, relprice, data = diamonds) + facet_wrap(~ color)qplot(relprice, ..density.., data = diamonds, colour = color, geom = "freqpoly", binwidth = 0.2)qplot(relprice, ..density.., data = diamonds, colour = cut, geom = "freqpoly", binwidth = 0.2)

log(Y) = a * log(X) + b

Y = c . dX

An additive model becomes a multiplicative model.

Intercept becomes starting point, slope becomes geometric growth.

Multiplicative model

Residuals

resid(mod) = log(Y) - log(Yhat)

exp(resid(mod)) = Y / (Yhat)

# Useful trick - close to 0, exp(x) ~ x + 1x <- seq(-0.2, 0.2, length = 100)qplot(x, exp(x)) + geom_abline(intercept = 1)

qplot(x, x / exp(x)) + scale_y_continuous("Percent error", formatter = percent)

# Not so useful here because the x is also # transformedcoef(modprice)

Categorical data

Compare the results of the following two functions. What can you say about the model?

ddply(diamonds, "color", summarise, mean = mean(price))

coef(lm(price ~ color, data = diamonds))

Your turn

Categorical data

Converted into a numeric matrix, with one column for each level. Contains 1 if that observation has that level, 0 otherwise.

However, if we just do that naively, we end up with too many columns (because we have one extra column for the intercept)

So everything is relative to the first level.

Visualising models

# What do you think this model does?lm(log(price) ~ log(carat) + color, data = diamonds)

# What about this one?lm(log(price) ~ log(carat) * color, data = diamonds)

# Or this one?lm(log(price) ~ cut * color, data = diamonds)

# How can we interpret the results?

mod1 <- lm(log(price) ~ log(carat) + cut, data = diamonds)mod2 <- lm(log(price) ~ log(carat) * cut, data = diamonds)

# One way is to explore predictions from the model# over an evenly spaced grid. expand.grid makes # this easy

grid <- expand.grid( carat = seq(0.2, 2, length = 20), cut = levels(diamonds$cut), KEEP.OUT.ATTRS = FALSE)str(grid)grid

grid$p1 <- exp(predict(mod1, grid))grid$p2 <- exp(predict(mod2, grid))

Plot the predictions from the two sets of models. How are they different?

Your turn

qplot(carat, p1, data = grid, colour = cut, geom = "line")qplot(carat, p2, data = grid, colour = cut, geom = "line")

qplot(log(carat), log(p1), data = grid, colour = cut, geom = "line")qplot(log(carat), log(p2), data = grid, colour = cut, geom = "line")

qplot(carat, p1 / p2, data = grid, colour = cut, geom = "line")

# Another approach is the effects package# install.packages("effects")library(effects)effect("cut", mod1)

cut <- as.data.frame(effect("cut", mod1))qplot(fit, reorder(cut, fit), data = cut)qplot(fit, reorder(cut, fit), data = cut) + geom_errorbarh(aes(xmin = lower, xmax = upper), height = 0.1)

qplot(exp(fit), reorder(cut, fit), data = cut) + geom_errorbarh(aes(xmin = exp(lower), xmax = exp(upper)), height = 0.1)

24 modelling

Business

Research Article Biomass Modelling of Androstachys johnsonii Prain: A Comparison …downloads.hindawi.com/journals/ijfr/2015/878402.pdf · 2015-11-24 · Biomass Modelling of Androstachys

Modelling of systematic errors in borehole geophone ... errors in borehole geophone orientation CREWES Research Report — Volume 24 (2012) 1 Modelling of systematic errors in borehole

Modelling b 24

Concrete Syntax: A Multi-paradigm Modelling Approachmsdl.cs.mcgill.ca/people/yentl/papers/2017-SLE.pdf · Concrete Syntax: A Multi-paradigm Modelling Approach SLE’17, October 23–24,

MARCH2015 - AK- INTERACTIVE€¦ · 2015. 3. 24. · AK276 AIRCRAFT SCALE MODELLING F.A.Q. (INGLÉS) 57€ AK277 AIRCRAFT SCALE MODELLING F.A.Q. (CASTELLANO) 57€ Marzo 2015 Este

The many faces of modelling in biology...Introduction to modelling in biology, Babraham Institute, 24 November 2016 The many faces of modelling in biology Nicolas Le Novère, The Babraham

Modelling Software-Defined Networking: Software and ...speed.cis.nctu.edu.tw/~ydlin/MSDN.pdf · JournalofNetworkandComputerApplications122(2018)24–36 ContentslistsavailableatScienceDirect

Sex as Gibbs Sampling: Modelling Evolution with a ...crest.cs.ucl.ac.uk/cow/24/slides/COW24_Watkins.pdf · Modelling Evolution with a Tractable Markov Chain Chris Watkins Yvonne Buttkewitz

Slide 1 of 24 System Development Tools, Techniques and Methods Structured Process Modelling

Modelling International Tourism Demand for Zimbabwe … · Modelling International Tourism Demand for Zimbabwe Edwin Muchapondwa and Obert Pimhidzaiy November 24, 2008 Abstract This

THREE-DIMENSIONAL ELECTROMAGNETIC MODELLING AND … · 2019. 6. 24. · THREE-DIMENSIONAL ELECTROMAGNETIC MODELLING AND INVERSION FROM THEORY TO APPLICATION DMITRY B. AVDEEV1,2 1Russian

Bodycentric modelling, identification, and acceleration tracking control … · 2018. 1. 16. · Modelling, Identification and Control, 24 (1). pp. 2941. ... the formulation of the

WASTEWATER LOAD MODELLING - newea.org LOAD MODELLING What you don’t know can hurt you! New England Water Environment Association January 24, 2018

INTERACTIVE FRAMEWORK FOR DYNAMIC MODELLING AND …mekanika/Issue 25/intan.pdf · jurnal mekanikal june 2008, no. 25, 24 - 38 24 interactive framework for dynamic modelling and active

Laboratory Modelling of Volcano Plumbing Systems: A Reviewfolk.uio.no/oliviega/Website/Publications/Papers/Galland... · 2015-07-24 · Laboratory Modelling of Volcano Plumbing Systems:

Modelling stormwater and evaluating potential solutionsswitchurbanwater.lboro.ac.uk/outputs/pdfs/GEN_PAP_BH... · 2008. 11. 24. · Modelling stormwater and evaluating potential solutions

Victorian Modelling Group 2010 24 th March 2010 Melbourne

Integrating GIS and coastal simulation modelling · Integrating GIS and coastal simulation modelling Spit growing experiments Spit growing experiments Conclusions last update: 24-Jul-01

Introduction to population PKPD modelling in …€¦ · Introduction to population PKPD modelling in paediatric clinical pharmacology ... 12 18 24 30 COMFORT-B 0 0.2 0.8 0.6

PIM Precinct Information Modelling - Cloud Object Storage ... · PIM – Precinct Information Modelling Information Modelling at a Precinct Scale to Manage ... 3/24/2015 8 PIM for