Statistics: Multiple Regression in R … · This course introduces the concept of regression using Sir Francis Galton’s Parent- Child height data and then extends the concept to

February 2016

Statistics:

Multiple Regression in R

February 2016

How to Use This Course Book

This course book accompanies the face-to-face session taught at IT Services. It contains a copy of the slideshow and the worksheets.

Software Used

We might use Excel to capture your data, but no other software is required. Since this is a Concepts course, we will concentrate on exploring ideas and underlying concepts that researchers will find helpful in undertaking data collection and interpretation.

Revision Information

Version Date Author Changes made

1.0 February 2016 John Fresen Course book version 1

February 2016

Copyright

The copyright of this document lies with Oxford University IT Services.

February 2016

Contents

1 Introduction ............................................................................. 1

1.1. What You Should Already Know ......................................................... 1

1.2. What You Will Learn ........................................................................... 1

2 Your Resources for These Exercises ........................................ 2

2.1. Help and Support Resources ............................................................. 2

3 What Next? .............................................................................. 3

3.1. Statistics Courses ............................................................................... 3

3.2. IT Services Help Centre ..................................................................... 3

Statistics: Concepts TRMSZ

1 IT Services

1 Introduction Welcome to the course Mul t iple in R .

This course introduces the concept of regression using Sir Francis Galton’s Parent- Child height data and then extends the concept to multiple regression using real examples. The course has an applied focus and makes minimum used of mathematics. No derivations of formulae are presented.

1.1. What You Should Already Know

We assume that you are familiar with entering and editing text, rearranging and formatting text - drag and drop, copy and paste, printing and previewing, and managing files and folders.

The computer network in IT Services may differ slightly from that which you are used to in your College or Department; if you are confused by the differences, ask for help from the teacher.

1.2. What You Will Learn

In this course we will cover the following topics:

What is regression

Simple linear regression

Influential observations

Multiple regression

Model selection

Post selection inference

Cross valadation

From problem – to data – to conclusions

Where to get help….

Topics covered in related Statistics courses, should you be interested, are given in Section 3.1.


2 IT Services

2 Your Resources for These Exercises The exercises in this handbook will introduce you to some of the tasks you will need to carry out when working with WebLearn. Some sample files and documents are provided for you; if you are on a course held at IT Services, they will be on your network drive H:\ (Find it under My Computer).

During a taught course at IT Services, there may not be time to complete all the exercises. You will need to be selective, and choose your own priorities among the

variety of activities offered here. However, those exercises marked with a star *

should not be skipped.

Please complete the remaining exercises later in your own time, or book for a Computer8 session at IT Services for classroom assistance (See section 8.2).

2.1. Help and Support Resources

You can find support information for the exercises on this course and your future use of WebLearn, as follows:

WebLearn Guidance https://weblearn.ox.ac.uk/info (This should be your first port of call)

If at any time you are not clear about any aspect of this course, please make sure you ask John for help. If you are away from the class, you can get help and advice by emailing the central address [email protected].

The website for this course including reading material and other material can be found at https://weblearn.ox.ac.uk/x/Mvkigl

You are welcome to contact John about statistical issues and questions at [email protected]

https://weblearn.ox.ac.uk/info

mailto:[email protected]

https://weblearn.ox.ac.uk/x/Mvkigl



3 IT Services

3 What Next?

3.1. Statistics Courses

Now that you have a grasp of some basic concepts in Statistics, you may want to develop your skills further. IT Services offers further Statistics courses and details are available at http://courses.it.ox.ac.uk.

In particular, you might like to attend the course

Stat ist ics : In troduct ion : this is a four-session module which covers the basics of statistics and aims to provide a platform for learning more advanced tools and techniques.

Courses on particular discipline areas or data analysis packages include:

Stat ist ics : Designing c l inical research and biostat ist ics

SPSS: An introduct ion

SPSS: An introduct ion to using syntax

STATA: An introduct ion to data access and management

STATA: Data manipulat ion and analysis

STATA: Stat ist ical , survey and graphical analyses

3.2. IT Services Help Centre

The IT Services Help Centre at 13 Banbury Road is open by appointment during working hours, and on a drop-in basis from 6:00 pm to 8:30 pm, Monday to Friday.

The Help Centre is also a good place to get advice about any aspect of using computer software or hardware. You can contact the Help Centre on (2)73200 or by email on [email protected]


Your safety is important

Where is the fire exit?

Beware of hazards:

Tripping over bags and coats

Please report any equipment faults to us

Let us know if you have any other concerns

2

Your comfort is important

The toilets are along the corridor outside the lecture rooms

The rest area is where you registered;it has vending machines and a water cooler

The seats at the computers are adjustable

You can adjust the monitors for height, tilt and brightness

Session 1The concept of regression from Galton

Thanks to:

Dave Baker, IT ServicesJill Fresen, IT ServicesJim Hanley, McGill UniversityIan Sinclair, REES Group Oxford

4

Sir Francis Galton(16 February 1822 17 January 1911)

http://en.wikipedia.org/wiki/Francis_Galton

Sir Francis Galton was an incredible polymath

Cousin of Charles Darwin.

General: Genetics What do we inherit form our ancestors?

Particular: Do tall parents have tall children and short parents, short children?

i.e. Does the height of children depend on the height of parents?

Data: Famous 1885 study: 205 sets of parents 928 offspring

mph = average height of parents; ch = child height

Galton Peas Experiment: Selected 700 pea pods of selected sizes

average diam of parent peas ; average diam of child peas

Francis Galton: Do tall parents have tall children, short parents short children?Does height of child depend on height of parents?

62 64 66 68 70 72 74

Midparent

Frequency scatterplot of Galton Data

1

2

4

1

2

2

1

1

11

4

4

1

5

5

2

1

9

5

7

11

11

7

7

5

2

1

3

3

5

2

17

17

14

13

4

3

5

14

15

36

38

28

38

19

11

4

1

7

11

16

25

31

34

48

21

18

4

3

1

16

4

17

27

20

33

25

20

11

45

1

1

1

1

3

12

18

14

7

4

33

1

3

4

3

5

10

4

9

22

1

2

1

2

7

24

1

3

57

32

59

48

117

138

120

167

99

64

41

1714

Galton data:boxplots of conditional distributions of child-ht conditional on parent-ht

histograms of the marginal distributions of child-ht and parent-ht

64 66 68 70 72

Plot of child-parent data

parent-ht

64 66 68 70 72

sunflower plot of data

parent-ht

64 66 68 70 72

Plot of datachild-ht jittered

parent-ht

64 66.5 69.5 72.5

Plot of distributionschild-ht given parent-ht

parent-ht

Regression is a plot/trace of the means of the conditional distributions

62 64 66 68 70 72 74

Midparent

trace of actual meansregression of Child on Midparent

62 64 66 68 70 72 74

Midparent

trace of linear regression meansassumes means lie on a straight line

62 64 66 68 70 72 74

Midparent

superimposing actualand linear regressions

The trace of actual means has no assumptions in it but end distributions have a lot ofsampling variation because of the small number of observations in those distributions

Linear regression stabilises that

62 64 66 68 70 72 74

Midparent

Linear regression model

62 64 66 68 70 72 74

Midparent

Linear regression model fitted to data

Linear regression model assumes:1. Conditional distributions are normal2. Conditional means lie on a straight line3. Conditional distributions all have same spread

In words: the distribution of child height, conditional on a given midparent height,is normal, with means lying on the straight line, and constant spread

In mathematics:

This model can be extended in many ways

The Linear Model : can be extended in manyways. Here are three - there are more:

1. Model the mean by a more general function such as a polynomial ortrigonometric function or Fourier series or radial basis functions or some non-parametric function

2. Model the variance as a function :

3. Generalize from the Normal distribution to the Exponential Family thatincludes: normal, exponential, gamma, chi-squared, beta, Dirichlet, Bernoulli,categorical, Poisson, Wishart, Inverse Wishart and many others.

But in all cases we are modelling the mean and other parameters of conditionaldistributions. These are called Generalized Linear Models.

In R : fits a linear modelfits a generalized linear modelfits a generalized additive model

Does the average diam of child peas depend on average diam of parent peas?What are the sketches telling us?Would a linear regression model be suitable?

An important point about this example is that in regression, it is theslope of the regression line that is important, not the intercept.

Go to Lecture 2 A: Example 1 UWC Analysis

Detecting Influential Observationsusing R

Influential observations may suggestyour model is incorrectdata point has been miss-recorded

17

Detecting Influential Observations in R

Case1: Outlying in y-space, not x-space.Use studentized residuals to detect these.

Case 2: Outlying in x-space, not y-space.High leverage point.

Use hatvalues to detect this.

Case 3: Outlying in both x-space and y-space.High leverage point.

Studentized residualsHatvaluesDFFITS

DFBETAS

18

Outliers in Y- space:

To detect outliers in the Y-space we compute thestudentized residuals, defined as:

This is just the standardizing transformation on theresiduals.

Thus, we expect 99% of studentized residuals to liebetween -3 and 3.

Values outside of this range suggest possible outliers.

In R these are plotted on a graph using the statement:

19

Outliers in X-space:

To detect outliers in the X-space we compute thehatvalues, defined as:

hatvalues = diagonal elements of the hat matrix

1diag( ( ) )T TX X X X

The i-th hatvalue measures the distance of case i x-valuesfrom the centroid of the x-values.


Some of these will be small, some intermediate, and somelarge. These are assessed in a relative sense there is, in

-

We want to avoid the situation in which one observationdominates the others.

20

DFFITS (means change (DF) in FITted values)

DFFITS measures the effect or influence of removing thei th observation on the predicted value of the i-thobservation, divided by a standardizing quantity.

constant

)(iii

i

YYDFFITS

This is a local measure of influence it is only concernedwith what is happens at case I if case I is removed fromthe data.



- point. We want to avoid the situationin which one observation dominates the others.

21

Cooks Distance measures the effect or influence ofremoving the i th observation on all predicted, divided bya standardizing quantity to normalize it.

2( )( )

constant

j j i

i

Y YD

This is a global measure of influence it considers theeffect of removing case I on all predicted values.



-

DFBETAS (means change (DF) in the BETAS)

DFBETAS measures the effect or influence of removingthe i th observation on the estimated regressioncoefficients, divided by a standardizing quantity.

( )

( )constant

k k i

k iDFB


is a pXn matrix. The entry in row I andcolumn k represents the effect of removing the i-thobservation on the k-th regression coefficient.


-

Go to Lecture 2 B UWC Analysis outliers and influential observations

Model Selection

The Executive Salary Data has the variables:lsalary exper educat bonus numemp assets board age profits internat sales

Question: How do we select which variables to include in a model forestimating the mean log(salary)?

still a vibrant research topic in statistics with many contentious issues

controversy:

Among competing hypotheses, the one with the fewest assumptions should beselected.

Choose the simplest model that gives and adequate description of the data

include

Akaike information criterion (AIC)Bayesian information criterion (BIC)

AIC = -2*log(likelihood(model)) + 2*no predictorsBIC = -2*log(likelihood(model)) + 2*no predictors*log(no obs)

The first term -2*log(likelihood(model)) gets smaller with more predictorsbut gets penalized by the increased no of predictors and the increasedsample size (no obs).

we choose model with the smallest AIC (BIC)

Recent excellent articles are:

Statistical model choice by Gerda Claeskens (2016)Valid post selection inference 2013 Berk Brown Buja Zhang and Zhao Annals(2015)

Problem: If we have k predictors, we have 2^k possible models(without considering interactions and transformations such aslog, etc)

For the Exec Salary Data we have 10 predictors so there are 2^10 = 1024 models toconsider without interactions or transformations.

We consider the Stepwise Selection:

Forward selection, which involves starting with no variables in the model, testing theaddition of each variable using a chosen model comparison criterion, adding thevariable (if any) that improves the model the most, and repeating this process untilnone improves the model.

Backward elimination, which involves starting with all candidate variables, testing thedeletion of each variable using a chosen model comparison criterion, deleting thevariable (if any) that improves the model the most by being deleted, and repeating thisprocess until no further improvement is possible.

Bidirectional elimination, a combination of the above, testing at each step for variablesto be included or excluded.

Inference after Model Selection

The following articles provide a great discussion post-selection inferenceand make suggestions for how to proceed:

Ernst Wit, Edwin van den Heuvel and Jan-Willem RomeijnStatistica

Neerlandica, doi:10.1111/j.1467-9574.2012.00530.x

Richard Berk, Lawrence Brown, Andreas Buja, Kai Zhang, and Linda Zhao(2013) Valid post-selection inference. Ann. Statist. Volume 41, Number 2,802-837

Two major problems that arise in the process of model selection are:

First, the distributions of the estimated regression parameters are no longer valid.(This means that the tests and confidence intervals normally calculated are nolonger valid.)

Second, we should see how well the model works on unseen data.This might conceptually be achieved by splitting the data into two subsets,

and a

This is called cross-validation.

Cross-validation is important in guarding against testing hypotheses suggestedby the data (called "Type III errors")

Leave-p-out cross-validation

Leave-p-out cross-validation (LpO CV) involves using p observations asthe validation set and the remaining observations as the training set. Thisis repeated on all ways to cut the original sample on a validation set of pobservations and a training set. Very expensive computing wise

Leave-one-out cross-validation

Leave-one -out cross-validation (LOOCV) is a particular case of leave-p-out cross-validation with p = 1.

LOOCV doesn't have the computational problem of general LpO cross-validation.

k-fold cross-validation

In k-fold cross-validation, the original sample is randomly partitioned into k equalsized subsamples.

Of the k subsamples, a single subsample is retained as the validation data for testing

The cross-validation process is then repeated k times (the folds), with each of the ksubsamples used exactly once as the validation data. The k results from the foldscan then be averaged (or otherwise combined) to produce a single estimation

2-fold cross-validation = special case

Go to lecture 5 Example 4 Executive salary data

Lecture 3: Example 2: Multiple Regression Executive Salary Data – Four variables

Problem and Data This data set was collected by a large Human Resources Placement company specializing in the placement of Chief Executive Officers (CEO's). The objective of the research is to develop a model that will predict the average salary of CEO's based on a number of factors or variables associated with the executive and the company. Because of its skewed nature salary was converted to log-salary and recorded as lsalary to bring it closer to normality. The purpose of the model is to help guide executives as to the distribution of salaries (mean and standard deviation) earned by executives with similar qualifications and backgrounds. The data is given in execsaldata.xlxs, execsaldata.txt, execsaldata.csv. For this exercise we will only consider the following predictors: experience (exper); number of employees (numemp) ; number of years of formal education (educat) ; company assets in millions of dollars (assets)

Model: 𝑙𝑠𝑎𝑙𝑎𝑟𝑦 = 𝛽0 + 𝛽1𝑒𝑥𝑝𝑒𝑟 + 𝛽2𝑒𝑑𝑢𝑐𝑎𝑡 + 𝛽3𝑛𝑢𝑚𝑒𝑚𝑝 + 𝛽4𝑎𝑠𝑠𝑒𝑡𝑠 + 𝑛𝑜𝑖𝑠𝑒 where the noise follows a normal distribution with a mean of zero and constant variance The data were obtained by taking a random sample of 100 companies listed on the New York Stock Exchange during 2008 and is given in a tab delimited file execsaldata.csv. But an excel version is also given.

Assignment: Perform an analysis that will lead to a regression model for predicting the average salary of executives based on their experience and background in terms of the factors considered above. You might consider, inter-alia, the following points in your analysis but do not be restricted to this list. 1. Read the data into R and attach it. 2. Summarize the univariate marginal distributions of experience, number of employees, number of years

of formal education and the company assets. 3. Provide the matrix of these variables, using the pairs statement, compute the correlation

coefficients and comment on the plots and correlation matrix. 4. Fit the regression model to the data. Construct and interpret the SUMMARY and ANOVA

tables. 5. How accurate is the prediction equation? Assess graphically the accuracy of the model by

plotting the observed salary against the fitted values, say salhat. 6. Perform a graphical analysis of the residuals to assess if the assumptions on the error terms

are reasonable. 7. Assess the possibility of outliers and influential observations. 8. Can you make recommendations about estimating the average salary of executives from the

predictor variables considered here? 9. Can you criticize the data or the model?

Recommended Steps – You may copy and paste into R

lsalary exper educat bonus numemp assets board age profits internat sales

Step 1 Read and attach data.

My data is stored in the directory Data of my memory stick.

One can specify any directory.

esd = read.csv("E:/Data/execsaldata.csv",header=T,sep=",")

attach(esd)

head(esd)

names(esd)

Step 2 Plot marginal distributions of the response variable and the four

predictor variables on a 2X3 scatterplot matrix

par(mfrow=c(2,3))

hist(lsalary,prob=T,col="gray");lines(density(lsalary));rug(lsalary)

hist(exper,prob=T,col="gray");lines(density(exper));rug(exper)

hist(educat,prob=T,col="gray");lines(density(educat));rug(educat)

hist(numemp,prob=T,col="gray");lines(density(numemp));rug(numemp)

hist(assets,prob=T,col="gray");lines(density(assets));rug(assets)

# notice that these are on vastly different scales

Step 3 Pairs plot (matrix of scatterplots)

newdata <- cbind(lsalary,exper,educat,numemp,assets)

pairs(newdata)

Step 4 Can combine steps 2 and 3 (See help for pairs)

panel.hist <- function(x, ...)

{

usr <- par("usr"); on.exit(par(usr))

par(usr = c(usr[1:2], 0, 1.5) )

h <- hist(x, plot = FALSE)

breaks <- h$breaks; nB <- length(breaks)

y <- h$counts; y <- y/max(y)

rect(breaks[-nB], 0, breaks[-1], y, col = "cyan", ...)

}

pairs(newdata, panel = panel.smooth,

diag.panel = panel.hist, cex.labels = 1.0, font.labels = 2)

Step 5 Compute the correlation coefficients:

cor(newdata)

# or, to have fewer decimal places

round(cor(newdata),3)

Step 6 Fit the linear model to the data and get summary and anova:

fit1 = lm(lsalary~exper+educat+numemp+assets)

summary(fit1)

anova(fit1)

Step 7 Compute predicted values and residuals:

salhat = fitted.values(fit1)

res = residuals(fit1)

Step 8 # check the fit and assumptions on noise terms i.e.

# are they normal and independent of predictor and model?

par(mfrow=c(2,4))

plot(salhat,lsalary,xlim=c(10.6,12.1),ylim=c(10.6,12.1),

main="Observed vs fitted values")

abline(0,1) #superimposes an “ideal” straight line

plot(res) # sequential plot of residuals

qqnorm(res);qqline(res) #normal probability plot of residuals

plot(exper,res) #residuals vs exper

plot(educat,res) #residuals vs eduac

plot(numemp,res) #residuals vs numemp

plot(assets,res) #residuals vs assets

plot(salhat,res) #residuals vs fitted values

Step 9 # Detecting possible outlying and influential observations

# Are there outliers in the y-space or x-space?

# Are there influential observations as measured by

# DFFITS,Cook’s Distance or DFBETAS?

par(mfrow=c(3,3))

plot(rstudent(fit1),type="h")

plot(hatvalues(fit1),type="h")

plot(dffits(fit1),type="h")

plot(cooks.distance(fit1),type="h")

plot(dfbetas(fit1)[,1],type="h")





# We might wish to plot hatvalues against cooks.distance and even

# consider other combinations of plots.

#---------------

# An interesting 2X2 matrix of plots is provided by plot(fit1)

par(mfrow=c(2,2))

plot(fit1)

Step 10 What are our conclusions?

First interpret the diagnostic plots for assumptions on residuals

and then consider the possibility of influential observations and

outliers. Do the assumptions on the noise terms seem reasonable?

If the diagnostics are satisfactory we then look at the summary and

ANOVA to assess and interpret the model: What is the mean function

and what is the spread about the mean function? Is the model a

reasonable approximation to the data?

Can we criticize that data or the model and make suggestions for

further analysis or research?

1

Example 1: UWC Analysis

Mathematical model: 𝑟𝑒𝑠𝑢𝑙𝑡 = 𝛽0 + 𝛽1𝑟𝑎𝑡𝑖𝑛𝑔 + 𝑛𝑜𝑖𝑠𝑒

assume 𝑛𝑜𝑖𝑠𝑒 follows a normal distribution with a mean of zero and variance 𝜎2

We could write this as 𝑟𝑒𝑠𝑢𝑙𝑡|𝑟𝑎𝑡𝑖𝑛𝑔~𝑁𝑜𝑟𝑚𝑎𝑙(𝛽0 + 𝛽1𝑟𝑎𝑡𝑖𝑛𝑔, 𝜎2)

The conditional distribution of result given rating is normal with a mean of

𝛽0 + 𝛽1𝑟𝑎𝑡𝑖𝑛𝑔 and variance of 𝜎2

Suggested steps:

Step 1 # Read data into R, attach data, print first 6 lines

uwc = read.table("E:/Data/uwcdata.csv",header = T,sep=",")

attach(uwc)

head(uwc)

Step 2 # Plot marginal distributions

par(mfrow=c(1,2)) #1 row by 2 cols graphics window

hist(rating,prob=T,col="gray");density(rating);rug(rating)

hist(result,prob=T,col="gray");density(result);rug(result)

Step 3 # fit the linear model (linear regression model)

fit1 <- lm(result ~ rating)

# fit1 is an object generated by the routine “lm” containing a lot

# of information about the fitted model.

# The rest of the steps are simply accessing information in fit1

Step 4 # obtain summary and anova of fit

# compute fitted values and residuals

summary(fit1)

anova(fit1)

yhat <- fitted.values(fit1) # fitted values

res <- residuals(fit1) # residuals

Step 5 # various common plots put into a 2X3 matrix of scatter-plots to

# check the fit and assumptions on noise terms i.e.


#

# plot 1: scatterplot of data and superimposed fitted model -

# only do this plot when there is only a single predictor

# plot 2: scatterplot of observed values vs fitted values to see

# how close the fitted values are to the observed data –

# do this plot no matter how many predictors

# plot 3: plot of residuals (random or pattern?)

# plot 4: Q-Q plot of residuals (are they approx normal?)

# plot 5: residuals vs predictor (random or pattern?)

# if there are many predictors we plot residuals against

# each predictor in turn

# plot 6: residuals vs fitted (random or pattern?)

2

par(mfrow=c(2,3))

plot(rating,result,ylim=c(0,100), pch=19,cex=1.5,

main="Result vs Rating UWC data/n showing pass mark and fitted

model")

abline(fit1) #superimposes fitted straight line

abline(h=48,lty=2) # superimposes dashed horizontal line at 48

plot(yhat,result,xlim=c(0,100),ylim=c(0,100),



plot(res)


plot(rating,res) #residuals vs predictor

plot(yhat,res) #residuals vs fitted values





par(mfrow=c(2,3))








First interpret the diagnostic plots for assumptions on

residuals and then consider the possibility of

influential observations and outliers.

If the diagnostics are satisfactory we then look at the

summary and ANOVA to assess and interpret the model:

What is the mean function and what is the spread about

the mean function?

Can we criticize that data or the model and make

suggestions for further analysis or research?

Lecture 3: Example 2: Multiple Regression Executive Salary Data – Four variables

Problem and Data This data set was collected by a large Human Resources Placement company specializing in the placement of Chief Executive Officers (CEO's). The objective of the research is to develop a model that will predict the average salary of CEO's based on a number of factors or variables associated with the executive and the company. Because of its skewed nature salary was converted to log-salary and recorded as lsalary to bring it closer to normality. The purpose of the model is to help guide executives as to the distribution of salaries (mean and standard deviation) earned by executives with similar qualifications and backgrounds. The data is given in execsaldata.xlxs, execsaldata.txt, execsaldata.csv. For this exercise we will only consider the following predictors: experience (exper); number of employees (numemp) ; number of years of formal education (educat) ; company assets in millions of dollars (assets)

Model: 𝑙𝑠𝑎𝑙𝑎𝑟𝑦 = 𝛽0 + 𝛽1𝑒𝑥𝑝𝑒𝑟 + 𝛽2𝑒𝑑𝑢𝑐𝑎𝑡 + 𝛽3𝑛𝑢𝑚𝑒𝑚𝑝 + 𝛽4𝑎𝑠𝑠𝑒𝑡𝑠 + 𝑛𝑜𝑖𝑠𝑒 where the noise follows a normal distribution with a mean of zero and constant variance The data were obtained by taking a random sample of 100 companies listed on the New York Stock Exchange during 2008 and is given in a tab delimited file execsaldata.csv. But an excel version is also given.

Assignment: Perform an analysis that will lead to a regression model for predicting the average salary of executives based on their experience and background in terms of the factors considered above. You might consider, inter-alia, the following points in your analysis but do not be restricted to this list. 1. Read the data into R and attach it. 2. Summarize the univariate marginal distributions of experience, number of employees, number of years

of formal education and the company assets. 3. Provide the matrix of these variables, using the pairs statement, compute the correlation

coefficients and comment on the plots and correlation matrix. 4. Fit the regression model to the data. Construct and interpret the SUMMARY and ANOVA

tables. 5. How accurate is the prediction equation? Assess graphically the accuracy of the model by

plotting the observed salary against the fitted values, say salhat. 6. Perform a graphical analysis of the residuals to assess if the assumptions on the error terms

are reasonable. 7. Assess the possibility of outliers and influential observations. 8. Can you make recommendations about estimating the average salary of executives from the

predictor variables considered here? 9. Can you criticize the data or the model?







attach(esd)

head(esd)

names(esd)

Step 2 Plot marginal distributions of the response variable and the four

predictor variables on a 2X3 scatterplot matrix

par(mfrow=c(2,3))

hist(lsalary,prob=T,col="gray");lines(density(lsalary));rug(lsalary)

hist(exper,prob=T,col="gray");lines(density(exper));rug(exper)

hist(educat,prob=T,col="gray");lines(density(educat));rug(educat)

hist(numemp,prob=T,col="gray");lines(density(numemp));rug(numemp)

hist(assets,prob=T,col="gray");lines(density(assets));rug(assets)

# notice that these are on vastly different scales


newdata <- cbind(lsalary,exper,educat,numemp,assets)

pairs(newdata)



{


par(usr = c(usr[1:2], 0, 1.5) )





}




cor(newdata)

# or, to have fewer decimal places



fit1 = lm(lsalary~exper+educat+numemp+assets)

summary(fit1)

anova(fit1)






par(mfrow=c(2,4))







plot(educat,res) #residuals vs eduac








par(mfrow=c(3,3))










# We might wish to plot hatvalues against cooks.distance and even

# consider other combinations of plots.

#---------------

# An interesting 2X2 matrix of plots is provided by plot(fit1)

par(mfrow=c(2,2))

plot(fit1)











Lecture 4: Example 5: Model Selection Executive Salary Data – All variables

Stepwise Regression in R







attach(esd)

head(esd)

names(esd)

Step 2 # Invoke the MASS library that contains the stepAIC function

library(MASS)


newdata <- c(exper,educat,numemp,assets,age,profits,sales)



{


par(usr = c(usr[1:2], 0, 1.5) )





}





Step 6 Perform stepwise regression

fit1 <- lm(low ~ .,data=esd)

esd.step <- stepAIC(fit1, direction = "backward" )


fit2 <- lm(lsalary ~ exper+educat+bonus+numemp+assets+board+age+

profits+internat+sales)

summary(fit2)

anova(fit2)

Step 7 Fit reduced model – take out non-significant terms

fit3 <- lm(lsalary ~ exper+educat+bonus+numemp+assets)

summary(fit3)

anova(fit3)






par(mfrow=c(3,3))







plot(educat,res) #residuals vs educat

plot(bonus,res) #residuals vs educat








par(mfrow=c(3,3))





















Lecture 5: Inference after model selection

The following articles provide a great discussion post-selection inference

and make suggestions for how to proceed:

Ernst Wit, Edwin van den Heuvel and Jan-Willem Romeijn (2012) ‘All models are wrong...’: an introduction to model uncertainty. Statistica Neerlandica, doi:10.1111/j.1467-9574.2012.00530.x Richard Berk, Lawrence Brown, Andreas Buja, Kai Zhang, and Linda Zhao (2013) Valid post-selection inference. Ann. Statist. Volume 41, Number 2, 802-837. Two major problems that arise in the process of model selection are: First, the distributions of the estimated regression parameters are no longer valid. (This means that the tests and confidence intervals normally calculated are no longer valid.) Second, we should see how well the model works on unseen data. This might conceptually be achieved by splitting the data into two subsets, a “training” set a “validation” set This is called cross-validation. Cross-validation is important in guarding against testing hypotheses suggested by the data (called "Type III errors") Exhaustive cross-validation Exhaustive cross-validation methods are cross-validation methods which learn and test on all possible ways to divide the original sample into a training and a validation set. Leave-p-out cross-validation Leave-p-out cross-validation (LpO CV) involves using p observations as the validation set and the remaining observations as the training set. This is repeated on all ways to cut the original sample on a validation set of p observations and a training set. LpO cross-validation requires to learn and validate C_p^n times (where n is the number of observations in the original sample). So as soon as n is quite big it becomes impossible to calculate. Leave-one-out cross-validation Leave-one-out cross-validation (LOOCV) is a particular case of leave-p-out cross-validation with p = 1. LOOCV doesn't have the calculation problem of general LpO cross-validation because C_1^n=n. k-fold cross-validation In k-fold cross-validation, the original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data. The cross-validation process is then repeated k times (the folds), with each of the k subsamples used exactly once as the validation data. The k results from the folds can then be averaged (or otherwise combined) to produce a single estimation

2-fold cross-validation This is the simplest variation of k-fold cross-validation. Also called holdout method.[8] For each fold, we randomly assign data points to two sets d0 and d1, so that both sets are equal size (this is usually implemented by shuffling the data array and then splitting it in two). We then train on d0 and test on d1, followed by training on d1 and testing on d0. Cross-validation only yields meaningful results if the validation set and training set are drawn from the same population and only if selection biases are controlled. Cross validation in R Example 5: Executive salary data


attach(esd)

head(esd)

fit3 <- lm(lsalary ~ exper+educat+bonus+numemp+assets)

xmatrix <- as.matrix(cbind(exper,educat,bonus,numemp,assets))

library(cvTools)

cvFit(fit3, lsalary ~ exper+educat+bonus+numemp+assets, data=esd,

y=lsalary,x=xmatrix,

cost = rmspe, K = 5, R=1,foldType = "consecutive")

Practical 1: Ht Wt UCLA Data The dataset UCLA ht wt sample.csv contains 250 records of human heights and

weights. These were obtained by taking a random sample of 250 from the

original sample of 25000 children in 1993 by a Growth Survey of children

from birth to 18 years of age recruited from Maternal and Child Health

Centres (MCHC) and schools and were used to develop Hong Kong's current

growth charts for weight, height, weight-for-age, weight-for-height and

body mass index (BMI).

http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_Dinov_020108_HeightsWeights

To reduce the size of the data for this exercise a random sample of 250

rows were generated and stored in UCLA ht wt sample.csv. Large data sets

require a different treatment that we won’t cover here.

The columns contain the new index (new.index), the original index (index),

the ht and wt.

Use the UWC analysis to guide your R code.

Step 1 # Read data into R, attach data, print first 6 lines

d <- read.csv("E:/Data/UCLA ht wt sample.csv",header=T)

Step 2 # Plot marginal distributions of height and weight

# Provide the boxplot of the conditional distributions of weight

# given height

boxplot(wt~ht,range=0,varwidth=T,col= "gray", main= "Boxplots of the distributions\n of weight given height", xlab="height (in)", ylab="weight (lb)")

Step 3 # fit the linear model of wt on ht

Step 4 # obtain summary and anova of fit

# compute fitted values and residuals

Step 5 # various common plots put into a 2X3 matrix of scatter-plots to

# check the fit and assumptions on noise terms i.e.


#

# plot 1: scatterplot of data and superimposed fitted model -

# only do this plot when there is only a single predictor

# plot 2: scatterplot of observed values vs fitted values to see

# how close the fitted values are to the observed data –

# do this plot no matter how many predictors

# plot 3: plot of residuals (random or pattern?)

# plot 4: Q-Q plot of residuals (are they approx normal?)

# plot 5: residuals vs predictor (random or pattern?)

# if there are many predictors we plot residuals against

# each predictor in turn

# plot 6: residuals vs fitted (random or pattern?)





http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_Dinov_020108_HeightsWeights


First interpret the diagnostic plots for assumptions on

residuals and then consider the possibility of influential

observations and outliers.

If the diagnostics are satisfactory we then look at the

summary and ANOVA to assess and interpret the model: What is

the mean function and what is the spread about the mean

function?

Can we criticize that data or the model and make suggestions

for further analysis or research?

1

Practical 2: Salamander Problem and Data The data set for this assignment was obtained from Bill Peterman during the fall of 2005, then a post-graduate student in Ecology and Conservation at University of Missouri. See his webpage at http://senr.osu.edu/our-people/william-peterman http://petermanresearch.weebly.com/dr-bill-peterman.html The data given in the appendix were collected on 45 salamanders to ascertain the time to anesthetization (seconds) when submerged in different concentrations of Tricaine Methanesulfonate, or MS-222 for short. It is a fine white powder that easily dissolves in water. The salamanders were placed in a container with the solution and were completely submerged. The temperature of the water-anesthetic solution (MS-222 was the anesthetic) was measured in degrees Celsius. The covariates considered were snout vent length (sl) measured in millimeters, total length (tl) measured in millimeters, mass measured in grams, ph of the solution, the temperature. The study was motivated because Bill needed to insert electronic tracking devices into the salamanders so that they could be easily tracked. However, he could find no guidelines about the concentration required for anesthetization for salamanders. The objective was to develop a model to predict the time required for anesthetization in terms of the concentration, the size of the salamander as measured by the mass. We will ignore the ph and temperature.

Model Building Considerations It seems appropriate to exclude temperature and ph from the analysis because these were strongly correlated with the concentration. Further it seemed sensible to use mass rather than either snout length (sl) or total length (tl) in the analysis. (Because of their high correlation only one of these measurements would be included and mass is by far the more reliable and intuitively appealing measurement.) After considering the scatterplots of time to anesthetization (anes [measured in minutes]) against concentration (conc [mg/L]), it seemed that the analysis should be based on log transformations of both anesthetization time, concentration and mass. Thus, the complete model to be contemplated is

)ln()ln()ln( 210 massconcanes

Analysis: Perform an analysis that will lead to a regression model for predicting the average time to anesthetization in terms of concentration, mass and ph. Suggestions for the analysis step of the Salamander Data (approximate Steps): 1 Read data into R. 2 Analyse the marginal distributions of original/untransformed data and comment on these. (e.g.

stem-and-leaf, summary, Q-Qplots, histograms, etc.) 3 Obtain scatterplot and correlation matrices of original data and comment on these. 4 Transform data to logs: log(anes), log(conc) and log(mass). 5 Check that the transformed data are approximately normal. (At this stage we are not so interested

in the means and sd’s but in the shape of their distributions – are they approximately normal?.) 6 Repeat step 3 for the transformed data as a precursor to the model fitting. 7 Fit and assess the contemplated models:

Model 1 )ln()ln()ln( 210 massconcanes

8 provide the ANOVA and summary table, perform the checks that the assumptions on the error terms are reasonable for that model and perform the usual diagnostic plots looking for outliers in the Y-space, the X-space, the DFFITS, Cooks Distance, and DFBETAS.

9 Interpretations 10 Criticisms and recommendations of experiment, data and the model

http://senr.osu.edu/our-people/william-peterman

http://petermanresearch.weebly.com/dr-bill-peterman.html

2

Picture of salamander species used in the anesthetization study:

Pictures by Bill Peterman:

Practical 4: Birthweight data

The data JRHbirthwt.csv for this exercise comes from a recent research

project at the John Radcliff Hospital. The 17 variables are:

Age

PAPPA

hCG

NT

trisomy

Parity (categorical)

BMI

Smoking (categorical)

Ethnicity (categorical)

Conception

Gestation in weeks (won’t use this)

Gestation in days

Delivery (categorical)

Centile

Birthwt

PET2 (categorical)

G3M (categorical)

The objective is to develop a model to predict birthweight from the other

variables.

I plotted all the marginal distributions and re-coded to eliminate sparse

categories, and then converted the categorical variables to factors:

Ethnicity.new <- 1*(Ethnicity==1)+2*(Ethnicity==2)+1*(Ethnicity>2)

Ethnicity.new <- as.factor( Ethnicity.new)

PET.new <-1*(PET2==1)+2*(PET2>1)

PET.new <- as.factor(PET.new)

G3M.new <- 1*(G3M==1)+2*(G3M>1)

G3M.new <- as.factor(G3M.new)

Smoking <- as.factor(Smoking)

Parity.new <- 0*(Parity==0)+1*(Parity==1)+1*(Parity==2)

Parity.new <- as.factor(Parity.new)

Conception.new <- 1*(Conception==1)+2*(Conception>1)

Conception.new <- as.factor(Conception.new)

Perform a stepwise regression on these variables.

Fit the best resulting model

Obtain the summary and anova

Select your model

Assess assumptions and influential observations

What are your conclusions?

Documents

Statistics: Multiple Regression in R … · This course introduces the concept of regression using Sir Francis Galton’s Parent- Child height data and then extends the concept to