Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
February 2016
Statistics:
Multiple Regression in R
February 2016
How to Use This Course Book
This course book accompanies the face-to-face session taught at IT Services. It contains a copy of the slideshow and the worksheets.
Software Used
We might use Excel to capture your data, but no other software is required. Since this is a Concepts course, we will concentrate on exploring ideas and underlying concepts that researchers will find helpful in undertaking data collection and interpretation.
Revision Information
Version Date Author Changes made
1.0 February 2016 John Fresen Course book version 1
February 2016
Copyright
The copyright of this document lies with Oxford University IT Services.
February 2016
Contents
1 Introduction ............................................................................. 1
1.1. What You Should Already Know ......................................................... 1
1.2. What You Will Learn ........................................................................... 1
2 Your Resources for These Exercises ........................................ 2
2.1. Help and Support Resources ............................................................. 2
3 What Next? .............................................................................. 3
3.1. Statistics Courses ............................................................................... 3
3.2. IT Services Help Centre ..................................................................... 3
Statistics: Concepts TRMSZ
1 IT Services
1 Introduction Welcome to the course Mul t iple in R .
This course introduces the concept of regression using Sir Francis Galton’s Parent- Child height data and then extends the concept to multiple regression using real examples. The course has an applied focus and makes minimum used of mathematics. No derivations of formulae are presented.
1.1. What You Should Already Know
We assume that you are familiar with entering and editing text, rearranging and formatting text - drag and drop, copy and paste, printing and previewing, and managing files and folders.
The computer network in IT Services may differ slightly from that which you are used to in your College or Department; if you are confused by the differences, ask for help from the teacher.
1.2. What You Will Learn
In this course we will cover the following topics:
What is regression
Simple linear regression
Influential observations
Multiple regression
Model selection
Post selection inference
Cross valadation
From problem – to data – to conclusions
Where to get help….
Topics covered in related Statistics courses, should you be interested, are given in Section 3.1.
Statistics: Concepts TRMSZ
2 IT Services
2 Your Resources for These Exercises The exercises in this handbook will introduce you to some of the tasks you will need to carry out when working with WebLearn. Some sample files and documents are provided for you; if you are on a course held at IT Services, they will be on your network drive H:\ (Find it under My Computer).
During a taught course at IT Services, there may not be time to complete all the exercises. You will need to be selective, and choose your own priorities among the
variety of activities offered here. However, those exercises marked with a star *
should not be skipped.
Please complete the remaining exercises later in your own time, or book for a Computer8 session at IT Services for classroom assistance (See section 8.2).
2.1. Help and Support Resources
You can find support information for the exercises on this course and your future use of WebLearn, as follows:
WebLearn Guidance https://weblearn.ox.ac.uk/info (This should be your first port of call)
If at any time you are not clear about any aspect of this course, please make sure you ask John for help. If you are away from the class, you can get help and advice by emailing the central address [email protected].
The website for this course including reading material and other material can be found at https://weblearn.ox.ac.uk/x/Mvkigl
You are welcome to contact John about statistical issues and questions at [email protected]
Statistics: Concepts TRMSZ
3 IT Services
3 What Next?
3.1. Statistics Courses
Now that you have a grasp of some basic concepts in Statistics, you may want to develop your skills further. IT Services offers further Statistics courses and details are available at http://courses.it.ox.ac.uk.
In particular, you might like to attend the course
Stat ist ics : In troduct ion : this is a four-session module which covers the basics of statistics and aims to provide a platform for learning more advanced tools and techniques.
Courses on particular discipline areas or data analysis packages include:
Stat ist ics : Designing c l inical research and biostat ist ics
SPSS: An introduct ion
SPSS: An introduct ion to using syntax
STATA: An introduct ion to data access and management
STATA: Data manipulat ion and analysis
STATA: Stat ist ical , survey and graphical analyses
3.2. IT Services Help Centre
The IT Services Help Centre at 13 Banbury Road is open by appointment during working hours, and on a drop-in basis from 6:00 pm to 8:30 pm, Monday to Friday.
The Help Centre is also a good place to get advice about any aspect of using computer software or hardware. You can contact the Help Centre on (2)73200 or by email on [email protected]
Your safety is important
Where is the fire exit?
Beware of hazards:
Tripping over bags and coats
Please report any equipment faults to us
Let us know if you have any other concerns
2
Your comfort is important
The toilets are along the corridor outside the lecture rooms
The rest area is where you registered;it has vending machines and a water cooler
The seats at the computers are adjustable
You can adjust the monitors for height, tilt and brightness
Session 1The concept of regression from Galton
Thanks to:
Dave Baker, IT ServicesJill Fresen, IT ServicesJim Hanley, McGill UniversityIan Sinclair, REES Group Oxford
4
Sir Francis Galton(16 February 1822 17 January 1911)
http://en.wikipedia.org/wiki/Francis_Galton
Sir Francis Galton was an incredible polymath
Cousin of Charles Darwin.
General: Genetics What do we inherit form our ancestors?
Particular: Do tall parents have tall children and short parents, short children?
i.e. Does the height of children depend on the height of parents?
Data: Famous 1885 study: 205 sets of parents 928 offspring
mph = average height of parents; ch = child height
Galton Peas Experiment: Selected 700 pea pods of selected sizes
average diam of parent peas ; average diam of child peas
Francis Galton: Do tall parents have tall children, short parents short children?Does height of child depend on height of parents?
62 64 66 68 70 72 74
Midparent
Frequency scatterplot of Galton Data
1
2
4
1
2
2
1
1
11
4
4
1
5
5
2
1
9
5
7
11
11
7
7
5
2
1
3
3
5
2
17
17
14
13
4
3
5
14
15
36
38
28
38
19
11
4
1
7
11
16
25
31
34
48
21
18
4
3
1
16
4
17
27
20
33
25
20
11
45
1
1
1
1
3
12
18
14
7
4
33
1
3
4
3
5
10
4
9
22
1
2
1
2
7
24
1
3
57
32
59
48
117
138
120
167
99
64
41
1714
Galton data:boxplots of conditional distributions of child-ht conditional on parent-ht
histograms of the marginal distributions of child-ht and parent-ht
64 66 68 70 72
Plot of child-parent data
parent-ht
64 66 68 70 72
sunflower plot of data
parent-ht
64 66 68 70 72
Plot of datachild-ht jittered
parent-ht
64 66.5 69.5 72.5
Plot of distributionschild-ht given parent-ht
parent-ht
Regression is a plot/trace of the means of the conditional distributions
62 64 66 68 70 72 74
Midparent
trace of actual meansregression of Child on Midparent
62 64 66 68 70 72 74
Midparent
trace of linear regression meansassumes means lie on a straight line
62 64 66 68 70 72 74
Midparent
superimposing actualand linear regressions
The trace of actual means has no assumptions in it but end distributions have a lot ofsampling variation because of the small number of observations in those distributions
Linear regression stabilises that
62 64 66 68 70 72 74
Midparent
Linear regression model
62 64 66 68 70 72 74
Midparent
Linear regression model fitted to data
Linear regression model assumes:1. Conditional distributions are normal2. Conditional means lie on a straight line3. Conditional distributions all have same spread
In words: the distribution of child height, conditional on a given midparent height,is normal, with means lying on the straight line, and constant spread
In mathematics:
This model can be extended in many ways
The Linear Model : can be extended in manyways. Here are three - there are more:
1. Model the mean by a more general function such as a polynomial ortrigonometric function or Fourier series or radial basis functions or some non-parametric function
2. Model the variance as a function :
3. Generalize from the Normal distribution to the Exponential Family thatincludes: normal, exponential, gamma, chi-squared, beta, Dirichlet, Bernoulli,categorical, Poisson, Wishart, Inverse Wishart and many others.
But in all cases we are modelling the mean and other parameters of conditionaldistributions. These are called Generalized Linear Models.
In R : fits a linear modelfits a generalized linear modelfits a generalized additive model
Does the average diam of child peas depend on average diam of parent peas?What are the sketches telling us?Would a linear regression model be suitable?
An important point about this example is that in regression, it is theslope of the regression line that is important, not the intercept.
Go to Lecture 2 A: Example 1 UWC Analysis
Detecting Influential Observationsusing R
Influential observations may suggestyour model is incorrectdata point has been miss-recorded
17
Detecting Influential Observations in R
Case1: Outlying in y-space, not x-space.Use studentized residuals to detect these.
Case 2: Outlying in x-space, not y-space.High leverage point.
Use hatvalues to detect this.
Case 3: Outlying in both x-space and y-space.High leverage point.
Studentized residualsHatvaluesDFFITS
DFBETAS
18
Outliers in Y- space:
To detect outliers in the Y-space we compute thestudentized residuals, defined as:
This is just the standardizing transformation on theresiduals.
Thus, we expect 99% of studentized residuals to liebetween -3 and 3.
Values outside of this range suggest possible outliers.
In R these are plotted on a graph using the statement:
19
Outliers in X-space:
To detect outliers in the X-space we compute thehatvalues, defined as:
hatvalues = diagonal elements of the hat matrix
1diag( ( ) )T TX X X X
The i-th hatvalue measures the distance of case i x-valuesfrom the centroid of the x-values.
In R these are plotted on a graph using the statement:
Some of these will be small, some intermediate, and somelarge. These are assessed in a relative sense there is, in
-
We want to avoid the situation in which one observationdominates the others.
20
DFFITS (means change (DF) in FITted values)
DFFITS measures the effect or influence of removing thei th observation on the predicted value of the i-thobservation, divided by a standardizing quantity.
constant
)(iii
i
YYDFFITS
This is a local measure of influence it is only concernedwith what is happens at case I if case I is removed fromthe data.
In R these are plotted on a graph using the statement:
Some of these will be small, some intermediate, and somelarge. These are assessed in a relative sense there is, in
- point. We want to avoid the situationin which one observation dominates the others.
21
Cooks Distance measures the effect or influence ofremoving the i th observation on all predicted, divided bya standardizing quantity to normalize it.
2( )( )
constant
j j i
i
Y YD
This is a global measure of influence it considers theeffect of removing case I on all predicted values.
In R these are plotted on a graph using the statement:
Some of these will be small, some intermediate, and somelarge. These are assessed in a relative sense there is, in
-
DFBETAS (means change (DF) in the BETAS)
DFBETAS measures the effect or influence of removingthe i th observation on the estimated regressioncoefficients, divided by a standardizing quantity.
( )
( )constant
k k i
k iDFB
In R these are plotted on a graph using the statement:
is a pXn matrix. The entry in row I andcolumn k represents the effect of removing the i-thobservation on the k-th regression coefficient.
Some of these will be small, some intermediate, and somelarge. These are assessed in a relative sense there is, in
-
Go to Lecture 2 B UWC Analysis outliers and influential observations
Model Selection
The Executive Salary Data has the variables:lsalary exper educat bonus numemp assets board age profits internat sales
Question: How do we select which variables to include in a model forestimating the mean log(salary)?
still a vibrant research topic in statistics with many contentious issues
controversy:
Among competing hypotheses, the one with the fewest assumptions should beselected.
Choose the simplest model that gives and adequate description of the data
include
Akaike information criterion (AIC)Bayesian information criterion (BIC)
AIC = -2*log(likelihood(model)) + 2*no predictorsBIC = -2*log(likelihood(model)) + 2*no predictors*log(no obs)
The first term -2*log(likelihood(model)) gets smaller with more predictorsbut gets penalized by the increased no of predictors and the increasedsample size (no obs).
we choose model with the smallest AIC (BIC)
Recent excellent articles are:
Statistical model choice by Gerda Claeskens (2016)Valid post selection inference 2013 Berk Brown Buja Zhang and Zhao Annals(2015)
Problem: If we have k predictors, we have 2^k possible models(without considering interactions and transformations such aslog, etc)
For the Exec Salary Data we have 10 predictors so there are 2^10 = 1024 models toconsider without interactions or transformations.
We consider the Stepwise Selection:
Forward selection, which involves starting with no variables in the model, testing theaddition of each variable using a chosen model comparison criterion, adding thevariable (if any) that improves the model the most, and repeating this process untilnone improves the model.
Backward elimination, which involves starting with all candidate variables, testing thedeletion of each variable using a chosen model comparison criterion, deleting thevariable (if any) that improves the model the most by being deleted, and repeating thisprocess until no further improvement is possible.
Bidirectional elimination, a combination of the above, testing at each step for variablesto be included or excluded.
Inference after Model Selection
The following articles provide a great discussion post-selection inferenceand make suggestions for how to proceed:
Ernst Wit, Edwin van den Heuvel and Jan-Willem RomeijnStatistica
Neerlandica, doi:10.1111/j.1467-9574.2012.00530.x
Richard Berk, Lawrence Brown, Andreas Buja, Kai Zhang, and Linda Zhao(2013) Valid post-selection inference. Ann. Statist. Volume 41, Number 2,802-837
Two major problems that arise in the process of model selection are:
First, the distributions of the estimated regression parameters are no longer valid.(This means that the tests and confidence intervals normally calculated are nolonger valid.)
Second, we should see how well the model works on unseen data.This might conceptually be achieved by splitting the data into two subsets,
and a
This is called cross-validation.
Cross-validation is important in guarding against testing hypotheses suggestedby the data (called "Type III errors")
Leave-p-out cross-validation
Leave-p-out cross-validation (LpO CV) involves using p observations asthe validation set and the remaining observations as the training set. Thisis repeated on all ways to cut the original sample on a validation set of pobservations and a training set. Very expensive computing wise
Leave-one-out cross-validation
Leave-one -out cross-validation (LOOCV) is a particular case of leave-p-out cross-validation with p = 1.
LOOCV doesn't have the computational problem of general LpO cross-validation.
k-fold cross-validation
In k-fold cross-validation, the original sample is randomly partitioned into k equalsized subsamples.
Of the k subsamples, a single subsample is retained as the validation data for testing
The cross-validation process is then repeated k times (the folds), with each of the ksubsamples used exactly once as the validation data. The k results from the foldscan then be averaged (or otherwise combined) to produce a single estimation
2-fold cross-validation = special case
Go to lecture 5 Example 4 Executive salary data
Lecture 3: Example 2: Multiple Regression Executive Salary Data – Four variables
Problem and Data This data set was collected by a large Human Resources Placement company specializing in the placement of Chief Executive Officers (CEO's). The objective of the research is to develop a model that will predict the average salary of CEO's based on a number of factors or variables associated with the executive and the company. Because of its skewed nature salary was converted to log-salary and recorded as lsalary to bring it closer to normality. The purpose of the model is to help guide executives as to the distribution of salaries (mean and standard deviation) earned by executives with similar qualifications and backgrounds. The data is given in execsaldata.xlxs, execsaldata.txt, execsaldata.csv. For this exercise we will only consider the following predictors: experience (exper); number of employees (numemp) ; number of years of formal education (educat) ; company assets in millions of dollars (assets)
Model: 𝑙𝑠𝑎𝑙𝑎𝑟𝑦 = 𝛽0 + 𝛽1𝑒𝑥𝑝𝑒𝑟 + 𝛽2𝑒𝑑𝑢𝑐𝑎𝑡 + 𝛽3𝑛𝑢𝑚𝑒𝑚𝑝 + 𝛽4𝑎𝑠𝑠𝑒𝑡𝑠 + 𝑛𝑜𝑖𝑠𝑒 where the noise follows a normal distribution with a mean of zero and constant variance The data were obtained by taking a random sample of 100 companies listed on the New York Stock Exchange during 2008 and is given in a tab delimited file execsaldata.csv. But an excel version is also given.
Assignment: Perform an analysis that will lead to a regression model for predicting the average salary of executives based on their experience and background in terms of the factors considered above. You might consider, inter-alia, the following points in your analysis but do not be restricted to this list. 1. Read the data into R and attach it. 2. Summarize the univariate marginal distributions of experience, number of employees, number of years
of formal education and the company assets. 3. Provide the matrix of these variables, using the pairs statement, compute the correlation
coefficients and comment on the plots and correlation matrix. 4. Fit the regression model to the data. Construct and interpret the SUMMARY and ANOVA
tables. 5. How accurate is the prediction equation? Assess graphically the accuracy of the model by
plotting the observed salary against the fitted values, say salhat. 6. Perform a graphical analysis of the residuals to assess if the assumptions on the error terms
are reasonable. 7. Assess the possibility of outliers and influential observations. 8. Can you make recommendations about estimating the average salary of executives from the
predictor variables considered here? 9. Can you criticize the data or the model?
Recommended Steps – You may copy and paste into R
lsalary exper educat bonus numemp assets board age profits internat sales
Step 1 Read and attach data.
My data is stored in the directory Data of my memory stick.
One can specify any directory.
esd = read.csv("E:/Data/execsaldata.csv",header=T,sep=",")
attach(esd)
head(esd)
names(esd)
Step 2 Plot marginal distributions of the response variable and the four
predictor variables on a 2X3 scatterplot matrix
par(mfrow=c(2,3))
hist(lsalary,prob=T,col="gray");lines(density(lsalary));rug(lsalary)
hist(exper,prob=T,col="gray");lines(density(exper));rug(exper)
hist(educat,prob=T,col="gray");lines(density(educat));rug(educat)
hist(numemp,prob=T,col="gray");lines(density(numemp));rug(numemp)
hist(assets,prob=T,col="gray");lines(density(assets));rug(assets)
# notice that these are on vastly different scales
Step 3 Pairs plot (matrix of scatterplots)
newdata <- cbind(lsalary,exper,educat,numemp,assets)
pairs(newdata)
Step 4 Can combine steps 2 and 3 (See help for pairs)
panel.hist <- function(x, ...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(usr[1:2], 0, 1.5) )
h <- hist(x, plot = FALSE)
breaks <- h$breaks; nB <- length(breaks)
y <- h$counts; y <- y/max(y)
rect(breaks[-nB], 0, breaks[-1], y, col = "cyan", ...)
}
pairs(newdata, panel = panel.smooth,
diag.panel = panel.hist, cex.labels = 1.0, font.labels = 2)
Step 5 Compute the correlation coefficients:
cor(newdata)
# or, to have fewer decimal places
round(cor(newdata),3)
Step 6 Fit the linear model to the data and get summary and anova:
fit1 = lm(lsalary~exper+educat+numemp+assets)
summary(fit1)
anova(fit1)
Step 7 Compute predicted values and residuals:
salhat = fitted.values(fit1)
res = residuals(fit1)
Step 8 # check the fit and assumptions on noise terms i.e.
# are they normal and independent of predictor and model?
par(mfrow=c(2,4))
plot(salhat,lsalary,xlim=c(10.6,12.1),ylim=c(10.6,12.1),
main="Observed vs fitted values")
abline(0,1) #superimposes an “ideal” straight line
plot(res) # sequential plot of residuals
qqnorm(res);qqline(res) #normal probability plot of residuals
plot(exper,res) #residuals vs exper
plot(educat,res) #residuals vs eduac
plot(numemp,res) #residuals vs numemp
plot(assets,res) #residuals vs assets
plot(salhat,res) #residuals vs fitted values
Step 9 # Detecting possible outlying and influential observations
# Are there outliers in the y-space or x-space?
# Are there influential observations as measured by
# DFFITS,Cook’s Distance or DFBETAS?
par(mfrow=c(3,3))
plot(rstudent(fit1),type="h")
plot(hatvalues(fit1),type="h")
plot(dffits(fit1),type="h")
plot(cooks.distance(fit1),type="h")
plot(dfbetas(fit1)[,1],type="h")
plot(dfbetas(fit1)[,2],type="h")
plot(dfbetas(fit1)[,3],type="h")
plot(dfbetas(fit1)[,4],type="h")
plot(dfbetas(fit1)[,5],type="h")
# We might wish to plot hatvalues against cooks.distance and even
# consider other combinations of plots.
#---------------
# An interesting 2X2 matrix of plots is provided by plot(fit1)
par(mfrow=c(2,2))
plot(fit1)
Step 10 What are our conclusions?
First interpret the diagnostic plots for assumptions on residuals
and then consider the possibility of influential observations and
outliers. Do the assumptions on the noise terms seem reasonable?
If the diagnostics are satisfactory we then look at the summary and
ANOVA to assess and interpret the model: What is the mean function
and what is the spread about the mean function? Is the model a
reasonable approximation to the data?
Can we criticize that data or the model and make suggestions for
further analysis or research?
1
Example 1: UWC Analysis
Mathematical model: 𝑟𝑒𝑠𝑢𝑙𝑡 = 𝛽0 + 𝛽1𝑟𝑎𝑡𝑖𝑛𝑔 + 𝑛𝑜𝑖𝑠𝑒
assume 𝑛𝑜𝑖𝑠𝑒 follows a normal distribution with a mean of zero and variance 𝜎2
We could write this as 𝑟𝑒𝑠𝑢𝑙𝑡|𝑟𝑎𝑡𝑖𝑛𝑔~𝑁𝑜𝑟𝑚𝑎𝑙(𝛽0 + 𝛽1𝑟𝑎𝑡𝑖𝑛𝑔, 𝜎2)
The conditional distribution of result given rating is normal with a mean of
𝛽0 + 𝛽1𝑟𝑎𝑡𝑖𝑛𝑔 and variance of 𝜎2
Suggested steps:
Step 1 # Read data into R, attach data, print first 6 lines
uwc = read.table("E:/Data/uwcdata.csv",header = T,sep=",")
attach(uwc)
head(uwc)
Step 2 # Plot marginal distributions
par(mfrow=c(1,2)) #1 row by 2 cols graphics window
hist(rating,prob=T,col="gray");density(rating);rug(rating)
hist(result,prob=T,col="gray");density(result);rug(result)
Step 3 # fit the linear model (linear regression model)
fit1 <- lm(result ~ rating)
# fit1 is an object generated by the routine “lm” containing a lot
# of information about the fitted model.
# The rest of the steps are simply accessing information in fit1
Step 4 # obtain summary and anova of fit
# compute fitted values and residuals
summary(fit1)
anova(fit1)
yhat <- fitted.values(fit1) # fitted values
res <- residuals(fit1) # residuals
Step 5 # various common plots put into a 2X3 matrix of scatter-plots to
# check the fit and assumptions on noise terms i.e.
# are they normal and independent of predictor and model?
#
# plot 1: scatterplot of data and superimposed fitted model -
# only do this plot when there is only a single predictor
# plot 2: scatterplot of observed values vs fitted values to see
# how close the fitted values are to the observed data –
# do this plot no matter how many predictors
# plot 3: plot of residuals (random or pattern?)
# plot 4: Q-Q plot of residuals (are they approx normal?)
# plot 5: residuals vs predictor (random or pattern?)
# if there are many predictors we plot residuals against
# each predictor in turn
# plot 6: residuals vs fitted (random or pattern?)
2
par(mfrow=c(2,3))
plot(rating,result,ylim=c(0,100), pch=19,cex=1.5,
main="Result vs Rating UWC data/n showing pass mark and fitted
model")
abline(fit1) #superimposes fitted straight line
abline(h=48,lty=2) # superimposes dashed horizontal line at 48
plot(yhat,result,xlim=c(0,100),ylim=c(0,100),
main="Observed vs fitted values")
abline(0,1) #superimposes an “ideal” straight line
plot(res)
qqnorm(res);qqline(res) #normal probability plot of residuals
plot(rating,res) #residuals vs predictor
plot(yhat,res) #residuals vs fitted values
Step 6 # Detecting possible outlying and influential observations
# Are there outliers in the y-space or x-space?
# Are there influential observations as measured by
# DFFITS,Cook’s Distance or DFBETAS?
par(mfrow=c(2,3))
plot(rstudent(fit1),type="h")
plot(hatvalues(fit1),type="h")
plot(dffits(fit1),type="h")
plot(cooks.distance(fit1),type="h")
plot(dfbetas(fit1)[,1],type="h")
plot(dfbetas(fit1)[,2],type="h")
Step 7 What are our conclusions?
First interpret the diagnostic plots for assumptions on
residuals and then consider the possibility of
influential observations and outliers.
If the diagnostics are satisfactory we then look at the
summary and ANOVA to assess and interpret the model:
What is the mean function and what is the spread about
the mean function?
Can we criticize that data or the model and make
suggestions for further analysis or research?
Lecture 3: Example 2: Multiple Regression Executive Salary Data – Four variables
Problem and Data This data set was collected by a large Human Resources Placement company specializing in the placement of Chief Executive Officers (CEO's). The objective of the research is to develop a model that will predict the average salary of CEO's based on a number of factors or variables associated with the executive and the company. Because of its skewed nature salary was converted to log-salary and recorded as lsalary to bring it closer to normality. The purpose of the model is to help guide executives as to the distribution of salaries (mean and standard deviation) earned by executives with similar qualifications and backgrounds. The data is given in execsaldata.xlxs, execsaldata.txt, execsaldata.csv. For this exercise we will only consider the following predictors: experience (exper); number of employees (numemp) ; number of years of formal education (educat) ; company assets in millions of dollars (assets)
Model: 𝑙𝑠𝑎𝑙𝑎𝑟𝑦 = 𝛽0 + 𝛽1𝑒𝑥𝑝𝑒𝑟 + 𝛽2𝑒𝑑𝑢𝑐𝑎𝑡 + 𝛽3𝑛𝑢𝑚𝑒𝑚𝑝 + 𝛽4𝑎𝑠𝑠𝑒𝑡𝑠 + 𝑛𝑜𝑖𝑠𝑒 where the noise follows a normal distribution with a mean of zero and constant variance The data were obtained by taking a random sample of 100 companies listed on the New York Stock Exchange during 2008 and is given in a tab delimited file execsaldata.csv. But an excel version is also given.
Assignment: Perform an analysis that will lead to a regression model for predicting the average salary of executives based on their experience and background in terms of the factors considered above. You might consider, inter-alia, the following points in your analysis but do not be restricted to this list. 1. Read the data into R and attach it. 2. Summarize the univariate marginal distributions of experience, number of employees, number of years
of formal education and the company assets. 3. Provide the matrix of these variables, using the pairs statement, compute the correlation
coefficients and comment on the plots and correlation matrix. 4. Fit the regression model to the data. Construct and interpret the SUMMARY and ANOVA
tables. 5. How accurate is the prediction equation? Assess graphically the accuracy of the model by
plotting the observed salary against the fitted values, say salhat. 6. Perform a graphical analysis of the residuals to assess if the assumptions on the error terms
are reasonable. 7. Assess the possibility of outliers and influential observations. 8. Can you make recommendations about estimating the average salary of executives from the
predictor variables considered here? 9. Can you criticize the data or the model?
Recommended Steps – You may copy and paste into R
lsalary exper educat bonus numemp assets board age profits internat sales
Step 1 Read and attach data.
My data is stored in the directory Data of my memory stick.
One can specify any directory.
esd = read.csv("E:/Data/execsaldata.csv",header=T,sep=",")
attach(esd)
head(esd)
names(esd)
Step 2 Plot marginal distributions of the response variable and the four
predictor variables on a 2X3 scatterplot matrix
par(mfrow=c(2,3))
hist(lsalary,prob=T,col="gray");lines(density(lsalary));rug(lsalary)
hist(exper,prob=T,col="gray");lines(density(exper));rug(exper)
hist(educat,prob=T,col="gray");lines(density(educat));rug(educat)
hist(numemp,prob=T,col="gray");lines(density(numemp));rug(numemp)
hist(assets,prob=T,col="gray");lines(density(assets));rug(assets)
# notice that these are on vastly different scales
Step 3 Pairs plot (matrix of scatterplots)
newdata <- cbind(lsalary,exper,educat,numemp,assets)
pairs(newdata)
Step 4 Can combine steps 2 and 3 (See help for pairs)
panel.hist <- function(x, ...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(usr[1:2], 0, 1.5) )
h <- hist(x, plot = FALSE)
breaks <- h$breaks; nB <- length(breaks)
y <- h$counts; y <- y/max(y)
rect(breaks[-nB], 0, breaks[-1], y, col = "cyan", ...)
}
pairs(newdata, panel = panel.smooth,
diag.panel = panel.hist, cex.labels = 1.0, font.labels = 2)
Step 5 Compute the correlation coefficients:
cor(newdata)
# or, to have fewer decimal places
round(cor(newdata),3)
Step 6 Fit the linear model to the data and get summary and anova:
fit1 = lm(lsalary~exper+educat+numemp+assets)
summary(fit1)
anova(fit1)
Step 7 Compute predicted values and residuals:
salhat = fitted.values(fit1)
res = residuals(fit1)
Step 8 # check the fit and assumptions on noise terms i.e.
# are they normal and independent of predictor and model?
par(mfrow=c(2,4))
plot(salhat,lsalary,xlim=c(10.6,12.1),ylim=c(10.6,12.1),
main="Observed vs fitted values")
abline(0,1) #superimposes an “ideal” straight line
plot(res) # sequential plot of residuals
qqnorm(res);qqline(res) #normal probability plot of residuals
plot(exper,res) #residuals vs exper
plot(educat,res) #residuals vs eduac
plot(numemp,res) #residuals vs numemp
plot(assets,res) #residuals vs assets
plot(salhat,res) #residuals vs fitted values
Step 9 # Detecting possible outlying and influential observations
# Are there outliers in the y-space or x-space?
# Are there influential observations as measured by
# DFFITS,Cook’s Distance or DFBETAS?
par(mfrow=c(3,3))
plot(rstudent(fit1),type="h")
plot(hatvalues(fit1),type="h")
plot(dffits(fit1),type="h")
plot(cooks.distance(fit1),type="h")
plot(dfbetas(fit1)[,1],type="h")
plot(dfbetas(fit1)[,2],type="h")
plot(dfbetas(fit1)[,3],type="h")
plot(dfbetas(fit1)[,4],type="h")
plot(dfbetas(fit1)[,5],type="h")
# We might wish to plot hatvalues against cooks.distance and even
# consider other combinations of plots.
#---------------
# An interesting 2X2 matrix of plots is provided by plot(fit1)
par(mfrow=c(2,2))
plot(fit1)
Step 10 What are our conclusions?
First interpret the diagnostic plots for assumptions on residuals
and then consider the possibility of influential observations and
outliers. Do the assumptions on the noise terms seem reasonable?
If the diagnostics are satisfactory we then look at the summary and
ANOVA to assess and interpret the model: What is the mean function
and what is the spread about the mean function? Is the model a
reasonable approximation to the data?
Can we criticize that data or the model and make suggestions for
further analysis or research?
Lecture 4: Example 5: Model Selection Executive Salary Data – All variables
Stepwise Regression in R
Recommended Steps – You may copy and paste into R
lsalary exper educat bonus numemp assets board age profits internat sales
Step 1 Read and attach data.
My data is stored in the directory Data of my memory stick.
One can specify any directory.
esd = read.csv("E:/Data/execsaldata.csv",header=T,sep=",")
attach(esd)
head(esd)
names(esd)
Step 2 # Invoke the MASS library that contains the stepAIC function
library(MASS)
Step 3 Pairs plot (matrix of scatterplots)
newdata <- c(exper,educat,numemp,assets,age,profits,sales)
Step 4 Can combine steps 2 and 3 (See help for pairs)
panel.hist <- function(x, ...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(usr[1:2], 0, 1.5) )
h <- hist(x, plot = FALSE)
breaks <- h$breaks; nB <- length(breaks)
y <- h$counts; y <- y/max(y)
rect(breaks[-nB], 0, breaks[-1], y, col = "cyan", ...)
}
pairs(newdata, panel = panel.smooth,
diag.panel = panel.hist, cex.labels = 1.0, font.labels = 2)
Step 5 Compute the correlation coefficients:
round(cor(newdata),3)
Step 6 Perform stepwise regression
fit1 <- lm(low ~ .,data=esd)
esd.step <- stepAIC(fit1, direction = "backward" )
Step 6 Fit the linear model to the data and get summary and anova:
fit2 <- lm(lsalary ~ exper+educat+bonus+numemp+assets+board+age+
profits+internat+sales)
summary(fit2)
anova(fit2)
Step 7 Fit reduced model – take out non-significant terms
fit3 <- lm(lsalary ~ exper+educat+bonus+numemp+assets)
summary(fit3)
anova(fit3)
Step 7 Compute predicted values and residuals:
salhat = fitted.values(fit3)
res = residuals(fit3)
Step 8 # check the fit and assumptions on noise terms i.e.
# are they normal and independent of predictor and model?
par(mfrow=c(3,3))
plot(salhat,lsalary,xlim=c(10.6,12.1),ylim=c(10.6,12.1),
main="Observed vs fitted values")
abline(0,1) #superimposes an “ideal” straight line
plot(res) # sequential plot of residuals
qqnorm(res);qqline(res) #normal probability plot of residuals
plot(exper,res) #residuals vs exper
plot(educat,res) #residuals vs educat
plot(bonus,res) #residuals vs educat
plot(numemp,res) #residuals vs numemp
plot(assets,res) #residuals vs assets
plot(salhat,res) #residuals vs fitted values
Step 9 # Detecting possible outlying and influential observations
# Are there outliers in the y-space or x-space?
# Are there influential observations as measured by
# DFFITS,Cook’s Distance or DFBETAS?
par(mfrow=c(3,3))
plot(rstudent(fit3),type="h")
plot(hatvalues(fit3),type="h")
plot(dffits(fit3),type="h")
plot(cooks.distance(fit3),type="h")
plot(dfbetas(fit3)[,1],type="h")
plot(dfbetas(fit3)[,2],type="h")
plot(dfbetas(fit3)[,3],type="h")
plot(dfbetas(fit3)[,4],type="h")
plot(dfbetas(fit3)[,5],type="h")
plot(dfbetas(fit3)[,6],type="h")
Step 10 What are our conclusions?
First interpret the diagnostic plots for assumptions on residuals
and then consider the possibility of influential observations and
outliers. Do the assumptions on the noise terms seem reasonable?
If the diagnostics are satisfactory we then look at the summary and
ANOVA to assess and interpret the model: What is the mean function
and what is the spread about the mean function? Is the model a
reasonable approximation to the data?
Can we criticize that data or the model and make suggestions for
further analysis or research?
Lecture 5: Inference after model selection
The following articles provide a great discussion post-selection inference
and make suggestions for how to proceed:
Ernst Wit, Edwin van den Heuvel and Jan-Willem Romeijn (2012) ‘All models are wrong...’: an introduction to model uncertainty. Statistica Neerlandica, doi:10.1111/j.1467-9574.2012.00530.x Richard Berk, Lawrence Brown, Andreas Buja, Kai Zhang, and Linda Zhao (2013) Valid post-selection inference. Ann. Statist. Volume 41, Number 2, 802-837. Two major problems that arise in the process of model selection are: First, the distributions of the estimated regression parameters are no longer valid. (This means that the tests and confidence intervals normally calculated are no longer valid.) Second, we should see how well the model works on unseen data. This might conceptually be achieved by splitting the data into two subsets, a “training” set a “validation” set This is called cross-validation. Cross-validation is important in guarding against testing hypotheses suggested by the data (called "Type III errors") Exhaustive cross-validation Exhaustive cross-validation methods are cross-validation methods which learn and test on all possible ways to divide the original sample into a training and a validation set. Leave-p-out cross-validation Leave-p-out cross-validation (LpO CV) involves using p observations as the validation set and the remaining observations as the training set. This is repeated on all ways to cut the original sample on a validation set of p observations and a training set. LpO cross-validation requires to learn and validate C_p^n times (where n is the number of observations in the original sample). So as soon as n is quite big it becomes impossible to calculate. Leave-one-out cross-validation Leave-one-out cross-validation (LOOCV) is a particular case of leave-p-out cross-validation with p = 1. LOOCV doesn't have the calculation problem of general LpO cross-validation because C_1^n=n. k-fold cross-validation In k-fold cross-validation, the original sample is randomly partitioned into k equal sized subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data. The cross-validation process is then repeated k times (the folds), with each of the k subsamples used exactly once as the validation data. The k results from the folds can then be averaged (or otherwise combined) to produce a single estimation
2-fold cross-validation This is the simplest variation of k-fold cross-validation. Also called holdout method.[8] For each fold, we randomly assign data points to two sets d0 and d1, so that both sets are equal size (this is usually implemented by shuffling the data array and then splitting it in two). We then train on d0 and test on d1, followed by training on d1 and testing on d0. Cross-validation only yields meaningful results if the validation set and training set are drawn from the same population and only if selection biases are controlled. Cross validation in R Example 5: Executive salary data
esd = read.csv("E:/Data/execsaldata.csv",header=T,sep=",")
attach(esd)
head(esd)
fit3 <- lm(lsalary ~ exper+educat+bonus+numemp+assets)
xmatrix <- as.matrix(cbind(exper,educat,bonus,numemp,assets))
library(cvTools)
cvFit(fit3, lsalary ~ exper+educat+bonus+numemp+assets, data=esd,
y=lsalary,x=xmatrix,
cost = rmspe, K = 5, R=1,foldType = "consecutive")
Practical 1: Ht Wt UCLA Data The dataset UCLA ht wt sample.csv contains 250 records of human heights and
weights. These were obtained by taking a random sample of 250 from the
original sample of 25000 children in 1993 by a Growth Survey of children
from birth to 18 years of age recruited from Maternal and Child Health
Centres (MCHC) and schools and were used to develop Hong Kong's current
growth charts for weight, height, weight-for-age, weight-for-height and
body mass index (BMI).
http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_Dinov_020108_HeightsWeights
To reduce the size of the data for this exercise a random sample of 250
rows were generated and stored in UCLA ht wt sample.csv. Large data sets
require a different treatment that we won’t cover here.
The columns contain the new index (new.index), the original index (index),
the ht and wt.
Use the UWC analysis to guide your R code.
Step 1 # Read data into R, attach data, print first 6 lines
d <- read.csv("E:/Data/UCLA ht wt sample.csv",header=T)
Step 2 # Plot marginal distributions of height and weight
# Provide the boxplot of the conditional distributions of weight
# given height
boxplot(wt~ht,range=0,varwidth=T,col= "gray", main= "Boxplots of the distributions\n of weight given height", xlab="height (in)", ylab="weight (lb)")
Step 3 # fit the linear model of wt on ht
Step 4 # obtain summary and anova of fit
# compute fitted values and residuals
Step 5 # various common plots put into a 2X3 matrix of scatter-plots to
# check the fit and assumptions on noise terms i.e.
# are they normal and independent of predictor and model?
#
# plot 1: scatterplot of data and superimposed fitted model -
# only do this plot when there is only a single predictor
# plot 2: scatterplot of observed values vs fitted values to see
# how close the fitted values are to the observed data –
# do this plot no matter how many predictors
# plot 3: plot of residuals (random or pattern?)
# plot 4: Q-Q plot of residuals (are they approx normal?)
# plot 5: residuals vs predictor (random or pattern?)
# if there are many predictors we plot residuals against
# each predictor in turn
# plot 6: residuals vs fitted (random or pattern?)
Step 6 # Detecting possible outlying and influential observations
# Are there outliers in the y-space or x-space?
# Are there influential observations as measured by
# DFFITS,Cook’s Distance or DFBETAS?
Step 7 What are our conclusions?
First interpret the diagnostic plots for assumptions on
residuals and then consider the possibility of influential
observations and outliers.
If the diagnostics are satisfactory we then look at the
summary and ANOVA to assess and interpret the model: What is
the mean function and what is the spread about the mean
function?
Can we criticize that data or the model and make suggestions
for further analysis or research?
1
Practical 2: Salamander Problem and Data The data set for this assignment was obtained from Bill Peterman during the fall of 2005, then a post-graduate student in Ecology and Conservation at University of Missouri. See his webpage at http://senr.osu.edu/our-people/william-peterman http://petermanresearch.weebly.com/dr-bill-peterman.html The data given in the appendix were collected on 45 salamanders to ascertain the time to anesthetization (seconds) when submerged in different concentrations of Tricaine Methanesulfonate, or MS-222 for short. It is a fine white powder that easily dissolves in water. The salamanders were placed in a container with the solution and were completely submerged. The temperature of the water-anesthetic solution (MS-222 was the anesthetic) was measured in degrees Celsius. The covariates considered were snout vent length (sl) measured in millimeters, total length (tl) measured in millimeters, mass measured in grams, ph of the solution, the temperature. The study was motivated because Bill needed to insert electronic tracking devices into the salamanders so that they could be easily tracked. However, he could find no guidelines about the concentration required for anesthetization for salamanders. The objective was to develop a model to predict the time required for anesthetization in terms of the concentration, the size of the salamander as measured by the mass. We will ignore the ph and temperature.
Model Building Considerations It seems appropriate to exclude temperature and ph from the analysis because these were strongly correlated with the concentration. Further it seemed sensible to use mass rather than either snout length (sl) or total length (tl) in the analysis. (Because of their high correlation only one of these measurements would be included and mass is by far the more reliable and intuitively appealing measurement.) After considering the scatterplots of time to anesthetization (anes [measured in minutes]) against concentration (conc [mg/L]), it seemed that the analysis should be based on log transformations of both anesthetization time, concentration and mass. Thus, the complete model to be contemplated is
)ln()ln()ln( 210 massconcanes
Analysis: Perform an analysis that will lead to a regression model for predicting the average time to anesthetization in terms of concentration, mass and ph. Suggestions for the analysis step of the Salamander Data (approximate Steps): 1 Read data into R. 2 Analyse the marginal distributions of original/untransformed data and comment on these. (e.g.
stem-and-leaf, summary, Q-Qplots, histograms, etc.) 3 Obtain scatterplot and correlation matrices of original data and comment on these. 4 Transform data to logs: log(anes), log(conc) and log(mass). 5 Check that the transformed data are approximately normal. (At this stage we are not so interested
in the means and sd’s but in the shape of their distributions – are they approximately normal?.) 6 Repeat step 3 for the transformed data as a precursor to the model fitting. 7 Fit and assess the contemplated models:
Model 1 )ln()ln()ln( 210 massconcanes
8 provide the ANOVA and summary table, perform the checks that the assumptions on the error terms are reasonable for that model and perform the usual diagnostic plots looking for outliers in the Y-space, the X-space, the DFFITS, Cooks Distance, and DFBETAS.
9 Interpretations 10 Criticisms and recommendations of experiment, data and the model
2
Picture of salamander species used in the anesthetization study:
Pictures by Bill Peterman:
Practical 4: Birthweight data
The data JRHbirthwt.csv for this exercise comes from a recent research
project at the John Radcliff Hospital. The 17 variables are:
Age
PAPPA
hCG
NT
trisomy
Parity (categorical)
BMI
Smoking (categorical)
Ethnicity (categorical)
Conception
Gestation in weeks (won’t use this)
Gestation in days
Delivery (categorical)
Centile
Birthwt
PET2 (categorical)
G3M (categorical)
The objective is to develop a model to predict birthweight from the other
variables.
I plotted all the marginal distributions and re-coded to eliminate sparse
categories, and then converted the categorical variables to factors:
Ethnicity.new <- 1*(Ethnicity==1)+2*(Ethnicity==2)+1*(Ethnicity>2)
Ethnicity.new <- as.factor( Ethnicity.new)
PET.new <-1*(PET2==1)+2*(PET2>1)
PET.new <- as.factor(PET.new)
G3M.new <- 1*(G3M==1)+2*(G3M>1)
G3M.new <- as.factor(G3M.new)
Smoking <- as.factor(Smoking)
Parity.new <- 0*(Parity==0)+1*(Parity==1)+1*(Parity==2)
Parity.new <- as.factor(Parity.new)
Conception.new <- 1*(Conception==1)+2*(Conception>1)
Conception.new <- as.factor(Conception.new)
Perform a stepwise regression on these variables.
Fit the best resulting model
Obtain the summary and anova
Select your model
Assess assumptions and influential observations
What are your conclusions?