Stat 6341, Statistical Computing Stat 6341 Syllabusammann/stat6341.pdfStat 6341, Statistical Computing Stat 6341 Syllabus STAT 6341 ... 3. Demonstrate how the widths of con dence intervals

Stat 6341, Statistical Computing

Stat 6341 Syllabus

STAT 6341Numerical Linear Algebra and Statistical Computing

Course Information

Instructor: Dr. Larry P. AmmannOffice hours: Tues. 2:30-3:30pm, others by appt.Email: [email protected]: FO 2.410CPhone: (972) 883-2161Recommended text: Modern Applied Statistics with S, 4th Ed.Authors: W.N. Venables and B.D. RipleyAdditional resources: Matrix ComputationsAuthors: G. Golub and C. van Loan

These notes are copyrighted by their author, Larry P. Ammann, and are intended for theuse of students currently registered for Stat 6341. They may not be copied or used for anyother purpose without permission of the author.

Tentative Schedule

Topics Chapters

Introduction to the S language and statistical programming in R VR 1-3Numerical linear algebra class notesSimulation class notesQR decomposition and least squares regression class notesData explorations VR 4-5Statistical models VR 6-7SVD and multivariate data class notes, VR 11

This course will make use of the statistics programming language R. Pre-compiled binariesfor R are freely available for Windows, MacOS, and Linux athttp://cran.r-project.org

An online introduction to R is located athttp://cran.r-project.org/doc/manuals/R-intro.html

1

Course grade will be based on homework projects using R.

Student Learning ObjectivesUnderstand numerical, statistical, and computational issues associated with major matrixdecompositions, LU, QR, SVD used for statistical models. Become familiar with developingand performing simulation studies. Understand how to express basic mathematical andstatistical problems in a high-level statistical programming language.Note: the complete syllabus is available here:http://www.utdallas.edu/~ammann/stat6341_syllabus.pdf

2

Introduction to R

R is an open-source implementation of the S statistical language developed by Bell Labs.Executables for Windows, Macs, and Linux are freely available athttps://cran.r-project.org/

An extensive collection of articles and books about F in English and other languages can befound on the CRAN site under the Documentation – Contributed link. An extension of Rthat provides a helpful user interface is Rstudiohttps://www.rstudio.com/

Example R scripts

An introduction to some of the features of R.http://www.utdallas.edu/~ammann/stat6341scripts/crabs.r

Function to return variable path with forward stepwise regression:http://www.utdallas.edu/~ammann/stat6341scripts/fstep1.r

Simulation of random variables

There are several uses for simulation:1. Demonstration of statistical or probabilistic effects.2. Understanding the responses of structures during their design to ranges of environmentalor operational conditions so that their designs can be optimized. Examples of this applicationinclude aircraft and ship design, communication networks. This approach allows designers todevelop and test products without building prototypes for testing. It also provides opportu-nities for designers to observe behaviors under extreme conditions that would be impossibleor extremely expensive to create for physical prototypes.3. Simulation of theories for the behavior of physical systems. Simulation results can becompared to observations for validation of theories. Examples include climate models, mod-els for astronomical phenomena such as solar system formation and galactic evolution, andmodels for chemical reactions that can accelerate development of new drugs or vaccines.4. Comparison of statistical methods under a variety of conditions and models. This is acommon method for the comparison of new methods to existing methods when analyticalcomparisons are difficult or impossible to obtain.

One of R’s strengths is its extensive library of functions for simulation of random vari-ables. This includes the following distributions.

3

Discrete r.v.’s:

Distribution Basenamebinomial binomgeometric geomhypergeometric hypermultinomial multinomnegative binomial negbinPoisson pois

Continuaous r.v.’s:

Distribution Basenamebeta betaCauchy (with location and scale) cauchychi-squared chisqexponential expgamma gammaF flog-normal lnormnormal normt tuniform unifWeibull weibull

Other distribu-

tions are available in contributed packages.For each distribution there are functions in R to generate the cdf, density or pmf, quan-

tiles, and random samples. These functions use the following convention: dname returnsthe density or probability mass function, pname returns the cdf, qname returns quantiles,and rname returns random samples. For example,

n = 200

mu = 100

sig = 20

X = rnorm(n,mu,sig)

X.hist = hist(X,plot=FALSE)

x0 = seq(mu-4*sig,mu+4*sig,length=250)

d0 = dnorm(x0,mu,sig)

y0 = d0*n*unique(diff(X.hist$breaks))

y.lim = max(c(y0,X.hist$counts))

png("NormalDens.png",width=480,height=480)

plot(X.hist,col="cyan",ylim=c(0,y.lim),xlim=mu+c(-4,4)*sig,main="")

title(paste("Histogram of N(",mu,",",sig,")",sep=""))

mtext("with Density Function",side=3,line=.25)

lines(x0,y0,col="red")

graphics.off()

4

Random number generators must be initialized before they are used. Ordinarily thisis done internally by R. However, there are situations in which a repeatable sequence ofgenerated values is required, for example, when testing code. In this case, we must initializethe RNG by specifying what is referred to as the seed. In R this is accomplished by thefunction set.seed(seed) where seed is an integer. Specifying this before using one of ther-functions listed above will produce the same set of values.

5

Additional notes: simulation of the Gamma distribution

The Gamma distribution is a two-parameter family with alternative parameterizations. Thedensity function is

f(x) =1

βαΓ(α)xα−1e−x/β, x > 0, α, β > 0

=λα

Γ(α)xα−1e−λx, x > 0, α, λ > 0

where α is called the shape parameter, β is called the scale parameter, and λ = 1/β is calledthe rate parameter. The expected value and standard deviation of the gamma distributionare,

E(X) = αβ =α

λ

SD(X) =√αβ =

√α

λ.

It usually is more natural to specify the distribution in terms of its mean and s.d., µ, σ. Therelationship between α, β and µ, σ can be inverted to obtain the natural parameters of thisdistribution in terms of its mean and s.d. That is,

α =µ2

σ2, β =

σ2

µ.

The plot below shows why α is called the shape parameter.

6

This plot was generated by the following R code.

n = 100 # sample size

N = 400 # num x-values

mu = c(.75,1,5) # means

Gam.col = c("red","ForestGreen","blue")

sig = 1 # s.d.’s

alpha = (mu/sig)^2

beta = (sig^2)/mu

x = seq(0.05,8,length=N)

7

y1 = dgamma(x,alpha[1],scale=beta[1])

%y2 = dgamma(x,alpha[2],scale=beta[2])

%y3 = dgamma(x,alpha[3],scale=beta[3])

%Y = cbind(y1,y2,y3)

%png("GammaDens.png",width=480,height=480)

%plot(x,y1,xlab="",ylab="Density",ylim=range(Y),type="n")

%for(k in seq(mu)) {

% lines(x,Y[,k],col=Gam.col[k],lwd=1.5)

%}

%legend(x[N],max(Y),legend=paste("Mean =",mu),

% lty=1,col=Gam.col,xjust=1)

%title("Gamma Densities")

%mtext(paste("SD =",sig[1]),side=1,line=2)

%graphics.off()

Note that the Chi-square distribution is a special case of the gamma distribution with

shape = df/2

scale = 2 (rate = 1/2).

Examples

Demonstration of statistical or probabilistic effects

This example considers how we could illustrate the performance of a confidence intervalto estimate a population parameter. First consider large sample confidence intervals for apopulation proportion.

Simulation goals:1. Illustrate how the central limit theorem is applicable. 2. Demonstrate that it is theinterval that is random, not the parameter. That is, the interval is what changes fromsample to sample.3. Demonstrate how the widths of confidence intervals depend on sample size.4. Show that the probability the interval contains the population proportion p does notdepend on p.5. Construct scripts for this demonstration that could be adapted easily to other types ofconfidence intervals.

To satisfy goal 1, we can simulate N random samples from Binomial(n,p), constructconfidence intervals for each sample, and determine whether or not each interval contains p.Parameters for this sinmulation would include

N, n, p, α

Visualizations for this demonstration include:1. Display the histogram and quantile-quantile plot for sample proportions of the simulated

8

samples.2. Display the intervals in a way that shows how they vary from sample to sample and includean indicator for which intervals contain p and which do not. This can be accomplished byplotting vertical bars representing the intervals across the x-axis. Different colors for thesebars can be used to differentiate between intervals that contain p and those that do not.

The script to accomplish these goals should begin with specification of the simulationparameters so that they can be modified easily. This is followed by obtaining the simulatedsamples, constructing the confidence intervals, and then determining which intervals containp.

### parameters ###

N = 200 #number of simulated samples

n = 400 #sample size

p = .36 #population proportion

alpha = .05 #for 95% confidence interval

conf.int.colors = c("cyan","red") #red-green is not a good choice in case a viewer is color-blind

### simulate samples ###

Xall = rbinom(N,n,p)/n # sample proportions

### construct intervals and store them in a matrix with 2 columns

zval = qnorm(1-alpha/2)

sig = sqrt(p*(1-p)/n)

sighat = sqrt(Xall*(1-Xall)/n)

conf.mat = matrix(0,N,2)

conf.mat[,1] = Xall - zval*sighat

conf.mat[,2] = Xall + zval*sighat

### indicate which do not contain p

ci.out = conf.mat[,1] > p | conf.mat[,2] < p

cnt.out = sum(ci.out) #number of intervals that missed p

conf.col = rep(conf.int.colors[1],N)

conf.col[ci.out] = conf.int.colors[2]

### show histogram and QQ-plot

hist(Xall,col="cyan",main="")

title(paste("Histogram of Sample Proportions\nfrom",N,"Random Samples"))

title(sub=paste("Sample size =",n))

### plot confidence intervals ###

### add each bar sequentially for in-class display

Y.lim = range(conf.mat)

plot(0,0,xlim=c(0,N),ylim=Y.lim,xlab="",ylab="",type="n")

title(paste("Simulation of",paste(round(100*(1-alpha)),"%",sep=""),

"Confidence Intervals for", paste("Binom(",p,")",sep="")))

mtext(paste("Sample size =",n),line=.25)

for(k in seq(N)) {

rect(k-1,conf.mat[k,1],k,conf.mat[k,2],col=conf.col[k])

9

}

mtext(paste(paste(round(100*cnt.out/N,1),"%",sep=""),

"of these confidence intervals do not contain",p),side=1,line=3)

abline(h=p)

### repeat confidence interval plot using n=1000

### keep same y-limits

n = 1000 #sample size

### simulate samples ###

Xall = rbinom(N,n,p)/n # sample proportions

### construct intervals and store them in a matrix with 2 columns

zval = qnorm(1-alpha/2)

sig = sqrt(p*(1-p)/n)

sighat = sqrt(Xall*(1-Xall)/n)

conf.mat = matrix(0,N,2)

conf.mat[,1] = Xall - zval*sighat

conf.mat[,2] = Xall + zval*sighat

### indicate which do not contain p

ci.out = conf.mat[,1] > p | conf.mat[,2] < p

cnt.out = sum(ci.out) #number of intervals that missed p

conf.col = rep(conf.int.colors[1],N)

conf.col[ci.out] = conf.int.colors[2]

### plot confidence intervals ###

### add each bar sequentially for in-class display

plot(0,0,xlim=c(0,N),ylim=Y.lim,xlab="",ylab="",type="n")

title(paste("Simulation of",paste(round(100*(1-alpha)),"%",sep=""),

"Confidence Intervals for",paste("Binom(",p,")",sep="")))

mtext(paste("Sample size =",n),line=.25)

for(k in seq(N)) {

rect(k-1,conf.mat[k,1],k,conf.mat[k,2],col=conf.col[k])

}

mtext(paste(paste(round(100*cnt.out/N,1),"%",sep=""),

"of these confidence intervals do not contain",p),side=1,line=3)

abline(h=p)

Note that we have used the same code twice, so let’s define a function that returnsthe matrix of confidence limits. Arguments for this function should be parameters of thesimulation. Then define a function that plots the intervals. Include arguments for specifyingwhether or not we want to include histogram and qqnorm plots. These functions are definedin the script:http://www.utdallas.edu/~ammann/stat6341scripts/simConfInt.r

Examples of its use are inhttp://www.utdallas.edu/~ammann/stat6341scripts/confsim.r

Now suppose we wish to extend these functions to include confidence intervals for a mean.

10

http://www.utdallas.edu/~ammann/stat6341scripts/simConfInt1.r

Examples are inhttp://www.utdallas.edu/~ammann/stat6341scripts/confsim1.r

Further extend to include confidence intervals for a s.d.http://www.utdallas.edu/~ammann/stat6341scripts/simConfInt2.r

Examples are inhttp://www.utdallas.edu/~ammann/stat6341scripts/confsim2.r

Additional examples that simulate heavy-tailed and asymmetric distributions are inhttp://www.utdallas.edu/~ammann/stat6341scripts/confsim3.r

Practical Guidelines for Monte Carlo simulation

Monte Carlo simulation has become an important tool for analyses of models and comparisonof statistical methods. This section discusses some practical issues that should be consideredwhen implementing a monte carlo study to compare methods.

1. Decide which methods should be compared. These methods should all be applicableto the same data. If a method has tuning parameters, then you must consider how thetuning parameters are determined for a particular data set. In some cases, a tuningparameter may be derived from statistical theory. For example, the MAD estimatorfor scale is given by

MAD = kMedian{Xi −MD}

where MD is the median of the data. The value for the tuning parameter k=1.4826is chosen to make this an unbiased estimator of the standard deviation if the data isnormally distributed. Another estimator is based on the inter-quartile range, scaled tomake it unbiased for normal distributions,

siqr = IQR/1.349.

However, some methods such as predictive models may have tuning parameters thatare not determined by theory. In such cases optimal values may be obtained by k-fold cross-validation. This requires a measure of optimality such as sum of squarederrors. The sample is randomly partitioned into k groups and a grid of values for thetuning parameter is specified. For each value of the tuning parameter, the model isfit using the data with group j removed and then the fitted model is applied to groupj. The measure of optimality is obtained from the fitted values in group j and thisis repeated for each group. The optimality measures are then averaged to obtain anestimate of optimality for the corresponding value of the tuning parameter. This isrepeated for each value of the tuning parameter and the value is chosen that has thehighest optimality. Obviously the use of k-fold cross-validation within a large montecarlo study will add considerable computation time. An alternative is to treat each

11

value of the tuning parameter as a separate method. This approach makes possible ananalysis of the sensitivity of the method to the tuning parameter.

2. Determine what population distributions should be considered for the comparison. Inpractice it is often the case that populations are not normally distributed, so you mustconsider what characteristics of populations are important to include as part of thestudy. Do we only wish to compare methods for normally distributed populations ordo we also wish to consider the effects of asymmetry, heavy-tails, or contamination onthe methods?

3. Use a fixed seed and a small number of replications to begin code writing so the resultscan be checked for correctness. Once you are satisfied that your code is workingcorrectly, generate data for all replications of a particular population distribution andsave the data using the R function save(). All methods should be applied to the samedata. If you decide later to include other methods in your comporison, you will be ableto apply the new methods to the same data used for all of the previous methods withourerunning the entire monte carlo study.

4. In addition to averaging across replications to estimate mean values of the measure ofinterest, you should also obtain standard errors associated with monte carlo estimates.This can be treated as a sample size determination problem. For example, if you areestimating a proportion such as the proportion of times a confidence interval covers aspecified parameter value, then the s.e. would be

se =√p(1− p)/N

where p is the average proportion and N is the number of replications. Examinationstandard error can show whether or not the number of replications is sufficient tocompare methods with the required level of accuracy.

5. Try to avoid looping over replications of a particular population distribution if possible.R functions apply, tapply, sapply, vapply can be helpful for this purpose.

6. If you must loop over a set of parameter values, it can be helpful to construct functionsthat perform some of the complex computations within a loop so that memory usagewon’t increase unnecessarily. R generally releases memory required for those computa-tions when a function returns. Avoid making multiple references to the same elementsof an array inside a loop because that causes multiple copies of the entire array to bemade, potentially increasing execution times significantly.

7. Think about how you want to display results of the study so you can organize theoutput of your code efficiently. Consider defining your own functions to give the samelist of output components from different methods.

12

Example. Consider the problem of estimating a location parameter of some population.There are a number of methods to obtain confidence intervals, the most commonly used ofwhich is based on the t-distribution. However, it is well-known that the sample mean ands.d. are not robust, so alternatives have been proposed that purport to be more robust. Thisstudy will compare 8 methods.

stdT. Standard confidence interval for a population mean,

X ± qt(1− α/2, n− 1)s/√n.

IQR. Replace mean with median and s.d. with scaled IQR. Use same t-value as standard.

MAD. Replace mean with median and s.d. with scaled MAD. Use same t-value as standard.

Gini. Replace mean with median and s.d. with scaled sum of Gini differences. Use samet-value as standard.

Gini1. Replace mean with median and s.d. with scaled lower quartile of Gini differences. Usesame t-value as standard.

Huber. Replace sample mean with Huber M-estimator of location and s.d. with MAD scale.Use same t-value as standard.

Hubers. Replace sample mean and s.d. with Huber M-estimators of location and scale. Usesame t-value as standard.

Wilcoxon. Use Hodges-Lehmann estimator associated with the Wilcoxon signed rank test to ob-tain confidence interval.

Note that all of the alternative methods exaept Standard, Huber, and Hubers give confidenceintervals for the population median, not the mean.

The tasks for this project can be partitioned into four parts.Define parameters and functions. All of the parameters needed for this simulationproject are defined in one file:http://www.utdallas.edu/~ammann/stat6341scripts/Parms.r

All other scripts use these parameters and attributes of the objects defined here. For exam-ple, if the number of methods is needed, then use

nmeths = length(meths)

instead of the literal 8. That way any changes or additions to those parameters or functionsonly need to be made in one location.To unify the interface to the confidence interval methods, a separate function is definedfor each method. These functions all have the same main arguments and the same format

13

for returning a list with four components: mean miscoverage probability, s.e. of meanmiscoverage probability, mean interval width, and s.e. of mean interval width. Since somemethods are designed to estimate the mean and others the median, these functions includeoptional arguments for the population mean and median, but a particular method just usesthe appropriate location parameter. After running the scripts initially, I decided to add apseudomedian argument for the Wilcoxon confidence interval since that is what this methodestimates for asymmetric distributions. However, this parameter is used as the referenceonly if the pmed1 argument is not null. I included this option for huber and hubers justfor experimentation since those two methods actually give M-estimates of location, not themedian or pseudomedian. By organizing the code in this way, I only needed to make thesechanges in the parameter file and then make any needed modifications to the function callin the other scripts.This file needs to be sourced just once before using the other scripts for the project. All of theobjects defined here, including the function definitions, are saved in the file Data/Parms.RData.Note that this file also sources the script described below that defines a function to generatecontaminated normal data. All other scripts access these objects by putting the line

load("Data/Parms.RData")

at the beginning of the other scripts. The subdirectory Data is used to store this parameterfile along with all of the data generated for the project.

Generate data. The distributions used for this study are normal, t-distribution with 3d.f., asymmetrically contaminated normal, symmetrically contaminated normal, and gamma.R has functions defined for the normal, t, and gamma distributions, so to unify the code, Iwrote a new function that generates data for contaminated normal. That code is here:http://www.utdallas.edu/~ammann/stat6341scripts/rnormcontam.r

and is sourced by the Parms.r script.Notes:

• The number of observations that are contaminated in each simulated sample is a ran-dom variable, not a fixed number. This corresponds to the probability model

Y = (1− P )X + PXc,

where P is a Bernoulli r.v. with success probability p and P,X,Xc are independent.This is not the same as replacing a fixed proportion p of observations in each sample.

• The code to identify specific rows in each sample that are replaced by the contaminatedvalues is vectorized so there is no looping over replications to accomplish this task.

• After writing and using this function, an option was added to use the Gamma dis-tribution as the contaminating distribution. This was done by adding the optionalargument useGamma with default value FALSE so that code that used the originalfunction could use the modified function without needing any changes. This script ishere:http://www.utdallas.edu/~ammann/stat6341scripts/rnormcontam1.r

14

All of the generated data objects are saved in the Data subdirectory using self-describingfile names generated by the paste() function. These objects are accessed with the load()function. For any monte carlo study all methods must be applied to the same data, so ifany methods are modified or new methods added, these can be applied without rerunningeverything. The code for data generation is here:http://www.utdallas.edu/~ammann/stat6341scripts/ConfintMCdata.r

Apply methods. To avoid loops over the methods within this script, Parms.r defines afunction All.ci for this purpose. That avoids memory buildup since a function releases mem-ory needed within the function once it returns. The scripthttp://www.utdallas.edu/~ammann/stat6341scripts/ConfintMC.r

performs the tasks needed here. This script collects each output component (mean miscover-age probability, etc.) within separate arrays for each type of population in the study. Theseoutput arrays are saved in the Out subdirectory using self-describing file names for eachpopulation type. Dimension names are defined for these arrays to make it easier to generatesummaries and graphics of the results.Summarize results. The filehttp://www.utdallas.edu/~ammann/stat6341scripts/ConfintMCplot.r

loads the appropriate output arrays and then generates png graphics to display the results.These graphic files are saved in the Graphics subdirectory using self-describing file names.Since the methods will be compared based on miscoverage probability and interval width,plots for both of these are put on one page. Tabular summaries including standard errorsalso are generated and saved in the Tables subdirectory. This script uses the xtable libraryto generate these tables in LaTeX so they can be easily incorporated into a report.

Least Squares and the QR Decomposition

Suppose we would like to solve the linear system of equations,

Ax = b, A ∈ <n×p,

where n¿p. Since there are more equations than unknowns, an exact solution does not exist.In this case we look for an approximate solution by minimizing a norm of the difference,

‖Ax− b‖

When the 2-norm is used here, this is referred to as the least squares problem. This sectionconsiders the case in which A is full rank. Less than full rank problems are discussed in latersections.

An algebraic expression for the solution can be derived from the normal equations,

ATAxls = AT b.

These equations can be solved using the Cholesky decomposition of ATA:

15

1. Compute S = ATA.

2. Obtain Cholesky decomposition of S, S = GGT , where G is lower triangular.

3. Let d = AT b, solve the lower triangular system Gy = d via forward substitution.

4. Solve the upper triangular system, GTxls = y via backward substitution.

The biggest problem associated with this approach is the increased floating point errorintroduced in the first step.

A more numerically stable approach is provided by the QR decomposition. Let

A = QR,

where Q is an n × n orthogonal matrix and R is an n × p upper triangular matrix. If A isfull rank, then all of the diagonal elements of R are non-zero. This decomposition alwaysexists but it is not necessarily unique. Multiplication of the ith column of Q and the ithrow of R by -1 does not change the product. Therefore, the QRD is unique if a requirementis added that diagonal elements of R must be positive.

Properties of the QRD. Let

Q = [Q1 Q2], Q1 ∈ <n×p, Q2 ∈ <n×(n−p).

1. The columns of Q1 form an orthonormal basis for range(A) and the columns of Q2

form an orthonormal basis for range(A)⊥. More generally, let

A = [a1 · · · ak], 1 ≤ k ≤ p

Q = [q1 · · · qk], 1 ≤ k ≤ p

denote submatrices of A,Q, respectively. Then

span{a1, · · · , ak} = span{q1, · · · , qk}, 1 ≤ k ≤ p.

2. A = Q1R1, where R1 contains the first p rows of R. Note that the remaining rows ofR are 0 and that R1 is upper triangular. This is referred to as the reduced QRD.

3. ATA = (Q1R1)T (Q1R1) = RT

1QT1Q1R1 = RT

1R1.

Hence, RT1 is the Cholesky factor of the symmetric matrix ATA.

4. Q1QT1 is the projection matrix onto range(A).

5. I = Q1QT1 is the projection matrix onto range(A)⊥.

16

To show how these properties generate solutions to least squares problems, let A = QRbe the full QRD of A. Use the column partition of Q defined above to obtain

QT b =

[QT

1 bQT

2 b

]=

[cd

]

QT is an orthogonal matrix and so

‖Ax− b‖22 = ‖QT (Ax− b)‖22= ‖QTQRx−QT b‖22

=

∥∥∥∥∥[R1x

0

]−[cd

]∥∥∥∥∥2

2

= ‖R1x− c‖22 + ‖d‖22.

Since d does not involve x, this is minimized by solving the square, upper triangular systemof equations

R1x = QT1 .

Note that just the reduced QRD is needed to obtain the LS solution.Fitted values are given by

b̂ = Axls = Q1R1xls = Q1QT1 b,

so fitted values are the projection of b onto range(A). Residuals are given by

e = b− b̂ = b−Q1QT1 b = (I −Q1Q

T1 )b,

which is the projection of b onto range(A)⊥.

Linear Models in R

The basic function in R to fit least squares regression models is lm(). To illustrate the useof this function, consider the data set,http://www.utdallas.edu/~ammann/stat6341scripts/cars.data

The goal here is to determine how mileage (mpg) depends on displacement, horsepower,and weight for cars with displacement less than 120. Also, cars with diesel engines will beremoved.

Cars = read.table("http://www.utdallas.edu/~ammann/stat6341scripts/cars.data",

header=TRUE,row.names=1)

diesel.ndx = grep("diesel",dimnames(Cars)[[1]])

Cars = Cars[-diesel.ndx,]

mpgcut = 200 #parameterize cut point for mpg

Cars1 = Cars[Cars$displacement < mpgcut,]

17

Cars.lm = lm(mpg ~ displacement + horsepower + weight, data=Cars1)

# residual plots

plot(Cars.lm) #standard plots for lm

# redo residuals vs fitted with annotation for this data

plot(fitted(Cars.lm),residuals(Cars.lm),pch=19)

abline(h=0,col="red")

title("Residuals vs Fitted for Cars data")

# same text will be used for several plots so assign it a name

info.txt = paste("Diesel cars removed, Displacement <",mpgcut)

mtext(info.txt,line=.25)

#looks OK

plot(residuals(Cars.lm) ~ displacement, data=Cars1,pch=19)


title("Residuals vs Displacement for Cars data")


#looks OK

plot(residuals(Cars.lm) ~ horsepower, data=Cars1,pch=19)


title("Residuals vs Horsepower for Cars data")


#looks OK

plot(residuals(Cars.lm) ~ weight, data=Cars1,pch=19)


title("Residuals vs Weight for Cars data")

#looks OK

qqnorm(residuals(Cars.lm),main="Normal Q-Q Plot for Cars Data")

qqline(residuals(Cars.lm),col="red")


#looks OK

#print summary of LS fit

print(summary(Cars.lm)) #print() is only needed if using source()

#now consider a quadratic model that just uses horsepower and horsepower^2 as predictors

horsepower2 = Cars1$horsepower^2

Cars2.lm = lm(mpg ~ horsepower + horsepower2, data=Cars1)

#diagnostic plots

plot(Cars2.lm)

#looks good

print(summary(Cars2.lm))

#plot data

plot(mpg ~ horsepower, data=Cars1,pch=19)

#define colors for various lines that will be superimposed on plot

mpg.col = c("blue","ForestGreen","DarkRed")

18

names(mpg.col) = c("fit","conf","pred")

#obtain fitted values over the range of horsepower values

#construct data frame for predict function

HP = seq(min(Cars1$horsepower),max(Cars1$horsepower),length=400)

HP.data = data.frame(horsepower=HP,horsepower2=HP^2)

mpg.conf = predict(Cars2.lm,newdata=HP.data,interval="confidence")

lines(HP,mpg.conf[,"fit"],col=mpg.col["fit"])

title("MPG vs Horsepower: Quadratic Fit")

#add 95\% confidence bands

lines(HP,mpg.conf[,"upr"],col=mpg.col["conf"])

lines(HP,mpg.conf[,"lwr"],col=mpg.col["conf"])

#now add prediction bands

mpg.pred = predict(Cars2.lm,newdata=HP.data,interval="prediction")

lines(HP,mpg.pred[,"upr"],col=mpg.col["pred"])

lines(HP,mpg.pred[,"lwr"],col=mpg.col["pred"])

#note that part of upper prediction band is outside plot

#redo with expanded y-axis limits

mpg.lim = range(c(mpg.pred,Cars1$mpg))

plot(mpg ~ horsepower, data=Cars1,pch=19,ylim=mpg.lim)

lines(HP,mpg.conf[,"fit"],col=mpg.col["fit"])

title("MPG vs Horsepower: Quadratic fit")


lines(HP,mpg.conf[,"upr"],col=mpg.col["conf"])

lines(HP,mpg.conf[,"lwr"],col=mpg.col["conf"])

lines(HP,mpg.pred[,"upr"],col=mpg.col["pred"])

lines(HP,mpg.pred[,"lwr"],col=mpg.col["pred"])

legend(min(Cars1$horsepower),mpg.lim[1],

legend=c("Fitted","95% Confidence","95% Prediction"),

lty=1,col=mpg.col,yjust=0)

The lm() function is most conveniently applied with the data contained in a data frame.This data frame is specified in lm() by the argument data=. If transformations are requiredfor some variables, then a new data frame can be constructed that contains the transformedvariables along with needed untransformed variables. Use of the data= argument allows useof some shortcuts for the model formula. For example, if the model contains more than afew variables, then the formula

Y ~ .

is a shortcut for using all variables in the specified data frame except Y as predictor variablesin an additive model.

It is critically important that assumptions are checked before using a model; otherwisewe are just doing arithmetic, not statistical analysis. The plot() function applied to a linearmodel object by default produces four plots:

19

1. residuals vs fitted: checks for nonlinearity and marks potential outliers

2. normal quantile-quantile plot of residuals: checks normality assumption

3. scale-location: checks for homogeneity of variance

4. Cook’s D: checks for observations with high influence on the fit

Models with multiple predictors also should include plots of residuals vs each predictor tocheck for possible nonlinearity in specific predictors. These are not needed for a polynomialmodel with no other predictors. In such cases the basic residual plot is sufficient.

Predicted values are obtained by the predict() function. By default predict() with anlm object as its only argument returns predicted values for the data set used for fitting themodel. Optional arguments produce additional information.

The predict() function in R is a generic function that calls a specific function dependingon the class attribute of its main argument. For example class(Cars.lm) returns lm. Thatinstructs R to use the specific function predict.lm. Other classes use other specific functionsfor predict(). A list of the specific predict functions is given by methods(predict).

Some generic functions include a function with the suffix default. For example print.default()is defined for the generic print() function. For these generic functions the default method isused if the object’s class does not match any of the suffixes for the generic function. Thisprocess can be circumvented by using a specific function instead of the generic function. Forexample,

summary(Cars.lm) #uses summary.lm() since class(Cars.lm) is lm

summary.aov(Cars.lm) #force use of summary.aov()

A summary of a linear model fit is obtained by the summary() function. This functionreturns a list whose components include residuals, coefficient estimates, r-squared, and theestimated residual s.d. Helper functions residuals(), coefficients() extract the correspondingcomponents of a fit. These are generic functions with specific functions defined for theappropriate type of model object specified in their argument. To reduce typing, resid(),coef() are the same functions. The coef() function returns a matrix whose columns includecoefficient estimates, their standard errors, t-values and corresponding p-values. The p-values are equivalent to a partial F-test for significance of the parameter when it is addedto a model with all other parameters included. In the language of SAS these correspondto Type III tests. However, those p-values should not be used for variable selection. Thattopic is discussed in more detail below.

To obtain predicted values for a linear model fit using predictor values other than thedata used for fitting, it is necessary to define a new data frame that contains variables withthe same names as those in right hand side of the model formula. An example is given inthe code above:

HP = seq(min(Cars1$horsepower),max(Cars1$horsepower),length=400)

HP.data = data.frame(horsepower=HP,horsepower2=HP^2)

mpg.conf = predict(Cars2.lm,newdata=HP.data,interval="confidence")

20

Confidence and prediction bands are obtained with the interval= argument. The defaultvalue for that argument is none. The argumentinterval=confidence

uses the standard error associated with prediction of the mean response and the argumentinterval=prediction

uses the standard error for prediction of an individual response. These arguments return amatrix with columns named fit,lwr,upr.

Factor variables in linear models

If any of the predictor variables are factors (categorical variables), then R adds columnsof dummy variables according to how those factors appear in the model formula before thelinear model is fit. If all predictors are factors, then the linear model corresponds to analysisof variance. In such cases the fit should be obtained using the aov() function instead of ls().There is no difference computationally, but summary and plot methods for aov objects differfrom those methods for lm objects.

It is important to understand how dummy variables are coded in R in order to under-stand how to interpret coefficient estimates. By default, dummy variables are created usingtreatment contrasts. In this case the first level of a factor is treated as if it is the controlgroup. Coefficient estimates are reported for each of the other levels and these representestimates of the mean deviations from the control group. Note that this coding only affectshow coefficients are interpreted; it has no affect on p-values of hypothesis tests.

If the model includes a actor variable and a numeric variable, then we may wish toinclude an interaction term. Interaction between a factor and a numeric variable allowsfor different slopes in the relationship between the response and the numeric variable. Forexample, the crabs data set includes measurements of several physical attributes of crabsfrom two different species and from both sexes. For purposes of illustration, the Species andSex factors will be combined into a single factor with four levels: BM,BF,OM,OF.

crabs = read.table("stat6341scripts/crabs.csv",header=TRUE,sep=",")

SpSex = paste(crabs$Species,crabs$Sex,sep="")

SpSex = factor(SpSex)

#fit model to predict CL based on FL and SpSex with interaction

CL.lm = lm(CL ~ SpSex*FL,data=crabs)

plot(CL.lm) #looks ok

summary(CL.lm)

The interaction terms are shown in the matrix of coefficients by a : that separates thename of the numeric variable and the levels of the factor. Note that the first factor, BF,does not appear in that list. Since the default treatment contrasts were used for coding,the estimates for (Intercept) and FL represent the estimated intercept and slope for the BFgroup. The estimate for SpSexBM represents the estimated difference between the interceptsfor BF and BM. The estimate for SpSexBM:FL represents the estimated difference between

21

the slopes for BF and BM. A partial F-test for significance of the interaction term in thismodel can be obtained by fitting an additive model and using the anova() function.

CL.lm1 = lm(CL ~ SpSex+FL,data=crabs)

anova(CL.lm1,CL.lm)

The additive model assumes that the slopes for the levels of the factor are the same butthe intercepts may differ. In this case the interaction term is highly significant. This indicatesthat the slopes in the linear relationships between the numeric predictor and the responsevariable differ significantly among the levels of the factor. If a factor*numeric interactionterm is not significant, then we may wish to perform a partial F-test to determine whether ornot there is a difference among the intercepts. This can be done by fitting a model withoutthe factor and then using the anova() function.

CL.lm0 = lm(CL ~ FL,data=crabs)

anova(CL.lm0,CL.lm1,CL.lm)

Note that we only should consider this test if the interaction term is not significant.The three models in this example represent nested models applied to the same data

and the anova() function reports the results of the corresponding partial F-tests. A similarsituation occurs if we are considering polynomial regression models. Suppose for example Xis a numeric predictor and we wish to compare linear, quadratic, and cubic models.

X2 = X^2

X3 = X^3

Y.lm0 = lm(Y ~ 1) #intercept only

Y.lm1 = lm(Y ~ X) #linear

Y.lm2 = lm(Y ~ X + X2) #quadratic

Y.lm3 = lm(Y ~ X + X2 + X3) #cubic

anova(Y.lm3,Y.lm2,Y.lm1,Y.lm0)

Of course, we also should examine diagnostic plots of the final model.

Variable selection

Typically, in studies that have large numbers of predictor variables available, many of thosevariables may be unrelated to the response among individuals in the population of interest.The problem then is to identify the important variables and fit a reduced model to predictresponses. This problem is referred to as variable selection. This problem will be discussedhere in the context of prediction of new observations rather than fitting an existing dataset. Variable selection represents an attempt to find an optimal balance between precisionand bias. As we have seen, addition of a variable to a model always reduces residual errorvariance unless the new variable is an exact linear combination of the predictor variablesalready in the model or it has 0 correlation with the residuals of the current model. Even

22

if a variable is generated randomly, the probability of that happening is essentially 0. Theeffect of including weakly correlated variables in a model is increased bias when the model isused to predict responses for new observations not in the data set used for fitting. We referto such situations as over-fitting.

If the number of potential predictor variables is small, then models with each possiblesubset of predictors could be fit and compared. Obviously residual error variance or r-squaredshould not be the basis for comparison of models since criteria based on those measures wouldalways select the largest model. There are two basic approaches to this problem that areused most often, penalized likelihood methods and shrinkage methods.

Penalized likelihood methods subtract from the maximized likelihood function a quantitythat is a function of the number of variables in the model. These likelihood penalties aredesigned to adjust for the increase in bias that would occur if a noise variable is added tothe model. One of the earliest such methods is Mallow’s Cp statistic, defined by

Cpk =RSSks2e

+ 2k,

where RSSk is the residual sum of squares, k is the number of variables in the model, ands2e is a low-bias estimate of residual error variance that does not depend on k. Typically,the low-bias estimate of residual error variance is obtained from the largest possible model.Note that Mallow’s definition used 2k − n instead of 2k.) Since s2e is constant wrt k, this isequivalent to

Cp∗k = RSSk + 2ks2e. (1)

An information-based penalty was developed by Aikake and is referred to as Aikake’sInformation Criterion (AIC). A similar measure introduced by Schwartz is referred to asBayes Information Criterion (BIC). These are defined for linear regression by

AICk = n log(RSSk/n) + 2k,

BICk = n log(RSSk/n) + k log(n).

Addition of a variable decreases RSS but increases the penalty. The best model is the onewith the smallest value of the criterion. Note that the dimension penalties in these criteriado not include precision associated with the model being evaluated, nor do they includemodel bias.

When there are more than a few potential predictor variables, it is most efficient touse a forward stepwise approach to the selection of variables. The variable most stronglycorrelated with the response is selected initially and a linear model is fit. At each step thenext variable selected from the remaining variables is the one most strongly correlated withthe residuals of the current fit. This is continued until all variables have been added to themodel or a predefiined stopping criterion has been satisfied. The selection criterion (Cp,AIC, or BIC) is evaluated at each step and the model selected is the one with minimumvalue of the criterion.

23

This process is performed in R with the step() function. By default this function performsforward stepwise regression using AIC for the selection criterion. Steps are terminated whenAIC values of all remaining variables are higher than AIC of the current model. Choice ofBIC criterion is made by the argument k=log(n) where n is the sample size. This function isimplemented by updating the QR decomposition and so has computational complexity thatis the same order of magnitude as the complexity of obtaining the QRD using all predictors.

Shrinkage methods subtract from the likelihood function a penalty that is proportional toa norm of the coefficients. For standard linear model assumptions the maximized likelihoodis the sum of squared residuals and so the goal is to minimize∑

e2i − c‖β‖,

where c¿0 is a tuning parameter. The idea here is that larger models would have larger valuesfor the norm of the coefficients, so the reduction in RSS associated with a larger model wouldneed to be high enough to offset the increased norm of its coefficients. Coefficients that areessentially 0 in these methods are removed from the model. If the 2-norm is used here, themethod is referred to as ridge regression, but this does not ordinarily result in reductionof the number of variables. Use of the 1-norm almost always results in removal of weakvariables. This method is referred to as the lasso.

Recall that reduction in RSS when a variable a is added to a model is given by

RSSk −RSSk+1 = u2,

where

u =dTY

‖d‖, d = (I −QQT )a.

Note that d is the projection of a onto the orthogonal complement of range(X). This pro-jection removes all of the partial correlations with variables already in the model betweena and Y. In practice removal of all of those partial correlations may cause the stepwiseprocess to follow a sub-optimal path. An alternative algorithem can be defined by takingonly a very small step in the direction of the projection onto the orthogonal complement ofrange(X). This algorithm is referred to as stagewise regression. An efficient implementationof forward stagewise regression, referred to as Least Angle Regression (LARS) is availablein the contributed package lars. The authors of that package show that LARS is relatedto lasso variable selection. This package includes options for forward stagewise, LARS, andlasso. Mallow’s Cp statistic is used for variable selection. The lars package is not distributedwith R. It must be downloaded and installed from the CRAN site via the Package Installermenu item for R.

Big Data problems contain large numbers of potential predictor variables and so variableselection is an integral step in the analysis of such data. For example, identification ofgenetic biomarkers with genomic data may lead to new drugs and treatments as well asbetter understanding of disease mechanisms. However, genomic data sets often contain tensof thousands of variables in the form of gene expressions. Further exacerbating that problem

24

is the much smaller sample sizes typically used for such studies. Classical methods forvariable selection almost always leads to over-fitting the data by selecting variables that mayappear useful for the sample in the study, but they only add bias to prediction of responsesfrom new individuals from the population. For these reasons, variable selection for Big Dataremains an important topic for future research.

Multicollinearity and Principal Components

Multicollinearity in regression refers to a situation in which a subset of the predictor variablesis nearly linearly dependent. This implies that the data matrix is nearly rank-deficient. Thatcreates two problems in regression. The first, and more serious, problem is that near rank-deficiency produces numerically unstable least squares solution. Suppose for example thatthe first three variables in a data matrix are very close to linearly dependent. That is, thereexists a non-zero vector c such that

c1X1 + c2X2 + c3X3 = ε,

with

‖ε‖ ≈ 0.

This implies that the 3rd diagonal of R in the QRD of the data matrix will be approximately0. The backsolve algorithm to obtain the least squares estimate of regression coefficientsinvolves division by those diagonal elements, resulting in numerical instability of the solution.

Another problem occurs even if the multicollinearity is not severe enough to cause nu-merical instability. Suppose the first two variables are highly correlated. This implies thatif we know the value of the first variable, we can obtain a fairly accurate estimate of thevalue of the second variable. However, it still is possible that the second variable wouldbe selected by a variable selection algorithm in spite of this strong correlation. Recall thatin the forward stepwise algorithm after a variable is selected for the model, the remainingvariables are projected onto the orthogonal complement of the range of variables in the cur-rent model. That corresponds to fitting regression models to predict each of the remainingvariables from variables in the current model and then extracting the residuals from thosefits. This removes all of the linear relationships the remaining variables have with variablesin the current model. However, some of those residuals may have sufficiently strong linearrelationships with the response variable to justify addition of the second variable even thoughit is highly correlated with the first variable selected. The problem occurs when we attemptto interpret the model coefficients.

Ordinarily we intrepret regression coefficients separately as slopes. Suppose for examplethe coefficient for the first variable is 1.5 and the correlation between the first and secondvariables is 0.9. Then we would say that the response variable increases an average of1.5 for every unit increase in the first variable. We could interpret the second variablesimilarly, but the strong correlation between the variables implies that the data would not

25

have observations with high values for the first variable and low values for the second, andvice versa. Therefore, we cannot interpret coefficients of those variables separately.

Example. The sleep data set contains body weight, brain weight, sleep time, and lifespanof 54 mammals. The following code reads this data and constructs a pairs plot with homosapiens colored red.

Sleep = read.table("http://www.utdallas.edu/~ammann/stat6341scripts/Sleep.data",

header=TRUE,row.names=1)

n = dim(Sleep)[1]

sleep.color = rep("black",n)

hndx = match("homo sapiens",dimnames(Sleep)[[1]])

sleep.color[hndx] = "red"

par(oma=c(0,0,2,0))

pairs(Sleep,pch=19,col=sleep.color)

mtext("Pairs plot of sleep data",outer=TRUE,line=0.5,cex=1.2)

The compressed small values and stretched out high values for all these variables exceptSleep indicate that those variables have exponential relationships. So we can create a newdataframe that contains log-transformed values.

library(MASS)

logSleep = Sleep

logSleep$BodyWgt = log(Sleep$BodyWgt)

logSleep$BrainWgt = log(Sleep$BrainWgt)

logSleep$LifeSpan = log(Sleep$LifeSpan)

pairs(logSleep,pch=19,col=sleep.color)

mtext("Pairs plot of logSleep",outer=TRUE,line=0.5,cex=1.2)

# that looks reasonable so fit linear model to predict LifeSpan

LifeSpan.lm = lm(LifeSpan ~ .,data=logSleep)

plot(LifeSpan.lm,main="Full model: data=logSleep",pch=19,col=sleep.color)

LifeSpan.step = step(LifeSpan.lm,k=log(n)) #backward stepwise selection with BIC

print(summary(LifeSpan.step))

plot(LifeSpan.step,main="Reduced model: data=logSleep",pch=19,col=sleep.color)

#forward stepwise requires the scope argument and starts with the null model

LifeSpan.lm0 = lm(LifeSpan ~ 1,data=logSleep)

LifeSpan.forward = stepAIC(LifeSpan.lm0,direction="forward",

scope= ~BodyWgt+BrainWgt+Sleep,k=log(n))

#note that the upper formula cannot use . as a shortcut

print(summary(LifeSpan.forward))

R usage. When performing forward stepwise selection, the scope argument must beincluded. If you wish to start with the intercept only model as above, then that model isused as the object argument and the scope argument only needs to give the formula for theupper model. If the upper model contains more than a few variables, this formula can beconstructed using paste and as.formula as shown below.

26

ndx = match("LifeSpan",names(logSleep)) #get index for response variable

#don’t include response in formula!

upper.form = as.formula(paste("~",paste(names(logSleep)[-ndx],collapse="+")))

LifeSpan.forward = stepAIC(LifeSpan.lm0,direction="forward",

scope= upper.form,k=log(n))

In this case the model selected by forward stepwise is the same as the model obtained bybackward selection, but that may not always be the case. We use forward stepwise when wewish to only add important variables and we use backward stepwise when we want to startwith the full model and then remove unimportant variables.

The correlation between log(BodyWgt) and log(BrainWgt) is 0.9587, so 91.9% of the vari-ability in log(BodyWgt) can be explained by a linear relationship between the two variables.The correlation between the residuals of that fit and log(LifeSpan) is -0.1765. This corre-lation is referred to as a partial correlation and is sufficiently strong to overcome the BICpenalty when this variable is added to log(BrainWgt)

The correlations among these three variables are all positive,

BodyWgt BrainWgt LifeSpanBodyWgt 1.0000 0.9587 0.7070BrainWgt 0.9587 1.0000 0.7899LifeSpan 0.7070 0.7899 1.0000

However, the coefficient for BodyWgt in negative. This illustrates that correlation coefficientsdo not tell the whole story about relationships among more than two variables.

The residual plots showed that the two brown bats were high residual/high leverageobservations. Bats have highly developed brain functions for processing their high frequencysonar signals, and their bodies must be extremely light to enable flight. So it would bereasonable to remove them from the data. Also, the Echidna is a marsupial, so it will beremoved as well.

bndx = grep("brown bat",dimnames(logSleep)[[1]])

bndx = c(bndx,match("Echidna",dimnames(logSleep)[[1]]))

logSleep1 = logSleep[-bndx,]

n1 = dim(logSleep1)[[1]]

sleep.color1 = sleep.color[-bndx]

LifeSpan1.lm = lm(LifeSpan ~ .,data=logSleep1)

plot(LifeSpan1.lm,main="Full model: data=logSleep with outliers removed",

col=sleep.color1,pch=19)

LifeSpan1.step = step(LifeSpan1.lm,k=log(n1))

plot(LifeSpan1.step,main="Reduced model: data=logSleep with brown bats removed",

col=sleep.color1,pch=19)

print(summary(LifeSpan1.step))

If the multicollinearity is severe, we can consider principal components regression. Thisinvolves obtaining the SVD of the centered data matrix and using the matrix of right singular

27

vectors (eigenvectors of the covariance matrix) to rotate the centered data matrix. Note thatthese new variables are uncorrelated and are referred to as principal components. If any ofthe singular values are close to 0, we can reduce dimensions by removing the correspondingprincipal components and then use the remaining PCs as predictor variables in regressionto predict the response variable. This is equivalent to using the Moore-Penrose generalizedinverse of XTX to obtain the regression estimates instead of the ordinary inverse.

Example. The MASS library contains a data set named Boston. This data contains506 observations on 14 variables related to housing prices, one of which is categorical. Thelast variable, medv gives the median value of owner-occupied homes and will be treated asthe response variable. The script below performs some initial pre-processing of variablesto remove categorical and nearly categorical variables, applies a log-transformation to crim,and rescales some variables. An initial fit showed that rm has a quadratic relationship withthe response, so rm2 was added to the data matrix. The initial fit also revealed some outlierswhich were then removed. The final fit was obtained and backward stepwise selection wasperformed using the BIC selection criterion.

Next, principal components regression was applied to this data. The details of this processare given here, but most of that is done within the function prcomp. The data is centered bysubracting column means, and then the SVD of the centered data matrix is obtained. Thematrix of right singular vectors is used to rotate the centered data. Note that the s.d.’s ofthe principal components are the singular values divided by the square root of the samplesize minus one.

library(MASS)

Boston1 = Boston[,-c(2,4,12,14)] #remove chas,zn,black,medv

Boston1[,1] = log(Boston[,1])

Boston1[,"tax"] = Boston1[,"tax"]/20 #rescale to rate per 200,000

Boston1[,"age"] = Boston1[,"age"]/10 #rescale age

Boston1[,"nox"] = Boston1[,"nox"]*10 #rescale nox

medianValue = Boston[,"medv"]

# histograms of variable 1 indicates log-transform needed

#fit linear model

MV.lm = lm(medianValue ~ .,data = Boston1)

Boston2 = cbind(Boston1,rm2=Boston1[,"rm"]^2)

MV2.lm = lm(medianValue ~ .,data = Boston2)

bndx = c(365:373) #remove outliers

Boston2a = Boston2[-bndx,]

medianValue2a = medianValue[-bndx]

n = dim(Boston2a)[1]

p = dim(Boston2a)[2]

MV2a.lm = lm(medianValue2a ~ .,data = Boston2a)

summary(MV2a.lm)

b.all = coef(MV2a.lm)[-1]

MV2a.step = stepAIC(MV2a.lm,k=log(n)) # bic backward stepwise

summary(MV2a.step)

28

b.step = rep(0,p)

names(b.step) = names(Boston2a)

b = coef(MV2a.step)[-1]

b.step[names(b)] = b

### PC regression

Bc = scale(Boston2a,scale=FALSE) # center only

Bc.svd = svd(Bc)

d = Bc.svd$d

V = Bc.svd$v

Bc.cvar = cumsum(d^2)/sum(d^2)

plot(Bc.cvar,type="b",ylab="",xlab="# of Principal Components",ylim=c(0.5,1))

title("% of Total Variance Explained vs Number of PCs")

Bcr = Bc %*% V #rotated data

print(d/sqrt(n-1)) #s.d.’s of PCs

Bc.pc = prcomp(Boston2a) #principal components is equivalent to previous

###

#fit principal components regression with all PCs

Bcr1 = as.data.frame(Bcr)

names(Bcr1) = paste("PC",seq(dim(Bcr1)[2]),sep="")

MVpc1.lm = lm(medianValue2a ~ .,data=Bcr1)

MVpc1.step = stepAIC(MVpc1.lm,k=log(n)) # bic backward selection

summary(MVpc1.step)

V1names = names(coef(MVpc1.step)[-1])

vndx = match(V1names,names(Bcr1))

V1 = V[,vndx] #obtain columns selected by stepAIC

b.pcr = as.vector(V1 %*% coef(MVpc1.step)[-1]) #back rotate PCR coefficients

names(b.pcr) = names(Boston2a)

rbind(b.all,b.step,b.pcr) #compare coefficients

sum(residuals(MV2a.lm)^2) # SSE for full model

sum(residuals(MV2a.step)^2) # SSE for reduced model

sum(residuals(MVpc1.step)^2) # SSE for reduced PCR model

# sum of squared differences between models

sum((fitted(MVpc1.step)-fitted(MV2a.lm))^2)

sum((fitted(MVpc1.step)-fitted(MV2a.step))^2)

Assignments

Homework 1

Due date: September 28Please email the graphics for problem 3 to me with the subject line stat6341 homework 1.Problems 1 and 2 can be turned in handwritten on paper. If you wish to include graphics ina document such as Word, please save your document as a pdf file and email that, not theWord file.

29

1. Show that if A is a square matrix and ATA is nonsingular, then A is nonsingular.Hint: use the SVD.

2. Let S be an n × n matrix with ST = −S. Show that I-S is nonsingular and that(I − S)−1(I + S) is orthogonal. Hint: first show that (I − S)T (I − S) is nonsingular.Then show and use the fact that

(I + S)(I − S) = (I − S)(I + S).

3. The filehttp://www.utdallas.edu/~ammann/stat6341scripts/OgleSMCV.csv

contains stellar magnitudes (luminosity) and log(period) for a family of variable starscalled Cepheid variables in the Small Magellenic Cloud. The first column of this filegives IDs for the Cepheids and so can be used as row names. These variable starsare important to astronomers because the periods of their variability (logPeriod) aredirectly related to their luminosity. This enables astronomers to estimate distances ofthese stars from their periods. Two types of Cepheids are contained in this data set,FU and FO, and these types have slightly different period-luminosity relationships.Note: stellar magnitudes are reversed in the sense that higher value for magnitudecorresponds to a dimmer star.

a. Construct a plot of infra-red luminosity (I ) vs logPeriod that uses different colorsfor the two types of Cepheids. Include an informative title and a legend thatshows which color is assigned to which type.

b. Construct a pairs plot of I,V,B,logPeriod that uses different colors for Cepheidtypes. Include an informative title.

c. Construct a graphic that show how infra-red luminosity differs between types.Repeat for visual (V ), blue (B), and logPeriod.

Homework 2

Due date: Oct. 17, 2017

1. Suppose that A is a 2× 2 matrix with singular values σ1 = σ2 = σ > 0.

a. Show that if z ∈ <2 with ‖z‖2 = 1, then ‖Az‖2 = σ.

b. Let u,v be any pair of orthonormal vectors in <2. Show that u,v are right singularvectors of A.

2. Let X be a full rank n× p matrix with n > p, let X = UDV T denote the skinny SVDof X, and let X = QR denote its skinny QRD. Show that

UTQQTU = QTUUTQ = I.

30

3. Let

A =

19 7 11 613 15 20 1220 34 18 99 37 17 1029 1 30 220 12 18 9

Find Householder matrices H1, H2 such that

B = H1AH2

satisfies B[2 : 6, 1] = 0 and B[1, 3 : 4] = 0.

4. The filehttp://www.utdallas.edu/~ammann/stat6341scripts/Sleep.data

contains measurements of body weight, brain weight, sleep time, and life span for a setof mammals.[a] Construct a regression model to predict LifeSpan based on the other variables.Note: BodyWgt, BrainWgt, LifeSpan are heavily skewed, so a log transformation ofthose variables would be appropriate. Construct appropriate diagnostic plots to showwhether or not the model assumptions are reasonable.[b] Use backward selection with BIC to select the best model. Use this model to predictLifeSpan for homo sapiens. How does this predicted value compare to the actual lifespan for homo sapiens?[c] Since the two brown bats and the Echidna are very different from other mammals,remove those observations from the data and repeat a,b.

Homework 3

Due date: Oct. 26, 2017

1. Use data in the filehttp://www.utdallas.edu/~ammann/stat6341scripts/cars.csv

The first column is not data, it should be used for row names. The column labelledorigin is coded as follows: 1=US, 2=Europe, 3=Asia. That column should be convertedto a factor with levels US, Europe, Asia. The goal here is to construct a model topredict mpg based on displacement, horsepower, weight, acceleration, origin.[a] Obtain the least squares model and check assumptions.[b] Use forward stepwise regression with the BIC criterion to select the most importantvariables for prediction.[c] Summarize the properties of the final model and interpret its coefficients.

31

[d] There are several diesel fueled cars in this data set. Obtain hat values for thesecars and discuss their influence on the least squares fit.[e] Create a new data frame with the diesel cars removed and repeat parts [a,b] forthis reduced data set. How do the coefficients differ between the final model with allcars and the final model after removing diesel cars?Note: see R functions step.lm(), influence.measures(), grep.

2. Use data in the filehttp://www.utdallas.edu/~ammann/stat6341scripts/Smoking.data

This dataset gives cigarette consumption and the rates of several types of cancer. Thegoal is to determine if there is a relationship between cigarette consumption and cancerrate. The variables are:

STATE: state

CIG: cigarette consumption

BLAD: bladder cancer

LUNG: lung cancer

KID: kidney cancer

LEUK: leukemia

Note that STATE should be used as row names, not as a variable. However, thereis a built-in data set in R, state.region, that categorizes states into four regions,Northeast, South, North Central, West. Use this variable as an additional factor.Since the Smoking data includes DC but state.region does not, assign the region forDC to be South since both Maryland and Virginia are included in that region. Thiscan be done by a lookup table. Note that state.region is a factor, so to add an entryfor DC, we first must convert state.region to an ordinary character vector and thencombine that vector with the region for DC. Then this new vector must be convertedto a factor when it is added to the Smoking data frame.

Region = c(as.vector(state.region),"South")

names(Region) = c(state.abb,"DC")

Smoking$Region = factor(Region[dimnames(Smoking)[[1]]])

[a] Fit models to predict bladder cancer rate based on cigarette consumption andRegion. Consider three models: CIG only, CIG+Region, CIG*Region. Use 5% level ofsignificance for partial-F tests to determine which model to use.[b] Check assumptions of the final regression model.[c] Construct a plot of bladder cancer rate vs CIG. Superimpose lines representingfitted values. If Region is in the model, then use different colors for the different regionsand include a legend.

32

3. The filehttp://www.utdallas.edu/~ammann/stat6341scripts/DiabetesFull.csv

contains data from a diabetes study. The response variable Y is the last column ofthis data set. The other variables are potential predictor variables. The goal here isto compare the mean square prediction error of several potential least squares models.Model 1: use all variables to predict Y.Model 2: select variables using backward stepwise selection with AIC criterion.Model 3: select variables using backward stepwise selection with BIC criterion.Model 4: select variables using forward stepwise selection with AIC criterion. Startthe selection process with the intercept-only model.Model 5: select variables using forward stepwise selection with BIC criterion. Startthe selection process with the intercept-only model.Estimate mean squared prediction error for each model as follows. Treat the first 300observations as training data and fit the models using just the training data. Use theremaining 142 observations as test data. Obtain predicted values from each model forthe test data and then obtain mean squared prediction errors for each model. Discussthe results.Note. Backward stepwise variable selection is performed in R with the functionstep(full.lm), where full.lm is the model with all potential predictors included.Thecode below illustrates how to perform forward stepwise selection. In this code X is adata frame that contains all of the predictor variables.

Y0.lm = lm(Y ~ 1, data=X)

Yall.lm = lm(Y ~ ., data=X)

Ystep.lm = step(Y0.lm,direction="forward",scope=list(lower=Y0.lm, upper=Yall.lm))

The default value for argument k in this function is 2 which corresponds to AIC selec-tion criterion. To use BIC, the argument k=log(n) must be included. Note that theargument data=X must be used for the intercept-only model even though the predictorvariables are not used in that model. Both models in the scope argument must use thesame data frame.

Homework 4

1. Data for this problem is in the filewww.utdallas.edu/~ammann/stat6341scripts/Tax.csv

This data contains property tax amounts for a sample of houses along with related physicalattributes of the houses. The problem is to understand how taxes are determined from theother variables.a) Fit a regression model to predict Taxes based on the other variables in this dataset, checkassumptions, and make any transformations if needed. Summarize the model and includediagnostic plots to show that assumptions have been verified.

33

b) Use BIC to reduce the model to just important predictor varibles and provide a summaryof the reduced model.c) Identify any observations that have studentized residuals in the reduced model with abso-lute value greater than 2. For those observations compare the actual tax to 95% predictionintervals for their taxes and interpret.d) Let p denote the number of parameters in the reduced model (including the intercept)and let n denote the number of observations. We consider an observation to be influential if

dffits > 2√

(p+ 1)/(n− p− 1)

Are any of the observations in c) influential by this definition? Note that such observationswould have high residuals and high influence. Remove those observations, refit the model,and then reduce this model using BIC. How does this model differ from the model in partb?e) Use the model from part d) to obtain 95% prediction intervals for the taxes of the observa-tions that were removed. How do these prediction intervals compare to the ones obtained inpart c? How do the actual taxes of the removed observations compare to the new predictionintervals for their taxes?

2. Use data inhttp://www.utdallas.edu/~ammann/stat6341scripts/Temperature1.data

This file contains average January minimum temperatures in degrees F. from 1931-1960 for51 U.S. cities. Pacific coast cities Los Angeles, SanFrancisco, Portland, and Seattle wereremoved since their winter temperatures are controlled mainly by Pacific ocean currents.a) Construct an informative plot of temperature versus latitude.b) Fit a model to predict January minimum temperature based on latitude and longitude.Interpret the coefficients of this model.c) Are the model assumptions reasonable?d) Is longitude an important predictor? Use 5% level of significance. If it is not significant,refit the model with just latitude.e) The latitude of Richardson is 33.0 with a longitude of 96.75. Use your regression model ind) to predict the January minimum temperature for Richardson and obtain a 90% predictioninterval for this temperature. Richardson’s actual January minimum temperature is 34. Howdoes that compare to temperatures in the prediction interval?f) How does Richardson’s actual January minimum temperature compare to a 90% confidenceinterval for the mean temperature of all cities at the same latitude?

3. The filehttp://www.utdallas.edu/~ammann/stat6341scripts/OgleSMCV.csv

contains stellar magnitudes (luminosity) and log(period) for a family of variable stars calledCepheid variables in the Small Magellenic Cloud. The first column of this file gives IDsfor the Cepheids and so can be used as row names. These variable stars are important toastronomers because the periods of their variability (logPeriod) are directly related to theirluminosity. This enables astronomers to estimate distances of these stars from their periods.Two types of Cepheids are contained in this data set, FU and FO, and these types have

34

slightly different period-luminosity relationships. Note: stellar magnitudes are reversed inthe sense that higher value for magnitude corresponds to a dimmer star. Also, BV = B-Vand VI = V-I so those variables should be ignored.a) Fit a model to predict MV based on I,V,B,logPeriod,Type that includes all two-way inter-actions between Type and the other variables. Summarize this model and include diagnosticplots to check assumptions.b) Define as high-residual outliers stars with studentized residuals great than 3 in absolutevalue, and define as high-leverage outliers stars with

dffits > 2√

(p+ 1)/(n− p− 1)

where p is the number of parameters in the model. Remove both high-residual outliers andhigh-leverage outliers, refit the full model, then reduce the model using BIC. How does thismodel compare to the original full model?c) High residual stars and high leverage stars may have been misclassified as FO or FUby the automated photometry software used by this study. For each of those stars use thereduced model to obtain predicted MV based on their values for I,V,B,logPeriod but withType = FO for all of them. Then repeat but with Type = FU. Obtain the prediction errorsusing Type = FO and prediction errors using Type = FU. Reclassify these stars accordingto which type gives the smaller prediction error. Report the results as a table showing theoriginal type and reclassified type of each of these stars. Summarize these results in a two-way frequency table that gives counts of stars according to their original classification andtheir new classification.

Simulation Project

The problem considered here is to compare several methods for ANOVA under differentconditions.

Method 1:Perform Levene’s test for homogeneity of variance. If this test does not reject the nullhypothesis, then perform ANOVA. If the F-test does not reject then say all means are equal.If the F-test rejects, then perform pairwise t-tests with no adjustment of p-values. If Levene’stest rejects the null hypothesis, then perform pairwise t-tests with adjusted p-values methodholm.

Method 2:Same as Method 1 except use p-value adjustment method fdr instead of holm

Method 3: perform pairwise t-tests with adjusted p-values method holm.Method 4: perform pairwise t-tests with adjusted p-values method fdr.Method 5:

Perform Kruskal-Wallis test (kruskal.test). If this tests does not reject, then say all meansare equal. If the Kruskal-Wallis test rejects, then perform pairwise rank sum tests (pair-wise.wilcox.test) with no adjustment of p-values.

35

Link to scripts and data is here:http://www.utdallas.edu/~ammann/stat6341proj

R Scripts

Scripts and data used in this course are located listed here.

Oct. 5, 2017:http://www.utdallas.edu/~ammann/stat6341scripts/step_example.r

Script for simulation of confidence intervalshttp://www.utdallas.edu/~ammann/stat6341scripts/ConfIntSim.r

36

Documents

Stat 6341, Statistical Computing Stat 6341 Syllabusammann/stat6341.pdfStat 6341, Statistical Computing Stat 6341 Syllabus STAT 6341 ... 3. Demonstrate how the widths of con dence intervals