27
R in 27411: Biological Data Analysis and Chemometrics Per Bruun Brockhoff DTU Compute, DK-2800 Lyngby 22. marts 2014 Indhold 1 Introduction to R 2 1.1 R in 27411 ..................................... 2 1.2 Rstudio ...................................... 3 1.3 Getting started ................................... 3 1.3.1 Console and scripts ............................ 3 1.3.2 Assignments and vectors ......................... 3 1.4 Descriptive statistics ............................... 4 1.5 Import af data ................................... 4 1.6 Storage of Text and Graphics ........................... 4 1.7 Scatterplots .................................... 5 2 Introduction day 5 2.1 Height-weight example from Latting Chapter 2 ................. 5 2.2 Exercise: EDUC scores data from the Lattin book ................ 6 3 PCA, Principal Component Analysis 8 3.1 Example/Exercise: Fisher’s Iris Data ....................... 9 3.2 Exercise: Wine Data (To be presented by Team 1 next time) ........... 11 4 MLR, Multiple Linear Regression (OLS) 13 4.1 Example: Leslie Salt Data ............................ 13 4.2 Exercise: Prostate Cancer data .......................... 14 5 PCR, Principal Component Regression 15 5.1 Example: Leslie Salt Data (again) ........................ 16 5.2 Exercise: Prostate Cancer data .......................... 17 1

R in 27411: Biological Data Analysis and Chemometrics › courses › 27411 › Rin27411.pdf · R in 27411: Biological Data Analysis and Chemometrics Per Bruun Brockhoff DTU Compute,

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: R in 27411: Biological Data Analysis and Chemometrics › courses › 27411 › Rin27411.pdf · R in 27411: Biological Data Analysis and Chemometrics Per Bruun Brockhoff DTU Compute,

R in 27411: Biological Data Analysis andChemometrics

Per Bruun BrockhoffDTU Compute, DK-2800 Lyngby

22. marts 2014

Indhold1 Introduction to R 2

1.1 R in 27411 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Rstudio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Getting started . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3.1 Console and scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3.2 Assignments and vectors . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Descriptive statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.5 Import af data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.6 Storage of Text and Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.7 Scatterplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Introduction day 52.1 Height-weight example from Latting Chapter 2 . . . . . . . . . . . . . . . . . 52.2 Exercise: EDUC scores data from the Lattin book . . . . . . . . . . . . . . . . 6

3 PCA, Principal Component Analysis 83.1 Example/Exercise: Fisher’s Iris Data . . . . . . . . . . . . . . . . . . . . . . . 93.2 Exercise: Wine Data (To be presented by Team 1 next time) . . . . . . . . . . . 11

4 MLR, Multiple Linear Regression (OLS) 134.1 Example: Leslie Salt Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134.2 Exercise: Prostate Cancer data . . . . . . . . . . . . . . . . . . . . . . . . . . 14

5 PCR, Principal Component Regression 155.1 Example: Leslie Salt Data (again) . . . . . . . . . . . . . . . . . . . . . . . . 165.2 Exercise: Prostate Cancer data . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1

Page 2: R in 27411: Biological Data Analysis and Chemometrics › courses › 27411 › Rin27411.pdf · R in 27411: Biological Data Analysis and Chemometrics Per Bruun Brockhoff DTU Compute,

6 PLSR, Partial Least Squares Regression 186.1 Motivating example - Leslie Salt again . . . . . . . . . . . . . . . . . . . . . . 186.2 NIR Example: yarn data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196.3 Exercise: NIR data on Degree of Esterification . . . . . . . . . . . . . . . . . . 21

7 RR, Ridge and Lasso Regression 217.1 Example: PAC data from package chemometrics . . . . . . . . . . . . . . 217.2 Exercise: Prostate Cancer data and maybe pectin data . . . . . . . . . . . . . . 23

8 Discriminant Analysis, LDA, QDA, k-NN, Bayes, PLSDA 238.1 Example: Iris data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248.2 Exercise: Wine data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

1 Introduction to RThe program R is an open source statistic program that you can download to your own laptop forfree. Go to http://mirrors.dotsrc.org/cran/ and select your platform (Windows,Mac, or Linux).In the course 02441 Applied Statistics and Statistical Software, which is run as a 3-weeks coursein January every year, you will get the chance to work more with R for practical statistical pro-blems - a direct project oriented way to get a more advanced level of working with the program,see http://www.imm.dtu.dk/courses/02441), where you also can find referencesto good (online) textbook material. Also in the new version of the course 02418 (substitutefor 02416) you will get your competences for using R to do more advanced statistical analysisfurther developed.

1.1 R in 27411In Spring 2014, support for R is offered in course 27411 for the first time, and the material willbe updated throughout the course. It should be seen in the course as an alternative or supplementto the software Unscrambler. Parts of the course will also use the software NTSYS. Some of themethods in this course can be covered with R-tools that can be found in the base R installationand some are available in various add-on R-packages. And not uncommonly, the methods willbe available in more than a single version as there by now are thousands of R-packages available.This R-note should be seen more as a help to find a good way through (or into) the tools, helpand material already available than providing lots of specific new tutorial help in itself. Somostly ”copy-and-paste”from and links to other sources! The two main sources of inspirationfor us are the two R-packages chemometrics and ChemometricswithR which again arecompanions to the two books:

• K. Varmuza and P. Filzmoser (2009). Introduction to Multivariate Statistical Analysis inChemometrics, CRC Press.

• Ron Wehrens (2012). Chemometrics With R: Multivariate Data Analysis in the NaturalSciences and Life Sciences. Springer, Heidelberg.

2

Page 3: R in 27411: Biological Data Analysis and Chemometrics › courses › 27411 › Rin27411.pdf · R in 27411: Biological Data Analysis and Chemometrics Per Bruun Brockhoff DTU Compute,

Both of these books are available as ebooks: (that you can access using you DTU ID):

• http://www.crcnetbase.com.globalproxy.cvt.dk/isbn/978-1-4200-5947-2

• http://link.springer.com.globalproxy.cvt.dk/book/10.1007/978-3-642-17841-2/page/1

1.2 RstudioRStudio is a free and open source integrated development environment (IDE) for R. You canrun it on your desktop (Windows, Mac, or Linux) or even over the web using RStudio Server. Itworks as (an extended) alternative to running R in the basic way. This will be used in the course.Download it from http://www.rstudio.com/ and follow installation instructions. To usethe software, you only need to open Rstudio (not R itself).

1.3 Getting started1.3.1 Console and scripts

Once you have opened Rstudio, you will see the a number of different windows. One of themis the console. Here you can write commands and execute them by hitting Enter. For instance:

> 2+3[1] 5

In the console you cannot go back and change previous commands and neither can you saveyour work for later. To do this you need to write a script. Go to File → New → R Script. Inthe script you can write a line and execute it in the console by hitting Ctrl+R (Windows) orCmd+Enter (Mac). You can also mark several lines and execute them all at the same time.

1.3.2 Assignments and vectors

If you want to assign a value to a variable, you either use = or <-. For instance:

> x = 2> y <- 3> x+y[1] 5

It is often useful to assign a set of values to a variable like a vector. This is done with the functionc (short for concatenate).

> x = c(1, 4, 6, 2)> x[1] 1 4 6 2

Use the colon :, if you need a sequence, e.g. 1 to 10:

> x =1:10> x[1] 1 2 3 4 5 6 7 8 9 10

3

Page 4: R in 27411: Biological Data Analysis and Chemometrics › courses › 27411 › Rin27411.pdf · R in 27411: Biological Data Analysis and Chemometrics Per Bruun Brockhoff DTU Compute,

You can also make a sequence with a specific stepsize different from 1 with seq( from,to, stepsize):

> x = seq( 0, 1, 0.1)> x[1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

If you are in doubt of how to use a certain function, the help page can be opened by typing ?followed by the function, e.g. ?seq.

1.4 Descriptive statisticsSome useful commands are:

• mean(x) - mean value of the vector x

• var(x) - variance

• sd(x) - standard deviation

• median(x) - median

• quantile(x,p) - finds the pth percentile. p can consist of several different values, e.g.quantile(x,c(0.25,0.75))

• boxplot(x) - draws a boxplot

Remember to use the help function, if you are in doubt about anything.

1.5 Import af dataData is imported into R by using the ”read.table” function. When importing data from a spre-adsheet, it can sometimes be better to export the data from the spreadsheet to comma delimited(CSV) format before importing it into R.(The package RODBC provides a more advanced wayof directly importing e.g. excell files but we will not go into this here). Examples will be givenbelow.

1.6 Storage of Text and GraphicsText from windows in R can be copied into other programs in the usual manner:

• mark the text by holding the left mouse button down and drag it over the desired text.

• open another program (e.g. StarOffice), place the pointer at the desired location and pressthe middle mouse button.

All text in the ’Commands Window’ or the ’Report Window’ can be stored in a text-file byactivating the window and choose ’File’→ ’Save As . . . ’.Graphics can be stored in a graphics-file by activating the graphic window and choose ’File’→’Save As . . . ’. It is possible to choose from a range of graphics formats (JPEG is the default).One may also define explicitly graphics devices, e.g. ”pdf”to directly write the graphics to apdf-file.

4

Page 5: R in 27411: Biological Data Analysis and Chemometrics › courses › 27411 › Rin27411.pdf · R in 27411: Biological Data Analysis and Chemometrics Per Bruun Brockhoff DTU Compute,

1.7 ScatterplotsHave a look at: http://www.statmethods.net/graphs/scatterplot.html

2 Introduction day

2.1 Height-weight example from Latting Chapter 2# Data from chapter 2:

X1=c(57,58,60,59,61,60,62,61,62,63,62,64,63,65,64,66,67,66,68,69)X2=c(93,110,99,111,115,122,110,116,122,128,134,117,123,129,135,128,135,148,142,155)

mean(X1)mean(X2)sd(X1)sd(X2)

X_d1=X1-mean(X1)X_d2=X2-mean(X2)

X_s1=X_d1/sd(X1)X_s2=X_d2/sd(X2)

# Table 2.1:Tab21=cbind(X1,X2,X_d1,X_d2,X_s1,X_s2)Tab21

# Raw data in matrix X and using scale functionX=cbind(X1,X2)scale(X,scale=F)scale(X)X_s=scale(X)

# Means and standard deviations in columns:apply(Tab21,2,mean)apply(Tab21,2,sd)

plot(X1,X2)plot(X_s1,X_s2)abline(h=0,v=0)arrows(0,0,1,1)

# How to find this direction from data:cor(X)t(X_s)%*%X_s/19eigen(cor(X))

# Using notation from Chapter 2:W=eigen(cor(X))$vectorsW # In PCA: loadings

# The projected values: (Fig. 2.8, Fig 2.10)z1=X_s%*%W[,1]z2=X_s%*%W[,2]

plot(z1,z2,ylim=c(-3,3),xlim=c(-3,3))

hist(z1,xlim=c(-3,3))hist(z2,xlim=c(-3,3))var(z1)var(z2)

5

Page 6: R in 27411: Biological Data Analysis and Chemometrics › courses › 27411 › Rin27411.pdf · R in 27411: Biological Data Analysis and Chemometrics Per Bruun Brockhoff DTU Compute,

# These variances are also found as the socalled# singular values or eigen values of the correlation matrix:eigen(cor(X))$values

# And the D-matrix is the diagonal of the square-roots of these:D=diag(sqrt(eigen(cor(X))$values))D # In Pca: The explained variances

# And the z1 and z2 can be standardized by these sds:z_s1=z1/sd(z1)z_s2=z2/sd(z2)

z_s2

# Or the same could be extracted from the matrices:Z_s=X_s%*%W%*%solve(D)Z_s # In PCA: The standardized scores

var(Z_s)

# So we have done the SVD, check:Z_s%*%D%*%t(W)X_s

2.2 Exercise: EDUC scores data from the Lattin book1. Have a look at the R introduction in Appendix 3 of the Varmuza-book:

http://www.crcnetbase.com.globalproxy.cvt.dk/doi/pdfplus/10.1201/9781420059496.ax3 (They do not mention Rstudio, but this is still recom-mended)

2. Complete Lattin exercise 2.2 by:

(a) If needed: Install R and Rstudio

(b) Start Rstudio

(c) Import the EDUC SCORES data from chapter 2 in Lattin et. al (2003) (Hint: Usethe file EDUC SCORES.txt available under File Sharing in Campusnet)

(d) Use the following code to solve the different steps:#import:educ_scores=read.table("EDUC_SCORES.txt",header=TRUE,sep=",",row.names=1)educ_scores

#Select the relevant columns:X=edu_scores[,1:3]X

# a)?apply()apply(X,2,mean)

# b)?scaleXd=scale(X,scale=F)Xd

# c)apply(X,2,sd)Xs=scale(X)Xs

6

Page 7: R in 27411: Biological Data Analysis and Chemometrics › courses › 27411 › Rin27411.pdf · R in 27411: Biological Data Analysis and Chemometrics Per Bruun Brockhoff DTU Compute,

# d)?covt(Xd)%*%Xd7*cov(X)

# e)(t(Xd)%*%Xd)/7cov(X)

# f)(t(Xs)%*%Xs)/7cor(X)

(e) Some additional matrix scatterplotting:# Scatterplot Matrices using pairs:pairs(X)Gender=factor(educ_scores[,4])pairs(X,main = "Scatterplots - Gender grouping",

col = c("red", "blue")[Gender])

# Scatterplot Matrices from the lattice Packagelibrary(lattice)splom(X, groups=Gender)

# 3D Scatterplotlibrary(scatterplot3d)scatterplot3d(X, main="3D Scatterplot")

# 3D Scatterplot with Coloring and Vertical Drop Linesscatterplot3d(X, pch=16, highlight.3d=TRUE,

type="h", main="3D Scatterplot")

# 3D Scatterplot with Coloring and Vertical Lines# and Regression Planes3d <-scatterplot3d(X, pch=16, highlight.3d=TRUE,

type="h", main="3D Scatterplot")fit <- lm(X[,3] ˜ X[,1]+X[,2])s3d$plane3d(fit)

# Spinning 3d Scatterplotlibrary(rgl)plot3d(X, col="red", size=3)

3. Look at the Leslie Salt data (To be used for the MLR day). Import the Leslie Salt datafrom chapter 3 in Lattin et. al (2003) and do some initial explorative analysis:

#import:saltdata=read.table("LESLIE_SALT.csv",header=TRUE,sep=";",dec=".")head(saltdata)summary(saltdata)dim(saltdata)

pairs(saltdata)

# Scatterplot Matrices from the glus Packagelibrary(gclus)dta <- saltdata # get data - just put your own data here!dta.r <- abs(cor(dta)) # get correlationsdta.col <- dmat.color(dta.r) # get colors# reorder variables so those with highest correlation# are closest to the diagonaldta.o <- order.single(dta.r)

7

Page 8: R in 27411: Biological Data Analysis and Chemometrics › courses › 27411 › Rin27411.pdf · R in 27411: Biological Data Analysis and Chemometrics Per Bruun Brockhoff DTU Compute,

cpairs(dta, dta.o, panel.colors=dta.col, gap=.5,main="Variables Ordered and Colored by Correlation" )

# 8 Normal probility plots on the same plot page:par(mfrow=c(2,4))for (i in 1:8) qqnorm(saltdata[,i])

# Adding the log-price:saltdata$logPrice=log(saltdata$Price)

# 9 histograms with color:par(mfrow=c(3,3))for (i in 1:9) hist(saltdata[,i],col=i)

attach(saltdata)

# scatterplot for logprice: with Added fit linespar(mfrow=c(1,1))plot(logPrice˜Size)abline(lm(logPrice˜Size), col="red")

# scatterplot for logprice: identifying the observation numberspar(mfrow=c(1,1))plot(logPrice˜Size,type="n")text(Size,logPrice,labels=row.names(saltdata))abline(lm(logPrice˜Size), col="red")

# For all of them:par(mfrow=c(2,4))for (i in 2:8) {

plot(logPrice˜saltdata[,i],type="n",xlab=names(saltdata)[i])text(saltdata[,i],logPrice,labels=row.names(saltdata))abline(lm(logPrice˜saltdata[,i]), col="red")

}

# Saving directly to a pgn-file:png("logprice_relations.png",width=800,height=600)par(mfrow=c(2,4))for (i in 2:8) {

plot(logPrice˜saltdata[,i],type="n",xlab=names(saltdata)[i])text(saltdata[,i],logPrice,labels=row.names(saltdata))abline(lm(logPrice˜saltdata[,i]), col="red")

}dev.off()

# Correlation (with 2 decimals)round(cor(saltdata),2)

# Making a table for viewing in a browser - to be included in e.g. Word/Powerpoint# Go find/use the cortable.html file afterwardslibrary(xtable)capture.output(print(xtable(cor(saltdata)),type="html"),file="cortable.html")

3 PCA, Principal Component AnalysisWehrens book, Chapter 4, pp 43-56 and the Varmuza-book, chapter 3, sections 3.1 - 3.7 PLeasenote that there are a number of important plots examplified as part of the iris-example:

1. Variance-plots (”scree-type”plots)

2. Scores, loadings and biplots (main plots for interpretation of structure)

8

Page 9: R in 27411: Biological Data Analysis and Chemometrics › courses › 27411 › Rin27411.pdf · R in 27411: Biological Data Analysis and Chemometrics Per Bruun Brockhoff DTU Compute,

3. Explained variances for each variable

4. Validation/diagnostics plots:

(a) Leverage and residuals (also called ”score distances”and ”orthogonal distances”(cf.the nice Figure 3.15, page 79 in the Varmuza-book)

(b) The ”influence plot”: residuals versus leverage

5. Jacknifing/bootstrapping/Crossvalidating the PCA for various purposes:

(a) Deciding on number of components

(b) Sensitivity/uncertainty investigation of scores and loadings.

3.1 Example/Exercise: Fisher’s Iris DataThe exercise can be solved in either NTSYS (SVD), Uscrambler or in R. JCF can support inNTSYS/Unscrambler. Sophie can support in R. The Fisher Iris data-set is classic, c.f.:

• Fisher, R.A. (1936). The use of multiple measurements in taxonomic problem. Annals ofEugenics 7: 179-188.

• Anderson, E. (1935). The irises of the Gaspe Peninsula. Bulletin of the American IrisSociety 59: 2-5.

There are 150 objects, 50 Iris setosa, 50 Iris versicolor and 50 Iris virginica. The flowers ofthese 150 plants have been measures by a ruler. The variables are sepal length (SL), sepal width(SW), petal length (PL) and petal width PW), all in all only four variables.The original hypothesis was that I. versicolor was a hybrid of the two other species i.e. I. setosax virginica. I. setosa is diploid; I. virginica is a tetraploid; and I. versicolor is hexaploid.Note that we call the data-set Fisherout.xls (an Excell file in this case). When you look at thedata in the plots later it is a good idea to only use the first two letters (se, ve, vi). Import the dataset into your choice of software. In RFirst examine the raw data and examine whether there are obvious mistakes. After that one coulduse other Unscrambler features to examine the statistical properties of the objects and variable,but it in this case we go directly to PCA, as this give a very fine overview of the data, and willoften show outliers immediately. Perform the PCA with leverage correction and with centering.Examine the four standard plots (score plot, loading plot, influence plot and explained varianceplot).

1. How many principal components would you need and what does the first PC (PC1) descri-be?

2. How many percentage of the variation is described by the first two PCs?

3. Can you find an outlier? It so do you have an idea why thus outlier came about? (loadingsplot or scores plot)? In R: Do you see problem in the influence plot. If there is an outlier,in which other plot can you see the problem? If you see severe outliers, remove them fromthe data and run PCA again (and answer a, and b, again)

9

Page 10: R in 27411: Biological Data Analysis and Chemometrics › courses › 27411 › Rin27411.pdf · R in 27411: Biological Data Analysis and Chemometrics Per Bruun Brockhoff DTU Compute,

4. Does a standardization (autoscaling) give a better model? (answer a) and b) again)

5. How many PCs are needed to explain 70%, 75% and 90% of the variation in the data?

6. How many PCs can you maximally get in this dataset?

7. Compare the score and the loading plot, and make a biplot. Do any of the variables ”tellthe same story”?

8. Are any variables more discriminative the others? Are any variables dispensable?

9. Can you see the presupposed classes? Any class overlap?

10. Does the original hypothesis seem to be OK?

The iris data can allready be found within R, so no import is needed:

# E.g. set the relevant working direcctory for you - here the one used by PBB locally# setwd("C:/perbb/DTUkemometri/PCAnew")

# Reading the IRIS csv-data which is a copy of the file uploaded by JCF:# Note that the Iris data given by JCF is slightly different from the IRIS data available in R:JCFiris=read.table("Copy of Fisherout.csv",header=T,sep=";",dec=",")

library(ChemometricsWithR)

data(iris)

#Note that the Iris data given by JCF is slightly different from the IRIS data available in R:summary(iris)summary(JCFiris)

head(iris)head(JCFiris)

# Basic PCA on covariances (WITHOUT Standardization)irisPC_without=PCA(scale(iris[,1:4],scale = FALSE))par(mfrow=c(2,2))scoreplot(irisPC_without, col = iris$Species)loadingplot(irisPC_without, show.names = TRUE)biplot(irisPC_without, score.col = iris$Species)screeplot(irisPC_without,type="percentage")

# Basic PCA on correlations (WITH Standardization)irisPC=PCA(scale(iris[,1:4]))par(mfrow=c(2,2))scoreplot(irisPC, col = iris$Species)loadingplot(irisPC, show.names = TRUE)biplot(irisPC, score.col = iris$Species)screeplot(irisPC,type="percentage")

# Plotting more components:# Scores:par(mfrow=c(4,4))for (i in 1:4) for (j in 1:4) scoreplot(irisPC, col = iris$Species,pc=c(i,j))

# Loadings:par(mfrow=c(4,4))for (i in 1:4) for (j in 1:4) loadingplot(irisPC, show.names = TRUE,pc=c(i,j))

# Diagnostics using chemometrics packagelibrary(chemometrics)

10

Page 11: R in 27411: Biological Data Analysis and Chemometrics › courses › 27411 › Rin27411.pdf · R in 27411: Biological Data Analysis and Chemometrics Per Bruun Brockhoff DTU Compute,

irisPCA=princomp(iris[,1:4], cor = TRUE)

summary(irisPCA)loadings(irisPCA)#scores(irisPCA)

par(mfrow=c(2,2))plot(irisPCA) # shows a screeplot.biplot(irisPCA)

# How to plot residuals versus leverage??# The score distances res$SDist express the leverage values# The orthogonal distances express the residualsres=pcaDiagplot(iris[,1:4],irisPCA)pcaVarexpl(iris[,1:4],a=2)

plot(res$SDist,res$ODist,type="n")text(res$SDist,res$ODist,labels=row.names(iris))

# Influence plot: residuals versus leverage for different number of components:par(mfrow=c(2,2))for (i in 1:4) {

res=pcaDiagplot(iris[,1:4],a=i,irisPCA,plot=FALSE)plot(res$SDist,res$ODist,type="n")text(res$SDist,res$ODist,labels=row.names(iris))

}

# Doing something similar to "jacknifing of the loadings":# Leaving out a certain number of the observation and plotting the loadings for each subset data# Random samples of a certain proportion of the original number of observations are left outpar(mfrow=c(2,2))

n=length(iris[,1])

leave_out_size=0.50for (k in 1:4){

irisPC=PCA(scale(iris[sample(1:n,round(n*(1-leave_out_size))),1:4]))loadingplot(irisPC, show.names = TRUE)

}

3.2 Exercise: Wine Data (To be presented by Team 1 next time)The exercise can be solved in either NTSYS, Uscrambler or in R. JCF can support in NT-SYS/Unscrambler. Sophie can support in R. The second dataset is called VIN2:

• Forina, M., Armanino, C., Castino, M. and Ubigli, M. (1986). Multivariate data analysisas a discriminating method of the origin of wines. Vitis 25: 189-201.

• Forina, M., Lanteri, S., Armanino, C., Casolino, C. and Casale, M. 2010. V-PARVUS.An extendable package of programs for data exploration, classification, and correlation.(www.parvus.unige.it)

The dataset VIN2.csv is an Excell CSV file. In this dataset there are 178 objects (Italian wines),the first 59 are Barolo wines (B1-B59), the next 71 are Grignolino wines (G60-G130) and thelast 48 are Barbera wines (S131-S178). These wines have been characterized by 13 variables(chemical and physical measurements):

11

Page 12: R in 27411: Biological Data Analysis and Chemometrics › courses › 27411 › Rin27411.pdf · R in 27411: Biological Data Analysis and Chemometrics Per Bruun Brockhoff DTU Compute,

1. Alcohol (in %)

2. Malic acid

3. Ash

4. Alkalinity of Ash

5. Magnesium

6. Total phenols

7. Flavanoids

8. Nonflavanoid phenols

9. Proanthocyanins

10. Colour intensity

11. Colour hue

12. OD280 / OD315 of diluted wines

13. Proline (amino acid)

Questions:

1. Examine the raw data. Are there any severe outliers you can detect? What do you thinkhappened with the outlier, if any?

2. Correct wrong data, if any (in the excel file), and use PCA again. Does the score andloading plot look significantly different now?

3. Try PCA without standardization: Which variables are important here and why?

4. Try PCA with standardization. Which variables are important here, and would you recom-mend removing any of them from the data set? Which variables are especially importantfor the Barbera wines?

5. Suppose that alcohol % and proanthocyanins were especially healthy which wine wouldyou recommend?

6. (IF using Unscrambler) Use the Unscrambler jack-knife method to test for significanceof the variables (run the PCA again, with jack-knifing). Are all the variables stable? (InR: Think about how to do something similar - see e.g. the example for the iris data wherethe loadings are plottet for different subsets of data)

The wine data can allready be found within R, so no import is needed:

12

Page 13: R in 27411: Biological Data Analysis and Chemometrics › courses › 27411 › Rin27411.pdf · R in 27411: Biological Data Analysis and Chemometrics Per Bruun Brockhoff DTU Compute,

# Wines data:

# From the JCF uploaded file:# Also slightly different from the version in the packageJCFwines=read.table("VIN2.csv",header=T,sep=";",dec=",")

# The wines data from the package:# The wine class information is here stored in the wine.classes objectdata(wines, package = "ChemometricsWithRData")head(wines)head(JCFwines)

summary(wines)summary(JCFwines)

# Basic PCAwines.PC <- PCA(scale(wines))par(mfrow=c(2,2))scoreplot(wines.PC, col = wine.classes, pch = wine.classes)loadingplot(wines.PC, show.names = TRUE)biplot(wines.PC, score.col = wine.classes)screeplot(wines.PC,type="percentage")

4 MLR, Multiple Linear Regression (OLS)• Check Wehrens book, Chapter 8, pp 145-148

• and the Varmuza-book, chapter 4, sections 4.3.1 -4.3.2.

• And some basics on Regression in R: http://www.statmethods.net/stats/regression.html

• And Regression diagnostics in R: http://www.statmethods.net/stats/regression.html

4.1 Example: Leslie Salt Data# Regression part#import:saltdata=read.table("LESLIE_SALT.csv",header=TRUE,sep=";",dec=".")

# Simple regressionlm1=lm(logPrice˜Distance,data=saltdata)summary(lm1)

# Full MLR model:lm2=lm(logPrice˜County+Size+Elevation+Sewer+Date+Flood+Distance,data=saltdata)summary(lm2)step(lm2,direction="backward")

lm3=lm(formula = logPrice ˜ Elevation + Sewer + Date + Flood + Distance, data = saltdata)summary(lm3)

# plot fitted vs. observed - identifying observation nr:par(mfrow=c(1,1))plot(saltdata$logPrice,lm3$fitted,type="n")text(saltdata$logPrice,lm3$fitted,labels=row.names(saltdata))

# Regression Diagnostics plots given automatically, either 4:par(mfrow=c(2,2))

13

Page 14: R in 27411: Biological Data Analysis and Chemometrics › courses › 27411 › Rin27411.pdf · R in 27411: Biological Data Analysis and Chemometrics Per Bruun Brockhoff DTU Compute,

plot(lm3)

# Or 6:par(mfrow=c(3,2))plot(lm3,1:6)

# Plot residuals versus individual xs:par(mfrow=c(2,3))for (i in 4:8) {

plot(lm3$residuals˜saltdata[,i],type="n",xlab=names(saltdata)[i])text(saltdata[,i],lm3$residuals,labels=row.names(saltdata))

lines(lowess(saltdata[,i],lm3$residuals),col="blue")}

# Check for potential quadratic effects and/or interactions:# Here just two examplesof such:saltdata$Distance_square=saltdata$Distanceˆ2saltdata$County_Elevation=saltdata$County*saltdata$Elevation

lm4=lm(formula = logPrice ˜ Elevation + Sewer + Date + Flood + Distance+ Distance_square + County_Elevation, data = saltdata)

summary(lm4)

# How to remove a potential outlier and re-analyze:# E.g. Let’s try without observation no 2 and 10:# Remove and copy-paste:saltdata_red=saltdata[-c(2,10),]# Check:dim(saltdata)dim(saltdata_red)row.names(saltdata_red)

lm3_red=lm(formula = logPrice ˜ Elevation + Sewer + Date + Flood + Distance, data = saltdata_red)summary(lm3_red)par(mfrow=c(3,2))plot(lm3_red,1:6)

4.2 Exercise: Prostate Cancer dataAnalyze the Prostate data from the following book website: http://statweb.stanford.edu/˜tibs/ElemStatLearn/ (check p. 47-48 of Hastie et. al for a little description of thedata) (Use the uploaded version of the data file in Campusnet to avoid import problems)

prostate=read.table("prostatedata.txt",header=TRUE,sep=";",dec=",")head(prostate)summary(prostate)dim(prostate)pairs(prostate)

Try to answer the following questions:

1. What are the pair-wise relations between lpsa and the other variables – any indications ofimportant relations?

2. Are there any clearly non-normal (e.g. skew) distributions among the variables?

3. Run the 8-variable MLR analysis and try to reduce the model by removing the most non-significant variables one by one – what is the final model?

4. Interpret the parameters of the final model – compare with the investigation in 1.

14

Page 15: R in 27411: Biological Data Analysis and Chemometrics › courses › 27411 › Rin27411.pdf · R in 27411: Biological Data Analysis and Chemometrics Per Bruun Brockhoff DTU Compute,

5. What is the estimate (and interpretation) of the residual standard deviation?

6. Investigate the validity/assumptions of the final model:

(a) Residual checks

(b) Influential/outlying observations

(c) Any additional model structure? (non-linearities, iteractions?)(diagnostics plots and/ormodel extensions)

5 PCR, Principal Component RegressionPCR and the other biased regression methods presented in this course (PLS, Ridge and Lasso)are all together with even more methods (as e.g. MLR=OLS) introduced in each of the threebooks

• The Elements of Statistical Learning: Data Mining, Inference, and Prediction. SecondEdition, February 2009, Trevor Hastie, Robert Tibshirani, Jerome Friedman

• Ron Wehrens (2012). Chemometrics With R: Multivariate Data Analysis in the NaturalSciences and Life Sciences. Springer, Heidelberg.(Chapter 8 and 9)

• K. Varmuza and P. Filzmoser (2009). Introduction to Multivariate Statistical Analysis inChemometrics, CRC Press. (Chapter 4)

The latter two ones are directly linked with R-packages, and here we will most directly use thelatter. These things are NOT covered by the Lattin book, so we give here a reading list for themost relevant parts of chapter 4, when it comes to syllabus content for course 27411:

• Section 4.1 (4p) (Concepts - ALL models)

• Section 4.2.1-4.2.3 (6p)(Errors in (ALL) models )

• Section 4.2.5-4.2.6 (3.5p)(CV+bootstrap - ALL models)

• [Section 4.3.1-4.3.2.1 (9.5p)(Simple regression and MLR (=OLS))]

• [Section 4.5.3 (1p) (Stepwise Variable Selection in MLR)]

• Section 4.6 (2p) (PCR)

• Section 4.7.1 (3.5p) (PLS)

• Section 4.8.2 (2.5p) (Ridge and Lasso)

• Section 4.9-4.9.1.2 (5p) (An example using PCR and PLS)

• Section 4.9.1.4-4.9.1.5 (5p) (An example using Ridge and Lasso)

• Section 4.10 (2p) (Summary)

15

Page 16: R in 27411: Biological Data Analysis and Chemometrics › courses › 27411 › Rin27411.pdf · R in 27411: Biological Data Analysis and Chemometrics Per Bruun Brockhoff DTU Compute,

5.1 Example: Leslie Salt Data (again)Below I have included R-code primarily based on the pls package. It shows how one can workthrough all the steps of the PCR analysis of a data set.

# E.g.: Set your own working directoty - on perbb computer, it is:setwd("C:/perbb/DTUkemometri/R")

# Example: using Salt data:saltdata=read.table("LESLIE_SALT.csv",header=TRUE,sep=";",dec=".")saltdata$logPrice=log(saltdata$Price)

# Define the X-matrix as a matrix in the data frame:saltdata$X=as.matrix(saltdata[,2:8])

# First of all we consider a random selection of 4 properties as a TEST setsaltdata$train=TRUEsaltdata$train[sample(1:length(saltdata$train),4)]=FALSE

saltdata_TEST=saltdata[saltdata$train==FALSE,]saltdata_TRAIN=saltdata[saltdata$train==TRUE,]

# Now all the work is performed on the TRAIN data set

# First: 1) Explore the data - we allready did this previously, so no more of that here

# Next: 2) Model the data# Run the PCR with maximal/large number of components using pls packagelibrary(pls)mod <- pcr(logPrice ˜X , ncomp = 7, data =saltdata_TRAIN,validation="LOO",scale=TRUE,

jackknife = TRUE)

# Initial set of plots:par(mfrow=c(2,2))plot(mod,labels=rownames(saltdata_TRAIN),which="validation")plot(mod,"validation",estimate=c("train","CV"),legendpos = "topright")plot(mod,"validation",estimate=c("train","CV"),val.type="R2",legendpos = "bottomright")scoreplot(mod,labels=rownames(saltdata_TRAIN))

# Choice of components:

# what would segmented CV give:mod_segCV <- pcr(logPrice ˜X , ncomp = 7, data =saltdata_TRAIN,scale=TRUE,validation="CV",

segments = 5, segment.type = c("random"),jackknife = TRUE)

# Initial set of plots:plot(mod_segCV,"validation",estimate=c("train","CV"),legendpos = "topright")plot(mod_segCV,"validation",estimate=c("train","CV"),val.type="R2",legendpos = "bottomright")

# Let us look at some more components:

# Scores:scoreplot(mod, comps = 1:4,labels=rownames(saltdata_TRAIN))

#Loadings:loadingplot(mod,comps = 1:4,scatter=TRUE,labels=names(saltdata_TRAIN))

# We choose 4 componentsmod4 <- pcr(logPrice ˜X , ncomp = 4, data =saltdata_TRAIN,validation="LOO",scale=TRUE,

jackknife = TRUE)

# Then 3) Validate:# Let’s validate som more: using 4 components

16

Page 17: R in 27411: Biological Data Analysis and Chemometrics › courses › 27411 › Rin27411.pdf · R in 27411: Biological Data Analysis and Chemometrics Per Bruun Brockhoff DTU Compute,

# We take the predicted and hence the residuals from the predplot function# Hence these are the (CV) VALIDATED versions!

par(mfrow=c(2,2))k=4obsfit=predplot(mod4,labels=rownames(saltdata_TRAIN),which="validation")Residuals=obsfit[,1]-obsfit[,2]plot(obsfit[,2],Residuals,type="n",main=k,xlab="Fitted",ylab="Residuals")text(obsfit[,2],Residuals,labels=rownames(saltdata_TRAIN))

qqnorm(Residuals)

# To plot residuals against X-leverage, we need to find the X-leverage:# AND then find the leverage-values as diagonals of the Hat-matrix:# Based on fitted X-values:Xf=fitted(mod)[,,k]H=Xf%*%solve(t(Xf)%*%Xf)%*%t(Xf)leverage=diag(H)

plot(leverage,Residuals,type="n",main=k)text(leverage,Residuals,labels=rownames(saltdata_TRAIN))

# This set of four plots could be reproduced for other values of k# One could also redo some leaving out certain observations

# Let’s also plot the residuals versus each input X:par(mfrow=c(3,3))for ( i in 2:8){plot(Residuals˜saltdata_TRAIN[,i],type="n",xlab=names(saltdata_TRAIN)[i])text(saltdata_TRAIN[,i],Residuals,labels=row.names(saltdata_TRAIN))lines(lowess(saltdata_TRAIN[,i],Residuals),col="blue")}

# Now let’s look at the results - 4) "interpret/conclude"

par(mfrow=c(2,2))# Plot coefficients with uncertainty from Jacknife:obsfit=predplot(mod4,labels=rownames(saltdata_TRAIN),which="validation")abline(lm(obsfit[,2]˜obsfit[,1]))plot(mod,"validation",estimate=c("train","CV"),val.type="R2",legendpos = "bottomright")

coefplot(mod4,se.whiskers=TRUE,labels=prednames(mod4),cex.axis=0.5)biplot(mod4)

# And then finally some output numbers:jack.test(mod4,ncomp=4)

# And now let’s try to 5) predict the 4 data points from the TEST set:

preds=predict(mod4,newdata=saltdata_TEST,comps=4)plot(saltdata_TEST$logPrice,preds)rmsep=sqrt(mean((saltdata_TEST$logPrice-preds)ˆ2))rmsep

5.2 Exercise: Prostate Cancer dataUse the Prostate data also used for the MLR-exercises. Make sure to support all the analysis bya lot of plotting - try to ”play around” with plotting results with/without extreme observations.Also try to look at results for different choices of number of components (to get a feeling forthe consequence of that). The Rcode (including comments) for the Leslie Salt data above can

17

Page 18: R in 27411: Biological Data Analysis and Chemometrics › courses › 27411 › Rin27411.pdf · R in 27411: Biological Data Analysis and Chemometrics Per Bruun Brockhoff DTU Compute,

be used as a template for a good approach.

1. Define test sets and training sets according to the last column in the data set.

2. Do PCR on the training set – use cross-validation.

3. Go through ALL the relevant steps :

(a) model selection

(b) validation

(c) interpretation

(d) etc.

4. Predict the test set lpsa values.

5. What is the average prediction error for the test set?

(a) Compare with the cross validation error in the training set - try both ”LOO”and asegmented version of CV

6. Compare with an MLR prediction using ALL predictors

6 PLSR, Partial Least Squares Regression

6.1 Motivating example - Leslie Salt again

# E.g.: Set your own working directoty - on perbb computer, it is:setwd("C:/perbb/DTUkemometri/R")

# Example: using Salt data:saltdata=read.table("LESLIE_SALT.csv",header=TRUE,sep=";",dec=".")saltdata$logPrice=log(saltdata$Price)

# Define the X-matrix as a matrix in the data frame:saltdata$X=as.matrix(saltdata[,2:8])

# First of all we consider a random selection of 4 properties as a TEST setsaltdata$train=TRUEsaltdata$train[sample(1:length(saltdata$train),4)]=FALSE

saltdata_TEST=saltdata[saltdata$train==FALSE,]saltdata_TRAIN=saltdata[saltdata$train==TRUE,]

# Now all the work is performed on the TRAIN data set

# First: 1) Explore the data - we allready did this previously, so no more of that here

# Next: 2) Model the data# Run the PCR with maximal/large number of components using pls packagelibrary(pls)mod_pcr <- pcr(logPrice ˜X , ncomp = 7, data =saltdata_TRAIN,validation="LOO",scale=TRUE,

jackknife = TRUE)

# Initial set of plots:

18

Page 19: R in 27411: Biological Data Analysis and Chemometrics › courses › 27411 › Rin27411.pdf · R in 27411: Biological Data Analysis and Chemometrics Per Bruun Brockhoff DTU Compute,

par(mfrow=c(2,2))plot(mod_pcr,labels=rownames(saltdata_TRAIN),which="validation")plot(mod_pcr,"validation",estimate=c("train","CV"),legendpos = "topright")plot(mod_pcr,"validation",estimate=c("train","CV"),val.type="R2",legendpos = "bottomright")scoreplot(mod_pcr,labels=rownames(saltdata_TRAIN))

# Now with pls:mod_pls <- plsr(logPrice ˜X , ncomp = 7, data =saltdata_TRAIN,validation="LOO",scale=TRUE,

jackknife = TRUE)

# Initial set of plots:par(mfrow=c(2,2))plot(mod_pls,labels=rownames(saltdata_TRAIN),which="validation")plot(mod_pls,"validation",estimate=c("train","CV"),legendpos = "topright")plot(mod_pls,"validation",estimate=c("train","CV"),val.type="R2",legendpos = "bottomright")scoreplot(mod_pls,labels=rownames(saltdata_TRAIN))

# Let us look at the coefficients of the final models:# Plot coefficients with uncertainty from Jacknife:par(mfrow=c(2,2))coefplot(mod_pcr,ncomp=1,se.whiskers=TRUE,labels=prednames(mod_pcr4),cex.axis=0.5,main="logPrice, PCR with 1")coefplot(mod_pcr,ncomp=2,se.whiskers=TRUE,labels=prednames(mod_pcr4),cex.axis=0.5,main="logPrice, PCR with 2")coefplot(mod_pcr,ncomp=4,se.whiskers=TRUE,labels=prednames(mod_pcr4),cex.axis=0.5,main="logPrice, PCR with 4")coefplot(mod_pls,ncomp=2,se.whiskers=TRUE,labels=prednames(mod_pls4),cex.axis=0.5,main="logPrice, PLS with 2")

6.2 NIR Example: yarn datalibrary(chemometrics)

data(yarn)#summary(yarn)?yarnstr(yarn)

# Plotting of the 21 individual NIR spectra"max_X=max(yarn$NIR)min_X=min(yarn$NIR)par(mfrow=c(1,1))plot(yarn$NIR[1,],type="n",ylim=c(min_X,max_X))for (i in 1:21) lines(yarn$NIR[i,],col=i)

# Plot of y:plot(yarn$density,type="n")lines(yarn$density)

yarn_TEST=yarn[yarn$train==FALSE,]yarn_TRAIN=yarn[yarn$train==TRUE,]

# Pls:mod_pls <- plsr(density ˜NIR , ncomp = 10, data =yarn_TRAIN,validation="LOO",scale=TRUE,

jackknife = TRUE)

# Initial set of plots:par(mfrow=c(2,2))plot(mod_pls,labels=rownames(yarn_TRAIN),which="validation")plot(mod_pls,"validation",estimate=c("train","CV"),legendpos = "topright")plot(mod_pls,"validation",estimate=c("train","CV"),val.type="R2",legendpos = "bottomright")scoreplot(mod_pls,labels=rownames(yarn_TRAIN))

# Choice of components:

# what would segmented CV give:mod_segCV <- plsr(density ˜NIR , ncomp = 10, data =yarn_TRAIN,scale=TRUE,validation="CV",

segments = 5, segment.type = c("random"),jackknife = TRUE)

19

Page 20: R in 27411: Biological Data Analysis and Chemometrics › courses › 27411 › Rin27411.pdf · R in 27411: Biological Data Analysis and Chemometrics Per Bruun Brockhoff DTU Compute,

# Initial set of plots:par(mfrow=c(2,1))plot(mod_segCV,"validation",estimate=c("train","CV"),legendpos = "topright")plot(mod_segCV,"validation",estimate=c("train","CV"),val.type="R2",legendpos = "bottomright")

# Let us look at some more components:

# Scores:scoreplot(mod_pls, comps = 1:4,labels=rownames(yarn_TRAIN))

#Loadings:loadingplot(mod_pls,comps = 1:4,scatter=TRUE,labels=prednames(mod_pls))

#Loadings, one by one - the spectral view:par(mfrow=c(2,2))loadingplot(mod_pls,comps = 1,scatter=FALSE,labels=prednames(mod_pls))loadingplot(mod_pls,comps = 2,scatter=FALSE,labels=prednames(mod_pls))loadingplot(mod_pls,comps = 3,scatter=FALSE,labels=prednames(mod_pls))loadingplot(mod_pls,comps = 4,scatter=FALSE,labels=prednames(mod_pls))

# We choose 4 componentsmod4 <- plsr(density˜NIR , ncomp = 4, data =yarn_TRAIN,validation="LOO",scale=TRUE,

jackknife = TRUE)

# Then 3) Validate:# Let’s validate som more: using 4 components# We take the predicted and hence the residuals from the predplot function# Hence these are the (CV) VALIDATED versions!

par(mfrow=c(2,2))obsfit=predplot(mod4,labels=rownames(saltdata_TRAIN),which="validation")Residuals=obsfit[,1]-obsfit[,2]plot(obsfit[,2],Residuals,type="n",xlab="Fitted",ylab="Residuals")text(obsfit[,2],Residuals,labels=rownames(saltdata_TRAIN))

qqnorm(Residuals)

# To plot residuals against X-leverage, we need to find the X-leverage:# AND then find the leverage-values as diagonals of the Hat-matrix:# Based on fitted X-values:Xf=scores(mod4)H=Xf%*%solve(t(Xf)%*%Xf)%*%t(Xf)leverage=diag(H)

plot(leverage,abs(Residuals),type="n",main=k)text(leverage,abs(Residuals),labels=rownames(yarn_TRAIN))

# This set of four plots could be reproduced for other values of k

# One could also redo some leaving out certain observations

# Now let’s look at the results - 4) "interpret/conclude"

par(mfrow=c(2,2))# Plot coefficients with uncertainty from Jacknife:obsfit=predplot(mod4,labels=rownames(yarn_TRAIN),which="validation")abline(lm(obsfit[,2]˜obsfit[,1]))plot(mod,"validation",estimate=c("train","CV"),val.type="R2",legendpos = "bottomright")

coefplot(mod4,se.whiskers=TRUE,labels=prednames(mod4),cex.axis=0.5)biplot(mod4)

# And now let’s try to 5) predict the 4 data points from the TEST set:

preds=predict(mod4,newdata=yarn_TEST,ncomp=4)

20

Page 21: R in 27411: Biological Data Analysis and Chemometrics › courses › 27411 › Rin27411.pdf · R in 27411: Biological Data Analysis and Chemometrics Per Bruun Brockhoff DTU Compute,

plot(yarn_TEST$density,preds)rmsep=sqrt(mean((yarn_TEST$density-preds)ˆ2))rmsep

6.3 Exercise: NIR data on Degree of EsterificationUse the Pectin data to be downloaded from file sharing in CN on relating the degree of esteri-fication (DE) for pectins to their NIR spectra as obtained by diffuse reflectance technique. Thedatafile contains data for 70 different granulated pectin samples with 1 Y-variable (DE) and1039 regressor X-variables in the NIR region (3996 cm−1 - 12004 cm−1). Setup and validate aPLS-model following the standard steps as shown in the PCR approach. Compare with PCR re-sults also! Remember to check for outliers. Are any spectral ranges more important than othersto predict DE? How precise can DE be predicted by the models? Define your own test set by arandom selection of 23 samples:

# E.g.: Set your own working directory - on perbb computer, it is:setwd("C:/perbb/DTUkemometri/Niels")

# Exercise data: Pectin data:pectin=read.table("Pectin_and_DE.txt",header=TRUE,sep=";",dec=",")pectin$NIR=as.matrix(pectin[,-1])

length(pectin$DE)# First of all we consider a random selection of 23 observation as a TEST setpectin$train=TRUEpectin$train[sample(1:length(pectin$DE),23)]=FALSE

pectin_TEST=pectin[pectin$train==FALSE,]pectin_TRAIN=pectin[pectin$train==TRUE,]

# The training X-data is 47 rows and 1039 columns:dim(pectin_TRAIN$NIR)

7 RR, Ridge and Lasso Regression

7.1 Example: PAC data from package chemometricslibrary(chemometrics)data(PAC)?PAC

# First find optimal lambda:

# By GCV: (formula approximation of real CV), sec 4.3.2.2 in Varmuza-bookridge_res=plotRidge(y˜X,data=PAC,lambda=seq(1,20,by=0.1))ridge_res$lambdaopt

# By real CV, e.g. LOO:# Note that the ridgeCV function shown in the book does NOT do this for us# It only looks into an allready chosen optimal lambda# So we use the ridge.CV function from the parcor package:

library(parcor)

# First Full LOOres_LOO=ridge.cv(scale(PAC$X),PAC$y,lambda=seq(1,20,by=0.5),k=209,plot.it=TRUE)res_LOO$lambda.opt

# Then 9 (random) versions of 10-fold CV:par(mfrow=c(3,3))

21

Page 22: R in 27411: Biological Data Analysis and Chemometrics › courses › 27411 › Rin27411.pdf · R in 27411: Biological Data Analysis and Chemometrics Per Bruun Brockhoff DTU Compute,

for ( i in 1:9){res_CV10=ridge.cv(scale(PAC$X),PAC$y,lambda=seq(1,20,by=0.5),k=10,plot.it=TRUE)title(main=res_CV10$lambda.opt)}

# Let’s do our own "repeated double 10-fold CV":K=10lambda.opts=rep(0,K)for ( i in 1:K){

res_CV10=ridge.cv(scale(PAC$X),PAC$y,lambda=seq(1,30,by=0.5),k=10,plot.it=FALSE)lambda.opts[i]=res_CV10$lambda.opt

}

median(lambda.opts)

par(mfrow=c(1,1))boxplot(lambda.opts)

# Now let’s try the ridgeCV to investigate the choice(s) more carefully:# As suggested by the bookres=ridgeCV(y˜X,data=PAC,lambdaopt=4.3,repl=20)

# Look at the coefficients:ridge_lm=lm.ridge(y˜X,data=PAC,lambda=4.3)

# Coefficients on raw X-scale:coef(ridge_lm)

# Coefficients on standardized X-scale:ridge_lm$coef

# Now let’s try to fit the lasso:

# We could use the lars package:lasso_result=lars(PAC$X,PAC$y)cv.lars(PAC$X,PAC$y)

# Or from the chemometrics package:res_lasso=lassoCV(y˜X,data=PAC,fraction=seq(0,1,by=0.05),K=10)res_lasso$soptres_coef=lassocoef(y˜X,data=PAC,sopt=res_lasso$sopt)

# In an auto-scaled version:#res_lasso_scaled=lassoCV(y˜scale(X),data=PAC,fraction=seq(0,1,by=0.05),K=10)#res_lasso_scaled$sopt#res_coef_scaled=lassocoef(y˜scale(X),data=PAC,sopt=res_lasso_scaled$sopt)

# The actual coefficients can be looked at by: (on raw X-scale - I believe.:-)...)res_coef=res_coef$coefficientsres_coefbeta_lasso=res_coef

# Let’s try to predict a NEW data set# here we take the original data - just for illustration:PAC_test=PAC

# First by ridge regression:ridge_lm=lm.ridge(y˜X,data=PAC,lambda=4.3)

# Coefficients on raw X-scale: (WITH intercept)beta_ridge=coef(ridge_lm)

yhat_ridge=beta_ridge[1]+PAC_test$X%*%beta_ridge[-1]

22

Page 23: R in 27411: Biological Data Analysis and Chemometrics › courses › 27411 › Rin27411.pdf · R in 27411: Biological Data Analysis and Chemometrics Per Bruun Brockhoff DTU Compute,

residuals=PAC$y-yhat_ridgeMSEP=mean(residualsˆ2)MSEPsqrt(MSEP)plot(PAC_test$y,yhat_ridge,ylim=c(180,520))

# Then by lasso regression:

yhat_lasso=predict(lasso_result,newx=PAC_test$X,s=0.3,mode="fraction")$fitresiduals=PAC_test$y-yhat_lassoMSEP=mean(residualsˆ2)MSEPsqrt(MSEP)plot(PAC_test$y,yhat_ridge,ylim=c(180,520))

7.2 Exercise: Prostate Cancer data and maybe pectin data1. Try both ridge and lasso regression on the Prostate cancer data - a scaled version.

(a) Work on the training data

(b) Find the optimal penalty

(c) Will lasso completely leave out some of the variables?

(d) Interpret/Look at the coefficients

(e) Predict the test data - compare results with other methods, e.g. MLR/PCR/(PLS)

(f) For the ridge-model: Compare the ”size”of variable coefficients with coefficientsfrom MLR and PCR - what xan you conclude?

2. Try lasso on the pectin NIR data - how many wavelengths will be completely left out?

8 Discriminant Analysis, LDA, QDA, k-NN, Bayes, PLSDAApart from the Latin chapter, also read in the Varmuza book:

• Section 5.1, Intro, 2.5 pages

• Section 5.2, Linear Methods, 12 pages

• Section 5.3.3, Nearest Neighbourg (k-NN), 3 pages

• Section 5.7, Evaluation of classification, 3 pages

23

Page 24: R in 27411: Biological Data Analysis and Chemometrics › courses › 27411 › Rin27411.pdf · R in 27411: Biological Data Analysis and Chemometrics Per Bruun Brockhoff DTU Compute,

8.1 Example: Iris data# we use the iris data:

Iris <- data.frame(rbind(iris3[,,1], iris3[,,2], iris3[,,3]),Sp = rep(c("s","c","v"), rep(50,3)))

# We make a test and training data:train <- sample(1:150, 75)test = (1:150)[-train]Iris_train=Iris[train,]Iris_test=Iris[test,]

# Distribution in three classes in training data:table(Iris_train$Sp)

# PART 1: LDAlda_train_LOO <- lda(Sp˜Sepal.L.+Sepal.W.+Petal.L.+Petal.W., Iris_train, prior = c(1,1,1)/3, CV=TRUE)

# Assess the accuracy of the prediction# percent correct for each category:ct <- table(Iris_train$Sp, lda_train_LOO$class)diag(prop.table(ct, 1))# total percent correctsum(diag(prop.table(ct)))

# PLotting only works without the CV-stuff:lda_train<- lda(Sp˜Sepal.L.+Sepal.W.+Petal.L.+Petal.W., Iris_train, prior = c(1,1,1)/3)

plot(lda_train) # CVA components score plot no 1 and 2plot(lda_train, dimen=1, type="both") # fit from lda

# Or REALLY nice plots in another package:library(klaR)partimat(Sp˜Sepal.L.+Sepal.W.+Petal.L.+Petal.W.,data=Iris_train,method="lda",prior=c(1,1,1)/3)

# See the effect of a different prior: (using an extreme one for illustration)partimat(Sp˜Sepal.L.+Sepal.W.+Petal.L.+Petal.W.,data=Iris_train,method="lda",prior=c(0.02,.49,0.49))

# And now predicting NEW data:predict(lda_train, Iris_test)$class

# Find confusion table:ct <- table(Iris_test$Sp, predict(lda_train, Iris_test)$class)ctdiag(prop.table(ct, 1))# total percent correctsum(diag(prop.table(ct)))

# PART 2: QDA# Most stuff from LDA can be reused:

qda_train_LOO <- qda(Sp˜Sepal.L.+Sepal.W.+Petal.L.+Petal.W., Iris_train, prior = c(1,1,1)/3, CV=TRUE)

# Assess the accuracy of the prediction# percent correct for each category:ct <- table(Iris_train$Sp, qda_train_LOO$class)ctdiag(prop.table(ct, 1))# total percent correctsum(diag(prop.table(ct)))

# PLotting only works without the CV-stuff:qda_train<- qda(Sp˜Sepal.L.+Sepal.W.+Petal.L.+Petal.W., Iris_train, prior = c(1,1,1)/3)

# Or REALLY nice plots in another package:

24

Page 25: R in 27411: Biological Data Analysis and Chemometrics › courses › 27411 › Rin27411.pdf · R in 27411: Biological Data Analysis and Chemometrics Per Bruun Brockhoff DTU Compute,

library(klaR)partimat(Sp˜Sepal.L.+Sepal.W.+Petal.L.+Petal.W.,data=Iris_train,method="qda",prior=c(1,1,1)/3)

# And now predicting NEW data:predict(qda_train, Iris_test)$class

# Find confusion table:ct <- table(Iris_test$Sp, predict(qda_train, Iris_test)$class)ctdiag(prop.table(ct, 1))# total percent correctsum(diag(prop.table(ct)))

# Part 3: (Naive) Bayes method# With "usekernel=TRUE" it fits a density to the observed data and uses that# instead of the normal distribution

# Let’s explore the results first:# (Can take a while dur to the fitting of densities)partimat(Sp˜Sepal.L.+Sepal.W.+Petal.L.+Petal.W.,data=Iris_train,method="naiveBayes",prior=c(1,1,1)/3,usekernel=TRUE)

bayes_fit=NaiveBayes(Sp˜Sepal.L.+Sepal.W.+Petal.L.+Petal.W.,data=Iris_train,method="naiveBayes",prior=c(1,1,1)/3,usekernel=TRUE)plot(bayes_fit)

# And now predicting NEW data:predict(bayes_fit, Iris_test)$class

# Find confusion table:ct <- table(Iris_test$Sp, predict(bayes_fit, Iris_test)$class)ctdiag(prop.table(ct, 1))# total percent correctsum(diag(prop.table(ct)))

# PART 4: k-nearest neighbourgh:

# Explorative plot WARNING: THIS takes some time to produce!!partimat(Sp˜Sepal.L.+Sepal.W.+Petal.L.+Petal.W.,data=Iris[train,],method="sknn",kn=3)

knn_fit_5=sknn(Sp˜Sepal.L.+Sepal.W.+Petal.L.+Petal.W.,data=Iris_train,method="sknn",kn=5)

# And now predicting NEW data:predict(knn_fit_5, Iris_test)$class

# Find confusion table:ct <- table(Iris_test$Sp, predict(knn_fit_5, Iris_test)$class)ctdiag(prop.table(ct, 1))# total percent correctsum(diag(prop.table(ct)))

# PART 5: PLS-DA# We have to use the "usual" PLS-functions

# Define the response vector (2 classes) OR matrix (>2) classes:

# Let’s try with K=2: Group 1: s, Group -1: c and vIris_train$Y=-1Iris_train$Y[Iris_train$Sp=="s"]=1table(Iris_train$Y)

Iris_train$X=as.matrix(Iris_train[,1:4])# From here use standard PLS1 predictions, e.g.:library(pls)# Pls:

25

Page 26: R in 27411: Biological Data Analysis and Chemometrics › courses › 27411 › Rin27411.pdf · R in 27411: Biological Data Analysis and Chemometrics Per Bruun Brockhoff DTU Compute,

mod_pls <- plsr(Y ˜X , ncomp = 4, data =Iris_train,validation="LOO",scale=TRUE,jackknife = TRUE)

# Initial set of plots:par(mfrow=c(2,2))plot(mod_pls,labels=rownames(Iris_train),which="validation")plot(mod_pls,"validation",estimate=c("train","CV"),legendpos = "topright")plot(mod_pls,"validation",estimate=c("train","CV"),val.type="R2",legendpos = "bottomright")scoreplot(mod_pls,labels=Iris_train$Sp)

# Be carefull about interpretations due to the binary setting# You should do a CV based confusion table for each component, really:preds=array(dim=c(length(Iris_train[,1]),4))

for (i in 1:4) preds[,i]=predict(mod_pls,ncomp=i)preds[preds<0]=-1preds[preds>0]=1

# Look at the results from each of the components:

for (i in 1:4) {ct <- table(Iris_train$Y, preds[,1])CV_error=1-sum(diag(prop.table(ct)))print(CV_error)}

# The prediction of new data would be handled similarly

# WHAT if in this case K=3 classes:# Look at Varmuza, Sec. 5.2.2.3K=3Iris_train$Y=matrix(rep(1,length(Iris_train[,1])*K),ncol=K)

Iris_train$Y[Iris_train$Sp!="c",1]=-1Iris_train$Y[Iris_train$Sp!="s",2]=-1Iris_train$Y[Iris_train$Sp!="v",3]=-1

mod_pls <- plsr(Y ˜X , ncomp = 4, data =Iris_train,validation="LOO",scale=TRUE,jackknife = TRUE)

# Initial set of plots:par(mfrow=c(1,1))scoreplot(mod_pls,labels=Iris_train$Sp)plot(mod_pls,labels=rownames(Iris_train),which="validation")plot(mod_pls,"validation",estimate=c("train","CV"),legendpos = "topright")plot(mod_pls,"validation",estimate=c("train","CV"),val.type="R2",legendpos = "bottomright")

# Predictions from PLS need to be transformed to actual classifications:# Select the largest one:preds=array(dim=c(length(Iris_train[,1]),4))for (i in 1:4) preds[,i]=apply(predict(mod_pls,ncomp=i),1,which.max)

preds2=array(dim=c(length(Iris_train[,1]),4))preds2[preds==1]="c"preds2[preds==2]="s"preds2[preds==3]="v"

# Look at the results from each of the components:for (i in 1:4) {

ct <- table(Iris_train$Sp, preds2[,i])CV_error=1-sum(diag(prop.table(ct)))print(CV_error)

}

26

Page 27: R in 27411: Biological Data Analysis and Chemometrics › courses › 27411 › Rin27411.pdf · R in 27411: Biological Data Analysis and Chemometrics Per Bruun Brockhoff DTU Compute,

8.2 Exercise: Wine dataUse the wine data previously used. Predict the wine-class using the different methods suggested.Try also a PCA based version of some of the methods. What are the results? Make your owntest- and training data. Apply CV on the training data and predict the test data. By lookingat various plots (in addition to prediction ability) consider which models seem to be mostlyappropriate for the data.

27