53
7/25/2019 CS 2008 3complete http://slidepdf.com/reader/full/cs-2008-3complete 1/53 Institut f. Statistik u. Wahrscheinlichkeitstheorie 1040 Wien, Wiedner Hauptstr. 8-10/107 AUSTRIA http://www.statistik.tuwien.ac.at Linear and nonlinear methods for regression and classification and applications in R P. Filzmoser Forschungsbericht CS-2008-3 Juli 2008 Kontakt: [email protected]

CS 2008 3complete

Embed Size (px)

Citation preview

Page 1: CS 2008 3complete

7/25/2019 CS 2008 3complete

http://slidepdf.com/reader/full/cs-2008-3complete 1/53

Institut f. Statistik u. Wahrscheinlichkeitstheorie

1040 Wien, Wiedner Hauptstr. 8-10/107

AUSTRIA

http://www.statistik.tuwien.ac.at

Linear and nonlinear methods for regression

and classification and applications in R

P. Filzmoser

Forschungsbericht CS-2008-3

Juli 2008

Kontakt: [email protected]

Page 2: CS 2008 3complete

7/25/2019 CS 2008 3complete

http://slidepdf.com/reader/full/cs-2008-3complete 2/53

Linear and Nonlinear Methodsfor Regression and Classification and

applications in R

Peter Filzmoser

Department of Statistics and Probability Theory, Vienna University of Technology

Wiedner Hauptstr. 8-10, Vienna, 1040, Austria, [email protected]

Summary

The field of statistics has drastically changed since the introduction of the computer. Compu-

tational statistics is nowadays a very popular field with many new developments of statisticalmethods and algorithms, and many interesting applications. One challenging problem is theincreasing size and complexity of data sets. Not only for saving and filtering such data, butalso for analyzing huge data sets new technologies and methods had to be developed.

This manuscript is concerned with linear and nonlinear methods for regression and classifi-cation. In the first sections “classical” methods like least squares regression and discriminantanalysis are treated. More advanced methods like spline interpolation, generalized additivemodels, and tree-based methods follow.

Each section introducing a new method is followed by a section with examples from practiceand solutions with R [R Development Core Team, 2008]. The results of different methodsare compared in order to get an idea of the performance of the methods.

The manuscript is essentially based on the book “The Elements of Statistical Learning”,Hastie et al. [2001].

1

Page 3: CS 2008 3complete

7/25/2019 CS 2008 3complete

http://slidepdf.com/reader/full/cs-2008-3complete 3/53

Contents

1 Fundamentals 41.1 The linear regression model . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 Comparison of models and model selection . . . . . . . . . . . . . . . . . . . 5

1.2.1 Test for several coefficients to be zero . . . . . . . . . . . . . . . . . . 51.2.2 Explained variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.2.3 Information criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.2.4 Resampling methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Linear regression methods 92.1 Least Squares (LS) regression . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1.1 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2 Variable selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.1 Bias versus variance and interpretability . . . . . . . . . . . . . . . . 102.2.2 Best subset regression . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2.3 Stepwise algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3 Methods using derived inputs as regressors . . . . . . . . . . . . . . . . . . . 12

2.3.1 Principal Component Regression (PCR) . . . . . . . . . . . . . . . . 132.3.2 Partial Least Squares (PLS) regression . . . . . . . . . . . . . . . . . 14

2.4 Shrinkage methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.4.1 Ridge regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.4.2 Lasso Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3 Linear regression methods in R 173.1 Least squares regression in R . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1.1 Parameter estimation in R . . . . . . . . . . . . . . . . . . . . . . . . 183.2 Variable selection in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.2.1 Best subset regression in R . . . . . . . . . . . . . . . . . . . . . . . . 183.2.2 Stepwise algorithms in R . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.3 Methods using derived inputs as regressors in R . . . . . . . . . . . . . . . . 203.3.1 Principal component regression in R . . . . . . . . . . . . . . . . . . 203.3.2 Partial least squares regression in R . . . . . . . . . . . . . . . . . . . 21

3.4 Shrinkage methods in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.4.1 Ridge regression in R . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.4.2 Lasso regression in R . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4 Linear methods for classification 254.1 Linear regression of an indicator matrix . . . . . . . . . . . . . . . . . . . . . 254.2 Discriminant analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.2.1 Linear Discriminant Analysis (LDA) . . . . . . . . . . . . . . . . . . 264.2.2 Quadratic Discriminant Analysis (QDA) . . . . . . . . . . . . . . . . 274.2.3 Regularized Discriminant Analysis (RDA) . . . . . . . . . . . . . . . 28

4.3 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5 Linear methods for classification in R 315.1 Linear regression of an indicator matrix in R . . . . . . . . . . . . . . . . . . 31

5.2 Discriminant analysis in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325.2.1 Linear discriminant analysis in R . . . . . . . . . . . . . . . . . . . . 325.2.2 Quadratic discriminant analysis in R . . . . . . . . . . . . . . . . . . 32

2

Page 4: CS 2008 3complete

7/25/2019 CS 2008 3complete

http://slidepdf.com/reader/full/cs-2008-3complete 4/53

5.2.3 Regularized discriminant analysis in R . . . . . . . . . . . . . . . . . 335.3 Logistic regression in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

6 Basis expansions 356.1 Interpolation with splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

6.1.1 Piecewise polynomial functions . . . . . . . . . . . . . . . . . . . . . 366.1.2 Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376.1.3 Natural cubic splines . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

6.2 Smoothing splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

7 Basis expansions in R 397.1 Interpolation with splines in R . . . . . . . . . . . . . . . . . . . . . . . . . . 397.2 Smoothing splines in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

8 Generalized Additive Models (GAM) 428.1 General aspects on GAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

8.2 Parameter estimation with GAM . . . . . . . . . . . . . . . . . . . . . . . . 43

9 Generalized additive models in R 44

10 Tree-based methods 4610.1 R egression trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4610.2 Classification trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

11 Tree based methods in R 4811.1 Regression trees in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4811.2 Classification trees in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

Bibliography 52

3

Page 5: CS 2008 3complete

7/25/2019 CS 2008 3complete

http://slidepdf.com/reader/full/cs-2008-3complete 5/53

1 Fundamentals

In this section the linear model is introduced, and basic information about model comparisonand model selection is provided. It is assumed that the reader has some knowledge aboutthe theory of least squares regression including statistical testing.

1.1 The linear regression model

The aim of the linear regression model is to describe the output variable y through a linearcombination of one or more input variables x1, x2, . . . , x p. “Linear combination” means thatthe input variables get first weighed with constants and are then summarized. The resultshould explain the output variable as good as possible. The classical linear model has theform

y = f (x1, x2, . . . , x p) + ε

with the linear functionf (x1, x2, . . . , x p) = β 0 +

p j=1

x jβ j. (1)

The goal is to find a functional relation f which justifies that y ≈ f (x1, x2, . . . , x p), and thuswe have

y = f (x1, x2, . . . , x p) + ε

with the error term ε, which should be as small as possible. A request often made is normaldistribution of the error term with expectation 0 and variance σ2. The variables x j can comefrom different sources:

• quantitative inputs, for example the height of different people

• transformations of quantitative inputs, such as log, square-root, or square

• basis-expansions, such as x2 = x21, x3 = x3

1, leading to a polynomial representation

• numeric or “dummy” coding of the levels of qualitative inputs

• interactions between variables, for example x3 = x1 · x2

No matter the source of the x j, the model is linear in the parameters.

Typically we estimate the parameters β j from a set of training data. These consist of measurements y1, . . . , yn of the response variable and x1, . . . ,xn of the regressor variables.

Each vector xi contains the measurements of the p variables xi1, xi2, . . . , xip. The estimatedparameters are denoted by β 0, β 1, . . . , β p and by inserting these values in the linear model(1) for each observation one gets:

yi = f (xi1, xi2, . . . , xip) = β 0 +

p j=1

xij β j

The predicted values yi should be as close as possible to the real measured values yi. Thedefinition of “as close as possible” as well as an estimation of how good the model actuallyis, will be given in the next sections.

4

Page 6: CS 2008 3complete

7/25/2019 CS 2008 3complete

http://slidepdf.com/reader/full/cs-2008-3complete 6/53

1.2 Comparison of models and model selection

Let us consider a statistical model which combines several input variables x1, x2, . . . , x plinearly in order to explain an output variable y as good as possible. One question oftenasked is what “as good as possible” really means and if there would be another model giving

an explanation of y as good as the one just found. A lot of times the differences of thequadratic error of the observed values yi and the estimated values yi are used to measurethe performance of a model. These differences yi − yi are called residuals .

There are different strategies to compare several models: One could be interested in reducingthe number of input variables used to explain the output variable since this could simplifythe model, making it easier to understand. In addition, the measurements of variables areoften expensive, a smaller model would therefore be cheaper. Another strategy is to useall of the input variables and derive a small number of new input variables from them (seeSection 2). Both strategies are aimed at describing the output variable as good as possible,not just for already measured and available values but also for values acquired in the future.

1.2.1 Test for several coefficients to be zero

Let us assume that we have two models of different size:

M 0 : y = β 0x0 + β 1x1 + · · · + β p0x p0 + ε

M 1 : y = β 0x0 + β 1x1 + · · · · · · + β p1x p1 + ε

with p0 < p1. By simultaneously testing several coefficients to be zero we want to findout if the additional variables x p0+1, . . . , x p1 in the larger model M 1 provide a significantexplanation gain to the smaller model M 0.

Rephrasing we get: H 0 : “the small model is true”, meaning that the model M 0 is acceptable,versus H 1 : “the larger model is true”. Basis for this test is the residual sum of squares

RSS0 =ni=1

yi − β 0 −

p0 j=1

β jxij

2

=ni=1

(yi − yi)2

for model M 0 and respectively

RSS1 =ni=1

yi − β 0 −

p1 j=1

β jxij

2

for model M 1. This leads to the following test statistic

F =

RSS0−RSS1

p1− p0

RSS1

n− p1−1

The F -statistic measures the change in residual sum of squares per additional parameter inthe bigger model, and it is normalized by an estimate of σ2. Under Gaussian assumptionsεi ∼ N (0, σ2) and the null hypothesis that the smaller model is correct, the F statistic willbe distributed according to

F ∼ F p1− p0,n− p1−1 .

For a better understanding of this test statistic we use the fact that the Total Sum of Squares(TSS) can be expressed by the sum of two variations: the sum of squares explained by theregression (Regression Sum of Squares - RegSS) and the Residual Sum of Squares - RSS:

5

Page 7: CS 2008 3complete

7/25/2019 CS 2008 3complete

http://slidepdf.com/reader/full/cs-2008-3complete 7/53

• Total Sum of Squares TSS =ni=1

(yi − y)2

• Residual Sum of Squares RSS =ni=1

(yi − yi)2

Regression Sum of Squares RegSS =

ni=1(yi − y)2

It is easy to show that TSS = RegSS + RSS.

For the models M 0 and M 1 we have

TSS = RegSS0 + RSS0

= RegSS1 + RSS1

which leads to

RegSS1

−RegSS0 = RSS0

−RSS1

≥ 0 because p0 < p1

If RSS0 − RSS1 is large, M 1 explains the data significantly better.

Attention: A test for H 0 : β 0 = β 1 = · · · = β p = 0 might not give the same result as singletests for H 0 : β j = 0 with j = 0, 1, . . . , p - especially if the x-variables are highly correlated.

1.2.2 Explained variance

The multiple R-Square (coefficient of determination) describes the amount of variance that

is explained by the model

R2 = 1 − RSS

TSS =

RegSS

TSS

= 1 −

(yi − yi)2

(yi − y)2

= Cor2(y, y) ∈ [0, 1]

The closer R-Square is to 1, the better the fit of the regression. If the model has no intercept,y = 0 is chosen.Since the denominator remains constant, R-Square grows with the size of the model. In

general we are not interested in the model with the maximum R-Square. Rather, we wouldlike to select that model which leads only to a marginal increase in R-Square with the additionof new variables.

The adjusted R-Square R2 is defined as

R2 = 1 − RSS/(n − p − 1)

TSS/(n − 1) .

It gives a reduced R-Square value and prevents the effect of getting a bigger R-Square eventhough the fit gets worse, by including the degrees of freedom in the calculation [see, for

example, Fox, 1997].

6

Page 8: CS 2008 3complete

7/25/2019 CS 2008 3complete

http://slidepdf.com/reader/full/cs-2008-3complete 8/53

1.2.3 Information criteria

We now want to approach the analysis of nested models M 1, . . . , M q. The models are rankedwith M 1 ≺ M 2 ≺ · · · ≺ M q, therefore if M i ≺ M j, the model M i can be seen as a specialcase of M j which can be obtained by setting some parameters equal to zero, i.e.

M 1 : y = β 0 + β 1x1 + ε

M 2 : y = β 0 + β 1x1 + β 2x2 + ε

Of course we have M 1 < M 2, since M 1 is a special case of M 2 where β 2 = 0.

A possibility to evaluate nested models are information criteria. The most popular criteriaare AIC and BIC, which try in different ways to combine the model complexity with a penaltyterm for the number of parameters used in the model.

Akaike’s information criterion (AIC): The AIC is defined as

AIC = −2maxlog(L(θ|data)) + 2 p

where max log(L(θ|data)) is the maximized log-likelihood function for the parameters θ con-ditioned at the data at hand, and p is the number of parameters in the model. Considering alinear regression model with normally distributed errors, the maximum likelihood estimatorsfor the unknown parameters are

βML = βLS = (X X )−1X y

σ2ML = RSS/n

It follows that

max log L(βML, σML) = −n

2 (1 + log(2π)) − n

2 log

RSS

n

.

After dropping the constant, the AIC becomes

AIC = n log(RSS/n) + 2 p

= RSS

σ2 + 2 p if σ2 is known

Within nested models the (log-)likelihood is monotonically increasing and the RSS mono-tonically decreasing.

Bayes information criterion (BIC): Similar to AIC, BIC can be used if the model se-lection is based on the maximization of the log-likelihood function. The BIC is definedas

BIC = −2maxlog L + log(n) p

= RSS

σ2 + log(n) p if σ2 is known.

Mallow’s C p: For known σ2

the Mallow’s C p is defined as

C p = RSS

σ2 + 2 p − n .

7

Page 9: CS 2008 3complete

7/25/2019 CS 2008 3complete

http://slidepdf.com/reader/full/cs-2008-3complete 9/53

If the residual variance σ2 is not known, it can be estimated by regression with the full model(using all variables). If a full model cannot be computed (too many variables, collinearity,etc.), a regression can be performed on the relevant principal components (see Section 2.3.1),and the variance is estimated from the resulting residuals. For the “true” model, C p isapproximately p, the number of parameters used in this model, and otherwise greater than

p. Thus a model where C p is approximately p should be selected, and preferably that modelwith smallest p.

1.2.4 Resampling methods

After choosing a model which provides a good fit to the training data, we want to modelthe test data as well, with the requirement yTest ≈ f (xTest) ( f was calculated based on thetraining data only). One could define a loss function L which measures the error between yand f (x), i.e.

Ly, f (x) = (y

− f (x))2 ... quadratic error

|y − f (x)| ... absolute error.

The test error is the expected predicted value of an independent test set

Err = IE

L

y, f (x)

.

With a concrete sample, Err can be estimated using

Err = 1

n

ni=1

L

yi, f (xi)

.

In case of using the loss function with quadratic error, the estimation of Err is well-knownunder the name Mean Squared Error (MSE). So, the MSE is defined by

MSE = 1

n

ni=1

yi − f (xi)

2

.

Usually there is only one data set available. The evaluation of the model is then done byresampling methods (i.e. cross-validation, bootstrap).

Cross validation: A given sample is randomly divided into q parts. One part is chosen to

be the test set, the other parts are defined as training data. The idea is as follows: for thekth part, k = 1, 2, . . . , q , we fit the model to the other q − 1 parts of the data and calculatethe prediction error of the fitted model when predicting the kth part of the data. In orderto avoid complicated notation, f denotes the fitted function, computed with the kth part of the data removed. The functions all differ depending on the part left out. The evaluation of the model (calculation of the expected prediction error) is done on the kth part of the dataset, i.e. for k = 3:

1 2 3 4 5 · · · q Training Training Test Training Training · · · Training

yi = f (xi) represents the prediction for xi, calculated by leaving out the kth part of thedata set, with xi allocated in part k. Since each observation exists only once in each testset, we obtain n predicted values yi, i = 1, . . . , n.

8

Page 10: CS 2008 3complete

7/25/2019 CS 2008 3complete

http://slidepdf.com/reader/full/cs-2008-3complete 10/53

The estimated cross-validation error is then

ErrCV = 1

n

ni=1

L (yi, yi) .

Choice of q: The case q = n is known as “leave-one-out cross-validation”. In this case the fitis computed using all the data except the ith. Disadvantages of this method are

• high computational effort

• high variance due to similarity of the n “training sets”.

A 5-fold or 10-fold cross-validation, thus q = 5 or q = 10, should therefore be preferred.

Bootstrap: The basic idea is to randomly draw data sets with replacement from the trainingdata, each sample the same size as the original training set. This is done q times, producingq data sets. Then we refit the model to each of the bootstrap data sets, and examine the

behavior of the fits over the q replications. The mean prediction error is then ErrBoot = 1

q

1

n

qk=1

ni=1

L

yi, f k(xi)

,

with f k indicating the function assessed on sample k and f k(xi) indicating the prediction of observation xi of the kth data set.

Problem and possible improvements: Due to the large overlap in test and training sets,

ErrBoot is frequently too optimistic. The probability of an observation being included in abootstrap data set is 1 − (1 − 1

n)n ≈ 1 − e−1 = 0.632. A possible improvement would be to

calculate ErrBoot only for those observations not included in the bootstrap data set, whichis true for about 1

3 of the observations. “Leave-one-out bootstrap” offers another potential

improvement: The prediction f k(xi) is based only on the data sets which do not include xi:

ErrBoot−1 = 1

n

ni=1

1

|C −i|k∈C −i

L

yi, f k(xi)

C −i is the set of indices of the bootstrap samples k that do not contain observation i.

2 Linear regression methodsA linear regression model assumes that the regression function IE(y|x1, x2, . . . , x p) is linear in the inputs x1, x2, . . . , x p. Linear models were largely developed in the pre-computer age of statistics, but even in today’s computer era there are still good reasons to study and use themsince they are the foundation of more advanced methods. Some important characteristics of linear models are:

• they are simple and

• often provide an adequate and

interpretable description of how the inputs affect the outputs.• For prediction purposes they can often outperform fancier nonlinear models, especially

in situations with

9

Page 11: CS 2008 3complete

7/25/2019 CS 2008 3complete

http://slidepdf.com/reader/full/cs-2008-3complete 11/53

– small numbers of training data or

– a low signal-to-noise ratio.

• Finally, linear models can be applied to transformations of the inputs and therefore beused to model nonlinear relations.

2.1 Least Squares (LS) regression

2.1.1 Parameter estimation

The multiple linear regression model has the form y = Xβ + ε with the n observed valuesy = (y1, y2, . . . , yn)

, the design matrix

X =

1 x11 x12 · · · x1 p

1 x21 x22 · · · x2 p...1 xi1 xi2 · · · xip...1 xn1 xn2 · · · xnp

=

1, x1

1, x2...

1, xi...

1, xn

and the error term ε = (ε1, ε2, . . . , εn). The parameters we are looking for are the regression

coefficients β = (β 0, β 1, . . . , β p). The most popular estimation method is least squares, in

which we choose the coefficients β = (β 0, β 1, . . . , β p) which minimize the residual sum of

squares (RSS)

RSS(β) =

ni=1

(yi − f (xi))2

=ni=1

yi − β 0 −

p j=1

xijβ j

2

.

This approach makes no assumptions about the validity of the model, it simply finds thebest linear fit to the data.

The solution of the minimization problem is given by

ˆβ = (X

X )−1

X

y.

The estimated y-values are

y = X β = X (X X )−1X y.

2.2 Variable selection

2.2.1 Bias versus variance and interpretability

There are two reasons why a least squares estimate of the parameters in the full model isunsatisfying:

10

Page 12: CS 2008 3complete

7/25/2019 CS 2008 3complete

http://slidepdf.com/reader/full/cs-2008-3complete 12/53

1. Prediction quality : Least squares estimates often have a small bias but a high variance.The prediction quality can sometimes be improved by shrinkage of the regression coef-ficients or by setting some coefficients equal to zero. This way the bias increases, butthe variance of the prediction reduces significantly which leads to an overall improvedprediction.

This tradeoff between bias and variance can be easily seen by decomposing the mean squared error (MSE) of an estimate θ of θ:

MSE(θ) := IE(θ − θ)2 = Var(θ) + [IE(θ) − θ bias

]2

For our BLUE (Best Linear Unbiased Estimator) β we have

MSE(β) = Var(β) + [IE(β) − β]2

=0

.

However, an estimator β with

MSE(β) < MSE(β)

could exist, with IE(β) = β (biased) and Var(β) < Var(β). A smaller MSE could thuslead to a better prediction of new values. Such estimators β can be obtained fromvariable selection, by building linear combinations of the regressor variables (Section2.3), or by shrinkage methods (Section 2.4).

2. Interpretability : If many predictor variables are available, it makes sense to identify theones who have the largest influence, and to set the ones to zero that are not relevantfor the prediction. Thus we eliminate variables that will only explain some details, but

we keep those which allow for the major explanation of the response variable.

With variable selection only a subset of all input variables is used, the rest is eliminatedfrom the model. For this subset, the least squares estimate is used for parameter estimation.There are many different strategies to choose a subset.

2.2.2 Best subset regression

Best subset regression finds the subset of size k for each k ∈ 0, 1, . . . , p that gives thesmallest RSS. An efficient algorithm is the so-called Leaps and Bounds algorithm which can

handle up to 30 or 40 regressor variables.Leaps and Bounds algorithm: This algorithm creates a tree model and calculates the RSS(or other criteria) for the particular subsets. In Figure 1 the AIC for the first subsets isshown. Subsequently, large branches are eliminated by trying to reduce the RSS. The AICfor the model x2 + x3 = 20. Through the elimination of x2 or x3 we get an AIC of atleast 18, since the value can reduce by 2 p = 2 at most (this follows from the formula for theAIC := n log(RSS/n) + 2 p). By eliminating a regressor in x1 + x2 a smaller AIC (>8) canbe obtained. Therefore, the branch x2 + x3 is left out for the future analysis.

2.2.3 Stepwise algorithms

With data sets larger than 40 input variables a search through all possible subsets becomesinfeasible. The following algorithms can be used instead:

11

Page 13: CS 2008 3complete

7/25/2019 CS 2008 3complete

http://slidepdf.com/reader/full/cs-2008-3complete 13/53

Figure 1: Tree of models where complete branches can be eliminated.

• Forward stepwise selection: starts with the intercept and then sequentially adds intothe model the predictor that most improves the fit. The improvement of fit is oftenbased on the F -statistic

F = RSS(β) − RSS(β)

RSS(β)/(n − k − 2).

β is the parameter vector of the current model with k parameters, β the parametervector of the model with k + 1 parameters. Typically, predictors are added sequentially

until no predictor produces an F ratio greater than the 90th or 95th percentile of theF 1,n−k−2 distribution.

• Backward stepwise selection: starts with the full model and sequentially deletes thepredictors. It typically uses the same F ratio like forward stepwise selection. Backwardselection can only be used when n > p in order to have a well defined model.

• There are also hybrid stepwise strategies that consider both forward and backwardmoves at each stage, and make the “best” move.

2.3 Methods using derived inputs as regressors

In many situations we have a large number of inputs, often highly correlated. For the model

y = Xβ + ε

with X fixed and ε ∼ N (0, σ2I ) we have: βLS ∼ N (β, σ2(X X )−1).

In case of highly correlated regressors, X X is almost singular and (X X )−1 thereforevery large which leads to a numerically instable β. As a consequence, tests for the regressionparameters become unreliable:

H 0 : β j = 0H 1 : β j = 0

12

Page 14: CS 2008 3complete

7/25/2019 CS 2008 3complete

http://slidepdf.com/reader/full/cs-2008-3complete 14/53

with

z j =β j

σ

d j,

where d j represents the jth diagonal element of (X X )−1.

To avoid this problem, we use methods that use derived input directions. The methods inthis section produce a small number of linear combinations zk, k = 1, 2, . . . , q of the originalinputs x j which are then used as inputs in the regression. The methods differ in how thelinear combinations are constructed.

2.3.1 Principal Component Regression (PCR)

This method looks for transformations of the original data into a new set of uncorrelatedvariables called principal components, see, for example, Johnson and Wichern [2002]. Thistransformation ranks the new variables according to their importance, which means according

to the size of their variance, and eliminates those of least importance. Then a least squaresregression on the reduced set of principal components is performed.Since PCR is not scale invariant, one should scale and center the data first. Given a p-dimensional random vector x = (x1, . . . , x p)

with covariance matrix Σ. We assume that Σ ispositive definite. Let V = (v1, . . . ,v p) be a ( p× p)-matrix with orthonormal column vectors,i.e. vi vi = 1, i = 1, . . . , p, and V = V −1. We are looking for the linear transformation

z = V x

z i = vi x

The variance of the random variable z i is given by

Var(z i) = IE[vi xxvi] = v

i Σvi

Maximizing the variance Var(z i) under the conditions vi vi = 1 with Lagrange gives

φi = vi Σvi − ai(v

i vi − 1) .

Setting the partial derivation to zero we get

∂φi∂ vi

= 2Σvi − 2aivi = 0

which is(Σ − aiI )vi = 0

In matrix form we haveΣV = V A

orΣ = V AV

where A = Diag(a1, . . . , a p). This is known as the eigenvalue problem, vi are the eigenvectorsof Σ and ai the corresponding eigenvalues: a1 ≥ a2 . . . ≥ a p. Since Σ is positive definite, alleigenvalues are real, non negative numbers.

z i is named the ith principal component of x, and we have:

Cov(z) = V Cov(x)V = V ΣV = A

13

Page 15: CS 2008 3complete

7/25/2019 CS 2008 3complete

http://slidepdf.com/reader/full/cs-2008-3complete 15/53

The variance of the ith principal component matches the eigenvalue ai, the variances areranked in descending order. The last principal component has the smallest variance. Theprincipal components are orthogonal to all the other principal components (they are evenuncorrelated) since A is a diagonal matrix.

In the following we will use q (1

≤q

≤ p) principal components for regression. The regression

model for observed data X and y can then be expressed as

y = Xβ + ε

= XV V β + ε

= Zθ + ε

with the n×q matrix of the empirical principal components Z = XV and the new regressioncoefficients θ = V β. The solution of the least squares estimation is

θk = (zk zk)−1zk y

andθ = (θ1, . . . , θq)

.

Since the zk are orthogonal, the regression is just a sum of univariate regressions

yPCR =

qk=1

θkzk

Since the zk are linear combinations of the original x j, we can express the solution in termsof coefficients of the x j

βPCR(q ) =

qk=1

θkvk = V θ

Note that if q = p, we would just get back the usual least squares estimates for the fullmodel. For q < p we get a “reduced” regression.

2.3.2 Partial Least Squares (PLS) regression

This technique also constructs a set of linear combinations of the inputs for regression, butunlike principal components regression it uses y in addition to X for this construction [e.g.,

Tenenhaus, 1998]. We assume that both y and X are centered. Instead of calculating theparameters β in the linear modelyi = x

i β + εi

we estimate the parameters γ in the so-called latent variable model

yi = ti γ + εi .

We assume:

• The new coefficients γ are of dimension q ≤ p.

• The values ti of the variables are put together in a (n

×q ) score matrix T .

• Due to the reduced dimension, the regression of y on T should be more stable.

14

Page 16: CS 2008 3complete

7/25/2019 CS 2008 3complete

http://slidepdf.com/reader/full/cs-2008-3complete 16/53

• T can not be observed directly; we get each T sequentially for k = 1, 2, . . . , q throughthe PLS criteria

ak = argmaxa

Cov(y,Xa)

under the constraints ak = 1 and Cov(Xak,Xa j) = 0 for 1 ≤ j < k. The vectors

ak with k = 1, 2, . . . , q are called loadings , and they are collected in the columns of thematrix A. The score matrix is then

T = XA,

and y can be written as:

y = T γ + ε

= (XA)γ + ε

= X (Aγ )

β

+ε = Xβ + ε

In other words, PLS does a regression on a weighted version of X which contains incompleteor partial information (thus the name of the method). The additional usage of the leastsquares method for the fit leads to the name Partial Least Squares .Since PLS uses also y to determine the PLS-directions, this method is supposed to havebetter prediction performance than for instance PCR. In contrast to PCR, PLS is lookingfor directions with high variance and large correlation with y.

2.4 Shrinkage methods

Shrinkage methods keep all variables in the model and assign different (continuous) weights.In this way we obtain a smoother procedure with a smaller variability.

2.4.1 Ridge regression

Ridge regression shrinks the coefficients by imposing a penalty on their size. The ridgecoefficients minimize a penalized residual sum of squares,

βRidge = argminβ

n

i=1 yi − β 0 − p

j=1

xijβ j2

+ λ

p

j=1

β 2 j

(2)

Here λ ≥ 0 is a complexity parameter that controls the amount of shrinkage: the larger thevalue of λ, the greater the amount of shrinkage. The coefficients are shrunk towards zero(and towards each other).

An equivalent way to write the ridge problem is

βRidge = argminβ

ni=1

yi − β 0 −

p j=1

xijβ j

2

under the constraint p j=1

β 2 j ≤ s,

15

Page 17: CS 2008 3complete

7/25/2019 CS 2008 3complete

http://slidepdf.com/reader/full/cs-2008-3complete 17/53

which makes explicit the size constraint on the parameters. By penalizing the RSS we tryto avoid that highly correlated regressors (e.g. x j and xk) cancel each other. An especiallylarge positive coefficient β j can be canceled by a similarly large negative coefficient β k. Byimposing a size constraint on the coefficients this phenomenon can be prevented.

The ridge solutions βRidge are not equivariant for different scaling of the inputs, and so one

normally standardizes the inputs. In addition, notice that the intercept β 0 has been left outof the penalty term. Penalization of the intercept would make the procedure depend on theorigin chosen for y; that is, adding a constant c to each of the targets yi would not simplyresult in a shift of the predictions by the same amount c.

We center the xij (each xij is replaced by xij − x j) and estimate β 0 by y =ni=1 yi/n. The

remaining coefficients get estimated by a ridge regression without intercept, hence the matrixX has p rather than p + 1 columns. Rewriting (2) in matrix form gives

RSS(λ) = (y −Xβ)(y −Xβ) + λββ and the solutions become

βRidge = (X X + λI )−1X y .

I is the ( p × p) identity matrix. Advantages of the just described method are:

• With the choice of quadratic penalty ββ, the resulting ridge regression coefficientsare again a linear function of y.

• The solution adds a positive constant to the diagonal of X X before inversion. Thismakes the problem nonsingular, even if X X is not of full rank. This was the mainmotivation of its introduction around 1970.

• The effective degrees of freedom are

df(λ) = tr(X (X

X + λI )−1

X

),

thus

for λ = 0 ⇒ df(λ) = tr(X X (X X )−1) = tr(I p) = p

λ → ∞ ⇒ df(λ) → 0

It can be shown that PCR is very similar to ridge regression: both methods use the principalcomponents of the input matrix X . Ridge regression shrinks the coefficients of the princi-pal components, the shrinkage depends on the corresponding eigenvalues; PCR completelydiscards the components to the smallest p

−q eigenvalues.

2.4.2 Lasso Regression

The lasso is a shrinkage method like ridge, but the L1 norm rather than the L2 norm is usedin the constraints. The lasso is defined by

βLasso = argminβ

ni=1

yi − β 0 −

p j=1

xijβ j

2

with the constraint p j=1

|β j| ≤ s.

16

Page 18: CS 2008 3complete

7/25/2019 CS 2008 3complete

http://slidepdf.com/reader/full/cs-2008-3complete 18/53

Just as in ridge regression we standardize the data. The solution for β 0 is y, and thereafterwe fit a model without an intercept.

Lasso and ridge differ in their penalty term. The lasso solutions are nonlinear and a quadraticprogramming algorithm is used to compute them. Because of the nature of the constraint,making s sufficiently small will cause some of the coefficients to be exactly 0. Thus the lasso

does a kind of continuous subset selection. If s is chosen larger than s0 = p j=1 |β j| (where

β j is the least squares estimate), then the lasso estimates are the least squares estimates.On the other hand, for s = s0/2, the least squares coefficients are shrunk by about 50%on average. However, the nature of the shrinkage is not obvious. Like the subset size insubset selection, or the penalty in ridge regression, s should be adaptly chosen to minimizean estimate of expected prediction error.

3 Linear regression methods in R

3.1 Least squares regression in R

The various regression methods introduced throughout the manuscript we will illustratedwith the so-called Body fat data . The data set can be found in the R package library(UsingR)under the name “fat”. It consists of 15 physical measurements of 251 men.

• body.fat: percentage of body-fat calculated by Brozek’s equation

• age: age in years

• weight: weight (in pounds)

• height: height (in inches)

• BMI: adiposity index• neck: neck circumference (cm)

• chest: chest circumference (cm)

• abdomen: abdomen circumference (cm)

• hip: hip circumference (cm)

• thigh: thigh circumference (cm)

• knee: knee circumference (cm)

• ankle: ankle circumference (cm)

• bicep: extended biceps circumference (cm)

• forearm: forearm circumference (cm)• wrist: wrist circumference (cm)

To measure the percentage of body-fat in the body, an extensive (and expensive) underwatertechnique has to be performed. The goal here is to establish a model which allows theprediction of the percentage of body-fat with easily measurable and collectible variables inorder to avoid the underwater procedure. Nowadays, a new, very effortless method calledbio-impedance analysis provides a reliable method to determine the body-fat percentage.

• Scanning of the data and explanation of the variables

17

Page 19: CS 2008 3complete

7/25/2019 CS 2008 3complete

http://slidepdf.com/reader/full/cs-2008-3complete 19/53

> library(UsingR)

> data(fat)> attach(fat)

> fat$body.fat[fat$body.fat == 0] <- NA

# exclude observations that are not used for the analysis> fat <- fat[, -cbind(1, 3, 4, 9)]

# exclude a sample with wrong body height

> fat <- fat[-42, ]# transform the body height in centimeter> fat[, 4] <- fat[, 4] * 2.54

3.1.1 Parameter estimation in R

> model.lm <- lm(body.fat ~ ., data = fat)> summary(model.lm)

Call:lm(formula = body.fat ~ ., data = fat)

Residuals:Min 1Q Median 3Q Max

-10.1062 -2.6605 -0.2011 2.8920 9.2619

Coefficients:Estimate Std. Error t value Pr(>|t|)

(Intercept) -44.91075 36.67739 -1.224 0.22200

age 0.05740 0.03004 1.911 0.05725 .weight -0.16239 0.10076 -1.612 0.10838

height 0.17192 0.20001 0.860 0.39089

BMI 0.75340 0.73339 1.027 0.30534

neck -0.42594 0.21857 -1.949 0.05251 .chest -0.05969 0.09907 -0.603 0.54740

abdomen 0.87126 0.08569 10.168 < 2e-16 ***

hip -0.22543 0.13796 -1.634 0.10359

thigh 0.21780 0.13660 1.594 0.11220knee -0.01257 0.22965 -0.055 0.95639

ankle 0.12398 0.20837 0.595 0.55243bicep 0.16357 0.16000 1.022 0.30769

forearm 0.39166 0.18627 2.103 0.03656 *

wrist -1.49585 0.49586 -3.017 0.00284 **

---Signif. codes: 0 aAY***aAZ 0.001 aAY**aAZ 0.01 aAY*aAZ 0.05 aAY.aAZ 0.1 aAY aAZ 1

Residual standard error: 3.988 on 235 degrees of freedom(1 observation deleted due to missingness)

Multiple R-Squared: 0.7432, Adjusted R-squared: 0.7279

F-statistic: 48.58 on 14 and 235 DF, p-value: < 2.2e-16

The coefficients age, neck, abdomen, forearm and wrist have very large t-values andvery small p-values, therefore the null hypothesis β i = 0 should be rejected. Due tothe very small p-value of the F-statistic, the null hypothesis β i = 0, ∀ i = 1, . . . , pshould be rejected as well. With an R squared = 0.7432 we can assume that the modelprovides a good fit.

3.2 Variable selection in R

3.2.1 Best subset regression in R

18

Page 20: CS 2008 3complete

7/25/2019 CS 2008 3complete

http://slidepdf.com/reader/full/cs-2008-3complete 20/53

> library(leaps)> lm.regsubset <- regsubsets(body.fat ~ ., data = fat, nbest = 1,+ nvmax = 8)

regsubsets() in library(leaps) provides the “best” model for different sizes of sub-

sets. Here only one “best” model per subset size was considered. The ranking of themodels is done using the BIC measure.

> lm.regsubset2 <- regsubsets(body.fat ~ ., data = fat, nbest = 2,+ nvmax = 8)> plot(lm.regsubset2)

b i c

( I n t e r c e p t )

a g e

w e i g h t

h e i g h t

B M I

n e c k

c h e s t

a b d o m e n

h i p

t h i g h

k n e e

a n k l e

b i c e p

f o r e a r m

w r i s t

−170−250−280−290−290−290−290−290−290−290

−300−300−300−300−300−300

Figure 2: Model selection with leaps()

This plot shows the two best, by regsubsets() computed models with 1-8 regressorseach. The BIC, coded in grey scale, does not improve after the fifth stage (startingfrom the bottom, see Figure 2). The optimal model can then be chosen from the modelswith “saturated” grey, and preferable that model is taken with the smallest number of variables.

3.2.2 Stepwise algorithms in R

Stepwise selection with step()

>model.lmstep<-step(model.lm)

Start: AIC=706.23body.fat ~ age + weight + height + BMI + neck + chest + abdomen +

hip + thigh + knee + ankle + bicep + forearm + wrist

Df Sum of Sq RSS AIC

- knee 1 0.04766 3738.3 704.2

- ankle 1 5.6 3743.9 704.6- chest 1 5.8 3744.1 704.6

- height 1 11.8 3750.0 705.0

- bicep 1 16.6 3754.9 7 05.3

- BMI 1 16.8 3755.1 705.4<none> 3738.3 706.2

- thigh 1 40.4 3778.7 7 06.9- weight 1 41.3 3779.6 707.0- hip 1 42.5 3780.8 707.1

19

Page 21: CS 2008 3complete

7/25/2019 CS 2008 3complete

http://slidepdf.com/reader/full/cs-2008-3complete 21/53

- age 1 58.1 3796.4 708.1

- neck 1 60.4 3798.7 708.2- forearm 1 70.3 3808.6 708.9

- wrist 1 144.8 3883.1 713.7

- abdomen 1 1644.6 5382.9 795.4

...

Step: AIC=697.41

body.fat ~ age + weight + neck + abdomen + hip + thigh + forearm +wrist

Df Sum of Sq RSS AIC<none> 3786.2 697.4

- hip 1 37.4 3823.6 697.9

- age 1 59.3 3845.5 699.3- neck 1 61.2 3847.4 699.4

- weight 1 74.7 3860.9 700.3

- thigh 1 77.5 3863.7 7 00.5

- forearm 1 114.0 3900.2 702.8- wrist 1 135.8 3922.1 704.2

- abdomen 1 2712.5 6498.7 830.5

Call:lm(formula = body.fat ~ age + weight + neck + abdomen + hip +

thigh + forearm + wrist, data = fat)

Coefficients:

(Intercept) age weight neck abdomen hip

-18.46826 0.05577 -0.08081 -0.41183 0.87775 -0.20063thigh forearm wrist

0.26719 0.46567 -1.39341

step() continues with variable selection as long as the AIC cannot be reduced further.

• Comparison of the models with anova()

> anova(model.lm, model.lmstep)

Analysis of Variance Table

Model 1: body.fat ~ age + weight + height + BMI + neck + chest + abdomen +

hip + thigh + knee + ankle + bicep + forearm + wristModel 2: body.fat ~ age + weight + neck + abdomen + hip + thigh + forearm +

wrist

Res.Df RSS Df Sum of Sq F Pr(>F)

1 235 3738.32 241 3786.2 -6 -47.9 0.5019 0.8066

By using the smaller model model.lmstep no essential information is lost, therefore it canbe used for the prediction instead of model.lm.

3.3 Methods using derived inputs as regressors in R

3.3.1 Principal component regression in R

> x <- scale(fat[, -1], T, T)> y <- scale(fat[, 1], T, F)> library(pls)> model.pcr <- pcr(y ~ x, ncomp = 14, validation = "CV", segments = 10,+ segment.type = "random")> plot(model.pcr, plottype = "validation", val.type = "MSEP", legend = "topright")

20

Page 22: CS 2008 3complete

7/25/2019 CS 2008 3complete

http://slidepdf.com/reader/full/cs-2008-3complete 22/53

0 2 4 6 8 10 12 14

4 . 5

5 . 5

6 . 5

7 . 5

number of components

R M S

E P

CVadjCV

Figure 3: Determination of the number of components for PCR

The plot (see Figure 3) suggests to use 12 components. The RMSEP is defined as

RMSEP =

ErrCV

ErrCV is the mean prediction error of Section 1.2.4 with the quadratic function L. RMSEPhas a value of 4.143.

3.3.2 Partial least squares regression in R

> model.plsr <- plsr(y ~ x, ncomp = 14, validation = "CV", segments = 10,+ segment.type = "random")> plot(RMSEP(model.plsr), legendpos = "topright")

0 2 4 6 8 10 12 14

4 . 5

5 . 5

6 . 5

7 . 5

number of components

R M S E P

CVadjCV

Figure 4: Determination of the number of components for PLS regression

21

Page 23: CS 2008 3complete

7/25/2019 CS 2008 3complete

http://slidepdf.com/reader/full/cs-2008-3complete 23/53

−20 −10 0 10 20 30

− 1 0

0

1 0

2 0

3 0

y, 12 comps, validation

measured

p r e d i c t e d

(a) PCR

−20 −10 0 10 20 30

− 1 0

0

1 0

2 0

3 0

y, 6 comps, validation

measured

p r e d i c t e d

(b) PLSR

Figure 5: Comparison of the fit provided by PCR and PLS regression

Analyzing Figure 4 suggests that 6 components are sufficient. The RMSEP is 4.211.

Figure 5 compares the fit provided by PCR and PLS regression. The predictions obtainedby CV are plotted against the observed values. The left plot shows the yi computed by thefirst 12 PCR components, the right plot uses the first 6 components of the PLS regression.

3.4 Shrinkage methods in R

3.4.1 Ridge regression in R

• RR without the special selection of λ

> library(MASS)> model.ridge <- lm.ridge(body.fat ~ ., data = fat, lambda = 1)> model.ridge$Inter

[1] 1

> model.ridge$coef

age weight height BMI neck chest

0.79271367 -2.74668538 0.30184658 1.25208862 -1.06552555 -0.34544269

abdomen hip thigh knee ankle bicep8.87964121 -1.43870319 1.06181696 -0.04751459 0.19309025 0.44022071

forearm wrist

0.79583229 -1.41917220

• optimal selection of λ using GCV

> model.ridge = lm.ridge(body.fat ~ ., data = fat, lambda = seq(0,+ 15, by = 0.2))> select(model.ridge)

modified HKB estimator is 1.513289 modified L-W estimator is 4.41101

smallest value of GCV at 1.2

22

Page 24: CS 2008 3complete

7/25/2019 CS 2008 3complete

http://slidepdf.com/reader/full/cs-2008-3complete 24/53

> plot(model.ridge$lambda, model.ridge$GCV, type = "l")

Generalized cross-validation (GCV) is a good approximation of the leave-one-out (LOO)cross-validation by linear fits with quadratic errors. For a linear fit we have y = Sy

with the corresponding transformation matrix S and this results in

1

n

ni

yi − f −i(xi)

2

= 1

n

ni

yi − f (xi)

1 − S ii

2

with f −i(xi) as fit without observation i and S ii as ith diagonal element of S . TheGCV approximation is

GCV = 1

N

N

i

yi − f (xi)

1 − trace(S )/n

2

0 5 10 15

0 . 0

6 7

0 . 0

6 8

0 . 0

6 9

0 . 0

7 0

lambda

G C V

e r r o r

Figure 6: Optimal selection of λ with GCV

3.4.2 Lasso regression in R

• Lasso regression

>library(lars)

>model.lars<-lars(x,y)>plot(model.lars)

• optimal selection of s with CV

> set.seed(123)> cv.lasso <- cv.lars(x, y, K = 10, fraction = seq(0, 1, by = 0.05))

cv.lars() computes the 10-fold cross-validation MSEP (MSE for test data set) of the lassoregression. We have set a random seed here in order to get a unique solution. For each

23

Page 25: CS 2008 3complete

7/25/2019 CS 2008 3complete

http://slidepdf.com/reader/full/cs-2008-3complete 25/53

* * * * ** * *** * * * ** * *

0.0 0.2 0.4 0.6 0.8 1.0

− 5 0

0

5 0

1 0 0

1 5 0

|beta|/max|beta|

S t a n d a r d i z e d C o e f f i c i e n t s

* * * * ** * ****

**

***

*

* * * * ** * *** * * * ** **

* * * * ** * *** * * * **

*

*

* * * * *** *** * * * ** * *

* * * * ** * *** * * * ** * **

* * *** * *

*** *

* ** * *

* * * * ** * *** * * * ** * *

* * * * ** * *** * * * ** * *

* * * * ** * *** * * * ** * ** * * * ** * *** * * * ** * ** * * * ** * *** * * * ** * ** * * * **

* *** * * * ** * ** * * *

** * *** * * * ** * *

LASSO

2

8

1 0

4

7

0 1 4 6 10 12 15 16

Figure 7: Lasso regression

0.0 0.2 0.4 0.6 0.8 1.0

2 0

3 0

4 0

5 0

6 0

fraction

c v

Figure 8: Optimal selection of s with CV

considered fraction (standardized value of s) the average MSEP over all cv-segments iscomputed (points in the figure), as well as their standard error (drawn as intervals). Theoptimal value of the standardized s is the model to the smallest value of (standardized) swhere the mean MSEP is no more than one standard error above the minimum mean MSEP.

This rule selects a fraction of 0.3.

24

Page 26: CS 2008 3complete

7/25/2019 CS 2008 3complete

http://slidepdf.com/reader/full/cs-2008-3complete 26/53

4 Linear methods for classification

We assume that the K different classes exist, and that the class membership is known forthe training data. The task then is to establish a classification rule that allows a reliableassignment of the test data to the classes. Linear classification tries to find linear functions of

the form (1) which separate the observations into the different classes. Observations withinone class should have as many similar features as possible whereas observations of differentclusters should have little in common. If the predictor G(x) takes on values in a discrete set G,we can always divide the input space into different regions corresponding to the classification.The decision boundaries can either be smooth or rough. The fitted linear model for the kthregressor variable is f k(x) = β k0 + β

k x. Then the decision boundary between class k and

l is the set of points for which f k(x) = f l(x), that is, x : (β k0 − β l0) + (βk − βl)x = 0.

New data points x can then be classified with this predictor.

4.1 Linear regression of an indicator matrixHere each of the response categories is coded by an indicator variable. Thus if G has K classes, there will be K such indicators

yk with k = 1, 2, . . . , K

with

yk =

1 for G = k0 otherwise.

These response variables are collected in a vector y = (y1, y2, . . . , yK ), and the n training

instances form an (n×K ) indicator response matrix Y , with 0/1 values as indicators for theclass membership. We fit a linear regression model to each of the columns yk of Y which isgiven by

βk = (X X )−1X yk with k = 1, 2, . . . , K .

The βk can be combined in a (( p + 1) × K ) matrix B since

B = (β1, β2, . . . , βK )

= (X X )−1X (y1,y2, . . . ,yK )

= (X X )−1X Y .

The estimated values are then

Y = X B

= X (X X )−1X Y .

A new observation x can be classified as follows:

• Compute the vector of length K of the fitted values:

f (x) =

(1,x) B

=

f 1(x), . . . , f K (x)

• Identify the largest component f (x) and classify x accordingly:

G(x) = argmaxk∈G

f k(x)

25

Page 27: CS 2008 3complete

7/25/2019 CS 2008 3complete

http://slidepdf.com/reader/full/cs-2008-3complete 27/53

4.2 Discriminant analysis

4.2.1 Linear Discriminant Analysis (LDA)

G consists of K classes. Let the probability of an observation belonging to class k be (the

prior probability) πk, k = 1, 2, . . . , K with K k=1 πk = 1. Suppose hk(x) is the densityfunction of x in class G = k. Then

P (G = k|x) = hk(x)πkK l=1

hl(x)πl

is the conditional probability that with a given observation x the random variable G is k.hk(x) is often assumed to be the density of a multivariate normal distribution ϕk

ϕk(x) = 1

(2π) p

|Σk|exp−

(x− µk)Σ−1k (x− µk)

2 .

LDA arises in the special case when we assume that the classes have a common covarianceΣk = Σ, k = 1, 2, . . . , K [see, e.g., Huberty, 1994].

Comparing these two classes, it is sufficient to look at the log-ratio

log P (G = k|x)

P (G = l|x) = log

ϕk(x)πkϕl(x)πl

= log ϕk(x)

ϕl(x) + log

πkπl

= log πkπl

− 1

2(µk + µl)

Σ−1(µk − µl) + xΣ−1(µk − µl) .

The decision boundary between the classes k and l is

P (G = k|x) = P (G = l|x)

which is linear in x (in p dimensions this is a hyperplane). Using this equality, the log-ratiobecomes

log P (G = k|x)

P (G = l|x) = log(1) = 0 = log

ϕk(x)πkϕl(x)πl

= log ϕk(x) + log πk − log ϕl(x) − log πl

= log 1

(2π) p/2|Σ|1/2 − 1

2(x

−µk)

Σ−1(x−µk) + log πk

− log 1

(2π) p/2|Σ|1/2 +

1

2(x− µl)

Σ−1(x− µl) + log πl

= −1

2xΣ−1x+xΣ−1µk − 1

2µkΣ

−1µk + log πk δk(x)

+1

2xΣ−1x−xΣ−1µl +

1

2µl Σ

−1µl − log πl

−δl(x)

.

The linear discriminant scores for class k

δ k(x) = xΣ−1µk − 1

2µkΣ−1µk + log πk

26

Page 28: CS 2008 3complete

7/25/2019 CS 2008 3complete

http://slidepdf.com/reader/full/cs-2008-3complete 28/53

(for k = 1, . . . , K ) provide a description of the decision rule. Observation x is classified tothe group for which δ k(x) is the largest, i.e.,

G(x) = argmaxk

δ k(x) .

Considering the two classes k and l , then x is classified to class k if

δ k(x) > δ l(x) ⇐⇒ xΣ−1(µk − µl) − 1

2(µk + µl)

Σ−1(µk − µl) > log πlπk

The parameters of the distribution (µk, Σk) as well as the prior probabilities πk are usuallyunknown and need to be estimated from the training data. Common estimates are:

πk = nk

n with nk . . . number of observations in group k

µk =

gi=k

xi

nk

Σ = 1

n − K

K k=1

gi=k

(xi − µk)(xi − µk)

gi indicates the true group number of observation xi from the training data. Using the lineardiscriminant function we now can estimate the group membership of xi. This gives a valuefor G(xi), which can now be compared to gi. A reliable classification rule should correctlyclassify as many observations as possible. The misclassification rate provides the relativefrequency of incorrectly classified observations.

4.2.2 Quadratic Discriminant Analysis (QDA)

Just as in LDA we assume a prior probability πk for class k, k = 1, 2, . . . , K withK k=1 πk =

1. The conditional probability that G takes on the value k is

P (G = k|x) = ϕk(x)πkK l=1

ϕl(x)πl

(ϕ is the density of the multivariate normal distribution). QDA does not assume the co-

variance matrices to be equal, which complicates the formulas. Considering the log-ratioof the conditional probabilities results in the quadratic discriminant scores (quadraticfunctions in x)

δ k(x) = −1

2 log |Σk| − 1

2(x− µk)

Σ−1k (x− µk) + log πk

that can be used for the group assignment. Accordingly, observation x is classified to thatgroup which yields the maximal value of the quadratic discriminant scores. πk and µk can beestimated from the training data as in LDA. Σk can be estimated by the sample covariancematrix of the observations from group k.

27

Page 29: CS 2008 3complete

7/25/2019 CS 2008 3complete

http://slidepdf.com/reader/full/cs-2008-3complete 29/53

4.2.3 Regularized Discriminant Analysis (RDA)

This method is a compromise between linear and quadratic discriminant analysis, that allowsto shrink the separate covariances of QDA towards a common covariance as in LDA. Thesecommon (pooled) covariance matrices have the form

Σ = 1K k=1 nk

K k=1

nk Σk

where nk is the number of observations per group. With the pooled covariance matrices theregularized covariance matrices have the form

Σk(α) = αΣk + (1 − α)Σ

with α ∈ [0, 1]. α provides a compromise between LDA (α = 0) and QDA (α = 1). The ideais to keep the degrees of freedom flexible. α can be estimated using cross-validation.

4.3 Logistic regression

Logistic regression also deals with the problem of classifying observations that originate fromtwo or more groups [see, e.g., Kleinbaum and Klein, 2002]. The difference to the previousclassification methods is that the output of logistic regression includes an inference statisticwhich provides information about which variables are well suitable for separating the groups,and which provide no contribution to this goal.

In logistic regression the posterior probabilities of the K classes are modeled by linear func-tions in x, with the constraint that the probabilities remain in the interval [0 , 1] and thatthey sum up to 1. Let us consider the following models:

log P (G = 1|x)P (G = K |x)

= β 10 + β1 x

log P (G = 2|x)

P (G = K |x) = β 20 + β

2 x

...

log P (G = K − 1|x)

P (G = K |x) = β (K −1)0 + β

K −1x

Here we chose class K to be the denominator, but the choice of the denominator is arbitraryin the sense that the estimates are equivariant under that choice. After a few calculations

we get

P (G = k|x) = P (G = K |x)exp

β k0 + βk x

for k = 1, 2, . . . , K − 1

K k=1

P (G = k|x) = 1 = P (G = K |x)

1 +K −1k=1

exp

β k0 + βk x

P (G = K |x) = 1

1 +K −1

l=1

exp

β l0 + β

l x

P (G = k|x) = expβ k0 + βk x

1 +K −1l=1

exp

β l0 + βl x for k = 1, 2, . . . , K − 1

28

Page 30: CS 2008 3complete

7/25/2019 CS 2008 3complete

http://slidepdf.com/reader/full/cs-2008-3complete 30/53

In the special case of K = 2 classes the model is

log P (G = 1|x)

P (G = 2|x) = β 0 + β

1 x

or

P (G = 1|x) = expβ 0 + β1 x1 + expβ 0 + β

1 xP (G = 2|x) =

1

1 + expβ 0 + β1 x

P (G = 1|x) p1(x)

+ P (G = 2|x) p2(x)

= 1

The parameters can be estimated using the ML method by using the n training data points.The log-likelihood function is

l(β) =

ni=1

log pgi(xi;β) ,

where

β =

β 0β1

and xi includes the intercept. pgi is the probability that observation xi belongs to classgi = 1, 2. With

p(x;β) = p1(x;β) = 1 − p2(x;β)

and an indicator yi = 1 for gi = 1 and yi = 0 for gi = 2, l(β) can be written as

l(β) =ni=1

yi log p(xi;β) + (1 − yi)log[1 − p(xi;β)]

=ni=1

yi log

exp(βxi)

1 + exp(βxi)

+ (1 − yi)log

1 − exp(βxi)

1 + exp(βxi)

=ni=1

yiβ

xi − yi log

1 + exp(βxi)− (1 − yi)log

1 + exp(βxi

=n

i=1 yiβ

xi − log

1 + exp(βxi)

.

To maximize the log-likelihood we set its derivative equal to zero. These score equations are p + 1 nonlinear functions in β of the form

∂l(β)

∂ β =

ni=1

yixi − 1

1 + exp(βxi) exp(βxi)xi

=ni=1

[yi − p(xi;β)]xi = 0 .

To solve these equations, we use the Newton-Raphson algorithm , which requires the secondderivative or Hessian matrix:

∂ 2l(β)

∂ β∂ β = . . . = −

ni=1

p(xi;β)[1 − p(xi;β)]xixi

29

Page 31: CS 2008 3complete

7/25/2019 CS 2008 3complete

http://slidepdf.com/reader/full/cs-2008-3complete 31/53

Starting with βold we get

βnew = βold −

∂ 2l(β)

∂ β∂ β

−1∂l(β)

∂ β ,

where the derivatives are evaluated at βold.

Matrix notation provides a better insight in that method:y . . . (n × 1) vector of the yiX . . . (n × ( p + 1)) matrix of observations xi p . . . (n × 1) vector of estimated probabilities p(xi,βold)W . . . (n × n) diagonal matrix with weights p(xi,βold)(1 − p(xi,βold)) in the diagonal

The above derivatives can now be written as:

∂l(β)

∂ β = X (y − p)

∂ 2l(β)

∂ β∂ β

=

−X WX

The Newton-Raphson algorithm has the form

βnew = βold + (X WX )−1X (y − p)

= (X WX )−1X W (Xβold + W −1(y − p))

= (X WX )−1X W weighted LS

z

with the adjusted responsez = Xβold + W −1(y − p)

adjustment

) .

This algorithm is also referred to as IRLS (iteratively reweighted LS), since each iterationsolves the weighted least squares problem

βnew ←− argminβ

(z −Xβ)W (z −Xβ) .

Some remarks:

• β = 0 is a good starting value for the iterative procedure, although convergence isnever guaranteed.

• For K ≥ 3 the Newton-Raphson algorithm can also be expressed as an IRLS algorithm,but in this case W is no longer a diagonal matrix.

• The logistic regression is mostly used for modeling and inference. The goal is tounderstand the role of the input variables in explaining the outcome.

Comparison of logistic regression and LDA: Logistic regression uses

log P (G = k|x)

P (G = q |x) = β k0 + β

k x,

whereas LDA uses

log P (G = k|x)

P (G = K

|x)

= log πkπK

− 1

2(µk + µK )

Σ−1(µk − µK ) + xΣ−1(µk + µK )

= αk0 −αk x

The linearity in x in LDA is achieved by:

30

Page 32: CS 2008 3complete

7/25/2019 CS 2008 3complete

http://slidepdf.com/reader/full/cs-2008-3complete 32/53

• the assumption of equal covariance matrices

• the assumption of multivariate normally distributed groups

Even though the models have the same form, the coefficients are estimated differently. The joint distribution of x and G can be expressed as

P (x, G = k) = P (x)P (G = k|x) .

Logistic regression does not specify P (x), LDA assumes a mixture model (with ϕ as a normaldistribution):

P (x) =K k=1

πkϕ(x;µk, Σ)

5 Linear methods for classification in R

5.1 Linear regression of an indicator matrix in RThe different classification methods introduced in this manuscript will be illustrated with theso-called Pima Indian data . The data set can be found in the R package library(mlbench)under the name “PimaIndiansDiabetes2”.

> library(mlbench)> data(PimaIndiansDiabetes2)> pid <- na.omit(PimaIndiansDiabetes2)

The data consist of a population of 392 women with “Pima Indian” heritage, who live in the

area of Phoenix, Arizona. They were tested for diabetes. The goal is to get a classificationrule for the diagnosis of diabetes. The variables are:

• npreg: number of pregnancies

• glu: plasma glucose concentration (glucose tolerance test)

• bp: diastolic blood pressure (mm Hg)

• skin: triceps skin thickness (mm)

• bmi: BMI

• ped: diabetes pedigree function

age: age in years• type: “pos” or “neg” for diabetes diagnosis

Generation of the indicator vector for the regression:

> pidind <- pid> pidind$diabetes <- as.numeric(pid$diabetes) - 1

Random selection (using a fixed random seed) of the training data, and fitting of a linearmodel:

> set.seed(101)> train <- sample(1:nrow(pid), 300)> mod.reg <- lm(diabetes ~ ., data = pidind[train, ])

31

Page 33: CS 2008 3complete

7/25/2019 CS 2008 3complete

http://slidepdf.com/reader/full/cs-2008-3complete 33/53

The model is applied to the test data, and the resulting misclassification rate is computed.

> predictind <- predict(mod.reg, newdata = pidind[-train, ])> TAB <- table(pid$diabetes[-train], predictind > 0.5)> mcrreg <- 1 - sum(diag(TAB))/sum(TAB)> mcrreg

[1] 0.2391304

The resulting misclassification rate is 0.239.The misclassification rates of different classification methods will be collected in a table andthen compared and analyzed throughout the rest of this manuscript.

INDR LDA QDA RDA GLM GAMMCR 0.239

5.2 Discriminant analysis in R

5.2.1 Linear discriminant analysis in R

Classification of the PimaIndianDiabetes data with lda() from library(MASS)

> set.seed(101)> train <- sample(1:nrow(pid), 300)> library(MASS)> mod.lda <- lda(diabetes ~ ., data = pid[train, ])> predictLDA <- predict(mod.lda, newdata = pid[-train, ])> TAB <- table(pid$diabetes[-train], predictLDA$class)> mcrlda <- 1 - sum(diag(TAB))/sum(TAB)> mcrlda

[1] 0.2391304

INDR LDA QDA RDA GLM GAMMCR 0.239 0.239

5.2.2 Quadratic discriminant analysis in R

Classification of the PimaIndianDiabetes data with qda() from library(MASS)

> set.seed(101)> train <- sample(1:nrow(pid), 300)> library(MASS)> mod.qda <- qda(diabetes ~ ., data = pid[train, ])> predictQDA <- predict(mod.qda, newdata = pid[-train, ])> TAB <- table(pid$diabetes[-train], predictQDA$class)> mcrqda <- 1 - sum(diag(TAB))/sum(TAB)> mcrqda

[1] 0.25

INDR LDA QDA RDA GLM GAMMCR 0.239 0.239 0.25

32

Page 34: CS 2008 3complete

7/25/2019 CS 2008 3complete

http://slidepdf.com/reader/full/cs-2008-3complete 34/53

5.2.3 Regularized discriminant analysis in R

Classification of the PimaIndianDiabetes data with qda() from library(klaR)

> set.seed(101)> train <- sample(1:nrow(pid), 300)

> library(klaR)> mod.rda <- rda(diabetes ~ ., data = pid)> predictRDA <- predict(mod.rda, newdata = pid[-train, ])> TAB <- table(pid$diabetes[-train], predictRDA$class)> mcrrda <- 1 - sum(diag(TAB))/sum(TAB)> mcrrda

[1] 0.2173913

INDR LDA QDA RDA GLM GAMMCR 0.239 0.239 0.25 0.217

5.3 Logistic regression in R

• Fit of a model with glm()

Logistic regression can be carried out with the function glm() (for Generalized LinearModels) under the model of the binomial family. Syntax and interpretation of theinference statistic are in analogy to lm().

> modelglm <- glm(diabetes ~ ., data = pid, family = binomial)> summary(modelglm)

Call:glm(formula = diabetes ~ ., family = binomial, data = pid)

Deviance Residuals:Min 1Q Median 3Q Max

-2.7823 -0.6603 -0.3642 0.6409 2.5612

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -1.004e+01 1.218e+00 -8.246 < 2e-16 ***

pregnant 8.216e-02 5.543e-02 1.482 0.13825glucose 3.827e-02 5.768e-03 6.635 3.24e-11 ***

pressure -1.420e-03 1.183e-02 -0.120 0.90446

triceps 1.122e-02 1.708e-02 0.657 0.51128insulin -8.253e-04 1.306e-03 -0.632 0.52757

mass 7.054e-02 2.734e-02 2.580 0.00989 **

pedigree 1.141e+00 4.274e-01 2.669 0.00760 **age 3.395e-02 1.838e-02 1.847 0.06474 .

---

Signif. codes: 0 aAY***aAZ 0.001 aAY**aAZ 0.01 aAY*aAZ 0.05 aAY.aAZ 0.1 aAY aAZ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 498.10 on 391 degrees of freedomResidual deviance: 344.02 on 383 degrees of freedom

AIC: 362.02

Number of Fisher Scoring iterations: 5

• Model selection with step()

Reducing the number of explanatory variables can improve the prediction performance.Stepwise logistic regression has the same syntax as stepwise linear regression:

33

Page 35: CS 2008 3complete

7/25/2019 CS 2008 3complete

http://slidepdf.com/reader/full/cs-2008-3complete 35/53

> mod.glm <- step(modelglm, direction = "both")

The result after stepwise logistic regression is a smaller model that should be able tohave a better prediction performance that the full model. The inference statistic tells

which variables are significantly contributing to the group separation:

> summary(mod.glm)

Call:glm(formula = diabetes ~ pregnant + glucose + mass + pedigree +

age, family = binomial, data = pid)

Deviance Residuals:

Min 1Q Median 3Q Max

-2.8827 -0.6535 -0.3694 0.6521 2.5814

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -9.992080 1.086866 -9.193 < 2e-16 ***pregnant 0.083953 0.055031 1.526 0.127117

glucose 0.036458 0.004978 7.324 2.41e-13 ***

mass 0.078139 0.020605 3.792 0.000149 ***pedigree 1.150913 0.424242 2.713 0.006670 **

age 0.034360 0.017810 1.929 0.053692 .

---

Signif. codes: 0 aAY***aAZ 0.001 aAY**aAZ 0.01 aAY*aAZ 0.05 aAY.aAZ 0.1 aAY aAZ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 498.10 on 391 degrees of freedom

Residual deviance: 344.89 on 386 degrees of freedom

AIC: 356.89

Number of Fisher Scoring iterations: 5

• Prediction of the class membership and presentation of the results, see Figure 9

> predictGLM <- predict(mod.glm, newdata = pid)> plot(predictGLM, pch = as.numeric(pid$diabetes) + 1)

0 100 200 300 400

− 4

− 2

0

2

4

Index

P r e d i c t i o n

Figure 9: Prediction of the group memberships; the line is the separation from logistic

regression, the symbol corresponds to the true group membership.

34

Page 36: CS 2008 3complete

7/25/2019 CS 2008 3complete

http://slidepdf.com/reader/full/cs-2008-3complete 36/53

• Misclassification rate

Simple evaluation of the misclassification rate:

> set.seed(101)> train <- sample(1:nrow(pid), 300)

> predictLR <- predict(mod.glm, newdata = pid[-train, ])> TAB <- table(pid$diabetes[-train], predictLR > 0)> mcrlr <- 1 - sum(diag(TAB))/sum(TAB)> mcrlr

[1] 0.2173913

INDR LDA QDA RDA GLM GAMMCR 0.239 0.239 0.25 0.217 0.217

6 Basis expansions

So far we have used linear models for both regression and classification because

• they are easy to interpret and have a closed solution;

• with small n and/or large p, linear models are often the only possibility to avoidoverfitting.

However, in reality there is often no linear relation, and the errors are not normally dis-tributed. Methods that are not based on linearity are:

• Generalized Linear Models (GLM): The extension of normally distributed errors to thefamily of exponential distributions like the Gamma or Poisson distribution [see, e.g.,McLachlan and Peel, 2000];

• Mixed models: The populations originate from K different classes, where each class ismodeled with its own parameters;

• Nonlinear regression: Parametric nonlinear relation between the regressor and theresponse variable(s), e.g. y = aebx + ε.

Nonlinearity can be introduced by basis expansions. The idea is to augment/replace thevectors of inputs x with transformations of x, and then use linear models in this new space

of derived input features. Denote by hm(x) the mth transformation of x, m = 1, . . . , M .We then model

f (x) =M m=1

β mhm(x) .

Some widely used examples are:

• hm(x) = xm, m = 1, . . . , p . . . describes the original linear model

• hm(x) = x2 j or hm(x) = x jxk . . . quadratic transformation

• hm(x) = log(xi),√

x j . . . nonlinear transformation

• hm(x) = I (Lm ≤

xk < U m) . . . indicator function, results in models with a constantcontribution for xk in the interval [Lm, U m), or piecewise constant in case of morenon-overlapping regions.

35

Page 37: CS 2008 3complete

7/25/2019 CS 2008 3complete

http://slidepdf.com/reader/full/cs-2008-3complete 37/53

6.1 Interpolation with splines

A spline is created by joining together several functions [compare Hastie and Tibshirani,1990]. The names comes from a tool called “spline” (a tool for curves). This thin flexiblerod is fixed by weights and then used to draw curves through the given points. Since poly-

nomials are one of the easiest functions, they are often preferred as basic elements for splines.

6.1.1 Piecewise polynomial functions

We assume x to be univariate. A piecewise polynomial function f (x) is obtained by divid-ing the domain of x into k + 1 disjoint intervals and defining for each interval (−∞, ξ 1),[ξ 1, ξ 2), . . ., [ξ k−1, ξ k), [ξ k, ∞) an own polynomial function of order ≤ M . The boundariesξ 1, . . . , ξ k of the intervals are called knots . Piecewise constant functions are the easiest of all piecewise polynomials. Piecewise polynomials of order M = 2, 3, 4 are called piecewise

linear, quadratic, and cubic polynomials, respectively. To determine a piecewise polynomialfunction of order M with k knots ξ 1, . . . , ξ k we need M (k + 1) parameters, since each of thek + 1 polynomials consists of M coefficients.

Piecewise constant

ξξ1 ξξ2

Piecewise linear

ξξ1 ξξ2

Continuous piecewise linear

ξξ1 ξξ2

Piecewise−linear basis function

ξξ1 ξξ2

Figure 10: Piecewise constant and linear polynomials

Figure 10 shows data generated from the model shown by the dashed lines, where additionalrandom noise was added. The data range is split into 3 regions, thus we obtain the knotsξ 1 and ξ 2. In the upper left picture we fit in each interval a constant function. The basisfunctions are thus:

h1(x) = I (x < ξ 1), h2(x) = I (ξ 1 ≤ x < ξ 2), h3(x) = I (ξ 2 ≤ x)

36

Page 38: CS 2008 3complete

7/25/2019 CS 2008 3complete

http://slidepdf.com/reader/full/cs-2008-3complete 38/53

It is easy to see that the LS solutions for the model f (x) =3m=1 β mhm(x) are the arithmetic

means β m = ym of the y-values in each region.

Figure 10 upper right shows a piecewise fit by constant functions. Thus we need 3 additionalbasis functions, namely hm+3 = hmx for m = 1, 2, 3. In the lower left plot we also usepiecewise constant functions but with the constraints of continuity at the knots. These

constraints also lead to constraints on the parameters. For example, at the first knot werequire f (ξ −1 ) = f (ξ +1 ), and this means that β 1 + ξ 1β 4 = β 2 + ξ 1β 5. Thus the number of parameters is reduced by 1, for 2 knots by 2, and we end up with 4 free parameters in themodel.

These basis functions and the constraints can be obtained in a more direct way by using thefollowing definitions:

h1(x) = 1, h2(x) = x, h3(x) = (x − ξ 1)+, h4(x) = (x − ξ 2)+

where t+ denotes the positive part. The function h3 is shown in the lower left panel of Figure

10.

6.1.2 Splines

A spline of order M with knots ξ i, i = 1, . . . , k is a piecewise polynomial function of order M which has continuous derivatives up to order M − 2. The general form of the basis functionsfor splines is

h j(x) = x j−1, j = 1, . . . , M ,

hM +l(x) = (x − ξ l)M −1+ , l = 1, . . . , k .

We have: number of basis functions = number of parameters (= df).

A cubic spline (M = 4) with 2 knots has the following basis functions:

h1(x) = 1, h3(x) = x2, h5(x) = (x − ξ 1)3+

h2(x) = x, h4(x) = x3, h6(x) = (x − ξ 2)3+

Figure 11 shows piecewise cubic polynomials, with increasing order of continuity in the knots.The curve in the lower right picture has continuous derivatives of order 1 and 2, and thus itis a cubic spline.

Usually there is no need to go beyond cubic splines. In practice the most widely used ordersare M = 1, M = 2 and M = 4.

The following parameters have to be chosen:

• order M of the splines

• number of knots

• placement of knots: chosen by the user, for instance at the appropriate percentiles of x (e.g. for k = 3 at the percentiles 25, 50, 75%).

The spline bases functions suggested above are not too attractive numerically. The so-called

B-spline basis is numerically more suitable, and it is an equivalent form of the basis. In Rthe function for B-splines is called bs(), and the argument df allows to select the numberof spline basis functions.

37

Page 39: CS 2008 3complete

7/25/2019 CS 2008 3complete

http://slidepdf.com/reader/full/cs-2008-3complete 39/53

Discontinuous

ξξ1 ξξ2

Continuous

ξξ1 ξξ2

Continuous first derivative

ξξ1 ξξ2

Continuous second derivative

ξξ1 ξξ2

Figure 11: Piecewise cubic polynomials

6.1.3 Natural cubic splines

Polynomials tend to be erratic near the lower and upper data range, which can result in poorapproximations (Figure 12). This can be avoided by using natural cubic splines (Figure 13).They have the additional constraint that the function has to be linear beyond the boundaryknots. In this way we get back 4 degrees of freedom (two constraints each in both boundaryregions), which can then be “invested” in a larger number of knots.

6.2 Smoothing splines

This spline method avoids the knot selection problem by controlling the complexity of thefit through regularization.

Consider the following problem: Among all functions f (x) with two continuous derivatives,find the one that minimizes the penalized residual sum of squares

RSS(f, λ) =ni=1

(yi − f (xi))2 + λ

f (t)

2dt . (3)

The first term measures closeness to the data, the second term penalizes curvature in thefunction. The smoothing parameter λ establishes a tradeoff between the two terms. There

are two special cases:• λ = 0: f is any function that interpolates the data

38

Page 40: CS 2008 3complete

7/25/2019 CS 2008 3complete

http://slidepdf.com/reader/full/cs-2008-3complete 40/53

• λ = ∞: the simple least-squares fit, where no second derivative can be tolerated

(3) has a unique minimizer, which is a natural cubic spline with knots at the values xi, i =1, . . . , n. It seems that this solution is over-parameterized, since n knots correspond to ndegrees of freedom. However, the penalty term reduces them, since the spline coefficientsare shrunk towards the linear fit.

Since the solution is a natural cubic spline, we can write it as

f (x) =n j=1

N j(x)θ j

where N j is the n-dimensional set of basis functions that represent the natural cubic splines.The criterion (3) thus reduces to

RSS(θ, λ) = (y −Nθ)(y −Nθ) + λθΩN θ

with N ij = N j(xi) and ΩN jk = N

j (t)N

k (t)dt.The solution is obtained asθ = (N N + λΩN )

−1N y

which is a generalized form of Ridge regression. The fit of the smoothing spline is given by

f (x) =n j=1

N j(x)θ j .

7 Basis expansions in R

7.1 Interpolation with splines in R

• Model fit with B-splines

Generation of a sine function overlayed with noise.

> x = seq(1, 10, length = 100)> y = sin(x) + 0.1 * rnorm(x)> x1 = seq(-1, 12, length = 100)

Since we are looking for a linear relationship between the y variable and the splinebases, the function lm() can be used (Figure 12):

> library(splines)> lm1 = lm(y ~ bs(x, df = 6))> plot(x, y, xlim = range(x1))> lines(x1, predict.lm(lm1, list(x = x1)))

• Usage of natural cubic splines (Figure 13)

> lm3 = lm(y ~ ns(x, df = 6))

> lines(x1, predict.lm(lm3, list(x = x1)))

39

Page 41: CS 2008 3complete

7/25/2019 CS 2008 3complete

http://slidepdf.com/reader/full/cs-2008-3complete 41/53

0 2 4 6 8 10 12

− 1 . 0

0 . 0

0 .

5

1 .

0

x

y

Figure 12: Prediction with 6 spline basis functions

0 2 4 6 8 10 12

− 1 .

0

0 .

0

0 . 5

1 .

0

x

y

Figure 13: Prediction with natural cubic splines

40

Page 42: CS 2008 3complete

7/25/2019 CS 2008 3complete

http://slidepdf.com/reader/full/cs-2008-3complete 42/53

7.2 Smoothing splines in R

• Fit with smooth.spline() from library(splines)

> library(splines)> m1 = smooth.spline(x, y, df = 6)

> plot(x, y, xlim = range(x1), ylim = c(-1.5, 1.5))> lines(predict(m1, x1))

0 2 4 6 8 10 12

− 1 . 5

− 0 .

5

0 .

5

1 .

5

x

y

Figure 14: Prediction at the boundaries with smoothing splines

• Choice of degrees of freedom with cross-validation

> m2 = smooth.spline(x, y, cv = TRUE)> plot(x, y, xlim = range(x1), ylim = c(-1.5, 1.5))

> lines(predict(m2, x1), lty = 2)

0 2 4 6 8 10 12

− 1 .

5

− 0 .

5

0 .

5

1 .

5

x

y

df=6

df=11.7

Figure 15: Choice of degrees of freedom

• Example: smoothing splines with the bone density data

The bone density data consist of 261 teens from North America. The average age of theteens (age) and the relative change in the bone density (spnbmd) were measured at two

consecutive visits.

41

Page 43: CS 2008 3complete

7/25/2019 CS 2008 3complete

http://slidepdf.com/reader/full/cs-2008-3complete 43/53

> library(ElemStatLearn)> data(bone)

The following R code plots the data, separately for male and female persons, and fits smooth-ing splines to the data subsets.

> plot(spnbmd ~ age, data = bone, pch = ifelse(gender == "male",+ 1, 3), xlab = "Age", ylab = "Relative Change in Spinal BMD")> bone.spline.male <- smooth.spline(bone$age[bone$gender == "male"],+ bone$spnbmd[bone$gender == "male"], df = 12)> bone.spline.female <- smooth.spline(bone$age[bone$gender == "female"],+ bone$spnbmd[bone$gender == "female"], df = 12)> lines(bone.spline.male, lty = 1, lwd = 2)> lines(bone.spline.female, lty = 2, lwd = 2)> legend("topright", legend = c("Male data", "Female data", "Model for male",+ "Model for female"), pch = c(1, 3, NA, NA), lty = c(NA, NA,+ 1, 2), lwd = 2, xjust = 1)

10 15 20 25

− 0 . 0

5

0 . 0

0

0 . 0

5

0 . 1

0

0 . 1

5

0 . 2

0

Age

R e l a t i v e C h a n g e i n S p i n a l B M D

Male data

Female data

Model for maleModel for female

Figure 16: Bone density data

The regressor variable measures the relative change of the bone density. For each, the maleand female measurements, a smoothing spline with 12 degrees of freedom was fitted. Thiscorresponds to a value of λ ≈ 0.0002.

8 Generalized Additive Models (GAM)

For GAMs the weighted sum of the regressor variables is replaced by a weighted sum of transformed regressor variables [see Hastie and Tibshirani, 1990]. In order to achieve moreflexibility, the relations between y and x are modeled in a non-parametric way, for instance

by cubic splines. This allows to identify and characterize nonlinear effects in a better way.

42

Page 44: CS 2008 3complete

7/25/2019 CS 2008 3complete

http://slidepdf.com/reader/full/cs-2008-3complete 44/53

8.1 General aspects on GAM

GAMs are generalizations of Generalized Linear Models (GLM) to nonlinear functions. Letus consider the special case of multiple linear regression. There, we model the conditionalexpectation of y by a linear function:

IE(y|x1, x2, . . . , x p) = α + β 1x1 + . . . + β px p

A generalization is to use unspecified nonlinear (but smooth) functions f j instead of theoriginal regressor variables x j:

IE(y|x1, x2, . . . , x p) = α + f 1(x1) + f 2(x2) + . . . + f p(x p)

Another regression model is the logistic regression model, which is in case of 2 groups

log µ(x)

1 − µ(x) = α + β 1x1 + . . . + β px p,

where µ(x) = P (y = 1|x). The generalization with nonlinear functions is called additive logistic regression model , and it replaces the linear terms by

log

µ(x)

1 − µ(x)

= α + f 1(x1) + . . . + f p(x p) .

For GLMs, and analogously for GAMs, the “left hand side” is replaced by different otherfunctions. The right hand side remains a linear combination of the input variables for GLM,or of nonlinear functions of the input variables for GAM.

Generally we have for GAM:

The conditional expectation µ(x) of y is related to an additive function of the predictors viaa link function g:

g [µ(x)] = α + f 1(x1) + . . . + f p(x p)

Examples for classical link functions are:

• g(µ) = µ: is the identity link, used for linear and additive models of Gaussian responsedata

g(µ) = logit(µ) or g(µ) = probit(µ): the probit link function (probit(µ) = Φ−1

(µ)) isused for modeling binomial probabilities

• g(µ) = log(µ): log-linear or log-additive models for Poisson count data.

All these link functions arise from the exponential family, which forms the class of generalizedlinear models. Those are all extended in the same way to generalized additive models.

8.2 Parameter estimation with GAM

The model has the form

yi = α + p j=1

f j(xij) + εi, i = 1, . . . , n

43

Page 45: CS 2008 3complete

7/25/2019 CS 2008 3complete

http://slidepdf.com/reader/full/cs-2008-3complete 45/53

with IE(εi) = 0. Given observations (xi, yi), a criterion like the penalized residual sum of squares (PRSS) can be specified for this problem:

PRSS(α, f 1, f 2, . . . , f p) =n

i=1

yi − α −

p

j=1

f j(xij)

2

+

p

j=1

λ j

f

j (t j)2dt j

f

j (t j)2 is an indicator for how much the function is ≥ 0. With linear f j, the integral is

0, nonlinear f j have values larger than 0. λ j are tuning parameters. They regularize thetradeoff between the fit of the model and the roughness of the function. The larger theλ j, the smoother the function. It can be shown, that (independent of the choice of λ j) anadditive model with cubic splines minimizes the PRSS. Each of the functions f j isa cubic spline in the component x j with knot xij, i = 1, . . . , n. Uniqueness of the solutionsis obtained by the restrictions

n

i=1

f j(xij) = 0 ∀

j =⇒

α = 1

n

n

i=1

yi =: ave(yi)

and the non-singularity of X .

Iterative algorithm for finding the solution:

1. Initialization of α = ave(yi), f j ≡ 0 ∀i, j

2. For the cycle j = 1, 2, . . . , p , . . . , 1, 2, . . . , p , . . .

• f j ← S j

yi − α −

k= j f k(xik)

, i = 1, . . . , n

• f j ← f j − 1nni=1 f j(xij)

until all f j are stabilized.

The functions S j[x] = n j=1 N j(x)θ j denote cubic smoothing splines, see Section 6.2. This

algorithm is also known as backfitting algorithm .

9 Generalized additive models in R

• Fit of a model to the PimaIndianDiabetes data with gam() from library(mgcv)

> library(mgcv)> set.seed(101)> train = sample(1:nrow(pid), 300)> mod.gam <- gam(diabetes ~ s(pregnant) + s(insulin) + s(pressure) + + s(triceps) + s(glucose) + s(age) + s(mass) + s(pedigree),+ data = pid, subset = train)> summary(mod.gam)

Family: gaussianLink function: identity

Formula:diabetes ~ s(pregnant) + s(insulin) + s(pressure) + s(triceps) +

s(glucose) + s(age) + s(mass) + s(pedigree)

Parametric coefficients:Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.33000 0.02067 15.96 <2e-16 ***

44

Page 46: CS 2008 3complete

7/25/2019 CS 2008 3complete

http://slidepdf.com/reader/full/cs-2008-3complete 46/53

---

Signif. codes: 0 aAY***aAZ 0.001 aAY**aAZ 0.01 aAY*aAZ 0.05 aAY.aAZ 0.1 aAY aAZ 1

Approximate significance of smooth terms:edf Est.rank F p-value

s(pregnant) 3.043 7 2.209 0.0338 *

s(insulin) 2 .095 5 2.207 0.0538 .

s(pressure) 1.000 1 0.548 0.4597s(triceps) 1.000 1 1.890 0.1703

s(glucose) 7.322 9 6.708 1.11e-08 ***s(age) 7.602 9 4.237 3.74e-05 ***

s(mass) 3.218 7 1.396 0.2070

s(pedigree) 1.269 3 1.043 0.3738

---Signif. codes: 0 aAY***aAZ 0.001 aAY**aAZ 0.01 aAY*aAZ 0.05 aAY.aAZ 0.1 aAY aAZ 1

R-sq.(adj) = 0.422 Deviance explained = 47.3%GCV score = 0.14119 Scale est. = 0.12822 n = 300

The estimated degrees of freedom (“edf”) for each term have been computed by GCV.In this case, 3.043 edf’s have been used for the term pregnant. The degrees of freedom

of pressure and triceps are each 1. This suggests, that both fits prepresent a straightline (Figure 17). The estimated GCV value of 0.14 indicates that the model providesa good fit.

pregnant

− 1 . 0

− 0 . 5

0 . 0

0 . 5

1 . 0

insulin

i li

i

l i

− 1 . 0

− 0 . 5

0 . 0

0 . 5

1 . 0

pressure

− 1 . 0

− 0 . 5

0 . 0

0 . 5

1 . 0

triceps

i

i

glucose

− 1 . 0

− 0 . 5

0 . 0

0 . 5

1 . 0

age

− 1 . 0

− 0 . 5

0 . 0

0 . 5

1 . 0

mass

− 1 . 0

− 0 . 5

0 . 0

0 . 5

1 . 0

pedigree

i

Figure 17: Fit of the different terms with additional 95% confidence regions

• Prediction with gam()

> gam.res <- predict(mod.gam, pid[-train, ]) > 0.5 > gam.TAB <- table(pid$diabetes[-train], as.numeric(gam.res))> gam.TAB

0 1

0 51 101 16 15

• Misclassification rate for test set

45

Page 47: CS 2008 3complete

7/25/2019 CS 2008 3complete

http://slidepdf.com/reader/full/cs-2008-3complete 47/53

> mcrgam <- 1 - sum(diag(gam.TAB))/sum(gam.TAB)> mcrgam

[1] 0.2826087

INDR LDA QDA RDA GLM GAMMCR 0.239 0.239 0.25 0.217 0.217 0.283

According to the table above, GAMs show a rather poor performance in comparison to theother methods. These results, however, are obtained from the selected training and test set,and the picture could be different for another selection. Therefore, it is advisable to repeatthe random selection of training and test data several times, to apply the classificationmethods, and to average the resulting misclassification rates.

10 Tree-based methods

Tree-based methods partition the feature space of the x-variables into a set of rectangularregions which should be as homogeneous as possible, and then fit a simple model in eachregion [see, e.g., Breiman et al., 1984]. In each step, a decision rule is determined by asplit variable and a split point which afterwards is used to assign an observation to thecorresponding partition. Then a simple model (i.e. a constant) is fit to every region. Tosimplify matters, we restrict attention to binary partitions, therefore we always have onlytwo branches in each split. Mostly, the result is presented in form of a tree leading to an easyto understand and interpretable representation. Tree models are nonparametric estimation

methods, since no assumptions about the distribution of the regressors is made. They arevery flexible in application which also makes them computationally intensive, and the resultsare highly dependent on the observed data. Even a small change in the observations canresult in a severe change of the tree structure.

10.1 Regression trees

We have data of the form (xi, yi), i = 1, . . . , n with xi = (xi1, . . . , xip). The algorithm hasto make decision about the:

• Split variable• Split point

• Form of the tree

Suppose that we have a partition into M regions R1, . . . , RM and we model the responseas a constant cm in each region: f (x) =

M m=1 cmI (x ∈ Rm). If we adopt as our criterion

minimization of the sum of squares

(yi − f (xi))2, it is easy to see that the best cm is justthe average of yi in each region:

cm = ave(yi|xi ∈ Rm)

Now finding the best binary partition in terms of minimization of the above criterion of thesum of squares is generally computationally infeasible. Hence we look for an approximationof the solution.

46

Page 48: CS 2008 3complete

7/25/2019 CS 2008 3complete

http://slidepdf.com/reader/full/cs-2008-3complete 48/53

Approximative solution:

• Consider a split variable x j and a split point s and define a pair of half planes by:

R1( j, s) = x|x j ≤ s ; R2( j, s) = x|x j > s

• Search for the splitting variable x j and for the split point s that solves

min j,s

minc1

xi∈R1( j,s)

(yi − c1)2 + minc2

xi∈R2( j,s)

(yi − c2)2

.

For any choice of j and s, the inner minimization is solved by

c1 = ave(yi|xi ∈ R1( j, s)) and c2 = ave(yi|xi ∈ R2( j, s)) .

For each splitting variable, the determination of the split point s can be done very quicklyand hence by scanning through all of the inputs, the determination of the best pair ( j, s)is feasible. Afterwards, this algorithm is applied to all other regions. In order to avoidoverfitting, we use a tuning parameter, which tries to regulate the model’s complexity. Thestrategy is to grow a large tree T 0, stopping the splitting process when some minimum nodesize is reached. Then this large tree is pruned using cost complexity pruning. By pruning(thus reduction of inner nodes) of T 0, we get a “sub” tree T . We index terminal nodes by m,representing region Rm. Let |T | denote the number of terminal nodes in T and

cm = 1

nm xi∈Rmyi nm . . . number of observations in the space Rm

Qm(T ) = 1

nm

xi∈Rm

(yi − cm)2

and thus we define the cost complexity criterion

cα(T ) =

|T |m=1

nmQm(T ) + α|T |

which has to be minimized. The tuning parameter α ≥ 0 regulates the compromise betweentree size (large α results in a small tree) and goodness of fit (α = 0 results in a full tree T 0).The optional value α for the final tree T α can be chosen by cross-validation.

It can be shown that for every α there exists a unique smallest sub-tree T α ⊆ T 0 whichminimizes cα. For finding T α, we eliminate successively that internal node which yields thesmallest increase (per node) of

m nmQm(T ). This is done as long as no node is left. It can

be shown that this sequence of sub-trees must include T α.

10.2 Classification trees

The goal is to partition the x-variables into 1, . . . , K classes. They are classified by the

known output variable y with values between 1, . . . , K . Afterwards, new data should beassigned to the corresponding class.

47

Page 49: CS 2008 3complete

7/25/2019 CS 2008 3complete

http://slidepdf.com/reader/full/cs-2008-3complete 49/53

In a node m representing a region Rm with nm observations, let

ˆ pmk = 1

nm

xi∈Rm

I (yi = k)

be the proportion of class k in node m. We classify the observations in node m to that class

k(m) for which k(m) = argmaxk ˆ pmk. So, the observation if assigned to the majority classin node m.

Different measures Qm(T ) of node impurity include the following:

1. Misclassification error: 1nm

i∈Rm

I (yi = k(m)) = 1 − ˆ pmk(m)

2. Gini index: k=k ˆ pmkˆ pmk =

K k=1 ˆ pmk(1 − ˆ pmk)

3. Cross-entropy or deviance: K k=1 ˆ pmk log ˆ pmk

The three criteria are very similar, but cross-entropy and Gini index are differentiable, andhence more amenable to numerical optimization. In addition, cross-entropy and the Gini

index are more sensitive to changes in the node probabilities than the misclassification rate.For this reason, either Gini index or cross-entropy should be used when growing a tree.

11 Tree based methods in R

11.1 Regression trees in R

• Growing a regression tree with rpart() from library(rpart)

> library(rpart)

> mod.tree <- rpart(body.fat ~ ., data = fat, cp = 0.005, xval = 20)

– library(mvpart) or library(tree) can be used as well for growing a tree.

– rpart() does cross-validation (cv) internally. The number of cv replications canbe controlled by the control parameters.

• Output of the tree

> mod.tree

n=250 (1 observation deleted due to missingness)

node), split, n, deviance, yval* denotes terminal node

1) root 250 14557.340000 18.963200

2) abdomen< 91.9 131 3834.820000 13.912210

4) abdomen< 85.45 65 1027.789000 10.6953808) hip< 92.9 28 362.304300 9.264286

16) ankle>=20.8 23 189.056500 8.143478 *

17) ankle< 20.8 5 11.448000 14.420000 *9) hip>=92.9 37 564.742700 11.778380

18) wrist>=17.5 28 349.528600 10.657140

36) height>=173.355 25 258.945600 10.076000 *

37) height< 173.355 3 11.780000 15.500000 *

..

.

15) abdomen>=112.3 11 220.900000 34.30000030) weight>=219.075 9 65.040000 32.666670 *31) weight< 219.075 2 23.805000 41.650000 *

48

Page 50: CS 2008 3complete

7/25/2019 CS 2008 3complete

http://slidepdf.com/reader/full/cs-2008-3complete 50/53

For each branch of the tree we obtain results in the following order:

– branch number

– split

– number of data following the split

– deviations associated with the split

– predicted value

– ”*”, if node is a terminal node

• Plot of the tree

> plot(mod.tree)> text(mod.tree)

abdomen< 91.9

abdomen< 85.45

hip< 92.9

l =20.8 wrist>=17.5

height>=173.4

height>=182.6

thigh>=60.5 knee>=39.05

abdomen< 90 thigh< 54.5

forearm>=29.85

abdomen< 104.9

ankle>=24.85

abdomen< 99.45

neck>=40.65

abdomen< 96.9

chest>=108.4

ankle>=21.45

ankle< 23.6

height>=186.4

weight< 214 hip>=106.5

ankle< 22.95

abdomen< 112.3

height>=183.2weight>=219.1

abdomen>=91.9

abdomen>=85.45

hip>=92.9

ankle< 20.8 wrist< 17.5

height< 173.4

height< 182.6

thigh< 60.5 knee< 39.05

abdomen>=90thigh>=54.5

forearm< 29.85

abdomen>=104.9

ankle< 24.85

abdomen>=99.45

neck< 40.65

abdomen>=96.9

chest< 108.4

ankle< 21.45

ankle>=23.6

height< 186.4

weight>=214hip< 106.5

ankle>=22.95

abdomen> .

height< 183.2weigh .

8.143514.42

10.07615.515.267

9.4515.277

11.97519.55

13.4 15.320.106

19.11

14.3 17.625.63321.65524.94427.7

17.93326.1

20.424.62528.775

23.128.90632.66741.65

Figure 18: Regression tree of the body fat data

The procedure to prune the tree and the choice of the optimal complexity are shown in thenext section (Section 11.2).

49

Page 51: CS 2008 3complete

7/25/2019 CS 2008 3complete

http://slidepdf.com/reader/full/cs-2008-3complete 51/53

11.2 Classification trees in R

The data set Spam consists of 4601 observations and 58 variables. The goal is to divide thevariable Email in “good” and “spam” emails using classification trees. The data set is alsoavailable in the library(ElemStatLearn) with data(spam).

> load("Spam.RData")> names(Spam)

[1] "make" "address" "all" "X3d"

[5] "our" "over" "remove" "internet"[9] "order" "mail" "receive" "will"

[13] "people" "report" "addresses" "free"

[17] "business" "email" "you" "credit"[21] "your" "font" "X000" "money"

[25] "hp" "hpl" "george" "X650"

[29] "lab" "labs" "telnet" "X857"

[33] "data" "X415" "X85" "technology"[37] "X1999" "parts" "pm" "direct"

[41] "cs" "meeting" "original" "project"

[45] "re" "edu" "table" "conference"

[49] "semicolon" "parenthesis" "bracket" "exclamationMark"[53] "dollarSign" "hashSign" "capitalAverage" "capitalLongest"

[57] "capitalTotal" "class"

> set.seed(100)> train = sample(1:nrow(Spam), 3065)> library(mvpart)

• Prediction with the classification tree

> tree1.pred <- predict(tree1, Spam[-train, ], type = "class")

> tree1.tab = table(Spam[-train, "class"], tree1.pred)> tree1.tab

tree1.predgood spam

good 866 54

spam 69 547

• Misclassification rate for the test set

> 1 - sum(diag(tree1.tab))/sum(tree1.tab)

[1] 0.08007812

The misclassification rate is not necessarily the best measure for the evaluation, becauseone would rather accept that spam emails are not identified as spam, than“true”emailsare incorrectly assigned as spam.

• Results of the cv

> plotcp(tree1, upper = "size")

Figure 19 shows the complexity of the tree and the cv estimate. The upper points arethe MSEs within the cv, the lower points are the MSEs for the training data. Takingthe minimum of the cv-errors, and increasing by one standard error gives the horizontal

line. The smallest tree that has a MSE which is still below this line is achieved withα = 0.0035, which is our optimal tree complexity. The optimal size of the tree isapproximately 20 nodes.

50

Page 52: CS 2008 3complete

7/25/2019 CS 2008 3complete

http://slidepdf.com/reader/full/cs-2008-3complete 52/53

cp

X − v a l R e l a t i v e E r

r o r

0 . 0

0 . 2

0 . 4

0 . 6

0 . 8

1 . 0

Inf 0.072 0.036 0.02 0.012 0.0071 0.0048 0.004 0.0029 0.002 0.0014 0.0011

1 2 3 4 5 6 7 8 9 12 13 14 15 17 19 21 23 24 28 33 62 64 69 73

Size of tree

Min + 1 SE

Figure 19: Cross-validation results of tree1: cv-error with standard error (points ontop) and error in test set (points below)

• Pruning of the tree

> tree2 <- prune(tree1, cp = 0.0035)> plot(tree2)> text(tree2)

Pruning of the tree with the optimal α.exclamation ar < .

remove< 0.045

money< 0.01

free< 0.165

ll i 0.174

i l ge< 3.418

our< 1.045

your< 0.31

hp>=0.11

george>=0.08

capitalAverage< 2.306

free< 0.165

remove< 0.045

internet< 0.535

business< 0.18

hp>=0.195

george>=0.19

hp>=0.39

meeting>=0.155

edu>=0.2

exclamation ar >= .

remove>=0.045

money>=0.01

free>=0.165

dollarSign>=0.174

capitalAverage>=3.418

our>=1.045

your>=0.31

hp< 0.11

george< 0.08

capitalAverage>=2.306

free>=0.165

remove>=0.045

internet>=0.535

business>=0.18

hp< 0.195

george< 0.19

hp< 0.39

meeting<

edu<

good good spam good good spam

good spam

good spam

spam

spam

good spam good

good good spam

Figure 20: Result of the pruning of tree1

51

Page 53: CS 2008 3complete

7/25/2019 CS 2008 3complete

http://slidepdf.com/reader/full/cs-2008-3complete 53/53

• Prediction with the new tree

> tree2.pred = predict(tree2, Spam[-train, ], type = "class")> tree2.tab = table(Spam[-train, "class"], tree2.pred)> tree2.tab

tree2.pred

good spamgood 868 52

spam 80 536

• Misclassification rate for the test set

> 1 - sum(diag(tree2.tab))/sum(tree2.tab)

[1] 0.0859375

ReferencesL. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees .

Belmont, Wadsworth, CA, 1984.

J. Fox. Applied Regression Analysis, Linear Models, and Related Methods . Sage Publications,Thousand Oaks, CA, USA, 1997.

T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning . Springer,New York, 2001.

T.J. Hastie and R.J. Tibshirani. Generalized Additive Models . Chapman and Hall, London,

1990.

C.J. Huberty. Applied Discriminant Analysis . John Wiley & Sons, Inc., New York, 1994.

R.A. Johnson and D.W. Wichern. Applied Multivariate Statistical Analysis . Prentice Hall,Upper Saddle River, New Jersey, 5th edition, 2002.

D.G. Kleinbaum and M. Klein. Logistic Regression . Springer, New York, 2nd edition, 2002.

G. McLachlan and D. Peel. Finite Mixture Models . Wiley, New York, 2000.

R Development Core Team. R: A language and environment for statistical computing . RFoundation for Statistical Computing, Vienna, Austria, 2008. http://www.R-project.org.

M. Tenenhaus. La Regression PLS. Theorie et Practique . Editions Technip, Paris, 1998. (InFrench).