Part VII Tree Based Methodslipas.uwasa.fi/~sjp/Teaching/ams/lectures/amsc7.pdf · The tree with six nodes has the smallest CV RSS (or deviance). It is interesting to note that the

Tree based methods

Part VII

Tree Based Methods

As of Dec 2, 2020Some of the figures in this presentation are taken from ”An Introduction to Statistical

Learning, with applications in R” (Springer, 2013) with permission from the authors:

G. James, D. Witten, T. Hastie and R. Tibshirani

Seppo Pynnonen Applied Multivariate Statistical Analysis

Tree based methods

1 Tree based methods

Basics of decision trees

Regression trees

Classification trees

Trees versus linear models

Advantages and disadvantages of trees

Bagging, random forests, boosting

Bagging

Random forest

Boosting


Tree based methods

Tree based methods involve stratifying or segmenting the predictorspace into a number of simple regions.

As a results, the prediction space can be summarized in a decisiontree.

Decision trees can be applied to both regression and classificationproblems.


Tree based methods




Regression trees





Bagging

Random forest

Boosting


Tree based methods




Regression trees





Bagging

Random forest

Boosting


Tree based methods


Regression trees

Example 1

Predicting baseball players’ salaries in terms of years in major league andHits that the player has made in the previous year.

An example of the regression tree below is given below.

To the left from Years < 4.5 is the left-hand branch and to the right fromYears ≥ 4.5 is the right-hand branch.

The tree has two internal nodes and three terminal nodes (leaves).

The number in each leave is the mean log(Salary) for the observations

that fall there.


Tree based methods


Regression trees

|Years < 4.5

Hits < 117.5

5.11

6.00 6.74

Source: James et al (2013), Fig 8.1


Tree based methods


The figure below indicates the partition in predictor space that make the terminal

nodes, or leaves.

Years

Hits

1

117.5

238

1 4.5 24

R1

R3

R2

Source: James et al. (2013), Fig 8.2


Tree based methods


Regression trees

Building a regression tree has two steps.

1 Divide the prediction space into J distinct, non-overlapping regionsR1,R2, . . . ,RJ .

2 For every observation that fall into the region Rj , make the prediction, which isthe mean of the response values for the training observations in Rj .

The regions R1, . . . ,Rj are constructed by minimizing the RSS

J∑j=1

∑i∈Rj

(yi − yRj)2, (1)

where yRj is the mean response for the training observations within Rj .

It is however computationally infeasible to consider every possible partition.

Instead the top-down, recursive binary splitting approach is adopted in practice.

The process is called greedy because at each step of the tree-building process, thebest step is made at the particular step, rather than picking a split that will lead toa better tree in some future step.


Tree based methods


Regression trees

In the recursive binary splitting, first a predictor xj is selected and acut-point s such that splitting the predictor space into the regions{x |xj < s} and {x |xj ≥ s} that lead to the greatest reduction inRSS (all x1, . . . , xp are considered and all cut-points s).

Generally for any j and s, define half-planes

R1(j , s) = {x |xj < s} and R2(j , s) = {x |xj ≥ s}, (2)

that minimize ∑i :xi∈R1(j ,s)

(yi − yR1)2 +∑

i :xi∈R2(j ,s)

(yi − yR2)2, (3)

where yR1 and yR2 are the mean responses of the trainingobservations in R1(j , s) and R2(j , s), respectively.


Tree based methods


Regression trees

Once the regions R1, . . . ,RJ are created, we predict the response for agiven test observation using the mean of the training observation in theregion to which the test observation belongs.

|

t1

t2

t3

t4

R1

R1

R2

R2

R3

R3

R4

R4

R5

R5

X1

X1X1

X2

X2

X2

X1 ≤ t1

X2 ≤ t2 X1 ≤ t3

X2 ≤ t4

Source: James et al (2013), Fig 8.3.Seppo Pynnonen Applied Multivariate Statistical Analysis

Tree based methods


Tree pruning:

In building the tree, a strategy is to build a large initial tree T0 (atree with lots of branches and leaves, implying small training RSS)and prune it back to a smaller subtree.A popular approach is the cost complexity pruning also known asweakest link pruning in which:

A sequence of trees indexed by a non-negative tuning parameter α areconsidered.For each α there corresponds a subtree T ⊂ T0 (T0 correspondsα = 0) such that

|T |∑m=1

∑xi∈Rm

(yi − yRm)2 + α|T | (4)

is as small as possible, where |T | is the number of terminal nodes ofthe tree T , Rm is the rectangle corresponding the mth terminal node.and yRm is the predicted response (i.e., mean) associated with Rm. (αcontrols trade-off between subtree complexity and its fit to the trainingdata.)


Tree based methods


Tree pruning

It turns out that as α is increased, branches get pruned from thetree in a nested and predictable fashion, so obtaining a wholesequence of subtrees as a function is easy.

The value of α is selected using cross-validation, after which thecorresponding subtree is obtained from the whole data.

By the R tree() the initial tree can be constructed, the appropriatenumber of nodes can be selected by cross-validation using cv.tree(),and finally prune the tree to the one with smallest CV MSE.


Tree based methods


Example 2

The initial tree for regression tree of log(Salary) on nine backgroundfeatures.

> library(ISLR)

> library(tree)

> Hitters <- na.omit(Hitters) # remove missing values

> set.seed(10) # for replication

> nrow(Hitters) # n of observations is 263

[1] 263

> train <- sample(nrow(Hitters), 132) # training sample

> tree_salary <- tree(log(Salary) ~ AtBat + Hits + HmRun + Runs

+ + RBI + Walks + Years + PutOuts + Assists,

+ data = Hitters, subset = train) # use nine features for the intial tree

> summary(tree_salary) # summary of the results

Regression tree:

tree(formula = log(Salary) ~ AtBat + Hits + HmRun + Runs + RBI +

Walks + Years + PutOuts + Assists, data = Hitters, subset = train)

Variables actually used in tree construction:

[1] "Years" "Hits" "Assists" "RBI"

Number of terminal nodes: 8

Residual mean deviance: 0.1884 = 23.36 / 124

Distribution of residuals:

Min. 1st Qu. Median Mean 3rd Qu. Max.

-1.91600 -0.24250 -0.02596 0.00000 0.29310 1.15500

Only 4 out of the initial 9 variables are used in the resulting 8 nodes

initial tree, T0Seppo Pynnonen Applied Multivariate Statistical Analysis

Tree based methods


> par(mfrow = c(1, 1)) # full plot window

> plot(tree_salary) # plot the initial tree

> title(main = "Initial Regression Tree\nof log(Salary) on Selected Features", cex.main = .8)

> text(tree_salary, cex = .8, digits = 3) # add values and splitting points

|

Initial Regression Treeof log(Salary) on Selected Features

Years < 4.5

Years < 3.5

Hits < 122

Hits < 117.5

Years < 5.5 Assists < 237.5RBI < 66

4.60 5.235.55

5.37 5.95

6.60 7.196.42


Tree based methods


Cross-validation based results:


> cv_salary <- cv.tree(tree_salary, K = 6) # 6 fold cross-validation, 6 divides 132 exactly

> par(mfrow = c(1, 2)) # two regions

> plot(cv_salary$size, cv_salary$dev, type = "b", xlab = "Tree size", col = "orange",

+ ylab = "Residaul Sum of Squares",

+ main = "Cross-Validation RSS for the Test Sample",

+ cex.main = .8) # size of the model agains RSS

> min_pos <- which.min(cv_salary$dev) # position of the smallest CV RSS

> cv_salary$size[min_pos] # tree size of the smallest RSS

[1] 6

> cv_salary$dev[min_pos] # smallest CV RSS

[1] 42.77326

> points(cv_salary$size[min_rss_x], cv_salary$dev[min_rss_x],

+ col = "red", pch = 20) # mark in the plot

> prune_salary <- prune.tree(tree_salary, best = 6) # prune the full tree to the best CV result

> summary(prune_salary)

Regression tree:

snip.tree(tree = tree_salary, nodes = c(6L, 14L))

Variables actually used in tree construction:

[1] "Years" "Hits" "Assists"

Number of terminal nodes: 6

Residual mean deviance: 0.2151 = 27.1 / 126

Distribution of residuals:

Min. 1st Qu. Median Mean 3rd Qu. Max.

-1.91600 -0.21720 -0.03029 0.00000 0.25260 1.15500

> plot(prune_salary)

> text(prune_salary, cex = .8, digits = 3)

> title(main = "Training Set\nBest Cross-Validation Tree", cex.main = .8)


Tree based methods


●●

●

●●●

●

●

1 2 3 4 5 6 7 8

5060

7080

9010

011

0

Cross−Validation RSS for the Test Sample

Tree size

Resid

aul S

um of

Squ

ares

●

|Years < 4.5

Years < 3.5

Hits < 122

Hits < 117.5

Assists < 237.5

4.60 5.23

5.55

5.876.95 6.42

Training SetBest Cross−Validation Tree

The tree with six nodes has the smallest CV RSS (or deviance).

It is interesting to note that the model predicts higher salary for those with lower

Assists!?


Tree based methods


Below are presented test MSEs, cross-validation MSEs, and train sample MSEs fordifferent tree sizes.We observe that the minimum test MSE is 4 or 5, while as found above the minimumCV MSE is for tree size 6 (factually all are well within the tolerance limits for tree size3 [or even 2]).

As a results it would be wise to check the performance of tree size 3 also.

●

●

●

● ●●

● ●

1 2 3 4 5 6 7 8

0.2

0.4

0.6

0.8

1.0

Test, CV, and Trainig Mean Squared Errorsfor Different Tree Sizes with Standard Error Bands

Tree size

Mea

n Sq

uare

d Er

ror

●

●

● ● ●●

● ●

●

●

●

●●

●●

●

TestCross−validationTrain


Tree based methods


Finally it may be interesting to see how the prediction results compare tolinear regression.Using the training data, we select the best model in terms of the Cpcriterion.

> library(leaps) # regression subsets

> subsets_salary_sum <- summary(subsets_salary <- regsubsets(log(Salary) ~ AtBat + Hits + HmRun

+ + Runs + RBI + Walks + Years

+ + PutOuts + Assists,

+ data = Hitters, subset = train))

> best_cp <- which.min(subsets_salary_sum$cp) # best model by the CP criterion

> (fml <- formula(paste("log(Salary) ~ ", paste(names(coef(subsets_salary, best_cp))[-1],

+ collapse = "+")))) # formula for lm()

log(Salary) ~ Hits + Walks + Years

> lm_train <- lm(fml, data = Hitters, subset = train) # train data regression fit of the best model

> summary(lm_train) # results

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 3.982820 0.149021 26.727 < 2e-16 ***

Hits 0.007128 0.001333 5.348 3.95e-07 ***

Walks 0.007926 0.003033 2.613 0.0101 *

Years 0.112590 0.011859 9.494 < 2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.5853 on 128 degrees of freedom

Multiple R-squared: 0.5968,Adjusted R-squared: 0.5874

F-statistic: 63.16 on 3 and 128 DF, p-value: < 2.2e-16

> lm_test <- predict(lm_train, newdata = Hitters[-train, ]) # test set predictions

> mean((lm_test - log(Hitters$Salary[-train]))^2) # test MSE

[1] 0.4792482

> reg_tree_test <- predict(prune_salary, newdata = Hitters[-train, ]) # predictions

> mean((test_salary - log(Hitters$Salary[-train]))^2) # reg tree test MSE

[1] 0.4100664

Here the regression tree produces better results in terms of MSE.


Tree based methods




Regression trees





Bagging

Random forest

Boosting


Tree based methods



Classification trees are similar to regression trees with theexception that they are used to predict qualitative responses.

Classification error rate (fraction of misclassified observations overthe regions) is used instead of RSS in the splitting process.

Given K classes in the output variable, in classification a newobservation is classified to the most common class on the basis ofthe explanatory variables.

Let pmk denote the proportion of training observations in the mthregion that are from the kth class.

The error rate is then

E = 1−maxk

pmk . (5)


Tree based methods



It turns out that the classification error rate is not sufficientlysensitive for tree-growing.

Gini index

G =K∑

k=1

pmk(1− pmk) (6)

performs as a total variance measure across K classes andperforms as a measure of node purity (the closer pmk is to zero orone, the more predominately a class contains observations from asingle class and the G gets smaller).


Tree based methods



An alternative to Gini index is cross-entropy

D = −K∑

k=1

pmk log pmk . (7)

Again, the Gini index takes a small value if the mth node is pure.

These two indexes perform quite similarly when building aclassification tree.


Tree based methods




Regression trees





Bagging

Random forest

Boosting


Tree based methods



In linear models the prediction equation is

f (x) = β0 +b∑

j=1

βjxj (8)

whereas in regression trees we have

f (x) =M∑

m=1

cm · I (x ∈ Rm), (9)

where R1, . . . ,RM is a partition of the feature space.

If the relationships are linear, the linear model performs mostlikely better.

In (highly) non-linear and complex relationships betweenfeatures and the response, decision trees can be expected toperform better.


Tree based methods




Regression trees





Bagging

Random forest

Boosting


Tree based methods


Advantages and disadvantages

Advantages

Easy explain.Kind of mirror human decision making.Easy display graphically.Easy handle qualitative predictors (no need to transform todummy variables).

Disadvantages

Generally produce less accurate predictions than model basedregression and classification approaches.


Tree based methods




Regression trees





Bagging

Random forest

Boosting


Tree based methods




Regression trees





Bagging

Random forest

Boosting


Tree based methods


Bagging

Decision trees tend to suffer high variance (sampling variance).

For example if splitting a sample randomly to two trainingsub-samples, the fits may be differ much from each other.

A procedure with low variance will produce closely similar resultsfrom different samples (linear regression tends to have lowvariance).

Bootstrap aggregation, or bagging, is a general-purpose procedurefor reducing variance.

Bagging relies on the idea that averaging observations reducesvariance as individual errors tend to cancel out in averaging.


Tree based methods


Bagging

Generating B different bootstrap samples, computing predictionf ∗b(x) from each bootstrap sample, b = 1, . . . ,B, and taking theaverage

fbag(x) =1

B

B∑b=1

f ∗b(x), (10)

is called bagging.

Bagging can improve predictions as it reduces the variance.

The approach is to generate B regression trees using Bbootstrapped training sets, and average the resulting predictions.

These trees are grown deep and not pruned:

Individual trees have high variance but low bias.

Averaging reduces the variance.


Tree based methods


Bagging

In classification problems there are a few possible approaches.

The simplest is to record the class predicted by each of the Btrees, and take the majority vote, i.e., the overall prediction is themost commonly occurring class among the B predictions.


Tree based methods


Bagging: Out-of-bag error estimation

Estimation of the test error can be performed by bagged modelwithout cross-validation.

Each bootstrapped subset in the bagged tree makes on average useof around two-thirds of the observations.

The remaining third, called out-of-bag (OOB) observations, can beused for test set purposes.

We can predict the response for the ith observation using each ofthe trees in which that observation was OOB.

This will yield around B/3 predictions for the ith observation.

The single prediction can be obtained by averaging the predictions.

OOB MSE or classification error can be computed using theassociated prediction errors, thereby giving a valid estimate of testerror.


Tree based methods


Bagging: Variable importance measures

Using single decision tree, the diagram is easily interpreted (e.g.important variables).

This advantage disappears with bagging, because the predictionresults are based on several trees.

The importance of variables is hard to infer from bagging.

So, bagging improves prediction at the expense of interpretability.

However, overall summary of the importance of each variable canbe obtained using RSS (regression trees) or the Gini index(classification) by recording the total amount of reduction in RSS(Gini index) due to slits over a given predictor, averaged over all Btrees.


Tree based methods




Regression trees





Bagging

Random forest

Boosting


Tree based methods


Random forest

Random forests provide an improvement over bagged trees.

In bagging most of the trees use the same strong predictors in thetop of the tree, which implies that the trees tend to be fairlysimilar, i.e., the trees are correlated.

Highly correlated results do not lead to a large of a reduction invariance.

In random forests, when building decision trees, each time a split isconsidered, a random sample of m predictors is chosen as splitcandidates from the p predictors, typically m ≈ √p (m = p impliesbagging).

The purpose of the randomization is to decorrelate the trees,thereby making the average of the trees less variable (and hencemore reliable).


Tree based methods




Regression trees





Bagging

Random forest

Boosting


Tree based methods


Boosting is yet another approach for improving decision treepredictions.

While in bagging separate decision trees are fitted to the bootstrapsamples of training data, in boosting the trees are grownsequentially so that each tree is growing using information frompreviously grown trees.

The sequential process is based on relatively small model so thatgiven the current model, the tree is fitted into the residuals of thecurrent model and the current model is the updated.

The procedure is described in more detail in James et al. 2013, pp.322–323.


Tree based methods


Boosting has three tuning parameters

The number of trees B. Selected by cross-validation.

The shrinkage parameter λ to control the rate boosting learns.Typically small, like 0.01 or 0.001.

The number of splits d in each tree, which controls thecomplexity of the boosted ensemble, often d = 1, i.e., eachtree is only a stump, consisting of a single split.


Tree based methods


Example 3

The R package randomForest can be used for bagging and random forest(random forest is bagging when m = p).

Housing prices in Boston (data set in MASS library).

> library(MASS) # MASS library include Boston data

> library(randomForest)

> library(help = randomForest)


> colnames(Boston) # variables in Boston

[1] "crim" "zn" "indus" "chas" "nox" "rm" "age"

[8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"

> bag_boston <- randomForest(medv ~ ., data = Boston, subset = train,

+ mtry = 13, # force all variables, which implie bagging

+ importance = TRUE) # importance saves predictors importance info

> bag_boston # print results

Call:

randomForest(formula = medv ~ ., data = Boston, mtry = 13, importance = TRUE, subset = train)

Type of random forest: regression

Number of trees: 500

No. of variables tried at each split: 13

Mean of squared residuals: 9.403441

% Var explained: 88.12


Tree based methods


plot(yhat_bag, Boston$medv[-train], xlim = c(0, 50), ylim = c(0, 50),

xlab = "Predicted Housing Prices",

ylab = "Realized Housing Prices", pch = 20, col = "orange") # scatter plot

abline(0, 1, col = "grey", lty = "dashed") # diagonal line

> round(mean((yhat_bag - Boston$medv[-train])^2), 2) # MSE

[1] 22.74

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●●●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●●●

●

●●

●

●

●

●●

●●

●

●●

●

●

●

●

●●

●

●●

●

●

●

●●

● ● ●●●

●

●●

●●

●●

●

●

●●

●

●●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

0 10 20 30 40 50

010

2030

4050

Predicted Housing Prices

Rea

lized

Hou

sing

Pric

es


Tree based methods


Fit next random forest with m = 6 (R default is p/3 for regression treesand√p for classification trees)

> set.seed(1)

> rf.boston <- randomForest(medv ~ ., data = Boston, subset = train,

+ mtry = 6, imprtance = TRUE)

> yhat_rf <- predict(rf_boston, newdata = Boston[-train, ])

> round(mean((yhat_rf - Boston$medv[-train])^2), 2) # MSE

[1] 21.89

> importance(rf_boston) # importance of predictors

IncNodePurity

crim 779.81

zn 125.78

indus 1220.03

chas 13.90

nox 1143.43

rm 8371.00

age 407.09

dis 470.80

rad 114.38

tax 630.53

ptratio 957.97

black 200.24

lstat 5354.91


Tree based methods


> varImpPlot(rf_boston, main = "Variable Importance") # plot of variable importance

chas

rad

zn

black

age

dis

tax

crim

ptratio

nox

indus

lstat

rm

●

●

●

●

●

●

●

●

●

●

●

●

●

0 2000 4000 6000 8000

Variable Importance

IncNodePurity


Tree based methods


Boosted regression trees can be fitted by gbm package using gbm()function.

> summary(boost_boston)

var rel.inf

lstat lstat 33.21544106

rm rm 28.34225145

dis dis 11.68472578

crim crim 7.88991584

black black 4.67558175

nox nox 3.88590502

age age 3.15203732

ptratio ptratio 2.40626229

indus indus 1.89905161

tax tax 1.58509825

chas chas 0.75175057

rad rad 0.41883727

zn zn 0.09314181


Tree based methods


znra

dta

xpt

ratio

nox

crim

rm

Relative influence

0 5 10 15 20 25 30

> yhat_boost <- predict(boost_boston, newdata = Boston[-train, ], n.trees = 5000)

> mean((yhat_boost - Boston$medv[-train])^2) # MSE

[1] 11.1852


Documents

Part VII Tree Based Methodslipas.uwasa.fi/~sjp/Teaching/ams/lectures/amsc7.pdf · The tree with six nodes has the smallest CV RSS (or deviance). It is interesting to note that the