Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Tree based methods
Part VII
Tree Based Methods
As of Dec 2, 2020Some of the figures in this presentation are taken from ”An Introduction to Statistical
Learning, with applications in R” (Springer, 2013) with permission from the authors:
G. James, D. Witten, T. Hastie and R. Tibshirani
Seppo Pynnonen Applied Multivariate Statistical Analysis
Tree based methods
1 Tree based methods
Basics of decision trees
Regression trees
Classification trees
Trees versus linear models
Advantages and disadvantages of trees
Bagging, random forests, boosting
Bagging
Random forest
Boosting
Seppo Pynnonen Applied Multivariate Statistical Analysis
Tree based methods
Tree based methods involve stratifying or segmenting the predictorspace into a number of simple regions.
As a results, the prediction space can be summarized in a decisiontree.
Decision trees can be applied to both regression and classificationproblems.
Seppo Pynnonen Applied Multivariate Statistical Analysis
Tree based methods
Basics of decision trees
1 Tree based methods
Basics of decision trees
Regression trees
Classification trees
Trees versus linear models
Advantages and disadvantages of trees
Bagging, random forests, boosting
Bagging
Random forest
Boosting
Seppo Pynnonen Applied Multivariate Statistical Analysis
Tree based methods
Basics of decision trees
1 Tree based methods
Basics of decision trees
Regression trees
Classification trees
Trees versus linear models
Advantages and disadvantages of trees
Bagging, random forests, boosting
Bagging
Random forest
Boosting
Seppo Pynnonen Applied Multivariate Statistical Analysis
Tree based methods
Basics of decision trees
Regression trees
Example 1
Predicting baseball players’ salaries in terms of years in major league andHits that the player has made in the previous year.
An example of the regression tree below is given below.
To the left from Years < 4.5 is the left-hand branch and to the right fromYears ≥ 4.5 is the right-hand branch.
The tree has two internal nodes and three terminal nodes (leaves).
The number in each leave is the mean log(Salary) for the observations
that fall there.
Seppo Pynnonen Applied Multivariate Statistical Analysis
Tree based methods
Basics of decision trees
Regression trees
|Years < 4.5
Hits < 117.5
5.11
6.00 6.74
Source: James et al (2013), Fig 8.1
Seppo Pynnonen Applied Multivariate Statistical Analysis
Tree based methods
Basics of decision trees
The figure below indicates the partition in predictor space that make the terminal
nodes, or leaves.
Years
Hits
1
117.5
238
1 4.5 24
R1
R3
R2
Source: James et al. (2013), Fig 8.2
Seppo Pynnonen Applied Multivariate Statistical Analysis
Tree based methods
Basics of decision trees
Regression trees
Building a regression tree has two steps.
1 Divide the prediction space into J distinct, non-overlapping regionsR1,R2, . . . ,RJ .
2 For every observation that fall into the region Rj , make the prediction, which isthe mean of the response values for the training observations in Rj .
The regions R1, . . . ,Rj are constructed by minimizing the RSS
J∑j=1
∑i∈Rj
(yi − yRj)2, (1)
where yRj is the mean response for the training observations within Rj .
It is however computationally infeasible to consider every possible partition.
Instead the top-down, recursive binary splitting approach is adopted in practice.
The process is called greedy because at each step of the tree-building process, thebest step is made at the particular step, rather than picking a split that will lead toa better tree in some future step.
Seppo Pynnonen Applied Multivariate Statistical Analysis
Tree based methods
Basics of decision trees
Regression trees
In the recursive binary splitting, first a predictor xj is selected and acut-point s such that splitting the predictor space into the regions{x |xj < s} and {x |xj ≥ s} that lead to the greatest reduction inRSS (all x1, . . . , xp are considered and all cut-points s).
Generally for any j and s, define half-planes
R1(j , s) = {x |xj < s} and R2(j , s) = {x |xj ≥ s}, (2)
that minimize ∑i :xi∈R1(j ,s)
(yi − yR1)2 +∑
i :xi∈R2(j ,s)
(yi − yR2)2, (3)
where yR1 and yR2 are the mean responses of the trainingobservations in R1(j , s) and R2(j , s), respectively.
Seppo Pynnonen Applied Multivariate Statistical Analysis
Tree based methods
Basics of decision trees
Regression trees
Once the regions R1, . . . ,RJ are created, we predict the response for agiven test observation using the mean of the training observation in theregion to which the test observation belongs.
|
t1
t2
t3
t4
R1
R1
R2
R2
R3
R3
R4
R4
R5
R5
X1
X1X1
X2
X2
X2
X1 ≤ t1
X2 ≤ t2 X1 ≤ t3
X2 ≤ t4
Source: James et al (2013), Fig 8.3.Seppo Pynnonen Applied Multivariate Statistical Analysis
Tree based methods
Basics of decision trees
Tree pruning:
In building the tree, a strategy is to build a large initial tree T0 (atree with lots of branches and leaves, implying small training RSS)and prune it back to a smaller subtree.A popular approach is the cost complexity pruning also known asweakest link pruning in which:
A sequence of trees indexed by a non-negative tuning parameter α areconsidered.For each α there corresponds a subtree T ⊂ T0 (T0 correspondsα = 0) such that
|T |∑m=1
∑xi∈Rm
(yi − yRm)2 + α|T | (4)
is as small as possible, where |T | is the number of terminal nodes ofthe tree T , Rm is the rectangle corresponding the mth terminal node.and yRm is the predicted response (i.e., mean) associated with Rm. (αcontrols trade-off between subtree complexity and its fit to the trainingdata.)
Seppo Pynnonen Applied Multivariate Statistical Analysis
Tree based methods
Basics of decision trees
Tree pruning
It turns out that as α is increased, branches get pruned from thetree in a nested and predictable fashion, so obtaining a wholesequence of subtrees as a function is easy.
The value of α is selected using cross-validation, after which thecorresponding subtree is obtained from the whole data.
By the R tree() the initial tree can be constructed, the appropriatenumber of nodes can be selected by cross-validation using cv.tree(),and finally prune the tree to the one with smallest CV MSE.
Seppo Pynnonen Applied Multivariate Statistical Analysis
Tree based methods
Basics of decision trees
Example 2
The initial tree for regression tree of log(Salary) on nine backgroundfeatures.
> library(ISLR)
> library(tree)
> Hitters <- na.omit(Hitters) # remove missing values
> set.seed(10) # for replication
> nrow(Hitters) # n of observations is 263
[1] 263
> train <- sample(nrow(Hitters), 132) # training sample
> tree_salary <- tree(log(Salary) ~ AtBat + Hits + HmRun + Runs
+ + RBI + Walks + Years + PutOuts + Assists,
+ data = Hitters, subset = train) # use nine features for the intial tree
> summary(tree_salary) # summary of the results
Regression tree:
tree(formula = log(Salary) ~ AtBat + Hits + HmRun + Runs + RBI +
Walks + Years + PutOuts + Assists, data = Hitters, subset = train)
Variables actually used in tree construction:
[1] "Years" "Hits" "Assists" "RBI"
Number of terminal nodes: 8
Residual mean deviance: 0.1884 = 23.36 / 124
Distribution of residuals:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.91600 -0.24250 -0.02596 0.00000 0.29310 1.15500
Only 4 out of the initial 9 variables are used in the resulting 8 nodes
initial tree, T0Seppo Pynnonen Applied Multivariate Statistical Analysis
Tree based methods
Basics of decision trees
> par(mfrow = c(1, 1)) # full plot window
> plot(tree_salary) # plot the initial tree
> title(main = "Initial Regression Tree\nof log(Salary) on Selected Features", cex.main = .8)
> text(tree_salary, cex = .8, digits = 3) # add values and splitting points
|
Initial Regression Treeof log(Salary) on Selected Features
Years < 4.5
Years < 3.5
Hits < 122
Hits < 117.5
Years < 5.5 Assists < 237.5RBI < 66
4.60 5.235.55
5.37 5.95
6.60 7.196.42
Seppo Pynnonen Applied Multivariate Statistical Analysis
Tree based methods
Basics of decision trees
Cross-validation based results:
> set.seed(20) # for replication
> cv_salary <- cv.tree(tree_salary, K = 6) # 6 fold cross-validation, 6 divides 132 exactly
> par(mfrow = c(1, 2)) # two regions
> plot(cv_salary$size, cv_salary$dev, type = "b", xlab = "Tree size", col = "orange",
+ ylab = "Residaul Sum of Squares",
+ main = "Cross-Validation RSS for the Test Sample",
+ cex.main = .8) # size of the model agains RSS
> min_pos <- which.min(cv_salary$dev) # position of the smallest CV RSS
> cv_salary$size[min_pos] # tree size of the smallest RSS
[1] 6
> cv_salary$dev[min_pos] # smallest CV RSS
[1] 42.77326
> points(cv_salary$size[min_rss_x], cv_salary$dev[min_rss_x],
+ col = "red", pch = 20) # mark in the plot
> prune_salary <- prune.tree(tree_salary, best = 6) # prune the full tree to the best CV result
> summary(prune_salary)
Regression tree:
snip.tree(tree = tree_salary, nodes = c(6L, 14L))
Variables actually used in tree construction:
[1] "Years" "Hits" "Assists"
Number of terminal nodes: 6
Residual mean deviance: 0.2151 = 27.1 / 126
Distribution of residuals:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.91600 -0.21720 -0.03029 0.00000 0.25260 1.15500
> plot(prune_salary)
> text(prune_salary, cex = .8, digits = 3)
> title(main = "Training Set\nBest Cross-Validation Tree", cex.main = .8)
Seppo Pynnonen Applied Multivariate Statistical Analysis
Tree based methods
Basics of decision trees
●●
●
●●●
●
●
1 2 3 4 5 6 7 8
5060
7080
9010
011
0
Cross−Validation RSS for the Test Sample
Tree size
Resid
aul S
um of
Squ
ares
●
|Years < 4.5
Years < 3.5
Hits < 122
Hits < 117.5
Assists < 237.5
4.60 5.23
5.55
5.876.95 6.42
Training SetBest Cross−Validation Tree
The tree with six nodes has the smallest CV RSS (or deviance).
It is interesting to note that the model predicts higher salary for those with lower
Assists!?
Seppo Pynnonen Applied Multivariate Statistical Analysis
Tree based methods
Basics of decision trees
Below are presented test MSEs, cross-validation MSEs, and train sample MSEs fordifferent tree sizes.We observe that the minimum test MSE is 4 or 5, while as found above the minimumCV MSE is for tree size 6 (factually all are well within the tolerance limits for tree size3 [or even 2]).
As a results it would be wise to check the performance of tree size 3 also.
●
●
●
● ●●
● ●
1 2 3 4 5 6 7 8
0.2
0.4
0.6
0.8
1.0
Test, CV, and Trainig Mean Squared Errorsfor Different Tree Sizes with Standard Error Bands
Tree size
Mea
n Sq
uare
d Er
ror
●
●
● ● ●●
● ●
●
●
●
●●
●●
●
TestCross−validationTrain
Seppo Pynnonen Applied Multivariate Statistical Analysis
Tree based methods
Basics of decision trees
Finally it may be interesting to see how the prediction results compare tolinear regression.Using the training data, we select the best model in terms of the Cpcriterion.
> library(leaps) # regression subsets
> subsets_salary_sum <- summary(subsets_salary <- regsubsets(log(Salary) ~ AtBat + Hits + HmRun
+ + Runs + RBI + Walks + Years
+ + PutOuts + Assists,
+ data = Hitters, subset = train))
> best_cp <- which.min(subsets_salary_sum$cp) # best model by the CP criterion
> (fml <- formula(paste("log(Salary) ~ ", paste(names(coef(subsets_salary, best_cp))[-1],
+ collapse = "+")))) # formula for lm()
log(Salary) ~ Hits + Walks + Years
> lm_train <- lm(fml, data = Hitters, subset = train) # train data regression fit of the best model
> summary(lm_train) # results
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.982820 0.149021 26.727 < 2e-16 ***
Hits 0.007128 0.001333 5.348 3.95e-07 ***
Walks 0.007926 0.003033 2.613 0.0101 *
Years 0.112590 0.011859 9.494 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.5853 on 128 degrees of freedom
Multiple R-squared: 0.5968,Adjusted R-squared: 0.5874
F-statistic: 63.16 on 3 and 128 DF, p-value: < 2.2e-16
> lm_test <- predict(lm_train, newdata = Hitters[-train, ]) # test set predictions
> mean((lm_test - log(Hitters$Salary[-train]))^2) # test MSE
[1] 0.4792482
> reg_tree_test <- predict(prune_salary, newdata = Hitters[-train, ]) # predictions
> mean((test_salary - log(Hitters$Salary[-train]))^2) # reg tree test MSE
[1] 0.4100664
Here the regression tree produces better results in terms of MSE.
Seppo Pynnonen Applied Multivariate Statistical Analysis
Tree based methods
Basics of decision trees
1 Tree based methods
Basics of decision trees
Regression trees
Classification trees
Trees versus linear models
Advantages and disadvantages of trees
Bagging, random forests, boosting
Bagging
Random forest
Boosting
Seppo Pynnonen Applied Multivariate Statistical Analysis
Tree based methods
Basics of decision trees
Classification trees
Classification trees are similar to regression trees with theexception that they are used to predict qualitative responses.
Classification error rate (fraction of misclassified observations overthe regions) is used instead of RSS in the splitting process.
Given K classes in the output variable, in classification a newobservation is classified to the most common class on the basis ofthe explanatory variables.
Let pmk denote the proportion of training observations in the mthregion that are from the kth class.
The error rate is then
E = 1−maxk
pmk . (5)
Seppo Pynnonen Applied Multivariate Statistical Analysis
Tree based methods
Basics of decision trees
Classification trees
It turns out that the classification error rate is not sufficientlysensitive for tree-growing.
Gini index
G =K∑
k=1
pmk(1− pmk) (6)
performs as a total variance measure across K classes andperforms as a measure of node purity (the closer pmk is to zero orone, the more predominately a class contains observations from asingle class and the G gets smaller).
Seppo Pynnonen Applied Multivariate Statistical Analysis
Tree based methods
Basics of decision trees
Classification trees
An alternative to Gini index is cross-entropy
D = −K∑
k=1
pmk log pmk . (7)
Again, the Gini index takes a small value if the mth node is pure.
These two indexes perform quite similarly when building aclassification tree.
Seppo Pynnonen Applied Multivariate Statistical Analysis
Tree based methods
Basics of decision trees
1 Tree based methods
Basics of decision trees
Regression trees
Classification trees
Trees versus linear models
Advantages and disadvantages of trees
Bagging, random forests, boosting
Bagging
Random forest
Boosting
Seppo Pynnonen Applied Multivariate Statistical Analysis
Tree based methods
Basics of decision trees
Trees versus linear models
In linear models the prediction equation is
f (x) = β0 +b∑
j=1
βjxj (8)
whereas in regression trees we have
f (x) =M∑
m=1
cm · I (x ∈ Rm), (9)
where R1, . . . ,RM is a partition of the feature space.
If the relationships are linear, the linear model performs mostlikely better.
In (highly) non-linear and complex relationships betweenfeatures and the response, decision trees can be expected toperform better.
Seppo Pynnonen Applied Multivariate Statistical Analysis
Tree based methods
Basics of decision trees
1 Tree based methods
Basics of decision trees
Regression trees
Classification trees
Trees versus linear models
Advantages and disadvantages of trees
Bagging, random forests, boosting
Bagging
Random forest
Boosting
Seppo Pynnonen Applied Multivariate Statistical Analysis
Tree based methods
Basics of decision trees
Advantages and disadvantages
Advantages
Easy explain.Kind of mirror human decision making.Easy display graphically.Easy handle qualitative predictors (no need to transform todummy variables).
Disadvantages
Generally produce less accurate predictions than model basedregression and classification approaches.
Seppo Pynnonen Applied Multivariate Statistical Analysis
Tree based methods
Bagging, random forests, boosting
1 Tree based methods
Basics of decision trees
Regression trees
Classification trees
Trees versus linear models
Advantages and disadvantages of trees
Bagging, random forests, boosting
Bagging
Random forest
Boosting
Seppo Pynnonen Applied Multivariate Statistical Analysis
Tree based methods
Bagging, random forests, boosting
1 Tree based methods
Basics of decision trees
Regression trees
Classification trees
Trees versus linear models
Advantages and disadvantages of trees
Bagging, random forests, boosting
Bagging
Random forest
Boosting
Seppo Pynnonen Applied Multivariate Statistical Analysis
Tree based methods
Bagging, random forests, boosting
Bagging
Decision trees tend to suffer high variance (sampling variance).
For example if splitting a sample randomly to two trainingsub-samples, the fits may be differ much from each other.
A procedure with low variance will produce closely similar resultsfrom different samples (linear regression tends to have lowvariance).
Bootstrap aggregation, or bagging, is a general-purpose procedurefor reducing variance.
Bagging relies on the idea that averaging observations reducesvariance as individual errors tend to cancel out in averaging.
Seppo Pynnonen Applied Multivariate Statistical Analysis
Tree based methods
Bagging, random forests, boosting
Bagging
Generating B different bootstrap samples, computing predictionf ∗b(x) from each bootstrap sample, b = 1, . . . ,B, and taking theaverage
fbag(x) =1
B
B∑b=1
f ∗b(x), (10)
is called bagging.
Bagging can improve predictions as it reduces the variance.
The approach is to generate B regression trees using Bbootstrapped training sets, and average the resulting predictions.
These trees are grown deep and not pruned:
Individual trees have high variance but low bias.
Averaging reduces the variance.
Seppo Pynnonen Applied Multivariate Statistical Analysis
Tree based methods
Bagging, random forests, boosting
Bagging
In classification problems there are a few possible approaches.
The simplest is to record the class predicted by each of the Btrees, and take the majority vote, i.e., the overall prediction is themost commonly occurring class among the B predictions.
Seppo Pynnonen Applied Multivariate Statistical Analysis
Tree based methods
Bagging, random forests, boosting
Bagging: Out-of-bag error estimation
Estimation of the test error can be performed by bagged modelwithout cross-validation.
Each bootstrapped subset in the bagged tree makes on average useof around two-thirds of the observations.
The remaining third, called out-of-bag (OOB) observations, can beused for test set purposes.
We can predict the response for the ith observation using each ofthe trees in which that observation was OOB.
This will yield around B/3 predictions for the ith observation.
The single prediction can be obtained by averaging the predictions.
OOB MSE or classification error can be computed using theassociated prediction errors, thereby giving a valid estimate of testerror.
Seppo Pynnonen Applied Multivariate Statistical Analysis
Tree based methods
Bagging, random forests, boosting
Bagging: Variable importance measures
Using single decision tree, the diagram is easily interpreted (e.g.important variables).
This advantage disappears with bagging, because the predictionresults are based on several trees.
The importance of variables is hard to infer from bagging.
So, bagging improves prediction at the expense of interpretability.
However, overall summary of the importance of each variable canbe obtained using RSS (regression trees) or the Gini index(classification) by recording the total amount of reduction in RSS(Gini index) due to slits over a given predictor, averaged over all Btrees.
Seppo Pynnonen Applied Multivariate Statistical Analysis
Tree based methods
Bagging, random forests, boosting
1 Tree based methods
Basics of decision trees
Regression trees
Classification trees
Trees versus linear models
Advantages and disadvantages of trees
Bagging, random forests, boosting
Bagging
Random forest
Boosting
Seppo Pynnonen Applied Multivariate Statistical Analysis
Tree based methods
Bagging, random forests, boosting
Random forest
Random forests provide an improvement over bagged trees.
In bagging most of the trees use the same strong predictors in thetop of the tree, which implies that the trees tend to be fairlysimilar, i.e., the trees are correlated.
Highly correlated results do not lead to a large of a reduction invariance.
In random forests, when building decision trees, each time a split isconsidered, a random sample of m predictors is chosen as splitcandidates from the p predictors, typically m ≈ √p (m = p impliesbagging).
The purpose of the randomization is to decorrelate the trees,thereby making the average of the trees less variable (and hencemore reliable).
Seppo Pynnonen Applied Multivariate Statistical Analysis
Tree based methods
Bagging, random forests, boosting
1 Tree based methods
Basics of decision trees
Regression trees
Classification trees
Trees versus linear models
Advantages and disadvantages of trees
Bagging, random forests, boosting
Bagging
Random forest
Boosting
Seppo Pynnonen Applied Multivariate Statistical Analysis
Tree based methods
Bagging, random forests, boosting
Boosting is yet another approach for improving decision treepredictions.
While in bagging separate decision trees are fitted to the bootstrapsamples of training data, in boosting the trees are grownsequentially so that each tree is growing using information frompreviously grown trees.
The sequential process is based on relatively small model so thatgiven the current model, the tree is fitted into the residuals of thecurrent model and the current model is the updated.
The procedure is described in more detail in James et al. 2013, pp.322–323.
Seppo Pynnonen Applied Multivariate Statistical Analysis
Tree based methods
Bagging, random forests, boosting
Boosting has three tuning parameters
The number of trees B. Selected by cross-validation.
The shrinkage parameter λ to control the rate boosting learns.Typically small, like 0.01 or 0.001.
The number of splits d in each tree, which controls thecomplexity of the boosted ensemble, often d = 1, i.e., eachtree is only a stump, consisting of a single split.
Seppo Pynnonen Applied Multivariate Statistical Analysis
Tree based methods
Bagging, random forests, boosting
Example 3
The R package randomForest can be used for bagging and random forest(random forest is bagging when m = p).
Housing prices in Boston (data set in MASS library).
> library(MASS) # MASS library include Boston data
> library(randomForest)
> library(help = randomForest)
> set.seed(1) # for replication
> colnames(Boston) # variables in Boston
[1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
[8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
> bag_boston <- randomForest(medv ~ ., data = Boston, subset = train,
+ mtry = 13, # force all variables, which implie bagging
+ importance = TRUE) # importance saves predictors importance info
> bag_boston # print results
Call:
randomForest(formula = medv ~ ., data = Boston, mtry = 13, importance = TRUE, subset = train)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 13
Mean of squared residuals: 9.403441
% Var explained: 88.12
Seppo Pynnonen Applied Multivariate Statistical Analysis
Tree based methods
Bagging, random forests, boosting
plot(yhat_bag, Boston$medv[-train], xlim = c(0, 50), ylim = c(0, 50),
xlab = "Predicted Housing Prices",
ylab = "Realized Housing Prices", pch = 20, col = "orange") # scatter plot
abline(0, 1, col = "grey", lty = "dashed") # diagonal line
> round(mean((yhat_bag - Boston$medv[-train])^2), 2) # MSE
[1] 22.74
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●
●
●●
●●
●
●●
●
●
●
●
●●
●
●●
●
●
●
●●
● ● ●●●
●
●●
●●
●●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
0 10 20 30 40 50
010
2030
4050
Predicted Housing Prices
Rea
lized
Hou
sing
Pric
es
Seppo Pynnonen Applied Multivariate Statistical Analysis
Tree based methods
Bagging, random forests, boosting
Fit next random forest with m = 6 (R default is p/3 for regression treesand√p for classification trees)
> set.seed(1)
> rf.boston <- randomForest(medv ~ ., data = Boston, subset = train,
+ mtry = 6, imprtance = TRUE)
> yhat_rf <- predict(rf_boston, newdata = Boston[-train, ])
> round(mean((yhat_rf - Boston$medv[-train])^2), 2) # MSE
[1] 21.89
> importance(rf_boston) # importance of predictors
IncNodePurity
crim 779.81
zn 125.78
indus 1220.03
chas 13.90
nox 1143.43
rm 8371.00
age 407.09
dis 470.80
rad 114.38
tax 630.53
ptratio 957.97
black 200.24
lstat 5354.91
Seppo Pynnonen Applied Multivariate Statistical Analysis
Tree based methods
Bagging, random forests, boosting
> varImpPlot(rf_boston, main = "Variable Importance") # plot of variable importance
chas
rad
zn
black
age
dis
tax
crim
ptratio
nox
indus
lstat
rm
●
●
●
●
●
●
●
●
●
●
●
●
●
0 2000 4000 6000 8000
Variable Importance
IncNodePurity
Seppo Pynnonen Applied Multivariate Statistical Analysis
Tree based methods
Bagging, random forests, boosting
Boosted regression trees can be fitted by gbm package using gbm()function.
> summary(boost_boston)
var rel.inf
lstat lstat 33.21544106
rm rm 28.34225145
dis dis 11.68472578
crim crim 7.88991584
black black 4.67558175
nox nox 3.88590502
age age 3.15203732
ptratio ptratio 2.40626229
indus indus 1.89905161
tax tax 1.58509825
chas chas 0.75175057
rad rad 0.41883727
zn zn 0.09314181
Seppo Pynnonen Applied Multivariate Statistical Analysis
Tree based methods
Bagging, random forests, boosting
znra
dta
xpt
ratio
nox
crim
rm
Relative influence
0 5 10 15 20 25 30
> yhat_boost <- predict(boost_boston, newdata = Boston[-train, ], n.trees = 5000)
> mean((yhat_boost - Boston$medv[-train])^2) # MSE
[1] 11.1852
Seppo Pynnonen Applied Multivariate Statistical Analysis