Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Trees
Applied Multivariate Statistics – Spring 2012
Overview
Intuition for Trees
Regression Trees
Classification Trees
1
Idea of Trees: Regression Trees
Continuous response
2
y
x
Binary Tree
1
2
0
Y=1.2
1 2
X>1 X≤1
Y=1.4 Y=0.7
Y=0.2 Y=1.9 Y=1.3 Y=0.3
X≤0.3 X>0.3 X≤2
X>2
Idea of Trees: Classification Tree
Discrete response
3
Survived in Titanic?
No
Sex=F Sex=M
800/200
No Yes
No
150/50
No
650/150
Age <35 Age ≥35
Yes No
3/17 147/33
Age <27 Age ≥27
Yes No
70/130 580/20
Missclassification rate:
- Total: (3+33+70+20) / 1000 = 0.126
- “Yes”-class: 53/200 = 0.26
- “No”-class: 73/800 = 0.09
Intuition of Trees: Recursive Partitioning
4
For simplicity:
Restrict to
recursive
binary splits
Fighting overfitting: Cost-complexity pruning
5
Test error
Training error
Complexity of model
Overfitting: Fitting the training data perfectly might not be good for predicting future data
In practice: Use cross-validation
For trees:
1. Fit a very detailed model
2. Prune it using a complexity penalty to optimize cross-validation performance
Building Regression Trees 1/2
Assume given partition of space R1, …, RM
Tree model:
Goal is to minimize sum of squared residuals:
(𝑦𝑖 − 𝑓 𝑥𝑖2)
Solution: Average of data points in every region
6
Building Regression Trees 2/2
Finding the best binary partition is computationally
infeasible
Use greedy approach: For variable j and split point s define
the two generated regions:
Choose splitting variable j and split point s that solve:
inner minimization is solved by
Repeat splitting process on each of the two resulting
regions 7
Pruning Regression Trees
Stop splitting when some minimal node size (= nmb. of
samples per node) is reached (e.g. 5)
Then, cut back the tree again (“pruning”) to optimize the
cost-complexity criterion:
Tuning parameter 𝛼 is chosen by cross-validation
8
“Impurity measure” Goodness of fit Complexity
Classification Trees
Regression Tree:
Quality of split measured by “Squared error”
Classification Tree:
Quality of split measured by general “Impurity measure”
9
Classification Trees: Impurity Measures
Proportion of class k observations in node m:
Define majority class in node m: k(m)
Common impurity measures 𝑄𝑚(𝑇):
For just two classes:
10
Example: Gini Index
11
Side effects after treatment? 100 persons, 50 with and 50 without side effects:
50 / 50 (No / Yes)
Split on sex
50 / 50
F M
30 / 40
Gini = 0.49
20 / 10
Gini = 0.44
Total Gini = 0.49 + 0.44 =
= 0.93
Split on age
50 / 50
young old
10 / 50
Gini = 0.27
40 / 0
Gini = 0
Total Gini = 0.27 + 0 =
= 0.27
0.27 < 0.93, therefore:
Choose split on age
Classification Trees: Impurity Measures
Usually:
- Gini Index used for building
- Misclassification error used for pruning
12
Example: Pruning using Misclass. Error (MCE)
13
50 / 50
young old
10 / 50
MCE = 0.167
40 / 0
MCE = 0
50 / 50
young old
10 / 50
MCE = 0.167
40 / 0
MCE = 0
short tall
0 / 50
MCE = 0
10 / 0
MCE = 0
𝐶𝛼 𝑇 = 50 ∗ 0 + 10 ∗ 0 + 40 ∗ 0 + 0.5 ∗ 3 =
= 1.5
𝐶𝛼 𝑇 = 60 ∗ 0.167 + 40 ∗ 0 + 0.5 ∗ 2 =
= 11.0
Smaller 𝐶𝛼(𝑇), therefore don’t prune
e.g., 𝛼 = 0.5
Trees in R
Function “rpart” (recursive partitioning) in package “rpart”
together with “print”, “plot”, “text”
Function “rpart” automatically prunes using optimal 𝜶 based on 10-fold CV
Functions “plotcp” and “printcp” for cost-complexity
information
Function “prune” for manual pruning
14
Concepts to know
Trees as recursive partitionings
Concept of cost-complexity pruning
Impurity measures
15
R functions to know
From package “rpart”: “rpart”, “print”, “plot”, “text”, “plotcp”,
“printcp”, “prune”
16