Applied Multivariate Statistics Spring 2012...Building Regression Trees 1/2 Assume given partition of space R 1, …, R M Tree model: Goal is to minimize sum of squared residuals:

Trees

Applied Multivariate Statistics – Spring 2012

Overview

Intuition for Trees

Regression Trees

Classification Trees

1

Idea of Trees: Regression Trees

Continuous response

2

y

x

Binary Tree

1

2

0

Y=1.2

1 2

X>1 X≤1

Y=1.4 Y=0.7

Y=0.2 Y=1.9 Y=1.3 Y=0.3

X≤0.3 X>0.3 X≤2

X>2

Idea of Trees: Classification Tree

Discrete response

3

Survived in Titanic?

No

Sex=F Sex=M

800/200

No Yes

No

150/50

No

650/150

Age <35 Age ≥35

Yes No

3/17 147/33

Age <27 Age ≥27

Yes No

70/130 580/20

Missclassification rate:

- Total: (3+33+70+20) / 1000 = 0.126

- “Yes”-class: 53/200 = 0.26

- “No”-class: 73/800 = 0.09

Intuition of Trees: Recursive Partitioning

4

For simplicity:

Restrict to

recursive

binary splits

Fighting overfitting: Cost-complexity pruning

5

Test error

Training error

Complexity of model

Overfitting: Fitting the training data perfectly might not be good for predicting future data

In practice: Use cross-validation

For trees:

1. Fit a very detailed model

2. Prune it using a complexity penalty to optimize cross-validation performance

Building Regression Trees 1/2

Assume given partition of space R1, …, RM

Tree model:

Goal is to minimize sum of squared residuals:

(𝑦𝑖 − 𝑓 𝑥𝑖2)

Solution: Average of data points in every region

6

Building Regression Trees 2/2

Finding the best binary partition is computationally

infeasible

Use greedy approach: For variable j and split point s define

the two generated regions:

Choose splitting variable j and split point s that solve:

inner minimization is solved by

Repeat splitting process on each of the two resulting

regions 7

Pruning Regression Trees

Stop splitting when some minimal node size (= nmb. of

samples per node) is reached (e.g. 5)

Then, cut back the tree again (“pruning”) to optimize the

cost-complexity criterion:

Tuning parameter 𝛼 is chosen by cross-validation

8

“Impurity measure” Goodness of fit Complexity

Classification Trees

Regression Tree:

Quality of split measured by “Squared error”

Classification Tree:

Quality of split measured by general “Impurity measure”

9

Classification Trees: Impurity Measures

Proportion of class k observations in node m:

Define majority class in node m: k(m)

Common impurity measures 𝑄𝑚(𝑇):

For just two classes:

10

Example: Gini Index

11

Side effects after treatment? 100 persons, 50 with and 50 without side effects:

50 / 50 (No / Yes)

Split on sex

50 / 50

F M

30 / 40

Gini = 0.49

20 / 10

Gini = 0.44

Total Gini = 0.49 + 0.44 =

= 0.93

Split on age

50 / 50

young old

10 / 50

Gini = 0.27

40 / 0

Gini = 0

Total Gini = 0.27 + 0 =

= 0.27

0.27 < 0.93, therefore:

Choose split on age

Classification Trees: Impurity Measures

Usually:

- Gini Index used for building

- Misclassification error used for pruning

12

Example: Pruning using Misclass. Error (MCE)

13

50 / 50

young old

10 / 50

MCE = 0.167

40 / 0

MCE = 0

50 / 50

young old

10 / 50

MCE = 0.167

40 / 0

MCE = 0

short tall

0 / 50

MCE = 0

10 / 0

MCE = 0

𝐶𝛼 𝑇 = 50 ∗ 0 + 10 ∗ 0 + 40 ∗ 0 + 0.5 ∗ 3 =

= 1.5

𝐶𝛼 𝑇 = 60 ∗ 0.167 + 40 ∗ 0 + 0.5 ∗ 2 =

= 11.0

Smaller 𝐶𝛼(𝑇), therefore don’t prune

e.g., 𝛼 = 0.5

Trees in R

Function “rpart” (recursive partitioning) in package “rpart”

together with “print”, “plot”, “text”

Function “rpart” automatically prunes using optimal 𝜶 based on 10-fold CV

Functions “plotcp” and “printcp” for cost-complexity

information

Function “prune” for manual pruning

14

Concepts to know

Trees as recursive partitionings

Concept of cost-complexity pruning

Impurity measures

15

R functions to know

From package “rpart”: “rpart”, “print”, “plot”, “text”, “plotcp”,

“printcp”, “prune”

16

Documents

Applied Multivariate Statistics Spring 2012...Building Regression Trees 1/2 Assume given partition of space R 1, …, R M Tree model: Goal is to minimize sum of squared residuals: