Lecture 6: Decision Tree, Random Forest, and Boostingtzhao80/Lectures/Lecture_6.pdf · Lecture 6: Decision Tree, Random Forest, and Boosting Tuo Zhao Schools of ISyE and CSE, Georgia

Lecture 6: Decision Tree, Random Forest,and Boosting

Tuo Zhao

Schools of ISyE and CSE, Georgia Tech

CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis

Tree?

Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 2/42

Decision Trees


Decision Trees

Decision trees have a long history in machine learning

The first popular algorithm dates back to 1979

Very popular in many real world problems

Intuitive to understand

Easy to build



History

EPAM- Elementary Perceiver and Memorizer

1961: Cognitive simulation model of human concept learning

1966: CLS-Early algorithm for decision tree construction

1979: ID3 based on information theory

1993: C4.5 improved over ID3

Also has history in statistics as CART (Classification andregression tree)



Motivation

How do people make decisions?

Consider a variety of factors

Follow a logical path of checks

Should I eat at this restaurant?

If there is no wait

Yes

If there is short wait and I am hungry

Yes

Else

No



Decision Flowchart



Example: Should We Play Tennis?

If temperature is not hot

Play

If outlook is overcast

Play tennis

Otherwise

Don’t play tennis



Decision Tree



Structure

A decision tree consists of

Nodes

Tests for variables

Branches

Results of tests

Leaves

Classification



Function Class

What functions can decision trees model?

Non-linear: very powerful function class

A decision tree can encode any Boolean function

Proof

Given a truth table for a function

Construct a path in the tree for each row of the table

Given a row as input, follow that path to the desired leaf(output)

Problem: exponentially large trees!



Smaller Trees

Can we produce smaller decision trees for functions?

Yes (Possible)

Key decision factors

Counter examples

Parity function: Return 1 on even inputs, 0 on odd inputs

Majority function: Return 1 if more than half of inputs are 1

Bias-Variance Tradeoff

Bias: Representation Power of Decision Trees

Variance: require a sample size exponential in depth


Fitting Decision Trees


What Makes a Good Tree?

Small tree:

Occam’s razor: Simpler is better

Avoids over-fitting

A decision tree may be human readable, but not use humanlogic!

How do we build small trees that accurately capture data?

Learning an optimal decision tree is computationallyintractable



Greedy Algorithm

We can get good trees by simple greedy algorithms

Adjustments are usually to fix greedy selection problems

Recursive:

Select the “best” variable, and generate child nodes: One foreach possible value;

Partition samples using the possible values, and assign thesesubsets of samples to the child nodes;

Repeat for each child node until all samples associated with anode that are either all positive or all negative.



Variable Selection

The best variable for partition

The most informative variable

Select the variable that is most informative about the labels

Classification error?

Example:

P(Y = 1|X1 = 1) = 0.75 v.s. P(Y = 1|X2 = 1) = 0.55

Which to choose?



Information Theory

The quantification of information

Founded by Claude Shannon

Simple concepts:

Entropy: H(X) = −∑x P(X = x) logP(X = x)

Conditional Entropy: H(Y |X) =∑

x P(X = x)H(Y |X = x)

Information Gain: IG(Y|X) = H(Y)− H(Y|X)

Select the variable with the highest information gain




H(Tennis) = −3/5 log2 3/5− 2/5 log2 2/5 = 0.97

H(Tennis|Out. = Sunny) = −2/2 log2 2/2− 0/2 log2 0/2 = 0

H(Tennis|Out. = Overcast) = −0/1 log2 0/1− 1/1 log2 1/1 = 0

H(Tennis|Out. = Rainy) = −0/2 log2 0/2− 2/2 log2 2/2 = 0

H(Tennis|Out.) = 2/5× 0 + 1/5× 0 + 2/5× 0 = 0




IG(Tennis|Out.) = 0.97− 0 = 0.97

If we knew the Outlook we’d be able to predict Tennis!

Outlook is the variable to pick for our decision tree



When to Stop Training?

All data have same label

Return that label

No examples

Return majority label of all data

No further splits possible

Return majority label of passed data

If max IG = 0?



Example: No Information Gain?

Y X1 X20 0 01 0 11 1 00 1 1

Both features give 0 IG

Once we divide the data, perfect classification!

We need a little exploration sometimes (Unstable Equilibrium)



Decision Tree with Continuous Features

Decision Stump

Threshold for binary classification:

Check Xj ≥ δj or Xj < δj .

More than two classes:

Threshold for three classes:

Check Xj ≥ δj , γj ≥ Xj < δj or γj < Xj .

Decompose one node to two modes:

First node: Check Xj ≥ δj or Xj < δj ,

Second node: If Xj < δj , check γj ≥ Xj or γj < Xj .



Decision Tree for Spam ClassificationBoosting Trevor Hastie, Stanford University 10

600/1536

280/1177

180/1065

80/861

80/652

77/423

20/238

19/236 1/2

57/185

48/113

37/101 1/12

9/72

3/229

0/209

100/204

36/123

16/94

14/89 3/5

9/29

16/81

9/112

6/109 0/3

48/359

26/337

19/110

18/109 0/1

7/227

0/22

spam

spam

spam

spam

spam

spam

spam

spam

spam

spam

spam

spam

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

email

ch$<0.0555

remove<0.06

ch!<0.191

george<0.005

hp<0.03

CAPMAX<10.5

receive<0.125edu<0.045

our<1.2

CAPAVE<2.7505

free<0.065

business<0.145

george<0.15

hp<0.405

CAPAVE<2.907

1999<0.58

ch$>0.0555

remove>0.06

ch!>0.191

george>0.005

hp>0.03

CAPMAX>10.5

receive>0.125edu>0.045

our>1.2

CAPAVE>2.7505

free>0.065

business>0.145

george>0.15

hp>0.405

CAPAVE>2.907

1999>0.58



Decision Tree for Spam Classification

Boosting Trevor Hastie, Stanford University 11

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Specificity

Sens

itivi

ty

ROC curve for pruned tree on SPAM data

o TREE − Error: 8.7%

SPAM Data

Overall error rate on test data:

8.7%.

ROC curve obtained by vary-

ing the threshold c0 of the clas-

sifier:

C(X) = +1 if P (+1|X) > c0.

Sensitivity: proportion of true

spam identified

Specificity: proportion of true

email identified.

We may want specificity to be high, and suffer some spam:

Specificity : 95% =⇒ Sensitivity : 79%

Sensitivity: proportion of true spam identified

Specificity: proportion of true email identified.



Decision Tree v.s. SVM


0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Specificity

Sens

itivity

ROC curve for TREE vs SVM on SPAM data

oo

SVM − Error: 6.7%TREE − Error: 8.7%

TREE vs SVM

Comparing ROC curves on

the test data is a good

way to compare classi-

fiers. SVM dominates

TREE here.

SVM outperforms Decision Tree.

AUC: Area Under Curve.


Bagging and Random Forest


Toy Example


Toy Example - No Noise

-3 -2 -1 0 1 2 3

-3-2

-10

12

30

0

00

0 0

0

0 0

0

0

0

0

0

0

000

0

0

0

0

0

0

0

0

0

0

0

00

0

0

0

000

0

0

0

0

0

0

0

0

0

00

00 0

0

0 0

00

00

0

0

0

0

00

0

0

0

0

0

0

0

0 0

00

00

0

0

0

00

0

00

0

0

0

0

00

0

0

000

0

0

00

0

00

0 0

00

0

0 00

0

0

0

000

0 00

0

00

0

0

0

00

00

0

00

00 0

0

000

0

00

000 0

0

00

00

0

0

0

0

0

00

0

0 0

00

0

0 00

00

00

00

00

0

00

0

00

0 0

0

00

0

0

0

0 0

0

00

0

00

0

000

0000

00 0

0

0

0

0 00

0

0

0

00

000

0

00

0

00

0

00

0

00

0

0

00

00

0

0 0

0

0

00

0

0

000

00

00

00

0

0

00

0

00

0

0

00

0

00

0 0

00

0

0

0

0

0

0

00

0

00

00

0

0

0

0

000

0

0 0000

0

0

0

0

000

00

00

00

00

00

0

0

000

00

0

000 0

0

0

0

0

0

0

0

0

0

0

0

0

00

0

0

0 00

0

0

00

00 00

00

00

00

0

0

0

0

0

0

0

0 0

0

0

0

0

00

0

00

0

0 0000

00

0

00

1

1

1

1

1

1

11

11

11

1

1

11

1

1

1

11

1

11

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1 1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

11

11

1

1 1

11

1

1

1

1

1

11

1

1

1

1

11

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1 1

11

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1 1

1

1

1

1

1 1

11

1

1

1

1

1

1

1

1

1

1

1

1

1 1

1

1

1

1

1

1

1

1

1

11

11

1 1

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1 1

1

11

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

1

1

1 1

1

1

1

1

11

1

1

1

1

1

1

1

11

1

11

1

11

1

1

1

1 11

1

1

11

11

1

1 1

1

1

11

1

1

1

1

1

1

11

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

11

11

1

1

1

1

1

1

11

1

11

1

1

1

1

1

1

1

11

1

1

1

1

11

1

1

1

1

1 11 1

1

1

1

1

1

1

1

11

1

1

1

1

11

11

1

1

11

1

1 1

1

1

1

1

1

1

11

1

1

1

1

11

1

1

Bayes Error Rate: 0

X2

X1

• Deterministic problem;

noise comes from sam-

pling distribution of X.

• Use a training sample

of size 200.

• Here Bayes Error is

0%.

Nonlinear separable data.

Optimal decision boundary: X21 +X2

2 = 1.



Toy Example: Decision TreeBoosting Trevor Hastie, Stanford University 15

Classification Tree

x.2<-1.06711x.2>-1.06711

94/2001

0/341

x.2<1.14988x.2>1.14988

72/1660

x.1<1.13632x.1>1.13632

40/1340

x.1<-0.900735x.1>-0.900735

23/1170

x.1<-1.1668x.1>-1.1668

5/261

0/121

x.1<-1.07831x.1>-1.07831

5/141

1/51

4/91

x.2<-0.823968x.2>-0.823968

2/910

2/80

0/830

0/171

0/321


Decision Boundary: Tree

X1

X2-3 -2 -1 0 1 2 3

-3-2

-10

12

3

1

1

11

1

11

1

1

1

11

1

1

1

1

11

1

1

1

1

1

11

1

11

1

1 1

1

1

1

1

1

1 1

1

1

1

1

1

1

1

1

1

11

1

1

1 1

1

1

1

1

1

1

1

111

1

1

1 11

1

1

1

11 1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

11

1

1

1

11

1

0

0

0

0

0

0

0

00

0

0

0

0

0

0

00

0

0

0

0

0

0

0

00

0

0

0

000

0

00

00

0

0

0

00

0

00

0

0

0

00

00

0

0

0

0

0

00

0

0

0

0

00

0 0

0

00

00 0

00

0

0

0

0

0

0

00

0

00

0

0

0 0

0

0

0

00

0

0

00

0

0

Error Rate: 0.073

When the nested spheres

are in 10-dimensions, Clas-

sification Trees produces a

rather noisy and inaccurate

rule C(X), with error rates

around 30%.

Sample size: 200

7 branching nodes; 6 layers.

Classification error: 7.3% when d = 2; > 30% when d = 10.



Model Averaging

Decision trees can be simple, but often produce noisy (bushy) orweak (stunted) classifiers.

Bagging (Breiman, 1996): Fit many large trees tobootstrap-resampled versions of the training data, and classifyby majority vote.

Boosting (Freund & Shapire, 1996): Fit many large or smalltrees to reweighted versions of the training data. Classify byweighted majority vote.

Random Forests (Breiman 1999): Fancier version of bagging.

In general Boosting>Random Forests>Bagging>Single Tree.



Bagging/Bootstrap Aggregation

Motivation: Average a given procedure over many samples toreduce the variance.

Given a classifier C(S, x), based on our training data S,producing a predicted class label at input point x.

To bag C, we draw bootstrap samples S1, ..., SB each of sizeN with replacement from the training data. Then

CBag = Mojarity Vote{C(Sb, x)}Bb=1.

Bagging can dramatically reduce the variance of unstableprocedures (like trees), leading to improved prediction.

All simple structures in a tree are lost.



Toy Example: Bagging TreesBoosting Trevor Hastie, Stanford University 19

x.2<0.39x.2>0.39

10/30

0

x.3<-1.575x.3>-1.575

3/21

0

2/5

1

0/16

0

2/9

1

Original Tree

x.2<0.36x.2>0.36

7/30

0

x.1<-0.965x.1>-0.965

1/23

0

1/5

0

0/18

0

1/7

1

Bootstrap Tree 1

x.2<0.39x.2>0.39

11/30

0

x.3<-1.575x.3>-1.575

3/22

0

2/5

1

0/17

0

0/8

1

Bootstrap Tree 2

x.4<0.395x.4>0.395

4/30

0

x.3<-1.575x.3>-1.575

2/25

0

2/5

0

0/20

0

2/5

0

Bootstrap Tree 3

x.2<0.255x.2>0.255

13/30

0

x.3<-1.385x.3>-1.385

2/16

0

2/5

0

0/11

0

3/14

1

Bootstrap Tree 4

x.2<0.38x.2>0.38

12/30

0

x.3<-1.61x.3>-1.61

4/20

0

2/6

1

0/14

0

2/10

1

Bootstrap Tree 5

2 branching nodes; 2 layers.

5 dependent trees.



Toy Example: Bagging Trees


Decision Boundary: Bagging

X1

X2

-3 -2 -1 0 1 2 3

-3-2

-10

12

31

1

11

1

11

1

1

1

11

1

1

1

1

11

1

1

1

1

1

11

1

11

1

1 1

1

1

1

1

1

1 1

1

1

1

1

1

1

1

1

1

11

1

1

1 1

1

1

1

1

1

1

1

111

1

1

1 11

1

1

1

11 1

1

1

1

1

1

1

1

1

1

1

1

11

1

1

1

1

11

1

1

1

11

1

0

0

0

0

0

0

0

00

0

0

0

0

0

0

00

0

0

0

0

0

0

0

00

0

0

0

000

0

00

00

0

0

0

00

0

00

0

0

0

00

00

0

0

0

0

0

00

0

0

0

0

00

0 0

0

00

00 0

00

0

0

0

0

0

0

00

0

00

0

0

0 0

0

0

0

00

0

0

00

0

0

Error Rate: 0.032

Bagging averages many

trees, and produces

smoother decision bound-

aries.

A smoother decision boundary.

Classification error: 3.2% (Single deeper tree 7.3%).



Random Forest

Bagging features and samples simultaneously:

At each tree split, a random sample of m features is drawn,and only those m features are considered for splitting.Typically m =

√d or log2 d, where d is the number of features

For each tree grown on a bootstrap sample, the error rate forobservations left out of the bootstrap sample is monitored.This is called the “out-of-bag” error rate.

random forests tries to improve on bagging by “de-correlating”the trees. Each tree has the same expectation.



Random Forest for Spam ClassificationBoosting Trevor Hastie, Stanford University 22

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Specificity

Sens

itivity

ROC curve for TREE, SVM and Random Forest on SPAM data

ooo

Random Forest − Error: 5.0%SVM − Error: 6.7%TREE − Error: 8.7%

TREE, SVM and RF

Random Forest dominates

both other methods on the

SPAM data — 5.0% error.

Used 500 trees with default

settings for random Forest

package in R.

RF outperforms SVM.

500 Trees.



Random Forest: Variable Importance Scores


Spam: Variable Importance

!$

hpremove

freeCAPAVE

yourCAPMAX

georgeCAPTOT

eduyouour

moneywill

1999business

re(

receiveinternet

000email

meeting;

650overmailpm

peopletechnology

hplall

orderaddress

makefont

projectdata

originalreport

conferencelab

[creditparts

#85

tablecs

direct415857

telnetlabs

addresses3d

0 20 40 60 80 100

Relative importance

The bootstrap roughly covers 1/3 samples for each time.

Do permutations over variables to check how important theyare.


Boosting


Boosting


Training Sample

Weighted Sample

Weighted Sample

Weighted Sample

Training Sample

Weighted Sample

Weighted Sample

Weighted SampleWeighted Sample

Training Sample

Weighted Sample

Training Sample

Weighted Sample

Weighted SampleWeighted Sample

Weighted Sample

Weighted Sample

Weighted Sample

Training Sample

Weighted Sample

CM (x)

C3(x)

C2(x)

C1(x)

Boosting

• Average many trees, each

grown to re-weighted versions

of the training data.

• Final Classifier is weighted av-

erage of classifiers:

C(x) = sign!"M

m=1 αmCm(x)#

Average many trees, each grown to re-weighted versions ofthe training data (Iterative).

Final Classifier is weighted average of classifiers:



Adaboost v.s. BaggingBoosting Trevor Hastie, Stanford University 24

Number of Terms

Test

Erro

r

0 100 200 300 400

0.0

0.1

0.2

0.3

0.4 Bagging

AdaBoost

100 Node Trees

Boosting vs Bagging

• 2000 points from

Nested Spheres in R10

• Bayes error rate is 0%.

• Trees are grown best

first without pruning.

• Leftmost term is a sin-

gle tree.

Each classifier for bagging is a tree with a single node(decision stump).

2000 samples from Spheres in R10.



Adaboost


AdaBoost (Freund & Schapire, 1996)

1. Initialize the observation weights wi = 1/N, i = 1, 2, . . . , N .

2. For m = 1 to M repeat steps (a)–(d):

(a) Fit a classifier Cm(x) to the training data using weights wi.

(b) Compute weighted error of newest tree

errm =

!Ni=1 wiI(yi = Cm(xi))!N

i=1 wi

.

(c) Compute αm = log[(1 − errm)/errm].

(d) Update weights for i = 1, . . . , N :

wi ← wi · exp[αm · I(yi = Cm(xi))]

and renormalize to wi to sum to 1.

3. Output C(x) = sign"!M

m=1 αmCm(x)#.

wi’s are the weights of the samples.

errm is the weighted training error.



AdaboostBoosting Trevor Hastie, Stanford University 26

Boosting Iterations

Test

Erro

r

0 100 200 300 400

0.0

0.1

0.2

0.3

0.4

0.5

Single Stump

400 Node Tree

Boosting Stumps

A stump is a two-node

tree, after a single split.

Boosting stumps works

remarkably well on the

nested-spheres problem.

The ensemble of classifiers is much more efficient than thesimple combination of classifiers.



Overfitting of AdaboostBoosting Trevor Hastie, Stanford University 27

Number of Terms

Trai

n an

d Te

st E

rror

0 100 200 300 400 500 600

0.0

0.1

0.2

0.3

0.4

0.5 Training Error

• Nested spheres in 10-

Dimensions.

• Bayes error is 0%.

• Boosting drives the

training error to zero.

• Further iterations con-

tinue to improve test

error in many exam-

ples.

More iterations continue to improve test error in manyexamples.

The Adaboost is often observed to be robust to overfitting.



Overfitting of AdaboostBoosting Trevor Hastie, Stanford University 28

Number of Terms

Trai

n an

d Te

st E

rror

0 100 200 300 400 500 600

0.0

0.1

0.2

0.3

0.4

0.5

Bayes Error

Noisy Problems

• Nested Gaussians in

10-Dimensions.

• Bayes error is 25%.

• Boosting with stumps

• Here the test error

does increase, but quite

slowly.

Optimal Bayes classification error: 25%.

The test error does increase, but quite slowly.


Documents

Lecture 6: Decision Tree, Random Forest, and Boostingtzhao80/Lectures/Lecture_6.pdf · Lecture 6: Decision Tree, Random Forest, and Boosting Tuo Zhao Schools of ISyE and CSE, Georgia