Upload
others
View
32
Download
2
Embed Size (px)
Citation preview
Lecture 6: Decision Tree, Random Forest,and Boosting
Tuo Zhao
Schools of ISyE and CSE, Georgia Tech
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Tree?
Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 2/42
Decision Trees
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Decision Trees
Decision trees have a long history in machine learning
The first popular algorithm dates back to 1979
Very popular in many real world problems
Intuitive to understand
Easy to build
Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 4/42
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
History
EPAM- Elementary Perceiver and Memorizer
1961: Cognitive simulation model of human concept learning
1966: CLS-Early algorithm for decision tree construction
1979: ID3 based on information theory
1993: C4.5 improved over ID3
Also has history in statistics as CART (Classification andregression tree)
Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 5/42
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Motivation
How do people make decisions?
Consider a variety of factors
Follow a logical path of checks
Should I eat at this restaurant?
If there is no wait
Yes
If there is short wait and I am hungry
Yes
Else
No
Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 6/42
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Decision Flowchart
Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 7/42
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Example: Should We Play Tennis?
If temperature is not hot
Play
If outlook is overcast
Play tennis
Otherwise
Don’t play tennis
Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 8/42
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Decision Tree
Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 9/42
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Structure
A decision tree consists of
Nodes
Tests for variables
Branches
Results of tests
Leaves
Classification
Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 10/42
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Function Class
What functions can decision trees model?
Non-linear: very powerful function class
A decision tree can encode any Boolean function
Proof
Given a truth table for a function
Construct a path in the tree for each row of the table
Given a row as input, follow that path to the desired leaf(output)
Problem: exponentially large trees!
Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 11/42
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Smaller Trees
Can we produce smaller decision trees for functions?
Yes (Possible)
Key decision factors
Counter examples
Parity function: Return 1 on even inputs, 0 on odd inputs
Majority function: Return 1 if more than half of inputs are 1
Bias-Variance Tradeoff
Bias: Representation Power of Decision Trees
Variance: require a sample size exponential in depth
Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 12/42
Fitting Decision Trees
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
What Makes a Good Tree?
Small tree:
Occam’s razor: Simpler is better
Avoids over-fitting
A decision tree may be human readable, but not use humanlogic!
How do we build small trees that accurately capture data?
Learning an optimal decision tree is computationallyintractable
Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 14/42
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Greedy Algorithm
We can get good trees by simple greedy algorithms
Adjustments are usually to fix greedy selection problems
Recursive:
Select the “best” variable, and generate child nodes: One foreach possible value;
Partition samples using the possible values, and assign thesesubsets of samples to the child nodes;
Repeat for each child node until all samples associated with anode that are either all positive or all negative.
Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 15/42
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Variable Selection
The best variable for partition
The most informative variable
Select the variable that is most informative about the labels
Classification error?
Example:
P(Y = 1|X1 = 1) = 0.75 v.s. P(Y = 1|X2 = 1) = 0.55
Which to choose?
Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 16/42
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Information Theory
The quantification of information
Founded by Claude Shannon
Simple concepts:
Entropy: H(X) = −∑x P(X = x) logP(X = x)
Conditional Entropy: H(Y |X) =∑
x P(X = x)H(Y |X = x)
Information Gain: IG(Y|X) = H(Y)− H(Y|X)
Select the variable with the highest information gain
Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 17/42
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Example: Should We Play Tennis?
H(Tennis) = −3/5 log2 3/5− 2/5 log2 2/5 = 0.97
H(Tennis|Out. = Sunny) = −2/2 log2 2/2− 0/2 log2 0/2 = 0
H(Tennis|Out. = Overcast) = −0/1 log2 0/1− 1/1 log2 1/1 = 0
H(Tennis|Out. = Rainy) = −0/2 log2 0/2− 2/2 log2 2/2 = 0
H(Tennis|Out.) = 2/5× 0 + 1/5× 0 + 2/5× 0 = 0
Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 18/42
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Example: Should We Play Tennis?
IG(Tennis|Out.) = 0.97− 0 = 0.97
If we knew the Outlook we’d be able to predict Tennis!
Outlook is the variable to pick for our decision tree
Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 19/42
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
When to Stop Training?
All data have same label
Return that label
No examples
Return majority label of all data
No further splits possible
Return majority label of passed data
If max IG = 0?
Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 20/42
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Example: No Information Gain?
Y X1 X20 0 01 0 11 1 00 1 1
Both features give 0 IG
Once we divide the data, perfect classification!
We need a little exploration sometimes (Unstable Equilibrium)
Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 21/42
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Decision Tree with Continuous Features
Decision Stump
Threshold for binary classification:
Check Xj ≥ δj or Xj < δj .
More than two classes:
Threshold for three classes:
Check Xj ≥ δj , γj ≥ Xj < δj or γj < Xj .
Decompose one node to two modes:
First node: Check Xj ≥ δj or Xj < δj ,
Second node: If Xj < δj , check γj ≥ Xj or γj < Xj .
Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 22/42
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Decision Tree for Spam ClassificationBoosting Trevor Hastie, Stanford University 10
600/1536
280/1177
180/1065
80/861
80/652
77/423
20/238
19/236 1/2
57/185
48/113
37/101 1/12
9/72
3/229
0/209
100/204
36/123
16/94
14/89 3/5
9/29
16/81
9/112
6/109 0/3
48/359
26/337
19/110
18/109 0/1
7/227
0/22
spam
spam
spam
spam
spam
spam
spam
spam
spam
spam
spam
spam
ch$<0.0555
remove<0.06
ch!<0.191
george<0.005
hp<0.03
CAPMAX<10.5
receive<0.125edu<0.045
our<1.2
CAPAVE<2.7505
free<0.065
business<0.145
george<0.15
hp<0.405
CAPAVE<2.907
1999<0.58
ch$>0.0555
remove>0.06
ch!>0.191
george>0.005
hp>0.03
CAPMAX>10.5
receive>0.125edu>0.045
our>1.2
CAPAVE>2.7505
free>0.065
business>0.145
george>0.15
hp>0.405
CAPAVE>2.907
1999>0.58
Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 23/42
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Decision Tree for Spam Classification
Boosting Trevor Hastie, Stanford University 11
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Specificity
Sens
itivi
ty
ROC curve for pruned tree on SPAM data
o TREE − Error: 8.7%
SPAM Data
Overall error rate on test data:
8.7%.
ROC curve obtained by vary-
ing the threshold c0 of the clas-
sifier:
C(X) = +1 if P (+1|X) > c0.
Sensitivity: proportion of true
spam identified
Specificity: proportion of true
email identified.
We may want specificity to be high, and suffer some spam:
Specificity : 95% =⇒ Sensitivity : 79%
Sensitivity: proportion of true spam identified
Specificity: proportion of true email identified.
Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 24/42
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Decision Tree v.s. SVM
Boosting Trevor Hastie, Stanford University 12
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Specificity
Sens
itivity
ROC curve for TREE vs SVM on SPAM data
oo
SVM − Error: 6.7%TREE − Error: 8.7%
TREE vs SVM
Comparing ROC curves on
the test data is a good
way to compare classi-
fiers. SVM dominates
TREE here.
SVM outperforms Decision Tree.
AUC: Area Under Curve.
Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 25/42
Bagging and Random Forest
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Toy Example
Boosting Trevor Hastie, Stanford University 14
Toy Example - No Noise
-3 -2 -1 0 1 2 3
-3-2
-10
12
30
0
00
0 0
0
0 0
0
0
0
0
0
0
000
0
0
0
0
0
0
0
0
0
0
0
00
0
0
0
000
0
0
0
0
0
0
0
0
0
00
00 0
0
0 0
00
00
0
0
0
0
00
0
0
0
0
0
0
0
0 0
00
00
0
0
0
00
0
00
0
0
0
0
00
0
0
000
0
0
00
0
00
0 0
00
0
0 00
0
0
0
000
0 00
0
00
0
0
0
00
00
0
00
00 0
0
000
0
00
000 0
0
00
00
0
0
0
0
0
00
0
0 0
00
0
0 00
00
00
00
00
0
00
0
00
0 0
0
00
0
0
0
0 0
0
00
0
00
0
000
0000
00 0
0
0
0
0 00
0
0
0
00
000
0
00
0
00
0
00
0
00
0
0
00
00
0
0 0
0
0
00
0
0
000
00
00
00
0
0
00
0
00
0
0
00
0
00
0 0
00
0
0
0
0
0
0
00
0
00
00
0
0
0
0
000
0
0 0000
0
0
0
0
000
00
00
00
00
00
0
0
000
00
0
000 0
0
0
0
0
0
0
0
0
0
0
0
0
00
0
0
0 00
0
0
00
00 00
00
00
00
0
0
0
0
0
0
0
0 0
0
0
0
0
00
0
00
0
0 0000
00
0
00
1
1
1
1
1
1
11
11
11
1
1
11
1
1
1
11
1
11
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1 1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
11
11
1
1 1
11
1
1
1
1
1
11
1
1
1
1
11
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1 1
11
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1 1
1
1
1
1
1 1
11
1
1
1
1
1
1
1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1
1
11
11
1 1
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1 1
1
11
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
11
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
1
1
1 1
1
1
1
1
11
1
1
1
1
1
1
1
11
1
11
1
11
1
1
1
1 11
1
1
11
11
1
1 1
1
1
11
1
1
1
1
1
1
11
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
11
11
1
1
1
1
1
1
11
1
11
1
1
1
1
1
1
1
11
1
1
1
1
11
1
1
1
1
1 11 1
1
1
1
1
1
1
1
11
1
1
1
1
11
11
1
1
11
1
1 1
1
1
1
1
1
1
11
1
1
1
1
11
1
1
Bayes Error Rate: 0
X2
X1
• Deterministic problem;
noise comes from sam-
pling distribution of X.
• Use a training sample
of size 200.
• Here Bayes Error is
0%.
Nonlinear separable data.
Optimal decision boundary: X21 +X2
2 = 1.
Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 27/42
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Toy Example: Decision TreeBoosting Trevor Hastie, Stanford University 15
Classification Tree
x.2<-1.06711x.2>-1.06711
94/2001
0/341
x.2<1.14988x.2>1.14988
72/1660
x.1<1.13632x.1>1.13632
40/1340
x.1<-0.900735x.1>-0.900735
23/1170
x.1<-1.1668x.1>-1.1668
5/261
0/121
x.1<-1.07831x.1>-1.07831
5/141
1/51
4/91
x.2<-0.823968x.2>-0.823968
2/910
2/80
0/830
0/171
0/321
Boosting Trevor Hastie, Stanford University 16
Decision Boundary: Tree
X1
X2-3 -2 -1 0 1 2 3
-3-2
-10
12
3
1
1
11
1
11
1
1
1
11
1
1
1
1
11
1
1
1
1
1
11
1
11
1
1 1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1
1
11
1
1
1 1
1
1
1
1
1
1
1
111
1
1
1 11
1
1
1
11 1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
11
1
1
1
11
1
0
0
0
0
0
0
0
00
0
0
0
0
0
0
00
0
0
0
0
0
0
0
00
0
0
0
000
0
00
00
0
0
0
00
0
00
0
0
0
00
00
0
0
0
0
0
00
0
0
0
0
00
0 0
0
00
00 0
00
0
0
0
0
0
0
00
0
00
0
0
0 0
0
0
0
00
0
0
00
0
0
Error Rate: 0.073
When the nested spheres
are in 10-dimensions, Clas-
sification Trees produces a
rather noisy and inaccurate
rule C(X), with error rates
around 30%.
Sample size: 200
7 branching nodes; 6 layers.
Classification error: 7.3% when d = 2; > 30% when d = 10.
Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 28/42
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Model Averaging
Decision trees can be simple, but often produce noisy (bushy) orweak (stunted) classifiers.
Bagging (Breiman, 1996): Fit many large trees tobootstrap-resampled versions of the training data, and classifyby majority vote.
Boosting (Freund & Shapire, 1996): Fit many large or smalltrees to reweighted versions of the training data. Classify byweighted majority vote.
Random Forests (Breiman 1999): Fancier version of bagging.
In general Boosting>Random Forests>Bagging>Single Tree.
Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 29/42
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Bagging/Bootstrap Aggregation
Motivation: Average a given procedure over many samples toreduce the variance.
Given a classifier C(S, x), based on our training data S,producing a predicted class label at input point x.
To bag C, we draw bootstrap samples S1, ..., SB each of sizeN with replacement from the training data. Then
CBag = Mojarity Vote{C(Sb, x)}Bb=1.
Bagging can dramatically reduce the variance of unstableprocedures (like trees), leading to improved prediction.
All simple structures in a tree are lost.
Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 30/42
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Toy Example: Bagging TreesBoosting Trevor Hastie, Stanford University 19
x.2<0.39x.2>0.39
10/30
0
x.3<-1.575x.3>-1.575
3/21
0
2/5
1
0/16
0
2/9
1
Original Tree
x.2<0.36x.2>0.36
7/30
0
x.1<-0.965x.1>-0.965
1/23
0
1/5
0
0/18
0
1/7
1
Bootstrap Tree 1
x.2<0.39x.2>0.39
11/30
0
x.3<-1.575x.3>-1.575
3/22
0
2/5
1
0/17
0
0/8
1
Bootstrap Tree 2
x.4<0.395x.4>0.395
4/30
0
x.3<-1.575x.3>-1.575
2/25
0
2/5
0
0/20
0
2/5
0
Bootstrap Tree 3
x.2<0.255x.2>0.255
13/30
0
x.3<-1.385x.3>-1.385
2/16
0
2/5
0
0/11
0
3/14
1
Bootstrap Tree 4
x.2<0.38x.2>0.38
12/30
0
x.3<-1.61x.3>-1.61
4/20
0
2/6
1
0/14
0
2/10
1
Bootstrap Tree 5
2 branching nodes; 2 layers.
5 dependent trees.
Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 31/42
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Toy Example: Bagging Trees
Boosting Trevor Hastie, Stanford University 20
Decision Boundary: Bagging
X1
X2
-3 -2 -1 0 1 2 3
-3-2
-10
12
31
1
11
1
11
1
1
1
11
1
1
1
1
11
1
1
1
1
1
11
1
11
1
1 1
1
1
1
1
1
1 1
1
1
1
1
1
1
1
1
1
11
1
1
1 1
1
1
1
1
1
1
1
111
1
1
1 11
1
1
1
11 1
1
1
1
1
1
1
1
1
1
1
1
11
1
1
1
1
11
1
1
1
11
1
0
0
0
0
0
0
0
00
0
0
0
0
0
0
00
0
0
0
0
0
0
0
00
0
0
0
000
0
00
00
0
0
0
00
0
00
0
0
0
00
00
0
0
0
0
0
00
0
0
0
0
00
0 0
0
00
00 0
00
0
0
0
0
0
0
00
0
00
0
0
0 0
0
0
0
00
0
0
00
0
0
Error Rate: 0.032
Bagging averages many
trees, and produces
smoother decision bound-
aries.
A smoother decision boundary.
Classification error: 3.2% (Single deeper tree 7.3%).
Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 32/42
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Random Forest
Bagging features and samples simultaneously:
At each tree split, a random sample of m features is drawn,and only those m features are considered for splitting.Typically m =
√d or log2 d, where d is the number of features
For each tree grown on a bootstrap sample, the error rate forobservations left out of the bootstrap sample is monitored.This is called the “out-of-bag” error rate.
random forests tries to improve on bagging by “de-correlating”the trees. Each tree has the same expectation.
Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 33/42
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Random Forest for Spam ClassificationBoosting Trevor Hastie, Stanford University 22
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Specificity
Sens
itivity
ROC curve for TREE, SVM and Random Forest on SPAM data
ooo
Random Forest − Error: 5.0%SVM − Error: 6.7%TREE − Error: 8.7%
TREE, SVM and RF
Random Forest dominates
both other methods on the
SPAM data — 5.0% error.
Used 500 trees with default
settings for random Forest
package in R.
RF outperforms SVM.
500 Trees.
Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 34/42
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Random Forest: Variable Importance Scores
Boosting Trevor Hastie, Stanford University 40
Spam: Variable Importance
!$
hpremove
freeCAPAVE
yourCAPMAX
georgeCAPTOT
eduyouour
moneywill
1999business
re(
receiveinternet
000email
meeting;
650overmailpm
peopletechnology
hplall
orderaddress
makefont
projectdata
originalreport
conferencelab
[creditparts
#85
tablecs
direct415857
telnetlabs
addresses3d
0 20 40 60 80 100
Relative importance
The bootstrap roughly covers 1/3 samples for each time.
Do permutations over variables to check how important theyare.
Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 35/42
Boosting
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Boosting
Boosting Trevor Hastie, Stanford University 23
Training Sample
Weighted Sample
Weighted Sample
Weighted Sample
Training Sample
Weighted Sample
Weighted Sample
Weighted SampleWeighted Sample
Training Sample
Weighted Sample
Training Sample
Weighted Sample
Weighted SampleWeighted Sample
Weighted Sample
Weighted Sample
Weighted Sample
Training Sample
Weighted Sample
CM (x)
C3(x)
C2(x)
C1(x)
Boosting
• Average many trees, each
grown to re-weighted versions
of the training data.
• Final Classifier is weighted av-
erage of classifiers:
C(x) = sign!"M
m=1 αmCm(x)#
Average many trees, each grown to re-weighted versions ofthe training data (Iterative).
Final Classifier is weighted average of classifiers:
Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 37/42
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Adaboost v.s. BaggingBoosting Trevor Hastie, Stanford University 24
Number of Terms
Test
Erro
r
0 100 200 300 400
0.0
0.1
0.2
0.3
0.4 Bagging
AdaBoost
100 Node Trees
Boosting vs Bagging
• 2000 points from
Nested Spheres in R10
• Bayes error rate is 0%.
• Trees are grown best
first without pruning.
• Leftmost term is a sin-
gle tree.
Each classifier for bagging is a tree with a single node(decision stump).
2000 samples from Spheres in R10.
Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 38/42
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Adaboost
Boosting Trevor Hastie, Stanford University 25
AdaBoost (Freund & Schapire, 1996)
1. Initialize the observation weights wi = 1/N, i = 1, 2, . . . , N .
2. For m = 1 to M repeat steps (a)–(d):
(a) Fit a classifier Cm(x) to the training data using weights wi.
(b) Compute weighted error of newest tree
errm =
!Ni=1 wiI(yi = Cm(xi))!N
i=1 wi
.
(c) Compute αm = log[(1 − errm)/errm].
(d) Update weights for i = 1, . . . , N :
wi ← wi · exp[αm · I(yi = Cm(xi))]
and renormalize to wi to sum to 1.
3. Output C(x) = sign"!M
m=1 αmCm(x)#.
wi’s are the weights of the samples.
errm is the weighted training error.
Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 39/42
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
AdaboostBoosting Trevor Hastie, Stanford University 26
Boosting Iterations
Test
Erro
r
0 100 200 300 400
0.0
0.1
0.2
0.3
0.4
0.5
Single Stump
400 Node Tree
Boosting Stumps
A stump is a two-node
tree, after a single split.
Boosting stumps works
remarkably well on the
nested-spheres problem.
The ensemble of classifiers is much more efficient than thesimple combination of classifiers.
Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 40/42
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Overfitting of AdaboostBoosting Trevor Hastie, Stanford University 27
Number of Terms
Trai
n an
d Te
st E
rror
0 100 200 300 400 500 600
0.0
0.1
0.2
0.3
0.4
0.5 Training Error
• Nested spheres in 10-
Dimensions.
• Bayes error is 0%.
• Boosting drives the
training error to zero.
• Further iterations con-
tinue to improve test
error in many exam-
ples.
More iterations continue to improve test error in manyexamples.
The Adaboost is often observed to be robust to overfitting.
Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 41/42
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Overfitting of AdaboostBoosting Trevor Hastie, Stanford University 28
Number of Terms
Trai
n an
d Te
st E
rror
0 100 200 300 400 500 600
0.0
0.1
0.2
0.3
0.4
0.5
Bayes Error
Noisy Problems
• Nested Gaussians in
10-Dimensions.
• Bayes error is 25%.
• Boosting with stumps
• Here the test error
does increase, but quite
slowly.
Optimal Bayes classification error: 25%.
The test error does increase, but quite slowly.
Tuo Zhao — Lecture 6: Decision Tree, Random Forest, and Boosting 42/42