Ensemble Learning
Reading:
R. Schapire, A brief introduction to boosting
Ensemble learning
Training sets Hypotheses Ensemble hypothesis
S1 h1
S2 h2
.
.
.
SN hN
H
Advantages of ensemble learning
• Can be very effective at reducing generalization error!
(E.g., by voting.)
• Ideal case: the hi have independent errors
Example
Given three hypotheses, h1, h2, h3, with hi (x) {−1,1}
Suppose each hi has 60% generalization accuracy, and assume errors are independent.
Now suppose H(x) is the majority vote of h1, h2, and h3 . What is probability that H is correct?
h1 h2 h3 H probability
C C C C .216
C C I C .144
C I I I .096
C I C C .144
I C C C .144
I I C I .096
I C I I .096
I I I I .064
Total probability correct: .648
Again, given three hypotheses, h1, h2, h3.
Suppose each hi has 40% generalization accuracy, and assume errors are independent.
Now suppose we classify x as the majority vote of h1, h2, and h3 . What is probability that the classification is correct?
Another Example
h1 h2 h3 H probability
C C C C .064
C C I C .096
C I I I .144
C I C C .096
I C C C .096
I I C I .144
I C I I .144
I I I I .261
Total probability correct: .352
In general, if hypotheses h1, ..., hM all have generalization accuracy A, what is probability that a majority vote will be correct?
General case
Possible problems with ensemble learning
• Errors are typically not independent
• Training time and classification time are increased by a factor of M.
• Hard to explain how ensemble hypothesis does classification.
• How to get enough data to create M separate data sets,
S1, ..., SM?
• Three popular methods:
– Voting:• Train classifier on M different training sets Si to
obtain M different classifiers hi.
• For a new instance x, define H(x) as:
where αi is a confidence measure for classifier hi
– Bagging (Breiman, 1990s):• To create Si, create “bootstrap replicates” of original
training set S
– Boosting (Schapire & Freund, 1990s)• To create Si, reweight examples in original training
set S as a function of whether or not they were misclassified on the previous round.
Adaptive Boosting (Adaboost)
A method for combining different weak hypotheses (training error close to but less than 50%) to produce a strong hypothesis (training error close to 0%)
Sketch of algorithm
Given examples S and learning algorithm L, with | S | = N
• Initialize probability distribution over examples w1(i) = 1/N .
• Repeatedly run L on training sets St S to produce h1, h2, ... , hK.
– At each step, derive St from S by choosing examples probabilistically according to probability distribution wt . Use St to learn ht.
• At each step, derive wt+1 by giving more probability to examples that were misclassified at step t.
• The final ensemble classifier H is a weighted sum of the ht’s, with each weight being a function of the corresponding ht’s error on its training set.
Adaboost algorithm
• Given S = {(x1, y1), ..., (xN, yN)} where x X, yi {+1, −1}
• Initialize w1(i) = 1/N. (Uniform distribution over data)
• For t = 1, ..., K:
– Select new training set St from S with replacement, according to wt
– Train L on St to obtain hypothesis ht
– Compute the training error t of ht on S :
– Compute coefficient
t
tt
1ln
2
1
– Compute new weights on data:
For i = 1 to N
where Zt is a normalization factor chosen so that wt+1 will be a probability distribution:
• At the end of K iterations of this algorithm, we have
h1, h2, . . . , hK
We also have
1, 2, . . . ,K, where
• Ensemble classifier:
• Note that hypotheses with higher accuracy on their training sets are
weighted more strongly.
t
tt
1ln
2
1
A Hypothetical Example
where { x1, x2, x3, x4 } are class +1
{x5, x6, x7, x8 } are class −1
t = 1 :
w1 = {1/8, 1/8, 1/8, 1/8, 1/8, 1/8, 1/8, 1/8}
S1 = {x1, x2, x2, x5, x5, x6, x7, x8} (notice some repeats)
Train classifier on S1 to get h1
Run h1 on S. Suppose classifications are: {1, −1, −1, −1, −1, −1, −1, −1}
• Calculate error:
Calculate ’s:
Calculate new w’s:
t = 2
w2 = {0.102, 0.163, 0.163, 0.163, 0.102, 0.102, 0.102, 0.102}
S2 = {x1, x2, x2, x3, x4, x4, x7, x8}
Run classifier on S2 to get h2
Run h2 on S. Suppose classifications are: {1, 1, 1, 1, 1, 1, 1, 1}
Calculate error:
Calculate ’s:
Calculate w’s:
t =3
w3 = {0.082, 0.139, 0.139, 0.139, 0.125, 0.125, 0.125, 0.125}
S3 = {x2, x3, x3, x3, x5, x6, x7, x8}
Run classifier on S3 to get h3
Run h3 on S. Suppose classifications are: {1, 1, −1, 1, −1, −1, 1, −1}
Calculate error:
• Calculate ’s:
• Ensemble classifier:
What is the accuracy of H on the training data?
Example Actual class
h1 h2 h3
x1 1 1 1 1
x2 1 −1 1 1
x3 1 −1 1 −1
x4 1 1 1 1
x5 −1 −1 1 −1
x6 −1 −1 1 −1
x7 −1 1 1 1
x8 −1 −1 1 −1
where { x1, x2, x3, x4 } are class +1 {x5, x6, x7, x8 } are class −1
Recall the training set:
HW 2: SVMs and Boosting
Caveat: I don’t know if boosting will help. But worth a try!
Boosting implementation in HW 2
Use arrays and vectors.
Let S be the training set. You can set S to DogsVsCats.train or some subset of examples.
Let M = |S| be the number of training examples. Let K be the number of boosting iterations.
; Initialize:
Read in S as an array of M rows and 65 columns.
Create the weight vector w(1) = (1/M, 1/M, 1/M, ..., 1/M)
; Run Boosting Iterations:
For T = 1 to K {
; Create training set T:
For i = 1 to M {
Select (with replacement) an example from S using weights in w.
(Use “roulette wheel” selection algorithm)
Print that example to S<T> (i.e., “S1”)
; Learn classifier T
svm_learn <kernel parameters> S<T> h<T>
; Calculate Error<T> :
svm_classify S h<T> <T>.predictions
(e.g., “svm_classify S h1 1.predictions”)
Use <T>.predictions to determine which examples in S were misclassified.
Error<T> = Sum of weights of examples that were misclassified.
; Calculate Alpha<T>
; Calculate new weight vector, using Alpha<T> and <T>.predictions file
} ; T T + 1
At this point, you have
h1, h2, ..., hK [the K different classifiers]
Alpha1, Alpha2, ..., AlphaK
; Run Ensemble Classifier on test data (DogsVsCats.test):
For T = 1 to K:
svm_classify DogsVsCats.test h<T>.model Test(T).predictions}
At this point you have: Test1.predictions, Test2.predictions,...,TestK.predictions
To classify test example x1:The sign of row 1 of Test1.predictions gives h1(x1)The sign of row 1 of Test2.predictions gives h2(x1)etc.
H(x1) = sgn [Alpha1 * h1(x1) + Alpha2 * h2(x1) + ... + AlphaK * hK(x1)]
Compute H(x) for every x in the test set.
Sources of Classifier Error: Bias, Variance, and Noise
• Bias:– Classifier cannot learn the correct hypothesis (no matter what
training data is given), and so incorrect hypothesis h is learned. The bias is the average error of h over all possible training sets.
• Variance:– Training data is not representative enough of all data, so the
learned classifier h varies from one training set to another.
• Noise:– Training data contains errors, so incorrect hypothesis h is learned.
Give examples of these.
Adaboost seems to reduce both bias and variance.
Adaboost does not seem to overfit for increasing K.
Optional: Read about “Margin-theory” explanation of success of Boosting