Classifier Ensembles Ludmila Kuncheva School of Computer
Science Bangor University [email protected] Part 2 1
Slide 2
Combiner Features Classifier 2Classifier 1Classifier L Data set
A Combination level selection or fusion? voting or another
combination method? trainable or non-trainable combiner? and why
not another classifier? B Classifier level same or different
classifiers? decision trees, neural networks or other? how many? C
Feature level all features or subsets of features? random or
selected subsets? D Data level independent/dependent bootstrap
samples? selected data sets? Levels of questions Building ensembles
Building ensembles Boosting Random subspace Random Forest Rotation
Forest Bagging Linear Oracle
Slide 3
Combiner Features Classifier 2Classifier 1Classifier L Data set
A Combination level selection or fusion? voting or another
combination method? trainable or non-trainable combiner? and why
not another classifier? B Classifier level same or different
classifiers? decision trees, neural networks or other? how many? C
Feature level all features or subsets of features? random or
selected subsets? D Data level independent/dependent bootstrap
samples? selected data sets? Levels of questions Boosting Random
subspace Random Forest Rotation Forest Bagging Linear Oracle
Building ensembles Building ensembles This seems
under-researched...
Ensemble (label outputs, R,G,B) 204 R 102 G 54 B Red Blue Red
Green Red Majority vote Combiner
Slide 7
Ensemble (label outputs, R,G,B) 200 R 219 G 190 B Red Blue Red
Green Red Majority vote Green Weighted Majority vote 0.05 0.50 0.02
0.10 0.70 0.10 0.27 0.70 0.50 Combiner
Slide 8
Ensemble (label outputs, R,G,B) Red Blue Red Green Red RBRRGR
Classifier Green Combiner
Ensemble (continuous outputs, [R,G,B]) [0.6 0.3 0.1] [0.1 0.0
0.6] [0.7 0.6 0.5] [0.4 0.3 0.1] [0 1 0] [0.9 0.7 0.8] Mean R =
0.45 Mean G = 0.48 Combiner
Slide 12
Ensemble (continuous outputs, [R,G,B]) [0.6 0.3 0.1] [0.1 0.0
0.6] [0.7 0.6 0.5] [0.4 0.3 0.1] [0 1 0] [0.9 0.7 0.8] Mean R =
0.45 Mean G = 0.48 Mean B = 0.35 Class GREEN Combiner
Slide 13
Ensemble (continuous outputs, [R,G,B]) [0.6 0.3 0.1] [0.1 0.0
0.6] [0.7 0.6 0.5] [0.4 0.3 0.1] [0 1 0] [0.9 0.7 0.8] Mean R =
0.45 Mean B = 0.35 Class GREEN Decision profile 0.6 0.3 0.1 0.1 0.0
0.6 0.7 0.6 0.5 0.4 0.3 0.1 0.0 1.0 0.0 0.9 0.7 0.8 Combiner Mean G
= 0.48
Slide 14
Time for an example: combiner matters
Slide 15
Data set: Lets call this data The Tropical Fish or just the
fish data. 50-by-50 = 2500 objects in 2-d Bayes error rate = 0%
Induce label noise to make the problem more interesting noise
10%noise 45%
Slide 16
Example: 2 ensembles Train 50 linear classifiers on bootstrap
samples Throw 50 straws and label the fish side so that the
accuracy is greater than 0.5
Slide 17
Example: 2 ensembles Each classifier returns an estimate for
class Fish And, of course, we have but we will not need this.
Slide 18
Example: 2 ensembles 10% label noise
Slide 19
Example: 2 ensembles 45% label noise
Slide 20
Example: 2 ensembles 45% label noise
Slide 21
Example: 2 ensembles 45% label noise
Slide 22
What does the example show? The combiner matters (a lot) Noise
helps the ensemble! The trained combiner for continuous labels is
best (linear, tree) BKS works because of the small number of
classes and classifiers Example: 2 ensembles However, nothing is as
simple as it looks...
Slide 23
http://samcnitt.tumblr.com/ The Combining Classifier: to Train
or Not to Train?
Slide 24
Slide 25
Train the COMBINER if you have enough data! Otherwise, like
with any classifier, we may over- fit the data. Get this: Almost
NOBODY trains the combiner, not in the CLASSIC ensemble methods
anyway. Ha-ha-ha, what is enough data?
MULTIPLE CLASSIFIER SYSTEMS 30 INT JOINT CONF ON NEURAL
NETWORKS (IJCNN) 22 PATTERN RECOGNITION 17 NEUROCOMPUTING 14 EXPERT
SYSTEMS WITH APPLICATIONS 13 INFORMATION SCIENCES 12 APPLIED SOFT
COMPUTING 11 PATTERN RECOGNITION LETTERS 10 INFORMATION FUSION 9
IEEE INT JOINT CONF ON NEURAL NETWORKS 9 KNOWLEDGE-BASED SYSTEMS 7
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 7 INT J OF
PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE 6 MACHINE LEARNING
5 IEEE TRANSACTIONS ON NEURAL NETWORKS 5 JOURNAL OF MACHINE
LEARNING RESEARCH 5 APPLIED INTELLIGENCE 4 INTELLIGENT DATA
ANALYSIS 4 IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION 4 ADVANCES
IN KNOWLEDGE DISCOVERY AND DATA MINING 4 NEURAL INFORMATION
PROCESSING 4 580 papers Diversity
Slide 29
Where in the world are we? China 140 UK 68 USA 63 Spain 55
Brazil 41 Canda 32 Poland 28 Iran 23 Italy 19... Diversity
Slide 30
Are we still talking about diversity in classifier ensembles?
Apparently yes... That elusive diversity... We want the classifiers
in the ensemble to be ACCURATE and DIVERSE simultaneously. And HOW
CAN THIS HAPPEN?!? Diversity
Slide 31
All ensemble methods we have seen so far strive to keep the
individual accuracy high while increasing diversity. How can we
measure diversity? WHAT can we do with the diversity value?
Slide 32
Measure diversity for a PAIR of classifiers Classifier 2
Classifier 1 correct wrong correct wrong independent outputs
independent errors hence, use ORACLE outputs Number of instances
labelled correctly by classifier 1 and mislabelled by classifier 2
Diversity
Do we need more NEW pairwise diversity measures? Looks like we
dont... Diversity And the same holds for non-pairwise measures...
Far too many already.
Slide 36
All ensemble methods we have seen so far strive to keep the
individual accuracy high while increasing diversity. How can we
measure diversity? WHAT can we do with the diversity value?
-Compare ensembles -Explain why a certain ensemble heuristic works
and others dont -Construct ensemble by overproducing and selecting
classifiers with high accuracy and high diversity
Slide 37
Why is diversity so baffling? The problem is that diversity is
NOT monotonically related to the ensemble accuracy. In other words,
diverse ensembles may be good or may be bad...
Slide 38
Good diversity and bad diversity
Slide 39
Good and Bad diversity 3 classifiers: A, B, C 15 objects, wrong
vote, correct vote individual accuracy = 10/15 = 0.667 P = ensemble
accuracy independent classifiers P = 11/15 = 0.733 identical
classifiers P = 10/15 = 0.667 dependent classifiers 1 P = 7/15 =
0.467 dependent classifiers 2 P = 15/15 = 1.000 ABCABC ABCABC
ABCABC ABCABC MAJORITY VOTE
Slide 40
Good and Bad diversity 3 classifiers: A, B, C 15 objects, wrong
vote, correct vote individual accuracy = 10/15 = 0.667 P = ensemble
accuracy independent classifiers P = 11/15 = 0.733 identical
classifiers P = 10/15 = 0.667 dependent classifiers 1 P = 7/15 =
0.467 dependent classifiers 2 P = 15/15 = 1.000 ABCABC ABCABC
ABCABC ABCABC MAJORITY VOTE Good diversity Bad diversity
Slide 41
Good and Bad diversity Data set Z Ensemble, L = 7 classifiers
Are these outputs diverse?
Slide 42
Good and Bad diversity Data set Z Ensemble, L = 7 classifiers
How about these?
Slide 43
Good and Bad diversity Data set Z Ensemble, L = 7 classifiers 3
vs 4... Cant be more diverse, really...
Slide 44
Good and Bad diversity Data set Z Ensemble, L = 7 classifiers
MAJORITY VOTE Good diversity
Slide 45
Good and Bad diversity Data set Z Ensemble, L = 7 classifiers
MAJORITY VOTE Bad diversity
Slide 46
Good and Bad diversity maj maj Decomposition of the Majority
Vote Error Individual error Subtract GOOD diversity Add BAD
diversity Brown G., L.I. Kuncheva, "Good" and "bad" diversity in
majority vote ensembles, Proc. Multiple Classifier Systems
(MCS'10), Cairo, Egypt, LNCS 5997, 2010, 124-133.
Slide 47
Good and Bad diversity Note that diversity quantity is 3 in
both cases
Slide 48
Ensemble Margin POSITIVE NEGATIVE
Slide 49
Ensemble Margin Average margin However, nearly all diversity
measures are functions of Average absolute margin or Average square
margin Margin has no sign...
Slide 50
Ensemble Margin
Slide 51
The bottom line is: Diversity is not MONOTONICALLY related to
ensemble accuracy So, stop looking for what is not there...
Slide 52
Where next in classifier ensembles?
Slide 53
proposed by Margineantu and Dietterich in 1997 visualise
individual accuracy and diversity in a 2-dimensional plot have been
used to decide which ensemble members can be pruned without much
harm to the overall performance Kappa-error diagrams
Slide 54
Adaboost 75.0% Bagging 77.0% Random subspace 80.9% Random
oracle 83.3% Rotation Forest 84.7% sonar data (UCI): 260 instances,
60 features, 2 classes, ensemble size L = 11 classifiers, base
model tree C4.5 Example Kuncheva L.I., A bound on kappa-error
diagrams for analysis of classifier ensembles, IEEE Transactions on
Knowledge and Data Engineering, 2013, 25 (3), 494-501 (DOI:
10.1109/TKDE.2011.234).
Slide 55
correctwrong C1 correct ab wrong cd C2 error kappa = (observed
chance)/(1-chance) Kappa-error diagrams
error kappa Kappa-error diagrams simulated ensembles L = 3
Slide 58
error kappa Real data: 77,422,500 pairs of classifiers room for
improvement
Slide 59
Is there space for new classifier ensembles? Looks like
yes...
Slide 60
Number of classifiers L 1 The perfect classifier 3-8
classifiers heterogeneous trained combiner (stacked generalisation)
100+ classifiers same model non-trained combiner (bagging,
boosting, etc.) Large ensemble of nearly identical classifiers -
REDUNDANCY Small ensembles of weak classifiers - INSUFFICIENCY ? ?
Must engineer diversity Strength of classifiers How about here?
30-50 classifiers same or different models? trained or non-trained
combiner? selection or fusion?
Slide 61
61 MathWorks recommendations: AdaBoost and... wait for it...
wait for iiiiit... AdaBoost
Slide 62
62 plus, is quite expensive MathWorks recommendations:
Slide 63
One final play instead of conclusions...
Slide 64
64 For the winner by my favourite illustrator Marcello Barenghi
Well, Ill give you a less crinkled one :)
Slide 65
65 Time for you now... Recall our digit example The competitors
are: Bagging, AdaBoost, Random Forest, Random Subspace and Rotation
Forest ALL with 10 decision trees A guessing game Data for this
example: A small part of MNIST... decision tree 68.2% YOUR TASK:
Rank the competitors and predict the ensemble accuracy for each
one. The WINNER will be a correct ranking and predictions within 3%
of the true accuracies. (MSE for a tie-break) The judge is
WEKA
Slide 66
decision tree 68.2% 4. Random Forest 78.7% 1. Rotation Forest
85.0% 2. AdaBoost 82.9% 5. Bagging 75.6% 3. Random Subspace 79.1%
Ensembles of 10
Slide 67
decision tree 68.2% 4. Random Forest 78.7% 1. Rotation Forest
85.0% 2. AdaBoost 82.9% 5. Bagging 75.6% 3. Random Subspace 79.1%
Ensembles of 10 But you know what the funny thing is?...
Slide 68
Rotation Forest 85.0% AdaBoost 82.9% Random Subspace 79.1%
Random Forest 78.7% Bagging 75.6% decision tree 68.2% 1-nn 87.4%
SVM 89.5%
Slide 69
The moral of the story... 1. There may be a simpler solution.
Dont overlook it! 2. The most acclaimed methods are not always the
best. Heeeeey, this proves fallibility of my classifier ensemble
theory, Marcello Pelillo! (who left already...) :(
Slide 70
Everyone, WAKE UP! And thank you for still being here :)
Everyone, WAKE UP! And thank you for still being here :) 1.
Classifier combiners. Nobody talks about this... 2. Time for an
example: combiner matters 3. Diversity. Everybody talks about
this... 4. Good diversity and bad diversity 5. Where next in
classifier ensembles? 6. One final play instead of
conclusions...