Classifier Ensembles Ludmila Kuncheva School of Computer Science Bangor University [email protected] Part 2 1

Combiner Features Classifier 2Classifier 1Classifier L Data set A Combination level selection or fusion? voting or another combination method? trainable or non-trainable combiner? and why not another classifier? B Classifier level same or different classifiers? decision trees, neural networks or other? how many? C Feature level all features or subsets of features? random or selected subsets? D Data level independent/dependent bootstrap samples? selected data sets? Levels of questions Building ensembles Building ensembles Boosting Random subspace Random Forest Rotation Forest Bagging Linear Oracle

Combiner Features Classifier 2Classifier 1Classifier L Data set A Combination level selection or fusion? voting or another combination method? trainable or non-trainable combiner? and why not another classifier? B Classifier level same or different classifiers? decision trees, neural networks or other? how many? C Feature level all features or subsets of features? random or selected subsets? D Data level independent/dependent bootstrap samples? selected data sets? Levels of questions Boosting Random subspace Random Forest Rotation Forest Bagging Linear Oracle Building ensembles Building ensembles This seems under-researched...

Classifier combiners Nobody talks about this...

Label outputs Continuous-valued outputs 1 1 2 2 3 3 x 1 1 2 2 3 3 x Decision profile Combiner

Ensemble (label outputs, R,G,B) 204 R 102 G 54 B Red Blue Red Green Red Majority vote Combiner

Ensemble (label outputs, R,G,B) 200 R 219 G 190 B Red Blue Red Green Red Majority vote Green Weighted Majority vote 0.05 0.50 0.02 0.10 0.70 0.10 0.27 0.70 0.50 Combiner

Ensemble (label outputs, R,G,B) Red Blue Red Green Red RBRRGR Classifier Green Combiner

Ensemble (continuous outputs, [R,G,B]) [0.6 0.3 0.1] [0.1 0.0 0.6] [0.7 0.6 0.5] [0.4 0.3 0.1] [0 1 0] [0.9 0.7 0.8] Combiner

Ensemble (continuous outputs, [R,G,B]) [0.6 0.3 0.1] [0.1 0.0 0.6] [0.7 0.6 0.5] [0.4 0.3 0.1] [0 1 0] [0.9 0.7 0.8] Mean R = 0.45 Combiner

Ensemble (continuous outputs, [R,G,B]) [0.6 0.3 0.1] [0.1 0.0 0.6] [0.7 0.6 0.5] [0.4 0.3 0.1] [0 1 0] [0.9 0.7 0.8] Mean R = 0.45 Mean G = 0.48 Combiner

Ensemble (continuous outputs, [R,G,B]) [0.6 0.3 0.1] [0.1 0.0 0.6] [0.7 0.6 0.5] [0.4 0.3 0.1] [0 1 0] [0.9 0.7 0.8] Mean R = 0.45 Mean G = 0.48 Mean B = 0.35 Class GREEN Combiner

Ensemble (continuous outputs, [R,G,B]) [0.6 0.3 0.1] [0.1 0.0 0.6] [0.7 0.6 0.5] [0.4 0.3 0.1] [0 1 0] [0.9 0.7 0.8] Mean R = 0.45 Mean B = 0.35 Class GREEN Decision profile 0.6 0.3 0.1 0.1 0.0 0.6 0.7 0.6 0.5 0.4 0.3 0.1 0.0 1.0 0.0 0.9 0.7 0.8 Combiner Mean G = 0.48

Time for an example: combiner matters

Data set: Lets call this data The Tropical Fish or just the fish data. 50-by-50 = 2500 objects in 2-d Bayes error rate = 0% Induce label noise to make the problem more interesting noise 10%noise 45%

Example: 2 ensembles Train 50 linear classifiers on bootstrap samples Throw 50 straws and label the fish side so that the accuracy is greater than 0.5

Example: 2 ensembles Each classifier returns an estimate for class Fish And, of course, we have but we will not need this.

Example: 2 ensembles 10% label noise

Example: 2 ensembles 45% label noise

What does the example show? The combiner matters (a lot) Noise helps the ensemble! The trained combiner for continuous labels is best (linear, tree) BKS works because of the small number of classes and classifiers Example: 2 ensembles However, nothing is as simple as it looks...

http://samcnitt.tumblr.com/ The Combining Classifier: to Train or Not to Train?

Train the COMBINER if you have enough data! Otherwise, like with any classifier, we may over- fit the data. Get this: Almost NOBODY trains the combiner, not in the CLASSIC ensemble methods anyway. Ha-ha-ha, what is enough data?

Diversity Everybody talks about this...

Publications (580) Citations (4594) CLASSIFIER ENSEMBLE DIVERSITY Search on 10 Sep 2014 Diversity

MULTIPLE CLASSIFIER SYSTEMS 30 INT JOINT CONF ON NEURAL NETWORKS (IJCNN) 22 PATTERN RECOGNITION 17 NEUROCOMPUTING 14 EXPERT SYSTEMS WITH APPLICATIONS 13 INFORMATION SCIENCES 12 APPLIED SOFT COMPUTING 11 PATTERN RECOGNITION LETTERS 10 INFORMATION FUSION 9 IEEE INT JOINT CONF ON NEURAL NETWORKS 9 KNOWLEDGE-BASED SYSTEMS 7 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 7 INT J OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE 6 MACHINE LEARNING 5 IEEE TRANSACTIONS ON NEURAL NETWORKS 5 JOURNAL OF MACHINE LEARNING RESEARCH 5 APPLIED INTELLIGENCE 4 INTELLIGENT DATA ANALYSIS 4 IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION 4 ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING 4 NEURAL INFORMATION PROCESSING 4 580 papers Diversity

Where in the world are we? China 140 UK 68 USA 63 Spain 55 Brazil 41 Canda 32 Poland 28 Iran 23 Italy 19... Diversity

Are we still talking about diversity in classifier ensembles? Apparently yes... That elusive diversity... We want the classifiers in the ensemble to be ACCURATE and DIVERSE simultaneously. And HOW CAN THIS HAPPEN?!? Diversity

All ensemble methods we have seen so far strive to keep the individual accuracy high while increasing diversity. How can we measure diversity? WHAT can we do with the diversity value?

Measure diversity for a PAIR of classifiers Classifier 2 Classifier 1 correct wrong correct wrong independent outputs independent errors hence, use ORACLE outputs Number of instances labelled correctly by classifier 1 and mislabelled by classifier 2 Diversity

Classifier 2 Classifier 1 correct wrong correct wrong Q kappa correlation (rho) disagreement double fault... Diversity

SEVENTY SIX !!! Diversity

Do we need more NEW pairwise diversity measures? Looks like we dont... Diversity And the same holds for non-pairwise measures... Far too many already.

All ensemble methods we have seen so far strive to keep the individual accuracy high while increasing diversity. How can we measure diversity? WHAT can we do with the diversity value? -Compare ensembles -Explain why a certain ensemble heuristic works and others dont -Construct ensemble by overproducing and selecting classifiers with high accuracy and high diversity

Why is diversity so baffling? The problem is that diversity is NOT monotonically related to the ensemble accuracy. In other words, diverse ensembles may be good or may be bad...

Good diversity and bad diversity

Good and Bad diversity 3 classifiers: A, B, C 15 objects, wrong vote, correct vote individual accuracy = 10/15 = 0.667 P = ensemble accuracy independent classifiers P = 11/15 = 0.733 identical classifiers P = 10/15 = 0.667 dependent classifiers 1 P = 7/15 = 0.467 dependent classifiers 2 P = 15/15 = 1.000 ABCABC ABCABC ABCABC ABCABC MAJORITY VOTE

Good and Bad diversity 3 classifiers: A, B, C 15 objects, wrong vote, correct vote individual accuracy = 10/15 = 0.667 P = ensemble accuracy independent classifiers P = 11/15 = 0.733 identical classifiers P = 10/15 = 0.667 dependent classifiers 1 P = 7/15 = 0.467 dependent classifiers 2 P = 15/15 = 1.000 ABCABC ABCABC ABCABC ABCABC MAJORITY VOTE Good diversity Bad diversity

Good and Bad diversity Data set Z Ensemble, L = 7 classifiers Are these outputs diverse?

Good and Bad diversity Data set Z Ensemble, L = 7 classifiers How about these?

Good and Bad diversity Data set Z Ensemble, L = 7 classifiers 3 vs 4... Cant be more diverse, really...

Good and Bad diversity Data set Z Ensemble, L = 7 classifiers MAJORITY VOTE Good diversity

Good and Bad diversity Data set Z Ensemble, L = 7 classifiers MAJORITY VOTE Bad diversity

Good and Bad diversity maj maj Decomposition of the Majority Vote Error Individual error Subtract GOOD diversity Add BAD diversity Brown G., L.I. Kuncheva, "Good" and "bad" diversity in majority vote ensembles, Proc. Multiple Classifier Systems (MCS'10), Cairo, Egypt, LNCS 5997, 2010, 124-133.

Good and Bad diversity Note that diversity quantity is 3 in both cases

Ensemble Margin POSITIVE NEGATIVE

Ensemble Margin Average margin However, nearly all diversity measures are functions of Average absolute margin or Average square margin Margin has no sign...

Ensemble Margin

The bottom line is: Diversity is not MONOTONICALLY related to ensemble accuracy So, stop looking for what is not there...

Where next in classifier ensembles?

proposed by Margineantu and Dietterich in 1997 visualise individual accuracy and diversity in a 2-dimensional plot have been used to decide which ensemble members can be pruned without much harm to the overall performance Kappa-error diagrams

Adaboost 75.0% Bagging 77.0% Random subspace 80.9% Random oracle 83.3% Rotation Forest 84.7% sonar data (UCI): 260 instances, 60 features, 2 classes, ensemble size L = 11 classifiers, base model tree C4.5 Example Kuncheva L.I., A bound on kappa-error diagrams for analysis of classifier ensembles, IEEE Transactions on Knowledge and Data Engineering, 2013, 25 (3), 494-501 (DOI: 10.1109/TKDE.2011.234).

correctwrong C1 correct ab wrong cd C2 error kappa = (observed chance)/(1-chance) Kappa-error diagrams

bound (tight) bound (tight) error kappa Kappa-error diagrams

error kappa Kappa-error diagrams simulated ensembles L = 3

error kappa Real data: 77,422,500 pairs of classifiers room for improvement

Is there space for new classifier ensembles? Looks like yes...

Number of classifiers L 1 The perfect classifier 3-8 classifiers heterogeneous trained combiner (stacked generalisation) 100+ classifiers same model non-trained combiner (bagging, boosting, etc.) Large ensemble of nearly identical classifiers - REDUNDANCY Small ensembles of weak classifiers - INSUFFICIENCY ? ? Must engineer diversity Strength of classifiers How about here? 30-50 classifiers same or different models? trained or non-trained combiner? selection or fusion?

61 MathWorks recommendations: AdaBoost and... wait for it... wait for iiiiit... AdaBoost

62 plus, is quite expensive MathWorks recommendations:

One final play instead of conclusions...

64 For the winner by my favourite illustrator Marcello Barenghi Well, Ill give you a less crinkled one :)

65 Time for you now... Recall our digit example The competitors are: Bagging, AdaBoost, Random Forest, Random Subspace and Rotation Forest ALL with 10 decision trees A guessing game Data for this example: A small part of MNIST... decision tree 68.2% YOUR TASK: Rank the competitors and predict the ensemble accuracy for each one. The WINNER will be a correct ranking and predictions within 3% of the true accuracies. (MSE for a tie-break) The judge is WEKA

decision tree 68.2% 4. Random Forest 78.7% 1. Rotation Forest 85.0% 2. AdaBoost 82.9% 5. Bagging 75.6% 3. Random Subspace 79.1% Ensembles of 10

decision tree 68.2% 4. Random Forest 78.7% 1. Rotation Forest 85.0% 2. AdaBoost 82.9% 5. Bagging 75.6% 3. Random Subspace 79.1% Ensembles of 10 But you know what the funny thing is?...

Rotation Forest 85.0% AdaBoost 82.9% Random Subspace 79.1% Random Forest 78.7% Bagging 75.6% decision tree 68.2% 1-nn 87.4% SVM 89.5%

The moral of the story... 1. There may be a simpler solution. Dont overlook it! 2. The most acclaimed methods are not always the best. Heeeeey, this proves fallibility of my classifier ensemble theory, Marcello Pelillo! (who left already...) :(

Everyone, WAKE UP! And thank you for still being here :) Everyone, WAKE UP! And thank you for still being here :) 1. Classifier combiners. Nobody talks about this... 2. Time for an example: combiner matters 3. Diversity. Everybody talks about this... 4. Good diversity and bad diversity 5. Where next in classifier ensembles? 6. One final play instead of conclusions...

Documents

Classifier Ensembles Ludmila Kuncheva School of Computer Science Bangor University [email protected] Part 2 1