Classifier Ensembles Ludmila Kuncheva School of Computer Science Bangor University [email protected] Part 2 1

Embed Size (px)

Citation preview

  • Slide 1
  • Classifier Ensembles Ludmila Kuncheva School of Computer Science Bangor University [email protected] Part 2 1
  • Slide 2
  • Combiner Features Classifier 2Classifier 1Classifier L Data set A Combination level selection or fusion? voting or another combination method? trainable or non-trainable combiner? and why not another classifier? B Classifier level same or different classifiers? decision trees, neural networks or other? how many? C Feature level all features or subsets of features? random or selected subsets? D Data level independent/dependent bootstrap samples? selected data sets? Levels of questions Building ensembles Building ensembles Boosting Random subspace Random Forest Rotation Forest Bagging Linear Oracle
  • Slide 3
  • Combiner Features Classifier 2Classifier 1Classifier L Data set A Combination level selection or fusion? voting or another combination method? trainable or non-trainable combiner? and why not another classifier? B Classifier level same or different classifiers? decision trees, neural networks or other? how many? C Feature level all features or subsets of features? random or selected subsets? D Data level independent/dependent bootstrap samples? selected data sets? Levels of questions Boosting Random subspace Random Forest Rotation Forest Bagging Linear Oracle Building ensembles Building ensembles This seems under-researched...
  • Slide 4
  • Classifier combiners Nobody talks about this...
  • Slide 5
  • Label outputs Continuous-valued outputs 1 1 2 2 3 3 x 1 1 2 2 3 3 x Decision profile Combiner
  • Slide 6
  • Ensemble (label outputs, R,G,B) 204 R 102 G 54 B Red Blue Red Green Red Majority vote Combiner
  • Slide 7
  • Ensemble (label outputs, R,G,B) 200 R 219 G 190 B Red Blue Red Green Red Majority vote Green Weighted Majority vote 0.05 0.50 0.02 0.10 0.70 0.10 0.27 0.70 0.50 Combiner
  • Slide 8
  • Ensemble (label outputs, R,G,B) Red Blue Red Green Red RBRRGR Classifier Green Combiner
  • Slide 9
  • Ensemble (continuous outputs, [R,G,B]) [0.6 0.3 0.1] [0.1 0.0 0.6] [0.7 0.6 0.5] [0.4 0.3 0.1] [0 1 0] [0.9 0.7 0.8] Combiner
  • Slide 10
  • Ensemble (continuous outputs, [R,G,B]) [0.6 0.3 0.1] [0.1 0.0 0.6] [0.7 0.6 0.5] [0.4 0.3 0.1] [0 1 0] [0.9 0.7 0.8] Mean R = 0.45 Combiner
  • Slide 11
  • Ensemble (continuous outputs, [R,G,B]) [0.6 0.3 0.1] [0.1 0.0 0.6] [0.7 0.6 0.5] [0.4 0.3 0.1] [0 1 0] [0.9 0.7 0.8] Mean R = 0.45 Mean G = 0.48 Combiner
  • Slide 12
  • Ensemble (continuous outputs, [R,G,B]) [0.6 0.3 0.1] [0.1 0.0 0.6] [0.7 0.6 0.5] [0.4 0.3 0.1] [0 1 0] [0.9 0.7 0.8] Mean R = 0.45 Mean G = 0.48 Mean B = 0.35 Class GREEN Combiner
  • Slide 13
  • Ensemble (continuous outputs, [R,G,B]) [0.6 0.3 0.1] [0.1 0.0 0.6] [0.7 0.6 0.5] [0.4 0.3 0.1] [0 1 0] [0.9 0.7 0.8] Mean R = 0.45 Mean B = 0.35 Class GREEN Decision profile 0.6 0.3 0.1 0.1 0.0 0.6 0.7 0.6 0.5 0.4 0.3 0.1 0.0 1.0 0.0 0.9 0.7 0.8 Combiner Mean G = 0.48
  • Slide 14
  • Time for an example: combiner matters
  • Slide 15
  • Data set: Lets call this data The Tropical Fish or just the fish data. 50-by-50 = 2500 objects in 2-d Bayes error rate = 0% Induce label noise to make the problem more interesting noise 10%noise 45%
  • Slide 16
  • Example: 2 ensembles Train 50 linear classifiers on bootstrap samples Throw 50 straws and label the fish side so that the accuracy is greater than 0.5
  • Slide 17
  • Example: 2 ensembles Each classifier returns an estimate for class Fish And, of course, we have but we will not need this.
  • Slide 18
  • Example: 2 ensembles 10% label noise
  • Slide 19
  • Example: 2 ensembles 45% label noise
  • Slide 20
  • Example: 2 ensembles 45% label noise
  • Slide 21
  • Example: 2 ensembles 45% label noise
  • Slide 22
  • What does the example show? The combiner matters (a lot) Noise helps the ensemble! The trained combiner for continuous labels is best (linear, tree) BKS works because of the small number of classes and classifiers Example: 2 ensembles However, nothing is as simple as it looks...
  • Slide 23
  • http://samcnitt.tumblr.com/ The Combining Classifier: to Train or Not to Train?
  • Slide 24
  • Slide 25
  • Train the COMBINER if you have enough data! Otherwise, like with any classifier, we may over- fit the data. Get this: Almost NOBODY trains the combiner, not in the CLASSIC ensemble methods anyway. Ha-ha-ha, what is enough data?
  • Slide 26
  • Diversity Everybody talks about this...
  • Slide 27
  • Publications (580) Citations (4594) CLASSIFIER ENSEMBLE DIVERSITY Search on 10 Sep 2014 Diversity
  • Slide 28
  • MULTIPLE CLASSIFIER SYSTEMS 30 INT JOINT CONF ON NEURAL NETWORKS (IJCNN) 22 PATTERN RECOGNITION 17 NEUROCOMPUTING 14 EXPERT SYSTEMS WITH APPLICATIONS 13 INFORMATION SCIENCES 12 APPLIED SOFT COMPUTING 11 PATTERN RECOGNITION LETTERS 10 INFORMATION FUSION 9 IEEE INT JOINT CONF ON NEURAL NETWORKS 9 KNOWLEDGE-BASED SYSTEMS 7 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 7 INT J OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE 6 MACHINE LEARNING 5 IEEE TRANSACTIONS ON NEURAL NETWORKS 5 JOURNAL OF MACHINE LEARNING RESEARCH 5 APPLIED INTELLIGENCE 4 INTELLIGENT DATA ANALYSIS 4 IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION 4 ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING 4 NEURAL INFORMATION PROCESSING 4 580 papers Diversity
  • Slide 29
  • Where in the world are we? China 140 UK 68 USA 63 Spain 55 Brazil 41 Canda 32 Poland 28 Iran 23 Italy 19... Diversity
  • Slide 30
  • Are we still talking about diversity in classifier ensembles? Apparently yes... That elusive diversity... We want the classifiers in the ensemble to be ACCURATE and DIVERSE simultaneously. And HOW CAN THIS HAPPEN?!? Diversity
  • Slide 31
  • All ensemble methods we have seen so far strive to keep the individual accuracy high while increasing diversity. How can we measure diversity? WHAT can we do with the diversity value?
  • Slide 32
  • Measure diversity for a PAIR of classifiers Classifier 2 Classifier 1 correct wrong correct wrong independent outputs independent errors hence, use ORACLE outputs Number of instances labelled correctly by classifier 1 and mislabelled by classifier 2 Diversity
  • Slide 33
  • Classifier 2 Classifier 1 correct wrong correct wrong Q kappa correlation (rho) disagreement double fault... Diversity
  • Slide 34
  • SEVENTY SIX !!! Diversity
  • Slide 35
  • Do we need more NEW pairwise diversity measures? Looks like we dont... Diversity And the same holds for non-pairwise measures... Far too many already.
  • Slide 36
  • All ensemble methods we have seen so far strive to keep the individual accuracy high while increasing diversity. How can we measure diversity? WHAT can we do with the diversity value? -Compare ensembles -Explain why a certain ensemble heuristic works and others dont -Construct ensemble by overproducing and selecting classifiers with high accuracy and high diversity
  • Slide 37
  • Why is diversity so baffling? The problem is that diversity is NOT monotonically related to the ensemble accuracy. In other words, diverse ensembles may be good or may be bad...
  • Slide 38
  • Good diversity and bad diversity
  • Slide 39
  • Good and Bad diversity 3 classifiers: A, B, C 15 objects, wrong vote, correct vote individual accuracy = 10/15 = 0.667 P = ensemble accuracy independent classifiers P = 11/15 = 0.733 identical classifiers P = 10/15 = 0.667 dependent classifiers 1 P = 7/15 = 0.467 dependent classifiers 2 P = 15/15 = 1.000 ABCABC ABCABC ABCABC ABCABC MAJORITY VOTE
  • Slide 40
  • Good and Bad diversity 3 classifiers: A, B, C 15 objects, wrong vote, correct vote individual accuracy = 10/15 = 0.667 P = ensemble accuracy independent classifiers P = 11/15 = 0.733 identical classifiers P = 10/15 = 0.667 dependent classifiers 1 P = 7/15 = 0.467 dependent classifiers 2 P = 15/15 = 1.000 ABCABC ABCABC ABCABC ABCABC MAJORITY VOTE Good diversity Bad diversity
  • Slide 41
  • Good and Bad diversity Data set Z Ensemble, L = 7 classifiers Are these outputs diverse?
  • Slide 42
  • Good and Bad diversity Data set Z Ensemble, L = 7 classifiers How about these?
  • Slide 43
  • Good and Bad diversity Data set Z Ensemble, L = 7 classifiers 3 vs 4... Cant be more diverse, really...
  • Slide 44
  • Good and Bad diversity Data set Z Ensemble, L = 7 classifiers MAJORITY VOTE Good diversity
  • Slide 45
  • Good and Bad diversity Data set Z Ensemble, L = 7 classifiers MAJORITY VOTE Bad diversity
  • Slide 46
  • Good and Bad diversity maj maj Decomposition of the Majority Vote Error Individual error Subtract GOOD diversity Add BAD diversity Brown G., L.I. Kuncheva, "Good" and "bad" diversity in majority vote ensembles, Proc. Multiple Classifier Systems (MCS'10), Cairo, Egypt, LNCS 5997, 2010, 124-133.
  • Slide 47
  • Good and Bad diversity Note that diversity quantity is 3 in both cases
  • Slide 48
  • Ensemble Margin POSITIVE NEGATIVE
  • Slide 49
  • Ensemble Margin Average margin However, nearly all diversity measures are functions of Average absolute margin or Average square margin Margin has no sign...
  • Slide 50
  • Ensemble Margin
  • Slide 51
  • The bottom line is: Diversity is not MONOTONICALLY related to ensemble accuracy So, stop looking for what is not there...
  • Slide 52
  • Where next in classifier ensembles?
  • Slide 53
  • proposed by Margineantu and Dietterich in 1997 visualise individual accuracy and diversity in a 2-dimensional plot have been used to decide which ensemble members can be pruned without much harm to the overall performance Kappa-error diagrams
  • Slide 54
  • Adaboost 75.0% Bagging 77.0% Random subspace 80.9% Random oracle 83.3% Rotation Forest 84.7% sonar data (UCI): 260 instances, 60 features, 2 classes, ensemble size L = 11 classifiers, base model tree C4.5 Example Kuncheva L.I., A bound on kappa-error diagrams for analysis of classifier ensembles, IEEE Transactions on Knowledge and Data Engineering, 2013, 25 (3), 494-501 (DOI: 10.1109/TKDE.2011.234).
  • Slide 55
  • correctwrong C1 correct ab wrong cd C2 error kappa = (observed chance)/(1-chance) Kappa-error diagrams
  • Slide 56
  • bound (tight) bound (tight) error kappa Kappa-error diagrams
  • Slide 57
  • error kappa Kappa-error diagrams simulated ensembles L = 3
  • Slide 58
  • error kappa Real data: 77,422,500 pairs of classifiers room for improvement
  • Slide 59
  • Is there space for new classifier ensembles? Looks like yes...
  • Slide 60
  • Number of classifiers L 1 The perfect classifier 3-8 classifiers heterogeneous trained combiner (stacked generalisation) 100+ classifiers same model non-trained combiner (bagging, boosting, etc.) Large ensemble of nearly identical classifiers - REDUNDANCY Small ensembles of weak classifiers - INSUFFICIENCY ? ? Must engineer diversity Strength of classifiers How about here? 30-50 classifiers same or different models? trained or non-trained combiner? selection or fusion?
  • Slide 61
  • 61 MathWorks recommendations: AdaBoost and... wait for it... wait for iiiiit... AdaBoost
  • Slide 62
  • 62 plus, is quite expensive MathWorks recommendations:
  • Slide 63
  • One final play instead of conclusions...
  • Slide 64
  • 64 For the winner by my favourite illustrator Marcello Barenghi Well, Ill give you a less crinkled one :)
  • Slide 65
  • 65 Time for you now... Recall our digit example The competitors are: Bagging, AdaBoost, Random Forest, Random Subspace and Rotation Forest ALL with 10 decision trees A guessing game Data for this example: A small part of MNIST... decision tree 68.2% YOUR TASK: Rank the competitors and predict the ensemble accuracy for each one. The WINNER will be a correct ranking and predictions within 3% of the true accuracies. (MSE for a tie-break) The judge is WEKA
  • Slide 66
  • decision tree 68.2% 4. Random Forest 78.7% 1. Rotation Forest 85.0% 2. AdaBoost 82.9% 5. Bagging 75.6% 3. Random Subspace 79.1% Ensembles of 10
  • Slide 67
  • decision tree 68.2% 4. Random Forest 78.7% 1. Rotation Forest 85.0% 2. AdaBoost 82.9% 5. Bagging 75.6% 3. Random Subspace 79.1% Ensembles of 10 But you know what the funny thing is?...
  • Slide 68
  • Rotation Forest 85.0% AdaBoost 82.9% Random Subspace 79.1% Random Forest 78.7% Bagging 75.6% decision tree 68.2% 1-nn 87.4% SVM 89.5%
  • Slide 69
  • The moral of the story... 1. There may be a simpler solution. Dont overlook it! 2. The most acclaimed methods are not always the best. Heeeeey, this proves fallibility of my classifier ensemble theory, Marcello Pelillo! (who left already...) :(
  • Slide 70
  • Everyone, WAKE UP! And thank you for still being here :) Everyone, WAKE UP! And thank you for still being here :) 1. Classifier combiners. Nobody talks about this... 2. Time for an example: combiner matters 3. Diversity. Everybody talks about this... 4. Good diversity and bad diversity 5. Where next in classifier ensembles? 6. One final play instead of conclusions...