Classifier ensembles: Does the combination rule matter?

Classifier ensembles: Does the combination

rule matter?Ludmila Kuncheva

School of Computer Science

Bangor University, UK

[email protected]

mailto:[email protected]

classifier

feature values(object description)

classifier classifier

class label

combinerclassifier ensemble

Congratulations!The Netflix Prize sought to substantially improve the accuracy of predictions about how much someone is going to enjoy a movie based on their movie preferences.On September 21, 2009 we awarded the $1M Grand Prize to team “BellKor’s Pragmatic Chaos”. Read about their algorithm, checkout team scores on the Leaderboard, and join the discussions on the Forum.We applaud all the contributors to this quest, which improves our ability to connect people to the movies they love.

classifier



class label


http://www.netflixprize.com/community/viewtopic.php?id=1537

http://www.netflixprize.com/leaderboard

http://www.netflixprize.com/community

cited 7194 times

by 28 July 2013

(Google Scholar)

classifier



class label


Saso Dzeroski

David Hand

S. Dzeroski, and B. Zenko. (2004) Is combining classifiers better than selecting the best one? Machine Learning, 54, 255-273.

David J. Hand (2006) Classifier technology and the illusion of progress, Statist. Sci. 21 (1), 1-14.

Classifier combination? Hmmmm…..

We are kidding ourselves; there is no real progress in spite of ensemble methods.

Chances are that the single best classifier will be better than the ensemble.

Quo Vadis?

"combining classifiers" OR "classifier combination" OR "classifier ensembles" OR "ensemble of classifiers" OR "combining multiple classifiers" OR "committee of classifiers" OR "classifier committee" OR "committees of neural networks" OR "consensus aggregation" OR "mixture of experts" OR "bagging predictors" OR adaboost OR (( "random subspace" OR "random forest" OR "rotation forest" OR boosting) AND "machine learning")

time

visi

bilit

y

naiv

e eu

phor

ia

asymptote of reality

slope of enlightenment

trough of disillusionment

peak of inflated expectations

Gartner’s Hype Cycle: a typical evolution pattern of a new technology

Where are we?...

1990 1995 2000 2005 20100

0.05

0.1

0.15

0.2

0.25

0.3

0.35

per m

il of

pub

lishe

d pa

pers

on

clas

sifie

r ens

embl

es

time

IEEE

TSM

C

IEEE

TPA

MI

NN ML

IEEE

TPA

MI

IEEE

TPA

MI

ML

IEEE

TPA

MI

ML

JASA

ML

IJCV

PRIE

EE T

PAM

IIE

EE T

PAM

IJA

E PPL

PPL JT

BCC

(6) IEEE TPAMI = IEEE Transactions on Pattern Analysis and Machine Intelligence

IEEE TSMC = IEEE Transactions on Systems, Man and CyberneticsJASA = Journal of the American Statistical Association

IJCV = International Journal of Computer VisionJTB = Journal of Theoretical Biology

(2) PPL = Protein and Peptide LettersJAE = Journal of Animal Ecology

PR = Pattern Recognition (4) ML = Machine Learning

NN = Neural NetworksCC = Cerebral Cortex

top cited paper is from…

application paper

1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 20120

500

1000

1500

2000

2500

3000

3500

4000

4500

num

ber o

f cita

tions

time

[ML] Bagging predictors

[IEEE TPAMI] On combining classifiers

[ML] Random forests

[IJCV] Robust real-time face detection

International Workshop on Multiple Classifier Systems2000 – 2013 - continuing

Combiner

Features

Classifier 2

Classifier 1

Classifier L

…

Data set

A Combination level• selection or fusion?• voting or another combination method?• trainable or non-trainable combiner?

B Classifier level• same or different classifiers?• decision trees, neural networks or other?• how many?

C Feature level• all features or subsets of features?• random or selected subsets?D Data level

• independent/dependent bootstrap samples?

• selected data sets?

Levels of questions

Number of classifiers L1

The perfect classifier• 3-8 classifiers• heterogeneous• trained combiner(stacked generalisation)

• 100+ classifiers• same model• non-trained

combiner(bagging, boosting, etc.)

Large ensemble of nearly identical classifiers - REDUNDANCY

Small ensembles of weak classifiers - INSUFFICIENCY?

?

Must engineer diversity…

Strength of classifiers

How about here?• 30-50 classifiers• same or different models?• trained or non-trained

combiner?• selection or fusion?• IS IT WORTH IT?

Number of classifiers L1

The perfect classifier• 3-8 classifiers• heterogeneous• trained combiner(stacked generalisation)

• 100+ classifiers• same model• non-trained

combiner(bagging, boosting, etc.)

Large ensemble of nearly identical classifiers - REDUNDANCY

Small ensembles of weak classifiers - INSUFFICIENCY

Must engineer diversity…

Strength of classifiers

• 30-50 classifiers• same or different models?• trained or non-trained

combiner?• selection or fusion?• IS IT WORTH IT?

Diversity is absolutely CRUCIAL!

Diversity is pretty impossible…

Label outputs Continuous-valued outputs

1 2 3

𝜔1𝜔2 𝜔1

x

1 2 3

x

𝜔1 𝜔2

Decision profile

𝑃3(𝜔¿¿2∨𝐱)¿

Ensemble (label outputs, R,G,B)

204 R102 G

54 B

Red

Blue

RedRed

Green Red

Red

Majority vote


200 R219 G190 B

Red

Blue

RedRed

Green Red

Red

Majority vote

Green

WeightedMajority vote

0.05 0.50 0.02 0.10 0.70 0.10

0.270.700.50


Red

Blue

RedRed

Green Red

RBRRGR

Classifier

Green

Ensemble (continuous outputs, [R,G,B])

[0.6 0.3 0.1]

[0.1 0.0 0.6]

[0.7 0.6 0.5]

[0.4 0.3 0.1]

[0 1 0] [0.9 0.7

0.8]


[0.6 0.3 0.1]

[0.1 0.0 0.6]

[0.7 0.6 0.5]

[0.4 0.3 0.1]

[0 1 0] [0.9 0.7

0.8]

Mean R = 0.45


[0.6 0.3 0.1]

[0.1 0.0 0.6]

[0.7 0.6 0.5]

[0.4 0.3 0.1]

[0 1 0] [0.9 0.7

0.8]

Mean R = 0.45

Mean G = 0.48


[0.6 0.3 0.1]

[0.1 0.0 0.6]

[0.7 0.6 0.5]

[0.4 0.3 0.1]

[0 1 0] [0.9 0.7

0.8]

Mean R = 0.45

Mean G = 0.48

Mean B = 0.35

Class GREEN


[0.6 0.3 0.1]

[0.1 0.0 0.6]

[0.7 0.6 0.5]

[0.4 0.3 0.1]

[0 1 0] [0.9 0.7

0.8]

Mean R = 0.45

Mean G = 0.48

Mean B = 0.35

Class GREEN

Decision profile

0.6 0.3 0.1 0.1 0.0

0.6 0.7 0.6

0.5 0.4 0.3

0.1 0 .0 1.0

0.0 0.9 0.7

0.8

Decision profile

0.6 0.3 0.1 0.1 0.0 0.6 0.7 0.6 0.5

0.4 0.3 0.1

0 .0 1.0 0.0

0.9 0.7 0.8

classes

class

ifiers

Support that classifier #4 gives to the hypothesis that the object to classify comes from class #3.

Would be nice if these were probability distributions...

Decision profile

classes

class

ifiers

𝜔1 𝜔2

𝐷1

𝐷2

𝐷3

𝑃3(𝜔¿¿2∨𝐱)¿

𝜔3

…We can take probability outputs from the classifiers

Combination Rules

For label outputs For continuous-valued outputs

𝜔1𝜔2 𝜔1

𝜔1 𝜔2

𝐷1

𝐷2

𝐷3

𝑃3(𝜔¿¿2∨𝐱)¿

Majority (plurality) vote

Weighted majority vote

Naïve Bayes

BKS

A classifier

Simple rules: minimum, maximum, product,average (sum)

c Regressions

A classifier

Combination Rules

For label outputs For continuous-valued outputs

𝐷𝑃=[𝑑 {𝑖 , 𝑗 } , 𝑖=1 ,…𝐿 , 𝑗=1 ,…,𝑐 ]

Majority (plurality) vote

Weighted majority vote

Naïve Bayes

BKS

A classifier

Simple rules: minimum, maximum, product,average (sum)

c Regressions

A classifier

𝑠1 ,𝑠2 ,…, 𝑠𝐿

Decision profile

classifier



class label


classifier



class label

classifierclassifier ensemble

http://samcnitt.tumblr.com/

Bob Duin: The Combining Classifier: to Train or Not to Train?

Tin Ho: “Multiple Classifier Combination: Lessons and Next Steps”, 2002

“Instead of looking for the best set of features and the best classifier, now we look for the best set of classifiers and then the best combination method. One can imagine that very soon we will be looking for the best set of combination methods and then the best way to use them all. If we do not take the chance to review the fundamental problems arising from this challenge, we are bound to be driven into such an infinite recurrence, dragging along more and more complicated combination schemes and theories and gradually losing sight of the original problem.”

Classifier ensembles: Does the combination rule matter?

In a word, yes.

But its merit depends upon• the base classifier model, • the training of the individual classifiers, • the diversity, • the possibility to train the combiner, and more.

Conclusions - 1

1. The choice of the combiner should not be side-lined.

2. The combiner should be chosen in relation to the rest of the ensemble and the available data.

Conclusions - 2

Questions to you:

1. What is the future of classifier ensembles? (Are they here to stay or are they a mere phase?)

2. In what direction(s) will they evolve/dissolve?

3. What will be the ‘classifier of the future’? Or the ‘classification paradigm of the future’?

4. And one last question: How can we get a handle of the ever growing scientific literature in each and every area? How can we find the gems among the pile of stones?

Documents

Classifier ensembles: Does the combination rule matter?