Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
CptS 570 – Machine LearningSchool of EECS
Washington State University
CptS 570 - Machine Learning 1
No one learner is always best (No Free Lunch) Combination of learners can overcome
individual weaknesses How to choose learners that complement one
another? How to combine their outputs to maximize
accuracy? Ensemble: Weighted majority vote of several
learners
CptS 570 - Machine Learning 2
Different algorithms◦ E.g., parametric vs. non-parametric
Different parameter settings◦ E.g., random initial weights in neural network
Different input representations◦ E.g., feature selection◦ E.g., multi-modal training data (e.g., audio & video)
Different training sets◦ Bagging: Different samples of same training set◦ Boosting/Cascading: Weight more heavily examples
missed by previous learned classifier◦ Partitioning: Mixture of experts
CptS 570 - Machine Learning 3
All learners generate an output◦ Voting, stacking
One or a few learners generate output◦ Chosen by gating function◦ Mixture of experts
Learner output weighted by accuracy and complexity◦ Cascading, boosting
CptS 570 - Machine Learning 4
L learners, K outputs dji(x) is prediction of learner j for output i Regression
Classification
CptS 570 - Machine Learning 5
1 and 0where11
=≥= ∑∑==
L
jjj
L
jjiji wwdwy
k
K
kii yyC1
maxifChoose=
=
Majority voting: wj = 1/L If learner produces P(Ci|x), then use as
weights after normalization Weight wj is accuracy of learner j on
validation set Learn weights (stacked generalization)
CptS 570 - Machine Learning 6
CptS 570 - Machine Learning 7
Example:
where dji=P(Ci|x,Mj) and wj=P(Mj) Majority voting implies uniform prior Can’t include all models, so choose a few
with suspected high probability
CptS 570 - Machine Learning 8
( ) ( ) ( )jjii PxCPxCPj
MMM
,|| models all
∑=
Assuming each learner is independent and better than random
Then adding more learners will maintain bias, but reduce variance (i.e., error)
CptS 570 - Machine Learning 9
[ ] [ ] [ ]
( ) ( ) ( )jjj
jj
j
jjj
j
dL
dLL
dL
dL
y
dEdELL
dL
EyE
VarVarVarVarVar1111
11
22 =⋅=
=
=
=⋅=
=
∑∑
∑
General case
If learners positively correlated, then variance (and error) increase
If learners negatively correlated, then variance (and error) decrease◦ But bias increases
Voting is a form of smoothing that maintains low bias, but decreases variance
CptS 570 - Machine Learning 10
( ) ( )
+=
= ∑ ∑∑∑
<j j jijij
jj ddCovd
Ld
Ly ),(VarVarVar 211
22
Given training set X of size N Generate L different training sets, each of
size N, by sampling with replacement from X◦ Called “bootstrapping”
Use one learning algorithm to learn L classifiers from the different training sets
Learning algorithm must be unstable◦ I.e., small changes in training set result in different
classifiers◦ E.g., decision trees, neural networks
CptS 570 - Machine Learning 11
Similar to bagging, but L training sets chosen to increase negative correlation
Use one learning algorithm to learn L classifiers
Training set for classifier j biased toward examples missed by classifier j-1
Learning algorithm should be weak (not too accurate)
Adaptive Boosting (AdaBoost)
CptS 570 - Machine Learning 12
CptS 570 - Machine Learning 13
CptS 570 - Machine Learning 14
Each point represents1 of 27 test domains.
Dietterich “Machine Learning Research: Four Current Directions,” AI Magazine, Winter 1997.
CptS 570 - Machine Learning 15
CptS 570 - Machine Learning 16
Weights depend on the test instance
Competitive learning◦ Weight wj(x) driven
toward 1 (others to 0) for learner j best at region near x
CptS 570 - Machine Learning 17
)()(1
xx∑=
=L
jjj dwy
Combining function f( ) is learned
Train f on data not used to train base learners
CptS 570 - Machine Learning 18
Ensemble need not be fixed Can modify ensemble to improve accuracy or
reduce correlation of base learners Subset selection◦ Add/remove base learners while performance
improves Meta-learners◦ Stack learners to construct new features
CptS 570 - Machine Learning 19
Use classifier djonly if previous classifiers lacked confidence
Order classifiers by increasing complexity
Differs from boosting◦ Both errant and
uncertain examples passed to next learner
CptS 570 - Machine Learning 20
Typically, the hypothesis space H does not contain the target function f
Weighted combinations of several approximations may represent classifiers outside of H
CptS 570 - Machine Learning 21
Decision surfacesdefined by learneddecision trees.
Decision surfacedefined by vote overLearned decision trees.
$1M to team improving NetFlix’s movie recommender by 10%
Won by team “BellKor’s Pragmatic Chaos” which combined classifiers from 3 teams◦ Bellkor, Big Chaos, Pragmatic Theory
Second place “The Ensemble” combined classifiers from 23 other teams
Solutions effectively ensembles of over 800 classifiers
www.netflixprize.com
CptS 570 - Machine Learning 22
CptS 570 - Machine Learning 23
Toscher et al. “The BigChaos Solution to the Netflix Grand Prize”, 2009.
Combining learners can overcome weaknesses of individual learners
Base learners must do better than random and have uncorrelated errors
Ensembles typically majority vote of base classifiers
Boosting, stacking Application to recommender systems◦ Netflix Prize
CptS 570 - Machine Learning 24