Upload
wayne-chen
View
364
Download
6
Embed Size (px)
Citation preview
Ensembling & Boosting
Wayne Chen201608
sense
Maybe
Deep Learning ML
XGBoost : Kaggle Winning Solution
Giuliano Janson: Won two games and retired from Kaggle
Persistence: every Kaggler nowadays can put up a great model in a few hours and usually achieve 95% of final score. Only persistence will get you the remaining 5%.
Ensembling: need to know how to do it "like a pro". Forget about averaging models. Nowadays many Kaggler do meta-models, and meta-meta-models.
Why Ensemble is needed?
Occam's Razor
An explanation of the data should be made as simple as possible, but no simpler.
Simple s good.
Training data might not provide sufficient information for choosing a single best learner. The search processes of the learning algorithms might be imperfect (difficult to achieve unique
best hypothesis) Hypothesis space being searched might not contain the true target function.
ID3, C4.5, CART Tree base methodEntropyex. 5 (1M,4F), 9 (6M,3F)
E_all -5/14 * log(5/14) - 9/14 * log(9/14) Entropy is 1 if 50% - 50%, 0 if 100% - 0%
Information Gain a split attribute Entropy E_gender P(M) * E(1,6) + P(F) * E(4,3) Gain = E_all - E_gender
http://www.saedsayad.com/decision_tree.htm
http://blogs.sas.com/content/jmp/2013/03/25/partitioning-a-quadratic-in-jmp/
Boost Ensemble
1. 2. 3. ()
Ensemble
try model Decision tree, NN, SVM, Regression ..
Ensemble Kaggle submission CSV files. Its work!Majority Voting
Three models : 70%, 70%, 70% Majority vote ensemble will be ~78%. Averaging predictions often reduces overfit.
http://mlwave.com/kaggle-ensembling-guide/
Ensemble
Kobe, Curry, LBJ
Uncorrelated models usually performed betterAs more accurate as possible, and as more diverse aspossible Majority Vote, Weighted AveragingVoting Ensemble RandomForest GradientBoostingMachine
1111111100 = 80% accuracy1111111100 = 80% accuracy1011111100 = 70% accuracy
1111111100 = 80% accuracy
1111111100 = 80% accuracy0111011101 = 70% accuracy1000101111 = 60% accuracy
1111111101 = 90% accuracy
Ensemble
Randomly sampling not only dat but also feature
Majority vote Minimal tuning Performance pass lots of
complex method
n: subsample size
m: subfeature set size
tree size, tree numberhttp://www.slideshare.net/0xdata/jan-vitek-distributedrandomforest522013
Base Learner ensemble ex. , simple neural network Train by base learning algorithm (ex. decision tree, neural network ..)
Boosting - Boost weak learners too strong learners (sequential learners) Bagging - Like RandomForest, sampling from data or features Stacking - (parallel learners)
Employing different learning algorithms to train individual learners Individual learners then combined by a second-level learner which is
called meta-learner.
Ensemble
Bagging Ensemble Bootstrap Aggregating
m (bootstrap sample) train base learner by calling a base learning algorithm
Sampling train model
Cherkauer(1996) 32 NN input feature
randomness backpropagation random init, tree random select feature
Majority voting
--
Boost Family
AdaBoost (Adaptive Boosting) Gradient Tree Boosting XGBoost
Conbination of Additive Models
Bagging can significantly reduce the variance Boosting can significantly reduce the bias
http://slideplayer.com/slide/4816467/
Assigns equal weights to all the training examples, increased the weights of incorrectly classified examples.
Adaboost
http://www.37steps.com/exam/adaboost_comp/html/adaboost_comp.html
Gradient Boosting
Additive training New predictor is optimized by moving in the opposite direction of the
gradient to minimize the loss function.
GBDT 510 Boosted Tree: GBDT, GBRT, MART, LambdaMART
Gradient Boosting Model Steps
Leaf weighted cost score Additive training:
cost error Greedy algorithm to build new tree from a single leaf Gradient update weight
Training Tips
Shrinkage
Reduces the influence of each individual tree and leaves space for future trees to improve the model.
Better to improve model by many small steps than lagre steps.
Subsampling, Early Stopping, Post-Prunning
In 2015, 29 challenge winning solutions, 17 used XGBoost (deep neural nets 11)
KDDCup 2015 all winning solution mention it. leaderboard top 10
Scalability enables data scientists to process hundred millions of examples on a desktop.
OpenMP CPU multi-thread DMatrix Cache-aware and Sparsity-aware
XGBoost
Column Block for Parallel Learning
The most time consuming part of tree learning is to get the data into sorted order.In memory block, compressed column format, each column sorted by the corresponding feature value. Block Compression, Block Sharding.
Results
Use it in Python
xgb_model = XGBClassifier( learning_rate =0.1, n_estimators=1000, max_depth=5, min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8, objective= 'binary:logistic', nthread=8, scale_pos_weight=1, seed=27)
gamma : Minimum loss reduction required to make a further partition on a leaf node of the tree.
min_child_weight : Minimum sum of instance weight(hessian) needed in a child.
colsample_bytree : Subsample ratio of columns when constructing each tree.
Ensamble in Kaggle
Voting ensembles, Weighted majority vote, Bagged Perceptrons, Rank averaging, Historical ranks, Stacked & Blending (Netflix)
Voting ensemble of around 30 convnets. The best single model scored 0.93170. Final score 0.94120.
Ensemble in Kaggle
No Free Lunch
Ensemble is much better than single learner.Bias-variance tradeoff Boosting or Average vote it.
Not understandable -- like DNN, Non-linear SVM There is no ensemble method which outperforms other ensemble methods
consistentlySelecting some base learners instead of using all of them to compose an ensemble is a better choice -- selective ensembles
XGBoost(tabular data) v.s. Deep Learning(more & complex data, hard tuning)
Reference
Gradient boosting machines, a tutorial Alexey Natekin1* and Alois Knoll2 XGBoost: A Scalable Tree Boosting System - Tianqi Chen NTU cmlab http://www.cmlab.csie.ntu.edu.tw/~cyy/learning/tutorials/ http://mlwave.com/kaggle-ensembling-guide/
http://www.cmlab.csie.ntu.edu.tw/~cyy/learning/tutorials/