Ensembling & Boosting 概念介紹

Ensembling & Boosting

Wayne Chen201608

sense

Maybe

Deep Learning ML

XGBoost : Kaggle Winning Solution

Giuliano Janson: Won two games and retired from Kaggle

Persistence: every Kaggler nowadays can put up a great model in a few hours and usually achieve 95% of final score. Only persistence will get you the remaining 5%.

Ensembling: need to know how to do it "like a pro". Forget about averaging models. Nowadays many Kaggler do meta-models, and meta-meta-models.

Why Ensemble is needed?

Occam's Razor

An explanation of the data should be made as simple as possible, but no simpler.

Simple s good.

Training data might not provide sufficient information for choosing a single best learner. The search processes of the learning algorithms might be imperfect (difficult to achieve unique

best hypothesis) Hypothesis space being searched might not contain the true target function.

ID3, C4.5, CART Tree base methodEntropyex. 5 (1M,4F), 9 (6M,3F)

E_all -5/14 * log(5/14) - 9/14 * log(9/14) Entropy is 1 if 50% - 50%, 0 if 100% - 0%

Information Gain a split attribute Entropy E_gender P(M) * E(1,6) + P(F) * E(4,3) Gain = E_all - E_gender

http://www.saedsayad.com/decision_tree.htm

http://blogs.sas.com/content/jmp/2013/03/25/partitioning-a-quadratic-in-jmp/

Boost Ensemble

1. 2. 3. ()

Ensemble

try model Decision tree, NN, SVM, Regression ..

Ensemble Kaggle submission CSV files. Its work!Majority Voting

Three models : 70%, 70%, 70% Majority vote ensemble will be ~78%. Averaging predictions often reduces overfit.

http://mlwave.com/kaggle-ensembling-guide/

Ensemble

Kobe, Curry, LBJ

Uncorrelated models usually performed betterAs more accurate as possible, and as more diverse aspossible Majority Vote, Weighted AveragingVoting Ensemble RandomForest GradientBoostingMachine

1111111100 = 80% accuracy1111111100 = 80% accuracy1011111100 = 70% accuracy

1111111100 = 80% accuracy

1111111100 = 80% accuracy0111011101 = 70% accuracy1000101111 = 60% accuracy

1111111101 = 90% accuracy

Ensemble

Randomly sampling not only dat but also feature

Majority vote Minimal tuning Performance pass lots of

complex method

n: subsample size

m: subfeature set size

tree size, tree numberhttp://www.slideshare.net/0xdata/jan-vitek-distributedrandomforest522013

Base Learner ensemble ex. , simple neural network Train by base learning algorithm (ex. decision tree, neural network ..)

Boosting - Boost weak learners too strong learners (sequential learners) Bagging - Like RandomForest, sampling from data or features Stacking - (parallel learners)

Employing different learning algorithms to train individual learners Individual learners then combined by a second-level learner which is

called meta-learner.

Ensemble

Bagging Ensemble Bootstrap Aggregating

m (bootstrap sample) train base learner by calling a base learning algorithm

Sampling train model

Cherkauer(1996) 32 NN input feature

randomness backpropagation random init, tree random select feature

Majority voting

--

Boost Family

AdaBoost (Adaptive Boosting) Gradient Tree Boosting XGBoost

Conbination of Additive Models

Bagging can significantly reduce the variance Boosting can significantly reduce the bias

http://slideplayer.com/slide/4816467/

Assigns equal weights to all the training examples, increased the weights of incorrectly classified examples.

Adaboost

http://www.37steps.com/exam/adaboost_comp/html/adaboost_comp.html

Gradient Boosting

Additive training New predictor is optimized by moving in the opposite direction of the

gradient to minimize the loss function.

GBDT 510 Boosted Tree: GBDT, GBRT, MART, LambdaMART

Gradient Boosting Model Steps

Leaf weighted cost score Additive training:

cost error Greedy algorithm to build new tree from a single leaf Gradient update weight

Training Tips

Shrinkage

Reduces the influence of each individual tree and leaves space for future trees to improve the model.

Better to improve model by many small steps than lagre steps.

Subsampling, Early Stopping, Post-Prunning

In 2015, 29 challenge winning solutions, 17 used XGBoost (deep neural nets 11)

KDDCup 2015 all winning solution mention it. leaderboard top 10

Scalability enables data scientists to process hundred millions of examples on a desktop.

OpenMP CPU multi-thread DMatrix Cache-aware and Sparsity-aware

XGBoost

Column Block for Parallel Learning

The most time consuming part of tree learning is to get the data into sorted order.In memory block, compressed column format, each column sorted by the corresponding feature value. Block Compression, Block Sharding.

Results

Use it in Python

xgb_model = XGBClassifier( learning_rate =0.1, n_estimators=1000, max_depth=5, min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8, objective= 'binary:logistic', nthread=8, scale_pos_weight=1, seed=27)

gamma : Minimum loss reduction required to make a further partition on a leaf node of the tree.

min_child_weight : Minimum sum of instance weight(hessian) needed in a child.

colsample_bytree : Subsample ratio of columns when constructing each tree.

Ensamble in Kaggle

Voting ensembles, Weighted majority vote, Bagged Perceptrons, Rank averaging, Historical ranks, Stacked & Blending (Netflix)

Voting ensemble of around 30 convnets. The best single model scored 0.93170. Final score 0.94120.

Ensemble in Kaggle

No Free Lunch

Ensemble is much better than single learner.Bias-variance tradeoff Boosting or Average vote it.

Not understandable -- like DNN, Non-linear SVM There is no ensemble method which outperforms other ensemble methods

consistentlySelecting some base learners instead of using all of them to compose an ensemble is a better choice -- selective ensembles

XGBoost(tabular data) v.s. Deep Learning(more & complex data, hard tuning)

Reference

Gradient boosting machines, a tutorial Alexey Natekin1* and Alois Knoll2 XGBoost: A Scalable Tree Boosting System - Tianqi Chen NTU cmlab http://www.cmlab.csie.ntu.edu.tw/~cyy/learning/tutorials/ http://mlwave.com/kaggle-ensembling-guide/

http://www.cmlab.csie.ntu.edu.tw/~cyy/learning/tutorials/

Data & Analytics

Ensembling & Boosting 概念介紹