26
Ensembling & Boosting 概念介紹 Wayne Chen 201608

Ensembling & Boosting 概念介紹

Embed Size (px)

Citation preview

  • Ensembling & Boosting

    Wayne Chen201608

  • sense

    Maybe

  • Deep Learning ML

    XGBoost : Kaggle Winning Solution

    Giuliano Janson: Won two games and retired from Kaggle

    Persistence: every Kaggler nowadays can put up a great model in a few hours and usually achieve 95% of final score. Only persistence will get you the remaining 5%.

    Ensembling: need to know how to do it "like a pro". Forget about averaging models. Nowadays many Kaggler do meta-models, and meta-meta-models.

  • Why Ensemble is needed?

    Occam's Razor

    An explanation of the data should be made as simple as possible, but no simpler.

    Simple s good.

    Training data might not provide sufficient information for choosing a single best learner. The search processes of the learning algorithms might be imperfect (difficult to achieve unique

    best hypothesis) Hypothesis space being searched might not contain the true target function.

  • ID3, C4.5, CART Tree base methodEntropyex. 5 (1M,4F), 9 (6M,3F)

    E_all -5/14 * log(5/14) - 9/14 * log(9/14) Entropy is 1 if 50% - 50%, 0 if 100% - 0%

    Information Gain a split attribute Entropy E_gender P(M) * E(1,6) + P(F) * E(4,3) Gain = E_all - E_gender

    http://www.saedsayad.com/decision_tree.htm

  • http://blogs.sas.com/content/jmp/2013/03/25/partitioning-a-quadratic-in-jmp/

  • Boost Ensemble

    1. 2. 3. ()

  • Ensemble

    try model Decision tree, NN, SVM, Regression ..

    Ensemble Kaggle submission CSV files. Its work!Majority Voting

    Three models : 70%, 70%, 70% Majority vote ensemble will be ~78%. Averaging predictions often reduces overfit.

    http://mlwave.com/kaggle-ensembling-guide/

  • Ensemble

    Kobe, Curry, LBJ

    Uncorrelated models usually performed betterAs more accurate as possible, and as more diverse aspossible Majority Vote, Weighted AveragingVoting Ensemble RandomForest GradientBoostingMachine

    1111111100 = 80% accuracy1111111100 = 80% accuracy1011111100 = 70% accuracy

    1111111100 = 80% accuracy

    1111111100 = 80% accuracy0111011101 = 70% accuracy1000101111 = 60% accuracy

    1111111101 = 90% accuracy

  • Ensemble

    Randomly sampling not only dat but also feature

    Majority vote Minimal tuning Performance pass lots of

    complex method

    n: subsample size

    m: subfeature set size

    tree size, tree numberhttp://www.slideshare.net/0xdata/jan-vitek-distributedrandomforest522013

  • Base Learner ensemble ex. , simple neural network Train by base learning algorithm (ex. decision tree, neural network ..)

    Boosting - Boost weak learners too strong learners (sequential learners) Bagging - Like RandomForest, sampling from data or features Stacking - (parallel learners)

    Employing different learning algorithms to train individual learners Individual learners then combined by a second-level learner which is

    called meta-learner.

    Ensemble

  • Bagging Ensemble Bootstrap Aggregating

    m (bootstrap sample) train base learner by calling a base learning algorithm

    Sampling train model

    Cherkauer(1996) 32 NN input feature

    randomness backpropagation random init, tree random select feature

    Majority voting

    --

  • Boost Family

    AdaBoost (Adaptive Boosting) Gradient Tree Boosting XGBoost

    Conbination of Additive Models

    Bagging can significantly reduce the variance Boosting can significantly reduce the bias

  • http://slideplayer.com/slide/4816467/

    Assigns equal weights to all the training examples, increased the weights of incorrectly classified examples.

  • Adaboost

    http://www.37steps.com/exam/adaboost_comp/html/adaboost_comp.html

  • Gradient Boosting

    Additive training New predictor is optimized by moving in the opposite direction of the

    gradient to minimize the loss function.

    GBDT 510 Boosted Tree: GBDT, GBRT, MART, LambdaMART

  • Gradient Boosting Model Steps

    Leaf weighted cost score Additive training:

    cost error Greedy algorithm to build new tree from a single leaf Gradient update weight

  • Training Tips

    Shrinkage

    Reduces the influence of each individual tree and leaves space for future trees to improve the model.

    Better to improve model by many small steps than lagre steps.

    Subsampling, Early Stopping, Post-Prunning

  • In 2015, 29 challenge winning solutions, 17 used XGBoost (deep neural nets 11)

    KDDCup 2015 all winning solution mention it. leaderboard top 10

    Scalability enables data scientists to process hundred millions of examples on a desktop.

    OpenMP CPU multi-thread DMatrix Cache-aware and Sparsity-aware

    XGBoost

  • Column Block for Parallel Learning

    The most time consuming part of tree learning is to get the data into sorted order.In memory block, compressed column format, each column sorted by the corresponding feature value. Block Compression, Block Sharding.

  • Results

  • Use it in Python

    xgb_model = XGBClassifier( learning_rate =0.1, n_estimators=1000, max_depth=5, min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8, objective= 'binary:logistic', nthread=8, scale_pos_weight=1, seed=27)

    gamma : Minimum loss reduction required to make a further partition on a leaf node of the tree.

    min_child_weight : Minimum sum of instance weight(hessian) needed in a child.

    colsample_bytree : Subsample ratio of columns when constructing each tree.

  • Ensamble in Kaggle

    Voting ensembles, Weighted majority vote, Bagged Perceptrons, Rank averaging, Historical ranks, Stacked & Blending (Netflix)

  • Voting ensemble of around 30 convnets. The best single model scored 0.93170. Final score 0.94120.

    Ensemble in Kaggle

  • No Free Lunch

    Ensemble is much better than single learner.Bias-variance tradeoff Boosting or Average vote it.

    Not understandable -- like DNN, Non-linear SVM There is no ensemble method which outperforms other ensemble methods

    consistentlySelecting some base learners instead of using all of them to compose an ensemble is a better choice -- selective ensembles

    XGBoost(tabular data) v.s. Deep Learning(more & complex data, hard tuning)

  • Reference

    Gradient boosting machines, a tutorial Alexey Natekin1* and Alois Knoll2 XGBoost: A Scalable Tree Boosting System - Tianqi Chen NTU cmlab http://www.cmlab.csie.ntu.edu.tw/~cyy/learning/tutorials/ http://mlwave.com/kaggle-ensembling-guide/

    http://www.cmlab.csie.ntu.edu.tw/~cyy/learning/tutorials/