Comparison Study of Decision Tree Ensembles for Regression

Comparison Study of Decision Tree Ensembles for Regression

SEONHO PARK

Objectives

• Empirical study of Ensemble trees for regression problems

• To verify its performance and time efficiency

• Candidates from open source

• Scikit-Learn• BaggingRegressor• RandomForestRegressor• ExtraTreesRegressor• AdaBoostRegressor• GradientBoostingRegressor

• XGBoost

• XGBRegressor

Decision Tree

1x

2x2 2.5?x >

1 3.0?x >

N Y

N Y

• Expressed as a recursive partition of the feature space

• Use for both classifier and regressor

• Building blocks: nodes, leaves

• Node splits the instance space into two or more sub-spaces according to a certain

discrete function of the input feature values

2.5

3.0

Decision Tree Inducers

• How to generate decision tree?

• Rule to determine the decision tree is how to split and prune nodes

• Decision trees inducers:

ID3(Quinlan, 1986), C4.5(Quinlan, 1993), CART(Breiman et al., 1984)

• CART is most generable and popular

CART

• CART stands for Classification and Regression Trees

• Has ability to generate regression trees

• Minimization of misclassification costs

• In regression, the costs are represented for least squares between target values and

expected values

• Maximization of change of impurity function:

• For regression,

argmax ( ) ( ( )) ( ( ))j

Rj p l l r r

xx i t P i t P i té ù= - -ê úë û

[ ]argmin Var( ) Var( )j

Rj l r

xx Y Y= +

CART

• Pruning

• minimum number of points

Figure: Roman Timofeev, Classification and Regression Trees Theory and Applications, (2004)

minN

Decision Tree Pros And Cons

• Advantages

• Explicability: Easy to understand and interpret(white boxes)

• Make minimal assumptions

• Requires little data preparation

• Addressing nonlinearity in an intuitive manner

• Can handle both nominal and numerical features

• Perform well with large datasets

• Disadvantages

• Heuristics such as the greedy algorithm local optimal decision at each node

• Instability, Overfitting – not to be robust to noise(outlier)

Ensemble Methods

• Tactics of Ensemble Tree can be classified by two types : Bagging and Boosting

• Bagging Methods: Tree Bagging, Random Forest, Extra Trees

• Boosting Methods: AdaBoost, Gradient Boosting

Figure: http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_iris.html

Averaging Methods

• Random Forest (L. Breiman, 2001)

• Tree Bagging + Split among a random subset of the feature

• Extra Trees (Extremely Randomized Trees) (P. Geurts et al., 2006)

• Random Forest + Extra Tree

• Extra Tree: thresholds at nodes are drawn at random

• Tree Bagging (L. Breiman, 1996)

• What is Bagging?

• BAGGING is abbreviation for Bootstrap AGGregatING

• Boosting: samples are drawn with replacement

• Drawn as random subsets of the features ‘Random Subspace’(1999)

• Drawn as random subsets of both samples and features ‘Random Patches’ (2012)

Boosting Methods – AdaBoost

• AdaBoost (Y. Freund, and R. Schapire, 1995)

• AdaBoost is abbreviation for ‘Adaptive Boosting’

• Sequential decision making method

• Boosted classifier in the form:

Hypothesis of weak learnerweight

Hypothesis of Strong learner

Figure: Schapire and Freund, Boosting: Foundations and algorithms (2012)

1

( ) ( )T

t tt

H x h xr=

=å


• Supposed that you are given (x1,y1),(x2,y2),…,(xn,yn), and the task is to fit model H(x).

And your friend wants to help you and gives you a model H. you check his model and

find it is good but not perfect. There are some mistakes: H(x1) = 0.8, H(x2) = 1.4…,

while y1= 0.9, y2=1.3… How can you improve this model?

• Rule

• Use friend model H without any modification of it

• Can add additional model h to improve prediction, so the new prediction will be

H+h

1

( ) ( )T

t tt

H x h xr=

=å 1( ) ( )T T T TH x H h xr-= +


1 1 1

2 2 2

( ) ( )( ) ( )

...( ) ( )n n n

H x h x yH x h x y

H x h x y

+ =+ =

+ =

• Wish to improve the model such that:

1 1 1

2 2 2

1

( ) ( )( ) ( )...( ) ( )n n

h x y H xh x y H x

h x y H x

= -= -

= -

• Fit a weak learner h to data

(x1,y1-H(x1)),(x2,y2-H(x2)),…,(xn,yn-H(xn))residual

Boosting Methods – Gradient Boosting

• AdaBoost: updates with loss function residual which will be converged to 0

• In scikit-learn, AdaBoost.R2 algorithm is implemented

• Gradient Boosting (L. Breiman, 1997)

: updates with negative gradients of loss functions which will be converged to 0

0y H- =

0LH

¶- =¶

*Drucker,H., Improving Regressors using Boosting Techniques (1997)


• Loss function

• First order optimality

• If loss function is as follows:

• Negative gradients can be interpret as residuals

( , )L y H

( , ) 0, 1,i i

i

L y H i nH

¶ = " =¶

2

2

1( , )2

L y H y H= -

( , ) , 1,i ii i

i

L y H y H i nH

¶ = - " =¶


• Square loss function is not adequate to treat the outliers overfitting

• Other loss functions

• Absolute loss

• Huber loss

( , )L y H y H= -

( )21 ( ) if ,

2( , )/ 2 otherwise

y H y HL y H

y H

d

d d

ìïï - - £ïï=íïï - -ïïî

• Among the 29 kaggle challenge winning solutions during 2015,

• 17 used XGBoost (Gradient Boosting Trees)

(8 solely used XGBoost, 9 used XGBoost + deep neural nets)

• 11 used deep neural nets

(2 solely used, 9 combined with XGBoost)

• In KDDCup 2015, Ensemble Trees was used in every winning team in the top 10

XGBoost

*Tianqi Chen, XGBoost: A Scalable Tree Boosting System (2016)

Ensemble Method Pros and Cons

• Advantages

• Avoid overfitting

• Fast and scalable handle large-scale data

• Almost work ‘out-of-the-box’

• Disadvantages

• Overfitting

• ad hoc heuristic

• Not provide probabilistic framework (confidence intervals, posterior distribu-

tions)

Empirical Test Suits

• Diabetes1)

• Concrete Slump Test2)

• Machine CPU1)

• Body Fat3)

• Yacht Hydrodynamics2)

• Chemical4)

• Boston Housing5)

• Istanbul stock exchange2)

• Concrete compressive strength2)

• Engine 4)

• Airfoil Self-Noise2)

• Wine Quality (Red) 2)

• Pumadyn (32) 1)

• Pumadyn (8) 1)

• Bank (8) 1)

• Bank (32) 1)

• Wine Quality (White) 2)

• Computer Activity6)

• Computer Activity_small6)

• Kinematics of Robot Arm1)

• Combined Cycle Power Plant2)

• California Housing7)

• Friedman8)1)http://www.dcc.fc.up.pt/~ltorgo/

2)https://archive.ics.uci.edu/ml/datasets/3)http://www.people.vcu.edu/~rjohnson/bios546/programs/

4)MATLAB neural fitting toolbox5)https://rpubs.com/yroy/Boston

6)http://www.cs.toronto.edu/~delve/7)5)http://www.cs.cmu.edu/afs/cs/academic/class/15381-s07/www/hw6/cal_housing.arff

8)http://tunedit.org/repo/UCI/numeric/fried.arff

Description of Comparison Methods

• Corrected t-test*

where , and denote the difference

• Data set is divided into a learning sample of a given size and a test sample of

size

• Assumed to follow a student distribution with d.o.f.

• We used confidential interval to 95% (type 1 error) to verify the hypothesis

• In this task, we repeated 30 times independently ( is 30)

• Parameters used for ensemble trees are as defaults

*Nadeau, C., Bengio, Y., Inference for the generalization error (2003)

idi iA Be e-

Tn

Ln

sN

21( )

dcorr

Td

s L

tn

N n

m

s=

+

1

sNii

ds

dN

m == å 22 1

( )1

sNi di

ds

dN

ms = -= -

å

1sN -

• Accuracy: R2

• GradientBoosting>XGBoost>ExtraTrees>Bagging>RandomForest>AdaBoost

Win/Draw/Loss records comparing the algorithm in the column versus the algorithm in the row

Bagging RandomForest Extra Trees AdaBoost Gradient

Boosting XGBoost

Bagging - 0/27/0 10/16/1 0/8/19 11/9/7 7/13/7

Random Forest 0/27/0 - 7/19/1 0/8/19 11/9/7 8/12/7

Extra Trees 1/16/10 1/19/7 - 0/7/20 8/12/7 7/13/7

AdaBoost 19/8/0 7/9/11 20/7/0 - 20/6/1 19/8/0

Gradient Boosting 7/9/11 7/12/8 7/12/0 1/6/20 - 1/24/2

XGBoost 7/13/7 7/12/8 7/13/7 0/8/19 2/24/1 -

Empirical Test Results

( )( )

2

12

1

1s

s

Ni ii

Ni ii

y y

y y=

=

---

åå

%


• Accuracy: R2( )( )

2

12

1

1s

s

Ni ii

Ni ii

y y

y y=

=

---

åå

%

• Computational Cost

• ExtraTrees>XGBoost>RandomForest>Bagging>GradientBoosting>AdaBoost

Bagging RandomForest Extra Trees AdaBoost Gradient

Boosting XGBoost

Bagging - 11/13/3 20/7/0 0/4/23 7/3/17 11/14/2

Random Forest 3/13/11 - 24/3/0 0/2/25 3/7/17 10/15/2

Extra Trees 0/7/20 0/3/24 - 0/0/27 0/0/27 2/23/2

AdaBoost 23/4/0 25/2/0 27/0/0 - 24/3/0 21/4/2

Gradient Boosting 17/3/7 17/7/3 27/0/0 0/3/24 - 18/7/2

XGBoost 2/14/11 2/15/10 2/23/2 2/4/21 2/7/18 -


Win/Draw/Loss records comparing the algorithm in the column versus the algorithm in the row


• Computational Cost

Data & Analytics

Comparison Study of Decision Tree Ensembles for Regression