16

Click here to load reader

I Don't Want to Be a Dummy! Encoding Predictors for Trees

Embed Size (px)

Citation preview

Page 1: I Don't Want to Be a Dummy! Encoding Predictors for Trees

I Don’t Want to Be a Dummy!Encoding Predictors for Trees

Max Kuhn

NYRC

Page 2: I Don't Want to Be a Dummy! Encoding Predictors for Trees

Trees

Tree–based models are nested sets of if/else statements that make predictions in theterminal nodes:

> library(rpart)

> library(AppliedPredictiveModeling)

> data(schedulingData)

> rpart(Class ~ ., data = schedulingData, control = rpart.control(maxdepth = 2))

n= 4331

node), split, n, loss, yval, (yprob)

* denotes terminal node

1) root 4331 2100 VF (0.511 0.311 0.119 0.060)

2) Protocol=C,D,E,F,G,I,J,K,L,N 2884 860 VF (0.703 0.206 0.068 0.023) *

3) Protocol=A,H,M,O 1447 690 F (0.126 0.521 0.219 0.133)

6) Iterations< 1.5e+02 1363 610 F (0.134 0.553 0.232 0.081) *

7) Iterations>=1.5e+02 84 1 L (0.000 0.000 0.012 0.988) *

Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 2 / 16

Page 3: I Don't Want to Be a Dummy! Encoding Predictors for Trees

Rules

Similarly, rule–based models are non–nested sets of if statements:

> library(C50)

> summary(C5.0(Class ~ ., data = schedulingData, rules = TRUE))

<snip>

Rule 109: (17/7, lift 9.7)

Protocol in {F, J, N}

Compounds > 818

InputFields > 152

NumPending <= 0

Hour > 0.6333333

Day = Tue

-> class L [0.579]

Default class: VF

Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 3 / 16

Page 4: I Don't Want to Be a Dummy! Encoding Predictors for Trees

Bayes!

Bayesian regression and classification models don’t really specify anything about the predictorsbeyond Pr[X] and Pr[X|Y ].

If there were only one categorical predictor, we could have Pr[X|Y ] be a table of rawprobabilities:

> xtab <- table(schedulingData$Day, schedulingData$Class)

> apply(xtab, 2, function(x) x/sum(x))

VF F M L

Mon 0.1678 0.1492 0.15 0.162

Tue 0.1913 0.2019 0.27 0.255

Wed 0.2090 0.2101 0.19 0.228

Thu 0.1678 0.1589 0.18 0.154

Fri 0.2171 0.2183 0.20 0.178

Sat 0.0068 0.0082 0.00 0.023

Sun 0.0403 0.0535 0.00 0.000

Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 4 / 16

Page 5: I Don't Want to Be a Dummy! Encoding Predictors for Trees

Dummy Variables

For the other models, we typically encode a predictor with C categories into C − 1 binarydummy variables:

> design_mat <- model.matrix(Class ~ Day, data = head(schedulingData))

> design_mat[, colnames(design_mat) != "(Intercept)"]

DayTue DayWed DayThu DayFri DaySat DaySun

1 1 0 0 0 0 0

2 1 0 0 0 0 0

3 0 0 1 0 0 0

4 0 0 0 1 0 0

5 0 0 0 1 0 0

6 0 1 0 0 0 0

In this case, one predictor generates six columns in the design matrix

Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 5 / 16

Page 6: I Don't Want to Be a Dummy! Encoding Predictors for Trees

Encoding Choices

We make the decision on how to encode the data prior to creating the model.

That means we choose whether to present the model with the grouped categories orungrouped binary dummy variables.

The means we could get different representations of the model (see the next two slides).

Does it matter? Let’s do some experiments!

Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 6 / 16

Page 7: I Don't Want to Be a Dummy! Encoding Predictors for Trees

A Tree with Categorical Data

wday

1

Sun, Sat Mon, Tues, Wed, Thurs, Fri

Node 2 (n = 1530)

0

5000

10000

15000

20000

25000

●●

Node 3 (n = 3826)

0

5000

10000

15000

20000

25000

●●●

●●

●●

●●

●●●●

●●●

●●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 7 / 16

Page 8: I Don't Want to Be a Dummy! Encoding Predictors for Trees

A Tree with Dummy Variables

Sun

1

≥ 0.5 < 0.5

Node 2 (n = 765)

0

5000

10000

15000

20000

25000

Sat

3

≥ 0.5 < 0.5

Node 4 (n = 765)

0

5000

10000

15000

20000

25000

●●

Node 5 (n = 3826)

0

5000

10000

15000

20000

25000

●●●●

●●

●●

●●

●●

●●

●●●●

●●●●

●●●

●●

●●

●●●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 8 / 16

Page 9: I Don't Want to Be a Dummy! Encoding Predictors for Trees

Data Sets

Classification:

German Credit, 13 categorical predictors out of 20 (ROC AUC ≈ 0.76)

UCI Car Evaluation, 6 of 6 (Acc ≈ 0.96)

APM High Performance Computing, 2 of 7 (κ ≈ 0.7)

Regression:

Sacramento house prices, 3 of 8 but one has 37 unique values and another has 68(RMSE ≈ 0.13, R2 ≈ 0.6)

For each data set, we did 10 separate simulations were 20% of the data were used for testing.Repeated cross-validation is used to the tune the models when they have tuning parameters.

Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 9 / 16

Page 10: I Don't Want to Be a Dummy! Encoding Predictors for Trees

Simulaitons

Models fit twice on each dataset (with and without dummy variables:

single trees (CART, C5.0)

single rulesets (C5.0, Cubist)

bagged trees

random forests

boosted models (SGB trees, C5.0, Cubist)

A number of performance metrics were computed for each (e.g. RMSE, binomial ormultinomial log–loss, etc.) and the test set results are used to compare models.

Confidence intervals were computed using a linear mixed model as to account for theresample–to–resample correlation structure.

Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 10 / 16

Page 11: I Don't Want to Be a Dummy! Encoding Predictors for Trees

Regression Model Results

RF

CART

Cubist_boost

GBM

Cubist

Bagging

−0.010 −0.005 0.000 0.005 0.010RMSE Difference

(DV Better) <−−−−−> (Factors Better)

Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 11 / 16

Page 12: I Don't Want to Be a Dummy! Encoding Predictors for Trees

Classification Model Results

German Credit UCI Cars HPC

CART

C50rule_boost

C50rule

C50tree_boost

C50tree

RF

Bagging

1 2 4 1 2 4 1 2 4Loss Ratio

Ratio > 1 => Factors Did Better

Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 12 / 16

Page 13: I Don't Want to Be a Dummy! Encoding Predictors for Trees

It Depends!

For classification:

The larger differences in the UCI car data might indicate that, if the percentage ofcategorical predictors is large, it might matter a lot.

However, the magnitude of improvement of factors over dummy variables depends on themodel.

For 2 or 3 data sets, there was no real difference.

For regression:

It doesn’t seem to matter (except when it does)

Two very similar models (bagging and random forests) showed effects in differentdirections.

Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 13 / 16

Page 14: I Don't Want to Be a Dummy! Encoding Predictors for Trees

It Depends!

All of this is also dependent on how easy the problem is.

If no models are able to adequately model the data, the choice of factor vs. dummy won’tmatter.

Also, if the categorical predictors are really important, the difference would most likely bediscernible.

For the Sacramento data, ZIP code is very important. For the HPC data, the protocol variableis also very informative.

However, one thing is definitive:

Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 14 / 16

Page 15: I Don't Want to Be a Dummy! Encoding Predictors for Trees

Factors Usually Take Less Time to Train

German Credit UCI Cars

HPC Sacramento

C50ruleC50tree

CARTC50tree_boostC50rule_boost

RFBagging

Cubist_boostCubist

GBM

C50ruleC50tree

CARTC50tree_boostC50rule_boost

RFBagging

Cubist_boostCubist

GBM

1 2 4 1 2 4Speedup for Using Factors

Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 15 / 16

Page 16: I Don't Want to Be a Dummy! Encoding Predictors for Trees

R and Dummy Variables

In almost all cases, using a formula with a model function will convert factors to dummyvariables.

However, some do not (e.g. rpart, randomForest, gbm, C5.0, NaiveBayes, etc.). Thismakes sense for these models.

If you are tuning your model with train, the formula method will create dummy variables andthe non–formula method does not:

> ## dummy variables presented to underlying model:

> train(Class ~ ., data = schedulingData, ...)

>

> ## any factors are preserved

> train(x = schedulingData[, -ncol(schedulingData)],

+ y = schedulingData$Class,

+ ...)

Max Kuhn (NYRC) I Don’t Want to Be a Dummy! 16 / 16