Strata 2013: Tutorial-- How to Create Predictive Models in R using Ensembles

How to Create Predictive Models in R using Ensembles

Strata - Hadoop World, New York

October 28, 2013

Giovanni Seni, Ph.D. Intuit

@IntuitInc [email protected]

Santa Clara University

[email protected]

© 2013 G.Seni 2013 Strata Conference + Hadoop World 2

Reference


Overview

•  Motivation, In a Nutshell & Timeline

•  Predictive Learning & Decision Trees

•  Ensemble Methods - Diversity & Importance Sampling

–  Bagging –  Random Forest –  Ada Boost –  Gradient Boosting –  Rule Ensembles

•  Summary


Motivation

Volume 9 Issue 2


Motivation (2)

“1′st Place Algorithm Description: … 4. Classification: Ensemble classification methods are used to combine multiple classifiers. Two separate Random Forest ensembles are created based on the shadow index (one for the shadow-covered area and one for the shadow-free area). The random forest “Out of Bag” error is used to automatically evaluate features according to their impact, resulting in 45 features selected for the shadow-free and 55 for the shadow-covered part.”


Motivation (3)

•  “What are the best of the best techniques at winning Kaggle competitions?

–  Ensembles of Decisions Trees –  Deep Learning

account for 90% of top 3 winners!”

Jeremy Howard, Chief Scientist of Kaggle KDD 2013

⇒ Key common characteristics: –  Resistance to overfitting –  Universal approximations


Ensemble Methods in a Nutshell

•  “Algorithmic” statistical procedure

•  Based on combining the fitted values from a number of fitting attempts

•  Loosely related to:

–  Iterative procedures –  Bootstrap procedures

•  Original idea: a “weak” procedure can be strengthened if it can operate “by committee”

–  e.g., combining low-bias/high-variance procedures

•  Accompanied by interpretation methodology


Timeline

•  CART (Breiman, Friedman, Stone, Olshen, 1983)

•  Bagging (Breiman, 1996)

–  Random Forest (Ho, 1995; Breiman 2001)

•  AdaBoost (Freund, Schapire, 1997)

•  Boosting – a statistical view (Friedman et. al., 2000)

–  Gradient Boosting (Friedman, 2001) –  Stochastic Gradient Boosting (Friedman, 1999)

•  Importance Sampling Learning Ensembles (ISLE) (Friedman, Popescu, 2003)


Timeline (2)

•  Regularization – variance control techniques:

–  Lasso (Tibshirani, 1996) –  LARS (Efron, 2004) –  Elastic Net (Zou, Hastie, 2005) –  GLMs via Coordinate Descent (Friedman, Hastie, Tibshirani, 2008)

•  Rule Ensembles (Friedman, Popescu, 2008)


Overview


Ø Predictive Learning & Decision Trees

•  Ensemble Methods

•  Summary


Predictive Learning Procedure Summary

•  Given "training" data

–  is a random sample from some unknown (joint) distribution

•  Build a functional model

–  Offers adequate and interpretable description of how the inputs affect the outputs

–  Parsimony is an important criterion: simpler models are preferred for the sake of scientific insight into the relationship

•  Need to specify:

Nii

Niniii yxxxyD 1121 },{},,,,{ x==

D

)(),,,( 21 xFxxxFy n

==

y-x

>< strategy search criterion, score model,


Predictive Learning Procedure Summary (2)

•  Model: underlying functional form sought from data

•  Score criterion: judges (lack of) quality of fitted model

–  Loss function : penalizes individual errors in prediction

–  Risk : the expected loss over all predictions

•  Search Strategy: minimization procedure of score criterion

∈= );()( axx FF

ℱ family of functions indexed by a

)(minarg* aaa

R=

));(,()( , axa x FyLER y

=

),( FyL


Predictive Learning Procedure Summary (3)

•  “Surrogate” Score criterion:

–  Training data:

–  unknown unknown

⇒ Use approximation: Empirical Risk

• 

•  If not ,

)(minarg aaa

R

=∑=

=N

iiFyL

NR

1

));(,(1)( axa

),(~},{ 1 ypy Nii xx

),( yp x * a⇒

⇒

nN >> )()( *aa RR >>


Predictive Learning Example

•  A simple data set

•  What is the class of new point ?

•  Many approaches… no method is universally better; try several / use committee

Attribute-1

( x1 )

Attribute-2

( x2 )

Class

(y)

1.0 2.0 blue

2.0 1.0 green

… … …

4.5 3.5 ? 1x

2x


•  Ordinary Linear Regression (OLR)

–  Model:

1x

2x

F(x) = a0 + ajx j

j=1

n

∑

⇒ Not flexible enough

⎩⎨⎧

≥else

0)(xF

Predictive Learning Example (2)

;


•  Model:

where if , otherwise

1x

2x

2 5

3

51 ≥x

32 ≥x

21 ≥x

Sub-regions of input variable space

{ } ==

MmmR 1

Decision Trees Overview

R1 R2

R3 R4

∑=

==M

mRm mIcTy

1ˆ )(ˆ)(ˆ xx

IR (x) =1 x ∈ R 0


Decision Trees Overview (2)

•  Score criterion:

–  Classification – "0-1 loss" ⇒ misclassification error (or surrogate)

–  Regression – least squares – i.e.,

•  Search: Find

–  i.e., find best regions and constants

cm, Rm{ }1

M= argmin

TM = cm ,Rm{ }1M

yi −TM (xi )( )2

i=1

N

∑

cm, Rm{ }1

M= argmin

TM = cm ,Rm{ }1M

I yi ≠TM (xi )( )i=1

N

∑

( )2ˆ)ˆ,( yyyyL −= R(TM )

mR mc

T = argminT

R(T )


Decision Trees Overview (3)

•  Join optimization with respect to and simultaneously is very difficult

⇒ use a greedy iterative procedure

mR mc

• j1, s1 R0

R0

R1

• j1, s1 R0 j2, s2 •

R1

R2

R3

R4

• j1, s1 R0 j2, s2 •

R1 •

R3

• R4

• j3, s3 R2

R6

R5

• j1, s1 R0 j2, s2 •

R1 •

R3

• R4

• j3, s3 R2 •

R5

• j4, s4 R6 •

R7

• R8


Decision Trees What is the “right” size of a model?

•  Dilemma

–  If model (# of splits) is too small, then approximation is too crude (bias) ⇒ increased errors

–  If model is too large, then it fits the training data too closely (overfitting, increased variance) ⇒ increased errors

⇒ο ο

ο ο ο

ο

ο ο

ο ο

ο ο ο

ο

ο

ο

x

y

ο ο

ο ο ο

ο

ο ο

ο ο

ο ο ο

ο

ο

ο 1c

2c

ο ο

ο ο ο

ο

ο ο

ο ο

ο ο ο

ο

ο

ο 1c

2c3c

x x

y y

vs


Decision Trees What is the “right” size of a model? (2)

–  Right sized tree, , when test error is at a minimum –  Error on the training is not a useful estimator!

•  If test set is not available, need alternative method

Model Complexity

Pred

ictio

n Er

ror

Low Bias High Variance

High Bias Low Variance

Test Sample

Training Sample

High Low *M

*M


•  Two strategies

–  Prepruning - stop growing a branch when information becomes unreliable

•  #(Rm) – i.e., number of data points, too small

⇒ same bound everywhere in the tree

•  Next split not worthwhile

⇒ Not sufficient condition

–  Postpruning - take a fully-grown tree and discard unreliable parts

(i.e., not supported by test data)

•  C4.5: pessimistic pruning

•  CART: cost-complexity pruning (more statistically grounded)

Decision Trees Pruning to obtain “right” size


Decision Trees Hands-on Exercise

•  Start Rstudio

•  Navigate to directory:

example.1.LinearBoundary

•  Set working directory: use setwd() or with GUI

•  Load and run “fitModel_CART.R”

•  If curious, also see “gen2DdataLinear.R”

•  After boosting discussion, load and run “fitModel_GBM.R

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

x1

x2


•  Ability to deal with irrelevant inputs

–  i.e., automatic variable subset selection

–  Measure anything you can measure

–  Score provided for selected variables ("importance")

•  No data preprocessing needed

-  Naturally handle all types of variables

•  numeric, binary, categorical

-  Invariant under monotone transformations:

•  Variable scales are irrelevant

•  Immune to bad distributions (e.g., outliers)

( )jjj xgx =

−jx

Decision Trees Key Features


•  Computational scalability

–  Relatively fast: ( )NnNO log

•  Missing value tolerant

-  Moderate loss of accuracy due to missing values

-  Handling via "surrogate" splits

•  "Off-the-shelf" procedure

-  Few tunable parameters

•  Interpretable model representation

-  Binary tree graphic

Decision Trees Key Features (2)


•  Discontinuous piecewise constant model

–  In order to have many splits you need to have a lot of data

•  In high-dimensions, you often run out of data after a few splits

–  Also note error is bigger near region boundaries

x

)(xF

Decision Trees Limitations


•  Not good for low interaction

–  e.g., is worst function for trees

–  In order for to enter model, must split on it

•  Path from root to node is a product of indicators

( )x*F( )

( )∑

∑

=

=

=

+=

n

jjj

n

jjjo

xf

xaaF

1

*

1

* x

(no interaction, additive)

lx

•  Not good for that has dependence on many variables

-  Each split reduces training data for subsequent splits (data fragmentation)

( )x*F

Decision Trees Limitations (2)


•  High variance caused by greedy search strategy (local optima)

–  Errors in upper splits are propagated down to affect all splits below it

⇒ Small changes in data (sampling fluctuations) can cause big changes in tree

- Very deep trees might be questionable

- Pruning is important

•  What to do next? –  Live with problems –  Use other methods (when possible) –  Fix-up trees: use ensembles

Decision Trees Limitations (3)


Overview

•  In a Nutshell & Timeline


Ø Ensemble Methods

–  In a Nutshell, Diversity & Importance Sampling –  Generic Ensemble Generation –  Bagging, RF, AdaBoost, Boosting, Rule Ensembles

•  Summary


Ensemble Methods In a Nutshell

•  Model:

–  : “basis” functions (or “base learners”)

–  i.e., linear model in a (very) high dimensional space of derived variables

•  Learner characterization:

–  : a specific set of joint parameter values – e.g., split definitions at internal nodes and predictions at terminal nodes

–  : function class – i.e., set of all base learners of specified family

∑ =+=

M

m mmTccF10 )()( xx

{ }MmT 1)(x

);()( mm TT pxx =

mp

{ } PT ∈ppx );(


Ensemble Methods In a Nutshell (2)

•  Learning: two-step process; approximate solution to

–  Step 1: Choose points

•  i.e., select

–  Step 2: Determine weights

•  e.g., via regularized LR

{cm, pm}oM = argmin

{cm , pm}oM

L yi, c0 + cmT (x;pm )m=1

M∑( )

i=1

N

∑

{ }Mm 1p

{ } { } PM

m TT ∈⊂ ppxx );()( 1

{ }Mmc 0


Ensemble Methods Importance Sampling (Friedman, 2003)

•  How to judiciously choose the “basis” functions (i.e., )?

•  Goal: find “good” so that

•  Connection with numerical integration:

– 

{ }Mm 1p

)( )}{,}{;( *11 xpx FcF M

mM

m ≅{ }Mm 1p

)(w )( M

1m m mII ppp ∑∫ =Ρ≈∂

vs. Accuracy improves when we choose more points from this region…


Importance Sampling Numerical Integration via Monte Carlo Methods

•  = sampling pdf of -- i.e,

–  Simple approach: -- i.e., uniform

–  In our problem: inversely related to ’s “risk”

•  i.e., has high error ⇒ lack of relevance of ⇒ low

•  “Quasi” Monte Carlo:

–  with/out knowledge of the other points that will be used

•  i.e., single point vs. group importance

–  Sequential approximation: ’s relevance judged in the context of the (fixed) previously selected points

P∈p)(pr

iidr m )(p

mp);( mT px

mp

{ }Mm r 1)(~ pp

p

)( mr p


Ensemble Methods Importance Sampling – Characterization of

•  Let

)(prBroad

{ }MmT 1);( px•  Ensemble of “strong” base learners - i.e., all with

)()( ∗≈ pp RiskRisk m

);( mT px•  ’s yield similar highly correlated predictions ⇒ unexceptional performance

)(prNarrow

•  Diverse ensemble - i.e., predictions are not highly correlated with each other

•  However, many “weak” base learners - i.e., ⇒ poor performance )()( ∗>> pp RiskRisk m

( )pp p Riskminarg=∗


Ensemble Methods Approximate Process of Drawing from

•  Heuristic sampling strategy: sampling around by iteratively applying small perturbations to existing problem structure

–  Generating ensemble members

–  is a (random) modification of any of •  Data distribution - e.g., by re-weighting the observations

•  Loss function - e.g., by modifying its argument

•  Search algorithm (used to find minp)

–  Width of is controlled by degree of perturbation

∗p

);()( mm TT pxx =

( ){ }}

);( ,minarg { to1For

pxp p TyLPERTURBMm

xymm Ε=

=

{}⋅ PERTURB

)(pr

© 2013 G.Seni 2013 Strata Conference + Hadoop World

Generic Ensemble Generation Step 1: Choose Base Learners

35

•  Forward Stagewise Fitting Procedure:

–  Algorithm control:

•  : random sub-sample of size ⇒ impacts ensemble "diversity"

•  : “memory” function ( )

υη , ,L

N≤η)(ηmS

∑−

=− ⋅=1

11 )()( m

k km TF xx υ 10 ≤≤υ

Modification of data distribution Modification of loss function

(“sequential” approximation)

𝐹0(x) = 0 For 𝑚 = 1 to 𝑀 { // Fit a single base learner p𝑚 = argmin

p. 𝐿0𝑦𝑖 , 𝐹𝑚−1 + 𝑇(x𝑖 ;p)8

𝑖∈𝑆𝑚 (𝜂)

// Update additive expansion 𝑇𝑚(𝑥) = 𝑇0x; p𝒎8 𝐹𝑚(x) = 𝐹𝑚−1(x) + 𝜐 ∙ 𝑇𝑚(x) } write {𝑇𝑚(x)}1𝑀

p! !!

!


•  Given , coefficients can be obtained by a

regularized linear regression

–  Regularization here helps reduce bias (in addition to variance) of the model

–  New iterative fast algorithms for various loss/penalty combinations

•  “GLMs via Coordinate Descent” (2008)

{ } { }MmmMmm TT 11 );()( == = pxx

+⎟⎠

⎞⎜⎝

⎛+= ∑ ∑

= =

)( ,minarg}{N

1i 10

}{

M

mimmi

cm TccyLc

m

x)(cP⋅λ

Generic Ensemble Generation Step 2: Choose Coefficients c! !

!!


•  Bagging = Bootstrap Aggregation

• 

• 

• 

• 

• 

–  i.e., perturbation of the data distribution only –  Potential improvements? –  R package: ipred

L(y, y)

2/N=η

)(xmT

0=υ

{ }M1/1 ,0 Mcc mo ==

( )

{ }Mm

mmm

mm

Siiimim

T

TFFTT

TFyLMm

F

m

1

1

)(1

0

)( write

})( )()(

);()(

);()( ,minarg { to1For

0)(

x

xxxpxx

pxxp

x

p

⋅+=

=

+=

=

=

−

∈−∑

υ

ηη

υ

( ) L⇒ no memory

⇒ are large un-pruned trees

i.e., not fit to the data (avg)

Bagging (Breiman, 1996)

: as available for single tree


Bagging Hands-on Exercise

-2 -1 0 1 2

-1.0

-0.5

0.0

0.5

1.0

x1

x2


example.2.EllipticalBoundary


•  Load and run

–  fitModel_Bagging_by_hand.R –  fitModel_CART.R (optional)

•  If curious, also see gen2DdataNonLinear.R

•  After class, load and run fitModel_Bagging.R


•  Under , averaging reduces variance and leaves bias unchanged

•  Consider “idealized” bagging (aggregate) estimator:

–  fit to bootstrap data set

–  is sampled from actual population distribution (not training data)

–  We can write:

⇒  true population aggregation never increases mean squared error!

⇒  Bagging will often decrease MSE…

2)()ˆ,( yyyyL −=

)( )( xx Zff

Ε=

Zf

{ }Nii xyZ 1,=

Z

[ ] [ ][ ] [ ][ ]2

2 2

2 2

)(

)()()(

)()()()(

x

xxx

xxxx

fY

fffY

fffYfY

Z

ZZ

−Ε≥

−Ε+−Ε=

−+−Ε=−Ε

Bagging Why it helps?


•  Random Forest = Bagging + algorithm randomizing

–  Subset splitting As each tree is constructed…

•  Draw a random sample of predictors before each node is split

•  Find best split as usual but selecting only from subset of predictors

⇒ Increased diversity among - i.e., wider

•  Width (inversely) controlled by

–  Speed improvement over Bagging –  R package: randomForest

⎣ ⎦1)(log2 += nns

{ }MmT 1)(x )(pr

sn

Random Forest (Ho, 1995; Breiman, 2001)


•  ISLE improvements:

–  Different data sampling strategy (not fixed) –  Fit coefficients to data

•  xxx_6_5%_P : 6 terminal nodes trees 5% samples without replacement Post-processing – i.e., using estimated “optimal” quadrature coefficients

⇒ Significantly faster to build!

Bag RF Bag_6_5%_P RF_6_5%_P

Co

mp

arat

ive

RM

S E

rro

r

Bagging vs. Random Forest vs. ISLE 100 Target Functions Comparison (Popescu, 2005)


•  Equivalence to Forward Stagewise Fitting Procedure –  We need to show is equivalent to line a. above

–  is equivalent to line c.

•  R package adabag

[ ]

( )∑

∑∑

=

+

=

=

≠⋅⋅=

−=

≠=

=

=

M

m mm

imimmi

mi

mmm

N

imi

imiN

imi

m

mim

i

Tsign

TyIwwerrerr

w

TyIwerr

wTMm

Nw

1

)()1(

1)(

1)(

)(

)0(

)(Output

})((expSet d.

))1(log( Compue c.

))((

Compute b. with data training to)( classifier aFit a.

{ to1For 1 :n weightsobservatio

x

x

x

x

α

α

α

( )p

p ⋅= minargm

( )

Mmm

mmmm

mm

Siiimi

cmm

Tc

TcFFTT

TcFyLcMm

F

m

1

1

)(1

,

0

)}(,{ write}

)( )()( );()(

);()( ,minarg),( { to1For

0)(

x

xxxpxx

pxxp

x

p

⋅⋅+=

=

⋅+=

=

=

−

∈−∑

υ

η

( )c

mc ⋅= minarg

Boo

k

AdaBoost (Freund & Schapire, 1997)


AdaBoost Hands-on Exercise


example.2.EllipticalBoundary


•  Load and run

–  fitModel_Adaboost_by_hand.R

•  After class, load and run fitModel_Adaboost.R and fitModel_RandomForest.R

-2 -1 0 1 2

-1.0

-0.5

0.0

0.5

1.0

x1

x2


•  Boosting with any differentiable loss criterion

•  General

• 

• 

• 

• 

• 

–  Potential improvements? –  R package: gbm

( )

Mmm

mmmm

mm

Siiimi

cmm

Tc

TcFFTT

TcFyLcMm

cF

m

1

1

)(1

,

00

)}(),{( write}

)( )()( );()(

);()( ,minarg),( { to1For

)(

x

xxxpxx

pxxp

x

p

⋅

⋅⋅+=

=

⋅+=

=

=

−

∈−∑

υ

υ

η

⇒ “shrunk” sequential partial regression coefficients { }M1mc

η

υ

mc ( ) L

)ˆ,( yyL

⇒= 1.0υ Sequential sampling

2N=η

⇒ )(xmT Any “weak” learner

∑ ==

N

i ico cyLc1

),(minarg

0c

Stochastic Gradient Boosting (Friedman, 2001)


•  More robust than

•  Resistant to outliers in …trees already providing resistance to outliers in

•  Note:

–  Trees are fitted to pseudo-response

⇒ Can’t interpret interpret individual trees

–  Original tree constants are overwritten

–  “shrunk” version of tree gets added to ensemble

( )2Fy −

y

x

{ }

( )

{ } { }( )

{ }

( )

}

ˆ )()(

expansion // Update

1 )(ˆ

tscoefficien find :Step2//

,~ treeregression-LS node terminal

)( ~

)( find :Step1//

{ to1For )(

11

11

11

1

10

∑=

−

−∈

−

∈⋅+=

=−=

−=

−=

=

=

J

jjmijmmm

NimiRjm

Nii

Jjm

imii

m

Ni

RIFF

JjFymedian

yJR

Fysigny

T

MmymedianF

jmi

…

xxx

x

x

x

x

x

x

γυ

γ

Stochastic Gradient Boosting LAD Regression – L !,! = ! − ! !


•  Sequential ISLE tend to perform better than parallel ones

–  Consistent with results observed in classical Monte Carlo integration

Bag RF

Bag_6_5%_P RF_6_5%_P

Co

mp

arat

ive

RM

S E

rro

r •  xxx_6_5%_P : 6 terminal nodes trees 5% samples without replacement Post-processing – i.e., using estimated “optimal” quadrature coefficients

•  Seq_υ_η%_P : “Sequential” ensemble 6 terminal nodes trees υ : “memory” factor η % samples without replacement Post-processing

Boost Seq_0.01_20%_P

Seq_0.1_50%_P

“Sequential”

“Parallel”

Parallel vs. Sequential Ensembles 100 Target Functions Comparison (Popescu, 2005)


•  Trees as collection of conjunctive rules:

–  These simple rules, , can be used as base learners

–  Main motivation is interpretability

∑=

∈=J

jjmjmm RIcT

1)ˆ(ˆ)( xx

1x2x

yR2

R1

R3

R5

R4

152215

27

)27()22()( 211 >⋅>= xIxIr x⇒ 1R

⇒ 2R )270()22()( 212 ≤≤⋅>= xIxIr x

⇒ 3R )0()2215()( 213 xIxIr ≤⋅≤<=x

⇒ 4R )15()150()( 214 >⋅≤≤= xIxIr x

⇒ 5R )150()150()( 215 ≤≤⋅≤≤= xIxIr x

{ }1,0)( ∈xmr

Rule Ensembles (Friedman & Popescu, 2005)


•  Rule-based model:

–  Still a piecewise constant model

•  Fitting

–  Step 1: derive rules from tree ensemble (shortcut)

•  Tree size controls rule “complexity” (interaction order)

–  Step 2: fit coefficients using linear regularized procedure:

∑+=m

mmraaF )()( 0 xx

({ak},{bj}) = argmin{ak },{bj }

L yi, F x; ak{ }0

K , bj{ }1

P( )( )i=1

N∑

⇒ complement the non-linear rules

with purely linear terms:

Rule Ensembles ISLE Procedure

+!!! ⋅ !(a)+ !(b) !


Boosting & Rule Ensembles Hands-on Exercise

0 200 400 600 800 1000

500

1000

1500

2000

2500

Iteration

Abs

olut

e lo

ss


example.3.Diamonds


•  Load and run

–  viewDiamondData.R –  fitModel_GBM.R –  fitModel_RE.R

•  After class, go to:

example.1.LinearBoundary

Run fitModel_GBM.R


Overview



•  Ensemble Methods

Ø Summary


Summary

•  Ensemble methods have been found to perform extremely well in a variety of problem domains

•  Shown to have desirable statistical properties

•  Latest ensemble research brings together important foundational strands of statistics

•  Emphasis on accuracy but significant progress has been made on interpretability

Go build Ensembles and keep in touch!

Education

Strata 2013: Tutorial-- How to Create Predictive Models in R using Ensembles