Some of the new features in SPM 7

Advances in Boosted Tree Technology:TreeNet Model Compression and Optimal Rule

Extraction

Dan Steinberg, Milkail Golovnya, N Scott Cardell

May 2012

Salford Systems

http://www.salford-systems.com

http://www.salford-systems.com/

http://www.salford-systems.com/

© Copyright Salford Systems 2012

Beyond TreeNet

• TreeNet has set a high bar for automatic off-the-shelf model performance– TreeNet was used to win all four 1st place awards in the

Duke/Teradata churn modeling competition of 2002– Awards in 2010, 2009, 2008, 2007, 2004 all based on TreeNet

• TreeNet was first developed (MART) in 1999 and essentially perfected in 2000– Many improvements since then but the fundamentals are largely

those of the 2000 technology

• In subsequent work Friedman has introduced major extensions that go beyond the framework of boosted trees


Importance Sampled Learning Ensembles (ISLE)

• Friedman’s work in 2003 is somewhat more complex than what we describe here– Presented his paper at our first data mining conference in San

Francisco in March of 2004

• We focus on the concept of model compression• TreeNet model is grown myopically one added tree at a time

– From current model attempt to improve it by predicting residuals– Each tree represents incremental learning and error correction– Slow learning, small steps– During model development we do not know where we are going to

end up

• Once we have the TreeNet model completed can we review it and “clean it up”


Post-Processing With Regularized Regression

• Friedman’s ISLE takes a TreeNet model as its raw material and considers how we can refine it using regression

• Consider: every tree takes our raw data as input and generates outputs at the terminal nodes

• Each tree can be thought of as a new variable constructed out of the original data– No missing values in tree outputs even if there were missing values

in the raw data– Outliers among such predictors are expected to be rare as each

terminal is doing averaging and the trees are typically small

• Might create many more generated variables than original raw variables – Boston data set has 13 predictors, TN might generate 1000 trees


Regularized Regression

• Modern regression techniques starting with Ridge regression, and then the Lasso, and finally hybrid models

• Methods have advantages over classical regression – Can handle highly correlated variables (Ridge)– Can work with data sets with more columns than rows– Can do variable selection (Lasso, Ridge-Lasso hybrids)– Much more effective and reliable than old fashioned stepwise

• Regularized regression is still regression and thus suffers from all the primary limitations of classical regression– No missing value handling– Linear additive model (no interactions)– Sensitive to functional form of predictors


Regularized Regression Applied to Trees

• Applying to regularized regression to trees is not vulnerable to these traditional problems– Missing values already handled and transformed to non-missing– Interactions incorporated into the tree structure– Trees are invariant with respect to typical univariate transformations– Any order preserving transform will not affect tree

• What will a regularized regression on trees accomplish?– Combine all identical trees into one– Combine several similar trees into a compromise tree– Bypass any meandering while TreeNet searched for optimum– Reweights the trees (in TN all trees have equal weight)


Regularized Regression of TreeNet

• In this mode of ISLE we develop the best TreeNet model we can

• Post-process results allowing for different degrees of compression

• By default we run four models on the TreeNet– Ridge (no compression, just reweighting)– Lasso (compression possible)– Ridged Lasso (hybrid of Lasso and Ridge but mostly Lasso)– Compact (maximum compression)

• Goal usually is to find a substantial degree of compression while giving up little or nothing on test sample performance

• Could focus only on beating TN performance


Model Compression: Early Days

• TreeNet has always offered model truncation• Instead of using the fully articulated model stop the process

early• In 2005 this method was being used by a major web portal

– TreeNet model used to predict likely response to item presented to visitor on a web page (ad, link, photo, story)

– To implement real time response TN model limited to first 30 trees– Sacrificed considerable predictive accuracy to have a model that

could score fast enough in real time – Truncated TreeNet at 30 trees still was better than other alternatives– Consider that model might have been rebuilt every hour


Illustrative Example: Boston Housing Data SetSet Up Model


TreeNet Controls

1000 trees, Least Squares, AUTO Learnrate


Post Processor Controls:What Type of Post Processing


Post Processor Details:Use all defaults

• Standardizing the “trees” gives all equal weight in regularized regression• Worth experimenting with unstandardized – larger variance trees will dominate


Two Stage Modeling Process

• First Stage here is a TreeNet but in SPM could also be– Single CART Tree (focus would be on nodes eg from maximal tree)– Ensemble of CART trees (bagger)– MARS model (basis functions from maximal model)– Random Forests

• In ISLE mode we need to operate on a collection of variables created by a learning machine– these can come from any of our tree engines or MARS

• We will get first stage results: a model• Then get second stage: model refinement

– Model compression or model selection (eg tree pruning)


TreeNet Results

Test Set R2=.87875 MSE=7.407


TreeNet Results: Residual Stats

One substantial outlier more than 5 IQR outside central data range


TreeNet and Compressed TreeNetBoth Models Reported Below

• The dashed lines show evolution of the compressed model• Because we can choose any of our 1000 trees to start the compressed model starts

off much better than the original TreeNet and it has a coefficient


ISLE Reweighted TreeNet:Test Data Results


TreeNet vs ISLE ResidualsISLE is wider in the center but narrower top to bottom

TreeNet Residuals ISLE Compressed TreeNet


Comment on the First Tree

• It is interesting to observe that in this example the compressed model with just one tree in it outperforms the TreeNet model with just one tree

• Trees are built without look ahead but having a menu of 1000 trees to choose from allows the 2nd stage model to do better

• Worst case scenario is that 2nd stage chooses same first tree• Coefficient can spread out the predictions


TreeNet Model Compression

• TreeNet has set a high bar for predictive accuracy in the data mining field

• We now offer several ways in which a TreeNet can be further improved by post-processing

• Consider that a TreeNet model is built one step at a time without knowledge of where we will end up– Some trees are exact or almost exact copies of other trees– Some trees may exhibit some “wandering” before the right direction

is found– Trees are each built on a different random subset of the data and

some trees may just be “unlucky”– Post processing can combine multiple copies of essentially the same

tree and skip any unnecessary wandering


How Much Compression is Possible?

• Our experience derives from working with data from several industries (retail sales, online web advertising, credit risk, direct marketing)

• Compression of 80% is not uncommon for the best model generated by the post-processing

• However, user is free to truncate the compressed model as it is also built up sequentially (we add one tree at a time to the model)

• User can thus choose from a possibly broad range of tradeoffs opting for even greater compression available from a less accurate model

• In the BOSTON example 90% compression also performs quite well (about 40 trees instead of the optimal 91 trees)


A Comment on the Theory behind ISLE

• In Friedman’s paper on ISLE he provides a rationale for this approach quite different from ours

• Consider that our goal is to learn a model from data where it is clear that a linear regression is not adequate

• How to automatically manufacture basis functions that capture more complex structure than raw variables– Imagine offering high order polynomials

– Some have suggested adding Xi*Xj interactions and also 1/Xi as new predictors plus log(Xi) for all strictly positive regressors

– Friedman proposes TreeNet as a vehicle for generating such new variables in the search for a more faithful model (to the truth)

– Think of TreeNet as a search engine for features (constructed predictors)


From Trees to Nodes

• In a second round of work on the idea of post-processing a tree ensemble Friedman suggested working with nodes

• Every node in a decision tree (other than the root) defines a potentially interesting subset of data

• Analysts have long thought about the terminal nodes of a CART tree in this way– Each terminal node is a segment or can be thought of as an

interesting rule– Cardell and Steinberg proposed blending CART and logistic

regression in this way (each terminal node is a dummy variable)

• Now we extend this thinking to all nodes below the root• Tibshirani proposed using all the nodes of a maximal tree in a Lasso model

to “prune” the tree


Nodes in a Single TreeNet TreeTree grown to have T=6 terminal nodes

• Typical TreeNet has T=6 terminal nodes• One level down has two nodes• Next level has 4 nodes (3 terminal)• Next 2 levels have 2 nodes each• Total is 10 non-root nodes• Will always be T + (T-1) -1 = 2(T-1)

• Represent each node as a 0/1 indicator• Record passes through this node (1) or

does not pass through this node (0)

• With 10 node indicators per each 6-terminal tree a 1,000 tree TreeNet will generate 10,000 node indicators

• Now we want to post-process this node representation of the TreeNet• Methodology can generate an immense number of predictors


Use Regularized Regression to Post Process

• Essential because even if we start with a small data set (rows and columns) we might generate thousands of trees

• The regularized regression is used to– SELECT trees (only a subset of the original trees will be used)– REWEIGHT trees (originally all had equal weight)

• The new model is still an ensemble of regression trees but now recombined differently – Some trees might get a negative weight

• New model could have two advantages– Could be MUCH smaller than original model (good for deployment)– Could be more accurate on holdout data

• No guarantees but results often attractive


Variations on Node Post Processing

• Pure: nodes (only node dummies in 2nd stage model)• Hybrid: nodes + trees (mix of ISLE and nodes)• Hybrid: raw predictors + nodes (Friedman’s preferred)• Hybrid: raw predictors + ISLE variables• Hybrid: raw predictors + ISLE trees + nodes

• In addition we could add the original TreeNet prediction to any of these sets of predictors

• Ideal interaction detection: include TreeNet prediction from a pure additive model and node indicators as regressors


Raw Predictor Problems

• Much of our empirical work involves incomplete data (missing values) and the 2nd stage model requires complete data (listwise deletion)

• While the hybrid models involving raw variables can capture nonlinearity and interactions the raw predictors act as everyday regressors– Issue of functional form– Issue of outliers

• Using ISLE variables may be far better for working with data for which careful cleaning and repair is not an option


Same Data Post-Processing Nodes

• In this example running only on nodes does not do well• See the upper dotted performance curves

• Still we will examine the outputs generated• Which method works best will vary with specifics of the data


Pure RuleSeeker

• Each variable in model is a node, or a RULE• Worthwhile to examine mean target, lift, support and agreement with test data• All shown above


Rule table:Display is Sortable

• Number of terms in a rule is determined by location of node in tree• Deep nodes can involve more variables (minimum is one, max is equal to depth of tree)


Rule Statistics

More columns from the Rule Table Display


Lift Report:High Lifts Represent Interesting Sergments

Dot for each rule (here displaying test data results)


Parametric Bootstrap For Interaction Statistics


Final Details

• We have described RuleSeeker as a way to post-process a TreeNet model and this is a fundamental use of the method

• When our goal from the start is to extract rules then we are advised to modify the TreeNet control in two ways– Allow the sizes of the trees to vary at random– Use very small subsets of the data when growing each tree

• Friedman recommends an average tree size of 4 terminal nodes and using a Poisson distribution to generate varying tree sizes (will often yield a few trees with 10-16 nodes)

• Friedman describes experiments in which each tree in the TreeNet is grown on just 5% of the available data– TreeNet first stage is inferior to standard TreeNet but 2nd stage could

actually outperform the standard TreeNet


RuleSeeker and Huge Data

• If the RuleSeeker approach can in fact outperform standard TreeNet this suggests a sampling approach to massive data sets

• Extract rather small (possibly stratified) samples from each of many data repositories

• Grow a Treenet tree• Repeat random draws to grow subsequent trees• Friedman’s approach does not grow very many trees (200)• The 2nd stage regression must be run on a much larger

sample but regression is much easier to distribute than trees


RuleSeeker Summary

• A RuleSeeker model has several interesting dimensions– It is a post-processed version of a TreeNet– RuleSeeker model could offer better performance than original TN– RuleSeeker model might also be more compact– Rules extracted could be seen as important INTERACTIONS– Rules could be studied as rules

• Compare train vs test Lift (want good agreement)• Consider tradeoff of Lift versus Support

– Rules can guide targeting but only worthwhile if support is sufficient


Big Data

• Currently we support 64-bit single server• Using typical modern servers means 32-cores and 512GB

RAM– Shortly we expect to see 200 cores and 2TB RAM at modest prices– Our training data can reach about 1/3 RAM without disk thrashing– 200GB training data (50 million rows by 1000 predictors)

• MapReduce/Hadoop appears to be the emerging standard for massively parallel data stores and computation

• Our approach will be bagging models that extract random samples from each of the data stores

• Each mapper and reducer are expected to have 4GB RAM• We will require reducers to be equipped with 16GB