Regression Tree Ensembles Sergey Bakin. Problem Formulation §Training data set of N data points (x i,y i ), 1,…,N. §x are predictor variables (P-dimensional

Regression Tree Ensembles

Sergey Bakin

Problem FormulationTraining data set of N data points (xi,yi), 1,…,N. x are predictor variables (P-dimensional vector):

can be fixed design points or independently sampled from the same distribution.

y is numeric response variable.Problem: estimate regression function E(y|x)=F(x) - can be very complex function.

Ensembles of Models<1990’s: Multitude of techniques developed to

tackle regression problems1990’s: New idea - use collection of “basic”

models (ensemble) Substantial improvements in accuracy

compared with any single “basic” modelExamples: Bagging, Boosting, Random

Forests

Key Ingredients of Ensembles

Type of “basic” model used in the ensemble (RT, K-NN, NN)

The way basic models are built (data sub-sampling schemes, injection of randomness)

The way basic models are combinedPossible postprocessing (tuning) of

resulting ensemble (optional)

Random Forests (RF)

Developed by Leo Brieman, Department of Statistics, University of California, Berkeley in late 1990’s.

RF resistant to overfitting RF capable of handling large number of

predictors

Key Features of RF

Randomised Regression Tree is a basic model

Each tree is grown on a bootsrap sample Ensemble (Forest) is formed by averaging

of predictions from individual trees

Regression TreesPerforms recursive binary division of data: start with

Root node (all points) and split it into 2 parts (Left Node and Right Node)

Split attempts to separate data points with high yi’s from data points with low yi’s as much as possible

Split is based on a single predictor and a split pointTo find the best splitter all possible splits and split

points are tried Splitting repeated for Children.

x2 >< -110.663055550.78 Event Rate; 385 Records; 45.8%

0.56 Event Rate180 Records

1


2

Total deviance explained = 45.8 %

RT Competitor ListPrimary splits: x2 < -110.6631 to the right, improve=734.0907, (0 missing) x6 < 107.5704 to the left, improve=728.0376, (0 missing) x51 < 101.4707 to the left, improve=720.1280, (0 missing) x30 < -113.879 to the right, improve=716.6580, (0 missing) x67 < -93.76226 to the right, improve=715.6400, (0 missing) x78 < 93.27373 to the left, improve=715.6400, (0 missing) x62 < 93.99937 to the left, improve=715.6400, (0 missing) x44 < 96.059 to the left, improve=715.6400, (0 missing) x25 < -85.65475 to the right, improve=685.0943, (0 missing) x21 < -118.4764 to the right, improve=685.0736, (0 missing) x82 < 119.6532 to the left, improve=685.0736, (0 missing) x79 < -81.00349 to the right, improve=675.7913, (0 missing) x18 < -70.78995 to the right, improve=663.0757, (0 missing)

x2 >< -110.663055550.78 Event Rate; 385 Records; 45.8%

x82 <> 118.656005530.56 Event Rate; 180 Records; 5.3%


1


2

x3 <> 114.4023890450.91 Event Rate; 205 Records; 12.4%


3


4

Total deviance explained = 63.5 %

Predictions from a Tree model

Prediction from a tree is obtained by “dropping” x down the tree until it gets to a terminal node.

The predicted value is the average of the response values of training data points in that terminal node.

Example: if x1 >=-110.66 and x82 >=118.65 then Prediction=0.61

Pruning of CART treesPrediction Error(PE) = Variance + Bias2

PE vs Tree Size has U-shape: very large and very small trees are bad

Trees are grown until terminal nodes become small...

... and then pruned backUse holdout data to estimate PE of trees.Select tree that has smallest PE.

Randomised Regression Trees IEach tree is grown on a bootstrap sample: N

data points are sampled with replacement Each such sample contains ~63% of

original data points - some records occur multiple times

Each tree is built on its own bootstrap sample - trees are likely to be different

Randomised Regression Trees IIAt each split, only M randomly selected predictors

are allowed to compete as potential splitters, i.e. 10 out of 100.

New group of eligible splitters is selected at random at each step.

At each step the splitter selected is likely to be somewhat suboptimal

Every predictor gets a chance to compete as a splitter: important predictors are very likely to be eventually used as splitters

Competitor List for Randomised RT Primary splits: x6 < 107.5704 to the left, improve=728.0376, (0 missing) x78 < 93.27373 to the left, improve=715.6400, (0 missing) x62 < 93.99937 to the left, improve=715.6400, (0 missing) x79 < -81.00349 to the right, improve=675.7913, (0 missing) x80 < 63.85983 to the left, improve=654.7728, (0 missing) x24 < 59.5085 to the left, improve=648.3837, (0 missing) x90 < -59.35043 to the right, improve=646.8825, (0 missing) x75 < -52.43783 to the right, improve=639.5996, (0 missing) x68 < 50.18278 to the left, improve=631.1139, (0 missing) Y < -33.42134 to the right, improve=606.9931, (0 missing) x34 < 132.8378 to the left, improve=555.2047, (0 missing)

Randomised Regression Trees III

M=1: splitter selected at random, but not the split point.

M=P: original deterministic CART algorithm

Trees deliberately are not pruned.

Combining the TreesEach tree represents a regression model

which fits the training data very closely: low bias, high variance model.

The idea behind RF: take predictions from large number of highly variable trees and average them.

The result is: low bias, low variance model

5 10 15 20

3000

030

200

3040

030

600

3080

0

RF Vrs Individual Trees

Number of Trees in RF x50

Poi

sson

LL

RFSingle TreeConstant Model

Correlation Vs StrengthAnother decomposition for PE of Random

Forests: PE(RF) < (BT)·PE(Tree)(BT): correlation between any 2 trees in a

forestPE(Tree): prediction error (strength) of a single

tree.M=1: low correlation, low strengthM=P: high correlation, high strength

RF as K-NN regression model IRF induces proximity measure in the predictor

space P(x1,x2)= Proportion of trees where x1 and x2 landed in the same terminal node.

Prediction at point x:Only fraction of data points actually contributes

to prediction.Strongly resembles formula used for K-NN

predictions

N

iii yxxPxy

1

),()(ˆ

RF as K-NN regression model IILin, Y., Jeon, Y. Random Forests and

Adaptive Nearest Neighbours, Technical Report 1055, Department of Statistics, University of Wisconsin, 2002.

Breiman, L. Consistency for a Simple Model of Random Forests. Technical Report 670, Statistics Department, University of California at Berkeley, 2004

It was shown that:

Randomisation does reduce variance component

Optimal M is independent of sample sizeRF does behave as an Adaptive K-Nearest

Neighbour model: shape and size of neighbourhood is adapted to the local behaviour of target regression function.

Case Study: Postcode Ranking in Motor Insurance

575 postcodes in NSWFor each postcode: number of claims as

well as “exposure” - number of policies in a postcode

Problem: Ranking of postcodes for pricing purposes

ApproachEach postcode is represented by (x,y) coordinates

of its centroid.Model expected claim frequency as function of

(x,y).The target surface is likely to be highly irregular.Add coordinates of postcodes along 100 randomly

generated directions to allow greater flexibility.

5 10 15 20

3073

030

740

3075

030

760

3077

0

M = 1


Poi

sson

LL

5 10 15 20

3073

030

740

3075

030

760

3077

0

M = 5

Number of Trees in RF x50P

oiss

on L

L

5 10 15 20

3073

030

740

3075

030

760

3077

0

M = 10


Poi

sson

LL

5 10 15 20

3073

030

740

3075

030

760

3077

0

M = 20


Poi

sson

LL

5 10 15 20

3073

030

740

3075

030

760

3077

0

M = 40


Poi

sson

LL

Tuning RF: M

Tuning RF: Size of Node

5 10 15 20

3070

030

720

3074

030

760

Node Size = 3


Poi

sson

LL

5 10 15 20

3070

030

720

3074

030

760

Node Size = 6


Poi

sson

LL

5 10 15 20

3070

030

720

3074

030

760

Node Size = 10


Poi

sson

LL

5 10 15 20

3070

030

720

3074

030

760

Node Size = 20


Poi

sson

LL

150.85 150.90 150.95 151.00 151.05 151.10 151.15

-33.

95-3

3.90

-33.

85-3

3.80

-33.

75-3

3.70

Metro Sydney

X

Y 2150

2151

2142

2116

2160

2152

2145

2117

21612144

2153

2115

21182146

21412162

2143

21652163

2114

2122

2125

2121

2164

2119

2197

2138

2147

2154

2140

2120

2199

2112

2198

2126

2135

2190

2137

2166

2148

2136

2200

2134

2113

2191

2768

2111

2155

2132

2196

2192

2214

2133

20462176

2212

2170

2158

2194

2131

2110

2195

2763

2213 2211

2077

2045

2193

2210

2074

2076

2130

2767

2047

2177

2073

2072

2209

2208

2203

2040

2206

2066

2071

2159

2223

2069

2039

2049

20702766

2204

2067

2222

2038

2168

2207

2064

2156

2041

-33.95 -33.90 -33.85 -33.80 -33.75 -33.70 -33.65

0.90

0.95

1.00

1.05

1.10

1.15

1.20

Surface Profile at X= 151.13

Y

Cla

im R

ate

Nor

th R

yde

Gla

desv

ille

Fiv

e D

ock

Ash

field

Hab

erfie

ld

Tur

ram

urra

Wah

roon

ga

Sum

mer

Hill

Dul

wic

h H

ill

Ear

lwoo

d

148.5 149.0 149.5 150.0

-37.

0-3

6.5

-36.

0-3

5.5

-35.

0-3

4.5

Country town Surrounded by Farms

X

Y

150.85 150.90 150.95 151.00 151.05 151.10 151.15

-33.

95-3

3.90

-33.

85-3

3.80

-33.

75-3

3.70

Five Dock RF Neighbours

X

Y

Five Dock

Haberf ield

Hunters Hill

DrummoyneConcord

Gladesville

CroydonBurw ood

Rozelle

Leichhardt

Homebush

Concord West

Ashf ield

Lane Cove

AnnandaleStrathf ield

Balmain

Ryde

Summer Hill

North Ryde

151.10 151.15 151.20 151.25

-33.

95-3

3.90

-33.

85-3

3.80

Waterloo RF Neighbours

X

Y

Waterloo

Redfern

Alexandria

DarlinghurstPaddington

RoseberyKensingtonMascot

Erskineville

Edgeclif f

Bondi Junction

St Peters

Randw ick

Botany

New tow n

Kingsford

Double Bay

Chippendale

Potts Point

Waverley

Things not coveredClustering, missing value imputation and

outlier detectionIdentification of important variablesOOB testing of RF modelsPostprocessing of RF models

Documents

Regression Tree Ensembles Sergey Bakin. Problem Formulation §Training data set of N data points (x i,y i ), 1,…,N. §x are predictor variables (P-dimensional