Upload
oswin-hicks
View
222
Download
0
Embed Size (px)
DESCRIPTION
Ensembles of Models §
Citation preview
Regression Tree Ensembles
Sergey Bakin
Problem FormulationTraining data set of N data points (xi,yi), 1,…,N. x are predictor variables (P-dimensional vector):
can be fixed design points or independently sampled from the same distribution.
y is numeric response variable.Problem: estimate regression function E(y|x)=F(x) - can be very complex function.
Ensembles of Models<1990’s: Multitude of techniques developed to
tackle regression problems1990’s: New idea - use collection of “basic”
models (ensemble) Substantial improvements in accuracy
compared with any single “basic” modelExamples: Bagging, Boosting, Random
Forests
Key Ingredients of Ensembles
Type of “basic” model used in the ensemble (RT, K-NN, NN)
The way basic models are built (data sub-sampling schemes, injection of randomness)
The way basic models are combinedPossible postprocessing (tuning) of
resulting ensemble (optional)
Random Forests (RF)
Developed by Leo Brieman, Department of Statistics, University of California, Berkeley in late 1990’s.
RF resistant to overfitting RF capable of handling large number of
predictors
Key Features of RF
Randomised Regression Tree is a basic model
Each tree is grown on a bootsrap sample Ensemble (Forest) is formed by averaging
of predictions from individual trees
Regression TreesPerforms recursive binary division of data: start with
Root node (all points) and split it into 2 parts (Left Node and Right Node)
Split attempts to separate data points with high yi’s from data points with low yi’s as much as possible
Split is based on a single predictor and a split pointTo find the best splitter all possible splits and split
points are tried Splitting repeated for Children.
x2 >< -110.663055550.78 Event Rate; 385 Records; 45.8%
0.56 Event Rate180 Records
1
0.91 Event Rate205 Records
2
Total deviance explained = 45.8 %
RT Competitor ListPrimary splits: x2 < -110.6631 to the right, improve=734.0907, (0 missing) x6 < 107.5704 to the left, improve=728.0376, (0 missing) x51 < 101.4707 to the left, improve=720.1280, (0 missing) x30 < -113.879 to the right, improve=716.6580, (0 missing) x67 < -93.76226 to the right, improve=715.6400, (0 missing) x78 < 93.27373 to the left, improve=715.6400, (0 missing) x62 < 93.99937 to the left, improve=715.6400, (0 missing) x44 < 96.059 to the left, improve=715.6400, (0 missing) x25 < -85.65475 to the right, improve=685.0943, (0 missing) x21 < -118.4764 to the right, improve=685.0736, (0 missing) x82 < 119.6532 to the left, improve=685.0736, (0 missing) x79 < -81.00349 to the right, improve=675.7913, (0 missing) x18 < -70.78995 to the right, improve=663.0757, (0 missing)
x2 >< -110.663055550.78 Event Rate; 385 Records; 45.8%
x82 <> 118.656005530.56 Event Rate; 180 Records; 5.3%
0.43 Event Rate101 Records
1
0.61 Event Rate79 Records
2
x3 <> 114.4023890450.91 Event Rate; 205 Records; 12.4%
0.64 Event Rate42 Records
3
0.96 Event Rate163 Records
4
Total deviance explained = 63.5 %
Predictions from a Tree model
Prediction from a tree is obtained by “dropping” x down the tree until it gets to a terminal node.
The predicted value is the average of the response values of training data points in that terminal node.
Example: if x1 >=-110.66 and x82 >=118.65 then Prediction=0.61
Pruning of CART treesPrediction Error(PE) = Variance + Bias2
PE vs Tree Size has U-shape: very large and very small trees are bad
Trees are grown until terminal nodes become small...
... and then pruned backUse holdout data to estimate PE of trees.Select tree that has smallest PE.
Randomised Regression Trees IEach tree is grown on a bootstrap sample: N
data points are sampled with replacement Each such sample contains ~63% of
original data points - some records occur multiple times
Each tree is built on its own bootstrap sample - trees are likely to be different
Randomised Regression Trees IIAt each split, only M randomly selected predictors
are allowed to compete as potential splitters, i.e. 10 out of 100.
New group of eligible splitters is selected at random at each step.
At each step the splitter selected is likely to be somewhat suboptimal
Every predictor gets a chance to compete as a splitter: important predictors are very likely to be eventually used as splitters
Competitor List for Randomised RT Primary splits: x6 < 107.5704 to the left, improve=728.0376, (0 missing) x78 < 93.27373 to the left, improve=715.6400, (0 missing) x62 < 93.99937 to the left, improve=715.6400, (0 missing) x79 < -81.00349 to the right, improve=675.7913, (0 missing) x80 < 63.85983 to the left, improve=654.7728, (0 missing) x24 < 59.5085 to the left, improve=648.3837, (0 missing) x90 < -59.35043 to the right, improve=646.8825, (0 missing) x75 < -52.43783 to the right, improve=639.5996, (0 missing) x68 < 50.18278 to the left, improve=631.1139, (0 missing) Y < -33.42134 to the right, improve=606.9931, (0 missing) x34 < 132.8378 to the left, improve=555.2047, (0 missing)
Randomised Regression Trees III
M=1: splitter selected at random, but not the split point.
M=P: original deterministic CART algorithm
Trees deliberately are not pruned.
Combining the TreesEach tree represents a regression model
which fits the training data very closely: low bias, high variance model.
The idea behind RF: take predictions from large number of highly variable trees and average them.
The result is: low bias, low variance model
5 10 15 20
3000
030
200
3040
030
600
3080
0
RF Vrs Individual Trees
Number of Trees in RF x50
Poi
sson
LL
RFSingle TreeConstant Model
Correlation Vs StrengthAnother decomposition for PE of Random
Forests: PE(RF) < (BT)·PE(Tree)(BT): correlation between any 2 trees in a
forestPE(Tree): prediction error (strength) of a single
tree.M=1: low correlation, low strengthM=P: high correlation, high strength
RF as K-NN regression model IRF induces proximity measure in the predictor
space P(x1,x2)= Proportion of trees where x1 and x2 landed in the same terminal node.
Prediction at point x:Only fraction of data points actually contributes
to prediction.Strongly resembles formula used for K-NN
predictions
N
iii yxxPxy
1
),()(ˆ
RF as K-NN regression model IILin, Y., Jeon, Y. Random Forests and
Adaptive Nearest Neighbours, Technical Report 1055, Department of Statistics, University of Wisconsin, 2002.
Breiman, L. Consistency for a Simple Model of Random Forests. Technical Report 670, Statistics Department, University of California at Berkeley, 2004
It was shown that:
Randomisation does reduce variance component
Optimal M is independent of sample sizeRF does behave as an Adaptive K-Nearest
Neighbour model: shape and size of neighbourhood is adapted to the local behaviour of target regression function.
Case Study: Postcode Ranking in Motor Insurance
575 postcodes in NSWFor each postcode: number of claims as
well as “exposure” - number of policies in a postcode
Problem: Ranking of postcodes for pricing purposes
ApproachEach postcode is represented by (x,y) coordinates
of its centroid.Model expected claim frequency as function of
(x,y).The target surface is likely to be highly irregular.Add coordinates of postcodes along 100 randomly
generated directions to allow greater flexibility.
5 10 15 20
3073
030
740
3075
030
760
3077
0
M = 1
Number of Trees in RF x50
Poi
sson
LL
5 10 15 20
3073
030
740
3075
030
760
3077
0
M = 5
Number of Trees in RF x50P
oiss
on L
L
5 10 15 20
3073
030
740
3075
030
760
3077
0
M = 10
Number of Trees in RF x50
Poi
sson
LL
5 10 15 20
3073
030
740
3075
030
760
3077
0
M = 20
Number of Trees in RF x50
Poi
sson
LL
5 10 15 20
3073
030
740
3075
030
760
3077
0
M = 40
Number of Trees in RF x50
Poi
sson
LL
Tuning RF: M
Tuning RF: Size of Node
5 10 15 20
3070
030
720
3074
030
760
Node Size = 3
Number of Trees in RF x40
Poi
sson
LL
5 10 15 20
3070
030
720
3074
030
760
Node Size = 6
Number of Trees in RF x40
Poi
sson
LL
5 10 15 20
3070
030
720
3074
030
760
Node Size = 10
Number of Trees in RF x40
Poi
sson
LL
5 10 15 20
3070
030
720
3074
030
760
Node Size = 20
Number of Trees in RF x40
Poi
sson
LL
150.85 150.90 150.95 151.00 151.05 151.10 151.15
-33.
95-3
3.90
-33.
85-3
3.80
-33.
75-3
3.70
Metro Sydney
X
Y 2150
2151
2142
2116
2160
2152
2145
2117
21612144
2153
2115
21182146
21412162
2143
21652163
2114
2122
2125
2121
2164
2119
2197
2138
2147
2154
2140
2120
2199
2112
2198
2126
2135
2190
2137
2166
2148
2136
2200
2134
2113
2191
2768
2111
2155
2132
2196
2192
2214
2133
20462176
2212
2170
2158
2194
2131
2110
2195
2763
2213 2211
2077
2045
2193
2210
2074
2076
2130
2767
2047
2177
2073
2072
2209
2208
2203
2040
2206
2066
2071
2159
2223
2069
2039
2049
20702766
2204
2067
2222
2038
2168
2207
2064
2156
2041
-33.95 -33.90 -33.85 -33.80 -33.75 -33.70 -33.65
0.90
0.95
1.00
1.05
1.10
1.15
1.20
Surface Profile at X= 151.13
Y
Cla
im R
ate
Nor
th R
yde
Gla
desv
ille
Fiv
e D
ock
Ash
field
Hab
erfie
ld
Tur
ram
urra
Wah
roon
ga
Sum
mer
Hill
Dul
wic
h H
ill
Ear
lwoo
d
148.5 149.0 149.5 150.0
-37.
0-3
6.5
-36.
0-3
5.5
-35.
0-3
4.5
Country town Surrounded by Farms
X
Y
150.85 150.90 150.95 151.00 151.05 151.10 151.15
-33.
95-3
3.90
-33.
85-3
3.80
-33.
75-3
3.70
Five Dock RF Neighbours
X
Y
Five Dock
Haberf ield
Hunters Hill
DrummoyneConcord
Gladesville
CroydonBurw ood
Rozelle
Leichhardt
Homebush
Concord West
Ashf ield
Lane Cove
AnnandaleStrathf ield
Balmain
Ryde
Summer Hill
North Ryde
151.10 151.15 151.20 151.25
-33.
95-3
3.90
-33.
85-3
3.80
Waterloo RF Neighbours
X
Y
Waterloo
Redfern
Alexandria
DarlinghurstPaddington
RoseberyKensingtonMascot
Erskineville
Edgeclif f
Bondi Junction
St Peters
Randw ick
Botany
New tow n
Kingsford
Double Bay
Chippendale
Potts Point
Waverley
Things not coveredClustering, missing value imputation and
outlier detectionIdentification of important variablesOOB testing of RF modelsPostprocessing of RF models