Upload
ngonhan
View
228
Download
2
Embed Size (px)
Citation preview
Knowledge Discovery and Data MiningLecture 05 - Tree methods - Introduction
Tom Kelsey
School of Computer ScienceUniversity of St Andrews
http://[email protected]
Tom Kelsey ID5059-05-TM 11 Feb 2015 1 / 27
Administration
P2 description needs to be agreedForest Cover Type PredictionAllstate Purchase Prediction ChallengeDon’t Get Kicked!Claim Prediction Challenge (Allstate)KDD Cup 2013 - Author Disambiguation Challenge (Track 2)
Tom Kelsey ID5059-05-TM 11 Feb 2015 2 / 27
Validation Recap
Validation analysis example
Tom Kelsey ID5059-05-TM 11 Feb 2015 3 / 27
Validation analysis example
Tom Kelsey ID5059-05-TM 11 Feb 2015 4 / 27
Validation Recap
Response variable The y variable. The variable(s) we seek topredict. If categorical, we are classifyingCovariates The x variable(s). The variables(s) we think might beused to predict the response – a.k.a. attributes, predictorvariablesCovariate space Conceptually the space defined by ourcovariates e.g. the x values give coordinates of observations in aspace.Mean Squared Error (MSE) The theoretical expected/averagesquared error between the “true" quantity and our “method" ofestimating this (estimator). In practice use estimated MSE= n−1Σi(yi − yi)
2
Supervised/unsupervised learning Supervised learning meanswe know the response values for model building (most common).Unsupervised does not (e.g. clustering).
Tom Kelsey ID5059-05-TM 11 Feb 2015 5 / 27
Selection & Validation Summary
Process1 Choose a measure for models2 Choose a (some) candidate model(s)3 For each model, find the number of parameters giving
optimal generalisation MSENotes
Covariates are the variables we have to potentially predictthe responseThese may be represented by functions in our X designmatrix, so by 1 or more columnsParsimony is achieved by reducing the size of the designmatrix/number of parameters to estimateClearly there is a tradeoff: few covariates to achieveparsimony, possibly many parameters to achieve goodgeneralisation errorTom Kelsey ID5059-05-TM 11 Feb 2015 6 / 27
Tree methods
Need-to-knows1 How recursive binary partitioning ofRp works.2 How to sketch a partitioning ofR2 on the basis of a series of
simple binary splits.3 How to go from a series of binary splitting rules to a tree
representation and vice versa.
Tom Kelsey ID5059-05-TM 11 Feb 2015 7 / 27
Supervised/Unsupervised learning
http://practiceovertheory.com/blog/2010/02/15/machine-learning-who-s-the-boss/
Tom Kelsey ID5059-05-TM 11 Feb 2015 8 / 27
A classification problem....
For the Titanic worked examples in week 1, the models Idescribed differed in subtle ways.
Regression and Tree both returned probabilities rather thanthe 1 and 0 returned by random forests.I used random forests to impute missing values, estimaterelative importance of covariates, and estimatemisclassification rateThe tree model supplied a confusion matrix, allowing moredetailed error analysis than simple misclassification rate(easy to derive confusion matrices for the other two)Straightforward training/test data validation – I didn’texamine the overfit/undefit tradeoff in any detailI performed naïve covariate selection using myunderstanding of the problem domain – this is often a sourceof significant error in model development & validation
Tom Kelsey ID5059-05-TM 11 Feb 2015 9 / 27
Historical perspective
1960 Automatic Interaction Detection (AID) related to theclustering literature.THAID, CHAIDID3, C4.5, C5.0CART 1984 Breiman et al.
Tom Kelsey ID5059-05-TM 11 Feb 2015 10 / 27
Recursive partitioning onR
Take an n× p matrix X, define a p dimensional spaceRp. Wewish to apply a simple rule recursively:
1 Select a variable xi and split on the basis of a single valuexi = a. We now have two spaces: xi ≤ a and xi > a.
2 Select one of the current sub-spaces, select a variable xj, andsplit this sub-space on the basis of a single value xj = b.
3 Repeatedly select sub-spaces and split in two.4 Note that this process can extend beyond just the two
dimensions represented by x1 and x2. If this were3-dimensions (i.e. include an x3) then the partitions wouldbe cubes. Beyond this the partitions are conceptuallyhyper-cubes.
Tom Kelsey ID5059-05-TM 11 Feb 2015 11 / 27
An arbitrary 2-D space
X1
X2
An arbitrary 2-D space
Tom Kelsey ID5059-05-TM 11 Feb 2015 12 / 27
Space splitting
X1
X2
X1=a
Y=f(X1>=a)Y=f(X1<a)
A single split
Tom Kelsey ID5059-05-TM 11 Feb 2015 13 / 27
Space splitting
X1
X2
X2=bY=f(X1>=a)
Y=f(X2>=b&(X1<a))
Y=f(X2<c&(X1<a))
Split of a subspace
Tom Kelsey ID5059-05-TM 11 Feb 2015 14 / 27
Space splitting
X1
X2
X2=c
Y=f(X2>=c&(X1>=a))
Y=f(X2<c&(X1>=a))
Y=f(X2>=b&(X1<a))
Y=f(X2<c&(X1<a))
Further splitting of a subspace
Tom Kelsey ID5059-05-TM 11 Feb 2015 15 / 27
Space splitting
X1
X2
X1=d
Y=f(X2>=c&(X1>=a&<d))
Y=f(X2<c&(X1>=a))
Y=f(X2>=b&(X1<a))
Y=f(X2<c&(X1<a))
Y=f(X2>=c&(X1>=a&>=d))
Further splitting
Tom Kelsey ID5059-05-TM 11 Feb 2015 16 / 27
Space splitting
X1
X2
Z
Potential 3-D surface
Tom Kelsey ID5059-05-TM 11 Feb 2015 17 / 27
Binary partitioning process as a tree
X1
X2 X2
X2
C1 C2
C3
C4 C5
<a >=a
<b >=b <c >=c
<d >=d
An example tree diagram for a contrived partioning
Tom Kelsey ID5059-05-TM 11 Feb 2015 18 / 27
Tree representation
The splitting points are called nodes - these have a binarysplitting rule associated with themThe two new spaces created by the split are represented bylines leaving the nodes, these are referred to as the branches.A tree with one split is a stump.The nodes at the bottom of the diagram are referred to as theterminal nodes and collectively represent all the finalpartitions/subspaces of the data.You can ‘drop’ a vector x down the tree to determine whichsubspace this coordinate falls into.
Tom Kelsey ID5059-05-TM 11 Feb 2015 19 / 27
Exercise
The following is the summary of a series of splits inR2:
(x1 > 10)(x1 ≤ 10) & (x2 ≤ 5)(x1 ≤ 10) & (x2 > 5) & (x2 ≤ 10)
1 Sketch the progression of splits in 2-dimensions.2 Produce a tree that summarises this series of splits.
Tom Kelsey ID5059-05-TM 11 Feb 2015 20 / 27
Tree construction
We can model the response as a constant for each region (or.equivalently, leaf)If we are minimising sums of squares, the optimal constantfor a region/leaf is the average of the observed outputs forall inputs associated with the region/leafComputing the optimal binary partition for given inputsand output is computationally intractable, in generalA greedy algorithm is used that finds an optimal variableand split point given an initial choice (or guess), thencontinues for sub-regionsThis is quick to compute (sums of averages) but errors at theroot lead to errors at the leaves
Tom Kelsey ID5059-05-TM 11 Feb 2015 21 / 27
How big should the tree be?
Tradeoff between bias and variance
Small tree – high bias, low varianceNot big enough to capture the correct model structureLarge tree – low bias, high varianceOverfitting – in the extreme case each input is in exactly oneregionOptimal size should be adaptively chosen from the data
We could stop splitting based on a threshold for decreases in sumof squares, but this might rule out a useful split further down thetree.Instead we construct a tree that is probably too large, and pruneit by cost-complexity calculations – next lecture
Tom Kelsey ID5059-05-TM 11 Feb 2015 22 / 27
Regression trees
Consider our general regression problem (note can beclassification):
y = f (X) + e
and the usual approximation model (linear in its parameters):
y = Xβ + e
‘Standard’ interactions of form βp(X1X2)
These are simple in form and quite hard to interpretsuccinctlyWhat is probably the simplest interaction form to interpret?Recursive binary splitting rules for the Covariate space
Tom Kelsey ID5059-05-TM 11 Feb 2015 23 / 27
Advantages of tree models
Actually tree models in general, and CART in particularNonparametric
no probabilistic assumptionsAutomatically performs variable selection
important variables at or near the rootAny combination of continuous/discrete variables allowed
in the Titanic example, no need to specify that the response iscategoricalso we can automatically bin massively categorical variablesinto a few categoriese.g. zip code, make/model, etc.
Tom Kelsey ID5059-05-TM 11 Feb 2015 24 / 27
Advantages of tree models
Discovers interactions among variablesHandles missing values automatically
using surrogate splits
Invariant to monotonic transformations of predictivevariableNot sensitive to outliers in predictive variablesEasy to spot when CART is struggling to capture a linearrelationship (and therefore might not be suitable)
repeated splits on the same variableGood for data exploration, visualisation, multidisciplinarydiscussion
in the Titanic example gives hard values for "child" tosupport the heuristic "women & children first"
Tom Kelsey ID5059-05-TM 11 Feb 2015 25 / 27
Disdvantages of tree models
Discrete output values, rather than continuousone response per finite number of leaf nodes
Trees can be large and hence hard to interpretCan be unstable when covariates are correlated
slightly different data gives completely different trees
Not good for describing linear relationshipsNot always the best predictive model
might be outperformed by NN, RF, SVM, etc.
Tom Kelsey ID5059-05-TM 11 Feb 2015 26 / 27
Tree methods
Need-to-knows1 How recursive binary partitioning ofRp works.2 How to sketch a partitioning ofR2 on the basis of a series of
simple binary splits.3 How to go from a series of binary splitting rules to a tree
representation and vice versa.
Tom Kelsey ID5059-05-TM 11 Feb 2015 27 / 27