Knowledge Discovery and Data Mining - Tom Kelsey · Knowledge Discovery and Data Mining Lecture 05 - Tree methods ... Claim Prediction Challenge (Allstate) KDD Cup 2013 - Author Disambiguation

Knowledge Discovery and Data MiningLecture 05 - Tree methods - Introduction

Tom Kelsey

School of Computer ScienceUniversity of St Andrews

http://[email protected]

Tom Kelsey ID5059-05-TM 11 Feb 2015 1 / 27

http://tom.home.cs.st-andrews.ac.uk

mailto:[email protected]

Administration

P2 description needs to be agreedForest Cover Type PredictionAllstate Purchase Prediction ChallengeDon’t Get Kicked!Claim Prediction Challenge (Allstate)KDD Cup 2013 - Author Disambiguation Challenge (Track 2)


Validation Recap

Validation analysis example


Validation analysis example


Validation Recap

Response variable The y variable. The variable(s) we seek topredict. If categorical, we are classifyingCovariates The x variable(s). The variables(s) we think might beused to predict the response – a.k.a. attributes, predictorvariablesCovariate space Conceptually the space defined by ourcovariates e.g. the x values give coordinates of observations in aspace.Mean Squared Error (MSE) The theoretical expected/averagesquared error between the “true" quantity and our “method" ofestimating this (estimator). In practice use estimated MSE= n−1Σi(yi − yi)

2

Supervised/unsupervised learning Supervised learning meanswe know the response values for model building (most common).Unsupervised does not (e.g. clustering).


Selection & Validation Summary

Process1 Choose a measure for models2 Choose a (some) candidate model(s)3 For each model, find the number of parameters giving

optimal generalisation MSENotes

Covariates are the variables we have to potentially predictthe responseThese may be represented by functions in our X designmatrix, so by 1 or more columnsParsimony is achieved by reducing the size of the designmatrix/number of parameters to estimateClearly there is a tradeoff: few covariates to achieveparsimony, possibly many parameters to achieve goodgeneralisation errorTom Kelsey ID5059-05-TM 11 Feb 2015 6 / 27

Tree methods

Need-to-knows1 How recursive binary partitioning ofRp works.2 How to sketch a partitioning ofR2 on the basis of a series of

simple binary splits.3 How to go from a series of binary splitting rules to a tree

representation and vice versa.


Supervised/Unsupervised learning

http://practiceovertheory.com/blog/2010/02/15/machine-learning-who-s-the-boss/


A classification problem....

For the Titanic worked examples in week 1, the models Idescribed differed in subtle ways.

Regression and Tree both returned probabilities rather thanthe 1 and 0 returned by random forests.I used random forests to impute missing values, estimaterelative importance of covariates, and estimatemisclassification rateThe tree model supplied a confusion matrix, allowing moredetailed error analysis than simple misclassification rate(easy to derive confusion matrices for the other two)Straightforward training/test data validation – I didn’texamine the overfit/undefit tradeoff in any detailI performed naïve covariate selection using myunderstanding of the problem domain – this is often a sourceof significant error in model development & validation


Historical perspective

1960 Automatic Interaction Detection (AID) related to theclustering literature.THAID, CHAIDID3, C4.5, C5.0CART 1984 Breiman et al.


Recursive partitioning onR

Take an n× p matrix X, define a p dimensional spaceRp. Wewish to apply a simple rule recursively:

1 Select a variable xi and split on the basis of a single valuexi = a. We now have two spaces: xi ≤ a and xi > a.

2 Select one of the current sub-spaces, select a variable xj, andsplit this sub-space on the basis of a single value xj = b.

3 Repeatedly select sub-spaces and split in two.4 Note that this process can extend beyond just the two

dimensions represented by x1 and x2. If this were3-dimensions (i.e. include an x3) then the partitions wouldbe cubes. Beyond this the partitions are conceptuallyhyper-cubes.


An arbitrary 2-D space

X1

X2

An arbitrary 2-D space


Space splitting

X1

X2

X1=a

Y=f(X1>=a)Y=f(X1<a)

A single split


Space splitting

X1

X2

X2=bY=f(X1>=a)

Y=f(X2>=b&(X1<a))

Y=f(X2<c&(X1<a))

Split of a subspace


Space splitting

X1

X2

X2=c

Y=f(X2>=c&(X1>=a))

Y=f(X2<c&(X1>=a))

Y=f(X2>=b&(X1<a))

Y=f(X2<c&(X1<a))

Further splitting of a subspace


Space splitting

X1

X2

X1=d

Y=f(X2>=c&(X1>=a&<d))

Y=f(X2<c&(X1>=a))

Y=f(X2>=b&(X1<a))

Y=f(X2<c&(X1<a))

Y=f(X2>=c&(X1>=a&>=d))

Further splitting


Space splitting

X1

X2

Z

Potential 3-D surface


Binary partitioning process as a tree

X1

X2 X2

X2

C1 C2

C3

C4 C5

<a >=a

<b >=b <c >=c

<d >=d

An example tree diagram for a contrived partioning


Tree representation

The splitting points are called nodes - these have a binarysplitting rule associated with themThe two new spaces created by the split are represented bylines leaving the nodes, these are referred to as the branches.A tree with one split is a stump.The nodes at the bottom of the diagram are referred to as theterminal nodes and collectively represent all the finalpartitions/subspaces of the data.You can ‘drop’ a vector x down the tree to determine whichsubspace this coordinate falls into.


Exercise

The following is the summary of a series of splits inR2:

(x1 > 10)(x1 ≤ 10) & (x2 ≤ 5)(x1 ≤ 10) & (x2 > 5) & (x2 ≤ 10)

1 Sketch the progression of splits in 2-dimensions.2 Produce a tree that summarises this series of splits.


Tree construction

We can model the response as a constant for each region (or.equivalently, leaf)If we are minimising sums of squares, the optimal constantfor a region/leaf is the average of the observed outputs forall inputs associated with the region/leafComputing the optimal binary partition for given inputsand output is computationally intractable, in generalA greedy algorithm is used that finds an optimal variableand split point given an initial choice (or guess), thencontinues for sub-regionsThis is quick to compute (sums of averages) but errors at theroot lead to errors at the leaves


How big should the tree be?

Tradeoff between bias and variance

Small tree – high bias, low varianceNot big enough to capture the correct model structureLarge tree – low bias, high varianceOverfitting – in the extreme case each input is in exactly oneregionOptimal size should be adaptively chosen from the data

We could stop splitting based on a threshold for decreases in sumof squares, but this might rule out a useful split further down thetree.Instead we construct a tree that is probably too large, and pruneit by cost-complexity calculations – next lecture


Regression trees

Consider our general regression problem (note can beclassification):

y = f (X) + e

and the usual approximation model (linear in its parameters):

y = Xβ + e

‘Standard’ interactions of form βp(X1X2)

These are simple in form and quite hard to interpretsuccinctlyWhat is probably the simplest interaction form to interpret?Recursive binary splitting rules for the Covariate space


Advantages of tree models

Actually tree models in general, and CART in particularNonparametric

no probabilistic assumptionsAutomatically performs variable selection

important variables at or near the rootAny combination of continuous/discrete variables allowed

in the Titanic example, no need to specify that the response iscategoricalso we can automatically bin massively categorical variablesinto a few categoriese.g. zip code, make/model, etc.


Advantages of tree models

Discovers interactions among variablesHandles missing values automatically

using surrogate splits

Invariant to monotonic transformations of predictivevariableNot sensitive to outliers in predictive variablesEasy to spot when CART is struggling to capture a linearrelationship (and therefore might not be suitable)

repeated splits on the same variableGood for data exploration, visualisation, multidisciplinarydiscussion

in the Titanic example gives hard values for "child" tosupport the heuristic "women & children first"


Disdvantages of tree models

Discrete output values, rather than continuousone response per finite number of leaf nodes

Trees can be large and hence hard to interpretCan be unstable when covariates are correlated

slightly different data gives completely different trees

Not good for describing linear relationshipsNot always the best predictive model

might be outperformed by NN, RF, SVM, etc.


Tree methods

Need-to-knows1 How recursive binary partitioning ofRp works.2 How to sketch a partitioning ofR2 on the basis of a series of

simple binary splits.3 How to go from a series of binary splitting rules to a tree

representation and vice versa.


Documents

Knowledge Discovery and Data Mining - Tom Kelsey · Knowledge Discovery and Data Mining Lecture 05 - Tree methods ... Claim Prediction Challenge (Allstate) KDD Cup 2013 - Author Disambiguation