27
Knowledge Discovery and Data Mining Lecture 05 - Tree methods - Introduction Tom Kelsey School of Computer Science University of St Andrews http://tom.home.cs.st-andrews.ac.uk [email protected] Tom Kelsey ID5059-05-TM 11 Feb 2015 1 / 27

Knowledge Discovery and Data Mining - Tom Kelsey · Knowledge Discovery and Data Mining Lecture 05 - Tree methods ... Claim Prediction Challenge (Allstate) KDD Cup 2013 - Author Disambiguation

  • Upload
    ngonhan

  • View
    228

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Knowledge Discovery and Data Mining - Tom Kelsey · Knowledge Discovery and Data Mining Lecture 05 - Tree methods ... Claim Prediction Challenge (Allstate) KDD Cup 2013 - Author Disambiguation

Knowledge Discovery and Data MiningLecture 05 - Tree methods - Introduction

Tom Kelsey

School of Computer ScienceUniversity of St Andrews

http://[email protected]

Tom Kelsey ID5059-05-TM 11 Feb 2015 1 / 27

Page 2: Knowledge Discovery and Data Mining - Tom Kelsey · Knowledge Discovery and Data Mining Lecture 05 - Tree methods ... Claim Prediction Challenge (Allstate) KDD Cup 2013 - Author Disambiguation

Administration

P2 description needs to be agreedForest Cover Type PredictionAllstate Purchase Prediction ChallengeDon’t Get Kicked!Claim Prediction Challenge (Allstate)KDD Cup 2013 - Author Disambiguation Challenge (Track 2)

Tom Kelsey ID5059-05-TM 11 Feb 2015 2 / 27

Page 3: Knowledge Discovery and Data Mining - Tom Kelsey · Knowledge Discovery and Data Mining Lecture 05 - Tree methods ... Claim Prediction Challenge (Allstate) KDD Cup 2013 - Author Disambiguation

Validation Recap

Validation analysis example

Tom Kelsey ID5059-05-TM 11 Feb 2015 3 / 27

Page 4: Knowledge Discovery and Data Mining - Tom Kelsey · Knowledge Discovery and Data Mining Lecture 05 - Tree methods ... Claim Prediction Challenge (Allstate) KDD Cup 2013 - Author Disambiguation

Validation analysis example

Tom Kelsey ID5059-05-TM 11 Feb 2015 4 / 27

Page 5: Knowledge Discovery and Data Mining - Tom Kelsey · Knowledge Discovery and Data Mining Lecture 05 - Tree methods ... Claim Prediction Challenge (Allstate) KDD Cup 2013 - Author Disambiguation

Validation Recap

Response variable The y variable. The variable(s) we seek topredict. If categorical, we are classifyingCovariates The x variable(s). The variables(s) we think might beused to predict the response – a.k.a. attributes, predictorvariablesCovariate space Conceptually the space defined by ourcovariates e.g. the x values give coordinates of observations in aspace.Mean Squared Error (MSE) The theoretical expected/averagesquared error between the “true" quantity and our “method" ofestimating this (estimator). In practice use estimated MSE= n−1Σi(yi − yi)

2

Supervised/unsupervised learning Supervised learning meanswe know the response values for model building (most common).Unsupervised does not (e.g. clustering).

Tom Kelsey ID5059-05-TM 11 Feb 2015 5 / 27

Page 6: Knowledge Discovery and Data Mining - Tom Kelsey · Knowledge Discovery and Data Mining Lecture 05 - Tree methods ... Claim Prediction Challenge (Allstate) KDD Cup 2013 - Author Disambiguation

Selection & Validation Summary

Process1 Choose a measure for models2 Choose a (some) candidate model(s)3 For each model, find the number of parameters giving

optimal generalisation MSENotes

Covariates are the variables we have to potentially predictthe responseThese may be represented by functions in our X designmatrix, so by 1 or more columnsParsimony is achieved by reducing the size of the designmatrix/number of parameters to estimateClearly there is a tradeoff: few covariates to achieveparsimony, possibly many parameters to achieve goodgeneralisation errorTom Kelsey ID5059-05-TM 11 Feb 2015 6 / 27

Page 7: Knowledge Discovery and Data Mining - Tom Kelsey · Knowledge Discovery and Data Mining Lecture 05 - Tree methods ... Claim Prediction Challenge (Allstate) KDD Cup 2013 - Author Disambiguation

Tree methods

Need-to-knows1 How recursive binary partitioning ofRp works.2 How to sketch a partitioning ofR2 on the basis of a series of

simple binary splits.3 How to go from a series of binary splitting rules to a tree

representation and vice versa.

Tom Kelsey ID5059-05-TM 11 Feb 2015 7 / 27

Page 8: Knowledge Discovery and Data Mining - Tom Kelsey · Knowledge Discovery and Data Mining Lecture 05 - Tree methods ... Claim Prediction Challenge (Allstate) KDD Cup 2013 - Author Disambiguation

Supervised/Unsupervised learning

http://practiceovertheory.com/blog/2010/02/15/machine-learning-who-s-the-boss/

Tom Kelsey ID5059-05-TM 11 Feb 2015 8 / 27

Page 9: Knowledge Discovery and Data Mining - Tom Kelsey · Knowledge Discovery and Data Mining Lecture 05 - Tree methods ... Claim Prediction Challenge (Allstate) KDD Cup 2013 - Author Disambiguation

A classification problem....

For the Titanic worked examples in week 1, the models Idescribed differed in subtle ways.

Regression and Tree both returned probabilities rather thanthe 1 and 0 returned by random forests.I used random forests to impute missing values, estimaterelative importance of covariates, and estimatemisclassification rateThe tree model supplied a confusion matrix, allowing moredetailed error analysis than simple misclassification rate(easy to derive confusion matrices for the other two)Straightforward training/test data validation – I didn’texamine the overfit/undefit tradeoff in any detailI performed naïve covariate selection using myunderstanding of the problem domain – this is often a sourceof significant error in model development & validation

Tom Kelsey ID5059-05-TM 11 Feb 2015 9 / 27

Page 10: Knowledge Discovery and Data Mining - Tom Kelsey · Knowledge Discovery and Data Mining Lecture 05 - Tree methods ... Claim Prediction Challenge (Allstate) KDD Cup 2013 - Author Disambiguation

Historical perspective

1960 Automatic Interaction Detection (AID) related to theclustering literature.THAID, CHAIDID3, C4.5, C5.0CART 1984 Breiman et al.

Tom Kelsey ID5059-05-TM 11 Feb 2015 10 / 27

Page 11: Knowledge Discovery and Data Mining - Tom Kelsey · Knowledge Discovery and Data Mining Lecture 05 - Tree methods ... Claim Prediction Challenge (Allstate) KDD Cup 2013 - Author Disambiguation

Recursive partitioning onR

Take an n× p matrix X, define a p dimensional spaceRp. Wewish to apply a simple rule recursively:

1 Select a variable xi and split on the basis of a single valuexi = a. We now have two spaces: xi ≤ a and xi > a.

2 Select one of the current sub-spaces, select a variable xj, andsplit this sub-space on the basis of a single value xj = b.

3 Repeatedly select sub-spaces and split in two.4 Note that this process can extend beyond just the two

dimensions represented by x1 and x2. If this were3-dimensions (i.e. include an x3) then the partitions wouldbe cubes. Beyond this the partitions are conceptuallyhyper-cubes.

Tom Kelsey ID5059-05-TM 11 Feb 2015 11 / 27

Page 12: Knowledge Discovery and Data Mining - Tom Kelsey · Knowledge Discovery and Data Mining Lecture 05 - Tree methods ... Claim Prediction Challenge (Allstate) KDD Cup 2013 - Author Disambiguation

An arbitrary 2-D space

X1

X2

An arbitrary 2-D space

Tom Kelsey ID5059-05-TM 11 Feb 2015 12 / 27

Page 13: Knowledge Discovery and Data Mining - Tom Kelsey · Knowledge Discovery and Data Mining Lecture 05 - Tree methods ... Claim Prediction Challenge (Allstate) KDD Cup 2013 - Author Disambiguation

Space splitting

X1

X2

X1=a

Y=f(X1>=a)Y=f(X1<a)

A single split

Tom Kelsey ID5059-05-TM 11 Feb 2015 13 / 27

Page 14: Knowledge Discovery and Data Mining - Tom Kelsey · Knowledge Discovery and Data Mining Lecture 05 - Tree methods ... Claim Prediction Challenge (Allstate) KDD Cup 2013 - Author Disambiguation

Space splitting

X1

X2

X2=bY=f(X1>=a)

Y=f(X2>=b&(X1<a))

Y=f(X2<c&(X1<a))

Split of a subspace

Tom Kelsey ID5059-05-TM 11 Feb 2015 14 / 27

Page 15: Knowledge Discovery and Data Mining - Tom Kelsey · Knowledge Discovery and Data Mining Lecture 05 - Tree methods ... Claim Prediction Challenge (Allstate) KDD Cup 2013 - Author Disambiguation

Space splitting

X1

X2

X2=c

Y=f(X2>=c&(X1>=a))

Y=f(X2<c&(X1>=a))

Y=f(X2>=b&(X1<a))

Y=f(X2<c&(X1<a))

Further splitting of a subspace

Tom Kelsey ID5059-05-TM 11 Feb 2015 15 / 27

Page 16: Knowledge Discovery and Data Mining - Tom Kelsey · Knowledge Discovery and Data Mining Lecture 05 - Tree methods ... Claim Prediction Challenge (Allstate) KDD Cup 2013 - Author Disambiguation

Space splitting

X1

X2

X1=d

Y=f(X2>=c&(X1>=a&<d))

Y=f(X2<c&(X1>=a))

Y=f(X2>=b&(X1<a))

Y=f(X2<c&(X1<a))

Y=f(X2>=c&(X1>=a&>=d))

Further splitting

Tom Kelsey ID5059-05-TM 11 Feb 2015 16 / 27

Page 17: Knowledge Discovery and Data Mining - Tom Kelsey · Knowledge Discovery and Data Mining Lecture 05 - Tree methods ... Claim Prediction Challenge (Allstate) KDD Cup 2013 - Author Disambiguation

Space splitting

X1

X2

Z

Potential 3-D surface

Tom Kelsey ID5059-05-TM 11 Feb 2015 17 / 27

Page 18: Knowledge Discovery and Data Mining - Tom Kelsey · Knowledge Discovery and Data Mining Lecture 05 - Tree methods ... Claim Prediction Challenge (Allstate) KDD Cup 2013 - Author Disambiguation

Binary partitioning process as a tree

X1

X2 X2

X2

C1 C2

C3

C4 C5

<a >=a

<b >=b <c >=c

<d >=d

An example tree diagram for a contrived partioning

Tom Kelsey ID5059-05-TM 11 Feb 2015 18 / 27

Page 19: Knowledge Discovery and Data Mining - Tom Kelsey · Knowledge Discovery and Data Mining Lecture 05 - Tree methods ... Claim Prediction Challenge (Allstate) KDD Cup 2013 - Author Disambiguation

Tree representation

The splitting points are called nodes - these have a binarysplitting rule associated with themThe two new spaces created by the split are represented bylines leaving the nodes, these are referred to as the branches.A tree with one split is a stump.The nodes at the bottom of the diagram are referred to as theterminal nodes and collectively represent all the finalpartitions/subspaces of the data.You can ‘drop’ a vector x down the tree to determine whichsubspace this coordinate falls into.

Tom Kelsey ID5059-05-TM 11 Feb 2015 19 / 27

Page 20: Knowledge Discovery and Data Mining - Tom Kelsey · Knowledge Discovery and Data Mining Lecture 05 - Tree methods ... Claim Prediction Challenge (Allstate) KDD Cup 2013 - Author Disambiguation

Exercise

The following is the summary of a series of splits inR2:

(x1 > 10)(x1 ≤ 10) & (x2 ≤ 5)(x1 ≤ 10) & (x2 > 5) & (x2 ≤ 10)

1 Sketch the progression of splits in 2-dimensions.2 Produce a tree that summarises this series of splits.

Tom Kelsey ID5059-05-TM 11 Feb 2015 20 / 27

Page 21: Knowledge Discovery and Data Mining - Tom Kelsey · Knowledge Discovery and Data Mining Lecture 05 - Tree methods ... Claim Prediction Challenge (Allstate) KDD Cup 2013 - Author Disambiguation

Tree construction

We can model the response as a constant for each region (or.equivalently, leaf)If we are minimising sums of squares, the optimal constantfor a region/leaf is the average of the observed outputs forall inputs associated with the region/leafComputing the optimal binary partition for given inputsand output is computationally intractable, in generalA greedy algorithm is used that finds an optimal variableand split point given an initial choice (or guess), thencontinues for sub-regionsThis is quick to compute (sums of averages) but errors at theroot lead to errors at the leaves

Tom Kelsey ID5059-05-TM 11 Feb 2015 21 / 27

Page 22: Knowledge Discovery and Data Mining - Tom Kelsey · Knowledge Discovery and Data Mining Lecture 05 - Tree methods ... Claim Prediction Challenge (Allstate) KDD Cup 2013 - Author Disambiguation

How big should the tree be?

Tradeoff between bias and variance

Small tree – high bias, low varianceNot big enough to capture the correct model structureLarge tree – low bias, high varianceOverfitting – in the extreme case each input is in exactly oneregionOptimal size should be adaptively chosen from the data

We could stop splitting based on a threshold for decreases in sumof squares, but this might rule out a useful split further down thetree.Instead we construct a tree that is probably too large, and pruneit by cost-complexity calculations – next lecture

Tom Kelsey ID5059-05-TM 11 Feb 2015 22 / 27

Page 23: Knowledge Discovery and Data Mining - Tom Kelsey · Knowledge Discovery and Data Mining Lecture 05 - Tree methods ... Claim Prediction Challenge (Allstate) KDD Cup 2013 - Author Disambiguation

Regression trees

Consider our general regression problem (note can beclassification):

y = f (X) + e

and the usual approximation model (linear in its parameters):

y = Xβ + e

‘Standard’ interactions of form βp(X1X2)

These are simple in form and quite hard to interpretsuccinctlyWhat is probably the simplest interaction form to interpret?Recursive binary splitting rules for the Covariate space

Tom Kelsey ID5059-05-TM 11 Feb 2015 23 / 27

Page 24: Knowledge Discovery and Data Mining - Tom Kelsey · Knowledge Discovery and Data Mining Lecture 05 - Tree methods ... Claim Prediction Challenge (Allstate) KDD Cup 2013 - Author Disambiguation

Advantages of tree models

Actually tree models in general, and CART in particularNonparametric

no probabilistic assumptionsAutomatically performs variable selection

important variables at or near the rootAny combination of continuous/discrete variables allowed

in the Titanic example, no need to specify that the response iscategoricalso we can automatically bin massively categorical variablesinto a few categoriese.g. zip code, make/model, etc.

Tom Kelsey ID5059-05-TM 11 Feb 2015 24 / 27

Page 25: Knowledge Discovery and Data Mining - Tom Kelsey · Knowledge Discovery and Data Mining Lecture 05 - Tree methods ... Claim Prediction Challenge (Allstate) KDD Cup 2013 - Author Disambiguation

Advantages of tree models

Discovers interactions among variablesHandles missing values automatically

using surrogate splits

Invariant to monotonic transformations of predictivevariableNot sensitive to outliers in predictive variablesEasy to spot when CART is struggling to capture a linearrelationship (and therefore might not be suitable)

repeated splits on the same variableGood for data exploration, visualisation, multidisciplinarydiscussion

in the Titanic example gives hard values for "child" tosupport the heuristic "women & children first"

Tom Kelsey ID5059-05-TM 11 Feb 2015 25 / 27

Page 26: Knowledge Discovery and Data Mining - Tom Kelsey · Knowledge Discovery and Data Mining Lecture 05 - Tree methods ... Claim Prediction Challenge (Allstate) KDD Cup 2013 - Author Disambiguation

Disdvantages of tree models

Discrete output values, rather than continuousone response per finite number of leaf nodes

Trees can be large and hence hard to interpretCan be unstable when covariates are correlated

slightly different data gives completely different trees

Not good for describing linear relationshipsNot always the best predictive model

might be outperformed by NN, RF, SVM, etc.

Tom Kelsey ID5059-05-TM 11 Feb 2015 26 / 27

Page 27: Knowledge Discovery and Data Mining - Tom Kelsey · Knowledge Discovery and Data Mining Lecture 05 - Tree methods ... Claim Prediction Challenge (Allstate) KDD Cup 2013 - Author Disambiguation

Tree methods

Need-to-knows1 How recursive binary partitioning ofRp works.2 How to sketch a partitioning ofR2 on the basis of a series of

simple binary splits.3 How to go from a series of binary splitting rules to a tree

representation and vice versa.

Tom Kelsey ID5059-05-TM 11 Feb 2015 27 / 27