1 Classification with Decision Trees II Instructor: Qiang Yang Hong Kong University of Science and Technology [email protected] Thanks: Eibe Frank and Jiawei

1

Classification with Decision Trees II

Instructor: Qiang YangHong Kong University of Science and Technology

[email protected]

Thanks: Eibe Frank and Jiawei Han

2

Part II: Industrial-strength algorithms

Requirements for an algorithm to be useful in a wide range of real-world applications:

Can deal with numeric attributes Doesn’t fall over when missing values are present Is robust in the presence of noise Can (at least in principle) approximate arbitrary

concept descriptions Basic schemes (may) need to be extended to

fulfill these requirements

3

Decision trees

Extending ID3 to deal with numeric attributes: pretty straightforward

Dealing sensibly with missing values: a bit trickier

Stability for noisy data: requires sophisticated pruning mechanism

End result of these modifications: Quinlan’s C4.5

Best-known and (probably) most widely-used learning algorithm

Commercial successor: C5.0

4

Numeric attributes

Standard method: binary splits (i.e. temp < 45)

Difference to nominal attributes: every attribute offers many possible split points

Solution is straightforward extension: Evaluate info gain (or other measure) for every

possible split point of attribute Choose “best” split point Info gain for best split point is info gain for attribute

Computationally more demanding

5

An example

Split on temperature attribute from weather data:

Eg. 4 yeses and 2 nos for temperature < 71.5 and 5 yeses and 3 nos for temperature 71.5

Info([4,2],[5,3]) = (6/14)info([4,2]) + (8/14)info([5,3]) = 0.939 bits

Split points are placed halfway between values All split points can be evaluated in one pass!

64 65 68 69 70 71 72 72 75 75 80 81 83 85Yes No Yes Yes Yes No No Yes Yes Yes No Yes Yes No

6

Avoiding repeated sorting

Instances need to be sorted according to the values of the numeric attribute considered

Time complexity for sorting: O(n log n) Does this have to be repeated at each node? No! Sort order from parent node can be used

to derive sort order for children Time complexity of derivation: O(n) Only drawback: need to create and store an array of

sorted indices for each numeric attribute

7

Notes on binary splits

Information in nominal attributes is computed using one multi-way split on that attribute

This is not the case for binary splits on numeric attributes

The same numeric attribute may be tested several times along a path in the decision tree

Disadvantage: tree is relatively hard to read Possible remedies: pre-discretization of

numeric attributes or multi-way splits instead of binary ones

8

Example of Binary Split

Age<=3

Age<=5

Age<=10

9

Missing values

C4.5 splits instances with missing values into pieces (with weights summing to 1)

A piece going down a particular branch receives a weight proportional to the popularity of the branch

Info gain etc. can be used with fractional instances using sums of weights instead of counts

During classification, the same procedure is used to split instances into pieces

Probability distributions are merged using weights

10

Stopping Criteria

When all cases have the same class. The leaf node is labeled by this class.

When there is no available attribute. The leaf node is labeled by the majority class.

When the number of cases is less than a specified threshold. The leaf node is labeled by the majority class.

11

Pruning

Pruning simplifies a decision tree to prevent overfitting to noise in the data

Two main pruning strategies:1. Postpruning: takes a fully-grown decision tree and

discards unreliable parts2. Prepruning: stops growing a branch when

information becomes unreliable Postpruning preferred in practice because of

early stopping in prepruning

12

Prepruning

Usually based on statistical significance test Stops growing the tree when there is no

statistically significant association between any attribute and the class at a particular node

Most popular test: chi-squared test ID3 used chi-squared test in addition to

information gain Only statistically significant attributes where allowed

to be selected by information gain procedure

13

The Weather example: Observed Count

Play

Outlook

Yes No Outlook Subtotal

Sunny 2 0 2

Cloudy 0 1 1

PlaySubtotal:

2 1 Total count in table =3

14

The Weather example: Expected Count

Play

Outlook

Yes No Subtotal

Sunny 2*2/6=4/3=1.3

2*1/6=2/3=0.6

2

Cloudy 2*1/3=0.6 1*1/3=0.3 1

Subtotal: 2 1 Total count in table =3

If attributes were independent, then the subtotals would be Like this

15

Question: How different between observed and expected?

•If Chi-squared value is very large, then A1 and A2 are not independent that is, they are dependent!

•Degrees of freedom: if table has n*m items, then freedom = (n-1)*(m-1)

•If all attributes in a node are independent with the class attribute, then stop splitting further.

16

Postpruning

Builds full tree first and prunes it afterwards Attribute interactions are visible in fully-grown tree

Problem: identification of subtrees and nodes that are due to chance effects

Two main pruning operations: 1. Subtree replacement2. Subtree raising

Possible strategies: error estimation, significance testing, MDL principle

17

Subtree replacement

Bottom-up: tree is considered for replacement once all its subtrees have been considered

18

Subtree raising

Deletes node and redistributes instances Slower than subtree replacement

(Worthwhile?)

19

Estimating error rates

Pruning operation is performed if this does not increase the estimated error

Of course, error on the training data is not a useful estimator (would result in almost no pruning)

One possibility: using hold-out set for pruning (reduced-error pruning)

C4.5’s method: using upper limit of 25% confidence interval derived from the training data

Standard Bernoulli-process-based method

Outlook Tempreature Humidity Windy Classsunny hot high false Nsunny hot high true Novercast hot high false Prain mild high false Prain cool normal false Prain cool normal true Novercast cool normal true Psunny mild high false Nsunny cool normal false Prain mild normal false Psunny mild normal true Povercast mild high true Povercast hot normal false Prain mild high true N

Training Set

21

Post-pruning in C4.5 Bottom-up pruning: at each non-leaf node v, if

merging the subtree at v into a leaf node improves accuracy, perform the merging.

Method 1: compute accuracy using examples not seen by the algorithm.

Method 2: estimate accuracy using the training examples: Consider classifying E examples incorrectly out of

N examples as observing E events in N trials in the binomial distribution.

For a given confidence level CF, the upper limit on the error rate over the whole population is with CF% confidence. ),( NEUCF

22

Usage in Statistics: Sampling error estimation Example:

population: 1,000,000 people, could be regarded as infinite population mean: percentage of the left handed people sample: 100 people sample mean: 6 left-handed How to estimate the REAL population mean?

Pessimistic Estimate

U0.25(100,6)L0.25(100,6)6

Possibility(%)

2 10

75% confidence interval15%

23

Usage in Decision Tree (DT): error estimation for some node in the DT

example: unknown testing data: could be regarded as infinite universe population mean: percentage of error made by this node sample: 100 examples from training data set sample mean: 6 errors for the training data set How to estimate the REAL average error rate?

Pessimistic Estimate

U0.25(100,6)L0.25(100,6)6

Possibility(%)

2 10

75% confidence interval

Heuristic!But works well...

24

C4.5’s method

Error estimate for subtree is weighted sum of error estimates for all its leaves

Error estimate for a node:

If c = 25% then z = 0.69 (from normal distribution)

f is the error on the training data N is the number of instances covered by the

leaf

Nz

N

zNf

Nf

zNz

fe2

2

222

142

25

Example for Estimating Error

Consider a subtree rooted at Outlook with 3 leaf nodes: Sunny: Play = yes : (0 error, 6 instances) Overcast: Play= yes: (0 error, 9 instances) Cloudy: Play = no (0 error, 1 instance)

The estimated error for this subtree is 6*0.206+9*0.143+1*0.750=3.273

If the subtree is replaced with the leaf “yes”, the estimated error is

So the pruning is performed and the tree is merged (see next page)

750.0)0,1(,143.0)0,9(,206.0)0,6( 25.025.025.0 UUU

512.2157.0*16)1,16(*16 25.0 U

26

Example continued

Outlook

yes yes no

sunny

overcast

cloudy yes

27

Example

f=0.33 e=0.47 f=0.5

e=0.72f=0.33 e=0.47

f=5/14 e=0.46

Combined usingratios 6:2:6this gives 0.51

28

Complexity of tree induction

Assume m attributes, n training instances and a tree depth of O(log n)

Cost for building a tree: O(mn log n) Complexity of subtree replacement: O(n) Complexity of subtree raising: O(n (log n)2)

Every instance may have to be redistributed at every node between its leaf and the root: O(n log n)

Cost for redistribution (on average): O(log n) Total cost: O(mn log n) + O(n (log n)2)

29

The CART Algorithm

)()( ii

i TsdT

TTsdSDR

Tx

xxPTSD 2)(*)()(

k

jjjkk xwxwxwxwxwy

0

)1()1()1(22

)1(11

)1(00

)1( ...

yXXXWT

T 1)(

30

Numeric prediction Counterparts exist for all schemes that we previously discussed

Decision trees, rule learners, SVMs, etc.

All classification schemes can be applied to regression problems using discretization

Prediction: weighted average of intervals’ midpoints (weighted according to class probabilities)

Regression more difficult than classification (i.e. percent correct vs. mean squared error)

31

Regression trees

Differences to decision trees: Splitting criterion: minimizing intra-subset variation Pruning criterion: based on numeric error measure Leaf node predicts average class values of training

instances reaching that node Can approximate piecewise constant functions Easy to interpret More sophisticated version: model trees

32

Model trees

Regression trees with linear regression functions at each node

Linear regression applied to instances that reach a node after full regression tree has been built

Only a subset of the attributes is used for LR Attributes occurring in subtree (+maybe attributes

occurring in path to the root) Fast: overhead for LR not large because usually

only a small subset of attributes is used in tree

33

Smoothing

Naïve method for prediction outputs value of LR for corresponding leaf node

Performance can be improved by smoothing predictions using internal LR models

Predicted value is weighted average of LR models along path from root to leaf

Smoothing formula: Same effect can be achieved by incorporating

the internal models into the leaf nodes

kn

kqnpp

34

Building the tree

Splitting criterion: standard deviation reduction

Termination criteria (important when building trees for numeric prediction):

Standard deviation becomes smaller than certain fraction of sd for full training set (e.g. 5%)

Too few instances remain (e.g. less than four)

)()( ii

i TsdT

TTsdSDR

35

Pruning

Pruning is based on estimated absolute error of LR models

Heuristic estimate: LR models are pruned by greedily removing

terms to minimize the estimated error Model trees allow for heavy pruning: often a

single LR model can replace a whole subtree Pruning proceeds bottom up: error for LR

model at internal node is compared to error for subtree

orsolute_erraverage_abvn

n

36

Nominal attributes

Nominal attributes are converted into binary attributes (that can be treated as numeric ones)

Nominal values are sorted using average class val. If there are k values, k-1 binary attributes are

generated The ith binary attribute is 0 if an instance’s value is one

of the first i in the ordering, 1 otherwise It can be proven that the best split on one of

the new attributes is the best binary split on original

But M5‘ only does the conversion once

37

Missing values Modified splitting criterion:

Procedure for deciding into which subset the instance goes: surrogate splitting

Choose attribute for splitting that is most highly correlated with original attribute

Problem: complex and time-consuming Simple solution: always use the class

Testing: replace missing value with average

)()( i

i

i TsdT

TTsd

T

mSDR

38

Pseudo-code for M5’

Four methods: Main method: MakeModelTree() Method for splitting: split() Method for pruning: prune() Method that computes error: subtreeError()

We’ll briefly look at each method in turn Linear regression method is assumed to

perform attribute subset selection based on error

39

MakeModelTree()

MakeModelTree (instances){ SD = sd(instances) for each k-valued nominal attribute convert into k-1 synthetic binary attributes root = newNode root.instances = instances split(root) prune(root) printTree(root)}

40

split()split(node) { if sizeof(node.instances) < 4 or sd(node.instances) < 0.05*SD node.type = LEAF else node.type = INTERIOR for each attribute for all possible split positions of the attribute calculate the attribute's SDR node.attribute = attribute with maximum SDR split(node.left) split(node.right)}

41

prune()

prune(node){ if node = INTERIOR then prune(node.leftChild) prune(node.rightChild) node.model = linearRegression(node) if subtreeError(node) > error(node)

then node.type = LEAF}

42

subtreeError()

subtreeError(node){ l = node.left; r = node.right if node = INTERIOR then return (sizeof(l.instances)*subtreeError(l) + sizeof(r.instances)*subtreeError(r))

/sizeof(node.instances) else return error(node)}

43

Model tree for servo data

44

Variations of CART

Applying Logistic Regression predict probability of “True” or “False” instead of making a

numerical valued prediction predict a probability value (p) rather than the outcome itself Probability= odds ratio

ii XWp

p)

1log(

)(1

1WXe

p

45

Other Trees

• Classification Trees•Current node •Children nodes (L, R):

•Decision Trees •Current node •Children nodes (L, R):

•GINI index used in CART (STD= )•Current node •Children nodes (L, R):

),min(),min( 1010RRRLLL ppppppQ

),min( 10 ppQ

1100 loglog ppppQ

RRLL QpQpQ

10 pp10 ppQ

RRLL QpQpQ

47

Efforts on Scalability

Most algorithms assume data can fit in memory. Recent efforts focus on disk-resident implementation for

decision trees. Random sampling Partitioning Examples

SLIQ (EDBT’96 -- [MAR96]) SPRINT (VLDB96 -- [SAM96]) PUBLIC (VLDB98 -- [RS98]) RainForest (VLDB98 -- [GRG98])

Documents

1 Classification with Decision Trees II Instructor: Qiang Yang Hong Kong University of Science and Technology [email protected] Thanks: Eibe Frank and Jiawei