Lecture # 5 Decision Trees for Numeric Databiomisa.org/uploads/2014/06/Lect-5.pdf · 2017-04-20 · The Temperature attribute • First, sort the temperature values, including the

1

Machine Learning

Lecture # 5Decision Trees for Numeric Data

What if attributes are numerical?

Discretization

• Processing for converting an interval/numerical variable into a finite (discrete) set of elements (labels)

• A discretization algorithm computes a series of cut-points which defines a set of intervals that are mapped into labels

• Motivations: many methods (also in mathematics and computer science) cannot deal with numerical variables. In these cases, discretization is required to be able to use them, despite the loss of information

• Several approaches– Supervised vs unsupervised

– Static vs dynamic

Unsupervised vs SupervisedDiscretization

• Unsupervised discretization just uses the attribute values, for instance,

discretize humidity as <70, 70-79, >=80)

• Supervised discretization also uses the class attribute to generate interval

with lower entropy

• Example using humidity

– Values alone may suggest <60, 60-70

– Considering the class value might suggest different intervals by grouping

values to maintain information about the class attribute

The Temperature attribute

• First, sort the temperature values, including the class labels

• Then, check all the cut points and choose the one with the best information gain

• e.g. temperature < 71.5: yes/4, no/2

temperature ≥ 71.5: yes/5, no/3

• Info([4,2],[5,3]) = 6/14 info([4,2]) + 8/14 info([5,3]) = 0.939

• Place split points halfway between values

• Can evaluate all split points in one pass!

Person Hair

Length

Weight Age Class

Homer 0” 250 36 M

Marge 10” 150 34 F

Bart 2” 90 10 M

Lisa 6” 78 8 F

Maggie 4” 20 1 F

Abe 1” 170 70 M

Selma 8” 160 41 F

Otto 10” 180 38 M

Krusty 6” 200 45 M

Comic 8” 290 38 ?

Hair Length <= 5?

yes no

Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9)

= 0.9911

np

n

np

n

np

p

np

pSEntropy 22 loglog)(

Gain(Hair Length <= 5) = 0.9911 – (4/9 * 0.8113 + 5/9 * 0.9710 ) = 0.0911

)()()( setschildallEsetCurrentEAGain

Let us try splitting on

Hair length

Weight <= 160?

yes no


= 0.9911

np

n

np

n

np

p

np


Gain(Weight <= 160) = 0.9911 – (5/9 * 0.7219 + 4/9 * 0 ) = 0.5900



Weight

age <= 40?

yes no


= 0.9911

np

n

np

n

np

p

np


Gain(Age <= 40) = 0.9911 – (6/9 * 1 + 3/9 * 0.9183 ) = 0.0183



Age

Weight <= 160?

yes no

Hair Length <= 2?

yes no

Of the 3 features we had, Weight was

best. But while people who weigh over 160

are perfectly classified (as males), the

under 160 people are not perfectly

classified… So we simply recurse!

This time we find that we can split

on Hair length, and we are done!

Weight <= 160?

yes no

Hair Length <= 2?

yes no

We don’t need to keep the data around, just

the test conditions.

Male

Male Female

How would these

people be

classified?

It is trivial to convert Decision Trees to

rules… Weight <= 160?

yes no

Hair Length <= 2?

yes no

Male

Male Female

Rules to Classify Males/Females

If Weight greater than 160, classify as Male

Elseif Hair Length less than or equal to 2, classify as

Male

Else classify as Female

Wears green?

Yes No

The worked examples we have seen

were performed on small datasets.

However with small datasets there is a

great danger of overfitting the data…

When you have few datapoints, there are

many possible splitting rules that perfectly

classify the data, but will not generalize to

future datasets.

For example, the rule ―Wears green?‖ perfectly classifies the data, so

does ―Mothers name is Jacqueline?‖, so does ―Has blue shoes‖…

MaleFemale

Avoid Overfitting in Classification• The generated tree may overfit the training data

– Too many branches, some may reflect anomalies due to noise or

outliers

– Result is in poor accuracy for unseen samples

• Two approaches to avoid overfitting – Prepruning: Halt tree construction early—do not split a node if this

would result in the goodness measure falling below a threshold

• Difficult to choose an appropriate threshold

– Postpruning: Remove branches from a ―fully grown‖ tree—get a sequence of progressively pruned trees

• Use a set of data different from the training data to decide which is the ―best pruned tree‖

10

1 2 3 4 5 6 7 8 9 10

123456789

100

10 20 30 40 50 60 70 80 90 100

10

20

30

40

50

60

70

80

90

10

1 2 3 4 5 6 7 8 9 10

123456789

Which of the ―Pigeon Problems‖ can be solved by

a Decision Tree?

1) Deep Bushy Tree

2) Useless

3) Deep Bushy Tree

The Decision Tree

has a hard time

with correlated

attributes ?

• Advantages:

– Easy to understand (Doctors love them!)

– Easy to generate rules

• Disadvantages:

– May suffer from overfitting.

– Classifies by rectangular partitioning (so

does not handle correlated features very

well).

– Can be quite large – pruning is necessary.

– Does not handle streaming data easily

Advantages/Disadvantages of Decision Trees

Decision Tree with Real Features

Example from MIT course

Major issuesQ1: Choosing best attribute: what quality measure to

use?

Q2: Determining when to stop splitting: avoid

overfitting

Q3: Handling continuous attributes

Q4: Handling training data with missing attribute values

Q5: Handing attributes with different costs

Q6: Dealing with continuous goal attribute

Q1: What quality measure

• Information gain

• Gain Ratio

• …

How to avoiding Overfitting

• Stop growing the tree earlier

– Ex: InfoGain < threshold

– Ex: Size of examples in a node < threshold

– …

• Grow full tree, then post-prune

In practice, the latter works better than the former.

Post-pruning

• Split data into training and validation set

• Do until further pruning is harmful:– Evaluate impact on validation set of pruning each

possible node (plus those below it)

– Greedily remove the ones that don’t improve the performance on validation set

Produces a smaller tree with best performance measure

Performance measure

• Accuracy:– on validation data

– K-fold cross validation

• Misclassification cost: Sometimes more accuracy is desired for some classes than others.

• MDL: size(tree) + errors(tree)

Rule post-pruning

• Convert tree to equivalent set of rules

• Prune each rule independently of others

• Sort final rules into desired sequence for use

• Perhaps most frequently used method (e.g., C4.5)

Q3: handling numeric attributes

• Continuous attribute discrete attribute

• Example

– Original attribute: Temperature = 82.5

– New attribute: (temperature > 72.3) = t, f

Question: how to choose thresholds?

Q4: Unknown attribute values

Assume an attribute can take the value “blank”.

Assign most common value of A among training data at node n.

Assign most common value of A among training data at node n which have the same target class.

Assign prob pi to each possible value vi of A Assign a fraction (pi) of example to each descendant in tree

This method is used in C4.5.

Q5: Attributes with cost

)(

),(2

ACost

ASGain

• Consider medical diagnosis (e.g., blood test) has a cost

• Question: how to learn a consistent tree with low expected cost?

• One approach: replace gain by

– Tan and Schlimmer (1990)

Common algorithms

• ID3

• C4.5

• CART

ID3

• Proposed by Quinlan (so is C4.5)

• Can handle basic cases: discrete attributes, no missing information, etc.

• Information gain as quality measure

C4.5

• An extension of ID3:

– Several quality measures

– Incomplete information (missing attribute values)

– Numerical (continuous) attributes

– Pruning of decision trees

– Rule derivation

– Random mood and batch mood

CART

• CART (classification and regression tree)

• Proposed by Breiman et. al. (1984)

• Constant numerical values in leaves

• Variance as measure of impurity

Strengths of decision tree methods

• Ability to generate understandable rules

• Ease of calculation at classification time

• Ability to handle both continuous and categorical variables

• Ability to clearly indicate best attributes

The weaknesses of decision tree methods

Greedy algorithm: no global optimization

Error-prone with too many classes: numbers of training examples become smaller quickly in a tree with many levels/branches.

Expensive to train: sorting, combination of attributes, calculating quality measures, etc.

Trouble with non-rectangular regions: the rectangular classification boxes that may not correspond well with the actual distribution of records in the decision space.

49

Acknowledgements

Introduction to Machine Learning, Alphaydin

Statistical Pattern Recognition: A Review – A.K Jain et al., PAMI (22) 2000

Pattern Recognition and Analysis Course – A.K. Jain, MSU

Pattern Classification” by Duda et al., John Wiley & Sons.

http://www.doc.ic.ac.uk/~sgc/teaching/pre2012/v231/lecture13.html

Some Material adopted from Dr. Adam Prugel-Bennett Dr. Andrew Ng and Dr. Aman

ullah’s Slides

Mat

eria

l in

th

ese

slid

es h

as b

een

tak

en f

rom

, th

e fo

llow

ing

reso

urc

es

http://www.doc.ic.ac.uk/~sgc/teaching/pre2012/v231/lecture13.html

Documents

Lecture # 5 Decision Trees for Numeric Databiomisa.org/uploads/2014/06/Lect-5.pdf · 2017-04-20 · The Temperature attribute • First, sort the temperature values, including the