49
1 Machine Learning Lecture # 5 Decision Trees for Numeric Data

Lecture # 5 Decision Trees for Numeric Databiomisa.org/uploads/2014/06/Lect-5.pdf · 2017-04-20 · The Temperature attribute • First, sort the temperature values, including the

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture # 5 Decision Trees for Numeric Databiomisa.org/uploads/2014/06/Lect-5.pdf · 2017-04-20 · The Temperature attribute • First, sort the temperature values, including the

1

Machine Learning

Lecture # 5Decision Trees for Numeric Data

Page 2: Lecture # 5 Decision Trees for Numeric Databiomisa.org/uploads/2014/06/Lect-5.pdf · 2017-04-20 · The Temperature attribute • First, sort the temperature values, including the

What if attributes are numerical?

Page 3: Lecture # 5 Decision Trees for Numeric Databiomisa.org/uploads/2014/06/Lect-5.pdf · 2017-04-20 · The Temperature attribute • First, sort the temperature values, including the

Discretization

• Processing for converting an interval/numerical variable into a finite (discrete) set of elements (labels)

• A discretization algorithm computes a series of cut-points which defines a set of intervals that are mapped into labels

• Motivations: many methods (also in mathematics and computer science) cannot deal with numerical variables. In these cases, discretization is required to be able to use them, despite the loss of information

• Several approaches– Supervised vs unsupervised

– Static vs dynamic

Page 4: Lecture # 5 Decision Trees for Numeric Databiomisa.org/uploads/2014/06/Lect-5.pdf · 2017-04-20 · The Temperature attribute • First, sort the temperature values, including the

Unsupervised vs SupervisedDiscretization

• Unsupervised discretization just uses the attribute values, for instance,

discretize humidity as <70, 70-79, >=80)

• Supervised discretization also uses the class attribute to generate interval

with lower entropy

• Example using humidity

– Values alone may suggest <60, 60-70

– Considering the class value might suggest different intervals by grouping

values to maintain information about the class attribute

Page 5: Lecture # 5 Decision Trees for Numeric Databiomisa.org/uploads/2014/06/Lect-5.pdf · 2017-04-20 · The Temperature attribute • First, sort the temperature values, including the

The Temperature attribute

• First, sort the temperature values, including the class labels

• Then, check all the cut points and choose the one with the best information gain

• e.g. temperature < 71.5: yes/4, no/2

temperature ≥ 71.5: yes/5, no/3

• Info([4,2],[5,3]) = 6/14 info([4,2]) + 8/14 info([5,3]) = 0.939

• Place split points halfway between values

• Can evaluate all split points in one pass!

Page 6: Lecture # 5 Decision Trees for Numeric Databiomisa.org/uploads/2014/06/Lect-5.pdf · 2017-04-20 · The Temperature attribute • First, sort the temperature values, including the

Person Hair

Length

Weight Age Class

Homer 0” 250 36 M

Marge 10” 150 34 F

Bart 2” 90 10 M

Lisa 6” 78 8 F

Maggie 4” 20 1 F

Abe 1” 170 70 M

Selma 8” 160 41 F

Otto 10” 180 38 M

Krusty 6” 200 45 M

Comic 8” 290 38 ?

Page 7: Lecture # 5 Decision Trees for Numeric Databiomisa.org/uploads/2014/06/Lect-5.pdf · 2017-04-20 · The Temperature attribute • First, sort the temperature values, including the

Hair Length <= 5?

yes no

Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9)

= 0.9911

np

n

np

n

np

p

np

pSEntropy 22 loglog)(

Gain(Hair Length <= 5) = 0.9911 – (4/9 * 0.8113 + 5/9 * 0.9710 ) = 0.0911

)()()( setschildallEsetCurrentEAGain

Let us try splitting on

Hair length

Page 8: Lecture # 5 Decision Trees for Numeric Databiomisa.org/uploads/2014/06/Lect-5.pdf · 2017-04-20 · The Temperature attribute • First, sort the temperature values, including the

Weight <= 160?

yes no

Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9)

= 0.9911

np

n

np

n

np

p

np

pSEntropy 22 loglog)(

Gain(Weight <= 160) = 0.9911 – (5/9 * 0.7219 + 4/9 * 0 ) = 0.5900

)()()( setschildallEsetCurrentEAGain

Let us try splitting on

Weight

Page 9: Lecture # 5 Decision Trees for Numeric Databiomisa.org/uploads/2014/06/Lect-5.pdf · 2017-04-20 · The Temperature attribute • First, sort the temperature values, including the

age <= 40?

yes no

Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9)

= 0.9911

np

n

np

n

np

p

np

pSEntropy 22 loglog)(

Gain(Age <= 40) = 0.9911 – (6/9 * 1 + 3/9 * 0.9183 ) = 0.0183

)()()( setschildallEsetCurrentEAGain

Let us try splitting on

Age

Page 10: Lecture # 5 Decision Trees for Numeric Databiomisa.org/uploads/2014/06/Lect-5.pdf · 2017-04-20 · The Temperature attribute • First, sort the temperature values, including the

Weight <= 160?

yes no

Hair Length <= 2?

yes no

Of the 3 features we had, Weight was

best. But while people who weigh over 160

are perfectly classified (as males), the

under 160 people are not perfectly

classified… So we simply recurse!

This time we find that we can split

on Hair length, and we are done!

Page 11: Lecture # 5 Decision Trees for Numeric Databiomisa.org/uploads/2014/06/Lect-5.pdf · 2017-04-20 · The Temperature attribute • First, sort the temperature values, including the

Weight <= 160?

yes no

Hair Length <= 2?

yes no

We don’t need to keep the data around, just

the test conditions.

Male

Male Female

How would these

people be

classified?

Page 12: Lecture # 5 Decision Trees for Numeric Databiomisa.org/uploads/2014/06/Lect-5.pdf · 2017-04-20 · The Temperature attribute • First, sort the temperature values, including the

It is trivial to convert Decision Trees to

rules… Weight <= 160?

yes no

Hair Length <= 2?

yes no

Male

Male Female

Rules to Classify Males/Females

If Weight greater than 160, classify as Male

Elseif Hair Length less than or equal to 2, classify as

Male

Else classify as Female

Page 13: Lecture # 5 Decision Trees for Numeric Databiomisa.org/uploads/2014/06/Lect-5.pdf · 2017-04-20 · The Temperature attribute • First, sort the temperature values, including the

Wears green?

Yes No

The worked examples we have seen

were performed on small datasets.

However with small datasets there is a

great danger of overfitting the data…

When you have few datapoints, there are

many possible splitting rules that perfectly

classify the data, but will not generalize to

future datasets.

For example, the rule ―Wears green?‖ perfectly classifies the data, so

does ―Mothers name is Jacqueline?‖, so does ―Has blue shoes‖…

MaleFemale

Page 14: Lecture # 5 Decision Trees for Numeric Databiomisa.org/uploads/2014/06/Lect-5.pdf · 2017-04-20 · The Temperature attribute • First, sort the temperature values, including the

Avoid Overfitting in Classification• The generated tree may overfit the training data

– Too many branches, some may reflect anomalies due to noise or

outliers

– Result is in poor accuracy for unseen samples

• Two approaches to avoid overfitting – Prepruning: Halt tree construction early—do not split a node if this

would result in the goodness measure falling below a threshold

• Difficult to choose an appropriate threshold

– Postpruning: Remove branches from a ―fully grown‖ tree—get a sequence of progressively pruned trees

• Use a set of data different from the training data to decide which is the ―best pruned tree‖

Page 15: Lecture # 5 Decision Trees for Numeric Databiomisa.org/uploads/2014/06/Lect-5.pdf · 2017-04-20 · The Temperature attribute • First, sort the temperature values, including the

10

1 2 3 4 5 6 7 8 9 10

123456789

100

10 20 30 40 50 60 70 80 90 100

10

20

30

40

50

60

70

80

90

10

1 2 3 4 5 6 7 8 9 10

123456789

Which of the ―Pigeon Problems‖ can be solved by

a Decision Tree?

1) Deep Bushy Tree

2) Useless

3) Deep Bushy Tree

The Decision Tree

has a hard time

with correlated

attributes ?

Page 16: Lecture # 5 Decision Trees for Numeric Databiomisa.org/uploads/2014/06/Lect-5.pdf · 2017-04-20 · The Temperature attribute • First, sort the temperature values, including the

• Advantages:

– Easy to understand (Doctors love them!)

– Easy to generate rules

• Disadvantages:

– May suffer from overfitting.

– Classifies by rectangular partitioning (so

does not handle correlated features very

well).

– Can be quite large – pruning is necessary.

– Does not handle streaming data easily

Advantages/Disadvantages of Decision Trees

Page 17: Lecture # 5 Decision Trees for Numeric Databiomisa.org/uploads/2014/06/Lect-5.pdf · 2017-04-20 · The Temperature attribute • First, sort the temperature values, including the

Decision Tree with Real Features

Example from MIT course

Page 18: Lecture # 5 Decision Trees for Numeric Databiomisa.org/uploads/2014/06/Lect-5.pdf · 2017-04-20 · The Temperature attribute • First, sort the temperature values, including the
Page 19: Lecture # 5 Decision Trees for Numeric Databiomisa.org/uploads/2014/06/Lect-5.pdf · 2017-04-20 · The Temperature attribute • First, sort the temperature values, including the
Page 20: Lecture # 5 Decision Trees for Numeric Databiomisa.org/uploads/2014/06/Lect-5.pdf · 2017-04-20 · The Temperature attribute • First, sort the temperature values, including the
Page 21: Lecture # 5 Decision Trees for Numeric Databiomisa.org/uploads/2014/06/Lect-5.pdf · 2017-04-20 · The Temperature attribute • First, sort the temperature values, including the
Page 22: Lecture # 5 Decision Trees for Numeric Databiomisa.org/uploads/2014/06/Lect-5.pdf · 2017-04-20 · The Temperature attribute • First, sort the temperature values, including the
Page 23: Lecture # 5 Decision Trees for Numeric Databiomisa.org/uploads/2014/06/Lect-5.pdf · 2017-04-20 · The Temperature attribute • First, sort the temperature values, including the
Page 24: Lecture # 5 Decision Trees for Numeric Databiomisa.org/uploads/2014/06/Lect-5.pdf · 2017-04-20 · The Temperature attribute • First, sort the temperature values, including the
Page 25: Lecture # 5 Decision Trees for Numeric Databiomisa.org/uploads/2014/06/Lect-5.pdf · 2017-04-20 · The Temperature attribute • First, sort the temperature values, including the
Page 26: Lecture # 5 Decision Trees for Numeric Databiomisa.org/uploads/2014/06/Lect-5.pdf · 2017-04-20 · The Temperature attribute • First, sort the temperature values, including the
Page 27: Lecture # 5 Decision Trees for Numeric Databiomisa.org/uploads/2014/06/Lect-5.pdf · 2017-04-20 · The Temperature attribute • First, sort the temperature values, including the
Page 28: Lecture # 5 Decision Trees for Numeric Databiomisa.org/uploads/2014/06/Lect-5.pdf · 2017-04-20 · The Temperature attribute • First, sort the temperature values, including the
Page 29: Lecture # 5 Decision Trees for Numeric Databiomisa.org/uploads/2014/06/Lect-5.pdf · 2017-04-20 · The Temperature attribute • First, sort the temperature values, including the
Page 30: Lecture # 5 Decision Trees for Numeric Databiomisa.org/uploads/2014/06/Lect-5.pdf · 2017-04-20 · The Temperature attribute • First, sort the temperature values, including the
Page 31: Lecture # 5 Decision Trees for Numeric Databiomisa.org/uploads/2014/06/Lect-5.pdf · 2017-04-20 · The Temperature attribute • First, sort the temperature values, including the
Page 32: Lecture # 5 Decision Trees for Numeric Databiomisa.org/uploads/2014/06/Lect-5.pdf · 2017-04-20 · The Temperature attribute • First, sort the temperature values, including the
Page 33: Lecture # 5 Decision Trees for Numeric Databiomisa.org/uploads/2014/06/Lect-5.pdf · 2017-04-20 · The Temperature attribute • First, sort the temperature values, including the
Page 34: Lecture # 5 Decision Trees for Numeric Databiomisa.org/uploads/2014/06/Lect-5.pdf · 2017-04-20 · The Temperature attribute • First, sort the temperature values, including the

Major issuesQ1: Choosing best attribute: what quality measure to

use?

Q2: Determining when to stop splitting: avoid

overfitting

Q3: Handling continuous attributes

Q4: Handling training data with missing attribute values

Q5: Handing attributes with different costs

Q6: Dealing with continuous goal attribute

Page 35: Lecture # 5 Decision Trees for Numeric Databiomisa.org/uploads/2014/06/Lect-5.pdf · 2017-04-20 · The Temperature attribute • First, sort the temperature values, including the

Q1: What quality measure

• Information gain

• Gain Ratio

• …

Page 36: Lecture # 5 Decision Trees for Numeric Databiomisa.org/uploads/2014/06/Lect-5.pdf · 2017-04-20 · The Temperature attribute • First, sort the temperature values, including the

How to avoiding Overfitting

• Stop growing the tree earlier

– Ex: InfoGain < threshold

– Ex: Size of examples in a node < threshold

– …

• Grow full tree, then post-prune

In practice, the latter works better than the former.

Page 37: Lecture # 5 Decision Trees for Numeric Databiomisa.org/uploads/2014/06/Lect-5.pdf · 2017-04-20 · The Temperature attribute • First, sort the temperature values, including the

Post-pruning

• Split data into training and validation set

• Do until further pruning is harmful:– Evaluate impact on validation set of pruning each

possible node (plus those below it)

– Greedily remove the ones that don’t improve the performance on validation set

Produces a smaller tree with best performance measure

Page 38: Lecture # 5 Decision Trees for Numeric Databiomisa.org/uploads/2014/06/Lect-5.pdf · 2017-04-20 · The Temperature attribute • First, sort the temperature values, including the

Performance measure

• Accuracy:– on validation data

– K-fold cross validation

• Misclassification cost: Sometimes more accuracy is desired for some classes than others.

• MDL: size(tree) + errors(tree)

Page 39: Lecture # 5 Decision Trees for Numeric Databiomisa.org/uploads/2014/06/Lect-5.pdf · 2017-04-20 · The Temperature attribute • First, sort the temperature values, including the

Rule post-pruning

• Convert tree to equivalent set of rules

• Prune each rule independently of others

• Sort final rules into desired sequence for use

• Perhaps most frequently used method (e.g., C4.5)

Page 40: Lecture # 5 Decision Trees for Numeric Databiomisa.org/uploads/2014/06/Lect-5.pdf · 2017-04-20 · The Temperature attribute • First, sort the temperature values, including the

Q3: handling numeric attributes

• Continuous attribute discrete attribute

• Example

– Original attribute: Temperature = 82.5

– New attribute: (temperature > 72.3) = t, f

Question: how to choose thresholds?

Page 41: Lecture # 5 Decision Trees for Numeric Databiomisa.org/uploads/2014/06/Lect-5.pdf · 2017-04-20 · The Temperature attribute • First, sort the temperature values, including the

Q4: Unknown attribute values

Assume an attribute can take the value “blank”.

Assign most common value of A among training data at node n.

Assign most common value of A among training data at node n which have the same target class.

Assign prob pi to each possible value vi of A Assign a fraction (pi) of example to each descendant in tree

This method is used in C4.5.

Page 42: Lecture # 5 Decision Trees for Numeric Databiomisa.org/uploads/2014/06/Lect-5.pdf · 2017-04-20 · The Temperature attribute • First, sort the temperature values, including the

Q5: Attributes with cost

)(

),(2

ACost

ASGain

• Consider medical diagnosis (e.g., blood test) has a cost

• Question: how to learn a consistent tree with low expected cost?

• One approach: replace gain by

– Tan and Schlimmer (1990)

Page 43: Lecture # 5 Decision Trees for Numeric Databiomisa.org/uploads/2014/06/Lect-5.pdf · 2017-04-20 · The Temperature attribute • First, sort the temperature values, including the

Common algorithms

• ID3

• C4.5

• CART

Page 44: Lecture # 5 Decision Trees for Numeric Databiomisa.org/uploads/2014/06/Lect-5.pdf · 2017-04-20 · The Temperature attribute • First, sort the temperature values, including the

ID3

• Proposed by Quinlan (so is C4.5)

• Can handle basic cases: discrete attributes, no missing information, etc.

• Information gain as quality measure

Page 45: Lecture # 5 Decision Trees for Numeric Databiomisa.org/uploads/2014/06/Lect-5.pdf · 2017-04-20 · The Temperature attribute • First, sort the temperature values, including the

C4.5

• An extension of ID3:

– Several quality measures

– Incomplete information (missing attribute values)

– Numerical (continuous) attributes

– Pruning of decision trees

– Rule derivation

– Random mood and batch mood

Page 46: Lecture # 5 Decision Trees for Numeric Databiomisa.org/uploads/2014/06/Lect-5.pdf · 2017-04-20 · The Temperature attribute • First, sort the temperature values, including the

CART

• CART (classification and regression tree)

• Proposed by Breiman et. al. (1984)

• Constant numerical values in leaves

• Variance as measure of impurity

Page 47: Lecture # 5 Decision Trees for Numeric Databiomisa.org/uploads/2014/06/Lect-5.pdf · 2017-04-20 · The Temperature attribute • First, sort the temperature values, including the

Strengths of decision tree methods

• Ability to generate understandable rules

• Ease of calculation at classification time

• Ability to handle both continuous and categorical variables

• Ability to clearly indicate best attributes

Page 48: Lecture # 5 Decision Trees for Numeric Databiomisa.org/uploads/2014/06/Lect-5.pdf · 2017-04-20 · The Temperature attribute • First, sort the temperature values, including the

The weaknesses of decision tree methods

Greedy algorithm: no global optimization

Error-prone with too many classes: numbers of training examples become smaller quickly in a tree with many levels/branches.

Expensive to train: sorting, combination of attributes, calculating quality measures, etc.

Trouble with non-rectangular regions: the rectangular classification boxes that may not correspond well with the actual distribution of records in the decision space.

Page 49: Lecture # 5 Decision Trees for Numeric Databiomisa.org/uploads/2014/06/Lect-5.pdf · 2017-04-20 · The Temperature attribute • First, sort the temperature values, including the

49

Acknowledgements

Introduction to Machine Learning, Alphaydin

Statistical Pattern Recognition: A Review – A.K Jain et al., PAMI (22) 2000

Pattern Recognition and Analysis Course – A.K. Jain, MSU

Pattern Classification” by Duda et al., John Wiley & Sons.

http://www.doc.ic.ac.uk/~sgc/teaching/pre2012/v231/lecture13.html

Some Material adopted from Dr. Adam Prugel-Bennett Dr. Andrew Ng and Dr. Aman

ullah’s Slides

Mat

eria

l in

th

ese

slid

es h

as b

een

tak

en f

rom

, th

e fo

llow

ing

reso

urc

es