Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
1
Machine Learning
Lecture # 5Decision Trees for Numeric Data
What if attributes are numerical?
Discretization
• Processing for converting an interval/numerical variable into a finite (discrete) set of elements (labels)
• A discretization algorithm computes a series of cut-points which defines a set of intervals that are mapped into labels
• Motivations: many methods (also in mathematics and computer science) cannot deal with numerical variables. In these cases, discretization is required to be able to use them, despite the loss of information
• Several approaches– Supervised vs unsupervised
– Static vs dynamic
Unsupervised vs SupervisedDiscretization
• Unsupervised discretization just uses the attribute values, for instance,
discretize humidity as <70, 70-79, >=80)
• Supervised discretization also uses the class attribute to generate interval
with lower entropy
• Example using humidity
– Values alone may suggest <60, 60-70
– Considering the class value might suggest different intervals by grouping
values to maintain information about the class attribute
The Temperature attribute
• First, sort the temperature values, including the class labels
• Then, check all the cut points and choose the one with the best information gain
• e.g. temperature < 71.5: yes/4, no/2
temperature ≥ 71.5: yes/5, no/3
• Info([4,2],[5,3]) = 6/14 info([4,2]) + 8/14 info([5,3]) = 0.939
• Place split points halfway between values
• Can evaluate all split points in one pass!
Person Hair
Length
Weight Age Class
Homer 0” 250 36 M
Marge 10” 150 34 F
Bart 2” 90 10 M
Lisa 6” 78 8 F
Maggie 4” 20 1 F
Abe 1” 170 70 M
Selma 8” 160 41 F
Otto 10” 180 38 M
Krusty 6” 200 45 M
Comic 8” 290 38 ?
Hair Length <= 5?
yes no
Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9)
= 0.9911
np
n
np
n
np
p
np
pSEntropy 22 loglog)(
Gain(Hair Length <= 5) = 0.9911 – (4/9 * 0.8113 + 5/9 * 0.9710 ) = 0.0911
)()()( setschildallEsetCurrentEAGain
Let us try splitting on
Hair length
Weight <= 160?
yes no
Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9)
= 0.9911
np
n
np
n
np
p
np
pSEntropy 22 loglog)(
Gain(Weight <= 160) = 0.9911 – (5/9 * 0.7219 + 4/9 * 0 ) = 0.5900
)()()( setschildallEsetCurrentEAGain
Let us try splitting on
Weight
age <= 40?
yes no
Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9)
= 0.9911
np
n
np
n
np
p
np
pSEntropy 22 loglog)(
Gain(Age <= 40) = 0.9911 – (6/9 * 1 + 3/9 * 0.9183 ) = 0.0183
)()()( setschildallEsetCurrentEAGain
Let us try splitting on
Age
Weight <= 160?
yes no
Hair Length <= 2?
yes no
Of the 3 features we had, Weight was
best. But while people who weigh over 160
are perfectly classified (as males), the
under 160 people are not perfectly
classified… So we simply recurse!
This time we find that we can split
on Hair length, and we are done!
Weight <= 160?
yes no
Hair Length <= 2?
yes no
We don’t need to keep the data around, just
the test conditions.
Male
Male Female
How would these
people be
classified?
It is trivial to convert Decision Trees to
rules… Weight <= 160?
yes no
Hair Length <= 2?
yes no
Male
Male Female
Rules to Classify Males/Females
If Weight greater than 160, classify as Male
Elseif Hair Length less than or equal to 2, classify as
Male
Else classify as Female
Wears green?
Yes No
The worked examples we have seen
were performed on small datasets.
However with small datasets there is a
great danger of overfitting the data…
When you have few datapoints, there are
many possible splitting rules that perfectly
classify the data, but will not generalize to
future datasets.
For example, the rule ―Wears green?‖ perfectly classifies the data, so
does ―Mothers name is Jacqueline?‖, so does ―Has blue shoes‖…
MaleFemale
Avoid Overfitting in Classification• The generated tree may overfit the training data
– Too many branches, some may reflect anomalies due to noise or
outliers
– Result is in poor accuracy for unseen samples
• Two approaches to avoid overfitting – Prepruning: Halt tree construction early—do not split a node if this
would result in the goodness measure falling below a threshold
• Difficult to choose an appropriate threshold
– Postpruning: Remove branches from a ―fully grown‖ tree—get a sequence of progressively pruned trees
• Use a set of data different from the training data to decide which is the ―best pruned tree‖
10
1 2 3 4 5 6 7 8 9 10
123456789
100
10 20 30 40 50 60 70 80 90 100
10
20
30
40
50
60
70
80
90
10
1 2 3 4 5 6 7 8 9 10
123456789
Which of the ―Pigeon Problems‖ can be solved by
a Decision Tree?
1) Deep Bushy Tree
2) Useless
3) Deep Bushy Tree
The Decision Tree
has a hard time
with correlated
attributes ?
• Advantages:
– Easy to understand (Doctors love them!)
– Easy to generate rules
• Disadvantages:
– May suffer from overfitting.
– Classifies by rectangular partitioning (so
does not handle correlated features very
well).
– Can be quite large – pruning is necessary.
– Does not handle streaming data easily
Advantages/Disadvantages of Decision Trees
Decision Tree with Real Features
Example from MIT course
Major issuesQ1: Choosing best attribute: what quality measure to
use?
Q2: Determining when to stop splitting: avoid
overfitting
Q3: Handling continuous attributes
Q4: Handling training data with missing attribute values
Q5: Handing attributes with different costs
Q6: Dealing with continuous goal attribute
Q1: What quality measure
• Information gain
• Gain Ratio
• …
How to avoiding Overfitting
• Stop growing the tree earlier
– Ex: InfoGain < threshold
– Ex: Size of examples in a node < threshold
– …
• Grow full tree, then post-prune
In practice, the latter works better than the former.
Post-pruning
• Split data into training and validation set
• Do until further pruning is harmful:– Evaluate impact on validation set of pruning each
possible node (plus those below it)
– Greedily remove the ones that don’t improve the performance on validation set
Produces a smaller tree with best performance measure
Performance measure
• Accuracy:– on validation data
– K-fold cross validation
• Misclassification cost: Sometimes more accuracy is desired for some classes than others.
• MDL: size(tree) + errors(tree)
Rule post-pruning
• Convert tree to equivalent set of rules
• Prune each rule independently of others
• Sort final rules into desired sequence for use
• Perhaps most frequently used method (e.g., C4.5)
Q3: handling numeric attributes
• Continuous attribute discrete attribute
• Example
– Original attribute: Temperature = 82.5
– New attribute: (temperature > 72.3) = t, f
Question: how to choose thresholds?
Q4: Unknown attribute values
Assume an attribute can take the value “blank”.
Assign most common value of A among training data at node n.
Assign most common value of A among training data at node n which have the same target class.
Assign prob pi to each possible value vi of A Assign a fraction (pi) of example to each descendant in tree
This method is used in C4.5.
Q5: Attributes with cost
)(
),(2
ACost
ASGain
• Consider medical diagnosis (e.g., blood test) has a cost
• Question: how to learn a consistent tree with low expected cost?
• One approach: replace gain by
– Tan and Schlimmer (1990)
Common algorithms
• ID3
• C4.5
• CART
ID3
• Proposed by Quinlan (so is C4.5)
• Can handle basic cases: discrete attributes, no missing information, etc.
• Information gain as quality measure
C4.5
• An extension of ID3:
– Several quality measures
– Incomplete information (missing attribute values)
– Numerical (continuous) attributes
– Pruning of decision trees
– Rule derivation
– Random mood and batch mood
CART
• CART (classification and regression tree)
• Proposed by Breiman et. al. (1984)
• Constant numerical values in leaves
• Variance as measure of impurity
Strengths of decision tree methods
• Ability to generate understandable rules
• Ease of calculation at classification time
• Ability to handle both continuous and categorical variables
• Ability to clearly indicate best attributes
The weaknesses of decision tree methods
Greedy algorithm: no global optimization
Error-prone with too many classes: numbers of training examples become smaller quickly in a tree with many levels/branches.
Expensive to train: sorting, combination of attributes, calculating quality measures, etc.
Trouble with non-rectangular regions: the rectangular classification boxes that may not correspond well with the actual distribution of records in the decision space.
49
Acknowledgements
Introduction to Machine Learning, Alphaydin
Statistical Pattern Recognition: A Review – A.K Jain et al., PAMI (22) 2000
Pattern Recognition and Analysis Course – A.K. Jain, MSU
Pattern Classification” by Duda et al., John Wiley & Sons.
http://www.doc.ic.ac.uk/~sgc/teaching/pre2012/v231/lecture13.html
Some Material adopted from Dr. Adam Prugel-Bennett Dr. Andrew Ng and Dr. Aman
ullah’s Slides
Mat
eria
l in
th
ese
slid
es h
as b
een
tak
en f
rom
, th
e fo
llow
ing
reso
urc
es