Upload
berenice-rogers
View
217
Download
1
Embed Size (px)
Citation preview
Classification by Decision Tree Induction
• Decision tree – A flow-chart-like tree structure– Internal node denotes an attribute– Branch represents the values of the node
attribute– Leaf nodes represent class labels or class
distribution
5
Constructing a decision tree, one
step at a timeaddress?
yes no
+a, -c, +i, +e, +o, +u: Y-a, +c, -i, +e, -o, -u: N+a, -c, +i, -e, -o, -u: Y-a, -c, +i, +e, -o, -u: Y-a, +c, +i, -e, -o, -u: N-a, -c, +i, -e, -o, +u: Y+a, -c, -i, -e, +o, -u: N+a, +c, +i, -e, +o, -u: N
-a, +c, -i, +e, -o, -u: N-a, -c, +i, +e, -o, -u: Y-a, +c, +i, -e, -o, -u: N-a, -c, +i, -e, -o, +u: Y
+a, -c, +i, +e, +o, +u: Y+a, -c, +i, -e, -o, -u: Y+a, -c, -i, -e, +o, -u: N+a, +c, +i, -e, +o, -u: N criminal? criminal?
-a, +c, -i, +e, -o, -u: N-a, +c, +i, -e, -o, -u: N
-a, -c, +i, +e, -o, -u: Y-a, -c, +i, -e, -o, +u: Y
+a, -c, +i, +e, +o, +u: Y+a, -c, +i, -e, -o, -u: Y+a, -c, -i, -e, +o, -u: N
+a, +c, +i, -e, +o, -u: N
income?
+a, -c, +i, +e, +o, +u: Y+a, -c, +i, -e, -o, -u: Y
+a, -c, -i, -e, +o, -u: N
yes noyes no
yes no Address was maybe not the best attribute to start with…
A Worked ExampleWeekend Weather Parents Money Decision
(Category)
W1 Sunny Yes Rich Cinema
W2 Sunny No Rich Tennis
W3 Windy Yes Rich Cinema
W4 Rainy Yes Poor Cinema
W5 Rainy No Rich Stay in
W6 Rainy Yes Poor Cinema
W7 Windy No Poor Cinema
W8 Windy No Rich Shopping
W9 Windy Yes Rich Cinema
W10 Sunny No Rich Tennis
Decision Tree Learning
• Building a Decision Tree
1. First test all attributes and select the on that would function as the best root;
2. Break-up the training set into subsets based on the branches of the root node;
3. Test the remaining attributes to see which ones fit best underneath the branches of the root node;
4. Continue this process for all other branches untila. all examples of a subset are of one typeb. there are no examples left (return majority classification of the
parent)c. there are no more attributes left (default value should be
majority classification)
Decision Tree Learning
• Determining which attribute is best (Entropy & Gain)• Entropy (E) is the minimum number of bits needed in order to
classify an arbitrary example as yes or no• E(S) = c
i=1 –pi log2 pi ,
– Where S is a set of training examples,– c is the number of classes, and– pi is the proportion of the training set that is of class i
• For our entropy equation 0 log2 0 = 0
• The information gain G(S,A) where A is an attribute• G(S,A) E(S) - v in Values(A) (|Sv| / |S|) * E(Sv)
Information Gain• Gain(S,A): expected reduction in entropy due to
sorting S on attribute A
11
A1=?
True False
[21+, 5-] [8+, 30-]
[29+,35-] A2=?
True False
[18+, 33-] [11+, 2-]
[29+,35-]
Gain(S,A)=Entropy(S) - vvalues(A) |Sv|/|S| Entropy(Sv)
Entropy([29+,35-]) = -29/64 log2 29/64 – 35/64 log2 35/64 = 0.99
Training Examples
12
Day Outlook Temp. Humidity Wind Play Tennis
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Weak Yes
D8 Sunny Mild High Weak No
D9 Sunny Cold Normal Weak Yes
D10 Rain Mild Normal Strong Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
ID3 Algorithm
13
Outlook
Sunny Overcast Rain
Yes
[D1,D2,…,D14] [9+,5-]
Ssunny=[D1,D2,D8,D9,D11] [2+,3-]
? ?
[D3,D7,D12,D13] [4+,0-]
[D4,D5,D6,D10,D14] [3+,2-]
Gain(Ssunny , Humidity)=0.970-(3/5)0.0 – 2/5(0.0) = 0.970Gain(Ssunny , Temp.)=0.970-(2/5)0.0 –2/5(1.0)-(1/5)0.0 = 0.570Gain(Ssunny , Wind)=0.970= -(2/5)1.0 – 3/5(0.918) = 0.019
ID3 Algorithm
14
Outlook
Sunny Overcast Rain
Humidity
High Normal
Wind
Strong Weak
No Yes
Yes
YesNo
[D3,D7,D12,D13]
[D8,D9,D11] [D6,D14][D1,D2] [D4,D5,D10]
Converting a Tree to Rules
15
Outlook
Sunny Overcast Rain
Humidity
High Normal
Wind
Strong Weak
No Yes
Yes
YesNo
R1: If (Outlook=Sunny) (Humidity=High) Then PlayTennis=No R2: If (Outlook=Sunny) (Humidity=Normal) Then PlayTennis=YesR3: If (Outlook=Overcast) Then PlayTennis=Yes R4: If (Outlook=Rain) (Wind=Strong) Then PlayTennis=NoR5: If (Outlook=Rain) (Wind=Weak) Then PlayTennis=Yes
Decision tree classifiers• Does not require any prior knowledge of data
distribution, works well on noisy data.• Has been applied to:
– classify medical patients based on the disease,
– equipment malfunction by cause, – loan applicant by likelihood of payment.
The internal nodes are simple decision rules on one or more attributes and leaf nodes are predicted class labels.
Decision trees
Salary < 1 M
Prof = teaching
Good
Age < 30
BadBad Good
Sample Experience TableExample Attributes Target
Hour Weather Accident Stall Commute
D1 8 AM Sunny No No Long
D2 8 AM Cloudy No Yes Long
D3 10 AM Sunny No No Short
D4 9 AM Rainy Yes No Long
D5 9 AM Sunny Yes Yes Long
D6 10 AM Sunny No No Short
D7 10 AM Cloudy No No Short
D8 9 AM Rainy No No Medium
D9 9 AM Sunny Yes No Long
D10 10 AM Cloudy Yes Yes Long
D11 10 AM Rainy No No Short
D12 8 AM Cloudy Yes No Long
D13 9 AM Sunny No No Medium
Decision Tree Analysis
• For choosing the best courseof action when future outcomes are uncertain.
Product pric ing strategy17,330
Choose high price17,330
Com petitor enters8,900
Com petitor is superior-25,000
Increase prom otion 1-49,000
Maintain course 1-30,000
Abandon product 1-25,000
Com petitor is equal8,000
Increase prom otion 28,000
Maintain course 26,000
Abandon product 2-25,000
Com petitor is inferior22,000
Increase prom otion 310,000
Maintain course 322,000
Abandon product 3-25,000
No com petition37,000
Sales high 195,000
Sales typical 135,000
Sales low 1-5,000
Choose low price13,000
Sales high 242,000
Sales typical 212,000
Sales low 2-8,000
Don_t launch product0
70%
10%
60%
30%
30%
10%
80%
10%10%
80%
10%
• Data: It has k attributes A1, … Ak. Each tuple (case or example) is described by values of the attributes and a class label.
• Goal: To learn rules or to build a model that can be used to predict the classes of new (or future or test) cases.
• The data used for building the model is called the training data.
Classification
Data and its format
• Data– attribute-value pairs– with/without class
• Data type– continuous/discrete– nominal
• Data format– Flat– If not flat, what should we do?
Induction Algorithm
• We calculate the gain for each attribute and
choose the max gain to be the node in the tree.
• After build the node calculate the gain for other
attribute and choose again the max of them.
Decision Tree Induction Algorithm
• Create a root node for the tree• If all cases are positive, return single-node tree
with label +• If all cases are negative, return single-node
tree with label -• Otherwise begin
– For each possible value of node• Add a new tree branch• Let cases be subset of all data that have this value
– Add new node with new subtree until leaf node
Impurity Measures
• Information entropy:
– Zero when consisting of only one class, one when all classes in equal number
• Other measures of impurity: Gini:
k
iii ppSEntropy
1
log)(
k
iipSGini
1
21)(
Review of Log2
log2(0) = unkownlog2(1) = 0log2(2) = 1log2(4) = 2log2(1/2) = -1log2(1/4) = -2(1/2)log2(1/2) = (1/2)(-1) = -1/2
Example of Decision Tree
SizeSize ShapeShape ColorColor ChoiceChoice
Medium Brick Blue Yes
Small Wedge Red No
Large Wedge Red No
Small Sphere Red Yes
Large Pillar Green Yes
Large Pillar Red No
Large Sphere Green Yes
Classification—A Two-Step Process
• Model construction: describing a set of predetermined classes based on a training set. It is also called learning.– Each tuple/sample is assumed to belong to a predefined class– The model is represented as classification rules, decision trees, or
mathematical formulae• Model usage: for classifying future test data/objects
– Estimate accuracy of the model• The known label of test example is compared with the classified
result from the model• Accuracy rate is the % of test cases that are correctly classified by
the model– If the accuracy is acceptable, use the model to classify data tuples
whose class labels are not known.
Classification Process (1): Model Construction
TrainingData
NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no
ClassificationAlgorithms
IF rank = ‘professor’OR years > 6THEN tenured = ‘yes’
Classifier(Model)
Classification Process (2): Use the Model in Prediction
Classifier
TestingData
NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes
Unseen Data
(Jeff, Professor, 4)
Tenured?
Example
• consider the problem of learning whether or
not to go jogging on a particular day.
Attribute Possible Values
WEATHER Warm, Cold, Raining
JOGGED_YESTERDAY Yes, No
The DataWEATHER JOGGED_YESTERDAY CLASSIFICATION
C N +
W Y -
R Y -
C Y -
R N -
W Y -
C N -
W N +
C Y -
W Y +
W N +
C N +
R Y -
W Y -
ID3: Some Issues
•Sometimes we arrive to a node with no examples.
This means that the example has not been observed.We just assigned as value the majority vote of its parent
•Sometimes we arrive to a node with both positive and negative examples and no attributes left.
This means that there is noise in the data.We just assigned as value the majority vote of the examples
Problems with Decision Tree
• ID3 is not optimal– Uses expected entropy reduction, not actual
reduction
• Must use discrete (or discretized) attributes– What if we left for work at 9:30 AM?– We could break down the attributes into smaller
values…
Problems with Decision Trees
• While decision trees classify quickly, the time for building a tree may be higher than another type of classifier
• Decision trees suffer from a problem of errors propagating throughout a tree– A very serious problem as the number of classes
increases
Decision Tree characteristics
• The training data may contain missing
attribute values.
– Decision tree methods can be used even
when some training examples have
unknown values.
Unknown Attribute ValuesWhat is some examples missing values of A?Use training example anyway sort through tree• If node n tests A, assign most common value of A
among other examples sorted to node n.• Assign most common value of A among other
examples with same target value• Assign probability pi to each possible value vi of A
– Assign fraction pi of example to each descendant in tree
38
Rule Generation
Once a decision tree has been constructed, it is a
simple matter to convert it into set of rules.
• Converting to rules allows distinguishing among
the different contexts in which a decision node
is used.
Rule Generation
• Converting to rules improves readability.
– Rules are often easier for people to understand.
• To generate rules, trace each path in the
decision tree, from root node to leaf node
Rule Simplification
Once a rule set has been devised:
– Once individual rules have been simplified by eliminating redundant rules and unnecessary rules.
– Attempt to replace those rules that share the most common consequent by a default rule that is triggered when no other rule is triggered.
Continuous Valued Attributes
Create a discrete attribute to test continuous • Temperature = 24.50C• (Temperature > 20.00C) = {true, false} Where to set the threshold?
43
Temperatur 150C 180C 190C 220
C240
C270
C
PlayTennis No No Yes Yes Yes No
Pruning Trees
• There is another technique for reducing the number of attributes used in a tree - pruning
• Two types of pruning:– Pre-pruning (forward pruning)– Post-pruning (backward pruning)
Prepruning
• In prepruning, we decide during the building process when to stop adding attributes (possibly based on their information gain)
• However, this may be problematic – Why?– Sometimes attributes individually do not contribute much
to a decision, but combined, they may have a significant impact
Postpruning
• Postpruning waits until the full decision tree has built and then prunes the attributes
• Two techniques:– Subtree Replacement– Subtree Raising
Subtree Replacement
• Node 6 replaced the subtree• Generalizes tree a little more, but may increase accuracy
A
B
6 4 5
Subtree Raising
• Entire subtree is raised onto another node• This was not discussed in detail as it is not clear
whether this is really worthwhile (as it is very time consuming)
A
C
1 2 3