51
Machine Learning Decision Tree 1

Machine Learning Decision Tree 1. 2 3 Outlook SunnyOvercastRain Humidity HighNormal Wind StrongWeak NoYes No

Embed Size (px)

Citation preview

Machine Learning

Decision Tree

1

2

Decision Tree

3

Outlook

Sunny Overcast Rain

Humidity

High Normal

Wind

Strong Weak

No Yes

Yes

YesNo

4

Classification by Decision Tree Induction

• Decision tree – A flow-chart-like tree structure– Internal node denotes an attribute– Branch represents the values of the node

attribute– Leaf nodes represent class labels or class

distribution

5

Decision trees

high Income?

yes no

NOyes no

NO

Criminal record?

YES

Constructing a decision tree, one

step at a timeaddress?

yes no

+a, -c, +i, +e, +o, +u: Y-a, +c, -i, +e, -o, -u: N+a, -c, +i, -e, -o, -u: Y-a, -c, +i, +e, -o, -u: Y-a, +c, +i, -e, -o, -u: N-a, -c, +i, -e, -o, +u: Y+a, -c, -i, -e, +o, -u: N+a, +c, +i, -e, +o, -u: N

-a, +c, -i, +e, -o, -u: N-a, -c, +i, +e, -o, -u: Y-a, +c, +i, -e, -o, -u: N-a, -c, +i, -e, -o, +u: Y

+a, -c, +i, +e, +o, +u: Y+a, -c, +i, -e, -o, -u: Y+a, -c, -i, -e, +o, -u: N+a, +c, +i, -e, +o, -u: N criminal? criminal?

-a, +c, -i, +e, -o, -u: N-a, +c, +i, -e, -o, -u: N

-a, -c, +i, +e, -o, -u: Y-a, -c, +i, -e, -o, +u: Y

+a, -c, +i, +e, +o, +u: Y+a, -c, +i, -e, -o, -u: Y+a, -c, -i, -e, +o, -u: N

+a, +c, +i, -e, +o, -u: N

income?

+a, -c, +i, +e, +o, +u: Y+a, -c, +i, -e, -o, -u: Y

+a, -c, -i, -e, +o, -u: N

yes noyes no

yes no Address was maybe not the best attribute to start with…

A Worked ExampleWeekend Weather Parents Money Decision

(Category)

W1 Sunny Yes Rich Cinema

W2 Sunny No Rich Tennis

W3 Windy Yes Rich Cinema

W4 Rainy Yes Poor Cinema

W5 Rainy No Rich Stay in

W6 Rainy Yes Poor Cinema

W7 Windy No Poor Cinema

W8 Windy No Rich Shopping

W9 Windy Yes Rich Cinema

W10 Sunny No Rich Tennis

Decision Tree Learning

• Building a Decision Tree

1. First test all attributes and select the on that would function as the best root;

2. Break-up the training set into subsets based on the branches of the root node;

3. Test the remaining attributes to see which ones fit best underneath the branches of the root node;

4. Continue this process for all other branches untila. all examples of a subset are of one typeb. there are no examples left (return majority classification of the

parent)c. there are no more attributes left (default value should be

majority classification)

Decision Tree Learning

• Determining which attribute is best (Entropy & Gain)• Entropy (E) is the minimum number of bits needed in order to

classify an arbitrary example as yes or no• E(S) = c

i=1 –pi log2 pi ,

– Where S is a set of training examples,– c is the number of classes, and– pi is the proportion of the training set that is of class i

• For our entropy equation 0 log2 0 = 0

• The information gain G(S,A) where A is an attribute• G(S,A) E(S) - v in Values(A) (|Sv| / |S|) * E(Sv)

Information Gain• Gain(S,A): expected reduction in entropy due to

sorting S on attribute A

11

A1=?

True False

[21+, 5-] [8+, 30-]

[29+,35-] A2=?

True False

[18+, 33-] [11+, 2-]

[29+,35-]

Gain(S,A)=Entropy(S) - vvalues(A) |Sv|/|S| Entropy(Sv)

Entropy([29+,35-]) = -29/64 log2 29/64 – 35/64 log2 35/64 = 0.99

Training Examples

12

Day Outlook Temp. Humidity Wind Play Tennis

D1 Sunny Hot High Weak No

D2 Sunny Hot High Strong No

D3 Overcast Hot High Weak Yes

D4 Rain Mild High Weak Yes

D5 Rain Cool Normal Weak Yes

D6 Rain Cool Normal Strong No

D7 Overcast Cool Normal Weak Yes

D8 Sunny Mild High Weak No

D9 Sunny Cold Normal Weak Yes

D10 Rain Mild Normal Strong Yes

D11 Sunny Mild Normal Strong Yes

D12 Overcast Mild High Strong Yes

D13 Overcast Hot Normal Weak Yes

D14 Rain Mild High Strong No

ID3 Algorithm

13

Outlook

Sunny Overcast Rain

Yes

[D1,D2,…,D14] [9+,5-]

Ssunny=[D1,D2,D8,D9,D11] [2+,3-]

? ?

[D3,D7,D12,D13] [4+,0-]

[D4,D5,D6,D10,D14] [3+,2-]

Gain(Ssunny , Humidity)=0.970-(3/5)0.0 – 2/5(0.0) = 0.970Gain(Ssunny , Temp.)=0.970-(2/5)0.0 –2/5(1.0)-(1/5)0.0 = 0.570Gain(Ssunny , Wind)=0.970= -(2/5)1.0 – 3/5(0.918) = 0.019

ID3 Algorithm

14

Outlook

Sunny Overcast Rain

Humidity

High Normal

Wind

Strong Weak

No Yes

Yes

YesNo

[D3,D7,D12,D13]

[D8,D9,D11] [D6,D14][D1,D2] [D4,D5,D10]

Converting a Tree to Rules

15

Outlook

Sunny Overcast Rain

Humidity

High Normal

Wind

Strong Weak

No Yes

Yes

YesNo

R1: If (Outlook=Sunny) (Humidity=High) Then PlayTennis=No R2: If (Outlook=Sunny) (Humidity=Normal) Then PlayTennis=YesR3: If (Outlook=Overcast) Then PlayTennis=Yes R4: If (Outlook=Rain) (Wind=Strong) Then PlayTennis=NoR5: If (Outlook=Rain) (Wind=Weak) Then PlayTennis=Yes

Decision tree classifiers• Does not require any prior knowledge of data

distribution, works well on noisy data.• Has been applied to:

– classify medical patients based on the disease,

– equipment malfunction by cause, – loan applicant by likelihood of payment.

The internal nodes are simple decision rules on one or more attributes and leaf nodes are predicted class labels.

Decision trees

Salary < 1 M

Prof = teaching

Good

Age < 30

BadBad Good

Sample Experience TableExample Attributes Target

  Hour Weather Accident Stall Commute

D1 8 AM Sunny No No Long

D2 8 AM Cloudy No Yes Long

D3 10 AM Sunny No No Short

D4 9 AM Rainy Yes No Long

D5 9 AM Sunny Yes Yes Long

D6 10 AM Sunny No No Short

D7 10 AM Cloudy No No Short

D8 9 AM Rainy No No Medium

D9 9 AM Sunny Yes No Long

D10 10 AM Cloudy Yes Yes Long

D11 10 AM Rainy No No Short

D12 8 AM Cloudy Yes No Long

D13 9 AM Sunny No No Medium

Decision Tree Analysis

• For choosing the best courseof action when future outcomes are uncertain.

Product pric ing strategy17,330

Choose high price17,330

Com petitor enters8,900

Com petitor is superior-25,000

Increase prom otion 1-49,000

Maintain course 1-30,000

Abandon product 1-25,000

Com petitor is equal8,000

Increase prom otion 28,000

Maintain course 26,000

Abandon product 2-25,000

Com petitor is inferior22,000

Increase prom otion 310,000

Maintain course 322,000

Abandon product 3-25,000

No com petition37,000

Sales high 195,000

Sales typical 135,000

Sales low 1-5,000

Choose low price13,000

Sales high 242,000

Sales typical 212,000

Sales low 2-8,000

Don_t launch product0

70%

10%

60%

30%

30%

10%

80%

10%10%

80%

10%

• Data: It has k attributes A1, … Ak. Each tuple (case or example) is described by values of the attributes and a class label.

• Goal: To learn rules or to build a model that can be used to predict the classes of new (or future or test) cases.

• The data used for building the model is called the training data.

Classification

Data and its format

• Data– attribute-value pairs– with/without class

• Data type– continuous/discrete– nominal

• Data format– Flat– If not flat, what should we do?

Induction Algorithm

• We calculate the gain for each attribute and

choose the max gain to be the node in the tree.

• After build the node calculate the gain for other

attribute and choose again the max of them.

Decision Tree Induction Algorithm

• Create a root node for the tree• If all cases are positive, return single-node tree

with label +• If all cases are negative, return single-node

tree with label -• Otherwise begin

– For each possible value of node• Add a new tree branch• Let cases be subset of all data that have this value

– Add new node with new subtree until leaf node

Impurity Measures

• Information entropy:

– Zero when consisting of only one class, one when all classes in equal number

• Other measures of impurity: Gini:

k

iii ppSEntropy

1

log)(

k

iipSGini

1

21)(

Review of Log2

log2(0) = unkownlog2(1) = 0log2(2) = 1log2(4) = 2log2(1/2) = -1log2(1/4) = -2(1/2)log2(1/2) = (1/2)(-1) = -1/2

Example

No. A1 A2 Classification

1 T T +

2 T T +

3 T F -

4 F F +

5 F T -

6 F T -

Example of Decision Tree

SizeSize ShapeShape ColorColor ChoiceChoice

Medium Brick Blue Yes

Small Wedge Red No

Large Wedge Red No

Small Sphere Red Yes

Large Pillar Green Yes

Large Pillar Red No

Large Sphere Green Yes

Classification—A Two-Step Process

• Model construction: describing a set of predetermined classes based on a training set. It is also called learning.– Each tuple/sample is assumed to belong to a predefined class– The model is represented as classification rules, decision trees, or

mathematical formulae• Model usage: for classifying future test data/objects

– Estimate accuracy of the model• The known label of test example is compared with the classified

result from the model• Accuracy rate is the % of test cases that are correctly classified by

the model– If the accuracy is acceptable, use the model to classify data tuples

whose class labels are not known.

Classification Process (1): Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = ‘professor’OR years > 6THEN tenured = ‘yes’

Classifier(Model)

Classification Process (2): Use the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

(Jeff, Professor, 4)

Tenured?

Example

• consider the problem of learning whether or

not to go jogging on a particular day.

Attribute Possible Values

WEATHER Warm, Cold, Raining

JOGGED_YESTERDAY Yes, No

The DataWEATHER JOGGED_YESTERDAY CLASSIFICATION

C N +

W Y -

R Y -

C Y -

R N -

W Y -

C N -

W N +

C Y -

W Y +

W N +

C N +

R Y -

W Y -

Test DataWEATHER JOGGED_YESTERDAY CLASSIFICATION

W Y -

R N +

C N +

C Y -

W N +

ID3: Some Issues

•Sometimes we arrive to a node with no examples.

This means that the example has not been observed.We just assigned as value the majority vote of its parent

•Sometimes we arrive to a node with both positive and negative examples and no attributes left.

This means that there is noise in the data.We just assigned as value the majority vote of the examples

Problems with Decision Tree

• ID3 is not optimal– Uses expected entropy reduction, not actual

reduction

• Must use discrete (or discretized) attributes– What if we left for work at 9:30 AM?– We could break down the attributes into smaller

values…

Problems with Decision Trees

• While decision trees classify quickly, the time for building a tree may be higher than another type of classifier

• Decision trees suffer from a problem of errors propagating throughout a tree– A very serious problem as the number of classes

increases

Decision Tree characteristics

• The training data may contain missing

attribute values.

– Decision tree methods can be used even

when some training examples have

unknown values.

Unknown Attribute ValuesWhat is some examples missing values of A?Use training example anyway sort through tree• If node n tests A, assign most common value of A

among other examples sorted to node n.• Assign most common value of A among other

examples with same target value• Assign probability pi to each possible value vi of A

– Assign fraction pi of example to each descendant in tree

38

Rule Generation

Once a decision tree has been constructed, it is a

simple matter to convert it into set of rules.

• Converting to rules allows distinguishing among

the different contexts in which a decision node

is used.

Rule Generation

• Converting to rules improves readability.

– Rules are often easier for people to understand.

• To generate rules, trace each path in the

decision tree, from root node to leaf node

Rule Simplification

Once a rule set has been devised:

– Once individual rules have been simplified by eliminating redundant rules and unnecessary rules.

– Attempt to replace those rules that share the most common consequent by a default rule that is triggered when no other rule is triggered.

Attribute-Based Representations

• Examples of decisions

Continuous Valued Attributes

Create a discrete attribute to test continuous • Temperature = 24.50C• (Temperature > 20.00C) = {true, false} Where to set the threshold?

43

Temperatur 150C 180C 190C 220

C240

C270

C

PlayTennis No No Yes Yes Yes No

Pruning Trees

• There is another technique for reducing the number of attributes used in a tree - pruning

• Two types of pruning:– Pre-pruning (forward pruning)– Post-pruning (backward pruning)

Prepruning

• In prepruning, we decide during the building process when to stop adding attributes (possibly based on their information gain)

• However, this may be problematic – Why?– Sometimes attributes individually do not contribute much

to a decision, but combined, they may have a significant impact

Postpruning

• Postpruning waits until the full decision tree has built and then prunes the attributes

• Two techniques:– Subtree Replacement– Subtree Raising

Subtree Replacement

• Entire subtree is replaced by a single leaf node

A

B

C

1 2 3

4 5

Subtree Replacement

• Node 6 replaced the subtree• Generalizes tree a little more, but may increase accuracy

A

B

6 4 5

Subtree Raising

• Entire subtree is raised onto another node

A

B

C

1 2 3

4 5

Subtree Raising

• Entire subtree is raised onto another node• This was not discussed in detail as it is not clear

whether this is really worthwhile (as it is very time consuming)

A

C

1 2 3

51