39
3.1 Ch. 6: Decision Trees Stephen Marsland, Machine Learning: An Algorithmic Perspective. CRC 2009 based on slides from Stephen Marsland, Jia Li, and Ruoming Jin. Longin Jan Latecki Temple University [email protected]

Longin Jan Latecki Temple University latecki@temple

  • Upload
    drew

  • View
    81

  • Download
    6

Embed Size (px)

DESCRIPTION

Ch. 6: Decision Trees Stephen Marsland, Machine Learning: An Algorithmic Perspective .  CRC 2009 based on slides from Stephen Marsland, Jia Li, and Ruoming Jin. Longin Jan Latecki Temple University [email protected]. Illustrating Classification Task. Decision Trees. - PowerPoint PPT Presentation

Citation preview

Page 1: Longin Jan Latecki Temple University latecki@temple

3.1

Ch. 6: Decision Trees

Stephen Marsland, Machine Learning: An Algorithmic Perspective.  CRC 2009

based on slides from Stephen Marsland,Jia Li, and Ruoming Jin.

Longin Jan LateckiTemple [email protected]

Page 2: Longin Jan Latecki Temple University latecki@temple

3.2

Illustrating Classification Task

Apply

Model

Induction

Deduction

Learn

Model

Model

Tid Attrib1 Attrib2 Attrib3 Class

1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No

5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No

8 No Small 85K Yes

9 No Medium 75K No

10 No Small 90K Yes 10

Tid Attrib1 Attrib2 Attrib3 Class

11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

14 No Small 95K ?

15 No Large 67K ? 10

Test Set

Learningalgorithm

Training Set

Page 3: Longin Jan Latecki Temple University latecki@temple

3.3159.302 Stephen Marsland

Decision Trees

Split classification down into a series of choices about features in turn

Lay them out in a treeProgress down the tree to the leaves

Page 4: Longin Jan Latecki Temple University latecki@temple

3.4159.302 Stephen Marsland

Play Tennis Example

Outlook

OvercastSunny Rain

Humidity WindYes

High Normal Strong Weak

No YesNo Yes

Page 5: Longin Jan Latecki Temple University latecki@temple

3.5159.302 Stephen Marsland

Day Outlook Temp Humid Wind Play?

1 Sunny Hot High Weak No

2 Sunny Hot High Strong No

3 Overcast Hot High Weak Yes

4 Rain Mild High Weak Yes

5 Rain Cool Normal Weak Yes

6 Rain Cool Normal Strong No

7 Overcast Cool Normal Strong Yes

8 Sunny Mild High Weak No

9 Sunny Cool Normal Weak Yes

10 Rain Mild Normal Weak Yes

11 Sunny Mild Normal Strong Yes

12 Overcast Mild High Strong Yes

13 Overcast Hot Normal Weak Yes

14 Rain Mild High Strong No

Page 6: Longin Jan Latecki Temple University latecki@temple

3.6159.302 Stephen Marsland

Rules and Decision Trees

Can turn the tree into a set of rules:(outlook = sunny & humidity = normal) |

(outlook = overcast) | (outlook = rain & wind = weak)

How do we generate the trees?Need to choose featuresNeed to choose order of features

Page 7: Longin Jan Latecki Temple University latecki@temple

3.7

A tree structure classification rule for a medical example

159.302 Stephen Marsland

Page 8: Longin Jan Latecki Temple University latecki@temple

3.8159.302 Stephen Marsland

The construction of a tree involves the following threeelements:1. The selection of the splits.2. The decisions when to declare a node as terminal or to continue splitting.3. The assignment of each terminal node to one of the classes.

Page 9: Longin Jan Latecki Temple University latecki@temple

3.9

Goodness of split

159.302 Stephen Marsland

The goodness of split is measured by an impurity functiondefined for each node.Intuitively, we want each leaf node to be “pure”, that is, one class dominates.

Page 10: Longin Jan Latecki Temple University latecki@temple

3.10

How to determine the Best Split

OwnCar?

C0: 6C1: 4

C0: 4C1: 6

C0: 1C1: 3

C0: 8C1: 0

C0: 1C1: 7

CarType?

C0: 1C1: 0

C0: 1C1: 0

C0: 0C1: 1

StudentID?

...

Yes No Family

Sports

Luxury c1c10

c20

C0: 0C1: 1

...

c11

Before Splitting: 10 records of class 0,10 records of class 1

Which test condition is the best?

Page 11: Longin Jan Latecki Temple University latecki@temple

3.11

How to determine the Best Split

Greedy approach: Nodes with homogeneous class distribution are

preferred

Need a measure of node impurity:

C0: 5C1: 5

C0: 9C1: 1

Non-homogeneous,

High degree of impurityHomogeneous,

Low degree of impurity

Page 12: Longin Jan Latecki Temple University latecki@temple

3.12

Measures of Node Impurity

Entropy

Gini Index

Page 13: Longin Jan Latecki Temple University latecki@temple

3.13159.302 Stephen Marsland

Entropy

Let F be a feature with possible values f1, …, fn

Let p be a pdf (probability density function) of F; usually p is simply given by a histogram (p1, …, pn), where pi is the proportion of the data that has value F=fi.

Entropy of p tells us how much extra information we get from knowing the value of the feature, i.e, F=fi for a given data point .

Measures the amount in impurity in the set of features Makes sense to pick the features that provides the most

information

Page 14: Longin Jan Latecki Temple University latecki@temple

3.14159.302 Stephen Marsland

E.g., if F is a feature with two possible values +1 and -1, and p1=1 and p2=0, then we get no new information from knowing that F=+1 for a given example. Thus the entropy is zero. If p1=0.5 and p2=0.5, then the entropy is at maximum.

Page 15: Longin Jan Latecki Temple University latecki@temple

3.15

Entropy and Decision Tree

Entropy at a given node t:

(NOTE: p( j | t) is the relative frequency of class j at node t).

Measures homogeneity of a node. Maximum (log nc) when records are equally distributed

among all classes implying least information

Minimum (0.0) when all records belong to one class, implying most information

Entropy based computations are similar to the GINI index computations

j

tjptjptEntropy )|(log)|()(

Page 16: Longin Jan Latecki Temple University latecki@temple

3.16

Examples for computing Entropy

C1 0 C2 6

C1 2 C2 4

C1 1 C2 5

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0

P(C1) = 1/6 P(C2) = 5/6

Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65

P(C1) = 2/6 P(C2) = 4/6

Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92

j

tjptjptEntropy )|(log)|()(2

Page 17: Longin Jan Latecki Temple University latecki@temple

3.17

Splitting Based on Information Gain

Information Gain:

Parent Node, p is split into k partitions;

ni is number of records in partition i

Measures Reduction in Entropy achieved because of the split. Choose the split that achieves most reduction (maximizes GAIN)

Used in ID3 and C4.5

Disadvantage: Tends to prefer splits that result in large number of partitions, each being small but pure.

k

i

i

splitiEntropy

nn

pEntropyGAIN1

)()(

Page 18: Longin Jan Latecki Temple University latecki@temple

3.18159.302 Stephen Marsland

Information Gain

Choose the feature that provides the highest information gain over all examples

That is all there is to ID3:At each stage, pick the feature with the highest

information gain

Page 19: Longin Jan Latecki Temple University latecki@temple

3.19159.302 Stephen Marsland

Example

Values(Wind) = Weak, StrongS = [9+, 5-]S(Weak) <- [6+, 2-]S(Strong) <- [3+, 3-]

Gain(S,Wind) = Entropy(S) - (8/14) Entropy(S(Weak)) - (6/14)Entropy(S(Strong))

= 0.94 - (8/14)0.811 - (6/14)1.00

Page 20: Longin Jan Latecki Temple University latecki@temple

3.20159.302 Stephen Marsland

Day Outlook Temp Humid Wind Play?

1 Sunny Hot High Weak No

2 Sunny Hot High Strong No

3 Overcast Hot High Weak Yes

4 Rain Mild High Weak Yes

5 Rain Cool Normal Weak Yes

6 Rain Cool Normal Strong No

7 Overcast Cool Normal Strong Yes

8 Sunny Mild High Weak No

9 Sunny Cool Normal Weak Yes

10 Rain Mild Normal Weak Yes

11 Sunny Mild Normal Strong Yes

12 Overcast Mild High Strong Yes

13 Overcast Hot Normal Weak Yes

14 Rain Mild High Strong No

Page 21: Longin Jan Latecki Temple University latecki@temple

3.21159.302 Stephen Marsland

ID3 (Quinlan)

Search over all possible treesGreedy search - no backtrackingSusceptible to local minimaUses all features - no pruning

Can deal with noiseLabels are most common value of examples

Page 22: Longin Jan Latecki Temple University latecki@temple

3.22159.302 Stephen Marsland

ID3Feature F with highest Gain(S,F) F

G

Etc.

wv

y x

Feature G with highest Gain(Sw,G)Leaf node,

category c1

Sv only contains examples in category c1

Etc.

Page 23: Longin Jan Latecki Temple University latecki@temple

3.23159.302 Stephen Marsland

Search

Page 24: Longin Jan Latecki Temple University latecki@temple

3.24159.302 Stephen Marsland

Example

Outlook

OvercastSunny Rain

? ?Yes

[9+, 5-]

[3+, 2-][4+, 0-][2+, 3-]

Page 25: Longin Jan Latecki Temple University latecki@temple

3.25159.302 Stephen Marsland

Inductive Bias

How does the algorithm generalise from the training examples?Choose features with highest information gainMinimise amount of information is leftBias towards shorter treesOccam’s RazorPut most useful features near root

Page 26: Longin Jan Latecki Temple University latecki@temple

3.26159.302 Stephen Marsland

Missing Data

Suppose that one feature has no valueCan miss out that node and carry on down

the tree, following all paths out of that nodeCan therefore still get a classificationVirtually impossible with neural networks

Page 27: Longin Jan Latecki Temple University latecki@temple

3.27159.302 Stephen Marsland

C4.5

Improved version of ID3, also by QuinlanUse a validation set to avoid overfitting

Could just stop choosing features (early stopping)

Better results from post-pruningMake whole treeChop off some parts of tree afterwards

Page 28: Longin Jan Latecki Temple University latecki@temple

3.28159.302 Stephen Marsland

Post-Pruning

Run over treePrune each node by replacing subtree below

with a leafEvaluate error and keep if error same or better

Page 29: Longin Jan Latecki Temple University latecki@temple

3.29159.302 Stephen Marsland

Rule Post-Pruning

Turn tree into set of if-then rulesRemove preconditions from each rule in

turn, and check accuracySort rules according to accuracyRules are easy to read

Page 30: Longin Jan Latecki Temple University latecki@temple

3.30159.302 Stephen Marsland

Rule Post-Pruning

IF ((outlook = sunny) & (humidity = high))THEN playTennis = noRemove preconditions:

Consider IF (outlook = sunny)And IF (humidity = high)Test accuracyIf one of them is better, try removing both

Page 31: Longin Jan Latecki Temple University latecki@temple

3.31159.302 Stephen Marsland

ID3 Decision Tree

Outlook

OvercastSunny Rain

Humidity WindYes

High Normal Strong Weak

No YesNo Yes

Page 32: Longin Jan Latecki Temple University latecki@temple

3.32159.302 Stephen Marsland

Test Case

Outlook = SunnyTemperature = CoolHumidity = HighWind = Strong

Page 33: Longin Jan Latecki Temple University latecki@temple

3.33

Party Example, Section 6.4, p. 147Construct a decision tree based on these data:

159.302 Stephen Marsland

Deadline, Party, Lazy, ActivityUrgent, Yes, Yes, PartyUrgent, No, Yes, StudyNear, Yes, Yes, PartyNone, Yes, No, PartyNone, No, Yes, PubNone, Yes, No, PartyNear, No, No, StudyNear, No, Yes, TVNear, Yes, Yes, PartyUrgent, No, No, Study

Page 34: Longin Jan Latecki Temple University latecki@temple

3.35

Measure of Impurity: GINIGini Index for a given node t :

(NOTE: p( j | t) is the relative frequency of class j at node t).

Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information

Minimum (0.0) when all records belong to one class, implying most interesting information

j

tjptGINI 2)]|([1)(

C1 0C2 6

Gini=0.000

C1 2C2 4

Gini=0.444

C1 3C2 3

Gini=0.500

C1 1C2 5

Gini=0.278

Page 35: Longin Jan Latecki Temple University latecki@temple

3.36

Examples for computing GINI

C1 0 C2 6

C1 2 C2 4

C1 1 C2 5

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0

j

tjptGINI 2)]|([1)(

P(C1) = 1/6 P(C2) = 5/6

Gini = 1 – (1/6)2 – (5/6)2 = 0.278

P(C1) = 2/6 P(C2) = 4/6

Gini = 1 – (2/6)2 – (4/6)2 = 0.444

Page 36: Longin Jan Latecki Temple University latecki@temple

3.37

Splitting Based on GINI

Used in CART, SLIQ, SPRINT.

When a node p is split into k partitions (children), the quality of split is computed as,

where, ni = number of records at child i,

n = number of records at node p.

k

i

isplit iGINI

n

nGINI

1

)(

Page 37: Longin Jan Latecki Temple University latecki@temple

3.38

Binary Attributes: Computing GINI Index

Splits into two partitions Larger and Purer Partitions are sought for

B?

Yes No

Node N1 Node N2

Parent

C1 6

C2 6

Gini = 0.500

N1 N2 C1 5 1

C2 2 4

Gini=0.333

Gini(N1) = 1 – (5/6)2 – (2/6)2 = 0.194

Gini(N2) = 1 – (1/6)2 – (4/6)2 = 0.528

Gini(Children) = 7/12 * 0.194 + 5/12 * 0.528= 0.333

Page 38: Longin Jan Latecki Temple University latecki@temple

3.39

Categorical Attributes: Computing Gini Index

For each distinct value, gather counts for each class in the dataset

Use the count matrix to make decisions

CarType{Sports,Luxury}

{Family}

C1 3 1

C2 2 4

Gini 0.400

CarType

{Sports}{Family,Luxury}

C1 2 2

C2 1 5

Gini 0.419

CarType

Family Sports Luxury

C1 1 2 1

C2 4 1 1

Gini 0.393

Multi-way splitTwo-way split

(find best partition of values)

Page 39: Longin Jan Latecki Temple University latecki@temple

3.40

Homework

Written homework for everyone, due on Oct. 5: Problem 6.3, p. 151.

Matlab Homework:

Demonstrate decision tree learning and classification on the party example.