1 COMP3503 Inductive Decision Trees with Daniel L. Silver Daniel L. Silver

1

COMP3503COMP3503 Inductive Decision TreesInductive Decision Trees

withwith

Daniel L. SilverDaniel L. Silver

2

AgendaAgenda Explanatory/Descriptive Explanatory/Descriptive

ModelingModeling Inductive Decision Tree TheoryInductive Decision Tree Theory The Weka IDT System The Weka IDT System Weka IDT TutorialWeka IDT Tutorial

3

Explanatory/Descriptive Explanatory/Descriptive ModelingModeling

4

Overview of Data Mining Overview of Data Mining MethodsMethods Automated Exploration/DiscoveryAutomated Exploration/Discovery

• e.g.. e.g.. discovering new market segmentsdiscovering new market segments• distance and probabilistic clustering algorithmsdistance and probabilistic clustering algorithms

Prediction/ClassificationPrediction/Classification• e.g.. e.g.. forecasting gross sales given current factorsforecasting gross sales given current factors• statistics (regression, K-nearest neighbour)statistics (regression, K-nearest neighbour)• artificial neural networks, genetic algorithmsartificial neural networks, genetic algorithms

Explanation/DescriptionExplanation/Description• e.g.. e.g.. characterizing customers by demographics characterizing customers by demographics

• inductive decision trees/rulesinductive decision trees/rules• rough sets, Bayesian belief netsrough sets, Bayesian belief nets

x1

x2

f(x)

x

if age > 35 and income < $35k then ...

AB

5

Inductive Modeling = Inductive Modeling = LearningLearningObjective: Objective: Develop a general Develop a general

model or model or hypothesis hypothesis from specific examplesfrom specific examples

Function approximationFunction approximation (curve fitting)(curve fitting)

Classification Classification (concept learning, (concept learning, pattern recognition)pattern recognition)

x1

x2

AB

f(x)

x

6

Inductive Modeling with IDTInductive Modeling with IDT

Basic Framework for Inductive LearningBasic Framework for Inductive Learning

InductiveLearning System

Environment

TrainingExamples

TestingExamples

Induced Model ofClassifier

Output Classification

(x, f(x))

(x, h(x))

h(x) = f(x)?

The focus is on developing a model h(x) that can be understood (is transparent).

~

7

Inductive Decision Tree Inductive Decision Tree TheoryTheory

8

Inductive Decision TreesInductive Decision TreesDecision TreeDecision Tree

A representational structureA representational structure An acyclic, directed graphAn acyclic, directed graph Nodes are either a:Nodes are either a:

• LeafLeaf - - indicates class or value (distribution)indicates class or value (distribution)• Decision node Decision node - - a test on a single attribute a test on a single attribute

- will have one branch and subtree for - will have one branch and subtree for each each possible outcome of the testpossible outcome of the test

Classification made by traversing from root Classification made by traversing from root to a leaf in accord with teststo a leaf in accord with tests

A?

B? C?

D?

Root

Leaf

Yes

9

Inductive Decision Trees Inductive Decision Trees (IDTs)(IDTs)

A Long and Diverse HistoryA Long and Diverse History Independently developed in the 60Independently developed in the 60’’s and s and

7070’’s by researchers in ...s by researchers in ...Statistics: Statistics: L. Breiman & J. Friedman - CART L. Breiman & J. Friedman - CART

(Classification and Regression Trees)(Classification and Regression Trees)Pattern Recognition: Pattern Recognition: Uof Michigan - AID, Uof Michigan - AID,

G.V. Kass - CHAID (Chi-squared G.V. Kass - CHAID (Chi-squared Automated Interaction Detection)Automated Interaction Detection)

AI and Info. Theory: AI and Info. Theory: R. Quinlan - ID3, C4.5 R. Quinlan - ID3, C4.5 (Iterative Dichotomizer)(Iterative Dichotomizer)

closest to Scenario

10

Inducing a Decision TreeInducing a Decision TreeGiven: Given: Set of examples with Pos. & Neg. Set of examples with Pos. & Neg.

classesclassesProblem: Problem: Generate a Decision Tree model to Generate a Decision Tree model to

classify a separate (validation) set of classify a separate (validation) set of examples with minimal errorexamples with minimal error

Approach: Approach: OccamOccam’’s Razor s Razor - produce the - produce the simplest model that is consistent with the simplest model that is consistent with the training examples -> narrow, short tree. training examples -> narrow, short tree. Every traverse should be as short as possibleEvery traverse should be as short as possible

Formally: Formally: Finding the absolute simplest tree is Finding the absolute simplest tree is intractable, but we can at least try our bestintractable, but we can at least try our best

11

Inducing a Decision TreeInducing a Decision TreeHow do we produce an optimal tree?How do we produce an optimal tree?

Heuristic (strategy) 1: Heuristic (strategy) 1: Grow the tree from the Grow the tree from the top down. Place the top down. Place the most important variable most important variable test at the root of each successive subtreetest at the root of each successive subtree

TheThe most important variablemost important variable: : • the variable (predictor) that gains the most ground the variable (predictor) that gains the most ground

in classifying the set of training examples in classifying the set of training examples • the variable that has the most significant the variable that has the most significant

relationship to the response variablerelationship to the response variable• to which the response is most dependent or least to which the response is most dependent or least

independentindependent

12

Inducing a Decision TreeInducing a Decision TreeImportance of a predictor variableImportance of a predictor variable

CHAID/CART CHAID/CART • Chi-squared [or F (Fisher)] statistic is used to test Chi-squared [or F (Fisher)] statistic is used to test

the independence between the catagorical [or the independence between the catagorical [or continuous] response variable and each predictor continuous] response variable and each predictor variablevariable

• The lowest probability (p-value) from the test The lowest probability (p-value) from the test determines the most important predictor (p-values determines the most important predictor (p-values are first corrected by the Bonferroni adjustment)are first corrected by the Bonferroni adjustment)

C4.5 (section 4.3 of WFH, and PDF slides)C4.5 (section 4.3 of WFH, and PDF slides)• Theoretic Information Gain is computed for each Theoretic Information Gain is computed for each

predictor and one with the highest Gain is chosenpredictor and one with the highest Gain is chosen

13


Heuristic (strategy) 2: Heuristic (strategy) 2: To be fair to predictors To be fair to predictors variables that have only 2 values, divide variables that have only 2 values, divide variables with multiple values into similar variables with multiple values into similar groups or segments which are then treated groups or segments which are then treated as separated variables (CART/CHAID only)as separated variables (CART/CHAID only)

The p-values from the Chi-squared or F statistic is The p-values from the Chi-squared or F statistic is used to determine variable/value combinations used to determine variable/value combinations which are most similar in terms of their relationship which are most similar in terms of their relationship to the response variable to the response variable

14


Heuristic (strategy) 3: Heuristic (strategy) 3: Prevent overfitting Prevent overfitting the tree to the training data so that it the tree to the training data so that it generalizes generalizes well to a validation set by:well to a validation set by:

Stopping: Stopping: Prevent the split on a predictor Prevent the split on a predictor variable if it is above a level of statistical variable if it is above a level of statistical significance - simply make it a leaf (CHAID)significance - simply make it a leaf (CHAID)

Pruning: Pruning: After a complex tree has been grown, After a complex tree has been grown, replace a split (subtree) with a leaf if the replace a split (subtree) with a leaf if the predictedpredicted validation error is no worse than the validation error is no worse than the more complex tree (CART, C4.5)more complex tree (CART, C4.5)

15

Inducing a Decision TreeInducing a Decision Tree

Stopping (pre-pruning) means a Stopping (pre-pruning) means a choice of level of significance (CART) choice of level of significance (CART)

........ If the probability (p-value) of the statistic If the probability (p-value) of the statistic

is less than the chosen level of is less than the chosen level of significance then a split is allowedsignificance then a split is allowed

Typically the significance level is set to:Typically the significance level is set to:• 0.05 which provides 95% confidence0.05 which provides 95% confidence• 0.01 which provides 99% confidence0.01 which provides 99% confidence

16


Stopping means a minimum number Stopping means a minimum number of examples at a leaf node (C4.5 = of examples at a leaf node (C4.5 =

J48)....J48).... M factor = minimum number of M factor = minimum number of

examples allowed at a leave nodeexamples allowed at a leave node M =2 is defaultM =2 is default

17


Pruning means reducing the complexity Pruning means reducing the complexity of a tree .. (C4.5 = J48)....of a tree .. (C4.5 = J48)....

C factor = confidence in the data used to C factor = confidence in the data used to train the treetrain the tree

C = 25% is defaultC = 25% is default If there is 25% confidence that a pruned If there is 25% confidence that a pruned

branch will generate < or = training errors branch will generate < or = training errors on a test set then prune it.on a test set then prune it.

p.196 WFH, PDF slidesp.196 WFH, PDF slides

18

The Weka IDT SystemThe Weka IDT System Weka SimpleCART creates a tree-Weka SimpleCART creates a tree-

based classification model based classification model The target or response variable must be The target or response variable must be

categorical (multiple classes allowed) categorical (multiple classes allowed) Uses the Chi-Squared test for Uses the Chi-Squared test for

significancesignificance Prunes the tree by using a test/tuning Prunes the tree by using a test/tuning

setset

Copyright (c), 2002All Rights Reserved

19

The Weka IDT SystemThe Weka IDT System Weka J48 creates a tree-based Weka J48 creates a tree-based

classification model = Ross Quinlan’s classification model = Ross Quinlan’s orginal C4.5 algorithmorginal C4.5 algorithm

The target or response variable must be The target or response variable must be categoricalcategorical

Uses information gain test for Uses information gain test for significancesignificance

Prunes the tree by using a test/tuning Prunes the tree by using a test/tuning setset


20

The Weka IDT SystemThe Weka IDT System Weka M5P creates a tree-based Weka M5P creates a tree-based

classification model = also by Ross classification model = also by Ross QuinlanQuinlan

The target or response variable must be The target or response variable must be continuous continuous

Uses information gain test for Uses information gain test for significancesignificance

Prunes the tree by using a test/tuning Prunes the tree by using a test/tuning setset


21

IDT TrainingIDT Training

22

IDT TrainingIDT TrainingHow do you ensure that a decision How do you ensure that a decision

tree has been well trained?tree has been well trained? Objective: Objective: To achieve good generalizationTo achieve good generalization

accuracy on new accuracy on new examples/cases examples/cases

Establish a maximum acceptable error rate Establish a maximum acceptable error rate Train the tree using a method to prevent Train the tree using a method to prevent

over-fitting – stopping / pruning over-fitting – stopping / pruning Validate the trained network against a Validate the trained network against a

separate test setseparate test set

23


Available Examples

TrainingSet

HOSet

Approach #1: Approach #1: Large SampleLarge SampleWhen the amount of available data is When the amount of available data is

large ...large ...

70% 30%

Used to develop one IDT modelComputegoodness of fit

Divide randomly

Generalization = goodness of fit

TestSet

24


Available Examples

TrainingSet

HOSet

Approach #2: Approach #2: Cross-validationCross-validationWhen the amount of available data is When the amount of available data is

small ...small ...

10%90%

Repeat 10 times

Used to develop 10 different IDT models Tabulate goodness of fit stats

Generalization = mean and stddev of goodness of fit

TestSet

25

IDT TrainingIDT TrainingHow do you select between two How do you select between two

induced decision trees ? induced decision trees ? A statistical test of hypothesis is required A statistical test of hypothesis is required

to ensure that a significant difference to ensure that a significant difference exists between the fit of two IDT modelsexists between the fit of two IDT models

If If Large Sample Large Sample method has been used method has been used then apply then apply McNemarMcNemar’’s test* s test* oror diff. of diff. of proportionsproportions

If If Cross-validationCross-validation then use a then use a paired paired tt test test for difference of two proportionsfor difference of two proportions

*We assume a classification problem, if this is function approximation then use paired t test for difference of means

26

Pros and Cons of IDTs Pros and Cons of IDTs

Cons:Cons: Only one response variable at a timeOnly one response variable at a time Different significance tests required Different significance tests required

for nominal and continuous responsesfor nominal and continuous responses Can have difficulties with noisy data Can have difficulties with noisy data Discriminate functions are often Discriminate functions are often

suboptimal due to orthogonal decision suboptimal due to orthogonal decision hyperplaneshyperplanes

27

Pros and Cons of IDTsPros and Cons of IDTs

Pros:Pros: Proven modeling method for 20 yearsProven modeling method for 20 years Provides explanation and predictionProvides explanation and prediction Ability to learn arbitrary functionsAbility to learn arbitrary functions Handles unknown values wellHandles unknown values well Rapid training and recognition speedRapid training and recognition speed Has inspired many inductive learning Has inspired many inductive learning

algorithms using statistical regressionalgorithms using statistical regression

28

The IDT Application The IDT Application Development ProcessDevelopment ProcessGuidelines for inducting decision treesGuidelines for inducting decision trees1. 1. IDTs are good method to start withIDTs are good method to start with2. 2. Get a suitable training setGet a suitable training set3. 3. Use a sensible coding for input variablesUse a sensible coding for input variables4. 4. Develop the simplest tree by adjusting Develop the simplest tree by adjusting

tuning parameters (significance level)tuning parameters (significance level)5. 5. Use a method to prevent over-fittingUse a method to prevent over-fitting6. 6. Determine confidence in generalization Determine confidence in generalization

through cross-validationthrough cross-validation

29

THE ENDTHE END

[email protected]@acadiau.ca

Documents

1 COMP3503 Inductive Decision Trees with Daniel L. Silver Daniel L. Silver