Upload
thomasine-phillips
View
215
Download
1
Embed Size (px)
Citation preview
1
COMP3503COMP3503 Inductive Decision TreesInductive Decision Trees
withwith
Daniel L. SilverDaniel L. Silver
2
AgendaAgenda Explanatory/Descriptive Explanatory/Descriptive
ModelingModeling Inductive Decision Tree TheoryInductive Decision Tree Theory The Weka IDT System The Weka IDT System Weka IDT TutorialWeka IDT Tutorial
4
Overview of Data Mining Overview of Data Mining MethodsMethods Automated Exploration/DiscoveryAutomated Exploration/Discovery
• e.g.. e.g.. discovering new market segmentsdiscovering new market segments• distance and probabilistic clustering algorithmsdistance and probabilistic clustering algorithms
Prediction/ClassificationPrediction/Classification• e.g.. e.g.. forecasting gross sales given current factorsforecasting gross sales given current factors• statistics (regression, K-nearest neighbour)statistics (regression, K-nearest neighbour)• artificial neural networks, genetic algorithmsartificial neural networks, genetic algorithms
Explanation/DescriptionExplanation/Description• e.g.. e.g.. characterizing customers by demographics characterizing customers by demographics
• inductive decision trees/rulesinductive decision trees/rules• rough sets, Bayesian belief netsrough sets, Bayesian belief nets
x1
x2
f(x)
x
if age > 35 and income < $35k then ...
AB
5
Inductive Modeling = Inductive Modeling = LearningLearningObjective: Objective: Develop a general Develop a general
model or model or hypothesis hypothesis from specific examplesfrom specific examples
Function approximationFunction approximation (curve fitting)(curve fitting)
Classification Classification (concept learning, (concept learning, pattern recognition)pattern recognition)
x1
x2
AB
f(x)
x
6
Inductive Modeling with IDTInductive Modeling with IDT
Basic Framework for Inductive LearningBasic Framework for Inductive Learning
InductiveLearning System
Environment
TrainingExamples
TestingExamples
Induced Model ofClassifier
Output Classification
(x, f(x))
(x, h(x))
h(x) = f(x)?
The focus is on developing a model h(x) that can be understood (is transparent).
~
8
Inductive Decision TreesInductive Decision TreesDecision TreeDecision Tree
A representational structureA representational structure An acyclic, directed graphAn acyclic, directed graph Nodes are either a:Nodes are either a:
• LeafLeaf - - indicates class or value (distribution)indicates class or value (distribution)• Decision node Decision node - - a test on a single attribute a test on a single attribute
- will have one branch and subtree for - will have one branch and subtree for each each possible outcome of the testpossible outcome of the test
Classification made by traversing from root Classification made by traversing from root to a leaf in accord with teststo a leaf in accord with tests
A?
B? C?
D?
Root
Leaf
Yes
9
Inductive Decision Trees Inductive Decision Trees (IDTs)(IDTs)
A Long and Diverse HistoryA Long and Diverse History Independently developed in the 60Independently developed in the 60’’s and s and
7070’’s by researchers in ...s by researchers in ...Statistics: Statistics: L. Breiman & J. Friedman - CART L. Breiman & J. Friedman - CART
(Classification and Regression Trees)(Classification and Regression Trees)Pattern Recognition: Pattern Recognition: Uof Michigan - AID, Uof Michigan - AID,
G.V. Kass - CHAID (Chi-squared G.V. Kass - CHAID (Chi-squared Automated Interaction Detection)Automated Interaction Detection)
AI and Info. Theory: AI and Info. Theory: R. Quinlan - ID3, C4.5 R. Quinlan - ID3, C4.5 (Iterative Dichotomizer)(Iterative Dichotomizer)
closest to Scenario
10
Inducing a Decision TreeInducing a Decision TreeGiven: Given: Set of examples with Pos. & Neg. Set of examples with Pos. & Neg.
classesclassesProblem: Problem: Generate a Decision Tree model to Generate a Decision Tree model to
classify a separate (validation) set of classify a separate (validation) set of examples with minimal errorexamples with minimal error
Approach: Approach: OccamOccam’’s Razor s Razor - produce the - produce the simplest model that is consistent with the simplest model that is consistent with the training examples -> narrow, short tree. training examples -> narrow, short tree. Every traverse should be as short as possibleEvery traverse should be as short as possible
Formally: Formally: Finding the absolute simplest tree is Finding the absolute simplest tree is intractable, but we can at least try our bestintractable, but we can at least try our best
11
Inducing a Decision TreeInducing a Decision TreeHow do we produce an optimal tree?How do we produce an optimal tree?
Heuristic (strategy) 1: Heuristic (strategy) 1: Grow the tree from the Grow the tree from the top down. Place the top down. Place the most important variable most important variable test at the root of each successive subtreetest at the root of each successive subtree
TheThe most important variablemost important variable: : • the variable (predictor) that gains the most ground the variable (predictor) that gains the most ground
in classifying the set of training examples in classifying the set of training examples • the variable that has the most significant the variable that has the most significant
relationship to the response variablerelationship to the response variable• to which the response is most dependent or least to which the response is most dependent or least
independentindependent
12
Inducing a Decision TreeInducing a Decision TreeImportance of a predictor variableImportance of a predictor variable
CHAID/CART CHAID/CART • Chi-squared [or F (Fisher)] statistic is used to test Chi-squared [or F (Fisher)] statistic is used to test
the independence between the catagorical [or the independence between the catagorical [or continuous] response variable and each predictor continuous] response variable and each predictor variablevariable
• The lowest probability (p-value) from the test The lowest probability (p-value) from the test determines the most important predictor (p-values determines the most important predictor (p-values are first corrected by the Bonferroni adjustment)are first corrected by the Bonferroni adjustment)
C4.5 (section 4.3 of WFH, and PDF slides)C4.5 (section 4.3 of WFH, and PDF slides)• Theoretic Information Gain is computed for each Theoretic Information Gain is computed for each
predictor and one with the highest Gain is chosenpredictor and one with the highest Gain is chosen
13
Inducing a Decision TreeInducing a Decision TreeHow do we produce an optimal tree?How do we produce an optimal tree?
Heuristic (strategy) 2: Heuristic (strategy) 2: To be fair to predictors To be fair to predictors variables that have only 2 values, divide variables that have only 2 values, divide variables with multiple values into similar variables with multiple values into similar groups or segments which are then treated groups or segments which are then treated as separated variables (CART/CHAID only)as separated variables (CART/CHAID only)
The p-values from the Chi-squared or F statistic is The p-values from the Chi-squared or F statistic is used to determine variable/value combinations used to determine variable/value combinations which are most similar in terms of their relationship which are most similar in terms of their relationship to the response variable to the response variable
14
Inducing a Decision TreeInducing a Decision TreeHow do we produce an optimal tree?How do we produce an optimal tree?
Heuristic (strategy) 3: Heuristic (strategy) 3: Prevent overfitting Prevent overfitting the tree to the training data so that it the tree to the training data so that it generalizes generalizes well to a validation set by:well to a validation set by:
Stopping: Stopping: Prevent the split on a predictor Prevent the split on a predictor variable if it is above a level of statistical variable if it is above a level of statistical significance - simply make it a leaf (CHAID)significance - simply make it a leaf (CHAID)
Pruning: Pruning: After a complex tree has been grown, After a complex tree has been grown, replace a split (subtree) with a leaf if the replace a split (subtree) with a leaf if the predictedpredicted validation error is no worse than the validation error is no worse than the more complex tree (CART, C4.5)more complex tree (CART, C4.5)
15
Inducing a Decision TreeInducing a Decision Tree
Stopping (pre-pruning) means a Stopping (pre-pruning) means a choice of level of significance (CART) choice of level of significance (CART)
........ If the probability (p-value) of the statistic If the probability (p-value) of the statistic
is less than the chosen level of is less than the chosen level of significance then a split is allowedsignificance then a split is allowed
Typically the significance level is set to:Typically the significance level is set to:• 0.05 which provides 95% confidence0.05 which provides 95% confidence• 0.01 which provides 99% confidence0.01 which provides 99% confidence
16
Inducing a Decision TreeInducing a Decision Tree
Stopping means a minimum number Stopping means a minimum number of examples at a leaf node (C4.5 = of examples at a leaf node (C4.5 =
J48)....J48).... M factor = minimum number of M factor = minimum number of
examples allowed at a leave nodeexamples allowed at a leave node M =2 is defaultM =2 is default
17
Inducing a Decision TreeInducing a Decision Tree
Pruning means reducing the complexity Pruning means reducing the complexity of a tree .. (C4.5 = J48)....of a tree .. (C4.5 = J48)....
C factor = confidence in the data used to C factor = confidence in the data used to train the treetrain the tree
C = 25% is defaultC = 25% is default If there is 25% confidence that a pruned If there is 25% confidence that a pruned
branch will generate < or = training errors branch will generate < or = training errors on a test set then prune it.on a test set then prune it.
p.196 WFH, PDF slidesp.196 WFH, PDF slides
18
The Weka IDT SystemThe Weka IDT System Weka SimpleCART creates a tree-Weka SimpleCART creates a tree-
based classification model based classification model The target or response variable must be The target or response variable must be
categorical (multiple classes allowed) categorical (multiple classes allowed) Uses the Chi-Squared test for Uses the Chi-Squared test for
significancesignificance Prunes the tree by using a test/tuning Prunes the tree by using a test/tuning
setset
Copyright (c), 2002All Rights Reserved
19
The Weka IDT SystemThe Weka IDT System Weka J48 creates a tree-based Weka J48 creates a tree-based
classification model = Ross Quinlan’s classification model = Ross Quinlan’s orginal C4.5 algorithmorginal C4.5 algorithm
The target or response variable must be The target or response variable must be categoricalcategorical
Uses information gain test for Uses information gain test for significancesignificance
Prunes the tree by using a test/tuning Prunes the tree by using a test/tuning setset
Copyright (c), 2002All Rights Reserved
20
The Weka IDT SystemThe Weka IDT System Weka M5P creates a tree-based Weka M5P creates a tree-based
classification model = also by Ross classification model = also by Ross QuinlanQuinlan
The target or response variable must be The target or response variable must be continuous continuous
Uses information gain test for Uses information gain test for significancesignificance
Prunes the tree by using a test/tuning Prunes the tree by using a test/tuning setset
Copyright (c), 2002All Rights Reserved
22
IDT TrainingIDT TrainingHow do you ensure that a decision How do you ensure that a decision
tree has been well trained?tree has been well trained? Objective: Objective: To achieve good generalizationTo achieve good generalization
accuracy on new accuracy on new examples/cases examples/cases
Establish a maximum acceptable error rate Establish a maximum acceptable error rate Train the tree using a method to prevent Train the tree using a method to prevent
over-fitting – stopping / pruning over-fitting – stopping / pruning Validate the trained network against a Validate the trained network against a
separate test setseparate test set
23
IDT TrainingIDT Training
Available Examples
TrainingSet
HOSet
Approach #1: Approach #1: Large SampleLarge SampleWhen the amount of available data is When the amount of available data is
large ...large ...
70% 30%
Used to develop one IDT modelComputegoodness of fit
Divide randomly
Generalization = goodness of fit
TestSet
24
IDT TrainingIDT Training
Available Examples
TrainingSet
HOSet
Approach #2: Approach #2: Cross-validationCross-validationWhen the amount of available data is When the amount of available data is
small ...small ...
10%90%
Repeat 10 times
Used to develop 10 different IDT models Tabulate goodness of fit stats
Generalization = mean and stddev of goodness of fit
TestSet
25
IDT TrainingIDT TrainingHow do you select between two How do you select between two
induced decision trees ? induced decision trees ? A statistical test of hypothesis is required A statistical test of hypothesis is required
to ensure that a significant difference to ensure that a significant difference exists between the fit of two IDT modelsexists between the fit of two IDT models
If If Large Sample Large Sample method has been used method has been used then apply then apply McNemarMcNemar’’s test* s test* oror diff. of diff. of proportionsproportions
If If Cross-validationCross-validation then use a then use a paired paired tt test test for difference of two proportionsfor difference of two proportions
*We assume a classification problem, if this is function approximation then use paired t test for difference of means
26
Pros and Cons of IDTs Pros and Cons of IDTs
Cons:Cons: Only one response variable at a timeOnly one response variable at a time Different significance tests required Different significance tests required
for nominal and continuous responsesfor nominal and continuous responses Can have difficulties with noisy data Can have difficulties with noisy data Discriminate functions are often Discriminate functions are often
suboptimal due to orthogonal decision suboptimal due to orthogonal decision hyperplaneshyperplanes
27
Pros and Cons of IDTsPros and Cons of IDTs
Pros:Pros: Proven modeling method for 20 yearsProven modeling method for 20 years Provides explanation and predictionProvides explanation and prediction Ability to learn arbitrary functionsAbility to learn arbitrary functions Handles unknown values wellHandles unknown values well Rapid training and recognition speedRapid training and recognition speed Has inspired many inductive learning Has inspired many inductive learning
algorithms using statistical regressionalgorithms using statistical regression
28
The IDT Application The IDT Application Development ProcessDevelopment ProcessGuidelines for inducting decision treesGuidelines for inducting decision trees1. 1. IDTs are good method to start withIDTs are good method to start with2. 2. Get a suitable training setGet a suitable training set3. 3. Use a sensible coding for input variablesUse a sensible coding for input variables4. 4. Develop the simplest tree by adjusting Develop the simplest tree by adjusting
tuning parameters (significance level)tuning parameters (significance level)5. 5. Use a method to prevent over-fittingUse a method to prevent over-fitting6. 6. Determine confidence in generalization Determine confidence in generalization
through cross-validationthrough cross-validation