Chapter 8Chapter 8Decision Tree AlgorithmsDecision Tree Algorithms
Rule Based
Suitable for automatic generation
結束
8-2
ContentsContents
Presents the concept of decision tree models
Discusses the concept rule interestingness
Demonstrates decision tree rules on a case
Review real applications of decision tree models
Shows the application of decision tree models to larger data sets
Demonstrates See5 decision tree analysis in the appendix
結束
8-3
ProblemsProblems
Grocery stores has a massive data problem in inventory control, dealt with a high degree by bar-coding.The massive data database of transactions can be mined to monitor customer demand.Decision trees provide a means to obtain product-specific forecasting models in the form of rules (IF-THEN) that are easy to implement.Decision-tree can be used by grocery stores in a number of policy decisions, including ordering inventory replenishment and evaluating alternative promotion campaigns.
結束
8-4
Decision treeDecision tree
Decision tree refers to the tree structure of rules (often association rules).The decision tree modeling process involves collecting those variables that the analyst thinks might bear on the decision at issue, and analyzing these variables for their ability to predict the outcome.The algorithm automatically determines which variables are most important, based on their ability to sort the data into the correct output category.Decision tree has a relative advantage over ANN and GA in that a reusable set of rules are provided, this explaining model conclusions.
結束
8-5
Decision Tree ApplicationsDecision Tree Applications
Classifying loan applications, screening potential consumers, and rating job applicants.
Decision tree provide a way to implement rule-based system approach. 監督式學習模型
樹型結構 關聯模式 規則誘發
決策樹(類別型屬性)
迴歸樹(連續型屬性)
C4.5/C5ID3
CART CART
M5
CN2
ITRULE
Cubist
監督式學習模型
樹型結構 關聯模式 規則誘發
決策樹(類別型屬性)
迴歸樹(連續型屬性)
C4.5/C5ID3
CART CART
M5
CN2
ITRULE
Cubist
結束
8-6
Types of TreesTypes of Trees
Classification treeVariable values classesFinite conditions
Regression treeVariable values continuous numbersPrediction or estimation
結束
8-7
Rule InductionRule Induction
Automatically process dataClassification (logical, easier)Regression (estimation, messier)
Search through data for patterns & relationshipsPure knowledge discovery
Assumes no prior hypothesisDisregards human judgment
結束
8-8
Decision treesDecision trees
Logical branching
Historical:ID3 – early rule- generating systemC4.5See 5
Branches:Different possible values
Nodes:From which branches emanate (spray)
結束
8-9
Decision tree operationDecision tree operation
A bank may have a database of pass loan applicants for short-term loans. (see table. 4.4)
The bank’s policy treats applicants differently by age group, income level, and risk.
A tree sorts the possible combinations of these variables. An exhaustive tree enumerates all combinations of variable values in Table. 8.1
結束
8-10
Decision tree operationDecision tree operation
結束
8-11
Decision tree operationDecision tree operation
A rule-based system model would require that bank loan officers who had respected judgment be interviewed to classify the decision for each of these combinations of variables.
Some situations can be directly reduced in decision tree.
結束
8-12
Rule interestingnessRule interestingness
Data, even categorical data can potentially involve many rules.
See Table. 8.1, 333=27 combinations. If 10 variables and each with 4 possible values(410=1,048,576), the combinations become over a million. unreasonable
Decision tree models identify the most useful rules in terms of predicting outcomes. Rule effectiveness is measured in terms of confidence and support.Confidence is the degree of accuracy of a ruleSupport is the degree to which the antecedent
condition occur in the data.
結束
8-13
Tanagra ExampleTanagra Example
結束
8-14
Support & ConfidenceSupport & Confidence
Support for an association rule indicates the proportion of records covered by the set of attributes in the association rules.
Example: If there were 10 million book purchases, support for the given rule would be 10/10,000,000, a very small support measure of 0.000001. These concepts are often used in the form of threshold levels in machine learning systems.
Minimum confidence levels and support levels can be specified to retain rules identified by decision tree.
結束
8-15
Machine learningMachine learning
Rule-induction algorithms can automatically process categorical data (also can work on continuous data). A clear outcome is needed.
Rule induction works by searching through data for patterns and relationships.
Machine learning starts with on assumptions, looking only at input data and results.
Recursive partitioning algorithms split data (original data) into finer and finer subsets leading to a decision tree.
結束
8-16
CasesCases
20 past loan application cases in Table 8.3.
結束
8-17
CasesCases
Automatic machine learning begins with identifying those variables that offer the great likelihood of distinguishing between the possible outcomes.
For each of the three variables, the outcome probabilities illustrated in Table 8.5 (next slide)
Most data mining packages use an entropy measure to gauge the discriminating power of each variable (split data) (Chi-square measures can also be used to select variables)
結束
8-18
CasesCasesTable 8.4 Grouped data
Table 8.5 Combination outcomes
結束
8-19
Entropy formulaEntropy formula
Where p is the number of positive examples and n is the number of negative examples in the training set for each value of the attribute.
The lower the measure (entropy), the greater the information content
Can use to automatically select variable with most productive rule potential
np
nlog
np
n-
np
plog
np
p-Inform 22
結束
8-20
Entropy formulaEntropy formula
The entropy formula has a problem if either p or n are 0, then the log2 is undefined.Entropy for each Age category generated by the formula is shown in Table 8.6.
Category Young: [-(8/12)(-0.585)-(4/12)(-1.585)](12/20)=0.551The lower entropy measure, the greater the information content (the greater the agreement probability)Rule: If (Risk=low) then Predict on-time payment
Else predict late
np
nlog
np
n-
np
plog
np
p-Inform 22
× + × ×
結束
8-21
EntropyEntropy
Young- 8/12 × -0.390 – 4/12 × -0.528 × 12/20: 0.551
Middle- (4/5 × -0.322) – (1/5 x -2.322) × 5/20: 0.180
Old- 3/3 × 0 – 0/3 × 0 × 3/20: 0.000
SUM 0.731Income 0.782Risk 0.446 (lowest)
By the measures, Risk has the greatest information content. If Risk is low, the data indicates a 1.0 probability that the applicant will pay the loan back on time.
結束
8-22
EvaluationEvaluation
Two type errors may occur:1. Those applicants rated as low risk actually not pay on
time (from the data, the probability of this case is 0.0)
2. Those applicants rated as high or average risk may actually have paid if given a loan. (from the data, the probability of this happening is 0.25. 5/20=0.25)
Expected error 0.250.5 (the p of being wrong)=0.125 Test the model using another set of data.
結束
8-23
EvaluationEvaluation
The entropy formula for Age, given the risk was not low 0.99, while the same calculation for income is 1. 971. Age has greater discriminating power.
If age is middle, the one case did not pay one time.
If (Risk is Not low) AND (Age=Middle)
Then Predict late
Else Predict On-time
結束
8-24
EvaluationEvaluation
For last variable, income, give that Risk was not low, and Age was not Middle, there are nine cases left and shown in Table 8.8.
A third rule takes advantage of the case with u’nanimous (agree) outcome is:
If (Risk Not low) AND (Age NOT middle) AND (income high)
Then predict Late
Else Predict on-time
See page 141 for more explanations
結束
8-25
Rule accuracyRule accuracy
The expected accuracy of the three rules is shown in Table 8.9.
The expected error is 8/20=0.375 (1-0.625)
An additional rule could be generated. For the case of Risk not low, Age=Young, and Income Not high (four cases with low income (p of on-time) =0.5 and four cases with average income (p of on-time = 0.75)
The greater discrimination is provided by average income, resulting in the following rule:
If (Risk not low) and (Age not middle) and (income average)
Then predict on-time
Else predict either
結束
8-26
Rule accuracyRule accuracy
There is no added accuracy obtained with this rule, shown in Table 8.10.
The expected error is 4/20 times 0.5 = 0.15 the same without the rule.
When machine learning methods encounter no improvement, they generally stop.
結束
8-27
Rule accuracyRule accuracy
Table 8.11 shows the results.
結束
8-28
Rule accuracyRule accuracy
結束
8-29
Inventory PredictionInventory Prediction
GroceriesMaybe over 100,000 SKUsBarcode data input
Data mining to discover patternsRandom sample of over 1.6 million records30 months95 outletsTest sample 400,000 records
Rule induction more workable than regression28,000 rulesVery accurate, up to 27% improvement
結束
8-30
Clinical DatabaseClinical Database
HeadacheOver 60 possible causes
Exclusive reasoning uses negative rulesUse when symptom absent
Inclusive reasoning uses positive rules
Probabilistic rule induction expert systemHeadache: Training sample over 50,000 cases, 45
classes, 147 attributesMeningitis(腦膜炎 ): 1200 samples on 41 attributes, 4
outputs
結束
8-31
Clinical DatabaseClinical Database
Used AQ15, C4.5Average accuracy 82%
Expert SystemAverage accuracy 92%
Rough Set Rule SystemAverage accuracy 70%
Using both positive & negative rules from rough setsAverage accuracy over 90%
結束
8-32
Software Development QualitySoftware Development Quality
Telecommunications company
Goal: find patterns in modules being developed likely to contain faults discovered by customersTypical module several million lines of codeProbability of fault averaged 0.074
Apply greater effort for thoseSpecification, testing, inspection
結束
8-33
Software QualitySoftware Quality
Preprocessed dataReduced dataUsed CART (Classification & Regression Trees)Could specify prior probabilities
First model 9 rules, 6 variablesBetter at cross-validationBut variable values not available until late
Second model 4 rules, 2 variablesAbout same accuracy, data available earlier
結束
8-34
Rules and evaluationRules and evaluation
結束
8-35
Rules and evaluationRules and evaluation
The Second models rules
Two models were very close in accuracy. The fist model was better at cross validation accuracy, but its variables were available just prior to release.The second model had the advantage of being based on data available at an earlier state and required less extensive data reduction.See also, page .146 for expert system
結束
8-36
Applications of methods to larger data setsApplications of methods to larger data sets
Expenditure application to find the characteristics of potential customers for each expenditure category.
A simple case is to categorize clothing expenditures (or other expenditures in the data set) per year as a 2-class classification problem.
Data preparation data transformation, see page 154
Comparisons of A-prioriC4.5C5.0
結束
8-37
Fuzzy Decision TreesFuzzy Decision Trees
Have assumed distinct (crisp) outcomes
Many data points not that clear
Fuzzy: Membership function represents belief (between 0 and 1)
Fuzzy relationships have been incorporated in decision tree algorithms
結束
8-38
Fuzzy ExampleFuzzy Example
Age Young 0.3 Middle 0.9 Old 0.2
Income Low 0.0 Average 0.8 High 0.3
Risk Low 0.1 Average 0.8 High 0.3
Definitions:Sum will not necessarily equal 1.0If ambiguous, select alternative with larger
membership valueAggregate with mean
結束
8-39
Fuzzy ModelFuzzy Model
IF Risk=Low Then OTMembership function: 0.1
IF Risk NOT Low & Age=Middle Then LateRisk MAX(0.8, 0.3)Age 0.9Membership function: Mean = 0.85
結束
8-40
Fuzzy Model cont.Fuzzy Model cont.
IF Risk NOT Low & Age NOT Middle & Income=High THEN LateRisk MAX(0.8, 0.3) 0.8Age MAX(0.3, 0.2) 0.3Income 0.3Membership function: Mean = 0.433
結束
8-41
Fuzzy Model cont.Fuzzy Model cont.
IF Risk NOT Low & Age NOT Middle & Income NOT High THEN LateRisk MAX(0.8, 0.3) 0.8Age MAX(0.3, 0.2) 0.3Income MAX(0.0, 0.8) 0.8Membership function: Mean = 0.633
結束
8-42
Fuzzy Model cont.Fuzzy Model cont.
Highest membership function is 0.633, for Rule 4
Conclusion: On-time
結束
8-43
Decision TreesDecision Trees
Very effective & useful
Automatic machine learningThus unbiased (but omit judgment)
Can handle very large data setsNot affected much by missing data
Lots of software available