Upload
nobeen666
View
465
Download
2
Embed Size (px)
Citation preview
Classification Trees
Trees and Rules
Goal: Classify or predict an outcome based on a set of predictors
The output is a set of rules Rules are represented by tree
diagrams Also called CART (Classification And
Regression Trees) , Decision Trees, or just Trees
Key Ideas
Recursive partitioning: Repeatedly split the records into two parts so as to achieve maximum homogeneity within the new parts
Pruning the tree: Simplify the tree by pruning peripheral branches to avoid overfitting
The Idea of Recursive Partitioning
Recursive partitioning of predictors to achieve homogenous groups
At each step a single predictor is used for splitting the data.
Use rectangles of different sizes to split the data
X2
X1
Recursive Partitioning Steps
Pick one of the predictor variables, xi
Pick a value of xi, say si, that divides the training data into two (not necessarily equal) portions
Measure how “pure” or homogeneous each of the resulting portions are “Pure” = containing records of mostly one
class Idea is to pick xi, and si to maximize purity Repeat the process
Example: Beer Preference Using the predictors Age, Gender,
Married, Income we want to be able to classify the preference of new beer drinkers (Regular/Light beer)
Two classes, 4 predictors 100 records, partitioned into
training/validation
Splitting the space in trees
0
10
20
30
40
50
60
70
80
90
100
$0 $20,000 $40,000 $60,000 $80,000
Income
Ag
e Regular beer
Light beer
A Classification Tree for the Beer Preference Example (training)
42.5
34375 41173
Regular39180
Regular51.5
Light Light Light Regular
Age
Income Income
Income Age
29 31
6 23 18 13
7 16 7 6
Leaf/ terminal node
Splitting value
# records
Method Settings
1. Determining splits/partitions
2. Terminating tree growth
3. Finding rule to predict the class for each record
4. Pruning of tree (cutting branches)
Three famous algorithms: CART (in XLMiner, SAS, CART), CHAID (in SAS) , C4.5
(in Clementine by SPSS)
Determining the best split Best split is the one that best discriminates
between records in the different classes Goal is to have a single class predominate in each
resulting node Split that maximizes the reduction in impurity of
node is chosen CART algorithm (Classification And Regression Trees)
evaluates all possible binary splits C4.5 algorithm splits categorical predictor with K categories
into K children (creates bushiness) Two impurity measures are entropy and Gini
index Impurity of split = weighted average of impurities of
children
Entropy Measure
Entropy of a node =
K = number of classespi = the percentage of records in
class i
K
iii pp
12log
Entropy
0 Entropy log2(K)
Bad; the K classes are equally likely.AS IMPURE AS POSSIBLE
Good; all recordsin same class. PURE
Entropy For 2 classes
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
p i
Ent
ropy
Entropy: Example
Assume we have K = 2 classes (buy, not buy) Then, maximum Entropy is log2(2) = 1
Ex.1—Impure Node: p1 = 0.5, p2 = 0.5
Entropy = – [0.5log2(0.5) + 0.5log2(0.5)] = 1
Ex.2—Pure Node: p1 = 1, p2 = 0
Entropy = – [1log2(1) + 0log2(0)] = 0
xx
x
Splitting the 100 beer drinkers (50 prefer light, 50 regular) by gender
gender43 females
25 Regular 18 Light
57 males
25 Regular 32 Light
FEMALES: -[ 18/43 log2 18/43 + 25/43 log2 25/43 ] = 0.98
MALES: -[ 32/57 log2 32/57 + 25/57 log2 25/57 ] = 0.989
Combined: (0.43)(0.98) + (0.57)(0.989) = 0.985 not too much improvement
Before split (p1=p2=0.5) entropy=1After split:
The Gini Impurity Index
The impurity of a node is
K = number of classespi = the percentage of records in class i
K
iipGI
1
21
The Gini Index
0 GI (K-1) / K
Bad; the K classes are equally likely.AS IMPURE AS POSSIBLE
Good; all recordsin same class. PURE
Splitting the 100 beer drinkers (50 prefer light, 50 regular) by gender
gender43 females
25 Regular 18 Light
57 males
25 Regular 32 Light
Before split: 1- [0.25 +0.25] = 0.5LEFT: 1- [(25/43)^2 + (18/43)^2 ] = 0.487RIGHT: 1- [(25/57)^2 + (32/57)^2 ] = 0.492Combined: 0.43 * 0.487 + 0.57 * 0.492 = 0.49
Recursive Partitioning
Obtain overall impurity measure (weighted avg. of individual rectangles)
At each successive stage, compare this measure across all possible splits in all variables
Choose the split that reduces impurity the most
Chosen split points become nodes on the tree
Learning about the “best predictors”
42.5
34375 41173
Regular39180
Regular51.5
Light Light Light Regular
Age
Income Income
Income Age
29 31
6 23 18 13
7 16 7 6
Predictors close to top are most informative
Trees sometimes used for variable selection
Age, Income
Tree Structure Split points become nodes on tree (circles
with split value in center) Rectangles represent “leaves” (terminal
points, no further splits, classification value noted)
Numbers on lines between nodes indicate # cases
Read down tree to derive rule, e.g. If lot size < 19, and if income > 84.75, then class
= “owner”
Determining Leaf Node Label
Each leaf node label is determined by “voting” of the records within it, and by the cutoff value
Records within each leaf node are from the training data
Default cutoff=0.5 means that the leaf node’s label is the majority class.
Cutoff = 0.75: requires majority of 75% or more “1” records in the leaf to label it a “1” node
Converting a Tree into Rules
Convert tree to IF-ELSE-AND rules
IF age>42.5 AND income< $41,173 THEN prefers regular beer
IF age<42.5 AND income < $34,375 then prefers regular beer
Can condense some rules
If income < $34,375 then regular beer
42.5
34375 41173
Regular39180
Regular51.5
Light Light Light Regular
Age
Income Income
Income Age
29 31
6 23 18 13
7 16 7 6
Labeling Leaf Nodes Default: majority vote In 2-class case, majority vote = cutoff of 0.5 Changing the cutoff will change the labeling Example:
success class=buyers 0.75 means that to label the node as “buyer”, at
least 75% of the training set observations with that predictor combination should be buyers.
Classifying existing and new records “Drop” the record into the tree
(=use the rules) Classify record as the majority
class in its node42.5
34375 41173
Regular39180
Regular51.5
Light Light Light Regular
Age
Income Income
Income Age
29 31
6 23 18 13
7 16 7 6
Predict the preference of a 50-year old female with 50K income
42.5
34375 41173
Regular39180
Regular51.5
Light Light Light Regular
Age
Income Income
Income Age
29 31
6 23 18 13
7 16 7 6
Stopping Tree Growth Natural end of process is 100% purity
in each leaf This overfits the data, which end up
fitting noise in the data Overfitting leads to low predictive
accuracy of new data Past a certain point, the error rate for
the validation data starts to increase
Avoiding Over-fitting Partitioning is done based on
training set As we go down the tree, splitting is
based on less data Larger trees lead to higher
prediction variance
Avoiding Over-fitting – cont. How will the tree perform on new
data? The error rate of a tree =
misclassifications
# splits
Training data
Unseen data
Err
or R
ate
Solution 1: Stopping Tree Growth Rules/Criteria used to stop tree
growth: Tree depth Minimum # of records (cases) in node Minimum reduction in impurity
measure Problem: hard to know when to
stop…
CHAID (Chi-square Automatic Interaction Detector): Stopping in time Popular in marketing. Instead of using an impurity measure, CHAID
uses the chi-square test of independence to determine splits
At each node, we split on the predictor that has the strongest association with the Y variable
Strength of association is measured by the p-value of chi-square test of independence (smaller p-value = stronger evidence of association)
Splitting terminates when no more association is found between the predictors and Y.
Requires categorical predictors (or bin interval predictors into categories)
Solution 2: Pruning (CART, C4.5) We would like to maximize
prediction accuracy for unseen data
Use validation set to prune the tree Prune here
# splits
Training data
validation data
Err
or R
ate
Pruning CART lets tree grow to full extent, then
prunes it back Idea is to find that point at which the
validation error begins to rise Generate successively smaller trees by
pruning leaves At each pruning stage, multiple trees are
possible Use cost complexity to choose the best tree
at that stage
Cost Complexity
CC(T) = cost complexity of a treeErr(T) = proportion of misclassified records = penalty factor attached to tree size (set
by user)
Among trees of given size, choose the one with lowest CC
Do this for each size of tree
CC(T) = Err(T) + L(T)
Pruning Results This process yields a set of trees of
different sizes and associated error rates
Two trees of interest: Minimum error tree
Has lowest error rate on validation data Best pruned tree
Smallest tree within one std. error of min. error This adds a bonus for simplicity/parsimony
Beer Preference Example
42.5
34375 41173
Regular39180
Regular51.5
Light Light Light Regular
Age
Income Income
Income Age
29 31
6 23 18 13
7 16 7 6
42.5
34375 41173
Regular Light Regular51.5
Light Regular
Age
Income Income
Age
18 22
7 11 12 10
4 6
Full tree (training) Pruned tree (validation)
Extension to numerical Y: Regression Trees All is similar to classification trees,
except labels of leaf nodes are averages of the observations in the node
Non-parametric, no assumptions, can find global relationships between Y and predictors
Nice alternative to linear reg, if large dataset
XLMiner: Prediction> Regression Trees
Differences from CT Prediction is computed as the
average of numerical target variable in the rectangle (in CT it is majority vote)
Impurity measured by sum of squared deviations from leaf mean
Performance measured by RMSE (root mean squared error)
Advantages of trees Easy to use, understand Produce rules that are easy to interpret
& implement Variable selection & reduction is
automatic Do not require the assumptions of
statistical models Can work without extensive handling of
missing data
Disadvantages May not perform well where there is
structure in the data that is not well captured by horizontal or vertical splits
Since the process deals with one variable at a time, no way to capture interactions between variables
Big Example: Personal Loan Offer As part of customer acquisition efforts,
Universal bank wants to run a campaign for current customers to purchase a loan.
In order to improve target marketing, they want to find customers that are most likely to accept the personal loan offer.
They use data from a previous campaign on 5000 customers, 480 of them accepted.
Personal Loan Data Description
ID Customer IDAge Customer's age in completed yearsExperience #years of professional experienceIncome Annual income of the customer ($000)ZIPCode Home Address ZIP code.Family Family size of the customerCCAvg Avg. spending on credit cards per month ($000)Education Education Level. 1: Undergrad; 2: Graduate; 3: Advanced/ProfessionalMortgage Value of house mortgage if any. ($000)Personal Loan Did this customer accept the personal loan offered in the last campaign?Securities Account Does the customer have a securities account with the bank?CD Account Does the customer have a certificate of deposit (CD) account with the bank?Online Does the customer use internet banking facilities?CreditCard Does the customer use a credit card issued by UniversalBank?
File: “UniversalBankTrees.xls”
Full tree (CT_output4)92.5
2.95 1.5
00.5 2.5 114.5
81.5 168 0.5 116 2.95 116.5
Sub Tree beneath
Sub Tree beneath
1 0 0Sub Tree beneath
Sub Tree beneath
1Sub Tree beneath
Sub Tree beneath
Sub Tree beneath
1
Income
CCAvg Education
CD Account Family Income
Income Mortgage CD Account Income CCAvg Income
1830 670
1737 93 398 272
87 6 346 52 111 161
51 36 5 1 330 16 21 31 79 32 6 155
Test Data (Using Full Tree)
Actual Class 1 0
1 88 14
0 8 890
Class # Cases # Errors % Error
1 102 14 13.73
0 898 8 0.89
Overall 1000 22 2.20
Predicted Class
Error Report
Classification Confusion MatrixValidation Data (Using Full Tree)
Actual Class 1 0
1 128 15
0 17 1340
Class # Cases # Errors % Error
1 143 15 10.49
0 1357 17 1.25
Overall 1500 32 2.13
Predicted Class
Error Report
Classification Confusion MatrixTraining Data (Using Full Tree)
Actual Class 1 0
1 235 0
0 0 2265
Class # Cases # Errors % Error
1 235 0 0.00
0 2265 0 0.00
Overall 2500 0 0.00
Predicted Class
Error Report
Classification Confusion Matrix
Pruned tree (CT_Output3)
92.5
01.5
2.5 114.5
0116 2.95
1
0 1 0 1
Income
Education
Family Income
Income CCAvg
1083 417
260 157
0 260 130 27
60 200 0 130
# Decision Nodes
% Error Training
% Error Validation
41 0 2.133333
40 0.04 2.2
39 0.08 2.2
38 0.12 2.2
37 0.16 2.066667
36 0.2 2.066667
35 0.2 2.066667
34 0.24 2.066667
33 0.28 2.066667
32 0.4 2.066667
31 0.48 2.133333
30 0.48 2.133333
29 0.56 2.133333
28 0.6 1.866667
27 0.64 1.866667
26 0.72 1.866667
25 0.76 1.866667
24 0.88 1.866667
23 0.88 1.733333
22 0.88 1.733333
21 0.96 1.733333
20 0.96 1.733333
19 1 1.733333
18 1 1.733333
17 1.12 1.733333
16 1.12 1.533333
15 1.12 1.533333
14 1.16 1.533333
13 1.16 1.6
12 1.2 1.6
11 1.2 1.466667 Std. Err. 0.003103929
10 1.6 1.666667
9 2.2 1.666667
8 2.2 1.866667
7 2.24 1.866667
6 2.24 1.6
5 4.44 1.8
4 5.08 2.333333
3 5.24 3.466667
2 9.4 9.533333
1 9.4 9.533333
0 9.4 9.533333
<-- Min. Err. Tree
<-- Best Pruned Tree
Test Data scoring - Summary Report (Using Best Pruned Tree)
Actual Class 1 0
1 88 14
0 3 895
Class # Cases # Errors % Error
1 102 14 13.73
0 898 3 0.33
Overall 1000 17 1.70
Classification Confusion Matrix
Predicted Class
Error Report
Validation Data scoring
Actual Class 1 0
1 127 16
0 8 1349
Class # Cases # Errors % Error
1 143 16 11.19
0 1357 8 0.59
Overall 1500 24 1.60
Classification Confusion Matrix
Predicted Class
Error Report