8.Classification Tree

Classification Trees

Trees and Rules

Goal: Classify or predict an outcome based on a set of predictors

The output is a set of rules Rules are represented by tree

diagrams Also called CART (Classification And

Regression Trees) , Decision Trees, or just Trees

Key Ideas

Recursive partitioning: Repeatedly split the records into two parts so as to achieve maximum homogeneity within the new parts

Pruning the tree: Simplify the tree by pruning peripheral branches to avoid overfitting

The Idea of Recursive Partitioning

Recursive partitioning of predictors to achieve homogenous groups

At each step a single predictor is used for splitting the data.

Use rectangles of different sizes to split the data

X2

X1

Recursive Partitioning Steps

Pick one of the predictor variables, xi

Pick a value of xi, say si, that divides the training data into two (not necessarily equal) portions

Measure how “pure” or homogeneous each of the resulting portions are “Pure” = containing records of mostly one

class Idea is to pick xi, and si to maximize purity Repeat the process

Example: Beer Preference Using the predictors Age, Gender,

Married, Income we want to be able to classify the preference of new beer drinkers (Regular/Light beer)

Two classes, 4 predictors 100 records, partitioned into

training/validation

Splitting the space in trees

0

10

20

30

40

50

60

70

80

90

100

$0 $20,000 $40,000 $60,000 $80,000

Income

Ag

e Regular beer

Light beer

A Classification Tree for the Beer Preference Example (training)

42.5

34375 41173

Regular39180

Regular51.5

Light Light Light Regular

Age

Income Income

Income Age

29 31

6 23 18 13

7 16 7 6

Leaf/ terminal node

Splitting value

# records

Method Settings

1. Determining splits/partitions

2. Terminating tree growth

3. Finding rule to predict the class for each record

4. Pruning of tree (cutting branches)

Three famous algorithms: CART (in XLMiner, SAS, CART), CHAID (in SAS) , C4.5

(in Clementine by SPSS)

Determining the best split Best split is the one that best discriminates

between records in the different classes Goal is to have a single class predominate in each

resulting node Split that maximizes the reduction in impurity of

node is chosen CART algorithm (Classification And Regression Trees)

evaluates all possible binary splits C4.5 algorithm splits categorical predictor with K categories

into K children (creates bushiness) Two impurity measures are entropy and Gini

index Impurity of split = weighted average of impurities of

children

Entropy Measure

Entropy of a node =

K = number of classespi = the percentage of records in

class i

K

iii pp

12log

Entropy

0 Entropy log2(K)

Bad; the K classes are equally likely.AS IMPURE AS POSSIBLE

Good; all recordsin same class. PURE

Entropy For 2 classes

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

p i

Ent

ropy

Entropy: Example

Assume we have K = 2 classes (buy, not buy) Then, maximum Entropy is log2(2) = 1

Ex.1—Impure Node: p1 = 0.5, p2 = 0.5

Entropy = – [0.5log2(0.5) + 0.5log2(0.5)] = 1

Ex.2—Pure Node: p1 = 1, p2 = 0

Entropy = – [1log2(1) + 0log2(0)] = 0

xx

x

Splitting the 100 beer drinkers (50 prefer light, 50 regular) by gender

gender43 females

25 Regular 18 Light

57 males

25 Regular 32 Light

FEMALES: -[ 18/43 log2 18/43 + 25/43 log2 25/43 ] = 0.98

MALES: -[ 32/57 log2 32/57 + 25/57 log2 25/57 ] = 0.989

Combined: (0.43)(0.98) + (0.57)(0.989) = 0.985 not too much improvement

Before split (p1=p2=0.5) entropy=1After split:

The Gini Impurity Index

The impurity of a node is

K = number of classespi = the percentage of records in class i

K

iipGI

1

21

The Gini Index

0 GI (K-1) / K

Bad; the K classes are equally likely.AS IMPURE AS POSSIBLE

Good; all recordsin same class. PURE

Splitting the 100 beer drinkers (50 prefer light, 50 regular) by gender

gender43 females

25 Regular 18 Light

57 males

25 Regular 32 Light

Before split: 1- [0.25 +0.25] = 0.5LEFT: 1- [(25/43)^2 + (18/43)^2 ] = 0.487RIGHT: 1- [(25/57)^2 + (32/57)^2 ] = 0.492Combined: 0.43 * 0.487 + 0.57 * 0.492 = 0.49

Recursive Partitioning

Obtain overall impurity measure (weighted avg. of individual rectangles)

At each successive stage, compare this measure across all possible splits in all variables

Choose the split that reduces impurity the most

Chosen split points become nodes on the tree

Learning about the “best predictors”

42.5

34375 41173

Regular39180

Regular51.5


Age

Income Income

Income Age

29 31

6 23 18 13

7 16 7 6

Predictors close to top are most informative

Trees sometimes used for variable selection

Age, Income

Tree Structure Split points become nodes on tree (circles

with split value in center) Rectangles represent “leaves” (terminal

points, no further splits, classification value noted)

Numbers on lines between nodes indicate # cases

Read down tree to derive rule, e.g. If lot size < 19, and if income > 84.75, then class

= “owner”

Determining Leaf Node Label

Each leaf node label is determined by “voting” of the records within it, and by the cutoff value

Records within each leaf node are from the training data

Default cutoff=0.5 means that the leaf node’s label is the majority class.

Cutoff = 0.75: requires majority of 75% or more “1” records in the leaf to label it a “1” node

Converting a Tree into Rules

Convert tree to IF-ELSE-AND rules

IF age>42.5 AND income< $41,173 THEN prefers regular beer

IF age<42.5 AND income < $34,375 then prefers regular beer

Can condense some rules

If income < $34,375 then regular beer

42.5

34375 41173

Regular39180

Regular51.5


Age

Income Income

Income Age

29 31

6 23 18 13

7 16 7 6

Labeling Leaf Nodes Default: majority vote In 2-class case, majority vote = cutoff of 0.5 Changing the cutoff will change the labeling Example:

success class=buyers 0.75 means that to label the node as “buyer”, at

least 75% of the training set observations with that predictor combination should be buyers.

Classifying existing and new records “Drop” the record into the tree

(=use the rules) Classify record as the majority

class in its node42.5

34375 41173

Regular39180

Regular51.5


Age

Income Income

Income Age

29 31

6 23 18 13

7 16 7 6

Predict the preference of a 50-year old female with 50K income

42.5

34375 41173

Regular39180

Regular51.5


Age

Income Income

Income Age

29 31

6 23 18 13

7 16 7 6

Stopping Tree Growth Natural end of process is 100% purity

in each leaf This overfits the data, which end up

fitting noise in the data Overfitting leads to low predictive

accuracy of new data Past a certain point, the error rate for

the validation data starts to increase

Avoiding Over-fitting Partitioning is done based on

training set As we go down the tree, splitting is

based on less data Larger trees lead to higher

prediction variance

Avoiding Over-fitting – cont. How will the tree perform on new

data? The error rate of a tree =

misclassifications

# splits

Training data

Unseen data

Err

or R

ate

Solution 1: Stopping Tree Growth Rules/Criteria used to stop tree

growth: Tree depth Minimum # of records (cases) in node Minimum reduction in impurity

measure Problem: hard to know when to

stop…

CHAID (Chi-square Automatic Interaction Detector): Stopping in time Popular in marketing. Instead of using an impurity measure, CHAID

uses the chi-square test of independence to determine splits

At each node, we split on the predictor that has the strongest association with the Y variable

Strength of association is measured by the p-value of chi-square test of independence (smaller p-value = stronger evidence of association)

Splitting terminates when no more association is found between the predictors and Y.

Requires categorical predictors (or bin interval predictors into categories)

Solution 2: Pruning (CART, C4.5) We would like to maximize

prediction accuracy for unseen data

Use validation set to prune the tree Prune here

# splits

Training data

validation data

Err

or R

ate

Pruning CART lets tree grow to full extent, then

prunes it back Idea is to find that point at which the

validation error begins to rise Generate successively smaller trees by

pruning leaves At each pruning stage, multiple trees are

possible Use cost complexity to choose the best tree

at that stage

Cost Complexity

CC(T) = cost complexity of a treeErr(T) = proportion of misclassified records = penalty factor attached to tree size (set

by user)

Among trees of given size, choose the one with lowest CC

Do this for each size of tree

CC(T) = Err(T) + L(T)

Pruning Results This process yields a set of trees of

different sizes and associated error rates

Two trees of interest: Minimum error tree

Has lowest error rate on validation data Best pruned tree

Smallest tree within one std. error of min. error This adds a bonus for simplicity/parsimony

Beer Preference Example

42.5

34375 41173

Regular39180

Regular51.5


Age

Income Income

Income Age

29 31

6 23 18 13

7 16 7 6

42.5

34375 41173

Regular Light Regular51.5

Light Regular

Age

Income Income

Age

18 22

7 11 12 10

4 6

Full tree (training) Pruned tree (validation)

Extension to numerical Y: Regression Trees All is similar to classification trees,

except labels of leaf nodes are averages of the observations in the node

Non-parametric, no assumptions, can find global relationships between Y and predictors

Nice alternative to linear reg, if large dataset

XLMiner: Prediction> Regression Trees

Differences from CT Prediction is computed as the

average of numerical target variable in the rectangle (in CT it is majority vote)

Impurity measured by sum of squared deviations from leaf mean

Performance measured by RMSE (root mean squared error)

Advantages of trees Easy to use, understand Produce rules that are easy to interpret

& implement Variable selection & reduction is

automatic Do not require the assumptions of

statistical models Can work without extensive handling of

missing data

Disadvantages May not perform well where there is

structure in the data that is not well captured by horizontal or vertical splits

Since the process deals with one variable at a time, no way to capture interactions between variables

Big Example: Personal Loan Offer As part of customer acquisition efforts,

Universal bank wants to run a campaign for current customers to purchase a loan.

In order to improve target marketing, they want to find customers that are most likely to accept the personal loan offer.

They use data from a previous campaign on 5000 customers, 480 of them accepted.

Personal Loan Data Description

ID Customer IDAge Customer's age in completed yearsExperience #years of professional experienceIncome Annual income of the customer ($000)ZIPCode Home Address ZIP code.Family Family size of the customerCCAvg Avg. spending on credit cards per month ($000)Education Education Level. 1: Undergrad; 2: Graduate; 3: Advanced/ProfessionalMortgage Value of house mortgage if any. ($000)Personal Loan Did this customer accept the personal loan offered in the last campaign?Securities Account Does the customer have a securities account with the bank?CD Account Does the customer have a certificate of deposit (CD) account with the bank?Online Does the customer use internet banking facilities?CreditCard Does the customer use a credit card issued by UniversalBank?

File: “UniversalBankTrees.xls”

Full tree (CT_output4)92.5

2.95 1.5

00.5 2.5 114.5

81.5 168 0.5 116 2.95 116.5

Sub Tree beneath

Sub Tree beneath

1 0 0Sub Tree beneath

Sub Tree beneath

1Sub Tree beneath

Sub Tree beneath

Sub Tree beneath

1

Income

CCAvg Education

CD Account Family Income

Income Mortgage CD Account Income CCAvg Income

1830 670

1737 93 398 272

87 6 346 52 111 161

51 36 5 1 330 16 21 31 79 32 6 155

Test Data (Using Full Tree)

Actual Class 1 0

1 88 14

0 8 890

Class # Cases # Errors % Error

1 102 14 13.73

0 898 8 0.89

Overall 1000 22 2.20

Predicted Class

Error Report

Classification Confusion MatrixValidation Data (Using Full Tree)

Actual Class 1 0

1 128 15

0 17 1340


1 143 15 10.49

0 1357 17 1.25

Overall 1500 32 2.13

Predicted Class

Error Report

Classification Confusion MatrixTraining Data (Using Full Tree)

Actual Class 1 0

1 235 0

0 0 2265


1 235 0 0.00

0 2265 0 0.00

Overall 2500 0 0.00

Predicted Class

Error Report

Classification Confusion Matrix

Pruned tree (CT_Output3)

92.5

01.5

2.5 114.5

0116 2.95

1

0 1 0 1

Income

Education

Family Income

Income CCAvg

1083 417

260 157

0 260 130 27

60 200 0 130

# Decision Nodes

% Error Training

% Error Validation

41 0 2.133333

40 0.04 2.2

39 0.08 2.2

38 0.12 2.2

37 0.16 2.066667

36 0.2 2.066667

35 0.2 2.066667

34 0.24 2.066667

33 0.28 2.066667

32 0.4 2.066667

31 0.48 2.133333

30 0.48 2.133333

29 0.56 2.133333

28 0.6 1.866667

27 0.64 1.866667

26 0.72 1.866667

25 0.76 1.866667

24 0.88 1.866667

23 0.88 1.733333

22 0.88 1.733333

21 0.96 1.733333

20 0.96 1.733333

19 1 1.733333

18 1 1.733333

17 1.12 1.733333

16 1.12 1.533333

15 1.12 1.533333

14 1.16 1.533333

13 1.16 1.6

12 1.2 1.6

11 1.2 1.466667 Std. Err. 0.003103929

10 1.6 1.666667

9 2.2 1.666667

8 2.2 1.866667

7 2.24 1.866667

6 2.24 1.6

5 4.44 1.8

4 5.08 2.333333

3 5.24 3.466667

2 9.4 9.533333

1 9.4 9.533333

0 9.4 9.533333

<-- Min. Err. Tree

<-- Best Pruned Tree

Test Data scoring - Summary Report (Using Best Pruned Tree)

Actual Class 1 0

1 88 14

0 3 895


1 102 14 13.73

0 898 3 0.33

Overall 1000 17 1.70


Predicted Class

Error Report

Validation Data scoring

Actual Class 1 0

1 127 16

0 8 1349


1 143 16 11.19

0 1357 8 0.59

Overall 1500 24 1.60


Predicted Class

Error Report

Documents

8.Classification Tree