Upload
blaise-higgins
View
218
Download
1
Tags:
Embed Size (px)
Citation preview
2
Chapter 7. Classification and Prediction
What is classification? What is prediction? Issues regarding classification and prediction Classification by decision tree induction Bayesian Classification Classification by Neural Networks Classification by Support Vector Machines (SVM) Classification based on concepts from association
rule mining Other Classification Methods Prediction Classification accuracy Summary
3
Supervised vs. Unsupervised Learning
Supervised learning (classification) Supervision: The training data (observations,
measurements, etc.) are accompanied by labels indicating the class of the observations
New data is classified based on the training set Unsupervised learning (clustering)
The class labels of training data is unknown Given a set of measurements, observations,
etc. with the aim of establishing the existence of classes or clusters in the data
44
Classification predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the
training set and the values (class labels) in a classifying attribute and uses it in classifying new data
Numeric Prediction models continuous-valued functions, i.e., predicts
unknown or missing values Typical applications
Credit/loan approval: Medical diagnosis: if a tumor is cancerous or benign Fraud detection: if a transaction is fraudulent Web page categorization: which category it is
Prediction Problems: Classification vs. Numeric Prediction
5
Classification—A Two-Step Process
Model construction: describing a set of predetermined classes Each tuple/sample is assumed to belong to a predefined
class, as determined by the class label attribute The set of tuples used for model construction is training set The model is represented as classification rules, decision
trees, or mathematical formulae Model usage: for classifying future or unknown objects
Estimate accuracy of the model The known label of test sample is compared with the
classified result from the model Accuracy rate is the percentage of test set samples that
are correctly classified by the model Test set is independent of training set, otherwise over-
fitting will occur If the accuracy is acceptable, use the model to classify
data tuples whose class labels are not known
6
Classification Process (1): Model Construction
TrainingData
NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no
ClassificationAlgorithms
IF rank = ‘professor’OR years > 6THEN tenured = ‘yes’
Classifier(Model)
7
Classification Process (2): Use the Model in Prediction
Classifier
TestingData
NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes
Unseen Data
(Jeff, Professor, 4)
Tenured?
8
Chapter 7. Classification and Prediction
What is classification? What is prediction? Issues regarding classification and prediction Classification by decision tree induction Bayesian Classification Classification by Neural Networks Classification by Support Vector Machines (SVM) Classification based on concepts from association
rule mining Other Classification Methods Prediction Classification accuracy Summary
9
Issues Regarding Classification and Prediction (1): Data Preparation
Data cleaning Preprocess data in order to reduce noise and
handle missing values Relevance analysis (feature selection)
Remove the irrelevant or redundant attributes Data transformation
Generalize and/or normalize data
10
Issues regarding classification and prediction (2): Evaluating Classification Methods
Predictive accuracy Speed and scalability
time to construct the model time to use the model
Robustness handling noise and missing values
Scalability efficiency in disk-resident databases
Interpretability: understanding and insight provided by the model
Goodness of rules decision tree size compactness of classification rules
11
Chapter 7. Classification and Prediction
What is classification? What is prediction? Issues regarding classification and prediction Classification by decision tree induction Bayesian Classification Classification by Neural Networks Classification by Support Vector Machines (SVM) Classification based on concepts from association
rule mining Other Classification Methods Prediction Classification accuracy Summary
12
Classification by Decision Tree Induction
Decision tree A flow-chart-like tree structure Internal node denotes a test on an attribute Branch represents an outcome of the test Leaf nodes represent class labels or class distribution
Decision tree generation consists of two phases Tree construction
At start, all the training examples are at the root Partition examples recursively based on selected
attributes Tree pruning
Identify and remove branches that reflect noise or outliers Once the tree is build
Use of decision tree: Classifying an unknown sample
13
Training Dataset
age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no
This follows an example from Quinlan’s ID3Han Table 7.1
Original data from DMPML Ch 4 Sec 4.3 pp 8-9,89-94
14
Output: A Decision Tree for “buys_computer”
age?
overcast
student? credit rating?
no yes fairexcellent
<=30 >40
no noyes yes
yes
31..40
14 cases
5 cases 5 cases
4 cases
15
In Practice
For the i th data, at time I, input information is known At time O, output is asigned (yes/no)
For all data object in the training data set (i=1..14) both input and output are known
Ii Oi
passed Current
time
16
In Practice
For a new customers n, at the current time I, input information is known But O, output is not known
Yet to be classified as (yes or no) before its actual buying behavioris realized
Value of a data mining study to predict buying behavior beforehand
In On
passed Current
time
future
17
ID3 Algorithm for Decision Tree Induction
ID3 algorithm Quinlan (1986) Tree is constructed in a top-down recursive divide-and-
conquer manner At start, all the training examples are at the root Attributes are categorical (if continuous-valued, they are
discretized in advance) Examples are partitioned recursively based on selected
attributes Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain) Conditions for stopping partitioning
All samples for a given node belong to the same class There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf There are no samples left
18
Entropy of a Simple Event
average amount of information to predict event s Expected information of an event s is
I(s)= -log2ps=log2(1/p) Ps is probability of event s entropy is high for lower probable events and
low otherwise as ps goes to zero
Event s becomes rare and rare and hence İts entropy approachs to infinity
as ps goes to one Event s becomes more and more certain and hence İts entropy approachs to zero
Hence entropy is a measure of randomness. disorder rareness of an event
19
10 0.50.25
1
2
entro
py
probability
Log21=0. log22=1Log20.5=log21/2=log22-1=-log22=-1Log20.25=-2Computational formulaLogax=logbx/logbaa and b any basisChoosing b as e It is more convenient tocompute logarithms of any basisLogax=logex/logeaor in particularLog2x=logex/loge2 orDropping e Log2x= lnx/ln2
Entropy=-log2p
Computational Formulas
20
Entropy of an simple event
Entropy can be interpreted as the number of bits to encode that event
if p=1/2, -log21/2= 1 bit is required to encode that event
0 and 1 with equal probability if p=1/4 -log21/4= 2 bits are required to encode that event
00 01 10 11 each are equally likely only one represent the specific event
if p=1/8 -log21/8= 3 bits are required to encode that event
000 001 010 011 100 101 110 111 each are equally likely only one represent the specific event
if p=1/3 -log21/3=-(loge1/3)/loge2=1.58 bits are required to encode that
event
21
Composite events
Consider the four events A, B, C, D A:tossing a fair coin where
PH = PT = ½ B:tossing an unfair coin where
PH = ¼ PT = 3/4 C:tossing an unfair coin where
PH = 0.001 PT = 0.999 D:tossing an unfair coin where
PH = 0.0 PT = 1.0 Which of these events is more certain How much information is needed to guess the
result of each toes in A to D What is the expected information to predict the
outcome of events A B C D respectively?
22
Entropy of composite events
Average or expected information is highest when each event is equally likely
Event A Expected information required to guess falls as the probability
of head becomes either 0 or 1 as PH goes to 0 or PT goes to 1: moving to case D The composite event of toesing a coin is more and more certain So for case D no information is needed as the answer is
already known as tail What is the expected information to predict the outcome of that
event when probability of head is p in general ent(S)= p[-log2p]+(1-p)[-log2(1-p)] ent(S)= -plog2p-(1-p)log2(1-p)
The lower the entropy the higher the information content of that event
Weighted average of simple events weighted by their probailities
23
Examples
When the event is certain: pH = 1,pT= 0 or pH = 0, pT = 1 ent(S)= -1log2(1)-0log20= -1*0-0*log20=0
Note that: limx0+ xlog2x=0 For a fair coin pH = 0.5 ,pT= 0.5
ent(S)= -(1/2)log2(1/2)-(1/2)log21/2 = -1/2(-1) -1/2(-1)=1 Ent(S) is 1: p = 0.5 1-p=0.5
if head or tail probabilities are unequal entropy is between 0 and 1
24
0 1
Probability of head
Entropy= -plog2p-(1-p)log2(1-p)
1
0.5
P head versus entropy for the event of toesing a coin
25
In general
Entropy is a measure of (im)purity of an sample variable S is defined as
Ent(S) = sSps(-log2ps) = -sS pslog2ps
s is a value of S an element of sample space
ps is its estimated or subjective probability of any sS
Note that sps = 1
26
Information needed to classify an object
class entropy is computed in a similar manner Entropy is 0 if all members of S belong to the same
class no information to classify an object
entropy of a partition is the weighted average of all entropies of all classes
Total number of objects: S There are two classes C1 and C2
with cardinalities S1 and S2 I(S1,S2)=-(S1/S)*log2(S1/S)-(S2/S)*log2(S1/S2) Or in general with m classes C1,C2,…,Cm I(S1,S2…Sm)=-(S1/S)*log2(S1/S)-(S2/S)*log2(S1/S2)-.. -(Sm/S)*log2(Sm/S)
Probability of an objects belonging to class i:Pi=Si/S
27
Example cont.: At the root of the tree
There are 14 cases S = 14 9 yes denoted by Y, S1 = 9, py = 9/14 5 no denoted by N, S2 = 5, pn = 5/14
Y s and N s are almost equally likely How much information on the average is needed
to classify a person as buyer or not Without knowing any characteristics such as
age income … I(S1,S2)=(S1/S)(-log2S1/S) +(S2/S)(-log2S2/S) I(9,5)=(9/14)(-log29/14) + (5/14)(-log25/14) =(9/14)(0.637) + (5/14)(1.485) = 0.940 bits
of information close to 1
28
Expected information to classify with an attribute (1)
An attribute A with n distinct values as a1,a2,..,an partition the dataset into n distinct parts Si,j is number of objects in partition i (i=1..n) with a class Cj (j=1..m)
Expected information to classify an object knowing the value of attribute A is the weighted average of entropies of all
partitions Weighted by the frequency of that partition i
29
Expected information to classify with an attribute (2)
Ent(A) =I(A) = mj=1(ai/S) *I(Si,1..Si,m)
=(a1/S)*I(S1,1..S1,m)+(a2/S)*I(S2,1..S2,m)+… +(an/S)*I(Sn,1..Sn,m) I(Si,1..Si,m) entropy of any partition i I(Si,1..Si,m)=m
j=1(Sij/ai)(-log2Sij/ai) =-(Si1/ai)*log2Si1/ai-(Si2/ai)*log2Si2/ai-
...-(Sim/ai)*log2Sim/ai ai = m
j=1Sij = nuber of objects in each partition Here sij/ai is the probability of class j in partition i
30
Information Gain
Gain in information using distinct values of attribute A is the reduction in entropy or information need to classify an object
Gain(A) = I(S1,..,Sm) – I(A) average information without knowing A – Average information with knowing A
Eg. Knowing such characteristics as: Age interval, income interval How much help to classify a new object?
Can information gain be negative? Is it always greater then or equal to zero?
31
Another Example
Pick up a student at random in BU What is the chance that she is staying in dorm?
Initially we have no specific information about her
If I ask initials Does it help us in predicting the probability of
her staying in dorm. No
If I ask her adress and record the city Does it help us in predcting the chance of her
staying in dorm Yes
32
Attribute selection at the root
There are four attributes age, income, student, credit rating
Which of these provıdes the highest informatıon in classıfying a new customer or equivalently
Which of these results in hıghest information gaın
33
14 cases9 Y5 N
5 cases2 Y3 NI(2,3)=0.971Dec N
4 cases4 Y0 NI(4,0) =0Dec:Y
5 cases3 Y2 NI(3,2)=0.971 Dec Y
Accuricy: 10/14, Entropy(age)=5/14*I(2,3)+4/14*I(4,0)+5/14*I(3,2)Entropy(age) = 5/14(-3/5log23/5-2/5log22/5) + 4/14((-4/4log24/4-0/4log20/4) + 5/14(-3/5log23/5-2/5log22/5) = 0.694Gain(age) = 0.940 – 0.694 = 0.246
<=30
31..40>40
Testing age at the root node
34
Expected information for age <=30
If age is <=30 Information need to classify a new customer : I(2,3)=0.971 bits as the training data tells that Knowing that age <=30
with 0.4 probability a customer buys but with 0.6 probability she dose not
I(2,3)=-(3/5)log23/5-(2/5)log22/5= 0.971 = 0.6*0.734 + 0.4*1.322 = 0.971 But what is the weight of age range <=30 5 out of 14 samples are in that range (5/14)*I(2,3) is the weighted information need to
classify a customer as buyer or not
35
Information gain by age
gain(age)= I(Y,N)-Entropy(age) = 0.940 – 0.694 =0.246 is the information gain Or reduction of entropy to classify a new
object Knowing the age of the customer increases
our ability to classify her as buyer or not or Help us to predict her buying behavior
36
Attribute Selection by Information Gain Computation
Class Y: buys_computer = “yes”
Class N: buys_computer = “no” I(Y, N) = I(9, 5) =0.940 Compute the entropy for age:
means “age <=30” has
5 out of 14 samples, with 2
yes’es and 3 no’s. Hence
Similarly,
age Yi Ni I(pi, ni)<=30 2 3 0.97130…40 4 0 0>40 3 2 0.971
694.0)2,3(14
5
)0,4(14
4)3,2(
14
5)(
I
IIageE
048.0)_(
151.0)(
029.0)(
ratingcreditGain
studentGain
incomeGain
246.0)(),()( ageENYIageGainage income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no
)3,2(14
5I
37
14 cases9 Y5 N
4 cases3 Y1 NI(3,1)=? Dec Y
6 cases4 Y 2 NI(4,2) =?Dec: Y
4 cases2 Y2 NI(2,2)=1Dec: ?
Accuricy:9/14Entropy(income) = 4/14(-3/4log23/4-1/4log21/4) + 6/14((-2/6log23/4-4/6log22/6) + 4/14(-2/4log22/4-2/4log22/4) = 0.914Gain(income) = 0.940 – 0914.=0.026
lowmedium high
Testing income at the root node
38
14 cases9 Y5 N
7 cases3 Y4 NI(3,4)=? Dec N
7 cases6 Y1 NI(6,1)=?Dec: Y
Accuricy:10/14Entropy(student) = 7/14(-3/7log23/7-4/7log24/7) + 7/14((-6/7log26/7-1/7log21/7) = 0.789Gain(student) = 0.940 –0.789 0.151
noyes
Testing student at the root node
39
14 cases9 Y5 N
8 cases6 Y2 NI(6,2)=? Dec Y
6 cases3 Y3 NI(3,3)=1Dec: ?
Accuricy:9/14Entropy(student) = 8/14(-2/6log22/6-4/6log24/6)+ + 6/14((-3/6log23/6-3/6log23/6) = 0.892Gain(credit rating) = 0.940 – 0892=0.048
faırexcellent
Testing credit rating at the root node
40
Comparıng gaıns
İnformation gains for attrıbutes at the root node:
Gain(age) = 0.246 Gain(age) = 0.026 Gain(age) = 0.151 Gain(age) = 0.048 Age provıdes the highest gain in information Age ıs choosen as the attribute at the root
node Branch acording to the distinct values of age
41
After Selecting age first level of the tree
age?
overcast
Continue continue
<=30 >40
yes
31..40
14 cases
5 cases 5 cases
4 cases
42
5 cases2 Y3 N
1 Y0 N 1Y
1 N
low hıghincome
5 cases2 Y3 N
0 Y3 N
2 Y 0 N
no yesstudent
5 cases2 Y3 N
1 Y2 N
1 Y1 N
faır excellentcredit
Attrıbute selectıon at age <=30 node
0 Y2 Nmedıum
Informatıon gaın for student ıs the hıghest as knowıng the Customers beıng a student or not provıdes perfect ınforma-tıon to classıfy her buyıng behavıor
43
5 cases3 Y2 N
1 Y1 N 2Y
1 N
low hıghincome
5 cases3 Y2 N
1 Y1 N
2 Y1 N
no yesstudent
5 cases3 Y2 N
3 Y0 N
0 Y2 N
faır excellentcredit
Attrıbute selectıon at age >40 node
0 Y0 Nmedıum
Informatıon gaın for credıt ratıng ıs the hıghest as knowıng the Customers beıng a student or not provıdes perfect ınforma-tıon to classıfy her buyıng behavıor
44
Exercise
Calculate all information gains in the second level of the tree that is after branching by the distinct values of age
45
Output: A Decision Tree for “buys_computer”
age?
overcast
student? credit rating?
no yes fairexcellent
<=30 >40
no noyes yes
yes
31..40
14 cases
5 cases 5 cases
4 cases
3 N 3 N2 N 2 Y
Accuricy 14/14 on training set
46
Advantage and Disadvantages of Decision Trees
Advantages: Easy to understand and map nicely to a production rules Suitable for categorical as well as numerical inputs No statistical assumptions about distribution of attributes Generation and application to classify unknown outputs is
very fast Disadvantages:
Output attributes must be categorical Unstable: slight variations in the training data may result
in different attribute selections and hence different trees Numerical input attributes leads to complex threes as
attribute splits are usually binary Not suitable for non rectangler regions such as regions
separated by linear or nonlnear combination of attributes By lines ( in 2 dimensions) planes( in 3 dimensions) or in
general by hyperplanes (n dimensions)
47
age
income
x
xx
x
x
xxx
x
o
oo
oo o o
The two classes X and Oare separated by line AA
Decision trees are not suitabe For this problem
A
A
A classification problem in that decision trees are not suitable to classify
48
Other Attribute Selection Measures
Gain Ratio Gini index (CART, IBM IntelligentMiner)
All attributes are assumed continuous-valued Assume there exist several possible split
values for each attribute May need other tools, such as clustering, to
get the possible split values Can be modified for categorical attributes
49
Gain Ratio
Add another attribute transaction TID for each observation TID is different E(TID)= (1/14)*I(1,0)+(1/14)*I(1,0)+ (1/14)*I(1,0)...
+ (1/14)*I(1,0)=0 gain(TID)= 0.940-0=0.940 the highest gain so TID is the test attribute
which makes no sense use gain ratio rather then gain Split information: measure of the information value
of split: without considering class information only number and size of child nodes A kind of normalization for information gain
50
Split information = (-Si/S)log2(Si/S) information needed to assign an instance to one
of these branches Gain ratio = gain(S)/split information(S) in the previous example Split info and gain ratio for TID:
split info(TID) = [(1/14)log2(1/14)]*14=3.807 gain ratio(TID) = (0.940-0.0)/3.807=0.246
Split info for age:I(5,4,5)= (5/14)log25/14+ (4/14)log24/14
+(5/14)log25/14=1.577 gain ratio(age) = gain(age)/split info(age)
= 0.247/1.577=0.156
51
Gini Index (IBM IntelligentMiner)
If a data set T contains examples from n classes, gini index, gini(T) is defined as
where pj is the relative frequency of class j in T. If a data set T is split into two subsets T1 and T2 with
sizes N1 and N2 respectively, the gini index of the split data contains examples from n classes, the gini index gini(T) is defined as
The attribute provides the smallest ginisplit(T) is chosen to split the node (need to enumerate all possible splitting points for each attribute).
n
jp jTgini
1
21)(
)()()( 22
11 Tgini
NN
TginiNNTginisplit
52
Gini index (CART) Example
Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”
Suppose the attribute income partitions D into 10 in D1: {low,
medium} and 4 in D2
but gini{medium,high} is 0.30 and thus the best since it is the
lowest D1:{medium,high}, D2:{low} gini index: 0.300 D1:{low,high}, D2:{medium} gini index: 0.315 Highest gini is for D1:{low,medium}, D2:{high}
459.014
5
14
91)(
22
Dgini
)(14
4)(
14
10)( 11},{ DGiniDGiniDgini mediumlowincome
53
Extracting Classification Rules from Trees
Represent the knowledge in the form of IF-THEN rules One rule is created for each path from the root to a leaf Each attribute-value pair along a path forms a conjunction The leaf node holds the class prediction Rules are easier for humans to understand Example
IF age = “<=30” AND student = “no” THEN buys_computer = “no”
IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”
IF age = “31…40” THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes”
IF age = “<=30” AND credit_rating = “fair” THEN buys_computer = “no”
54
Approaches to Determine the Final Tree Size
Separate training (2/3) and testing (1/3) sets Use cross validation, e.g., 10-fold cross
validation Use all the data for training
but apply a statistical test (e.g., chi-square) to estimate whether expanding or pruning a node may improve the entire distribution
Use minimum description length (MDL) principle halting growth of the tree when the encoding
is minimized
55
Enhancements to basic decision tree induction
Allow for continuous-valued attributes Dynamically define new discrete-valued attributes
that partition the continuous attribute value into a discrete set of intervals
Handle missing attribute values Assign the most common value of the attribute Assign probability to each of the possible values
Attribute construction Create new attributes based on existing ones that
are sparsely represented This reduces fragmentation, repetition, and
replication
56
Avoid Overfitting in Classification
Overfitting: An induced tree may overfit the training data Too many branches, some may reflect anomalies
due to noise or outliers Poor accuracy for unseen samples
Two approaches to avoid overfitting Prepruning: Halt tree construction early—do not
split a node if this would result in the goodness measure falling below a threshold
Difficult to choose an appropriate threshold Postpruning: Remove branches from a “fully grown”
tree—get a sequence of progressively pruned trees Use a set of data different from the training data
to decide which is the “best pruned tree”
57
If the error rate of a subtree is higher then the error obtained by replacing the tree with its most frequent leaf or branch prune the subtree
How to estimate the prediction error do not use training samples
pruning always increase error of the training sample
estimate error based on test set cost complexity or reduced error pruning
58
Example
Labour negotiation data dependent variable or class to be
predicted: acceptability of contract: good or bad
independent variables: duration, wage increase 1th year: <%2.5, >%2.5 working hours per week: <36, >36 health plan contribution: none, half, full
Figure 6.2 WF shows a branch of a d.tree
59
wage increase 1th year
<2.5 >2.5working hours per week
1 Bad1 good
health plan contribution
4 bad2 good
1 bad1good
4 bad2 good
<=36 >36
nonehalf
full
a b c
60
AnswerTree
Variables measurement levels case weights frequency variables
Growing methods Stopping rules Tree parameters
costs, prior probabilities, scores and profits Gain summary Accuracy of tree Cost-complexity pruning
61
Variables
Categorical Variables nominal or ordinal
Continuous Variables All grouping method accept all types of
variables QUEST requires that tatget variable be
nominal Target and predictor variables
target variable (dependent variable) predictor (independent variables)
Case weight and frequency variables
62
Case weight and frequency variables
CASE WEIGHT VARIABLES unequal treatment to the cases Ex: direct marketing
10,000 households respond and 1,000,000 do not respond
all responders but %1 nonresponders(10,000)
case weight 1 for responders and case weight 100 for nonresponders
FREQUENCY VARIABLES count of a record representing more than one
individual
63
Tree-Growing Methods
CHAID Chi-squared Automatic Interaction Detector Kass (1980)
Exhaustive CHAID Biggs, de Ville, Suen (1991)
CR&T Classification and Regression Trees Breiman, Friedman, Olshen, and Stone (1984)
QUEST Quick, Unbiased, Efficient Statistical Tree Loh, Shih (1997)
64
CHAID
evaluate all values of a potential predictor merges values that are judged to be
statistically homogenous target variable
continuous F test categorical Chi-square test
not binary: produce more than two categories at any particular level wider tree than do the binary methods works for all types of variables case weights and frequency variables missing values as a single valid category
65
Exhaustive CHAID
CHAID may not find the optimal split for a variable as it stop merging categories all are statistically different
continue merging until two super categories are left
examine series of merges for the predictor set of categories giving strongest
association with the target computes and adjusted p values for that
association best split for each predictor
choose predictor based on p values otherwise identical to CHAID
longer to compute safer to use
66
CART
Binary tree growing algorithm may not present efficiently
partitions data into two subsets same predictor variable may be used
several times at different levels misclassification costs prior probability distribution Computation can take a long time with
large data sets Surrogate splitting for missing values
67
Steps in the CART analysis
at the root node t=1, search for a split s* giving the largest decrease in impurity (s*,1) = max sS(s,1)
split 1 as t=2 and t=3 using s* repeat the split searching process in t=2
and t=3 continue until one of the stopping rules is
met
68
Stoping rules
all cases have identical values for all predictors the node becomes pure; all cases have the
same value of the target variable the depth of the tree has reached its
prespecified maximum value the number of cases in a node less than a
prespecified minimum parent node size split at node results in producing a child node
with cases less than a prespecified min child node size
for CART only: max decrease in impurity is less than a prespecified value
69
QUEST
variable selection and split point selection separately
computationally efficient than CART
70
Classification in Large Databases
Classification—a classical problem extensively studied by statisticians and machine learning researchers
Scalability: Classifying data sets with millions of examples and hundreds of attributes with reasonable speed
Why decision tree induction in data mining? relatively faster learning speed (than other
classification methods) convertible to simple and easy to understand
classification rules can use SQL queries for accessing databases comparable classification accuracy with other methods
71
Scalable Decision Tree Induction Methods in Data Mining Studies
SLIQ (EDBT’96 — Mehta et al.) builds an index for each attribute and only class list
and the current attribute list reside in memory SPRINT (VLDB’96 — J. Shafer et al.)
constructs an attribute list data structure PUBLIC (VLDB’98 — Rastogi & Shim)
integrates tree splitting and tree pruning: stop growing the tree earlier
RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti) separates the scalability aspects from the criteria
that determine the quality of the tree builds an AVC-list (attribute, value, class label)