Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
SCALABLE CLASSIFICATION AND REGRESSION
TREE CONSTRUCTION
A Dissertation
Presented to the Faculty of the Graduate School
of Cornell University
in Partial Fulfillment of the Requirements for the Degree of
Doctor of Philosophy
by
Alin Viorel Dobra
August 2003
c© 2003 Alin Viorel Dobra
ALL RIGHTS RESERVED
SCALABLE CLASSIFICATION AND REGRESSION TREE CONSTRUCTION
Alin Viorel Dobra, Ph.D.
Cornell University 2003
Automating the learning process is one of the long standing goals of Artifi-
cial Intelligence and its more recent specialization, Machine Learning. Supervised
learning is a particular learning task in which the goal is to establish the con-
nection between some of the attributes of the data made available for learning,
called attribute variables, and the remaining attributes called predicted attributes.
This thesis is concerned exclusively with supervised learning using tree structured
models: classification trees for predicting discrete outputs and regression trees for
predicting continuous outputs.
In the case of classification and regression trees most methods for selecting
the split variable have a strong preference for variables with large domains. Our
first contribution is a theoretical characterization of this preference and a general
corrective method that can be applied to any split selection method. We further
show how the general corrective method can be applied to the Gini gain for discrete
variables when building k-ary splits.
In the presence of large amounts of data, efficiency of the learning algorithms
with respect to the computational effort and memory requirements becomes very
important. Our second contribution is a scalable construction algorithm for re-
gression trees with linear models in the leaves. The key to scalability is to use the
EM Algorithm for Gaussian Mixtures to locally reduce the regression problem to
a, much easier, classification problem.
The use of strict split predicates in classification and regression trees has unde-
sirable properties like data fragmentation and sharp decision boundaries, properties
that result in decreased accuracy. Our third contribution is the generalization of
the classic classification and regression trees by allowing probabilistic splits in a
manner that significantly improves the accuracy but, at the same time, does not
increase significantly the computational effort to build this types of models.
Biographical Sketch
Alin Dobra was born on September 20th, 1974 in Bistrita, Romania. He received
a B.S in Computer Science from Technical University of Cluj-Napoca, Romania
in June 1998. He expects to receive a Ph.D in Computer Science from Cornell
University in August 2003.
In the summers of 1991 and 1992, he interned at Bell-Laboratories in Murray
Hill, NJ and worked with Minos Garofalakis and Rajeev Rastogi.
He is joining, in the Fall 2003, the Department of Computer and Information
Science and Engineering Department at University of Florida, Gainesville as an
Assistant Professor.
iii
Parintilor mei
iv
Acknowledgements
First and foremost I would like to thank my thesis adviser, Professor Johannes
Gehrke. This thesis would have not be possible without his guidance and support
for the last three years.
Many thanks an my love go to my wife, Delia, that has been there for me all
these years. I do not even want to imagine how my life would have been without
her and her support.
My eternal gratitude goes to my parents, especially my father, that put my
education above their personal comfort for more than 20 years. They encouraged
and supported my scientific curiosity from an early age even though they never
got the chance to pursue their own scientific dreams. I hope this thesis will bring
them much personal satisfaction and pride.
I met many great people during my five year stay at Cornell University. I thank
them all.
v
Table of Contents
1 Introduction 11.1 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.1 Bias and bias correction in classification tree construction . . 51.1.2 Scalable linear regression tree construction . . . . . . . . . . 61.1.3 Probabilistic classification and regression trees . . . . . . . . 6
1.2 Thesis Overview and Prerequisites . . . . . . . . . . . . . . . . . . . 71.2.1 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.2.2 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Classification and Regression Trees 82.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Classification Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3 Building Classification Trees . . . . . . . . . . . . . . . . . . . . . . 11
2.3.1 Tree Growing Phase . . . . . . . . . . . . . . . . . . . . . . 122.3.2 Pruning Phase . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4 Regression Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3 Bias Correction in Classification Tree Construction 283.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.1 Split Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 313.3 Bias in Split Selection . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.1 A Definition of Bias . . . . . . . . . . . . . . . . . . . . . . . 343.3.2 Experimental Demonstration of the Bias . . . . . . . . . . . 36
3.4 Correction of the Bias . . . . . . . . . . . . . . . . . . . . . . . . . 433.5 A Tight Approximation of the distribution of Gini Gain . . . . . . 47
3.5.1 Computation of the Expected Value and Variance of Gini
Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.5.2 Approximating the Distribution of Gini Gain with Paramet-
ric Distributions . . . . . . . . . . . . . . . . . . . . . . . . . 503.6 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 643.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
vi
4 Scalable Linear Regression Trees 684.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 694.2 Preliminaries: EM Algorithm for Gaussian Mixtures . . . . . . . . . 724.3 Previous solutions to linear regression tree construction . . . . . . . 73
4.3.1 Quinlan’s construction algorithm . . . . . . . . . . . . . . . 734.3.2 Karalic’s construction algorithm . . . . . . . . . . . . . . . . 774.3.3 Chaudhuri’s et al. construction algorithm . . . . . . . . . . 77
4.4 Scalable Linear Regression Trees . . . . . . . . . . . . . . . . . . . . 784.4.1 Efficient Implementation of the EM Algorithm . . . . . . . . 824.4.2 Split Point and Attribute Selection . . . . . . . . . . . . . . 84
4.5 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 924.5.1 Experimental testbed and methodology . . . . . . . . . . . . 934.5.2 Experimental results: Accuracy . . . . . . . . . . . . . . . . 974.5.3 Experimental results: Scalability . . . . . . . . . . . . . . . 98
4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5 Probabilistic Decision Trees 1095.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1105.2 Probabilistic Decision Trees (PDTs) . . . . . . . . . . . . . . . . . . 113
5.2.1 Generalized Decision Trees(GDTs) . . . . . . . . . . . . . . 1155.2.2 From Generalized Decision Trees to Probabilistic Decision
Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1185.2.3 Speeding up Inference with PDTs . . . . . . . . . . . . . . . 121
5.3 Learning PDTs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1225.3.1 Computing sufficient statistics for PDTs . . . . . . . . . . . 1235.3.2 Adapting DT algorithms to PTDs . . . . . . . . . . . . . . . 1265.3.3 Split Point Fluctuations . . . . . . . . . . . . . . . . . . . . 129
5.4 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 1365.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . 1375.4.2 Experimental Results: Accuracy . . . . . . . . . . . . . . . . 1375.4.3 Experimental Results: Running Time . . . . . . . . . . . . . 141
5.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1445.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
6 Conclusions 149
A Probability and Statistics Notions 152A.1 Basic Probability Notions . . . . . . . . . . . . . . . . . . . . . . . 152
A.1.1 Probabilities and Random Variables . . . . . . . . . . . . . . 153A.2 Basic Statistical Notions . . . . . . . . . . . . . . . . . . . . . . . . 160
A.2.1 Discrete Distributions . . . . . . . . . . . . . . . . . . . . . 161A.2.2 Continuous Distributions . . . . . . . . . . . . . . . . . . . . 162
vii
B Proofs for Chapter 3 165B.0.3 Variance of the Gini gain random variable . . . . . . . . . . 165B.0.4 Mean and Variance of χ2-test for two class case . . . . . . . 174
viii
List of Tables
1.1 Example Training Database . . . . . . . . . . . . . . . . . . . . . . 2
3.1 P-values at point x for parametric distributions as a function ofexpected value, µ, and variance, σ2. . . . . . . . . . . . . . . . . . 51
3.2 Experimental moments and predictions of moments for N = 100, n =2, p1 = .5 obtained by Monte Carlo simulation with 1000000 rep-etitions. -T are theoretical approximations, -E are experimentalapproximations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.3 Experimental moments and predictions of moments for N = 100, n =10, p1 = .5 obtained by Monte Carlo simulation with 1000000 rep-etitions. -T are theoretical approximations, -E are experimentalapproximations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.4 Experimental moments and predictions of moments for N = 100, n =2, p1 = .01 obtained by Monte Carlo simulation with 1000000 rep-etitions. -T are theoretical approximations, -E are experimentalapproximations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.1 Accuracy on real (upper part) and synthetic (lower part) datasets ofGUIDE and SECRET. In parenthesis we indicate O for orthogonalsplits. The winner is in bold font if it is statistically significant andin italics otherwise. . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.1 Datasets used in experiments; top for classification and bottom forregression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
5.2 Classification tree experimental results. . . . . . . . . . . . . . . . 1395.3 Constant regression trees experimental results. . . . . . . . . . . . 1425.4 Linear regression trees experimental results. . . . . . . . . . . . . . 143
B.1 Formulae for expressions over random vector [X1 . . . Xk] distributedMultinomial(N, p1, . . . , pk) . . . . . . . . . . . . . . . . . . . . . . . 168
ix
List of Figures
1.1 Example of classification tree for training data in Table 1.1 . . . . 4
2.1 Classification Tree Induction Schema . . . . . . . . . . . . . . . . . 13
3.1 Summary of notation for Chapter 3. . . . . . . . . . . . . . . . . . 313.2 Contingency table for a generic dataset D and attribute variable X. 323.3 The bias of the Gini gain. . . . . . . . . . . . . . . . . . . . . . . . 383.4 The bias of the information gain. . . . . . . . . . . . . . . . . . . . 393.5 The bias of the gain ratio. . . . . . . . . . . . . . . . . . . . . . . . 403.6 The bias of the p-value of the χ2-test (using a χ2-distribution). . . 413.7 The bias of the p-value of the G2-test (using a χ2-distribution). . . 423.8 Experimental p-value of Gini gain with one standard deviation
error bars against p-value of theoretical gamma approximation forN = 100, n = 2 and p1 = .5. . . . . . . . . . . . . . . . . . . . . . . 59
3.9 Experimental p-value of Gini gain with one standard deviationerror bars against p-value of theoretical gamma approximation forN = 100, n = 10 and p1 = .5. . . . . . . . . . . . . . . . . . . . . . 60
3.10 Experimental p-value of Gini gain with one standard deviationerror bars against p-value of theoretical gamma approximation forN = 100, n = 2 and p1 = .01. . . . . . . . . . . . . . . . . . . . . . 61
3.11 Probability density function of Gini gain for attribute variables X1
and X2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633.12 Bias of the p-value of the Gini gain using the gamma correction . . 66
4.1 Example of situation where average based decision is different fromlinear regression based decision . . . . . . . . . . . . . . . . . . . . 75
4.2 Example where classification on sign of residuals is unintuitive. . . 754.3 SECRET algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 804.4 Projection on Xr, Y space of training data. . . . . . . . . . . . . . 814.5 Projection on Xd, Xr, Y space of same training data as in Figure 4.4 814.6 Separator hyperplane for two Gaussian distributions in two dimen-
sional space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
x
4.7 Tabular and graphical representation of running time (in seconds)of GUIDE, GUIDE with 0.01 of point as split points, SECRETand SECRET with oblique splits for synthetic dataset 3DSin (3continuous attributes). . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.8 Tabular and graphical representation of running time (in seconds)of GUIDE, GUIDE with 0.01 of point as split points, SECRETand SECRET with oblique splits for synthetic dataset Fried (11continuous attributes). . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.9 Running time of SECRET with linear regressors as a function ofthe number of attributes for dataset 3Dsin. . . . . . . . . . . . . . 101
4.10 Accuracy of the best quadratic approximation of the running timefor dataset 3Dsin. . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.11 Running time of SECRET with linear regressors as a function ofthe size of the 3Dsin dataset. . . . . . . . . . . . . . . . . . . . . . 103
4.12 Accuracy as a function of learning time for SECRET and GUIDEwith four sampling proportions. . . . . . . . . . . . . . . . . . . . . 104
5.1 Tabular and graphical representation of running time (in seconds)of vanilla SECRET and probabilistic SECRET(P), both with con-stant regressors, for synthetic dataset Fried (11 continuous attributes).145
5.2 Tabular and graphical representation of running time (in seconds)of vanilla SECRET and probabilistic SECRET(P), both with linearregressors, for synthetic dataset Fried (11 continuous attributes). . 146
B.1 Dependency of the function1−6p1+6p2
1
p1(1−p1)on p1. . . . . . . . . . . . . . 177
xi
Chapter 1
Introduction
Automating the learning process is one of the long standing goals of Artificial In-
telligence – and its more recent specialization, Machine Learning – but also the
core goal of newer research areas like Data-mining. The ability to learn from ex-
amples has found numerous applications in the scientific and business communities
– the applications include scientific experiments, medical diagnosis, fraud detec-
tion, credit approval, and target marketing (Brachman et al., 1996; Inman, 1996;
Fayyad et al., 1996) – since it allows the identification of interesting patterns or
connections either in the examples provided or, more importantly, in the natural or
artificial process that generated the data. In this thesis we are only concerned with
data presented in tabular format – we call each column an attribute and we asso-
ciate a name with it. Attributes whose domain is numerical are called numerical
attributes, whereas attributes whose domain is not numerical are called categorical
attributes. An example of a dataset about people leaving in a metropolitan area
is depicted in Table 1.1. Attribute “Car type” of this dataset is categorical and
attribute “Age” is numerical.
Two types of learning tasks have been identified in the literature: unsupervised
1
2
Table 1.1: Example Training DatabaseCar Type Driver Age Children Lives in Suburb?
sedan 23 0 yessports 31 1 nosedan 36 1 notruck 25 2 nosports 30 0 nosedan 36 0 nosedan 25 0 yestruck 36 1 nosedan 30 2 yessedan 31 1 yessports 25 0 nosedan 45 1 yessports 23 2 notruck 45 0 yes
and supervised learning. They differ in the semantics associated with the attributes
of the learning examples and their goals.
The general goal of unsupervised learning is to find interesting patterns in the
data, patterns that are useful for a higher level understanding of the structure of
the data. Types of interesting patterns that are useful are: groupings or clusters in
the data as found by various clustering algorithms (see for example the excellent
surveys (Berkhin, 2002; Jain et al., 1999)), and frequent item-sets, (Agrawal &
Srikant, 1994; Hipp et al., 2000). Unsupervised learning techniques usually assign
the same role to all the attributes.
Supervised learning tries to determine a connection between a subset of the
attributes, called the inputs or attribute variables, and the dependent attribute or
outputs.1 Two of the central problems in supervised learning – the only ones we
are concerned with in this thesis – are classification and regression. Both problems
1It is possible to have more dependent attributes, but for the purpose of thisthesis we consider only one.
3
have as goal the construction of a succinct model that can predict the value of the
dependent attribute from the attribute variables. The difference between the two
tasks is the fact that the dependent attribute is categorical for classification and
numerical for regression.
Many classification and regression models have been proposed in the litera-
ture: Neural networks (Sarle, 1994; Kohonen, 1995; Bishop, 1995; Ripley, 1996),
genetic algorithms (Goldberg, 1989), Bayesian methods (Cheeseman et al., 1988;
Cheeseman & Stutz, 1996), log-linear models and other statistical methods (James,
1985; Agresti, 1990; Chirstensen, 1997), decision tables (Kohavi, 1995), and tree-
structured models, so-called classification and regression trees (Sonquist et al.,
1971; Gillo, 1972; Morgan & Messenger, 1973; Breiman et al., 1984). Excellent
overviews of classification and regression methods were given by Weiss and Ku-
likowski (1991), Michie et al. (1994) and Hand (1997).
Classification and regression trees – we call them collectively decision trees –
are especially attractive in a data mining environment for several reasons. First,
due to their intuitive representation, the resulting model is easy to assimilate by
humans (Breiman et al., 1984; Mehta et al., 1996). Second, decision trees are
non-parametric and thus especially suited for exploratory knowledge discovery.
Third, decision trees can be constructed relatively fast compared to other meth-
ods (Mehta et al., 1996; Shafer et al., 1996; Lim et al., 1997). And last, the
accuracy of decision trees is comparable or superior to other classification and re-
gression models (Murthy, 1995; Lim et al., 1997; Hand, 1997). In this thesis, we
restrict our attention exclusively to classification and regression trees. Figure 1.1
depicts a classification tree, built based on data in Table 1.1, that predicts if a
person lives in a suburb based on other information about the person. The pred-
4
icates, that label the edges (e.g. Age ≤ 30), are called split predicates and the
attributes involved in such predicates, split attributes. In traditional classification
and regression trees only deterministic split predicates are used (i.e. given the split
predicate and the value of the the attributes, we can determine if the attribute is
true or false). Prediction with classification trees is done by navigating the tree on
true predicates until a leaf is reached, when the prediction in the leaf (YES or NO
in our example) is returned. The regions of the attribute variable space where the
decision is given by the same leaf will be called, throughout the thesis, decision
regions and the boundaries between such regions decision boundaries.
Age
Car Type
Car Type
<= 30
NO
YES NO
YES
>30
0
sedan
# Childr.
NO
sedan sports, trucksports, truck
sports, trucksedan >0
YES
Car Type
Figure 1.1: Example of classification tree for training data in Table 1.1
As it can be observed from the figure, the classification trees are easy to under-
stand – we immediately observe, for example, that people younger than 30 which
drive sports cars tend not to live in suburbs – and have a very compact repre-
sentation. For these reasons and others, detailed in Chapter 2, classification and
5
regression trees have been the subject of much research for the last two decades.
Nevertheless, at least in our opinion, more research is still necessary to fully un-
derstand and develop these types of learning models, especially from a statistical
perspective. The synergy of Statistics, Machine Learning and Data-mining meth-
ods, when applied to classification and regression tree construction, is the main
theme in this thesis. The overall goal of our work was to designed learning algo-
rithms that have good statistical properties, good accuracy and require reasonable
computational effort, even for large data-sets.
1.1 Our Contributions
Three problems in classification and regression tree construction received our at-
tention:
1.1.1 Bias and bias correction in classification tree con-
struction
Often, learning algorithms have undesirable preferences, especially in the pres-
ence of large amounts of noise. In the case of classification and regression trees
most methods for selecting the split variable have a strong preference for variables
with large domains. In this thesis we provide a theoretical characterization of this
preference and a general corrective method that can be applied to any split selec-
tion criteria to remove this undesirable bias. We show how the general corrective
method can be applied to the Gini gain for discrete variables when building k-ary
splits.
6
1.1.2 Scalable linear regression tree construction
In the presence of large amounts of data, efficiency of the learning algorithms with
respect to the computational effort and memory requirements becomes very impor-
tant. Part of this thesis is concerned with the scalable construction of regression
trees with linear models in the leaves. The key to scalability is to use the EM
Algorithm for Gaussian Mixtures to locally – at the level of each node being built
– reduce the regression problem to a classification problem. As a side benefit,
regression trees with oblique splits (involving a linear combination of predictor
attributes instead of a single attribute) can be easily built.
1.1.3 Probabilistic classification and regression trees
The use of strict split predicates in classification and regression trees has two
undesirable consequences. First, data is fragmented at an exponential rate and
therefore decisions in leaves are based on small number of samples. Second, deci-
sion boundaries are sharp because a single leaf is responsible for prediction. One
principled way to address both these problems is to generalize classification and
regression trees to make probabilistic decisions. More specifically, a probabilistic
model is assigned to each branch and it is used to determine the probability to
follow the branch. Instead of using a single leaf to predict the output for a given
input, all leaves are used, but their contributions are weighted by the probability
to reach them when starting from the root. In this thesis we show how to find well
motivated probabilistic models and to design scalable algorithms for building such
probabilistic classification and regression trees.
7
1.2 Thesis Overview and Prerequisites
1.2.1 Prerequisites
This thesis requires relatively few prerequisites. We assume the reader is famil-
iar with basic linear algebra and calculus, in particular the notions of equations,
vectors, matrices and Riemann integrals. Standard textbooks on Linear Algebra
(our favorite reference is (Hefferon, 2003)) and Calculus (for example (Swokowski,
1991)) suffice. The thesis relies heavily on notions of Probability Theory and
Statistics. In Appendix A we provide an overview of the necessary notions and
results for reading this thesis. Certainly, readers familiar with these topics will
find it easier to follow the presentation – especially the proofs – but the exposition
in Appendix A should suffice.
1.2.2 Thesis Overview
Chapter 2 provides a broad introduction to classification and regression tree con-
struction. In the rest of the thesis we assume that the reader is familiar with
these notions. In Chapter 3 we address the bias and bias correction problem for
classification tree construction. We provide proofs of results in this chapter in
Appendix B. Chapter 4 is dedicated to the linear regression tree construction
problem, and Chapter 5 to probabilistic decision trees. Concluding remarks and
directions of future research are given in Chapter 6.
Chapter 2
Classification and Regression
Trees
In this chapter we give an introduction to classification and regression trees. We
first start by formally introducing the classification trees and present some con-
struction algorithms for building such classifiers. Then, we explain how regression
trees differ. As mentioned in the introduction, we collectively refer to these types
of models as decision trees.
2.1 Classification
Let X1, . . . , Xm, C be random variables where Xi has domain Dom(Xi). The random
variable C has domain Dom(C) = 1, . . . , k. We call X1 . . . Xm attribute variables
– m is the number of such attribute variables – and C the class label or predicted
attribute.
A classifier C is a function C : Dom(X1) × · · · × Dom(Xm) 7→ Dom(C). Let
Ω = Dom(X1) × · · · × Dom(Xm) × Dom(C) be the set of events. The underlying
8
9
assumption in classification is the fact that the generative process for the data
is probabilistic; it generates the datasets according to an unknown probability
distribution P over the set of events Ω.
For a given classifier C and a given probability distribution P over Ω we can
introduce a functional RP (C) = P [C(X1, . . . , Xn) 6= C] called the generalization
error of the classifier C. Given some information about P in the form of a set of
samples, we would like to build a classifier that best approximates P . This leads
us to the following:
Classifier Construction Problem: Given a training dataset D of N inde-
pendent identically distributed samples from Ω, sampled according to probability
distribution P , find a function C that minimizes the functional RP (C), where P is
the probability distribution used to generate D.
In general, the classifier construction problem is very hard to solve if we allow
the classifier to be an arbitrary function. Arguments rooted in statistical learning
theory (Vapnik, 1998) suggest that we have to restrict the class of classifiers that
we allow in order to hope to solve this problem. For this reason we restrict our
attention to a special type of classifier – classification trees.
2.2 Classification Trees
A classification tree is a directed, acyclic graph T with tree shape. The root of
the tree – denoted by Root(T ) – does not have any incoming edges. Every other
node has exactly one incoming edge and may have 0, 2 or more outgoing edges.
We call a node T without outgoing edges a leaf node, otherwise T is called an
internal node. Each leaf node is labeled with one class label; each internal node T
10
is labeled with one attribute variable XT , called the split attribute. We denote the
class label associated with a leaf node T by Label(T ).
Each edge (T, T ′) from an internal node T to one of its children T ′ has a
predicate q(T,T ′) associated with it where q(T,T ′) involves only the splitting attribute
XT of node T . The set of predicates QT on the outgoing edges of an internal node
T must contain disjoint predicates involving the split attribute whose conjunction
is true – for any value of the split attribute exactly one of the predicates in QT is
true. We will refer to the set of predicates in QT as splitting predicates of T
Given a classification tree T , we can define the associated classifier
CT (x1, . . . , xm) in the following recursive manner:
C(x1, . . . , xm, T ) =
Label(T ) if T is a leaf node
C(x1, . . . , xm, Tj) if T is an internal node, Xi is label of T ,
and q(T,Tj)(xi) = true
(2.1)
CT (x1, . . . , xm) = C(x1, . . . , xm, Root(T )) (2.2)
thus, to make a prediction, we start at the root node and navigate the tree on
true predicates until a leaf is reached, when the class label associated with it is
returned as the result of the prediction.
If the tree T is a well-formed classification tree (as defined above), then the
function CT () is also well defined and, by our definition, a classifier which we call
a classification tree classifier, or in short a classification tree.
Two main variations have been proposed for classification trees – both are
in extensive use. If we allow at most two branches for any of the intermediate
nodes we get a binary classification tree; otherwise we get a k-ary classification
11
tree. Binary classification trees were introduced by Breiman et al. (1984); k-
ary classification trees were introduced by Quinlan (1986). The main difference
between these types of trees is in what predicates are allowed for discrete attribute
variables (for continuous attribute variables both allow only predicates of the form
X > c where c is a constant). For binary classification trees, predicates of the
form X ∈ S, with S a subset of the possible values of the attribute, are allowed.
This means that for each node we have to determine both a split attribute and
a split set. For discrete attributes in k-ary classification trees, there are as many
split predicates as there are values for the attribute variable and all are of the form
X = xi, with xi one of the possible value of X. In this situation, no split set has
to be determined but the fanout of the tree can be very large.
For continuous attribute variables, both types of classification trees split a node
into two parts on predicates of the form X ≤ s and its complement X > s, where
the real number s is called the split point.
Figure 1.1 shows an example of a binary classification tree that is build to
predict the data in the dataset in Table 1.1
2.3 Building Classification Trees
Now that we introduced the classification trees, we can formally state the classifi-
cation tree construction problem by instantiating the general classifier construction
problem:
Classification Tree Construction Problem: Given a training dataset D of
N independent identically distributed samples from Ω, sampled according to prob-
ability distribution P , find a classification tree T such that the misclassification
12
rate functional RP (CT ) of the corresponding classifier CT is minimized.
The main issue with solving the classification tree problem in particular and the
classifier problem in general, is the fact that the classifier has to be a good predictor
for the distribution not for the sample made available from the distribution. This
means that we cannot just simply build a classifier that is as good as possible
with respect to the available sample – it is easy to see that we can achieve zero
error with arbitrary classification trees if we do not have contradicting examples –
since the noise in the data will be learned as well. This noise learning phenomena,
called overfitting, is one of the main problems in classification. For this reason,
classification trees are build in two phases. In the first phase a tree as large as
possible is constructed in a manner that minimizes the error with respect to some
subset of the available data – subset that we call training data. In the second
phase the remaining samples – we call them the pruning data – are used to prune
the large tree by removing subtrees in a manner that reduces the estimate of the
generalization error computed using the pruning data. We discuss each of these
two phases individually in what follows.
2.3.1 Tree Growing Phase
Several aspects of decision tree construction have been shown to be NP-hard. Some
of these are: building optimal trees from decision tables (Hyafil & Rivest, 1976),
constructing minimum cost classification tree to represent a simple function (Cox
et al., 1989), and building optimal classification trees in terms of size to store
information in a dataset (Murphy & Mccraw, 1991).
In order to deal with the complexity of choosing the split attributes and split
sets and points , most of the classification tree construction algorithms use the
13
Input: node T , data-partition D, split selection method V
Output: classification tree T for D rooted at T
Top-Down Classification Tree Induction Schema:
BuildTree(Node T , data-partition D, split attribute selection method V)
(1) Apply V to D to find the split attribute X for node T .
(2) Let n be the number of children of T .
(2) if (T splits)
(3) Partition D into D1, . . . , Dn and label note T with split attribute X
(4) Create children nodes T1, . . . , Tn of T and label the edge (T, Ti)
with predicate q(T,Ti)
(5) foreach i∈1, .., n
(6) BuildTree(Ti, Di, V)
(7) endforeach
(8) else
(9) Label T with the majority class label of D
(10) endif
Figure 2.1: Classification Tree Induction Schema
greedy induction schema in Figure 2.1. It consists in deciding, at each step, upon
a split attribute and split set or point, if necessary, partitioning the data according
with the newly determined split predicates and recursively repeating the process
on these partitions, one for each child. The construction process at a node is
terminated when a termination condition is satisfied. The only difference between
the two types of classification trees is the fact that for k-ary trees no split set needs
to be determined for discrete attributes.
14
We now discuss how the split attribute and split set or point are picked at each
step in the recursive construction process, then show some common termination
conditions.
Split Attribute Selection
At each step in the recursive construction algorithm, we have to decide on what
attribute variable to split. The purpose of the split is to separate, as much as
possible, the class labels from each others. To make this intuition useful, we need
a metric that estimates how much the separation of the classes is improved when
a particular split is performed. We call such a metric a split criteria or a split
selection method.
There is extensive research in the machine learning and statistics literature on
devising split selection criteria that produce classification trees with high predictive
accuracy (Murthy, 1997). We briefly discuss here only the ones relevant for our
work.
A very popular class of split selection methods are impurity-based (Breiman
et al., 1984; Quinlan, 1986). The popularity is well deserved since studies have
shown that this class of split selection methods have high predictive accuracy (Lim
et al., 1997), and at the same time they are simple and intuitive. Each impurity-
based split selection criteria is based on an impurity function Φ(p1, . . . , pk), with pj
interpreted as the probability of seeing the class label cj. Intuitively, the impurity
function measures how impure the data is. It is required to have the following
properties (Breiman et al., 1984):
1. to be concave:
∂2Φ(p1, . . . , pk)
∂p2i
> 0
15
2. to be symmetric in all its arguments, i.e. for π a permutation,
Φ(p1, . . . , pk) = Φ(pπ1 , . . . , pπk)
3. to have unique maximum at (1/k, . . . , 1/k) when the mix of class labels is
most impure
4. to achieve the minimum for (1, 0, . . . , 0), (0, 1, 0, . . . , 0), . . . , (0, . . . , 0, 1), when
the mix of class labels is the most pure
With this, for a node T of the classification tree being built, the impurity at
node T is:
i(T ) = Φ(P [C = c1|T ], . . . , P [C = ck|T ]
where P [C = cj|T ] is the probability that the class label is cj given that the data
reaches node T . We defer the discussion on how these statistics are computed for
the end of this section.
Given a set Q of split predicates on attribute variable X that split a node T
into nodes T1, . . . , Tn, we can define the reduction in impurity as:
∆i(T, X, Q) = i(T )−n∑
i=1
P [Ti|T ] · i(Ti)
= i(T )−n∑
i=1
P [q(T,Ti)(X)|T ] · i(Ti)
(2.3)
Intuitively, the reduction in impurity is the amount of purity gained by splitting,
where the impurity after split is the weighted sum of impurities of each child node.
By instantiating the impurity function we get the first two split selection cri-
teria:
16
Gini Gain. This split criterion was introduced by Breiman et al. (1984). By
setting the impurity function to be the Gini index:
gini(T ) = 1−k∑
j=1
P [C = cj|T ]
and plugging it into Equation 2.3 we get the Gini gain split criteria:
GG(T, X, Q) = gini(T )−n∑
i=1
P [q(T,Ti)(X)|T ] · gini(Ti) (2.4)
For two class labels, the Gini gain takes the more compact form:
GGb(T, X, Q) = P [C = c0|T ]2(P [C = C0|T1]− P [T1|T ])2
P [T1|T ](1− P [T1|T ])(2.5)
Information Gain. This split criterion was introduced by Quinlan (1986). By
setting the impurity function to be the entropy of the dataset
entropy(T ) = −k∑
j=1
P [C = cj|T ] log P [C = cj|T ]
and plugging it into Equation 2.3 we get the information gain split criteria:
IG(T, X, Q) = entropy(T )−n∑
j=1
P [qj(X)|T ] · entropy(Tj) (2.6)
Gain Ratio. Quinlan introduced this adjusted version of the information gain
to remove the preference of information gain for attribute variables with large
domains (Quinlan, 1986).
GR(T,X, Q) =IG(T,X, Q)
−∑|Dom(X)|
j=1 P [X = xj|T ] log P [X = xj|T ](2.7)
Two other popular split selection methods come from the statistics literature:
17
The χ2 Statistic (test).
χ2(T,X) =
|Dom(X)|∑i=1
k∑j=1
(P [X = xi|T ] · P [C = cj|T ]− P [X = xi, C = cj|T ])2
P [X = xi|T ] · P [C = cj|T ].
(2.8)
estimates how much the class labels depend on the value of the split attribute.
Notice that the χ2-test does not depend on the set Q of split predicates. A known
result in the statistics literature, see for example (Shao, 1999), is the fact that
the χ2-test has, asymptotically, a χ2 distribution with |Dom(X)|(k − 1) degrees of
freedom.
The G2-statistic.
G2(T, X, Q) = 2 ·NT · IG(T ) loge 2, (2.9)
where NT is the number of records at node T . Asymptotically, the G2-statistic has
also a χ2 distribution (Mingers, 1987). Interestingly, it is identical to the informa-
tion gain up to a multiplicative constant, which immediately gives an asymptotic
approximation for the distribution of information gain.
Note that all split criteria except χ2-test take the set of split predicates as
argument. For discrete attribute variables in k-ary classification trees, the set of
predicates is completely determined by specifying the attribute variable, but this
is not the case for discrete variables for binary trees or continuous variables. In
these last two situations we also have to determine the best split set or point in
order to evaluate how good a split on a particular attribute variable is.
18
Split Set Selection for Discrete Attributes
Most of the set selection methods proposed in the literature use the same split
criterion used for split attribute selection in order to evaluate all possible splits
and select as split set the best. This method is referred to as exhaustive search,
since all possible splits of the set of values of an attribute variables are evaluated, at
least in principle. In general, this process of finding the split set is computationally
intensive except when the domain of the split attribute and the number of class
labels is small. There is though a notable exception due to Breiman et al. (1984),
when there is an efficient algorithm to find the split set: the case when there are
only two class labels and an impurity based selection criterion is used. Since this
algorithm is relevant for some parts of our work, we describe it here.
Let us first start with the following:
Theorem 1 (Breiman et al. (1984)). Let I be a finite set, qi, ri, i ∈ I be
positive quantities and Φ(x) a concave function. For I1, I2 a partitioning of I, an
optimum of the problem
argminI1,I2
∑i∈I1
qi Φ
(∑i∈I1
qiri∑i∈I1
qi
)+∑i∈I2
qi Φ
(∑i∈I2
qiri∑i∈I2
qi
)has the property that:
∀i ∈ I1,∀j ∈ I2, ri < rj
A direct consequence of this theorem is an efficient algorithm to solve this type
of optimization problems, namely order the elements of I into increasing order
of ri and consider only the |I| number of ways to split set I in this order. The
correctness of the algorithm is guaranteed by the fact that, the optimum split will
be among the splits considered.
19
With this, setting I = Dom(X), qi = P [X = xi|T ], ri = P [C = c0|X = xi, T ]
and Φ(x) to be the Gini index or entropy for the two class labels case (both are
concave):
gini(T ) = 2P [C = c0|T ](1− P [C = c0|T ])
entropy(T ) = −P [C = c0|T ] ln(P [C = c0|T ])
− (1− P [C = c0|T ]) ln(1− P [C = c0|T ])
the optimization criterion, up to a constant factor, is exactly the Gini gain or
information gain. Thus, to efficiently find the best split set, we order elements of
DomX in the increasing order of ri = P [C = c0|X = xi, T ] and consider splits only
in this order.
Since all the split criteria we introduced, except the χ2-test, use either the Gini
gain or information gain multiplied with a factor that does not depend on the split
set, this fast split set selection method can be used for all of them.
It is worth mentioning that Loh and Shih (1997) proposed a different technique
that consists in transforming values of discrete attributes into continuous values
and using split point selection methods for continuous attributes to obtain the split
for discrete attributes.
Split Point Selection for Continuous Attributes
Two methods have been proposed in the literature to deal with the split point
selection problem for continuous attributes: exhaustive search and Quadratic Dis-
criminant Analysis.
Exhaustive search uses the same split selection criteria as does the split at-
tribute selection method and consists in evaluating all the possible ways to split
20
the domain of the continuous attribute in two parts. To make the process efficient,
data available is first sorted on the attribute variable that is being evaluated and
then traversed in order. At the same time, the sufficient statistics are incrementally
maintained and the value of the split criteria computed for each split point. This
means that the overall process requires a sort and a linear traversal with constant
processing time per value. Most of the classification tree construction algorithms
proposed in the literature use the exhaustive search.
Loh and Shih (1997) proposed using Quadratic Discriminant Analysis (QDA)
to find the split point for continuous attributes, and showed that, from the point
of view of accuracy of the produced trees, it is as good as exhaustive search. An
apparent problem with QDA is that it works only for two class label problems.
Loh and Shih (1997) suggested a solution to this problem: group the class labels
into two super-classes based on some class similarity and define QDA and the split
set problem in terms of this super-classes. This method can be used to deal with
the intractability of finding splits for categorical attributes when the number of
classes is larger than two.
We now briefly describe QDA. The idea is to approximate the distribution of
the data-points with the same class label with a normal distribution, and to take
as the split point the point between the centers of the two distributions with equi-
probability to belong to each of the distributions. More precisely, for a continuous
attribute X, the parameters of the two normal distributions – probability to belong
21
to the distribution αi, mean µi and variance σ2i – are determined with the formulae:
αi = P [C = ci|T ]
µi = E [X|C = ci, T ]
σ2i = E
[X2|C = ci, T
]− µ2
i
and the equation of the split point µ is:
α11
σ1
√2π
e− (µ−µ1)2
2σ21 = α2
1
σ2
√2π
e− (µ−µ2)2
2σ22
which reduces to the following quadratic equation for the split point:
µ2
(1
σ21
− 1
σ22
)− 2µ
(µ1
σ21
− µ2
σ22
)+
µ21
σ21
− µ22
σ22
= 2 lnα1
α2
− lnσ2
1
σ22
(2.10)
If σ21 is very close to σ2
2, solving the second order equation is not numerically
stable. In this case it is preferable to solve the linear equation:
2µ(µ1 − µ2) = µ21 − µ2
2 − 2σ21 ln
α1
α2
that is numerically solvable as long as µ1 6= µ2.
To compute the Gini gain of the variable X with split point µ we just need to
compute the sufficient statistics: P [C = ci|X ≤ µ, T ], P [C = ci|X ≤ µ, T ], and
P [X ≤ µ|T ] = P [C = c0|T ]P [C = c0|X ≤ µ, T ] + P [C = c1|T ]P [C = c1|X ≤ µ, T ]
and plug them into Equation 2.5. The probability P [x ∈ C1|x ≤ µ, T ] is nothing
that the cumulative distribution function (c.d.f) of the normal distribution with
mean µ1 and variance σ21 at point µ. That is:
P [C = c0|X ≤ µ, T ] =
∫x≤µ
1
σ1
√2π
e−(x−µ1)2/2σ21dx
=1
2
(1 + Erf
(µ1 − µ
σ1
√2
))
22
P [C = c1|X ≤ µ] is similarly obtained.
The advantage of QDA is the fact that no sorting of the data is necessary. The
sufficient statistics (see next section) can be easily computed in a single pass over
the data in any order and solving the quadratic equation gives the split point.
Stopping Criteria
The recursive process of constructing classification trees has to be eventually
stopped. The most popular stopping criteria – we use it throughout the thesis
– is to stop the growth of the tree when the number of data-points on which the
decision is based goes below a prescribed minimum. By stopping the growth when
small amount of data is available, we avoid taking statistical insignificant decisions
that are likely to be very noisy thus wrong.Other possibilities are to stop the tree
growth when no predictive attribute can be found – can be quite damaging to the
construction algorithm since no one variable might be predictive but a combination
of variables can be predictive – or when the tree reached a maximum height.
Computing the Sufficient Statistics
So far, we have seen how the classification tree construction process can be re-
duced to sufficient statistics computation for every node. Here we explain how the
sufficient statistics can be estimated using the training data. The idea is to use the
usual empirical estimates; throughout the thesis we use the symbole= to denote
the empirical estimate of a probability or expectation. This means that:
1. for probabilities of the form P [p(Xj)|T ] with p(Xj) some predicate on at-
tribute variable Xj, the estimate is simply the number of data-points in the
training dataset at node T , DT , for which the predicate p(Xj) holds over the
23
overall number of data-points in DT :
P [p(Xj)|T ]e=|(x, c) ∈ DT |Xj = xj|
|DT |
2. for conditional probabilities of the form P [p(Xj)|C = c0, T ], the estimate is:
P [p(Xj)|C = c0, T ]e=|(x, c0) ∈ DT |Xj = xj|
|(x, c0) ∈ DT|
3. for expectations of functions of attributes, like E [f(Xj)|T ], the estimate is
simply the average value of the function applied to the attribute for the
data-points in DT :
E [f(Xj)|T ]e=
∑(x,c)∈DT
f(xj)
|DT |
where f(x) is the function whose expectation is being estimated
4. for expectations of the form E [f(Xj)|C = c0, T ], the estimate is:
E [f(Xj)|C = c0, T ]e=
∑(x,c0)∈DT
f(xj)
|(x, c0) ∈ DT|
Note that the estimates for all these sufficient statistics can be computed in a
single pass over the data. Gehrke et al. (1998) explain how these sufficient statistics
can be efficiently computed using limited memory and secondary storage.
2.3.2 Pruning Phase
In this thesis we use exclusively Quinlan’s re-substitution error pruning (Quinlan,
1993a). A comprehensive overview of other pruning techniques can be found in
(Murthy, 1997).
Re-substitution error pruning consists in eliminating subtrees in order to obtain
a tree with the smallest error on the pruning set, a separate part of the data used
24
only for pruning. To achieve this, every node estimates its contribution to the error
on pruning data when the majority class is used as en estimate. Then, starting
from the leaves and going upward, every node compares the contribution to the
error by using the local prediction with the smallest possible contribution to the
error of its children (if a node is not a leaf in the final tree, it has no contribution to
the error, only leaves contribute), and prunes the tree if the local error contribution
is smaller – this results in the node becoming a leaf. Since, after visiting any of
the nodes the tree is optimally pruned – this is the invariant maintained – when
the overall process finishes the whole tree is optimally pruned.
2.4 Regression Trees
We start with the formal definition of the regression problem and we present re-
gression trees, a particular type of regressors.
We have the random variables X1, . . . , Xm as in the previous section to which
we add the random variable Y with real line as the domain that we call the predicted
attribute or output.
A regressor R is a function R : Dom(X1) × · · · × Dom(Xm) 7→ Dom(Y ). Now
if we let the set of events to be Ω = Dom(X1) × · · · × Dom(Xm) × Dom(Y ) we can
define probability measures P over Ω. Using such a probability measure and some
loss function L (i.e. square loss function L(a, x) = ‖a − x‖2) we can define the
regressor error as RP (R) = EP [L(Y,R(X1, . . . , Xm)] where EP is the expectation
with respect to probability measure P . In this thesis we use only the square loss
function. With this we have:
25
Regressor Construction Problem: Given a training dataset D of N inde-
pendent identically distributed samples from Ω, sampled according to probability
distribution P , find a function R that minimizes the functional RP (R).
Regression Trees, the particular type of regressors we are interested in, are the
natural generalization of classification trees for regression problems. Instead of
associating a class label to every node, a real value or a functional dependency of
some of the inputs is used.
Regression trees were introduced by Breiman et al. (1984) and implemented in
their CART system. Regression trees in CART are binary trees, have a constant
numerical value in the leaves and use the variance as a measure of impurity. Thus
the split selection measure is:
Err(T ) =
NT∑i=1
(yi − yi)2 (2.11)
∆Err(T ) = Err(T )− Err(T1)− Err(T2) (2.12)
The reason for using variance as the impurity measure is justified by the fact
that the best constant predictor in a node is the average of the value of the predicted
variable on the test examples that correspond to the node; the variance is thus the
mean square error of the average used as a predictor.
An alternative split criteria proposed by Breiman et al. (1984) and used also
in (Torgo, 1997a) is based on the sample variance as the impurity measure:
26
ErrS(T ) = Var (Y |T )
e=
1
NT
Err(T )
∆ErrS(T ) = ErrS(T )− P [T1|T ] · ErrS(T1)− P [T2|T ] · ErrS(T2)
Interestingly, if the maximum likelihood estimate is used for all the probabilities
and expectations, as it is usually done in practice, we have the following connection
between the variance and sample variance criteria:
∆ErrS(T )e=
Err(T )
NT
− NT1
NT
Err(T1)
NT1
− NT2
NT
Err(T2)
NT2
=∆Err(T )
NT
Due to this connection, if there are no missing values, minimizing one of the criteria
results also in minimizing the other.
For a categorical attribute variable X, minimizing ∆ErrS(T ) can be done very
efficiently since the objective function in Theorem 1 with:
Φ(x) = −x2
qi = P [X = xi|T ]
ri = P [Y |X = xi, T ]n
is exactly this criterion up to additive and multiplicative constants that do not
influence the solution (Breiman et al., 1984). This means that we can simply order
the elements in Dom(X) in increasing order of P [Y |X = xi, T ] and consider splits
only in this order. If the empirical estimates are used for qi = P [X = xi|T ] and
ri = P [Y |X = xi, T ], the criteria ∆Err(T ) is minimized.
As in the case of classification trees, prediction is made by navigating the tree
27
following branches with true predicates until a leaf is reached. The numerical value
associated with the leaf is the prediction of the model.
Usually the top-down induction schema algorithm like the one in Figure 2.1 is
used to build regression tress. Pruning is used to improve the accuracy on unseen
examples like in the classification tree case. Pruning methods for classification
trees can be straightforwardly adapted for regression trees (Torgo, 1998).
For the case of re-substitution error, we simply define the contribution to the
pruning error at a node to be Err(T ). Then the pruning mechanism designed for
classification trees can be also used for regression trees.
Chapter 3
Bias Correction in Classification
Tree Construction
In this chapter we address the problem of bias in split variable selection in clas-
sification tree construction. A split criterion is unbiased if the selection of a split
variable X is based only on the strength of the dependency between X and the
class label, regardless of other characteristics (such as the size of the domain of
X); otherwise the split criterion is biased. In this chapter we make the following
four contributions: (1) We give a definition that allows us to quantify the extent
of the bias of a split criterion, (2) we show that the p-value of any split criterion
is a nearly unbiased criterion, (3) we give theoretical and experimental evidence
that the correction is successful, and (4) we demonstrate the power of our method
by correcting the bias of the Gini gain.
28
29
3.1 Introduction
Split variable selection is one of the main components of classification tree con-
struction. The quality of the split selection criterion has a major impact on the
quality (generalization, interpretability and accuracy) of the resulting tree. Many
popular split criteria suffer from bias toward attribute variables with large domains
(White & Liu, 1994; Kononenko, 1995).
Consider two attribute variables X1 and X2 whose association with the class
label is equally strong (or weak). Intuitively, a split selection criterion is unbiased
if on a random instance the criterion chooses both X1 and X2 with probability 1/2
as split variables. Unfortunately, this is usually not the case.
There are two previous meanings associated with the notion of bias in decision
tree construction. First, Quinlan (1986) calls bias the preference toward attributes
with large domains, preference that is easily observed when the dataset contains
exactly one data-point for each possible value of the attribute variable with the
large domain. In this case the attribute with large domain has the best possible
value for the entropy gain irrespective of how predictive it actually is, thus is always
preferred to an attribute with smaller domain that might be more predictive (but
not perfect). Second, White and Liu (1994) call bias the difference in distribution
of the split criteria applied to different attribute variables. In this chapter, we start
in Section 3.3 by giving a precise, quantitative definition of bias in split variable
selection. By extending the studies by White and Liu (1994) and Kononenko
(1995), we quantify in an extensive experimental study the bias in split selection
for the case that none of the attribute variables is correlated with the class label.
Section 3.4 contains the heart of our contribution in this chapter. Assume that
we use split criterion s(D, X) to calculate the quality q of attribute variable X as
30
split variable for training dataset D. Consider the the p-value p of value q, which
is the probability to see a value as extreme as the observed value q in the case that
X is not correlated with the class label. In Section 3.4, we prove that choosing the
variable with the lowest p-value results in a split selection criterion that is nearly
unbiased — independent of the initial split criterion s. Since previous criteria such
as χ2 and G2 (Mingers, 1987) and the permutation test (Frank & Witten, 1998) are
p-values, our theorem explains why χ2, G2, and the permutation test are virtually
unbiased. We continue in Section 3.5 by computing a tight approximation of the
distribution of Breiman’s Gini index for k-ary splits which gives us a theoretical
approximation of the p-value of the index. We demonstrate in Section 3.6 that our
new criterion is nearly unbiased.
Note that the general method that we propose is similar in spirit but different
from the work of Jensen and Cohen (2000) on the problems with multiple compar-
isons in induction algorithms. The bias in split selection for discrete variables is
not due to multiple comparisons, but rather due to inherent statistical fluctuations
as we explain in Section 3.3.
3.2 Preliminaries
In this section we introduce some more notation, useful only within this chapter and
appendix B, that contains proofs of some results in this chapter. This notation
will allow us to keep the formulae concise and simplify the expressions for the
split selection criteria in Section 2.3.1, simplifications that facilitate theoretical
endeavors.
31
3.2.1 Split Selection
As in the rest of the thesis, we denote by D be the training dataset consisting of
N data-points. We consider, without loss of generality, the problem of selecting
the split attribute at the root node of the classification tree. For X an attribute
variable with domain x1, . . . , xn, let Ni be the number of data-points in the
dataset D for which X = xi for i∈1, .., n. As before, we denote by c1, . . . , ck
be the domain of the class label C. Let Sj be the number of training records in
D for which C = cj for j∈1, .., n. Denote by Aij the number of data-points for
which X = xi ∧ C = cj. Also let pj, j ∈ 1, .., k be the prior probability to see
class label cj in the dataset D. Obviously the following normalization constraint
holds:∑k
j=1 pj = 1. We summarized the notation in Figure 3.2.1.
Symbol MeaningD datasetN size of DX attribute variablexi the i-th value in Dom(X)Ni number of data-points in D for which X = xi
C class labelcj the j-th value of the class labelSj number of data-points in D for which C = cj
pj probability to observe class label cj
Aij number of data-points in D for which X = xi and C = cj
Figure 3.1: Summary of notation for Chapter 3.
Using the notation we just introduced, we can form a contingency table for
dataset D as shown in Figure 3.2. We call the numbers on the last column and
32
the last row marginals since they obey the following marginal constraints:
n∑i=1
Ni = N
n∑i=1
Aij = Sj
k∑j=1
Aij = Ni
Using the contingency table we have the following maximum likelihood estimates:
P [X =xi] = Ni/N
pj = P [C =cj] = Sj/N
P [C =cj ∧X =xi] = Aij/N
P [C =cj|X =xi] = Aij/Ni
Note that this contingency table contains the sufficient statistics for split selec-
tion criteria that make univariate splits (Gehrke et al., 1998); thus given the table,
any split selection criterion can compute the quality of X as split variable.
X C1 . . . cj . . . ck
x1 A11 . . . A1j . . . A1k N1
. . . . . . . . . . .
. . . . . . . . . . .xi Ai1 . . . Aij . . . Aik Ni
. . . . . . . . . . .
. . . . . . . . . . .xn An1 . . . Anj . . . Ank Nn
S1 . . . Sj . . . Sk N
Figure 3.2: Contingency table for a generic dataset D and attribute variable X.
We express now the split criteria introduced in Section 2.3.1 in terms of the
elements of the contingency table in Figure 3.2. We use the new formulae in the
theoretical developments in this chapter.
33
χ2 Statistic.
χ2 =n∑
i=1
k∑j=1
(Aij − E(Aij))2
E(Aij), E(Aij) =
NiSj
N(3.1)
Gini Gain.
∆g =n∑
i=1
P [X =xi]k∑
j=1
P [C =cj|X =xi]2
−k∑
j=1
P [C =cj]2
=1
N
k∑j=1
(n∑
i=1
A2ij
Ni
−S2
j
N
) (3.2)
Information Gain.
IG =k∑
j=1
Φ(P [C =cj]) +n∑
i=1
Φ(P [X =xi])
−k∑
j=1
n∑i=1
Φ(P [C =cj ∧X =xi])
=1
N
(k∑
j=1
n∑i=1
Aij log Aij −k∑
j=1
Sj log Sj
−n∑
i=1
Ni log Ni + N log N
),
(3.3)
where Φ(p) = −p log p
Gain Ratio.
GR =IG∑n
i=1 Φ(P [X =xi])
=IG
1N
(N log N −∑n
i=1 Ni log Ni)
(3.4)
G2 Statistic.
G2 = 2 ·N · IG loge 2 (3.5)
34
3.3 Bias in Split Selection
In this section we introduce formally the notion of bias in split variable selection
for the case when there is no correlation between attribute variables and the class
label (i.e., the predictor variables are not predictive of the class label). We then
show that three popular split selection criteria are biased toward attribute variables
with large domains.
3.3.1 A Definition of Bias
In order to study the behavior of the split criteria for the case where there is
no correlation between an attribute variable and the class label we formalize the
following setting:
Null Hypothesis: For every i∈1, .., n, the random vector (Ai1, . . . , Aik) has
the distribution Multinomial(Ni, p1, . . . , pk).
Intuitively, the Null Hypothesis assumes that for each value of the attribute
variable, the distribution of the class label results from pure multi-face coin tossing,
thus the distribution of the class label obeys a multinomial distribution. Since∑ni=1 Aij = Sj, the random vector (S1, . . . , Sk) has the distribution Multinomi-
al(N, p1, . . . , pk).
We now give a formal definition of the bias. Let s be a split criterion, and let
s(D, X) be the value of s when applied to dataset D. Usually the split variable
selection method compares the values of the split criteria for two variables and
picks the one with the biggest corresponding value for attribute variable X.1 Now
let D be a random dataset whose values are distributed according to the Null
1For the case when smaller values of the split criterion are preferable, we canuse −s as split criterion.
35
Hypothesis. Thus s(D, X) is now a random variable that has a given distribution
under the Null Hypothesis. Define the probability that split selection method s
chooses attribute variable X1 over X2 as follows:
Ps(X1, X2) = P [s(D, X1) > s(D, X2)] (3.6)
We can now define the bias of the split criterion between X1 and X2 as the loga-
rithmic odds of choosing X1 over X2 as a split variable when neither X1 nor X2 is
correlated with the class label, formally:
Bias(X1, X2) = log10
(Ps(X1, X2)
1− Ps(X1, X2)
)(3.7)
When the split criterion is unbiased, Bias(X1, X2)= log10(0.5/(1−0.5))=0. The
bias is positive if s prefers X1 over X2 and negative, otherwise. A larger value for
|Bias(X1, X2)| indicates stronger bias; we desire split criteria with values of the
bias as close to 0 as possible. Furthermore, 10|Bias(X1,X2)| is the odds of choosing
X1 over X2.
Our notion of bias is inherently statistical in nature. It reflects the intuition
that, under the Null Hypothesis, the split criterion should have no preference for
any attribute variable. There have been several attempts to define the bias in split
variable selection. Quinlan’s Gain Ratio (Quinlan, 1986) was designed to correct
for an anomaly in choosing the split variable that he observed, but as we will show
in Section 3.3.2, the Gain Ratio merely reduces the bias, but it does not remove
it. White and Liu (1994) point out that Quinlan’s definition of the bias is non-
statistical in nature. Their own definition of the bias is based on the equality of
the distributions of the split criterion for different attribute variables. It is harder
to use in practice since it implies a test of the equality of two distributions instead
of two numbers as in our case. Loh and Shih (1997) introduce a notion of bias
36
whose formalization coincides with our definition.
3.3.2 Experimental Demonstration of the Bias
We performed an extensive experimental study to demonstrate the bias according
to our definition in Section 3.3.1. We generated synthetic training datasets with
two attribute variables and two class labels. We chose n1 = 10 different values for
predictor variable X1 and n2 = 2 different variable values for attribute variable
X2.2 We varied N , the size of the training database from 10 and 1000 records in
steps of 40 records, and we varied the value of the prior probability p1 of the first
class label exponentially between 0 and 1/2. Since all split criteria are invariant
to class labels permutations, the graphs depicting the bias are symmetric with
respect to p1 = 1/2; we present here only the part of the graphs with p1 ≤ 1/2.
To estimate Ps(X1, X2), we performed 100000 Monte Carlo trials in which we
generated random training databases distributed according to the Null Hypothesis
(thus the standard error of all our measurements is smaller than 0.0016). Exactly
the same random instances were used for all split criteria considered.
The results of our experiments are shown in Figures 3.3 to 3.7. Figure 3.3
shows the bias of the Gini gain, Figure 3.4 shows the bias of the information gain,
Figure 3.5 shows the bias of Quinlan’s gain ratio, Figure 3.6 shows the bias of
the p-value of the χ2-test according to the χ2 distribution (with n − 1 degrees of
freedom), and Figure 3.7 shows the bias of the p-value of the G2-statistics according
to the χ2 distribution (with n − 1 degrees of freedom). The χ2-distribution with
n−1 degrees of freedom has to be used since there are 2n entries in the contingency
2Results from experiments with different values for n1 and n2 were qualitativelysimilar.
37
table with n marginal constraints (SjAij = Ni) and the additional constraint that
Sj/N is used as an estimate for pj.
For values of p1 between 10−2 and 1/2 both the Gini gain and the information
gain show a very strong bias – X1 is chosen 101.80 = 63 times more often than X2.
The gain ratio is less biased – X1 is chosen 100.8 = 6.3 times more often than X2
– but the bias is still significant. The χ2 test is basically unbiased in this region
except for really small values of N . The G2 test is unbiased for large values of N
and for p1 close to 1/2, but the bias is noticeable in important border cases that
are relevant in practice (for example for p1 = 10−2 and N = 1000, the bias has
value 0.20) – thus not always unbiased.
For values of p1 between 10−4 and 10−2, the Gini gain, the information gain,
and the gain ratio start having less and less bias. Both the χ2 and the G2 criterion
have a preference toward variable variables with few values, the bias gets as low
as -0.2 (corresponds to 1.58 odds) when p1N = 1. The maximum negative bias
corresponds to datasets that, on average, have a single data-point with class label
c1. We postpone the explanation of this phenomenon to Section 3.6. The region
where p1 < 10−4 corresponds to training datasets where no record has class label c1
(all records have the same class label). In this case the Gini gain, the information
gain and the gain ratio have value 0, whereas the χ2 and G2 criteria have value 1,
irrespective of the split variable. In our experiments, we tossed a fair coin in the
case that the split criterion returns the same value for variables X1 and X2, thus
the bias is 0.
38
Fig
ure
3.3:
The
bia
sof
theGini
gain
.
39
Fig
ure
3.4:
The
bia
sof
the
info
rmat
ion
gain
.
40
Fig
ure
3.5:
The
bia
sof
the
gain
ratio.
41
Fig
ure
3.6:
The
bia
sof
the
p-v
alue
ofth
eχ
2-t
est
(usi
ng
aχ
2-d
istr
ibuti
on).
42
Fig
ure
3.7:
The
bia
sof
the
p-v
alue
ofth
eG
2-t
est
(usi
ng
aχ
2-d
istr
ibuti
on).
43
One surprising insight from our experiments is that the bias for the Gini gain,
the information gain and the gain ratio do not vanish as N gets arbitrary large. In
addition, the bias does not seem to have a significant dependency on p1 as long as
all entries in the contingency table for variable X1 are moderately populated (i.e.
Aij > 5).
We obtained similar results for different variable domain sizes. The bias is more
pronounced for bigger differences in the domain sizes of X1 and X2. When the
domain sizes are identical (n1 = n2), the bias is almost nonexistent. These facts
suggest that the size of the domain is the most significant factor that influences
the behavior under the Null Hypothesis. This conclusion, for the Gini gain, is
supported by the theoretical developments in Section 3.5.
The bias for the Gini gain, the information gain and the gain ratio comes
from the fact that under the Null Hypothesis the value of the split criterion is
not exactly zero. The values of s(X,D) monotonically increase with n, the size
of the domain of X, and variables with more values tend to have larger values
of s(X,D) due to the fact that the counts in the contingency table have bigger
statistical fluctuations. The bias is thus due to the inability of traditional split
criteria to account for these normal statistical fluctuations. In the next section,
we will present a technique that allows us to remove the bias from existing split
criteria.
3.4 Correction of the Bias
In this section we present a general method for removing the bias of any arbitrary
split criterion. We will use this result in Section 3.5 to show how the bias of Gini
gain can be corrected.
44
Let us first give some intuition behind our method. We observed in Section
3.3.2 that the expected value of several split criteria under the Null Hypothesis
depends on the size of the domain of the attribute variables. Assume that the
value of the split criterion for variable X1 (X2) is v1 (v2). Instead of comparing v1
and v2 directly and incurring a biased variable selection, we compute the p-value
p1, the probability that the value of the split criterion is as extreme as v1 under
the Null Hypothesis and similarly value p2. We then choose the split attribute
variable with the lower p-value, since it is the least likely to be good by chance.
The remainder of this section is devoted to a formal proof that the p-value of any
split criterion is virtually unbiased under the assumption that the Null Hypothesis
holds.
Let X and XH be two identically distributed random variables (i.e., ∀x ∈
Dom(X) : pxdef= P [X =x] = P [XH =x]), and let Y and YH be two other identically
distributed random variables. Define CX(x)def= 1 − P [XH ≤ x] = 1 −
∑x′≤x px′ ,
and similarly define CY (y). Let ∆def= maxx P [X =x] + maxy P [Y =y].
Lemma 1. Let X and Y be two independent discrete random variables. Then
∀γ ∈ [0, 1]:P [CX(X) < CY (Y )] + γP [CX(X)=CY (Y )] ∈ (1/2−∆, 1/2 + ∆).
Proof:
P = P [CX(X) < CY (Y )]
=∑
x
∑y
I(CX(x) < CY (y))P [X =x ∧ Y =y]
=∑
x
∑y
I
(x∑x′
px′ >
y∑y′
py′
)pxpy
(3.8)
where I(·) is the indicator function.
For a fixed x, let yx be the biggest value of y ∈ Dom(Y ) such that∑
x′<x px′ >
45
∑y′<y py′ still holds. Equation 3.8 then can be rewritten as follows:
P =∑
x
px
yx∑y
py (3.9)
On the other hand using the definition of yx we have:
x∑x′
px′ −yx∑y
py > 0, and (3.10)
x∑x′
px′ −yx∑y
py ≤ py+x≤ max
ypy (3.11)
where py+x
is the smallest y ∈ Dom(Y ) such that y > yx. The previous two
inequalities imply:x∑x′
px′ −maxy
py ≤yx∑y
py <x∑x′
px′ (3.12)
Multiplying by px, summing up on x and using the result of Equation 3.9 we
obtain:
∑x
px
x∑x′
px′ −maxy
py ≤ P <∑
x
px
x∑x′
px′ (3.13)
To further simplify, let X ′ be a random variable with the same distribution as X.
We then obtain: ∑x
px
x∑x′
px′ =∑
x
I(x′ ≤ x)P [X ′=x′ ∧X =x]
= P [X ′ ≤ X] =1
2− 1
2P [X ′=X]
=1
2− 1
2
∑x
p2x
∈(
1
2− 1
2max
xpx,
1
2− 1
2min
xpx
)(3.14)
Using Equations 3.13 and 3.14 we get:
1
2−max
xpx −max
ypy < P <
1
2− 1
2min
xpx (3.15)
46
If the roles of x and y are switched we obtain:
1
2−∆ < P [CX(X) > CY (Y )] <
1
2− 1
2min
ypy, (3.16)
which implies:
1
2+
1
2min
ypy < P [CX(X) ≤ CY (Y )] <
1
2+ ∆, (3.17)
thus
1
2−∆ < P + γP [CX(X)=CY (Y )] <
1
2+ ∆ (3.18)
According to Lemma 1, if the p-value of a criterion is used to decide the split
variable, the probability of choosing one variable over another is not farther than
∆ from 12. In practice, even for small sizes of the dataset, any split criterion has
a huge number of possible values and the probability of the criterion to take on
any particular value is much smaller than 12, thus ∆ ≈ 0, and the p-value is a
virtually unbiased split criterion. Hence, we can guarantee that the p-value of any
split criterion s is unbiased under the Null Hypothesis, as long as s does not take
on any single value with a significant probability.
Using the above fact, a general method to remove the bias in split variable
selection consists of two steps. First, we compute the value v of the original split
criteria s on the given dataset. Second, we compute the p-value of v under the
Null Hypothesis and we select the variable with the smallest p-value as the split
variable.
The above method requires the computation of the p-value of a given criterion.
We can distinguish four ways in which this can be accomplished:
47
• Exact computation. Use the exact distribution of the split criterion. The
main drawback is that this is almost always very expensive; it is reasonably
efficient only for n = 2 and k = 2 (Martin, 1997).
• Bootstrapping. Use Monte Carlo simulations with random instances gen-
erated according to the Null Hypothesis. This method was used by Frank
and Witten (1998) to correct the bias; its main drawback is the high cost of
the the Monte Carlo simulations.
• Asymptotic approximation. Use an asymptotic approximation of the dis-
tribution of the split criterion (e.g., use the χ2-distribution to approximate
the χ2-test (Kass, 1980) and the G2-statistic (Mingers, 1987)). Approxi-
mations often work well in practice, but they can be inaccurate for border
conditions (e.g., small entries in the contingency table).
• Tight approximation. Use a tight approximation of the distribution of
the criterion with a nice distribution. While conceptually superior to the
previous three methods, such tight approximations might be hard to find.
3.5 A Tight Approximation of the distribution of Gini
Gain
In this section we give a tight approximation of the distribution of the Gini gain,
and we use our approximation in combination with Lemma 1 to compose a new
unbiased split criterion.
Note that the p-value of the Gini gain can be well approximated if the cumu-
lative distribution function (c.d.f) of the distribution of the Gini gain can be well
approximated (since the p-value=1-c.d.f.).
48
We experimentally observed by looking at the shape of the probability distri-
bution function of the Gini gain that it is very close to the shape of distributions
from the Gamma family. Our experiments, reported latter in this section, show
that the Gamma distribution – using the expected value and variance of the Gini
gain as distribution parameters (which completely specify a Gamma distribution)
– is a very good approximation of the distribution of the Gini gain. In the re-
mainder of this section, we will show how to compute exactly the expected value
and the variance of the Gini gain under the Null Hypothesis. We then use these
values for a tight approximation of the Gini gain with parametric distributions,
the best of which is the approximation with the Gamma distribution.
3.5.1 Computation of the Expected Value and Variance of
Gini Gain
As mentioned in Section 3.2, the contingency table described in Figure 3.2 contains
the sufficient statistics for the computation of the Gini gain. Thus in order to
analyze the distribution of the Gini gain, it is sufficient to look at the distribution
of the entries in the contingency table. Consider a given fixed set of parameters N ,
n, k, Ni, i∈1, .., n, and pj, j ∈1, .., k. If the Null Hypothesis holds, the Aij’s
and Sj’s are random variables with multinomial distributions (see Section 3.3).
Using the definition of the Gini gain (Equation 3.2), linearity of expectation, the
fact that the Aij’s and Sj’s have multinomial distributions, and the normalization
constraint on the pj’s, we get the following formula for the expectation E(∆g) of
49
the Gini gain under the Null Hypothesis:
E(∆g) =1
N
k∑j=1
(n∑
i=1
E(A2ij)
Ni
− E(Sj)
N
)
=1
N
k∑j=1
(n∑
i=1
Nipj(1− pj + Nipj)
Ni
− Npj(1− pj + Npj)
N
)=
1
N
k∑j=1
(npj(1−pj) + Np2j
− pj(1−pj)−Np2j)
=n− 1
N
(1−
k∑j=1
p2j
)
(3.19)
so the expected value of the Gini gain is indeed linear in n as observed by White
and Liu (1994).
Computation of the the variance Var (∆g) of the Gini gain results in the fol-
lowing formula (see Appendix B for the proof):
Var (∆g) =
1
N2
[(n−1)
2k∑
j=1
p2j + 2
(k∑
j=1
p2j
)2
− 4k∑
j=1
p3j
+
(n∑
i=1
1
Ni
− 2n
N+
1
N
)×
−2k∑
j=1
p2j − 6
(k∑
j=1
p2j
)2
+ 8k∑
j=1
p3j
](3.20)
For the two class label problem, the expected value and variance take the
simpler form:
E(∆g) =2(n− 1)p1(1− p1)
N(3.21)
Var(∆g(T )) =4p1(1− p1)
N2
[(1− 6p1 + 6p2
1)
(n∑
i=1
1
Ni
− 2n
N+
1
N
)
+ 2(n− 1)p1(1− p1)
] (3.22)
50
3.5.2 Approximating the Distribution of Gini Gain with
Parametric Distributions
Now that we have exact values of the expected value and variance of the Gini gain,
we can approximate its distribution with any two-parameter parametric distribu-
tion simply by requiring their expected values and variances to be the same. The
quality of the approximation will depend on how well the shape of the parametric
distribution follows the shape of the distribution of the Gini gain. To compute
an estimate of the p-value of the distribution of the Gini gain we simply use the
parametric approximation with appropriate values for the parameters. We consider
here three families of parametric distributions: normal, gamma and beta. For each
of them we listed, in Table 3.1, the formulae for their p-values as a function of the
expected value and variance. By replacing µ with the exact expected value of
Gini gain computed with formula 3.19, and σ2 with the exact variance of Gini
gain computed with formula 3.20, and substituting in the equations in Table 3.1
estimates of the the p-value of the distribution of Gini gain are obtained.
The p-value of the Gini gain, the corrected criterion, depends only on n, N ,
and the Ni’s and pj’s since µ and σ2 depend only on on these quantities. Because
the true value of the pj is unknown, we suggest using the usual maximum likeli-
hood estimate pj = Sj,e/N where Sj,e is the number of data-points in the training
database with class label cj.
51
Tab
le3.
1:P
-val
ues
atpoi
nt
xfo
rpar
amet
ric
dis
trib
uti
ons
asa
funct
ion
ofex
pec
ted
valu
e,µ,an
dva
rian
ce,σ
2.
Nam
ePar
ams
Mom
ent
gen
fct.
p-v
alue
Expla
nat
ion
Nor
mal
N(µ
,σ2)
µ,σ
eµt−
σ2t2
/2
1 2
( 1−
Φ( x
−µ
σ√
2
))Φ
(x)
iser
ror
funct
ion
Gam
ma
Γ(γ
,θ)
γ=
µ2
σ2,θ
=σ
2 µ(1−
γt)−
θ1−
Q( γ
,x θ
)Q
(x,y
)is
inco
mple
te
regu
lari
zed
gam
ma
fct
Bet
aB
(α,β
)α
=−
µ(−
mu+
µ2+
σ2)
σ2
No
exp.
form
1−
I(x
;α,β
)I(x
;a,b
)is
inco
mple
te
β=
µ−
2µ
2+
µ3−
σ2+
µ∗σ
2
σ2
regu
lari
zed
bet
afc
t
52
To determine the quality of the approximation of the distribution of the Gini
gain with normal and gamma distributions we performed two kinds of empirical
tests, each of them tests a different aspect of the approximation. The first test
compares the empirical moments of the Gini gain with the moments of the gamma,
beta and normal distribution with parameters determined as explained before. If
moments are close, the moment generating functions are close, which is proof that
the approximation is good both in the body and in the tail (Shao, 1999). The sec-
ond test compares empirical values of the p-value with values computed using the
three approximations. This test provides direct evidence that the approximation
of the p-value is good in the body and beginning of the tail but says nothing about
end of the tail, since large numbers of Monte Carlo trials are required to get data
in the tail. In what follows, we give more details and experimental results for these
two tests.
Quality of the Approximation of Moments
Indirect evidence that two distributions approximate well each-other – under mild
conditions, conditions that are satisfied by any distribution with finite moments
as is the case here since the values Gini gain can take are between 0 and 1 – is
provided by the fact that their moments are close to each-other. This is due to
the fact that the moment generating function Ψ(t) – under the mild conditions –
uniquely determines the distribution and the fact that the moments are derivatives
of first and higher order of Ψ(t) around 0:
E [Xn] =∂nΨ(t)
∂tn
Thus if the moments are close the moment generating functions are close, which
in turn implies that the distributions are close. For both the normal and gamma
53
distributions the moments can be computed analytically from their moment gener-
ating functions Ψ(t), depicted in Table 3.1. For the Beta distribution the moments
can be shown to be (Papoulis, 1991):
E [Xn] =Γ(α + β)Γ(α + n)
Γ(α + β + n)Γ(α)
To compare the moments of the true distribution of the Gini gain with the
moments of the approximations we ran experiments consisting in 1000000 Monte
Carlo trials for various values of the parameters N , n and p1. In all the experiments
we are reporting here there are only two class labels and the values of the attribute
variable are equi-probable.
The results for N = 100 and various values for n and p1 are reported in Ta-
bles 3.5.2, 3.5.2 and 3.5.2. The numbers on the columns “Gamma-T”, “Beta-
T”, and “Normal-T” correspond to the ratio error of the approximation with the
gamma, beta and normal approximations, respectively, and parameters determined
with the theoretical formulae for E [∆g()] and Var (∆g()). The numbers on the
“-E” columns are obtained using the approximations with parameters determined
from the experimental values of the expected value and variance of Gini gain. All
the numbers on columns other than “Moment” or “Exp. Value” are the ratio error
with respect to the experimental moment. The ration error is defined here to be
the ratio of the approximated value over the true value, if the true value is smaller,
and the negative of the ratio of the true value over the approximation, otherwise.
Thus, the sign indicates which value is larger and the magnitude the error of the
approximation. Numbers very close to 1 or −1 indicate good approximations.
There are a number of things to notice from this experimental results:
1. The approximations using the theoretical values of the expected value and
54
variance of Gini gain are as good as the approximations using the experi-
mental values. This is to be expected since the formulae for expected value
and variance are exact and a large number of experiments (1000000) are
performed so the experimental error is very small.
2. The gamma approximation is slightly better than the beta approximation
and much better than the normal approximation.
3. When Np1 is large, the gamma and beta approximations are reasonably
good, even for higher moments but the approximation completely breaks
down when Np1 ≈ 1, since the distribution is very discrete (see Section 3.6).
We essentially observed the same behavior when the values of the attribute variable
are not equi-probable.
The approximation with the beta distribution is not possible for all values of
the quantities n, N , Ni and pj since the corresponding parameters are not positive,
as required by the definition of the distribution. As we have seen, for the case when
the parameters are positive, the quality of the approximation is slightly worse or
comparable to the quality of the approximation with the gamma distribution.
As we have seen from experimental data on the approximation of the moments,
the normal approximation is significantly worse than the gamma approximation.
For this reason and because the approximation with beta distribution is not always
possible, for the rest of the chapter we focus our attention exclusively on the gamma
approximation.
55
Tab
le3.
2:E
xper
imen
tal
mom
ents
and
pre
dic
tion
sof
mom
ents
for
N=
100,
n=
2,p 1
=.5
obta
ined
by
Mon
teC
arlo
sim
ula
tion
wit
h10
0000
0re
pet
itio
ns.
-Tar
eth
eore
tica
lap
pro
xim
atio
ns,
-Ear
eex
per
imen
talap
pro
xim
atio
ns.
Mom
ent
Exp.
Val
ue
Gam
ma-
TG
amm
a-E
Bet
a-T
Bet
a-E
Nor
mal
-TN
orm
al-E
E[X
1]
0.00
4997
561.
0004
871.
0000
001.
0004
871.
0000
001.
0004
871.
0000
00E
[X2]
7.44
994·1
0−05
1.00
0008
1.00
0000
1.00
0008
1.00
0000
1.00
0008
1.00
0000
E[X
3]
1.83
948·1
0−06
1.00
4412
1.00
5082
−1.
0274
67−
1.02
6823
−2.
1204
43−
2.12
0902
E[X
4]
6.30
756·1
0−08
1.01
6427
1.01
7876
−1.
0847
44−
1.08
3323
−4.
0956
16−
4.09
5038
E[X
5]
2.75
695·1
0−09
1.03
7153
1.03
9468
−1.
1759
56−
1.17
3598
−11
.082
39−
11.0
8156
E[X
6]
1.46
249·1
0−10
1.06
5559
1.06
8828
−1.
3101
85−
1.30
6649
−28
.928
42−
28.9
1781
E[X
7]
9.12
355·1
0−12
1.09
9995
1.10
4313
−1.
5020
30−
1.49
6946
−92
.006
53−
91.9
7131
E[X
8]
6.53
591·1
0−13
1.14
0874
1.14
6349
−1.
7704
43−
1.76
3258
−29
0.80
13−
290.
6160
E[X
9]
5.26
295·1
0−14
1.19
2961
1.19
9743
−2.
1361
78−
2.12
6098
−10
58.9
1−
1058
.128
E[X
10]
4.66
568·1
0−15
1.26
6282
1.27
4615
−2.
6180
85−
2.60
4040
−37
21.1
9−
3717
.70
E[(X−
E[X
])1]
0.00
4997
561.
0004
871.
0000
001.
0004
871.
0000
001.
0004
871.
0000
00E
[(X−
E[X
])2]
4.95
238·1
0−05
−1.
0004
801.
0000
00−
1.00
0480
1.00
0000
−1.
0004
801.
0000
00E
[(X−
E[X
])3]
9.72
172·1
0−07
1.00
8155
1.00
9616
−1.
0534
92−
1.05
1996
∞∞
E[(X−
E[X
])4]
3.55
965·1
0−08
1.02
4251
1.02
6427
−1.
1249
87−
1.12
2719
−4.
8425
64−
4.83
7916
E[(X−
E[X
])5]
1.55
974·1
0−09
1.05
0090
1.05
3195
−1.
2401
58−
1.23
6752
∞∞
E[(X−
E[X
])6]
8.32
377·1
0−11
1.08
2419
1.08
6527
−1.
4050
64−
1.40
0217
−45
.752
32−
45.6
8646
E[(X−
E[X
])7]
5.21
353·1
0−12
1.11
9830
1.12
5044
−1.
6402
66−
1.63
3466
∞∞
E[(X−
E[X
])8]
3.74
293·1
0−13
1.16
4358
1.17
0799
−1.
9680
10−
1.95
8505
−59
3.74
75−
592.
6082
E[(X−
E[X
])9]
3.00
962·1
0−14
1.22
3682
1.23
1537
−2.
4109
31−
2.39
7671
3741
5.64
3747
1.30
E[(X−
E[X
])10]
2.65
059·1
0−15
1.31
1237
1.32
0830
−2.
9868
61−
2.96
8475
−82
55.7
87−
8240
.496
56
Tab
le3.
3:E
xper
imen
tal
mom
ents
and
pre
dic
tion
sof
mom
ents
for
N=
100,
n=
10,p
1=
.5ob
tain
edby
Mon
teC
arlo
sim
ula
tion
wit
h10
0000
0re
pet
itio
ns.
-Tar
eth
eore
tica
lap
pro
xim
atio
ns,
-Ear
eex
per
imen
talap
pro
xim
atio
ns.
Mom
ent
Exp.
Val
ue
Gam
ma-
TG
amm
a-E
Bet
a-T
Bet
a-E
Nor
mal
-TN
orm
al-E
E[X
1]
0.04
5005
4−
1.00
0120
−1.
0000
00−
1.00
0120
1.00
0000
−1.
0001
201.
0000
00E
[X2]
0.00
2434
41.
0000
42−
1.00
0000
1.00
0042
1.00
0000
1.00
0042
1.00
0000
E[X
3]
0.00
0153
048
1.00
5306
1.00
4899
−1.
0070
48−
1.00
7423
−1.
0453
58−
1.04
5644
E[X
4]
1.09
208·1
0−05
1.01
8613
1.01
7677
−1.
0254
66−
1.02
6295
−1.
1400
67−
1.14
0714
E[X
5]
8.68
977·1
0−07
1.04
2033
1.04
0418
−1.
0580
19−
1.05
9410
−1.
2952
87−
1.29
6453
E[X
6]
7.60
679·1
0−08
1.07
7301
1.07
4853
−1.
1073
30−
1.10
9394
−1.
5273
85−
1.52
9316
E[X
7]
7.24
799·1
0−09
1.12
6108
1.12
2650
−1.
1764
80−
1.17
9355
−1.
8634
93−
1.86
6571
E[X
8]
7.45
378·1
0−10
1.19
0283
1.18
5602
−1.
2695
22−
1.27
3384
−2.
3455
42−
2.35
0372
E[X
9]
8.21
696·1
0−11
1.27
1924
1.26
5759
−1.
3919
68−
1.39
7051
−3.
6653
93−
3.67
4128
E[X
10]
9.65
576·1
0−12
1.37
3560
1.36
5585
−1.
5513
51−
1.55
7969
−4.
0432
85−
4.05
5100
E[(X−
E[X
])1]
0.04
5005
4−
1.00
0120
−1.
0000
00−
1.00
0120
1.00
0000
−1.
0001
201.
0000
00E
[(X−
E[X
])2]
0.00
0408
911
1.00
1442
1.00
0000
1.00
1442
1.00
0000
1.00
1442
1.00
0000
E[(X−
E[X
])3]
6.68
078·1
0−06
1.11
5573
1.11
2230
−1.
1994
96−
1.20
3073
∞∞
E[(X−
E[X
])4]
6.46
091·1
0−07
1.09
3553
1.08
9879
−1.
1314
14−
1.13
4904
−1.
2842
95−
1.28
8000
E[(X−
E[X
])5]
3.08
766·1
0−08
1.22
8298
1.22
2453
−1.
3804
12−
1.38
6509
∞∞
E[(X−
E[X
])6]
2.44
658·1
0−09
1.29
6608
1.28
9322
−1.
4672
40−
1.47
4434
−2.
3752
30−
2.38
5517
E[(X−
E[X
])7]
1.83
7·1
0−10
1.45
0129
1.44
0228
−1.
7812
99−
1.79
1900
∞∞
E[(X−
E[X
])8]
1.62
662·1
0−11
1.60
2230
1.58
9550
−2.
0792
00−
2.09
3036
−5.
5091
06−
5.54
0942
E[(X−
E[X
])9]
1.53
379·1
0−12
1.80
5990
1.78
9588
−2.
5696
52−
2.58
8995
0.33
1664
0.33
2649
E[(X−
E[X
])10]
1.57
781·1
0−13
2.04
6614
2.02
5645
−3.
1918
50−
3.21
8313
13.2
5839
913
.220
442
57
Tab
le3.
4:E
xper
imen
tal
mom
ents
and
pre
dic
tion
sof
mom
ents
for
N=
100,
n=
2,p 1
=.0
1ob
tain
edby
Mon
teC
arlo
sim
ula
tion
wit
h10
0000
0re
pet
itio
ns.
-Tar
eth
eore
tica
lap
pro
xim
atio
ns,
-Ear
eex
per
imen
talap
pro
xim
atio
ns.
Mom
ent
Exp.
Val
ue
Gam
ma-
TG
amm
a-E
Bet
a-T
Bet
a-E
Nor
mal
-TN
orm
al-E
E[X
1]
0.00
0197
806
1.00
0979
−1.
0000
001.
0009
791.
0000
001.
0009
791.
0000
00E
[X2]
1.54
833·1
0−07
1.00
0171
−1.
0000
001.
0001
711.
0000
001.
0001
711.
0000
00E
[X3]
2.34
056·1
0−10
−1.
1062
51−
1.10
5261
−1.
1627
09−
1.16
1767
−3.
0610
75−
3.06
3485
E[X
4]
5.57
744·1
0−13
−1.
3516
21−
1.34
8834
−1.
5724
38−
1.56
9577
−8.
0984
25−
8.09
9907
E[X
5]
1.88
178·1
0−15
−1.
7992
81−
1.79
3372
−2.
4369
23−
2.43
0057
−38
.396
21−
38.4
184
E[X
6]
8.43
356·1
0−18
−2.
5857
24−
2.57
3987
−4.
2800
16−
4.26
3789
−17
0.27
06−
170.
2881
E[X
7]
4.77
882·1
0−20
−3.
9570
43−
3.93
4026
−8.
3866
05−
8.34
6890
−10
90.6
47−
1091
.033
E[X
8]
3.25
504·1
0−22
−6.
2873
82−
6.24
2680
−17
.841
551
−17
.740
779
−66
73.6
00−
6673
.368
E[X
9]
2.52
968·1
0−24
−10
.031
462
−9.
9470
54−
39.7
8134
5−
39.5
2159
5−
5081
9.78
−50
824.
06E
[X10]
2.14
421·1
0−26
−15
.587
050
−15
.435
429
−90
.011
447
−89
.347
136
−35
3193
.1−
3531
18.1
E[(X−
E[X
])1]
0.00
0197
806
1.00
0979
−1.
0000
001.
0009
791.
0000
001.
0009
791.
0000
00E
[(X−
E[X
])2]
1.15
706·1
0−07
−1.
0004
34−
1.00
0000
−1.
0004
341.
0000
00−
1.00
0434
1.00
0000
E[(X−
E[X
])3]
1.57
654·1
0−10
−1.
1668
24−
1.16
4673
−1.
2628
46−
1.26
0591
∞∞
E[(X−
E[X
])4]
4.04
31·1
0−13
−1.
4601
48−
1.45
5904
−1.
7804
34−
1.77
5635
−10
.075
31−
10.0
6658
E[(X−
E[X
])5]
1.41
096·1
0−15
−1.
9887
49−
1.98
0527
−2.
8844
78−
2.87
3746
∞∞
E[(X−
E[X
])6]
6.49
457·1
0−18
−2.
9094
91−
2.89
3833
−5.
2643
57−
5.23
9587
−27
9.87
06−
279.
5069
E[(X−
E[X
])7]
3.75
174·1
0−20
−4.
5117
12−
4.48
1692
−10
.669
307
−10
.608
88∞
∞E
[(X−
E[X
])8]
2.58
362·1
0−22
−7.
2151
62−
7.15
7858
−23
.316
617
−23
.162
94−
1375
2.13
−13
728.
30E
[(X−
E[X
])9]
2.01
237·1
0−24
−11
.497
68−
11.3
9141
−52
.995
32−
52.5
9852
4781
572
4801
052
E[(X−
E[X
])10]
1.69
695·1
0−26
−17
.724
78−
17.5
3774
−12
1.41
31−
120.
3988
−83
2297
−83
0739
58
Quality of the Approximation of P-value
To asses the quality of the approximation of the p-value of the true distribution of
Gini gain we used the same samples (ran the same experiments) that were used
to compute the information in Tables 3.5.2, 3.5.2 and 3.5.2 and we obtained the
results in Figures 3.5.2, 3.5.2 and 3.5.2, respectively.
As it can be seen, for large Np1 the approximation is reasonably good but when
Np1 ≈ 1 the approximation breaks down completely since the distribution is very
discrete (see Section 3.6). We observed essentially the same trends for different
values of the parameters.
All these experiments suggest that the Gamma approximation behaves well in
practice. Thus, the formula to compute the p-value of the Gini gain we propose,
based on gamma approximation, is:
p-value(∆ge) = 1−Q
(E(∆g)2
Var (∆g),∆geVar (∆g)
E(∆g)
), (3.23)
where ∆ge is the actual value for Gini computed on the given dataset and Q(x, y) =
Γ(x, y)/Γ(x) is the regularized incomplete gamma function.
59
1e-
06
1e-
05
0.0
001
0.0
01
0.0
1
0.1 1
0 0
.01
0.0
2 0
.03
0.0
4 0
.05
0.0
6 0
.07
0.0
8 0
.09
0.1
expe
rim
enta
l p-v
alue
gam
ma
p-va
lue
Fig
ure
3.8:
Exper
imen
talp-v
alue
ofGini
gain
wit
hon
est
andar
ddev
iati
oner
ror
bar
sag
ainst
p-v
alue
ofth
eore
tica
lga
mm
a
appro
xim
atio
nfo
rN
=10
0,n
=2
and
p 1=
.5.
60
1e-
12
1e-
10
1e-
08
1e-
06
0.0
001
0.0
1 1
0 0
.02
0.0
4 0
.06
0.0
8 0
.1 0
.12
0.1
4 0
.16
0.1
8 0
.2
expe
rim
enta
l p-v
alue
gam
ma
p-va
lue
Fig
ure
3.9:
Exper
imen
talp-v
alue
ofGini
gain
wit
hon
est
andar
ddev
iati
oner
ror
bar
sag
ainst
p-v
alue
ofth
eore
tica
lga
mm
a
appro
xim
atio
nfo
rN
=10
0,n
=10
and
p 1=
.5.
61
1e-
07
1e-
06
1e-
05
0.0
001
0.0
01
0.0
1
0.1 1
0 0
.001
0.0
02 0
.003
0.0
04 0
.005
0.0
06 0
.007
0.0
08
expe
rim
enta
l p-v
alue
gam
ma
p-va
lue
Fig
ure
3.10
:E
xper
imen
talp-v
alue
ofGini
gain
wit
hon
est
andar
ddev
iati
oner
ror
bar
sag
ainst
p-v
alue
ofth
eore
tica
lga
mm
a
appro
xim
atio
nfo
rN
=10
0,n
=2
and
p 1=
.01.
62
Explanation of the Bias of the Gini Gain
Now that we have a tight theoretical approximation of the distribution of the Gini
gain at the Null Hypothesis, we can provide an explanation for the existence of the
bias. In Figure 3.11 we depicted the probability density function of the Gini gain,
as approximated by the gamma distribution, for two attribute variables, X1 with
size of its domain 2, and X2 with size of its domain 10. The number of data-points
is 100 and each of the two classes and the values of the attribute variables are
equi-probable. Notice that, the distribution of the Gini gain for X1 is much closer
to 0 than the distribution for X2, which explains why it is far more probable for
X2 to be chosen as the split attribute as we previously observed. Using the p-value
of the Gini gain will result in both these distributions being approximately the
uniform distribution on interval [0, 1], thus the probability to choose any of the
two attribute variables as the split attribute is the same.
Practical Considerations
Note that there is a very important numerical precision problem associated with
the formula for p-value, Equation 3.23. Even for moderate correlation between a
attribute variable and the class label, the value of the second term in Equation 3.23
approaches 1 very rapidly (by far exceeding the precision of the processor). Thus
the computed value of the p-value is 0 in this case, seemingly limiting the usefulness
of our criterion for the case that correlations between an attribute variable and
the class label are present. This “non-discrimination” anomaly was also observed
by Kononenko (1995).
63
0 1 2 3 4 5 6 7 8
0 0
.02
0.0
4 0
.06
0.0
8 0
.1 0
.12
0.1
4
p.d.
f. of
gin
i for
X1
p.d.
f. of
gin
i for
X2
Fig
ure
3.11
:P
robab
ility
den
sity
funct
ion
ofGini
gain
for
attr
ibute
vari
able
sX
1an
dX
2.
64
For our criterion, we can avoid this problem by directly computing the log-
arithm of the p-value using a series expansion.3 In this manner, values of the
logarithm of the p-value (which can be used instead of the p-value since the log-
arithm is a monotonically increasing function) can be computed accurately even
for datasets with millions of records and very strong correlations.
The computational complexity of our new criterion is O(n + k) since we have
to compute the sum of inverses of the Ni’s and pj’s; all other factors can be
computed in time O(1), including the logarithm of the incomplete regularized
gamma function. Thus our new criterion can be computed efficiently in practice.4
3.6 Experimental Evaluation
In this section we will show experimental evidence that our theoretical corrections
behave well in practice. To evaluate the bias of the gamma correction of the Gini
gain we repeated the experiment from Section 3.3. The bias of our correction
of the Gini gain as a function of N and p1 is depicted in Figure 3.12. As can
be observed by comparing Figures 3.6 and 3.12, the bias for the corrected Gini
gain and the χ2-test are practically identical for all values of p1 and N . Also, for
p1 between 10−4 and 10−2 all the statistical methods are biased toward attribute
variables with small n in precisely the same way. As mentioned in Section 3.3, the
most extreme bias is obtained for p1 = 1/N . In this case the probability to see
exactly one data-point with class label c1 is N 1N
(1− 1
N
)N−1 ≈ e−1. The margin
3We used the implementation of the incomplete gamma function in the Statis-tics package ANA (Shine & Strous, 2001)
4On a Pentium III 933MHz the computation of the incomplete regularizedgamma function takes 155µs. This is also the time to compute the contingency ta-ble for 14000 samples in the most favorable case (one attribute variable and highlyoptimized code for this special case).
65
∆ from the Lemma in Section 3.4 is at least 2e−1 ≈ 0.73 which means that the
exact correction (using the exact distribution of the split criteria) can have any
bias. Thus around p1 = 1/N we cannot expect any of the statistical methods to be
perfectly unbiased. Moreover, since in this situation the distribution is extremely
discrete, we cannot expect any approximation of the distribution with a continuous
distribution like gamma to be reasonable. This explains the experimental results
in Section 3.5 where we have seen that the p-value of the distribution of Gini gain
is grossly underestimated.
Note that for small entries in the contingency table the χ2-distribution is a
poor approximation of the χ2-test.5 In the case that a predictor variable is not
correlated with the class label, the overestimation of the variance does not seem
to matter (but this might not be the case when correlations are present).
To summarize our experiments, the gamma correction of the Gini gain and the
χ2 criterion have very good behavior under the Null Hypothesis. The G2 criterion
behaves well if class labels are almost improbable but some bias is present if this
is not the case. The Gini gain, the information gain, and the gain ratio have
significant biases toward variables with more values.
5We observed that for this case the expected value according to the χ2-distribution is correct, but the variance is overestimated. See Appendix B.0.4for a proof.
66
Fig
ure
3.12
:B
ias
ofth
ep-v
alue
ofth
eGini
gain
usi
ng
the
gam
ma
corr
ecti
on
67
3.7 Discussion
This chapter addresses the fundamental problem of bias in split variable selection
in classification tree construction. Our contribution is (1) a general method to
provably remove the bias introduced by categorical variables with large domains
and (2) an the application of our method to the removal of the bias for the Gini
gain.
Previous work for some split criteria suggests that removal of the bias by the
usage of p-values improves the quality of the split when correlations are weak and
in the same time preserves the good behavior for strong correlations (Mingers,
1987; Frank & Witten, 1998). This suggests that bias removal in general might be
useful in practice.
Chapter 4
Scalable Linear Regression Trees
interest in developing regression models for large datasets that are both accurate
and easy to interpret. Regressors that have these properties are regression trees
with linear models in the leaves, but the algorithms already proposed for construct-
ing them are not scalable due to the fact that they require a large number of linear
systems to be formed and solved. In this chapter we propose a novel regression
tree construction algorithm that is both accurate and can truly scale to very large
datasets. The main idea is, for every intermediate node, to use the EM algorithm
for Gaussian mixtures to find two clusters in the data and to locally transform the
regression problem into a classification problem based on closeness to these clus-
ters. Goodness of split measures, like the Gini gain, can then be used to determine
the split variable and point much like in classification tree construction. Scalability
of the algorithm can be enhanced by employing scalable versions of the EM and
the classification tree construction algorithms. Tests on real and artificial data
show that the proposed algorithm has competitive accuracy but requires orders of
magnitude less computation time for large datasets.
68
69
4.1 Introduction
Even though regression trees were introduced early in the development of clas-
sification trees (CART, Breiman et al. (1984)), regression trees received far less
attention from the research community. Quinlan (1992) generalized the regression
trees in CART by using a linear model in the leaves to improve the accuracy of
the prediction. The impurity measure used to choose the split variable and the
split point was the standard deviation of the predictor for the training examples
at the node. Karalic (1992) argued that the mean square error of the linear model
in a node is a more appropriate impurity measure for the linear regression trees
since data well predicted by a linear model can have large variance. This is a
crucial observation since evaluating the variance is much easier than estimating
the error of a linear model (which requires solving a linear system). Even more,
if discrete attributes are present among the predictor attributes and binary trees
are built (as is the case in CART), the problem of finding the best split attribute
becomes intractable for linear regression trees since the theorem that justifies a
linear algorithm for finding the best split (Theorem 9.6 in (Breiman et al., 1984),
see Section 2.3.1) does not seem to apply. To address computational concerns of
normal linear regression models, Alexander and Grimshaw (1996) proposed the
use of simple linear regressors (i.e., the linear model depends on only one predictor
attribute), which can be trained more efficiently but are not as accurate.
Torgo proposed the use of even more sophisticated functional models in the
leaves (i.e., kernel regressors) (Torgo, 1997b; Torgo, 1997a). For such regression
trees both construction and deployment of the model is expensive but they po-
tentially are superior to the linear regression trees in terms of accuracy. More
recently, Li et al. (2000) proposed a linear regression tree algorithm that can
70
produce oblique splits1 using Principal Hessian Analysis but the algorithm cannot
accommodate discrete attributes.
There are a number of contributions to regression tree construction coming from
the statistics community. Chaudhuri et al. (1994) proposed the use of statistical
tests for split variable selection instead of error of fit methods. The main idea
is to fit a model (constant, linear or higher order polynomial) for every node in
the tree and to partition the data at each node into two classes: data-points with
positive residuals2 and data-points with negative residuals. In this manner the
regression problem is locally reduced to a classification problem, so it becomes
much simpler. Statistical tests used in classification tree construction, Student’s
t-test in this case, can be used from this point on. Unfortunately, it is not clear
why differences in the distributions of the signs of the residuals are good criteria
on which decisions about splits are made. A further enhancement was proposed
recently by Loh (2002). It consists mostly in the use of the χ2-test instead of the
t-test in order to accommodate discrete attributes, the detection of interactions of
pairs of predictor attributes, and a sophisticated calibration mechanism to ensure
the unbiasedness of the split attribute selection criterion.
In this chapter we introduce SECRET (Scalable EM and Classification based
Regression Trees), a new construction algorithm for regression trees with linear
models in the leaves, which produces regression trees with accuracy comparable
to the ones produced by existing algorithms and, at the same time, requiring far
less computational effort on large datasets. Our experiments show that SECRET
improves the running time of regression tree construction by up to two orders of
1Oblique splits are linear inequalities involving two or more predictor attributesof the form: a1X1 + · · ·+ akXk > c.
2Residuals are the difference between the true value and the value predicted byregression model.
71
magnitude when compared to previous work while at the same time constructing
trees of comparable quality. Our main idea is to use the EM algorithm on the data
partition in an intermediate node to determine two Gaussian clusters, hopefully
with shapes close to flat disks. We then use these two Gaussian clusters to locally
transform the regression problem into a classification problem by labeling every
data-point with class label 1 if the probability of belong to the first cluster exceeds
the probability of belong to the second cluster, or class label 2 if the converse is
true. A split attribute and a corresponding split point to separate the two classes
can be determined then using goodness of split measures for classification trees
like the Gini gain (Breiman et al., 1984). Least square linear regression can be
used to determine the linear regressors in the leaves.
The local reduction to a classification problem allows us to avoid forming and
solving the large number of linear systems of equations required by an exhaustive
search method such as the method used by RETIS (Karalic, 1992). Even more,
scalable versions of the EM algorithm for Gaussian mixtures (Bradley et al., 1998)
and classification tree construction (Gehrke et al., 1998) can be used to improve
the scalability of the proposed solution. An extra benefit of the method is the fact
that good oblique splits can be easily obtained.
The rest of the chapter is organized as follows. In Section 4.2 we give a short
introduction the EM algorithm for Gaussian mixtures. In Section 4.3 we present in
greater detail some of the previously proposed solutions and we comment on their
shortcomings. Section 4.4 contains the description of SECRET, our proposal for a
linear regression tree algorithm. We then show results of an extensive experimental
study of SECRET in Section 4.5.
72
4.2 Preliminaries: EM Algorithm for Gaussian Mixtures
In this section we discuss a particular solution to the problem of approximating
some unknown distribution, from which a sample is available, with a mixture of
Gaussian distributions – the EM algorithm for Gaussian mixtures.
The EM algorithm (Dempster,A.P. Laird & Rubin, 1977) is a very general
method that can be used to determine parameters of models with hidden variables.
Here, we will be concerned only with its application to determining a mixture of
Gaussian distributions that best approximate an unknown distribution from which
samples are available. Our introduction follows, in large, the excellent tutorial of
Bilmes (1997) where details and complete proofs of the EM algorithm for Gaussian
mixtures can be found.
The Gaussian mixture density estimation problem is the following: find the
most likely values of the parameter set Θ = (α1, . . . , αM , µ1, . . . , µM , Σ1, . . . , ΣM)
of the probabilistic model:
p(x, Θ) =M∑i=1
αipi(x|µi, Σi) (4.1)
pi(x|µi, Σi) =1
(2π)d/2|Σi|1/2e−
12(x−µi)
T Σ−1i (x−µi) (4.2)
given sample X = (x1, . . . ,xN) (training data). In the above formulae pi is the
density of the Gaussian distribution with mean µi and covariance matrix Σi. αi
is the weight of the component i of the mixture, M is the number of mixture
components or clusters and is fixed and given, and d is the dimensionality of the
space.
The EM algorithm for estimating the parameters of the Gaussian components
proceeds by repeatedly applying the following two steps until the termination con-
dition is satisfied:
73
Expectation (E step):
hij =αipi(xj|µi, Σi)∑M
k=1 αkpk(xj|µk, Σk)(4.3)
Maximization (M step):
αi =1
N
N∑j=1
hij (4.4)
µi =
∑Nj=1 hijxj∑N
j=1 hij
(4.5)
Σi =
∑Nj=1 hij(xj − µi)(xj − µi)
T∑Nj=1 hij
(4.6)
In the above formulae, hij are the hidden parameters and they can be interpreted
as the probability that the point xj belongs to the i-th component.
The termination condition is usually specified either by a maximum number of
steps or as a minimum average movement between consecutive steps of the centers
of the Gaussian distributions. In our work we use exclusively the former since it
is simpler and works well in practice.
4.3 Previous solutions to linear regression tree construc-
tion
In this section we analyze some of the previously proposed construction algorithms
for linear regression trees and, for each, we point major drawbacks.
4.3.1 Quinlan’s construction algorithm
For efficiency reasons, the algorithm proposed by Quinlan (1992) pretends that
a regression tree with constant models in the leaves is constructed until the tree
is fully grown, when linear models are fit on the data-points available at each
74
leaf. This is equivalent to using the split selection criterion in Equation 2.12
during the growing phase. Then linear regressors in the leaves are constructed by
performing another pass over the data in which the set of data-points from the
training examples corresponding to each of the leaves is determined and the least
square linear problem for these data-points is formed and solved (using the SVD
decomposition (Golub & Loan, 1996)).
The same approach was latter used by Torgo (1997a; 1997b) with more com-
plicated models in the leaves like kernels and local polynomials.
As pointed out by Karalic (1992) the variance of the output variable is a poor
estimator of the impurity of the fit when linear regressors are used in the leaves
since the points can be arranged along a line (so the error of the linear fit is almost
zero) but they occupy a significant region (so the variance is large). To correct
this problem, he suggested that the following impurity function should be used:
Errl(T )def= E
[(Y − E [f(X)|T ])2] (4.7)
where f(x) = [1 xT ]c is the linear regressor with the smallest least square error.
It is easy to see (see for example (Golub & Loan, 1996)) that c is the solution of
the LSE equation:
E
1 XT
X XXT
∣∣∣∣∣∣∣T c = E
1
X
Y
∣∣∣∣∣∣∣T (4.8)
To see more clearly that Err(T ) given by Equation 2.12 is not appropriate for
the linear regressor case, consider the situation in Figure 4.1. The two thick lines
represent a large number of points (possibly infinite). The best split for the linear
regressor is x = 0 and the fit is perfect after the split (thus Errl(T1) = Errl(T2) = 0).
Obviously the split has the maximum possible Gini gain (1/12).
75
s−1 1
1
−1 0s
T2T1
Figure 4.1: Example of situation where average based decision is different from
linear regression based decision
1−1 0
1
−
+
Figure 4.2: Example where classification on sign of residuals is unintuitive.
76
For the case when Err(T ) is used, E [Y |T ] = 1/2 so Err(T ) = 1/12. To
determine the split point for this situation suppose the split point is s − 1 in
Figure 4.1. The points with property x < s − 1 belong to T1 and the rest to
T2. Then E [Y |T1] = s/2, Err(T1) = s3/12, E [Y |T2] = (−2 + s2)/2(−2 + s) and
Err(T2) = (4 − 8s + 12s2 − 8s3 + s4)/(24 − 12s). Thus by splitting the impurity
decreases by ∆Err(T ) = (1 − s)2s/4(2 − s) The extremum points in the interval
[0, 1] are s = 1 and s = (3 −√
5)/2. Looking at the second derivative in these
points one can observe that ∆Err(T ) has a minimum in s = 1 and a maximum
in s = (3 −√
5)/2. Thus the maximum impurity decrease is obtained if the split
point is −(√
5−1)/2 = −0.618034 or symmetrically 0.618034. Either of this splits
is very far from the split obtained using Errl(T ) (at point 0), thus splitting the
points in proportion 19% to 81% instead of the ideal 50% to 50%.
This example suggests that the split point selection based on Err(T ) produces
an unnecessary fragmentation of the data that is not related to the natural organi-
zation of the data-points for the case of linear regression trees. This fragmentation
produces unnecessarily large and unnatural trees, anomalies that are not corrected
by the pruning stage. Indeed, when we used a dataset with the triangular shape
described above as the input to a regression tree construction algorithm that used
Err(T ) from Equation 2.12 as split criterum we obtained the following split points
starting from the root and navigating breadth first for three levels: 0.6185, -0.5255,
0.8095, -0.7625, 0.3585, 0.7145, 0.9055. Splits are not only anintuitive but the gen-
erated tree is very unbalanced. Note that this example is not an extreme case but
rather a normal one, so this behavior is probably the norm not the exception.
77
4.3.2 Karalic’s construction algorithm
Using the split criterion in Equation 4.7 the problem mentioned above is avoided
and much higher quality trees are built. If exhaustive search is used to determine
the split point, the computational cost of the algorithm becomes prohibitively
expensive for large datasets for two main reasons:
• If the split attribute is continuous, as many split points as there are training
data-points have to be considered. For each of them a linear system has
to be formed and solved. Even if the matrix and the vector that form the
linear system are maintained incrementally (which can be dangerous from
numerical stability point of view), for every level of the tree constructed, a
number of linear systems equal to the size of the dataset have to be solved.
• If the split attribute is discrete the situation is potentially much worse since
Theorem 9.6 in (Breiman et al., 1984) does not seem to apply for this split
criterion. This means that an exponential number (in the size of the domain
of the split variable) of linear systems have to be formed and solved.
The first problem can be alleviated if a sample of the points available are
considered as split points. Even if this simplification is made, the data-points have
to be sorted in every intermediate node on all the possible split attributes. Also, it
is not clear how these modifications influence the accuracy of the regression trees
produces. The second problem seems unavoidable if exhaustive search is used.
4.3.3 Chaudhuri’s et al. construction algorithm
In order to avoid forming and solving so many linear systems, Chaudhuri et al.
(1994) proposed to locally classify the data-points available at an intermediate
78
node based on the sign of the residual with respect to the least square error linear
model. For the data-points in Figure 4.2 (the set of data-points is identical to the
one in Figure 4.1) this corresponds to points above and below the dashed line. As
it can be observed, when projected on the X axis, the negative class surrounds the
positive class so two split points are necessary to differentiate between them (the
node has to be split into three parts). When the number of predictor attributes is
greater than 1 (multidimensional case), the separating surface between class labels
+ and − is nonlinear. Moreover, if best regressors are fit in these two classes, the
prediction is only slightly improved. The solution adopted by Chaudhuri et al. is
to use Quadratic Discriminant Analysis (QDA) to determine the split point. This
usually leads to choosing as split point approximately the mean of the dataset,
irrespective of where the optimal split is, so the reduction is not always useful. For
this reason GUIDE (Loh, 2002) uses this method only to select the split attribute,
not the split point.
4.4 Scalable Linear Regression Trees
For constant regression trees (i.e. regression trees with constants as models in
the leaves), algorithms for scalable classification trees can be straightforwardly
adapted (Gehrke et al., 1998). The main obstacle in achieving scalability for linear
regression trees is the observation previously made that the problem of partitioning
the domain of a discrete variable in two parts is intractable. Also the amount of
sufficient statistics that has to be maintained goes from two real numbers for
constant regressors (mean and mean of square) to quadratic in the number of
regression attributes (to maintain the matrix AT A that defines the linear system)
– this can be a problem also.
79
In this work we make the distinction in (Loh, 2002) between predictor at-
tributes: (1) split discrete attributes – used only in the split predicates in inter-
mediate nodes in the regression tree, (2) split continuous attributes – continuous
attributes used only for splitting, (3) regression attributes – continuous attributes
used in the linear combination that specifies the linear regressors in the leaves as
well as for specifying split predicates. By allowing some continuous attributes to
participate in splits but not in regression in the leaves we add greater flexibility to
the learning algorithm. The partitioning of the continuous attributes in split and
regression is beyond the scope of this thesis (and is usually performed by the user
(Loh, 2002)).
The main idea behind our algorithm is to locally transform the regression prob-
lem into a classification problem by first identifying two general Gaussian distribu-
tions in the regressor attributes–output space using the EM algorithm for Gaussian
mixtures and then by classifying the data-points based on the probability of be-
longing to these two distributions. Classification tree techniques are then used to
select the split attribute and the split point. Our algorithm, called SECRET, is
shown in Figure 4.3.
The role of EM is to find two natural classes in the data that have approximately
a linear organization. The role of the classification is to identify the predictor
attributes that can make the difference between these two classes in the input
space. To see this more clearly suppose we are in the process of building a linear
regression tree and we have to decide on the split attribute and split point for the
node T . Suppose the set of training examples available at node T contains tuples
with three components: a regressor attribute Xr, a discrete attribute Xd and the
predicted attribute Y . The projection of the training data on the Xr, Y space
80
Input: node T , data-partition DOutput: regression tree T for D rooted at T
Linear regression tree construction algorithm:BuildTree(node T , data-partition D)(1) normalize data-points to unitary sphere(2) find two Gaussian clusters in regressor–output space (EM)(3) label data-points based on closeness to these clusters(4) foreach split attribute(5) find best split point and determine its Gini gain(6) endforeach(7) let X be the attribute with the greatest Gini gain and
Q the corresponding best split predicate set(8) if (T splits)(9) partition D into D1, D2 based on Q and label node T
with split attribute X(10) create children nodes T1, T2 of T and label
the edge (T, Ti) with predicate q(T,Ti)
(11) BuildTree(T1, D1); BuildTree(T2, D2)(12) else(13) label T with the least square linear regressor of D(14) endif
Figure 4.3: SECRET algorithm
might look like Figure 4.4. The data-points are approximatively organized in two
clusters with Gaussian distributions that are marked as ellipsoids. Differentiating
between the two clusters is crucial for prediction, but information in the regression
attribute is not sufficient to make this distinction, even though within a cluster
they can do good predictions. The information in the discrete attribute Xd can
make this distinction, as can be observed from Figure 4.5 where the projection is
made on the Xd, Xr, Y space. If other split attributes had been present, a split on
Xd would have been preferred since the resulting splits are pure.
For the situation in Figure 4.2, the EM algorithm will approximate each of the
two distinct linear portions with a very narrow Gaussian distribution. Using these
81
Y
X r
Figure 4.4: Projection on Xr, Y space of training data.
Y
X rNo
Xd
Yes
Figure 4.5: Projection on Xd, Xr, Y space of same training data as in Figure 4.4
82
two clusters, all the points at the left of origin will have the first class label and
the points at the right of the origin the second class label. Obviously, using 0 as
the split point provides the best separation between classes. This is exactly the
best split point for this situation since it results in perfect approximation of the
data after linear models are fitted for each cluster.
Observe that the use of the EM algorithm for Gaussian mixtures is very limited
since we have only two mixtures and thus the likelihood function has a simpler
form which means fewer local maxima. Since EM is sensitive to distances, before
running the algorithm, the training data has to be normalized by performing a
linear transformation that makes the data look as close as possible to a unitary
sphere with the center in the origin. Experimentally we observed that, with this
transformation and in this restricted scenario, the EM algorithm with clusters
initialized randomly works well.
We describe first how the EM algorithm can be implemented efficiently followed
by details on the integration of the resulting mixtures with the split selection
procedure and the linear regression in the leaves.
4.4.1 Efficient Implementation of the EM Algorithm
The following two ideas are used to implement the EM algorithm efficiently:
1. steps E and M are performed simultaneously, which means that quantities
hij do not have to be stored explicitly
2. all the operations are expressed in terms of the Cholesky decomposition Gi
of the covariance matrix Σi = GiGTi . Gi has the useful property that it is
lower diagonal, so solving linear systems takes quadratic effort in the number
83
of dimensions and computing the determinant is linear in the number of
dimensions.
Note that these modifications can be made in addition to the techniques used in
(Bradley et al., 1998) to make the EM algorithm scalable.
Using the Cholesky decomposition we immediately have Σ−1i = G−1T
i G−1i and
|Σi| = |Gi|2. Substituting in Equation 4.2 we get:
pi(x|µi, Gi) =1
(2π)d/2|Gi|e−
12‖G−1
i (x−µi)‖2
The quantity x′ = G−1i (x − µi) can be computed by solving the linear system
Gix′ = x − µi and takes quadratic effort in the number of dimensions. For this
reason the inverse of Gi needs not be precomputed since solving the linear system
takes as much time as vector matrix multiplication. This is in line with the recom-
mendations given by Golub and Loan (1996) to avoid inverting matrices whenever
possible.
The following quantities have to be maintained incrementally for each cluster:
si =N∑
j=1
hij
sx,i =N∑
j=1
hijxj
Si =N∑
j=1
hijxTj xj
where quantities hij are computed with the formula in Equation 4.3 for each train-
ing example xj and are discarded after updating si, sx,i, Si for every cluster i (we
need only small temporary storage for hij’s).
84
After all the training examples have been seen, the new parameters of the two
distributions are computed with the formulae:
αi =si
N
µi =sx,i
si
Σi =Si
si
− µiT µi
Gi = Chol(Σi)
Moreover, if the data-points are coming from a Gaussian distribution with mean
µi and covariance matrix GiGTi the transformation x′ = G−1
i (x−µi) results in data-
points with Gaussian distribution with mean 0 and identity matrix as covariance
matrix. This means that this transformation can be used for data normalization in
the tree growing phase, normalization that is of crucial importance as we pointed
out earlier in the section.
4.4.2 Split Point and Attribute Selection
Once the two Gaussian mixtures are identified, the data-points can be labeled
based on the closeness to the two clusters (i.e., If a data-point is closer to cluster
1 than cluster 2 it is labeled with class label 1, otherwise it is labeled with class
label 2). After this classification is performed, locally, split point and attribute
selection methods from classification tree construction can be used.
We are using Gini gain as the split selection criteria to find the split point. That
is, for each attribute (or collection of attributes for oblique splits) we determine
the best split point and compute the Gini gain. Then the predictor attribute with
the greatest Gini gain is chosen as split attribute.
For the discrete attributes the algorithm of Breiman et al. (1984) finds the split
85
point in time linear in the size of the domain of the discrete attribute (since we
only have two class labels, see Section 2.3.1). We use this algorithm, unchanged,
in the present work to find the split point for discrete attributes.
Split point selection for continuous attributes
Since the EM algorithm for Gaussian mixtures produces two normal distributions,
it is reasonable to assume that the projection of the data-points with the same class
label on a continuous attribute X has also a normal distribution. As explained in
Section 2.3.1, the split point that best separates the two normal distributions can
be found using Quadratic Discriminant Analysis (QDA). The reason for preferring
QDA to a direct minimization of the Gini gain is the fact that it gives qualitatively
similar splits but requires less computational effort (Loh & Shih, 1997). We already
explained in Section 2.3.1 how to find the split point using QDA.
Finding a good oblique split for two Gaussian mixtures
Ideally, given two Gaussian distributions, we would like to find the separating
hyperplane that maximizes the Gini gain. Fukanaga showed that the problem of
minimizing the expected value of the 0-1 loss (the classification error function)
generates an equation involving the normal of the hyperplane that is not solvable
algebraically (Fukanaga, 1990). Following the same treatment, it is easy to see
that the problem of minimizing the Gini gain generates the same equation, thus
algebraic solutions are not possible for the Gini gain either. Fortunately, a good
solution to the problem of determining a separating hyperplane can be found using
Linear Discriminant Analysis (LDA) (Fukanaga, 1990).
The solution of an LDA problem for two mixtures consists of minimizing
86
Fisher’s separability criterion (Fukanaga, 1990):
J(n) =nT Σwn
nT Σbn
with
Σw =∑i=1,2
αi(µ− µi)(µ− µi)T , µ =
∑i=1,2
αiµi
Σb =∑i=1,2
αiΣi
which has as result a vector n, with the property that the projections on this vector
of the two Gaussian distributions is as separated as possible. The solution of the
optimization problem is (Fukanaga, 1990):
n =Σ−1
w (µ1 − µ2)
‖Σ−1w (µ1 − µ2)‖2
The value of Fisher’s criterion is invariant to the choice of origin on the projection
so we can make the projection on the line given by the vector n, that optimizes
Fisher’s criterion, and the origin of the coordinate system.
The two multidimensional Gaussian distributions are transformed into unidi-
mensional normal distributions on the projection line with means ηi = nT µi and
the variances σ2i = nT Σin for i = 1, 2, the coordinates being line coordinates with
the coordinate of the projection of the origin as the 0. This situation is depicted
in Figure 4.6 for the bi-dimensional case.
87
O
σ2 1
η 1η
nT(x−ηn)=0
µ1
Σ1
Σ1
µ2
σ2 2
η 2n
Fig
ure
4.6:
Sep
arat
orhyper
pla
ne
for
two
Gau
ssia
ndis
trib
uti
ons
intw
odim
ensi
onal
spac
e.
88
On the projection line (n, O), the QDA can be used to find the split point η, as
in the previous section. The point η on the projection line corresponds to ηn in the
initial space. The equation of the separating hyperplane that has n as normal and
contains point ηn is nT (x−ηn) = 0 ⇔ nTx−η = 0. With this, a point x belongs
to the first partition if sign(η1 − η)(nTx − η) ≥ 0. The hyperplane that contains
this point of the projection line and that is perpendicular to the projection line is
a good separator of the two multidimensional Gaussian distributions.
In order to be able to compare the efficacy of the split with other splits, we have
to compute its Gini gain. The same method as for the case of unidimensional splits
of continuous variables can be used here. The only unsolved problem is computing
the p-value of a Gaussian distribution with respect to a half-space. The solution
is given by the following result:
Proposition 1. For a Gaussian distribution with mean µ and covariance matrix
Σ = GGT , positive definite, and density pµ,Σ(x) and a hyperplane with normal n
that contains the point xc, the p-value with respect to the hyperplane is:
P [nT (x− xc) ≥ 0|µ, Σ] =
∫nT (x−xc)≥0
pµ,Σ(x)dx
=σ
2√|Σ||S|
(1 + Erf
(µ′1
σ√
2
))where
Σ′−1 =
s wT
w S
= MT Σ−1M
with M orthogonal such that MTn = e1, σ = 1/√
s−wT S−1w and µ′ = MT (µ−
xc).
Proof. Since MTn = e1 the first column in M has to be n (which is supposed to
have unitary norm) and the rest of columns are vectors orthogonal on n. Such an
orthogonal matrix can be found using Gram-Schmidth orthogonalization starting
89
with n and the d−1 least parallel with n versor vectors. Doing the transformation
x′ = MT (x−xc) that transforms the hyperplane (n,xc) into (e1, 0) we get x−µ =
M(x′ − µ′). Using the notation Φ for P [nT (x − xc) ≥ 0|µ, Σ] and substituting in
the definition we get:
Φ =
∫x′1≥0
∫x′2
· · ·∫
x′d
p(x′)dx′
=
∫x′1≥0
∫x′2
· · ·∫
x′d
1
(2π)d/2√|Σ|
e−12[(x′−µ′)T Σ′−1(x′−µ′)]dx′
With the notation y = x′−µ′ and L for the set of indexes 2 . . . d, thus yT = [y1 yTL ],
the exponent in the above integral can be rewritten like:
yT Σ′−1y = sy21 + 2y1y
TLw + yT
LSyL
= sy21 − y2
1wT S−1w
+ (yL + y1S−1w)T S(yL + y1S
−1w)
With this we get:
ΦL(x′1) =
∫x′L
exp
(−1
2[(x′ − µ′)T Σ′−1(x′ − µ′)]
)dx′L
=
∫yL
exp
(−1
2[sy2
1 − y21w
T S−1w
+(yL + y1S−1w)T S(yL + y1S
−1w)])dyL
=(2π)
d−12√
|S|exp
(−1
2(x′1 − µ′1)
2(s− wT S−1w)
)
90
and substituting back in 4.4.2 we have:
Φ =
∫x′1≥0
1
(2π)d/2√|Σ|
ΦL(x′1)
=σ√|Σ||S|
∫x′1≥0
1√2πσ
e−(x′1−µ′1)2/2σ2
=σ√|Σ||S|
∫t≥−µ′1/σ
√2
1√π
e−t2dt
=σ√|Σ||S|
(1
2
2√π
∫ 0
− µ1σ√
2
e−t2dt +1√π
∫ ∞
0
e−t2dt
)
=σ
2√|Σ||S|
(1 + Erf
(µ′1
σ√
2
))We show now that s−wT S−1w > 0 thus the above computations are sound. Since
Σ is positive definite by supposition, Σ′−1 = MT Σ−1M is positive definite. This
means that vT Σ′−1v > 0 for any nonzero v. Taking v = [1 S−1w]T we get the
required inequality.
Finding linear regressors
If the current node is a leaf or in preparation for the situation that all the descen-
dants of this node are pruned we have to find the best linear regressor that fits the
training data. We identified two ways the LSE linear regressor can be computed.
The first method consist of a traversal of the original dataset and the iden-
tification of the subset that falls into this node. The least square linear system
in Equation 4.8 is formed with these data-points and solved. Note that, in the
case that all the sufficient statistics can be maintained in main memory, a single
traversal of the training dataset per tree level will suffice.
The second method uses the fact that the split selection method tries to find
a split attribute and a split point that can differentiate best between the two
91
Gaussian mixtures found in the regressor–output space. The least square problem
can be solved at the level of each of these mixtures under the assumption that the
distribution of the data-points is normal with the parameters identified by the EM
algorithm. This method is less precise since the split is usually not perfect but can
be used when the number of traversals over the dataset becomes a concern.
Before we show how this can be done we need to introduce some notation. We
use subscript I to denote the first d− 1 components (the components referring to
regressors) and o to refer to the last component – the output. Thus, for example
for a matrix G, GII refers to its (d− 1)× (d− 1) upper part. The following result
provides the solution:
Proposition 2. For a Gaussian distribution with mean µ and covariance matrix
Σ = GGT , the LSE linear regressor is given by:
y = cT (x− µI) + µo (4.9)
where c is the solution of linear equation cT GII = GoI .
Proof. The LSE linear regressor is the function f(xI) that minimizes
E [(xo − f(xI))2] for x. It can be shown (see (Shao, 1999) Example 1.19) that
the out of all the measurable functions, y = E [xo|xI ] is the LSE estimator.
Thus, it remains only to compute E [xo|xI ] for x distributed according to a
Gaussian distribution with mean µ and covariance matrix GGT . We denote by
p(x) the density of this distribution. We have:
E [xo|xI ] =
∫xo
xop(x)dxo∫xo
p(x)dxo
(4.10)
Doing the transformation x′ = G−1(x−µ) we have x = Gx′+µ so xo = GoIx′I +
Goox′o + µo and dxo = Goodx′o. Making the change of variable in Equation 4.10 we
92
get:
E [xo|xI ] =
∫x′o
(GoIx′I + Goox
′o + µo)e
12x′TI x′Ie
12x′2o dx′o∫
x′oe
12x′TI x′Ie
12x′2o dx′o
= GoIx′I + µo + Goo
∫x′o
x′oe12x′2o dx′o∫
x′oe
12x′2o dx′o
= GoIG−1II (xI − µI) + µo
= cT (xI − µI) + µo
(4.11)
since x′oe12x′2o dx′o is antisymmetric so its integral on the whole domain of x′o is zero.
Experimentally we observed that the first method is usually more precise. The
reason for the difference in precision can be attributed to the fact that the best
split found by the split selection method is not perfect, so the two clusters are not
perfectly differentiated. For medium size datasets we recommend the use of the
first method. If computation time is a concern, the second method, that is only
slightly less precise, can be used.
4.5 Empirical Evaluation
In this section we present the results of an extensive experimental study of SE-
CRET, the linear regression tree construction algorithm we propose. The purpose
of the study was twofold: (1) to compare the accuracy of SECRET with GUIDE
(Loh, 2002), a state-of-the-art linear regression tree algorithm and (2) to compare
the scalability properties of SECRET and GUIDE through running time analysis.
The main findings of our study are:
• Accuracy of prediction. SECRET is more accurate than GUIDE on three
datasets, as accurate on six datasets and less accurate on three datasets. This
93
suggests that overall the prediction accuracy to be expected from SECRET
is comparable to the accuracy of GUIDE. On four of the datasets, the use of
oblique splits resulted in significant improvement in accuracy.
• Scalability to large datasets. For datasets of small to moderate sizes
(up to 5000 tuples), GUIDE slightly outperforms SECRET. The behavior
for large datasets of the two methods is very different; for datasets with
256000 tuples and 3 attributes, SECRET runs about 200 times faster than
GUIDE. Even if GUIDE considers only 1% of the points available as possible
split points, SECRET still runs 20 times faster. Also, there is no significant
change in running time when SECRET produces oblique splits.
4.5.1 Experimental testbed and methodology
GUIDE (Loh, 2002) is a regression tree construction algorithm that was designed
to be both accurate and fast. The extensive study by Loh (Loh, 2002) showed
that GUIDE outperforms previous regression tree construction algorithms and
compares favorably with MARS (Friedman, 1991), a state-of-the-art regression
algorithm based on spline functions. GUIDE uses statistical techniques to pick
the split variable and can use exhaustive search or just a sample of the points
to find the split point. In our accuracy experiments we set up GUIDE to use
exhaustive search since it is more accurate than split point candidate sampling,
the only other option. For the scalability experiments we report running times for
both the exhaustive search and split point candidate sampling of size 1%.
For the experimental study we used nine real life and three synthetic datasets.
94
Real life datasets:
Abalone Dataset from UCI machine learning repository used to predict the age of
abalone from physical measurements. Contains 4177 cases with 8 attributes
(1 nominal and 7 continuous).
Baseball Dataset from UCI repository, containing information about baseball play-
ers used to predict their salaries. Consists of 261 cases with 20 attributes (3
nominal and 17 continuous).
Boston Data containing characteristics and prices of houses around Boston, from
UCI repository. Contains 506 cases with 14 attributes (2 nominal and 12
continuous).
Kin8nm Data containing information on the forward kinematics of an 8 link robot
arm from the DELVE repository. Contains 8192 cases with 8 continuous
attributes.
Mpg Subset of the auto-mpg data in the UCI repository (tuples with missing
values were removed). The data contains characteristics of automobiles that
can be used to predict gas consumption. Contains 392 cases with 8 attributes
(3 nominal and 5 continuous).
Mumps Data from StatLib archive containing incidence of mumps in each of
the 48 contiguous states from 1953 to 1989. The predictor variables are
year and longitude and latitude of state center. The dependent variable is
the logarithm of the number of mumps cases. Contains 1523 cases with 4
continuous attributes.
Stock Data containing daily stock of 10 aerospace companies from StatLib repos-
itory. The goal is to predict the stock of the 10th company from the stock
95
of the other 9. Contains 950 cases with 10 continuous attributes.
TA Data from UCI repository containing information about teaching assistants
at University of Wisconsin. The goal is to predict their performance. Con-
tains 151 cases with 6 attributes (4 nominal and 2 continuous).
Tecator Data from StatLib archive containing characteristics of spectra of pork
meat with the purpose of predicting the fat content. We used the first 10
principal components of the wavelengths to predict the fat content. Contains
240 cases with 11 continuous attributes.
Synthetic datasets:
Cart Synthetic dataset proposed by Breiman et al.((Breiman et al., 1984) p.238)
with 10 predictor attributes: X1 ∈ −1, 1, Xi ∈ −1, 0, 1, i ∈ 2 . . . 10
and the predicted attribute determined by if X1 = 1 then Y = 3 + 3X2 +
2X3 + X4 + σ(0, 2) else Y = −3 + 3X5 + 2X6 + X7 + σ(0, 2). We interpreted
all the 10 predictor attributes as discrete attributes.
Fried Artificial dataset used by Friedman (Friedman, 1991) containing 10 contin-
uous predictor attributes with independent values uniformly distributed in
the interval [0, 1]. The value of the predictor variable is obtained with the
equation: Y = 10 sin(πX1X2) + 20(X3 − 0.5)2 + 10X4 + 5X5 + σ(0, 1).
3DSin Artificial dataset containing 2 continuous predictor attributes uniformly
distributed in interval [−3, 3], with the output defined as
Y = 3 sin(X1) sin(X2). There is no noise added.
We performed all the experiments reported in this chapter on a Pentium III
933MHz running Redhat Linux 7.2.
96
Tab
le4.
1:A
ccura
cyon
real
(upper
par
t)an
dsy
nth
etic
(low
erpar
t)dat
aset
sof
GU
IDE
and
SE
CR
ET
.In
par
enth
esis
we
indic
ate
Ofo
ror
thog
onal
splits
.T
he
win
ner
isin
bol
dfo
nt
ifit
isst
atis
tica
lly
sign
ifica
nt
and
init
alic
sot
her
wis
e.
Con
stan
tR
egre
ssor
sLin
ear
Reg
ress
ors
GU
IDE
SE
CR
ET
SE
CR
ET
(O)
GU
IDE
SE
CR
ET
SE
CR
ET
(O)
Abal
one
5.32±
0.05
5.50±
0.10
5.41±
0.10
4.63±
0.04
4.67±
0.04
4.76±
0.05
Bas
ebal
l0.
224±
0.00
90.
200±
0.00
80.
289±
0.01
20.1
73±
0.0
05
0.24
3±0.
011
0.28
0±0.
009
Bos
ton
23.3
4±
0.7
228
.00±
0.92
30.9
1±0.
9440
.63±
6.63
24.0
1±0.
6926
.11±
0.66
Kin
8nm
0.04
19±
0.00
020.
0437±
0.00
020.
0301±
0.00
030.
0235±
0.00
020.
0222±
0.00
020.0
162±
0.0
001
Mpg
12.9
4±
0.3
330
.09±
2.28
26.2
6±2.
4534
.92±
21.9
215
.88±
0.68
16.7
6±0.
74M
um
ps
1.34±
0.02
1.59±
0.02
1.56±
0.02
1.0
2±
0.0
21.
23±
0.02
1.32±
0.04
Sto
ck2.
23±
0.06
2.20±
0.06
2.18±
0.07
1.49±
0.09
1.35±
0.05
1.0
3±
0.0
3TA
0.74±
0.02
0.69±
0.01
0.69±
0.01
0.81±
0.04
0.72±
0.01
0.79±
0.08
Tec
ator
57.5
9±2.
4049
.72±
1.72
28.2
1±1.
7513
.46±
0.72
12.0
8±0.
537.8
0±
0.5
3
3DSin
0.14
35±
0.00
200.
4110±
0.00
060.
2864±
0.00
770.
0448±
0.00
180.
0384±
0.00
260.0
209±
0.0
004
Car
t1.
506±
0.00
51.1
71±
0.0
01
N/A
N/A
N/A
N/A
Fri
ed7.
29±
0.01
7.45±
0.01
6.43±
0.03
1.2
1±
0.0
01.
26±
0.01
1.50±
0.01
97
4.5.2 Experimental results: Accuracy
For each experiment with real datasets we used a random partitioning into 50% of
the datapoints for training, 30% for pruning and 20% for testing. For the synthetic
datasets we generated randomly 16384 tuples for training, 16384 tuples for pruning
and 16384 tuples for testing for each experiment. We repeated each experiment
100 times in order to get accurate estimates. For comparison purposes we built
regression trees with both constant (by using all the continuous attributes as split
attributes) and linear (by using all continuous attributes as regressor attributes)
regression models in the leaves. In all the experiments we used Quinlan’s resubsti-
tution error pruning (Quinlan, 1993b). For both algorithms we set the minimum
number of data-points in a node to be considered for splitting to 1% of the size of
the dataset, which resulted in trees at the end of the growth phase with around 75
nodes.
Table 4.1 contains the average mean square error and its standard deviation for
GUIDE, SECRET and SECRET with oblique splits (SECRET(O)) with constant
(left part) and linear (right part) regressors in the leaves, on each of the twelve
datasets. GUIDE and SECRET with linear regressor in the leaves have equal
accuracy (we considered accuracies equal if they were less than three standard
deviations away from each other) on six datasets (Abalone, Boston, Mpg, Stock,
TA and Tecator), GUIDE wins on three datasets (Baseball, Mumps and Fried)
and SECRET wins on the remaining three (Kin8nm, 3DSin and Cart). These
findings suggest that the two algorithms are comparable from the accuracy point
of view, neither dominating the other. The use of oblique splits in SECRET made
a big difference in four datasets (Kin8nm 27%, Stock 24%, Tecator 35% and 3DSin
45%). These datasets usually have less noise and are complicated but smooth (so
98
they offer more opportunities for intelligent splits). At the same time the use of
oblique splits resulted in significantly worse performance on two of the datasets
(Baseball 13% and Fried 19%).
4.5.3 Experimental results: Scalability
We chose to use only synthetic datasets for scalability experiments since the sizes of
the real datasets are too small. The learning time of both GUIDE and SECRET is
mostly dependent on the size of the training set and on the number of attributes, as
is confirmed by some other experiments we are not reporting here. As in the case of
accuracy experiments, we set the minimum number of data-points in a node to be
considered for further splits to 1% of the size of the training set. We measured only
the time to grow the trees, ignoring the time necessary for pruning and testing. The
reason for this is the fact that pruning and testing can be implemented efficiently
and for large datasets do not make a significant contribution to the running time.
For GUIDE we report running times for both exhaustive search and sample split
point (only 1% of the points available in a node are considered as possible split
points), denoted by GUIDE(S).
99
Size GUIDE GUIDE(S) SECRET SECRET(O)
250 0.07 0.05 0.21 0.21500 0.13 0.07 0.33 0.34
1000 0.30 0.12 0.55 0.582000 0.94 0.24 1.08 1.124000 3.28 0.66 2.11 2.078000 12.58 2.40 4.07 4.12
16000 48.93 9.48 8.16 8.3732000 264.50 43.25 16.71 16.1964000 1389.88 184.50 35.62 35.91
128000 6369.94 708.73 73.35 71.67256000 25224.02 2637.94 129.95 131.70
0.01
0.1
1
10
100
1000
10000
100000
100 1000 10000 100000 1e+06
Run
ning
tim
e (s
econ
ds)
Dataset size (tuples)
GUIDEGUIDE(S)
SECRETSECRET(O)
Figure 4.7: Tabular and graphical representation of running time (in seconds) of
GUIDE, GUIDE with 0.01 of point as split points, SECRET and SECRET with
oblique splits for synthetic dataset 3DSin (3 continuous attributes).
100
Size GUIDE GUIDE(S) SECRET SECRET(O)
250 0.09 0.07 0.47 0.43500 0.17 0.14 0.87 0.92
1000 0.36 0.28 1.85 1.832000 1.12 0.80 3.58 3.694000 2.90 2.38 7.33 7.368000 10.46 8.43 13.77 14.05
16000 42.16 33.09 27.80 28.6832000 194.63 123.63 56.87 58.0164000 1082.70 533.16 122.26 124.60
128000 4464.88 1937.94 223.42 222.75256000 18052.16 8434.33 460.12 470.68
0.01
0.1
1
10
100
1000
10000
100000
100 1000 10000 100000 1e+06
Run
ning
tim
e (s
econ
ds)
Dataset size (tuples)
GUIDEGUIDE(S)
SECRETSECRET(O)
Figure 4.8: Tabular and graphical representation of running time (in seconds) of
GUIDE, GUIDE with 0.01 of point as split points, SECRET and SECRET with
oblique splits for synthetic dataset Fried (11 continuous attributes).
101
0 20
40
60
80
100
120
0 5
10
15
20
25
30
35
Running time (seconds)
Num
ber
of a
ttrib
utes
4000
dat
apoi
nts
8000
dat
apoi
nts
1200
0 da
tapo
ints
16
000
data
poin
ts
Fig
ure
4.9:
Runnin
gti
me
ofSE
CR
ET
wit
hlinea
rre
gres
sors
asa
funct
ion
ofth
enum
ber
ofat
trib
ute
sfo
rdat
aset
3Dsi
n.
102
0 10
20
30
40
50
60
70
80
90
100
110
0 5
10
15
20
25
30
35
Running time (seconds)
Num
ber
of a
ttrib
utes
1600
0 da
tapo
ints
qu
adra
tic a
ppro
xim
atio
n
Fig
ure
4.10
:A
ccura
cyof
the
bes
tquad
rati
cap
pro
xim
atio
nof
the
runnin
gti
me
for
dat
aset
3Dsi
n.
103
0 20
40
60
80
100
120 2
000
400
0 6
000
800
0 1
0000
120
00 1
4000
160
00
Running time (seconds)
Dat
aset
siz
e (t
uple
s)
3 at
trib
utes
17 a
ttrib
utes
33 a
ttrib
utes
Fig
ure
4.11
:R
unnin
gti
me
ofSE
CR
ET
wit
hlinea
rre
gres
sors
asa
funct
ion
ofth
esi
zeof
the
3Dsi
ndat
aset
.
104
1e-
06
1e-
05
0.0
001
0.0
01
0.0
1
0.1 1
0.1
1 1
0 1
00 1
000
100
00
Mean square error
Run
ning
tim
e (s
econ
ds)
GU
IDE
1%
sam
ple
GU
IDE
25%
sam
ple
GU
IDE
50%
sam
ple
GU
IDE
100
% s
ampl
eSE
CR
ET
Fig
ure
4.12
:A
ccura
cyas
afu
nct
ion
ofle
arnin
gti
me
for
SE
CR
ET
and
GU
IDE
wit
hfo
ur
sam
pling
pro
por
tion
s.
105
Results of experiments with the 3DSin dataset and Fried dataset are depicted
in Figures 4.7 and 4.8 respectively. A number of observations are apparent from
these two sets of results:
1. The performance of the two versions of SECRET (with and without oblique
splits) is virtually indistinguishable.
2. The running time of both versions of GUIDE is quadratic in size for large
datasets.
3. As the number of attributes went up from 3 (3DSin) to 11 (Fried) the com-
putation time for GUIDE(S), SECRET and SECRET(O) went up about
3.5 times but went slightly down for GUIDE. An interesting question raised
by these results is: How does SECRET scale with the number of regres-
sor attributes? In order to answer this question, we added more regressor
attributes, with values generated randomly in interval [0, 1], to the three ex-
isting attributes of the 3DSin dataset and we measured the running time of
SECRET for various sizes of the training data. The dependency of the run-
ning time on the total number of attributes (the three existing ones plus the
extra attributes added) for various sizes of the training dataset are depicted
in Figure 4.9. These results suggest that the dependency on the number of
regressor attributes is quadratic, a fact reinforced by the good match that
can be observed in Figure 4.10 between the shape of the best quadratic
approximation and the actual observations for the experiment with 16000
data-points. This is to be expected since both the EM Algorithm and the
computation of the linear models in the leaves require a square matrix with
as many rows and columns as there are attributes (of overall quadratic size
106
in the number of regressor attributes) to have each element updated for each
data-point. This quadratic dependency is unavoidable if linear models are
fitted in leaves, thus the scalability of SECRET in this respect is as good as
possible.
4. For fixed size trees, SECRET scales linearly with the number of of training
examples. This is apparent from Figures 4.7 and 4.8, as well from the de-
piction of the results in Figure 4.9 as a function of the size of the dataset in
Figure 4.11.
5. For large datasets (256000 tuples) SECRET is two orders of magnitude faster
than GUIDE and one order of magnitude faster than GUIDE(S).
Since SECRET is much faster than GUIDE and, as we have seen, it has compa-
rable accuracy, a natural question is how much the accuracy of GUIDE decreases if
its running time is limited. To shed some light into this issue we performed exper-
iments on the 3DSin dataset, experiments in which we fixed the minimum number
of data-points in a node to be considered for further splits, the cutoff, to 10 and
we varied only the size of the dataset. Since data is noiseless, we expect the larger
trees that are build for larger datasets, due to fact that the cutoff is constant, to
be more accurate than the trees built on small datasets. We are comparing the
dependency of the accuracy on the running time of GUIDE with various sampling
proportions (proportion of data-points to be considered potential split points) and
SECRET. The results of these experiments are depicted in Figure 4.12. Notice
that, by allowing the same running time, SECRET is by as much as 300 times
more accurate than any of the variants of GUIDE (when the training time is 70
seconds). Curiously, the accuracy of GUIDE does not increase at the initial rate,
107
as the accuracy of SECRET does, and it levels of at about 0.001 irrespective of
the running time. Even if the initial rate of error decrease is maintained through-
out the range of the running time, SECRET would still be about 30 times more
accurate if the training time is limited to 70 seconds. This significantly reduced
error of SECRET is mostly due to the fact that the 3DSin dataset is smooth so
the larger the tree constructed the more precise the prediction. Since SECRET is
much faster, it can be run on a larger dataset in the same amount of time thus
producing larger and more accurate trees.
Scalability properties of SECRET algorithm: As the experiments reported
in this section suggest, for a fixed tree size, SECRET scales linearly with the size
of the training dataset and quadratically with the number of regressor attributes.
With respect to discrete and split attributes, SECRET has the same scalability
properties as classification tree algorithms thus, by using scalable classification tree
construction techniques (Gehrke et al., 1998), SECRET can achieve good behavior
even for very large datasets. Furthermore, since most of the computational effort
goes into running the EM Algorithm for Gaussian mixtures for every node, for
large datasets SECRET can be further speed up by employing the scalable EM
algorithm of Bradley et al. (1998).
4.6 Discussion
In this chapter we introduced SECRET, a new linear regression tree construction
algorithm designed to overcome the scalability problems of previous algorithms.
The main idea is, for every intermediate node, to find two Gaussian clusters in the
regressor–output space and then to classify the data-points based on the closeness
108
to these clusters. Techniques from classification tree construction are then used
locally to choose the split variable and split point. In this way the problem of
forming and solving a large number of linear systems, required by an exhaustive
search algorithm, is avoided entirely. Moreover, this reduction to a local classifi-
cation problem allows us to efficiently build regression trees with oblique splits.
Experiments on real and synthetic datasets showed that the proposed algorithm
is as accurate as GUIDE, a state-of-the-art regression tree algorithm if normal
splits are used, and on some datasets up to 45% more accurate if oblique splits are
used. At the same time our algorithm requires significantly smaller computational
resources for large datasets.
Chapter 5
Probabilistic Decision Trees
Classification and regression trees prove to be accurate models, and easy to inter-
pret and build due to their tree structure. Nevertheless, they have a number of
shortcomings that reduce their accuracy: (1) for data-points close to the decision
boundaries, the prediction is discontinuous, (2) natural fluctuations into the data
are not taken into account, and (3) data is strictly partitioned at every node and a
data-point will influence only one of the children nodes thus the amount of data is
exponentially reduced. All of these shortcomings are due to the fact that the splits
employed in the nodes are sharp. In this chapter we address all these concerns with
traditional classification and regression trees by allowing probabilistic splits in the
nodes, splits that depend also on the natural fluctuations in the data. The result of
these modifications to traditional classification and regression trees is a new type
of models we call probabilistic decision trees. They prove to be significantly more
accurate than the traditional counterpart, by as much as 36%, and, at the same
time, they require only modest increase in the computational effort.
109
110
5.1 Introduction
As we have seen in Chapter 2, classification and regression trees use deterministic
conditions as split predicates. These conditions determine exactly one branch to
be followed at each step in order to reach, a unique leaf. Usually the conditions
involve a single attribute, the split attribute, and a very simple condition of the
form X ∈ S for discrete attributes or X ≤ c for continuous attributes with S some
set and c a constant. This type of models have the advantage that they are simple
and have predictable behavior, requiring a simple walk starting from the root of
the tree on true predicates to reach the leaf. Despite all these good properties, the
classic classification and regression trees – we refer to them collectively as decision
trees – have a number of shortcomings that, in general, decrease their efficiency,
and, in some cases, severely limit their applicability.
• Discontinuity of Decision. Decision trees can be thought of as a compact
specification of a partitioning of the decision domain (the crossproduct of the
domain of all the split attributes) into distinct regions–each such partition,
that we will call a decision partition, is specified by the conjunction of the
predicates on the path from the root to a leaf–together with the decision,
that corresponds to the leaf that ends the path. These decision partitions do
not overlap and have crisp borders, which results in discontinuous decisions.
More precisely, a small modification of the input, for the case the input is
close to the border, might result in an entirely different decision. This means
that the decision is arbitrary near the border, since it depends on small
fluctuations, thus unlikely to be good. For regression trees it also means
that the model, seen as a function, is discontinuous on the border, thus,
111
it is not everywhere differentiable – in particular it has no first derivative
on the border. This discontinuity property of regression trees makes them
unsuitable for applications that require continuity, and, in general, results in
lack of precision around the border.
• Exponential Fragmentation of the Data. In the process of building the
trees, only data-points corresponding to the node being built (data-points
that satisfy the conjunction of predicates on the path from the root to the
node) are used in subsequent decision for this node and any of its descendants.
Thus, after a split predicate is decided upon, the data is partitioned into two
parts, one corresponding to each of the children, and each of the data-points
contributes to the construction of exactly one child. This means the amount
of data available to take decisions decreases exponentially with every level of
the tree built, process that is usually referred to as data fragmentation.
The statistical significance of a decision, either split for an intermediate node
or label/regressor for a leaf node, depends mostly on the number of data-
points on which the decision is based on – having larger amounts of data
results in greater statistical significance which translates in greater precision
for the decision tree. With this in mind, it is immediately apparent that data
fragmentation results in decreased precision in the regions of the decision
space close to the split points of nodes higher in the tree. This is where,
most likely, the decision has to be refined, but the amount of data on which
the decision is based is severely reduced by fragmentation.
• Fluctuations due to Noise. Even though having crisp decision surfaces
is appealing from the point of view of the users (the classifier is easy to
112
understand) determining a precise boundary for the region is problematic
since data is usually noisy and finite. This suggest that the placement of the
boundary is somewhat arbitrary and, due to the fact that statistical signifi-
cance decreases with the reduction of the number of data-points, the deeper
in the tree the less precise the placement of the boundary is. Traditional
decision trees do not give any indication on how firm the fine structure of
the tree actually is. Quite often small modifications in the training data
result in vastly different trees; this sometimes creates problems for the end
user. Taking the natural fluctuations into account should greatly improve
the stability of the structure of the classifier and make it less prone to data
fragmentation due to arbitrary decisions.
• Probabilistic Decisions. For some applications it is desirable for classifi-
cation trees to produce a probabilistic answer instead of deterministic ones.
Such a probabilistic answer can be easily transformed into a deterministic
prediction, and, in addition, provides extra information about the confidence
in the answer. For example the user would trust more a prediction knowing
that the probability to produce this result is, for example, .9 instead of .51,
even though both will produce the same deterministic answer, for example
class label YES. Estimating such probabilities only from the information at
the leaves, as proposed by Breiman et al. (1984) is error-prone, especially
for inputs close to decision borders, due to data fragmentation, and discon-
tinuous on the border.
In view of the above discussion on the shortcomings of traditional decision trees,
we think that there is a real need to refine the decision trees in a systematic way
to address all of these issues, but, at the same time, to maintain their desirable
113
properties. This is exactly the purpose of the new model we propose in this chapter,
the probabilistic decision tree (PDT). In the next section we provide the detailed
description of PDTs, and in Section 5.3 we show how they can be constructed.
We show then, in Section 5.4, experimental evidence that the accuracy is almost
always increased, sometimes dramatically, and, at the same time, the increase in
the computational effort is small. We comment on related work in Section 5.5 and
conclude in Section 5.6.
5.2 Probabilistic Decision Trees (PDTs)
We designed the Probabilistic Decision Trees (PDTs) in order to directly address
limitations of the classic decision trees (DTs), limitations that, as we have seen,
have a negative impact on either the precision or the applicability of decision trees.
Before we delve into details, let us first give a high level description of the
probabilistic decision trees and explain how they address the shortcomings of the
traditional decision trees.
The main ideas behind PDTs are:
• Make the splits fuzzy.1 Instead of deciding at the level of each of the
intermediate nodes if the left or right branch should be followed, a proba-
bility is associated with following each of the branches–the sum of the two
probabilities should always be 1, thus it is enough to specify only one of the
two–thus sometimes both branches are actually followed but one might be
more important than the other. This effectively allows, in the most general
case, all the leaves to be reached by any of the inputs. To obtain a prediction,
1Here by fuzzy splits we mean relaxed splits not splits necessarily expressedwith fuzzy logic.
114
the predictors of all leaves of the tree are used, but the importance of the
contribution is determined by the probability to reach the respective leaf.
• Predict probability vectors instead of class labels. For regression
trees, since linear combinations of real numbers are easy to obtain, the infor-
mation in multiple leaves can be easily combined. Class labels, on the other
hand, cannot be directly combined in a linear fashion. The natural solution
to this problem is to replace the class label prediction with probability vec-
tor prediction–every class label has an entry in the vector that specifies the
probability for the tree to produce the class label. It is easy to obtain linear
combinations of probability vectors, and to produce class label predictions
from probability vectors by simply returning the most probable label.
• Retain the good properties of DTs. In addition to addressing the short-
comings of traditional decision trees, we would like to retain, as much as
possible, the desirable properties of DTs. More precisely, we would like the
PTDs to be as close to DTs with respect to usage and construction; in this
manner PDTs will still be interpretable, the prediction can be made effi-
ciently, and the PDTs can be built in a scalable fashion by merely adapting
the decision tree algorithms. It is quite clear that, in order to address the
problems of DTs, the efficiency of prediction and construction will decrease;
it is important though to keep the performance degradation minimal.
The way we will achieve all three desiderata is by, first, generalizing the decision
trees in order to allow imprecise splits and, for the case of classification trees, the
prediction of probability vectors; then, by severely restricting the general model to
make sure that the prediction and construction of the model can be done efficiently,
115
and, at the same time, the shortcomings of DTs are still addressed. We will refer to
the general model as generalized decision trees(GDTs) and the two specializations
as generalized classification trees(GCTs) and generalized regression trees(GRTs).
5.2.1 Generalized Decision Trees(GDTs)
The GDTs generalize the decision trees by replacing split predicates in intermediate
nodes with probability distributions over the inputs. Since in this chapter we are
interested only in binary trees, we show the generalization only for this type of
decision trees, but all the ideas can easily be extended to the general case.
Like decision trees, GDTs consist of a set of nodes that are linked in a tree
structure; the nodes without descendants are called the leaves, all the other nodes
are called intermediate nodes. The information that each type of node contains is
the following:
• Intermediate Nodes. Let us denote by T the node and with TL and TR its
left and right descendants respectively. Also we denote by x ∈ T the event
input x is routed to the node T . The only information associated with the
node is P [x ∈ TL|x ∈ T ], the probability the input x is routed to the left
node, TL, given the fact that it was routed to the node T –the probability to
follow the right branch is completely determined by this probability. In its
simplest form the probability P [x ∈ TL|x ∈ T ] depends on a single predictor
attribute, in which case we call the attribute the split attribute, but in general
it can be a general probability distribution function that depends on all the
attributes.
• Leaf Nodes. The information associated with each leaf is a probability
116
vector, that specifies the probability for each of the class labels, in the case
of generalized classification trees (GCTs), and a regressor – constant, linear
or more complicated– in the case of generalized regression trees (GRTs). For
leaves of GCTs, we denote by P [C = c|L] the probability that the prediction
is class label c given the leaf node L, which is nothing else but the c-th
component of the probability vector associated with leaf L, P [C|L]. For
leaves of GRTs, we denote by fL(x) the regressor function that produces the
numerical output when given the input for the leaf L. We denote the set of
all leaves in the GDTs by L.
Notice that GDTs are complete probabilistic models in the sense that the tree
structure together with the information in the node specify a probabilistic model.
In order to produce predictions with this model, we can simply use the fact that,
given some input x, the prediction with the smallest expected squared error is the
expected value of the prediction given the model (Shao, 1999). For the two types
of models, this best predictor is:
• GCT: Denoting by by V the random vector distributed according to the dis-
tribution specified by the GCT (Vc is its c-th component), the best predictor
for the probability of seeing some class label c given the input x, vc(x), is:
vc(x) = E[Vc|x]
= P [C = c|x]
=∑L∈L
P [C = c|L]P [x ∈ L]
(5.1)
To establish the above result we used the fact that reaching each of the leaves
is an independent event and the fact that the probability to see a particular
class label in any of the leaves is independent of the input. To transform the
117
probability vector into a class label, we simply return the class label with
the highest probability.
• GRT: Denoting by Y the random variable distributed according to the dis-
tribution specified by the GRT, the best predictor for the output given the
input x, y(x), is:
y(x) = E[Y |x]
=∑L∈L
fL(x)P [x ∈ L](5.2)
where we used again the fact that reaching each leaf is an independent result.
The probability to reach any leaf, given some input x, P [x ∈ L], can be com-
puted recursively – using the Bayes rule and the fact that x ∈ T ′ ⇒ x ∈ T if T ′
is a descendant of T – starting at the root R of the tree and following the path
leading to the leaf L using the equations:
P [x ∈ R] = 1 (5.3)
P [x ∈ T ′] = P [x ∈ T ′|x ∈ T ] · P [x ∈ T ] (5.4)
where T is some intermediate node and T ′ is its children. Thus, if the path from
R to L is R, T1, T2, ..., Tn, L, then
P [x ∈ L] =P [x ∈ T1|x ∈ R] · P [x ∈ T2|x ∈ T1] · · ·
· · ·P [x ∈ L|x ∈ Tn]
(5.5)
so it is simply the product of the probabilities, conditioned on the input x, along
the path.
If the conditional probabilities in the intermediate nodes are defined as:
P [x ∈ TL|x ∈ T ] =
1 pL(x) = true
0 otherwise
(5.6)
118
with pL(x) some predicate on inputs (for example X > 10), then the GDTs de-
generate into traditional decision tree – thus indeed GDTs generalize DTs.
Generalized classification trees have been proposed before under the name
Bayesian decision trees (Chipman et al., 1996; Chipman et al., 1998). They can
also easily emulate fuzzy or soft decision trees (Sison & Chong, 1994; Guetova
et al., 2002).
5.2.2 From Generalized Decision Trees to Probabilistic De-
cision Trees
Clearly, the GDTs allow imprecise splits and predict, in the case of GCTs, proba-
bility vectors, but they are very hard to learn since any probability distribution is
allowed in the nodes. Also, if the probability distributions are not 0 or 1 for the
majority of the data-points, most of the leaves have to be consulted to make a pre-
diction, which means that making a prediction might be too slow for some practical
applications. In order to retain the good properties of DTs, the GDTs have to be
drastically restricted. We call this restricted version of GDTs probabilistic decision
trees (PDTs), and the two specializations probabilistic classification trees (PCTs)
and probabilistic regression trees (PRTs). In what follows, we point out, for each
restriction, how the performance is improved but the shortcomings of traditional
trees are still addressed.
The notion of split variable and split predicate has a lot of appeal from user’s
point of view. At the same time, algorithms to find the split variable and split
point for decision trees are efficient and well studied. In order to retain these
good properties of decision trees and, at the same time, to allow probabilistic
splits, we require the probabilities associated with each node to characterize the
119
fluctuations of the split point under noise in the data, instead of being general
probability distributions. In this manner, the fluctuations are naturally captured
in the model; the only problem is determining these fluctuations instead of finding
general densities that would fit the data (which is known to be a hard problem).
By requiring the probabilities to capture the fluctuations of the split point, we
take advantage of the fact that, except capturing fluctuations, split attributes and
points in decision trees provide a good way to capture the underlying structure of
the system being learned. Furthermore, since we assume that the training data
are samples from the distribution that describe this system, by generating other
datasets from the same distribution – which, in general, will not coincide with the
dataset made available – we will observe fluctuations of the split points for any of
the split variables. These are the the fluctuations we want to capture and leverage
in our models.
Let us now be more specific and, for the two types of split variables, continuous
and discrete, specify precisely the probabilities that can be used in the intermediate
nodes together with their interpretation from the point of view of the fluctuations.
We defer the discussion on how to find such probabilities to Section 5.3.3.
Continuous Split Variables
Due to the noise in the data-points available for learning, the split point – as com-
puted by the split selection methods for decision tree construction algorithms on
these datasets, which usually means choosing the point where the split selection
criteria is minimized – will fluctuate. Since we want to keep things as simple as
possible, and we want to obtain decision trees that can produce predictions effi-
ciently, we restrict the specification of the probability for nodes with continuous
120
split variables to two parameters: the mean and the variance. The distribution
of the split point is well confined in space; it is, in general, reasonably well ap-
proximate by a normal distribution. For this reason, we restrict the interpretation
of the probability function P [TL|T,x] in a node T for a split variable X with the
mean and variance of the split point µ and σ2 respectively, to:
P [x ∈ TL|x ∈ T ] = P [X < N(µ, σ2)]
=1
2
[1− Erf
(x− µ
σ√
2
)] (5.7)
where N(µ, σ2) is the normal distribution with mean µ and variance σ2, and Erf(x)
is the error function (the last quantity is the cumulative distribution function
expressed in terms of the error function). The probability specified by this formula
is nothing but the probability that the observed value x of the split attribute X
is at the left of the split point that is distributed as N(µ, σ2) (the actual random
variable). With this interpretation, we only need to determine the mean and
variance of the split point, not a full probability distribution – a much easier and
less error prone task.
Discrete Split Variables
Split points for discrete variables, in traditional decision trees, are specified by the
subset of values for which the left branch should be taken. Due to the fluctuations
in the training data, a particular value will sometimes be placed into this subset and
sometimes in its complement. The way we model this fluctuations is by associating,
with each possible value, the probability to belong to the subset, which is exactly a
probability vector that we use to model information about class labels in the GCTs.
Thus, for a given value of the split variable we can simply find (by consulting our
probability vector) the probability that the left branch should be followed. Notice
121
that, as opposed to the continuous case, where we had to restrict the possible
distributions in order to keep the model reasonably simple and easy to use, in
the discrete case our representation has full generality (specifies completely the
distribution).
One of the appealing features of PDTs is the fact that they can be easily
converted or interpreted as traditional decision trees. This can be achieved by:
1. using the split predicates X < µ instead of the probability function P [X <
N(µ, σ2)]
2. converting the probability vectors for continuous attributes into the subset
for the left branch by selecting the values with corresponding probability at
least 0.5
3. replacing the probability vector in the leaves with the most probable class
label, if necessary.
5.2.3 Speeding up Inference with PDTs
Having a fast inference mechanism is important in its own right, but becomes
critical during learning since the inference, using the already built part of the tree,
is extensively used in the process of building new nodes. Thus, speeding up the
inference results in speedups of the learning process as well.
The inefficiency of the inference of PDTs, inherited from the GDTs, is due to the
fact that all leaves participate in the prediction process, each having a contribution
proportional with the probability that the leaf is reached from the root given the
input. This means that, instead of using time proportional with the height of the
tree, as is the case for DTs, time proportional with the number of nodes (which is
122
exponentially bigger) is used. The main observation for alleviating this inefficiency
is the fact that most of the leaves are very unlikely to have significant contribution
to the prediction, which suggests that they can be excluded altogether from the
decision process. Obviously, the set of leaves relevant for prediction are input
dependent.
To make the above principle practical, we observe that, due to the fact that the
probability to reach a leaf is the product of probabilities on the path, the aggregate
contribution of all the leaves that belong to a subtree rooted at some intermediate
node is exactly the probability to reach that node (i.e. the product of probabilities
from the root to the node). This suggests that, if the probability to follow the
left branch, for some input x, is small – say smaller than α = 0.01, with α some
predefined threshold – then the contribution of the left subtree to the prediction,
when compared to the contribution of the right subtree, can be ignored. Thus,
we can improve the inference time by following a branch only if the probability to
follow it is larger than α. This means that, usually, only a skinny subtree of the
original tree has to be consulted to make a prediction, greatly improving the the
inference time. For the rest of the chapter, we assume that this pruning rule is
always used with the default value for the threshold α = 0.01, but, to keep things
simple, we will provide the explanation as if no pruning is explicitly performed.
5.3 Learning PDTs
Since, by design, structurally the PDTs are very close to DTs, the learning algo-
rithms for DTs can simply be adapted to construct PDTs. In what follows, we
explain only the modifications to the traditional decision tree construction algo-
rithms that are necessary to construct PDTs.
123
We have two distinct sets of problems to deal with in order to be able to adapt
the DT algorithms to PDTs. First, we have to show how sufficient statistics should
be computed for each of the nodes in a PDT. The sufficient statistics, as discussed
in Chapter 2, are the aggregate information for each of the nodes, information
that is necessary and sufficient for all construction algorithms. Second, we have
to show how the sufficient statistics are to be used to construct the PDTs. The
design principle for constructing PDTs from sufficient statistics is to retain most
of the algorithm for DTs construction. In particular, we want to replace the split
point by a split distribution but to keep as much of the rest of the construction
algorithm the same. Let us now see, in detail, how we deal with each of these
issues.
5.3.1 Computing sufficient statistics for PDTs
Intuitively, the sufficient statistics for node T characterize the data-points from the
training or pruning dataset that are routed at node T . They capture aggregate
properties only; these properties are further used by the construction algorithm.
In the context of PDTs, the main problem we have to deal with is the fact that,
in general, a data-point is routed with nonzero probability to multiple nodes in
the tree. Thus, as opposed to DT construction, a data-point might contribute
to the construction of multiple nodes, but its contribution should depend on the
probability of the data-point to reach each of the nodes. The natural way to
capture this intuitive requirement – which, as we will see, reduces to the classic
case when PDTs have the node probabilities defined as in Equation 5.6, thus they
degenerate into DTs – is by conditioning the sufficient statistics on the fact that
the data-point, which we denote by the random vector (X, C), is routed to the
124
node T . We were doing this conditioning implicitly in Chapter 2 but there it had
the direct interpretation: restrict the computation of the statistic to points routed
to node T . Here we have to be more explicit. We have two types of sufficient
statistics to deal with: probabilities and expectations.
Computing Probabilities
We are interested in conditioning only on the data-point (X, C) being routed to
node T . Thus, all the conditional probabilities we have to deal with have the form
P [p(X, C)|X ∈ T ] for PCTs and P [p(X, Y )|X ∈ T ] for PRTs, where p(·) is some
predicate. To keep things simple, we show only the formulae for classifiers, but the
formulae for regressors can be obtained by simply replacing C by Y . Using the
formula for the conditional probability, the way we should estimate this quantities
given the training (or pruning) dataset D is:
P [p(X, C)|X ∈ T ] =
∑(x,c)∈D∧p(x,c) P [x ∈ T ]∑
(x,c)∈D P [x ∈ T ](5.8)
This formula is intuitive since P [x ∈ T ] weights the contribution of the data-
point (x, c) and the denominator of the fraction normalizes the result. The quan-
tity P [x ∈ T ] is computed using the part of the tree already constructed (more
specifically the path from the root to T ) and Equation 5.5.
For instance, to estimate the probability that the discrete attribute Xi takes
the value aj and that the class label is c given the fact that the data-point is routed
to node T we use the formula:
P [Xi = aj ∧ C = ck|X ∈ T ] =
∑(x,c)∈D∧xi=aj∧c=ck
P [x ∈ T ]∑(x,c)∈D P [x ∈ T ]
(5.9)
Similarly, to estimate the conditional probability that a continuous attribute
125
Xi takes values less than a, we use the formula:
P [Xi < a|X ∈ T ] =
∑(x,c)∈D∧xi<a P [x ∈ T ]∑
(x,c)∈D P [x ∈ T ](5.10)
Note that, when the probabilities in the nodes are degenerate, as in Equa-
tion 5.6, the sum in both the numerator and denominator in all the above fractions
becomes the sum over the data-points at node T instead of the sum over the train-
ing set and the P [x ∈ T ] terms disappear. Thus, all these formulae are identical
with their classic counterpart in the degenerate case, re-confirming the fact that
the new formulae generalize the old ones.
Computing Expectations
For some decision tree construction algorithms, in addition to estimating probabil-
ities, we have to estimate expected values. Of particular interest are the compu-
tation of means and variances of points routed to a particular node. The general
form of expectation we would like to estimate in this context is: E[f(X, C)|X ∈ T ]
for classifiers and E[f(X, Y )|X ∈ T ] for regressors, where f is some predefined
function. Again, we show only the formulae for classifiers since the formulae for
regressors can be obtained by simply replacing C by Y . For any real function f(·),
E[f(X, C)|X ∈ T ] =
∑(x,c)∈D f(x, c)P [x ∈ T ]∑
(x,c)∈D P [x ∈ T ](5.11)
For instance, to estimate the parameters of the normal distributions required to
determine the split point for continuous attribute Xi using the statistical method
in Chapter 2, the following formulae have to be used, formulae that just instantiate
the above result:
µc = E[Xi|X ∈ T, C = c]
=
∑(x,c)∈D xiP [x ∈ T ]∑(x,c)∈D P [x ∈ T ]
126
σ2c = E[X2
i |X ∈ T,C = c]− µ2c
=
∑(x,c)∈D x2
i P [x ∈ T ]∑(x,c)∈D P [x ∈ T ]
− µ2c
Notice again that, in the case the probabilities in the nodes are degenerate
(given by Equation 5.6), these formulae become the classic formulae.
By allowing the f(·) function to produce a vector or a matrix instead of a real
number, formulae for the mean and covariance matrix of a multidimensional normal
distribution (conditioned on the data being routed at node T ) are obtained. This
means that the conditional version of the EM algorithm, described in Section 4.2,
can simply be obtained by weighting the contributions of the data-points in the
same manner as above; this suggests that algorithms like SECRET for regression
tree construction (see Chapter 4) can be adapted to produce PDTs.
Interestingly, the amount of space required for maintaining the sufficient statis-
tics does not increase much; the only increase is due to the fact that we have to
use a real number for quantities for which integers previously sufficed.
5.3.2 Adapting DT algorithms to PTDs
Now that we showed how sufficient statistics, the building blocks of the learning
algorithm, are computed, we show how the DT construction algorithm can be
adapted to build PDTs. We talk about each of the two decision tree construction
phases in what follows.
Tree Growth
As a reminder, in the tree growth phase for DTs, a tree as large as possible is
built incrementally using the training dataset. For every node being constructed,
127
first, sufficient statistics are gathered. Then, using these statistics, the suitability
of each of the attributes for the role of split attribute is evaluated and the most
promising one chosen. Lastly, or concurrently with the previous step, the best
split point for the split attribute is determined. Once the split attribute and point
are determined, either the growth is stopped for the current node and leaves are
built for each of the two descendants – process which requires statistics for each
of the leaves to be gathered – when the stopping criteria is true, or the process is
repeated recursively on the two descendants, if the converse holds.
Since the decision tree construction algorithm in Chapter 2 have been expressed
in terms of sufficient statistics, simply by using the new formulae for these statistics
in Section 5.3.1, we obtain most of the construction algorithm for probabilistic
decision trees. The only issue that remains to be addressed is the determination of
the probability distribution associated with the node for the split attribute settled
upon – the distribution that specifies the probability to follow the left branch for
a given input. We defer the discussion on how such probability distributions are
determined to Section 5.3.3.
Tree Pruning
In the construction of DTs, after the tree is grown, a separate dataset, called
pruning dataset, is usually used to determine a way to prune the tree in order
to maximize the chances that the model will be a good predictor for unseen data.
Pruning phase is usually necessary since the growth phase tends to overfit the data
(learn the noise).
In order to decide if the three has to be pruned at a node T or not, two
types of quantities have to be computed: an estimate of the contribution of the
128
node T to the generalization error if the tree is pruned at T , and, in the case of
complexity based pruning methods, an estimate of the complexity of the subtree
rooted at T (Frank, 2000). Usually the tree is pruned at T if the contribution
to the generalization error of T is less then the smallest possible contribution of
its descendants plus the complexity cost of the subtree. Once the generalization
error estimate for each node is determined, the subtrees that have to be pruned
are determined in a bottom-up fashion, starting with leaves and moving up.
To adapt the pruning methods to PTDs is enough to specify how to estimate
the generalization error for each node, and to account for the increase in complexity
of the tree, if necessary. To compute the complexity of a node, the same methods
developed for classic decision trees can be used; the additional cost is due to the
fact that maintaining information about the probabilities in the nodes requires
more space than a simple split point.
The contribution of a node, if considered a leaf, to the generalization error is
usually estimated by computing the contribution of the node to the empirical error
with respect to the pruning set. In the case of PDTs, usually, we have more than
one leaf responsible for the error of a data-point since multiple leaves are reached
with significant probability. This error should be interpreted as an expectation
due to the probabilistic interpretation of PDTs. Taking (x, c) to be the input, Er
the error metric, and T the tree with L its leaf-set, we have:
ErT (x, c) =∑L∈L
ErL(x, c)P [x ∈ L] (5.12)
Note that this formula becomes exactly the error formula for DTs when the distri-
bution is degenerate since only one leaf will have P [x ∈ L] 6= 0. Also, the formula
is very intuitive since the blame for the error is distributed among all leaves in
proportion to the probability that the data-point is routed to each leaf.
129
If we look now at the global error, we can rewrite it in the following manner:
ErT (Dp) =∑
(x,c)∈Dp
[∑L∈L
ErL(x, c)P [x ∈ L]
]
=∑L∈L
∑(x,c)∈Dp
ErL(x, c)P [x ∈ L]
=∑L∈L
ErL(Dp)
where Dp is the pruning set, and
ErL(Dp) =∑
(x,c)∈Dp
ErL(x, c)P [x ∈ L] (5.13)
is the contribution of the leaf L to the overall error. This last equation gives exactly
the quantity we need to compute in order to be able to prune the tree, quantity
that can be computed incrementally in a straightforward manner.
5.3.3 Split Point Fluctuations
One of the main design goals of PDTs – the only one that has yet to be ensured –
is to take into account the natural fluctuations in the data and to incorporate them
in the model. With this goal in mind, we observe that, if we would have multiple
training datasets – all generated with the same generative method but possibly
distinct, since the generation process is probabilistic – and we would compute the
split point, for each of these datasets we would get a different split point with close
but not exactly the same value. This means that, because the size of the training
dataset is finite, the split point is not constant but it naturally fluctuates around its
expected value. Due to this uncertainty of where the split point actually is, when
presented with an input x – especially when the the input is close to the expected
value of the split point – we do not know for sure if we should send the data-point
130
to the left or right child. This uncertainty is perfectly captured by a probability
distribution, which is exactly the information we decided to associate with each
node in a PDT. Thus, we have a very natural way for interpreting and computing
the probability distribution in the nodes of PDTs; it is the probability that the
input x, a fixed and given quantity, satisfies the split predicate, predicate that is
defined with respect to the split point, a probabilistic or fluctuating quantity.
Before we show how the distribution of the split points is actually estimated for
the two types of attributes, discrete and continuous – which allows us to construct
the probabilities in the nodes of PDTs – let us present, in short, two different ways
to estimate these distributions:
Empirical method. The idea behind the empirical method is to determine the
distribution of the split point experimentally by generating multiple training
datasets, computing the split point for each such dataset, thus obtaining a
set of samples for the split point, and then approximating the distribution
with a parametric or non-parametric method. The requirements for the gen-
erated training datasets are to have the same size – the size of the dataset
critically influences the amount of fluctuation – and the same underlying
distribution as the training dataset available in the learning process. Boot-
strapping is a very general method, developed in the statistical literature
(Davison & Hinkley, 1997), to generate as many datasets as needed satisfy-
ing these requirements. It consists of sampling, with replacement from the
original dataset, multiple datasets of the same size. Thus, by computing the
split point of these re-sampled datasets, samples from the distribution of the
split point are obtained. The advantage of the empirical method is the fact
that, in the manner described above, the distribution of the split point can
131
be approximated with any precision. The disadvantage is the fact that it is
computationally intensive; even when parametric approximations are used
to model the distributions, the number of samples that have to be produced
is in the hundreds in order to get reasonable approximations.
Analytical method. The idea behind the analytical method is to model, with
parametric distributions, the training dataset. Using this parametric model
and analytical analysis, formulae are developed for the distribution of the
split point. The advantage of the method is the fact that the distribution of
the split point can be determined very fast – essentially by simply instantiat-
ing variables in a formula. The disadvantages are the fact that the parametric
modeling might not be very accurate, and the fact that the method can be
applied only to restricted circumstances due to the fact that the analysis
tends to be very hard in general.
In developing both these kinds of methods we will assume, for simplicity, that
the training dataset available at the node being constructed is non-probabilistic.
This is not a restriction, since the results are easily extensible to the probabilistic
version simply by using the new formulae for the sufficient statistics developed in
Section 5.3.1.
Split Point Fluctuations for Continuous Attributes
For continuous attributes we decided in Section 5.2.2 to model the probability of
following the left branch as the p-value of a normal distribution. This is equivalent,
using our interpretation of the probability, to modeling the distribution of the split
point with a normal distribution. There are two parameters we have to determine
for this normal distribution: the mean and the variance. We can take the mean
132
to be simply the split point of the provided training dataset. This leaves us with
only one problem: the estimation of the variance of the split point. We can use the
empirical method described above to estimate the variance by simply computing
samples of the split point and then computing their variance. Alternatively, we
can use the analytical method to estimate the variance. In what follows we explore
this second possibility.
We were able to perform the analysis only for a restricted scenario: the two
class label classification problem for split points determined using the statistical
method (see Chapter 2). The result also applies to regression since, using the
method proposed in Chapter 4, regression can locally be reduced to classification
using the EM algorithm.
Let µc, σ2c and Nc be the mean, variance and number of samples, respectively,
for the normal distribution that models the data-points with class label c. For the
case when we have only two class labels, the split point, as explained in Chapter 2,
can be found by solving the quadratic Equation 2.10. For this scenario, we would
like to find the variance of the split point when only finite samples of size Nc
are available from the datasets. Since the quadratic equation has a complicated
solution, finding its exact variance proves to be difficult.
One method to develop an approximation to the variance is to use the delta
method (Shao, 1999). It consists in expressing the variance of the solution in
terms of the variance of the ingredients – estimates of the mean and variance
of the two normal distributions using finite samples – and the derivatives of the
solution with respect to the ingredients. Even if only a first order approximation is
attempted, the resulting formulae tend to be quite large and have to be produced
in a mechanical way to avoid making mistakes. We found that, in practice, a first
133
order approximation is in most part satisfactory but occasionally is very far from
the true value.
A second method to approximate the variance of the split point is based on
ignoring the error in estimating the variances and the differences in the size of the
two normal distributions. Equivalently, in Equation 2.10 the quantities σ1 and
σ2 are considered constants and the right hand side is replaced by 0. With this
simplifications, the only solution of the quadratic equation between µ1 and µ2 is:
µ =µ1σ2 + µ2σ1
σ1 + σ2
(5.14)
Ignoring the fluctuations in estimating the variances, using the independence of the
two sampling processes, and remembering that by using Nc samples to estimate
the mean, the variance of the mean estimate is σ2c
Nc, we get the following expression
for the variance of the split point:
Var(µ) =Var (µ1) σ2 + Var (µ2) σ1
σ1 + σ2
=1
σ1 + σ2
(σ2
1σ2
N1
+σ2
2σ1
N2
)=
σ1σ2
σ1 + σ2
(σ1
N1
+σ2
N2
) (5.15)
We found this formula to be just as good as the formula obtained with the delta
method, and at the same time much simpler and more stable numerically; for this
reason we recommend the use of this formula in place of the one produced by the
delta method.
Split Point Fluctuations for Discrete Attributes
For discrete attributes, the probability associated with nodes in the PDTs takes
the most general form: the probability to follow the left branch is specified for
each possible value of the attribute. As in the case of continuous attributes, both
134
the empirical and analytical methods can be used to estimate these probabilities.
We explore both alternatives in what follows.
As explained before, the empirical method can be used to generate multiple
training datasets. For each of these, a split set – the subset of the values of the
discrete attribute for which the left branch is followed – can be determined. By
counting how many times the value a of the attribute appears in the split set
and dividing it by the number of datasets we get an estimate of the probability
that the left branch should be followed when value a is observed. Performing this
computation for all values gives the full probability. Notice that this method is
applicable to any split point selection method.
Let us explore now the analytical method. As in the previous situation, we
restrict our attention to the two class label problem. In this case, for convex split
criteria like Gini and information gain, the split set can be found very efficiently
using the result of Breiman et al. (1984) (see Section 2.3.1). As it turns out, the
same result can be used to compute the probabilistic split. Let X be an discrete
attribute, with possible values a1 . . . an. Remember from Section 2.3.1 that, if
we denote by ri = P [C = 0|X = ai] – that is ri is the probability that the first
class label is observed given that X takes value ai – Breiman’s theorem states
that the split set that minimizes a convex split selection function can be found
by considering only splits in the increasing order of quantity ri. Thus, a natural
order over values ai, with respect to minimizing the split criterion, is given by the
increasing order of the quantities ri; finding a split set is equivalent to determining
some split point B, for this order, and including in the set only values ai for
which ri < B. Using this interpretation, we have a natural way to determine the
probability that the left branch is followed if value ai is observed: first, quantities
135
ri are determined from the training dataset, then, the split point B is determined
so that the split point selection criteria is minimized – this is achieved by simply
considering all the possible split points, no more then the size of the domain of
attribute X, and picking the best one – and finally, the the probability that left
branch is followed is set to P [ri < B].
The only thing that remains to be investigated is how to compute P [ri < B].
This can be easily accomplished if we observe that the quantity ri is computed
by dividing Ni,0, the number of times X = ai ∧ C = 0 in the training dataset, to
Ni, the number of times X = ai in the training dataset. If we assume that the
counts Ni,0 are binomial distributed with probability ri and count Ni, assumption
which is natural, then P [ri < B] is simply P [Ni,0 < B ∗Ni]. This probability can
be easily computed by expressing it in terms of the incomplete regularized beta
function Iβ(x; a, b). Putting now everything together, we have
P [TL|T, X = ai] = P [Ri < B] = P [Ni,0 < B ∗Ni]
= 1− Iβ(ri; B ∗Ni + 1, Ni −B ∗Ni)
(5.16)
Thus, the computation of the probabilities for each possible value of X is not much
more inefficient then the traditional split point selection method.
Being able to apply the analytical method only for problems with two class
labels is not as a severe restriction as it might seem since the multiple class clas-
sification problem can be locally reduced, by clustering the class labels based on
similarity, to a two class label problem (Loh & Shih, 1997).
136
5.4 Empirical Evaluation
In this section we present the results of an extensive experimental study of the
probabilistic decision tree version of SECRET, the regression tree system intro-
duced in Chapter 4. The purpose of the study was twofold: (a) to compare the
accuracy of PDTs on both classification and regression tasks with traditional DTs,
both implemented within SECRET, and (b) to estimate the computational penalty
incurred by the PDTs when compared to DTs. To get a base line reference point,
we include in the classification experiments QUEST (Loh & Shih, 1997), a state-
of-the-art classification tree algorithm and GUIDE (Loh, 2002), a state-of-the-art
regression tree algorithm.
The man findings of our study are:
• Accuracy of prediction. We found the PDTs on some of the learning tasks
to be significantly more accurate than DTs; the improvement is up to 11% for
classification, 36% for constant regression and 24% for linear regression. In
general, the improvement is largest for linear regression followed by constant
regression. For a single dataset, abalone, in the constant regression tests we
observed a significant degradation in accuracy.
• Computational penalty. We found PDTs to incur minimal computational
penalty when compared with DTs. In particular the increase in computa-
tional effort is 11% for constant regression trees with small training datasets
but as low as 1% for linear regression trees with large training datasets.
137
5.4.1 Experimental Setup
To perform the experiments whose results are reported in this section, we added
classification tree construction code and implemented the probabilistic decision tree
algorithm described in Section 5.2 to SECRET, the system described in Chapter 4.
For the experimental study we used 12 datasets for classification tests, and 14
datasets for the regression tests. We list the characteristics and source of these
datasets in Table 5.1. Since our implementation of classification trees in SECRET
supports only two class labels, we derived two-class learning tasks from the datasets
used for classification by predicting each of the class labels against all others, as
long as the support of the class label was at least 10%.
5.4.2 Experimental Results: Accuracy
For each experiment we used a random partitioning of the available data into 50%
for training, 30% for pruning and the remaining 20% for testing. We repeated each
experiment reported 100 times; we report both the average error and its standard
deviation for all experiments. In all experiments we used the re-substitution error
pruning. For all algorithms we set the minimum number of data-points to be
considered for splitting to 1%.
Classification Experiments
In Table 5.2 we report the experimental results of classification tasks listed in the
top part of Table 5.1 for QUEST, vanilla SECRET and the probabilistic classifi-
cation tree SECRET, denoted SECRET(P). The numbers after the data-set name
denote the class label in the original dataset that gets mapped to the first class
label in the modified dataset; all other class labels get mapped to the second class
138
Table 5.1: Datasets used in experiments; top for classification and bottom for
regression.
Nam
e
Sour
ce
#ca
ses
#no
minal
#co
ntinuo
us
breast UCI 683 1 9cmc UCI 1473 8 2dna StatLog 2000 61 0heart UCI 270 7 7led UCI 2000 1 7liver UCI 345 1 6pid UCI 532 1 7sat StatLog 2000 1 36seg StatLog 2310 1 19veh StatLog 846 1 18voting UCI 435 17 0wave UCI 600 1 21
abalone UCI 4177 1 7bank DELVE 8192 0 9baseball UCI 261 3 17cart Breiman et al.((Breiman et al., 1984)) 40768 10 1fried Friedman (Friedman, 1991) 40768 0 11house UCI 506 2 12kin8nm DELVE 8192 0 8mpg UCI 392 3 5mumps StatLib 1523 0 4price UCI 159 1 15puma DELVE 8192 0 9stock StatLib 950 0 10tae UCI 151 4 2tecator StatLib 240 0 11
139
label.
On the last column of Table 5.2, we list the improvement, in percentage, of
the probabilistic version of SECRET over the vanilla version. As it can be noticed
from these experimental results, QUEST and vanilla SECRET have about the
same accuracy across all learning tasks. SECRET(P) has higher accuracy for all
data-sets except voting, where the decrease is 2%, and for some of the tasks like
sat-1 and veh-4 the increase is as high as 10%.
Table 5.2: Classification tree experimental results.
QUEST SECRET SECRET(P) %breast 0.051± 0.002 0.051± 0.002 0.049± 0.002 4
cmc-1 0.305± 0.003 0.316± 0.003 0.303± 0.002 4cmc-2 0.227± 0.002 0.233± 0.002 0.225± 0.002 4cmc-3 0.324± 0.002 0.334± 0.002 0.326± 0.002 2
dna-3 0.097± 0.002 0.042± 0.001 0.040± 0.000 5dna-1 0.029± 0.001 0.030± 0.001 0.029± 0.001 2dna2 0.055± 0.001 0.051± 0.001 0.050± 0.001 2
heart 0.257± 0.006 0.224± 0.005 0.221± 0.005 1
liver 0.398± 0.006 0.380± 0.005 0.366± 0.005 4
pid 0.250± 0.003 0.253± 0.003 0.240± 0.003 5
sat-1 0.034± 0.001 0.025± 0.000 0.023± 0.000 10sat-3 0.055± 0.001 0.047± 0.001 0.046± 0.001 3sat-7 0.087± 0.001 0.077± 0.001 0.071± 0.001 7
veh-1 0.228± 0.003 0.231± 0.003 0.220± 0.003 5veh-2 0.239± 0.002 0.246± 0.003 0.235± 0.003 4veh-3 0.066± 0.002 0.053± 0.002 0.050± 0.002 6veh-4 0.077± 0.002 0.080± 0.002 0.072± 0.002 11
voting 0.053± 0.005 0.050± 0.002 0.051± 0.002 −2
wave-1 0.185± 0.001 0.182± 0.001 0.172± 0.001 6wave-2 0.166± 0.001 0.161± 0.001 0.153± 0.001 5wave-3 0.159± 0.001 0.155± 0.001 0.147± 0.001 5
140
Regression Experiments
In Tables 5.3 and 5.4 we depicted the experimental results for the regression tasks
in Table 5.1(bottom) for regression trees with constants in leaves and with linear
models in leaves, respectively. In both tables, on the second column we indicated
the scaling factor for the accuracy results (the numbers on the GUIDE, SECRET,
SECRET(O) and SECRET(P) columns have to be multiplied by the scaling factor
to get the actual experimental results). On the last two columns of these tables
we report the improvement, in percentage, of vanilla SECRET versus probabilistic
regression tree SECRET without and with oblique splits, respectively.
As observed in Chapter 4, GUIDE and SECRET have comparable accuracy.
We notice here, by analyzing the results in Tables 5.3 and 5.4, that the probabilistic
regression tree version of SECRET – denoted as SECRET(P) if univariate splits
are used and SECRET(OP) if oblique splits are used – overall outperforms, in
terms of accuracy, the vanilla version. More precisely, on the constant regression
tree tests, SECRET(P) significantly outperforms SECRET by as much as 36%
on four tasks (baseball, house, stock and tecator), noticeably outperforms it on
four datasets (bank, kin, mumps and price) and outperforms it by a small margin
on the remaining five tasks (abalone, fried, kin, puma and tae); exactly the same
trends are observed for the oblique splits version of the algorithms.
On the linear regression experiments, the results are good but not as impressive:
SECRET(P) significantly outperforms SECRET on four tasks (baseball,mpg,stock
and tecator) by as much as 20%, noticeably outperforms on two tasks (mumps and
tae), has about the same accuracy on seven tasks (bank, cart, fried, house, kin,
price and puma) and looses significantly on a single dataset (abalone); exactly
the same behavior is observed for the oblique splits version of the algorithms with
141
small variations in the improvements. We think that the decrease in accuracy,
especially on abalone task, is due to the fact that SECRET with linear models in
leaves is more prone to making extrapolation errors than the version with constant
models and, especially on tasks where there is not enough fine structure to exploit,
the use of probabilistic splits can be detrimental. Smaller improvements for linear
regression trees, when compared to constant regression trees, are to be expected
since the decision surface is more rugged for the latter so smoothing provided by
probabilistic splits should help more.
Especially interesting is the result on tecator task for both sets of experiments,
where SECRET was doing better than GUIDE in the first place but is doing much
better with probabilistic splits. This, in our opinion, suggests that the probabilis-
tic regression trees are especially useful for tasks that have fine but complicated
structure.
5.4.3 Experimental Results: Running Time
To compare the computational effort of vanilla and probabilistic SECRET, we
timed their running time (clock time) on a Dual Pentium III Xeon 933MHz running
Linux Mandrake 9.1 (only one of the processors was used). In Figures 5.1 and 5.2
we report the dependency of running times for regression on the fried task on the
size of the training dataset for the case when constant and linear regressors are
used in the leaves, respectively. Experiments for classification and other datasets
give similar qualitative results. In all experiments, the number of data-points in a
node to be considered for further splits to 1% of the size of the training set and
only the time to grow the tree is measured.
142
Tab
le5.
3:C
onst
ant
regr
essi
ontr
ees
exper
imen
talre
sult
s.
×G
UID
ESE
CR
ET
SE
CR
ET
(O)
SE
CR
ET
(P)
SE
CR
ET
(OP
)S/S
P(%
)SO
/SO
P(%
)ab
alon
e10
05.
31±
0.04
5.35±
0.04
5.26±
0.04
5.23±
0.04
5.16±
0.04
22
ban
k10
−3
2.40±
0.01
2.30±
0.01
6.34±
0.05
2.16±
0.01
5.97±
0.05
66
bas
ebal
l10
−1
2.26±
0.11
2.23±
0.11
2.54±
0.12
1.82±
0.08
2.29±
0.08
1810
frie
d10
07.
30±
0.01
7.69±
0.01
6.37±
0.03
7.57±
0.01
6.37±
0.05
20
hou
se10
12.
26±
0.08
2.74±
0.09
2.93±
0.09
2.30±
0.08
2.45±
0.07
1616
kin
10−
24.
22±
0.02
4.34±
0.02
3.07±
0.03
4.30±
0.02
2.97±
0.03
13
mpg
101
1.44±
0.04
1.33±
0.04
1.32±
0.03
1.26±
0.03
1.23±
0.03
57
mum
ps
100
1.34±
0.02
1.59±
0.02
1.56±
0.02
1.56±
0.02
1.49±
0.02
25
pri
ce10
68.
89±
0.37
8.89±
0.43
9.51±
0.43
8.29±
0.57
8.73±
0.66
78
pum
a10
11.
16±
0.01
1.23±
0.01
1.43±
0.01
1.21±
0.01
1.41±
0.01
12
stock
100
2.19±
0.06
2.62±
0.09
2.11±
0.05
2.14±
0.07
1.81±
0.05
1814
tae
10−
16.
99±
0.12
6.82±
0.10
6.83±
0.10
6.71±
0.09
6.71±
0.09
22
teca
tor
101
5.96±
0.25
5.62±
0.21
3.66±
0.15
3.62±
0.19
2.73±
0.12
3625
143
Tab
le5.
4:Lin
ear
regr
essi
ontr
ees
exper
imen
talre
sult
s.
×G
UID
ESE
CR
ET
SE
CR
ET
(O)
SE
CR
ET
(P)
SE
CR
ET
(OP
)S/S
P(%
)SO
/SO
P(%
)ab
alon
e10
04.
73±
0.04
4.81±
0.04
4.88±
0.05
5.26±
0.34
5.43±
0.35
−9
−11
ban
k10
−4
9.40±
0.04
9.50±
0.04
11.0
4±
0.05
9.41±
0.04
10.9
2±
0.05
11
bas
ebal
l10
−1
1.75±
0.05
2.31±
0.09
2.64±
0.09
2.08±
0.09
2.44±
0.09
108
cart
100
1.62±
0.00
1.12±
0.00
1.12±
0.00
1.15±
0.00
1.16±
0.00
−3
−3
frie
d10
01.
15±
0.00
1.20±
0.01
1.34±
0.01
1.19±
0.01
1.34±
0.01
10
hou
se10
12.
42±
0.13
2.27±
0.07
2.53±
0.07
2.35±
0.07
2.46±
0.07
−4
3kin
10−
22.
42±
0.02
2.27±
0.02
1.69±
0.01
2.23±
0.02
1.62±
0.01
24
mpg
101
1.18±
0.03
1.73±
0.08
1.70±
0.06
1.45±
0.04
1.44±
0.04
1615
mum
ps
100
1.03±
0.02
1.30±
0.02
1.31±
0.02
1.19±
0.02
1.19±
0.02
89
pri
ce10
68.
92±
0.48
8.19±
0.34
8.91±
0.46
8.02±
0.33
8.71±
0.42
22
pum
a10
11.
06±
0.00
1.05±
0.00
1.12±
0.01
1.05±
0.00
1.11±
0.01
11
stock
100
1.52±
0.07
1.40±
0.08
1.12±
0.05
1.12±
0.12
0.85±
0.02
2024
tae
10−
17.
28±
0.15
7.28±
0.17
7.28±
0.17
6.95±
0.11
6.95±
0.11
55
teca
tor
101
1.23±
0.06
1.17±
0.07
0.80±
0.05
1.01±
0.06
0.70±
0.06
1312
144
We make a number of observations on these experimental results:
1. the increase in computational effort is modest – at most 11% but as little as
1%
2. as the number of data-points in the training data-set increases, the difference
between running times gets smaller; it is 11% for datasets of size 5092 but
only 4% when size is 40768 for constant regression trees.
3. the increase in running time is smaller for linear regression trees when com-
pared to the constant regression trees
5.5 Related Work
Modifications of the traditional decision trees in order to allow for imprecise splits
have been previously proposed in the literature. The most notable proposals are:
Bayesian decision trees, (Chipman et al., 1996; Chipman et al., 1998), and fuzzy
decision trees, (Sison & Chong, 1994; Guetova et al., 2002). Structurally, Bayesian
decision trees coincide structurally with the generalized classification trees intro-
duced in Section 5.2. They differ in the way the probabilities in the nodes are
chosen: some prior distribution is assumed for the class labels and the probabili-
ties are the Bayesian posterior given the input. Usually Monte Carlo methods are
necessary to find the posterior, thus the method is quite computationally inten-
sive. By interpreting the membership functions as probabilities, the fuzzy decision
trees also coincide with the generalized classification trees. The fundamental dif-
ference between these classifiers and our proposal is the fact that the fuzzyness of
the membership functions is determined using human expert knowledge about the
145
Size SECRET SECRET(P) Slowdown(%)
5096 1.99± 0.01 2.22± 0.01 1110192 3.83± 0.01 4.14± 0.01 820384 7.62± 0.02 8.01± 0.01 540768 15.38± 0.06 16.02± 0.05 4
0
2
4
6
8
10
12
14
16
18
0 5000 10000 15000 20000 25000 30000 35000 40000 45000
Run
ning
tim
e(s)
Number of data-points
vanillaprobabilistic
Figure 5.1: Tabular and graphical representation of running time (in seconds) of
vanilla SECRET and probabilistic SECRET(P), both with constant regressors, for
synthetic dataset Fried (11 continuous attributes).
146
Size SECRET SECRET(P) Slowdown(%)
5096 7.92± 0.04 8.66± 0.07 810192 15.85± 0.09 16.52± 0.13 420384 31.93± 0.25 33.12± 0.30 440768 64.18± 0.32 65.13± 0.35 1
0
10
20
30
40
50
60
70
80
0 5000 10000 15000 20000 25000 30000 35000 40000 45000
Run
ning
tim
e(s)
Number of data-points
vanillaprobabilistic
Figure 5.2: Tabular and graphical representation of running time (in seconds) of
vanilla SECRET and probabilistic SECRET(P), both with linear regressors, for
synthetic dataset Fried (11 continuous attributes).
147
attribute variables, whereas in our method they are determined by estimating the
fluctuations of the split point.
In terms of overall architecture our proposal is closely related to hierarchical
mixture of experts (Jordan & Jacobs, 1993; Waterhouse & Robinson, 1994) where
and Bayesian network classifiers (Friedman & Goldszmidt, 1996). Hierarchical
mixtures of experts consist in a tree structure that is used to determine what should
be the contribution to the final prediction of each of the experts, one for each leaf.
Bayesian network classifiers are restricted Bayesian networks (for an overview of
Bayesian networks see (Heckerman, 1995)) that have tree structure and predict
class labels. Both these models are more general – thus more expressive – but at
the same time they require much greater computational effort for learning.
The problem of discontinuity of the prediction for decision trees has been ad-
dressed by techniques like smoothing by averaging (Chaudhuri et al., 1994; Bun-
tine, 1993), that change the way the prediction is obtained: the predictions of all
the nodes on the path are combined by assigning to each one a weight and tak-
ing the final prediction to be the weighted sum. Thus, only the way the decision
tree is interpreted is changed, not the actual structure of the model. If the right
weighting along the path is chosen, no pruning is necessary, but obtaining a data
independent weighting is challenging.
By fitting a naive Bayesian classifier in the leaves of a decision tree, probabilistic
predictions can be made (Breiman et al., 1984). This types of models are usually
called class probability trees. Buntine combined smoothing by averaging with class
probability trees (Buntine, 1993).
148
5.6 Discussion
In this chapter we presented the probabilistic decision trees, a probabilistic version
of traditional classification and regression trees. By allowing probabilistic splits
that take into consideration the natural fluctuation of the decision boundaries, but
at the same time maintaining most of the structure and properties of decision trees,
the computational effort for building these trees is increased by a modest amount
– at most 11% in our experiments but usually much less – but the accuracy is
significantly increased, with as much as 36% for constant regression trees. Side
benefits of the proposed models are: (1) possibility to predict probability vectors,
not only class labels, (2) continuity of the prediction, and (3) the models are
probabilistic and capture natural fluctuations of the data.
Chapter 6
Conclusions
In this thesis we studied a particular type of supervised learning models: classi-
fication and regression trees. This models are very important since they are one
of the preferred models by the end users due to their simplicity and, at the same
time, they can be constructed efficiently and have good accuracy. Our work re-
lied heavily on Statistics as a method to analyze, capture and leverage statistical
behavior of training examples.
The three distinct contributions we made in this thesis to the classification and
regression tree construction problem are:
Bias in Classification Tree Construction
We analyzed statistical properties of split criteria for split attribute variable selec-
tion, analysis that resulted in a general method to correct the bias toward variables
with larger domains. The corrective method is simply to use the p-value of the
criterion with respect to the Null Hypothesis (i.e. the attribute variables are not
correlated with the class label). We further showed how this correction can be
applied to the Gini gain by approximating the distribution of the Gini gain with
149
150
a Gamma distribution with particular parameters.
We think our work on the bias is split selection not only showed principled
ways to correct the bias but also resulted in interesting statistical characterization
of the classification tree construction algorithms.
Scalable Linear Regression Trees
Our second contribution in this thesis is a construction algorithm for regression
trees with linear models in the leaves that is as accurate as previously proposed
construction algorithms but requires substantially less computational effort, as
small as two orders of magnitude on large datasets. The main idea is to use the
EM Algorithm for Gaussian mixtures and to classify the data-points based on
closeness to each of the resulting clusters to locally reduce, for each node being
constructed, the regression problem to a much simpler classification problem. As
a side benefit, the proposed construction method allows, with little extra compu-
tational effort, the construction of regression trees with oblique splits, a problem
previously considered to be very hard.
Probabilistic Decision Trees
Our last contribution is a statistically sound and elegant method to incorporate
natural fluctuations of the data into classification and regression trees. The main
ideas is to allow the split points to be probabilistic instead of fixed and to deter-
mine these probabilities by analyzing the behavior of the splits under noise. Our
proposal, probabilistic decision trees, maintains the good properties of the decision
trees and avoids some of their shortcomings: the prediction is continuous, proba-
bilities can be predicted instead of class labels, the model gives extra information
151
about the statistical fluctuations in the data. Moreover, probabilistic decision trees
are generally more accurate than decision trees by as much as 36%, even though
they incur only small increases in the computational effort.
Appendix A
Probability and Statistics Notions
In this chapter we review some useful notions from Probability and Statistics lit-
erature to help the reader not familiar with these mathematical tools required to
understand the developments in our thesis. The intent is to focus on intuition and
usefulness rather than strict rigor in order to keep notation and explanations sim-
ple. For the interested reader, we provide references that contain a more rigorous
treatment. Throughout this introduction we assume that the reader is familiar
with elementary notions of set theory and elementary calculus.
A.1 Basic Probability Notions
In this section we give a brief overview of useful notions from Probability Theory.
A rigorous introduction can be found, for example, in (Resnick, 1999), but this
overview should suffice for the purpose of this thesis.
152
153
A.1.1 Probabilities and Random Variables
We first introduce the notion of probability and random variable and their condi-
tional counterparts, then we introduce variance and covariance and give some of
their useful properties.
Probability
Let Ω be some set and F be a set of subsets of Ω that contains ∅ and Ω and it
is closed under union, intersection and complementation with respect to Ω (i.e.
the intersection, union and complement of sets in F gives sets in F). The pair
(Ω,F) is called a probability space, and any element A ∈ F is called an event. If an
event does not contain any other event, it is called an elementary event. We call
Ω the probability space and F the set of measurable sets. With this, a mapping
P : F → [0, 1] from the set of measurable sets to the real numbers between 0 and
1 is called a probability function, in short probability, if the following properties
hold:
P [Ω] = 1 (A.1)
∀A, B, A ∩B = ∅, P [A ∪B] = P [A] + P [B] (A.2)
where A, B ∈ F are two measurable sets.
154
These properties are enough to show that the following properties also hold:
P [∅] = 0 (A.3)
P [A] ≤ P [B], if A ⊂ B (A.4)
P [A] = 1− P [A] (A.5)
P [A−B] = P [A]− P [A ∩B] (A.6)
P [A ∪B] = P [A] + P [B]− P [A ∩B] (A.7)
where we denoted by A the complement of event A. P [A ∩ B] is usually replaced
by the simpler notation P [A, B], the probability that events A and B happen
together.
Two types of probabilities are interesting for the purpose of understanding this
thesis: discrete probabilities and continuous probabilities. We briefly take a look
at each, deferring further discussion until random variables are introduced.
If set Ω is a finite set and we take F = 2Ω – the powerset of Ω, i.e. the set of all
the possible subsets – any probability over Ω is fully specified by the probabilities
of the elementary events, which are nothing else than the elements of Ω. We call
such a probability a discrete probability.
If we take Ω = R, with R the set of all real numbers, and F to be the transi-
tive closure under intersection and complement of the compact intervals over the
real numbers (the so called Borel set), any probability defined over Ω is called a
continuous probability. The notion of continuous probability is also extended to
vector spaces over the real numbers in the natural manner. We will see examples
of continuous probabilities in the next section. A continuous probability function
P can be specified by its density function p(x). Intuitively, p(x)dx is the probabil-
ity to see value x. Obviously for any x ∈ R this probability is 0, but this allows
155
the specification of the probabilities of intervals, that are the elementary events of
continuous probabilities:
P [[a, b]] =
∫ b
a
p(x)dx (A.8)
where [a, b] is a compact interval of R.
Independent Events
Events A and B are called independent if:
P [A, B] = P [A] · P [B] (A.9)
The notion of independent events is very important because of this factorization
property of the probability, factorization that greatly simplifies the analysis.
Conditional Probability and
The conditional probability that event B happens given that event A happened,
denoted by P [B|A], is defined as:
P [B|A] =P [A, B]
P [A](A.10)
The conditional probability has the following useful properties:
P [A|Ω] = P [A] (A.11)
P [B|A] = 1, if A ⊂ B (A.12)
P [B|A] =P [B] · P [A|B]
P [A](A.13)
The last formula is called Bayes rule.
Also, conditional probabilities have all the properties normal probabilities have.
156
Random Variables
A mapping X : Ω → R is called a random variable with respect to the probability
space (Ω,F) if it has the property that:
∀a ∈ R, ω ∈ Ω|X(ω) < a ∈ F (A.14)
For discrete probability spaces, any mapping is a random variable. For continuous
spaces, it is enough to require the mapping to be continuous everywhere except a
finite number of points. Moreover, by combining random variables using continuous
functions, random variables are also obtained. What this amounts to is the fact
that all mapping we have to deal with in our thesis are random variables.
A random variable defined over a discrete or continuous probability space is
called discrete random variable or continuous random variable, respectively. To
specify a discrete random variable, it is enough to specify the value of the random
variable for each elementary event. For continuous random variables, we have
to specify the values of the random variable for each real number. We will see
examples of random variables in the next section.
A very important notion with respect to random variables is the notion of
expectation. Intuitively, the expectation of a random variable is its average value
with respect to a probability function. We denote the expectation of a random
variable X by EP [X]. If the probability function is understood from the context,
we use the simpler notation E[X].
For discrete random variables, the expectation is defined as:
E[X] =∑ω∈Ω
X(ω)P [ω] (A.15)
For convenience, we also use the alternative notation Xω instead of X(ω).
157
For continuous variables, the expectation of random variable X with respect
to the probability function P with density p(x) is defined as:
E[X] =
∫ ∞
−∞X(x)p(x)dx (A.16)
A probability space together with a probability function are usually called
a distribution. Discrete distributions are usually specified by the probability of
the elementary events and continuous distributions by the density function p(x).
We say that a random variable X is distributed according to the distribution D,
denoted by X ∼ D, if the probability is specified by the distribution and X is the
identity function. This means that for discrete distributions, Ω is a subset of R in
general but a subset of N or Z most often.
Important properties of expectation are:
1. expectation of a constant:
E(a) = a
2. linearity of expectation:
E [aX] = aE [X]
E [X + Y ] = E [X] + E [Y ]
3. expected value of sum (no independence required):
E
[∑i
Xi
]=∑
i
E [Xi] (A.17)
Independent Random Variables
Two random variables X and Y , defined over the same probability space Ω,F are
independent if and only if for all x, y ∈ R, the events ω ∈ Ω|X(w) < x and
158
ω ∈ Ω|Y (w) < y are independent. In this case it can be shown that:
E [XY ] = E [X] E [Y ] (A.18)
which is one the most useful properties of expectation.
Variance and Covariance
Variance is an important property of distributions since it indicates how spread
(or localized) the distribution is. It is defined as:
Var (X) = E [(X − E [X])]
= E[X2]− (E [X])2
(A.19)
The covariance of two random variables X and Y is defined as:
Cov (X, Y ) = E [XY ]− E [X] E [Y ]
and gives an idea of how much random variables X and Y influence each-other.
Notice that if X and Y are independent, Cov (X, Y ) = 0.
Some of the useful properties of variance are:
1. variance of a constant:
Var (a) = 0
2. scalar multiplication:
Var (aX) = a2Var (X)
3. variance of sum of random variables:
Var (X + Y ) = Var (X) + Var (Y ) + 2Cov (X, Y )
or in general
Var
(∑i
Xi
)=∑
i
Var (Xi) +∑
i
∑i′ 6=i
Cov (Xi, Xi′)
159
4. variance of sum of independent random variables:
Var
(∑i
Xi
)=∑
i
Var (Xi)
A very useful property of covariance is the fact that it is bilinear:
Cov (aX, Y ) = aCov (X, Y )
Cov (X, aY ) = aCov (X, Y )
Cov (X1 + X2, Y ) = Cov (X1, Y ) + Cov (X2, Y )
Cov (X, Y1 + Y2) = Cov (X, Y1) + Cov (X, Y2)
Cov
(∑i
Xi,∑
j
Yj
)=∑
i
∑j
Cov (Xi, Yj)
Also, the covariance is commutative:
Cov (X, Y ) = Cov (Y,X)
Conditional Expectation
Conditional expectation generalizes the notion of conditional expectation. For
random variable X defined over the discrete probability (Ω, 2Ω, P ), and an event
A ∈ 2Ω, the conditional expectation is defined as:
E [X|A] =
∑ω∈A X(ω)P [ω]
P [A](A.20)
For a continuous probability with density p(x), the conditional expectation is
defined as:
E[X|A] =
∫A
X(x)p(x)dx
P [A](A.21)
Conditional expectation has all the properties normal expectation has. More-
over, since the notion of variance is entirely based on the notion of expectation,
160
we can define conditional variance in terms of the conditional expectation as:
Var (X|A) = E[X2|A
]− (E [X|A])2
Random Vectors
The notion of random variable can be extended to vectors and, more generally, to
matrices. If X = [X1, . . . , Xn] is a random vector – a vector of random variables –
its expectation is the vector of expectations of components:
E [X] = [E [X1] , . . . , E [Xn]]
With this, the variance of random vector X is a matrix, called the covariance
matrix:
Var (X) = E[XTX
]− E [X]T E [X]
=
Var (X1) Cov (X1, X2) . . . Cov (X1, Xn)
Cov (X2, X1) Var (X2) . . . Cov (X2, Xn)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Cov (Xn, X1) Cov (Xn, X2) . . . Var (Xn)
(A.22)
A.2 Basic Statistical Notions
In this section we introduce some useful statistical notions. More information can
be found, for example, in (Wilks, 1962; Pratt et al., 1995; Shao, 1999).
P-Value
The p-value of an observed value x of a random variable X is the probability that
the random value X would take a value as high or higher than x. Mathematically,
the p-value is P [X > x]. Intuitively, a very small p-value is statistical proof that
x is not a sample of the random variable X.
161
A.2.1 Discrete Distributions
Binomial Distribution
Binomial distribution is the distribution of the number of times one sees the head
in N flips of an asymmetric coin that has probability of tossing head p. If X is a
random variable binomially distributed with parameters N and p it can be shown
that:
E(X) = Np
Var (X) = Np(1− p)
The p-value of the binomial distribution is:
P [X > x] = 1− I(p; x + 1, N − x)
where
I(x; a, b) =
∫ x
−∞
Γ(a + b)
Γ(a)Γ(b)ta−1(1− t)b−1dt
is the incomplete regularized beta function.
Multinomial Distribution
The multinomial distribution generalizes the binomial distribution to multiple di-
mensions. It has as parameters N , the number of trials, and (p1, . . . , pn), the
probabilities of an n face coin. The multinomial distribution is the distribu-
tion of number of times each of the faces is observed out of N trials. If we let
X ∼ Multinomial(N, p1, . . . , pn), and denote by Xi the i-th component of X we
162
have:
E [Xi] = Npi
Var (Xi) = Npi(1− pi)
Cov (Xi, Xj) = −Npipj
A.2.2 Continuous Distributions
Normal (unidimensional Gaussian) Distribution
Normal distribution, denoted by N(µ, σ2), has two parameters: the mean µ and
variance σ2. σ must always be a positive quantity. Given X ∼ N(µ, σ2),
E [X] = µ (A.23)
Var (X) = σ2 (A.24)
p(x) =1
σ√
2πe−
(x−µ)2
2σ2 (A.25)
P [X > x] =1
2
(1− Erf
(x− µ√
2σ
))(A.26)
where Erf(x) =∫ x
−∞ et2/2dt is the standard error function.
Gaussian Distribution
Gaussian distribution, denoted by N(µ, Σ), has two parameters: the mean vector
µ and the covariance matrix Σ. Σ has to be positive definite which means that
it always has a Choleski decomposition Σ = GGT (Golub & Loan, 1996). For
163
X ∼ N(µ, Σ),
E [X] = µ (A.27)
Var (X) = Σ (A.28)
p(x) =1
(2π)d/2|Σi|1/2e−
12(x−µ)T Σ−1(x−µ) (A.29)
Gamma Distribution
The gamma distribution (with parameters α and θ) is the distribution with density
p(x) =xαe−x/θ
Γ(α)θα
and p-value
P [X > x] = 1− Γ(α, x/θ)
Γ(α)= 1−Q(α, x/θ) (A.30)
where Γ(x) is the gamma function and Γ(x, y) is the incomplete gamma function.
Q(x, y) is called the incomplete regularized gamma function.
Mean and variance or a random variable X with gamma distribution are:
E(X) = αθ (A.31)
Var(X) = αθ2 (A.32)
Normal distribution is a particular case of the gamma distribution.
Beta Distribution
The beta distribution has parameters α and β and density:
p(x) =Γ(α + β)
Γ(α)Γ(β)xα−1(1− x)β−1
The p-value is 1− I(x; α, β) with I the incomplete regularized beta function.
164
χ2-test and χ2-distribution
Having a set of random variables Xi, the χ2-test is defined as:
χ2 =n∑
i=1
(Xi − Ei)2
Ei
(A.33)
where Ei is the expected value of Xi under some hypothesis (that is tested using
the χ2-test).
It can be shown that asymptoticly χ2 has a χ2 distribution, that coincides with
a gamma distribution with parameters α = 1/2r and θ = 2, where r is the degrees
of freedom (number of variables n minus number of constrains between variables).
The mean and the variance for the χ2 distribution are:
E(χ2) = r (A.34)
Var(χ2) = 2r (A.35)
Appendix B
Proofs for Chapter 3
B.0.3 Variance of the Gini gain random variable
In this section we show the derivation of the formula in Equation 3.20, the variance
of the Gini gain.
Using the notation in Chapter 3, the formula for the Gini gain as a function
of sufficient statistics is (Equation 3.2)
∆g(T ) =1
N
k∑j=1
(n∑
i=1
A2ij
Ni
−S2
j
N
)(B.1)
Using extensively properties of variance, covariance and connection between
them (see Chapter A) we have:
Var (∆g(T )) =1
N2
[Var
(k∑
j=1
n∑i=1
A2ij
Ni
)+ Var
(k∑
j=1
S2j
N
)
−2Cov
(k∑
j=1
n∑i=1
A2ij
Ni
,k∑
j=1
S2j
N
)] (B.2)
We proceed by analyzing each term inside the square brackets individually. To
simplify the formulae we use∑
i,∑
j,∑
j′ ,∑
j′ 6=j instead of∑n
i=1,∑k
j=1,∑k
j′=1 and
165
166
∑kj′=1,j′ 6=j, respectively. Also, remember from Chapter 3 that for every i∈1, .., n,
the random vector (Ai1, . . . , Aik) has the distribution Multinomial(Ni, p1, . . . , pk),
and (S1, . . . , Sk) has the distribution Multinomial(N, p1, . . . , pk); for i 6= i′, Aij is
independent of Ai′j.
First term:
Var
(∑j
∑i
A2ij
Ni
)=∑
i
1
N2i
Var
(∑j
A2ij
)
=∑
i
1
N2i
(∑j
Var(A2
ij
)+∑
j
∑j′ 6=j
Cov(A2
ij, A2ij′
))
=∑
j
(∑i
Var(A2
ij
)N2
i
)+∑
j
∑j′ 6=j
∑i
(1
N2i
Cov(A2
ij, A2ij′
))(B.3)
Second term:
Var
(∑j
S2j
N
)=∑
j
Var(S2
j
)N2
+∑
j
∑j′ 6=j
1
N2Cov
(S2
j , S2j′
)(B.4)
Third term:
Cov
(∑j
∑i
A2ij
Ni
,∑
j
S2j
N
)=∑
i
∑j
∑j′
1
NiNCov
(A2
ij, S2j′
)(B.5)
167
Observing that:
Cov(A2
ij, S2j′
)= Cov
A2ij,
(∑i′
Ai′j′
)2
=∑
i′
∑i′′
Cov(A2
ij, Ai′j′Ai′′j′)
= Cov(A2
ij, A2ij′
)+ 2
∑i′ 6=i
Cov(A2
ij, Aij′Ai′j′)
= Cov(A2
ij, A2ij′
)+ 2
∑i′ 6=i
(E[A2
ijAij′Ai′j′]− E
[A2
ij
]E [Aij′Ai′j′ ]
)= Cov
(A2
ij, A2ij′
)+ 2 (E [Sj′ ]− E [Aij′ ])
(E[A2
ijAij′]− E
[A2
ij
]E [Aij′ ]
)(B.6)
we get:
Cov
(∑j
∑i
A2ij
Ni
,∑
j
S2j
N
)
=∑
j
∑i
Var(A2
ij
)NiN
+ 2∑
j
∑i
1
NiN(E [Sj]− E [Aij])
(E[A3
ij
]− E
[A2
ij
]E [Aij]
)+∑
j
∑j′ 6=j
∑i
1
NiNCov
(A2
ij, A2ij′
)+ 2
∑j
∑j′ 6=j
∑i
1
NiN(E [Sj′ ]− E [Aij])
(E[A2
ijAij′]− E
[A2
ij
]E [Aij′ ]
)(B.7)
We now focus on the sub-terms inside summations. The formulae for the sub-
terms that appear in the above equations are depicted in Table B.1.
We compute the sub-terms of each of the three main terms of the variance. We
leave the computation of the last sum over j and j′ until the end.
168
Table B.1: Formulae for expressions over random vector [X1 . . . Xk] distributed
Multinomial(N, p1, . . . , pk)
Expression Formula
E [Xj] Npj
Var(X2
j
)N(1−pj)pj(1+6(N−1)pj+2(N−1)(2N−3)p2
j)
E[X3
j
]− E
[X2
j
]E [Xj] Npj(1− pj)(2Npj − 2pj + 1)
Cov(X2
j , X2j′
)−Npjpj′(1+2(pj+pj′)(N−1)+2(N−1)(2N−3)pjpj′)
E[X2j Xj′ ]−E[X2
j ]E[Xj′ ] −Npjpj′(2Npj − 2pj + 1)
First term sub-terms:∑i
Var(A2
ij
)N2
i
=∑
i
1
N2i
Nipj(1− pj)((1− 6pj + 6p2j) + Ni(6pj − 10p2
j) + 4N2i p2
j)
= pj(1− pj)
((1− 6pj + 6p2
j)∑
i
1
Ni
+ n(6pj − 10p2j) + 4Np2
j
)(B.8)
∑i
1
N2i
Cov(A2
ij, A2ij′
)= −
∑i
1
N2i
Nipjpj′ (1− 2(pj + pj′) + 6pjpj′
+ Ni(2pj + 2pj′ − 10pjpj′) + 4N2i pjpj′
)= −pjpj′
((1− 2(pj + pj′) + 6pjpj′)
∑i
1
Ni
+ n(2pj + 2pj′ − 10pjpj′) + 4Npjpj′)
(B.9)
Second term sub-terms:
Var(S2
j
)N2
=1
N2Npj(1− pj)((1− 6pj + 6p2
j) + N(6pj − 10p2j) + 4N2p2
j)
=pj(1− pj)
N
((1− 6pj + 6p2
j) + N(6pj − 10p2j) + 4N2p2
j
) (B.10)
169
1
N2Cov
(S2
j , S2j′
)= − 1
N2Npjpj′ (1− 2(pj + pj′) + 6pjpj′
+ N(2pj + 2pj′ − 10pjpj′) + 4N2pjpj′)
= −pjpj′
N(1− 2(pj + pj′) + 6pjpj′
+ N(2pj + 2pj′ − 10pjpj′) + 4N2pjpj′)
(B.11)
Third term sub-terms:
∑i
Var(A2
ij
)NiN
=∑
i
1
NiNNipj(1− pj)((1− 6pj + 6p2
j) + Ni(6pj − 10p2j) + 4N2
i p2j)
=pj(1− pj)
N
(n(1− 6pj + 6p2
j) + N(6pj − 10p2j) + 4p2
j
∑i
N2i
)(B.12)
∑i
1
NiN(E [Sj]− E [Aij])
(E[A3
ij
]− E
[A2
ij
]E [Aij]
)=∑
i
1
NiN(Npj −Nipj)Nipj(1− pj)(2Nipj − 2pj + 1)
=pj(1− pj
N
[(n− 1)(1− 2pj)pjN + 2N2p2
j − 2p2j
∑i
N2i
] (B.13)
∑i
1
NiNCov
(A2
ij, A2ij′
)= −
∑i
1
NiNNipjpj′ (1− 2(pj + pj′) + 6pjpj′
+ Ni(2pj + 2pj′ − 10pjpj′) + 4N2i pjpj′
)= −pjpj′
N[n(1− 2(pj + pj′) + 6pjpj′)
+ N(2pj + 2pj′ − 10pjpj′) + 4pjpj′
∑i
N2i
](B.14)
170
∑i
1
NiN(E [Sj′ ]− E [Aij])
(E[A2
ijAij′]− E
[A2
ij
]E [Aij′ ]
)= −
∑i
1
NiN(Npj′ −Nipj)Nipjpj′(2Npj − 2pj + 1)
= −pjpj′
N
[n(1− 2pj)pj′N + 2N2pjpj′ −Npj(1− 2pj)− 2pjpj′
∑i
N2i
](B.15)
In the final formula, all the four sub-terms of this third term have to be mul-
tiplied by −2 and the second and fourth sub-terms by an extra 2.
We put everything together now by grouping all the terms with the same N ,∑i
1Ni
and∑
i N2i factors together. We ignore the 1
N2 term in front of the equation
for Var (∆g(T )).
We use extensively the following identities:
∑j
∑j′ 6=j
pjpj′ =∑
j
pj
(∑j′
pj′ − pj
)
=∑
j
pj(1− pj)
= 1−∑
j
p2j
(B.16)
∑j
∑j′ 6=j
p2jpj′ =
∑j
p2j
(∑j′
pj′ − pj
)
=∑
j
p2j(1− pj)
=∑
j
p2j −
∑j
p3j
(B.17)
171
∑j
∑j′ 6=j
pjp2j′ =
∑j
pj
(∑j′
p2j′ − p2
j
)
=∑
j
pj
∑j′
p2j′ −
∑j
p3j
=∑
j
p2j −
∑j
p3j
(B.18)
∑j
∑j′ 6=j
p2jp
2j′ =
∑j
p2j
(∑j′
p2j′ − p2
j
)
=
(∑j
p2j
)2
−∑
j
p4j
(B.19)
∑i
1Ni
factor:∑j
pj(1− pj)(1− 6pj + 6p2j)−
∑j
∑j′ 6=j
pjpj′(1− 2(pj + pj′) + 6pjpj′)
=∑
j
(pj − 7p2
j + 12p3j − 6p4
j
)−∑
j
∑j′ 6=j
(pjpj′ − 2p2
jpj′ − 2pjp2j′ + 6p2
jp2j′
)= 1− 7
∑j
p2j + 12
∑j
p3j − 6
∑j
p4j − 1 +
∑j
p2j + 2
∑j
p2j − 2
∑j
p3j
+ 2∑
j
p2j − 2
∑j
p3j − 6
(∑j
p2j
)2
6∑
j
p4j
= −2∑
j
p2j + 8
∑j
p3j − 6
(∑j
p2j
)2
(B.20)∑i N
2i factor. We ignore the expression − 2
Nin front of all terms (only the third
term contains these factors):∑j
(4p3
j(1− pj)− 4p3j(1− pj)
)+∑
j
∑j′ 6=j
(−4p2
jp2j′ + 4p2
jp2j′
)= 0 (B.21)
N factor: ∑j
(4p3
j(1− pj) + 4p3j(1− pj)− 8p3
j(1− pj))
+∑
j
∑j′ 6=j
(−4p2
jp2j′ − 4p2
jp2j′ + 8p2
jp2j′
)= 0
(B.22)
172
1N
factor:
∑j
(pj(1− pj)(1− 6pj + 6p2
j)− 2npj(1− pj)(1− 6pj + 6p2j))
+∑
j
∑j′ 6=j
(−pjpj′(1− 2(pj + pj′) + 6pjpj′) + 2npjpj′(1− 2(pj + pj′) + 6pjpj′))
= (2n− 1)
[−∑
j
(pj − 7p2
j + 12p3j − 6p4
j
)+∑
j
∑j′ 6=j
(pjpj′ − 2p2
jpj − 2pjp2j′ + 6p2
jpj′)]
= −(2n− 1)
−2∑
j
p2j + 8
∑j
p3j − 6
(∑j
p2j
)2
(B.23)
since inside the brackets we have the negation of the formula for the∑
i1
Nifactor.
173
Free factors:∑j
(npj(1− pj)(6pj − 10p2
j) + npj(1− pj)(6pj − 10p2j)
−2pj(1− pj)(6pj − 10p2j)− 4(n− 1)pj(1− pj)(1− 2pj)pj
)∑
j
∑j′ 6=j
(−npjpj′(2pj + 2pj′ − 10pjpj′)− pjpj′(2pj + 2pj′ − 10pjpj′)
+2pjpj′(2pj + 2pj′ − 10pjpj′) + 4npjp2j′(1− 2pj)− 4pjp
2j′(1− 2pj)
)= (n− 1)
∑j
(6p2
j − 16p3j + 10p4
j − 4p2j + 12p3
j − 8p4j
)− (n− 1)
∑j
∑j′ 6=j
(2p2
jpj′ + 2pjp2j′ − 10p2
jp2j′
)+ 4(n− 1)
∑j
∑j′ 6=j
(pjp
2j′ − 2p2
jp2j′
)= (n− 1)
(2∑
j
p2j − 4
∑j
p3j + 2
∑j
p4j
)
− (n− 1)
2∑
j
p2j − 2
∑j
p3j + 2
∑j
p2j − 2
∑j
p3j − 10
(∑j
p2j
)2
+ 10∑
j
p4j
+ 4(n− 1)
∑j
p2j −
∑j
p3j − 2
(∑j
p2j
)2
+ 2∑
j
p4j
= (n− 1)
(2∑
j
pj + 2
(∑j
p2j
)− 4
∑j
p3j
)(B.24)
Now, putting everything together we get:
Var (∆g) =1
N2
[(n−1)
2k∑
j=1
p2j + 2
(k∑
j=1
p2j
)2
− 4k∑
j=1
p3j
+
(n∑
i=1
1
Ni
− 2n
N+
1
N
)×
−2k∑
j=1
p2j − 6
(k∑
j=1
p2j
)2
+ 8k∑
j=1
p3j
]
(B.25)
174
B.0.4 Mean and Variance of χ2-test for two class case
Let p1 be the probability for the first class label. Using the notation in Chapter 3
and starting with the definition of χ2 (equation A.33) we have:
χ2 =n∑
i=1
((Ai1 − p1Ni)
2
p1Ni
+(Ai2 − p2Ni)
2
p1Ni
)=
n∑i=1
((Ai1 − p1Ni)
2
p1Ni
+(Ni − Ai1 − (1− p1)Ni)
2
(1− p1)Ni
)=
n∑i=1
(Ai1 − p1Ni)2
(1
p1Ni
+1
(1− p1)Ni
)=
n∑i=1
(Ai1 − p1Ni)2
p1(1− p1)Ni
=1
p1(1− p1)
n∑i=1
(A2
i1
Ni
− 2p1Ai1 + p21Ni
)
=1
p1(1− p1)
(n∑
i=1
A2i1
Ni
− 2p1N1 + p21N
)
(B.26)
Using the linearity of expectation we have:
E(χ2) =1
p1(1− p1)
(n∑
i=1
E(A2i1)
Ni
− 2p1E(N1) + p21N
)
=1
p1(1− p1)
(n∑
i=1
p1(1− p1 + Nip1)− 2p1Np1 + p21N
)
=1
p1(1− p1)(np1(1− p1) + Np2
1 −Np21)
= n
(B.27)
Now we look at the variance:
Var(χ2) =1
p21(1− p1)2
Var
(n∑
i=1
A2i1
Ni
− 2p1N1 + p21N
)
=1
p21(1− p1)2
(Var
(n∑
i=1
A2i1
Ni
)+ 4p2
1Var(N1)− 4p1Cov
(n∑
i=1
A2i1
Ni
, N1
))(B.28)
175
The third term is the only one that needs to be analyzed separately:
Cov
(n∑
i=1
A2i1
Ni
, N1
)= E
(n∑
i=1
A2i1
Ni
N1
)− E
(n∑
i=1
A2i1
Ni
)E(N1)
=n∑
i=1
E(A2i1N1)− E(A2
i1)E(N1)
Ni
(B.29)
E(A2i1N1) = E
(A2
i1
n∑j=1
Aj1
)= E
A2i1
Ai1 +n∑
j=1j 6=i
Aj1
= E(A3i1) + E(A2
i1)n∑
j=1j 6=i
E(Aj1)
= E(A3i1)− E(A2
i1)E(Ai1) + E(A2i1)
n∑j=1
E(Aj1)
= E(A3i1)− E(A2
i1)E(Ai1) + E(A2i1)E(N1)
(B.30)
Substituting equation B.30 into equation B.29 we get:
Cov
(n∑
i=1
A2i1
Ni
, N1
)=
n∑i=1
E(A3i1)− E(A2
i1)E(Ai1)
Ni
,n∑
i=1
p1(1− p1)(1− 2p1 + 2Nip1)
= p1(1− p1)(n(1− 2p1) + 2Np1)
(B.31)
Substituting equation B.31 into B.28 we have:
176
Var(χ2) =1
p21(1− p1)2
p1(1− p1)
[(1− 6pY + 6pY )
n∑i=1
1
Ni
+ n(6p1 − 10p21) + 4Np2
1] + 4p21p1(1− p1)N
− 4p1p1(1− p1)[n(1− 2p1) + 2Np1]
=1
p1(1− p1)
[(1− 6p1 + 6p2
1)n∑
i=1
1
Ni
+ n(6p1 − 10p21) + 4Np2
1
+ 4Np21 − n(4p1 − 8p2
1)− 8Np21]
=1
p1(1− p1)
[(1− 6p1 + 6p2
1)n∑
i=1
1
Ni
+ 2np1(1− p1)
]
= 2n +1− 6p1 + 6p2
1
p1(1− p1)
n∑i=1
1
Ni
(B.32)
In this last formula, if we assume that all the values of the split variable are
equi-probable, we get:
Var(χ2) = 2n +1− 6p1 + 6p2
1
p1(1− p1)
n2
N(B.33)
177
-10 0 10
20
30
40
50
60
70
80
90
100
0 0
.2 0
.4 0
.6 0
.8 1
(1-6
*x+6
*x*x
)/(x
*(1-
x))
Fig
ure
B.1
:D
epen
den
cyof
the
funct
ion
1−
6p1+
6p2 1
p1(1−
p1)
onp 1
.
178
Since we have 2n variables (Ai1, Ai2, i∈1, .., n) and n constraints (Ai1+Ai2 =
Ni) the number of degrees of freedom is n, but as we have shown the variance is not
exactly 2n as predicted by the χ2 distribution but the expression in equation B.33.
Using the graphic representation of the part of the second term in this equation
that depends on p1 depicted in Figure B.1, we notice that when p1 is close to 0 or
1, the second term can become significant, especially if N is not much larger than
n. In these situations the approximation of the distribution can be quite poor.
Bibliography
Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules.
Proc. 20th Int. Conf. Very Large Data Bases, VLDB (pp. 487–499). Morgan
Kaufmann.
Agresti, A. (1990). Categorical data analysis. John Wiley and Sons.
Alexander, W. P., & Grimshaw, S. D. (1996). Treed regression. Journal of Com-
putational and Graphical Statistics, 156–175.
Berkhin, P. (2002). Survey of clustering data mining techniques (Technical Report).
Accrue Software, San Jose, CA.
Bilmes, J. (1997). A gentle tutorial of the EM algorithm and its application to
parameter estimation for gaussian mixture and hidden markov models (Technical
Report). University of California at Berkeley.
Bishop, C. (1995). Neural networks for pattern recognition. Oxford University
Press, New York, NY.
Brachman, R. J., Khabaza, T., Kloesgen, W., Piatetsky-Shapiro, G., & Simoudis,
E. (1996). Mining business databases. Communications of the ACM, 39.
179
180
Bradley, P. S., Fayyad, U. M., & Reina, C. (1998). Scaling clustering algorithms
to large databases. Knowledge Discovery and Data Mining (pp. 9–15).
Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification
and regression trees. Belmont: Wadsworth.
Buntine, W. (1993). Learning classification trees. Artificial Intelligence frontiers
in statistics (pp. 182–201). Chapman & Hall,London.
Chaudhuri, P., Huang, M.-C., Loh, W.-Y., & Yao, R. (1994). Piecewise-polynomial
regression trees. Statistica Sinica, 4, 143–167.
Cheeseman, P., Kelly, J., Self, M., Stutz, J., Taylor, W., & Freeman, D. (1988).
Autoclass: A bayesian classification system. Proceedings of the Fifth Interna-
tional Conference on Machine Learning. Morgan Kaufmann.
Cheeseman, P., & Stutz, J. (1996). Bayesian classification (autoclass): Theory and
results. In U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth and R. Uthurusamy
(Eds.), Advances in knowledge discovery and data mining, chapter 6, 153–180.
AAAI/MIT Press.
Chipman, H., George, E., & McCulloch, R. (1996). Bayesian cart.
Chipman, H. A., George, E. I., & McCulloch, R. E. (1998). Bayesian CART model
search. Journal of the American Statistical Association, 93, 935–947.
Chirstensen, R. (1997). Log-linear models and logistic regression. Springer. 2
edition.
Cox, L. A., Qiu, Y., & Kuehner, W. (1989). Heuristic least-cost computation
181
of discrete classification functions with uncertain argument values. Annals of
Operations Research, 21, 1–30.
Davison, A. C., & Hinkley, D. V. (1997). Bootstrap methods and their applica-
tion. No. 1 in Cambridge Series in Statistical and Probabilistic Mathematics.
Cambridge Univ Press.
Dempster,A.P. Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from
incomplete data via the EM algorithm. J. R. Statist. Soc. B, 39, 185–197.
Fayyad, U., Haussler, D., & Stolorz, P. (1996). Mining scientific data. Communi-
cations of the ACM, 39.
Frank, E. (2000). Pruning decision trees and lists. Doctoral dissertation, Depart-
ment of Computer Science, University of Waikato, Hamilton, New Zealand.
Frank, E., & Witten, I. H. (1998). Using a permutation test for attribute selection
in decision trees. International Conference on Machine Learning.
Friedman, J. H. (1991). Multivariate adaptive regression splines. The Annals of
Statistics, 19, 1–141 (with discussion).
Friedman, N., & Goldszmidt, M. (1996). Building classifiers using bayesian net-
works. AAAI/IAAI, Vol. 2 (pp. 1277–1284).
Fukanaga, K. (1990). Introduction to statistical pattern recognition, second edition.
Academic Press.
Gehrke, J., Ramakrishnan, R., & Ganti, V. (1998). Rainforest – a framework for
fast decision tree construction of large datasets. Proceedings of the 24th Interna-
tional Conference on Very Large Databases (pp. 416–427). Morgan Kaufmann.
182
Gillo, M. (1972). MAID: A honeywell 600 program for an automatised survey
analysis. Behavioral Science, 17, 251–252.
Goldberg, D. E. (1989). Genetic algorithms in search, optimization and machine
learning. Morgan Kaufmann.
Golub, G. H., & Loan, C. F. V. (1996). Matrix computations. Johns Hopkins.
Guetova, M., Holldobler, S., & Storr, H. (2002). Incremental fuzzy decision trees.
25th German Conference on Artificial Intelligence (KI2002).
Hand, D. (1997). Construction and assessment of classification rules. John Wiley
& Sons, Chichester, England.
Heckerman, D. (1995). A tutorial on learning with bayesian networks.
Hefferon, J. (2003). Linear algebra. http://joshua.smcvt.edu/linearalgebra/.
Hipp, J., Guntzer, U., & Nakhaeizadeh, G. (2000). Algorithms for association rule
mining – a general survey and comparison. SIGKDD Explorations, 2, 58–64.
Hyafil, L., & Rivest, R. L. (1976). Constructing optimal binary decision trees is
np-complete. Information Processing Letters, 5, 15–17.
Inman, W. (1996). The data warehouse and data mining. Communications of the
ACM, 39.
Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: a review. ACM
Computing Surveys, 31, 264–323.
James, M. (1985). Classification algorithms. Wiley.
183
Jensen, D. D., & Cohen, P. R. (2000). Multiple comparisons in induction algo-
rithms. Machine Learning, 38, 309–338.
Jordan, M. I., & Jacobs, R. A. (1993). Hierarchical mixtures of experts and the
EM algorithm (Technical Report AIM-1440).
Karalic, A. (1992). Linear regression in regression tree leaves. International School
for Synthesis of Expert Knowledge, Bled,Slovenia.
Kass, G. (1980). An exploratory technique for investigating large quantities of
categorical data. Applied Statistics, 119–127.
Kohavi, R. (1995). The power of decision tables. Proceedings of the 8th European
Conference on Machine Learning (ECML). Heraclion, Crete, Greece: Springer.
Kohonen, T. (1995). Self-organizing maps. Heidelberg: Springer-Verlag.
Kononenko, I. (1995). On biases in estimating multivalued attributes.
Li, K.-C., Lue, H.-H., & Chen, C.-H. (2000). Interactive tree-structured regression
via principal hessian directions. journal of the American Statistical Association,
547–560.
Lim, T.-S., Loh, W.-Y., & Shih, Y.-S. (1997). An empirical comparison of decision
trees and other classification methods (Technical Report 979). Department of
Statistics, University of Wisconsin, Madison.
Loh, W.-Y. (2002). Regression trees with unbiased variable selection and interac-
tion detection. Statistica Sinica. in press.
Loh, W.-Y., & Shih, Y.-S. (1997). Split selection methods for classification trees.
Statistica Sinica, 7.
184
Martin, J. K. (1997). An exact probability metric for decision tree splitting. Ma-
chine Learning, 28, 257–291.
Mehta, M., Agrawal, R., & Rissanen, J. (1996). SLIQ: A fast scalable classifier
for data mining. Proc. of the Fifth Int’l Conference on Extending Database
Technology (EDBT). Avignon, France.
Michie, D., Spiegelhalter, D. J., & Taylor, C. C. (1994). Machine learning, neural
and statistical classification. Ellis Horwood.
Mingers, J. (1987). Expert systems – rule induction with statistical data. J. Opl.
Res. Soc., 38, 39–47.
Morgan, J., & Messenger, R. (1973). Thaid: a sequantial search program for the
analysis of nominal scale dependent variables (Technical Report). Institute for
Social Research, University of Michigan, Ann Arbor, Michigan.
Murphy, O. J., & Mccraw, R. L. (1991). Designing storage efficient decision trees.
IEEE Transactions on Computers, 40, 315–319.
Murthy, S. K. (1995). On growing better decision trees from data. Doctoral disser-
tation, Department of Computer Science, Johns Hopkins University, Baltimore,
Maryland.
Murthy, S. K. (1997). Automatic construction of decision trees from data: A
multi-disciplinary survey. Data Mining and Knowledge Discovery.
Papoulis, A. (1991). Probability, random variables and stochastic processes.
McGraw-Hill Science/Engineering/Math.
185
Pratt, J. W., Raiffa, H., & Schlaifer, R. (1995). Statistical decision theory. The
MIT Press.
Quinlan, J. (1993a). A case study in machine learning. Proceedings ACSC-16
Sixteenth Australian Computer Science Conference (pp. 731–7).
Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1, 81–106.
Quinlan, J. R. (1992). Learning with Continuous Classes. 5th Australian Joint
Conference on Artificial Intelligence (pp. 343–348).
Quinlan, J. R. (1993b). C4.5: Programs for machine learning. Morgan Kaufman.
Resnick, S. I. (1999). A probability path. Birkhauser.
Ripley, B. (1996). Pattern recognition and neural networks. Cambridge University
Press, Cambridge.
Sarle, W. (1994). Neural networks and statistical models. Procedings of the Nine-
teenth Annual SAS Users Groups International Conference (pp. 1538–1550). SAS
Institute, Inc., Cary, NC.
Shafer, J., Agrawal, R., & Mehta, M. (1996). SPRINT: A scalable parallel classifier
for data mining. Proc. of the 22nd Int’l Conference on Very Large Databases.
Bombay, India.
Shao, J. (1999). Mathematical statistics. Springer-Verlag.
Shine, R. A., & Strous, L. (2001). Ana. http://ana.lmsal.com.
Sison, L. G., & Chong, E. K. (1994). Fuzzy modeling by induction and pruning of
decision trees. IEEE International Symposium on Intelligent Control. Columbus,
Ohio.
186
Sonquist, J., Baker, E., & Morgan, J. (1971). Searching for structure (Techni-
cal Report). Institute for Social Research, University of Michigan, Ann Arbor,
Michigan.
Swokowski, E. W. (1991). Calculus. PWS Publishing Co. 5th edition.
Torgo, L. (1997a). Functional models for regression tree leaves. Proc. 14th Inter-
national Conference on Machine Learning (pp. 385–393). Morgan Kaufmann.
Torgo, L. (1997b). Kernel regression trees. European Conference on Machine
Learning. Poster paper.
Torgo, L. (1998). A comparative study of reliable error estimators for pruning
regression trees. Iberoamerican Conference on Artificial Intelligence. Springer-
Verlag.
Vapnik, V. N. (1998). Statistical learning theory. Wiley-Interscience.
Waterhouse, S. R., & Robinson, A. J. (1994). Classification using hierarchical
mixtures of experts. Proceedings of the 1994 IEEE Workshop on Neural Networks
for Signal Processing IV (pp. 177–186). Long Beach, CA: IEEE Press.
Weiss, S. M., & Kulikowski, C. A. (1991). Computer systems that learn: Classi-
fication and prediction methods from statistics, neural nets, machine learning,
and expert systems. Morgan Kaufman.
White, A. P., & Liu, W. Z. (1994). Bias in information-based measures in decision
tree induction. Machine Learning, 15, 321–329.
Wilks, S. S. (1962). Mathematical statistics. John Wiley & Sons, Inc.