SCALABLE CLASSIFICATION AND REGRESSION TREE CONSTRUCTIONadobra/papers/phd-thesis.pdf · SCALABLE CLASSIFICATION AND REGRESSION TREE CONSTRUCTION Alin Viorel Dobra, Ph.D. Cornell University

SCALABLE CLASSIFICATION AND REGRESSION

TREE CONSTRUCTION

A Dissertation

Presented to the Faculty of the Graduate School

of Cornell University

in Partial Fulfillment of the Requirements for the Degree of

Doctor of Philosophy

by

Alin Viorel Dobra

August 2003

c© 2003 Alin Viorel Dobra

ALL RIGHTS RESERVED

SCALABLE CLASSIFICATION AND REGRESSION TREE CONSTRUCTION

Alin Viorel Dobra, Ph.D.

Cornell University 2003

Automating the learning process is one of the long standing goals of Artifi-

cial Intelligence and its more recent specialization, Machine Learning. Supervised

learning is a particular learning task in which the goal is to establish the con-

nection between some of the attributes of the data made available for learning,

called attribute variables, and the remaining attributes called predicted attributes.

This thesis is concerned exclusively with supervised learning using tree structured

models: classification trees for predicting discrete outputs and regression trees for

predicting continuous outputs.

In the case of classification and regression trees most methods for selecting

the split variable have a strong preference for variables with large domains. Our

first contribution is a theoretical characterization of this preference and a general

corrective method that can be applied to any split selection method. We further

show how the general corrective method can be applied to the Gini gain for discrete

variables when building k-ary splits.

In the presence of large amounts of data, efficiency of the learning algorithms

with respect to the computational effort and memory requirements becomes very

important. Our second contribution is a scalable construction algorithm for re-

gression trees with linear models in the leaves. The key to scalability is to use the

EM Algorithm for Gaussian Mixtures to locally reduce the regression problem to

a, much easier, classification problem.

The use of strict split predicates in classification and regression trees has unde-

sirable properties like data fragmentation and sharp decision boundaries, properties

that result in decreased accuracy. Our third contribution is the generalization of

the classic classification and regression trees by allowing probabilistic splits in a

manner that significantly improves the accuracy but, at the same time, does not

increase significantly the computational effort to build this types of models.

Biographical Sketch

Alin Dobra was born on September 20th, 1974 in Bistrita, Romania. He received

a B.S in Computer Science from Technical University of Cluj-Napoca, Romania

in June 1998. He expects to receive a Ph.D in Computer Science from Cornell

University in August 2003.

In the summers of 1991 and 1992, he interned at Bell-Laboratories in Murray

Hill, NJ and worked with Minos Garofalakis and Rajeev Rastogi.

He is joining, in the Fall 2003, the Department of Computer and Information

Science and Engineering Department at University of Florida, Gainesville as an

Assistant Professor.

iii

Parintilor mei

iv

Acknowledgements

First and foremost I would like to thank my thesis adviser, Professor Johannes

Gehrke. This thesis would have not be possible without his guidance and support

for the last three years.

Many thanks an my love go to my wife, Delia, that has been there for me all

these years. I do not even want to imagine how my life would have been without

her and her support.

My eternal gratitude goes to my parents, especially my father, that put my

education above their personal comfort for more than 20 years. They encouraged

and supported my scientific curiosity from an early age even though they never

got the chance to pursue their own scientific dreams. I hope this thesis will bring

them much personal satisfaction and pride.

I met many great people during my five year stay at Cornell University. I thank

them all.

v

Table of Contents

1 Introduction 11.1 Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.1.1 Bias and bias correction in classification tree construction . . 51.1.2 Scalable linear regression tree construction . . . . . . . . . . 61.1.3 Probabilistic classification and regression trees . . . . . . . . 6

1.2 Thesis Overview and Prerequisites . . . . . . . . . . . . . . . . . . . 71.2.1 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.2.2 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Classification and Regression Trees 82.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 Classification Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3 Building Classification Trees . . . . . . . . . . . . . . . . . . . . . . 11

2.3.1 Tree Growing Phase . . . . . . . . . . . . . . . . . . . . . . 122.3.2 Pruning Phase . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.4 Regression Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3 Bias Correction in Classification Tree Construction 283.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2.1 Split Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 313.3 Bias in Split Selection . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.3.1 A Definition of Bias . . . . . . . . . . . . . . . . . . . . . . . 343.3.2 Experimental Demonstration of the Bias . . . . . . . . . . . 36

3.4 Correction of the Bias . . . . . . . . . . . . . . . . . . . . . . . . . 433.5 A Tight Approximation of the distribution of Gini Gain . . . . . . 47

3.5.1 Computation of the Expected Value and Variance of Gini

Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.5.2 Approximating the Distribution of Gini Gain with Paramet-

ric Distributions . . . . . . . . . . . . . . . . . . . . . . . . . 503.6 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 643.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

vi

4 Scalable Linear Regression Trees 684.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 694.2 Preliminaries: EM Algorithm for Gaussian Mixtures . . . . . . . . . 724.3 Previous solutions to linear regression tree construction . . . . . . . 73

4.3.1 Quinlan’s construction algorithm . . . . . . . . . . . . . . . 734.3.2 Karalic’s construction algorithm . . . . . . . . . . . . . . . . 774.3.3 Chaudhuri’s et al. construction algorithm . . . . . . . . . . 77

4.4 Scalable Linear Regression Trees . . . . . . . . . . . . . . . . . . . . 784.4.1 Efficient Implementation of the EM Algorithm . . . . . . . . 824.4.2 Split Point and Attribute Selection . . . . . . . . . . . . . . 84

4.5 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 924.5.1 Experimental testbed and methodology . . . . . . . . . . . . 934.5.2 Experimental results: Accuracy . . . . . . . . . . . . . . . . 974.5.3 Experimental results: Scalability . . . . . . . . . . . . . . . 98

4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

5 Probabilistic Decision Trees 1095.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1105.2 Probabilistic Decision Trees (PDTs) . . . . . . . . . . . . . . . . . . 113

5.2.1 Generalized Decision Trees(GDTs) . . . . . . . . . . . . . . 1155.2.2 From Generalized Decision Trees to Probabilistic Decision

Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1185.2.3 Speeding up Inference with PDTs . . . . . . . . . . . . . . . 121

5.3 Learning PDTs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1225.3.1 Computing sufficient statistics for PDTs . . . . . . . . . . . 1235.3.2 Adapting DT algorithms to PTDs . . . . . . . . . . . . . . . 1265.3.3 Split Point Fluctuations . . . . . . . . . . . . . . . . . . . . 129

5.4 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 1365.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . 1375.4.2 Experimental Results: Accuracy . . . . . . . . . . . . . . . . 1375.4.3 Experimental Results: Running Time . . . . . . . . . . . . . 141

5.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1445.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

6 Conclusions 149

A Probability and Statistics Notions 152A.1 Basic Probability Notions . . . . . . . . . . . . . . . . . . . . . . . 152

A.1.1 Probabilities and Random Variables . . . . . . . . . . . . . . 153A.2 Basic Statistical Notions . . . . . . . . . . . . . . . . . . . . . . . . 160

A.2.1 Discrete Distributions . . . . . . . . . . . . . . . . . . . . . 161A.2.2 Continuous Distributions . . . . . . . . . . . . . . . . . . . . 162

vii

B Proofs for Chapter 3 165B.0.3 Variance of the Gini gain random variable . . . . . . . . . . 165B.0.4 Mean and Variance of χ2-test for two class case . . . . . . . 174

viii

List of Tables

1.1 Example Training Database . . . . . . . . . . . . . . . . . . . . . . 2

3.1 P-values at point x for parametric distributions as a function ofexpected value, µ, and variance, σ2. . . . . . . . . . . . . . . . . . 51

3.2 Experimental moments and predictions of moments for N = 100, n =2, p1 = .5 obtained by Monte Carlo simulation with 1000000 rep-etitions. -T are theoretical approximations, -E are experimentalapproximations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55



4.1 Accuracy on real (upper part) and synthetic (lower part) datasets ofGUIDE and SECRET. In parenthesis we indicate O for orthogonalsplits. The winner is in bold font if it is statistically significant andin italics otherwise. . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

5.1 Datasets used in experiments; top for classification and bottom forregression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

5.2 Classification tree experimental results. . . . . . . . . . . . . . . . 1395.3 Constant regression trees experimental results. . . . . . . . . . . . 1425.4 Linear regression trees experimental results. . . . . . . . . . . . . . 143

B.1 Formulae for expressions over random vector [X1 . . . Xk] distributedMultinomial(N, p1, . . . , pk) . . . . . . . . . . . . . . . . . . . . . . . 168

ix

List of Figures

1.1 Example of classification tree for training data in Table 1.1 . . . . 4

2.1 Classification Tree Induction Schema . . . . . . . . . . . . . . . . . 13

3.1 Summary of notation for Chapter 3. . . . . . . . . . . . . . . . . . 313.2 Contingency table for a generic dataset D and attribute variable X. 323.3 The bias of the Gini gain. . . . . . . . . . . . . . . . . . . . . . . . 383.4 The bias of the information gain. . . . . . . . . . . . . . . . . . . . 393.5 The bias of the gain ratio. . . . . . . . . . . . . . . . . . . . . . . . 403.6 The bias of the p-value of the χ2-test (using a χ2-distribution). . . 413.7 The bias of the p-value of the G2-test (using a χ2-distribution). . . 423.8 Experimental p-value of Gini gain with one standard deviation

error bars against p-value of theoretical gamma approximation forN = 100, n = 2 and p1 = .5. . . . . . . . . . . . . . . . . . . . . . . 59

3.9 Experimental p-value of Gini gain with one standard deviationerror bars against p-value of theoretical gamma approximation forN = 100, n = 10 and p1 = .5. . . . . . . . . . . . . . . . . . . . . . 60

3.10 Experimental p-value of Gini gain with one standard deviationerror bars against p-value of theoretical gamma approximation forN = 100, n = 2 and p1 = .01. . . . . . . . . . . . . . . . . . . . . . 61

3.11 Probability density function of Gini gain for attribute variables X1

and X2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 633.12 Bias of the p-value of the Gini gain using the gamma correction . . 66

4.1 Example of situation where average based decision is different fromlinear regression based decision . . . . . . . . . . . . . . . . . . . . 75

4.2 Example where classification on sign of residuals is unintuitive. . . 754.3 SECRET algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 804.4 Projection on Xr, Y space of training data. . . . . . . . . . . . . . 814.5 Projection on Xd, Xr, Y space of same training data as in Figure 4.4 814.6 Separator hyperplane for two Gaussian distributions in two dimen-

sional space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

x

4.7 Tabular and graphical representation of running time (in seconds)of GUIDE, GUIDE with 0.01 of point as split points, SECRETand SECRET with oblique splits for synthetic dataset 3DSin (3continuous attributes). . . . . . . . . . . . . . . . . . . . . . . . . . 99

4.8 Tabular and graphical representation of running time (in seconds)of GUIDE, GUIDE with 0.01 of point as split points, SECRETand SECRET with oblique splits for synthetic dataset Fried (11continuous attributes). . . . . . . . . . . . . . . . . . . . . . . . . . 100

4.9 Running time of SECRET with linear regressors as a function ofthe number of attributes for dataset 3Dsin. . . . . . . . . . . . . . 101

4.10 Accuracy of the best quadratic approximation of the running timefor dataset 3Dsin. . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

4.11 Running time of SECRET with linear regressors as a function ofthe size of the 3Dsin dataset. . . . . . . . . . . . . . . . . . . . . . 103

4.12 Accuracy as a function of learning time for SECRET and GUIDEwith four sampling proportions. . . . . . . . . . . . . . . . . . . . . 104

5.1 Tabular and graphical representation of running time (in seconds)of vanilla SECRET and probabilistic SECRET(P), both with con-stant regressors, for synthetic dataset Fried (11 continuous attributes).145

5.2 Tabular and graphical representation of running time (in seconds)of vanilla SECRET and probabilistic SECRET(P), both with linearregressors, for synthetic dataset Fried (11 continuous attributes). . 146

B.1 Dependency of the function1−6p1+6p2

1

p1(1−p1)on p1. . . . . . . . . . . . . . 177

xi

Chapter 1

Introduction

Automating the learning process is one of the long standing goals of Artificial In-

telligence – and its more recent specialization, Machine Learning – but also the

core goal of newer research areas like Data-mining. The ability to learn from ex-

amples has found numerous applications in the scientific and business communities

– the applications include scientific experiments, medical diagnosis, fraud detec-

tion, credit approval, and target marketing (Brachman et al., 1996; Inman, 1996;

Fayyad et al., 1996) – since it allows the identification of interesting patterns or

connections either in the examples provided or, more importantly, in the natural or

artificial process that generated the data. In this thesis we are only concerned with

data presented in tabular format – we call each column an attribute and we asso-

ciate a name with it. Attributes whose domain is numerical are called numerical

attributes, whereas attributes whose domain is not numerical are called categorical

attributes. An example of a dataset about people leaving in a metropolitan area

is depicted in Table 1.1. Attribute “Car type” of this dataset is categorical and

attribute “Age” is numerical.

Two types of learning tasks have been identified in the literature: unsupervised

1

2

Table 1.1: Example Training DatabaseCar Type Driver Age Children Lives in Suburb?

sedan 23 0 yessports 31 1 nosedan 36 1 notruck 25 2 nosports 30 0 nosedan 36 0 nosedan 25 0 yestruck 36 1 nosedan 30 2 yessedan 31 1 yessports 25 0 nosedan 45 1 yessports 23 2 notruck 45 0 yes

and supervised learning. They differ in the semantics associated with the attributes

of the learning examples and their goals.

The general goal of unsupervised learning is to find interesting patterns in the

data, patterns that are useful for a higher level understanding of the structure of

the data. Types of interesting patterns that are useful are: groupings or clusters in

the data as found by various clustering algorithms (see for example the excellent

surveys (Berkhin, 2002; Jain et al., 1999)), and frequent item-sets, (Agrawal &

Srikant, 1994; Hipp et al., 2000). Unsupervised learning techniques usually assign

the same role to all the attributes.

Supervised learning tries to determine a connection between a subset of the

attributes, called the inputs or attribute variables, and the dependent attribute or

outputs.1 Two of the central problems in supervised learning – the only ones we

are concerned with in this thesis – are classification and regression. Both problems

1It is possible to have more dependent attributes, but for the purpose of thisthesis we consider only one.

3

have as goal the construction of a succinct model that can predict the value of the

dependent attribute from the attribute variables. The difference between the two

tasks is the fact that the dependent attribute is categorical for classification and

numerical for regression.

Many classification and regression models have been proposed in the litera-

ture: Neural networks (Sarle, 1994; Kohonen, 1995; Bishop, 1995; Ripley, 1996),

genetic algorithms (Goldberg, 1989), Bayesian methods (Cheeseman et al., 1988;

Cheeseman & Stutz, 1996), log-linear models and other statistical methods (James,

1985; Agresti, 1990; Chirstensen, 1997), decision tables (Kohavi, 1995), and tree-

structured models, so-called classification and regression trees (Sonquist et al.,

1971; Gillo, 1972; Morgan & Messenger, 1973; Breiman et al., 1984). Excellent

overviews of classification and regression methods were given by Weiss and Ku-

likowski (1991), Michie et al. (1994) and Hand (1997).

Classification and regression trees – we call them collectively decision trees –

are especially attractive in a data mining environment for several reasons. First,

due to their intuitive representation, the resulting model is easy to assimilate by

humans (Breiman et al., 1984; Mehta et al., 1996). Second, decision trees are

non-parametric and thus especially suited for exploratory knowledge discovery.

Third, decision trees can be constructed relatively fast compared to other meth-

ods (Mehta et al., 1996; Shafer et al., 1996; Lim et al., 1997). And last, the

accuracy of decision trees is comparable or superior to other classification and re-

gression models (Murthy, 1995; Lim et al., 1997; Hand, 1997). In this thesis, we

restrict our attention exclusively to classification and regression trees. Figure 1.1

depicts a classification tree, built based on data in Table 1.1, that predicts if a

person lives in a suburb based on other information about the person. The pred-

4

icates, that label the edges (e.g. Age ≤ 30), are called split predicates and the

attributes involved in such predicates, split attributes. In traditional classification

and regression trees only deterministic split predicates are used (i.e. given the split

predicate and the value of the the attributes, we can determine if the attribute is

true or false). Prediction with classification trees is done by navigating the tree on

true predicates until a leaf is reached, when the prediction in the leaf (YES or NO

in our example) is returned. The regions of the attribute variable space where the

decision is given by the same leaf will be called, throughout the thesis, decision

regions and the boundaries between such regions decision boundaries.

Age

Car Type

Car Type

<= 30

NO

YES NO

YES

>30

0

sedan

# Childr.

NO

sedan sports, trucksports, truck

sports, trucksedan >0

YES

Car Type

Figure 1.1: Example of classification tree for training data in Table 1.1

As it can be observed from the figure, the classification trees are easy to under-

stand – we immediately observe, for example, that people younger than 30 which

drive sports cars tend not to live in suburbs – and have a very compact repre-

sentation. For these reasons and others, detailed in Chapter 2, classification and

5

regression trees have been the subject of much research for the last two decades.

Nevertheless, at least in our opinion, more research is still necessary to fully un-

derstand and develop these types of learning models, especially from a statistical

perspective. The synergy of Statistics, Machine Learning and Data-mining meth-

ods, when applied to classification and regression tree construction, is the main

theme in this thesis. The overall goal of our work was to designed learning algo-

rithms that have good statistical properties, good accuracy and require reasonable

computational effort, even for large data-sets.

1.1 Our Contributions

Three problems in classification and regression tree construction received our at-

tention:

1.1.1 Bias and bias correction in classification tree con-

struction

Often, learning algorithms have undesirable preferences, especially in the pres-

ence of large amounts of noise. In the case of classification and regression trees

most methods for selecting the split variable have a strong preference for variables

with large domains. In this thesis we provide a theoretical characterization of this

preference and a general corrective method that can be applied to any split selec-

tion criteria to remove this undesirable bias. We show how the general corrective

method can be applied to the Gini gain for discrete variables when building k-ary

splits.

6

1.1.2 Scalable linear regression tree construction

In the presence of large amounts of data, efficiency of the learning algorithms with

respect to the computational effort and memory requirements becomes very impor-

tant. Part of this thesis is concerned with the scalable construction of regression

trees with linear models in the leaves. The key to scalability is to use the EM

Algorithm for Gaussian Mixtures to locally – at the level of each node being built

– reduce the regression problem to a classification problem. As a side benefit,

regression trees with oblique splits (involving a linear combination of predictor

attributes instead of a single attribute) can be easily built.

1.1.3 Probabilistic classification and regression trees

The use of strict split predicates in classification and regression trees has two

undesirable consequences. First, data is fragmented at an exponential rate and

therefore decisions in leaves are based on small number of samples. Second, deci-

sion boundaries are sharp because a single leaf is responsible for prediction. One

principled way to address both these problems is to generalize classification and

regression trees to make probabilistic decisions. More specifically, a probabilistic

model is assigned to each branch and it is used to determine the probability to

follow the branch. Instead of using a single leaf to predict the output for a given

input, all leaves are used, but their contributions are weighted by the probability

to reach them when starting from the root. In this thesis we show how to find well

motivated probabilistic models and to design scalable algorithms for building such

probabilistic classification and regression trees.

7

1.2 Thesis Overview and Prerequisites

1.2.1 Prerequisites

This thesis requires relatively few prerequisites. We assume the reader is famil-

iar with basic linear algebra and calculus, in particular the notions of equations,

vectors, matrices and Riemann integrals. Standard textbooks on Linear Algebra

(our favorite reference is (Hefferon, 2003)) and Calculus (for example (Swokowski,

1991)) suffice. The thesis relies heavily on notions of Probability Theory and

Statistics. In Appendix A we provide an overview of the necessary notions and

results for reading this thesis. Certainly, readers familiar with these topics will

find it easier to follow the presentation – especially the proofs – but the exposition

in Appendix A should suffice.

1.2.2 Thesis Overview

Chapter 2 provides a broad introduction to classification and regression tree con-

struction. In the rest of the thesis we assume that the reader is familiar with

these notions. In Chapter 3 we address the bias and bias correction problem for

classification tree construction. We provide proofs of results in this chapter in

Appendix B. Chapter 4 is dedicated to the linear regression tree construction

problem, and Chapter 5 to probabilistic decision trees. Concluding remarks and

directions of future research are given in Chapter 6.

Chapter 2

Classification and Regression

Trees

In this chapter we give an introduction to classification and regression trees. We

first start by formally introducing the classification trees and present some con-

struction algorithms for building such classifiers. Then, we explain how regression

trees differ. As mentioned in the introduction, we collectively refer to these types

of models as decision trees.

2.1 Classification

Let X1, . . . , Xm, C be random variables where Xi has domain Dom(Xi). The random

variable C has domain Dom(C) = 1, . . . , k. We call X1 . . . Xm attribute variables

– m is the number of such attribute variables – and C the class label or predicted

attribute.

A classifier C is a function C : Dom(X1) × · · · × Dom(Xm) 7→ Dom(C). Let

Ω = Dom(X1) × · · · × Dom(Xm) × Dom(C) be the set of events. The underlying

8

9

assumption in classification is the fact that the generative process for the data

is probabilistic; it generates the datasets according to an unknown probability

distribution P over the set of events Ω.

For a given classifier C and a given probability distribution P over Ω we can

introduce a functional RP (C) = P [C(X1, . . . , Xn) 6= C] called the generalization

error of the classifier C. Given some information about P in the form of a set of

samples, we would like to build a classifier that best approximates P . This leads

us to the following:

Classifier Construction Problem: Given a training dataset D of N inde-

pendent identically distributed samples from Ω, sampled according to probability

distribution P , find a function C that minimizes the functional RP (C), where P is

the probability distribution used to generate D.

In general, the classifier construction problem is very hard to solve if we allow

the classifier to be an arbitrary function. Arguments rooted in statistical learning

theory (Vapnik, 1998) suggest that we have to restrict the class of classifiers that

we allow in order to hope to solve this problem. For this reason we restrict our

attention to a special type of classifier – classification trees.

2.2 Classification Trees

A classification tree is a directed, acyclic graph T with tree shape. The root of

the tree – denoted by Root(T ) – does not have any incoming edges. Every other

node has exactly one incoming edge and may have 0, 2 or more outgoing edges.

We call a node T without outgoing edges a leaf node, otherwise T is called an

internal node. Each leaf node is labeled with one class label; each internal node T

10

is labeled with one attribute variable XT , called the split attribute. We denote the

class label associated with a leaf node T by Label(T ).

Each edge (T, T ′) from an internal node T to one of its children T ′ has a

predicate q(T,T ′) associated with it where q(T,T ′) involves only the splitting attribute

XT of node T . The set of predicates QT on the outgoing edges of an internal node

T must contain disjoint predicates involving the split attribute whose conjunction

is true – for any value of the split attribute exactly one of the predicates in QT is

true. We will refer to the set of predicates in QT as splitting predicates of T

Given a classification tree T , we can define the associated classifier

CT (x1, . . . , xm) in the following recursive manner:

C(x1, . . . , xm, T ) =

Label(T ) if T is a leaf node

C(x1, . . . , xm, Tj) if T is an internal node, Xi is label of T ,

and q(T,Tj)(xi) = true

(2.1)

CT (x1, . . . , xm) = C(x1, . . . , xm, Root(T )) (2.2)

thus, to make a prediction, we start at the root node and navigate the tree on

true predicates until a leaf is reached, when the class label associated with it is

returned as the result of the prediction.

If the tree T is a well-formed classification tree (as defined above), then the

function CT () is also well defined and, by our definition, a classifier which we call

a classification tree classifier, or in short a classification tree.

Two main variations have been proposed for classification trees – both are

in extensive use. If we allow at most two branches for any of the intermediate

nodes we get a binary classification tree; otherwise we get a k-ary classification

11

tree. Binary classification trees were introduced by Breiman et al. (1984); k-

ary classification trees were introduced by Quinlan (1986). The main difference

between these types of trees is in what predicates are allowed for discrete attribute

variables (for continuous attribute variables both allow only predicates of the form

X > c where c is a constant). For binary classification trees, predicates of the

form X ∈ S, with S a subset of the possible values of the attribute, are allowed.

This means that for each node we have to determine both a split attribute and

a split set. For discrete attributes in k-ary classification trees, there are as many

split predicates as there are values for the attribute variable and all are of the form

X = xi, with xi one of the possible value of X. In this situation, no split set has

to be determined but the fanout of the tree can be very large.

For continuous attribute variables, both types of classification trees split a node

into two parts on predicates of the form X ≤ s and its complement X > s, where

the real number s is called the split point.

Figure 1.1 shows an example of a binary classification tree that is build to

predict the data in the dataset in Table 1.1

2.3 Building Classification Trees

Now that we introduced the classification trees, we can formally state the classifi-

cation tree construction problem by instantiating the general classifier construction

problem:

Classification Tree Construction Problem: Given a training dataset D of

N independent identically distributed samples from Ω, sampled according to prob-

ability distribution P , find a classification tree T such that the misclassification

12

rate functional RP (CT ) of the corresponding classifier CT is minimized.

The main issue with solving the classification tree problem in particular and the

classifier problem in general, is the fact that the classifier has to be a good predictor

for the distribution not for the sample made available from the distribution. This

means that we cannot just simply build a classifier that is as good as possible

with respect to the available sample – it is easy to see that we can achieve zero

error with arbitrary classification trees if we do not have contradicting examples –

since the noise in the data will be learned as well. This noise learning phenomena,

called overfitting, is one of the main problems in classification. For this reason,

classification trees are build in two phases. In the first phase a tree as large as

possible is constructed in a manner that minimizes the error with respect to some

subset of the available data – subset that we call training data. In the second

phase the remaining samples – we call them the pruning data – are used to prune

the large tree by removing subtrees in a manner that reduces the estimate of the

generalization error computed using the pruning data. We discuss each of these

two phases individually in what follows.

2.3.1 Tree Growing Phase

Several aspects of decision tree construction have been shown to be NP-hard. Some

of these are: building optimal trees from decision tables (Hyafil & Rivest, 1976),

constructing minimum cost classification tree to represent a simple function (Cox

et al., 1989), and building optimal classification trees in terms of size to store

information in a dataset (Murphy & Mccraw, 1991).

In order to deal with the complexity of choosing the split attributes and split

sets and points , most of the classification tree construction algorithms use the

13

Input: node T , data-partition D, split selection method V

Output: classification tree T for D rooted at T

Top-Down Classification Tree Induction Schema:

BuildTree(Node T , data-partition D, split attribute selection method V)

(1) Apply V to D to find the split attribute X for node T .

(2) Let n be the number of children of T .

(2) if (T splits)

(3) Partition D into D1, . . . , Dn and label note T with split attribute X

(4) Create children nodes T1, . . . , Tn of T and label the edge (T, Ti)

with predicate q(T,Ti)

(5) foreach i∈1, .., n

(6) BuildTree(Ti, Di, V)

(7) endforeach

(8) else

(9) Label T with the majority class label of D

(10) endif

Figure 2.1: Classification Tree Induction Schema

greedy induction schema in Figure 2.1. It consists in deciding, at each step, upon

a split attribute and split set or point, if necessary, partitioning the data according

with the newly determined split predicates and recursively repeating the process

on these partitions, one for each child. The construction process at a node is

terminated when a termination condition is satisfied. The only difference between

the two types of classification trees is the fact that for k-ary trees no split set needs

to be determined for discrete attributes.

14

We now discuss how the split attribute and split set or point are picked at each

step in the recursive construction process, then show some common termination

conditions.

Split Attribute Selection

At each step in the recursive construction algorithm, we have to decide on what

attribute variable to split. The purpose of the split is to separate, as much as

possible, the class labels from each others. To make this intuition useful, we need

a metric that estimates how much the separation of the classes is improved when

a particular split is performed. We call such a metric a split criteria or a split

selection method.

There is extensive research in the machine learning and statistics literature on

devising split selection criteria that produce classification trees with high predictive

accuracy (Murthy, 1997). We briefly discuss here only the ones relevant for our

work.

A very popular class of split selection methods are impurity-based (Breiman

et al., 1984; Quinlan, 1986). The popularity is well deserved since studies have

shown that this class of split selection methods have high predictive accuracy (Lim

et al., 1997), and at the same time they are simple and intuitive. Each impurity-

based split selection criteria is based on an impurity function Φ(p1, . . . , pk), with pj

interpreted as the probability of seeing the class label cj. Intuitively, the impurity

function measures how impure the data is. It is required to have the following

properties (Breiman et al., 1984):

1. to be concave:

∂2Φ(p1, . . . , pk)

∂p2i

> 0

15

2. to be symmetric in all its arguments, i.e. for π a permutation,

Φ(p1, . . . , pk) = Φ(pπ1 , . . . , pπk)

3. to have unique maximum at (1/k, . . . , 1/k) when the mix of class labels is

most impure

4. to achieve the minimum for (1, 0, . . . , 0), (0, 1, 0, . . . , 0), . . . , (0, . . . , 0, 1), when

the mix of class labels is the most pure

With this, for a node T of the classification tree being built, the impurity at

node T is:

i(T ) = Φ(P [C = c1|T ], . . . , P [C = ck|T ]

where P [C = cj|T ] is the probability that the class label is cj given that the data

reaches node T . We defer the discussion on how these statistics are computed for

the end of this section.

Given a set Q of split predicates on attribute variable X that split a node T

into nodes T1, . . . , Tn, we can define the reduction in impurity as:

∆i(T, X, Q) = i(T )−n∑

i=1

P [Ti|T ] · i(Ti)

= i(T )−n∑

i=1

P [q(T,Ti)(X)|T ] · i(Ti)

(2.3)

Intuitively, the reduction in impurity is the amount of purity gained by splitting,

where the impurity after split is the weighted sum of impurities of each child node.

By instantiating the impurity function we get the first two split selection cri-

teria:

16

Gini Gain. This split criterion was introduced by Breiman et al. (1984). By

setting the impurity function to be the Gini index:

gini(T ) = 1−k∑

j=1

P [C = cj|T ]

and plugging it into Equation 2.3 we get the Gini gain split criteria:

GG(T, X, Q) = gini(T )−n∑

i=1

P [q(T,Ti)(X)|T ] · gini(Ti) (2.4)

For two class labels, the Gini gain takes the more compact form:

GGb(T, X, Q) = P [C = c0|T ]2(P [C = C0|T1]− P [T1|T ])2

P [T1|T ](1− P [T1|T ])(2.5)

Information Gain. This split criterion was introduced by Quinlan (1986). By

setting the impurity function to be the entropy of the dataset

entropy(T ) = −k∑

j=1

P [C = cj|T ] log P [C = cj|T ]

and plugging it into Equation 2.3 we get the information gain split criteria:

IG(T, X, Q) = entropy(T )−n∑

j=1

P [qj(X)|T ] · entropy(Tj) (2.6)

Gain Ratio. Quinlan introduced this adjusted version of the information gain

to remove the preference of information gain for attribute variables with large

domains (Quinlan, 1986).

GR(T,X, Q) =IG(T,X, Q)

−∑|Dom(X)|

j=1 P [X = xj|T ] log P [X = xj|T ](2.7)

Two other popular split selection methods come from the statistics literature:

17

The χ2 Statistic (test).

χ2(T,X) =

|Dom(X)|∑i=1

k∑j=1

(P [X = xi|T ] · P [C = cj|T ]− P [X = xi, C = cj|T ])2

P [X = xi|T ] · P [C = cj|T ].

(2.8)

estimates how much the class labels depend on the value of the split attribute.

Notice that the χ2-test does not depend on the set Q of split predicates. A known

result in the statistics literature, see for example (Shao, 1999), is the fact that

the χ2-test has, asymptotically, a χ2 distribution with |Dom(X)|(k − 1) degrees of

freedom.

The G2-statistic.

G2(T, X, Q) = 2 ·NT · IG(T ) loge 2, (2.9)

where NT is the number of records at node T . Asymptotically, the G2-statistic has

also a χ2 distribution (Mingers, 1987). Interestingly, it is identical to the informa-

tion gain up to a multiplicative constant, which immediately gives an asymptotic

approximation for the distribution of information gain.

Note that all split criteria except χ2-test take the set of split predicates as

argument. For discrete attribute variables in k-ary classification trees, the set of

predicates is completely determined by specifying the attribute variable, but this

is not the case for discrete variables for binary trees or continuous variables. In

these last two situations we also have to determine the best split set or point in

order to evaluate how good a split on a particular attribute variable is.

18

Split Set Selection for Discrete Attributes

Most of the set selection methods proposed in the literature use the same split

criterion used for split attribute selection in order to evaluate all possible splits

and select as split set the best. This method is referred to as exhaustive search,

since all possible splits of the set of values of an attribute variables are evaluated, at

least in principle. In general, this process of finding the split set is computationally

intensive except when the domain of the split attribute and the number of class

labels is small. There is though a notable exception due to Breiman et al. (1984),

when there is an efficient algorithm to find the split set: the case when there are

only two class labels and an impurity based selection criterion is used. Since this

algorithm is relevant for some parts of our work, we describe it here.

Let us first start with the following:

Theorem 1 (Breiman et al. (1984)). Let I be a finite set, qi, ri, i ∈ I be

positive quantities and Φ(x) a concave function. For I1, I2 a partitioning of I, an

optimum of the problem

argminI1,I2

∑i∈I1

qi Φ

(∑i∈I1

qiri∑i∈I1

qi

)+∑i∈I2

qi Φ

(∑i∈I2

qiri∑i∈I2

qi

)has the property that:

∀i ∈ I1,∀j ∈ I2, ri < rj

A direct consequence of this theorem is an efficient algorithm to solve this type

of optimization problems, namely order the elements of I into increasing order

of ri and consider only the |I| number of ways to split set I in this order. The

correctness of the algorithm is guaranteed by the fact that, the optimum split will

be among the splits considered.

19

With this, setting I = Dom(X), qi = P [X = xi|T ], ri = P [C = c0|X = xi, T ]

and Φ(x) to be the Gini index or entropy for the two class labels case (both are

concave):

gini(T ) = 2P [C = c0|T ](1− P [C = c0|T ])

entropy(T ) = −P [C = c0|T ] ln(P [C = c0|T ])

− (1− P [C = c0|T ]) ln(1− P [C = c0|T ])

the optimization criterion, up to a constant factor, is exactly the Gini gain or

information gain. Thus, to efficiently find the best split set, we order elements of

DomX in the increasing order of ri = P [C = c0|X = xi, T ] and consider splits only

in this order.

Since all the split criteria we introduced, except the χ2-test, use either the Gini

gain or information gain multiplied with a factor that does not depend on the split

set, this fast split set selection method can be used for all of them.

It is worth mentioning that Loh and Shih (1997) proposed a different technique

that consists in transforming values of discrete attributes into continuous values

and using split point selection methods for continuous attributes to obtain the split

for discrete attributes.

Split Point Selection for Continuous Attributes

Two methods have been proposed in the literature to deal with the split point

selection problem for continuous attributes: exhaustive search and Quadratic Dis-

criminant Analysis.

Exhaustive search uses the same split selection criteria as does the split at-

tribute selection method and consists in evaluating all the possible ways to split

20

the domain of the continuous attribute in two parts. To make the process efficient,

data available is first sorted on the attribute variable that is being evaluated and

then traversed in order. At the same time, the sufficient statistics are incrementally

maintained and the value of the split criteria computed for each split point. This

means that the overall process requires a sort and a linear traversal with constant

processing time per value. Most of the classification tree construction algorithms

proposed in the literature use the exhaustive search.

Loh and Shih (1997) proposed using Quadratic Discriminant Analysis (QDA)

to find the split point for continuous attributes, and showed that, from the point

of view of accuracy of the produced trees, it is as good as exhaustive search. An

apparent problem with QDA is that it works only for two class label problems.

Loh and Shih (1997) suggested a solution to this problem: group the class labels

into two super-classes based on some class similarity and define QDA and the split

set problem in terms of this super-classes. This method can be used to deal with

the intractability of finding splits for categorical attributes when the number of

classes is larger than two.

We now briefly describe QDA. The idea is to approximate the distribution of

the data-points with the same class label with a normal distribution, and to take

as the split point the point between the centers of the two distributions with equi-

probability to belong to each of the distributions. More precisely, for a continuous

attribute X, the parameters of the two normal distributions – probability to belong

21

to the distribution αi, mean µi and variance σ2i – are determined with the formulae:

αi = P [C = ci|T ]

µi = E [X|C = ci, T ]

σ2i = E

[X2|C = ci, T

]− µ2

i

and the equation of the split point µ is:

α11

σ1

√2π

e− (µ−µ1)2

2σ21 = α2

1

σ2

√2π

e− (µ−µ2)2

2σ22

which reduces to the following quadratic equation for the split point:

µ2

(1

σ21

− 1

σ22

)− 2µ

(µ1

σ21

− µ2

σ22

)+

µ21

σ21

− µ22

σ22

= 2 lnα1

α2

− lnσ2

1

σ22

(2.10)

If σ21 is very close to σ2

2, solving the second order equation is not numerically

stable. In this case it is preferable to solve the linear equation:

2µ(µ1 − µ2) = µ21 − µ2

2 − 2σ21 ln

α1

α2

that is numerically solvable as long as µ1 6= µ2.

To compute the Gini gain of the variable X with split point µ we just need to

compute the sufficient statistics: P [C = ci|X ≤ µ, T ], P [C = ci|X ≤ µ, T ], and

P [X ≤ µ|T ] = P [C = c0|T ]P [C = c0|X ≤ µ, T ] + P [C = c1|T ]P [C = c1|X ≤ µ, T ]

and plug them into Equation 2.5. The probability P [x ∈ C1|x ≤ µ, T ] is nothing

that the cumulative distribution function (c.d.f) of the normal distribution with

mean µ1 and variance σ21 at point µ. That is:

P [C = c0|X ≤ µ, T ] =

∫x≤µ

1

σ1

√2π

e−(x−µ1)2/2σ21dx

=1

2

(1 + Erf

(µ1 − µ

σ1

√2

))

22

P [C = c1|X ≤ µ] is similarly obtained.

The advantage of QDA is the fact that no sorting of the data is necessary. The

sufficient statistics (see next section) can be easily computed in a single pass over

the data in any order and solving the quadratic equation gives the split point.

Stopping Criteria

The recursive process of constructing classification trees has to be eventually

stopped. The most popular stopping criteria – we use it throughout the thesis

– is to stop the growth of the tree when the number of data-points on which the

decision is based goes below a prescribed minimum. By stopping the growth when

small amount of data is available, we avoid taking statistical insignificant decisions

that are likely to be very noisy thus wrong.Other possibilities are to stop the tree

growth when no predictive attribute can be found – can be quite damaging to the

construction algorithm since no one variable might be predictive but a combination

of variables can be predictive – or when the tree reached a maximum height.

Computing the Sufficient Statistics

So far, we have seen how the classification tree construction process can be re-

duced to sufficient statistics computation for every node. Here we explain how the

sufficient statistics can be estimated using the training data. The idea is to use the

usual empirical estimates; throughout the thesis we use the symbole= to denote

the empirical estimate of a probability or expectation. This means that:

1. for probabilities of the form P [p(Xj)|T ] with p(Xj) some predicate on at-

tribute variable Xj, the estimate is simply the number of data-points in the

training dataset at node T , DT , for which the predicate p(Xj) holds over the

23

overall number of data-points in DT :

P [p(Xj)|T ]e=|(x, c) ∈ DT |Xj = xj|

|DT |

2. for conditional probabilities of the form P [p(Xj)|C = c0, T ], the estimate is:

P [p(Xj)|C = c0, T ]e=|(x, c0) ∈ DT |Xj = xj|

|(x, c0) ∈ DT|

3. for expectations of functions of attributes, like E [f(Xj)|T ], the estimate is

simply the average value of the function applied to the attribute for the

data-points in DT :

E [f(Xj)|T ]e=

∑(x,c)∈DT

f(xj)

|DT |

where f(x) is the function whose expectation is being estimated

4. for expectations of the form E [f(Xj)|C = c0, T ], the estimate is:

E [f(Xj)|C = c0, T ]e=

∑(x,c0)∈DT

f(xj)

|(x, c0) ∈ DT|

Note that the estimates for all these sufficient statistics can be computed in a

single pass over the data. Gehrke et al. (1998) explain how these sufficient statistics

can be efficiently computed using limited memory and secondary storage.

2.3.2 Pruning Phase

In this thesis we use exclusively Quinlan’s re-substitution error pruning (Quinlan,

1993a). A comprehensive overview of other pruning techniques can be found in

(Murthy, 1997).

Re-substitution error pruning consists in eliminating subtrees in order to obtain

a tree with the smallest error on the pruning set, a separate part of the data used

24

only for pruning. To achieve this, every node estimates its contribution to the error

on pruning data when the majority class is used as en estimate. Then, starting

from the leaves and going upward, every node compares the contribution to the

error by using the local prediction with the smallest possible contribution to the

error of its children (if a node is not a leaf in the final tree, it has no contribution to

the error, only leaves contribute), and prunes the tree if the local error contribution

is smaller – this results in the node becoming a leaf. Since, after visiting any of

the nodes the tree is optimally pruned – this is the invariant maintained – when

the overall process finishes the whole tree is optimally pruned.

2.4 Regression Trees

We start with the formal definition of the regression problem and we present re-

gression trees, a particular type of regressors.

We have the random variables X1, . . . , Xm as in the previous section to which

we add the random variable Y with real line as the domain that we call the predicted

attribute or output.

A regressor R is a function R : Dom(X1) × · · · × Dom(Xm) 7→ Dom(Y ). Now

if we let the set of events to be Ω = Dom(X1) × · · · × Dom(Xm) × Dom(Y ) we can

define probability measures P over Ω. Using such a probability measure and some

loss function L (i.e. square loss function L(a, x) = ‖a − x‖2) we can define the

regressor error as RP (R) = EP [L(Y,R(X1, . . . , Xm)] where EP is the expectation

with respect to probability measure P . In this thesis we use only the square loss

function. With this we have:

25

Regressor Construction Problem: Given a training dataset D of N inde-

pendent identically distributed samples from Ω, sampled according to probability

distribution P , find a function R that minimizes the functional RP (R).

Regression Trees, the particular type of regressors we are interested in, are the

natural generalization of classification trees for regression problems. Instead of

associating a class label to every node, a real value or a functional dependency of

some of the inputs is used.

Regression trees were introduced by Breiman et al. (1984) and implemented in

their CART system. Regression trees in CART are binary trees, have a constant

numerical value in the leaves and use the variance as a measure of impurity. Thus

the split selection measure is:

Err(T ) =

NT∑i=1

(yi − yi)2 (2.11)

∆Err(T ) = Err(T )− Err(T1)− Err(T2) (2.12)

The reason for using variance as the impurity measure is justified by the fact

that the best constant predictor in a node is the average of the value of the predicted

variable on the test examples that correspond to the node; the variance is thus the

mean square error of the average used as a predictor.

An alternative split criteria proposed by Breiman et al. (1984) and used also

in (Torgo, 1997a) is based on the sample variance as the impurity measure:

26

ErrS(T ) = Var (Y |T )

e=

1

NT

Err(T )

∆ErrS(T ) = ErrS(T )− P [T1|T ] · ErrS(T1)− P [T2|T ] · ErrS(T2)

Interestingly, if the maximum likelihood estimate is used for all the probabilities

and expectations, as it is usually done in practice, we have the following connection

between the variance and sample variance criteria:

∆ErrS(T )e=

Err(T )

NT

− NT1

NT

Err(T1)

NT1

− NT2

NT

Err(T2)

NT2

=∆Err(T )

NT

Due to this connection, if there are no missing values, minimizing one of the criteria

results also in minimizing the other.

For a categorical attribute variable X, minimizing ∆ErrS(T ) can be done very

efficiently since the objective function in Theorem 1 with:

Φ(x) = −x2

qi = P [X = xi|T ]

ri = P [Y |X = xi, T ]n

is exactly this criterion up to additive and multiplicative constants that do not

influence the solution (Breiman et al., 1984). This means that we can simply order

the elements in Dom(X) in increasing order of P [Y |X = xi, T ] and consider splits

only in this order. If the empirical estimates are used for qi = P [X = xi|T ] and

ri = P [Y |X = xi, T ], the criteria ∆Err(T ) is minimized.

As in the case of classification trees, prediction is made by navigating the tree

27

following branches with true predicates until a leaf is reached. The numerical value

associated with the leaf is the prediction of the model.

Usually the top-down induction schema algorithm like the one in Figure 2.1 is

used to build regression tress. Pruning is used to improve the accuracy on unseen

examples like in the classification tree case. Pruning methods for classification

trees can be straightforwardly adapted for regression trees (Torgo, 1998).

For the case of re-substitution error, we simply define the contribution to the

pruning error at a node to be Err(T ). Then the pruning mechanism designed for

classification trees can be also used for regression trees.

Chapter 3

Bias Correction in Classification

Tree Construction

In this chapter we address the problem of bias in split variable selection in clas-

sification tree construction. A split criterion is unbiased if the selection of a split

variable X is based only on the strength of the dependency between X and the

class label, regardless of other characteristics (such as the size of the domain of

X); otherwise the split criterion is biased. In this chapter we make the following

four contributions: (1) We give a definition that allows us to quantify the extent

of the bias of a split criterion, (2) we show that the p-value of any split criterion

is a nearly unbiased criterion, (3) we give theoretical and experimental evidence

that the correction is successful, and (4) we demonstrate the power of our method

by correcting the bias of the Gini gain.

28

29

3.1 Introduction

Split variable selection is one of the main components of classification tree con-

struction. The quality of the split selection criterion has a major impact on the

quality (generalization, interpretability and accuracy) of the resulting tree. Many

popular split criteria suffer from bias toward attribute variables with large domains

(White & Liu, 1994; Kononenko, 1995).

Consider two attribute variables X1 and X2 whose association with the class

label is equally strong (or weak). Intuitively, a split selection criterion is unbiased

if on a random instance the criterion chooses both X1 and X2 with probability 1/2

as split variables. Unfortunately, this is usually not the case.

There are two previous meanings associated with the notion of bias in decision

tree construction. First, Quinlan (1986) calls bias the preference toward attributes

with large domains, preference that is easily observed when the dataset contains

exactly one data-point for each possible value of the attribute variable with the

large domain. In this case the attribute with large domain has the best possible

value for the entropy gain irrespective of how predictive it actually is, thus is always

preferred to an attribute with smaller domain that might be more predictive (but

not perfect). Second, White and Liu (1994) call bias the difference in distribution

of the split criteria applied to different attribute variables. In this chapter, we start

in Section 3.3 by giving a precise, quantitative definition of bias in split variable

selection. By extending the studies by White and Liu (1994) and Kononenko

(1995), we quantify in an extensive experimental study the bias in split selection

for the case that none of the attribute variables is correlated with the class label.

Section 3.4 contains the heart of our contribution in this chapter. Assume that

we use split criterion s(D, X) to calculate the quality q of attribute variable X as

30

split variable for training dataset D. Consider the the p-value p of value q, which

is the probability to see a value as extreme as the observed value q in the case that

X is not correlated with the class label. In Section 3.4, we prove that choosing the

variable with the lowest p-value results in a split selection criterion that is nearly

unbiased — independent of the initial split criterion s. Since previous criteria such

as χ2 and G2 (Mingers, 1987) and the permutation test (Frank & Witten, 1998) are

p-values, our theorem explains why χ2, G2, and the permutation test are virtually

unbiased. We continue in Section 3.5 by computing a tight approximation of the

distribution of Breiman’s Gini index for k-ary splits which gives us a theoretical

approximation of the p-value of the index. We demonstrate in Section 3.6 that our

new criterion is nearly unbiased.

Note that the general method that we propose is similar in spirit but different

from the work of Jensen and Cohen (2000) on the problems with multiple compar-

isons in induction algorithms. The bias in split selection for discrete variables is

not due to multiple comparisons, but rather due to inherent statistical fluctuations

as we explain in Section 3.3.

3.2 Preliminaries

In this section we introduce some more notation, useful only within this chapter and

appendix B, that contains proofs of some results in this chapter. This notation

will allow us to keep the formulae concise and simplify the expressions for the

split selection criteria in Section 2.3.1, simplifications that facilitate theoretical

endeavors.

31

3.2.1 Split Selection

As in the rest of the thesis, we denote by D be the training dataset consisting of

N data-points. We consider, without loss of generality, the problem of selecting

the split attribute at the root node of the classification tree. For X an attribute

variable with domain x1, . . . , xn, let Ni be the number of data-points in the

dataset D for which X = xi for i∈1, .., n. As before, we denote by c1, . . . , ck

be the domain of the class label C. Let Sj be the number of training records in

D for which C = cj for j∈1, .., n. Denote by Aij the number of data-points for

which X = xi ∧ C = cj. Also let pj, j ∈ 1, .., k be the prior probability to see

class label cj in the dataset D. Obviously the following normalization constraint

holds:∑k

j=1 pj = 1. We summarized the notation in Figure 3.2.1.

Symbol MeaningD datasetN size of DX attribute variablexi the i-th value in Dom(X)Ni number of data-points in D for which X = xi

C class labelcj the j-th value of the class labelSj number of data-points in D for which C = cj

pj probability to observe class label cj

Aij number of data-points in D for which X = xi and C = cj

Figure 3.1: Summary of notation for Chapter 3.

Using the notation we just introduced, we can form a contingency table for

dataset D as shown in Figure 3.2. We call the numbers on the last column and

32

the last row marginals since they obey the following marginal constraints:

n∑i=1

Ni = N

n∑i=1

Aij = Sj

k∑j=1

Aij = Ni

Using the contingency table we have the following maximum likelihood estimates:

P [X =xi] = Ni/N

pj = P [C =cj] = Sj/N

P [C =cj ∧X =xi] = Aij/N

P [C =cj|X =xi] = Aij/Ni

Note that this contingency table contains the sufficient statistics for split selec-

tion criteria that make univariate splits (Gehrke et al., 1998); thus given the table,

any split selection criterion can compute the quality of X as split variable.

X C1 . . . cj . . . ck

x1 A11 . . . A1j . . . A1k N1

. . . . . . . . . . .

. . . . . . . . . . .xi Ai1 . . . Aij . . . Aik Ni

. . . . . . . . . . .

. . . . . . . . . . .xn An1 . . . Anj . . . Ank Nn

S1 . . . Sj . . . Sk N

Figure 3.2: Contingency table for a generic dataset D and attribute variable X.

We express now the split criteria introduced in Section 2.3.1 in terms of the

elements of the contingency table in Figure 3.2. We use the new formulae in the

theoretical developments in this chapter.

33

χ2 Statistic.

χ2 =n∑

i=1

k∑j=1

(Aij − E(Aij))2

E(Aij), E(Aij) =

NiSj

N(3.1)

Gini Gain.

∆g =n∑

i=1

P [X =xi]k∑

j=1

P [C =cj|X =xi]2

−k∑

j=1

P [C =cj]2

=1

N

k∑j=1

(n∑

i=1

A2ij

Ni

−S2

j

N

) (3.2)

Information Gain.

IG =k∑

j=1

Φ(P [C =cj]) +n∑

i=1

Φ(P [X =xi])

−k∑

j=1

n∑i=1

Φ(P [C =cj ∧X =xi])

=1

N

(k∑

j=1

n∑i=1

Aij log Aij −k∑

j=1

Sj log Sj

−n∑

i=1

Ni log Ni + N log N

),

(3.3)

where Φ(p) = −p log p

Gain Ratio.

GR =IG∑n

i=1 Φ(P [X =xi])

=IG

1N

(N log N −∑n

i=1 Ni log Ni)

(3.4)

G2 Statistic.

G2 = 2 ·N · IG loge 2 (3.5)

34

3.3 Bias in Split Selection

In this section we introduce formally the notion of bias in split variable selection

for the case when there is no correlation between attribute variables and the class

label (i.e., the predictor variables are not predictive of the class label). We then

show that three popular split selection criteria are biased toward attribute variables

with large domains.

3.3.1 A Definition of Bias

In order to study the behavior of the split criteria for the case where there is

no correlation between an attribute variable and the class label we formalize the

following setting:

Null Hypothesis: For every i∈1, .., n, the random vector (Ai1, . . . , Aik) has

the distribution Multinomial(Ni, p1, . . . , pk).

Intuitively, the Null Hypothesis assumes that for each value of the attribute

variable, the distribution of the class label results from pure multi-face coin tossing,

thus the distribution of the class label obeys a multinomial distribution. Since∑ni=1 Aij = Sj, the random vector (S1, . . . , Sk) has the distribution Multinomi-

al(N, p1, . . . , pk).

We now give a formal definition of the bias. Let s be a split criterion, and let

s(D, X) be the value of s when applied to dataset D. Usually the split variable

selection method compares the values of the split criteria for two variables and

picks the one with the biggest corresponding value for attribute variable X.1 Now

let D be a random dataset whose values are distributed according to the Null

1For the case when smaller values of the split criterion are preferable, we canuse −s as split criterion.

35

Hypothesis. Thus s(D, X) is now a random variable that has a given distribution

under the Null Hypothesis. Define the probability that split selection method s

chooses attribute variable X1 over X2 as follows:

Ps(X1, X2) = P [s(D, X1) > s(D, X2)] (3.6)

We can now define the bias of the split criterion between X1 and X2 as the loga-

rithmic odds of choosing X1 over X2 as a split variable when neither X1 nor X2 is

correlated with the class label, formally:

Bias(X1, X2) = log10

(Ps(X1, X2)

1− Ps(X1, X2)

)(3.7)

When the split criterion is unbiased, Bias(X1, X2)= log10(0.5/(1−0.5))=0. The

bias is positive if s prefers X1 over X2 and negative, otherwise. A larger value for

|Bias(X1, X2)| indicates stronger bias; we desire split criteria with values of the

bias as close to 0 as possible. Furthermore, 10|Bias(X1,X2)| is the odds of choosing

X1 over X2.

Our notion of bias is inherently statistical in nature. It reflects the intuition

that, under the Null Hypothesis, the split criterion should have no preference for

any attribute variable. There have been several attempts to define the bias in split

variable selection. Quinlan’s Gain Ratio (Quinlan, 1986) was designed to correct

for an anomaly in choosing the split variable that he observed, but as we will show

in Section 3.3.2, the Gain Ratio merely reduces the bias, but it does not remove

it. White and Liu (1994) point out that Quinlan’s definition of the bias is non-

statistical in nature. Their own definition of the bias is based on the equality of

the distributions of the split criterion for different attribute variables. It is harder

to use in practice since it implies a test of the equality of two distributions instead

of two numbers as in our case. Loh and Shih (1997) introduce a notion of bias

36

whose formalization coincides with our definition.

3.3.2 Experimental Demonstration of the Bias

We performed an extensive experimental study to demonstrate the bias according

to our definition in Section 3.3.1. We generated synthetic training datasets with

two attribute variables and two class labels. We chose n1 = 10 different values for

predictor variable X1 and n2 = 2 different variable values for attribute variable

X2.2 We varied N , the size of the training database from 10 and 1000 records in

steps of 40 records, and we varied the value of the prior probability p1 of the first

class label exponentially between 0 and 1/2. Since all split criteria are invariant

to class labels permutations, the graphs depicting the bias are symmetric with

respect to p1 = 1/2; we present here only the part of the graphs with p1 ≤ 1/2.

To estimate Ps(X1, X2), we performed 100000 Monte Carlo trials in which we

generated random training databases distributed according to the Null Hypothesis

(thus the standard error of all our measurements is smaller than 0.0016). Exactly

the same random instances were used for all split criteria considered.

The results of our experiments are shown in Figures 3.3 to 3.7. Figure 3.3

shows the bias of the Gini gain, Figure 3.4 shows the bias of the information gain,

Figure 3.5 shows the bias of Quinlan’s gain ratio, Figure 3.6 shows the bias of

the p-value of the χ2-test according to the χ2 distribution (with n − 1 degrees of

freedom), and Figure 3.7 shows the bias of the p-value of the G2-statistics according

to the χ2 distribution (with n − 1 degrees of freedom). The χ2-distribution with

n−1 degrees of freedom has to be used since there are 2n entries in the contingency

2Results from experiments with different values for n1 and n2 were qualitativelysimilar.

37

table with n marginal constraints (SjAij = Ni) and the additional constraint that

Sj/N is used as an estimate for pj.

For values of p1 between 10−2 and 1/2 both the Gini gain and the information

gain show a very strong bias – X1 is chosen 101.80 = 63 times more often than X2.

The gain ratio is less biased – X1 is chosen 100.8 = 6.3 times more often than X2

– but the bias is still significant. The χ2 test is basically unbiased in this region

except for really small values of N . The G2 test is unbiased for large values of N

and for p1 close to 1/2, but the bias is noticeable in important border cases that

are relevant in practice (for example for p1 = 10−2 and N = 1000, the bias has

value 0.20) – thus not always unbiased.

For values of p1 between 10−4 and 10−2, the Gini gain, the information gain,

and the gain ratio start having less and less bias. Both the χ2 and the G2 criterion

have a preference toward variable variables with few values, the bias gets as low

as -0.2 (corresponds to 1.58 odds) when p1N = 1. The maximum negative bias

corresponds to datasets that, on average, have a single data-point with class label

c1. We postpone the explanation of this phenomenon to Section 3.6. The region

where p1 < 10−4 corresponds to training datasets where no record has class label c1

(all records have the same class label). In this case the Gini gain, the information

gain and the gain ratio have value 0, whereas the χ2 and G2 criteria have value 1,

irrespective of the split variable. In our experiments, we tossed a fair coin in the

case that the split criterion returns the same value for variables X1 and X2, thus

the bias is 0.

38

Fig

ure

3.3:

The

bia

sof

theGini

gain

.

39

Fig

ure

3.4:

The

bia

sof

the

info

rmat

ion

gain

.

40

Fig

ure

3.5:

The

bia

sof

the

gain

ratio.

41

Fig

ure

3.6:

The

bia

sof

the

p-v

alue

ofth

eχ

2-t

est

(usi

ng

aχ

2-d

istr

ibuti

on).

42

Fig

ure

3.7:

The

bia

sof

the

p-v

alue

ofth

eG

2-t

est

(usi

ng

aχ

2-d

istr

ibuti

on).

43

One surprising insight from our experiments is that the bias for the Gini gain,

the information gain and the gain ratio do not vanish as N gets arbitrary large. In

addition, the bias does not seem to have a significant dependency on p1 as long as

all entries in the contingency table for variable X1 are moderately populated (i.e.

Aij > 5).

We obtained similar results for different variable domain sizes. The bias is more

pronounced for bigger differences in the domain sizes of X1 and X2. When the

domain sizes are identical (n1 = n2), the bias is almost nonexistent. These facts

suggest that the size of the domain is the most significant factor that influences

the behavior under the Null Hypothesis. This conclusion, for the Gini gain, is

supported by the theoretical developments in Section 3.5.

The bias for the Gini gain, the information gain and the gain ratio comes

from the fact that under the Null Hypothesis the value of the split criterion is

not exactly zero. The values of s(X,D) monotonically increase with n, the size

of the domain of X, and variables with more values tend to have larger values

of s(X,D) due to the fact that the counts in the contingency table have bigger

statistical fluctuations. The bias is thus due to the inability of traditional split

criteria to account for these normal statistical fluctuations. In the next section,

we will present a technique that allows us to remove the bias from existing split

criteria.

3.4 Correction of the Bias

In this section we present a general method for removing the bias of any arbitrary

split criterion. We will use this result in Section 3.5 to show how the bias of Gini

gain can be corrected.

44

Let us first give some intuition behind our method. We observed in Section

3.3.2 that the expected value of several split criteria under the Null Hypothesis

depends on the size of the domain of the attribute variables. Assume that the

value of the split criterion for variable X1 (X2) is v1 (v2). Instead of comparing v1

and v2 directly and incurring a biased variable selection, we compute the p-value

p1, the probability that the value of the split criterion is as extreme as v1 under

the Null Hypothesis and similarly value p2. We then choose the split attribute

variable with the lower p-value, since it is the least likely to be good by chance.

The remainder of this section is devoted to a formal proof that the p-value of any

split criterion is virtually unbiased under the assumption that the Null Hypothesis

holds.

Let X and XH be two identically distributed random variables (i.e., ∀x ∈

Dom(X) : pxdef= P [X =x] = P [XH =x]), and let Y and YH be two other identically

distributed random variables. Define CX(x)def= 1 − P [XH ≤ x] = 1 −

∑x′≤x px′ ,

and similarly define CY (y). Let ∆def= maxx P [X =x] + maxy P [Y =y].

Lemma 1. Let X and Y be two independent discrete random variables. Then

∀γ ∈ [0, 1]:P [CX(X) < CY (Y )] + γP [CX(X)=CY (Y )] ∈ (1/2−∆, 1/2 + ∆).

Proof:

P = P [CX(X) < CY (Y )]

=∑

x

∑y

I(CX(x) < CY (y))P [X =x ∧ Y =y]

=∑

x

∑y

I

(x∑x′

px′ >

y∑y′

py′

)pxpy

(3.8)

where I(·) is the indicator function.

For a fixed x, let yx be the biggest value of y ∈ Dom(Y ) such that∑

x′<x px′ >

45

∑y′<y py′ still holds. Equation 3.8 then can be rewritten as follows:

P =∑

x

px

yx∑y

py (3.9)

On the other hand using the definition of yx we have:

x∑x′

px′ −yx∑y

py > 0, and (3.10)

x∑x′

px′ −yx∑y

py ≤ py+x≤ max

ypy (3.11)

where py+x

is the smallest y ∈ Dom(Y ) such that y > yx. The previous two

inequalities imply:x∑x′

px′ −maxy

py ≤yx∑y

py <x∑x′

px′ (3.12)

Multiplying by px, summing up on x and using the result of Equation 3.9 we

obtain:

∑x

px

x∑x′

px′ −maxy

py ≤ P <∑

x

px

x∑x′

px′ (3.13)

To further simplify, let X ′ be a random variable with the same distribution as X.

We then obtain: ∑x

px

x∑x′

px′ =∑

x

I(x′ ≤ x)P [X ′=x′ ∧X =x]

= P [X ′ ≤ X] =1

2− 1

2P [X ′=X]

=1

2− 1

2

∑x

p2x

∈(

1

2− 1

2max

xpx,

1

2− 1

2min

xpx

)(3.14)

Using Equations 3.13 and 3.14 we get:

1

2−max

xpx −max

ypy < P <

1

2− 1

2min

xpx (3.15)

46

If the roles of x and y are switched we obtain:

1

2−∆ < P [CX(X) > CY (Y )] <

1

2− 1

2min

ypy, (3.16)

which implies:

1

2+

1

2min

ypy < P [CX(X) ≤ CY (Y )] <

1

2+ ∆, (3.17)

thus

1

2−∆ < P + γP [CX(X)=CY (Y )] <

1

2+ ∆ (3.18)

According to Lemma 1, if the p-value of a criterion is used to decide the split

variable, the probability of choosing one variable over another is not farther than

∆ from 12. In practice, even for small sizes of the dataset, any split criterion has

a huge number of possible values and the probability of the criterion to take on

any particular value is much smaller than 12, thus ∆ ≈ 0, and the p-value is a

virtually unbiased split criterion. Hence, we can guarantee that the p-value of any

split criterion s is unbiased under the Null Hypothesis, as long as s does not take

on any single value with a significant probability.

Using the above fact, a general method to remove the bias in split variable

selection consists of two steps. First, we compute the value v of the original split

criteria s on the given dataset. Second, we compute the p-value of v under the

Null Hypothesis and we select the variable with the smallest p-value as the split

variable.

The above method requires the computation of the p-value of a given criterion.

We can distinguish four ways in which this can be accomplished:

47

• Exact computation. Use the exact distribution of the split criterion. The

main drawback is that this is almost always very expensive; it is reasonably

efficient only for n = 2 and k = 2 (Martin, 1997).

• Bootstrapping. Use Monte Carlo simulations with random instances gen-

erated according to the Null Hypothesis. This method was used by Frank

and Witten (1998) to correct the bias; its main drawback is the high cost of

the the Monte Carlo simulations.

• Asymptotic approximation. Use an asymptotic approximation of the dis-

tribution of the split criterion (e.g., use the χ2-distribution to approximate

the χ2-test (Kass, 1980) and the G2-statistic (Mingers, 1987)). Approxi-

mations often work well in practice, but they can be inaccurate for border

conditions (e.g., small entries in the contingency table).

• Tight approximation. Use a tight approximation of the distribution of

the criterion with a nice distribution. While conceptually superior to the

previous three methods, such tight approximations might be hard to find.

3.5 A Tight Approximation of the distribution of Gini

Gain

In this section we give a tight approximation of the distribution of the Gini gain,

and we use our approximation in combination with Lemma 1 to compose a new

unbiased split criterion.

Note that the p-value of the Gini gain can be well approximated if the cumu-

lative distribution function (c.d.f) of the distribution of the Gini gain can be well

approximated (since the p-value=1-c.d.f.).

48

We experimentally observed by looking at the shape of the probability distri-

bution function of the Gini gain that it is very close to the shape of distributions

from the Gamma family. Our experiments, reported latter in this section, show

that the Gamma distribution – using the expected value and variance of the Gini

gain as distribution parameters (which completely specify a Gamma distribution)

– is a very good approximation of the distribution of the Gini gain. In the re-

mainder of this section, we will show how to compute exactly the expected value

and the variance of the Gini gain under the Null Hypothesis. We then use these

values for a tight approximation of the Gini gain with parametric distributions,

the best of which is the approximation with the Gamma distribution.

3.5.1 Computation of the Expected Value and Variance of

Gini Gain

As mentioned in Section 3.2, the contingency table described in Figure 3.2 contains

the sufficient statistics for the computation of the Gini gain. Thus in order to

analyze the distribution of the Gini gain, it is sufficient to look at the distribution

of the entries in the contingency table. Consider a given fixed set of parameters N ,

n, k, Ni, i∈1, .., n, and pj, j ∈1, .., k. If the Null Hypothesis holds, the Aij’s

and Sj’s are random variables with multinomial distributions (see Section 3.3).

Using the definition of the Gini gain (Equation 3.2), linearity of expectation, the

fact that the Aij’s and Sj’s have multinomial distributions, and the normalization

constraint on the pj’s, we get the following formula for the expectation E(∆g) of

49

the Gini gain under the Null Hypothesis:

E(∆g) =1

N

k∑j=1

(n∑

i=1

E(A2ij)

Ni

− E(Sj)

N

)

=1

N

k∑j=1

(n∑

i=1

Nipj(1− pj + Nipj)

Ni

− Npj(1− pj + Npj)

N

)=

1

N

k∑j=1

(npj(1−pj) + Np2j

− pj(1−pj)−Np2j)

=n− 1

N

(1−

k∑j=1

p2j

)

(3.19)

so the expected value of the Gini gain is indeed linear in n as observed by White

and Liu (1994).

Computation of the the variance Var (∆g) of the Gini gain results in the fol-

lowing formula (see Appendix B for the proof):

Var (∆g) =

1

N2

[(n−1)

2k∑

j=1

p2j + 2

(k∑

j=1

p2j

)2

− 4k∑

j=1

p3j

+

(n∑

i=1

1

Ni

− 2n

N+

1

N

)×

−2k∑

j=1

p2j − 6

(k∑

j=1

p2j

)2

+ 8k∑

j=1

p3j

](3.20)

For the two class label problem, the expected value and variance take the

simpler form:

E(∆g) =2(n− 1)p1(1− p1)

N(3.21)

Var(∆g(T )) =4p1(1− p1)

N2

[(1− 6p1 + 6p2

1)

(n∑

i=1

1

Ni

− 2n

N+

1

N

)

+ 2(n− 1)p1(1− p1)

] (3.22)

50

3.5.2 Approximating the Distribution of Gini Gain with

Parametric Distributions

Now that we have exact values of the expected value and variance of the Gini gain,

we can approximate its distribution with any two-parameter parametric distribu-

tion simply by requiring their expected values and variances to be the same. The

quality of the approximation will depend on how well the shape of the parametric

distribution follows the shape of the distribution of the Gini gain. To compute

an estimate of the p-value of the distribution of the Gini gain we simply use the

parametric approximation with appropriate values for the parameters. We consider

here three families of parametric distributions: normal, gamma and beta. For each

of them we listed, in Table 3.1, the formulae for their p-values as a function of the

expected value and variance. By replacing µ with the exact expected value of

Gini gain computed with formula 3.19, and σ2 with the exact variance of Gini

gain computed with formula 3.20, and substituting in the equations in Table 3.1

estimates of the the p-value of the distribution of Gini gain are obtained.

The p-value of the Gini gain, the corrected criterion, depends only on n, N ,

and the Ni’s and pj’s since µ and σ2 depend only on on these quantities. Because

the true value of the pj is unknown, we suggest using the usual maximum likeli-

hood estimate pj = Sj,e/N where Sj,e is the number of data-points in the training

database with class label cj.

51

Tab

le3.

1:P

-val

ues

atpoi

nt

xfo

rpar

amet

ric

dis

trib

uti

ons

asa

funct

ion

ofex

pec

ted

valu

e,µ,an

dva

rian

ce,σ

2.

Nam

ePar

ams

Mom

ent

gen

fct.

p-v

alue

Expla

nat

ion

Nor

mal

N(µ

,σ2)

µ,σ

eµt−

σ2t2

/2

1 2

( 1−

Φ( x

−µ

σ√

2

))Φ

(x)

iser

ror

funct

ion

Gam

ma

Γ(γ

,θ)

γ=

µ2

σ2,θ

=σ

2 µ(1−

γt)−

θ1−

Q( γ

,x θ

)Q

(x,y

)is

inco

mple

te

regu

lari

zed

gam

ma

fct

Bet

aB

(α,β

)α

=−

µ(−

mu+

µ2+

σ2)

σ2

No

exp.

form

1−

I(x

;α,β

)I(x

;a,b

)is

inco

mple

te

β=

µ−

2µ

2+

µ3−

σ2+

µ∗σ

2

σ2

regu

lari

zed

bet

afc

t

52

To determine the quality of the approximation of the distribution of the Gini

gain with normal and gamma distributions we performed two kinds of empirical

tests, each of them tests a different aspect of the approximation. The first test

compares the empirical moments of the Gini gain with the moments of the gamma,

beta and normal distribution with parameters determined as explained before. If

moments are close, the moment generating functions are close, which is proof that

the approximation is good both in the body and in the tail (Shao, 1999). The sec-

ond test compares empirical values of the p-value with values computed using the

three approximations. This test provides direct evidence that the approximation

of the p-value is good in the body and beginning of the tail but says nothing about

end of the tail, since large numbers of Monte Carlo trials are required to get data

in the tail. In what follows, we give more details and experimental results for these

two tests.

Quality of the Approximation of Moments

Indirect evidence that two distributions approximate well each-other – under mild

conditions, conditions that are satisfied by any distribution with finite moments

as is the case here since the values Gini gain can take are between 0 and 1 – is

provided by the fact that their moments are close to each-other. This is due to

the fact that the moment generating function Ψ(t) – under the mild conditions –

uniquely determines the distribution and the fact that the moments are derivatives

of first and higher order of Ψ(t) around 0:

E [Xn] =∂nΨ(t)

∂tn

Thus if the moments are close the moment generating functions are close, which

in turn implies that the distributions are close. For both the normal and gamma

53

distributions the moments can be computed analytically from their moment gener-

ating functions Ψ(t), depicted in Table 3.1. For the Beta distribution the moments

can be shown to be (Papoulis, 1991):

E [Xn] =Γ(α + β)Γ(α + n)

Γ(α + β + n)Γ(α)

To compare the moments of the true distribution of the Gini gain with the

moments of the approximations we ran experiments consisting in 1000000 Monte

Carlo trials for various values of the parameters N , n and p1. In all the experiments

we are reporting here there are only two class labels and the values of the attribute

variable are equi-probable.

The results for N = 100 and various values for n and p1 are reported in Ta-

bles 3.5.2, 3.5.2 and 3.5.2. The numbers on the columns “Gamma-T”, “Beta-

T”, and “Normal-T” correspond to the ratio error of the approximation with the

gamma, beta and normal approximations, respectively, and parameters determined

with the theoretical formulae for E [∆g()] and Var (∆g()). The numbers on the

“-E” columns are obtained using the approximations with parameters determined

from the experimental values of the expected value and variance of Gini gain. All

the numbers on columns other than “Moment” or “Exp. Value” are the ratio error

with respect to the experimental moment. The ration error is defined here to be

the ratio of the approximated value over the true value, if the true value is smaller,

and the negative of the ratio of the true value over the approximation, otherwise.

Thus, the sign indicates which value is larger and the magnitude the error of the

approximation. Numbers very close to 1 or −1 indicate good approximations.

There are a number of things to notice from this experimental results:

1. The approximations using the theoretical values of the expected value and

54

variance of Gini gain are as good as the approximations using the experi-

mental values. This is to be expected since the formulae for expected value

and variance are exact and a large number of experiments (1000000) are

performed so the experimental error is very small.

2. The gamma approximation is slightly better than the beta approximation

and much better than the normal approximation.

3. When Np1 is large, the gamma and beta approximations are reasonably

good, even for higher moments but the approximation completely breaks

down when Np1 ≈ 1, since the distribution is very discrete (see Section 3.6).

We essentially observed the same behavior when the values of the attribute variable

are not equi-probable.

The approximation with the beta distribution is not possible for all values of

the quantities n, N , Ni and pj since the corresponding parameters are not positive,

as required by the definition of the distribution. As we have seen, for the case when

the parameters are positive, the quality of the approximation is slightly worse or

comparable to the quality of the approximation with the gamma distribution.

As we have seen from experimental data on the approximation of the moments,

the normal approximation is significantly worse than the gamma approximation.

For this reason and because the approximation with beta distribution is not always

possible, for the rest of the chapter we focus our attention exclusively on the gamma

approximation.

55

Tab

le3.

2:E

xper

imen

tal

mom

ents

and

pre

dic

tion

sof

mom

ents

for

N=

100,

n=

2,p 1

=.5

obta

ined

by

Mon

teC

arlo

sim

ula

tion

wit

h10

0000

0re

pet

itio

ns.

-Tar

eth

eore

tica

lap

pro

xim

atio

ns,

-Ear

eex

per

imen

talap

pro

xim

atio

ns.

Mom

ent

Exp.

Val

ue

Gam

ma-

TG

amm

a-E

Bet

a-T

Bet

a-E

Nor

mal

-TN

orm

al-E

E[X

1]

0.00

4997

561.

0004

871.

0000

001.

0004

871.

0000

001.

0004

871.

0000

00E

[X2]

7.44

994·1

0−05

1.00

0008

1.00

0000

1.00

0008

1.00

0000

1.00

0008

1.00

0000

E[X

3]

1.83

948·1

0−06

1.00

4412

1.00

5082

−1.

0274

67−

1.02

6823

−2.

1204

43−

2.12

0902

E[X

4]

6.30

756·1

0−08

1.01

6427

1.01

7876

−1.

0847

44−

1.08

3323

−4.

0956

16−

4.09

5038

E[X

5]

2.75

695·1

0−09

1.03

7153

1.03

9468

−1.

1759

56−

1.17

3598

−11

.082

39−

11.0

8156

E[X

6]

1.46

249·1

0−10

1.06

5559

1.06

8828

−1.

3101

85−

1.30

6649

−28

.928

42−

28.9

1781

E[X

7]

9.12

355·1

0−12

1.09

9995

1.10

4313

−1.

5020

30−

1.49

6946

−92

.006

53−

91.9

7131

E[X

8]

6.53

591·1

0−13

1.14

0874

1.14

6349

−1.

7704

43−

1.76

3258

−29

0.80

13−

290.

6160

E[X

9]

5.26

295·1

0−14

1.19

2961

1.19

9743

−2.

1361

78−

2.12

6098

−10

58.9

1−

1058

.128

E[X

10]

4.66

568·1

0−15

1.26

6282

1.27

4615

−2.

6180

85−

2.60

4040

−37

21.1

9−

3717

.70

E[(X−

E[X

])1]

0.00

4997

561.

0004

871.

0000

001.

0004

871.

0000

001.

0004

871.

0000

00E

[(X−

E[X

])2]

4.95

238·1

0−05

−1.

0004

801.

0000

00−

1.00

0480

1.00

0000

−1.

0004

801.

0000

00E

[(X−

E[X

])3]

9.72

172·1

0−07

1.00

8155

1.00

9616

−1.

0534

92−

1.05

1996

∞∞

E[(X−

E[X

])4]

3.55

965·1

0−08

1.02

4251

1.02

6427

−1.

1249

87−

1.12

2719

−4.

8425

64−

4.83

7916

E[(X−

E[X

])5]

1.55

974·1

0−09

1.05

0090

1.05

3195

−1.

2401

58−

1.23

6752

∞∞

E[(X−

E[X

])6]

8.32

377·1

0−11

1.08

2419

1.08

6527

−1.

4050

64−

1.40

0217

−45

.752

32−

45.6

8646

E[(X−

E[X

])7]

5.21

353·1

0−12

1.11

9830

1.12

5044

−1.

6402

66−

1.63

3466

∞∞

E[(X−

E[X

])8]

3.74

293·1

0−13

1.16

4358

1.17

0799

−1.

9680

10−

1.95

8505

−59

3.74

75−

592.

6082

E[(X−

E[X

])9]

3.00

962·1

0−14

1.22

3682

1.23

1537

−2.

4109

31−

2.39

7671

3741

5.64

3747

1.30

E[(X−

E[X

])10]

2.65

059·1

0−15

1.31

1237

1.32

0830

−2.

9868

61−

2.96

8475

−82

55.7

87−

8240

.496

56

Tab

le3.

3:E

xper

imen

tal

mom

ents

and

pre

dic

tion

sof

mom

ents

for

N=

100,

n=

10,p

1=

.5ob

tain

edby

Mon

teC

arlo

sim

ula

tion

wit

h10

0000

0re

pet

itio

ns.

-Tar

eth

eore

tica

lap

pro

xim

atio

ns,

-Ear

eex

per

imen

talap

pro

xim

atio

ns.

Mom

ent

Exp.

Val

ue

Gam

ma-

TG

amm

a-E

Bet

a-T

Bet

a-E

Nor

mal

-TN

orm

al-E

E[X

1]

0.04

5005

4−

1.00

0120

−1.

0000

00−

1.00

0120

1.00

0000

−1.

0001

201.

0000

00E

[X2]

0.00

2434

41.

0000

42−

1.00

0000

1.00

0042

1.00

0000

1.00

0042

1.00

0000

E[X

3]

0.00

0153

048

1.00

5306

1.00

4899

−1.

0070

48−

1.00

7423

−1.

0453

58−

1.04

5644

E[X

4]

1.09

208·1

0−05

1.01

8613

1.01

7677

−1.

0254

66−

1.02

6295

−1.

1400

67−

1.14

0714

E[X

5]

8.68

977·1

0−07

1.04

2033

1.04

0418

−1.

0580

19−

1.05

9410

−1.

2952

87−

1.29

6453

E[X

6]

7.60

679·1

0−08

1.07

7301

1.07

4853

−1.

1073

30−

1.10

9394

−1.

5273

85−

1.52

9316

E[X

7]

7.24

799·1

0−09

1.12

6108

1.12

2650

−1.

1764

80−

1.17

9355

−1.

8634

93−

1.86

6571

E[X

8]

7.45

378·1

0−10

1.19

0283

1.18

5602

−1.

2695

22−

1.27

3384

−2.

3455

42−

2.35

0372

E[X

9]

8.21

696·1

0−11

1.27

1924

1.26

5759

−1.

3919

68−

1.39

7051

−3.

6653

93−

3.67

4128

E[X

10]

9.65

576·1

0−12

1.37

3560

1.36

5585

−1.

5513

51−

1.55

7969

−4.

0432

85−

4.05

5100

E[(X−

E[X

])1]

0.04

5005

4−

1.00

0120

−1.

0000

00−

1.00

0120

1.00

0000

−1.

0001

201.

0000

00E

[(X−

E[X

])2]

0.00

0408

911

1.00

1442

1.00

0000

1.00

1442

1.00

0000

1.00

1442

1.00

0000

E[(X−

E[X

])3]

6.68

078·1

0−06

1.11

5573

1.11

2230

−1.

1994

96−

1.20

3073

∞∞

E[(X−

E[X

])4]

6.46

091·1

0−07

1.09

3553

1.08

9879

−1.

1314

14−

1.13

4904

−1.

2842

95−

1.28

8000

E[(X−

E[X

])5]

3.08

766·1

0−08

1.22

8298

1.22

2453

−1.

3804

12−

1.38

6509

∞∞

E[(X−

E[X

])6]

2.44

658·1

0−09

1.29

6608

1.28

9322

−1.

4672

40−

1.47

4434

−2.

3752

30−

2.38

5517

E[(X−

E[X

])7]

1.83

7·1

0−10

1.45

0129

1.44

0228

−1.

7812

99−

1.79

1900

∞∞

E[(X−

E[X

])8]

1.62

662·1

0−11

1.60

2230

1.58

9550

−2.

0792

00−

2.09

3036

−5.

5091

06−

5.54

0942

E[(X−

E[X

])9]

1.53

379·1

0−12

1.80

5990

1.78

9588

−2.

5696

52−

2.58

8995

0.33

1664

0.33

2649

E[(X−

E[X

])10]

1.57

781·1

0−13

2.04

6614

2.02

5645

−3.

1918

50−

3.21

8313

13.2

5839

913

.220

442

57

Tab

le3.

4:E

xper

imen

tal

mom

ents

and

pre

dic

tion

sof

mom

ents

for

N=

100,

n=

2,p 1

=.0

1ob

tain

edby

Mon

teC

arlo

sim

ula

tion

wit

h10

0000

0re

pet

itio

ns.

-Tar

eth

eore

tica

lap

pro

xim

atio

ns,

-Ear

eex

per

imen

talap

pro

xim

atio

ns.

Mom

ent

Exp.

Val

ue

Gam

ma-

TG

amm

a-E

Bet

a-T

Bet

a-E

Nor

mal

-TN

orm

al-E

E[X

1]

0.00

0197

806

1.00

0979

−1.

0000

001.

0009

791.

0000

001.

0009

791.

0000

00E

[X2]

1.54

833·1

0−07

1.00

0171

−1.

0000

001.

0001

711.

0000

001.

0001

711.

0000

00E

[X3]

2.34

056·1

0−10

−1.

1062

51−

1.10

5261

−1.

1627

09−

1.16

1767

−3.

0610

75−

3.06

3485

E[X

4]

5.57

744·1

0−13

−1.

3516

21−

1.34

8834

−1.

5724

38−

1.56

9577

−8.

0984

25−

8.09

9907

E[X

5]

1.88

178·1

0−15

−1.

7992

81−

1.79

3372

−2.

4369

23−

2.43

0057

−38

.396

21−

38.4

184

E[X

6]

8.43

356·1

0−18

−2.

5857

24−

2.57

3987

−4.

2800

16−

4.26

3789

−17

0.27

06−

170.

2881

E[X

7]

4.77

882·1

0−20

−3.

9570

43−

3.93

4026

−8.

3866

05−

8.34

6890

−10

90.6

47−

1091

.033

E[X

8]

3.25

504·1

0−22

−6.

2873

82−

6.24

2680

−17

.841

551

−17

.740

779

−66

73.6

00−

6673

.368

E[X

9]

2.52

968·1

0−24

−10

.031

462

−9.

9470

54−

39.7

8134

5−

39.5

2159

5−

5081

9.78

−50

824.

06E

[X10]

2.14

421·1

0−26

−15

.587

050

−15

.435

429

−90

.011

447

−89

.347

136

−35

3193

.1−

3531

18.1

E[(X−

E[X

])1]

0.00

0197

806

1.00

0979

−1.

0000

001.

0009

791.

0000

001.

0009

791.

0000

00E

[(X−

E[X

])2]

1.15

706·1

0−07

−1.

0004

34−

1.00

0000

−1.

0004

341.

0000

00−

1.00

0434

1.00

0000

E[(X−

E[X

])3]

1.57

654·1

0−10

−1.

1668

24−

1.16

4673

−1.

2628

46−

1.26

0591

∞∞

E[(X−

E[X

])4]

4.04

31·1

0−13

−1.

4601

48−

1.45

5904

−1.

7804

34−

1.77

5635

−10

.075

31−

10.0

6658

E[(X−

E[X

])5]

1.41

096·1

0−15

−1.

9887

49−

1.98

0527

−2.

8844

78−

2.87

3746

∞∞

E[(X−

E[X

])6]

6.49

457·1

0−18

−2.

9094

91−

2.89

3833

−5.

2643

57−

5.23

9587

−27

9.87

06−

279.

5069

E[(X−

E[X

])7]

3.75

174·1

0−20

−4.

5117

12−

4.48

1692

−10

.669

307

−10

.608

88∞

∞E

[(X−

E[X

])8]

2.58

362·1

0−22

−7.

2151

62−

7.15

7858

−23

.316

617

−23

.162

94−

1375

2.13

−13

728.

30E

[(X−

E[X

])9]

2.01

237·1

0−24

−11

.497

68−

11.3

9141

−52

.995

32−

52.5

9852

4781

572

4801

052

E[(X−

E[X

])10]

1.69

695·1

0−26

−17

.724

78−

17.5

3774

−12

1.41

31−

120.

3988

−83

2297

−83

0739

58

Quality of the Approximation of P-value

To asses the quality of the approximation of the p-value of the true distribution of

Gini gain we used the same samples (ran the same experiments) that were used

to compute the information in Tables 3.5.2, 3.5.2 and 3.5.2 and we obtained the

results in Figures 3.5.2, 3.5.2 and 3.5.2, respectively.

As it can be seen, for large Np1 the approximation is reasonably good but when

Np1 ≈ 1 the approximation breaks down completely since the distribution is very

discrete (see Section 3.6). We observed essentially the same trends for different

values of the parameters.

All these experiments suggest that the Gamma approximation behaves well in

practice. Thus, the formula to compute the p-value of the Gini gain we propose,

based on gamma approximation, is:

p-value(∆ge) = 1−Q

(E(∆g)2

Var (∆g),∆geVar (∆g)

E(∆g)

), (3.23)

where ∆ge is the actual value for Gini computed on the given dataset and Q(x, y) =

Γ(x, y)/Γ(x) is the regularized incomplete gamma function.

59

1e-

06

1e-

05

0.0

001

0.0

01

0.0

1

0.1 1

0 0

.01

0.0

2 0

.03

0.0

4 0

.05

0.0

6 0

.07

0.0

8 0

.09

0.1

expe

rim

enta

l p-v

alue

gam

ma

p-va

lue

Fig

ure

3.8:

Exper

imen

talp-v

alue

ofGini

gain

wit

hon

est

andar

ddev

iati

oner

ror

bar

sag

ainst

p-v

alue

ofth

eore

tica

lga

mm

a

appro

xim

atio

nfo

rN

=10

0,n

=2

and

p 1=

.5.

60

1e-

12

1e-

10

1e-

08

1e-

06

0.0

001

0.0

1 1

0 0

.02

0.0

4 0

.06

0.0

8 0

.1 0

.12

0.1

4 0

.16

0.1

8 0

.2

expe

rim

enta

l p-v

alue

gam

ma

p-va

lue

Fig

ure

3.9:

Exper

imen

talp-v

alue

ofGini

gain

wit

hon

est

andar

ddev

iati

oner

ror

bar

sag

ainst

p-v

alue

ofth

eore

tica

lga

mm

a

appro

xim

atio

nfo

rN

=10

0,n

=10

and

p 1=

.5.

61

1e-

07

1e-

06

1e-

05

0.0

001

0.0

01

0.0

1

0.1 1

0 0

.001

0.0

02 0

.003

0.0

04 0

.005

0.0

06 0

.007

0.0

08

expe

rim

enta

l p-v

alue

gam

ma

p-va

lue

Fig

ure

3.10

:E

xper

imen

talp-v

alue

ofGini

gain

wit

hon

est

andar

ddev

iati

oner

ror

bar

sag

ainst

p-v

alue

ofth

eore

tica

lga

mm

a

appro

xim

atio

nfo

rN

=10

0,n

=2

and

p 1=

.01.

62

Explanation of the Bias of the Gini Gain

Now that we have a tight theoretical approximation of the distribution of the Gini

gain at the Null Hypothesis, we can provide an explanation for the existence of the

bias. In Figure 3.11 we depicted the probability density function of the Gini gain,

as approximated by the gamma distribution, for two attribute variables, X1 with

size of its domain 2, and X2 with size of its domain 10. The number of data-points

is 100 and each of the two classes and the values of the attribute variables are

equi-probable. Notice that, the distribution of the Gini gain for X1 is much closer

to 0 than the distribution for X2, which explains why it is far more probable for

X2 to be chosen as the split attribute as we previously observed. Using the p-value

of the Gini gain will result in both these distributions being approximately the

uniform distribution on interval [0, 1], thus the probability to choose any of the

two attribute variables as the split attribute is the same.

Practical Considerations

Note that there is a very important numerical precision problem associated with

the formula for p-value, Equation 3.23. Even for moderate correlation between a

attribute variable and the class label, the value of the second term in Equation 3.23

approaches 1 very rapidly (by far exceeding the precision of the processor). Thus

the computed value of the p-value is 0 in this case, seemingly limiting the usefulness

of our criterion for the case that correlations between an attribute variable and

the class label are present. This “non-discrimination” anomaly was also observed

by Kononenko (1995).

63

0 1 2 3 4 5 6 7 8

0 0

.02

0.0

4 0

.06

0.0

8 0

.1 0

.12

0.1

4

p.d.

f. of

gin

i for

X1

p.d.

f. of

gin

i for

X2

Fig

ure

3.11

:P

robab

ility

den

sity

funct

ion

ofGini

gain

for

attr

ibute

vari

able

sX

1an

dX

2.

64

For our criterion, we can avoid this problem by directly computing the log-

arithm of the p-value using a series expansion.3 In this manner, values of the

logarithm of the p-value (which can be used instead of the p-value since the log-

arithm is a monotonically increasing function) can be computed accurately even

for datasets with millions of records and very strong correlations.

The computational complexity of our new criterion is O(n + k) since we have

to compute the sum of inverses of the Ni’s and pj’s; all other factors can be

computed in time O(1), including the logarithm of the incomplete regularized

gamma function. Thus our new criterion can be computed efficiently in practice.4

3.6 Experimental Evaluation

In this section we will show experimental evidence that our theoretical corrections

behave well in practice. To evaluate the bias of the gamma correction of the Gini

gain we repeated the experiment from Section 3.3. The bias of our correction

of the Gini gain as a function of N and p1 is depicted in Figure 3.12. As can

be observed by comparing Figures 3.6 and 3.12, the bias for the corrected Gini

gain and the χ2-test are practically identical for all values of p1 and N . Also, for

p1 between 10−4 and 10−2 all the statistical methods are biased toward attribute

variables with small n in precisely the same way. As mentioned in Section 3.3, the

most extreme bias is obtained for p1 = 1/N . In this case the probability to see

exactly one data-point with class label c1 is N 1N

(1− 1

N

)N−1 ≈ e−1. The margin

3We used the implementation of the incomplete gamma function in the Statis-tics package ANA (Shine & Strous, 2001)

4On a Pentium III 933MHz the computation of the incomplete regularizedgamma function takes 155µs. This is also the time to compute the contingency ta-ble for 14000 samples in the most favorable case (one attribute variable and highlyoptimized code for this special case).

65

∆ from the Lemma in Section 3.4 is at least 2e−1 ≈ 0.73 which means that the

exact correction (using the exact distribution of the split criteria) can have any

bias. Thus around p1 = 1/N we cannot expect any of the statistical methods to be

perfectly unbiased. Moreover, since in this situation the distribution is extremely

discrete, we cannot expect any approximation of the distribution with a continuous

distribution like gamma to be reasonable. This explains the experimental results

in Section 3.5 where we have seen that the p-value of the distribution of Gini gain

is grossly underestimated.

Note that for small entries in the contingency table the χ2-distribution is a

poor approximation of the χ2-test.5 In the case that a predictor variable is not

correlated with the class label, the overestimation of the variance does not seem

to matter (but this might not be the case when correlations are present).

To summarize our experiments, the gamma correction of the Gini gain and the

χ2 criterion have very good behavior under the Null Hypothesis. The G2 criterion

behaves well if class labels are almost improbable but some bias is present if this

is not the case. The Gini gain, the information gain, and the gain ratio have

significant biases toward variables with more values.

5We observed that for this case the expected value according to the χ2-distribution is correct, but the variance is overestimated. See Appendix B.0.4for a proof.

66

Fig

ure

3.12

:B

ias

ofth

ep-v

alue

ofth

eGini

gain

usi

ng

the

gam

ma

corr

ecti

on

67

3.7 Discussion

This chapter addresses the fundamental problem of bias in split variable selection

in classification tree construction. Our contribution is (1) a general method to

provably remove the bias introduced by categorical variables with large domains

and (2) an the application of our method to the removal of the bias for the Gini

gain.

Previous work for some split criteria suggests that removal of the bias by the

usage of p-values improves the quality of the split when correlations are weak and

in the same time preserves the good behavior for strong correlations (Mingers,

1987; Frank & Witten, 1998). This suggests that bias removal in general might be

useful in practice.

Chapter 4

Scalable Linear Regression Trees

interest in developing regression models for large datasets that are both accurate

and easy to interpret. Regressors that have these properties are regression trees

with linear models in the leaves, but the algorithms already proposed for construct-

ing them are not scalable due to the fact that they require a large number of linear

systems to be formed and solved. In this chapter we propose a novel regression

tree construction algorithm that is both accurate and can truly scale to very large

datasets. The main idea is, for every intermediate node, to use the EM algorithm

for Gaussian mixtures to find two clusters in the data and to locally transform the

regression problem into a classification problem based on closeness to these clus-

ters. Goodness of split measures, like the Gini gain, can then be used to determine

the split variable and point much like in classification tree construction. Scalability

of the algorithm can be enhanced by employing scalable versions of the EM and

the classification tree construction algorithms. Tests on real and artificial data

show that the proposed algorithm has competitive accuracy but requires orders of

magnitude less computation time for large datasets.

68

69

4.1 Introduction

Even though regression trees were introduced early in the development of clas-

sification trees (CART, Breiman et al. (1984)), regression trees received far less

attention from the research community. Quinlan (1992) generalized the regression

trees in CART by using a linear model in the leaves to improve the accuracy of

the prediction. The impurity measure used to choose the split variable and the

split point was the standard deviation of the predictor for the training examples

at the node. Karalic (1992) argued that the mean square error of the linear model

in a node is a more appropriate impurity measure for the linear regression trees

since data well predicted by a linear model can have large variance. This is a

crucial observation since evaluating the variance is much easier than estimating

the error of a linear model (which requires solving a linear system). Even more,

if discrete attributes are present among the predictor attributes and binary trees

are built (as is the case in CART), the problem of finding the best split attribute

becomes intractable for linear regression trees since the theorem that justifies a

linear algorithm for finding the best split (Theorem 9.6 in (Breiman et al., 1984),

see Section 2.3.1) does not seem to apply. To address computational concerns of

normal linear regression models, Alexander and Grimshaw (1996) proposed the

use of simple linear regressors (i.e., the linear model depends on only one predictor

attribute), which can be trained more efficiently but are not as accurate.

Torgo proposed the use of even more sophisticated functional models in the

leaves (i.e., kernel regressors) (Torgo, 1997b; Torgo, 1997a). For such regression

trees both construction and deployment of the model is expensive but they po-

tentially are superior to the linear regression trees in terms of accuracy. More

recently, Li et al. (2000) proposed a linear regression tree algorithm that can

70

produce oblique splits1 using Principal Hessian Analysis but the algorithm cannot

accommodate discrete attributes.

There are a number of contributions to regression tree construction coming from

the statistics community. Chaudhuri et al. (1994) proposed the use of statistical

tests for split variable selection instead of error of fit methods. The main idea

is to fit a model (constant, linear or higher order polynomial) for every node in

the tree and to partition the data at each node into two classes: data-points with

positive residuals2 and data-points with negative residuals. In this manner the

regression problem is locally reduced to a classification problem, so it becomes

much simpler. Statistical tests used in classification tree construction, Student’s

t-test in this case, can be used from this point on. Unfortunately, it is not clear

why differences in the distributions of the signs of the residuals are good criteria

on which decisions about splits are made. A further enhancement was proposed

recently by Loh (2002). It consists mostly in the use of the χ2-test instead of the

t-test in order to accommodate discrete attributes, the detection of interactions of

pairs of predictor attributes, and a sophisticated calibration mechanism to ensure

the unbiasedness of the split attribute selection criterion.

In this chapter we introduce SECRET (Scalable EM and Classification based

Regression Trees), a new construction algorithm for regression trees with linear

models in the leaves, which produces regression trees with accuracy comparable

to the ones produced by existing algorithms and, at the same time, requiring far

less computational effort on large datasets. Our experiments show that SECRET

improves the running time of regression tree construction by up to two orders of

1Oblique splits are linear inequalities involving two or more predictor attributesof the form: a1X1 + · · ·+ akXk > c.

2Residuals are the difference between the true value and the value predicted byregression model.

71

magnitude when compared to previous work while at the same time constructing

trees of comparable quality. Our main idea is to use the EM algorithm on the data

partition in an intermediate node to determine two Gaussian clusters, hopefully

with shapes close to flat disks. We then use these two Gaussian clusters to locally

transform the regression problem into a classification problem by labeling every

data-point with class label 1 if the probability of belong to the first cluster exceeds

the probability of belong to the second cluster, or class label 2 if the converse is

true. A split attribute and a corresponding split point to separate the two classes

can be determined then using goodness of split measures for classification trees

like the Gini gain (Breiman et al., 1984). Least square linear regression can be

used to determine the linear regressors in the leaves.

The local reduction to a classification problem allows us to avoid forming and

solving the large number of linear systems of equations required by an exhaustive

search method such as the method used by RETIS (Karalic, 1992). Even more,

scalable versions of the EM algorithm for Gaussian mixtures (Bradley et al., 1998)

and classification tree construction (Gehrke et al., 1998) can be used to improve

the scalability of the proposed solution. An extra benefit of the method is the fact

that good oblique splits can be easily obtained.

The rest of the chapter is organized as follows. In Section 4.2 we give a short

introduction the EM algorithm for Gaussian mixtures. In Section 4.3 we present in

greater detail some of the previously proposed solutions and we comment on their

shortcomings. Section 4.4 contains the description of SECRET, our proposal for a

linear regression tree algorithm. We then show results of an extensive experimental

study of SECRET in Section 4.5.

72

4.2 Preliminaries: EM Algorithm for Gaussian Mixtures

In this section we discuss a particular solution to the problem of approximating

some unknown distribution, from which a sample is available, with a mixture of

Gaussian distributions – the EM algorithm for Gaussian mixtures.

The EM algorithm (Dempster,A.P. Laird & Rubin, 1977) is a very general

method that can be used to determine parameters of models with hidden variables.

Here, we will be concerned only with its application to determining a mixture of

Gaussian distributions that best approximate an unknown distribution from which

samples are available. Our introduction follows, in large, the excellent tutorial of

Bilmes (1997) where details and complete proofs of the EM algorithm for Gaussian

mixtures can be found.

The Gaussian mixture density estimation problem is the following: find the

most likely values of the parameter set Θ = (α1, . . . , αM , µ1, . . . , µM , Σ1, . . . , ΣM)

of the probabilistic model:

p(x, Θ) =M∑i=1

αipi(x|µi, Σi) (4.1)

pi(x|µi, Σi) =1

(2π)d/2|Σi|1/2e−

12(x−µi)

T Σ−1i (x−µi) (4.2)

given sample X = (x1, . . . ,xN) (training data). In the above formulae pi is the

density of the Gaussian distribution with mean µi and covariance matrix Σi. αi

is the weight of the component i of the mixture, M is the number of mixture

components or clusters and is fixed and given, and d is the dimensionality of the

space.

The EM algorithm for estimating the parameters of the Gaussian components

proceeds by repeatedly applying the following two steps until the termination con-

dition is satisfied:

73

Expectation (E step):

hij =αipi(xj|µi, Σi)∑M

k=1 αkpk(xj|µk, Σk)(4.3)

Maximization (M step):

αi =1

N

N∑j=1

hij (4.4)

µi =

∑Nj=1 hijxj∑N

j=1 hij

(4.5)

Σi =

∑Nj=1 hij(xj − µi)(xj − µi)

T∑Nj=1 hij

(4.6)

In the above formulae, hij are the hidden parameters and they can be interpreted

as the probability that the point xj belongs to the i-th component.

The termination condition is usually specified either by a maximum number of

steps or as a minimum average movement between consecutive steps of the centers

of the Gaussian distributions. In our work we use exclusively the former since it

is simpler and works well in practice.

4.3 Previous solutions to linear regression tree construc-

tion

In this section we analyze some of the previously proposed construction algorithms

for linear regression trees and, for each, we point major drawbacks.

4.3.1 Quinlan’s construction algorithm

For efficiency reasons, the algorithm proposed by Quinlan (1992) pretends that

a regression tree with constant models in the leaves is constructed until the tree

is fully grown, when linear models are fit on the data-points available at each

74

leaf. This is equivalent to using the split selection criterion in Equation 2.12

during the growing phase. Then linear regressors in the leaves are constructed by

performing another pass over the data in which the set of data-points from the

training examples corresponding to each of the leaves is determined and the least

square linear problem for these data-points is formed and solved (using the SVD

decomposition (Golub & Loan, 1996)).

The same approach was latter used by Torgo (1997a; 1997b) with more com-

plicated models in the leaves like kernels and local polynomials.

As pointed out by Karalic (1992) the variance of the output variable is a poor

estimator of the impurity of the fit when linear regressors are used in the leaves

since the points can be arranged along a line (so the error of the linear fit is almost

zero) but they occupy a significant region (so the variance is large). To correct

this problem, he suggested that the following impurity function should be used:

Errl(T )def= E

[(Y − E [f(X)|T ])2] (4.7)

where f(x) = [1 xT ]c is the linear regressor with the smallest least square error.

It is easy to see (see for example (Golub & Loan, 1996)) that c is the solution of

the LSE equation:

E

1 XT

X XXT

∣∣∣∣∣∣∣T c = E

1

X

Y

∣∣∣∣∣∣∣T (4.8)

To see more clearly that Err(T ) given by Equation 2.12 is not appropriate for

the linear regressor case, consider the situation in Figure 4.1. The two thick lines

represent a large number of points (possibly infinite). The best split for the linear

regressor is x = 0 and the fit is perfect after the split (thus Errl(T1) = Errl(T2) = 0).

Obviously the split has the maximum possible Gini gain (1/12).

75

s−1 1

1

−1 0s

T2T1

Figure 4.1: Example of situation where average based decision is different from

linear regression based decision

1−1 0

1

−

+

Figure 4.2: Example where classification on sign of residuals is unintuitive.

76

For the case when Err(T ) is used, E [Y |T ] = 1/2 so Err(T ) = 1/12. To

determine the split point for this situation suppose the split point is s − 1 in

Figure 4.1. The points with property x < s − 1 belong to T1 and the rest to

T2. Then E [Y |T1] = s/2, Err(T1) = s3/12, E [Y |T2] = (−2 + s2)/2(−2 + s) and

Err(T2) = (4 − 8s + 12s2 − 8s3 + s4)/(24 − 12s). Thus by splitting the impurity

decreases by ∆Err(T ) = (1 − s)2s/4(2 − s) The extremum points in the interval

[0, 1] are s = 1 and s = (3 −√

5)/2. Looking at the second derivative in these

points one can observe that ∆Err(T ) has a minimum in s = 1 and a maximum

in s = (3 −√

5)/2. Thus the maximum impurity decrease is obtained if the split

point is −(√

5−1)/2 = −0.618034 or symmetrically 0.618034. Either of this splits

is very far from the split obtained using Errl(T ) (at point 0), thus splitting the

points in proportion 19% to 81% instead of the ideal 50% to 50%.

This example suggests that the split point selection based on Err(T ) produces

an unnecessary fragmentation of the data that is not related to the natural organi-

zation of the data-points for the case of linear regression trees. This fragmentation

produces unnecessarily large and unnatural trees, anomalies that are not corrected

by the pruning stage. Indeed, when we used a dataset with the triangular shape

described above as the input to a regression tree construction algorithm that used

Err(T ) from Equation 2.12 as split criterum we obtained the following split points

starting from the root and navigating breadth first for three levels: 0.6185, -0.5255,

0.8095, -0.7625, 0.3585, 0.7145, 0.9055. Splits are not only anintuitive but the gen-

erated tree is very unbalanced. Note that this example is not an extreme case but

rather a normal one, so this behavior is probably the norm not the exception.

77

4.3.2 Karalic’s construction algorithm

Using the split criterion in Equation 4.7 the problem mentioned above is avoided

and much higher quality trees are built. If exhaustive search is used to determine

the split point, the computational cost of the algorithm becomes prohibitively

expensive for large datasets for two main reasons:

• If the split attribute is continuous, as many split points as there are training

data-points have to be considered. For each of them a linear system has

to be formed and solved. Even if the matrix and the vector that form the

linear system are maintained incrementally (which can be dangerous from

numerical stability point of view), for every level of the tree constructed, a

number of linear systems equal to the size of the dataset have to be solved.

• If the split attribute is discrete the situation is potentially much worse since

Theorem 9.6 in (Breiman et al., 1984) does not seem to apply for this split

criterion. This means that an exponential number (in the size of the domain

of the split variable) of linear systems have to be formed and solved.

The first problem can be alleviated if a sample of the points available are

considered as split points. Even if this simplification is made, the data-points have

to be sorted in every intermediate node on all the possible split attributes. Also, it

is not clear how these modifications influence the accuracy of the regression trees

produces. The second problem seems unavoidable if exhaustive search is used.

4.3.3 Chaudhuri’s et al. construction algorithm

In order to avoid forming and solving so many linear systems, Chaudhuri et al.

(1994) proposed to locally classify the data-points available at an intermediate

78

node based on the sign of the residual with respect to the least square error linear

model. For the data-points in Figure 4.2 (the set of data-points is identical to the

one in Figure 4.1) this corresponds to points above and below the dashed line. As

it can be observed, when projected on the X axis, the negative class surrounds the

positive class so two split points are necessary to differentiate between them (the

node has to be split into three parts). When the number of predictor attributes is

greater than 1 (multidimensional case), the separating surface between class labels

+ and − is nonlinear. Moreover, if best regressors are fit in these two classes, the

prediction is only slightly improved. The solution adopted by Chaudhuri et al. is

to use Quadratic Discriminant Analysis (QDA) to determine the split point. This

usually leads to choosing as split point approximately the mean of the dataset,

irrespective of where the optimal split is, so the reduction is not always useful. For

this reason GUIDE (Loh, 2002) uses this method only to select the split attribute,

not the split point.

4.4 Scalable Linear Regression Trees

For constant regression trees (i.e. regression trees with constants as models in

the leaves), algorithms for scalable classification trees can be straightforwardly

adapted (Gehrke et al., 1998). The main obstacle in achieving scalability for linear

regression trees is the observation previously made that the problem of partitioning

the domain of a discrete variable in two parts is intractable. Also the amount of

sufficient statistics that has to be maintained goes from two real numbers for

constant regressors (mean and mean of square) to quadratic in the number of

regression attributes (to maintain the matrix AT A that defines the linear system)

– this can be a problem also.

79

In this work we make the distinction in (Loh, 2002) between predictor at-

tributes: (1) split discrete attributes – used only in the split predicates in inter-

mediate nodes in the regression tree, (2) split continuous attributes – continuous

attributes used only for splitting, (3) regression attributes – continuous attributes

used in the linear combination that specifies the linear regressors in the leaves as

well as for specifying split predicates. By allowing some continuous attributes to

participate in splits but not in regression in the leaves we add greater flexibility to

the learning algorithm. The partitioning of the continuous attributes in split and

regression is beyond the scope of this thesis (and is usually performed by the user

(Loh, 2002)).

The main idea behind our algorithm is to locally transform the regression prob-

lem into a classification problem by first identifying two general Gaussian distribu-

tions in the regressor attributes–output space using the EM algorithm for Gaussian

mixtures and then by classifying the data-points based on the probability of be-

longing to these two distributions. Classification tree techniques are then used to

select the split attribute and the split point. Our algorithm, called SECRET, is

shown in Figure 4.3.

The role of EM is to find two natural classes in the data that have approximately

a linear organization. The role of the classification is to identify the predictor

attributes that can make the difference between these two classes in the input

space. To see this more clearly suppose we are in the process of building a linear

regression tree and we have to decide on the split attribute and split point for the

node T . Suppose the set of training examples available at node T contains tuples

with three components: a regressor attribute Xr, a discrete attribute Xd and the

predicted attribute Y . The projection of the training data on the Xr, Y space

80

Input: node T , data-partition DOutput: regression tree T for D rooted at T

Linear regression tree construction algorithm:BuildTree(node T , data-partition D)(1) normalize data-points to unitary sphere(2) find two Gaussian clusters in regressor–output space (EM)(3) label data-points based on closeness to these clusters(4) foreach split attribute(5) find best split point and determine its Gini gain(6) endforeach(7) let X be the attribute with the greatest Gini gain and

Q the corresponding best split predicate set(8) if (T splits)(9) partition D into D1, D2 based on Q and label node T

with split attribute X(10) create children nodes T1, T2 of T and label

the edge (T, Ti) with predicate q(T,Ti)

(11) BuildTree(T1, D1); BuildTree(T2, D2)(12) else(13) label T with the least square linear regressor of D(14) endif

Figure 4.3: SECRET algorithm

might look like Figure 4.4. The data-points are approximatively organized in two

clusters with Gaussian distributions that are marked as ellipsoids. Differentiating

between the two clusters is crucial for prediction, but information in the regression

attribute is not sufficient to make this distinction, even though within a cluster

they can do good predictions. The information in the discrete attribute Xd can

make this distinction, as can be observed from Figure 4.5 where the projection is

made on the Xd, Xr, Y space. If other split attributes had been present, a split on

Xd would have been preferred since the resulting splits are pure.

For the situation in Figure 4.2, the EM algorithm will approximate each of the

two distinct linear portions with a very narrow Gaussian distribution. Using these

81

Y

X r

Figure 4.4: Projection on Xr, Y space of training data.

Y

X rNo

Xd

Yes

Figure 4.5: Projection on Xd, Xr, Y space of same training data as in Figure 4.4

82

two clusters, all the points at the left of origin will have the first class label and

the points at the right of the origin the second class label. Obviously, using 0 as

the split point provides the best separation between classes. This is exactly the

best split point for this situation since it results in perfect approximation of the

data after linear models are fitted for each cluster.

Observe that the use of the EM algorithm for Gaussian mixtures is very limited

since we have only two mixtures and thus the likelihood function has a simpler

form which means fewer local maxima. Since EM is sensitive to distances, before

running the algorithm, the training data has to be normalized by performing a

linear transformation that makes the data look as close as possible to a unitary

sphere with the center in the origin. Experimentally we observed that, with this

transformation and in this restricted scenario, the EM algorithm with clusters

initialized randomly works well.

We describe first how the EM algorithm can be implemented efficiently followed

by details on the integration of the resulting mixtures with the split selection

procedure and the linear regression in the leaves.

4.4.1 Efficient Implementation of the EM Algorithm

The following two ideas are used to implement the EM algorithm efficiently:

1. steps E and M are performed simultaneously, which means that quantities

hij do not have to be stored explicitly

2. all the operations are expressed in terms of the Cholesky decomposition Gi

of the covariance matrix Σi = GiGTi . Gi has the useful property that it is

lower diagonal, so solving linear systems takes quadratic effort in the number

83

of dimensions and computing the determinant is linear in the number of

dimensions.

Note that these modifications can be made in addition to the techniques used in

(Bradley et al., 1998) to make the EM algorithm scalable.

Using the Cholesky decomposition we immediately have Σ−1i = G−1T

i G−1i and

|Σi| = |Gi|2. Substituting in Equation 4.2 we get:

pi(x|µi, Gi) =1

(2π)d/2|Gi|e−

12‖G−1

i (x−µi)‖2

The quantity x′ = G−1i (x − µi) can be computed by solving the linear system

Gix′ = x − µi and takes quadratic effort in the number of dimensions. For this

reason the inverse of Gi needs not be precomputed since solving the linear system

takes as much time as vector matrix multiplication. This is in line with the recom-

mendations given by Golub and Loan (1996) to avoid inverting matrices whenever

possible.

The following quantities have to be maintained incrementally for each cluster:

si =N∑

j=1

hij

sx,i =N∑

j=1

hijxj

Si =N∑

j=1

hijxTj xj

where quantities hij are computed with the formula in Equation 4.3 for each train-

ing example xj and are discarded after updating si, sx,i, Si for every cluster i (we

need only small temporary storage for hij’s).

84

After all the training examples have been seen, the new parameters of the two

distributions are computed with the formulae:

αi =si

N

µi =sx,i

si

Σi =Si

si

− µiT µi

Gi = Chol(Σi)

Moreover, if the data-points are coming from a Gaussian distribution with mean

µi and covariance matrix GiGTi the transformation x′ = G−1

i (x−µi) results in data-

points with Gaussian distribution with mean 0 and identity matrix as covariance

matrix. This means that this transformation can be used for data normalization in

the tree growing phase, normalization that is of crucial importance as we pointed

out earlier in the section.

4.4.2 Split Point and Attribute Selection

Once the two Gaussian mixtures are identified, the data-points can be labeled

based on the closeness to the two clusters (i.e., If a data-point is closer to cluster

1 than cluster 2 it is labeled with class label 1, otherwise it is labeled with class

label 2). After this classification is performed, locally, split point and attribute

selection methods from classification tree construction can be used.

We are using Gini gain as the split selection criteria to find the split point. That

is, for each attribute (or collection of attributes for oblique splits) we determine

the best split point and compute the Gini gain. Then the predictor attribute with

the greatest Gini gain is chosen as split attribute.

For the discrete attributes the algorithm of Breiman et al. (1984) finds the split

85

point in time linear in the size of the domain of the discrete attribute (since we

only have two class labels, see Section 2.3.1). We use this algorithm, unchanged,

in the present work to find the split point for discrete attributes.

Split point selection for continuous attributes

Since the EM algorithm for Gaussian mixtures produces two normal distributions,

it is reasonable to assume that the projection of the data-points with the same class

label on a continuous attribute X has also a normal distribution. As explained in

Section 2.3.1, the split point that best separates the two normal distributions can

be found using Quadratic Discriminant Analysis (QDA). The reason for preferring

QDA to a direct minimization of the Gini gain is the fact that it gives qualitatively

similar splits but requires less computational effort (Loh & Shih, 1997). We already

explained in Section 2.3.1 how to find the split point using QDA.

Finding a good oblique split for two Gaussian mixtures

Ideally, given two Gaussian distributions, we would like to find the separating

hyperplane that maximizes the Gini gain. Fukanaga showed that the problem of

minimizing the expected value of the 0-1 loss (the classification error function)

generates an equation involving the normal of the hyperplane that is not solvable

algebraically (Fukanaga, 1990). Following the same treatment, it is easy to see

that the problem of minimizing the Gini gain generates the same equation, thus

algebraic solutions are not possible for the Gini gain either. Fortunately, a good

solution to the problem of determining a separating hyperplane can be found using

Linear Discriminant Analysis (LDA) (Fukanaga, 1990).

The solution of an LDA problem for two mixtures consists of minimizing

86

Fisher’s separability criterion (Fukanaga, 1990):

J(n) =nT Σwn

nT Σbn

with

Σw =∑i=1,2

αi(µ− µi)(µ− µi)T , µ =

∑i=1,2

αiµi

Σb =∑i=1,2

αiΣi

which has as result a vector n, with the property that the projections on this vector

of the two Gaussian distributions is as separated as possible. The solution of the

optimization problem is (Fukanaga, 1990):

n =Σ−1

w (µ1 − µ2)

‖Σ−1w (µ1 − µ2)‖2

The value of Fisher’s criterion is invariant to the choice of origin on the projection

so we can make the projection on the line given by the vector n, that optimizes

Fisher’s criterion, and the origin of the coordinate system.

The two multidimensional Gaussian distributions are transformed into unidi-

mensional normal distributions on the projection line with means ηi = nT µi and

the variances σ2i = nT Σin for i = 1, 2, the coordinates being line coordinates with

the coordinate of the projection of the origin as the 0. This situation is depicted

in Figure 4.6 for the bi-dimensional case.

87

O

σ2 1

η 1η

nT(x−ηn)=0

µ1

Σ1

Σ1

µ2

σ2 2

η 2n

Fig

ure

4.6:

Sep

arat

orhyper

pla

ne

for

two

Gau

ssia

ndis

trib

uti

ons

intw

odim

ensi

onal

spac

e.

88

On the projection line (n, O), the QDA can be used to find the split point η, as

in the previous section. The point η on the projection line corresponds to ηn in the

initial space. The equation of the separating hyperplane that has n as normal and

contains point ηn is nT (x−ηn) = 0 ⇔ nTx−η = 0. With this, a point x belongs

to the first partition if sign(η1 − η)(nTx − η) ≥ 0. The hyperplane that contains

this point of the projection line and that is perpendicular to the projection line is

a good separator of the two multidimensional Gaussian distributions.

In order to be able to compare the efficacy of the split with other splits, we have

to compute its Gini gain. The same method as for the case of unidimensional splits

of continuous variables can be used here. The only unsolved problem is computing

the p-value of a Gaussian distribution with respect to a half-space. The solution

is given by the following result:

Proposition 1. For a Gaussian distribution with mean µ and covariance matrix

Σ = GGT , positive definite, and density pµ,Σ(x) and a hyperplane with normal n

that contains the point xc, the p-value with respect to the hyperplane is:

P [nT (x− xc) ≥ 0|µ, Σ] =

∫nT (x−xc)≥0

pµ,Σ(x)dx

=σ

2√|Σ||S|

(1 + Erf

(µ′1

σ√

2

))where

Σ′−1 =

s wT

w S

= MT Σ−1M

with M orthogonal such that MTn = e1, σ = 1/√

s−wT S−1w and µ′ = MT (µ−

xc).

Proof. Since MTn = e1 the first column in M has to be n (which is supposed to

have unitary norm) and the rest of columns are vectors orthogonal on n. Such an

orthogonal matrix can be found using Gram-Schmidth orthogonalization starting

89

with n and the d−1 least parallel with n versor vectors. Doing the transformation

x′ = MT (x−xc) that transforms the hyperplane (n,xc) into (e1, 0) we get x−µ =

M(x′ − µ′). Using the notation Φ for P [nT (x − xc) ≥ 0|µ, Σ] and substituting in

the definition we get:

Φ =

∫x′1≥0

∫x′2

· · ·∫

x′d

p(x′)dx′

=

∫x′1≥0

∫x′2

· · ·∫

x′d

1

(2π)d/2√|Σ|

e−12[(x′−µ′)T Σ′−1(x′−µ′)]dx′

With the notation y = x′−µ′ and L for the set of indexes 2 . . . d, thus yT = [y1 yTL ],

the exponent in the above integral can be rewritten like:

yT Σ′−1y = sy21 + 2y1y

TLw + yT

LSyL

= sy21 − y2

1wT S−1w

+ (yL + y1S−1w)T S(yL + y1S

−1w)

With this we get:

ΦL(x′1) =

∫x′L

exp

(−1

2[(x′ − µ′)T Σ′−1(x′ − µ′)]

)dx′L

=

∫yL

exp

(−1

2[sy2

1 − y21w

T S−1w

+(yL + y1S−1w)T S(yL + y1S

−1w)])dyL

=(2π)

d−12√

|S|exp

(−1

2(x′1 − µ′1)

2(s− wT S−1w)

)

90

and substituting back in 4.4.2 we have:

Φ =

∫x′1≥0

1

(2π)d/2√|Σ|

ΦL(x′1)

=σ√|Σ||S|

∫x′1≥0

1√2πσ

e−(x′1−µ′1)2/2σ2

=σ√|Σ||S|

∫t≥−µ′1/σ

√2

1√π

e−t2dt

=σ√|Σ||S|

(1

2

2√π

∫ 0

− µ1σ√

2

e−t2dt +1√π

∫ ∞

0

e−t2dt

)

=σ

2√|Σ||S|

(1 + Erf

(µ′1

σ√

2

))We show now that s−wT S−1w > 0 thus the above computations are sound. Since

Σ is positive definite by supposition, Σ′−1 = MT Σ−1M is positive definite. This

means that vT Σ′−1v > 0 for any nonzero v. Taking v = [1 S−1w]T we get the

required inequality.

Finding linear regressors

If the current node is a leaf or in preparation for the situation that all the descen-

dants of this node are pruned we have to find the best linear regressor that fits the

training data. We identified two ways the LSE linear regressor can be computed.

The first method consist of a traversal of the original dataset and the iden-

tification of the subset that falls into this node. The least square linear system

in Equation 4.8 is formed with these data-points and solved. Note that, in the

case that all the sufficient statistics can be maintained in main memory, a single

traversal of the training dataset per tree level will suffice.

The second method uses the fact that the split selection method tries to find

a split attribute and a split point that can differentiate best between the two

91

Gaussian mixtures found in the regressor–output space. The least square problem

can be solved at the level of each of these mixtures under the assumption that the

distribution of the data-points is normal with the parameters identified by the EM

algorithm. This method is less precise since the split is usually not perfect but can

be used when the number of traversals over the dataset becomes a concern.

Before we show how this can be done we need to introduce some notation. We

use subscript I to denote the first d− 1 components (the components referring to

regressors) and o to refer to the last component – the output. Thus, for example

for a matrix G, GII refers to its (d− 1)× (d− 1) upper part. The following result

provides the solution:

Proposition 2. For a Gaussian distribution with mean µ and covariance matrix

Σ = GGT , the LSE linear regressor is given by:

y = cT (x− µI) + µo (4.9)

where c is the solution of linear equation cT GII = GoI .

Proof. The LSE linear regressor is the function f(xI) that minimizes

E [(xo − f(xI))2] for x. It can be shown (see (Shao, 1999) Example 1.19) that

the out of all the measurable functions, y = E [xo|xI ] is the LSE estimator.

Thus, it remains only to compute E [xo|xI ] for x distributed according to a

Gaussian distribution with mean µ and covariance matrix GGT . We denote by

p(x) the density of this distribution. We have:

E [xo|xI ] =

∫xo

xop(x)dxo∫xo

p(x)dxo

(4.10)

Doing the transformation x′ = G−1(x−µ) we have x = Gx′+µ so xo = GoIx′I +

Goox′o + µo and dxo = Goodx′o. Making the change of variable in Equation 4.10 we

92

get:

E [xo|xI ] =

∫x′o

(GoIx′I + Goox

′o + µo)e

12x′TI x′Ie

12x′2o dx′o∫

x′oe

12x′TI x′Ie

12x′2o dx′o

= GoIx′I + µo + Goo

∫x′o

x′oe12x′2o dx′o∫

x′oe

12x′2o dx′o

= GoIG−1II (xI − µI) + µo

= cT (xI − µI) + µo

(4.11)

since x′oe12x′2o dx′o is antisymmetric so its integral on the whole domain of x′o is zero.

Experimentally we observed that the first method is usually more precise. The

reason for the difference in precision can be attributed to the fact that the best

split found by the split selection method is not perfect, so the two clusters are not

perfectly differentiated. For medium size datasets we recommend the use of the

first method. If computation time is a concern, the second method, that is only

slightly less precise, can be used.

4.5 Empirical Evaluation

In this section we present the results of an extensive experimental study of SE-

CRET, the linear regression tree construction algorithm we propose. The purpose

of the study was twofold: (1) to compare the accuracy of SECRET with GUIDE

(Loh, 2002), a state-of-the-art linear regression tree algorithm and (2) to compare

the scalability properties of SECRET and GUIDE through running time analysis.

The main findings of our study are:

• Accuracy of prediction. SECRET is more accurate than GUIDE on three

datasets, as accurate on six datasets and less accurate on three datasets. This

93

suggests that overall the prediction accuracy to be expected from SECRET

is comparable to the accuracy of GUIDE. On four of the datasets, the use of

oblique splits resulted in significant improvement in accuracy.

• Scalability to large datasets. For datasets of small to moderate sizes

(up to 5000 tuples), GUIDE slightly outperforms SECRET. The behavior

for large datasets of the two methods is very different; for datasets with

256000 tuples and 3 attributes, SECRET runs about 200 times faster than

GUIDE. Even if GUIDE considers only 1% of the points available as possible

split points, SECRET still runs 20 times faster. Also, there is no significant

change in running time when SECRET produces oblique splits.

4.5.1 Experimental testbed and methodology

GUIDE (Loh, 2002) is a regression tree construction algorithm that was designed

to be both accurate and fast. The extensive study by Loh (Loh, 2002) showed

that GUIDE outperforms previous regression tree construction algorithms and

compares favorably with MARS (Friedman, 1991), a state-of-the-art regression

algorithm based on spline functions. GUIDE uses statistical techniques to pick

the split variable and can use exhaustive search or just a sample of the points

to find the split point. In our accuracy experiments we set up GUIDE to use

exhaustive search since it is more accurate than split point candidate sampling,

the only other option. For the scalability experiments we report running times for

both the exhaustive search and split point candidate sampling of size 1%.

For the experimental study we used nine real life and three synthetic datasets.

94

Real life datasets:

Abalone Dataset from UCI machine learning repository used to predict the age of

abalone from physical measurements. Contains 4177 cases with 8 attributes

(1 nominal and 7 continuous).

Baseball Dataset from UCI repository, containing information about baseball play-

ers used to predict their salaries. Consists of 261 cases with 20 attributes (3

nominal and 17 continuous).

Boston Data containing characteristics and prices of houses around Boston, from

UCI repository. Contains 506 cases with 14 attributes (2 nominal and 12

continuous).

Kin8nm Data containing information on the forward kinematics of an 8 link robot

arm from the DELVE repository. Contains 8192 cases with 8 continuous

attributes.

Mpg Subset of the auto-mpg data in the UCI repository (tuples with missing

values were removed). The data contains characteristics of automobiles that

can be used to predict gas consumption. Contains 392 cases with 8 attributes

(3 nominal and 5 continuous).

Mumps Data from StatLib archive containing incidence of mumps in each of

the 48 contiguous states from 1953 to 1989. The predictor variables are

year and longitude and latitude of state center. The dependent variable is

the logarithm of the number of mumps cases. Contains 1523 cases with 4

continuous attributes.

Stock Data containing daily stock of 10 aerospace companies from StatLib repos-

itory. The goal is to predict the stock of the 10th company from the stock

95

of the other 9. Contains 950 cases with 10 continuous attributes.

TA Data from UCI repository containing information about teaching assistants

at University of Wisconsin. The goal is to predict their performance. Con-

tains 151 cases with 6 attributes (4 nominal and 2 continuous).

Tecator Data from StatLib archive containing characteristics of spectra of pork

meat with the purpose of predicting the fat content. We used the first 10

principal components of the wavelengths to predict the fat content. Contains

240 cases with 11 continuous attributes.

Synthetic datasets:

Cart Synthetic dataset proposed by Breiman et al.((Breiman et al., 1984) p.238)

with 10 predictor attributes: X1 ∈ −1, 1, Xi ∈ −1, 0, 1, i ∈ 2 . . . 10

and the predicted attribute determined by if X1 = 1 then Y = 3 + 3X2 +

2X3 + X4 + σ(0, 2) else Y = −3 + 3X5 + 2X6 + X7 + σ(0, 2). We interpreted

all the 10 predictor attributes as discrete attributes.

Fried Artificial dataset used by Friedman (Friedman, 1991) containing 10 contin-

uous predictor attributes with independent values uniformly distributed in

the interval [0, 1]. The value of the predictor variable is obtained with the

equation: Y = 10 sin(πX1X2) + 20(X3 − 0.5)2 + 10X4 + 5X5 + σ(0, 1).

3DSin Artificial dataset containing 2 continuous predictor attributes uniformly

distributed in interval [−3, 3], with the output defined as

Y = 3 sin(X1) sin(X2). There is no noise added.

We performed all the experiments reported in this chapter on a Pentium III

933MHz running Redhat Linux 7.2.

96

Tab

le4.

1:A

ccura

cyon

real

(upper

par

t)an

dsy

nth

etic

(low

erpar

t)dat

aset

sof

GU

IDE

and

SE

CR

ET

.In

par

enth

esis

we

indic

ate

Ofo

ror

thog

onal

splits

.T

he

win

ner

isin

bol

dfo

nt

ifit

isst

atis

tica

lly

sign

ifica

nt

and

init

alic

sot

her

wis

e.

Con

stan

tR

egre

ssor

sLin

ear

Reg

ress

ors

GU

IDE

SE

CR

ET

SE

CR

ET

(O)

GU

IDE

SE

CR

ET

SE

CR

ET

(O)

Abal

one

5.32±

0.05

5.50±

0.10

5.41±

0.10

4.63±

0.04

4.67±

0.04

4.76±

0.05

Bas

ebal

l0.

224±

0.00

90.

200±

0.00

80.

289±

0.01

20.1

73±

0.0

05

0.24

3±0.

011

0.28

0±0.

009

Bos

ton

23.3

4±

0.7

228

.00±

0.92

30.9

1±0.

9440

.63±

6.63

24.0

1±0.

6926

.11±

0.66

Kin

8nm

0.04

19±

0.00

020.

0437±

0.00

020.

0301±

0.00

030.

0235±

0.00

020.

0222±

0.00

020.0

162±

0.0

001

Mpg

12.9

4±

0.3

330

.09±

2.28

26.2

6±2.

4534

.92±

21.9

215

.88±

0.68

16.7

6±0.

74M

um

ps

1.34±

0.02

1.59±

0.02

1.56±

0.02

1.0

2±

0.0

21.

23±

0.02

1.32±

0.04

Sto

ck2.

23±

0.06

2.20±

0.06

2.18±

0.07

1.49±

0.09

1.35±

0.05

1.0

3±

0.0

3TA

0.74±

0.02

0.69±

0.01

0.69±

0.01

0.81±

0.04

0.72±

0.01

0.79±

0.08

Tec

ator

57.5

9±2.

4049

.72±

1.72

28.2

1±1.

7513

.46±

0.72

12.0

8±0.

537.8

0±

0.5

3

3DSin

0.14

35±

0.00

200.

4110±

0.00

060.

2864±

0.00

770.

0448±

0.00

180.

0384±

0.00

260.0

209±

0.0

004

Car

t1.

506±

0.00

51.1

71±

0.0

01

N/A

N/A

N/A

N/A

Fri

ed7.

29±

0.01

7.45±

0.01

6.43±

0.03

1.2

1±

0.0

01.

26±

0.01

1.50±

0.01

97

4.5.2 Experimental results: Accuracy

For each experiment with real datasets we used a random partitioning into 50% of

the datapoints for training, 30% for pruning and 20% for testing. For the synthetic

datasets we generated randomly 16384 tuples for training, 16384 tuples for pruning

and 16384 tuples for testing for each experiment. We repeated each experiment

100 times in order to get accurate estimates. For comparison purposes we built

regression trees with both constant (by using all the continuous attributes as split

attributes) and linear (by using all continuous attributes as regressor attributes)

regression models in the leaves. In all the experiments we used Quinlan’s resubsti-

tution error pruning (Quinlan, 1993b). For both algorithms we set the minimum

number of data-points in a node to be considered for splitting to 1% of the size of

the dataset, which resulted in trees at the end of the growth phase with around 75

nodes.

Table 4.1 contains the average mean square error and its standard deviation for

GUIDE, SECRET and SECRET with oblique splits (SECRET(O)) with constant

(left part) and linear (right part) regressors in the leaves, on each of the twelve

datasets. GUIDE and SECRET with linear regressor in the leaves have equal

accuracy (we considered accuracies equal if they were less than three standard

deviations away from each other) on six datasets (Abalone, Boston, Mpg, Stock,

TA and Tecator), GUIDE wins on three datasets (Baseball, Mumps and Fried)

and SECRET wins on the remaining three (Kin8nm, 3DSin and Cart). These

findings suggest that the two algorithms are comparable from the accuracy point

of view, neither dominating the other. The use of oblique splits in SECRET made

a big difference in four datasets (Kin8nm 27%, Stock 24%, Tecator 35% and 3DSin

45%). These datasets usually have less noise and are complicated but smooth (so

98

they offer more opportunities for intelligent splits). At the same time the use of

oblique splits resulted in significantly worse performance on two of the datasets

(Baseball 13% and Fried 19%).

4.5.3 Experimental results: Scalability

We chose to use only synthetic datasets for scalability experiments since the sizes of

the real datasets are too small. The learning time of both GUIDE and SECRET is

mostly dependent on the size of the training set and on the number of attributes, as

is confirmed by some other experiments we are not reporting here. As in the case of

accuracy experiments, we set the minimum number of data-points in a node to be

considered for further splits to 1% of the size of the training set. We measured only

the time to grow the trees, ignoring the time necessary for pruning and testing. The

reason for this is the fact that pruning and testing can be implemented efficiently

and for large datasets do not make a significant contribution to the running time.

For GUIDE we report running times for both exhaustive search and sample split

point (only 1% of the points available in a node are considered as possible split

points), denoted by GUIDE(S).

99

Size GUIDE GUIDE(S) SECRET SECRET(O)

250 0.07 0.05 0.21 0.21500 0.13 0.07 0.33 0.34

1000 0.30 0.12 0.55 0.582000 0.94 0.24 1.08 1.124000 3.28 0.66 2.11 2.078000 12.58 2.40 4.07 4.12

16000 48.93 9.48 8.16 8.3732000 264.50 43.25 16.71 16.1964000 1389.88 184.50 35.62 35.91

128000 6369.94 708.73 73.35 71.67256000 25224.02 2637.94 129.95 131.70

0.01

0.1

1

10

100

1000

10000

100000

100 1000 10000 100000 1e+06

Run

ning

tim

e (s

econ

ds)

Dataset size (tuples)

GUIDEGUIDE(S)

SECRETSECRET(O)

Figure 4.7: Tabular and graphical representation of running time (in seconds) of

GUIDE, GUIDE with 0.01 of point as split points, SECRET and SECRET with

oblique splits for synthetic dataset 3DSin (3 continuous attributes).

100

Size GUIDE GUIDE(S) SECRET SECRET(O)

250 0.09 0.07 0.47 0.43500 0.17 0.14 0.87 0.92

1000 0.36 0.28 1.85 1.832000 1.12 0.80 3.58 3.694000 2.90 2.38 7.33 7.368000 10.46 8.43 13.77 14.05

16000 42.16 33.09 27.80 28.6832000 194.63 123.63 56.87 58.0164000 1082.70 533.16 122.26 124.60

128000 4464.88 1937.94 223.42 222.75256000 18052.16 8434.33 460.12 470.68

0.01

0.1

1

10

100

1000

10000

100000

100 1000 10000 100000 1e+06

Run

ning

tim

e (s

econ

ds)

Dataset size (tuples)

GUIDEGUIDE(S)

SECRETSECRET(O)


GUIDE, GUIDE with 0.01 of point as split points, SECRET and SECRET with

oblique splits for synthetic dataset Fried (11 continuous attributes).

101

0 20

40

60

80

100

120

0 5

10

15

20

25

30

35

Running time (seconds)

Num

ber

of a

ttrib

utes

4000

dat

apoi

nts

8000

dat

apoi

nts

1200

0 da

tapo

ints

16

000

data

poin

ts

Fig

ure

4.9:

Runnin

gti

me

ofSE

CR

ET

wit

hlinea

rre

gres

sors

asa

funct

ion

ofth

enum

ber

ofat

trib

ute

sfo

rdat

aset

3Dsi

n.

102

0 10

20

30

40

50

60

70

80

90

100

110

0 5

10

15

20

25

30

35


Num

ber

of a

ttrib

utes

1600

0 da

tapo

ints

qu

adra

tic a

ppro

xim

atio

n

Fig

ure

4.10

:A

ccura

cyof

the

bes

tquad

rati

cap

pro

xim

atio

nof

the

runnin

gti

me

for

dat

aset

3Dsi

n.

103

0 20

40

60

80

100

120 2

000

400

0 6

000

800

0 1

0000

120

00 1

4000

160

00


Dat

aset

siz

e (t

uple

s)

3 at

trib

utes

17 a

ttrib

utes

33 a

ttrib

utes

Fig

ure

4.11

:R

unnin

gti

me

ofSE

CR

ET

wit

hlinea

rre

gres

sors

asa

funct

ion

ofth

esi

zeof

the

3Dsi

ndat

aset

.

104

1e-

06

1e-

05

0.0

001

0.0

01

0.0

1

0.1 1

0.1

1 1

0 1

00 1

000

100

00

Mean square error

Run

ning

tim

e (s

econ

ds)

GU

IDE

1%

sam

ple

GU

IDE

25%

sam

ple

GU

IDE

50%

sam

ple

GU

IDE

100

% s

ampl

eSE

CR

ET

Fig

ure

4.12

:A

ccura

cyas

afu

nct

ion

ofle

arnin

gti

me

for

SE

CR

ET

and

GU

IDE

wit

hfo

ur

sam

pling

pro

por

tion

s.

105

Results of experiments with the 3DSin dataset and Fried dataset are depicted

in Figures 4.7 and 4.8 respectively. A number of observations are apparent from

these two sets of results:

1. The performance of the two versions of SECRET (with and without oblique

splits) is virtually indistinguishable.

2. The running time of both versions of GUIDE is quadratic in size for large

datasets.

3. As the number of attributes went up from 3 (3DSin) to 11 (Fried) the com-

putation time for GUIDE(S), SECRET and SECRET(O) went up about

3.5 times but went slightly down for GUIDE. An interesting question raised

by these results is: How does SECRET scale with the number of regres-

sor attributes? In order to answer this question, we added more regressor

attributes, with values generated randomly in interval [0, 1], to the three ex-

isting attributes of the 3DSin dataset and we measured the running time of

SECRET for various sizes of the training data. The dependency of the run-

ning time on the total number of attributes (the three existing ones plus the

extra attributes added) for various sizes of the training dataset are depicted

in Figure 4.9. These results suggest that the dependency on the number of

regressor attributes is quadratic, a fact reinforced by the good match that

can be observed in Figure 4.10 between the shape of the best quadratic

approximation and the actual observations for the experiment with 16000

data-points. This is to be expected since both the EM Algorithm and the

computation of the linear models in the leaves require a square matrix with

as many rows and columns as there are attributes (of overall quadratic size

106

in the number of regressor attributes) to have each element updated for each

data-point. This quadratic dependency is unavoidable if linear models are

fitted in leaves, thus the scalability of SECRET in this respect is as good as

possible.

4. For fixed size trees, SECRET scales linearly with the number of of training

examples. This is apparent from Figures 4.7 and 4.8, as well from the de-

piction of the results in Figure 4.9 as a function of the size of the dataset in

Figure 4.11.

5. For large datasets (256000 tuples) SECRET is two orders of magnitude faster

than GUIDE and one order of magnitude faster than GUIDE(S).

Since SECRET is much faster than GUIDE and, as we have seen, it has compa-

rable accuracy, a natural question is how much the accuracy of GUIDE decreases if

its running time is limited. To shed some light into this issue we performed exper-

iments on the 3DSin dataset, experiments in which we fixed the minimum number

of data-points in a node to be considered for further splits, the cutoff, to 10 and

we varied only the size of the dataset. Since data is noiseless, we expect the larger

trees that are build for larger datasets, due to fact that the cutoff is constant, to

be more accurate than the trees built on small datasets. We are comparing the

dependency of the accuracy on the running time of GUIDE with various sampling

proportions (proportion of data-points to be considered potential split points) and

SECRET. The results of these experiments are depicted in Figure 4.12. Notice

that, by allowing the same running time, SECRET is by as much as 300 times

more accurate than any of the variants of GUIDE (when the training time is 70

seconds). Curiously, the accuracy of GUIDE does not increase at the initial rate,

107

as the accuracy of SECRET does, and it levels of at about 0.001 irrespective of

the running time. Even if the initial rate of error decrease is maintained through-

out the range of the running time, SECRET would still be about 30 times more

accurate if the training time is limited to 70 seconds. This significantly reduced

error of SECRET is mostly due to the fact that the 3DSin dataset is smooth so

the larger the tree constructed the more precise the prediction. Since SECRET is

much faster, it can be run on a larger dataset in the same amount of time thus

producing larger and more accurate trees.

Scalability properties of SECRET algorithm: As the experiments reported

in this section suggest, for a fixed tree size, SECRET scales linearly with the size

of the training dataset and quadratically with the number of regressor attributes.

With respect to discrete and split attributes, SECRET has the same scalability

properties as classification tree algorithms thus, by using scalable classification tree

construction techniques (Gehrke et al., 1998), SECRET can achieve good behavior

even for very large datasets. Furthermore, since most of the computational effort

goes into running the EM Algorithm for Gaussian mixtures for every node, for

large datasets SECRET can be further speed up by employing the scalable EM

algorithm of Bradley et al. (1998).

4.6 Discussion

In this chapter we introduced SECRET, a new linear regression tree construction

algorithm designed to overcome the scalability problems of previous algorithms.

The main idea is, for every intermediate node, to find two Gaussian clusters in the

regressor–output space and then to classify the data-points based on the closeness

108

to these clusters. Techniques from classification tree construction are then used

locally to choose the split variable and split point. In this way the problem of

forming and solving a large number of linear systems, required by an exhaustive

search algorithm, is avoided entirely. Moreover, this reduction to a local classifi-

cation problem allows us to efficiently build regression trees with oblique splits.

Experiments on real and synthetic datasets showed that the proposed algorithm

is as accurate as GUIDE, a state-of-the-art regression tree algorithm if normal

splits are used, and on some datasets up to 45% more accurate if oblique splits are

used. At the same time our algorithm requires significantly smaller computational

resources for large datasets.

Chapter 5

Probabilistic Decision Trees

Classification and regression trees prove to be accurate models, and easy to inter-

pret and build due to their tree structure. Nevertheless, they have a number of

shortcomings that reduce their accuracy: (1) for data-points close to the decision

boundaries, the prediction is discontinuous, (2) natural fluctuations into the data

are not taken into account, and (3) data is strictly partitioned at every node and a

data-point will influence only one of the children nodes thus the amount of data is

exponentially reduced. All of these shortcomings are due to the fact that the splits

employed in the nodes are sharp. In this chapter we address all these concerns with

traditional classification and regression trees by allowing probabilistic splits in the

nodes, splits that depend also on the natural fluctuations in the data. The result of

these modifications to traditional classification and regression trees is a new type

of models we call probabilistic decision trees. They prove to be significantly more

accurate than the traditional counterpart, by as much as 36%, and, at the same

time, they require only modest increase in the computational effort.

109

110

5.1 Introduction

As we have seen in Chapter 2, classification and regression trees use deterministic

conditions as split predicates. These conditions determine exactly one branch to

be followed at each step in order to reach, a unique leaf. Usually the conditions

involve a single attribute, the split attribute, and a very simple condition of the

form X ∈ S for discrete attributes or X ≤ c for continuous attributes with S some

set and c a constant. This type of models have the advantage that they are simple

and have predictable behavior, requiring a simple walk starting from the root of

the tree on true predicates to reach the leaf. Despite all these good properties, the

classic classification and regression trees – we refer to them collectively as decision

trees – have a number of shortcomings that, in general, decrease their efficiency,

and, in some cases, severely limit their applicability.

• Discontinuity of Decision. Decision trees can be thought of as a compact

specification of a partitioning of the decision domain (the crossproduct of the

domain of all the split attributes) into distinct regions–each such partition,

that we will call a decision partition, is specified by the conjunction of the

predicates on the path from the root to a leaf–together with the decision,

that corresponds to the leaf that ends the path. These decision partitions do

not overlap and have crisp borders, which results in discontinuous decisions.

More precisely, a small modification of the input, for the case the input is

close to the border, might result in an entirely different decision. This means

that the decision is arbitrary near the border, since it depends on small

fluctuations, thus unlikely to be good. For regression trees it also means

that the model, seen as a function, is discontinuous on the border, thus,

111

it is not everywhere differentiable – in particular it has no first derivative

on the border. This discontinuity property of regression trees makes them

unsuitable for applications that require continuity, and, in general, results in

lack of precision around the border.

• Exponential Fragmentation of the Data. In the process of building the

trees, only data-points corresponding to the node being built (data-points

that satisfy the conjunction of predicates on the path from the root to the

node) are used in subsequent decision for this node and any of its descendants.

Thus, after a split predicate is decided upon, the data is partitioned into two

parts, one corresponding to each of the children, and each of the data-points

contributes to the construction of exactly one child. This means the amount

of data available to take decisions decreases exponentially with every level of

the tree built, process that is usually referred to as data fragmentation.

The statistical significance of a decision, either split for an intermediate node

or label/regressor for a leaf node, depends mostly on the number of data-

points on which the decision is based on – having larger amounts of data

results in greater statistical significance which translates in greater precision

for the decision tree. With this in mind, it is immediately apparent that data

fragmentation results in decreased precision in the regions of the decision

space close to the split points of nodes higher in the tree. This is where,

most likely, the decision has to be refined, but the amount of data on which

the decision is based is severely reduced by fragmentation.

• Fluctuations due to Noise. Even though having crisp decision surfaces

is appealing from the point of view of the users (the classifier is easy to

112

understand) determining a precise boundary for the region is problematic

since data is usually noisy and finite. This suggest that the placement of the

boundary is somewhat arbitrary and, due to the fact that statistical signifi-

cance decreases with the reduction of the number of data-points, the deeper

in the tree the less precise the placement of the boundary is. Traditional

decision trees do not give any indication on how firm the fine structure of

the tree actually is. Quite often small modifications in the training data

result in vastly different trees; this sometimes creates problems for the end

user. Taking the natural fluctuations into account should greatly improve

the stability of the structure of the classifier and make it less prone to data

fragmentation due to arbitrary decisions.

• Probabilistic Decisions. For some applications it is desirable for classifi-

cation trees to produce a probabilistic answer instead of deterministic ones.

Such a probabilistic answer can be easily transformed into a deterministic

prediction, and, in addition, provides extra information about the confidence

in the answer. For example the user would trust more a prediction knowing

that the probability to produce this result is, for example, .9 instead of .51,

even though both will produce the same deterministic answer, for example

class label YES. Estimating such probabilities only from the information at

the leaves, as proposed by Breiman et al. (1984) is error-prone, especially

for inputs close to decision borders, due to data fragmentation, and discon-

tinuous on the border.

In view of the above discussion on the shortcomings of traditional decision trees,

we think that there is a real need to refine the decision trees in a systematic way

to address all of these issues, but, at the same time, to maintain their desirable

113

properties. This is exactly the purpose of the new model we propose in this chapter,

the probabilistic decision tree (PDT). In the next section we provide the detailed

description of PDTs, and in Section 5.3 we show how they can be constructed.

We show then, in Section 5.4, experimental evidence that the accuracy is almost

always increased, sometimes dramatically, and, at the same time, the increase in

the computational effort is small. We comment on related work in Section 5.5 and

conclude in Section 5.6.

5.2 Probabilistic Decision Trees (PDTs)

We designed the Probabilistic Decision Trees (PDTs) in order to directly address

limitations of the classic decision trees (DTs), limitations that, as we have seen,

have a negative impact on either the precision or the applicability of decision trees.

Before we delve into details, let us first give a high level description of the

probabilistic decision trees and explain how they address the shortcomings of the

traditional decision trees.

The main ideas behind PDTs are:

• Make the splits fuzzy.1 Instead of deciding at the level of each of the

intermediate nodes if the left or right branch should be followed, a proba-

bility is associated with following each of the branches–the sum of the two

probabilities should always be 1, thus it is enough to specify only one of the

two–thus sometimes both branches are actually followed but one might be

more important than the other. This effectively allows, in the most general

case, all the leaves to be reached by any of the inputs. To obtain a prediction,

1Here by fuzzy splits we mean relaxed splits not splits necessarily expressedwith fuzzy logic.

114

the predictors of all leaves of the tree are used, but the importance of the

contribution is determined by the probability to reach the respective leaf.

• Predict probability vectors instead of class labels. For regression

trees, since linear combinations of real numbers are easy to obtain, the infor-

mation in multiple leaves can be easily combined. Class labels, on the other

hand, cannot be directly combined in a linear fashion. The natural solution

to this problem is to replace the class label prediction with probability vec-

tor prediction–every class label has an entry in the vector that specifies the

probability for the tree to produce the class label. It is easy to obtain linear

combinations of probability vectors, and to produce class label predictions

from probability vectors by simply returning the most probable label.

• Retain the good properties of DTs. In addition to addressing the short-

comings of traditional decision trees, we would like to retain, as much as

possible, the desirable properties of DTs. More precisely, we would like the

PTDs to be as close to DTs with respect to usage and construction; in this

manner PDTs will still be interpretable, the prediction can be made effi-

ciently, and the PDTs can be built in a scalable fashion by merely adapting

the decision tree algorithms. It is quite clear that, in order to address the

problems of DTs, the efficiency of prediction and construction will decrease;

it is important though to keep the performance degradation minimal.

The way we will achieve all three desiderata is by, first, generalizing the decision

trees in order to allow imprecise splits and, for the case of classification trees, the

prediction of probability vectors; then, by severely restricting the general model to

make sure that the prediction and construction of the model can be done efficiently,

115

and, at the same time, the shortcomings of DTs are still addressed. We will refer to

the general model as generalized decision trees(GDTs) and the two specializations

as generalized classification trees(GCTs) and generalized regression trees(GRTs).

5.2.1 Generalized Decision Trees(GDTs)

The GDTs generalize the decision trees by replacing split predicates in intermediate

nodes with probability distributions over the inputs. Since in this chapter we are

interested only in binary trees, we show the generalization only for this type of

decision trees, but all the ideas can easily be extended to the general case.

Like decision trees, GDTs consist of a set of nodes that are linked in a tree

structure; the nodes without descendants are called the leaves, all the other nodes

are called intermediate nodes. The information that each type of node contains is

the following:

• Intermediate Nodes. Let us denote by T the node and with TL and TR its

left and right descendants respectively. Also we denote by x ∈ T the event

input x is routed to the node T . The only information associated with the

node is P [x ∈ TL|x ∈ T ], the probability the input x is routed to the left

node, TL, given the fact that it was routed to the node T –the probability to

follow the right branch is completely determined by this probability. In its

simplest form the probability P [x ∈ TL|x ∈ T ] depends on a single predictor

attribute, in which case we call the attribute the split attribute, but in general

it can be a general probability distribution function that depends on all the

attributes.

• Leaf Nodes. The information associated with each leaf is a probability

116

vector, that specifies the probability for each of the class labels, in the case

of generalized classification trees (GCTs), and a regressor – constant, linear

or more complicated– in the case of generalized regression trees (GRTs). For

leaves of GCTs, we denote by P [C = c|L] the probability that the prediction

is class label c given the leaf node L, which is nothing else but the c-th

component of the probability vector associated with leaf L, P [C|L]. For

leaves of GRTs, we denote by fL(x) the regressor function that produces the

numerical output when given the input for the leaf L. We denote the set of

all leaves in the GDTs by L.

Notice that GDTs are complete probabilistic models in the sense that the tree

structure together with the information in the node specify a probabilistic model.

In order to produce predictions with this model, we can simply use the fact that,

given some input x, the prediction with the smallest expected squared error is the

expected value of the prediction given the model (Shao, 1999). For the two types

of models, this best predictor is:

• GCT: Denoting by by V the random vector distributed according to the dis-

tribution specified by the GCT (Vc is its c-th component), the best predictor

for the probability of seeing some class label c given the input x, vc(x), is:

vc(x) = E[Vc|x]

= P [C = c|x]

=∑L∈L

P [C = c|L]P [x ∈ L]

(5.1)

To establish the above result we used the fact that reaching each of the leaves

is an independent event and the fact that the probability to see a particular

class label in any of the leaves is independent of the input. To transform the

117

probability vector into a class label, we simply return the class label with

the highest probability.

• GRT: Denoting by Y the random variable distributed according to the dis-

tribution specified by the GRT, the best predictor for the output given the

input x, y(x), is:

y(x) = E[Y |x]

=∑L∈L

fL(x)P [x ∈ L](5.2)

where we used again the fact that reaching each leaf is an independent result.

The probability to reach any leaf, given some input x, P [x ∈ L], can be com-

puted recursively – using the Bayes rule and the fact that x ∈ T ′ ⇒ x ∈ T if T ′

is a descendant of T – starting at the root R of the tree and following the path

leading to the leaf L using the equations:

P [x ∈ R] = 1 (5.3)

P [x ∈ T ′] = P [x ∈ T ′|x ∈ T ] · P [x ∈ T ] (5.4)

where T is some intermediate node and T ′ is its children. Thus, if the path from

R to L is R, T1, T2, ..., Tn, L, then

P [x ∈ L] =P [x ∈ T1|x ∈ R] · P [x ∈ T2|x ∈ T1] · · ·

· · ·P [x ∈ L|x ∈ Tn]

(5.5)

so it is simply the product of the probabilities, conditioned on the input x, along

the path.

If the conditional probabilities in the intermediate nodes are defined as:

P [x ∈ TL|x ∈ T ] =

1 pL(x) = true

0 otherwise

(5.6)

118

with pL(x) some predicate on inputs (for example X > 10), then the GDTs de-

generate into traditional decision tree – thus indeed GDTs generalize DTs.

Generalized classification trees have been proposed before under the name

Bayesian decision trees (Chipman et al., 1996; Chipman et al., 1998). They can

also easily emulate fuzzy or soft decision trees (Sison & Chong, 1994; Guetova

et al., 2002).

5.2.2 From Generalized Decision Trees to Probabilistic De-

cision Trees

Clearly, the GDTs allow imprecise splits and predict, in the case of GCTs, proba-

bility vectors, but they are very hard to learn since any probability distribution is

allowed in the nodes. Also, if the probability distributions are not 0 or 1 for the

majority of the data-points, most of the leaves have to be consulted to make a pre-

diction, which means that making a prediction might be too slow for some practical

applications. In order to retain the good properties of DTs, the GDTs have to be

drastically restricted. We call this restricted version of GDTs probabilistic decision

trees (PDTs), and the two specializations probabilistic classification trees (PCTs)

and probabilistic regression trees (PRTs). In what follows, we point out, for each

restriction, how the performance is improved but the shortcomings of traditional

trees are still addressed.

The notion of split variable and split predicate has a lot of appeal from user’s

point of view. At the same time, algorithms to find the split variable and split

point for decision trees are efficient and well studied. In order to retain these

good properties of decision trees and, at the same time, to allow probabilistic

splits, we require the probabilities associated with each node to characterize the

119

fluctuations of the split point under noise in the data, instead of being general

probability distributions. In this manner, the fluctuations are naturally captured

in the model; the only problem is determining these fluctuations instead of finding

general densities that would fit the data (which is known to be a hard problem).

By requiring the probabilities to capture the fluctuations of the split point, we

take advantage of the fact that, except capturing fluctuations, split attributes and

points in decision trees provide a good way to capture the underlying structure of

the system being learned. Furthermore, since we assume that the training data

are samples from the distribution that describe this system, by generating other

datasets from the same distribution – which, in general, will not coincide with the

dataset made available – we will observe fluctuations of the split points for any of

the split variables. These are the the fluctuations we want to capture and leverage

in our models.

Let us now be more specific and, for the two types of split variables, continuous

and discrete, specify precisely the probabilities that can be used in the intermediate

nodes together with their interpretation from the point of view of the fluctuations.

We defer the discussion on how to find such probabilities to Section 5.3.3.

Continuous Split Variables

Due to the noise in the data-points available for learning, the split point – as com-

puted by the split selection methods for decision tree construction algorithms on

these datasets, which usually means choosing the point where the split selection

criteria is minimized – will fluctuate. Since we want to keep things as simple as

possible, and we want to obtain decision trees that can produce predictions effi-

ciently, we restrict the specification of the probability for nodes with continuous

120

split variables to two parameters: the mean and the variance. The distribution

of the split point is well confined in space; it is, in general, reasonably well ap-

proximate by a normal distribution. For this reason, we restrict the interpretation

of the probability function P [TL|T,x] in a node T for a split variable X with the

mean and variance of the split point µ and σ2 respectively, to:

P [x ∈ TL|x ∈ T ] = P [X < N(µ, σ2)]

=1

2

[1− Erf

(x− µ

σ√

2

)] (5.7)

where N(µ, σ2) is the normal distribution with mean µ and variance σ2, and Erf(x)

is the error function (the last quantity is the cumulative distribution function

expressed in terms of the error function). The probability specified by this formula

is nothing but the probability that the observed value x of the split attribute X

is at the left of the split point that is distributed as N(µ, σ2) (the actual random

variable). With this interpretation, we only need to determine the mean and

variance of the split point, not a full probability distribution – a much easier and

less error prone task.

Discrete Split Variables

Split points for discrete variables, in traditional decision trees, are specified by the

subset of values for which the left branch should be taken. Due to the fluctuations

in the training data, a particular value will sometimes be placed into this subset and

sometimes in its complement. The way we model this fluctuations is by associating,

with each possible value, the probability to belong to the subset, which is exactly a

probability vector that we use to model information about class labels in the GCTs.

Thus, for a given value of the split variable we can simply find (by consulting our

probability vector) the probability that the left branch should be followed. Notice

121

that, as opposed to the continuous case, where we had to restrict the possible

distributions in order to keep the model reasonably simple and easy to use, in

the discrete case our representation has full generality (specifies completely the

distribution).

One of the appealing features of PDTs is the fact that they can be easily

converted or interpreted as traditional decision trees. This can be achieved by:

1. using the split predicates X < µ instead of the probability function P [X <

N(µ, σ2)]

2. converting the probability vectors for continuous attributes into the subset

for the left branch by selecting the values with corresponding probability at

least 0.5

3. replacing the probability vector in the leaves with the most probable class

label, if necessary.

5.2.3 Speeding up Inference with PDTs

Having a fast inference mechanism is important in its own right, but becomes

critical during learning since the inference, using the already built part of the tree,

is extensively used in the process of building new nodes. Thus, speeding up the

inference results in speedups of the learning process as well.

The inefficiency of the inference of PDTs, inherited from the GDTs, is due to the

fact that all leaves participate in the prediction process, each having a contribution

proportional with the probability that the leaf is reached from the root given the

input. This means that, instead of using time proportional with the height of the

tree, as is the case for DTs, time proportional with the number of nodes (which is

122

exponentially bigger) is used. The main observation for alleviating this inefficiency

is the fact that most of the leaves are very unlikely to have significant contribution

to the prediction, which suggests that they can be excluded altogether from the

decision process. Obviously, the set of leaves relevant for prediction are input

dependent.

To make the above principle practical, we observe that, due to the fact that the

probability to reach a leaf is the product of probabilities on the path, the aggregate

contribution of all the leaves that belong to a subtree rooted at some intermediate

node is exactly the probability to reach that node (i.e. the product of probabilities

from the root to the node). This suggests that, if the probability to follow the

left branch, for some input x, is small – say smaller than α = 0.01, with α some

predefined threshold – then the contribution of the left subtree to the prediction,

when compared to the contribution of the right subtree, can be ignored. Thus,

we can improve the inference time by following a branch only if the probability to

follow it is larger than α. This means that, usually, only a skinny subtree of the

original tree has to be consulted to make a prediction, greatly improving the the

inference time. For the rest of the chapter, we assume that this pruning rule is

always used with the default value for the threshold α = 0.01, but, to keep things

simple, we will provide the explanation as if no pruning is explicitly performed.

5.3 Learning PDTs

Since, by design, structurally the PDTs are very close to DTs, the learning algo-

rithms for DTs can simply be adapted to construct PDTs. In what follows, we

explain only the modifications to the traditional decision tree construction algo-

rithms that are necessary to construct PDTs.

123

We have two distinct sets of problems to deal with in order to be able to adapt

the DT algorithms to PDTs. First, we have to show how sufficient statistics should

be computed for each of the nodes in a PDT. The sufficient statistics, as discussed

in Chapter 2, are the aggregate information for each of the nodes, information

that is necessary and sufficient for all construction algorithms. Second, we have

to show how the sufficient statistics are to be used to construct the PDTs. The

design principle for constructing PDTs from sufficient statistics is to retain most

of the algorithm for DTs construction. In particular, we want to replace the split

point by a split distribution but to keep as much of the rest of the construction

algorithm the same. Let us now see, in detail, how we deal with each of these

issues.

5.3.1 Computing sufficient statistics for PDTs

Intuitively, the sufficient statistics for node T characterize the data-points from the

training or pruning dataset that are routed at node T . They capture aggregate

properties only; these properties are further used by the construction algorithm.

In the context of PDTs, the main problem we have to deal with is the fact that,

in general, a data-point is routed with nonzero probability to multiple nodes in

the tree. Thus, as opposed to DT construction, a data-point might contribute

to the construction of multiple nodes, but its contribution should depend on the

probability of the data-point to reach each of the nodes. The natural way to

capture this intuitive requirement – which, as we will see, reduces to the classic

case when PDTs have the node probabilities defined as in Equation 5.6, thus they

degenerate into DTs – is by conditioning the sufficient statistics on the fact that

the data-point, which we denote by the random vector (X, C), is routed to the

124

node T . We were doing this conditioning implicitly in Chapter 2 but there it had

the direct interpretation: restrict the computation of the statistic to points routed

to node T . Here we have to be more explicit. We have two types of sufficient

statistics to deal with: probabilities and expectations.

Computing Probabilities

We are interested in conditioning only on the data-point (X, C) being routed to

node T . Thus, all the conditional probabilities we have to deal with have the form

P [p(X, C)|X ∈ T ] for PCTs and P [p(X, Y )|X ∈ T ] for PRTs, where p(·) is some

predicate. To keep things simple, we show only the formulae for classifiers, but the

formulae for regressors can be obtained by simply replacing C by Y . Using the

formula for the conditional probability, the way we should estimate this quantities

given the training (or pruning) dataset D is:

P [p(X, C)|X ∈ T ] =

∑(x,c)∈D∧p(x,c) P [x ∈ T ]∑

(x,c)∈D P [x ∈ T ](5.8)

This formula is intuitive since P [x ∈ T ] weights the contribution of the data-

point (x, c) and the denominator of the fraction normalizes the result. The quan-

tity P [x ∈ T ] is computed using the part of the tree already constructed (more

specifically the path from the root to T ) and Equation 5.5.

For instance, to estimate the probability that the discrete attribute Xi takes

the value aj and that the class label is c given the fact that the data-point is routed

to node T we use the formula:

P [Xi = aj ∧ C = ck|X ∈ T ] =

∑(x,c)∈D∧xi=aj∧c=ck

P [x ∈ T ]∑(x,c)∈D P [x ∈ T ]

(5.9)

Similarly, to estimate the conditional probability that a continuous attribute

125

Xi takes values less than a, we use the formula:

P [Xi < a|X ∈ T ] =

∑(x,c)∈D∧xi<a P [x ∈ T ]∑

(x,c)∈D P [x ∈ T ](5.10)

Note that, when the probabilities in the nodes are degenerate, as in Equa-

tion 5.6, the sum in both the numerator and denominator in all the above fractions

becomes the sum over the data-points at node T instead of the sum over the train-

ing set and the P [x ∈ T ] terms disappear. Thus, all these formulae are identical

with their classic counterpart in the degenerate case, re-confirming the fact that

the new formulae generalize the old ones.

Computing Expectations

For some decision tree construction algorithms, in addition to estimating probabil-

ities, we have to estimate expected values. Of particular interest are the compu-

tation of means and variances of points routed to a particular node. The general

form of expectation we would like to estimate in this context is: E[f(X, C)|X ∈ T ]

for classifiers and E[f(X, Y )|X ∈ T ] for regressors, where f is some predefined

function. Again, we show only the formulae for classifiers since the formulae for

regressors can be obtained by simply replacing C by Y . For any real function f(·),

E[f(X, C)|X ∈ T ] =

∑(x,c)∈D f(x, c)P [x ∈ T ]∑

(x,c)∈D P [x ∈ T ](5.11)

For instance, to estimate the parameters of the normal distributions required to

determine the split point for continuous attribute Xi using the statistical method

in Chapter 2, the following formulae have to be used, formulae that just instantiate

the above result:

µc = E[Xi|X ∈ T, C = c]

=

∑(x,c)∈D xiP [x ∈ T ]∑(x,c)∈D P [x ∈ T ]

126

σ2c = E[X2

i |X ∈ T,C = c]− µ2c

=

∑(x,c)∈D x2

i P [x ∈ T ]∑(x,c)∈D P [x ∈ T ]

− µ2c

Notice again that, in the case the probabilities in the nodes are degenerate

(given by Equation 5.6), these formulae become the classic formulae.

By allowing the f(·) function to produce a vector or a matrix instead of a real

number, formulae for the mean and covariance matrix of a multidimensional normal

distribution (conditioned on the data being routed at node T ) are obtained. This

means that the conditional version of the EM algorithm, described in Section 4.2,

can simply be obtained by weighting the contributions of the data-points in the

same manner as above; this suggests that algorithms like SECRET for regression

tree construction (see Chapter 4) can be adapted to produce PDTs.

Interestingly, the amount of space required for maintaining the sufficient statis-

tics does not increase much; the only increase is due to the fact that we have to

use a real number for quantities for which integers previously sufficed.

5.3.2 Adapting DT algorithms to PTDs

Now that we showed how sufficient statistics, the building blocks of the learning

algorithm, are computed, we show how the DT construction algorithm can be

adapted to build PDTs. We talk about each of the two decision tree construction

phases in what follows.

Tree Growth

As a reminder, in the tree growth phase for DTs, a tree as large as possible is

built incrementally using the training dataset. For every node being constructed,

127

first, sufficient statistics are gathered. Then, using these statistics, the suitability

of each of the attributes for the role of split attribute is evaluated and the most

promising one chosen. Lastly, or concurrently with the previous step, the best

split point for the split attribute is determined. Once the split attribute and point

are determined, either the growth is stopped for the current node and leaves are

built for each of the two descendants – process which requires statistics for each

of the leaves to be gathered – when the stopping criteria is true, or the process is

repeated recursively on the two descendants, if the converse holds.

Since the decision tree construction algorithm in Chapter 2 have been expressed

in terms of sufficient statistics, simply by using the new formulae for these statistics

in Section 5.3.1, we obtain most of the construction algorithm for probabilistic

decision trees. The only issue that remains to be addressed is the determination of

the probability distribution associated with the node for the split attribute settled

upon – the distribution that specifies the probability to follow the left branch for

a given input. We defer the discussion on how such probability distributions are

determined to Section 5.3.3.

Tree Pruning

In the construction of DTs, after the tree is grown, a separate dataset, called

pruning dataset, is usually used to determine a way to prune the tree in order

to maximize the chances that the model will be a good predictor for unseen data.

Pruning phase is usually necessary since the growth phase tends to overfit the data

(learn the noise).

In order to decide if the three has to be pruned at a node T or not, two

types of quantities have to be computed: an estimate of the contribution of the

128

node T to the generalization error if the tree is pruned at T , and, in the case of

complexity based pruning methods, an estimate of the complexity of the subtree

rooted at T (Frank, 2000). Usually the tree is pruned at T if the contribution

to the generalization error of T is less then the smallest possible contribution of

its descendants plus the complexity cost of the subtree. Once the generalization

error estimate for each node is determined, the subtrees that have to be pruned

are determined in a bottom-up fashion, starting with leaves and moving up.

To adapt the pruning methods to PTDs is enough to specify how to estimate

the generalization error for each node, and to account for the increase in complexity

of the tree, if necessary. To compute the complexity of a node, the same methods

developed for classic decision trees can be used; the additional cost is due to the

fact that maintaining information about the probabilities in the nodes requires

more space than a simple split point.

The contribution of a node, if considered a leaf, to the generalization error is

usually estimated by computing the contribution of the node to the empirical error

with respect to the pruning set. In the case of PDTs, usually, we have more than

one leaf responsible for the error of a data-point since multiple leaves are reached

with significant probability. This error should be interpreted as an expectation

due to the probabilistic interpretation of PDTs. Taking (x, c) to be the input, Er

the error metric, and T the tree with L its leaf-set, we have:

ErT (x, c) =∑L∈L

ErL(x, c)P [x ∈ L] (5.12)

Note that this formula becomes exactly the error formula for DTs when the distri-

bution is degenerate since only one leaf will have P [x ∈ L] 6= 0. Also, the formula

is very intuitive since the blame for the error is distributed among all leaves in

proportion to the probability that the data-point is routed to each leaf.

129

If we look now at the global error, we can rewrite it in the following manner:

ErT (Dp) =∑

(x,c)∈Dp

[∑L∈L

ErL(x, c)P [x ∈ L]

]

=∑L∈L

∑(x,c)∈Dp

ErL(x, c)P [x ∈ L]

=∑L∈L

ErL(Dp)

where Dp is the pruning set, and

ErL(Dp) =∑

(x,c)∈Dp

ErL(x, c)P [x ∈ L] (5.13)

is the contribution of the leaf L to the overall error. This last equation gives exactly

the quantity we need to compute in order to be able to prune the tree, quantity

that can be computed incrementally in a straightforward manner.

5.3.3 Split Point Fluctuations

One of the main design goals of PDTs – the only one that has yet to be ensured –

is to take into account the natural fluctuations in the data and to incorporate them

in the model. With this goal in mind, we observe that, if we would have multiple

training datasets – all generated with the same generative method but possibly

distinct, since the generation process is probabilistic – and we would compute the

split point, for each of these datasets we would get a different split point with close

but not exactly the same value. This means that, because the size of the training

dataset is finite, the split point is not constant but it naturally fluctuates around its

expected value. Due to this uncertainty of where the split point actually is, when

presented with an input x – especially when the the input is close to the expected

value of the split point – we do not know for sure if we should send the data-point

130

to the left or right child. This uncertainty is perfectly captured by a probability

distribution, which is exactly the information we decided to associate with each

node in a PDT. Thus, we have a very natural way for interpreting and computing

the probability distribution in the nodes of PDTs; it is the probability that the

input x, a fixed and given quantity, satisfies the split predicate, predicate that is

defined with respect to the split point, a probabilistic or fluctuating quantity.

Before we show how the distribution of the split points is actually estimated for

the two types of attributes, discrete and continuous – which allows us to construct

the probabilities in the nodes of PDTs – let us present, in short, two different ways

to estimate these distributions:

Empirical method. The idea behind the empirical method is to determine the

distribution of the split point experimentally by generating multiple training

datasets, computing the split point for each such dataset, thus obtaining a

set of samples for the split point, and then approximating the distribution

with a parametric or non-parametric method. The requirements for the gen-

erated training datasets are to have the same size – the size of the dataset

critically influences the amount of fluctuation – and the same underlying

distribution as the training dataset available in the learning process. Boot-

strapping is a very general method, developed in the statistical literature

(Davison & Hinkley, 1997), to generate as many datasets as needed satisfy-

ing these requirements. It consists of sampling, with replacement from the

original dataset, multiple datasets of the same size. Thus, by computing the

split point of these re-sampled datasets, samples from the distribution of the

split point are obtained. The advantage of the empirical method is the fact

that, in the manner described above, the distribution of the split point can

131

be approximated with any precision. The disadvantage is the fact that it is

computationally intensive; even when parametric approximations are used

to model the distributions, the number of samples that have to be produced

is in the hundreds in order to get reasonable approximations.

Analytical method. The idea behind the analytical method is to model, with

parametric distributions, the training dataset. Using this parametric model

and analytical analysis, formulae are developed for the distribution of the

split point. The advantage of the method is the fact that the distribution of

the split point can be determined very fast – essentially by simply instantiat-

ing variables in a formula. The disadvantages are the fact that the parametric

modeling might not be very accurate, and the fact that the method can be

applied only to restricted circumstances due to the fact that the analysis

tends to be very hard in general.

In developing both these kinds of methods we will assume, for simplicity, that

the training dataset available at the node being constructed is non-probabilistic.

This is not a restriction, since the results are easily extensible to the probabilistic

version simply by using the new formulae for the sufficient statistics developed in

Section 5.3.1.

Split Point Fluctuations for Continuous Attributes

For continuous attributes we decided in Section 5.2.2 to model the probability of

following the left branch as the p-value of a normal distribution. This is equivalent,

using our interpretation of the probability, to modeling the distribution of the split

point with a normal distribution. There are two parameters we have to determine

for this normal distribution: the mean and the variance. We can take the mean

132

to be simply the split point of the provided training dataset. This leaves us with

only one problem: the estimation of the variance of the split point. We can use the

empirical method described above to estimate the variance by simply computing

samples of the split point and then computing their variance. Alternatively, we

can use the analytical method to estimate the variance. In what follows we explore

this second possibility.

We were able to perform the analysis only for a restricted scenario: the two

class label classification problem for split points determined using the statistical

method (see Chapter 2). The result also applies to regression since, using the

method proposed in Chapter 4, regression can locally be reduced to classification

using the EM algorithm.

Let µc, σ2c and Nc be the mean, variance and number of samples, respectively,

for the normal distribution that models the data-points with class label c. For the

case when we have only two class labels, the split point, as explained in Chapter 2,

can be found by solving the quadratic Equation 2.10. For this scenario, we would

like to find the variance of the split point when only finite samples of size Nc

are available from the datasets. Since the quadratic equation has a complicated

solution, finding its exact variance proves to be difficult.

One method to develop an approximation to the variance is to use the delta

method (Shao, 1999). It consists in expressing the variance of the solution in

terms of the variance of the ingredients – estimates of the mean and variance

of the two normal distributions using finite samples – and the derivatives of the

solution with respect to the ingredients. Even if only a first order approximation is

attempted, the resulting formulae tend to be quite large and have to be produced

in a mechanical way to avoid making mistakes. We found that, in practice, a first

133

order approximation is in most part satisfactory but occasionally is very far from

the true value.

A second method to approximate the variance of the split point is based on

ignoring the error in estimating the variances and the differences in the size of the

two normal distributions. Equivalently, in Equation 2.10 the quantities σ1 and

σ2 are considered constants and the right hand side is replaced by 0. With this

simplifications, the only solution of the quadratic equation between µ1 and µ2 is:

µ =µ1σ2 + µ2σ1

σ1 + σ2

(5.14)

Ignoring the fluctuations in estimating the variances, using the independence of the

two sampling processes, and remembering that by using Nc samples to estimate

the mean, the variance of the mean estimate is σ2c

Nc, we get the following expression

for the variance of the split point:

Var(µ) =Var (µ1) σ2 + Var (µ2) σ1

σ1 + σ2

=1

σ1 + σ2

(σ2

1σ2

N1

+σ2

2σ1

N2

)=

σ1σ2

σ1 + σ2

(σ1

N1

+σ2

N2

) (5.15)

We found this formula to be just as good as the formula obtained with the delta

method, and at the same time much simpler and more stable numerically; for this

reason we recommend the use of this formula in place of the one produced by the

delta method.

Split Point Fluctuations for Discrete Attributes

For discrete attributes, the probability associated with nodes in the PDTs takes

the most general form: the probability to follow the left branch is specified for

each possible value of the attribute. As in the case of continuous attributes, both

134

the empirical and analytical methods can be used to estimate these probabilities.

We explore both alternatives in what follows.

As explained before, the empirical method can be used to generate multiple

training datasets. For each of these, a split set – the subset of the values of the

discrete attribute for which the left branch is followed – can be determined. By

counting how many times the value a of the attribute appears in the split set

and dividing it by the number of datasets we get an estimate of the probability

that the left branch should be followed when value a is observed. Performing this

computation for all values gives the full probability. Notice that this method is

applicable to any split point selection method.

Let us explore now the analytical method. As in the previous situation, we

restrict our attention to the two class label problem. In this case, for convex split

criteria like Gini and information gain, the split set can be found very efficiently

using the result of Breiman et al. (1984) (see Section 2.3.1). As it turns out, the

same result can be used to compute the probabilistic split. Let X be an discrete

attribute, with possible values a1 . . . an. Remember from Section 2.3.1 that, if

we denote by ri = P [C = 0|X = ai] – that is ri is the probability that the first

class label is observed given that X takes value ai – Breiman’s theorem states

that the split set that minimizes a convex split selection function can be found

by considering only splits in the increasing order of quantity ri. Thus, a natural

order over values ai, with respect to minimizing the split criterion, is given by the

increasing order of the quantities ri; finding a split set is equivalent to determining

some split point B, for this order, and including in the set only values ai for

which ri < B. Using this interpretation, we have a natural way to determine the

probability that the left branch is followed if value ai is observed: first, quantities

135

ri are determined from the training dataset, then, the split point B is determined

so that the split point selection criteria is minimized – this is achieved by simply

considering all the possible split points, no more then the size of the domain of

attribute X, and picking the best one – and finally, the the probability that left

branch is followed is set to P [ri < B].

The only thing that remains to be investigated is how to compute P [ri < B].

This can be easily accomplished if we observe that the quantity ri is computed

by dividing Ni,0, the number of times X = ai ∧ C = 0 in the training dataset, to

Ni, the number of times X = ai in the training dataset. If we assume that the

counts Ni,0 are binomial distributed with probability ri and count Ni, assumption

which is natural, then P [ri < B] is simply P [Ni,0 < B ∗Ni]. This probability can

be easily computed by expressing it in terms of the incomplete regularized beta

function Iβ(x; a, b). Putting now everything together, we have

P [TL|T, X = ai] = P [Ri < B] = P [Ni,0 < B ∗Ni]

= 1− Iβ(ri; B ∗Ni + 1, Ni −B ∗Ni)

(5.16)

Thus, the computation of the probabilities for each possible value of X is not much

more inefficient then the traditional split point selection method.

Being able to apply the analytical method only for problems with two class

labels is not as a severe restriction as it might seem since the multiple class clas-

sification problem can be locally reduced, by clustering the class labels based on

similarity, to a two class label problem (Loh & Shih, 1997).

136

5.4 Empirical Evaluation

In this section we present the results of an extensive experimental study of the

probabilistic decision tree version of SECRET, the regression tree system intro-

duced in Chapter 4. The purpose of the study was twofold: (a) to compare the

accuracy of PDTs on both classification and regression tasks with traditional DTs,

both implemented within SECRET, and (b) to estimate the computational penalty

incurred by the PDTs when compared to DTs. To get a base line reference point,

we include in the classification experiments QUEST (Loh & Shih, 1997), a state-

of-the-art classification tree algorithm and GUIDE (Loh, 2002), a state-of-the-art

regression tree algorithm.

The man findings of our study are:

• Accuracy of prediction. We found the PDTs on some of the learning tasks

to be significantly more accurate than DTs; the improvement is up to 11% for

classification, 36% for constant regression and 24% for linear regression. In

general, the improvement is largest for linear regression followed by constant

regression. For a single dataset, abalone, in the constant regression tests we

observed a significant degradation in accuracy.

• Computational penalty. We found PDTs to incur minimal computational

penalty when compared with DTs. In particular the increase in computa-

tional effort is 11% for constant regression trees with small training datasets

but as low as 1% for linear regression trees with large training datasets.

137

5.4.1 Experimental Setup

To perform the experiments whose results are reported in this section, we added

classification tree construction code and implemented the probabilistic decision tree

algorithm described in Section 5.2 to SECRET, the system described in Chapter 4.

For the experimental study we used 12 datasets for classification tests, and 14

datasets for the regression tests. We list the characteristics and source of these

datasets in Table 5.1. Since our implementation of classification trees in SECRET

supports only two class labels, we derived two-class learning tasks from the datasets

used for classification by predicting each of the class labels against all others, as

long as the support of the class label was at least 10%.

5.4.2 Experimental Results: Accuracy

For each experiment we used a random partitioning of the available data into 50%

for training, 30% for pruning and the remaining 20% for testing. We repeated each

experiment reported 100 times; we report both the average error and its standard

deviation for all experiments. In all experiments we used the re-substitution error

pruning. For all algorithms we set the minimum number of data-points to be

considered for splitting to 1%.

Classification Experiments

In Table 5.2 we report the experimental results of classification tasks listed in the

top part of Table 5.1 for QUEST, vanilla SECRET and the probabilistic classifi-

cation tree SECRET, denoted SECRET(P). The numbers after the data-set name

denote the class label in the original dataset that gets mapped to the first class

label in the modified dataset; all other class labels get mapped to the second class

138

Table 5.1: Datasets used in experiments; top for classification and bottom for

regression.

Nam

e

Sour

ce

#ca

ses

#no

minal

#co

ntinuo

us

breast UCI 683 1 9cmc UCI 1473 8 2dna StatLog 2000 61 0heart UCI 270 7 7led UCI 2000 1 7liver UCI 345 1 6pid UCI 532 1 7sat StatLog 2000 1 36seg StatLog 2310 1 19veh StatLog 846 1 18voting UCI 435 17 0wave UCI 600 1 21

abalone UCI 4177 1 7bank DELVE 8192 0 9baseball UCI 261 3 17cart Breiman et al.((Breiman et al., 1984)) 40768 10 1fried Friedman (Friedman, 1991) 40768 0 11house UCI 506 2 12kin8nm DELVE 8192 0 8mpg UCI 392 3 5mumps StatLib 1523 0 4price UCI 159 1 15puma DELVE 8192 0 9stock StatLib 950 0 10tae UCI 151 4 2tecator StatLib 240 0 11

139

label.

On the last column of Table 5.2, we list the improvement, in percentage, of

the probabilistic version of SECRET over the vanilla version. As it can be noticed

from these experimental results, QUEST and vanilla SECRET have about the

same accuracy across all learning tasks. SECRET(P) has higher accuracy for all

data-sets except voting, where the decrease is 2%, and for some of the tasks like

sat-1 and veh-4 the increase is as high as 10%.

Table 5.2: Classification tree experimental results.

QUEST SECRET SECRET(P) %breast 0.051± 0.002 0.051± 0.002 0.049± 0.002 4

cmc-1 0.305± 0.003 0.316± 0.003 0.303± 0.002 4cmc-2 0.227± 0.002 0.233± 0.002 0.225± 0.002 4cmc-3 0.324± 0.002 0.334± 0.002 0.326± 0.002 2

dna-3 0.097± 0.002 0.042± 0.001 0.040± 0.000 5dna-1 0.029± 0.001 0.030± 0.001 0.029± 0.001 2dna2 0.055± 0.001 0.051± 0.001 0.050± 0.001 2

heart 0.257± 0.006 0.224± 0.005 0.221± 0.005 1

liver 0.398± 0.006 0.380± 0.005 0.366± 0.005 4

pid 0.250± 0.003 0.253± 0.003 0.240± 0.003 5

sat-1 0.034± 0.001 0.025± 0.000 0.023± 0.000 10sat-3 0.055± 0.001 0.047± 0.001 0.046± 0.001 3sat-7 0.087± 0.001 0.077± 0.001 0.071± 0.001 7

veh-1 0.228± 0.003 0.231± 0.003 0.220± 0.003 5veh-2 0.239± 0.002 0.246± 0.003 0.235± 0.003 4veh-3 0.066± 0.002 0.053± 0.002 0.050± 0.002 6veh-4 0.077± 0.002 0.080± 0.002 0.072± 0.002 11

voting 0.053± 0.005 0.050± 0.002 0.051± 0.002 −2

wave-1 0.185± 0.001 0.182± 0.001 0.172± 0.001 6wave-2 0.166± 0.001 0.161± 0.001 0.153± 0.001 5wave-3 0.159± 0.001 0.155± 0.001 0.147± 0.001 5

140

Regression Experiments

In Tables 5.3 and 5.4 we depicted the experimental results for the regression tasks

in Table 5.1(bottom) for regression trees with constants in leaves and with linear

models in leaves, respectively. In both tables, on the second column we indicated

the scaling factor for the accuracy results (the numbers on the GUIDE, SECRET,

SECRET(O) and SECRET(P) columns have to be multiplied by the scaling factor

to get the actual experimental results). On the last two columns of these tables

we report the improvement, in percentage, of vanilla SECRET versus probabilistic

regression tree SECRET without and with oblique splits, respectively.

As observed in Chapter 4, GUIDE and SECRET have comparable accuracy.

We notice here, by analyzing the results in Tables 5.3 and 5.4, that the probabilistic

regression tree version of SECRET – denoted as SECRET(P) if univariate splits

are used and SECRET(OP) if oblique splits are used – overall outperforms, in

terms of accuracy, the vanilla version. More precisely, on the constant regression

tree tests, SECRET(P) significantly outperforms SECRET by as much as 36%

on four tasks (baseball, house, stock and tecator), noticeably outperforms it on

four datasets (bank, kin, mumps and price) and outperforms it by a small margin

on the remaining five tasks (abalone, fried, kin, puma and tae); exactly the same

trends are observed for the oblique splits version of the algorithms.

On the linear regression experiments, the results are good but not as impressive:

SECRET(P) significantly outperforms SECRET on four tasks (baseball,mpg,stock

and tecator) by as much as 20%, noticeably outperforms on two tasks (mumps and

tae), has about the same accuracy on seven tasks (bank, cart, fried, house, kin,

price and puma) and looses significantly on a single dataset (abalone); exactly

the same behavior is observed for the oblique splits version of the algorithms with

141

small variations in the improvements. We think that the decrease in accuracy,

especially on abalone task, is due to the fact that SECRET with linear models in

leaves is more prone to making extrapolation errors than the version with constant

models and, especially on tasks where there is not enough fine structure to exploit,

the use of probabilistic splits can be detrimental. Smaller improvements for linear

regression trees, when compared to constant regression trees, are to be expected

since the decision surface is more rugged for the latter so smoothing provided by

probabilistic splits should help more.

Especially interesting is the result on tecator task for both sets of experiments,

where SECRET was doing better than GUIDE in the first place but is doing much

better with probabilistic splits. This, in our opinion, suggests that the probabilis-

tic regression trees are especially useful for tasks that have fine but complicated

structure.

5.4.3 Experimental Results: Running Time

To compare the computational effort of vanilla and probabilistic SECRET, we

timed their running time (clock time) on a Dual Pentium III Xeon 933MHz running

Linux Mandrake 9.1 (only one of the processors was used). In Figures 5.1 and 5.2

we report the dependency of running times for regression on the fried task on the

size of the training dataset for the case when constant and linear regressors are

used in the leaves, respectively. Experiments for classification and other datasets

give similar qualitative results. In all experiments, the number of data-points in a

node to be considered for further splits to 1% of the size of the training set and

only the time to grow the tree is measured.

142

Tab

le5.

3:C

onst

ant

regr

essi

ontr

ees

exper

imen

talre

sult

s.

×G

UID

ESE

CR

ET

SE

CR

ET

(O)

SE

CR

ET

(P)

SE

CR

ET

(OP

)S/S

P(%

)SO

/SO

P(%

)ab

alon

e10

05.

31±

0.04

5.35±

0.04

5.26±

0.04

5.23±

0.04

5.16±

0.04

22

ban

k10

−3

2.40±

0.01

2.30±

0.01

6.34±

0.05

2.16±

0.01

5.97±

0.05

66

bas

ebal

l10

−1

2.26±

0.11

2.23±

0.11

2.54±

0.12

1.82±

0.08

2.29±

0.08

1810

frie

d10

07.

30±

0.01

7.69±

0.01

6.37±

0.03

7.57±

0.01

6.37±

0.05

20

hou

se10

12.

26±

0.08

2.74±

0.09

2.93±

0.09

2.30±

0.08

2.45±

0.07

1616

kin

10−

24.

22±

0.02

4.34±

0.02

3.07±

0.03

4.30±

0.02

2.97±

0.03

13

mpg

101

1.44±

0.04

1.33±

0.04

1.32±

0.03

1.26±

0.03

1.23±

0.03

57

mum

ps

100

1.34±

0.02

1.59±

0.02

1.56±

0.02

1.56±

0.02

1.49±

0.02

25

pri

ce10

68.

89±

0.37

8.89±

0.43

9.51±

0.43

8.29±

0.57

8.73±

0.66

78

pum

a10

11.

16±

0.01

1.23±

0.01

1.43±

0.01

1.21±

0.01

1.41±

0.01

12

stock

100

2.19±

0.06

2.62±

0.09

2.11±

0.05

2.14±

0.07

1.81±

0.05

1814

tae

10−

16.

99±

0.12

6.82±

0.10

6.83±

0.10

6.71±

0.09

6.71±

0.09

22

teca

tor

101

5.96±

0.25

5.62±

0.21

3.66±

0.15

3.62±

0.19

2.73±

0.12

3625

143

Tab

le5.

4:Lin

ear

regr

essi

ontr

ees

exper

imen

talre

sult

s.

×G

UID

ESE

CR

ET

SE

CR

ET

(O)

SE

CR

ET

(P)

SE

CR

ET

(OP

)S/S

P(%

)SO

/SO

P(%

)ab

alon

e10

04.

73±

0.04

4.81±

0.04

4.88±

0.05

5.26±

0.34

5.43±

0.35

−9

−11

ban

k10

−4

9.40±

0.04

9.50±

0.04

11.0

4±

0.05

9.41±

0.04

10.9

2±

0.05

11

bas

ebal

l10

−1

1.75±

0.05

2.31±

0.09

2.64±

0.09

2.08±

0.09

2.44±

0.09

108

cart

100

1.62±

0.00

1.12±

0.00

1.12±

0.00

1.15±

0.00

1.16±

0.00

−3

−3

frie

d10

01.

15±

0.00

1.20±

0.01

1.34±

0.01

1.19±

0.01

1.34±

0.01

10

hou

se10

12.

42±

0.13

2.27±

0.07

2.53±

0.07

2.35±

0.07

2.46±

0.07

−4

3kin

10−

22.

42±

0.02

2.27±

0.02

1.69±

0.01

2.23±

0.02

1.62±

0.01

24

mpg

101

1.18±

0.03

1.73±

0.08

1.70±

0.06

1.45±

0.04

1.44±

0.04

1615

mum

ps

100

1.03±

0.02

1.30±

0.02

1.31±

0.02

1.19±

0.02

1.19±

0.02

89

pri

ce10

68.

92±

0.48

8.19±

0.34

8.91±

0.46

8.02±

0.33

8.71±

0.42

22

pum

a10

11.

06±

0.00

1.05±

0.00

1.12±

0.01

1.05±

0.00

1.11±

0.01

11

stock

100

1.52±

0.07

1.40±

0.08

1.12±

0.05

1.12±

0.12

0.85±

0.02

2024

tae

10−

17.

28±

0.15

7.28±

0.17

7.28±

0.17

6.95±

0.11

6.95±

0.11

55

teca

tor

101

1.23±

0.06

1.17±

0.07

0.80±

0.05

1.01±

0.06

0.70±

0.06

1312

144

We make a number of observations on these experimental results:

1. the increase in computational effort is modest – at most 11% but as little as

1%

2. as the number of data-points in the training data-set increases, the difference

between running times gets smaller; it is 11% for datasets of size 5092 but

only 4% when size is 40768 for constant regression trees.

3. the increase in running time is smaller for linear regression trees when com-

pared to the constant regression trees

5.5 Related Work

Modifications of the traditional decision trees in order to allow for imprecise splits

have been previously proposed in the literature. The most notable proposals are:

Bayesian decision trees, (Chipman et al., 1996; Chipman et al., 1998), and fuzzy

decision trees, (Sison & Chong, 1994; Guetova et al., 2002). Structurally, Bayesian

decision trees coincide structurally with the generalized classification trees intro-

duced in Section 5.2. They differ in the way the probabilities in the nodes are

chosen: some prior distribution is assumed for the class labels and the probabili-

ties are the Bayesian posterior given the input. Usually Monte Carlo methods are

necessary to find the posterior, thus the method is quite computationally inten-

sive. By interpreting the membership functions as probabilities, the fuzzy decision

trees also coincide with the generalized classification trees. The fundamental dif-

ference between these classifiers and our proposal is the fact that the fuzzyness of

the membership functions is determined using human expert knowledge about the

145

Size SECRET SECRET(P) Slowdown(%)

5096 1.99± 0.01 2.22± 0.01 1110192 3.83± 0.01 4.14± 0.01 820384 7.62± 0.02 8.01± 0.01 540768 15.38± 0.06 16.02± 0.05 4

0

2

4

6

8

10

12

14

16

18

0 5000 10000 15000 20000 25000 30000 35000 40000 45000

Run

ning

tim

e(s)

Number of data-points

vanillaprobabilistic


vanilla SECRET and probabilistic SECRET(P), both with constant regressors, for

synthetic dataset Fried (11 continuous attributes).

146

Size SECRET SECRET(P) Slowdown(%)

5096 7.92± 0.04 8.66± 0.07 810192 15.85± 0.09 16.52± 0.13 420384 31.93± 0.25 33.12± 0.30 440768 64.18± 0.32 65.13± 0.35 1

0

10

20

30

40

50

60

70

80

0 5000 10000 15000 20000 25000 30000 35000 40000 45000

Run

ning

tim

e(s)

Number of data-points

vanillaprobabilistic


vanilla SECRET and probabilistic SECRET(P), both with linear regressors, for

synthetic dataset Fried (11 continuous attributes).

147

attribute variables, whereas in our method they are determined by estimating the

fluctuations of the split point.

In terms of overall architecture our proposal is closely related to hierarchical

mixture of experts (Jordan & Jacobs, 1993; Waterhouse & Robinson, 1994) where

and Bayesian network classifiers (Friedman & Goldszmidt, 1996). Hierarchical

mixtures of experts consist in a tree structure that is used to determine what should

be the contribution to the final prediction of each of the experts, one for each leaf.

Bayesian network classifiers are restricted Bayesian networks (for an overview of

Bayesian networks see (Heckerman, 1995)) that have tree structure and predict

class labels. Both these models are more general – thus more expressive – but at

the same time they require much greater computational effort for learning.

The problem of discontinuity of the prediction for decision trees has been ad-

dressed by techniques like smoothing by averaging (Chaudhuri et al., 1994; Bun-

tine, 1993), that change the way the prediction is obtained: the predictions of all

the nodes on the path are combined by assigning to each one a weight and tak-

ing the final prediction to be the weighted sum. Thus, only the way the decision

tree is interpreted is changed, not the actual structure of the model. If the right

weighting along the path is chosen, no pruning is necessary, but obtaining a data

independent weighting is challenging.

By fitting a naive Bayesian classifier in the leaves of a decision tree, probabilistic

predictions can be made (Breiman et al., 1984). This types of models are usually

called class probability trees. Buntine combined smoothing by averaging with class

probability trees (Buntine, 1993).

148

5.6 Discussion

In this chapter we presented the probabilistic decision trees, a probabilistic version

of traditional classification and regression trees. By allowing probabilistic splits

that take into consideration the natural fluctuation of the decision boundaries, but

at the same time maintaining most of the structure and properties of decision trees,

the computational effort for building these trees is increased by a modest amount

– at most 11% in our experiments but usually much less – but the accuracy is

significantly increased, with as much as 36% for constant regression trees. Side

benefits of the proposed models are: (1) possibility to predict probability vectors,

not only class labels, (2) continuity of the prediction, and (3) the models are

probabilistic and capture natural fluctuations of the data.

Chapter 6

Conclusions

In this thesis we studied a particular type of supervised learning models: classi-

fication and regression trees. This models are very important since they are one

of the preferred models by the end users due to their simplicity and, at the same

time, they can be constructed efficiently and have good accuracy. Our work re-

lied heavily on Statistics as a method to analyze, capture and leverage statistical

behavior of training examples.

The three distinct contributions we made in this thesis to the classification and

regression tree construction problem are:

Bias in Classification Tree Construction

We analyzed statistical properties of split criteria for split attribute variable selec-

tion, analysis that resulted in a general method to correct the bias toward variables

with larger domains. The corrective method is simply to use the p-value of the

criterion with respect to the Null Hypothesis (i.e. the attribute variables are not

correlated with the class label). We further showed how this correction can be

applied to the Gini gain by approximating the distribution of the Gini gain with

149

150

a Gamma distribution with particular parameters.

We think our work on the bias is split selection not only showed principled

ways to correct the bias but also resulted in interesting statistical characterization

of the classification tree construction algorithms.

Scalable Linear Regression Trees

Our second contribution in this thesis is a construction algorithm for regression

trees with linear models in the leaves that is as accurate as previously proposed

construction algorithms but requires substantially less computational effort, as

small as two orders of magnitude on large datasets. The main idea is to use the

EM Algorithm for Gaussian mixtures and to classify the data-points based on

closeness to each of the resulting clusters to locally reduce, for each node being

constructed, the regression problem to a much simpler classification problem. As

a side benefit, the proposed construction method allows, with little extra compu-

tational effort, the construction of regression trees with oblique splits, a problem

previously considered to be very hard.

Probabilistic Decision Trees

Our last contribution is a statistically sound and elegant method to incorporate

natural fluctuations of the data into classification and regression trees. The main

ideas is to allow the split points to be probabilistic instead of fixed and to deter-

mine these probabilities by analyzing the behavior of the splits under noise. Our

proposal, probabilistic decision trees, maintains the good properties of the decision

trees and avoids some of their shortcomings: the prediction is continuous, proba-

bilities can be predicted instead of class labels, the model gives extra information

151

about the statistical fluctuations in the data. Moreover, probabilistic decision trees

are generally more accurate than decision trees by as much as 36%, even though

they incur only small increases in the computational effort.

Appendix A

Probability and Statistics Notions

In this chapter we review some useful notions from Probability and Statistics lit-

erature to help the reader not familiar with these mathematical tools required to

understand the developments in our thesis. The intent is to focus on intuition and

usefulness rather than strict rigor in order to keep notation and explanations sim-

ple. For the interested reader, we provide references that contain a more rigorous

treatment. Throughout this introduction we assume that the reader is familiar

with elementary notions of set theory and elementary calculus.

A.1 Basic Probability Notions

In this section we give a brief overview of useful notions from Probability Theory.

A rigorous introduction can be found, for example, in (Resnick, 1999), but this

overview should suffice for the purpose of this thesis.

152

153

A.1.1 Probabilities and Random Variables

We first introduce the notion of probability and random variable and their condi-

tional counterparts, then we introduce variance and covariance and give some of

their useful properties.

Probability

Let Ω be some set and F be a set of subsets of Ω that contains ∅ and Ω and it

is closed under union, intersection and complementation with respect to Ω (i.e.

the intersection, union and complement of sets in F gives sets in F). The pair

(Ω,F) is called a probability space, and any element A ∈ F is called an event. If an

event does not contain any other event, it is called an elementary event. We call

Ω the probability space and F the set of measurable sets. With this, a mapping

P : F → [0, 1] from the set of measurable sets to the real numbers between 0 and

1 is called a probability function, in short probability, if the following properties

hold:

P [Ω] = 1 (A.1)

∀A, B, A ∩B = ∅, P [A ∪B] = P [A] + P [B] (A.2)

where A, B ∈ F are two measurable sets.

154

These properties are enough to show that the following properties also hold:

P [∅] = 0 (A.3)

P [A] ≤ P [B], if A ⊂ B (A.4)

P [A] = 1− P [A] (A.5)

P [A−B] = P [A]− P [A ∩B] (A.6)

P [A ∪B] = P [A] + P [B]− P [A ∩B] (A.7)

where we denoted by A the complement of event A. P [A ∩ B] is usually replaced

by the simpler notation P [A, B], the probability that events A and B happen

together.

Two types of probabilities are interesting for the purpose of understanding this

thesis: discrete probabilities and continuous probabilities. We briefly take a look

at each, deferring further discussion until random variables are introduced.

If set Ω is a finite set and we take F = 2Ω – the powerset of Ω, i.e. the set of all

the possible subsets – any probability over Ω is fully specified by the probabilities

of the elementary events, which are nothing else than the elements of Ω. We call

such a probability a discrete probability.

If we take Ω = R, with R the set of all real numbers, and F to be the transi-

tive closure under intersection and complement of the compact intervals over the

real numbers (the so called Borel set), any probability defined over Ω is called a

continuous probability. The notion of continuous probability is also extended to

vector spaces over the real numbers in the natural manner. We will see examples

of continuous probabilities in the next section. A continuous probability function

P can be specified by its density function p(x). Intuitively, p(x)dx is the probabil-

ity to see value x. Obviously for any x ∈ R this probability is 0, but this allows

155

the specification of the probabilities of intervals, that are the elementary events of

continuous probabilities:

P [[a, b]] =

∫ b

a

p(x)dx (A.8)

where [a, b] is a compact interval of R.

Independent Events

Events A and B are called independent if:

P [A, B] = P [A] · P [B] (A.9)

The notion of independent events is very important because of this factorization

property of the probability, factorization that greatly simplifies the analysis.

Conditional Probability and

The conditional probability that event B happens given that event A happened,

denoted by P [B|A], is defined as:

P [B|A] =P [A, B]

P [A](A.10)

The conditional probability has the following useful properties:

P [A|Ω] = P [A] (A.11)

P [B|A] = 1, if A ⊂ B (A.12)

P [B|A] =P [B] · P [A|B]

P [A](A.13)

The last formula is called Bayes rule.

Also, conditional probabilities have all the properties normal probabilities have.

156

Random Variables

A mapping X : Ω → R is called a random variable with respect to the probability

space (Ω,F) if it has the property that:

∀a ∈ R, ω ∈ Ω|X(ω) < a ∈ F (A.14)

For discrete probability spaces, any mapping is a random variable. For continuous

spaces, it is enough to require the mapping to be continuous everywhere except a

finite number of points. Moreover, by combining random variables using continuous

functions, random variables are also obtained. What this amounts to is the fact

that all mapping we have to deal with in our thesis are random variables.

A random variable defined over a discrete or continuous probability space is

called discrete random variable or continuous random variable, respectively. To

specify a discrete random variable, it is enough to specify the value of the random

variable for each elementary event. For continuous random variables, we have

to specify the values of the random variable for each real number. We will see

examples of random variables in the next section.

A very important notion with respect to random variables is the notion of

expectation. Intuitively, the expectation of a random variable is its average value

with respect to a probability function. We denote the expectation of a random

variable X by EP [X]. If the probability function is understood from the context,

we use the simpler notation E[X].

For discrete random variables, the expectation is defined as:

E[X] =∑ω∈Ω

X(ω)P [ω] (A.15)

For convenience, we also use the alternative notation Xω instead of X(ω).

157

For continuous variables, the expectation of random variable X with respect

to the probability function P with density p(x) is defined as:

E[X] =

∫ ∞

−∞X(x)p(x)dx (A.16)

A probability space together with a probability function are usually called

a distribution. Discrete distributions are usually specified by the probability of

the elementary events and continuous distributions by the density function p(x).

We say that a random variable X is distributed according to the distribution D,

denoted by X ∼ D, if the probability is specified by the distribution and X is the

identity function. This means that for discrete distributions, Ω is a subset of R in

general but a subset of N or Z most often.

Important properties of expectation are:

1. expectation of a constant:

E(a) = a

2. linearity of expectation:

E [aX] = aE [X]

E [X + Y ] = E [X] + E [Y ]

3. expected value of sum (no independence required):

E

[∑i

Xi

]=∑

i

E [Xi] (A.17)

Independent Random Variables

Two random variables X and Y , defined over the same probability space Ω,F are

independent if and only if for all x, y ∈ R, the events ω ∈ Ω|X(w) < x and

158

ω ∈ Ω|Y (w) < y are independent. In this case it can be shown that:

E [XY ] = E [X] E [Y ] (A.18)

which is one the most useful properties of expectation.

Variance and Covariance

Variance is an important property of distributions since it indicates how spread

(or localized) the distribution is. It is defined as:

Var (X) = E [(X − E [X])]

= E[X2]− (E [X])2

(A.19)

The covariance of two random variables X and Y is defined as:

Cov (X, Y ) = E [XY ]− E [X] E [Y ]

and gives an idea of how much random variables X and Y influence each-other.

Notice that if X and Y are independent, Cov (X, Y ) = 0.

Some of the useful properties of variance are:

1. variance of a constant:

Var (a) = 0

2. scalar multiplication:

Var (aX) = a2Var (X)

3. variance of sum of random variables:

Var (X + Y ) = Var (X) + Var (Y ) + 2Cov (X, Y )

or in general

Var

(∑i

Xi

)=∑

i

Var (Xi) +∑

i

∑i′ 6=i

Cov (Xi, Xi′)

159

4. variance of sum of independent random variables:

Var

(∑i

Xi

)=∑

i

Var (Xi)

A very useful property of covariance is the fact that it is bilinear:

Cov (aX, Y ) = aCov (X, Y )

Cov (X, aY ) = aCov (X, Y )

Cov (X1 + X2, Y ) = Cov (X1, Y ) + Cov (X2, Y )

Cov (X, Y1 + Y2) = Cov (X, Y1) + Cov (X, Y2)

Cov

(∑i

Xi,∑

j

Yj

)=∑

i

∑j

Cov (Xi, Yj)

Also, the covariance is commutative:

Cov (X, Y ) = Cov (Y,X)

Conditional Expectation

Conditional expectation generalizes the notion of conditional expectation. For

random variable X defined over the discrete probability (Ω, 2Ω, P ), and an event

A ∈ 2Ω, the conditional expectation is defined as:

E [X|A] =

∑ω∈A X(ω)P [ω]

P [A](A.20)

For a continuous probability with density p(x), the conditional expectation is

defined as:

E[X|A] =

∫A

X(x)p(x)dx

P [A](A.21)

Conditional expectation has all the properties normal expectation has. More-

over, since the notion of variance is entirely based on the notion of expectation,

160

we can define conditional variance in terms of the conditional expectation as:

Var (X|A) = E[X2|A

]− (E [X|A])2

Random Vectors

The notion of random variable can be extended to vectors and, more generally, to

matrices. If X = [X1, . . . , Xn] is a random vector – a vector of random variables –

its expectation is the vector of expectations of components:

E [X] = [E [X1] , . . . , E [Xn]]

With this, the variance of random vector X is a matrix, called the covariance

matrix:

Var (X) = E[XTX

]− E [X]T E [X]

=

Var (X1) Cov (X1, X2) . . . Cov (X1, Xn)

Cov (X2, X1) Var (X2) . . . Cov (X2, Xn)

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Cov (Xn, X1) Cov (Xn, X2) . . . Var (Xn)

(A.22)

A.2 Basic Statistical Notions

In this section we introduce some useful statistical notions. More information can

be found, for example, in (Wilks, 1962; Pratt et al., 1995; Shao, 1999).

P-Value

The p-value of an observed value x of a random variable X is the probability that

the random value X would take a value as high or higher than x. Mathematically,

the p-value is P [X > x]. Intuitively, a very small p-value is statistical proof that

x is not a sample of the random variable X.

161

A.2.1 Discrete Distributions

Binomial Distribution

Binomial distribution is the distribution of the number of times one sees the head

in N flips of an asymmetric coin that has probability of tossing head p. If X is a

random variable binomially distributed with parameters N and p it can be shown

that:

E(X) = Np

Var (X) = Np(1− p)

The p-value of the binomial distribution is:

P [X > x] = 1− I(p; x + 1, N − x)

where

I(x; a, b) =

∫ x

−∞

Γ(a + b)

Γ(a)Γ(b)ta−1(1− t)b−1dt

is the incomplete regularized beta function.

Multinomial Distribution

The multinomial distribution generalizes the binomial distribution to multiple di-

mensions. It has as parameters N , the number of trials, and (p1, . . . , pn), the

probabilities of an n face coin. The multinomial distribution is the distribu-

tion of number of times each of the faces is observed out of N trials. If we let

X ∼ Multinomial(N, p1, . . . , pn), and denote by Xi the i-th component of X we

162

have:

E [Xi] = Npi

Var (Xi) = Npi(1− pi)

Cov (Xi, Xj) = −Npipj

A.2.2 Continuous Distributions

Normal (unidimensional Gaussian) Distribution

Normal distribution, denoted by N(µ, σ2), has two parameters: the mean µ and

variance σ2. σ must always be a positive quantity. Given X ∼ N(µ, σ2),

E [X] = µ (A.23)

Var (X) = σ2 (A.24)

p(x) =1

σ√

2πe−

(x−µ)2

2σ2 (A.25)

P [X > x] =1

2

(1− Erf

(x− µ√

2σ

))(A.26)

where Erf(x) =∫ x

−∞ et2/2dt is the standard error function.

Gaussian Distribution

Gaussian distribution, denoted by N(µ, Σ), has two parameters: the mean vector

µ and the covariance matrix Σ. Σ has to be positive definite which means that

it always has a Choleski decomposition Σ = GGT (Golub & Loan, 1996). For

163

X ∼ N(µ, Σ),

E [X] = µ (A.27)

Var (X) = Σ (A.28)

p(x) =1

(2π)d/2|Σi|1/2e−

12(x−µ)T Σ−1(x−µ) (A.29)

Gamma Distribution

The gamma distribution (with parameters α and θ) is the distribution with density

p(x) =xαe−x/θ

Γ(α)θα

and p-value

P [X > x] = 1− Γ(α, x/θ)

Γ(α)= 1−Q(α, x/θ) (A.30)

where Γ(x) is the gamma function and Γ(x, y) is the incomplete gamma function.

Q(x, y) is called the incomplete regularized gamma function.

Mean and variance or a random variable X with gamma distribution are:

E(X) = αθ (A.31)

Var(X) = αθ2 (A.32)

Normal distribution is a particular case of the gamma distribution.

Beta Distribution

The beta distribution has parameters α and β and density:

p(x) =Γ(α + β)

Γ(α)Γ(β)xα−1(1− x)β−1

The p-value is 1− I(x; α, β) with I the incomplete regularized beta function.

164

χ2-test and χ2-distribution

Having a set of random variables Xi, the χ2-test is defined as:

χ2 =n∑

i=1

(Xi − Ei)2

Ei

(A.33)

where Ei is the expected value of Xi under some hypothesis (that is tested using

the χ2-test).

It can be shown that asymptoticly χ2 has a χ2 distribution, that coincides with

a gamma distribution with parameters α = 1/2r and θ = 2, where r is the degrees

of freedom (number of variables n minus number of constrains between variables).

The mean and the variance for the χ2 distribution are:

E(χ2) = r (A.34)

Var(χ2) = 2r (A.35)

Appendix B

Proofs for Chapter 3

B.0.3 Variance of the Gini gain random variable

In this section we show the derivation of the formula in Equation 3.20, the variance

of the Gini gain.

Using the notation in Chapter 3, the formula for the Gini gain as a function

of sufficient statistics is (Equation 3.2)

∆g(T ) =1

N

k∑j=1

(n∑

i=1

A2ij

Ni

−S2

j

N

)(B.1)

Using extensively properties of variance, covariance and connection between

them (see Chapter A) we have:

Var (∆g(T )) =1

N2

[Var

(k∑

j=1

n∑i=1

A2ij

Ni

)+ Var

(k∑

j=1

S2j

N

)

−2Cov

(k∑

j=1

n∑i=1

A2ij

Ni

,k∑

j=1

S2j

N

)] (B.2)

We proceed by analyzing each term inside the square brackets individually. To

simplify the formulae we use∑

i,∑

j,∑

j′ ,∑

j′ 6=j instead of∑n

i=1,∑k

j=1,∑k

j′=1 and

165

166

∑kj′=1,j′ 6=j, respectively. Also, remember from Chapter 3 that for every i∈1, .., n,

the random vector (Ai1, . . . , Aik) has the distribution Multinomial(Ni, p1, . . . , pk),

and (S1, . . . , Sk) has the distribution Multinomial(N, p1, . . . , pk); for i 6= i′, Aij is

independent of Ai′j.

First term:

Var

(∑j

∑i

A2ij

Ni

)=∑

i

1

N2i

Var

(∑j

A2ij

)

=∑

i

1

N2i

(∑j

Var(A2

ij

)+∑

j

∑j′ 6=j

Cov(A2

ij, A2ij′

))

=∑

j

(∑i

Var(A2

ij

)N2

i

)+∑

j

∑j′ 6=j

∑i

(1

N2i

Cov(A2

ij, A2ij′

))(B.3)

Second term:

Var

(∑j

S2j

N

)=∑

j

Var(S2

j

)N2

+∑

j

∑j′ 6=j

1

N2Cov

(S2

j , S2j′

)(B.4)

Third term:

Cov

(∑j

∑i

A2ij

Ni

,∑

j

S2j

N

)=∑

i

∑j

∑j′

1

NiNCov

(A2

ij, S2j′

)(B.5)

167

Observing that:

Cov(A2

ij, S2j′

)= Cov

A2ij,

(∑i′

Ai′j′

)2

=∑

i′

∑i′′

Cov(A2

ij, Ai′j′Ai′′j′)

= Cov(A2

ij, A2ij′

)+ 2

∑i′ 6=i

Cov(A2

ij, Aij′Ai′j′)

= Cov(A2

ij, A2ij′

)+ 2

∑i′ 6=i

(E[A2

ijAij′Ai′j′]− E

[A2

ij

]E [Aij′Ai′j′ ]

)= Cov

(A2

ij, A2ij′

)+ 2 (E [Sj′ ]− E [Aij′ ])

(E[A2

ijAij′]− E

[A2

ij

]E [Aij′ ]

)(B.6)

we get:

Cov

(∑j

∑i

A2ij

Ni

,∑

j

S2j

N

)

=∑

j

∑i

Var(A2

ij

)NiN

+ 2∑

j

∑i

1

NiN(E [Sj]− E [Aij])

(E[A3

ij

]− E

[A2

ij

]E [Aij]

)+∑

j

∑j′ 6=j

∑i

1

NiNCov

(A2

ij, A2ij′

)+ 2

∑j

∑j′ 6=j

∑i

1

NiN(E [Sj′ ]− E [Aij])

(E[A2

ijAij′]− E

[A2

ij

]E [Aij′ ]

)(B.7)

We now focus on the sub-terms inside summations. The formulae for the sub-

terms that appear in the above equations are depicted in Table B.1.

We compute the sub-terms of each of the three main terms of the variance. We

leave the computation of the last sum over j and j′ until the end.

168

Table B.1: Formulae for expressions over random vector [X1 . . . Xk] distributed

Multinomial(N, p1, . . . , pk)

Expression Formula

E [Xj] Npj

Var(X2

j

)N(1−pj)pj(1+6(N−1)pj+2(N−1)(2N−3)p2

j)

E[X3

j

]− E

[X2

j

]E [Xj] Npj(1− pj)(2Npj − 2pj + 1)

Cov(X2

j , X2j′

)−Npjpj′(1+2(pj+pj′)(N−1)+2(N−1)(2N−3)pjpj′)

E[X2j Xj′ ]−E[X2

j ]E[Xj′ ] −Npjpj′(2Npj − 2pj + 1)

First term sub-terms:∑i

Var(A2

ij

)N2

i

=∑

i

1

N2i

Nipj(1− pj)((1− 6pj + 6p2j) + Ni(6pj − 10p2

j) + 4N2i p2

j)

= pj(1− pj)

((1− 6pj + 6p2

j)∑

i

1

Ni

+ n(6pj − 10p2j) + 4Np2

j

)(B.8)

∑i

1

N2i

Cov(A2

ij, A2ij′

)= −

∑i

1

N2i

Nipjpj′ (1− 2(pj + pj′) + 6pjpj′

+ Ni(2pj + 2pj′ − 10pjpj′) + 4N2i pjpj′

)= −pjpj′

((1− 2(pj + pj′) + 6pjpj′)

∑i

1

Ni

+ n(2pj + 2pj′ − 10pjpj′) + 4Npjpj′)

(B.9)

Second term sub-terms:

Var(S2

j

)N2

=1

N2Npj(1− pj)((1− 6pj + 6p2

j) + N(6pj − 10p2j) + 4N2p2

j)

=pj(1− pj)

N

((1− 6pj + 6p2

j) + N(6pj − 10p2j) + 4N2p2

j

) (B.10)

169

1

N2Cov

(S2

j , S2j′

)= − 1

N2Npjpj′ (1− 2(pj + pj′) + 6pjpj′

+ N(2pj + 2pj′ − 10pjpj′) + 4N2pjpj′)

= −pjpj′

N(1− 2(pj + pj′) + 6pjpj′

+ N(2pj + 2pj′ − 10pjpj′) + 4N2pjpj′)

(B.11)

Third term sub-terms:

∑i

Var(A2

ij

)NiN

=∑

i

1

NiNNipj(1− pj)((1− 6pj + 6p2

j) + Ni(6pj − 10p2j) + 4N2

i p2j)

=pj(1− pj)

N

(n(1− 6pj + 6p2

j) + N(6pj − 10p2j) + 4p2

j

∑i

N2i

)(B.12)

∑i

1

NiN(E [Sj]− E [Aij])

(E[A3

ij

]− E

[A2

ij

]E [Aij]

)=∑

i

1

NiN(Npj −Nipj)Nipj(1− pj)(2Nipj − 2pj + 1)

=pj(1− pj

N

[(n− 1)(1− 2pj)pjN + 2N2p2

j − 2p2j

∑i

N2i

] (B.13)

∑i

1

NiNCov

(A2

ij, A2ij′

)= −

∑i

1

NiNNipjpj′ (1− 2(pj + pj′) + 6pjpj′

+ Ni(2pj + 2pj′ − 10pjpj′) + 4N2i pjpj′

)= −pjpj′

N[n(1− 2(pj + pj′) + 6pjpj′)

+ N(2pj + 2pj′ − 10pjpj′) + 4pjpj′

∑i

N2i

](B.14)

170

∑i

1

NiN(E [Sj′ ]− E [Aij])

(E[A2

ijAij′]− E

[A2

ij

]E [Aij′ ]

)= −

∑i

1

NiN(Npj′ −Nipj)Nipjpj′(2Npj − 2pj + 1)

= −pjpj′

N

[n(1− 2pj)pj′N + 2N2pjpj′ −Npj(1− 2pj)− 2pjpj′

∑i

N2i

](B.15)

In the final formula, all the four sub-terms of this third term have to be mul-

tiplied by −2 and the second and fourth sub-terms by an extra 2.

We put everything together now by grouping all the terms with the same N ,∑i

1Ni

and∑

i N2i factors together. We ignore the 1

N2 term in front of the equation

for Var (∆g(T )).

We use extensively the following identities:

∑j

∑j′ 6=j

pjpj′ =∑

j

pj

(∑j′

pj′ − pj

)

=∑

j

pj(1− pj)

= 1−∑

j

p2j

(B.16)

∑j

∑j′ 6=j

p2jpj′ =

∑j

p2j

(∑j′

pj′ − pj

)

=∑

j

p2j(1− pj)

=∑

j

p2j −

∑j

p3j

(B.17)

171

∑j

∑j′ 6=j

pjp2j′ =

∑j

pj

(∑j′

p2j′ − p2

j

)

=∑

j

pj

∑j′

p2j′ −

∑j

p3j

=∑

j

p2j −

∑j

p3j

(B.18)

∑j

∑j′ 6=j

p2jp

2j′ =

∑j

p2j

(∑j′

p2j′ − p2

j

)

=

(∑j

p2j

)2

−∑

j

p4j

(B.19)

∑i

1Ni

factor:∑j

pj(1− pj)(1− 6pj + 6p2j)−

∑j

∑j′ 6=j

pjpj′(1− 2(pj + pj′) + 6pjpj′)

=∑

j

(pj − 7p2

j + 12p3j − 6p4

j

)−∑

j

∑j′ 6=j

(pjpj′ − 2p2

jpj′ − 2pjp2j′ + 6p2

jp2j′

)= 1− 7

∑j

p2j + 12

∑j

p3j − 6

∑j

p4j − 1 +

∑j

p2j + 2

∑j

p2j − 2

∑j

p3j

+ 2∑

j

p2j − 2

∑j

p3j − 6

(∑j

p2j

)2

6∑

j

p4j

= −2∑

j

p2j + 8

∑j

p3j − 6

(∑j

p2j

)2

(B.20)∑i N

2i factor. We ignore the expression − 2

Nin front of all terms (only the third

term contains these factors):∑j

(4p3

j(1− pj)− 4p3j(1− pj)

)+∑

j

∑j′ 6=j

(−4p2

jp2j′ + 4p2

jp2j′

)= 0 (B.21)

N factor: ∑j

(4p3

j(1− pj) + 4p3j(1− pj)− 8p3

j(1− pj))

+∑

j

∑j′ 6=j

(−4p2

jp2j′ − 4p2

jp2j′ + 8p2

jp2j′

)= 0

(B.22)

172

1N

factor:

∑j

(pj(1− pj)(1− 6pj + 6p2

j)− 2npj(1− pj)(1− 6pj + 6p2j))

+∑

j

∑j′ 6=j

(−pjpj′(1− 2(pj + pj′) + 6pjpj′) + 2npjpj′(1− 2(pj + pj′) + 6pjpj′))

= (2n− 1)

[−∑

j

(pj − 7p2

j + 12p3j − 6p4

j

)+∑

j

∑j′ 6=j

(pjpj′ − 2p2

jpj − 2pjp2j′ + 6p2

jpj′)]

= −(2n− 1)

−2∑

j

p2j + 8

∑j

p3j − 6

(∑j

p2j

)2

(B.23)

since inside the brackets we have the negation of the formula for the∑

i1

Nifactor.

173

Free factors:∑j

(npj(1− pj)(6pj − 10p2

j) + npj(1− pj)(6pj − 10p2j)

−2pj(1− pj)(6pj − 10p2j)− 4(n− 1)pj(1− pj)(1− 2pj)pj

)∑

j

∑j′ 6=j

(−npjpj′(2pj + 2pj′ − 10pjpj′)− pjpj′(2pj + 2pj′ − 10pjpj′)

+2pjpj′(2pj + 2pj′ − 10pjpj′) + 4npjp2j′(1− 2pj)− 4pjp

2j′(1− 2pj)

)= (n− 1)

∑j

(6p2

j − 16p3j + 10p4

j − 4p2j + 12p3

j − 8p4j

)− (n− 1)

∑j

∑j′ 6=j

(2p2

jpj′ + 2pjp2j′ − 10p2

jp2j′

)+ 4(n− 1)

∑j

∑j′ 6=j

(pjp

2j′ − 2p2

jp2j′

)= (n− 1)

(2∑

j

p2j − 4

∑j

p3j + 2

∑j

p4j

)

− (n− 1)

2∑

j

p2j − 2

∑j

p3j + 2

∑j

p2j − 2

∑j

p3j − 10

(∑j

p2j

)2

+ 10∑

j

p4j

+ 4(n− 1)

∑j

p2j −

∑j

p3j − 2

(∑j

p2j

)2

+ 2∑

j

p4j

= (n− 1)

(2∑

j

pj + 2

(∑j

p2j

)− 4

∑j

p3j

)(B.24)

Now, putting everything together we get:

Var (∆g) =1

N2

[(n−1)

2k∑

j=1

p2j + 2

(k∑

j=1

p2j

)2

− 4k∑

j=1

p3j

+

(n∑

i=1

1

Ni

− 2n

N+

1

N

)×

−2k∑

j=1

p2j − 6

(k∑

j=1

p2j

)2

+ 8k∑

j=1

p3j

]

(B.25)

174

B.0.4 Mean and Variance of χ2-test for two class case

Let p1 be the probability for the first class label. Using the notation in Chapter 3

and starting with the definition of χ2 (equation A.33) we have:

χ2 =n∑

i=1

((Ai1 − p1Ni)

2

p1Ni

+(Ai2 − p2Ni)

2

p1Ni

)=

n∑i=1

((Ai1 − p1Ni)

2

p1Ni

+(Ni − Ai1 − (1− p1)Ni)

2

(1− p1)Ni

)=

n∑i=1

(Ai1 − p1Ni)2

(1

p1Ni

+1

(1− p1)Ni

)=

n∑i=1

(Ai1 − p1Ni)2

p1(1− p1)Ni

=1

p1(1− p1)

n∑i=1

(A2

i1

Ni

− 2p1Ai1 + p21Ni

)

=1

p1(1− p1)

(n∑

i=1

A2i1

Ni

− 2p1N1 + p21N

)

(B.26)

Using the linearity of expectation we have:

E(χ2) =1

p1(1− p1)

(n∑

i=1

E(A2i1)

Ni

− 2p1E(N1) + p21N

)

=1

p1(1− p1)

(n∑

i=1

p1(1− p1 + Nip1)− 2p1Np1 + p21N

)

=1

p1(1− p1)(np1(1− p1) + Np2

1 −Np21)

= n

(B.27)

Now we look at the variance:

Var(χ2) =1

p21(1− p1)2

Var

(n∑

i=1

A2i1

Ni

− 2p1N1 + p21N

)

=1

p21(1− p1)2

(Var

(n∑

i=1

A2i1

Ni

)+ 4p2

1Var(N1)− 4p1Cov

(n∑

i=1

A2i1

Ni

, N1

))(B.28)

175

The third term is the only one that needs to be analyzed separately:

Cov

(n∑

i=1

A2i1

Ni

, N1

)= E

(n∑

i=1

A2i1

Ni

N1

)− E

(n∑

i=1

A2i1

Ni

)E(N1)

=n∑

i=1

E(A2i1N1)− E(A2

i1)E(N1)

Ni

(B.29)

E(A2i1N1) = E

(A2

i1

n∑j=1

Aj1

)= E

A2i1

Ai1 +n∑

j=1j 6=i

Aj1

= E(A3i1) + E(A2

i1)n∑

j=1j 6=i

E(Aj1)

= E(A3i1)− E(A2

i1)E(Ai1) + E(A2i1)

n∑j=1

E(Aj1)

= E(A3i1)− E(A2

i1)E(Ai1) + E(A2i1)E(N1)

(B.30)

Substituting equation B.30 into equation B.29 we get:

Cov

(n∑

i=1

A2i1

Ni

, N1

)=

n∑i=1

E(A3i1)− E(A2

i1)E(Ai1)

Ni

,n∑

i=1

p1(1− p1)(1− 2p1 + 2Nip1)

= p1(1− p1)(n(1− 2p1) + 2Np1)

(B.31)

Substituting equation B.31 into B.28 we have:

176

Var(χ2) =1

p21(1− p1)2

p1(1− p1)

[(1− 6pY + 6pY )

n∑i=1

1

Ni

+ n(6p1 − 10p21) + 4Np2

1] + 4p21p1(1− p1)N

− 4p1p1(1− p1)[n(1− 2p1) + 2Np1]

=1

p1(1− p1)

[(1− 6p1 + 6p2

1)n∑

i=1

1

Ni

+ n(6p1 − 10p21) + 4Np2

1

+ 4Np21 − n(4p1 − 8p2

1)− 8Np21]

=1

p1(1− p1)

[(1− 6p1 + 6p2

1)n∑

i=1

1

Ni

+ 2np1(1− p1)

]

= 2n +1− 6p1 + 6p2

1

p1(1− p1)

n∑i=1

1

Ni

(B.32)

In this last formula, if we assume that all the values of the split variable are

equi-probable, we get:

Var(χ2) = 2n +1− 6p1 + 6p2

1

p1(1− p1)

n2

N(B.33)

177

-10 0 10

20

30

40

50

60

70

80

90

100

0 0

.2 0

.4 0

.6 0

.8 1

(1-6

*x+6

*x*x

)/(x

*(1-

x))

Fig

ure

B.1

:D

epen

den

cyof

the

funct

ion

1−

6p1+

6p2 1

p1(1−

p1)

onp 1

.

178

Since we have 2n variables (Ai1, Ai2, i∈1, .., n) and n constraints (Ai1+Ai2 =

Ni) the number of degrees of freedom is n, but as we have shown the variance is not

exactly 2n as predicted by the χ2 distribution but the expression in equation B.33.

Using the graphic representation of the part of the second term in this equation

that depends on p1 depicted in Figure B.1, we notice that when p1 is close to 0 or

1, the second term can become significant, especially if N is not much larger than

n. In these situations the approximation of the distribution can be quite poor.

Bibliography

Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules.

Proc. 20th Int. Conf. Very Large Data Bases, VLDB (pp. 487–499). Morgan

Kaufmann.

Agresti, A. (1990). Categorical data analysis. John Wiley and Sons.

Alexander, W. P., & Grimshaw, S. D. (1996). Treed regression. Journal of Com-

putational and Graphical Statistics, 156–175.

Berkhin, P. (2002). Survey of clustering data mining techniques (Technical Report).

Accrue Software, San Jose, CA.

Bilmes, J. (1997). A gentle tutorial of the EM algorithm and its application to

parameter estimation for gaussian mixture and hidden markov models (Technical

Report). University of California at Berkeley.

Bishop, C. (1995). Neural networks for pattern recognition. Oxford University

Press, New York, NY.

Brachman, R. J., Khabaza, T., Kloesgen, W., Piatetsky-Shapiro, G., & Simoudis,

E. (1996). Mining business databases. Communications of the ACM, 39.

179

180

Bradley, P. S., Fayyad, U. M., & Reina, C. (1998). Scaling clustering algorithms

to large databases. Knowledge Discovery and Data Mining (pp. 9–15).

Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification

and regression trees. Belmont: Wadsworth.

Buntine, W. (1993). Learning classification trees. Artificial Intelligence frontiers

in statistics (pp. 182–201). Chapman & Hall,London.

Chaudhuri, P., Huang, M.-C., Loh, W.-Y., & Yao, R. (1994). Piecewise-polynomial

regression trees. Statistica Sinica, 4, 143–167.

Cheeseman, P., Kelly, J., Self, M., Stutz, J., Taylor, W., & Freeman, D. (1988).

Autoclass: A bayesian classification system. Proceedings of the Fifth Interna-

tional Conference on Machine Learning. Morgan Kaufmann.

Cheeseman, P., & Stutz, J. (1996). Bayesian classification (autoclass): Theory and

results. In U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth and R. Uthurusamy

(Eds.), Advances in knowledge discovery and data mining, chapter 6, 153–180.

AAAI/MIT Press.

Chipman, H., George, E., & McCulloch, R. (1996). Bayesian cart.

Chipman, H. A., George, E. I., & McCulloch, R. E. (1998). Bayesian CART model

search. Journal of the American Statistical Association, 93, 935–947.

Chirstensen, R. (1997). Log-linear models and logistic regression. Springer. 2

edition.

Cox, L. A., Qiu, Y., & Kuehner, W. (1989). Heuristic least-cost computation

181

of discrete classification functions with uncertain argument values. Annals of

Operations Research, 21, 1–30.

Davison, A. C., & Hinkley, D. V. (1997). Bootstrap methods and their applica-

tion. No. 1 in Cambridge Series in Statistical and Probabilistic Mathematics.

Cambridge Univ Press.

Dempster,A.P. Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from

incomplete data via the EM algorithm. J. R. Statist. Soc. B, 39, 185–197.

Fayyad, U., Haussler, D., & Stolorz, P. (1996). Mining scientific data. Communi-

cations of the ACM, 39.

Frank, E. (2000). Pruning decision trees and lists. Doctoral dissertation, Depart-

ment of Computer Science, University of Waikato, Hamilton, New Zealand.

Frank, E., & Witten, I. H. (1998). Using a permutation test for attribute selection

in decision trees. International Conference on Machine Learning.

Friedman, J. H. (1991). Multivariate adaptive regression splines. The Annals of

Statistics, 19, 1–141 (with discussion).

Friedman, N., & Goldszmidt, M. (1996). Building classifiers using bayesian net-

works. AAAI/IAAI, Vol. 2 (pp. 1277–1284).

Fukanaga, K. (1990). Introduction to statistical pattern recognition, second edition.

Academic Press.

Gehrke, J., Ramakrishnan, R., & Ganti, V. (1998). Rainforest – a framework for

fast decision tree construction of large datasets. Proceedings of the 24th Interna-

tional Conference on Very Large Databases (pp. 416–427). Morgan Kaufmann.

182

Gillo, M. (1972). MAID: A honeywell 600 program for an automatised survey

analysis. Behavioral Science, 17, 251–252.

Goldberg, D. E. (1989). Genetic algorithms in search, optimization and machine

learning. Morgan Kaufmann.

Golub, G. H., & Loan, C. F. V. (1996). Matrix computations. Johns Hopkins.

Guetova, M., Holldobler, S., & Storr, H. (2002). Incremental fuzzy decision trees.

25th German Conference on Artificial Intelligence (KI2002).

Hand, D. (1997). Construction and assessment of classification rules. John Wiley

& Sons, Chichester, England.

Heckerman, D. (1995). A tutorial on learning with bayesian networks.

Hefferon, J. (2003). Linear algebra. http://joshua.smcvt.edu/linearalgebra/.

Hipp, J., Guntzer, U., & Nakhaeizadeh, G. (2000). Algorithms for association rule

mining – a general survey and comparison. SIGKDD Explorations, 2, 58–64.

Hyafil, L., & Rivest, R. L. (1976). Constructing optimal binary decision trees is

np-complete. Information Processing Letters, 5, 15–17.

Inman, W. (1996). The data warehouse and data mining. Communications of the

ACM, 39.

Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: a review. ACM

Computing Surveys, 31, 264–323.

James, M. (1985). Classification algorithms. Wiley.

183

Jensen, D. D., & Cohen, P. R. (2000). Multiple comparisons in induction algo-

rithms. Machine Learning, 38, 309–338.

Jordan, M. I., & Jacobs, R. A. (1993). Hierarchical mixtures of experts and the

EM algorithm (Technical Report AIM-1440).

Karalic, A. (1992). Linear regression in regression tree leaves. International School

for Synthesis of Expert Knowledge, Bled,Slovenia.

Kass, G. (1980). An exploratory technique for investigating large quantities of

categorical data. Applied Statistics, 119–127.

Kohavi, R. (1995). The power of decision tables. Proceedings of the 8th European

Conference on Machine Learning (ECML). Heraclion, Crete, Greece: Springer.

Kohonen, T. (1995). Self-organizing maps. Heidelberg: Springer-Verlag.

Kononenko, I. (1995). On biases in estimating multivalued attributes.

Li, K.-C., Lue, H.-H., & Chen, C.-H. (2000). Interactive tree-structured regression

via principal hessian directions. journal of the American Statistical Association,

547–560.

Lim, T.-S., Loh, W.-Y., & Shih, Y.-S. (1997). An empirical comparison of decision

trees and other classification methods (Technical Report 979). Department of

Statistics, University of Wisconsin, Madison.

Loh, W.-Y. (2002). Regression trees with unbiased variable selection and interac-

tion detection. Statistica Sinica. in press.

Loh, W.-Y., & Shih, Y.-S. (1997). Split selection methods for classification trees.

Statistica Sinica, 7.

184

Martin, J. K. (1997). An exact probability metric for decision tree splitting. Ma-

chine Learning, 28, 257–291.

Mehta, M., Agrawal, R., & Rissanen, J. (1996). SLIQ: A fast scalable classifier

for data mining. Proc. of the Fifth Int’l Conference on Extending Database

Technology (EDBT). Avignon, France.

Michie, D., Spiegelhalter, D. J., & Taylor, C. C. (1994). Machine learning, neural

and statistical classification. Ellis Horwood.

Mingers, J. (1987). Expert systems – rule induction with statistical data. J. Opl.

Res. Soc., 38, 39–47.

Morgan, J., & Messenger, R. (1973). Thaid: a sequantial search program for the

analysis of nominal scale dependent variables (Technical Report). Institute for

Social Research, University of Michigan, Ann Arbor, Michigan.

Murphy, O. J., & Mccraw, R. L. (1991). Designing storage efficient decision trees.

IEEE Transactions on Computers, 40, 315–319.

Murthy, S. K. (1995). On growing better decision trees from data. Doctoral disser-

tation, Department of Computer Science, Johns Hopkins University, Baltimore,

Maryland.

Murthy, S. K. (1997). Automatic construction of decision trees from data: A

multi-disciplinary survey. Data Mining and Knowledge Discovery.

Papoulis, A. (1991). Probability, random variables and stochastic processes.

McGraw-Hill Science/Engineering/Math.

185

Pratt, J. W., Raiffa, H., & Schlaifer, R. (1995). Statistical decision theory. The

MIT Press.

Quinlan, J. (1993a). A case study in machine learning. Proceedings ACSC-16

Sixteenth Australian Computer Science Conference (pp. 731–7).

Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1, 81–106.

Quinlan, J. R. (1992). Learning with Continuous Classes. 5th Australian Joint

Conference on Artificial Intelligence (pp. 343–348).

Quinlan, J. R. (1993b). C4.5: Programs for machine learning. Morgan Kaufman.

Resnick, S. I. (1999). A probability path. Birkhauser.

Ripley, B. (1996). Pattern recognition and neural networks. Cambridge University

Press, Cambridge.

Sarle, W. (1994). Neural networks and statistical models. Procedings of the Nine-

teenth Annual SAS Users Groups International Conference (pp. 1538–1550). SAS

Institute, Inc., Cary, NC.

Shafer, J., Agrawal, R., & Mehta, M. (1996). SPRINT: A scalable parallel classifier

for data mining. Proc. of the 22nd Int’l Conference on Very Large Databases.

Bombay, India.

Shao, J. (1999). Mathematical statistics. Springer-Verlag.

Shine, R. A., & Strous, L. (2001). Ana. http://ana.lmsal.com.

Sison, L. G., & Chong, E. K. (1994). Fuzzy modeling by induction and pruning of

decision trees. IEEE International Symposium on Intelligent Control. Columbus,

Ohio.

186

Sonquist, J., Baker, E., & Morgan, J. (1971). Searching for structure (Techni-

cal Report). Institute for Social Research, University of Michigan, Ann Arbor,

Michigan.

Swokowski, E. W. (1991). Calculus. PWS Publishing Co. 5th edition.

Torgo, L. (1997a). Functional models for regression tree leaves. Proc. 14th Inter-

national Conference on Machine Learning (pp. 385–393). Morgan Kaufmann.

Torgo, L. (1997b). Kernel regression trees. European Conference on Machine

Learning. Poster paper.

Torgo, L. (1998). A comparative study of reliable error estimators for pruning

regression trees. Iberoamerican Conference on Artificial Intelligence. Springer-

Verlag.

Vapnik, V. N. (1998). Statistical learning theory. Wiley-Interscience.

Waterhouse, S. R., & Robinson, A. J. (1994). Classification using hierarchical

mixtures of experts. Proceedings of the 1994 IEEE Workshop on Neural Networks

for Signal Processing IV (pp. 177–186). Long Beach, CA: IEEE Press.

Weiss, S. M., & Kulikowski, C. A. (1991). Computer systems that learn: Classi-

fication and prediction methods from statistics, neural nets, machine learning,

and expert systems. Morgan Kaufman.

White, A. P., & Liu, W. Z. (1994). Bias in information-based measures in decision

tree induction. Machine Learning, 15, 321–329.

Wilks, S. S. (1962). Mathematical statistics. John Wiley & Sons, Inc.

Documents

SCALABLE CLASSIFICATION AND REGRESSION TREE CONSTRUCTIONadobra/papers/phd-thesis.pdf · SCALABLE CLASSIFICATION AND REGRESSION TREE CONSTRUCTION Alin Viorel Dobra, Ph.D. Cornell University