Bagging Decision Trees on Data Sets with Classification Noise

Bagging Decision Trees on Data Setswith Classification Noise

Joaquín Abellán and Andrés R. Masegosa

Department of Computer Science and Artificial Intelligence

University of Granada

Sofia, February 2010

6th International Symposium on Foundations of Infomation and Knowledge Systems

FOIKS 2010 Sofia (Bulgaria) 1/25

Introduction

Part I

Introduction


Introduction

Introduction

Ensembles of Decision Trees (DT)

Features

They usually build different DT for different samples of the training dataset.The final prediction is a combination of the individual predictions of each tree.Take advantage of the inherent instability of DT.Bagging, AdaBoost and Randomization the most known approaches.


Introduction

Introduction

Classification Noise (CN) in the class values

Definition

The class values of the samples given to the learning algorithm have someerrors.Random classification noise: the label of each example is flipped randomlyand independently with some fixed probability called the noise rate.

Causes

It is mainly due to errors in the data capture process.Very common in real world applications: surveys, biological or medicalinformation...

Effects on ensembles of decision trees

The presence of classification noise degenerates the performance of anyclassification inducer.AdaBoost is known to be very affected by the presence of classification noise.Bagging is the ensemble approach with the better response to classificationnoise, specially with the C4.5 [21] as decision tree inducer.


Introduction

Introduction

Motivation of this study

Previous Works [7]

Simple decision trees (no continuous attributes, no missing values, nopost-pruning) built with different split criteria were considered in a Baggingscheme.Classic split criteria (InfoGain, InfoGain Ratio and Gini Index) and a new splitcriteria based on imprecise probabilities were analyzed.Imprecise Info-Gain generates the the most robust Bagging ensembles in datasets with classification noise.

Contributions

An extension of Bagging ensembles of credal decision trees to deal with:

Continuous VariablesMissing DataPost-Pruning Process

Evaluate the performance on data sets with different rates of randomclassification noise.Experimental comparison with Bagging ensembles built with C4.5R8 decisiontree inducer.

Outline

Description of the different split criteria.Bagging Decision TreesExperimental Results.Conclusions and Future Works.


Previous Knowledge

Part II

Previous Knowledge


Previous Knowledge

Decision Trees

Decision Trees

Description

Attributes are placed at the nodes.Class values predictions are placed atthe leaves.Each leaf corresponds to aclassification decision rule.

Learning

Split Criteria selects the attribute to place at each branching node.Stop Criteria decides when to fix a leaf and stop the branching.Post-Prunning simplifies the tree pruning those branches with a low support intheir associated decision rules.


Previous Knowledge

Decision Trees

C4.5 Tree Inducer

Description

It is the most famous tree inducer, introduced by Quinlan in 1993 [21].eight different releases have been proposed. In this work, we consider the lastone: C4.5R8.The most influential data mining algorithm (IEEE ICDM06).

Features

Split Criteria: The Info-Gain Ratio as a quotient between the information gain ofan attribute and its entropy.Numeric Attributes: It looks for the optimal binary split point in terms ofinformation gain.Missing Values: Assumes Missing at Random Hypothesis. Marginalize themissing variable when making predictions.Post-Pruning: It employs the pessimistic error pruning: computes an upperbound of the estimated error and when the bound of a leaf is higher than thebound of its ancestor, this leaf is pruned.


Previous Knowledge

Bagging Decision Trees

Bagging Decision Trees

Procedure

Ti samples are generated byrandom sampling withreplacement from the initial trainingdataset.From each Ti sample, a simpledecision tree is built using a givensplit criteria.Final prediction is made by amajority voting criteria.

Motivation

Employing different decision tree inducers (C4.5, CART,...) we get differentBagging ensembles.Bagging ensembles have been reported to be on of the best classificationmodels when there is noisy data samples [10,13,18].


Bagging Credal Decision Trees

Part III




Credal Decision Trees

Credal Decision Trees (CDT) Inducer

New FeaturesHandle with numeric attributes and the presence ofmissing values in the data set.Adaptation of C4.5 methods.Strong simplification of the algorithm and far lessnumber of parameters.Take advantage of the good properties of ImpreciseInformation Gain [2] measure against overfitting.




Imprecise Information Gain (Imprecise-IG) measure

Maximum Entropy Function

Probability intervals for multinomialvariables are computed from thedata set using Walley’s IDM [22].We then obtain a credal set ofprobability distributions for theclass variable, K (C).Maximum Entropy of a credal setS(K ) is efficiently computed usingAbellan and Moral’s method [2].

Imprecise-IG Split criteria

Imprecise Info-Gain for each variable X is defined as:

IIG(X , C) = S(K (C))�X

p(xi )S(K (C|(X = xi )))

It was successful applied to build simple decision trees in [3]




Split Criteria

Credal DTThe attribute with maximum Imprecise-IG score isselected as split attribute at each branching node.

C4.5R8There are multiple conditions and heuristics:

Maximum Info-Gain Ratio score.IG score higher than the average value of the valid splitattributes.Valid split attributes are those whose number of values issmaller than 30% of number of samples in that branch.




Stop Criteria

Credal DT

All split attributes have negativeImprecise-IG.

Minimum number of instances in aleaf higher than 2.

C4.5R8

Info-Gain measure is always positive.

There are multiple conditions and heuristics:

Minimum number of instances in a leaf higher than 2.When there is any valid split attribute.




Numeric Attributes

Credal DT

Each possible split point is evaluated. The one which generates thebi-partition with the highest Imprecise-IG score is selected.

C4.5R8

Same approach: split point with the maximum Info-Gain.There are extra restrictions and heuristics:

Minimum n. of instances: 10% of the ratio between the number ofinstances in this branch and the cardinality of the class variable.Info-Gain is corrected subtracting the logarithm of the number ofevaluated split points divided by the number of instances in this branch.




Post-Pruning

Credal DT

Reduced Error Pruning [20].

Most simple pruning process.

2-folds to build the and 1-fold to estimate the test error.


Experiments

Part IV

Experiments


Experiments

Experiments

Experimental Set-up

Benchmark

25 UCI data sets with very different features.Bagging ensembles of 100 trees.Bagging-CDT versus Bagging-C4.5R8.Different noise rates were applied to training datasets (not to test datasets):0%, 5%, 10%, 20% and 30%.k-10 fold cross validation repeated 10 times were used to estimate theclassification accuracy.

Statistical Tests [12,24]

Corrected Paired T-test.Wilconxon Signed-Ranks Test.Sign Test.Friedman Test.Nemenyi Test.


Experiments

Experiments

Performance Evaluation without tree post-pruning

Analysis

There is no statistical differences with low noise level.B-CDT outperforms B-C4.5 ensembles with high noise levels.B-CDT always induces more simple decision trees (lower n. of nodes).


Experiments

Experiments

Performance Evaluation with tree post-pruning

Analysis

Post-pruning methods help to improve the performance when there is noise inthe data.The performance of B-C4.5 does not degenerate so quickly.B-CDT also has better performance with high noise levels.B-CDT also induces more simple trees.


Experiments

Experiments

Bias-Variance (BV) Error Analysis

Bias-Variance Decomposition of 0-1 Loss Functions [17]

Error = Bias2 + Variance

Bias: Error component due to the incapacity of the predictor to model theunderlying distribution.Variance: Error component that stems from the particularities of the training dataset (i.e. measure of overfitting).


Experiments

Experiments

BV Error Differences: No pruning

Analysis

Bias differences remains stable across the different noise levels.Variance differences increases with the noise rate level.B-CDT does not overfit so much as B-C4.5 (lower variance error) when there ishigh noise levels.Maybe the high number of heuristics of C4.5 may not help when there isspurious data.


Conclusions and Future Works

Part V






Conclusions

We have presented an interesting application of information baseduncertainty measures on a challenging data-mining problem.A very simple decision tree inducer that handles numeric attributes and dealswith missing values have been proposed.An extensive experimental evaluation have been carried out to analyze the effectof classification noise in Bagging ensembles.Bagging with decision trees induced by the Imprecise-IG measure have a betterperformance and less overfitting with medium-high noise levels.The Imprecise-IG is a robust split criteria to build bagging ensembles ofdecision trees.

Future Works

Develop a new pruning method based on imprecise probabilities.Extend these methods to carry out credal classification.Apply new imprecise models such as the NPI model.


Thanks for your attention!!

Questions?