Upload
butest
View
205
Download
1
Tags:
Embed Size (px)
Citation preview
1. Introduction
Decision trees are one of the methods for concept learning from the examples. They are widely
used in machine learning and knowledge acquisition systems. The main application area are
classification tasks:
We are given a set of records, called training set. Each record from training set has the same
structure, consisting of a number of attribute/value pairs. One of these attributes represents class
of the record. We also have a test set for which the class information is unknown. The problem is
to derive a decision tree, using examples from training set, which will determine class for each
record in the test set.
The leaves of induced decision tree are class names and other nodes represent attribute-based tests
with a branch for each possible value of particular attribute.
Once the tree is formed, we can classify objects from the test set: starting at the root of tree, we
evaluate the test, and take the branch appropriate to the outcome. The process continues until leaf
is encountered, at which time the object is asserted to belong to class named by the leaf.
Induction of decision trees has been very active area of machine learning and many approaches
and techniques have been developed for building trees with high classification performance. The
most commonly addressed problems are :
selecting of the best attribute for splitting
dealing with noise in real-world tasks
pruning of complex decision trees
dealing with unknown attribute values
dealing with continuous attribute values.
1
My intention is to give an overview of methods considering the first three problems, which I find
to be very mutually dependant and of prior importance for building of trees with good
classification ability.
2. Selection Criterion
For a given training set it is possible to construct many decision trees that will correctly classify
all of its objects. Among all accurate trees we are interested in the most simple one. This search is
guided by Occam’s Razor heuristic: among all rules that accurately account for the training set,
the simplest is likely to have the highest success rate when used to classify unseen objects. This
heuristics is also supported by analysis: Pearl and Quinlan [11] have derived upper bounds on the
expected error using different formalisms for generalizing from a set of known cases. For a
training set of predetermined size, these bounds increase with the complexity of induced
generalization.
Since decision tree is made of nodes that represent attribute-based test, simplifying the tree would
mean reducing the number of tests. We can achieve this by carefully selecting the order of tests to
be conducted. The example given in [11] shows how for the same training set given in Table 1
different decision trees may be constructed. Each of the examples in training set is described in
terms of 4 discrete attributes: outlook {sunny, overcast, rain}, temperature {cold, mild, hot},
humidity {high, normal}, windy {true, false} and each belonging to one of the classes N or P.
Figure 1 shows decision tree when attribute outlook is used for the first test and figure 2 shows
decision tree with the first test temperature. The difference in complexity is obvious.
2
No. Outlook Temperature Humidity Windy Class1 sunny hot high false N2 sunny hot high true N3 overcast hot high false P4 rain mild high false P5 rain cool normal false P6 rain cool normal true N7 overcast cool normal true P8 sunny mild high false N9 sunny cool normal false P10 rain mild normal false P11 sunny mild normal true P12 overcast mild high true P13 overcast hot normal false P14 rain mild high true N
Table1. A small training set
Figure 1. A simple decision tree
3
outlook
humidity windy
sunny overcast rain
P
high truenormal false
N P N P
Figure 2. A complex decision tree
This infers that the choice of test is crucial for simplicity of decision tree, on which many
researchers, such as Quinlan [11], Fayyad [5], White and Liu [16], agree.
A method of choosing a test to form the root of decision tree is usually referred to as selection
criterion. Many different selection criterion have been tested over the years and most common
once among them are maximum information gain and GINI index. Both of these methods belong
to the class of impurity measures, which are designed to capture aspects of partitioning of
examples relevant to good classification.
4
temperature
sunny
outlook
o’cast
P
outlookwindy
sunny o’cast rain
P Pwindy windy
sunny rain
humidity
true false true false
N P P N
windy
high normal
P
true false
N P
humidity
outlook
sunny o’cast rain
true false
N
high normal
P
N P null
Impurity measures
Let S be set of training examples with each example e S belonging to one of the classes in
C = {C1, C2, …, Ck}. We can define the class vector (c1, c2, …, ck) Nk , where ci = |{e S |
class(e) = Ci}| and class probability vector (p1, p2, …, pk) [0, 1]k:
It’s obvious that pi =1.
A set of examples is said to be pure if all its examples belong to one class. Hence, if probability
vector of a set of examples has a component 1 (all other components being equal to 0) the set is
said to be pure. On the other hand, if all components are equal we get an extreme case of
impurity.
To quantify the notion of impurity, a family of functions known as impurity measures [5] is
defined.
Definition 1 Let S be a set of training examples having a class probability vector PC. A function
: [0, 1]k R such that (PC) 0 is an impurity measure if it satisfies the following
conditions:
1. (PC) is minimum if i such that component PCi = 1.
2. (PC) is maximum if i, 1 i k, PCi = 1/k.
3. (PC) is symmetric with respect to components of PC.
4. (PC) is smooth (differentiable everywhere) in its range.
Conditions 1 and 2 express well-known extreme cases, and condition 3 insures that the measure is
not biased towards any of the classes.
5
For induction of decision trees, impurity measure is used to evaluate impurity of partition induced
by an attribute.
Let PC(S) be the class probability vector of S and let A be a discrete attribute over the set S. Let
assume the attribute A partition set S into the sets S1, S2, …, Sv. The impurity of the partition is
defined as weighted average impurity on its component blocks:
Finally, the goodness-of-split due to attribute A is defined as reduction in impurity after the
partition.
(S, A) = (PC(S)) - (S, A)
If we choose entropy of the partition, E(A,S), as an impurity measure:
(PC(S)) = E(A, S) =
than the reduction in impurity gained by an attribute is called information gain.
This method was used in many algorithms for induction of decision trees such as ID3, GID3* and
CART.
The other popular impurity measure is GINI index used in CART [2]. To obtain GINI index we
set to be
(PC(S)) =
All functions belonging to family of impurity measures agree on the minima, maxima and
smoothness and as a consequence they should result in similar trees [2], [9] regarding complexity
and accuracy. After detailed analysis Breiman [3] reports basic differences in trees produced
using information gain and GINI index. The GINI prefers splits that put the largest class into one
pure node, and all others into the other. Entropy favors size–balanced children nodes. If the
6
number of classes is small both criterions should produce similar results. The difference appears
when number of classes is larger. In this case GINI produces splits that are too unbalanced near to
the root of the tree. On the other hand, splits produced by entropy show a lack of uniqueness.
This analysis point out some of the problems associated with impurity measures. But,
unfortunately, these are not the only ones.
Some experiments carried out in the mid eighties showed that the gain criterion tends to favor
attributes with many values [8]. This finding was supported by analysis in [11]. One of the
solutions to this problem was offered by Kononenko et al. in [8]: decision tree induced has to be
binary tree. This means that every test has only two outcomes. If we have an attribute A with
values A1, A2, …, Av the decision tree no longer branches on each possible value. Instead, a subset
of S is chosen and the tree has one branch for that subset and the other for remainder for S. This
criterion is known as subset criterion. Kononenko et al. report that this modification led to
smaller decision trees with an improved classification performance. Although, it’s obvious that
binary trees don’t suffer from bias in favor of attributes with large number of values, it is not
known if this is the only reason for their better performance.
This finding is repeated in [4] and in [5] Fayyad and Irani introduce the binary tree hypothesis:
For a to–down, non-backtracking, decision tree generation algorithm, if the algorithm applies a
proper attribute selection measure, then selecting a single attribute-value pair at each node and
thus constructing a binary tree, rather than selecting an attribute and branching on all its values
simultaneously, is likely to lead to a decision tree with fewer leaves.
The formal proof for this hypothesis doesn’t exist; it’s a result of informal analysis and empirical
evaluation. Fayyad has also shown in [4] that for every decision tree, there exists binary decision
tree that is logically equivalent to it. This would mean that for every decision tree we could
7
induce logically equivalent binary decision tree that is expected to have fewer nodes and to be
more accurate.
But binary trees have some side-effect explained in [11]:
First, this kind of trees is undoubtedly more unintelligible to human experts than is ordinarily the
case, with unrelated attribute values being grouped together and with multiple tests on the same
attribute.
Second, the subset criterion can require a large increase in computation, specially for attributes
with many values – for attribute A with v values there are 2v-1 – 1 different ways of specifying the
distinguished subset of attributes values. But since a decision tree is induced only once and than
used for classification and since computer efficiency is rapidly increasing this problem seems to
diminish.
In [11] Quinlan proposes another method for overcoming the bias in information gain called the
gain ratio. The gain ratio, GR (originally Quinlan denoted it by IV), is normalizing the gain with
attribute information:
The attribute information is used as normalizing factor because of its property to increase as the
number of possible values increases.
As mentioned in [11] this ratio may not always be defined – the denominator may be zero – or it
may tend to favor attributes for which the denominator is very small. As a solution to this, the
gain ratio criterion will select, from among those attributes with an average-or-better gain, the
attribute that maximizes GR. The experiments described in [14] show the improvement in tree
simplicity and prediction accuracy when gain ratio criterion is used.
8
There is also another measure introduced by Lopez de Mantras in [], called distance measure, dN:
1 - dN =
where |Sij| is the number of examples with value aj of attribute A that belong to class Ci.
This is just another attempt of normalizing information gain, but in this case with cell information
(cell is a subset of S which contains all examples with one attribute value that belong to one
class).
Although both of these normalized measures were claimed to be unbiased, the statistics analysis
in [17] shows that again each favor attributes with larger number of values. These results also
suggest that information gain is the worst measure compared to gain ratio and distance measure in
this respect, while gain ratio is the least biased. Furthermore, their analysis shows that the
magnitude of the bias is strongly dependent on the number of classes, increasing as k is increased.
Orthogonality measure
Recently, one conceptually new approach was introduced by Fayyad and Irani in [5]. In their
analysis they give a number of reasons why information entropy, as representative of class of
impurity measures, is not suitable for attribute selection.
Assume the following example: Consider a set S of 110 examples belonging to three classes {C 1,
C2, C3} whose class vector is (50, 10, 50). Assume that the attribute-value pairs (A, a1) and (A, a2)
induce two binary partition on S, 1 and 2 shown in the figure 3. We can see that 2 separates the
classes C1 and C2 from the class C3. However, the information gain measure prefers partition 1
(gain = 0.51) over 2 (gain = 0.43).
9
Gain = 0.51 Gain = 0.43
Figure3. Two possible binary partitions.
Analysis shows that if 2 is accepted, subtree under this node has lower bound of tree leaves. On
the other hand, if 1 is chosen subtree could minimally have 6 leaves.
Intuitively, if the goal is to generate tree with smaller number of leaves, the selection measure
should be sensitive to total class separation – it should separate differing classes from each other
as much as possible while separating as few examples of the same class as possible. Above
example shows that information entropy doesn’t satisfy these demands – it’s completely
insensitive for class separation and within-class fragmentation. The only exception is when
learning problem has exactly two classes: than class purity and class separation become the same.
Another negative property of information gain emphasized in this paper is its tendency to induce
decision trees with near-minimal average depth. The empirical evaluation of this kind of trees
shows that they tend to have a large number of leaves and high error rate [4].
Another of deficiencies pointed out is, actually, embedded in definition of impurity measures:
their symmetry with respect to components of PC. This means that the set with a given class
probability vector evaluates identically to another set whose class vector is a permutation of the
10
45 8 5
5 2 45
50 0 50
0 10 0
45 8 5
C1 C2 C3
Partition 1 produced by a1 Partition 2 produced by a2
first. Thus if one of the subsets of a set S has a different majority class than the original but the
distribution of classes is simply permuted, entropy will not detect the change. However this
change in dominant class is generally strong indicator that the attribute value is relevant to
classification.
Realizing above weakness of impurity measures authors define the desirable class of selection
measures:
Assuming induction of binary tree (relying on the binary tree hypothesis) for training set S and
attribute A, test on this attribute induces a binary partition on the set S into:
S = S S , where S = { e S | e satisfies }and S = S ~ S.
Selection measure should satisfy the properties:
1. It is maximum when the classes in S are disjoint with the classes in S (inter-class
separation).
2. It is minimum when the class distribution in S is identical to the class distribution in S.
3. It favors partitions which keep examples of the same class in the same block ( intra-class
cohesiveness).
4. It is sensitive to permutations in the class distribution.
5. It is non-negative, smooth (differenciable), and symmetric with respect to the classes.
This defines a family of measures called C-SEP (for Class SEParation), for evaluating binary
partitions.
One such measure proposed in this paper and proven to satisfy all requirements of C-SEP family
is orthogonality measure defined as:
ORT(, S) = 1 – cos (V1, V2),
11
Where (V1, V2) is the angle between two class vectors V1 and V2 of partitions S and S,
respectively.
The result of empirical comparisons of orthogonality measure embedded in O-BTREE system
with entropy measure used in GID3* (produces branching only on a few individual values while
grouping the rest in one default branch), information gain in ID3, gain ratio in ID3-IV and
information gain for induction of binary trees in ID3-BIN, is taken from [5] and given in figures
4, 5, 6 and 7. In these experiments 5 different data sets were used: RIO (Reactive Ion Etching) –
synthetic data, and real-word data sets: Soybean, Auto, Harr90 and Mushroom. Descriptions of
these sets may be found in [5]. The results reported are in terms of ratios relative to GID3*
performance (GID3* =1.0 in both cases).
Figure 4. Error ratios for RIE-random Domains Figure5. Leaf ratios for RIE-random Domains
12
The figures 4, 5, 6 and 7 show that results for O-BTREE algorithm are almost always superior to
other algorithms.
Conclusion
Until recently most algorithms for induction of decision trees were using one of the impurity
measures described in previous section. These functions were borrowed from information theory
without any formal analysis of their suitability for selection criterion. The empirical results were
acceptable and only small variations of these methods were further tested.
Fayyad’s and Irani’s approach in [5] introduces completely different family of measures, C-SEP,
for binary partitions. They recognize very important properties of the measure: inter-class
separation and intra-class cohesiveness that were not precisely defined in impurity measures. This
is the first step to better formalization of selection criterion which is necessary for further
improvement of decision trees’ accuracy and simplicity.
13
Figure 6. Relative ratios of Error Rates (GID3=1) Figure 6. Ratios of Numbers of Leaves (GID3=1)
3. Noise
When we use decision tree induction techniques for real-word domains we have to expect noisy
data. The description of the object may include attribute based on measurements or subjective
judgement, both of which can give rise to errors in the values of the attributes. Sometimes the
class information may contain errors. These defects in the data may lead to two known problems:
attribute inadequacy, meaning that even some examples may have identical description
in terms of attributes values they don’t belong to the same class. Inadequate attributes are
not able to distinguish among the object in set S.
spurious tree complexity which is the result of tree induction algorithm trying to fit the
noisy data into the tree.
Recognizing these two problems we can define two modification of the tree-building algorithm if
it is to be able to operate with a noise-affected training set [11]:
the algorithm must be able to decide that testing further attributes will not improve the
predictive accuracy of the decision tree
the algorithm must be able to work with inadequate attributes
In [11] Quinlan suggests the chi-square test for stochastic independence as the implementation of
first modification:
Let S be collection of objects which belong to one of two classes N and P and let A be an attribute
with v values that produces subsets {S1, S2, …, Sv} of S, where Si contains pi and ni object of class
P and N, respectively. If the value of A is irrelevant (if the values of A for these objects are just
noise, the values would be expected to be unrelated to the objects’ classes) to the class of an
object in S the expected value pi of pi should be
14
If ni is corresponding expected value of ni, the statistic:
is approximately chi-square with v-1 degrees of freedom. This statistic can be used to determine
the confidence with which one can reject the hypothesis that A is independent of the class of
objects in S [11].
The tree-building procedure can than be modified to prevent testing any attribute whose
irrelevance cannot be rejected with very high (e.g., 99%) confidence level. One difficulty with
chi-square test is that it’s unreliable for very small values of the expectation p i and ni , so the
common practise is to use it when all expectations are at least 4 [12].
Second algorithm’s modification should cope with inadequate attributes. Quinlan [12] suggest
two possibilities:
Notation of the class could be generalized to continuous value laying between 0 and 1 :
If the subset of objects at leaf contained p examples belonging to class P and n examples
belonging to class N, the choice for c would be:
c =
In this case the class of 0.8 would be interpreted as ‘belonging to class P with probability 0.8’.
Classification error is defined as:
if the object is really class N: c- 0;
if the object is really class P: 1 – c.
This method is called probability method.
15
Voting model could be established: assign all object to the more numerous class at the
leaf.
This method is called majority method.
It can be verified that the first method minimizes the sum of the squares of the classification
errors while the second one minimizes the sum of absolute errors over the objects in S. If the goal
is to minimize expected error the second approach seems more suitable and empirical results
shown in [12] approve this.
Two suggested modification were tested on various data sets and with different noise level
affecting different attributes or class information and results are shown in [12]. Quite different
forms of degradation are observed:
Destroying class information produces linear increase in error, for the noise level of
100% reaching 50% error which means that object would be classified randomly.
Noise in a single attribute doesn’t have a dramatic effect, and it appears that it is
directly proportional to importance of an attribute. Importance of an attribute can be
defined as average classification error produced if the attribute is deleted altogether from
the data.
Noise in all attributes together leads to relatively rapid increase in error which
generally reach peak and declines. The explanation for appearance of peak is explained in
[11].
These experiments leaded to very interesting and unexpected observation given in [11]: For
higher noise levels, the performance of the correct decision tree on corrupted data was found to be
inferior to that of an imperfect decision tree formed from data corrupted to similar level.
These observations impose some basic tactics for dealing with noisy data:
16
It is important to eliminate noise affecting the class membership of the objects in the
training set.
It is worthy to exclude noisy, less important attributes.
The payoff in noise reduction increases with the importance of the attribute.
The training set should reflect the noise distribution and level as expected when the
induced decision tree is used in practice.
The majority method of assigning classes to leaves is preferable to the probability
method.
Conclusion
The methods employed to cope with noise in decision tree induction are mostly based on
empirical results. Although it is obvious that they lead to improvement of the decision trees in the
terms of simplicity and accuracy there is no formal theory to support them. This implies that
laying some theoretical foundation should be necessity in the future.
4. Pruning
In noisy domains, pruning methods are employed to cut back a full-size tree to smaller one that is
likely to give better classification performance. Decision trees generated using the examples from
17
the training set are generally overfitted to accurately classify unseen examples from a test set.
Techniques used to prune the original tree usually consist of following steps [15]:
generate a set of pruned trees
estimate the performance of each of these trees
select the best tree.
One of the major issues is what data set will be used to test the performance of the pruned tree.
The ideal situation would be if we could have complete set of test examples. Only than we could
be able to make optimal tree selection. However, in practice this is not possible and it is
approximated with a very large, independent test set, if one is available.
The real problem arises when the test set is not available. Then the same test used for building the
decision tree has to be used to estimate accuracy of the pruned trees. Resampling methods, such
as cross-validation, are the principal technique used in these situations. F-fold cross-validation is
a technique which divides the training set S into f blocks of roughly the same distribution and
then for each block in turn, a classifier is constructed from the cases in the remaining blocks and
then tested on the cases in the hold-out block. The error rate of the classifier produced from all the
cases is estimated as the ratio of the total number of errors on the hold-out cases to the total
number of cases. The average error rate from these distinct cross-validations is then a relatively
reliable estimate of the error rate of the single classifier produced from all the cases. The 10-fold
cross-validation has proven to be very reliable and it’s widely used for many different learning
models.
Quinlan in [13] describes three techniques for pruning:
cost-complexity pruning,
reduced error pruning, and
18
pessimistic pruning.
Cost-Complexity Pruning
This technique is initially described in [2]. It consists of two stages:
First, the sequence of trees T0, T1, …, Tk is generated, where T0 is original decision tree
and each Ti+1 is obtained by replacing one or more subtrees of T i with leaves until the final
tree Tk is just a leaf
Then, each tree in the sequence is evaluated and one of them is selected as the final
pruned tree.
Cost-complexity measure is used for evaluation of pruned tree T:
If N is the total number of examples classified by T, E is the number of misclassifed ones, and
L(T) is the number of leaves in T, then cost-complexity is defined as sum
where is some parameter. Now, let’s suppose that we replace some subtree T* of tree T with the
beat possible leaf. In general, these pruned tree would have M more misclassified examples and
L(T*) – 1 fewer leaves. T and T* would have same cost-complexity if
.
To produce Ti+1 from Ti each non-leaf subtree is examined of Ti to find the one with minimum
value. The one or more subtrees with that value of are then replaced by their respective best
leaves.
19
For second stage of pruning we use independent test set containing N’ examples to test the
accuracy of the pruned trees. If E’ is the minimum number of errors observed with any T i and
standard error of E’ is given by:
than the tree selected is the smallest one whose number of errors does not exceed E’ + se(E’).
Reduced Error Pruning
This technique is probably the simplest and most intuitive one for finding small pruned trees of
high accuracy. First, the original tree T is used to classify independent test set. Then for every
non-leaf subtree T* of T we examine the change in misclassifications over the test set that that
would occur if T* were replaced by the best possible leaf. If the new tree contains no subtree with
the same property, T* is replaced by the leaf. The process continues until any further replacement
would increase number of errors over the test set.
Pessimistic Pruning
This technique does not require separate test set. If decision tree T was generated from training set
with N examples and then tested with the same set we can assume that at some leaf of T there are
K classified examples of which J is misclassified. The ratio J/K does not provide a reliable
estimate of error rate of that leaf when unseen object are classified since the tree T has be tailored
to the training set. Instead, we can use more realistic measure known as continuity correction for
binomial distribution in which J is replaced with J + 1/2.
20
Now, let's consider some subtree T* of T, containing L(T*) leaves and classifying J examples
(sum over all leaves of T*) with J of them misclassified. According to the above measure it will
misclassify J + L(S)/2 unseen cases. If T* is replaced with the best leaf which misclassify E
examples from training set, the new pruned tree will be accepted whenever E + 1/2 is within one
standard error of J + L(S)/2 (standard error is defined as in the cost-complexity pruning).
All non-leaf subtrees are examined just once to see whether they should be pruned, but once the
subtree is pruned its subtrees aren’t further examined. This strategy makes this algorithm much
faster than previous two.
Quinlan compare these three techniques on 6 different domains with both real-word and synthetic
data. The general observation is that simplified trees are of superior or equivalent accuracy to the
originals, so pruning has been beneficial in both counts. Cost-complexity pruning tends to
produce smaller decision trees then either reduced error or pessimistic pruning, but they are less
accurate than the trees produced by two other techniques. This suggest that cost-complexity may
be overpruning. On the other hand, reduced error pruning and pessimistic pruning produce trees
with very similar accuracy, but knowing that the later one uses only training set for pruning and
that it is more efficient than the former one it can be pronounced as optimal technique among
three suggested.
OPT Algorithm
While previously described techniques were used to prune decision trees generated from noisy
data, Bratko and Bohanec in [1] introduce OPT algorithm for pruning accurate decision trees. The
problem they aim to solve is [1]: given a completely accurate, but complex, definition of a
concept, simplify the definition, possible at the expense of accuracy, so that the simplified
21
definition still correspond to the concept well in general, but may be inaccurate in some details.
So, while the previously mentioned techniques were designed to improve tree accuracy this one is
designed to reduce its size, which makes it impractical to be communicated and understood by the
user.
Bratko's and Bohanec's approach is somewhat similar to previous pruning algorithms: they
construct the sequence of pruned trees and then select the smallest tree that satisfies some
required accuracy. However, the tree sequence they construct is denser with respect to the number
of their leaves:
Sequence T0, T1, ..., Tn is constructed such that
1. n = L(T0) - 1
2. the trees in sequence decrease in size by one, i.e., L(Ti) = L(T0) - i for i = 0, 1, ..., n (unless
there is no pruned tree of the corresponding size) and
3. each Ti has the highest accuracy among all the pruned trees of T0 of the same size
This sequence is called optimal pruning sequence and was initially suggested by Breiman et al.
[2]. To construct this optimal pruning sequence efficiently in quadratic (polynomial) time with
respect to the number of leaves of T0 they use dynamic programming. The construction is
recursive in that each subtree of T0 is again a decision tree with its own optimal pruning sequence
The algorithm starts by constructing sequence that correspond to small subtrees near the leaves of
T0. These are then combined together, yielding sequences that correspond to larger and larger
subtrees of T0, until optimal pruning sequence for T0 is finally constructed.
The main advantage of OPT algorithm is density of its optimal pruning sequence which always
contains an optimal tree. Sequences produced by cost-complexity pruning or reduced error
pruning are sparse and therefore can miss some optimal solutions.
22
One interesting observation derived from the experiments conducted by Bratko and Bohanec is
that for real-world data a relatively high accuracy was achieved with relatively small pruned trees
disregarding the technique used for pruning, while that wasn't a case with synthetic data. This is
another proof of usefulness of pruning especially for the real-world domains.
Conclusion
Either if we want to improve classification accuracy of decision trees generated from noisy data
or to simplify accurate, but complex decision trees to make them more intelligible to human
experts, pruning is proved to be very successful. Recent papers [], [] suggest there is still some
space left for improvement of basic and most commonly used techniques described in this section.
5. Summary
The selection criterion is probably the most important aspect that determines the behavior of top-
down decision generation algorithm. If it select most important attributes regarding the class
information near the root of the tree, than any pruning technique can successfully cut-off the
branches of class-independent and/or noisy attributes because they will appear near to the leaves
of the tree. Thus, an intelligent selection method, which is able to recognize most important
attributes for classification will initially generate more simple trees and additionally will ease the
job of the pruning algorithm.
23
The main problem of this domain seems to be lack of theoretical foundation: many techniques are
still used because of their acceptable empirical evaluation not because they have be formally
proven to be superior. Development of formal theory for decision tree induction is necessary for
better understanding of this domain and for further improvement of decision trees’ classification
accuracy, especially for noisy, incomplete, real-world data.
6.References
[1] Bratko, I. & Bohanec, M. (1994). Trading accuracy for simplicity in decision trees, Machine
Learning 15, 223-250.
[2] Breiman, L., Friedman, J.H., Olshen, R.A., and Stone, C.J. (1984). Classification and
Regression Trees. Montrey, CA:Wadsworth & Brooks.
[3] Breiman, L., (1996).Technical note: Some properties of splitting criteria, Machine Learning
24, 41-47.
[4] Fayyad, U.M. (1991). On the induction of decision trees for multiple concept learning, PhD
dissertation, EECS Department, The University of Michigan.
[5] Fayyad, U.M. & Irani, R.B. (1993) The attribute selection problem in decision tree generation,
Proceedings of the 10th National Conference on AI, AAAI-92, 104-110, MIT Press.
[6] Lopez de Mantras, R. (1991). A distance-based attribute selection measure for decision tree
induction, Machine Learning 6, 81-92.
24
[7] Kearns, M. & Mansour, Y. (1998). A fast, bottom-up decision tree pruning algorithm with
near optimal generalization, Submitted.
[8] Kononenko, I. , Bratko, I., Roskar, R. (1984). Experiments in automatic learning of medical
diagnosis rules, Technical Report, Faculty of Electrical Engineering, E.Kardelj University,
Ljubljana.
[9] Mingers, J. (1989). An empirical comparison of selection measures for decision-tree
induction, Machine Learning 3, 319-342.
[10] Schapire, R.E. & Helmbold, D.P. (1995). Predicting nearly as well as the best pruning of a
decision tree, Proceedings of the 8th Annual Conference on Computational Learning Theory,
ACM Press, 61-68.
[11] Quinlan, J.R. (1986). Induction of decision trees, Machine Learning 1, 81-106.
[12] Quinlan, J.R. (1986). The effect of noise on concept learning, Machine Learning: An arificial
intelligence approach, Morgan Kaufmann: San Mateo CA, 148-166.
[13] Quinlan, J.R. (1987). Simplifying decision trees, International Journal of Man-Machine
Studies, 27, 221-234.
[14] Quinlan, J.R. (1988). Decision trees and multi-valued attributes, Machine Intelligence 11,
305-318.
[15] Weiss, S.M. & Indurkhya, N. (1994 ). Small sample decision tree pruning, Proceedings of
the 11th International Conference on Machine Learning, Morgan Kaufmann, 335-342.
[16] White, A.P. & Liu, W.Z. (1994). The importance of attribute selection measures in decision
tree induction, Machine Learning 15, 25-41.
[17] White, A.P. & Liu, W.Z. (1994). Technical note: Bias in information-based measures in
decision tree induction, Machine Learning 15, 321-329.
25
26