9
Pergamon 0952-1976(95)00022-4 Engng Applic.Artif. lntelL Vol. 8, No. 4, pp. 391-399, 1995 Copyright t~) 1995 Elsevier Science Ltd Printed in Great Britain. All rights reserved 0952-1976/95 $9.50 + 0.00 Contributed Paper Automated Extraction of Attribute Hierarchies for an Improved Decision-tree Classifier SHINICHI NAKASUKA University of Tokyo, Japan TAKAZO KOISHI IBM Japan (Received June 1993; in revised form March 1995) A method to automatically extract inherent hierarchies of attribute values from data is proposed to improve the performance of a decision-tree classifier. When attributes used for a decision tree take continuous values, simple decision rules such as "IF a certain attribute is less than a certain value" yield good results in many cases. The rationale of this type of rule is the natural tendency that two nearby attribute values have a "similar meaning" in the sense that they suggest the same class with high probability. When only discrete-type, unordered attributes are available for tree generation, however, such efficient decision rules are hard to obtain because "distance relationships" between the attribute values are seldom known beforehand. In order to solve this problem, the proposed method estimates from the training data set a conceptual distance between each pair of attribute values, and by iteratively grouping two attribute values with minimum distance, generates hierarchies of attribute values, which are then utilized for making the decision tree. This method is applied to the task of fault diagnosis of a certain printed circuit board, and it is indicated that the generated attribute hierarchies can reduce the size of a decision tree sufficiently, which results in a significant improvement of its classification accuracy. Keywords: Decision tree classifiers, conceptual distance, generalization, knowledge acquisition, knowledge-bases, fault diagnosis. 1. INTRODUCTION Decision tree classifiers have been applied to many fields to solve pattern-recognition and concept-learning problems) -~° In this classification schema, with a set of training data having attribute (or feature) values and a class name, a decision tree is generated by iteratively dividing the training data set by a sequence of decision rules until each of the divided data groups can repre- sent a certain class. Once a decision tree has been generated, data of an unknown class can be classified into one of the classes by following the decision tree from the root node to one of the leaf nodes. The literature indicates that the main advantage of a tree classifier over a single-stage classifier is that only the attributes pertinent for classification at each decision stage are used, which results in an improvement of classification accuracy, more efficient use of attributes, Correspondence should be sent to: Dr S. Nakasuka, Department of Aeronautics and Astronautics, University of Tokyo, Hongo 7, Bunkyo-ku, Tokyo 113, Japan. and reduction of the attribute dimensions needed for a single decision. Though an optimal decision sequence can seldom be realized in tree classifiers because of the combinatorial problems, it has been reported that, due to the above advantages, even a suboptimal tree classi- fier can yield better results than a single-stage classifier. Applications are found in such fields as white blood cell classification, 1 sound data classification, 2 Chinese char- acter recognition 4'5 and the classification of remote sensing data. 6 In these applications, the attributes are those such as the densities of certain components, coefficients of a certain transformation result, or intensities of certain wave lengths, which all take con- tinuous values. Decision trees have also been applied where the attributes take discrete values. Quinlan 7-1° described a program named ID3 which can deal with various classi- fication problems using a decision tree with discrete- type attributes. In the example of applying it to chess endgame classification, several attributes describing the disposition of the chessmen such as "the distance from 391

Automated extraction of attribute hierarchies for an improved decision-tree classifier

Embed Size (px)

Citation preview

Page 1: Automated extraction of attribute hierarchies for an improved decision-tree classifier

Pergamon

0952-1976(95)00022-4

Engng Applic. Artif. lntelL Vol. 8, No. 4, pp. 391-399, 1995 Copyright t~) 1995 Elsevier Science Ltd

Printed in Great Britain. All rights reserved 0952-1976/95 $9.50 + 0.00

Contributed Paper

Automated Extraction of Attribute Hierarchies for an Improved Decision-tree Classifier

SHINICHI NAKASUKA University of Tokyo, Japan

TAKAZO KOISHI IBM Japan

(Received June 1993; in revised form March 1995)

A method to automatically extract inherent hierarchies of attribute values from data is proposed to improve the performance of a decision-tree classifier. When attributes used for a decision tree take continuous values, simple decision rules such as "IF a certain attribute is less than a certain value" yield good results in many cases. The rationale of this type of rule is the natural tendency that two nearby attribute values have a "similar meaning" in the sense that they suggest the same class with high probability. When only discrete-type, unordered attributes are available for tree generation, however, such efficient decision rules are hard to obtain because "distance relationships" between the attribute values are seldom known beforehand. In order to solve this problem, the proposed method estimates from the training data set a conceptual distance between each pair of attribute values, and by iteratively grouping two attribute values with minimum distance, generates hierarchies of attribute values, which are then utilized for making the decision tree. This method is applied to the task of fault diagnosis of a certain printed circuit board, and it is indicated that the generated attribute hierarchies can reduce the size of a decision tree sufficiently, which results in a significant improvement of its classification accuracy.

Keywords: Decision tree classifiers, conceptual distance, generalization, knowledge acquisition, knowledge-bases, fault diagnosis.

1. INTRODUCTION

Decision tree classifiers have been applied to many fields to solve pattern-recognition and concept-learning p r o b l e m s ) -~° In this classification schema, with a set of training data having attribute (or feature) values and a class name, a decision tree is generated by iteratively dividing the training data set by a sequence of decision rules until each of the divided data groups can repre- sent a certain class. Once a decision tree has been generated, data of an unknown class can be classified into one of the classes by following the decision tree from the root node to one of the leaf nodes. The literature indicates that the main advantage of a tree classifier over a single-stage classifier is that only the attributes pert inent for classification at each decision stage are used, which results in an improvement of classification accuracy, more efficient use of attributes,

Correspondence should be sent to: Dr S. Nakasuka, Department of Aeronautics and Astronautics, University of Tokyo, Hongo 7, Bunkyo-ku, Tokyo 113, Japan.

and reduction of the attribute dimensions needed for a single decision. Though an optimal decision sequence can seldom be realized in tree classifiers because of the combinatorial problems, it has been reported that, due to the above advantages, even a suboptimal tree classi- fier can yield bet ter results than a single-stage classifier. Applications are found in such fields as white blood cell classification, 1 sound data classification, 2 Chinese char- acter recognition 4'5 and the classification of remote sensing data. 6 In these applications, the attributes are those such as the densities of certain components , coefficients of a certain transformation result, or intensities of certain wave lengths, which all take con- tinuous values.

Decision trees have also been applied where the attributes take discrete values. Quinlan 7-1° described a program named ID3 which can deal with various classi- fication problems using a decision tree with discrete- type attributes. In the example of applying it to chess endgame classification, several attributes describing the disposition of the chessmen such as " the distance from

391

Page 2: Automated extraction of attribute hierarchies for an improved decision-tree classifier

392 SHINICHI NAKASUKA and TAKAZO KOISHI: DECISION-TREE CLASSIFIER

v e h i c l e

aerial-~-vehicle

Fig. 1. An example of an attribute hierarchy in Ref. 12.

the black King to the Knight (with values '1' , '2' and 'more than 2 ' )" are utilized to generate a decision tree which can tell whether a particular chess position in the endgame is lost for a certain side in a fixed number of plays. 7 In ref. 10, a simpler example was given, in which a typical representation of "Saturday Morning" is obtained by ID3 in the form of a tree classifier using four simple attributes taking the following values.

out look = {sunny, overcast, rain} temperature = {cool, mild, hot}

humid i t y = {high, normal} w indy = {true, false}

ID3 was modified further to deal with continuous- type attributes, and its application to diagnosis of hyperthyroid conditions, where both discrete-type and continuous-type attributes exist, was repor ted ."

The most important task in the process of generating a decision tree is the selection of a decision rule at each node of the tree. When using continuous-type attri- butes, typical decision rules have the form of " IF a certain attribute is less (greater) than a certain value" or " IF the distance between the attribute vector and a certain vector is less (greater) than a certain value". The rationale of using this type of decision rule is the natural tendency that the attribute values which are close to one another will have a high probability of coming from data of the same class, so it is better to gather the data with nearby attribute values into the same group.

Discrete-type, unordered attributes, however, do not generally have this favorable feature, because in most cases no such "distance relationships" of the attribute values are available. For example, in the problem of fault diagnosis of a printed circuit board discussed in this paper, the attributes are the outcomes of certain electrical tests, for example a return code taking values of {00FF, 00DF, DFFD, 0020, 5 F F D , - - } , and it is impossible to tell whether 00FF is nearer to 00DF than to 0020. As a consequence, the decision rule must inevitably have the form of " IF a certain attribute is A or not" which yields binary branches or " IF a certain attribute is Ai(i=l-n)" which yields as many branches as the number of attribute values. In the former case, the decision will be quite inefficient because only one value out of many attribute values can be separated at a time, and in the latter case, so many branches are generated from each node that the

decision tree is likely to be of enormous size. Ouinlan 1° modified this decision rule so that " A " in the rule of " IF a certain attribute is A" can take a subset S of the attribute values. He added, however, that the choice of S requires quite large computat ion, because an attri- bute with v values has 2 ° ~ - 1 different ways of specify- ing the distinguished subset of attribute values. For the problem of fault diagnosis considered here, the aver- aged number of attribute values is about 16, which makes this strategy impractical.

In order to solve this problem, it would be a good strategy to prepare a hierarchy of the attribute values such as that in Fig. 1 before making the decision tree. If such a hierarchy is known for the above-mentioned attribute of "return code", e.g. as in Fig. 2, in which the values located close together are likely to imply the same class, formulating a decision rule of the type " IF return code is within group 2" can reconcile the above problem of rule inefficiency and combinatorial explo- sion of the search space. This type of "concept hier- archy" was utilized for conceptual clustering, ~2 learning by experimentat ion, ~3 or learning from examples. ~4 Though in these applications the hierarchy can be intuitively defined beforehand by humans, in many real-world problems, including the fault diagnosis prob- lem at hand, such a hierarchy is not available because little is known beforehand as to the relationships between the attribute values.

This paper proposes a method for automatically generating such hierarchies of the attribute values (called "Att r ibute Hierarchies" throughout the rest of this paper) , using only the given training data. These hierarchies reflect the "conceptual distance" of the attribute values, such that values which have the same nearer ancestor can be judged "neare r" in the sense that they have a similar effect on inferring the class of the data. The algorithm for generating these attribute hierarchies is described in Section 2, followed by the

Gr q . u ~ / ~ p 3

,,1 5FFD DFFD 0020

00FF 00DF

Fig. 2. An example of the attribute hierarchy of "Return code".

Page 3: Automated extraction of attribute hierarchies for an improved decision-tree classifier

SHINICHI NAKASUKA and TAKAZO KOISHI: DECISION-TREE CLASSIFIER 393

description of the tree generation using the attribute hierarchies. The classification algorithm using the generated decision tree is slightly modified to be applied to the practical fault-diagnosis problem, which will be described in Section 4.

The application of this method to the problem of fault diagnosis is described in Section 5. A number of data items with several attributes and class names are obtained from the actual testing site of a certain printed circuit board; these are then used for training and evaluating the decision tree. The experiments using these data indicate that the generated attribute hierar- chies can reduce the size of the decision tree suffi- ciently, which results in a significant improvement of the accuracy of the classification.

2. GENERATION OF AN ATTRIBUTE HIERARCHY

For generating effective attribute hierarchies, the following two problems must be considered.

(1) How to define the "distance of the attribute values".

(2) How to construct attribute hierarchies based on the above distance.

2.1. Definition of "distance of attribute values"

For the first question, there may be many candidate solutions. For example, the Hamming distance, when each value is described in the form of binary data, will be one choice. This distance, however, does not seem to improve the performance of the decision-tree because it has nothing to do with the classification tasks. The true distance between two values must reflect the possibility that the two values indicate the same class, i.e. the distance must be defined so that if the two values are "nearer" in the distance definition sense, two items of data containing these two values have a higher probability of coming from the same class. In order to express this kind of "distance", the following two types of measure have been developed. Throughout this section, these notations are used.

Notations

a~, ( i - 1 - n ) : the ith training data (vector) ai(j), ( j = 1 - m): the value of the jth attribute of the ith data n, m = number of training data, attributes respectively.

Distance measure 1 (called " D M I " throughout the rest o f the paper)

First, search for the pair of training data which have the same class name and the same attribute values except for one attribute. Let these data be il and i 2 , and the dissimilar attribute be attribute j. Then add one point to the value pair ai~(j) and ai2(j ). After all the pairings have been checked, the inverse of the point

assigned to each value pair is defined as the distance of the value pair. In this definition, a pair with zero point is considered to have an infinite distance.

Distance measure 2 ("DM2")

First, search for a pair of training data which have the same class name. Let these data be i~ and/2. Then add one point to all the value pairs aid(j) and ai2(j) (j = 1 - m) where ail(j) :k ai2 (1). After all the pairings are checked, the inverse of the point assigned to each value pair is defined as the distance of the value pair. In this definition, too, a pair with zero point is considered to have an infinite distance.

In either definition, pairs of data coming from the same class are used for distance measurement. In DM1, if the two data are almost identical but only one attribute is different, such dissimilar attribute values are considered to be "near" because two values are bound to the same class in this data pair. The value pairs having larger points can be said to indicate the same class more frequently, so they are judged as "nearer". In DM2, a more relaxed definition of near- ness is employed, but the basic philosophy is the same as in DM1.

2.2. Algorithm for generation of attribute hierarchy

In order to construct hierarchies of the attribute values, the following iterative algorithm has been developed. During this iteration process, the original training data are transformed, so a copy of the original data should be used for this process.

In order to make the explanation clearer, a simplified example of the fault-diagnosis problem is employed. Let the training data be the following eight items with four attributes, coming from two classes.

Training data

<ATTRIBUTE> data No. 1 2 3 4 class

1 03 00FD 250D 22 U14 2 01 0000 100D 09 U21 3 03 0020 250D 22 U14 4 03 0020 0000 00 U 14 5 02 00FD 100D 09 U21 6 01 00FD 100D 09 U21 7 01 0020 100D 09 U21 8 03 00FD 0000 00 U14

Step 1.

Step 2.

Check all the pairs of the training data to obtain DM1 (or DM2). If DM1 is used, the value pair {00FD, 0020} of the second attribute, for exam- ple, has the distance 1/3 (from the three data pairs {No. 1, No. 3}, {No. 4, No. 8} and {No. 6, No. 7}). List the value pairs having finite distances for each attribute. Value pairs with infinite distance should not be listed. If DM1 is employed, the list will be;

EM18-4.-8

Page 4: Automated extraction of attribute hierarchies for an improved decision-tree classifier

394 SHINICHI NAKASUKA and TAKAZO KOISHI: DECISION-TREE CLASSIFIER

<Attribute 1> <Attr ibute 3> PAIR DM1 PAIR DM1

{02,01} 1 - - <Attr ibute 2> <Attr ibute 4> PAIR DM1 PAIR DM1

{00FD, 0020} 1/3 {0000, 00FD} 1 {0000, 0020} 1

Step 3.

Step 4.

Define a group of values for each listed pair, and assign an arbitrary group name to each group. If the same value appears more than once in the list (e.g., 00FD, 0020, 0000 of attribute 2), only the least-distance pair (i.e. {00FD, 0020}) for that particular value is picked up. In the exam- ple, {02, 01} and {00FD, 0020} are considered as groups, and names (for example, " G n l l " and "Gn21") are assigned to them. As to the values in the defined groups, replace their occurrences in the original training data set with their group names. In the example, "02" and "01" are replaced with " G n l l " , and "00FD" and "0020" are replaced with "Gn21" . Therefore the training data set will become;

<ATTRIBUTE> data No. 1 2 3 4 class

1 03 Gn21 250D 22 U14 2 Gnl 1 0000 100D 09 U21 3 03 Gn21 250D 22 U14 4 03 Gn21 0000 00 U14 5 Gn11 Gn21 100D 09 U21 6 Gnl 1 Gn21 100D 09 U21 7 Gn11 Gn21 100D 09 U21 8 03 Gn21 0000 00 U14

Step 5. The process of Steps 1-4 is iterated. In each iteration step, the group names (e.g. G n l l , G n 2 1 ) are dealt with in exactly the same way as the original values (e.g. 03, 00130, 250D). This iteration is terminated when no value pairs with finite distances are found in Step 1.

In this example using DM1, the hierarchies in Fig. 3 are finally obtained. When using DM2 as the distance measure, the same hierarchy can be obtained in this example. Generally, however, the two distance meas- ures generate different hierarchies.

In this example, the value "03" of the attribute 1 is separated from the other values in the hierarchical tree. (The same situation is observed for attributes 3 and 4.) This is because "03" appears only in the data from class U14, while the other values appear only in the data from class U21, so there is no way to measure the distance between "03" and the other values. This is a favorable feature because the values coming from different classes can be completely separated in the hierarchy, which can prevent useless groupings of unre- lated values.

< A t t r i b u t e 1> < A t t r i b u t e 3>

Gn 11 Gn31

O1 02 03 O0 D IOOD

< A t t r i b u t e 2> < A t t r i b u t e 4>

oo Fo oo 00 22 09

Fig. 3. Generated attribute hierarchies for the example problem.

3. GENERATION OF THE DECISION TREE

After the hierarchies of the attribute values have been constructed, a decision tree is generated using the same original training data in the generally employed way. The division of the training data is i terated until every leaf node contains data of one class, or the data cannot be divided any more. Other stopping conditions or tree-pruning techniques such as those developed by Quinlan ~° are not employed. In this section, therefore, only the employed schema of the decision rule at each node and the criterion to determine the best decision rule are briefly explained. For further details of the tree-generation algorithm, see Refs 10 and 15, for example.

3.1. Schema of decision rule

The following schema is employed for the decision rule:

IF a( j) = b

where a(] ) means jth attribute and b means an attribute value which is either an original value or a group name

Table 1. Classes, attributes and attribute values in the fault diagnosis problem

No. of values Examples of values

44 U27, U14, U01, IP, 009,- Class Attribute (7 types) Faulty test step 45 Lamp indication 7 Return code 1 19 Return code 2 11 Register 1 read- ing 12 Register 2 read- ing 10 Sensor reading 10

0000, 0110, 0130, 2125, - 00, ON, 08, 04, 09, 03, 10 0000, 0001, 0009, DFFD, - 250D, 0000, 0005, 00FF, -

00, 04, 0C, F8, 22, 28,-

00, F8, CO, 80, FF, 82, - 0000, A000, 6000, 6120, -

Page 5: Automated extraction of attribute hierarchies for an improved decision-tree classifier

SHINICHI NAKASUKA and TAKAZO KOISHI: DECISION-TREE CLASSIFIER 395

defined in the attribute hierarchies. When b is a group name, the outcome of the rule is "yes" if a(j) is a descendant of the group "b" in the attribute hierarchy. In the example of the previous section, the rule

IF a(2) = Gn22

has the same meaning as

IF a(2) ~ {0020,00FD,0000}.

As the maximum number of groups generated in the attribute hierarchy for an attribute taking o different values is o - 1, the number of variations of "b", i.e. the computational load required in the search for the best rule, is not more than twice that of the case when only original values are allowed in the decision rule.

3.2. Criterion for deciding the best rule

The same measure as used in ID3,1° known as "Infor- mation Gain" is employed as the criterion to decide the best decision rule. This information-theoretic criterion chooses the decision rule which is expected to gain the maximum information at the time the outcome of the test is obtained.

The "information gain" of a decision rule p at a node q is calculated as follows. Let node q have N training data including Ci data from the ith (i= 1 ~ no) class. Assume that these data are divided by rule p into m groups, with the ]th group containing Nj data. Let the number of data from class i (i= 1 -nc) in the ]th group be nji. Then the Information Gain G(p, q) is defined as

G(p, q) = t(q) - E(p, q)

~ G G l (q)= - E ~log2

i = 1

~ N~ nc n j i n i i

E(p, q)= - N E ~ log2 j=l i=l

The rule p which has the largest G(p, q) is chosen as the decision rule at node q. In the above formula, I(q) means the quantity of information required to classify the data belonging to node q, while E(p, q) describes the quantity of information still required after node q is divided by rule p. Therefore the difference between these values, G(p,q), represents the information gained by dividing node q by rule p.

4. CLASSIFICATION USING THE GENERATED TREES

The generated attribute hierarchies and decision tree are utilized to classify the given data with unknown class names. When using the conventional, straightfor- ward classification method, only one candidate of its true class is inferred. In such a problem as fault diagno-

sis, however, it would be better if several candidates for the failed part can be indicated by the classifier, because the repair engineers on site can take the next action even if the first guess turns out to be wrong. For this objective, the conventional classification algorithm is modified as follows.

Step 1. The decision tree is followed from the root node according to the outcomes of the decision rule at each node until one leaf node is reached. The class assigned to that leaf node is nominated as the first candidate. Assume that the thick line in Fig. 4 is this nominal node sequence. Let this sequence be node(i), i= l -n s . [node(l) is the root node and node(ns) is the finally reached leaf node.]

Step2. At each followed node [node(i)], the branch which was not selected during Step 1 is intentio- nally selected, and the usual tree-following pro- cess is done from the selected branch. For exam- ple, at node 2, the wrong branch "No" is intentionally selected and the usual tree- following process is initiated from node 5. Let the class reached in this way by selecting the wrong branch at node(i) be cl(i). Then cl(n~-j), j = 1 - n ~ - 1 is considered as the j + lth candi- date of the true class.

Using this method, the same number of candidates as the number of the followed nodes (n~) are obtained. The candidates obtained can be interpreted as the classes inferred if up to one decision rule along the nominal node sequence is ignored. In assigning priori- ties in Step 2, the candidate which is obtained by ignoring the decision rule nearer to the root node has a lower priority. This is because the rule nearer to the root is more reliable, as it is selected using a larger number of data.

5. APPLICATION TO A FAULT-DIAGNOSIS PROBLEM

The proposed method for generating an attribute hierarchy has been applied to the task of fault diagnosis of a certain printed circuit board. In this task, a failed part on a board had to be identified, using a given set of test results on that board. The attributes are the read- ings of the various indicators during the test, and the classes correspond to the parts on the board which have failed. 500 items of data, each containing a set of test results and a genuinely failed part, were provided by the actual plant for manufacturing and testing the actual printed circuit boards. These data were used partly as training data, and partly as testing data. Table 1 describes the number of classes, attributes, and some examples of their values. This set of seven attributes has the complete information to locate the failed part, that is, there is no pair of data which have the same attribute values and different class names.

Page 6: Automated extraction of attribute hierarchies for an improved decision-tree classifier

396 SHINICHI NAKASUKA and TAKAZO KOISHI: DECISION-TREE CLASSIFIER

Node 1

Node 4 Node 5 Node 6 Node 7

014 U20 UIO 009 U12 IP1 Q09 UO1 Fig. 4. An example of a decision tree and matching process.

In the experiments, a proportion of the 500 data were randomly selected as the training data set, and attribute hierarchies and a decision tree were generated using these data. Then they were tested on the remaining data, to measure the classification accuracy. The accur- acy was measured as the probability that the true class is included in up to the nth candidate, so it is a function of this n. This process was iterated five times using different training sets, also chosen randomly, and the performance was averaged. Four cases, when the number of training data are 20, 40, 60, and 80 of the total (500) data, were tested. For comparison, the case without generating the attribute hierarchies was also examined.

Figure 5 shows the comparison of the classification accuracy of the three cases, a case with no attribute

hierarchies ( "NO") , and cases with attribute hierar- chies generated using DM1 and DM2 as the distance measures ( " D M I " and "DM2" respectively). Figure 6 describes the number of nodes in the decision tree, and Fig. 7 shows the number of the total attribute values used in generating the decision tree. The generated group names in the attribute hierarchies are also included in this number of the total attribute values. Figure 8 describes one example of the attribute hier- archy (for "register 1 reading"), generated using DM1 for two cases. From these figures, the following obser- vations can be made.

• When the training data set is as small as 20% of the total data, the generation of an attribute hierarchy has little (or even a slightly adverse) effect on the classification accuracy. This is because with too few training data, reliable and effective attribute hierarchies cannot be gener- ated. This situation clearly appears in Fig. 8, where only a few attribute values are associated within the attribute hierarchy, as compared with the case of more training data.

• When the number of training data becomes larger, the generation of attribute hierarchies considerably improves the classification accur- acy. For example, focusing only on the accuracy of the first candidate, an improvement from 71 to 80% in case of 40% training data, and from 80 to 87% in the case of 60% training data is observed when using DM1. The same improve- ment is also observed for DM2, and for the second and third candidates. As can be inferred

Classilication Accuracy(%) 100

90

80

70

60

• • • •" • •• /.o°°o°OO'°°°"°

Y

• ' ' " . -0 - "1~

r.-x- 'DM2'

0 i i i i i | i 20 40 60 80

Training Dala Percenlage(%)

Classification Accuracy(%} 100

90 ' ~ .o°°.

• ° , ."

I S ° . ° "

8O

70 1st & 2nd Candidates

F I --)<- 'DM2'

50 . . . . 20 40 60 80

Training Data Percentage(%)

Classilicali0n Accuracy(%) 10(

9(

80

70

60

SO

Fig. 5. Comparison of classification accuracy of three cases.

O'

20

1st Io 3td Candidates

J~iiii--O- 'NO'

1 | I I

40 60 80

Training Data Percentage(%)

Page 7: Automated extraction of attribute hierarchies for an improved decision-tree classifier

SHINICHI NAKASUKA and TAKAZO KOISHI: DECISION-TREE CLASSIFIER 397

Number of Nodes 250

200

150

100

50

.D °°°

.

1 ° 7 I " --O'- DMI" ~ .'"" )( "DM2" I J ' "

. , .o.*"°

°,.*** ~ )

***'" • ~ " • "

n I ! I I 1 I 20 40 60 80

Training Data Percentage(%)

Fig. 6. Number of nodes in the decision tree.

Number of Attribute Values 250

200

150 /'- I ×,, I ©"

100 ...E7 .............. .°°..-'°"

[~,°°o'°°

50 ' ' ' ' ' ' ' 20 40 60 80

Training Data Percentage(%) Fig. 7. Number of total attribute values used for decision-tree gen-

eration.

f rom Fig. 6, this improvement is achieved by the significant reduction of the size of the decision tree, i.e. the number of nodes is reduced from 133 to 84 for 40% training data, and from 180 to 117 for 60% training data. As is mentioned by Quinlan, 1° too large a decision tree in many

cases results in a bad classification performance. When using attribute hierarchies, such an unfa- vorable condition can be prevented effectively by allowing the group names, in addition to the original values, to be used in the decision rules.

• As to the two distance measures, DM2 performs

< From 20~ Training Data >

Gn54

Gn52 Gn53

2 ~ 2 8 2 ~ f ~ O O0 25 2C 90 FF

< From 40~ Training Data >

;6

O~ OC Gn51 Gn53

2 2 ~ 2 8 2 ~ / ~ 5 00 90 FF

Fig. 8. Generated attribute hierarchy for "register 1 readings". (Two cases: from 20% training data and from 40% training data.)

Page 8: Automated extraction of attribute hierarchies for an improved decision-tree classifier

398 SHINICHI NAKASUKA and TAKAZO KOISHI: DECISION-TREE CLASSIFIER

Table 2. Required computational time for generation of attribute hierarchies and a decision tree

Training data NO DM1 DM2

20% Att. hier. - - 29 111 Dec. tree 550 696 846 Total 550 725 957

40% Att. hier. - - 108 236 Dec. tree 1532 1982 2681 Total 1532 2090 2917

60% Att. hier - 227 372 Dec. tree 2585 3155 3820 Total 2585 3382 4192

80% Att. hier. - 375 634 Dec. tree 4220 4820 5862 Total 4220 5195 6496

Values are in seconds. An IBM-5570/T PC is assumed.

on average a little better than DM1, but the difference is generally small. This minor differ- ence seems to come from the difference in the maturity of the generated attribute hierarchies. In DM1, value pairs are assigned points after a more strict evaluation than DM2, by also taking the other values appearing in the same data into account. In cases where only a few training data are provided, therefore, the strict evaluation of DM1 tends to assign no points to some value pairs, so the group formations are less likely to occur than in DM2. This results in " too coarse" hierarchies in the 20% training data case in Fig. 8. Table 2 summarizes the required computational time to generate the attribute hierarchies and the decision trees. The time additionally required for the generation of the attribute hierarchies is quite small as compared with the time required for decision-tree generation. When using the attribute hierarchies, the time for decision-tree generation also becomes larger because the search space for the best decision rule at each node becomes larger by allowing the use of the generated group names. The time required in total, however, does not exceed twice that of the case when attribute hierarchies are not generated. DM2 requires about 20-40% more computation than DM1, which is the result of the use of more attribute values in the tree generation (see Fig. 7) and more nodes generated in the decision tree (see Fig. 6) in the case using DM2.

From these observations, it can be said that by allowing "meaningful" groups of attribute values to be used in the decision rule, a significant reduction of the size of the decision tree can be achieved, which results in an improvement of the classification accuracy. In the previously mentioned method of generating a subset of attribute values, as described in Ref. 10, the grouping

of values is made quite locally at the t ime of selection of a decision rule at each node using the same criterion as for the rule selection. An exhaustive serch for the best group must be performed at every node, which requires much computational load. The method, described here, on the other hand, first obtains "a guide" for such a grouping from the global perspective in the form of attribute hierarchies, then utilizes them uniformly during the selection of a decision rule at each node of the decision tree. The requirement for obtaining efficient decision rules without too much additional computational load can be achieved to some extent in the proposed method.

The two proposed distance measures yield compar- able performance except for the case where the training data set is very small, when the performance of DM1 is a little degraded. As discussed earlier, this comes from the low maturity of the attribute hierarchies when using DM1. In order to compensate for this shortcoming, it would be a good strategy to employ DM2 tentatively as the distance measure when DM1 produces no more value pairs in step 1 of the algorithm described in Section 2.

6. CONCLUSIONS

It is a good strategy to use hierarchies of attribute values for generating a decision tree when the attri- butes take discrete values, and for the case when such hierarchies cannot be obtained beforehand (which is a typical case for most real-world problems) a method is proposed to automatically generate them using only the training data set. In the process of generating the attribute hierarchies, the conceptual distances between value pairs are measured using the newly developed distance measure. The generated hierarchies then provide an efficient guide to group attribute values to be inserted in the decision rules, which makes it poss- ible to make quite effective binary decision rules with- out too much additional computation. This method is applied to a fault-diagnosis problem of a certain printed circuit board, and the results strongly confirm the advantage of using such attribute hierarchies.

R E F E R E N C E S

1. Nuik J. K. and Fu K. S. Automated classification of nucleted blood cells using a binary tree classifier. IEEE Trans. Pattern Anal. Mach. lntell. PAMI-2, 429-443 (1980).

2. Dattatreya G. R. and Sarma V. V. S. Bayesian and decision tree approaches for pattern recognition including feature measure- ment costs. IEEE Trans. Pattern Anal. Mach. lntell. PAMI-3, 293-298 (1981).

3. Sethi I. K. and Sarvarayudu G. P. R. Hierarchical classifier design using mutual information. I E E E Trans. Pattern Anal. Mach. Intell. PAMI-4, 441-445 (1982).

4. Gu Y. X., Wang O. R. and Suen C. Y. Application of a multilayer decision tree in computer recognition of Chinese characters. IEEE Trans. Pattern Anal. Mach. lntell. PAMI-5, 83- 89 (1983).

5. Wang Q. R. and Suen C. Y. Large tree classifier with heuristic search and global training. I E E E Trans. Pattern Anal. Mach.

Page 9: Automated extraction of attribute hierarchies for an improved decision-tree classifier

SHINICHI NAKASUKA and TAKAZO KOISHI: DECISION-TREE CLASSIFIER 399

Intell. PAMI-9, 91-101 (1987). 6. Argentiero P., Chin R. and Beaudet P. An automated approach

to the design of decision tree classifiers. IEEE Trans. Pattern Anal. Mach. Intell. PAMI-4, 51-57 (1982).

7. Quinlan J. R. Learning efficient classification procedures and their application to chess endgames. Machine Learning: An Artificial Intelligence Approach, Vol. 1, pp. 463-482. Morgan Kaufmann, Los Altos (1983).

8. Quinlan J. R. The effect of noise on concept learning. Machine Learning: An Artificial Intelligence Approach, Vol. 2, pp. 149- 166. Morgan Kaufman, Los Altos (1986).

9. Quinlan J. R. Simplifying decision trees. Int. J. Man-Mach. Stud. 27, 221-234 (1987).

10. Quinlan J. R. Induction of decision trees. Mach. Learn. 1, 81- 106 (1986).

11. Quinlan J. R. Probabilistic decision trees. Machine Learning: An

Artificial Intelligence Approach, Vol. 3, pp. 140-152. Morgan Kaufmann, San Mateo (1990).

12. Kodratoff Y. and Tecuci G. Learning based on conceptual distance. IEEE Trans. Pattern Anal. Mach. lntell. PAMI-10, 897-909 (1988).

13. Mitchell T. M., Utgoff P. E. and Banerji R. Learning by experimentation: acquiring and refining problem-solving heuris- tics. Machine Learning: An Artificial Intelligence Approach, Vol. 1, pp. 163-190. Morgan Kaufmann, Los Altos (1983).

14. Vrain C. OGUST: a system that learns using domain properties expressed as theorems. Machine Learning: An Artificial Intelligence Approach, Vol. 3, pp. 360-382. Morgan Kaufmann, San Mateo (1990).

15. Nakasuka S. and Yoshida T. Dynamic scheduling system utiliz- ing machine learning as a knowledge acquisition tool. Int. J. Prod. Res. 30, 411-431 (1992).