Refining classifiers with neural networks

Refining Classifiers with Neural NetworksMarcin S. Szczuka*Institute of Mathematics, Warsaw University, Banacha 2, 02 � 097 Warsaw,Poland

This article is devoted to presentation of two approaches to the task of combiningclassification using decision rules with that of neural networks. Main stress is put onapplying the knowledge derived from data with methods coming from area of rough setsand Boolean reasoning to construction of feedforward artificial network architecture.Such an approach allows for both easier construction of neural network based decisionsupport system and better classification results for previously unseen cases. One moreparadigm presented within the paper is the application of a simplified neural network tothe task of resolving conflict that occur in rule-based decision support systems. Theapproaches presented in the paper are illustrated with examples and results of actualnumerical experiments. � 2001 John Wiley & Sons, Inc.

1. INTRODUCTION

The task of supporting classification and decision-making processes iswidely recognized as one of the main fields for application of techniques comingfrom artificial intelligence. Several techniques have been discovered over theyears and have proven to be effective in many applications.

The problem that exists at the base of all data-driven classification pro-cesses is how to establish the representation of classes of objects that should betreated as distinct. If we have a set of examples taken from a real-life situation,we can only approximate the decision classes as the sets in an attribute-valueuniverse that is only partly known to us. The challenge is to approximate it in away that assures proper behavior of the derived classification, not only for theexamples seen so far, but also for those not known yet.

In this work we deal with two specific kinds of classification problems. InŽthe first approach, we assume that the primary data set consists of real floating

. Ž .point numbers. We also assume that the set of possible decisions classes isfinite and quite small. To deal with such problems, we propose a method that is

* e-mail: [email protected] grant sponsor: Polish State Committee for Scientific Research.Contract grant number: 8T11C02412.Contract grant sponsor: ESPRIT.Contract grant number: 20288�CRIT2.

Ž .INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, VOL. 16, 39�55 2001� 2001 John Wiley & Sons, Inc.

SZCZUKA40

Ž .a hybrid of two techniques. We use some intelligent scaling quantizationtechniques described in Refs. 1�3. This method allows us to restructure our

Ž .examples and to build the set of binary rules binary decision tree that allow usto make decisions. This method returns a partition of the attribute-value spaceinto subsets bounded by hyperplanes.

The next step is to construct an artificial neural network that makesŽdecisions similar to those defined by the decision tree or tree-based decision

.rules we started with. The idea of building neural networks based on priorŽ .knowledge has been investigated before see, e.g., Ref. 4 . The algorithms to

construct the network with respect to so-called domain knowledge are repre-sented as a set of Horn clauses. Our approach is a little bit different because itincorporates both knowledge creation and utilization in neural network con-struction. Moreover, the set of rules we use is nondivergent. It is worth noticingthat a method that eliminates the exhaustive step of searching for propernetwork architecture is advantageous.

The second approach, however different, was inspired by achievements inapplying neural networks to classification tuning. Because the task of resolvingconflicts in rule-based systems is crucial for this kind of decision support, wewanted to check whether the adaptiveness of neural networks would be helpful.The encouraging results from hyperplane-based networks suggested a chance forimproving rule-based systems. Therefore, we decided to treat rule output as theset of new, higher order attributes. With this approach, we were able to producedecent classification and use less rules.

This article begins with a brief introduction to the notation from relatedfields. Then we present basic constructs i.e., hyperplane-based and rule-basedclassifiers. The next two sections discuss proposed modifications together withsome examples and initial experimental results.

2. BASIC NOTATION

In this work, a lot of terminology and, what is even more important, a lot ofmethodology comes from rough set theory. Therefore, we begin by introducingin detail some basic notions that will appear in this article. The basic notions weuse are information systems, decision tables, indiscernibility, information func-tion, etc. For further reference, see Refs. 5 and 6.

Ž .Information system is a pair A � U, A , where U is a nonempty, finite setcalled the universe and A is a nonempty, finite set of attributes, i.e., a: U � Vafor a � A, where V is called the �alue set of attribute a. Elements of U areacalled objects.

Ž .Every information system A � U, A and a nonempty set B � A defines aŽ . Ž Ž . .B-information function by Inf u � a u : a � B for u � U and some linearB i i

� 4 � Ž . 4order A � a , . . . , a . The set Inf u : u � U is called the B-information set1 n Band it is denoted by V .B

In the case of real-valued attributes, where for each i � n, a : U � R is aireal function from universe U, its elements can be characterized as points

P � a u , a u , . . . , a uŽ . Ž . Ž .Ž .u 1 2 n

REFINING CLASSIFIERS WITH NEURAL NETWORKS 41

in n-dimensional affine space Rn. The validity of such a representation is basedŽ . Žon the assumption that for an A-indiscernibility relation IND A which can be

.associated with any B � A as well defined by

IND B � u , u� � U � U: Inf u � Inf u�� 4Ž . Ž . Ž . Ž .B B

all equivalence classes are singletons that correspond to particular objects in U.Ž � 4.A decision table is any information system of the form A � U, A � d ,

where d � A is a distinguished attribute called decision. The elements of A arecalled conditions.

We assume that the set V of values of the decision d is equal tod� 4� , . . . , � for some positive integer n called the range of d. The decision d1 n dd

� 4 �determines the partition C , . . . , C of the universe U, where C � u �1 n ldŽ . 4U: d u � � for 1 � l � n . The set C is called the lth decision class of A.l d lFor a decision table with real-valued conditions, according to the assump-

Ž .tion of the discernibility of objects with respect to IND A , any of the real-valued conditions belongs to one of the decision classes C , C , . . . , C , where1 2 n d

C � u � U: d u � �� 4Ž .l l

The main task is how to approximate these classes by a possibly small andregular family of subsets � � Rn, where any � indicates some decision valuek k� , e.g., in terms of its high frequency of occurrence for objects in � .lŽk . k

Given the indiscernibility relation, we may define the notion of reduct.Ž . Ž .B � A is a reduct of the information system if IND B � IND A and no proper

subset of B has this property. Intuitively, the reduct is the minimal set ofattributes that preserves the ability to distinguish objects at the same level as theoriginal decision table.

The decision rule is a formula � of the form

a � � � �� a � � � d � � 1Ž . Ž . Ž .i 1 i k d1 k

where 1 � i � �� i � m, � � V . The set of all rules for a particular1 k i aiŽ . Ž .decision table B � A is denoted by RUL B . Atomic subformulae a � � arei 11

called conditions. We say that rule r is applicable to an object or, alternatively,the object matches the rule, if its attribute values satisfy the premise of the rule.

Ž .With the rule we can connect some characteristics. Support denoted as Supp rAis equal to the number of objects from A for which rule r applies correctly, i.e.,the premise of rule is satisfied and the decision given by the rule is similar to the

Ž .one preset in the decision table. Match r is the number of objects in A forAwhich rule r applies in general. Analogously, the notion of matching set for a

Ž .collection of rules may be introduced. By Match R, u we denote the subset MAof rule set R such that rules in M are applicable to the object u � U. The ruleis said to be optimal if removal of any of its conditions causes a decrease of itssupport. Support and matching are also used to define the coefficient of consis-

Ž . Ž . Ž . Ž .tency � r for a rule, which is � r � Supp r Match r .A A A AIn our study, we refer only to the rules that are derived using knowledge

contained in the data. Within that framework, we introduce the meaning of the

SZCZUKA42

Ž .premise of rule r in the decision table A. The meaning of Pred r is denoted by� Ž . �Pred r and is defined inductively in the following way:A

Ž . � Ž . � � Ž . 41. If Pred r is of the form a � � , then Pred r � u � U: a u � � .A� Ž . Ž . � � Ž . � � Ž . � � Ž . Ž . �2. Pred r � Pred r � � Pred r Pred r ; Pred r � Pred r � �A A A A� Ž . � � Ž . � �� Ž . � � Ž . �Pred r � Pred r ; Pred r � U � Pred r .A A A A

One more key notion in our study is the dynamic rule. If we consider aŽ . Ž Ž ..family F of subsets subtables of A F � P A , then we call rule r �

Ž . Ž .� RUL B F-dynamic usually simply dynamic if and only ifB� F

� �Pred r � � � r � RUL B for any B � F 2Ž . Ž . Ž .B

In our further study, we rely on certain numerical characteristics of dynamicrules, one of which, is the stability coefficient for the dynamic rule r relative to Fthat is denoted by SC F and defined asA

card B � F: r � RUL B� 4Ž .Ž .FSC � 3Ž .A � �card B � F: Pred r � �� 4Ž .Ž .B

This coefficient reflects the frequency of occurrence of a particular rule in theset of rules generated by subsequent steps of the rule generation algorithm. The

Ž F .more frequent the rule higher SC , the better its reliability.AFurther in this article we frequently use terminology that comes from the

field of artificial neural networks. Therefore, it would be appropriate to explainbriefly the meaning of the terms used.

By neural network or simply network, we mean the basic feedforward,multilayer network that consists of neurons that calculate their output bysummarizing weighted inputs and produce an output signal by applying someneuron excitation function. The notions we use are basically similar to thoseintroduced in Refs. 7 or 8. One important remark is that networks presentedsubsequently may not preserve full connectionism. We represent the inputs toour networks as the input neurons that calculate nothing; they only pass theweighted signals to the first hidden layer of neurons. To calculate neuronoutputs, we use one of the following functions:

� Bipolar threshold function given by the formula

1 for x 0f x �Ž . ½�1 for x � 0

� Bipolar sigmoidal function given by the formula

2f x � � 1Ž . �� x1 � e


� Ž .Simple sigmoid logistic function given by the formula

1f x � 4Ž . Ž .�� x1 � e

3. HYPERPLANE-BASED CLASSIFICATION

In Refs. 1�4, the search for decision class approximations was performed bydefining hyperplanes over Rn. Any hyperplane

H � x , x , . . . , x � Rn : � � � x � �� x � 0� 4Ž .1 2 n 0 1 1 n n

Ž .where � , � , � , . . . , � � R splits decision class C into two subclasses de-0 1 2 n lfined by

CU , H � u � C : H u 0� 4Ž .l l

and

C L , H � u � C : H u � 0� 4Ž .l l

Ž .where, for a given hyperplane, function H: U � R is defined by H u �Ž Ž ..H Inf u .A

Based on the chosen hyperplanes, we can establish decision rules that canclassify objects according to their positions with respect to hyperplanes. Such arule is formulated as the conjunction of binary hyperplane attributes and

Ž .indicates a specific decision value. The value hyperplane H attribute H u forl lspecific point u in U is equal to 1 iff u � CU, H and to 0 otherwise.l

Let us introduce an exemplary measure to estimate the quality of hyper-planes with respect to the decision classes C , C , . . . , C . Consider a function1 2 n d

award H � card CU , H � card C L , H 5Ž . Ž .Ž . Ž .Ý l1 l2l1�l2

Ž . Ž .If award H � award H� for some hyperplanes H, H�, then the number ofpairs of objects from different decision classes discerned by H is greater thanthe corresponding number for H�. Thus, this H should be considered whenbuilding decision rules. If H does not discern enough pairs of objects, then wecan search for subsequent hyperplanes until a satisfactory degree of decisionclass approximation is obtained. By n we denote the number of hyperplaneshfound by such an algorithm.

Ž nhThe number of decision rules equal to 2 due to all possible combinations.of positions of objects with respect to n hyperplanes can be reduced to theh

number n � 2 nh of minimal decision rules of the form � � d � � , where nor k lŽk .component � that corresponds to hyperplane H can be rejected withoutk j jdecreasing a given approximation degree.

SZCZUKA44

4. RULE-BASED DECISION SYSTEMS

4.1. The Principle

Ž .Among others, we may use the decision classification support systemsbased on rules derived from data. There are several approaches to generatesuch rules. They differ in the way the rules are generated as well as in the formof rule representation and use. Nevertheless, all the approaches have somecommon, basic questions to answer. Probably the most important question whenclassifying new, unseen objects is about the trustworthiness of a rule or group ofrules. Depending on the approach, there may be several issues to resolve whiledeciding how the newcomer object should be classified.

Ž .Given a set of decision rules R � r , . . . , r derived from data by some1 mmethod and the new object o , we may face several problems while trying toidecide trustworthiness; namely:

1. There may be no rule in R that is applicable to o . In other words, the values oficonditional attributes of o do not satisfy conditions of any rule in R. In such aicase, we cannot make a decision because there is no knowledge within our ruleset that covers the case of o .i

2. There are several rules in R that are applicable to o , but they give contradictoryioutputs. This situation, known as conflict between rules, must be resolved by

Žapplying procedures to measure the confidence of particular rules or groups of.them .

There are a number of possible solutions to these two problems. Usually toresolve the problem of nonapplicability of rules, one of three methods may beapplied:

� The object is assigned the default value of decision according to preset assump-tions.

� Ž .The rule that has the best according to a given criterion applicability is chosenand the decision is determined by this rule. The applicability criterion may bebased e.g., on the number of conditions in the rule that are satisfied by the object.Another such criterion may be induced by preferences about decision values as inthe case of the ordered decision domain.

� The ‘‘don’t know’’ signal is returned to the user.

Of course, rule-based decision systems are usually built in a manner toavoid not recognizing new objects. Still, the actual accuracy depends on thequality of derived rules.

The matter of resolving conflicts between rules may be even more compli-cated, especially in cases when we have bunches of them and no external,additional information about their applicability and importance. To cope withthis problem, several techniques may be applied. Presenting all of them here israther impossible, but we do discuss some next.

The most popular way to establish a final decision is based on a comparisonof the number of rules from different decision classes that are applicable to agiven object. The object class assignment is determined by the majority of rules


Ž .it fulfills in comparison with other classes . This method, however, causesunification of rule importance, which may be a serious weakness. To avoid

Ž .unification, weights can be assigned to rules or groups of them ,

W M , oŽ .BSS

� Supp r � SC rŽ . Ž .Ý AAŽ .r�Match M , o

if Supp r � SC r � 0Ž . Ž .Ý A�� Supp r � SC rŽ . Ž .Ý AA r�MAr�M�

0 otherwise6Ž .

Ž .where SC r is the stability coefficient 3 and it is determined during the processAŽof rule calculation using dynamic reducts see Refs. 9 and 10 for a detailed

. Ž .explanation . To give some intuition about SC r , it is worth knowing that itAmainly depends on the frequency of occurrence of rule r in the set of optimal

Žrules at subsequent steps of the dynamic algorithm for rule generation see Ref.. Ž .10 . We use this method because numerous experiments see Refs. 9�11 prove

that it is, on the average, the best available.

4.2. Rough Set Rule Induction

The process of creating rules using rough set techniques is essential for ourideas of classifier construction. Therefore, some basic information about meth-ods for rule induction is needed. The base for deriving rules is reduct calcula-tion. Numerous practical experiments show that there usually is a need tocalculate several reducts to get satisfactory classification quality. Most of thecases that involve large sets of data require reducts and rules to be calculatedusing dynamic techniques. From a technical point of view, the process ofcalculating the reducts and rules is computationally exhaustive, and for real-worldsolutions, some approximate techniques like heuristics or genetic algorithms are

Ž .engaged see e.g., Ref. 12 .The derived set of rules R may be unsatisfactory. The major concerns are

the following:

� The number of rules is excessive, so the cost of storing, checking, and explainingthe rules is not acceptable.

� The rules are so general that they do not really contain any valid knowledge, orthey are too specific, so they describe a very small part of the universe in toomuch detail.

To avoid at least part of the problems mentioned, we may apply shorteningprocedures. Those procedures allow us to shorten the rules and, consequently,to reduce the number of rules. The process of rule shortening comprises severalsteps that, in consequence, lead to removal of some descriptors from a particularrule. Usually, after shortening, the number of rules decreases because repeti-

SZCZUKA46

tions occur in the set of shortened rules. There are several methods that achievethis goal; for details, review, e.g., Refs. 10, 13, and 14.

5. HYPERPLANE-BASED NETWORK

5.1. Initial Construction

Once the hyperplanes and decision rules are constructed for a given A, wemay represent them in a neural network. It is worth noticing that, given decision

Ž � 4.table A � U, A � d and the set of n hyperplanes that induce n decisionh rrules, we can construct a four-layer neural network with n � 1 inputs, n and nh rneurons in hidden layers, respectively, and n outputs, such that the networkdrecognizes objects in U just like in the case of the corresponding hyperplanedecision tree.

Such a network has n inputs that correspond to conditional attributes andone additional constant input called bias. Every input neuron sends its signal toall neurons in the hidden layer. For each hyperplane from H we construct oneneuron in the hidden layer. This neuron has weights equal to coefficients thatdescribe the corresponding hyperplane.

For all neurons in the first hidden layer, the threshold functions have theform

1 for x 0h x �Ž .j ½�1 for x � 0

This is also the case for thresholds in the second layer, which are given as

1 for x 1r x �Ž .k ½ 0 for x � 1

Neurons in this layer correspond to binary hyperplane decision rules. Theweights that connect these two layers correspond to the way the hyperplaneattributes occur in the rules. For instance, let the k th minimal decision rule � kbe of the form

H u � 0 & H u 0 & H u � 0Ž . Ž . Ž .Ž . Ž . Ž .2 4 7

� d u � � 7Ž . Ž .4

Then the corresponding weights that lead to the k th neuron in the secondhidden layer take the values

1� for j � 43� 1w � 8Ž .� for j � 2 or 7jk 3�0 otherwise

Thus, according to the preceding example, the k th neuron in the second hiddenŽ .layer will be active its threshold function will reach 1 for some u � U iff u

satisfies the conditions of decision rule 7.


For every decision value, we construct one neuron in the output layer,which results in n outputs from the network. The lth output is supposed to bedactive iff the given object put into the network belongs to the correspondingdecision class C . To achieve such behavior, we link every decision rule neuronlonly with the output neuron that corresponds to the decision value indicated bythe decision rule. Thus, in our example, the weights between the k th neuron inthe second hidden layer and the output layer are

1 for l � 4w �k l ½ 0 otherwise

All neurons in the output layer receive threshold functions

1 for x 1out x �Ž .l ½ 0 for x � 1

To summarize, in the initial construction, we obtain a neural network withŽ .four layers two hidden that have at least the same quality of classification as

Ž .the decision rules decision tree they have been built over. Figure 1 shows anoutline of an artificial neural network constructed in this manner.

5.2. Extending the Network

Although the preceding network is valid in its construction, it does not useall its potential power. First of all, we can extend this model by replacing thefunctions of the hyperplane and rule-based neurons with continuous ones. If weuse bipolar sigmoidal functions instead of threshold functions in the first hiddenlayer, we weaken the hyperplane attributes. Whereas the neurons used to reflectthe position of the given input according to a specific hyperplane, they now

Ž .decide how close or up to what degree the specific point is to one of thehalf-spaces determined by the hyperplane. In other words, the hyperplaneattribute is no longer binary. It now behaves a little bit like the attributesŽ . Ž .features used, e.g., in fuzzy set theory see Ref. 15 .

Figure 1. Outline of constructed network.

SZCZUKA48

Ž .Following this idea, let us consider the class of bipolar sigmoidal functionsof the form

2h x � � 1Ž .j �� xj1 � e

for the hyperplane layer. Parameters � express degrees of vagueness forjparticular hyperplanes. The parallel nature of computations along the neuralnetwork justifies searching for such parameters locally, for each H with respectjto other hyperplanes, by applying adequate statistical or entropy-based methodsŽ .compare with Refs. 16 and 17 .

Degrees of vagueness that are proportional to the risk of being based oncorresponding hyperplane cuts find very simple interpretation. Let us weakenthe decision rule thresholds by replacing initial function r byk

1 for x 1 � �kr x �Ž .k ½ 0 for x � 1 � �k

where parameter � expresses the degree of belief in the decision rule sup-kported by � or, more precisely, in the quality of hyperplanes that generate it.kThen, for a fixed � , increasing � for some H that occurs in � implies thatk j j kfor objects that are ‘‘uncertain’’ with respect to the jth cut, function � equals 0kand no classification is obtained.

If we want to modify functions in the second hidden layer similarly as in thefirst, the idea of extracting initial weights from the degrees of precision forreasoning with given hyperplanes as conditions should be followed. We claimthat formulas for the decision rule functions should be derived from the shapesof the functions in the previous layer. Thus, for function

1r x �Ž .k � xk1 � e

which corresponds to decision rule � , the quantity is given by the formulak k

h

� � � � � wÝk j jkj�1

where w stands for the weight of connection between the jth neuron in thejkŽ .first hidden layer and the k th neuron in the second rule-based hidden layer.

We also have to change the output layer accordingly, to preserve its properbehavior. To do this, we modify the neurons so they produce the sum of theirinputs. The decision is made by choosing the output neuron that produces thehighest peak.

The foregoing changes simplify the learning if we are interested in perform-ing adaptive fitting. As in classical neural network learning, we can manipulatewith some coefficients like learning rate to control the learning process.7,18 Inour approach, we use this ability to introduce some meaning for such operations.


The change of weights in the first hidden layer corresponds to a change ofelevation of the hyperplanes. Hence, by setting constraints for the value of thelearning coefficients, we can induce learning in the case where we do not wantthe hyperplanes to change too rapidly. Standard tricks from network learninglike the momentum factor7 also can be used, although they do not have explicitinterpretation in terms of hyperplanes and decision trees yet.

During the learning process, we should still remember the interpretation ofweights and functions. Starting from the initial structure obtained from the databy the sequential algorithm for finding hyperplanes, we begin to modify weightsvia the given learning method. Then, for possibly improved classification, how-ever, we cannot determine how the decision rules actually behave over the data.To some extent; the changes can be interpreted as giving a degree of accuracyŽ .trustworthiness to particular rules or their parts. Another point is to keepdecision rules minimal for the foregoing hyperplane weights to make the wholeprocess clearer. Thus, it turns out to be very important to preserve the balancebetween what is derived from the learning process and what is obtained fromthe described construction. Some discussion of these aspects can be found inRefs. 19 and 20.

5.3. Illustrative Example

To better understand the methodology presented, let us review a verysimple example. The iris classification data base is widely known since its firstpublication by Fisher.21 This simple data base contains 150 varieties of irises.Each object is described by four real-valued attributes. The task is to decide towhich of three possible categories every particular object belongs. In our terms,we have three decision classes that should be approximated. Objects aredistributed evenly between decision classes; each class consists of exactly 50

Ž .objects. In a standard approach, we used half of the data table 75 objects forlearning and the second half for testing. For our methods, the partition was 40%

Ž .of cases chosen by random for training and 60% for testing in the process ofknowledge discovery and network construction.

The iris example is very good for us because decision classes are almostlinearly separable. Except for two objects, it is possible to separate objects withdifferent decisions by using only two hyperplanes as is shown symbolically inFigure 2. So there are only three, very simple decision rules.

Figure 2. Classes defined by hyperplanes.

SZCZUKA50

Ž .Figure 3. Network for iris data outline .

From the two hyperplanes and three decision rules, we can now constructthe neural network. The outline of this network is shown in Figure 3.

All together, the network has 12 neurons, of which 5 are located in hiddenlayers. If we change this network by applying continuous functions for bothhidden layers, we obtain improvement and at the end of the process, only oneobject remains misfitted. Moreover, the construction method is quite general forthis data set. The average of several trials with different choices of learning andtesting sets showed the same results.

5.4. Rules as Attributes

In the classical approach, once we have decision rules, we are at the end ofclassifier construction. However, there is also another way to treat the rulesbecause they describe the relationships that exist in our data. Therefore, we maytreat the rules as features of objects. In this view, the process of rule extractionbecomes the process of new feature extraction. These features are of higher‘‘order’’ because they take into account specific configurations of attributevalues with respect to decision.

Ž .Let us consider the set of rules R � r , . . . , r . We can construct the new1 mdecision table based on them. With every rule r in R, we connect a newiattribute ar . The decision attribute remains unchanged as well as the universeiof objects. The values of attributes over objects may be defined in different ways,depending on the nature of the data. For the purposes of this research, we usethe following three possibilities:

� Ž .ar o � d , where d is the value of decision returned by rule r if it isi j k k iŽ .applicable to object o , and is equal to 0 or any other constant otherwise.j

� Ž . Ž .ar o � const usually const � 1 or �1 if the rule r applies to the object o ,i j i jŽ .and is equal to 0 or any other constant otherwise.

� Ž .In the case of tables with binary decision, ar o � 1 if the rule r applies to thei j iŽ .object o and the output of this rule points at decision value 1, and ar o � �1j i j

if the rule r applies to the object o and the output of this rule points at decisioni jŽ .value 0. When the rule is not applicable, ar o � 0.i j


Due to technical restrictions in further steps of the classifier construction, itis sometimes necessary to modify the preceding methods, e.g., by encoding thedecision values in the first of the approaches to use neural network as in ourcase.

It can be seen easily how important it is to keep the rule set withinreasonable size. Otherwise, the newly produced decision table may becomepractically unmanageable due to the number of attributes.

5.5. The Making of Classifier

Equipped with the decision table extracted using the set of rules, we mayŽ .now proceed with construction of the final classification decision system. To

keep computation to a reasonable size with respect to time and spatial complex-ity, we apply very simple and straightforward methods. Namely, we use a simplesigmoidal neural network with no hidden layers.7 The overall process of classi-fier construction is illustrated in Figure 4.

We start with an initial training decision table for which we calculatereducts and a set of possibly best rules. We may derive rules in a dynamic or

Ž .nondynamic way, depending on the particular situation data . These rules arethen used to construct a new decision table in the manner described in theprevious section. Over the new data table so constructed, we build a neuralnetwork-based classifier to classify newly formed objects. Then the classifier ischecked for quality against a testing set.

Of course with the proposed scheme we may construct various classifiersbecause some parameters may be adjusted on any step of this process. In theprocess of reduct and rule calculation, we may establish restrictions for number

Ž .and size of reducts rules as well as rule specificity, generality, coverage, and soon. During neural network construction, we may apply different learning algo-rithms. The learning coefficients of those algorithms may vary as well.

To complete the picture of the classifier, it is important to add a handful oftechnical details. For the purpose of the research presented in this article, weused dynamic calculation of rules based on a genetic algorithm and incorporat-

Figure 4. The layout of a new classifier.

SZCZUKA52

Table I. Data sets used for experiments.

Ž .Data Set Objects Attributes Attribute Type Rank d

Monk1 432 6 Symbolic 2Monk2 432 6 Symbolic 2Monk3 432 6 Symbolic 2Lymphography 148 18 Symbolic 4EEG 550 105 Binary 2

ing some discretization techniques for attributes that are continuous in nature.For details, consult Ref. 8. On the side of neural networks, we used simplearchitecture with neurons that have classical sigmoidal or hyperbolic tangents asthe activation function. Usually, the network is equipped with bias and trainedusing gradient descent with regularization, momentum, and adaptive learning

Ž .rate see Refs. 7 and 22 .The simple architecture of neural networks has one additional advantage.

Ž .From its weights we may decipher the importance of particular attributes rulesfor decision making. This is usually not the case with more complicated neuralarchitecture for which such an interpretation is difficult and the role of singleinputs is not transparent.

5.6. Experimental Results

The proposed methods have been tested against real data tables. Fortesting, we used two benchmark data sets taken from a repository 23 and onedata set received from a medical source. Table I describes the basic parametersof the decision tables used in experiments. The EEG data were originallyrepresented as a matrix of signals that was further converted to binary form byapplying wavelet analysis and discretization techniques as originally proposed inRefs. 8 and 11 and developed in Ref. 24. The MONK data sets have a presetpartition into training and testing sets, the rest of the data sets were tested usinga cross-validation method.

The rules were calculated using dynamic techniques. Then we performedseveral experiments using a different rule shortening ratio. Table II shows thebest results.

Table II. The results of experiments.

ErrorNumber ShorteningData Set of Rules Ratio Method Proposed Other

Monk1 31 0.6 TT 00.03 00Monk2 26 0.6 TT 00.06 00.049Monk3 44 0.6 TT 00.051 00.046Lymphography 78 0.8 CV-10 0.030.19 00.15EEG 13 0.3 CV-5 00.01 0.110.16


The columns in this table describe the number of rules used for the newŽ . Ž .table ‘‘Number of Rules’’ , the shortening ratio of the rules between 0 and 1 ,

Žthe method of training and testing TT � train and test, CV-n � n-fold cross-.validation , the average error on the training and testing set as a fraction of the

number of cases, and the best results from other rough set methods forcomparison. The experiments were performed several times to get averageŽ .representative results. The comparison is made with best result from applica-tion of combined rough set methods. However, it is important to mention thatthose best classifiers are usually based on much larger sets of rules.

The results are comparable to those published in Refs. 9 and 25, but theyusually use much less rules and simpler setting of classifier than in cases of bestresults in Refs. 9 and 10. The most significant boost is visible if we compare theoutcome of classification using only the calculated rules with classical weightsetting. Especially in the case of a small shortening ratio, which corresponds tosignificant reduction of rules, the impact of the methods proposed is clearlyvisible.

6. CONCLUSIONS AND FURTHER RESEARCH

The methodology for constructing a classifier by combining hyperplane andneural network methods looks promising. The possibility of creating a networkfrom the knowledge we have is very convenient and saves us a lot of workusually spent searching for the proper architecture. The network’s ability tolearn and modify its states as well as its flexibility and tolerance to vaguenessresults in better fitting and extensibility. If we add to that the fact that thehyperplane method itself proved to be very effective in many cases, the futurepotential seems bright.

There are several open questions about further development of the meth-ods proposed. One possible direction of further research focuses on extension ofthe hyperplane method. We can try to use the higher order surfaces instead ofhyperplanes, although this method seems to be much more complex and costlyto implement. Also, the construction of the network for such a curve-basedclassification is not so obvious. Another possible way to extend the describedmethods is to try to work with heterogeneous data. If we have to deal with datathat have both numerical and symbolic attributes, it may be less obvious how toestablish decision classes. We have to consider the coding of symbolic attributesfor use with our methods, not to mention potential problems with the orderingof possible attribute values that may not exist at all. Anyway, there seems to be alot more to do in the future.

The proposed approach allows us to construct a classifier with a combina-tion of rule-based systems and neural networks. The rough set rules derived withrespect to the discernibility of an object seem to possess extended importancewhen used as new feature generators. Application of the neural network in thelast stage of classifier construction allows better fitting to a particular set of dataand makes further addition of new knowledge to the system easier due to itsadaptiveness.

SZCZUKA54

Initial experiments show promising results, especially in cases of binarydecision. Reduction of the number of rules used makes the system obtained inthis way closer to natural intuition.

Because work on this issue is just beginning, there is still a lot to do in manydirections. Most interesting, from our point of view, is further investigation ofthe relationships between the process of rule induction with rough sets and theirfurther quality as new attributes.

I want to thank Nguyen Hung Son, Jan Bazan, and Piotr Wojdyłło for sharing theirexpertise and allowing me to use some of their solutions and tools. This work wassupported by Grant 8T11C02412 from the Polish State Committee for Scientific Researchand ESPRIT project 20288�CRIT 2.

References

1. Nguyen HS, Skowron A. Quantization of real-valued attributes, rough set andboolean reasoning approaches. In: Proceedings of the Second Joint Annual Confer-ence on Information Sciences, Wrightsville Beach, North Carolina, Sept 28�Oct 1,1995. p 34�37.

2. Nguyen HS, Nguyen SH. From optimal hyperplanes to optimal decision tree. In:Proceedings of the IV International Workshop on Rough Sets, Fuzzy Sets and

Ž .Machine Discovery RSFD’96 , Tokyo, Japan: Nov. 6�8, 1996. p 82�88.3. Nguyen HS, Nguyen SH, Skowron A. Searching for features defined by hyperplanes.

In: Ras ZW, Michalewicz M, editors. Proceedings of the IX International Symposium´Ž .on Methodologies for Information Systems ISMIS’96 , Zakopane, Poland. Lecture

Notes in Artificial Intelligence, Vol. 1079. Berlin: Springer-Verlag; 1996. p 366�375.4. Shavlik JW, Towell GG. Knowledge-based artificial neural networks. Artificial Intelli-

gence 1995;70.5. Pawlak Z. Rough sets: Theoretical aspects of reasoning about data. Dordrecht:

Kluwer; 1991.6. Skowron A, Rauszer C. The discernibility matrices and functions in information

systems. In: Slowinski R, editor. Intelligent decision supportHandbook of applica-tions and advances of the rough sets theory. Dordrecht: Kluwer; 1992. p 331�362.

7. Karayannis NB, Venetsanopoulos AN. Artificial neural networks: Learning algo-rithms. Performance evaluation and applications. Dordrecht: Kluwer; 1993.

8. Nguyen SH, Nguyen HS. Discretization methods in data mining. In: Skowron A,Polkowski L, editors. Rough sets in knowledge discovery 1. Heidelberg: PhysicaVerlag; 1998. p 451�482.

9. Bazan J. A comparison of dynamic and non-dynamic rough set methods for extract-ing laws from decision tables. In: Skowron A, Polkowski L, editors. Rough sets inknowledge discovery 1. Heidelberg: Physica Verlag; 1998. p 321�365.

Ž10. Bazan J. Approximate reasoning methods for synthesis of decision algorithms in.Polish , PhD Thesis, Department of Math, Comp Sci and Mechanics, Warsaw

University, Warsaw, 1998.11. Wojdyłło P. Wavelets, rough sets and artificial neural networks in EEG analysis. In:

Proceeding of RSCTC’98. Lecture notes in artificial intelligence, Vol. 1424. Berlin:Springer-Verlag; 1998. p 444�449.

12. Wroblewski J. Covering with reductsA fast algorithm for rule generation. In:´Proceeding of RSCTC’98. Lecture notes in artificial intelligence, Vol. 1424. Berlin:Springer-Verlag; 1998. p 402�407.

13. Agrawal R, Manilla H, Srikant R, Toivonen H, Verkamo I. Fast discovery ofassociation rules. In: Proceedings of the advances in knowledge discovery and datamining. Menlo Park, CACambridge, MA: AAAI PressMIT Press; 1996. p 307�328.


14. Ziarko W. Variable precision rough set model. J Comput Syst Sci 1993;40:39�59.15. Krusel R, Gebhard J, Klawonn F. Foundations of fuzzy systems. Chichester: John

Wiley & Sons; 1994.16. Kohavi R, Sahami M. Error-based and entropy-based discretization of continuous

features. In: Simoudis E, Han J, Fayyad UM, editors. Proceedings of the SecondInternational Conference on Knowledge Discovery and Data Mining, Portland,Oregon, August 19, 19xx. p 114�119.

17. Vapnik VN. The nature of statistical learning theory. New York: Springer-Verlag;1995.

18. Hecht-Nielsen R. Neurocomputing. Reading, MA: Addison-Wesley; 1990.´19. Szczuka M, Slezak D. Hyperplane-based neural networks for real-valued decision

tables. In: Proceedings of the RSSC’97, Raleigh, North Carolina, 1997. p 265�268.´20. Nguyen HS, Szczuka M, Slezak D. Neural networks design: Rough set approach to

real-valued data. In: Proceedings of PKDD ’97. Trondheim, Norway. Lecture Notesin Artificial Intelligence, Vol. 1263. Berlin: Springer-Verlag; 1997. p 359�366.

21. Fisher R. The use of multiple measurements in taxonomic problem. In: Contribu-tions to mathematical statistics. New York: John Wiley & Sons; 1950.

22. Arbib MA, editor. The handbook of brain theory and neural networks. CambridgeMA: MIT Press; 1995.

23. The Machine Learning Repository, University of California at Irvine,http:www.ics.uci.edumlearnMLRepository.html96.

24. Szczuka M, Wojdyłło P. Neuro-wavelet classifiers for EEG signals based on rough setmethods. Unpublished report.

25. Michie D, Spiegelhalter DJ, Taylor CC. Machine learning, neural and statisticalclassification. London: Ellis Horwood; 1994.

Documents

Refining classifiers with neural networks