6
Constructing a Fuzzy Decision Tree by Integrating Fuzzy Sets and Entropy TIEN-CHIN WANG (王天津 HSIEN-DA LEE(李賢達) Department of Information Management I-Shou University Kaohsiung, Taiwan Abstract: - Decision tree induction is one of common approaches for extracting knowledge from a sets of feature-based examples. In real world, many data occurred in a fuzzy and uncertain form. The decision tree must able to deal with such fuzzy data. This paper presents a tree construction procedure to build a fuzzy decision tree from a collection of fuzzy data by integrating fuzzy set theory and entropy. It proposes a fuzzy decision tree induction method for fuzzy data of which numeric attributes can be represented by fuzzy number, interval value as well as crisp value, of which nominal attributes are represented by crisp nominal value, and of which class has confidence factor. It also presents an experiment result to show the applicability of the proposed method. Key-Words: Fuzzy Decision Tree, Fuzzy Sets, Entropy, Information Gain, Classification, Data Mining 1 Introduction Decision trees have been widely and successfully used in machine learning. More recently, fuzzy representations have been combined with decision trees. Many methods have been proposed to construct decision trees from collection of data. Due to observation error, uncertainty, and so on, many data collecting in real world are obtained in fuzzy forms. Fuzzy decision trees treat features as fuzzy variables and also yield simple decision trees. Moreover, the use of fuzzy sets is expected to deal with uncertainty due to noise and imprecision. The researches on fuzzy decision tree induction for fuzzy data have not yet sufficiently performed. This paper is concerned with a fuzzy decision tree induction method for such fuzzy data. It proposes a tree-building procedure to construct fuzzy decision tree from a collection of fuzzy data. Decision trees and decision rules are data-mining methodologies applied in many real-world applications as a powerful solution to classification problem [1]. Classification is a process of learning a function that maps a data item into one of several predefined classes. Every classification based on inductive-learning algorithms is given as input a sets of samples that consist of vectors of attribute values and a corresponding class. For example, a simple classification might group students into three groups based on their scores: (1) those students whose scores are above 90 (2) those students whose scores are between 90 and 70 and (3) those students whose scores are below 70. 1.1 Fuzzy set theory Fuzzy set theory was first proposed by Zadeh to represent and manipulate data and information that posses non-statistical uncertainty. Fuzzy set theory is primarily concerned with quantifying and reasoning using natural language in which words can have ambiguous meanings. This can be thought of as an extension of traditional crisp sets, in which each element must either be in or not in a set. Fuzzy sets are defined on a non-fuzzy universe of discourse, which is an ordinary sets. A fuzzy sets F of a universe of discourse U is characterized by a membership function ) ( x F µ which assigns to every element U x ,a membership degree ] 1 , 0 [ ) ( x F µ . An element U x is said to be in a fuzzy sets F if and only if 0 ) ( > x A µ and to be a full member if and only if 1 ) ( = x F µ [5]. Membership functions can either be chosen by the user arbitrarily, based on the user’s experience, or they can be designed by using optimization procedures[6][7]. Typically, a fuzzy subset A can be represented as, { }{ } { n n A A A x x x x x x A / ) ( ,..., / ) ( , / ) ( 2 2 1 1 } µ µ µ = Where the separating symbol / is used to associate the membership value with its coordinate on the horizontal axis. For example, in Fig.1, let F=integers close to 10; then one choice for ) ( x F µ is expressed as 12 / 0 . 0 11 / 5 . 0 . 10 / 1 9 / 5 . 0 . 8 / 0 . 0 + + + + = F Proceedings of the 5th WSEAS International Conference on Applied Computer Science, Hangzhou, China, April 16-18, 2006 (pp306-311)

Constructing a Fuzzy Decision Tree by Integrating Fuzzy ... · fuzzy decision tree induction for fuzzy data have not yet sufficiently performed. This paper is concerned with a fuzzy

  • Upload
    others

  • View
    14

  • Download
    0

Embed Size (px)

Citation preview

  • Constructing a Fuzzy Decision Tree by Integrating Fuzzy Sets and Entropy

    TIEN-CHIN WANG (王天津 HSIEN-DA LEE(李賢達)

    Department of Information Management I-Shou University

    Kaohsiung, Taiwan

    Abstract: - Decision tree induction is one of common approaches for extracting knowledge from a sets of feature-based examples. In real world, many data occurred in a fuzzy and uncertain form. The decision tree must able to deal with such fuzzy data. This paper presents a tree construction procedure to build a fuzzy decision tree from a collection of fuzzy data by integrating fuzzy set theory and entropy. It proposes a fuzzy decision tree induction method for fuzzy data of which numeric attributes can be represented by fuzzy number, interval value as well as crisp value, of which nominal attributes are represented by crisp nominal value, and of which class has confidence factor. It also presents an experiment result to show the applicability of the proposed method. Key-Words: Fuzzy Decision Tree, Fuzzy Sets, Entropy, Information Gain, Classification, Data Mining 1 Introduction

    Decision trees have been widely and successfully used in machine learning. More recently, fuzzy representations have been combined with decision trees. Many methods have been proposed to construct decision trees from collection of data. Due to observation error, uncertainty, and so on, many data collecting in real world are obtained in fuzzy forms. Fuzzy decision trees treat features as fuzzy variables and also yield simple decision trees. Moreover, the use of fuzzy sets is expected to deal with uncertainty due to noise and imprecision. The researches on fuzzy decision tree induction for fuzzy data have not yet sufficiently performed. This paper is concerned with a fuzzy decision tree induction method for such fuzzy data. It proposes a tree-building procedure to construct fuzzy decision tree from a collection of fuzzy data.

    Decision trees and decision rules are data-mining methodologies applied in many real-world applications as a powerful solution to classification problem [1]. Classification is a process of learning a function that maps a data item into one of several predefined classes. Every classification based on inductive-learning algorithms is given as input a sets of samples that consist of vectors of attribute values and a corresponding class. For example, a simple classification might group students into three groups based on their scores: (1) those students whose scores are above 90 (2) those students whose scores are between 90 and 70 and (3) those students whose scores are below 70.

    1.1 Fuzzy set theory Fuzzy set theory was first proposed by Zadeh to

    represent and manipulate data and information that posses non-statistical uncertainty. Fuzzy set theory is primarily concerned with quantifying and reasoning using natural language in which words can have ambiguous meanings. This can be thought of as an extension of traditional crisp sets, in which each element must either be in or not in a set. Fuzzy sets are defined on a non-fuzzy universe of discourse, which is an ordinary sets. A fuzzy sets F of a universe of discourse U is characterized by a membership function )(xFµ which assigns to every element

    Ux ∈ ,a membership degree ]1,0[)( ∈xFµ . An element Ux∈ is said to be in a fuzzy sets F if and only if 0)( >xAµ and to be a full member if and only if 1)( =xFµ [5]. Membership functions can either be chosen by the user arbitrarily, based on the user’s experience, or they can be designed by using optimization procedures[6][7]. Typically, a fuzzy subset A can be represented as,

    { } { } { nnAAA xxxxxxA /)(,...,/)(,/)( 2211 }µµµ= Where the separating symbol / is used to associate the membership value with its coordinate on the horizontal axis. For example, in Fig.1, let F=integers close to 10; then one choice for )(xFµ is expressed as

    12/0.011/5.0.10/19/5.0.8/0.0 ++++=F

    Proceedings of the 5th WSEAS International Conference on Applied Computer Science, Hangzhou, China, April 16-18, 2006 (pp306-311)

  • Fig. 1. Triangular Membership

    function expression for a number closed to 10

    1.2 Fuzzy Decision Trees A decision tree[4][8] is a formalism for

    expressing mapping from attribute values to classes and consists of tests or attribute nodes linked to two or more subtrees and leafs or decision nodes labeled with a class which indicates the decision. The main advantage of decision-tree approach is it visualizes the solution; it is easy to follow any path through the tree. Relationships discovered by a decision tree can be expressed as a set of rules, which can then be used in developing an expert system. A decision tree model employs a recursive divide–and-conquer strategy to divide the data set into partitions so that all of the records in a partition have the same class label[9]. In classical decision trees, nodes make a data follow down only one branch since data satisfies a branch condition, and the data finally arrives at only a leaf node. In tree-structured representations, a set of data is represented by a node, and the entire data set is represented as a root node. When a split is made, several child nodes, which correspond to partitioned data subsets, are formed. If a node is not to be split any further, it is called a leaf; otherwise, it is an internal node. Decision trees classify data by sorting them down the tree from the root to leaf nodes. As the typical kinds of decision tree induction algorithms, there are ID3 and CART [10][11]. Decision trees were popularized by Quinlan with the ID3 algorithm. Systems based on ID3 work well in symbolic domains. A large variety of extensions to the basic ID3 algorithm have been developed by different researchers. ID3 is designed to deal with symbolic domain data, and the data finally arrives at only a leaf node. The algorithm is applied recursively to each child node until all samples at a node are belongs to a

    class. Fuzzy decision trees allow data to follow down simultaneously multiple branches of a node with different satisfaction degrees ranged on [0,1][12]. CART is designed to deal with continuous numeric domain data. A number of alternation of them have been developed. Fuzzy decision tree is one of them.

    Fuzzy decision trees attempt to combine elements of symbolic and sub-symbolic approaches. Fuzzy sets and fuzzy logic allow modeling language-related uncertainties, while providing a symbolic framework for knowledge comprehensibility. Fuzzy decision trees differ from traditional crisp decision trees in three respects [10]: (1) They use splitting criteria based on fuzzy restrictions. (2) Their inference procedures are different. (3) The fuzzy sets representing the data have to be defined.

    Fuzzy decision tree induction has two major components: a procedure for fuzzy decision tree building and an inference procedure for decision making [13]. It is required to develop the following things to apply an ID3-like procedure to fuzzy decision tree construction: attribute value space partitioning methods, branching attribute selection method, branching test method to decide which degree data follows down branches of a node, and leaf node labeling methods to determine classes for which leaf nodes stand.

    1.3 Entropy Heuristics Attribute selection in ID3 and C4.5 algorithms are

    based on minimizing an information entropy measure applied to the examples at a node [1]. The entropy measure is used to calculate the information gain which reflects the quality of an attribute as the branching attribute. The attribute-selection part of ID3 is base on the assumption that the complexity of the decision tree is strongly related to the amount of information conveyed by the value of the given attribute. An information-based heuristic selects the attribute providing the highest information gain. A data set with some discrete-valued condition attributes and one discrete-valued decision attributes can be presented in the form of knowledge representation system ,

    where

    ),( DCUJ ∪={ }suuuU ....,, 21= is the set of data samples,

    { }ncccC ....,, 21= is the set of condition attributes and { }dD = is the one-elemental set with the decision attribute or class label attribute. Suppose this class label attribute has m distinct values

    defining m distinct classes , (for i=l, ..,m), let

    be the number of samples of U in class .The

    id is

    id

    Proceedings of the 5th WSEAS International Conference on Applied Computer Science, Hangzhou, China, April 16-18, 2006 (pp306-311)

  • expected information or entropy need to classify a given sample is given by

    ∑=

    −=m

    iiim ppssI

    121 log),...( (1)

    Where is the probability that an arbitrary sample belongs to class and is estimated by summation those samples’ entropy (m is the number of all samples). Let attribute have v distinct value

    ip

    is

    ic{ }vAAA ....,, 21 , attribute can be used to partition U into v subsets { where (j=1,..,v) contains those samples in U that have value of .

    Let be the number of samples of class in a

    subset , the entropy of attribute is given by

    ic}vSSS ....,, 21 iS

    jA ic

    ijs id

    jS ic

    ∑=

    +=

    v

    jmjj

    mjji ssIs

    sscE

    11

    1 ),...(...

    )( (2)

    The term s

    ss mjj ++ ....1 acts as the weight of the

    jth subset and is the number of samples in the subset divided by the total number of samples. The smaller the entropy value, the greater the purity of the subset partitions. Thus the attribute that leads to the largest information gain, is selected as the branching attribute. For a given subset ,the information gain is expressed as

    jS

    (3) ∑=

    −=m

    iijijmjj ppssI

    121 log),...(

    Where j

    ijij S

    sp = ( jS is the number of

    samples in the subset ) and is the probability that

    a sample in belongs to class . So information

    gain of attribute is given by

    jS

    jS id

    ic)(),...()( 1 imjji cEssIcGain −= (4)

    We compute the information gain of each condition attribute, the attribute with the highest information gain is the most informative and the most discriminating attribute of the given set.

    2 Experiment

    In this section, an example is given to illustrate the proposed fuzzy decision tree algorithm. This sample is intended to show fuzzy decision tree algorithm can be used to evaluate student admission

    for graduate school. The data set includes 10 applicants, as shown in Table 1

    Table 1.The data set of students

    Student no. GPA ETS WE Ref.

    Admission

    1 3.2 75 Fair Yes Yes 2 2.8 52 Excellent N/A No 3 2.7 69 Fair Yes No 4 3.6 86 Excellent Yes Yes 5 2.1 63 Fair Yes No 6 2.6 91 Fair N/A Yes 7 2.8 63 Excellent Yes No 8 2.3 77 Fair Yes No 9 3.6 68 Fair Yes Yes 10 3.5 90 Fair N/A Yes

    Each case consists of four condition attributes: grade point average (denoted GPA), entrance test score (denoted ETS), working experience (denoted WE), and reference (denoted Ref).

    In this example, triangular membership functions are used to represent fuzzy sets because of its simplicity, easy comprehension, and computational efficiency. Membership functions are usually predefined by experienced experts. They also can be derived through automatic adjustments [14].

    From Fig.2 and Fig.3, GPA and ETS attribute have three fuzzy regions: Low, Middle, and High. Thus, three fuzzy membership values are produced for each course score according to the predefined membership functions.

    Fig. 2. The membership function for examinees’ GPAs

    Proceedings of the 5th WSEAS International Conference on Applied Computer Science, Hangzhou, China, April 16-18, 2006 (pp306-311)

  • Fig. 3. The membership function for examinees’ scores

    3 Problem Solution

    For the experimental data in Table 1, the decision-tree construction algorithm proceeds in following subsections.

    3.1 Calculate Information Gain STEP 1. To represent a continuous fuzzy set , we

    need to express it as a function and then map the elements of the set to their degree of membership[3]. Transform the quantitative values of each examinee’s score into fuzzy sets. Take the Entrance Test Score (ETS) for example, the score “85” can be converted into a fuzzy set (0.0/Low + 0.0/Middle + 0.5/High) using the predefined membership functions in Fig.2. The transformation procedure is repeated for the other scores. The result is shown in Table 2.

    Table 2.The data set of students in fuzzy form

    no. GPA ETS WE Ref. Admission 1 Middle Middle Fair Yes Yes 2 Middle Low Excellent N/A No 3 Middle Middle Fair Yes No 4 High High Excellent Yes Yes 5 Low Low Fair Yes No 6 Middle High Fair N/A Yes 7 Middle Low Excellent Yes No 8 Low Middle Fair Yes No 9 High Middle Fair Yes Yes 10 High High Fair N/A Yes

    STEP 2. Form a knowledge representation system { } { } { ,.,,,,10,..1,, REFWEETSGPACUDCUJ }==∪={ }AdmissionD = . The class label attribute is

    admission, has two distinct values { . There are two distinct classes (m=2), let class represents

    yes and class represents no, there are 5 samples of class yes and 5 samples of class no, so

    }noyes,1d

    2d

    ( ) 1105log

    105

    105log

    105, 2221 =−−=ssI formula(1)

    STEP 3. Compute the entropy for each attribute,

    for attribute GPA, it has three distinct values { }LowMiddleHigh ,, ,U can be partitioned into three subsets { }321 ,, sss For GPA=”High” =3 =0 11s 21s

    ( ) 0033log

    33, 22111 =−−=ssI formula(3)

    For GPA=”Middle” =2 =3 12s 22s

    ( ) 971.053log

    53

    52log

    52, 222212 =−−=ssI formula(3)

    For GPA=”Low” =0 =2 13s 23s

    ( ) 022log

    220, 23113 =−=ssI formula (3)

    ( ) ( ) ( ) 485.0,*102,*

    105,*

    103)( 311322212111 =++= ssIssIssIGPAE

    formula(2) ( ) 514.0)(,)( 21 =−= GPAEssIGPAGain

    formula(4)

    STEP 4. Same as STEP 3 to compute Gain(ETS)=0.6, Gain(WE)=0.3389, Gain(Ref)=0.05. Since ETS has the highest information gain among the four attributes, so ETS is selected as the attribute to split the tree. 3.2 Constructing a Decision Tree

    We use the selected condition attribute: ETS to form the decision tree. So, we get the following equivalence classes:

    high: { }10,6,4 middle: { low: }9,8,3,1 { }7,5,2 The subset class middle: { }9,8,3,1 needs to

    further split. Following the algorithm expressed in section 2.1, the attribute GPA has the highest information gain to split the tree. Then the whole decision tree has been completed as Fig.4.

    Proceedings of the 5th WSEAS International Conference on Applied Computer Science, Hangzhou, China, April 16-18, 2006 (pp306-311)

  • Fig. 4. Decision tree based on information gain.

    3.3 Extract classification rules

    Data classification is an important data mining task[2] that tries to identify common characteristics in a set of N objects contained in a database and to categorize them into different groups. We extract classification IF-THEN rules from those equivalence classes. For equivalence class { ,those samples all have the identical attribute values:

    }10,6,4

    ETS=high, Admission=yes So, we use the condition attribute values

    (ETS=high) as the rule antecedent and use the class label attribute value (Admission= yes) as the rule consequent, we can get the following classification rule:

    IF ETS=”high” THEN Admission=”yes” Similarly, the other classification rules can be

    extracted at this manner. We can get those rules as follows:

    1. IF ETS=”high” THEN Admission=”yes” 2. IF ETS=”low” THEN Admission=”no” 3. IF ETS=”middle” AND GPA=”high” THEN

    Admission=”yes” 4. IF ETS=”middle” AND GPA=”low” THEN

    Admission=”no” 4 Conclusion

    The paper is concerned with fuzzy sets and decision tree. We present a fuzzy decision tree model based on fuzzy set theory and information theory. It proposes a fuzzy decision tree induction method for fuzzy data of which numeric attributes can be represented by fuzzy number, interval value as well

    as crisp value, of which nominal attributes are represented by crisp nominal value, and of which class has confidence factor. An example is used to prove the validity. First, we applied fuzzy set theory to transform real-world data into fuzzy linguistic forms. Secondly, we used information theory to construct a decision tree. Finding the best split point and performing the split are the main tasks in decision tree induction method. Through the integration of both fuzzy set theory and information theory, it can make classification tasks originally thought too difficult or complex to become possible. It provides an alternative for evaluating the best possible candidates.

    1-10 ETS?

    high low middle

    2,5,7 no 1,3,8,9 GPA? 4,6,10 yes

    low high middle

    8 no 9 yes 1, 3 ?

    References: [1] Mehmed Kantardzic, Data Mining, Concept,

    Models, Methods, and Algorithms, Wiley Publishers, 1993.

    [2] U.M. Fayyad, G.Piatesky-Shapiro and P. Smith, From Data Mining to Knowledge Discovery in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1996.

    [3] Michael Negnevitsky, Artificial Intelligence, Addison Wesley, 2002.

    [4] Stuart J. Russel, Peter Norvig, et al: Artificial Intelligence: a Modern Approach, Englewood Cliffs, Prentice-Hall,1995

    [5] H. J. Zimmermann, Fuzzy Set Theory and Its Applications, Kluwer Academic Publishers, 1991.

    [6] Jang, J.S. R., Self-Learning Fuzzy Controllers Based on Temporal Back-Propagation, IEEE Trans. On Neural Network, Vol. 3, September, 1992, pp. 714-723.

    [7] Horikowa, S., T. Furahashi and Y. Uchikawa, On Fuzzy Modeling Using Fuzzy Neural Networks with Back-Propagation Algorithm, IEEE Trans. on Neural Networks, Vol.3, Sept., 1992, pp. 801-806.

    [8] J.R. Quinlan: C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers, San Mateo, CA, 1993

    [9] Shu-Tzu Tsai, Chao-Tung Yang, Decision Tree Construction for Data Mining on Grid Computing, IEEE International Conference on e-Technology, e-Commerce and e-Service, 2004.

    [10] C. Z. Janikow, Fuzzy Decision Trees: Issues and Methods, IEEE Trans. on Systems, Man, and Cybernetics -Part B, February 1998, Vo1.28, No.1, pp.1-14.

    [11] J. Jang, Structure determination in fuzzy modeling: A fuzzy CART approach, Proc. IEEE Conf on Fuzzy Systems, 1994, pp.480-485.

    Proceedings of the 5th WSEAS International Conference on Applied Computer Science, Hangzhou, China, April 16-18, 2006 (pp306-311)

  • [12] R.L.P. Chang, T. Pavlidis, Fuzzy Decision Tree Algorithms, IEEE Trans. on Systems, Man, and Cybernetics,Vol.7, No.1, 1977, pp.28-35.

    [13] Koen-Myung Lee, Kyung-Mi. Lee, Jee-Hyung Lee, Hyung Lee-Kwang, A Fuzzy Decision Tree Induction Method for Fuzzy Data, IEEE International Fuzzy Systems Conference Proceedings, Vol.1, August 1999, pp.16-21.

    [14] T.P. Hong, C.H. Chen, Y.L. Wu, Y.C. Lee, Using Divide-and- Conquer GA Strategy in Fuzzy Data Mining, The Ninth IEEE Symposium on Computers and Communications, 2004.

    Proceedings of the 5th WSEAS International Conference on Applied Computer Science, Hangzhou, China, April 16-18, 2006 (pp306-311)