A Survey of Information Theory Application on Data …devroye/courses/ECE534/project/Ta-Yuan.pdf · A Survey of Information Theory Application on Data Mining T. Y. HSU 662096093 ECE

A Survey of Information TheoryApplication on Data Mining

T. Y. HSU 662096093

ECE 534 Final Project

Abstract:

In data mining area, "classification" is one of the most important isses. The approachof decision trees generated is a very useful and reliable solution. For the construction of adecision tree, there are several ways. Among them, Information Theory is a veryeffective and scalable method. This is a survey project in Information Theory. We focuson the generation of decision tree for classification based on the Information Theory. Inthis project, there are 10 major references included. The whole topics include the issuesof discrete-valued attributes, continuous-valued attributes, decision tree, decision forests,feature extraction, data compression, single class label, multiple class labels.

1. Building Decision Trees with the ID3Algorithm[2]

Abstract

In this paper, it details Iterative Dichotomiser 3(ID3) classification algorithm usedto generate a decision tree invented by Ross Quinlan[1]. ID3 will establish a decision treefrom a known set of training examples. Each tuple in the training examples has multipleattributes and is marked with a class (i.e., TRUE/FALSE). By ID3, a resulting decisiontree would be built up; this tree is used to classify other testing examples. Each leaf nodein the decision tree contains the class name. Each non-leaf node is viewed as a decisionnode, which is an attribute test with each branch to be a possible value of the attribute.InID3, information gain (the expected mutual information) is used to decide which attributebelongs to a decision node.

Introduction

ID3 was originally developed at the University of Sydney. Ross Quinlan firstpresented ID3 in 1975 in a textbook, Machine Learning. The main method of ID3 isbased on the Concept Learning System (CLS) algorithm. The CLS approach over a set oftraining examples C is as follows :

Step 1: If all examples in C are positive, then create TRUE node and stop.

And, if all examples in C are negative, create a FALSE node and stop.ELSE choose a feature, F with values v 1 , . . . , vn and create a decision node.Step 2: Divide the training examples in C into subsets C1, C2, ..., Cn according totheir corresponding values.Step 3: Recursively do the algorithm for each of the sets C i.

ID3 is developed, based on CLS by including a feature extraction heuristic. ID3explores through all attributes of the training examples and selects the attribute thatperfectly separates the given training examples. If the attribute can classify the trainingsets well, then ID3 will stop. Else, it recursively partitions subsets to find out the bestattribute. The greedy approach is used in ID3. In other words, it forward looks for thebest attribute and never turns back to reconsider previous decisions.

Data Description

The training example data used by ID3 must have some requirements as follows:Attribute-value : the same attributes have a fixed number of valuesPredefined classes : attributes of examples have to be predefined.Discrete classes : Continuous classes are not directly allowed.Sufficient examples : it requires that there must be enough testing data, since ID3 try

to distinguish useful patterns from occurrences.

Attribute Selection

In this section, it presents how ID3 decide the best attribute. A statistical property inInformation Theory, called information gain, is used. Information Gain measures thereduction in impurity or disorder for a specific attribute. The one with the highestinformation gain will be chosen. Information Gain is the most popular impurity functionsused for decision tree learning introduced by ID3. The information gain measure is basedon the entropy function from information theory:

Given a collection D of class outcomes

Entropy(D) = −∑j=1

|C|Pr(cj) log Pr(cj)

∑j=1

|C|Pr(cj) = 1

where Pr(cj) is the probability of class cj in the training dataset D, and is defined asthe number of examples of class cj in D

the total number of examples in DIn order to know which attribute can reduce the impurity most when it is used to

divide D, every attribute will be estimated. Assume that the number of possible values ofthe attribute Ai is v and Ai will be used to divide the data set D into |v| disjoint subsetsD1, D2, . . . , Dv. The conditional entropy (after partitioning)

EntropyAi(D) = ∑j=1

v |Dj||D| Entropy(Dj)

As a result, the information gain of Ai can be evaluated by:InfoGain(D, Ai) = Entropy(D) − EntropyAi(D)

Clearly, the information gain is used to select attribute As leading to the largestreduction in impurity. If InfoGain(D, As) is too small, ID3 will halt. Else if As reducesthe impurity perfectly, As is reliably used to be a decision node to separate the dataset.The algorithm operates recursively on creating decision subtrees. Note that for subtreeextensions, As is not required any more.

Conclusion

This algorithm given shows how ID3 create a reliable decision tree. Conceptually,this algorithm is a simple but powerful classification approach.

2. C4.5: Programs for Machine Learning[3]

Abstract

Constructions for decision trees are widely used of all machine learning researches.Beside ID3, C4.5 is probably the most well known algorithm used to generate a decisiontree developed by Ross Quinlan[3] in the machine learning area. C4.5 is also a successorof the ID3 algorithm. The decision trees constructed by C4.5 can be used forclassification. C4.5 is also considered as a statistical classifier. In a Quinlan’s book, C4.5:Programs for Machine Learning, he discusses a wide range of issues about decision trees,from methods for creating an initial tree to ways for pruning, converting trees to rules,and dealing with some variant scenarios.

Introduction

C4.5 constructs decision trees from a set of training data in the same way as ID3,according to the statistical property-information entropy-in Information Theory. Thetraining data is a set D. Each tuple example Xi = A1, A2, . . . is a vector where A1, A2, . . .represent attributes or features of the tuple example. Each example in the training databelongs to a predefined class cj ∈ C (= {c1, c2, . . . }). At each node of the decision tree,C4.5 selects one attribute of the data that most effectively splits its set of examples intosubsets. Its criterion is to normalize the information gain (difference in entropy) resultingfrom the selection of the best attribute for partitioning the data set. The attribute with thehighest normalized information gain is selected to make the decision. The C4.5 algorithmthen recursively runs on the smaller sublists.

In C4.5, there are a few basic descriptions:When all the examples in the list belong to the same class, it simply creates a leaf

node for the decision tree delineating that class.If none of the features provide any information gain or instance of previously-unseen

class happens, C4.5 creates a decision node higher up the decision tree using theexpected value of the class.

The above scenarios are almost the same as the ID3 algorithm. However, C4.5somewhat improves on ID3. Some major variants are as follows:

Handling both continuous and discrete attributes ( in order for the issue of dealingwith continuous attributes, C4.5 creates a threshold and then splits the list into thosewhose attribute value is above the threshold and those that are less than or equal to it.[4]

Handling training data with missing attribute values (missing attribute values areallowed and simply not used in gain and entropy calculations).

Handling attributes with differing costs.

Besides, in C4.5, information gain ratio is used to handling attribute selection ratherthan information gain. the information gain ratio definition is:

InfoGainRation(D, Ai) =InfoGain(D, Ai)

−∑j=1

s( |Dj||D| log |Dj|

|D| )

where s is the number of possible values of Ai, and Dj is the subset of D. Therefore,the attribute with the highest InfoGainRation value will be chosen to extend the decisiontree. Information gain ratio can improve an extreme situation that if Ai has too manypossible values and entropy is equal to 0, then, information gain by using this attribute ismaximal. However, this partition could be useless. We will use an obvious example toexplain later.

ConclusionDecision trees have been widely used as classifiers for many real-world areas. The

decision trees generated by C4.5 are reliably accurate, resulting in fast, valuableclassifiers. Thus, these properties make decision trees a very popular and useful tool forclassification.

Case study for ID3 and C4.5[5]Example :

In Table 1, it is a simple loan application data set. there are four attributes. The firstattribute is Age(values: young, middle and old). The second attribute is Has_Job. Thethird attribute is Own_house. The fourth attribute is Credit_rating. In other words, when anew customer encounters, the classifier could predict whether the new customer’s loanapplication should be approved or not.

Table 1Let’s calculate the information gain values for Age, Own_house, and Credit_Rating

attributes. we attempt to look for the ’best’ root node of a decision tree. The entropy of Dis

Entropy(D) = − 615 log2

615 − 9

15 log2915 = 0. 971

try AgeEntropyAge(D) = 5

15 Entropy(D1) + 515 Entropy(D2) + 5

15 Entropy(D3)

= 515 0. 971 + 5

15 0. 971 + 515 0. 722

= 0. 888try Own_house

EntropyOwn_hou s e(D) = 615 Entropy(D1) + 9

15 Entropy(D2)

= 615 0 + 9

15 0. 918

= 0. 551try Has_job

EntropyHas_job(D) = 515 Entropy(D1) + 10

15 Entropy(D2)

= 0. 647try Credit_ratingEntropyCredit_rating(D) = 4

15 Entropy(D1) + 615 Entropy(D2) + 5

15 Entropy(D3)

= 0. 608Finally, the information gains for the attributes are:

InfoGain(D, Age) = 0. 971 − 0. 888 = 0. 083

InfoGain(D, Own_house) = 0. 971 − 0. 551 = 0. 420

InfoGain(D, Has_job) = 0. 971 − 0. 647 = 0. 324

InfoGain(D, Credit_rating) = 0. 971 − 0. 608 = 0. 363Obviously, Own_house is the best root attribute. Then, the decision tree is:

Since the left branch in the tree has only one class, it leads to a leaf node. On theother hand, for the right branch, further extension is required. The recursive procedure isthe same as the above one. However, only the subset data with Own_house=false isneeded.

In C4.5, it uses information gain ratio instead of information gain. As with the abovedata set, if we utilize the ID attribute to partition the data, each training tuple examplewill belong to a subset with only one class label, since EntropyID(D) = 0. Then, IDattribute will lead to the maximal information gain. But, this partition is nonsensical.Information gain ratio correct this issue by the normalization of information gain,

InfoGainRatio(D, Ai) =InfoGain(D, Ai)

−∑j=1

s( |Dj||D| log |Dj|

|D| )

In the above issue of ID attribute, the denominator is log|D|, which is called the splitinformation in C4.5. After the normalization, InfoGainRatio(D, ID) is not maximal. Thus,the above issue is removed.

3. Re Optimization of ID3 and C4.5Decision Tree[6]

AbstractID3 and C4.5 are widely used decision tree algorithms as supervised learning methods

for classification. In this paper, the authors attempt to reoptimize ID3 and C4.5. Theysimply modify the approaches of the attribute selection in the procedure of constructionof decision trees. This modification remedies the information gain computation in ID3and split information computation in C4.5. This paper shows that a better decision treewith higher classification accuracy can be acquired.

Introduction

In this section, it focuses on the difference between this reoptimization approach andID3/C4.5. In ID3 and C4.5, it has been proven that primary key attribute is always notuseful in classification. Since if the split information selects this primary key attribute forclassification, this decision node will have too many partitions, each of which belongs toa single tuple example. Therefore, the prepruning strategy is to remove the primary keyattribute. On the other hands, this paper simply changes the measure of attributeselection. It first select any one of the attribute and then try to find out all subsets for thegiven attribute. As with the same formulas in ID3 and C4.5, we can calculate thecorresponding information gain ratio for a given subset. Finally, the subset with thehighest gain ratio is chosen as the decision point.

ConclusionIn this paper, it provided a new method about the approach of attribute selection and

a simple prepruning strategy to improve the accuracy in classification against ID3 andC4.5.

4. A Maximum Contribution Method forClassification

Based on Information Theory[7]Abstract

In this paper, based on information theory, it proposed a maximum contributionmethod for classification. In terms of the theory of channel capacity, the definition ofcontribution is evolved in probability distribution and probability transfer matrices andmutual information in feature spaces and class spaces.

Introduction

As with the example in Case Study, every tuple entity is expressed by m features(Vk : K=1,2,...,m). Vk’s values could be vk1,vk2,...,vkqk . For convenience, Vk’s value ispresented as vkjk(jk=1,2,...,qk).Assume that the class set is U and its values could beu1,u2,...,ur. The major task is to study all tuple entities with m features and allobservation class labels in U, and then get the classification rule. Finally, people can tryto predict the label of U with the observable feature values based on the classificationrule. The U could be viewed as the sender and Vk is the receiver. By information theory,the probability transferring matrix can be defined as

P(Vk|U) =

p(vk1|u1) p(vk2|u1) p(vkqk|u1)

p(vk1|u2) p(vk2|u2) p(vkqk|u2)

p(vk1|ur) p(vk2|ur) p(vkqk|ur)

Besides, the average mutual information between Vk and U is I(U,Vk) and the

entropy of Vk is H(Vk). The relational factor of u i and vkjk is defined asRik = p(vkjk |ui)/p(vkjk). Based on the fundamental definition, information gain ratio ispresented as Gk =I(U,Vk)/H(Vk). The production sum of Rik and Gk is

Ai = ∑k

RikGk

Ai is called the contribution of a group features (v1j1 ,v2j2 ,...,vmjm) given by u i

(i = 1, 2, . . . , r). Apparently, Ai reflects the integration measurement of information size.Rik reflects the transferring probability. Therefore, if there exists a i’ and

Ai′ = maxi

Ai

Then a tuple example with v1j1 ,v2j2 ,...,vmjm will be classified as u i′ . Let’s use thefollowing example to understand the whole approach.

Table 02If one person has medium income, 30~40 years old,below college degree, female,

which class does the person belong to?The probability distribution of class domain is

P(willing to by computer) = 914

P(no willing to by computer) = 514

Each feature’s probability transferring matrix is

p(income) =29

49

39

35 0 2

5

p(age) =39

49

29

15

25

25

p(educational level) =39

69

45

15

p(sex) =39

69

35

25

The class entropy isH(willing to buy computer) = 0. 94

Each feature’s mutual information isI(income) = H(willing to buy computer) − H(willing to buy computer|income)

= 0. 246

I(age) = H(willing to buy computer) − H(willing to buy computer|age)

= 0. 029

I(educational level) = H(willing to buy computer) − H(willing to buy computer|educational level)

= 0. 151

I(sex) = H(willing to buy computer) − H(willing to buy computer|sex)

= 0. 048Finally, if the person belongs to "buy", the contribution A(yes)=0.52. If if the person

belongs to "not buy", the contribution A(no)=0.112. According to the Maximumcontribution method, the person should belong to "buy" class.

ConclusionThis method comes from the information theory. It utilizes probability transferring

matrix, entropy, conditional entropy to create a reliable model for classification.

5. Hierarchical Classifier Design Using Mutual

Information[8]Abstract

In this paper, it proposed an efficient algorithm to generate a partitioning decision treeby maximizing the average mutual information gain at each partitioning decision node.

IntroductionAs with ID3 and C4.5, this algorithm comes from the concept of mutual information,

and is applied to pattern classification or pattern recognition with multiple features andmultiple classes. If C indicates the class label and m=|C| and X is the feature event andPe is the probability of error allowed in classification; then, based on the Fano’sInequality,

H(C|X) ≤ H(Pe) + Pe log(m − 1)From the definition of mutual information,

I(C; X) = H(C) − H(C|X)

I(C; X) ≥ H(C) − H(Pe) − Pe log(m − 1)Denote the minimum mutual information

Imin = −∑i=1

mp(cj) log p(cj) + Pe log Pe + (1 − Pe) log(1 − Pe) − Pe log(m − 1)

Let’s take a look at a classification problem with m pattern classes and n features. By

using hyperplanes parallel to feature axes can be expressed in a binary tree, such as thefollowing example.

Obviously, the mutual information at the decision node lk is

Ik(Ck; Xk) = ∑Ck,Xk

p(cki, xkj) log p(cki, xkj)p(cki)p(xkj)

Therefore, the mutual information between the class set C and decision tree T is

I(C; T) = ∑k

pkIk(Ck; Xk)

From a given I(C; T), Pe can be determined. From a given Pe, I(C; T) can be

provided.In this algorithm, the main concept is that the partitioning will maximize the mutual

information or minimize Pe. This would be recursively operated by the followingrecursive partitioning approach. Given n features data set with m classes and a specificerror criterion Pe. By selecting a threshold tkj for the jth feature at the kth partition step,pkIk(Ck; Xk) is maximum over all possible i and j. The recursive partitioning algorithm isas follows:

(1) initialization:specify Pe and compute Imin

(2) looping:In order for ordering the class label array, let sij be the ith order sample on the jth

feature axis.Examine all possible threshold points in the label arrays. A threshold point should

exist between Yij and Yi+1j. In other words, tj =[xij + xi+1j]/2 (xij is the jth feature of theith-order sample Yij)

(3) Decision:By the similar approach in C4.5, determine the decision node NMAX from

NODESET. If this node’s corresponding mutual information ≥ Imin, then go to the nextstep. Otherwise, terminate the algorithm.

ConclusionThis algorithm maximizes the mutual information at each decision step and can be

used for multiple features and multiple classes.

6. Using Information Gain to Build MeaningfulDecision Forests for Multilabel Classification[9]

Abstract

In supervised learning, a large amount of research focuses on the analysis of singleclass label data, where training examples are specified with a single class label from a setof disjoint labels C. However, in real world there are various kinds of training examplesoften associated with a set of labels ⊆ C. Such data are called multiple class label. Whenthe information gain at a decision node would be higher, all examples with a specificclassification are removed and reserved for another tree. By this way, the algorithm inthis paper will separate classes into various categories.

IntroductionThere exists a set of functions (f1, f2, . . . , fN), each of which maps a object vector x to

a class label in its corresponding category Ci(fi : n → ∈ Ci). The algorithm isalmost the same as ID3 or C4.5, but with the distinction as follows: if the informationgain would not be greater at a decision node, then a specific corresponding label wouldbe removed from the tree and reserved for other different trees. The entropy at a specificnode of the tree is

H(T) = −∑i

P(li) log P(li)

where T is a list of paired input vectors and class labels and li is the label event. If abinary decision D partitions the data set into subsets TP and Tn. So the conditionalentropy is

H(T|D) = |Tp||T| H(Tp) +

|Tn||T| H(Tn)

Obviously, the information gain isInfoGain(D, T) = H(T) − H(T|D)

In this paper, it defines another information gain.InfoGain(D, Tk′) = H(Tk′) − H(T|Dk′)

where Tk′ is the set of all examples without label lk. When all examples with labelsfor which InfoGain(D, Tk′) < InfoGain(D, T), these are removed from the tree andreserved for the next tree. The next tree is constructed by the same approach.

ConclusionThis algorithm can deal with the issue of multiple label classification. It doesn’t only

generate one decision tree but decision forests.

7. USING MUTUAL INFORMATION FORSELECTING CONTINUOUS-VALUED[10]

AbstractIn this paper, it uses information entropy minimization and mutual information

maximization to deal with the issue of continuous-valued attributes. By using mutualinformation, it would avoid selecting the previously selected attributes in the constructionof decision trees.

IntroductionPractically, this algorithm is similar to the C4.5 to handle continuous attributes. For

each continuous valued condition attribute A, all examples would be sorted, and themidpoint between each consecutive pair of examples is a possible cut point. Then, by thesame way, all possible cut points could be acquired. For each evaluation of a possible cutpoint T, the data set would be divided into two subsets. If a set S is the root node,S1 = {e|e ∈ S, A(e) ≤ T}, S2 = {e|e ∈ S, A(e) > T}. For the evaluation of a possible T,the information entropy resulting from T is

Entropy(A, T, S) = |S1||S| Entropy(S1) +

|S2||S| Entropy(S2)

Then, the best cut point TA makes the entropy minimal among all possible cut points.However, if If a set S is not the root node, the class information entropy is

F(A, TA, S) = Entropy(A, TA, S) + ∑q∈Q

arctan I(A. q)

where is a constant and Q is a conditional attribute set. q is used for previousdecision node. Then, the best cut point TA makes the entropy F minimal among allpossible cut points when .a set S is not the root node. Then, recursively do until all nodesbecome leaf nodes.

ConclusionFor this paper, it uses the mutual information idea coming from the information

theory to construct a reliable and useful decision tree. Especially, it focuses on dealingwith continuous-valued attributes.

8. Studies on incidence pattern recognitionbased on information entropy[11]

Abstract

In this paper, it defines a object weight called IEW(information entropy weight) to beused to measure the importance of different features. Besides, based on the gray relationanalysis, it proposed a new idea of IID (information incidence degree). By using IEWand IID, it provided a effective algorithm for incidence supervised pattern classification.

IntroductionAssume that there are m class patterns (A1, A2, . . . , Am) with n features

(X1, X2, . . . , Xm) such that xij is the observation value of the ith entity with the jth feature.The training data matrix can be expressed as

X =

x11 x12 . . . x1n

x21 xX22 . . . x2n

. . . . . . . . . . . .

xm1 xm2 . . . xmn

=

A1

A2

Am

where pij ≡ xij/ ∑i=1

mxij. Thus, the information entropy of the jth feature is

ej = − 1log m ∑

i=1

mpij log pij

And the difference degree of the jth feature g j ≡ 1 − ej(i.e., g j reflects the differenceof the jth feature value.) Based on the definition of g j , the IEW is defined as follows:

wj = gj/ ∑j=1

ngj

Based on the IEW, a diagonal matrix is apparently created as follows:

=

w1

w2

wn

For the sake of decreasing the influence from the difference of different characteristicindexes, data should be pretreated. For the efficiency,

yij =xij − min

ixij

maxi

xij − mini

xij

For the cost,

yij =max

ixij − xij

maxi

xij − mini

xij

After pretreating for data, the new data matrix Y =(yij)m×n. The final weighted datamatrix Z is defined as

Y = (yijwj)m×n

At the next step, incidence coefficient would be generated. Assume that a test patternis A0 = (x01, x02, . . . , x0n), then the incidence coefficient is defined as

0i(k) =Δmin+Δmax

Δ0i(k) + Δmax

in the kth point with sequence Ai to A0 with the distinguishing coefficient.Δmin = min

imin

k|x0k − xik|

Δmax = maxi

maxk|x0k − xik|

Δ0i(k) = |x0k − xik|

The difference degree between Ai and A0 is defined as

d0i = p ∑k=1

n|wk(1 − 0i(k))|p

where p is the distance parameter. Based on the above definition, the IID could beexpressed as

0i = 1 − d0i

Apparently, the bigger 0i, the more similar the pattern between A0 and A i . In other

words, if0l = max(01,02, . . . ,0m)

the test pattern A0 should belong to the lth training class pattern.

ConclusionAccording to information entropy, this paper provided an useful indicator of IID for

pattern recognition.

9. A novel feature extraction algorithm[12]Abstract

In this paper, based on information theory, it proposed a new concept of probabilityinformation distance (PID). Practically, PID can measure the variant between twodifferent random variables. According to the PID criterion, this paper develop a effectivealgorithm for information extraction.

Introduction

Assume that there are two classes (C1 and C2) and {Xjk(i)}(class index i = 1, 2; data

index j = 1~Ni; feature index k = 1~n) are squared normalization data components withthe property as follows:

∑k=1

n(Xjk

(i))2 = 1

Then, the square mean of each data component is

k(i) = 1

Ni∑j=1

Ni

(Xjk(i))2

Due to the importance of correlation between different features, each element in asymmetric matrix G(i) is defined as

gkl(i) = 1

Ni∑j=1

Ni

Xjk(i)Xjl

(i)

where gkk(i) = k

(i). Then, a useful variant matrix A can be defined asA = G(1) − G(2)

Try to find out all eigenvalues in A and order them by their square values as follows:1

2 ≥. . .≥ n2

The corresponding eigenvectors u1~u12 can be used to construct a useful informationcompression matrix T = (u1~ud )

By the following transformationy = T ′x

where x is the input data, compressed data can be extracted. However, how todetermine d?

The total sum V is defined as

V = ∑k=1

nk

2

Then, the variance square ratio is

Vd =∑i=1

d i

2

VAccording to the theory of information compression, when Vd ≥ 80%, the

effectiveness of feature extraction can be reached.

Case Study:The following two tables shows the mean temperature of the Spanish in 1989. All

data have been normalized. In the first class, the cities belong to the inland category. Inthe second class, the cities belong to the coastline category.

Class 1 City

Class 2 CityAccording to the above approach, 1~ 12 and their corresponding eigenvectors can

be found out. Since 1 and 2 are big enough, T can be defined as (u1, u2).Thus, the final compressed results are as follows:

Obviously, the data compression purpose could be reached.

10. Diversity and complexity of HIV-1 drugresistance:

A bioinformatics approach to predictingphenotype from genotype[13]

AbstractIn this paper, it focuses on the medical issue of drug resistance testing. For the

various drugs, the complexity of varying patterns can be observed. Based on theseinformation contents, decision tree classifiers were generated to identify genotypicpatterns characteristic of resistance or susceptibility to the various drugs. The coreapproach comes from the fundamental information theory.

Introduction

Based on the utilization of mutual information, a decision tree is designed as theclassification testing model. Each classification testing is done by going through the treefrom the root node to any leaf nodes according to the amino acids (values) of thesequence positions (features) of the sample appearing on that path. The classifiers werecreated by recursively splitting the training sample data set. In other words, for all thenew subsets, they would be precessed in the same way. In order for determining the splitpoint, the normalized mutual information is defined as I(Xi,Y)

H(Xi) (i.e., this is known asinformation gain ratio) , is calculated from the subset to be partitioning. This ratiopresents the information content coming from the partition for classification. The attributefor which this ratio is maximal subject to the constraint that the information gain ratiowill be selected as a decision node. Unknown attribute values are allowed in this paper,and their values are generated according to the known values. Decision trees are prunedin order for avoiding overfitting. At each leaf of the decision trees, the number ofsamples successfully classified by this leaf and the number of errors estimated to occuron unseen samples are also shown in the table.

DiscussionIn this paper, we only focus on the issue of the application of mutual information in

medical area. We wonder about the procedure of the application of information theory.From the results in this paper(one example is shown in the following pictures), theeffectiveness seems to be good.

Results from decision tree generated, leave-one-out experiments, and an example of

Reference:[1] Quinlan, J. R. 1986. Induction of Decision Trees. Mach. Learn. 1, 1 (Mar. 1986),

81-106.[2] "Building Decision Trees with the ID3 Algorithm", by: Andrew Colin, Dr. Dobbs

Journal, June 1996[3] Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann

Publishers, 1993.[4] J. R. Quinlan. Improved use of continuous attributes in c4.5. Journal of Artificial

Intelligence Research, 4:77-90, 1996.[5] Bing Liu, Data, Web DataMining[6] D. Thakur, N. Markandaiah, Sharan R. D, Re Optimization of ID3 and C4.5

Decision Tree[7] Lin K., Xue Y., Wen J., A Maximum Contribution Method for Classification

Based on Information Theory[8] I. K. SETHI, G. P. R. SARVARAYUDU, Hierarchical Classifier Design Using

Mutual Information[9] Kevin G., Allison P, Using Information Gain to Build Meaningful Decision

Forests for Multilabel Classification[10] H. Li., X.Z. WANG, Y Li,USING MUTUAL INFORMATION FOR

SELECTING CONTINUOUS-VALUED[11] Ding Shi-fei, Shi Zhong-zhi, Studies on incidence pattern recognition based on

information entropy[12] S. F. Ding, Z. Z. Shi, Y. C. Wang, S. S. Li, A novel feature extraction algorithm[13] Niko B, Barbara S., Hauke W., Rolf K., Thomas L., Daniel H.., Klaus K., and

Joachim S.,Diversity and complexity of HIV-1 drug resistance: A bioinformaticsapproach to predicting phenotype from enotype

Documents

A Survey of Information Theory Application on Data …devroye/courses/ECE534/project/Ta-Yuan.pdf · A Survey of Information Theory Application on Data Mining T. Y. HSU 662096093 ECE