Feature Selection With Conditional Mutual Information MaxiMin

8/14/2019 Feature Selection With Conditional Mutual Information MaxiMin

1/14

Feature Selection with Conditional Mutual

Information Maximin in Text Categorization

Department of Computer Science,

Hong Kong University of Science and Technology


2/14

Outline

Introduction

Information Theory Review Conditional Mutual Information Maximin (CMIM)

Experimental Result

Conclusion


3/14

Introduction

Text Categorization

Preprocessing

Feature selection

filter method

wrapper method

embedded method

Training the classifier Testing

Training the classifier Testing

Important


4/14

Classic Feature Selection Methods

Ranking Criterion

Information gain Mutual information

test

Drawback

Regardless of relationshipamong features

2

Documents w1 w2 w3 class

2

1

0

0

d1 2 0 c1

d2 1 0 c1

d3 0 0 c2

d4 0 2 c2


5/14

Information Theory Review Entropy: H(X)

H(X) = -p(x) log p(x) = - E( log p(x))

Mutual Information: I(X;Y)

I(X;Y)= - E( log (p(x,y)/p(x)p(y)) )

Conditional MI: I(X;Y|Z)I(X;Y|Z)= - E( log (p(x,y|z)/p(x|z)p(y|z)) )

H(X) H(Y)

I(X;Y)I(X;Y|Z)

H(X|Y)


6/14


8/14

CMIM Algorithm

Input: n the number of features to be selected

v the number of total features

Output: F the set for selected features

1. Set F to be empty2. m=1

3. Add Fi in F, where Fi = argmaxi=1..v I(Fi;C)

4. Repeat5. m++

6. add Fi in F, where Fi = argmaxi=1..v {min Fj F I(Fi;C|Fj)}

7. Until m=n


9/14

Experiment Setup

Dataset:

WebKB: 4199 pages, 4 categories NewsGroups: 20000 pages, 10 categories

Feature selection criterion:

CMIM

Information gain (IG)

Classifier: Nave Bayes

Support vector machine


10/14

Result for WebKB

Micro-averaged accuracy Macro-averaged accuracy

SVM

NB


11/14


12/14

Result for Newsgroup

Micro-averaged accuracy Macro-averaged accuracy

SVM

NB


13/14

Discussion

Feature size

AccCMIM >> AccIG when small feature size

Category number

AccCMIM >> AccIG when small category number

Category deviation

MicroAccCMIM >> MicroAccIG when large deviation


14/14

Conclusion

Use simple triplet to approximate joint conditional

mutual information

CMIM algorithm tries to reduce the correlation among

features

Complexity is O(NV3)