[IEEE 2008 International Symposium on Computational Intelligence and Design (ISCID) - Wuhan, China (2008.10.17-2008.10.18)] 2008 International Symposium on Computational Intelligence

Using Modified CHI Square and Rough Set for Text Categorization with Many Redundant Features

Abstract

Text categorization is a key problem of text mining. Although there are many researchs on this problem, the main works are focused on classification of big categories. There are very few researchs on text categorization problems characterised by many redundant features. We call this kind of problem as fine-text-categorization. In this paper, we presented an algorithm based on modified CHI square feature selection and rough set to solve this problem. The features of categories are selected in a aggressive maner. The classification rules are extracted by using rough set theory. Experiments on real world corpora show that our algorithm can evidently improve classification precision, thus is promising. 1. Introduction

With the rapid development of internet, the amount of electronic documents increases in exponential speed. This case puts people into the circumstance of information confusion. Nowadays, people care about not only the existence but also the convenience of accessing information. Text categorization is an important method of solving this problem. It aims to find and manage interesting information from large scale of electronic documents thus becomes an important basis of information retrieval.

Text categorization has aroused interests of many researchers. Many algorithms of machine learning have been introduced into this field and some of them achieved good experimental results. From our point of view, the main works can be classified into two classes: vector based methods and rule based methods. The vector based methods include simple vector distance, Bayesian algorithm, kNN, artificial neutral network, support vector machines, etc [1]. Among

them, SVM and kNN are two algorithms of top performance [1]. This kind of method is easy to learn and is accurate in general situations. The rule based methods classify documents according to a set of rules. Given a well constructed rule set, this kind of methods can result in high accuracy and can cope with the situation that with many redundant features. But the construction of rule sets is a hard task especially by manual way. Alternatively, some automatic methods to construct rule sets have been presented such as [2-4].

But the majority of researches on text categorization are focused on the “big” categories such as politics, economy, autos, computers, sports, etc. Only very few researches concern the categorization of fine categories which is characterized by a mass of redundant features that co-occur in more than one category. For example, in computer field, “laptops” and “desktops” share many features, such as “CPU”, “memory”, “chipset”, etc. Only very few features of them are informative to tell them apart. For description convenience, we define this problem as fine text categorization problem (FTC in short). As a result caused by the redundant features, traditional algorithms will result in low accuracy.

With the development of text categorization technique and new demands for information retrieval, it is urgent to find accurate algorithms for fine text categorization problem. For example, in an automatic text categorization system for a specific field, with the increase of the depth of category tree, the subcategories will become very similar.

In this paper, we are aiming at the solution of the fine text categorization problem by seeking an adaptation of traditional algorithms to improve classifying accuracy. The key idea is the usage of aggressive feature selection based on modified CHI2 statistic and extracting classifying rules based on rough set. We conducted a series of experiments to prove the utility of our approach.

Liuling DAI Beijing Laboratory of Intelligent Information Technology, School of

Computer Science, Beijing Institute of Technology,

Beijing 100081 PRC [email protected]

Jinwu HU School of Management and Economics, Beijing Institute

of Technology, Beijing 100081 PRC

WanChun Liu Beijing Laboratory of Intelligent Information Technology, School of

Computer Science, Beijing Institute of Technology,

Beijing 100081 PRC

2008 International Symposium on Computational Intelligence and Design

978-0-7695-3311-7/08 $25.00 © 2008 IEEE

DOI 10.1109/ISCID.2008.178

182

2008 International Symposium on Computational Intelligence and Design

978-0-7695-3311-7/08 $25.00 © 2008 IEEE

DOI 10.1109/ISCID.2008.178

182

The rest of this paper is organized as follows. Section 2 introduces the feature selection method while section 3 presents the rule extraction method. The algorithm is evaluated in section 4 before concluding this paper in section 5. 2. Feature selection 2.1. Aggressive feature selection

Feature selection is the preparation step of text categorization that aims to extract feature words for categories before training classifying models. These features are important to categories and thus will be used to represent documents during training and classifying stages. Because the number of features is substantially less than the number of original words, this step decreases the complexity of algorithm. Furthermore, many research results showed that feature selection step can improve the performance of categorization [5].

On the other hand, several studies reported that feature selection isn’t beneficial or even small harmful to classifying performance [6, 7]. They implied that all features are informative and shouldn’t be removed. Consequently, some works didn't perform any feature selection at all [8]. Evgeniy [9] thoroughly studied this phenomenon and pointed out the reason is the distribution of the importance of features (such as IG). According to the results of Evgeniy’s study, for the problem plagued with redundant features, traditional feature selection technique isn’t substantially helpful. In addition, similar results are reported by Forman’s work [10]. This case is also corroborated by our experiments in section 4. As a conclusion, Evgeniy and Forman suggest that, under this circumstance, aggressive feature selection should be adopted. That is to say, only tens or hundreds most informative features should be used to learn underlying concepts of categories. In present study, we follow this conclusion and verify it by experiments.

2.2. Extension to CHI squared statistic

The frequently used feature selection algorithm

includes MI, IG, DF, CHI2, etc [10]. Among them, CHI2 is reported by many studies as one of the most effective algorithms [6, 7, 9 and 10]. It is defined as:

)()()()()(

42314321

232412

nnnnnnnnnnnnnCHI

+∗+∗+∗+∗−∗∗= (1)

where n1, n2, n3 and n4 denote the co-occurrence frequency between a feature t and a category c. They are corresponding to the case of (t, c), (t, c ), ( t , c),

( t , c ). While n is the sum of them. We can know from this definition that n1 and n4 are beneficial to the relevancy between c and t but n2 and n3 are harmful.

The definition of CHI2 only measures the absolute difference between n1*n4 and n2*n3. So, it is a macro statistic for all categories. But in this paper, we are intent to find the features that are absolutely important for a specific category. Furthermore, we can assert that n1*n4 is more important for c than n2*n3. In order to embody this difference, we modify the definition of CHI2 as:

⎩⎨⎧−

>∗−∗=

+∗+∗+∗+∗−∗∗∗=−

.,10)(,1

)()()()()(

3241

42314321

232412

elsennnnif

where

nnnnnnnnnnnnnCHI

δ

δδ (2)

In (2), we use the sign indicator δ to characterize the contribution of a feature to a category. We only select the features that have positiveδ-CHI2 value for a category. 3. Rule extraction

After aggressive feature selection, only very few features are used to represent categories. In this situation, vector based methods such as SVM and kNN do not fare well. This is proved by Evgeniy’s study [9] in which SVM and kNN is inferior to C4.5. So, we are more interested in rule based methods. We would like to check the utility of rough set theory [11] based rule extraction method.

To extract classifying rules from training corpus, we set features as the precondition and category labels as the decision of rules. Then a rule is represented as:

ri : k(di, f1) ∧ k(di, f2) ∧ … k(di, fn) => di∈cj (3) where ri is the rule extracted from di, fi is the i-th feature of document di, cj (j = 1, 2, …, m) is the label of a category, k(di, fi) is a boolean value indicates whether fi occurs in di. All the original rules construct a decision table, as shown in Table 1.

Table 1. Original decision table f1 f2 … category

d1 1 0 … c1

d2 0 1 … c2

… … … … …

d4 0 0 … c1

To simplify and optimize the original rule set, we

perform feature reduction and value reduction on it. The steps of optimization and reduction are as follows:

183183

1. Remove the rules appear in every category (if any). This kind of rule is call ordinary rules that have no help for categorization;

2. Feature reduction. If a feature can be removed from rules while remaining the classifying utility of the rule set, we delete the corresponding column from the decision table.

3. Value reduction. For each rule, if a feature can be removed without collision with other rules, we remove this feature from this rule.

After the above steps of reduction, there are no redundant features in each rule and the number of the conditional features is reduced. 4. Experimental results 4.1. Datasets and setups

Constructing datasets for text categorization based on web directories has been often performed in prior studies, which used Yahoo! [12], ODP [13], etc. Evgeniy [9] constructed a dataset in which each subset consists of a pair of ODP categories with about 150 documents. Thus each subset corresponds to a binary classification task. Considering the scarcity of documents in each subset of Evgeniy’s dataset, we constructed a dataset about IT field from website http://www.anandtech.com and website http://shopping.yahoo.com. The categories and the number of documents are shown in Table 2.

Table�2.�Dataset�used�in�our�experiments�ID Categories and document number

1 Laptops(249) vs Desktops(300)

2 Laptops (249) vs Servers (340)

3 Servers (340) vs Desktops(300)

4 CPU(159)vs Motherboard(501)

5 CPU(159) vs Memory (285)

6 Memory (285) vs Motherboard(501) In our experiments, we used SVM [14] and C4.5 [2]

as contrasting algorithms. The SVMs are implemented by the SVMlight implementation [15] with a linear kernel with C = 0.1.

Before feature selection, we unconditionally remove HTML tags, stop words and words occurring in less than 3 documents. All the F1 measures reported below are obtained using 4-fold cross validation scheme. 4.2. Results and discussions

Figure 1 shows the average F1 measures of classifiers with CHI2 feature selection at several levels. As we can see, on our dataset, rough set (RS) and C4.5 is better than SVM and feature selection is beneficial.

Another observation is that with more features, the F1 measure of classifiers drops. This is consistent with the results reported earlier by Evgeniy [9]. So, we conclude that on the dataset with many redundant features, rule based methods outperform vector based methods and feature selection should be introduced.

20% 50% 80% 100%0.5

0.6

0.7

0.8

0.9

1.0

Avg

F1 m

easu

re

Feature selcection level

SVM RS C4.5

Figure 1. Average F1 at several feature

selection levels Table 3 shows average F1 measures of classifiers

with 100% features and with the optimal feature selection levels. As we can see, the optimal feature selection is very aggressive. In fact, only tens or hundreds of features are used to learn classifiers.

Table�3.�Avg�F1�of�classifiers�with��100%�and�optimal�FS�level�

Feature Selection

and Classifier

Avg F1 with

all features

Avg F1 with

optimal FS level

CHI2 + SVM 0.721 0.851 (1.0%)

CHI2 + C4.5 0.796 0.848 (0.5%)

CHI2 + RS 0.804 0.853 (0.5%)

-CHI2 + SVM 0.762 0.861 (1.0%)

-CHI 2+ C4.5 0.814 0.872 (0.5%)

-CHI 2+ RS 0.830 0.901 (0.5%) Figure 2 illustrates the F1 measure of C4.5 and RS,

withδ-CHI2 and CHI2 feature selection in different levels. For all configurations, the F1 measure drops when more features are used. This figure also proves that with same classifier, δ-CHI2� is more effective than CHI2. And with same feature selection method, RS is more effective than C4.5.

184184

0.0 0.2 0.4 0.6 0.8 1.00.70

0.75

0.80

0.85

0.90

0.95

1.00

A

vg F

1 m

easu

re

Feature Select level

d -CHI 2+C4.5 CHI2+RS d -CHI2+RS CHI2+C4.5

Figure�2.�F1�of�C4.5�and�RS�with�δ-CHI2�and�CHI2�feature�selection�

Through all these results above we can see that aggressive feature selection is very beneficial for fine text categorization problem. We can also observe thatδ-CHI2 statistic outperforms CHI2 statistic throughout all the experiments. In addition, it is shown that rough set based method is competitive with C4.5 with slightly predominance. 5. Conclusion

In this paper we are concerned with fine text categorization problem which is characterized with many redundant features. In other words, only few features are useful of telling a category apart from others. We modified the traditional CHI2 feature selection algorithm by adding a sign indicator for contribution of features. At the same time, we tried to extract classifying rules by using rough set theory. As this kind of problem is concerned, we can draw several conclusions as follows:

(1) Rule based algorithms are more powerful than vector based algorithms;

(2) Aggressive feature selection deserves consideration;

(3) δ -CHI2 statistic is helpful for improving accuracy;

(4) Rough set based method is competitive with C4.5 with modest predominance. Acknowledgement

This paper is partially supported by Beijing Key Discipline Program, Basic Research Foundation of Beijing Institute of Technology (No. 20071142004).

References [1] Y. Yang, X. Liu, “A re-examination of text categorization methods”, Proceedings of the22nd Annual International ACMSIGIR Conference on Research and Development in In-formation Retrieval, Berkeley,CA, 1999, pp. 42-49. [2] J. R. Quinlan, “C4.5: Programs for machine learning”, Morgan Kaufmann, 1993. [3] Y. Chen, J. An, “Depth first rule generation for text categorization”, Advances in Intelligent IT. Active Media Technology, 2006, pp. 302-306. [4] M. Sasaki, K. Kita, “Rule-based text categorization using hierarchical categories” IEEE International Conference on Systems, Man, and Cybernetics, vol.3, 1998, pp. 2827-2830. [5] Y. Yang, J. P. Pedersen, “A comparative study on feature selection in text categorization”, The Fourteenth International Conference on Machine Learning, Morgan Kaufmann, 1997, pp. 412-420. [6] J. Brank, M. Grobelnik, N. Milic-Frayling, and D. Mladenic, “Interaction of feature selection methods and linear classification models”, Workshop on Text Learning held at ICML, 2002. [7] M. Rogati, Y. Yang, “High-performing feature selection for text classification”, CIKM, 02, 2002, pp. 659-661. [8] D. D. Lewis, Y. Yang, T. Rose, and F. Li, “RCV1: A new benchmark collection for text categorization research”, JMLR, 5, 2004, pp. 361-397. [9] G. Evgeniy, M. Shaul, “Text Categorization with Many Redundant Features: Using Aggressive Feature Selection to Make SVMs Competitive with C4.5”, The 21st International Conference on Machine Learning (ICML), Banff, Alberta, Canada, July 2004, pp. 321-328. [10] G. Forman, “An extensive empirical study of feature selection metrics for text classification”, Journal of Machine Learning Research, 3, 2003, pp. 1289-1305. [11] Z. Pawlak, ‘‘Rough sets’’. Norwell, MA: Kluwer Academic Publishers, 1991. [12] D. Mladenic, M. Grobelnik, “Word sequences as features in text-learning”, Proc. of 7th Electrotech. and Comp. Sci. Conf., 1998, pp. 145-148. [13] S. Chakrabarti, M. M. Joshi, K. Punera, and D. M. Pennock, “The structure of broad topics on the web”, Proc. of the Int'l World Wide Web Conference, 2002. [14] V. Vapnik, “The nature of statistical learning theory”, Springer-Verlag, 1995. [15] T. Joachims, “Making large-scale SVM learning practical”, Advances in kernel methods, support vector learning, The MIT Press, 1999.

185185

Documents

[IEEE 2008 International Symposium on Computational Intelligence and Design (ISCID) - Wuhan, China (2008.10.17-2008.10.18)] 2008 International Symposium on Computational Intelligence