17
This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution and sharing with colleagues. Other uses, including reproduction and distribution, or selling or licensing copies, or posting to personal, institutional or third party websites are prohibited. In most cases authors are permitted to post their version of the article (e.g. in Word or Tex form) to their personal website or institutional repository. Authors requiring further information regarding Elsevier’s archiving and manuscript policies are encouraged to visit: http://www.elsevier.com/copyright

Author's personal copy - eeeweba.ntu.edu.sgeeeweba.ntu.edu.sg/ELHCHEN/ChenLH-ResearchPaper/Fuzzy semi... · Author's personal copy Y. Yan et al. / Fuzzy Sets and Systems 215 ... 1

  • Upload
    buidien

  • View
    216

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Author's personal copy - eeeweba.ntu.edu.sgeeeweba.ntu.edu.sg/ELHCHEN/ChenLH-ResearchPaper/Fuzzy semi... · Author's personal copy Y. Yan et al. / Fuzzy Sets and Systems 215 ... 1

This article appeared in a journal published by Elsevier. The attachedcopy is furnished to the author for internal non-commercial researchand education use, including for instruction at the authors institution

and sharing with colleagues.

Other uses, including reproduction and distribution, or selling orlicensing copies, or posting to personal, institutional or third party

websites are prohibited.

In most cases authors are permitted to post their version of thearticle (e.g. in Word or Tex form) to their personal website orinstitutional repository. Authors requiring further information

regarding Elsevier’s archiving and manuscript policies areencouraged to visit:

http://www.elsevier.com/copyright

Page 2: Author's personal copy - eeeweba.ntu.edu.sgeeeweba.ntu.edu.sg/ELHCHEN/ChenLH-ResearchPaper/Fuzzy semi... · Author's personal copy Y. Yan et al. / Fuzzy Sets and Systems 215 ... 1

Author's personal copy

Available online at www.sciencedirect.com

Fuzzy Sets and Systems 215 (2013) 74–89www.elsevier.com/locate/fss

Fuzzy semi-supervised co-clustering for text documentsYang Yan, Lihui Chen∗, William-Chandra Tjhi

Nanyang Technological University, School of Electric and Electronic Engineering, Republic of Singapore

Received 23 September 2011; received in revised form 24 September 2012; accepted 25 October 2012Available online 13 November 2012

Abstract

In this paper we propose a new heuristic semi-supervised fuzzy co-clustering algorithm (SS-HFCR) for categorization of large webdocuments. In this approach, the clustering process is carried out by incorporating some prior knowledge in the form of pair-wiseconstraints provided by users into the fuzzy co-clustering framework. Each constraint specifies whether a pair of documents “must”or “cannot” be clustered together. Moreover, we formulate the competitive agglomeration cost function which is also able to makeuse of prior knowledge in the clustering process. The experimental studies on a number of large benchmark datasets demonstratethe strength and potentials of SS-HFCR in terms of accuracy, stability and efficiency, compared with some of the recent popularsemi-supervised clustering approaches.© 2012 Elsevier B.V. All rights reserved.

Keywords: Semi-supervised learning; Heuristic; Must-link/cannot-link constraint; Fuzzy co-clustering

1. Introduction

Being an important machine learning technique, clustering is widely used in many real applications, especiallyin exploratory text document analysis. Document clustering is a popular application that automatically organizes acollection of documents into some meaningful sub-groups of coherent topics based on the similarity in content. Eachsub-group in this case is called a document cluster. With document clustering, the user can be benefited from a set ofhigh quality document clusters, which shows a better view on how huge amount of information in the collections areinherently structured. However, as most text datasets are known to be large, noisy, overlapping and high dimensional,the clustering process always becomes a challenge task. Some clustering techniques have been developed recently fortextual documents, such as: co-clustering [1], fuzzy clustering [2], clustering based on matrix factorization concept(NMF) [3], model-based clustering [4], spectral clustering [5], relational data clustering [6], etc. These techniques dohave merits on various aspects. For example, co-clustering is generally effective to handle high dimensional data bysimultaneously grouping documents and words based on the high co-occurrence among them. Fuzzy clustering [7]is used for categorization-applications which require a realistic overlapping clusters representation, but suffer fromoutliers. Model-based clustering is accomplished at outlier detection but usually has high time complexity.

Most co-clustering algorithms are used to deal with dyadic data, e.g. the document and word co-occurrence fre-quencies. The popular co-clustering models cover bipartition based on a spectral graph [8], bi-partition based onnon-negative matrix factorization [3], information-theory partition [1] of the empirical joint probability distribution,

∗ Corresponding author. Tel.: +65 6790 4484.E-mail addresses: [email protected] (Y. Yan), [email protected] (L. Chen), [email protected] (W.-C. Tjhi).

0165-0114/$ - see front matter © 2012 Elsevier B.V. All rights reserved.http://dx.doi.org/10.1016/j.fss.2012.10.016

Page 3: Author's personal copy - eeeweba.ntu.edu.sgeeeweba.ntu.edu.sg/ELHCHEN/ChenLH-ResearchPaper/Fuzzy semi... · Author's personal copy Y. Yan et al. / Fuzzy Sets and Systems 215 ... 1

Author's personal copy

Y. Yan et al. / Fuzzy Sets and Systems 215 (2013) 74 –89 75

and fuzzy partition [9]. In fuzzy co-clustering framework, the fuzzy relationship is presented by the captured degree(membership) to which documents belong to each cluster, as well as the words. It has been reported in recent study [10],fuzzy co-clustering algorithms outperform a modern variant of fuzzy-c means [7] and crisp co-clustering approaches[8] on certain benchmark datasets. Moreover, it holds a lower time complexity compared to NMF [2] and soft-spectralco-clustering methods [11].

However, sometimes it is still a challenging task to well categorize some complicated datasets purely relying on acompletely unsupervised method. It is noted that, in many real applications, some prior knowledge of a dataset maybe available to the users, such as the categorical labels of a small portion of data in the dataset. The technique thatmakes use of both labeled and unlabeled data for training purpose is called semi-supervised learning. 1 It has beensuccessfully applied in various fields such as knowledge discovery [12], data mining, etc. [13]. From our understanding,the premise of semi-supervised clustering [14] is to leverage various types of prior knowledge into the clustering processwithout an expensive learning process, and thereby to improve its performance. Existing methods for semi-supervisedclustering generally fall into two categories: the similarity adapting-based [15] and search-based methods. In similarityadapting-based approaches, the proximity (similarity) values between a few reference objects are given, and the usercould usually train a new similarity function to govern the relationship between the remaining objects by distance metriclearning, while the proximity “hints” can be easily satisfied. In search-based approaches, the algorithm is designed inthe way to make use of class ground truth labels [14] or various constraints [16] to guide the search for an appropriateclustering. Providing a few class labels may be the simplest approach, users directly make use of them as seeds to form aset of initial clusters. It has been widely incorporated into kmeans [14], fuzzy c-means [17], NMF [18] and model-basedclustering [19]. However, the labeling process is usually expensive and sometimes it is hard to give an explicit label setfor each category in the real case, such as clustering of GPS data for landing. Most recent semi-supervised applicationsfocus on building two sets of pair-wise categorical constraints: must-link and cannot-link constraints, instead of gettingthe exact ground truth labels. As pointed out by Chen [29], “It is much easier for a user to provide feedback in the formof pair-wise constraints than class labels, since providing constraints does not require the user to have significant priorknowledge about the categories in the dataset.”

In this paper, we intend to combine the strength of fuzzy clustering, co-clustering and semi-supervised clustering. Anovel heuristic semi-supervised fuzzy co-clustering algorithm (SS-HFCR) is proposed to make use of prior knowledgein the form of pair-wise constraints on the document domain. Each constraint specifies whether a pair of documents“must” or “cannot” be clustered into the same category. Our objective is to increase the clustering accuracy, and reducethe sensitivity to the fuzzifier parameters with limited prior knowledge while the complexity of the algorithm mayremain relatively low. Experimental study is conducted on a small synthetic dataset and a number of publicly availablebenchmark datasets. The clustering result is compared with several existing popular semi-supervised methods to showthe effectiveness of SS-HFCR.

The rest of the paper is organized as follows. The related semi-supervised clustering and fuzzy co-clustering worksare reviewed in Section 2. The proposed SS-HFCR approach is explained in Section 3. In Section 4, we firstly illustratethe effectiveness of SS-HFCR on a toy problem. Secondly, more empirical study on 10 large benchmark datasets isprovided in Section 5. Finally, the conclusion and future work is given in Section 6.

2. Related works

We generally group the relevant works of our proposed method into three categories, namely fuzzy co-clustering,semi-supervised fuzzy clustering, and semi-supervised co-clustering. In this section, the existing literature falling inthese categories is briefly reviewed one by one.

The potential of using fuzzy co-clustering for document categorization has been discussed in the literature. Thetraditional methods proposed in [9,20] are partitioning-ranking based approaches for handling categorical attributesand large text corpus. Recently, a dual-partitioning-based fuzzy co-clustering approach called HFCR [10] has beensuccessfully formulated. It addressed a few issues of the partitioning-ranking approaches, such as poor performanceon those datasets with overlapping feature clusters.

From the literature review, we noted some clustering approaches had been exploring various forms of prior knowledgebeing incorporated into fuzzy clustering framework. While prior knowledge in the form of class labels is used in [17,21],

1 http://en.wikipedia.org/wiki/Semi-supervised_learning

Page 4: Author's personal copy - eeeweba.ntu.edu.sgeeeweba.ntu.edu.sg/ELHCHEN/ChenLH-ResearchPaper/Fuzzy semi... · Author's personal copy Y. Yan et al. / Fuzzy Sets and Systems 215 ... 1

Author's personal copy

76 Y. Yan et al. / Fuzzy Sets and Systems 215 (2013) 74 –89

pair-wise constraints-based approaches such as PCCA, is proposed in [22]. More than that, [23] introduces an activeselection mechanism to select proper constraints in order to reduce the impact to the performance caused by theselection of constraints which are encountered in [22]. However, with a fixed fuzzifier m, this kind of approach may notbe suitable for clustering high-dimensional and sparse text datasets. Instead of directly incorporating the constraints intothe FCM framework as discussed above, P-FCM [24] is an interesting method which augments FCM by adding anotheroptimization step at each iteration. A number of proximity “hints” (values) serve as the pair-wise constraints, and theoverall difference between the given proximity values and those computed ones based on the membership should beminimized by a separate gradient-driven optimization process. Obviously, extra computation time is needed by theprocess. On the other hand, while similarity-adapting methods [25,26] appear to be more applicable to a wide rangeof applications, as pointed out by authors in [23], similarity-adapting based approaches need either significantly moresupervision or specific strong assumptions regarding the target similarity measure” [23]. In [27], a full adaptive-basedand a kernel-based distance measure are explored for fuzzy c-means.

Some efforts have also been made on how to extend the existing co-clustering methods to the semi-supervisedversion with incorporating class labels [18,28] or pair-wise constraints [29–32]. Most of these methods are NMF-based[29–31] or information theory-based approaches [32]. SS-RNMF [29] is developed, based on symmetric non-negativetri-factorization of a symmetric data similarity matrix R with additional must-link and cannot-link pair-wise constraints.Each constraint refers to a pair of objects. If objects i and j belong to the same category, then the corresponding elementri j in R is set to the value of the maximum similarity found in R while ri j is set to the value of the minimum similarityif i and j belong to different clusters. Moreover, in some cases SS-NMF may not always use pair-wise constraints toimprove the clustering. Lee and Yoo [18] proposed a weighted SS-NMF algorithm which employs the joint weightedfactorization, introducing 0/1 weights in residuals in the decomposition of the data matrix.

In addition, recent research tends to combine the strength of both search-based and similarity adapted-based ap-proaches. In [33], Chen et al. proposed a triplet co-clustering algorithm, namely SS-NMF-CC, which has an additionallayer to handle heterogeneous data. In SS-NMF-CC, firstly a distance metric L is learnt based on the provided must-linkand cannot-link constraints on each pair of objects, and then the heterogeneous relational matrices can be calculated inthe way as explained in [15]. Secondly, two modality-importance factors � and � are selected based on the new metricsobtained by L, in order to decide the relative importance between different object types, as they may play a distinctrole in the clustering of documents. It is noted that these two objectives must be achieved simultaneously because themodality selection and distance metric learning are strongly dependent on each other. Hence, an algorithm has beenproposed to iteratively update L, � and � by certain optimization processes before the non-negative tri-factorization isperformed. Therefore, the computational cost is higher than other NMF-based approaches.

In this paper, we focus on the development of a novel and more effective semi-supervised approach based onour previous fuzzy co-clustering algorithm with a low computational complexity to improve the performance onhomogeneous high dimensional datasets. The additional learning process such as that used in P-FCM [24] or SS-NMF-CC [33] is not required.

3. Proposed SS-HFCR algorithm

In this section we firstly introduce the formulation of the proposed heuristic approach (SS-HFCR) for the semi-supervised fuzzy co-clustering. It follows by the derivation of the update rules of the membership matrix U and V. Thedetailed steps and the time complexity of the algorithm are then presented.

3.1. Notations

Throughout the paper, we used bold uppercase characters to denote matrices, italic uppercased characters to denotescalars. The meanings of some frequently used notations are summarized in Table 1.

3.2. The formulation

The document collection is represented in the vector space model in this paper. Let D be a dataset of N objects(documents) drawn from an M-dimensional feature (word) space, the goal of clustering is then to partition the datasetcorrectly into C clusters.

Page 5: Author's personal copy - eeeweba.ntu.edu.sgeeeweba.ntu.edu.sg/ELHCHEN/ChenLH-ResearchPaper/Fuzzy semi... · Author's personal copy Y. Yan et al. / Fuzzy Sets and Systems 215 ... 1

Author's personal copy

Y. Yan et al. / Fuzzy Sets and Systems 215 (2013) 74 –89 77

Table 1Frequently used notations.

D The word-document association matrixC/c The number of categories (clusters)/the index of a particular category (cluster)M The number of wordsN The number of documentsxi The i-th document in RM

di j The tfidf value of word j in document iU, V The document and word fuzzy membership matrixuci , vcj The particular document and word membershipsTu , Tv The user-defined fuzzifier for membershipsML/CNL The training sets which contain all must-link/cannot-link document pairsTd The weighting factor (penalty cost) of a constraint

Inspired by our earlier works in [10], the new objective function of SS-HFCR can be formulated as

JSS-H FC R =C∑

c=1

N∑i=1

M∑j=1

ucivcj di j − Tu

C∑c=1

N∑i=1

uci ln uci − Tv

C∑c=1

M∑j=1

vcj ln vcj

+Tu × Td

⎛⎝ ∑

(xi ,xk )∈ml

C∑c=1

uci uck −∑

(xi ,xk )∈cnl

C∑c=1

uci uck

⎞⎠ (1)

The optimization of the function J is subject to the following two constraints given in Eq. (2) and (3)

C∑c=1

uci = 1, uci ∈ [0, 1] for i = 1, . . . , N (2)

C∑c=1

vcj = 1, vcj ∈ [0, 1] for j = 1, . . . , M (3)

First, it is noted that as a fuzzy co-clustering approach, each document i is given a membership uci to describethe degree of belongingness to the cluster c. Similarly, each word j in the dataset is given a fuzzy membership vcj todescribe the degree of belongingness. Both of them should be taken into account in the objective function. It shouldbe provided so as to group the documents and words which have a higher correlation to each other. Therefore, toaccomplish the clustering task, the first term

∑Cc=1

∑Ni=1

∑Mj=1 ucivcj di j which is called the degree of aggregation

should be maximized among the co-clusters. In other words, the maximization of this term is intended to make thehighly related documents and words (as indicated by high di j values) to be co-clustered together (i.e. be assigned tothe same co-cluster). The motivation is based on the belief that a high quality co-cluster should be the one with strongcoherence bonding among its members (i.e. documents and words).

Meanwhile the constraints in Eq. (2) and (3) conform Ruspini’s condition, which indicates uci and vcj computed bythis approach reflects how the documents/words are partitioned across diverse co-clusters, respectively. As discussedin [10], unlike the partitioning-ranking based approaches, e.g. Fuzzy Codok, the dual-partitioning schemes without acareful design may suffer from a fundamental flaw that prevents an effective fuzzy co-clustering from being accom-plished. The maximization of the degree of aggregation may not lead to the correct direction for the desired clusteringresult. The reason could be shown more clearly when we denote the term into the component-wise inner product oftwo matrices, i.e. G : D = ∑

i∑

j gi j di j , gi j (each element of G) is defined as∑C

c=1 ucivcj . It is noted that the value

of∑N

i−1∑M

j=1 gi j can vary from 0 to NK. This variation implies that the maximization of the degree of aggregation in

this case will be biased towards the construction of co-clusters with larger∑N

i−1∑M

j=1 gi j values. Meanwhile, it doesnot entirely depend on the partitioning on D. Therefore, as pointed out in [10], it is not necessary for the co-clusters tohave a large

∑Ni−1

∑Mj=1 gi j value in order to capture the real inherent grouping structure of a given dataset.

It is also noted,∑N

i−1∑M

j=1 gi j always equals to a constant (i.e. N) in the partitioning-ranking based approaches;for this reason they are spared from the bias problem. Therefore, in a dual partitioning-based approach, two auxiliary

Page 6: Author's personal copy - eeeweba.ntu.edu.sgeeeweba.ntu.edu.sg/ELHCHEN/ChenLH-ResearchPaper/Fuzzy semi... · Author's personal copy Y. Yan et al. / Fuzzy Sets and Systems 215 ... 1

Author's personal copy

78 Y. Yan et al. / Fuzzy Sets and Systems 215 (2013) 74 –89

functions with different normalized degree of aggregation terms shown in Eqs. (4) and (5) need to be used to replaceEq. (1):

Jss−1 =C∑

c=1

N∑i=1

M∑j=1

ucivcj∑M

q=1 vcqdi j − Tu

C∑c=1

N∑i=1

uci ln uci − Tv

C∑c=1

M∑j=1

vcj ln vcj

+Tu × Td

⎛⎝ ∑

xi ,xk∈ml

C∑c=1

uci uck −∑

xi ,xk∈cnl

C∑c=1

uci uck

⎞⎠ (4)

Jss−2 =C∑

c=1

N∑i=1

M∑j=1

uci∑Np=1 ucp

vcj di j − Tu

C∑c=1

N∑i=1

uci ln uci − Tv

C∑c=1

M∑j=1

vcj ln vcj

+Tu × Td

⎛⎝ ∑

xi ,xk∈ml

C∑c=1

uci uck −∑

xi ,xk∈cnl

C∑c=1

uci uck

⎞⎠ (5)

From the formulation of Jss−1 and Jss−2, it is clear that although both equations share similar principles withJSS-H FC R , the original degree of aggregation of every co-cluster c is normalized by

∑Mq=1 vcq in Jss−1, while it is

normalized by∑N

p=1 ucp in Jss−2. By taking the normalization, we now have (g1)i j = ∑Cc=1 ucivcj/

∑Mq=1 vcq in

Jss−1 and (g2)i j = ∑Cc=1(uci/

∑Np=1 ucp)vcj in Jss−2, respectively. Then, we are able to get the constant value for the

aggregation term as in partitioning-ranking based approaches. In other words, the bias is removed by eliminating thevariation in the values of these two terms. Therefore, this normalization process is essential in the formulation in orderto avoid the bias and also reduce the possibility of computational overflow mentioned in [21].

As we aim for developing a semi-supervised clustering algorithm, some prior knowledge needs to be used in theclustering process. The prior knowledge is given in the form of two sets of pair-wise constraints. One set is usedto specify the ‘must-link’ constraints denoted as ML, and the other is used for ‘cannot-link’ constraints denoted asCNL. Here, we assume each document in a pair-wise constraint has a ‘virtual label’, which is a categorical variable.This variable can take two types of values: one is a user assigned category; the other is the ground truth categoricallabel if such specific information is available. We make sure that each pair of documents in ML set is very similarto each other in content, therefore, it indicates a must-link constraint; while each pair of the documents in CNL setis dissimilar in content, it indicates a cannot-link constraint accordingly. The must-link constraints will represent anequivalence relation. Hence, it is possible to derive a collection of transitive closure in the ML set, which any documentpair in the same transitive closure must share the same ‘virtual label’. As we state at the beginning of this section, eachdocument in the corpus is more or less related to several topics based on the fuzzy set principle, the prior knowledgecan be also specified through the pair-wise constraints via the fuzzy membership values during the initialization. Eachdocument in a pair-wise constraint is given a higher degree of membership to the category c which corresponds to itsvirtual label, and lower degree of membership to the other categories. Hence, through the clustering process, the term∑C

c=1 uci uck should be maximized if xi and xk has the same ‘virtual label’, and it should be minimized if xi and xk

has a different ‘virtual label’. Therefore the combined supervised term in the objective function can be expressed as:(∑

(xi ,xk )∈ml∑C

c=1 uci uck − ∑(xi ,xk )∈cnl

∑Cc=1 uci uck), and it should be maximized. Td is a weighting factor, which

controls the relative importance of the prior knowledge brought from the document domain compared with the wholedatasets. The overall design of the SS-HFCR ensures that each document will get a fuzzy membership distribution, andthe violation of the designed pair-wise constraints is minimized at the end of the clustering process.

The second and third term in Eq. (4) or (5) are the fuzzifier terms based on entropy regularization. Their purpose issimilar to the parameter m in the fuzzy c-means algorithm, i.e. to fuzzify the resulting co-clusters. Tu and Tv are usedto adjust the levels of fuzziness of the document and word memberships, respectively.

3.3. Updating rules derivation

Now we need to solve the maximization problem given in Eqs. (4) and (5) by finding the optimal values of U and Vsubject to constraints given by Eqs. (2) and (3), where U and V denote for the entire document and word membershipmatrix, respectively. Since u and v are continuous variables, we use the method of Lagrange multipliers with the first

Page 7: Author's personal copy - eeeweba.ntu.edu.sgeeeweba.ntu.edu.sg/ELHCHEN/ChenLH-ResearchPaper/Fuzzy semi... · Author's personal copy Y. Yan et al. / Fuzzy Sets and Systems 215 ... 1

Author's personal copy

Y. Yan et al. / Fuzzy Sets and Systems 215 (2013) 74 –89 79

order necessary condition to derive the update rules for u and v. Therefore, the respective Lagrangian functions of Eqs.(4) and (5) are first constructed as

L1 =C∑

c=1

N∑i=1

M∑j=1

ucivcj∑M

q=1 vcqdi j − Tu

C∑c=1

N∑i=1

uci ln uci − Tv

C∑c=1

M∑j=1

vcj ln vcj

+Tu × Td

⎛⎝ ∑

xi ,xk∈ml

C∑c=1

uci uck −∑

xi ,xk∈cnl

C∑c=1

uci uck

⎞⎠+

N∑i=1

�i

(C∑

c=1

uci − 1

)+

M∑j=1

� j

(C∑

c=1

vcj − 1

)(6)

L2 =C∑

c=1

N∑i=1

M∑j=1

uci∑Np=1 ucp

vcj di j − Tu

C∑c=1

N∑i=1

uci ln uci − Tv

C∑c=1

M∑j=1

vcj ln vcj

+Tu × Td

⎛⎝ ∑

xi ,xk∈ml

C∑c=1

uci uck −∑

xi ,xk∈cnl

C∑c=1

uci uck

⎞⎠+

N∑i=1

�i

(C∑

c=1

uci − 1

)+

M∑j=1

� j

(C∑

c=1

vcj − 1

)(7)

where �i and � j are Lagrange multipliers corresponding to the constraints in Eqs. (2) and (3).From the necessary conditions for the optimality of the Lagrangian function L1, we take the partial derivatives of

L1 with respect to uci . By calculating �L1/�uci = 0, the updating rule for uci can be derived as below. The sametechnique is applied for �L2/�vcj = 0 to get the updating rule for vcj :

uci =exp

{∑Mj=1 vcj di j

Tu∑M

j=1 vcj+ Td [

∑(xi ,xs )∈ml ucs −∑

(xi ,xt )∈cnl uct ]

}

∑Cf =1 exp

{∑Mj=1 vcj d f j

Tu∑M

j=1 v f j+ Td [

∑(xi ,xs )∈ml u f s −∑

(xi ,xt )∈cnl u f t ]

} (8)

vcj =exp

{∑Ni=1 uci di j

Tv

∑Ni=1 uci

}

∑Cf =1 exp

{∑Ni=1 u f i d f j

Tv

∑Ni=1 u f i

} (9)

3.4. Algorithm and complexity

Now the SS-HFCR algorithm can be described as follows: starting with a set of given pair-wise constraints and anon-negative value initialization of U, V and U are iteratively updated with Eqs. (9) and (8) respectively in an alternatingmanner until either the successive estimates of U are close enough or it reaches the maximum number of iterations. Itis noted that, the initialized membership for those documents involved in constraint set should be manually adjustedin order to avoid the violation of the constraints. During the iteration process, the quality of the partition in terms ofthe criteria defined in Eqs. (4) and (5) are successively improved through reassignment of documents to clusters basedon the current word partition and similarly, reforming of word clusters based on the current document partition. Thedetailed algorithm of SS-HFCR is shown in Table 2.

The time complexity of SS-HFCR is O(CNM �) where � denotes the number of iterations. The complexity is thesame as fuzzy c-means, as well as HFCR. We like to point out that the actual time spent is much less, since the numberof nonzero entries in the highly sparse document-word association matrix D is considerably much smaller than MN.

4. Experiments on toy problem

In this section, a highly overlapped toy dataset D1 is designed and used to illustrate the performance of SS-HFCR.This experiment shows how the pair-wise constraints in SS-HFCR can help the highly overlapped documents beinggrouped into correct clusters while HFCR cannot achieve it. As shown in Table 3, D1 is a synthetic dataset with sixdocuments of seven words in three different categories. document 4, 5 and 6 can be considered as the highly overlapped

Page 8: Author's personal copy - eeeweba.ntu.edu.sgeeeweba.ntu.edu.sg/ELHCHEN/ChenLH-ResearchPaper/Fuzzy semi... · Author's personal copy Y. Yan et al. / Fuzzy Sets and Systems 215 ... 1

Author's personal copy

80 Y. Yan et al. / Fuzzy Sets and Systems 215 (2013) 74 –89

Table 2The SS-HFCR algorithm.

Input: Dataset D, number of clusters C, Constraint sets:ML&CNLOutput: document membership matrix: U and word membership matrix: V.Method:1. Set weighting factor Tu TvTd , stopping threshold �, maximum iteration number �max

2. Manually adjust the initial uci for the documents existed inML&CNL set to obey all the constraints, then randomly assign the initial uci forthe other documents.

3. REPEAT3.1 Update vcj with Eq. (9);3.2 Update uci with Eq. (8)3.3 � = � + 1UNTIL (maxc |ut+1

ci − utci | ≤ �) or � > �max

Table 3Toy problem.

Dataset Ideal result

D1 =

⎡⎢⎢⎢⎢⎢⎢⎣

1 0.7 0.5 0.5 0 0.8 01 0.7 0.5 0.5 0 0.8 0

0.1 0.3 0.5 0.8 0.5 0.3 00.2 0.3 0.5 0.7 0.5 0.3 00.3 0.3 0.5 0.7 0.6 0.4 0.70.6 0.4 0.4 0.5 0.6 0.4 0.7

⎤⎥⎥⎥⎥⎥⎥⎦

U =⎡⎣ 1 1 0 0 0 0

0 0 1 1 a2 00 0 0 0 a3 1

⎤⎦

where 0 < a2 < 0.5and 0.5 < a3 < 1

Table 4Results on toy problem with two different initializations.

Results on HFCR Result on SS-HFCR with one pair-wise constraint

U =⎡⎣ 1 1 0 0 0 0

0 0 0.98 0.84 0.62 0.050 0 0.02 0.16 0.38 0.95

⎤⎦ U =

⎡⎣ 1 1 0 0 0 0

0 0 1 0.88 0.07 0.010 0 0 0.12 0.93 0.99

⎤⎦

A must-link constraint of document 5and document 6

or or

U =⎡⎣ 1 1 0 0 0 0

0 0 1 0.56 0.29 0.100 0 0 0.44 0.71 0.90

⎤⎦ U =

⎡⎣ 1 1 0 0 0 0

0 0 1 0.96 0.18 0.080 0 0 0.04 0.82 0.92

⎤⎦

A cannot-link constraint of document4 and document 5

ones. Ideally, document 5 should obtain a higher document membership in cluster 3 than cluster 2, as we can seedocument 4 does not contain word 7. From the left column of Table 4, we could see that, with different initializations,two final document membership matrices U computed by HFCR give different clustering results. document 5 mayhave a high chance to obtain a higher membership in cluster 2 rather than cluster 3, which is not accurate. However,if a must-link constraint between document 5 and document 6, or a cannot-link constraint between document 4 anddocument 5 is provided by the user, SS-HFCR is able to capture the real inherent structure of D1, as shown via two Uon the right column of Table 4. Due to the space limitations, the results on word membership matrix V between HFCRand SS-HFCR are not provided here.

Page 9: Author's personal copy - eeeweba.ntu.edu.sgeeeweba.ntu.edu.sg/ELHCHEN/ChenLH-ResearchPaper/Fuzzy semi... · Author's personal copy Y. Yan et al. / Fuzzy Sets and Systems 215 ... 1

Author's personal copy

Y. Yan et al. / Fuzzy Sets and Systems 215 (2013) 74 –89 81

Table 5Summary of the benchmark datasets.

Dataset No. of clusters No. of docs No. of words Balance Brief descriptions

Binary 2 500 3377 1 Politics (250), Middle-east (250)Multi10 10 500 2115 1 Atheism (50), Hardware (50), For-

sale (50), Rec.autos (50), Hockey (50),Srypt (50), Politics (50), Electronics(50), Medical (50), Space (50)

webkb4 4 4199 11 909 0.345 Student (1641), Project (504), faculty(1124), course (930)

re0 13 1504 2886 0.018 Housing (16), Money (608), Trade(319), Reserves (42), Cpi (60), Interest(219), Gnp (80), Retail (20), Ipi (37),Jobs (39), Lei (11), Bop (38), Wpi (15)

reuters3 3 1076 2837 0.752 Trade (361), Crude (408), Money-fx(307)

Yahoo_K1 6 2340 3640 0.043 Health (494), Entertainment (1389),Sports (141), Politics (114), Technol-ogy (60), Business (142)

SM 2 2000 5450 1 Soccer (1000), Motorsport (1000)SS 2 2000 6337 1 Soccer (1000), Sport (1000)CB 2 2000 4791 1 Commercial Bank (1000), Building

Societies (1000)WPD6s 6 600 2660 1 Commercial Bank (100), C++ (100),

Astronomy (100), Biology (100), Soc-cer (100), Sport (100)

5. Experiments and discussion on large datasets

5.1. Dataset collections for testing

Other than the illustration on a toy problem, 10 benchmark datasets are used to evaluate the performance of SS-HFCR in categorizing real-world data. The number of clusters for these datasets ranges from 2 to 13, the number ofdocuments ranges from 500 to 4199, and the word size ranges from 2000 plus to 11,000 plus. Table 5 shows the conciseinformation of these datasets. Binary and Multi10 are the subsets from the 20newsgroups collection, 2 webkb4 3 containsweb pages collected from four universities and miscellaneous web pages re0, reuters3 are the subsets of Reuters-21578collections. 4 Yahoo_K1 dataset consists of web pages in various subject directories of Yahoo!. 5 Words occurring lessthan 0.5% or more than 99.5% of the number of documents are removed for all datasets. Meanwhile, each documentwas automatically indexed for keyword frequency extraction. Stemming was performed and stop words were discarded.The “Balance” column indicates the ratio of the smallest class size to the largest class size of a dataset.

5.2. Evaluation criteria

The accuracy, stability, and efficiency are the important criteria to evaluate the performance of a clustering algorithm.Therefore the performance of SS-HFCR has been assessed in terms of these well-known criteria. The results are alsocompared with some well received approaches in the literature.

For accuracy: Two external measures “Accuracy” and “NMI” are used as the metrics. The external measures evaluatea clustering solution based on how much this clustering resembles a set of classes, commonly known as ground-truth,

2 http://people.csail.mit.edu/jrenni/20Newsgroups3 http://www.cs.cmu.edu/∼webkb/4 http://www.daviddlewis.com/resources/testcollections/reuters215785 ftp://ftp.cs.umn,edu/dept/users/boley/pddpdata/doc-K

Page 10: Author's personal copy - eeeweba.ntu.edu.sgeeeweba.ntu.edu.sg/ELHCHEN/ChenLH-ResearchPaper/Fuzzy semi... · Author's personal copy Y. Yan et al. / Fuzzy Sets and Systems 215 ... 1

Author's personal copy

82 Y. Yan et al. / Fuzzy Sets and Systems 215 (2013) 74 –89

which have been manually tagged by human experts; the more similar the clustering solution is to the ground-truth,the better the clustering algorithm is. Accuracy measures how accurately a clustering method assigns cluster labels jito the ground truth category ci It discovers the one-to-one relationship between clusters and categories and measuresthe extent to which each cluster contains documents from the corresponding category. It is defined as follows:

Accuracy = max

∑Ni=1 �(map( ji ), ci )

N(10)

where (x, y) is a function which equals to 1 if x = y and equals to 0 otherwise, map( ji ) is the permutation functionwhich maps each cluster label to the corresponding label of the dataset. Although Accuracy is simple and compact, itmay not be reasonable when we apply clustering on extremely unbalanced datasets. Sometimes, in real applications,the number of clusters found by a clustering algorithm might not be equal to the real number of classes. In such a case,NMI [4], stands for the normalized mutual information, defined in Eq. (11), is a better measure. The higher Accuracyor NMI value indicates better clustering performance is achieved:

N M I =

∑Cc=1

∑Qj=1 N c

j log

(N · N j

c

Nc · N j

)√(∑C

c=1 Nc logNc

N

)(∑Qj=1 N j log

N j

N

) (11)

For stability: stability assesses how sensitive the algorithm is to variations in the pair-wise constraints. In other words,the stability examines whether the corresponding clustering results are “most stable” to the random selection of a fixednumber of document pairs as the constraints. In order to measure the stability, we calculate the standard deviation of20 independent trials for each dataset for each given number of the constraints. A relatively low standard deviationimplies that SS-HFCR is able to obtain a stable result with different pair-wise constraints.

For efficiency: we consider both the time complexity and the actual run time, to approximately estimate the efficiencyof the proposed method, and compare it with other semi-supervised clustering approaches discussed in the paper.

5.3. Experimental settings

We have two types of experimental setups. One makes use of user assigned category values to form pair-wiseconstraints. The other makes use of ground-truth category labels to form pair-wise constraints.

5.3.1. SS-HFCR for assigned categorical valuesThis section refers to the first type of experimental setup. We firstly compare SS-HFCR with four relevant and

popular semi-supervised clustering approaches incorporating with pair-wise constraints proposed in recent years. Theyare SS-RNMF [29], PMFCC [31], OSS-NMF [30] and PC-kmeans [16]. Due to space limitations, no direct comparisonswere made in our study on other semi-supervised clustering approaches such as those reported in [1,14,34–36], as theywere outperformed by the above four approaches. OSS-NMF-D refers to the OSS-NMF approach where the priorknowledge is only provided from the document domain.

In the experimental study, before the clustering process is carried out, the prior knowledge in the form of pair-wiseconstraints needs to be prepared. To form each pair-wise constraint, we first pick up two documents from the datasetto form the pair. We then assign each of them a ‘virtual label’ which is the categorical value, and finally we place theconstraint into either a must-link or cannot-link set based on whether they have the same or different ‘virtual label’,respectively.

The above step is repeated many times to produce enough number of constraints for the simulation purpose. Theamount of prior knowledge is measured by the ratio of the total number of constraints in use to the number of allpossible combinations of document pairs (N (N − 1)/2) for a particular dataset. The ratio increases from 0% to 5%in step of 1%, except for webkb4 (0–0.2%). 0% indicates the original un-supervised version of those four approaches,which proposed in [3,37,38] respectively, are applied.

The results of all algorithms given in the figures and tables starting from this section are the mean value of 20independent trials. A conditional random initialization declared in Sections 3.3 and 3.4 is applied to SS-HFCR. For

Page 11: Author's personal copy - eeeweba.ntu.edu.sgeeeweba.ntu.edu.sg/ELHCHEN/ChenLH-ResearchPaper/Fuzzy semi... · Author's personal copy Y. Yan et al. / Fuzzy Sets and Systems 215 ... 1

Author's personal copy

Y. Yan et al. / Fuzzy Sets and Systems 215 (2013) 74 –89 83

Table 6The values of Tu in SS-HFCR on different datasets.

CB Multi10 WPD6 Yahoo_K1 webkb4 re0

1E−2 1E−3 1E−3 1E−2 1E−3 1E−3

PMFCC, clustering results obtained by spherical kmeans will be served as the initialization for the document posteriorprobability matrix. For OSS-NMF-D, spherical kmeans is applied to both document and word clustering to enforceorthogonality condition on both domains. The stopping threshold � = 10−5 is used in all experiments. �max is set to 200in SS-HFCR and PC-kmeans, 500 in PMFCC, 1000 in OSS-NMF-D and SS-RNMF, respectively, for all the datasets. Asignificantly higher �max is set in the NMF-based approaches due to its sensitivity to the initialization, and the specific�max is estimated by empirical study. For SS-HFCR, we set Tu = 0.001, Tv = 0.005 and Td = 1 for all datasetsexcept for SS, SM and Yahoo_K1, Tv = 0.001. Lastly, since we evaluate the results of a fuzzy clustering algorithm, ade-fuzzification process that assigns every document to the cluster which has the highest membership is required. Thisprocess is also applied to the PMFCC and SS-RNMF, as both of them are soft-partitional algorithms.

For SS-RNMF, a bad local minimum may be hit at a high possibility due to its sensitivity to the initialization andthe selection of those pair-wise constraints as reported in [29]. Therefore, it often results in a low accuracy (usuallybelow 60%). To avoid this problem, the authors chose the clustering result of the trial with the minimal objective valueas the valid one from every three independent trials with different initial values. In this study, we follow the way ofconducting experiments for SS-RNMF in [29] closely.

5.3.2. SS-HFCR for ground truth labelsThis section refers to the second type of experimental setup. We also compare SS-HFCR with three popular semi-

supervised clustering approaches, where the prior knowledge makes use of documents’ ground truth labels. Theapproaches in [14,17,18] are selected for this purpose.

We would like to point out that, SS-HFCR can be applied not only when pair-wise constraints are available, but alsowhen the documents’ ground truth class labels are provided. In such experimental setup, a labeling set is firstly built byrandomly picking up a small number of documents from the dataset. The ground truth label is known for each documentin this subset. Hence, a whole collection of pair-wise constraints could be formed by pairing any two documents withinthe labeling set. The corresponding ML and CNL set can then be built accordingly based on the ground truth labels ofeach document in each of the pair-wise constraints. Compared with the pair-wise constraints randomly formed in thefull set of all possible document pairs, these constraints could be considered as a complete pair-relation map on a subsetof the dataset. In this way, SS-HFCR could also be considered as a label-based semi-supervised clustering approach.The clustering process is then guided by some ground truth labels at the initialization state, and the constraints atthe ongoing state. Therefore, we can compare also SS-HFCR with some approaches in which ground truth labels areprovided. The amount of prior knowledge is measured by the proportion of the number of labeled documents to totalnumber of documents N in each dataset. It starts from 5%, increases to 10%, 15% and ends with 20%. Due to spacelimitations, we only present the results of six datasets for this group of experiments. Again, �max is set to 200. ForSS-HFCR, Tv is set to 0.005 and Td is set to 1 for all datasets. The values of Tu used in each of the datasets are listedin Table 6.

5.4. Results and discussions

The experimental results of SS-HFCR and other compared semi-supervised algorithms are reported and discussed inthis section. We also divide the results and the discussions into two sections to better match with the two experimentalsetups explained in Section 5.3.

5.4.1. Pair-wise constraints by assigned categorical valuesThis group of experimental study runs SS-HFCR by using pair-wise constraints generated from user assigned

categorical values. Each pair-wise constraint is formed by randomly pairing two documents in a dataset. Each documentin the constraint has a user assigned categorical value.

Page 12: Author's personal copy - eeeweba.ntu.edu.sgeeeweba.ntu.edu.sg/ELHCHEN/ChenLH-ResearchPaper/Fuzzy semi... · Author's personal copy Y. Yan et al. / Fuzzy Sets and Systems 215 ... 1

Author's personal copy

84 Y. Yan et al. / Fuzzy Sets and Systems 215 (2013) 74 –89

Table 7Clustering results in Accuracy of SS-HFCR and PMFCC with less prior knowledge.

No. of pair-wiseconstraint Binary CB SS SM Yahoo_K1

SS-HFCR PMFCC SS-HFCR PMFCC SS-HFCR PMFCC SS-HFCR PMFCC SS-HFCR PMFCC

50 87.1(5.3) 62.1(0.5) 67.8(0.4) 63.4(3.7) 89.7(4.1) 66.2(12) 82.8(6.0) 67.5(4.2) 74.0(3.5) 43.7(2.8)100 93.2(2.0) 65.4(1.7) 67.9(0.7) 62.2(4.9) 87.7(4.6) 74.3(6.3) 85.3(5.2) 78.6(6.2) 75.6(4.2) 44.6(2.1)150 93.9(0.6) 72.4(9.1) 69.5(0.8) 63.9(2.2) 90.2(3.9) 82.4(4.2) 86.7(3.5) 79.1(5.8) 77.4(2.8) 44.5(1.8)200 95.6(1.0) 80.4(4.6) 70.0(1.3) 64.3(2.3) 93.3(1.2) 79.7(5.6) 87.5(2.1) 83.6(6.6) 78.2(1.8) 46.2(1.8)250 96.2(0.4) 79.8(6.2) 71.1(1.2) 64.8(1.8) 95.8(1.5) 82.4(4.1) 89.4(1.0) 83.9(7.0) 78.2(1.6) 45.8(2.0)

The clustering results in Accuracy measure of eight datasets are shown in Fig. 1; the results reflected by NMI measureon another two extremely unbalanced datasets are given in Fig. 2. From these two figures, some advantages of SS-HFCRcan be summarized. Firstly, SS-HFCR outperforms HFCR in all the datasets. Especially when a dataset is relativelysimple, it is able to make significant improvement by quickly learning from a few constraints provided: 100% accuracyis achieved on five datasets with less than 1% constraints provided. Secondly, although HFCR may not be the bestchoice when there is no available prior knowledge, SS-HFCR generally outperforms all other four algorithms on alldatasets at all prior knowledge level except re0. Thirdly, SS-HFCR shows the consistency in achieving an improvedaccuracy when more constraints are provided. However, this may not be guaranteed using the NMF-based approaches.Fig. 1(c), (d), (f), (g) show significant variations with the increase of prior knowledge. It implies a large number ofconstraints sometimes may even impose certain restriction to those NMF-based clustering process. When the size of theconstraint set is beyond a certain value, the quality of SS-RNMF fluctuates. Fourthly, SS-HFCR shows the advantagein terms of computational speed. It usually converges within 100 iterations, which is much faster than NMF-basedapproaches.

Moreover, among three NMF-based approaches, PMFCC seemingly always outperforms other two. It also proves thatthe additional orthogonal constraint may not be helpful to document categorization since a hard partition mechanismmay be not suitable for the overlapping datasets [31]. We found that both SS-HFCR and PMFCC perform very well onseveral datasets around 1% constraints. Therefore, more simulations are conducted at lower prior knowledge levels tosee the improvements made when the number of constraints increases from 0 to 250 in step of 50. The results tabulatedin Table 7 confirm that SS-HFCR still outperforms PMFCC. With such a low demand in the amount of prior knowledge,our proposed method can still make good use of the knowledge to achieve a significant improvement. We could alsofind that, SS-HFCR produces more stable clustering results than other algorithms. The standard deviation is generallyvery small (0–2) (e.g. webkb4, reuters3). In other words, for some datasets which may produce unstable results byHFCR, e.g. Binary, SS-HFCR improves not only the clustering performance in terms of Accuracy or NMI measure,but also the stability of the results when the number of constraints continuously increases.

5.4.2. Pair-wise constraints generated by ground truth labelsThis group of experimental study runs SS-HFCR by using pair-wise constraints generated from documents’ ground

truth labels. Each pair-wise constraint is formed by pairing two documents in the labeling set. Each document has aground truth label.

In Figs. 3 and 4, we show the clustering results of SS-HFCR and other three approaches when the prior knowledgeis given in ground truth labels. The clustering results in Accuracy on four datasets are shown in Fig. 3, while the NMImeasure of Yahoo_K1 and re0 are given in Fig. 4. Moreover, we found that irrespective of whether the label informationis directly used as the initial fuzzy memberships in the clustering process, SS-HFCR performs equally well on mostdatasets. This is because the updating rules of SS-HFCR ensure that each document membership correctly reflects thedegree of belongingness to every cluster during the whole clustering process. In this case, even if sometimes a documentis initially assigned to the wrong cluster, it is possible to drag it back to the correct cluster with the help of those pair-wiseconstraints linked to it for most cases. Lastly, with the same amount of prior knowledge given, SS-HFCR convergesmuch faster than the algorithms compared. Table 8 shows the average number of iteration required until convergencewith 10% labeled documents, on the six datasets reported in Figs. 3 and 4.

Page 13: Author's personal copy - eeeweba.ntu.edu.sgeeeweba.ntu.edu.sg/ELHCHEN/ChenLH-ResearchPaper/Fuzzy semi... · Author's personal copy Y. Yan et al. / Fuzzy Sets and Systems 215 ... 1

Author's personal copy

Y. Yan et al. / Fuzzy Sets and Systems 215 (2013) 74 –89 85

Fig. 1. Clustering results in Accuracy on eight datasets for various percentages of constraints.

Page 14: Author's personal copy - eeeweba.ntu.edu.sgeeeweba.ntu.edu.sg/ELHCHEN/ChenLH-ResearchPaper/Fuzzy semi... · Author's personal copy Y. Yan et al. / Fuzzy Sets and Systems 215 ... 1

Author's personal copy

86 Y. Yan et al. / Fuzzy Sets and Systems 215 (2013) 74 –89

Fig. 2. Clustering results in NMI on two datasets for various percentages of constraints.

Table 8Number of iterations of four algorithms.

Algorithms Datasets

CB webkb4 Multi10 WPD6s Yahoo_K1 re0

SS-HFCR 6 6 134 8 52 70SS-WNMF 84 64 200 200 197 200SFCM 32 48 200 200 123 164Sd-kmeans 56 134 200 200 86 89

Table 9The complexity of the related algorithms.

Algorithm Time complexity

SS-HFCR O(C M N )SS-RNMF O(3C N (C + N ))PMFCC O(C M N + C2(M + N ))OSS-NMF O(C M N + C2(M + 2N + 2C))Seeded-kmeans/PC-kmeans/SFCM O(CMN)SS-WNMF O(C2 M N )

5.4.3. Time complexityThe computational cost of an algorithm is determined not only by the number iteration required until convergence,

but also by the time complexity. Table 9 lists the time complexity of all the six related algorithms. It is noted thatthe time complexity of SS-HFCR is the same as kmeans/fuzzy c-means and it is less than the other three NMF-basedapproaches. Therefore, combining these two factors, SS-HFCR is generally the least time-consuming algorithm amongthe presented methods. In other words, it implies the efficiency of SS-HFCR is higher than the other semi-supervisedclustering approaches compared in this paper.

5.4.4. More discussionsAs a dual-partitioning approach, two factors may affect the performance of SS-HFCR: (1) the absolute values of Tu

and Tv and (2) the relative weight to be given to Tu and Tv . Both of them may make a significant impact on the clusteringresults. We noted that the sensitivity to Tu and Tv is generally reduced through the increase in prior knowledge. With a

Page 15: Author's personal copy - eeeweba.ntu.edu.sgeeeweba.ntu.edu.sg/ELHCHEN/ChenLH-ResearchPaper/Fuzzy semi... · Author's personal copy Y. Yan et al. / Fuzzy Sets and Systems 215 ... 1

Author's personal copy

Y. Yan et al. / Fuzzy Sets and Systems 215 (2013) 74 –89 87

Fig. 3. Clustering results in Accuracy on four datasets for various percentages of labeled documents.

suitable combination of Tu and Tv , a relatively more stable clustering result (reflected by STD) and quick convergenceare always achieved in SS-HFCR, compared with the other algorithms.

We also want to highlight one key difference in SS-HFCR under two different ways of forming the prior knowledge.Firstly, using roughly the same number of pair-wise constraints, one is generated from the labeling set (with the helpof the ground truth labels), the other is randomly selected from the full set of possible pairs (with the user assignedcategorical values), and the performance of SS-HFCR on the former is not as good as the latter. For example, althoughthe number of constraints generated by 10% randomly labeled documents is more or less equal to 1% constraints whichare randomly selected from all possible pairs for CB, Fig. 3(a) shows that 100% Accuracy could not be achieved. Thereason could be, instead of using the prior knowledge concentrated on a small number of labeled documents based onthe ground truth, equal number of pair-wise constraints selected from the full distribution of documents most likelycovers a wide range of the relationships among documents. Therefore, it generates much better clustering results.

6. Conclusion

We presented SS-HFCR: a heuristic semi-supervised fuzzy co-clustering approach for categorizing large high di-mensional text corpus. In SS-HFCR, the user can make use of a group of available pair-wise constraints in a particular

Page 16: Author's personal copy - eeeweba.ntu.edu.sgeeeweba.ntu.edu.sg/ELHCHEN/ChenLH-ResearchPaper/Fuzzy semi... · Author's personal copy Y. Yan et al. / Fuzzy Sets and Systems 215 ... 1

Author's personal copy

88 Y. Yan et al. / Fuzzy Sets and Systems 215 (2013) 74 –89

Fig. 4. Clustering results in NMI on Yahoo_K1 and re0 for various percentages of labeled documents.

dataset as the prior knowledge in the fuzzy co-clustering process. The constraint can be specified from whether twodocuments are considered as similar or dissimilar to each other in content from the user’s judgment or other authenticresource, or generated from a small group of documents with given ground truth labels in the dataset. We treat the clus-tering process as solving a maximization problem of a competitive agglomeration cost function with fuzzy terms andpair-wise constraints. An iterative algorithm is developed to carry out the clustering process. The experimental studyshows the improved accuracies, stability, and execution time on a toy problem and a number of benchmark datasets.The results are compared with a few popular ground truth label-based/constrained-based semi-supervised clusteringapproaches.

Currently, we only consider the constraints on document domain. It is also possible to add similar constraints on theword domain based on some prior knowledge on some meaningful keywords appeared in those documents. Other thanthat, we also see the potential of a high order co-clustering approach [33] incorporated with some prior knowledgeto conduct more complicated heterogeneous data analysis. In the future, we may also explore innovative approacheswhich have less computational complexity for clustering data in a heterogeneous network.

References

[1] I.S. Dhillon, S. Mallela, D.S. Modha, Information-theoretic co-clustering, in: Proceedings of the 9th ACM SIGKDD International Conferenceon Knowledge Discovery and Data Mining (KDD 03), 2003, pp. 89–98.

[2] S. Chatzis, A method for training finite mixture models under a fuzzy clustering principle, Fuzzy Sets Syst. 161 (2010) 3000–3013.[3] W. Xu, X. Liu, Y. Gong, Document clustering based on non-negative matrix factorization, in: SIGIR Forum (ACM Special Interest Group on

Information Retrieval), 2003, pp. 267–273.[4] S. Zhong, J. Ghosh, Generative model-based document clustering: a comparative study, Knowl. Inf. Syst. 8 (2005) 374–384.[5] I.S. Dhillon, Y. Guan, B. Kulis, Kernel k-means, spectral clustering and normalized cuts, in: KDD-2004—Proceedings of the Tenth ACM

SIGKDD International Conference on Knowledge Discovery and Data Mining, 2004, pp. 551–556.[6] B. Long, Z.M. Zhang, P.S. Yu, A probabilistic framework for relational clustering, in: Proceedings of the ACM SIGKDD International Conference

on Knowledge Discovery and Data Mining, 2007, pp. 470–479.[7] M.E.S. Mendes, L. Sacks, Evaluating fuzzy clustering for relevance-based information access, in: IEEE International Conference on Fuzzy

Systems, 2003, pp. 648–653.[8] I.S. Dhillon, Co-clustering documents and words using bipartite spectral graph partitioning, in: Proceedings of the Seventh ACM SIGKDD

International Conference on Knowledge Discovery and Data Mining, 2001, pp. 269–274.[9] K. Kummamuru, A. Dhawale, R. Krishnapuram, Fuzzy co-clustering of documents and keywords, in: IEEE International Conference on Fuzzy

Systems, 2003, pp. 772–777.[10] W.C. Tjhi, L. Chen, A heuristic-based fuzzy co-clustering algorithm for categorization of high-dimensional data, Fuzzy Sets Syst. 159 (2008)

371–389.[11] D. Greene, P. Cunningham, Producing accurate interpretable clusters from high-dimensional data, in: Lecture Notes in Artificial Intelligence,

vol. 3721, 2005, pp. 486–494.

Page 17: Author's personal copy - eeeweba.ntu.edu.sgeeeweba.ntu.edu.sg/ELHCHEN/ChenLH-ResearchPaper/Fuzzy semi... · Author's personal copy Y. Yan et al. / Fuzzy Sets and Systems 215 ... 1

Author's personal copy

Y. Yan et al. / Fuzzy Sets and Systems 215 (2013) 74 –89 89

[12] A. Klose, R. Kruse, Semi-supervised learning in knowledge discovery, Fuzzy Sets Syst. 149 (2005) 209–233.[13] A. Blum, T. Mitchell, Combining labeled and unlabeled data with co-training, in: Proceedings of the Annual ACM Conference on Computational

Learning Theory, 1998, pp. 92–100.[14] S. Basu, A. Banerjee, R. Mooney, Semi-supervised clustering by seeding, in: Proceedings of the 19th International Conference on Machine

Learning, 2002, pp. 19–26.[15] E.P. Xing, A.Y. Ng, M.I. Jordan, S. Russell, Distance metric learning, with application to clustering with side-information, Adv. Neural Inf.

Process. Syst. 15 (2003) 505–512.[16] S. Basu, M. Bilenko, R.J. Mooney, A probabilistic framework for semi-supervised clustering, in: KDD-2004—Proceedings of the 10th ACM

SIGKDD International Conference on Knowledge Discovery and Data Mining, 2004, pp. 59–68.[17] Endo Yasunori, Hamasuna Yukihiro, Yamashiro Makito, M. Sadaaki, On semi-supervised fuzzy c-Means clustering, in: FUZZ-IEEE 2009,

Korea, 2009.[18] H. Lee, J. Yoo, S. Choi, Semi-supervised nonnegative matrix factorization, IEEE Signal Process. Lett. 17 (2010) 4–7.[19] S. Zhong, Semi-supervised model-based document clustering: a comparative study. Mach. Learn. 65 (2006) 3–29.[20] C.H. Oh, K. Honda, H. Ichihashi, Fuzzy clustering for categorical multivariate data, in: Annual Conference of the North American Fuzzy

Information Processing Society—NAFIPS, 2001, pp. 2154–2159.[21] K. Li, Z. Cao, L. Cao, R. Zhao, A novel semi-supervised fuzzy c-means clustering method, in: 2009 Chinese Control and Decision Conference,

CCDC 2009, 2009, pp. 3761–3765.[22] N. Grira, M. Crucianu, N. Boujemaa, Semi-supervised fuzzy clustering with pair-wise-constrained competitive agglomeration, in: IEEE

International Conference on Fuzzy Systems (Fuzz’IEEE 2005), 2005.[23] N. Grira, M. Crucianu, N. Boujemaa, Active semi-supervised fuzzy clustering, Pattern Recognition 41 (2008) 1851–1861.[24] Witold Pedrycz, Vincenzo Loia, S. Senatore, P-FCM: a proximity-based fuzzy clustering, Fuzzy Sets Syst. 148 (2004) 21–41.[25] D.Y. Yeung, H. Chang, A kernel approach for semisupervised metric learning, IEEE Trans. Neural Networks 18 (2007) 141–149.[26] X. Yin, S. Chen, E. Hu, D. Zhang, Semi-supervised clustering with metric learning: an adaptive kernel method, Pattern Recognition 43 (2010)

1320–1333.[27] A. Bouchachia, W. Pedrycz, Enhancement of fuzzy clustering by mechanisms of partial supervision, Fuzzy Sets Syst. 157 (2006) 1733–1759.[28] V. Sindhwani, J. Hu, A. Mojsilovic, Regularized co-clustering with dual supervision, in: Proceedings of NIPS, 2008.[29] Y. Chen, M. Rege, M. Dong, J. Hua, Non-negative matrix factorization for semi-supervised data clustering, Knowl. Inf. Syst. 17 (2008)

355–379.[30] Huifang Ma, Weizhong Zhao, Qian Tan, Z. Shi, Orthogonal nonnegative matrix tri-factorization for semi-supervised document co-clustering,

in: Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2010, pp. 189–200.[31] F. Wang, T. Li, C. Zhang, Semi-supervised clustering via matrix factorization, in: SIAM International Conference on Data Mining, 2008,

pp. 1–12.[32] Y. Song, S. Pan, S. Liu, Constrained co-clustering for textual documents, in: Association for the Advancement of Artificial intelligence, 2010.[33] Y. Chen, L. Wang, M. Dong, Non-negative matrix factorization for semisupervised heterogeneous data coclustering, IEEE Trans. Knowl. Data

Eng. 22 (2010) 1459–1474.[34] A. Hotho, S. Staab, G. Stumme, Text Clustering Based on Background Knowledge, Technical Report, 2003, p. 36.[35] B. Kulis, S. Basu, I. Dhillon, R. Mooney, Semi-supervised graph clustering: a kernel approach, in: ICML 2005—Proceedings of the 22nd

International Conference on Machine Learning, 2005, pp. 457–464.[36] X. Ji, W. Xu, S. Zhu, Document clustering with prior knowledge, in: Proceedings of the 29th Annual International ACM SIGIR Conference on

Research and Development in Information Retrieval, 2006, pp. 405–412.[37] C. Ding, T. Li, W. Peng, H. Park, Orthogonal nonnegative matrix tri-factorizations for clustering, in: Proceedings of the ACM SIGKDD

International Conference on Knowledge Discovery and Data Mining, 2006, pp. 126–135.[38] I.S. Dhillon, D.S. Modha, Concept decompositions for large sparse text data using clustering, Mach. Learn. 42 (2001) 143–175.