Algorithms-For-clustering-high Dimensional and Distributed Data

Algorithms for Clustering High Dimensional andDistributed Data

Tao Li�

Shenghuo Zhu Mitsunori OgiharaComputer Science Department

University of RochesterRochester, NY 14627-0226, USA�

taoli,zsh,ogihara � @cs.rochester.edu

Abstract

Clustering is the problem of identifying the distribution of pat-terns and intrinsic correlations in large data sets by partitioning thedata points into similarity classes. The clustering problem has beenwidely studied in machine learning, databases, and statistics. Thispaper studies the problem of clustering high dimensional data. Thepaper proposes an algorithm called the CoFD algorithm, which is anon-distance based clustering algorithm for high dimensional spaces.Based on the Maximum Likelihood Principle, CoFD attempts to op-timize its parameter settings to maximize the likelihood between datapoints and the model generated by the parameters. The distributedversions of the problem, called the D-CoFD algorithms, are also pro-posed. Experimental results on both synthetic and real data sets showthe efficiency and effectiveness of CoFD and D-CoFD algorithms.

Keywords: CoFD, clustering, high dimensional, maximum likeli-hood, distributed.

�The contact author is Tao Li. His contacting information: Email: [email protected], tele-

phone: +1-585-275-8479, fax: +1-585-273-4556.

1

1 Introduction

The problem of clustering data arises in many disciplines and has a wide range ofapplications. Intuitively, the clustering problem can be described as follows: Let�

be a set of � data points in a multi-dimensional space. Find a partition of�

into classes such that the points within each class are similar to each other. Tomeasure similarity between points a distance function is used. A wide variety offunctions has been used to define the distance function. The clustering problemhas been studied extensively in machine learning [17, 67], databases [32, 74], andstatistics [13] from various perspectives and with various approaches and focuses.

Most clustering algorithms do not work efficiently in high dimensional spacesdue to the curse of dimensionality. It has been shown that in a high dimensionalspace, the distance between every pair of points is almost the same for a widevariety of data distributions and distance functions [14]. Many feature selectiontechniques have been applied to reduce the dimensionality of the space. However,as demonstrated in [1], the correlations among the dimensions are often specific todata locality; in other words, some data points are correlated with a given set offeatures and others are correlated with respect to different features. As pointed outin [40], all methods that overcome the dimensionality problems use a metric formeasuring neighborhoods, which is often implicit and/or adaptive.

In this paper, we present a non-distance-based algorithm for clustering in highdimensional spaces called CoFD1, The main idea of CoFD is as follows: Sup-pose that a data set

�with feature set � needs to be clustered into � classes,��

with the possibility of recognizing some data points to be outliers.The clustering of the data then is represented by two functions, the data map�� and the feature map � � � � �� , where��

correspond to the clusters and�

corresponds to the set of outliers. Ac-curacy of such representation is measured using the log likelihood. Then, by theMaximum Likelihood Principle, the best clustering will be the representation thatmaximizes the likelihood. In CoFD, several approximation methods are used tooptimize

�and � iteratively. The CoFD algorithm can also be easily adapted

to estimate the number of classes when the value � is not given as a part of theinput. An added bonus of CoFD is that it produces interpretable descriptions ofthe resulting classes since it produces an explicit feature map. Furthermore, sincethe data and feature maps provide natural summary and representative informationof the datasets and cluster structures, CoFD can be naturally extended to clusterdistributed datasets. We present its distributed version, D-CoFD. The rest of thepaper is organized as follows: Section 2 presents the CoFD algorithm. Section 3

1CoFD stands for Co-learning between Feature maps and Data maps.

2

describes the D-CoFD algorithms. Section 4 shows our experimental results onboth synthetic and real data sets. Section 5 surveys the related work. Finally, ourconclusions are presented in Section 6.

2 The CoFD Algorithm

This section describes CoFD and the core idea behind it. We first present the CoFDalgorithm for binary data sets. Then we will show how to extend it to continuousor non-binary categorical data sets in Section 2.6.

2.1 The Model of CoFD

Suppose we wish to divide�

into � classes with the possibility of declaring somedata as outliers. Such clustering can be represented by a pair of functions, � � � �� ,where � � � � �� is the feature map and

� � � � �� isthe data map.

Given a representation � � � �� we wish to be able to evaluate how good therepresentation is. To accomplish this we use the concept of positive features. In-tuitively, a positive feature is one that best describes the class it is associated with.Suppose that we are dealing with a data set of animals in a zoo, where the vastmajority of the animals is the monkey and and the vast majority of the animals isthe four-legged animal. Then, given an unidentified animal having four legs in thezoo, it is quite natural for one to guess that the animal is a monkey because the con-ditional probability of that event is high. Therefore, we regard the feature “havingfour legs” as a positive (characteristic) feature of the class. In most practical cases,characteristic features of a class do not overlap with those of another class. Even ifsome overlaps exist, we can add the combinations of those features into the featurespace.

Let � be the total number of data points and let � be the number of features.Since we are assuming that the features are binary,

�can be represented as a

�-�

matrix, which we call the data-feature matrix. Let�� and

�� . Wesay that the

th feature is active in the

�th point if and only if

�� . We also

say that the�th data point possesses the

�th feature if and only if

� �� .

The key idea behind the CoFD algorithm is the use of the Maximum LikelihoodPrinciple, which states that the best model is the one that has the highest likelihoodof generating the observed data. We apply this principle by regarding the data-feature matrix as the observed data and the representation � � � � � as the model.Let the data map

�and the feature map � be given. Consider the �� matrix,�� , defined as follows: for each

�,�� , and each

,�� ,

3

the�

entry of the matrix is�

if� � � � � � � �� and

�otherwise. This

is the model represented by � and�

. interpreted as the consistence of� � � � and

�� . Note that��

if� � � � � � � �� . For all

�,� � � � � ,

,�� , �� , and � � �� , we consider �� , the probability of the

th feature being active in the

�th data point in the real

data conditioned upon theth feature being active in the

�th data in the model

represented by�

and � . We assume that this conditional probability is dependentonly on the values of

�and � . Let �� denote the probability of an entry being

observed as�

while the entry in the model being equal to � . Also, let � � � � � denotethe proportion of � � � � such that

� �� and

�� . Then the likelihoodof the model can be expressed as follows:��

� � �"! � � � � � � � � �� $#&% � �('�� (1)

�� )�* �+ )-,. � / �� 0� � � � � � (2)

We apply the hill-climbing method to maximize�� , i.e., alternatively

optimizing one of�

and � by fixing the other. First, we try to optimize � by fixing�. The problem of optimizing � over all data-feature pairs can be approximately

decomposed into subproblems of optimizing each �� over all data-feature pairsof feature

, i.e., minimizing the conditional entropy '�� 21 � � ��31 � � � � � � � . If�

and � � � are given, the entropies can be directly estimated. Therefore, � � �is assigned to the class in which the conditional entropy of

� 1 �is the smallest.

Optimizing�

while fixing � is the dual. Hence, we only need to minimize'�� 1 � �� 1 � � � � � � for each�. A straightforward approximation method for min-

imizing '�� 41 � � ��41 � � � � � � � is to assign �� to)�* 0+ )-,6587� � � �� 7 . This

approximation method is also applicable to minimization of '�� 91 � �� 1 � � � � � �where we assign

� � � � to)�* 0+ )-,65:7� � � �� 7 . A direct improvement of the

approximation is possible by using the idea of entropy-constrained vector quan-tizer [21] and assigning �� and

� � � � to)�* 0+ )-,;5 � 7� � � � �� 7=< >�� ?� 5 ��

and)�* �+ )-,65 � 7� � � � � � � 7@< >�� :� 5 ��

, respectively, where � � � � and �� are the number of features and the number of points in class

�.

2.2 Algorithm Description

There are two auxiliary procedures in the algorithm: ACBEDGFIHKJLDNM�O�MPJQDSRUT9MWVXJUY esti-mates the feature map from the data map; ACBEDGFIHKJLDNM�Z[JLDNJ\VXJ]Y estimates the datamap from the feature map. Chi-square tests are used for deciding if a feature is an

4

outlier or not. The main procedure, �� O;Z , attempts to find a best class by an iter-ative process similar to the EM algorithm. Here the algorithm iteratively estimatesthe data and feature maps based on the estimations made in the previous round,until no more changes occur in the feature map. The pseudo-code of the algorithmis shown in Figure 1.

It can be observed from the pseudo-code description that the time complexitiesof both ACBED�F HKJLD�M�O�MPJLDSRUT MWVXJ]Y and ACBEDGFIH JQDNM�Z[JLDEJCVXJ]Y are �� . Thenumber of iterations in �� O=Z is not related to � or � .

2.3 An Example

To illustrate how CoFD works, an example is given in this section. Suppose wehave a data set as presented in Table 1. Initially, the data points � and � are chosenas seed points, where the data point � is in class � and the data point � is in class�

, i.e.,� �� ,

� �� . ACBEDGFIHKJLDNM�O�MPJQDSRUT9MWVXJUY returns that the features � ,�

and � are positive in class � , that the features � and are positive in class�

,and that features � and � are outliers. In other words, �� ,��

, and � � � � � � �� . Then ACBED�F HKJLD�M�Z0JQDEJ\VXJ]Y assigns the

data points�, � and � to class � and the data points � , � and � to class

�. thenA]BEDGFIH JLD�M�O�MPJLDGR]T MWVXJ]Y asserts that the features � ,

�and � are positive in class � , the

features � , � and are positive in class�

, and the feature � is an outlier. In thenext iteration, the result does not change. At this point the algorithm stops. Theresulting clusters are: the class � for the data points

�, � , and � and the class

�for

the data points � , � , and � .

2.4 Refining Methods

Clustering results are sensitive to initial seed points. Randomly chosen seed pointsmay make the search trapped in local minima. We use a refining procedure, whoseidea is to use conditional entropy to measure the similarity between a pair of clus-tering results. CoFD attempts to find a best clustering result having the smallestaverage conditional entropy against all others.

Clustering a large data set may be time consuming. To speed up the algorithm,we focus on reducing the number of iterations. A small set of data points, forexample,

��of the entire data set, may be selected as the bootstrap data set. First,

the clustering algorithm is executed on the bootstrap data set. Then, the clusteringalgorithm is run on the entire data set using the data map obtained from clusteringon the bootstrap data set (instead of using randomly generated seed points).

CoFD can be easily adapted to estimate the number of classes instead of using� as an input parameter. We observed that the best number of classes results the

5

smallest average conditional entropy between clustering results obtained from dif-ferent random seed point sets. Based on the observation, the algorithm of guessingthe number of clusters is described in Figure 3.

2.5 Using Graph Partitioning to Initialize Feature Map

In this section, we describe the method of using graph partitioning for initializingthe feature map. For the binary datasets, we can find the frequent associationsbetween features and build the association graph of the features. Then we usethe graph partitioning algorithm to partition the features into several parts, i.e., toinduce the initial feature map.

2.5.1 Feature Association

The problem of finding all frequent associations among attributes in categorical(“basket”) databases [3], called association mining, is one of the most fundamentaland most popular problems in data mining. Formally, association mining is theproblem of, given a database � each of whose datasets is a subset of a universe �(the set of all items), and a threshold value � ,

�� , (the so-called minimum

support), enumerating all nonempty �� such that the proportion of the setof transactions in � containing each of the elements of � is at least � , that is,7�� 7� � 7 � 7

. Such a set � is called a frequent itemset.(There are variations in which the task is to enumerate all frequent itemsets havingcertain properties, e.g. [10, 22, 29].) There are many practical algorithms for thisproblem in various settings, e.g. [4, 5, 20, 34, 38, 43, 73] (see a survey by Hippet al. [41]) and, using the current commodity machines, these practical algorithmscan be made to run reasonably quickly.

2.5.2 Graph Partitioning

After finding all the frequent associations2 , we build the association graph. In theassociation graph, the nodes represent the features and the edge weights are thesupport of the associations. Then we could use the graph partitioning to get aninitial feature map. Graph partitioning is to partition the nodes of a graph intoseveral parts, such that the summation of the weights of edges connecting nodes indifferent nodes is minimized. So the partitioning of the association graph woulddivide the features into several subsets so that the connections between the subsetsare minimized. In our experiments, We apply the Metis algorithms [50] to get aninitial feature map.

2In our experiment, we use the Apriori algorithm with the minimum support of �� .

6

The use of graph partitioning to initialize feature map enables our algorithmexplicitly consider the correlations between the features. In general, the featurespace is much smaller than the data space. Hence the graph partitioning on featurespace is very efficient.

2.6 Extending to Non-binary Data Sets

In order to handle non-binary data sets, we first translate raw attribute values ofdata points into binary feature spaces. The translation scheme in [68] can be usedto discretize categorical and continuous attributes. However, this type of transla-tion is vulnerable to outliers that may drastically skew the range. In our algorithm,we use two other discretization methods. The first is combining Equal FrequencyIntervals method with the idea of CMAC [6]. Given � instances, the methoddivides each dimension into

�bins with each bin containing

� 5 <��adjacent val-

ues3. In other words, with the new method, each dimension is divided into severaloverlapped segments and the size of overlap is determined by

�. An attribute is

then translated into a binary sequence having bit-length equal to the number of theoverlapped segments, where each bit represents whether the attribute belongs tothe corresponding segment. We can also use Gaussian mixture models to fit eachattribute of the original data sets, since most of them are generated from Gaussianmixture models. The number of Gaussian distributions � can be obtained by max-imizing the Bayesian information criterion of the mixture model. Then the valueis translated into � feature values in the binary feature space. The

th feature value

is�

if the probability that the value of the data point is generated from theth

Gaussian distribution is the largest.

3 Distributed Clustering

In this section, we first briefly review the motivation for distributed clustering andthen give the D-CoFD algorithms for distributed clustering.

3.1 Motivation for Distributed Clustering

Over the years, data set sizes have grown rapidly with the advances in technology,the ever-increasing computing power and computer storage capacity, the perme-ation of Internet into daily life and the increasingly automated business, manufac-turing and scientific processes. Moreover, many of these data sets are, in nature,

3The parameter � can be a constant for all bins or different constants for different bins dependingon the distribution density of the dimension. In our experiment, we set �� .

7

geographically distributed across multiple sites. For example, the huge numberof sales records of hundreds of chain stores are stored at different locations. Tocluster such large and distributed data sets, it is important to investigate efficientdistributed algorithms to reduce the communication overhead, central storage re-quirements, and computation times. With the high scalability of the distributedsystems and the easy partition and distribution of a centralized dataset, distributeclustering algorithms can also bring the resources of multiple machines to bear ona given problem as the data size scale-up.

3.2 D-CoFD Algorithms

In a distributed environment, data sites may be homogeneous, i.e., different sitescontaining data for exactly the same set of features, or heterogeneous, i.e., differentsites storing data for different set of features, possibly with some common featuresamong sites. As we already mentioned, the data and feature maps imply naturalsummary information of the dataset and the cluster structures. Based on the obser-vation, we will present extensions of our CoFD algorithm to those for clusteringhomogeneous and heterogeneous datasets.

3.2.1 Clustering Homogeneous Datasets

We first present an extension of CoFD algorithm, D-CoFD-Hom, to clusteringhomogeneous datasets. In the paper, we assume that the data sets have similardistributions and hence the number of classes of each site is usually the same.

Algorithm Z�� O=Z�� \Hbegin

1. Each site sends a sample of its data to the central cite;2. The central site decides the parameters for data transformation, e.g., number

of mixtures for each dimension, et al., and broadcasts to each site;3. Each site preprocesses its data, performs CoFD, and obtains its data map

� and its corresponding feature map � ;

4. Each site sends its feature map � to the central cite;5. The central site decides the global feature map � and broadcasts it to each site;6. Each site estimates its data map

� from the global feature map � .

end

The task of estimating the global feature map at the central site can be de-scribed as follows: first we need to identify the correspondence between each pairof sites. If we view each class as a collection of features4 , we need to find a map-

4The feature indices are the same among all sites.

8

ping between the classes of two sites to maximize the mutual information betweenthe two sets of features. This can be easily reduced to the maximum bipartitegraph matching problem, which is known to be NP-hard. There are several ap-proximation algorithms, e.g., [42], for this problem. In our experiments, however,the number of classes is relatively small, thus, we used brute force search to findthe mapping between the classes on two sites. Once the best correspondence hasbeen found for each pair of sites, the global feature map can be established usingA]BEDGFIH JLD�M�O�MPJLDGR]T MWVXJ]Y , i.e., by assigning each feature to a class.

3.2.2 Clustering Heterogeneous Datasets

Here we present another extension of CoFD algorithm, D-CoFD-Het, for cluster-ing heterogeneous datasets. We assume that each site contains relational data andthe row indices are used as the key linking the rows from different sites. The du-ality of the feature map and data map in our CoFD algorithm makes it natural toextend it to an algorithm for clustering heterogeneous datasets.

Algorithm Z�� O=Z�� M?Dbegin

1. Each site preprocesses its data, performs CoFD and obtains its data map��

and its corresponding feature map � ;2. Each site sends

� to the central cite;

3. The central site decides the global data map�

and broadcasts it to each site.The similar approach as estimating the global feature map is used;

4. Each site estimates its feature map � from the global data map�

.end

4 Experimental Results

4.1 Experiments with CoFD algorithm

There are many ways to measure how accurately CoFD performs. One is the con-fusion matrix which is described in [1]. Entry �� of a confusion matrix is thenumber of data points assigned to output class � and generated from input class�. The input map

�is a map of the data points to the input classes. So, the in-

formation of the input map can be measured by the entropy '�� . The goal ofclustering is to find an output map � that recovers the information. Thus, the con-dition entropy '�� is interpreted as the information of the input map given theoutput map � , i.e., the portion of information which is not recovered by the cluster-ing algorithm. Therefore, the recovering rate of a clustering algorithm, defined as

9

� % '�� '�� '�� 5 , can also be used as a performance mea-sure for clustering. The purity [75] that measures the extend to which each clustercontained data points from primarily one class is also a good metric for cluster qual-ity. The purity of a clustering solution is obtained as a weighted sum of individualcluster purities and is given by ��

��

where � is a particular cluster of size � , �� is the number of documents of the�

-th input class that were assigned to the-th cluster, � is the number of clusters

and � is the total number of points 6. In general, the larger the values of purity, thebetter the clustering solution is. For a perfect clustering solution, the purity is

�.

All the experiments are performed on a Sun � � � � � � machine with� �� MB

memory, running on Sun OS 5.7.

4.1.1 A Binary Synthetic Data Set

First we generate a binary synthetic data set� �

to evaluate our algorithm. � �� points are from � � � clusters. Each point has � � � � � binary features. Eachcluster has � � � � positive features,

� � � negative features. The positive features ofa point have a probability of

� � to be�

and� � to be

�; the negative features of

a point have a probability of� � to be

�and

� � to be�; the rest features have a

probability of� � to be

�, and

� � to be�.

Table 2 shows the confusion matrix of this experiment. In this experiment, � � � � � � � � � � � � �, ��

, � � � � � � �, ��

, � � � � � � �, and the

rest are zeros. So '�� >�� , i.e., the recovering rate of

the algorithm is�, i.e., all data points are correctly recovered.

4.1.2 A Continuous Synthetic Data Set

In this experiment we attempt to cluster a continuous data set. We use the methoddescribed in [1] to generate a data set

� �.� �

has � � � � � � � � �data points in

a � � -dimensional space, with � � � . All input classes are generated in some�-dimensional subspace. Five percent of the data points is chosen to be outliers,

which are distributed uniformly at random throughout the entire space. Using thesecond translation method described in Section 2.6, we map all the data point intoa binary space with � � features.

Then,� � � �

data points are randomly chosen as the bootstrap data set. Byrunning CoFD algorithm on the bootstrap data set, we obtain clustering of thebootstrap data set. Using the bootstrap data set as the seed points, we run the

5 � ��!"��#%$�& is mutual information between � and $ .6 ' !)( � & is also called the individual cluster purity.

10

algorithm on the entire data set. Table 3 shows the confusion matrix of this exper-iment. About �� of the data points are recovered. The conditional entropy'�� is

� � � � � while the input entropy '�� is�� . The recovering rate is

thus� % '�� '�� . We make a rough comparison with the result

reported in [1]. From the confusion matrix reported in the paper, their recoveringrate is calculated as

� � � � , which seems to indicate that our algorithm is better thantheirs in terms of recovering rate.

4.1.3 Zoo Database

We also evaluate the performance of the CoFD algorithm on the zoo database avail-able at the UC Irvine Machine Learning Repository. The database contains

� � �animals, each of which has

� � boolean attributes and�

categorical attribute7 . Wetranslate each boolean attribute into two features, whether the feature is active andwhether the feature is inactive. We translate the numeric attribute, “legs”, into sixfeatures, which correspond to

�, � , � , � , � , and � legs, respectively. Table 4 shows

the confusion matrix of this experiment. The conditional entropy '�� is� � � �

while the input entropy '�� is�� . The recovering rate of this algorithm is� % '�� '�� . In the confusion matrix, we found that the clusters

with a large number of animals are likely to be correctly clustered.8

CoFD comes with an important by-product that the resulting classes can be eas-ily described in terms of features, since the algorithm produces an explicit featuremap. For example, the positive features of class � are “no eggs,” “no backbone,”“venomous,” and “eight legs”; the positive features of class

�are “feather,” “air-

borne,” and “two legs.” “domestic” and “catsize”; the positive feature of class�

is “no legs”; the positive feature of class � are “aquatic,” “no breathes,” “fins” and“five legs”; the positive features of class � are “six legs” and “no tail”; the positivefeature of class � is “four legs”. Hence, class

�can be described as animals having

feather and two legs, and being airborne, which are the representative features ofthe birds. Class � can be described as animals being aquatic, having no breathes,having fins and five legs9.

7The original data set has � �� data points but one animal, “frog,” appears twice. So we eliminatedone of them. We also eliminated two attributes, “animal name” and “type.”

8For example, cluster no. � is mapped into cluster � ; cluster no. � is mapped into cluster � ; etal.

9Animals “dolphin” and “porpoise” are in class � , but were clustered into class � , because theirattributes “aquatic” and “fins” make them more like animals in class � than their attribute “milk”does for class � .

11

4.2 Scalability Results

In this subsection we present the computational scalability results for our algo-rithm. The results are averaged over five runs on each case to eliminate the ran-domness effect. We report the scalability results in terms of the number of pointsand the dimensionality of the space.

Number of points: The data sets we test have � � � features (dimensions). Theyall contain � clusters and each cluster has an average � � features. Figure 4 and Fig-ure 5 show the scalability result of the algorithm in terms of the number of datapoints. The value of the

coordinate in Figure 4 is the running time in seconds.

We implement the algorithm using octave � �� on Linux � � � � . If it were im-plement in

�, the running time could be reduced considerably. Figure 4 contains

two curves: one is the scalability result with bootstrap and one without. It canbe seen from Figure 4 that with bootstrap, our algorithm scales sub-linearly withthe number of points and without bootstrap, it scales linearly with the number ofpoints. The value of the

coordinate in Figure 5 is the ratio of the running time to

the number of points. The linear and sub-linear scalability properties can also beobserved in Figure 5.

Dimensionality of the space: The data sets we test have� � � � � �

data points.They all contain � clusters and the average features (dimensions) for each clusteris about

�� of the total number of features (dimensions). Figure 6 and Figure 7

show the scalability result of the algorithm in terms of dimensionality of the space.The value of the

coordinate in Figure 6 is the running time in seconds. It can be

seen from Figure 6 that our algorithm scales linearly with the dimensionality of thespace. The value of the

coordinate in Figure 7 is the ratio of the running time to

the number of features.

4.3 Document Clustering Using CoFD

In this section, we apply our CoFD algorithm to cluster documents and compareits performance with other standard clustering algorithms. Document clusteringhas been used as a means for improving document retrieval performances. In ourexperiments, documents are represented using binary vector-space model whereeach document is a binary vector in the term space and each element of the vectorindicates the presence of the corresponding term.

4.3.1 Document Datasets

Several datasets are used in our experiments:

12

URCS: URCS dataset is the collection of technical reports published in theyear 2002 by the computer science department at university of Rochester10 . Thereare 3 main research area at URCS: AI11, System and Theory while AI has beenloosely divided into two sub-areas: NLP12 and Robotics/Vision. Hence the datasetwhich a contain 476 abstracts that are divided into 4 different research areas.

WebKB: The WebKB dataset contains webpages gathered from universitycomputer science departments. There are about 8300 documents and they aredivided into 7 categories: student, faculty, staff, course, project, department andother. The raw text is about 27MB. Among these 7 categories, student, faculty,course and project are four most populous entity-representing categories. The as-sociated subset is typically called WebKB4. In this paper, we did experiments onboth 7-category and 4-category datasets.

Reuters: The Reuters-21578 Text Categorization Test collection contains doc-uments collected from the Reuters newswire in 1987. It is a standard text cate-gorization benchmark and contains 135 categories. In our experiments, we use asubsets of the data collection which include the 10 most frequent categories amongthe 135 topics and we call it Reuters-top 10.

K-dataset: The K-dataset was from WebACE project [35] and it was used in[15] for document clustering. The K-dataset contains 2340 documents consistingnews articles from Reuters new service via the Web in October 1997. These docu-ments are divided into 20 classes.

The datasets and their characteristics are summarized in Table 5. To pre-process the datasets, we remove the stop words use a standard stop list and performstemming using a porter stemmer, all HTML tags are skipped and all header fieldsexcept subject and organization of the posted article are ignored. In all our experi-ments, we first select the top 1000 words by mutual information with class labels.The feature selection is done with the rainbow package [56].

4.3.2 Document Clustering Comparisons

Document clustering methods can be mainly categorized into two types: partition-ing and hierarchical clustering [69]. Partitioning methods decompose a collectionof documents into a given number disjoint clusters which are optimal in terms ofsome predefined criteria functions. For example, the traditional � -means methodtries to minimize the sum-of-squared-errors criterion function. The criteria func-tions of adaptive � -means approaches usually take more factors into considera-tion and hence more complicated. There are many different Hierarchical clustering

10The dataset can be downloaded from http://www.cs.rochester.edu/u/taoli/data.11Artificial Intelligence.12Natural Language Processing.

13

algorithms with different policies on combining clusters such as single-linkage,complete-linkage and UPGA method [45]. Single-linkage and complete-linkageuse the maximum and the minimum distance between the two clusters, respectively,while UPGMA - Unweighted Pair-Groups Method Average uses the distance of thecluster centers to define the similarity of two clusters for aggregating.

In our experiments, we compare the performance of CoFD on the datasets with� -means, Partitioning methods with various adaptive criteria functions and hier-archical clustering algorithms using single-linkage, complete-linkage and UPGAaggregating measures. For partitioning and hierarchical algorithms, we use theCLUTO clustering package described in [75, 76]. The comparisons are shown inTable 6. Each entry is the purity of the corresponding column algorithm on therow dataset. P1, P2, P3 and P4 are the partitioning algorithm with different adap-tive criteria functions as shown in Table 7. Slink, Clink and UPMGA columns aredifferent hierarchical clustering algorithms using single-linkage, complete-linkageand UPGA aggregating policies. In our experiments, we use the cosine functionmeasure of the two document vectors as their similarity.

From Table 6, we observe that CoFD achieve the best performance on URCSand Reuters-top 10 datasets. On all other datasets, the results obtained by CoFD arereally close to the best results. On WebKB4, � -means has the best performance of� � � � and CoFD gives the result of

� � �� . P2 achieves the best result of� � � � on

WebKB and P1 gives the highest performance of� � � � on K-dataset. The results

on WebKB and K-dataset obtained by CoFD are� � � � and

� � � � respectively.Figure 8 shows the graphical comparison. The comparison shows that, althoughthere is no single winner on all the datasets, CoFD is a viable and competitivealgorithm in document clustering domain.

To get a better understanding of the CoFD, Table 8 gives the confusion matricesbuilt from the clustering results on URCS dataset. The columns of the confusionmatrix are NLP, Robotics/Vision, Systems and Theory respectively. The resultshows Systems and Theory are much different from each other, and different fromNLP and Robotics/Vision; NLP and Robotics/Vision are similar to each other; AIis more similar to SYSTEMS than ROBOTICS is.

4.4 Experiments Results of D-CoFD algorithms

This subsection presents the experiment results with D-CoFD algorithms for dis-tributed clustering.

14

4.4.1 An Artificial Experiment

We generate the test data set��

using the algorithm in [59].��

has � � � � � � �in a � � -dimensional space with � � � , including � noise dimensions. The firsttranslation method described in Section 2.6 is used for preprocessing. We first ap-ply the CoFD algorithm on the dataset, for which the recovering rate was

�, i.e., all

data points are correctly recovered. We then horizontally partition the datasets intofive subsets containing

� � � �points each and applied the D-CoFD-Hom algorithm.

An important step of the D-CoFD-Hom algorithm is to identify the correspondencebetween the classes of a set and those of the other for each pair of sites. We usea brute force approach to find the mapping that maximizes the mutual informationbetween them. For example, given the feature map �� and �� from site � and site�

respectively, we want to establish the correspondence between classes on site �and those on

�. We first construct the confusion matrix of two feature maps where

the entry �I �� is the number of features that are positive in both class at site �and class

�at site

�. According to the confusion matrix, we can easily derive the

correspondence. Table 9 gives a detailed example of a confusion matrix and itsconsequent correspondence. Overall, all the points are also correctly clustered andthus the recovering rate was

�.

In another experiment, we vertically partition the datasets into three subsetscontaining

� � � � �points and

� �features each. As in the homogeneous case, to

identify the correspondence between sets of classes, we use brute force searchto find the mapping that maximizes the mutual information between them. Forexample, in our experiment, given the data map

�� and

�� of site � and site�

, we want to establish the correspondence of site � and�

. We first constructthe confusion matrix of two feature maps where the entry �I �� of the confusionmatrix is the number of points that are clustered into both class at site � and class�

at site�

. We build the confusion matrix of the data maps similar to that of thefeature maps in the homogeneous case. Then, according to the confusion matrix ofthe data maps, we can easily derive the map. Table 10 presents the confusion matrixand its correspondence. The outcome of the D-CoFD-Het algorithm is shown inTable 11 and the recovering rate in this case was

� �� . The above experimentsdemonstrate that the results of distributed clustering are very close to that of thecentralized clustering.

4.4.2 Dermatology Database

We also evaluate our algorithms on the dermatology database from the UC IrvineMachine Learning Repository. The database is used for the differential diagnosisof erythemato-squamous diseases. These diseases all share the clinical features

15

of erythema and scaling, with very little differences. The dataset contains � � �data points over � � attributes and was previously used for classification. In ourexperiment, we use it to demonstrate our distributed clustering algorithms. Theconfusion matrix of centralized clustering, i.e., CoFD, on this dataset is presentedin Table 12 and its recovering rate is

� �� . We then horizontally partition thedataset into � subsets, each containing

� � � instances. The confusion matrix byD-CoFD-Hom is shown in Table 13.

5 Related Work

Self-Organizing Map (SOM) [53] is widely used for clustering, and is particularlywell suited to the task of identifying a small number of prominent classes in a mul-tidimensional data set. SOM uses incremental approach where points/patterns areprocessed one-by-one. However, the “centroids” with huge dimensionality are hardto interpret in SOM. Other Traditional clustering techniques can be classified intopartitional, hierarchical, density-based and grid-based. Partitional clustering at-tempts to directly decompose the data set into � disjoint classes such that the datapoints in a class are nearer to one another than the data points in other classes. Hi-erarchical clustering proceeds successively by building a tree of clusters. Density-based clustering is to group the neighboring points of a data set into classes basedon density conditions. Grid-based clustering quantizes the object space into a fi-nite number of cells that form a grid-structure and then performs clustering on thegrid structure. Most of them use distance functions as objective criteria and are noteffective in high dimensional spaces. Next we review some recent clustering algo-rithms which have been proposed for high dimensional spaces or without distancefunctions and are largely related to our work.

CLIQUE [2] is an automatic subspace clustering algorithm for high dimen-sional spaces. It uses equal-size cells and cell density to find dense regions in eachsubspace in a high dimensional space. Cell size and the density threshold need tobe provided by the user as inputs. CLIQUE does not produce disjoint classes andthe highest dimensionality of subspace classes reported is about ten. Our algorithmproduces disjoint classes and does not require the additional parameters such as thecell size and density. Also it can find classes with higher dimensionality. Aggar-wal et al. [1] introduce projected clustering and presents algorithms for discov-ering interesting patterns in subspaces of high dimensional spaces. The core ideais a generalization of feature selection which allows the selection of different setsof dimensions for different subsets of the data sets. However, the algorithms arebased on the Euclidean or Manhattan distance and their feature selection methodis a variant of singular value decomposition. Also the algorithms assume that the

16

number of projected dimensions are given beforehand. Our algorithm does notneed the distance measures and the number of dimensions for each class. Alsoour algorithm does not require all projected classes to have the same number ofdimensions.

Ramkumar and Swami [64] propose a method for clustering without distancefunctions. The method is based on the principles of co-occurrence maximizationand label minimization which are normally implemented using associations andclassification techniques. The method does not consider correlations of dimensionswith respect to data locality.

The idea of co-clustering of data points and attributes dates back to [7, 39, 62].Co-clustering is a simultaneous clustering of both points and their attributes by uti-lizing the canonical duality contained in the point-by-attribute data representation.Govaert [31] researches simultaneous block clustering of the rows and columns ofcontingency table. Dhillon [24] presents a co-clustering algorithm using bipartitegraph between documents and words. Our algorithms, however, use the associationgraph of the features to initialize the feature map and then apply an EM-type algo-rithm. Han et al. [36] propose the clustering algorithm based on association rulehypergraphs but they don’t consider the co-learning problem of data map and fea-ture map. A detailed survey on hypergraph based clustering can be found in [37].

Cheng et al. [19] propose an entropy-based subspace clustering for mining nu-merical data. The criterion of the entropy-based method is that low entropy sub-space corresponds to a skewed distribution of unit densities. Decision tree basedclustering method is proposed in [55]. There are also some recent work on cluster-ing categoric datasets, such as ROCK [33] and CACTUS [30]. Our algorithm canbe applied to both categorical and numerical data.

There are also a lot of probabilistic clustering approaches where data is consid-ered to be a sample independently drawn from a mixture model of several probabil-ity distributions [58]. The area around the mean of each distribution constitutes anatural cluster. Generally the clustering task is then reduced to finding the parame-ters such as mean, variance, etc. of the distributions to maximizing the probabilitythat the data is drawn from the mixture model (also known as maximizing the like-lihood). Expectation-Maximization (EM) method [23] is then used to find mixturemodel parameters that maximize the likelihood. Moore [60] suggests accelerationof EM method based on a special data index, KD-tree, where data is divided ateach node into two descendants by splitting the widest attribute at the center of itsrange. Algorithm Autoclass [18] utilizes a mixture model and covers a wide rangeof distributions including Bernoulli, Poisson, Gaussian and log-normal distribu-tions. The algorithm MCLUST [28] uses Gaussian models with ellipsoids of differ-ent volumes, shapes and orientations. smyth [67] uses probabilistic clustering formutli-variate and sequential data and Cadez et al.[16] use mixture model for cus-

17

tomer profiling based on transactional information. Barash and Friedman [9] usecontext specific Bayesian clustering method to help in understanding the connec-tions between transcription factors and functional classes of genes. The content-specific approach has the ability to deal with many attributes that are irrelevant tothe clustering and are suitable for the underlying biological problem. Kearns etal. [51] give an information-theoretic analysis of hard assignments (used by � -means) and soft assignments (used by EM algorithm) for clustering and proposeda posterior partition algorithm which is close to the soft assignments of EM forclustering. The relationship between maximum likelihood and clustering is alsodiscussed in [47].

Minimum encoding length approaches, including Minimum Message Length(MML) [70] and Minimum Description Length (MDL) [65] can also be used forclustering. Both approaches perform induction by seeking a theory that enables themost compact encoding of both the theory and available data and admit to proba-bilistic interpretations. Given prior probabilities for both theories and data, min-imization of the MML encoding closely approximates maximization of posteriorprobability while an MDL code length defines an upper bound on “unconditionallikelihood” [66, 71]. Minimum encoding is an information theoretic criterion forparameter estimation and model selection. Using minimum encoding for cluster-ing, we are looking for the cluster structure which minimum the size of encoding.The algorithm SNOB [72] uses a mixture model in conjunction with MinimumMessage Length (MML) principle. An empirical comparison of criteria prominentis given in [63] and it also concludes that Minimum Message Length appeared to bethe best criterion for selecting the number of components for mixtures of Gaussiandistributions.

Fasulo [26] gives a detailed survey on clustering approaches based on mixturemodels [8], dynamical systems [52] and clique graphs [11]. Similarity based onshared features has also been analyzed in cognitive science such as the Familyresemblances study by Rosch and Mervis. Some methods have been proposedfor classification using maximum entropy (or maximum likelihood)[12, 61]. Theclassification problem is a supervised learning problem, for which the data pointsare already labeled with known classes. However, the clustering problem is anunsupervised learning problem as no labeled data is available.

Recently there has been lots of interest in distributed clustering. Kargupta etal. [49] present collective principal component analysis for clustering heteroge-neous datasets. Johnson and Kargupta [46] study hierarchical clustering from dis-tributed and heterogeneous data. Both of the above works tend to subject to verticalpartitioning. In our D-CoFD algorithms, we envisage data that is horizontally par-titioned as well as vertically partitioned. Lazarevic et al. [54] propose a distributedclustering algorithm for learning regression modules from spatial datasets. Hulth

18

and Grenholm [44] propose a distributed clustering algorithm which is intended tobe incremental and to work in a real-time situation. McClean et al. [57] presentalgorithms to cluster databases that hold aggregate count data on a set of attributesthat have been classified according to heterogeneous classification schemes. For-man and Zhang [27] present distributed K-means, K-Harmonic-Means and EMalgorithms for homogeneous (horizontally partitioned) datasets. Parallel cluster-ing algorithms are also discussed, for example in [25]. More work on distributedclustering can be found in [48].

6 Conclusions

In this paper, we first propose a novel clustering algorithm CoFD which does notrequire the distance function for high dimensional spaces. We come up with anew perspective of viewing the clustering problem by interpreting it as the dualproblem of optimizing the feature map and data map. As a by-product, CoFDalso produces the feature map which can provide interpretable descriptions of theresulting classes. CoFD also provides a method to select the number of classesbased on the conditional entropy. We introduce the recovering rate, an accuracymeasure of clustering, to measure the performance of clustering algorithms and tocompare clustering results of different algorithms. We extend CoFD and proposeD-CoFD algorithms for cluster distributed datasets. Extensive experiments havebeen conducted to demonstrate the efficiency and effectiveness of the algorithms.

Documents are usually represented as high-dimensional vectors in term space.We evaluate CoFD for clustering on a variety of document collections and compareits performance with other widely used document clustering algorithms. Althoughthere is no single winner on all the datasets, CoFD outperforms the rest on twodatasets and its performances are close to the best on the other three datasets. Insummary, CoFD is a viable and competitive clustering algorithm.

Acknowledgments

The project is supported in part by NIH Grants 5-P41-RR09283, RO1-AG18231,and P30-AG18254 and by NSF Grants EIA-0080124, EIA-0205061, NSF CCR-9701911, and DUE-9980943.

19

References

[1] Charu C. Aggarwal, Joel L. Wolf, Philip S. Yu, Cecilia Procopiuc, andJong Soo Park. Fast algorithms for projected clustering. In ACM SIGMODConference, pages 61–72, 1999.

[2] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic sub-space clustering for high dimensional data for data mining applications. InSIGMOD-98, 1998.

[3] R. Agrawal, T. Imielinski, and A. Swami. Mining associations between setsof items in massive databases. In Proceedings of the ACM-SIGMOD 1993International Conference on Management of Data, pages 207–216, 1993.

[4] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo. Fastdiscovery of association rules. In Advances in Knowledge Discovery andData Mining., pages 307–328. AAAI/MIT Press, 1996.

[5] R. Agrawal and R. Srikant. Fast algorithms for mining association rules inlarge databases. In Proceedings of 20th Conference on Very Large Databases,pages 487–499, 1994.

[6] J. S. Albus. A new approach to manipulator control: The cerebellar modelarticlatioon controller (CMAC). Trans. of the ASME, J. Dynamic Systems,Meaasurement, and Control, 97(3):220–227, sep 1975.

[7] M. R. Anderberg. Cluster Analysis for Applications. Academic Press Inc.,1973.

[8] J. Banfield and A. Raftery. Model-based gaussian and non-gaussian cluster-ing. Biometrics, 49:803–821, 1993.

[9] Yoseph Barash and Nir Friedman. Context-specific bayesian clustering forgene expression data. In RECOMB, pages 12–21, 2001.

[10] R. J. Bayardo Jr. and R. Agrawal. Mining the most interesting rules. InProceedings of 5th International Conference on Knowledge Discovery andData Mining, pages 145–154, New York, NY, 1999. ACM Press.

[11] A. Ben-Dor, R. Shamir, and Z. Yakhini. Clustering gene expression patterns.J. of Comp. Biology, 6(3/4):281–297, 1999.

[12] Adam L. Berger, Stephen A. Della Pietra, and Vincent J. Della Pietra. Amaximum entropy approach to natural language processing. ComputationalLingistics, 1996.

20

[13] M. Berger and I. Rigoutsos. An algorithm for poinr clustering and grid gen-eration. IEEE Trans. on Systems, Man and Cybernetics, 21(5):1278–1286,1991.

[14] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is nearestneighbor meaningful? In ICDT Conference, 1999.

[15] Daniel Boley, Maria Gini, Robert Gross, Eui-Hong Han, Kyle Hastings,George Karypis, Vipin Kumar, Bamshad Mobasher, and Jerome Moore. Doc-ument categorization and query generation on the world wide web using we-bace. AI Review, 13(5).

[16] I. Cadez, P. Smyth, and H. Mannila. Probabilistic modeling of transactionaldata with applications to profiling, visualization, and prediction. In KDD-2001, pages 37–46, San Francisco, CA, 2001.

[17] P. Cheeseman, J. Kelly, and M. Self. AutoClass: A bayesian classificationsystem. In ICML’88, 1988.

[18] P. Cheeseman and J. Stutz. Bayesian classification (autoclass): Theory andresults. In Advances in Know/edge Discovery and Data Mining. AAAIPress/MIT Press, 1996.

[19] C-H Cheng, A. W-C Fu, and Y. Zhang. Entropy-based subspace clusteringfor mining numerical data. In KDD-99, 1999.

[20] D. Cheung, J. Han, V. T. Ng, and A. W. Fu an Y. Fu. Fast distributed algorithmfor mining association rules. In International Conference on Parallel andDistributed Information Systems, pages 31–42, 1996.

[21] P. A. Chou, T. Lookabaugh, and R. M. Gray. Entropy-constrained vectorquantization. IEEE Trans., ASSP-37(1):31, 1989.

[22] E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J. D. wull-man, and C. Yang. Finding interesting associations without support pruning.IEEE Trasactions on Knowledge and Data Engineering, 13(1):64–78, 2001.

[23] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from in-complete data. Journal of the Royal Statistical Society, (39):1–38, 1977.

[24] I. Dhillon. Co-clustering documents and words using bipartite spectral graphpartitioning. Technical Report 2001-05, UT Austin CS Dept, 2001.

21

[25] Inderjit S. Dhillon and Dharmendra S. Modha. A data-clustering algorithm ondistributed memory multiprocessors. In Large-Scale Parallel Data Mining,Lecture Notes in Artificial Intelligence, pages 245–260, 2000.

[26] D. Fasulo. An analysis of recent work on clustering algorithms. TechnicalReport 01-03-02, U. of Washington, Dept. of Comp. Sci. & Eng., 1999.

[27] G. Forman and B. Zhang. Distributed data clusteringcan be efficient andexact. SIGKDD Explorations, 2(2):34–38, 2001.

[28] C. Fraley and A. Raftery. Mclust: Software for model-based cluster anddiscriminant analysis. Technical Report 342, Dept. Statistics, Univ. of Wash-ington, 1999.

[29] A. Fu, R. Kwong, and J. Tang. Mining � most interesting itemsets. In Pro-ceedings of 12th International Symposium on Methodologies for IntelligentSystems, pages 59–67. Springer-Verlag Lecture Notes in Computer Science1932, 2000.

[30] Venkatesh Ganti, Johannes Gehrke, and Raghu Ramakrishnan. CACTUS -clustering categorical data using summaries. In Knowledge Discovery andData Mining, pages 73–83, 1999.

[31] G. Govaert. Simultaneous clustering of rows and columns. Control and Cy-bernetics, (24):437–458, 1985.

[32] S. Guha, R. Rastogi, and K. Shim. CURE: An efficient clustering algorithmfor large database. In Proceedings of the 1998 ACM SIGMOD Conference,1998.

[33] Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim. ROCK: A robust cluster-ing algorithm for categorical attributes. Information Systems, 25(5):345–366,2000.

[34] E. Han, G. Karypis, and V. Kumar. Scalable parallel data mining for asso-ciation rules. In ACM SIGMOD Conference, pages 277–288. ACM Press,1997.

[35] Eui-Hong Han, Daniel Boley, Maria Gini, Robert Gross, Kyle Hastings,George Karypis, Vipin Kumar, Bamshad Mobasher, and Jerome Moore. We-bACE: A web agent for document categorization and exploration. In Katia P.Sycara and Michael Wooldridge, editors, Proceedings of the 2nd Interna-tional Conference on Autonomous Agents (Agents’98), pages 408–415, NewYork, 9–13, 1998. ACM Press.

22

[36] Eui-Hong Han, George Karypis, Vipin Kumar, and Bamshad Mobasher. Clus-tering based on association rule hypergraphs. In Research Issues on DataMining and Knowledge Discovery, pages 0–, 1997.

[37] Eui-Hong Han, George Karypis, Vipin Kumar, and Bamshad Mobasher. Hy-pergraph based clustering in high-dimensional data sets: A summary of re-sults. Data Engineering Bulletin, 21(1):15–22, 1998.

[38] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate gener-ation. In ACM SIGMOD Conference, pages 1–12. ACM Press, 2000.

[39] J. Hartigan. Clustering Algorithms. Wiley, 1975.

[40] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learn-ing: Data Mining,Inference, and Prediction. Springer, 2001.

[41] J. Hipp, U. Guntzer, and G. Nakhaeizadeh. Algorithms for association rulemining—a general survey and comparison. SIGKDD Explorations, 2(1):58–63, 2000.

[42] J. Hopcroft and R. Karp. An � ��

algorithm for maximum matchings inbipartite graphs. SIAM Journal of Computing, 2(4):225–231, 1973.

[43] M. Houtsma and A. Swami. Set-oriented mining of association rules. Re-search Report RJ 9567, IBM Almaden Research Center, San Jose, CA, Octo-ber 1993.

[44] N. Hulth and P. Grenholm. A distributed clustering algorithm. TechnicalReport 74, Lund University Cognitive Science, 1998.

[45] A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall,1988.

[46] Erik L. Johnson and Hillol Kargupta. Collective, hierarchical clustering fromdistributed, heterogeneous data. In M. Zaki and C. Ho, editors, Large-scaleParallel Systems, pages 221–244. Springer, 1999.

[47] M. I. Jordan. Graphical Models: Foundations of Neural Computation. MITPress, 2001.

[48] Hillol Kargupta and Philip Chan, editors. Advances in Distributed and Par-allel Data Mining. AAAI Press, 2000.

23

[49] Hillol Kargupta, Weiyun Huang, Krishnamoorthy Sivakumar, and Erik John-son. Distributed clustering using collective principal component analysis. InACM SIGKDD-2000 Workshop on Distributed and Parallel Knowledge Dis-covery, 2000.

[50] George Karypis and Vipin Kumar. Multilevel algorithms for multi-constraintgraph partitioning. Technical Report 98-019, 1998.

[51] M. Kearns, Y. Mansour, and A. Y. Ng. An information-theoretic analysis ofhard and soft assignment methods for clustering. In Learning in GraphicalModels. Kluwer AP, 1998.

[52] Jon M. Kleinberg. Two algorithms for nearest-neighbor search in high di-mensions. pages 599–608, 1997.

[53] Teuvo Kohonen. The Self-Organizing Map. In New Concepts in ComputerScience, 1990.

[54] A. Lazarevic, D. Pokrajac, and Z. Obradovic. Distributed clustering and lo-cal regression for knowledge discovery in multiple spatial databases. In 8thEuropean Symposium On Artificial Neural Networks,ESANN2000, 2000.

[55] Bing Liu, Yiyuan Xia, and Philip S. Yu. Clustering through decision treeconstruction. In SIGMOD-00, 2000.

[56] Andrew Kachites McCallum. Bow: A toolkit for statistical language model-ing, text retrieval, classification and clustering. http://www.cs.cmu.edu/ mc-callum/bow, 1996.

[57] S. McClean, B. Scotney, K. Greer, and R. Pairceir. Conceptual clusteringof heterogeneous distributed databases. In Workshop on Ubiquitous DataMining, PAKDD01, 2001.

[58] G. McLachlan and K. Basford. Mixture Models: Inference and Applicationsto Clustering. Marcel Dekker, New York, NY, 1988.

[59] G. Milligan. An algorithm for creating artificial test clusters. Psychometrika,50(1):123–127, 1985.

[60] A. Moore. Very fast em-based mixture model clustering using multireso-lution kd-trees. In Proceedings of Neural Information Processing SystemsConference, 1998.

24

[61] K. Nigam, J. Lafferty, and A. McCallum. Using maximum entropy for textclassification. In In IJCAI-99 Workshop on Machine Learning for InformationFiltering, pages 61–67, 1999.

[62] S. Nishisato. Analysis of Categorical Data: Dual Scaling and its Applica-tions. University of Toronto Press, Toronto, 1980.

[63] J. J. Oliver, R. A. Baxter, and C. S. Wallace. Unsupervised Learning usingMML. In Machine Learning: Proceedings of the Thirteenth InternationalConference (ICML 96), pages 364–372. Morgan Kaufmann Publishers, 1996.

[64] G.D. Ramkumar and A. Swami. Clustering data without distance function.IEEE Data(base) Engineering Bulletin, 21:9–14, 1998.

[65] J. Rissanen. A universal prior for integers and estimation by minimum de-scription length. The Annals of Statistics, 11(2):416–431, 1983.

[66] Jorma Rissanen. Stochastic complexity. Journal of the Royal Statistical So-ciety B, 49(3), 1987.

[67] P. Smyth. Probabilistic model-based clustering of multivariate and sequentialdata. In Proceedings of Artificial Intelligence and Statistics, pages 299–304,San Mateo CA, 1999. Morgan Kaufman.

[68] R. Srikant and R. Agrawal. Mining quantitative association rules in largerelational tables. In SIGMOD-96, 1996.

[69] M. Steinbach, G. Karypis, and V. Kumar. A comparison of document cluster-ing techniques. In In KDD Workshop on Text Mining, 2000.

[70] C.S. Wallace and D.M. Boulton. An information measure for classification.Computer Journal, 11(2):185–194, 1968.

[71] C.S. Wallace and P.R. Freeman. Estimation and inference by compact coding.Journal of the Royal Statistical Society (Series B), 49(3), 1987.

[72] S. Wallace and D.L. Dowe. Intrinsic classification by mml – the snob pro-gram. In Proceedings of the 7 Australian Joint Conference on Artificial Intel-ligence, pages 37–44. World Scientific, 1994.

[73] M. Zaki, S. Parthasarathy, M. Ogihara, and W. LI. Parallel algorithms for dis-covery of association rules. Data Mining and Knowledge Discovery, 1:343–373, 1998.

25

[74] T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An efficient data cluster-ing method for very large databases. In ACM SIGMOD Conference, 1996.

[75] Y. Zhao and G. Karypis. Criterion functions for document clustering: Ex-periments and analysis. Technical Report 01–40, Department of ComputerScience, University of Minnesota, Minneapolis, MN, 2001.

[76] Y. Zhao and G. Karypis. Evaluation of hierarchical clustering algorithmsfor document datasets. Technical Report 02-022, Department of ComputerScience, University of Minnesota, Minneapolis, MN, 2002.

26

List of Tables

1 A data set example. . . . . . . . . . . . . . . . . . . . . . . . . . 282 Confusion matrix for

� �. . . . . . . . . . . . . . . . . . . . . . . 28

3 Confusion matrix for� �

. . . . . . . . . . . . . . . . . . . . . . . 284 Confusion matrix of the zoo data. . . . . . . . . . . . . . . . . . . 285 Document DataSets Descriptions. . . . . . . . . . . . . . . . . . 296 Document Clustering Comparison Table . . . . . . . . . . . . . . 297 Criteria Functions for Partitioning Algorithms . . . . . . . . . . . 298 Confusion matrix of technical reports by CoFD. . . . . . . . . . . 309 Experiments of

� �in homogeneous setting . . . . . . . . . . . . 30

10 Experiments of� �

in heterogeneous setting . . . . . . . . . . . . 3011 Confusion Matrix on Site A. . . . . . . . . . . . . . . . . . . . . 3112 Confusion matrix of Centralized Result for Dermatology Database. 3113 Confusion matrix of Distributed Result for Dermatology Database. 31

27

featuredata point a b c d e f g

1 1 1 0 0 1 0 02 1 1 1 1 0 0 13 1 0 1 0 0 0 04 0 1 0 0 1 1 05 0 0 0 1 1 1 16 0 0 0 1 0 1 0

Table 1: A data set example.

InputOutput A B C D E

1 0 0 0 83 02 0 0 0 0 813 0 0 75 0 04 86 0 0 0 05 0 75 0 0 0

Table 2: Confusion matrix for� �

.

InputOutput A B C D E O.

1 1 0 1 17310 0 122 0 15496 0 2 22 1393 0 0 0 0 24004 14 10 0 17425 0 5 435 20520 0 0 0 0 6

Outliers 16 17 15 10 48 4897

Table 3: Confusion matrix for� �

.

InputOutput 1 2 3 4 5 6 7

A 0 0 0 0 0 0 1B 0 20 0 0 0 0 0C 39 0 0 0 0 0 0D 0 0 2 0 0 0 0E 2 0 1 13 0 0 4F 0 0 0 0 0 8 5G 0 0 2 0 3 0 0

Table 4: Confusion matrix of the zoo data.

28

Datasets # documents # classURCS 476 4

WebKB4 4199 4WebKB 8,280 7

Reuters-top 10 2,900 10K-dataset 2,340 20

Table 5: Document DataSets Descriptions.

Datasets CoFD � -means P1 P2 P3 P4 Slink Clink UPMGA

URCS 0.889 0.782 0.868 0.836 0.872 0.878 0.380 0.422 0.597WebKB4 0.638 0.647 0.639 0.644 0.646 0.640 0.392 0.446 0.395WebKB 0.501 0.489 0.488 0.505 0.503 0.493 NA NA NA

Reuters-top 10 0.734 0.717 0.681 0.684 0.724 0.691 0.393 0.531 0.498K-dataset 0.754 0.644 0.762 0.748 0.716 0.750 0.220 0.514 0.551

Table 6: Document Clustering Comparison Table. Each entry is the purity of thecorresponding column algorithm on the row dataset.

Algorithms Criteria Function

P1 Maximize� � �� . � � � � ��

P2 Minimize� � ��

�� ?� � � �S�� ?� � � �S�P3 Maximize

�� ?� � � �?� ��

� � �� P4 Maximize

� ��!�"� � �� ?� � � �?�� #� � ��

� � �� Table 7: Criteria Functions for Different Partitioning Algorithms.

�is the dataset

and�

’s are the clusters of size � . � is the number of clusters. � � � �� is thedistance/similarity between � and

.

29

InputOutput 1 2 3 4

A 68 0 8 0B 8 1 4 120C 0 0 160 0D 25 70 6 6

Table 8: Confusion matrix of technical reports by CoFD.

Site BSite A Outlier 1 2 3 4 5

Outlier 7 0 0 0 0 01 1 0 0 25 2 02 1 0 0 0 27 03 0 29 0 0 0 04 1 0 0 0 1 285 0 0 28 0 0 0

Site A 1 2 3 4 5Site B 3 4 1 5 2

Table 9: Confusion matrix and correspondence of feature maps of� �

in homoge-neous setting.

Site BSite A 1 2 3 4 5

1 1989 0 0 3 62 1 2 0 0 19973 2 0 0 1958 14 0 1998 3 14 05 0 0 1998 28 0

Site A 1 2 3 4 5Site B 1 5 4 2 3

Table 10: Confusion matrix and correspondence of data maps of� �

in heteroge-neous setting.

30

InputOutput 1 2 3 4 5

1 0 0 1 0 19972 1999 0 0 0 13 1 0 1958 0 24 0 2000 13 2 05 0 0 28 1998 0

Table 11: Confusion Matrix on Site A.

InputOutput 1 2 3 4 5 6

A 0 5 0 1 0 20B 112 13 0 2 4 0C 0 0 72 1 0 0D 0 14 0 15 5 0E 0 7 0 4 43 0F 0 22 0 26 0 0

Table 12: Confusion matrix of Centralized Result for Dermatology Database.

InputOutput 1 2 3 4 5 6

A 3 0 9 0 41 0B 99 6 0 1 0 0C 1 0 63 1 0 0D 5 4 0 2 8 20E 2 16 0 3 3 0F 2 35 0 42 0 0

Table 13: Confusion matrix of Distributed Result for Dermatology Database.

31

List of Figures

1 Description of CoFD algorithm. . . . . . . . . . . . . . . . . . . 332 Clustering and refining algorithm. . . . . . . . . . . . . . . . . . 343 Algorithm of guessing the number of clusters. . . . . . . . . . . . 344 Scalability with the number of points:running time. . . . . . . . . 345 Scalability with the number of points:running time/#features. . . . 356 Scalability with the number of features:running time. . . . . . . . 357 Scalability with the number of features:running time/#features. . . 368 Purity comparisons on various document datasets. . . . . . . . . . 36

32

Procedure A]BEDGFIH JLD�M�O�MPJLDGR]T MWVXJ]Y (data points:�

, data map:�

)begin

for

in�� do

if+ )-, � � � � . � >�)� �� then

� � � � )�* 0+ )-, � � � � . � >�)� �� ;

else� � � � outlier;

endifendreturn �

endProcedure ACBEDGFIH JQDNM�Z[JLDEJCVXJ]Y (data points:

�, feature map: � )�

If no outliers,� ��

. Otherwise,� � � . �

beginfor�

in�� do

if+ )-, � � � � � / � �G� � ��

then� � � � )�* 0+ )-, � � � � � / � �S�)� �� ;

else� � � � � outlier;endif

endreturn

�;

endAlgorithm �� O=Z (data points:

�, # of classes: � )

beginlet

� �be the set of randomly chosen � � distinct data points from

�;

assign each � of them to one class,say the map be� �

;assign ACBEDGFIH JQDNM�O�MPJLDGR]T MWVXJ]Y (

� �,� �

) to � ;repeat

assign � to � � ;assign A]BEDGFIH JLD�M�Z[JLDEJ\VXJUY (

�, � ) to

�;

assign A]BEDGFIH JLD�M�O�MPJLDGR]T MWVXJ]Y (�

,�

) to � ;until conditional entropy '�� is zeros;return

�;

end

Figure 1: Description of CoFD algorithm.

33

Algorithm � M��LM (data points:�

, the number of classes: � )begin

do � times of �� O;Z (�

, � ) andassign the results to

� for� � �� ;

compute the average conditional entropies� � � � �� '�� for

� � �� ;return

)�* �+�� ;end

Figure 2: Clustering and refining algorithm.

Algorithm [R-M?BEB� (data points:�

)� �is the estimated maximum number of clusters; �

beginfor � in � �� do

do � times of � � RQBEDNM�T ( � , � ) andassign the results to

� � for� � �� ;

compute the average conditional entropies� � � � � �� '�� for

� � �� ;endreturn

)�* �+�� +�� ;end

Figure 3: Algorithm of guessing the number of clusters.

0.1

1

10

100

1000

100 1000 10000 100000

runn

ing

time

in s

econ

ds

#points

w/o bootstrapw/ bootstrap

Figure 4: Scalability with the number of points:running time.

34

0.0001

0.001

0.01

100 1000 10000 100000

runn

ing

time/

#poi

nts

#points

w/o bootstrapw/ bootstrap

Figure 5: Scalability with the number of points:running time/#features.

2

4

6

8

10

12

14

20 200 500 1000

runn

ing

time

in s

econ

ds

#features

Figure 6: Scalability with the number of features:running time.

35

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

20 200 500 1000

runn

ing

time/

#fea

ture

s

#features

Figure 7: Scalability with the number of features:running time/#features.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

URCS WebKB4 WebKB Reuters-top10

K-dataset

COFDK-MeansP1P2P3P4SlinkClinkUPMGA

Figure 8: Purity comparisons on various document datasets.

36