Cluster Ensembles for High Dimensional Clustering: An ...web.engr.oregonstate.edu/~xfern/clustensem.pdf · Journal of Machine Learning Research n/a (2004) n/a Submitted 09/04; Published

Journal of Machine Learning Research n/a (2004) n/a Submitted 09/04; Published n/s

Cluster Ensembles for High Dimensional Clustering:

An Empirical Study

Xiaoli Z. Fern [email protected]

School of Electrical and Computer Engineering, Purdue University, W. Lafayette, IN 47907, USA

Carla E. Brodley [email protected]

Electrical Engineering and Computer Science, Tufts University, 161 College Avenue, Medford, MA

02155, USA

Editor:

Abstract

This paper studies cluster ensembles for high dimensional data clustering. We examinethree different approaches to constructing cluster ensembles. To address high dimension-ality, we focus on ensemble construction methods that build on two popular dimensionreduction techniques, random projection and principal component analysis (PCA). Wepresent evidence showing that ensembles generated by random projection perform betterthan those by PCA and further that this can be attributed to the capability of randomprojection to produce diverse base clusterings. We also examine four different consensusfunctions for combining the clusterings of the ensemble. We compare their performanceusing two types of ensembles, each with different properties. In both cases, we show thata recent consensus function based on bipartite graph partitioning achieves the best perfor-mance.

Keywords: Clustering, Cluster ensembles, Dimension reduction, Random projection

1. Introduction

Clustering for unsupervised data exploration and analysis has been investigated for decadesin the statistics, data mining, and machine learning communities. The general goal ofclustering is to partition a given set of data points in a multidimensional space into clusterssuch that the points within a cluster are similar to one another. Many clustering algorithmsrequire a definition of a metric to compute the distance between data points. Thus, theirperformance are often directly influenced by the dimensionality used for calculating thechosen distance metric. Data sets with high dimensionality pose two fundamental challengesfor clustering algorithms. First, in a high dimensional space the data tend to be sparse(the curse of dimensionality Bellman, 1961). Indeed, Beyer et al. (1999) demonstratedthat as the dimensionality increases, the difference in distance between a given point andit nearest neighbor and other points in the data set often becomes negligible—making itdifficult if not impossible to identify any clustering structure in the data based on distancemeasures. Second, there often exist irrelevant and/or noisy features that may misleadclustering algorithms (Dy and Brodley, 2004). For example, with the presence of irrelevantfeatures, points from distinct clusters may appear to be closer (as measured by the Euclideandistance) than points from the same cluster.

c©2004 Xiaoli Z. Fern and Carla E. Brodley.

Fern and Brodley

To ameliorate the problems caused by high dimensionality, clustering is often used incombination with dimension reduction techniques. Representative examples of commonlyused dimension reduction techniques include principle component analysis (PCA), featureselection (Agrawal et al., 1998, Dy and Brodley, 2004) and projection pursuit (Huber,1985, Friedman, 1987). The goal of dimension reduction for clustering is to form a lowerdimensional representation of the data such that “natural” clusters are easier for clusteringalgorithms to detect. To achieve this goal, dimension reduction techniques often rely onsome “interestingness” criterion to search for a single lower dimensional representation.However, because the true structure of the data is unknown, it is inherently ambiguouswhat constitutes a good low dimensional representation. This makes it difficult to definea proper “interestingness” criterion. An alternative approach to this problem that weexplore in this paper is to use multiple low-dimensional representations of the data. Eachrepresentation is used to cluster the data and a final clustering is obtained by combiningall of the clustering solutions. Note that each low-dimensional representation presents theclustering algorithm with a different view of the original data. By combining the resultsfrom many different views, we expect to accumulate information about the structure of thedata and produce better and more robust clustering results.

Although several recent papers have addressed cluster ensembles (Strehl and Ghosh,2002, Fern and Brodley, 2003, Topchy et al., 2003), research in this area has been limitedin a number of ways. First, there has not been a systematic investigation of this approachfor high dimensional data clustering despite its clear promise as discussed above. Second,empirical evaluations of existing cluster ensemble systems have been limited to a smallnumber of data sets, some of which are artificially constructed. We believe that a morethorough empirical evaluation is warranted. Lastly, the current cluster ensemble literaturehas mainly focused on the problem of combining multiple clusterings (commonly referred toas the problem of consensus function) and little guidance is available for how to constructcluster ensembles (i.e., produce multiple clustering solutions). Many different approachescan be used to construct cluster ensembles and they may produce ensembles that varysignificantly with respect to the amount by which the individual clustering solutions differfrom one another (diversity) and how good each clustering solution is (quality). Further, itis likely that different consensus functions may be better suited to different types of clusterensembles. An in-depth study of the above issues will provide useful guidance for applyingcluster ensemble techniques in practice.

This paper investigates cluster ensemble techniques in the context of high dimensionaldata clustering. First, we examine three approaches to constructing cluster ensembles. Toaddress high dimensionality, our ensemble constructors build on two popular dimensionreduction techniques: random projection (RP) and principle component analysis (PCA).Second, we examine four consensus functions for combining clusterings. Through an in-depth empirical analysis, our goal in this paper is to study:

• the suitability of different ensemble construction methods and different consensusfunctions for high dimensional clustering,

• the influence that the diversity and quality of base clusterings have on the final en-semble performance, and

2

Cluster Ensembles for High Dimensional Clustering

• the relationship between the performance of various consensus functions and the basicproperties (diversity and quality) of cluster ensembles.

Our study indicates that, in comparison to the traditional approach of a single cluster-ing with PCA dimension reduction, cluster ensembles are generally more effective for highdimensional data clustering. Among the various ensemble construction techniques exam-ined in this paper, the random projection based approach is shown to be most effective inproducing ensembles with high diversity and good quality for high dimensional data. Last,our experiments on four different consensus functions show that although there is no uni-versal single best consensus function, on average the bipartite graph partitioning consensusfunction (Fern and Brodley, 2004) is the most stable and effective approach for combiningcluster ensembles.

The remainder of this paper is arranged as follows. In Section 2, we introduce thebasics of cluster ensembles. Section 3 describes three different methods for generatingcluster ensembles and Section 4 introduces different consensus functions. Section 5 describesthe data sets used for our experiments, the parameter choices and experimental methods.Sections 6 and 7 present the experiments on ensemble constructors and consensus functionsrespectively. In Section 8, we review the related work on cluster ensembles. Section 9summarizes the paper and discusses directions for future work.

2. Cluster Ensembles

A cluster ensemble consists of two parts: an ensemble constructor and a consensus function.Given a data set, an ensemble constructor generates a collection of clustering solutions—i.e., a cluster ensemble. A consensus function then combines the clustering solutions of theensemble and produces a single clustering as the final output of the ensemble system. Belowwe formally describe these two parts.

Ensemble Constructor: Given a data set of n instances X = {X1,X2, · · · ,Xn}, anensemble constructor generates a cluster ensemble, represented as Π = {π1, π2, · · · , πr},where r is the ensemble size (the number of clusterings in the ensemble). Each clusteringsolution πi is simply a partition of the data set X into Ki disjoint clusters of instances,represented as πi = {ci

1, ci2, · · · , c

iKi}, where ∪kc

ik = X. For simplicity this paper assumes

that hard clusterings are used. However, it should be noted that each consensus functionexamined in this paper can be applied to cluster ensembles with soft clusterings, directlyor with minor modifications.

Consensus Function: Given a cluster ensemble Π and a number K, the desired numberof clusters, a consensus function Γ uses the information provided by Π and partitions X

into K disjoint clusters as the final clustering solution πf . In a more general case, one canalso use the original features of X in combination with Π to produce the final clustering.In this study we focus on the case where the original features of X are only used duringensemble construction.

3

Fern and Brodley

3. Ensemble Constructors

In this section, we introduce three ensemble constructors. Because our goal in this paper isto cluster high dimensional data, we focus on approaches that build on dimension reductiontechniques. Note that for all approaches described below, we use K-means as the baseclustering algorithm for its simplicity and computational efficiency.

3.1 Random Projection Based Approach

Our first ensemble constructor is based on random projection, a dimension reduction tech-nique that has seen growing popularity due to its simplicity and theoretical promise. Wewill refer to this approach as the RP-based approach and the resulting ensembles as RPensembles.

Our motivation for using random projection is threefold. First, in comparison to tradi-tional dimension reduction methods such as PCA, random projection is a generic dimensionreduction technique that does not use any “interestingness” criterion to “optimize” the pro-jection. Specifying a proper “interestingness” criterion for a data set can be challenging dueto the unsupervised nature of clustering tasks. Second, random projection has been shown tohave promise for high dimensional data clustering. In 1984, Diaconis and Freedman showedthat various high-dimensional distributions look more Gaussian when randomly projectedonto a low-dimensional subspace. Recently, Dasgupta (2000) showed that random projec-tion can change the shape of highly eccentric clusters to be more spherical. These resultssuggest that random projection combined with standard K-means clustering may be wellsuited to finding structure in high dimensional data. Lastly, our initial experiments (Fernand Brodley, 2003) on random projection for clustering high dimensional data indicate thatclustering with different random projection runs results in different clustering solutions thatcan be complementary to each other. We conjecture that this is a desirable property forcluster ensembles, rendering random projection a natural candidate for constructing clusterensembles. Below we introduce random projection and the RP-based ensemble constructor.

A random projection from d dimensions to d′ dimensions is simply a randomly generatedlinear transformation, represented by a d × d′ matrix R. The transformation matrix R istypically generated by first setting each entry of the matrix to a value drawn from an i.i.d.N(0,1) distribution and then normalizing the columns to unit length. Generally speaking,a matrix R generated as described above will not be orthonormal. To obtain a real pro-jection matrix, we need to further orthogonalize R. However, as shown by Hecht-Nielsen(1994), high dimensional vectors with random directions may be sufficiently close to beingorthogonal. Therefore, in our experiments we do not perform orthogonalization.

Given a d-dimensional data set represented as an n×d matrix X, where n is the numberof total data points in X, the mapping X×R results in a reduced-dimension data set X ′. Wethen apply the standard K-means algorithm to the new data set X ′ to obtain a clusteringsolution. To construct a cluster ensemble, the above process is repeated multiple times andeach time a different projection matrix is randomly generated.

The RP-based ensemble constructor requires two parameters: d′, the new dimension forrandom projection; and k0, the desired number of clusters for K-means in each clusteringrun. We will discuss the parameter choices in Section 5.2.

4


3.2 Combining PCA and Random Subsampling

PCA is a widely used dimension reduction technique. It selects projection directions tomaximize data variance. Although the variance criterion used by PCA cannot guaranteefinding a good low-dimensional representation of the data for clustering (Fukunaga, 1990),it is often considered to be a useful heuristic. Indeed, when comparing random projec-tion with PCA applied to produce a single clustering, we observed that in many cases PCAleads to better performance than random projection. Given a data set, PCA chooses projec-tion directions deterministically, therefore, we combine PCA with subsampling to constructcluster ensembles.

Specifically, given a high-dimensional data set X, we first apply PCA to reduce thedimension of the data from from d to d′. Random subsampling with a sampling rate of65% 1 is then applied to generate multiple subsamples of the reduced data. To avoidduplicating instances, the subsampling is performed without replacement. Each subsampleis then clustered by K-means. Note that K-means only clusters instances that appear inthe current subsample—resulting in the possibility that some instances are never clusteredin the entire ensemble. To avoid this situation, in each clustering run we further assigneach instance that is absent from the current subsample to its closest cluster based on itsEuclidean distances (using the reduced d′-dimensional representation) to the cluster centers.

In the remaining part of the paper, we will refer to this ensemble constructor as PCASSand the resulting ensembles as PCASS ensembles. Note that we also explored an alternativeapproach, where we subsample first and then apply PCA. By doing so, subsampling isused as a tool for producing different principle projection directions. However, empiricalcomparisons between these two approaches indicate that they produce highly similar clusterensembles. An possible explanation is that the features of the high dimensional data setsstudied are highly redundant—applying PCA to subsampled data leads to very similarprojection directions. In light of the similar behavior of the two approaches, we will focuson PCASS in our future discussion.

3.3 Combining Random Projection with PCA

When PCA is used in PCASS for constructing cluster ensembles, a critical limitation isthat it produces only a single low-dimensional representation of the data in different clus-tering runs. If significant information is erroneously discarded by PCA, all clustering runs(performed on subsamples) in the ensemble will suffer the same loss without exception. Incontrast, although a single run of random projection may lead to significant informationloss, the other random projection runs in the ensemble can potentially compensate for theloss incurred in that run. To benefit from the strengths of both approaches, our thirdensemble constructor combines random projection with PCA.

Given a high-dimensional data set X, we first apply random projection to reduce thedimension of the data set from d to d1. PCA is then applied to the d1-dimensional datato further reduce the dimension to d2 and form the final low-dimensional representation ofthe data, which is then clustered by K-means. We refer to this ensemble constructor asRPPCA, and the resulting ensembles as RPPCA ensembles.

1. The sampling rate was arbitrarily selected.

5

Fern and Brodley

4. Consensus Functions

A consensus function takes an ensemble of clustering solutions and combines them to pro-duce a single (presumably better) final clustering. Note that combining clusterings is adifficult task that does not come with a simple and natural solution. Because there is noclear correspondence relationship among clusters obtained from different clusterings, we cannot directly apply simple schemes such as majority vote. In this paper we examine fourdifferent consensus functions, each of which approaches the clustering combination problemby transforming it into a more familiar problem. In particular, the first three consensusfunctions each reduce the clustering combination problem to a graph partitioning problem,and the fourth approach reduces it to a standard clustering problem.

4.1 Consensus Functions Using Graph Partitioning

Consensus functions using graph partitioning have two main advantages. First, graph parti-tioning is a well studied area and algorithms such as spectral clustering have been successfulin a variety of applications (Shi and Malik, 2000, Dhillon, 2001). Second, cluster ensemblesprovide a simple and effective way to define similarity measures for computing the weightof the edges in a graph, which is an important and sometimes hard to satisfy prerequisitefor the success of graph partitioning techniques (Bach and Jordan, 2003). Below we firstdescribe the basics of graph partitioning and provide the necessary notation.

The input to a graph partitioning problem is a graph that consists of vertices andweighted edges, represented by G = (V,W ), where V is a set of vertices and W is anonnegative and symmetric |V |×|V | similarity matrix characterizing the similarity betweeneach pair of vertices. To partition a graph into K parts is to find K disjoint sets of verticesP = {P1, P2, · · · , PK}, where ∪kPk = V . Unless a given graph has K, or more than K,strongly connected components, any K-way partition will cross some of the graph edges.The sum of the weights of these crossed edges is often referred to as the cut of a partitionP : Cut(P,W ) =

∑

W (i, j), where vertices i and j do not belong to the same set.

Given a weighted graph, a general goal of graph partitioning is to find a K-way partitionthat minimizes the cut, subject to the constraint that each part should contain roughly thesame number of vertices. In practice, various graph partitioning algorithms define differentoptimization criteria based on the above goal. In this paper, we use a well-known spectralgraph partitioning algorithm by Ng et al. (2002).

Given the basics of graph partitioning, we now introduce three different consensus func-tions, each of which formulates and solves a different graph partitioning problem given acluster ensemble. The first approach models instances as vertices in a graph and the secondmodels clusters as vertices. They were proposed by Strehl and Ghosh (2002). Here werefer to them as the instance-based and cluster-based graph formulations respectively tocharacterize their key distinction. Finally, the third consensus function, recently proposedby Fern and Brodley (2004), models instances and clusters simultaneously as vertices in abipartite graph.

6


4.1.1 Instance-Based Graph Formulation

The Instance-Based Graph Formulation (IBGF) approach constructs a graph that modelsthe pairwise relationship among instances of the data set X. Below we formally describeIBGF.

Given a cluster ensemble Π = {π1, · · · , πR}, IBGF constructs a fully connected graphG = (V,W ), where

• V is a set of n vertices, each representing an instance of X.

• W is a similarity matrix and W (i, j) = 1R

∑Rr=1 I(gr(Xi) = gr(Xj)), where I(·) is an

indicator function that returns 1 if the argument is true and 0 otherwise; gr(·) takesan instance and returns the cluster to which it belongs in πr.

W (i, j) measures how frequently the instances i and j are clustered together in thegiven ensemble. In recent work (Fern and Brodley, 2003, Monti et al., 2003), this similaritymeasure has been shown to give satisfactory performance in domains where a good similarity(or distance) metric is otherwise hard to find. Once a graph is constructed, one can solve thegraph partitioning problem using any chosen graph partitioning algorithm and the resultingpartition can be directly output as the final clustering solution.

Note that IBGF constructs a fully connected graph with n2 edges, where n is the numberof instances. Depending on the algorithm used to partition the graph, the computationalcomplexity of IBGF may vary. But generally it is computationally more expensive than theother two graph formulation approaches examined in this paper.

4.1.2 Cluster-Based Graph Formulation

Note that clusters formed in different clusterings may contain the same set of instances orlargely overlap with each other. Such clusters are considered to be corresponding (similar)to one another. Cluster-Based Graph Formulation (CBGF) constructs a graph that modelsthe correspondence (similarity) relationship among different clusters in a given ensembleand partitions the graph into groups such that the clusters of the same group correspond(are similar) to one another.

Given a cluster ensemble Π = {π1, · · · , πR}, we first rewrite Π as Π = {c11, · · · , c

1K1

, · · · ,

cR1 , · · · , cR

KR} where ci

j represents the jth cluster formed in the ith clustering πi in the

ensemble Π. Denote the total number of clusters in Π as t =∑R

r=1 Kr. CBGF constructsa graph G = (V,W ), where

• V is a set of t vertices, each representing a cluster

• W is a matrix such that W (i, j) is the similarity between the clusters ci and cj and

is computed using the Jaccard measure as: W (i, j) =|ci∩cj ||ci∪cj |

Partitioning a cluster-based graph results in a grouping of the clusters. We then producea final clustering of instances as follows. First we consider each group of clusters as ametacluster. For each clustering in the ensemble, an instance is considered to be associatedwith a metacluster if it contains the cluster to which the instance belongs. Note that aninstance may be associated with different metaclusters in different runs. We assign an

7

Fern and Brodley

Cluster 2

Cluster 1

Cluster 4

Cluster 3Cluster 4Cluster 3

Instances

Cluster 2Cluster 1

(a) Clustering 1 (b) Clustering 2 (c) Bipartite Graph

Figure 1: An example of the Hybrid Bipartite Graph Formulation

instance to the metacluster with which it is most frequently associated. Ties are brokenrandomly.

A basic assumption of CBGF is that there exists a correspondence structure amongdifferent clusters formed in the ensemble. This poses a potential problem—in cases whereno such correspondence structure exists, this approach may fail to provide satisfactoryperformance. The advantage of CBGF is that it is computationally efficient. It generates afully connected graph with t2 edges, where t is the total number of clusters in the ensemble.This is significantly smaller than the n2 of IBGF, assuming t ≪ n.

Strehl and Ghosh (2002) also proposed a hypergraph-based approach, which modelsclusters as hyperedges and instances as vertices in a hypergraph and uses a hypergraphpartitioning algorithm to produce a final partition. Conceptually, this approach formsa different type of graph and has the limitation that it can not model soft clustering.Practically, our initial experiments show that it performed worse than IBGF and CBGF onour data sets. Due to the above reasons, we choose not to further discuss this approach inthis paper.

4.1.3 Hybrid Bipartite Graph Formulation

The Hybrid Bipartite Graph Formulation (HBGF) approach models instances and clusterssimultaneously in a graph. The graph edges can only connect instance vertices to clustervertices, resulting a bipartite graph. Below we first describe HBGF and then explain itsconceptual advantages over IBGF and CBGF.

Description of HBGF: Given a cluster ensemble Π = {π1, · · · , πR}, HBGF constructsa graph G = (V,W ), where

• V = V C∪V I , where V C contains t vertices each representing a cluster of the ensemble;V I contains n vertices each representing an instance of the data set X.

• W is defined as follows. If the vertices i and j are both clusters or both instances,W (i, j) = 0; otherwise if instance i belongs to cluster j, W (i, j) = W (j, i) = 1 and 0otherwise.

8


Note that W can be written as: W =

[

0 AT

A 0

]

where A is a connectivity matrix whose

rows correspond to the instances and columns correspond to the clusters. A(i, j) is anindicator that takes value 1 if instance i belongs to the j-th cluster and 0 otherwise.

Figure 1 shows an example of HBGF. Particularly, Figures 1(a) and (b) depict two dif-ferent clusterings of nine instances and Figure 1(c) shows the graph constructed by HBGF,in which the diamond vertices represent the clusters and the round vertices represent theinstances. An edge between an instance vertex and a cluster vertex indicates that the clus-ter contains the instance. All edges in the graph have equal weight of one (edges with zeroweights are omitted from the graph). In this graph, cluster vertices are only connected toinstance vertices and vice versa, forming a bipartite graph. If a new clustering is addedto the ensemble, a new set of cluster vertices will be added to the graph and each clustervertex will be connected to the instances that it contains.

Shown in Figure 1(c) as a dashed line, a partition of the bipartite graph partitions thecluster vertices and the instance vertices simultaneously. The partition of the instances canthen be output as the final clustering.

Although HBGF’s vertex set is the sum of IBGF’s and CBGF’s vertex sets, it forms asparse graph and, as shown by Dhillon (2001), due to its special structure the real sizeof the resulting bipartite graph partitioning problem is n × t, where n is the number ofinstances and t is the total number of clusters in the ensemble C. This is significantlysmaller compared to the size n2 of IBGF, assuming that t ≪ n.

Conceptual advantages of HBGF: Comparing to IBGF and CBGF, HBGF offers twoimportant conceptual advantages. First, the reduction of HBGF is lossless — the originalcluster ensemble can be easily reconstructed from an HBGF graph. In contrast, IBGFand CBGF do not have this property. To understand the second advantage of HBGF,it should be noted that IBGF and CBGF consider the similarity of instances and thesimilarity of clusters independently and, as shown below, such independent treatment maybe problematic.

Comparing two pairs of instances (X1, X2) and (X3, X4), we assume that X1 andX2 are never clustered together in the ensemble and the same is true for pair (X3, X4).However, the instances X1 and X2 are each frequently clustered together with the samegroup of instances in the ensemble, i.e., X1 and X2 are frequently assigned to clusters thatare similar to each other. In contrast, this is not true for X3 and X4. Intuitively we considerX1 and X2 to be more similar to one another than X3 and X4. However, IBGF will fail todifferentiate these two cases and assign both similarities to be zero. This is because IBGFignores the information about the similarity of clusters while computing the similarity ofinstances.

A similar problem exists for CBGF. For example, consider two pairs of clusters (c1, c2)and (c3, c4). Assume that c1 ∩ c2 = φ and c3 ∩ c4 = φ. And further assume that theinstances of c1 and those of c2 are often clustered together in other clustering runs, whereasthis is not the case for c3 and c4. Note that CBGF will assign both similarities to zerowhile intuitively we would consider clusters c1 and c2 to be more similar to one anotherthan clusters c3 and c4. CBGF fails to differentiate these two cases, because it does nottake the similarity of instances into account.

9

Fern and Brodley

Table 1: Description of the data sets

Name Description Source

eos Land-cover classification data set #1 Friedl et al. (2002)

modis Land-cover classification data set #2

hrct High resolution computed tomography lung image data Dy et al. (1999)

chart Synthetically generated control chart time series UCI KDD archive(Hettich and Bay, 1999)

isolet6 Spoken letter recognition data set (6 letters only) UCI ML archive

mfeat Handwritten digits represented by Fourier coefficients (Blake and Merz, 1998)

satimage StatLog Satellite image data set (training set)

segmentation Image segmentation data

In contrast, HBGF allows the similarity of instances and the similarity of clusters tobe considered simultaneously in producing the final clustering. Thus it avoids the aboveproblems of IBGF and CBGF.

4.2 Consensus Function Using Centroid-based Clustering

The fourth consensus function, proposed by Topchy et al. (2003), transforms a clusterensemble problem into a standard clustering problem by representing a given ensemble as anew set of features and then uses the standard K-means clustering algorithm to produce thefinal clustering. Below we refer to this consensus function as KMCF, standing for K-MeansConsensus Function.

Given a cluster ensemble Π = {π1, · · · , πr}, KMCF first creates a new set of featuresas follows. For each partition πi = {ci

1, ci2, · · · , c

iKi}, it adds Ki binary features to the new

feature set, each corresponding to a cluster in πi. The resulting new feature set will containt features, where t is the total number of clusters of ensemble Π. Note that for an instance,the feature corresponding to ci

j takes value 1 if that instance belongs to cluster cij , and 0

otherwise. The new features are then standardized to have zero mean. Finally, K-means isapplied to the standardized new features to produce the final clustering.

5. Experimental Setup

In this section we describe the data sets used in our experiments, related parameter settingsand the evaluation criterion.

10


Table 2: Summary of the data sets

chart eos hrct isol. mfeat modis sati. segm.

#inst. 600 2398 1545 1800 2000 4975 4435 2310#class 6 8 8 6 10 10 6 7org. dim. 60 20 183 617 76 112 36 19dr, dp, d2 10 5 20 30 10 10 5 5d1 20 10 40 60 20 20 10 10k0 10 15 15 15 15 15 15 15

Table 3: Required parameters for ensemble constructors.

Approach Parameters

RP k0 # of clusters for each clustering rundr dimension to reduce to for RP

PCASS k0 # of clusters for each clustering rundp dimension to reduce to for PCAr subsampling rate

RPPCA k0 # of clusters for each clustering rund1 Intermedium dimension for RPd2 Final dimension for PCA

5.1 Description of the Data Sets

Our experiments use eight data sets obtained from a variety of domains and sources. Table 1briefly describes these data sets. Note that the ISOLET6 data set is a subsample of theUCI Spoken letter recognition data set obtained by randomly selecting six out of a total oftwenty-six letters. We chose to use only six letters to make sure that the IBGF approachcan be applied within a reasonable amount of time. The MFEAT data set comes from theUCI multiple-feature data set. Note that the original data set contains several differentrepresentations of the same data, among which the fourier coefficients were used in ourexperiments. In rows 2-4 of Table 2 we present the main characteristics of these datasets. The dimensions of these data sets range from dozens to hundreds. Rows 5-7 list theparameters selected for each data set, which we discuss in Section 5.2. It should be notedthat all of the eight data sets are labeled. However, the labels are omitted during clustering,and used only for evaluation.

11

Fern and Brodley

5.2 Parameter Settings

In this paper, we examine three different ensemble constructors and four consensus func-tions. For the consensus functions, the only parameter to be specified is kf , the desirednumber of the final clusters. Because we have access to the class information of our datasets, our experiments set kf to be the same as the class number (shown in row 3 of Table 2)for all data sets and all consensus functions.

In Table 3, we summarize the parameters to be specified for each ensemble constructor.One parameter that is common to all ensemble constructors in k0, the number of clustersfor K-means in each clustering run during ensemble construction. The choices of k0 foreach data set are shown in the last row of Table 3. These numbers are arbitrarily selected.Note that we use a slightly smaller number for the CHART data set because it containsonly 600 instances.

The remaining parameters are decided as follows. First, for PCASS, we select dp, thedimensionality for PCA, by requiring that 90% of the data variance be preserved. Note thatbecause the resulting dimensionalities for HRCT, ISOLET6 and MFEAT are still high, forcomputation efficiency we relax the requirements for these three data sets to preservingonly 80% of the data variance. The sampling rate r of PCASS is set to be 65% (arbitrarilyselected). In order to have direct comparisons between RP and PCASS, we set dr, thedimensionality for RP, to be the same as dp. Finally, for RPPCA, we have two dimensionparameters, the intermediate dimension d1 used by RP and the final dimension d2 used byPCA. We set the final dimension d2 to be the same as dp and set d1 to be 2 × dp.

Here we argue that the exact choices of the above discussed parameters will not stronglyinfluence the final performance. While the quality of the individual base clusterings maybe significantly affected when different parameters are used, as we combine more and morebase clusterings, we expect the ensemble performance to be comparable for different pa-rameter settings. To verify our conjecture, we examined other choices for both dp (the finaldimensionality used by ensemble constructors) and k0 (the cluster number for K-means inbase clustering). Indeed, we also examined the case where we select dp and k0 randomlyfor each individual clustering run from a given range. All settings result in comparableperformance. This is possibly due to the fact that cluster ensembles aggregate the informa-tion of multiple base clusterings—although the individual base clusterings may differ whenparameters change, the total aggregated information remains comparable.

5.3 Evaluation Criterion

Because the data sets used in our experiments are labeled, we can assess the quality of thefinal clusterings using external criteria that measure the discrepancy between the structuredefined by a clustering and what is defined by the class labels. Here we choose to use an in-formation theoretic criterion—the Normalized Mutual Information (NMI) criterion (Strehland Ghosh, 2002). Treating cluster labels and class labels as random variables, NMI mea-sures the mutual information (Cover and Thomas, 1991) shared by the two random variablesand normalizes it to a [0, 1] range. Note that the expected NMI value of a random partitionof the data is 0 and the optimal value 1 is attained when the class labels and cluster labelsdefine the same partition of the data.

12


6. Experiments on Ensemble Constructors

In this section, we empirically evaluate the three ensemble constructors described in Sec-tion 3. The purpose of our experiments is to study the suitability of the various ensembleconstructors for the high dimensional clustering task.

6.1 Upper-Bound Performance Analysis

Given a cluster ensemble we have a variety of choices for consensus functions and for anensemble using a different consensus function may lead to a different result. To obtain aunified evaluation of the ensemble constructors, we evaluate each cluster ensemble using thefollowing procedure. Given a cluster ensemble, we apply all of the four consensus functionsto the ensemble, resulting in four final clusterings. We then evaluate the obtained fourclusterings using the class labels and record the best NMI value obtained. We refer to thisas the upper-bound performance evaluation for each ensemble constructor.

In Figure 2, we plot the upper-bound performance of the three ensemble constructors.The y-axis shows the NMI value and the x-axis shows the ensemble size. Each performanceline consists of eleven data points. The first left-most point has ensemble size one, showingthe average quality of the base clusterings produced by an ensemble constructor. To obtainthis value, we use each ensemble constructor to generate an ensemble of size 50 and averagethe quality of the 50 base clusterings. The other ten data points show the performanceof ensembles with ten different ensemble sizes, ranging from 10 to 100. For each ensemblesize, ten different ensembles are generated using each constructor and the plotted valuesare averaged across the ten runs.

Comparing with base clusterings: Recall that the optimal NMI value is 1, which isachieved when the evaluated clustering and the class labels define the exact same partition ofthe data. The larger the NMI value the better the performance. Comparing the ensembleswith their base clusterings, from Figure 2 we see that RP ensembles consistently improveover the base clusterings for all eight data sets. In contrast, RPPCA fails to improve forthe SEGMENTATION data set, whereas PCASS fails to improve for the HRCT, EOS andMODIS data sets.

Comparing with regular clustering with PCA dimension reduction: As anotherbase-line comparison, we apply PCA to each data set to reduce the dimension to dp (usingthe same values as shown in Table 2), and then apply K-means to the reduced data withk equal to the class number. The resulting performance is shown in Figure 2 by the dottedlines. 2 As shown by the figures, all three types of ensembles outperform regular clusteringwith PCA dimension reduction.

Comparing the three ensemble constructors: From Figure 2, we observe that RPensembles achieve the best performance among the three ensemble constructors for all butthe SATIMAGE data set. The performance lines of RP ensembles generally show a clearimproving trend as the ensemble size increases. RPPCA ensembles also improve with an

2. Note that it may appear surprising that in some cases this base-line performance is inferior to the

performance of the base clusterings. This can be explained by the fact that larger k values are used by

K-means in producing base clusterings.

13

Fern and Brodley

0 50 1000.4

0.6

0.8

1(a) Chart

0 50 1000.2

0.25

0.3

0.35(b) Eos

0 50 1000.2

0.25

0.3

0.35

0.4 (c) Hrct

0 50 1000.6

0.7

0.8

0.9

1 (d) Isolet6

RPRPPCAPCASS

0 50 1000.5

0.6

0.7

0.8(e) Mfeat−fou

0 50 1000.3

0.4

0.5

0.6(f) Modis

0 50 1000.4

0.5

0.6

0.7

0.8(g) Satimage

0 50 1000.4

0.5

0.6

0.7(h) Segmentation

Figure 2: Upper-bound performance of the three types of ensembles.

14


increase in ensemble size, but its final performance tends to be inferior to RP. In contrast,PCASS behaves rather differently from the other two. Particularly, its base clusteringsoften have better quality compared to the other two approaches while the performanceimprovements introduced by ensembles are often less significant or even negligible in somecases. Interestingly, we notice that for the ISOLET6 and MFEAT data sets, the baseclusterings generated by RP have significantly lower quality compared to those by PCASS.However, as we increase the ensemble size, RP ensembles gradually improve to have similar(for ISOLET6) or better (for MFEAT) final performance.

The above comparisons suggest that, among the three ensemble constructors, randomprojection (RP) generates the best performing cluster ensembles in most cases. Further, weconjecture that the superiority of RP ensembles is because RP produces more diverse baseclusterings. To verify this conjecture, we examine the diversity and quality of the clusterensembles using the process described below. Note that the approach we take here is similarto what was used by Dietterich (2000) for analyzing supervised ensembles.

Diversity-quality analysis: Given a cluster ensemble, we graph the diversity versusquality for each pair of clusterings in the ensemble as a cloud of diversity-quality (d-q)points. Each d-q point corresponds to a pair of base clusterings in the ensemble. Tomeasure the diversity (shown on the y axis of a graph), we calculate the NMI betweeneach pair of base clusterings. To obtain a single quality measure (shown on the x axis of agraph) for each pair, we average their NMI values as computed between each clustering andthe class labels. Note that when the NMI between two clusterings is zero the diversity ismaximized. In contrast, maximizing the average NMI of each pair maximizes their quality.From the distribution of the d-q points, we can obtain information about the quality anddiversity of a given ensemble. For example, for an ensemble to have high diversity and goodquality, its d-q points should be close to the right-bottom region of a graph.

We will first compare the RP ensembles and the PCASS ensembles in Figure 3 byexamining the distribution of the d-q points of both approaches. For each approach, weobtain its d-q points using an ensemble of size 50. Note that the d-q points shown inthe figures are uniformly subsampled to enhance the readability. Our first observationis that, for all eight data sets, the d-q points of RP ensembles are located below thoseof PCASS ensembles, indicating that the clusterings of RP ensembles are generally morediverse than those of PCASS. Further, we relate the results shown in Figure 3 to theperformance evaluation shown in Figure 2. Examining each data set individually, we makethe following observations.

• For the CHART data set, the d-q points of RP ensembles and PCASS ensembleslargely overlap. Correspondingly, we see that the performance lines of these twoensembles as shown in Figure 2(a) are also close to each other.

• For the EOS, HRCT, MODIS, and SEGMENTATION data sets, we see that thequality of the two types of ensembles are similar whereas the diversity of RP ensemblesare higher. Comparing the upper-bound performance of the two approaches for thesefour data sets shown in Figure 2, we observe that higher diversity leads to largerperformance improvement over the base clusterings.

15

Fern and Brodley

0 0.5 10

0.2

0.4

0.6

0.8

1(a) Chart

0 0.5 10

0.2

0.4

0.6

0.8

1(b) Eos

0 0.5 10

0.2

0.4

0.6

0.8

1(c) Hrct

0 0.5 10

0.2

0.4

0.6

0.8

1(d) Isolet6

PCASSRP

0 0.5 10

0.2

0.4

0.6

0.8

1(e) Mfeat

0 0.5 10

0.2

0.4

0.6

0.8

1(f) Modis

0 0.5 10

0.2

0.4

0.6

0.8

1(g) Satimage

0 0.5 10

0.2

0.4

0.6

0.8

1(h) Segmentation

Figure 3: Diversity-quality analysis of RP ensembles and PCASS ensembles.

16


Table 4: Diversity and quality of RP ensembles and RPPCA ensembles.

RP RPPCADiversity Quality Diversity Quality

chart 0.588 0.644 0.454 0.502eos 0.624 0.269 0.647 0.255hrct 0.431 0.275 0.434 0.260satimage 0.496 0.483 0.517 0.461modis 0.596 0.442 0.588 0.405isolet6 0.608 0.667 0.740 0.739mfeat 0.546 0.548 0.596 0.552segmentation 0.607 0.478 0.736 0.535

• For the ISOLET6, MFEAT and SATIMAGE data sets, we note that the RP ensem-bles have lower quality but higher diversity compared to the PCASS ensembles. ForISOLET6 and MFEAT, we observe that, in comparison to PCASS, the final perfor-mance of RP is comparable or better although it begins with base clusterings of lowerquality. This suggests that, for cluster ensembles, lower quality in the base clusteringsmay be compensated for by higher diversity. However, this is not always the case.Particularly, we also observe that PCASS outperforms RP for the SATIAMGE dataset. The difference between SATIMAGE and the other two data sets is that SATIM-AGE is the only data set where RP’s d-q points lie in a different region from PCASS’sd-q points. Particularly, all of the PCASS’s d-q points have quality higher than 0.5,whereas many of RP’s points have quality lower than 0.5.

In comparing RPPCA with RP, it is rather surprising to see that RPPCA performsso poorly compared to RP because RPPCA was designed with the goal of combining thestrength of random projection and PCA. Indeed, our empirical evaluation appears to suggestthat RPPCA actually inherits the weakness of RP and PCA. To analyze the diversity andquality of RP ensembles and RPPCA ensembles, we show the average diversity and qualityof the two types of ensembles in Table 4. We present this comparison using a table becausethe d-q points of RPPCA tend to overlap those of RP, making it hard to interpret a figure.

In Table 4, we present the eight data sets in two groups based on whether RPPCAgenerates base clusterings of higher quality compared to RP. The first group contains fivedata sets. For these five data sets, RPPCA actually generates base clusterings that are oflower quality than those generated by RP. This indicates that our approach of combiningPCA with RP does not necessarily lead to better dimension reduction performance forclustering, at least with the parameters setting used in this paper. Among these five datasets, the diversity of RPPCA ensembles is significantly higher than RP only for the CHARTdata set, for which the final ensemble performance of RP and RPPCA are comparable. Forthe other four data sets in the first group, the diversity of RPPCA ensembles are eitherlower (EOS, SATIMAGE) or similar (HRCT, MODIS). Correspondingly, we observe thatRPPCA ensembles perform worse than RP ensembles for these four data sets. In the secondgroup, we see three data sets for which RPPCA does improve the base clusterings comparedto RP. However, as we see from the table, RP ensembles are much more diverse for these

17

Fern and Brodley

three data sets. Consequently, the final performance of RP ensembles are still better or atleast similar compared to RPPCA ensembles.

In conclusion, the above diversity-quality analysis shows strong evidence that the highdiversity of RP ensembles is, if not the direct reason, at least an important contributingfactor for its superior performance in comparison to the other two approaches. Generallyspeaking, our observation suggests that if the base clusterings of two ensembles are of similarquality, the higher the diversity the better the final ensemble performance. In cases wherethe base clusterings of two ensembles differ in quality, high diversity may also compensatefor low quality. But this may depend on the particular data set and the extend to whichthe quality drops.

6.2 Reality Performance Analysis

Note that the upper-bound evaluation used in the previous section measures the best perfor-mance achievable by each ensemble constructor using the four consensus functions describedin Section 4. In reality, we can not achieve this performance because we do not have theground truth for selecting the best result from four candidate clustering outputs. To evalu-ate the ensemble constructors in a more realistic setting, we use a second procedure basedon an objective function defined by Strehl and Ghosh (2002). This objective function, re-ferred to as the Sum of NMI (SNMI), measures the total amount of information shared bya final clustering and the participating clusterings in a given ensemble and is computed asfollows. Given a cluster ensemble Π = {π1, · · · , πr} and a final cluster πf , SNMI is definedas:

SNMI =r

∑

i=1

NMI(πi, πf )

where NMI(πi, πf ) measures the normalized mutual information between a base clusteringand the final clustering πf .

We apply the four consensus functions to each given ensemble and among the four clus-tering outputs we select the clustering that maximizes the SNMI value. We then computethe NMI value between the selected clustering and the class labels as the evaluation ofthe ensemble. In Figure 4, we plot the new performance measure of the three ensembleconstructors.

Figure 4 shows that, among the three ensemble constructors examined here, the ensem-bles generated by RP perform best in most cases. This is consistent with what we observein the upper-bound performance analysis. It should be noted that SNMI sometimes failsto select the optimal clustering, which can be inferred from the fact that the performancebased on SNMI is in some cases significantly worse than the upper-bound performance pre-sented in Section 6.1. Note that, among the eight data sets used in the experiments, thedifference between the upper-bound performance and the SNMI-based performance is leastsignificant for the ISOLET6 data set, for which we see the highest base-clustering quality.This suggests that SNMI may be best suited as an objective function for cluster ensembleswhen the base clustering quality is high.

18


0 50 1000.4

0.6

0.8

1(a) Chart

0 50 1000.2

0.25

0.3

0.35(b) Eos

0 50 1000.2

0.25

0.3

0.35

0.4(c) Hrct

0 50 1000.6

0.7

0.8

0.9

1 (d) Isolet6

RPRPPCAPCASS

0 50 1000.5

0.6

0.7

0.8(e) Mfeat

0 50 1000.3

0.4

0.5

0.6(f) Modis

0 50 1000.4

0.5

0.6

0.7

0.8(g) Satimage

0 50 1000.4

0.5

0.6

0.7(h) Segmentation

Figure 4: SNMI-based evaluation of the three types of ensembles.19

Fern and Brodley

7. Experiments on Consensus Functions

In this section, we evaluate the four consensus functions described in Section 4. Althoughwe have identified RP as the most effective ensemble construction approach among thethree constructors examined, we choose to evaluate the consensus functions using both RPensembles and PCASS ensembles. This is because RP and PCASS behave differently withrespect to the diversity and the quality of the resulting base clusterings. Particularly, RPproduces base clusterings that are highly diverse, whereas the base clusterings of PCASSare often better in quality compared to RP. We are interested in evaluating the consensusfunctions using both types of ensembles and finding out if their performance is sensitive tothese basic properties of cluster ensembles.

7.1 Evaluation of Consensus Functions Using RP Ensembles

To generate RP ensembles, we use the same set of parameters as described in Section 5.2.For each data set, we use RP to generate ensembles of ten different ensemble sizes. Theyare 10, 20, · · ·, up to 100 respectively. For each ensemble size, ten different ensembles aregenerated. This creates a pool of 100 RP ensembles for each data set. For each ensembleΠi in the pool, we measure the quality of its base clusterings using NMI and the classlabels and average them to obtain the average base-clustering quality (represented as qi) ofensemble Πi. To evaluate a consensus function using Πi, we apply the consensus functionto Πi to produce a final clustering, which is then evaluated using NMI and the class labels.Denoting the obtained NMI value as fnmi, we then calculate ri = fnmi

qi− 1, which measures

the rate that ensemble Πi improves over its base clusterings when the particular consensusfunction was used to produce the final clustering. We use ri as the final evaluation of theconsensus function under ensemble Πi.

For each data set, a pool of 100 RP ensembles are used to evaluate each of the fourconsensus functions. We evaluate the improvement rate for every ensemble in the pool andaverage the 100 results to obtain a final evaluation. Note that we choose not to separatelypresent the evaluation results for different ensemble sizes to facilitate a clear presentationand interpretation of the results. More importantly, we believe that this choice will notinfluence our evaluation results because from Section 6 we see that the ensemble performanceappears to be stable for different ensemble sizes.

Columns 2-5 of Table 5 show the improvement rates of the four consensus functions. Weshow the results of each data set as well as the average of all data sets. The numbers shownfor each individual data set are obtained by averaging 100 results whereas the numbersshown for the overall average are obtained by average 800 results.

Table 5 also present the ranking of the four consensus functions for each data set andfor the final average. Note that the notation A < B indicates that the performance of A onaverage is inferior to B and the difference is statistically significant (according to paired t-test with significance level 0.05). In contrast, A ≤ B indicates that the average performanceof A is better than that of B but the difference is not statistically significant. Finally, wehighlight in boldface the best performance obtained for each row in the table.

The first observation we make is that no single consensus function is the best for alldata sets. Indeed, each of the four consensus functions examined wins the top position forsome data set. Among the four consensus functions, HBGF and IBGF appear to be more

20


Table 5: The rates of improvement over base clusterings achieved by the four consensusfunctions when applied to RP ensembles.

IBGF(I) CBGF(C) HBGF(H) KMCF(K) Ranking

chart .225 .248 .232 .235 I < H ≤ K < C

eos .032 -.082 .050 .041 C < I < K ≤ H

hrct .043 -.063 .039 .192 C < H ≤ I < K

isolet6 .275 .262 .276 .143 K < C < I ≤ H

mfeat .258 .202 .260 .204 C ≤ K < I ≤ H

modis .119 .114 .118 .102 K < C ≤ H ≤ I

satimage .276 .179 .260 .171 K ≤ C < H ≤ I

segmentation .077 .125 .131 .023 K < I < C ≤ H

Average .163 .123 .171 .139 C < K < I < H

Table 6: The rates of improvement over base clusterings achieved by the four consensusfunctions when applied to PCASS ensembles.

IBGF(I) CBGF(C) HBGF(H) KMCF(K) Ranking

chart .032 .086 .055 .164 I < H < C < K

eos -.100 -.404 -.105 -.014 C < H < I < K

hrct -.016 -.110 .008 -.131 K ≤ C < I < H

isolet6 .089 .083 .099 -.134 K < C ≤ I < H

mfeat .033 .018 .032 -.048 K < C < H ≤ I

modis .000 -.016 -.005 -.076 K < C < H < I

satimage .215 .171 .221 -.191 K < C < I < H

segmentation -.048 .052 .043 -.133 K < I < H < C

Average .026 -.015 .044 -.070 K < C < I < H

stable than the other two functions because on average they achieve better performance andwin the top positions more frequently than the other two approaches. The performance ofHBGF and IBGF is similar but on average HBGF achieves a higher improve rate of 17.1%(averaged across the eight data sets used in our experiments).

21

Fern and Brodley

7.2 Evaluation of Consensus Functions Using PCASS Ensembles

We generate PCASS ensembles to evaluate the consensus functions using the same procedureas described in the previous section. The values of the parameters are set as described inSection 5.2.

From the results shown in Table 6, we observe that HBGF again achieves the most stableperformance among the four consensus functions. On average, its improvement rate overbase clusterings is 4.4%. This is consistent with the results we observe for RP ensembles.In comparing Table 5 and 6, we observe that all consensus functions see certain degradationin the improvement rates over the base clusterings for PCASS ensembles—suggesting thatthe consensus functions are influenced by the base-clusterings diversity and quality.

Among the four consensus functions, we note that KMCF sees the most significantdegradation and fails to improve over the base clusterings for all but the CHART data set.We argue that the failure of KMCF may be directly attributed to the reduced diversity ofthe PCASS ensembles. In comparing the performance of KMCF across different data sets,we notice that low diversity in base clusterings tends to correspond to poor performance ofKMCF. For example, among the eight data sets, KMCF sees the worst performance for theSATIMAGE and ISOLET6 data sets, which are the data sets that have the least diversityin the ensembles.

In contrast to KMCF, we see a different trend for CBGF, where poor final performanceoften correlates with low quality in base clusterings. Indeed, among the eight data sets,CBGF performs the worst for EOS and HRCT, and we observe that they are the data setsfor which PCASS ensembles have the lowest base-clustering quality.

Finally, for HBGF and IBGF, we do not see either the quality or diversity dominatingthe final performance. Instead, the final performance of these two approaches appears tobe influenced jointly by both factors.

In conclusion, our experiments suggest that, among the four consensus functions, HBGFand IBGF often perform better than CBGF and KMCF. The performance of HBGF andIBGF tends to be comparable, but on average HBGF performs slightly better. We alsoobserve that KMCF is sensitive to the diversity of the ensembles—low diversity leads topoor KMCF performance. In contrast, CBGF is more influenced by the quality of the baseclusterings.

8. Related Work On Cluster Ensembles

The framework of cluster ensembles was recently formalized by Strehl and Ghosh (2002).There have been many instances of the cluster ensemble framework in the literature. Theydiffer in how they construct the ensembles (different ensemble constructor) and how theycombine the solutions (different consensus function).

Examples of ensemble constructors include using different bootstrap samples (Fridlyandand Dudoit, 2001, Strehl and Ghosh, 2002), using different initializations to randomizedclustering algorithms such as K-means (Fred and Jain, 2002), using different feature sub-sets to perform clustering (Kargupta et al., 1999, Strehl and Ghosh, 2002) and using multipledifferent clustering algorithms (Strehl and Ghosh, 2002). In this paper, we focused on con-struction approaches that build on dimension reduction techniques in attempt to addressproblems caused by high dimensionality. We identified RP as an effective ensemble con-

22


structor. Further, we conducted an in-depth analysis of the properties of the ensemblesgenerated by different constructors, providing an explanation for the effectiveness of theRP ensemble constructor.

A main focus of the cluster ensemble research has been placed on combining clusteringsand a variety of consensus functions are available in the literature. In this paper we do notintend to cover all existing consensus functions in our comparisons. Instead, we chose toexamine four consensus functions that have been shown to be highly competitive comparedto other alternatives (Strehl and Ghosh, 2002, Topchy et al., 2004, Fern and Brodley, 2004).Below we review various alternative consensus functions and relate them to those examinedin this paper.

Many existing techniques for combining clusterings bear strong similarity with the IBGFapproach in the sense that they generate an instance similarity matrix based on a givencluster ensemble. The difference lies in how the matrix is used to produce the final clustering,for which various techniques, such as simple thresholding (Fred, 2001) and agglomerativeclustering algorithms (Fred and Jain, 2002, Fern and Brodley, 2003, Monti et al., 2003),have been examined.

There also exist a number of cluster-based approaches that are similar in essence toCBGF. These approaches seek to find a correspondence structure among clusters that areformed in the ensemble and use a voting scheme to assign instances to clusters once a corre-spondence structure is found. CBGF finds such a correspondence structure by partitioninga graph describing the similarity relationship among clusters. Other approaches includeusing a relabeling process (Dudoit and Fridlyand, 2003) to map the clusters to virtual classlabels, and clustering the cluster centers to group similar clusters together (Dimitriadouet al., 2001).

9. Conclusions and Future Work

In this paper, we studied cluster ensembles for high dimensional data clustering. First,we examined three different ensemble constructors for generating cluster ensembles, whichbuild on two popular dimension reduction techniques: random projection (RP) and prin-cipal component analysis (PCA). We empirically evaluated the ensemble constructors oneight data sets, and concluded that cluster ensembles constructed by RP perform betterfor high dimensional clustering in comparison to 1) the traditional approach of a singleclustering with PCA dimension reduction and 2) ensembles generated by the other twoensemble constructors examined in the paper. In an attempt to explain the superiority ofRP ensembles, we analyzed the diversity and quality of the ensembles generated by differ-ent constructors. Our analysis indicated that if the base clusterings have similar quality,the higher the diversity, the better the ensemble can potentially perform. Further, highdiversity may also compensate for low base clustering quality. These results provide strongevidence that RP’s superior performance can be attributed to the fact that it produceshighly diverse base clusterings.

We also examined four different consensus functions for combining clusterings. Weused two different types of cluster ensembles (RP and PCASS ensembles) to evaluate thefour consensus functions. We identified two of the graph partitioning based approaches,the HBGF and IBGF approaches, as superior to the other two approaches. On average,

23

Fern and Brodley

HBGF improved over base clusterings by 17.1% for RP ensembles and by 4.4% for PCASSensembles, whereas IBGF improved by 16.3% and 2.6% respectively. Overall, HBGF slightlyoutperformed IBGF. The experiments suggested that all of the four consensus functions areinfluenced by the quality and diversity of the base clusterings. We saw evidence that KMCFis highly sensitive to the diversity of the base clusterings. Therefore, if a cluster ensemblelacks diversity, KMCF should be avoided as the choice for consensus function. In contrast,CBGF is sensitive to the quality of the base clusterings. And finally, the quality anddiversity of base clusterings jointly impact HBGF and IBGF, making them more stablewhen facing the trade-off of the quality and diversity of the base clusterings.

To conclude, we have identified random projection as an effective solution for generatingcluster ensembles and HBGF as a good candidate for combining the clusterings. For futureresearch, we note that, in this paper as well as previous research on cluster ensembles, allbase clusterings are treated equally by consensus functions. There is no doubt that some ofthe base clusterings may be better than others. An interesting question is whether we canimprove the consensus functions by assigning weights to the base clusterings according totheir quality. Finally, we note that the empirical performance of cluster ensembles tends tolevel off as the ensemble size increases. This indicates that very large ensemble sizes may beunnecessary. However, it remains an open question how can we decide a proper ensemblesize when applying cluster ensembles in practice.

References

R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clusteringof high dimensional data for data mining applications. In Proceedings of the 1998 ACMSIGMOD International Conference on Management of Data, pages 94–105. ACM Press,1998.

F. R. Bach and M. I. Jordan. Learning spectral clustering. In Advances in Neural Infor-mation Processing Systems 16, 2003.

R. Bellman. Adaptive control processes. Princeton University Press, 1961.

K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is nearest neighbors mean-ingful? In Proceedings of the International Conference on Database Theories, pages217–235, 1999.

C. L. Blake and C. J. Merz. UCI repository of machine learning databases, 1998. URLhttp://www.ics.uci.edu/∼mlearn/MLRepository.html.

T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley & Sons,1991.

S. Dasgupta. Experiments with random projection. In Uncertainty in Artificial Intelligence:Proceedings of the Sixteenth Conference (UAI-2000), pages 143–151. Morgan Kaufmann,2000.

I. S. Dhillon. Co-clustering documents and words using bipartite spectral graph partitioning.In Knowledge Discovery and Data Mining, pages 269–274, 2001.

24


T. G. Dietterich. An experimental comparison of three methods for constructing ensemblesof decision trees: Bagging, boosting and randomization. Machine learning, 2:139–157,2000.

E. Dimitriadou, A. Weingessel, and K. Hornik. Voting-merging: An ensemble method forclustering. In Proceedings of International Confernce on Artificial Neural Networks, pages217–224, 2001.

S. Dudoit and J. Fridlyand. Bagging to improve the accuracy of a clustering procedure.Bioinformatics, 19, 2003.

J. G. Dy and C. E. Brodley. Feature selection for unsupervised learning. Journal of MachineLearning Research, 5:845–889, 2004.

J. G. Dy, C. E. Brodley, A. Kak, C. Shyu, and L. S. Broderick. The customized-queriesapproach to CBIR using EM. In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 400–406. IEEE Computer Society Press, 1999.

X. Z. Fern and C. E. Brodley. Random projection for high dimensional data clustering: Acluster ensemble approach. In Proceedings of the Twentieth International Conference onMachine Learning, pages 186–193, 2003.

X. Z. Fern and C. E. Brodley. Solving cluster ensemble problems by bipartite graph parti-tioning. In Proceedings of the Twenty First International Conference on Machine Learn-ing, pages 281–288, 2004.

A. L. N. Fred and A. K. Jain. Data clustering using evidence accumulation. In Proceedingsof International Conference on Pattern Recognition, 2002.

A.L.N. Fred. Finding consistent clusters in data partitions. In Proceedings of the ThirdInternational Workshop on Multiple Classifier Systems, pages 309–318, 2001.

J. Fridlyand and S. Dudoit. Applications of resampling methods to estimate the numberof clusters and to improve the accuracy of a clustering method. Technical Report 600,Statistics Department, UC Berkeley, 2001.

M. Friedl, D. McIver, J. Hodges, X. Zhang, D. Muchoney, A. Strahler, C. Woodcock,S. Gopal, A. Schneider, A. Cooper, A. Baccini, F. Gao, and C. Schaaf. Global land covermapping from modis: algorithms and early results. Remote Sensing of Environment, 83:287–302, 2002.

J. H. Friedman. Exploratory projection pursuit. Journal of the American Statistical Asso-ciation, 82(397):249–266, 1987.

K. Fukunaga. Statistical Pattern Recognition (second edition). Academic Press, 1990.

R. Hecht-Nielsen. Context vectors: general purpose approximate meaning representationsself-organized from raw data. In Computational Intelligence: Imitating Life, pages 43–56.IEEE press, 1994.

25

Fern and Brodley

S. Hettich and S. D. Bay. The UCI KDD archive, 1999. URL http://kdd.ics.uci.edu.

P. J. Huber. Projection pursuit. Annals of Statistics, 13(2):435–475, 1985.

H. Kargupta, B. Park, D. Hershberger, and E. Johnson. Collective data mining: A newperspective toward distributed data mining. In Advances in Distributed and ParallelKnowledge Discovery. MIT/AAAI Press, 1999.

S. Monti, P. Tamayo, J. Mesirov, and T. Golub. Consensus clustering: A resampling-basedmethod for class discovery and visualization of gene expression microarray data. MachineLearning, 52:91–118, 2003.

A. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. InAdvances in Neural Information Processing Systems 14, pages 849–856, 2002.

J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transaction on PatternAnalysis and Machine Intelligence, 22(8):888–905, 2000.

A. Strehl and J. Ghosh. Cluster ensembles - a knowledge reuse framework for combiningmultiple partitions. Machine Learning Research, 3:583–417, 2002.

A. Topchy, A. K. Jain, and W. Punch. Combining multiple weak clusterings. In ProceedingsIEEE International Conference on Data Mining, pages 331–338, 2003.

A. Topchy, A. K. Jain, and W. Punch. A mixture model for clustering ensembles. InProceedings of SIAM Conference on Data Mining, pages 379–390, 2004.

26

Documents

Cluster Ensembles for High Dimensional Clustering: An ...web.engr.oregonstate.edu/~xfern/clustensem.pdf · Journal of Machine Learning Research n/a (2004) n/a Submitted 09/04; Published