IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …rio.ecs.umass.edu/mnilpub/papers/coclusterd-tkde2015.pdf · 2019. 5. 16. · 3232 ieee transactions on knowledge and data engineering, vol

Co-ClusterD: A Distributed Framework for DataCo-Clustering with Sequential Updates

Xiang Cheng, Sen Su, Lixin Gao, Fellow, IEEE, and Jiangtao Yin

Abstract—Co-clustering has emerged to be a powerful data mining tool for two-dimensional co-occurrence and dyadic data. However,

co-clustering algorithms often require significant computational resources and have been dismissed as impractical for large data sets.

Existing studies have provided strong empirical evidence that expectation-maximization (EM) algorithms (e.g., k-means algorithm) with

sequential updates can significantly reduce the computational cost without degrading the resulting solution. Motivated by this

observation, we introduce sequential updates for alternate minimization co-clustering (AMCC) algorithms which are variants of EM

algorithms, and also show that AMCC algorithms with sequential updates converge. We then propose two approaches to parallelize

AMCC algorithms with sequential updates in a distributed environment. Both approaches are proved to maintain the convergence

properties of AMCC algorithms. Based on these two approaches, we present a new distributed framework, Co-ClusterD, which

supports efficient implementations of AMCC algorithms with sequential updates. We design and implement Co-ClusterD, and show its

efficiency through two AMCC algorithms: fast nonnegative matrix tri-factorization (FNMTF) and information theoretic co-clustering

(ITCC). We evaluate our framework on both a local cluster of machines and the Amazon EC2 cloud. Empirical results show that AMCC

algorithms implemented in Co-ClusterD can achieve a much faster convergence and often obtain better results than their traditional

concurrent counterparts.

Index Terms—Co-Clustering, concurrent updates, sequential updates, cloud computing, distributed framework

Ç

1 INTRODUCTION

CO-CLUSTERING is a powerful data mining tool for two-dimensional co-occurrence and dyadic data. It haspractical importance in a wide range of applications such astext mining [1], recommendation systems [2], and the analy-sis of gene expression data [3]. Typically, clustering algo-rithms leverage an iterative refinement method to groupinput points into clusters. The cluster assignments are per-formed based on the current cluster information (e.g., thecentroids of clusters in k-means clustering). The resultedcluster assignments can be utilized to further update thecluster information. Such a refinement process is iterated tillthe cluster assignments become stable. Depending on howfrequently the cluster information is updated, clusteringalgorithms can be broadly categorized into two classes. Thefirst class updates the cluster information after all inputpoints have updated their cluster assignments. We refer tothis class of algorithms as clustering algorithms with concur-rent updates. In contrast, the second class updates the clusterinformation whenever a point changes its cluster assign-ment. We refer to this class of algorithms as clustering algo-rithms with sequential updates.

A number of existing studies (e.g., [4], [5], [6], [7], [8])have provided strong empirical evidence that expectation-maximization (EM) algorithms (e.g., k-means algorithm)with sequential updates can significantly reduce the compu-tational cost without degrading the resulting solution. Moti-vated by this observation, we introduce sequential updatesfor alternate minimization co-clustering (AMCC) algorithms[9], which are variants of EM algorithms. We show thatAMCC algorithms with sequential updates converge.Despite the potential advantages of sequential updates, par-allelizing AMCC algorithms with sequential updates ischallenging. Specifically, if we let each worker machineupdate the cluster information sequentially, it might resultin inconsistent cluster information across worker machinesand thus the convergence properties of co-clustering algo-rithms cannot be guaranteed; if we synchronize the clusterinformation whenever a cluster assignment is changed, itwill incur large synchronization overhead and thus result inpoor performance in a distributed environment. Conse-quently, AMCC algorithms with sequential updates cannotbe easily performed in a distributed manner.

Toward this end,we propose two approaches to parallelizesequential updates for AMCC algorithms. The first approachis referred to as dividing clusters. It divides the problem of clus-tering rows (or columns) into independent tasks and each ofwhich is assigned to aworker. In order tomake tasks indepen-dent, we randomly divide row (or column) clusters into mul-tiple non-overlapping subsets at the beginning of eachiteration, and let each worker perform row (or column) clus-teringwith sequential updates on one of these subsets.

The second approach is referred to as batching points.Relaxing the stringent requirement of sequential updates, itparallelizes sequential updates by performing batch updates.

� X. Cheng and S. Su are with the State Key Laboratory of Networking andSwitching Technology, Beijing University of Posts and Telecommunica-tions, Beijing, China. E-mail: {chengxiang, susen}@bupt.edu.cn.

� L. Gao and J. Yin are with the Department of Electrical and ComputerEngineering, University of Massachusetts Amherst, MA 01003.E-mail: {lgao, jyin}@ecs.umass.edu.

Manuscript received 26 Oct. 2013; revised 10 Apr. 2015; accepted 23 June2015. Date of publication 30 June 2015; date of current version 3 Nov. 2015.Recommended for acceptance by I. Davidson.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TKDE.2015.2451634

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 12, DECEMBER 2015 3231

1041-4347� 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Instead of updating the cluster information after eachchange in cluster assignments, batch updates perform abatch of row (or column) cluster assignments, and thenupdate the cluster information. We typically divide rowsand columns of the input data matrix into several batchesand let all workers perform row (or column) clustering withconcurrent updates on each batch.

We formally prove that the dividing clusters and batch-ing points approaches can maintain the convergence prop-erties of AMCC algorithms. Based on these two approaches,we design and implement a distributed framework, Co-ClusterD, to support efficient implementations of AMCCalgorithms with sequential updates. Co-ClusterD providesan abstraction for AMCC algorithms with sequentialupdates and allows programmers to specify the sequentialupdate operations via simple APIs. As a result, it eases theprocess of implementing AMCC algorithms and frees pro-grammer from the detailed mechanisms such as communi-cation and synchronization between workers. We evaluateCo-ClusterD through two AMCC algorithms: fast nonnega-tive matrix tri-factorization (FNMTF) [10] and informationtheoretic co-clustering (ITCC) [11]. Experimenting on a localcluster of machines and the Amazon EC2 cloud, we showthat AMCC algorithms implemented in Co-ClusterD canrun faster and often get better results than their traditionalconcurrent counterparts. In summary, we make three con-tributions in this paper.

� We introduce sequential updates for alternate mini-mization co-clustering (AMCC) algorithms andshow that AMCC algorithms with sequentialupdates converge.

� We propose two approaches (i.e., dividing clustersand batching points) to parallel AMCC algorithmswith sequential updates in a distributed environ-ment. We prove that both of these two approachescan maintain the convergence properties of AMCCalgorithms. We also provide solutions for settingoptimal parameters for these two approaches.

� We design and implement a distributed frameworkcalled Co-ClusterD which can support efficientimplementations of AMCC algorithms with sequen-tial updates. Empirical results on both a local clusterand the Amazon EC2 cloud show that AMCC algo-rithms implemented in Co-ClusterD can achieve amuch faster convergence and often obtain betterresults than their traditional concurrent counterparts.

The rest of this paper is organized as follows. Section 2gives an overview of AMCC and shows how update strat-egies affect its performance. Section 3 presents ourapproaches to parallelize AMCC algorithms with sequen-tial updates. In Section 4, we present the design andimplementation of our co-clustering framework. Section 5is devoted to the experimental results. We discuss relatedwork in Section 6 and conclude the paper in Section 7.

2 ALTERNATE MINIMIZATION BASEDCO-CLUSTERING AND SEQUENTIAL UPDATE

In this section, we first describe the co-clustering problemwe focus on. Next, we introduce sequential updates for

alternate minimization co-clustering algorithms. Then, tak-ing two AMCC algorithms (i.e., fast nonnegative matrix tri-factorization [10] and information theoretic co-clustering[11]) as examples, we show how sequential updates work.At last, we elaborate the performance comparison betweenconcurrent updates and sequential updates throughFNMTF and ITCC on a single machine.

2.1 Alternate Minimization Based Co-Clustering

Co-clustering is also known as bi-clustering, block clus-tering or direct clustering [12]. Formally, given a m� nmatrix Z, a co-clustering can be defined by two maps rand g, which groups rows and columns of Z into k and ldisjoint or hard clusters respectively. In particular, rðuÞ =p (1 � p � k) means that row u is in row cluster p, andgðvÞ ¼ q (1 � q � l) indicates that column v is in columncluster q. If we reorder rows and columns of Z and letrows and columns of the same cluster be close to eachother, we obtain k� l correlated sub-matrices. Each sub-matrix is referred to as a co-cluster. For example, as shownin Fig. 1, given 4� 4 matrix A, after reordering accordingto rðuÞ and gðvÞ, we get matrix B which contains fourcorrelated sub-matrices (i.e., co-clusters).

The co-clustering problem can be viewed as a lossy datacompression problem [9]. Given a specified number of rowand column clusters, it attempts to retain as much informa-tion as possible about the original data matrix in terms ofstatistics based on the co-clustering. Let ~Z be an approxima-tion of the original data matrix Z. The goodness of theunderlying co-clustering can be quantified by the expected

distortion between Z and ~Z, which is shown as follows.

E½dfðZ; ~ZÞ� ¼Xmu¼1

Xnv¼1

wuvdfðzuv; ~zuvÞ

¼Xkp¼1

XfujrðuÞ¼pg

Xlq¼1

XfvjgðvÞ¼qg

wuvdfðzuv; spqÞ;(1)

where df is a distance measure and can be any member ofthe Bregman divergence family [13] (e.g., Euclidean dis-

tance), zuv and ~zuv are the elements of Z and ~Z respectively,wuv denotes the pre-specified weight of pair (u, v), spq is thestatistic derived from the co-cluster (p, q) of Z, and is usedto approximate the element in the co-cluster (p, q) of Z. spq isconsidered as the cluster information of data co-clustering. Theco-clustering problem is then to find (r, g) such that equa-tion (1) is minimized.

To find the optimal (r, g), a broadly applicable approachis to leverage an iterative process, which monotonicallydecreases the objective function above by intertwiningboth row and column clustering iterations. Such kind of co-clustering approach is referred to as alternate minimization

Fig. 1. According to rðuÞ and gðvÞ, reorder input data matrix A such thatthe resulting sub-matrices (i.e., co-clusters) in B are correlated.

3232 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 27, NO. 12, DECEMBER 2015

co-clustering [9], which is considered as our main focus inthis paper.

Typically, AMCC algorithms repeat the following foursteps till convergence.

Step I: Keep g fixed, for every row u, find its new rowcluster assignment by the following equation:

rðuÞ ¼ argminpXlq¼1

XfvjgðvÞ¼qg

wuvdfðzuv; spqÞ: (2)

Step II: With respect to (r, g), update the cluster informa-tion (i.e., the statistic of each co-cluster) by the fol-lowing equation:

spq ¼ argminspqX

fujrðuÞ¼pg

XfvjgðvÞ¼qg


Step III: Keep r fixed, for every column v, find its new col-umn cluster assignment by the following equation:

gðvÞ ¼ argminqXkp¼1

XfujrðuÞ¼pg


Step IV: The same as Step II.

In the above general algorithm, some implementationsmight combine Step II and Step IV into one step.

2.2 Sequential Updates

Many existing studies [4], [5], [6], [7], [8] have providedstrong empirical evidence that expectation-maximizationalgorithms (e.g., k-means algorithm) with sequentialupdates can significantly reduce the computational costwithout degrading the resulting solution. Since AMCC algo-rithms are variants of EM algorithms, we introduce sequen-tial updates for AMCC algorithms. Unlike concurrentupdates that perform the cluster information update afterall rows (or columns) have updated their cluster assign-ments, sequential updates perform the cluster informationupdate after each change in cluster assignments.

Specifically, AMCC algorithms with sequential updatesrepeat the following six steps till convergence.

Step I: Keep g fixed, pick a row u in some order, find itsnew row cluster assignment by Eq. (2).

Step II: With respect to (r, g), update the involved statis-tics of co-clusters by Eq. (3) once u changes itsrow cluster assignment.

Step III: Repeat Step I and Step II until all rows have beenprocessed.

Step IV: Keep r fixed, pick a column v in some order, findits new column cluster assignment by Eq. (4).

Step V: With respect to (r, g), update the involved statis-tics of co-clusters by Eq. (3) once v changes its col-umn cluster assignment.

Step VI: Repeat Step IV and Step V until all columns havebeen processed.

The following theorem shows that sequential updatesmaintain the convergence properties of AMCC algorithms.

Theorem 1. Alternate minimization based co-clustering algo-rithms with sequential updates monotonically decrease theobjective function given by Eq. (1).

Proof. The overall approximation error CðZ; ~ZÞ in Eq. (1)can be rewritten as the sum over the approximation

errors due to each row and its assignment, i.e., CðZ; ~ZÞ =Pmu¼1 Crðu; rðuÞÞ, where Crðu; rðuÞÞ is the approximation

error due to row u and its assignment rðuÞ, andCrðu; rðuÞÞ =

Plq¼1

PfvjgðvÞ¼qg wuvdfðzuv; spqÞ. Since per-

forming row clustering for a given row u by Eq. (2) will

not increase Crðu; rðuÞÞ, CðZ; ~ZÞ will not be increased inStep1. In addition, CðZ; ~ZÞ can be also rewritten as thesum over the approximation errors due to the statistic of

each co-cluster, i.e., CðZ; ~ZÞ = Pkp¼1 Plq¼1 CsðspqÞ, whereCsðspqÞ is the approximation error due to the statistic ofco-cluster ðp; qÞ, and CsðspqÞ ¼

PfujrðuÞ¼pg

PfvjgðvÞ¼qg

wuvdfðzuv; spqÞ. Since updating the statistic of eachinvolved co-cluster by Eq. (3) will not increase CsðspqÞ,CðZ; ~ZÞ will not be increased in Step 2. Therefore, alter-nate minimization based co-clustering algorithms withsequential updates monotonically decrease the objectivefunction given by Eq. (1) during the row clusteringphase. For the column clustering phase, the proof is simi-lar and is omitted for brevity. tu

We observe that, in hard clustering, a row (or column)re-assignment only gives rise to the update of the clusterinformation related to the reassigned row (or column).Therefore, for AMCC algorithms with sequentialupdates, after a row (or column) re-assignment occurs,we only need to update the related cluster informationinstead of recalculating all the cluster information. Inthis way, we can significantly reduce the overhead ofsequential updates. In fact, the ideal of sequentialupdates can also be adopted by soft clustering algo-rithms. However, in soft clustering, a row (or column)re-assignment requires the update of all the cluster infor-mation, which might cause large computational over-head. Moreover, since the cluster information in AMCCalgorithms is the statistics derived from co-clusters, wecan incrementally update the related statistics by addingor subtracting the effects of the reassigned row (or col-umn), and further reduce the computational overhead ofsequential updates.

In the following section, we will use two concrete exam-ples to show the details of sequential updates and how toincrementally update the cluster information.

2.3 Examples: FNMTF and ITCC

Many AMCC algorithms with different Bregman divergen-ces (e.g., [3], [10], [11], [14]) have been proposed in recentyears. However, all of these algorithms are proposed withconcurrent updates. To show how sequential updates workand how to incrementally update the cluster information,we take two AMCC algorithms, FNMTF [10] and ITCC [11],as examples. One considers the Euclidean distance asthe distance measure, and the other considers theKullback� Leibler divergence as the distance measure.

CHENG ET AL.: CO-CLUSTERD: A DISTRIBUTED FRAMEWORK FOR DATA CO-CLUSTERINGWITH SEQUENTIAL UPDATES 3233

2.3.1 FNMTF

FNMTF considers the Euclidean distance as the distancemeasure between the original matrix and the approximationmatrix. Specifically, FNMTF constrains the factor matrices Fand G to be cluster indicator matrices and tries to minimizethe following objective function:

jjZ � FSGT jj2 s:t: F 2 Cm�k; G 2 Cn�l; (5)where Z is the input data matrix, and matrix S is composedof the statistics given on co-clusters determined by F andG.

FNMTF alternately solves the three variables F, G and Sin Eq. (5), and iterates till convergence. In each iteration, itcontains the following three steps.

First, fixing G and S, for each row zu� in Z, find its newcluster assignment using the following equation:

fup ¼ 1; argminpjjzu� � ~hp�jj2;

0; otherwise;

((6)

where ~hp� is the p-th row of SGTSGT and serves as the row-clusterprototype which is similar to the “centroid” in k-meansclustering.

Second, fixing F and S, for each column z�v in Z, find itsnew cluster assignment using the following equation:

gvq ¼ 1; argminqjjz�v � ~m�qjj2;

0; otherwise;

�(7)

where ~m�q is the q-th column of FSFS and serves as the column-cluster prototype.

In the last step of this iteration, according to the currentco-clustering determined by F and G, it updates each ele-ment of S by the following equation:

spq ¼PfujrðuÞ¼pg

PfvjgðvÞ¼qg zuv

jpj � jqj ; (8)

where jpj denotes the number of rows in row cluster p andjqj denotes the number of columns in column cluster q. spqprovides the statistic of co-cluster (p, q), which is the meanof all elements in co-cluster (p, q).

The procedures of FNMTF with concurrent updates aresummarized in Algorithm 1.

Algorithm 1. FNMTF with Concurrent Updates

1 Initialize G 2 Cn�l and F 2 Cm�k with arbitrary clusterindicator matrices;

2 repeat3 Fixing F andG, update S by (8);4 Fixing F and S, updateG by (7);5 FixingG and S, update F by (6);6 until converges;

Unlike FNMTF with concurrent updates, FNMTF withsequential updates requires the most up-to-date S to per-form row (or column) cluster assignments and thus Sshould be updated immediately once a row (or column)changes its cluster assignment.

Since a row (or column) re-assignment involves only tworows (or columns) of S and the elements of S are the means

of co-clusters, we can update S incrementally and thusreduce the overhead of frequent updates of S. Algorithm 2shows the incremental update for S after a particular rowzu� changes its row cluster assignment from p to p̂. The incre-mental update for S after reassigning a column can be donein a similar way and is omitted for brevity.

Algorithm 2. Incremental Update for S

1 for q 1 to l do

2 spq ¼spq �jpj�jqj�

PfujrðuÞ¼pg

PfvjgðvÞ¼qg zuv

ðjpj�1Þ�jqj ;

3 sp̂q ¼sp̂q �jp̂j�jqjþ

PfujrðuÞ¼pg

PfvjgðvÞ¼qg zuv

ðjp̂jþ1Þ�jqj ;

4 jpj ¼ jpj � 1 ;5 jp̂j ¼ jp̂j þ 1 ;

The procedures of FNMTF with sequential updates aresummarized in Algorithm 3.

Algorithm 3. FNMTF with Sequential Updates

1 InitializeG 2 Cn�l and F 2 Cm�k with arbitrary clusterindicator matrices;

2 Initialize S by (8);3 repeat4 FixingG;5 foreach zu� 2 Z do6 update F by (6);7 update Swith Algorithm (2);8 end9 Fixing F;10 foreach z�v 2 Z do11 updateG by (7);12 update S in an incremental way (similar to

Algorithm (2));13 end14 until converges;

2.3.2 ITCC

ITCC considers a non-negative matrix as the joint probabil-ity distribution of two discrete random variables and for-mulates the co-clustering problem as an optimizationproblem in information theory. Formally, let X and Y bediscrete random variables taking values in fxigmi¼1 andfyignj¼1 respectively. Suppose we want to cluster X into kdisjoint clusters fx̂igki¼1 and Y into l disjoint clusters fŷjglj¼1simultaneously. ITCC tries to find a co-clustering, whichminimizes the loss in mutual information:

IðX;Y Þ � IðX̂; Ŷ Þ ¼ DðpðX;Y ÞjjqðX;Y ÞÞ; (9)where IðX;Y Þ is the mutual information between X and Y ,Dð�jj�Þ denotes the Kullback� Leibler divergence, pðX;Y Þdenotes the joint probability distribution between X and Y ,and qðX;Y Þ is a distribution of the form

qðx; yÞ ¼ pðx̂; ŷÞpðxjx̂ÞpðyjŷÞ; x 2 x̂; y 2 ŷ; (10)

where pðx̂; ŷÞ ¼Px2x̂ Py2ŷ pðx; yÞ, and pðxjx̂Þ ¼ pðxÞpðx̂Þ forx̂ ¼ CXðxÞ and 0 otherwise, and similarly for pðyjŷÞ. pðx̂; ŷÞ


provides the statistic of co-cluster ðx̂; ŷÞ, which is the sum ofall the elements in co-cluster ðx̂; ŷÞ. pðX;Y Þ and qðX;Y Þ areboth m� n matrices, pðX̂; Ŷ Þ is a k� l matrix, pðXjX̂Þ andpðY jŶ Þ are m� k and n� l column orthogonal matricesrespectively.

ITCC starts with an initial co-clustering ðCX;CY Þ andrefines it through an iterative row and column clusteringprocess. According to ðCX;CY Þ, ITCC computes the follow-ing three distributions:

qðX̂; Ŷ Þ; qðXjX̂Þ; qðY jŶ Þ; (11)

where qðX̂; Ŷ Þ ¼ pðX̂; Ŷ Þ, qðXjX̂Þ ¼ pðXjX̂Þ , and qðY jŶ Þ ¼pðY jŶ Þ.

For row clustering, fixing CY , it tries to find new clusterassignment for each row x as

CXðxÞ ¼ argminx̂DðpðY jxÞjjqðY jx̂ÞÞ; (12)

where qðyjx̂Þ ¼ pðyjŷÞpðŷjx̂Þ and serves as the row-cluster pro-totype. Recall that pðY jŶ Þ is a n� l column orthogonalmatrix. It implies that given pðY jŶ Þ and a row of pðX̂; Ŷ Þ,we can get the corresponding row-cluster prototype with-out matrix multiplication. Then, according to the new rowcluster assignment, it updates the distributions in (11).

For column clustering, fixing CX, it tries to find new clus-ter assignment for each column y as

CY ðyÞ ¼ argminŷDðpðXjyÞjjqðXjŷÞÞ; (13)

where qðxjŷÞ ¼ pðxjx̂Þpðx̂jŷÞ and serves as the column-clusterprototype. Then, according to the new column cluster assign-ment, it updates the distributions in (11).

The procedures of ITCC with concurrent updates aresummarized in Algorithm 4.

Algorithm 4. ITCC with Concurrent Updates

1 Initialize CX and CY with arbitrary cluster indicator matrices;2 Initialize distributions in (11);3 repeat4 Fixing CY , update CX by Eq. (12);5 update qðX̂; Ŷ Þ and qðXjX̂Þ;6 Fix CX , update CY by Eq. (13);7 update qðX̂; Ŷ Þ and qðY jŶ Þ;8 until converges;

To sequentialize ITCC, we should update the distribu-tions in (11) once a row (or column) of pðX;Y Þ changes itscluster assignment. Since qðx̂; ŷÞ is the sum of all the ele-ments in the corresponding co-cluster, if a row (or column)changes its cluster assignment, only two rows (or columns)

of qðX̂; Ŷ Þ need to be updated. Thus, we can update qðX̂; Ŷ Þincrementally. Moreover, for distribution qðXjX̂Þ, accordingto Eq. (10) and Eq. (11), it can be also obtained in an incre-mental manner. Since qðXÞ is constant, we only performincremental update on qðX̂Þ and calculate qðXjX̂Þwheneverwe need. Algorithm 5 shows how to incrementally update

qðX̂; Ŷ Þ and qðX̂Þ after a particular row x changes its columncluster assignment from x̂ to x̂�. The incremental update for

qðX̂; Ŷ Þ and qðŶ Þ after reassigning a column can be done ina similar way and is omitted for brevity.

Algorithm 5. Incremental Update for qðX̂; Ŷ Þ and qðX̂Þ1 for ŷ 1 to l do2 qðx̂; ŷÞ ¼ qðx̂; ŷÞ �Pi2x̂;j2ŷ xij;3 qðx̂�; ŷÞ ¼ qðx̂�; ŷÞ þPi2x̂;j2ŷ xij;4 qðx̂Þ ¼ qðx̂Þ � qðxÞ;5 qðx̂�Þ ¼ qðx̂�Þ þ qðxÞ;

The procedures of ITCC with sequential updates aresummarized in Algorithm 6.

Algorithm 6. ITCC with Sequential Updates

1 Initialize CX and CX with arbitrary cluster indicatormatrices;

2 Initialize the distributions in (11);3 repeat4 Fix CY ;5 foreach x 2 X do6 update CX by Eq. (12);7 update qðX̂; Ŷ Þ and qðXjX̂Þwith Algorithm (5);8 Fix CX ;9 foreach y 2 Y do10 update CY by Eq. (13);11 update qðX̂; Ŷ Þand qðY jŶ Þ in an incremental way

(similar to Algorithm (5));12 until converges;

2.4 Concurrent versus Sequential

To show the benefits of sequential updates for AMCC algo-rithms, we implement FNMTF and ITCC with differentupdate strategies in Java. Two real-world data sets Coil20[15] and 20-Newsgroup [16] are used to evaluate the co-clustering algorithms. For the 20-Newsgroup data set, weselected the top 1,000 words by mutual information. Thedetails of the data sets are summarized in Table 1. The num-ber of row (or column) clusters is set to 20. For the sameco-clustering algorithm, we report the minimum objectivefunction value it can obtain and the corresponding runningtime. The clustering results are also evaluated by twowidely used metrics, i.e., clustering accuracy and normal-ized mutual information (NMI).

The experiments are performed on a single machine,which has Intel Core 2 Duo E8200 2.66 GHz processor, 3 GBof RAM, 1 TB hard disk, and runs 32-bit Linux Debian 6.0OS. For a fair comparison, all of these algorithms use thesame cluster assignments for initialization. Moreover, werun each algorithm 10 times, and use a different cluster ini-tialization for each time. The average results of these algo-rithms are reported.

TABLE 1Description of Data Sets

Data sets samples features non-zeros

Coil20 1,440 1,024 967,507NG20 4,000 1,000 266,152


As shown in Figs. 2 and 3, both of these two AMCCalgorithms with sequential updates converge faster andget better results than their concurrent counterparts. It is

not surprising that the AMCC algorithms with sequentialupdates also outperform their concurrent counterparts interms of clustering accuracy and NMI as shown inTables 2 and 3. Such encouraging results motivate us toprovide parallel solutions for AMCC algorithms withsequential updates.

3 PARALLELIZING ALTERNATE MINIMIZATIONCO-CLUSTERING ALGORITHMS WITHSEQUENTIAL UPDATES

Parallelizing alternate minimization co-clustering algo-rithms with sequential updates is challenging. In this sec-tion, we propose two approaches to solve this challenge.

3.1 Dividing Clusters Approach

Suppose in a distributed environment which consists of anumber of worker machines, each worker independentlyperforms sequential updates during the iterative process.The statistics of co-clusters (Scc) should be updated when-ever a row (or column) changes its cluster assignment.However, since the workers run concurrently, it may resultin inconsistent Scc across workers. Thus, the convergenceproperties of co-clustering algorithms cannot be main-tained. Therefore, we a propose dividing clusters approachto solve this problem.

The details of the dividing clusters approach are describedas follows. Suppose we want to group the input data matrixinto k row clusters and l column clusters, the number of

workers is p (p � minfk2 ; l2g), and each worker wi holds a sub-set of rows Ri and a subset of columns Ci. When performingrow clustering, we randomly divide row clusters Sr into pnon-overlapping row subsets Sr1; S

r2; . . . ; S

rp, and guarantee

that each subset contains at least two clusters. These subsetsare distributed to each worker in a one-to-onemanner. Whenworker wi receives S

ri , it can perform row clustering with

sequential updates for its rows Ri among the subset of rowclusters Sri . For example, assume that S

ri is f1; 3; 6g, wi will

perform row clustering for its rows whose current clusterassignments are in Sri , and allow these rows to change their

Fig. 2. Compare convergence speed and quality between concurrentupdates and sequential updates.

Fig. 3. Compare convergence speed and quality between concurrentupdates and sequential updates.

TABLE 2Clustering Results of FNMTF Measured by

Accuracy and NMI

Data sets Metrics Concurrent Sequential

Coil20 Accuracy 0.685 0.721NMI 0.573 0.612

NG20 Accuracy 0.475 0.485NMI 0.481 0.490

TABLE 3Clustering Results of ITCC Measured by Accuracy

and NMI

Data sets Metrics Concurrent Sequential

Coil20 Accuracy 0.681 0.696NMI 0.564 0.574

NG20 Accuracy 0.478 0.483NMI 0.489 0.495


cluster assignments among row clusters 1, 3, and 6. Since wiupdates only a non-overlapping subset of Scc, the sequentialupdates on worker wi will never affect the updates on otherworkers. The subsets of Scc and cluster indicators updated byeach worker will be combined and synchronized over itera-tions. Here we have illustrated how to perform row cluster-ing. Column clustering can be done in a similar way.

An example for the dividing clusters approach isshown in Fig. 4. There are three workers in the distributedenvironment. Each worker holds a subset of rows Ri anda subset of columns Ci, and performs sequential updateson a non-overlapping subset of row (or column) clustersin each iteration. The mapping between workers andcluster subsets changes over iterations. Row and columnclustering iterations are performed alternatively untilconvergence.

The following lemma and theorem guarantee the conver-gence properties of AMCC algorithms parallelized by thedividing clusters approach.

Lemma 2. For alternate minimization based co-clustering algo-rithms, performing sequential updates for its row (or column)clustering iteration on any subset of row (or column) clustersmonotonically decreases the objective function given by Eq. (1).

Proof. On any given subset of row clusters, performing rowclustering for a given row u by Eq. (2) in Step 1 will stillnot increase the approximation due to u and rðuÞ, i.e.,Crðu; rðuÞÞ. Since the overall approximation errorCðZ; ~ZÞ in Eq. (1) is the sum over the approximationerrors due to each row and its assignment, CðZ; ~ZÞ willnot be increased. In addition, while keeping the co-clus-tering (r, g) fixed, performing the cluster information

update by Eq. (3) in Step 2 will never increase CðZ; ~ZÞ.Therefore, performing row clustering with sequentialupdates on any subset of row clusters monotonicallydecreases the objective function given by Eq. (1). For thecolumn clustering case, the proof is similar and is omit-ted for brevity. tu

Based on Lemma 2, we can easily derive the followingtheorem.

Theorem 3. Dividing clusters approach maintains the conver-gence properties of alternate minimization based co-clusteringalgorithms.

Discussion. Since each worker performs row (or column)clustering only on a subset of clusters in each iteration,sequential updates for row (or column) clustering should beexecuted in a series of iterations before switching to theother side of clustering. Intuitively, to ensure the subset ofrows (or columns) of the input data matrix held on eachworker has sufficient opportunity to move to any row (orcolumn) cluster, the number of such iterations should notbe less than the number of non-overlapping subsets of clus-ters. We will further discuss the setting for this number inSection 5.3.

3.2 Batching Points Approach

The dividing clusters approach eliminates the dependencyon the cluster information for each worker and enables co-clustering with sequential updates in a parallel and distrib-uted manner. However, it assumes that the number ofworkers is less than the number of clusters. Such assump-tion might restrict the scalability of this approach. For exam-ple, when the number of workers is larger than the numberof clusters, this approach cannot utilize the extra workers toperform data clustering. Therefore, by relaxing the stringentconstraint of sequential updates, we introduce batch updatesfor AMCC algorithms. The difference between batch andsequential updates is that batch updates perform the clusterinformation update after a batch of rows (or columns) haveupdated their cluster assignments, rather than after eachchange in cluster assignments. The following lemma showsthat batch updates maintain the convergence properties ofAMCC algorithms.

Lemma 4. Alternate minimization based co-clustering algo-rithms with batch updates monotonically decrease the objectivefunction given by Eq. (1).

Proof. The proof is similar to that of Theorem 1 and is omit-ted for brevity. tu

Obviously, sequential or concurrent updates are theextreme case of batch updates. If the number of rows (or col-umns) in a batch is one, it is equivalent to sequentialupdates; if the number of rows (or columns) is the numberof input rows (or columns), it is equivalent to concurrentupdates. Batch updates provide a useful knob for tuning theupdate frequency for AMCC algorithms.

Compared to concurrent updates, batch updatesupdate the cluster information more frequently, whichimplies the advantages of sequential updates could bepreserved. In addition, if the number of rows (or columns)in a batch is small and Scc can be updated incrementally,we can still leverage incremental computation to achievecomputational efficiency. Specifically, after processing abatch of rows (or columns) concurrently, Scc could beincrementally updated according to the re-assignments inthis batch. Therefore, the computational overhead of fre-quent updates of Scc could be reduced.

We propose a batching points approach to parallel batchupdates. The details of this approach are as follows.Suppose each worker wi holds a subset of rows Ri and asubset of columns Ci. When performing row clustering,we randomly divide Ri into p non-overlapping subsets

Fig. 4. Flow of dividing clusters approach.


R1i ; R2i ; . . . ; R

pi . We refer to R

ji as a batch. Each worker pro-

cesses only one of its batches with concurrent updates ineach iteration. A synchronization process for the clusterinformation update is initiated at the end of each iteration.Such iteration continues until all batches have been proc-essed, then it switches to column clustering which is per-formed in a similar way. During the synchronizationprocess, each worker computes the statistics of co-clusterson Ri (or Ci). We refer to such statistics obtained by eachworker as one slice of Scc. All of the slices will be combinedto obtain a new Scc used for the next iteration.

Fig. 5 presents an example for the batching pointsapproach. There are three workers in the distributed envi-ronment. Each worker holds a subset of rows Ri and a sub-set of columns Ci, and further divides Ri and Ci into twobatches respectively. For a row clustering iteration, eachworker selects one of its row batches to perform co-cluster-ing with concurrent updates. The row clustering iterationcontinues two times, then the column clustering iterationbegins. Such row and column clustering iterations are per-formed until convergence.

The following theorem guarantees the convergence prop-erties of AMCC algorithms parallelized by the batchingpoints approach, which can be easily derived by Lemma 4.

Theorem 5. Batching points approach maintains the convergenceproperties of alternate minimization based co-clusteringalgorithms.

Discussion. The potential performance gain of the batch-ing points approach is not free, since performing batchupdates requires more synchronization than concurrentupdates. In other words, there is a trade-off between com-munication overhead and computational efficiency in thisapproach. To find the optimal number of batches numbatch(numbatch 1) that offers the best trade-off, we need toquantify both performance gain and performance overheadfor the batching points approach. We define one-pass process-ing as follows. All rows and columns of the input datamatrix are processed by a given co-clustering algorithm inone pass.

Specifically, for concurrent updates, one-pass processingconsists of a row clustering iteration and a column cluster-ing iteration. For batch updates, it consists of numbatch rowclustering iterations and numbatch column clustering itera-tions. Suppose tbat and tcon are the running time of one-pass

processing with batch updates and concurrent updatesrespectively. Dcostbat and Dcostcon are the correspondingdecreases of the objective function respectively. We con-sider the ratio of Dcostbat to Dcostcon as the performance gainof batch updates, and consider the ratio of tbat to tcon as itsperformance overhead. Typically, we expect the perfor-mance gain of batch updates should be less than its perfor-mance overhead, and thus numbatch should satisfy thefollowing inequality:

tbattcon¼ tcomp þ 2 � numbatch � tsync

tcomp þ 2 � tsync �DcostbatDcostcon

; (14)

where tcomp is the computational time for one-pass process-ing, and tsync is the time consumption of synchronizationamong workers for one iteration. Obviously, both the per-formance gain and the performance overhead are functionsof numbatch and they are specific to the data set and theexperimental environment. In practice, it is not realistic totry all possible numbers to find the optimal number ofbatches which can maximize the performance (i.e., maxi-mize the ratio of performance gain to performance over-head). In Section 5.3, we will provide a practical method todetermine this number.

4 CO-CLUSTERD FRAMEWORK

In this section, we first show the design and implementationof our Co-ClusterD framework for alternate minimizationco-clustering algorithms with sequential updates. Then wegive the API sets that the framework provides. At last, wedescribe the fault tolerance and load balance mechanismsadopted by Co-ClusterD.

4.1 Design and Implementation

Based on the proposed approaches for parallelizing sequen-tial updates in the previous section, we design and imple-ment Co-ClusterD, a distributed framework for AMCCalgorithms with sequential updates. The architecture of theCo-ClusterD framework is shown in Fig. 6.

Co-ClusterD consists of a number of basic workers and aleading worker. Each basic worker independently performsrow and column clustering by its updater component. Asshown in Fig. 7, it maintains five key-value pair tables in itsmemory to store the following variables: a non-overlappingsubset of rows of the input data matrix Ri, a non-overlap-ping subset of columns of the input data matrix Ci, row-cluster indicators of the input data matrix Ir, column-cluster

Fig. 5. Flow of batching points approach.

Fig. 6. The architecture of the co-clusterd framework.


indicators of the input data matrix Ic, and the statistics of co-clusters Scc. Specifically, the key field of the table for storingRi (or Ci) is the row (or column) identifier, and the valuefield is the corresponding elements; the key field of the tablefor storing Ir (or Ic) is the row (or column) identifier, andthe value field is the cluster identifier the row (or column)belongs to; the key field of the table for storing Scc is the co-cluster identifier, and the value field is the statistic of the co-cluster. Ri and Ci are constant values during the data co-clustering process. At the beginning of each row (or col-umn) clustering iteration, each basic worker receives Sccand Ir (or Ic) from the leading worker, and then performsrow (or column) clustering. The updated subsets of Scc usedin the dividing clusters approach (or the slices of Scc used inthe batching points approach) and cluster indicators aresent to the leading worker by basic workers over each clus-tering iteration.

In Co-ClusterD, the leading worker plays a coordinationrole during the data co-clustering process. It uses its dividerand combiner components to perform dividing clusters,and combining non-overlapping subsets of Scc or slices ofScc. In addition, it is also responsible for synchronizing clus-ter assignments and the cluster information among workers.In particular, after the leading work receives the subsets (orslices) of Scc and cluster indicators from all the workersover each clustering iteration, it reconstructs Scc and Ir (orIc), and then broadcasts them to the basic workers.

For a given co-clustering job, Co-ClusterD proceeds intwo stages: cluster information initialization and data co-clustering. In the cluster information initialization stage,assuming there are w workers in the distributed environ-ment, Co-ClusterD first partitions the input data matrixinto w row and w column subsets. Next, each workerloads one row subset, one column subset, and randomlyinitializes the cluster assignments for the rows and col-umns it holds. Then, each worker calculates its slice of Sccand sends it to the leading worker. Finally, the leadingworker combines all slices of Scc and thus the initial Scc isobtained. In the data co-clustering stage, Co-ClusterDworks on the co-clustering algorithm implemented byusers. The algorithm can be easily implemented by over-riding a number of APIs provided by Co-ClusterD (seeSection 4.2 for details). It alternatively performs row andcolumn clustering until the number of iterations exceedsa user-defined threshold. In particular, for the dividingclusters approach, users can specify the number of itera-tions repeated for row (or column) clustering beforeswitching to the other side of clustering. For the batching

points approach, users can specify the number of row (orcolumn) batches each work holds.

Co-ClusterD is implemented based on iMapreduce [17],which is a distributed framework based on Hadoop and hasbuilt-in support for iterative algorithms (see [17] for details).In fact, Co-ClusterD is independent of the underlying dis-tributed frameworks. It can be also implemented on otherdistributed frameworks (e.g., Spark [18], and MPI). Wechoose iMapreduce since it can better support the iterativeprocesses of co-clustering algorithms.

Notice that, in the Co-ClusterD framework, row and col-umn subsets held by each worker will not be shuffled andonly the updated cluster information Scc and cluster assign-ments Ir (or Ic) are synchronized among workers over itera-tions through the network. Since Scc (k � l matrix, k mand l n), Ir (m � 1 matrix) and Ic (n � 1 matrix) are verysmall, the communication overhead of the Co-ClusterDframework is small.

4.2 API

Co-ClusterD allows users without much knowledge on dis-tributed computing to write distributed AMCC algorithms.Users need to implement only a set of well defined APIsprovided by Co-ClusterD. In fact, these APIs are callbackfunctions, which will be automatically invoked by theframework during the data co-clustering process. Thedescriptions of these APIs are as follows.

1) cProto genClusterProto(bRow, pointS,Ind): Users specify how to generate the cluster pro-totype. Recall that the cluster prototype plays therole of “centroid” in row (or column) clustering, andit can be constructed by the cluster indicators andthe statistics of co-clusters Scc. The parameter bRowindicates whether row clustering is performed. IfbRow is true, pointS is one row of Scc, and Ind isthe column-cluster indicators of the input datamatrix. Otherwise, pointS is one column of Scc, andInd is the row-cluster indicators of the input datamatrix.

2) double disMeasure(point, cProto): Given apoint and a cluster prototype cProto, users spec-ify a measure to quantify the distance between them.point denotes a row or a column of the input datamatrix.

3) Scc updateSInc(bRow, point, preCID,curCID, Scc, rInd, cInd): Users specify howto incrementally update Scc when a point changesits cluster assignment from previous cluster pre-CID to current cluster curCID. rInd and cIndare row-cluster and column-cluster indicators ofthe input data matrix respectively.

4) slice updateSliceInc(bRow, point,preCID, curCID, slice, subInd, Ind): Usersspecify how to incrementally update a slice of Sccwhen a point changes its cluster assignment fromprevious cluster preCID to current cluster curCID.If bRow is true, subInd is the row-cluster indicatorsof the subset of rows the worker holds, and Ind isthe column-cluster indicators of the input datamatrix. Otherwise, subInd is the column-cluster

Fig. 7. Basic worker.


indicators of the subset of columns the worker holds,and Ind is the row-cluster indicators of the inputdata matrix.

5) slice buildOneSlice(bRow, subInd, Ind):Users specify how to build one slice of Scc. Theparameters in this function have the same meaningas the parameters in function (4).

6) Scc combineSlices(slices, rInd, cInd):Users specify how to combine the slices of Scc sentby workers. slices are all the slices of Scc sent byworkers. rInd and cInd are row-cluster and col-umn-cluster indicators of the input data matrixrespectively.

Notice that we do not add constant variables such as thesubset of rows (or columns) to the parameter lists of theseAPIs. Constant variables are initialized by an initializationprocess and exposed as global variables.

The API sets used by the dividing clusters approach andthe batching points approach are summarized in Table 4.

4.3 Fault Tolerance and Load Balance

In Co-ClusterD, fault tolerance is implemented by a globalcheckpoint/restore mechanism, which is performed at auser-defined time interval. The cluster information andcluster assignments in the leading worker are dumped to areliable file system every period of time. If any worker fails(including the leading worker), the computation will rollback to the most recent iteration checkpoint and resumefrom that iteration. In addition, since Co-ClusterD is basedon iMapreduce [17], it also inherits all the salient features ofMapreduce’s style fault tolerance.

The synchronous computation model used by our Co-ClusterD framework also makes load balance become animportant issue. This is because the running time of eachiteration is dependent on the slowest worker. Therefore, ifthe capacity of each computer in the distributed environ-ment is homogeneous, the workload should be evenly dis-tributed. Otherwise, the workload should be distributedaccording to the capacity of each worker.

5 EXPERIMENTAL RESULTS

In this section, we evaluate the effectiveness and efficiencyof our Co-ClusterD framework in the context of two alter-nate minimization based co-clustering algorithms ITCC andFNMTF on several real world data sets. We compare Co-ClusterD to a state-of-the-art Hadoop based co-clusteringframework DisCo [19]. Since DisCo does not supportAMCC algorithms with sequential updates, we implementonly concurrent updates in DisCo. However, in Co-Clus-terD, both concurrent and sequential updates are imple-mented. The experiments are performed on both small-scale and large-scale clusters.

5.1 Experiment Setup

We use real world data sets downloaded from UCI MachineLearning Repository [16] to evaluate the co-clustering algo-rithms. These data sets are summarized in Table 5.

Since these data sets do not have class labels, the metrics,such as clustering accuracy and normalized mutual infor-mation, are not used. In fact, for a clustering algorithm, thequality of the clustering results obtained by its differentupdate strategies can be also reflected by the minimumobjective function value it can obtain. Therefore, in ourexperiments, we only consider the objective function valueas the performance metric to measure the quality of theclustering results. We also report the running time of theclustering algorithms.

We build a small-scale cluster of local machines and alarge-scale cluster on the Amazon EC2 cloud to run experi-ments. The small-scale cluster consists of four machines.Each machine has Intel Core 2 Duo E8200 2.66 GHz proces-sor, 3 GB of RAM, 1 TB hard disk, and runs 32-bit LinuxDebian 6.0 OS. These four machines are connected to aswitch with communication bandwidth of 1 Gbps. Thelarge-scale cluster consists of 100 High-CPU mediuminstances on the Amazon EC2 Cloud. Each instance has1.7 GB memory and five EC2 compute units.

For a fair comparison, we enforced all the algorithms to beinitialized by the same cluster assignments. We run the algo-rithms 10 times, and each time a different initialization isused. The average results of these algorithms are reported.

5.2 Small-Scale Experiments

For small-scale experiments, data sets KOS and NIPS areused for evaluation. The number of row (or column) clustersis set to 40. For the dividing clusters approach, each workeris randomly assigned a subset of row (column) clusters withsize 10 in each iteration. In addition, the number of itera-tions repeated for row (or column) clustering is set to 4. Forthe batching points approach, each worker divides its sub-set of rows (or columns) into 16 batches. We will discusshow to choose these parameters in the next section.

As shown in Figs. 8 and 9, we can observe that co-clus-tering algorithms with concurrent updates implementedin Co-ClusterD (denoted by “Concurrent”) convergefaster than those implemented in DisCo (denoted by“DisCo”) although they converge to the same values. Thisis because Co-ClusterD leverages a persistent job for theiterative process of the co-clustering algorithm ratherthan using one job for one row (or column) clusteringiteration which is adopted by DisCo. Hence, Co-ClusterDreduces the repeated job initialization overhead in eachiteration and achieves faster convergence. In addition, wecan also observe that algorithms parallelized by the divid-ing clusters approach and the batching points approach

TABLE 4API Sets for Different Approaches

Approach Initialization Data Co-clustering

Dividing Clusters (5), (6) (1), (2), (3)Batching Points (5), (6) (1), (2), (4), (6)

TABLE 5Description of Data Sets

Data sets samples features non-zeros

KOS 3,430 6,906 467,714NIPS 1,500 12,419 1,900,000ENRON 39,861 28,102 6,400,000


(denoted by “Dividing Clusters” and “Batching Points”respectively) converge much faster and obtain betterresults than their concurrent counterparts implementedin DisCo and Co-ClusterD.

5.3 Tuning Parameters

To quantify the effects of the parameters used in Co-Clus-terD, we conduct the following two experiments. In thefirst experiment, we run algorithms using the dividingclusters approach with different number of row (or col-umn) clustering iterations before switching to the otherside of clustering. In the second experiment, we run algo-rithms using the batching points approach with differentnumber of batches. The number of row (or column) clus-ters is set to 40. All of these experiments are performedon our small-scale local cluster. Notice that we actuallydid these two experiments on different data sets for dif-ferent co-clustering algorithms. Due to space limitation,we show only the representative results.

Fig. 10a presents the results of the first experiment,where the number that follows the name of the approachdenotes the number of row (or column) iterations it contin-ues to perform. From this figure, we can observe that settingthis number to too small or too large values results in slowconvergence. According to our empirical results, setting thisnumber to the number of workers (i.e., the number of thenon-overlapping subsets of clusters) typically shows goodperformance for the dividing clusters approach.

Fig. 10b plots the results of the second experiment, wherethe number that follows the name of the approach denotesthe number of row (or column) batches each worker holds.The results shown in this figure indicate that setting this

number to too small or too large values results in poor per-formance. In addition, it also shows that there is an intervalbetween the small number and the large number where the

Fig. 8. Cost function versus running time comparisons for co-clusteringalgorithms with different update strategies. Fig. 9. Cost function versus running time comparisons for co-clustering

algorithms with different update strategies.

Fig. 10. Tuning parameters for dividing clusters and batching pointsapproaches.


performance is not sensitive to the number of batches. Forexample, the performance of eight batches is similar to theperformance of 32 batches. Based on these observations, wecan use the following method to find a good candidate forthe number of batches. We first give an initial guess for thenumber of batches, which is relatively small (e.g., 4 or 8),then run the co-clustering algorithm to measure other varia-bles in Eq. (14). When we get the ratio DcostbatDcostcon under the ini-tial guess, we replace this ratio with a higher value (e.g.,1x�2x the obtained ratio) and consider this new ratio as theexpected gain. Then we can derive a range for numbatchaccording to the expected gain and Eq. (14). Typically,selecting a number around the middle of this range canachieve good performance.

5.4 Large-Scale Experiments

To validate the scalability of our framework, we also runexperiments on the Amazon EC2 cloud. The data setENRON is used for evaluation. The number of row (or col-umn) clusters is set to 20. Since the scalability of the divid-ing clusters approach is dependent on the relationshipbetween the number of workers and the number of clusters,only the batching points approach is evaluated. The numberof row (or column) batches each worker holds is set to 4.

As shown in Figs. 11a and 12a, the algorithms imple-mented in DisCo (denoted by “DisCo”) and the algorithmsimplemented in the Co-ClusterD framework obtain similarspeedups. The main reason is that the algorithms imple-mented in the Co-ClusterD framework do not shuffle therow and column subsets held by workers and only synchro-nize the updated cluster information Scc and cluster

assignments Ir (or Ic) over clustering iterations. We can alsoobserve that the speedup obtained by Co-ClusterD withconcurrent updates (denoted by “Concurrent”) is a little bet-ter than Co-ClusterD using the batching points approach(denoted by “Batching Points”). The reason lies in that con-current updates performs less cluster information updatesthan the batching points approach and thus results in lesssynchronization overhead.

In Figs. 11b and 12b, we can observe that, as the num-ber of workers increases, the running time of the algo-rithms implemented in DisCo and Co-ClusterD issignificantly reduced. Notice that, since the bases of com-puting speedups are different, a better speedup does notnecessarily lead to a shorter running time. As shown inFigs. 11b and 12b, co-clustering algorithms parallelized bythe batching points approach converge almost two timesfaster than their concurrent counterparts implemented inCo-ClusterD.

We also evaluate how the co-clustering algorithmsimplemented in our Co-ClusterD framework scale withincreasing input size by adjusting input size to keep theamount of computation per worker fixed with increasingthe number of workers. For this experiment, the ideal scal-ing has constant running time as input size increases withthe number of workers. As shown in Figs. 13b and 13b, theachieved scaling is within 20 percent of the ideal number,which is acceptable.

6 RELATED WORK

In this section, we review the work related to data co-clus-tering, and large scale data clustering.

Fig. 11. Speedup and running time comparisons. Fig. 12. Speedup and running time comparisons.


6.1 Data Co-Clustering

There is a large body of work on algorithms for co-cluster-ing. Dhillon [1] formulate co-clustering problem as agraph partition problem and propose a spectral co-cluster-ing algorithm that uses eigenvectors to co-cluster inputdata matrix. Dhillon et al. [11] explore the relationshipsbetween nonnegative matrix factorization (NMF) [20] andk-means clustering, and propose to use nonnegativematrix tri-factorization [21] to co-cluster the input datamatrix. Information theoretic co-clustering proposed in[11] is an alternate minimization based co-clustering algo-rithm, which monotonically optimizes the objective func-tion by intertwining both row and column clusteringiterations at all stages. Many variations of this kind of co-clustering algorithms have been proposed by using differ-ent optimization criteria such as sum-squared distance[3], and code length [14]. Based on Bregman divergence, amore general framework for this kind of algorithms hasproposed in [9]. Our framework can perfectly supportthese alternate minimization based co-clustering algo-rithms with different update strategies and they can beeasily implemented by overriding a a set of APIs providedby our framework.

6.2 Large Scale Data Clustering

As huge data sets become prevalent, improving the scalabil-ity of clustering algorithms has drawnmore and more atten-tion. Many scalable clustering algorithms are proposedrecently. Dave et al. [22] propose a scheme of implementingk-means with concurrent updates on Microsoft’s WindowsAzure cloud. Ene et al. [23] design amethod of implementing

k-center and k-median on Mapreduce. These studies are dif-ferent from ours as they are devoted to scaling up one-sidedclustering algorithms. In order to minimize the I/O cost andthe network cost among processing nodes, Cordeiro et al.[24] propose the best of both worlds (BoW) method for dataclustering with MapReduce. This method can automaticallyspot the bottleneck and choose a good strategy. Since theBoW method can treat as a plug-in tool for most of theclustering algorithms, it is orthogonal to our work.

Folino et al. [25] propose a parallelized efficient solution tothe high-order co-clustering problem (i.e., the problem ofsimultaneously clustering heterogeneous types of domain)[26]. George andMerugu [27] design a parallel version of theweighted Bregman co-clustering algorithm [9] and use it tobuild an efficient real-time collaborative filtering framework.Deodhar et al. [28] develop a parallelized implementation ofthe simultaneous co-clustering and learning algorithm [29]based on Mapreduce. Papadimitriou and Sun [19] proposethe distributed co-clustering (DisCo) framework, underwhich various co-clustering algorithms can be implemented.Using barycenter heuristic [30], Nisar et al. [31] propose ahigh performance parallel for data co-clustering. Kwon andCho [32] parallelize all the co-clustering algorithms in theBregman co-clustering framework [9] usingmessage passinginterface (MPI). Narang et al. [33], [34] present a novel hierar-chical design for soft real-time distributed co-clusteringbased collaborative filtering algorithm. However, whilethese studies focus on parallelizing co-clustering withconcurrent updates, our work is devoted to parallelizing co-clusteringwith sequential updates.

7 CONCLUSIONS

In this paper, we introduce sequential updates for alternateminimization based co-clustering algorithms and proposedividing clusters and batching points approaches to paral-lelize AMCC algorithms with sequential updates. We proveboth of these two approaches maintain the convergenceproperties of AMCC algorithms. Based on these twoapproaches, we design and implement a distributed frame-work referred to as Co-ClusterD, which supports efficientimplementations of AMCC algorithms with sequentialupdates. Empirical results show that AMCC algorithmsimplemented in Co-ClusterD can achieve a much faster con-vergence and often obtain better results than their tradi-tional concurrent counterparts.

ACKNOWLEDGMENTS

The authors thank the reviewers for their valuable commentswhich significantly improved this paper. The work was sup-ported in part by the following funding agencies of China:National Natural Science Foundation under grant 61170274,National Basic Research Program (973 Program) under grant2011CB302506, Fundamental Research Funds for the CentralUniversities under grant 2014RC1103. This work is also sup-ported in part by the US National Science Foundation undergrants CNS-1217284 and CCF-1018114. The preliminary ver-sion of this paper has been accepted for publication in theproceedings of the 2013 IEEE International Conference onData Mining series (ICDM 2013), Dallas, Texas, December 7-11, 2013. Prof. Sen Su is the corresponding author.

Fig. 13. Scaling input size.


REFERENCES[1] I. Dhillon, “Co-clustering documents and words using bipartite

spectral graph partitioning,” in Proc. Knowl. Discovery Data Min-ing, 2001, pp. 269–274.

[2] S. Daruru, N. M. Marin, M. Walker, and J. Ghosh, “Pervasiveparallelism in data mining: Dataflow solution to co-clusteringlarge and sparse netflix data,” in Proc. 15th ACM SIGKDD Int.Conf. Knowl. Discovery Data Mining, 2009, pp. 1115–1124.

[3] H. Cho, I. Dhillon, Y. Guan, and S. Sra,, “Minimum sum-squaredresidue co-clustering of gene expression data,” in Proc. SIAM Int.Conf. Data Mining, 2004, vol. 114, pp. 114-125.

[4] R. M. Neal and G. E. Hinton, “A view of the EM algorithm thatjustifies incremental, sparse, and other variants,” Learning in Graph-icalModels, pp. 355–368, 1999.

[5] B. Thiesson, C. Meek, and D. Heckerman, “Accelerating EM forlarge databases,”Mach. Learn., vol. 45, no. 3, pp. 279–299, 2001.

[6] N. Slonim, N. Friedman, and N. Tishby, “Unsupervised documentclassification using sequential information maximization,” in Proc.25th Annu. Int. ACM SIGIR Conf. Res. Develop. Inf. Retrieval, 2002,pp. 129–136.

[7] I. Dhillon, Y. Guan, and J. Kogan, “Iterative clustering of highdimensional text data augmented by local search,” in Proc. IEEEInt. Conf. Data Mining, 2003, pp. 131–138.

[8] J. Yin, Y. Zhang, and L. Gao, “Accelerating expectation-maximiza-tion algorithms with frequent updates,” in Proc. IEEE Int. Conf.Cluster Comput., 2012, pp. 275–283.

[9] A. Banerjee, I. Dhillon, J. Ghosh, S. Merugu, and D. Modha, “Ageneralized maximum entropy approach to Bregman Co-cluster-ing and matrix approximation,” in Proc. 10th ACM SIGKDD Int.Conf. Knowl. Discovery Data Mining, 2004, pp. 509–514.

[10] H. Wang, F. Nie, H. Huang, and F. Makedon, “Fast nonnegativematrix tri-factorization for large-scale data co-clustering,” in Proc.22nd Int. Joint Conf. Artif. Intell., 2011, pp. 1553–1558.

[11] I. Dhillon, S. Mallela, and D. Modha, “Information-theoreticCo-clustering,” in Proc. 9th ACM SIGKDD Int. Conf. Knowl. Discov-ery Data Mining, 2003, pp. 89–98.

[12] J. Hartigan, “Direct clustering of a data matrix,” J. Am. Statist.Assoc., vol. 67, pp. 123–129, 1972.

[13] A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh, “Clusteringwith Bregman divergences,” in Proc. SDM, 2004, pp. 234–245.

[14] D. Chakrabarti, S. Papadimitriou, D. Modha, and C. Faloutsos,“Fully automatic Cross-associations,” in Proc. 10th ACM SIGKDDInt. Conf. Knowl. Discovery data Mining, 2004, pp. 79–88.

[15] Columbia University Image Library. [Online]. Available: http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php, 1996.

[16] UCI Machine Learning Repository. [Online]. Available: http://archive.ics.uci.edu/ml/datasets.html, 2013.

[17] Y. Zhang, Q. Gao, L. Gao, and C. Wang, “iMapreduce: A distrib-uted computing framework for iterative computation,” in Proc.IEEE Int. Symp. Parallel Distrib. Process. Workshops PhD Forum,2011, pp. 1112–1121.

[18] M. Zaharia, M. Chowdhury, M. Franklin, S. Shenker, and I. Stoica,“Spark: Cluster computing with working sets,” in Proc. 2nd USE-NIX Conf. Hot Topics Cloud Comput., 2010, pp. 10–10.

[19] S. Papadimitriou and J. Sun, “Disco: Distributed co-clustering withmap-reduce: A case study towards petabyte-scale end-to-endmin-ing,” in Proc. 8th IEEE Int. Conf. DataMining, 2008, pp. 512–521.

[20] D. Lee and H. Seung, “Learning the parts of objects by non-negative matrix factorization,” Nature, vol. 401, no. 6755,pp. 788–791, 1999.

[21] C. Ding, T. Li, W. Peng, and H. Park, “Orthogonal nonnegativematrix t-factorizations for clustering,” in Proc. 12th ACM SIGKDDInt. Conf. Knowl. Discovery Data Mining, 2006, pp. 126–135.

[22] A. Dave, W. Lu, J. Jackson, and R. Barga, “Cloudclustering:Toward an iterative data processing pattern on the cloud,” inProc. IEEE Int. Symp. Parallel Distrib. Process. Workshops Phd Forum,2011, pp. 1132–1137.

[23] A. Ene, S. Im, and B. Moseley, “Fast clustering usingMapReduce,” in Proc. 17th ACM SIGKDD Int. Conf. Knowl. Discov-ery Data Mining, 2011, pp. 681–689.

[24] R. Cordeiro, C. T. Jr, A. Traina, J. Lopez, U. Kang, and C. Faloutsos,“Clustering very large multi-dimensional datasets withMapReduce,” in Proc. 17th ACM SIGKDD Int. Conf. Knowl. Discov-ery DataMining, Aug. 2011, no. 17, pp. 690–698.

[25] F. Folino, G. Greco, A. Guzzo, and L. Pontieri, “Scalable parallelco-clustering over multiple heterogeneous data types,” in Proc.Int. Conf. High Perform. Comput. Simul., 2010, pp. 529–535.

[26] G. Greco, A. Guzzo, and L. Pontieri, “Coclustering multiple het-erogeneous domains: Linear combinations and agreements,” IEEETrans. Knowl. Data Eng., vol. 22, no. 12, pp. 1649–1663, Dec. 2010.

[27] T. George and S. Merugu, “A scalable collaborative filteringframework based on co-clustering,” in Proc. 5th IEEE Int. Conf.Data Mining, 2005, pp. 625–628.

[28] M. Deodhar, C. Jones, and J. Ghosh, “Parallel simultaneous co-clustering and learning with map-reduce,” in Proc. IEEE Int. Conf.Granular Comput., 2010, pp. 149–154.

[29] M. Deodhar and J. Ghosh, “A framework for simultaneous co-clustering and learning from complex data,” in Proc. 13th ACMSIGKDD Int. Conf. Knowl. Discovery DataMining, 2007, pp. 250–259.

[30] W. Ahmad and A. Khokhar, “Chawk: A highly efficient bicluster-ing algorithm using bigraph crossing minimization,” in Proc. 2ndInt. Workshop Data Mining Bioinf., 2007, pp. 1–12.

[31] A. Nisar, W. Ahmad, W. Liao, and A. N. Choudhary, “High per-formance parallel/distributed biclustering using barycenter heu-ristic,” in Proc. SIAM Int. Conf. Data Mining, 2009, pp. 1050–1061.

[32] B. Kwon and H. Cho, “Scalable co-clustering algorithms,” in Proc.ICA3PP, 2010, pp. 32–43.

[33] A. Narang, A. Srivastava, and N. P. K. Katta, “High performancedistributed co-clustering and collaborative filtering,” in IBM Res.,NY, United States, Tech. Rep. RI11019, 2011, pp. 1–28.

[34] A. Narang, A. Srivastava, and N. P. K. Katta, “High performanceoffline and online distributed collaborative filtering,” in Proc.IEEE 12th Int. Conf. Data Mining, 2012, pp. 549–558.

Xiang Cheng received the PhD degree in com-puter science from the Beijing University of Postsand Telecommunications, China, in 2013. He iscurrently an assistant professor at the BeijingUniversity of Posts and Telecommunications. Hisresearch interests include information retrieval,data mining, and data privacy.

Sen Su received the PhD degree in computerscience from the University of Electronic Scienceand Technology, China, in 1998. He is currently aprofessor at the Beijing University of Posts andTelecommunications. His research interestsinclude data privacy, cloud computing, and inter-net services.

LixinGao is a professor of electrical and computerengineering at the University of Massachusetts atAmherst, and the PhD degree in computer sciencefrom the University of Massachusetts at Amherstin 1996. Her research interests include social net-works, and Internet routing, network virtualization,and cloud computing. She is a fellow of the ACMand the IEEE.

Jiangtao Yin received the BE degree from theBeijing Institute of Technology, China, in 2006,the ME degree from the Beijing University ofPosts and Telecommunications, China, in 2009,and is currently working toward the PhD degreein the Department of Electrical and ComputerEngineering, the University of Massachusettsat Amherst. Since fall 2009, he has been workingas a research assistant in the University ofMassachusetts at Amherst. His current researchinterests include distributed systems, big data

analytics, and cloud computing.

" For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.


/ColorImageDict > /JPEG2000ColorACSImageDict > /JPEG2000ColorImageDict > /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 150 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 300 /GrayImageDepth -1 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict > /GrayImageDict > /JPEG2000GrayACSImageDict > /JPEG2000GrayImageDict > /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 1200 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 600 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict > /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False

/CreateJDFFile false /Description >>> setdistillerparams> setpagedevice

Documents

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA …rio.ecs.umass.edu/mnilpub/papers/coclusterd-tkde2015.pdf · 2019. 5. 16. · 3232 ieee transactions on knowledge and data engineering, vol