Download pdf - Designing scalable and efficient parallel clustering algorithms on arrays with reconfigurable optical buses

Designing scalable and efficient parallel clustering algorithms on arrayswith reconfigurable optical busesq

Chin-Hsiung Wu, Shi-Jinn Horng* , Yi-Wen Chen, Wei-Yi Lee

Department of Electrical Engineering, National Taiwan University of Science and Technology, Taipei, Taiwan, ROC

Received 2 August 1999; revised 15 March 2000; accepted 31 March 2000

Abstract

Clustering is a part of data analysis that is required in many fields. In most applications such as unsupervised pattern recognition and imagesegmentation, the number of patterns may be very large. Therefore, designing fast and processor efficient parallel algorithms for clustering isdefinitely of fundamental importance. In this paper, forN patterns andK centers each withM features, we propose several efficient andscalable parallel algorithms for squared error clustering on the arrays with reconfigurable optical buses with various number of processors.Based on the advantages of both optical transmission and electronic computation, the proposed algorithms can be run inO��K=p� log N�,O�log N�; O��KNM=pqr�1 log r 1 log q�; O�K=p�; O(1), O(K) andO(KM) time usingp × M × N=log N; K × M × N=log N; p × q × r ; p ×M × N111=e

; K × M × N111=e; M × N111=e andN111=e processors, for some constante ande $ 1; respectively. These results are more efficient

and scalable than the previously known algorithms.q 2000 Elsevier Science B.V. All rights reserved.

Keywords: Pattern recognition; Partitional clustering; Squared error clustering; Parallel algorithm; Arrays with reconfigurable optical buses

1. Introduction

Clustering techniques are widely applied in many areassuch as life sciences, medical sciences, social sciences, earthsciences, image processing and so on, and the applicationscontinue to grow. In image processing, clustering is used forpattern recognition, image segmentation, registration,compression and object detection [1,6]. Cluster analysis isthe process of classifying objects into subsets that havemeaning in the context of a particular problem [6]. Conven-tionally, the objects are characterized as patterns, and thepatterns are numerical vectors in the pattern analysis.Assume that there areN patterns and each pattern containsM features. The clustering of such patterns is achieved byminimizing intracluster dissimilarity and maximizing inter-cluster dissimilarity.

Most existing clustering techniques can be classified intotwo categories: partitional and hierarchical. The hierarchicalclustering is a nested sequence of partitions, whereas thepartitional clustering is a single partition. The squared

error method is the most popular method of the partitionalclustering techniques. Both partitional and hierarchical clus-terings require intensive computation, even for a modestnumber of patterns. Thus, designing fast and processor effi-cient parallel algorithms for clustering is definitely of funda-mental importance.

The two major phases of partitional clustering are parti-tion adjusting and cluster center updating as described inSection 4. The sequential partitional clustering algorithmcan be computed inO(KMN) time in a straightforwardmanner, whereK is the given cluster number. Parallelimplementations for sequential partitional clustering algo-rithms have been studied extensively by many researchers.Ni and Jain [12] proposed a two-level systolic array for theclustering with 1300 times over a serial processor, whereeach processor is pipelined to further enhance the systemperformance. Li and Fang [10] gave anO�K log MN� timeparallel algorithm on the SIMD hypercube multiprocessorswith MN processors. They viewed the hypercube as anM ×N array. They used cumulative sums operation to performthe two major phases. Their algorithm can be run only whenthe number of processors is greater thanMN and bothM andN must be a power of 2. Ranka and Sahni [21] also gave twoO�K 1 log MN� andO�log KMN� time parallel algorithmson the SIMD hypercube multiprocessors withMN andKMN

Image and Vision Computing 18 (2000) 1033–1043

0262-8856/00/$ - see front matterq 2000 Elsevier Science B.V. All rights reserved.PII: S0262-8856(00)00044-5

www.elsevier.com/locate/imavis

q This work was supported by the National Science Council under thecontract no. NSC-88-2213-E011-082.* Corresponding author. Tel.:1886-2737-6700; fax:1886-2737-6699.

E-mail address:[email protected] (S.-J. Horng).

processors, respectively. They viewed anNMK hypercubeas anN × M × K array with eachN × M subarray formingone plane. They first described nine basic data manipulationoperations, such as window sum, consecutive sum and soon. Then, based on these operations they parallelized theclustering algorithm. Hwang et al. [5] gave anO�KM 1�KMN=p�� time parallel algorithm on the orthogonal multi-processor sharing memory with an enhanced mesh (OMP)using p processors andp2 memory modules. They firstdistributedN patterns evenly to the memory modules,N/p2 patterns per module. Then, applied the parallel histo-gramming method to perform the two major phases. Jenqand Sahni [7] gave threeO�KM 1 K log N�; O�K log MN�and O�M 1 log KMN� time parallel algorithms on thereconfigurable mesh (RMESH), each withN, MN andKMN processors, respectively. They applied the consecutivesums operation and bus reconfiguration to the two majorphases according to the dimension and the size of mesh.Tsai et al. [22] gave threeO(KM), O(K) and O(1) timeparallel algorithms on the reconfigurable array of processorswith wider bus networks (RAPWBN), usingN111=e

;

MN111=e and KMN111=e processors and each with extraone row,M rows andMN rows bus networks, respectively,wheree is any constant ande $ 1: They used these extrabus networks to configure the array and to compute somebasic operations, such as multiple prefix sums. Then, basedon these basic operations, three parallel algorithms werederived according to the size of array; but the bus widthof each bus network of the RAPWBN is extended toN1=e-bit. Unfortunately, as the number of processors is increased,the complexity of the interconnection networks grows muchhigher than the complexity of processors. From a VLSIpoint of view, it is hard to be implemented because of thecross-talk problem of its extra bus networks and thecomplexity of the ports (pins) of each processor. Most exist-ing parallel clustering algorithms stated above are held in avery restricted class of problems (for example, the numberof processors should beN, NM or KNM, respectively) exceptin Ref. [5]. But the time complexity of Ref. [5] is boundedby O(KM). In this paper, we present parallel clustering algo-rithms for the solution of clustering problems with anynumber of patterns, features and clusters, on the AROBmodel of any size.

The mesh-connected-computers (MCC) are useful forsolving image processing problem [5,7,12,22] because ofits simplicity and regularity in architecture. There are twodrawbacks of the MCC: fixed architecture and long commu-nication diameter. Recently, the reconfigurable networkshave received much attention from researchers becausethey can reduce the drawbacks of the MCC [2,7,23]. Unfor-tunately, the exclusive access to the bus resources will limitthe throughput of the end-to-end communication. Opticalinterconnections may provide an ultimate solution to thisproblem [4,8,9,14,18,19]. Due to unidirectional signalpropagation and predictable delay of the signal per unitlength, the optical buses enable synchronized concurrent

access in a pipelined fashion. Pipelined optical buses cansupport massive amount of data transfer simultaneously andrealize various communication patterns in constant time.

The array with a reconfigurable optical bus system isdefined as an array of processors connected to a reconfigur-able optical bus system whose configuration can be dyna-mically changed by setting up the local switches of eachprocessor, and messages can be transmitted concurrentlyon a bus in a pipelined fashion. More recently, two relatedmodels have been proposed, namely the array with reconfi-gurable optical buses (AROB) [16] and linear array with areconfigurable pipelined bus system (LARPBS) [13,14]. Amajor difference between them lies in that the counting isnot permitted in the LARPBS model during a bus cycle butit is allowed in the AROB model. Many algorithms havebeen proposed for these two relative models[9,15,17,18,20]. These indicate that arrays with a reconfi-gurable optical bus system are very efficient for parallelcomputations due to the high bandwidth and flexibilitywithin a reconfigurable optical bus system.

The main contributions of this paper are in designingefficient and scalable algorithms for clustering with varyingdegrees of parallelism on the AROB model. We first reviewseveral basic operations that have been proposed on theAROB. Then, based on these operations and the power ofreconfigurable optical bus, several parallel squared errorclustering algorithms are developed:

• anO��K=p� log N� time algorithm usingp × M × N=log Nprocessors;

• an O��KNM=pqr�1 log r 1 log q� time algorithm usingp × q × r processors;

• an O�K=p� time algorithm usingp × N111=e × M proces-sors, for some constante ande $ 1;

• anO(1) time algorithm usingK × N111=e × M processors;• anO(K) time algorithm usingN111=e × M processors;• anO(KM) time algorithm usingN111=e processors.

Compared to the results in the literature [5,7,10,21,22],our results are more scalable, feasible and flexible in bothtime complexity and hardware complexity. Especially, ouralgorithms can communicate and manipulate data quite wellon the AROB model. Based on both the pipelined datatransmission ability and the reconfigurability of theAROB, several scalable algorithms such as binary sum,minimum finding and data summation are first derived;also a constant time integer sum algorithm is derived.Then using these basic operations, the derived parallelsquared error clustering algorithms are highly scalable,feasible and flexible in all respects.

The remainder of this paper is organized as follows. Wegive a brief introduction to the AROB computation model inSection 2. Section 3 describes some basic operations thatwill be used in the parallel clustering algorithms. Section 4describes the definitions and notations of the squared errorclustering upon which our algorithms are based. Section 5

C.-H. Wu et al. / Image and Vision Computing 18 (2000) 1033–10431034

presents four parallel squared error clustering algo-rithms. Finally, some concluding remarks are includedin Section 6.

2. The computation model

The AROB model is essentially a mesh using the basicstructure of a classical reconfigurable network (RN) [2] andoptical technology. A linear AROB (LAROB) of sizeNcontainsN processors connected to the optical bus withtwo couplers. One is used to write data on the upper (trans-mitting) segment of the bus and the other is used to read thedata from the lower (receiving) segment of the bus. That is,it extends the capabilities of the linear arrays with pipelinedoptical buses (APPB) [4] by permitting each processor toconnect to the bus through a pair of switches. Each proces-sor with a local memory is identified by a unique indexdenoted asPi0; 0 # i0 , N; and each switch can be set tocross or straight by the local processor. Each processor usesa set of control registers to store information needed tocontrol the transmission and reception of messages by thatprocessor. An example for a linear AROB of size 5 is shown

in Fig. 1(a). Two interesting switch configurations derivablefrom a processor of an LAROB are also shown in Fig. 1(b).

A petit cycle(t ) is defined as the time needed for a pulseto traverse the optical distance between two consecutiveprocessors on the bus. Abus cycle(s ) is defined as theend-to-end propagation delay of the messages on the opticalbus; i.e. the time needed to traverse through the entire opti-cal bus. Then thebus cycles � 2Nt; whereN is the numberof processors in the array. Aunit delayis defined to be thespatial length of a single optical pulse, shown as a loop inFig. 1(a), which may introduce a time slot delay betweentwo processors on the receiving segments. The opticalswitches are used for reconfiguration.

A 2-D AROB of size M × N; denoted as 2-DM × NAROB, containsM × N processors arranged in a 2-D grid.Each processor is identified by a unique 2-tuple index�i 1; i0�; 0 # i1 , M; 0 # i0 , N: The processor withindex �i1; i0� is denoted byPi1;i0: Each processor has fourI/O ports, denoted by2Sj ; 1Sj ; 0 # j , 2; to be connectedwith a reconfigurable optical bus system. The interconnec-tion among the four ports of a processor can be reconfiguredduring the execution of algorithms. Thus, multiple arbitrarylinear arrays like LAROB can be specified in a 2-D AROB.The two terminalprocessors which are located in the endpoints of the constructed LAROB may serve as theleaderprocessors (similar toP0 in Fig. 1(a)). The related positionof any processor on a bus to which it is connected is itsdistance from theleader processor. For more details onthe AROB, see Ref. [16]. An example of a 2-D 4× 4AROB and the 10 allowed switch configurations areshown in Fig. 2.

For a unit of time, assume each processor can eitherperform arithmetic and logic operations or communicatewith others on a bus. Since the bus cycle length can beconsidered to beO(s ), we assume that it is compatiblewith the computation time of any arithmetic or logicoperation. It allows multiple processors to broadcastdata on the different buses or to broadcast the samedata on the same bus simultaneously at a time unit, ifthere is no collision. Letvar(i), array(i)[k] denote alocal variable var and an array variablearray[k] in aprocessor with index i. For example, sum(0) and

C.-H. Wu et al. / Image and Vision Computing 18 (2000) 1033–1043 1035

Fig. 1. (a) An LAROB of size 5. (b) The switch states.

Fig. 2. (a) A 4× 4 AROB. (b) The allowed switch configurations.

quota(0)[k] are a local variablesum and an array vari-able quota[k] of processorP0.

3. Basic operations

Several data operations that can be performed quickly onthe AROB will be described in this section. These dataoperations will be used in Section 5 to design efficient algo-rithms for clustering. For the sake of completeness, we firstbriefly outline the results in the following.

Lemma 1 [17]. Given N Boolean data each with either 0or 1, the binary prefix sums of these N Boolean data can becomputed in O(1) time on an N LAROB.

Lemma 2 [18]. Given N integers each of size O�log N�-bit, the maximum/minimum of these N integers can be foundin O(1) time on an N LAROB if counting is allowed during abus cycle.

Lemma 3 [11]. Given N integers or normalized realnumbers each of size O�log N�-bit, these N numbers canbe added by the bus split technique in O�log N� time onan N LAROB.

In the real world, the number of processors available isnot always enough for applications. If the number of proces-sors available isp, 1 # p # N; then each processor holds anarray with O�N=p� data items to be processed. Therefore,Lemmas 1–3 can be modified to run on ap LAROB.While only p processors are available, assume that dataitem ai, 0 # i , N; is stored in the local variablea�i modp��bi=pc� of processorPi mod p; i.e. processorPj

holds data itemsaj ;aj1p;…;aj1N2p: The binary sum opera-tion can be performed by first performing the parallel binaryprefix sums over the data allocated to allp processors eachfrom their first data items to their last data items usingLemma 1. This takesO(N/p) time. Then, processorPp21

sums up the partial sums obtained previously. This takesO(N/p) time sequentially. Thus, the binary sum operationcan be done inO(N/p) time on an LAROB withp proces-sors. Therefore, the maximum/minimum operation can alsobe done inO(N/p) time on an LAROB withp processors bythe similar method. To addN numbers (integers or reals) ona p LAROB, each processor first sums up theO(N/p)numbers locally. Then, apply Lemma 3 to sum up theseppartial sums inO�log p� time. According to the abovedescription, we have following corollaries.

Corollary 1. Given N Boolean data each with either 0 or1, these N Boolean data can be added in O�N=p� time on a pLAROB.

Corollary 2. Given N integers each of size O�log N�-bit,the maximum/minimum of these N integers can be found in

O(N/p) time on a p LAROB if counting is allowed during abus cycle.

Corollary 3. Given N integers or normalized realnumbers each of size O�log N�-bit, these N numbers canbe added in O�N=p 1 log p� time on a p LAROB.

Recently, Wu et al. [24] showed that the task of comput-ing the summation ofN O�log N�-bit numbers can be solvedin constant time on a 1-DN111=e AROB for some fixede ande $ 1: Since the summation operation is crucial for design-ing our clustering algorithms, we now describe the algo-rithm in detail.

3.1. Summation

Given N integersai with 0 # ai , N; 0 # i , N; thesummation of theseN integers can be defined as

XN 2 1

i�0

ai : �1�

For computing Eq. (1), Pavel and Akl [17] proposed anO(1)time algorithm on a 2-DN × log N AROB. In the following,we will use another approach to design a more flexiblealgorithm for this problem on a 1-D AROB usingN111=e

processors, wheree is a constant ande $ 1: Sinceai , Nand 0# i , N; each digit has a value ranging from 0 tov 21 for the radix-v system and av -ary representation…m3m2m1m0 is equal tom0v

0 1 m1v1 1 m2v

2 1 m3v3…

The maximum ofsum is at most N�N 2 1�: With thisapproach,ai andsumare equivalent to

ai �XT 2 1

k�0

mi;kvk; �2�

sum�XU 2 1

l�0

Slvl; �3�

where T � blogv Nc 1 1; 0 # i , N; U � blogv N�N 21�c 1 1; and 0# mi;k; Sl , v:

As sum� PN 2 1i�0

PT 2 1k�0 mi;kv

k � PT 2 1k�0

PN 2 1i�0 mi;kv

k;

let dk be the sum ofN coefficientsmi;k; 0 # i , N; whichis defined as

dk �XN 2 1

i�0

mi;k; �4�

where 0# k , T: Thensumcan be also formulated as

sum�XT 2 1

k�0

dkvk; �5�

where 0# dk , vN:

Let C0 � 0 and du � 0; T # u , U: The relationshipbetween Eqs. (3) and (5) is described by Eqs. (6)–(8):

et � Ct 1 dt; 0 # t , U; �6�


Ct11 � et div v; 0 # t , U; �7�

St � et modv; 0 # t , U; �8�whereet is the sum at thetth digit position andCt is the carryto thetth digit position. Hence,St of Eq. (8) corresponds tothe coefficient ofsumof Eq. (3) under the radix-v system.Since the carry to thetth digit position ofet is not greaterthanN, we haveCt # N; 0 # t , U: Sincesum# N�N 21�; the number of digits representingsumunder radix-v isnot greater thanU, whereU � blogv N�N 2 1�c 1 1: There-fore, instead of computing Eq. (1), we first compute thecoefficientmi;k for eachai. Then eachSt can be computedby Eqs. (4), (6)–(8). Finally,sumcan be computed by Eq.(3). For the sake of readability, letP0i; j ; 0 # i , N; 0 # j ,v; denote the 2-D logical processor corresponding to the 1-D physical processorPi×v1j : The sum ofN integers each ofsize (logN)-bit can be computed in constant time on a 1-DvN AROB. The detailed algorithm (SUMI) is described inthe following.

Procedure SUMI(a, sum)1: ProcessorP0i;0; 0 # i , N; copies a(i) to processorP0i; j ; 0 # j , v:2: ProcessorP0i;v21; 0 # i , N; setsq�i;v 2 1��0� � 0:3: for k � 0 to U 2 1 do Steps 3.1–3.6

3.1: ProcessorP0i;j ; 0 # i , N; 0 # j , v; computesm�i; j��k� (i.e. m�i; k� of Eq. (2)) from a(i) by usingEq. (2).3.2:ProcessorP0i;j ; 0 # i , N; 0 # j , v; setsb�i; j� �1 if j , m�i; j��k�; b�i; j� � 0; otherwise.3.3: ProcessorP0i;v21; 0 # i , N; setsb�i;v 2 1� �q�i;v 2 1��k�:3.4:Apply Lemma 1 to compute the binary prefix sumsonb�i; j�; 0 # i , N; 0 # j , v; and store the result tothe local variablee�i; j��k� of processorP0i;j :3.5: ProcessorP0N21;v21 computesC�N 2 1;v 2 1��k 1 1� (i.e. Ck11 of Eq. (7)) andS�N 2 1;v 2 1��k�(i.e. Sk of Eq. (8)) frome�N 2 1;v 2 1��k� (i.e. ek ofEq. (6)) by using Eqs. (7) and (8), respectively.3.6: ProcessorP0i;v 2 1; 0 # i , N; sets q�i;v 2 1��k11� � 1 if i , C�N 2 1;v 2 1��k 1 1�; q�i;v 2 1��k 1 1� � 0; otherwise.

4: P0N21;v21 computes the finalsumfrom S�N21;v 2 1��k� by using Eq. (3).5: ProcessorP0N21;v21 copiessumto processorP00;0:End{ SUMI }

Lemma 4. Given N integers each of size O(log N)-bit,these N integers can be added in O(U) time on a 1-DvNAROB forv $ 2; U � blogv N�N 2 1�c 1 1:

Proof. The correctness of procedure SUMI directlyfollows from Lemma 1 and Eqs. (1)–(8). The time complex-ity is analyzed as follows. Steps 1 and 2 each takeO(1) time.

Each iteration of Step 3 takesO(1) time, and the number ofiterations will be repeatedU times. Hence, the total timecomplexity of Step 3 isO(U). Finally, Steps 4 and 5 eachtake O(U) time. Hence, the total time complexity of theproposed algorithm isO(U). A

Procedure SUMI is quite efficient. For example, ifN �232 andv � 32; thenU � 13 only. For simplicity, assumev � N1=e

; where e is a constant ande $ 1: Then U �blogv N�N 2 1�c 1 1� b2ec 1 1; also a constant. This leadsto the following corollary.

Corollary 4. Given N integers each of size O(log N)-bit,these N integers can be added in O(1) time on a 1-D N111=e

AROB for some fixed constante ande $ 1:

Procedure SUMI can be modified to process the signedinteger number and the real number, respectively. The caseof processing the signed integer number can be easilyderived. Thus, we only describe the case of processing thereal number in the following. Assume that a real number isrepresented by a normalized floating-point which consists oftwo parts, namely mantissa and exponent. Assume that themantissa and exponent are bothO(log N)-bit. Then it can becomputed by the following four steps:

Step 1:Find the maximum exponent parts from theseNreal numbers by Lemma 2.Step 2:Adjust all real numbers with the maximum expo-nent part.Step 3:Compute the addition of theseN mantissas byCorollary 1.Step 4:Normalize the final sum.

The time complexity is analyzed as follows. Steps 1 and 2each takesO(1) time usingN processors. Step 3 takesO(1)time usingN111=e processors for a constante and e $ 1:Step 4 takesO(1) time using a single processor. This leadsto the following corollary.

Corollary 5. Given N normalized real numbers each ofsize O(log N)-bit, these N numbers can be added in O(1)time on a 1-D N111=e AROB for some fixede and e $ 1:

4. Definitions and notations

Based on an initial partition of the patterns, the idea ofsquared error clustering is to alter cluster memberships ofthese patterns to obtain a better partition. In the process ofclustering, the patterns are grouped together to have thelargest similarity (closest) to a cluster center. There areseveral methods used for squared error clustering [1]. Wewill only concentrate on implementing a general method inthis paper.

We follow the notations used in Ref. [12]. LetA� ai; j ;

0 # i , N; 0 # j , M; be a pattern matrix of sizeN × M


that consists ofN patterns each withM features. Ingeneral, N is in the range of hundreds andM , 30:Let L � �l i�; 0 # i , N; 0 # l i , K; be a vector ofsize 1× N; where l i is the label of patternai. Also letS� { S0;S1;…;SK21} be a set of K partitions derivedfrom N patterns, where eachSi is called a cluster andeach pattern exactly belongs to a cluster. Thus, clusterSk can be defined as

Sk � { iul i � k; 0 # i , N} ; �9�

where 0# l i , K; K is the number of clusters and auser-defined parameter. That is, those patterns havelabeling k will be put into clusterSk.

The center of each clusterSk, 0 # k , K; denoted asck; j ;

0 # k , K; 0 # j , M; is the mean matrix of the memberpatterns, and it can be expressed as

ck; j � 1uSku

Xi[Sk

ai; j ; 0 # i , N; 0 # j , M; 0 # k , K;

�10�

whereuSku denotes the number of patterns belonging to clus-ter Sk (cardinality or size).

The dissimilarity between a pattern and a cluster ismeasured by the squared Euclidean distance (L2 metric). Itcan be also extended to operate on anyLa metric for 0#a , ∞: That is, the dissimilarity between the patternai and


Fig. 3. An illustration of algorithm PSECA: (a) initialization; (b) after Step 1; (c) after Step 2 fork � 0; (d) after Step 2 fork � 1; (e) after Step 3; (f) after Step4 for k � 0; (g) after Step 4 fork � 1; (h) after Step 2 fork � 0; (i) after Step 2 fork � 1; the final result.

the clusterSk is defined as

d2i;k �

XM 2 1

j�0

�ai; j 2 ck; j�2; 0 # i , N; 0 # k , K: �11�

Then the squared error for clusterSk is defined as

e2k �

Xi[Sk

d2i;k; 0 # i , N; 0 # k , K: �12�

Finally, the squared error for the clustering is

E2K �

XK 2 1

k�0

e2k: �13�

For a givenK, the goal of squared error clustering is topartition theseN patterns intoK clustersS0;S1;…;SK21

such thatE2K is minimal. The proper value forK is problem

dependent, usuallyK # 20: It is difficult to find a suitablevalue ofK for the given patterns without prior information.In practice, we can repeatedly try several different values ofK to obtain the best result (E2

K is minimal). However, thismethod has two serious disadvantages: one is that the finalresult is influenced by the choice of the initial seed points,the other is that it is possible to drop down a local optimalgrouping. It is also difficult to determine the best initialcluster centers for a given problem. One way to overcomethis problem is to run the clustering algorithm with several


Fig. 3. (continued)

different initial seed patterns. In this paper, we set the firstKpatterns as the initial cluster centers.

There are two major phases of the squared error cluster-ing, namely partition adjusting and cluster center updating.Partition adjusting phase is to associate each pattern with asuitable label according to the dissimilarity withK clustercenters (Eq. (11)) in an attempt to reduce the squared error(Eq. (13)). For the patternai 0 # i , N; its new label will be

l i � k; if d2i;k � min

0#k 0,K{ d2

i;k0 } : �14�

Cluster center updating phase is to update the cluster centersby averaging their owned patterns if the pattern label reas-signment has changed the membership of some clusters.The above phases are iterated until no change of partition,or reaching a maximum number of iterations. We refer aniteration to a pass in the following.

5. Parallel squared error clustering algorithms

In this section, we shall present six efficient parallelsquared error clustering algorithms on the AROB. Base onthe technique as described in the previous section, one passof the squared error clustering algorithm takesO(KMN)time on a uniprocessor computer. We first design anefficient parallel algorithm for one pass clustering ona 3-D AROB of size p × N × M; for 1 # p # K: Asmentioned in Section 2, each processorPi2;i1;i0 is identi-fied by a unique index, where 0# i2 , p; 0 # i1 , N;

0 # i0 , M: The input of this algorithm is a patternmatrix whose domain is an integer or a real number,and the output is a cluster labeling of each patternindicating the cluster to which it belongs.

The high-level description of the implementation of this

algorithm is stated in the following four phases:

Phase 1.Initial partition: assign the firstK patterns as theinitial cluster centers.Phase 2. Partition adjusting: adjust the partition byassigning each pattern to a cluster in which it has theminimum squared distance.Phase 3.Convergence check: the algorithm will be haltedwhen no patterns are reassigned from one cluster toanother.Phase 4.Cluster center updating: update the center ofeach cluster by averaging the features of patterns of itsmember.

A 3-D AROB of sizep × N × M can be recognized asp 2-D AROB each of sizeN × M: Basically, each 2-D AROB isresponsible for one cluster computation. Since there areKclusters andp # K; therefore, Phases 2 and 4 should beexecutedK=p times. Assume that the pattern matrixai1;i0is initially stored in the local variablea�0; i1; i0� of processorP0;i1;i0; 0 # i1 , N; 0 # i0 , M; one item per processor.The final label of thei1th pattern is stored in the local vari-able l�0; i1;0� of processorP0;i1;0; 0 # i1 , N: The detailedalgorithm (PSECA) is presented in the following. Assumethere are seven patterns�N � 7�; each has two features�M � 2� denoted as (0, 0), (0, 4), (1, 1), (1, 3), (2, 2), (3,0), and (3, 5). That is, we use coordinates as features. Alsoassume the number of clusters is four�K � 4� and p� 2:Based on these assumptions, an illustration of algorithmPSECA is shown in Fig. 3.

Algorithm PSECA

1: // Initial partition phase.//1.1: ProcessorP0;i1;i0; 0 # i1 , N; 0 # i0 , M; copies


Fig. 3. (continued)

a�0; i1; i0� to a�i2; i1; i0� for 0 # i2 , p; usingi2-dimen-sional optical buses.1.2: ProcessorPi2;i1;0; 0 # i2 , p; 0 # i1 , N; setsl�i2; i1;0� � nl�i2; i1;0� � i1 if 0 # i1 , K; l�i2; i1;0� �nil; otherwise.1.3: // Set the firstK patterns as the initial clustercenters. //

ProcessorPi2;i1;i0; 0 # i2 , p; 0 # i1 , K; 0 # i0 ,M; setsc�i2; i1; i0� � a�i2; i1; i0�:

2: // Partition adjusting phase.//repeat Steps 2.1 and 2.2from k � 0 to �K=p�2 1

2.1: // Compute Eq. (11). //2.1.1:ProcessorPi2;kp1i2;i0; 0 # i2 , p; 0 # i0 , M;

copiesc�i2; kp1 i2; i0� to c0�i2; i1; i0� for 0 # i1 , N;

using i1-dimensional optical buses.2.1.2:// Compute the dissimilarity between the clus-ter centerkp1 i2 and the patterni1 of each feature. //

ProcessorPi2;i1;i0; 0 # i2 , p; 0 # i1 , N; 0 #i0 , M; perform d�i2; i1; i0� U �a�i2; i1; i0�2c0�i2; i1; i0��2:

2.1.3: ProcessorPi2;i1;i0; 0 # i2 , p; 0 # i1 , N;

0 # i0 , M; perform d2�i2; i1; 0� UPM 2 1

i0�0

d�i2; i1; i0�; through the i0-dimensional opticalbuses by Lemma 3.

2.2: // Compute Eq. (14), initially, setMIN�0; i1;0� �∞ for 0 # i1 , N: //

2.2.1: //Without loss of generality, if more than oneitem is suitable, we will select one with the smallestindex. //

ProcessorPi2;i1;0; 0 # i2 , p; 0 # i1 , N; appliesLemma 2 to compute MIN d�0; i1;0� �min{d2�i2; i1;0�; for 0 # i2 , p} through the i2-dimensional optical buses, then setsml�0; i1;0� Ukp1 m; wherem is the index ofMIN d�0; i1;0�:

2.2.2: If MIN d�0; i1;0� , MIN�0; i1;0� thenset MIN�0; i1;0� � MIN d�0; i1;0� and nl�i2; i1;0� � ml�0; i1; 0�; for 0 # i2 , p; 0 # i1 ,N:

3: // Convergence check phase. //3.1: ProcessorP0;i1;0; 0 # i1 , N; setsfl�0; i1;0� � 1 ifnl�0; i1;0� ± l�0; i1; 0�; fl�0; i1;0� � 0; otherwise.3.2: Processor P0;i1;0; 0 # i1 , N; computesch�0;0;0� U PN 2 1

i1�0 fl�0; i1; 0�; through the i1-dimen-sional optical buses by Lemma 1.3.3: If ch�0;0;0� . 0 then go to Step 4; otherwise, goto Step 5.

4: // Cluster center updating phase.//4.1: ProcessorPi2;i1;0; 0 # i2 , p; 0 # i1 , N; setsl�i2; i1;0� � nl�i2; i1;0�:4.2: repeatSteps 4.2.1–4.2.7from k � 0 to �K=p�2 1

4.2.1: // Find members of each cluster. //ProcessorPi2;i1;0; 0 # i2 , p; 0 # i1 , N; setsmem�i2; i1;0� � 1 if nl�i2; i1;0� � i2 1 kp;mem�i2; i1;0� � 0; otherwise.

4.2.2: // Compute the number of members. //ProcessorPi2;i1;0; 0 # i2 , p; 0 # i1 , N; applies

Lemma 1 to compute S�i2; 0;0� UPN 2 1i1�0 mem�i2; i1;0�; through the i1-dimensional

optical buses.4.2.3: Processor Pi2;0;0; 0 # i2 , p; broadcastsS�i2;0; 0� to S�i2; i1; i0�; for 0 # i1 , N; 0 # i0 ,M; using i1-dimensional andi0-dimensional opticalbuses.4.2.4: // Member’s property. //

ProcessorPi2;i1;i0; 0 # i2 , p; 0 # i1 , N; 0 #i0 , M; sets mem p�i2; i1; i0� � a�i2; i1; i0�if nl�i2; i1; 0� � i2 1 kp; mem p�i2; i1; i0� � 0;otherwise.

4.2.5: ProcessorPi2;i1;i0; 0 # i2 , p; 0 # i1 , N;

0 # i0 , M; applies Lemma 3 to computesum�i2;0; i0� �

PN 2 1i1�0 mem p�i2; i1; i0�; through the

i1-dimensional optical buses.4.2.6: ProcessorPi2;0;i0; 0 # i2 , p; 0 # i0 , M;

broadcastssum(i2,0,i0) to sum(i2,i1,i0) for 0 # i1 ,N; using i1-dimensional optical buses.4.2.7: // Update the center of each cluster. //

Processor Pi2;i21kp;i0; 0 # i2 , p; 0 # i0 , M;

computes c�i2; i2 1 kp; i0� � sum�i2; i2 1 kp; i0�=S�i2; i2 1 kp; i0�:

4.3: Go to Step 2.5: ProcessorP0;i1;0; 0 # i1 , N; copies nl(0,i1,0) tol(0,i1,0).

Theorem 1. Given N patterns each with M features and afixed K (usually, Mp N and Kp N�; one pass of algo-rithm PSECA can be computed in O��K=p� log N� time on anp × N × M AROB.

Proof. The correctness of this algorithm directly followsEqs. (9)–(14). The time complexity is analyzed as follows.Steps 1.1–1.3 each takeO(1) time. So the time complexityof Step 1 isO(1). Steps 2.1.1 and 2.1.2 each takeO(1) time.Step 2.1.3 takesO�log M� time by Lemma 3. So the timecomplexity of Step 2.1 isO�log M�: Step 2.2.1 takesO(1)time by Lemma 2. Step 2.2.2 takesO(1) time. So, Step 2.2takesO(1) time. Steps 2.1 and 2.2 will be repeated byK/ptimes. Hence, the total time complexity of Step 2 isO��K=p� log M�: Steps 3.1 and 3.3 each takesO(1) time,respectively; Step 3.2 takesO(1) time by Lemma 1. Sothe time complexity of Step 3 isO(1). Step 4.1 takesO(1)time. Steps 4.2.1–4.2.4, 4.2.6, and 4.2.7 each takeO(1)time. Step 4.2.2 takesO(1) time by Lemma 1. Step 4.2.5takesO�log N� time by Lemma 3. These seven substeps willbe executedK/p times, so the time complexity of Step 4.2 isO��K=p� log N�: Step 4.3 takesO(1) time. Hence, the totaltime complexity of Step 4 isO��K=p� log N�: Step 5 takesO(1) time. Therefore, the total time complexity of one passof the proposed algorithm isO��K=p� log N�: A

Practically, the number of processors available in thesystem is not always enough for all applications. If the


number of processors available in the system ispqr, 1 #q # N and 1# r # M; then each processor holds an arraywith O�KNM=pqr� data to be processed. Therefore, the mini-mum operation used in Step 2.2.1 should be run inO�KNM=pqr� time applying Corollary 2; the number ofmembers (Step 4.2.2) should be computed inO�KNM=pqr�time applying Corollary 1; the summation operations usedin Steps 2.1.3 and 4.2.5 should be also run inO��KNM=pqr�1 log r� and O��KNM=pqr�1 log q� times,respectively, applying Corollary 3. This implies that thetime complexity of the highly scalable algorithm PSECAbecomesO��KNM=pqr�1 log r 1 log q�: Note that for asingle processor�p� q� r � 1�; the time complexity isO(KMN) and it is the same as that of the sequential algo-rithm. The best time complexity,O�log N�; is achievedwhenp� K; q� N and r � M: Especially, if the numberof processors ispNM=log N (i.e. q� N=log N and r � M),then the time complexity is

OK log N

p1 log

Nlog N

� �� O

Kp

log N� �

:

That is, the number of processors can be reduced by a factorof O(log N) and the proposed algorithm is cost optimal.

The time complexity of the proposed algorithm can beimproved by increasing the number of processors. The timecomplexity of algorithm PSECA is dominated by thesummation operations in Step 2.1.3 and 4.2.5. For thesesteps, we can apply Corollary 5 to sum up theM squareddistance terms and theN patterns inO(1) time usingp ×N111=e × M processors, for some fixede ande $ 1: Hence,the time complexity of the proposed algorithm can bereduced fromO��K=p� log N� to O�K=p� by increasing thenumber of processors top × N111=e × M: Furthermore, if thenumber of processors is increased toK × N111=e × M (i.e.p� K) then anO(1) time algorithm can be obtained. On theother hand, if the number of processors is reduced to

N111=e × M (i.e. p� 1) and N111=e (i.e. p� M � 1) thentwo O(K) time and O(KM) time algorithms can be alsoobtained, respectively. The results for various system sizesare listed in Table 1.

6. Concluding remarks

Clustering is a valuable tool for data analysis. In mostapplications such as pattern recognition in which thenumber of patterns may be very large clustering is valuable.However, this problem can be overcome by using parallelprocessing algorithms running on multiprocessor compu-ters. To highly parallelize the squared error clustering algo-rithm on distributed memory systems, a more powerfulcommunication mechanism is required. It is clear that allexisting static networks with electronic interconnectionshave limited communication capability in supporting fastparallel clustering algorithm. In this paper, by integratingthe advantages of both optical transmission and electroniccomputation, we design several efficient parallel algorithmson the AROB for the squared error clustering.

If the system sizes� pqr is less than the problem sizeI � KNM; the time complexity may be represented asO�T�I �=s1 To�I ; s��; whereT(I) is the time complexity ofthe sequential clustering algorithm andTo�I ; s� is the overalloverhead of a parallel implementation. A parallel imple-mentation is scalable in the range [1…S] if linear speedupcan be achieved for all 1# s # S: A highly scalable parallelimplementation means that the sequential algorithm can behighly parallelized, and the overhead in parallelization islittle (i.e. S is as large asQ�T�I �=T p�I ��log I �k; wherek $0 is a constant [9]). According to this definition, we haveobtained a highly scalable parallelization of the clusteringalgorithm on an AROB model. Therefore, the proposedalgorithms are faster and more elegant than the previous


Table 1Summary of comparison results for parallel partitional clustering

Architecture Bus width Extra rowbus network

Processors Time complexity

Li and Fang [10] Hypercube O(log N)-bit 0 MN O�K log MN�Hwang et al. [5] OMP O(log N)-bit 0 p O�KM 1 �KMN=p��Ranka and Sahni [21] Hypercube O(log N)-bit 0 MN O�K 1 log MN�Jenq and Sahni [7] RMESH O(log N)-bit 0 N O�KM 1 K log N�

0 MN O�K log MN�0 KMN O�M 1 log KMN�

Tsai et al. [22] RAPWBN O�N1=e�-bit 1 N111=e O(KM)M MN111=e O(K)MN KMN111=e O(1)

This paper AROB O(log N)-bit 0 pMN/log N O��K=p� log N�0 pMN111=e O�K=p�0 MN111=e O(K)0 KMN111=e O(1)0 N111=e O(KM)0 KMN/log N O(log N)0 pqr O��KNM=pqr�1 log r 1 log q�

known results on various parallel computation models. Thecomparison result is summarized in Table 1.

References

[1] M.R. Anderberg, Cluster Analysis for Applications, Academic Press,New York, 1973.

[2] Y. Ben-Asher, D. Peleg, R. Ramaswami, A. Schuster, The power ofreconfiguration, Journal of Parallel and Distributed Computing 13(1991) 139–153.

[4] Z. Guo, R.G. Melhem, R.W. Hall, D.M. Chiarulli, S.P. Levitan, Pipe-lined communications in optically interconnected arrays, Journal ofParallel and Distributed Computing 12 (3) (1991) 269–282.

[5] K. Hwang, H.M. Alnuweiri, V.K.P. Kumar, D. Kim, Orthogonalmultiprocessor sharing memory with an enhanced mesh for integratedimage understanding, CVGIP: Image Understanding 53 (1991) 31–45.

[6] A. Jain, R. Dubes, Algorithms for Clustering Data, Prentice-Hall,Englewood Cliffs, NJ, 1988.

[7] J.F. Jenq, S. Sahni, Reconfigurable mesh algorithms for image shrink-ing, expanding, clustering, and template matching, Proc. Int. ParallelProcessing Symposium, 1991, pp. 208–215.

[8] S.P. Levitan, D.M. Chiarulli, R.G. Melhem, Coincident pulse techni-que for multiprocessor interconnection structures, Applied Optics 29(14) (1990) 2024–2033.

[9] K. Li, Y. Pan, S.Q. Zheng, Fast and processor efficient parallel matrixmultiplication algorithms on a linear array with a reconfigurable pipe-lined bus system, IEEE Transactions on Parallel and DistributedSystems 9 (1998) 705–720.

[10] X. Li, Z. Fang, Parallel clustering algorithms, Parallel Computing 11(1989) 275–290.

[11] R. Miller, V.K.P. Kumar, D. Reisis, Q.F. Stout, Image computationson reconfigurable VLSI arrays, Proc. IEEE Computer Society Conf.on Computer Vision and Pattern Recognition, 1988, pp. 925–930.

[12] L.M. Ni, A.K. Jain, A VLSI systolic architecture for pattern cluster-ing, IEEE Transactions on Pattern Analysis and Machine IntelligencePAMI-7 (1) (1985) 79–89.

[13] Y. Pan, M. Hamdi, Quicksort on a linear array with a reconfigurable

pipelined bus system, Proc. Int. Symp. on Parallel Architectures,Algorithms and Networks, 1996, pp. 313–319.

[14] Y. Pan, K. Li, Linear array with a reconfigurable pipelined bussystem—concepts and applications, Information Sciences—an Inter-national Journal 106 (3/4) (1998) 237–258.

[15] Y. Pan, K. Li, S.Q. Zheng, Fast nearest neighbor algorithms on alinear array with a reconfigurable pipelined bus system, Parallel Algo-rithms and Applications 13 (1998) 1–25.

[16] S. Pavel, S.G. Akl, On the power of arrays with reconfigurable opticalbus, Proc. Int. Conf. on Parallel and Distributed Processing Techni-ques and Applications, 1996, pp. 1443–1454.

[17] S. Pavel, S.G. Akl, Matrix operations using arrays with reconfigurableoptical buses, Parallel Algorithms and Applications 8 (1996) 223–242.

[18] S. Pavel, S.G. Akl, Integer sorting and routing in arrays with reconfi-gurable optical buses, International Journal of Foundations of Compu-ter Science (Special Issue on Interconnection Networks) 9 (1) (1998)99–120.

[19] C. Qiao, R.G. Melhem, Time-division communications in multipro-cessor arrays, IEEE Transactions on Computers 42 (5) (1993) 577–590.

[20] S. Rajasekaran, S. Sahni, Sorting, selection and routing on the arrayswith reconfigurable optical buses, IEEE Transactions on Parallel andDistributed Systems 8 (11) (1997) 1123–1132.

[21] S. Ranka, S. Sahni, Clustering on a hypercube multicomputer, IEEETransactions on Parallel and Distributed Systems 2 (2) (1991) 129–137.

[22] H.R. Tsai, S.J. Horng, S.S. Tsai, S.S. Lee, T.W. Kao, C.H. Chen,Parallel clustering algorithms on a reconfigurable arrays of processorswith wider bus networks, Proc. Int. Conf. on Parallel and DistributedSystems, 1997, pp. 630–637.

[23] B.F. Wang, G.H. Chen, Constant time algorithms for transitiveclosure and some related graph problems on processor arrays withreconfigurable bus systems, IEEE Transactions on Parallel andDistributed Systems 1 (10) (1990) 500–507.

[24] C.H. Wu, S.J. Horng, H.R. Tsai, Template matching on arrays withreconfigurable optical buses, Proc. Int. Symp. on Operations Researchand its Applications, 1998, pp. 127–141.