Efficient Parallel Algorithms for Hierarchical Clustering on Arrays with Reconfigurable Optical Buses

Journal of Parallel and Distributed Computing 60, 1137�1153 (2000)

Efficient Parallel Algorithms for HierarchicalClustering on Arrays with Reconfigurable

Optical Buses1

Chin-Hsiung Wu

Department of Information Management, Chinese Naval Academy, Kaohsiung, Taiwan, R.O.C.

Shi-Jinn Horng

Department of Electrical Engineering, National Taiwan University of Science and Technology,Taipei, Taiwan, R.O.C.

E-mail: horng�mouse.ee.ntust.edu.tw

and

Horng-Ren Tsai

Department of Information Management, Ling Tung College, Taichung, Taiwan, R.O.C.

Received January 26, 1999; revised October 25, 1999; accepted April 5, 2000

Clustering is a basic operation in image processing and computer vision,and it plays an important role in unsupervised pattern recognition and imagesegmentation. While there are many methods for clustering, the single-linkhierarchical clustering is one of the most popular techniques. In this paper,with the advantages of both optical transmission and electronic computation,we design efficient parallel hierarchical clustering algorithms on the arrayswith reconfigurable optical buses (AROB). We first design three efficient basicoperations which include the matrix multiplication of two N_N matrices,finding the minimum spanning tree of a graph with N vertices, and identify-ing the connected component containing a specified vertex. Based on thesethree data operations, an O(log N) time parallel hierarchical clusteringalgorithm is proposed using N3 processors. Furthermore, if the connectivityof the AROB with four-port connection is allowed, two constant time cluster-ing algorithms can be also derived using N4 and N3 processors, respectively.These results improve on previously known algorithms developed on variousparallel computational models. � 2000 Academic Press

Key Word: cluster analysis; hierarchical clustering; image processing; patternrecognition; parallel algorithm; arrays with reconfigurable optical buses(AROB).

doi:10.1006�jpdc.2000.1644, available online at http:��www.idealibrary.com on

1137 0743-7315�00 �35.00Copyright � 2000 by Academic Press

All rights of reproduction in any form reserved.

1 To whom correspondence should be addressed: Prof. Shi-Jinn Horng, Department of ElectricalEngineering, National Taiwan University of Science and Technology, 43 Section 4, Kee-Lung Road,Taipei, Taiwan, R.O.C. Fax: 886-2-27376699. This work was supported by the National Science Councilunder Contract NSC-88-2213-E011-082.

1. INTRODUCTION

Clustering techniques are widely applied in many aspects such as life sciences,medical sciences, social sciences, earth sciences, and image processing, and theapplications continue to grow [1, 11]. Clustering is especially useful when only avery little prior information about the problem is available. Cluster analysis is theprocess of classifying objects into subsets that have meaning in the context of aparticular problem [11]. Conventionally, the objects are characterized as patterns,and the patterns are numerical vectors in the pattern analysis. Assuming that thereare M features and each pattern contains all M features, clustering is a process ofpartitioning these N patterns in M-dimensional spaces into meaningful subsets orclusters. The clustering of such patterns is achieved by minimizing intraclusterdissimilarity and maximizing intercluster dissimilarity. The detailed survey of thecluster analysis can be found in the literature [1, 11, 12].

Many hierarchical clustering methods have been proposed using different distancemetrics and techniques [11]. Because of its simplicity and efficiency, hierarchicalclustering with the single-link method described by Johnson [13] is one of the popularclustering methods. The agglomerative approach starts with the disjoint partition,where each pattern is set to a distinct cluster initially. Then, two clusters are mergedto form a new cluster from the current level to the next level, according to thedissimilarity among all of the remaining clusters. Repeat this process to produce asequence of nested partitions until only a single cluster exists. Finally, a dendrogramis constructed.

Efficient sequential and parallel clustering algorithms have been studied extensivelyfrom researchers. Assuming N patterns each with M features, the sequential hierarchicalclustering algorithm can be computed in O(N2M+N 3) time in a straightforwardmanner. Kurita [16] proposed an O(N2 log N) time sequential algorithm to solvethis problem. Li and Fang [20] proposed an O(N log N) time parallel algorithm onthe SIMD hypercube multiprocessors with MN processors. Li [19] also proposedan O(N2) time parallel algorithm on the SIMD shuffle-exchange networks using Nprocessors. Gower and Ross [7] first specified that the hierarchical clustering withthe single-link method can be derived from the minimum spanning tree (MST).Recently, based on the MST of a proximity matrix, Tsai and Horng et al. [37] alsoproposed an O(log2 N) time parallel algorithm on the processor array with a recon-figurable bus system (PARBS) using N3 processors.

Images are often represented as a two-dimensional (2-D) array of pixels and eachimage contains a large amount of data. The mesh-connected computer (MCC) isuseful for solving these problems because of its simplicity and regularity in architecture.There are two drawbacks of the MCC: fixed architecture and long communicationdiameter, and these degrade the performance of the proposed algorithms. These twodrawbacks can be overcome by equipping it with various types of bus systems.

Recently, the reconfigurable networks have received much attention from researchersbecause they can overcome the drawbacks of the MCC [2, 3, 38]. Unfortunately,the exclusive access to the bus resources limits the throughput of the end-to-endcommunication. Optical interconnections may provide an ultimate solution to thisproblem [4, 8, 18, 24, 30, 31, 34, 35].

1138 WU, HORNG, AND TSAI

File: 740J 164403 . By:XX . Date:10:08:00 . Time:07:20 LOP8M. V8.B. Page 01:01Codes: 3107 Signs: 2442 . Length: 52 pic 10 pts, 222 mm

The array with a reconfigurable optical bus system is defined to be an array ofprocessors connected to a reconfigurable optical bus system. The configuration canbe dynamically changed by setting up the local switches of each processor, and amessage can be transmitted concurrently on a bus in a pipelined fashion. Recently,two related models have been proposed, namely the array with reconfigurableoptical buses (AROB) [32] and the linear array with a reconfigurable pipelinedbus system (LARPBS) [29, 30]. A major difference between the two models lies inthe fact that the counting is not permitted in the LARPBS model during a bus cyclebut is allowed in the AROB model. The AROB model is a powerful computationmodel which incorporates some of the advantages and characteristics of recon-figurable meshes and meshes with optical buses [32].

In this paper, we are interested in designing parallel clustering algorithms on theAROB. By integrating the advantages of both optical transmission and electroniccomputation, we first design three O(1), O(log N) time, and O(log N) time basicoperations for matrix multiplication of two N_N matrices, finding the MST of agraph with N vertices and identifying the connected component containing a specifiedvertex, respectively. Based on these three basic operations, an O(log N) time parallelhierarchical clustering algorithm is proposed using N3 processors. This result improveson previously known algorithms developed on various computational models. Further-more, if the connectivity of the AROB with four-port connection is allowed, twoconstant time clustering algorithms can be also derived.

The remainder of this paper is organized as follows. We give a brief introductionto the AROB computation model in Section 2. Section 3 designs two basic opera-tions which will be used in the parallel clustering algorithms. Section 4 develops ourparallel clustering algorithms. Finally, some concluding remarks are included in thelast section.

2. THE COMPUTATION MODEL

A linear array processors with pipelined optical buses (1-D APPB) [8] of size Ncontains N processors connected to the optical bus with two couplers. One is usedto write data on the upper (transmitting) segment of the bus and the other is usedto read the data from the lower (receiving) segment of the bus. An example for a1-D APPB of size 5 is shown in Fig. 1.

FIG. 1. A linear APPB of size 5.

1139HIERARCHICAL CLUSTERING ON ARRAYS


The linear AROB (LAROB or 1-D AROB) extends the capabilities of the 1-DAPPB by permitting each processor to connect to the bus through a pair of switches.Each processor with a local memory is identified by a unique index denoted as Pi ,0�i<N, and each switch can be set to either cross or straight by the local processor.The optical switches are used for reconfiguration. If the switches of processor Pi ,1�i<N&1, are set to cross, then the LAROB will be partitioned into two inde-pendent subbuses, each of them forms an LAROB. Each processor uses a set ofcontrol registers to store information needed to control the transmission and recep-tion of messages by that processor. An example for an LAROB of size 5 is shownin Fig. 2a. Two interesting switch configurations derivable from a processor of anLAROB are also shown in Fig. 2b.

A petit cycle ({) is defined as the time needed for a pulse to traverse the opticaldistance between two consecutive processors on the bus. A bus cycle (_) is definedas the end-to-end propagation delay of the messages on the optical bus, i.e., thetime needed to traverse through the entire optical bus. Then the bus cycle _=2N{,where N is the number of processors in the array. A unit delay is defined to be thespatial length of a single optical pulse, shown as a loop in Fig. 2a. The unit delaymay introduce a time slot delay between two processors on the receiving segments.

Several approaches can be applied to route messages from one processor toanother in an optical bus system: there are time waiting function [8], time-divisionmultiplexity scheme [35], and the coincident pulse technique [4, 18, 35]. It isshown that these approaches can be implemented in constant time [32]. Arbitrarypermutation can be performed using these methods.

The AROB model is essentially a mesh using the basic structure of a classicalreconfigurable network (RN) [2] and optical technology. A 2-D AROB of sizeM_N, denoted as 2-D M_N AROB, contains M_N processors arranged in a2-D grid. Each processor is identified by a unique 2-tuple index (i, j), 0�i<M,0� j<N. The processor with index (i, j) is denoted by Pi, j . Each processor has

FIG. 2. (a) An LAROB of size 5. (b) The switch states.



FIG. 3. (a) A 4_4 AROB. (b) The allowed switch configurations.

four I�O ports, denoted by &Sk , +Sk , 0�k<2, to be connected with a reconfigurableoptical bus system. The interconnection among the four ports of a processor can bereconfigured during the execution of algorithms. Thus, multiple arbitrary lineararrays like LAROB can be specified in a 2-D AROB. A processor can be connectedto up to four such buses at a time. Once the LAROB is obtained by reconfiguration,we must specify the position of each processor and orientation of the waveguides in theconstructed LAROB for implementing the time division or coincident pulse techniques.The two terminal processors which are located in the end points of the constructedLAROB may serve as the leader processors (similar to P0 in Fig. 2a). The relatedposition of any processor on a bus to which it is connected is its distance from theleader processor. For more details on the AROB, see [32]. The extended AROBallows switch setting to change during a bus cycle, triggered by detection of a pulse[34]. An example of a 2-D 4_4 AROB and the allowed switch configurations areshown in Fig. 3.

For a unit of time, we assume each processor can either perform arithmetic andlogic operations or communicate with others on a bus. Since the bus cycle lengthcan be considered to be O(_), it is compatible with the computation time of anyarithmetic or logic operation. The AROB allows multiple processors to broadcastdata on the different buses or to broadcast the same data on the same bus simul-taneously at a time unit, if there is no collision. Let var(k) denote the local variablevar (memory or register) in a processor with index k. For example, sum(0, 0, 1) isa local variable sum of processor P0, 0, 1 .

3. BASIC OPERATIONS

In this section, we design three data operations: computing the matrix multiplica-tion, finding the MST of a graph, and identifying the connected component containinga specified vertex. These three data operations will be used for developing an efficientparallel hierarchical algorithm in the next section. Several basic operations whichhave been proposed on the AROB are summarized in the following.

Lemma 1 [34]. Given N integer or normalized real numbers, these N numberscan be sorted in O(1) time either on an N LAROB if these numbers are of bounded


magnitude and precision or on an N extended LAROB if these numbers are of unboundedmagnitude and precision.

By applying Lemma 1, the maximum (minimum) of these N numbers can be easilyfound. This leads to the following corollary.

Corollary 2. Given N integer or normalized real numbers, the maximum(minimum) of these N numbers can be found in O(1) time either on an N LAROBif these numbers are of bounded magnitude and precision or on an N extended LAROBif these numbers are of unbounded magnitude and precision.

Lemma 2 [27]. Given N integer or normalized real numbers each of size O(log N)-bit,these N numbers can be added by the bus split technique in O(log N) time on an NLAROB.

Lemma 3 [33]. Given a data array of size N, the ordered compaction problem isthe problem of moving the n nonzero (nonempty) data items to the first n consecutivelocations (row major order) of the array and remaining in the same order. Theordered compaction problem can be computed in O(1) time on an N LAROB.

Lemma 4 [32]. Given N Boolean data, the logical or (6) of these N Booleandata can be computed in O(1) time on an N LAROB.

3.1. Computing the Matrix Multiplication

Let A=ai1 , i0and B=bi1 , i0

, 0�i1 , i0<N be two N_N matrices and all elementsof A and B be on the same domain D. The matrix multiplication C=ci1 , i0

is definedby

ci1 , i0= �

N&1

i2=0

[a i1 , i2�bi2 , i0

], 0�i1 , i0<N, (1)

where � and } are two associative operators on the domain D. For standardmatrix multiplication, some results on models with optical buses have been derivedin the literature [21, 22, 24, 26, 33]. In this paper, we will design algorithms fortwo variations of matrix multiplication on the AROB model.

By the reconfigurability and pipelined ability of the optical bus configuration, wecan compute Eq. (1) efficiently on a 3-D AROB. Initially, ai1 , i0

and bi1 , i0are stored

in the local variables a(0, i1 , i0) and b(0, i1 , i0) of processor P0, i1 , i0, 0�i1 , i0<N,

one item per processor, respectively. The result of ci1 , i0is stored in the local variable

c(0, i1 , i0) of processor P0, i1 , i0, 0�i1 , i0<N. Like Chen et al. [3], Eq. (1) can be

computed on a 3-D AROB by the following three steps.

1. Broadcast the elements of A and B over the N_N_N processors throughthe optical buses so that a(i2 , i1 , i0)=ai1 , i2

for 0�i0<N and b(i2 , i1 , i0)=bi2 , i0for

0�i1<N.

2. Compute e(i2 , i1 , i0)=a(i2 , i1 , i0)�b(i2 , i1 , i0).

3. Compute c(0, i1 , i0)=�N&1i2=0 e(i2 , i1 , i0).


The first two steps each take O(1) time. The time complexity of Step 3 is dependenton the operator �. That is, by properly replacing the associative operator �, thevariation of matrix multiplication can be computed efficiently. For example, if theassociative operator � is replaced by the maximum�minimum operation, thenStep 3 can be computed in O(1) time by Corollary 1 on a 3-D N_N_N AROB.If matrix B is replaced by an N_1 vector and the associative operators � and }are replaced by the logical or (6) and logical and (7) operations respectively,then this matrix�vector multiplication is a special case of matrix multiplication. Forthis case, Steps 1�3 require only N2 processors and Step 3 can be computed in O(1)time by Lemma 4. Hence, this leads to the following lemmas.

Lemma 5. Given two N_N matrices A and B, if the operator � is the maximum(minimum) operator, then the variation of matrix multiplication of A and B can becomputed in O(1) time either on a 3-D N_N_N AROB if the matrix entries are ofbounded magnitude and precision, or on a 3-D N_N_N extended AROB if thematrix entries are of unbounded magnitude and precision.

Lemma 6. Given an N_N Boolean matrix A and an N_1 Boolean vector B, ifthe operators � and } are the 6 and 7 operators, respectively, then the matrix�vector multiplication of A and B can be computed in O(1) time on a 2-D N_N AROB.

Compared to the algorithm proposed by Chen et al. [3], our algorithm forcomputing the variation of matrix multiplication with the maximum�minimumoperation can be run in the same time complexity but the number of processors isreduced by a factor N.

3.2. Finding the Minimum Spanning Tree

Given a graph G=(V, E) with N vertices, the minimum spanning tree (MST)problem of G is defined to find a tree with a minimum total weight. Assume thatthe adjacency matrix of G is given by

ai, j={w(e)�

if the edge e is an edge from vertex i to vertex j,otherwise.

Based on the approach specified by Maggs and Plotkin [25], the MST T of G canbe represented by the following recursive formula:

Initially, c1i, j=ai, j , 0�i, j<N.

{For each iteration l, 2�l�N, c li, j=min0�k<N [max[c l�2

i, k , c l�2k, j]], 0�i, j<N.

Finally, set ti, j=1 if cNi, j=a i, j ; ti, j=0, otherwise.

(2)

Like Chen et al. [3], c li, j of Eq. (2) at each iteration l can be computed using the

matrix multiplication as mentioned in the previous subsection by replacing theoperators � and � with minimum and maximum, respectively. That is, c l

i, j ofEq. (2) can be computed in O(1) time at the iteration l by Lemma 5. Continuing


this process at most log N iterations, cNi, j can be obtained from c1

i, j , c2i, j , c4

i, j , ...,cN�2

i, j . Finally, all edges of the MST T of G can be determined by setting ti, j=1 ifcN

i, j=ai, j ; ti, j=0, otherwise. Since each iteration l takes O(1) time by Lemma 5and the number of iterations is at most log N, the total time complexity of theproposed algorithm is O(log N). Compared to the algorithm proposed by Chen etal. [3], our algorithm can be run in the same time complexity but the number ofprocessors is reduced by a factor N. Hence, this leads to the following lemma.

Lemma 7. Given a graph G with N vertices, the minimum spanning tree of G canbe solved in O(log N) time either on a 3-D N_N_N AROB if the weights of edgesare of bounded magnitude and precision or on a 3-D N_N_N extended AROB if theweights of edges are of unbounded magnitude and precision. K

3.3. Identifying the Connected Component Containing a Specified Vertex

Given a graph G=(V, E) with N vertices and a specified vertex v, the problemof identifying the connected component containing vertex v is defined to find asubgraph G$=(V$, E$), V$�V, E$�E, such that there is a path from vertex v toevery vertex in V$. This problem is a subtask of the connected component problem,and it can be solved by using the variation of matrix�vector multiplication asmentioned in Section 3.1. Assume that the adjacency matrix of G is given by

ai, j={10

if there is an edge from vertex i to vertex j,otherwise.

Based on the well-known technique [3, 23], we can rewrite Eq. (2) by replacing theoperators min and max with logical OR and Logical AND, respectively. Hence, thisproblem is a variation of the MST problem and the iteration step can be representedby

c li= �

0�k<N

[c l�2i, k 7bk], 0�i<N, (3)

where B=bk , 0�i<N, is the v th column of matrix A (i.e., the adjacency vector ofvertex v). Since c l

i of Eq. (3) at each iteration l can be computed by using the varia-tion matrix�vector multiplication, that is, Eq. (3) can be computed in O(1) time ateach iteration l by using Lemma 6, cN

i can be obtained in O(log N) time. Finally,vertex i, 0�i<N, and vertex v are in the same connected component if and onlyif cN

i =1.

Lemma 8. Given an undirected graph G=(V, E) with |V |=N and a specifiedvertex v, the connected component of G containing vertex v can be found in O(log N)time on a 2-D N_N AROB. K

4. PARALLEL HIERARCHICAL CLUSTERING ALGORITHMS

Let A=ai, j , 0�i<N, 0� j<M, be a pattern matrix of size N_M with Npatterns each with all M features. In general, N is in the range of hundreds and


M<30. A hierarchical clustering method is a procedure for transforming a proximitymatrix into a sequence of nested partition [11]. The direct input to the hierarchicalclustering is the proximity matrix D which is usually generated from a pattern matrixA. Each entry of the proximity matrix D=di, k , 0�i, k<N, represents the proximityof the pairwise indices according to the row and column of pattern matrix A. Becausethe Euclidean distance is the most common of Minkowski metrics, we use theEuclidean distance to measure the dissimilarity between patterns. That is,

di, k=� :M&1

j=0

(ai, j&ak, j)2, 0�i, k<N. (4)

The output of a hierarchical clustering algorithm can be represented by a dendrogram(i.e., a level tree of nested partitions). Each level (denoted as li , 1�i<N) consistsof only one node (different to the regular tree), each representing a cluster. We cancut a dendrogram at any level to obtain a clustering.

Tsai and Horng et al. [37] have proposed an O(log2 N) time parallel hierarchicalclustering algorithm with the single-link method on a 3-D N_N_N PARBS. Bythe pipelined ability and reconfigurability of the optical buses, we also develop anefficient parallel clustering algorithm on a 3-D AROB in the following.

Let G=(V, E, W) denote a weighted proximity graph derived from the proximitymatrix D, where V is the set of vertices, E is the set of edges, w(ei, j) is theassociated weight of the edge ei, j , and ei, j # E. Thus, the associated weight w(ei, j)of edge ei, j corresponds to the entry di, j of D, where ei, j=(i, j) is the edge incidentto vertices vi and vj . For the sake of convenience, we assume that no two edges inthe MST have the same weight. Since D is a symmetric matrix and di, i=0, 0�i<N,only the upper triangular matrix of D is enough to specify the N(N&1)�2 edges of G.Hence, we can set di, j=� for 0� j�i<N.

Let T=(V, [t1 , t2 , t3 , ..., tN&1]) be the MST of G, where w(t1)<w(t2)<w(t3)< } } } <w(tN&1). For agglomerative hierarchical clustering, the edges in T will becut in the order of t1 , t2 , t3 , ..., tN&1 , and each cut corresponds to a level of thedendrogram, respectively. That is, cutting the edge t i in T will form the (N&i) thlevel of the nested partitions of the hierarchical clustering. At level lN&i , twoclusters containing the patterns corresponding to the two ending vertices of the cutedge ti will be merged to form a new cluster. Since the newly generated cluster ofeach cut may contain many vertices, we set the vertex with the smallest vertexnumber among them as a supervertex (denoted as sv) to identify the newly generatedcluster, and the other vertices in the same cluster can be discarded. In order todetermine which clusters will be merged at level lN&i , we may find the connectedcomponents containing the two ending vertices of the cut edge ti , respectively, fromthe subtree which was made up of tk , 1�k<i. For the sake of readability, anexample of the hierarchical clustering based on MST is shown in Fig. 4.

As for the easily specified information of each edge in MST and each nestedpartition, two arrays T=ti, j and C=ci, j , 0�i, j<N, are used to store the edgesof MST and the cluster, respectively. Each entry ti, j consists of three fields: tw, tl,tr, where tw represents the distance between vertices i and j, tl denotes the leftending vertex of edge (i, j) and tr denotes the right ending vertex of edge (i, j).



FIG. 4. An example of hierarchical clustering using MST. (a) Five patterns placed on the 2-Dfeature space. (b) The proximity matrix D derived from (a). (c) An MST derived from the proximitygraph G. (d) t1 corresponding to l4 : two endig of t1 (v1 and v2) are merged from cluster 5 in whichsv=v1 . (e) t2 corresponds to l3 : v3 and v4 are merged to form cluster 6 in which sv=v3 . (f) t3

corresponds to l2 : cluster 6 containing v3 merges with v0 to form cluster 7 in which sv=v0 . (g) t4

corresponds to l1 : cluster 5 containing v2 and cluster 7 containing v0 are merged to form cluster 8 inwhich sv=v0 . Finally, cluster 8 contains all patterns, and the dendogram is constructed.

Each entry ci, j consists of three fields: cn, cl, cr, where cn represents the clusternumber, cl denotes the left child, and cr denotes the right child.

Following the definition of proximity matrix D, proximity graph G, and the MSTT, the parallel algorithm for agglomerative hierarchical clustering can be describedby the following three phases.

Phase 1. Compute the proximity matrix D from pattern matrix A.

Phase 2. Begin with the disjoint clustering and set each pattern as a cluster;then find an MST T of proximity graph G.



Phase 3. Cut the N&1 edges in the MST T in the order of weights simul-taneously such that each cut forms a new clustering.

These three phases can be easily implemented in O(log N) time on a 3-DN_N_N AROB. Initially, the pattern matrix ai, j is stored in the local variable

FIG. 5. An illustration of algorithm SLHCA. (a) d(i, j, 0), after Step 1. (b) t(i, j, 0) of T, afterStep 2.1. (c) After Step 2.2. (d) After Step 3.1. (e) After Step 3.2. (f ) After Step 3.4, the final result.


a(i, j, 0) of processor Pi, j, 0 , 0�i<N, 0� j<M. Finally, the dendrogram consistingof the level number, cluster number and two children are stored in the localvariable l(i, j, 0), cn(i, j, 0), cl(i, j, 0), and cr(i, j, 0) of processor Pi, j, 0 , 0�i� j<N,respectively. The detailed single-link hierarchical clustering algorithm (SLHCA) isshown in the following. Based on the patterns shown in Fig. 4, an illustration ofAlgorithm SLHCA is shown in Fig. 5.

Algorithm SLHCA(A, l, C);�* A is an input variable. l and C are output variables. *�

1: �� Phase 1. ��1.1: �� Distribute the value of ai, j over the N_M_N processors, where

M<N.�� Copy a(i, j, 0), 0�i<N, 0� j<M, to ai(i, j, k), 0�k<N throughthe i0-dimensional bus; then copy ai(k, j, k), 0� j<M, 0�k<N, toak(i, j, k), 0�i<N through the i2-dimensional bus. Thus, each processorPi, j, k , 0�i, k<N, 0� j<M, holds two pattern features ai, j and ak, j .

1.2: Compute d2(i, j, k) :=(ai(i, j, k)&ak(i, j, k))2, 0�i, k<N, 0� j<M.1.3: �� Compute Eq. (3) ��

Compute d(i, 0, k) :=(�M&1j=0 d2(i, j, k))1�2, 0�i, k<N, through the

i1-dimensional bus by Lemma 2.1.4: First copy d(i, 0, k), 0�i, k<N, to d $(i, k, k) through the i1-dimen-

sional bus and then copy d $(i, k, k), 0�i, k<N, to d(i, k, 0) throughthe i0-dimensional bus. Finally, set d(i, k, 0) :=� for 0�k�i<N.

2: �� Phase 2. ��2.1: �� Find all edges of T from G. ��

The proximity matrix D is a weighted matrix corresponding toproximity graph G. Find the MST T of G by Lemma 7. Let each edgeof T be stored in the local variable t(i, j, 0).

2.2: �� Sort the N&1 edges of T and put the sorted edges to diagonalprocessors. ��Compact the edge t(i, j, 0) according to its associated weight tw(i, j, 0){�,0�i, j<N, by Lemma 3. Then, sort these N&1 edges located on thefirst row of processor P0, j, 0 , 0� j<N&1, into nondecreasing order byLemma 1. Finally, copy t(0, j&1, 0) to t( j, j, 0), for 1� j<N.

2.3: �� Initial partition: set each pattern as a cluster and row i correspondingto level N&i. ��For 0�i, j<N, set l(i, j, 0) :=N&i, cn(0, j, 0) :=j, cl(0, j, 0) :=nil,cr(0, j, 0) :=nil, and sv(0, j, 0) :=1.

3: �� Phase 3. Let a 3-D N_N_N AROB be partitioned into N 2-D N_NAROB which are denoted as 2D-AROBi0

, 0�i0<N. A 2D-AROBi0contains

processor Pi2 , i1 , i0, 0�i2 , i1<N. ��

3.1: �� Identify the newly generated cluster of each level. ��3.1.1: �� Construct the subtree of level li , 1�i<N. ��

Copy t( j, j, 0), 1� j<N, to t(i, j, 0) for j�i through the i2-dimen-sional bus. Thus, the subtree of lN&i consists of edge t(i, j, 0), 1� j�i.


3.1.2: �� Label the vertices of each level which lie in the same connectedcomponent with vertices tl(i, i, 0) and tr(i, i, 0) by a new clusternumber N+i&1. ��For those edges belonging to row i of 2D-AROB0 , 1�i<N, findthe connected component CCi containing vertices tl(i, i, 0) andtr(i, i, 0) from the subtree by Lemma 8. If a vertex j, 0� j<N,belonged to the connected component CCi found above, then setcn(i, j, 0) :=N+i&1; cn(i, j, 0) :=nil, otherwise.

3.1.3: �� Find the supervertex of each cluster. ��For each row i of 2D-AROB0 , 1�i<N, find the minimal index jof cn(i, j, 0) with cn(i, j, 0){nil, 0� j<N by Corollary 1. Then, setsv(i, j, 0) :=1 if the vertex j is the minimal index; sv(i, j, 0) :=0,otherwise.

3.2: �� Set the left and right children of each cluster. ��3.2.1: Processor Pi, j, 0 with cn(i, j, 0)=nil, 0�i, j<N, establishes the local

connection [&S2 , +S2]; it breaks this established local connection,otherwise.

3.2.2: Processor Pi, j, 0 with cn(i, j, 0){nil, 0�i, j<N, broadcastscn(i, j, 0) on the port +S2 of the established bus. Processor Pi, j, 0 withcn(i, j, 0){nil, 1�i, j<N, receives the broadcasted data from its port&S2 on the established bus and stores them in pre�cn(i, j, 0).

3.2.3: For each row i of 2D-AROB0 , 1�i<N, set cl(i, j, 0) :=pre�cn(i,tl(i, i, 0), 0) and cr(i, j, 0) :=pre�cn(i, tr(i, i, 0), 0) if sv(i, j, 0)=1for 0� j<N; cl(i, j, 0) :=nil and cr(i, j, 0) :=nil, otherwise.

3.3: �� Broadcast the cluster of lk , 0<i<k�N, which has not been mergedto li . ��

3.3.1: For each row i of 2D-AROB0 , 1�i<N, processor Pi, j, 0 withcn(i, j, 0)=nil, 0� j<N, establishes the local connection [&S2 , S2];breaks this established local connection, otherwise.

3.3.2: For each row i of 2D-AROB0 , 0�i<N, processor Pi, j, 0 withcn(i, j, 0){nil, 0� j<N, broadcasts c(i, j, 0) and sv(i, j, 0) on theport +S2 of the established bus, respectively. Then, processor Pi, j, 0

with cn(i, j, 0)=nil, 1�i, j<N, receives the broadcasted data fromits port &S2 on the established bus and stores them in the corre-sponding local variables, respectively.

3.4: For each row i of 2D-AROB0 , 0�i<N, processors Pi, j, 0 with sv(i, j, 0)=1, 0� j<N, compacts l(i, j, 0) and c(i, j, 0) together through thei1 -dimensional processors by Lemma 3.

Theorem 1. Given N patterns each with M features (N�M), algorithm SLHCAcan be computed in O(log N) time either on a 3-D N_N_N AROB if all featuresare of bounded magnitude and precision or on a 3-D N_N_N extended AROB if allfeatures are of unbounded magnitude and precision.

Proof. The time complexity is analyzed as follows. Steps 1.1, 1.2, and 1.4 eachtake O(1) time. Step 1.3 takes O(log M) time by Lemma 2. Hence, Step 1 takes


O(log M) time. Step 2.1 takes O(log N) time using N_N_N processors by Lemma 7.Step 2.2 takes O(1) time by Lemmas 1 and 3. Step 2.3 takes O(1) time. Hence, Step2 takes O(log N) time. Steps 3.1.1 and 3.1.2 each take O(1) time. Step 3.1.3 takesO(1) time by Corollary 1. Hence, Step 3.1 takes O(1) time. Steps 3.2 and 3.3 eachtake O(1) time. Step 3.4 takes O(1) time by Lemma 3. Therefore, the total timecomplexity is O(log N) using N3 processors. K

Our next results assume an extended AROB wherein a processor can connectfour ports [38]. By increasing the number of processors and extending the switchconfigurations, algorithm SLHCA can be easily modified to run with better efficiency.Since the time complexity of the proposed algorithm is dominated by finding theMST from proximity graph G shown in Step 2.1, the MST of G can be found inO(1) time using N4 processors by applying the technique proposed by Wang andChen [38]. Hence, the time complexity of the modified algorithm can be reducedfrom O(log N) to O(1) by increasing the number of processors from N3 to N4. Thehierarchical clustering with single-link method can be derived from the geometry ofcoordinate space [10]. Furthermore, most applications of clustering have M<30.When M is fixed, the Euclidean MST can be found in O(1) time using N3 processorsby applying the approach proposed by Lai and Sheng [17].

Wang and Chen's connected component algorithm and MST algorithm and Laiand Sheng's EMST algorithm are based on the tree shape 2-D buses (i.e., four-portconnection), while the AROB model defined in [32] permits only linear buses.Recently, some optical interconnections, such as time-division multiplexed (TDM)systems [36], wavelength division multiplexed (WDM) systems [5], and time slotinterchangers (TSI) [14], have been designed as various methods of establishingreconfigurability and switching for systems with Terabit�s throughput requirements.Such interconnections can be used to provide massively parallel data communica-tion among a few ports [15, 36]. Most optical switch systems are constructed bythe switch matrices with crossbar or tree structures [9, 28]. For example, Okayamaet al. [28] developed an optical switch matrix with N_N tree structure by using1_2, 2_1, and 2_2 switches. Actually, the switch systems mentioned above aremore complex than the one proposed in the AROB model used in this paper.

In order to implement the four-port connection of the AROB, minor modifica-tions in the hardware and functional specification of the switching system arerequired. In the sense of four-port connection, the switch is functionally equivalentto a 1-to-3 splitter, and the index determination algorithm [32] used for specifyingto each processor the length of the buses to which it is connected or its indicesrelative to these buses is not necessarily required. If the four-port connectionis allowed on an extended AROB, then Wang and Chen's connected componentalgorithm and MST algorithm and Lai and Sheng's EMST algorithm can be runin O(1) time on the extended AROB. This leads to the following two corol-laries.

Corollary 2. Given N patterns each with M features (N�M), the parallelhierarchical clustering algorithm can be computed in O(1) time on a 4-D N_N_N_Nextended AROB, if the four-port connection is allowed.


Corollary 3. Given N patterns each with M features (M is fixed, N�M), theparallel hierarchical clustering algorithm can be computed in O(1) time on a 3-DN_N_N extended AROB, if the four-port connection is allowed.

5. CONCLUDING REMARKS

In a cycle time, the number of messages which can be transmitted by a pipelinedoptical bus is larger than the number that can be transmitted by an electrical bus.Optical transmission can reduce the data transmission time between processorsquite a lot. The transmission time of a data item between processors is determinedby the size of the data item and the bus capacity. Due to the high communicationbandwidth, the bus reconfigurability, and the supported versatile communicationpatterns, the AROB is useful for solving computational problems and its computa-tional power is superior than that of other existing reconfigurable networks, likethe PARBS.

To demonstrate the computation power of the AROB, we first design three O(1)time, O(log N) time, and O(log N) time basic operations for matrix multiplicationof two N_N matrices, finding the MST of a graph with N vertices and identifyingthe connected component containing a specified vertex, respectively. These threeoperations are computationally intensive and require global propagation of data.Based on these three basic operations, an O(log N) hierarchical clustering algo-rithm on a 3-D N_N_N AROB is derived. Compared to the algorithm proposedby Tsai and Horng et al. [37], the time complexity of the proposed algorithm canbe reduced from O(log2 N) to O(log N) but with the same number of processors.Furthermore, two constant time results can be derived if the four-port connectionof the AROB is provided by using the advanced optical switch techniques.

Optical interconnections offer many advantages over the electronic counterpartincluding high connection density, low crosstalk, and relaxed bandwidth-distanceproduct [6]. Currently, the optical interconnection techniques are used to establishthe reconfigurability and simultaneous switching for massively parallel processingsystem [36]. Finally, it should be mentioned that the Jitney Optical Bus with 20channels (500Mb�s�ch) has been designed for high speed parallel computing andsuccessfully demonstrated in IBM AS�400 and RS6000 power parallel systemstestbeds [15]. Due to these new developments, the models with reconfigurableoptical buses are likely to become feasible architectures in the near future. Althoughthe four-port connection is not defined in the standard AROB, recent experimentsseem to indicate that the assumption of allowing the four-port connection isreasonable. Thus, Corollaries 2 and 3 can be derived.

REFERENCES

1. M. R. Anderberg, ``Cluster Analysis for Applications,'' Academic Press, New York, 1973.

2. Y. Ben-Asher, D. Peleg, R. Ramaswami, and A. Schuster, The power of reconfiguration, J. ParallelDistrib. Comput. 13 (1991), 139�153.

3. G. H. Chen, B. F. Wang, and C. J. Lu, On the parallel computation of the algebraic path problems,IEEE Trans. Parallel Distrib. Systems 3 (1992), 251�256.


4. D. M. Chiarulli, R. G. Melhem, and S. P. Levitan, Using coincident optical pulses for parallelmemory addressing, IEEE Comput. Mag. 20 (1987), 48�58.

5. P. Dowd, K. Bogineni, K. A. Aly, and J. Perreault, Hierarchical scalable photonic architectures forhigh performance processor interconnection, IEEE Trans. Comput. 20 (1993), 1105�1120.

6. M. Feldman, S. Esener, C. Guest, and S. Lee, Comparison between optical and electrical inter-connects based on power and speed considerations, Appl. Optics 27 (1988), 1742�1751.

7. J. C. Gower and G. J. S. Ross, Minimum spanning tree and single-linkage cluster analysis, Appl.Statist. 18 (1969), 54�64.

8. Z. Guo, R. G. Melhem, R. W. Hall, D. M. Chiarulli, and S. P. Levitan, Pipelined communicationsin optically interconnected arrays, J. Parallel Distrib. Comput. 12 (1991), 269�282.

9. Y. Hanaoka, F. Shimokawa, and Y. Nishida, Low-loss intersecting grooved waveguides with low 2for a self-holding optical matrix switch, IEEE Trans. Components, Packaging, ManufacturingTechnol. B 18 (1995), 241�244.

10. J. A. Hartigan, ``Clustering Algorithms,'' Wiley, New York, 1975.

11. A. Jain and R. Dubes, ``Algorithms for Clustering Data,'' Englewood Cliffs, NJ: Prentice�Hall,1988.

12. M. Jambu and M. O. Lebeaux, ``Cluster Analysis and Data Analysis,'' North-Holland, New York,1983.

13. S. Johnson, Hierarchical clustering scheme, Phychometrika 23 (1967), 241�254.

14. H. F. Jordan, D. Lee, K. Y. Lee, and S. V. Ramanan, Serial array time slot interchangers and opticalimplementations, IEEE Trans. Comput. 43, 11 (1994), 1309�1318.

15. D. M. Kuchta, J. Crow, P. Pepeljugoski, K. Stawiasz, J. Trewhella, D. Booth, W. Nation, C. DeCusatis,and A. Muszynski, Low Cost 10 Gigabit�s optical interconnects for parallel processing, in ``Proc. theFifth International Conference on Massively Parallel Processing,'' pp. 210�215, 1998.

16. T. Kurita, An efficient agglomerative clustering algorithm using a heap, Pattern Recognition 24(1991), 205�209.

17. T. H. Lai and M. J. Sheng, Constructing Euclidean minimum spanning trees and all nearestneighbors on reconfigurable meshes, IEEE Trans. Parallel Distrib. Systems 7 (1996), 806�817.

18. S. P. Levitan, D. M. Chiarulli, and R. G. Melhem, Coincident pulse technique for multiprocessorinterconnection structures, Appl. Optics 29 (1990), 2024�2033.

19. X. Li, Parallel algorithms for hierarchical clustering and cluster validity, IEEE Trans. Pattern Anal.Mach. Intelligence 12 (1990), 1088�1092.

20. X. Li and Z. Fang, Parallel clustering algorithms, Parallel Comput. 11 (1989), 275�290.

21. K. Li, Constant time Boolean matrix multiplication on a linear array with a reconfigurable pipelinedbus system, J. Supercomput. 11 (1997), 391�403.

22. K. Li and V. Y. Pan, Parallel matrix multiplication on a linear array with a reconfigurable pipelinedbus system, in ``Proc. International Parallel Processing Symposium�SPDP,'' pp. 31�35, 1999.

23. K. Li, Y. Pan, and M. Hamdi, Solving graph theory problems using reconfigurable pipelined opticalbuses, in ``Proc. the 3rd Workshop Optics and Computer Science (Parallel and Distributed Processing),''pp. 911�923, 1999.

24. K. Li, Y. Pan, and S. Q. Zheng, Fast and processor efficient parallel matrix multiplication algorithmson a linear array with a reconfigurable pipelined bus system, IEEE Trans. Parallel Distribut Systems9 (1998), 705�720.

25. B. M. Maggs and S. A. Plotkin, Minimum-cost spanning tree as a path-finding problem, Inform.Process. Lett. 26 (1988), 291�293.

26. M. Middendorf and ElGindy, Matrix multiplication on processor arrays with optical buses,Informatica (1999), to appear.

27. R. Miller, V. K. P. Kumar, D. Reisis, and Q. F. Stout, Image computations on reconfigurable VLSIarrays, in ``Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition,''pp. 925�930, 1988.


28. H. Okayama, A. Matoba, R. Shibuya, and T. Ishida, Optical switch matrix with simplified N_Ntree structure, J. Lightwave Technol. 7 (1989), 1023�1028.

29. Y. Pan and M. Hamdi, Quicksort on a linear arrays with a reconfigurable pipelined bus system, in``Proc. International Symposium on Parallel Architectures, Algorithms and Networks,'' pp. 313�319,1996.

30. Y. Pan and K. Li, Linear array with a reconfigurable pipelined bus system��concepts and applications,Inform. Sci. 106 (1998), 237�258.

31. Y. Pan, K. Li, and S. Q. Zheng, Fast nearest neighbor algorithms on a linear array with a reconfigurablepipelined bus system, Parallel Algorithms Appl. 13 (1998), 1�25.

32. S. Pavel and S. G. Akl, On the power of arrays with reconfigurable optical bus, in ``Proc. Int. Conf.on Parallel and Distributed Processing Techniques and Applications,'' pp. 1443�1454, 1996.

33. S. Pavel and S. G. Akl, Matrix operations using arrays with reconfigurable optical buses, ParallelAlgorithms Appl. 8 (1996), 223�242.

34. S. Pavel and S. G. Akl, Integer sorting and routing in arrays with reconfigurable optical buses, Int.J. Foundations Comput. Sci. 9 (1998), 99�120.

35. C. Qiao and R. G. Melhem, Time-division communications in multiprocessor arrays, IEEE Trans.Comput. 42 (1993), 577�590.

36. C. Qiao, R. G. Melhem, D. Chiarulli, and S. Levitan, Dynamic Reconfiguration of optically inter-connected networks with time division multiplexing, J. Parallel Distrib. Comput. 22, 8 (1994),268�278.

37. H. R. Tsai, S. J. Horng, S. S. Lee, S. S. Tsai, and T. W. Kao, Parallel hierarchical clusteringalgorithms on processor arrays with a reconfigurable bus system, Pattern Recognition 30 (1997),801�815.

38. B. F. Wang and G. H. Chen, Constant time algorithms for transitive closure and some related graphproblems on processor arrays with reconfigurable bus systems, IEEE Trans. Parallel Distrib. Systems1 (1990), 500�507.


Documents

Efficient Parallel Algorithms for Hierarchical Clustering on Arrays with Reconfigurable Optical Buses