Multiscattering on the cube-connected cycles

ELSEVIER Parallel Computing 20 (1994) 313-324

P A R A L L E L

COMPUTING

Multiscattering on the Cube-Connected Cycles

Jian-jin Li *

lnstitut des Sciences de l'Ing~nieur CUST, Campus Universitaire des C~seaux, 24, avenue des Landats-B.P. 206, 63174 Aubiere Cedex, France

(Received 15 April 1993; revised 21 June 1993)

Abstract

This paper presents several multiscattering algorithms on the Cube-Connected Cycles (CCC). We first implement a network-independent greedy algorithm. Then we propose two specialized algorithms for multiscattering on the CCC: the first approach uses only one hypercube link of each cycle, but the second approach uses all hypercube links at each phase. The theoretical formulas given in this paper show that multiscattering on the CCC with N --- H * 2 s ( H >__ S) processors can be performed in time [(H + 1)S + 1H/2]]/3 + [(H + 1)SH + [H/2J([H/2] + 1)]2S-lLz or [2S + tS/21]/3 + [(S 2 - 1)2 s + ls/2J(ts/21 + 1)2 s-1 + 1]L~-, where L is the message length, fl is the startup and ~- is the inverse of the bandwidth of communication. The difference between the complexity of the two approaches is due to the number of hypercube links used at each phase. We carry out experiments with the greedy algorithm on the ring and the exchange-perfect shuffle, and we compare the two multiscattering algorithms on the CCC with the greedy algorithm and the non-oriented ring algorithm on a transputer-based machine with 32 processors.

Key words: Multiscattering; Cube-Connected Cycles; Exchange-perfect shuffle; Complexity; Transputer networks

I. Introduction

In the paper [10], Saad and Schultz study various basic communica t ion kernels in parallel architectures. They point out that in terprocessor communica t ion is of ten one of the main obstacles to increase pe r fo rmance of parallel algori thms for multiprocessors. They consider the following data exchange operat ions:

* Email: [email protected]

0167-8191/94/$07.00 �9 1994 Elsevier Science B.V. All rights reserved SSDI 0167-8191(93)E0083-8

314 J-J. Li / Parallel Computing 20 (1994) 313-324

(1) One-to-one: moving data from one processor to another. (2) Broadcast: moving the same data packet from one processor to all the others. (3) Total exchange: moving a data packet from each processor to every other

processor. This is a broadcast operation (2) from each node. (4) Scattering: a node sends a packet to every other processor. These packets,

although different, are ideally all of the same size. (5) Multiscattering (or data transposition): every processor sends a different packet

to every other processor. This is in effect a scattering operation (4) from each node.

The communication model that we used is "store and forward". A packet of L words can be moved from one processor to any member of its neighbors in time /3 + L~-, where/3 is the startup and ~" is the inverse of the bandwidth of communication. Each link connecting two processors can be used for the bidirectional transmission of data, and each processor can communicate at the same time with all its neighbors. On some machines, the oriented communication is faster than the bidirectional communication. We can use this to increase the effectiveness of certain algorithms.

Multiscattering is clearly the most complex communication operation in the list. However, it arises very frequently, for instance when transposing a matrix [13] which is distributed by rows or columns among the processors, or when dynami- cally reallocating data to processors so as to balance the workload, or in the parallel image synthesis problem [6].

Saad and Schultz [10] have proposed multiscattering algorithms for various models of parallel architectures, such as the broadcast bus, the shared memory, the ring, the mesh connected array, the hypercube, and the switch connected model. In this paper, different solutions for implementing multiscattering are examined. We first implement a network independent greedy algorithm. Then we propose two specialized algorithms for multiscattering on the CCC and we compare them with the greedy algorithm on several topologies and the multiscattering algorithm on the non-oriented ring. The experiments are carried out on the CCC, the ring and the exchange-perfect shuffle.

With a fixed number of processors and a given message size, the time for multiscattering depends on many parameters of the interconnection network, such as its diameter, its degree and its regularity. The complexity of scattering is not known for any interconnection network but the ring [10], even when all messages are of the same length. The problem of determining the best interconnection network of maximum degree A for multiscattering messages of length L among p processors is even more complex! This is why we first propose a general greedy algorithm that enables us to multiscatter on any interconnection network. We have not succeeded in giving a complexity analysis for this algorithm, but we shall report experimental data on several networks. Then we move to two specialized algorithms for the CCC, which is of degree 3 and still has a logarithmic diameter. The recursive nature of the CCC enables us to derive a timing formula for the performance of our algorithms. The first algorithm uses a fraction of the links at each phase, and the second uses all links at each step, similarly to the multiscatter-

J-J, Li / Parallel Computing 20 (1994) 313-324 315

ing algorithms on the hypercube proposed by Ho and John [5]. We experimentally compare these algorithms with the greedy one, showing that the first offers around 60% and the second around 70% performance improvement.

In this paper, we let N be the number of processors, and we assume that all elementary messages have the same length L.

All our experiments have been performed on a T-node, a transputer machine of 32 processors, with a reconfigurable interconnection network. Each transputer has four communication links.

2. A greedy algorithm for arbitrary interconnection networks

2.1. Algorithm

The problem can be stated formally as follows: there are p processors P0 . . . . . Pp 1. Each processor Pi has a message m[ to send to Ps, J 4: i. Assume that processors P, has out(P,) output links and in(P,) input links. The greedy algorithm proceeds as follows: (a) Each processor P/computes a routing table, that is, an array A[1 . . . . ,out(P/)]

of out(p/) lists. For each output link k, 1 _<k _< out(P,), the list A[k] is composed of processor indices j such that a shortest path from P, to Pj starts using link k.

(b) Each processor P, sends out(P,) messages. On each link k, the message is the concatenation (with adequate control) of all the messages m:, j cA[k].

(c) The algorithm consists then of several identical steps. At each step, each processor P, receives some messages on its input links. Pi identifies the destination processors y of all the messages mx y that it has received. It stores the messages with y = i and composes out(P,) messages using its routing table: for all k, 1 < k _< out(P,), it concatenates all the messages mx y such that y ~ A[k]. Then P, sends these messages.

(d) The total number of communication steps is equal to D: the diameter of the connected graph.

After one execution step, the message elements will be brought one step closer to their destination. Since the largest distance of the shortest paths equals D, then after D steps, all message elements will reach their destination.

For building up the routing table, we introduce the notion of the shortest paths tree. Given a connected graph G having N vertices Po, P1 . . . . . PN-1, a shortest paths tree for the vertex P, is a subgraph T of G satisfying the next conditions: * T is a spanning tree of G, * The path between Pj and P, (0 _< i < N) in T is a shortest path between P~ and

Pi within G. A shortest paths tree and the diameter of interconnection networks 2.re com-

puted by the Dijkstra algorithm [1], where the function

cost: Vertex* Vertex ~ Real

316 J-J. Li ~Parallel Computing 20 (1994) 313-324

is defined as follows:

cost(i, i) = 0, if (0 < i < N ) ;

cost(i, j ) = 1, if it exists an output link of P/ towards to Pj;

cost(i, j ) = ~, otherwise.

Then, the number of input and output links can be expressed as: out (P/ )= I{J" Icost(i, j ) = 1, 0 < j < N}I, in(Pi) = I{J I cost(j, i) = 1, 0 < j < N}I. At the source of a shortest paths tree for P~, we can obtain a forest by suppressing the processor Pi and its output links. Each tree of this forest represents a routing table A[k] of P,, 1 < k < out(P/).

2.2. Performance measurements

It is clear that the degree of the interconnection network has a big influence on the running time of the greedy algorithm, so that choosing a interconnection network with the highest degree is the first idea for obtaining the smallest multiscattering time. On our target machine, we must reserve a transputer which can communicate with the host machine. Then, the degree of our topology cannot be greater than 4 and it is necessary to keep one transputer having a degree smaller than 4. The exchange-perfect shuffle satisfies these conditions [12]. We recall its definition in the case of N processors numbered from 0 to N - 1: �9 shuffle-links: Processor Pi is connected to P2i, if O<i<N/2; to P2i+~-N, if

N/2 <i <N. �9 ring-links: Processor Pi is connected to P, + 1 rood ~v, 0 < i < N. An example is shown in Fig. 1.

We show that the degree of the interconnection network plays an important part on the multiscattering time, by comparing the multiscattering time on the oriented ring, and on the non-oriented ring with the exchange-perfect shuffle. Given the number of processors, the degree of oriented ring < the degree of

J, 4---sP5 P7

/ /

"J ~" / / 15

I - - Shuffle links

Ring links

Fig. 1. An exchange-perfect shuffle with N = 16.

J-J. Li / Parallel Computing 20 (1994) 313-324

Multl~atteringtlme(CPU) 20000

10000

exchange-perfect shuffle

.::~.0....o..r.i.e.n t .ed ' n,'ng "':",,, . : . J

oriented ring .,. ~ ' : . . \ .

4 ;32 2'60 3'88 s' 6 6'44 9'00 Message length(bytes)

Fig. 2. Greedy multiscattering times on three topologies with N = 32.

1;28

317

non-oriented ring < the degree of exchange-perfect shuffle. Fig. 2 shows the multiscattering time of the greedy algorithm on the ring and the exchange-perfect shuffle. We can see that the exchange-perfect shuffle is always more efficient than the non-oriented ring and the non-oriented ring is more efficient than the oriented ring.

3. Multiscattering on the cube-connected cycles

Let CCC(H, S) (for H > S) be a CCC of the dimension S and there are H nodes per cycle. The number of nodes is N = H2 s. The nodes are grouped into 2 s cycles, each cycle consisting of H nodes. Each node can be expressed as a pair (h, s), where h refers to the address of the node within the cycle, and s refers to the address of the cycle to which the node belongs. Here h = 0, 1 . . . . . H - 1, s = 0, 1 , . . . , 2 s - 1. A definition of a CCC(H, S) connection is given in [2] as follows: �9 (h, s) is connected to its successor ((h + 1) modulo H, s), �9 (h, s) is connected to its predecessor ((h - 1) modulo H, s), �9 (h, s) is connected to (h, ~h(S)) if h < S. The nodes with h = 0, 1 . . . . . S - 1 have three ports: succ, pred, lat (mnemonic for successor, predecessor, and lateral), whereas the nodes with h = S , . . . , H - 1 have only succ and pred ports.

The diameter of a CCC is no less than twice the diameter of the hypercube of degree S [7], that is

6

H + S

if H = S = 3

i f H = S > 3

if S + 1 < H < 2 S - 2

if H > 2 S - 1.


(1,0) (3,1)

(2,0) , ~ ~ ~ (2,1 )

(3,0) (1,1)

Fig. 3. A couple of cycles connected by link 0.

The structure of the CCC has a recursive nature: a CCC(H, S + 1) can be constructed by combining two CCC(H, S). We will use this recursive nature in our multiscattering algorithms. We first number the lateral links according to the dimension. The link connecting processor (d, s) to processor (d, ~d(S)) for d < S,

h's' the message to be sent by processor (h, s) to is numbered d. We denote mhs processor (h' , s ' ) for Vh, h ' and Vs, s', and C~(s) the subCCC CCC(H, S - d 2 + d I - 1) to which cycle s belongs suppressing the lateral links dl, d l + 1 . . . . , d 2 with d 2 > d 1.

3.1. First approach

3.1.1. Algorithm We consider the case H > S in this approach. Our algorithm is recursive in S

steps. Each recursive step d corresponds to a data-exchange between two cycles connected by link d. Fig. 3 shows an example of a pair connected by link 0, with H = 4 .

The data-exchange between two cycles s and ~d(S) consists of the message exchange between processors (h,s) and (h, ~gd(S)) for 0 < d < S - 1. Suppose that processor (h,s) has a message M to exchange with processor (h,~d(s)). The data-exchange algorithm between cycles s and ~a(S) can be described as:

algorithm data exchange (M, d); (for each processor (h, s)) begin

if h = d then M lat -- M; M succ = 0 ; else M succ =/1~ for phase = 0 to H do

if h ---d then do in parallel send M lat to processor (d, ~d(S)); send M-succ to processor ((h + 1) rood H, s); M lat = message received from processor ((h - 1) mod H, s); M-succ = message received from processor (d, ~d(S)) - messages for me;

else do in parallel send M succ to processor ((h + 1) mod H, s); M succ-- message received from processor ((h - 1) mod H, s) - messages for me;

M = message received from processor ((h - 1) mod H, s) at phase H;

J-J. Li / Parallel Computing 20 (1994) 313-324 319

end data exchange; With this data exchange algorithm for each step, the algorithm of the fist approach can be described as follow: algorithm MScccl; (for each processor (h, s)) begin

M = t h's' [ ( h ' , s ' ) ~ C ~ t h's' , m h s , = t m h s , [(h', s') ~ C0~ for d = 0 to S - 1 do in sequent ia l

data exchange(M', d); if d < S - 1 then

M' = {mh,,s,,,h's' ] mh',,S'h s" ~ {M, M'}, (h', s') ~ cd+l(t~d+l(S)}; f h's' M tmh, ,s , , , I h's' = mh,,s,, ~ {M, M'}, (h',s') ~ Coa+l(s)};

multiscattering on cycle s with the message length 2SL; end MScccl.

The multiscattering on cycle s with the message length 2SL is necessary, because since M' does not cross all processors (h, ~a+l(S)) for 0 < h < H at each phase d, it is difficult for each processor to take regularly the messages it addressed.

3.1.2. Complexity This algorithm has S data exchange phases plus a multiscattering on the ring,

and each phase is composed of H + 1 steps. At each phase, since the number of processors equals half that of the previous step and the length of the message elements is doubled, the length of the message sent (or received) is always the same, equal to H2S-~L. Thus each step has the same processing time /3 + H2S-~Lz. The multiscattering time on the ring of H processors with the message length L is equal to

The processing time of the algorithm on a CCC Tc~ is

Tc~(n, S, L ) = [ ( n + 1)S + IH]] /3 + [ ( n + 1 ) n s + l ~ 1 ( l ~ 1 + 1 ) ]

• 2 s- 1L~."

3.1.3. Performance measurements We report timings for the general greedy algorithm on the CCC, and the

exchange-perfect shuffle networks, and for the special multiscattering algorithms on the CCC and on the non-oriented ring [6,10] (see Fig. 4).

The right part is a close-up of section [4... 260] of the left part. In the left part, we can see the asymptotic behavior of the multiscattering curves. The multiscattering time on a non-oriented ring is always the smallest except when the message length is very large. In this case, the greedy algorithm on the exchange-perfect

320 J-Z Li /Parallel Computing 20 (1994) 313-324

Time(CPU)

30000 I ~.:a(o-oriented.ring ~

. . . . . .

4 520 1036 1552 2068 2584 3100 3616 Message length(bytes)

Time(CPU) 3000 1 greedy, exchange-perfect shuffle

/ "~'::--: .:.first appr.oach.,,

2000 greedy CC~' .... "'%:.,


Fig. 4. Multiscattering times of the first approach and the greedy algorithm on the CCC.

shuffle will perform better. The greedy algorithm is slower than the first approach except when the message length is very large, because our algorithm has redundant communication in order to regularize the algorithm. The redundant communication means the processor has not taken the messages addressed it when these messages crossed it, and that forces us to perform a multiscattering on each cycle

J-l. Li / Parallel Computing 20 (1994) 313-324 321

at the end of algorithm. The larger the message size, the bigger the redundant communication. The greedy algorithm on the exchange-perfect shuffle is better than special algorithm on the non-oriented ring, since the degree of the non-oriented ring is smaller than the exchange-perfect shuffle, and the processing time gained by the special algorithm on the non-oriented ring is compromised by the increase of the message size. In the right part, we clearly see that the greedy algorithm strongly depends upon the degree of the interconnection network. The lower the degree of the interconnection network, the higher the processing time.

Note that the 32 processors is too small a number to illustrate the comparison between the special algorithms on the non-oriented ring and on the CCC. For a further explanation, we compare the theoretical formulas for the multiscattering time Tcl on the CCC and the multiscattering time Tn2(H2S, L) on the non-oriented ring. We have

H2 s-1 + 1 ) TR2(H2 s, L) = H 2 s - I 13 + 2 L r .

Consider the case where H = S, and L is large enough. With H = S, and the experimental values r = 1, 1 /~s/byte, we obtain Tc~ < Tn~ when S > 4. In other words, as soon as the number of processors is at least 64, the multiscattering is faster on the CCC than on the non-oriented ring.

3.2. Second approach

3.2.1. Algorithm In this approach, we limit us to discussing the CCC with H = S. As with the

second algorithm on the hypercube, the idea is to use all hypercube links at each phase of message transmission between hypercube-node pairs, and respect the recursive nature of the CCC(H, S). But in this approach, the messages are not cut into S parts, because each hypercube-node of CCC(S, S) has S processors and the communication along each hypercube link can use only the elementary messages of corresponding processor. The algorithm could proceed as follows (see algorithm MSccc2): (a) Initialize Mhs t h's' = t m h s , I(h', s ') ~ Chh(~h(S))}, and Nh, = imhs, h's', I(h', S t)

C~(s)l. (b) In the first step, each processor (h, s) exchanges the message Mhs with

processor (h, *h(S)), and it sends the message Nhs to processor ((h + 1) rood S, s), and receives the message from processor ((h - 1) rood S, s). Second step: each processor (h, s) sends the message received from (h, ~)n(s)) to processor ((h + 1) mod S, s) except to the messages destinating to itself, and receives the message from processor ((h - 1) rood S, s).

(c) Recompute the messages M h s and N m from two messages received from processor ((h - 1) rood S, s) at phase (b) and repeat phase (b) S - 1 times.

(d) Perform a multiscattering on each cycle.

322 J-J. Li /Parallel Computing 20 (1994) 313-324

3. 2. 2. Complexity This algorithm has S similar phases plus a multiscattering on the ring, and each

phase is composed of 2 steps. At each phase, Mh~ has half as many message elements as at the previous one. However the message element length is doubled, so that the processing time of the first step is still the same, that is/3 + s 2 S - I L r . At the second step, there are 2 i messages aimed to processor (h, s) among the $2 s messages received from processor (h, ~h(s)) at the first step, the processing time of this step is/3 + ($2 s-1 - 2i)L~, where i = 0 . . . . . S - 1 is the phase number. At the ith phase, 2 i messages at the first step and 2 i - I messages at the second step are archived to its destination for each processor. So the multiscattering performed at the end of algorithm is also different for each different S. For example, it is sufficient to perform one transmission with the length 4L to its successor for each processor when S = 3, but a multiscattering with the messages of 2 S - l L is necessary when S = 5. For the general case, we take Tn2(s,2sL) for the multiscattering time on the ring. The processing time for this second approach is

Tc2(S, L) - 2S(/3 + s2S - IL~ ") - (2 s - 1)L~- + Tnz(S, 2SL)

[25+ lSl] + [Sl(l l + 1)2s-1 + 1]Lr. = lSl( (l J )

algorithm MSccc2; (for each processor (h, s)) begin

= f h's' f h's' M send tmhs , I(h', s') ~ ch(~h(S))}; M ' send = tmhs , I(h', s ' ) ~ C~(s)}; fo r d = 0 to S - 1 do in sequential

in parallel send M send to processor (h, ~h(s)); send 547 send to processor ((h + 1) mod H, s); M recv = message received from processor (h, ~h(S)); h47 recv = message received from processor ((h - 1) rood H, s);

in parallel send M recv to processor ((h + 1) rood H, s); M recv = message received from processor ((h - 1) rood H, s);

if d < S - 1 then t h's' h's' M ' ch(~h(S))}; M send = tmh,,s,, , I mh,,~,, ~ {M recv, recv}, (h', s ')

[ h's' h's' M 7 send tmh,,s,, , [mh,,s,, E {/~ recv, 347 reCv}, (h ' ,s ' ) ~ Chh(s)}; multisca~ering of M recv and M ' re-cv on cyc]-e s;

end MSccc2;

3.2.3. Performance measurements Next, we explain the experiments with the second approach on a CCC of 24 processors, and compare them with that of unilateral CCC and no oriented ring (see Fig. 5). The right part is a close-up of section [4. . . 260] of the left part. The

J-J. Li /Parallel Compuang 20 (1994) 313-324

Time~ 2OOOO

10000

0

~PU)

, : . S . ~ o n d . . a . p . p . r g a c h / / ' ' ' ~

first approach ............. -:'"...:...'~ ........................... ".~ ~'-.:~


Time(CPU) 1200-

1 o

800 ............ "''~"~........,.::...,...

4O0

200 ng

0 I I 1 | " i I I


Fig. 5. Multiscattering times on the ring and CCC.

323

second approach performs twice as efficiently as the first approach on the unilateral CCC. Its processing time is less than that of the non-oriented ring when the message length L is less than some threshold L 0. When the number of processors $2 s is fixed, we have L 0 = [$2 s-1 - 2 S - [ S / 2 ] ] ~ / [ ( S 2 - 1)2 s + [S /2] ( [S /2] + 1)2 s-1 + 1 - $ 2 2 2 S - 3 - s 2 S - 2 ] L r . Theoretically, L 0 will be positif for S > 3. But in our experiences it is already positive when S = 3, because we have used two unidirectional hypercube links in stead of one bidirectional hypercube link. This is based on the properties of T-node: the communication in one direction is better than that of in two directions. In the right part, we can see this threshold is around 130 bytes. For L > L0, it is always the case that the non-oriented ring performs more efficiently than the others. Note that L 0 increases as the number of processors increases.


4. Conclusion

We have developed two multiscattering algorithms on the CCC, similar to that o f on the hypercube. However there is a difference of a factor log N between the complexity of our multiscattering on the CCC and that of on the hypercube, because a CCC has H processors per cycle. But with the same number of processors, the degree of the CCC is fixed while that of the hypercube is logari thmic in N. The first algorithm, al though less efficient, has a round 60% per fo rmance improvement in compar ison with the greedy algorithm. W h e n H is close to S, multiscattering with the first approach on the CCC is faster than on the non-or ien ted ring, while the number of processors is larger than 64. The second approach performs more efficiently than the non-or iented ring when the message length is less than some threshold, otherwise up to 32 processors the non-or ien ted ring is always the most efficient.

Th rough the experiments repor ted in this paper, we have shown that the greedy algori thm is useful when the message length is large enough, or in the situations where the message lengths are unpredictable. In addition, since the d iameter of the in terconnect ion network has a big influence for the greedy algorithm, choosing the smallest d iameter network is quite a natural idea.

5. References

[1] T.H. Cormen, C.E. Leiserson and R,L Rivest, Introduction to Algorithms (MIT press, Cambridge, MA, 1990).

[2] P. Banerjee, The Cubical Ring Connected Cycles: A fault-tolerant parallel computation network, IEEE Trans. Comput. 37, (5) (May 1988) 632-636.

[3] P. Fraigniaud, S. Miguet and Y. Robert, Scattering on a ring of processors, Parallel Comput. 13 (1990) 377-383.

[4] B.N. Jain, An analysis of Cube-Connected Cycles and Circular Shuffle Networks for Parallel Computation, J. Parallel Distributed Comput. 5 (1989) 741-754.

[5] S.L. Johnsson and C.-T. Ho, Optimum broadcasting and personalized communication in hyper- cubes, IEEE Trans. Comput. 38, (9) (Sep. 1989) 1249-1268.

[6] J.-J. Li and S. Miguet, Z-buffer on a transputer-based machine, Proc. DMCC6, Q. Stout et al., eds. (IEEE Computer Society Press 1991) 315-322.

[7] D.S. Meliksetian and C.R. Goger Chen, Communication aspects of the cube-connected cycles, 1990 Internat. Conf. on Parallel Processing (1990) 1-579-I-560.

[8] D.A. Nicole, Esprit Project 1085, Reconfigurable transputer processor architecture, in CONPAR 88, C.R. Jesshope et al., eds. (Cambridge University Press 1989) 81-89.

[9] F.P. Preparata and J. Vuillemin, The cube-connected cycles: A versatile network for parallel computation, Commun. ACM 24 (May 1981) 300-309.

[10] Y. Saad and M.H. Schultz, Data communication in parallel architectures, Parallel Comput. 11 (1989) 131-150.

[11] M. Simmen, Comments on broadcast algorithms for two-dimensional grids, Parallel Comput. 17 (1991) 109-112.

[12] H.S. Stone, Parallel processing with the Perfect Shuffle, IEEE Trans. Comput. c-20, (2) (Feb. 1971) 153-161.

[13] E.F. Van de Velde, Data redistribution and concurrence, Parallel Comput. 16, (1990) 125-138.

Documents

Multiscattering on the cube-connected cycles