8
J. Parallel Distrib. Comput. 66 (2006) 1103 – 1110 www.elsevier.com/locate/jpdc Broadcasting in all-output-port cube-connected cycles with distance-insensitive switching Petr Salinger a , Pavel Tvrdík b, a T-Systems Czech, Belehradska 126, 120 00 Praha 2, Czech Republic b Department of Computer Science and Engineering, Faculty of Electrical Engineering, Czech Technical University, Karlovo nám. 13, 121 35, Praha 2, Czech Republic Received 23 December 2004; received in revised form 28 July 2005; accepted 9 May 2006 Abstract In this paper, we consider the problem of one-to-all broadcast (OAB) in an interconnection network with the topology of the n-dimensional cube-connected cycles, CCC n , under the following conditions. Routers use distance insensitive switching, e.g., wormhole. A OAB proceeds in rounds that consist of message passing between pairs of nodes using disjoint paths. Routers also have all-output-port functionality, i.e., they can send a packet via all output ports simultaneously. In CCC n , this assumption implies that the lower bound on the number rounds of the OAB is log 4 (|V(CCC n )|). The main result of this paper is an algorithm for the OAB in CCC n whose number of rounds is greater than the lower bound by at most 1 for all n< 123. The algorithm is depth-contention-free and therefore, it works correctly even if nodes participating in the broadcast run asynchronously. We also show how to make the algorithm deadlock-free. © 2006 Elsevier Inc. All rights reserved. Keywords: Cube-connected-cycles; One-to-all broadcast; Distance-insensitive switching; All-output-port capability; Depth-contention-free communication; Deadlock-free communication 1. Introduction The one-to-all broadcast (OAB) belongs to the most impor- tant collective communication operations. One source node dis- seminates the same information to all other nodes. The OAB appears in many parallel algorithms in linear algebra, neural networks, optimizations problems, and so on. Algorithms for the OAB are an important part of communication libraries. Efficient OAB algorithms depend strongly on the hardware capabilities of the communication subsystem, mainly on the switching techniques and channel utilization. Today, most in- terconnection networks of massively parallel computers im- plement some kind of distance insensitive message switch- ing, such as wormhole or virtual cut-through. We assume that edges of interconnection topology graphs represent full-duplex This research has been supported by MŠMT under research program MSM 6840770014, by ˇ CVUT under grant #300009203, and by FRVŠ under grant #580/2000. Corresponding author. Fax: +420 224 923 325. E-mail addresses: [email protected] (P. Salinger), [email protected] (P. Tvrdík). 0743-7315/$ - see front matter © 2006 Elsevier Inc. All rights reserved. doi:10.1016/j.jpdc.2006.05.003 channels and each channel consists of two antiparallel unidi- rectional channels, called links. Another common assumption is that a collective communication operation consists of rounds. One round consists of link-disjoint paths connecting pairs of si- multaneously communicating processors. In a round, any node, s, can communicate with any other node, d, as long as during that round no other source-destination pair communication path shares an edge with the path from s to d. The validity of this model is attributed to the experimental results demonstrating the distance insensitivity of wormhole routing presented in [6]. These networks are referred to as WH networks in the further text. If a link contention between paths from different sources in successive rounds can appear, the rounds must be synchro- nized, for example, by barriers. Then a new round can start only after all the communication paths from the previous rounds have been released. However, barriers may introduce substan- tial overhead. If no collision between paths can appear even if the communication paths are established asynchronously, then the algorithm is said to be depth-contention-free. This prop- erty leads to both implementation simplicity and efficiency.

Broadcasting in all-output-port cube-connected cycles with distance-insensitive switching

Embed Size (px)

Citation preview

J. Parallel Distrib. Comput. 66 (2006) 1103–1110www.elsevier.com/locate/jpdc

Broadcasting in all-output-port cube-connected cycles withdistance-insensitive switching�

Petr Salingera, Pavel Tvrdíkb,∗aT-Systems Czech, Belehradska 126, 120 00 Praha 2, Czech Republic

bDepartment of Computer Science and Engineering, Faculty of Electrical Engineering, Czech Technical University, Karlovo nám. 13, 121 35,Praha 2, Czech Republic

Received 23 December 2004; received in revised form 28 July 2005; accepted 9 May 2006

Abstract

In this paper, we consider the problem of one-to-all broadcast (OAB) in an interconnection network with the topology of the n-dimensionalcube-connected cycles, CCCn, under the following conditions. Routers use distance insensitive switching, e.g., wormhole. A OAB proceedsin rounds that consist of message passing between pairs of nodes using disjoint paths. Routers also have all-output-port functionality, i.e., theycan send a packet via all output ports simultaneously. In CCCn, this assumption implies that the lower bound on the number rounds of theOAB is log4(|V (CCCn)|). The main result of this paper is an algorithm for the OAB in CCCn whose number of rounds is greater than thelower bound by at most 1 for all n < 123. The algorithm is depth-contention-free and therefore, it works correctly even if nodes participatingin the broadcast run asynchronously. We also show how to make the algorithm deadlock-free.© 2006 Elsevier Inc. All rights reserved.

Keywords: Cube-connected-cycles; One-to-all broadcast; Distance-insensitive switching; All-output-port capability; Depth-contention-free communication;Deadlock-free communication

1. Introduction

The one-to-all broadcast (OAB) belongs to the most impor-tant collective communication operations. One source node dis-seminates the same information to all other nodes. The OABappears in many parallel algorithms in linear algebra, neuralnetworks, optimizations problems, and so on. Algorithms forthe OAB are an important part of communication libraries.

Efficient OAB algorithms depend strongly on the hardwarecapabilities of the communication subsystem, mainly on theswitching techniques and channel utilization. Today, most in-terconnection networks of massively parallel computers im-plement some kind of distance insensitive message switch-ing, such as wormhole or virtual cut-through. We assume thatedges of interconnection topology graphs represent full-duplex

� This research has been supported by MŠMT under research programMSM 6840770014, by CVUT under grant #300009203, and by FRVŠ undergrant #580/2000.

∗ Corresponding author. Fax: +420 224 923 325.E-mail addresses: [email protected] (P. Salinger),

[email protected] (P. Tvrdík).

0743-7315/$ - see front matter © 2006 Elsevier Inc. All rights reserved.doi:10.1016/j.jpdc.2006.05.003

channels and each channel consists of two antiparallel unidi-rectional channels, called links. Another common assumptionis that a collective communication operation consists of rounds.One round consists of link-disjoint paths connecting pairs of si-multaneously communicating processors. In a round, any node,s, can communicate with any other node, d, as long as duringthat round no other source-destination pair communication pathshares an edge with the path from s to d. The validity of thismodel is attributed to the experimental results demonstratingthe distance insensitivity of wormhole routing presented in [6].These networks are referred to as WH networks in the furthertext.

If a link contention between paths from different sourcesin successive rounds can appear, the rounds must be synchro-nized, for example, by barriers. Then a new round can start onlyafter all the communication paths from the previous roundshave been released. However, barriers may introduce substan-tial overhead. If no collision between paths can appear even ifthe communication paths are established asynchronously, thenthe algorithm is said to be depth-contention-free. This prop-erty leads to both implementation simplicity and efficiency.

1104 P. Salinger, P. Tvrdík / J. Parallel Distrib. Comput. 66 (2006) 1103–1110

Each node can start a new round independently of the othernodes, as soon as it receives the packet or finishes its previousround.

Another common feature of routers is the all-output-portcapability, or more generally, the multicast capability. A packetreceived by a router in round i is stored in the memory of itsprocessor and it can be injected into the network repeatedly viaall its output external channels in rounds k > i. This capabilityimplies a very strong lower bound on the number of rounds forthe OAB.

Lemma 1. If G is an N-node d-regular network, then the lowerbound on the number of rounds is �OAB(G) = �logd+1 N�.

Proof. The best possible scenario is clearly the following: eachnode, once informed, is able to inform d uninformed nodes inall remaining rounds. Then nt , the number of informed nodesafter round t, is nt = nt−1 +d ·nt−1 = (d +1)nt−1 = (d +1)t ,since n0 = 1. �

Asymptotically or nearly optimal OAB algorithms have beendesigned for several regular topologies. An asymptotically op-timal OAB algorithm for the n-dimensional binary hypercubeQn has been described in [4]. It uses e-cube routing. A nearlyoptimal algorithm for the hypercube is given in [1]. For n�31,the algorithm needs at most one extra round with respect to thetrivial lower bound �OAB(Qn) = � n

log2(n+1)�, but it does not

use e-cube routing. Significant effort has been spent on findingoptimal algorithms for low-dimensional meshes and tori. Anelegant and nearly optimal OAB algorithm for 2-D tori is de-scribed in [9]. It requires at most two (five) rounds more thanthe trivial lower bound �log5 N� for N-node square (rectangu-lar, respectively) 2-D tori. It uses again dimension-ordered rout-ing. Nearly optimal OAB algorithms for 2-D and 3-D meshesof trees have been given in [7].

In this paper, we propose a nearly optimal OAB algorithmin all-output-port WH n-dimensional cube-connected cycles. 1

For network instances of practical sizes, our algorithm needsat most one extra round compared to the trivial lower bound. Itis also noteworthy that our algorithm uses dimension-orderedrouting and is depth-contention-free. Since cube-connected cy-cles are spanning subgraphs of wrapped butterflies, the samealgorithm provides an efficient OAB algorithm in wrapped but-terflies.

2. Cube-connected cycles and the OAB problem

Let B denote the binary alphabet {0, 1} and Bn the set of allbinary strings of n bits. The inversion of bit i in string c ∈ Bn

is written as negi (c). The n-dimensional cube-connected cy-cles, denoted by CCCn, belong to sparse hypercubic networks,a network family containing also butterflies. CCCn are derived

1 This is a revised version of conference paper “Broadcasting in all-output-port cube-connected-cycles with distance-insensitive routing” by the sameauthors that appeared in SIROCCO 8, pp. 321–336. Carleton Scientific,Canada, 2001.

000 001 011 010 110 111 101 1000

1

2

Fig. 1. Cube-connected cycles CCC3.

from the n-dimensional binary hypercube by replacing eachhypercube node with an n-node cycle so that each cycle nodeis incident with a single hypercube edge (see Fig. 1). Hence,CCCn have n2n nodes and 3n2n−1 edges. A node is labeledwith a pair (i, c) where c is an n-bit binary string denoting theaddress of its cycle and i, 0� i�n − 1, is its offset within thiscycle. Two nodes (i1, c1) and (i2, c2) are adjacent if and only ifc1 = c2 and i1−i2 ≡ ±1(mod n) or i1 = i2 and c1 and c2 differonly in bit i1. Being a 3-regular node-symmetric graph, CCCn

are the sparsest hypercubic network. They are a spanning sub-graph of the wrapped butterfly of the same dimension, whichis a 4-regular graph. It is known that all sparse hypercubic net-works are quasiisometric. Cube-connected cycles are weakly-orthogonal: a displacement in a cycle or along a hypercubeedge does not change the other coordinates, but from a givennode only one hypercube dimension can be traversed. Routingconsists in alternating hypercube edges and cycle subpaths. Apath from node a to node b will be denoted by P(a, b).

In the all-output-port CCCn with distance-sensitive switch-ing, such as store-and-forward, the trivial flooding OAB algo-rithm is optimal. The number of rounds is equal to the diame-ter. The same holds for the 2-output port model. In the 1-portmodel, the best known algorithm requires two more rounds[3]. Algorithms for all-to-all broadcast in the 1-port combiningCCCn have been studied for both half- and full-duplex modelsin [5].

In 1-port WH CCCn, there is a simple and optimal algorithm.In the first n rounds, each cycle is informed by the standardhypercube OAB algorithm based on the spanning binomial treeand the rest of the nodes is informed in �log2 n� rounds withincycles.

Paper [8] gives an efficient unicast-based multicast algorithmin 1-port WH CCCn. A multicast within a group of m nodescan be achieved in �log2 m� rounds supposing the hardwareprovides two virtual channels per one physical cycle channel.The algorithm is not depth-contention-free.

3. Preliminary results

3.1. OAB in linear arrays and cycles from internal sources

Consider an N-node linear array A (1-D mesh) with nodeslabeled consecutively 0, . . . , N − 1. Given two integers b�0and 1�c�N − 1 − b, we define the (b, c)-subarray of A tobe the part of A, starting in node b and having c nodes. Theleader of the (b, c)-subarray is the central node leader(b, c) =b + � c

2�.

P. Salinger, P. Tvrdík / J. Parallel Distrib. Comput. 66 (2006) 1103–1110 1105

Fig. 2. Pseudocode of algorithm PathOAB.

If the source of a OAB is in the middle of a linear array,then the linear array can be split into three equal parts (up to1 node). The source can inform the two boundary parts in 1round and the same algorithm is applied recursively in all threeparts in parallel. The algorithm is therefore depth-contention-free. A pseudocode for such a 3-ary decomposition algorithm,called PathOAB, is given in Fig. 2. The OAB from node �N/2�is performed by executing PathOAB(0, N). The algorithmproves constructively the following statement.

Lemma 2. An OAB in an all-output-port WH linear array of Nnodes from source node �N

2 � can be done in �log3 N� rounds,which is optimal, using a depth-contention-free algorithm.

Since an N-node cycle is node-symmetric, we get easily

Corollary 3. An OAB in an all-output-port WH cycle of Nnodes from any internal source s can be done in �log3 N�rounds, using a depth-contention-free algorithm.

Let us denote the corresponding algorithm by CycleOAB

(s, N).

3.2. OAB in linear arrays and cycles from external sources

Our algorithm for the OAB in CCCn will induce OABs incycles from external sources. Let ext(i) denote an external nodeconnected with node i of a linear array or cycle. Let us startwith linear arrays.

Lemma 4. An OAB in an all-output-port WH linear ar-ray of N nodes with the source ext(0) can be done within

�log3(2N + 1)� rounds, which is optimal, using a depth-contention-free algorithm.

Proof. The lower bound on the number of rounds is as follows:let ti be the number of informed nodes after round i. Then t0 = 0and for i�1, ti = 3·ti−1+1, since each internal informed nodecan inform two new nodes and the external node can informone new node. The least i such that ti = (3i − 1)/2�N is i =�log3(2N + 1)�. Note that �log3(2N + 1)��1 + �log3 N�.

The pseudocode of the algorithm ExtPathOAB, which provesconstructively the lemma, is described in Fig. 3. It has one pa-rameter count, which is again to be substituted by the linear ar-ray size, i.e., N. The external node disseminates the informationin the ever diminishing subarrays taken from the main lineararray from the right to the left towards the node 0, whereas inthe parts with informed leaders, the algorithm PathOAB fromFig. 2 completes the broadcast. Fig. 4 shows three rounds ofExtPathOAB(13). Similarly to the algorithm PathOAB, sincepossibly concurrent OABs operate within disjoint subarrays,the algorithm ExtPathOAB is depth-contention-free. �

A similar approach applies to cycles, which are node-symmetric.

Corollary 5. An OAB in an all-output-port WH cycleof N nodes with the source ext(x) can be done within�log3(2N + 1)� rounds.

Let us denote the corresponding OAB algorithm byExtCycleOAB(x, N). By applying Lemma 4 twice, we get

Corollary 6. An OAB in an all-output-port WH cycle of Nnodes where N is even (odd) from two external sources ext(x)

1106 P. Salinger, P. Tvrdík / J. Parallel Distrib. Comput. 66 (2006) 1103–1110

Fig. 3. Pseudocode of algorithm ExtPathOAB.

0 1 2 3 4 5 6 7 8 9 10 11 12

Fig. 4. The whole broadcast tree of the 3-round ExtPathOAB(13).

and ext((x + �N/2�)mod N) can be done in �log3(N + 1)�(�log3(N + 2)�, respectively) rounds.

Let us denote the corresponding algorithm by Ext2Cycle

OAB(x, N).

4. The proposed OAB algorithm in CCCn

From the previous results, the lower bound on the numberof rounds for OAB in the all-output-port WH cube-connectedcycles is easy.

Corollary 7 (of Lemma 1). �OAB(CCCn)=�log4 |V (CCCn)|�= �(n + log2 n)/2�.

Our main result is a nearly optimal OAB algorithm in theall-output-port WH CCCn. It uses the standard minimal routingin CCCn. The corresponding routing function is denoted by R.If there are two shortest paths for a given source-destinationpair, the algorithm always chooses one of them by decidingthe output port so that all paths starting from a given source ina given round are minimal and link-disjoint. Since CCCn arenode-symmetric, we assume without loss of generality that thesource is node (0, 0n). The algorithm depends on the parity of n.

The variants for odd n and even n are described by pseudocodesin Figs. 5 and 7, respectively. The main idea of the both variantsis to disseminate the packet in the fastest possible way intosufficiently large number of cycles (more precisely, their nodeswith index 0) so that this subset of informed cycles is uniformlydistributed across the whole network. After this first phase, thebroadcast is finished within these cycles or within cycles nearbyby the trivial OAB in cycles from internal or external sources,as explained in Section 3. The key to the appropriate selectionof the informed cycles in the first phase is a function that foran already informed cycle generates in all subsequent alwaysthree new uninformed cycles so that the number of informedcycles triples in every round. Such a function is defined below.

Definition 8. Let n�3 and 0 < i < n be integers. Let s =(0, c) be a node of CCCn, c ∈ Bn. Then dst(s, i, n) is definedas a set of three nodes:

dst(s, i, n) = {(0, negi (neg0(c))), (0, negn−i (negi (c))),

(0, negn−i (c))}.

Let us now describe and prove the algorithms, first for oddn and then for even n.

4.1. n is odd

Theorem 9. If n is odd, then the algorithm OddCCCOAB(n)

whose pseudocode is in Fig. 5 completes a OAB in an all-output-port WH CCCn in (n − 1)/2 + �log3(2n + 1)� rounds.Moreover, it is a depth-contention-free algorithm.

Proof. Let us prove first that at the end of Phase 1, in exactlyone half of cycles of CCCn there is exactly one informed leaderand exactly one half of cycles is uninformed. More specifically,

P. Salinger, P. Tvrdík / J. Parallel Distrib. Comput. 66 (2006) 1103–1110 1107

Fig. 5. Pseudocode of algorithm OddCCCOAB.

let us prove that the cardinality of S, the set of all informedleaders, at the end of Phase 1 is 2n−1 and

∀ w ∈ Bn−1 ∃� ∈ B such that (0, w�) ∈ S

and (0, w�) /∈ S, (1)

which means that each informed cycle has one uninformedadjacent cycle. (1) can be proved inductively by proving thatat the end of round j, 1�j � n−1

2 , of Phase 1

∀ x, y ∈ Bj ∃� ∈ B such that (0, 0kxy0k�) ∈ S

and (0, 0kxy0k�) /∈ S, (2)

where k = (n − 1)/2 − j . This statement holds at the endof round j = 1, since S = {(0, 0n)} ∪ dst((0, 0n), n−1

2 , n) ={(0, 0

n−12 −1000

n−12 −10), (0, 0

n−12 −1010

n−12 −11), (0, 0

n−12 −1

110n−1

2 −10), (0, 0n−1

2 −1100n−1

2 −10)

}.

Let us assume that at the end of round j < n−12 , (2) holds and

let us consider any s = (0, z) ∈ S, z = 0n−1

2 −jxy0

n−12 −j�.

After round j + 1, dst(s, n−12 − j, n) is added to S due to s.

dst

(s,

n − 1

2− j, n

)

={(0, neg n−1

2 −j (neg0(z))), (0, negn−( n−12 −j)(neg n−1

2 −j (z))),

(0, negn−( n−12 −j)(z))

}

={(0, neg n−1

2 −j (neg0(z))), (0, neg n+12 +j (neg n−1

2 −j (z))),

(0, neg n+12 +j (z))

}

={(0, 0

n−12 −j−10xy10

n−12 −j−1�),

(0, 0n−1

2 −j−11xy10n−1

2 −j−1�),

(0, 0n−1

2 −j−11xy00n−1

2 −j−1�)}

.

Hence, {s} ∪ dst(s, n−12 − j, n) = {(0, 0

n−12 −(j+1)�xy�

0n−1

2 −(j+1)�′)|�, � ∈ B} where x, y ∈ Bj are given and �′ = �if � = 0 and � = 1 and �′ = � otherwise. Since this argu-ment applies to any s ∈ S, the induction step is proved. Thisproves (1).

In Phase 2, each informed leader (0, w�) ∈ S becomes thesource of a OAB in its own cycle (∗, w�) and simultaneously, itbecomes the external source of a OAB in its “complementary’’uninformed cycle (∗, w�).

Due to Corollaries 3 and 5, Phase 2 is depth-contention-free.Let us show that Phase 1 is depth-contention-free, too. Thisfollows from two facts:

1. All minimal paths going from all nodes s = (0, z) ∈ S to

corresponding nodes in dst(s, n−12 −(j+1), n) in round j are

node-disjoint. The path P((0, z), (0, negn−12 −j

(neg0(z))))

starts by 1 hypercube edge, continues by several cycle edgesand 1 hypercube edge, and ends by several cycle edges.The path P((0, z), (0, neg

n−(n−1

2 −j)(negn−1

2 −j(z)))) starts

by cycle edges, continues by 1 hypercube edge, several cy-cle edges, and 1 more hypercube edge, and ends by sev-eral cycle edges. The path P((0, z), (0, neg

n−(n−1

2 −j)(z)))

starts by cycle edges, traverses a hypercube edge, and endsby cycle edges.

2. Any path starting in node (0, 0kxy0k�) ∈ S, x, y ∈Bj , k = n−1

2 − j , in subsequent rounds traverses onlynodes (l, c) where c = �xy�� or c = �xy�� and�, � ∈ Bk . Therefore, any two paths originated intwo different leaders in any two successive rounds arenode-disjoint. �

Fig. 6 shows 3 rounds of OddCCCOAB(3). Phase 1 hasonly 1 round and Phase 2 has 2 rounds.

1108 P. Salinger, P. Tvrdík / J. Parallel Distrib. Comput. 66 (2006) 1103–1110

000 001 011 010 110 111 101 100

0

1

2

0

1

2

0

1

2

Fig. 6. The whole broadcast tree of the 3-round OddCCCOAB(3).

4.2. n is even

Theorem 10. If n is even, then the algorithm EvenCCCOAB(n)

whose pseudocode is in Fig. 7 completes a OAB in an

Fig. 7. Pseudocode of algorithm EvenCCCOAB.

all-output-port WH CCCn in n + �log3(n + 1)� rounds. More-over, it is a depth-contention-free algorithm.

Proof. Similar to the proof of Theorem 9. At the end of Phase1, in exactly one quarter of cycles of CCCn, there is exactly oneinformed leader. This follows from the fact that after Phase 1

∀x, y ∈ Bn/2−1∃� ∈ B such that (0, x0y�) ∈ S.

This can again be proved inductively by showing that afterround j of Phase 1,

∀ x, y ∈ Bj∃� ∈ B such that (0, 0kx0y0k�) ∈ S

and (0, 0kx0y0k�) /∈ S,

where k = n/2 − 1 − j . In the single round of Phase2a, each leader informs two further nodes so that ∀x, y ∈Bn/2−1∀� ∈ B, cycle (∗, x1y�) has one informed leaderand ∀x, y ∈ Bn/2−1∃� ∈ B such that cycle (∗, x0y�)

has one informed leader, whereas cycle (∗, x0y�) is unin-formed. In Phase 2b, all cycles with informed leaders per-form OABs using algorithm CycleOAB(n) and at the sametime, uninformed cycles perform OABs initiated by two ex-ternal sources, adjacent to nodes 0 and n/2 in these cycles,respectively. �

Fig. 8 shows a part of the broadcast tree of EvenCCCOAB(6)

from the source (0, 06). Phase 1 consists of two rounds andPhase 2 consists of 1 + 2 rounds.

Since the Odd(Even)CCCOAB algorithm uses the minimalrouting R, we call it the R-algorithm.

P. Salinger, P. Tvrdík / J. Parallel Distrib. Comput. 66 (2006) 1103–1110 1109

0

1

2

3

4

5

0

1

2

3

4

5

0

1

2

3

4

5

0

1

2

3

4

5

round 1

round 4 round 5

round 2 round 3

0

1

2

3

4

5

000000 000001 000000

000000 000001001000000001000000

000001 000000 000001

001001

100010000011

010100000101

010000

100000

001000001001

001001 00100 0

Fig. 8. Samples of the broadcast tree of the 5-round EvenCCCOAB(6) in the vicinity of the source node (0, 06).

5. Deadlock avoidance

The R-algorithm is depth-contention-free. So, a deadlockcannot appear even if the broadcast tree grows asynchronously.However, the minimal routing R is not deadlock-free, since theminimal routing even in cycles is not deadlock-free. If twoasynchronous OABs with different sources run concurrently orif a OAB proceeds concurrently with some other communica-tion routed by R, a deadlock can appear, since the channel de-pendency graph (see [2] for the definition) for CCCn and R,CDG(CCCn, R), is not acyclic.

In this section, we propose a solution based on the well-known notion of virtual channels. Recall that we have assumedthe full-duplex channels all the time: each full-duplex physicalchannel can be thought as a pair of antiparallel directed links.Instead of R, we define nonminimal routing function R′ on vir-tual channels, derived from the e-cube routing. To keep CCCn

connected, we only need to split cycle links in one direction,say down, into two virtual links, denoted by A and B. Note thatthese requirements are smaller than requirements for the HCrouting introduced in [8]. They need two virtual links in bothdirections.

Definition 11. Let 0� i�n − 1 and � ∈ Bn. We label linksincident to or from node (i, �) as follows:

cubei—the link from (i, �) to (i, negi (�)),upi—the link from (i, �) to ((i + 1)mod n, �),

downAi —the virtual link A from ((i + 1)mod n, �) to (i, �),

downBi —the virtual link B from ((i + 1)mod n, �) to (i, �).

Definition 12. The links within a cycle of CCCn are orderedas follows (see Fig. 9).

downAn−1 < downA

n−2 < · · · < downA0 <

< cube0 < up0 < cube1 < up1 < · · · < cuben−1 <

< upn−1 < downBn−1 < downB

n−2 < · · · < downB0 .

Definition 13. Let R′ be the routing function that generatesonly paths in strictly increasing order of the link labels. Eachpath from (a, �) to (b, �) conforming with R′ consists of up tothree subpaths:

1. The subpath from source node (a, �) to node (i, �) where iis the lowest bit in which � and � differ. This subpath canuse all downA links and upx , 0�x < i, links. Hence, thissubpath is not guaranteed to be minimal.

2. The subpath from node (i, �) to node (j, �) where j is thehighest bit in which � and � differ. This subpath traversesdimensions in increasing order and it can use cubex andupx , i�x < j , links.

3. The subpath from node (j, �) to node (b, �). This subpathcan use upx , j �x < n, and all downB links. Again, thissubpath is not guaranteed to be minimal.

Since the CDG(CCCn, R′) does not contain cycles, R′ is

deadlock-free.

1110 P. Salinger, P. Tvrdík / J. Parallel Distrib. Comput. 66 (2006) 1103–1110

0

n-2

n-1

2

1

cuben-1downn-1

cube0

down0

upi

down0

up0

upn-1

downn-1

downiA

A

A

downiB

B

Fig. 9. Ordering of links within a cycle of CCCn. Each edge of a cycle isreplaced by one up link and two down links.

Theorem 14. There exists a OAB R′-algorithm in an all-output-port WH CCCn, whose number of rounds is 1 + n−1

2 +�log3(n + 2)� rounds for odd n and 1 + n−2

2 + �log3(2n + 1)�for even n. Moreover, it is a depth-contention-free algorithm.

Proof. For any source (0, �), � ∈ Bn, the R′-algorithm, theOAB algorithm conforming with R′, has exactly the same pseu-docode as the R-algorithm in Section 4. The routing functionR′ generates the same paths as routing function R in case ofsuch a source.

Consider a OAB R′-algorithm with source (i, �), i �= 0. Itcan be obtained by a slight modification of the R-algorithm. Inthe first round, source (i, �) sends the packet to nodes (0, �)

and (0, neg�n/2�(�)) using R′. After that, we can view CCCn assplit into two halves, which are almost isomorphic to CCCn−1,except that each cycle has one abundant node. Now, in bothCCCn−1-like halves in parallel, we can perform OABs using aminor modification of the R-algorithm from Section 4, respect-ing the fact that cycles have n nodes instead of n − 1. �

6. Conclusions

The time complexity of the presented algorithms is evaluatedin Table 1. The greatest n when the R-algorithm is optimal is 80.The smallest n such that the difference between the trivial lowerbound and the number of rounds of the R-algorithm is greaterthan 1 is 123. Note that the CCC123 has around 1.3 × 1039

nodes. The deadlock-free R′-algorithm needs at most 1 roundmore than the R-algorithm.

Since CCCn are a spanning subgraph of the n-dimensionalwrapped butterfly wBFn, our algorithms performing a OAB inCCCn can run without any modification on wBFn in nearly�OAB(wBFn) · log4 5 rounds.

Table 1The time complexity of our OAB algorithms in CCCn

n log2 n � n2 + log2 n

2 � The # of rounds The # of roundsthe lower bound of R-algorithm of R′-algorithm

3 1.585 3 3 44 2.000 3 4 45 2.322 4 5 56 2.585 5 5 67 2.807 5 6 68 3.000 6 6 79 3.170 7 7 8

10 3.322 7 8 811 3.459 8 8 912 3.585 8 9 913 3.700 9 9 1014 3.807 9 10 1115 3.906 10 11 11

. . .

78 6.285 43 43 4479 6.303 43 44 4480 6.322 44 44 4581 6.340 44 45 46

. . .

121 6.919 64 65 66122 6.931 65 66 67123 6.942 65 67 67124 6.954 66 67 68

References

[1] J.-C. Bermond, T. Kodate, S. Perennes, A. Bonnecaze, P. Solé,Broadcasting in hypercubes in the circuit switched model, in: Proceedingsof the 14th International Parallel and Distributed Processing Symposium(IPDPS 2000) (May 2000), IEEE Computer Society Press, Silver Spring,MD, pp. 21–26.

[2] W.J. Dally, C.L. Seitz, Deadlock-free message routing in multiprocessorinterconnection networks, IEEE Trans. Comput. 36 (5) (1987) 547–553.

[3] P. Fraigniaud, E. Lazard, Methods and problems of communication inusual networks, Discrete Appl. Math. 53 (1994) 79–133.

[4] C.-T. Ho, M.-Y. Kao, Optimal broadcast in all-port wormhole routedhypercubes, IEEE Trans. Parallel Distributed Comput. 6 (2) (1995)200–204.

[5] J. Hromkovic, C.-D. Jeschke, B. Monien, Optimal algorithms fordissemination of information in some interconnection networks,Algorithmica 10 (1993) 24–40.

[6] L.M. Ni, P.K. McKinley, A survey of wormhole routing techniques indirect networks, Computer 26 (2) (1993) 62–76.

[7] P. Salinger, P. Tvrdík, Broadcasting in all-output-port meshes of treeswith distance-insensitive switching, J. Parallel Distributed Comput. 62(2002) 1272–1294.

[8] J. Song, Z. Hou, Y. Shi, Optimal multicast communication in wormhole-routed cube connected cycles, in: Proceedings of PDCS’99, Cambridge,USA (November 1999), vol. 2, Acta Press, pp. 725–731.

[9] Y.-C. Tseng, A dilated-diagonal-based scheme for broadcast in awormhole-routed 2D torus, IEEE Trans. Comput. 46 (8) (1997) 947–952.

Petr Salinger received his M.Eng. and Ph.D. in Computer Science from theCzech Technical University, Prague in 1997 and 2001, respectively. He iscurrently with the T-Systems Czech. His research interests include commu-nication algorithms and operating systems.

Pavel Tvrdík received his M.Eng. and Ph.D. in Computer Science fromthe Czech Technical University, Prague, in 1980 and 1991, respectively.Currently, he is a Full Professor at the Department of Computer Science andEngineering, Faculty of Electrical Engineering, Czech Technical University.His research interests include computer architectures, parallel algorithms, andcluster computing.