13
JOURNAL OF PARALLEL AND DISTRIBUTEDCOMPUTING 11, 263-275 (1991) Optimal Communication Algorithms for Hypercubes* D. P. BERTSEKAS, C. OZVEREN, G. D. STAMOULIS, P. TSENG, ANDJ.N. TSITSIKLISt Laboratory for Information and Decision Systems, Room 35-210, Massachusetts Institute o/Technology, Cambridge,Massachusetts 02139 We consider the following basic communication problems in a hypercube network of processors: the problem of a single pro- cessorsending a different packet to each of the other processors, the problem of simultaneous broadcastof the same packet from every processor to all other processors,and the problem of si- multaneous exchange of different packets between every pair of processors.The algorithms proposed for these problems are op- timal in terms of execution time and communication resource requirements; that is, they require the minimum possible number of time steps and packet transmissions. In contrast, algorithms in the literature are optimal only within an additive or multi- plicative factor. @ 1991 Academic Press,lnc. --' 1. INTRODUCTION AND PROBLEM FORMULATION When algorithms are executedin a network of processors, it is necessary to exchange some intermediate information between the processors. The interprocessor communication time may be substantial relative to the time neededexclu- sively for computations, so it is important to carry out the information exchange as efficiently as possible.There are a number of generic communication problems that arise fre- quently in numerical and other algorithms. In this paper, we describenew algorithms for solving some of theseprob- lems on a hypercube.Algorithms for solving suchproblems have been studied in such works as [15,7], among others. In this paper, we present some new algorithms for the hy- percubethat are optimal, in the sense that they execute the required communication tasks in the minimum possible number of time stepsand link transmissions. To define a hypercubenetwork (or d-cube), we consider the set of points in d-dimensional space with each coordinate equal to 0 or I. We let these points correspond to processors, and we considera communication link for every two points differing in a single coordinate.We thus obtain an undirected graph with the processors as nodesand the communication links as arcs.The binary string of length d that corresponds * Research supported by the NSF under Grants ECS-85 19058 and ECS- 8552419, with matching funds from Bellcore Inc., the ARO under Grant DAALO3-86-K-O171, and the AFOSR under Grant AFOSR-88-OO32. t Laboratory for Information and Decision Systems, M.I. T., Cambridge, MA 02139. 263 Copyright@ \99\ by All rights of reproduction i to the coordinates of a node of the d-cube is referred to as the identity number of the node. We recall that a hypercube of any dimension can be constructed by connecting lower- dimensional cubes,starting with a l-cube. In particular, we can start with two (d -I )-dimensional cubes and introduce a link connecting each pair of nodes with the same identity number (see, e.g., [I, Sect. 1.3]). This constructsad-cube with the identity number of each node obtained by adding a leading0 or a leading I to its previous identity, depending on whetherthe node belongs to the first (d -I )-dimensional cube or the second (see Fig. I). When confusion cannot arise, we refer to a d-cube node interchangeablyin terms of its identity number (a binary string of length d) and in terms of the decimal representationof its identity number. Thus, for example, the nodes (00. ..0),(00. ..1),and(11. ..1) are also referred to as nodes0, I, and 2d -1, respectively. The Hamming distancebetween two nodesis the number of bits in which their identity numbersdiffer. Two nodes,are directly connected with a communication link if and only if their Hamming distance is unity, that is, if and only if their identity numbers differ in exactlyone bit. The number of links on any path connecting two nodes cannot be less than the Hamming distanceof the nodes. Furthermore, there is a path with a number of links that is equalto the Hamming distance, obtained, for example, by switching in sequence the bits in which the identity numbers of the two nodesdiffer (equivalently, by traversing the corresponding links of the hypercube). Such a path is referred to as a shortest path in this paperand a tree consisting of shortest paths from some node to all other nodesis referred to as a shortest path tree. Information is transmitted along the hypercubelinks in groups of bits called packets. In our algorithms we assume that the time required to cross any link is the same for all packets and is takento beone unit. Thus, our analysis applies to communication problems where all packets have roughly equallength. We assume that packetscanbe simultaneously transmitted along a link in both directions and that their transmission is error free. Only one packet can travel along a link in each direction at anyone time; thus, if ~ore than one packet is available at a node and is scheduled to be transmitted on the sameincident link of the node, then only one of these packets can be transmitted at the next time period, while the remaining packets must be stored at the node while waiting in queue. 0743-7315/91 $3.00 AcademicPress. Inc. in any form reserved.

Optimal Communication Algorithms for Hypercubes*dimitrib/OptimalCA.pdfThe Hamming distance between two nodes is the number of bits in which their identity numbers differ. Two nodes,are

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Optimal Communication Algorithms for Hypercubes*dimitrib/OptimalCA.pdfThe Hamming distance between two nodes is the number of bits in which their identity numbers differ. Two nodes,are

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING 11, 263-275 (1991)

Optimal Communication Algorithms for Hypercubes*

D. P. BERTSEKAS, C. OZVEREN, G. D. STAMOULIS, P. TSENG, AND J.N. TSITSIKLISt

Laboratory for Information and Decision Systems, Room 35-210, Massachusetts Institute o/Technology, Cambridge, Massachusetts 02139

We consider the following basic communication problems ina hypercube network of processors: the problem of a single pro-cessor sending a different packet to each of the other processors,the problem of simultaneous broadcast of the same packet fromevery processor to all other processors, and the problem of si-multaneous exchange of different packets between every pair ofprocessors. The algorithms proposed for these problems are op-timal in terms of execution time and communication resourcerequirements; that is, they require the minimum possible numberof time steps and packet transmissions. In contrast, algorithmsin the literature are optimal only within an additive or multi-plicative factor. @ 1991 Academic Press,lnc.

--'

1. INTRODUCTION AND PROBLEMFORMULATION

When algorithms are executed in a network of processors,it is necessary to exchange some intermediate informationbetween the processors. The interprocessor communicationtime may be substantial relative to the time needed exclu-sively for computations, so it is important to carry out theinformation exchange as efficiently as possible. There are anumber of generic communication problems that arise fre-quently in numerical and other algorithms. In this paper,we describe new algorithms for solving some of these prob-lems on a hypercube. Algorithms for solving such problemshave been studied in such works as [15,7], among others.In this paper, we present some new algorithms for the hy-percube that are optimal, in the sense that they execute therequired communication tasks in the minimum possiblenumber of time steps and link transmissions.

To define a hypercube network (or d-cube), we considerthe set of points in d-dimensional space with each coordinateequal to 0 or I. We let these points correspond to processors,and we consider a communication link for every two pointsdiffering in a single coordinate. We thus obtain an undirectedgraph with the processors as nodes and the communicationlinks as arcs. The binary string of length d that corresponds

* Research supported by the NSF under Grants ECS-85 19058 and ECS-

8552419, with matching funds from Bellcore Inc., the ARO under GrantDAALO3-86-K-O171, and the AFOSR under Grant AFOSR-88-OO32.

t Laboratory for Information and Decision Systems, M.I. T., Cambridge,

MA 02139.

263

Copyright@ \99\ byAll rights of reproduction i

to the coordinates of a node of the d-cube is referred to asthe identity number of the node. We recall that a hypercubeof any dimension can be constructed by connecting lower-dimensional cubes, starting with a l-cube. In particular, wecan start with two (d -I )-dimensional cubes and introducea link connecting each pair of nodes with the same identitynumber (see, e.g., [I, Sect. 1.3]). This constructs ad-cubewith the identity number of each node obtained by addinga leading 0 or a leading I to its previous identity, dependingon whether the node belongs to the first (d -I )-dimensionalcube or the second (see Fig. I). When confusion cannotarise, we refer to a d-cube node interchangeably in terms ofits identity number (a binary string of length d) and in termsof the decimal representation of its identity number. Thus,for example, the nodes (00. ..0),(00. ..1),and(11. ..1)are also referred to as nodes 0, I, and 2d -1, respectively.

The Hamming distance between two nodes is the numberof bits in which their identity numbers differ. Two nodes,aredirectly connected with a communication link if and onlyif their Hamming distance is unity, that is, if and only iftheir identity numbers differ in exactly one bit. The numberof links on any path connecting two nodes cannot be lessthan the Hamming distance of the nodes. Furthermore, thereis a path with a number of links that is equal to the Hammingdistance, obtained, for example, by switching in sequencethe bits in which the identity numbers of the two nodes differ(equivalently, by traversing the corresponding links of thehypercube). Such a path is referred to as a shortest path inthis paper and a tree consisting of shortest paths from somenode to all other nodes is referred to as a shortest path tree.

Information is transmitted along the hypercube links ingroups of bits called packets. In our algorithms we assumethat the time required to cross any link is the same for allpackets and is taken to be one unit. Thus, our analysis appliesto communication problems where all packets have roughlyequal length. We assume that packets can be simultaneouslytransmitted along a link in both directions and that theirtransmission is error free. Only one packet can travel alonga link in each direction at anyone time; thus, if ~ore thanone packet is available at a node and is scheduled to betransmitted on the same incident link of the node, then onlyone of these packets can be transmitted at the next timeperiod, while the remaining packets must be stored at thenode while waiting in queue.

0743-7315/91 $3.00Academic Press. Inc.in any form reserved.

Page 2: Optimal Communication Algorithms for Hypercubes*dimitrib/OptimalCA.pdfThe Hamming distance between two nodes is the number of bits in which their identity numbers differ. Two nodes,are

264 BERTSEKAS ET AL.

110 111

ro;;

7101

000 001

0110

0010~

11101010//~""

-~1oij:::7I'1"

,

~

'-- ~1 V

0000 I ~ ~ ~~~~~~~~~~~ ~1 IXXI 11 00~. 0101 1101

~1XX11 1001

FIG. 1. Construction of a 3-cube and a 4-cube by connecting the cor-responding nodes of two identicallower-dimensional cubes. A node belongsto the first lower-dimensional cube or the second depending on whether itsidentity has a leading 0 or a leading I.

In a generalized version of the single node broadcast, wewant to do a single node broadcast simultaneously from allnodes (we call this a multinode broadcast). To solve themultinode broadcast problem, we need to specify one span-ning tree per root node. The difficulty here is that some linksmay belong to several spanning trees; this complicates thetiming analysis, because several packets can arrive simulta-neously at a node and require transmission on the same linkwith a queueing delay resulting.

Another interesting communication problem is sending apacket from every node to every other node (here a nodesends different packets to different nodes, in contrast withthe multinode broadcast problem, where a node sends thesame packet to every other node). We call this the totalexchange problem. A related problem, called the single nodescatter problem, involves sending a separate packet from asingle node, called the root, to every other node.

Table 1 gives the main results for the preceding threecommunication problems. One of the columns gives thenumber of time units required to solve the problems. Weshow that each of these numbers is a lower bound on thenumber of time units taken by any algorithm that solves thecorresponding problem, and we describe an algorithm thatattains the lower bound. The other column gives the numberof packet transmissions required to solve the correspondingcommunication problems. These numbers are lower boundson the number of packet transmissions taken by any algo-rithms that solve the corresponding problems, and the lowerbounds are attained by the same algorithms that attain thecorresponding lower bounds for the execution time. Thus,there are algorithms that are simultaneously optimal in termsof execution time and number of packet transmissions forour communication problems.

Related Research

Algorithms for the communication problems of this paperwere first considered in [15], which also discusses the effects

Each node is assumed to have infinite storage space.Moreover, we assume that all incident links of a node canbe used simultaneously for packet transmission and recep-tion; this is called the Multiple Link Availability (or MLA)assumption. Another possibility is the Single Link Avail-ability (or SLA) assumption, where it is assumed that, atany time, a node can transmit a packet along at most oneincident link and can simultaneously receive a packet alongat most one incident link. Optimal algorithms under thisassumption are considerably simpler than the ones for theMLA case and are not considered here (see [IS, 1,7]). Fi-nally, we assume that each of the algorithms proposed inthis paper is simultaneously initiated at all processors. Thisis a somewhat restrictive assumption, essentially implyingthat all processors can be synchronized with a global clock.For recent research on communication problems wherepackets are generated at random times, see [5, 20, 24].

We now describe the communication problems that weare concerned with.

TABLE I

Optimal Times and Optimal Numbers of Packet Transmis-sions for Solving the Three Basic Communication Problems ona d-Cube for the Case Where Simultaneous Transmission AlongAll Incident Links of a Node Is Allowed (the MLA Assumption)

The Communication Problems

One of the simplest problems is the single node broadcast,where we want to send the same packet from a given node,called the root, to every other node. Oearly, to solve thisproblem, it is sufficient to transmit the packet along a directedspanning tree emanating from the root node, that is, a span-ning tree of the network together with a direction on eachlink of the tree such that there is a unique directed path onthe tree from the root to every other node (all links of thepath must be oriented away from the root). This problemhas been discussed extensively under various assumptionsin [15, 12,7], so we do not discuss it further.

Number oftransmissionsProblem Time

r~l d2d-Single node scatterd

r~lMultinode broadcast 2d(2d_l)d

2d-1

d22d-'

Total exchange

Note. We assume that each packet requires unit time for transmission onany link.

Page 3: Optimal Communication Algorithms for Hypercubes*dimitrib/OptimalCA.pdfThe Hamming distance between two nodes is the number of bits in which their identity numbers differ. Two nodes,are

COMMUNICATION ALGORITHMS FOR HYPERCUBES 265

of the packet overhead and the data rate (denoted by fJ and'T, respectively, in [15]) on the transmission times. Theproblems are named somewhat differently in [15] than here.We essentially follow the communication model of [15],except that the hypercube links are assumed to be unidirec-tional in that work; this increases the algorithm executiontimes by a factor of 2 for the multinode broadcast and thetotal exchange problems. To compare the results of [ 15] withthose of the present paper, the times of [15] should be usedwith fJ = 0, m = 1, and 'T = 1 and, for the aforementionedproblems, should be divided by 2. A multinode broadcastalgorithm (under the MLA assumption) which is slightlysuboptimal (by no more than dtime units) is given in [15].This algorithm is constructed by specifying a packet trans-mission schedule at a single node and then properly repli-cating that schedule at each node, exploiting the symmetryof the d-cube. In contrast, we obtain optimal multinodebroadcast algorithms starting from a suitable single nodebroadcast algorithm and replicating that algorithm at eachnode. This approach was used for meshes in general in [14],where a slightly suboptimal (by no more than d -3 timeunits) multinode broadcast algorithm was given. The sameapproach was also used in [7]. The total exchange algorithm,given in Ref. [15] under the MLA assumption, assumes thateach node has m packets to send to every other node. Thisalgorithm is optimal only ifm is a multiple of d; for m = 1,it is suboptimal by a factor of d. An algorithm similar to thetotal exchange algorithm of [15] is also given in [12]. Analternative approach to optimal algorithms for the total ex-change problem was recently presented in [24, 23]. Optimalalgorithms for single node scatter and total exchange werealso given in [15] under the SLA assumption (even thoughthe SLA assumption does not explicitly appear in [15]).

The problems of this paper have also been considered in[7], where optimal and nearly optimal algorithms are givenon the basis of a different model of communication. Thismodel differs from ours in that it quantifies the effects ofsetup time (or overhead>. per packet, while it allows packetsto have variable length and to be split and be recombinedprior to transmission on any link in order to save on setuptime. In the model of [7], each packet may consist of dataoriginating at different nodes and / or destined for differentnodes. The extra overhead for splitting and combining pack-ets is considered negligible in the model of[7]. Our modelmay be viewed as the special case of the model of [7] inwhich packets have a fixed length and splitting and combin-ing of packets is not allowed. Under the assumptions of ourmodel, the algorithms given in [7] for single node scatter,multinode broadcast, and total exchange are not exactly op-timal, although some of them are optimal up to a small ad-ditive term, and are exactly optimal when d is a prime num-ber (they are also optimal if each node has a multiple of dpackets to send to each destination node). In contrast, ourcorresponding algorithms are exactly optimal for all d and

are unimprovable as far as time and communication re-quirements are concerned.

Even if there is no incentive to combine packets into largerpackets to save on setup time, there is sometimes an incentivefor splitting packets into smaller packets that can travel in-dependently through the network. The idea here is to par-allelize communication through pipelining the smallerpackets over paths with multiple links and is inherent inproposals for virtual cut-through and wormhole routing [II,3). A little thought shows that (as long as the associatedextra overhead is not excessive) the single node broadcasttime can be reduced by dividing packets into smaller packets(see also [7). On the other hand this is essentially impossiblefor the three basic communication problems considered inthis paper (under the MLA assumption); from Table I it isseen that in an optimal algorithm, there is almost 100% uti-lization of some critical communication resource (the d linksoutgoing from the root in single node scatter and all of thed2d directed network links in multinode broadcast and totalexchange). Any communication algorithm for these prob-lems that divides packets into smaller packets cannot reducethe total usage of the corresponding critical resource andtherefore cannot enjoy any pipelining advantage.

We also note that in addition to [15, 12, 7J, there areseveral other works dealing with various communicationproblems and network architectures related to those discussedin the present paper (see [2, 4, 6, 9, 10, 16-19,21,22).

To summarize, the ~ew results of the present paper arethe optimal algorithms for single node scatter, multinodebroadcast, and total exchange (under the MLA assumption).In all of our algorithms, all packets originating from the samenode are routed through a spanning tree rooted at this node.Our single node scatter algorithm uses a new perfectly bal-anced spanning tree construction, that is, a spanning treewith d subtrees that are as close to being equal in size aspossible. To the best of our knowledge, this is the first span-ning tree that is provably perfectly balanced. (A spanningtree proposed in [7) attains this property only when d is aprime number.) Also, our multi node broadcast algorithmuses another new and interesting spanning tree construction.These spanning trees could prove useful in other algorithms;in general, our algorithms can be used for the optimal so-lution of various other communication problems, by intro-ducing appropriate modifications (see [I, Sect. 1.3). Re-garding the appropriateness of our assumptions, our view isthat it is important to consider several types of communi-cation models, given the broad variety of present and futurecommunication hardware. Our fixed packet length modelhas the advantages of simplicity and flexibility; we believethat algorithms based on our model are likely to be adaptablewith small variations to many types of communication con-texts. (For such an adaptation and corresponding analysisof our multinode broadcast algorithm in the case where thepacket lengths are random, see [24 ).) It is also worth noting

Page 4: Optimal Communication Algorithms for Hypercubes*dimitrib/OptimalCA.pdfThe Hamming distance between two nodes is the number of bits in which their identity numbers differ. Two nodes,are

266 BERTSEKAS ET AL

that the emerging standard for high-speed communications,the Asynchronous Transfer Mode (see, e.g., [13]), is basedon fixed packet lengths as well as minimal packet processingat the nodes, which favors neither splitting nor combiningpackets.

2. MUL TINODE BROADCAST UNDERTHE MLA ASSUMPTION

= {(OO- --O)} and Sj C {(OO- --O)} U (U~-:'\Ek). Fur-

thermore, every nonzero node identity must belong to someEj. The set of all nodes together with the set of links(Uf= I Aj) must form a subgraph that contains a spanning tree(see Fig. 2a); in fact, to minimize the number of packettransmissions, the sets of links AI, A2, ..., Aq should bedisjoint and should form a spanning tree.

Consider now a d-bit string t representing the identitynumber of some node on the d-cube. For any node identityz, we denote by t e z the d-bit string obtained by performingmodulo 2 addition of the jth bit of t and z for eachj = 1,2,..., d. It can be seen ti\at an algorithm for broadcasting apacket from the node with identity t can be specified by thesets

We first note that in a multinode broadcast each nodemust receive a total of 2 d -1 packets over its d incidentlinks, so r (2 d -1)/ dl is a lower bound for the time required

by any multinode broadcast algorithm under the MLA as-sumption. We obtain an algorithm that attains this lowerbound.

As a first step toward constructing such an algorithm, werepresent any single node broadcast algorithm from node(00. ..0) to all other nodes that takes q time units by asequence of sets of directed links A I , Az, ..., Aq. Each Ai isthe set of links on which transmission of the packet beginsat time i-I and ends at time i. Naturally, the sets Ai mustsatisfy certain consistency requirements for accomplishingthe single node broadcast. In particular, if Si and Ei are thesets of identity numbers of the start nodes and endnodes, respectively, of the links in Ai, we must have S.

Aj(t) = {(t0x,t0y)l(x,y)EAj}, ,2,

i=

,q,

where Ai(t) denotes the set of links on which transmissionof the packet begins at time i-I and ends at time i. Theproof of this is based on the fact that t e x and t e y differin a particular bit if and only if x and y differ in the samebit, so (t e x, t e y) is a link if and only if(x, y) is a link.Figure 2 illustrates the sets Ai(t) corresponding to all possiblet for the case where d = 3.

A, A2 AJ

1

1

(at

A,IOO1! AVOOll A3(OOll

?-.~~ ~O..?g~ 0

~~l

011 111 101

0 1~~~~~~~;= =;1

'-~'~~~~~~~--~

101 001 011

1 OO~l~:~~~~~~. ~ 1'"~::~~~; v

010 110 100

001~

010 001111 100 000 010 110 010 (XX)

(b)

FIG. 2. Generation of a multinode broadcast algorithm for the d-cube, starting from a single node broadcast algorithm. (a) The algorithm tha~broadcasts a packet from the node with identity (00. ..0) to all other nodes is specified by a sequence of sets of directed links A I , A2, ., ., Aq. Each Ajis the set of links on which transmission begins at time i-I and ends at time i. (b) A corresponding broadcast algorithm for each root node identity t isspecified by the sets of links Aj(t) = {(I e x, t e y) I (x, y) EAj}, where we denote by t e z the d-bit string obtained by performing modulo 2 additionof the jth bit of t and z for j = 1, 2, ..., d. The multinode broadcast algorithm is specified by the requirement that transmission of the packet of node tstarts on each link in A;(t) at time i-I. The figure shows the construction for an example where d = 3. Here the set A2 has two links of the same typeand the multinode broadcast cannot be executed in three time units. However, if the link (000, 0 10) belonged to A I instead of A2, the required time wouldbe the optimal three time units.

Page 5: Optimal Communication Algorithms for Hypercubes*dimitrib/OptimalCA.pdfThe Hamming distance between two nodes is the number of bits in which their identity numbers differ. Two nodes,are

COMMUNICATION ALGORITHMS FOR HYPERCUBES 267

is 1,2, ..., d, 1,2, ..., d, 1,2, ...(cf. Fig. 3 for the cased = 4). We specify the order of node identities within eachset Rkn as follows: the first element t in each set Rkn is chosenso that the relation

the bit in position m(t) from the right is a 1 )

is satisfied, and the subsequent elements in Rkn are chosenso that each element is obtained by a single bit rotation tothe left of the preceding element. Also, for the elements t ofRkl, we require that the bit in position m(t) -1 [if m(t)> 1] or d [if m(t) = 1] from the right be a O. For i = 1, 2,

...,r(2d-l)jdl-l,define

~

n(t) ~ id},Ej = {t I (i -l)d +

and for i = 0 and i = q = r(2d -l)/dl, define

EO={(OO""_,-

Eq = {t I (q -l)d +

0)1.

~ n(t) ~ 2d -

We define the set of links Ai as fol~ows:

For i = 1, 2, ..., q, each set Ai consists of the linksthat connect the node identities t E Ei with the corre-sponding node identities of U~~~ Ek obtained from t byreversing the bit in position m(t) [which is always a 1by property (1 )]. In particular, the node identities ineach set in RkJ are connected with corresponding node'identities in R(k-l) J, because, by construction, the bitin position m(t) lies next to a 0 for each node identityt in the set RkJ.

We now describe a procedure for generating a multinodebroadcast algorithm specified by the sets Ai( t) for all possiblevalues of i and t, starting from a single node broadcast al-gorithm specified by the sets A I, A2, ..., Aq. Let us say thata link (x, y) is of type j if x and y differ in the jth bit. Wemake the following key observation: consider a single nodebroadcast algorithm specified by the link sets AI, ..., Aq.If, for each i, the links in Ai are of different types, then, foreach i, the sets Ai(t), where t ranges over all possible iden-tities, are disjoint. [If, for t + t', two links (t e x, t e y)E Ai(t) and (t' ex', t' e y') E Ai(t') were the same, thenthe links (x, y) and (x', y') would be different (since t + t'),and they would be of the same type because (x, y) and (x',y') are of the same type as (t e x, t e y) and (t' ex', t'e y'), respectively, which contradicts the fact that (x, y) and(x', y') belong to Ai.] This implies that the single nodebroadcasts of all nodes t can be executed simultaneously,without any further delay. In particular, we have a multinodebroadcast algorithm that takes q time units. We proceed togive a method for selecting the sets Ai with the links in eachAi being of different types. Furthermore, we ensure that eachone of the sets A I, ..., Aq-1 has exactly d elements, whichis the maximum possible (since there exist only d link types),thereby resulting in the minimum possible execution time

ofq=r(2d-l)jdlunits.Let Nk, k = 0, 1, ..., d, be the set of node identities

having k unity bits and d -k zero. bits. The number ofelements in Nk is (t) = d!j(k!(d -k)!). In particular, Noand Nd contain one element, the strings (00. ..0) and(11. ..1), respectively; the sets NI and Nd-1 contain d ele-ments; and for 2 ~ k ~ d -2 and d ~ 5, Nk contains at least2d elements (when d = 4, the number of elements of N2 issix, as shown in Fig. 3). We partition each set Nk, k = 1,..., d -1, into disjoint subsets Rkl, ..., Rknk' which areequivalence classes under a single bit rotation to the left. Weimpose the restriction that Rkl is the equivalence class of theelement whose k rightmost bits are unity. We associate eachnode identity t with a distinct number n(t) E {O, 1,2, ...,2d -I} in the order

(OO...O)R11R21

..R2n2 ...Rkl ...Rknk

.R(d-2) I ...R(d-2)nd-2R(d-l) I ( ...1)

[i.e., n(OO.. .0) = 0, n(ll.. .1) = 2d- I, and the othernode identities are numbered consecutively in the above or-der between 1 and 2 d -2]. Let

+ [(n(t) -l)(mod d)].m(t) =

Thus, the sequence of numbers m(t) corresponding to thesequence of node identities

To show that this definition of the sets Aj is legitimate, weneed to verify that by reversing the specified bit of a nodeidentity t E Ej, we indeed obtain a node identity t' that be-longs to U~-::;.\ Ej, as opposed to Ej. [It cannot belong to Ekfor k > i, because n(t') < n(t).]

To see this in the exceptional case where t = (II. ..I),note that by the preceding rule, t' is the element of R(d-l)lwith a zero bit in position m(t) from the right. The elementsof R(d-l)1 are ordered so that the bit oft' in position m(t')-1 (ifm(t') > 1) orin position d(ifm(t') = 1) is aD. Since2d -1 is not divisible by d(see the appendix), we have m(t)+ d. Thus, the zero bit of t' cannot be in position d, so itmust be in position m(t') -1, implying that m(t) = m(t')-1. The set R(d-l) 1 has d elements, and as a result, its firstelement t" satisfies m(t") = m(t), so t' must be the secondelement ofR(d-l)l. Since m(t) + d, Eq has at most d -1elements, and thus, we obtain the desired conclusion t'

~Eq.In the case where t + ( 11 ...1), it is sufficient to show

that n(t) -n(t') ~ d. We consider two cases: (a) 1ft E Rknfor some n > 1, then all of the d elements of Rkl are betweenR11R21R22o 0 OR(d-l)1

Page 6: Optimal Communication Algorithms for Hypercubes*dimitrib/OptimalCA.pdfThe Hamming distance between two nodes is the number of bits in which their identity numbers differ. Two nodes,are

268 BERTSEKAS ET AL.

001 101

B

(al

No N, N2 N3 N.r-"-" ~" .,,' ,~(KX)OI ((XXI1) (00101 (01001 (1(XX11 (0011) (0110) (11001 (10011 (0101) (1010) (11011 (10111 (01111 (1110) (1111)

, ~ ' , .' ' " .'

R" R21 R22 R31

m((XXI1}=1 m(1001)=4 m(11011=J, ~

m(OOIII=1 I

~m(OI10)a2

~::~~m(II00I-3 I

(::=~

m(101It=4 m(IIIII=3--0 ,0

m(OI0It=1 m(Olllt=1D -:""~

m(10101s2 m(1110)=2

-~

(~)..B

m(1~A,

A2 AJ A.

R"

N2

(00011) (00110) (011001 (11000j (10001) (001011 (010101 (101001 (010011 (10010) .

R21 R22

N3,

(00111) (011101 (111001 (110011 (10011) (01011) (101101 (01101) (11010) (10101)

R32

R.,10001~1 01$X)1 11$X>1 01)01 11101

00911 1~10 1~11 11910 11911

B

~~~==~-6~~~~:

~ 11

01100 01010 01110 01011 01111~o o- 0 0 <>

1<XXXJ 11000 10100 11100 10110 11110 11111~o.-~~~ 0

(cl

FIG. 3. Construction of a multinode broadcast algorithm for a d-cube that takes r(2d -1)/ dl time.

verified that (k~ 1) -d ~ d, and we are done. The cases d= 3 and d = 4 can be handled individually (see Fig. 3). Thecases k = I, 2 create no' difficulties because R 11 = E 1 ,RZ1 = Ez.

t' and t, and the inequality n(t) -n(t') ~ d follows. (b) Ift E Rkl, then t' E R(k-l)l, and all the elements of the setsR(k-l)2, ..., R(k-l)nk-1 are between t' and t. There are(k~l) -d such elements. If 2 < k < d and d ~ 5, it can be

Page 7: Optimal Communication Algorithms for Hypercubes*dimitrib/OptimalCA.pdfThe Hamming distance between two nodes is the number of bits in which their identity numbers differ. Two nodes,are

COMMUNICATION ALGORITHMS FOR HYPERCUBES 269

With this rule, s starts transmitting its last packet to the sub-tree T j no later than time Nj -I, where Nj denotes the num-ber of nodes in Tj, and all nodes in Tj receive their packetno later than time Nj. (To see the latter, note that all packetsdestined for the nodes in T j that are k links away from s aresent no later than time Nj -k, and each of these packetscompletes its journey in exactly k time units.) Therefore, allpackets are received at their respective destinations inmax {Nt, N2, N,} time units. Hence, the above algo-rithm attains the optimal time if and only if T has the prop-erty that s has d neighbors in T and that each subtree T j, i= I,..., d, contains at mostr(2d -I )/dl nodes. If Tis inaddition a shortest path tree from s, then each packet travelsalong the shortest path to its destination and this algorithmalso attains the optimal number of packet transmissions.

We assume without loss of generality that s = (00. ..0)in what follows. To construct a spanning tree T with theabove two properties, let us consider the equivalence classesRkn introduced in Section 2 in connection with the multinodebroadcast problem. As in Section 2, we order the classes as

We have thus shown that the sets Ai are properly defined,and we also note that any two links in each set Ai are ofdifferent types, implying that the corresponding multinodebroadcast algorithm takes q = r(2d -1)/ dl time units. Thus,the algorithm attains the lower bound of execution time overall multinode broadcast algorithms under the MLA as-sumption and is optimal.

The preceding algorithm requires 2d(2d -I) packettransmissions. This is also a lower bound on the numberrequired by any multi node broadcast algorithm, since eachof the 2d nodes must receive a total of2d -I different packets(one from each of the other nodes). Therefore the algorithmis also optimal in terms of total required communicationresource.

(00 O)R11R21' RZn2 ...Rkl ...Rknk

...R(d-2) .R(d-2)nd-2R(d-I)I(II.. .1

3. SINGLE NODE SCA lTER UNDER THEMLA ASSUMPTION

Consider the d -cube and the problem of single node scatterwith root node s. Since 2d -I different packets must betransmitted by the root node over its d incident links, anyalgorithm solving these problems requires at least r(2d -1)/dl time units under the MLA assumption. This time can beachieved by modifying the corresponding optimal multinodebroadcast algorithm of the previous section, thereby justifyingthe entries of Table 1 for single node scatter.

The modified multinode broadcast algorithm is not, how-ever, optimal for the scatter problem with respect to thenumber of packet transmissions. To see this, note that apacket destined for some node must travel a number of linksat least equal to the Hamming distance between that nodeand the root. Therefore, a lower bound for the optimal num-ber of packet transmissions is the sum of the Hammingdistances of all nodes to the root. There are (t) = d!/(k!(d -k)!) nodes that are at distance k from the root, sothis bound is

(2)

and we consider the numbers n (t) and m (t) for each identityt, but for the moment, we leave the choice of the first elementin each class Rkn unspecified. We denote by mkn the numberm( t) of the first element t of Rkn and we note that this numberdepends only on Rkn and not on the choice of the first elementwithin Rkn.

We say that class R(k-l)n' is compatible with class Rkn ifR(k-l)n' has d elements (node identities) and there existidentities t' E R(k-l)n' and t E Rkn such that t' is obtainedfrom t by changing some unity bit of t to a O. Since theelements of R(k-l)n' and Rkn are obtained by left shifting thebits of t' and t, respectively, it is seen that for every elementx' of R(k-l)n' there is an element x of Rkn such that x' isobtained from x by changing one of its unity bits to a O. Thereverse is also true, namely that for every element x of Rknthere is an element x' of R(k-l)n' such that x is obtained fromx' by changing one of its zero bits to unity.

An important fact for the subsequent spanning tree con-struction is that for every class Rkn with 2 ~ k ~ d -1, thereexists a compatible class R(k-l )n'. Such a class can be obtainedas follows: Take any identity t E Rkn whose rightmost bit isa 1 and leftmost bit is a O. Let (1 be a string of consecutiveOs with maximal number of bits and let t' be the identityobtained from t by changing to 0 the unity bit immediatelyto the right of (1. [For example, if t = (0010011), then t'= (00 10001) or t' = (0000011), and if t = (0010001), thent' = (00 10000).] Then the equivalence class oft' is compatiblewith Rkn, because it has delements [t' * (00- --0) and t'contains a unique substring of consecutive Os with maximal

The lower bound ofEq. (2) is much smaller than the 2d(2d-I) packet transmissions required by a multinode broadcast.While it is possible to extract from the optimal multinodebroadcast algorithm a single node scatter algorithm whichattains the lower bound of Eq. (2), such an algorithm isquite complex to visualize and to implement. The followingalternative algorithm is much simpler.

For any spanning tree Tofthe d-cube, let,be the numberof neighbor nodes of the root node s in T, and let T i be thesubtree of T rooted at the ith neighbor of s. Consider thefollowing rule for s to send packets to each subtree T i:

Continuously send packets to distinct nodes in the sub-tree (using only links in T), giving priority to nodesfurthest away from s (break ties arbitrarily).

Page 8: Optimal Communication Algorithms for Hypercubes*dimitrib/OptimalCA.pdfThe Hamming distance between two nodes is the number of bits in which their identity numbers differ. Two nodes,are

270

BERTSEKAS

ET AL.

number of bits, so it cannot be replicated by left rotation ofless than dbits].

The spanning tree T with the desired properties is con-structed sequentially by adding links incident to elements ofthe classes Rkn as follows (see Fig. 4):

Initially T contains no links. We choose arbitrarily thefirst element of class R11 and we add to T the linksconnecting (00' ..0) with all the elements of RII. Wethen consider the classes Rkn (2 ~ k ~ d -I) one-by-one in the order indicated above, and for each Rkn, wefind a compatible class R(k-l)n' and the element t' inR(k-l)n' such that m(t') = mkn (this is possible becauseR(k-l)n' has d elements). We then choose as the firstelement of Rkn an element t such that t' is obtainedfrom t by changing one of its unity bits to a O. SinceR(k-l)n' has d elements and Rkn has at most- d elements,it can be seen that, for any x in Rkn, we have m(x')= m(x), where x' is the element ofR(k-l)n' obtainedby shifting t' to the left by the same amount as thatneeded to obtain x by shifting t to the left. Moreover,x' can be obtained from x by changing some unity bitof x to a O. We add to Tthe links (x', x), for all x E Rkn(with x' defined as above for each x). After exhaustingthe classes Rkn, 2 ~ k ~ d -I, we finally add to Tthelink (x,( 11. ..1», where x is the element of R(d-I)1with m(x) = m( 11. ..1).

The construction of T is such that each node x* (00' ..0) is in the subtree T m(x). Since there are at mostr(2d -1)/ dl nodes x having the same value of m(x), eachsubtree contains at most r ( 2 d -1) / dl nodes. Furthermore,

the number of links on the path of T connecting any nodeand (00. ..0) is the corresponding Hamming distance.Hence, T is also a shortest path tree from (00. ..0), asdesired.

Note that for the spanning tree constructed above, eachof the subtrees Tj, contains eitherl(2d -1)/ dJ orr(2d -1)/

~~I

cube with certain properties to be stated shortly, and we usethis algorithm to perform an optimal total exchange in the(d + I )-cube. The construction is as follows: we decomposethe (d + I )-cube into two d-cubes, denoted C1 and C2 (cf.the construction of Fig. I). Without loss of generality weassume that C1 contains nodes 0,. .., 2d -I, and that theircounterparts in C2 are nodes 2d, ..., 2d+1 -I, respectively.The total exchange algorithm for the (d + I )-cube consistsof three phases. In the first phase, there is a total exchange(using the optimal algorithm for the d -cube) within each ofthe cubes C1 and C2 (each node in C1 and C2 exchanges itspackets with the other nodes in C1 and C2, respectively). Inthe second phase, each node transmits to its counterpart nodein the opposite d-cube all of the 2d packets that are destinedfor the nodes of the opposite d-cube. In the third phase,there is an optimal total exchange in each of the two d-cubesof the packets received in phase 2 (see Fig. 5). Phase 3 must

001 01 I III

QO~~~~~IO 110000

100 101

mIll-!

~m('1-2

m(II-3

m(tl-1

m("-2

~m(tJ-3

~~_.:-.~~--~, """...mltl-4FIG. 4. Spanning tree construction for optimal single node scatter under

the MLA assumption for d = 3 and d = 4.

dl nodes. (This follows from the discussion above and fromthe fact that there are at least l (2 d -I) / d J nodes x having

the same value of m(x).) Hence, T is perfectly balanced inthe sense that the Ti's are as close to being equal in size aspossible.

4. TOTAL EXCHANGE UNDER THEMLA ASSUMPTION

Consider the total exchange problem under the MLA as-sumption. We showed in the preceding section that for anysingle node scatter algorithm in the d-cube, the number ofpacket transmissions is bounded below by d2d-I, and it isequal to d2d-1 if and only if packets follow shortest pathsfrom the root to all other nodes. Since a total exchange canbe viewed as 2 d separate versions of single node scatter, a

lower bound for the total number of transmissions is

d2d2d-l. (3)

Since each node has d incident links, at most d2d transmis-sions may take place at each time unit. Therefore, if T d isthe execution time of a total exchange algorithm in the d-cube, we have

Td~ 2d-l. (4)

For an algorithm to achieve this lower bound, it is necessarythat packets follow shortest paths and that all links are busy(in both directions) during all of the 2d-1 time units. In whatfollows, we present an algorithm for which T d = 2d-l. Inlight of the above, this algorithm is optimal with respect toboth the time and the number of packet transmissions criteriaand achieves 100% link utilization.

We construct the algorithm recursively. We assume thatwe have an optimal algorithm for total exchange in the d-

Page 9: Optimal Communication Algorithms for Hypercubes*dimitrib/OptimalCA.pdfThe Hamming distance between two nodes is the number of bits in which their identity numbers differ. Two nodes,are

COMMUNICATION ALGORITHMS FOR HYPERCUBES 271

Nd(i, n) ~ 2d-t + n -1,(5)'"' - 1 2d-I' - 0 2d 1vn -,..., ,1-, ..., -

FIG. 5. Recursive construction of a total exchange algorithm for the d-cube. Let the (d + 1 )-cube be decomposed into two d-cubes denoted C.and C2. The algorithm consists of three phases. In the first phase, there is atotal exchange within each of the cubes C. and C2. In the second phase,each node transmits to its counterpart node in the opposite d-cube all ofthe 2 d packets that are destined for the nodes of the opposite d-cube. In the

third phase, there is a total exchange in each of the two d-cubes of thepackets received in phase 2.

to see this, note that the left-hand quantity in Eq. (5 )is thenumber of packets of node i E C I that must be transmittedby node i + 2d during the first n time units of phase 3, whilethe right-hand quantity in Eq. (5) is the number of availabletime units within phase 2 for transferring these packets fromnode i to node i + 2 d. There is also a requirement analogousto Eq. (5) for the nodes i ofC2.

We proceed by induction, using the requirement of Eq.(5) as part of the inductive hypothesis. In particular, weprove that for every d, there exists a total exchange algorithmfor the d-cube satisfying

T d = 2d-1(6)

and Nd(i,n) ~ 2d-t + n -1,"" - 1 2d-t. - 0vn -,..., ,1-, , 2d- 1.

We have Tl = 1 and N1(i, 1) = 1, for i = 0,1, whichproves the inductive hypothesis for d = I. Assume that forsome d, we have a total exchange algorithm for the d-cubethat satisfies the inductive hypothesis (6), and let s(i,j, d)denote the time unit in this algorithm during which node itransmits its own packet that is destined for node j. We con-struct a three-phase total exchange algorithm for the (d + 1)-cube of the type described above that satisfies the inductivehypothesis. Suppose that packets are transmitted in phase 2according to the following rules (in view of the symmetryof the transmissions of nodes of the d-cubes C1 and C2, wedescribe the rules for phase 2 packet transmissions for. onlythe nodes ofC1):

(a) Each node i E C1 transmits its packets to node i + 2din the order in which the latter node forwards them in phase3 (ties are broken arbitrarily); i.e., the packet destined for jE C2, j ", i + 2d, is transmitted before the packet destinedforf E C2,j'", i + 2d, if

s(i,j -2d, d) < s(i,j' -2d, d).

(b) Each node i E C1 transmits its packet destined fornode i + 2d last.

We claim that, under the above rules, phase 3 can proceeduninterrupted after phase I. To show this, consider any iE C1 (the case of i E Cz can be treated analogously). At theend of phase 1 node i has received exactly 2 d-1 packets from

node i + 2d (since phase 1 lasts 2d-1 time units by induction).Hence, n time units after the end of phase 1, node i hasreceived exactly 2d-1 + n packets from node i + 2d. On theother hand, the total number of packets of node i + 2 d that

node i forwards after n + 1 time units of phase 3 is exactlyNd(i, n + 1). Since [cf. Eq. (6)] Nd(i, n + 1) ~ 2d-1 + nfor all n = 0, 1,. .., 2d-1 -I and node i + 2dtransmits its

be carried out after phase I because during phase I all thelinks of the cubes C1 and C2 are continuously busy (sincethe d-cube total exchange algorithm is assumed optimal).On the other hand, phase 2 may take place simultaneouslywith both phase 1 and phase 3. In an algorithm presentedin [I], phase 3 starts after the end of phase 2, resulting in anexecution time of 2 d -1 units. Here, we improve on this

time by allowing phase 3 to start before phase 2 ends. Toillustrate how this is possible, consider the packet originatingat some node i E C1 and destined for its counterpart nodein C2, namely i + 2d, and the packet originating at i + 2dand destined for i. These packets are not transmitted at allduring phase 3. Therefore, if they are transmitted last inphase 2 then phase 3 can start one time unit before the endof phase 2. This idea can be generalized as follows: clearly,if it were guaranteed that packets going from C1 to C2 andfrom C2 to C I arrive sufficiently early at C2 and at C I, re-spectively, then phase 3 may be carried out just after phase1, without completing phase 2. In such a case, the first halfof phase 2 would be carried out simultaneously with phase1, while the second half would be carried out simultaneouslywith phase 3, and we would have T d+1 = 2T d. Since, byassumption, T d is equal to the lower bound 2d-1 ofEq. (4)for a total exchange in the d-cube, we would have T d+1 = 2d,implying that such an algorithm would achieve the lowerbound of Eq. (4) for the (d + 1 )-cube; We prove that thisis indeed feasible.

Suppose that an optimal total exchange algorithm has al-ready been devised for the d-cube. Let Nd(i, n) denote thenumber of its own packets that node i has transmitted upto and including time n, forn = 1,. .., 2d-1 [Nd(i, n)rangesfrom 1 to 2d -1]. We can use Nd(i, n) to express the re-quirement that phase 3 packets originating at nodes of C Iare available in time at the appropriate nodes of C2, so thatphase 3 begins right after phase 1 and continues withoutdelay. In particular, it is necessary that

Page 10: Optimal Communication Algorithms for Hypercubes*dimitrib/OptimalCA.pdfThe Hamming distance between two nodes is the number of bits in which their identity numbers differ. Two nodes,are

272 BERTSEKAS ET AL.

packets to node i according to the above rules, node i alwayshas enough packets from node i + 2d for transmission ifphase 3 begins immediately after phase I. Since i E C1 waschosen arbitrarily, this holds for all i E C1.

Consider the total exchange algorithm for the (d + 1)-cube whereby phase 3 proceeds uninterrupted immediatelyfollowing phase I as described above. Since according to theinductive hypothesis, each of phases I and 3 takes time T d= 2d-l, this algorithm takes time 2T d = 2d. There remainsto show that the second part of the inductive hypothesis issatisfied for d+ I. For any node i, let Nd+l(i, n) denote thenumber of node z"s own packets that i has transmitted up toand including time n in this algorithm. Since, in the first2d-1 time units of this algorithm, phases I and 2 executesimultaneously, we obtain

Nd+l(i,n)=Nd(i,n)+n, Vn=I,...,2d-l.

By combining this equation with the inequality Nd( i, n) ~ 2d-I, which holds for all n, we obtain

Nd+l(i,n)~2d+n-l, Vn=I,...,2d-l.

Since, in the next 2d-l time units of this algorithm, phases3 and 2 execute simultaneously (and i does not transmit anypacket of its own in phase 3), we have

Time Link 1

1 0001

Nd+t(i, n) = 2d- Vn = 2d-1 + 2d, .+ n,Time Link 1 Link 2

1 0001 0011By combining the last two relations, it follows that Nd+l(i,n) satisfies

0010

Nd+J(i,n)~2d+n-

,2d.Vn =

Time Link 1 Link 2 Link 3

1 0001 0011 0101

2 0010 0111

3 OliO

4 0100

Since the choice of i was arbitrary, this implies that the in-ductive hypothesis (6) holds for the (d + I)-cube.

Implementation of the Optimal AlgorithmIn what follows, we present the rules used by the nodes

of the d-cube for transmitting their own packets and for-warding the packets they receive from other nodes, wheneverthey require further transmission. We write the identity ofnode i as (id, ..., il), where each ik, k = I, ..., d, is a 0or a 1. Moreover, we denote by ek the identity of node 2 k-1 ,for k = 1, ..., d. The link between i and i e ek is called thekth link incident to i. Finally 4 denotes the reverse of bit ik;namely 4 = (ik + 1) mod 2.

We first describe the order in which an arbitrary node itransmits its own packets. It can be seen that during timeunits 1, ..., 2k-l, node i transmits all its packets destinedfor nodes

(id,...,ik+t,4,Xk-t,...,Xt),where Xm = 0 or I, for m = ,k-

Time Link 1 Link 2 Link 3 Link 4

1 0001 0011 0101 1001

2 0010 0111 1011

3 0110 1101

4 0100 1010

5 1111

6 1110

7 1100

8 1000FIG.

6. Implementation of the total exchange algorithm for d = 4.

through its kth incident link, for k = 1, ..., d. The lastpacket to be transmitted in this group is the one destined fornode i (t) ek. For i = 0, the exact order in which i transmitsits packets on each of its incident links may be derived byusing a sequence of d tables, which may be constructed it-eratively. The kth table consists of k columns, the mth ofwhich contains the destinations of the packets transmittedthrough link m. The first table contains only e,. For k = 2,..., d, the first k -1 columns of the kth table are identicalto those of the (k -I )st table, whereas its last column consistsof ek and the entries of the (k -I )st table with their kth bitbeing set to I. In the last column, entries corresponding tothe same row of the (k -I )st table appear one after theother, ordered (arbitrarily) from left to right; entries corre-sponding to different rows of the (k -I )st table are orderedfrom top to bottom. The last element of the last column isek. This scheme follows from the recursive construction ofthe algorithm. As an example, we present the scheme for d= 4 in Fig. 6.

For any other node i, the corresponding order of destin-ations may be obtained by forming the (t) operation of eachentry of the dth table of 0 with i.

Page 11: Optimal Communication Algorithms for Hypercubes*dimitrib/OptimalCA.pdfThe Hamming distance between two nodes is the number of bits in which their identity numbers differ. Two nodes,are

COMMUNICATION ALGORITHMS FOR HYPERCUBES

APPENDIX: INDIVISIBILITY OF 2d -I BY d

The proof of the following result is due to David Gillmanand Arthur Mattuck of the M.I. T. Department of Mathe-matics and is given here with their kind permission.

PROPOSITION. For any integer d ~ 2, 2d -1 is not di-visible by d.

Proof For any positive integers a, b, and d we use thenotation a = b (mod d) to indicate that a and b give thesame remainder when divided by d; this remainder is denoteda mod d or b mod d. We note that for all positive integersa, b, d, and t we have

We now consider the packets arriving at some node i andpresent the rules under which these packets are forwardedby i (whenever necessary). Packets arrive in i, through thekth link, in groups of 2k-l, for k = I, ..., d. Each groupcontains all the packets originating from the same node (Y d,..., Yk+l, 4. ik-l, ..., il) (where Ym = 0 or I, for m = k+ I, ..., d) and destined for all nodes of the form (id, ...,ik, Xk-l, ..., XI) (where Xm = 0 or I, for m = I, ..., k -1).The order of group arrivals is lexicographic on (Yd0 id, ...,Yk+1

0 ik+I)' Routing is accomplished as follows:

A packet destined for node (id, ..., ik, Xk-l, ...,XI) is placed in the queue which contains packets to betransmitted by i through the ~th link, where

(mod d) => ta = tb (mod d).a=b~= Max {mlxm=lm}

l~m~k-1 (Write a = Pad + w, b = Pbd + w, tw = pd + r, where w= a mod d = b mod d and r = (tw) mod d, and note that r= (ta) mod d = (tb) mod d.)

It suffices to consider odd d ~ 2. We argue by contradic-tion. If the claim does not hold, let d be the smallest oddinteger which is larger than I and is such that

2d= 1 (mod d).

Let m be the smallest positive integer for which

Packets originating from different nodesj ,j' and placedin the same queue are ordered according to the lexi-cographic order between j e i and j' e i. Packets orig-inating from the same node and placed in the samequeue preserve their order of arrival. Forwarding pack-ets in the kth link starts at time 2k-1 + 1, for k = 1,..., d -1; no forwarding takes place in the dth link.

The rules presented above follow from the recursive con-struction of the algorithm. Our earlier analysis guaranteesthat packets are always in time at the intermediate nodes (ifany) of the paths they have to traverse. Note that the travelingschedule of each packet may be locally determined at theintermediate nodes of its path by examining the packet'sorigin and destination, so packets do not have to carry timinginformation.

(mod d). (A3)2m=

We claim that m < d. To see this, note that the numbers

2 mod d 22 mod d ...2d mod d, "

belong to {I, ..., d -I}. Since there are d such numbers,some of them must repeat; i.e., for some integers rand swith 1 ~ r < s ~ d,

5.

CONCLUSIONS

2' = 25 (mod d).

Using Eq. (AI) with a = 2', b = 2s, and t = 2d-s, and using

also Eq. (A2), we obtain

2d-s+r = 2d = (mod d).

Excessive communication time is widely recognized asthe principal obstacle for achieving large speedup in manyproblems using massively parallel computing systems. Thisemphasizes the importance of optimal communication al-gorithms. In this paper, we have shown that a very strongform of optimality can be achieved for some basic com-munication problems in the hypercube architecture.

Our methodological ideas may find application in otherrelated contexts. In particular, variations of our algorithmscan be investigated under different and possibly less restric-tive assumptions. Furthermore, some of our algorithmicconstructions can be applied in other architectures to obtainoptimal or nearly optimal algorithms for the communicationproblems of this paper. Finally, it is worth considering thepotential existence of optimal algorithms for specializedcommunication tasks, arising in the context of specific nu-merical and other methods.

Since m is the smallest positive integer for which Eq. (A3)holds, we obtain m ~ d -s + r, so finally m < d.

Let us now express d as d = pm + r, where r = d mod m.By multiplying by 2m in Eq. (A3) and using Eq. (AI), weobtain

22m=2m= (mod d).

By multiplying again by 2m and using Eq. (A3) and (AI),we have

Page 12: Optimal Communication Algorithms for Hypercubes*dimitrib/OptimalCA.pdfThe Hamming distance between two nodes is the number of bits in which their identity numbers differ. Two nodes,are

274 BERTSEKAS ET AL.

23m = 2 m =(mod d),

and by continuing similarly,

2pmE (mod d).

By multiplying by 2' in this equation and again using Eq.(AI) together with Eq. (A2), we obtain

2' ~ 2pm+, ~ 2d ~ (mod d).

Since r < m, our hypothesis on m implies that r = O. Thus,d is divisible by m. This implies that m is odd (since d isodd) and that 2m= I (mod m) (since Eq. (A3) holds). Inview of the definition of d and the fact d> m, we must havem = 1, which contradicts Eq. (A3). Q.E.D.

REFERENCES

16. Saad Y., and Schultz, M. H. Data communication in parallel architec-tures. Yale University Report, Mar. 1986.

17. Saad Y., and Schultz, M. H. Parallel direct methods for solving bandedlinear systems. Linear Algebra Appl. 88/89 (1987),623-650.

18. Saad Y., and Schultz, M. H. Topological properties of hypercubes. IEEETrans. Comput. 37(1988), 867-872.

19. Saad, Y ~ Communication complexity of the Gaussian elimination al-gorithm on multiprocessors. Linear Algebra Appl. 77 (1986),315-340.20.

Stamoulis, G. D., and Tsitsiklis, J. N. Efficient routing schemes formultiple broadcasts in hypercubes. Proc. 29th IEEEConf Decision and

.Control. Honolulu, H:awaii, 1990, pp. 1349-1354.

21. Stout, Q. F., and Wagar, B. Passing messages in link-bound hypercubes.Proc. 1986 Hypercube Conference. SIAM, Philadelphia, 1987, pp. 251-257.

22. Topkis, D. M. Concurrent broadcast for information dissemination.IEEE Trans. Software Engrg. 13 (1983),207-231.

23. Varvarigos, E. A., and Bertsekas, D. P. Communication algorithms forisotropic tasks in hypercubes and wraparound meshes. Rep. LIDS-P-1972, Laboratory for Information and Decision Systems, MIT, Cam-bridge, MA,.Mar. 1990; submitted for publication.

24. Varvarigos, E. A. Optimal communication algorithms for multiprocessorcomputers. M.S. thesis, Department of Electrical Engineering, MIT,Cambridge, MA; Rep. CICS-TH-192, Center for Intelligent ControlSystems, MIT, Cambridge, MA, Jan. 1990.

DIMITRI P. BERTSEKAS was born in Athens, Greece, in 1942. Hereceived a combined B.S.E.E. and B.S.M.E. from the National TechnicalUniversity of Athens, Greece, in 1965, the M.S.E.E. degree from GeorgeWashington University in 1969, and the Ph.D. degree in system ~iencefrom the Massachusetts Institute of Technology in 1971. Dr. Bertsekas hasheld faculty positions with the Engineering-Economic systems Departmentof Stanford University (1971-1974) and the Electrical Engineering De-partment of the University of Illinois, Urbana (1974-1979). He is currentlya professor of electrical engineering and computer science at the Massachu-setts Institute of Technology. He consults regularly with private industry.and has held editorial positions in several journals. He was elected Fellowof the IEEE in 1983. Professor Bertsekas has done research in the areas ofestimation and control of stochastic systems, linear, nonlinear, and dynamicprogramming, data communication networks, and parallel and distributedcomputation and has written numerous papers in each of these areas. He isthe author of Dynamic Programming and Stochastic Control, AcademicPress, 1976, Constrained Optimization and Lagrange Multiplier Methods,Academic Press, 1982, and Dynamic Programming: Deterministic and Sto-chastic Models. Prentice-Hall, 1987 and coauthor of Stochastic OptimalControl: The Discrete-Time Case, Academic Press, 1978, Data Networks,1987, and Parallel and Distributed Computation: Numerical Methods,Prentice-Hall, 1989.

CUNEYT QZVEREN was born in Istanbul, Turkey, on July 20, 1962.He received the B.S. and M.S. degrees in electrical engineering and computerscience, the Electrical Engineer degree, the M.S. degree from the Sloan Schoolof Management, and the Ph.D. degree in electrical engineering, all from theMassachusetts Institute of Technology, Cambridge, in 1984, 1987, 1987,1989, and 1989, respectively. He is currently a senior engineer at DigitalEquipment Corp., working on the design and the implementation of a high-speed communications switch. From January to August 1988 he conductedresearch at the Institut de Recherche en Informatique et Sysremes Aleatoires,France, and from September to December 1989 he was a postdoctoral re-search associate at the Laboratory for Information and Decision Systems atM.I. T. His interests are in the analysis and control of large-scale dynamicsystems, including applications to communications systems, manufacturingsystems, and economics. Dr. Ozveren is a member of Sigma Chi, Tau Beta

I. Bertsekas, D. P., and Tsitsiklis, J. N. Parallel and Distributed Compu-tation: Numerical Methods. Prentice-Hall, Englewood Oilfs, NJ, 1989.

2. Bhatt, S. N., and Ipsen, I. C. F. How to embed trees in hypercubes.Res. Rep. Y ALEU /DCS/RR-443, Department of Computer Science,Yale Univel1!ity, 1985.

3. Dally, W. J., and Seitz, C. L. Deadlock-free message routing in multi-processor interconnection networks. IEEE Trans. Comput. C-36 (1987),547-553.

4. Dekel, E., Nassimi, D., and Sahni, S. Parallel matrix and graph algo-rithms. SIAM J. Comput. 10 (1981),657-673.

5. Greenberg, A. G., and Hajek, B. Deflection routing in hypercube net-works. Unpublished report, 1989.

6. Hedetniemi, S. M., Hedetniemi, S. T., and Liestman, A. L. A surveyof gossiping and broadcasting in communication networks. Networks18 (1988),319-349.

7. Johnsson, S. L., and Ho, C. T. Optimum broadcasting and personalizedcommunication in hypercubes. IEEE Trans. Comput. 38 (1989), 1249-1268.

8. Johnsson, S. L. Cyclic reduction on a binary tree. Comput. Phys. Comm.37(1985),195-203.

9. Johnsson, S. L. Communication efficient basic linear algebra compu-tations on hypercube architectures. J. Para/lei Distrib. Comput. 4 ( 1987),133-172.

10. Krumme, D. W., Venkataraman, K. N., and Cybenko, G. The tokenexchange problem. Tech. Rep. 88-2, Tufts Univel1!ity, 1988.

11. Kermani, P., and Kleinrock, L. Virtual cut-through: A new computercommunicating switching technique. Comput. Networks 3 (1979), 267-286.

12. McBryan, O. A., and Van de Velde, E. F. Hypercube algorithms andtheir implementations. SIAM J. Sci. Stat. Comput. 8 (1987),227-287.

13. Minzer, S. E. Broadband ISDN and asynchronous transfer mode (A TM).IEEEComm. Mag. (Sept. 1989),17-57.

14. Ozveren, C. Communication aspects of parallel processing. Rep. LIDS-P-1721, Laboratory for Information and Decision Systems, MIT, Cam-bridge, MA, 1987.

15. Saad, Y., and Schultz, M. H. Data communication in hypercubes. Res.Rep. YALEU/DCS/RR-428, Yale Univel1!ity, Oct. 1985 (revised Aug.1987).

Page 13: Optimal Communication Algorithms for Hypercubes*dimitrib/OptimalCA.pdfThe Hamming distance between two nodes is the number of bits in which their identity numbers differ. Two nodes,are

COMMUNICATION ALGORITHMS FOR HYPERCUBES 275

Pi, Eta Kappa Nu, and IEEE. In 1989 he was a finalist for the 18th IEEEConference on Decision and Control Best Student Pap<;r Award. He is alsothe 1989 recipient of the Pugh-Roberts Associates Prize in Computer Sim-ulation Applied to Corporate Strategy.

Columbia, and from 1987 to 1990, he was a Postdoctoral Associate at M.I. T.He is currently an assistant professor in the Department of Mathematics,University of Washington, Seattle. His research interests are in optimization

algorithms.

JOHN N. TSITSIKLIS was born in Thessaloniki, Greece, in 1958. He

received the B.S. degree in mathematics ( 1980) and the B.S. (1980), M.S.(1981 ) and Ph.D. (1984) degrees in electrical engineering, all from the Mas-sachusetts Institute of Technology, Cambridge, Massachusetts. During theacademic year 1983-1984 he was an acting assistant professor of electricalengineering at Stanford University, Stanford, California. Since 1984 he hasbeen with the Electrical Engineering and Computer Science Department atthe Massachusetts Institute of Technology, where he is currently associateprofessor. His research interests are in the areas of parallel and distributed

computation, systems and control theory, and applied probability. Dr. Tsi-tsiklis is the coauthor, with Dimitri Bertsekas, of Parallel and Distributed

Computation: Numerical Methods (1989). He has been a recipient of anIBM Faculty Development Award ( 1983), an NSF Presidential Young In-vestigator Award (1986), an Outstanding Paper Award from the IEEE Con-trol Systems Society (for a paper coauthored with M. Athans, 1986), andthe Edgerton Faculty Achievement Award from M.I. T. (1989). He is anAssociate Editor of the Applied Mathematical Letters and the IEEE Trans-actions on Automatic Control.

GEORGE D. STAMOULIS was born in Athens, Greece, in 1964. Hereceived a diploma in electrical engineering (with highest honors) from theNational Technical University of Athens, Greece, in 1987, and the M.S,degree in electrical engineering from the Massachusetts Institute of Tech-nology, Cambridge, Massachusetts, in 1988. He is currently completing hisPh.D. degree in electrical engineering at M.I. T. His research interests are inthe areas of routing and performance evaluation of multiprocessing systems,communication networks, and queueing theory. Mr. Stamoulis was amongthe winners of the Greek Mathematic Olympiad in both 1981 and 1982.He also participated in the 23rd International Mathematic Olympiad, inBudapest, in July 1982. He is a member of the Technical Chamber of Greeceand Sigma Xi.

PAUL Y. TSENG received the engineering degree in mathematics fromQueen's University, Kingston, Canada, in 1981 and the Ph.D. degree inoperations ~h from the Massachusetts Institute of Technology in 1986.From 1986 to 1987, he served on the faculty of the University of British

Received February 21, 1989; revised August 23,1990; accepted September12,1990