10
Avoiding Communication Hot-Spots in Interconnection Networks D.Franco, I.Garcés and E.Luque {d.franco,iarq23, e.luque}@cc.uab.es; Web: http://aows1.uab.es Unitat d'Arquitectura d'Ordinadors i Sistemes Operatius Dept. d'Informàtica Universitat Autònoma de Barcelona 08193-Bellaterra, Barcelona, Spain Abstract In creating interconnection networks for parallel computers an efficient design is crucial because of its impact on the performance. A high speed routing scheme that minimises contention and avoids the formation of hot-spots should be included. We have developed a new method to uniformly balance communication traffic over the interconnection network called Distributed Routing Balancing (DRB) based on limited and load-controlled path expansion in order to maintain low message latency. DRB defines how to create alternative paths to expand single paths (expanded path definition) and when to use them depending on traffic load (expanded path selection). Evaluation in terms of latency and bandwidth is presented. Some conclusions and comparisons with existing methods are given. It is demonstrated that DRB is a method to effectively balance network traffic. Keywords: Interconnection networks, adaptive routing, random routing, hot-spot avoidance, traffic distribution, uniform latency, distributed routing balancing. 1. Introduction Cluster computing based on workstations is a realistic alternative to High Performance Computing. Currently, these cluster configurations, which can be considered as a kind of parallel machine, incorporate a specific interconnection network made of commodity switches. This interconnection network is a high speed network which is an alternative to the typical bus-based LAN connection among the workstations. High speed networks are composed of point-to-point connections between the workstations forming a topology. The switches have many ports to which workstations and other switches are connected. Examples of switches are ATM switches [8] and Myrinet switches [15]. The performance of such an interconnection network is the determining factor for allowing high performance computing on these cluster configurations. The computation takes place in the workstations while messages are exchanged between them travelling from switch to switch. One of the most crucial problems that affects the performance of communications in both parallel computer and workstation cluster interconnection networks is message contention, i.e. when two or more messages want to use the same link at the same moment. Message contention generates message latency, i.e. the time the message has to wait for the resources it needs. A sustained message contention situation can produce hot-spots [17]. A hot-spot is a region of the network which is saturated, (i.e. there exists more bandwidth demand than the network can offer). Messages that enter this region suffer a very high latency while other regions of the network are less loaded and have bandwidth available. The problem is that there is an incorrect communication load distribution over the network topology and that, although the total communication bandwidth requirements do not surpass the total bandwidth offered by the interconnection network, this uneven distribution generates saturated points. This saturation is produced when the buffers of the routers in the hot-spot region are full, while other network regions have free resources. In addition, the hot-spot situation propagates rapidly to contiguous areas in a domino effect which can collapse the whole interconnection network rapidly. This effect is even worse in the case of wormhole routing because a blocked packet occupies a large number of links spread in the network. Nowadays, in order to maintain a low and stable latency, networks are operated at low load to avoid hot-spot generation. This fact leads to an under-use of the network. The parallel program running on the parallel machine is described as a collection of processes and channels and there is a mapping which assigns each process to a processor. Processes execute concurrently and communicate by logical channels. Regarding the total execution time of the program, communication latency must be avoided in order to make communications faster. But, it is more important to avoid big latency variations than any given amount of latency. The reason is because a bounded amount of latency can be tolerated by hiding it Work supported by the Spanish Comisión Interministerial de Ciencia y Tecnología (CICYT) under contract number TIC 95/0868. Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999 0-7695-0001-3/99 $10.00 (c) 1999 IEEE Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999 1

Avoiding communication hot-spots in interconnection networks

Embed Size (px)

Citation preview

Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999

Avoiding Communication Hot-Spots in Interconnection Networks

D.Franco, I.Garcés and E.Luque{d.franco,iarq23, e.luque}@cc.uab.es; Web: http://aows1.uab.es

Unitat d'Arquitectura d'Ordinadors i Sistemes Operatius Dept. d'InformàticaUniversitat Autònoma de Barcelona 08193-Bellaterra, Barcelona, Spain

l itseme ofnewer

ngedcy.anduseon).iswithB is

g,n,

stictly,as afices.orkNrksthe

anyare[8]anorter

thethem

theteris

wantage the

ucerkdthnternsidthectrk

talishis

s innstionfectorkole

bertoted to

ine ando aandtalcyter.

nsse ag it

e.

AbstractIn creating interconnection networks for paralle

computers an efficient design is crucial because ofimpact on the performance. A high speed routing schthat minimises contention and avoids the formationhot-spots should be included. We have developed a method to uniformly balance communication traffic ovthe interconnection network called Distributed RoutiBalancing (DRB) based on limited and load-controllpath expansion in order to maintain low message laten

DRB defines how to create alternative paths to expsingle paths (expanded path definition) and when to them depending on traffic load (expanded path selectiEvaluation in terms of latency and bandwidth presented. Some conclusions and comparisons existing methods are given. It is demonstrated that DRa method to effectively balance network traffic.Keywords: Interconnection networks, adaptive routinrandom routing, hot-spot avoidance, traffic distributiouniform latency, distributed routing balancing.

1. Introduction

Cluster computing based on workstations is a realialternative to High Performance Computing. Currenthese cluster configurations, which can be considered kind of parallel machine, incorporate a speciinterconnection network made of commodity switchThis interconnection network is a high speed netwwhich is an alternative to the typical bus-based LAconnection among the workstations. High speed netwoare composed of point-to-point connections between workstations forming a topology. The switches have mports to which workstations and other switches connected. Examples of switches are ATM switches and Myrinet switches [15]. The performance of such interconnection network is the determining factor fallowing high performance computing on these clus

Work supported by the Spanish Comisión Interministerial dCiencia y Tecnología (CICYT) under contract number TIC 95/0868

0-7695-0001-3/99 $10

configurations. The computation takes place in workstations while messages are exchanged between travelling from switch to switch.

One of the most crucial problems that affects performance of communications in both parallel compuand workstation cluster interconnection networks message contention, i.e. when two or more messages to use the same link at the same moment. Messcontention generates message latency, i.e. the timemessage has to wait for the resources it needs.

A sustained message contention situation can prodhot-spots [17]. A hot-spot is a region of the netwowhich is saturated, (i.e. there exists more bandwidemand than the network can offer). Messages that ethis region suffer a very high latency while other regioof the network are less loaded and have bandwavailable. The problem is that there is an incorrcommunication load distribution over the netwotopology and that, although the total communicationbandwidth requirements do not surpass the tobandwidth offered by the interconnection network, thuneven distribution generates saturated points. Tsaturation is produced when the buffers of the routerthe hot-spot region are full, while other network regiohave free resources. In addition, the hot-spot situapropagates rapidly to contiguous areas in a domino efwhich can collapse the whole interconnection netwrapidly. This effect is even worse in the case of wormhrouting because a blocked packet occupies a large numof links spread in the network. Nowadays, in order maintain a low and stable latency, networks are operaat low load to avoid hot-spot generation. This fact leadsan under-use of the network.

The parallel program running on the parallel machis described as a collection of processes and channelsthere is a mapping which assigns each process tprocessor. Processes execute concurrently communicate by logical channels. Regarding the toexecution time of the program, communication latenmust be avoided in order to make communications fasBut, it is more important to avoid big latency variatiothan any given amount of latency. The reason is becaubounded amount of latency can be tolerated by hidin

.00 (c) 1999 IEEE 1

ofsorssesbige toar

heirtota

hot-in

ingerns0,ainadsthetionheial

onthed

ingceontainach

idthncy.ndseth

ichInthe

ng, athsingefficthe3

they5

withthe

ngtionB

er aonB

lewots.w

ednotue

andoadionute

agevelasbens

ththeand,erebig

n berisehe

ausengesheingto aiscencyncyhavefectm

tainofhile

Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999

through the method of assigning an “excess” parallelism, i.e. having enough processes per procesand scheduling any ready process while other procewait for their messages. But, if latency undergoes unpredictable variations from the expected values (duhot-spots, for example), idle processors will appebecause all their processes are blocked, waiting for tcorresponding messages, and, consequently, the execution time of the application is increased.

Many mechanisms have been developed to avoid spot generation due to message contention interconnection networks, such as the dynamic routalgorithms that try to adapt to traffic conditions. Somexamples are Planar Adaptive Routing [4], the TuModel [16], Duato’s Algorithm [9], ComprensionlesRouting [12], Chaos Routing [13], Random Routing [214] and other methods presented in [10]. The mdisadvantages of adaptive routing are the high overhebecause of information monitoring, path changing and necessity to guarantee deadlock, live-lock and starvafreedom. These drawbacks have limited timplementation of these techniques in commercmachines.

The work presented in this paper focuses developing a new method to distribute messages in interconnection network using network-load controllepath expansion. The method is called Distributed RoutBalancing (DRB) and its objective is to uniformly balantraffic load over all paths of the whole interconnectinetwork. The method is based on creating, under cerload values, simultaneous alternative paths between esource and destination node in order to allow a bandwuse increment and to maintain a low message lateDRB defines how to create alternative paths to expasingle paths (multi-lane path definition) and when to uthem, depending on traffic load (multi-lane paselection).

The method offers a broad range of alternatives whmove from minimal static to random routing methods. fact, both static and random routings are included in DRB specification as particular cases.

The next two sections explain the Distributed RoutiBalancing technique. DRB has two components: firstsystematic methodology to generate multi-lane paaccording to the network topology and second, a routalgorithm to monitor traffic load and select multi-lanpaths to get the message distribution according to traload. Section 2 gives some concept definitions and multi-lane path generation methodology. Section presents the DRB Routing algorithm. Section 4 shows evaluation of the DRB method carried out bexperimentation with a network simulator. Section presents a discussion of the method and comparison other existing methods. Finally, section 6 presents

0-7695-0001-3/99 $10.

,

l

conclusions and future work.

2. Distributed Routing Balancing

Distributed Routing Balancing is a method of creatialternative source-destination paths in the interconnecnetwork using a load-controlled path expansion. DRdistributes every source-destination message load ov“multi-lane path” made of several paths. This distributiis controlled by the path load level. The objective of DRis a uniform distribution of the traffic load over the whointerconnection network in order to maintain a lomessage latency and avoid the generation of hot-spThis message distribution will maintain a uniform and lolatency in the whole interconnection network providthat total communication bandwidth demand does exceed interconnection network capacity. In addition, dto hot-spot avoiding, network throughput is increased the network is allowed to be used at higher average lrates. Depending on the traffic load and its distributpattern, the DRB method configures paths to distribload from more loaded to less loaded paths.

The method’s principal idea is based on messlatency behaviour according to message traffic load lein interconnection networks. The typical behaviour hbeen studied by many authors [1], [7], [10]. It can represented by a non-lineal curve in which two regiocan be identified: first, a flat region at low-load level winear-linear behaviour (where big changes in communication load cause small changes in latency) second, a sharp rise from a threshold load level (whsmall changes in the communication load cause changes in latency). Also, a threshold latency value caidentified where the curve changes from the flat to the region. This Threshold Latency (ThL) is the point of tcurve with the minimum radius of curvature.

The sharp slope curve pattern is not desirable becit means latency is not stable and undergoes big chain relation to small traffic load changes. According to tlatency behaviour, the DRB method moves the workpoint of the congested area from the saturation point lower latency point in the flat region of the curve. Theffect is achieved by modifying path distribution to redutraffic on the most loaded paths. The result is big latereductions on congested paths and small lateincrements on non-congested paths because they still some bandwidth available and, therefore, the global efis positive. The resulting latency configuration is uniforand low for all paths.

The DRB method fulfils the following objectives:1. Reduction of the message latency under a cer

threshold value by dynamically varying the number alternative paths used by the source-destination pair, w

00 (c) 1999 IEEE 2

isionion

islsandrdlay

ndges

the

of

d isnted

a

eer

he

oleode

in

thecalf

of

theIt is

l

itse ofngy,e aksewed

o

are

Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999

maintaining a uniform latency for all messages. Thisachieved by maximising the use of the interconnectnetwork resources in order to minimise communicatdelays.

2. Minimisation of path-lengthening. This point important for Wormhole and Cut-Through flow controbecause two main latency factors, bandwidth use collision points, are increased. For Store&Forwanetworks it is also important because Transmission Dedepends directly on the length of the message path.

3. Maximisation of the use of the source adestination node links (node grade), distributing messafairly over all processor links.

In order to show how DRB works to create and use alternative paths, we make the following definitions:Definition 0:

• An Interconnection Network I is defined as a directedgraph I=(N,E), where N is a set of nodes

N Nii

MaxN

==0U and E a set of arcs connecting pairs

nodes. Usually, every node is composed of a router anconnected to other nodes by means of links, represeby the arcs. The topology can be regular or irregulardepending on the network. For example, for k-ary n-cubes[6], n is the dimension and k the size.

• If two nodes Ni andN j are directly connected by

link, then, Ni andN j are adjacent nodes.

• Distance(Ni , N j ) is the minimum number of links

that must be crossed to go from Ni to Nj according to theGraph I.

• A path P(Ni , jN ) between two nodes Ni andN j

is the set of nodes selected between Ni andN j according

to the minimal static routing defined for thinterconnection network (for example, Dimensional Ord

Routing for k-ary n-Cubes). Ni is the source node and

N j the destination node. Length of a path P, Length(P),

is the number of links between Ni and N j following the

defined routing. In case of minimal static routing:

Length(P(Ni , N j ) ) = Distance(Ni , N j ) (1)

Definition 1: A Supernode S(type, size,N S0 ,V(S))

==

iS

i

l

N0� , is defined as a structured region of t

interconnection network consisting of l adjacent nodes

NiS around a “central” node N S

0 provided that:

0-7695-0001-3/99 $10

NiS complies with a given property specified in type

and Distance(NiS , N S

0 )<=size.

As particular cases, any single node and the whinterconnection network are Supernodes. A Supernwhich contains only a single node is called a minorcanonical form; if a Supernode contains all the nodesthe network it is called a major canonical form. A nodecan belong to more that one Supernode.

DRB defines two different Supernode types suitable forany topology. The first one is called Gravity Area and thesecond Subtopology. The parameters type and size of theSupernode determine which nodes are included in Supernode and also the following properties: Topologishape, number of nodes l, Supernode Grade (number o

Ni node links not connected to other Ni node links, i.e.

links connected outside the Supernode), number

Supernode nodes connected to N S0 .

Gravity Area Supernode: A Gravity Area

Supernode S(“Gravity Area”, size,N S0 ,V(S)) is the set of

nodes at a distance from the node N S0 smaller or equal to

the size. This type expands the Supernode selecting higher number of nodes surrounding the central node. suitable for regular or irregular networks.

Subtopology Supernode: A Supernode

S(“Subtopology”, size,N S0 ,V(S)) has the same full/partia

topological shape as the interconnection network butdimension and/or size is reduced. Therefore, thSubtopology Supernode should be considered as a kindtopological “projection” of the network topology. It cabe applied to regular networks with a structured topolodimension and size. For example, in a k-ary n-cubSubtopology Supernode is any j-ary m-cube with j<and/or m<n. (A more detailed description of theSupernode types for k-ary n-cubes [6] and Midimenetworks [2] and the evaluation of their above mentionproperties can be found in [11].)

Definition 2: A Multi-step Path Ps(SOrigin, NiSOrigin ,

NjSDest, SDest) is the path generated between tw

Supernodes, SOrigin and Sdest, as

sP = ∏ ( N SOrigin0 , Ni

SOrigin , N jSDest, NSDest

0 )=

P1(N SOrigin0 , Ni

SOrigin ) ∗ P2( NiSOrigin , N j

SDest) ∗

P3( N jSDest, NSDest

0 )

where ∗ means path concatenation and P1, P2 and P3single paths.

The MSP is composed of the following steps:

.00 (c) 1999 IEEE 3

de

ic

hen

ve

th

l

ttheiesent the in

the the

e

e oftheof

el

.

n

ds, a

e

rceiderh abeheultiheir

Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999

Step 1: From the central node of the Superno

Supernode_Origin, N SOrigin0 , to a node belonging to

Supernode_Origin, NiSOrigin .

Step 2: From the NiSOrigin to a node belonging to

Supernode_Destination, N jSDest.

Step 3: From the N jSDestto the central node of the

Supernode Supernode_Destination, N SDest0 .

Step 1 and/or Step 3 can be null if

Supernode_Origin={N SOrigin0 } (It is minor canonical)

and/or Supernode_Destination ={ N SDest0 }(It is minor

canonical). If both Supernodes Origin and Destination areminor canonical form, then, the Multi Step Path iscanonical form, and it is equal to the path followingminimal static routing.

Length of a multi-step path Length(sP ) is defined as

the sum of each individual step length following statrouting.

Length( sP )=Length(P1(N SOrigin0 , Ni

SOrigin )) + Length

(P2( NiSOrigin , N j

SDest))+Length(P3(N jSDest, NSDest

0 ))(2)

From this definition, it can be seen that some of tMulti-step Paths (MSP in the following) betwee

N SOrigin0 and N SDest

0 can be of non minimal length. This

length is a measure of the transmission time of the sP .

MSP Latency(Ps) is the addition of the transmissiontime and the waiting time spent by one message to tra

from NSOrigin0 to Nj

SDest on router’s queues due to

message contention.Latency( sP )=Transmission_Time+

Queuingdelaynodes Ps∀ ∈

∑ (Node) (3)

MSP Bandwidth( Ps) is the inverse of the Latency:Bandwidth( Ps)= Latency(Ps)-1 (4)

Definition 3: A Metapath P*(Supernode_Origin,Supernode_Destination) is the set of all multi-step pathsPi generated between the Supernodes Supernode_Originand Supernode_Destination:

P PSSOrigin

iSOrigin

jSDest SDest

i jN N N N* ( , , , )

,

=∀

0 0� .

Suppose l to be the number of nodes of Supernode_Originand k the number of nodes of Supernode_Destination.Metapath Width s is the number of Multi-step Pathswhich compose the Metapath:

Metapath Width = s = l*k (5)When Supernode Origin and Destination are minor

canonical form, the Metapath M* is canonical form, i.e.

0-7695-0001-3/99 $10

l

it is composed of the one and only minimal static pa(s=1).

Metapath Length (Length(P*)) is the average of althe individual multi-step paths lengths that compose it,

Length(P*)=(1/s)∑∀s

sPlength )( (6)

Metapath Latency (Latency(P*)) is the equivalenlatency defined as the inverse of the addition of individual MSP-Latency inverses. These inverse latencare, in fact, bandwidths and their addition is the equivalbandwidth. The physical concept is the same as addingassociation of elements in parallel which can be foundelectronic systems, for example.

Latency(P*)= ( ( ) )Latency MSPss

−∑ 1 1 (7)

Canonical Latency of a MetaPath P* is the time amessage of a determined size uses to travel alongMetapath without other messages in the network, whenMetapath is canonical.

Metapath Bandwidth is defined as the inverse of thLatency:

Bandwidth(P*) = Latency(P*)-1 =

( ( )BandWidth MSPss∀

∑ (8)

Canonical Metapath Bandwidth (BWc) is defined asthe bandwidth of the canonical Metapath in the absencother messages in the network. It is the inverse of Canonical Latency. It is the maximum number messages per unit time that the path can accept.

3. DRB Routing

For a given application as described in thintroduction, a Metapath P* is designated for each logicachannel by assigning a Source Supernode to the sourcenode and a Destination Supernode to the destination nodeThe Source Supernode is a Message Scattering Area(MeSA) from the source node. The DestinatioSupernode is a Message Gathering Area (MeGA) to theDestination.

Then, for each message that the source process sen

Multi-step Path Ps (SOrigin, NiSOrigin , Nj

SDest, SDest)

belonging to the Metapath P* is selected and the messagis sent through it.

Under this scheme, the communication between souand destination can be seen as if it were using a wmulti-lane “Metapath” of potentially higher bandwidtthan the original path from a source “Supernode” todestination “Supernode”. This multi-lane path can likened to a highway and the MeSA and MeGA, thighway access and exit areas, respectively. Several MStep Paths can be non-disjoint and share some of t

.00 (c) 1999 IEEE 4

xtraB

rst,athuse ineraut.

gath

ect, toto

f theeethsef the

encath aath a

med1bn aheowseenulti

ncyselfdschiones

ngy.rk,thed

to

tes orthe

y

Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999

links, but as they are used alternatively, an effective ebandwidth is available for the Metapath. Using DRRouting there is a double effect on communication. Filatency on a single path is reduced because poccupancy is less loaded. This reduction is high becaof the non-linear behaviour of the latency as explainedSection 2. Second, for a source-destination pair sevpaths are used in parallel resulting in a higher throughp

DRB Routing: DRB Routing is in charge ofdynamically configuring Metapaths and distributinmessages between the Multi Step Paths of the MetapThe fundamentals of the DRB Routing are to detlatency experienced by the messages in the networkconfigure Metapath depending of the latency and distribute the messages among the Multi-Step Paths oMetapath. Therefore, DRB Routing is divided in thrphases: Traffic Load Monitoring, Dynamic MetapaConfiguration and Multi-Step Path Selection. Thephases are independently executed by each channel oapplication. The Monitoring activity is carried out by thmessages themselves and its objective is to record latthe message experiences. Dynamic MetapConfiguration is carried at channel level each timemessage arrives at destination and Multi-Step PSelection is carried out at channel level each timemessage is injected. Fig. 1a shows the actions perforby DRB Routing at process flow diagram level. Fig. shows DRB Routing procedure at the beginning whemessage travels following minimal static routing and tlatency is acknowledge back to the sender. Fig. 1c. shDRB Routing procedure once a Metapath has bexpanded and messages are sent through several MStep Paths.

a)

e

eol

0-7695-0001-3/99 $10

l

.

e

y

-

b) c)

Fig. 1. DRB Routing a) Process flow diagram, b)Monitoring and configuration phases, c) Pathexpansion phase.

In order to carry out this function, DRB Routingperforms the following actions:1-Traffic Load Monitoring: Traffic load monitoring iscarried out by the messages themselves. The lateexperienced is recorded and carried by the message it(Latency recording in the Fig. 1). The message recorinformation about the contention it experiences at eanode it traverses when it is blocked because of contentwith other messages. The monitoring determinLatency(Ps) according to expression (3).

When a message arrives at its destination carryilatency information from the MSP it followed, the latencis sent back to the sender by an acknowledge message

As it is carried out by all the messages in the netwothe monitoring activity objective is to identify the currentraffic pattern to detect high and low loaded regions in tnetwork. The following pseudo-code shows Traffic LoaMonitoring phase:

Traffic Load Monitoring() /*Performed at eachintermediate router */

Begin1.For each hop,

1.1.Accumulate latency(Queuing time) tocalculate Latency(MSP)

End For2.At destination the Latency(MSP) is sent back

the sender and delivered to the Metapath Configurationfunction.

End Monitoring2-Dynamic Metapath Configuration: The objective ofthis phase is to determine Metapath type and sizeaccording to Latency(P*).

When the sender receives a MSP latency, it calculathe new Latency(P*) [using (7)] and decides to increasereduce P* Supernode sizes depending on whether Latency(P*) is off an interval defined by [Thl-Tol,Thl+Tol]. The Threshold value (Thl) identifies the latencsaturation point (the change from flat region to risregion), as explained in Section 2. Tol defines thetolerated deviation of the actual bandwidth and thcanonical bandwidth. The interval determined by Tdefines the range where the Metapath is not changed.

.00 (c) 1999 IEEE 5

whes

+

-

nhe

unhethece

th

w

latthe

e

e

hthct oos

th.reas

n=

te

1)

th

g

n

ofle.ge

at,s

theeryberalges

isg am

ach

d is its

rceel.anistion theam

nds,es.the

Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999

The Supernode sizes are modified to find the neMetapath width according to the relation between tcanonical Bandwidth(P*) and the BandWidth(P*) astated by the following formula:

-Find the ∆sizewhich satisfies

)/11(*/1**1

kBWcikTolBWcBWBWcsize

i

+<+< ∑∆

=

(9)

and increment Metapath Width = Metapath Width ∆size; if BW<BWc*Tol

)/11(*/1**1

kBWcikTolBWcBWBWcsize

i

+<−< ∑∆

=

(10)

and decrement Metapath Width = Metapath Width∆size; if BW>BWc*Tol

- k is a parameter which defines the channel use adepends on the number of logical channels of tapplication and the network size.

The configuration of the Supernodes takes into accothe latency values as well as the topology of tinterconnection network and the physical distance of source and destination nodes in order to balanMetapath bandwidth and lengthening.

The following pseudo-code shows MetapaConfiguration phase:

Metapath_Configuration (Threshold Latency Th,Tolerance Tol) /*Executed at source nodes when a neMSP Latency arrives */

/*Threshold latency is the change latency from the fregion to the rise region defined in Sec. 2 Tol defines interval between the Metapath does not change */

Var MSP_Latencies: Array[1..SSNSize*DSNSize]of int;

Begin1.Receive Latency(MSP)2.Calculate Latency(P*) using (7)3.If (Latency(P*) > Thh+Tol) Increase Supernod

Sizes according to (9)ElseIf (Latency(P*) < Thl-Tol) Decrease Supernod

Sizes according to (10)EndIf

End Metapath Configuration3-MSP Selection: This function selects a MSP for eacmessage to distribute the load among the Multi-step Paof the Metapath. For each message sent, a MSP is seledepending on the MSP Bandwidth which is the inversethe MSP Latency; the most available Bandwidth, the mfrequent use. Suppose MSP(k) is the kth MSP of theMetapath and BW(MSP(k)) is its associated bandwidThe MSP Bandwidths are used as the values of a discprobabilistic distribution of the MSPs. The procedure is follows:

1. Convert the distribution to a cumulative distributiofunction P, obtaining the proportions P[MSP(k)]

nt,

0-7695-0001-3/99 $10

d

t

sedft

te

P[MSP(k)<=k] by adding and normalising the discreBandwidths of each MSP.

P(MSP(k))=

BW(MSP(k))

BW(MSP(k))

k

i

k

s P MSP i s Number of MSPs=

=

∑= =1

1

1; ( ( )) ; ( ) (11)

2. Generate a uniform random value R between [0, 3. Find the value k which

P[MSP(k-1)<k-1] < R <= P[MSP(k)<k] (12)The following pseudo-code shows Multi Step Pa

Selection phase:Multi_Step_Path_Selection() /* Executed at source

node when it injects a message */Begin

1.Build the cumulative distribution function addinand normalising MSP_BandWidths according to (11)

2.Generate a random number between [0,1)3.Select a MSP using the cumulative distributio

function according to (12)End SenderDRB Routing has been designed with the aim

minimising overheads and with a view to being scalabIn this sense, there is no periodic information exchanand it is fully distributed. It has the characteristic thunder low traffic loads, the monitoring activity iminimum and the paths follow minimal static routing.

Memory space and the execution time overhead of algorithm is very low because the implied actions are vsimple. In addition, these activities are executed a numof times, linearly dependent on the number of logicchannels of the application and the number of messasent.

Regarding the time overhead, the monitoring taskjust latency recorded by the message itself, i.e. storininteger value, and the Metapath Configuration algorithis a local and simple computation applied only when elatency rise is detected.

Regarding the space overhead, the latency recorone or a few integers that the message carries itself inheader, and the only information to keep in the sounode is the MSP Latencies array for each logical chann

It is important to remark that, in order to achieve effective uniform load distribution, a global action needed and that, for this reason, all source-destinanodes are able to expand their paths depending onmessage traffic load between them during progrexecution.

DRB Routing takes advantage of the spatial atemporal locality of parallel program communicationlike cache memory systems do with memory referencThe algorithm adapts the Metapath configurations to current traffic pattern. While this pattern is consta

.00 (c) 1999 IEEE 6

oesfficutingewiont hotbyrce- istionncyent

of at ofon/her

atedust

dthdth: a

r tohisrnsandhe

ngrderateses,s

ughrn.iescubees.

perted

edsageith Thehe

notient

rnsrksterect thellel

dsdertondhot-o-

ary

cy,ad

cycess asountonere. Inheheof

er tohs,s.cy

theticDwn

ifystf theedly,

Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999

latencies will be low and the Metapath Configuration dnot activate. If the application changes to a new trapattern and message latencies change, the DRB RoMetapath Configuration will adapt Metapaths to the nsituation. DRB is useful for persistent communicatpatterns which are the ones which can cause the worsspot situations. Also, DRB reduces injection latency configuring several disjoint paths between each soudestination pair. In addition, this Metapath adaptabilityspecific and can be different for each source-destinapair depending on their static distance or lateconditions, so it can be adapted to each differbehaviour pattern.

The effect of this method is to allow a high level accepted traffic, i.e. that network saturation arriveshigher rates of traffic. This means that the granularitythe application processes (the computaticommunication ratio) can be lower and can present higvariations because these variations are better tolerGiven this scenario, the only issue the application mtake into consideration is that the total bandwirequirements do not exceed the network bandwibandwidth distribution requirements are no longerpreoccupation.

4. DRB Evaluation

We have developed a time-driven network simulatoestimate the performance of the DRB routing. In tsection we show the results for different traffic patteand network loads with a fixed message length, compare the performance with that of static routing. Tsimulator implements the DRB Random Routipresented in section 3. Basic Static Dimensional ORouting is also user selectable. The simulator simuldifferent network topologies of any size (k-ary n-cubmidimews) and different flow control technique(Store&Forward, Wormhole and Cut-through).

The simulations consisted of sending packets throthe network links according to a specific traffic patteThe simulations were conducted for various topologand sizes. The selected topologies are Torus, Hyperand Midimew whose sizes range form 16 to 256 nodWe have assumed wormhole flow control and 10 flits packet. Each link was bi-directional and had associaonly one flit buffer. The packet generation rate followan exponential distribution whose average is the mesinter-arrival time. The results were run many times wdifferent seeds and were observed to be consistent. simulation was carried out for 1,000,000 packets. Teffects of the first 50,000 delivered packets are included in the results in order to lessen the transeffects in the simulations.

ur

0-7695-0001-3/99 $10

-

.

We have chosen some of the communication pattecommonly used to evaluate interconnection netwo[10]. Uniform, hot-spot, bit-reversal, butterfly, perfecshuffle and matrix transpose communication patterns wconsidered in our study. Bit-Reversal, butterfly, perfeshuffle and matrix transpose patterns take into accountpermutations that are usually performed in many paranumerical algorithms.

Under the uniform traffic pattern, every node senmessages to the others with the same probability. Unhot-spot traffic some destinations are fixed in order increase the traffic in a particular area of the network acause the already explained saturated zones called spots. Under bit-reversal traffic the node with binary cordinates an-1, an-2, ..., a1, a0 communicates with the nodea0, a1, ..., an-2, an-1. Butterfly traffic is formed by swappingthe most and least significant bits: the node with binco-ordinates an-1, an-2, ..., a1, a0 communicates with thenode a0, an-2, ..., a1, an-1. In Matrix Transpose the node withbinary co-ordinates an-1, an-2, ..., a1, a0 communicates withthe node an/2-1, ..., a0, an-1,...,an/2. Perfect Shuffle rotatesleft one bit: the node with binary co-ordinates an-1, an-2, ...,a1, a0 communicates with the node an-2, an-3, ..., a0, an-1.

We have studied the average communication latenthe average throughput of the network, and the traffic lodistribution in the network. The communication latenwas measured as the total time the packets wait to acthe links along the path. The throughput was calculatedthe percentage relation between the accepted load (amof information delivered) and the applied communicatiload (injection rate). These communication loads wmeasured as the number of messages per unit timeorder to show the traffic load distribution, we calculate taverage latency in each link of the network. Texperiments were conducted for a range communication traffic loads from low load to saturation.

In the DRB_Routing experiments, looking for lowcongested paths, we selected some Supernodes in ordform a Metapath composed of three Multi-Step Patwhich included the original and two additional pathWith this configuration we expected a high latenreduction due to a low path occupancy.

Next, we will present and compare the results of network performance experiments quantitatively for staDOR routing and DRB routing for an 8x8 Torus and 6Hypercube. The results for other sizes are not shobecause the same behaviour was observed.Results Analysis: Under uniform traffic, there is no loadimbalance and, therefore, DRB routing does not modthe load distribution of the network, resulting in almothe same average latency and average throughput onetwork for all ranges of load. This is the expectbehaviour according to DRB’s definition, consequentDRB can not improve this situation. For the other fo

.00 (c) 1999 IEEE 7

reatin aothecyd in

fficd 4fly, anheill

ticndthe

es thaticffic

rvalaticrge

atelti-essed,is

too,thencyd a

Fig.flet isg

dncyngedrval.inksng , ofgeingd

s isge

reting oft ise ofendsded

Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999

traffic patterns, the communication load unbalance is gand the behaviour of static and DRB Routing change very different way. The following graphs show twcurves for each pattern, one for static routing and other for DRB routing plotted according to the latenmeasured as explained before. Load is representedecreasing values from left to right.

Fig. 2 shows the latency results for the hot-spot trapattern for two tori of 16 and 64 nodes. Figures 3 anshow the latency results for the bit-reversal, butterperfect shuffle and matrix transpose traffic patterns, for8x8 2D-Torus and a 6D Hypercube respectively. Tresults are very similar for all patterns, so we wcomment them together.

In general, DRB routing performs better than Starouting. The difference between the Static routing aDRB routing curves for each pattern increases as network traffic load grows. The Static Routing curvshow a greater rise in latency as the load is increasedfor DRB Routing. DRB routing makes an automaconfiguration of the Metapath depending on the traload.

It can be seen that at big message generation inte(bigger than 100), DRB behaves nearly equal to strouting. This means that the DRB method does not chathe network when it is not necessary. At intermediintervals (between 60-100) DRB begins to use two mustep paths, reducing latency. At smaller intervals (lthan 60) it uses the maximum number of MSPs allowresulting in higher latency reductions. While load increasing, latency improvements are increasing resulting in latency reductions bigger than 50% at highest load. At the same time, as these lateimprovements are achieved, the throughput is increasecan be seen in Fig. 2 for the hot-spot traffic pattern, in 3 and Fig. 4 for the bit-reversal, butterfly, perfect shufand matrix transpose traffic patterns. The throughpuimproved up to 50% for DRB routing while static routinperforms worse since it gets saturated earlier.

Fig. 2. Performance results for the hot-spot

0-7695-0001-3/99 $10

n

s

s

Fig. 3. Performance results for torus

Fig. 4. Performance results for hypercubes

In order to show how DRB Routing distributes loaand eliminates hot-spots, we compare in Fig. 5 the latesurface for the links of the network using static routiand DRB routing for the hot-spot traffic pattern. We usa value of 30 cycles as the message generation inteEach grid point represents the average latency of the lof a torus node. It can be seen that, using Static Routibig hot-spots appear in the network while other regionsthe network are only slightly used. The maximum averalatency in the hot-spots is around 15 cycles. When usDRB Routing, this hot-spots are effectively eliminatebecause the excess of load of the hot-spot nodedistributed among other links. The maximum averalatency in this case is about 3.5 cycles.

The conclusions are that, using DRB Routing, momessages are sent and with less latency. DRB roumaintains uniform load distribution getting a better usethe network resources and the network saturation poinreached at higher load rates, minimising the appearanchot-spots. These results are produced because DRB smessages by new and different paths which are less loaand uses them in a parallel way.

.00 (c) 1999 IEEE 8

hss thesge otaption

yle byathgthdedrks

dueto atBtheinge

On

thewes anandde,

emef the

ivem,ctsare

sere-four

tondle,

the to [10]tualnel

isathe isInndernnelem.sage

forks.hetionandand

ledlyfornes

t isum

Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999

Fig. 5. Latency distribution for the hot-spot

5. Discussion and Comparison

Many adaptive methods try to modify current patwhen a message arrives to a congested node. This icase, for example, of Chaos routing [13] which usrandomisation to misroute messages when the messablocked. The difference with DRB is that DRB does nact at the individual message level, but tries to adcommunication flow between source and destinatnodes to non-congested paths.

Random routing algorithms [19] [14] uniformldistribute bandwidth requirements over the whomachine, independent of the traffic pattern generatedthe application, but at the expense of doubling the plength. A closer view shows that paths of maximum lenare not lengthened but paths of length one are extenon average, up to the mean distance for regular netwoSo, the shortest paths are extremely affected. This isto the method being “blind” namely it does not take inaccount current traffic and it distributes all messages“brute force” over the entire machine. Although DRshares some objectives with random routing, difference is that DRB does not only try to maintathroughput, but also maintains limited individual messalatency because path lengthening can be controlled.

0-7695-0001-3/99 $10

e

is

,.

the contrary, on average, random routing doubles lengthening with the negative effect on the latency mentioned above. It can be seen that static routing iextreme case when both Supernodes, source destination, contain only the source or destination norespectively; and that random routing is the other extrin which the source Supernode contains all nodes ointerconnection network.

A similar but restricted, less flexible and non-adaptsolution is offered by the IBM SP2 routing algorithRTG (Route Table Generator), which statically selefour paths for each source-destination node which used in a “round-robin” fashion to more uniformly utilithe network [18]. The Meiko CS-2 machine also pestablishes all source-destination paths and select alternative paths to balance the network traffic [3].

DRB is independent of and, in fact, can be appliedany direct or indirect network with any topology aswitching technique (Store and Forward, WormhoVirtual Cut Through, etc.). Depending on the topologyand the switching technique, DRB can introduce possibility of deadlock into the network. A techniqueavoid deadlock such as Duato’s method presented inshould be used. This technique assigns extra virchannels [5] to avoid cycles in the extended chandependency graph.

In addition, it can be seen that, by definition DRBlive-lock free, because it never produces infinite plengths, and also, starvation free because no nodprevented from injecting its messages infinitely. addition, message ordering must be preserved and uDRB only messages belonging to the same logical chamust be ordered. This is easy to do by numbering thMessage pre-fetching can be used to hide this mesdisordering.

6. Conclusions and Future Work

Distributed Routing Balancing is a new method message traffic distribution in interconnection networDRB has been developed with the aim of fulfilling tdesign objectives for parallel computer interconnecnetworks. These objectives are all-to-all connection low and uniform latency between any pair of nodes under any message traffic load.

The evaluation made to validate DRB has reveavery good improvements in latency, effectiveeliminating hot-spots from the network. DRB is useful persistent communication patterns which are the owhich can produce the worst hot-spot situations.

Latency is reduced by up to 50% and throughpuincreased by up to 50%, too. Overheads are minim

.00 (c) 1999 IEEE 9

Bthevelanout

ondTheoseveryentnotouldingits

d

na,el

on,

-

ge

.e

for

ingd

llelg”,

: A

ure

ter:l

s.,nce

eonp.

nd

heBM

r.

Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999Proceedings of the 32nd Hawaii International Conference on System Sciences - 1999

because at low loads performance is not reduced.In relation to future work, we have designed a DR

extension to work at two levels in order to create alternative paths to balance network traffic. The first leis the method presented in this paper, which is internode level which changes message paths withmoving source or destination positions, and the secone is an intranode level which migrates processes. intranode level is needed in the following case: suppthat a single node has more message injection or delibandwidth requirements from its local processing elemthan its links can accept. Then, path distribution can be improved and source or destination processes shbe moved. We are currently implementing and maksimulations to test this new DRB feature and overheads.

References

[1]A.Agarwal,”Limits on Interconnection NetworkPerformance”, IEEE Transactions on Parallel and DistributeSystems, Vol. 2, N. 4, Oct. 1991, pp. 398-412.[2] R. Beivide, E. Herrada, J.L. Balcázar and A. Arruabarre“Optimal Distance Networks of Low Degree for ParallComputers”, IEEE Trans. on Computers, Vol. 40. N. 10, Oct.1992, pp. 1109-1124.[3] S. Bokhari, “Multiphase Complete Exchange on ParagSP2 and CS2”, IEEE Parallel and Distributed Technology, Vol.4, N. 3, Fall 1996, pp. 45-49.[4] A. A. Chien and J. H. Kim, “Planar Adaptive Routing: LowCost Adaptive Networks for Multiprocessors”, Proc. of the 19thSymposium on Computer Architecture, May 1992, pp. 268-277.[5] W. J. Dally and C. L. Seitz, "Deadlock-Free MessaRouting in Multiprocessor Interconnection Networks", IEEETrans. On Computers, Vol. C-36, N. 5, May 1987, pp. 547-553[6] W. J. Dally, “Performance analysis of k-ary n-cubinterconnection networks”, IEEE Trans. On Computers, Vol. C-39, N. 6, June 1990, pp. 775-785.

0-7695-0001-3/99 $10

[7] W. J. Dally, “Virtual-Channel Flow Control”, IEEETransactions on Parallel and Distributed Systems, Vol. 3, N. 2,March 1992, pp. 194-205.[8] M. De Prycker, “Asynchronous Transfer Mode, Solutions Broadband ISDN“, Prentice-Hall, 1995.[9] J. Duato. “A new theory of Dead-lock free adaptive routin wormhole networks” IEEE Transactions on Parallel anDistributed Systems, 4(12), Dec 1993, pp.1320-1331.[10] J. Duato, S. Yalamanchili and L. Ni, “InterconnectionNetworks, an Engineering Approach”, IEEE Computer SocietyPress, 1997.[11] I. Garcés, D. Franco and E. Luque. “Improving ParaComputer Communication: Dynamic Routing BalancinProceedings of the Sixth Euromicro Workshop on Parallel andDistributed Processing (IEEE-Euromicro) PDP98, Madrid-Spain, January 21-23, 1998, pp. 111-119.[12] J. Kim, Z. Liu and A. Chien, “Comprensionless RoutingFramework for Adaptive and Fault-Tolerant Routing”, Proc. ofthe 21st International Sysmposium on Computer Architect,April 1994, pp. 289-300.[13] S. Konstanyinidou and L. Snyder. “Chaos RouArchitecture and Performance”, Proc. of the 18th InternationaSymposium on Computer Architecture, May 1991, pp.212-221.[14] M. D. May and P. W. Thompson, PH Welch Ed“Networks, Routers and Transputers: Function, Performaand application”, IOS Press, 1993.[15] Myricom Home Page, http://www.myri.com[16] L. Ni and C. Glass, “The Turn model for AdaptivRouting”, Proc. of the 19th International Symposium Computer Architecture, IEEE Computer Society, May 1992, p278-287.[17] G. F. Pfister and A. Norton, "Hot-Spot Contention aCombining in Multistage Interconnection Networks", IEEETrans. On Computers, Vol. 34, N. 10, Oct 1985, pp. 943-948.[18] M. Snir, P. Hochschild, D. D. Frye and K. J. Gildea, ”Tcommunication software and parallel environment of the ISP2”, IBM Systems Journal, Vol. 34, N.2, pp. 205-221.[19] L. G. Valiant and F. J. Brebner. "Universal Schemes foParallel Communication", ACM STOC, Milwaukee, 1981, pp263-277.

.00 (c) 1999 IEEE 10