8
A Universal Approach for Task Scheduling for Distributed Memory Multiprocessors Department of Computer Science State University of New York at Binghamton, NY 13902-6000 { ghose, neelima}@cs.binghamton.edu Kanad Ghose and Neelima Mehdiratta Abstract Wepresent a static om-step list scheduling technique for scheduling a taskgraph onto adistributedmemorymultipro- cessor taking into account the interconnection constraints andchannelconjlicts. We use apriority list to directlysched- ule nodes in a task graph onto the processor architecture. Our scheme departsfrom conventional schedulers in its use of a “bottom-up” approach for scheduling the task graph nodes. This scheduling technique is applicable to any type ofprocessor architecture and routing strategy. Experimental results indicate the performance advantages of our schedul- er Keywords: Channel Assignment, Channel Conten- tion, Distributed Memory Multiprocessors, Mapping Problem, Task Scheduling. 1: Introduction It has been shown that the general problem of mapping task graphs onto arbitrary processor structures which factor in the inter-task communication time, is NP-complete. Ex- isting scheduling techniques thus use one or more heuristics to realize acceptable solutions with polynomial complexity. Many existing task scheduling techniques for distributed memory multiprocessors use a two-step approach to sched- uling. In the first step, heuristics are used to group together task nodes that have a large amount of communication among themselves into a cluster, assuming full processor connectivity [GeYa 931. In the second step, clusters are as- signed to processors, taking into account the interconnection constraints. lhs assumption of complete connectivity can be quiteunrealistic since delaysdue to channel conflicts and multihop transits can be significant. The resulting schedule thus has the possibility of improvement. One-step schedul- ers directly schedule nodes of the task graph onto the under- lying processor architecture. A variety of one-step task scheduling techques have been proposed for scheduling task graphs onto arbitrary processor graphs. Many of these make unrealistic assumptions about the communication costs. For example, it has been assumed that every commu- nication has the same cost [Bok 811 or that communication occurs in phases [LeAg 871 or that communication channels have infinite bandwidth [YBN 911 or that channel conflicts are absent [LHCA 881. To date very few one-step scheduling schemes, such as the ones proposed in [ E l k 901, [KoSa 931 and [ChAg 931, address the real-worldconstraints of the processor topology, channel contentionand routing techques. Theone-step list scheduling approach presented in ElLe 901 schedules task graphs to arbitrary processor graphs. lhs approach uses a table to keep track of channel usage to account for channel contention. A communication path is scheduled by picking up the path that has the least contention delay (i.e., waiting time). lhs does not, in general, result in choosing a path with the minimum delay since the overall communication delay is the sum of the contention delay plus the propagation delay on the path. The second factor is more critical since in amessage-switchednetwork, as is assumedin [ElLegO], the propagationdelay is afunctionof the hopcount. Further,up- dates to the routing table are made only at the point of pro- cessingthetransmissionorreceptionofamessage. Thus, the contention figures available from the table are inaccurate, leading to a possibly inefficient schedule. In addition, the schemepresented in ElLe 901 has a problem common to all existing list based schedulers as dscussed in Section 2. In [KoSa93] a scheduling techque for acircuit-switched hy- pe~cube interconnection topology is presented. A fairly complex but complete scheme for scheduling task graphs onto processors using a modification of the pairwise ex- change method that takes into account processor COM~C~~V- ity and channel contention is presented in [ChAg 931. 577 0-81 86-5680494 $03.00 0 1994 IEEE

[IEEE Comput. Soc. Press IEEE Scalable High Performance Computing Conference - Knoxville, TN, USA (23-25 May 1994)] Proceedings of IEEE Scalable High Performance Computing Conference

  • Upload
    n

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Page 1: [IEEE Comput. Soc. Press IEEE Scalable High Performance Computing Conference - Knoxville, TN, USA (23-25 May 1994)] Proceedings of IEEE Scalable High Performance Computing Conference

A Universal Approach for Task Scheduling for Distributed Memory Multiprocessors

Department of Computer Science State University of New York at Binghamton, NY 13902-6000

ghose, [email protected]

Kanad Ghose and Neelima Mehdiratta

Abstract We present a static om-step list scheduling technique for

scheduling a taskgraph onto adistributedmemorymultipro- cessor taking into account the interconnection constraints andchannel conjlicts. We use apriority list to directly sched- ule nodes in a task graph onto the processor architecture. Our scheme departsfrom conventional schedulers in its use of a “bottom-up” approach for scheduling the task graph nodes. This scheduling technique is applicable to any type ofprocessor architecture and routing strategy. Experimental results indicate the performance advantages of our schedul- er

Keywords: Channel Assignment, Channel Conten- tion, Distributed Memory Multiprocessors, Mapping Problem, Task Scheduling.

1: Introduction

It has been shown that the general problem of mapping task graphs onto arbitrary processor structures which factor in the inter-task communication time, is NP-complete. Ex- isting scheduling techniques thus use one or more heuristics to realize acceptable solutions with polynomial complexity. Many existing task scheduling techniques for distributed memory multiprocessors use a two-step approach to sched- uling. In the first step, heuristics are used to group together task nodes that have a large amount of communication among themselves into a cluster, assuming full processor connectivity [GeYa 931. In the second step, clusters are as- signed to processors, taking into account the interconnection constraints. l h s assumption of complete connectivity can be quiteunrealistic since delays due to channel conflicts and multihop transits can be significant. The resulting schedule thus has the possibility of improvement. One-step schedul- ers directly schedule nodes of the task graph onto the under- lying processor architecture. A variety of one-step task

scheduling techques have been proposed for scheduling task graphs onto arbitrary processor graphs. Many of these make unrealistic assumptions about the communication costs. For example, it has been assumed that every commu- nication has the same cost [Bok 811 or that communication occurs in phases [LeAg 871 or that communication channels have infinite bandwidth [YBN 911 or that channel conflicts are absent [LHCA 881.

To date very few one-step scheduling schemes, such as the ones proposed in [E lk 901, [KoSa 931 and [ChAg 931, address the real-worldconstraints of the processor topology, channel contentionand routing techques. Theone-step list scheduling approach presented in ElLe 901 schedules task graphs to arbitrary processor graphs. l h s approach uses a table to keep track of channel usage to account for channel contention. A communication path is scheduled by picking up the path that has the least contention delay (i.e., waiting time). l h s does not, in general, result in choosing a path with the minimum delay since the overall communication delay is the sum of the contention delay plus the propagation delay on the path. The second factor is more critical since in amessage-switchednetwork, as is assumedin [ElLegO], the propagationdelay is afunctionof the hopcount. Further,up- dates to the routing table are made only at the point of pro- cessingthetransmissionorreceptionofamessage. Thus, the contention figures available from the table are inaccurate, leading to a possibly inefficient schedule. In addition, the scheme presented in ElLe 901 has a problem common to all existing list based schedulers as dscussed in Section 2. In [KoSa93] a scheduling techque for acircuit-switched hy- pe~cube interconnection topology is presented. A fairly complex but complete scheme for scheduling task graphs onto processors using a modification of the pairwise ex- change method that takes into account processor C O M ~ C ~ ~ V -

ity and channel contention is presented in [ChAg 931.

577 0-81 86-5680494 $03.00 0 1994 IEEE

Page 2: [IEEE Comput. Soc. Press IEEE Scalable High Performance Computing Conference - Knoxville, TN, USA (23-25 May 1994)] Proceedings of IEEE Scalable High Performance Computing Conference

2: Motivations List schedulers take as input a directed acyclic graph

(DAG) or task graph as the representation of a program and a processor graph representing the processing nodes and their interconnections. Thenodes of the task graph represent tasks and thenode weights represent the computation time of the tasks. The edges represent inter-task communication with the edge weights representing the costs of these commu- nications. Intheprocessorgraph,nodes represent theproces- sors and the edges represent direct physical links between the processors. Scheduhng involves assigning the tasks in the task graph onto the nodes in the processor graph. I n list scheduling,twopriority levels are associatedwitheachnode (or task):

precedence level (p-level): This level is measured by the maximum number of edges along any path to the node from the start node(s). The precedence level essentially implies how soon a task becomes eligible for scheduling. In ~nany schemes the p-level is used implicitly as the scheduling

&a&h&d(s-level): This level indicates the order in whch tasks eligible for scheduling are actually assigned to processors. The assignment of s-level values are specific to the particular scheduling strategy. A number of list schedul- ing schemes use the length of the longest path from the task to a termid node (includmg the computation cost for nodes and communication costs for edges on the path) as the value of s-level m u 611, [Cof 761 and WILe 901.

priority.

List schedulers construct alist of all thenodesintheIlAG (i.e., the task graph) in decreasing order of theirs- levels and the first element from the list is taken and assigned to a pro- cessor. The ordering in the list ensures that parent nodes are scheduled before the chldren nodes, so that the data depen- dencies in the program are maintained. List Scheduling schemes differ in the way they assign priorities to tasks. Since the computation time of nodes are fixed, an overall re- duction in execution time is obtained by reducing communi- cation costs. The general approach is to schedule nodes that need to communicate witheach other on the same processor or on processors that are as close to each other as possible.

The list scheduling approaches of [ E h 901 and others suchas [Hu61] and[Cof76] donotprovideanaccuraternea- sure for creating scheduling priorities when interprocessor communication is factored in. The s-level associated with a node is a function of the communication costs associated with the successor nodes along the longest path. When two

data dependent tasks are scheduled on the same processor, the communication cost between them drops to zero. If they are scheduled onto two non-adjacent processors, the com- munication cost changes to reflect the &stance between the twoprocessors. Thls changecanbequite significant depend- ing on the routing strategy used. In the schemes mentioned above, tasks are scheduled from the start nodes (nodes with no predecessors) or top-down. In the top-down approach, atthe timeofschedulinganode,thecommunicationcosts as- sociated with these successor nodes are assumed to be fixed although in reality, these costs change dependmg on where the successor nodes get scheduled. Thus the s-level used to schedule anode can be inaccurate. Note that the basic nature of the top-down approachdoesnot allow for aremedy of this problem since successor nodes have to be scheduled later.

3: The proposed scheduler We use a static non-preemptive approach to schedule

tasks of the task graph onto the processor graph with the aim of reducing the overall execution time of the program. In our approach we schedule the tasks of the task graph on the pro- cessors starting from the terminal nodes(nodes with no suc- cessors), and schedule a node only after all its successors have been scheduled. Our approach is & a bottom-up ap- proach. The s-level of a node is still computed as the length of the longest path from the terminal node. Thus when a node is to be scheduled it is known exactly where its successors have been scheduled. Consequently, the s-level accurately reflects the actual communication costs that should contrib- ute to the scheduling level. We also use various heuristics to assign a task to aprocessor that factor in the processor inter- connection topology (and thus, the impact of multi-hop de- lays) and the load on the processors. In general bottom-up scheduling requires the scheduling level of a node to be re- evaluated when that node is scheduled. ms, as we will see later, can be done in constant time.

Taskscanbeexecuted only whenthey havereceivedmes- sages from all their predecessors and can send messages to their successors only after they have finished executing. Note that since thestartnode(s) is(are) scheddedlast, it is not possible to determine the time at whch a task can start ex- ecuting (i.e., the startup time of the task). Consequently, link contention cannot be accurately determined till all the tasks are scheduled on the processors. We therefore make a sec- ond pass through the scheduled task graph, topdown, to send messages through routes of minimum communication delays. Wehavedevelopedheuristics to schedulecommuni- cations in an attempt to minimize link contentions.

Page 3: [IEEE Comput. Soc. Press IEEE Scalable High Performance Computing Conference - Knoxville, TN, USA (23-25 May 1994)] Proceedings of IEEE Scalable High Performance Computing Conference

The main steps for the assignment of tasks to processors are as follows: 1. Compute the p-level (precedence level) of all nodes in

processor, choose the one closest to processor Z. If there is more than one processor that is closest to 2, choose any one at random. Let this be processor Y. the task DAG.

2. Compute the s-level (scheduling level) of all tasks in the DAG. This is computed bottom-up,p-level by p-level. A list of the nodes on the critical path (i.e., the longest path from the start node to a terminal node) is formed as the s-levels are computed.

3. Find thecritical path of the taskDAG from the start node. If there is more than one critical path, randomly choose one.

4. Assign tasks on the chosen critical path to the same pro- cessor. Mark these tasks as assigned, and update proces- sor load after each assignment. Mark as zero the (com- munication) edges among all nodes assigned in th is step since they are assigned to the same processor.

5. Assign the rest of the nodes in order of their p-levels, starting with the highest p-level (i.e., task nodes are as- signed bottom-up) using the following steps:

a. Re-compute the scheduling level (s-level) of all nodes at that p-level.

b. Sort these nodes at the same p-level in decreasing order of their s-levels creating list L.

c. Assim each task T from the front of list L as follows:

(v) Let C be the communication cost between proces- sor Y and processor Z, assuming no channel con- flicts. (For a message switched store and forward routing strategy C = weight of the edge between T andQ*thehop&stancebetweenY andZ. Foroth- er routing strategies, the communication costs can be estimated in a similar fashion.)

(vi) Re-compute r = C /node weight of T.

(vii) If r < z assign T to Y and change the weight be- tween T and Q to C. If r > 't then assign T on Z.

(viii) Identify any successor of T on the processor to whch T is assigned. The weight of the edges from T to these successors are marked as zero.

d. Re-compute s-levels of all nodes at the p-level just as- signed.

e. Update the load for the processor to which T is assigned. (Go to Step 5.)

Assigning tasks on the critical path to the same processor in Step 3 reflects an attempt to reduce the overall computa- tion time of the program by removing the communication costs between tasks on the longest path in the task graph.

assigned.

(ii) Let r=(weight oftheedgebetweenTandQ)/(node weight of T). The edge weight between T and Q corresponds to the inter-task communication cost between two tasks had they been assigned on adja- cent processors (processors COMected by a direct link).

(iii) If r > 't (where z is a threshold chosen as 1 in our experiments) then assign T on Z.

(iv) If r S z then identify the processor(s) with the mini- mumcurrent load. (Thecurrent loadonaprocessor is the s u m of the execution time of the tasks as- signed to it thus far). If there is more than one such

timeoftaskTexceeds athresholdvaluez. Thisagainisdone to help keep the communication cost as low as possible. If the ratio r is below the threshold, then an attempt is made to assign the task to the closest processor with minimum load (processor Y) For tlus purpose, the communication cost be- tween between task T and task Q is re-computed, assuming Tis assigned to Y and the ratio r is re-evaluated. If t lus ratio is still below the threshold, then taskTis assigned to proces- sor Y otherwise it is assigned to the same processor (proces- sor Z) as its successor Q. The reason for choosing aproces- sor with the minimum load reflects an effort to balance the load among the processors; finding the closest processor at- tempts to reduce the adverse effect of communication be- tween tasks T and Q on the execution time of the program.

579

Page 4: [IEEE Comput. Soc. Press IEEE Scalable High Performance Computing Conference - Knoxville, TN, USA (23-25 May 1994)] Proceedings of IEEE Scalable High Performance Computing Conference

Tasks assigned to the same processor are executed in the or- der in which they become ready. A task is ready to be ex- ecuted when it has received messages from all its predeces- sors. For task graphs with ahigh fan-in (task graphs where anode has several predecessors on the same node), the sched- uling heuristic would tend to assign a number of predeces- sors on the same processor as its successor to which it has the highest communication (processor Z) if the ratio r exceeds thethresholdz. This couldresult inaskewedloadacrossthe processors and consequently increase the overall execution time of the program. This problem can be remded if we al- low the assignment of apredecessor task to the same proces- sor as its successor to which it has the highest communica- tion, only if the assignment does not result in an unbalanced load among theprocessoIs. llvo variations of the heuristic to prevent this skewing are presented in MeGh 941.

4: Channel allocation heuristics

Once task assignments are made, communication chan- nels are assigned in a top-down fashion, as explainedearlier. We developed a variety of heuristics for channel assignment. In all of these heuristics, channel allocations that lead to deadlocks are avoided. The heuristic that provides the best result is what we call the adaptive channel assignment tech- nique. Note that in step (5(c)) of the scheduler, weimplicitly assumed that the communication path between a node and thesuccessorto whchithasthehghest amount ofcommuni- cation uses a path with the shortest hop count and zero con- tention delay. The adaptive channel assignment strategy at- tempts to duplicate these condltions. The heuristic is adaptivebecause it adapts to the current load on the intercon- nectionlinks of the system. We present the adaptivechannel assignment for a hypercube-connected system. Similar adaptive channel assignment strategies for other intercon- nections like mesh, ton and trees are possible.

The adaptive channel assignment for a hypercube inter- connectionis essentially arouting strategythat is a modlfica- tion of its standard e-cube routing. In the e-cube routing scheme, the exclusive-or (XOR) of the bit addresses of the node (say src) that has either generated or received amessage for routing and the destination node for the message (say dest) are computed. The purpose of each routing step is to reduce the number of 1’s in the XOR. The adaptive channel assignment heuristic finds the link of minimum contention to route the message at every step of the routing. At each

routingskp,thelinksconesponding tothepositionofthe 1’s in the XOR from the most significant bit position are scmed. The first freelinkencounteredintheprocessis cho- sen. Ifalllinksarebusy,thenthelinkwiththeleastdelaycor- responding to the position of the 1’s in the XOR is chosen:

Compute re1 = src 8 d e s t ; / * 8 = b i t - w i s e XOR * / if (re1 == 0 ) t h o n

/ * message h a s r eached i t s d e s t i n a t i o n * /

L e t K be t h e p o s i t i o n of t h e most- s i g n i f i c a n t 1 i n rel; L e t B be t h e number of 1‘s i n re l ; if a l l l i n k s co r re spond ing t o t h e

t h o n u s e l i n k w i t h t h e least d e l a y

ole.

done

oleo

p o s i t i o n s of 1‘s i n re1 a r e busy

co r re spond ing t o t h e 1‘s i n re1

w h i l o (B 2 0) do if ( c o n t e n t i o n d e l a y on l i n k

t h o n Route message o u t #K = 0)

t h rough l i n k # K; ox i t ; 1

01m0 B = B-1; K = p o s i t i o n o f n e x t 1 i n

re l ; 1

1

Note that the adaptive routing heuristic attempts to assign a channel that has the least hop count. Additionally, it at- tempts to reduce the contention delays as much as possible. Thls heuristic therefore tries to achieve the channelhouting characteristics assumed at the time of schedulmg the task nodes. The two other channel allocation heuristics are ranked channel assignment and greedy channel assignment and follow a fwed route. Ranked channel assignment is also based on the e-cube routing scheme. It selects three paths with the minimum hop count from the many paths that are possible and routes the message on the path with the least delay from among these three paths. The first path is found by scanning bits from the least significant bit position in the XOR at each step and picking the link corresponding to the first 1 encounkred in the process. For the second path, bits arescannedfromthemost significantbitpositionintheXOR at each step. The h r d path is found by picking a link corre- sponding to a 1 in a randomly picked position in the XOR. Thegreedy channel assignment uses Dijkstra’s shortest path algorithm to pickup the route with the least delay. Details of these schemes can be found in [MeGh 941.

580

Page 5: [IEEE Comput. Soc. Press IEEE Scalable High Performance Computing Conference - Knoxville, TN, USA (23-25 May 1994)] Proceedings of IEEE Scalable High Performance Computing Conference

5: Complexity of the bottom-up scheduling scheme

We now estimate the worst-case complexity of our bot- tom-up scheduling technique. We assume that for the pur- pose of the task allocation heuristic, a table listing the hop count for the default routing between any two processors is available. In any case, the complexity for computing such a table is O(P2), where P is the number of processing nodes in the system. We assume that the task graph has T task nodes and E edges; we also assume that a node can have at most k successors. For each task, the list of its successors is pre- computed. The complexity for creating these lists depends on the number of successors for a task in the task graph and is O(k*T). Lists of tasks at the same p-level are also avail- able and the complexity for computing these lists is O(T) for T tasks in the taskgraph. The worst-case complexity of each of the main steps outlined in Section 3 is as follows:

Step 1 : Computing the p-level of the task graph depends on the number of edges E in the graph and has a complexity of O@).

f3ep-2 The complexity of computing the s-level is also O(E).

S@LL The critical path is found by using the mformation computed dunng the computation of the s-levels. The iden- tification of the critical path has a complexity of O(k*T) since at most T nodes can be on the critical path.

Step 4: This step has a complexity of O(T) in the worst case.

Step 5 is a loop that is executed at most T times (i.e., at most once for each task node). The complexities of the various sub-steps w i h n step 5 are as follows:

hp5(,i& For all iterations of the loop, the complexity of h s step is O(k*T) since the s-level of a node is re-computed soon after a successor of the node is scheduled.

Stm Suil: Sorting all nodes at the same p-level has a worst case complexity of O(TlogT). The actual number of task nodes sorted by this step is considerably smaller than T in practice. Since thls sorting is done for each iteration of the loopofStep5,thecontributionofthlssteptotheoverallcom- plexity is O(p1ogT).

Step Xcl; Sub-steps (i) and (ii) of this step require constant time because of the use of table lookups. Sub-step (iii) is again a constant time operation. F i d n g the processor with the minimum load is aO(P) operation at most. Sub-step (iv) thus has complexity of O(P) since the hop count is available from a pre-computed table. Sub-steps (v) through (vii) are done in constant time. Sub-step (viii) has a complexity of q k ) . The contribution of sub-step 5(c) towards the overall complexity for all iterations of the loop of step 5 is thus O(T*(P+k)).

The re-computing of the s-level by this step has a complexity of O(k*T) for all iterations of the loop.

&gC&L The processor load is updated in constant time.

The overall complexity of Step 5 (includlng all its sub- steps) is thus O(k*T+ plogT+ P*T). The overall complex- ity of our bottom-up scheduling techruque is thus O(E + T +k"T+plogT+P*T). Ifweassumethatkisaconstantand T> P (and T > 2), this reduces to O(T210gT) since E is O(p) in the worst case.

6: The complexity of channel assignment For a system that uses a ndmensional hypercube inter-

connection, the number of processors is P = 2". The worst- case complexity of the adaptive channel allocation and ranked channel allocation heuristics is O(n2), i.e., O((1og P)2). The worst-case complexity of the greedy channel as- signment strategy is O(P2). If the complexity of deadlockde- tectionis ignored for the moment, the worst-case complexity of the overall channel allocation process is O(E*H), where Eis thenumber of edges in the task graph and H is the worst- case complexity of the communications heuristics. If the complexity of deadlock detection is factored in, the overall complexity is O(E*H + E2).

7: Experimental assessment and comparison To assess the performance of our bottom-up scheduling

scheme and its associated channel allocation heuristics (to- gether called the bottom-up scheduler), we scheduled ran- domly generated task graphs using our scheduler for hyper- cube connected multiprocessors. The use of randomly generated task graphs is quite common in assessing the per- formance of schedulers [E lk 901,KoSa 931 and [ScJa 931. Although our approach is applicable to any network topolo- gy (hypercubes, meshes, tori, trees etc.) and any message routing strategy (message-switched, circuit-switched,

581

Page 6: [IEEE Comput. Soc. Press IEEE Scalable High Performance Computing Conference - Knoxville, TN, USA (23-25 May 1994)] Proceedings of IEEE Scalable High Performance Computing Conference

hole etc.), we present results for a message switched hyper- cube topology. Results for the mesh and ton can be found in [MeGh 941. We constructed a “generic” version of the bot- tom-up scheduler that allows the use of switches to tum on and off various heuristics that were proposed and thus penmt variations of the basic bottom-up scheduler to be instan- tiated. We also implemented adetailed simulator to simulate theexecutionof the taskscheduleontheunderlying architec- ture.

We compared our bottom-up scheduler with a top-down scheduler that we implemented. llus top-down scheduler is similar to the bottom-up scheduler, except that tasks are scheduled from the start nodes and s-levels are not re-com- puted since the change in weights of the edges does not im- pact the s-levels of nodes that have not yet been assigned. Asecondpass is made through the task graph todocommuni- cation scheduling, using the same routing strategies as used in the bottom-up approach.

We also compared the bottom-up scheduler using adap- tive hypercube routing with El-Rewini and Lewis’ MH (Mapping Heuristic) scheme[ElLe 903. The MH scheduler was an appropriate choice for comparison for two reasons. First, it is a top-down one step scheduler based on the mes- sage switched routing strategy. Second, like the bottom-up scheduler it also takes into account the topology of theunder- lying multiprocessor system and delays due to channel con- tention for scheduling.

Task graphs with 200-1000 nodes were randomly gener- ated with communication time (edge weights) and computa- tion time (node weights) ranging from 0-100. The tasks of the task graph were scheduled on hypercubes of dlmensions 1-5 using the top-down and bottom-up techniques with the three communication scheduling heuristics. The scheduler used a value of 1 for z, the threshold of the communication to computation ratio used in the processor allocation heuris- tics(step 5(c)). We assumed that the architecture consisted of a set of homogeneous processors connected by a static inter- connection network. A separate communication processor existedper PE, to allow for communication andcomputation overlap. The communication processors also allowed for si- multaneous transmissions and receptions of messages. Links were assumed to be full duplex, i.e, separate links existed for sendmg and receiving messages. For the simulation we also assumed that adequate buffering capacity was available at eachprocessor. Messages were assumed to be typed, so that they could be sent and received in any order.

Figure 1 shows the execution time for both communica- tion intensive and computation intensive task graphs as- signed to different number of processors using the bottom- up scheme with the adaptive channel allocation strategy. The graph shows that the execution times for both communica- tion intensive and computation intensive task graphs de- crease with an increase in the number of processors, indlcat- ing that our bottom-up scheduler performs the mapping in a fairly scalable manner. These results also indlcate that the bottom-up scheduler is effectively mapping the tasks, as well as the communication channels. Figure 2 depicts the minimum and maximum percentage deviation of processor load for task graphs consisting of 400,600,800,1000 nodes assigned to 16 processors, again using the bottom-up scheme with adaptive channel allocation. The small deviations show that the bottom-up heuristic effectively balances the load among the processors, whch consequently decreases the overall execution time of the task graph. (The percentage de- viations were computed for 50 randomly generated task graphs). Figure 3 depicts how thebottom-up scheduler (with adaptive channel assignment) compares with the Mapping Heuristic (MH) Scheduler of [Elk 903. The results show the average execution times (rounded) for 50 task graphs for hypercube based systems with 16 processing nodes for task graphs with 400,600,800 and IO00 nodes. The results of Figure 3 clearly indicate that owproposedscheme does sig- nificantly better than the top-downMH scheme of [ E l k 901, achieving in excess of a 40% reduction in the execution time on the average over the MH scheme. Similar results were seen for task graphs of other sizes and different number of processors. We attribute the inefficiency of the MH schedul- er due to its use of inaccurate scheduling precedences as well as the use of approximate link status information.

Figure 4 (a) is a pie chart that indlcates the percentage of instances where each channel allocation strategy produces a better result than the others (including a tie with any other scheme) for 50 randomly generated task graphs with 200 nodes that were scheduled on a 16processor hypercubeusing the bottom-up technique and with the three channel assign- ment strategies. As seen from Figure 4 (a), adaptive channel assignment does better than the other channel assignment strategies in 56% of the cases. This was a general trend that was observed for the task graphs of other sizes and for other processor configurations - adaptive channel assignment per- formed better than the other channel allocation strategies. This is presumably due to the fact that adaptive channel as- signment tries to duplicate the communication condltions as-

582

Page 7: [IEEE Comput. Soc. Press IEEE Scalable High Performance Computing Conference - Knoxville, TN, USA (23-25 May 1994)] Proceedings of IEEE Scalable High Performance Computing Conference

sumed by the bottom-up scheduling techque. Figure 4 (b) depicts the average execution times (rounded to the nearest integer) using the three channel allocation variations for the 50 test cases.

To compare our bottom-up approach with the top-down scheduling approach, we also implemented a top-down list scheduler that used a second pass to assign communication channels using the adaptive and ranked channel assignment as explained earlier. We also studed other variations of the bottom-up scheduler. Figure 5 summarizes some of the re- sults for these experiments, by depicting the average execu- tion time for 50 randomly generated task graphs with 200 task nodes each on a 1 &node hypercube. As seen from Fig- ure5,thechannel assignment strategiesdonot appeartohave much impact on the execution time for the top-down scheme. In general, the bottom-up scheme results in a better schedule than the top-down approach, improving the aver- age execution time by roughly 60%. The execution time for the schedule obtained with the bottom-up scheduler goes up when the nodes on the critical path are not assigned to the same processor. Thls is not surprising since the effect of scheduling the nodes on the critical path removes the com- munication cost among these nodes, resulting in improve- ments in the overall execution time.

Communication Intensive Computation Intensive

16000 14000 12000 10000

Execution 8000 Time 6000

4000 2000

0

8: Conclusions

We have presented a list scheduling techque that makes a significant departure from conventional list schedulers by scheduling nodes in a task graph bottom-up. This results in an accurate estimate of the scheduling weight of the task nodes. Another feature of our scheduler is the use of a simple processor allocation heuristic that attempts to balance the overall loadmg at each processor and and at the same time, maintaining the communication costs to a minimum. Our channel allocation heuristics take into account channel con- tention and assigns communication channels after the task nodes have been scheduled. The experimental results show that processor loads are balancedextremely well. Thls is one of the factors that contribute to the superior performance of our scheduler over the MH scheduler of El-Rewini and Le- wis, a scheduler that is comparable to ours. Although we present experimental results for a message switched hyper- cube system, our task assignment techque is universally applicable due to its ability to handle any arbitrary processor interconnection graph and other routing strategies (such as circuit switching and wormhole routing). We have adapted the proposed channel allocation heuris tics for other intercon- nection topologies (meshes and tori) [MeGh 941. Currently, we are in the process of adapting the scheduler to the worm- hole routing strategy.

20 18 I :: I 12 10

. . . . . . . Minimum percent deviation a 2 % ; ~ Maximum percent deviation

.:.:.:.:.:.:.

400 600 800 lo00

Minimum and maximum percentage standard devi- ation of processor load for 400,600,800,1000 node task graphs on 16 processors using the bottom-up scheme with adaptive channel allocation using 50 randomly generated task graphs

2 4 8 16 32

Execution time vs. Number of Processing Nodes for a randomly generated 1000-node communication intensive and computation intensive task graph using the bottom-up scheme with adaptive channel allocation

Number of Processing Nodes (P)

Figure 1 Figure 2

583

Page 8: [IEEE Comput. Soc. Press IEEE Scalable High Performance Computing Conference - Knoxville, TN, USA (23-25 May 1994)] Proceedings of IEEE Scalable High Performance Computing Conference

14000 12000 10000 8000

Execution 6000

2000 0

Time 4000

Greedy Ranked

1064

1128

1127

400 600 800 1000

Average execution time for the bottomup (BU) and MH scheduling schemes for 400, 600,800,1000 node taskgraphs averaged over 50 task graphs assigned to 16 proces- sors.

Figure 3

I 2400 Avg. Exec. 2100

?me 1800 1500 1200 900 600 300 0

A B C D

Figure 5

References [ACD 741 T. Adam, K. Chandy and J. A. Dickson, “Comparison of List Schedulers for Parallel Rocassing Systems,” C o m m i c a - t w m of the ACM, vol. 17, pp. 685-690, Dec. 1974.

[Bok811 S. Bokhari,“A Shortest~AlgorithmforOptimalAs- signments Across Space and Time in Distributed Processor Sys- tem,” IEEE Transaction on Software Engineering, vol. SE-7, no. 6, Nov. 1981.

[ChAg 931 V. Chaudhary and J. K. Aggarwal, “A Generalized Scheme for Mapping Parallel Algorithms,” IEEE Tram. Parallel and Dirtributed System, vol. 4, no. 3, pp. 328-346, Mar 1993.

[Cof 761 E. Coffman, Computer and Job -Shop Scheduling Theory, Wiley, New York 1976.

[ELe 901 H. El-Rewini and T. G. Lewis, “Scheduling Parallel Program Tasks onto Arbitrary Target Machines,” Journal of Paral- lel and Distributed Computing, 9, pp. 138-153,1990.

[GeYa 921 A. Gerasoulis and T. Yang, “A Comparison of Cluster- ing Heuristics for Scheduling Directed Acyclic Graphs on Multi- processors”, Journal of Parallel and Dirtributed Computing, 16, pp. 276-291,1992.

a b

A Comparison of the bottowup sched- uler with the three variations of commu- nication channel allocation for 200 node task graphs assigned to 16 processors using 50 randomly generated task graphs

B: Average execution times for bottom-up for the three variations for the 50 randomly gener- ated task graphs

Figure 4

A: Bottomup with Adaptive channel assignment E: Bottomup without critical path with Adaptive channel assignment C: Top-down with Adaptive channel Assignment D: Topdown with Ranked channel Assignment

Average execution times for variations of the Bottom-up scheduler and the Top-down scheduler

WeGh 941 N. Mehdiratta and K. Ghose “Scheduling Task Graphs onto Distributed Memory Multiprocessors Under Realistic Con- straints”, to appear in Proc. Parallel Architectwas and Languages Europe, 1994 (PARLE!X),availableintheSpringer-VerlagLNCS.

[Hu 611 T. Hu, “Parallel Sequencing and Assembly Line Prob- lems,” Operation Research, vol. 9, pp. 841-848,1961.

[KoSa 931 S. Kon’ya and T. Satoh, “Task Scheduling on a Hyper- cube withLinkContentions,”Pmc.IntemationalParallelPmess- ing Symposium, pp. 363-368.1993.

[LaAg 871 Lee, S.Y. and Aggarwal, J.K.,“A Mapping Strategy for Parallel Processing”, IEEE Trans. on Computers, Vol. C-36, pp. 433-442, April 1987.

[LHCA88]C. Y. M J . J. Hwang,Y. C. Chow,F. D. Anger, “Multiprocessor Scheduling with Interprocessor Communication Delays”,, Operations Research Letters, vol. 7 no. 3, pp. 141-147, June 1988.

[Sda 931 L. Schwiebert and D. N. Jayasimha, “Mapping to Re- duce Contention in Multiprocessor Architectures”, Proc. Internu- twnal Parallel Processing Symposium pp. 248-253,1993.

B B N 911 J. Yang, L. Bic and A Nicolau, “A Mapping strategy for MIMD Computers,”lnternatwnal Parallel Processing Corgfer- ence, Vol. I , pp. 102-109,1991.

584