8
Adaptive window scheduling for a hierarchical agent system Holly Dail and Fr´ ed´ eric Desprez LIP, ENS Lyon 46 All´ ee d’Italie, 69364 Lyon FRANCE [email protected], [email protected] Abstract DIET (Distributed Interactive Engineering Toolbox) is a toolbox for the construction of Network Enabled Server (NES) systems. For most NES systems, as for most grid mid- dleware systems, the scheduling system is centralized and can suffer from poor scalability. DIET provides an alter- native: low-latency, scalable scheduling services based on a distributed hierarchy of scheduling agents. However, the online scheduling model used currently in DIET can over- load interactive servers in high-load conditions and does not allow adaption to task or data dependencies. In this ar- ticle we consider an alternative model based on active man- agement of the flow of requests throughout the system. We have added support for (1) limiting the number of concur- rent requests on interactive servers, (2) server and agent- level queues, and (3) window-based scheduling algorithms whereby the request release rate to servers can be con- trolled and some re-arrangement of request to host map- pings is possible. We present experiments demonstrating that these approaches can improve performance and that the overheads introduced are not significantly different from those of the standard DIET approach. 1. Introduction The use of distributed resources available through high- speed networks has gained broad interest in recent years. The number of resources made available grows every day, and the scalability of middleware platforms is becoming a key issue. A variety of middleware approaches have been developed to cope with the heterogeneity and dynamic na- ture of the target resource platforms while trying to hide the complexity of the platform as much as possible from the user [9, 11, 17]. Among middleware approaches, one simple approach consists of using servers available in different administra- This work has been supported by the Grid’5000, VTHD, RNTL, and ACI GRID projects. tive domains through the classical client-server or Remote Procedure Call (RPC) paradigm. Several such network en- abled server (NES) environments have been developed for the grid [3, 5, 6, 15]; GridRPC [18] provides a standard API that is allowing unification of the interfaces used by these projects. In these systems, clients submit computa- tion requests to a scheduling agent which identifies suitable servers available on the grid. While many NES environ- ments are based on a single, centralized scheduling agent, we have found that the scalability of this approach is lim- ited. We have thus developed the Distributed Interactive Engineering Toolbox (DIET) [3], an NES environment that uses a hierarchical arrangement of scheduling agents to pro- vide better scalability. DIET has traditionally used on-line scheduling whereby client requests are scheduled immediately in FIFO order. This approach provides fast response time and fairness for users, but can also lead to scheduling too far in advance at the server-level and does account for inter-task dependen- cies or task-machine affinities. In this paper, we present an extension of the DIET scheduling approach to support control over the flow of re- quests in the system. The goal of our approach is to pro- vide the low response time of on-line scheduling under low- load conditions while improving system performance under high-load conditions. First, we add queue-like semantics at the server level to limit the number of concurrent jobs allowed on a server. Second, we introduce a queue at the DIET root agent to control the rate at which requests are released into the scheduling system. Third, we present a window-based task scheduling approach whereby tasks are released into the scheduling hierarchy as a group in each window and task placements can be re-arranged within a window to account for task-machine affinities and data de- pendencies. The interest of the algorithm presented here is its adaptation to the distributed nature of scheduling in DIET; specifically, server information is only available via the scheduling of at least one request and global information on all servers is never available in one place. This paper is organized as follows. The next section de- Proceedings of the 4th International Symposium on Parallel and Distributed Computing (ISPDC’05) 0-7695-2434-6/05 $20.00 © 2005 IEEE

[IEEE The 4th International Symposium on Parallel and Distributed Computing (ISPDC'05) - lille, France (04-06 July 2005)] The 4th International Symposium on Parallel and Distributed

Embed Size (px)

Citation preview

Adaptive window scheduling for a hierarchical agent system

Holly Dail and Frederic DesprezLIP, ENS Lyon

46 Allee d’Italie, 69364 Lyon [email protected], [email protected]

Abstract

DIET (Distributed Interactive Engineering Toolbox) isa toolbox for the construction of Network Enabled Server(NES) systems. For most NES systems, as for most grid mid-dleware systems, the scheduling system is centralized andcan suffer from poor scalability. DIET provides an alter-native: low-latency, scalable scheduling services based ona distributed hierarchy of scheduling agents. However, theonline scheduling model used currently in DIET can over-load interactive servers in high-load conditions and doesnot allow adaption to task or data dependencies. In this ar-ticle we consider an alternative model based on active man-agement of the flow of requests throughout the system. Wehave added support for (1) limiting the number of concur-rent requests on interactive servers, (2) server and agent-level queues, and (3) window-based scheduling algorithmswhereby the request release rate to servers can be con-trolled and some re-arrangement of request to host map-pings is possible. We present experiments demonstratingthat these approaches can improve performance and thatthe overheads introduced are not significantly different fromthose of the standard DIET approach.

1. Introduction

The use of distributed resources available through high-speed networks has gained broad interest in recent years.The number of resources made available grows every day,and the scalability of middleware platforms is becoming akey issue. A variety of middleware approaches have beendeveloped to cope with the heterogeneity and dynamic na-ture of the target resource platforms while trying to hide thecomplexity of the platform as much as possible from theuser [9, 11, 17].

Among middleware approaches, one simple approachconsists of using servers available in different administra-

This work has been supported by the Grid’5000, VTHD, RNTL, andACI GRID projects.

tive domains through the classical client-server or RemoteProcedure Call (RPC) paradigm. Several such network en-abled server (NES) environments have been developed forthe grid [3, 5, 6, 15]; GridRPC [18] provides a standardAPI that is allowing unification of the interfaces used bythese projects. In these systems, clients submit computa-tion requests to a scheduling agent which identifies suitableservers available on the grid. While many NES environ-ments are based on a single, centralized scheduling agent,we have found that the scalability of this approach is lim-ited. We have thus developed the Distributed InteractiveEngineering Toolbox (DIET) [3], an NES environment thatuses a hierarchical arrangement of scheduling agents to pro-vide better scalability.

DIET has traditionally used on-line scheduling wherebyclient requests are scheduled immediately in FIFO order.This approach provides fast response time and fairness forusers, but can also lead to scheduling too far in advance atthe server-level and does account for inter-task dependen-cies or task-machine affinities.

In this paper, we present an extension of the DIETscheduling approach to support control over the flow of re-quests in the system. The goal of our approach is to pro-vide the low response time of on-line scheduling under low-load conditions while improving system performance underhigh-load conditions. First, we add queue-like semanticsat the server level to limit the number of concurrent jobsallowed on a server. Second, we introduce a queue at theDIET root agent to control the rate at which requests arereleased into the scheduling system. Third, we present awindow-based task scheduling approach whereby tasks arereleased into the scheduling hierarchy as a group in eachwindow and task placements can be re-arranged within awindow to account for task-machine affinities and data de-pendencies. The interest of the algorithm presented hereis its adaptation to the distributed nature of scheduling inDIET; specifically, server information is only available viathe scheduling of at least one request and global informationon all servers is never available in one place.

This paper is organized as follows. The next section de-

Proceedings of the 4th International Symposium on Parallel and Distributed Computing (ISPDC’05) 0-7695-2434-6/05 $20.00 © 2005 IEEE

scribes related work in hierarchical scheduling, Section 3provides an overview of the architecture of DIET, Section 4presents our extensions to DIET, Section 5 presents experi-ments that test the performance and overheads of our exten-sions, and Section 6 presents conclusions and a preview offuture work.

2. Related work

Centralized scheduling systems for clusters of worksta-tions and supercomputers have been studied extensively inthe literature [10]. Hierarchical approaches have also beenstudied, including work in fields outside of classical jobscheduling. For example, 2-level hierarchical schedulingsystems for clusters of WWW servers are discussed in [2].The authors seek to load-balance HTTP requests acrossmultiple clusters with a scheduling approach that can eithermove data to unloaded hosts or place computation wherethe data resides. A hierarchical disk scheduler for multime-dia systems is presented in [4]. Benefits are shown witha 2-level scheduling scheme for disk I/O. In [14], a de-centralized service discovery system is discussed for globalcomputing grids. This work presents scalable approachesfor service discovery but is not concerned with schedulingtasks; the work is thus complementary to ours.

Hierarchical scheduling heuristics for shared and dis-tributed memory machines and for database query process-ing are presented in [8]. Although these heuristics are eval-uated in simulation, the idealized parameters used do notseem to reflect real conditions on actual platforms. Simu-lation is also used to test several hierarchical job schedul-ing algorithms in [16]. In this paper, the authors compareseveral heuristics like First Come First Served (FCFS) andShortest Job First (SJF) at the Global Resource Manager(GRM) level. It is shown that careful resource managementat the global level can improve performance of the platformas a whole. In [12], other algorithms are tested by carefullystudying the cost of the parallel jobs sent to the schedulers.

The authors of [13] present a distributed scheduling ap-proach using Global Resource Managers connecting LocalResource Managers (LRM); the GRMs are replicated forbetter fault-tolerance. The largest difference with our workis that the LRM-level is a classical batch system and thusdoes not afford the same control over job scheduling. Asimilar approach is used in [19] but simultaneous requestsare sent to every global scheduler; once one request finishesthe others are cancelled.

3. DIET overview

As shown in Figure 1, DIET is based on several compo-nents arranged in a hierarchical fashion; this arrangement

provides scalability and can be adapted to diverse envi-ronments including heterogeneous network hierarchies. ADIET Client is an application that submits requests for ser-vices. A SeD, or server daemon, provides the interfaceto computational servers and can offer any number of ap-plication specific computational services. Agents providehigher-level services such as scheduling and data manage-ment. Each DIET deployment contains exactly one MasterAgent (MA), the root node of the hierarchy and the node towhich clients submit requests. Any tree-based arrangementof Local Agents (LA) can be introduced between the MAand the SeDs to improve scalability.

MA

LA LA

LA

SeD SeD

SeD

SeD SeD SeD

SeD

C

Figure 1. DIET hierarchical organization.

When a client submits a request for a computational ser-vice to the MA, the MA forwards the request in the DIEThierarchy and the LAs, if any exist, forward the request on-wards until the request reaches the SeDs. The SeDs thenevaluate their own capacity to perform the requested ser-vice; capacity can be measured in a variety of ways includ-ing an application-specific performance prediction, generalserver load, or local availability of data-sets specificallyneeded by the application. The SeDs send their responsesback up the agent hierarchy. The agents perform a dis-tributed collation and reduction of server responses until fi-nally the MA returns to the client a list of possible serverchoices sorted in order of desirability. Additional details onthe DIET scheduling system are available in a related arti-cle [7].

The use of on-line scheduling in the current version ofDIET has several advantages. Requests are scheduled asquickly as possible, resulting in low scheduling latencies.The gathering of performance information from servers isonly done upon request, contrary to other NES systems likeNetSolve or Ninf [15]. This avoids unnecessary commu-nications when few requests are sent to the agents. Also,data transfer related to performance prediction is low sinceonly the synthesized performance predictions are commu-nicated; in systems with centralized performance predictionall relevant data must be communicated.

Proceedings of the 4th International Symposium on Parallel and Distributed Computing (ISPDC’05) 0-7695-2434-6/05 $20.00 © 2005 IEEE

There are, however, conditions in which an on-line strat-egy can be less desirable. For example, the system con-tinues to launch new executions even if the target server isalready fully utilized. This can lead to reduced throughputdue to interference between jobs, especially for jobs withintensive resource requirements. Also, the on-line strategyis not able to support any re-ordering of requests to accountfor task or data dependencies. In the following section wediscuss extensions to the DIET scheduling approaches de-signed to address these limitations.

4. Scheduling extensions

Our extensions to the DIET scheduling system can begrouped in changes at the SeD level and changes at the MAlevel. To the extent possible, we have designed these exten-sions to function both separately and together to provide themaximum flexibility. All design decisions have been takenwith the additional goal of maintaining as much as possiblethe stability and low-overhead characteristics of the stan-dard DIET scheduling approach.

At the SeD level, we have added the ability to controlthe number of requests actively using resources at the sametime; this addition reduces resource contention. When re-quests reach the SeD, their request thread is placed in whatwe will call a SeD-level “queue”. To keep overheads lowwe implement a lightweight approach based on a count-ing semaphore. We use a number of heuristics to esti-mate “queue” statistics to be used in the scheduling process.Specifically, each SeD monitors the number of jobs waitingfor execution, the sum of the predicted execution times forall waiting jobs, and the predicted completion times for anyrunning jobs. With these statistics a rough estimate can bemade of the predicted completion time of a new job.

The second group of extensions involve scheduling at theMA-level. We augment the MA such that in high-load con-ditions the flow of requests to servers can be slowed downand so that task assignements can be re-arranged to accountfor task and data affinities. The adaptive window schedul-ing approach we have added at the MA-level is especiallyof interest because it is this framework that will allow theexploration of a variety of task scheduling algorithms in theunique context of a system of hierarchical agents.

In the standard DIET system, requests are each assignedan independent thread in the MA process and that threadpersists until the request has been forwarded in the DIEThierarchy, the response received, and the final response for-warded on to the client. In this approach, the only data ob-ject shared among threads is a counter that is used to assigna unique request ID to every request.

In the modified MA, we introduce one additional threadthat provides higher-level management of request flow; theapproach used for this GlobalTaskManager is presented in

Algorithm 1. Scheduling proceeds in distinct phases calledwindows and both the number of requests scheduled in awindow and the time interval spent between windows areconfigurable. An interesting aspect of this algorithm is thatthe MA can only discern characteristics of the DIET hierar-chy, such as server availability, by forwarding a request inthe hierarchy. We avoid sending any task twice in the hi-erarchy, thus the GlobalTaskManager must schedule one ormore jobs in order to have information about server loadsand queue lengths.

The maximum size of the scheduling window(maxWinSize) derives from the distributed natureof DIET: it should be equal to the number of responsesmaxResponses forwarded by the aggregation routines inthe DIET agents (this is itself a configurable property ofDIET).

The minWinT ime is the minimum amount of time tosleep between windows and is introduced to avoid busywait behavior when request volume is low and the Glob-alTaskManager runs in an on-line mode; in our experiencesa value of 5 msec is effective, but the value may be changedaccording to how aggressively one wants to use the hostmachine and what level of responsiveness one wants fromthe scheduler. The window time is re-calculated after eachscheduling window with the following goals: when requestvolume is low, the window should be very short to providefast response time; when the servers become loaded andjobs are already waiting in the SeD-level queues, the win-dow should be longer to avoid scheduling too far into thefuture; and when jobs have just been scheduled, the windowshould be long enough to allow the clients of the just sched-uled requests to submit the requests to the servers. This lastpoint is a problem for the standard DIET approach as well:the time required to schedule several DIET tasks is muchshorter than the time required for the scheduling decision totake effect (that is, the time required for the client to receivethe first response and start the solve on the selected server).Thus, multiple jobs can be assigned the same server beforethe server reports that a new job has been launched. For theGlobalTaskManager algorithm, we introduce the parame-ter taskLatency as the minimum window time to use aftersome tasks are scheduled; in our experiences a value of onesecond was reasonable, however this value is certainly plat-form dependent and additional experiences will be neededto gain intuition as to the correct value.

The minQueueWait returned from the RefinePlace-ment call provides the expected time until the least-loadedSeD has finished all of its work. The windowT ime forthe next window is adapted to the current minQueueWait:the next window must occur before the least-loaded SeDis left idle, yet if the windows are too short the MA willschedule too far in the future. The winAdaptFactor isa factor between 0 and 1 that controls how aggressively

Proceedings of the 4th International Symposium on Parallel and Distributed Computing (ISPDC’05) 0-7695-2434-6/05 $20.00 © 2005 IEEE

Algorithm 1 Global task flow management algorithm.

procedure GlobalTaskManager(minWinT ime,maxWinSize, winAdaptFactor, taskLatency)

windowT ime ← minWinT imewhile true do

numWaiting ← GetWaitingTaskCountif numWaiting == 0 then

windowT ime ← windowTime2

if windowT ime < minWinT ime thenwindowT ime ← minWinT ime

end ifelse

windowSize ← Minimum (numWaiting, maxWinSize),requests ← GetFirstNRequests (windowSize)ReleaseRequestsToHierarchy (requests)responses ← WaitNResponsesFromHierarchy (windowSize)(finalResponses,minQueueWait)← RefinePlacement (responses)ReleaseResponsesToClient (finalRespones)if minQueueWait < taskLatency then

windowT ime ← taskLatencyelse

windowT ime ← winAdaptFactor × minQueueWaitend if

end ifSleep (windowT ime)

end while

the next window should be scheduled. In practice, thewinAdaptFactor should be larger in situations with accu-rate performance prediction, stable resource performance,and low latency for scheduling jobs and smaller in caseswhere those characteristics are missing.

The RefinePlacement call in Algorithm 1 provides theopportunity to experiment with rearranging the allocationof servers to tasks. DIET is a multi-user system and thuswe are concerned with providing fairness to users; we con-sider that job placements can only be re-arranged within thescheduling window and the window can be made smallerto improve fairness or larger to improve opportunities forperformance gains. Algorithm 2 provides a simple Refine-Placement approach; this example avoids placing tasks inthe same window on the same host, if possible. While in thispaper we focus on the practical aspects of our extensions, infuture work we plan to investigate more sophisticated Re-finePlacement algorithms incorporating, for example, datadependencies and task inter-dependencies.

Limitations: Despite the proposed advantages to ourextensions, there are also a number of possible limitations.For the SeD-level extension, requests that are in the SeD“queue” are in fact resident on the server. Thus, if the pa-rameters of the problem sent in the solve phase include largedata sets, memory or disk-usage conflicts could arise. Some

Algorithm 2 Global schedule refinement algorithm

procedure RefineP lacement(responses)

numTasks ← GetSize(responses)resp ← responses[1]finalP lacements[1]← resp.servers[1]for i ← 2, 3, . . . , numTasks do

resp ← responses[i]for all server ∈ resp.servers do

if server /∈ finalP lacements thenfinalP lacements[i] ← serverBreak

end ifend for

end forReturn (finalP lacements, minQueueWait)

DIET applications use a different approach for transferringtheir data whereby only the file location is sent in the prob-lem parameters and the data is retrieved at the beginningof the solve. The impact of this problem will therefore de-pend on the data approach used by the application. This is-sue is also minimized when the MA-level approach is usedsince the number of jobs enqueued at the servers can be

Proceedings of the 4th International Symposium on Parallel and Distributed Computing (ISPDC’05) 0-7695-2434-6/05 $20.00 © 2005 IEEE

controlled at the MA; since DIET never includes large datasets in scheduling requests, the queue at the MA is unlikelyto be resource-limited. Another limitation lies in the useof a lightweight queue-like structure at the SeD; the heuris-tics used to predict performance metrics can be inaccuratein some cases. We plan to compare the performance of thislightweight approach against a full queue implementationto evaluate whether more accurate performance predictionswarrant a more expensive queueing approach.

The primary limitation of the MA-level extensions is thedependence on relatively accurate performance predictions.If job execution times are significantly underpredicted, thewindow length may be too short and the MA will sched-ule too far in advance; if overprediction occurs, the windowlength may be too long and some SeDs will be left idle wait-ing for the next scheduling window.

5. Experiments

5.1. General configuration

Testbed: To test our algorithms, we performed experi-ments on the Grid’5000 testbed [1]. Specifically, we used a56-node cluster located at the Ecole normale superieure deLyon (ENS Lyon). Each node in this cluster has 2 GB ofmemory, dual 2 GHz AMD Opteron processors, and 1 MBof cache. The cluster had just been installed at the time ofthese experiments and was not yet in common use by otherusers. Thus although we did not have explicit dedicated ac-cess, all observations lead to the conclusion that, in practice,we did in fact have dedicated use of the cluster. Dedicatedaccess provides us the opportunity to study the performanceof DIET and our scheduling approaches in a detailed, con-trolled, and reproducible manner on a real system. Thesetypes of experiments are not often done by middleware de-signers and thus detailed, comparable results on middlewareperformance are rare. However, these experiments are alsolimited in scope and do not adequately test issues of robust-ness to heterogeneity, load variability, or failures. In futurework we plan to complement these experiments with casestudies in distributed, heterogeneous environments.

User model: We define a usage scenario for each exper-iment and then implement the scenario by scripts that sub-mit jobs following the pattern defined by the given usagescenario. We define two general usage styles: sequentialusers and batch users. In the sequential user model,each user has a number of tasks to run sequentially. Thismodel emulates users who use the results of previous runs toselect parameters for their next run; steering an interactiveinstrument such as a large telescope would create a work-load of this type. Since DIET is a multi-user system, we testvarying system loads by varying the number of sequentialusers that are performing this style of interactive usage at

once. In the batch user model, we consider users thathave a number of tasks to submit all at once and who aremore interested in the completion of the group rather thaneach individual task. Parameter sweep applications typi-cally create this type of workload.

Scheduling approaches: We compare the traditionalDIET on-line scheduling approach, termed the Standardapproach, against our extended version of DIET, termed theTask Scheduler approach.

Application model: As an initial study, we use a simplematrix application with varying problem sizes. Specifically,we use a BLAS DIET server that internally uses the dgemmlibrary function to solve problems of type C = αC +βAB.A matrix operation of this sort is so computationally simplerelative to the communication required that there are fewconditions in practice under which it would be run remotely.A more reasonable scenario for grid execution is one inwhich many different matrix operations are performed re-motely for each communication of the matrices; this is, infact, the behavior of a broad variety of scientific applica-tions. We thus modified the DIET BLAS server to perform10 iterations of the matrix operations; this provides a morerealistic user scenario while allowing us to maintain a sim-ple application as an initial case study. We also specificallyselected a BLAS implementation that is not very cache ormemory sensitive and thus is not very sensitive to compet-ing load in terms of performance; this choice was made toallow comparison of the scheduling strategies themselvesrather than simply their ability to avoid contention betweenconcurrent computations.

5.2. Performance for bursty request arrival

In this experiment we test an interactive scenario: a userthat quickly submits a group of tasks and who is interestedin the completion of the group. In real-world situationsclients may send any sized burst of tasks at any rate; in theseexperiments we test the robustness of each approach to usersend rates.

Experimental design: We use a DIET hierarchy withone MA, two LAs, and eight SeDs (four attached to eachLA). We model a user scenario in which the user has agroup of eight tasks to run and submits them all as a batchto DIET; all requests are for dgemm with square matrices ofsize 1500x1500. We test task inter-arrival times of 0, 0.25,0.5, 0.75 and 1 second. We chose one second as the upperlimit after some exploratory testing indicated that, for ourparticular test system, scheduler behavior for inter-arrivaltimes larger than one second match those of one second.For each submission of 8 tasks, we measured the mean re-quest turnaround time and the makespan for completion ofall eight tasks. This test was repeated five times for eachconfiguration of period and scheduling approach.

Proceedings of the 4th International Symposium on Parallel and Distributed Computing (ISPDC’05) 0-7695-2434-6/05 $20.00 © 2005 IEEE

Results: Figure 2 shows the mean and standard devia-tion of the makespan (Figure 2a) and the task turnaroundtime (Figure 2b). The most important result shown inthese figures is that the Task Scheduler approach pro-vides consistently good performance regardless of the rateat which the user sends tasks. The Standard approachachieves a similar level of performance for an inter-arrivaltime of one second, but performed more poorly for shorterinter-arrival times. Thus, in real-world scenarios whereusers send at a variety of rates, the Task Schedulershould provide consistently good performance while theStandard approach may be more unreliable.

0 0.25 0.5 0.75 10

200

400

600

800

Mak

espa

n (s

ec)

Task inter−arrival time (sec)

(a)

StandardTask Scheduler

0 0.25 0.5 0.75 10

200

400

600

Tas

k tu

rnar

ound

(se

c)

Task inter−arrival time (sec)

(b)

StandardTask Scheduler

Figure 2. The effect of client send behavioron scheduler performance as shown by the(a) makespan for completion of all eight tasksand (b) the mean task turnaround time.

5.3. Performance under changing resource condi-tions

In this experiment, we test the capacity of each approachto adapt to changing conditions; this is an important fea-ture as DIET is targeted for distributed, heterogeneous gridsystems where conditions may change rapidly.

Experimental design: We again use a base DIET hierar-chy with one MA, two LAs, and eight servers (four attachedto each LA). We also use the same user model, except wenow model a user who has a group of 54 tasks to run in to-tal where all requests are for dgemm with square matricesof size 1200x1200. To remove the effect of task inter-arrivaltimes on performance, we use the results of the last section

and introduce a fixed sleep of one second between each tasksubmission.

The tests were run as follows: first all 54 clients werelaunched at a rate of one per second, then a pause of tenseconds was taken, and then four additional SeDs were at-tached to the DIET hierarchy (two on each LA). After all 54clients requests finished, we calculated the makespan (basedon the time of launch for the first client to the completionof the last request in the system) and the mean turnaroundtime.

Results: The Task Scheduler resulted in better per-formance with a makespan of 312.7 seconds and mean taskturnaround time of 147.6 ± 52.7 seconds. By comparison,the Standard approach resulted in a makespan of 460.1seconds and a mean task turnaround time of 254.7 ± 61.7seconds. The Task Scheduler provides better perfor-mance because it does not schedule requests as far in ad-vance; when the new resources are added to the system, theMA can adapt its scheduling decisions for all tasks remain-ing at the MA-level and therefore utilize the newly addedSeDs.

5.4. Steady-state performance studies

The previous section described execution environmentswhere our extensions proved effective. However, it is dif-ficult to differentiate the overhead introduced by a strat-egy as compared to the performance difference due to thescheduling strategy itself. In this section, we present ex-periments designed to reveal any overheads or other perfor-mance problems introduced by our extensions. The experi-ments use a steady, long-running, heavy request volume ofuniform tasks, homogeneous resources, and no competingload. Assuming that all servers are kept continuously busy,there can be no performance benefit due to the schedulingstrategy used. Instead, any differences in performance canbe attributed to overheads or inability to keep all serversbusy.

Experimental design: For these experiments we use aDIET architecture with 1 MA, 4 LAs, and 24 servers; 6servers are attached to each LA. Each server and agent wasprovided a dedicated machine. We use a sequentialuser model with 96 concurrent users of DIET, thus on av-erage there are four users per server or two per processor.Each user is emulated with a script that submits one re-quest at a time in a continuous loop; all requests were fordgemm with square matrices of size 1000x1000. Since wehad a limited number of resources available for these exper-iments, we placed 6 users per client machine.

The user scripts were launched at a rate of one everyfive seconds until all 96 were in place (about eight minutes).Three hours later the user scripts were stopped. For thecalculation of statistics, tasks are considered only if they

Proceedings of the 4th International Symposium on Parallel and Distributed Computing (ISPDC’05) 0-7695-2434-6/05 $20.00 © 2005 IEEE

entered the system more than 60 minutes after the start ofthe first user script and more than 15 minutes before thescripts were stopped.

Request handling: Figure 3 shows a 50 request snap-shot of task execution for each strategy during the three-hour experiment; since each full experiment included about10,000 requests, these snapshots represent only about 0.5%of each experiment. For the Standard approach trace,shown in Figure 3(a), each bar represents the total timespent for each request and two dominant activities are high-lighted within each bar: the time required for the DIET hi-erarchy to select the best server(s) for the request (ServerSelection) and the time for the SeD to do the computa-tion (Compute time). For this scenario, the computationtime is dominant and the scheduling response time is excel-lent. Task completion (i.e. the end of the Compute timephase) is highly irregular because each of the 24 computeservers operates independently and may have different num-bers of concurrently executing requests.

1050 1060 1070 1080 1090 1100

1250

1300

1350

1400

Tim

e pr

ogre

ssio

n (s

ec)

Request number

(a)

Compute timeServer Selection

1050 1060 1070 1080 1090 11001450

1500

1550

1600

1650

Tim

e pr

ogre

ssio

n (s

ec)

Request number

(b)

Compute timeSeD QueueTask Scheduler

Figure 3. A snapshot of a 50-request portionof the steady-state experiment executions for(a) the Standard DIET approach and (b) theTask Scheduler approach.

Figure 3(b) provides a trace for the Task Schedulerapproach; the total task time is broken down into thethree activities: the Task Scheduler phase, the SeDQueue phase, and the Compute time phase. The TaskScheduler phase encompasses the time spent waiting atthe MA-level task scheduler for entrance into the hierarchy,the time required for the DIET hierarchy to identify the bestservers, and the time for the Task Scheduler to final-ize the scheduling decisions. The SeD Queue phase is the

time the request spends waiting at the SeD for permissionto begin the solve. The Compute time phase is the timefor the SeD to do the computation. The release of requestsin groups can be easily seen in the stair-step nature of theTask Scheduler portion of the bar graph. Finally, theCompute times are very stable and relatively short be-cause jobs do not execute concurrently.

Request throughput: In steady-state conditions the av-erage request completion rate provides a useful metric forcomparing approaches. For each approach, we calculatedthe number of requests finished by all clients per minute; wethen calculated an average completion rate during the entiresteady-state experiment excluding the startup and shutdownphases. The Standard approach processed an average of55.5 requests per minute with a standard deviation of 5.1 re-quests per minute while the Task Scheduler approachprocessed an average of 53.1 ± 2.6 requests per minute.The Task Scheduler approach thus causes a 4.3% lossin performance, but also offers more throughput stabilityover time with a standard deviation roughly half as large.For many scenarios we consider the added overheads of theTask Scheduler approach reasonable given the variousbenefits of this approach.

6. Discussion

This paper has focused on hierarchical scheduling al-gorithms for distributed, heterogeneous resource environ-ments. As computational grid systems grow in size, dis-tributed scheduling algorithms will become an importantaspect of providing scalability in grid middleware. Whileour approach has been developed for an NES system, ourhierarchical approach and window-based algorithm couldbe useful for other grid scheduling systems in need of scal-ability.

We have described two extensions made to the standardhierarchical scheduling approach used in DIET. The firstextension provides the ability to limit the number of con-current jobs that can run on a server at a time. This fea-ture reduces resource contention between requests and pro-vides stable computation times. The second extension en-ables some global task scheduling control at the MA level.Specifically, we have added a queue-like structure at theMA to control the rate of release of requests into the system.The release of requests from this queue is controlled by anadaptive window-based scheduling algorithm that seeks tomaintain the low scheduler response time of the standardDIET approach under low-load conditions while ensuringthat requests are not scheduled too far in advance underhigh-load conditions. We also presented a simple algorithmfor task placement refinement that improves load-balance inscheduling decisions made during each window. The algo-rithms we have introduced are particularly interesting be-

Proceedings of the 4th International Symposium on Parallel and Distributed Computing (ISPDC’05) 0-7695-2434-6/05 $20.00 © 2005 IEEE

cause they are designed to function in a distributed schedul-ing environment in which global information on all serversand tasks is never available and where resource informationcan only be obtained by scheduling at least one task.

We then presented experiments comparing the perfor-mance of the Standard on-line DIET scheduling ap-proach against that of the Task Scheduler approachwhich used both the SeD-level and MA-level extensions.We demonstrated that under bursty load conditions, theTask Scheduler provides more reliable performanceregardless of user behavior. We then demonstrated thatthe Task Scheduler is better able to adapt to changingresource conditions. Finally, we presented long-running,high-load experiments designed to test the overheads in-troduced by our extensions. We found that the TaskScheduler approach caused a 4.3% loss in performanceas compared to the Standard approach. In combination,these results indicate that we succeeded in introducing use-ful new functionality to the hierarchical DIET schedulingapproach with only a small cost in performance relative tothe standard on-line approach.

Our future work will consist in experimenting with dis-tributed scheduling algorithms at different levels of the hi-erarchy. First we plan to experiment with different task /server affinity models. When data dependencies occur be-tween requests on the same server (or even between dif-ferent servers), we need to move requests in the queues tooptimize data management.

We also plan to extend our tests on the Grid’5000 [1]experimental platform to incorporate multiple clusters dis-tributed throughout France. This platform provides a uniqueopportunity to perform experiments under a relatively con-trolled environment that is also distributed and heteroge-neous. We expect the need for a distributed scheduling ap-proach will be evident on this larger scale platform.

Acknowledgements

The authors would like to thank Alan Su, Raphael Bolze,and Eddy Caron for many lively discussions about hierar-chical scheduling.

References

[1] Grid 5000 project. http://www.grid5000.org.[2] D. Andresen and T. McCune. H-SWEB: A Hierarchical

Scheduling System for Distributed WWW Server Clusters.Concurrency: Practice and Experience, 12:189–210, 2000.

[3] E. Caron, F. Desprez, F. Lombard, J.-M. Nicod, M. Quin-son, and F. Suter. A Scalable Approach to Network EnabledServers. In Proc. of EuroPar 2002, Paderborn, Germany,2002.

[4] J. Carreto, J. Fernandez, F. Garcıa, and A. Chouhary. A Hi-erarchical Disk Scheduler for Multimedia Systems. FutureGeneration Computer Systems, 19:23–35, 2003.

[5] H. Casanova and J. Dongarra. NetSolve: A network serverfor solving computational science problems. In Proceedingsof Supercomputing Conference (SC’96), November 1996.

[6] H. Casanova, S. Matsuoka, and J. Dongarra. Network-Enabled Server Systems: Deploying Scientific Simulationson the Grid. In High Performance Computing Symposium(HPC’01), Seattle, Washington (USA), April 2001.

[7] H. Dail and F. Desprez. Experiences with hierarchical re-quest flow management for network-enabled server environ-ments. International Journal of High Performance Comput-ing Applications, 2006. To appear.

[8] S. Dandamudi. Hierarchical Scheduling in Parallel andCluster Systems. Kluwer Academic/Plenum Publishers,New York, USA, 2003.

[9] T. T. Douglas Thain and M. Livny. Distributed computingin practice: The Condor experience. Concurrency and Com-putation: Practice and Experience, 2005.

[10] D. Feitelson, L. Rudolf, U. Schwiegelshohn, K. Sevcik, andP. Wong. Theory and practice in parallel job scheduling. InD. Feitelson and L. Rudolf, editors, Job Scheduling Strate-gies for Parallel Processing, volume 1291 of Lecture Notesin Computer Science, pages 1–34. Springer Verlag, 1997.

[11] I. Foster and C. Kesselman. The Globus Project: A statusreport. pages 4–18. IEEE Press, 1998.

[12] J. Gehring and T. Preiss. Scheduling a Metacomputer withUn-cooperative Subschedulers. In Proc. IPPS Workshop onJob Scheduling Strategies for Parallel Processing, volume1659 of LNCS, Puerto Rico, April 1999. Springer.

[13] A. Halderen, B. Overeinder, and P. Sloot. Hierarchical Re-source Management in the Polder Metacomputing Initiative.Parallel Computing, 24:1807–1825, 1998.

[14] Z. Juhasz, A. Andics, and S. Pota. Towards A Robust AndFault-Tolerant Multicast Discovery Architecture For GlobalComputing Grids. In 4th Austrian-Hungarian Workshopon Distributed and Parallel Systems (DAPSYS 2002), Linz,Austria, Sept. 2002.

[15] H. Nakada, M. Sato, and S. Sekiguchi. Design and imple-mentations of Ninf: Towards a global computing infrastruc-ture. Future Generation Computing Systems, 15(5–6):649 –658, 1999.

[16] J. Santoso, G. van Albada, B. Nazief, and P. Sloot. Simula-tion of Hierarchical Job Management for Meta-ComputingSystems. International Journal of Foundations of ComputerScience, 12(5):629–643, 2001.

[17] E. Seidel, G. Allen, A. Merzky, and J. Nabrzyski. GridLab–a grid application toolkit and testbed. Future GenerationComp. Syst., 18(8):1143 – 1153, 2002.

[18] K. Seymour, H. Nakada, S. Matsuoka, J. Dongarra, C. Lee,and H. Casanova. An Overview of GridRPC: A Remote Pro-cedure Call API for Grid Computing. In 3rd InternationalWorkshop on Grid Computing, Nov. 2002.

[19] V. Subramani, R. Kettimuthu, S. Srinivasan, and P. Sa-dayappan. Distributed Job Scheduling on ComputationalGrids Using Multiple Simultaneous Requests. In Pro-ceedings of the 11 th IEEE International Symposium onHigh Performance Distributed Computing HPDC-11 20002(HPDC’02). IEEE Computer Society, 2002.

Proceedings of the 4th International Symposium on Parallel and Distributed Computing (ISPDC’05) 0-7695-2434-6/05 $20.00 © 2005 IEEE