30
Real-Time Syst (2012) 48:264–293 DOI 10.1007/s11241-011-9145-6 Scheduling real-time divisible loads with advance reservations Anwar Mamat · Ying Lu · Jitender Deogun · Steve Goddard Published online: 19 January 2012 © Springer Science+Business Media, LLC 2012 Abstract Providing QoS and performance guarantee to arbitrarily divisible loads has become a significant problem for many cluster-based research computing facilities. While progress is being made in scheduling arbitrarily divisible loads, previous ap- proaches have no support for advance reservations. However, with the emergence of Grid applications that require simultaneous access to multi-site resources, supporting advance reservations in a cluster has become increasingly important. In this paper we propose a new real-time divisible load scheduling algorithm that supports advance reservations in a cluster. The impact of advance reservations on system performance is systematically studied. Simulation results show that, with the proposed algorithm and appropriate advance reservations, the system performance could be maintained at the same level as the no reservation case. Thus, Our approach enforces the real-time agreement vis-a-vis addresses the under-utilization concerns. Keywords Real-time scheduling · Parallel computing · Divisible load · Cluster computing · Resource reservation A. Mamat · Y. Lu ( ) · J. Deogun · S. Goddard Department of Computer Science and Engineering, University of Nebraska-Lincoln, Lincoln, NE 68588, USA e-mail: [email protected] A. Mamat e-mail: [email protected] J. Deogun e-mail: [email protected] S. Goddard e-mail: [email protected]

Scheduling real-time divisible loads with advance reservations

Embed Size (px)

Citation preview

Page 1: Scheduling real-time divisible loads with advance reservations

Real-Time Syst (2012) 48:264–293DOI 10.1007/s11241-011-9145-6

Scheduling real-time divisible loads with advancereservations

Anwar Mamat · Ying Lu · Jitender Deogun ·Steve Goddard

Published online: 19 January 2012© Springer Science+Business Media, LLC 2012

Abstract Providing QoS and performance guarantee to arbitrarily divisible loads hasbecome a significant problem for many cluster-based research computing facilities.While progress is being made in scheduling arbitrarily divisible loads, previous ap-proaches have no support for advance reservations. However, with the emergence ofGrid applications that require simultaneous access to multi-site resources, supportingadvance reservations in a cluster has become increasingly important. In this paper wepropose a new real-time divisible load scheduling algorithm that supports advancereservations in a cluster. The impact of advance reservations on system performanceis systematically studied. Simulation results show that, with the proposed algorithmand appropriate advance reservations, the system performance could be maintained atthe same level as the no reservation case. Thus, Our approach enforces the real-timeagreement vis-a-vis addresses the under-utilization concerns.

Keywords Real-time scheduling · Parallel computing · Divisible load · Clustercomputing · Resource reservation

A. Mamat · Y. Lu (�) · J. Deogun · S. GoddardDepartment of Computer Science and Engineering, University of Nebraska-Lincoln, Lincoln,NE 68588, USAe-mail: [email protected]

A. Mamate-mail: [email protected]

J. Deogune-mail: [email protected]

S. Goddarde-mail: [email protected]

Page 2: Scheduling real-time divisible loads with advance reservations

Real-Time Syst (2012) 48:264–293 265

1 Introduction

Arbitrarily divisible or embarrassingly parallel workloads can be partitioned into anarbitrarily large number of independent load fractions, and are quite common inbioinformatics as well as high energy and particle physics. For example, the CMS(Compact Muon Solenoid) (2008) and ATLAS (AToroidal LHC Apparatus) (2008)projects, associated with the Large Hadron Collider (LHC) at CERN (European Lab-oratory for Particle Physics), execute cluster-based applications with arbitrarily divis-ible loads.

In a large-scale cluster, the resource management system (RMS), which providesreal-time guarantees or QoS, is central to its operation. As a result, the real-timescheduling of arbitrarily divisible loads is becoming a significant problem for cluster-based research computing facilities like the U.S. CMS Tier-2 sites (Swanson 2005).Due to the increasing importance (Robertazzi 2003), a few efforts (He et al. 2004;Lee et al. 2003; Lin et al. 2007b; Chuprat and Baruah 2008) have been made in real-time divisible load scheduling, with significant initial progress in important theoriesand novel approaches.

To support real-time applications at a Grid level, advance reservations of clusterresources play a key role. For instance, real-time process data collection and analysisrequires resources to be available for predefined time intervals. Since a Grid sched-uler usually does not have a direct control of resources, to provide real-time guar-antees, especially to those large resource-consuming applications, advance resourcereservations at multiple cluster sites are necessary. However, in a cluster, advancereservations have been largely ignored due to the under-utilization concerns and lackof support for agreement enforcement (Siddiqui et al. 2006). In this paper, we inves-tigate real-time divisible load scheduling with advance reservations. Its challengesare carefully analyzed and addressed. In a cluster with no reservation, resources areallocated to tasks until they finish processing. If, however, advance reservations aresupported in a cluster, computing nodes and the communication channel could bereserved for a period of time and become unavailable for regular tasks. Due to theseconstraints, it becomes a very difficult task to efficiently count the available resourcesand schedule real-time tasks.

Two major contributions are made in this paper. First, a multi-stage real-time di-visible load scheduling algorithm that supports advance reservations is proposed. Thenovelty of our approach is that we consider reservation blocks on both computingnodes and the communication channel. According to Sotomayor et al. (2006), manyapplications have huge deployment overheads, which require large and costly filestaging before applications start. To provide real-time guarantees, it is therefore es-sential to take the reservation’s data transmission into account. Second, the effects ofadvance reservations on system performance is thoroughly investigated. Thus far, itis believed that advance reservations result in under-utilization of system resourcesand thus a cost-inefficient system. Our study demonstrates that with our proposed al-gorithm and appropriate advance reservations, we can avoid under-utilizing the real-time cluster.

The rest of the paper is organized as follows. Section 2 describes the system andtask models. The real-time scheduling with advance reservation is investigated in

Page 3: Scheduling real-time divisible loads with advance reservations

266 Real-Time Syst (2012) 48:264–293

Sect. 3. In Sect. 4, we establish correctness of our approach. We evaluate and analyzethe system performance in Sect. 5 and present the related work in Sect. 6. Section 7concludes the paper.

2 Task and system models

In this paper, we adopt similar task and system models as our previous work (Linet al. 2007b, 2007c; Mamat et al. 2008). For completeness, we briefly present thesebelow.

Task model There are two types of task: the reservation and the regular task. A reser-vation Ri is specified by the tuple (Ri

a,Ris, ni,R

ie, IOi

ratio), where Ria is the arrival

time of the reservation request, Ris and Ri

e are respectively the start time and the finishtime of the reservation, ni is the number of nodes to be reserved in [Ri

s,Rie] interval,

and IOiratio specifies the data transmission time relative to the length of reservation.

It is assumed that for a reservation, data transmission happens at the beginning andcomputation follows. Let Ri

io = Ris + (Ri

e − Ris) × IOi

ratio. We have data transmis-sion in the interval [Ri

s,Riio] and computation in the interval [Ri

io,Rie]. For a regular

(non-reservation) task, a real-time aperiodic task model is assumed, in which eachaperiodic task Ti consists of a single invocation specified by (Ai, σi,Di), where Ai

is the task arrival time, σi is the total data size of the task, and Di is its relativedeadline. The task absolute deadline is given by Ai + Di . Assuming Ti is arbitrarilydivisible, the task execution time is thus dynamically computed based on total datasize σi , resources allocated (i.e., processing nodes and bandwidth) and the partition-ing method applied to parallelize the computation (Sect. 3).

System model A cluster consists of a head node, denoted by P0, connected via aswitch to N processing nodes, P1, P2, . . . ,PN . We assume that all processing nodeshave the same computational power and bandwidth to the switch. The system modelassumes a typical cluster environment in which the head node does not participatein computation. The role of the head node is to accept or reject incoming tasks, ex-ecute the scheduling algorithm, divide the workload and distribute data chunks toprocessing nodes. Since different nodes process different data chunks, the head nodesequentially sends every data chunk to its corresponding processing node via theswitch. As many previous studies on divisible load theory (Veeravalli et al. 2003;Cheng and Robertazzi 1990), we assume a single-level tree network (SLTN) topol-ogy in our system model. Thus, the data transmission is assumed not to occur inparallel, although it is straightforward to generalize our model and include the casewhere some pipelining of communication may occur. For the arbitrarily divisibleloads, tasks and subtasks are independent. Therefore, there is no need for processingnodes to communicate with each other.

According to divisible load theory (Veeravalli et al. 2003), linear models areused to represent processing and transmission times of regular tasks. In the sim-plest scenario, the computation time of a load σ is calculated by a cost functionCp(σ) = σCps , where Cps represents the time to compute a unit of workload on

Page 4: Scheduling real-time divisible loads with advance reservations

Real-Time Syst (2012) 48:264–293 267

Fig. 1 Cluster nodes with noreservation

a single processing node. The transmission time of a load σ is calculated by a costfunction Cm(σ) = σCms , where Cms is the time to transmit a unit of workload fromthe head node to a processing node. In this paper, we assume linear relationship be-tween data size and processing time. It is, however, straightforward to generalize ouralgorithms to handle the case where computing time is non-linearly related to datasize. For many applications the output data is just a short message and is negligible,particularly considering the very large size of the input data. Therefore, in this pa-per we only model the transmission of application input data but not that of outputdata. If the amount of output data is non-negligible, to accept and schedule a task,the scheduler has to consider the transmission of the task’s output data (for instance,to identify if there is enough bandwidth available for the output data transmission;and once scheduled, a task’s output data transmissions could be treated as advancereservations to avoid conflicts with other tasks). The extension to consider the outputdata transfer is, however, out of the scope of this paper.

3 Scheduling with advance reservation

This section presents a real-time scheduling algorithm that supports advance reserva-tions in a cluster.

In Lin et al. (2007b, 2007c), we investigated the problem of real-time divisibleload scheduling in clusters. Our previous work, however, does not address the chal-lenges of supporting advance reservations. Without advance reservations, computingresources are allocated to tasks until they finish computation. If, however, advancereservations are supported in a cluster, a computing node could be reserved for a pe-riod of time and become unavailable for regular tasks. The reservations thus block theprocessing of regular tasks and cast severe constraints on the real-time scheduling. Inthe following, we use an example to illustrate the challenge.

In Fig. 1, we show three processing nodes P1, P2 and P3 available at time pointsS1, S2 and S3 respectively. Once available, they could be allocated to execute a newtask. Upon arrival of a new task, the real-time scheduler considers the system statusand determines if enough processing power is available to finish the task before itsdeadline. The decision process is simple when there is no reservation block: for nodePi , any time between Si and the task deadline could be allocated to the new task.It, however, becomes a complicated process when there are advance reservations.For instance, as shown in Fig. 2, there is an advance reservation R occupying nodeP2 from time Rs to time Re. During the reserved period, the time from Rs to Rio

is used for transmitting data to node P2 and the time from Rio to Re is used forcomputation. Because of the reservation, node P2 becomes unavailable in the timeperiod [Rs,Re]. Furthermore, the reservation interferes with activities on other nodes.During the time period [Rs,Re], nodes P1 and P3 could be used to compute tasks.

Page 5: Scheduling real-time divisible loads with advance reservations

268 Real-Time Syst (2012) 48:264–293

Fig. 2 Cluster nodes with areservation

However, data transmission to P1 or P3 is not allowed in the interval [Rs,Rio] whendata is transmitted to node P2. Because of these constraints, it becomes a challengeto efficiently count the available processing power and schedule real-time tasks. Theremainder of this section discusses how we overcome this challenge and design analgorithm that supports advance reservations in a real-time cluster.

3.1 Admission control algorithm

As is typical for dynamic real-time scheduling algorithms (Dertouzos and Mok 1989;Manimaran and Murthy 1998; Ramamritham et al. 1990), when a task arrives, thescheduler dynamically determines if it is feasible to schedule the new task withoutcompromising the guarantees for previously admitted tasks. The pseudocode for theschedulability test is shown in Algorithm 1.

According to the newly arrived task’s type, the algorithm invokes an admissiontest. For a reservation, it (Algorithm 2) first checks if enough processing nodes areavailable to accommodate the reservation. Because data transmission does not hap-pen in parallel, we must then ensure that the new reservation will not cause any IOconflict. The function IO_ Overlap(Rk , R) verifies if data transmissions for Rk andR overlap. If so, new reservation R is rejected. If the admission test is successful, itproves that accepting R will not compromise the guarantees for previously acceptedreservations. Its impact on previously accepted regular tasks is yet to be analyzed,which is the third step of the algorithm. For a regular task T , the admission test (Al-gorithm 3) checks if T is schedulable with the accepted reservations. When a newregular task T arrives, it is added to the waiting queue of regular tasks. We adoptthe EDF (Earliest Deadline First) scheduling algorithm and order the queue by taskabsolute deadlines. The schedulability test (Algorithm 1) invokes the admission test(Algorithm 3) for each task in the queue. If they are all successful, it proves that ac-cepting T will not compromise the guarantees for previously accepted tasks includingall reservations and regular tasks.

As mentioned, Algorithm 3 tests the schedulability of a regular task. We have N

processing nodes in the cluster, available at time points S1, S2, . . . , SN . Assume reser-vations are made on these nodes for specific periods of time. To determine whetheror not a regular task T (A,σ,D) is schedulable, the N nodes’ total processing timethat could be allocated to task T by its absolute deadline A + D is computed. If thetotal time is enough to process the task, deadline D can be satisfied and task T isschedulable. To derive the processing time, we first compute the blocking time whena processing node cannot be utilized.

Blocking happens because data cannot be transmitted in parallel. Transmission toa processing node blocks transmissions to all other nodes. There are three types ofblocking: (1) the blocking is caused by a reservation’s data transmission; (2) amongnodes allocated to a task, data transmission to a node blocks the other transmissionsof the same task; (3) the blocking is caused by another task’s data transmissions.

Page 6: Scheduling real-time divisible loads with advance reservations

Real-Time Syst (2012) 48:264–293 269

Algorithm 1 boolean Schedulability_Test(T )1: if isResv(T ) then2: // Reservation admission control3: if !ResvAdmTest(T ) then4: return false5: end if6: end if7: // Reservation list8: TempResvList ← ResvQueue9: // Regular task list

10: TempTaskList ← TaskWaitingQueue11: if isResv(T ) then12: TempResvList.add(T )13: else14: TempTaskList.add(T )15: end if16: // EDF scheduling of regular tasks17: order TempTaskList by task absolute deadline18: order TempResvList by reservation start time19: while TempTaskList != ∅ do20: TempTaskList.remove(T )21: // Regular task admission control22: if !AdmTest(T ) then23: return false24: else25: TempScheduleQueue.add(T )26: end if27: end while28: TaskWaitingQueue ← TempScheduleQueue29: ResvQueue ← TempResvList30: return true

Algorithm 2 boolean ResvAdmTest(R(Ra,Rs, n,Re, IOratio))1: // Check if the number of available nodes2: // in [Rs , Re] time period is less than n

3: if MinAvailableNode(Rs , Re) < n then4: return false5: end if6: for Rk ∈ ResvQueue do7: if IO_Overlap(Rk , R) then8: return false9: end if

10: end for11: reserve n available nodes for R

12: return true

Page 7: Scheduling real-time divisible loads with advance reservations

270 Real-Time Syst (2012) 48:264–293

To count the available processing power and decide a task’s schedulability, wehave to consider all these blocking factors. However, the degree of blocking varies,depending on node available times, task start times and reservation lengths. For in-stance, a reservation with a long data transmission could lead to lengthy blocking. Ifa reservation and a task start at the same time, the task is blocked during the reser-vation’s data transmission. Computing the exact blocking time is complicated. Thus,to simplify the computation, the worst-case blocking scenario is initially assumed inthe admission test and only if a task is determined schedulable, will it be sent to thetask partition procedure to accurately compute the task’s schedule.

The worst-case blocking happens when all nodes become available at the sametime as the reservation’s data transmission. In this case, all nodes have to be blockedthe whole time during the reservation’s data transmission. Assuming this worst-caseblocking scenario, Algorithm 3 derives the amount of data σsum that can be processedin the total available time. When σsum is larger than the task data size σ , enough nodesare found to finish the task.

The algorithm sorts the N nodes in a non-decreasing order of node available time,i.e., making S1 ≤ S2 ≤ · · · ≤ SN . Following this order, each node’s available pro-cessing time μi is computed. First, μi is initialized to be A + D − max (Si,A), thelongest time that Pi could be allocated to task T by its absolute deadline A + D.Then considering the effects of reservations, μi is adjusted. For the worst-case block-ing, a node Pi is assumed not utilized when data is transmitted for reservations.Let t0 = max (A,S1). The total time (ResvIO) consumed by the reservation’s I/Oin the interval [t0, A + D] is computed, which is the blocking time relevant to taskT ’s schedule. If a reservation is on node Pi , task T cannot utilize Pi during thereservation’s computation (ResvCPi ). The algorithm thus reduces μi by ResvIO andResvCPi . After considering the type 1 blocking caused by reservations, the algo-rithm considers the other two blocking factors. Bi−1 counts the type 2 blocking timeon node Pi , which is caused by the same task’s data transmissions to previouslyassigned nodes (P1,P2, . . . ,Pi−1). Again, the worst-case blocking is assumed, i.e.,Pi is assumed to be blocked for a time period of transmitting data to the first i − 1nodes. μi is, therefore, reduced by Bi−1. This gives the final value of μi . The algo-rithm then derives the size of data that can be processed in μi amount of time. Thetotal sum of data that can be processed by the first i nodes are recorded in σsum.Once σsum is larger than the task data size σ , enough nodes nmin are found for thetask. The algorithm concludes that task T is schedulable, assigns nmin number ofnodes to it and invokes the task partition procedure (MSTaskPartition, Algorithm 4)to accurately schedule the task. Processing task T may block tasks scheduled in thefuture, which leads to type 3 blocking. Considering this factor, the algorithm properlyadjusts node available time Si in MSTaskPartition procedure (Algorithm 4).

3.2 Task partitioning algorithm

The previous section discusses how the real-time scheduling algorithm makes the ad-mission control decision. As mentioned, upon admitting a regular task T (A,σ,D),a certain number (n) of nodes are allocated to it at certain time points (S1, S2, . . . , Sn).According to the admission controller, these nodes can finish processing T by dead-line A + D. This section presents the next step of the scheduling algorithm: the task

Page 8: Scheduling real-time divisible loads with advance reservations

Real-Time Syst (2012) 48:264–293 271

Algorithm 3 boolean AdmTest(T (A,σ,D))1: σsum = 02: If a node Pi becomes available during a reservation’s data transmission, Pi ’s

available time Si is reset at the finish time of the data transmission3: sort nodes in non-decreasing order of node available time4: t0 = max (A,S1)

5: // Compute the total reservation I/O time (ResvIO)6: // and the reservation computation time (ResvCPi )7: // on node Pi in the period [t0, A + D]8: ResvIO = 09: for i ← 1 : N do

10: ResvCPi = 011: end for12: for R(Ra,Rs, n,Re, IOratio) ∈ ResvQueue do13: if Rs > t0 then14: Rio = Rs + (Re − Rs) × IOratio

15: if Rio > A + D then16: ResvIO+ = A + D − Rs

17: else18: ResvIO+ = Rio − Rs

19: end if20: for Pi ∈ {nodes assigned to R} do21: ResvCPi+ = Re − Rio

22: end for23: end if24: end for25: B0 = 026: for i ← 1 : N do27: // Compute node Pi ’s available processing time28: μi = A + D − max (Si,A)

29: μi = μi − ResvIO − ResvCPi − Bi−130: // Compute the size of data that can be processed31: σi = μi

Cms+Cps

32: σsum = σsum + σi

33: if σsum ≥ σ then34: nmin = i

35: assign the first nmin nodes to T at their corresponding available timesS1, S2, . . . , Snmin

36: MSTaskPartition(T (A,σ,D,nmin, S1, S2, . . . , Snmin))

37: return true38: end if39: υi = σi × Cms

40: Bi = Bi−1 + υi

41: end for42: return false

Page 9: Scheduling real-time divisible loads with advance reservations

272 Real-Time Syst (2012) 48:264–293

Fig. 3 Multi-stage scenario 1

Fig. 4 Multi-stage scenario 2

partition procedure. How a task is partitioned and executed on the allocated n nodesis described.

Without loss of generality, the n nodes are assumed to be sorted in a non-decreasing order of their available times. Task T (A,σ,D)’s processing on the n

nodes should not interfere with reservations in the interval [t0, A + D], wheret0 = max (A,S1). Once accepted, a reservation R(Ra,Rs, k,Re, IOratio) is guar-anteed a certain number (k) of nodes at the specified start time (Rs ). On the re-served nodes, the processing of regular tasks must stop before the reservation startsat Rs . If the reservation requires data transmission in the interval [Rs,Rio], whereRio = Rs + (Re − Rs) × IOratio, data transmissions to other nodes cannot be sched-uled in the same interval. The proposed algorithm considers these constraints whenpartitioning and processing a task. To be applicable to a broader range of systems,our solution does not require the support of task preemption.

Assume during the interval [t0, A + D] there are m reservations, R1,R2, . . . ,Rm,in the cluster. According to each of these reservations’ data transmission interval (de-noted by [Ri

s,Riio], where Ri

io = Ris + (Ri

e − Ris) × IOi

ratio), we divide the interval[t0, A + D] into M stages. The first stage starts at t0 and ends at R1

io (the data trans-mission finish time of reservation R1). The second stage starts at R1

io and ends at R2io.

In general, interval [Ri−1io ,Ri

io] is the ith stage when i is not the first or last stage. Thelast M th stage ends at the task deadline A + D. If A + D is not in the last reserva-tion’s data transmission interval, i.e., A + D > Rm

io, M = m + 1 stages are generated,and there is no reservation’s data transmission in the last stage (Fig. 3); otherwise,M = m and the last stage ends in the middle of Rm’s data transmission (Fig. 4).

Page 10: Scheduling real-time divisible loads with advance reservations

Real-Time Syst (2012) 48:264–293 273

After dividing the interval [t0, A + D], we form M stages and each stage includesat most one reservation’s data transmission interval (i.e., the interval [Ri

s,Riio]), which

will occur at the end of the stage. When partitioning task T into subtasks T ij for the

j th node in the ith stage, where j = 1,2, . . . , n and i = 1,2, . . . ,M , the followingconstraints must be satisfied. If reservation Ri is on node Pj , subtask T i

j must finish

its data transmission and computation before the reservation starts at Ris . On the other

hand, if Ri is not on Pj , subtask T ij can continue its computation until the end of

the stage Riio but T i

j must finish its data transmission before Ris , when Ri ’s data

transmission starts. The multi-stage task partition procedure is shown in Algorithm 4.

Algorithm complexity analysis Our Schedulability_Test (i.e., Algorithm 1’s) timecomplexity is O(nN logN + n3). Algorithm 1’s time complexity is determined bythe while loop (i.e., lines 19–27), where it invokes Algorithm 3 n times. Algorithm 3has the same time complexity O(N logN +n2) as Algorithm 4 because it invokes Al-gorithm 4 once (line 36) when σsum ≥ σ . Thus, our Schedulability_Test algorithm’stime complexity is O(nN logN + n3).

4 Proof of algorithm correctness

In this section, we prove the correctness of the proposed real-time scheduling algo-rithm. We show that a task accepted by the admission control algorithm (Sect. 3.1)will be dispatched by the task partitioning algorithm (Sect. 3.2) and completed beforethe task deadline (this section).

We start with proving the algorithm correctness in special scenarios and then gen-eralize the proof.

Case 1: a group of processors are available simultaneously For this case, we as-sume that a group of l (l ≤ N) processors {P1, . . . ,Pl} are available at the same timex (see Fig. 5). During the closed time interval [Rs , y], a reservation R is transmittingdata. We consider admission and partitioning of a task among l processors in the timeinterval [x, y] and prove that the workload admitted to execute on the l processors inthe time interval [x, y] can be completed.

Lemma 4.1 If the allocated processors are available at the same time, then the taskwill meet the deadline.

Proof It is easy to see that the workload admitted by the admission controller is nomore than the workload that can be dispatched by the partitioning algorithm.

According to the admission control algorithm (Sect. 3.1), we estimate that inthe time interval [x, y], the available processing time on node Pk, k ≤ l, is μk =y − x − ResvIO − Bk−1, where ResvIO represents the time consumed by the reser-vation’s I/O and Bk−1 counts the blocking time caused by the regular task’s datatransmissions to nodes {P1, . . . ,Pk−1}, leading to μk = Rs − x − Bk−1. That is, dueto the reservation’s I/O in [Rs , y], the admission control algorithm assumes that pro-cessors are only available for processing regular tasks in the time interval [x, Rs ].

Page 11: Scheduling real-time divisible loads with advance reservations

274 Real-Time Syst (2012) 48:264–293

Algorithm 4 MSTaskPartition(T (A,σ,D,n,S1, S2, · · · , Sn))

1: // Input:2: // task T and its allocated nodes3: // M : number of stages in the interval [t0, A + D],4: // where t0 = max (A,S1)

5: // Output:6: // task partition vector a[n,M], where7: // 0 ≤ a[j, i] ≤ 1 is the fraction of data allocated8: // to the j th node in the ith stage, and9: //

∑nj=1

∑Mi=1 a[j, i] = 1

10: σsum = 011: σlast = σ

12: // Initialize tj , Pj ’s start time in current stage13: for j ← 1 : n do14: tj = Sj

15: end for16: for i ← 1 : M do17: sort the n nodes by their start times18: for j ← 1 : n do19: if Pj ∈ {nodes assigned to Ri} then20: // Fraction of data that can be processed before Ri

s

21: a[j, i] = Ris−tj

σ (Cms+Cps)

22: else23: // Fraction of data that can be transmitted24: // before Ri

s and processed before Riio

25: tmp = min (Ris − tj , (Ri

io− tj )

CmsCms+Cps

)

26: a[j, i] = tmpσCms

27: end if28: // Update Pj+1’s start time, considering blocking29: tj+1 = max (tj + a[j, i]σCms, tj+1)

30: // Compute Pj ’s start time in next stage31: if Pj ∈ {nodes assigned to Ri} then32: tj = Ri

e33: else34: tj = Ri

io35: end if36: σsum = σsum + a[j, i]σ37: if σsum ≥ σ then38: a[j, i] = σlast

σ39: // Record these multi-stage data transmissions40: // in the n nodes. Update the other41: // nodes’ available times considering the42: // blocking caused by T ’s first stage data43: // transmission. When scheduling other tasks44: // in the future, a later stage data transmission45: // of T will be treated the same as46: // a reservation’s data transmission.47: UpdateNodeStatus():48: return49: end if50: σlast = σ − σsum

51: end for52: end for

Page 12: Scheduling real-time divisible loads with advance reservations

Real-Time Syst (2012) 48:264–293 275

Fig. 5 Single-stage Case 1: l

processors with the sameavailable time

We use σ est and σact to respectively denote the workload estimated by the ad-mission control algorithm and the actual workload that can be processed by the l

processors in the time interval [x, y]. As we have derived in our earlier work (Lin etal. 2007b), when simultaneously allocating l processors to a divisible task T of sizeσ , the task execution time is

E = 1 − β

1 − βlσ (Cms + Cps) (1)

where

β = Cps

Cms + Cps

(2)

In addition, since the admission control algorithm assumes that the l processors areonly available in the time interval [x, Rs ], we have

σ est = Rs − x

1−β

1−βl (Cms + Cps)(3)

In contrast, the task partition algorithm leverages the fact that in the time interval[Rs, y], although the data transmission cannot be done for regular tasks due to theconflict with the reservation’s I/O, idle processors can still run computations for reg-ular tasks. Following the task partitioning algorithm, we distribute workload to the l

processors. There could be two typical cases for the actual workload processing.

Subcase 1.1 The transmission time of regular tasks is ≥ Rs (see Fig. 6). In this case,data has to be transmitted continuously till Rs . We want to prove that σact ≥ σ est .

The actual workload σact dispatched to the l processors is the maximum amountof workload that can be transmitted during the time interval [x, Rs ]. Thus, we have

σact = Rs − x

Cms

(4)

Page 13: Scheduling real-time divisible loads with advance reservations

276 Real-Time Syst (2012) 48:264–293

Fig. 6 Actual workloadprocessing: typical Case 1

According to (3) and (2), we get

σ est = Rs − x

Cms

1−βl

(5)

From (2), we know 0 < β < 1. Thus, we have

1 > 1 − βl (6)

⇒ Cms <Cms

1 − βl(7)

⇒ Rs − x

Cms

>Rs − x

Cms

1−βl

(8)

That is

σact > σest (9)

Subcase 1.2 The data transmission time of regular tasks is < Rs (see Fig. 7). Thisindicates that enough regular workload has been transmitted to fully utilize idle pro-cessors in the time interval [x, y]. Consider a node Pk, k ≤ l, and that Pk is not partof the reservation. Then Pk can process data in the entire interval [x, y]. For the nodePk , we denote the amount of workload that is processed in the time interval [Rs, y]as σ�k

and its data transmission time as t�k. Among the workload σk for the proces-

sor Pk , the σ�kis processed in the interval [Rs, y], as demonstrated in Fig. 8. Then

we have

σ�k= y − Rs

Cps

and (10)

Page 14: Scheduling real-time divisible loads with advance reservations

Real-Time Syst (2012) 48:264–293 277

Fig. 7 Actual workloadprocessing: typical Case 2

Fig. 8 Demonstration of σ�k,

t�kand σ[k,l]

t�k= σ�k

∗ Cms (11)

If a node Pk is part of the reservation, in the closed interval [Rs , y] (like nodesPi, . . . ,Pi+j in Fig. 8), we have σ�k

= 0.We also define the following terms:

• σactk : the workload dispatched to node Pk (Fig. 9);

• σ est[k,l]: the workload that can be processed by nodes {Pk,Pk+1, . . . ,Pl} in the time

interval [x + Bk−1, Rs ];• σ[k,l]: the workload that can be processed by nodes {Pk,Pk+1, . . . ,Pl} in the time

interval [x + Bk−1 + t�k,Rs];

• σ sum[k,l] : σ[k,l] + σ�k

.

Page 15: Scheduling real-time divisible loads with advance reservations

278 Real-Time Syst (2012) 48:264–293

Fig. 9 Demonstration of σactk

and σest[k+1,l]

We have

σ sum[k,l] = σ[k,l] + σ�k

= Rs − (x + Bk−1 + t�k)

1−β

1−βl−k+1 (Cps + Cms)+ t�k

Cms

= Rs − (x + Bk−1)

1−β

1−βl−k+1 (Cps + Cms)− t�k

Cms

1−βl−k+1

+ t�k

Cms

= σ est[k,l] − t�k

Cms

1−βl−k+1

+ t�k

Cms

> σest[k,l] (12)

In addition, since σ sum[k,l] = σ[k,l] + σ�k

= σ est[k+1,l] + σact

k , we get

σ est[k+1,l] + σact

k > σest[k,l] (13)

⇒ σactk > σest

[k,l] − σ est[k+1,l] (14)

It follows that

σact =l∑

k=1

σactk >

l∑

k=1

(σ est[k,l] − σ est

[k+1,l])

Thus

σact > σest[1,l] − σ est

[l+1,l] (15)

Page 16: Scheduling real-time divisible loads with advance reservations

Real-Time Syst (2012) 48:264–293 279

Fig. 10 Single-stage Case 2:multiple same-available-timeprocessors groups

Since we are considering the workload allocation to nodes {P1,P2, . . . ,Pl},σ est

[l+1,l] = 0. Substituting σ est = σ est[1,l] into (15), we have

σact > σest (16)

In the above, we have proved the algorithm correctness for Case 1 (Fig. 5) whenl processors are available at the same time, by showing that the workload admittedto execute on the l processors (i.e., σest ) is never more than that can be processed bythe l processors (i.e., σact ). �

Case 2: a group of processors are available at different times

Lemma 4.2 If the allocated processors are available at different times, then the taskwill meet the deadline.

Proof We prove Case 2, in a general scenario, where there are many processorsgroups and each of which have same available time (Fig. 10). That is, we prove

m∑

i=1

σactg[i] ≥

m∑

i=1

σ estg[i] (17)

The proof is based on induction. Base Case: since σact ≥ σ est holds for Case 1(Fig. 5), we know the following inequality is true

σactg[1] ≥ σ est

g[1] (18)

Page 17: Scheduling real-time divisible loads with advance reservations

280 Real-Time Syst (2012) 48:264–293

Induction: we assume that (17) holds for all j ≤ k, that is,

j∑

i=1

σactg[i] ≥

j∑

i=1

σ estg[i] , j = 1,2, . . . , k (19)

Now we want to prove (17) also holds for k + 1 groups. From (19) when j = k, wehave

σ�g[k] =k∑

i=1

σactg[i] −

k∑

i=1

σ estg[i] ≥ 0 (20)

This workload difference σ�g[k] is caused by the admission control algorithm’s pes-simistic behavior assuming that

1. Even though all processors are available but can not be used for regular tasks inthe time interval [Rs, y] and that

2. a processor experiences the worst-case blocking time, i.e, when nodes {P1,P2, . . . ,

Pk−1} are transmitting data for regular workload, node Pk is blocked and thus as-sumed unavailable.

On the other hand, the partition algorithm utilizes these idle resources and processesσ�g[k] extra workload on the first k groups of processors. Due to this workload in-crease, the blocking time encountered by group Gk+1 increases and its processors’available time reduces by t�g[k] = σ�g[k] × Cms .

We define the following terms:

• Bestg[k] : the first k groups’ data transmission, i.e., group Gk+1’s blocking time esti-

mated by the admission control algorithm;• σg[k+1]: the workload that can be processed by group Gk+1 in the time interval

[x + Bestg[k] + t�g[k] ,Rs] (see Fig. 10);

• lg[k] : the number of processors in group Gk .

Since σ�g[k] = t�g[k]Cms

, we have

σ�g[k] ≥t�g[k]Cms

1−βlg[k+1]

i.e., (21)

σ�g[k] ≥t�g[k]

1−β

1−βlg[k+1] (Cms + Cps)

(22)

Again, since σg[k+1] = Rs−(x+Bestg[k]+t�g[k] )

1−β

1−βlg[k+1] (Cms+Cps)

, we get

σ�g[k] + σg[k+1] ≥ Rs − (x + Bestg[k])

1−β

1−βlg[k+1] (Cms + Cps)

i.e.,

Page 18: Scheduling real-time divisible loads with advance reservations

Real-Time Syst (2012) 48:264–293 281

σ�g[k] + σg[k+1] ≥ σ estg[k+1] (23)

Considering that group Gk+1 has lg[k+1] processors in the time interval [x + Bestg[k] +

t�g[k] , y] and leveraging the result we have proved for Case 1 (Fig. 5), we get

σactg[k+1] ≥ σg[k+1] (24)

From (23) and (24), it follows that

σactg[k+1] ≥ σ est

g[k+1] − σ�g[k] (25)

Substituting σ�g[k] by∑k

i=1 σactg[i] − ∑k

i=1 σ estg[i] , it becomes

σactg[k+1] ≥ σ est

g[k+1] −(

k∑

i=1

σactg[i] −

k∑

i=1

σ estg[i]

)

i.e., (26)

k+1∑

i=1

σactg[i] ≥

k+1∑

i=1

σ estg[i] (27)

This completes the algorithm correctness proof, i.e., the induction-based proof of∑mi=1 σact

g[i] ≥ ∑mi=1 σ est

g[i] , for Case 2 (see Fig. 10). We conclude that σact ≥ σ est holdsfor the single-stage Case 2. �

Combining Lemma 4.1 and Lemma 4.2, we have the following lemma.

Lemma 4.3 In a stage, tasks will meet their deadline.

Case 3: multi-stage scenario Next, we present the proof for the multi-stage caseand show that σact ≥ σ est holds for the multi-stage scenario, i.e., Case 3 shown inFig. 11.

Lemma 4.4 The multi-stage dispatch algorithm’s actual dispatched workload is noless than the estimated workload by the admission controller.

According to the proved result for the single-stage (Fig. 10), we know that for eachstage k in Fig. 11, we have σact

s[k] ≥ σ ests[k]. Therefore

M∑

k=1

σacts[k] ≥

M∑

k=1

σ ests[k] i.e., (28)

σact ≥M∑

k=1

σ ests[k] (29)

We now prove that σ est = ∑Mk=1 σ est

s[k] holds if σ est = ∑Mk=1 σ est

s[k] were true, we canconclude σact ≥ σ est .

Page 19: Scheduling real-time divisible loads with advance reservations

282 Real-Time Syst (2012) 48:264–293

Fig. 11 Case 3: multi-stagescenario

According to the admission control algorithm (Algorithm 3), σ est = ∑ni=1 σ est [i]

where σest [i] is the estimated workload for node Pi in the time interval [x, y]. Weuse σ est [i, k] to represent the estimated workload for node Pi in stage k and prove byinduction that ∀j ∈ {1,2, . . . , n},∑j

i=1 σ est [i] = ∑j

i=1

∑Mk=1 σ est [i, k] is true.

We define the following terms:

• Is[k]: the reservation I/O in stage k;• C[i, k]: the reservation’s computation on node Pi in stage k;• μ[i, k]: node Pi ’s available time in stage k;• L: y − x.

According to Algorithm 3, we have σ est [i] = μi

Cms+Cpsand

μi = y − x − ResvIO − ResvCPi − Bi−1

= L −M∑

k=1

(Is[k] + C[i, k]) − Bi−1 (30)

Base case: since B0 = 0, we get

μ1 = L −M∑

k=1

(Is[k] + C[1, k])

⇒ μ1 =M∑

k=1

μ[1, k]

⇒ μ1

Cms + Cps

=∑M

k=1 μ[1, k]Cms + Cps

i.e.,

σ est [1] =M∑

k=1

σ est [1, k] (31)

Page 20: Scheduling real-time divisible loads with advance reservations

Real-Time Syst (2012) 48:264–293 283

Induction: we assumeh∑

i=1

σ est [i] =h∑

i=1

M∑

k=1

σ est [i, k] (32)

We prove∑h+1

i=1 σ est [i] = ∑h+1i=1

∑Mk=1 σ est [i, k] holds. Algorithm 3 assumes that

Bh = ∑hi=1 σ est [i] × Cms . According to the assumption made in Step 2, we have

Bh =h∑

i=1

M∑

k=1

σ est [i, k] × Cms (33)

Substituting the above equation into (30), it becomes

μh+1 = L −M∑

k=1

(

Is[k] + C[h + 1, k] +h∑

i=1

σ est [i, k] × Cms

)

(34)

⇒ μh+1 =M∑

k=1

μ[h + 1, k]

⇒ μh+1

Cms + Cps

=∑M

k=1 μ[h + 1, k]Cms + Cps

i.e.,

σ est [h + 1] =M∑

k=1

σ est [h + 1, k] (35)

Adding (32) and (35), we have

h+1∑

i=1

σ est [i] =h+1∑

i=1

M∑

k=1

σ est [i, k] (36)

which completes the induction-based proof of ∀j ∈ {1,2, . . . , n}, ∑j

i=1 σ est [i] =∑j

i=1

∑Mk=1 σ est [i, k]. Therefore,

n∑

i=1

σ est [i] =n∑

i=1

M∑

k=1

σ est [i, k] i.e.,

σ est =n∑

i=1

M∑

k=1

σ est [i, k]

⇒ σ est =M∑

k=1

σ ests[k] (37)

With (29) and (37), we conclude that σact ≥ σ est holds for Case 3 and the multi-stagereal-time scheduling algorithm is correct.

Page 21: Scheduling real-time divisible loads with advance reservations

284 Real-Time Syst (2012) 48:264–293

Theorem 4.5 The proposed multi-stage algorithm is correct.

Proof The proof follows from Lemmas 4.1, 4.2, and 4.3. �

5 Performance evaluation

In the previous section, we presented a real-time scheduling algorithm that supportsadvance reservations. In this section, we experimentally evaluate the performance ofthe algorithm.

5.1 Simulation configurations

We develop a discrete simulator to simulate a wide range of clusters. Three parame-ters, N , Cms and Cps are specified for every cluster.

For a set of regular tasks Ti = (Ai, σi,Di), Ai , the task arrival time, is specifiedby assuming that the interarrival times follow an exponential distribution with a meanof 1/λ; task data sizes σi are assumed to be normally distributed with the mean andthe standard deviation equal to Avgσ ; task relative deadlines (Di ) are assumed tobe uniformly distributed in the range [AvgD

2 ,3AvgD

2 ], where AvgD is the mean relativedeadline. To specify AvgD, we use the term DCRatio defined in Lin et al. (2007b). It isdefined as the ratio of mean deadline to mean minimum execution time (cost), that is

AvgDE (Avgσ,N)

, where E (Avgσ,N) is the execution time assuming the task has an averagedata size Avgσ and is allocated to run on N fully-available nodes simultaneously (Linet al. 2007b). Given a DCRatio, the cluster size N and the average data size Avgσ ,AvgD is implicitly specified as DCRatio × E (Avgσ,N). Thus, task relative deadlinesare related to the average task execution time. In addition, a task’s relative deadlineDi is chosen to be larger than its minimum execution time E (σi,N). In summary,we specify the following parameters for a simulation: N , Cms , Cps , 1/λ, Avgσ , andDCRatio.

To analyze the cluster load for a simulation, we use the metric SystemLoad (Leeet al. 2003). It is defined as, SystemLoad = E (Avgσ,1)×λ

N, which is the same as,

SystemLoad = TotalTaskNumber×E (Avgσ,1)TotalSimulationTime×N

. For a simulation, we can specify SystemLoadinstead of average interarrival time 1/λ. Configuring (N, Cms , Cps , SystemLoad,Avgσ , DCRatio) is equivalent to specifying (N , Cms , Cps , 1/λ, Avgσ , DCRatio),

because, 1/λ = E (Avgσ,1)SystemLoad×N

.To generate reservations, some regular tasks are selected from the aforementioned

workload and converted to reservations. To study the algorithm’s performance undervaried conditions, different percents of the workload are converted to reservations.Algorithm 5 describes the procedure of converting a regular task to a reservation.To ensure that the newly generated workload, mixed with reservations and regulartasks, leads to the same SystemLoad as the original workload, the reservation starttime is made equal to the regular task’s arrival time. In addition, the number of nodesreserved equals to the minimum number of nodes nmin required to finish the regulartask before its deadline. It follows that the reservation length is equal to the regulartask’s execution time E (σ,nmin); and the reservation’s IOratio is defined accordingto the regular task’s I/O ratio.

Page 22: Scheduling real-time divisible loads with advance reservations

Real-Time Syst (2012) 48:264–293 285

Algorithm 5 RegTask2Resv(T (A,σ,D),�)

1: // Input:2: // Regular task T (A,σ,D) and3: // advance factor �

4: // Output:5: // Reservation R(Ra,Rs, n,Re, IOratio)

6: // Make ResvStartTime = RegTaskArrivalTime7: Rs = A

8: // Compute ResvLength (Re − Rs ) based9: // on RegTaskExecutionTime E (σ,nmin)

10: γ = 1 − σCms

D

11: β = Cps

Cps+Cms

12: nmin = lnγlnβ

�13: E = 1−β

1−βnmin σ (Cms + Cps)

14: // ResvLength = RegTaskExecutionTime E (σ,nmin)

15: Re = Rs + E16: // Nodes reserved = minimum nodes required by17: // RegTask at A to complete before its deadline D

18: n = nmin

19: // Make Resv IOratio =20: // RegTaskTranmissionTime/RegTaskExecutionTime21: IOratio = σCms

E22: // The request for Resv arrives � time unit in advance23: Ra = Rs − �

Fig. 12 An example of mixedworkload generation

Advance factor for reservation Since a reservation is often made in advance, we use�, called the advance factor, to specify the time difference between the arrival of thereservation request (Ra) and the reservation start time (Rs ), i.e., � = Rs − Ra . Fig-ure 12 shows an example where task T9 is selected and converted to reservation R9,which is then assumed to arrive before task T4. Therefore, for the new workload, tasksT1, T2, T3 and reservation R9 will be scheduled first, followed by tasks T4, T5, . . . , T8and T10. The earlier a reservation is made, the greater its chance of being accepted.To study an advance reservation’s impact on system performance, different advancefactors are simulated.

The simulation in this paper is divided into several experiments. Each experimentconsists of several runs, where each run is further divided into ten tests. The pa-rameters N , Cms , Cps remain constant over all runs. However, SystemLoad, Avgσ ,DCRatio), � and reservation percentage vary from run to run. For each test, differ-

Page 23: Scheduling real-time divisible loads with advance reservations

286 Real-Time Syst (2012) 48:264–293

ent random numbers are generated for task arrival times Ai , data sizes σi and dead-lines Di . For all figures in this section, a point on a curve corresponds to the averageperformance of ten tests in a specific run of an experiment. For each simulation, theTotalSimulationTime is 10,000,000 time units, which is sufficiently long.

Baseline configuration For our basic simulation model we chose the following pa-rameters: number of processing nodes in the cluster N = 256; unit data transmissiontime Cms = 1; unit data processing time Cps = 1000; SystemLoad changes in therange {0.1,0.2, . . . ,1.0}; Average data size Avgσ = 2000; and the ratio of the av-erage deadline to the average execution time DCRatio = 2. Our simulation has athree-fold objective. First, we verify the correctness of the proposed algorithm. Sec-ond, we study the effects of reservation percentage, and third, we want to investigateeffects of advance factor �.

5.2 Simulation results

To validate that the proposed algorithm works correctly, we check all simulation re-sults to verify that real-time requirements of every accepted task are satisfied. Thereare enough resources to guarantee reservations start and finish at the specified times.Once accepted, tasks are successfully processed by their deadlines.

Effects of reservation percentage We conducted experiments with the baseline con-figuration. To study the effects of reservation percentage, 0%, 10%, 30%, 50%, 80%and 100% of the workload were set to be reservations respectively.

In the first experiment, we set the advance factor � = 0. That is, all reservationsrequest to be started immediately. Figures 13a and 13b show the simulation results.We can see that for a workload of all regular tasks (i.e., 0% reservation), the sched-uler rejects the least number of tasks (TRR) and leads to the highest system utilization(UTIL). As the reservation percentage of the workload increases from 0% to 100%,the TRR increases and the UTIL decreases. These results follow the common in-tuition that making reservations can reduce system performance. Reservations muststart at the requested time and execute continuously until completion. Reservationtasks offer little flexibility to the scheduler. In contrast, regular tasks are flexible.That is, they can start at any time as long as they finish before their deadlines. Someparallel tasks, are also flexible in their required number of nodes, allowing the sched-uler to dictate the allocated amount of resources. The scheduler can start them earlierwith fewer nodes or later with more nodes. In particular, the arbitrarily divisible tasksconsidered in this paper, give the scheduler the maximum flexibility. Such tasks canbe divided into subtasks to utilize any available processing times in the cluster. Thesefactors explain why the system performs the best with no reservation in the workload.

In the second experiment, we instead let the advance factor equal to the averagetask interarrival time: � = 1/λ. That is, all reservations are made 1/λ time units inadvance of their start times. Figures 13c and 13d show the simulation results. FromFigure 13c, we can see that the scheduler achieves similar TRRs for workloads with0%, 10%, 30% and 50% reservations, while the TRR for the workloads with 80% and100% reservations increase significantly. Figure 13d shows that UTIL obtained with

Page 24: Scheduling real-time divisible loads with advance reservations

Real-Time Syst (2012) 48:264–293 287

Fig. 13 Effects of reservation percentage

a workload of no reservation is higher than those obtained with mixed workloads.However, the utilization differences between workloads of 0%, 10%, 30% and 50%reservations are quite small. These results are due to the fact that reservations aremade plenty of time in advance. Since an advance reservation requests for some re-sources in the future, the earlier the reservation is made, the more likely the requiredresources have not been occupied. Therefore, advance reservations are more likelyto be accepted. After cluster resources are booked by reservations, less resources areleft to serve regular tasks arriving in the future. As a result, more regular tasks arerejected. However, thanks to the flexibility in scheduling regular tasks, many of themcan still be accepted. This explains why the overall system performance (i.e., TRRand UTIL) does not deteriorate with the percentage increase of advance reservationsfrom 0% to 50%.

Next, to understand how much earlier a reservation should be made, we investigatethe effects of advance factor �.

Effects of advance factor We again conducted experiments with the baseline con-figuration, where either 30% or 50% of the workload was set aside for reservations.To study the effects of advance factor, we set � = 0,1/λ,2/λ, and 10/λ respectively.Figures 14a and 14b show the simulation results.

From both figures, we observe that when � increases, the TRR decreases. Theimprovement is significant until � = 2/λ, and the TRR with advance factors � = 2/λ

Page 25: Scheduling real-time divisible loads with advance reservations

288 Real-Time Syst (2012) 48:264–293

Fig. 14 Effects of advance factor

Fig. 15 Advance factor effect

and � = 10/λ are similar. Since the UTIL curves have the same trend, we omit themto save space.

In the following, we use an example to illustrate how the advance factor affectsthe task acceptance. Figure 15 shows a regular task Ti arriving at Ai and a reservationRi+1 requesting to start at Rs . In this simple example, we assume there is only oneprocessing node. If Ri+1 arrives after Ai , it is rejected because the node is allocatedto Ti . If Ri+1 arrives before Ti , Ri+1 is booked on the node before Ti arrives. UponTi ’s arrival, the scheduler may still accept Ti and let it utilize the node before and afterRi+1, while still finishing before its deadline. In general, since a reservation does notaffect a regular task as much as a regular task affects a reservation, it is beneficialto make reservations in advance so that the scheduler can consider them before allcompeting regular tasks.

On average, when the advance reservation factor is equal to or greater than theaverage task interarrival time (i.e., � ≥ 1/λ), the competition for resources be-tween reservations and regular tasks is less constraining and results in improvedperformance. When a reservation R arrives, the scheduler decides if it is feasibleto schedule R without compromising the guarantees for previously admitted tasks.Therefore, R only competes with reservations and regular tasks that have alreadybeen committed to by system. Moreover, among admitted regular tasks, only thosewhose deadlines are later than R’s start time are actually competing with R forresources. Consequently, if the advance reservation factor is at least as large asthe average task deadline (i.e., � ≥ AvgD), the competition for resources betweenreservations and regular tasks is almost negligible. For the simulated workloads,since SystemLoad = E (Avgσ,N) × λ and AvgD = DCRatio × E (Avgσ,N), we have

Page 26: Scheduling real-time divisible loads with advance reservations

Real-Time Syst (2012) 48:264–293 289

AvgD = DCRatio × SystemLoad × 1/λ ≤ 2/λ. This explains why we observe signif-icant performance improvements as � increases until � = 2/λ, and the curves forworkloads with � ≥ 2/λ are close to each other with less performance improvement.If a reservation is rejected, it is most likely due to conflicts with other advance reser-vations. As reservations compete for resources with each other, the task accept ratiodecreases significantly, which explains why as the reservation percentage increasesbeyond 50%, system performance degrades drastically (Figs. 13c and 13d).

6 Related work

Real-time scheduling on distributed or multiprocessor systems has been exten-sively studied (Amin et al. 2004; Ammar and Alhamdan 2002; Eltayeb et al. 2004;Anderson and Srinivasan 2000; Funk and Baruah 2005; Isovic and Fohler 2000;Qin and Jiang 2001; Zhang 2002). The scheduling models investigated most often(e.g., Anderson and Srinivasan 2000; Funk and Baruah 2005; Isovic and Fohler 2000)assume periodic or aperiodic sequential jobs that must be allocated to a single re-source and executed by their deadlines. With the evolution of cluster computing,researchers have begun to investigate real-time scheduling of parallel applicationson a cluster (Amin et al. 2004; Ammar and Alhamdan 2002; Eltayeb et al. 2004;Qin and Jiang 2001; Zhang 2002). However, most of these studies assume the exis-tence of some form of task graph to describe communication and precedence relationsbetween computational units called subtasks (i.e., nodes in a task graph).

Due to the increasing importance of arbitrarily divisible applications (Robertazzi2003), a few researchers (He et al. 2004; Lee et al. 2003; Chuprat and Baruah 2008;Chuprat et al. 2008), including ourselves (Lin et al. 2007a, 2007b, 2007c), have in-vestigated the real-time divisible load scheduling. In Lin et al. (2007b), we developedseveral scheduling algorithms for real-time divisible loads. Following those algo-rithms, a task must wait until a sufficient number of processors become available. Thiscould cause a waste of processing power as some processors are idle when the systemis waiting for enough processors to become available. This system inefficiency is re-ferred to as the Inserted Idle Times (IITs) (Lin et al. 2007c). To reduce or completelyeliminate IITs, several algorithms have been developed (Chuprat and Baruah 2008;Chuprat et al. 2008; Lin et al. 2007c), which enable a task to utilize processors at dif-ferent processor available times. These algorithms lead to better cluster utilizations,with increased scheduling overheads. The time complexity of algorithms proposedin Chuprat and Baruah (2008), Chuprat et al. (2008) is O(nN logN) and the al-gorithm in Lin et al. (2007c) has a time complexity of O(nN3). In this paper, wedevelop a new algorithm that can also support advance reservations. Similar to al-gorithms in Chuprat and Baruah (2008), Chuprat et al. (2008), Lin et al. (2007c),our new algorithm eliminates IITs. In addition, in comparison to those algorithms,although our algorithm can schedule not only regular divisible loads but also advancereservations, its time complexity O(nN logN + n3) does not increase.

To offer QoS support, researchers have investigated resource reservation for net-works (Degermark et al. 1997; Ferrari et al. 1997; Wolf and Steinmetz 1997), CPUs(Chu and Nahrstedt 1999; Smith et al. 2000; Cucinotta et al. 2009), and co-reservation

Page 27: Scheduling real-time divisible loads with advance reservations

290 Real-Time Syst (2012) 48:264–293

for resources of different types (Czajkowski and Foster 2002; Liu et al. 2002). Themost well-known architectures that support resource reservations include GRAM(Czajkowski et al. 1998), GARA (Foster et al. 2000, 2004) and SNAP (Czajkowskiand Foster 2002). These research efforts mainly focus on resource reservation proto-cols and QoS support architectures. Our work, on the other hand, focuses on schedul-ing mechanisms to meet specific QoS objectives, which could be integrated into ar-chitectures like GARA (Foster et al. 2000, 2004) to satisfy Grid users’ QoS require-ments.

Advance reservation and resource co-allocation in Grids (Foster et al. 1999, 2004;Huang and Qiu 2005; Margo et al. 2007; Smith et al. 2000) assume the support of ad-vance reservations in local cluster sites. Cluster schedulers like PBS PRO, Maui andLSF (MacLaren 2003) support advance reservations. However, they are not widelyapplied in practice due to under-utilization concerns. In MacLaren (2003), Cao andZimmermann (2004), backfilling is used to improve system utilization. However, theresults still exhibit a significant reduction in performance and resource utilizationwhen advance reservations are supported. Furthermore, these schedulers do not pro-vide real-time guarantees to regular tasks. To improve system utilization, a probabilis-tic approach is developed (Cucinotta et al. 2009) to support advance reservations fordistributed real-time workflows, enabling a resource provider to make proper trade-offs between resource utilization and real-time guarantees.

This paper differs from the previous work and investigates real-time divisible loadscheduling with advance reservations. Considering reservation blocks on both com-puting and communication resources, we propose a multi-stage scheduling algorithmfor real-time divisible loads.

7 Conclusion

Providing QoS and performance guarantees to arbitrarily divisible loads has becomea significant problem. Despite the role of advance reservation in supporting Grid QoS,this technique is not widely applied in clusters due to performance concerns. This pa-per investigates the challenging problem of real-time divisible load scheduling withadvance reservations in a cluster. To address the under-utilization concerns, we thor-oughly study the effects of advance reservations. A multi-stage real-time schedulingalgorithm is proposed. Simulation results show: (1) our algorithm works correctlyand provides real-time guarantees to accepted tasks; (2) proper advance reservationscould avoid the system performance degradation.

References

Amin A, Ammar R, Dessouly AE (2004) Scheduling real time parallel structure on cluster computingwith possible processor failures. In: Proc of 9th IEEE international symposium on computers andcommunications, July 2004, pp 62–67

Ammar RA, Alhamdan A (2002) Scheduling real time parallel structure on cluster computing. In: Proc. of7th IEEE international symposium on computers and communications, Taormina, Italy, July 2002,pp 69–74

Page 28: Scheduling real-time divisible loads with advance reservations

Real-Time Syst (2012) 48:264–293 291

Anderson J, Srinivasan A (2000) Pfair scheduling: beyond periodic task systems. In: Proc. of the 7thinternational conference on real-time systems and applications, Cheju Island, South Korea, December2000, pp 297–306

ATLAS (AToroidal LHC Apparatus) experiment, CERN (European lab for particle physics) (2008) Atlasweb page. http://atlas.ch/

Cao J, Zimmermann F (2004) Queue scheduling and advance reservations with cosy. In: Parallel anddistributed processing symposium, April 2004, p 63a

Cheng Y, Robertazzi T (1990) Distributed computation for a tree network with communication delays.IEEE Trans Aerosp Electron Syst 26(3):511–516

Chu H-H, Nahrstedt K (1999) CPU service classes for multimedia applications. In: ICMCS, vol 1, pp 296–301

Chuprat S, Baruah S (2008) Scheduling divisible real-time loads on clusters with varying processor starttimes. In: 14th IEEE international conference on embedded and real-time computing systems andapplications (RTCSA ’08), Aug 2008, pp 15–24

Chuprat S, Salleh S, Baruah S (2008) Evaluation of a linear programming approach towards schedulingdivisible real-time loads. In: International symposium on information technology, Aug 2008, pp 1–8

Compact Muon Solenoid (CMS) experiment for the Large Hadron Collider at CERN (European lab forparticle physics) (2008) Cms web page. http://cmsinfo.cern.ch/Welcome.html/

Cucinotta T, Konstanteli K, Varvarigou T (2009) Advance reservations for distributed real-time workflowswith probabilistic service guarantees. In: IEEE international conference on service-oriented comput-ing and applications (SOCA), December 2009

Czajkowski K, Foster I, (2002) SNAP: a protocol for negotiating service level agreements and coordinat-ing resource management in distributed systems. In: 8th workshop on job scheduling strategies forparallel processing, pp 153–183

Czajkowski K, Foster IT, Karonis NT, Kesselman C, Martin S, Smith W, Tuecke S (1998) A resource man-agement architecture for metacomputing systems. In: IPPS/SPDP ’98: proceedings of the workshopon job scheduling strategies for parallel processing, London, UK. Springer, Berlin, pp 62–82

Degermark M, Kohler T, Pink S, Schelen O (1997) Advance reservations for predictive service in theinternet. Multimed Syst 5(3):177–186

Dertouzos ML, Mok AK (1989) Multiprocessor online scheduling of hard-real-time tasks. IEEE TransSoftw Eng 15(12):1497–1506

Eltayeb M, Dogan A, Özgüner F (2004) A data scheduling algorithm for autonomous distributed real-timeapplications in grid computing. In: Proc. of 33rd international conference on parallel processing,Montreal, Canada, August 2004, pp 388–395

Ferrari D, Gupta A, Ventre G (1997) Distributed advance reservation of real-time connections. MultimedSyst 5(3):187–198

Foster I, Kesselman C, Lee C, Lindell R, Nahrstedt K, Roy A (1999) A distributed resource managementarchitecture that supports advance reservations and co-allocation. In: Proceedings of the internationalworkshop on quality of service, pp 27–36

Foster I, Roy A, Sander V (2000). A quality of service architecture that combines resource reservation andapplication adaptation. doi:10.1109/IWQOS.2000.847954

Foster IT, Fidler M, Roy A, Sander V, Winkler L (2004) End-to-end quality of service for high-end appli-cations. Comput Commun 27(14):1375–1388

Funk S, Baruah S (2005) Task assignment on uniform heterogeneous multiprocessors. In: Proc of 17thEuromicro conference on real-time systems, July 2005, pp 219–226

He L, Jarvis SA, Spooner DP, Chen X, Nudd GR (2004) Hybrid performance-oriented scheduling of mold-able jobs with qos demands in multiclusters and grids. In: Proc. of the 3rd international conferenceon grid and cooperative computing, Wuhan, China, October 2004, pp 217–224

Huang Z, Qiu Y (2005) A bidding strategy for advance resource reservation in sequential ascending auc-tions. In: Autonomous decentralized systems, 2005. ISADS, April 2005, pp 284–288

Isovic D, Fohler G (2000) Efficient scheduling of sporadic, aperiodic, and periodic tasks with complexconstraints. In: Proc. of 21st IEEE real-time systems symposium, Orlando, FL, November 2000,pp 89–98

Lee WY, Hong SJ, Kim J (2003) On-line scheduling of scalable real-time tasks on multiprocessor systems.J Parallel Distrib Comput 63(12):1315–1324

Lin X, Lu Y, Deogun J, Goddard S (2007a) Enhanced real-time divisible load scheduling with differentprocessor available times. In: 14th international conference on high performance computing, Decem-ber 2007

Page 29: Scheduling real-time divisible loads with advance reservations

292 Real-Time Syst (2012) 48:264–293

Lin X, Lu Y, Deogun J, Goddard S (2007b) Real-time divisible load scheduling for cluster computing.In: 13th IEEE real-time and embedded technology and application symposium, Bellevue, WA, April2007, pp 303–314

Lin X, Lu Y, Deogun J, Goddard S (2007c) Real-time divisible load scheduling with different processoravailable times. In: International conference on parallel processing, September 2007, p 20

Liu C, Yang L, Foster I, Angulo D (2002) Design and evaluation of a resource selection framework forgrid applications. In: Proceedings of the 11th IEEE symposium on high-performance distributedcomputing, July 2002

MacLaren J (2003) Advance reservations: State of the art. http://www.fz-juelich.de/zam/RD/coop/ggf/graap/graap-wg.html

Mamat A, Lu Y, Deogun J, Goddard S (2008) Real-time divisible load scheduling with advance reserva-tions. In: 20th Euromicro conference on real-time systems (ECRTS), July 2008

Manimaran G, Murthy CSR (1998) An efficient dynamic scheduling algorithm for multiprocessor real-time systems. IEEE Trans Parallel Distrib Syst 9(3):312–319

Margo MW, Yoshimoto K, Kovatch P, Andrews P (2007) Impact of reservations on production job schedul-ing. In: 13th workshop on job scheduling strategies for parallel processing, June 2007

Qin X, Jiang H (2001) Dynamic, reliability-driven scheduling of parallel real-time jobs in heterogeneoussystems. In: Proc. of 30th international conference on parallel processing, Valencia, Spain, September2001, pp 113–122

Ramamritham K, Stankovic JA, Shiah P-F (1990) Efficient scheduling algorithms for real-time multipro-cessor systems. IEEE Trans Parallel Distrib Syst 1(2):184–194

Robertazzi TG (2003) Ten reasons to use divisible load theory. Computer 36(5):63–68Siddiqui M, Villazón A, Fahringer T (2006) Grid allocation and reservation—grid capacity planning

with negotiation-based advance reservation for optimized qos. In: SC ’06: Proceedings of the 2006ACM/IEEE conference on supercomputing, New York, NY, USA. ACM, New York, p 103

Smith W, Foster I, Taylor V (2000) Scheduling with advanced reservations. In: 14th international paralleland distributed processing symposium (IPDPS), May 2000, pp 127–132

Sotomayor B, Keahey K, Foster I (2006) Overhead matters: a model for virtual resource management. In:VTDC ’06: Proceedings of the 2nd international workshop on virtualization technology in distributedcomputing, Washington, DC, USA. IEEE Comput Soc, Los Alamitos, p 5

Swanson D (2005) Personal communication. Director, UNL Research Computing Facility (RCF) and UNLCMS Tier-2 Site, August 2005

Veeravalli B, Ghose D, Robertazzi TG (2003) Divisible load theory: a new paradigm for load schedulingin distributed systems. Clust Comput 6(1):7–17

Wolf LC, Steinmetz R (1997) Concepts for resource reservation in advance. Multimed Tools Appl4(3):255–278

Zhang L (2002) Scheduling algorithm for real-time applications in grid environment. In: Proc. of IEEEinternational conference on systems, man and cybernetics, Hammamet, Tunisia, October 2002, vol 5,p 6

Anwar Mamat received his B.E. and M.S. degrees from Chang’an Uni-versity, Xian, China, in 1999 and 2002, respectively. He received thePh.D. degree from the University of Nebraska-Lincoln in 2011. His re-search interests are grid and cluster computing, high-performance com-puting, and real-time systems.

Page 30: Scheduling real-time divisible loads with advance reservations

Real-Time Syst (2012) 48:264–293 293

Ying Lu received the B.S. degree in Computer Science in 1996 from theSouthwest Jiaotong University, Chengdu, China; the M.CS. and Ph.D.degrees in Computer Science in 2001 and 2005 respectively from theUniversity of Virginia. She joined the Department of Computer Sci-ence and Engineering, University of Nebraska-Lincoln in 2005 as anAssistant Professor. In 2011, she was promoted to Associate Professor.Her research interests include real-time systems, autonomic computing,cluster and grid computing, and web systems. She has done significantwork on feedback control of computing systems.

Jitender Deogun received his B.S. (1966) with Honors from PunjabUniversity, Chandigarh, India, M.S. (1974), and Ph.D. (1979) degreesfrom University of Illinois, Urbana, IL. He joined the University of Ne-braska in 1981 as an Assistant Professor of Computer Science. He waspromoted to Associate Professor in 1984 and Full Professor in 1988.His research interests include high-performance optical switch archi-tectures, routing and wavelength assignment, real-time scheduling, net-work coding, and structural and algorithmic graph theory.

Steve Goddard is a John E. Olsson Professor of Engineering and Chairof the Department of Computer Science & Engineering at the Univer-sity of Nebraska–Lincoln. He received the B.A. degree in computerscience and mathematics from the University of Minnesota (1985). Hereceived the M.S. and Ph.D. degrees in computer science from the Uni-versity of North Carolina at Chapel Hill (1995, 1998). His research in-terests are embedded, real-time and distributed systems with an empha-sis in real-time, rate-based scheduling theory.