14
S TANFORD U NIVERSITY C OMPUTER S CIENCE T ECHNICAL R EPORT CSTR 2014-05 11/25/2014 Tarcil: High Quality and Low Latency Scheduling in Large, Shared Clusters Christina DELIMITROU Daniel S ANCHEZ Christos KOZYRAKIS November 25, 2014

Tarcil: Low Latency and High Quality Cluster Scheduling · PDF fileference between jobs sharing a server ... However, there are two key differences in Tarcil ... An obvious solution

  • Upload
    phamnhu

  • View
    214

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Tarcil: Low Latency and High Quality Cluster Scheduling · PDF fileference between jobs sharing a server ... However, there are two key differences in Tarcil ... An obvious solution

STANFORD UNIVERSITY

COMPUTER SCIENCE TECHNICAL REPORT

CSTR 2014-05 11/25/2014

Tarcil: High Quality and Low Latency Scheduling inLarge, Shared Clusters

Christina DELIMITROU Daniel SANCHEZ Christos KOZYRAKIS

November 25, 2014

Page 2: Tarcil: Low Latency and High Quality Cluster Scheduling · PDF fileference between jobs sharing a server ... However, there are two key differences in Tarcil ... An obvious solution

Tarcil: High Quality and Low Latency Scheduling in Large, Shared Clusters

AbstractScheduling diverse applications in large, shared clusters

is particularly challenging. Recent research on cluster man-agement focuses either on scheduling speed, using samplingtechniques to quickly assign tasks to resources, or on schedul-ing quality, using centralized algorithms that examine thecluster state to find the most suitable resources that improveboth task performance and cluster utilization.

We present Tarcil, a distributed scheduler that targets bothscheduling speed and quality, making it appropriate for large,highly-loaded clusters running both short and long jobs. Tarciluses an analytically derived sampling framework that dynami-cally adjusts the sample size based on load and provides guar-antees on the quality of scheduling decisions with respect toresource heterogeneity and workload interference. It also im-plements admission control when sampling is unlikely to findsuitable resources for a task. We evaluate Tarcil on clusterswith hundreds of servers on EC2. For highly-loaded clustersrunning short jobs, Tarcil improves task execution time by 41%over a distributed, sampling-based scheduler. For more gen-eral workload scenarios, Tarcil increases the fraction of tasksthat achieve near ideal performance by 4x and 2x comparedto sampling-based and centralized scheduling respectively.

1. IntroductionAn increasing and diverse set of applications is now hostedin private and public datacenters [4, 10, 18]. The large sizeof these clusters (up to tens of thousands of servers) andthe high arrival rate of jobs (up to millions of tasks per sec-ond) make cluster scheduling quite challenging. The clusterscheduler must determine which hardware resources, e.g., spe-cific servers and cores, should be used by each job. Ideally,scheduling decisions lead to three desirable properties. First,each workload receives resources that enable it to achieve pre-dictably high performance. Second, jobs are packed tightlyon available servers, achieving high cluster utilization. Third,decisions introduce minimal scheduling overheads, allowingthe scheduler to handle large clusters and high job arrival rates.

Recent research on cluster scheduling can be examinedalong two dimensions; scheduling concurrency (throughput)and scheduling speed (latency).

With respect to scheduling concurrency, there are twogroups of work. In the first scheduling is serialized, witha centralized scheduler making all decisions [14, 20]. In thesecond group, decisions are parallelized through two-level,distributed or shared-state designs. Two-level schedulers, suchas Mesos and YARN, use a centralized coordinator to divideresources between frameworks like Hadoop and MPI [19, 34].Each framework uses its own scheduler to assign resources to

0 20 40 60 80 100Performance norm. to Ideal (%)

0.00

0.05

0.10

0.15

0.20

Fra

ctio

n of

Job

s ShortMediumLong

Fig. 1a: Sampling-based scheduling.

0 20 40 60 80 100Performance norm. to Ideal (%)

0.00

0.05

0.10

0.15

0.20

Fra

ctio

n of

Job

s ShortMediumLong

Fig. 1b: Centralized scheduling.

0 20 40 60 80 100Performance norm. to Ideal (%)

0.00

0.05

0.10

0.15

0.20

Fra

ctio

n of

Job

s ShortMediumLong

Fig. 1c: Tarcil.

Figure 1: Distribution of job performance on a 200-servercluster with concurrent, sampling-based [28] and centralizedgreedy [13] schedulers and Tarcil for three scenarios: 1) short,homogeneous Spark [40] tasks (100ms average duration), 2)Spark tasks of medium duration (1s–10s), and 3) long Hadoopanalytics tasks (10s–10min). The ideal performance (100%) as-sumes no scheduling overheads and no performance degra-dation due to interference. The cluster utilization is 80%.

incoming tasks. Since neither the coordinator nor the frame-work schedulers have a complete view of the cluster state andall task characteristics, scheduling is suboptimal [31]. Shared-state schedulers like Omega [31] allow multiple schedulers toconcurrently access the whole cluster state using atomic trans-actions. Finally, Sparrow uses multiple concurrent, statelessschedulers to sample and allocate resources [28].

With respect to the speed at which scheduling decisionshappen, there are again two groups of work. The first groupexamines most of (or all) the cluster state to determine themost suitable resources for incoming tasks, in a way that ad-dresses the performance impact of hardware heterogeneityand interference in shared resources [13, 17, 24, 26, 39, 42].For instance, Quasar [14] uses classification to determine theresource preferences of incoming jobs. Then, it uses a greedyscheduler to search the cluster state for resources that meetthe application’s demands on servers with minimal contention.Similarly, Quincy [20] formulates scheduling as a cost opti-

1

Page 3: Tarcil: Low Latency and High Quality Cluster Scheduling · PDF fileference between jobs sharing a server ... However, there are two key differences in Tarcil ... An obvious solution

mization problem that accounts for preferences with respectto locality, fairness and starvation-freedom. Such schedulersmake high quality decisions that lead to high application per-formance and high cluster utilization. Unfortunately, theyneed to greedily inspect the cluster state on every schedulingevent. Their decision overhead can be prohibitively high forlarge clusters, and in particular for the very short jobs of real-time analytics (100ms - 10s) [28, 40]. Using multiple greedyschedulers improves scheduling throughput but not latency,and terminating the greedy search early typically lowers thedecision quality, especially at high cluster loads.

The second group improves the speed of each schedulingdecision by only examining a small number of machines.Sparrow reduces scheduling latency through resource sam-pling [28]. The scheduler examines the state of two randomly-selected servers for each required core and selects the one thatbecomes available first. While Sparrow improves schedulingspeed, its decisions can be poor because it ignores the hetero-geneity and interference preferences of jobs. Typically concur-rent schedulers follow sampling schemes, while centralizedsystems are paired with sophisticated scheduling algorithms.

Figure 1 illustrates the tradeoff between scheduling speedand quality. Figure 1a shows the probability distribution func-tion (PDF) of application performance for three scenarios ofvariable job duration using Sparrow [28] on a 200-server EC2cluster. For very short jobs (100ms), fast scheduling allowsmost workloads to achieve 80% to 95% of the ideal perfor-mance on this cluster. In contrast, jobs with medium (1s–10s)or long duration (10s–1min) suffer significant degradation andachieve 50% to 30% of ideal performance. As duration in-creases, jobs become more heterogeneous in their resourcerequirements (e.g., preference for high-end cores), and inter-ference between jobs sharing a server matters. In contrast, thescheduling decision speed is not as critical.

Figure 1b shows the PDF of job performance using theQuasar scheduler that accounts for heterogeneity and inter-ference [14]. The centralized scheduler leads to near-optimalperformance for long jobs. In contrast, medium and short jobsare penalized by the latency of scheduling decisions, whichcan exceed the execution time of the shortest jobs. Even ifwe use multiple schedulers to increase the scheduling through-put [31], the per-job overhead remains prohibitively high.

We propose Tarcil, a scheduler that achieves the best of bothworlds: high quality and high speed decisions, making it ap-propriate for large, highly-loaded clusters that host both shortand long jobs. Similar to Quasar [13, 14], Tarcil starts withrich information on the resource preferences and interferencesensitivity of incoming jobs. Similar to Sparrow [28], it usessampling to avoid examining the whole cluster state on everydecision. However, there are two key differences in Tarcil’sarchitecture. First, Tarcil uses sampling not merely to findavailable resources but to identify resources that best match ajob’s resource preferences. The sampling scheme is derivedusing analytical methods that provide statistical guarantees

on the quality of scheduling decisions. Tarcil additionallyadjusts the sample size dynamically based on the quality ofavailable resources. Second, Tarcil uses admission controlto avoid scheduling a job that is unlikely to find appropriateresources. To handle the tradeoff between long queueing de-lays and suboptimal allocations, Tarcil uses a small amount ofcoarse-grain information on the quality of available resources.

We use two clusters of 100 and 400 servers on AmazonEC2 to show that Tarcil leads to low scheduling overheads andpredictably high performance for a wide range of workloadscenarios. For a heavily-loaded, heterogeneous cluster runningshort Spark jobs, Tarcil improves average performance by 41%over Sparrow [28], with some jobs running 2-3x faster. For acluster running a wide range of applications from short Sparktasks to long Hadoop jobs and low-latency services, Tarcilachieves near-optimal performance for 92% of jobs, in contrastwith only 22% of jobs with a distributed, sampling-basedscheduler and 48% with a centralized greedy scheduler [14].Finally, Figure 1c, shows that Tarcil enables close to idealperformance for the vast majority of jobs of the three scenarios.

2. BackgroundOur work draws from related efforts to improve schedulingspeed and quality in large, shared datacenters:Concurrent scheduling: Scheduling becomes a bottleneckfor clusters with thousands of servers and high workloadchurn. An obvious solution is to schedule multiple jobs inparallel [19, 31]. We assume a structure similar to Google’sOmega [31], where multiple scheduling agents can access thewhole cluster state. As long as these agents rarely attempt toassign work to the same servers (infrequent conflicts), theyproceed concurrently without additional delays. Section 5discusses conflict resolution.Sampling-based scheduling: Based on results from random-ized load balancing [25, 29], we can design sampling-basedcluster schedulers [8, 15, 28]. Sampling the state of just a fewservers reduces the latency of scheduling decisions and theprobability of conflicts between concurrent scheduling agents,and is likely to find available resources in lightly- or medium-loaded clusters. The recently-proposed Sparrow scheduleruses batch sampling and late binding [28]. Batch samplingexamines the state of two servers for each of m required coresby an incoming job and selects the m best cores. If the selectedcores are busy, tasks are queued locally in the sampled serversand assigned to the machine where resources become availablefirst. In our evaluation we compare Tarcil with Sparrow.Heterogeneity & interference-aware scheduling: Hard-ware heterogeneity occurs in large clusters because serversare populated and replaced over time [13, 39]. Moreover, theperformance of tasks sharing a server may degrade signifi-cantly due to interference on shared resources such as caches,memory and I/O channels [13, 17, 24, 27]. A scheduler canimprove task performance significantly by taking into con-sideration its resource preferences. For instance, a particular

2

Page 4: Tarcil: Low Latency and High Quality Cluster Scheduling · PDF fileference between jobs sharing a server ... However, there are two key differences in Tarcil ... An obvious solution

task may perform much better on 2.3GHz Ivy-Bridge corescompared to 2.6GHz Nehalem cores, while another task maybe particularly sensitive to interference from cache-intensiveworkloads executing on the same server.

The key challenge in heterogeneity and interference-awarescheduling is knowing the preferences of incoming jobs. Westart with a system like Quasar that automatically estimates re-source preferences and interference sensitivity [13,14]. Quasarprofiles each incoming job for a few seconds on two servertypes, while two microbenchmarks place pressure on twoshared resources. The sparse profiling signal on resource pref-erences is transformed into a dense signal using collaborativefiltering [6, 21, 30, 37]. Collaborative filtering projects thesignal against all information available from previously-runjobs, identifying similarities in resource and interference pref-erences. These include examples such as the preferred corefrequency and cache size for a job or the memory and networkcontention it generates. Quasar performs profiling and collab-orative filtering online. We perform this analysis offline, giventhat workloads like real-time analytics are repeated multipletimes, potentially over different data (e.g., daily or hourly).

3. The Tarcil Scheduler

3.1. Overview

Tarcil is a shared-state scheduler that allows multiple, concur-rent agents to operate on the cluster state [31]. In this section,we describe the operation of a single agent.

The scheduler processes incoming workloads as follows.Upon submission, Tarcil first looks up the job’s resource andinterference sensitivity preferences [13, 14]. This informationprovides estimates of the relative performance on the differ-ent server platforms, as well as estimates of the interferencethe workload can tolerate and generate in shared resources(caches, memory, I/O channels). Next, Tarcil performs admis-sion control. Given statistics on the cluster state, it determineswhether the scheduler is likely to quickly find resources ofsatisfactory quality for a job, or whether it should queue it fora while. Admission control is useful when the cluster is highlyloaded. A queued application waits until it has a high proba-bility of finding appropriate resources or until a queueing-timethreshold is reached. Tarcil maintains coarse-grained statisticson available resources for admission control decisions. Thesestatistics are updated as jobs begin and end execution.

For admitted jobs, Tarcil performs sampling-based schedul-ing with the sample size adjusted to satisfy statistical guaran-tees on the quality of allocated resources. The scheduler alsouses batch sampling if a job requests multiple cores. Tarcilexamines the quality of sampled resources to select those bestmatching the job’s preferences. It additionally monitors theperformance of running jobs. If a job runs significantly belowits expected performance, the scheduler adjusts the schedulingdecisions. This is useful for long-running workloads; for shortjobs, the initial scheduling decision determines performance

0.0 0.2 0.4 0.6 0.8 1.0Q

0.0

0.2

0.4

0.6

0.8

1.0

Res

ourc

e Q

ualit

y C

DF

TW = 0.3

Poor

Good Per

fect

0.0 0.2 0.4 0.6 0.8 1.0Q

0.0

0.2

0.4

0.6

0.8

1.0

Res

ourc

e Q

ualit

y C

DF

TW = 0.8

Goo

d

Poor

Per

fect

Figure 2: Distribution of resource quality Q for two workloadswith TW = 0.3 (left) and TW = 0.8 (right).

with little space for adjustments.

3.2. Analytical Framework

We use the following framework to design and analyzesampling-based scheduling in Tarcil.Resource unit (RU): Tarcil manages resources at RU granu-larity using Linux containers [11]. Each RU consists of onecore and an equally partitioned fraction of the server’s memoryand storage capacity and the provisioned network bandwidth.For example, a server with 16 cores, 64GB DRAM, 480GB ofFlash and a 10GE NIC has 16 RUs, each with 1 core, 4 GBDRAM, 30GB of Flash and 625ME of network bandwidth.Because all our experiments are on public cloud providerswhere the network topology is unknown, in our evaluation wedo not partition network bandwidth.RU quality: The utility an application can extract from anRU depends on the hardware type (e.g., 2GHz vs 3GHz core)and the interference on shared resources from other jobs onthe same server. Classification [13, 14] obtains the interfer-ence preferences of an incoming job using a small set ofmicrobenchmarks to inject pressure of increasing intensity(from 0 to 99%) on one of ten shared resources of interest [12].Interference preferences capture, first, the amount of pressureti a job can tolerate in each shared resource i (i ∈ [1,N]), andsecond, the amount of pressure ci it itself will generate in thesame resource. High values of ti or ci imply that a job willtolerate or cause a lot of interference on resource i. ti and citake values in [0,99]. In most cases, jobs that cause a lot ofinterference in a resource are also sensitive to interference onthe same resource. Hence, to simplify the rest of the analysiswe assume that ti = 99− ci and express resource quality as afunction of caused interference in an RU.

Let W be an incoming job and VW the vector of interferenceit will cause in the N shared resources, VW = [c1,c2, ...,cN ]. Tocapture the fact that different jobs are sensitive to interferenceon different resources [24], we reorder the elements of VW bydecreasing value of ci and get V ′W = [c j,ck, ...,cn], with c j >ck > ... > cn. Finally, we obtain a single value for the resourcerequirements of W using an order-preserving encoding schemethat transforms V ′W to a concatenation of its elements:

VWenc = c j ·10(2·(N−1))+ ck ·10(2·(N−2))+ ...+ cn (1)

For example if V ′W = [84,31] then VWenc = 8431. The expres-

3

Page 5: Tarcil: Low Latency and High Quality Cluster Scheduling · PDF fileference between jobs sharing a server ... However, there are two key differences in Tarcil ... An obvious solution

n=8 n=16 n=32 n=64

0.0 0.2 0.4 0.6 0.8 1.0Q

0.2

0.4

0.6

0.8

1.0

Res

ourc

e Q

ualit

y C

DF

0.0 0.2 0.4 0.6 0.8 1.0Q

10-6

10-5

10-4

10-3

10-2

10-1

100

Res

ourc

e Q

ualit

y C

DF

Figure 3: Resource quality CDFs under the uniformity assumption in linear and log scale for sample size R=8, 16, 32 and 64.

0.0 0.2 0.4 0.6 0.8 1.0Q

0.2

0.4

0.6

0.8

1.0

Res

ourc

e Q

ualit

y C

DF unif.(8)

unif.(16)unif.(32)unif.(64)

n=8n=16n=32n=64

0.0 0.2 0.4 0.6 0.8 1.0Q

10-6

10-5

10-4

10-3

10-2

10-1

100

Res

ourc

e Q

ualit

y C

DF

Figure 4: Comparison of resource quality CDFs under the uni-formity assumption, and as measured in a 100-server cluster.

sion above is provably the most dense encoding that preservesthe full entropy of the values of vector V ′W and their ordering,for general V ′W . Finally, for simplicity we normalize VWenc in[0,1] and derive the target resource quality for job W :

TW =VWenc

102N−1, TW ∈ [0,1] (2)

A high value for the quality target TW implies that job W isresource-intensive. Its performance will depend a lot on thequality of the scheduling decision.

We now need to find RUs that closely match this targetquality. To determine if an available resource unit H is ap-propriate for job W , we calculate the interference caused onthis RU by all other jobs occupying RUs on the same server.Assuming M resource units in the server, the total interferenceH experiences on resource i is:

Ci =∑m 6=H ci

M−1(3)

Starting with vector VH = [C1,C2, ...,CN ] for H and usingthe same reordering and order-preserving encoding as for TW ,we calculate the quality of resource H as:

UH = 1− VHenc

102N−1, UH ∈ [0,1] (4)

The higher the interference from co-located tasks, the lowerUH will be. Resources with low UH are more appropriate forjobs that can tolerate a lot of interference and vice versa.

Comparing UH for an RU against TW allows us to judge thequality of resource H for incoming job W :

Q =

{1− (UH −TW ) , if UH ≥ TWTW −UH , if UH < TW

(5)

If Q equals 1, we have an ideal assignment with the servertolerating as much interference as the new job generates. If Qis within [0,TW ], selecting RU H will degrade the job’s perfor-mance. If Q is within [TW ,1), the assignment will preserve theworkload’s performance but is suboptimal. It would be betterto assign a more demanding job on this resource unit.Resource quality distribution: Figure 2 shows the distribu-tion of Q for a 100-server cluster with ∼800 RUs (see Sec-tion 6 for cluster details) and one hundred, 10-min Hadoopjobs as resident load (50% cluster utilization). For a non-demanding new job with TW = 0.3 (left), there are many appro-priate RUs at any point in time. In contrast, for a demandingjob with TW = 0.8, only a small number of resources will leadto good performance. Obviously, the scheduler must adjustthe sample size for incoming jobs based on TW .

3.3. Sampling-based Scheduling with Guarantees

We can now derive the sample size that provides statisticalguarantees on the quality of scheduling decisions.Assumptions and analysis: To make the analysis indepen-dent of cluster load, we make Q an absolute ordering of RUsin the cluster. Starting with equation (5), we sort RUs basedon Q for incoming job W , breaking any ties in quality with afair coin, and distribute them uniformly in [0,1], i.e., for NRUtotal RUs, Q(i) = i/(NRU −1), i ∈ [0,NRU −1]. Because Q isnow a probability distribution function of resource quality, wecan derive the sample size in the following manner.

Assume that the scheduler samples R RU candidates foreach RU needed by an incoming workload. If we treatthe qualities of these R candidates as random variables Qi(Q1,Q2, ...,QR ∼ U [0,1]) that are uniformly distributed byconstruction and statistically independent from each other(i.i.d), we can derive the distribution of quality Q after sam-pling. The cumulative distribution function (CDF) of theresource quality of each candidate is: FQi(x) = Prob(Qi ≤x) = x, x ∈ [0,1]1. Since the candidate with the highest quality

1This assumes Qi to be continuous variables, although in practice they arediscrete. This makes the analysis independent of the cluster size NRU . Theresult holds for the discretized version of the equation.

4

Page 6: Tarcil: Low Latency and High Quality Cluster Scheduling · PDF fileference between jobs sharing a server ... However, there are two key differences in Tarcil ... An obvious solution

is selected from the sampled set, its resource quality is therandom variable A = max{Q1,Q2, ...,QR}, and its CDF is:

FA(x) = Prob(A≤ x) = Prob(Q1 ≤ x∧ ...∧QR ≤ x)= Prob(Qi ≤ x)R = xR, x ∈ [0,1] (6)

This implies that the distribution of quality after samplingonly depends on the sample size R. Figure 3 shows CDFs ofresource quality distributions under the uniformity assumption,for sample sizes R = {8,16,32,64}. The higher the value ofR, the more skewed to the right the distribution is, hence theprobability of finding only candidates of low quality quicklydiminishes to 0. For example, for R = 64 there is a 10−6

probability that none of the sampled RUs will have resourcequality of at least Q = 80% (Prob(Q < 0.8| ∀ RU) = 10−6).

Figure 4 validates the uniformity assumption on a 100-server EC2 cluster running short Spark tasks (100msec idealduration) and longer Hadoop jobs (1-10min). The cluster loadis 70-75% (see methodology in Section 6). In all cases, thedeviation between the analytically derived and measured distri-butions of Q is minimal, which shows that the analysis aboveholds in practice. In general, the larger the cluster, the moreclosely the quality distribution approximates uniformity.Large jobs: For jobs that need multiple RUs, Tarcil usesbatch sampling [28, 29]. For m requested units, the schedulersamples R ·m RUs and selects the m best among them as shownin Figure 5a. Some applications experience locality betweensub-tasks or benefit from allocation of all resources in a smallset of machines (e.g., within a single rack). In such cases, foreach sampled RU, Tarcil examines its neighboring resourcesand makes a decision based on their aggregate quality as shownin Figure 5b. Alternatively, if a job prefers distributing itsresources across machines the scheduler will allocate RUs indifferent machines, racks and/or cluster switches, assumingknowledge of the cluster’s topology. Placement preferencesfor reasons such as security [32] can also be specified in theform of attributes at submission time by the user.Sampling at high load: Equation (6) estimates the probabilityof finding near-optimal resources accurately when resourcesare not scarce. When the cluster operates at high load, we mustincrease the sample size to guarantee the same probability offinding a candidate of equally high quality, as when the systemis unloaded. Assume a system with NRU = 100 RUs. Itsdiscrete CDF is FA(x) = P[A≤ x] = x , x = 0, 0.01, 0.02, ..., 1.For sample size R, this becomes: FA(x) = xR, and a qualitytarget of Pr[Q < 0.8] = 10−3 is achieved with R = 32. Nowassume that 60% of the RUs are already busy. If, for example,only 8 of the top 20 candidates for this task are available atthis point, we need to set R s.t. Pr[Q < 0.92] = 10−3, whichrequires a sample size of R = 82. Hence, the sample size for ahighly loaded cluster can be quite high, degrading schedulinglatency. In the next section, we introduce an admission controlscheme that bounds sample size and scheduling latency, whilestill allocating high quality resources.

… …

RUA

Job A B

RUB

(R = 4)

x x x

x

x x

x x

x

x x

x x

x x

x x x

x

x

x x

x

x

x x

Scheduler

Scheduler

x x

Scheduler

x x x

x

x x

x x

x x

x

x A1 A2

x x

x

x x

Job

(R = 4)

Scheduler

Scheduler …

Scheduler …

x

x x

x x

x x x

x

x x

Figure 5: Batch sampling in Tarcil with sample size R = 4for (a) a job with two independent tasks A and B, and (b) a jobwith two subtasks A1 and A2 that exhibit locality. x-markedRUs are already allocated, striped RUs are sampled, and solidblack RUs are allocated to the incoming job after sampling.

4. Admission Control

4.1. Pre-scheduling Queueing

When available resources are plentiful, jobs are immediatelyscheduled using the sampling scheme described in Section3. However, when load is high, the number of resourcesof sufficient quality may be very small and the sample sizeneeded to find them can become quite large. Tarcil employsa simple admission control scheme that queues jobs untilresources of proper quality become available and estimateshow long an application should wait at admission control.

A simple indication to trigger job queueing is the count ofavailable RUs in the cluster. This, however, does not yieldsufficient insight into the quality of available RUs. If mostRUs have poor quality for an incoming job, it may be betterfor it to wait. Unfortunately, a naïve quality check involvesaccessing the state of the whole cluster, which would introduceprohibitive overheads. Instead, we maintain a small amountof coarse-grain information which allows for a fast check. Weleverage the information on contention scores that is alreadymaintained for each RU to construct a contention score vector[C1 C2 ...CN ] from the resource contention Ci it experiencesin each of its resources, due to interference from neighbor-ing RUs. We use locality sensitive hashing (LSH) based onrandom selection to hash these vectors into a small set ofbuckets [1, 9, 30]. LSH computes the cosine distance betweenvectors and assigns RUs with similar contention scores in therespective resources to the same bucket. We also separate RUsby platform type to account for heterogeneity. We only keep asingle count of available RUs for each bucket. The hash foran RU (and the counter of the corresponding bucket) needs tobe recalculated upon instantiation or completion of a job in anRU. Updating the per-bucket counters is a fast operation, outof the critical path for scheduling. Note that excluding updates

5

Page 7: Tarcil: Low Latency and High Quality Cluster Scheduling · PDF fileference between jobs sharing a server ... However, there are two key differences in Tarcil ... An obvious solution

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0Time (sec)

0.0

0.2

0.4

0.6

0.8

1.0

Pr[∃

appr

opria

te RU

]

Bucket ABucket BBucket C

Figure 6: Actual and estimated (dot) probability for a targetRU to exist as a function of waiting time for three buckets.

in RU status, LSH is only performed once.Admission control works as follows. We check the bucket(s)

that correspond to the resources with quality that matches theincoming job’s preferences. If these buckets have countersclose to the number of RUs the job needs, the application isqueued. Queued applications wait until the probability thatresources are freed increases or until an upper bound for wait-ing time is reached. To estimate waiting time, Tarcil recordsthe rate at which RUs of each bucket became available in re-cent history. Specifically, it uses a simple feedback loop toestimate when the probability that an appropriate RU existsapproximates 1 for a target bucket. The distribution is updatedevery time an RU from that bucket is freed. Tarcil also sets anupper bound for waiting time at µ +2 ·σ , where µ and σ arethe mean and standard deviation of the corresponding “time-until-free” PDF. If the estimated waiting time is less than theupper bound, the job waits for resources to be freed; otherwiseit is scheduled to avoid excessive queueing. Although admis-sion control adds some complexity, in practice it only delaysworkloads at very high cluster utilizations (over 80%-85%).Validation of waiting time estimation: Fig. 6 shows theprobability that a desired RU will become available withintime t for different buckets for a heterogeneous 100-serverEC2 cluster running short Spark tasks and longer Hadoop jobs.The cluster utilization is approximately 85%. We show theprobabilities for r3.2xlarge (8 vCPUs) instances with CPUcontention (A), r3.2xlarge instances with network contention(B), and c3.large (2 vCPUs) instances with memory contention(C). The distributions are obtained from recent history andvary across buckets. The dot in each line shows the estimatedwaiting time by Tarcil, which closely approximates the mea-sured time for an appropriate RU to be freed (less than 8%deviation on average). In all experiments, we use 20 buckets,and history of the past 2 hours, which was sufficient to makeaccurate estimations of available resources. The number ofbuckets and/or history length may vary for different systems.

4.2. Post-scheduling Queueing

A job that exceeds the upper bound on queueing may stillrequire a high sample size. To avoid excessive schedulingoverheads, we cap the sample size at 32 and instead use latebinding on the sampled servers until resources become avail-able [28]. If the best two of the 32 sampled RUs are currentlybusy, the job is locally queued in both until the first RU is freed

and is subsequently removed from the queue of the secondRU. Note that local queueing is unlikely in practice.

5. Tarcil Implementation

5.1. Tarcil Components

Figure 7 shows the components of the scheduler. Tarcil is a dis-tributed, shared-state scheduler and, unlike Quincy or Mesos,it does not have a central coordinator [19, 20]. Schedulingagents work in parallel, are load-balanced by the cluster front-end, and each agent has a local copy of the shared server state,which contains the list and status of all RUs in the cluster.

Since all schedulers have full access to the cluster state,conflicts are possible. Conflicts between agents are resolvedusing lock-free optimistic concurrency as discussed in [31].The system maintains one resilient master copy of state. Eachscheduling agent has a local copy of this state which is updatedfrequently. When an agent makes a scheduling decision it at-tempts to update the master copy of the state using an atomicwrite operation. While an agent performs this action no otheragent can update these resources in the master copy. Once thecommit is successful the resources are yielded to the corre-sponding agent. Any other agent with conflicting decisionsneeds to resample resources. The local copy of state of eachagent is periodically synced (every 5-10sec) with the master.The timing of the updates includes a small random seed suchthat not all agents update their state at exactly the same time,making the master the bottleneck. When the sample size issmall, decisions of scheduling agents rarely overlap and eachscheduling action is fast (∼ 10−20ms, for a 100-server clusterand R = 8, over an order of magnitude faster than centralizedapproaches). When the number of sampled RUs increasesbeyond R = 32 for very large jobs, conflicts can become morefrequent, which we resolve using incremental transactions onthe non-conflicting resources [31]. In the event where onescheduling agent crashes, an idle cluster server resumes itsrole, once it has obtained a copy of the master state.

Each worker server has a local monitor module that handlesscheduling requests, federates resource usage in the server,and updates the quality of RUs. When a new task is assignedto a server by a scheduling agent, the monitor updates thestatus of the RU in the master copy and notifies the schedulingagent and admission control. Finally, a per-RU load monitorevaluates performance in real time. When the monitor detectsthat a job’s performance deviates from its expected target, itnotifies the proper agent for a possible allocation adjustment.The load monitor also notifies agents of CPU or memorysaturation, which triggers resource autoscaling (see Sec. 5.2).

We currently use Linux containers to partition servers intoRUs [3]. Containers enable CPU, memory and I/O isolation.Each container is configured to a single core and a fair shareof the memory and storage subsystem, and the network band-width. Containers can be merged to accommodate multicoreworkloads, using cgroups. Virtual machines (VMs) can also be

6

Page 8: Tarcil: Low Latency and High Quality Cluster Scheduling · PDF fileference between jobs sharing a server ... However, there are two key differences in Tarcil ... An obvious solution

Worker Worker Worker Worker

Local monitor Local monitor Local monitor Local monitor

Load monitor

Admission control

Legend

Local task queue

Resource unit RU

Worker server

Scheduler

cluster state copy

Scheduling agent

cluster state copy

Scheduling agent

cluster state copy

Scheduling agent

Master cluster state

Server

Figure 7: The different components of the scheduler and their interactions.

used to enable workload migration [27, 35, 36, 38], but wouldincur higher overheads.

Figure 8 traces a scheduling event. Once a job is submitted,admission control evaluates whether it should be queued ornot. Once the assigned scheduling agent sets the sample sizeaccording to the job’s constraints, it samples the shared clus-ter state for the required number of RUs. Sampling happenslocally in each agent. The agent computes the resource qual-ity of sampled resources and selects the ones that should beallocated to the job. The actual selection takes into accountthe resource quality and platform preferences, as well as anylocality preferences of a task. The agent then attempts to up-date the master copy of the state. Upon a successful committhe agent notifies the local monitor of the selected server(s)over RPC and launches the task in the target RU(s). The localmonitor notifies admission control, and the master copy toupdate their state. Once the task completes, the local monitorissues RPCs that update the master state and notify the agentand admission control; the scheduling agent then informs thecluster front-end.

5.2. Adjusting Allocations

For short-running tasks, the quality of the initial assignment isparticularly important. For long-running tasks, we must alsoconsider the different phases the program can go through [22].Similarly, we must consider cases where Tarcil makes a sub-optimal allocation due to inaccurate classification, deviationsfrom fully random selection in the sampling process, or acompromise in resource quality at admission control. Tarciluses the per-server load monitor, i.e., a local daemon runningin each RU, to measure the performance of active workloadsin real time. This can correspond to instructions per second(IPS), packets per second or a high-level application metric,depending on the application type. Tarcil compares this metricto any performance targets the job provides or are availablefrom previous runs of the same application. If there is a largedeviation, the scheduler takes action. Since we are using con-tainers, the primary action we take is to avoid scheduling otherjobs on the same server. For scale-out workloads, the systemalso employs a simple autoscale service which allocates moreRUs (locally or not) to improve the job’s performance.

Time

Admission Control

Scheduling Agent

Local Monitor

Resource Unit/s

Cluster Frontend

sampleRU()

execTask()

computeQ()

selectRU()

Master State Copy

Figure 8: Trace of a scheduling event in Tarcil.

5.3. Fairness

Users can submit jobs with priorities. Jobs with higher prioritywill bypass others at admission control and preempt lower-priority jobs during resource selection. Tarcil also allows theuser to select between incremental scheduling, where tasksfrom a job get progressively scheduled as resources becomeavailable and all-or-nothing gang scheduling, where either allor no task from a job is scheduled. We leave the experimentalevaluation of priorities and other policies to future work.

6. Evaluation

6.1. Tarcil Analysis

We first evaluate Tarcil’s scalability and its sensitivity to pa-rameters such as the sample size and task duration.Sample size: Fig. 9 shows the sensitivity of sampling over-heads and response times to the sample size for homogeneousSpark jobs with 100msec duration and cluster loads varyingfrom 10% to 90% on the 110-server EC2 cluster. All machinesare r3.2xlarge memory-optimized instances (61GB of RAM).10 servers are used by the scheduling agents, and the remain-ing 100 machines serve incoming load. The boundaries of theboxplots depict the 25th and 75th percentiles, the whiskersthe 5th and 95th percentiles and the horizontal line in eachboxplot shows the mean. As sample size increases, the over-heads increase. Until R = 32 overheads are marginal even at

7

Page 9: Tarcil: Low Latency and High Quality Cluster Scheduling · PDF fileference between jobs sharing a server ... However, there are two key differences in Tarcil ... An obvious solution

2 4 8 16 32 64Sample Size R

0

50

100

150

200

250

Sam

plin

g T

ime

(mse

c) load=10%

load=20%

load=50%

load=80%

load=90%

2 4 8 16 32 64Sample Size R

50

100

150

200

250

300

350

Res

pons

e T

ime

(mse

c) load=10%

load=20%

load=50%

load=80%

load=90%

Figure 9: Sensitivity of sampling overheads and response times to sample size.

0 20 40 60 80 100Utilization (%)

0

200

400

600

800

1000

Res

pons

e T

ime

(mse

c) TargetMean95th %ile

10-2 10-1 100 101 102 103

Job Duration (sec)

10-2

10-1

100

101

102

103

Res

pons

e T

ime

(sec

)

TargetMean95th %ile

Figure 10: Response times when (a) increasing cluster load,and (b) when decreasing task duration with constant load.

high loads, but they increase substantially for R≥ 64, primar-ily due to the overhead of resolving conflicts between the 10scheduling agents used. Hence, we cap sample size to R = 32even under high load. Response times are more sensitive tosample size. At low load, high quality resources are plentifuland increasing R makes little difference to performance. Asload increases, sampling with R = 2 or R = 4 is unlikely tofind good resources. Sample size of R = 8 is optimal for bothlow and high cluster loads, in this scenario.Cluster load: Fig.10a shows the average and 95th percentileresponse times when we scale the cluster load in the 110-server EC2 cluster. The incoming jobs are homogeneousSpark tasks with 100msec target duration. We increase the taskarrival rate to increase cluster load. The target performanceof 100msec includes no scheduling overheads or degradationdue to suboptimal scheduling. The reported response timesinclude the task execution time and all overheads. The meanof response times with Tarcil remains almost constant untilloads over 85%. At very high loads, admission control and thelarge sample size increase the scheduling overheads, affectingperformance. The 95th percentile is more volatile at highloads, but only exceeds 250msec at cluster loads of 80% orhigher. Tasks with very high response times are typically thosedelayed by admission control until the wait-time threshold isreached. Sampling itself adds marginal overheads until 90%load. At very high loads scheduling overheads are dominatedby queueing time and increased sample sizes.Task duration: Fig. 10b shows the average and 95th per-centile response times as a function of task duration, whichranges from 10msec to 600sec. The cluster load is 80% inall cases. For long tasks the mean and 95th percentile closelyapproximate the target performance. When task duration isbelow or close to 100msec, the scheduling overhead domi-nates. Despite this, the mean and 95th percentile remain very

close, which shows that performance unpredictability is lim-ited. For long jobs, configuring and allocating large amountsof resources dominates the scheduling overheads, while forlarge numbers of short tasks, queueing delay dominates.

6.2. Comparison with Other Schedulers

Methodology: We compare Tarcil to Sparrow [28] andQuasar [14]. Sparrow uses multiple scheduling agents andsampling ratio of R = 2 servers for every core required, as rec-ommended in [28]. Quasar has a centralized greedy schedulerthat searches the cluster state with a scheduling timeout of 2seconds. Sparrow does not take into account heterogeneityor interference preferences for incoming jobs, while Tarciland Quasar do. We evaluate these schedulers on the same110-server EC2 cluster with r3.2xlarge memory-optimizedinstances (61GB of RAM). 10 servers are dedicated to thescheduling agents for Tarcil and Sparrow and a single serverfor Quasar. While we could replicate Quasar’s scheduler forfault tolerance, it would not help with the latency of eachscheduling decision. Additionally, Quasar schedules applica-tions at job, not task, granularity (when applicable), whichreduces its scheduling load. Unless otherwise specified, Tarciluses sample sizes of R = 8 during low load.6.2.1. TPC-H workload

We compare the three schedulers on the TPC-H decisionsupport benchmark. TPC-H is a standard proxy for ad-hoc,low-latency queries that comprise a large fraction of loadin shared clusters. We use a similar setup as the one usedto evaluate Sparrow [28]. TPC-H queries are compiled intoSpark tasks using Shark [16], a distributed SQL data ana-lytics platform. The Spark plugin for Tarcil is 380 lines ofcode in Scala. Each task triggers a scheduling request forthe distributed schedulers (Tarcil and Sparrow), while Quasarschedules jointly all tasks from the same computation stage.We constrain tasks in the first stage of each query to the ma-chines holding their input data (3-way replication). All othertasks are unconstrained. We run each experiment for 30 min-utes, with multiple users submitting randomly-ordered TPC-Hqueries to the cluster. The results discard the initial 10 min-utes (warm-up) and capture a total of 40k TPC-H queries andapproximately 134k jobs. Utilization at steady state is 75-82%.Unloaded cluster: We first examine the case where TPC-His the only workload present in the cluster. Figure 11a showsthe response times for seven representative query types [41].

8

Page 10: Tarcil: Low Latency and High Quality Cluster Scheduling · PDF fileference between jobs sharing a server ... However, there are two key differences in Tarcil ... An obvious solution

Centralized Sparrow Tarcil Ideal

q1 q3 q4 q6 q9 q10 q120

500

1000

1500

2000

2500

3000

3500

Res

pons

e T

ime

(mse

c)

0 500 1000 1500 2000Scheduling Time (msec)

0

20

40

60

80

100

Cum

ulat

ive

Pro

babi

lity

(%)

CentralizedSparrowTarcil

(a) Initially unloaded cluster.

q1 q3 q4 q6 q9 q10 q120

500

1000

1500

2000

2500

3000

Res

pons

e T

ime

(mse

c)

0 500 1000 1500 2000Scheduling Time (msec)

0

20

40

60

80

100

Cum

ulat

ive

Pro

babi

lity

(%)

CentralizedSparrowTarcil

(b) Cluster with initial resident load.

q1 q3 q4 q6 q9 q10 q120

500

1000

1500

2000

2500

3000

3500

Res

pons

e T

ime

(mse

c)

0 500 1000 1500 2000Scheduling Time (msec)

0

20

40

60

80

100

Cum

ulat

ive

Pro

babi

lity

(%)

CentralizedSparrowTarcil

(c) Heterogeneous cluster with initial resident load.

Figure 11: Response times for different query types (left) and CDFs of scheduling overheads (right).

Response times include all scheduling overheads from sam-pling or the greedy selection, and queueing. Boundaries show25th and 75th percentiles and whiskers the 5th and 95th per-centiles. The ideal scheduler corresponds to a system thatidentifies the resources of optimal quality (including hetero-geneity and interference preferences) with zero delay. Fig. 11ashows that the centralized scheduler experiences the highestvariability in performance. Although some queries completevery fast because they receive high quality resources, most ex-perience high scheduling delays. To verify this, we also showthe scheduling time CDF on the right of Fig. 11a. While Tarciland Sparrow have tight bounds on scheduling overheads, thecentralized scheduler adds up to 2 seconds of delay (timeoutthreshold). Comparing the query performance using Sparrowand Tarcil, we see that the difference is small, 8% on aver-age. Tarcil approximates the ideal scheduler more closely, asit accounts for each task’s resource preferences. Addition-ally, Tarcil constrains performance unpredictability. The 95thpercentile is reduced by 80%-2.4x compared to Sparrow.Cluster with resident load: The difference in schedul-ing quality becomes more clear when we introduce cross-application interference. Figure 11b shows a setup where 40%of the cluster is busy servicing background applications, in-cluding other Spark jobs for machine learning processing, longHadoop workloads, and latency-critical services like mem-

cached. These jobs are not being scheduled by the examinedschedulers. While the centralized scheduler still adds consid-erable overhead to each job (Fig.11b, right), its performance isnow comparable to Sparrow. Since Sparrow does not accountfor sensitivity to interference, the response time of queries thatexperience resource contention is high. Apart from averageresponse time, the 95th percentile also increases significantly(poor predictability). In contrast, Tarcil accounts for resourcepreferences and only places tasks on machines with accept-able interference levels. It maintains an average performanceonly 6% higher compared to the unloaded cluster across querytypes. More importantly, it preserves the low performancejitter by bounding the 95th percentile of response times.Heterogeneous cluster with resident load: Next, in addi-tion to interference, we also introduce hardware heterogeneity.The cluster size remains constant but 75% of the worker ma-chines are replaced with less or more powerful servers, rangingfrom general purpose medium and large instances to quadruplecompute- and memory-optimized instances. Fig.11c showsthe new performance for the TPC-H queries. As expected,response times increase, since some of the high-end machinesare replaced by less powerful servers. More importantly, per-formance unpredictability increases when the resource prefer-ences of incoming jobs are not accounted for. In some cases(q9, q10), the centralized scheduler now outperforms Sparrow

9

Page 11: Tarcil: Low Latency and High Quality Cluster Scheduling · PDF fileference between jobs sharing a server ... However, there are two key differences in Tarcil ... An obvious solution

0 20 40 60 80 100Tasks (%)

0

500

1000

1500

2000R

espo

nse

Tim

e (m

sec) Centralized

SparrowTarcil

1000 20 40 60 80 100

Requests (%)0

500

1000

1500

2000

2500

3000

3500

Late

ncy

(use

c)

CentralizedSparrowTarcil

0 100 200 300 400 500 600 700 800 900Time (sec)

0

500

1000

1500

2000

2500

3000

99th

%ile

Lat

ency

(us

ec)

CentralizedSparrowTarcil

Figure 12: Performance of scheduled Spark tasks and resident memcached load (aggregate and over time).

despite its much higher scheduling overheads. Tarcil preservesresponse times close to those in the unloaded cluster and veryclose to those achieved with the ideal scheduler.

6.2.2. Impact on Resident Memcached LoadFinally, we examine the impact of scheduling decisions

on resident cluster load. In the same heterogeneous cluster(110 nodes on EC2, 100 workers and 10 schedulers), we placelong-running memcached instances as resident load. Theseinstances serve read and write queries following the Facebooketcworkload characteristics [2]. etc is the large memcacheddeployment in Facebook, has a 3:1 read:write ratio, and a valuedistribution between 1B and 1KB. Memcached occupies about40% of the total system capacity and has a QoS target of200usec for the 99th percentile of response latency.

The incoming jobs are homogeneous, short Spark tasks(100msec ideal duration, 20 tasks per job) that perform logis-tic regression. A total of 300k jobs are submitted over 900seconds. Fig. 12a shows the response times of the Spark tasksfor the three schedulers. The centralized scheduler adds sig-nificant overheads, while Sparrow and Tarcil lead to smalloverheads and behave similarly for 80% of the tasks. Forthe remaining tasks, Sparrow increases response times signifi-cantly, as it is unaware of the interference induced by mem-cached. Tarcil maintains low response times for most tasks.

It is also important to consider the impact on the mem-cached load. Fig.12b shows the latency CDF of the mem-cached requests. The black diamond depicts the QoS con-straint of 200usec for the 99th request percentile. With Tarciland the centralized scheduler, memcached does not sufferas both schedulers attempt to minimize interference. Spar-row, however, leads to large latency increases for memcached.Even though the performance of the short tasks is satisfactory,not accounting for resource preferences has an impact on thelonger jobs in the cluster. Finally, Fig.12c shows how the99th percentile of memcached requests changes throughoutthe execution of the experiment. Initially memcached meetsits QoS for all three schedulers. As the cluster becomes moreloaded the tail latency increases significantly for Sparrow.

Note that a naïve coupling of Sparrow – for short jobs –with Quasar – for long jobs – is inadequate for three rea-sons. First, Tarcil achieves higher performance for short tasksbecause it accounts for their resource preferences. Second,even if the long-running resident load was scheduled usingQuasar, scheduling short tasks with Sparrow would degrade

its performance. Third, while the difference in execution timeachieved by Quasar and Tarcil for long jobs is small, schedul-ing overheads are significantly reduced, without sacrificingthe scheduling decision quality.

6.3. Large-Scale Evaluation

Methodology: We also evaluated Tarcil on a 400-serverEC2 cluster with 10 server types ranging from 4 to 32 cores.The total core count in the cluster is 4,178. All servers arededicated and managed only by the examined schedulers andthere is no external interference from other workloads.

We use applications including short Spark tasks, longerHadoop jobs, streaming Storm jobs [33], latency-critical ser-vices (memcached [23] and Cassandra [7]), and single-serverbenchmarks (SPECCPU2006, PARSEC [5], etc.). In total,7,200 workloads are submitted with 1 second inter-arrivaltimes. These applications stress different resources, includingCPU, memory and I/O (network, storage). We measure jobperformance (from submission to completion), cluster utiliza-tion, scheduling overheads and quality of allocation decisions.

We compare Tarcil, Quasar and Sparrow. Because this sce-nario includes long-running jobs, such as memcached, that arenot supported by the open-source implementation of Sparrow,we use Sparrow when applicable (e.g., Spark) and a Sampling-based scheduler that follows Sparrow’s principles (sample size2, batch sampling and late binding) for the remaining jobs.Performance: Fig. 13a shows the performance (time betweensubmission and completion) of the 7,200 workloads orderedfrom worst to best-performing, and normalized to their optimalperformance in this cluster. Optimal corresponds to the per-formance on the best available resources and zero schedulingdelay. The Sampling-based scheduler degrades performancefor more than 75% of jobs. While Centralized behaves better,achieving an average of 82% of optimal, it still violates QoSfor a large fraction of applications, particularly short-runningworkloads (0-3900 for this scheduler). Tarcil outperforms bothschedulers, leading to 97% average performance and boundingmaximum performance degradation to 8%.Cluster utilization: Figure 13b shows the system utilizationacross the 400 servers of the cluster when incoming jobs arescheduled with Tarcil. CPU utilization is averaged across thecores of each server, and sampled every 2 sec. Utilizationis 70% on average at steady-state (middle of the scenario),when there are enough jobs to keep servers load-balanced.

10

Page 12: Tarcil: Low Latency and High Quality Cluster Scheduling · PDF fileference between jobs sharing a server ... However, there are two key differences in Tarcil ... An obvious solution

0 1000 2000 3000 4000 5000 6000 7000Workloads

0.0

0.2

0.4

0.6

0.8

1.0

Spe

edup

nor

m to

Idea

l

Ideal PerformanceDistr. SamplingCentralized GreedyTarcil

0 6000 12000 18000 24000 300000

50

100

150

200

250

300

350

400

Ser

vers

0102030405060708090100

Ser

ver

Util

izat

ion

(%)

Time (s)0 2 4 6 8 10 12 14 16

0

50

100

150

200

250

300

350

400

0.00.10.20.30.40.50.60.70.80.91.0

Reso

urce

Qua

lity

(QR

)

Resource Units (RUs)

Ser

vers

0 500 10001500200025003000Workloads

0.0

0.5

1.0

1.5

2.0

Sch

edul

ing

Tim

e (s

ec)

Queueing (Tarcil)Sampling (Tarcil)Centralized

Figure 13: (a) Performance across 7,200 jobs on a 400-server EC2 cluster for the Sampling-based and Centralized schedulersand Tarcil, normalized to optimal performance, (b) cluster utilization achieved by Tarcil throughout the duration of the experiment,(c) quality of resource allocation across all RUs, and (d) scheduling overheads in Tarcil and the Centralized scheduler.

0.0

0.2

0.4

0.6

0.8

1.0

Res

ourc

e Q

ualit

y

01020

3040

5060708090100

Allo

cate

d R

U F

ract

ion

(%)

Time (s)7000 14000 21000 28000 35000 0.0

0.2

0.4

0.6

0.8

1.0

Res

ourc

e Q

ualit

y

01020

3040

5060708090100

Allo

cate

d R

U F

ract

ion

(%)

Time (s)7000 14000 21000 28000 35000

Figure 14: Resource quality CDFs for: (a) Sampling-based,(b) Tarcil.

0 20 40 60 80 100Q

0.0

0.2

0.4

0.6

0.8

1.0

CD

F

Sampling-based

UniformPermutation 1Permutation 2Permutation 3Permutation 4Permutation 5

0 20 40 60 80 100Q

0.0

0.2

0.4

0.6

0.8

1.0 Tarcil, 8 RU candidates

UniformPermutation 1Permutation 2Permutation 3Permutation 4Permutation 5

0 20 40 60 80 100Q

0.0

0.2

0.4

0.6

0.8

1.0 Tarcil, 16 RU candidates

UniformPermutation 1Permutation 2Permutation 3Permutation 4Permutation 5

Figure 15: Resource quality distributions for the Sampling-based scheduler and Tarcil with R = 8 and 16 RUs across dif-ferent permutations of the EC2 scenario.

The maximum in the x-axis is set to the time it takes for theSampling-based scheduler to complete the scenario (∼ 35,000sec). The additional time corresponds to jobs that run onsuboptimal resources and take longer to complete.Core allocation: Figure 13c shows a snapshot of the RU qual-ity across the cluster as observed by the job that is occupyingeach RU when using Tarcil. The snapshot is taken at 8,000swhen all applications have arrived and the cluster operates atmaximum utilization. White tiles correspond to unallocated re-sources. Dark blue tiles denote jobs with resources very closeto their target quality. Lighter blue RUs correspond to jobsthat received good but suboptimal resources. The graph showsthat the majority of jobs are given appropriate resources. Notethat high Q does not imply low server utilization. Utilizationat the time of the snapshot is approximately 75%.Scheduling overheads: Figure 13d shows the schedulingoverheads for the Centralized scheduler and Tarcil. The resultsare consistent with the TPC-H experiment in Section 6.2. Theoverheads of the Centralized scheduler increase significantlywith scale, adding approximately 1 sec to most workloads.Tarcil keeps overheads low, adding less than 150msec to morethan 80% of workloads. This is essential for scalability. Athigh load, Tarcil increases the sample size to preserve the

statistical guarantees and/or resorts to local queueing. Theoverheads for the Sampling-based scheduler are similar toTarcil and are omitted from the graph for clarity.Predictability: Figure 14 shows the fraction of allocated RUsthat are over a certain resource quality at each point of the dura-tion of the scenario. Results are shown for the Sampling-basedscheduler (left) and Tarcil (right). Darker colors towards thebottom of the graph denote that a larger fraction of allocatedRUs have poor quality. At time 16,000sec, when the cluster ishighly-loaded, the Sampling-based scheduler leads to 70% ofallocated cores having quality less than 0.4. For Tarcil, only18% of cores have less than 0.9 quality. Also note that, asthe scenario progresses, the Sampling-based scheduler startsallocating resources of worse quality, while Tarcil maintainsalmost the same quality throughout the experiment.

Figure 15 explains this dissimilarity. It shows the CDF ofresource quality for this scenario, and 5 random permutationsof it (different job submission order). We show the CDFfor the Sampling-based scheduler and Tarcil with 8 and 16candidates. We omit the centralized scheduler which allocatesresources of high quality most of the time. The sampling-basedscheduler deviates significantly from the uniform distribution,since it does not account for the quality of allocated resources.In contrast, Tarcil closely follows the uniform distribution,improving the predictability of scheduling decisions.

7. Conclusions

We have presented Tarcil, a cluster scheduler that improvesboth scheduling speed and quality, making it appropriate forlarge, highly-loaded clusters running both short and long jobs.Tarcil uses an analytically-derived sampling framework thatprovides guarantees on the quality of allocated resources, andadjusts the sample size to match application preferences. Italso employs admission control to avoid excessive samplingand poor scheduling decisions at high load. We have comparedTarcil to existing parallel and centralized schedulers for avariety of workload scenarios on 100- to 400-server clusters onAmazon EC2. We have showed that it provides low schedulingoverheads, high application performance, and high clusterutilization. Moreover, it reduces performance jitter, improvingpredictability in large, shared clusters.

11

Page 13: Tarcil: Low Latency and High Quality Cluster Scheduling · PDF fileference between jobs sharing a server ... However, there are two key differences in Tarcil ... An obvious solution

References[1] Alexandr Andoni and P. Indyk. Near-optimal hashing algo-

rithms for approximate nearest neighbor in high dimensions. InCommunications of the ACM 51 (1): 117-122.

[2] Berk Atikoglu, Yuehai Xu, Eitan Frachtenberg, Song Jiang, andMike Paleczny. Workload analysis of a large-scale key-valuestore. In Proc. of SIGMETRICS. London, UK, 2012.

[3] Gaurav Banga, Peter Druschel, and Jeffrey C. Mogul. Resourcecontainers: a new facility for resource management in serversystems. OSDI, 1999.

[4] Luiz Barroso and Urs Hoelzle. The Datacenter as a Computer:An Introduction to the Design of Warehouse-Scale Machines.Morgan and Claypool Publishers, 2009.

[5] Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and KaiLi. The parsec benchmark suite: Characterization and architec-tural implications. In Proc. of the 17th International Conferenceon Parallel Architectures and Compilation Techniques (PACT).Toronto, CA, October, 2008.

[6] Leon Bottou. Large-scale machine learning with stochasticgradient descent. In Proc. of the International Conference onComputational Statistics (COMPSTAT). Paris, France, 2010.

[7] Apache cassandra. http://cassandra.apache.org/.[8] Hyeong Soo Chang, Robert Givan, and Edwin Chong. On-line

scheduling via sampling. In Proc. of Artificial IntelligencePlanning and Scheduling (AIPS). 2000.

[9] Moses S. Charikar. Similarity estimation techniques fromrounding algorithms. In Proc. of the 34th Annual ACM Sympo-sium on Theory of Computing 2002.

[10] McKinsey & Company. Revolutionizing data center efficiency.In Uptime Institute Symposium, 2008.

[11] Linux containers. http://lxc.sourceforge.net/.[12] Christina Delimitrou and Christos Kozyrakis. iBench: Quanti-

fying Interference for Datacenter Workloads. In Proceedings ofthe 2013 IEEE International Symposium on Workload Charac-terization (IISWC). Portland, OR, September 2013.

[13] Christina Delimitrou and Christos Kozyrakis. Paragon: QoS-Aware Scheduling for Heterogeneous Datacenters. In Proc.of the Eighteenth International Conference on ArchitecturalSupport for Programming Languages and Operating Systems(ASPLOS). Houston, TX, USA, 2013.

[14] Christina Delimitrou and Christos Kozyrakis. Quasar:Resource-Efficient and QoS-Aware Cluster Management. InProc. of the Nineteenth International Conference on Architec-tural Support for Programming Languages and Operating Sys-tems (ASPLOS). Salt Lake City, UT, USA, 2014.

[15] Xicheng Dong, Ying Wang, and Huaming Liao. Schedulingmixed real-time and non-real-time applications in mapreduceenvironment. In Proc. of the IEEE 17th International Confer-ence on Parallel and Distributed Systems (ICPADS). Tainan,2011.

[16] C. Engle, A. Lupher, R. Xin, M. Zaharia, M. Franklin,S. Shenker, and I. Stoica. Shark: Fast data analysis usingcoarse-grained distributed memory. In Proc. of the 2012 ACMSIGMOD. Scottsdale, Arizona, 2012.

[17] Sriram Govindan, Jie Liu, Aman Kansal, and Anand Sivasubra-maniam. Cuanta: quantifying effects of shared on-chip resourceinterference for consolidated virtual machines. In Proc. of the2nd ACM Symposium on Cloud Computing, pages 22:1–22:14,2011.

[18] J Hamilton. Cost of power in large-scale data centers. http://perspectives.mvdirona.com.

[19] Ben Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A.D.Joseph, R. Katz, S. Shenker, and I. Stoica. Mesos: A platformfor fine-grained resource sharing in the data center. In Proc. ofthe 8th USENIX Symposium on Networked Systems Design andImplementation (NSDI). Boston, MA, 2011.

[20] M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, andA. Goldberg. Quincy: Fair scheduling for distributed computingclusters. In Proc. of the ACM Symposium on Operating SystemsPrinciples (SOSP). Big Sky, MT, 2009.

[21] K.C. Kiwiel. Convergence and efficiency of subgradient meth-ods for quasiconvex minimization. In Mathematical Program-ming (Series A) (Berlin, Heidelberg: Springer) 90 (1): pp. 1-25,2001.

[22] Christos Kozyrakis, Aman Kansal, Sriram Sankar, and Kusha-gra Vaid. Server engineering insights for large-scale onlineservices. IEEE Micro, 30(4):8–19, July 2010.

[23] Jacob Leverich and Christos Kozyrakis. Reconciling high serverutilization and sub-millisecond quality-of-service. In Proc. ofEuroSys. Amsterdam, The Netherlands, 2014.

[24] Jason Mars and Lingjia Tang. Whare-map: heterogeneity in"homogeneous" warehouse-scale computers. In Proc. of the40th Annual International Symposium on Computer Architec-ture (ISCA). Tel-Aviv, Israel, 2013.

[25] M. Mitzenmacher. The power of two choices in randomizedload balancing. In Journal IEEE Transactions on Parallel andDistributed Systems, Volume 12 Issue 10, 2001.

[26] R. Nathuji, A. Kansal, and A. Ghaffarkhah. Q-clouds: Manag-ing performance interference effects for qos-aware clouds. InProc. of EuroSys France, 2010.

[27] Dejan Novakovic, Nedeljko Vasic, Stanko Novakovic, DejanKostic, and Ricardo Bianchini. Deepdive: Transparently iden-tifying and managing performance interference in virtualizedenvironments. In Proc. of the USENIX Annual Technical Con-ference (ATC’13). San Jose, CA, 2013.

[28] Kay Ousterhout, Patrick Wendell, Matei Zaharia, and Ion Stoica.Sparrow: Distributed, low latency scheduling. In Proceedingsof the Twenty-Fourth ACM Symposium on Operating SystemsPrinciples, SOSP ’13. Farminton, PA, 2013.

[29] Gahyun Park. A generalization of multiple choice balls-into-bins. In Proceedings of the 30th Annual ACM SIGACT-SIGOPSSymposium on Principles of Distributed Computing, PODC ’11.San Jose, CA, 2011.

[30] A. Rajaraman and J. Ullman. Textbook on Mining of MassiveDatasets. 2011.

[31] Malte Schwarzkopf, Andy Konwinski, Michael Abd-El-Malek,and John Wilkes. Omega: flexible, scalable schedulers forlarge compute clusters. In Proc. of the 8th ACM EuropeanConference on Computer Systems (EuroSys). Prague, CzechRepublic, April 2013.

[32] Bikash Sharma, Victor Chudnovsky, Joseph L. Hellerstein,Rasekh Rifaat, and Chita R. Das. Modeling and synthesiz-ing task placement constraints in google compute clusters. InProc. of the 2nd ACM Symposium on Cloud Computing (SOCC).Cascais, Portugal, 2011.

[33] Storm. https://github.com/nathanmarz/storm/.[34] Vinod Kumar Vavilapalli, Arun C Murthy, Chris Douglas,

Sharad Agarwal, Mahadev Konar, Robert Evans, ThomasGraves, Jason Lowe, Hitesh Shah, Siddharth Seth, Bikas Saha,Carlo Curino, Owen O’Malley, Sanjay Radia, Benjamin Reed,and Eric Baldeschwieler. Apache hadoop yarn: Yet anotherresource negotiator. In Proc. of the Symposium on Cloud Com-puting. Santa Clara, CA, 2013.

[35] Virtualbox. https://www.virtualbox.org/.[36] Vmware virtual machines. http://www.vmware.com/.[37] Ian H. Witten, Eibe Frank, and Geoffrey Holmes. Data Mining:

Practical Machine Learning Tools and Techniques. 3rd Edition.[38] The xen project. http://www.xen.org/.

12

Page 14: Tarcil: Low Latency and High Quality Cluster Scheduling · PDF fileference between jobs sharing a server ... However, there are two key differences in Tarcil ... An obvious solution

[39] Hailong Yang, Alex Breslow, Jason Mars, and Lingjia Tang.Bubble-flux: precise online qos management for increased uti-lization in warehouse scale computers. In Proc. of the 40thAnnual International Symposium on Computer Architecture(ISCA). Tel-Aviv, Israel, 2013.

[40] Matei Zaharia, M Chowdhury, T Das, A Dave, J Ma, M Mc-Cauley, M.J Franklin, S Shenker, and I Stoica. Spark: Clustercomputing with working sets. In Proc. of the 9th USENIXSymposium on Networked Systems Design and Implementation(NSDI). San Jose, CA, 2012.

[41] Jianyong Zhang, Anand Sivasubramaniam, Hubertus Franke,Natarajan Gautam, Yanyong Zhang, and Shailabh Nagar. Syn-thesizing representative i/o workloads for tpc-h. In Proc. of the10th International Symposium on High Performance ComputerArchitecture (HPCA). Madrid, Spain, 2004.

[42] Xiao Zhang, Eric Tune, Robert Hagmann, Rohit Jnagal, VrigoGokhale, and John Wilkes. Cpi2: Cpu performance isolationfor shared compute clusters. In Proc. of the 8th ACM EuropeanConference on Computer Systems (EuroSys). Prague, CzechRepublic, 2013.

13