Finding the Right Cloud Configuration for Analytics Clusters · IST(ULisboa)/INESC-ID ABSTRACT Finding good cloud congurations for deploying a single dis-tributed system is already

Finding the Right Cloud Configurationfor Analytics Clusters

Muhammad Bilal∗UCLouvain and

IST(ULisboa)/INESC-ID

Marco CaniniKAUST

Rodrigo RodriguesIST(ULisboa)/INESC-ID

ABSTRACTFinding good cloud con�gurations for deploying a single dis-tributed system is already a challenging task, and it becomessubstantially harder when a data analytics cluster is formedby multiple distributed systems since the search space be-comes exponentially larger. In particular, recent proposalsfor single system deployments rely on benchmarking runsthat become prohibitively expensive as we shift to joint opti-mization of multiple systems, as users have to wait until theend of a long optimization run to start the production run oftheir job.We propose Vanir, an optimization framework designed

to operate in an ecosystem of multiple distributed systemsforming an analytics cluster. To deal with this large searchspace, Vanir takes the approach of quickly �nding a goodenough con�guration and then attempts to further optimizethe con�guration during production runs. This is achievedby combining a series of techniques in a novel way, namelya metrics-based optimizer for the benchmarking runs, anda Mondrian forest-based performance model and transferlearning during production runs. Our results show that Vanircan �nd deployments that perform comparably to the onesfound by state-of-the-art single-system cloud con�gurationoptimizers while spending 2⇥ fewer benchmarking runs.This leads to an overall search cost that is 1.3-24⇥ lowercompared to the state-of-the-art. Additionally, when transferlearning can be used, Vanir can minimize the benchmarkingruns even further, and use online optimization to achievea performance comparable to the deployments found bytoday’s single-system frameworks.∗Work done in part while author was interning at KAUST.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copiesare not made or distributed for pro�t or commercial advantage and thatcopies bear this notice and the full citation on the �rst page. Copyrightsfor components of this work owned by others than the author(s) mustbe honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior speci�cpermission and/or a fee. Request permissions from [email protected] ’20, October 19–21, 2020, Virtual Event, USA© 2020 Copyright held by the owner/author(s). Publication rights licensedto ACM.ACM ISBN 978-1-4503-8137-6/20/10. . . $15.00https://doi.org/10.1145/3419111.3421305

CCS CONCEPTS• Computer systems organization ! Cloud comput-ing; • Social and professional topics ! Managementof computing and information systems.

ACM Reference Format:Muhammad Bilal, Marco Canini, and Rodrigo Rodrigues. 2020.Finding the Right Cloud Con�guration for Analytics Clusters. InACM Symposium on Cloud Computing (SoCC ’20), October 19–21,2020, Virtual Event, USA. ACM, New York, NY, USA, 15 pages.https://doi.org/10.1145/3419111.3421305

1 INTRODUCTIONData analytics is part of the computational ecosystem ofvirtually every organization that seeks to extract value fromthe data it collects. In such computational infrastructures,a wide variety of use cases do not execute on a standaloneframework — instead, as depicted in Figure 1, they executein an ecosystem of multiple frameworks that connect in agraph-like structure to form an analytics cluster. The pipelinedepicted in this �gure is very similar to the real-world de-ployments for Periodic ETL [3], or Lambda architectures foraggregating clickstream events [11].A common way to deploy these analytics clusters is to

acquire resources in a cloud environment and do so on de-mand, whenever (typically recurring) jobs need to execute.However, when executing jobs in the cloud, ensuring thatthe right resources are allocated is paramount, not only dueto cost e�ciency but also to satisfy strict service-level ob-jectives (SLOs) on job completion times. Thus, deployinganalytics clusters in the cloud requires solving a cloud con-�guration problem: for each framework, determine (1) howmany instances to use, and (2) which type of instances to use?Choosing the right con�guration is crucial because, whensuch decisions are wrong or sub-optimal, jobs fail to meettheir deadlines and/or execution costs increase several-fold,as we show in §2.2.Prior methods for selecting cloud con�gurations [13, 17,

27–29, 42] do not consider the existence of multiple frame-works in an analytics cluster, but instead focus on con�guringa single framework, such as Spark or Hadoop. Furthermore,all possible ways of applying these prior methods in themulti-framework analytics cluster setting have signi�cant

SoCC ’20, October 19–21, 2020, Virtual Event, USA Muhammad Bilal, Marco Canini, and Rodrigo Rodrigues

HDFS Spark Cassandra

Figure 1: Example data analytics cluster.

limitations. The �rst way is to use these methods to opti-mize each framework individually. However, as our resultsshow, the coupling between di�erent frameworks and in-correct distribution of budget causes this approach to missopportunities to improve performance. The second way is toconsider the entire analytics cluster as if it were a “single-large-system,” simultaneously modifying the con�gurationof all frameworks. This approach enables joint optimizationbut increases the size of the con�guration search space expo-nentially with the number of frameworks. Exploring a largerspace naturally takes longer, causing a costly delay beforeusers can deploy their clusters in production.We present Vanir, a system for �nding the right cloud

con�guration for multi-framework data analytics clusters.Vanir is designed for a setting where a user needs to pro-vision and set up an on-demand analytics cluster for eachrun of a batch processing job. In this scenario, it is often thecase that a large fraction of these deployments are recurring,as supported by reports that more than 40% of the jobs inproduction clusters are recurring computations [12, 24, 39].After the job executes, the cluster is terminated or scaleddown to avoid any further costs [5]. The main principle thatVanir adopts to cope with a large con�guration search spaceis to �nd a good enough con�guration via a fast benchmark-ing phase, and optimize that con�guration during productionruns, as the job recurs. A good enough con�guration is onethat satis�es user-de�ned SLO constraints on both cost andexecution time. The design of Vanir combines in a novel wayseveral techniques needed to jointly optimize the resourcerequirements for multi-framework analytics jobs. In particu-lar, during a preliminary benchmarking phase, Vanir uses ametrics-based optimizer to quickly determine an initial con-�guration. In this stage, Vanir also attempts, whenever ap-plicable, to use transfer learning and similarity between jobsto reuse knowledge from previous jobs. Then, to tackle theinherent limitations of a quick benchmarking search withina large search space, Vanir �ne-tunes the con�guration withthe help of a performance model that is updated continu-ously with each successive run of a recurrent job. To selectnew con�gurations, Vanir employs Mondrian forests [32],an online random forest approach, aimed at incrementallyimproving the con�guration.We implement a Vanir prototype (which we plan to re-

lease as open-source) and use it to guide the deploymentof nine di�erent jobs on two di�erent pipelines (includingthe one in Figure 1) using AWS EC2. We evaluate Vaniragainst two baselines corresponding to the above-mentioned

ways of using existing optimization methods to handle multi-framework analytics clusters. Our results show that Vanir�nds con�gurations that are comparable to completely of-�ine methods. More importantly, this is achieved while re-quiring only half of the benchmarking runs, and with a totalcost of optimization that is 1.3-24⇥ lower than state-of-the-art methods.

2 ANALYTICS CLUSTERS CONFIGAnalytics clusters arewidely prevalent inmodern data-drivenorganizations. A rich ecosystem of software systems givesorganizations the freedom to create sophisticated analyticsclusters composed of a multitude of frameworks such asHadoop, Spark, Cassandra, or Flink. We begin by de�ningthe cloud con�guration problem and then motivate our workby demonstrating the importance of selecting good cloudcon�gurations. We further outline the challenges and discussbaseline solutions.

2.1 The cloud con�guration problemWe formalize the cloud con�guration problem as the problemof �nding a con�guration of resources that minimizes job ex-ecution time. A job is a combination of application (for exam-ple a random forest application in Spark) and the input data.Since a cluster comprises of multiple frameworks, we want tojointly identify both the type and number of instances for eachframework within the cluster. Formally, a cloud con�gurationis denoted as a vectorC = {hN1, I1i, . . . , hNn , Ini} where NFis the number of instances for framework F 2 {1, . . . ,n} andIF is the corresponding instance type.As a validity condition, we require that any valid cloud

con�guration satis�es user-speci�ed constraints on the maxi-mum execution time and maximum execution cost. In addition,we consider that the number of instances of each frameworkis chosen from a �nite set of possible values. The minimumand maximum values across all frameworks yield the searchspace bounds.An instance type de�nes the CPU, memory, and storage

capabilities of an instance. Popular cloud providers groupavailable instance types based on their instance family (e.g.,general-purpose, compute-optimized) and instance size (e.g.,large, xlarge). The instance family expresses the class ofhardware speci�cations that may best meet the requirementsof di�erent applications as well as a CPU-memory ratio. Theinstance size, in turn, allows for choosing the number ofvirtual CPUs, amounts of memory, etc.

2.2 Selecting good con�gurations mattersWe illustrate the value of picking a good cloud con�gurationby reporting the ratio of maximum to minimum executiontime and execution cost for all the con�gurations that were

Finding the Right Cloud Configuration for Analytics Clusters SoCC ’20, October 19–21, 2020, Virtual Event, USA

Job Max-Min RatioTime Cost

Logistic Regression (lr) 2.7⇥ 2.4⇥Random Forests (rf) 5.7⇥ 8.3⇥PageRank (pr) 3.4⇥ 3.6⇥Gradient Boosted Trees (gbt) 8.0⇥ 77.1⇥Nweight (nw) 3.5⇥ 5.0⇥Shortest Paths (sp) 5.2⇥ 4.3⇥Connected Components (cc) 2.3⇥ 9.8⇥Label Prop. Algorithm (lpa) 2.0⇥ 2.8⇥Price Predictor (pp) 2.9⇥ 2.7⇥

Table 1: Max to Min ratio for execution time and executioncost of the tested con�gurations.

used in our evaluation of several optimization methods (§ 7)and for the 9 jobs mentioned in § 7.1. Table 1 shows thatselecting a good cloud con�guration can drastically decreasejob execution time. In particular, the max to min ratio forexecution time and execution cost is roughly 2-8⇥ and 2.4-77⇥, respectively. Thus, optimizing cloud con�guration canbe crucial to satisfying time and/or monetary SLOs. Notethat we do not show con�gurations where jobs take longerthan our maximum time constraint (2,400s). Additionally,we omit results from several con�gurations that lead to jobfailures (mainly due to insu�cient memory resources). Assuch, the ratios in Table 1 are a conservative report.

2.3 ChallengesWhile there are a few recent solutions for �nding an appro-priate con�guration for deploying a single framework in thecloud [13, 27–29, 42], moving to multi-framework analyticsclusters raises several additional challenges.Intrinsic performance coupling. The cloud con�gura-tion problem does not decompose into several isolated, per-framework optimizations. This is because the end-to-endsystem performance depends on the performance of eachframework, which is coupled with the performance of otherinteracting frameworks. Moreover, performance couplingmeans that one cannot simply optimize the cluster con�g-uration one framework at a time following a natural, pre-established sequential order. The baseline method (Baseline1) described below highlights that not accounting for perfor-mance coupling leads to sub-optimal con�gurations (§7.2).Huge con�guration space. Jointly optimizing multipleframeworks inevitably leads to a large con�guration space,which increases exponentially with every additional frame-work. Thus the search for possible con�gurations has inher-ent scalability problems in this new setting. As such, tradi-tional solutions that use exclusively o�ine optimization arenot practical with a large con�guration space. Our results in

§7.2 show that the search time and search cost of an exclu-sively o�ine method can be high, and it may not be practicalto use them.Uninformative o�line pro�ling. Gathering pro�ling in-formation o�ine is a typical strategy to enhance the con-�guration search speed [18, 19, 31, 43]. However, in thesemethods, it is necessary to pro�le information for all possi-ble con�gurations, which becomes untenable in our setting,since the number of con�gurations grows exponentially withthe number of frameworks. We show that, in our solution,a short benchmarking phase (combined with transfer learn-ing where possible) eliminates the need for comprehensiveo�ine pro�ling.

2.4 Baseline solutionsAs we outlined, there are two ways of adapting the state-of-the-art solutions to work with multi-framework analyticsclusters. We will use the following two solutions as baselinesto compare against our method.Baseline 1. Baseline 1 consists of applying a traditionalblack-box algorithm like Bayesian optimization with Gauss-ian processes (used in Cherrypick [13]), to optimize oneframework at a time. The main issue is that the order inwhich each framework is optimized in�uences the optimiza-tion outcome. Since the best order depends on the applicationand the cluster composition under consideration, we followthe DAG of the analytics cluster as the default order. Notethat, even though the baseline optimizes one framework at-a-time, they are not optimized in isolation. In other words, tooptimize a given framework, we execute the job in a multi-framework setting, so that the optimization of that singleframework is done in its actual execution environment. Ata high level, the main limitation of Baseline 1 is that it hasa limited view of the con�guration search space since it de-termines the best con�guration for only one framework ata time in the speci�ed sequence. Thus, it only observes asmaller subset of the full con�guration space and might beunable to explore better con�gurations.Baseline 2. For Baseline 2, we also use Bayesian optimiza-tion with Gaussian processes, except that this baseline treatsthe whole cluster as a single black-box system. Thus it con�g-ures all of the frameworks simultaneously. Unlike Baseline1, Baseline 2 is capable of exploring the entire con�gurationsearch space.Baseline 1 and 2 are o�ine methods that require all op-

timization runs to be executed on a representative inputdataset before the job can be deployed in production.

3 OVERVIEW OF VANIRVanir solves the cloud con�guration problem by splittingthe optimization process into two phases. Phase 1: A quick


Job is Known

Create Initial Performance Model

Production Run withOnline Optimizer

UpdatePerformance Model

Profile Run with Initial Config

Job is Similar

Profile Runs withOffline Optimizer

Yes

No

Yes

No

Bootstrap Performance Model via Transfer Learn.

Figure 2: Con�guration optimization work�ow in Vanir.

heuristics-based benchmarking phase that determines a goodenough initial con�guration. Phase 2: A production opti-mization phase that uses machine learning to progressively�nd a better con�guration. To further improve the perfor-mance and scalability of the solution, Vanir adopts a similar-ity scoring mechanism to bypass the benchmarking phase,if possible.

The two phases use separate optimizer components. Thebenchmarking phase uses the o�ine optimizer (whose runsare utilized for benchmarking purposes). The productionoptimization phase uses the online optimizer (wherein eachrun is an actual production run that is used to incrementallyupdate the performance model for improving the con�gu-ration for the next job iteration). The online optimizationprocess concludes after a user-speci�ed number of runs.We now discuss Vanir’s usage for di�erent high-level

work�ow scenarios. The next sections present each com-ponent in detail and describe our design rationale.Work�ow. Figure 2 overviews Vanir’s optimization process.Whenever a job request arrives, Vanir distinguishes betweenthree scenarios:Known: The job is the same as a previously executed job.Similar : The job is new (i.e., not seen before), yet it is similarto a previously executed job.Unknown: The job is new and not su�ciently similar to anypreviously executed job.

Vanir handles each of these scenarios as follows:Known. The request for another run of a recurring jobis passed directly to the online optimizer. The online opti-mizer uses the performance model for this job to select anew cloud con�guration to test. Once a production run isconcluded, Vanir updates the performance model for futureoptimization.Similar. When a new job arrives, Vanir executes it on aninitial pro�ling con�guration and scores it based on similar-ity with previously executed jobs. If the job has a su�cientdegree of similarity to one of the existing jobs, the job ispassed to the online optimizer. The online optimizer loadsthe performance model for the most similar job and uses

transfer learning to bootstrap a performance model for thenew job. This performance model is then used as an initialstate for the online optimizer to search for an appropriatecon�guration of the new job.Unknown. When a new job does not satisfy the similarity�lter, the job request is passed to the o�ine optimizer. Theoptimizer �nds a good enough con�guration that meets theuser constraints and then launches a production run usingthat con�guration. Subsequent submissions of the same jobare optimized using the online optimizer.

4 OFFLINE OPTIMIZERThe goal of the o�ine optimizer is to quickly �nd a goodenough con�guration that meets user-provided performanceconstraints, even if this comes at the expense of the qualityof the chosen con�guration. Owing to incremental improve-ments that the online optimizer will apply for subsequent jobsubmissions, this is a reasonable trade-o� to scale to largecon�guration search spaces.Our design uses a metrics-based algorithm as the o�ine

optimizer, which uses CPU and memory resource utilizationmetrics (monitored during pro�ling runs) to determine thecon�guration of each framework. Note that our aim in thispart of the design is to illustrate the advantages of decom-posing the con�guration problem into an o�ine and onlinephase; one could replace our o�ine optimizer with an al-ternate one or even a static con�guration based on domainknowledge.The algorithm proceeds iteratively and consists of three

phases: Init, Resource Increase and Resource Adjustment. Be-fore we dive into the details of these phases, let us �rstdevelop the intuition behind our o�ine algorithm. Our �rstgoal is to �nd a good enough con�guration that meets theuser-speci�ed constraints. To achieve this, we start by rapidlyincreasing the resource allocation so that we swiftly reach areasonable set of valid con�gurations (i.e., that satisfy bothexecution time and cost constraints), allowing the algorithmto choose the one with the best execution time. However, ifthis phase only �nds one valid con�guration or none,1 then,at a �ner granularity, the algorithm reduces the executioncost by decreasing the resources allocated to frameworkswith low resource utilization.

This simple algorithm does not deal with di�erent instancefamilies; it sticks with a general-purpose instance and modi-�es instance size and number of instances. This is intendedto make the benchmarking phase short. The online optimizerhandles di�erent instance types.

1A user could specify execution cost or execution time constraints that aretoo stringent for the given search space. If the o�ine optimizer fails to �ndany valid con�guration, then the constraints should be revised.


4.1 Metrics-based algorithmInit Phase: The optimizer �rst performs a pro�ling run ona pre-speci�ed con�guration to obtain an initial performancesample. We use an initial con�guration with 4 instances ofinstance typem5.lar�e for each framework.Resource Increase Phase: Then, in a sequence of pro�lingruns, the number of instances NF allocated to each frame-work F increases according to monitored CPU and memoryutilization as follows:

NF =

(2 · NF cpuF +memF > µ1

s + NF otherwiseThat is, NF doubles as long as the mean utilization of CPU

(cpuF ) and memory (memF ) cumulatively remain above athreshold µ1. Otherwise, a lower sum of CPU and memoryutilization indicates that doubling resources would be anover-allocation. In this case, NF increases by a constant swith each pro�ling run. We use s = 2 and we discuss howto set the thresholds in §4.2. Since the search space boundsimpose a limit on the number of instances of a particularsize, if NF reaches this limit, the algorithm shifts to the nextlarger instance size.

The resource increase phase applies to all the frameworksat once until one of two conditions occurs: (1) the executiontime of the current con�guration worsens as compared to theexecution time of the previous con�guration, or (2) the pre-vious con�guration was valid but the current con�gurationis not.

When either of these two conditions occurs, the optimizerchecks whether it foundmore than one valid con�guration. Ifthat is the case, it terminates by returning the con�gurationwith the best execution time. Otherwise, the optimizer entersa �ne-grained resource adjustment phase to �nd a bettercon�guration.Resource Adjustment Phase: The optimizer seeks to �ndvalid con�gurations by decreasing resources to decrease theoverall execution cost. During another series of pro�lingruns, any framework whose CPU and memory utilizationare not above certain thresholds sees its resources decreased.This terminates when the current con�guration is not validwhile the previous one was. The adjustment in the numberof instances is done as:

NF =

8>>><>>>:

NF /2 (cpuF < µ2) ^ (memF < µ3)

NF (cpuF � µ2) ^ (memF � µ3)

NF � s otherwiseThe rationale behind the way we change NF is that we

want to have the ability to make signi�cant adjustments(halving or doubling NF ) when the utilization metrics areeither too high or too low. Otherwise, we enable �ne-grainedadjustments (increments or decrements to NF ) to avoid over-or under-allocating the resources by a large margin.

4.2 Tuning the o�line optimizerThe resource increase and adjustment phases depend on 3thresholds that we set empirically. µ1 controls the rate atwhich resources are increased w.r.t. resource utilization. Avalue of 100 means that the combined utilization of CPU andmemory is 100%. The maximum value of this parameter is200. Setting the value close to 200 makes the o�ine optimizermore conservative in increasing resources and likely slower.Conversely, setting it closer to 0 makes the optimizer moreaggressive, which is likely to double the resources at everybenchmarking run. µ2 and µ3 control the aggressiveness withwhich the resource adjustment phase decreases resources.The possible values range from 0 to 100. A low value forthese parameters means fewer resources taken back fromthe frameworks and vice versa. In our experiments, we usethe following threshold values: µ1 = 100, µ2 = 50 and µ3 = 50.These values strike a balance between aggressiveness andover-allocation of the resources in the di�erent phases ofthe o�ine optimizer. A full analysis of these thresholds isoutside the scope of this paper.

5 ONLINE OPTIMIZERThe online optimizer aims to further re�ne the con�gurationduring recurring production runs of the same job. Thereare three key capabilities that an online optimizer shouldhave: (1) a method to model the job performance on di�erentcon�gurations, (2) the ability to update the performancemodel at the end of each production run, and (3) a way toselect the next con�guration (i.e., an acquisition method) thatpotentially improves performance.In Vanir, the online optimizer maintains and uses a job-

speci�c performance model M : C ! t to predict the jobexecution time t of any given valid con�guration C . Due tothe challenges described in §2.3, generally,M is non-convexand its derivative is not available. This precludes standardoptimization methods for �nding the best cloud con�gura-tion. Moreover,M is not necessarily accurate, which requiresexploratory production runs to make the model more accu-rate. To tackle these challenges, we will use Mondrian forestsas a fundamental building block.

5.1 Mondrian forestsMondrian forests use Mondrian processes [36] to constructensembles of random decision trees. They can be trained inbatch or onlinemode.We useMFs because of three importantfeatures [32]. First, MFs gracefully handle predictions ondata points outside the space seen in the training data. Inthe absence of an extensive training dataset, as in our case,this property is of particular importance. Second, onlinetraining of MFs has higher accuracy than online RFs for thesame amount of training data. Third, online MFs can provide


Config space

Pre

dict

ed t

Performance Model

Previous Configurations

C1 300C2 250

C3 270C4 230 ① Centroid selection

C1 300C2 250

C3 270C4 230

All previous configurations {Cx | ET(Cx) within ε of

best previous configuration}

✓✗

✓✓

② Candidate set selection

Config space

Pre

dict

ed t Discarded samples

Cen

troid

s Acquired samples Q

③ Configuration selection

Configurations within distance d from centroids and instance type variations

Sample scoring

Next configurationC ∈ Q chosen at random from acquired samples weighted by normalized rating:maxX∈Q(t(X)) / t(C)

Figure 3: Acquisition method used by the online optimizer.

prediction accuracy comparable to batch-trained RFs. Thesefeatures make MFs more suitable for our case than the onlineversions of RFs, such as the one presented in [20, 37].

In Vanir, we use MFs as regressors. Con�guration C is thevector of input features, and the output prediction is the jobexecution time t . Each production run of a con�gurationis used to update the model. We train the initial MF modelM for a job using the data available from the o�ine opti-mization phase. Except when a new unknown job has highsimilarity to other jobs, where we use transfer learning forbootstrapping the initial model of similar jobs (c.f. §6).

5.2 Acquisition methodThe acquisition method picks a new test con�guration thatpotentially improves job performance and helps re�ne themodel’s accuracy. However, as discussed, the performancemodel is built incrementally and may be inaccurate (espe-cially in the early stages of a recurring job). Also, due tothe large search space, a brute force evaluation of the per-formance model is ine�cient. Therefore, our acquisitionmethod evaluates the performance model on a narroweddown candidate set of con�gurations before selecting thenext con�guration for a production run.We start by providing an intuitive idea of how the algo-

rithm works, as illustrated in Figure 3. At a high level, theacquisition method uses the previously tested con�gurationsas centroids for a neighborhood search. In particular, theneighboring con�gurations around the centroids are used aspotential candidates fromwhich a better con�guration mightemerge. This potential candidate set is then �ltered usingconstraints on the predicted execution time, and a single con-�guration is selected randomly using a weighted probability.The job is then executed with this selected con�guration,and its execution time is used to update the performancemodel.

5.2.1 Centroid selection. The algorithm selects previouslyexecuted con�gurations as centroids – all valid con�gura-tions with execution time within an � factor of the currentlowest execution time – whose neighboring con�gurationscan provide better con�gurations.

The algorithm attempts to select at least l centroids. Giventhat fewer than l con�gurations may satisfy the �-factor con-dition (especially early in the online optimization phase), thealgorithm uses a scoring function to extend the selection topreviously executed con�gurations that meet the executiontime constraint and represent good choices in terms of theirexecution cost. Let ETmin be the lowest execution time, andECmax be the highest cost across all executed con�gurations.A con�guration C is scored as follows (higher is better):

score(C) =

((1+� )ETmin

ET (C)EC(C) ECmax

(1+� )ETminET (C)

+ ECmaxEC(C)

EC(C) > ECmaxIntuitively, this mechanism assigns higher scores to con�g-urations that satisfy cost constraints and lower scores tocon�gurations that do not, proportionally to the amount bywhich they exceed the constraints. This ensures that, withinthe con�gurations that meet the time constraint, preferenceis given to searching the neighborhood of valid con�gura-tions (as far as their cost is concerned) with lower executiontimes, followed by con�gurations that violate the cost con-straint with the lowest margin.

5.2.2 Candidate set selection. Based on the set of centroidsO, the algorithm forms a candidate set of promising butuntested con�gurations. This set is the union of con�gura-tions drawn from the neighborhood of each centroid andthrough instance type variations of the currently best con-�guration. Figure 4 shows an example of this process.Neighborhood selection. For each centroid, the algorithmselects every con�guration within distance d from the cen-troid. We de�ne a distance metric based on the framework-wise di�erence of CPU and memory capacity. We say thatthe capacity capF (C) of a con�guration C w.r.t. frameworkF is the vector htotal CPU count, total memoryi, where


C1 {<8, m5.l>, <24, m5.l>, <16, r5.l>}

C2 {<8, m5.l>, <22, m5.l>, <16, r5.l>}

C3 {<8, m5.l>, <18, m5.l>, <16, r5.l>}

C4 {<8, m5.l>, <20, c5.xl>, <16, r5.l>}

✓✗

✓✓

Search Space

Cx = {<8, m5.l>, <20, m5.l>, <16, r5.l>} dmax = 20%

CPUs

Mem

ory

Configuration in the neighborhood

Configuration based on instance type change

Figure 4: Example candidate set based on 20%maximum dis-tance from a centroid (purple cross) for the middle frame-work (h20, m5.li). The valid con�gurations in the neighbor-hood are those with a variation of ±10% memory and ±10%number of CPUs (combined ±20%). A selected con�gurationbased on instance type variation doubles the number ofCPUs and maintains the same memory.

total refers to the sum of all NF instances of F . The dis-tance between two con�gurations X , Y for framework F isdistF (X ,Y ) = | |capF (X ) � capF (Y )| |1. Therefore, so far thecandidate set is

–o2O{C | distF (C,o) d,8F }.

The threshold d is set asmax(ds ,dmax ), where ds denotesthe distance of a single step away from centroid. This onlyapplies when the hyperparameter dmax , which is speci�edin relative terms, overly restricts the candidate set. Note thatbecause of our de�nition of distance, in theory, there could becon�gurations that di�er from a centroid but their distancefrom it is zero. These con�gurations are also included.Instance type variations. We observe that the above can-didate set would rarely include an instance family di�erentfrom those of the centroids because changing the instancefamily (for the same number of instances) can double orhalve CPU and memory resources (we expect dmax to besmaller than 100%, see §5.3). To allow for instance familyvariations, we include in the candidate set con�gurationsmore distant than dmax from a centroid. Namely, the algo-rithm enumerates all possible instance type variations forthe centroid con�guration and includes those that do notreduce the total available memory. The intuition behind thisis that changes in CPU resources can a�ect the job executiontime, but drastic reductions to memory capacity can causejob failures. As reference centroid, the algorithm only consid-ers the current best con�guration, and all framework-wiseinstance type variations are considered. Since only the bestcon�guration is used for this step, the possible decrease inthe execution time is not severe.

5.2.3 Configuration selection. Using the candidate set Q,the algorithm selects a con�guration in Q for use in the nextproduction run. First, the algorithm �lters the candidate setto discard any con�guration whose predicted execution timeis too high compared to the currently best con�guration andthe best-predicted execution time of the con�gurations in Q.Namely, C 2 Q is discarded if t(C) > � minX 2Q(ET (X )) or if

t(C) > �ETmin . Here, � and � are hyperparameters whosetuning we describe in §5.3.Next, the algorithm could pick from the remaining can-

didates the best con�guration as predicted by the model.However, given the small number of executed con�gura-tions in the beginning, the model predictions may not bevery accurate. Instead, the algorithm rates each con�gurationC according to its predicted execution time t(C): rate(C) =maxX 2Q(t(X ))/t(C); then the algorithm randomly picks thenext con�guration with a probability p following its normal-ized rating: p(C) = rate(C)/

ÕX 2Q rate(X ). This allows for

some diversity while also biasing the choice to con�gurationswith better predicted execution time.

5.3 Tuning the acquisition method� controls the closeness of the centroids to the best validcon�guration in terms of the execution time. For instance, avalue of 0.2 means that only points that have 20% higher exe-cution time than the best-known execution time are chosenas centroids. l determines the desired number of centroids.dmax controls neighborhood size around the centroids. � and� limit the con�gurations that can be selected based on theirpredicted execution time w.r.t the best-predicted executiontime in the candidate set and the best-known execution time.� should be set slightly higher than � to take into account themodel prediction error. � controls how conservatively thealgorithm behaves; e.g., a value of 2 means that the algorithmwill not select a con�guration whose predicted executiontime is twice the best-known execution time.Higher values for all these hyper-parameters will allow

for more exploration but might also lead to more executiontime constraint violations. For our evaluation, we use the fol-lowing values: � = 0.2, � = 1.5, � = 2, l = 5 and dmax = 20%.These values worked well for our scenarios since, with thesesettings, the algorithm maintains a conservative approachin terms of the execution time constraint, while still beingable to explore. A complete hyper-parameter exploration isleft for future work.

6 LEARNING FROM OTHER JOBSTo try to reduce the number of benchmarking runs evenfurther, Vanir uses the similarity between jobs and transferlearning to quickly bootstrap a performance model for theonline optimizer, using knowledge from other jobs.

6.1 Telling jobs are similarBefore bootstrapping the performance model for a new jobfrom a previous one, we need to �rst determine whether jobsare similar. Inspired by collaborative �ltering algorithms [38],Vanir uses cosine similarity as a similarity metric applied tojob signatures. We explored several types of signatures and


we settled on a simple one: a job signature is the pair of CPU%idle and memory utilization histograms. The histogram iscreated from a list of observations of the respective utiliza-tion metric. Observations from each instance of a frameworkare taken continuously and appended to create that list. Inparticular, these observations are collected every 5 secondsduring a job run, and the histograms use bins at 10% granu-larity. Cosine similarity uses the concatenation of these twohistograms in vectorized form.

We say that the pair of a new job and a prior job is similarwhen their similarity score is above a threshold (0.85 in ourcase). Vanir uses the model of the prior job with the highestsimilarity to bootstrap a performance model.Unlike in [18, 19, 31, 43], we do not use collaborative �l-

tering to get recommendations regarding the con�gurationsbased on prior jobs. This is because collaborative �lteringrequires certain sparsity conditions [16, 18, 19, 31], whichmight be impractical to achieve. Furthermore, these priorworks rely on extensive o�ine pro�ling, in which a set ofbenchmark applications are pro�led on all con�gurations.This is also impractical in our case, wherein the total numberof possible con�gurations (Table 2) is roughly 900k. How-ever, collaborative �ltering or machine learning methodsfor workload characterization [18, 19, 29, 31, 41, 43] can beuseful once there is a substantial history of a large numberof jobs and job runs.

While cosine similarity between jobs allows bootstrappingmodels for unknown jobs, it is not a perfect method and maylead to false-positive matches. We add a simple test to avoidthe e�ects of false positives: we measure the performanceof the new job on the best and worst con�guration of themost similar job, and, if the new job performs better onthe best con�guration from the known job compared to theworst con�guration, then we bootstrap the model; otherwise,we fall back to the benchmarking phase. Thus, a successfulapplication of similarity limits the number of benchmarkingruns to a total of 3 (1 for similarity calculation and 2 to avoidfalse positives).

6.2 Bootstrapping modelsVanir uses transfer learning to adapt an MF model trained ona previously seen job and provide better results with fewertraining samples for a new job. However, several challengesarise when putting this idea into practice. First, to the best ofour knowledge, there is no existing transfer learning methodfor Mondrian Forests. Additionally, we want to minimizethe number of samples needed from the new job to performtransfer learning. Therefore, we propose a simple techniquethat uses the di�erence in execution time between the twojobs (new job and a previously seen job) for the same con�g-uration to o�set the rest of the predictions of the model.

Parameters Possible ValuesInstance Type(in AWS EC2)

m5.large, m5.xlarge, c5.xlarge,c5.2xlarge, r5.large, r5.xlarge

No. of Instances(per framework) [2, 32] - step size of 2

Table 2: Possible values of con�guration parameters.

Assume that job � has the highest similarity with a new jobH . In this case, we want to transfer � ’s performance modelM � to create a performance model MH for H . To this end,we run H with the best con�guration from model M � andobtain the execution time ETH . The di�erence ETH � ET�between this execution time and the execution time obtainedfor the same con�guration on � is then used as an o�setto adjust the execution time predictions (tH (C)) for MH as:tH (C) = t � (C) + ETH � ET� for all training con�gurations Cused to buildM � .

7 EVALUATIONWe evaluate Vanir against Baseline 1 and Baseline 2 (§2.4),which we implement atop the Spearmint [10] library forBayesian optimization. All experiments run in AWS.

Table 2 lists the con�guration search space, which yields96 con�gurations for a single framework while the totalnumber of possible con�gurations is 884,736. As such, it isimpractical to search the entire space to determine the bestcon�guration of each job as a ground truth.We limit the optimization budget for each job to 18 runs

in total. Baseline 1 uses 3 initial samples for each framework,while Baseline 2 uses a total of 6 initial samples (generatedthrough Spearmint’s default Sobol sampling).

7.1 ApplicationsWe run a total of 9 realistic benchmarks (Table 1), 8 of whichrun on the analytics cluster shown in Figure 1. These bench-marks are adapted from applications included in the HiBenchsuite [6] and a few additional GraphX [4] jobs (sp, cc, lpa).To evaluate Vanir with di�erent pipelines, we further run theprice predictor (pp) job on a second cluster comprising aDAG of batch jobs. Our benchmarks, con�gs, and datasetsare public at [8].

As for the dimension of input datasets, following the nam-ing in HiBench, we use “gigantic” for pr, nw and lr; “huge”for gbt and rf; and “small” for sp, lpa, and cc. These sizes arechosen so that all jobs run in a reasonable amount of time(to reduce our costs) on at least a subset of the possible con-�gurations. The datasets are generated synthetically usingHiBench for each run.

We use the default values for all system parameters exceptfor parallelism level and memory con�gurations for Spark.


gEW rf lr nw pr cc lpa VpWorkloadV

0

500

1000

(T o

f EeV

W con

f.(V) BaVeline 1

BaVeline 2Vanir

0eWhod0

6

12

18

Benc

hmar

king

iWer

aWio

nV

Figure 5: (a) Best execution time (left). (b) Number of bench-marking iterations (right).

The parallelism level is set to 2⇥ the number of cores avail-able in the cluster allocated for Spark. In turn, the amount ofexecutor memory and driver memory is set to the amountof available memory in the instances used for Spark.

7.2 ResultsWe compare Vanir against both baseline methods across:quality and cost of optimization; the speed of search; andthe contribution of di�erent components of Vanir. We alsodiscuss the best con�gurations selected by Vanir and bothbaselines. We discuss the e�ects of the bounded search lim-itations of the baseline methods. Lastly, we generalize ourresults by looking at a di�erent pipeline and variations ininput dataset size.

7.2.1 �ality of optimization. We analyze the best con�g-uration found by an optimization algorithm within a givensearch budget. In our case, this represents the best executiontime found while satisfying the constraints.Figure 5a shows the execution time of the best con�gu-

ration found by each optimization method. We repeat thejob with the best con�guration for each method 5 times, andthe error bars indicate the standard deviation. For all jobs,Vanir �nds con�gurations that perform comparably to thoseby Baseline 2. Baseline 1, in turn, performs comparably toVanir and Baseline 2 for 6 of the 8 jobs. The two jobs whereBaseline 1 is outperformed are rf and lr. For these jobs, thebest execution time found by Baseline 1 is 2-2.5⇥ higherthan Baseline 2 and Vanir. This is due to the limited viewof the con�guration space that Baseline 1 explores, since itoptimizes one framework at a time, and decides on the bestcon�guration for a given framework before advancing tooptimizing the next one.Figure 5b shows the number of benchmarking iterations

required by each algorithm. For Vanir, that is simply thenumber of iterations of the o�ine optimizer. For Baseline1 and 2, this is the number of runs to reach the best foundcon�guration within the optimization budget. In practice,there is no way to know if an algorithm has yet reachedthe best con�guration it will �nd. We, therefore, assume an

gbW rf lr nw pr cc lpa VpWorNload

0

10

20

30

1or

m. o

pW. c

oVW BaVeline 1

BaVeline 2Vanir

Figure 6: Cost of optimization. Baseline 1 and 2 are generally1.5-6 times more expensive than Vanir.

oracle that predicts if the baselines have reached their bestcon�guration. Therefore, this result shows the best-case sce-nario for the baselines. The results show that Vanir uses anaverage of 6 benchmarking iterations. In turn, Baseline 1 and2 use 14 and 12 benchmarking iterations, respectively – thus,on average, requiring to wait at least 2 times as long beforestarting production runs. Using only 1/2 benchmarking iter-ations, Vanir �nds con�gurations comparable to Baseline 2,thanks to online optimization in production runs.

7.2.2 Cost of optimization. It is important to consider themonetary cost of the (benchmarking) optimization phasesince this consists of running an o�ine optimization algo-rithm, which costs money (and time) without doing anyuseful work. Ideally, this should be kept to a minimum.

Figure 6 shows the cost of optimization, normalized by thecost of the benchmarking phase of Vanir. The bars for Base-line 1 and 2 represent the total cost of the optimization overthe budget of 18 iterations compared to the benchmarkingphase of Vanir. Vanir (Full) includes both benchmarking andproduction phases. The cost of the production phase is onlythe extra cost for each iteration of the production phase com-pared to what would cost if one were to simply run Vanir’sbest found con�guration.2 The results show that Vanir incursin substantially lower optimization costs, due to its virtuouscombination of techniques, namely a quick estimate of agood con�guration in the benchmarking phase followed bygradually improving the con�guration at production time.Using only the benchmarking phase of Vanir, the cost is

up to 60% lower than running both the benchmarking andproduction optimization phases. However, that reduction inoptimization cost comes at the expense of decreased qualityof optimization, as shown later in this section. On the otherhand, Baseline 1 and 2 are roughly 1.3-4.6⇥ and 1.6-5.9⇥more expensive than Vanir (Full), respectively, for 7 out of8 jobs. The exception is gbt, where Baseline 1 costs 5.9⇥,and Baseline 2 costs 24⇥ more than Vanir. This is primarily

2As if an oracle revealed the best con�guration that Vanir will �nd.


250

500

750

1000gEt rI lr nw

6 12 18

250

500

750

1000pr

6 12 18

cc

6 12 18

lpa

6 12 18

VpBaVeline 1BaVeline 2Vanir

BeVt

(xe

cutio

n tim

e (V

)

Iteration numEer

Figure 7: Progression of optimization for target jobs.

2 4 6 8 10 12 14 16 18Iteration numEer

0.25

0.50

0.75

1.00

FraF

tion

oI E

eVt E

7

BaVeline 1BaVeline 2Vanir

Figure 8: Speed of optimization. On average, Vanir takesonly half the number of iterations compared to Baseline 2to reach within 15% of the best execution time found.

because gbt does not require a lot of resources; in fact, big-ger cluster sizes worsen its performance. Vanir’s systematicapproach with its benchmarking phase quickly determinesthat characteristic. Therefore, it avoids selecting larger con-�gurations while the baselines often explore those regionsof search space, thus incurring high optimization costs.

7.2.3 Speed of search. Another way to compare optimiza-tion algorithms is to evaluate which method is the fastest toreach a good con�guration for a given search budget.Figure 8 shows that on average, across the 8 jobs, how

quickly each algorithm can get close to the best executiontime found by any optimization method. The results showthat Vanir is the fastest. On average, Baseline 2 takes 2⇥longer to get to a con�guration within 15% of the best one,compared to Vanir. Because Vanir uses ametrics-basedmethodas an o�ine optimizer, it keeps the benchmarking phaseshort by systematically changing the resources assigned todi�erent frameworks instead of blindly generating initialsamples, as it is the case for completely black-box methods.Vanir quickly �nds a good con�guration to start the produc-tion runs of the recurring job. Subsequent runs of the jobcan bene�t from the production optimization phase.

Figure 7 shows the progression of the best-found executiontime for each job separately. Vanir is fastest for 6 of the 8

jobs; except for rf and pr, whereby Baseline 2 by chance �ndsa valid con�guration during initial sampling.

7.2.4 Constraint violations. Since Vanir performs optimiza-tion during production runs, it is worth evaluating the num-ber of violations experienced (for time or cost constraints).Vanir assumes that cost constraint violations are tolerableduring the production optimization phase. However, execu-tion time constraint violations are more severe in productionenvironments, and therefore an online optimizer should limitthem. Any unsuccessful run of a job due to failure (e.g., mem-ory exhaustion) is a violation.Figure 9 shows the progression of execution time, and

execution cost as Vanir optimizes the con�guration for eachjob. The green circle indicates the best valid con�gurationfound. The blue vertical line is the iteration number at whichthe benchmarking phase ends, and the production runs start.The black horizontal line is the cost constraint (associatedwith the right y-axis). Optimization runs that fail to executeor take longer than the execution time constraint are shownas iterations with zero execution time.

In our experiments, the online phase of Vanir leads to onlyone failure due to a lack of resources (for the iteration 11 ofrf). However, if such an error occurs, the online optimizerquickly learns to avoid that region of the search space. It isclear that during the production phase, Vanir tests con�gura-tions that lead to cost constraint violations. This design is bychoice, since making Vanir too conservative would hamperthe quality of con�gurations it �nds.

7.2.5 Contribution of di�erent optimizers. To understandthe contribution of Vanir’s components to the overall opti-mization process, we break down the best execution timeachieved by using di�erent components. Figure 10 showsthat the o�ine optimizer reaches performance close to thebest con�guration found by Vanir (Full), in 3 of the 8 jobs(sp, rf, and lr). In these cases, the online optimizer improvesexecution time by 0%, 1%, and 6%, respectively. In contrast,


250500750

1000rI lr gEt pr

6 12 18

250500750

1000nw

6 12 18

cc

6 12 18

lpa

6 12 18

sp

0.00.51.01.52.0

0.00.51.01.52.0

0.00.51.01.52.0

0.00.51.01.52.0

0.00.51.01.52.0

0.00.51.01.52.0

0.00.51.01.52.0

0.00.51.01.52.0

(xec

utio

n tim

e (s

)

(xec

utio

n co

st ($

)

Iteration numEer

(xecution time(xecution cost

Figure 9: Progression of Vanir optimization across eight jobs. Only one ET constraint violation occurred in online phase.

Figure 10: Contribution of di�erent components of Vanir.Compared to o�line-only optimization, the online phaselowers execution time by up to 36%.

the online phase improves execution time by 12-36% for theother 5 jobs.

7.2.6 Contribution of similarity and transfer learning. Wenow use a model learned on a job to optimize the con�g-uration of another job. For a job � , using the best-knowncon�guration from the most similar job leads to an executiontime that is close to � ’s best-known execution time. Figure 10shows that by using similarity (Sim), Vanir �nds a con�gura-tion with an execution time that is at worst only 55% higherthan the best-known con�guration. However, these con�gu-rations do not always satisfy the execution cost constraints,a task handled by the production optimization phase thatfollows model bootstrapping.Using transfer learning, we bootstrap a model for each

job. Then we run the production optimization phase on thebootstrapped model to further improve performance. This isshown as Vanir (TL) in Figure 10. The improvement in execu-tion time after running a production optimization phase onthe bootstrapped model is 3.4%, 7.7%, 11%, and 35% for pr, rf,cc, and lr, respectively. There is no improvement in the caseof nw. For sp and lpa, the execution time is worse than theone shown for Sim; that is because the con�guration foundusing similarity did not satisfy the cost constraint, and so thecon�guration found after the production run phase in Vanir

(TL) that satis�es the cost constraint has higher executiontime. Interestingly, Vanir (TL) provides a trade-o� betweenlowering the number of benchmarking runs and the qualityof optimization. In 4 of the 7 cases, given the limited budgetof 18 iterations, Vanir (TL) �nds a con�guration that is upto 20% slower than the one found by Vanir (Full). Hence,shortening the benchmarking phase is not always the rightchoice when the optimization quality is more important.

7.2.7 Unbounded search. Bayesian optimization, as used inCherrypick [13] and Arrow [28], requires users to de�ne thebounds of the optimization search space; speci�cally, theminimum and the maximum number of instances as wellas possible instance types. While de�ning instance types isstraightforward, de�ning the bounds on the number of in-stances is not easy without prior knowledge. While Bayesianoptimization with Gaussian processes is limited to boundedsearch spaces, Vanir is not. This can be a useful property ofour design, since de�ning a bound on the search space canbe di�cult. In particular, when de�ning this bound, there is atrade-o� between the size of the search space and the cost ofthe optimization: a larger search space would likely require alarger budget and vice versa. Given that a user would likelynot know about a job’s performance pro�le across di�erentcon�gurations, it is easy to make mistakes when de�ningbounds on the search space.

Until now, we have been comparing both baselinemethodswith Vanir using a bounded search space, for a fair compar-ison. However, now we want to gauge the potential gainsfrom an unbounded search; we keep the same bounds as inTable 2 for Baseline 2, while we set no upper limit for thenumber of instances of Vanir. Since the type and size of theinstances would, in practice, be limited to a small number,we still keep the bounds on them. In this experiment, we usesp with the same dataset size used in previous experimentsbut with an execution cost constraint of $1.5 instead of $0.8as used before.


2 4 6 8 10 12 14 16 18IteUation numEeU

0

200

400

600

800

1000

(xec

utio

n tim

e (V

)

8nEounded VaniUBaVeline 2

Figure 11: Comparing Vanir’s performance on unboundedsearch space vs. Baseline 2 using sp. Vanir automatically de-termines the right upper resource limits and avoids incor-rect user-de�ned search space bounds.

Figure 11 shows the performance of the con�gurationsthat both methods �nd. The results show that Vanir �ndsa valid con�guration with an execution time of 185s, whileBaseline 2 only �nds a con�guration with an execution timeof 272s. This is because Vanir explores beyond the boundsthat Baseline 2 is limited to. In particular, it tests con�gu-rations with 64 instances allocated to Spark and then alsowith 128 instances, and automatically �nds that while 64instances still provide a speedup, ⇠128 instances worsen theperformance. Thus, unbounded search allows Vanir to �nd acon�guration that is 32% faster than Baseline 2, for the sameconstraints and the same optimization budget.

7.2.8 In-depth look at the best configurations. We now wantto reason about why one optimization algorithm performsbetter than another by taking a look at the best con�gura-tions found by each optimization algorithm. Answering thisquestion can also help us understand the ine�ciencies ineach optimization algorithm. We will only discuss select fewcases here.Generally, HDFS and Cassandra are allocated fewer re-

sources than Spark due to cost constraints. We also ob-serve di�erent best con�gurations across di�erent work-loads. HDFS and Cassandra are generally allocated a similaramount of resources except for the rfworkload, where HDFSis given 2⇥ resources compared to Cassandra (in the bestcon�guration).

We �nd that Baseline 1 generally ends up only assigningmore resources to Spark. In contrast, the best con�gurationsfound by both Vanir and Baseline 2 assign more resourcesto other frameworks as well. Thus, Baseline 1 misses theopportunities that the other two methods explore (since theyperform joint optimization). This is particularly importantfor rf and lr, and that is why Baseline 1 performs poorlycompared to the other two methods for those jobs.In addition, we �nd that di�erent instance types matter.

Speci�cally, Vanir’s best con�gurations for pr and gbt jobs

4 8 12 16Iteration numEer

0

500

1000

1500

2000

(xec

utio

n tim

e (V

)

0.00

0.25

0.50

0.75

1.00

1.25

(xec

utio

n Co

Vt ($

)

(7(Vanir)(7(BaVeline 2)

(C(Vanir)(C(BaVeline 2)

Figure 12: Comparing Vanir’s performance on pipeline ofbatch jobs (pp) vs. Baseline 2. Performance is comparable;Vanir spends just 6 runs in benchmarking phase.

use c5.2xlarge and c5.xlarge instances, respectively. Thesechanges improve the execution time by 18% and 30% (com-pared to using m5 instances) for gbt and pr, respectively.Interestingly, none of the best con�gurations use the maxi-mum allowed resources for every framework because: (1) theexecution cost constraint means that even if higher resourcesdecrease the execution time, they might incur a high execu-tion cost, and (2) in some cases, assigning more resourcescan degrade execution time; e.g., in case of gbt.

7.2.9 Generalizing to a pipeline of batch jobs. To validatethat Vanir’s approach applies to other types of data analyticspipelines, we use a pipeline of batch processing jobs. Thiskind of pipelines is common in industry use case [2, 5, 9],enabled by tools such as Apache Air�ow [1] and Luigi [7],which have been designed to facilitate data pipelines in theform of a DAG of batch jobs. In our case, the pipeline consistsof a linear DAG of three Spark jobs [8] wherein the resultsof one job serve as input to the next.

Figure 12 shows the comparison between Vanir and Base-line 2 for this data pipeline. Both methods eventually �ndthe best con�gurations (highlighted with green circles) thatperform comparably. However, Vanir does not have signif-icant performance variations during the production runs(beyond the vertical blue line) and thus it is more suitablefor optimization using production runs. In contrast, Baseline2 only reaches its best con�guration at the end of its o�ineoptimization budget (18 iterations). For this scenario, Vanirdetermines that the Update stage requires more resourcesthan the other two stages.

7.2.10 Handling input data size changes. Since Vanir per-forms part of the optimization using production runs, onecould expect that variations in the input data size for di�er-ent job invocations a�ect performance. Vanir handles thesechanges (whenever input data size changes signi�cantly,say at least a 10% change) by adapting the job’s performancemodel to di�erent input data sizes, treating the latter as an ex-tra model feature. The initial con�guration used is a linearly


2 4 6 8 10Iteration numEer

0

100

200

300

400

500

600

(xec

utio

n tim

e (s

)

1.0

1.5

2.0

(xec

utio

n Co

st ($

)

(T(10%)(T(50%)

(C(10%)(C(50%)

Figure 13: Handling changes in input data size for cc. Vanirimproves the execution time by 30% within 10 optimizationruns for input data size increase of 10% and 50%.

and proportionally-scaled version of the best con�gurationfor the original input size.

Figure 13 shows the progression of optimization of cc over10 iterations when the data size is 10% and 50% higher thanthe data size on which the original cc model was created.Vanir �nds con�gurations (highlighted with green circles)that provide 170s and 200s execution time for input data with10% and 50% higher data size, respectively. This representsan improvement of 30% over the initial (proportionally largercon�guration).

8 RELATEDWORKCloud con�guration optimization. Ernest [42] adoptsan analytical model to predict the execution time of Sparkjobs, mainly targeting ML applications. The model is limitedto Spark and is not applicable for the performance predictionof multi-framework analytics clusters. Elastisizer [26] usesa mix of black-box and white-box models to optimize cloudcon�gurations as well as job-level con�gurations for map-reduce jobs. However, several of the models used are limitedto the map-reduce style of jobs.

Cherrypick [13], Arrow [28], Micky [27] and Lynceus [17]treat the data analytics framework as a black box, and per-form cloud con�guration optimization using o�ine methods.These initial proposals targeted a single framework and there-fore only had to consider a con�guration search space onthe order of tens of con�gurations. We build on that lineof research but consider a much larger search space, wherea completely o�ine method becomes impractical. Instead,Vanir splits the optimization into o�ine and online phases.

Scout [29], Selecta [31] and PARIS [43] use historical datato quickly choose cloud con�gurations for new jobs. Selectaincorporates di�erent storage options into the cloud con�g-uration search space. Lessons learned from Selecta can beincorporated into Vanir’s o�ine optimizer. The increased sizeof the search space as a result of including storage optionswould make Vanir even more attractive. However, these were

designed for and evaluated in single system deployments,and the amount of historical data required for handling clus-ters withmultiple systemsmight be impractical. Additionally,given their di�erent target deployment, Scout and PARIS didnot need to improve the con�guration suggested by historicaldata at deployment time, which Vanir does through onlineoptimization.System con�guration optimization. Research works incon�guration optimization tackle an orthogonal problemof optimizing system con�gurations such as parameter tun-ing for Hadoop [14, 25], Spark [21, 23], Storm [15, 40] ordatabases [22, 34] or general-purpose frameworks like Best-Con�g [44]. Tools such as LearnConf [33] can be used todetermine themost important system parameters that impactthe performance and guide the tuning.Resource allocation in data centers. Paragon [18] is adata center heterogeneity and interference-aware scheduler.It uses collaborative �ltering to classify an unknown incom-ing job to assign resources to it. Quasar [19] is a follow-upwork on Paragon, which also uses collaborative �lteringfor unknown workload classi�cation. Despite this being adi�erent problem scenario, we note that a downside of col-laborative �ltering requires an o�ine training set that isimpractical to obtain when the con�guration space is large.DejaVu [41] is another work that tackles the problem of allo-cating resources to workloads in a datacenter setting. Theirmethod of creating workload signatures might be a comple-mentary alternative to the simple similarity metric used byVanir when there is a large set of jobs and performance data.Morpheus [30] is designed to provide high cluster utiliza-tion and predictable performance for enterprise clusters. Itextracts SLOs implicitly and performs job placement anddynamic re-provisioning to achieve the SLO. In contrast toVanir, Morpheus is designed for a data center environment,where there are no cost constraints or considerations of het-erogeneous types of machines. Perforator [35] introduces amethodology to perform resource optimization for Hive. Sev-eral analytical models are designed and used in that work, butthey are not applicable in our setting with multi-frameworks.

9 CONCLUSIONVanir performs automatic cloud con�guration optimizationby splitting the optimization task into benchmarking andproduction phases. This allows Vanir to handle large con�g-uration search spaces without requiring long o�ine bench-marking or optimization. Vanir �nds con�gurations com-parable to benchmarking based o�ine methods despite a3⇥ shorter benchmarking phase. Vanir has 1.3-24⇥ loweroptimization search cost, and it is 2⇥ faster to reach within15% of the execution time of the best con�guration found.


ACKNOWLEDGMENTSMuhammad Bilal was supported by a fellowship from theErasmus Mundus Joint Doctorate in Distributed Computing(EMJD-DC) program funded by the European Commission(EACEA) (FPA 2012-0030). This research was supported byFundação para a Ciência e a Tecnologia (FCT), under projectsUIDB/50021/2020, CMUP-ERI/TIC/0046/2014, and POCI-01-0247-FEDER-045915.

REFERENCES[1] Apache Air�ow. https://air�ow.apache.org. [Online; accessed

25/05/2020].[2] Build a Concurrent Data Orchestration Pipeline Using Amazon EMR

and Apache Livy. https://aws.amazon.com/blogs/big-data/build-a-concurrent-data-orchestration-pipeline-using-amazon-emr-and-apache-livy/. [Online; accessed 25/05/2020].

[3] Flink Use cases. https://�ink.apache.org/usecases.html. [Online; ac-cessed 01/03/2020].

[4] GraphX lib github. https://github.com/apache/spark/tree/master/graphx/src/main/scala/org/apache/spark/graphx/lib.

[5] HowVerizonMedia Groupmigrated from on-premises ApacheHadoopand Spark to Amazon EMR. https://aws.amazon.com/blogs/big-data/how-verizon-media-group-migrated-from-on-premises-apache-hadoop-and-spark-to-amazon-emr/. [Online; accessed25/05/2020].

[6] Intel’s HiBench benchmark. https://github.com/intel-hadoop/HiBench.

[7] Luigi. https://github.com/spotify/luigi. [Online; accessed 25/05/2020].[8] Modi�ed version of Intel’s HiBench benchmark. https://github.com/

MBtech/HiBench.[9] Quoble Pipeline. https://www.qubole.com/developers/spark-getting-

started-guide/work�ow/. [Online; accessed 25/05/2020].[10] Spearmint Github repo. https://github.com/HIPS/Spearmint.[11] Walmart Labs: Lambda Architecture. https://medium.com/

walmartlabs/how-we-built-a-data-pipeline-with-lambda-architecture-using-spark-spark-streaming-9d3b4b4555d3. [Online;accessed 01/03/2020].

[12] S. Agarwal, S. Kandula, N. Bruno, M.-C. Wu, I. Stoica, and J. Zhou.Re-optimizing data-parallel computing. In NSDI, pages 21–21. USENIXAssociation, 2012.

[13] O. Alipourfard, H. H. Liu, J. Chen, S. Venkataraman, M. Yu, andM. Zhang. CherryPick: Adaptively Unearthing the Best Cloud Con�g-urations for Big Data Analytics. In NSDI, 2017.

[14] S. Babu. Towards Automatic Optimization of MapReduce Programs.In SoCC, pages 137–142. ACM, 2010.

[15] M. Bilal and M. Canini. Towards automatic parameter tuning of streamprocessing systems. In SoCC, pages 189–200. ACM, 2017.

[16] F. Cacheda, V. Carneiro, D. Fernández, and V. Formoso. Comparisonof collaborative �ltering algorithms: Limitations of current techniquesand proposals for scalable, high-performance recommender systems.ACM Transactions on the Web (TWEB), 5(1):2, 2011.

[17] M. Casimiro, D. Didona, P. Romano, L. Rodrigues, W. Zwanepoel, andD. Garlan. Lynceus: Cost-e�cient tuning and provisioning of dataanalytic jobs. arXiv preprint arXiv:1905.02119, 2019.

[18] C. Delimitrou and C. Kozyrakis. Paragon: QoS-aware Scheduling forHeterogeneous Datacenters. In ASPLOS, pages 77–88. ACM, 2013.

[19] C. Delimitrou and C. Kozyrakis. Quasar: Resource-e�cient and QoS-aware Cluster Management. In ASPLOS, pages 127–144. ACM, 2014.

[20] M. Denil, D. Matheson, and N. Freitas. Consistency of online randomforests. In ICML, pages 1256–1264, 2013.

[21] H. Du, P. Han, W. Chen, Y. Wang, and C. Zhang. Otterman: A novelapproach of spark auto-tuning by a hybrid strategy. In ICSAI, pages478–483, 2018.

[22] S. Duan, V. Thummala, and S. Babu. Tuning Database Con�guration Pa-rameters with iTuned. In PVLDB, pages 1246–1257. VLDB Endowment,2009.

[23] A. Fekry, L. Carata, T. Pasquier, A. Rice, and A. Hopper. Tuneful: Anonline signi�cance-aware con�guration tuner for big data analytics.arXiv preprint arXiv:2001.08002, 2020.

[24] A. D. Ferguson, P. Bodik, S. Kandula, E. Boutin, and R. Fonseca. Jockey:guaranteed job latency in data parallel clusters. In EuroSys, pages99–112. ACM, 2012.

[25] H. Herodotou and S. Babu. Pro�ling, what-if analysis, and cost-basedoptimization of mapreduce programs. In PVLDB, pages 1111–1122.VLDB Endowment, 2011.

[26] H. Herodotou, F. Dong, and S. Babu. No one (cluster) size �ts all:automatic cluster sizing for data-intensive analytics. In SoCC, page 18.ACM, 2011.

[27] C. Hsu, V. Nair, T. Menzies, and V. Freeh. Micky: A cheaper alternativefor selecting cloud instances. In CLOUD, volume 00, pages 409–416.IEEE, 2018.

[28] C.-J. Hsu, V. Nair, V. W. Freeh, and T. Menzies. Arrow: Low-LevelAugmented Bayesian Optimization for Finding the Best Cloud VM. InICDCS, pages 660–670. IEEE, 2018.

[29] C.-J. Hsu, V. Nair, T. Menzies, and V. W. Freeh. Scout: An expe-rienced guide to �nd the best cloud con�guration. arXiv preprintarXiv:1803.01296, 2018.

[30] S. A. Jyothi, C. Curino, I. Menache, S. M. Narayanamurthy, A. Tumanov,J. Yaniv, R. Mavlyutov, I. Goiri, S. Krishnan, J. Kulkarni, et al. Morpheus:Towards automated slos for enterprise clusters. InOSDI, pages 117–134.USENIX Association, 2016.

[31] A. Klimovic, H. Litz, and C. Kozyrakis. Selecta: Heterogeneous cloudstorage con�guration for data analytics. In USENIX ATC, pages 759–773, 2018.

[32] B. Lakshminarayanan, D. M. Roy, and Y. W. Teh. Mondrian forests:E�cient online random forests. In NIPS, pages 3140–3148, 2014.

[33] C. Li, S.Wang, H. Ho�mann, and S. Lu. Statically inferring performanceproperties of software con�gurations. In EuroSys, pages 1–16, 2020.

[34] A. Mahgoub, P. Wood, S. Ganesh, S. Mitra, W. Gerlach, T. Harrison,F. Meyer, A. Grama, S. Bagchi, and S. Chaterji. Ra�ki: a middlewarefor parameter tuning of NoSQL datastores for dynamic metagenomicsworkloads. In Middleware, pages 28–40, 2017.

[35] K. Rajan, D. Kakadia, C. Curino, and S. Krishnan. PerfOrator: EloquentPerformance Models for Resource Optimization. In SoCC, pages 415–427. ACM, 2016.

[36] D. M. Roy, Y. W. Teh, et al. The mondrian process. In NIPS, pages1377–1384, 2008.

[37] A. Sa�ari, C. Leistner, J. Santner, M. Godec, and H. Bischof. On-linerandom forests. In ICCV, pages 1393–1400. IEEE, 2009.

[38] B. M. Sarwar, G. Karypis, J. A. Konstan, J. Riedl, et al. Item-based col-laborative �ltering recommendation algorithms. In WWW, volume 1,pages 285–295, 2001.

[39] L. Shao, Y. Zhu, S. Liu, A. Eswaran, K. Lieber, J. Mahajan, M. Thigpen,S. Darbha, S. Krishnan, S. Srinivasan, C. Curino, and K. Karanasos.Gri�on: Reasoning about Job Anomalies with Unlabeled Data in Cloud-Based Platforms. In SoCC, pages 441—-452. ACM, 2019.

[40] M. Trotter, T. Wood, and J. Hwang. Forecasting a Storm: DiviningOptimal Con�gurations using Genetic Algorithms and SupervisedLearning. In ICAC, pages 136–146. IEEE, 2019.

[41] N. Vasić, D. Novaković, S. Miučin, D. Kostić, and R. Bianchini. De-javu: accelerating resource allocation in virtualized environments. InASPLOS, pages 423–436. ACM, 2012.


[42] S. Venkataraman, Z. Yang, M. J. Franklin, B. Recht, and I. Stoica. Ernest:E�cient Performance Prediction for Large-Scale Advanced Analytics.In NSDI, pages 363–378. USENIX Association, 2016.

[43] N. J. Yadwadkar, B. Hariharan, J. E. Gonzalez, B. Smith, and R. H. Katz.Selecting the Best VM Across Multiple Public Clouds: A Data-driven

Performance Modeling Approach. In SoCC, pages 452–465. ACM, 2017.[44] Y. Zhu, J. Liu, M. Guo, Y. Bao, W. Ma, Z. Liu, K. Song, and Y. Yang.

Bestcon�g: tapping the performance potential of systems via automaticcon�guration tuning. In SoCC, pages 338–350, 2017.

Documents

Finding the Right Cloud Configuration for Analytics Clusters · IST(ULisboa)/INESC-ID ABSTRACT Finding good cloud congurations for deploying a single dis-tributed system is already