13
Ripple: A Practical Declarative Programming Framework for Serverless Compute Shannon Joyner 1 , Michael MacCoss 2 , Christina Delimitrou 1 , and Hakim Weatherspoon 1 1 Cornell University, 2 University of Washington 1 {sj677,delimitrou,hw228}@cornell.edu, 2 [email protected] Abstract Serverless computing has emerged as a promising alterna- tive to infrastructure- (IaaS) and platform-as-a-service (PaaS) cloud platforms for applications with ample parallelism and intermittent activity. Serverless promises greater resource elasticity, significant cost savings, and simplified application deployment. All major cloud providers, including Amazon, Google, and Microsoft, have introduced serverless to their public cloud offerings. For serverless to reach its potential, there is a pressing need for programming frameworks that abstract the deployment complexity away from the user. This includes simplifying the process of writing applications for serverless environments, automating task and data partition- ing, and handling scheduling and fault tolerance. We present Ripple, a programming framework designed to specifically take applications written for single-machine execu- tion and allow them to take advantage of the task parallelism of serverless. Ripple exposes a simple interface that users can leverage to express the high-level dataflow of a wide spectrum of applications, including machine learning (ML) analytics, genomics, and proteomics. Ripple also automates resource provisioning, meeting user-defined QoS targets, and handles fault tolerance by eagerly detecting straggler tasks. We port Ripple over AWS Lambda and show that, across a set of di- verse applications, it provides an expressive and generalizable programming framework that simplifies running data-parallel applications on serverless, and can improve performance by up to 80x compared to IaaS/PaaS clouds for similar costs. 1. Introduction An increasing number of popular applications are hosted on public clouds [22, 23, 33, 44, 54]. These include many in- teractive, latency-critical services with intermittent activity, for which the current Infrastructure-as-a-Service (IaaS) or Platform-as-a-Service (PaaS) resource models result in cost inefficiencies, due to idle resources. Serverless computing has emerged over the past few years as a cost-efficient alternative for such applications, with the added benefit that the cloud provider handles all data management, simplifying deploy- ment and maintenance for the end user. Under a serverless framework, the cloud provider offers fine-grained resource al- locations to users who only pay a small amount for them when a task is executing, resulting in much lower hosting costs. Several cloud providers have introduced serverless offerings to their public cloud models, including AWS Lambda [4], Google Functions [10], and Azure Functions [11]. In all cases, with small variations, a user launches one or more “functions”, which the provider maps to one or more machines. Functions take less than a second to spawn, and most providers allow the user to spawn hundreds of functions in parallel. This makes serverless a good option for extremely parallelizable tasks, or for services with intermittent activity and long idle periods. Big data, scientific, and certain classes of machine learning (ML) analytics are good candidates for serverless, as they tend to have ample task parallelism, and their storage and compute requirements continue to increase exponentially [27, 47, 65, 70]. Hosting such services on traditional IaaS and PaaS clouds incurs significantly higher costs compared to serverless, since instances need to be maintained for long periods of time to ensure low start-up latencies. Alternatively, if optimizing for cost, tasks from large jobs are subject to long queueing times until resources become available, despite tasks being ready to execute. When the input load is bursty the reverse is also true, with provisioned resources remaining idle during periods of low load. While cloud providers have auto-scaling systems in place to provide some elasticity in the number of allocated instances as load fluctuates [3], elasticity is offered at instance granularity. Additionally, scaling out is not instantaneous, resulting in unpredictable or degraded performance, which is especially harmful for short-running tasks. The premise of serverless is improving resource elasticity, and cost efficiency. To realize this we need programming frameworks that can (i) extract parallelism from existing cloud applications, (ii) simplify writing new applications for server- less compute, (iii) manage data dependencies between func- tions transparently to the user, and (iv) automatically handle resource elasticity, data partitioning, fault tolerance, and task scheduling to mask the performance unpredictability public clouds are subject to [21, 34, 35, 45, 66, 67, 7476]. Serverless frameworks additionally need to be pro- grammable enough for users to express generic applications at a high-level, and flexible enough to circumvent the limita- tions of current serverless offerings, such as AWS Lambda, which limits each function to 512MB of disk space and a 15 minute runtime. Existing analytics and ML frameworks, like Hadoop [68], Spark [2], and TensorFlow [16], can handle some of these requirements, but need to run over an active cluster by default. This is problematic in the case of server- less frameworks where resources are frequently allocated and reclaimed, requiring the framework to be re-deployed across runs. Additionally, such general-purpose frameworks are storage-demanding, in terms of both memory and disk, consuming a large fraction of the limited system resources allocated per serverless function. arXiv:2001.00222v1 [cs.DC] 1 Jan 2020

Ripple: A Practical Declarative Programming Framework for ...delimitrou/papers/2020.arxiv.ripple.pdf · App converted to serverless pipeline Figure 1: Overview of Ripple’s operation,

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Ripple: A Practical Declarative Programming Framework for Serverless Compute

Shannon Joyner1, Michael MacCoss2, Christina Delimitrou1, and Hakim Weatherspoon1

1Cornell University, 2University of Washington1{sj677,delimitrou,hw228}@cornell.edu, [email protected]

AbstractServerless computing has emerged as a promising alterna-

tive to infrastructure- (IaaS) and platform-as-a-service (PaaS)cloud platforms for applications with ample parallelism andintermittent activity. Serverless promises greater resourceelasticity, significant cost savings, and simplified applicationdeployment. All major cloud providers, including Amazon,Google, and Microsoft, have introduced serverless to theirpublic cloud offerings. For serverless to reach its potential,there is a pressing need for programming frameworks thatabstract the deployment complexity away from the user. Thisincludes simplifying the process of writing applications forserverless environments, automating task and data partition-ing, and handling scheduling and fault tolerance.

We present Ripple, a programming framework designed tospecifically take applications written for single-machine execu-tion and allow them to take advantage of the task parallelismof serverless. Ripple exposes a simple interface that users canleverage to express the high-level dataflow of a wide spectrumof applications, including machine learning (ML) analytics,genomics, and proteomics. Ripple also automates resourceprovisioning, meeting user-defined QoS targets, and handlesfault tolerance by eagerly detecting straggler tasks. We portRipple over AWS Lambda and show that, across a set of di-verse applications, it provides an expressive and generalizableprogramming framework that simplifies running data-parallelapplications on serverless, and can improve performance byup to 80x compared to IaaS/PaaS clouds for similar costs.

1. IntroductionAn increasing number of popular applications are hosted onpublic clouds [22, 23, 33, 44, 54]. These include many in-teractive, latency-critical services with intermittent activity,for which the current Infrastructure-as-a-Service (IaaS) orPlatform-as-a-Service (PaaS) resource models result in costinefficiencies, due to idle resources. Serverless computing hasemerged over the past few years as a cost-efficient alternativefor such applications, with the added benefit that the cloudprovider handles all data management, simplifying deploy-ment and maintenance for the end user. Under a serverlessframework, the cloud provider offers fine-grained resource al-locations to users who only pay a small amount for them whena task is executing, resulting in much lower hosting costs.

Several cloud providers have introduced serverless offeringsto their public cloud models, including AWS Lambda [4],Google Functions [10], and Azure Functions [11]. In all cases,with small variations, a user launches one or more “functions”,

which the provider maps to one or more machines. Functionstake less than a second to spawn, and most providers allow theuser to spawn hundreds of functions in parallel. This makesserverless a good option for extremely parallelizable tasks, orfor services with intermittent activity and long idle periods.

Big data, scientific, and certain classes of machine learning(ML) analytics are good candidates for serverless, as they tendto have ample task parallelism, and their storage and computerequirements continue to increase exponentially [27,47,65,70].Hosting such services on traditional IaaS and PaaS cloudsincurs significantly higher costs compared to serverless, sinceinstances need to be maintained for long periods of time toensure low start-up latencies. Alternatively, if optimizing forcost, tasks from large jobs are subject to long queueing timesuntil resources become available, despite tasks being ready toexecute. When the input load is bursty the reverse is also true,with provisioned resources remaining idle during periods oflow load. While cloud providers have auto-scaling systemsin place to provide some elasticity in the number of allocatedinstances as load fluctuates [3], elasticity is offered at instancegranularity. Additionally, scaling out is not instantaneous,resulting in unpredictable or degraded performance, which isespecially harmful for short-running tasks.

The premise of serverless is improving resource elasticity,and cost efficiency. To realize this we need programmingframeworks that can (i) extract parallelism from existing cloudapplications, (ii) simplify writing new applications for server-less compute, (iii) manage data dependencies between func-tions transparently to the user, and (iv) automatically handleresource elasticity, data partitioning, fault tolerance, and taskscheduling to mask the performance unpredictability publicclouds are subject to [21, 34, 35, 45, 66, 67, 74–76].

Serverless frameworks additionally need to be pro-grammable enough for users to express generic applicationsat a high-level, and flexible enough to circumvent the limita-tions of current serverless offerings, such as AWS Lambda,which limits each function to 512MB of disk space and a 15minute runtime. Existing analytics and ML frameworks, likeHadoop [68], Spark [2], and TensorFlow [16], can handlesome of these requirements, but need to run over an activecluster by default. This is problematic in the case of server-less frameworks where resources are frequently allocatedand reclaimed, requiring the framework to be re-deployedacross runs. Additionally, such general-purpose frameworksare storage-demanding, in terms of both memory and disk,consuming a large fraction of the limited system resourcesallocated per serverless function.

arX

iv:2

001.

0022

2v1

[cs

.DC

] 1

Jan

202

0

Ripple

Newmonolithic

app

λ λ

λ

λ

λ

λ

λ

λ

λ

λ

λ

SGD

Canaryrun

Provision

& Deploy

Canaryrun

…Ripple

App converted to serverless pipeline

λ

λ

λ

λ

λ λ

λ

λ

λλ

λλ

λ

λ

λ

λ

λ

λ

Figure 1: Overview of Ripple’s operation, from converting a monolithic application toserverless, to profiling and deploying the service to AWS Lambda.

We present Ripple, a programmingframework designed specifically forserverless computing platforms thatrun highly data-parallel applications.Ripple exposes a high-level declara-tive programming interface that en-ables programmers to express the in-herent parallelism in their computa-tion, as well as phases that must beexecuted serially. The framework isgeneral enough to capture executionpatterns of a wide range of analyticsand scientific applications, includingML analytics and bioinformatics. Ripple uses serverless forall computation phases, removing the need for maintaininglong-term instances. It also handles work partitioning automat-ically, assigning input data chunks to serverless functions, andsynchronizing them when required. Ripple also allows usersto express priorities and deadlines for their jobs, implementinga variety of scheduling policies, including round-robin, FIFO,and deadline-based scheduling. Furthermore, Ripple leveragesinformation about previously-seen jobs to automatically inferthe degree of concurrency (number of Lambdas per phase) thatwill allow a new job to meet its QoS requirements. Finally, ithandles fault tolerance transparently to the user, respawningfailing or under-performing tasks.

We evaluate Ripple using AWS Lambda. The design of theframework is portable across serverless frameworks, includingGoogle Functions and Azure Functions, with minimal modi-fications. We focus on three large-scale cloud frameworks, aproteomics framework, DNA compression, and a kNN-basedclassification service used to identify buildings in satelliteimaging data [15,24,63]. We show that Ripple offers a simpleway for users to express the pipelines of complex applicationsin a few lines of code, and that serverless computing offersdramatic performance improvements for data-parallel applica-tions, especially as the size of their datasets increases. Specif-ically, we show that Ripple can be up to 80× faster than anEC2 cluster, and 30% cheaper than PyWren [49] for a similarruntime. We also demonstrate that Ripple automatically han-dles the resource provisioning of a given job meeting its QoStarget, and can support different scheduling and fault tolerancemechanisms that further improve performance predictability.Ripple is open-source software under a GPL license.

2. Background

2.1. Advantages of Serverless

Serverless allocates fine-grained resources for a limited time,which reduces the overprovisioning problem current datacen-ters experience [30–34, 36, 57, 66]. It also simplifies cloudmanagement; users simply upload the code they wish to exe-cute to the cloud, and cloud providers handle the state manage-ment and resource provisioning. If an application has ample

parallelism,

serverless achieves much better performance for comparablecost to traditional PaaS or IaaS platforms [40, 42, 43].

Users only pay when a serverless function is running, mak-ing the model well-suited for applications with intermittentactivity. Traditional cloud instances can be rented and freedup on-demand, but initializing a new instance takes at least30s. In comparison, spawning and destroying a serverlessfunction takes a few milliseconds. For short-running tasksthat complete in a few seconds, the overhead of instantiat-ing and releasing cloud instances can have a dramatic impacton execution latency. Some of the most popular serverlesscomputing offerings at the moment include AWS Lambda,Google Functions, Azure Functions, OpenLambda, and Open-Whisk [1, 4, 10, 11, 48].

2.2. Limitations of Serverless

Despite its increasing popularity, serverless still needs to over-come several system challenges. First, while cloud providersmake triggering serverless functions easy, they do not cur-rently offer an easy way to share state between functions. Ifa user wants to combine results from multiple functions, orsave state to use later in the pipeline, this state needs to besaved in remote storage – S3 for AWS Lambda – potentiallydegrading application performance [4, 5]. Second, currentserverless functions are resource-constrained, typically beinglimited to a single CPU, a couple GB of memory, and a fewhundred MB of disk space. This limits the type of computa-tion that can be hosted on serverless, especially given the factthat current platforms do not offer a way for users to partitiontheir work to functions in a programmable and systematicway. This is especially challenging for non-expert cloud usersrunning applications with complex workflows, where the de-gree of available parallelism varies across execution phases.Third, despite the increased visibility cloud providers haveinto serverless applications as opposed to traditional PaaS- orIaaS-hosted applications, current platforms still do not havea way for users to express a quality-of-service (QoS) targetfor their applications that the cloud provider must meet. Thisis especially detrimental in serverless frameworks where re-source sharing, even of a single CPU, is much more prevalent

2

Ripple Programming InterfaceDescription Arguments

split Split a file into small data chunks split_size: (optional) Number of lines per chunk; default 1MB.

combine Combine multiple files identifier: (optional) Field to sort resulting items by.

top Return the top items in a fileidentifier: Field to sort items by.number: Number of items to return.

match Match files to i keywordfind: Property to look for; e.g., highest score sum.identifier: Field to sort items by.

map Map each item to an input

input_key: Key parameter to set input name to for next phase.map_table: Table containing list of files to map.table_key: Key parameter to set table name to for next phase.directories: (optional) Use table directories to map.

sort Sort file using Radix sort identifier: Field to sort items by.

partitionReturn n equally spaced ranges (usedfor Radix sort)

identifier: Field used to partition the items into ranges.

run Invoke Lambda functionapplication: The name of the application file to execute.output_format: (optional) Format of the output file.

Table 1: The programming interface of Ripple.

than in traditional cloud systems, and hence applications aremore prone to unpredictable performance due to interference.

The challenges and opportunities of serverless put increasedpressure on programming frameworks that simplify portingtraditional cloud applications on to serverless platforms in away that masks their current limitations.

3. Ripple Design

We propose Ripple, a general-purpose programming frame-work for serverless that targets applications with ample paral-lelism. Ripple follows the model of a high-level declarativelanguage; a user expresses a job’s dataflow at a high level,and Ripple abstracts away the complexity of data partitioning(Sec. 3.1), task scheduling (Sec. 3.4), resource provisioning(Sec. 3.2), and fault tolerance (Sec. 3.3). Fig. 1 shows thehigh-level overview of the framework.

3.1. Ripple Programming Framework

Ripple exposes a concise programming interface of eight prin-cipal functions that users can leverage to express the mostcommon parts of their services’ dataflow, including paralleland serial phases, and dependencies between dataflow stages.Each function can be used in one or more stages of a job. Theprogramming interface is shown in Table 1. Apart from theeight functions in the table, users can also upload arbitraryoperations, and invoke them through the run function.

Algorithm 1 shows an example Ripple application for port-ing a simple DNA compression algorithm to AWS Lambda.The config parameter contains the basic setup parameters,such as the default amount of memory to use for functions.The pipeline is initialized by specifying the job configura-tion and the result and logs locations. Next, an input variable

import ripple// Configure region and Lambda resourcesconfig = {"region": "use-west-2","role": "aws-role","memory_size": 2240,

}// Express computation phasespipeline = ripple.Pipeline(name="compression",table="s3://my-bucket",log="s3://my-log",timeout=600,config=config

)// Define input datain = pipeline.input(format="new_line")// Declare how to sort input datastep = input.sort(identifier="start_position",params={"split_size": 500*1000*1000},config={"memory_size": 3008}

)// Declare compression methodstep = step.run("compress_methyl",params={"pbucket": "s3://my-program"}

)// Create and upload functionspipeline.compile("json/compile.json")

Listing 1: Ripple Configuration for DNA Compression.

is declared. The input variable declares the data format of theinput dataset, which helps Ripple determine how to split andmanipulate the data. The input DNA files can be gigabytesin size, significantly more than the 512MB of disk space or3GB of memory space offered by Lambda. The compressionalgorithm relies on finding sequences with similar prefixes.Therefore, Ripple calls sort on the input file, which sortsthe file into 500MB chunks. The user can also provide an

3

application-specific hint on the split size a function shouldhandle. In this example, sort needs more than 2240MB, sothe memory allocation is 3008MB instead. The next step com-presses the input shards, invoked via Ripple’s run function.The application function returns the paths of the files thatRipple should pass to the next pipeline step.

Once the user specifies all execution steps, they can compilethe pipeline, which generates the JSON file Ripple uses to setup the serverless functions. Ripple then uploads the code anddependencies to AWS, and schedules the generated tasks.

3.2. Automating Resource Provisioning

One of the main premises of serverless is simplifying resourceprovisioning, by letting the cloud provider handle resourceallocation and elastic scaling. Serverless frameworks todaystill do not achieve that premise, by requiring the user tospecify the number of serverless functions a job is partitionedacross. In addition to allowing users to specify prioritiesamong their jobs (Sec. 3.4), Ripple also automates the resourceprovisioning and scaling process for submitted applications.Specifically, when a new application is submitted to AWSLambda via Ripple, the user also expresses a deadline that thejob needs to complete before [69,72]. Ripple uses that deadlineto determine the right resources for that job, depending onwhether the job consists of one or multiple phases.Single-phase jobs: This is the simplest type of serverless jobs,where the input data is partitioned, processed, and the resultis combined, and either stored on S3 or sent directly to theuser. In this case, Ripple just needs to determine the numberof Lambdas needed for processing. Ripple uses a two-stepprocess to determine the appropriate degree of concurrency.First it selects a small partition of the original input dataset,min(20MB,full_input), called canary input. 1 By defaultRipple selects the canary input starting from the beginning ofthe dataset, unless otherwise specified by the user. It then usesthe canary input as the input dataset for two small jobs, eachwith different data split sizes per Lambda. One job runs withthe default split size for AWS Lambda, 1MB, and the otherwith the split size Lambdas would have if the job used its max-imum allowed concurrency limit (1,000 on AWS by default).To track task progress when enforcing different resource allo-cations, Ripple implements a simple tracing system detailedin Sec. 4. Ripple’s tracing system collects the per-Lambda andper-job execution time (including the overheads for work par-titioning and result combining), and uses this information toinfer the performance of the job with any split size. Inferencehappens via Stochastic Gradient Descent (SGD), where SGD’sinput is a table with jobs as rows and split sizes as columns.SGD has been shown to accurately infer the performance ofapplications on non-profiled configurations, using only a smallamount of profiling information [30, 33, 51, 53], in the contextof cluster scheduling, heterogeneous multicore configuration,

1The 20MB lower limit is determined empirically to be sufficient for ourapplications, and can be tuned for different services.

and storage system configuration. As new jobs are scheduledby Ripple, the number of rows in the table increases. Giventhat input dataset sizes varies across jobs, the table includesprofiling information on a diverse set of split sizes, includingone split size (default=1MB) that is consistent across all jobs.

Once Ripple determines the level of concurrency that willachieve a job’s specified deadline, it launches the full applica-tion and monitors its performance. If measured performancedeviates from the estimated performance with that degree ofconcurrency, Ripple updates the table with the measured infor-mation to improve prediction accuracy over time.

Apart from specifying per-job deadlines, users can also re-quest that Ripple tries to achieve the best possible performancefor a job, given AWS’s resource limits, or alternatively, theycan specify an upper limit for the cost they are willing to spendfor a job, in which case Ripple maximizes performance underthat cost constraint.Multi-phase jobs: Typical serverless applications involvemultiple phases of high concurrency, which may be interleavedwith sequential phases that combine intermediate results. Theprocess Ripple follows to provision such jobs is similar tothe one described above, with the difference that columns inthe SGD table are now ratios of concurrency degrees acrossphases. Ripple also needs to increase the number of canaryruns to ensure high prediction accuracy from two for single-phase jobs to four for multi-phase jobs. Split sizes per phaseare selected in [1MB,fullDataset/maxLambdas] as before. Ifmeasured performance deviates from estimated, Ripple againupdates the corresponding column in the table to improveaccuracy the next time a similar job is scheduled.

Note that higher Lambda concurrency does not necessar-ily translate to better performance for three reasons. First,getting a speedup as Lambdas increase is contingent on theavailable parallelism in a job’s computation. Applicationswith a low inherent parallelism will not benefit from a higherdegree of concurrency, while also incurring higher costs forthe increased number of serverless functions. Second, paral-lelism does not come for free, as work and data needs to bepartitioned and distributed across tasks, and results need tobe reassembled and combined before computation can pro-ceed. This is especially the case for multi-stage job pipelines,where parallel phases are interleaved with phases that combineintermediate results. Third, a larger number of concurrentserverless functions also increases the probability that someof these tasks will become stragglers and degrade the entirejob’s execution time.

Obtaining a small amount of profiling information beforeprovisioning a new job allows Ripple to identify if either of thefirst two reasons that would prevent a job from benefiting fromhigher concurrency are present. The third reason is impactedboth by job characteristics, e.g., faulty data chunks can causesome Lambdas to become stragglers, and by the state of theserverless cluster, e.g., some Lambdas can underperform dueto interference between tasks sharing system resources. To

4

address underperforming tasks, we also implement a faulttolerance mechanism in Ripple, detailed below.

3.3. Fault Tolerance

Task failures are common in data-parallel analytics jobs. Fur-thermore, straggler tasks are a well-documented occurrence,where a small number of under-performing tasks delay thecompletion of the entire application [17, 19, 20, 46, 55, 73].AWS Lambda does not provide handlers to running functions,therefore a function’s progress cannot be directly monitored.Instead, Ripple collects execution logs for each Lambda re-quest based on when they write to S3. These logs do not onlyprevent duplicate requests, but contain payload information tore-execute failed Lambda processes. If Ripple detects that afunction has not been logged within the user-specified timeoutperiod from its instantiation, it will re-execute that function.This also helps minimize the impact of stragglers, as Ripplecan eagerly respawn tasks that are likely to under-perform.

3.4. Scheduling Policies

Cloud users typically submit more than one job to a publiccloud provider. AWS Lambda at the moment implements asimple FIFO2 scheduling policy, which means that Lambdassubmitted first will also start running first. This does not allowa lot of flexibility, especially when different jobs have differentpriorities or must meet execution time deadlines.

Ripple implements several scheduling policies users canselect when they submit their applications. To prevent con-flicts between scheduling policies specified in different jobs, ascheduling policy applies to all active jobs managed by Rip-ple. Ripple precomputes the best task schedule based on thespecified scheduling policies, and enforces scheduling policiesby reordering tasks client-side, and deploying them in that or-der on AWS Lambda. Below we summarize each schedulingpolicy supported in Ripple.FIFO: This is the default scheduling policy implementedby AWS Lambda. Functions are executed in the order ofsubmission. Under scenarios of high load, FIFO can lead tolong queueing latencies.Round-Robin: Ripple interleaves the phases of each jobto allocate approximately equivalent time intervals to eachapplication. Round-Robin penalizes the execution time ofthe first few jobs, but reduces queueing delays, and improvesfairness compared to FIFO.Priorities: Priorities in Ripple are defined at job granularity,with high-priority jobs superseding concurrently-submittedlow-priority jobs. If a high priority job arrives when the userhas reached their resource quota on Lambda (1,000 activeLambdas by default), Ripple will pause low priority jobs,schedule the high priority application, and resume the for-mer when the high priority job has completed. Applications

2Launching Lambdas happens in a “mostly” FIFO order, however theexact ordering is not strictly enforced.

of equal priorities revert to Round-Robin. Unless otherwisespecified, we use the FIFO scheduler by default.

4. Implementation

Ripple is written in approximately 5,000 lines of code inPython, including application specific code and configura-tions. The eight functions shown in Table 1 are also written inPython. A user can create arbitrary functions, as long as thefunctions adhere to Ripple’s run template. Currently, Ripplesupports interactions with AWS Lambda and S3. However,Ripple’s API has been designed to be abstractable to otherproviders, including Windows Azure and Google.Job configuration: A user ports a new application by usingthe Ripple API to create the application pipeline. This pipelineis compiled to a JSON configuration file. The configurationfile indicates which Lambda functions to use and the orderin which to execute the functions. If the application requiresmanipulation of file formats that the framework does not sup-port by default—such as newline separated, TSV, FASTA,mzML)—, users simply need to implement a format file forthe framework to know how to parse the input files. Users alsoneed to create two tables: one to store the input, intermediate,and output files, and another for the log files Ripple maintainsto manage the execution of functions.

To deploy an application, the user follows an automatedsetup process in Ripple, which takes their configuration fileand uploads all necessary code to AWS. The maximum deploy-ment package size AWS currently allows is 50MB. However,AWS provides layers for users to import extra dependencies.Ripple checks the user’s program for a list of common de-pendencies, such as numpy or scikit-learn, and links thefunctions to their corresponding dependencies [12, 13]. If thelayer does not exist, the user can create and add the new layerto Ripple. Ripple does not use any additional containeriza-tion or virtualization processes beyond what AWS Lambdaimplements by default.

Tracing and monitoring: Ripple needs to monitor the execu-tion of active functions to adhere to the specified schedulingpolicies, and ensure performant and fault tolerant operation.Since AWS Lambda does not provide handlers to active func-tions, Ripple periodically polls S3 to see if a function hasfinished, and records the function instantiation and comple-tion events in its execution log. The asynchronous nature ofLambda makes identifying failed functions challenging. Forevery job, Ripple spawns a thread to monitor the job’s progressbased on its logs. Ripple waits until the timeout period spec-ified by the user for a given function’s log. If the log doesnot appear, Ripple re-invokes the function. For schedulingalgorithms, such as deadline-based scheduling, Ripple needsthe ability to “pause” a job. To enable this, we specify a pauseparameter for each job that indicates when functions for thatjob should stop executing. To re-execute, Ripple re-triggersthe pipeline, and checks if the task has already executed. If so,

5

Application JSON file run functionsSpaceNet 16 250Proteomics 25 86DNA Compression 13 36

Table 2: Lines of Ripple code for each application. “JSON file”shows the LoC for the configuration file, and “run functions”shows the LoC for the specific logic of the application.

the function simply re-triggers its child processes. We haveverified that the tracing system has no noticeable impact oneither Lambda throughput or latency.

Testing library: Due to the asynchronous nature of server-less, testing can be difficult as logs are spread over multiplefiles. Ripple contains logic to make testing code easier. A usercan specify files to use as source data and Ripple will locallyand serially go through each stage of the pipeline.

Fault tolerance: Finally, Ripple itself may fail for a numberof reasons, including network outage, server failure, or internalerror. To ensure that such a failure does not leave activefunctions unmanaged, we maintain Ripple’s execution log inpersistent storage, such that, if a failure occurs, a hot stand-bymaster can take over management of the service.

5. Applications

We use three applications to demonstrate the practicality andgenerality of Ripple. In all cases, the modifications needed toport the original applications to Ripple are minimal. Rippleexposes a high-level declarative interface, which enables usersto take code designed to run on a single-machine and portit to a serverless platform. Below we describe each of theapplication frameworks we port on Ripple. The logic of theprotein analytics and DNA compression applications did notrequire any changes to be ported to Ripple. The SpaceNetBuilding Border Identification application was reimplementedin Python, given AWS’s languages limitations. Table 2 showsthe Ripple lines of code needed for the Ripple JSON config-uration file, and for the run function that is specific to eachapplication’s logic.

5.1. SpaceNet Building Border Identification

SpaceNet hosts a series of challenges, including identifyingbuildings from satellite images [15]. The developers provideTIFF training and test images of different areas, only some ofwhich contain buildings. We implemented our own solutionfor this challenge using a K-Nearest Neighbor (kNN) algo-rithm [14]. For each pixel in the training images, we use theSpaceNet solutions to classify the pixel as a border, inside oroutside the building. In the test image, we identify the mostsimilar pixels in the training images, and classify the test pix-els accordingly. SpaceNet offers both high resolution, 3-bandimages, and low resolution, 8-band images. For our experi-ments, we used the 3-band training and test images. Fig. 2

Map

Combine

Combine

Combine

Convert

KNN

KNN

KNN

Map

...

...... ...

...

Draw

KNN

Figure 2: Pipeline for SpaceNet. The convert shards the in-put data. map pairs test with training data chunks. The KNNstage performs a brute force KNN. The first combine stagefinds the absolute nearest neighbors for every pixel, and thesecond combine stage combines all results into one file. Thelast stage colors the identified border pixels.

shows the Ripple pipeline for SpaceNet.

5.2. Proteomics

Tide is a protein analytics tool used for analyzing proteinsequences [24]. Tide takes experimental data and scores them

CombineTide

Split

Percolator

...

...

Tide

Tide

Combine

Combine

Figure 3: Proteomics with Tide [24].We first split the input into shards.Next, Tide, determines the proteincomposition of each chunk. Theoutput is processed by Percolator.

against theoretical se-quences, defined in afile containing proteinsequences predictedfrom the genome se-quence [24, 39]. Ex-perimental sequencesare in mzML format,an XML-based for-mat [25, 59], and the-oretical sequences are

in FASTA format [38]. Results from Tide are often given toan ML application called Percolator to score the confidenceof the protein composition identification [50]. We evaluateboth Tide and Percolator in a single pipeline. Fig. 3 shows theRipple pipeline for the proteomics application.

5.3. DNA Compression

METHCOMP is an application to compress and decompressDNA methylation data [63]. The data from this pipeline arestored in a BED format, which is newline delimited [6]. Thisformat remains human-readable, but readability comes at theexpense of high storage costs. Nonetheless, METHCOMPis both memory- and disk-intensive and the algorithm doesnot use threads or multiple cores by default. Figure 4 showsthe Ripple pipeline for the METHCOMP application. Thecompression pipeline was modified to sort the input dataset.

6. MethodologyWe run each application on AWS Lambda using Ripple. Un-less otherwise specified, we run experiments in regions thathad a max concurrency of 1,000 functions. We compare Rippleto two other setups; AWS EC2 with resource autoscaling [3],

6

CombinePivot

Split Split

...

...

Pivot

Pivot

Combine

Combine

BED

FILES

Sort

Sort

Combine

Combine...

...

Combine

Compress

Compress

Figure 4: Pipeline for DNA compression. The input BED file issorted using Radix sort. This involves finding pivots in chunksegments. Once pivots are found, the file is split, and chunksare compressed in parallel.

and PyWren [49].EC2 Autoscaling: We use the Amazon Elastic MapReduce(EMR) cluster to compare Ripple to EC2 autoscaling. EMRonly allows servers to be added if usage for one or moreresources (CPU, memory) is above a certain threshold formore than five minutes. Figure 5 shows how Tide scales usingEMR’s default policy if a job is sent every 10 seconds for anhour. A server is added if CPU utilization is above 70%, andremoved if CPU utilization is below 30%. The bottom portionof the graph shows the number of pending jobs. The default 5minute scaling policy could not handle rapid load increases,taking almost 2 hours to process all requests and an additional2 hours to terminate all servers.

To show that even a more agile version of EC2’s autoscal-ing has suboptimal elasticity compared to Ripple, we alsoimplemented a version of EC2 autoscaling that increments anddecrements instances at 10s granularity. We implemented thispolicy using the boto3 API, and ran all jobs inside Dockercontainers to simplify scaling out [7, 8]. Since the Dockercontainers often needed significant disk space, for every EC2machine, we allocated a 26GB volume of persistent storage.PyWren: PyWren is an open-source programming frame-work for serverless, and the most closely-related prior work toRipple [49]. PyWren is designed following the MapReducestyle of distributed processing, with mappers running on AWSLambda as a single map phase, followed by reducers runningon traditional EC2 instances.Below we describe each application’s setup in Ripple, PyWren,and EC2.

6.1. SpaceNet Building Border Identification

We use 1,000 images to train the border classifier. The featurevector for each pixel was the RGB value of the pixel andits 8 surrounding neighbors. We store these results in an S3bucket. For each test image, we use the 100 closest neighborsto classify a pixel. Finally, we color the identified borderpixels. For all implementations, we use the scikit-learnlibrary to compute kNN using brute force [13]. For SpaceNet,we ran all Lambda experiments in a region where the maxconcurrency was 5K functions.Ripple setup: We create a function to convert TIFF testingimages to their feature vectors. Given the amount of trainingand testing data, we split testing images into subsets of 1,000pixels, and mapped each subset to a subset of training data to

run kNN. This typically resulted in approximately 180 sub-sets of pixels, each paired with around 40 chunks of trainingdata, and a total of 7K Lambdas. Results are combined intwo phases; the first phase found the absolute 100 nearestneighbors for each pixel, and the second phase concatenatedthe partial outputs. Each pipeline stage is triggered when theoutput of the previous phase is written to S3.

0

50

100

150

200VCPUsRunning JobsTotal JobsPending Jobs

0 2000 4000 6000 8000 10000 12000Time (Seconds)

0

200

Figure 5: Tide runtime on EC2 for a uni-form simulation using the default au-toscaling policy (in 5min intervals).

PyWren setup:For fairness, we at-tempted to run allparallel tasks onLambda; however,the first combinestep caused out ofmemory errors onLambda; we sub-sequently movedthe combine tasksto a r4.16xlargeEC2 instance.EC2 setup: Computing the k nearest neighbors is very mem-ory intensive, so we use r5a.xlarge machines.

6.2. Proteomics

Ripple setup: Tide frequently consumes more than the max3GB offered by Lambda. Therefore, the first step is splittingthe mzML input file into small chunks. We then run Tideon each file. Afterwards, we combine the results in a singleoutput file. For all experiments, we used human FASTA files.PyWren setup: Tide is both CPU- and memory-intensive, therefore we use Lambda for parallel phases, and at2.xlarge EC2 instance for serial computation.EC2 setup: We again use t2.xlarge instances.

6.3. DNA Compression

Ripple setup: We first use Ripple’s sort to sort the input sothat similar sequences are near each other, and subsequentlycompress each data chunk.PyWren setup: We again use Lambda for parallel phases,and a t2.xlarge EC2 instance for serial computation.EC2 setup: As before, we use t2.xlarge instances.

7. Evaluation

Below, we answer five questions, pertaining to the perfor-mance, cost, fault tolerance, and scalability of Ripple, as wellas its ability to preserve the QoS requirements of diverse jobs.• How able is Ripple’s provisioning system to preserve QoS

for incoming jobs (Sec. 7.1)?• How robust is Ripple across job distributions (Sec. 7.2)?• How does Ripple compare to PyWren (Sec. 7.3)?• How does Ripple scale as job concurrency increases

(Sec. 7.4)?

7

K-NN Tide DNA Compr.0

2

4

6

8

10

12

14

16

Infe

renc

e E

rror

(%

)

K-NN Tide DNA Compr.0

5

10

15

20

25

30

35

40

45

Exe

cutio

n T

ime

(min

) RippleDefault (1MB)Max Lambdas

Figure 6: (a) Error between execution time estimated by SGDin Ripple across the three applications, and measured execu-tion time with the corresponding resource configuration. (b)Performance with Ripple compared to using the default 1MBdata chunks per Lambda, and the maximum number of Lamb-das allowed, across the three applications.

Application CostRipple 1MB Max Lambdas

SpaceNet $2.77 $3.81 $4.61Proteomics $0.42 $0.61 $0.53DNA Compression $0.36 $0.51 $0.42

Table 3: Cost across applications for (i) Ripple, (ii) the defaultsplit size, and (iii) the max number of concurrent Lambdas.

• How does the fault tolerance mechanism in Ripple affectapplication performance (Sec. 7.5)?

7.1. Automated Provisioning

Performance estimation accuracy: We first examine Rip-ple’s ability to correctly provision resources in a serverlesscluster. We assume that jobs from all three applications aresubmitted, and for simplicity, each job is trying to maximizeits performance, instead of specifying a particular deadline. Asdiscussed in Section 3, Ripple will first launch two short-livedcanary runs for each job, with different degrees of concurrency,apply SGD to infer their performance with all other concur-rency degrees, and determine the configuration that minimizesexecution time for each new job. Fig. 6a shows the distribu-tion of error between the performance estimated by Rippleusing Stochastic Gradient Descent (SGD) and the actual per-formance measured on an AWS Lambda cluster when runningwith the corresponding degree of concurrency. Errors are lowacross all three applications, which allows Ripple to accuratelydetermine the number of Lambdas per execution phase thatwill maximize an application’s performance. The higher errorsat the tails of the three violin plots correspond to the first fewjobs that arrive, at which point Ripple’s knowledge of applica-tion characteristics is limited, and the number of applicationsin the SGD matrix is small. As more applications arrive in thesystem, the estimation accuracy improves.

Profiling and inference with SGD incur some schedulingoverheads to each application. For the examined applications,profiling overheads range from 260ms for jobs with 1-3 stagesto 6s for jobs with many stages, like the kNN classification.Inference overheads are less than 60ms in all cases. Aggregate

overheads are always less than 4% of the job’s execution time.Performance predictability: Fig. 6b shows the distributionof execution time for 100 jobs of each application class withRipple’s provisioning mechanism, compared to (i) using thedefault 1MB data chunk size, and (ii) always using the max-imum number of Lambdas that AWS allows (1,000 for oursetting). Across all application classes Ripple achieves thebest performance, especially in the case of kNN classification,where resource demands are high, and incorrect provisioninghas a severe impact on execution time. More importantly,performance with Ripple is highly predictable, despite theframework not having control over where serverless functionsare physically placed on AWS. This is in part because Rippleselects the number of Lambdas to be sufficient to exploit thejob’s parallelism, but not high enough to cause long queue-ing delays from AWS’s resource limit, and in part becauseof Ripple’s straggler detection technique. Straggler respawn-ing reduces variability by respawning the few tasks that areunderperforming due to interference or faulty data records.

In comparison, execution time for the 1MB and maxLamb-das policies varies widely and experiences long tails, espe-cially when using the maximum concurrency allowed. Thereare two reasons for this. First, when using the default datachunk size with very large input datasets, as is the case withthe examined applications, the number of total Lambdas thatneed to execute exceeds the total allowed limit by a lot, lead-ing to long queueing delays, as functions wait for resourcesto become available. Even if AWS’s resource limit increased,fine-grained parallelism does not come for free, incurring over-heads for work partitioning, synchronization, and combiningthe per-function outputs. Second, when using the maximumnumber of allowed Lambdas with large datasets, e.g., 500GB,each function is still responsible for a large amount of data,333MB in this case. Given the strict time limit Lambdashave today, this causes several functions to time out and berespawned by AWS. Note that this is not a case of a func-tion becoming a straggler, and Ripple’s straggler respawningtechnique would only increase the system load further byrespawning functions that will time out again.

Table 3 also shows the cost for each policy across applica-tions. Ripple incurs the lowest cost, since it avoids excessivetask queueing and respawning. The maxLambdas policy hasthe highest cost in the resource-intensive SpaceNet applicationwhich results in many Lambdas timing out and needing torerun, while for the other two scenarios, the excessive numberof Lambdas used by the 1MB default split size policy incursthe highest cost, since the increased number of Lambdas doesnot also result in faster execution.

7.2. Workload Distributions

We now show Ripple’s elasticity under different job arrivaldistributions, compared to Amazon EC2. We examine uniform,bursty, and diurnal load patterns.

8

vCPUs Running jobs Total jobs Pending jobs

0

50

100

0 500 1000 1500 2000 2500 3000 3500 4000 4500Time (Seconds)

0

1000

50

100

0 1000 2000 3000 4000Time (Seconds)

0

100

(a) Ripple (b) Amazon EC2

Figure 7: Tide performance under uniform load, with 1 job per10s. (a) and (b) show the total vCPUs used (green), and thetotal (brown) and running jobs on the top (red). In the bottom,we show the pending jobs waiting to run (yellow).

vCPUs Running jobs Total jobs Pending jobs

0

200

400

600

800

1000

1200

0 5000 10000 15000 20000 25000 30000Time (Seconds)

0

1000

50

100

0 5000 10000 15000 20000 25000 30000Time (Seconds)

0

100

(a) Ripple (b) EC2

Figure 8: Tide performance under bursty load (1 job/min). Ev-ery 90 min, a burst of 100 jobs arrives. (a) and (b) show theused vCPUs (green), and the total (brown) and running jobs(red). The bottom shows the pending jobs (yellow).

Uniform: In the case of the Tide proteomics framework, wecreate a uniform workload by sending a new job every 10s foran hour. Figure 7 shows that both for EC2 and Lambda, due tothe frequency of job arrivals, there is always at least a task run-ning, with Ripple being able to immediately scale to meet theresource demands. EC2 was able to handle the input load onceenough machines were launched, but as a result of the initialadjustment period, it took an additional 6 minutes to finishthe entire scenario. With Ripple, the average job completiontime was 4.5× faster than EC2, 2min on average, whereasthe average job completion time for EC2 was approximately8.5min. The cost for Lambda is also lower compared to EC2.Similarly, DNA compression achieves 8× faster executionwith Ripple compared to EC2 under a uniform load (85s withRipple as opposed to 11min with EC2).

We also examine a uniform workload distribution forSpaceNet, where we sent one job every 5 minutes for an hour.Figure 10 shows the results for Ripple and EC2. Both Rip-ple and EC2 were immediately able to handle new incomingrequests; however, due to the memory requirements of eachclassification job, the average job completion time for Ripplewas more than an order of magnitude faster (i.e., 80× faster)at 4.5 minutes, with EC2 requiring approximately 6 hours tocomplete an equal amount of work.

Bursty: We now initiate a Tide job every minute for 8 hours.

vCPUs Running jobs Total jobs Pending jobs

0

50

100

150

200

0 2000 4000 6000 8000 10000 12000Time (Seconds)

0

1000

50

100

150

200

0 2000 4000 6000 8000 10000 12000Time (Seconds)

0

100

(a) Ripple (b) EC2

Figure 9: Tide performance under diurnal load. Over 30min,we increased the jobs from 0 to 15 and back to 0. (a) and(b) show the used vCPUs (green), and the total (brown) andrunning jobs (red). The bottom shows pending jobs (yellow).

0 500 1000 1500 2000 2500 3000 3500Time (Seconds)

0

400

800

1200

1600

2000

2400

2800

3200

3600VCPUs Running Jobs

0 5000 10000 15000 20000 25000 30000Time (Seconds)

0

3

6

9

12

15

18

(a) Ripple (b) EC2

Figure 10: SpaceNet performance under a uniform workload.Over an hour, we send 1 job every 5min. Both graphs showthe vCPUs used by running jobs (green), as well as the totalnumber of jobs actively processed (red).

Every 90min, we additionally send a burst of 100 jobs to emu-late a sudden load spike. Figure 8 shows that, while the burst ishappening, Ripple used almost 700 vCPUs for approximately400 concurrent functions, which allowed it to immediatelyhandle all 100 jobs with no performance degradation for anyindividual job. EC2, on the other hand, incurs the VM instanti-ation overhead, and hence takes a long time to scale resourcesto meet the requirements of all 100 jobs. The figure showsthat initially most jobs remained idle in the admission queueas they are waiting for more instances to be spawned. Afterthe initial burst, EC2 did a better job at handling the burst;however, this was mostly because the machines were not givenenough time to terminate during the non-bursty periods. Rip-ple, on the other hand, was able to handle the bursts by rapidlyscaling up and down the number of active Lambdas, withoutincurring high costs by retaining unused resources during peri-ods of low load. As a result, the average job completion timewas 5× faster for Ripple at about 2min, compared to 10minfor EC2. The results are similar for DNA compression, withRipple achieving 8× faster execution compared to EC2.

Diurnal: Finally, we examine a diurnal job arrival distribution,which emulates the load fluctuation of many user-interactivecloud applications [37,56,57,60]. During a 30 minute intervalwithin a larger 4 hour period, we progressively increase thenumber of arriving jobs, and then progressively decrease backto zero new jobs. We repeat this pattern multiple times within

9

0 25 50 75 100 125 150 175Runtime (Seconds)

0

1

2

3

4

5

6

7

Cum

ulat

ive

Cost

($)

RipplePyWren (Lambda)PyWren (EC2)PyWren (Total)

0 25 50 75 100 125 150 175Runtime (Seconds)

0

1000

2000

3000

4000

5000

6000

7000

VCPU

s

RipplePyWren

(a) Cost Comparison (b) vCPUs Comparison

Figure 11: Comparison between PyWren and Ripple forSpaceNet: (a) shows the cumulative cost of Lambda for Rip-ple and PyWren, as well as the cost of EC2 and overall costfor PyWren; (b) shows a comparison of the number of vCPUsused at each point in time by the two frameworks.

(a) 100 concurrent (b) 1,000 concurrent

Figure 12: Performance of 100 versus 1,000 concurrent Tidejobs. The graphs show a breakdown of the number of Lamb-das actively used during each execution phase.

the 4 hour period. Figure 9 shows the diurnal results for Rippleand EC2. The results show that the average job completiontime was 6.75× faster for Ripple at about 2 minutes, whereasit was 13.5 minutes for EC2.

7.3. PyWren Comparison

We now use the SpaceNet application to compare Ripple andPyWren. We send one SpaceNet job to each framework. Toallow for the increased compute and memory requirementsof SpaceNet, we perform this experiment in an AWS regionwhere the max function concurrency was 5,000 Lambdas. Fig-ure 11 shows a comparison of the number of vCPUs and thecumulative cost for each framework. The runtime for Ripplewas 25.7% faster at about 140 seconds, while the runtime forPyWren was about 176 seconds. Part of the reason is thatRipple can explicitly invoke the next stage of the pipeline,whereas stages in PyWren have to wait for S3 results. ForSpaceNet, the maximum concurrency of a stage is 6,764 func-tions. This meant PyWren had to launch almost 7K functionssimultaneously. Because Ripple provisions each executionphase separately, it did not need 7K functions for every stageof the pipeline, avoiding resource and cost inefficiencies. Thistranslates to a $2.77 cost for a single run for Ripple and $3.61,for PyWren. The cost of the EC2 machine during this periodwas $0.13. The rest of the cost difference was caused by thedifference in the runtime of individual serverless functions.

Figure 13: (a) CDF of Lambda completion time for 20 DNA com-pression jobs, with and without Ripple’s fault tolerance. (b)Comparison between normal and re-spawned tasks by Ripple.

7.4. Job Concurrency

We now evaluate how well Ripple scales with the number ofconcurrent jobs. Figure 12 shows the per-stage execution timebreakdown for 100 concurrent Tide jobs versus 1,000 jobs. 100concurrent jobs do not generate enough Lambdas to reach thefunction limit, while 1,000 concurrent jobs almost immediatelyhit this limit. The fluctuation in the number of active Lambdasacross phases for both concurrency levels is approximatelythe same, despite the higher concurrency experiment reachingthe upper Lambda limit. The total runtime was about twice aslong for 1,000 concurrent jobs and, in general, used twice thenumber of serverless functions.

7.5. Fault Tolerance

We now quantify the effectiveness of Ripple’s fault tolerancemechanism. We run 20 DNA compression jobs in parallel, andforce a 10% failure probability on each task. In the case of Rip-ple, under-performing tasks, i.e., tasks not making sufficientprogress are proactively re-spawned to avoid performancedegradation. Figure 13 shows the impact of the fault toler-ance mechanism on execution time, compared to the baselineAWS system. When no fault tolerance is used, only 4 jobscomplete successfully without their Lambdas reaching the 5minute timeout limit, and being terminated. Instead, whenRipple proactively detects and re-spawns stragglers, all jobscomplete within AWS Lambda’s time limit, as seen in the CDFof Fig. 13a. Fig. 13b shows the tasks that had to be re-spawnedby Ripple to avoid degrading the overall job’s execution time.

8. Related Work

8.1. Programming Frameworks for Serverless

The past few years a number of programming frameworksfor serverless have been designed that either target specificapplications, or focus on generality.

PyWren: PyWren targets general-purpose computation,and offers a simple programming framework in Python thatevaluates the viability of serverless for distributed computingapplications [49]. The authors show that embarrassingly paral-lel workflows, such as computational imaging, solar physics,and object recognition, can be implemented in a straightfor-ward way using AWS Lambda. PyWren divides computa-

10

tion into map and reduce phases, similar to the MapReducemodel [29], with mappers using Lambdas to run in parallel,while reducers rely on long-running EC2 instances for serialcomputation. Locus [64] extends PyWren, and uses a mixtureof S3 and Redis to scale sorting on Lambda.

PyWren highlights the potential of serverless for jobs witha single parallel phase and computationally-expensive serialphases, such as reduce, which can run on a long-running EC2instance. Unfortunately the fact that PyWren still relies onlong-running instances for serial execution phases reducesthe performance and cost benefits of serverless. Addition-ally, since PyWren uses both serverless and EC2 for the job’sworkflow, cloud maintenance remains complicated for userswho must manage both serverless and on-demand resources.Finally, since PyWren provisions Lambdas for the entire jobonce, it cannot adjust to the different degrees of parallelismdifferent computation phases often have, hurting resource effi-ciency, and limiting the complexity of supported tasks.

EMR Cluster: EMR Cluster dynamically adjusts the sizeof a cluster depending on a user’s resource needs [9]. Sparkcan run on top of EMR Cluster, and users are charged on aper-second granularity. However, it can take several minutesto launch a new machine in an EMR Cluster, unlike serverless,where functions are initialized in under a second.

mu: mu is a serverless framework targeting a specific classof applications, namely video processing, and is used by theExCamera video encoding service [43]. mu uses AWS Lambdafor both map and reduce tasks, each framed as a short-livedworker. In addition to AWS Lambda, mu requires a long-lived EC2 server for the Coordinator, and another long-livedEC2 server to act as the Rendezvous Server. The Coordinatorassigns tasks to workers and tracks application state, and theRendezvous Server passes messages between workers.

GG: gg is a serverless framework where dependenciesbetween tasks, such as distributed compilation, are explic-itly known in advance. Similar to mu, gg exploits serverlessto parallelize interdependent tasks [41], showing significantspeedups for computation with irregular dependencies, such assoftware compilation. However, gg requires the dependencytree for a task to be pre-computed, making it impractical forcases where job dependencies are not known in advance, orwhen dependencies depend on the input.

SAND: SAND is a framework designed to simplify inter-actions between serverless functions [18]. Currently, publicclouds do not provide a way for the user to express how func-tions interact with each other, optimizing instead for individualfunction execution. SAND co-locates functions from the samepipeline on the same physical machine, and reuses the samecontainers for a pipeline’s functions. While beneficial for per-formance and resource utilization, SAND assumes full clustercontrol to enforce function placement to machines, which isnot possible in public cloud providers.

Pocket: The short execution times of Lambda functions

require fast communication between functions and storage.Like Locus, Pocket addresses this by dynamically scalingstorage to meet a given performance and cost constraint [52,53]. Pocket relies on user-provided hints on the performancecriticality of different jobs, their storage requirements, andtheir provisioning needs in terms of Lambda functions.

While all these frameworks highlight the increasing im-portance of developing systems for serverless compute, theycurrently do not provide the automated resource provisioning,scheduling, and fault tolerance mechanisms cloud applicationsneed to meet their QoS requirements.

8.2. Dataflow Frameworks

Dataflow frameworks have been an active area of researchin both academic and industrial settings for over 15 years,particularly in the context of analytics and ML applications.

Analytics Frameworks: MapReduce [29] innovated usingparallel, stateless dataflow computation for large analyticsjobs, while Spark [2] focused on improving the performanceand fault tolerance of interactive and iterative jobs by addingin-memory caching to persist state across computation phases.DryadLINQ [71] is another example of a framework that pro-vided abstractions on top of a MapReduce-like system, en-abling developers to write high-level programs, which thesystem transparently compiles into distributed computation.

ML Frameworks: GraphLab [58] and Naiad [62] are bothprogramming frameworks optimized for ML applications. Dis-tBelief [28], TensorFlow [16], and MXNet [26] provide similarcapabilities, focusing on deep neural networks. All three pro-vide higher-level APIs for defining dataflows which consistof stateless, independent workers and stateful servers for shar-ing global parameters. TensorFlow’s dataflow graph is static;unlike Ripple, it cannot handle dynamically-generated compu-tation graphs. Ray [61] enables dynamic graphs, but its designis specific to reinforcement learning pipelines, while Rippletargets general parallel computation.

9. Conclusion

We presented Ripple, a high-level declarative programmingframework for serverless compute. Ripple allows users toexpress the dataflow of complex pipelines at a high-level, andhandles work and data partitioning, task scheduling, fault tol-erance, and provisioning automatically, abstracting a lot ofthe low-level system complexity from the user. We port threelarge ML, genomics, and proteomics applications to Ripple,and show significant performance benefits compared to tradi-tional cloud deployments. We also show that Ripple is able toaccurately determine the degree of concurrency needed for anapplication to meet its performance requirements, and furtherimproves performance predictability by detecting stragglertasks eagerly and respawning them. Finally, we showed thatRipple is robust across different job arrival distributions anddifferent degrees of concurrency, and implements a number

11

of scheduling policies, including round-robin, priorities, anddeadline-based scheduling.

Code AvailabilityRipple is open-source software under a GPL license (https://github.com/delimitrou/Ripple), and is already in useby several research groups.

AcknowledgementsWe sincerely thank Robbert van Renesse, Zhiming Shen,Shuang Chen, Yanqi Zhang, Yu Gan, and Neeraj Kulkarnifor their feedback on earlier versions of this manuscript. Thiswork was partially supported by NSF grant CNS-1704742,NSF grant CNS-1422088, a Facebook Faculty ResearchAward, a VMWare Research Award, a John and Norma BalenSesquicentennial Faculty Fellowship, and generous donationsfrom GCE, Azure, and AWS.

References[1] Apache openwhisk. https://openwhisk.apache.org.[2] Apache spark. https://spark.apache.org/.[3] Aws autoscale. https://aws.amazon.com/autoscaling/.[4] Aws lambda. https://aws.amazon.com/lambda/.[5] Aws step functions. https://aws.amazon.com/step-functions.[6] bedmethyl. https://www.encodeproject.org/wgbs#outputs.[7] Boto 3. https://boto3.amazonaws.com.[8] Docker. https://www.docker.com/.[9] Emr cluster. https://aws.amazon.com/emr/.

[10] Google cloud functions. https://cloud.google.com/functions/.

[11] Microsoft azure functions. https://azure.microsoft.com/en-us/services/functions/.

[12] Numpy. http://www.numpy.org/.[13] scikit-learn. https://scikit-learn.org/.[14] An introduction to kernel and nearest-neighbor nonparametric regres-

sion. The American Statistician, 1992.[15] Spacenet on amazon web services (aws). The SpaceNet Catalog, 2018.[16] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy

Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irv-ing, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga,Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Va-sudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng.Tensorflow: A system for large-scale machine learning. In Proceedingsof the 12th USENIX Conference on Operating Systems Design andImplementation, OSDI’16, pages 265–283, Berkeley, CA, USA, 2016.USENIX Association.

[17] Faraz Ahmad, Srimat T. Chakradhar, Anand Raghunathan, and T. N.Vijaykumar. Tarazu: optimizing mapreduce on heterogeneous clusters.In Proceedings of the International Conference on Architectural Sup-port for Programming Languages and Operating Systems (ASPLOS).London, UK, 2012.

[18] Istemi Ekin Akkus, Ruichuan Chen, Ivica Rimac, Manuel Stein, KlausSatzke, Andre Beck, Paarijaat Aditya, and Volker Hilt. Sand: To-wards high performance serverless computing. 2018 USENIX AnnualTechnical Conference (USENIX ATC’18), 2018.

[19] Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, and Ion Stoica.Effective straggler mitigation: Attack of the clones. In Proceedings ofthe USENIX Symposium on Networked Systems Design and Implemen-tation (NSDI). Lombard, IL, 2013.

[20] Ganesh Ananthanarayanan, Srikanth Kandula, Albert Greenberg, IonStoica, Yi Lu, Bikas Saha, and Edward Harris. Reining in the outliersin map-reduce clusters using mantri. In Proceedings of the 9th USENIXconference on Operating Systems Design and Implementation (OSDI).Vancouver, CA, 2010.

[21] Aman Bakshi and Yogesh B. Dujodwala. Securing cloud from ddosattacks using intrusion detection system in virtual machine. In Proc. ofthe Second International Conference on Communication Software andNetworks (ICCSN). 2010.

[22] Luiz Barroso. Warehouse-scale computing: Entering the teenagedecade. ISCA Keynote, SJ, June 2011.

[23] Luiz Barroso and Urs Hoelzle. The Datacenter as a Computer: An In-troduction to the Design of Warehouse-Scale Machines. MC Publishers,2009.

[24] WS Noble BJ Diament. Faster sequest searching for peptide identifi-cation from tandem mass spectra. In Journal of Proteome Research,2011.

[25] Matthew C Chambers, Brendan Maclean, Robert Burke, Dario Amodei,Daniel L Ruderman, Steffen Neumann, Laurent Gatto, Bernd Fischer,Brian Pratt, Jarrett Egertson, Katherine Hoff, Darren Kessner, Na-talie Tasman, Nicholas Shulman, Barbara Frewen, Tahmina A Baker,Mi-Youn Brusniak, Christopher Paulse, David Creasy, Lisa Flashner,Kian Kani, Chris Moulding, Sean L Seymour, Lydia M Nuwaysir,Brent Lefebvre, Frank Kuhlmann, Joe Roark, Paape Rainer, SuckauDetlev, Tina Hemenway, Andreas Huhmer, James Langridge, BrianConnolly, Trey Chadick, Krisztina Holly, Josh Eckels, Eric W Deutsch,Robert L Moritz, Jonathan E Katz, David B Agus, Michael MacCoss,David L Tabb, and Parag Mallick. A cross-platform toolkit for massspectrometry and proteomics. Nature Biotechnology, 2012.

[26] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang,Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. Mxnet:A flexible and efficient machine learning library for heterogeneousdistributed systems. CoRR, abs/1512.01274, 2015.

[27] Charles E. Cook, Mary Todd Bergman, Robert D. Finn, Guy Cochrane,Ewan Birney, and Rolf Apweiler. The european bioinformatics institutein 2016: Data growth and integration. Nucleic Acids Research, 2016.

[28] Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, MatthieuDevin, Quoc V. Le, Mark Z. Mao, Marc’Aurelio Ranzato, Andrew Se-nior, Paul Tucker, Ke Yang, and Andrew Y. Ng. Large scale distributeddeep networks. In Proceedings of the 25th International Conferenceon Neural Information Processing Systems - Volume 1, NIPS’12, pages1223–1231, USA, 2012. Curran Associates Inc.

[29] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified dataprocessing on large clusters. In Proceedings of the Symposium onOperating Systems Design and Implementation (OSDI), pages 137–150. San Francisco, CA, December 2004.

[30] Christina Delimitrou and Christos Kozyrakis. Paragon: QoS-AwareScheduling for Heterogeneous Datacenters. In Proceedings of theEighteenth International Conference on Architectural Support for Pro-gramming Languages and Operating Systems (ASPLOS). Houston, TX,USA, 2013.

[31] Christina Delimitrou and Christos Kozyrakis. QoS-Aware Schedulingin Heterogeneous Datacenters with Paragon. In ACM Transactions onComputer Systems (TOCS), Vol. 31 Issue 4. December 2013.

[32] Christina Delimitrou and Christos Kozyrakis. Quality-of-Service-Aware Scheduling in Heterogeneous Datacenters with Paragon. InIEEE Micro Special Issue on Top Picks from the Computer Architec-ture Conferences. May/June 2014.

[33] Christina Delimitrou and Christos Kozyrakis. Quasar: Resource-Efficient and QoS-Aware Cluster Management. In Proceedings ofthe Nineteenth International Conference on Architectural Support forProgramming Languages and Operating Systems (ASPLOS). Salt LakeCity, UT, USA, 2014.

[34] Christina Delimitrou and Christos Kozyrakis. HCloud: Resource-Efficient Provisioning in Shared Cloud Systems. In Proceedings ofthe Twenty First International Conference on Architectural Supportfor Programming Languages and Operating Systems (ASPLOS), April2016.

[35] Christina Delimitrou and Christos Kozyrakis. Bolt: I Know WhatYou Did Last Summer... In The Cloud. In Proc. of the Twenty SecondInternational Conference on Architectural Support for ProgrammingLanguages and Operating Systems (ASPLOS), 2017.

[36] Christina Delimitrou, Daniel Sanchez, and Christos Kozyrakis. Tarcil:Reconciling Scheduling Speed and Quality in Large Shared Clusters.In Proceedings of the Sixth ACM Symposium on Cloud Computing(SOCC), August 2015.

[37] Christina Delimitrou, Sriram Sankar, Aman Kansal, and ChristosKozyrakis. ECHO: Recreating Network Traffic Maps for Datacen-ters of Tens of Thousands of Servers. In Proceedings of the IEEEInternational Symposium on Workload Characterization (IISWC). SanDiego, CA, USA, 2012.

[38] Lipman DJ and Pearson WR. Rapid and sensitive protein similaritysearches. Science, 1985.

[39] JK Eng, AL McCormack, and JR Yates. An approach to correlatetandem mass spectral data of peptides with amino acid sequencesin a protein database. In Journal of the American Society for MassSpectrometry, 1994.

12

[40] Sadjad Fouladi, John Emmons, Emre Orbay, Catherine Wu, Riad S.Wahby, and Keith Winstein. Salsify: Low-latency network videothrough tighter integration between a video codec and a transportprotocol. In 15th USENIX Symposium on Networked Systems Designand Implementation (NSDI 18), pages 267–282, Renton, WA, April2018. USENIX Association.

[41] Sadjad Fouladi, Dan Iter, Shuvo Chatterjee, Christos Kozyrakis MateiZaharia, and Keith Winstein. A thunk to remember: make-j1000 (andother jobs) on functions-as-a-service infrastructure. 2017.

[42] Sadjad Fouladi, Francisco Romero, Dan Iter, Qian Li, Shuvo Chat-terjee, Christos Kozyrakis, Matei Zaharia, and Keith Winstein. Fromlaptop to lambda: Outsourcing everyday jobs to thousands of transientfunctional containers. In 2019 USENIX Annual Technical Conference(USENIX ATC 19), pages 475–488, Renton, WA, July 2019. USENIXAssociation.

[43] Sadjad Fouladi, Riad S. Wahby, Brennan Shacklett, Karthikeyan VasukiBalasubramaniam, William Zeng, Rahul Bhalerao, Anirudh Sivara-man, George Porter, and Keith Winstein. Encoding, fast and slow:Low-latency video processing using thousands of tiny threads. In 14thUSENIX Symposium on Networked Systems Design and Implementa-tion (NSDI 17), pages 363–376, Boston, MA, March 2017. USENIXAssociation.

[44] Yu Gan, Yanqi Zhang, Dailun Cheng, Ankitha Shetty, Priyal Rathi,Nayantara Katarki, Ariana Bruno, Justin Hu, Brian Ritchken, Bren-don Jackson, Kelvin Hu, Meghna Pancholi, Brett Clancy, Chris Colen,Fukang Wen, Catherine Leung, Siyuan Wang, Leon Zaruvinsky, Ma-teo Espinosa, Yuan He, and Christina Delimitrou. An Open-SourceBenchmark Suite for Microservices and Their Hardware-Software Im-plications for Cloud and Edge Systems. In Proceedings of the TwentyFourth International Conference on Architectural Support for Program-ming Languages and Operating Systems (ASPLOS), April 2019.

[45] Yu Gan, Yanqi Zhang, Kelvin Hu, Yuan He, Meghna Pancholi, DailunCheng, and Christina Delimitrou. Seer: Leveraging Big Data to Nav-igate the Complexity of Performance Debugging in Cloud Microser-vices. In Proceedings of the Twenty Fourth International Conferenceon Architectural Support for Programming Languages and OperatingSystems (ASPLOS), April 2019.

[46] Rohan Gandhi and Amit Sabne. Finding stragglers in hadoop. InTechnical Report. 2011.

[47] Stuart Geman and Donald Geman. Stochastic relaxation, gibbs distri-butions, and the bayesian restoration of images. IEEE Trans. PatternAnal. Mach. Intell., 6(6):721–741, November 1984.

[48] Scott Hendrickson, Stephen Sturdevant, Tyler Harter, VenkateshwaranVenkataramani, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. Serverless computation with openlambda. 8th USENIXWorkshop on Hot Topics in Cloud Computing (HotCloud 16), 2016.

[49] Eric Jonas, Shivaram Venkataraman, Ion Stoica, and Benjamin Recht.Occupy the cloud: Distributed computing for the 99%. CoRR,abs/1702.04024, 2017.

[50] L Kall, JD Canterbury, WS Noble J Weston, and MJ MacCoss. Semi-supervised learning for peptide identification from shotgun proteomicsdatasets. 2007.

[51] Ana Klimovic, Heiner Litz, and Christos Kozyrakis. Selecta: Het-erogeneous cloud storage configuration for data analytics. In 2018USENIX Annual Technical Conference (USENIX ATC 18), pages 759–773, Boston, MA, July 2018. USENIX Association.

[52] Ana Klimovic, Yawen Wang, Christos Kozyrakis, Patrick Stuedi, JonasPfefferle, and Animesh Trivedi. Pocket: Emphemeral storage forserverless analytics. USENIX ATC, 2018.

[53] Ana Klimovic, Yawen Wang, Christos Kozyrakis, Patrick Stuedi, Ani-mesh Trivedi, and Jonas Pfefferle. Pocket: Ephemeral storage forserverless analytics. October 2018.

[54] Christos Kozyrakis, Aman Kansal, Sriram Sankar, and Kushagra Vaid.Server engineering insights for large-scale online services. IEEE Micro,30(4):8–19, July 2010.

[55] Jimmy Lin. The curse of zipf and limits to parallelization: A lookat the stragglers problem in mapreduce. In Proceedings of LSDS-IRWorkshop. Boston, MA, 2009.

[56] David Lo, Liqun Cheng, Rama Govindaraju, Luiz André Barroso, andChristos Kozyrakis. Towards energy proportionality for large-scalelatency-critical workloads. In Proceedings of the 41st Annual Inter-national Symposium on Computer Architecuture (ISCA). Minneapolis,MN, 2014.

[57] David Lo, Liqun Cheng, Rama Govindaraju, Parthasarathy Ran-ganathan, and Christos Kozyrakis. Heracles: Improving resourceefficiency at scale. In Proc. of the 42Nd Annual International Sympo-sium on Computer Architecture (ISCA). Portland, OR, 2015.

[58] Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, CarlosGuestrin, and Joseph M. Hellerstein. Graphlab: A new framework forparallel machine learning. CoRR, abs/1006.4990, 2010.

[59] Lennart Martens, Matthew Chambers, Marc Sturm, Darren Kessner,Fredrik Levander, Jim Shofstahl, Wilfred H. Tang, Andreas Rompp,Steffen Neumann, Angel D. Pizarro, Luisa Montecchi-Palazzi, NatalieTasman, Mike Coleman, Florian Reisinger, Puneet Souda, HenningHermjakob, Pierre-Alain Binz, and Eric W. Deutsch. mzml - a commu-nity standard for mass spectrometry data. In Molecular and CellularProteomics, 2010.

[60] David Meisner, Christopher M. Sadler, Luiz André Barroso, Wolf-Dietrich Weber, and Thomas F. Wenisch. Power management of onlinedata-intensive services. In Proceedings of the 38th annual internationalsymposium on Computer architecture, pages 319–330, 2011.

[61] Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov,Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul,Michael I. Jordan, and Ion Stoica. Ray: A distributed framework foremerging ai applications. In Proceedings of the 12th USENIX Con-ference on Operating Systems Design and Implementation, OSDI’18,pages 561–577, Berkeley, CA, USA, 2018. USENIX Association.

[62] Derek G. Murray, Frank McSherry, Rebecca Isaacs, Michael Isard,Paul Barham, and Martín Abadi. Naiad: A timely dataflow system.In Proceedings of the Twenty-Fourth ACM Symposium on OperatingSystems Principles, SOSP ’13, pages 439–455, New York, NY, USA,2013. ACM.

[63] Jianhao Peng, Olgica Milenkovic, and Idoia Ochoa. Methcomp: aspecial purpose compression platform for dna methylation data. Bioin-formatics, 34(15):2654–2656, 2018.

[64] Shivaram Venkataraman Qifan Pu and Ion Stoica. Shuffling, fast andslow: Scalable analytics on serverless infrastructure. NSDI, 2019.

[65] Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hog-wild: A lock-free approach to parallelizing stochastic gradient descent.In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q.Weinberger, editors, Advances in Neural Information Processing Sys-tems 24, pages 693–701. Curran Associates, Inc., 2011.

[66] Charles Reiss, Alexey Tumanov, Gregory Ganger, Randy Katz, andMichael Kozych. Heterogeneity and dynamicity of clouds at scale:Google trace analysis. In Proceedings of SOCC. 2012.

[67] Thomas Ristenpart, Eran Tromer, Hovav Shacham, and Stefan Savage.Hey, you, get off of my cloud: Exploring information leakage in third-party compute clouds. In Proc. of the ACM Conference on Computerand Communications Security (CCS). Chicago, IL, 2009.

[68] Konstantin Shvachko, Hairong Kuang, and Sanjay Radia. The hadoopdistributed file system. 2010.

[69] Anand Srinivasan and Sanjoy Baruah. Deadline-based scheduling ofperiodic task systems on multiprocessors. Inf. Process. Lett., 84(2):93–98, October 2002.

[70] Zachary D. Stephens, Skylar Y. Lee, Faraz Faghri, Roy H. Campbell,Chengxiang Zhai, Miles J. Efron, Ravishankar Iyer, Michael C. Schatz,Saurabh Sinha, and Gene E. Robinson. Big data: Astronomical orgenomical? PLoS Biol, 2015.

[71] Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlings-son, Pradeep Kumar Gunda, and Jon Currey. Dryadlinq: A system forgeneral-purpose distributed data-parallel computing using a high-levellanguage. In Proceedings of the 8th USENIX Conference on OperatingSystems Design and Implementation, OSDI’08, pages 1–14, Berkeley,CA, USA, 2008. USENIX Association.

[72] Matei Zaharia, Dhruba Borthakur, Joydeep Sen Sarma, Khaled Elmele-egy, Scott Shenker, and Ion Stoica. Delay scheduling: a simple tech-nique for achieving locality and fairness in cluster scheduling. InProceedings of the European conference on Computer systems (Eu-roSys). Paris, France, 2010.

[73] Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, andIon Stoica. Improving mapreduce performance in heterogeneous envi-ronments. In Proceedings of the 8th USENIX Symposium on OperatingSystems Design and Implementation (OSDI). San Diego, CA, 2008.

[74] Yinqian Zhang, Ari Juels, Alina Oprea, and Michael K. Reiter. Home-alone: Co-residency detection in the cloud via side-channel analysis.In Proc. of the IEEE Symposium on Security and Privacy. Oakland,CA, 2011.

[75] Yinqian Zhang, Ari Juels, Michael K. Reiter, and Thomas Ristenpart.Cross-tenant side-channel attacks in paas clouds. In Proc. of the ACMSIGSAC Conference on Computer and Communications Security (CCS).Scottsdale, AZ, 2014.

[76] Yinqian Zhang, Ari Juels, Michael K. Reiter, and Thomas Ristenpart.Cross-vm side channels and their use to extract private keys. In Pro-ceedings of the ACM Conference on Computer and CommunicationsSecurity (CCS). Raleigh, NC, 2012.

13