10
Benchmarking Elasticity of FaaS Platforms as a Foundation for Objective-driven Design of Serverless Applications Jörn Kuhlenkamp ISE, TU Berlin Berlin, Germany [email protected] Sebastian Werner ISE, TU Berlin Berlin, Germany [email protected] Maria C. Borges ISE, TU Berlin Berlin, Germany [email protected] Dominik Ernst TU Berlin Berlin, Germany [email protected] Daniel Wenzel TU Berlin Berlin, Germany [email protected] Preprint, to appear at The 35th ACM/SIGAPP Symposium on Applied Computing (SAC ’20), March 30-April 3, 2020,Brno, Czech Republic; final version available at https://doi.org/10.1145/ 3341105.3373948 ABSTRACT Application providers have to solve the trade-off between perfor- mance and deployment costs by selecting the "right" amount of provisioned computing resources for their application. The high value of changing this trade-off decision at runtime fueled a decade of combined efforts by industry and research to develop elastic applications. Despite these efforts, the development of elastic appli- cations still demands significant time and expertise from application providers. To address this demand, FaaS platforms shift responsibilities as- sociated with elasticity from the application developer to the cloud provider. While this shift is highly promising, FaaS platforms do not quantify elasticity; thus, application developers are unaware of how elastic FaaS platforms are. This lack of knowledge significantly impairs effective objective-driven design of serverless applications. In this paper, we present an experiment design and correspond- ing toolkit for quantifying elasticity and its associated trade-offs with latency, reliability, and execution costs. We present results for the evaluation of four popular FaaS platforms by AWS, Google, IBM, Microsoft, and show significant differences between the service offers. Based on our results, we assess the applicability of the indi- vidual FaaS platforms in three scenarios under different objectives: web serving, online data analysis, and offline batch processing. CCS CONCEPTS Applied computing Event-driven architectures; Service- oriented architectures; Software and its engineering Cloud computing; Software performance; Computer systems or- ganization Cloud computing; KEYWORDS Serverless, Benchmarking, Experimentation, Elasticity 1 INTRODUCTION Serverless Computing, or Function-as-a-Service (FaaS), offers a new alternative to develop cloud-based applications and has since emerged as a popular choice among application providers. Instead of using the cloud infrastructure directly, developers provide short- running code in the form of functions to be executed by a FaaS plat- form provider. This simplified programming model lets developers create applications without the burden of server-related operational tasks such as provisioning, managing, or scaling of resources. However, by relinquishing control of the infrastructure, develop- ers have to trust that the FaaS platform provider will manage the application in accordance with their application objectives. FaaS platforms often claim that the provided services can scale dynami- cally under volatile workloads. However, with no knowledge of the underlying architecture or the provider’s load balancing strategies, it is hard for developers to verify this claim. Without explicit infor- mation, application developers must select and use a FaaS platform based on gut feeling and intuition. We argue that this significantly impairs the objective-driven design of serverless applications; thus, the dissemination of serverless applications in general. Consequently, the need to benchmark different characteristics and qualities of FaaS platforms has been recognized by research and industry [19, 21, 26]. Elasticity has been highlighted by Jonas et al. [11] to be of particular interest, due to the unpredictable performance of cloud functions observed to date. Here, we advocate for a comparison between FaaS platforms under volatile workloads. Such a comparison will allow develop- ers to assess the impact of such workloads on specific objectives set during the design phase. With our work, we provide the first contributions to answer the research question: Does the quality of a FaaS platform change from a client- side perspective under volatile workloads, and is this change relevant in an application context? We build on our extensive expertise and previous work on quality- driven design and evaluation of cloud-based systems [2, 3, 14, 17] and propose a new experiment design and corresponding toolkit for application developers and researchers to use in the experimental evaluation. Further, based on the results of the experiments, we discuss the current applicability of the various FaaS Platforms for different application scenarios. Thus, in our work, we present the following contributions over the state-of-the-art: C1: A novel experiment design and corresponding toolkit for the assessment of FaaS platform quality under volatile workloads from a client-side perspective.

Benchmarking Elasticity of FaaS Platforms as a …...of benchmarking elasticity of traditional cloud platforms based on virtual machines and benchmarking FaaS platforms in general

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Benchmarking Elasticity of FaaS Platforms as a …...of benchmarking elasticity of traditional cloud platforms based on virtual machines and benchmarking FaaS platforms in general

Benchmarking Elasticity of FaaS Platforms as a Foundation forObjective-driven Design of Serverless Applications

Jörn KuhlenkampISE, TU BerlinBerlin, Germany

[email protected]

Sebastian WernerISE, TU BerlinBerlin, Germany

[email protected]

Maria C. BorgesISE, TU BerlinBerlin, Germany

[email protected]

Dominik ErnstTU Berlin

Berlin, [email protected]

Daniel WenzelTU Berlin

Berlin, [email protected]

Preprint, to appear at The 35th ACM/SIGAPP SymposiumonAppliedComputing (SAC ’20),March 30-April 3, 2020,Brno,Czech Republic; final version available at https://doi.org/10.1145/3341105.3373948

ABSTRACTApplication providers have to solve the trade-off between perfor-mance and deployment costs by selecting the "right" amount ofprovisioned computing resources for their application. The highvalue of changing this trade-off decision at runtime fueled a decadeof combined efforts by industry and research to develop elasticapplications. Despite these efforts, the development of elastic appli-cations still demands significant time and expertise from applicationproviders.

To address this demand, FaaS platforms shift responsibilities as-sociated with elasticity from the application developer to the cloudprovider. While this shift is highly promising, FaaS platforms donot quantify elasticity; thus, application developers are unaware ofhow elastic FaaS platforms are. This lack of knowledge significantlyimpairs effective objective-driven design of serverless applications.

In this paper, we present an experiment design and correspond-ing toolkit for quantifying elasticity and its associated trade-offswith latency, reliability, and execution costs. We present results forthe evaluation of four popular FaaS platforms by AWS, Google, IBM,Microsoft, and show significant differences between the serviceoffers. Based on our results, we assess the applicability of the indi-vidual FaaS platforms in three scenarios under different objectives:web serving, online data analysis, and offline batch processing.

CCS CONCEPTS• Applied computing → Event-driven architectures; Service-oriented architectures; • Software and its engineering → Cloudcomputing; Software performance; • Computer systems or-ganization→ Cloud computing;

KEYWORDSServerless, Benchmarking, Experimentation, Elasticity

1 INTRODUCTIONServerless Computing, or Function-as-a-Service (FaaS), offers anew alternative to develop cloud-based applications and has since

emerged as a popular choice among application providers. Insteadof using the cloud infrastructure directly, developers provide short-running code in the form of functions to be executed by a FaaS plat-form provider. This simplified programming model lets developerscreate applications without the burden of server-related operationaltasks such as provisioning, managing, or scaling of resources.

However, by relinquishing control of the infrastructure, develop-ers have to trust that the FaaS platform provider will manage theapplication in accordance with their application objectives. FaaSplatforms often claim that the provided services can scale dynami-cally under volatile workloads. However, with no knowledge of theunderlying architecture or the provider’s load balancing strategies,it is hard for developers to verify this claim. Without explicit infor-mation, application developers must select and use a FaaS platformbased on gut feeling and intuition. We argue that this significantlyimpairs the objective-driven design of serverless applications; thus,the dissemination of serverless applications in general.

Consequently, the need to benchmark different characteristicsand qualities of FaaS platforms has been recognized by researchand industry [19, 21, 26]. Elasticity has been highlighted by Jonaset al. [11] to be of particular interest, due to the unpredictableperformance of cloud functions observed to date.

Here, we advocate for a comparison between FaaS platformsunder volatile workloads. Such a comparison will allow develop-ers to assess the impact of such workloads on specific objectivesset during the design phase. With our work, we provide the firstcontributions to answer the research question:

Does the quality of a FaaS platform change from a client-side perspective under volatileworkloads, and is this changerelevant in an application context?

Webuild on our extensive expertise and previouswork on quality-driven design and evaluation of cloud-based systems [2, 3, 14, 17]and propose a new experiment design and corresponding toolkit forapplication developers and researchers to use in the experimentalevaluation.

Further, based on the results of the experiments, we discuss thecurrent applicability of the various FaaS Platforms for differentapplication scenarios. Thus, in our work, we present the followingcontributions over the state-of-the-art:

• C1: A novel experiment design and corresponding toolkitfor the assessment of FaaS platform quality under volatileworkloads from a client-side perspective.

Page 2: Benchmarking Elasticity of FaaS Platforms as a …...of benchmarking elasticity of traditional cloud platforms based on virtual machines and benchmarking FaaS platforms in general

• C2: An experiment-based evaluation of four major FaaSplatform providers based on C1.

• C3: A discussion of the applicability of FaaS platforms indifferent application scenarios based on results from C2.

The remainder of the paper is structured as follows: we presentrelevant background (sec. 2), give a refined problem statement (sec.3), discuss related work (sec. 4), present our experiment design (sec.5) and results (sec. 6), discuss applicability of FaaS platforms indifferent scenarios (sec. 7) and conclude in sec. 8.

2 BACKGROUNDWe present background on FaaS platforms and experiment-drivenevaluation of cloud-based systems and applications.

2.1 FaaS PlatformsFaaS are platforms that facilitate general serverless computing.FaaS is available from all major cloud providers, with offeringssuch as AWS Lambda or Google Cloud Functions. These providersmanage and run serverless infrastructure in their massive datacenters. Besides these commercial offerings, many open source FaaSsolutions exist (e.g., Fission, Knative), which can run on commodityhardware.

All these implementations share the following properties: First,the unit of computation, a function, is defined by a developer witha deployment package. The deployment package contains all thecode and libraries necessary to run the function. The FaaS platformprovides and manages the Operating System and the FunctionProgramm Runtime, e.g., Python or NodeJS used to execute thatfunction.

Second, the function configuration defines the environment andruntime constraints of a function. The function configuration con-tains the events that activate the execution of a function in theplatform, referred to as a function trigger. Furthermore, the func-tion configuration also describes the function memory and functiontimeout, which governs the maximum amount of resources a func-tion can use.

Lastly, most FaaS platforms support many different event trig-gers, such as HTTP requests or platform events such as the insertionof a new value in a managed database. The platforms, therefore,have strict limitations on the input and output interfaces, or han-dlers, of a function.

FaaS platforms also have similar lifecycles regardless of imple-mentation. Generally, a function is only provisioned and run whenan event triggers an execution. After execution, the function run-time is kept alive only for a short period before it is destroyedagain. A function runtime, however, is reused, should a new eventarrive before the runtime is destroyed. On the other hand, it is alsocommon that no runtime is provisioned, and therefore no resourceis used.

2.2 ExperimentationOur experiment methods are based on a mix of (i) the combinationof system performance evaluation and Design of Experiments (DoE)proposed by [12] and (ii) concepts proposed in [3, 23].

The goal of an experiment is the evaluation of a system under test(SUT). An experiment typically involves two roles. An experimentis requested by a decision maker and executed by an experimenter.

An objective is a non-functional property or quality of a SUTof interest to a decision-maker. The term experiment environmentdescribes all assumptions about the environment of a SUT thatare not under the control of a decision-maker. The experimentenvironment is encoded in environment variables. An experimentis executed for a well-defined time, also called a run.

During a run, the SUT is observed. Observations that are per-sisted are referred to as measurements. Based on measurements,high-level metrics are calculated that characterize one or moreobjectives. During a run, the experimenter executes one or moretreatments, a controlled change of a SUT. Treatments are encoded intreatment variables. In contrast to environment variables, treatmentvariables are under the control of a decision-maker.

3 PROBLEM STATEMENTApplication developers design applications to fulfill applicationobjectives. Examples of such application objectives are targets forthroughput, latency, reliability, variable execution costs for singlecomputations, and fixed deployment costs.

Software architecture is about deferring decisions by providingoptions [6]. These options allow changing an application objectivein the future under known conditions. We argue that these optionsshould have no impact on other application objectives or well-defined impacts on other application objectives that are acceptablewithin a specific application context.

For example, using a NoSQL database with a configurable con-sistency level such as Cassandra allows changing consistency inthe future, thus giving the option to increase consistency objectivesaccording to future demands with a quantifiable negative impacton execution latency [12].

FaaS platforms promise the core capability to execute code with-out requiring an application provider to manage resource alloca-tions [11]. Thus, a cloud automatically provisions resources toexecute application code with incoming events. In an applicationarchitectural context, this capability implies that a FaaS platformprovides an option for throughput-based application objectives. Werefer to this option as elasticity.

Elasticity is highly valuable for many types of applications withunknown or changing throughput objectives — for example, webserving, explorative data analysis, and periodical batch process-ing. While the available choices for fully managed FaaS platformsand open-source FaaS platforms are considerable and continue toincrease, and more applications migrate to FaaS platforms, applica-tion developers do not have a clear understanding of a core optionthat drives architectural design decision: the option on variablethroughput objectives.

We argue that such a clear understanding of elasticity includes:(i) quantifying the amount by which a throughput objective canchange in a particular time, and (ii) quantifying the impact of chang-ing throughput objectives on other application objectives such aslatency, reliability, and execution costs. In conclusion, we defineour problem statement as follows:

2

Page 3: Benchmarking Elasticity of FaaS Platforms as a …...of benchmarking elasticity of traditional cloud platforms based on virtual machines and benchmarking FaaS platforms in general

• P1: Do FaaS platforms deliver on the promise to processvolatile target throughput?

• P2: What are the impacts of volatile target throughput onlatency, reliability, and execution costs?

• P3: What are the implications of P1 and P2 on the applica-bility of FaaS platforms in different scenarios?

4 RELATEDWORKThis paper addresses the experiment-driven evaluation of FaaSplatform elasticity. We discuss related work from the perspectiveof benchmarking elasticity of traditional cloud platforms based onvirtual machines and benchmarking FaaS platforms in general.

There are many publications on the design objectives of systembenchmarks [3, 4, 15]. These benchmarks all share common well-established properties, regardless of the difference in objectives.Namely, all benchmarks need to be relevant, repeatable, fair, under-standable, and portable [3, 7, 25]. These properties must also applyto the evaluation proposed in this paper.

Furthermore, research is doing much work with regards to cloudservice benchmarking [2, 3, 14]. Analyzing cloud services comeswith additional challenges, such as less accessible measurementpoints and intransparent changes of services over time. Therefore,benchmarking FaaS platformsmust take the lessons learned in cloudservice benchmarking into account. Thus, we base our experimentdesign on these established design guidelines for benchmarks.

In literature, a body of work addresses elasticity benchmarking oftraditional, VM-based cloud platforms [5, 8, 13, 27]. All experimentdesigns assume that clients have control over the provisioning ofcomputing resources. This assumption does not hold in the contextof FaaS platforms; thus these approaches are not applicable forbenchmarking FaaS platforms.

Industry and research already started to use experiment-basedevaluations to analyze FaaS platforms [9, 18, 20–22]. Most of theseefforts focus on performance [16]. However, the evaluation of elas-ticity in FaaS platforms has largely been ignored in literature. A firstproposal by Lee et al. [19] evaluates latency changes in responseto volatile throughput. The proposal motivates our work, and wesignificantly extend its scope.

5 EXPERIMENT DESIGNIn this section, we describe our experiment design, which entails adetailed description of experiment goals, SUT setup, and workloads.Associated goals with this detailed design are understandability,verifiability, and reproducibility of results by interested readers orthird parties.

5.1 Experiment GoalThe goal of our experiment design is to quantify the elasticity ofFaaS platforms from a client-side perspective. More precisely, thatis the ability of a FaaS platform to process volatile target workloads,i.e., the number of events that are processed, considering a pro-visioned function, associated impacts on other client-observablequalities, and execution cost. Based on our results, we want toguide application developers to account for FaaS platform-specificbehavior to enable objective-driven design decisions.

5.2 System Under Test5.2.1 FaaS Platforms. We evaluate four fully managed FaaS plat-forms in different data centers in London: Amazon Web ServicesLambda (AWS), Google Cloud Functions (GCF), IBM Cloud Func-tions (ICF), and Microsoft Azure Functions (MAF).

5.2.2 Memory Size. Through preliminary experiments, we ob-served out of memory errors for smaller memory configurationsthan 1024 MB in ICF. Thus, we set the memory size of functionhandlers to 1024 MB for AWS, GCF, and ICF. MAF manages mem-ory allocations implicitly; thus, clients are not able to configurememory size explicitly.

5.2.3 Handler Timeout. Overall application scenarios, we considera client-side request-response latency of 30 seconds as the tolerablemaximum. Thus, we configure a hard handler timeout of 30 seconds.

5.2.4 Handler Implementation. The target programming platformfor all handler implementations is Node.js in the major version 10.The function handler takes a natural number n as handler inputand returns true or false as handler output depending on whethern is a prime number based on the Sieve of Eratosthenes algorithm.We use a non-optimized basic implementation, which has a timecomplexity ofO(n log logn) and a memory requirement ofO(n) forthe two-dimensional array indicating which numbers are prime.Consequently, the function handler can be characterized as CPUand Memory intensive. The handler implementation does not cacheany results.

5.2.5 Function Trigger. We trigger all functions via a public HTTPSendpoint. Thus, we use the corresponding auxiliary services foreach FaaS platform to provision the HTTPS endpoint. For example,we use the API Gateway service for AWS. Each auxiliary service isconfigured through the serverless framework using default settings.

5.3 WorkloadsOur workload model uses three sequential phases (P0-P2): warmupphase (P0), scaling phase (P1), and cooldown phase (P2). Besides,we split each phase into one-second intervals. In each one-secondinterval, clients send a well-defined request target to the SUT. Werefer to this target as target requests per second (trps). In our work-load design, a single client request results in a single event that aFaaS platform has to process.

The warmup phase P0 lasts for 60 seconds and ensures that a SUTis in a well-defined state before introducing any changes. DuringP0, we apply a constant number 0 or 60 trps. The reason behindthis choice is that a FaaS platform must initialize new executionenvironments which can result in visible impacts for clients - aphenomenon introduced in literature as "cold starts" [22]. With avalue of 0 trps in P0, we try to capture this effect in P1.

Similar to the warmup phase P0, the scaling phase P1 lasts for60 seconds. During the scaling phase P1, we increase trps for eachinterval by a constant factor of 0.5, 1, or 2. Thus, we increase trpsfor 60 seconds linearly.

In response to a scaling phase P1, a SUT can potentially changeits behavior over a long period. Thus, the duration of the cooldownphase (P2) is 180 seconds. P2 allows monitoring a SUT for a moreextended period after changing its environment in P1. We argue

3

Page 4: Benchmarking Elasticity of FaaS Platforms as a …...of benchmarking elasticity of traditional cloud platforms based on virtual machines and benchmarking FaaS platforms in general

that a FaaS platform should be able to respond to workload changeswithin 3 minutes. Similar to P0, we apply a constant number oftrps in P2. The number of tprs equals the trps of the last intervalof the scaling phase. In summary, we expose a SUT to a stableenvironment (P0), change the environment in a well-defined way(P1), and continue to observe the SUT in a stable environment (P2).

Table 1 summarizes all five workloads WL0-WL4 used in ourexperiments.

Table 1: Target Requests per Second (trps) as a Function ofElapsed Time t for the Three Phases P0 (Warmup), P1 (Scal-ing), and P2 (Cooldown) for the Five Workloads WL0-WL4.

Name P0 P1 P2

WL0 0 0 + ⌈0.5 ∗ t⌉ 15WL1 0 0 + 1.0 ∗ t 60WL2 0 0 + 2.0 ∗ t 120WL3 60 60 + ⌈0.5 ∗ t⌉ 75WL4 60 60 + 1.0 ∗ t 120

With each request, we want to trigger an execution that shouldtake around 1-2 seconds. Thus, the function handler input is 7.5mplus a random number that is drawn uniformly from a small rangeof [0,1000].We use the range to avoid potential cachingmechanismswithin a SUT to improve the fairness of the workload.

5.4 Measurement PointsA measurement point observes a SUT or the experiment environ-ment during an experiment run. Table 2 summarizes all measure-ment points that are observed during a run.

5.4.1 Result. We create a unique id (RId) for each request on theclient. For each response, we record the HTTP status code (RCode)and the result of the handler (RResult) to determine if a requestwas successful.

5.4.2 Timestamps. For each request, we measure up to four times-tamps with millisecond precision; the start of a request (RStart) andend of a request (REnd) on the client-side. Furthermore, we recordthe start of execution (EStart) and the end of execution (EEnd) in afunction handler. We send EStart and EEnd to the client with theresponse. If a request has no response, we cannot measure EStart,EEnd, and REnd. Figure 1 visualizes measurement points.

5.4.3 Execution Environment. We track the execution environment(container) and the host of the execution environment (host) toquantify the internal state of a FaaS platform. As this is not trivialfrom a client-side perspective, we use an approach based on Lloydet al. [20]. Each execution checks for a global field that containsa unique identifier for the container (CId). If the field does notexist, i.e., for the first execution in a container, we create the field.Similarly, we obtain a unique host identifier (HId), by inspecting/proc/uptime. Because this is not possible for MAF, we inspectthe network interfaces and hostname to obtain the HId.

5.4.4 Internal Validation. We measure the operating system (COs)and programming platform (CPlat) of a container for internal vali-dation of an experiment.

FaaS Platform

FunctionHandler

Trigger,Routing,

Scheduling,…

1 2 3 4

Benchmarking Client

ELatDLatRLat

Black-Box White-Box

1

2 3

4

RStart EStart EEnd REnd

Figure 1: Illustration of Client-Side and Platform-side Mea-surement Points and Latency-based Metrics for a Request.5.5 MetricsBased on our measurements, we calculate two sets of high-levelmetrics. We use client-side metrics to analyze impacts on applica-tion objectives. We use platform-side metrics to characterize theinternal behavior of a FaaS platform to develop mitigation strategiesfor application designers.

5.5.1 Reliability. We consider a request successful if the RCodeis 200, and the RResult is correct. For all other cases, we denotea request as failed. We terminate open requests on the client-sideafter 30 seconds and count the request as failed. For each one-second interval of an experiment run, we report the total number ofsuccessful requests RSuccess and failed requests RFailed. We arguethat the ratio of RSuccess and RFailed represents an indicator ofreliability application objectives.

5.5.2 Request-Response Latency. For each request, we calculatethe request-response latency RLat as the difference between REndand RStart. For each one-second interval of an experiment run, wereport the minimum, median, and maximum RLat using all suc-cessful requests. RLat is an indicator of latency-based performanceapplication objectives.

5.5.3 Request Throughput. For each one-second interval of an ex-periment run, we report the total number of successful requestsRSuccess to characterize the throughput of a function. RSuccess isan indicator of throughput-based application objectives.

5.5.4 Execution Cost. To characterize the impacts on applicationcost objectives, we calculate the execution cost of each request. Thepricing models of all four SUTs are a linear function of executiontime ELat - the difference between EEnd and EStart. Table 3 summa-rizes the different pricing models. For each one-second interval ofan experiment run, we report the minimum, median, and maximumELat based on all successful requests as an indicator of executioncosts. While mean latency is generally a non-significant metric

1Accessed: 2019/09/05, https://aws.amazon.com/lambda/pricing2Accessed: 2019/09/05, https://cloud.google.com/functions/pricing3Accessed: 2019/09/05, https://cloud.ibm.com/functions/learn/pricing4Accessed: 2019/09/05, https://azure.microsoft.com/pricing/details/functions

4

Page 5: Benchmarking Elasticity of FaaS Platforms as a …...of benchmarking elasticity of traditional cloud platforms based on virtual machines and benchmarking FaaS platforms in general

Table 2: Measurements for Experiment 1.

Name Location Value Domain Description

RId Client 9 Characters Unique id for each requestRStart Client UNIX ms TS Start of request from clientREnd Client UNIX ms TS Receiving a response at the clientRCode Handler HTTP Status Code An HTTP status code send with a response [200, 429, 500, 502, 503]RResult Handler N Result of a handler executionEStart Handler UNIX ms TS Start of an execution in a function handlerEEnd Handler UNIX ms TS End of an execution in a function handlerHId Handler 9 Characters Identifier of a hostCId Handler 9 Characters Identifier of a containerCOs Handler String Identifier of the OS of a containerCPlat Handler String Identifier of the handler programming platformCStart Handler UNIX ms TS Start of a container

Table 3: Pricing Model for Execution Costs. MAF is an Esti-mate due to Limited FaaS Platform Exposure.

Provider EVar EFix ECost[µ$ s

GB ] [µ$] [µ$]

AWS1 16.67 0.2 16.67 * ELat + 0.2GCF2 16.50 0.4 16.50 * ELat + 0.4ICF3 17.00 0.0 17.00 * ELat + 0.0MAF4 16.00 0.2 16.00 * ELat + 0.2

to characterize performance, we report mean ELat to aggregateexecution costs.

5.5.5 Host Heat. For each host, we characterize the heat/load overa one-second interval as HHeat. First, we assign each successfulexecutions to a specific host based on the HId. Second, we assign ex-ecution to an interval, if the interval is partially between EStart andEEnd. Therefore, HHeat indicates the total number of executionsthat run on a host over a one-second interval.

5.5.6 Container Heat. A host can run multiple containers. Wecalculate the heat of a container (CHeat) similar to HHeat by usingthe CId instead of the HId.

5.5.7 Delivery Latency. If no execution environment is available toprocess an event, a FaaS platform might need to queue an event. Wetry to characterize this behavior with the metric delivery latency(DLat) that is the difference between RLat and ELat.

5.6 PlotsIn our result section, we make heavy use of two types of plots tovisualize an experiment run. We refer to the two plots as client-viewand platform-view. Due to their non-trivial nature, we discuss eachplot in detail. We do not show measurements for the last 20 secondsof each run (see sec. 6 for explanation).

5.6.1 Client-View Plot. The design goal of the client-view plot is acombined view of all client-side metrics for a workload run. On thex-axis, the client-view plot shows the elapsed time of an experimentrun in one-second intervals and the three workload phases. Theplot shows the total number of trps, RFailed and minimum, median,and maximum RLat and ELat.

If RFailed is close to trps, the RLat and ELat distributions onlycharacterize a small number of requests. Thus, in this situation,ELat and RLat are less meaningful. Per definition, an ELat valuemust always be smaller than its corresponding RLat value.

Changing ELat values over time indicate that the required timeto execute a function handler changes over time. Increasing RLatvalues over time with constant ELat implies an increasing DLat.An increasing DLat indicates increasing wait or queuing times forinflight events within a FaaS platform.

5.6.2 Platform-View Plot. The design goal of the platform-viewplot is a combined view of (i) execution environments that areavailable to process incoming events and (ii) the heat of individ-ual execution environments. Similar to the client-view plot, theplatform-view plot shows the elapsed time of an experiment run inone-second intervals and the three workload phases.

We indicate the value CHeat or HHeat with different intensi-ties of red. The highest intensity indicates a HHeat ≥ 12 for aone-second interval. If the min ELat ≥ 1, a CHeat or HHeat ≥ 3indicates parallel executions. Therefore, executions are potentiallyless isolated and can cause noisy neighbor effects.

With increasing values for RFailed, less data is available to gen-erate the specific heat map. Thus, corresponding plots must beinterpreted carefully.

5.7 Experiment ArtifactsThe functions were deployed using the serverless framework (v.1.3.5). On Lambda, GCF & ICF, the calculation was executed on themain thread. MAF computes multiple requests in the same containerconcurrently. To leave the main thread unblocking, calculationswere performed in other threads. Requests were sent using theartillery.io framework. We deployed the workload clients on a TUBerlin virtual machine with an Intel Xeon E3-1220 CPU and 64 GBMemory, located in Germany. The experiment is fully automatedthrough a single script, available on GitHub7.

All measurements are available in an aggregated relational data-set to enable easy custom analysis for other researchers and prac-titioners. The code to execute all analysis and generate figures isavailable as Jupyter-Notebook 5.5https://ipython.org/notebook.html

5

Page 6: Benchmarking Elasticity of FaaS Platforms as a …...of benchmarking elasticity of traditional cloud platforms based on virtual machines and benchmarking FaaS platforms in general

To increase understandability, verifiability, and reproducibility,all code artifacts and results are available at GitHub7. For easy ref-erencing in publications, all measurements and figures are availableat Zendoo6.

6 EXPERIMENT RESULTSWe present and discuss results for experiments based on the exper-iment design presented in section 5.

6.1 Individual FaaS PlatformsWe present our experiment results for the four FaaS platforms AWS,GCF, ICF, and MAF, and the five workloads WL0-WL4. We visualizeselect experiment runs using client-view and platform-view plots(see section 5.6). The complete set of figures and raw measurementsfor all runs are available (see section 5.7). Measurements for RLatand ELat vary significantly for different FaaS platforms. Thus, weuse different ranges for the latency y-axis in the client-view plotsfor each FaaS platform provider. All reported numbers are basedon requests that are started before 270 seconds. This is due to aconfigured client-side timeout of 30 seconds and no more workloadgeneration after 300 seconds. Thus, requests started after 270 sec-onds are potentially partially executed in a changed environment.Table 4 shows aggregated statistics for each platform aggregatedover all workloads WL0-WL4.

warmup scaling cooldown0 20 40 60 80 100 120 140 160 180 200 220 240 260 280

0

20

40

60

80

100

120

Requ

ests

[#]

WL2 Client-View

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Late

ncy

[s]

trpsRFailedRLatELat

warmup scaling cooldown0 28024018012060

Elapsed Time [s]

Para

llel E

xecu

tions

[#]

WL2 Platform-View

HStartHHeat

warmup scaling cooldown0 20 40 60 80 100 120 140 160 180 200 220 240 260 280

20

40

60

80

100

120

Requ

ests

[#]

WL4 Client-View

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Late

ncy

[s]

trpsRFailedRLatELat

warmup scaling cooldown0 28024018012060

Elapsed Time [s]

Para

llel E

xecu

tions

[#]

WL4 Platform-View

HStartHHeat

Figure 2: Results for WL2 and WL4 on AWS

6.1.1 AWS Lambda. For all workloads, we measure 844 failed re-quests out of a total number of 88452 requests that indicate a failurerate of 0.954%. We measure a 99-percentile RLat of 2.051s that sig-nificantly increases for the worst-case (to 17.082s). Because RLatmeasurements take network latency between client and SUT into6http://doi.org/10.5281/zenodo.35656517https://github.com/tawalaya/FaaSElasticityEngineering

account, we argue that the 99-percentile is significant. We measuresimilar values for median ELat of 1.635s and 90-percentile ELat1.679s with a worst-case of 2.393s.

We continue to analyze results for the individual workloads WL2and WL4. Both workloads show no failed request before the 240thsecond. During the scaling phase, RLat varies between (i) 1.462s and2.400s with a median of 1.716s for WL2 and (ii) 1.560s and 2.410swith a median of 1.721s for WL4. Values for median RLat and ELatbetween a one-second interval differ within a small band. Aftersending the first requests, the median RLat und ELat increase forunder 10 seconds by less than 0.5s. We observed the same effect forthe first 10 seconds of (i) the scaling phase of WL0-WL2 and (ii) thewarmup phase of WL3-4. Our results indicate that AWS can handlesudden trps increases of different sizes equally well.

AWS does not start new hosts after the end of the scaling phase.The start of new hosts correlates with increasing trps. Visual inspec-tion of the platform-view plot for WL4 (see figure 2) reveals thatthe platform successfully starts executions on 55 different hostswithin the first second of the warmup phase.

Our results indicate that a single host rarely executes events inparallel. Due to our measurement approach, our results could evenindicate that a single host executes events sequentially. Our resultsdo not indicate that executions cause noisy neighbor effects.

warmup scaling cooldown0 20 40 60 80 100 120 140 160 180 200 220 240 260 280

0

5

10

15

20

25

30

Requ

ests

[#]

WL0 Client-View

0

5

10

15

20

25

30

Late

ncy

[s]

trpsRFailedRLatELat

warmup scaling cooldown0 28024018012060

Elapsed Time [s]

Para

llel E

xecu

tions

[#]

WL0 Platform-View

HStartHHeat

warmup scaling cooldown0 20 40 60 80 100 120 140 160 180 200 220 240 260 280

0

20

40

60

80

100

120

Requ

ests

[#]

WL2 Client-View

0

5

10

15

20

25

30

Late

ncy

[s]

trpsRFailedRLatELat

warmup scaling cooldown0 28024018012060

Elapsed Time [s]

Para

llel E

xecu

tions

[#]

WL2 Platform-View

HStartHHeat

Figure 3: Results for WL0 and WL2 on GCF

6.1.2 Google Cloud Functions. Across all workloads, we measure2324 failed requests out of a total number of 83698 requests, whichindicates a failure rate of 2.777%. ELat varies between the medianof 1.391s and a maximum of 2.616s in a range of 1.224s. In contrast,RLat varies between the median and 90-percentile by 2.836s andmedian and 99-percentile by 19.815s. Thus, our results indicate thatthe execution time and execution cost for executions are stable, butthe time between starting a request and beginning an executioncan vary significantly.

6

Page 7: Benchmarking Elasticity of FaaS Platforms as a …...of benchmarking elasticity of traditional cloud platforms based on virtual machines and benchmarking FaaS platforms in general

Table 4: Comparison of FaaS Platforms Based on Client-side Qualities Aggregated Over Workloads WL0-WL4.

Reliability RLat [s] ELat [s] ECost [µ$]Req [#] Failed [#] Failed [%] Med 90-p 99-p Max Med 90-p 99-p Max Mean Mean

AWS 88452 844 0.954 1.711 1.761 2.051 17.082 1.635 1.679 1.843 2.393 1.637 27.489GCF 83698 2324 2.777 1.509 4.345 21.324 30.033 1.391 1.511 1.740 2.615 1.350 22.675ICF 88534 2457 2.775 7.081 10.564 15.329 29.898 3.193 5.224 7.382 14.943 3.396 57.732MAF 86657 83170 95.976 20.002 28.400 29.940 31.081 0.729 4.297 10.087 14.251 1.676 (27.016)

We conduct an indepth analysis for WL0 and WL2 (see figure3). In response to increasing trps, the median and maximum RLattemporarily increase significantly. For WL4, we observe increasesin RLat up to 30 seconds, which results in client-side timeouts. Ourresults indicate that temporary increases in RLat correlate withtemporary trps increases. We continue to analyze this behaviorbased on the visual inspection of the corresponding platform-viewplots (see figure 3). The GCF platform continues starting new hostsafter the scaling phase up to t = 240s , around 2 minutes intothe cooldown phase. These late starts indicate that GCF initiallyunderestimates the target workloads. Moreover, we observe HHeat< 2. Thus, our results indicate that multiple executions run onthe same host at the same time. However, the combination of highHHeat and stable ELat indicates the isolation of different executionson the same host.

The combined visual inspection of the client-view and platform-view plots for WL2 shows that RLat increases occur with highHHeat values on multiples hosts. Around t = 170s HHeat and RLatsuddenly drops on multiple hosts accompanied by a sudden dropin maximum RLat. Our results indicate that the platform queuesincoming events and only a well-defined number of events executeon available hosts; thus, preventing volatile execution latency.

warmup scaling cooldown0 20 40 60 80 100 120 140 160 180 200 220 240 260 280

0

20

40

60

80

100

120

Requ

ests

[#]

WL2 Client-View

0

2

4

6

8

10

12

14

Late

ncy

[s]

trpsRFailedRLatELat

warmup scaling cooldown0 28024018012060

Elapsed Time [s]

Para

llel E

xecu

tions

[#]

WL2 Platform-View

HStartHHeat

warmup scaling cooldown0 20 40 60 80 100 120 140 160 180 200 220 240 260 280

0

20

40

60

80

100

120

Requ

ests

[#]

WL4 Client-View

0

5

10

15

20

25

30

Late

ncy

[s]

trpsRFailedRLatELat

warmup scaling cooldown0 28024018012060

Elapsed Time [s]

Para

llel E

xecu

tions

[#]

WL4 Platform-View

HStartHHeat

Figure 4: Results for WL2 and WL4 on ICF

6.1.3 IBM Cloud Functions. We measure 2457 failed requests outof a total number of 88534 requests, which indicates a failure rateof 2.775%. We measure a median RLat of 2.775s and a median ELatof 3.193s. RLat and ELat significantly increase for high percentilesup to 29.898s and 14.943s, respectively. Our results indicate thatclients observe significant variations in performance and executioncosts.

We continue to discussWL2 andWL4 inmore detail (see figure 4).Except for WL3, we do not observe any failures during experimentruns. The failures in WL3 occur after 1 minute into the cooldownphase. The failure rate RFailure is around trps, thus resulting inwhite vertical bars in the corresponding platform-view plot.

We observe quite stable values for median RLat and ELat fordifferent one-second intervals in the same run. We observe a differ-ence of around 2 seconds in median ELat and around 6 seconds inmedian RLat for cold workloads WL0-WL2 and warm workloadsWL3-WL5. Furthermore, our results show significant variations inmaximum ELat over 2 seconds and RLat over 6 seconds throughoutcomplete runs. The variations in maximum ELat in combinationwith high values for HHeat indicate noisy neighbor effects causedby different events of the same function. The ICF platform doesnot start new hosts after the end of the scaling period. Thus, ourresults indicate that the platform underestimates the initial targetworkload.

6.1.4 Microsoft Azure Functions. Wemeasure 83170 failed requestsout of a total number of 86657 requests that indicate a failure rate of95.976%. Over 80% of these failures are due to the client-side timeoutof 30 seconds; the other failures were all related to service errors(HTTP 500). These failure rates are highly significant. Because ofthese high failure rates, we do not show any platform-view plots.We want to stress that RLat and ELat statistics only represent lessthan 5% of the total number of started requests. Thus, they are lessrepresentative compared to the previous three platforms. However,we observe a short median ELat of 0.729s. From a client-side per-spective, we observe a median RLat of 20.002s. Visual inspection ofindividual runs (see figure 5) shows that failure rates increase upto 100% for the warm workloads WL3-WL4.

Our results indicate that the FaaS platform is not able to stabilizewithin 300 seconds of an experiment run.

6.2 FaaS Platform ComparisonTable 4 shows a comparison of the four FaaS platforms. Our resultsindicate low failure rates under 2.777% for three FaaS platformswith similar values for GCF and ICF and the lowest failure rate of0.954% for AWS. We measure the highest failure rates of 95.976%

7

Page 8: Benchmarking Elasticity of FaaS Platforms as a …...of benchmarking elasticity of traditional cloud platforms based on virtual machines and benchmarking FaaS platforms in general

warmup scaling cooldown0 20 40 60 80 100 120 140 160 180 200 220 240 260 280

Elapsed Time [s]

0

5

10

15

20

25

30

Requ

ests

[#]

WL0 Client-View

0

5

10

15

20

25

30

35

40

Late

ncy

[s]

trpsRFailedRLatELat

warmup scaling cooldown0 20 40 60 80 100 120 140 160 180 200 220 240 260 280

Elapsed Time [s]

0

10

20

30

40

50

60

Requ

ests

[#]

WL1 Client-View

0

5

10

15

20

25

30

35

40

Late

ncy

[s]

trpsRFailedRLatELat

warmup scaling cooldown0 20 40 60 80 100 120 140 160 180 200 220 240 260 280

Elapsed Time [s]

30

40

50

60

70

80

90

Requ

ests

[#]

WL3 Client-View

0

5

10

15

20

25

30

35

40

Late

ncy

[s]

trpsRFailedRLatELat

warmup scaling cooldown0 20 40 60 80 100 120 140 160 180 200 220 240 260 280

Elapsed Time [s]

40

60

80

100

120

Requ

ests

[#]

WL4 Client-View

0

5

10

15

20

25

30

35

40

Late

ncy

[s]

trpsRFailedRLatELat

Figure 5: Results for WL0-WL1 and WL3-WL4 on MAF

for MAF, which is highly significant in contrast to the other threeFaaS platforms.

We measure similar ELats for AWS and GCF, indicating stableexecution times and thus stable execution costs. Our results indicatedense multiplexing of different executions on single hosts for GCFunder performance isolation. In contrast, AWS seems to rely on alarger number of different hosts and sequential executions. Similarto GCF, ICF places multiple executions on single hosts. However,higher variations in ELat measurements indicate less performanceisolation between parallel executions compared to GCF.

Based on our results, GCF RLat is significantly sensitive to steepincreases in target workloads on cold functions.

In general, our results support the claim that FaaS platformsdeliver on the promise of automatically provisioning the computeresources required to execute highly volatile rates of incomingevents. Execution costs of different executions with the same com-plexity can vary for the same FaaS platform. All FaaS platformsshow sensitivities to different target workloads. Thus, dependingon application objectives, there is a need to mitigate some of theplatform behaviors to achieve stable and reliable client-side quali-ties.

6.3 LimitationsOur experiments are subject to several limitations. An experimentmust make assumptions about a concrete handler implementation,handler inputs, and target workloads. However, these assumptionsmight not represent a specific application accurately. We addressthis limitation by fostering easy extensibility through a detaileddescription of simple experiment design and making all experimentartifacts and results permanently available.

We use Artillery as a workload generator. However, Artillerydoes not support an easy distributed deployment. Thus, the scala-bility of the target workload is limited. While the scalability was

sufficient to generate our workloads, it limits extensibility. Further-more, we found that the used version of Artillery does not providenatural capabilities for tracking client-side failures, thus impedingverification of experiment results.

We did not deploy our workload generator within the samedatacenter as the FaaS platforms. While this choice improves theidentification of client-side errors, it adds the latency of the networkto RLat and DLat. We try to mitigate this limitation by reportingmultiple high-level percentiles for latency measurements.

Compared to the other three FaaS platforms, MAF does not ex-pose the same capabilities for configuration of memory size. Thus,MAF should be compared to other FaaS platforms carefully. More-over, MAF does not expose the amount of memory that is usedby individual executions. Thus, we could only approximation theexecution costs for MAF using ELat.

The short runtime of 300s for all workloads might not seemlong enough for all platforms to stabilize. A longer duration couldfavor specific platforms more; thus, it could be considered an un-fair comparison. However, we investigate FaaS platforms for theobjective-driven design of serverless applications. Thus, we arguethat FaaS platforms should be able to react to workload changeswithin 240 seconds.

The platform-view plot requires sufficient data to represent arun accurately. Thus, its expressiveness is limited for runs withhigh failure rates.

7 IMPLICATIONS FOR APPLICATION DESIGNIn this section, we present the implications of our experiment re-sults for application developers looking to build elastic serverlessapplications. For this purpose, we introduce three possible applica-tion scenarios and derive appropriate application objectives. Basedon the results presented in the previous section, we discuss theapplicability of different FaaS platforms for our three applicationscenarios.

7.1 Application ScenariosWe propose three application scenarios and determine suitablelatency, reliability, and execution cost targets. Each of the scenar-ios represents a current or up-and-coming use-case for serverlesscomputing.

7.1.1 Scenario 1 - Web Serving. The first application scenario isonline web serving, perhaps the most common type of applicationcurrently deployed in FaaS platforms. Typically, web applicationsneed to provide quick feedback to users, and the number of concur-rent clients interacting with the app can fluctuate substantially.

Web applications are characterized by strict low-latency require-ments. Research has shown that even small variances in responselatency (>1 second) are noticeable to users, which leads to less inter-action with the app [1] and can potentially impact its monetization.Besides low-latency, web serving also requires high reliability. Gen-erally, it is recommended that the error rate stay around 1%. Lastly,execution costs should always be considered but do not play a sig-nificant role in this scenario, since web functions tend to have verylow execution times.

8

Page 9: Benchmarking Elasticity of FaaS Platforms as a …...of benchmarking elasticity of traditional cloud platforms based on virtual machines and benchmarking FaaS platforms in general

7.1.2 Scenario 2 - Exploratory Data Analysis. The second type ofapplication is serverless exploratory analytics, which has seen in-creased adoption and support by research [10] and industry [24]alike. Business analysts will often perform initial investigations ondata to discover patterns or test hypotheses. Due to its infrequentand sporadic nature, exploratory analytics is also prone to volatileworkloads, with many executions being triggered all at once.

This scenario hasmore relaxed requirements for response latencybecause analysts do not need instant feedback from the applica-tion. However, latency should still not surpass 20 seconds to ensuresmooth coordination between functions and promote the produc-tivity of the analyst. Because functions can be re-executed in caseof an error, the reliability requirements are also not as strict. Still,dealing with errors can lead to coordination efforts and introducemore deployment costs. Finally, exploratory data analysis use-casesare more sensitive to execution cost than web applications becauseof their long function execution times.

7.1.3 Scenario 3 - Periodical Batch Processing. Similar to the secondapplication type, this scenario also makes use of the serverlessecosystem for big data processing. In this scenario, jobs are executedperiodically and are triggered automatically. The execution of batchprocessing jobs, e.g., [28], can also be characterized as a volatileworkload.

The lack of user interaction lowers the latency requirementsin this scenario. Latency and reliability should still be taken intoaccount because they influence the coordination of functions andtherefore cause more initial deployment costs. Like in the previ-ous scenario, batch processing applications are also quite sensitiveregarding execution costs.

7.2 Discussion of FaaS Platform ApplicabilityAs described in section 6, we observed a failure rate of over 95% forMAF during our experiments, which indicates that the platform isnot able to scale under volatile workloads. We, therefore, deem itunsuitable for all three of our application scenarios and urge inter-ested developers to continue to monitor its performance until theplatform matures. The following sections discuss the applicabilityof AWS, GCF, and ICF for our three application scenarios.

7.2.1 FaaS platforms for web serving. Scenario 1 assumes high re-liability as well as very low and consistent latency as applicationobjectives. All three of our remaining candidates (AWS, GCF, andICF) performed reliably, with failure rates of under 2.77%. Of thethree, AWS proved to be the most reliable with a failure rate of only0.95%. With regard to latency, not all three candidates were ableto meet our application objective. ICF’s median response latencyof over 7.08 seconds is less than ideal, and the latency variancebetween median and 99-percentile shows an even dimmer outlookfor potential users of the application. We, therefore, deem ICF un-suitable for serving web applications. Contrary, GCF, and AWSboth show median latencies of under 1.72 seconds, with a slightadvantage to GCF. However, the latency variance between the me-dian and the 99-percentile of almost 20s reveals GCF’s inability tohandle a sudden increase in requests. GCF’s mediocre performanceduring the scaling phase could potentially be improved throughmitigation strategies to be investigated in future work. Therefore,

we do not rule out GCF outright. AWS performs consistently andis consequently the most suitable candidate for a web serving sce-nario.

7.2.2 FaaS platforms for exploratory data analysis. Scenario 2 shouldmanage failure rates of up to 10% and allows for slower responsetimes of up to 20s. Besides this, execution costs should be minimizedas far as possible.

Based on our results, all three candidates performed very reliably,with failure rates well below our 10% threshold. Regarding latency,we can see a good median performance by all three candidates too.Looking at slow requests, represented by the RLat 99-percentile,AWS clearly stands out as the fastest, while ICF’s performance of15.33s still meets our objective. GCF falls short of our 20s target by1.32s, which again reveals its inability to scale when many execu-tions are triggered in a short amount of time. Still, as previouslymentioned, this behavior could be tolerated or potentially improvedthrough mitigation strategies.

With all three candidates still under consideration, we look at ex-ecution cost as a potential decider. The execution cost of each func-tion depends on its execution latency (see also table 3). Throughoutour experiments, the complexity of the execution is assumed tostay constant, so variances in ELat indicate noisy neighbors andinsufficient provision of resources. Our results show low mean ELatfor both GCF and AWS. ICF, meanwhile, underestimated the work-load and provisioned too little resources, significantly impactingthe execution time and, therefore, also the execution cost. AWS andGCF both show rather consistent ELat, but GCF outperforms AWSwith a better mean execution latency, making it the cheapest outof the three candidates.

To summarize, for exploratory data analysis, we recommendAWS for application providers that value low latency, while GCF isthe better option for those who want to optimize execution costs.

7.2.3 FaaS platforms for periodical batch processing. Scenario 3 issimilar to the previous scenario, with relaxed reliability require-ments and the goal of minimizing execution costs. However, latencydoes not play an important role here.

Correspondingly, neither reliability or latency outright disqual-ify any of the three candidates, so cost becomes the tiebreakerbetween the three. Execution costs are marginally lower for GCFthan AWS. Still, as mentioned in section 7.1.3, inconsistent latencies,and function failures can influence the coordination of functionsand therefore cause more initial deployment costs. Thus, we recom-mend GCF for application providers who prioritize low executioncosts and AWS for those looking for a straightforward deployment.

8 CONCLUSIONWe presented a novel experiment design and corresponding toolkitfor measuring the elasticity of FaaS platforms and evaluated fourcloud providers: AWS, Google, IBM, andMicrosoft. Our results showsignificant quality and cost differences between all four platforms.

Based on the results, we discussed the applicability of the fourFaaS platforms in three common application scenarios. Our experi-ment design and associated toolkit can be used to verify, reproduce,or extend our results or run custom benchmarks.

9

Page 10: Benchmarking Elasticity of FaaS Platforms as a …...of benchmarking elasticity of traditional cloud platforms based on virtual machines and benchmarking FaaS platforms in general

For future work, we intend to extend the scope of our experi-ments. We will improve the scalability of our workload generator bysupporting distributed deployments. Besides, we intend to provideadditional handler implementations to include different computa-tional profiles and triggers. Finally, we will evaluate new mitigationstrategies for application developers to counteract undesirable char-acteristics of specific FaaS platforms.

ACKNOWLEDGMENTSThe work in this paper was performed in the context of the DITASand BloGPV.Blossom projects. DITAS is partially supported by theEuropean Commission through the Horizon 2020 Research andInnovation program under contract 731945. BloGPV.Blossom ispartially funded by the Germany Federal Ministry for EconomicAffairs and Energy (BMWi) under grant no. 01MD18001E. Theauthors assume responsibility for the content.

REFERENCES[1] Ioannis Arapakis, Xiao Bai, and B. Barla Cambazoglu. 2014. Impact of Response

Latency on User Behavior in Web Search. In 37th International ACM SIGIR Confer-ence on Research & Development in Information Retrieval (SIGIR ’14). ACM, NewYork, NY, USA, 103–112. https://doi.org/10.1145/2600428.2609627

[2] David Bermbach, Jörn Kuhlenkamp, Akon Dey, Arunmoezhi Ramachandran,Alan Fekete, and Stefan Tai. 2017. BenchFoundry: A Benchmarking Frameworkfor Cloud Storage Services. In Service-Oriented Computing (ICSOC’15), MichaelMaximilien, Antonio Vallecillo, Jianmin Wang, and Marc Oriol (Eds.). SpringerInternational Publishing, Cham, 314–330.

[3] David Bermbach, Erik Wittern, and Stefan Tai. 2017. Cloud Service Benchmark-ing: Measuring Quality of Cloud Services from a Client Perspective. SpringerInternational Publishing, Cham.

[4] Brian F Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Rus-sell Sears. 2010. Benchmarking cloud serving systems with YCSB. In 1st ACMsymposium on Cloud computing. ACM, ACM, New York, NY, USA, 143–154.

[5] Thibault Dory, Boris Mejías, Peter Van Roy, and Nam-Luc Tran. 2011. Mea-suring Elasticity for Cloud Databases. In 2nd International Conference on CloudComputing, GRIDs, and Virtualization. IARIA, Rome, Italy, 154–160.

[6] Gregor Hohpe. 2017. The Architect Elevator - Visiting the upper floors. https://martinfowler.com/articles/architect-elevator.html. [Online; accessed 26.09.2019].

[7] Karl Huppler. 2009. The Art of Building a Good Benchmark. In PerformanceEvaluation and Benchmarking, Raghunath Nambiar and Meikel Poess (Eds.).Springer Berlin Heidelberg, Berlin, Heidelberg, 18–30.

[8] Sadeka Islam, Kevin Lee, Alan Fekete, and Anna Liu. 2012. How a ConsumerCan Measure Elasticity for Cloud Platforms. In 3rd ACM/SPEC InternationalConference on Performance Engineering (ICPE ’12). ACM, New York, NY, USA,85–96. https://doi.org/10.1145/2188286.2188301

[9] David Jackson and Gary Clynch. 2018. An Investigation of the Impact of LanguageRuntime on the Performance and Cost of Serverless Functions. In Proceedings ofthe 3rd International Workshop on Serverless Computing (WoSC’18). IEEE, Zurich,Switzerland, 154–160. https://doi.org/10.1109/UCC-Companion.2018.00050

[10] Eric Jonas, Qifan Pu, Shivaram Venkataraman, Ion Stoica, and Benjamin Recht.2017. Occupy the Cloud: Distributed Computing for the 99%. In Proceedings ofthe 2017 Symposium on Cloud Computing (SoCC’17). ACM, New York, NY, USA,445–451. https://doi.org/10.1145/3127479.3128601

[11] Eric Jonas, Johann Schleier Smith, Vikram Sreekanti, Chia-Che Tsai, AnuragKhandelwal, Qifan Pu, Vaishaal Shankar, Joao Carreira, Karl Krauth, NeerajaYadwadkar, Joseph E. Gonzales, Raluca A. Popa, Ion Stoica, and David A. Pat-terson. 2019. Cloud Programming Simplified: A Berkeley View on Server-less Computing. CoRR arXiv preprint arXiv:1902.03383 (9 Feb 2019), 1–33.http://arxiv.org/abs/1812.03651

[12] Markus Klems. 2016. Experiment-driven Evaluation of Cloud-based DistributedSystems. Ph.D. Dissertation. TU Berlin.

[13] Ioannis Konstantinou, Evangelos Angelou, Christina Boumpouka, and NectariosKoziris. 2011. On the Elasticity of NoSQL Databases over Cloud ManagementPlatforms. In 20th ACM International Conference on Information and KnowledgeManagement. ACM Press, New York, NY, USA, 2385–2388. https://doi.org/10.1145/2063576.2063973

[14] Jörn Kuhlenkamp and Markus Klems. 2017. Costradamus: A Cost-Tracing Systemfor Cloud-Based Software Services. In Service-Oriented Computing (ICSOC’15),Michael Maximilien, Antonio Vallecillo, Jianmin Wang, and Marc Oriol (Eds.).Springer International Publishing, Cham, 657–672.

[15] Jörn Kuhlenkamp, Markus Klems, and Oliver Röss. 2014. Benchmarking Scala-bility and Elasticity of Distributed Database Systems. VLDB Endowment 7, 12(2014), 1219–1230.

[16] Jörn Kuhlenkamp and Sebastian Werner. 2018. Benchmarking FaaS Platforms:Call for Community Participation. In Proceedings of the 3rd InternationalWorkshopon Serverless Computing (WoSC’18). IEEE, Zurich, Switzerland, 189–194. https://doi.org/10.1109/UCC-Companion.2018.00055

[17] Jörn Kuhlenkamp, Sebastian Werner, Maria C. Borges, Karim El Tal, and StefanTai. 2019. An Evaluation of FaaS Platforms As a Foundation for Serverless BigData Processing. In Conference on Utility and Cloud Computing (UCC’19). ACM,New York, NY, USA, 1–9. https://doi.org/10.1145/3344341.3368796

[18] Aleksandr Kuntsevich, Pezhman Nasirifard, and Hans-Arno Jacobsen. 2018. ADistributed Analysis and Benchmarking Framework for Apache OpenWhiskServerless Platform. In 19th International Middleware Conference (Posters) (Mid-dleware’18). ACM, New York, NY, USA, 3–4. https://doi.org/10.1145/3284014.3284016

[19] Hyungro Lee, Kumar Satyam, and Geoffrey Fox. 2018. Evaluation of ProductionServerless Computing Environments. In Proceedings of the IEEE 11th InternationalConference on Cloud Computing (CLOUD’18). IEEE, San Francisco, CA, USA,442–450. https://doi.org/10.1109/CLOUD.2018.00062

[20] Wes Lloyd, Shruti Ramesh, Swetha Chinthalapati, Lan Ly, and Shrideep Pal-lickara. 2018. Serverless Computing: An Investigation of Factors Influenc-ing Microservice Performance. In Proceedings of the IEEE International Confer-ence on Cloud Engineering (IC2E’18). IEEE, Orlando, FL, USA, 159–169. https://doi.org/10.1109/IC2E.2018.00039

[21] Maciej Malawski, Kamil Figiela, Adam Gajek, and Adam Zima. 2018. Bench-marking Heterogeneous Cloud Functions. In Euro-Par 2017: Parallel ProcessingWorkshops (Euro-Par’17), Dora B. Heras, Luc Bougé, Gabriele Mencagli, Em-manuel Jeannot, Rizos Sakellariou, Rosa M. Badia, Jorge G. Barbosa, Laura Ricci,Stephen L. Scott, Stefan Lankes, and Josef Weidendorfer (Eds.). Springer Interna-tional Publishing, Cham, 415–426.

[22] JohannesManner, Martin Endreß, Tobis Heckel, and GuidoWirtz. 2018. Cold StartInfluencing Factors in Function as a Service. In Proceedings of the 3rd InternationalWorkshop on Serverless Computing (WoSC’18). IEEE, Zurich, Switzerland, 181–188.https://doi.org/10.1109/UCC-Companion.2018.00054

[23] Douglas C. Montgomery. 2017. Design and Analysis of Experiments (8. ed.). JohnWiley & Sons, New York, NY, USA.

[24] Josep Sampé, Gil Vernik, Marc Sánchez Artigas, and Pedro Garcia López. 2018.Serverless Data Analytics in the IBMCloud. In Proceedings of the 19th InternationalMiddleware Conference Industry (Middleware’18). ACM, New York, NY, USA, 1–8.https://doi.org/10.1145/3284028.3284029

[25] Jóakim v. Kistowski, Jeremy A. Arnold, Karl Huppler, Klaus-Dieter Lange, John L.Henning, and Paul Cao. 2015. How to Build a Benchmark. In 6th ACM/SPECInternational Conference on Performance Engineering (ICPE ’15). ACM, New York,NY, USA, 333–336. https://doi.org/10.1145/2668930.2688819

[26] Erwin van Eyk, Alexandru Iosup, Cristina L. Abad, Johannes Grohmann, andSimon Eismann. 2018. A SPEC RG Cloud Group’s Vision on the PerformanceChallenges of FaaS Cloud Architectures. In Companion of the 2018 InternationalConference on Performance Engineering (ICPE’18). ACM, New York, NY, USA,21–24. https://doi.org/10.1145/3185768.3186308

[27] Andreas Weber, Nikolas Herbst, Henning Groenda, and Samuel Kounev. 2014.Towards a Resource Elasticity Benchmark for Cloud Environments. In 2Nd Inter-national Workshop on Hot Topics in Cloud Service Scalability (HotTopiCS ’14). ACM,New York, NY, USA, Article 5, 8 pages. https://doi.org/10.1145/2649563.2649571

[28] Sebastian Werner, Jörn Kuhlenkamp, Markus Klems, Johannes Müller, and StefanTai. 2018. Serverless Big Data Processing Using Matrix Multiplication as Example.In IEEE International Conference on Big Data (Big Data’18). IEEE, Seattle, WA,USA, 358–365. https://doi.org/10.1109/BigData.2018.8622362

10