Report on techniques for data management on integrated HPC ...€¦ · Towards Uniﬁcation of HPC And Big Data Paradigms. TIN2016-79637-P Report on techniques for data management

Grupo de Arquitectura de Computadores,Comunicaciones y Sistemas.

Universidad Carlos III de Madrid

Towards Unification of HPC And Big Data Paradigms.TIN2016-79637-P

Report on techniques for data managementon integrated HPC and Big Data platforms

Work Package 2Technical Report D2.2

Jesus Carretero, Felix Garcia, Francisco Javier Garcia, Florin Isaila, Silvina Caino, Estefania Serrano,March 10 2018

Towards Unification Of HPC And Big Data Paradigms

Contents

1 Introduction 3

2 Spark-DIY 32.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Spark-DIY Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Deployment Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Data-parallel buffering for workflows 123.1 Main contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.2 Design and implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.3 Bufferflow architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.5 Dataflow decoupling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.6 Evaluation of the allocation scheduling policies . . . . . . . . . . . . . . . . . 18

4 A Heterogeneous Mobile Cloud ComputingModel for Hybrid Clouds 214.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.2 Proposed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.2.1 Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.2.2 Aims and Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.2.3 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.2.4 Volunteer Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254.2.5 Application Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.2.6 Deployment and Mobile Application Adaptation . . . . . . . . . . . . 274.2.7 Incentive Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284.2.8 Security Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.3.1 Devices characterization . . . . . . . . . . . . . . . . . . . . . . . . . . 304.3.2 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.3.3 Case study 1: Processing Service . . . . . . . . . . . . . . . . . . . . . 324.3.4 Case study 2: Storage Service . . . . . . . . . . . . . . . . . . . . . . . 344.3.5 Case study 3: Speech Recognition . . . . . . . . . . . . . . . . . . . . 35

5 Conclusions 36

BIGHPC 2


1 Introduction

The scale of contemporary HPC platforms increased significantly in the past few years. Thisexplosive growth of computational resources engendered new and innovative scientific practicescapable of generating a vast amount of data.

The composition of multiple computations into workflows is an active area of research in theHPC community as it progresses toward the exascale era [25]. Moreover, the HPC communityis requiring capacity to use Big Data frameworks to analyze data, and the Data Analytics andDeep Learning communities are requiring HPC capacities, without loosing the programmingand ML libraries already existing.In the BIGHPC project, we have developed three new mechanisms to cope with this

challenges:

• A middleware that unifies the Spark and MPI environment, named SparkDIY.

• A new data buffers flow management architecture, compatible with POSIX and get/putinterface, but also with the underlying Big Data and parallel File Systems, namedBufferFlow

• A new mobile cloud computing model, in which platforms of volunteer devices providepart of the resources of the cloud, inspired by both volunteer computing and mobileedge computing paradigms

.In the next sections, both solutions are presented.

2 Spark-DIY

The authors [15] also determined that the usage of merged Big Data models presents limitations,such as high memory consumption and low efficiency in communication between cooperatingprocesses. Analyzing several use cases and comparing frameworks (MPI vs. Spark) andplatforms [16], the authors concluded that the shuffle phase of the Spark framework is not ableto support current HPC applications, which are oriented towards CPU-intensive tightly-coupledmodels and assume that computing and storage resources are decoupled. On the other side,data abstractions and application model of Spark are not easily supported using MPI.

In order to integrate the data abstractions and application model of Spark with an underlyingMPI communication model, in this paper we introduce a framework named Spark-DIY, thatallows the usage of native big data programming models, layered on top of HPC programmingmodels, by using the highly-scalable data-intensive communication pattern library DIY (DoIt Yourself Block Parallelism) [42]. Spark-DIY runs on top of MPI to enable the executionof data analysis applications in a supercomputer. Our goal is to preserve the usability andflexibility of Big Data tools combined with the high performance, high bandwidth, and lowlatency of HPC. The objective of our design is to allow scientists to analyze and visualizetheir simulation results in a productive and efficient manner. The resulting framework allowsusers to move data freely between Spark and DIY data structures, employing DIY for the MPIcommunication-intensive parts of the Spark application, and returning the results to Spark sothat it can continue other tasks natively.

BIGHPC 3


We choose to work with Apache Spark1 because today is one of the most popular Big Dataprogramming models, and many libraries for machine learning or graph analytics are availablefor Spark users [57, 41]. We choose MPI (via DIY) for similar reasons, because MPI is the defacto communication mechanism in HPC.The main contributions of this section are the implementation of the Big Data-HPC

framework, the definition of an interoperable data model between Spark and DIY, andthe ability to offload communication-intensive parts of the Spark application to DIY, whilemaintaining the Spark programming model for users.

2.1 Background

Convergence Challenges

Using Spark for HPC applications, while appealing, poses important convergence challenges.Gittens et al. explored in [28] the trade-offs of performing linear algebra using Apache Spark,compared to traditional C and MPI implementations on HPC platforms. The results showed apoor performance of Spark vs. MPI for matrix multiplications: from 2X to 25X performancegap. Thus they concluded that "it may be worthwhile to investigate better mechanisms forintegrating and interfacing with MPI-based runtimes with Spark. The cost associated withcopying data between the runtimes may not be prohibitive." Slota et al. [51] introduced amethodology for graph processing to bridge the gap between graph computing and HPC.Evaluations made in the Blue Waters supercomputer showed poor scalability of Spark vs.MPI+OpenMP for graph operations: a factor of two orders of magnitude difference wasrecorded.Likewise, in [48] we reported our experience combining traditional HPC and cloud-based

approaches with Big Data analytics paradigms, in the context of scientific ensemble workflows,by comparing a representative MPI-based iterative workflow from the hydrology domain(EnKF-HGS) and its equivalent implementation using the Spark framework. The resultsshowed that Spark displayed large memory requirements and low communication capacity forcooperative processes, mainly due to the shuffle phase in large-scale reductions. This resultwas also corroborated by other works, like [20], where major inefficiencies of Spark shuffle(explosion of files, high I/O contention, TTL cleaner overhead, etc.) were identified and someoptimizations were proposed inside the Spark framework. The authors [44] have also proposeda novel shuffle data transfer strategy for Spark was presented with good performance resultsand less memory utilization.Still, achieving a data model fully compatible for Spark and MPI, and accelerating and

scaling the shuffle phase, is nowadays a challenge not fully satisfied by any platform, and thisis the goal pursued in our work.

Spark

Spark is arguably the most popular Big Data processing framework for data analysis, andit also supports numerous other tools for machine learning, graph analytics, and streamprocessing, among others.

Being initially inspired by the Map-Reduce model, Spark supports extended functionality andoperates primarily in memory by means of its core data abstraction: the resilient distributeddataset (RDD)[59]. An RDD is a read-only, resilient collection of objects partitioned acrossmultiple nodes that holds provenance information (lineage) and can be rebuilt in case of

1Version 2.2.0. See https://spark.apache.org/

BIGHPC 4


failures by partial recomputation from ancestor RDDs. An RDD can be created in severalways: implicit partitioning of an input file stored in the underlying distributed file system, byexplicit partitioning of a native collection (e.g. array), or by operating on already existingRDDs.Furthermore, RDDs are by default ephemeral, which means that once computed and

consumed, they are discarded from memory. Since some RDDs might be repeatedly neededduring computations, the user can explicitly mark them as persistent, which moves them in adedicated cache for persistent objects.

Two types of operations can be executed in Spark: transformations that execute a functionindependently in each partition, and actions that trigger data shuffles between the partitions.Transformations are executed in a lazy manner and are triggered by actions. The operationsthat are contained between two communication points are called stages.Spark can be executed in standalone mode or on top of several resource managers such as

YARN2 and Mesos3, and it allows the main driver process of a job to be placed inside one ofits workers (cluster mode) or in the machine that submits the job (client mode).

DIY

DIY is an MPI-based library that offers efficient and highly scalable communication patternsover a generic block-based data model. It does so by decomposing the analysis problem amonga large number of data-parallel sub-problems and efficiently exchanging data among themusing regular local and global communication patterns whose implementation has been tunedfor HPC.The abstraction enabling these capabilities is block parallelism; blocks and their message

queues are mapped onto processing elements (MPI processes or threads) and are migratedbetween memory and storage by the DIY runtime. Configurable data partitioning, scalabledata exchange, and efficient parallel I/O are the main components of DIY. DIY supportsdistributed- and shared-memory parallel algorithms that can run both in- and out-of-core withthe same code. The same program can be executed with one or more threads per MPI processand with one or more data blocks resident in main memory.

DIY has demonstrated efficient scaling on leadership-class supercomputers in a diverse arrayof science and analysis codes, including cosmology, molecular dynamics, nuclear engineering,astrophysics, combustion, and synchrotron light source imaging. For example, benchmarks ofstrong and weak scaling of parallel Delaunay tessellations [45], one of the libraries built on topof DIY, demonstrated parallel efficiency of over 90% on up to 128K MPI processes.In DIY, algorithms are written in terms of data blocks that constitute the basic units

of domain decomposition and parallel work. Blocks are linked forming neighborhoods thatrepresent the domain in a distributed manner. The assignment of blocks to MPI processes,often multiple DIY blocks per MPI rank, is controlled by the DIY runtime transparent to theuser. Given a block decomposition and assignment to MPI processes, the user is able to runreusable communication patterns between local blocks in a neighborhood global operationssuch as reductions over all blocks. Therefore, DIY users can execute common communicationpatterns just by defining the block type and domain topology, without knowledge of theunderlying communication details.

The similarity between Spark RDDs and DIY block parallelism, and the resemblance betweenSpark map-shuffle-reduce and DIY merge-reduce communication patterns are the basis for our

2https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html3http://mesos.apache.org/

BIGHPC 5


integration of these two models.

2.2 Spark-DIY Architecture

There are three motivations for interoperability between Spark and DIY. (1) Spark users canuse HPC platforms to scale their workloads; (2) HPC users gain access to Spark libraries,which increases productivity; and (3) both types of users can benefit from additional datapatterns exposed by DIY (e.g. local neighborhood exchange).As we have seen, however, research challenges remain in order to achieve the convergence

of Big Data and HPC. Our approach is to integrate Spark with DIY, without enforcing theusage of one model or the other, by allowing the user to freely switch between the two modelsand select the one that adapts better to each stage of the problem.

Design Goals

Guided by our objective to offer the user the best features of both computing models, weformulate the following design goals for the integrated framework.

Interoperability . DIY and Spark target different canonical problems; therefore adapting aproblem from Spark to DIY and viceversa should be explicit. To make the user aware ofwhich model is currently active, we keep both platforms separated, but interoperablethrough explicit conversions.

Production-readiness . We believe that the viability of our solution depends on being ableto use standard versions of Spark and DIY without any changes required to thoseplatforms. Thus, the adaptation must be made to both of them using a middlewarelayer, transparently to the user, so that applications for Spark or DIY should run almostimmediately.

Usability . Although the user must be aware of the explicit interoperability (includingoverheads associated with switching contexts), the knowledge of the underlying datamodel should be minimal to preserve the nature of the Spark programming and datainterface. This would reduce the learning curve and minimize the impact in existingcode.

Flexibility . We want to support multiple data types and provide flexibility for differentdatasets to coexist in the same application.

Performance . The data locality capabilities of Spark are one of its key features and mustbe enforced as much as possible. On the other hand, the efficiency and scalability ofthe communication patterns of DIY should be exploited whenever possible to acceleratecommunication-intensive (e.g., shuffle) operations.

Interoperation Mechanisms

Given the previous design goals, three aspects of Spark and DIY need to be connected: dataabstraction, programming model, and execution model (see Fig. 2.1). The adaptations neededto connect each of these components between the two models are detailed below.

Data Abstraction

BIGHPC 6


SPARK ON DIYSPARK ENVIRONMENT DIY ENVIRONMENT

MAP DATA ABSTRACTIONS

ADAPT EXECUTION MODEL

TRANSLATE PROGRAMMING MODEL

RDD Blocks

Extended MRLocal exchange and

global communication patterns

Task-based dynamic staging on JVM

Static MPI processes

Figure 2.1: Interoperation Mechanisms between SPARK and DIY

The first aspect that must be aligned is the way in which both frameworks represent theirdata abstractions. Both in the case of Spark and DIY, the way data are arranged determinesthe development of algorithms and the behavior of the runtime.Since preserving the RDD abstraction of Spark is key to maintaining the usability and

interoperability with upper layers, it is necessary to map RDDs to a block-based data structurein DIY. If we think of the RDD as the equivalent of the global DIY (distributed) domain,the data partition in an RDD maps directly to a data block in DIY. In this context, theRDD dataset is partitioned into independent DIY blocks as shown in Fig. 2.2, where eachpartition Pi maps to a corresponding block Bi, preserving the same data elements inside thepartition and respecting locality, since no data transfers occur to build the DIY dataset. Asa consequence, the DIY dataset constitutes a distributed collection that reflects the innerstructure of an RDD, while adding topology information for the DIY-based communicationpatterns. Data are moved among Spark and DIY, transparently holding the bindings for eachpartition and interacting with the Spark context to control partitioning.

• PROS: less spark overhead, we only preserve the interface and core components areremoved (resource management, dag scheduler...), efficient shuffle and collectives by DIY,potential to enhance multithreading locally per partition

• CONS: we lose provenance and the natural interoperability between RDDS (i think itcould be properly designed to preserve these features in the future), requires changes indriver code to call the wrapper, callbacks do not follow lambda conventions (i might finda way to fix this in coordination with the first issue), RDDS do not follow contracts (thisis not a problem since we can convert to and from RDDS, but it could be improved)

Programming ModelOnce the data abstractions are mapped, the translation of the programming model from the

Spark interface to the underlying DIY communication patterns follows naturally. The way wemapped data abstractions facilitates the algorithmic mapping because we are able to preservethe independence between partitions and map data shuffles to underlying DIY communicationpatterns.

Spark operations on RDDs are internally expressed as algorithms built on top of DIY patternsto mimic the functionality expected from Spark. For example, the map transformation inSpark can be translated to a foreach pattern in DIY, since both of them represent parallel andindependent operations on the dataset. Similarly, reduceByKey in Spark was translated toan algorithm based on the swap-reduce DIY pattern, which conducts several rounds of dataexchanges between blocks, effectively shuffling data across the partitions.

BIGHPC 7


RDD

DIY DATASET

P1

P2 P

3

B1

B2 B

3

Figure 2.2: Mapping of a data partition in RDD and DIY blocks

To preserve the programming interface of Spark as much as possible, operations on partitionsare triggered by the inner algorithms in DIY, but expressed as user-defined callbacks writtenby the user in Scala, who also defines the data type of the records and the supported operators(e.g. unary for independent transformations, binary for reductions, and hash for partitioning).

Execution ModelBesides translating the programming model into DIY patterns, it is necessary to provide the

proper execution support. In this particular case, we must connect the dynamic task-basedexecution model from the Spark framework to the set of MPI processes that DIY assumes toexist at the beginning of its execution.To achieve this, we wrap each Spark worker into an MPI process that forms a basic

communicator for DIY. Since executors are spawned inside these processes, we can updatethe global communicator to include these children processes using MPI, in a similar wayas depicted in Spark-MPI [40], a solution that extends the Spark ecosystem with the MPIapplications using the Process Management Interface to allow the creation of MPI processesfrom Spark. However, afterwards, everything must be programmed on the MPI side, as thereare not facilities to manage data models and topologies, as we have in DIY.

2.3 Deployment Overview

Figure 2.3 shows the interaction between the main components of the proposed architecture.The following sections explain their role from the end user’s perspective, and the accompanyinginternal behavior of the system.

User View

The end user is exposed to a limited number of additional elements of the interoperation layerin addition to the basic Spark interface. The driver of the Spark application (in Scala or Java)must define and use these components as follows:

1. Select the record data type: The RDDs to be processed through DIY are collections ofdata records that we can convert to C++ data types through the Java Native Interface(JNI). To ease this process, a catalogue is offered where users can select a pre-builtdata type that handles type conversion and memory management from and to the C++code. Since users may want to use a custom data type not present in the catalogue,we developed the internals of Spark-DIY in a generic manner. New data types can bedefined in a helper file later used by the JNI code generation utility of choice, which is

BIGHPC 8


CLIENT/WORKER

DRIVER

WORKER

SPARK CONTEXT

DIY DATASET

RDD

DIY WRAP

CALLBACK INTERFACE

b0

p0

p1

b1

DIY/MPI bindings (Java)

Spark-DIY objects (Scala)

DIY (C++)

Spark objects (Scala)

Spark components

Callback

MPI process

SPARK MASTER

USER DEFINED

OPERATORS

DIY WRAP

WORKER

DIY WRAPbn-1

pn-1

Figure 2.3: Architecture overview including Spark core objects and deployment units in clustermode, Spark-DIY binding elements, and DIY MPI processes and block distribution.

SWIG in our particular case. New data types must define a serialization function sinceboth RDD and DIY block elements need to be serializable.

2. Define the callback operators for the record: Similarly as in Spark, the operations to beconducted on the data must be defined. In order to access these operators from DIY,users must implement the proper method as an object that extends the callback interface.For example, the interface exposes a unary operator for map-like transformations, abinary operator for reductions, and a hash operator for partitioning.

3. Delegate execution on a DIY dataset: A DIY dataset contains an RDD and mimics theoperations the user would normally run on the RDD. Once an RDD is created along withits operators, we can run the Spark-equivalent transformations and actions implementedusing the communication patterns of DIY, running on MPI. The result of this operationis a new RDD that can be further used in the driver with subsequent combinations ofSpark functions or DIY algorithms.

4. The ddd abstraction: mimics an rdd (supports transformations and actions, and mightkeep track of lineage in the future) but DOES NOT extend it in any way. holds its owndiy wrapper object to avoid interference with other ddds in the spark program. whenusing ddds we get rid of: resource manager (assume dedicated virtual spark cluster perapplication, no job coexistence–> batch-like hjobs), task scheduler (it’s diy’s duty now)

5. The record: ddds and rdds that aim to be interoperable should contain the same basicentry type to support conversions thorugh jni calls. record type instantiation is donein the swig interface, so the selection/definition of the proper data type would be theuser[s responsibility

6. The operators: the wrapper defines a common interface to define callbacks. driverprograms shall implement the interface with the proper operators for their target

BIGHPC 9


transformations/actions. For instance, if you want to run a map, you should implementthe unary method of the interface and instantiate it to pass to the map function of theddd. (BTW, you could choose to have different interfaces with a single member functionto do this (as hadoop does), but in practise this is rather confusing and i faced manyissues with naming. In the example in the repository you can see that you can defineseveral callback objects and just implement whatever you need without worrying aboutunimplemented methods. this is not hard to change in the future if it feels like subpardesign)

The following listing shows an example of a wordcount callback.1 object Wordcount {2 class WCCallback extends PairDIYCallback {3 override def binary (x : PairRecord , y : PairRecord ) : PairRecord = {4 x . s e t_ f i r s t ( x . get_second + y . get_second )5 return x6 }7 override def unary (x : PairRecord ) : PairRecord = {8 x . set_second (1)9 return x

10 }11 override def key_hash (x : PairRecord ) : Long = {12 return x . g e t_ f i r s t .##13 }14 }1516 def main ( args : Array [ S t r ing ] ) {17 // Library con f i gu ra t i on , spark s e t t i n g s , parameter checks . . .18 val ca l l ba ck = new WCCallback ( )1920 var inputRDD = sc . t e x tF i l e ( f i l e )21 . f latMap (_. s p l i t ( " " ) )22 .map(x => new PairRecord (x , 0) )2324 var inputDDD = new PairDDD( inputRDD , nBlocks , args , sc )25 var mappedRDD = inputDDD .DIYPairMap( ca l l ba ck )2627 var mappedDDD = new PairDDD(mappedRDD, nBlocks , args , sc )28 val outputRDD = mappedDDD. DIYPairReduce ( c a l l b a ck )29 }30 }

Internals

When a user triggers a function in a job that is delegated to the algorithms implemented inDIY several tasks are conducted internally to pass data from the Java to the C++ side:

1. Spawn executors: Since DIY algorithms are block-parallel, we exploit the one-to-oneassociation between each partition of an RDD and the corresponding block in the DIYdomain. We let Spark handle data serialization, partitioning, and executor creation bywrapping the partition-block conversion in a function that is passed to a mapPartitionsSpark operator. This creates executors that live in the MPI environment and containthe data of the corresponding partition, which enforces locality.

2. Convert each partition to a DIY block: The partition set is converted to a DIY domain,where each partition corresponds to a block. Transformations can be conducted with

BIGHPC 10


independent blocks following a similar approach to the Spark counterpart, with shuffleoperations translated to DIY communication patterns.

3. Delegate algorithm to DIY: Once the domain is established, we can run the DIY operationsthrough a wrapper in JNI that executes the user-defined callbacks for computation. Theresults are retrieved afterwards and converted back to an RDD, and the execution isresumed in Spark.

2.4 Evaluation

Execution Environment

We evaluated a prototype of the platform on bare metal nodes of the Chameleon cloud atthe University of Chicago. Each node has an Intel Xeon CPU [email protected] processorwith 12 physical cores and 135GB of RAM each. Both the Spark and Spark-DIY clusters wereconfigured with single-core workers to limit the number of executors in order to obtain a faircomparison against the MPI deployment. Therefore, each executor is mapped to one worker,and each worker is mapped to a MPI process.

Spark-DIY vs. Spark Weak Scaling

To test weak scaling of Spark-DIY vs. Spark, we conducted reductions on key-value datasetsto analyze if Spark-DIY can be competitive with Spark in the situations where communicationmatters the most, which is in the operations that involve data shuffles. This test runs thereduceByKey phase of a wordcount program with synthetic data generated in the driver thatis evenly distributed across a number of partitions equal to the number of workers in thedeployment. The problem-per-worker remains constant as a block size of 100MB, and weincrease the number of workers to increase the problem size. Thus, we can determine howthe behaviour of both frameworks evolves as communication for data distribution increasesbetween workers.The results in Fig. 2.4 show competitive performance and a similar scaling trend between

both platforms, although they both fail to scale linearly as the problem size increases. As maybe seen, Spark-DIY follows the trend of Spark. This shared trend indicates a scaling issue inthe Spark platform, which is in charge of parallelization and task generation in both cases,and this is the price we pay for keeping compatibility and native Spark and DIY frameworksunmodified. We hope that performance of Spark-DIY will increase when we test more complexreduction (distribution) patterns. Be also aware that, by using Spark-DIY, we can providemore patterns that the already predefined in Spark.

Spark-DIY Strong Scaling

Since we have shown the behaviour of the Spark-DIY reduceByKey is comparable to the Sparkcounterpart, we now focus on its scalability as the problem size increases for a fixed number ofworkers. We conduct the same reduceByKey benchmark with a variable dataset size in therange 200MB–6.4GB.Figure 2.5 shows the scaling results up to 128 workers. As the number of workers and the

problem size increases, the beneficial effects of DIY communication can be clearly appreciatedin the figure, in comparison to the lower scale cases. As seen in the weak scaling results,data parallelization and task management take a large portion of the overall execution time.Therefore, Spark-DIY operations are meaningful in those cases where there is communication

BIGHPC 11


20

40

60

80

100

120

140

160

180

8 16 32 64 128

Execu

tion

time(s)

Number of workers

SparkSpark-DIY

Figure 2.4: Weak scaling results for Spark-DIY and Spark running a reduceByKey with 100MBper worker.

200400

8001600

32006400

Dataset size (MB)

816

3264

128

Number of workers

0

50

100

150

200

250

300

Execution

time(s)

0

50

100

150

200

250

300

Figure 2.5: Strong scaling results for Spark with increasing overall dataset size.

involved, and it represents a significant portion of the problem. This effect is clearer as thedataset size increases, which again is a good feature of Spark-DIY, as it is intended for verylarge datasets.

3 Data-parallel buffering for workflows

HPC and Big Data workflows are constructed from multiple data parallel producer-consumerdataflows, which are the primitives of data transfer between components. These dataflowsrequire coordination between the endpoints to avoid overflows, but synchronous control isoften undesirable because it forces the producer and consumer processes to advance at thesame pace. Decoupling a tightly coupled producer-consumer pipeline into a buffered dataflowleads to key improvements such as better overall performance and fault tolerance. In order toachieve this decoupling, a buffering scheme must be introduced between the participants inthe data exchange so that the producers can function at the required speed while buffereddata is asynchronously sent to the consumers.

The current software I/O stack of large-scale HPC systems does not offer proper mechanismsto address this problem. This stack consists of several layers: scientific libraries (e.g., HDF5 [5]),middleware (e.g., MPI-IO [4]), I/O forwarding (e.g., IOFSL [6]), and file systems (e.g., GPFS[49], Lustre [3]). Improving the scaling of this distributed stack by several orders of magnitudeis complex, because the lack of data flow coordination across layers. In particular, databuffering is hard-wired in each stack layer; is managed mostly statically; and typically lackssupport for either on-demand space management, efficient vectorial operations, or collective

BIGHPC 12


I/O. For instance, the MPI-IO implementations [55] offer support for collective I/O, butthe buffer management is not elastic; the data shuffling is hard-wired; the small granularnoncontiguous access is inefficient [54]; and control and data management are intermingledand not customizable [34].An important topic of discussion is the need to decompose monolithic systems such as

large-scale parallel file systems into independent services that can exceed the current uses oncontemporary HPC infrastructures. For instance, today there is poor support for scientificworkflow I/O or for popular data analytics patterns, such as MapReduce, streaming, anditerative computations [47]. Also, monolithic systems cannot cope with the increasing depthand variety of storage hierarchy [31]. Additionally, computing and storage I/O need to bedecoupled in order to allow each system to scale independently.

Our work on CLARISSE (Cross-Layer Abstractions and Run-time for I/O Software Stack ofExtreme-scale systems) [35] started to address the poor extensibility and the lack of global dataflow coordination of the existing I/O stack. CLARISSE decomposes the software I/O stack intothree layers: data, control, and policy [33]. The control backplane coordinates the data transferreacting to different system events coming from a publish/subscribe infrastructure. Policiessuch as load balancing, fault tolerance, and data staging can easily be implemented by using thecontrol plane API. The data plane in CLARISSE is composed of multiple producer-consumerpipelines through which the data is transferred. This paper introduces Bufferflow, a novelbuffering scheme that efficiently supports parallel dataflows bewteen producers and consumersin the CLARISSE data plane. Bufferflow has been integrated in CLARISSE, but it can beused as an independent service by any third-party workflow middleware.The work presented in this section is motivated by the following steps that are needed to

improve the scalability of the I/O stack: (1) decoupling the tightly coupled producer-consumerpipelines; (2) identifying novel primitives that can be used for composing highly efficientservices for the future-generation software I/O stack; (3) opening up the memory and storagemanagement of each data plane layer to user-specific allocation and replacement policies (boththese policies are currently embedded in various layers of the I/O stack); and (4) compensatingfor the lack of scalability of parallel file systems with customizable buffering policies that tailorend-to-end performance optimization of data flows.

3.1 Main contributions

In this work we take an important step toward decomposing the system software stack intoindependent services. We present a novel buffering service that can be used as a building blockfor scalable storage systems. Our buffering system productively and efficiently supports variousrequirements such as adaptiveness to unpredicted data volume variations and low-overheaddata transfers of composite workflows of applications.The main contributions of this work are the following. First, Bufferflow goes beyond the

traditional put/get store such as Memcached by offering data-parallel primitives in both scalarand vector versions. These primitives can be used to efficiently compose parallel applications(e.g., simulation, analysis, visualizations) into complex workflows. Second, Bufferflow is thefirst system to offer collective access semantics for efficiently sharing buffers among decoupledparallel producers and consumers. Third, Bufferflow provides an adaptive buffering mechanismthat reacts to either high or low volume of data access operations. Its adaptability enablesthe library to expand or shrink its internal memory pools, when specific watermarks arereached, asynchronously from the consumer and producer execution contexts. Fourth, wepresent a novel portable MPI-IO-based implementation of coordinated collective write and read

BIGHPC 13


operations leveraging the CLARISSE middleware and Bufferflow framework.

3.2 Design and implementation

Figure 3.1 displays a high-level view of the integration between CLARISSE and Bufferflow. Thefigure shows np producers and nc consumers performing data-parallel accesses on ns servers.Each server leverages Bufferflow for local buffer management, while CLARISSE orchestratesthe global data flows. Data and control planes are decoupled in the design of CLARISSE, sothat novel policies can be enforced on top of the control layer. This work focuses on noveltechniques for integrating Bufferflow into the data plane.

Figure 3.1: High-level architecture of the integration of CLARISSE and Bufferflow.

3.3 Bufferflow architecture

Bufferflow is composed of three layers, shown in Figure 3.2: the coordination, scheduling,and allocation layers. Each of the layers exposes a clean interface and an implementationagnostic of the other layers’ internal details, thus allowing developers to easily plug differentimplementations of entire layers into Bufferflow. This is an important design decision, becauseit opens up the buffering service to custom implementations depending on the applicationneeds. The lack of scalability of the current software I/O stack is due in part to the inflexibleembedding of buffering at various layers, which prevents the design and insertion of novelpolicies for key aspects such as access coordination, scheduling, and allocation.

The coordination layer shown in the upper part of the figure is responsible for coordinatingconcurrent data-parallel accesses to the buffering systems. The access coordination allowsBufferflow to go beyond many existing systems offering buffer acesses through a put/getinterface. As one of the key contributions, Bufferflow offers mechanisms for coordinatingboth data-parallel accesses within one application and across applications, that is, betweendata-parallel producers and consumers. To achieve this coordination, the coordination layermanages the life cycle of the high-level abstraction of a buffer, including the transitions ofa buffer through different states. Moreover, the coordination layer provides synchronizationmechanisms for the concurrent data-parallel participants, thus enabling Bufferflow to offerpowerful API semantics, described in the following section.

BIGHPC 14


Figure 3.2: Bufferflow high-level architecture.

The scheduling layer shown in the middle area of the figure mediates the interaction betweenthe coordination and allocation layers. As the most complex part of the architecture, thescheduling layer makes Bufferflow adaptive to various conditions that may occur in the system.These conditions include inefficient memory management due to differences of speed betweenproducer and consumer processes, exceeding buffer memory footprint thresholds, and slowmemory allocations/deallocations due to certain data access patterns. The scheduling layeroffers asynchronous mechanisms for both buffer allocation (asynchronous buffer allocation)and swapping (buffer swapper). Additionally, the buffer selection for swapping is driven by acustomizable buffer eviction policy.The allocation layer shown in the lower part of Figure 3.2 specializes in managing chunks

of raw data. Bufferflow provides a modular allocation layer structure consisting of a bufferallocator and a chunk pool. The buffer allocator is efficient for bulk allocations/deallocationsrequired by the asynchronous allocation policy implemented in the scheduling layer. However,the modular design allows for a straightforward implementation of any other allocation policyin the buffer allocator module.There are many allocation mechanisms researched in the last two decades, but the only

conclusion we can draw after so many years of effort is that there is no silver bullet forallocation. Each one of the techniques that had been studied offers better performance for aspecific set of applications with a specific allocation pattern and worse performance for theothers. Therefore, the design and flexibility of such an allocation mechanism is of utmostimportance for Bufferflow, especially due to the asynchronous allocation policy implemented inthe scheduling layer which leverages the efficiency of bulk allocation/deallocation operations.

3.4 Evaluation

We start by describing the experimental setup. We then present results from running Bufferflowboth in a controlled environment and integrated into the CLARISSE middleware. In theevaluation, we used synthetic benchmarks and kernels of real scientific applications.All results presented in the following sections were obtained by deploying Bufferflow on

the Archer supercomputer, located at EPCC and part of the UK National SupercomputingService. Archer is based on the Cray XC30 MPP [2] with the addition of external login nodes,

BIGHPC 15


postprocessing nodes, and storage system. The Cray XC30 consists of 4,290 compute nodeseach with two 12-core Intel Ivy bridge processors and 64 GB of memory. Each 12-core processorin a compute node is an independent NUMA region with 32GB of local memory.Archer is equipped with the Cray proprietary Aries interconnect [1], which links all the

compute nodes in a dragonfly topology. The topology consists of Aries routers that connectfour compute nodes, cabinets that are composed of 188 compute nodes, and groups that consistof two cabinets grouped together. Aries connects groups to each other by all-to-all opticallinks and the nodes within a group by 2D all-to-all electrical links. All these characteristicsenable Archer to provide low-latency high-throughput data transfer between the computenodes. The data infrastructure in Archer relies on multiple high-performance parallel Lustre[3] filesystems that manage a total amount of 4.4 PB of storage.In our experiments, we used a synthetic benchmark that allows evaluating Bufferflow in a

controlled environment. The synthetic benchmark does a parallel file copy based on the MPIframework. It assumes the existence of a global namespace for buffers (e.g., handled by anunderlying distributed file system).

Figure 3.3: Benchmark architecture.

The benchmark architecture, illustrated in Figure 3.3, decouples the parallel processes ofan application into three categories: producers, consumers, and servers. Instead of tightlycoupling parallel producer and consumer processes in order to transfer the content of a file,we chose to have the producers delegate the data transfer to the servers so that they cancontinue to execute the rest of the application logic without facing the high time penalty ofdata transfers.

We assigned equal pieces of the file to each producer process in a MapReduce fashion. Insteadof starting to send the data to the consumer processes, the producers send metadata requeststo the servers that are responsible for later transferring the data to the consumers based onthe information listed in the metadata received from the producers. The metadata consists ofthe operation type (put for producers and get for consumers), the Bufferflow buffer handler,and the access size. The benchmark executes a configurable number of server processes. Eachproducer request goes through a shuffling phase that redirects it to a particular server, thusensuring a balanced load on each server.Inside each server process there are four components linked in a pipeline: a task processor

responsible of forwarding the requests from the producers to the disk accessor component. Thedisk accessor fetches the data from the system storage based on the metadata received fromthe task processor and sends it to Bufferflow. Later the data are read by the task processorresponsible for transmitting the data to the consumer processes.

BIGHPC 16


Figure 3.4: Decoupling MPI data transfers through Bufferflow servers

3.5 Dataflow decoupling

In this section we compare the benchmark described above, which we will refer to as Bufferflow-Copy, and a tightly coupled producer-consumer implementation of the same functionality basedon MPI, which we will refer to as MPI-Copy. Both benchmarks copy a file. For Bufferflow-Copy,the producers divide the files into chunks and delegate the copying process to Bufferflow serversby evenly shuffling the assigned chunks. The Bufferflow-Copy consumers receive the chunksand write them to the file. For MPI-Copy, the producers divide the file into chunks, shufflethe chunks among themselves, read the assign chunks, and send them to consumers using MPIpoint-to-point communications. The MPI-Copy consumers simply receive chunks and writethem to the file.

Our hypothesis is that tightly coupling producers and consumers causes higher data transfertimes due to the synchronization of involved processes.

In this experiment, we set up the producer processes to be faster than the consumers. Thiscase is commonly found in applications where producers generate data at very high rates (e.g.scientific simulations, social media streams) and consumers perform costly data analytics at alower consumption rate.

To evaluate Bufferflow-Copy, we deployed 64 parallel producers, 64 parallel consumers and64 servers for copying a file of 1 GB. We varied the Bufferflow buffer size between 4 KB to1 MB. The low watermark is 10% and the high watermark is 60%. In this experiment weintentionally avoided reaching the low watermark by setting the aggregate size of buffering ofall 64 servers to 16 GB (i.e., 256 MB/server), in other words 10% of 16 GB is larger than 1GB.

Figure 3.4 shows the results. As expected, the Bufferflow producers delegating the file copyto servers finish fast. For 4 KB buffers the producers finish in 34 milliseconds on average. Asthe buffer size increases, the number of transfers decreases, causing the finish time of producersto drop to 2 milliseconds for buffers of 1 MB. In comparison, the MPI-Copy producers requireon average between 4.3 seconds and 10.4 seconds, because they have to synchronously carryout the data transfer to the consumers.

BIGHPC 17


Figure 3.5: Memory footprint for adaptive, on-demand, and static allocation scheduling

The significant speedup of the producers is justified by delegating the asynchronous transfersto the servers. However, the interesting part of this experiment is the consumer results. Theaverage transfer time of the consumers is up to 2.5 times faster for buffers of 64 KB. Theslow transfers of MPI-Copy can be explained by the tight coupling between producers andconsumers, causing an MPI-Copy consumer process to wait for a matching producer requestto complete.

This experiment proves that the decoupling technique provided by Bufferflow can significantlybenefit data-parallel producer-consumer applications at the cost of additional processes andmemory used by the Bufferflow servers. The additional processes are not expected to consumecritical resources, given that the computational and networking capacity of HPC machines isincreasing significantly more than the storage I/O bandwidth. However, memory is expected tobecome a scarcer resource. Thus, using memory efficiently will become increasingly important.

3.6 Evaluation of the allocation scheduling policies

In this section we analyze the three allocation scheduling policies: adaptive, static, andon-demand. In this evaluation we used the benchmarking framework from Figure 3.3, with 64producers and 64 consumers concurrently generating 16,384 requests to 1 server. The size ofeach request was 64 KB. Thus, in this case the benchmark was copying a file of 1 GB. For theadaptive policy the low watermark was 10% and the high watermark was 50%. Figure 3.5illustrates the buffer memory footprint for this evaluation.The static allocation scheduling policy has a constant buffer memory footprint during the

execution of the benchmark, in other words, the maximum number of buffers Bufferflow wasconfigured with in the initialization phase.The on-demand scheduling generates a constant increase in the allocated buffers, with the

ascending part matching the duration of the producers’ execution continuously requestingbuffers from the scheduler. When the producers finish their jobs, the consumers release buffersuntil the benchmark finishes.

The memory footprint of the adaptive policy evolves as a step function, because the policy

BIGHPC 18


allocates and deallocates buffers in chunks when it reaches the low and high watermarks. Theadaptive scheduling policy has a higher buffer memory footprint than the on-demand policy,but lower than the static allocation policy. For this experiment, the average memory footprintof the adaptive allocation policy is 10303, which is 37% less than does the static allocationpolicy and 29% worse than the on-demand allocation policy.

The following figures illustrate the memory footprint generated by the adaptive schedulingpolicy for the Bufferflow-Copy benchmark. For each figure we ran Bufferflow against specificvalues of the low and high watermarks represented as percentages of the max memory limitconfigured in the initialization phase.Figure 3.6 illustrates a deployment strategy for the scheduling layer in Bufferflow where

extreme values are used for the low and high watermarks, that is 1% and 100% of the maximummemory configured in the initialization phase. The results are expected. The low watermark ishit repeatedly, while the high watermark is not. As we can see in the plot, the buffer memoryfootprint increases until the producers finish their execution; then it reaches a standstill becauseneither of the consumers triggers a shrink event, since the high watermark is too high. Suchan approach is undesirable: it hurts the producers’ access time because of the low watermarkbeing hit too frequently; it may overload the internal event queues in Bufferflow; and it ismemory inefficient, because the high watermark does not get hit at all.

A 10%-90% choice for the low-high watermarks is better than the previous choice, as shownin Figure 3.7. It both makes producers hit the low watermark less frequently and allows asmall amount of flexibility for consumers to trigger memory shrinks.

Figure 3.8 illustrates the buffer memory footprint for a 20%-80% low-high watermark choice.As we can observe from the plot, the producers trigger fewer expand events. However, morefree buffers in the window waiting to be requested, means a higher probability for the producersto trigger expand events less frequently. The interesting fact in this figure is that given a lowerhigh-watermark than in the previous figure, one would expect a shorter plateau max value ofthe buffer memory footprint, and thus more shrink events on the descending side of the plot.But in fact, the standstill buffer memory footprint is longer. This is due to an overlapping ofthe producer and consumer execution such that when the buffer memory footprint reached totop value, the producers achieved the same access rate as the consumers.We also present a case that should be avoided at all costs: choosing the low and high

watermarks too close to each other. As we see in Figure 3.9, the left side of the plot isdominated by an aggressive burst of expand and shrink events. This kind of unstable behaviorresults in increased latency penalties for both consumers and producers.

No perfect solution exists for picking the low and high watermark values. Each applicationhas a different data access pattern, different rates for producers and consumers and runs onplatforms with specific characteristics. Thus, the election of the low and high watermarksshould be done empirically, e.g. based on benchmarks such as the one presented in this paper.

BIGHPC 19


Figure 3.6: 1% low watermark, 100% high watermark




BIGHPC 20


4 A Heterogeneous Mobile Cloud ComputingModel for Hybrid Clouds

4.1 Introduction

Throughout the last few years, cloud computing (CC) has provided computing solutions tolots of companies, organizations, and individual users in the form of services over the Internet.CC provides on-demand, pay-per-use, and highly scalable computing capabilities for servicesthat enhance the user experience in a transparent way for the user [11]. Meanwhile, withthe current exponential growth of mobile devices, there is an emerging concept called mobilecloud computing (MCC) that has erected to integrate CC into the mobile environment [22].In MCC, user applications are computed in remote clouds rather than in their own mobiledevices, providing multiple benefits to the mobile users, such as a longer battery lifetime or alower processing load.

Among the different approaches to MCC, we can bring the computation capabilities closerto the mobile users. This model locates small-scale servers or cloudlets at the edge of thenetwork (e.g., base stations or coffee shops) in order to avoid latency or bandwidth issuesCC experiment. This approach is related to novel paradigms such as fog and mobile edgecomputing and is supposed to be a key aspect in 5G [32, 56]. On the other hand, it needs aperiodic synchronization between the edge servers and the cloud, so several questions arise:when should the edge servers upload data to the cloud servers? How will the cloud handlesuch amounts of data from multiple edge servers located all over the world? How will thesesystems guarantee consistency (one of the desired properties of a distributed system accordingto Brewer’s theorem [27])? There are only a few published works related these issues [38, 23]and they are all also theoretical. Besides, this approach has numerous security issues (e.g.,authentication, mobility, or access control) [53, 58], and not all companies and organizationswill be able to deploy multiple servers at the edge of the network due to the high investmentthat it entails.For all these reasons, we have developed a heterogeneous mobile cloud computing model

that can provide most of the benefits of the fog and mobile edge computing solutions, but itcan also be deployed easily and inexpensively by enterprises into their current cloud systems.More specifically, our work provides the following contributions:

• A heterogeneous mobile cloud computing model, which combines the current mobilecloud architecture with the utilization of volunteer platforms as resource providers.

• A complete description of this model and how it can be deployed in public, private,and hybrid clouds by using the BOINC open-source software: the devices that form thevolunteer platforms should run the BOINC client software, and the cloud side shouldrun the BOINC server software.

• A modeling of the new proposed model using ComBoS, an open-source simulator forvolunteer computing and desktop grids created by the authors, as an entry point.

• An explanation of the benefits of our solution, including cost savings, elasticity, scalability,load balancing, and efficiency.

• An extensive simulation-based evaluation considering several realistic scenarios thatdemonstrates that our proposed model is a feasible solution for different cloud services.

BIGHPC 21


4.2 Proposed Model

In this section, we describe our solution in detail. More specifically, Section 4.2.1 defines somebasic terms. Section 4.2.2 outlines the aims and goals of this approach, Section 4.2.3 showsthe architecture of the proposed model, Section 4.2.4 describes the volunteer platforms thatwe consider in our solution, Section 4.2.5 depicts the two main application scenarios, Section4.2.7 presents the incentive scheme we propose for the volunteer users, and finally Section4.2.8 depicts the security aspects needed.

4.2.1 Terms

We have split the actors of our model into three types:

• Mobile users: the final clients that consume the cloud services.

• Participating devices: desktop computers or mobile devices that collaborate in a cloudsystem by donating their idle resources. In other words, they act as intermediate serviceproviders. They form the volunteer platforms.

• Cloud infrastructure: hardware and software components that provide the cloud services,without considering the participating devices.

4.2.2 Aims and Goals

As we showed in previous work [8], some clouds are experiencing a saturation of their networksand servers due to the high number of user devices accessing the services offered. In fact, thisissue is only going to worsen in the next few years because, as we mentioned in Section ??, thecompany CISCO systems predicted in 2011 that there will be 50 billion devices with Internetaccess by 2020 [26], and a huge percentage of these devices is going to access mobile cloudservices. Some solutions from previous literature provide mechanisms to solve this bandwidthsaturation issues, in addition to allowing for communications with less latency (even real-timeapplications). In Section ?? we have also described these solutions, which consist of deployingsmall-scale clouds or servers on the edge of the network. Unfortunately, these solutions havenot been implemented yet worldwide. Besides, not all mobile cloud applications have real-timeexecution as their priority, and most importantly, many companies lack enough equity to copewith the expense of deploying small clouds at multiple base stations or other locations at theedge of the network.

For all these reasons, we propose a new Mobile Cloud Computing (MCC) model that, unlikethe existing solutions, can be applied to the current clouds without substantial disbursement.Our solution involves groups of volunteer users forming virtual platforms that act as resourcesto one or more clouds. Apart from cost-savings, the goals of our proposed model are:

• Elasticity: a cloud system that uses our solution can use the computing resourcesprovided by the volunteer platforms whenever needed, enabling the system to adapt tosignificant workload changes. With our solution, a mobile cloud application can havemany volunteers subscribed and can adapt the execution of the mobile clouds applicationin a elastic way.

• Scalability: after all, the volunteer platforms provide an extension to the cloud computingand storage capabilities, so cloud systems that use our proposed model would allowmore users to access their resources. As we demonstrate in the evaluation section, the

BIGHPC 22


performance with our solution improves the scalability of the system when the numberof mobile users increase.

• Efficiency: in some cases, mobile users would rather access a device from a volunteerplatform than from a remote cloud server (geographical proximity means fewer hops),thus reducing latency.

• Load balancing: as we explain later, the cloud controllers process the user requests andprovide the mobile users with the corresponding cloud services, either by their ownclouds or by devices from the volunteer platforms that collaborate with the cloud system.This scenario allows for the implementation of various load balancing schemes not tosaturate the cloud. This features is not analyzed in this work as is provided as futurework.

• Easy deployment: the clouds and the devices from the volunteer platforms must run anopen-source BOINC server and client software, respectively, so this solution does notrequire significant alterations of the cloud infrastructure. We describe this deploymentin section 4.2.6.

4.2.3 Architecture

The architecture of our proposed model is shown in Figure 4.1, which is a variation of theMCC basic architecture presented in Section ??.

Volunteer platforms

Cloud computing

Internet

Application servers

Cloud A

Cloud controller

Data center

Mobile network A

Access Point

Mobile network services

Satelite

BTS

Mobile device A

Mobile device C

Mobile device B

Figure 4.1: Architecture of our proposed model, based on the utilization of volunteer platforms.

The novel part of this approach is the utilization of volunteer platforms. A volunteerplatform consists of multiple participating devices that want to donate their idle computingand storage resources to cloud systems, in a similar way to the millions of devices that currentlycontribute to BOINC scientific projects. A participating device that wants to contribute to acloud system should download a variation of the BOINC open-source software [19] (available

BIGHPC 23


on Docker container 4 and Virtual Machine 5), which executes in the idle CPU periods of thedevice, and should request work to the clouds that the device collaborates with. By the timea cloud system has the collaboration of multiple participating devices, it can distribute thedevices in logical volunteer platforms or even define hierarchies, depending on their capabilities.For example, the volunteer platforms can be defined based on the storage capacity of theparticipating nodes, so that when a mobile user requests storage of a file to a cloud application,the cloud system should replicate this file in a number of cloud servers and participatingdevices (from a volunteer platform) that are able to store a file of such size.Each mobile user application that wants to use a cloud service should access the cloud

system in an ordinary way (via the Internet). The cloud controller is then responsible fordealing with the user application request and providing the mobile user with the requestedservice. Nevertheless, in this model there are two options: providing the services using (1) thecloud servers of the system or (2) the volunteer resources of some participating devices (seeAlgorithm 4.1b 6). From the point of view of a device that wants to donate resources to cloudservices, it is necessary that it first subscribe to a cloud system as a participating device. Then,the cloud system would run some benchmarks on the participating device in order to test itscapabilities. Once this has been done and depending on the type of service, the participatingdevice should ask the cloud for tasks during its idle CPU time (see Algorithm 4.1a).

1: procedure Subscribe(srvc) . Subscription of deviceinto cloud service srvc

2: if not subscribed then . list is not empty3: send subscription request to srvc4: benchmarks← receive answer from cloud .

the cloud sends the benchmarks in order to know thecapabilities of the device

5: res← execute benchmarks6: device_info_file← create response file . this

file should contain the benchmark results (res) and allother device information required (CPU model, RAM,GPS location, etc.)

7: send device_info_file to the cloud8: url← receive URL from cloud . this URL

has the code the participating device should executein order to collaborate in the service (e.g., a code thatis able to receive computation requests and execute aneural network for a music identification service)

9: code← download code from url10: subscribed← true11: end if12: run code in background13: end procedure

(a)

1: procedure Execute(tsk) . Remote execution of tasktsk (e.g., a recorded audio)

2: send request to cloud3: list← receive answer from cloud .

list of participating devices that are able to process thetask; if the list is empty, that means the task should beexecuted by the cloud

4: if list then . list is not empty5: err ← send tsk to N participating devices6: if not err then . there is no error7: res_list← receive answers from the N par-

ticipating devices8: res, err ← verify res_list . check if the

quorum is reached9: end if

10: end if11: if err or not list then . list is empty or there was

an error related to the participating devices12: send tsk to cloud13: res← receive answer from the cloud14: end if15: return res . computational result of

tsk (e.g., identification that the short audio stored intks corresponds to the song X)

16: end procedure

(b)

Algorithm 4.1: Examples of: (a) subscription of a participating device in a cloud service; (b)remote execution of a mobile user task.

4https://boinc.berkeley.edu/trac/wiki/BoincDocker5https://boinc.berkeley.edu/trac/wiki/VmServer6BOINC provides a form of redundant computing in which each computation is performed on multiple clients,the results are compared, and are accepted only when a ŚconsensusŠ is reached. In some cases, new resultsmust be created and sent. In Algorithm 4.1b, N is the replication factor, and the cloud administratorsshould choose it.

BIGHPC 24


4.2.4 Volunteer Platforms

Volunteer platforms consist of groups of multiple participating devices with similar computingcapabilities (decision of the company). As participating devices are going to run the BOINCclient software, there are basically desktop computers and mobile devices. Since the partici-pating devices are going to process tasks or store data of the mobile users, it is important toexercise caution over the battery lifetime for mobile devices. Fortunately, the BOINC clientsoftware for mobile devices computes only under the following conditions (as we mentioned inSection ??):

• The mobile device is plugged into a power source (AC or USB).

• The battery is over 90% of charge.

• The screen is off.

Volunteer platform C

Cloud computing

Cloud A

Cloud controller

Data center

Volunteer platform B

Volunteer platform A

Mobile device

Mobile device

Mobile device

Mobile device

Mobile device

Mobile device

Internet

Internet

Internet

Figure 4.2: Mobile users access participating devices from volunteer platforms that are closerand are able to process the tasks needed.

In this way, the cloud tasks will not significantly reduce the battery life or the rechargetime. For instance, an anonymous user can collaborate with a cloud system by just pluggingtheir mobile volunteer device into a power source before going to sleep. Hence, the mobilevolunteer device can participate in a cloud service while its owner is sleeping. Moreover, theideal of this model is that mobile users leverage the computing and storage idle resourcesof volunteer devices that are geographically closer than the cloud remote servers, thereby

BIGHPC 25


preventing saturation of cloud networks and servers and also bypassing latency issues (becauseparticipating devices may be much nearer than the remote servers, so fewer hops are needed inorder to arrive at the destination), as Figure 4.2 shows. However, as the resources provided bythe participating nodes are volunteered, there is no assurance that these resources are going tobe long-lasting. We can just say that they are ‘volatile’ resources and that the availability ofparticipating devices is therefore vitally important. That is why our solution does not consistexclusively of volunteer platforms. The main processing and storage resources would be theones provided by the cloud infrastructure in order to allow fault tolerance of the participatingdevices and therefore data loss. That said, the volunteer platforms will provide lots of benefitsbecause they can back up files in storage services, process tasks, etc. even when there areno more available resources in the cloud. In other words, this solution does not change thecurrent behavior of cloud services; it only provides more (inexpensive) resources to them andreduces the workload of the cloud.

4.2.5 Application Scenarios

Our proposed solution can be applied to different scenarios, among which we highlight storageand computing services.

Storage Services Our solution, which consists of the usage of volunteer platforms asresource providers, can be applied to typical storage mobile cloud services [18], such as Dropbox,Google Drive, or OneDrive. In this scenario, once a participating device has subscribed to thecloud service when a mobile user wants to upload a file to the cloud, it sends the file to thecloud (for simplicity, we are ignoring all the protocol matters of these kinds of services). Thenthe file is stored in a number of cloud nodes (depending on the replication factor of the storagesystem), and then the encrypted file is sent to a number of participating devices of one ormore volunteer platforms. In this way, each file is backed up in several places (for example, intwo cloud servers and in two participating devices) so that the mobile user can download thefile from both the cloud servers or the participating devices (for instance, based on proximity),and then verify its integrity by checking the hash against the cloud. This behavior is shown inFigure 4.3.

Participating device

Cloud System

subscribe

benchmark

results

Mobile user device

Cloud System

file

store the file


encrypted file

Cloud system

Mobile user device

ask for file


addressask for file

encrypted file

subscription of a participating device in the system

a mobile user device uploads a file to the cloud system

the mobile device downloads the previously stored file from a participating device

key-agreement protocol

check hash

(1)

(2)

(3)

(1)

(2)

(3)(4)

(1)

(2)

(3)

(4)

(5)

Figure 4.3: Example of a storage scenario.

In addition, as the files stored by the participating devices are encrypted (e.g., using AES-256 [24]), there are no security risks in untrusted users storing private information, since theparticipating devices cannot access the file contents. Finally, we also assume that the mobileusers can specify the maximum storage they want to donate. For example, the default valuecan be a 5% of the total storage capacity of the device (e.g., 25 GB for a computer with ahard disk of 500 GB).

BIGHPC 26


Processing Services Our model allows for the execution of multiple processing services.Music identification services (e.g., Shazam or ACRCloud) or optical character recognition(OCR) services are examples of this kind of processing services. In these scenarios, a mobiledevice sends a task (an audio file or a picture) to a remote cloud where the data is processed(identifying the song from the audio or recognizing a text from the picture) and the resultsof the computation performed are sent back to the mobile device. Without our solution, allthe processing is performed within the cloud infrastructure. With our model, the processingtask should be performed by the participating devices, thus reducing the load in the cloud. Inour approach, when a participating device subscribes to a cloud service, it downloads fromthe cloud the application that it needs to execute (e.g., the binaries with the algorithms orthe neural network to use). A mobile user device that wants to process some data first sendsthe processing request to the cloud system, which answers with a list of addresses of theparticipating devices (usually the addresses of all the devices of the same volunteer platform).Then, the user sends the task to a number of participating devices (two or more) in order torely on the results of untrusted users. If the replies received from the participating devicesmatch, the result is considered correct. This behavior is shown in Figure 4.4.


Cloud System

subscribe

benchmark

results

Cloud system

Mobile user device

Participating device A

task

subscription of a participating device in the system

the processing task requested by the mobile device is processed by two or more participating devices

addresses

Participating device B

processing request

answerapplication

(1)

(2)

(3)

(4)

(1)

(2)(3)

(4)

Figure 4.4: Example of a processing scenario.

In contrast to storage services, where the participating devices receive and store encryptedfiles, in processing services the computation tasks may be performed by untrusted users(the participating devices) over unencrypted data, so, in order to avoid security risks, it iscompulsory that the participating devices only receive public content, such as street picturesor music audios that the user wants to identify. There are some novel techniques that tryto perform computation over encrypted data [39], so probably in the future, our processingmodel can be applied also to tasks that use private information.

4.2.6 Deployment and Mobile Application Adaptation

This section describes the deployment of our solution for a typical application. This deploymentis based on the BOINC behavior described in Section ??.The first step is to transform the cloud application into a BOINC project. This step does

not require changes in the BOINC software, since only some transformations in the cloudapplication are needed. In this transformation, we obtain two BOINC applications: one forthe participating devices and other for the mobile user. Each application consists of a programand a set of workunits and results. The BOINC servers receive two types of requests:

• Requests from the mobile users.

• Requests from the participating devices. The server also stores the addresses of thedifferent participating devices.

BIGHPC 27


From the BOINC point of view, there are two types of clients: participating devices andmobile users. Both execute different applications. When a mobile user wants to execute amobile cloud application, it sends the request to the server and obtains a new workunit. Thisworkunit only includes a list containing the addresses of the participating devices. Then themobile device selects one of these addresses (as shown in Algorithm 4.1) and sends the data tothe selected participating device. After that, the participating device processes the applicationand returns the result to the mobile device, without the intervention of the BOINC server,lowering in this way the cloud load.

In order to avoid possible bottlenecks, several BOINC servers can be deployed similarly toother BOINC projects, like the SETI@Home project. Moreover, we can deploy several BOINCservers in different clouds, to reduce bottlenecks.As other BOINC projects, all services needed for the application are installed on the

device when it subscribes to the system. BOINC uses virtualization to allow the execution ofapplications in different hardware or operating system. The virtualization solution used byBOINC is VirtualBox, which is free and multiplatform. In this case, the recommended BOINCinstaller for Windows includes VirtualBox as well, and this is transparent for the user thatinstalls BOINC. Similarly, the BOINC client installer can provide the installation of Docker ina transparent way to the users, simplifying its deployment on the volunteer nodes.From the installation point of view, we have evaluated the time for installing BOINC in

a typical project on several desktop computers and smartphones, and subscribing them todifferent projects. On the one hand, the installation process took an average of 30 seconds forboth the desktop computers and the smartphones. On the other hand, the subscription phasetook less than 10 seconds for all devices.

4.2.7 Incentive Scheme

Why would anonymous users want to donate their resources to cloud services? In BOINC,users donate their idle processing and storage resources to contribute to scientific projects,such as Climateprediction.net7, that helps fight climate change; Rosetta@Home8, that helps tofind the cure for cancer and Alzheimer’s; or SETI@Home9, that helps to find extraterrestrialintelligence. However, this is not enough; that is why BOINC has an incentive scheme basedon credits. BOINC projects grant credit to users to encourage the volunteer users to contributeto the system. Credit has no monetary value; it is only a measure of how much a volunteerhas contributed to a project (credits are calculated from the floating point operations that adevice has computed) [9]. In our solution, companies and organizations should also include anincentive scheme based on credits in their services, as BOINC does. In this way, volunteerusers would be rewarded by their contribution to the mobile cloud computing services theycollaborate with.

Apart from that, the enterprises that want to deploy our model can also reward the volunteerusers with some ‘special’ functionalities. For example, a company that offers storage servicesto their mobile clients could grant the volunteer users with some premium features or even aprofessional account of one of their mobile applications for free.

7http://www.climateprediction.net/8https://boinc.bakerlab.org/9https://setiathome.berkeley.edu/

BIGHPC 28


4.2.8 Security Aspects

BOINC allows the project designers to use Secure Socket Layer (SSL) in their projects, soHTTPS (port 443) can be used in the log-in processes. Besides, BOINC uses the ports 31416and 1043 to exchange data, so the client has to unblock them if they are behind a firewall.Similarly, the implementation of our approach must use specific ports that should be unblockedfrom the firewall to manage the access between clients. We propose two alternatives to ensuresecure communication between the mobile users and the participating devices:

• Transport Layer Security (TLS, the last version is 1.2) [21]: it is available to most TCPapplications (e.g., FTPS, SMTPS, and HTTPS).

• Simple Object Access Protocol (SOAP, last version is 1.2) [29]: it is a protocol forexchanging data using XML files. It can be combined with WS-Security (last version isWS-Security 1.1) [43] in order to add security. WS-Security is a protocol that guaranteesauthentication, confidentiality, and integrity of the data exchanged.

As we explained in Section 4.2.5 (see Figure 4.3), when a mobile user wants to upload afile using a storage service, it first has to specify an encryption key with the cloud through akey-agreement protocol. For that reason, we propose the Diffie Hellman Ephemeral (DHE) orthe Elliptic Curve Diffie Hellman Ephemeral (ECDHE) [13] key-agreement protocols becausethey ensure the Perfect Forward Secrecy [37]. Then, encryption key should be stored in asecure local keystore by both the mobile device and the cloud. Besides, the file should betransmitted from the mobile device to the cloud via a secure channel (e.g., TLS), and then thefile should be encrypted in the cloud side in order to offload the computation from the mobiledevice. A good option is to use a symmetric-key algorithm, such as AES256 [24] or 3DES [12].Once the file is encrypted, the cloud can send it to multiple participating devices ensuringconfidentiality. When the mobile device downloads the encrypted file from a participatingdevice, it just has to verify the file hash with the cloud (to check integrity) and decrypt itusing the encryption key previously stored in its keystore.Apart from that, when a mobile user wants to execute a task, the cloud can reply to the

mobile user with the list of participating devices that are able to execute the task. Exactly asBOINC works, the mobile user has to send the task to N different users, and, after receivingthe computation results from all of them, check if the quorum is reached. For instance, supposethat a mobile user wants to apply an OCR program over a text in a poster, N is 3 and quorumis 2, so the user first takes a picture of the text, then requests to process this text to the cloudservice, so the cloud replies with the list of participating devices that are able to process thetask (normally, a whole volunteer platform). Then, the mobile user application sends thepicture to three different participating devices (N value) and then it checks if at least two ofthe answers (quorum value) match. If the quorum is reached (e.g., two of the participatingdevices answer “Mr. Bean Street”), the result is considered to be correct. Otherwise, themobile user requests it directly to the cloud. This behavior is also shown in Algorithm 4.1b.Apart from that, as described in [10], BOINC prevents to distribute malware among thevolunteer computers because applications have only access to their own input and output filesvia sandboxing. Besides, the BOINC software is also able to use virtualization support [14],which would facilitate the deployment of our proposed model.

BIGHPC 29


4.3 Evaluation

In this section we present the evaluation performed. In Section 4.3.1 we detail an analysisof the volunteer devices that participate in the famous SETI@Home project, apart from thedescription of how we managed to characterize three different individual devices. We haveused these results in order to perform the experiments presented in Section 4.3.2, that consistof different case studies we have analyzed through realistic simulations.

4.3.1 Devices characterization

We have analyzed the CPU performance of the 138,252 computers of the SETI@Home projectthat were active on June 12, 2017, 22:02:19 UCT (published in [50]). After analyzing all theCPU models, we found that 134,182 (97.06%) of the total number of devices were desktopcomputers and laptops, while the remaining 4,070 (2.94%) computers were mobile devices.Figure 4.5 shows the CPU performance (GigaFLOPS/core or GigaFLOPS/computer) ofthe aforementioned SETI@Home volunteer devices. This huge difference (3.13 over 17.5GigaFLOPS) between the performance per core (Figure 4.5a) and per computer (Figure 4.5b)is because the SETI@Home tasks use the maximum number of cores available for computation,ranging from 1 to 102 cores. As can be seen in the figure, mobile devices are much less powerfulthan the desktop and laptop computers on average (4.46 vs 17.91 GigaFLOPS/computer). Wehave used these SETI@Home CPU traces to model the power of the participating devices thatform the volunteer platforms of the simulations presented in Section 4.3.2.

0

1

2

3

4

5

6

7

Total PCs Mobile devices

GF

LO

PS

/co

re

CPU performance per core

(a)

0

10

20

30

40

50

Total PCs Mobile devices

GF

LO

PS

/co

mp

ute

r

CPU performance per computer

(b)

Figure 4.5: CPU performance of the volunteer computers of the SETI@Home project: (a)GFLOPS/core, (b) GFLOPS/computer.

In order to model the availability of the participating devices, we used the results obtainedin [36]. This research analyzed about 230,000 availability traces obtained from the volunteercomputers that participate in the SETI@Home project. According to this paper, 21% ofthe volunteer computers exhibit truly random availability intervals, and it also measured thegoodness of fit of the resulting distributions using standard probability-probability (PP) plots.For availability, the authors noted that in most cases the Weibull distribution is a good fit.For unavailability, the distribution that offers the best fit is the log-normal. The parametersused for the Weibull distribution are shape = 0.393 and scale = 2.964. For the log-normal,the parameters obtained and used in ComBoS are a distribution with mean µ = −0.586and standard deviation σ = 2.844. All these parameters were obtained from [36] too. Theavailability and unavailability modeling, allow us to simulate the entrance and leaving of

BIGHPC 30


volunteer resources in the system.Furthermore, because the software the participating devices in our proposed model is based

on a small variation of the BOINC client software, we are also interested in evaluating theperformance of individual devices participating in a real BOINC volunteer computing project.To make this possible, we have used the following devices:

• Desktop computer: Intel R©CoreTM i7-4790 (4 cores (8 threads), 3.60 GHz), OS: Ubuntu16.04.2 LTS, 8 GB of RAM memory.

• Mobile device: Woxter Zielo ZX840HD (8 cores, 1.7 GHz), OS: Android 4.4.2, 2 GB ofRAM memory.

• ARM device: ODROID-C2 (4 cores, 1.5 GHz), OS: Ubuntu 16.04.2 LTS, 2 GB of RAMmemory.

Each device has collaborated in the most famous BOINC project: the SETI@Home project.The results obtained are s:

• Desktop computer: 3,628,800 GigaFLOPs executed in 2 days.

• Mobile device: 345,600 GigaFLOPs executed in 2 days.

• ARM device: 322,600 GigaFLOPs executed in 2 days.

These results demonstrate that the desktop computers currently provide more computationalpower to volunteer computing than the other kind of devices that participate in these projects.

4.3.2 Case Studies

We have evaluated three different mobile cloud computing services as case studies: a genericprocessing service, a storage service, and a speech recognition processing service. In terms ofevaluation, we have used ComBoS [7], a complete BOINC simulator created by the authors as aprevious work, as a starting point. ComBoS is a public source software10 and was implementedin C programming language, with the help of the tools provided by the MSG API of SimGrid[17] and is able to perform realistic simulations of the whole BOINC infrastructure, consideringall its features: projects, servers, network, redundant computing, scheduling, etc. In orderto evaluate both case studies, we have modified ComBoS to evaluate the scenario shown inFigure 4.6. This scenario consists of two groups of mobile devices, that access a cloud in orderto use the services. It also has four volunteer platforms that provide computing and storageresources to the cloud. The bandwidth and latency values of the networks that connect thedifferent components are also specified in Figure 4.6. All other parameters relevant to thesimulations (number of devices of each type, power, etc.) are specified in each case study.Table 4.2 shows the details of the platform used to simulate the case studies. Every

execution in this section has simulated 100 hours. In order to account for the randomnessof the simulations and to deem the results reliable, each simulation result presented in thissection is based on the average of 20 runs. For a 95% confidence interval, the error is less than± 2% for all values.

10ComBoS can be downloaded from: https://github.com/arcos-combos/combos

BIGHPC 31


cloud controller data center

mobile devices (cloud users)

mobile devices (cloud users)

volunteer platform A

volunteer platform B

volunteer platform C

volunteer platform D

- bandwidth: 1 Gbps- latency: 5 ms







- bandwidth: 1.5 Gbps- latency: 5 ms



Figure 4.6: Scenario simulated in the experiments.

Table 4.2: Platform used in the evaluation.Value

Processor Intel R©CoreTM i7-920 (4 cores (8 threads), 2.67GHz)RAM 32 GBOperating System Ubuntu 14.04.5 LTSKernel 3.13.0-119-genericSimGrid version 3.11

4.3.3 Case study 1: Processing Service

A good case study to evaluate our proposed model is to analyze its performance of processingservices. These processing services can range from a music identification service (e.g., Shazam)to a text recognition service (e.g., an OCR). We considered the scenario shown in Figure 4.6,where each volunteer platform has 250 participating devices, and there are from 20,000 to100,000 mobile users. The cloud infrastructure consists of 20 nodes with a computing powerof 50 GigaFLOPS each, and each mobile device requests the cloud to compute a task of 20GigaFLOPs11 (based on the results of [46]) and 5 MB every 30 minutes on average. We haveconsidered three different configurations:

• Configuration 1: it corresponds to the original behavior of a cloud system (without usingvolunteer platforms). All the processing tasks are performed in the cloud.

• Configuration 2: both volunteer platforms are formed by participating devices in the sameproportion and with the same properties (power and availability) as in the SETI@Homeproject (see Section 4.3.1).

11We distinguish between FLOPS (floating point operations per second) and FLOPs (floating point operations).

BIGHPC 32


• Configuration 3: both volunteer platforms are formed only by mobile devices - theparticipating devices are only mobile devices in this configuration - with the sameproperties (power and availability) as in the SETI@Home project (see Section 4.3.1).

In configurations 2 and 3, the tasks are computed either by the cloud or by the participatingdevices on a round-robin basis. In the case a task is computed by a volunteer platform insteadof by the cloud, three different participating devices should compute the task with a quorumof two.

Processing service performance

0

20

40

60

80

100

20000 40000 60000 80000 100000

clo

ud

lo

ad

(%

)

number of mobile users

0

20

40

60

80

100

20000 40000 60000 80000 100000vo

l. p

latf

orm

s lo

ad

(%

)number of mobile users

7

14

21

28

35

42

20000 40000 60000 80000 100000tota

l th

rou

gh

pu

t (P

FL

OP

s)


0.4

0.6

0.8

1

1.2

1.4

20000 40000 60000 80000 100000ave

r. t

ime

pe

r ta

sk e

xe

c.

(s)


conf. 1 conf. 2 conf. 3

Figure 4.7: Case study 1: performance of the processing service for the three different configu-rations.

Figure 4.7 shows the results of this experiment: the load12 of both the cloud and thevolunteer platforms, the total throughput of the system in PetaFLOPs, and the average time atask is executed. With configuration 1, the cloud became saturated with almost 85,000 mobileusers. By contrast, this did not happen with configurations 2 and 3, because the computationof the tasks is shared by both the participating devices and the cloud, not only by the cloud asin the previous configuration. As it is shown in the figure, the use of volunteer platforms allowsfor an increase in the scalability and in the total throughput of the system, since more userscan process their tasks in the system. Finally, although the average time per task execution inconfiguration 1 is less than in the rest of configurations (except when the cloud is saturated),this difference is not significant, especially for configuration 2 (less than 200 ms), which showsthat our approach would not have a negative impact on the user experience.

12We considered the load as the maximum of the network and the CPU load.

BIGHPC 33


4.3.4 Case study 2: Storage Service

This second case study is about a file storage service. The scenario is the same as in theprevious case (Figure 4.6), where each volunteer platform has 25.000 participating devices.The cloud infrastructure consists of 200 nodes, each with 3.2 Terabytes, making a total of 640Terabytes of storage. In this case study there are 1 million cloud users and each one uploadsan average of 4.5 files of 50 MB each (following an exponential distribution) to the cloudservice. We have again considered three different configurations:

• Configuration 1: it corresponds to the original behavior of a cloud system (without usingvolunteer platforms). All the files are downloaded from the cloud. Each file should bereplicated three times in the cloud servers.

• Configuration 2: both volunteer platforms are formed by participating devices with astorage capacity that follows a statistical normal distribution, with µ = 5 and σ = 0.75(average 5 GB per device). Each file should be replicated two times in the cloud serversand two times in the participating devices.

• Configuration 3: both volunteer platforms are formed by participating devices with astorage capacity that follows an statistical normal distribution, with µ = 10 and σ = 1(average 10 GB per device). Each file should be replicated two times in the cloud serversand two times in the participating devices.

Storage service performance

0

20

40

60

80

100


sto

rag

e c

on

su

me

d (

%)

(a)

0

240

480

720

960

1200


tera

byte

s s

tore

d

(b)

0

0.8

1.6

2.4

3.2

4

conf. 1 conf. 2 conf. 3ave

r. t

ime

pe

r file

do

wn

. (s

)

(c)

cloud part. devices mobile users

Figure 4.8: Case study 2: performance of the storage service for the three different configura-tions.

Figure 4.8 shows the results of this experiment. As it can be seen in graphs (a) and (b),with the first configuration (without using volunteer platforms), the cloud is not able to storemore files because there is no more available space in their nodes. On the other hand, withconfigurations 2 and 3, the cloud is not saturated, and the service is then able to store andback up all the files from the 1 million users thanks to the storage resources donated by theparticipating users. Moreover, in graph (c) we show the average time required to download afile by the mobile users, assuming the mobile users download a file every 2 hours on average.With configuration 1, the cloud network becomes saturated soon; that is why the downloadtime is higher than in configurations 2 and 3, where the mobile users also download files fromthe volunteer platforms.

BIGHPC 34


4.3.5 Case study 3: Speech Recognition

In [30], the authors show that the new multimedia applications are changing the currentcloud computing architecture. They analyze different multimedia applications, such as faceand speech recognition, or mobile augmented reality. In the paper, the authors made allthe experiments using a Dell Latitude 2102 as the unique mobile device. However, in thissection, we have simulated their speech recognition application considering the scenario ofFigure 4.6, and using the same configurations as in the first case study (Processing Service,Section 4.3.3). Each volunteer platform has 250 participating devices, and there are from20,000 to 200,000 mobile users. The speech recognition application is based on an open-sourcespeech-to-text framework that uses Hidden Markov Models (HMM) recognition systems [52].In this application, the average request size is 243 KB, and the cloud response size is 60 byteson average. The request and response size and the rest of the parameters are obtained from[30] too. We have simulated cases in where 1 to 22 words are recognized.

Speech recognition service performance

0

20

40

60

80

100

40000 80000 120000 160000 200000

clo

ud

lo

ad

(%

)


0

20

40

60

80

100

40000 80000 120000 160000 200000

vo

l. p

latf

orm

s lo

ad

(%

)


10 20 30 40 50 60 70

40000 80000 120000 160000 200000tota

l th

rou

gh

pu

t (P

FL

OP

s)


0.3

0.45

0.6

0.75

0.9

1.05

40000 80000 120000 160000 200000ave

r. t

ime

pe

r ta

sk e

xe

c.

(s)



Figure 4.9: Case study 3: performance of the speech recognition service for the three differentconfigurations.

Figure 4.9 shows the results of this last experiment, considering the cloud and volunteerplatforms load, the total throughput, and the average time a task is executed (similarly as inthe first case study). With configuration 1, the cloud became saturated with almost 120,000mobile users. As in the generic processing service, this did not happen with configurations 2and 3. As it is shown in the figure, the use of volunteer platforms allows for an increase in thescalability and in the total throughput of the system, since more users can process their tasksin the system. Finally, although the average time per task execution in configuration 1 is lessthan in the rest of configurations (except when the cloud is saturated), this difference is notsignificant, especially for configuration 2 (less than 50 ms), which shows that our approachwould not have a negative impact on the user experience. The participating devices allow

BIGHPC 35


avoiding a performance loss when a high number of clients consume the services, also improvingthe availability and the usability of the service.

5 Conclusions

In this report we have shown we explored the potential benefits of integrating a popular BigData platform like Apache Spark, with HPC-oriented communication techniques representedby DIY block parallelism, together with a new block storage architecture, named BufferFlow.We analyzed the literature to derive the key design features that would interest both the HPCand Big Data communities, and proposed two techniques to reflect these goals: A frameworkfor interoperability of Spark and MPI, making data structures compatible and preservingthe programming interface of the Spark environment, thus making it compatible with anySpark-based application and tool, while providing efficient shuffle and collectives by usingDIY, a powerful library built on top of MPI; A novel buffering service that can be used as abuilding block for scalable storage systems. Our buffering system productively and efficientlysupports various requirements such as adaptiveness to unpredicted data volume variations andlow-overhead data transfers of composite workflows of applications.We have developed a promising prototype that shows good performance and scalability

for data-intensive and communication-intensive operations. These results are, however, pre-liminary and further assessment is needed to optimize the framework. Future evaluationsinclude analyzing memory consumption, profiling compute overheads for serialization and de-serialization, looking at scalability with applications running multiple stages with independenttransformations, and adapting real-world use cases that rely on higher-level libraries.The work presented is relevant for the Big Data community since we offer improved

performance and reduced latency for shuffle and other communication-intensive phases ofSpark workflows, which is a major claim for data analytics in scientific computing. In addition,we expose the benefits of using supercomputing infrastructures without changing the Sparkframework. On the other hand, the HPC community can benefit from the myriad librariesbuilt on top of Spark without giving away scalability. Spark’s resilience, provenance, and easeof use are lacking in the HPC software stack, and Spark-DIY affords HPC practitioners ofsuch characteristics that are commonplace in the Big Data world.Future works could enhance the architecture to support heterogeneity and accelerate in-

dependent transformations, and even extend the Spark programming model to exploit otherDIY communication patterns such as local neighborhood block exchanges that are availablein DIY, but have no Spark counterpart. We also plan to integrate Spark’s elasticity into ourarchitecture, along with MPI-I/O support to benefit from highly optimized parallel I/O inHPC systems as an alternative to current storage systems like HDFS.This document also presents a new MCC model that can provide more computing and

storage resources to public, private or hybrid clouds. The proposed heterogeneous model usesthe computing and storage resources of devices from the general public to contribute to cloudsystems, so the organizations can leverage the idle periods of these devices to gain computingand storage resources for their cloud services, in a similar way that volunteer devices contributeto BOINC projects.

BIGHPC 36


Acknowledgment

This work has been partially funded by the Spanish Ministerio de Economia and Competitividadof Spain under the grant TIN2016-79637-P.

References

[1] Archer hardware. Available at http://www.archer.ac.uk/about-archer/hardware/.

[2] Cray xc30. Available at http://www.cray.com/products/computing/xc-series.

[3] Lustre parallel file system. Available at http://lustre.org/.

[4] MPI Forum., 2017. Available at http://www.mpi-forum.org/.

[5] The HDF group., 2017. Available at http://www.hdfgroup.org/HDF5/.

[6] Nawab Ali, Philip Carns, Kamil Iskra, Dries Kimpe, Samuel Lang, Robert Latham,Robert Ross, Lee Ward, and P. Sadayappan. Scalable I/O Forwarding Framework forhigh-performance computing systems. In Proceedings of IEEE Custer, September 2009.

[7] Saúl Alonso-Monsalve, Félix García-Carballeira, and Alejandro Calderón. Combos: Acomplete simulator of volunteer computing and desktop grids. Simulation ModellingPractice and Theory, 77:197 – 211, 2017.

[8] Saúl Alonso-Monsalve, Félix García-Carballeira, and Alejandro Calderón. Fog computingthrough public-resource computing and storage. In The 2nd International Conference onFog and Mobile Edge Computing (FMEC 2017), pages 81–87. IEEE, May 2017.

[9] D. P. Anderson and J. McLeod. Local scheduling for volunteer computing. In 2007 IEEEInternational Parallel and Distributed Processing Symposium, pages 1–8, March 2007.

[10] David P. Anderson. Volunteer computing: The ultimate cloud. Crossroads, 16(3):7–10,March 2010.

[11] Michael Armbrust, Armando Fox, Rean Griffith, Anthony D. Joseph, Randy Katz, AndyKonwinski, Gunho Lee, David Patterson, Ariel Rabkin, Ion Stoica, and Matei Zaharia. Aview of cloud computing. Commun. ACM, 53(4):50–58, April 2010.

[12] William C. Barker, Elaine Barker, U.S. Department of Commerce, National Instituteof Standards, and Technology. Recommendation for the Triple Data Encryption Algo-rithm (TDEA) Block Cipher: NIST Special Publication 800-67, Revision 2. CreateSpaceIndependent Publishing Platform, USA, 2012.

[13] Simon Blake-Wilson, Bodo Moeller, Vipul Gupta, Chris Hawk, and Nelson Bolyard.Elliptic curve cryptography (ecc) cipher suites for transport layer security (tls), 2006.

[14] BOINC. Virtualbox. http://boinc.berkeley.edu/wiki/VirtualBox, 2016.

[15] Silvina Caíno-Lores, Alberto García Fernández, Félix García-Carballeira, and Jesús Car-retero Pérez. A cloudification methodology for multidimensional analysis: Implementationand application to a railway power simulator. Simulation Modelling Practice and Theory,55:46–62, 2015.

BIGHPC 37


[16] Silvina Caíno-Lores, Andrei Lapin, Peter G Kropf, and Jesús Carretero. Lessons learnedfrom applying big data paradigms to large scale scientific workflows. In WORKS@ SC,pages 54–58, 2016.

[17] Henri Casanova, Arnaud Giersch, Arnaud Legrand, Martin Quinson, and Frédéric Suter.Versatile, scalable, and accurate simulation of distributed applications and platforms.Journal of Parallel and Distributed Computing, 74(10):2899–2917, June 2014.

[18] Y. Cui, Z. Lai, and N. Dai. A first look at mobile cloud storage services: architecture,experimentation, and challenges. IEEE Network, 30(4):16–21, July 2016.

[19] D. P. Anderson et al. Boinc. https://github.com/BOINC/boinc, 2017.

[20] Aaron Davidson and Andrew Or. Optimizing shuffle performance in spark. University ofCalifornia, Berkeley-Department of Electrical Engineering and Computer Sciences, Tech.Rep, 2013.

[21] T. Dierks and E. Rescorla. The transport layer security (tls) protocol version 1.2. InIETF RFC 5246, pages 1–19, 2008.

[22] Hoang T Dinh, Chonho Lee, Dusit Niyato, and Ping Wang. A survey of mobile cloudcomputing: architecture, applications, and approaches. Wireless Communications andMobile Computing, 13(18):1587–1611, 2013.

[23] Francisco Rodrigo Duro, Javier Garcia Blas, Daniel Higuero, Oscar Perez, and JesusCarretero. Cosmic: A hierarchical cloudlet-based storage architecture for mobile clouds.Simulation Modelling Practice and Theory, 50:3 – 19, 2015. Special Issue on ResourceManagement in Mobile Clouds.

[24] Morris J Dworkin, Elaine B Barker, James R Nechvatal, James Foti, Lawrence E Bassham,E Roback, and James F Dray Jr. Advanced encryption standard (aes). Federal Inf.Process. Stds.(NIST FIPS)-197, 2001.

[25] Ewa Deelman et al. PANORAMA: An approach to performance modeling and diagnosisof extreme scale workflows. International Journal of High Performance ComputingApplications, 31(1):4–18, 2017.

[26] Dave Evans. The internet of things: How the next evolution of the internet is changingeverything. Whitepaper, CISCO Internet Business Solutions Group (IBSG), 1:1–11, April2011.

[27] Seth Gilbert and Nancy Lynch. Brewer’s conjecture and the feasibility of consistent,available, partition-tolerant web services. SIGACT News, 33(2):51–59, June 2002.

[28] A. Gittens, A. Devarakonda, E. Racah, M. Ringenburg, L. Gerhardt, J. Kottalam, J. Liu,K. Maschhoff, S. Canon, J. Chhugani, P. Sharma, J. Yang, J. Demmel, J. Harrell,V. Krishnamurthy, M. W. Mahoney, and Prabhat. Matrix factorizations at scale: Acomparison of scientific data analytics in spark and c#;mpi using three case studies. In2016 IEEE International Conference on Big Data (Big Data), pages 204–213, Dec 2016.

[29] Martin Gudgin, Marc Hadley, Noah Mendelsohn, Jean-Jacques Moreau, Henrik FrystykNielsen, Anish Karmarkar, and Yves Lafon. Simple object access protocol (soap) 1.2.World Wide Web Consortium, 2003.

BIGHPC 38


[30] K. Ha, P. Pillai, G. Lewis, S. Simanta, S. Clinch, N. Davies, and M. Satyanarayanan. Theimpact of mobile multimedia applications on data center consolidation. In 2013 IEEEInternational Conference on Cloud Engineering (IC2E), pages 166–176, March 2013.

[31] Nicole Hemsoth. The slow death of the parallel file system. Next Plat-form, January 2016. Available at http://www.nextplatform.com/2016/01/12/the-slow-death-of-the-parallel-file-system/.

[32] Yun Chao Hu, Milan Patel, Dario Sabella, Nurit Sprecher, and Valerie Young. Mobileedge computing—a key technology towards 5g. ETSI White Paper, 11, 2015.

[33] F. Isaila and J. Carretero. Making the case for data staging coordination and control forparallel applications. 2015. Available at http://www.epigram-project.eu/wp-content/uploads/2015/07/Isaila-ExaMPI15.pdf.

[34] F. Isaila, J. Garcia, J. Carretero, R. Ross, and D. Kimpe. Making the Case for Reformingthe I/O Software Stack of Extreme-Scale Systems. Elsevier’s Journal Advances inEngineering Software, 2016.

[35] Florin Isaila, Jesus Carretero, and Rob Ross. Clarisse: A middleware for data-stagingcoordination and control on large-scale hpc platforms. 2016 16th IEEE/ACM CCGrid,pages 346–355, 2016.

[36] B. Javadi, D. Kondo, J. M. Vincent, and D. P. Anderson. Discovering statistical modelsof availability in large distributed systems: An empirical study of seti@home. IEEETransactions on Parallel and Distributed Systems, 22(11):1896–1903, Nov 2011.

[37] Hugo Krawczyk. Perfect Forward Secrecy. Springer US, Boston, MA, 2011.

[38] G. Lewis, S. Echeverría, S. Simanta, B. Bradshaw, and J. Root. Tactical cloudlets: Movingcloud computing to the edge. In 2014 IEEE Military Communications Conference, pages1440–1446, Oct 2014.

[39] Feng-Hao Liu. Computation over encrypted data. Cloud Computing Security: Foundationsand Challenges, page 305, 2016.

[40] N. Malitsky, A. Chaudhary, S. Jourdain, M. Cowan, P. O?Leary, M. Hanwell, andK. K. Van Dam. Building near-real-time processing pipelines with the spark-mpi platform.In 2017 New York Scientific Data Summit (NYSDS), pages 1–8, Aug 2017.

[41] Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman,Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, et al. Mllib: Machinelearning in apache spark. The Journal of Machine Learning Research, 17(1):1235–1241,2016.

[42] D. Morozov and T. Peterka. Block-parallel data analysis with diy2. In 2016 IEEE 6thSymposium on Large Data Analysis and Visualization (LDAV), pages 29–36, Oct 2016.

[43] Anthony Nadalin, Chris Kaler, Ronald Monzillo, and Phillip Hallam-Baker. Web servicessecurity: Soap message security 1.1 (ws-security 2004). oasis standard specification, 2006.

[44] B. Nicolae, C. H. A. Costa, C. Misale, K. Katrinis, and Y. Park. Leveraging adaptive i/oto optimize collective data shuffling patterns for big data analytics. IEEE Transactionson Parallel and Distributed Systems, 28(6):1663–1674, June 2017.

BIGHPC 39


[45] Tom Peterka, Dmitriy Morozov, and Carolyn Phillips. High-performance computationof distributed-memory parallel 3d voronoi and delaunay tessellation. In Proceedings ofthe International Conference for High Performance Computing, Networking, Storage andAnalysis, pages 997–1007. IEEE Press, 2014.

[46] B. Ramesh, A. Bhardwaj, J. Richardson, A. D. George, and H. Lam. Optimization andevaluation of image- and signal-processing kernels on the ti c6678 multi-core dsp. In 2014IEEE High Performance Extreme Computing Conference (HPEC), pages 1–6, Sept 2014.

[47] Daniel A. Reed and Jack Dongarra. Exascale Computing and Big Data. Commun. ACM,58(7):56–68, June 2015.

[48] J.Carretero P.Kropf S.Caíno-Lores, A. Lapin. Applying big data paradigms to a large scalescientific workflow: Lessons learned and future directions. Future Generation ComputerSystems, (April), 2018.

[49] Frank Schmuck and Roger Haskin. GPFS: A shared-disk file system for large computingclusters. In Proceedings of the 1st USENIX FAST ’02, Berkeley, CA, 2002. USENIXAssociation.

[50] SETI@Home. Cpu performance. https://setiathome.berkeley.edu/cpu_list.php,(accessed 12 June 2017, 22:02:19 UCT). Online.

[51] G. M. Slota, S. Rajamanickam, and K. Madduri. A case study of complex graph analysisin distributed memory: Implementation and optimization. In 2016 IEEE InternationalParallel and Distributed Processing Symposium (IPDPS), pages 293–302, May 2016.

[52] Sphinx-4. Sphinx-4: A speech recognizer written entirely in the Java programminglanguage. http://cmusphinx.sourceforge.net/sphinx4/. Online.

[53] I. Stojmenovic and S. Wen. The fog computing paradigm: Scenarios and security issues.In 2014 Federated Conference on Computer Science and Information Systems, pages 1–8,Sept 2014.

[54] François Tessier, Preeti Malakar, Venkatram Vishwanath, Emmanuel Jeannot, and FlorinIsaila. Topology-aware Data Aggregation for Intensive I/O on Large-scale Supercomputers.In Proc. of the 1st Workshop on Optimization of Communication in HPC, pages 73–81,2016.

[55] R. Thakur, W. Gropp, and E. Lusk. Data sieving and collective I/O in ROMIO. InFrontiers of Massively Parallel Computation, 1999. Frontiers ’99. The Seventh Symposiumon the, pages 182–189, Feb 1999.

[56] T. X. Tran, A. Hajisami, P. Pandey, and D. Pompili. Collaborative mobile edge computingin 5g networks: New paradigms, scenarios, and challenges. IEEE CommunicationsMagazine, 55(4):54–61, April 2017.

[57] Ian H Witten, Eibe Frank, Mark A Hall, and Christopher J Pal. Data Mining: Practicalmachine learning tools and techniques. Morgan Kaufmann, 2016.

[58] Shanhe Yi, Cheng Li, and Qun Li. A survey of fog computing: Concepts, applicationsand issues. In Proceedings of the 2015 Workshop on Mobile Big Data, Mobidata ’15, pages37–42, New York, NY, USA, 2015. ACM.

BIGHPC 40


[59] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, MurphyMcCauley, Michael J Franklin, Scott Shenker, and Ion Stoica. Resilient distributeddatasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings ofthe 9th USENIX conference on Networked Systems Design and Implementation, pages2–2. USENIX Association, 2012.

BIGHPC 41

Documents

Report on techniques for data management on integrated HPC ...€¦ · Towards Uniﬁcation of HPC And Big Data Paradigms. TIN2016-79637-P Report on techniques for data management