8
Synchronous Parallel Processing of Big-Data Analytics Services to Optimize Performance in Federated Clouds Gueyoung Jung, Nathan Gnanasambandam Xerox Research Center Webster Webster, USA {gueyoung.jung, nathang}@xerox.com Tridib Mukherjee Xerox Research Center India Bangalore, India [email protected] Abstract—Parallelization of big-data analytics services over a federation of heterogeneous clouds has been considered to improve performance. However, contrary to common intuition, there is an inherent tradeoff between the level of parallelism and the performance for big-data analytics principally because of a significant delay for big-data to get transferred over the network. The data transfer delay can be comparable or even higher than the time required to compute data. To address the aforementioned tradeoff, this paper determines: (a) how many and which computing nodes in federated clouds should be used for parallel execution of big-data analytics; (b) opportunistic apportioning of big-data to these computing nodes in a way to enable synchronized completion at best-effort performance; and (c) sequence of apportioned, different sizes of big-data chunks to be computed in each node so that transfer of a chunk is overlapped as much as possible with the computation of the previous chunk in the node. In this regard, Maximally Overlapped Bin-packing driven Bursting (MOBB) algorithm is proposed, which improve the performance by up to 60% against existing approaches. Keywords-federated clouds; big-data analytics; parallelization I. I NTRODUCTION Deploying big-data analytics services into clouds is more than just a contemporary trend. We are living in an era where data is being generated from many different sources such as sensors, social media, click-stream, log files, and mobile devices. Recently, collected data can exceed hundreds of ter- abytes and be continuously generated. Such big-data represents data sets that can no longer be easily analyzed with traditional data management methods and infrastructures [1][2][3]. In or- der to promptly derive insight from big-data, enterprises have to deploy big-data analytics into an extraordinarily scalable delivery platform. The advent of Cloud Computing has been enabling enterprises to analyze such big-data by leveraging vast amounts of computing resources available on demand with low resource usage cost. One of the research challenges in this regard is figuring out how to best use federated cloud resources to maximize the performance of big-data analytics. In this paper, we mainly focus on parallel data mining such as topic mining and pattern mining that can be run in multiple computing nodes simultaneously. Parallel data mining consumes a lot of computing resources to analyze large amounts of unstructured data, especially when they are executed with a time constraint. Cloud service providers may have enough capacity dedicated to perform such data-intensive services in their own data centers. However, facilitating loosely coupled and federated clouds consisting of legacy resources and applications is often a better choice, as analytics can be carried out partly on local private resources while the rest of the big-data are transferred to external computing nodes that are optimized for processing big-data analytics. This paradigm can be more flexible and has obvious cost benefits than using a single data center [4][5]. In order to optimize a parallel data mining in not just multiple computing nodes but different clouds separated by relatively high latencies, this paper addresses: (a) node deter- mination, i.e., “how many” and “which” computing nodes in federated clouds should be used, (b) synchronized completion, i.e., how to optimally apportion big-data across parallelized computation environments to ensure synchronization, where synchronization refers to completing all workload portions at the same time even when resources and inter-networks are heterogeneous and situated in multiple Internet-separated clouds, and (c) data partition determination, i.e., how to serialize different data chunks to computing nodes to avoid overflow or underflow to nodes. To address these problems, we develop a heuristic cloud- bursting algorithm, referred to as Maximally Overlapped Bin- packing driven Bursting (MOBB). Specifically, we improve the advantage of data mining parallelization by considering the time overlap: (a) across computing nodes; and (b) between data transfer delay and computation time in each computing node. While unequal loads may be apportioned to the parallel computing nodes, our algorithm can still make sure that outputs are produced at the same time without any single slow node acting as a bottleneck. When a data mining is run on a pool of inter-connected clouds, extended periods of data transfer delay are often ex- perienced, and the data transfer delay depends on the location of each computing node. Fast transfer of data chunks to a slow computing node can cause data overflow, whereas slow transfer of chunks to a fast node can lead to underflow causing the nodes to be idle. Our MOBB algorithm can reduce such data overflow or underflow. 2012 IEEE Fifth International Conference on Cloud Computing 978-0-7695-4755-8/12 $26.00 © 2012 IEEE DOI 10.1109/CLOUD.2012.108 811 2012 IEEE Fifth International Conference on Cloud Computing 978-0-7695-4755-8/12 $26.00 © 2012 IEEE DOI 10.1109/CLOUD.2012.108 811

[IEEE 2012 IEEE 5th International Conference on Cloud Computing (CLOUD) - Honolulu, HI, USA (2012.06.24-2012.06.29)] 2012 IEEE Fifth International Conference on Cloud Computing - Synchronous

  • Upload
    tridib

  • View
    216

  • Download
    1

Embed Size (px)

Citation preview

Page 1: [IEEE 2012 IEEE 5th International Conference on Cloud Computing (CLOUD) - Honolulu, HI, USA (2012.06.24-2012.06.29)] 2012 IEEE Fifth International Conference on Cloud Computing - Synchronous

Synchronous Parallel Processing of Big-DataAnalytics Services to Optimize Performance in

Federated Clouds

Gueyoung Jung, Nathan Gnanasambandam

Xerox Research Center Webster

Webster, USA

{gueyoung.jung, nathang}@xerox.com

Tridib Mukherjee

Xerox Research Center India

Bangalore, India

[email protected]

Abstract—Parallelization of big-data analytics services overa federation of heterogeneous clouds has been considered toimprove performance. However, contrary to common intuition,there is an inherent tradeoff between the level of parallelismand the performance for big-data analytics principally becauseof a significant delay for big-data to get transferred over thenetwork. The data transfer delay can be comparable or evenhigher than the time required to compute data. To address theaforementioned tradeoff, this paper determines: (a) how manyand which computing nodes in federated clouds should be usedfor parallel execution of big-data analytics; (b) opportunisticapportioning of big-data to these computing nodes in a wayto enable synchronized completion at best-effort performance;and (c) sequence of apportioned, different sizes of big-datachunks to be computed in each node so that transfer of achunk is overlapped as much as possible with the computationof the previous chunk in the node. In this regard, MaximallyOverlapped Bin-packing driven Bursting (MOBB) algorithm isproposed, which improve the performance by up to 60% againstexisting approaches.

Keywords-federated clouds; big-data analytics; parallelization

I. INTRODUCTION

Deploying big-data analytics services into clouds is more

than just a contemporary trend. We are living in an era where

data is being generated from many different sources such

as sensors, social media, click-stream, log files, and mobile

devices. Recently, collected data can exceed hundreds of ter-

abytes and be continuously generated. Such big-data represents

data sets that can no longer be easily analyzed with traditional

data management methods and infrastructures [1][2][3]. In or-

der to promptly derive insight from big-data, enterprises have

to deploy big-data analytics into an extraordinarily scalable

delivery platform. The advent of Cloud Computing has been

enabling enterprises to analyze such big-data by leveraging

vast amounts of computing resources available on demand

with low resource usage cost.One of the research challenges in this regard is figuring

out how to best use federated cloud resources to maximize

the performance of big-data analytics. In this paper, we

mainly focus on parallel data mining such as topic mining

and pattern mining that can be run in multiple computing

nodes simultaneously. Parallel data mining consumes a lot of

computing resources to analyze large amounts of unstructured

data, especially when they are executed with a time constraint.

Cloud service providers may have enough capacity dedicated

to perform such data-intensive services in their own data

centers. However, facilitating loosely coupled and federated

clouds consisting of legacy resources and applications is often

a better choice, as analytics can be carried out partly on local

private resources while the rest of the big-data are transferred

to external computing nodes that are optimized for processing

big-data analytics. This paradigm can be more flexible and has

obvious cost benefits than using a single data center [4][5].

In order to optimize a parallel data mining in not just

multiple computing nodes but different clouds separated by

relatively high latencies, this paper addresses: (a) node deter-mination, i.e., “how many” and “which” computing nodes in

federated clouds should be used, (b) synchronized completion,

i.e., how to optimally apportion big-data across parallelized

computation environments to ensure synchronization, where

synchronization refers to completing all workload portions

at the same time even when resources and inter-networks

are heterogeneous and situated in multiple Internet-separated

clouds, and (c) data partition determination, i.e., how to

serialize different data chunks to computing nodes to avoid

overflow or underflow to nodes.

To address these problems, we develop a heuristic cloud-

bursting algorithm, referred to as Maximally Overlapped Bin-

packing driven Bursting (MOBB). Specifically, we improve

the advantage of data mining parallelization by considering

the time overlap: (a) across computing nodes; and (b) between

data transfer delay and computation time in each computing

node. While unequal loads may be apportioned to the parallel

computing nodes, our algorithm can still make sure that

outputs are produced at the same time without any single slow

node acting as a bottleneck.

When a data mining is run on a pool of inter-connected

clouds, extended periods of data transfer delay are often ex-

perienced, and the data transfer delay depends on the location

of each computing node. Fast transfer of data chunks to a

slow computing node can cause data overflow, whereas slow

transfer of chunks to a fast node can lead to underflow causing

the nodes to be idle. Our MOBB algorithm can reduce such

data overflow or underflow.

2012 IEEE Fifth International Conference on Cloud Computing

978-0-7695-4755-8/12 $26.00 © 2012 IEEE

DOI 10.1109/CLOUD.2012.108

811

2012 IEEE Fifth International Conference on Cloud Computing

978-0-7695-4755-8/12 $26.00 © 2012 IEEE

DOI 10.1109/CLOUD.2012.108

811

Page 2: [IEEE 2012 IEEE 5th International Conference on Cloud Computing (CLOUD) - Honolulu, HI, USA (2012.06.24-2012.06.29)] 2012 IEEE Fifth International Conference on Cloud Computing - Synchronous

To evaluate our approach, we employ a frequent pattern

mining analytics [6] as a specific type of parallel analytics

whose inputs are huge but outputs are far small. Then, we

deploy the analytics on small multiple Hadoop [7] clusters in

four different clouds. The experimental results show that our

approach outperforms other existing load-balancing methods

for big-data analytics.

II. RELATED WORK

The problem of load distribution has previously been

studied for different distributed computing systems includ-

ing computing grids [8], parallel architecture [9], and data

centers [10][11][12]. Load-balancing for parallel applications

has typically involved the distribution of loads to computing

nodes so as to maximize performance. Although the cost and

delay overheads have been considered in many cases, such

overheads usually involve application delays in check-pointing

and context switching. This paper, on the other hand, focuses

on the distribution of big-data over computing nodes separated

far apart. In this setting, the overhead of the data transfer can

be significant since the network latencies may be high between

nodes due to the amount of data being transferred.

Although recent research work [13][14] has introduced the

impact of the data transfer delay on the performance when they

select clouds to redirect service loads, they do not consider

further optimization by overlapping the transfer delay of a data

with the computation time of previous data. The need for such

overlapping has been identified for clusters [15]. Continuation-

based overlapping of data transfers with instruction execution

has been investigated for many-core architecture [9]. However,

such overlapping is restricted by a pre-defined order of instruc-

tion sets. This paper instead determines the order of different

sizes of data chunks to be transferred to individual nodes. By

doing so, we can have the maximum overlap between the data

transfer and the data computation.

The other way to optimize the performance of the big-

data analytics is scheduling sub-tasks among computing nodes.

For example, back-filling of tasks at earlier time than an

original scheduled sequence has been considered as a part of

batch-scheduling in data centers and computing clusters [10].

However, such scheduling is geared towards batch processing

within data centers and not for big-data apportioning in

heterogeneous federated clouds.

Some approaches have introduced task schedulers with load-

balancing techniques in heterogeneous computing environ-

ments. For example, CometCloud [16] has a task scheduler

to run analytics on hybrid clouds with a load-balancing

technique. [17] has introduced some heuristics to schedule

tasks for heterogeneous computing nodes. None of the above

approaches have dealt with the potential tradeoff between

the data transfer delay and the computation time in parallel

execution environments. [18] has introduced a scheduling

algorithm to address which tasks in a task queue have to be

run in internal cloud and sent to external cloud. They have

mainly focused on keeping the order of tasks in the queue

while increasing performance by utilizing an external cloud

on demand. However, they do not consider how many and

which clouds are required and how much data is allocated to

each chosen cloud for parallel processing.

Similar with our approach, research efforts [5][19] have

been made to deploy parallel applications using MapReduce

over massively distributed computing environments. Using

only local cluster with dynamic provisioning [20] may out-

perform these distributed approaches by reducing the data

transfer delay if the local cluster has enough computational

power. However, the distributed approach is more flexible and

has cost benefits [4][5]. This paper provides a precise load-

balancing algorithm for the distributed parallel applications

dealing with potentially big-data such as the continuous data

stream analysis [5] and the life pattern extraction for healthcare

applications in federated clouds [19].

III. PROBLEM STATEMENT

To achieve the optimal performance of big-data analytics

services in federated clouds, we have to determine “how

many” and “which computing nodes” in clouds are required,

where each computing node can be a cluster of servers in a

data center, and how to apportion given big-data to chosen

computing nodes. Specifically, we address these problems for

a frequent pattern mining employed as a big-data analytics

service (see Section V).

The number of parallel computing nodes

Ove

rall

exec

utio

n ti

me

Find a set of the best computing nodes

SLA

Minimum

Fig. 1. Hypothetical curve representing the relation between the overallexecution time and the number of parallel computing nodes

The input big-data to the frequent pattern mining algo-

rithm (e.g., a log file containing users’ web transactions) is

typically collected in a central place over a certain period

(e.g., for a year), and is processed to generate an output (e.g.,

frequent user behavior patterns). To execute the mining task

in multiple computing nodes for given big-data, the big-data

is first divided into a certain number of data chunks (e.g.,

logs for user groups), and those data chunks are transferred

to computing nodes. Intuitively, as the number of computing

nodes increases, the overall execution time can decrease, but

the amount of data chunks to be transferred increases. As

shown in Fig. 1, the overall execution time can start to increase

if we use beyond a certain number of computing nodes.

This is because the delay taken to transfer data chunks starts

to dominate the overall execution time. Meanwhile, adding

computing nodes is optionally stopped once a target execution

time specified in Service Level Agreement (SLA) is met.

812812

Page 3: [IEEE 2012 IEEE 5th International Conference on Cloud Computing (CLOUD) - Honolulu, HI, USA (2012.06.24-2012.06.29)] 2012 IEEE Fifth International Conference on Cloud Computing - Synchronous

Our MOBB algorithm is designed to address the problem of

how many and which computing nodes are used. It identifies

the number of computing nodes by exploring from a single

node, and increasing the number of nodes one at a time. At

each step, the best set of computing nodes can be identified

by estimating the data transfer delay and computation time of

each node for given big-data.

Minimizing the frequency of data synchronization in the

parallel process is practically one of the best ways to optimize

the performance. Thus, we have to understand the characteris-

tics of the input data before designing the parallel process. One

of the characteristics of frequent pattern mining is that the data

is in temporal order and mixed with many users’ activities.

To generate frequent behavior patterns of each individual

user (i.e., extract personalized information from big-data), we

divide the given big-data into user groups. Distributing and

executing individual user data in different computing nodes

can reduce the data synchronization since these data chunks

are independent of each other. In this regard, to address the

problem of how to apportion given big-data to computing

nodes, we have to consider a set of data chunks, each of which

has the different size, and a set of computing nodes, each of

which has the different network and computing capacities.

Central CloudBig-data

Medium-capacity Remote Cloud

Medium-capacity Local Cloud

Low-capacity Local Cloud

High-capacityRemote Cloud

Data chunk allocation

Network capacity

A set of data chunks

Fig. 2. Data allocation to clouds having different capacities

We encode this problem into a bin-packing problem as

shown in Fig. 2. Our MOBB algorithm aims at minimizing

the execution time difference among computing nodes by

optimally apportioning given data chunks into computing

nodes. In other words, it maximizes the time overlap across

computing nodes when it performs the parallel data mining.

Time

Data transfer delay Data computation time

Data transfer delay Data computation timeData chunk 2

Data chunk 1

Data transfer delayData chunk 3

Fig. 3. Ideal time overlap when serializing a series of data chunks to a cloud

Moreover, we simultaneously consider improving the time

overlap between the data transfer delay and the computation

time while distributing data chunks to computing nodes.

Practically, this overlap can be achieved since a data chunk can

be computed in a node while the next chunk can be transferred

to the node. As shown in Fig. 3, ideally, our algorithm attempts

to select a data chunk that takes the same amount of delay to

be transferred to a node with the computation time of the node

with the previous data chunk.

Our algorithm optimizes the performance of the parallel

mining by maximizing the time overlap not only across

computing nodes, but also between the data transfer delay and

the computation time in each node, simultaneously.

IV. MOBB APPROACH

A. Maximally Overlapped Cloud-Bursting

The first part of our approach is to make decision for “how

many” and “which computing nodes” are used. Based on

estimates of the data transfer delay and the data computation

time (see Section IV-B), our algorithm chooses a set of

parallel computing nodes, which have shorter delay than other

candidates, by identifying a next best node and adding it into

the set at a time.

Algorithm 1 Cloud-bursting to determine computing nodes

N ← {n1, n2, n3, ..., nn}for each ni in N do

ti ← EstDelay(ni)ei ← EstCompute(ni)

end forSort(N) by (ti + ei)S ← n0; p← 1; xp ← EstExecT ime(n0)while xp ≥ SLA or xp < xp−1 do

nmin ← SelectNextBest(N − S)S ← S ∨ nmin

p← p+ 1PerformBinPacking(S)xp ← EstExecT ime(S)

end while

Algorithm 1 describes our cloud-bursting approach. Given

a set of candidate computing nodes N , our approach estimates

the data transfer delay t to each node and the node’s computa-

tion time e for the data assigned. The estimation is made using

the unit of data (i.e., a fixed size of data). Then, the candidate

nodes are sorted by the total estimate (i.e., t+e), and each node

is added to the execution pool, S if required. Our approach

starts the execution of the mining task with the central node n0.

n0 in our approach is the node that has the big-data collected,

or the node that the service provider initially allocates the

big-data before bursting to multiple nodes. One extreme case

can be only using n0 if the estimated execution time using

EstExecT ime(n0) meets the exit condition.

If SLA cannot be met using the current set of parallel nodes

(i.e., xp ≥ SLA), or we want to further reduce the overall

execution time by utilizing more nodes (i.e., xp < xp−1), our

approach increases the level of parallelization by adding one

node into the pool using SelectNextBest(N−S). The newly

added node is the best node that has the minimum (t + e)

among all candidates. Once the set of parallel nodes is deter-

mined from the above step, our approach performs the max-

imally overlapped bin-packing, PerformBinPacking(S),that attempts to maximize the time overlap across these nodes

813813

Page 4: [IEEE 2012 IEEE 5th International Conference on Cloud Computing (CLOUD) - Honolulu, HI, USA (2012.06.24-2012.06.29)] 2012 IEEE Fifth International Conference on Cloud Computing - Synchronous

and the time overlap between data transfer and computation

in each node. Then, EstExecT ime(S) computes the optimal

execution time that can be achieved by utilizing these nodes.

B. Estimation of Data Computation and Transfer Delay

Many estimation techniques have been introduced for the

data computation time and the data transfer delay, but our

approach does not depend on a specific technique. For the

computation time estimation, we can adopt the response sur-

face model introduced in [18]. By profiling each node for the

computation time, we can build the initial model and tune the

model based on observations over time. Another well-known

technique is the queueing model [21]. The node has a task

queue and processes each task with a defined discipline (e.g.,

FIFS or PS). The data transfer delay between different clouds

can be more dynamic than the computation time because of

various factors such as bandwidth congestions and re-routing

due to network failures. To estimate the data transfer delay,

we can adopt the auto-regressive moving average (ARMA)

filter [21][22] by periodically profiling the network latency.

To profile the latency, we can periodically inject a small size

of unit data to the target node and record the delay. With the

historical record, we can build the model, and tune the model

over time by recording error rate. We have used ARMA filter

for estimating both the computation time and the data transfer

delay in the current prototype.

C. Maximally Overlapped Bin-Packing

Fig. 4 illustrates three main steps of our algorithm for data

allocation.

• Pre-processing: This step involves: (a) the determination

of bucket size for each node; (b) sorting of data chunks

in descending order of their sizes; and (c) sorting node

buckets in descending order of their sizes. The buckets’

sizes are determined in a way that a node with higher

delay will have lower bucket size. Then sorting of buckets

essentially boils down to giving higher preference to

nodes which has lower delay.

• Greedy bin-packing: In this step, the sorted list of data

chunks are assigned to node buckets in a way that larger

data chunks are handled by nodes with higher preference.

Any fragmentation of the bucket is handled in this step

(as shown in Algorithm 2).

• Post-processing: After the data chunks are assigned to

the buckets, this step organizes the sequence of chunks for

each bucket such that the data transfer and computation

are overlapped maximally.

1) Determining bucket size: The algorithm intends to par-

allelize the mining task by dividing input data for multiple

computing nodes. For parallelization, the size of data given to

particular node depends on the average delay of the mining

task on the node. If the average delay of the task for a unit

of data on a node i is denoted as di then the overall delay

for executing a data size si (that is provided to node i for

mining task) is sidi. In order to ensure ideal parallelization

for n nodes and a set of data, which sizes are s1, s2, ..., sn,

the following is satisfied,

s1d1 = s2d2 = ... = sndn,

where d1, d2, ..., dn are the delay per unit of data at each node,

respectively. After such assignment, if the overall execution

time of the mining task is assumed to be r, then the size of

data assigned to each node would be as follows,

s1 =r

d1, s2 =

r

d2, ..., sn =

r

dn(1)

Let s be the total amount of n input logs which are

distributed to each node (i.e., s =∑n

i=1 si). Then, we get

r =

⌈s∑n

i=11di

⌉(2)

Eq. 2 provides the overall execution time of the mining

task under full parallelization. This can be achieved if data

assigned to each node i is limited by an upper bound si given

(by replacing r from Eq. 2 to Eq. 1) as follows,

si =

⌈s

di∑n

i=11di

⌉(3)

Note here that si is higher for a node i if di is lower

(compared to other nodes). Hence, Eq. 3 can be used to

determine bucket size for each node in a way where higher

preference is given to nodes with lower delay.

2) Greedy Bin Packing: Once the bucket sizes are deter-

mined, the next step involves assigning the data chunks to the

node buckets. We use a greedy bin packing approach where

the largest data chunks are assigned to nodes with lowest delay

(hence reducing the overall delay), as shown in Fig. 4. There

are two main intuitions,

• Weighted load distribution: This involves loading to a

node based on their overall delay, i.e. the node with lower

delay gets more data to handle. This is guaranteed by

providing an upper bound on the size of data (i.e. the

bucket size) to be handled by the node (as described in

Section IV-C1).

• Delay-based node preference: Larger data chunks are

assigned to a node with larger bucket size (i.e. with lower

overall delay) so that individual data chunk get fairness

in their overall delay. This is guaranteed by sorting the

input data chunks in descending order (in preprocessing

step in Fig. 4) and filling up nodes with larger bucket

size first (Algorithm 2).

To reduce fragmentation of buckets, the buckets are completely

filled one at a time; i.e., the bucket with lowest delay will be

first exhausted followed by the next one and so on (Algorithm

2). This approach also enables to fill more data to nodes with

lower delay.

814814

Page 5: [IEEE 2012 IEEE 5th International Conference on Cloud Computing (CLOUD) - Honolulu, HI, USA (2012.06.24-2012.06.29)] 2012 IEEE Fifth International Conference on Cloud Computing - Synchronous

Input Data Chunk List

Bucket (Node/Cloud) List

Sorted Data Chunk List

Sorted Bucket List Preprocessing

Greedy Bin-Packing

Input Sorted Data Chunk List

Post- processing

Transfer Delay and Computation Time

Total Input Data Size = s

Node Node Node

Delay Delay Delay

Delay per unit data to nodei = di

Node Node Node

Delay Delay Delay

d1 d2 dn

≤ S1 ≤ Si ≤ Sn

Node1 Node i Node n

Assign more data (larger data chunks) to nodes with less delay

Node1 Node i Node n

t1 > e1 ti < ei tn > en

t1 e1 ti ei tn en

d1 di dn

di is sum of transfer delay and computation time per unit data i.e., di = ti + ei

Organize sequence of data chunks to maximize overlap between computation time and subsequent data transfer

Fig. 4. Steps to allocate data chunks to computing nodes in our algorithm

3) Maximizing the overlap between data transfer and com-putation: The above approach achieves parallelization of the

analytics over a large set of data chunks. However, the delay,

di, for unit of data to run on node i can be explained by

data transfer delay from the central node to node i and actual

computation time on node i. Therefore, it is possible to further

reduce the overall execution time by transferring data to a node

in parallel with computation on a different data chunk. Ideally,

the transfer delay of a data chunk should be exactly equal to

the computation time on previous data chunk. Otherwise, there

can be delay incurred by queuing (in case the computation

time is higher) or unavailability of data computation (in case

the transfer delay is higher). If computation time and transfer

delay are not exactly equal, it is required to smartly select

a sequence of data chunks for the data bucket of each node

so that the difference between the computation time of each

data chunk and the transfer delay of a data chunk immediately

following is minimized.Depending on the ratio of data transfer delay and compu-

tation time, a node i can be categorized as follows:

• type 1, for which the transfer delay ti per unit of data is

higher than the computation time ei per unit of data;

• type 2, for which the computation time ei per unit of data

is higher than the transfer delay ti per unit of data.

It is important to understand the required characteristics of the

sequence of data chunks sent to each of these types of nodes. If

sij and si(j+1) are the size of data chunk j and j+1 assigned

to node i, then for complete parallelization of the data transfer

of chunk j + 1 and computation of j, the following holds,

si(j+1)ti = sijei

It should be noted here that if ti ≥ ei, then ideally si(j+1) <sij . Thus, data chunks in the bucket for type 1 node should

be in descending order of their sizes. Similarly, it can be

concluded that for a node of type 2, data chunks should be in

ascending order, as shown in the post-processing step in Fig.

4 and ensured at the end of Algorithm 2 (where descending

order of data chunks is reversed to make the order ascending

in case ti < ei).

Algorithm 2 Maximally overlapped bin-packing

Sort DataChunkList by descending order of chunk sizeDetermine bucket size, s1, s2, ..., sn (Eq. 1)Sort BucketList by descending order of bucket sizerepeat

for i = 1 to n doRemove first element from DataChunkListInsert the element to tail of BucketList[i]for j = 1 to remainingNumberofDataChunks do

if (BucketList[i] is not empty) and (first element inDataChunkList can fit in BucketList[i]) then

Remove first element from DataChunkListInsert the element to tail of BucketList[i]

end ifend for

end foruntil all the BucketLists are fullfor i = 1 to numberofNodes do

if ti < ei thenReverse order of data chunks in BucketList[i]

end ifend for

V. EXPERIMENTAL EVALUATION

We demonstrate the efficiency of our approach by deploying

the frequent pattern mining as a big-data analytics service

to four different computing nodes. We first describe the

experimental setup followed by the results.

815815

Page 6: [IEEE 2012 IEEE 5th International Conference on Cloud Computing (CLOUD) - Honolulu, HI, USA (2012.06.24-2012.06.29)] 2012 IEEE Fifth International Conference on Cloud Computing - Synchronous

A. Experimental Setup

1) Frequent Pattern Mining: Frequent pattern mining [6]

aims to extract frequent patterns from a log file. The log

contains users activities to a system in temporal order. A

typical example is a web server access log, which contains

a history of web page accesses from users. Enterprises need

to analyze such web server access logs to discover valuable

information such as website traffic patterns and user activity

patterns by time of day, time of week, or time of year. These

frequent patterns can also be used to generate rules to predict

future activity of a certain user within a certain time interval

based on the user’s past patterns. To extract such frequent

patterns and prediction rules, our mining process parses the

log while identifying patterns of each user. Then, it generates

a set of prediction rules for each user.

As a sample log for our experiments, we have combined a

phone call log obtained from a call center and web access log.

The log has been collected for a year, and its size is up to 200

GB. The log contains several millions of activities generated

by more than a hundred of thousand of users regarding human

resources such as health insurance, 401K, and retirement plans.

The objective of the frequent pattern mining is to obtain

patterns of each user activities on human resource information

systems. As such, our approach first divides the log into a set

of user logs and then, executes these user logs in parallel over

federated computing nodes.

2) Computing nodes: For our experiments, we employ four

small computing nodes to run the frequent pattern mining.

Three nodes are local clusters located in the north-eastern part

of US, and one is a remote cluster located in the mid-western

part of US. One local node, referred to as a Low-end Local

Central node (LLC), is used as a central node that collects big-

data to be analyzed. This node consists of 5 virtual machines

(VMs), each of which has two 2.8 GHz cores, 1 GB memory,

and 1 TB hard drive. Another local node, referred to as a

Low-end Local Worker (LLW ), has the similar configuration

with LLC. The third local node, referred to as a High-end

Local Worker (HLW ), is a cluster that has 6 non-virtualized

servers, each of which has 24 2.6 GHz cores, 48 GB memory,

and 10 TB hard drive. All these local nodes are configured

with a high speed local connection so that data can be moved

very fast between nodes. The remote node, referred to as a

Mid-end Remote Worker (MRW ), has 9 VMs, each of which

has two 2.8 GHz cores, 4 GB memory, and 1 TB hard drive.

We deploy Hadoop into all these computing nodes.

In our scenario, we assume HLW is shared by other appli-

cations; there are three other data mining tasks while running

our aforementioned frequent pattern mining. Meanwhile, we

intentionally give a network overhead to LLW by moving

large size of files to LLW during experiments. This increases

the data transfer delay between LLC and LLW .

B. Experiment Results

We first show the performance characteristics of computing

nodes in the context of data computation time and data transfer

delay. Fig. 5 shows nodes’ performance characteristics when

0

10000

20000

30000

40000

50000

60000

LLC LLW MRW HLW

Computation time

Data transfer delay

Exe

cuti

on t

ime

(sec

)

Fig. 5. Performance characteristics of computing nodes

we run the mining task with the entire log in each node.

LLC has low-end servers, but the log is stored in the local

storage. Thus, the computation time is higher than other nodes

while there is no data transfer delay. To run the mining task

in other nodes, first, we must copy the log to the target

node and then, execute it. Since LLW is another local node

with intentional network delay, data transfer delay is high.

It has also low-end servers so that it takes large amounts

of time to compute. MRW , which is located remotely, has

mid-end configuration and thereby, its computation time is

lower than LLC. However, as shown in the figure, the overall

execution time is similar to LLC due to the large transfer

delay. Meanwhile, HLW has high-end servers, and the log is

stored in the local area network. Thus, the computation time

is much lower than others, and the data transfer delay is small.

Relative input data size (%)

0

50

100

150

200

250

300

350

400

450

500

10 20 30 40 50 60 70 80 90 100

Single nodeTwo nodesTwo nodes with delay

Exe

cuti

on t

ime

(sec

)

Fig. 6. Impact of data transfer delay on the overall execution time

1) Impact of data transfer delay on overall execution time:Using LLC and LLW , we compare the execution time with

intentional transfer delay to LLW to one without the delay.

We use a small size of log (18.6 GB) in this experiment, and

resize it to show the change of execution time against the

size of the input log. As shown in Fig. 6, the execution time

increases exponentially as the size increases. This explains

the need of parallel execution of such big-data analytics to

improve the performance. When using two nodes without the

transfer delay, the execution time decreases by almost half.

However, when we execute the mining task with the transfer

delay, the execution time is higher than the previous result,

and the gap slightly increases as the size increases. Therefore,

we have to deal with the data transfer delay carefully.

2) Effect of cloud-bursting on total execution time: Our

MOBB algorithm attempts to use only LLC first, and then

integrate one node at a time, which has the lowest estimated

816816

Page 7: [IEEE 2012 IEEE 5th International Conference on Cloud Computing (CLOUD) - Honolulu, HI, USA (2012.06.24-2012.06.29)] 2012 IEEE Fifth International Conference on Cloud Computing - Synchronous

0

5000

10000

15000

20000

25000

30000

35000

40000

Exe

cuti

on t

ime

(sec

)

LLC +HLW

LLC + HLW +MRW

LLC only All

Fig. 7. Cloud-bursting with four computing nodes

execution time. This addition is performed until SLA is met.

In this experiment, our approach adds HLW , MRW , and

finally it uses all four nodes to run the mining task.

As shown in Fig. 7, the execution time decreases as nodes

are added. However, the execution time is not significantly

improved from using three nodes. This is because the contri-

butions of MRW and LLW to the performance are small,

and the transfer delay caused by MRW and LLW starts to

impact on the overall execution time. Therefore, using two or

three nodes can be better choice rather than using four nodes

that needs higher resource usage cost.

3) Comparison of MOBB with other load balancing ap-proaches: We run the mining task using MOBB and other

three different methods that are used in many prior load-

balancing approaches [9][13][14][15][16] and then, compare

the results. Methods used in this comparison are as follows,

• Fair division: This method equally divides the input data

and distributes them to nodes. We use this as a naive

method to show as a baseline.

• Computation-based division: This method only considers

the computation power of each node when it performs

load-balancing, rather than considering both computation

and data transfer delay.

• Delay-based division: This method considers both each

node’s computation time and data transfer delay in load-

balancing. However, it does not consider the queuing

delay in each node incurred by blindly distributing user

logs to nodes (i.e., not considering the time overlap

between the transfer delay and computation time).

0

5000

10000

15000

20000

25000

30000

MOBB Fair Comp.-based Delay-based

Exe

cuti

on t

ime

(sec

)

Fig. 8. Comparison of different load-balancing algorithms

Fig. 8 shows the result when we run the mining task in

HLW and MRW . As shown in the figure, our algorithm

outperforms other three methods. Since MRW has the large

TABLE ISLACK TIMES OF DIFFERENT LOAD-BALANCING ALGORITHMS

MOBB Fair Comp.-based Delay-basedSlack (sec) 41.2 13952.8 12719.5 1030.8

transfer delay, the execution time of Computation-based di-vision is very close to the Fair division. Table I explains

this situation with slack time (i.e., the measurement of the

time difference between nodes’ task completions). Although

Computation-based division considers the computation powers

of MRW and HLW when load-balancing, MRW becomes

a bottleneck due to its large transfer delay. Meanwhile, the

Delay-division considers both the computation time and the

transfer delay as MOBB does. This significantly reduces the

slack time. However, some data chunks are cumulated in

queue before being computed. This is due to the situation of

that many small data chunks arrive and are cumulated in the

waiting queue while the mining task computes some large data

chunks. When this situation unnecessarily happens too much,

the significant delay can be incurred. Our MOBB algorithm

considers all these factors simultaneously when it allocates

given big-data. As evident from the results in Fig. 8, MOBB

can achieve a minimum of 20% (and up to 60%) improvement

compared to the other approaches.

If an ideal optimal data allocation is made, the slack time

must be 0 (i.e., computation in multiple computing nodes is

completed at the same time). Table I shows that our MOBB

has around 40 seconds slack time. However, it is very small

compared to the overall time taken for the parallel data mining.

To further evaluate the optimality of our MOBB approach,

we have conducted multiple small experiments with LLC and

LLW to execute randomly selected 20 data chunks out of 50K

data chunks. We have observed that in most cases (i.e., more

than 90%), the slack time is caused by the last data chunk of

sequence assigned to the slower node (i.e., there is no data

chunk in the slower node’s queue), and in very few cases, the

slower node has more than one data chunk in its queue while

the faster node has completed all assigned data chunks. This

indicates that our MOBB provides close to optimal allocation.

4) Efficiency of MOBB algorithm: Our MOBB algorithm is

efficient and scalable for increasing the number of data chunks

and applications. As described in Algorithm 2, the complexity

of preprocessing is O(nlogn+mlogm), where n is the number

of computing nodes, and m is the number of data chunks

to be assigned, since it sorts given nodes and data chunks.

Typically, m is much larger than n and thereby, its complexity

can be O(mlogm). In our experiments, we have used 4 nodes

to run 50K data chunks. The complexity of the rest of MOBB

algorithm (i.e., greedy bin-packing and post-processing) is

O(m). Therefore, the overhead of our MOBB algorithm is

mainly incurred in sorting a number of data chunks. We have

used existing quick sort algorithm in our prototype. In order to

deal with 50K data chunks with 4 nodes, it has taken less than

60 seconds that is very small portion of the overall approach

including data transfer and parallel data computation.

817817

Page 8: [IEEE 2012 IEEE 5th International Conference on Cloud Computing (CLOUD) - Honolulu, HI, USA (2012.06.24-2012.06.29)] 2012 IEEE Fifth International Conference on Cloud Computing - Synchronous

VI. DISCUSSION

Our MOBB approach has been designed for data-intensive

tasks (e.g., big-data analytics) that typically require special

platforms such as MapReduce cluster and especially, can run in

parallel. One of the best situations for our MOBB approach to

be applied is the case, where the target task can be divided into

a set of independent identical sub-tasks. For the data mining

task used in our evaluation, our data preprocessing system has

divided the input data into a set of data chunks. Then, each

sub-task is run independently and in parallel with other sub-

tasks to generate frequent patterns of each individual user with

a subset of data chunks. An extended situation considered in

our MOBB approach is the case, where multiple independent

data-intensive tasks are in a task queue with different sizes

of input data. If a task in the queue may not be divided

into independent sub-tasks such as iterative algorithms, which

data transfer should occur just once but the computation is

run multiple times on the same data, our MOBB approach

considers the task as a unit task and attempts to parallelize

with other tasks and sub-tasks in the queue. This is because

running these algorithms across federated clouds may not

be practical since these algorithms may require considerable

communications among computing nodes (e.g., merging and

redistributing intermediate results iteratively). We are planning

to extend our MOBB approach to be applied to such task queue

having multiple independent data-intensive tasks.

Another issue we are considering to extend our MOBB

approach is to dynamically re-sort computing nodes and re-

target the sequence of data chunks. In the current prototype,

the decision making is based on the current status of network

and computation capacities when it is invoked as described in

Algorithm 1. However, the status can be changed dynamically

due to various unexpected events such as node failures and

network congestions while sorting nodes based on the previous

status and then, allocating data chunks. One of possible solu-

tions can be that our MOBB approach periodically checks the

available computation capacities and network delays of nodes.

Another solution can be that distributed monitoring systems

can push events into MOBB when the status is significantly

changed. In either case, the status change triggers MOBB to re-

sort nodes and re-target the sequence of remaining data chunks

into the next available computing nodes, while data chunks

assigned already continue at the corresponding nodes.

VII. CONCLUSION

In this paper, we have described a cloud-bursting based on

maximally overlapped load-balancing algorithm to optimize

the performance of big-data analytics that can be run in

loosely-coupled and distributed computing environments such

as federated clouds. More specifically, our algorithm has

supported decision makings on: (a) how many and which

computing nodes in federated clouds should be used; (b)

opportunistic apportioning of big-data to these nodes in a

way to enable synchronized completion; and (c) sequence of

apportioned data chunks to be computed in each node so that

transfer of a chunk is overlapped as much as possible with

the computation of the previous chunk in the node. We have

compared our algorithm with other load-balancing schemes.

Result shows the performance can be improved by 20% and

up to 60% against other approaches.

REFERENCES

[1] A. Jacobs. (2009, Jul.) The pathologies of big data. [Online]. Available:http://queue.acm.org/detail.cfm?id=1563874

[2] T. White, Hadoop: The Definitive Guilde. O’Reilly, 2009.[3] D. Kusnetzky. (2010, Feb.) What is big data? [Online]. Available:

http://blogs.zdnet.com/virtualization/?p=1708[4] S. Rozsnyai, A. Slominski, and Y. Doganata, “Large-scale distributed

storage system for business provenance,” in Proc. Int’l. Conf. on CloudComputing, 2011, pp. 516–524.

[5] Q. Chen, M. Hsu, and H. Zeller, “Experience in continuos analytics as aservice (caaas),” in Proc. Int’l. Conf. on Extending Database Technology,2011, pp. 509–514.

[6] R. Srikant and R. Agrawal, “Mining sequential patterns: Generalizationsand performance improvements,” in Proc. Int’l. Conf. on ExtendingDatabase Technology, Feb. 1996, pp. 3–17.

[7] Apache. (2011) Hadoop. [Online]. Available: http://hadoop.apache.org/[8] Y. Li and Z. Lan, “A survey of load balancing in grid computing,”

Springer, Computational and Information Science, vol. 3314, pp. 280–285, 2005.

[9] T. Miyoshi, K. Kise, H. Irie, and T. Yoshinaga, “Codie: Continuation-based overlapping data-transfers with instruction execution,” in Int’l.Conf. on Networking and Computing, Nov. 2010, pp. 71–77.

[10] D. Tsafrir, Y. Etsion, and D. Feitelson, “Backfilling using system-generated predictions rather than user runtime estimates,” IEEE TPDS,vol. 18, pp. 789–803, 2007.

[11] T. Mukherjee, A. Banerjee, G. Varsamopoulos, S. Gupta, and S. Rungta,“Spatio-temporal thermal-aware job scheduling to minimize energyconsumption in virtualized heterogeneous data centers,” Comp. Net.,vol. 53, pp. 2888–2904, 2009.

[12] Z. Liu, M. Lin, A. Wierman, S. Low, and L. Andrew, “Greeninggeographical load balancing,” in Proc. SIGMETRICS Joint Conf. onMeasurement and Modeling of Computer Systems, 2011, pp. 233–244.

[13] P. Fan, J. Wang, Z. Zheng, and M. Lyu, “Toward optimal deploymentof communication-intensive cloud applications,” in Proc. Int’l. Conf. onCloud Computing, 2011, pp. 460–467.

[14] M. Andreolini, S. Casolari, and M. Colajanni, “Autonomic requestmanagement algorithms for geographically distributed internet-basedsystems,” in Proc. Int’l. Conf. on Self-Adaptive and Self-OrganizingSystems, 2008, pp. 171–180.

[15] K. Reid and M. Stumm, “Overlapping data transfer with applicationexecution on clusters,” in Proc. Workshop on Cluster-Based Computing,May 2000.

[16] H. Kim and M. Parashar, CometCloud: An Autonomic Cloud Engine.Cloud Computing: Principles and Paradigms, Wiley, Chapter 10, 2011.

[17] M. Maheswaran, S. Ali, H. Siegal, D. Hensgen, and R. Freund, “Dy-namic matching and scheduling of a class of independent tasks ontoheterogeneous computing systems,” in Proc. Heterogeneous ComputingWorkshop, 1999, pp. 30–44.

[18] S. Kailasam, N. Gnanasambandam, J. Dharanipragada, and N. Sharma,“Optimizing service level agreements for autonomic cloud burstingschedulers,” in Proc. Int’l. Conf. on Parallel Processing Workshops,2010, pp. 285–294.

[19] Y. Huang, Y. Ho, C. Lu, and L. Fu, “A cloud-based accessible architec-ture for large-scale adl analysis services,” in Proc. Int’l. Conf. on CloudComputing, 2011, pp. 646–653.

[20] D. Alves, P. Bizarro, and P. Marques, “Deadline queries: Leveragingthe cloud to produce on-time results,” in Proc. Int’l. Conf. on CloudComputing, 2011, pp. 171–178.

[21] G. Jung, K. Joshi, M. Hiltunen, R. Schlichting, and C. Pu, “A cost-sensitive adaptation engine for server consolidation of multi-tier appli-cations,” in Proc. Int’l. Conf. on Middleware, 2009, pp. 163–183.

[22] G. Box, G. Jenkins, and G. Reinsel, Time Series Analysis: Forecastingand Control. Prentice Hall, 1994.

818818