Upload
tridib
View
216
Download
1
Embed Size (px)
Citation preview
Synchronous Parallel Processing of Big-DataAnalytics Services to Optimize Performance in
Federated Clouds
Gueyoung Jung, Nathan Gnanasambandam
Xerox Research Center Webster
Webster, USA
{gueyoung.jung, nathang}@xerox.com
Tridib Mukherjee
Xerox Research Center India
Bangalore, India
Abstract—Parallelization of big-data analytics services overa federation of heterogeneous clouds has been considered toimprove performance. However, contrary to common intuition,there is an inherent tradeoff between the level of parallelismand the performance for big-data analytics principally becauseof a significant delay for big-data to get transferred over thenetwork. The data transfer delay can be comparable or evenhigher than the time required to compute data. To address theaforementioned tradeoff, this paper determines: (a) how manyand which computing nodes in federated clouds should be usedfor parallel execution of big-data analytics; (b) opportunisticapportioning of big-data to these computing nodes in a wayto enable synchronized completion at best-effort performance;and (c) sequence of apportioned, different sizes of big-datachunks to be computed in each node so that transfer of achunk is overlapped as much as possible with the computationof the previous chunk in the node. In this regard, MaximallyOverlapped Bin-packing driven Bursting (MOBB) algorithm isproposed, which improve the performance by up to 60% againstexisting approaches.
Keywords-federated clouds; big-data analytics; parallelization
I. INTRODUCTION
Deploying big-data analytics services into clouds is more
than just a contemporary trend. We are living in an era where
data is being generated from many different sources such
as sensors, social media, click-stream, log files, and mobile
devices. Recently, collected data can exceed hundreds of ter-
abytes and be continuously generated. Such big-data represents
data sets that can no longer be easily analyzed with traditional
data management methods and infrastructures [1][2][3]. In or-
der to promptly derive insight from big-data, enterprises have
to deploy big-data analytics into an extraordinarily scalable
delivery platform. The advent of Cloud Computing has been
enabling enterprises to analyze such big-data by leveraging
vast amounts of computing resources available on demand
with low resource usage cost.One of the research challenges in this regard is figuring
out how to best use federated cloud resources to maximize
the performance of big-data analytics. In this paper, we
mainly focus on parallel data mining such as topic mining
and pattern mining that can be run in multiple computing
nodes simultaneously. Parallel data mining consumes a lot of
computing resources to analyze large amounts of unstructured
data, especially when they are executed with a time constraint.
Cloud service providers may have enough capacity dedicated
to perform such data-intensive services in their own data
centers. However, facilitating loosely coupled and federated
clouds consisting of legacy resources and applications is often
a better choice, as analytics can be carried out partly on local
private resources while the rest of the big-data are transferred
to external computing nodes that are optimized for processing
big-data analytics. This paradigm can be more flexible and has
obvious cost benefits than using a single data center [4][5].
In order to optimize a parallel data mining in not just
multiple computing nodes but different clouds separated by
relatively high latencies, this paper addresses: (a) node deter-mination, i.e., “how many” and “which” computing nodes in
federated clouds should be used, (b) synchronized completion,
i.e., how to optimally apportion big-data across parallelized
computation environments to ensure synchronization, where
synchronization refers to completing all workload portions
at the same time even when resources and inter-networks
are heterogeneous and situated in multiple Internet-separated
clouds, and (c) data partition determination, i.e., how to
serialize different data chunks to computing nodes to avoid
overflow or underflow to nodes.
To address these problems, we develop a heuristic cloud-
bursting algorithm, referred to as Maximally Overlapped Bin-
packing driven Bursting (MOBB). Specifically, we improve
the advantage of data mining parallelization by considering
the time overlap: (a) across computing nodes; and (b) between
data transfer delay and computation time in each computing
node. While unequal loads may be apportioned to the parallel
computing nodes, our algorithm can still make sure that
outputs are produced at the same time without any single slow
node acting as a bottleneck.
When a data mining is run on a pool of inter-connected
clouds, extended periods of data transfer delay are often ex-
perienced, and the data transfer delay depends on the location
of each computing node. Fast transfer of data chunks to a
slow computing node can cause data overflow, whereas slow
transfer of chunks to a fast node can lead to underflow causing
the nodes to be idle. Our MOBB algorithm can reduce such
data overflow or underflow.
2012 IEEE Fifth International Conference on Cloud Computing
978-0-7695-4755-8/12 $26.00 © 2012 IEEE
DOI 10.1109/CLOUD.2012.108
811
2012 IEEE Fifth International Conference on Cloud Computing
978-0-7695-4755-8/12 $26.00 © 2012 IEEE
DOI 10.1109/CLOUD.2012.108
811
To evaluate our approach, we employ a frequent pattern
mining analytics [6] as a specific type of parallel analytics
whose inputs are huge but outputs are far small. Then, we
deploy the analytics on small multiple Hadoop [7] clusters in
four different clouds. The experimental results show that our
approach outperforms other existing load-balancing methods
for big-data analytics.
II. RELATED WORK
The problem of load distribution has previously been
studied for different distributed computing systems includ-
ing computing grids [8], parallel architecture [9], and data
centers [10][11][12]. Load-balancing for parallel applications
has typically involved the distribution of loads to computing
nodes so as to maximize performance. Although the cost and
delay overheads have been considered in many cases, such
overheads usually involve application delays in check-pointing
and context switching. This paper, on the other hand, focuses
on the distribution of big-data over computing nodes separated
far apart. In this setting, the overhead of the data transfer can
be significant since the network latencies may be high between
nodes due to the amount of data being transferred.
Although recent research work [13][14] has introduced the
impact of the data transfer delay on the performance when they
select clouds to redirect service loads, they do not consider
further optimization by overlapping the transfer delay of a data
with the computation time of previous data. The need for such
overlapping has been identified for clusters [15]. Continuation-
based overlapping of data transfers with instruction execution
has been investigated for many-core architecture [9]. However,
such overlapping is restricted by a pre-defined order of instruc-
tion sets. This paper instead determines the order of different
sizes of data chunks to be transferred to individual nodes. By
doing so, we can have the maximum overlap between the data
transfer and the data computation.
The other way to optimize the performance of the big-
data analytics is scheduling sub-tasks among computing nodes.
For example, back-filling of tasks at earlier time than an
original scheduled sequence has been considered as a part of
batch-scheduling in data centers and computing clusters [10].
However, such scheduling is geared towards batch processing
within data centers and not for big-data apportioning in
heterogeneous federated clouds.
Some approaches have introduced task schedulers with load-
balancing techniques in heterogeneous computing environ-
ments. For example, CometCloud [16] has a task scheduler
to run analytics on hybrid clouds with a load-balancing
technique. [17] has introduced some heuristics to schedule
tasks for heterogeneous computing nodes. None of the above
approaches have dealt with the potential tradeoff between
the data transfer delay and the computation time in parallel
execution environments. [18] has introduced a scheduling
algorithm to address which tasks in a task queue have to be
run in internal cloud and sent to external cloud. They have
mainly focused on keeping the order of tasks in the queue
while increasing performance by utilizing an external cloud
on demand. However, they do not consider how many and
which clouds are required and how much data is allocated to
each chosen cloud for parallel processing.
Similar with our approach, research efforts [5][19] have
been made to deploy parallel applications using MapReduce
over massively distributed computing environments. Using
only local cluster with dynamic provisioning [20] may out-
perform these distributed approaches by reducing the data
transfer delay if the local cluster has enough computational
power. However, the distributed approach is more flexible and
has cost benefits [4][5]. This paper provides a precise load-
balancing algorithm for the distributed parallel applications
dealing with potentially big-data such as the continuous data
stream analysis [5] and the life pattern extraction for healthcare
applications in federated clouds [19].
III. PROBLEM STATEMENT
To achieve the optimal performance of big-data analytics
services in federated clouds, we have to determine “how
many” and “which computing nodes” in clouds are required,
where each computing node can be a cluster of servers in a
data center, and how to apportion given big-data to chosen
computing nodes. Specifically, we address these problems for
a frequent pattern mining employed as a big-data analytics
service (see Section V).
The number of parallel computing nodes
Ove
rall
exec
utio
n ti
me
Find a set of the best computing nodes
SLA
Minimum
Fig. 1. Hypothetical curve representing the relation between the overallexecution time and the number of parallel computing nodes
The input big-data to the frequent pattern mining algo-
rithm (e.g., a log file containing users’ web transactions) is
typically collected in a central place over a certain period
(e.g., for a year), and is processed to generate an output (e.g.,
frequent user behavior patterns). To execute the mining task
in multiple computing nodes for given big-data, the big-data
is first divided into a certain number of data chunks (e.g.,
logs for user groups), and those data chunks are transferred
to computing nodes. Intuitively, as the number of computing
nodes increases, the overall execution time can decrease, but
the amount of data chunks to be transferred increases. As
shown in Fig. 1, the overall execution time can start to increase
if we use beyond a certain number of computing nodes.
This is because the delay taken to transfer data chunks starts
to dominate the overall execution time. Meanwhile, adding
computing nodes is optionally stopped once a target execution
time specified in Service Level Agreement (SLA) is met.
812812
Our MOBB algorithm is designed to address the problem of
how many and which computing nodes are used. It identifies
the number of computing nodes by exploring from a single
node, and increasing the number of nodes one at a time. At
each step, the best set of computing nodes can be identified
by estimating the data transfer delay and computation time of
each node for given big-data.
Minimizing the frequency of data synchronization in the
parallel process is practically one of the best ways to optimize
the performance. Thus, we have to understand the characteris-
tics of the input data before designing the parallel process. One
of the characteristics of frequent pattern mining is that the data
is in temporal order and mixed with many users’ activities.
To generate frequent behavior patterns of each individual
user (i.e., extract personalized information from big-data), we
divide the given big-data into user groups. Distributing and
executing individual user data in different computing nodes
can reduce the data synchronization since these data chunks
are independent of each other. In this regard, to address the
problem of how to apportion given big-data to computing
nodes, we have to consider a set of data chunks, each of which
has the different size, and a set of computing nodes, each of
which has the different network and computing capacities.
Central CloudBig-data
Medium-capacity Remote Cloud
Medium-capacity Local Cloud
Low-capacity Local Cloud
High-capacityRemote Cloud
Data chunk allocation
Network capacity
A set of data chunks
Fig. 2. Data allocation to clouds having different capacities
We encode this problem into a bin-packing problem as
shown in Fig. 2. Our MOBB algorithm aims at minimizing
the execution time difference among computing nodes by
optimally apportioning given data chunks into computing
nodes. In other words, it maximizes the time overlap across
computing nodes when it performs the parallel data mining.
Time
Data transfer delay Data computation time
Data transfer delay Data computation timeData chunk 2
Data chunk 1
Data transfer delayData chunk 3
Fig. 3. Ideal time overlap when serializing a series of data chunks to a cloud
Moreover, we simultaneously consider improving the time
overlap between the data transfer delay and the computation
time while distributing data chunks to computing nodes.
Practically, this overlap can be achieved since a data chunk can
be computed in a node while the next chunk can be transferred
to the node. As shown in Fig. 3, ideally, our algorithm attempts
to select a data chunk that takes the same amount of delay to
be transferred to a node with the computation time of the node
with the previous data chunk.
Our algorithm optimizes the performance of the parallel
mining by maximizing the time overlap not only across
computing nodes, but also between the data transfer delay and
the computation time in each node, simultaneously.
IV. MOBB APPROACH
A. Maximally Overlapped Cloud-Bursting
The first part of our approach is to make decision for “how
many” and “which computing nodes” are used. Based on
estimates of the data transfer delay and the data computation
time (see Section IV-B), our algorithm chooses a set of
parallel computing nodes, which have shorter delay than other
candidates, by identifying a next best node and adding it into
the set at a time.
Algorithm 1 Cloud-bursting to determine computing nodes
N ← {n1, n2, n3, ..., nn}for each ni in N do
ti ← EstDelay(ni)ei ← EstCompute(ni)
end forSort(N) by (ti + ei)S ← n0; p← 1; xp ← EstExecT ime(n0)while xp ≥ SLA or xp < xp−1 do
nmin ← SelectNextBest(N − S)S ← S ∨ nmin
p← p+ 1PerformBinPacking(S)xp ← EstExecT ime(S)
end while
Algorithm 1 describes our cloud-bursting approach. Given
a set of candidate computing nodes N , our approach estimates
the data transfer delay t to each node and the node’s computa-
tion time e for the data assigned. The estimation is made using
the unit of data (i.e., a fixed size of data). Then, the candidate
nodes are sorted by the total estimate (i.e., t+e), and each node
is added to the execution pool, S if required. Our approach
starts the execution of the mining task with the central node n0.
n0 in our approach is the node that has the big-data collected,
or the node that the service provider initially allocates the
big-data before bursting to multiple nodes. One extreme case
can be only using n0 if the estimated execution time using
EstExecT ime(n0) meets the exit condition.
If SLA cannot be met using the current set of parallel nodes
(i.e., xp ≥ SLA), or we want to further reduce the overall
execution time by utilizing more nodes (i.e., xp < xp−1), our
approach increases the level of parallelization by adding one
node into the pool using SelectNextBest(N−S). The newly
added node is the best node that has the minimum (t + e)
among all candidates. Once the set of parallel nodes is deter-
mined from the above step, our approach performs the max-
imally overlapped bin-packing, PerformBinPacking(S),that attempts to maximize the time overlap across these nodes
813813
and the time overlap between data transfer and computation
in each node. Then, EstExecT ime(S) computes the optimal
execution time that can be achieved by utilizing these nodes.
B. Estimation of Data Computation and Transfer Delay
Many estimation techniques have been introduced for the
data computation time and the data transfer delay, but our
approach does not depend on a specific technique. For the
computation time estimation, we can adopt the response sur-
face model introduced in [18]. By profiling each node for the
computation time, we can build the initial model and tune the
model based on observations over time. Another well-known
technique is the queueing model [21]. The node has a task
queue and processes each task with a defined discipline (e.g.,
FIFS or PS). The data transfer delay between different clouds
can be more dynamic than the computation time because of
various factors such as bandwidth congestions and re-routing
due to network failures. To estimate the data transfer delay,
we can adopt the auto-regressive moving average (ARMA)
filter [21][22] by periodically profiling the network latency.
To profile the latency, we can periodically inject a small size
of unit data to the target node and record the delay. With the
historical record, we can build the model, and tune the model
over time by recording error rate. We have used ARMA filter
for estimating both the computation time and the data transfer
delay in the current prototype.
C. Maximally Overlapped Bin-Packing
Fig. 4 illustrates three main steps of our algorithm for data
allocation.
• Pre-processing: This step involves: (a) the determination
of bucket size for each node; (b) sorting of data chunks
in descending order of their sizes; and (c) sorting node
buckets in descending order of their sizes. The buckets’
sizes are determined in a way that a node with higher
delay will have lower bucket size. Then sorting of buckets
essentially boils down to giving higher preference to
nodes which has lower delay.
• Greedy bin-packing: In this step, the sorted list of data
chunks are assigned to node buckets in a way that larger
data chunks are handled by nodes with higher preference.
Any fragmentation of the bucket is handled in this step
(as shown in Algorithm 2).
• Post-processing: After the data chunks are assigned to
the buckets, this step organizes the sequence of chunks for
each bucket such that the data transfer and computation
are overlapped maximally.
1) Determining bucket size: The algorithm intends to par-
allelize the mining task by dividing input data for multiple
computing nodes. For parallelization, the size of data given to
particular node depends on the average delay of the mining
task on the node. If the average delay of the task for a unit
of data on a node i is denoted as di then the overall delay
for executing a data size si (that is provided to node i for
mining task) is sidi. In order to ensure ideal parallelization
for n nodes and a set of data, which sizes are s1, s2, ..., sn,
the following is satisfied,
s1d1 = s2d2 = ... = sndn,
where d1, d2, ..., dn are the delay per unit of data at each node,
respectively. After such assignment, if the overall execution
time of the mining task is assumed to be r, then the size of
data assigned to each node would be as follows,
s1 =r
d1, s2 =
r
d2, ..., sn =
r
dn(1)
Let s be the total amount of n input logs which are
distributed to each node (i.e., s =∑n
i=1 si). Then, we get
r =
⌈s∑n
i=11di
⌉(2)
Eq. 2 provides the overall execution time of the mining
task under full parallelization. This can be achieved if data
assigned to each node i is limited by an upper bound si given
(by replacing r from Eq. 2 to Eq. 1) as follows,
si =
⌈s
di∑n
i=11di
⌉(3)
Note here that si is higher for a node i if di is lower
(compared to other nodes). Hence, Eq. 3 can be used to
determine bucket size for each node in a way where higher
preference is given to nodes with lower delay.
2) Greedy Bin Packing: Once the bucket sizes are deter-
mined, the next step involves assigning the data chunks to the
node buckets. We use a greedy bin packing approach where
the largest data chunks are assigned to nodes with lowest delay
(hence reducing the overall delay), as shown in Fig. 4. There
are two main intuitions,
• Weighted load distribution: This involves loading to a
node based on their overall delay, i.e. the node with lower
delay gets more data to handle. This is guaranteed by
providing an upper bound on the size of data (i.e. the
bucket size) to be handled by the node (as described in
Section IV-C1).
• Delay-based node preference: Larger data chunks are
assigned to a node with larger bucket size (i.e. with lower
overall delay) so that individual data chunk get fairness
in their overall delay. This is guaranteed by sorting the
input data chunks in descending order (in preprocessing
step in Fig. 4) and filling up nodes with larger bucket
size first (Algorithm 2).
To reduce fragmentation of buckets, the buckets are completely
filled one at a time; i.e., the bucket with lowest delay will be
first exhausted followed by the next one and so on (Algorithm
2). This approach also enables to fill more data to nodes with
lower delay.
814814
Input Data Chunk List
Bucket (Node/Cloud) List
Sorted Data Chunk List
Sorted Bucket List Preprocessing
Greedy Bin-Packing
Input Sorted Data Chunk List
Post- processing
Transfer Delay and Computation Time
Total Input Data Size = s
Node Node Node
Delay Delay Delay
Delay per unit data to nodei = di
Node Node Node
Delay Delay Delay
d1 d2 dn
≤ S1 ≤ Si ≤ Sn
Node1 Node i Node n
Assign more data (larger data chunks) to nodes with less delay
Node1 Node i Node n
t1 > e1 ti < ei tn > en
t1 e1 ti ei tn en
d1 di dn
di is sum of transfer delay and computation time per unit data i.e., di = ti + ei
Organize sequence of data chunks to maximize overlap between computation time and subsequent data transfer
Fig. 4. Steps to allocate data chunks to computing nodes in our algorithm
3) Maximizing the overlap between data transfer and com-putation: The above approach achieves parallelization of the
analytics over a large set of data chunks. However, the delay,
di, for unit of data to run on node i can be explained by
data transfer delay from the central node to node i and actual
computation time on node i. Therefore, it is possible to further
reduce the overall execution time by transferring data to a node
in parallel with computation on a different data chunk. Ideally,
the transfer delay of a data chunk should be exactly equal to
the computation time on previous data chunk. Otherwise, there
can be delay incurred by queuing (in case the computation
time is higher) or unavailability of data computation (in case
the transfer delay is higher). If computation time and transfer
delay are not exactly equal, it is required to smartly select
a sequence of data chunks for the data bucket of each node
so that the difference between the computation time of each
data chunk and the transfer delay of a data chunk immediately
following is minimized.Depending on the ratio of data transfer delay and compu-
tation time, a node i can be categorized as follows:
• type 1, for which the transfer delay ti per unit of data is
higher than the computation time ei per unit of data;
• type 2, for which the computation time ei per unit of data
is higher than the transfer delay ti per unit of data.
It is important to understand the required characteristics of the
sequence of data chunks sent to each of these types of nodes. If
sij and si(j+1) are the size of data chunk j and j+1 assigned
to node i, then for complete parallelization of the data transfer
of chunk j + 1 and computation of j, the following holds,
si(j+1)ti = sijei
It should be noted here that if ti ≥ ei, then ideally si(j+1) <sij . Thus, data chunks in the bucket for type 1 node should
be in descending order of their sizes. Similarly, it can be
concluded that for a node of type 2, data chunks should be in
ascending order, as shown in the post-processing step in Fig.
4 and ensured at the end of Algorithm 2 (where descending
order of data chunks is reversed to make the order ascending
in case ti < ei).
Algorithm 2 Maximally overlapped bin-packing
Sort DataChunkList by descending order of chunk sizeDetermine bucket size, s1, s2, ..., sn (Eq. 1)Sort BucketList by descending order of bucket sizerepeat
for i = 1 to n doRemove first element from DataChunkListInsert the element to tail of BucketList[i]for j = 1 to remainingNumberofDataChunks do
if (BucketList[i] is not empty) and (first element inDataChunkList can fit in BucketList[i]) then
Remove first element from DataChunkListInsert the element to tail of BucketList[i]
end ifend for
end foruntil all the BucketLists are fullfor i = 1 to numberofNodes do
if ti < ei thenReverse order of data chunks in BucketList[i]
end ifend for
V. EXPERIMENTAL EVALUATION
We demonstrate the efficiency of our approach by deploying
the frequent pattern mining as a big-data analytics service
to four different computing nodes. We first describe the
experimental setup followed by the results.
815815
A. Experimental Setup
1) Frequent Pattern Mining: Frequent pattern mining [6]
aims to extract frequent patterns from a log file. The log
contains users activities to a system in temporal order. A
typical example is a web server access log, which contains
a history of web page accesses from users. Enterprises need
to analyze such web server access logs to discover valuable
information such as website traffic patterns and user activity
patterns by time of day, time of week, or time of year. These
frequent patterns can also be used to generate rules to predict
future activity of a certain user within a certain time interval
based on the user’s past patterns. To extract such frequent
patterns and prediction rules, our mining process parses the
log while identifying patterns of each user. Then, it generates
a set of prediction rules for each user.
As a sample log for our experiments, we have combined a
phone call log obtained from a call center and web access log.
The log has been collected for a year, and its size is up to 200
GB. The log contains several millions of activities generated
by more than a hundred of thousand of users regarding human
resources such as health insurance, 401K, and retirement plans.
The objective of the frequent pattern mining is to obtain
patterns of each user activities on human resource information
systems. As such, our approach first divides the log into a set
of user logs and then, executes these user logs in parallel over
federated computing nodes.
2) Computing nodes: For our experiments, we employ four
small computing nodes to run the frequent pattern mining.
Three nodes are local clusters located in the north-eastern part
of US, and one is a remote cluster located in the mid-western
part of US. One local node, referred to as a Low-end Local
Central node (LLC), is used as a central node that collects big-
data to be analyzed. This node consists of 5 virtual machines
(VMs), each of which has two 2.8 GHz cores, 1 GB memory,
and 1 TB hard drive. Another local node, referred to as a
Low-end Local Worker (LLW ), has the similar configuration
with LLC. The third local node, referred to as a High-end
Local Worker (HLW ), is a cluster that has 6 non-virtualized
servers, each of which has 24 2.6 GHz cores, 48 GB memory,
and 10 TB hard drive. All these local nodes are configured
with a high speed local connection so that data can be moved
very fast between nodes. The remote node, referred to as a
Mid-end Remote Worker (MRW ), has 9 VMs, each of which
has two 2.8 GHz cores, 4 GB memory, and 1 TB hard drive.
We deploy Hadoop into all these computing nodes.
In our scenario, we assume HLW is shared by other appli-
cations; there are three other data mining tasks while running
our aforementioned frequent pattern mining. Meanwhile, we
intentionally give a network overhead to LLW by moving
large size of files to LLW during experiments. This increases
the data transfer delay between LLC and LLW .
B. Experiment Results
We first show the performance characteristics of computing
nodes in the context of data computation time and data transfer
delay. Fig. 5 shows nodes’ performance characteristics when
0
10000
20000
30000
40000
50000
60000
LLC LLW MRW HLW
Computation time
Data transfer delay
Exe
cuti
on t
ime
(sec
)
Fig. 5. Performance characteristics of computing nodes
we run the mining task with the entire log in each node.
LLC has low-end servers, but the log is stored in the local
storage. Thus, the computation time is higher than other nodes
while there is no data transfer delay. To run the mining task
in other nodes, first, we must copy the log to the target
node and then, execute it. Since LLW is another local node
with intentional network delay, data transfer delay is high.
It has also low-end servers so that it takes large amounts
of time to compute. MRW , which is located remotely, has
mid-end configuration and thereby, its computation time is
lower than LLC. However, as shown in the figure, the overall
execution time is similar to LLC due to the large transfer
delay. Meanwhile, HLW has high-end servers, and the log is
stored in the local area network. Thus, the computation time
is much lower than others, and the data transfer delay is small.
Relative input data size (%)
0
50
100
150
200
250
300
350
400
450
500
10 20 30 40 50 60 70 80 90 100
Single nodeTwo nodesTwo nodes with delay
Exe
cuti
on t
ime
(sec
)
Fig. 6. Impact of data transfer delay on the overall execution time
1) Impact of data transfer delay on overall execution time:Using LLC and LLW , we compare the execution time with
intentional transfer delay to LLW to one without the delay.
We use a small size of log (18.6 GB) in this experiment, and
resize it to show the change of execution time against the
size of the input log. As shown in Fig. 6, the execution time
increases exponentially as the size increases. This explains
the need of parallel execution of such big-data analytics to
improve the performance. When using two nodes without the
transfer delay, the execution time decreases by almost half.
However, when we execute the mining task with the transfer
delay, the execution time is higher than the previous result,
and the gap slightly increases as the size increases. Therefore,
we have to deal with the data transfer delay carefully.
2) Effect of cloud-bursting on total execution time: Our
MOBB algorithm attempts to use only LLC first, and then
integrate one node at a time, which has the lowest estimated
816816
0
5000
10000
15000
20000
25000
30000
35000
40000
Exe
cuti
on t
ime
(sec
)
LLC +HLW
LLC + HLW +MRW
LLC only All
Fig. 7. Cloud-bursting with four computing nodes
execution time. This addition is performed until SLA is met.
In this experiment, our approach adds HLW , MRW , and
finally it uses all four nodes to run the mining task.
As shown in Fig. 7, the execution time decreases as nodes
are added. However, the execution time is not significantly
improved from using three nodes. This is because the contri-
butions of MRW and LLW to the performance are small,
and the transfer delay caused by MRW and LLW starts to
impact on the overall execution time. Therefore, using two or
three nodes can be better choice rather than using four nodes
that needs higher resource usage cost.
3) Comparison of MOBB with other load balancing ap-proaches: We run the mining task using MOBB and other
three different methods that are used in many prior load-
balancing approaches [9][13][14][15][16] and then, compare
the results. Methods used in this comparison are as follows,
• Fair division: This method equally divides the input data
and distributes them to nodes. We use this as a naive
method to show as a baseline.
• Computation-based division: This method only considers
the computation power of each node when it performs
load-balancing, rather than considering both computation
and data transfer delay.
• Delay-based division: This method considers both each
node’s computation time and data transfer delay in load-
balancing. However, it does not consider the queuing
delay in each node incurred by blindly distributing user
logs to nodes (i.e., not considering the time overlap
between the transfer delay and computation time).
0
5000
10000
15000
20000
25000
30000
MOBB Fair Comp.-based Delay-based
Exe
cuti
on t
ime
(sec
)
Fig. 8. Comparison of different load-balancing algorithms
Fig. 8 shows the result when we run the mining task in
HLW and MRW . As shown in the figure, our algorithm
outperforms other three methods. Since MRW has the large
TABLE ISLACK TIMES OF DIFFERENT LOAD-BALANCING ALGORITHMS
MOBB Fair Comp.-based Delay-basedSlack (sec) 41.2 13952.8 12719.5 1030.8
transfer delay, the execution time of Computation-based di-vision is very close to the Fair division. Table I explains
this situation with slack time (i.e., the measurement of the
time difference between nodes’ task completions). Although
Computation-based division considers the computation powers
of MRW and HLW when load-balancing, MRW becomes
a bottleneck due to its large transfer delay. Meanwhile, the
Delay-division considers both the computation time and the
transfer delay as MOBB does. This significantly reduces the
slack time. However, some data chunks are cumulated in
queue before being computed. This is due to the situation of
that many small data chunks arrive and are cumulated in the
waiting queue while the mining task computes some large data
chunks. When this situation unnecessarily happens too much,
the significant delay can be incurred. Our MOBB algorithm
considers all these factors simultaneously when it allocates
given big-data. As evident from the results in Fig. 8, MOBB
can achieve a minimum of 20% (and up to 60%) improvement
compared to the other approaches.
If an ideal optimal data allocation is made, the slack time
must be 0 (i.e., computation in multiple computing nodes is
completed at the same time). Table I shows that our MOBB
has around 40 seconds slack time. However, it is very small
compared to the overall time taken for the parallel data mining.
To further evaluate the optimality of our MOBB approach,
we have conducted multiple small experiments with LLC and
LLW to execute randomly selected 20 data chunks out of 50K
data chunks. We have observed that in most cases (i.e., more
than 90%), the slack time is caused by the last data chunk of
sequence assigned to the slower node (i.e., there is no data
chunk in the slower node’s queue), and in very few cases, the
slower node has more than one data chunk in its queue while
the faster node has completed all assigned data chunks. This
indicates that our MOBB provides close to optimal allocation.
4) Efficiency of MOBB algorithm: Our MOBB algorithm is
efficient and scalable for increasing the number of data chunks
and applications. As described in Algorithm 2, the complexity
of preprocessing is O(nlogn+mlogm), where n is the number
of computing nodes, and m is the number of data chunks
to be assigned, since it sorts given nodes and data chunks.
Typically, m is much larger than n and thereby, its complexity
can be O(mlogm). In our experiments, we have used 4 nodes
to run 50K data chunks. The complexity of the rest of MOBB
algorithm (i.e., greedy bin-packing and post-processing) is
O(m). Therefore, the overhead of our MOBB algorithm is
mainly incurred in sorting a number of data chunks. We have
used existing quick sort algorithm in our prototype. In order to
deal with 50K data chunks with 4 nodes, it has taken less than
60 seconds that is very small portion of the overall approach
including data transfer and parallel data computation.
817817
VI. DISCUSSION
Our MOBB approach has been designed for data-intensive
tasks (e.g., big-data analytics) that typically require special
platforms such as MapReduce cluster and especially, can run in
parallel. One of the best situations for our MOBB approach to
be applied is the case, where the target task can be divided into
a set of independent identical sub-tasks. For the data mining
task used in our evaluation, our data preprocessing system has
divided the input data into a set of data chunks. Then, each
sub-task is run independently and in parallel with other sub-
tasks to generate frequent patterns of each individual user with
a subset of data chunks. An extended situation considered in
our MOBB approach is the case, where multiple independent
data-intensive tasks are in a task queue with different sizes
of input data. If a task in the queue may not be divided
into independent sub-tasks such as iterative algorithms, which
data transfer should occur just once but the computation is
run multiple times on the same data, our MOBB approach
considers the task as a unit task and attempts to parallelize
with other tasks and sub-tasks in the queue. This is because
running these algorithms across federated clouds may not
be practical since these algorithms may require considerable
communications among computing nodes (e.g., merging and
redistributing intermediate results iteratively). We are planning
to extend our MOBB approach to be applied to such task queue
having multiple independent data-intensive tasks.
Another issue we are considering to extend our MOBB
approach is to dynamically re-sort computing nodes and re-
target the sequence of data chunks. In the current prototype,
the decision making is based on the current status of network
and computation capacities when it is invoked as described in
Algorithm 1. However, the status can be changed dynamically
due to various unexpected events such as node failures and
network congestions while sorting nodes based on the previous
status and then, allocating data chunks. One of possible solu-
tions can be that our MOBB approach periodically checks the
available computation capacities and network delays of nodes.
Another solution can be that distributed monitoring systems
can push events into MOBB when the status is significantly
changed. In either case, the status change triggers MOBB to re-
sort nodes and re-target the sequence of remaining data chunks
into the next available computing nodes, while data chunks
assigned already continue at the corresponding nodes.
VII. CONCLUSION
In this paper, we have described a cloud-bursting based on
maximally overlapped load-balancing algorithm to optimize
the performance of big-data analytics that can be run in
loosely-coupled and distributed computing environments such
as federated clouds. More specifically, our algorithm has
supported decision makings on: (a) how many and which
computing nodes in federated clouds should be used; (b)
opportunistic apportioning of big-data to these nodes in a
way to enable synchronized completion; and (c) sequence of
apportioned data chunks to be computed in each node so that
transfer of a chunk is overlapped as much as possible with
the computation of the previous chunk in the node. We have
compared our algorithm with other load-balancing schemes.
Result shows the performance can be improved by 20% and
up to 60% against other approaches.
REFERENCES
[1] A. Jacobs. (2009, Jul.) The pathologies of big data. [Online]. Available:http://queue.acm.org/detail.cfm?id=1563874
[2] T. White, Hadoop: The Definitive Guilde. O’Reilly, 2009.[3] D. Kusnetzky. (2010, Feb.) What is big data? [Online]. Available:
http://blogs.zdnet.com/virtualization/?p=1708[4] S. Rozsnyai, A. Slominski, and Y. Doganata, “Large-scale distributed
storage system for business provenance,” in Proc. Int’l. Conf. on CloudComputing, 2011, pp. 516–524.
[5] Q. Chen, M. Hsu, and H. Zeller, “Experience in continuos analytics as aservice (caaas),” in Proc. Int’l. Conf. on Extending Database Technology,2011, pp. 509–514.
[6] R. Srikant and R. Agrawal, “Mining sequential patterns: Generalizationsand performance improvements,” in Proc. Int’l. Conf. on ExtendingDatabase Technology, Feb. 1996, pp. 3–17.
[7] Apache. (2011) Hadoop. [Online]. Available: http://hadoop.apache.org/[8] Y. Li and Z. Lan, “A survey of load balancing in grid computing,”
Springer, Computational and Information Science, vol. 3314, pp. 280–285, 2005.
[9] T. Miyoshi, K. Kise, H. Irie, and T. Yoshinaga, “Codie: Continuation-based overlapping data-transfers with instruction execution,” in Int’l.Conf. on Networking and Computing, Nov. 2010, pp. 71–77.
[10] D. Tsafrir, Y. Etsion, and D. Feitelson, “Backfilling using system-generated predictions rather than user runtime estimates,” IEEE TPDS,vol. 18, pp. 789–803, 2007.
[11] T. Mukherjee, A. Banerjee, G. Varsamopoulos, S. Gupta, and S. Rungta,“Spatio-temporal thermal-aware job scheduling to minimize energyconsumption in virtualized heterogeneous data centers,” Comp. Net.,vol. 53, pp. 2888–2904, 2009.
[12] Z. Liu, M. Lin, A. Wierman, S. Low, and L. Andrew, “Greeninggeographical load balancing,” in Proc. SIGMETRICS Joint Conf. onMeasurement and Modeling of Computer Systems, 2011, pp. 233–244.
[13] P. Fan, J. Wang, Z. Zheng, and M. Lyu, “Toward optimal deploymentof communication-intensive cloud applications,” in Proc. Int’l. Conf. onCloud Computing, 2011, pp. 460–467.
[14] M. Andreolini, S. Casolari, and M. Colajanni, “Autonomic requestmanagement algorithms for geographically distributed internet-basedsystems,” in Proc. Int’l. Conf. on Self-Adaptive and Self-OrganizingSystems, 2008, pp. 171–180.
[15] K. Reid and M. Stumm, “Overlapping data transfer with applicationexecution on clusters,” in Proc. Workshop on Cluster-Based Computing,May 2000.
[16] H. Kim and M. Parashar, CometCloud: An Autonomic Cloud Engine.Cloud Computing: Principles and Paradigms, Wiley, Chapter 10, 2011.
[17] M. Maheswaran, S. Ali, H. Siegal, D. Hensgen, and R. Freund, “Dy-namic matching and scheduling of a class of independent tasks ontoheterogeneous computing systems,” in Proc. Heterogeneous ComputingWorkshop, 1999, pp. 30–44.
[18] S. Kailasam, N. Gnanasambandam, J. Dharanipragada, and N. Sharma,“Optimizing service level agreements for autonomic cloud burstingschedulers,” in Proc. Int’l. Conf. on Parallel Processing Workshops,2010, pp. 285–294.
[19] Y. Huang, Y. Ho, C. Lu, and L. Fu, “A cloud-based accessible architec-ture for large-scale adl analysis services,” in Proc. Int’l. Conf. on CloudComputing, 2011, pp. 646–653.
[20] D. Alves, P. Bizarro, and P. Marques, “Deadline queries: Leveragingthe cloud to produce on-time results,” in Proc. Int’l. Conf. on CloudComputing, 2011, pp. 171–178.
[21] G. Jung, K. Joshi, M. Hiltunen, R. Schlichting, and C. Pu, “A cost-sensitive adaptation engine for server consolidation of multi-tier appli-cations,” in Proc. Int’l. Conf. on Middleware, 2009, pp. 163–183.
[22] G. Box, G. Jenkins, and G. Reinsel, Time Series Analysis: Forecastingand Control. Prentice Hall, 1994.
818818