OATS online aggregation with two-level sharing.pdf

8/10/2019 OATS online aggregation with two-level sharing.pdf

1/39


2/39

468 Distrib Parallel Databases (2014) 32:467505

Keywords Online aggregation MapReduce Cloud Two-level sharing

1 Introduction

Big data has been produced in various applications, including the user informationof social networks, sensor data, scientific data and variety of log data, etc. And bigdata analytics is playing an important role in todays fast-paced data-driven businesses[1]. Dealing with a tremendous amount of data to derive the latent useful informationhas become a much urgent demand, however, it is difficult to support in traditionaldatabases due to the large volumes of data, the complexity and diversities of queries[2]. To handle this issue, cloud-based distributed data processing platforms, such asHadoop [3], Hyracks [4], etc., have been proposed to provide more efficient and cost

effective solutions for big data analytics, and large number of contributions coveringthe task scheduling, replica management and execution mechanism, etc., have beenobtained to significantly improve the overall performance of the cloud platforms (e.g.,[511], etc.).

Performing statistical analytics and delivering exact final query results in the par-alleled cloud, however, can also be computationally expensive (long time for process-ing) [12] and resource intensive [13] due to the massive volumes of data involved,and the overloaded systems and high delays are incompatible with a good user experi-ence, while the early approximate answer that are accurate enough are often of much

greater value to users than tardy exact results [14]. One effective technique to han-dle this problem is online aggregation (OLA) [15], which aims to give response tolarge-scale aggregation queries1 with a statistically valid estimate to the final resultearlier (making a tradeoff between time and accuracy). The basic idea behind OLAis to compute an approximate result against the unbiased random samples and refinethe result as more samples are received. In this way, users can terminate the runningqueries prematurely if an acceptable estimate can be arrived at quickly.

OLA has been studied for RDBMS (relational database management system),streaming data management systems and P2P environment [1520], and it now

emerges as a new research area for the cloud environment (especially are theMapReduce-oriented cloud systems) with the development of cloud computing [2,2126]. The main benefits of running OLA in cloud system can be summarised as: (1)it makes the original platform much more flexible by providing a fast and effectiveway to obtain approximate results within the prescribed level of accuracy rather thanthe accurate results generated from the completely process model, which can signifi-cantly improve the analytic performance against the massive volumes of data, and (2)it reduces the economic cost of users on the typically pay-as-you-go cloud systems,that is an user can save money by monitoring the estimated result and killing the

computation early once sufficient accuracy has been obtained and, (3) it increases theoverall throughput of the cloud system since the released resources of early terminated

1 Aggregation are among the most common query types for data analysis, which often scan large amountof tuples to generate summary and statistical results

1 3


3/39

Distrib Parallel Databases (2014) 32:467505 469

OLA queries can be delivered to the other running OLA queries immediately, whichhelps to increase the parallelism degree and resource utilization.

Although OLA is very suitable for cloud environment, there are also some limi-tations affect the overall performance due to inefficiencies that arise from the rigid

execution model of cloud system, among which a major one considered in this paper isthe sharing problem. For example, in the MapReduce-oriented cloud system the shar-ing problem is generated from the fact that all queries are sampling and estimatingtheir own results independently. Given a set of OLA queries, the time consumptionprobably can be divided into three aspects that are task initialization cost, samplingcost and computation (estimation) cost. Then, the performance drawbacks withoutsharing can be summarised as follows: (1) the high proportion of task initializationcost. As the fact that OLA queries can obtain the approximate results against the ran-dom samples rather than the whole input data, then the computational cost is relatively

lower, increasing the proportion of task initialization cost. So that the large numberof OLA queries be executed independently will result in a considerable initializationcost, (2) large sampling cost. We note that each OLA query needs to establish its ownI/O pipeline to draw samples respectively, leading to large redundant disk I/O cost forthe OLA queries with the same input data, and (3) repetitive statistical computationcost. After receiving the samples, each OLA query will conduct the statistical com-putation to estimate the approximate result independently, even though there existssome optimization opportunities (such as the partial statistical results reuse) amongmultiple OLA queries, leading to large number of repetitive statistical computation.

To overcome the three performance drawbacks mentioned above, we propose anddevelop a system called online aggregation with two-level sharing strategy in cloud(OATS) to bridge the gap between the current mechanism of MapReduce paradigmand the requirement of OLA. The main contributions of this paper include:

We propose an OLA system with two-level sharing strategy in cloud called OATS,which is tailored for MapReduce framework, to improve the overall performancefor running multiple OLA queries in cloud.

We exploit the first-level sharing for sampling with a customized sample manage-

ment mechanism, which can significantly reduce the redundant I/O disk cost. We present the second-level sharing for statistical computation with a heuristicsharing groups generation algorithm called SLSA, which can not only improve theoverall performance but also has a well scalability for larger number of sharingOLA queries.

We introduce the details of MapReduce implementation of our two-level sharingstrategy for single relation and multi-relations OLA queries, including the Map andReduce logic and the corresponding statistical calculation approach for the resultestimation.

We deploy OATS based on the modified version of Hadoop called HOP in SEU-Cloud,2 and conduct extensive experiments to demonstrate the efficiency and effec-tiveness of our OATS.

2 SoutheastUniversity CloudPlatform, which supports dataprocessingapplications of the wholeuniversity.

1 3


4/39


The remainder of this paper is organized as follows. In the next section, we givean overview of our OATS. In Sect. 3, we present details of our two-level sharingstrategy and its algorithm description. Section4introduces the implementation of ourOATS over MapReduce. Section5describes the statistical computation and accuracy

estimation. And in Sect.6,we present the experimental setup and report results of theexperimental evaluation. Finally, we review related work in Sect.7and conclude inSect.8.

2 Overview of OATS

In this section, we introduce the system architecture of our OATS based on theMapReduce-oriented cloud system, including the data flow and processing logic. Tosimplify the presentation and make readers can easily catch the basic idea of our OATS,we need to give out some necessary preliminaries firstly.

2.1 Preliminaries

2.1.1 Underlying system platform

To implement OATS in cloud requires extensions to the traditional MapReduce frame-work. This is because the fact that the batch-oriented original MapReduce framework

can not keep pace with the requirement of the interactively OLA processing model.Therefore, we choose Hadoop Online Prototype (HOP)[23] as a natural candidatefor the underlying query processing engine. HOP is a modified version of the originalMapReduce framework, which is proposed to construct a pipeline between Map andReduce so that the reduce task could start immediately as long as any Map outputis generated. Such pipeline property can help to support OLA by returning the earlyapproximate result of the query, and scaling up such result with the query progress.

2.1.2 Data organization

In general, data in the cloud is typically organized and processed in blocks based adistributed file system [2,27], such as the input data of MapReduce job is divided intosmaller equal-size blocks (called size-aware partition) and reorganized as the logicchunks called input splits for parallel processing. We note that such basic solutionto some common operations like sessionization and join queries is not optimal. Thisis expected as both sessionization and join are repartition-based and repartitioningdata is an expensive operation in MapReduce, since it requires local sorting in eachmapper, shuffling data across the network, and merging of sorted files in each reducer[28]. However, it is well-known that such costly data repartitioning can be avoidedif the data are already organized into partitions that correspond to the query, and[19,24,28] adopts a preprocessing step at load time in which they partition both inputdata correspond to the query, improving the overall performance significantly. Basedon such motivation, in this paper we apply the same preprocessing step of [24] topartition input data according to the content, that is the block boundaries are created

1 3


5/39


such that each new block has the uniform intervals of partitioning columns (calledcontent-awarepartitionandinterestedreadersmayfindthedetailsofthispreprocessingstep in [24]). By doing so, there are two additional benefits can be obtained to improvethe performance of running OLA queries besides the reduction of the repartition cost,

that are (1) increasing the sampling efficiency, then improving the resource utilization,and (2) making the OLA queries much more scalable with regard to the skeweddata distribution. Given that it has so many advantages, we will take the content-aware partition as the default data organization to study an universal two-level sharingstrategy in Sect.3,which can also be adequate for traditional size-aware partition inoriginal MapReduce paradigm, since the sharing issue of size-aware partition can berecognised as a special case of the content-aware issue.

2.1.3 Sharing opportunities

The task of our OATS is to exploit and make use of the latent sharing opportunitiesamong multiple OLA queries to combine the appropriate queries together as a single

job, improving the overall performance. In general, there are two major non-trivialsharingopportunitiesofrunningOLAqueriesoverMapReduce-orientedcloudsystem,that are sharing sampling and sharing statistical computation. To share sampling, wemerge the two pipelines into a single pipeline and scan the input data only once togenerate the common samples, reducing the cost of sampling. While in the case ofsharing statistical computation, we calculate the statistical incrementally by reusing

the intermediate results, improving the performance of estimation. In particular, thesharing issue in the cloud has been considered in [29]. But the MRShare frameworkproposed in [29] is not applicable to our requirement due to the following reasons: (1)MRShare is tailored for the original MapReduce framework, in which the size-awarepartition method is adopted to organize the input data. And it cannot easily be extendedto satisfy the novel MapReduce framework, in which the content-aware partition isadopted as a preprocessing step, and (2) MRShare is based on the assumption all dataneed to be scanned to obtain the exact query result. However, this assumption is notalways true for OLA, since OLA is a sampling-based method to obtain approximate

results from a subset of data. In Sect.3,we will consider the characteristic of content-aware partition and sampling to discuss the sharing problems of OATS based on thesetwo sharing opportunities in detail, and present a two-level sharing strategy to improvethe overall performance of OATS.

2.2 System architecture of OATS

We are now ready to discuss the architecture and workflow of OATS in the cloud, asshown in Fig.1.Note that, the input data is organized in blocks by the content-awarepreprocessing step mentioned in Sect.2.1, which means these blocks have uniformintervals but with unequal block size, and distributed over different data node forstorage. Given the incoming OLA queries, OATS does not execute them immediatelybut collects them and analyses the potential sharing opportunities by the component ofQuery Collector . Then the set of incoming queries are combined as a grouped dynamic

1 3


6/39


Fig. 1 System architecture of OATS

MapReduce job. And this grouped job is decomposed into a series of map tasks, which

are assigned to the data nodes with the input blocks that overlapped to the predicateof the involved shared OLA queries for processing. The reason we call this groupedjob dynamic is that each grouped map task needs to be dynamically configured in run-timebased on the sharinginformation for further processing. Afterwards, OATS adoptsthe two-level sharing strategy according to the sharing information of each mapperin the map phase to provide the sharing for sampling and computation, eliminatingthe redundant disk I/O cost and reducing the repetitive statistical computation cost.Then, the reduce phase estimates the approximate result for each involved shared OLAqueries once the reducer receives sufficient map outputs (a pipeline model of HOP). Ifthe accuracy obtained is unsatisfactory, the above Map/Reduce phases will be repeatedto retrieve samples continuously, calculate the statistics incrementally and update theapproximate result progressively. The final result is returned to the user when a desiredaccuracy is reached and the user can stop the job early before its completion.

3 Two-level sharing strategy

In OATS, we are addressing two major challenges, that are how to find the sharingopportunities among OLA queries, and how to provide an effective sharing mechanismto make these OLA queries are sharing in an optimal way.

Towardsthefirstchallenge,wefirstdefinethequerieswiththeoverlappedpredicatesas the candidates for sharing. Consider the query set Qt hd = {Q1,Q2, . . . ,Qn}collected for sharing (according to the sharing threshold) with the predicates P={P1,P2, . . . ,Pn}, and the input file Fwith the block set B= {B1,B2, . . . ,Bm}. The

1 3


7/39


Table 1 Sharing opportunities of different cases

Input data Predicates Aggregate type Sharing opportunities

Same Different Sampling

Same Same Different Sampling

Same Same Same Sampling and computation

Different None

overlapped queries for each Bi Bcan be obtained by checking the predicate set P,denoted byQiolp= {Qx Qt hd|Bi ( PxB)}. Based on the traditional MapReduceframework,Qx Qiolpcontains a map task Mixto process Bi . While in our OATS,

all of{Mi

x|Qx Qi

olp}are combined to form a new grouped map task Mi

com . On theother hand, we need to determine the sample size required for Qx Qiolp , since OLAprocessing requires the assumption that all samples collect must be unbiased. Basedon such motivation, the basic sampling method of OATS can be described as follows.Consider the query Qx Qt hdwith predicate Px, thenBi (Px B)provideskixsamples for Qx, where kix= |Bi ||PxB|

i=1 |Bi | k,k= i kixis the total required sample

size in each iteration of OLA, which is defined by user. Given the overlapped queriesQiolp with their required sample size K= {kix|Qx Qiolp}forBi B, the secondchallenge can be explained as how to provide an effective sharing mechanism, which

satisfies the different sample requirements (difference among the items ofK) ofQiolp,to reduce the cost of sampling and statistical computation as much as possible.

Before we discuss the solution to the second challenge, we need to clarify thereal demand of sharing in our OATS scenario. Table1shows the different cases withtheir possible sharing opportunities. Given two queries Q1and Q2with the series ofmap tasks M1= {M11 ,M21 , . . . ,Mm1} and M2= {M12 ,M22 , . . . ,Mn2 }, the sharingopportunities in Table1can be explained as follows. Consider two map tasks Mi1and

Mj2 belong toM1and M2respectively, the sharing of sampling always exists if both

the map tasks have the same input that is Bi equals to Bj for i=

j , so that we can

merge the two sampling pipelines into a single pipeline and scan the input data onlyonce. Besides the same input, if the predicates and aggregate type of both Mi1and M

j2

are same too, then there has another sharing opportunity that is sharing of computationsince each map task has the same operation function. Otherwise, if the input data for

both Mi1 and Mj2 are different, then there are no sharing opportunities regardless of

the other two factors are same or not.In the rest of this section, we will introduce a two-level sharing strategy to response

to the second challenge mentioned above. And Fig. 2is a general example of oursharing mechanism for a given block Bi . Given the set of map tasks

{Mi

x|Qx

Qi

olp}assigned to Bi , the basic procedure of our sharing mechanism can be described asfollows. Firstly, a single sampling pipeline is constructed to draw samples from Bicontinuously and a sample buffer is built to maintain the samples in the main memory.In the case of first level sharing that is the sharing sampling, the component calledSSO (sharing sampling optimizer) is responsible for assigning appropriate samples

1 3


8/39


Fig. 2 Instance of two-level sharing mechanism

to support the shared map tasks in Micom , which satisfies the cases 1 and 2 in theTable1. While in the case of second level sharing that is the sharing computation, thecomponent called SCO (sharing computation optimizer) starts to find out the optimal

sharing plan among the set of map tasks, which satisfy the case 3 in the Table1, thencalculates and shares the immediately statistical results in each sharing group.

3.1 First level: the sharing of sampling

In order to reduce the redundant disk I/O cost, we propose a shared sampling strategyfor multiple OLA queries. The basic idea behind shared sampling is to combine themultiple sampling pipelines together and collect samples by one-pass disk scan. Infact, OATS builds a sample buffer for each involved block Bi in the main memory

to maintain and manage the samples drawn from Bi . Then the grouped task Micomcan collect samples from the built sample buffer rather than the disk, and assign theappropriate number of samples to each involved Mix, reducing the redundant disk I/O.There are two major issues need to be considered for shared sampling firstly, that is (1)how to construct the sample buffer for all queries ofQiol p, and (2) how to efficiently

update the sample buffer to suit for the progress ofQx Qiolp.Consider the first issue, we initialize the sample buffer with the control parameter

pointers, which is used to realize the buffer management. In our implementation,pointersincludes the pointers to the first and last sample that have been visited by all

queries ofQ iolpdenoted asfirstand lastrespectively, and the pointer to the next sampleneed to be visited by a given Qx Qiolp denoted as cur(Qx)(also called progresspointer). Note that, the set of progress pointers{cur(Q1), cur(Q2) , . . . , cur(Qn )}is used to control the sampling procedure for all queries of Qiolp , and the interval[f i r st, last] indicates the samples in this buffer that need to be updated, in where

1 3


9/39


last= Mi n{cur(Qi ), cur(Q2) , . . . , cur(Qn)}. In the case of the second issue, thegrouped map task Miolp adopts a aggressive method to update the sample buffer. Foreach incoming sample, OATS adds it to sample buffer if the buffer is not full and swapout the expired samples in

[f i r st, last

]iffirst s) isdefined over a window, it means the ACQ may compute the aggregate result over therangerand update it every slides . Consider a set of ACQs with the same input dataand same predicates, the sharing problem is generally solved by compose multiplesliced windows of each ACQ to form a common sliced window. Then, the partial

aggregates computed with the common sliced window can then be used to answereach individual ACQ(interested reader may find the details of this sharing problem inboth[30,31]). Back to our sharing of computation, the grouped Micom is composedof series{Mix|Qx Qiolp}for a given block Bi , the task of each Mixis to computethe statistical result against the kix samples iteratively (the computation is processedround and round, and the sample size in each round is kix). Note that, our sharingof computation can be seen as a special case of the ACQs sharing problem withr= s= kix, so that the full share solution in [30] can be selected as a candidate forsecond level sharing.

In OLA, the statistical result is calculated in an incremental way. Given a set ofsamples is received in the first round, we need to calculate the statistical result of thesesamples by partial aggregation operator firstly. When there are another set of samplesis collected, we not only need to calculate the partial aggregates but also require anextra final aggregation to calculate the statistical result for the samples have beencollected. In the full sharing solution, the computation for partial aggregate can bereduced but leading to the increment of final aggregate cost. So there exists a trade-offbetween unshare and full sharing. We illustrate such trade-off through the followingexample.

3 To our best knowledge, ourOATS is the first work on studying shared OLAwith two-level sharing strategyover MapReduce and the greedy algorithm proposed in [31] is the latest excellent solution to the second-level sharing. Therefore, we finally select this greedy algorithm as candidate to compare its performancewith our SLSA algorithm.

1 3


10/39


Example 1 Given two tasks Mia and Mib for block Bi with the required sample size

kia=9 and kib=6 respectively. And we use the number of partial aggregates generatedper sample as the metric to measure the unit cost of final aggregation operations(motivated by[31]). If two tasks are processed independently, their partial aggregate

operators will produce 1 partial aggregate every 9 and 6 samples respectively. Then,the unit final aggregation cost ofMiaand M

ibare Ca= 19 0.11 andCb= 16 0.17

respectively. Thus, the total unit final aggregation cost is 0.28. While in the case of fullsharing, a common sliced window with the size ofC S= lcm(kia , kib)= 18 is builtfirstly. And the common sliced window is divided into 4 fragments according to theinvolved kix, that is (6,9,12,18) in this example. Then, the unit final aggregation costfor these two tasks are both Ca,b= 4180.22. Thus, the total unit final aggregationcost is 0.44, which is larger than the unshare one.

Given that clear trade-off between unshare and full sharing, the ultimate goal of oursecond level sharing strategy is to find an optimal way to share among all involvedMix.The greedy algorithm proposed in [31] is a candidate to this problem. However, thetime complexity of this algorithm is relatively high especially for the larger sharingthreshold. To obtain the acceptable sharing plan efficiently, we propose a heuristicsharing algorithm calledSLSAwith some heuristic rules and give out the complexitycompared with the original greedy one.

3.2.1 Formalization

Given n map tasks{Mix|Qx Qiol p}for Bi , and the sample size for each Mix is kix.The cost of unshare and full sharing are discussed follows. Consider the unshare one,the unit final aggregation cost is calculated as 1

kix. If there are|T|samples have been

processed, then the overall final aggregation cost for a certainMixis|T|

kix, and the overall

final aggregation cost for all Mixis

x|T|kix

. On the other hand, the partial aggregation

cost for each Mix is|T|, then the total partial aggregation cost for all Mix is n|T|.Therefore, the unit cost of unshare can be calculated as:

Cunshare=n|T| +x |T|kix

|T| =n+

x

1

kix(1)

In the case of full sharing, we first suppose the length of common sliced windowlcm= |T|and the number of fragments in|T|is E. Then, the number of fragmentsfor each kix is

E|T| kixand the unit final aggregation cost is

E|T| kix

kix= E|T| . Thus, the

overall final aggregation cost for a certain Mix is E

|T

| |T

| = E, and the overall final

aggregation cost for all Mixisn E. On the other hand, the total partial aggregation costfor all Mix is |T|. Therefore, the unit cost of full sharing can be calculated as:

Cshare=|T| + n E

|T| =1 +n E

|T| (2)

1 3


11/39


Based on the above analysis, we can obtain the general type of sharing cost for{Mix|Qx Qiolp} as follows:

Cgeneral= n 1+n1

x=1

1

kix+

m

1 +

nm

E

|Tm |

(3)

Where n1 indicates the number of unshare tasks, m is the number of share groups,|Tm | is the length of common sliced window for nm tasks, and

mn m = n2 and

n1+ n2= n , which is the number ofMixin Bi .In particular, our objective is to find an optimal number ofn1and m to provide the

lowest sharing cost in a quite efficiency way (as a alternative approach to replace thegreedy one in[31]).

3.2.2 Second level sharing algorithm (SLSA)

The second sharing problem can be considered as an extended generalized task assign-ment problem which is known to be NP-Hard [32]: the input is a set of required samplesize {kix} and a set of sharing groups, where each kixwill produce a sharing cost whenit is assigned to a sharing group. The output is a assignment of all kixto groups thatminimizes the total sharing cost. However, we do not know the optimal number ofsharing groups at first, which is necessary to be determined as the input. And the

increment (reduction) of sharing cost caused by adding (deleting) a ki

xto (from) asharing group is not constant as it depends on the other kixhave already been assignedto this sharing group. Based on such two reasons, we cannot directly use any of theclassical algorithms for solving the task assignment problem (e.g., Dynamic Program-ming) to solve our second sharing problem. Thus we propose a heuristic algorithmcalledSLSA, which can guarantee the relatively lower sharing cost with an acceptabletime complexity compared to the greedy one. There are two heuristic rules need to beobeyed are listed.

Definition 1 (Rule 1: separability of two sample size) Given two sample size kia andkib with bigger difference on value, we should not assign them in the same sharinggroup as far as possible. Note that the unit final aggregation cost is 1

kibif there is only

kib exists in the sharing group. While in the case ofkia also belongs to the sharing

group, the unit final aggregation cost is calculated as1+ k

ib

kia

kib, where

kibkia

indicates the

fragments increment. Note that, given thekib, the more smallerkiaindicates the larger

unit final aggregation cost.

Definition 2 (Rule 2: associativity of two sample size) Given two sample size kiaandkibwith more same common edges in the common sliced windows, we should assignthem in the same sharing group as far as possible. Note that the common sliced windowcomposed by kia and k

ib is divided into several fragments which are separated by the

edges. In general, the number of fragments is proportional to the number of edges.

1 3


12/39


Fig. 3 Procedure of SLSA

Ifkia has more same common edges withkib in the common sliced window, then the

numberofedgeswillbelessthantheonehaslesscommonedgesinmostcases,leading

to less increment of unit final aggregation cost.

Based on the heuristic rules mentioned above, ourSLSAcan be processed by twoseparate stages: partition and adjustment, as shown in Fig. 3.The task of partitionstage is to determine an initial number of sharing groups with the lower sharing costbased on the first heuristic rule in a greedy way. While the task of adjustment stageis to further reduce the sharing cost according to the second heuristic rule, in whichthe sample size kixthat has the least common edges with other k

ixs is moved to the

adjacent sharing group.We are now ready to describe the detailed procedure of our SLSA as shown in

Algorithm1. Our SLSA is processed for each block Bi , so that the input is the setof sample size K= {kix}and the output is the optimal sharing groups SG. Firstly,we sort Kin an ascending order, which can make the next partition stage processedconveniently. Then, the initial sharing groups is built with only one element K (lines34). In the partition stage, the input K is divided into several sharing groups basedon the first heuristic rule. And the adjustment stage take the output of partition stageas the input to improve the sharing plan based on the second heuristic rule, and returnthe optimal sharing groups finally (lines 57).

Partition stage The procedure of partition stage is shown in Algorithm2.The basicidea behind partition stage is to divide the set of sample size according to the variance.We take the share group sg with the biggest variance form SG as the candidate forpartition (line 4) since it has the more potential for cost reduction based on the firstheuristic rule, and divide it into two sub-partitions based on the average ofsg that

1 3


13/39


14/39


Algorithm 3adjustPartition (Adjustment Stage)1: // rightleft2: fori= SG .size() to 1do3: sgr=SG .get(i)4: sgl =SG .get(i

1)

5: avgr=getAverage(sgr)6: cand=getCandidate(sgr, avgr)7: for j=0 tocand.size()do8: egr=getComEdge(cand[j ], sgr)9: C Er=put(egr, cand[j ])10: egl =getComEdge(cand[j ], sgl )11: C El =put(egl , cand[j])12: end for13: C Er.sort(egr)14: C El .sort(egl , Comparator)15: for j=0 tocand.size()do16: r I n d e x =C Er.getIndex(cand[j ])17: l Index=C El .getIndex(cand[j ])18: rank=wi n r I n d e x + wout lIndex19: wcand.put(cand[j ], rank)20: end for21: wcand.sort(rank)22: wcand=wcand.fliter(threshold)23: forallelementswcan do24: SG =migration(SG , wcand.nextElement())25: Ccur=getShareCost(SG )26: ifCcur Cmi n then27: SG

=SG

28: Cmi n= Ccur29: end if30: end for31: end for32: //leftright: similar to the case of rightleft33: ......34: return SG

take the case ofright l ef tas an example to illustrate the procedure of adjustmentsince these two cases have the similar operation.

Firstly, we define the sharing group sgras the i th element of SG and sgl as the(i 1)th one, where 1i SG.size(). Given these two sharing groups, we need toselect the candidates for migration froms gr. Based on the first heuristic rule, all thecandidates for migration may less than the average avgr ofsgr(lines 56). This isexcepted as the remove of smaller kixhas the more possibility of sharing cost reductionfor sgr,duetothemorefinalaggregationoperationsmaybereduced.Ontheotherhand,the insert of smallerkixhas the less possibility of sharing cost increment for sgl , sincethe difference between kixand elements in sglis relatively small. However, such initialcandidates for migration cannot satisfy the requirement of the second heuristic rule,so that we introduce the weighted mechanism to evaluate the importance of commonedges. We calculate the number ofkixs that have the common edges tocand[j]insgr, denoted as egr, andeglis also calculated for sgl(line 8,10). Note that, the smalleregrindicates the remove ofcand[j ] may lead to a larger sharing cost reduction for sgrbut the biggeregl represents the insert ofcand[j]may lead to a smaller sharing cost

1 3


15/39


increment fors gl based on the second heuristic rule. Therefore, we need to considerthe affect of bothegrandeglto calculate an integrative priority for cand[j ]. So thatwe record the pair of(egr, cand[j])and(egl , cand[j])inC ErandC Elrespectively(line 9,11). And conduct a sort in an ascending order for C Er, while in the case of

C El we conduct a sort in a descending order (lines 1314). Afterwards, we use theindex ofC ErandC Eldenotedr Index andl Index as the normalized parameters tomeasure the final priority ofcand[j] (denoted asrank), by fully coupling the affectofegrandegl as the following formula.

rank= wi n rInde x + wout l Inde x (4)

Where wi nand wout4 aretheweightsassignedto rInde x and l Inde x and wi n+wout=1. Through the adjustment of wi n and wout, we can obtain an optimal migration

instance to reduce the sharing cost as far as possible.Then, we record the pair of (cand[j], rank) in wcandand sort the candidatesaccording to the rankin an ascending order (lines 19, 21). The smaller rankrepresentsthe higher priority of the corresponding cand[j] should be selected for migrationfirstly. In our implementation, we only select thresholdelements fromwcandas thefinal candidates to simplify the processing (line 22). For each candidate in wcand,we migrate it forms gr tos gl to generate a new sharing groups SGwith the currentsharing costCcur, and we take the sharing groups with the minimumCcuras our finalsharing instance of the case ofright lef t (lines 2330). Then, we will conduct

another adjustment form le f t to right, which is similar to the case ofright le f tbut also with two differences, that are: (1) candis composed ofkixs that largerthan the average ofsgl , rather than the kixs less than the average ofsgr, and (2) thekixs that are migrated form sgrin the case ofright lef t is not considered forre-migration. After two rounds of migration, the adjustment is completed and the finalsharing groups SGis returned (line 34).

Figures 4 and 5 show a detailed example for our SLSA with (wr= wl= 0.5). Givena set of sample sizeK required for a shared map task Micom in a certain blockBi , thatis{2, 3, 4, 5, 7, 21, 23, 25}(only for illustration). The result of the partition stage (asshown in Fig.4)is the sharing groups SG that contains 3 sub-groups, which has thesharing cost of 6.01 that less than the initial cost of 7.25. While in the adjustmentstage, the case ofright lef tdo not reduce the sharing cost. But in the case ofle f t right, the optimal sharing groups are obtained through 4 migrations and theinputK is divided into three sub-groups that is {{2, 4}, {3, 5}, {7, 21, 23, 25}} with theoptimal final sharing cost of 5.78 (the sharing cost for the original greedy is 5.99 forthis example).

3.2.3 Complexity of SLSA

The task of our SLSA and the original greedy algorithm are both to find an appropriateset of sharing groups with the lowest sharing cost according to the Euqation(3). To

4 In the case ofright left,wi n indicates the weight assigned to the rightsg. Otherwise, it indicates theleftsg.

1 3


16/39


Fig. 4 Instance ofSLSA(partition stage)

calculate the sharing cost, there are two parameters need to be computed firstly thatare: (1) the length of common sliced window denoted as|T|, and (2) the number offragments exist in |T| denoted as E.

For the first parameter, we compute the least common multiple of all the sam-ple size lcm(k1, k2, . . . , kn) as the value of|T|. Consider the basic lcm for twoinputs lcm(k1, k2), we have lcm(k1, k2)= (k1k2)/gcd(k1, k2). Since the major-ity time consumption oflcm(k1, k2) comes from the computation of greatest com-mon divisor gcd(k1, k2), then the complexity of lcm(k1, k2) is O(log m) [33],where m

= Max

{k

1, k

2}, and we denote the time cost of lcm(k

1, k

2) as t(m).

Then the complexity of lcm for n inputs is O(n t(s)) denoted as l(n, s), wherelcm(k1, k2, . . . , kn )= s. While in the case of second parameter, we adopt the hashtable to realize the computation ofEand the complexity is a function ofO(s), so thatwe only usee(s)to represent the time consumption.

Given thel (n, s)and e(s), the time consumption of the sharing cost computationfor full sharing is g(n, s)= l(n, s)+ e(s), which is a function of n and s. Then thecomplexity of the original greedy algorithm and our SLSA algorithm can be calculatedbased on g(n, s). Consider the case of greedy algorithm, it conducts a three-layerloop to compute the sharing cost for each candidate group in a predefined traversalsequence. For each candidate group Gi , the time cost is g(ni , si )where si indicateslcm(kj |kj Gi , j=1, . . . , ni ). Without loss of generality, ni n (thenielements Gi are belong to the set of{k1, k2, . . . , kn}) then we havesi s, so that thetime cost for each candidate group is no more than g(n, s). Therefore, the total timecost of greedy algorithm isn3 g(n, s)in the worst case. While in the case ofSLSA,

1 3


17/39


Fig.

5

InstanceofSLSA

(AdjustmentStage)

1 3


18/39


there are two stages need to be considered respectively. In the partition stage, weselect the group with the largest variance as the candidate for further partition untilthere is no more sharing cost reduction, so the time cost for thei th partition iterationis Tivar

= v(n1)

+v(n2)

+ +v(ni ), where v(ni ) denotes the cost of variance

computation for group Gi , and the complexity ofv(ni )= O(ni ). Sinceni n,we have the time cost ofTivaris no more than i v(n). In the worst case, there are niterations in partition stage (generaten sharing groups), then the overall time cost ofpartition stage is

ni=1i v(n)and the complexity of partition stage isO (n3)denoted

as f(n). In the adjustment stage, there are two factors affect the overall time cost thatare the number of adjustment operations, and the cost of each adjustment operation.Note that the value of the first factor is 2(n1) (two direction adjustment) in theworst case, when there are n sharing groups. And the value of the second factor is2g(n, s) (two groups are affected in each adjustment operation) in the worst case.

Then the time consumption of the adjustment stage is 4(n 1) g(n, s). Therefore,the overall time cost of our SLSAis f(n) + 4(n 1) g(n, s), and the complexity isO(n3 + n g(n, s)).

Given the complexity of greedy andSLSAalgorithms areTgreedy= O(n3 g(n, s))and TS L S A= O(n3 +n g(n, s)) respectively. We need to proof our algorithm isbetter than the original greedy algorithm. Consider the TS L S A, suppose the first itemn3 dominates the other one, then the complexity isO (n3), which is less thanTgreedy=O(n3 g(n, s)). Suppose the second itemn g(n, s)dominates the first item, then thecomplexity isO (n g(n, s))is also less thanTgreedy= O(n3 g(n, s)). Suppose thereis no one item can dominate the other one, then we have O (n

3

+ n g(n, s))=O(C1 n3)=O(C2 n g(n, s)), and both expression are less thanTgreedy= O(n3 g(n, s)).Based on the above analysis, we can proof that our SLSAhas the better complexitythan the original greedy algorithm.

4 Implementation over mapreduce

In this section, we implement our OATS over the MapReduce paradigm. The queryprocessing scheme can be split into two phases. In the first phase, OATS does notexecute the incoming queries immediately but collect them and analyse the potentialsharing opportunities to determine the two-level sharing plan. Then the set of incomingqueries are combined as a grouped dynamic MapReduce job, which is initialized andsubmitted to the JobTracker for parallel processing. In the second phase, all thegrouped map tasks are processed according to the sharing of sampling and statisticalcomputation. Then the reduce tasks collect the map output to estimate the queryaccuracy.

Algorithm4illustrates the general idea of the initialization phase of query process-ing in OATS. Given a set of incoming queries Q, OATS collects a subset of queriesdenoted byQt hdwith the size ofthreshold (line 3), and analyses the potential sharingopportunities among the queries ofQt hd, denoted asshare(line 4). To support thequery accuracy estimation, we also recode the estimate parameters in the global vari-ableestimate(line 5). Finally, all the Qt hdare combined into a grouped MapReduce

job (with share and estimate as the configuration), which is initialized and submitted

1 3


19/39


Algorithm 4QueryProcess1: input: QuerySetQ2: while Q.hasNextQuery()do3: QuerySet Qt hd=collect(Q, threshold)4: ShareInfoshare= new ShareInfo(Qt hd)5: EstimateInfoestimate= new EstimateInfo(conf idence, err Rate, aggType)6: JobConfO L A j ob= new JobConf(O L A.class)7: // initialize the OLAjob8: DefaultStringifier.store(O L A j o b, share, share)9: DefaultStringifier.store(O L A j o b, estimate, estimate)10: JobClient.runJob(O L A j o b)11: end while

Fig. 6 Example ofshareandestimate

for processing (lines 610). The structure ofshareis shown in the Fig.6,in where

there are three elements for each tuple: (1) the involved blockBifor each grouped maptask, (2) the information of sharing sampling for the queries that satisfy the cases 1 and2 in Table1,and (3) the information of sharing statistical computation for the queriesthat satisfy the case 3 in Table1.Note that, the second element is composed of a pairof(Mix: kix), which indicates the required sample size ofMixin each estimate round.While in the case of third element, it records the optimal sharing groups S Ggeneratedby our SLSA. On the other hand, the estimaterecords the estimation parameters ofeach involved Mixfor Bi , e.g. the pair ofM

ix: (0.95, 0.01,AV G)indicates the query

Qxhas the aggregate type AV G and the confidence and error rate is 0.95 and 0.01

respectively.We are now ready to discuss the implementation of execution phase in OATS. Notethat, the detailed MapReduce implementation for the case of single relation and multi-relations is different, so that we will consider these two cases separately in the nexttwo subsections.

4.1 Queries for single relation

The Map logic of the case of single relation is shown in Algorithm5. Given a blockBi , the corresponding grouped map task Micom loadssharefromOLAjob, which canbe implemented by overwriting the configure function in MapReduceBase, andextracts the information of sharing sampling for Bi into theHashtablevariable Mswith the key isMix(only has the sharing opportunity of sampling) and value isk

ix, and

records the information of sharing computation in the variable SGwith each item as a

1 3


20/39


Algorithm 5Map Logic for Single Relation1: input: LongWritable key, Text value2: output: OutputCollectorIntWritable,Text outputKV3: ShareInfoshare= DefaultStringifier.load(O L A j o b, share,ShareI n f o.class)4: EstimateInfoestimate= DefaultStringifier.load(O L A j o b, estimate,EstimateInf o.class)5: Hashtable Ms = getFirstShare(share)6: Vector S G= getSecondShare(share)7: P S B.init(poi nt er s)8: forke y, value ininputdo9: P S B.update(ke y, value, pointers)10: // sharing sampling11: forMix Ms do12: if P S B.hasEnoughSample(Mix)then

13: samples= getSamples(kix,P S B) //ki

x=Ms .get(Mi

x)14: P S B.update(poi nt er s)15: aggType= getAggType(Mix, estimate)

16: statsix= getStats(samples, aggType, statsix)17: outputK V.collect(Qx, statsix, this.taskI D)18: end if19: end for20: // sharing statistical computation21: forsg S G do22: if P S B.hasEnoughSample(sg)then23: samples= getSamples(kma x,P S B) //kmax=MAX{kix s g}24: P S B.update(poi nt er s)25: aggType= getAggType(sg, estimate)26: partialStats.update(samples, aggType)

27: forMi

x s gdo28: if partialStats.hasEnoughStats(Mix)then

29: statsix= getFinalStats(par tial St at s, statsi

x)

30: outputK V.collect(Qx, statsix, this.task I D)31: end if32: end for33: partialStats.update()34: end if35: end for36: end for

certain sharing group sg(lines 3, 56). Afterwards,Micomadopts the sharing samplingstrategy mentioned in Sect.3.2 to build a public sample buffer (PSB) (line 7), thenMix Mxandsg SGwill collect samples fromPSB rather than disk, reducingthe redundant disk I/O cost. The difference is that each sgdraws samples to calculatethe partial statistical results for reusing among{Mix sg}, while Mix Ms drawssamples for its own statistical computation.

The grouped map task Micom adopts an aggressive method to updatePSBfor eachincoming input

key, value

(add samples to PSBif it is not full and swap out the

expired samples (line 9), and conducts the two-level sharing strategy for each sharingentity. For the case of sharing sampling,Mix Ms collects kixsamples ifPSBhasenough samples for the given Mix(lines 1213). Given the drawn samples for M

ix, the

statistical result denoted asstatsix(the details of statistics calculation are discussed inSect.5) is computed in an incremental way by merging the previous calculatedstatsix

1 3


21/39


(line 16). Finally, the output is collected with the key is Qxand the value is combinedby thestatsixand the task ID ofM

icom(line 17). While in the case of sharing statistical

computation,sg SG collectskma x=MAX{kix sg}samples ifPSB has enoughsamples for the given sg (lines 2223). Given the drawn samples for sg , the buffer

of partial statistics denoted as partialStatsis updated by inserting the newly partialstatistics calculated fromkmaxsamples (line 26). Then,Mix sg computes its ownfinal statistics statsixby merging the previous stats

ixand the corresponding partial

statistics inpartialStats, if there are enough partial statistics exist inpartialStats(lines2729). Finally, the output collection is same to the case of sharing sampling (line 30).After the final statistics calculation, the partialStatsneeds to be updated by removingthe expired partial statistics5 (line 33).

Algorithm 6Reduce Logic for Single Relation1: input: IntWritable key, IteratorT e x t values2: output: OutputCollectorI nt W r itabl e, T e x t outputKV3: EstimateInfoestimate= DefaultStringifier.load(O L A j o b, estimate,EstimateInf o.class)4: Hashtablecontainer= new Hashtable()5: whilevalues.hasNext() do6: update(container,values.next())7: ifcontainer.isAvailable()then8: unistats= uniStats(container)9: result= estimate(unistats)10: ifisAccuracy(result, estimate, unistats)then11: outputK V.collect(ke y,

result, accept

)

12: else13: outputK V.collect(ke y, result)14: end if15: end if16: end while

The reduce phase is responsible for collecting statistics from map tasks and calcu-lating the estimate to the final query result. In our implementation, each reduce task isonly responsible for a certain Qx(collect the map output with the key is Qx), and the

Algorithm6illustrates the logic of reduce phase. Given a certain Qx, the reduce taskRxloads the accuracy parameter estimate from the OLAjob and initializes a variablecontainerto classify the inputvaluesfrom all involved Micom ofQx(lines 34).

For eachMicom ,thereisacorrespondingiteminthe container to record the statisticsof Qxwith the key is Micom .taskIDand value is{statsi1, . . . , statsin}, where statsijindicates the statistics calculated in the j th iteration of map taskMicom . Afterwards,Rxneeds to check the containerto make sure it is available for accuracy estimation, whichmeans thecontainerhas contained the statistics for allMicom . Then Rxcalculates theunified statistics denoted asunistatsby merging all the

{statsij

}together asistats

ij

(lines 78). And the approximate result can be computed by the function estimatebased on the unistats(line 9). Afterwards, Rxinvoke theisAccuracyto conduct theaccuracy estimation based on the parameters in theestimatesuch as the aggregate

5 The expired partial statistics are defined as the statistics that have been reused by all Mix s g

1 3


22/39


type, confidence and error rate. Based on the result ofisAccuracy, Rxcollects theoutput as the pair of (Qx,result,accept) if theresultsatisfies the users expectation,otherwise the output is constructed as (Qx,result) (lines 1014).

4.2 Queries for multi-relations

For queries involved multi-relations, there are two approaches that can be applied.We can precompute the join result and store such result as a regular file in HDFS,then adopt the method used in the single relation queries to estimate the final result.Or alternatively, we can get samples from each relation and calculate the estimatesin the fly by one MapReduce job. Note that, the first approach trades storage foraccuracy,whilethesecondapproachsacrificessomeaccuracy(alsosatisfytheaccuracyrequirement) for flexibility. In our OATS, we implement both approach and configurethe second one as the default approach.

The map logic of the default approach is similar to Algorithm5,but with somedifference that is (1) such map logic does not need to calculate the statistics of randomsamples from each mapper since the statistics are related to the samples from both tworelations. And the task of this map logic is only to collect unbiased samples and assignthem to the corresponding reducer for further processing, so that there is only the first-level sharing of sampling without the second-level sharing and (2) such map logicneeds to redesign the structure of output key/value pair as (Qx, r T a g, samples)tosatisfy the requirement of shared sampling, whererTagindicates the source relation

of the drawn samples. Note that, the output with the same Qxregardless the value ofrTagwould be delivered to the same reducer. And we modify the defaultGrouping-Comparatorclass to make sure that the input from different relations are in the samegroup, which is convenient to the statistics calculation in the reduce logic. Algorithm7shows the reduce logic for multi-relations query.

There are several parameters defined in Algorithm 7, e.g. the staticVector rPre andsPreindicate the samples have been received from the relation Rand S respectively,while rCurand sCurdenote as the incoming samples need to be processed. Firstly,each reducer must classify the incoming samples into rCurand sCur(lines 511).

And then, the partial statistics are computed as statsPP,statsPC,statsCPandstatsCC(lines 1218), where statsPP indicates the statistics that calculated from rPre andsPre, and the other partial statistics have the similar definition. In order to calculatethe approximate estimate, we also need to combine these partial statistics togetherto form a final statistic denoted as stats, which is used to the approximate resultcalculation and accuracy estimation (lines 1923). Finally, each reducer collects theoutput as the pair of (Qx,result,accept) if theresultsatisfies the users expectation,otherwise the output is constructed as (Qx,result) (lines 2429).

5 Statistical estimation

The goal of OLA is to provide efficient, accurate estimates and their confidence inter-vals which are updated regularly. Let ,cbe the running error bound and confidencerespectively,andcgive probabilistic estimate of approximate result vwhich means

1 3


23/39


Algorithm 7Reduce Logic for Multi-Relations1: input: LongPair key, IteratorT e x t values2: output: OutputCollectorI nt W r itabl e, T e x t outputKV3: VectorrCur, sCur, r P r e, s P r e//r P r e, s P r eare static variables4: EstimateInfoestimate= DefaultStringifier.load(O L A j o b, estimate,EstimateInf o.class)5: whilevalues.hasNext() do6: ifvalues.next.getTag==R then7: rCur.add(values.next())8: else9: sCur.add(values.next())10: end if11: end while12: ifr P re.equals(null)&&s P r e.equals(null)then13: s t a t s P P= getStats(r P r e, s P r e)14: else15: s tatsPC = getStats(r P r e, sCur)

16: s tatsC P= getStats(rCur, s P r e)17: statsCC= getStats(rCur, sCur)18: end if19: stats= merge(s t a t s P P, statsPC , statsC P, statsCC)20: s t at s P P= stats21: r P r e= r P r e.addAll(rCur)22: s P r e= s P r e.addAll(sCur)23: result= estimate(stats)24: ifisAccuracy(result, estimate, stats)then25: Mix=k ey .getFirst()26: outputK V.collect(Qx, result, accept)27: else

28: outputK V.collect(Qx, result)29: end if

exact resultv lies in the interval[v , v+ ]with probability c. We can say theapproximate result achieves the user expectation, if ve (e is the predefinederror rate). Note that, the confidence c and error rate eis the parameter ofestimatein Algorithm4and v is the approximate resultcalculated by the functionestimatein Algorithms6and7.

In this section, we take SUM, COUNT, AVG as examples to show how estimatesand confidence intervals for the single relation query can be obtained in our OATS.Consider a typical single relation query Qxsuch as:SELECTop (expression)FROMR WHEREpredicate

GiventheblocksetB, Sindicatesthesamplesetcollectedfromtherelevantblockset{PxB},wherePxisthepredicateofQx,andeachBi {PxB} provides |Si | samples,where

|PxB|i=1 |Si | = |S|. Each Mixcalculates the statsaccording to the aggregate

type in theestimate, including the sum and count ofSi that issum(Si )=

sj Si sj

and count(Si

)= |

Si |

, where Si

Si indicates the sample set that satisfy the querypredicate. Note that, the value ofsum(Si )and count(Si )are calculated against thecollected samples for the case of sharing sampling. Otherwise, they are calculatedagainst the partial statistics in the case of sharing statistical computation. Then, thereduce task can calculate the value of

si Sex pp(si )as follows, which is part of the

parameters inunistats.

1 3


24/39


sj S

ex pp(sj )=

|PxB|i=1 sum (S

i ) op=s um

|PxB|i=1 count(S

i ) op=count

(5)

Then the estimated aggregate resultcan be calculated in the reduce task as follows(processed by functionestimatein Algorithm6).

vs|c=T

|S| sj S

ex pp(sj ) va=

vsvc

(6)

where the variable T=

Bi {PxB} |Bi |indicates the total number of tuples in therelevant block set of Q

x, and ex p

p(s

i)equals s

ifor SUM and 1 for COUNT ifs

isatisfies the predicate, and 0 otherwise.Besides the estimated aggregate results, the reduce task also needs the correspond-

ing variances of these results to calculate the error bound for accuracy estimation. Tosimplify the computation of variance, we apply the computational formula of variance.

2(X)=E(X2) E(X)2

In our implementation, Mixmaintains the quadratic sum of samples X2i = sj Si s2j

in the stats too, which is used to calculate the variances. Reduce task collects allX2i and calculates

|PxB|i=1 X

2i , which is the another parameter in unistats. In the

functionisAccuracy, the input parameter ofresultand unistatsare used to calculatethe variance firstly according to the aggregate type in theestimateas follows.

2a=|PxB|

i=1 X2i|PxB|

i=1 count(Si )

(va)2 (7)

2s = |PxB|i=1

X2

i|S| T2 (vs )2 (8)

2c =|PxB|

i=1 count(Si )

|S| T2 (vc)2 (9)

Afterwards, the function isAccuracy needs to compute the error bound based on thevariance above and the confidencecin estimate. Based on Central Limit Theorem,|S|(vv)

is distributed approximately as a standardized normal distribution when |S|

is large enough, where 2

|S| is the variance ofv . Given an predefined confidencec,can be computed by the following formula:

P|v v| 2

|S|

1 (10)

1 3


25/39


Where P|v v| is the predefined confidencec. If v e, then we can say

theresultis acceptable to the user.For the case of the multi-relations query such as:

SELECTop(expression)FROMR,S WHEREpredicate

GiventheblocksetBr,Bsfor relationR and Srespectively, Sr= r Pr e (Ss= s P r e)indicates the sample set has been collected from the relevant block set{Px Br}({PxBs}). Each reduce task needs to calculate the partial stats such asstatsCC, etc.For example, statsCCincludes the sum and count ofSccthat is su m(S

cc)=

si Scc si

and count(Scc)= |Scc|, where Scc Scc indicates the sample set that satisfy thequery predicate and Scc = rCur sCur. Then, the reduce task can calculate thevalue of

si SrSs ex pp(si )by adding these four partial stats together, which is part

of the parameters in stats. Then the estimated aggregate resultcan be calculated byformula(6), but replace the T and

|S

|with Tr

Ts and

|Sr

| |Ss

|respectively, where

Tr= Bi {PxBr} |Bi | and |Sj oi n| = |Sr| |Ss |.Moreover, the partial stats such as statsCCof each reducer also maintains the

quadratic sum of samples denoted as X2cc =

si Scc s2i, etc. Then the reducer can

calculate the value of X2j oi n by adding all the quadratic sum from each partial stats.Therefore, the variance of samples for each relation can be computed by Eqs. (1113),and the variance of the joined samples can be calculated as Eq.(14).

2a (R)=2a (S)=X2j oi n

Sj oi n

(va )2 (11)

2s(R)=2s( S)=X2j oi n

|Sr| |Ss | (Tr Ts )2 (vs )2 (12)

2c(R)=2c(S)=Sj oi n

|Sr| |Ss | (Tr Ts )2 (vc)2 (13)

2j oi n=2(R)

|Sr|+

2(S)

|Ss |(14)

Where Sj oi n indicates the sample set that satisfy the query predicate and we haveSj oi n= Spp + Spc + Scp + Scc. Finally, each reducer invokes the function isAccuracyto calculate the error bound by Eq. (10)based on the variance above and the inputparameter ofestimate.

6 Experimental evaluation

We have used Hadoop-hop-0.2, which is currently based on Hadoop-0.19.2, to imple-ment our OATS and run experiments on a virtual cluster with 32 nodes, in where thereare 30 data nodes, 1 name node and 1 secondary name node. And each node contains 2VCPU, 10GB of main memory and 100GB disks. Both these virtual nodes are appliedfrom SEUCloud. The architecture of SEUCloud can be depicted by Fig.7.

In this section, we conduct a set of experiments to evaluate the effectiveness andstudy the performance characteristics of our OATS under different degrees of skew

1 3


26/39


Fig. 7 The hardware architecture of SEUCloud

Table 2 Properties of datasets Scale Size (GB) Rows (million)

10 7.8 66

20 15.8 132

40 37.6 264

100 78.7 660

in the input data and under different data size. A modified TPC-H toolkit [34] isemployed to generate skewed dataset with Zipf distribution, which is determined by theZipf parameterz(zvaries over 0, 1.2, 1.6, where 0 represent the uniform distribution),derived from LINEITEM and ORDER tables as our test data. The scale factor is variedover 10, 20, 40, 100. Table2summarizes the properties of the generated datasets. Inour experiments, we generate queries based on the Single relation Template (T1) andMulti-relations Template (T2) as follows.T1:SELECT sum(Ci )| count(Ci )|avg(Ci )FROM LINEITEM

WHERE [l_discount > x andl _discount < x+y ]&[l_quantity > x andl _quantity < x+y ]&

[l_extendedprice> x andl _extednedprice< x+y]&T2:SELECT sum(Ci )|count(Ci )|avg(Ci )FROM LINEITEM L, ORDERS OWHEREL .orderkey= O.orderkey&

[l_discount > x andl _discount < x+y ]&[l_quantity > x andl _quantity < x+y ]&[l_extendedprice> x andl _extednedprice< x+y]|&[o_totalprice> x ando_totalprice< x+y]

Note that, in the template 1, ci = l_quantity|l_discount|l_extendedprice.Whileinthetemplate2, ci

=l_quantity

|l_discount

|l_extendedprice

|l_totalprice.

And the parameterxis some random values that belong to the value range ofCi andthe parameter yvaries from 10 to 90% of the value range ofCi for each x.

The aggregate type (AVG, COUNT, SUM) for each query is random selected duringthe query generation process. The default error rate eand confidence cused for theaccuracy estimation are 0.01 and 95%. And the default partition size (ps) and query

1 3


27/39


collection threshold (qct) are 4 and 20 respectively. We set the default parameter inourSLSAof second-level sharing aswi n= wout= 0.5.

For comparison purposes, we implemented 7 OLA methods for different partitionmanner and sharing strategies, that are: (1) hadoop-Complete is a straightforward

method without online aggregation to return the exact results, (2) SZ-unshareis theOLA based method for the default size-aware data partition manner, (3) CT-unshare isanother online aggregation based method for the modified content-aware data partitionmanner which is adopted in this paper, (4)CT-share_level1is the extended version forCT-unsharethat only deploys the first-level sharing strategy, (5) CT-share_level2:alladds the general second-level sharing strategy, which shares statistical computationamong all involved queries, based on CT-share_level1, (6) CT-share_level2:greedyextends CT-share_level1 with a greedy second-level sharing strategy, and (7) CT-share_level2:SLSA extendsCT-share_level1with our proposedSLSA. Note that,SZ-

unshareandCT-unshareare the methods deployed in the work [24] to study the effectof the content-aware partition. And we use them as the basic implementations withoutsharing in our experimental study. While the rest 4 OLA-based methods are all thespecial cases of our OATS. For facilitate to result analysis, we implemented the first-level sharing for sampling and the second-level sharing for computation as the optionalcomponents, which can be added to CT-unshare for different comparison purpose tostudy the effect of our two-level sharing strategy. In general, we execute 100queries for5 times in each experiment to remove any side effects. While for the last two methods(67), we set the experiment repeat factor as 100 to evaluate the execute performance

of the second-level sharing algorithms, since such sharing algorithms are susceptiblefor the different input data significantly.We acknowledge that the experimental comparison in this paper only considered

one optimized OLA method without sharing calledCT-unshare. We also understandthat using other methods (such as [2,21,24]) may produce different results. How-ever, through our preliminary study to the latest work called COLA[2], we note thateither COLA orCT-unsharecan obtain the similar performance in the same order ofmagnitude under the same configuration (same workload, same test data and samehardware platform). Therefore, the comparison between OATS and CT-unsharecangain us some necessary understanding of the differences between sharing and withoutsharing.6 In the future, we will extend our comparison to include other methods.

6.1 Effect of data size and distribution

In this experiment, we vary the data size from 10G to 100G and evaluate the perfor-mance of seven different methods with the uniform data distribution. And we set theparameters as the default values that are c=95 %,e=0.01, ps= 4, q ct= 20 andwi n= wout= 0.5.

6 The reasons we do not show the comparison result with COLA in this paper are that: (1) there are severalimplementation details of ourCOLA property maydifferent to theoriginal COLA dueto theless informationrelated in[2], which may affect the comparison result to some extent, and (2) the Hadoop platform of ourproperty is deployed in the virtual clusters rather than the physical clusters mentioned in[2], which resultsin the diversity between our property and the original COLA.

1 3


28/39


Figure 8 represents the results of template 1 and template 2 respectively. Wenote that both results show the similar trend, but the joined one takes much moretime to process since there are more tasks need to be initialized in map phase andthe join operation in the reduce phase takes extra more time. Take the result of

template 1 as example, Fig. 8 indicates that the performance ofhadoop-Completedecreases with the increment of data size, but the OLA-based methods are scal-able with regard to the data size. And the time cost is much lower than hadoop-Complete. This is because the OLA-based methods only need a small sample set tocalculate the approximate results rather than the whole data set. Among the OLA-based methods, CT-unshareperforms better than SZ-unshareas shown in the Table3.This is excepted as the sample quality is improved by our content-aware parti-tion method, leading the early acceptable results. But the improvement results fromthe content-aware partition is decreased along with the increment of data size, since

the enlarged block size increase the overall sampling overhead ofCT-unshare. Andwe also note that the performance ofCT-share_level1 is better than the one with-out sharing due to the redundant disk I/O can be significantly decreased by ourfirst-level sharing strategy. On the other hand, we note that the methods with thesecond-level sharing strategy overcome CT-share_level1since the intermediate sta-tistics are reused among all the shared queries, reducing the overall statistical compu-tation time. And Table3indicates that ourCT-share_level2:SLSAis more better thanCT-share_level2:allas well as CT-share_level2:greedy. This is expected as the costof partial statistics computation is significantly reduced by the well designed sharing

groups. Besides the performance study, we also need to illustrate the stability of oursharing strategy. As shown in the Table4,all the methods can be stably processedsince the data size is not the major factor that affects the estimation performance ofOLA.

Moreover, in order to study the effect of data distribution in detail, we compare theperformance of 100G dataset for different data distribution in Fig.9. Take the result ofT1as example, we note that the content-aware methods are also scalable with regardto the data distribution besides the hadoop-Completemethod. While the performanceofSZ-unshare is significantly decreased along with the more skewed data distribution.This is because the fact that the content-aware partition can improve performance forskewed data distribution by pruning off the unnecessary data, increasing the samplingefficiency as shown in Table5.Look inside the CT-sharebased methods, the sameconclusion to the above experiment (effect of data size) can be obtained through Table5,that is the CT-share_level2 based methods perform better than CT-share_level1.And ourSLSAhas the similar performance to the greedy one, which are both perform-ing better than CT-share_level2:all. On the other hand, Table6shows the standarddeviation of 5 repeated tests. Note that the results ofSZ-unsharebecome more stablealong with the skewed data distribution. This is because the representative sample setis hardly obtained for skewed data distribution, resulting in the longer time to achievethe acceptable result for the majority of the queries. While in the case of content-aware partition methods, the standard deviation becomes larger for more skewed datadistribution. This is excepted the more skewed data distribution will result in moreand more outliers in the blocks, which may affect the sample quality for some certainqueries, resulting some performance fluctuations.

1 3


29/39


(a)

(

b)

Fig.

8

Effectofd

atasize

1 3


30/39


Table 3 Study of performance improvement for different data size (uniform, T1)

Data size Comparison items

SZ-unshare vs.

CT-unshare(%)

CT-unshare vs.

level1 (%)

level1 vs.

level2:all (%)

level1 vs.

level2:SLSA(%)

level1 vs.

level2:greedy(%)

10 72.1 51.5 24.2 53.3 54.5

20 69.7 54.5 23.6 51.9 53.1

40 67.1 61.5 22.6 45.9 43.9

100 55.9 64.3 23.1 48.8 49.8

Table 4 Standard deviation of five repeated tests for different data size (uniform, T1)

Data size SD

SZ-unshare CT-unshare CT-share_level1

CT-share_level2:all

CT-share_level2:SLSA

CT-share_level2:greedy

10 302.73 69.51 34.26 28.52 16.75 15.11

20 287.64 72.35 33.16 27.56 16.63 14.87

40 238.72 71.62 34.86 26.93 16.47 14.73

100 293.78 70.86 36.17 28.24 16.05 14.01

6.2 Effect of query collection threshold

Inthistest,westudytheeffectofquerycollectionthreshold( qct),whichisanimportantparameter that controls the degree of our two-level sharing strategy. We vary the qctfrom 20 to 100, and the Fig. 10 demonstrates the results of all CT-share based methodsfor uniform data distribution and 100G dataset with the default configuration, that isc=95 %,e=0.01, ps=4 andwi n= wout= 0.5.

As shown in the figure, both the greedy andSLSAmethods perform better thanCT-share_level2:all, regardless of the value ofqct. The details of performance improve-ment are shown in the Table7,note that the greedy andSLSAmethod have the similartrend that is the performance is improved at first until the qctreaches a specific value,thentheperformanceisdegradedalongwiththeincrementofqct.Thiscanbeexplainedas our two-level sharing strategy affects the overall performance from four aspects,including two positive cases as well as another two negative cases, that is: (1) reducingthe redundant disk I/O cost and the statistical computation cost, (2) reducing the ini-tialization time of the query tasks due to the multi-queries combination, (3) extendingthe execution time of each combined query task, and (4) adding an additional exe-cution time of the greedy and ourSLSAfor optimal sharing groups generation. Forthe case of smallerqct, the performance improvement caused by (1) and (2) is largerthan the degradation results from (3) and (4), since there are few involved Mixin eachMicom and the execution time of greedy and SLSAalgorithm is relatively less (lowerpercent of the overall time cost). While this situation is reversed for the case of larger

1 3


31/39


(a)

(b)

Fig.

9

Effectofd

atadistribution

1 3


32/39


Table 5 Study of performance improvement for different data distribution (100G, T1)

Data distribution Comparison items

SZ-unshare vs.

CT-unshare(%)

CT-unshare vs.

level1 (%)

level1 vs.

level2:all (%)

level1 vs.

level2:SLSA(%)

level1 vs.

level2:greedy(%)

Uniform 55.9 64.3 23.1 48.8 49.8

zipf-1.2 66.1 65.4 21.9 46.1 49.3

zipf-1.6 81.5 62.5 22.5 48.9 51.7

Table 6 Standard deviation of five repeated tests for different data distribution (100G, T1)

Data distribution SD

SZ-unshare CT-unshare CT-share_level1

CT-share_level2:all

CT-share_level2:SLSA

CT-share_level2:greedy

Uniform 293.78 70.86 36.17 28.24 16.05 14.01

zipf-1.2 203.51 92.41 57.17 44.64 25.37 22.14

zipf-1.6 125.68 108.25 65.26 61.95 43.21 39.71

qctsince the number of involvedMi

xin eachMicomis increased (longer time to processthe combined map task) and the execution time of the second-level sharing algorithm

is extended along with the increment ofqct. However, the performance degradationofCT-share_level2:SLSAis less than CT-share_level2:greedysince ourSLSAhas anacceptable time complexity for larger qct. Moreover, we add the standard deviationof the data in the figure to illustrate the stability of our sharing strategy. Note that CT-share_level2:SLSA is more stable thanCT-share_level2:greedy, since the time cost ofour SLSA algorithm is always less than the original greedy algorithm, reducing the sideeffects generated from the larger qct. In our experimental environmentqct= 40 is opti-

mal for processing, but this value may change for other experimental configuration.Consider a situation that an overload set of queries need to be processed, the initializa-tion time accounts for a higher percent of overall time consumption, then the larger qctshould be selected and CT-share_level2:SLSAis the better candidate for deploymentthan the other methods, since it can not only achieve the acceptable sharing groups butalso has an acceptable execution complexity compares to CT-share_level2:greedy.

6.3 Performance of the second-level sharing algorithm

In this experiment, we set all the parameters as the default value (c=95 %,e=0.01,ps=4 andwi n= wout= 0.5) to evaluate the algorithm performance of the second-level sharing strategies for varied qct, including the greedy and SLSA algorithms.Given a set of grouped map tasks{Mix Micom}, the set of required sample sizeK= {kix}is the input to both the greedy and SLSAalgorithms. And the output is the

1 3


33/39


34/39


(a)

(b)

(d)

(c)

Fig.

11

ExecutiontimeofSLSA

1 3


35/39


Fig. 12 Effect ofwi n andwout

Note that, the adjustment stage of our SLSA is composed of two migrations that are thecases ofright leftand left right. In order to evaluate the effect of each case ofmigration and find out the optimal pair ofwi n: wout for each migration, we conducttwo lightweight methods ofCT-share_level2:SLSA, which only contains one case ofmigration simultaneously (right leftor left right).

Figure 12 shows the result of these two lightweight methods for different parameter

configuration. For the case ofright left, the largerwi n indicates the more consid-eration is focused on the action that migrate a smaller kix, which has less commonedges to the other items of sharing groupsgi , to the neighborsgj . And the overall costreduction ofsgi has a relatively higher possibility larger than the cost increment ofsgj , since the fact that the number of final aggregation operations of all the kix insgiexcept the migration one are decreased and the number of final aggregation operationsofsgj is increased only for the migratedkix. Therefore, the performance for the largerwi n is better than the smaller cases. On the other hand, the case ofleft rightholdsthe similar trend. Note that, the larger woutindicates the more consideration is focused

on the action that migrate a largerki

xwith less common edges to the other items ofsgito the neighborsgj , in which there are more common edges tokix, making the overallcost reduction ofsgi has a relatively higher possibility larger than the cost incrementofsgj . Therefore, the performance for the largerwoutis better than the smaller cases.Based on the above analysis, we obtain the conclusion that is the relatively largerwi n is needed for the case ofright left(such as 0.8:0.2 in our expirement) and therelatively smallerwi n is required for the other case ofleft right(such as 0.3:0.7).

7 Related work

In many real applications, such as OLAP, aggregation queries are used widely andfrequently. However, calculate exact results for these queries incurs long responsetime, and is not always required. To response queries in a short processing time withthe acceptable results, approximate query processing (AQP) is proposed recently.

1 3


36/39


Online aggregation [15] is one commonly-used AQP technique to provide a time-accuracy tradeoff for aggregation queries. Approximate answer within a running con-fidence interval is produced during early stages of query processing and graduallyrefined until satisfies the users expectation. The running confidence interval indicates

touserthattheestimatedproximityofeachrunningaggregationquerytoitsfinalresult.In [16], Haas illustrates how the central limit theorems, simple bounding argumentsand the delta method can be used to derive formulas for both large-sample and deter-ministic confidence intervals. To support join operation for OLA, Hass and Hellersteinintroduced a novel join methods called ripple joins in [17]. But the convergence ofripple joins can be slow when memory overflows. To handle this problem, hash ripple

join algorithm is proposed in [18], which combines parallelism with sampling to speedconvergence and also maintains good performance in the presence of memory over-flow. However, all works in [1518] are focused on single query processing rather than

multiple query optimization. Therefore, Wu et al. proposed a new OLA system calledCOSMOS to process multiple aggregation queries efficiently [12]. COSMOS orga-nizes queries into a dissemination graph to exploit the dependencies across queries,and the partial answers can be reused by the linked queries. In addition, Wang et al.present a partition-based OLA called POAS to overcome the side effect of skeweddata distribution and further improve the query performance.

Infact,thesecentralizedOLAmethodsorsystemscannotbeextendedtodistributedmanner easily, so that the well designed distributed OLA systems are proposed alongwith the development of P2P and cloud computing[2,2026]. Wu et al. extend the

OLA to a P2P context where sites are maintained in a DHT network [ 20], whichmaintains synopses that can be reused by different queries. However, this P2P-baseddistributed OLA will transfer all the samples among processing nodes to guarantee theload balancing, generating a lot of network traffic. In addition, [22,23] demonstratesa modified version of Hadoop MapReduce framework that supports OLA, but theyonly implement the lightweight one, which returns the query progress but without anyprecision estimation. And [21] proposed a new OLA system that supports MapReduce

job based on the open source project Hyracks [4], which discuss a Bayesian frameworkfor producing estimates and confidence intervals for OLA. Although this method canallow user to see how accuracy of the result to the real final result, but with the complexestimation method, which is hard to be implemented in the MapReduce framework,and the additional estimation module would add significant accidental complexity,restrictingtheoverallperformance.Intheworkof[2],theauthorsformulateastatisticalfoundation that supports block-level sampling for single-table OLA and developsa two-phase stratified sampling method to support multi-relations OLA. And theyalso addressed the skewed issue that affects the COLA performance by proposinga keep-order algorithm to make sure intermediate results delivered to downstreamoperators are consumed in a statistically random fashion [26]. While in the workof[24], the authors focus on the optimization for running OLA over MapReduce-based cloud system, in which they propose a fair-allocation strategy to guarantee thestorage and computation load balancing. Note that all works mentioned above do notconsidered the overhead generated form the statistical estimation, so that [25] designthe first framework for parallel OLA that incurs no overhead on top of the actualexecution.

1 3


37/39


However, there is an serious limitation that restricts the performance of runningOLA over MapReduce that is the sharing issue among multiple OLA queries. Andnone of the above papers mainly focus on such problem to improve the performanceof OLA over MapReduce that we have discussed in this paper. In order to stimulate the

potential of running OLA in cloud, we study how to design and implement a parallelprocessing model to serve as a target for running OLA in MapReduce framework witha two-level sharing strategy to optimize the overall performance.

8 Conclusions

To support OLA in the cloud makes it possible to save users time and computationcost by taking acceptable approximate early results. However, there exists two major

limitations restrict the OLA performance that are the large redundant I/O cost and thereplicative statistical computation cost, which are generated from the original cloudparadigmthat lacksof the support to thenecessary sharing operations. OATS provides apromising framework to deal with such performance issue. The major contributions ofOATS include: a system architecture with two-level sharing strategies which is tailoredfor MapReduce framework to support large scale concurrent query processing, a first-level sharing for sampling with a customized sample management mechanism, a morescalable SLSAalgorithm in the second-level sharing for statistical computation, andthe implementation of our OATS. To our best knowledge, OATS is the first work on

studying the sharing issues of OLA in the cloud. Our experiments demonstrate thatOATS can produce acceptable approximate results within a time period two ordersof magnitude shorter compared to the one with exact results, and is on average 80%more efficiency than the OLA-based solution without sharing operation for differentdata size and data distribution. Besides the sharing issue of online aggregation in thecloud, there exists some other problems need to be handled to further improve theOLA performance such as the estimation failure of OLA due to the unbiased sampleset, etc, and we will take this problem as our future work.

Acknowledgments This work is supported by National Key Basic Research Program of China under

Grants No. 2010CB328104, National Natural Science Foundation of China under Grants No. 61320106007,No. 61070161, No. 61003257, No. 61202449, No. 61272054, China National Key Technology R&D Pro-gram under Grants No. 2010BAI88B03 and No. 2011BAK21B02, China Specialized Research Fund for theDoctoral Program of Higher Education under Grants No. 20110092130002, China National Science andTechnology Major Project under Grants No. 2010ZX01044-001-001, Jiangsu Provincial Natural ScienceFoundation of China under Grants No. BK2008030, Jiangsu research prospective joint research projectunder Grants No. BY2012202, No. BY2013073-01, Jiangsu Provincial Key Laboratory of Network andInformation Security under Grants No. BM2003201, Key Laboratory of Computer Network and Informa-tion Integration of Ministry of Education of China under Grants No. 93K-9, and Shanghai Key Laboratoryof Scalable Computing and Systems (2010DS680095).

References

1. Herodotou, H., Lim, H., Luo, G., Borisov, N., Dong, L., Cetin, F.B., Babu, S.: Starfish: a self-tuningsystem for big data analytics. In: Biennial Conference on Innovative Data Systems Research (CIDR),pp. 261272, 2011

1 3


38/39


2. Shi, Y., Meng, X., Wang, F., Gan, Y.: You can stop early with cola: online processing of aggregatequeries in the cloud. In: Proceedings of the 20th ACM International Conference on Information andknowledge Management (CIKM), pp. 12231232, 2012

3. Hadoop: The apache software foundation.http://hadoop.apache.org.Accessed Nov 20124. Borkar, V., Carey, M.,Grover, R.,Onose,N.,Vernica, R.: Hyracks: A flexible and extensible foundation

for data-intensive computing. In: Proceedings of IEEE International Conference on Data Engineering(ICDE), pp. 11511162, 2011

5. Kolodziej, J., Khan, S.U.: Data scheduling in data grids and data centers: a short taxonomy of problemsand intelligent resolution techniques. Trans. Comput. Collect. Intell. X. 7777, 103119 (2013)

6. Zaharia, M., Borthakur, D., Sarma, J.S., et al.: Delay scheduling: a simple technique for achievinglocality andfairnessin cluster scheduling. In:Proceedings of the5th European Conference on ComputerSystems (EuroSys), pp. 265278, 2010

7. Jin, H., Yang, X.,Sun, X., et al.: Adapt:Availability-aware mapreducedata placement fornon-dedicateddistributed computing.In: Proceedings of the 32nd International Conference on Distributed ComputingSystems (ICDCS), pp. 516525, 2012

8. Grover, R., Carey, M.J.: Extending map-reduce for efficient predicate-based sampling. In: Proceedingsof the 28th International Conference on Data Engineering (ICDE), pp. 486497, 2012

9. Moseley, B., Dasgupta, A., Kumar, R., et al.: On scheduling in map-reduce and flow-shops. In: Pro-ceedings of the 23rd ACM Symposium on Par

Documents

OATS online aggregation with two-level sharing.pdf