SARAH – Statistical Analysis for Resource Allocation in Hadooparchitectworkshop.com/martin/papers/sarah.pdf · SARAH (Statistical Analysis for Resource Allocation in Hadoop) is

SARAH – Statistical Analysis for ResourceAllocation in Hadoop

Bruce MartinCloudera, Inc.

Palo Alto, California, [email protected]

Abstract—Improving the performance of big dataapplications requires understanding the size and distribution ofthe input and intermediate data sets. Obtaining thisunderstanding and then translating it into resource settings ischallenging. SARAH provides a set of tools that analyze inputand intermediate data sets and recommend configuration settingsand performance optimizations. Statistics generated by SARAHare persistently stored, incrementally updated and operate acrossthe several processing frameworks available in Apache Hadoop.In this paper we present the SARAH tool set, describe severalHadoop use cases for utilizing statistics and illustrate theeffectiveness of utilizing statistics for balancing reduce workloadon Map-Reduce jobs on web server log file data.

Keywords—big data; statistical analysis; Hadoop; Map-Reduce,performance tuning;

I. INTRODUCTION

The performance of big data applications is typically afunction of the size and distribution of input, intermediate andoutput data sets. The Apache Hadoop platform[2] offersdevelopers, system administrators, data scientists and analysts1

dozens of configuration parameters to specify the clusterresources needed by a big data application and to influencehow the big data application executes. While takingadvantage of such flexibility can result in a finely tunedsystem, the challenge of effectively setting those parameters isgreat. It requires understanding the size and distribution ofinput and intermediate data sets and the algorithms of the bigdata application. It also requires understanding the operationand configuration of Hadoop processing frameworks, theavailable resources of a given cluster and the overall workloadof the cluster.

Consider the problem in the Map-Reduce[8]framework ofdetermining the number of reducers that an application needsand balancing the load of intermediate data across thosereducers. [10] In the Map-Reduce and Hive[3] frameworks, theuser sets a property to specify the number of reducers. In thePig[1] and Spark[11] frameworks, the user specifies anoptional parameter to commands that typically run in reducers.To come up with a meaningful value, the user needs tounderstand the size and distribution of records in theintermediate data sets and given that understanding, have away to influence the assignment of records to reducers. Atbest, the informed user understands the intermediate data and

1 Throughout the paper we will refer to the developer, administrator, data scientist and analyst as “the user”.

can carefully calculate the number of reducers. At worst, theuniformed user utilizes system defaults or makes a randomguess.

Advanced relational data base systems gather and utilizestatistics about tables in query optimization. [16] Such systemsare closed systems; they offer a single relational model, asingle query language and the storage format is defined by thesystem. Hadoop, on the other hand, supports unlimitedstorage formats defined by the user, multiple models andmultiple processing frameworks with varying degrees ofmetadata. The Pig, Hive and Impala frameworks utilize somestatistics about a job’s data but the statistics generated in oneframework are not usable in the others.[14],[15],[18]

Computing statistics for big data sets is expensive. Itmakes little sense to spend more time computing statisticsthan the performance gain obtained by a more efficient use ofresources. On the other hand, if the statistics are persistentlysaved, used across subsequent executions of applications andavailable in multiple processing frameworks, then this costcan be amortized over time. Furthermore, if updates toanalyzed data sets only require an incremental statisticalanalysis cost, then the cost of generated statistics can beamortized over a long time. We view persistence, crossframework access and incremental update as requirements forany big data environment that gathers and utilizes big datastatistics.

SARAH (Statistical Analysis for Resource Allocation inHadoop) is a test bed we have built to experiment with thegeneration of statistics of big data and the use of thosestatistics at runtime. SARAH generated statistics are used toenhance performance and help the user in setting resourceproperties. Statistics generated by SARAH are persistentlysaved, incrementally updatable and can be utilized acrossprocessing frameworks.

Concretely, SARAH is a set of tools run by users on theirdata sets for their big data applications. SARAH contains atool set for each supported processing framework to generate,store and update statistics. The statistics generated by the toolin one framework can be used at runtime in other frameworks.

Having systems automatically generate, update and usestatistics without user involvement is appealing. SARAH takesa more pragmatic approach, requiring user involvement but ata high level, productive fashion. User input is needed todetermine when statistics should be gathered and

Appeared in: 3rd IEEE Conference on Big Data Science and Engineering (BSDE14), September,2014

incrementally updated and to map those statistics acrossplatforms.

II. HADOOP USE CASES FOR STATISTICAL ANALYSIS

The Hadoop Map-Reduce, Pig, Hive, Impala and Sparkframeworks have many configurations that allow or requireusers to specify resources. Our goal is that SARAH generatedstatistics support these and other use cases.

A. Smart Input SplitHadoop processing frameworks divide the input data set

into subsets of records for parallel processing. The defaultbehavior is to split each file into 64 MB blocks and assigneach block to a map task. This approach is often adequatebecause the amount of work each parallel map task does is notsensitive to the distribution of the data, as it is with reducers.Each map task is given 64 MB of data.

There is overhead creating and initializing a task, mostnotably the overhead to create and initialize a Java VirtualMachine. If a task has too little work, this overheaddominates and a larger block size is appropriate. Statisticalanalysis of the cost of executing the map function to the inputdata can estimate an effective value for the block size.

An input data set that consists of many small files, that is,files that are smaller than 64 MB, results in too many smallmap tasks because the default behavior is to assign one maptask to each file in this case. Statistical analysis of the inputdata set can recognize this.

B. Appropriate number of balanced reducersUsers can specify the number of reducers to use in a Map-

Reduce job, including those executed by the Hive framework.Similarly, users of the Pig and Spark frameworks can set anadditional parameter in the commands that are usuallyexecuted in parallel reducers. Statistical analysis of theintermediate data can estimate the number of reducers. Suchanalysis needs to take into account the size of the intermediatedata and the overhead for creating and initializing a task.

Estimating the number of reducers using only the size ofthe intermediate data is insufficient. Intermediate data issusceptible to data skew. Statistical analysis of thedistribution of the intermediate data can break the intermediatedata into similarly sized partitions.

We describe this use case with SARAH in more detail insection IV.

C. Skewed JoinsJoining two large data sets can result in unbalanced

parallel reducers if the joined data is skewed. [7] Statisticalanalysis of both data sets can estimate the number of reducers.Furthermore, by analyzing the distribution of the joined data,multiple reducers can be assigned to popular join keys. Thisapproach requires replicating some of the records acrossreducers. Pig does this kind of analysis for skewed joins[14],however the analysis does not persist, cannot be incrementallyupdated and is not available across processing frameworks.

D. Combiner BenefitIn the Hadoop processing frameworks, a combiner is a

function that is applied to subsets of intermediate data. Forlarge intermediate data, a combiner almost always improvesperformance and lessens network utilization. When a reducefunction is not commutative and associative, the reducefunction cannot simply be reused as a combiner. Instead, theuser must program a separate function. Statistical analysis ofthe input and intermediate data can advise on the benefits ofcoding an additional combiner function.

In the Pig framework combiners are automaticallydetermined by the execution plan. The Pig framework doesnot apply combiners when the script invokes a user-definedfunction because it treats the function as a black box. Pig doesapply combiners, however, if the user code is declared as“algebraic” and provided as initial, intermediate and finalfunctions. [13] Again, statistical analysis of the input andintermediate data can advise of the benefits of this additionalcoding.

E. Task Memory AllocationHadoop processing frameworks define several properties

that specify task memory requirements. These properties aredefined prior to executing the job. The properties include amap task’s heap size, the size of the map task’s in-memorybuffer for intermediate data and a reduce task’s heap size.Statistical analysis of input and intermediate data can estimatevalues for these memory specifications.

While not required by the Hadoop framework, somereducers buffer all of the values associated with a key.Statistical analysis of intermediate data can estimate an upperbound on the amount of memory a reduce function requires.

F. Balanced Total Order SortThe Map-Reduce Framework sorts partitions by key. It

does not, however, sort across all the partitions. The totalorder partitioner [6] ensures the sorted keys in one partitionare less than the sorted keys in the next partition, effectivelyproducing a total sort of the data set. The user provides keysthat divide the partitions and the practitioner builds thepartitions at runtime. The partitions can be unbalanced sincethey depend on the keys provided by the user. Statisticalanalysis of the intermediate data can estimate the distributionof the keys and calculate keys for balanced partitions.

G. Compressed Intermediate DataHadoop processing frameworks transmit intermediate data

over the network. The user can specify if this data should becompressed and the compression algorithm that should beused. If the amount of intermediate data is large and theoverhead of compressing and decompressing the data is nottoo great, then compressing the data improves performance.Statistical analysis of intermediate data can estimate whethercompression is worth it.

H. Parallel Data TransferHadoop processing frameworks transmit intermediate data

over the network in parallel. In the Map-Reduce framework,reducers pull sorted intermediate data from multiple mappers

and merge the sorted data. A property controls how manystreams are received and sorted in parallel. Statistical analysisof the intermediate data can estimate appropriate values forthis property.

I. Estimating cluster workloadThe previous use cases utilize statistical analysis of input,

intermediate, output data sets and algorithms for optimizingthe performance of a single job. Data and algorithm statisticscan also be used across jobs and over time. Job and datastatistics are useful in expanding a cluster, that is, indetermining additional hardware to deploy. The analysis canalso be useful in determining service level agreements andscheduling policies.

III. THE SARAH TEST BED

SARAH is a test bed for generating and using cross-framework, persistent and incremental statistics for Hadoop.Some of the SARAH tools generate statistics; other toolsproduce artifacts from the generated statistics. Some of theartifacts are used at runtime for better resource utilization,others are used to configure a job, others are used whendeveloping and testing software and still others are used forinformational purposes.

SARAH tools are framework specific. For Hadoop’sMap-Reduce framework, SARAH provides a set of genericMap-Reduce jobs to compute statistics and generate artifacts.For the Pig framework, SARAH provides a set ofparameterized Pig scripts. For the Hive and Impalaframeworks, SARAH provides a set of parameterized HiveQLscripts. For the Spark framework, SARAH provides a ScalaAPI for computing statistics on intermediate RDDs.

Metadata differ between frameworks. In Hive and Impala,data sets are completely described as tables. The tabledefinitions are stored in the Hive metastore. In the Map-Reduce, Pig and Spark frameworks, metadata are embedded inprograms and incomplete. Such differences necessitateseparate tools for generating statistics. SARAH artifact-generating tools are also framework specific becauseexecution costs differ between frameworks and many of theartifacts themselves are framework specific.

Since generating statistics on big data sets is costly,SARAH saves generated statistics persistently. Furthermore,SARAH tracks changes to input data sets and users canrequest SARAH to incrementally update the generatedstatistics. All of the tools represent generated statistics in acommon format. Statistics generated by one tool set can beused in another framework. In particular, artifact-generatingtools from one framework can use statistics generated inanother framework.

A. Input Data Sets Hadoop processing frameworks typically operate on sets

of files, stored in HDFS. Hive and Impala equate tables withHDFS directories and support partitioning of tables assubdirectories. Other frameworks are more flexible, allowingthe user to define an input data set as an arbitrary set of files.

SARAH requires users to name and define data sets. Datasets are either a directory in the file system or an explicitlydefined set of files. SARAH then bases incremental updates togenerated statistics on the addition to or deletion from files inthe input data set. SARAH does not track changes within afile since HDFS files are immutable.

SARAH calculates and stores basic statistics from inputdata sets, including the number of records in the data set, thesize of the data set in bytes, the minimum, maximum andaverage record size in bytes and the distribution of recordsizes.

B. Intermediate Data Sets and FunctionsAn intermediate data set results from applying a function

to an input data set. A function is named and realized indifferent ways in different frameworks. A given namedfunction can have multiple realizations. By naming differentrealizations of the function with the same name, the user isindicating that they produce the same intermediate data set,given the same input data set.

In the Map-Reduce framework, a map method in a Javamapper class is an example of a function. It produces anintermediate data set. In Hive and Impala, a simple query thatoperates on a single record is an example. In Pig, a simplescript that operates on a single record is an example. In Spark,a Scala or Java function that is defined on a single record is anexample.

Unlike input data sets, intermediate data sets are notnecessarily realized as files in HDFS. An intermediate dataset in the Map-Reduce framework is initially generated andpartitioned at all the mappers and then transmitted over thenetwork to the reducers. It is always a distributed datastructure, never being stored in a single place. In Spark, anintermediate data set may only live in the cluster-wide cache.Advanced database systems, Hive[5] and Impala[18] generateso-called column statistics. Such systems are essentiallygenerating statistics on intermediate data sets generated fromsimple column selection functions.

An intermediate data set can represent a resource intensivestate of a big data application. It represents the result ofapplying a function to an input data set. Statistics about theintermediate data set are useful in different resource allocationcontexts in different frameworks.

For each function applied to an input data set, SARAHcalculates and stores basic statistics from the resultingintermediate data sets, including the average execution time ofthe function, the number of intermediate records, the size ofthe intermediate data set in bytes, the minimum, maximumand average intermediate record size in bytes and thedistribution of intermediate records.

C. ArtifactsOnce statistics are generated for input and intermediate

data sets, SARAH tools can generate useful artifacts fromthose statistics. Some of the artifacts are used at runtime forbetter resource utilization, others are used to configure a job,others are used when developing and testing software and stillothers are used for informational purposes.

An example of a runtime artifact in the Map-Reduceframework is an interval file that can be used with the TotalOrder Partitioner[6] to balance or sort the load across reducers.

The number of reducers in a Map-Reduce job, the valuefor a parallel parameter in a Pig statement and an estimation ofthe amount of memory that a task needs to process a data setare all examples of configuration artifacts.

SARAH computes random samples of input andintermediate data sets for generating statistics. Users specify asample percentage between 0 and 100%. The generatedsamples are saved artifacts and users can use them fordevelopment, testing and analysis.

Besides random samples, SARAH can create other kindsof samples of input and intermediate data, including sampleswith outliers and samples of data sets to be joined that reflectthe resulting distribution of joined data. Such samples areuseful in development, testing and analysis of big dataapplications. The join samples are useful in understandingand addressing skewed joins.

SARAH generates artifacts that are useful for a user tounderstand the input and intermediate data sets. SARAH cangenerate visualizations of data distributions.

Figure 1 visualizes the distribution of applying themonths() function to a weblog input data set. SARAHgenerates the distribution data as well as an R [17] script tocreate the graphic.

IV. USING SARAH TO BALANCE LOAD ACROSS REDUCERS

We now illustrate the use of SARAH to estimate anappropriate number of reducers in the Map-Reduce frameworkand to balance the load across those reducers. While this usecase is also relevant to Hive, Pig and Spark, we limit thedescription to generating statistics and artifacts for HadoopMap-Reduce jobs

To begin, the user issues the following command:

sarah statistics [sample-%] input-data-set function1 .. functionN

The statistics command generates or updates statistics foran input data set and the n intermediate data sets defined byfunction1 .. functionN. The user can specify an optionalsample percentage.

The first phase of the statistics command computes andsaves the n+1 samples in a single map-only job. The mappersrandomly select records in the input to generate the inputsample. The mappers also apply each function to each recordin the input data set and randomly select records in eachintermediate data set for each intermediate sample.

The first sample generating phase is efficient. It onlyrequires a single, parallel, processing of the input data. Eachrecord is considered once in parallel and the n functions areapplied to it.

The second phase of the statistics command generatesstatistics from the samples. For samples that are small enoughto fit in memory, SARAH generates all statistics as a singleefficient map-only job. Each mapper processes a singlesample. For larger samples, SARAH generates multiple Map-Reduce jobs to compute the statistics for all of the data sets.

To obtain SARAH’s recommendations for the number ofreducers and artifacts for balancing the load across reducers,the user issues the following command:

sarah balanced-reducers [split-values] input-data-set function1 .. functionN

For each function, SARAH estimates a number of reducersby simply dividing the estimated number of records in theassociated intermediate data set by the ideal partition size; thatis, it divides the estimate by the number of records ideallyprocessed by each reducer. The later is a constant.

For each function, SARAH also generates an interval file.The interval file is an artifact that serves as input toa get_partition function used by the Map-Reduce frameworkto assign records to reducers. By default, SARAH generatesa n i n t e r v a l f i l e t h a t c a n b e p r o v i d e d t o t h eTotalOrderPartitioner in the Map-Reduce framework. Thispartitioner does not split the set of values that are associatedwith a key. This limits the ability to balance the load across

reducers. If the number of values associated with a single keyis greater than the ideal partition size, the partitioning is lessthan optimal.

If the user sets the optional split-values parameter to true,the intermediate data set is exactly balanced over all of thereducers. For keys with split set of values, the interval file alsocontains a percentage of records that are included in eachpart i t ion. SARAH provides an extension to theTotalOrderPartioner that uses the percentages to split the setof values associated with a single key. The extendedpartitioner violates the rule that all values associated with asingle key are provided to a single reducer. If the user wishesto do this, the application must accommodate the non-standardpartitioning.

A. Measuring SARAH Effectiveness Balancing ReducersMeasuring the value of SARAH requires comparing

performance and resource utilization of big data applicationsthat have been configured using SARAH artifacts to those thathave been configured manually.

We compare the utilization of reducers on a Map-Reducejob using SARAH artifacts to the same job configuredmanually. We consider three cases of “manual” configurationfor the balanced reducer use case:

• The naïve user understands neither the Map-Reducejob being executed nor the input and intermediatedata sets. The naïve user accepts all of the defaultsfor running the job. Since the default number ofreducers is 1, there is no need to partition theintermediate data set.

• The rule of thumb user learned some simple rule forallocating reducers. For our purposes the rule isbased on the size of the input data set. Hive and Pigprovide this as a default. [9], [12] The job configuredby the rule of thumb user runs with the default

HashPartitioner that simply assigns records toreducers by hashing the key.

• The educated guess user attempts to understand theintermediate data but the understanding isincomplete. In particular, the user measures the sizeof the intermediate data set by running the mapperbut fails to measure the distribution of theintermediate data set. Again, the job configured bythe educated guess user runs with the defaultHashPartitioner.

We ran a Map-Reduce job that processes an 18 millionrecord weblog data set generated by an Apache Web Server.The job analyzes the weblog data to understand thedifferences by month of users who access the web server inthe evening. The intermediate data consisted of 6 millionrecords.

The naïve user ran the job with a single reducer. Thesingle reducer processed all 6 million records. This approachobviously does not scale. The intermediate data set grows as afunction of the input data set, eventually overloading thesingle reducer.

The rule of thumb user ran the job as a function of theinput size, with 12 reducers. Figure 2 illustrates the result ofrunning the job configured by the rule of thumb user. Thepartitioning of data to reducers was not very good. Therewere three reducers with no records to process and theworkload of the remaining reducers was not balanced.Furthermore, the balanced reducers were underutilized,processing less than 600,000 records each.

We used SARAH to generate a sample of the intermediatedata and used the tool’s recommendation of 5 reducers and theinterval file generated by the tool. We did not split the set ofvalues so an exact partitioning is not possible. Figure 3illustrates the distribution of records to reducers that SARAHgenerated. The first four reducers are fairly balanced. Thefifth reducer handled a single key with 1.7 million records.

Since we did not choose to break up sets of records, the fifthreducer had more work than the others.

Finally, the educated guess user ran the job with 5 reducersbut without considering the distribution of the data. Figure 4illustrates the distribution of records to reducers as a result ofrunning the job by the educated guess user. The number ofreducers is appropriate but the skew in the intermediate data isnot handled. Notice that the third, rather than the fifth,reducer processed the 1.7 million records associated with thesame key in this case.

V. CONCLUSIONS AND FUTURE WORK

The SARAH test bed produces statistics for large inputand intermediate data sets. The statistics are persistentlysaved, incrementally updated and usable across frameworks.From these statistics, SARAH generates artifacts that areuseful for allocating resources in the desired framework. Weillustrated this with the problem of allocating reducers andbalancing the workload across those reducers. The analogousproblem exists in the Pig, Hive and Spark frameworks.

We continue to develop the SARAH test bed for the Map-Reduce, Hive, Impala, Pig and Spark frameworks. Wecontinue experimenting with different algorithms and usecases. We continue to expand the SARAH test cases to betterexperiment with the algorithms.

To date, we have concentrated on using SARAH toestimate resources for a single computation. We have not yetaddressed any of the use cases that use collected data setstatistics over time to estimate cluster wide sizing andscheduling.

Finally, in building SARAH we realized the need to definea runtime API and service that makes the statistics andgenerated artifacts available to the Hadoop applications andframeworks.

VI. REFERENCES

[1] A. Gates, O. Natkovich, S. Chopra, P. Kamath, S. Narayanam, C.Olston, B. Reed, S. Srinivasan, and U. Srivastava., "Building a HighLevel Dataflow System on top of Map-Reduce: The Pig Experience."Proc. of the VLDB Endowment, vol. 2, no. 2, 2009.

[2] Apache Hadoop Project. http://hadoop.apache.org[3] Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad

Chakka, Ning Zhang, Suresh Antony, Hao Liu, Raghotham Murthy,"Hive - a Petabyte Scale Data Warehouse Using Hadoop," icde, pp.996-1005, 2010 IEEE 26th International Conference on Data Engineering(ICDE 2010), 2010

[4] Cloudera Impala Project http://impala.io/[5] Column Statistics in Apache Hive,

http://blog.cloudera.com/blog/2012/08/column-statistics-in-hive/[6] D. Miner and A. Shook, MapReduce Design Patterns: Building

Effective Algorithms and Analytics for Hadoop and Other Systems.O'Reilly Media, December, 2012.

[7] David J. DeWitt , Jeffrey F. Naughton , Donovan A. Schneider , S.Seshadri, Practical Skew Handling in Parallel Joins, Proceedings of the18th International Conference on Very Large Data Bases, p.27-40,August 23-27, 1992

[8] Dean, J., and Ghemawat, S. “Mapreduce: Simplified Data Processing onLarge Clusters.” Communications of the ACM, vol. 51, no. 1, 2008

[9] E. Capriolo, D. Wampler, J. Ruthergien, Programming Hive, O'ReillyMedia, September, 2012.

[10] Lars Kolb , Andreas Thor , Erhard Rahm, “Load Balancing forMapReduce-based Entity Resolution", Proceedings of the 2012 IEEE28th International Conference on Data Engineering, p.618-629, April01-05, 2012

[11] Matei Zaharia , Mosharaf Chowdhury , Michael J. Franklin , ScottShenker , Ion Stoica, “Spark: Cluster Computing With Working Sets,”Proceedings of the 2nd USENIX Conference on Hot topics in CloudComputing, p.10-10, June 22-25, 2010, Boston, MA

[12] Pig Reducer Estimation, http://pig.apache.org/docs/r0.11.1/perf.html#reducer-estimation

[13] Pig User-defined Functions, http://pig.apache.org/docs/r0.12.1/udf.html[14] PigSkewedJoinSpec, https://wiki.apache.org/pig/PigSkewedJoinSpec[15] Statistics in Hive,

https://cwiki.apache.org/confluence/display/Hive/StatsDev[16] Sunil Chakkappen , Thierry Cruanes , Benoit Dageville , Linan Jiang ,

Uri Shaft , Hong Su , Mohamed Zait, “Efficient and scalable statisticsgathering for large databases in Oracle 11g,” Proceedings of the 2008ACM SIGMOD International Conference on Management of Data, June09-12, 2008, Vancouver, Canada

[17] The R Project for Statistical Computing. http://www.r-project.org.[18] Tuning Impala for Performance,

http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/Impala/Installing-and-Using-Impala/ciiu_performance.html

Documents

SARAH – Statistical Analysis for Resource Allocation in Hadooparchitectworkshop.com/martin/papers/sarah.pdf · SARAH (Statistical Analysis for Resource Allocation in Hadoop) is