12
Optimization Strategies for A/B Testing on HADOOP Andrii Cherniak University of Pittsburgh 135 N Bellefield ave Pittsburgh, PA 15260 [email protected] Huma Zaidi eBay Inc. 2065 Hamilton ave. San Jose, CA 95125 [email protected] Vladimir Zadorozhny University of Pittsburgh 135 N Bellefield ave Pittsburgh, PA 15260 [email protected] ABSTRACT In this work, we present a set of techniques that considerably improve the performance of executing concurrent MapRe- duce jobs. Our proposed solution relies on proper resource allocation for concurrent Hive jobs based on data depen- dency, inter-query optimization and modeling of Hadoop cluster load. To the best of our knowledge, this is the first work towards Hive/MapReduce job optimization which takes Hadoop cluster load into consideration. We perform an experimental study that demonstrates 233% reduction in execution time for concurrent vs sequential ex- ecution schema. We report up to 40% extra reduction in execution time for concurrent job execution after resource usage optimization. The results reported in this paper were obtained in a pi- lot project to assess the feasibility of migrating A/B testing from Teradata + SAS analytics infrastructure to Hadoop. This work was performed on eBay production Hadoop clus- ter. 1. INTRODUCTION Big Data challenges involve various aspects of large-scale data utilization. Addressing this challenge requires advanced methods and tools to capture, manage, and process the data within a tolerable time interval. This challenge is trifold: it involves data volume increase, accelerated growth rate, and increase in data diversity [23]. Ability to perform efficient big data analysis is crucial for successful operations of large enterprises. A common way to deal with big data analytics is to set up a pipeline of a high-performance data warehouse (e.g., Teradata [12] or Vertica [13]), an efficient analytics engine (e.g., SAS [9] or SPSS [7]) and an advanced visualization tool (e.g., MicroStrategy [8] or Tableau [11]). However, the cost of such infrastructure may be considerable [14]. Meanwhile, not every data analytics task is time-critical or requires the full functionality of a high-performance data Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Articles from this volume were invited to present their results at The 39th International Conference on Very Large Data Bases, August 26th - 30th 2013, Riva del Garda, Trento, Italy. Proceedings of the VLDB Endowment, Vol. 6, No. 11 Copyright 2013 VLDB Endowment 2150-8097/13/09... $ 10.00. analysis infrastructure. Often for non-time-critical applica- tions, it makes sense to use other computational architec- tures, such as Hadoop/MapReduce. One might name Ama- zon[1], Google[33], Yahoo[15], Netflix[28], Facebook[35], Twit- ter[24], which use MapReduce computing paradigm for big data analytics and large-scale machine learning. A/B testing [22] is a mantra at eBay to verify the perfor- mance of each new feature introduced on the web site. We may need to run hundreds of concurrent A/B tests analyzing billions of records in each test. Since A/B tests are typically scheduled on a weekly basis and are not required to provide results in real-time, they are good candidates for migration from the expensive conventional platform to a more afford- able architecture, such as Hadoop [4]. This approach avoids the need for continuous extension of expensive analytics in- frastructure. While providing less expensive computational resources, a corporate Hadoop cluster has impressive, but still finite computational capabilities. The total amount of the resources in the cluster is fixed once the cluster setup is complete. Each job submitted to a Hadoop cluster needs to be optimized to be able to effectively use those limited resources. One way is to use as many resources as possible in the expectation to decrease the time for execution of a job. However, hundreds of A/B tests need to be run con- currently, and Hadoop resources need to be shared between those jobs. Even a single A/B test may require as many as 10-20 MapReduce jobs to be executed at once. Each of these jobs may need to process terabytes of data, and thus even a single A/B test can introduce substantial cluster load. Some jobs may be dependent on other jobs. Thus for optimization purposes, we need to consider all jobs which belong to the same A/B test as a whole, not as independent processes. In addition, each job has to co-exist with other jobs on the cluster and not compete for unnecessary resources. The results reported in this paper were obtained in a pi- lot project to assess the feasibility of migrating A/B testing from Teradata + SAS analytics infrastructure to Hadoop. Preliminary work was conducted at eBay in the Fall 2011. A month-long A/B test experiment execution and cluster resource monitoring was completed in the Fall 2012. All our experiments were executed on Ares Hadoop cluster at eBay, which in spring 2012 had 1008 nodes, 4000+ CPU cores, 24000 vCPUs, 18 PB disk storage [25]. The cluster uses ca- pacity scheduler. All our analytics jobs were implemented using Apache Hive. While performing data analytics migration, we tried to answer two questions: 973

p973-cherniak

Embed Size (px)

DESCRIPTION

cherniak

Citation preview

Optimization Strategies for A/B Testing on HADOOPAndrii CherniakUniversity of Pittsburgh135 N Belleeld avePittsburgh, PA [email protected] ZaidieBay Inc.2065 Hamilton ave.San Jose, CA [email protected] ZadorozhnyUniversity of Pittsburgh135 N Belleeld avePittsburgh, PA [email protected] this work, we present a set of techniques that considerablyimprovetheperformanceof executingconcurrent MapRe-ducejobs. Ourproposedsolutionreliesonproperresourceallocationfor concurrent Hive jobs basedondatadepen-dency, inter-query optimizationandmodeling of Hadoopcluster load. To the best of our knowledge, this is therst work towards Hive/MapReduce job optimization whichtakesHadoopclusterloadintoconsideration.We perform an experimental study that demonstrates 233%reductioninexecutiontimeforconcurrentvssequentialex-ecutionschema. Wereport upto40%extrareductioninexecutiontimeforconcurrentjobexecutionafterresourceusageoptimization.Theresultsreportedinthispaperwereobtainedinapi-lotprojecttoassessthefeasibilityofmigratingA/BtestingfromTeradata+SASanalyticsinfrastructuretoHadoop.ThisworkwasperformedoneBayproductionHadoopclus-ter.1. INTRODUCTIONBigDatachallengesinvolvevariousaspectsoflarge-scaledata utilization. Addressing this challenge requires advancedmethods and tools to capture, manage, and process the datawithinatolerabletimeinterval. Thischallengeistrifold: itinvolvesdatavolumeincrease,acceleratedgrowthrate,andincreaseindatadiversity[23]. Abilitytoperformecientbigdataanalysisiscrucialforsuccessfuloperationsoflargeenterprises.Acommonwaytodeal withbigdataanalyticsistosetupapipelineof ahigh-performancedatawarehouse(e.g.,Teradata[12] orVertica[13]), anecientanalyticsengine(e.g., SAS[9] or SPSS[7]) andanadvancedvisualizationtool(e.g.,MicroStrategy[8]orTableau[11]). However,thecostofsuchinfrastructuremaybeconsiderable[14].Meanwhile, noteverydataanalyticstaskistime-criticalor requires the full functionality of ahigh-performance dataPermission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for prot or commercial advantage and that copiesbear this notice and the full citation on the rst page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specicpermission and/or a fee. Articles from this volume were invited to presenttheir results at The 39th International Conference on Very Large Data Bases,August 26th - 30th 2013, Riva del Garda, Trento, Italy.Proceedings of the VLDB Endowment, Vol. 6, No. 11Copyright 2013 VLDB Endowment 2150-8097/13/09...$ 10.00.analysisinfrastructure. Oftenfornon-time-criticalapplica-tions, itmakessensetouseothercomputational architec-tures,suchasHadoop/MapReduce. OnemightnameAma-zon[1], Google[33], Yahoo[15], Netix[28], Facebook[35], Twit-ter[24], whichuseMapReducecomputingparadigmforbigdataanalyticsandlarge-scalemachinelearning.A/Btesting[22]isamantraateBaytoverifytheperfor-manceofeachnewfeatureintroducedonthewebsite. Wemay need to run hundreds of concurrent A/B tests analyzingbillions of records in each test. Since A/B tests are typicallyscheduledonaweeklybasisandarenotrequiredtoprovideresultsinreal-time,theyaregoodcandidatesformigrationfromtheexpensiveconventionalplatformtoamoreaord-able architecture,such as Hadoop [4]. This approach avoidstheneedforcontinuousextensionofexpensiveanalyticsin-frastructure. Whileprovidinglessexpensivecomputationalresources, acorporateHadoopcluster has impressive, butstill nitecomputational capabilities. Thetotal amountoftheresourcesintheclusterisxedoncetheclustersetupiscomplete. EachjobsubmittedtoaHadoopclusterneedstobeoptimizedtobeabletoeectivelyusethoselimitedresources. Onewayistouseasmanyresourcesaspossibleintheexpectationtodecreasethetimeforexecutionof ajob. However, hundredsof A/Btestsneedtoberuncon-currently,andHadoopresourcesneedtobesharedbetweenthosejobs. EvenasingleA/Btestmayrequireasmanyas10-20 MapReduce jobs to be executed at once. Each of thesejobsmayneedtoprocessterabytesofdata,andthusevenasingle A/B test can introduce substantial cluster load. Somejobs may be dependent on other jobs. Thus for optimizationpurposes, weneedtoconsiderall jobswhichbelongtothesameA/Btest as awhole, not as independent processes.Inaddition,eachjobhastoco-existwithotherjobsontheclusterandnotcompeteforunnecessaryresources.Theresultsreportedinthispaperwereobtainedinapi-lotprojecttoassessthefeasibilityofmigratingA/BtestingfromTeradata+SASanalyticsinfrastructuretoHadoop.PreliminaryworkwasconductedateBayintheFall 2011.Amonth-longA/Btest experiment executionandclusterresource monitoring was completed in the Fall 2012. All ourexperiments were executed on Ares Hadoop cluster at eBay,whichinspring2012had1008nodes, 4000+CPUcores,24000 vCPUs,18 PB disk storage [25]. The cluster uses ca-pacity scheduler. All our analytics jobs were implementedusingApacheHive.Whileperformingdataanalytics migration, wetriedtoanswertwoquestions:97301234x 104total map slots1.47 1.48 1.49 1.5 1.51 1.52 1.53 1.54x 105050010001500200025003000time, sjob map slotsFigure1: A/Btestexecutionmonitoring. Topplot: mapslot usage in the entire cluster. Bottom plot: map slot usagebytheA/Btestjobshowtominimizetheexecutiontimeofatypical A/BtestonHadoop;howtooptimizeresourceusageforeachjobthusourA/Btestcanco-existwithothertasks;Consider anexampleHivejob, repetitivelyexecutedonHadoop,asshowninFigure1. Wemonitoredtheamountof resources(here- thenumberof mapslots)usedbythejobtogetherwithtotal mapslotusage inthe entireHadoopcluster. Theupper plot shows howmanymapslots wereinuseintheentireHadoopclusterduringtheexperiment.The bottomplot shows howmanymapslots our sampleMapReducejobreceivedduringtheexecution. Weobservethat whenthecluster is becomingbusy, MapReducejobshave diculty accessing desired amount of resources. Therearethreemajorcontributionsweprovidedinthispaper.1. weprovideempirical evidencethat eachMapReducejob execution is impacted with the load on the Hadoopcluster, and this load has to be taken into considerationforjoboptimizationpurposes2. basedonourobservations, weproposeaprobabilisticextensionforMapReducecostmodel3. weprovideanalgorithmfor optimizationof concur-rent Hive/MapReduce jobs using this probabilistic costmodelOurpaperisorganizedasfollows. Westartwithprovidingsome backgroundonperformance of Teradatadataware-house infrastructure and Hadoop in Section 2. We continueby presenting updated MapReduce cost model in Section 3.Then nally we apply this cost model to optimize a real A/Btest running on a production Hadoop cluster and report theresultsinSection4.2. BACKGROUNDDealingwithbigdataanalysisrequiresanunderstandingof theunderlyinginformationinfrastructure. This is cru-cial forproperassessmentof theperformanceof analyticalprocessing. Inthis section, we start withanoverviewofHadoop, - ageneral-purposeframeworkfor data-intensivedistributedapplications. Wealsodiscusstheexistingmeth-odsfordataanalyticsoptimizationonHadoop.HDFS1 21234buffermemoryinput splitHDFSspills mergedspillsother mappersinput splitmap 2input splitmap 1 reduce 1reduce 2merge shuffleFigure2: MapReduceexecutionschemaApacheHadoopisageneral-purposeframeworkfordis-tributed processing of large data sets across clusters of com-puters usingasimpleprogrammingmodel. It is designedtoscaleupfromasingleservertothousandsof machines,eachoeringlocal computationandstorage. Hadooppro-videsthetoolsfordistributeddatastorage(HDFS:Hadoopdistributedlesystem[29]) anddataprocessing(MapRe-duce). Each task submitted to a Hadoop cluster is executedin the form of a MapReduce job [16], as shown in Figure2.JobTracker[4]istheservicewithinHadoopthatfarmsoutMapReducetaskstospecicnodesinthecluster. TheJob-TrackersubmitstheworktothechosenTaskTrackernodes.TaskTracker is anode inthe cluster that accepts tasks -Map, Reduce and Shue operations - from a JobTracker. ATaskTrackerisconguredwithasetof slots(mapandre-duce), which indicate the number of tasks that it can accept[41] atatime. ItisuptotheschedulertodistributethoseresourcesbetweenMapReducejobs.There areseveral ways towrite MapReducejobs. Themost straight-forwardoneis towriteaJavaprogramus-ingMapReduceAPI.Whilethismethodprovidesthehigh-estdatamanipulationexibility,itisthemostdicultanderror-prone. Other tools, suchas functional languages(Scala[10]), dataprocessingworkow(Cascading[6]),anddataanalysislanguages(Pig[3], Hive[35])helptocover the underlyingdetails of MapReduce programming,whichcanspeedupdevelopmentandeliminatetypical er-rors. ApacheHivewaschosenasatool fortheseexperi-mentsbecauseofitssimilaritywithSQL, andtoeliminatetheneedforlow-levelprogrammingofMapReducejobs.2.1 MapReduce data owEveryMapReducejobtakesasetoflesonHDFSasitsinputand, bydefault, generatesanothersetof lesastheoutput. HadoopframeworkdividestheinputtoaMapRe-ducejobintoxed-sizepiecescalledsplits. Hadoopcreatesone map task for each split [5]. Hadoop takes care of schedul-ing, monitoringandreschedulingMapReducejobs. Itwilltrytoexecutemapprocessesascloseaspossible[29]tothedatanodes, whichholdthedatablocksfortheinputles.Wecannotexplicitlycontrol thenumberof maptasks[5].Theactualnumberofmaptaskswillbeproportionaltothenumberof HDFSblocksof theinputles. Itisuptotheschedulertodeterminehowmanymapandreduceslotsareneededforeachjob.2.1.1 Capacity schedulerOur Hadoop cluster was congured to use capacity sched-uler[2]. Thewholeclusterwasdividedintoasetofqueueswith their congured map and reduce slots capacities. Hardcapacitylimitsspecifytheminimumnumberof slotseach974queue will provide to the jobs. Soft limits specify how muchextra capacity this queue can take from the cluster providedsothattheclusterresourcesareunder-utilized. Whenthecluster gets fullyloaded, thoseextraresources will bere-claimedfortheappropriatequeues. InFigure1, weob-servetheeectofthisreclamation: whenthetotalloadontheclusterreacheditsmaximum: MapReducejobsinourqueue were not able to obtain as many resources as they hadbefore. A similar eect happens when we submit yet anotherjobtoaqueue: eachjobfromthatqueuewill receivelessresourcescomparedtowhattheyhadbefore.2.2 Current approaches for Hadoop / MapRe-duce optimizationWe cangroupexistingeorts inMapReduce jobs opti-mizationintothefollowingcategories: schedulereciency,Hadoop and MapReduce (MR) systemparameters opti-mization,andMRexecutionmodeling.2.2.1 Hadoop scheduler optimizationThe default Hadoopscheduler is FIFO[41]. However,thismaynotbethebestschedulerforcertaintasks. Theexistingworkinschedulingoptimizationaddressessomeofthemosttypical issueswithexecutionoptimizationof MRjobs. One sample issue is resource sharing between MR jobs(Capacity scheduler [2], Fair Scheduler [45]), so one job doesnotsuppresstheexecutionofotherjobsordoesnotoccupytoo many of the resources. [27] provides a summary of someof theexistingschedulers for Hadoop. Hereweprovideasummaryof theexistingapproachestoimproveschedulingforHadoopandhowtheiroptimizationisapplicabletoourtaskofA/Btesting.Resource-aware scheduler [44] suggests using ne-granularityschedulingthroughmonitoringresourcesusage(CPU,disk,or network) by each MR job. Delay scheduling [46] is tryingtoaddresstheissueof datalocalityforMRjobexecution.CouplingScheduler[31] approachisaimedatreducingtheburstinessofMRjobsbygraduallylaunchingReducetasksasthedatafromthemappartofaMRjobbecomesavail-able. These schedulers treat eachMRjobas completelyindependentofoneanother, whichisnottrueforourcase.Inaddition, theseschedulerssaynothingabouthowtore-distribute Hadoopresources betweenmultiple concurrentMRjobs.[38] introducedtheearliest-deadline-rstapproachforschedulingMRjobs. Theproposedapproach(SLO-basedscheduler)providessucientresourcestoaprocess,thus,itcanbenishedbythespecieddeadline. Theauthorsre-port in the paper that when the cluster runs out of resources,then the scheduler cannot provide any guarantees regardlessof process execution time. This happens because SLO-basedschedulerisbackedbyFIFOscheduler[47]. Thus,rst,westill needto performo-line calculations about the opti-maltimingforeachMRtask. [38]saysnothingabouthowtoundertakethisoptimization,whentherearemanyinter-dependentMRjobs, eachofwhichisbigenoughtouseallof theclusterresourcesavailable. Second, thisapproachisbased on FIFO scheduler, therefore it provides the capabilitytocontrol resourcesforeachprocess, butthisisnotavail-ableforcapacityschedulerwhichisrunningonourHadoopcluster. And third, this approach does not consider the back-groundHadooploadcausedbyotherprocesses, whichcanbelaunchedatanyarbitrarytime.[37] considers optimization of a group of independent con-current MR jobs, using FIFO scheduler with no backgroundMRjobsrunning(fromotherusers)onthecluster. Weusecapacity scheduler instead of FIFO scheduler on our cluster,andwe cannot explicitlycontrol resources assignment foreachMRjob. TypicallyA/Btestingjobsshowstrongde-pendencybetweeneachother,andthisdependenceimpactstheperformance.[42] presented a scheduler which tries to optimize multipleindependentMRjobswithgivenpriorities. Itfunctionsbyproviding requested amount of resources for jobs with higherpriority, and redistributes the remaining slack resources tojobswithlowerpriorities. [42] providesbatchoptimizationforasetofMRjobswhereeachjobisassigneditspriority.Thisisnottrueforourcase: otheruserssubmittheirjobswhenever they want and can take a signicant portion of theavailableresources. Thus, theapproachhastotakeon-lineloadintoconsideration.2.2.2 Hadoop/MapReduce systemparameter optimiza-tionTherewereseveral approachestomodel Hadoopperfor-mancewithdierentlevelsofgranularity. Forinstance,[40]considersadetailedsimulationoftheHadoopclusterfunc-tionality. It requires the denitionof a large number ofsystemparameters; manyof themmaybenotavailabletoadataanalyst. Similarly, [19] is basedonquite detailedknowledgeof thecluster set up. [21] and[20] approachesrelax the need for up-front knowledge of your cluster systemparameters and provide a visual approach to investigate bot-tlenecks in system performance and to tune the parameters.WhiletheseapproachesaddressveryimportantissuesofMRjobandHadoopclusterparametertuning,theydonotaddress a more subtle issue: concurrent job optimization foraclusterwhichissharedbetweenmanyusers,executingallsorts of MapReduce jobs. These approaches do not considerhowtoadjust Hadoop/MRparameters sothat manyMRjobscaneectivelyco-exist.2.2.3 MapReduce cost modelWhileMRjobproling[20]canhelptooptimizeasinglejob, itismuchmorepractical tohaveasimpliedgenera-tivemodel whichwouldestablishtherelationshipbetweenthe MR job execution time, the size of the input dataset andtheamountofresourceswhichaMRjobreceives. Thus,wecancalibrate this model once byexecutingafewsampleMRjobs, measure their executiontime, anduse the ob-tainedmodel foraquickapproximatepredictionof aMRjobcompletiontime.[37], [36], [39], [43], [19], [40] attempt toderive anan-alytical cost model toapproximateMRexecutiontiming.Each of these approaches species an elementary cost modelfordataprocessingstepsinMRandthencombineelemen-tarycost estimates inanoverall performance assessment.Itisinterestingtomention, thatdierentapproacheshavedierences evenwiththis simpliedmodel. For instance,[39] and [37] do not account for sorting-associated costs andtheoverheadrequiredtolaunchmapandreducejobsasin[36]. These dierences are associated with the fact that MRframeworkcanlaunchin-memoryor external sorting[41].Andthus, cost models obtainedfor dierent sizes of thedatasetwilldier.975TheHadoop/MRframeworkhasasubstantialnumberoftuningparameters. Tuningsomeof thoseparametersmayhave a dramatic inuence onthe jobperformance, whiletuningothersmayhavealesssignicanteect. Moreover,someofthetuningmayhaveanon-deterministicimpact,ifweoptimizetheusageof asharedresource. Forinstance,assumethatafterthejobprolingfrom[21] wedecidedtoincrease the sort buer size for a MR job. Let us do the sameforeveryMRjob. If,afterthisbuertuning,welaunchtoomanyMRjobsatonce, somenodesmayrunoutof RAMandwill startusingthediskspaceforswappingtoprovidetherequestedbuersize. Thus, fromtimetotime, insteadof speeding up,we will slow down concurrent job execution.This non-deterministic collective behavior has to be reectedinthecostmodel.Webasedourcostmodel ontheapproachesproposedin[36] and[43], toreectthedeterministiccharacteristicsofMR execution. We extend those models by adding the prob-abilistic component of MRexecution. InSection3, weelaborateontheMapReducecostmodel anddescribehowweuseittoestimateourHadoopclusterperformance.2.2.4 Applicability limits of the existing solutions to-wards large-scale analytics tasksWhiletheHadoopcommunityhasmadegreatstridesto-wardsHadoop/MRoptimization, thoseresultsarenotex-actlyplug-and-playforourA/Btestingproblem. Ourpro-ductionclusterhascapacityschedulerrunningandwecan-notchangethat. Approachesassuminganyotherschedulercannotbedirectlyappliedhere. Wecannotexplicitlycon-trol theamountof resourcesforeachMRjob, becauseca-pacityschedulercontrolsresourcesharingdynamically. WecanonlyexplicitlysetthenumberofreducetasksforaMRjob. OurresultswereobtainedforHiveusingspeculativeexecution model[41], thus Hadoop can launch more reducetasksthanwespecied. Productionanalyticsjobsneedtobeexecutedonarealclusterwhichissharedbetweenmanypeoplewhosubmittheirjobsatarbitrarytimes. Thus,anyoptimization strategy disregarding this external random fac-torisnotdirectlyapplicable.3. UPDATEDMAP-REDUCECOSTMODEL3.1 Concurrent MapReduce optimizationLet us assume that we submit MapReduce jobs to a HadoopclusterwithaqueueQ,whichhaslimits: M-max. numberofmapslotsandR-max. numberofreduceslots. Atrst,let us have 2 independent MapReduce jobs (colored in redandblue). Wewilllookatdierentscenariosforthosejobsto execute. If we submit them sequentially, as shown in Fig-ure3a,then the cluster resources will remain idle for muchof the time. When the map part for the red job is over, mapslotswillremainidlewhileHadoopisprocessingthereducepartforthejob. From3a:t11to3a:t1mapslotsareidle.The most obvious solution to improve resource utilizationistomonitortheprogressof runningjobsandlaunchnewjobswhentheclusterisidle. Figure3bdemonstratesthisprinciple. At time3b:t11, whenthemappart of theredjobis complete, the blue jobgets launched. Thus, from3b:t11to3b:t1theHadoopclusterhasitsmapandreduceslotsutilized. Thisschedulewillreducethetotalexecutiontime: 3b:t2>m. Forpracticalpurposes,wecanreplace

Mm Mm(2)Figure4showshowmanymapslotswereassignedtothesameMRjobovertimewhenitwasrecursivelyexecuted.DuringaMRjobexecution, resourceusageisnotconstantand depends on how many resources are available. Thus, weshouldreplaceparametermwithitseectivevalue, showninEquation3

Mm M1N

mim+mi=MIm; ||m+|| = N (3)wherem+isasetofmeasurementsduringasingleMRjobwhenmapslot usage bythis MRjobwas >0. Similarreasoninggoes for Rr . Consider Figure 5whichshowsapart of the experiment onreduce slots usage. Inthisexperiment, we sequentiallyexecutedthe same Hive job,askingfordierentnumbersofreducetaskstobeexecuted:1.4 1.6 1.8 2 2.2 2.4 2.6x 104020040060080010001200time,sreduce slots observedexpectedFigure 5: Reduce slots usage as a function of Hadoop clusterloadandspeculativeexecution[50, 100, 200, 300, 500, 700, 900]. The redline onFigure5showsthenumber of reducerswewouldexpecttobeexecutedforthejob. Theblacklineshowstheactualnumberof reducerswhichwereexecuted. Weseethatthenumberof reducersisnotconstant: itchangesduringthereducepart. Oneofthereasonsisthattheclustermaybebusy duringthe jobexecution andsome of the reducers willbereclaimedbacktothepool.Weobserveexampleswhenweaimedtouse900reducersbutreceivedonlyabout100. ItalsohappensthatHadooplaunches more reduce tasks than we asked for. This happensbecause Hadoopdetects that some of the reducers madelessprogressthanthemajority,andlaunchescopiesofslowrunningreducers.Based on this reasoning, we transform Rrinto Equation4. Herethedenominatoristheareaunderthereduceslotsusagecurve. Toeliminatecountingofextrareducers(fromspeculative execution), we limit the maximumnumber ofreduce slots to R. For example, if we requested 300 slots butatsomepointwegot350slots,wewouldcountonly300.

RrR1N

rir+min(ri, R)=RIr; ||r+|| = N (4)wherer+isasetof measurementsduringasingleMRjobwhere reduce slot usage by this MR job was > 0. CombiningEquations1,2, and4,weobtainEquation5T= 0 +1MIm++RIr

2kMR+3kMRlog(kMR)

+4M+5R (5)InEquation50isonlyaconstant;1Mmisthetimere-quired to read the MR job input; kMis the size of the outputof themappartof theMRjob, k 0; 2kMRisthetimeneededtocopythedatabyeachreducer; 3kMRlog(kMR)isthe time required to merge-sort data in each reducer; 4M+5Risthecostassociatedwiththeoverheadof launchingmapandreducetasks, assuggestedin[36]. However, fromthedynamicsofmapandreducetasksshowninFigure4and Figure5, it is not clear which number should be put inthecalculation. Wewill assumethatthisextracostisrel-ativelysmall comparedtothetimingcausedbytheactual977200 400 600 800 1000 1200 1400 1600 1800 200000.20.40.60.81map slotsCDF 50100200300500700900Figure6: CDFplotformapslots allocationtoaMRjobasafunctionofrequestedreducers0 100 200 300 400 500 600 700 80000.10.20.30.40.50.60.70.80.91reduce slotsCDF 50100200300500700900Figure7: CDFplotforreduceslotsallocationtoaMRjobasafunctionofrequestedreducersdataprocessing, andwill omitthisoverheadpart, like, forinstance,in[38]. WeobtainEquation6.T= 0 +1MIm+2kMIr+3kMIrlog(kMR) (6)3.3 Probabilistic resource allocationJob completion time in Equation 6 depends on howmanyresources aMRwill receive. Weconducted150it-erationsof thesamecycleof MRjobs, eachcycleconsistsof 7sequential MRjobs, askingfor [50, 100, 200, 300,500, 700, 900] reducers, totalingin1050jobs executionand9-dayduration.ResourceallocationforMRjobsareshownin: Figure6formapslotsallocationandFigure7forreduceslotsal-location. FromFigure6, weobservethat inprobability,eachMRjobreceivedthesamenumberofmapslots. Thishappensbecausewecannotexplicitlycontrol mapslotas-signment and must rely on the scheduler to obtain resources.Thesituationiscompletelydierentwiththereduceslots:wecanexplicitlyaskforaspecicamountoftheresources.However, evenif weaskfor moreslots, it does not meanthat we will receive them. If we askfor fewer resources,mostlikelytheschedulerwill beabletoprovidethosetoaMRjob. InFigure7, whenweaskfor50reduceslots, in90%ofcaseswewillreceivethose50slots. However, whenweaskfor900reducers,in90%ofcaseswewillreceivelessthan550reducers. FromEquation6, thevolatilityofthe100 200 300 400 500 600 700 800 90040060080010001200140016001800requested reducerstime, s avg timelower boudupper boundFigure 8: MapReduce jobcompletiontime as afunctionof requestedreduceslots: (averagetime, upper andlowerboundsfor95%interval)MR job completion time will increase when we ask for morereduceslots.Figure 8 provides another perspective on the volatility oftheMRexecution. WehaveplottedtheaveragecompletiontimeofaMRjobasafunctionofthenumberofrequestedreduce slots together with 95% interval of the observed mea-surements. We observe that for jobs requesting 300 reducersormore, theaveragecompletiontimeremainsalmostcon-stant. Howeverthe95%upperboundoftheexecutiontimeis increasing. This time increase is connected to the dynamicnatureofHadoopclusterload: someofthenodeswhichex-ecute our MR job reduce tasks may arbitrarily receive extraload (other users submit their MR jobs). Speculative execu-tion [41] was introduced to Hadoop to mitigate this problem,howeveritdoesnotsolvetheissuecompletely. Thehigherthe number of the reducers requested, the higher the chancesarethatatleastonereducerwillgetstuckonabusynode.To reect this eect we propose to alter the MapReduce costmodelforEquation6intoEquation7.T= 0+1MIm+

2kMIr+3kMIrlog(kMR)

(1 +DR(R))(7)whereDR(R)isprobabilisticextradelay,associatedwiththechancethatareducergetsstuckonabusynode. ForeachR, DR(R)isaset, containingpairs(delay, p)-showspossible extra delay together with the probability to observethisextradelay. Atrst, weassumethatDR(R)=0andobtainthe coecients for Equation7. Thenwe add[70..95]intervalofallmeasurementsandlearnthatDR(R).3.4 Functional dependencies for resource al-locationFigure9shows howmapresources wereassignedtoalocal queue as a function of the total cluster map slot usage.Weobservethatthedistributionofthemeasurementsdoesnotchangesignicantlyuntil thetotal clusterloadreachesapproximately 3104map slots. Below that threshold, queueloadandtotal cluster loadseemtobe uncorrelated, e.g.:p(Q|HC)=p(Q), wherep(Q)istheprobabilitytoobservecertainqueueload,andp(HC)istheprobabilitytoobservecertaintotal Hadoopcluster load. Whenthecluster loadbecomeshigherthan 3 104inFigure9, wecanclearlysee that the total cluster load inuences the number of slotswhichaqueuewillget: Q HC,andp(Q|HC) = p(Q).Obviously, theamountof theresourceswhichaMRjobobtains depends on how many resources in a queue are used978total cluster load0.511.522.533.5x 104queue load500 1000 1500 2000 2500 3000 3500Figure9: Mapslotusageinaqueueasafunctionof totalclusterloadtotal cluster load100020003000400050006000700080009000queue load100 200 300 400 500 600 700 800 900(a) here 50 reducers were re-questedtotal cluster load20004000600080001000012000queue load200 400 600 800 1000(b) here 900 reducers were re-questedFigure10: Reduceslot usageinaqueueas afunctionoftotalclusterloadandnumberofrequestedreducersby other jobs,and whether map slots can be borrowed fromtherestofthecluster. IfwedenoterJtobetheamountofresourcesforajobJ, thenrj HC, Q. Forgivenvaluesof hci - particularHadoopload, andqi - particularqueueresourceusage,p(rJ) = p(rJ|qi, hci).Figure9isintrinsically3D, wherethe3rddimension-thenumberof mapslotswhichaMRjobobtained, iscol-lapsed. Hadooploadanalysis provides us with(Hadoopload, queue load, MRresources)tuples, whichallowustobuildaprobabilisticmodel todescribehowmanyre-sources we wouldreceive givenwhat we knowabout thetotal clusterloadandotherjobsinthequeue, asshowninEquation8.p(rJ) =

hciHC

qiQp(rJ|qi, hci) p(qi|hci) p(hci) (8)Figure10aandFigure10bshowhowreduceresourcesare being distributed in a queue as a function of the total re-duce usage in a cluster. What we conclude is that the natureof thedistributionchanges dependinghowmanyreducerswe are seeking. We can derive Equation9 for probabilisticreduceresourceallocation,similartoEquation8:p(rJ, R) =

hciHC

qiQp(rJ|qi, hci, R) p(qi|hci, R) p(hci)(9)whereparameterR-specieshowmanyreducersweactu-allyaskedfor.! # $ % %&$ %&! '() '(& '(&*%& Figure11: Resourcessensitivityexplained3.5 Applying stochastic optimization for con-current MapReduce jobsBeforeweproceedtothealgorithmdescription, weneedto introduce the notion of resource sensitivity for a MR job.Assume that our task consists of a few MR jobs with certaindata dependencies between them, e.g. Figure 11. ResourcesensitivityshowshowthetaskexecutiontimechangesifwereducetheamountofresourcesforaparticularjobJ.FromtheexampleinFigure11: thetaskstartsatT =0andnishes at T =t. Nowlets assumethat wedecreasethenumber of mapor reduce slots for aparticular jobJ onR, whichmayleadtotheincreaseof thetaskexecutiontimefromT =t toT =t + t. WeuseEquation7topredictajobcompletiontime. TaskresourcesensitivityforaparticularjobJisshowninEquation10.RS(J) =tR(10)In the example shown in Figure 11, RS(D) = RS(E) = 0becausetheincreaseintheexecutiontimeforjobsDandEdoes not inuence the total timing: jobCnishes itsexecution later than jobs Dand E(even after we reduce theamount of resources for jobs Dand E). However RS(C) > 0,becausejobCnishesitsexecutionlast, andanyresourcereductiontojobCincreasesthetasktotalexecutiontime.Thecentralideaforourstochasticoptimizationistondthose MR jobs, in which results are needed much later duringtheA/Btest executionandwhichrequirefewer resourcesorcanbeassignedalowerexecutionpriority. Incapacityschedulerwecannotexplicitlycontrol mapslotassignmenttoajob. However, whenwesetalowerprioritytoaMRjob, we force this job to use map and reduce slots only whenthey are not used by other processes in the same queue fromthe same user. When there are many users submitting theirjobs toaqueue, capacityscheduler provides ashare of aqueue resources to each user. Thus,even the lowest priorityHivejobwill obtainaportionof theresources. All theseconsiderationshelpustoderiveAlgorithm1forMRjoboptimizationstrategies.4. CASESTUDY:MIGRATINGA/BTESTFROM TERADATA TO HADOOPInthispaper, wefocusononeparticularexampleofBigDataanalytics: large-scaleA/Btesting. WeperformedanA/B test for eBay Shopping Cart dataset. An outline of thetestisshowninTable1. OriginallythistestwasexecutedusingbothTeradataandSAS. Teradatawouldpre-processdata and send the result to a SAS server, which would nal-ize the computations. We start with an overview of Teradata-ahigh-performancedatawarehousinginfrastructure, andthenmovetothediscussionoftheA/Btestschema.979Algorithm1:StochasticoptimizationforA/Btestinput :MRjobdependencelist;MRjobdatasize;MRperformancemodelEquation7;HadooploadmodelEquation8, 9output: reducerallocationperMRjobinthetask,prioritiesforeachMRjobbegin1: obtainpairs(Resourcesi; pi)-amountoftheresourcesforthetaskwiththeprobabilitytoobtainthoseresources,usingEquation8, 92: usingResourcesi,instantiateeachMRjobwithmapandreduceresources,proportionallytothejobinputdatasizerepeatrepeat1: computereducerresourcessensitivityusingEquation102: addRofreducerstothosejobs,whichcanhelptodecreasethetaskexecutiontimeif total timingdoesnotincreasethen1: decreasethe#ofreducersonRforthosejobswiththelowestreducerresourcessensitivity2: useEquation7tocomputetasktotaltimingenduntilnoimprovement;foreachjobinthetest do1: usingEquation10,computemapresourcessensitivity,asifthejobwasassignedlowerpriorityend1: choosethejobJwiththelowestmapresourcessensitivity2: assumethatjobJtobesetlowerpriority;recalculatemapandreduceslotassignmenttootherMRjobsandtotaltimingusingEquation7if total timingdidnotincreasethen1: assignlowerprioritytojobJ2: recalculatemapandreduceslotassignmenttootherMRjobsenduntilnoimprovement;1: repeatforeachpair(Resourcesi, pi)2: reportresourcesassignmentforeachjobastheexpectationoftheresourcesassignmentend4.1 TeradataTeradata[12]isanexampleofashared-nothing[30]mas-siveparallelprocessing(MPP)relationaldatabasemanage-mentsystem(RDBMS)[34], [26]. Asimpliedviewof itsarchitectureisshowninFigure12. Thekeyarchitecturecomponentsof Teradataare: PE- parsingengine; AMP-AccessModuleProcessor; BYNET-communicationbe-tweenVPROCs.AMP and PE are VPROCs - virtual processors, self-containedinstances of the processor software. They are the basic unitsofparallelism[18]. VROCsrunasmulti-threadedprocessestoenableTeradatasparallelexecutionoftaskswithoutus-procedure equivalentSQLquery engineextraction A = t11 1 t5Tera-datapruning A1=selectA.c1, . . . where. . .aggregation A2=selectstddev(A1.c1), . . .groupby. . .capping UpdateA1setA1.c1=SAS= min(A1.c1, k A2.stdc1), . . .aggregation A3=selectstddev(A1.c1), . . .groupby. . .lifs,CIs computedusingSASTable1: A/Btestschematable Original number Numberoftuplesoftuples afterpruningt1 9,125 48t2 429,926 215,000t3 2,152,362,400 533,812,569t4 4,580,836,258 263,214,093t5 6,934,101,343 3,250,605,119Table2: Datasetsizeing specialized physical processors. They are labeled asVirtualProcessorsbecausetheyperformthesimilarpro-cessing functions as physical processors (CPU). In Teradataarchitecture, eachparallel unit(AMP)ownsandmanagesitsownportionofthedatabase[26], [17]. Tuplesarehash-partitioned and evenly distributed between AMPs. And theworkloadforjoining,aggregatescalculationandsortingcanbeachievedinawaythattheloadredistributesequallybe-tween AMPs. As with any RDBMS, Teradata uses indexing[34]ontables,whichallowsspeedingupdataaccess.4.2 Test schemaWhile select, join and aggregate computations are the rou-tine RDBMS operations, the real challenge is the size of datain tables t1, . . . t5, which is shown in Table2 as the originalnumber of tuples. In our sample A/B test, all the heavy dataprocessingwasexecutedonTeradata, andamuchsmallerdata set would be sent to SAS. Thus, it is more important tocompare the Teradata part of the A/B test with its Hadoopequivalent.The Teradatapart of the A/Btest fromTable 1canbetranslated1-to-1intoHivequeries. TheSASpartoftheAMP1AMP2AMP3PEAMP1AMP2AMP3AMP4AMP5AMP6PEAMP4AMP5AMP6VPROCS VPROCSBYNETFigure12: Teradata980!#$ % & '( ) * + , (a)scenario1!#$ % & '() (b)scenario2Figure13: A/Btestscenarios:A = t41 (t1)B= (t5)C= (t2)D= t3.c=v3(t3)E= t3.c=v3(t3)F= A 1 B,G = F1 C,H= G 1 D,I= G 1 EJ= sum(. . . ), count(. . . ) from H,K= sum(. . . ), count(. . . ) from ITable3: A/Btestdataloading: extraction, pruning, andaggregationA/B test cannot be translated completely into Hive. We usePHPscriptinglanguagetonalizethecomputationswhichcannotbetranslatedinHive. However, PHPscriptswouldperformonlyaverysmallportionofnaldataassembly.The schema inTable 1possesses strong dependenciesbetweendierentexecutionsteps. Forinstance, wecannotproceedwithcappingunlesswecomputedtheaggregatesofthe dataset. Or, we cannot compute lifts andcondenceintervalsunlesswecappedthedata.Afterdatadependencyanalysis, wetransformtheTera-datapartfromTable1intothediagram,showninFigure13a. LettersAthroughKdenotethedataprocessingoper-ations, asshowninTable3. Sonowwecanproceedwiththecomparison.4.3 A/B test without explicit resource controlWerunsequential andparallel versions of dataloadingroutines from Table 3 on Hadoop and compare the obtainedresultswithTeradatatiming. Intheseexperiments,wedonotcontrol theamountofresources(numberofre-ducetasksperHivejob). WeletHivetoinferthisbasedonthedatasizeforeachjob. However, all of theexperiments were performed during the weekends, when thequeue to which we were submitting MR jobs, was completelyfree. Eachresultwasaveragedover10executions.1 39 1 22 1 39 132050 80 50 29 50 80 1740100 105 100 29 100 105 17401 1.591065 3.12057414 70 50 1.90309 3.240549100 2.021189 3.24054914 3.6232490204060801001200 20 40 60 80 100execution time, s% of data in table t30204060800 20 40 60 80 100execution time, m% of data in table t3optimizedsequentialFigure14: Timingfordataextraction,pruning,andaggre-gationonTeradataIntherstexperiment,weexecutedataloadingroutinescompletely sequentially. In the second experiment, we sched-ulethoseroutinesconcurrently, preservingthedatadepen-dencies between them,as shown in Figure13a. We launchjobsA,B,C,D,andEsimultaneously,andproceedwithotherjobsasthedatabecomesavailable.Weconductedexperimentswithdierentsizesoftablet3fordataloadingonbothHadoopandTeradata. Dataloadtimingfor Hadoopis showninFigure 15, andfor Tera-datais revealedinFigure 14. At rst, weobservethatsequential dataloadingonHadooptakesapproximately70minutes(Figure15)whentablet3contains14%ofdata.Forconcurrent Hadooploadingroutines, it takesapproxi-mately20minutestoexecutetheroutineshaving100%ofdataintablet3.For Teradata, it takes only2minutes toloadthedatahaving100%of thesizeof tablet3. However, theslopeoftheplot for TeradatainFigure14is muchsteeper thanforHadoopinFigure14. Whentheamountof thedataintablet3increases, executiontimeforTeradataincreasesalmostlinearly. Thereasonisthatdataselectionfromthetable happens in the background of running other processes.ProcessesDandEinFigure13aselectdatafromtheta-bleandtheirresultsarerequiredonlyinlaterstagesofjobexecution.In Figure 16, we compare the total timing to compute anA/B test using Hadoop only with a combination of Teradata+ SAS. WhileTeradataisabout10timesfasterthanHadooptoloadthedata, Teradata+SASappearstobe about 5times slower tocompute the wholeA/Btest, thantodoitonHadoop.4.4 Applying stochastic optimization for A/BtestFor this evaluation, we will modify the data loading schemafortheA/BtestfromFigure13aintotheoneshowninFigure 13b. Inthis scenario, we addedanextrajobLwhichprocessesasubstantial amountof data, butthere-sults of which are needed only at the later stages of the A/Btest. We would like to investigate how this extra job impactstheperformance. Tobeabletousestochasticoptimization,weneedthecorrespondingdatasizesforeachofthesejobs.ThosenumbersareshowninTable4.WhenweapplyAlgorithm1toourmodiedA/Btestas showninFigure13b, wenotethat jobs D, E, andLreceivelowerpriorityofexecution. AlsojobLreceives150reducerslotsinsteadof 800slots, originallydeterminedby1 39 1 22 1 39 132050 80 50 29 50 80 1740100 105 100 29 100 105 17401 1.591065 3.12057414 70 50 1.90309 3.240549100 2.021189 3.24054914 3.6232490204060801001200 20 40 60 80 100execution time, s% of data in table t30204060800 20 40 60 80 100execution time, m% of data in table t3optimizedsequentialFigure15: Timingfordataextraction,pruning,andaggre-gationonHadoop981!"#" %&"'()* +&,-)./0"% 123 #./# -4()*565755765855865 9"'&&: #.+"'"#";