A Parallel Algorithms Framework Based on Hadoop Mapreduce

A Parallel Genetic Algorithms Framework based onHadoop MapReduceFilomena FerrucciPasquale SalzaUniversity of Salerno, Italy{fferrucci,psalza}@unisa.itM-Tahar KechadiUniversity College [email protected] SarroUniversity College LondonUnited [email protected] paper describes a framework for developing parallel Ge-neticAlgorithms (GAs)ontheHadoopplatform, followingtheparadigmof MapReduce. Theframeworkallowsdevel-operstofocusontheaspectsofGAthatarespecictotheproblemtobeaddressed. UsingtheframeworkaGAappli-cation has been devised to address the FeatureSubsetSelec-tionproblem. Apreliminaryperformanceanalysisshowedpromisingresults.Categories and Subject DescriptorsC.2.4 [Computer-communication Networks]: Dis-tributed SystemsDistributed applications; D.1.3[Programming Techniques]: Concurrent Program-mingDistributedprogrammingGeneral TermsAlgorithms,Experimentation,PerformanceKeywordsHadoop,MapReduce,ParallelGeneticAlgorithms1. INTRODUCTIONGenetic Algorithms (GAs) are powerful metaheuristictechniques used to address many software engineering prob-lems [7] that involve nding a suitable balance between com-petingandpotentiallyconictinggoals,suchaslookingforthesetof requirementsthatbalancesoftwaredevelopmentcostandcustomersatisfactionorsearchingforthebestal-locationofresourcestoasoftwaredevelopmentproject.GAs simulateseveral aspectsofthe DarwinsEvolutionTheory toconvergetowardsoptimal ornear-optimal solu-tions [5]. Startingfromaninitial populationof individu-als,eachrepresentedasastring(chromosome)overanitealphabet(genes), everyiterationof thealgorithmgener-atesanewpopulationexecutinggeneticoperationsonse-lectedindividuals, suchascrossover andmutation. AlongPermission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for prot or commercial advantage and that copiesbear this notice and the full citation on the rst page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specicpermission and/or a fee.ACM SAC15 April 13-17, 2015, Salamanca, Spain.Copyright 2015 ACM 978-1-4503-3196-8/15/04...$15.00.http://dx.doi.org/10.1145/2695664.2696060...$15.00.thechainof generations, thepopulationtends toimproveitsindividualsbykeepingthestrongestindividuals(charac-terisedbythebesttnessvalue)andrejectingtheweakestones. The generations continue until aspeciedstoppingconditionisveried. Theproblemsolutionisgivenbytheindividual withthe best tness value in the lastpopulation.GAsare usually executed on single machines as sequentialprograms, soscalabilityissuespreventthattheyareeec-tivelyappliedtoreal-worldproblems. However,thesealgo-rithms arenaturallyparallelisable, as it is possibletousemore than one population and execute the operators in par-allel. Parallel systems arebecomingcommonplacemainlyduetotheincreasingpopularityof CloudSystems andtheavailability of distributed platforms, such as ApacheHadoop[12], characterizedbyeasyscalability, reliabilityandfault-tolerance. Itischaracterizedbytheuseof theMapReducealgorithmandthedistributedlesystemHDFS,whichrunonlargeclustersofcommoditymachines.Inthispaper, itispresentedaframeworkfordevelopingGAsthatcanbeexecutedontheHadoopplatform,follow-ingtheparadigmof MapReduce. Themainpurposeof theframework is to completely hide the inner aspects of Hadoopandallowuserstofocusonthedenitionoftheirproblemsin terms of GAs. An example of use of the framework for theFeature Subset Selectionproblem is also provided togetherwithpreliminaryperformanceresults.2. RELATED WORKAs described by Di Martino et al. [3], there exist tree pos-siblemaingrainparallelismimplementationsofGAsbyex-ploiting the MapReduceparadigm: FitnessEvaluationLevel(Global ParallelisationModel ); PopulationLevel (Coarse-grainedParallelisationModel or IslandModel ); IndividualLevel (Fine-grainParallelisationModel orGridModel )[3].The Global ParallelisationModel was implementedbyDiMartinoetal. [3] withthreedierentGoogleAppEngineMapReduce platformcongurations to address automatictestdatageneration.Forthesamepurpose, Di Geronimoetal. [2] developedCoarse-grained Parallelisation Modelon the Hadoop MapRe-duceplatform.MRPGAis a parallel GAsframework based on .NetandAneka,aplatformwhichsimpliesthecreationandcom-municationofnodesinaGridenvironment[8]. ItexploitstheMapReduceparadigm, butwithasecondreducephase.Amaster coordinator manages the parallel GAexecutionamongnodes,followingtheislandsparallelisationmodel,inwhicheachnodecomputestheGAoperatorsforaportionofthepopulationonly. Duringtherststep, everymappernode reads a portion of individuals and computes the tnessvalues. Then, in the rst reducephase, each node applies theselectionoperatorfortheinputindividuals, givingalocaloptimum relative to each island. The eventual single reducercomputes the global optimum and applies the remaining GAoperators. ComparingtheresultsfromMRPGAwithase-quential versionof thesamealgorithm, theyhavehadaninterestingdegreeof scalabilityoverhead factorbutalsoahighoverhead.The approachbyVerma et al. [11] was developedonHadoop. The number of Mapperand Reducernodes are un-related. The main dierence with MRPGA lies in the fact asortof migration isperformedamongtheindividualsasitrandomlysendstheoutcomeoftheMappernodestodier-entReducernodes. ResultsshowedthepotentialofparallelGA approach for the solution of big computation problems.3. THE PROPOSED FRAMEWORKThe proposed framework aims to reduce the eort for im-plementingparallel GAs, exploitingthefeaturesofHadoopin terms of scalability, reliability and fault tolerance, thus al-lowingdevelopersdonotworryabouttheinfrastructurefora distributed computation. One of the problems with the ex-ecutionofGAsinadistributedsystemistheoverheadthatmight be reducedif the networkmovements are reduced.Soitisimportanttominimisethenumberofphases,when-evertheyareassociatedtoanexecutionwhichpassesfromamachinetoanewone. For this reason, theframeworkimplementstheislands model (mostlyfollowingthedesignprovided from Di Martino et al. [3]), forcing the machines toworkalwayswiththesameportionofpopulationandusingjustoneReducerphase.Inthefollowingitisdescribedthedesignoftheinvolvedcomponents,startingfromtheDriver,whichisthefulcrumoftheframework.3.1 The DriverItmanagestheinteractionwiththeuserandlaunchjobsonthecluster. Itisexecutedonamachineevenseparatedfrom the distributed cluster and it also computes some func-tionsinordertomanagethereceivedresults,generationbygeneration.TheelementsinvolvedintotheDriverprocessare:Initialiser: theusercandenehowtogeneratetherstindividualsforeachisland, butthedefaultimplementationmakes it randomly. The Initialiser is a MapReduce jobwhichproduces theindividuals intheformof asequenceofles,storedinaserialisedformdirectlyinHDFS;Generator: it executes one generation on the cluster andproducesthenewpopulation,storingeachindividualagainin HDFSwith the tness values and a ag indicating if an in-dividual satises the stoppingconditiondened by the user;Terminator: attheendof eachgeneration, theTermi-nator componentchecksthestoppingconditions: thestan-dard one of if the maximum number of generations has beenreached oradenedbyuserone,bycheckingthepresenceof the ag in the HDFS. Once terminated, the population isdirectlysubmittedtotheSolutionsFilterjob;Migrator: thisjoballowsmovingindividualsfromanis-land to another, according to the criteria dened by the usersuch as the period of migration,number and destinations ofmigrants andthe selectedmethodfor choosing migrants.Thisphaseisoptionalanditcanoccurbysettinga migra-tionperiod parameter;SolutionsFilter: whenthejobterminates, all theindi-vidualsofthelastgenerationarelteredaccordingtothosethatsatisfythestoppingconditionandthosewhichdonot.Then,theresultsofthewholeprocessisstoredinHDFS.3.2 The Generator and the other componentsEach generatorjob makes the population evolve. In ordertodevelopthe complexstructure describedbelow, it wasneededtousemultipleMapTasks andReduceTasks. Thisispossibleusingaparticularversionof ChainMapper andChainReducer classesofHadoop, slightlymodiedinordertotreatAvroobjectsratherthantherawserialisationusedbyHadoop as default method. Thus, usingAvro [1] it iseasier to store objects and to save space on the disk, also al-lowing any external treatment of them and a quick exchangeofdataamongthepartiesinvolvedintheMapReducecom-munication.Achainallowstomanagethetasksintheformdescribedbythepattern:(MAP)+(REDUCE) (MAP)whichmeansoneormoreMapTasks,followedbyoneRe-duceTaskandotherpossibleMapTasks.ThegenerationworkisdistributedonnodesoftheCloudasfollows:Splitter: the Splittertakes as input the population, dese-rialisesitandsplitstheindividualsintoJgroups(islands),accordingtotheorderofindividuals. Eachsplitcontainsalistofrecords,onepereachindividual:split : individuals list (individual, null)Duringthedeserialisation, thesplitter adds someeldstotheobjectswhichwillbeusefulforthenextstepsofthecomputation,suchasthetnessfunction;Fitness: here according to Jislands, the JMapperscom-putethetnessvaluesforeachindividualofitscorrespond-ingisland:map : (individual, null) (individualF, null)The user denes how to evaluate the tnessand the valuesarestoredinsidethecorrespondingeldoftheobjects;TerminationCheck: withoutleavingthesamemachineof the previous map, the secondmap acts inachain. Itchecks if the current individuals satises the stoppingcondi-tion. This is useful, for instance, when a lower limit target ofthetnessvalueisknown. Ifatleastoneindividualgivesapositive answer to the test, the event is notied to the otherphases andislands byaagstoredinHDFS. This avoidstheexecutionsofthenextphasesandgenerations;Selection: if the stopping conditionhas not been satisedyet, thisisthemomenttochoosetheindividualsthatwillbetheparentsduringthecrossover forthenextiteration.The users can dene this phase in their own algorithms. Thecoupleswhichhavebeenselectedareallstoredinthekey:map : (individualF, null) (coupleinformation, individualF)If anindividual hasbeenchosenmorethanonetime, itisreplicated. Thenall theindividuals, includingthosenotchosen if the elitismis active, leave the current machine andgotothecorrespondentReducerforthenextstep;Crossover: inthis phase, individuals are groupedbythecouplesestablishedduringtheselection. TheneachRe-ducerapplies the criteria dened by the user and makes thecrossover:reduce : (coupleinformation, list (individualF)) (individual, true)Thisproduces theospring(markedwiththevaluetruein the value eld) that is read during the next step, togetherwiththepreviouspopulation;Mutation: during this phase, the chained Mappersman-agetomakethemutationofthegenesdenedbytheuser.Onlytheospringcanbemutated:map : (individual, true) (individualM, null)Elitism: inthelastphase. Iftheuserchoosestousetheoptional elitism, thedenitivepopulationischosenamongthe individuals of the ospring and the previous population:map : (individual, null) (individual, null)At this point, the islands are readytobe writtenintoHDFS.The architecture of the generator component provides twolevelsofabstraction:Therstone, whichiscalledthe core level, allowsthewholejobtowork;The second one, which is called the user level, allowstheusertodevelophisownGeneticAlgorithmandtoexecuteitontheCloud.The core is the base of the frameworkwithwhichtheend-userdoesnotneedtointeract. Indeed, thefactthataMapReducejob is executed is totally invisible to the user. Itconsistsof everythingthatisneededtorunanapplicationonHadoop. Thenal usercandevelophisownGAsimplyby implementing the relative classes, without having to dealwith mapor reducedetails. If the user does not extend theseclasses, a simple behaviour is implemented by doing nothingelsethanforwardingtheinputasoutput. Theframeworkalsomakessomedefaultclassesavailable,thustheusercanusethemformostcases.TheTerminator component plays tworoles indierenttimes: aftertheexecutionof everygenerationjob, bycall-ingthemethodsof theTerminator classonthesamema-chine where the Driveris running; during the generator job,throughtheuseoftheTerminationCheckMapper. Itchecksif thestoppingconditionshaveoccurred, i.e. thecountofthemaximumnumber of generations has beenreachedorat least one individual has beenmarkedof satisfyingthestoppingconditionduring the most recent generation phase.Thecountismaintainedbystoringalocalcountervariablethatisupdatedaftertheendofeachgeneration. Thecheckfor the presence of marked individuals is done by looking forpossible ags in HDFS. If it terminates, the execution of theSolutionsFiltercomponentwilleventuallyfollow.TheInitialiser computesaninitial populationaccordingtothedenitionoftheuser.AnotheroptionalcomponentistheMigrator whichshuf-es individuals among the islands according to the denitionof theuser. Itisatthesametimealocal componentandajobexecutedonthecluster. ItisstartedbytheDriveraccordingtoaperiodcounter.Thecomponent SolutionsFilter isinvokedonlywhenatleastoneindividual hasprovokedtheterminationbysatis-fying the stopping condition. It simply lters the individualsof the last population by dividing those that satisfy the con-ditionfromthose whichdonot. More details about theproposedframeworkcanbefoundin[4].4. PRELIMINARY ANALYSISMany software engineering problems involve predictionandclassicationtasks(e.g. eortestimation[9] andandfault prediction [10]). Classication consists of learning fromexample data, calledtrainingdataset, withthepurposeof properlyclassifyinganynewinput data. The trainingdatasetincludes a list of records, also calledinstances, con-sistingofalistof attributes (features). Several classiersexist, amongthemC4.5 builds aDecisionTree, whichisabletogivealikelyclass of membershipfor everyrecordinanewdataset. Theeectivenessof theclassiercanbemeasuredbytheaccuracy of theclassicationof thenewdata:accuracy =correctclassicationstotal ofclassicationsFeature Subset Selection(FSS) helps to improve predic-tion/classication accuracy by reducing noise in the trainingdataset [6],selectingtheoptimalsubsetoffeatures. Never-theless the problem is NP-hard. GAscan be used to identifyanear-optimalsubsetoffeaturesinatrainingdataset. Theframework was exploited in order to develop a GA to addressthe problem. For the development of the application of FSS,only535linesofcodewerewrittenagainstthe3908oftheframeworkinfrastructure. Intermsof theframework, theDriveristhemainpartofthealgorithmandisexecutedononemachine. Itneedssomeadditional informationbeforeexecution: theparametersfortheGAenvironment,suchasthe number of individuals in the initial population, the max-imumnumberofgenerations,etc.;thetrainingdataset;thetestdataset.Everyindividual (subset) is encodedas anarrayof mbit, whereeachbitshowsif thecorrespondingenumeratedattributeispresentintothesubset(value1)ornot(value0). Duringeachgeneration, everysubset is evaluatedbycomputingtheaccuracy valueandall theGAs operationsareapplieduntil target accuracy isachievedor maximumnumber of generations is reached, according to the followingsteps:Fitness: for each subset of attributes, the training datasetisreducedwithrespectoftheattributesinthesubset. Thetnessvalueiscomputedbyapplyingthesteps:1. BuildtheDecisionTree throughtheC4.5 algorithmand computing the accuracyby submitting the currentdataset;2. The operations are repeatedaccordingtothe cross-validationfoldsparameter;3. Themeanofaccuraciesisreturned.4.1 SubjectTheChicago Crimedataset (fromthe UCI MachineLearningRepositorywasused. Thedatasethas13featuresand 10000 instances. It was divided into two parts: the rst60 %astrainingdataset andthelast40 %astestdataset.Thetestbenchwascomposedbythreeversions:Sequential: asinglemachineexecutesaGAversionsimilartotheframeworkonebyusingtheWekaJavalibrary,fortheUniversityofWaikato;Pseudo-distributed: heretheframeworkversionofthealgorithmisused. Itisexecutedonasinglema-chineagain, settingthenumberof islandstoone. ItrequiresHadoopinordertobeexecuted;Distributed: theframeworkisexecutedonaclusterof computersonHadoopplatform. Of course, thisistheobjectofinterest.All the versions runonthe same remote AmazonEC2clusterandall themachinesinvolvedhadthesamecong-uration, inorder togiveastrongfactor of fairness. Thisconguration is named by Amazonas m1.large, a machinewith2CPUs,7.5 GBofRAMand840 GBforstorage.Thesequential versionwasexecutedonasinglemachineandalsothepseudo-distributed,butoveraHadoopinstalla-tion. Instead, thedistributedversionneededafull Hadoopclustercomposedby1masterand4computingslaves.Thechosencongurationforthetestconsistedof10gen-erations, amigrationperiodof 5with10migrants. Itwasspecied as a stopping conditiononly themaximum numberofgenerations.Thenalaccuracyiscomputedontestdataset usingtheclassierbuiltwiththebestindividualidentied.4.2 ResultsAs for the accuracy, the results were 89.55 % for sequential,89.40 %for pseudo-distributed and89.45 %for distributedversion. Thisconrmsthattheapplicationachievesitsob-jective,bygivingasubsetthathasbothareducednumberofattributesandasuitableaccuracyvalue.As for the running time, the distributed version wonagainst other versions, with 1384 s against 2993 s for pseudo-distributed and2179 s for sequential. It is clear that thedistributed version must face some considerable intrinsicHadoop bottlenecks suchas map initialisation, dataread-ing, copying and memorisation etc. and it has an additionalphase of Migration not present inthe sequential version.Theproposedframeworkdoesnotconsidereverysingleas-pectofHadoopandthissuggestthatafurtheroptimisationmightincreasetheperformanceoftheparallelversioncom-paredwiththe sequential one. It is predictable that thedistributedversionperformsbetterwhentheamountofre-quired computation is large [4]. Since the pseudo-distributedis second to the distributed, this suggests that is worth split-ting the work among multiple nodes. More details about theanalysiscarriedoutcanbefoundin[4].5. CONCLUSIONS AND FUTURE WORKThisworkproposedaframeworkforGeneticAlgorithmswith the aim of simplifying the development of Genetic Algo-rithmson Cloud. The obtained results showed an interestingeort ratio between the number of lines of codefor the appli-cationagainsttheonesfortheframework. Thepreliminarytests about the performance conrmed the natural tendencyofGAs toparallelisationbutthethreatofoverheadisstillaroundthecorner. Indeed, theuseofparallelisationseemstobeworthonlyinpresenceof computationallyintensiveandnottrivialproblems[4]butscalabityissuesneedtobedeeper addressed. Because of the use of specic components(e.g. MapTasks andReduceTasks chains), itdoesnotseemunreasonabletosuggest that givingan ad-hoc optimisa-tionofHadoopmightimprovetheperformanceandthereisalsoapossibilitythatthechangeofmodelofparallelisation[3]couldgivebetterresults.6. REFERENCES[1] D.Cutting.Datainteroperabilitywithapacheavro.ClouderaBlog,2011.[2] L.DiGeronimo,F.Ferrucci,A.Murolo,andF.Sarro.Aparallelgeneticalgorithmbasedonhadoopmapreducefortheautomaticgenerationofjunittestsuites.InSoftwareTesting,VericationandValidation(ICST),2012IEEE5thInternational Conferenceon,pages785793.IEEE,2012.[3] S.DiMartino,F.Ferrucci,V.Maggio,andF.Sarro.TowardsMigratingGeneticAlgorithmsforTestDataGenerationtotheCloud,chapter6,pages113135.IGIGlobal,2012.[4] F.Ferrucci,M.-T.Kechadi,P.Salza,andF.Sarro.Aframeworkforgeneticalgorithmsbasedonhadoop.arXivpreprintarXiv:1312.0086,2013.[5] D.E.Goldberg.GeneticAlgorithmsinSearch,OptimizationandMachineLearning.Addison-WesleyLongmanPublishingCo.,Inc.,1989.[6] M.A.Hall.Correlation-basedFeatureSelectionforMachineLearning.PhDthesis,TheUniversityofWaikato,1999.[7] M.HarmanandB.F.Jones.Search-basedsoftwareengineering.InformationandSoftwareTechnology,43(14):833839,2001.[8] C.Jin,C.Vecchiola,andR.Buyya.Mrpga: Anextensionofmapreduceforparallelizinggeneticalgorithms.InE-Science(e-Science),2008IEEE4thInternational Conferenceon,pages214221.IEEE,2008.[9] F.Sarro,S.DiMartino,F.Ferrucci,andC.Gravino.Afurtheranalysisontheuseofgeneticalgorithmtoconguresupportvectormachinesforinter-releasefaultprediction.InAppliedComputing(SAC),2012ACM27thSymposiumon,pages12151220.ACM,2012.[10] F.Sarro,F.Ferrucci,andC.Gravino.Singleandmultiobjectivegeneticprogrammingforsoftwaredevelopmenteortestimation.InAppliedComputing(SAC),2012ACM27thSymposiumon,pages12211226.ACM,2012.[11] A.Verma,X.Llor` a,D.E.Goldberg,andR.H.Campbell.Scalinggeneticalgorithmsusingmapreduce.InIntelligentSystemsDesignandApplications(ISDA),20099thInternationalConferenceon,pages1318.IEEE,2009.[12] T.White.Hadoop: TheDenitiveGuide.OReiily,thirdedition,2012.

Documents

A Parallel Algorithms Framework Based on Hadoop Mapreduce