ENAR short course

  • View
    108

  • Download
    4

Embed Size (px)

DESCRIPTION

 

Text of ENAR short course

  • 1. Sta$s$calCompu$ng ForBigData DeepakAgarwal LinkedInAppliedRelevanceScience dagarwal@linkedin.com ENAR2014,Bal$more,USA

2. MainCollaborators:severalothersatbothY! andLinkedIn Iwontbeherewithoutthem,extremelyluckytoworkwithsuchtalented individuals Bee-Chung Chen Liang Zhang Bo Long Jonathan Traupman Paul Ogilvie 3. StructureofThisTutorial PartI:Introduc$ontoMap-Reduceandthe HadoopSystem OverviewofDistributedCompu$ng Introduc$ontoMap-Reduce Somesta$s$calcomputa$onsusingMap-Reduce Bootstrap,Logis$cRegression PartII:RecommenderSystemsforWeb Applica$ons Introduc$on ContentRecommenda$on OnlineAdver$sing 4. BigDatabecomingUbiquitous Bioinforma$cs Astronomy Internet Telecommunica$ons Climatology 5. BigData:Somesizees$mates 1000humangenomes:>100TBofdata(1000 genomesproject) SloanDigitalSkySurvey:200GBdatapernight (>140TBaggregated) Facebook:Abillionmonthlyac$veusers LinkedIn:roughly>280Mmembersworldwide Twiaer:>500milliontweetsaday Over6billionmobilephonesintheworld genera$ngdataeveryday 6. BigData:Paradigmshid ClassicalSta$s$cs Generalizeusingsmalldata ParadigmShidwithBigData Wenowhaveanalmostinnitesupplyofdata EasySta$s$cs?Justappealtoasympto$ctheory? Sotheissueismostlycomputa$onal? Notquite Moredatacomeswithmoreheterogeneity Needtochangeoursta$s$calthinkingtoadapt Classicalsta$s$css$llinvaluabletothinkaboutbigdataanaly$cs 7. SomeSta$s$calChallenges ExploratoryAnalysis(EDA),Visualiza$on Retrospec$ve(onTerabytes) MoreRealTime(streamingcomputa$onsevery fewminutes/hours) Sta$s$calModeling Scale(computa$onalchallenge) Curseofdimensionality Millionsofpredictors,heterogeneity TemporalandSpa$alcorrela$ons 8. Sta$s$calChallengescon$nued Experiments Totestnewmethods,testhypothesisfrom randomizedexperiments Adap$veexperiments Forecas$ng Planning,adver$sing ManymoreIarenotfullywellversedin 9. DeningBigData Howtoknowyouhavethebigdataproblem? Isitonlythenumberofterabytes? Whataboutdimensionality,structured/ unstructured,computa$onsrequired, Nocleardeni$on,dierentpointofviews Whendesiredcomputa$oncannotbecompleted inthes$pulated$mewithcurrentbestalgorithm usingcoresavailableonacommodityPC 10. DistributedCompu$ngforBigData Distributedcompu$nginvaluabletooltoscale computa$onsforbigdata Somedistributedcompu$ngmodels Mul$-threading GraphicsProcessingUnits(GPU) MessagePassingInterface(MPI) Map-Reduce 11. Evalua$ngamethodforaproblem Scalability ProcessXGBinYhours Easeofuseforasta$s$cian Reliability(faulttolerance) Especiallyinanindustrialenvironment Cost Hardwareandcostofmaintaining Goodforthecomputa$onsrequired? E.g.,Itera$veversusonepass Resourcesharing 12. Mul$threading Mul$plethreadstakeadvantageofmul$ple CPUs Sharedmemory Threadscanexecuteindependentlyand concurrently CanonlyhandleGigabytesofdata Reliable 13. GraphicsProcessingUnits(GPU) Numberofcores: CPU:Orderof10 GPU:smallercores Orderof1000 Canbe>100xfasterthanCPU Parallelcomputa$onallyintensivetaskso-loadedtoGPU Goodforcertaincomputa$onally-intensivetasks CanonlyhandleGigabytesofdata Nottrivialtouse,requiresgoodunderstandingoflow-levelarchitecture forecientuse Butthingschanging,itisgemngmoreuserfriendly 14. MessagePassingInterface(MPI) Languageindependentcommunica$on protocolamongprocesses(e.g.computers) Mostsuitableformaster/slavemodel CanhandleTerabytesofdata Goodforitera$veprocessing Faulttoleranceislow 15. Map-Reduce(Dean&Ghemawat, 2004) Mappers Reducers Data Output Computa$onsplittoMap (scaaer)andReduce(gather) stages EasytoUse: Userneedstoimplementtwo func$ons:Mapperand Reducer EasilyhandlesTerabytesof data Verygoodfaulttolerance (failedtasksautoma$cally getrestarted) 16. ComparisonofDistributedCompu$ngMethods Mul$threading GPU MPI Map-Reduce Scalability(data size) Gigabytes Gigabytes Terabytes Terabytes FaultTolerance High High Low High MaintenanceCost Low Medium Medium Medium-High Itera$veProcess Complexity Cheap Cheap Cheap Usually expensive ResourceSharing Hard Hard Easy Easy EasytoImplement? Easy Needs understanding oflow-levelGPU architecture Easy Easy 17. ExampleProblem Tabula$ngwordcountsincorpusof documents Similartotablefunc$oninR 18. WordCountThroughMap-Reduce HelloWorld ByeWorld HelloHadoop GoodbyeHadoop Mapper1 Mapper2 Reducer1 WordsfromA-G Reducer2 WordsfromH-Z 19. KeyIdeasaboutMap-Reduce BigData Par$$on1 Par$$on2 Par$$onN Mapper1 Mapper2 MapperN Reducer1 Reducer2 ReducerM Output1 Output1Output1Output1 20. KeyIdeasaboutMap-Reduce Dataaresplitintopar$$onsandstoredinmany dierentmachinesondisk(distributedstorage) Mappersprocessdatachunksindependentlyand emitpairs Datawiththesamekeyaresenttothesame reducer.Onereducercanreceivemul$plekeys Everyreducersortsitsdatabykey Foreachkey,thereducerprocessesthevalues correspondingtothekeyaccordingtothe customizedreducerfunc$onandoutput 21. ComputeMeanforEachGroup ID GroupNo. Score 1 1 0.5 2 3 1.0 3 1 0.8 4 2 0.7 5 2 1.5 6 3 1.2 7 1 0.8 8 2 0.9 9 4 1.3 22. KeyIdeasaboutMap-Reduce Dataaresplitintopar$$onsandstoredinmanydierentmachineson disk(distributedstorage) Mappersprocessdatachunksindependentlyandemitpairs Foreachrow: Key=GroupNo. Value=Score Datawiththesamekeyaresenttothesamereducer.Onereducercan receivemul$plekeys E.g.2reducers Reducer1receivesdatawithkey=1,2 Reducer2receivesdatawithkey=3,4 Everyreducersortsitsdatabykey E.g.Reducer1:, Foreachkey,thereducerprocessesthevaluescorrespondingtothekey accordingtothecustomizedreducerfunc$onandoutput E.g.Reducer1output:, 23. KeyIdeasaboutMap-Reduce Dataaresplitintopar$$onsandstoredinmanydierentmachineson disk(distributedstorage) Mappersprocessdatachunksindependentlyandemitpairs Foreachrow: Key=GroupNo. Value=Score Datawiththesamekeyaresenttothesamereducer.Onereducercan receivemul$plekeys E.g.2reducers Reducer1receivesdatawithkey=1,2 Reducer2receivesdatawithkey=3,4 Everyreducersortsitsdatabykey E.g.Reducer1:, Foreachkey,thereducerprocessesthevaluescorrespondingtothekey accordingtothecustomizedreducerfunc$onandoutput E.g.Reducer1output:, Whatyouneed toimplement 24. Mapper: Input:Data for(rowinData) { groupNo=row$groupNo; score=row$score; Output(c(groupNo,score)); } Reducer: Input:Key(groupNo),ListValue(alistofscoresthatbelongtotheKey) count=0; sum=0; for(vinValue) { sum+=v; count++; } Output(c(Key,sum/count)); PseudoCode(inR) 25. Exercise1 Problem:Averageheightper{Grade,Gender}? Whatshouldbethemapperoutputkey? Whatshouldbethemapperoutputvalue? Whatarethereducerinput? Whatarethereduceroutput? Writemapperandreducerforthis? StudentID Grade Gender Height(cm) 1 3 M 120 2 2 F 115 3 2 M 116 26. Problem:AverageheightperGradeandGender? Whatshouldbethemapperoutputkey? {Grade,Gender} Whatshouldbethemapperoutputvalue? Height Whatarethereducerinput? Key:{Grade,Gender},Value:ListofHeights Whatarethereduceroutput? {Grade,Gender,mean(Heights)} StudentID Grade Gender Height(cm) 1 3 M 120 2 2 F 115 3 2 M 116 27. Exercise2 Problem:Numberofstudentsper{Grade,Gender}? Whatshouldbethemapperoutputkey? Whatshouldbethemapperoutputvalue? Whatarethereducerinput? Whatarethereduceroutput? Writemapperandreducerforthis? StudentID Grade Gender Height(cm) 1 3 M 120 2 2 F 115 3 2 M 116 28. Problem:Numberofstudentsper{Grade,Gender}? Whatshouldbethemapperoutputkey? {Grade,Gender} Whatshouldbethemapperoutputvalue? 1 Whatarethereducerinput? Key:{Grade,Gender},Value:Listof1s Whatarethereduceroutput? {Grade,Gender,sum(valuelist)} OR:{Grade,Gender,length(valuelist)} StudentID Grade Gender Height(cm) 1 3 M 120 2 2 F 115 3 2 M 116 29. MoreonMap-Reduce Dependsondistributedlesystems Typicallymappersarethedatastoragenodes Map/Reducetasksautoma$callygetrestarted whentheyfail(goodfaulttolerance) MapandReduceI/Oareallondisk Datatransmissionfrommapperstoreducersare throughdiskcopy Itera$veprocessthroughMap-Reduce Eachitera$onbecomesamap-reducejob Canbeexpensivesincemap-reduceoverheadishigh 30. TheApacheHadoopSystem Anopen-sourcesodwareforreliable,scalable, distributedcompu$ng Themostpopulardistributedcompu$ng systemintheworld Keymodules: HadoopDistributedFileSystem(HDFS) HadoopYARN(jobschedulingandcluster resourcemanagement) HadoopMapReduce 31. MajorToolsonHadoop Pig Ahigh-levellanguageforMap-Reducecomputa$on Hive ASQL-likequerylanguagefordataqueryingviaMap-Reduce Hbase Adistributed&scalabledatabaseonHadoop Allowsrandom,real$meread/writeaccesstobigdata VoldemortissimilartoHbase Mahout Ascalablemachinelearninglibrary 32. HadoopInstalla$on SemngupHadooponyourdesktop/laptop: hap://hadoop.apache.org/docs/stable/ single_node_setup.html SemngupHadooponaclusterofmachines hap://hadoop.apache.org/docs/stable/ cluster_setup.html 33. HadoopDistributedFileSystem(HDFS) Master/Slavearchitecture NameNode:asinglemasternodethatcontrolswhich datablockisstoredwhere. DataNodes:slavenodesthatstoredataanddoR/W opera$ons Clients(Gateway):Allowuserstologinandinteract withHDFSandsubmitMap-Reducejobs Bigdataissplittoequal-sizedblocks,eachblockcanbe storedindierentDataNodes Diskfailuretolerance:dataisreplicatedmul$ple$mes 34. LoadtheDataintoPig A=LOADSample-1.dat'USINGPigStorage()AS (ID:int,groupNo:int,score:oat); ThepathofthedataonHDFSaderLOAD USINGPigStorage()meansdelimitthedatabytab (canbeomiaed) Ifdataaredelimitedbyothercharacters,e.g. space,useUSINGPigStorage() DataschemadenedaderAS Variabletypes:int,long,oat,double,chararray, 35. StructureofThisTutorial PartI:Introduc$ontoMap-Reduceandthe HadoopSystem OverviewofDistributedCompu$ng Introduc$ontoMap-Reduce Introduc$ontotheHadoopSystem ExamplesofSta$s$calCompu$ngforBigData BagofLialeBootstraps LargeScaleLogis$cRegression 36. BagofLialeBootstraps Kleineretal.2012 37. Bootstrap(Efron,1979) Are-samplingbasedmethodtoobtainsta$s$cal distribu$onofsamplees$mators Whyareweinterested? Re-samplingisembarrassinglyparallelizable Forexample