Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
CompSci 516DatabaseSystems
Lecture12Map-Reduce
andSpark
Instructor:Sudeepa Roy
1DukeCS,Fall2017 CompSci516:DatabaseSystems
Announcements
• Practicemidtermpostedonsakai– Firstprepareandthenattempt!
• MidtermnextWednesday10/11inclass– Closedbook/notes,noelectronicdevices– EverythinguntilandincludingLecture12included
• HW2duein2weeks– Firstrunyourcodeonlocalmachinetoensurethatitiscorrect,thenon
AWS– RemembertostopAWSinstances!
2DukeCS,Fall2017 CompSci516:DatabaseSystems
Wherearewenow?
Welearntü RelationalModelandQueryLanguages
ü SQL,RA,RCü Postgres(DBMS)ü XML(overview)§ HW1
ü DatabaseNormalizationü DBMSInternals
– Storage– Indexing– QueryEvaluation– OperatorAlgorithms– Externalsort– QueryOptimization
• Today:– MapReduceandSpark
DukeCS,Fall2017 CompSci516:DatabaseSystems 3
ReadingMaterial• Recommended(optional)readings:
– Chapter2(Sections1,2,3)ofMiningofMassiveDatasets,byRajaraman andUllman:http://i.stanford.edu/~ullman/mmds.html
– OriginalGoogleMRpaperbyJeffDeanandSanjayGhemawat,OSDI’04:http://research.google.com/archive/mapreduce.html
– “ResilientDistributedDatasets:AFault-TolerantAbstractionforIn-MemoryClusterComputing”(seecoursewebsite)– byMatei Zahariaetal.- 2012
4
Acknowledgement:SomeofthefollowingslideshavebeenborrowedfromProf.Shivnath Babu,Prof.DanSuciu,Prajakta Kalmegh,andJunghoon Kang
DukeCS,Fall2017 CompSci516:DatabaseSystems
MapReduce
DukeCS,Fall2017 CompSci516:DatabaseSystems 5
BigData
itcannotbe storedinonemachine
storethedatasetsonmultiplemachines
GoogleFileSystem
itcannotbe processed inonemachine
parallelizecomputationonmultiplemachines
MapReduce
Ack:SlidebyJunghoon Kang
TheMap-ReduceFramework
• GooglepublishedMapReducepaperinOSDI2004,ayearaftertheGoogleFileSystempaper
• Ahighlevelprogrammingparadigm– allowsmanyimportantdata-orientedprocessestobewrittensimply
• processeslargedataby:– applyingafunctiontoeachlogicalrecordintheinput(map)
– categorizeandcombinetheintermediateresultsintosummaryvalues(reduce)
DukeCS,Fall2017 CompSci516:DatabaseSystems 7
WheredoesGoogleuseMapReduce?
MapReduce
Input
Output
● crawleddocuments● webrequestlogs
● invertedindices● graphstructureofwebdocuments● summariesofthenumberofpages
crawledperhost● thesetofmostfrequentqueriesinaday
Ack:SlidebyJunghoon Kang
StorageModel
• Dataisstoredinlargefiles(TB,PB)– e.g.market-basketdata(morewhenwedodatamining)
– orwebdata• Filesaredividedintochunks– typicallymanyMB(64MB)– sometimeseachchunkisreplicatedforfaulttolerance(laterindistributedDBMS)
DukeCS,Fall2017 CompSci516:DatabaseSystems 9
Map-ReduceSteps
• Inputistypically(key,value)pairs– butcouldbeobjectsofanytype
• MapandReduceareperformedbyanumberofprocesses– physicallylocatedinsomeprocessors
DukeCS,Fall2017 CompSci516:DatabaseSystems 10
samekey
Map ReduceShuffleInputkey-valuepairs
outputlistssortbykey
Map-ReduceSteps
1. ReadData2. Map– extractsomeinfoofinterest
in(key,value)form3. Shuffleandsort
– sendsamekeystothesamereduceprocess
DukeCS,Fall2017 CompSci516:DatabaseSystems 11
samekey
Map ReduceShuffleInputkey-valuepairs
outputlistssortbykey
4. Reduce– operateonthevaluesofthesamekey– e.g.transform,aggregate,summarize,
filter5. Outputtheresults(key,final-result)
SimpleExample:Map-Reduce
• Wordcounting• Invertedindexes
Ack:SlidebyProf.Shivnath Babu
DukeCS,Fall2017 CompSci516:DatabaseSystems 12
MapFunction
• Eachmapprocessworksonachunkofdata• Input:(input-key,value)• Output:(intermediate-key,value)-- maynotbethesameasinputkeyvalue• Example:listalldocidscontainingaword
– outputofmap(word,docid)– emitseachsuchpair– wordiskey,docid isvalue– duplicateeliminationcanbedoneatthereducephase
DukeCS,Fall2017 CompSci516:DatabaseSystems 13
samekey
Map ReduceShuffleInputkey-valuepairs
outputlistssortbykey
ReduceFunction
• Input:(intermediate-key,list-of-values-for-this-key)– listcanincludeduplicates– eachmapprocesscanleaveitsoutputinthelocaldisk,reduceprocesscanretrieveits
portion• Output:(output-key,final-value)• Example:listalldocidscontainingaword
– outputwillbealistof(word,[doc-id1,doc-id5,….])– ifthecountisneeded,reducecounts#docs,outputwillbealistof(word,count)
DukeCS,Fall2017 CompSci516:DatabaseSystems 14
samekey
Map ReduceShuffleInputkey-valuepairs
outputlistssortbykey
ExampleProblem:MapReduceExplainhowthequerywillbeexecutedinMapReduce
• SELECTa,max(b)astopb• FROMR• WHEREa>0• GROUPBYa
SpecifythecomputationperformedinthemapandthereducefunctionsDukeCS,Fall2017 CompSci516:DatabaseSystems 15
Map
• Eachmaptask– ScansablockofR– Callsthemapfunctionforeachtuple– Themapfunctionappliestheselectionpredicatetothetuple
– Foreachtuple satisfyingtheselection,itoutputsarecordwithkey=aandvalue=b
SELECTa,max(b)astopbFROMRWHEREa>0GROUPBYa
•Wheneachmaptaskscansmultiplerelations,itneedstooutputsomethinglikekey=aandvalue=(‘R’,b)whichhastherelationname‘R’
DukeCS,Fall2017 CompSci516:DatabaseSystems 16
Shuffle
• TheMapReduce enginereshufflestheoutputofthemapphaseandgroupsitontheintermediatekey,i.e.theattributea
SELECTa,max(b)astopbFROMRWHEREa>0GROUPBYa
•Notethattheprogrammerhastowriteonlythemapandreducefunctions,theshufflephaseisdonebytheMapReduceengine(althoughtheprogrammercanrewritethepartitionfunction),butyoushouldstillmentionthisinyouranswers
DukeCS,Fall2017 CompSci516:DatabaseSystems 17
ReduceSELECTa,max(b)astopbFROMRWHEREa>0GROUPBYa
• Eachreducetask• computes the aggregate value max(b) = topb for each group
(i.e. a) assigned to it (by calling the reduce function) • outputs the final results: (a, topb)
•Multipleaggregatescanbeoutputbythereducephaselikekey=aandvalue=(sum(b),min(b)) etc.
• Sometimesasecond(thirdetc)levelofMap-Reducephasemightbeneeded
A local combiner can be used to compute local max before data gets reshuffled (in the map tasks)
DukeCS,Fall2017 CompSci516:DatabaseSystems 18
MoreTerminology
• AMap-Reduce“Job”– e.g.countthewordsinalldocs– complexqueriescanhavemultipleMRjobs
• MaporReduce“Tasks”– Agroupofmaporreduce“functions”– scheduledonasingle“worker”
• Worker– aprocessthatexecutesonetaskatatime– oneperprocessor,so4-8permachine
• Amastercontroller– dividesthedataintochunks– assignsdifferentprocessorstoexecutethemapfunctiononeach
chunk– other/sameprocessorsexecutethereducefunctionsontheoutputsof
themapfunctions
DukeCS,Fall2017 CompSci516:DatabaseSystems 19
however,thereisnouniformterminologyacrosssystems
Ack:SlidebyProf.DanSuciu
WhyisMap-ReducePopular?
• DistributedcomputationbeforeMapReduce– howtodividetheworkloadamongmultiplemachines?– howtodistributedataandprogramtoothermachines?– howtoscheduletasks?– whathappensifataskfailswhilerunning?– …and…and...
• DistributedcomputationafterMapReduce– howtowriteMapfunction?– howtowriteReducefunction?
• Developers’tasksmadeeasy
DukeCS,Fall2017 CompSci516:DatabaseSystems 20
Ack:SlidebyJunghoon Kang
HandlingFaultToleranceinMR
• Althoughtheprobabilityofamachinefailureislow,theprobabilityofamachinefailingamongthousandsofmachinesiscommon
• WorkerFailure– Themastersendsheartbeattoeachworkernode– Ifaworkernodefails,themasterreschedulesthetaskshandledbytheworker
• MasterFailure– ThewholeMapReducejobgetsrestartedthroughadifferentmaster
DukeCS,Fall2017 CompSci516:DatabaseSystems 21Ack:SlidebyJunghoon Kang
OtheraspectsofMapReduce• Locality
– TheinputdataismanagedbyGFS– ChoosetheclusterofMapReducemachinessuchthatthose
machinescontaintheinputdataontheirlocaldisk– Wecanconservenetworkbandwidth
• Taskgranularity– Itispreferabletohavethenumberoftaskstobemultiplesof
workernodes– Smallerthepartitionsize,fasterfailoverandbettergranularity
inloadbalance,butitincursmoreoverhead– Needabalance
• BackupTasks– Inordertocopewitha“straggler”,themasterschedulesbackup
executionsoftheremainingin-progresstasks
DukeCS,Fall2017 CompSci516:DatabaseSystems 22Ack:SlidebyJunghoon Kang
ApacheHadoop
• ApacheHadoophasanopen-sourceversionofGFSandMapReduce– GFS->HDFS(HadoopFileSystem)– GoogleMapReduce->HadoopMapReduce
• YoucandownloadthesoftwareandimplementyourownMapReduceapplications
DukeCS,Fall2017 CompSci516:DatabaseSystems 23Ack:SlidebyJunghoon Kang
MapReduceProsandCons
• MapReduceisgoodforoff-linebatchjobsonlargedatasets
• MapReduceisnotgoodforiterativejobsduetohighI/Ooverheadaseachiterationneedstoread/writedatafrom/toGFS
• MapReduceisbadforjobsonsmalldatasetsandjobsthatrequirelow-latencyresponse
DukeCS,Fall2017 CompSci516:DatabaseSystems 24Ack:SlidebyJunghoon Kang
Spark
DukeCS,Fall2017 CompSci516:DatabaseSystems 25
SeetheRDDpaperfromthecoursewebsite
WhatisSpark?
• NotamodifiedversionofHadoop• Separate,fast,MapReduce-likeengine– In-memorydatastorageforveryfastiterativequeries–Generalexecutiongraphsandpowerfuloptimizations–Upto40xfasterthanHadoop–Upto100xfaster(2-10xondisk)
• CompatiblewithHadoop’sstorageAPIs– Canread/writetoanyHadoop-supportedsystem,includingHDFS,HBase,SequenceFiles,etc
Borrowed slide
Distributedin-memorylargescaledataprocessingengine!
Ack:SlidebyPrajakta Kalmegh
Applications(BigDataAnalysis)
• In-memoryanalytics&anomalydetection(Conviva)
• Interactivequeriesondatastreams(Quantifind)• Exploratoryloganalysis(Foursquare)• Trafficestimationw/GPSdata(MobileMillennium)
• Twitterspamclassification(Monarch)• ...
Borrowed slide
Ack:SlidebyPrajakta Kalmegh
WhyaNewProgrammingModel?• MapReducegreatlysimplifiedbigdataanalysis• Butassoonasitgotpopular,userswantedmore:–Morecomplex,multi-stageiterative applications(graphalgorithms,machinelearning)–Moreinteractive ad-hocqueries–Morereal-time onlineprocessing• Allthreeoftheseappsrequirefastdatasharingacrossparalleljobs
Borrowed slide
NOTE: What were the workarounds in MR world? Ysmart [1], Stubby[2], PTF[3], Haloop [4], Twister [5]
Ack:SlidebyPrajakta Kalmegh
DataSharinginMapReduce
iter. 1 iter. 2 . . .
Input
HDFSread
HDFSwrite
HDFSread
HDFSwrite
Input
query 1
query 2
query 3
result 1
result 2
result 3
. . .
HDFSread
Slow due to replication, serialization, and disk IOBorrowed slide
Ack:SlidebyPrajakta Kalmegh
iter. 1 iter. 2 . . .
Input
DataSharinginSpark
Distributedmemory
Input
query 1
query 2
query 3
. . .
one-timeprocessing
10-100× faster than network and diskBorrowed slide
Ack:SlidebyPrajakta Kalmegh
RDD:SparkProgrammingModel
• Keyidea:ResilientDistributedDatasets(RDDs)–Distributedcollectionsofobjectsthatcanbecachedinmemoryorstoredondiskacrossclusternodes–Manipulatedthroughvariousparalleloperators–Automaticallyrebuiltonfailure (How?UseLineage)
Borrowed slide
Ack:SlidebyPrajakta Kalmegh
AdditionalSlidesonSpark(OptionalReading)
DukeCS,Fall2017 CompSci516:DatabaseSystems 32
Ack:ThefollowingslidesarebyPrajakta Kalmegh
More on RDDs• Transformations: Created through deterministic operations on either
‣ data in stable storage or
‣ other RDDs
• Lineage: RDD has enough information about how it was derived from other datasets
• Immutable: RDD is a read-only, partitioned collection of records
‣ Checkpointing of RDDs with long lineage chains can be done in the background.
‣Mitigating stragglers: We can use backup tasks to recompute transformations on RDDs
• Persistence level: Users can choose a re-use storage strategy (caching in memory, storing the RDD only on disk or replicating it across machines; also chose a persistence priority for data spills)
• Partitioning: Users can ask that an RDD’s elements be partitioned across machines based on a key in each record
RDD Transformations and Actions
*http://www.tothenew.com/blog/spark-1o3-spark-internals/
*https://spark.apache.org/docs/1.0.1/cluster-overview.html
Note:LazyEvaluation:Averyimportantconcept
DAGofRDDs
*https://trongkhoanguyenblog.wordpress.com/2014/11/27/understand-rdd-operations-transformations-and-actions/
FaultTolerance
• RDDstracktheseriesoftransformationsusedtobuildthem(theirlineage)torecomputelostdata
• E.g:messages = textFile(...).filter(_.contains(“error”))
.map(_.split(‘\t’)(2))
HadoopRDDpath = hdfs://…
FilteredRDDfunc = _.contains(...)
MappedRDDfunc = _.split(…)
Borrowed slide
Tradeoff:LowComputationcost(cachemoreRDDs)
VSHighmemorycost(notmuchworkforGC)
Representing RDDs• Graph-based representation. Five components :
Representing RDDs (Dependencies)
one-to-one many-to-one many-to-many
shuffle
Representing RDDs (An example)
Advantages of the RDD model
Checkpoint!
• DataSharinginSparkandSomeApplications• RDDDefinition,Model,Representation,Advantages
OtherEngineFeatures:Implementation
• Notcoveredindetails• SomeSummary:• SparklocalvsSparkStandalonevsSparkcluster(ResourcesharinghandledbyYarn/Mesos)
• JobScheduling: DAGSchedulervsTaskScheduler (FairvsFIFOattaskgranularity)
• MemoryManagement:serializedin-memory(fastest)VSdeserializedin-memoryVSon-diskpersistent
• SupportforCheckpointing:TradeoffbetweenusinglineageforrecomputingpartitionsVScheckpointingpartitionsonstablestorage
• InterpreterIntegration: Shipexternalinstancesofvariablesreferencedinaclosurealongwiththeclosureclasstoworkernodesinordertogivethemaccesstothesevariables