Map-Reduce and Spark - db.cs.duke.edudb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lectures/...Where are we now? We learnt ü Relational Model and Query Languages ü SQL,

CompSci 516DatabaseSystems

Lecture12Map-Reduce

andSpark

Instructor:Sudeepa Roy

1DukeCS,Fall2017 CompSci516:DatabaseSystems

Announcements

• Practicemidtermpostedonsakai– Firstprepareandthenattempt!

• MidtermnextWednesday10/11inclass– Closedbook/notes,noelectronicdevices– EverythinguntilandincludingLecture12included

• HW2duein2weeks– Firstrunyourcodeonlocalmachinetoensurethatitiscorrect,thenon

AWS– RemembertostopAWSinstances!

2DukeCS,Fall2017 CompSci516:DatabaseSystems

Wherearewenow?

Welearntü RelationalModelandQueryLanguages

ü SQL,RA,RCü Postgres(DBMS)ü XML(overview)§ HW1

ü DatabaseNormalizationü DBMSInternals

– Storage– Indexing– QueryEvaluation– OperatorAlgorithms– Externalsort– QueryOptimization

• Today:– MapReduceandSpark

DukeCS,Fall2017 CompSci516:DatabaseSystems 3

ReadingMaterial• Recommended(optional)readings:

– Chapter2(Sections1,2,3)ofMiningofMassiveDatasets,byRajaraman andUllman:http://i.stanford.edu/~ullman/mmds.html

– OriginalGoogleMRpaperbyJeffDeanandSanjayGhemawat,OSDI’04:http://research.google.com/archive/mapreduce.html

– “ResilientDistributedDatasets:AFault-TolerantAbstractionforIn-MemoryClusterComputing”(seecoursewebsite)– byMatei Zahariaetal.- 2012

4

Acknowledgement:SomeofthefollowingslideshavebeenborrowedfromProf.Shivnath Babu,Prof.DanSuciu,Prajakta Kalmegh,andJunghoon Kang

DukeCS,Fall2017 CompSci516:DatabaseSystems

MapReduce


BigData

itcannotbe storedinonemachine

storethedatasetsonmultiplemachines

GoogleFileSystem

itcannotbe processed inonemachine

parallelizecomputationonmultiplemachines

MapReduce

Ack:SlidebyJunghoon Kang

TheMap-ReduceFramework

• GooglepublishedMapReducepaperinOSDI2004,ayearaftertheGoogleFileSystempaper

• Ahighlevelprogrammingparadigm– allowsmanyimportantdata-orientedprocessestobewrittensimply

• processeslargedataby:– applyingafunctiontoeachlogicalrecordintheinput(map)

– categorizeandcombinetheintermediateresultsintosummaryvalues(reduce)


WheredoesGoogleuseMapReduce?

MapReduce

Input

Output

● crawleddocuments● webrequestlogs

● invertedindices● graphstructureofwebdocuments● summariesofthenumberofpages

crawledperhost● thesetofmostfrequentqueriesinaday


StorageModel

• Dataisstoredinlargefiles(TB,PB)– e.g.market-basketdata(morewhenwedodatamining)

– orwebdata• Filesaredividedintochunks– typicallymanyMB(64MB)– sometimeseachchunkisreplicatedforfaulttolerance(laterindistributedDBMS)


Map-ReduceSteps

• Inputistypically(key,value)pairs– butcouldbeobjectsofanytype

• MapandReduceareperformedbyanumberofprocesses– physicallylocatedinsomeprocessors


samekey

Map ReduceShuffleInputkey-valuepairs

outputlistssortbykey

Map-ReduceSteps

1. ReadData2. Map– extractsomeinfoofinterest

in(key,value)form3. Shuffleandsort

– sendsamekeystothesamereduceprocess


samekey



4. Reduce– operateonthevaluesofthesamekey– e.g.transform,aggregate,summarize,

filter5. Outputtheresults(key,final-result)

SimpleExample:Map-Reduce

• Wordcounting• Invertedindexes

Ack:SlidebyProf.Shivnath Babu


MapFunction

• Eachmapprocessworksonachunkofdata• Input:(input-key,value)• Output:(intermediate-key,value)-- maynotbethesameasinputkeyvalue• Example:listalldocidscontainingaword

– outputofmap(word,docid)– emitseachsuchpair– wordiskey,docid isvalue– duplicateeliminationcanbedoneatthereducephase


samekey



ReduceFunction

• Input:(intermediate-key,list-of-values-for-this-key)– listcanincludeduplicates– eachmapprocesscanleaveitsoutputinthelocaldisk,reduceprocesscanretrieveits

portion• Output:(output-key,final-value)• Example:listalldocidscontainingaword

– outputwillbealistof(word,[doc-id1,doc-id5,….])– ifthecountisneeded,reducecounts#docs,outputwillbealistof(word,count)


samekey



ExampleProblem:MapReduceExplainhowthequerywillbeexecutedinMapReduce

• SELECTa,max(b)astopb• FROMR• WHEREa>0• GROUPBYa

SpecifythecomputationperformedinthemapandthereducefunctionsDukeCS,Fall2017 CompSci516:DatabaseSystems 15

Map

• Eachmaptask– ScansablockofR– Callsthemapfunctionforeachtuple– Themapfunctionappliestheselectionpredicatetothetuple

– Foreachtuple satisfyingtheselection,itoutputsarecordwithkey=aandvalue=b

SELECTa,max(b)astopbFROMRWHEREa>0GROUPBYa

•Wheneachmaptaskscansmultiplerelations,itneedstooutputsomethinglikekey=aandvalue=(‘R’,b)whichhastherelationname‘R’


Shuffle

• TheMapReduce enginereshufflestheoutputofthemapphaseandgroupsitontheintermediatekey,i.e.theattributea

SELECTa,max(b)astopbFROMRWHEREa>0GROUPBYa

•Notethattheprogrammerhastowriteonlythemapandreducefunctions,theshufflephaseisdonebytheMapReduceengine(althoughtheprogrammercanrewritethepartitionfunction),butyoushouldstillmentionthisinyouranswers


ReduceSELECTa,max(b)astopbFROMRWHEREa>0GROUPBYa

• Eachreducetask• computes the aggregate value max(b) = topb for each group

(i.e. a) assigned to it (by calling the reduce function) • outputs the final results: (a, topb)

•Multipleaggregatescanbeoutputbythereducephaselikekey=aandvalue=(sum(b),min(b)) etc.

• Sometimesasecond(thirdetc)levelofMap-Reducephasemightbeneeded

A local combiner can be used to compute local max before data gets reshuffled (in the map tasks)


MoreTerminology

• AMap-Reduce“Job”– e.g.countthewordsinalldocs– complexqueriescanhavemultipleMRjobs

• MaporReduce“Tasks”– Agroupofmaporreduce“functions”– scheduledonasingle“worker”

• Worker– aprocessthatexecutesonetaskatatime– oneperprocessor,so4-8permachine

• Amastercontroller– dividesthedataintochunks– assignsdifferentprocessorstoexecutethemapfunctiononeach

chunk– other/sameprocessorsexecutethereducefunctionsontheoutputsof

themapfunctions


however,thereisnouniformterminologyacrosssystems

Ack:SlidebyProf.DanSuciu

WhyisMap-ReducePopular?

• DistributedcomputationbeforeMapReduce– howtodividetheworkloadamongmultiplemachines?– howtodistributedataandprogramtoothermachines?– howtoscheduletasks?– whathappensifataskfailswhilerunning?– …and…and...

• DistributedcomputationafterMapReduce– howtowriteMapfunction?– howtowriteReducefunction?

• Developers’tasksmadeeasy



HandlingFaultToleranceinMR

• Althoughtheprobabilityofamachinefailureislow,theprobabilityofamachinefailingamongthousandsofmachinesiscommon

• WorkerFailure– Themastersendsheartbeattoeachworkernode– Ifaworkernodefails,themasterreschedulesthetaskshandledbytheworker

• MasterFailure– ThewholeMapReducejobgetsrestartedthroughadifferentmaster

DukeCS,Fall2017 CompSci516:DatabaseSystems 21Ack:SlidebyJunghoon Kang

OtheraspectsofMapReduce• Locality

– TheinputdataismanagedbyGFS– ChoosetheclusterofMapReducemachinessuchthatthose

machinescontaintheinputdataontheirlocaldisk– Wecanconservenetworkbandwidth

• Taskgranularity– Itispreferabletohavethenumberoftaskstobemultiplesof

workernodes– Smallerthepartitionsize,fasterfailoverandbettergranularity

inloadbalance,butitincursmoreoverhead– Needabalance

• BackupTasks– Inordertocopewitha“straggler”,themasterschedulesbackup

executionsoftheremainingin-progresstasks


ApacheHadoop

• ApacheHadoophasanopen-sourceversionofGFSandMapReduce– GFS->HDFS(HadoopFileSystem)– GoogleMapReduce->HadoopMapReduce

• YoucandownloadthesoftwareandimplementyourownMapReduceapplications


MapReduceProsandCons

• MapReduceisgoodforoff-linebatchjobsonlargedatasets

• MapReduceisnotgoodforiterativejobsduetohighI/Ooverheadaseachiterationneedstoread/writedatafrom/toGFS

• MapReduceisbadforjobsonsmalldatasetsandjobsthatrequirelow-latencyresponse


Spark


SeetheRDDpaperfromthecoursewebsite

WhatisSpark?

• NotamodifiedversionofHadoop• Separate,fast,MapReduce-likeengine– In-memorydatastorageforveryfastiterativequeries–Generalexecutiongraphsandpowerfuloptimizations–Upto40xfasterthanHadoop–Upto100xfaster(2-10xondisk)

• CompatiblewithHadoop’sstorageAPIs– Canread/writetoanyHadoop-supportedsystem,includingHDFS,HBase,SequenceFiles,etc

Borrowed slide

Distributedin-memorylargescaledataprocessingengine!

Ack:SlidebyPrajakta Kalmegh

Applications(BigDataAnalysis)

• In-memoryanalytics&anomalydetection(Conviva)

• Interactivequeriesondatastreams(Quantifind)• Exploratoryloganalysis(Foursquare)• Trafficestimationw/GPSdata(MobileMillennium)

• Twitterspamclassification(Monarch)• ...

Borrowed slide


WhyaNewProgrammingModel?• MapReducegreatlysimplifiedbigdataanalysis• Butassoonasitgotpopular,userswantedmore:–Morecomplex,multi-stageiterative applications(graphalgorithms,machinelearning)–Moreinteractive ad-hocqueries–Morereal-time onlineprocessing• Allthreeoftheseappsrequirefastdatasharingacrossparalleljobs

Borrowed slide

NOTE: What were the workarounds in MR world? Ysmart [1], Stubby[2], PTF[3], Haloop [4], Twister [5]


DataSharinginMapReduce

iter. 1 iter. 2 . . .

Input

HDFSread

HDFSwrite

HDFSread

HDFSwrite

Input

query 1

query 2

query 3

result 1

result 2

result 3

. . .

HDFSread

Slow due to replication, serialization, and disk IOBorrowed slide


iter. 1 iter. 2 . . .

Input

DataSharinginSpark

Distributedmemory

Input

query 1

query 2

query 3

. . .

one-timeprocessing

10-100× faster than network and diskBorrowed slide


RDD:SparkProgrammingModel

• Keyidea:ResilientDistributedDatasets(RDDs)–Distributedcollectionsofobjectsthatcanbecachedinmemoryorstoredondiskacrossclusternodes–Manipulatedthroughvariousparalleloperators–Automaticallyrebuiltonfailure (How?UseLineage)

Borrowed slide


AdditionalSlidesonSpark(OptionalReading)


Ack:ThefollowingslidesarebyPrajakta Kalmegh

More on RDDs• Transformations: Created through deterministic operations on either

‣ data in stable storage or

‣ other RDDs

• Lineage: RDD has enough information about how it was derived from other datasets

• Immutable: RDD is a read-only, partitioned collection of records

‣ Checkpointing of RDDs with long lineage chains can be done in the background.

‣Mitigating stragglers: We can use backup tasks to recompute transformations on RDDs

• Persistence level: Users can choose a re-use storage strategy (caching in memory, storing the RDD only on disk or replicating it across machines; also chose a persistence priority for data spills)

• Partitioning: Users can ask that an RDD’s elements be partitioned across machines based on a key in each record

RDD Transformations and Actions

*http://www.tothenew.com/blog/spark-1o3-spark-internals/

*https://spark.apache.org/docs/1.0.1/cluster-overview.html

Note:LazyEvaluation:Averyimportantconcept

DAGofRDDs

*https://trongkhoanguyenblog.wordpress.com/2014/11/27/understand-rdd-operations-transformations-and-actions/

FaultTolerance

• RDDstracktheseriesoftransformationsusedtobuildthem(theirlineage)torecomputelostdata

• E.g:messages = textFile(...).filter(_.contains(“error”))

.map(_.split(‘\t’)(2))

HadoopRDDpath = hdfs://…

FilteredRDDfunc = _.contains(...)

MappedRDDfunc = _.split(…)

Borrowed slide

Tradeoff:LowComputationcost(cachemoreRDDs)

VSHighmemorycost(notmuchworkforGC)

Representing RDDs• Graph-based representation. Five components :

Representing RDDs (Dependencies)

one-to-one many-to-one many-to-many

shuffle

Representing RDDs (An example)

Advantages of the RDD model

Checkpoint!

• DataSharinginSparkandSomeApplications• RDDDefinition,Model,Representation,Advantages

OtherEngineFeatures:Implementation

• Notcoveredindetails• SomeSummary:• SparklocalvsSparkStandalonevsSparkcluster(ResourcesharinghandledbyYarn/Mesos)

• JobScheduling: DAGSchedulervsTaskScheduler (FairvsFIFOattaskgranularity)

• MemoryManagement:serializedin-memory(fastest)VSdeserializedin-memoryVSon-diskpersistent

• SupportforCheckpointing:TradeoffbetweenusinglineageforrecomputingpartitionsVScheckpointingpartitionsonstablestorage

• InterpreterIntegration: Shipexternalinstancesofvariablesreferencedinaclosurealongwiththeclosureclasstoworkernodesinordertogivethemaccesstothesevariables

Documents

Map-Reduce and Spark - db.cs.duke.edudb.cs.duke.edu/courses/compsci516/cps216/compsci516/fall17/Lectures/...Where are we now? We learnt ü Relational Model and Query Languages ü SQL,