Upload
others
View
11
Download
0
Embed Size (px)
Citation preview
InteractiveDebuggingforBigDataAnalytics
Muhammad Ali Gulzar, Xueyuan Han, Matteo Interlandi,Shaghayegh Mardani, Sai Deep Tetali, Tyson Condie, Todd Millstein,Miryung KimUniversity of California, Los Angeles
DebuggingBigDataAnalytics
• Today’splatformslackdebuggingsupport– Programs(i.e.,queries, jobs)arebatchexecuted /blackboxes– Errorsreflect low-leveldetails (e.g.,taskid?!)notrelevanttothe logicalbug– Longprogramexecution time =>longdevelopment cycles
• Whatdoprogrammersdo?– Trialanderror debuggingonsample data– Post-mortem analysisoferrorlogs– Analyzephysicalviewoftheexecution (ajobid,failednode,etc).
“IwouldliketounderstandtheflowofcontrolthroughtheSparksourcecodeontheworkernodeswhenIsubmitmyapplication…IamassumingI
shouldsetupSparkonEclipse…toenablesteppingthroughSparksourcecodeontheworkernodes.”
TryingtodebugaSparkApplicationonacluster…
Afterayear,stillnogoodanswers!
BigDebug ProjectOverviewBigDebug:DebuggingPrimitives
forInteractiveBigDataProcessinginSpark
[ICSE2016]
SimulatedBreakpointOn-DemandWatchpointCrashCulpritRemediationForwardBackwardTracing
Titian:DataProvenanceforFine-GrainedTracing[PVLDB2016]
Vega:IncrementalComputationforInteractiveDebugging
[UnderReview]
ExampleQueryDevelopmentSession
• Dataset:NYCOpenDataProject– Callstonon-emergencyservicecenters– Datasetcontainscallrecordsfor2010-2015• Recordcontents:calltime,agency,callerlocation,etc.
• Query:Identifytheagencies thatreceivedthemostcallsduringbusyhours– E.g.,busyhourifnumberofcalls>10,000
SparkProgram
caseclassCalls(id:String,hour:Int,agency:String,...)format=newSimpleDateFormat("M/d/yh:m:sa")input=sc.textFile("hdfs://...")calls=input.map(_.split(","))
.map(r=>Calls(r(0),format.parse(r(1)).getHours,r(2),...)calls.registerTempTable("calls")hist =sqlContext.sql("
SELECTagency,count(*)FROMcallsJOIN(
SELECThourFROMcallsGROUPBYhourHAVINGcount(*)>100000)counts
ONcalls.hour =counts.hourGROUPBYagency")
hist.show()
Extract DatasetfromHDFSTransform itintoaDataFrame (i.e.,table)Load itintoSparkSQL
caseclassCalls(id:String,hour:Int,agency:String,...)format=newSimpleDateFormat("M/d/yh:m:sa")input=sc.textFile("hdfs://...")calls=input.map(_.split(","))
.map(r=>Calls(r(0),format.parse(r(1)).getHours,r(2),...)calls.registerTempTable("calls")hist =sqlContext.sql("
SELECTagency,count(*)FROMcallsJOIN(
SELECThourFROMcallsGROUPBYhourHAVINGcount(*)>100000)counts
ONcalls.hour =counts.hourGROUPBYagency")
hist.show()
ExpressQueryinSparkSQL
caseclassCalls(id:String,hour:Int,agency:String,...)format=newSimpleDateFormat("M/d/yh:m:sa")input=sc.textFile("hdfs://...")calls=input.map(_.split(","))
.map(r=>Calls(r(0),format.parse(r(1)).getHours,r(2),...)calls.registerTempTable("calls")hist =sqlContext.sql("
SELECTagency,count(*)FROMcallsJOIN(
SELECThourFROMcallsGROUPBYhourHAVINGcount(*)>100000)counts
ONcalls.hour =counts.hourGROUPBYagency")
hist.show()
Identifythebusyhoursi.e.,#calls>10,000
Joinbusyhourswithcallsthengroupbyagencyandcountthenumberof“calls”receivedbyeachagency
DebuggingQueryResults• Analystobservessomeunexpectedresults– Agenciesthatshouldnotappear• e.g.,BrooklynPublicLibrary
– Expectedagenciesthatshouldappear• e.g,NYPD,NYFD
• Titiansupportforquerytriage– Analystcantracebackfromoutlierresultstocontributingdataatsomeintermediatestage
– Analystcanexecutequeriesagainstintermediatedataleadingtooutlierresults
QueryTriagewithTitian• Intermediateresultsforsubquery– Tracebacktosubqueryandshowdistributionofcallsperhour– Onintermediatedataleadingtooutlierresults
Significant skewinthemidnight hour=0!
SELECThour,count(*)FROMcallsGROUPBYhour
IdentifyBugandRevisetheQuery• TheBug
– Systemassignsdefaultvaluehour=0for…– Callsthatdidnotlogatime
• Possiblecourseofaction– Filteroutcallsassignedtohour=0
SELECTagency,count(*)FROMcallsJOIN(
SELECThourFROMcallsWHEREhour!=0GROUPBYhourHAVINGcount(*)>100000)counts
ONcalls.hour =counts.hourGROUPBYagency
Introducepredicatethatfiltersoutmidnight hour
Vega:Re-executerevisedQuery• Vegamaterializesintermediatestageresults– i.e.,Theprevioussubqueryresultissaved
• VegaQueryRewriterleveragesthistorewritethequeryinto…
SELECTagency,count(*)FROMcallsJOINcountsWHEREcounts.hour !=0ONcalls.hour =counts.hourGROUPBYagency
MaterializedresultfrompreviousexecutionRewritefiltertoremovehour0fromjoining records
Vega:ModifiedQueryEvaluation• Executeanincrementaljoin– “Diff”recordsspecifychangesinthe(join)result– Forthisexample,weincrementallyremoveallrecordsforhour0fromjoinandfinalaggregationresults
• VegaOptimizerResultsConsequence:overanorder-of-magnituderuntimeimprovement
• Whenaprogramfails,ausermaywanttoinvestigateasubsetoftheoriginalinputinducingacrash,afailure,orawrongoutcome.
• DeltaDebugging[Zeller1999]–Wellknowndebuggingalgorithmforminimizingfailure-inducinginputs
– Requiresmultiplerunstoisolatefailure-inducinginputs
AutomatedIsolationofFailure-InducingInputsforBigDataAnalytics
Firstwerunthetesttofindthefailureinducinginputdataset
Background:DeltaDebugging[Zeller,FSE1999]
TestFails
First,werunthetesttofindthefailureinducinginputdataset
Background:DeltaDebugging[Zeller,FSE1999]
Second,wesplitthefailinginputdata
TestFails Split
Background:DeltaDebugging[Zeller,FSE1999]
TestFails Split
TestPasses
TestFails
Background:DeltaDebugging[Zeller,FSE1999]
TestFails Split
TestPasses
TestFailsSplit
Background:DeltaDebugging[Zeller,FSE1999]
TestFails Split
TestPasses
TestFailsSplit
…...
Background:DeltaDebugging[Zeller,FSE1999]
ScalableAutomatedIsolationofFailure-InducingInputs
• Leveragedataprovenancetoreducesearchspace– Avoidcostlyexecutionsondatanotrelevanttothebug
• LeverageVegaoptimizesubsequentruns.
DeltaDebuggingTitian
Conclusion• BigDebug Project– DebuggingPrimitivesforInteractiveBigDataProcessinginApacheSpark– https://sites.google.com/site/sparkbigdebug/
• Titian:InteractiveDataProvenance– Supportstracebackqueriesfromasetofresults– Executionreplayfromanintermediatepoint
• Vega:Optimizingmodifiedqueryexecution– Novelqueryrewritemechanismthatpusheschangesbackwardstosavework– Incrementalevaluationthatoperatesondatachangesinducedbyquerymodifications