Interactive Debugging for Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/hotcloud16_slides_gulzar.pdf · Interactive Debugging for Big Data Analytics Muhammad Ali Gulzar, Xueyuan

InteractiveDebuggingforBigDataAnalytics

Muhammad Ali Gulzar, Xueyuan Han, Matteo Interlandi,Shaghayegh Mardani, Sai Deep Tetali, Tyson Condie, Todd Millstein,Miryung KimUniversity of California, Los Angeles

DebuggingBigDataAnalytics

• Today’splatformslackdebuggingsupport– Programs(i.e.,queries, jobs)arebatchexecuted /blackboxes– Errorsreflect low-leveldetails (e.g.,taskid?!)notrelevanttothe logicalbug– Longprogramexecution time =>longdevelopment cycles

• Whatdoprogrammersdo?– Trialanderror debuggingonsample data– Post-mortem analysisoferrorlogs– Analyzephysicalviewoftheexecution (ajobid,failednode,etc).

“IwouldliketounderstandtheflowofcontrolthroughtheSparksourcecodeontheworkernodeswhenIsubmitmyapplication…IamassumingI

shouldsetupSparkonEclipse…toenablesteppingthroughSparksourcecodeontheworkernodes.”

TryingtodebugaSparkApplicationonacluster…

Afterayear,stillnogoodanswers!

BigDebug ProjectOverviewBigDebug:DebuggingPrimitives

forInteractiveBigDataProcessinginSpark

[ICSE2016]

SimulatedBreakpointOn-DemandWatchpointCrashCulpritRemediationForwardBackwardTracing

Titian:DataProvenanceforFine-GrainedTracing[PVLDB2016]

Vega:IncrementalComputationforInteractiveDebugging

[UnderReview]

ExampleQueryDevelopmentSession

• Dataset:NYCOpenDataProject– Callstonon-emergencyservicecenters– Datasetcontainscallrecordsfor2010-2015• Recordcontents:calltime,agency,callerlocation,etc.

• Query:Identifytheagencies thatreceivedthemostcallsduringbusyhours– E.g.,busyhourifnumberofcalls>10,000

SparkProgram

caseclassCalls(id:String,hour:Int,agency:String,...)format=newSimpleDateFormat("M/d/yh:m:sa")input=sc.textFile("hdfs://...")calls=input.map(_.split(","))

.map(r=>Calls(r(0),format.parse(r(1)).getHours,r(2),...)calls.registerTempTable("calls")hist =sqlContext.sql("

SELECTagency,count(*)FROMcallsJOIN(

SELECThourFROMcallsGROUPBYhourHAVINGcount(*)>100000)counts

ONcalls.hour =counts.hourGROUPBYagency")

hist.show()

Extract DatasetfromHDFSTransform itintoaDataFrame (i.e.,table)Load itintoSparkSQL






hist.show()

ExpressQueryinSparkSQL






hist.show()

Identifythebusyhoursi.e.,#calls>10,000

Joinbusyhourswithcallsthengroupbyagencyandcountthenumberof“calls”receivedbyeachagency

DebuggingQueryResults• Analystobservessomeunexpectedresults– Agenciesthatshouldnotappear• e.g.,BrooklynPublicLibrary

– Expectedagenciesthatshouldappear• e.g,NYPD,NYFD

• Titiansupportforquerytriage– Analystcantracebackfromoutlierresultstocontributingdataatsomeintermediatestage

– Analystcanexecutequeriesagainstintermediatedataleadingtooutlierresults

QueryTriagewithTitian• Intermediateresultsforsubquery– Tracebacktosubqueryandshowdistributionofcallsperhour– Onintermediatedataleadingtooutlierresults

Significant skewinthemidnight hour=0!

SELECThour,count(*)FROMcallsGROUPBYhour

IdentifyBugandRevisetheQuery• TheBug

– Systemassignsdefaultvaluehour=0for…– Callsthatdidnotlogatime

• Possiblecourseofaction– Filteroutcallsassignedtohour=0


SELECThourFROMcallsWHEREhour!=0GROUPBYhourHAVINGcount(*)>100000)counts

ONcalls.hour =counts.hourGROUPBYagency

Introducepredicatethatfiltersoutmidnight hour

Vega:Re-executerevisedQuery• Vegamaterializesintermediatestageresults– i.e.,Theprevioussubqueryresultissaved

• VegaQueryRewriterleveragesthistorewritethequeryinto…

SELECTagency,count(*)FROMcallsJOINcountsWHEREcounts.hour !=0ONcalls.hour =counts.hourGROUPBYagency

MaterializedresultfrompreviousexecutionRewritefiltertoremovehour0fromjoining records

Vega:ModifiedQueryEvaluation• Executeanincrementaljoin– “Diff”recordsspecifychangesinthe(join)result– Forthisexample,weincrementallyremoveallrecordsforhour0fromjoinandfinalaggregationresults

• VegaOptimizerResultsConsequence:overanorder-of-magnituderuntimeimprovement

• Whenaprogramfails,ausermaywanttoinvestigateasubsetoftheoriginalinputinducingacrash,afailure,orawrongoutcome.

• DeltaDebugging[Zeller1999]–Wellknowndebuggingalgorithmforminimizingfailure-inducinginputs

– Requiresmultiplerunstoisolatefailure-inducinginputs

AutomatedIsolationofFailure-InducingInputsforBigDataAnalytics

Firstwerunthetesttofindthefailureinducinginputdataset

Background:DeltaDebugging[Zeller,FSE1999]

TestFails

First,werunthetesttofindthefailureinducinginputdataset


Second,wesplitthefailinginputdata

TestFails Split


TestFails Split

TestPasses

TestFails


TestFails Split

TestPasses

TestFailsSplit


TestFails Split

TestPasses

TestFailsSplit

…...


ScalableAutomatedIsolationofFailure-InducingInputs

• Leveragedataprovenancetoreducesearchspace– Avoidcostlyexecutionsondatanotrelevanttothebug

• LeverageVegaoptimizesubsequentruns.

DeltaDebuggingTitian

Conclusion• BigDebug Project– DebuggingPrimitivesforInteractiveBigDataProcessinginApacheSpark– https://sites.google.com/site/sparkbigdebug/

• Titian:InteractiveDataProvenance– Supportstracebackqueriesfromasetofresults– Executionreplayfromanintermediatepoint

• Vega:Optimizingmodifiedqueryexecution– Novelqueryrewritemechanismthatpusheschangesbackwardstosavework– Incrementalevaluationthatoperatesondatachangesinducedbyquerymodifications

Documents

Interactive Debugging for Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/hotcloud16_slides_gulzar.pdf · Interactive Debugging for Big Data Analytics Muhammad Ali Gulzar, Xueyuan