23
Interactive Debugging for Big Data Analytics Muhammad Ali Gulzar, Xueyuan Han, Matteo Interlandi, Shaghayegh Mardani, Sai Deep Tetali, Tyson Condie, Todd Millstein, Miryung Kim University of California, Los Angeles

Interactive Debugging for Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/hotcloud16_slides_gulzar.pdf · Interactive Debugging for Big Data Analytics Muhammad Ali Gulzar, Xueyuan

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Interactive Debugging for Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/hotcloud16_slides_gulzar.pdf · Interactive Debugging for Big Data Analytics Muhammad Ali Gulzar, Xueyuan

InteractiveDebuggingforBigDataAnalytics

Muhammad Ali Gulzar, Xueyuan Han, Matteo Interlandi,Shaghayegh Mardani, Sai Deep Tetali, Tyson Condie, Todd Millstein,Miryung KimUniversity of California, Los Angeles

Page 2: Interactive Debugging for Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/hotcloud16_slides_gulzar.pdf · Interactive Debugging for Big Data Analytics Muhammad Ali Gulzar, Xueyuan

DebuggingBigDataAnalytics

• Today’splatformslackdebuggingsupport– Programs(i.e.,queries, jobs)arebatchexecuted /blackboxes– Errorsreflect low-leveldetails (e.g.,taskid?!)notrelevanttothe logicalbug– Longprogramexecution time =>longdevelopment cycles

• Whatdoprogrammersdo?– Trialanderror debuggingonsample data– Post-mortem analysisoferrorlogs– Analyzephysicalviewoftheexecution (ajobid,failednode,etc).

Page 3: Interactive Debugging for Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/hotcloud16_slides_gulzar.pdf · Interactive Debugging for Big Data Analytics Muhammad Ali Gulzar, Xueyuan

“IwouldliketounderstandtheflowofcontrolthroughtheSparksourcecodeontheworkernodeswhenIsubmitmyapplication…IamassumingI

shouldsetupSparkonEclipse…toenablesteppingthroughSparksourcecodeontheworkernodes.”

TryingtodebugaSparkApplicationonacluster…

Page 4: Interactive Debugging for Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/hotcloud16_slides_gulzar.pdf · Interactive Debugging for Big Data Analytics Muhammad Ali Gulzar, Xueyuan

Afterayear,stillnogoodanswers!

Page 5: Interactive Debugging for Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/hotcloud16_slides_gulzar.pdf · Interactive Debugging for Big Data Analytics Muhammad Ali Gulzar, Xueyuan

BigDebug ProjectOverviewBigDebug:DebuggingPrimitives

forInteractiveBigDataProcessinginSpark

[ICSE2016]

SimulatedBreakpointOn-DemandWatchpointCrashCulpritRemediationForwardBackwardTracing

Titian:DataProvenanceforFine-GrainedTracing[PVLDB2016]

Vega:IncrementalComputationforInteractiveDebugging

[UnderReview]

Page 6: Interactive Debugging for Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/hotcloud16_slides_gulzar.pdf · Interactive Debugging for Big Data Analytics Muhammad Ali Gulzar, Xueyuan

ExampleQueryDevelopmentSession

• Dataset:NYCOpenDataProject– Callstonon-emergencyservicecenters– Datasetcontainscallrecordsfor2010-2015• Recordcontents:calltime,agency,callerlocation,etc.

• Query:Identifytheagencies thatreceivedthemostcallsduringbusyhours– E.g.,busyhourifnumberofcalls>10,000

Page 7: Interactive Debugging for Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/hotcloud16_slides_gulzar.pdf · Interactive Debugging for Big Data Analytics Muhammad Ali Gulzar, Xueyuan

SparkProgram

caseclassCalls(id:String,hour:Int,agency:String,...)format=newSimpleDateFormat("M/d/yh:m:sa")input=sc.textFile("hdfs://...")calls=input.map(_.split(","))

.map(r=>Calls(r(0),format.parse(r(1)).getHours,r(2),...)calls.registerTempTable("calls")hist =sqlContext.sql("

SELECTagency,count(*)FROMcallsJOIN(

SELECThourFROMcallsGROUPBYhourHAVINGcount(*)>100000)counts

ONcalls.hour =counts.hourGROUPBYagency")

hist.show()

Page 8: Interactive Debugging for Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/hotcloud16_slides_gulzar.pdf · Interactive Debugging for Big Data Analytics Muhammad Ali Gulzar, Xueyuan

Extract DatasetfromHDFSTransform itintoaDataFrame (i.e.,table)Load itintoSparkSQL

caseclassCalls(id:String,hour:Int,agency:String,...)format=newSimpleDateFormat("M/d/yh:m:sa")input=sc.textFile("hdfs://...")calls=input.map(_.split(","))

.map(r=>Calls(r(0),format.parse(r(1)).getHours,r(2),...)calls.registerTempTable("calls")hist =sqlContext.sql("

SELECTagency,count(*)FROMcallsJOIN(

SELECThourFROMcallsGROUPBYhourHAVINGcount(*)>100000)counts

ONcalls.hour =counts.hourGROUPBYagency")

hist.show()

Page 9: Interactive Debugging for Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/hotcloud16_slides_gulzar.pdf · Interactive Debugging for Big Data Analytics Muhammad Ali Gulzar, Xueyuan

ExpressQueryinSparkSQL

caseclassCalls(id:String,hour:Int,agency:String,...)format=newSimpleDateFormat("M/d/yh:m:sa")input=sc.textFile("hdfs://...")calls=input.map(_.split(","))

.map(r=>Calls(r(0),format.parse(r(1)).getHours,r(2),...)calls.registerTempTable("calls")hist =sqlContext.sql("

SELECTagency,count(*)FROMcallsJOIN(

SELECThourFROMcallsGROUPBYhourHAVINGcount(*)>100000)counts

ONcalls.hour =counts.hourGROUPBYagency")

hist.show()

Identifythebusyhoursi.e.,#calls>10,000

Joinbusyhourswithcallsthengroupbyagencyandcountthenumberof“calls”receivedbyeachagency

Page 10: Interactive Debugging for Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/hotcloud16_slides_gulzar.pdf · Interactive Debugging for Big Data Analytics Muhammad Ali Gulzar, Xueyuan

DebuggingQueryResults• Analystobservessomeunexpectedresults– Agenciesthatshouldnotappear• e.g.,BrooklynPublicLibrary

– Expectedagenciesthatshouldappear• e.g,NYPD,NYFD

• Titiansupportforquerytriage– Analystcantracebackfromoutlierresultstocontributingdataatsomeintermediatestage

– Analystcanexecutequeriesagainstintermediatedataleadingtooutlierresults

Page 11: Interactive Debugging for Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/hotcloud16_slides_gulzar.pdf · Interactive Debugging for Big Data Analytics Muhammad Ali Gulzar, Xueyuan

QueryTriagewithTitian• Intermediateresultsforsubquery– Tracebacktosubqueryandshowdistributionofcallsperhour– Onintermediatedataleadingtooutlierresults

Significant skewinthemidnight hour=0!

SELECThour,count(*)FROMcallsGROUPBYhour

Page 12: Interactive Debugging for Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/hotcloud16_slides_gulzar.pdf · Interactive Debugging for Big Data Analytics Muhammad Ali Gulzar, Xueyuan

IdentifyBugandRevisetheQuery• TheBug

– Systemassignsdefaultvaluehour=0for…– Callsthatdidnotlogatime

• Possiblecourseofaction– Filteroutcallsassignedtohour=0

SELECTagency,count(*)FROMcallsJOIN(

SELECThourFROMcallsWHEREhour!=0GROUPBYhourHAVINGcount(*)>100000)counts

ONcalls.hour =counts.hourGROUPBYagency

Introducepredicatethatfiltersoutmidnight hour

Page 13: Interactive Debugging for Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/hotcloud16_slides_gulzar.pdf · Interactive Debugging for Big Data Analytics Muhammad Ali Gulzar, Xueyuan

Vega:Re-executerevisedQuery• Vegamaterializesintermediatestageresults– i.e.,Theprevioussubqueryresultissaved

• VegaQueryRewriterleveragesthistorewritethequeryinto…

SELECTagency,count(*)FROMcallsJOINcountsWHEREcounts.hour !=0ONcalls.hour =counts.hourGROUPBYagency

MaterializedresultfrompreviousexecutionRewritefiltertoremovehour0fromjoining records

Page 14: Interactive Debugging for Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/hotcloud16_slides_gulzar.pdf · Interactive Debugging for Big Data Analytics Muhammad Ali Gulzar, Xueyuan

Vega:ModifiedQueryEvaluation• Executeanincrementaljoin– “Diff”recordsspecifychangesinthe(join)result– Forthisexample,weincrementallyremoveallrecordsforhour0fromjoinandfinalaggregationresults

• VegaOptimizerResultsConsequence:overanorder-of-magnituderuntimeimprovement

Page 15: Interactive Debugging for Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/hotcloud16_slides_gulzar.pdf · Interactive Debugging for Big Data Analytics Muhammad Ali Gulzar, Xueyuan

• Whenaprogramfails,ausermaywanttoinvestigateasubsetoftheoriginalinputinducingacrash,afailure,orawrongoutcome.

• DeltaDebugging[Zeller1999]–Wellknowndebuggingalgorithmforminimizingfailure-inducinginputs

– Requiresmultiplerunstoisolatefailure-inducinginputs

AutomatedIsolationofFailure-InducingInputsforBigDataAnalytics

Page 16: Interactive Debugging for Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/hotcloud16_slides_gulzar.pdf · Interactive Debugging for Big Data Analytics Muhammad Ali Gulzar, Xueyuan

Firstwerunthetesttofindthefailureinducinginputdataset

Background:DeltaDebugging[Zeller,FSE1999]

Page 17: Interactive Debugging for Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/hotcloud16_slides_gulzar.pdf · Interactive Debugging for Big Data Analytics Muhammad Ali Gulzar, Xueyuan

TestFails

First,werunthetesttofindthefailureinducinginputdataset

Background:DeltaDebugging[Zeller,FSE1999]

Page 18: Interactive Debugging for Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/hotcloud16_slides_gulzar.pdf · Interactive Debugging for Big Data Analytics Muhammad Ali Gulzar, Xueyuan

Second,wesplitthefailinginputdata

TestFails Split

Background:DeltaDebugging[Zeller,FSE1999]

Page 19: Interactive Debugging for Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/hotcloud16_slides_gulzar.pdf · Interactive Debugging for Big Data Analytics Muhammad Ali Gulzar, Xueyuan

TestFails Split

TestPasses

TestFails

Background:DeltaDebugging[Zeller,FSE1999]

Page 20: Interactive Debugging for Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/hotcloud16_slides_gulzar.pdf · Interactive Debugging for Big Data Analytics Muhammad Ali Gulzar, Xueyuan

TestFails Split

TestPasses

TestFailsSplit

Background:DeltaDebugging[Zeller,FSE1999]

Page 21: Interactive Debugging for Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/hotcloud16_slides_gulzar.pdf · Interactive Debugging for Big Data Analytics Muhammad Ali Gulzar, Xueyuan

TestFails Split

TestPasses

TestFailsSplit

…...

Background:DeltaDebugging[Zeller,FSE1999]

Page 22: Interactive Debugging for Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/hotcloud16_slides_gulzar.pdf · Interactive Debugging for Big Data Analytics Muhammad Ali Gulzar, Xueyuan

ScalableAutomatedIsolationofFailure-InducingInputs

• Leveragedataprovenancetoreducesearchspace– Avoidcostlyexecutionsondatanotrelevanttothebug

• LeverageVegaoptimizesubsequentruns.

DeltaDebuggingTitian

Page 23: Interactive Debugging for Big Data Analyticsweb.cs.ucla.edu/~gulzar/assets/pdf/hotcloud16_slides_gulzar.pdf · Interactive Debugging for Big Data Analytics Muhammad Ali Gulzar, Xueyuan

Conclusion• BigDebug Project– DebuggingPrimitivesforInteractiveBigDataProcessinginApacheSpark– https://sites.google.com/site/sparkbigdebug/

• Titian:InteractiveDataProvenance– Supportstracebackqueriesfromasetofresults– Executionreplayfromanintermediatepoint

• Vega:Optimizingmodifiedqueryexecution– Novelqueryrewritemechanismthatpusheschangesbackwardstosavework– Incrementalevaluationthatoperatesondatachangesinducedbyquerymodifications