Cloud Data Analytics With Performance SLAshomes.cs.washington.edu/...teradata-april2017.pdf · Workload Generation Magdalena Balazinska -University of Washington 16 •Start with

MagdalenaBalazinskaSCHOOL OF COMPUTER SCIENCE &ENGINEERING

UNIVERSITY OF WASHINGTONhttp://www.cs.washington.edu/people/faculty/magda

CloudDataAnalyticsWithPerformanceSLAs

1

2

Node 1 Node 2 Node N…

DataCleaning

Query-drivenDeduplication

MachineLearning

ComplexAnalyticsImagesandVideos

FederatedAnalytics

Buildbigdatamanagementandanalyticssystems

OpenSource RealUsers

CloudSLAs

CloudOperation

ElasticScaling:CPU&memory

PerformanceExplanations

ParallelQueryEvaluation

IterativeQueries

EfficientQueryEvaluation

DataSummaries

ArrayProcessing

Intra- &inter-group

collaborations

AcknowledgmentsWorkpresenteddonetogetherwith:• JenniferOrtiz(PhDstudent– leadstudent)• VictorAlmeida(nowatPetrobras)• BrendanLee(nowatTableau)• JosephL.Hellerstein (eScience InstituteatUW)• JohannesGehrke (Microsoft)

Oursponsorsfortheproject• NSF,ISTCBigData,Petrobras,Amazon,EMC,andFacebook

MagdalenaBalazinska- UniversityofWashington 3

MagdalenaBalazinska - UniversityofWashington 4

TheSetting

Data

Node 1 Node 2 Node N…

Datascientistneedstoanalyzedata

Shewantstouseacloudservice

Myria isabigdatasystem&servicefromourgroup

TheUser’sQuestions

• Price– Howmuchwillitcostme?– WillIaccidentallyspendtoomuchmoney?

• Capabilities– WillIbeabletoexpressallmyqueries?

• Performance– Willallmyqueriesrunfast?– Whichoneswillbefastandwhichonesslow?


MagdalenaBalazinska- UniversityofWashington

CloudDataServicePricingToday

6

ExampleAmazon

MagdalenaBalazinska- UniversityofWashington

CloudDataServicePricingToday

7

ExampleAzure

Mismatchbetweenuser’sandcloud’sperspectives


OursolutionPersonalizedService-LevelAgreements

(PSLAs)

UserBuysaPerformanceLevel


PersonalizedSLAs


HiCloud,soIhavethiscooldata Letmesee…hereare

someoptionsforyou

PSLAManager

PerfEnforce

Option2:$0.50/hourSelectandaggregate<10sec

Joins<1min

Option1:$0.10/hourSelectandaggregate<30sec

Joins<5 min

Performance-CentricSLAs

• PSLAManager System[CIDR’15]– Takesasinputauser’sdatabase(schema&stats)– GeneratesaPersonalizedSLA(PSLA)

• PerfEnforce System[SIGMOD’16Demo+Submission]– TakesasinputPSLAandstreamofqueries– ElasticallyscalesclustertomeetPSLAatlowcost


PerfEnforce!!PSLAManager/

Myria/Master Node

ExampleMyria’s PersonalizedServiceLevelAgreement

12MagdalenaBalazinska- UniversityofWashington

Fixed,hourlyprice

Servicetiers

Expectedperformance

Templatescapture

capabilities

Challenges

• WhatmakesagoodPSLA?

• HowtogenerateagoodPSLA?

• HowtoguaranteeruntimesinPSLA?


PSLAQualityMetrics• Complexity:Numberofquerytemplates

• Error:Errorbetweenadvertisedtimethresholdandexpectedqueryruntimes

• CapabilityCoverage:Relationalalgebraoperations

• Optimizationgoal:Givenadatabase, asetofquerycapabilities andasetofcloudserviceconfigurations,generateaPSLAthatminimizes acombinationofcomplexity anderror whilepreservingcapabilities


PSLAGenerationOverview


WorkloadGeneration

PerformancePredictionModel

Data(Schema)

CloudService

WorkloadClustering&Compression

Userdoesnothaveaconcretesetof

queries

Accuratequerytimeestimationishard

Tradeoffbetweencomplexityand

accuracy

PSLA

WorkloadGeneration


A B C D E• Startwithsimplequeries– TableF,D1,D2,andD3– Selectsomefractionoftherows– Lookatsomefractionofthecolumns

• Buildtowardmorecomplexones– Joinincreasinglymanytablestogether– FjoinswithD1,thenD2,thenD3

• Foreachquerypattern,generatethequery thatwillprocessthemostdata– Goaltofocusonmostexpensivequeries

PerformanceModel


QueryFeatureVector

{q1, q2 … qk}

Est.Rows

Est.IO Avg.Row

QueryFeatureVector

Est.Rows

Est.IO Avg.Row

CloudConfiguration

CloudConfiguration

runtime

runtime

BasedonPredictingMultipleMetricsforQueries:BetterDecisionsenabledbyMachineLearning[Ganapathi et.al.2009]

Trainmodelofflineonotherdataandqueries

Predict runtime from query features

TierSelection


WorkloadCompressionintoaPSLA


Configuration1$0.10/hour

Time(s)



…

Servicetier

Predictedtimesforgeneratedqueries

Step1:Cluster

Step2:Settimethresholds




Time(s)



…

Servicetier


Step1:Cluster


Step3:Identifyrepresentatives

Step4:Translateinto

querytemplates

TwoApproachestoClustering

MagdalenaBalazinska- UniversityofWashington 21Configuration

Time(s)

a)Threshold-basedclusteringb)Density-basedclustering


Complexity-ErrorTrade-off

Data: 10GBTPC-H/SSBBenchmark




Time(s)



…

Servicetier


Step1:Cluster


Step3:Identifyrepresentatives

Step4:Translateinto

querytemplates

TranslatingQueriesintoTemplatesConcretequeryQSELECT …FROM F JOIN D1 ON …WHERE …

FollowstemplateSELECT < N attributes >FROM F JOINS < K Dimension >WHERE < p % of ROWS >


CapabilityDominance

TemplateformatSELECT < N attributes >FROM F JOINS < K Dimensions >WHERE < p % of ROWS >

TemplateT1 dominatesT2 iffK1 >= K2 and p1 >= p2 and N1 >= N2

Retainonlyroottemplatesineachcluster– Enoughtocaptureallquerycapabilities


CompressingAcrossTiers

• Toreducecomplexity,PSLAonlyshowswhatimprovesfromonetiertothenext


Time(s)

(Fact+1D,9,100%)

(Fact+1D,9,100%)

(Fact+1D,8,10%)(Fact,10,100%)

(Fact+1D,9,10%)

(Fact,10,100%)

SummaryPSLAGeneration

27MagdalenaBalazinska- UniversityofWashington

Two-tierPSLAfora10GBinstanceoftheStarSchemaBenchmarkandtheMyria DBMSservice.

WorkloadCompressionintoPSLA

WorkloadGeneration

QueryClustering

TemplateGeneration

Cross-TierPruning PSLASchema

RuntimePrediction

Performance-CentricSLAs

• PSLAManager System– Takesasinputauser’sdatabase(schema&stats)– GeneratesaPersonalizedSLA(PSLA)– PSLAssellperformancelevelsratherthanresources

• PerfEnforce System– TakesasinputPSLAandstreamofqueries– ElasticallyscalesclustertomeetPSLAatlowcost


Challenges

• WhatmakesagoodPSLA?

• HowtogenerateagoodPSLA?

• HowtoguaranteeruntimesinPSLA?


FromPSLAtoPerformanceGuarantees

Onceuserpurchasestierofservice:


Challenges

Querytimeestimatesareinaccurate

• Reason1:Cardinalityestimationishard– Example:Howmanytuplesafterjoining3tables?

• Reason2:Querytimeestimationishard– Modelpredictsruntimefromqueryplanfeatures– Testingdatacanbeverydifferentfromtraining


5060#50#7588888

SLA1Generator DBMS

PerfEnforce

grow

shrink

{"#$%&, "#$%(, … }

,-

(,-, 0123(,-))

("#$%., 4.)

1

2

4

3 #9#0#6:#;$

Solution

Problem:HowtoguaranteePSLAruntimes?

Solution:Scaleclusterelastically• How?• When?


PerformanceforNetworkedStorage


12workers- 100randomqueries- 10GBdata

ShorterQueries

LongerQueries

Networkedstoragecanaddlatency(AmazonS3orEBS-LowIOPS)WarmcacheavoidsproblembutwilladdvarianceBestsolution:Localstorage(ephemeral)orEBS-HighIOPS• Costofaddinginstance:Timetore-attachEBSvolumeorre-ingestintolocalstorage• Thiscostisontopofcostofaddinganewvirtualmachine(VM)• EBSsolutionaddsextracostofpayingforEBSvolumes

ClusterScalingMethod1• Shuffletore-scale:

– StartasmanyVMsasuserpurchases– IngestdataintolocalstorageontheseVMs– Whenneedtoresize:addVM&reshuffledata


Slowtoreconfigure

• Incontrast:5sectoattachand10sectodetachEBSvolume

ClusterScalingMethod2

• Separatedataandcompute:– Separatedataandcomputenodes– Scalecomputenodesonly


Expensive

ClusterScalingMethod3

• ConsistentwithEBSapproachMagdalenaBalazinska- UniversityofWashington 36

N1N2

N3N4

N5a

N6a

N5b

N6b

Fastandinexpensive• Replicatedata:– SpinupmaximumnumberofVMs– Ingestdatawithcarefulreplication– SchedulequeryonasfewVMsaspossible

QueryruntimesInitialdatapreparation

ClusterScaling– BottomLine

• Step1:IngestdatafromAmazonS3(orother)intofastnetworkedstoragesuchasAmazonEBS– Replicate suchthatsubsetsofvolumeshavealldata

• Step2:AddandremoveVMsasneeded– WhenaddingaVM,attachanEBSvolume– WhenremovingaVM,detachtheEBSvolume

• Cost:CostofVMs+EBSvolumes• Scalingoverhead:TimetoaddVM+attachEBSvolume


Solution

Problem:HowtoguaranteePSLAruntimes?

Solution:Scaleclusterelastically• How?• When?

– Scheduling:Howmanyworkersforaquery?– Provisioning:Whentoadd/removeVMs?

• Per-tenantcluster• Sharedcluster


VirtualMachines

QuerySchedulingGoal


PerfEnforce QueryScheduling• Goal:UsejustenoughmachinestomeetSLAtime• ReactiveApproaches:

– ProportionalIntegralController– ReinforcementLearning:MultiArmedBandit– Donotworkwellbecausebestactiondependsonincoming

queryratherthanhistoricalerrors• ProactiveApproaches:

– ContextualMultiArmedBandit:Betterbecausetakesqueryfeaturesintoaccounttodecidehowtorunquery

– OnlineLearning:Bestbecause• Capturescorrelationsbetweenclustersizes(=fasterlearning)• Offlinemodelprotectsfrommajorworkloadchanges


QuerySchedulingResults


AmazonEC2with4,8,12,16,20,24,28,or32machines– 100GB– StartSchemaBenchmarkEachpoint:Onesetofconfigurationparametersandaveragefrom10workloadsEachworkload:100randomqueriesofagiventype(largejoins,smalljoins,short,long,etc.)

PerfEnforce ResourceProvisioning

• AddingandremovingVMstakestime• Twodeploymentmodes

– IndependentTenantsMode• Eachtenanthasowncluster

– MultitenantMode• Setoftenantssharepoolofwarminstances

• Twoalgorithms– ResourceUtilization– Simulation:Learnpasttenantbehaviorandresizeclusterassumingsamebehaviorinnextwindow


ResourceProvisioning- Utilization


SLAOver-estimatedquerytimes– Needtoscaledown

SLAUnder-estimatedquerytimes– Needtoscaleup

ResourceProvisioning- Simulation


Conclusion

• Canwemakecloudserviceseasiertouse?• Canwesellperformanceratherthanresources?• YeswithPSLAManager &PerfEnforce

– PSLAManager generatesPSLAs– PerfEnforce enforcesruntimesthroughscaling

SourcecodeavailableonMyria websitehttp://myria.cs.washington.edu


Documents

Cloud Data Analytics With Performance SLAshomes.cs.washington.edu/...teradata-april2017.pdf · Workload Generation Magdalena Balazinska -University of Washington 16 •Start with