Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
MagdalenaBalazinskaSCHOOL OF COMPUTER SCIENCE &ENGINEERING
UNIVERSITY OF WASHINGTONhttp://www.cs.washington.edu/people/faculty/magda
CloudDataAnalyticsWithPerformanceSLAs
1
2
Node 1 Node 2 Node N…
DataCleaning
Query-drivenDeduplication
MachineLearning
ComplexAnalyticsImagesandVideos
FederatedAnalytics
Buildbigdatamanagementandanalyticssystems
OpenSource RealUsers
CloudSLAs
CloudOperation
ElasticScaling:CPU&memory
PerformanceExplanations
ParallelQueryEvaluation
IterativeQueries
EfficientQueryEvaluation
DataSummaries
ArrayProcessing
Intra- &inter-group
collaborations
AcknowledgmentsWorkpresenteddonetogetherwith:• JenniferOrtiz(PhDstudent– leadstudent)• VictorAlmeida(nowatPetrobras)• BrendanLee(nowatTableau)• JosephL.Hellerstein (eScience InstituteatUW)• JohannesGehrke (Microsoft)
Oursponsorsfortheproject• NSF,ISTCBigData,Petrobras,Amazon,EMC,andFacebook
MagdalenaBalazinska- UniversityofWashington 3
MagdalenaBalazinska - UniversityofWashington 4
TheSetting
Data
Node 1 Node 2 Node N…
Datascientistneedstoanalyzedata
Shewantstouseacloudservice
Myria isabigdatasystem&servicefromourgroup
TheUser’sQuestions
• Price– Howmuchwillitcostme?– WillIaccidentallyspendtoomuchmoney?
• Capabilities– WillIbeabletoexpressallmyqueries?
• Performance– Willallmyqueriesrunfast?– Whichoneswillbefastandwhichonesslow?
MagdalenaBalazinska- UniversityofWashington 5
MagdalenaBalazinska- UniversityofWashington
CloudDataServicePricingToday
6
ExampleAmazon
MagdalenaBalazinska- UniversityofWashington
CloudDataServicePricingToday
7
ExampleAzure
Mismatchbetweenuser’sandcloud’sperspectives
MagdalenaBalazinska- UniversityofWashington 8
OursolutionPersonalizedService-LevelAgreements
(PSLAs)
UserBuysaPerformanceLevel
MagdalenaBalazinska- UniversityofWashington 9
PersonalizedSLAs
MagdalenaBalazinska- UniversityofWashington 10
HiCloud,soIhavethiscooldata Letmesee…hereare
someoptionsforyou
PSLAManager
PerfEnforce
Option2:$0.50/hourSelectandaggregate<10sec
Joins<1min
Option1:$0.10/hourSelectandaggregate<30sec
Joins<5 min
Performance-CentricSLAs
• PSLAManager System[CIDR’15]– Takesasinputauser’sdatabase(schema&stats)– GeneratesaPersonalizedSLA(PSLA)
• PerfEnforce System[SIGMOD’16Demo+Submission]– TakesasinputPSLAandstreamofqueries– ElasticallyscalesclustertomeetPSLAatlowcost
MagdalenaBalazinska- UniversityofWashington 11
PerfEnforce!!PSLAManager/
Myria/Master Node
ExampleMyria’s PersonalizedServiceLevelAgreement
12MagdalenaBalazinska- UniversityofWashington
Fixed,hourlyprice
Servicetiers
Expectedperformance
Templatescapture
capabilities
Challenges
• WhatmakesagoodPSLA?
• HowtogenerateagoodPSLA?
• HowtoguaranteeruntimesinPSLA?
MagdalenaBalazinska- UniversityofWashington 13
PSLAQualityMetrics• Complexity:Numberofquerytemplates
• Error:Errorbetweenadvertisedtimethresholdandexpectedqueryruntimes
• CapabilityCoverage:Relationalalgebraoperations
• Optimizationgoal:Givenadatabase, asetofquerycapabilities andasetofcloudserviceconfigurations,generateaPSLAthatminimizes acombinationofcomplexity anderror whilepreservingcapabilities
MagdalenaBalazinska- UniversityofWashington 14
PSLAGenerationOverview
MagdalenaBalazinska- UniversityofWashington 15
WorkloadGeneration
PerformancePredictionModel
Data(Schema)
CloudService
WorkloadClustering&Compression
Userdoesnothaveaconcretesetof
queries
Accuratequerytimeestimationishard
Tradeoffbetweencomplexityand
accuracy
PSLA
WorkloadGeneration
MagdalenaBalazinska- UniversityofWashington 16
A B C D E• Startwithsimplequeries– TableF,D1,D2,andD3– Selectsomefractionoftherows– Lookatsomefractionofthecolumns
• Buildtowardmorecomplexones– Joinincreasinglymanytablestogether– FjoinswithD1,thenD2,thenD3
• Foreachquerypattern,generatethequery thatwillprocessthemostdata– Goaltofocusonmostexpensivequeries
PerformanceModel
MagdalenaBalazinska- UniversityofWashington 17
QueryFeatureVector
{q1, q2 … qk}
Est.Rows
Est.IO Avg.Row
QueryFeatureVector
Est.Rows
Est.IO Avg.Row
CloudConfiguration
CloudConfiguration
runtime
runtime
BasedonPredictingMultipleMetricsforQueries:BetterDecisionsenabledbyMachineLearning[Ganapathi et.al.2009]
Trainmodelofflineonotherdataandqueries
Predict runtime from query features
TierSelection
MagdalenaBalazinska- UniversityofWashington 18
WorkloadCompressionintoaPSLA
MagdalenaBalazinska- UniversityofWashington 19
Configuration1$0.10/hour
Time(s)
Configuration2$0.20/hour
Configuration3$0.50/hour
…
Servicetier
Predictedtimesforgeneratedqueries
Step1:Cluster
Step2:Settimethresholds
WorkloadCompressionintoaPSLA
MagdalenaBalazinska- UniversityofWashington 20
Configuration1$0.10/hour
Time(s)
Configuration2$0.20/hour
Configuration3$0.50/hour
…
Servicetier
Predictedtimesforgeneratedqueries
Step1:Cluster
Step2:Settimethresholds
Step3:Identifyrepresentatives
Step4:Translateinto
querytemplates
TwoApproachestoClustering
MagdalenaBalazinska- UniversityofWashington 21Configuration
Time(s)
a)Threshold-basedclusteringb)Density-basedclustering
MagdalenaBalazinska- UniversityofWashington 22
Complexity-ErrorTrade-off
Data: 10GBTPC-H/SSBBenchmark
WorkloadCompressionintoaPSLA
MagdalenaBalazinska- UniversityofWashington 23
Configuration1$0.10/hour
Time(s)
Configuration2$0.20/hour
Configuration3$0.50/hour
…
Servicetier
Predictedtimesforgeneratedqueries
Step1:Cluster
Step2:Settimethresholds
Step3:Identifyrepresentatives
Step4:Translateinto
querytemplates
TranslatingQueriesintoTemplatesConcretequeryQSELECT …FROM F JOIN D1 ON …WHERE …
FollowstemplateSELECT < N attributes >FROM F JOINS < K Dimension >WHERE < p % of ROWS >
MagdalenaBalazinska- UniversityofWashington 24
CapabilityDominance
TemplateformatSELECT < N attributes >FROM F JOINS < K Dimensions >WHERE < p % of ROWS >
TemplateT1 dominatesT2 iffK1 >= K2 and p1 >= p2 and N1 >= N2
Retainonlyroottemplatesineachcluster– Enoughtocaptureallquerycapabilities
MagdalenaBalazinska- UniversityofWashington 25
CompressingAcrossTiers
• Toreducecomplexity,PSLAonlyshowswhatimprovesfromonetiertothenext
MagdalenaBalazinska- UniversityofWashington 26
Time(s)
(Fact+1D,9,100%)
(Fact+1D,9,100%)
(Fact+1D,8,10%)(Fact,10,100%)
(Fact+1D,9,10%)
(Fact,10,100%)
SummaryPSLAGeneration
27MagdalenaBalazinska- UniversityofWashington
Two-tierPSLAfora10GBinstanceoftheStarSchemaBenchmarkandtheMyria DBMSservice.
WorkloadCompressionintoPSLA
WorkloadGeneration
QueryClustering
TemplateGeneration
Cross-TierPruning PSLASchema
RuntimePrediction
Performance-CentricSLAs
• PSLAManager System– Takesasinputauser’sdatabase(schema&stats)– GeneratesaPersonalizedSLA(PSLA)– PSLAssellperformancelevelsratherthanresources
• PerfEnforce System– TakesasinputPSLAandstreamofqueries– ElasticallyscalesclustertomeetPSLAatlowcost
MagdalenaBalazinska- UniversityofWashington 28
Challenges
• WhatmakesagoodPSLA?
• HowtogenerateagoodPSLA?
• HowtoguaranteeruntimesinPSLA?
MagdalenaBalazinska- UniversityofWashington 29
FromPSLAtoPerformanceGuarantees
Onceuserpurchasestierofservice:
MagdalenaBalazinska- UniversityofWashington 30
Challenges
Querytimeestimatesareinaccurate
• Reason1:Cardinalityestimationishard– Example:Howmanytuplesafterjoining3tables?
• Reason2:Querytimeestimationishard– Modelpredictsruntimefromqueryplanfeatures– Testingdatacanbeverydifferentfromtraining
MagdalenaBalazinska- UniversityofWashington 31
5060#50#7588888
SLA1Generator DBMS
PerfEnforce
grow
shrink
{"#$%&, "#$%(, … }
,-
(,-, 0123(,-))
("#$%., 4.)
1
2
4
3 #9#0#6:#;$
Solution
Problem:HowtoguaranteePSLAruntimes?
Solution:Scaleclusterelastically• How?• When?
MagdalenaBalazinska- UniversityofWashington 32
PerformanceforNetworkedStorage
MagdalenaBalazinska- UniversityofWashington 33
12workers- 100randomqueries- 10GBdata
ShorterQueries
LongerQueries
Networkedstoragecanaddlatency(AmazonS3orEBS-LowIOPS)WarmcacheavoidsproblembutwilladdvarianceBestsolution:Localstorage(ephemeral)orEBS-HighIOPS• Costofaddinginstance:Timetore-attachEBSvolumeorre-ingestintolocalstorage• Thiscostisontopofcostofaddinganewvirtualmachine(VM)• EBSsolutionaddsextracostofpayingforEBSvolumes
ClusterScalingMethod1• Shuffletore-scale:
– StartasmanyVMsasuserpurchases– IngestdataintolocalstorageontheseVMs– Whenneedtoresize:addVM&reshuffledata
MagdalenaBalazinska- UniversityofWashington 34
Slowtoreconfigure
• Incontrast:5sectoattachand10sectodetachEBSvolume
ClusterScalingMethod2
• Separatedataandcompute:– Separatedataandcomputenodes– Scalecomputenodesonly
MagdalenaBalazinska- UniversityofWashington 35
Expensive
ClusterScalingMethod3
• ConsistentwithEBSapproachMagdalenaBalazinska- UniversityofWashington 36
N1N2
N3N4
N5a
N6a
N5b
N6b
Fastandinexpensive• Replicatedata:– SpinupmaximumnumberofVMs– Ingestdatawithcarefulreplication– SchedulequeryonasfewVMsaspossible
QueryruntimesInitialdatapreparation
ClusterScaling– BottomLine
• Step1:IngestdatafromAmazonS3(orother)intofastnetworkedstoragesuchasAmazonEBS– Replicate suchthatsubsetsofvolumeshavealldata
• Step2:AddandremoveVMsasneeded– WhenaddingaVM,attachanEBSvolume– WhenremovingaVM,detachtheEBSvolume
• Cost:CostofVMs+EBSvolumes• Scalingoverhead:TimetoaddVM+attachEBSvolume
MagdalenaBalazinska- UniversityofWashington 37
Solution
Problem:HowtoguaranteePSLAruntimes?
Solution:Scaleclusterelastically• How?• When?
– Scheduling:Howmanyworkersforaquery?– Provisioning:Whentoadd/removeVMs?
• Per-tenantcluster• Sharedcluster
MagdalenaBalazinska- UniversityofWashington 38
VirtualMachines
QuerySchedulingGoal
MagdalenaBalazinska- UniversityofWashington 39
PerfEnforce QueryScheduling• Goal:UsejustenoughmachinestomeetSLAtime• ReactiveApproaches:
– ProportionalIntegralController– ReinforcementLearning:MultiArmedBandit– Donotworkwellbecausebestactiondependsonincoming
queryratherthanhistoricalerrors• ProactiveApproaches:
– ContextualMultiArmedBandit:Betterbecausetakesqueryfeaturesintoaccounttodecidehowtorunquery
– OnlineLearning:Bestbecause• Capturescorrelationsbetweenclustersizes(=fasterlearning)• Offlinemodelprotectsfrommajorworkloadchanges
MagdalenaBalazinska- UniversityofWashington 40
QuerySchedulingResults
MagdalenaBalazinska- UniversityofWashington 41
AmazonEC2with4,8,12,16,20,24,28,or32machines– 100GB– StartSchemaBenchmarkEachpoint:Onesetofconfigurationparametersandaveragefrom10workloadsEachworkload:100randomqueriesofagiventype(largejoins,smalljoins,short,long,etc.)
PerfEnforce ResourceProvisioning
• AddingandremovingVMstakestime• Twodeploymentmodes
– IndependentTenantsMode• Eachtenanthasowncluster
– MultitenantMode• Setoftenantssharepoolofwarminstances
• Twoalgorithms– ResourceUtilization– Simulation:Learnpasttenantbehaviorandresizeclusterassumingsamebehaviorinnextwindow
MagdalenaBalazinska- UniversityofWashington 42
ResourceProvisioning- Utilization
MagdalenaBalazinska- UniversityofWashington 43
SLAOver-estimatedquerytimes– Needtoscaledown
SLAUnder-estimatedquerytimes– Needtoscaleup
ResourceProvisioning- Simulation
MagdalenaBalazinska- UniversityofWashington 44
Conclusion
• Canwemakecloudserviceseasiertouse?• Canwesellperformanceratherthanresources?• YeswithPSLAManager &PerfEnforce
– PSLAManager generatesPSLAs– PerfEnforce enforcesruntimesthroughscaling
SourcecodeavailableonMyria websitehttp://myria.cs.washington.edu
MagdalenaBalazinska- UniversityofWashington 45