45
Magdalena Balazinska SCHOOL OF COMPUTER SCIENCE &ENGINEERING UNIVERSITY OF W ASHINGTON http://www.cs.washington.edu/people/faculty/magda Cloud Data Analytics With Performance SLAs 1

Cloud Data Analytics With Performance SLAshomes.cs.washington.edu/...teradata-april2017.pdf · Workload Generation Magdalena Balazinska -University of Washington 16 •Start with

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

  • MagdalenaBalazinskaSCHOOL OF COMPUTER SCIENCE &ENGINEERING

    UNIVERSITY OF WASHINGTONhttp://www.cs.washington.edu/people/faculty/magda

    CloudDataAnalyticsWithPerformanceSLAs

    1

  • 2

    Node 1 Node 2 Node N…

    DataCleaning

    Query-drivenDeduplication

    MachineLearning

    ComplexAnalyticsImagesandVideos

    FederatedAnalytics

    Buildbigdatamanagementandanalyticssystems

    OpenSource RealUsers

    CloudSLAs

    CloudOperation

    ElasticScaling:CPU&memory

    PerformanceExplanations

    ParallelQueryEvaluation

    IterativeQueries

    EfficientQueryEvaluation

    DataSummaries

    ArrayProcessing

    Intra- &inter-group

    collaborations

  • AcknowledgmentsWorkpresenteddonetogetherwith:• JenniferOrtiz(PhDstudent– leadstudent)• VictorAlmeida(nowatPetrobras)• BrendanLee(nowatTableau)• JosephL.Hellerstein (eScience InstituteatUW)• JohannesGehrke (Microsoft)

    Oursponsorsfortheproject• NSF,ISTCBigData,Petrobras,Amazon,EMC,andFacebook

    MagdalenaBalazinska- UniversityofWashington 3

  • MagdalenaBalazinska - UniversityofWashington 4

    TheSetting

    Data

    Node 1 Node 2 Node N…

    Datascientistneedstoanalyzedata

    Shewantstouseacloudservice

    Myria isabigdatasystem&servicefromourgroup

  • TheUser’sQuestions

    • Price– Howmuchwillitcostme?– WillIaccidentallyspendtoomuchmoney?

    • Capabilities– WillIbeabletoexpressallmyqueries?

    • Performance– Willallmyqueriesrunfast?– Whichoneswillbefastandwhichonesslow?

    MagdalenaBalazinska- UniversityofWashington 5

  • MagdalenaBalazinska- UniversityofWashington

    CloudDataServicePricingToday

    6

    ExampleAmazon

  • MagdalenaBalazinska- UniversityofWashington

    CloudDataServicePricingToday

    7

    ExampleAzure

  • Mismatchbetweenuser’sandcloud’sperspectives

    MagdalenaBalazinska- UniversityofWashington 8

  • OursolutionPersonalizedService-LevelAgreements

    (PSLAs)

    UserBuysaPerformanceLevel

    MagdalenaBalazinska- UniversityofWashington 9

  • PersonalizedSLAs

    MagdalenaBalazinska- UniversityofWashington 10

    HiCloud,soIhavethiscooldata Letmesee…hereare

    someoptionsforyou

    PSLAManager

    PerfEnforce

    Option2:$0.50/hourSelectandaggregate<10sec

    Joins<1min

    Option1:$0.10/hourSelectandaggregate<30sec

    Joins<5 min

  • Performance-CentricSLAs

    • PSLAManager System[CIDR’15]– Takesasinputauser’sdatabase(schema&stats)– GeneratesaPersonalizedSLA(PSLA)

    • PerfEnforce System[SIGMOD’16Demo+Submission]– TakesasinputPSLAandstreamofqueries– ElasticallyscalesclustertomeetPSLAatlowcost

    MagdalenaBalazinska- UniversityofWashington 11

    PerfEnforce!!PSLAManager/

    Myria/Master Node

  • ExampleMyria’s PersonalizedServiceLevelAgreement

    12MagdalenaBalazinska- UniversityofWashington

    Fixed,hourlyprice

    Servicetiers

    Expectedperformance

    Templatescapture

    capabilities

  • Challenges

    • WhatmakesagoodPSLA?

    • HowtogenerateagoodPSLA?

    • HowtoguaranteeruntimesinPSLA?

    MagdalenaBalazinska- UniversityofWashington 13

  • PSLAQualityMetrics• Complexity:Numberofquerytemplates

    • Error:Errorbetweenadvertisedtimethresholdandexpectedqueryruntimes

    • CapabilityCoverage:Relationalalgebraoperations

    • Optimizationgoal:Givenadatabase, asetofquerycapabilities andasetofcloudserviceconfigurations,generateaPSLAthatminimizes acombinationofcomplexity anderror whilepreservingcapabilities

    MagdalenaBalazinska- UniversityofWashington 14

  • PSLAGenerationOverview

    MagdalenaBalazinska- UniversityofWashington 15

    WorkloadGeneration

    PerformancePredictionModel

    Data(Schema)

    CloudService

    WorkloadClustering&Compression

    Userdoesnothaveaconcretesetof

    queries

    Accuratequerytimeestimationishard

    Tradeoffbetweencomplexityand

    accuracy

    PSLA

  • WorkloadGeneration

    MagdalenaBalazinska- UniversityofWashington 16

    A B C D E• Startwithsimplequeries– TableF,D1,D2,andD3– Selectsomefractionoftherows– Lookatsomefractionofthecolumns

    • Buildtowardmorecomplexones– Joinincreasinglymanytablestogether– FjoinswithD1,thenD2,thenD3

    • Foreachquerypattern,generatethequery thatwillprocessthemostdata– Goaltofocusonmostexpensivequeries

  • PerformanceModel

    MagdalenaBalazinska- UniversityofWashington 17

    QueryFeatureVector

    {q1, q2 … qk}

    Est.Rows

    Est.IO Avg.Row

    QueryFeatureVector

    Est.Rows

    Est.IO Avg.Row

    CloudConfiguration

    CloudConfiguration

    runtime

    runtime

    BasedonPredictingMultipleMetricsforQueries:BetterDecisionsenabledbyMachineLearning[Ganapathi et.al.2009]

    Trainmodelofflineonotherdataandqueries

    Predict runtime from query features

  • TierSelection

    MagdalenaBalazinska- UniversityofWashington 18

  • WorkloadCompressionintoaPSLA

    MagdalenaBalazinska- UniversityofWashington 19

    Configuration1$0.10/hour

    Time(s)

    Configuration2$0.20/hour

    Configuration3$0.50/hour

    Servicetier

    Predictedtimesforgeneratedqueries

    Step1:Cluster

    Step2:Settimethresholds

  • WorkloadCompressionintoaPSLA

    MagdalenaBalazinska- UniversityofWashington 20

    Configuration1$0.10/hour

    Time(s)

    Configuration2$0.20/hour

    Configuration3$0.50/hour

    Servicetier

    Predictedtimesforgeneratedqueries

    Step1:Cluster

    Step2:Settimethresholds

    Step3:Identifyrepresentatives

    Step4:Translateinto

    querytemplates

  • TwoApproachestoClustering

    MagdalenaBalazinska- UniversityofWashington 21Configuration

    Time(s)

    a)Threshold-basedclusteringb)Density-basedclustering

  • MagdalenaBalazinska- UniversityofWashington 22

    Complexity-ErrorTrade-off

    Data: 10GBTPC-H/SSBBenchmark

  • WorkloadCompressionintoaPSLA

    MagdalenaBalazinska- UniversityofWashington 23

    Configuration1$0.10/hour

    Time(s)

    Configuration2$0.20/hour

    Configuration3$0.50/hour

    Servicetier

    Predictedtimesforgeneratedqueries

    Step1:Cluster

    Step2:Settimethresholds

    Step3:Identifyrepresentatives

    Step4:Translateinto

    querytemplates

  • TranslatingQueriesintoTemplatesConcretequeryQSELECT …FROM F JOIN D1 ON …WHERE …

    FollowstemplateSELECT < N attributes >FROM F JOINS < K Dimension >WHERE < p % of ROWS >

    MagdalenaBalazinska- UniversityofWashington 24

  • CapabilityDominance

    TemplateformatSELECT < N attributes >FROM F JOINS < K Dimensions >WHERE < p % of ROWS >

    TemplateT1 dominatesT2 iffK1 >= K2 and p1 >= p2 and N1 >= N2

    Retainonlyroottemplatesineachcluster– Enoughtocaptureallquerycapabilities

    MagdalenaBalazinska- UniversityofWashington 25

  • CompressingAcrossTiers

    • Toreducecomplexity,PSLAonlyshowswhatimprovesfromonetiertothenext

    MagdalenaBalazinska- UniversityofWashington 26

    Time(s)

    (Fact+1D,9,100%)

    (Fact+1D,9,100%)

    (Fact+1D,8,10%)(Fact,10,100%)

    (Fact+1D,9,10%)

    (Fact,10,100%)

  • SummaryPSLAGeneration

    27MagdalenaBalazinska- UniversityofWashington

    Two-tierPSLAfora10GBinstanceoftheStarSchemaBenchmarkandtheMyria DBMSservice.

    WorkloadCompressionintoPSLA

    WorkloadGeneration

    QueryClustering

    TemplateGeneration

    Cross-TierPruning PSLASchema

    RuntimePrediction

  • Performance-CentricSLAs

    • PSLAManager System– Takesasinputauser’sdatabase(schema&stats)– GeneratesaPersonalizedSLA(PSLA)– PSLAssellperformancelevelsratherthanresources

    • PerfEnforce System– TakesasinputPSLAandstreamofqueries– ElasticallyscalesclustertomeetPSLAatlowcost

    MagdalenaBalazinska- UniversityofWashington 28

  • Challenges

    • WhatmakesagoodPSLA?

    • HowtogenerateagoodPSLA?

    • HowtoguaranteeruntimesinPSLA?

    MagdalenaBalazinska- UniversityofWashington 29

  • FromPSLAtoPerformanceGuarantees

    Onceuserpurchasestierofservice:

    MagdalenaBalazinska- UniversityofWashington 30

  • Challenges

    Querytimeestimatesareinaccurate

    • Reason1:Cardinalityestimationishard– Example:Howmanytuplesafterjoining3tables?

    • Reason2:Querytimeestimationishard– Modelpredictsruntimefromqueryplanfeatures– Testingdatacanbeverydifferentfromtraining

    MagdalenaBalazinska- UniversityofWashington 31

  • 5060#50#7588888

    SLA1Generator DBMS

    PerfEnforce

    grow

    shrink

    {"#$%&, "#$%(, … }

    ,-

    (,-, 0123(,-))

    ("#$%., 4.)

    1

    2

    4

    3 #9#0#6:#;$

    Solution

    Problem:HowtoguaranteePSLAruntimes?

    Solution:Scaleclusterelastically• How?• When?

    MagdalenaBalazinska- UniversityofWashington 32

  • PerformanceforNetworkedStorage

    MagdalenaBalazinska- UniversityofWashington 33

    12workers- 100randomqueries- 10GBdata

    ShorterQueries

    LongerQueries

    Networkedstoragecanaddlatency(AmazonS3orEBS-LowIOPS)WarmcacheavoidsproblembutwilladdvarianceBestsolution:Localstorage(ephemeral)orEBS-HighIOPS• Costofaddinginstance:Timetore-attachEBSvolumeorre-ingestintolocalstorage• Thiscostisontopofcostofaddinganewvirtualmachine(VM)• EBSsolutionaddsextracostofpayingforEBSvolumes

  • ClusterScalingMethod1• Shuffletore-scale:

    – StartasmanyVMsasuserpurchases– IngestdataintolocalstorageontheseVMs– Whenneedtoresize:addVM&reshuffledata

    MagdalenaBalazinska- UniversityofWashington 34

    Slowtoreconfigure

    • Incontrast:5sectoattachand10sectodetachEBSvolume

  • ClusterScalingMethod2

    • Separatedataandcompute:– Separatedataandcomputenodes– Scalecomputenodesonly

    MagdalenaBalazinska- UniversityofWashington 35

    Expensive

  • ClusterScalingMethod3

    • ConsistentwithEBSapproachMagdalenaBalazinska- UniversityofWashington 36

    N1N2

    N3N4

    N5a

    N6a

    N5b

    N6b

    Fastandinexpensive• Replicatedata:– SpinupmaximumnumberofVMs– Ingestdatawithcarefulreplication– SchedulequeryonasfewVMsaspossible

    QueryruntimesInitialdatapreparation

  • ClusterScaling– BottomLine

    • Step1:IngestdatafromAmazonS3(orother)intofastnetworkedstoragesuchasAmazonEBS– Replicate suchthatsubsetsofvolumeshavealldata

    • Step2:AddandremoveVMsasneeded– WhenaddingaVM,attachanEBSvolume– WhenremovingaVM,detachtheEBSvolume

    • Cost:CostofVMs+EBSvolumes• Scalingoverhead:TimetoaddVM+attachEBSvolume

    MagdalenaBalazinska- UniversityofWashington 37

  • Solution

    Problem:HowtoguaranteePSLAruntimes?

    Solution:Scaleclusterelastically• How?• When?

    – Scheduling:Howmanyworkersforaquery?– Provisioning:Whentoadd/removeVMs?

    • Per-tenantcluster• Sharedcluster

    MagdalenaBalazinska- UniversityofWashington 38

    VirtualMachines

  • QuerySchedulingGoal

    MagdalenaBalazinska- UniversityofWashington 39

  • PerfEnforce QueryScheduling• Goal:UsejustenoughmachinestomeetSLAtime• ReactiveApproaches:

    – ProportionalIntegralController– ReinforcementLearning:MultiArmedBandit– Donotworkwellbecausebestactiondependsonincoming

    queryratherthanhistoricalerrors• ProactiveApproaches:

    – ContextualMultiArmedBandit:Betterbecausetakesqueryfeaturesintoaccounttodecidehowtorunquery

    – OnlineLearning:Bestbecause• Capturescorrelationsbetweenclustersizes(=fasterlearning)• Offlinemodelprotectsfrommajorworkloadchanges

    MagdalenaBalazinska- UniversityofWashington 40

  • QuerySchedulingResults

    MagdalenaBalazinska- UniversityofWashington 41

    AmazonEC2with4,8,12,16,20,24,28,or32machines– 100GB– StartSchemaBenchmarkEachpoint:Onesetofconfigurationparametersandaveragefrom10workloadsEachworkload:100randomqueriesofagiventype(largejoins,smalljoins,short,long,etc.)

  • PerfEnforce ResourceProvisioning

    • AddingandremovingVMstakestime• Twodeploymentmodes

    – IndependentTenantsMode• Eachtenanthasowncluster

    – MultitenantMode• Setoftenantssharepoolofwarminstances

    • Twoalgorithms– ResourceUtilization– Simulation:Learnpasttenantbehaviorandresizeclusterassumingsamebehaviorinnextwindow

    MagdalenaBalazinska- UniversityofWashington 42

  • ResourceProvisioning- Utilization

    MagdalenaBalazinska- UniversityofWashington 43

    SLAOver-estimatedquerytimes– Needtoscaledown

    SLAUnder-estimatedquerytimes– Needtoscaleup

  • ResourceProvisioning- Simulation

    MagdalenaBalazinska- UniversityofWashington 44

  • Conclusion

    • Canwemakecloudserviceseasiertouse?• Canwesellperformanceratherthanresources?• YeswithPSLAManager &PerfEnforce

    – PSLAManager generatesPSLAs– PerfEnforce enforcesruntimesthroughscaling

    SourcecodeavailableonMyria websitehttp://myria.cs.washington.edu

    MagdalenaBalazinska- UniversityofWashington 45