Discontinuities Large Scale

PowerPoint Presentation

Detecting Discontinuities in Large-Scale Systems

Haroon MalikPostdoctoral fellowIan John DavisResearch AssociateMichael GodfreyAssociate ProfessorDouglas NeuseInfrastructure Management

Serge MankovskiiResearch Staff

Software Architecture Group (SWAG) University of Waterloo, Waterloo, CanadaCA TechnologiesUSA

1

2

Datacenters Require Good Forecasts

To ensure SLAs are met, while minimizing infrastructure costs, data center operators need to know ahead of time, (i.e., short and long-term forecasts) the expected workload.Operators use short-term forecast (based on a week to a month of data centers recent performance history) for dynamic provisioning and placement of tasks in a data center, especially for load balancing to avoid performance bottlenecks.Where as long-term forecasting of the workload is necessary for capacity planning to ensure that the cloud infrastructure supports growth and evolution of client requirements.

The accuracy of forecasting results depends on the quality of the performance data (i.e., performance counters; such as CPU utilization, bandwidth consumption, network traffic and Disk IOPS) fed to the forecasting algorithms.

In next 20 minutes, I will walk you through forecasting steps for typical data center, describe the challenge face by the data center to derive quality data for forcast distributed across thousands of machines, expalin our proposed methodology to over coem the challenge and share some obtaiend resutls.

2

Forecasting Steps

12345Determine purposeSelect techniquePrepare dataPrepare forecastMonitor forecast

3

Initially a department, team or a stockholder requests a forecast. Usually, a dedicated group or team of analysts is responsible for handling the forecast requisition. The analysts gather preliminary information from the requestor, i.e., a) forecast purpose (e.g., operations are interested to know expected workload volume on a daily to weekly basis for load balancing and dynamic placement of machines, whereas, marketing and sales are more concerned about growth in customers and for scheduling and purchases) and b) a time horizon for a forecast (seconds, hours, days, months, quarters or years). 3

Forecasting Steps


4

Based on determining the forecast horizon and purpose of requestor, the analyst select an appropriate technique (e.g., moving averages with exponential smoothing for short-term forecasts and trend equations for long-term forecasts). Often, the analyst uses more than one forecasting technique to obtain independent forecasts. If selected techniques produce approximately the same precision, this would give increased confidence in the results; disagreement among forecast indicates that analysts need to revisit the technique.4

Forecasting Steps


5

This is the most important and expensive forecasting step for analysts. Poor forecasts can result from inadequate data preparation. In this step, analysts sanitize and preprocess the data to make it suitable for the forecasting techniques selected in the previous step. During sanitation missing, ignorable, erroneous and empty performance counter variables are treated [2-4]. Counter data is missing when a performance monitor fails to record an instance of a performance counter. A counter is empty when a resource cannot start the require service. Analysts then preprocess the data using their custom written scripts to aggregate performance counters across several subsystems of a data center to derive customer-perceived counters [5] such as transaction response time, latency, user wait time, and perceived throughput. These values capture the user interaction with their system as their transaction/request/job flows through the various subsystems in a data center. Preprocessing also involves preparing the data in the format that is required by the selected forecast techniques. Therefore, analyst preprocess (i.e., extrapolate, scale and standardize) the data accordingly.5

Forecasting Steps


6

In this step, the analyst uses prepared time series training data and the selected forecast technique to create a forecast model that has minimum error rate, i.e., its predicted values are close to the actual time series value, without either underfitting or overfitting. Analyst tune the parameters of the forecast techniques several times to find the best form of the model that satisfies the requestors forecast objective.

6

Forecasting Steps


7

This step is composed of two substeps: active and passive monitoring of the forecasts. In active monitoring, an analyst validates a forecast for a predetermined period of time before it is deployed in production or the model is handed over to the requester. The analyst verifies assumptions, compares the forecasted values (transaction volume, workload, or resource utilization of machines) to the actual observed values as they occur in the data center, and identify any external or internal event that affects the results of the forecast. Once the forecasting model is communicated to the requestor, a recurring monitoring checkpoint for the forecast is established (i.e., monthly, quarterly or every six months) to look for any evidence of significant variance between the actual and predicted results; identify deviation factors such as discontinuities. Any variance greater than the maximum is investigated and forecast model is either adjusted to accommodate the variance or retrained for the discontinuity. 7

Forecasting Steps


Challenges (a) Large volumes of performance data, (b) Limited time, (c) Domain knowledge8

8

DiscontinuitiesAnomalies9

DiscontinuitiesReasons:Company mergeHardware upgradeSoftware change (new release)Workload changePromotional customers

10

(a)

T1 T2 T3(b)

Transition Period(c)

(d)Symptoms:

Why Care About Discontinuities?Measurements taken before the discontinuity can skew the forecast.Detecting a discontinuity provide analysts with a reference point to retrain their forecasting models and make necessary adjustments. We propose an automated approach to help analyst identify discontinuities in performance data11

Steps Involved in The Proposed Approach12Performance logsReport(discontinuities)

Data preparation

Metric selection

Anomaly detection

Discontinuity identification1234

InputApproachOutput

1. Data PreparationThe performance logs from the production have noise:Missing counters Empty countersDifferent numerical ranges

13

We used statistical techniques to filter noise in the dataData preparationMetric selectionAnomalydetection

Discontinuity identification

2.Metric Selection Production logs contain thousands of counters that are:Highly correlatedInvariantsConfiguration constants14

We used Principal-Component-Analysis (PCA) to select important metricsData preparationMetric selectionAnomalydetection


3. Anomaly DetectionQuadratic ModellingQuadratic Function that minimize LSEA greedy algorithm to replace performance counter time series dataCost metric to reflect data fit

15

Largest costs suggest positions in time series value where the most egregious anomalies and discontinuities occurData preparationMetric selectionAnomalydetection


3. Anomaly Detection (Quadratic Model)

16

3. Anomaly Detection (Quadratic Model)

17Cost

4. Discontinuity IdentificationDistribution comparisonDifference of mean between two populationQuantify the difference of mean between two population

18

Data preparationMetric selectionAnomalydetection


19

Transition PeriodTransition PeriodAnomalyAnomalyDiscontinuity% CPU UtilizationDifference of Mean Between Two Populations

Difference of Mean Between Two Populations

20

Transition PeriodTransition PeriodAnomalyAnomalyDiscontinuity% CPU Utilization Cost


21

Transition PeriodTransition PeriodAnomalyAnomalyDiscontinuity% CPU Utilization21

Transition PeriodTransition Period% CPU Utilization

Wilcoxon Rank-Sum TestH0 = The two distributions are same


22


Transition PeriodTransition Period% CPU UtilizationWilcoxon Rank-Sum TestH0 = The two distributions are same


23


Transition PeriodTransition Period% CPU UtilizationWilcoxon Rank-Sum TestH0 = The two distributions are same

Quantify the Difference of Mean Between Two Populations

24Analysts based on their domain trends and required granularity set the effect size

Acts as a tunable threshold to reduce false positive identification of discontinuity by our approach

Cohens dLargeMediumSmallTrivial

Subjects of Study

DVD StoreSystem: Open SourceDomain: EcommerceType of Data: Performance Tests

System: SimulationDomain: Cloud ComputingType of Data: Synthetic Data25System: Industrial SystemDomain: Cloud ComputingType of Data: Production Data

Fault Injection

CategoryTypes of FaultsAnomaliesCPU StressMemory StressInterfering WorkloadDiscontinuitiesWorkload as Multiplicative FactorChange in Transaction PatternHardware & Software Upgrade

26We had NO prior knowledge of the underlying fault in the data obtained from the industrial system

Results0.920.72Proposed technique has high accuracy in detecting discontinuities270.83



Limitations of Our Approach We can tune the sensitivity of our approach by adjusting effect size.Using large effect size reduces false alarms, this may result in an analyst overlooking significant discontinuities.Analysts have to conduct multiple experiments30

Sensitivity

Determining a threshold value is a problem An automated techniques, generally can not decide whether identified discontinuity is important or is noise.

Limitations of Our ApproachThe approach can not distinguish between Overlapping discontinuities andDifferent type of discontinuities.31DistinguisibilityAnalysts have to manually inspect the identified discontinuity and take actions

Distinguishability

31

32

33

QUESTIONSHaroon [email protected]

33

Documents

Discontinuities Large Scale