50
ruxit theme 2014.05.15 The definition of normal An introduction and guide to anomaly detection Alois Reitbauer, ruxit @aloisreitbauer

The definition of normal - An introduction and guide to anomaly detection

Embed Size (px)

Citation preview

Page 1: The definition of normal - An introduction and guide to anomaly detection

ruxit theme 2014.05.15

The definition of normalAn introduction and guide to anomaly detection

Alois Reitbauer, ruxit@aloisreitbauer

Page 2: The definition of normal - An introduction and guide to anomaly detection

ruxit theme 2014.05.15

Some backgroundWho I am and what I do

Page 3: The definition of normal - An introduction and guide to anomaly detection

ruxit theme 2014.05.15

Page 4: The definition of normal - An introduction and guide to anomaly detection

ruxit theme 2014.05.15

Page 5: The definition of normal - An introduction and guide to anomaly detection

ruxit theme 2014.05.15

Anomaly DetectionWhat is an anomaly anyways?

Page 6: The definition of normal - An introduction and guide to anomaly detection

ruxit theme 2014.05.15What is an anomaly?

In data mining, anomaly detection (or outlier detection) is the identification of items, events or observations which do not conform to an expected pattern or other items in a dataset. Typically the anomalous items will translate to some kind of problem such as ……….

Source: Wikipedia

Page 7: The definition of normal - An introduction and guide to anomaly detection

ruxit theme 2014.05.15How many metrics would we have to look at

3 Metrics per Service5 Metrics per Host5 Metrics per Runtime

40 Services = 120 Metrics

20 Hosts = 100 Metrics

40 Runtimes = 200 Metrics

420 Metrics

Page 8: The definition of normal - An introduction and guide to anomaly detection

ruxit theme 2014.05.15

We cannot watch 400+ metricsSo we need to find ways to automate finding anomalies

Page 9: The definition of normal - An introduction and guide to anomaly detection

ruxit theme 2014.05.15

Historic

Data

“Normal”

Model

New Data

Hypothesis

Likeliness

Judgement

update

calculate derive

testproduces

Anomaly?

defines

Anomaly Detection Workflow

Page 10: The definition of normal - An introduction and guide to anomaly detection

ruxit theme 2014.05.15We will look at three types of data

Response TimesDid our response times increase significantly?

Error RatesDid the error rate of any of our services change?

LoadIs there anything unusual happening to our service load?

Page 11: The definition of normal - An introduction and guide to anomaly detection

ruxit theme 2014.05.15

Finding error rate anomaliesAre we having more errors than usual?

Page 12: The definition of normal - An introduction and guide to anomaly detection

ruxit theme 2014.05.15How can we get our baseline?

Average or MeanEasy to calculate but does not learn over time

MedianNeeds more raw data as average, precise. Does not learn well either

Exponential SmoothingEasy to calculate and learns over time

Page 13: The definition of normal - An introduction and guide to anomaly detection

ruxit theme 2014.05.15Using exponential smoothing for baseline

Source: Wikipedia

Page 14: The definition of normal - An introduction and guide to anomaly detection

ruxit theme 2014.05.15

Example

Page 15: The definition of normal - An introduction and guide to anomaly detection

ruxit theme 2014.05.15Is this an anomaly?

Our Observation:

Typical error of 3 percent at 10,000 transactions/min

Current System Behavior:

During night we see 5 errors in 100 requests

Page 16: The definition of normal - An introduction and guide to anomaly detection

ruxit theme 2014.05.15

Binomial Distr ibut ionTells us how likely it is to see n successes in a certain number of trials

Page 17: The definition of normal - An introduction and guide to anomaly detection

ruxit theme 2014.05.15

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 190.0%

20.0%

40.0%

60.0%

80.0%

100.0%

120.0%

Likeliness of at least n errors

18 % probability to see 5 or more errors

Applying Binomial Distribution to our problem.

Page 18: The definition of normal - An introduction and guide to anomaly detection

ruxit theme 2014.05.15

Derive an anomaly from a forecastWhat is unlikely enough to be interpreted as an anomaly?

Page 19: The definition of normal - An introduction and guide to anomaly detection

ruxit theme 2014.05.15

95 % Probability Window

Borrowing from the Standard Deviation

Page 20: The definition of normal - An introduction and guide to anomaly detection

ruxit theme 2014.05.15

Response Time AnomaliesAre our response times higher than usual?

Page 21: The definition of normal - An introduction and guide to anomaly detection

ruxit theme 2014.05.15Challenges in finding response time anomalies

Page 22: The definition of normal - An introduction and guide to anomaly detection

ruxit theme 2014.05.15Data representation is important

Page 23: The definition of normal - An introduction and guide to anomaly detection

ruxit theme 2014.05.15Proper data representation with Median

Page 24: The definition of normal - An introduction and guide to anomaly detection

ruxit theme 2014.05.15

Mean: 500 msStd. Dev.: 100 ms

68 %400ms – 600 ms

95 %300ms – 700 ms

100 200 300 400 500 600 700 800 900

99 %200ms – 800 ms

If our data would be normally distributed …

Page 25: The definition of normal - An introduction and guide to anomaly detection

ruxit theme 2014.05.15

50 Percent slower than μ

97.6 Percent slower than μ + 2σ

Median97th Percentile

However, we can generalize the model

Page 26: The definition of normal - An introduction and guide to anomaly detection

ruxit theme 2014.05.15Is this an anomaly?

Our Observation:

Usually we see median response time of 300 ms.

Current System Behavior:

During night with low traffic response times goup to 600 ms.

Page 27: The definition of normal - An introduction and guide to anomaly detection

ruxit theme 2014.05.15

Our median response time is 300 ms

and we measure

200 ms 400 ms 350 ms 200 ms 600 ms500 ms 150 ms 350 ms 400 ms 600 ms

Testing against new data

Page 28: The definition of normal - An introduction and guide to anomaly detection

ruxit theme 2014.05.15

Check all values above 300 ms

200 ms 400 ms 350 ms 200 ms 600 ms500 ms 150 ms 350 ms 400 ms 600 ms

7 values are higher than the median. Is this normal?

Using Binomial distribution on median

Page 29: The definition of normal - An introduction and guide to anomaly detection

ruxit theme 2014.05.15

We have a 50 percent likeliness to see values above the median.

How likely is is that 7 out of 10 samples are higher?The probability is 17 percent, so we should not alert.

Applying percentile drift detection

Page 30: The definition of normal - An introduction and guide to anomaly detection

ruxit theme 2014.05.15

Load AnomaliesAre we seeing unusually high or low load?

Page 31: The definition of normal - An introduction and guide to anomaly detection

ruxit theme 2014.05.15We will look at three types of data

SeasonalityLoad is often directly related to time-based usage.

TrendGrowth patterns are not necessarily source of a problem.

We need a different approach

Page 32: The definition of normal - An introduction and guide to anomaly detection

ruxit theme 2014.05.15Holt-Winters Seasonal Forecasting

Page 33: The definition of normal - An introduction and guide to anomaly detection

ruxit theme 2014.05.15

Example

Page 34: The definition of normal - An introduction and guide to anomaly detection

ruxit theme 2014.05.15

Causality Analysis of AnomaliesHow to derive meaningful information from anomalies.

Page 35: The definition of normal - An introduction and guide to anomaly detection

ruxit theme 2014.05.15Anomalies vs. Health

AnomalyA system does not expose the expected behavior.

HealthA system does not operate within well-defined boundaries.

Page 36: The definition of normal - An introduction and guide to anomaly detection

ruxit theme 2014.05.15Health and Anomaly Matrix

Healthy Unhealthy

No AnomaliesOperating normally Unstable System

Anomalies Resilient Operational issues

Page 37: The definition of normal - An introduction and guide to anomaly detection

ruxit theme 2014.05.15Judging Anomalies by Impact

1st Degree Anomaly - CPU Saturation on a host - or similar

2nd Degree Anomaly - Application Functionality affected

3rd Degree Anomaly - Externally visible effects – User realize

Page 38: The definition of normal - An introduction and guide to anomaly detection

ruxit theme 2014.05.15

Relationships of anomaliesTransferring system knowledge to monitoring systems

Page 39: The definition of normal - An introduction and guide to anomaly detection

ruxit theme 2014.05.15The model

Page 40: The definition of normal - An introduction and guide to anomaly detection

ruxit theme 2014.05.15Interpretation with expert knowledge

Strong Relationship Response time slow down impacted by CPU saturation

Potential RelationshipResponse time slow down potentially impacted by code deployment

No RelationshipCPU saturation not impacted by load drop

Page 41: The definition of normal - An introduction and guide to anomaly detection

ruxit theme 2014.05.15

Distinguish Impact from CauseHow to infer root cause information from monitoring data

Page 42: The definition of normal - An introduction and guide to anomaly detection

ruxit theme 2014.05.15Automated Analysis of ProblemsService slowdown

Page 43: The definition of normal - An introduction and guide to anomaly detection

ruxit theme 2014.05.15Automated Analysis of ProblemsService slowdownDependent services slow down

Page 44: The definition of normal - An introduction and guide to anomaly detection

ruxit theme 2014.05.15Automated Analysis of ProblemsService slow downDependent service slow downUsers are affected

Page 45: The definition of normal - An introduction and guide to anomaly detection

ruxit theme 2014.05.15Automated Analysis of ProblemsService slow downDependent service slow downUsers are affected

Analyze Dependencies

Page 46: The definition of normal - An introduction and guide to anomaly detection

ruxit theme 2014.05.15Automated Analysis of ProblemsService slow downDependent service slow downUsers are affected

Analyze DependenciesExclude non-relevant services

Page 47: The definition of normal - An introduction and guide to anomaly detection

ruxit theme 2014.05.15Automated Analysis of ProblemsService slow downDependent service slow downUsers are affected

Analyze DependenciesExclude non-relevant servicesFollow causality chain

Page 48: The definition of normal - An introduction and guide to anomaly detection

ruxit theme 2014.05.15Automated Analysis of ProblemsService slow downDependent service slow downUsers are affected

Analyze DependenciesExclude non-relevant servicesFollow causality chain

Page 49: The definition of normal - An introduction and guide to anomaly detection

ruxit theme 2014.05.15Real World Example

Page 50: The definition of normal - An introduction and guide to anomaly detection

ruxit theme 2014.05.15

Alois [email protected]@ruxit.comblog.ruxit.com