Антон Лебедевич

Статистика на практике для поиска аномалий в нагрузочном

тестировании и production

Антон Лебедевич

A Lot of Graphs

Contents

● Real Data and Ideal Models● Load Testing (Tuning)● Production Monitoring● Correlation● Tools

Real Data vs. Ideal Models

● noise (human actions)● outliers● missing data● different resolutions● counter update frequencies● quantization● not Gaussian and not random walk● what is normal for the system?

Outlier vs. Changepoint



Resolution

● >=5min● 1min● 10s● <=1s

Load Testing (Tuning)

● goal● beware of transient response● find failure● filter data● find bottleneck and fix● rinse and repeat

Transient Response

Failure on Target Metric



Filtration

● constants● index of dispersion (sd/mean)● apply system knowledge

– tasks migrated by scheduler– dependent (disk used/free)– interface traffic < 10 packets/s– load average < 0.5– …

Missing or Constant

Changed Mean

Nonlinear

ndiffs: diff until kpss says it's stationary

Production Classics

● Control charts– fixed window moving average (MA)– exponentially weighted moving average (EWMA)

● Holt-Winters

Test Subject

Test Subject

Test Subject

Moving Average

Exponentially-Weighted Moving Average

Control Charts

● stationary● Gaussian/Poisson● outliers

Two Weeks

Holt-Winters

triple exponential smoothing● needs a lot of data● sensitive to outliers● can't handle 3 seasons + holidays● overfitting

Time Shifting

Production Experimental

● autocorrelation● non-parametric 2 sample tests

Autocorrelation

Autocorrelation

Ljung-Box Test● non-stationary● mean shift● trends● seasonal● periodic (cron jobs, sampling)● aggregated (MA, EWMA)

Distribution Change

Distribution Change

Distribution Change

Distribution Change

2-Sample Tests: Good

Kolmogorov–Smirnov, Cramér–von Mises● good for request size and latency (unaggregated)● work on periodic data● outlier resistant● good for data exploration

2-Sample Tests: Bad

Kolmogorov–Smirnov, Cramér–von Mises● false positives on trends and seasonal changes● need many unique values● computational complexity● bad for alerting

Finding Similar Graphs

● correlation (Pearson, Spearman)● Euclidean distance● dynamic time warping (DTW)● discrete Fourier transform (DFT)● discrete wavelet transform (DWT)

Cluster Centers

Cluster Members

Cluster Members

Clustering

● non-euclidean (ultrametric) space● many small clusters● local clustering around events● false positives

– cron jobs (log rotation)– human actions (restarts, reconfigurations)– cache expirations– …

Tools

● collectd● statsd● graphite● whisper-fetch● R

Radd.smooth <- function(m) { r <- nrow(m) ms <- sapply(m, function(y) { ave(coredata(y), seq.int(r) %/% max(3, r %/% 150), FUN=function(x) {mean(x, na.rm=T)}) }) df <- data.frame(index(m)[rep.int(1:r, ncol(m))], factor(rep(1:ncol(m), each = r), levels = 1:ncol(m)), as.vector(coredata(m)), as.vector(coredata(ms))) names(df) <- c("Index", "Series", "Value", "Smooth") df}

Kale Stack

● github.com/etsy/skyline● github.com/etsy/oculus

Skyline

image from github.com/etsy/skyline

Skyline Internals

● Horizon agent● Redis● Analyzer agent● Flask (Python) Web App

Skyline Algorithms

● median absolute deviation● grubbs● first hour average● stddev from average● stddev from moving average

● mean subtraction cumulation● least squares● histogram bins● ks test● second order anomalies

Oculus

image from github.com/etsy/oculus

Oculus Internals

● Skyline Import Script and Cronjob● Resque workers● ElasticSearch● Sinatra (Ruby) Web App

Q&A

Anton Lebedevich

[email protected]

twitter.com/widdoc

github.com/mabrek

Technology

Антон Лебедевич