23
BD|CESGA Anomaly Detection at Scale Javier López Cacheiro Orquídea Seijas Salinas Samuel Soutullo Sobral

BD|CESGA Anomaly Detection at Scale - WGML 2017€¦ · BD|CESGA Anomaly Detection at Scale Javier López Cacheiro Orquídea Seijas Salinas Samuel Soutullo Sobral. Measurement is

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: BD|CESGA Anomaly Detection at Scale - WGML 2017€¦ · BD|CESGA Anomaly Detection at Scale Javier López Cacheiro Orquídea Seijas Salinas Samuel Soutullo Sobral. Measurement is

BD|CESGAAnomaly Detection at Scale

Javier López CacheiroOrquídea Seijas SalinasSamuel Soutullo Sobral

Page 2: BD|CESGA Anomaly Detection at Scale - WGML 2017€¦ · BD|CESGA Anomaly Detection at Scale Javier López Cacheiro Orquídea Seijas Salinas Samuel Soutullo Sobral. Measurement is

Measurement is the first step that leads to control and eventually to improvement. If you can’t measure something, you can’t understand it. If you can’t understand it, you can’t control it. If you can’t control it, you can’t improve it.

H. James Harrington

Page 3: BD|CESGA Anomaly Detection at Scale - WGML 2017€¦ · BD|CESGA Anomaly Detection at Scale Javier López Cacheiro Orquídea Seijas Salinas Samuel Soutullo Sobral. Measurement is

Anomaly Detection

● Phase 1: Measure → Collect & Store● Phase 2: Understand → Analyze&Visualize● Phase 3: Control → Monitoring● Phase 4: Improve → Anomaly Detection

Page 4: BD|CESGA Anomaly Detection at Scale - WGML 2017€¦ · BD|CESGA Anomaly Detection at Scale Javier López Cacheiro Orquídea Seijas Salinas Samuel Soutullo Sobral. Measurement is

Phase 1: Measure

Page 5: BD|CESGA Anomaly Detection at Scale - WGML 2017€¦ · BD|CESGA Anomaly Detection at Scale Javier López Cacheiro Orquídea Seijas Salinas Samuel Soutullo Sobral. Measurement is

Collect & Store

● System logs● Process

accounting● Server metrics● IPMI sensors● SMART disk info● Chillers & AHUs ● UPSs● Power Meters

● Network traffic

● Application module usage

● SLURM Accounting

● Temperature/Humidity sensors

Page 6: BD|CESGA Anomaly Detection at Scale - WGML 2017€¦ · BD|CESGA Anomaly Detection at Scale Javier López Cacheiro Orquídea Seijas Salinas Samuel Soutullo Sobral. Measurement is

33487 metrics

10 Million time series

Page 7: BD|CESGA Anomaly Detection at Scale - WGML 2017€¦ · BD|CESGA Anomaly Detection at Scale Javier López Cacheiro Orquídea Seijas Salinas Samuel Soutullo Sobral. Measurement is

Phase 2: Understand

Page 8: BD|CESGA Anomaly Detection at Scale - WGML 2017€¦ · BD|CESGA Anomaly Detection at Scale Javier López Cacheiro Orquídea Seijas Salinas Samuel Soutullo Sobral. Measurement is

Dashboards

Page 9: BD|CESGA Anomaly Detection at Scale - WGML 2017€¦ · BD|CESGA Anomaly Detection at Scale Javier López Cacheiro Orquídea Seijas Salinas Samuel Soutullo Sobral. Measurement is

Phase 3: Control

Page 10: BD|CESGA Anomaly Detection at Scale - WGML 2017€¦ · BD|CESGA Anomaly Detection at Scale Javier López Cacheiro Orquídea Seijas Salinas Samuel Soutullo Sobral. Measurement is

Monitoring

Page 11: BD|CESGA Anomaly Detection at Scale - WGML 2017€¦ · BD|CESGA Anomaly Detection at Scale Javier López Cacheiro Orquídea Seijas Salinas Samuel Soutullo Sobral. Measurement is

Phase 4: Improve

Page 12: BD|CESGA Anomaly Detection at Scale - WGML 2017€¦ · BD|CESGA Anomaly Detection at Scale Javier López Cacheiro Orquídea Seijas Salinas Samuel Soutullo Sobral. Measurement is

Anomaly Detection

● Types of anomalies in time series● Outliers● Change points● Anomalous time series

● Generic Anomaly Detection Systems

Page 13: BD|CESGA Anomaly Detection at Scale - WGML 2017€¦ · BD|CESGA Anomaly Detection at Scale Javier López Cacheiro Orquídea Seijas Salinas Samuel Soutullo Sobral. Measurement is

Caution: Alert overload

Page 14: BD|CESGA Anomaly Detection at Scale - WGML 2017€¦ · BD|CESGA Anomaly Detection at Scale Javier López Cacheiro Orquídea Seijas Salinas Samuel Soutullo Sobral. Measurement is

Server Anomalous Performance

● Problem: Parallel jobs are cancelled because some of the nodes have poor performance. Computation is lost.

● Detection: Analyze & visualize server metrics to spot the anomalous node

● Objective: Automatically detect low performance nodes

Page 15: BD|CESGA Anomaly Detection at Scale - WGML 2017€¦ · BD|CESGA Anomaly Detection at Scale Javier López Cacheiro Orquídea Seijas Salinas Samuel Soutullo Sobral. Measurement is

Analyze & Visualize

Page 16: BD|CESGA Anomaly Detection at Scale - WGML 2017€¦ · BD|CESGA Anomaly Detection at Scale Javier López Cacheiro Orquídea Seijas Salinas Samuel Soutullo Sobral. Measurement is

Anomalous Performance Detection

Page 17: BD|CESGA Anomaly Detection at Scale - WGML 2017€¦ · BD|CESGA Anomaly Detection at Scale Javier López Cacheiro Orquídea Seijas Salinas Samuel Soutullo Sobral. Measurement is

Results

● 6 months● 11965 jobs● 22 anomalies detected

● Precision: 100%● Recall: 96%● F-score: 0.98

Page 18: BD|CESGA Anomaly Detection at Scale - WGML 2017€¦ · BD|CESGA Anomaly Detection at Scale Javier López Cacheiro Orquídea Seijas Salinas Samuel Soutullo Sobral. Measurement is

Conclusions

● No longer needed to delete old data● Understand your data● Generic anomaly detecion systems

generate too many alerts● Target specific use cases● Maintain number of alerts low

Page 19: BD|CESGA Anomaly Detection at Scale - WGML 2017€¦ · BD|CESGA Anomaly Detection at Scale Javier López Cacheiro Orquídea Seijas Salinas Samuel Soutullo Sobral. Measurement is

Thanks!

Page 20: BD|CESGA Anomaly Detection at Scale - WGML 2017€¦ · BD|CESGA Anomaly Detection at Scale Javier López Cacheiro Orquídea Seijas Salinas Samuel Soutullo Sobral. Measurement is

SSH Attack Detection

● Problem: Daily our public servers are scanned and attacked

● Detection: Correlate real-time SSH connection information to detect attacks

● Objective: Automatically update router configuration to stop the attacks

Page 21: BD|CESGA Anomaly Detection at Scale - WGML 2017€¦ · BD|CESGA Anomaly Detection at Scale Javier López Cacheiro Orquídea Seijas Salinas Samuel Soutullo Sobral. Measurement is

SSH Attack Detection

Page 22: BD|CESGA Anomaly Detection at Scale - WGML 2017€¦ · BD|CESGA Anomaly Detection at Scale Javier López Cacheiro Orquídea Seijas Salinas Samuel Soutullo Sobral. Measurement is

Results

Page 23: BD|CESGA Anomaly Detection at Scale - WGML 2017€¦ · BD|CESGA Anomaly Detection at Scale Javier López Cacheiro Orquídea Seijas Salinas Samuel Soutullo Sobral. Measurement is

Results