33
Monitoring by Zabbix: the Final Frontier Detect problems way before end users

Monitoring by Zabbix: The Final Frontier

Embed Size (px)

Citation preview

Page 1: Monitoring by Zabbix: The Final Frontier

Monitoring by Zabbix: the Final Frontier

Detect problems way before end users

Page 2: Monitoring by Zabbix: The Final Frontier

AgendaProgramming languages we use to build our software

Standard approach to monitoring

How Zabbix does it?

Page 3: Monitoring by Zabbix: The Final Frontier

Who am I?Alexei Vladishev

Creator of Zabbix

CEO and Architect

@avladishev

Riga | Tokyo | New York

Page 4: Monitoring by Zabbix: The Final Frontier
Page 5: Monitoring by Zabbix: The Final Frontier

Runtime issues

Memory leaks

Uninitialised pointers

Require discipline!

Page 6: Monitoring by Zabbix: The Final Frontier

Runtime issues

Memory leaks

Uninitialised pointers

Require discipline!

Runtime issues

Out of memory

GC affects execution

Page 7: Monitoring by Zabbix: The Final Frontier

Runtime issues

Memory leaks

Uninitialised pointers

Require discipline!

Runtime issues

Out of memory

GC affects execution

Runtime issues

Out of memory

Slow execution

Hard to predict resource usage

Page 8: Monitoring by Zabbix: The Final Frontier

No guarantees: performance, resource usage, availability, etc.

Page 9: Monitoring by Zabbix: The Final Frontier

Confluence KB: How to fix out of memory errors by increasing available memory?

We aren't really able to give a concrete recommendation for the amount of memory to allocate, because that will depend greatly on your server setup, the size of your user base, and their behaviour. You will need to find a value that works for you, ie no noticeable GC pauses, and no OutOfMemory errors.

Solution: Increase Xmx in small increments (eg 512mb at a time), until you no longer experience the OutOfMemory error.

Page 10: Monitoring by Zabbix: The Final Frontier

Too many bad things may happen at runtime

Page 11: Monitoring by Zabbix: The Final Frontier

That’s why we need monitoring!

Page 12: Monitoring by Zabbix: The Final Frontier

Monitoring is about describing abnormal behaviour of our

systems

Page 13: Monitoring by Zabbix: The Final Frontier

How to detect it?

Page 14: Monitoring by Zabbix: The Final Frontier

Typical approach

0

2,5

5

7,5

10

10:00 10:05 10:10 10:15 10:20 10:25 10:30 10:35 10:40 10:45 10:50

CPU load > 5

Page 15: Monitoring by Zabbix: The Final Frontier

Typical approach

0

2,5

5

7,5

10

10:00 10:05 10:10 10:15 10:20 10:25 10:30 10:35 10:40 10:45 10:50

CPU load > 5

Problem Problem Problem

Recovery Recovery

Page 16: Monitoring by Zabbix: The Final Frontier

Too sensitive Flapping

Page 17: Monitoring by Zabbix: The Final Frontier

Zabbix does it smart way

Page 18: Monitoring by Zabbix: The Final Frontier

History

Analysis

Data collection

Zabbix server

Page 19: Monitoring by Zabbix: The Final Frontier

History

Analysis

Data collection

Alerts

Zabbix server

Page 20: Monitoring by Zabbix: The Final Frontier

0

2,5

5

7,5

10

10:00 10:05 10:10 10:15 10:20 10:25 10:30 10:35 10:40 10:45 10:50 10:55 11:00 11:05 11:10

Analyse historyCPU load for the last 10 minutes > 5

Page 21: Monitoring by Zabbix: The Final Frontier

0

2,5

5

7,5

10

10:00 10:05 10:10 10:15 10:20 10:25 10:30 10:35 10:40 10:45 10:50 10:55 11:00 11:05 11:10

Analyse historyProblem!

CPU load for the last 10 minutes > 5

Recovery

Page 22: Monitoring by Zabbix: The Final Frontier

Problem disappeared !=

problem is resolved

Page 23: Monitoring by Zabbix: The Final Frontier

Problem: free disk space <= 10%

Now free disk space is 10.001%

Have we resolved our problem?

Page 24: Monitoring by Zabbix: The Final Frontier

Problem: free disk space <= 10%

Now free disk space is 10.001%

Problem resolved?

Page 25: Monitoring by Zabbix: The Final Frontier

Different conditions

0

2,5

5

7,5

10

10:00 10:05 10:10 10:15 10:20 10:25 10:30 10:35 10:40 10:45 10:50

Problem: CPU load > 5 Recovery: CPU load < 1

Page 26: Monitoring by Zabbix: The Final Frontier

Different conditions

0

2,5

5

7,5

10

10:00 10:05 10:10 10:15 10:20 10:25 10:30 10:35 10:40 10:45 10:50

Problem: CPU load > 5 Recovery: CPU load < 1

Problem!

Recovery

Page 27: Monitoring by Zabbix: The Final Frontier

No flapping!

Page 28: Monitoring by Zabbix: The Final Frontier

Smarter approachProblem if Free disk space < 10%

Recovery if Free disk space > 30% for the last 15 minutes

Problem if 3 consecutive checks of REST service failed

Recovery if 10 consecutive checks of REST service are OK

Page 29: Monitoring by Zabbix: The Final Frontier

Anomaly detection

0

2,5

5

7,5

10

10:00 10:05 10:10 10:15 10:20 10:25 10:30 10:35 10:40 10:45 10:50 10:55 11:00 11:05 11:10

Compare current system state with the past

Anomaly!

Page 30: Monitoring by Zabbix: The Final Frontier

Forecasting

0

12,5

25

37,5

50

7:00 8:00 9:00 10:00 11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 19:00 20:00 21:00

Page 31: Monitoring by Zabbix: The Final Frontier

Forecasting

0

12,5

25

37,5

50

7:00 8:00 9:00 10:00 11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 19:00 20:00 21:00

y = -2,9455x + 48,309

When and value after period of time

Problem in the future

Page 32: Monitoring by Zabbix: The Final Frontier

ConclusionMonitoring by is your best friend

Use smart problem detection, do not spam DevOps

Detect problems way before end users notice

Anomalies

Forecasting

Page 33: Monitoring by Zabbix: The Final Frontier

Thank you!Learn more about Zabbix at our booth!

@avladishev

Email: [email protected]