37
(so it can die peacefully) Please stop using Nagios Andy Sykes Devops @ Forward3D @supersheep [email protected]

Stop using Nagios (so it can die peacefully)

Embed Size (px)

DESCRIPTION

You shouldn't use Nagios any more - it sucks. Let's build a new, better, more awesome monitoring system.

Citation preview

Page 1: Stop using Nagios (so it can die peacefully)

(so it can die peacefully)

Please stop using Nagios

Andy Sykes Devops @ Forward3D @supersheep [email protected]

Page 2: Stop using Nagios (so it can die peacefully)

Do you use Nagios? Tell me why you picked it. Go on. If you don't, why don't you?

Page 3: Stop using Nagios (so it can die peacefully)

Reasons for choosing Nagios

•  stupid simple plugin system •  billions* of existing plugins •  years of development behind it •  you can hire people who know it "Everybody uses it."**

* may not actually be true ** except me. and maybe you. and that guy at the back, who really likes Zabbix. you know who you are.

Page 4: Stop using Nagios (so it can die peacefully)

Reasons for choosing Nagios

•  stupid simple plugin system •  billions* of existing plugins •  years of development behind it •  you can hire people who know it "Everybody uses it."**

* may not actually be true ** except me. and maybe you. and that guy at the back, who really likes Zabbix. you know who you are.

Page 5: Stop using Nagios (so it can die peacefully)

So why did you pick Nagios?

Because it's the "safe", default choice. Because we've grown accustomed to the things that really, really suck about it. It's a little like we've all got Stockholm Syndrome.

Page 6: Stop using Nagios (so it can die peacefully)

What Nagios gets right

Incredibly simple plugin model. Fairly secure (SSL between agents + master). Very simple conceptually. Reliable.

Page 7: Stop using Nagios (so it can die peacefully)

Nagios, I hate thee; let me count thy ways

Doesn't scale. At all. World's second most horrible configuration*. Horrendous interface**. Assumes a static infrastructure. No decent programmatic interfaces***. Throws away perfdata. Stupid wire format for clients (NRPE/NSCA).

* the world's most horrible configuration is, obviously, Sendmail. ** even the paid Nagios XI one is ugly as sin and unusable. *** if I catch you parsing status.dat, I will beat your ass.

Page 8: Stop using Nagios (so it can die peacefully)

Expansion about config

Configuration has to be in two places:

Server has to know what checks to invoke via NRPE.

Client has to know what checks it will be asked to invoke with NRPE.

THIS IS MADNESS.

Page 9: Stop using Nagios (so it can die peacefully)

Scaling, or lack of it

No such thing as a Nagios cluster. More checks = more work = longer before you know something's happened! Every check increases your master's load average.

Page 10: Stop using Nagios (so it can die peacefully)

Okay, yes, there’s mod_gearman

But it’s a hack at best. No redundancy for the machine that distributes the checks, so it’s not a real cluster.

Page 11: Stop using Nagios (so it can die peacefully)

API poverty

Can't easily integrate with other systems. Can't easily write custom dashboards. Can't get information out again!

Assumes a static infra Master has to be told about a client before things can happen.

Page 12: Stop using Nagios (so it can die peacefully)

The bandaids we make Interface:

Opsview, Icinga, Shinken, others

API: Parsing status.dat, NDO

Client wire format: Opsview's NRPE, NRD

Config management: Puppet types, Chef cookbooks

None of it is good enough.

Page 13: Stop using Nagios (so it can die peacefully)

The take-home point:

"If we keep using Nagios, we'll never get anything better." (Writing monitoring systems is hard, and needs community involvement and real world adoption. Nagios steals mindshare by being just good enough. It's the monitoring system we deserve, but not the one we need right now.)

Page 14: Stop using Nagios (so it can die peacefully)

So, smart guy. What do we do?

Steal all the things that are great about Nagios. (existing plugin investment, simplicity, security, reliability)

Strap them to something more awesome. (scalable, API-ready, config management friendly, modern!)

Page 15: Stop using Nagios (so it can die peacefully)

THIS DOESN’T MEAN WRITING YOUR OWN MONITORING SYSTEM

Page 16: Stop using Nagios (so it can die peacefully)

Points for thought:

●  What else are people using? ●  Should we greenfield or lift existing tools? ●  What tools could we go with?

Page 17: Stop using Nagios (so it can die peacefully)

My suggestion:

Like OMD, but better. Wrap up a series of “best in breed” tools to make one kickass monitoring tool.

Page 18: Stop using Nagios (so it can die peacefully)

What we need:

Core Agent Graphing Anomaly detection Alerting UI

Page 19: Stop using Nagios (so it can die peacefully)

Core:

Holds configuration about hosts / services Distributed across X masters Check execution (poke) Results queue (poke response)

Page 20: Stop using Nagios (so it can die peacefully)

There’s something we can use for this.

Sensu! Sensu is often described as the “monitoring router”.

Page 21: Stop using Nagios (so it can die peacefully)
Page 22: Stop using Nagios (so it can die peacefully)

{ "checks": { "chef_client": { "command": "check-chef-client.rb", "subscribers": [ "production" ], "interval": 60, "handlers": [ "pagerduty", "irc" ] } } }

Only on the server

Page 23: Stop using Nagios (so it can die peacefully)

Client requires no registration for the server to know about it Uses Nagios status return codes Doesn’t talk to the server - talks to RabbitMQ

Page 24: Stop using Nagios (so it can die peacefully)

Core:

Holds configuration about hosts / services Distributed across X masters Check execution (poke) Results queue (poke response)

Page 25: Stop using Nagios (so it can die peacefully)

What we need:

Core - Sensu-server Agent - Sensu-client Graphing Anomaly detection Alerting UI

Page 26: Stop using Nagios (so it can die peacefully)

Graphing is easy now.

If you’re not using Graphite, you should be. Sensu “metric” checks can pump data to it.

Page 27: Stop using Nagios (so it can die peacefully)

What we need:

Core - Sensu-server Agent - Sensu-client Graphing - Graphite Anomaly detection Alerting UI

Page 28: Stop using Nagios (so it can die peacefully)

Anomaly detection is hard.

We’ve got all this metric data, but how do we check it? - Skyline/Oculus (Etsy) - Grok (very early days) - ???

Page 29: Stop using Nagios (so it can die peacefully)

What we need:

Core - Sensu-server Agent - Sensu-client Graphing - Graphite Anomaly detection - ??? Alerting UI

Page 30: Stop using Nagios (so it can die peacefully)

Alerting is tricky, but mostly solved.

Flapjack! - flapjack.io Alerting is not the concern of your monitoring tool. Push all alerts at Flapjack - define gateways (PagerDuty, email) - create relationships between checks and gateways

Page 31: Stop using Nagios (so it can die peacefully)

What we need:

Core - Sensu-server Agent - Sensu-client Graphing - Graphite Anomaly detection - ??? Alerting - Flapjack UI

Page 32: Stop using Nagios (so it can die peacefully)

User interfaces are hard.

What do we need from it? - What’s broken - When it broke, when it broke in the past - Say “OK, I know it’s broken” - View graphs to see how quickly it broke - See every check everywhere, and filter the list

Page 33: Stop using Nagios (so it can die peacefully)

The Sensu Dashboard sucks.

No history! Acknowledgements aren’t easy to do. No graphing. Can’t see anything that’s reporting an OK status. This won’t do.

Page 34: Stop using Nagios (so it can die peacefully)

I’m going to have to write a UI. Sigh.

Page 35: Stop using Nagios (so it can die peacefully)

What we need:

Core - Sensu-server Agent - Sensu-client Graphing - Graphite Anomaly detection - ??? Alerting - Flapjack UI - ???

Page 36: Stop using Nagios (so it can die peacefully)

In Summary

Nagios sucks. There are good tools for each concern of monitoring. If we can package them together, we can have something that rocks.

Page 37: Stop using Nagios (so it can die peacefully)

Thank You.

Contact [email protected] (@supersheep)