Deploying Nagios in a Large Enterprise - USENIX · Deploying Nagios in a Large Enterprise Carson...

Deploying Nagios in a Large Enterprise

Carson GasparGoldman Sachs

carson.gaspar@gs.com

or...If you strap enough

rockets to a brick, you can make it fly

In the beginning...

• New Linux HPC skunkworks project

• Catastrophic success

• Need monitoring added yesterday

Looking for Solutions

• Why not use what we already had?

• Stability problems

• Resource utilization problems

• Custom agents were hard

• No support for our new Linux platform

Why Nagios?

• Purchasing a new commercial solution was politically difficult

• At the time (2003) nagios was the most mature of the open source solutions

• Good community support

The Naive Approach(and why it didn’t work)

• Performance Problems

• Configuration Management Problems

• Availability Problems

• Integration / Automation Requirements

Performance Problems

• State check performance

• Active checks: ~3 checks / second maximum

• Statistics performance

• fork()/exec() for every sample

• Web UI performance

• Large configurations take a long time to display (much improved in 2.x)

Configuration Management Problems

• Configuration files are very verbose (even with templates)

• Syntax errors are easy

• Keeping up with a high churn rate in monitored servers is expensive

• Hardware / software failures

• Building power-downs

• Patches / upgrades

• Who watches the watchers?

Availability Problems

Integration / Automation Requirements

• Alarms need to be dispatched to our existing alerting and escalation system

• Alarms need to be suppressed by existing maintenance tools

• Provisioning should flow from our existing provisioning system

Solving the Problems

• Move to passive checks

• Run multiple nagios instances

• Deploy HA nagios servers

• Use data-driven configuration file generation

• Create a custom notification back end

Passive Checks

• Move most of the work to the clients

• Batch server updates unless a state change occurs

• Randomize server update times to avoid spikes

• Queue the results on the server

• Send statistics to a different back end

Passive Data FlowClient 1 Client n

... Server n

monqueue

stats-catcher

... nagios n

nagios 1

Server 1

nagios-agent nagios-agent

monqueue

stats-catcher

... nagios n

nagios 1

Multiple Nagios Instances

• Run many copies of nagios on one server

• Improve web UI performance

• Show each group only their own servers, so the top level dashboard is more useful

• Allow per-group customizations

HA Nagios Servers

• Run multiple nagios servers on different machines in different buildings

• All clients update all servers

• A heartbeat is published through each server to its counterpart

• Notifications are suppressed on slaves if the heartbeat service is OK

• Partitioned masters can cause duplicate alerts

HA Data Flowprimary server secondary server

client

nagios-agent

nagios

monqueueheartbeat

notify

monqueue

nagios

notify

heartbeat

Data-driven Configuration File Generation

• Leverage our existing host database and provisioning tools

• Generate client and server configurations via cfengine based on templates and database lookups

• Mostly driven by database data, with some per-server threshold overrides managed in cfengine master files

Custom NotificationBack End

• Custom code integrates with our Netcool infrastructure

• Alarm suppression based on external criteria

• Also supports email alerts, optionally batched

Design Trade-offs

• Batch updates mean slow detection of “zombie” hosts (ping-able, but not running user processes)

• nagios’ notification escalation doesn’t work well without active checks, especially if updates are batched

• Requires configuration management

• More complexity = more bugs

nagios-agentDesign Criteria

• Lightweight

• Easy to write and deploy additional agents

• Avoid fork()/exec() where possible

• Support agent callbacks to avoid blocking

• Support “proxy” agents to monitor other devices where we can’t deploy nagios-agent

• Evaluate all thresholds locally and batch server updates

nagios-agentnagios-agent

Class: Ping

Instance: db

Categories: appdev, sa

hostlist => /my/dblist,

latency => [ 50, 100],

losspct => [ 66, 100 ],

count => 3

Class: Ping

Instance: LDN

Categories: sa

hostlist => /my/ldnlist,

latency => [ 100, 200],

losspct => [ 66, 100 ],

count => 3

Server

Class: monqueue

Categories: appdev

auth => GSSAPI

keytab => mykeytab

server => myserver,

queue => 0

Server

Class: monqueue

Categories: sa

auth => GSSAPI

keytab => mykeytab

server => myserver,

queue => 1

monqueueDesign Criteria

• Fast

• Secure

• Accept data from clients, and dispatch to multiple output queues

• Supply heartbeats to nagios

• Supply queue depth stats to stats-catcher

Client Evolution• nagios-agent slowly grew features as they became

required

• multiple agent instances

• agent instance to server mapping

• auto reload of configuration, modules on update

• auto re-exec of nagios-agent on update

• stats collection

• SASL authentication to monqueue

Server Evolution

• Started as one monolithic instance

• As deployment spread, split into multiple instances based on administrative domain

• added HA

• added SASL authentication and authorization

• added monitoring of monqueue itself, and service dependencies (so a monqueue failure didn’t trigger alarms for all services)

• Originally for one project, fewer than 200 hosts

• Eventually used for large sections of the environment

• Documentation and internal consultancy are critical for user acceptance

One of Our Servers

• A single HP DL385 G1 (dual 2.6GHz Opteron, 4GB RAM), running RHEL4 U4, nagios 2.9

• 11 nagios instances

• 27,000+ services (mostly 2 minute intervals)

• 6,600+ hosts

• ~10% CPU

• ~500 MB RAM

Still to Come• Source code release

• Encryption of nagios-agent / monqueue communications

• Support for “pulling” status from nagios-agent to better support DMZ environments

• Statistical analysis of multiple data samples to determine service status

• Yet more agent plugins

• nagios-agent support for traditional nagios plugins

Deploying Nagios in a Large Enterprise

Carson GasparGoldman Sachs

carson.gaspar@gs.com

Deploying Nagios in a Large Enterprise - USENIX · Deploying Nagios in a Large Enterprise Carson...

Documents

Monitoring sieci Nagios - Maciej Kalkowskikalkowski.name/dydaktyka/2011-2012-L/ASL/referaty/nagios/nagios.pdf · Monitoring sieci Nagios Wstęp HomePage : Nagios – program do monitorowania

Nagios-NetEye Conference 2010: Jan Josephson on Nagios

Deploying F5 with Nagios Open Source Network Monitoring System

Nagios Conference 2014 - Mike Weber - Nagios Rapid Deployment Options

Nagios introduction Dhruba Raj Bhandari - Internet …ws.edu.isoc.org/workshops/2007/sanog10/day4/nagios/nagios.pdf · Nagios introduction Dhruba Raj Bhandari ... – Nagios® is

Nagios Conference 2014 - Scott Wilkerson - Log Monitoring and Log Management With Nagios - Introducing Nagios Log Server

Nagios XI â€“ Monitoring VMware - Nagios - The Industry Standard

Scala Collections Performance GS.com/Engineering March, 2015 Craig Motlin

Nagios Conference 2014 - Dave Williams - Multi-Tenant Nagios Monitoring

Nagios Conference 2012 - Bryan McLellan - Using Nagios With Chef

Nagios server and Nagios host in docker containers

Network Monitoring & Management: Nagios Monitoring & Management: Nagios Network Startup Resource Center ... • A Debian tutorial on Nagios

Nagios Conference 2012 - John Sellens - Non-Obvious Nagios

Nagios Conference 2013 - Spenser Reinhardt - Securing Your Nagios Server

Nagios Conference 2013 - David Stern - The Nagios Light Bar

Nagios 3 J. Gabès Nagios 3 - jecogite.free.frjecogite.free.fr/acces/INFO/nagios/nagios 3.pdf · Nagios 3 pour la supervision et la métrologie Déploiement, conﬁ guration et optimisation

Nagios Conference 2013 - Avleen Vig - Nagios at Etsy

New Deploying Nagios in a Large Enterprise - USENIX · 2019. 2. 25. · Deploying Nagios in a Large Enterprise Carson Gaspar Goldman Sachs carson.gaspar@gs.com. or... If you strap

Utilisation Nagios-Centreon-Nagvis · Utilisation Nagios-Centreon-Nagvis Le serveur FAN (Fully automated Nagios) qui regroupe nagios ,centreon et nagvis a pour adresse

TAKING THE HEAT - goldmansachs.com · TAKING THE HEAT Amanda Hindlian amanda.hindlian@gs.com Sandra Lawson sandra.lawson@gs.com Sonya Banerjee sonya.banerjee@gs.com The Goldman Sachs