Upload
duongphuc
View
254
Download
0
Embed Size (px)
Citation preview
S6 — Nagios: Advanced Topics
orNon-Obvious Nagios
John Sellens
USENIX LISA 27, 2013
November 3, 2013
S6 — Nagios: Advanced Topics
Contents
Preamble and Introduction 3
Nagios Basics 9
Nagios Past, Present and Future 20
Nagios Plugins 31
More on Configuration 38
Theory and Practice 71
Getting Larger 86
Tips and Tricks 92
c©2003-2013 John Sellens USENIX LISA 27, 2013 1
S6 — Nagios: Advanced Topics
Abusing Nagios 103
Plugin Pointers 121
Nagios Addons 123
Wrap Up 150
c©2003-2013 John Sellens USENIX LISA 27, 2013 2
S6 — Nagios: Advanced Topics Preamble and Introduction
Preamble and Introduction
c©2003-2013 John Sellens USENIX LISA 27, 2013 3
S6 — Nagios: Advanced Topics Preamble and Introduction
Overview
• Nagios is well-established and widely used
– Over a million known servers running 3.X
• We’re going to look at some of the non-obvious bits
– Alternate title: Non-Obvious Nagios
• And ways to extend Nagios
• We’ll assume a basic knowledge of Nagios and what it does
• We’ll look at actual Nagios data
c©2003-2013 John Sellens USENIX LISA 27, 2013 4
Notes:
• I’m assuming you’ve already chosen Nagios for your environment
– Or you’re very careful when making decisions and don’t want to
rush into anything
• Or at least I hope what we’re covering is non-obvious and/or non-trivial
• The server count is from Ethan Galstad’s talk at Ohio LinuxFest 2010 —
servers that check for available updates.
• I sure hope the network and my laptop are both happy . . .
• Both USENIX and I will very much appreciate your feedback — please fill
out (and return) the evaluation form later today.
S6 — Nagios: Advanced Topics Preamble and Introduction
Outline/Timetable
• Preamble / Introduction / Outline
• Basics – Build, Install, Configure, Run
• Plugins, Configuration Details
• — Break — 10:30 to 11:00
• Theory and Practice
• Getting Larger
• Add-ons and extensions
• Tips, tricks, etc.
• Wrap up – 12:30pm
c©2003-2013 John Sellens USENIX LISA 27, 2013 5
Notes:
• Scheduled for 9:00 - 12:30pm with one half hour break
• Tutorial lunch is from 12:30-1:30 I think
• Feel free to ask questions anytime, or ask for clarification, or add some
information. Interactivity is good!
S6 — Nagios: Advanced Topics Preamble and Introduction
Questions?
• Got a Question?
• A Clarification?
• Some Confusion?
• A Point of Interest?
• Ask!
c©2003-2013 John Sellens USENIX LISA 27, 2013 6
Notes:
• This slide is here to be even more explicit that questions and comments
are more than welcome, and that interactivity is good.
• Get my attention through any appropriate means, but if you’re throwing
something, please lob, and keep it light.
• Though please consider the time we have available before you start on a
long, involved anecdote of what once happened to a friend of yours.
S6 — Nagios: Advanced Topics Preamble and Introduction
About the Instructor
• John Sellens
• 25+ years as UNIX system administrator
• University of Waterloo, UUNET Canada, Certainty Solutions
Canada, SYONEX, FreshBooks, . . .
• “Running the Numbers: System, Network, and Environment
Monitoring”, co-author with Dan Klein; “System and Network
Administration for Higher Reliability”, lapsed ;login: author,
• Previous LISA PC, long time USENIX and LISA attendee
• Nagios World 2012 and 2013, OLF 2010, PICC 2011
c©2003-2013 John Sellens USENIX LISA 27, 2013 7
Notes:
• FreshBooks is a cloud accounting app, and normally I would be there
instead of here
• Feel free to contact me here or by email if you have any questions
S6 — Nagios: Advanced Topics Preamble and Introduction
Viewpoints and Religion
• Monitoring is Exceptions, Trending, History
• UNIX philosophy: Effective tools, not kitchen sink
– Choose the best tool(s) for the job
• SNMP is Your Friend
– Use it whenever you can
• Solve any problem in computer science with another level of
indirection
c©2003-2013 John Sellens USENIX LISA 27, 2013 8
S6 — Nagios: Advanced Topics Nagios Basics
Nagios Basics
c©2003-2013 John Sellens USENIX LISA 27, 2013 9
S6 — Nagios: Advanced Topics Nagios Basics
Nagios Basics
• Nagios is a host and service monitor
• Web GUI and display interface
• Extensible, featureful
• Built from components, rather than one big blob
– Well defined interfaces between components
• Very well documented and supported
– Version 3 documents are HTML or a 358 page PDF
• Very widely used
c©2003-2013 John Sellens USENIX LISA 27, 2013 10
Notes:
• http://www.nagios.org/
• Since 1999: http://www.nagios.org/about/history
• http://demos.nagios.com/
• New version 4
– Nagios 4.0.0 was released September 20, 2013
• Established version 3
– Nagios 3.5.1 was released August 30, 2013
– Nagios 3.0 was released March 13, 2008
• By Ethan Galstad; Licensed under the GPL
• Nagios R© is a registered trademark of Ethan Galstad
• Nagios Enterprises http://nagios.com/
– Commercial support and enhanced versions
S6 — Nagios: Advanced Topics Nagios Basics
Nagios Features
• Monitors services: SMTP, HTTP, etc.
• Monitors system state: CPU, memory, disk, etc.
• Plugin model — easily extensible service checks
• Efficient service check engine
• Host and service dependencies
• Highly configurable notifications and escalations
• Event handlers to automate problem resolution
• Very effective web interface
• And more . . .
c©2003-2013 John Sellens USENIX LISA 27, 2013 11
S6 — Nagios: Advanced Topics Nagios Basics
Nagios Implementation
• Nagios itself is a polling and reporting engine
• All service and host checks are via external “plugins”
– There’s a wide variety of plugins available, and it’s easy to
write more
– Notifications are external commands as well
• The Nagios core engine is written in C
• As are the CGIs that implement the web interface
• The plugins are variously written in shell, perl, C, etc.
c©2003-2013 John Sellens USENIX LISA 27, 2013 12
Notes:
• There is perpetually ongoing talk of how the CGIs would be much better
in PHP
– But see the mentions of some of the addons starting at page 144
S6 — Nagios: Advanced Topics Nagios Basics
Nagios Overview
Stolen from Ethan Galstad
c©2003-2013 John Sellens USENIX LISA 27, 2013 13
Notes:
• Shamefully stolen from Ethan Galstad’s FOSDEM 2005 presentation
http://www.nagios.org/fosdem2005
S6 — Nagios: Advanced Topics Nagios Basics
Operational Overview
• Nagios runs as a daemon
• Scheduling and executing host and service checks
• Running notification commands as required
• Accepting external commands and updates
• Web interface provides user interface
• Mechanisms for hooks and extensions
c©2003-2013 John Sellens USENIX LISA 27, 2013 14
S6 — Nagios: Advanced Topics Nagios Basics
Building Nagios
• The typical build process, with a few tweaks
• Lots (and lots) of configure options . . .
• Is --enable-embedded-perl a good idea?
– Gone in version 4
• Likely want --enable-event-broker
• Don’t forget make install-commandmode
– Creates the external command file (named pipe)
• Setup Apache with the example conf entries
c©2003-2013 John Sellens USENIX LISA 27, 2013 15
Notes:
• Quickstart Installation Guides
http://nagios.sourceforge.net/docs/3_0/quickstart.html
• Embedding a Perl interpreter just seems not quite right to me
– But see the documentation for pros and cons:
http://nagios.sourceforge.net/docs/3_0/embeddedperl.html
– Removed in version 4
• make install-webconf might do the Apache config for you
• make install-init may set up your boot script
• make fullinstall may do it all for you
• I use the FreeBSD port . . .
• Most people likely just install the “standard packages”
S6 — Nagios: Advanced Topics Nagios Basics
Running Nagios
• Nagios is typically started at boot time, and mostly left running
– Needs to be restarted to pick up any configuration changes
– A sigHUP will restart if your configs are perfect
• It comes with a handy daemon-init script that does start, stop,
reload
• Before re-starting Nagios for a configuration change, verify it
first:
nagios -v nagios.cfg
• This will tell you about problems before you start breaking things
c©2003-2013 John Sellens USENIX LISA 27, 2013 16
Notes:
• If your configuration contains errors (as shown by nagios -v then a
restart or a kill -HUP will leave you with no running Nagios
S6 — Nagios: Advanced Topics Nagios Basics
Not Running Nagios?
• Who watches the watcher?
• In the past, the cgi.cfg setting nagios_check_command
checked that nagios is running
• Not any more!
• If the status.dat file is left around, and nagios is dead,
nothing notices
– As far as I can see . . .
– CGIs are happy with a days old status.dat
• Run check_file_age from cron?
c©2003-2013 John Sellens USENIX LISA 27, 2013 17
Notes:
• At least up to version 3.4.1
• In practice, I’ve never seen this happen, but it’s good to be paranoid
• I don’t know how best to solve this
• I think it should be addressed in the CGIs at least
• Or perhaps cfengine, puppet, etc. will fix
S6 — Nagios: Advanced Topics Nagios Basics
Nagios Web Interface
• CGIs and HTML implement a web interface to Nagios
• Provides access to a variety of commands
• Some ask “why isn’t it PHP?”
• Some ask “why isn’t it database backed?”
• Documented as “optional”
– Which highlights the separate components
c©2003-2013 John Sellens USENIX LISA 27, 2013 18
Notes:
• The different CGIs, what they do, and authorization requirements, etc.
are described very effectively at
http://nagios.sourceforge.net/docs/3_0/cgis.html
• There are efforts underway to re-write/re-implement the web interface.
• More information on front ends later starting at page 144.
S6 — Nagios: Advanced Topics Nagios Basics
Configuration Basics
• Arbitrary probes, flexible parameters. macro substitution
• Time-variable behaviours to cover work-day vs off-hours issues
• Template based, inheritable, grouping, etc.
• Consistent and well-documented
• Some tools try to provide a web interface to the configs
• Three required files:
– nagios.cfg — overall configuration, refers to other files
– resource.cfg — global variables and database access
– cgi.cfg — controls web interface behaviour and access
c©2003-2013 John Sellens USENIX LISA 27, 2013 19
Notes:
• Some people say: “it should be in a database”
• I like text files that I can manipulate or generate
• Version 3 cleans up a bunch of configuration “issues” and makes things
much better
– Not that things were bad before, but they are even better now
S6 — Nagios: Advanced Topics Nagios Past, Present and Future
Nagios Past, Present and Future
c©2003-2013 John Sellens USENIX LISA 27, 2013 20
S6 — Nagios: Advanced Topics Nagios Past, Present and Future
Nagios Past, Present and Future
• Version 2.0 was released early 2006
– Regular expression matching for host, hostgroup and service
names in various object definitions
– Code and logic enhancements
• Version 3.0 was released March 2008
– More features, extensibility
– Enhancements for large sites and scalability
• Version 4.0 was released September 2013
• Steady, ongoing development
c©2003-2013 John Sellens USENIX LISA 27, 2013 21
Notes:
• Various companies have sponsored work on the project
• It may be a little early to use version 4 in full production, but it looks pretty
good, and has been in development for quite a while
• For new installations, you should likely use version 3, unless there is a
particular tool you need that does not work with version 3
• I’m going to assume we’re using version 3, unless I say otherwise
S6 — Nagios: Advanced Topics Nagios Past, Present and Future
New Features in Nagios 2.0
• Adaptive monitoring
– Modify on the fly how, what and when things are monitored
and processed
– e.g. Change which services monitored when hosts failover
• Event Broker API
– Loadable modules can process event and status data in real
time
– e.g. Process performance data, or insert into a database
• nagiostats command gets processing stats from the running
nagios
• Some regular expressions can be used in object definitions
c©2003-2013 John Sellens USENIX LISA 27, 2013 22
Notes:
• Version 2 documentation is likely no longer available online
S6 — Nagios: Advanced Topics Nagios Past, Present and Future
New Features in Nagios 3.0
• Much improved host check logic and parallelism
• External commands can use regular files (not the pipe)
• Multiple template inheritance, object definition updates
• Performance improvements, esp. for large sites
• Multiline plugin output
• More consistent state information storage
• Better retention across restarts
• And more!
c©2003-2013 John Sellens USENIX LISA 27, 2013 23
Notes:
• Big host check performance gains
• More detail at
http://www.nagios.org/development/history/
• What’s new in 3 at:
http://nagios.sourceforge.net/docs/3_0/whatsnew.html
S6 — Nagios: Advanced Topics Nagios Past, Present and Future
New Features in Nagios 4.0
• Performance, scalability, interfaces
• Query Handler interface for checks and more
– Core Worker processes run checks and report to Core
∗ Fewer fork()s, smaller, faster
– Nagios Event Radio Dispatcher (NERD) provides event
streams
• Internal speedups – config check, event queue, macro search
• Configuration – host address optional, service parents
• Host/service/contact values – no alerts for “minor” problems
• No more embedded Perl
c©2003-2013 John Sellens USENIX LISA 27, 2013 24
Notes:
• Announced: Nagios Core 4.0.0 Now Available
http://labs.nagios.com/2013/09/20/nagios-core-4-now-available/
• What’s new in 4 at:
http://nagios.sourceforge.net/docs/nagioscore/4/en/whatsnew.html
• I have not spent a lot of time with the 4.0 release (yet!)
– Released the Friday before the Monday these notes were due
– But seemed easy enough to get going in my quick testing
• Core Worker processes default to 1.5 x #CPUs, minimum 4
• I think service parents are effectively just easier to configure service de-
pendencies
• I suspect that host/service values will be complicated to use effectively
– Though relatively easy to only notify a manager if more than a few
things are broken
S6 — Nagios: Advanced Topics Nagios Past, Present and Future
Upgrading to Version 4
• In general, v4 is upwardly compatible with v3
– Most major changes are internal and less user visible
• Go through nagios.cfg and cgi.cfg for changes
– Or replace with current and backport your changes
• Some settings are now deprecated
– A few errors previously acceptable now fixed
• Some minor changes to macros
• Read the docs
c©2003-2013 John Sellens USENIX LISA 27, 2013 25
Notes:
• http://nagios.sourceforge.net/docs/nagioscore/4/en/upgrading.html
S6 — Nagios: Advanced Topics Nagios Past, Present and Future
What is This Icinga Thing?
• Icinga is a fork of Nagios
• Perception: Nagios single developer model
• Perception: Nagios Enterprises becoming more “corporate”
• Perception: Progress too slow
• Reality: Maybe not so clear cut
• Stated intention: Maintain compatibility
• Announced May 2009
c©2003-2013 John Sellens USENIX LISA 27, 2013 26
Notes:
• http://www.icinga.org/
S6 — Nagios: Advanced Topics Nagios Past, Present and Future
What is This Shinken Thing?
• A Nagios-like tool, redesigned and rewritten from scratch
• Discovery, GUI
• Python
• Compatible configuration
c©2003-2013 John Sellens USENIX LISA 27, 2013 27
Notes:
• http://www.shinken-monitoring.org/
S6 — Nagios: Advanced Topics Nagios Past, Present and Future
Predicting the Future
• Nagios now has more core developers
• Many enhancement projects
– Commercial and non-commercial
• Obvious examples of outreach and community building
– e.g. ideas.nagios.org
• Making the Nagios project/product sustainable?
c©2003-2013 John Sellens USENIX LISA 27, 2013 28
Notes:
• Development Roadmaps are currently a little sparse
http://wiki.nagios.org/index.php/Development_Roadmaps
• My view is that the result of going corporate (Nagios Enterprises) is a
good thing
– Momentum, support, choices, new tools, . . .
S6 — Nagios: Advanced Topics Nagios Past, Present and Future
What is this Nagios XI Thing?
• The Next Generation of Nagios
• An integrated, supported commercial product
• New web interface
• New web configuration interface
• An integrated all-in-one Nagios
– Includes NagiosQL and PNP, plus more!
• Comparable to GroundWork, Opsview, Op5, Centreon, . . . ??
c©2003-2013 John Sellens USENIX LISA 27, 2013 29
Notes:
• Announced in mid to late 2009, available since early 2010 or so
• I haven’t looked closely at it
– I’m lazy
– And cheap
– And I like my vi’d config files
S6 — Nagios: Advanced Topics Nagios Past, Present and Future
Nagios Fusion
• Distrbuted monitoring solution
– Centralized view of de-centralized Nagios servers
– Links to remote servers for “drill down”
• Aggregates tactical overview
• Simple configuration through the web
• Should scale well
– Another level of indirection . . .
c©2003-2013 John Sellens USENIX LISA 27, 2013 30
Notes:
• http://www.nagios.com/products/nagiosfusion
• Commercial product
• Works with both Nagios XI and Nagios Core
• Coming soon . . .
S6 — Nagios: Advanced Topics Nagios Plugins
Nagios Plugins
c©2003-2013 John Sellens USENIX LISA 27, 2013 31
S6 — Nagios: Advanced Topics Nagios Plugins
Nagios Plugins
• All Nagios host and service checks are performed by external
“plugins”
– Hand wave away any question about plugins run with
embedded Perl interpreter
• A separate nagiosplug development team
– Standard syntax and output
– Consistent coding standards and processing
• The nagiosplug distribution has helpers for your own plugins
– Such as Nagios::Plugin for Perl
c©2003-2013 John Sellens USENIX LISA 27, 2013 32
Notes:
• http://nagiosplugins.org/
• Version 1.4 February 3, 2005; 1.4.16 June 27, 2012
• i.e. Pretty stable at this point
S6 — Nagios: Advanced Topics Nagios Plugins
Theory of Plugins
• A plugin command will be invoked by Nagios as required, with
arguments as specified in the command definition that was used
• Standard options include
– who to check: --hostname= or -H
– where to check: --ipaddress= or -I
– the port to check: --port= or -P
– critical error level: --critical= or -c
– warning level: --warning= or -w
– provide usage help: --help or -h
• Critical and warning levels are in units that make sense for the
plugin being used
c©2003-2013 John Sellens USENIX LISA 27, 2013 33
Notes:
• The “ROADMAP” file in the current nagiosplug source provides additional
information
S6 — Nagios: Advanced Topics Nagios Plugins
Theory of Plugins (cont’d)
• Plugins return an informative message, and an exit code
– Limited to about 350 characters of message (pre-V3)
– The message is displayed in the web status interface
• Exit codes indicate current state
– 0 OK
– 1 WARNING
– 2 CRITICAL
– 3 UNKNOWN
• Nagios reacts based on exit code
– Invokes notifications, exception handlers
• Optional “performance data” is for statistics collection
c©2003-2013 John Sellens USENIX LISA 27, 2013 34
Notes:
• Version 3 allows long (4,000 character), multi-line output from plugins
– I think first line goes in web display
– All output available via macros
• Plugins may also return “performance data” by appending an or-bar (“|”)and “key=value” information to the message
• Performance data can be processed via appropriate settings in the na-
gios.cfg file
– So you can stuff your plugin results into a database, or a graph,
etc.
S6 — Nagios: Advanced Topics Nagios Plugins
Plugin Extra-Opts
• C plugins can consult a run-time config file
check_whatever --extra-opts=[section][@file]
• Config file is “ini” style
[section]
key=value
• File defaults to plugins.ini or nagios-plugins.ini
• Searched for in $NAGIOS_CONFIG_PATH or various /etc dirs
• Configure with --enable-extra-opts
c©2003-2013 John Sellens USENIX LISA 27, 2013 35
Notes:
• http://nagiosplugins.org/extra-opts hmm - wish there was per-host option
• I don’t think you can do per-hostaddress settings within the file itself
– Though you could have different sections named for a macro used
on the command line
• Extra-opts capability was added in 1.4.12
• Early on I didn’t find documentation on how to find the file, so I looked at
the code — fixed now
• The code to file the config file seems a little convoluted
• Ethan Galstad pointed out that this is handy for hiding userids and pass-
words and other secrets from the ps command
S6 — Nagios: Advanced Topics Nagios Plugins
Plugin Development
• It’s very easy to write your own plugins
– A shell script that checks a file or process and returns an exit
code is dead easy
• Wrappers around existing data
– Another level of indirection . . .
• For Perl: Nagios::Plugin module
• Performance/speed can be an issue as you monitor more
services
• Try it yourself!
c©2003-2013 John Sellens USENIX LISA 27, 2013 36
Notes:
• Nagios::Plugin comes with the nagios plugins distribution
– Configure with --enable-perl-modules
• Nagios Plugin API: http://nagios.sourceforge.net/docs/3_0/pluginapi.html
• I tried it myself: snagtools plugins collection
www.syonex.com/resources/software.html
S6 — Nagios: Advanced Topics Nagios Plugins
Existing Plugins
• There are many, many plugins already available
• Typically divide into local and remote checks
– Local checks something on the Nagios server
– Remote checks use SNMP or connect to a remote service
port to check status
• Mechanisms for executing “local” plugins on remote machines
– e.g check_by_ssh, NRPE
– Religion: I check remote machine state via SNMP
c©2003-2013 John Sellens USENIX LISA 27, 2013 37
Notes:
• More on NRPE later
• That may be my religion, but some times I am a heretic
• See exchange.nagios.org
– Lots of tools and plugins there
– Pointers to various places
• We’ll discuss plugin efficiency and overhead later
S6 — Nagios: Advanced Topics More on Configuration
More on Configuration
c©2003-2013 John Sellens USENIX LISA 27, 2013 38
S6 — Nagios: Advanced Topics More on Configuration
Configuration Details
• Recall: Text files, 3 core files plus object definitions
• Recall: Read and digested when Nagios starts
• Recall: Can be a little complicated at first glance
– Likely only because there are so many possibilities
– But consistent and well documented
• Let’s look at it in some more depth
c©2003-2013 John Sellens USENIX LISA 27, 2013 39
Notes:
• Version 3 allows pre-digesting configs before startup
– Speeds startup time in large environments
• Version 4 config parsing is much faster and better
S6 — Nagios: Advanced Topics More on Configuration
Required Files
• nagios.cfg resource.cfg and cgi.cfg
– nagios.cfg location is specified on nagios command line
– resource.cfg location is set in nagios.cfg
– nagios.cfg location is set in cgi.cfg
– cgi.cfg location is compiled into the CGIs
• Syntax for these: variable = value
• Variables are case-sensitive
c©2003-2013 John Sellens USENIX LISA 27, 2013 40
S6 — Nagios: Advanced Topics More on Configuration
nagios.cfg Settings
• Configures overall Nagios operation
• File and directory locations
• Timing intervals and timeouts
• Log rotation, and logging operations
• Service check scheduling options
• Administrator email and pager addresses
• And more . . .
c©2003-2013 John Sellens USENIX LISA 27, 2013 41
Notes:
• The documentation has all the details
• Named nagios.cfg by convention — the startup script explicitly refers to
the file
• Of course, going with the flow, and naming your file “nagios.cfg” will make
your life easier
S6 — Nagios: Advanced Topics More on Configuration
nagios.cfg Settings (cont’d)
• Worth a particular mention is check_result_reaper_frequency
• It tells Nagios how often (in seconds) to gather results from
service checks
– Default setting is every 10 seconds
• Child processes (service check plugins) hang around until they
are “reaped”
• If you’re doing a non-trivial number of service checks, setting
this lower will (typically)
– Reduce the number of processes waiting, taking up space
– Lower the local load average numbers, sometimes
c©2003-2013 John Sellens USENIX LISA 27, 2013 42
Notes:
• Mostly obsolete in version 4 due to the workier process implementation
• In version 2, check_result_reaper_frequency was called
service_reaper_frequency
• Load average depends on how processes blocked on I/O are counted on
your particular system
– I think
S6 — Nagios: Advanced Topics More on Configuration
resource.cfg Settings
• Not actually required, and you can have multiple resource files
• Can have restrictive permissions to hide secrets from the web
interface
• Defines up to 32 $USERx$ macros for substitution into
commands
– $USER1$ — path to the plugins
– $USER2$ — path to event handlers (if any)
– Various other substitutions
c©2003-2013 John Sellens USENIX LISA 27, 2013 43
Notes:
• And resource files don’t all have to be called resource.cfg
• I think it’s actually 256 $USERx$ macros now, but the documentation is
inconsistent (when last I looked)
• Database connectivity information won’t do you much good if you didn’t
build Nagios to use databases to store various bits of information
S6 — Nagios: Advanced Topics More on Configuration
cgi.cfg Settings
• Used by the web interface
• File locations, including a pointer to nagios.cfg
• User access controls, permissions, guest userid
• How to check if Nagios is running
• Extended host and service information for identifying images
and status map coordinates
– Or references to config files that contain that information
c©2003-2013 John Sellens USENIX LISA 27, 2013 44
Notes:
• The “How to check if Nagios is running” no longer exists/works as far as I
can tell
S6 — Nagios: Advanced Topics More on Configuration
Object Configuration
• nagios.cfg specifies the locations of “object configuration files”
– With the cfg_file and cfg_dir variables
– Which can be repeated to refer to multiple files and
directories
– cfg_dir directories are recursive
• Hosts, services, contacts, etc. are defined in template-based
definitions in those files
• Template inheritance provides a reasonably effective
mechanism
c©2003-2013 John Sellens USENIX LISA 27, 2013 45
Notes:
• cfg_dir recursion was added in version 2
S6 — Nagios: Advanced Topics More on Configuration
Object Configuration (cont’d)
• Object definitions look like
define type {
directive value
directive value
...
}
• Definitions include a “type_name” directive and value
• Templates include a “name” directive and value, and the
directive “register” with a value of “0”
• To inherit from a template, a definition includes a “use” directive
with a template name as the value
c©2003-2013 John Sellens USENIX LISA 27, 2013 46
Notes:
• Directive names are case-sensitive
• Comments with # in column 1 or semi-colon anywhere
• Newer documentation uses “directive” and “variable” interchangeably
• It sounds a little more complicated than it really is
• You can also inherit from a “registered” object, such as some other host
or service
– e.g. host1 is fully defined, host2 is “just like host1”
• See “Object Inheritance” in the docs
http://nagios.sourceforge.net/docs/3_0/objectinheritance.html
S6 — Nagios: Advanced Topics More on Configuration
Object Templates
• The “use” directive causes a definition to inherit the directives
declared in a template definition
• You can only have one “use” directive in a given template
– Its location in the template is irrelevant, anything local to the
template overrides anything inherited from a “use” directive
• Value for “use” directive is comma-separated list
– You can have a “tree” of definition inheritance from a
common root
– Or multiple roots
• Reasonably powerful . . .
c©2003-2013 John Sellens USENIX LISA 27, 2013 47
Notes:
• Version 3 added multiple template inheritance
– In earlier versions you could inherit from only one template
– Though you could have a template “chain”
• I think template inheritance must take the form of an acyclic directed
graph.
– There — my math degree proves useful once again
S6 — Nagios: Advanced Topics More on Configuration
Object Template Inheritance
• Host devweb1 definition contains use 1, 4, 8
• First template (1 in this case) has highest priority
c©2003-2013 John Sellens USENIX LISA 27, 2013 48
Notes:
• Stolen from Nagios 3 documentation
• Multiple inheritance sources were added in version 3
• Very handy with multiple locations, variable overrides, etc.
S6 — Nagios: Advanced Topics More on Configuration
Directives and Values
• Each object type has pre-defined directive names
– Generally consistent across different object types
• Append to an inherited value: directive +value
• Delete inherited value: directive null
c©2003-2013 John Sellens USENIX LISA 27, 2013 49
Notes:
• Appending with + is called additive inheritence
• No subtractive inheritence i.e. can’t remove an item from a list unless you
re-set the list
S6 — Nagios: Advanced Topics More on Configuration
Custom Directives
• Custom directives start with underscore
– And are case IN-sensitive
– e.g. _snmp_community, etc.
• Use in host, service and contact definitions
• Refer to as macros or environment variables
– e.g. _bloop in a host definition becomes
∗ macro $_HOSTBLOOP$
∗ environment variable NAGIOS_ _HOSTBLOOP
c©2003-2013 John Sellens USENIX LISA 27, 2013 50
Notes:
• The documentation calls these custom variables, not directives
• Note that the macros and environment variables are uppercase
• And similarly for SERVICE and CONTACT custom variables
• I don’t know if you can use custom directives in other objects, or how you
would refer to them
• More on macros and environment variables later
S6 — Nagios: Advanced Topics More on Configuration
Implied Inheritance
• Nagios will sometimes assume a value from a related object
• Service objects will inherit
– contact_groups, notification_interval, notification_period
from the associated host
• Hostescalations and serviceescalations will similarly inherit as
well
– Except notification_period becomes escalation_period
c©2003-2013 John Sellens USENIX LISA 27, 2013 51
Notes:
• Which is convenient and makes sense, and saves keeping the same in-
formation consistent in multiple places
S6 — Nagios: Advanced Topics More on Configuration
Object Types
• There are quite a few different object types that can be defined
• host, hostdependency, hostescalation
• hostgroup
• contact, contactgroup
• service, servicegroup, servicedependency, serviceescalation
• hostextinfo, serviceextinfo
• timeperiod
• command
c©2003-2013 John Sellens USENIX LISA 27, 2013 52
Notes:
• I think “quite a few” equals 14
• More or less self explanatory
• The “extinfo” types provide “extended” information for hosts and services
for the web interface
• hostgroupescalation removed in 2.x — you can now use hostgroup_name
in hostescalation definitions
• 2.x added servicegroup primarily for CGI display purposes
• Servicegroup can be referred to by servicedependency and serviceesca-
lation definitions
• hostextinfo and serviceextinfo are now deprecated
– All directives are now part of host and service definitions
S6 — Nagios: Advanced Topics More on Configuration
Object Definitions
• Object definitions have many possible directives
– You’ll default many, and inherit many from master templates
• The directives are fairly consistent across different object types
• The sample configuration files are well-documented
• The samples used to be in different files by object type
– Not necessary to split them up that way
– Many find that a “cfg_dir” full of .cfg files is very convenient
• Order is unimportant — multiple files are treated the same as a
single file
c©2003-2013 John Sellens USENIX LISA 27, 2013 53
Notes:
• You should read/review the sample config files
• cfg_dir is recursive as of 2.x
– Which is handy
S6 — Nagios: Advanced Topics More on Configuration
Timeperiod Definitions
• Timeperiods are used to define when to do service checks, and
when notifications can be sent
define timeperiod{
timeperiod_name nonwork
alias Time to be Not Working
sunday 00:00-24:00
monday 00:00-09:00,18:00-24:00
}
• Most object types have a “type_name” directive, which is used
by other objects to refer to the object being defined, and for
some display purposes
• “alias” defines a more verbose description of the object
c©2003-2013 John Sellens USENIX LISA 27, 2013 54
Notes:
• Apparently we work 24 hours a day from Tuesday to Saturday
• Most object types also have an “alias” directive
• http://nagios.sourceforge.net/docs/3_0/timeperiods.html
S6 — Nagios: Advanced Topics More on Configuration
Timeperiod Definitions (cont’d)
• Version 3 added lots of timeperiod features
– Dates and date ranges, day of month
– Offset weekday e.g. 3rd Monday of a month or all months
– And more!
• Exclude one timeperiod from another with the exclude directive
• Very powerful, handy for scheduling or recurring windows,
holidays, vacations, etc.
• Non-weekday definitions are called “exceptions”
– Which seems confusing to me
c©2003-2013 John Sellens USENIX LISA 27, 2013 55
Notes:
• Lots and lots of examples at
http://nagios.sourceforge.net/docs/3_0/objectdefinitions.html#timeperiod
• I think you can do just about anything
– Including making it completely incomprehensible
• Version 3 has some timeperiod exclusion bugs, fixed in version 4
S6 — Nagios: Advanced Topics More on Configuration
Command Definitions
• All service and host checks are performed by commands
defined in command definitions
define command{
command_name check_tcp
command_line $USER1$/check_tcp
-H $HOSTADDRESS$ -p $ARG1$
}
• Note the use of variable substitution to pass parameters to the
actual command
• A service definition that is checking for the gopher port to be
listening would use
check_command check_tcp!70
c©2003-2013 John Sellens USENIX LISA 27, 2013 56
Notes:
• I wrapped the command_line to fit, but you can’t do that in a real definition
• There are a number of macros defined based on the values of directives
in definition types
– We’ll touch on macros later
• $HOSTADDRESS$ is set from the “address” directive in the appropriate
host definition
• Command line quoting is sometimes challenging, so try to avoid special
characters in your arguments
• Do you remember gopher?
S6 — Nagios: Advanced Topics More on Configuration
contact Definitions
• A contact is a person that may be notified, or who has access to
the web interface
• host_notification_period and service_notification_period take a
timeperiod_name
• host_notification_options are d,u,r,f,s,n for down, unreachable,
recovery (up), flapping start/stop, scheduled downtime
start/stop, or none
• service_notification_options are w,u,c,r,f,s,n for warning,
unknown, critical, etc.
• The email, pager, host_notification_commands, and
service_notification_commands directives are “obvious”
c©2003-2013 John Sellens USENIX LISA 27, 2013 57
Notes:
• I think by now you probably understand the definition syntax
• Notification periods define when you can notify the person
• I’ve left out the contact_name and alias directives
• Notification options are comma-separated lists of code letters
• You might not want to include unreachable by default
– If a key router, switch or firewall goes down, you’ll get a lot of noise
– As long as you’ve properly defined host parent/child relationships
S6 — Nagios: Advanced Topics More on Configuration
contactgroup Definitions
• Think of a contactgroup as a work team that is responsible for
some hosts and/or services
– It provides a level of indirection or abstraction in the
configuration of areas of responsibility
• You might also define a “managers” contactgroup to be used in
escalations
• The members directive is a comma separated list of
contact_names of people in this group
c©2003-2013 John Sellens USENIX LISA 27, 2013 58
Notes:
• Recall that all problems in computer science can be solved by another
level of indirection
S6 — Nagios: Advanced Topics More on Configuration
host Definitions
• A host definition specifies a host or device which provides
“services”
• A host has both a host_name and an address
– address can be an IP address or FQDN
– defaults to host_name in v4
– An IP address avoids alerts if DNS fails, but is harder to
maintain
• The check_command is used to see if a host is up
– Typically a “ping” test of some form
– Only used if service checks fail
• parents — a list of routers, gateways between here and there
c©2003-2013 John Sellens USENIX LISA 27, 2013 59
Notes:
• Remember to define the check_command (somewhere!) otherwise your
host checks will show as “pending”
– Depending on firewalls, sometimes pings won’t work and I use
check_ssh as the check_command
• The “parents” directive lets you describe your network topology, so that if
a network link goes down, you’ll get notified about the link, not that all the
unreachable hosts are down
• An unreachable host can cause a “route verification” to take place
– If I was a marketer, I would say “root cause analysis” here
S6 — Nagios: Advanced Topics More on Configuration
Groups for Hosts
• Hosts are included in hostgroups via
– The hostgroups directive in the host definition
– The members directive in hostgroup definitions
• contact_groups directive is per-host, not per-hostgroup
– Consistent with use in service definitions
– Need to be a contact for all hosts in a hostgroup to have
access to the hostgroup
c©2003-2013 John Sellens USENIX LISA 27, 2013 60
Notes:
• Contact groups used to be defined for a hostgroup
• I think this means that hostgroups and servicegroups are more or less
equivalent
– Just different names for grouping things you want to group
S6 — Nagios: Advanced Topics More on Configuration
hostgroup Definitions
• A hostgroup defines an administrative grouping of hosts
• Used to organize output in the web interface, and to determine
access restrictions
– You can only access those hosts/services that you are
responsible for
• members lists the hosts that are members
• hostgroup_members includes other hostgroups in this one
c©2003-2013 John Sellens USENIX LISA 27, 2013 61
Notes:
• Can have multiple members directives for convenience
– I think
• Contacts used to be attached to hosts only by the old contact_groups
directive in hostgroups
S6 — Nagios: Advanced Topics More on Configuration
service Definitions
• In Nagios terms, a “service” could be an aspect of a running
system, like disk capacity, or memory utilization
– A “service” needn’t be offered externally to a device
• Nagios tests services based on
– max_check_attempts — how many times to check a service
before concluding it is actually down
– normal_check_interval — how many “time units” to wait
between regular service checks
– retry_check_interval – how many “time units” to wait before
checking a service that is not “OK”
• contact_groups — who to complain to in case of a problem
c©2003-2013 John Sellens USENIX LISA 27, 2013 62
Notes:
• See the documentation on “Service Check Scheduling” at
http://nagios.sourceforge.net/docs/2_0/checkscheduling.html
– Nagios works hard to be efficient and effective at doing checks
– Not yet written/updated for Nagios 3.x
S6 — Nagios: Advanced Topics More on Configuration
Service check_command Directives
• Each service definition must include a check_command
directive
• You can refer to a defined command
– Arguments can be provided, separated by !
• For example
check_command check_ntserv!w3svc
• Open question: where should you quote special characters?
c©2003-2013 John Sellens USENIX LISA 27, 2013 63
Notes:
• You used to be able to provide a “raw” command to be executed, sur-
rounded by double quotes
– But I don’t think you can anymore
– Not often used — you’re better off to use the indirection provided
by command definitions
S6 — Nagios: Advanced Topics More on Configuration
Notification Configuration
• Host and service definitions also define when notifications
should be sent using the following directives
– notification_interval — number of “time units” (default:
minutes) before re-notifying of a problem
– notification_period — the timeperiod during which
notifications may be sent
– notification_options — hosts: d,u,r,f,s,n; services: w,u,c,r,f,s
– notifications_enabled — 1 for yes, 0 for no
• The contact_group for the hostgroup or service is notified, using
the rules for the individual contacts in group
c©2003-2013 John Sellens USENIX LISA 27, 2013 64
Notes:
• “Time units” are in terms of the “interval_length” defined in the nagios.cfg
file
– Which defaults to 60 seconds
– So the number given in an object definition for an “interval” is usu-
ally the number of minutes
• Hosts: down, unreachable, recovery (up), flapping start/stop, scheduled
downtime start/stop, none
• Services: warning, unknown, critical, recovered, flapping start/stop, sched-
uled downtime start/stop, none
• See the detailed rules in “Notifications”, at
http://nagios.sourceforge.net/docs/3_0/notifications.html
S6 — Nagios: Advanced Topics More on Configuration
Notes and Links
• You can define
– notes notes_url action_url
for
– host hostgroup service servicegroup
objects
• Show up in reasonable places in the web interface
• Allow you to link to documentation or other things you can do to
the host
c©2003-2013 John Sellens USENIX LISA 27, 2013 65
Notes:
• These are relatively recent - late version 2, or version 3?
• Or perhaps they were always in the hostextinfo and serviceextinfo objects
which I never used?
S6 — Nagios: Advanced Topics More on Configuration
Dependency Definitions
• The hostdependency and servicedependency definitions allow
you to define relationships between hosts and services
• For example, you could declare that your WWW service
depends on your SQL service
– If SQL is down, don’t bother checking WWW, because we
already know it will fail
• Or your web host may depend on your nfs-server host
– Similar but different from a host’s “parents”
– parents defines network topology
c©2003-2013 John Sellens USENIX LISA 27, 2013 66
Notes:
• Leave dependent_host_name and dependent_hostgroup_name empty
(or null) for “same host”
• See “Host and Service Dependencies” at
http://nagios.sourceforge.net/docs/3_0/dependencies.html
• More on dependencies later
S6 — Nagios: Advanced Topics More on Configuration
Escalation Definitions
• If something is broken, you may want/need to “escalate” the
problem if it’s not resolved quickly
• Define these if so: hostescalation hostgroupescalation
serviceescalation
• first_notification and last_notification — which notifications to
escalate
• notification_interval — how often to send them
• contact_groups — who to send them to
– Remember to include the “lower level” contactgroups
– Or use additive inheritance (with a + sign)
c©2003-2013 John Sellens USENIX LISA 27, 2013 67
Notes:
• Notifications and escalations provide a very flexible mechanism
• For example, you could set up different contactgroups and contacts for
the same people, but with different ways to contact them, so that you
could
– first email problems reports to the contacts
– then page them
– then send SMS messages to their phones
– then call their home phone numbers
– and so on
• See “Notification Escalations” at
http://nagios.sourceforge.net/docs/2_0/escalations.html
• hostgroupescalation removed in 2.x — you can now use hostgroup_name
in hostescalation definitions
S6 — Nagios: Advanced Topics More on Configuration
Definition Shortcuts
• General rule: anywhere you can list a host_name or
hostgroup_name you can
– use a comma-separated list of hosts/groups
– exclude with !
– use a wildcard host_name of “∗”, meaning “all hosts”
to have it apply (or not) to multiple hosts
• e.g. A service definition for the HTTP service might include
hostgroup_name webservers
to cause the service to be defined for all hosts in the
webservers hostgroup
• This can save a lot of repetition in your configs
c©2003-2013 John Sellens USENIX LISA 27, 2013 68
Notes:
• e.g. For a service, dependency, or escalation
• Note that this is a “general rule” and won’t necessarily apply in every
possible instance
• In nagios.cfg set use_regexp_matching=1
• See “Time-Saving Tricks For Object Definitions” at
http://nagios.sourceforge.net/docs/3_0/objecttricks.html
S6 — Nagios: Advanced Topics More on Configuration
Nagios Macros
• Nagios defines a number of macros for use in commands
– Some implicitly from definition directives, etc.
– Some explicitly, in the resource.cfg file
• These macros can be substituted into host and service check
commands, notifications, event handlers, etc.
– Different macros are available at different times
• An effective way of passing variable data outside of the Nagios
core
c©2003-2013 John Sellens USENIX LISA 27, 2013 69
Notes:
• See “Understanding Macros and How They Work” at
http://nagios.sourceforge.net/docs/3_0/macros.html
http://nagios.sourceforge.net/docs/3_0/macrolist.html
for all the details
• Including a handy reference table of what’s available when, and what all
the macros mean
S6 — Nagios: Advanced Topics More on Configuration
More on Macros
• Lots of macros — all sorts of variant information is available
• Most are added to environment e.g. NAGIOS_SERVICESTATE
– Including any custom variables
• “On-Demand Macros” allow you to refer to values from other
config settings e.g.
$SERVICESTATEID:novellserver:DS Database$
• “On-Demand Group Macros” get you a comma-separated list of
all values in a host, service or contact group e.g.
$HOSTSTATEID:hg1:,$
c©2003-2013 John Sellens USENIX LISA 27, 2013 70
Notes:
• Can disable environment variables by setting the
enable_environment_macrosvariable to 0
– Avoids a bunch of overhead, or so they say
– But it removes a bunch of information that can be useful for plugins
and other commands
– Use the env command with plugins to add what you need
• On-demand macros are not added to the environment
• Environment variables, on-demand macros and more added in version 2
• Even more macros added in version 3
S6 — Nagios: Advanced Topics Theory and Practice
Theory and Practice
c©2003-2013 John Sellens USENIX LISA 27, 2013 71
S6 — Nagios: Advanced Topics Theory and Practice
Theory of Operation
• Documentation contains “Theory of Operation” information
• It covers many of the details of how things actually work
– Status and reachability of network hosts
– Determination of network outages
– Service check scheduling, service state
– Notifications, timeperiods
– Plugins
• You should at least review this information, as it will help you
understand both what is happening, and what is possible
c©2003-2013 John Sellens USENIX LISA 27, 2013 72
Notes:
• Used to be a separate section, now at the end of “The Basics”
• And review the “Advanced Topics” section as well
• And all the rest of the documentation while you’re at it
S6 — Nagios: Advanced Topics Theory and Practice
State of the Network
• Current state of a service or host: two things
– Status: OK, Warning, Up, Down, etc.
– State: Soft, Hard, Unreachable
• Soft state: a check failed, but still have retries to do
– Logged, and event handler run
• Hard state: When we’re sure
– Notification logic invoked
c©2003-2013 John Sellens USENIX LISA 27, 2013 73
Notes:
• I was tempted to try a pun related to the state of the network address, but
I held back
• State Types: http://nagios.sourceforge.net/docs/3_0/statetypes.html
• That document doesn’t mention “unreachable” as a state, but I think it
likely is an actual state
S6 — Nagios: Advanced Topics Theory and Practice
Stalking and Volatility
• If enabled, stalking logs any changes in plugin output
– Even with no state change
– e.g. RAID check was “1 disk dead” and is now “2 disks dead”
– Logged for later review/analysis
• Volatile services
– Something that resets to OK after each check
– Need attention every time there is a problem
– Notification and event handler happen once per failure
– e.g. Alert on a port scan
c©2003-2013 John Sellens USENIX LISA 27, 2013 74
Notes:
• Most people likely won’t want to use stalking
• Enabled on host and service definitions
• http://nagios.sourceforge.net/docs/3_0/stalking.html
• I’m thinking you could likely get the same result as volatility by setting only
1 check, no recovery notifications
– But I could be wrong
• http://nagios.sourceforge.net/docs/3_0/volatileservices.html
S6 — Nagios: Advanced Topics Theory and Practice
Topology Matters
• Parents directive in host definitions defines topology
• Parents are typically routers, firewalls, switches, etc.
– i.e. How the packets get there from here
• Can have multiple parents with redundant network paths
• Notification_option “u” sends on UNREACHABLE state
• Buzzphrase: root cause analysis
c©2003-2013 John Sellens USENIX LISA 27, 2013 75
Notes:
• Unless your network is tiny, flat or otherwise trivial
• Used in drawing network maps in a sensible way
• Parents in a comma-separated list of all parent hosts
• See the docs: “Determining Status and Reachability of Network Hosts”
http://nagios.sourceforge.net/docs/3_0/networkreachability.html
S6 — Nagios: Advanced Topics Theory and Practice
Depending on Others
• Host and service dependencies define operational requirements
– e.g. Web server can’t work unless file server is working
• execution_failure_criteria and
notification_failure_criteria determine what we do
if something we depend on fails, e.g.
– if file server down, don’t execute web check
– and don’t notify me about web problem
• Set inherits_parent to inherit dependencies in definitions
c©2003-2013 John Sellens USENIX LISA 27, 2013 76
Notes:
• Failure criteria are o (OK), w (warning), u (up), c (critical), p (pending), n
(none)
• I think inherits_parent is perhaps misnamed – parents are topo-
logical, dependencies are different
• http://nagios.sourceforge.net/docs/3_0/dependencies.html
S6 — Nagios: Advanced Topics Theory and Practice
Cached Checks
• Can cache and re-use host or service check results
• Used only for “On-Demand Checks”
– Checking that host is up if a service fails
– Checking topological reachability
– For “predictive dependency checks”
• i.e. Checking for “collateral damage”
• Lower overhead, good results
– You should enable and tune the cache
c©2003-2013 John Sellens USENIX LISA 27, 2013 77
Notes:
• “Predictive dependency checks” – in a network outage, schedule more
topological checks earlier, since we’re likely to need that information
• http://nagios.sourceforge.net/docs/3_0/cachedchecks.html
• Don’t blame me for the “cached checks” pun, because in Canada “cached
cheques” makes no sense at all
S6 — Nagios: Advanced Topics Theory and Practice
Event Handlers
• In a perfect world, nothing would ever go wrong
– In a semi-perfect world, problems would fix themselves
• Event handlers are one of Nagios’ ways of moving closer to
perfection
• An event handler is a command that is run in response to a
state change
– Canonical example: restart httpd if WWW service fails
– Open a trouble ticket on failure?
• Complications: runs as the nagios user, on the nagios server
• Global and specific host and service event handlers
c©2003-2013 John Sellens USENIX LISA 27, 2013 78
Notes:
• And incidentally, in a perfect world, tutorial notes would never contain any
typos
• A state change is (simplistically speaking) a failure or a recovery
– I’m ignoring “hard” vs “soft” states here
• As documented, sudo and ssh can be useful in event handling commands
for elevating permissions and access to remote services
– And recall that SSH key files can allow only specific commands
– I restart a Windows IIS service from Nagios via a script and ssh
• Host and service definitions can use the event_handler directive
• Event handlers can (and should) be passed all sorts of state and check
attempt information
S6 — Nagios: Advanced Topics Theory and Practice
External Commands
• The Nagios server maintains a named pipe in the file system for
accepting various commands from other processes
• External commands are used most often by the web interface to
record information and modify Nagios’ behaviour
– But you can do lots of things from shell scripts . . .
• Some of the available functionality
– Add/delete host or service comments
– Schedule downtime, enable/disable notifications
– Reschedule host or service checks
– Submit passive service check results
– Restart or stop the Nagios server
c©2003-2013 John Sellens USENIX LISA 27, 2013 79
Notes:
• Written to the named pipe as a single line, with multiple fields
– Syntax is timestamp, command, then ;-separated arguments
• e.g. to get Nagios to restart, write this:
[1041175870] RESTART_PROGRAM;1041175870
• Documented (of course) at
http://nagios.sourceforge.net/docs/3_0/extcommands.html
• Exhaustive list of 157 available commands, with examples, at
http://www.nagios.org/developerinfo/externalcommands/
S6 — Nagios: Advanced Topics Theory and Practice
Passive Service Checks
• Nagios can accept service check results from other programs
– Since Nagios did not initiate the check, these are called
“passive service checks”
• These are useful for
– Asyncronous events (SNMP traps, say)
– Results from other existing programs
– Results from remote or secured systems
• You’ll recall that the NSCA addon uses these
c©2003-2013 John Sellens USENIX LISA 27, 2013 80
Notes:
• Submitted through the external command interface
S6 — Nagios: Advanced Topics Theory and Practice
Distributed Monitoring
• Nagios supports distributed monitoring of a certain style
• Remote Nagios servers are essentially probe engines,
submitting their results to a central server with passive service
check results
• The configuration on the remote servers is a subset of the
central configuration
• The central server is configured to notice if the passive results
stop coming from the remote server
c©2003-2013 John Sellens USENIX LISA 27, 2013 81
S6 — Nagios: Advanced Topics Theory and Practice
Distributed Monitoring (cont’d)
• This seems like a fair amount of duplication of effort to me, but it
gets you all the status on one central console
• I tend to set up independent Nagios servers
– And use my check_nagios_status plugin to “screen-scrape”
the remote web interface
– Providing a central summary and click-through to the remote
server
• Tools like DNX and Mod-Gearman can spread the load
– But still retain one Nagios host doing the scheduling
• Your mileage may vary
c©2003-2013 John Sellens USENIX LISA 27, 2013 82
Notes:
• More on DNX on page 139
• More on Mod-Gearman on page 140
• My simple-minded mbdivert (page 141) distributes some checks
• The “central aggregation” approach is used by a number of more recent
tools, such as Nagios Fusion (page 30), Thruk (page 148), MNTOS (page
148), and Multisite (page 148)
S6 — Nagios: Advanced Topics Theory and Practice
Adaptive Monitoring
• Can change things during runtime via external commands
– e.g. schedule changes, or from an exception handler
• Can change
– Check commands and arguments
– Check interval, max attempts, timeperiod
– Event handler commands and arguments
• Likely just for very specific uses or situations
c©2003-2013 John Sellens USENIX LISA 27, 2013 83
Notes:
• An previous audience member gave an example of an active/passive
cluster that is behind a hardware load balancer
– Checking the hosts directly
– Need to move the checks if the service fails over
– No service address that moves between the hosts
• Added in version 2
• http://nagios.sourceforge.net/docs/3_0/adaptive.html
S6 — Nagios: Advanced Topics Theory and Practice
Obsession
• OCSP: Obsessive Compulsive Service Processor
• OCHP: Obsessive Compulsive Host Processor
• Commands that may be executed after every service or host
check
• Allows you to pass results to external applications
– e.g. Used to submit distributed monitoring results with
send_nsca
• Efficient? Commonly Used? Scalable?
c©2003-2013 John Sellens USENIX LISA 27, 2013 84
S6 — Nagios: Advanced Topics Theory and Practice
NEB – Nagios Event Broker
• Allows you to add code to the Nagios core
• Dynamically loaded module
• Has access to internal Nagios events and data
• Limited documentation — helloworld.c in source
• Starting to be used for interesting things
– Logging to database
– DNX – check distribution
c©2003-2013 John Sellens USENIX LISA 27, 2013 85
Notes:
• USENIX ;login: articles on NEB Modules by David Josephsen in October
and December 2008
http://www.usenix.org/publications/login/2008-10/index.html
http://www.usenix.org/publications/login/2008-12/index.html
S6 — Nagios: Advanced Topics Getting Larger
Getting Larger
c©2003-2013 John Sellens USENIX LISA 27, 2013 86
S6 — Nagios: Advanced Topics Getting Larger
Scaling Up
• Nagios can handle a lot without much effort
• As you get larger, advanced features are more important
– Use parent/child and host/service dependencies
– More efficient for humans and machines
• You will need to be more rigorous in your configuration
– Consistency, completeness, tuning
• Version 3 adds scalability and tuning features
• Version 4 adds even more
c©2003-2013 John Sellens USENIX LISA 27, 2013 87
S6 — Nagios: Advanced Topics Getting Larger
Check Execution
• A check is typically fork(), fork(), exec()
• Theory is that running the plugins uses lots of resources
• Distribute plugin execution for more capacity
– Distributed monitoring, multiple servers, DNX, etc.
• Perhaps embedded perl is a practical tool?
– Pros and cons – not all Perl will embed nicely
– Force/avoid ePN: # nagios: +epn
– In first 10 lines of Perl script . . .
– No longer available in version 4
c©2003-2013 John Sellens USENIX LISA 27, 2013 88
Notes:
• ePN is “embedded Perl Nagios”
• Explicit - if first 10 lines of a script contain
– # nagios: +epn
– # nagios: -epn
• Implicit use of ePN via configuration options
• Embedded perl overview and pros and cons at
http://nagios.sourceforge.net/docs/3_0/embeddedperl.html
S6 — Nagios: Advanced Topics Getting Larger
More on Timeperiods
• Timeperiods can be quite specific
• Use and exclude of timeperiods are very flexible
– e.g. define “holidays” and exclude from “workhours”
• Can describe on-call schedules, maintenance windows, etc.
• Avoid check overhead when you don’t care
• Not quite as useful for the one-person shop . . .
c©2003-2013 John Sellens USENIX LISA 27, 2013 89
Notes:
• All the cool timeperiod stuff was added in version 3
• “Time Periods”
http://nagios.sourceforge.net/docs/3_0/timeperiods.html
• “On-Call Rotations”
http://nagios.sourceforge.net/docs/3_0/oncallrotation.html
S6 — Nagios: Advanced Topics Getting Larger
Large Installation Tweaks
• use_large_installation_tweaks option
• No summary macros in the environment to avoid overhead
– e.g. TOTALHOSTSUP, etc.
• Lazy, but more efficient memory freeing in children
• Checks are single, not double, fork()
c©2003-2013 John Sellens USENIX LISA 27, 2013 90
Notes:
• “Large Installation Tweaks”
http://nagios.sourceforge.net/docs/3_0/largeinstalltweaks.html
S6 — Nagios: Advanced Topics Getting Larger
Tuning for Performance
• Lots of tunable configuration parameters
• Keep performance graphs of Nagios
– MRTG, nagiostats, etc.
• Disable environment macros
– Use the env command with plugins to add what you need
• Use passive checks if you can
– Not my favorite idea . . .
• Avoid interpreted plugins, or offload checks
• Use Fast Startup Options – pre-cache configs
c©2003-2013 John Sellens USENIX LISA 27, 2013 91
Notes:
• Lots of good information in the documentation
• “Tuning Nagios For Maximum Performance”
http://nagios.sourceforge.net/docs/3_0/tuning.html
• “Graphing Performance Info With MRTG”
http://nagios.sourceforge.net/docs/3_0/mrtggraphs.html
• “Fast Startup Options”
http://nagios.sourceforge.net/docs/3_0/faststatup.html
S6 — Nagios: Advanced Topics Tips and Tricks
Tips and Tricks
c©2003-2013 John Sellens USENIX LISA 27, 2013 92
S6 — Nagios: Advanced Topics Tips and Tricks
Tips and Tricks
• Use the parent/child topology
– Pre Nagios 3, host checks are not parallelized
– Host checks of a down segment can block all other checks
• Be consistent and use templates and groups
– Make it easy to add another similar host
– Make it easy to add a service to a group of hosts
• Smarter plugins make life (configuration) easier
c©2003-2013 John Sellens USENIX LISA 27, 2013 93
S6 — Nagios: Advanced Topics Tips and Tricks
Hostgroups Are Your Friends
• Hostgroups are really handy for grouping checks
• Use the +groupname syntax e.g.hostgroups +webservers,dbserverswhich makes it easy to add on checks as you include templates
• With multiple Nagios servers useallow_empty_hostgroup_assignment=1
– You can define machine types as common hostgroups
– Even if you don’t have every type on every Nagios server
c©2003-2013 John Sellens USENIX LISA 27, 2013 94
S6 — Nagios: Advanced Topics Tips and Tricks
Organize Your Config Files
• Put files in different directories
• One host per config file
• Generate configs from other information you already have
– Or use a script to generate from a list
• Take advantage of your naming convention
– Wildcards in host names based on FQDNs
c©2003-2013 John Sellens USENIX LISA 27, 2013 95
S6 — Nagios: Advanced Topics Tips and Tricks
Simpler Configs
• Simple commands in Nagios configs
• Same config for every machine
• Set limits outside of Nagios configs
– Manually or automatically
– Move details/smarts outside the configs
c©2003-2013 John Sellens USENIX LISA 27, 2013 96
Notes:
• Well, maybe not every machine, but try to avoid per-machine configs
S6 — Nagios: Advanced Topics Tips and Tricks
Separate Your Problem Space
• Multiple locations? Use multiple servers!
– Distributed monitoring
– Or separate systems, aggregated or summarized centrally
• Can you delegate to different internal groups?
– One system for networks, one for servers, . . .
– Scales software, and your time
c©2003-2013 John Sellens USENIX LISA 27, 2013 97
S6 — Nagios: Advanced Topics Tips and Tricks
Another Level of Indirection and “Smarts”
• Wrap your plugins in smarter scripts
– How about a master checker that knows all and checks
everything on a host?
• Have your plugins determine what’s “normal” or not
– So you don’t have to pre-set thresholds, etc.
– Time of day, trends, past experience
– Based on other current state/activity
• Use inherited defaults in configs, override as needed
• Let the machines make config changes, instead of you
c©2003-2013 John Sellens USENIX LISA 27, 2013 98
Notes:
• There is mathematics that will tell you if something is unusual
– Cricket can use Holt-Winters Forecasting for aberrent behaviour
detection
– Undoubtedly other techniques (standard deviation perhaps?)(
– I have forgotten everything I once knew about math
S6 — Nagios: Advanced Topics Tips and Tricks
Custom Object Variables for Limits
• Define a custom variable, use it in a check command
• In a global host template, set defaults
• Use other templates to set defaults for locations, hostgroups
• Set per-host values in host definition
Host template or host:_LOADWARN 5,3,2_LOADCRIT 7,5,4
Command definition:command_line $USER1$/check_load
--warning=$_HOSTLOADWARN$--critical=$_HOSTLOADCRIT$
c©2003-2013 John Sellens USENIX LISA 27, 2013 99
Notes:
• Useful with multiple inheritence:
define host {name busymachines_LOADWARN 10,6,4_LOADCRIT 15,10,6register 0}define host {use busymachines,generic...}
• Silly me, I only twigged to this after using Nagios 3.x for a long time, when
I was trying to solve a particular problem
S6 — Nagios: Advanced Topics Tips and Tricks
Custom Wraps
• Multi-stage plugin checks with a shell script
• Check multiple things
– Is at least one interface up?
– Is at least one redundant server up?
• Expect scripts for interactions
• Web form posting tools
c©2003-2013 John Sellens USENIX LISA 27, 2013 100
S6 — Nagios: Advanced Topics Tips and Tricks
Custom Wraps (cont’d)
• “Pervasive wrapper” — redefine $USER1$$USER1$=/usr/local/mywrapper/usr/local/libexec/nagios
• Custom object variables in the environment_web_regexp SomeRegExpNAGIOS__HOSTWEB_REGEXP
• Environment macros mean your plugins can know everything
c©2003-2013 John Sellens USENIX LISA 27, 2013 101
Notes:
• i.e. You can make your configs suddenly very different
• enable_environment_macros=1
S6 — Nagios: Advanced Topics Tips and Tricks
Dynamic Behaviour Changes
• Want to disable load and disk I/O checks during backup
– But backup takes a variable amount of time
• checkbook sets various tag files and runs a commandcheckbook -t load -e run-backup– Can set time limits, and recommended plugin result code
– Remove tag files when command finishes
• checkmate is a plugin wrapper that observes the tag files and
manipulates the plugin return codecheckmate -t load -- \check_load -w 1,2,3 -c 3,4,5
c©2003-2013 John Sellens USENIX LISA 27, 2013 102
Notes:
• Influenced by LISA 2000 paper on “eEMU enterprise Event Management
Utility” which provided ways for monitoring clients to advise the server
– “eEMU: A Practical Tool and Language for System Monitoring and
Event Management”, Jarra Voleynik, eEMUconcept Pty Ltd, LISA
2000,
http://www.usenix.org/event/lisa2000/voleynik.html
• Will be available at
http://www.syonex.com/resources/software/
S6 — Nagios: Advanced Topics Abusing Nagios
Abusing Nagios
c©2003-2013 John Sellens USENIX LISA 27, 2013 103
S6 — Nagios: Advanced Topics Abusing Nagios
Contact Convolution
• Note that people to contacts need not be one to one
• Sometimes you want to be paged, sometimes mailed,
sometimes not
• Consider three contact groups:
– sysadmin, sysadmin-email, sysadmin-page
• Contactgroup directives include contactgroup_members
• Define generic contact templates with notification commands
• Define per-person contact templaces with details
• Define 3 contacts for each, use-ing the templates, in the
contactgroups
c©2003-2013 John Sellens USENIX LISA 27, 2013 104
Notes:
• If this isn’t clear, please let me know, and I’ll send a sample file
S6 — Nagios: Advanced Topics Abusing Nagios
What to Say: genoa
• GEneric NOtification Author/Arranger/Artist
• Uses template files and environment variables to format
notifications
• Choose a template based on hostname, problem type,
notification type, custom object variables
c©2003-2013 John Sellens USENIX LISA 27, 2013 105
Notes:
• Via Nagios Exchange
• Or http://www.syonex.com/resources/software/
S6 — Nagios: Advanced Topics Abusing Nagios
genoa in Configs
define command {
command_name host-by-email
command_line genoa -s
}
define command {
command_name host-by-pager
command_line genoa -p -t
}
c©2003-2013 John Sellens USENIX LISA 27, 2013 106
Notes:
• Isn’t that simpler than the typical printf piped into mail?
• Uses all the environment variables set by
enable_environment_macros=1
• Could also of course use the env command to provide more environment
variables
S6 — Nagios: Advanced Topics Abusing Nagios
Sample genoa TemplateSubject: $NOTIFICATIONTYPE$ - $HOSTNAME$/$SERVICEDESC$is $SERVICESTATE$ - alert $NOTIFICATIONNUMBER$To: $CONTACTEMAIL$
$NOTIFICATIONTYPE$: $SERVICEOUTPUT$
Service: $SERVICEDESC$
Host: $HOSTNAME$ / $HOSTALIAS$State: $SERVICESTATE$ for $SERVICEDURATION$Address: $HOSTADDRESS$
Date/Time: $SHORTDATETIME$
genoa template $GENOATEMPLATE$
c©2003-2013 John Sellens USENIX LISA 27, 2013 107
Notes:
• Rules for template searching based on
– HOSTNAME or HOSTADDRESS
– _GENOACLASS custom object variable
– For HOST or SERVICE problem
– NOTIFICATIONTYPE - problem, acknowledgement, etc
– Falls back on a default file
• Nagios sometimes sets variables that I didn’t expect, so trying to deter-
mine whether the notification is for a HOST problem or a SERVICE prob-
lem is more convoluted than I had expected.
S6 — Nagios: Advanced Topics Abusing Nagios
Sample genoa Pager Templates
tdir/pager/SERVICE_PROBLEM
XX $HOSTNAME$ $SERVICEDESC$ $SERVICESTATE$$SERVICEOUTPUT$ for $SERVICEDURATION$$SHORTDATETIME$
tdir/pager/SERVICE_ACKNOWLEDGEMENT
ACK $HOSTNAME$ $SERVICEDESC$ $SERVICEACKAUTHOR$: $SERVICEACKCOMMENT$ : $SERVICESTATE$$SERVICEOUTPUT$ for $SERVICEDURATION$$SHORTDATETIME$
c©2003-2013 John Sellens USENIX LISA 27, 2013 108
Notes:
• I was surprised how expressive I could be with just the standard environ-
ment variables
• But one could imagine running templates through, say, m4
S6 — Nagios: Advanced Topics Abusing Nagios
Give Me a Call: tellitto
• Implements multiple notification methods
• Keeps trying until one succeeds
– e.g. Try SMS service, modem, mail, etc.
• Puts phone/pager numbers in one place
• Pipe message into tellitto, or use genoa -t
c©2003-2013 John Sellens USENIX LISA 27, 2013 109
Notes:
• Via Nagios Exchange
• Or http://www.syonex.com/resources/software/
S6 — Nagios: Advanced Topics Abusing Nagios
Sample tellitto Config
# Don’t put secrets in here# You can use different orders per contact# for better deliverability
bob pageuser bobbob smsgateway 12125551234bob clickatellgate 12125551234bob mail -s tellitto-notification
sally clickatell 12125557890sally smsgateway 12125557890sally pageuser sallysally mail -s tellitto-notification
c©2003-2013 John Sellens USENIX LISA 27, 2013 110
Notes:
• We have a command smsgateway which uses curl to send to our
provider
• Ditto for clickatellgate
• We have a command pageuser which tries to use Hylfax’s sendpageto send via a modem
S6 — Nagios: Advanced Topics Abusing Nagios
Cheap (Check) Tricks
• Check for an open localhost TCP port (e.g. 3366)
tcp.tcpConnTable.tcpConnEntry.tcpConnState
.127.0.0.1.3366.0.0.0.0.0
= listen(2)
– Which may not work on Windows . . .
• I put together a check_allstorage plugin
– Don’t need to set limits in nagios config
– Gets list of filesystems from device, cache in /tmp dir
– Estimates thresholds based on current usage
• And check_netapp_df for NetApp volumes
c©2003-2013 John Sellens USENIX LISA 27, 2013 111
Notes:
• I needed to check that the PureMessage Milters were actually running
and listening locally on remote mail servers
• I find check_allstorage very handy, we used nagmin, which I found a little
cumbersome
– I can tweak the automatic thresholds by editing the /tmp file
• http://www.syonex.com/resources/
S6 — Nagios: Advanced Topics Abusing Nagios
Getting There from Here
• I try to avoid the standard tools to run remote plugins
– Like NRPE, check_by_ssh, etc.
• My check_snmpexec does an SNMP query to run plugin
remotely
– Net-SNMP snmpd exec functionality
• I check Windows services with check_winsvc
– Uses snmptable to get
enterprises.lanmanager
.lanmgr-2.server.svSvcTable
– And looks for the desired services in the output
c©2003-2013 John Sellens USENIX LISA 27, 2013 112
Notes:
• Fits nicely with my SNMP religion
• And another level of indirection
• And avoids having more ports and services on your network
S6 — Nagios: Advanced Topics Abusing Nagios
Windows Service Checks via SNMP
• I check Windows services with check_winsvc
– Uses snmptable to get
enterprises.lanmanager
.lanmgr-2.server.svSvcTable
– And looks for the desired services in the output
• check_winservices uses files to know what should be running
– Calls check_winsvc with per-host services
– Initializes per-host lists if missing
c©2003-2013 John Sellens USENIX LISA 27, 2013 113
Notes:
• check_winservices has global FORCE and IGNORE files, used to initial-
ize per-host lists
• Currently no way to check that a service is not running
– My most common problem is services that stop, rather than ser-
vices that show up
• I’m not very Windows inclined – there are Windows addons and WMI
tools
• But SNMP seems to get me most of what I currently need
S6 — Nagios: Advanced Topics Abusing Nagios
Web Server Abuse
• There’s lots of different transports
• Got a visible web server that can run PHP or CGI?
• Set up a “hidden” web page to run your check
– Use Auth or allow/deny rules to limit access
– Use check_http to look for a regular expression
– Get remote status over port 80
c©2003-2013 John Sellens USENIX LISA 27, 2013 114
Notes:
• We do a few variants on this to get status and state out of our public web
servers
S6 — Nagios: Advanced Topics Abusing Nagios
Put Yourself in Someone Else’s Shoes
• Sometimes you don’t “own” a remote network
– Central networking group
– Service provider
• My theory: a tiny utility server solves many problems
– NRPE, SSH, plugins
• My idea: MonBOX Remote Monitoring Appliance
– Consider this a gratuitous plug
c©2003-2013 John Sellens USENIX LISA 27, 2013 115
Notes:
• Commercial product
• http://www.monbox.com/
• You can of course do something similar yourself
• mbdivert wrapper makes it easy to send some checks to a remote server
S6 — Nagios: Advanced Topics Abusing Nagios
Ghost Hosts
• We had a bunch of SMTP servers, and seven locations
• We had DNS names like smtp.location.company.com
– Which are DNS A records with multiple addresses
– So they can be sendmail “smart hosts”
• How do we know that the smtp names still work?
• Define a “virtual” host called smtp-servers with address
127.0.0.1
• And a bunch of check_smtp service checks for the various
names
c©2003-2013 John Sellens USENIX LISA 27, 2013 116
Notes:
• At former employer
S6 — Nagios: Advanced Topics Abusing Nagios
Service as a Host
• We had an outsourced help desk
– They watched nagios, but only cared about “down” hosts
• How did we get them to notice a down link between up routers?
• Made up a hostname “link-tor-det” and the host check is
check_hops
• So the link down looks like a host down
c©2003-2013 John Sellens USENIX LISA 27, 2013 117
S6 — Nagios: Advanced Topics Abusing Nagios
Nonsense with Negate
• The negate plugin inverts the result of a plugin
• No webserver (or telnet, or . . . ) allowed:
negate check_http -H hostname
• File doesn’t exist:
negate check_file_age -H hostname
• Another level of indirection . . .
c©2003-2013 John Sellens USENIX LISA 27, 2013 118
S6 — Nagios: Advanced Topics Abusing Nagios
Hey! Wake Up!
• At FreshBooks, we have “hack offs”
• I have a remote control power bar
• I bought at 12 volt revolving light at the auto supply
• A few scripts watch the status file
– Look for unhandled problems
• Can be used with a klaxon horn as well . . .
• I now have this hooked into my Asterisk box at home . . .
c©2003-2013 John Sellens USENIX LISA 27, 2013 119
S6 — Nagios: Advanced Topics Abusing Nagios
Remember . . .
• Anything you care about can be monitored
• It does not need to be a “service”
• Or even something on a computer or network device
• Simple shell plugins are powerful
c©2003-2013 John Sellens USENIX LISA 27, 2013 120
S6 — Nagios: Advanced Topics Plugin Pointers
Plugin Pointers
c©2003-2013 John Sellens USENIX LISA 27, 2013 121
S6 — Nagios: Advanced Topics Plugin Pointers
Hardware Happiness
• A variety of ways to get hardware state
– Fans, temperature, power, etc.
• IPMI and ipmitool(1) from OpenIPMI-tools
– And a check_ipmi plugin, or roll your own
– e.g. ipmitool sdr or ipmitool sensor
• Various hardware vendors have specific tools
– e.g. Dell’s OpenManage and the omreport command
• Remember you can wrap anything and make a plugin . . .
c©2003-2013 John Sellens USENIX LISA 27, 2013 122
Notes:
• OpenIPMI-tools rpm in Centos, etc.
• http://ipmitool.sourceforge.net
• http://openipmi.sourceforge.net
• http://www.qwirx.com/check_ipmi
S6 — Nagios: Advanced Topics Nagios Addons
Nagios Addons
c©2003-2013 John Sellens USENIX LISA 27, 2013 123
S6 — Nagios: Advanced Topics Nagios Addons
Visualizing Nagios Data
• Core Nagios has some visualization built-in
– History graphs, availability, etc.
• There are many addons for more graphing detail
– Performance data graphing
• I’m not sure there are clear winners
– Yet
c©2003-2013 John Sellens USENIX LISA 27, 2013 124
Notes:
• Many people subscribe to the theory that a picture is worth n words,
where n is a positive integer, usually of several orders of magnitude
S6 — Nagios: Advanced Topics Nagios Addons
Nagios and RRDtool
• Nagios doesn’t have RRDtool support built-in
• The Apan project aims to make it simple
• Provides a wrapper to use around Nagios plugins
• Uses the plugin’s output as RRD data
• Could use a little more polish (especially in the docs), but looks
interesting
• http://apan.sourceforge.net/
c©2003-2013 John Sellens USENIX LISA 27, 2013 125
Notes:
• Apan version 0.3.0-sql released December 16, 2003
• Configuration is now in an SQL database, rather than text files
– Which I suppose could be a blessing or a curse
• My unproven theory is that the perfdata output from the plugins would be
a good way to pass data to RRDtool
• But I could be completely wrong!
S6 — Nagios: Advanced Topics Nagios Addons
More Graphing: nagiostat and nagiosgraph
• Parses performance data
• Stores in RRDtool
• Graphs on the fly through CGI
• Called via the “service_perfdata_command” setting for each
plugin result
• i.e. Extra fork/exec for every plugin
• nagiostat and nagiosgraph appear very similar
• Nagios Grapher looks interesting, and active
c©2003-2013 John Sellens USENIX LISA 27, 2013 126
Notes:
• Not to be confused with “nagiostats” from Nagios 2.x which tells you about
the Nagios process itself
• http://nagiostat.sourceforge.net/
• http://nagiosgraph.sourceforge.net/
• http://www.nagiosexchange.org/NagiosGrapher.84.0.html or
http://sourceforge.net/projects/nagiosgrapher/
S6 — Nagios: Advanced Topics Nagios Addons
More Graphing: PerfParse
• Similar in purpose to Nagiostat
• But can get performance data in a variety of ways
– Including periodically reading perfdata log files
• Inserts data into MySQL
• Graphs through a CGI that integrates into the Nagios interface
• I think this is likely the best approach of the three
c©2003-2013 John Sellens USENIX LISA 27, 2013 127
Notes:
• http://perfparse.sourceforge.net/
• Version 0.106.1 released April 11, 2006
– Though the web site references are lagging a bit
S6 — Nagios: Advanced Topics Nagios Addons
Even More Graphing: PNP
• PNP (PNP is not PerfParse)
• Snazzy graph viewing features
– AJAX, zoom, calendar
– Show all graphs for a host
• Templates for different types of data/checks
• Bells and whistles!
c©2003-2013 John Sellens USENIX LISA 27, 2013 128
Notes:
• Seems to have some “traction”
• I’m not clear on the history of the name
– Is this a fork?
• http://pnp4nagios.sourceforge.net/
• Version 0.4.9, May 15, 2008
S6 — Nagios: Advanced Topics Nagios Addons
Still More Graphing: n2rrd
• Nagios to RRD
• Uses “perfdata command” hooks
• A little more easily configurable
• Active development
• Can interface with Cacti and Drraw
• Worth a look
c©2003-2013 John Sellens USENIX LISA 27, 2013 129
Notes:
• http://n2rrd.diglinks.com
• n2cacti derivative: http://sourceforge.net/projects/nagios2cacti
S6 — Nagios: Advanced Topics Nagios Addons
NagVis — Visualization
• Display IT process like a mail system or a network infrastructure
• Sort of like an interactive host/service “map”
• Could have an overview showing state, and “drill down”
• Active development
• Uses NDOUtils
c©2003-2013 John Sellens USENIX LISA 27, 2013 130
Notes:
• Haven’t tried it yet, but looks really neat
• http://nagvis.org/
S6 — Nagios: Advanced Topics Nagios Addons
Nagios Addons
• The “downloads” section of the Nagios web site includes
references to a number of “addons” for Nagios
– Providing interesting additional functionality
• The extras are now listed on nagiosexchange.org
• Have a browse, and you’ll get some interesting ideas
• A few addons, in particular, are worth special mention
c©2003-2013 John Sellens USENIX LISA 27, 2013 131
Notes:
• See my “Tools You Need” notes for more addon mentions
• http://www.nagios.org/download/addons/
• http://exchange.nagios.org/
S6 — Nagios: Advanced Topics Nagios Addons
NRPE and NSCA
• These addons were written to address the need to do remote
service checks of various types
• NRPE – Nagios Remote Plugin Executor
– A client “check_nrpe” and server “nrpe” pair
– Lets a Nagios server connect to a remote daemon, which will
run plugins on the remote machine, and return the results
– Moderate security features
• For Windows: try nrpe_nt
– A re-implementation for Windows servers
c©2003-2013 John Sellens USENIX LISA 27, 2013 132
Notes:
• Both NRPE and NSCA were written by Ethan Galstad (the Nagios au-
thor), so you can sort of consider them as just outside the Nagios core
• Despite my religion, I will sometimes admit that not all problems can be
solved with SNMP
– But don’t attempt to quote me on that!
• Was going to be split out at
http://sourceforge.net/projects/nrpe
• nrpe_nt for Windows at
http://www.miwi-dv.com/nrpent/
• Plugins for Windows NRPE: search for “Windows NRPE” at
http://exchange.nagios.org/
S6 — Nagios: Advanced Topics Nagios Addons
NRPE and NSCA Illustrated
c©2003-2013 John Sellens USENIX LISA 27, 2013 133
Notes:
• I stole these from
http://www.nagios.org/images/addons/nrpe/nrpe.png
and
http://www.nagios.org/images/addons/nsca/nsca.png
S6 — Nagios: Advanced Topics Nagios Addons
NRPE and NSCA (cont’d)
• NRPE uses popen() which means a shell is involved on the
remote host
– Leads to problems with special characters and quoting
• NSCA – Nagios Service Check Acceptor
– A client “send_nsca” and server “nsca” pair
– Lets a remote system run a local check and submit a
“passive service result” to the Nagios server
– Can also be used to set up distributed monitoring, with
service results aggregated on a central server
c©2003-2013 John Sellens USENIX LISA 27, 2013 134
Notes:
• My slightly modified NRPE avoids some problems and tries to make some
things easier
• http://www.syonex.com/resources/software.html
S6 — Nagios: Advanced Topics Nagios Addons
NRDP — Nagios Remote Data Processor
• Designed to replace NSCA
• Standard ports and protocols — HTTP(S) and XML
• Can be used for other purposes as well
– Send data, status or commands to a server
– Let server do something with it
c©2003-2013 John Sellens USENIX LISA 27, 2013 135
Notes:
• http://exchange.nagios.org/directory/Addons/Passive-Checks/NRDP–2D-
Nagios-Remote-Data-Processor/details
S6 — Nagios: Advanced Topics Nagios Addons
NDOUtils — Status and Events to DB
• Nagios Data Output Utilities
• Event broker and intermediate programs to stuff status and
event data into a database
– So you can crunch the details later
– Into pretty reports and graphs
• Real time (via pipe) or periodic (via files)
• Can import old logs
• Still alpha/beta, and still “experimental”
• “likely play a central role in the new Nagios web interface”
c©2003-2013 John Sellens USENIX LISA 27, 2013 136
Notes:
• Also by Ethan Galstad, so just outside the nagios code
• Currently MySQL only, but PostgreSQL will likely be added at some point
– Or so the README says
– I think the icinga equivalent may have PostgreSQL support now
S6 — Nagios: Advanced Topics Nagios Addons
NDOUtils — Simple Diagram
c©2003-2013 John Sellens USENIX LISA 27, 2013 137
Notes:
• Stolen from http://www.nagios.org/images/addons/ndoutils/ndoutils.png
S6 — Nagios: Advanced Topics Nagios Addons
NDOUtils – building
• On FreeBSD, I had to do:% env CPPFLAGS=-I/usr/local/include% LDFLAGS=-L/usr/local/lib./configure--with-mysql-inc=/usr/local/include--with-mysql-lib=/usr/local/lib/mysql
• If I added --with-pgsql-lib it failed to find mysql
• But it wouldn’t compile . . .
– No -I on gcc commands
c©2003-2013 John Sellens USENIX LISA 27, 2013 138
S6 — Nagios: Advanced Topics Nagios Addons
DNX – Distributed Nagios Executor
• NEB module, intercepts check commands before execution
• Passes work off to a “worker node”
• Bypasses external command pipe/file
• Workers register with the server
– Can come and go as they please (or die)
• Assumption: no worker is preferred
– i.e. doesn’t address local vs remote
c©2003-2013 John Sellens USENIX LISA 27, 2013 139
Notes:
• http://dnx.sourceforge.net/
• From the LDS Church
– They do > 10, 000 checks every 5 minutes
– And expect “sharp increases” in requirements
S6 — Nagios: Advanced Topics Nagios Addons
Mod-Gearman – Distributed Nagios Workers
• Event broker module, Gearman master daemon
• Worker nodes register with master
– Run checks
– Can select hostgroups to work on – remote locations
– Distributed, load balancing
• Likely the current preferred approach
c©2003-2013 John Sellens USENIX LISA 27, 2013 140
Notes:
• http://mod-gearman.org/
• From Sven Nierlein, author of Thruk
S6 — Nagios: Advanced Topics Nagios Addons
mbdivert - Divert Checks Elsewhere
• mbdivert – plugin wrapper
• Sends plugin checks elsewhere based on hostname/IP/regex
• Uses check_nrpe or check_by_ssh or . . .
• “Seamless” intergration into existing Nagios configs
• Makes use of a slightly enhanced NRPE
• Geographic, administrative or per-network diversion/distribution
• Bypass firewalls restrictions – divert SNMP over SSH
c©2003-2013 John Sellens USENIX LISA 27, 2013 141
Notes:
• This is one of my little projects
• http://www.syonex.com/resources/software/
• Built to work with the MonBOX Remote Monitoring Appliance
– But is useful on its own
– e.g. We use it at FreshBooks to run some checks locally on remote
isolated machines
S6 — Nagios: Advanced Topics Nagios Addons
Nagios Business Process Addons
• “Roll Up” host/service checks to represent “business processes”
• Adds new entries in web menu
• Sort of a process summary view
• Seems like a useful abstraction
• http://nagiosbp.sourceforge.net/
c©2003-2013 John Sellens USENIX LISA 27, 2013 142
S6 — Nagios: Advanced Topics Nagios Addons
Merlin or Module for Effortless Redundancy and
Loadbalancing In Nagios
• Effort to make distributed Nagios easy
– Alternative to NSCA
• Load balancing, redundancy, distributed
• Status database
• Active development, part of op5 products?
c©2003-2013 John Sellens USENIX LISA 27, 2013 143
Notes:
• From the good folks at op5.com, who have Nagios-based products.
• Requires Nagios 4
– Or so the README in the git repository says
• http://www.op5.org/community/plugin-inventory/op5-projects/merlin
S6 — Nagios: Advanced Topics Nagios Addons
V-Shell — Alternative Front End for Nagios
• Standard web interface is thought to be “long in the tooth”
• Alternative web interface to Nagios Core
– From the Nagios Enterprises team
• PHP, CSS, valid XHTML
• Basic, functional rather than web 2.0 whizbang
• Work in progress
c©2003-2013 John Sellens USENIX LISA 27, 2013 144
Notes:
• V-Shell released in early October 2010
• List of notable themes and web interfaces at
http://www.nagios.org/download/frontends
• Nagios V-Shell is referenced there and on
http://exchange.nagios.org
S6 — Nagios: Advanced Topics Nagios Addons
Adagios – Web Based Nagios Configuration
• Web configuration interface
• No database backend
– Stores configs in standard files
– Uses pynag Python library http://pynag.org/
• Experimental status view
• Seems like a convenient approach
c©2003-2013 John Sellens USENIX LISA 27, 2013 145
Notes:
• http://adagios.org/
S6 — Nagios: Advanced Topics Nagios Addons
NINJA or Nagios Is Now Just Awesome
• Attempt to develop an alternative Nagios GUI
• PHP, scalability, better searching and filtering
• Multi-language, templates/skins
• Relies on Merlin, Nagios 4, livestatus
• Used in op5’s monitoring product
c©2003-2013 John Sellens USENIX LISA 27, 2013 146
Notes:
• Also from the good folks at op5.com.
• http://www.op5.org/community/plugin-inventory/op5-projects/ninja
S6 — Nagios: Advanced Topics Nagios Addons
Naglite2 — Simple Status Screen
• At FreshBooks we use Naglite2 for a status screen
• Quick summary of currently un-ack’d problems
• Handy for NOC status wall screens
• By Laurie Denness, at Last.fm
• http://laurie.denness.net/blog/2010/03/naglite2-finally-released/
• http://github.com/lozzd/Naglite2
c©2003-2013 John Sellens USENIX LISA 27, 2013 147
S6 — Nagios: Advanced Topics Nagios Addons
Multiple Nagios Aggregators
• MNTOS — Multi Nagios Tactical Overview System
– Simple summary of multiple Tactical Overview pages
• Thruk Monitoring Webinterface
– Terrific – Aggregates multiple backends
– Uses MKLivestatus for communication and commands
• Multisite — aggregator, multiple views
• Check_MK — all-in-one, check everything plugin
c©2003-2013 John Sellens USENIX LISA 27, 2013 148
Notes:
• MNTOS from http://www.sorkmos.com/index.php?page=mntos
• http://www.thruk.org/index.php
• Thruk supports Nagios and “similar” tools
– From Sven Nierlein, author of Mod-Gearman
• MKLivestatus, Check_MK, Multisite from Mathias Kettner
• MKLivestatus NEB module from http://mathias-kettner.de/checkmk_livestatus.html
• Multisite from http://mathias-kettner.de/checkmk_multisite.html
• Check_MK from http://mathias-kettner.com/check_mk.html
S6 — Nagios: Advanced Topics Nagios Addons
Nagios-based Distributions
• FAN – Fully Automated Nagios
– CentOS, Nagios, Centreon, NagVis
– Full distribution, or addon RPMs
• OMD – The Open Monitoring Distribution
– Icinga, Shinken, NagVis, Thruk, etc.
– Multiple instances
– Install on Ubunto, SLES, RedHat, etc.
c©2003-2013 John Sellens USENIX LISA 27, 2013 149
Notes:
• http://www.fullyautomatednagios.org/
• http://omdistro.org/
S6 — Nagios: Advanced Topics Wrap Up
Wrap Up
c©2003-2013 John Sellens USENIX LISA 27, 2013 150
S6 — Nagios: Advanced Topics Wrap Up
Summary
• We’ve tried to hit the key areas for more advanced Nagios use
• We didn’t cover everything
– The community is large
– The extensions are extensive
• Hopefully you’ve learned some of the more interesting aspects
• And can apply them in your own implementations
c©2003-2013 John Sellens USENIX LISA 27, 2013 151
S6 — Nagios: Advanced Topics Wrap Up
Where to Get Nagios Help
• The Nagios web site contains a lot of good information
• The documentation is very good
• A growing FAQ collection
• Links to addons and so on
• Mailing lists for both Nagios and the plugins: announce, users,
developers, help, checkins
• network.nagios.org — master list of Nagios sites
– community.nagios.org – news and updates
– exchange.nagios.org – lots of plugins and addons
– etc. . . .
c©2003-2013 John Sellens USENIX LISA 27, 2013 152
Notes:
• http://www.nagios.org/ of course
• community.nagios.org is “Where the Community Connects”
S6 — Nagios: Advanced Topics Wrap Up
And Finally!
• Please take the time to fill out the tutorial evaluations
– The tutorial evaluations help USENIX offer the best possible
tutorial programs
– Comments, suggestions, criticisms gratefully accepted
– All evaluations are carefully reviewed, by USENIX and by the
presenter (me!)
• Feel free to contact me directly if you have any unanswered
questions, either now, or later: [email protected]
• Questions? Comments?
• Thank you for attending!
c©2003-2013 John Sellens USENIX LISA 27, 2013 153
Notes:
• Thank you very much for taking this tutorial, and I hope that it was (and
will be) informative and useful for you.
• I would be very interested in your feedback, positive or negative, and sug-
gestions for additional things to include in future versions of this tutorial,
on the comment form, here at the conference, or later by email.