Nagios Conference 2013 - Daniel Wittenberg - Scaling Nagios Core 4

Slate

Scaling Nagios 4

Daniel Wittenberg

[email protected]

About Me

Unix/Linux admin since mid 90's

Nagios/Netsaint user since early 2000's

Owned/operated consulting business for almost 10 years that provided distributed monitoring using Nagios

Previously employed by Fortune 50 Insurance company

Currently Monitoring Platform Manager at IPsoft Inc.

About IPsoft

Provider of Remote Infrastructure Management and automation services

ITIL and 6 Sigma compliance management framework

Automation that resolves 56% of all incidents, and 90% L1

Monitoring, Automation, Event Correlation, Management....

Offices around the world in ten countries

http://www.ipsoft.com

Last year...

What is BIG ?

My Configuration

~700 Nagios Servers

~130,000 Monitored Devices

~3,000,000 Service Checks

Mix of customized Nagios 3.2.3 and 4.0.0

Scientific Linux 6.2/6.4

Managed by Puppet 3.x

2/3 on VMware ESX rest are bare metal

Adding new Nagios servers almost daily

What's different with Nagios 4

SPEED!

Current testing shows on average 500% faster over 3.2.3

What's different with Nagios 4

Some things that would impact performance/stabilityhttp://nagios.sourceforge.net/docs/nagioscore/4/en/whatsnew.htmlEmbedded Perl Gone

external_command_buffer_slots - Gone

-x option to not verify circular paths no longer needed in rc scripts

Configuration Verification algorithm changes, massive startup speed increase

Event Queue algorithm changes, helps with CPU utilization * Andreas 2012 Pres.

Disk I/O reduced to virtually 0

NEW query handler interface, better communication with core

NEW core workers reduces I/O, memory, CPU

Completely re-written spec file for better installs, debug modes

Perf Testing Lab Setup

Servers are all ESX 5 based VM's on the same cluster

Variable CPU cores, 4GB memory

Metrics used to consider a test failure:CPU Block Queue > 3

CPU I/O Wait > 3

CPU Idle < 10%

Service Check Latency > 1s

Host Check Latency > 1s

30 minute run time, > 3% failure rate failed the test

Fully automated increasing work load, consistent results

Add 1 host + 1 service check, try to get best case numbers w/o check lat.

Test Lab Architecture

Test Results

CPU CoresService ChecksVersion 3.2.3Service Checks Version 4.0.0rc1Difference

1170010500617%

2330020800630%

4650035300543%

81170045100385%

Other software used

Customized livestatus based on Andreas updates for Nagios 4https://github.com/ageric/livestatus

Developing custom single pane interface to replace CGI/Check_mk Multisite

Developing full REST API to talk to QH, livestatus and config files

nagios-qh.rb Query Handler interface to gather loadctl metricshttps://www.dropbox.com/s/h6zn0ecycqb1xrc/nagios-qh.rb

Custom load control daemon that talks to QH

Custom Event Broker to send perf data directly to ActiveMQ for post-processing

Custom agent, like NRPE on steroids without limitations like buffer size

Other performance tweaks

Sysctl Changesnet.ipv4.tcp_fin_timeout

net.ipv4.tcp_keepalive_profiles

net.ipv4.tcp_tw_recycle

net.ipv4.tcp_tw.reuse

No longer need RAMDISK, but still in the default sysconfig/RC script for now

Keep logging levels as low as possible

Disable CGI's whenever possible

Disable Environment Macros

Don't use resource macros when you don't need to, they are not cached

Other performance tweaks

/etc/security/limits.d/nagios.confipmon soft nofile 131072

ipmon hard nofile 131072

ipmon soft nproc 131072

ipmon hard nproc 131072

Nearly disable OOM killer for the nagios process, saves it until lastecho '-16' > /proc//oom_adj

Re-nice puppet to run at 10 so less impacting (true for any extra services)/etc/sysconfig/puppet NICELEVEL=10

This should apply to any other running services that might take resources

Common Perf Tools

vmstat / top cpu/memory

iostat / iotop disk usage

iptraf - network

sar cpu/memory/disk

strace immediate debugging, also debugging QA

esxtop VM stats

tuned can dynamically tune system

perf record -p / perf list / perf top -u nagios

How to keep it running good

Monitor everything...you can never have too much info!

CPU load and CPU stats (idle/wait/user/system)

Disk space, inodes free

All application/system logs (apache, syslog, nagios.log, etc.)

Hardware status

Swap / Physical Memory Usage

Puppet state (state.yaml)

Apache Stats (if have GUI/API)

Network performance and stats (errors, throughput, etc.)

NTP time and drift (more important on VM's)

Our Platform Architecture (simplified)

Known Issues (and complaints)

Number of workers on smaller (1-2 core) systems easily overloaded

No remote workers (yet)

Still have to restart to add new hosts/services

No REST API natively

Livestatus (or similar) not native

Questions ?

[email protected]

[email protected]

@dwittenberg2008

www.linkedin.com/in/dwittenberg

nagios and nagios-devel IRC

Nagios Users and Devel mailing lists

Always looking to hire new people so contact me!

Click to edit the title text format

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline LevelNinth Outline Level

Technology

Nagios Conference 2013 - Daniel Wittenberg - Scaling Nagios Core 4