Nagios Implementation Case: Eastman Kodak Company Eric Loyd Founder & CEO Bitnetix Incorporated [email protected] 877.BITNETIX

Nagios Implementation Case:Eastman Kodak Company

Eric LoydFounder & CEO

Bitnetix Incorporated

[email protected]

877.BITNETIX

2© 2012 Bitnetix Incorporated

About Eric Loyd and Bitnetix

Founder and CEO of Bitnetix Incorporated

VOIP services and IT/network consulting

25 Years in IT at places like

Eastman Kodak

Frontier Communications

Global Crossing

Bitnetix started its seventh year in July, 2012

2012 Digital Rochester GREAT Award Finalist in Communications Technology

Using Nagios to monitor our client equipment, VOIP platform, and still using it at Kodak since 2004

A History of Eastman Kodak’s kodak.com Web Server

Infrastructure (non-confidential)


History of kodak.com

Pre-2004

Machines located in Rochester, NYPublic Apache servers

Reverse proxy Apache servers

Application servers (ATG/Dynamo, Tomcat, etc)

Database boxes, Production Support, etc.

2004 – Moved ~80 machines from ROC -> ???

ROC <-> ??? Firewalls

Bandwidth requirements

Minimal user impact

Flipped the switch, went live


History of kodak.com

Some of the things kodak.com did at the time

Consumer store and product information

B2B portal and wholesaler purchasing

“Picture Of The Day” (www.kodak.com/go/potd)

Warranty registration

Photo lab calibration strips

“Phone home” reports for printers, docks, cameras, etc

Software/firmware updates

Corporate press releases, bios, and regulatory information

Reverse proxy for internal information through secure channels

Dozens of sitelets for products and campaigns

http://www.kodak.com/go/potd

Why Kodak Chose Nagiosto Monitor kodak.com


Why Nagios?

No centralized corporate monitoring software

Nothing to compete with internally

Nothing to build on, either

Cost

No additional cost beyond existing human resources

Framework

Nagios worked with firewalls without needing agents

Leverage SSH, HTTP and other remote protocols

Custom checks and notifications (very important)

Initial Hurdles in the New Complex Server Environment

kodak.com Network

© 2012 Bitnetix Incorporated


Initial hurdles

Firewalls

Public load balancers on external Internet IPs

Public Apaches in Zone 1, Kodak network

Reverse proxy, app servers in Zone 2, semi-secure

Nagios machine in internal Zone 3, most secure

Complex “top” and “bottom” checks for web site

Is the site working from the user’s perspective (top)?

From the application side (bottom)?

How to separate apparent from actual failure


Initial hurdles

No Internal Nagios Knowledge

It was a contractor who set up Nagios (me)

Contractors typically have a finite lifespan at Kodak

Contractor made custom checks, event handlers, and all Nagios configurations. Uh-oh…

Escalation and Paging

Screw it – let’s email everyone, every time and let Thunderbird sort it all out

Paging done via texting gateway email addressWhich means email gateway failure = notification failure

Twitter API as backup / current primary notification

SSH to Remote Servers


SSH to the rescue

One user, one key, infinite access

Software apps run as second user, with SSH auth

Additional robot accounts can be added at any time

Wrap existing checks in an SSH shell

Provides additional control, error handling, reporting

Allows all checks to submit results to SQL databaseSQL Database Side Note – all custom scripts executed CLI Perl code that locked a file, logged to it, and unlocked it. A Perl cron job woke up every 5 minutes, locked the file, read it, pushed things to Oracle, unlocked, and deleted log file. A second cron pruned Oracle daily to 400 days of data and collapsed checks older than 30 days so that successive checks with the same status were removed.

Managing NagiosConfiguration Files


Configuration Management

SCCS

Solaris’s “poor man’s CVS”

Pre-installed, no additional cost, existing expertise

Current configuration is managed through SVN

Rsync – the workhorse to move config files

Configuration Repository and Push (CRaP) directory

Cfengine

Local versus remote execution

Post-install, ignore pid files, deploy/restart, etc.

Makefile – the “CLI” to the entire process

Common Event Handler

© 2012 Bitnetix Incorporated 17

Common Event Handler

EKrestart – That Which Does

Setup

• Arguments• Conversions• do_soft/hard?• do_something?• do_restart

do_restart

• Lock, logs, SQL• send_nagios• SSH to remote• Remote

EKrestart• Process args• do_<service>• send_nagios• Unlock, log, SQL• Terminate

do_<service>

• Locks (level 2)• Instance mapping• Port mapping• App restart• Email & log• Exit


A Closer Look at EKrestart#!/bin/shPATH=...

[ "$1" = "-r" ] && client_code

host="$1"service="$2"baseService=`echo $service | awk -F: '{print $1}'`state="$3"type="$4"tries="$5"perfdata="$6"class="<based on machine name, e.g., x-y-CLASS-nnn.kodak.com>"number="<based on machine name, e.g., x-y-class-NNN.kodak.com>"

case "$state" in OK) do_fixit;; WARNING) do_nothing;; UNKNOWN) do_nothing; CRITICAL) do_something; *) do_nothing;esac


A Closer Look at EKrestartdo_fixit() { case "$baseService" in Workers) do_restart;; *) do_nothing;; esac}

do_nothing() { $debug && echo "$service is in $state state ($type) for $tries tries."}

do_something() { case "$type" in SOFT) do_soft;; # Take action before it's too late? HARD) do_restart;; # Hard CRITICAL - Our last chance to take action *) do_nothing;; esac}

do_soft() { case "$tries" in 3,4,5) do_restart;; # Okay, let's restart it before it goes hard *) do_nothing;; # Don't restart yet esac}


A Closer Look at EKrestartdo_restart() { # <figure some stuff out, set up lock files, send_nagios, log to SQL, etc> ssh $machine <EKrestart> -r do_$service <parameters> # <tear down, unlock, close log, send_nagios, log to SQL, etc> exit}

# On the client side, we use the same EKretart script, but start at client_code()client_code() { host=`hostname` function="$2" service="$3" # (etc) eval $function exit}

# Example functiondo_Dynamo() { # lock file processing # turn off new sessions, wean existing ones # /etc/init.d/restart_dynamo_$instance # tear down return}

Integrating Nagios into Operational Procedures


Integration with Operations

Homebrew API

nchart, send_nagios, nlog – all portable to other installations of Nagios on other machines

Integrate with start/stop scripts

Lock files. Lots of lock files! TOO MANY lock files!!

The “Rippler”

Leverage EKrestart, cron, and send_nagios

Pager / Twitter and lots of private twitter feeds

Inter-group notifications

Predominately with procmail

Predictive Failure Recoveryand a Good Night’s Sleep


Predictive Failure Recovery

On ATG/Dynamo (and other) services

do_soft triggers do_restart on third failure

do_hard always triggers restart

Notifications on fourth failure

Escalation to pager only on fifth notification

Nagios has time to restart things that are bad, or are going bad, prior to sending out notifications

Service check dependencies allow us to know whether it’s a bad application, server, or user experience

Twitter – follow private tweets with smartphone, use apps to acknowledge problems, and get an even better night’s sleep!!

Questions

Eric LoydFounder & CEO

Bitnetix Incorporated

[email protected]

877.BITNETIX

Documents

Nagios Implementation Case: Eastman Kodak Company Eric Loyd Founder & CEO Bitnetix Incorporated [email protected] 877.BITNETIX