5 Unsung Tools of DevOps - Mirosław Dąbrowski Unsung Tools of DevOps... · 5 Unsung Tools of DevOps The tools we use play a critical role in how effective we are. In today’s ever-changing

http://newrelic.com/dev-ops

Jonathan Thurman

5 Unsung Tools of DevOps

5 Unsung Tools of DevOpsby Jonathan Thurman

Copyright © 2014 Jonathan Thurman. All rights reserved.

Printed in the United States of America.

Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA95472.

O’Reilly books may be purchased for educational, business, or sales promotional use.Online editions are also available for most titles (http://my.safaribooksonline.com). Formore information, contact our corporate/institutional sales department: 800-998-9938or [email protected].

October 2013: First Edition

Revision History for the First Edition:

2013-10-09: First release

Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registeredtrademarks of O’Reilly Media, Inc. 5 Unsung Tools of DevOps and related trade dressare trademarks of O’Reilly Media, Inc.

Many of the designations used by manufacturers and sellers to distinguish their prod‐ucts are claimed as trademarks. Where those designations appear in this book, andO’Reilly Media, Inc., was aware of a trademark claim, the designations have beenprinted in caps or initial caps.

While every precaution has been taken in the preparation of this book, the publisherand authors assume no responsibility for errors or omissions, or for damages resultingfrom the use of the information contained herein.

ISBN: 978-1-491-94517-9

[LSI]

http://my.safaribooksonline.com/?portal=oreilly

mailto:[email protected]

Table of Contents

5 Unsung Tools of DevOps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1RANCID 1Cacti 4lldpd 8IPerf 10MUltihost SSH Wrapper 12Conclusion 14

iii

“It has long been an axiom of mine thatthe little things are infinitely the most

important.”—Sir Arthur Conan Doyle

5 Unsung Tools of DevOps

The tools we use play a critical role in how effective we are. In today’sever-changing world of technology, we tend to focus on the latest andgreatest solutions and overlook the simple tools that are available.Constant improvement of tools is an important aspect of the DevOpsmovement, but improvement doesn’t always warrant replacement.

So here are five tools that I use almost every day. They either provideinsight into or control over the environment around me while requir‐ing minimal installation and configuration. They are not the flashiesttools, but they are time tested and just work.

RANCIDConfiguration management (CM) tools like Puppet and Chef are real‐ly useful for keeping your systems in line, but what about your infra‐structure? The Really Awesome New Cisco confIg Differ—or RANCIDfor short—is the first step in tackling this problem. In essence, RAN‐CID is a suite of utilities that enables automatic retention of your con‐figurations in revision control. If you have a physical infrastructure atany level, you should be working to have the same level of control asyou do on your servers with your CM solution.

So that sounds good, but what problem does RANCID really solve?The core usage is to create an audit trail of software configurationsand hardware information for the devices that glue servers together.The configuration of your switches, routers, and load balancers may

1

http://www.shrubbery.net/rancid/

not be changing as fast as the code in your Rails app (at least I hopenot!), but it does change over time. The rate of change is usually tiedto how fast your environment is either changing or expanding.

Auditing is a great first step in the automation process. You have toknow where you are to get where you’re going after all! RANCID doesthis out of the box for devices from Cisco, Juniper, F5, and many othervendors. The audit process requires a basic installation of RANCIDand configuration of a read-only user on the devices you want tomonitor. The result is a current configuration automatically pulled ona regular schedule, committed to revision control, and an email de‐tailing the changes in your inbox.

So now you’re ready to try out RANCID, but where to start? If you area Subversion shop, you’re set—go grab the latest tarball and followalong with the Getting Started guide. Git users can grab a fork ofRANCID that is patched to add git functionality. Though git isn’t na‐tively supported by the maintainer of RANCID, I prefer git to Sub‐version, so that’s the codebase that I use.

Once you have RANCID installed, there are a few base configurationitems that need to be set in /etc/rancid/rancid.conf (or wherever yourrancid.conf was installed).

rancid.confRCSSYS=gitFILTER_PWDS=YESLIST_OF_GROUPS="pdx slc ord"

Configuring RANCID to pull configs requires the username and pass‐word used to connect to each device, optionally using hostnamematching. RANCID supports connecting to devices over telnet andssh. I’m sure this goes without saying, but don’t use telnet, and don’teven enable it on the devices! Some devices support key-based au‐thentication (like Juniper and F5) and do not require passwords. Inthis case, RANCID will use the ssh key if one has been configured forthe ranice user, and RANCID is configured to connect via ssh. Other‐wise you configure passwords in the .cloginrc file, which can be foundin the rancid user’s home directory. Here is an example:

2 | 5 Unsung Tools of DevOps


http://dotwaffle.github.io/rancid-git/

~/.cloginrc# We only use SSHadd method * {ssh}

# Wildcard for all devicesadd password *.example.com LoginPassword EnablePasswordadd password router.example.com OtherPass AndAnotherPass

Finally you need to configure which group a device belongs to. Youwill need a directory with the same name as each group you identifiedabove, and each should contain a configuration file. This is whereRANCID shows some of its heritage, as the configuration file for thisis called router.db. It’s not limited to routers, however, and it’s not adatabase but a simple text file. Each line in the file represents a devicein the form of hostname:type:status, where type is the type of devicefrom the list of supported devices and status is either up or down.Devices that are marked down are not queried for their configuration,but they remain in revision control. Here is an example:

~/pdx/router.dbswitch.example.com:cisco:uprouter.example.com:juniper:upbalance.example.com:f5:up

Now, assuming you configured the user RANCID runs as on the abovedevices, as the RANCID user, you should be able to manually runrancid-run to gather all the configs. Once the devices have been quer‐ied, the full config is available at ~/pdx/configs/switch.example.com anda diff is emailed to rancid-pdx, which should have been aliased to youpreviously.

Phew—now you can rest knowing that the configurations and hard‐ware details for all of your configured devices are safely on the systemrunning RANCID. That might be good enough, but having that repopushed to, say, your local git server, is probably better. That’s anothereasy setup.

Set up a remote for the git repo$ git remote add upstream <git url>

Create the following file at ~/.git/hooks/post-commit#!/bin/sh# Push the local repo to my upstream on commitgit push upstream

RANCID | 3

Now that you’re armed with the basic details of setting up RANCIDand a newly found tool for keeping track of your configurations, goforth and hack at it! With the goal of controlling your equipment, youcan extend your current CM solution to reach down into the depthsof the networking stack.

For more details, check out http://www.shrubbery.net/rancid/ and besure to take a look at the other tools available from Shrubbery Net‐works, Inc.

CactiI think of Cacti as the granddaddy of Graphite. It is a round robindatabase–based statistics graphing tool primarily targeted at networkequipment using SNMP (Simple Network Management Protocol),and you can find it at http://www.cacti.net/ It’s not trendy and it’s notwritten in Node, so why would you consider it? Cacti is a great fit whenyou need to poll devices to gather information instead of having themreport data in. The configuration is centralized to the server it runson, and for the most part, it Just Works™.

Cacti provides the Web UI that you need to get up and going quickly,including user, device, and graph management. For the backend, Cactileverages RRDTool to store the time-series data collected from all thedevices that you have configured. RRD is convenient for storing thisdata for a set period of time, as the file never grows. Cacti handleslonger retention by storing data in multiple round robin archives(RRAs). RRAs define how many data points to store (Rows) over aspecific length of time (Timespan), and how to aggregate that data(Steps). Steps is the number of data points to average into one datapoint for that RRA.

You can of course adjust the defaults as well as create your own RRAsfor 18 months, 2 years, or any other timespan that you want. The im‐portant items to note here are the Steps and Rows. The step size defineshow many data points are aggregated into one data point in the RRA.



http://graphite.wikidot.com/

http://www.cacti.net/

http://oss.oetiker.ch/rrdtool/

Timespan defines how many seconds to use when creating the actualgraph from the data.

One of the strengths of Cacti is template-based configuration, whichallows for excellent customization. To start, there are templates fordifferent types of devices called Host Templates. The Host Templatedefines which Graph Templates are associated with a certain type ofdevice. For example, there is a built-in Host Template named CiscoRouter. When you assign this to a device, Cacti knows which graphtemplates are relevant. It would quickly become overwhelming if youhad to sort through the entire graph template list!

So how would Cacti know how many ports your switch has? The shortanswer is that Cacti asks the device using SNMP or another customData Query. Yes, SNMP is getting long in the tooth, but it’s still a quickand easy way to get structured data, and that’s where Cacti’s DataQueries come into play. Data Queries like SNMP - Interface Statisticsknow that there is an index value within the result and use that to walkthrough the results and gather the relevant information. If you havedata that is not available via SNMP, Cacti supports custom scripts thatare run on the server to collect data via whatever means required.

Configuring a new device is done through a simple web form that asksa lot of questions, but it boils down to Description, Hostname, andHost Template. Most other settings can be inherited from system-widedefaults, such as timeouts and the SNMP connection details. Whenconfiguring SNMP on the remote hosts, be sure to change the com‐munity and not use the default of “public.” It is also strongly recom‐mended to limit which hosts can query SNMP data or even what datathose hosts are able to see.

Once you have successfully created the host, you’ll be redirected to thedetails of that host and have the option to create graphs. Clicking onthat link brings you to a page listing the Graph Templates that areassociated with the device. Simply check the box next to the graphsthat you want to create and click Create.

Cacti | 5

Within the next five minutes (the default polling interval), you shouldstart seeing new data being graphed for the device you just created.Here is an example Hourly graph for a very low-usage switch port. Thestair steps are due to the default polling cycle, which at five minutes isprobably a bit too long by today’s standards. The green area is theinbound traffic, and the blue line represents the outbound. This graphtemplate also includes the 95th percentile (not actually visible on thegraph)--a very common way of billing network traffic, which is usuallybursty in nature.

You can easily look at graphs for a specific host, but that’s not alwaysthe most useful way to see the data. Another feature of Cacti is calledGraph Trees. Graph Trees allow you to create a folder-like structurefor sorting and viewing your graphs. Want to look at the networkthroughput of all of your Raspberry Pis at once? No problem! Here wecreate a Switch heading in the Default Tree that only contains graphswe care about:


http://www.raspberrypi.org/

To view all our hard work, we move away from Console to the Graphview in Cacti. These two modes distinguish configuration from nor‐mal use and slightly change the interface. Don’t worry, though: it’s justclicking what looks like a tab at the top of the page to toggle back andforth.

This is the Graphs view of Cacti, which contains the Graph Tree onthe left, a quick filter across the top, and the resulting graphs in themain body. This is also where Cacti shows some of its age. While thepage is mostly dynamic content, there is no AJAX in use, so the pagerefreshes via a meta tag, and you cannot directly interact with thegraphs to change the time range you’re looking at.

If you’re paying close attention, you might have noticed that thesegraphs are for the Last Hour and are showing data from the HourlyRRA. If you want to see all the time periods for a specific graph, justclick on it and you will be taken into the entire stored history of that

Cacti | 7

data. This is very useful for identifying trends over time, which hope‐fully lets you plan for future growth.

lldpdLink Layer Discovery Protocol (LLDP) is one of the most under-utilized yet extremely useful networking protocols you may never haveheard of. Ever unplug the wrong server from a switch because of out-of-date documentation or spaghetti wiring? Yeah, me either…but nowyou can know exactly which port a server is plugged into with confi‐dence! You just need to enable LLDP on your switch and install lldpd.

It is important to note that there are other Link Layer protocols thathave been implemented by multiple network equipment vendors overthe years. LLDP was defined by IEEE 802.1AB to provide a vendor-neutral specification. This is an important step, as now cross-vendordevices could finally exchange information, and network engineerswere pleased. Now it’s time to spread the information out to a broaderaudience.

While the inner working of LLDP is beyond the scope of this paper,the basics are quite simple. A device, be it a server, switch, router, oranything else, sends information about itself at regular intervals outof all connected network interfaces. This information typically in‐cludes the system name, name of the interface the data was sent on,and the system management IP address.

The receiving device then collects that data, adds what interface it sawthat data coming from, and stores it for a specific amount of time. Thedata is only exchanged between devices directly connected over Ether‐net, so you now can be certain which neighbor is really on a specificinterface.

Capability Codes: R - Router, T - Trans Bridge, S - Switch, H - Host, I - IGMP, r - RepeaterDevice ID Local Intrfce Capability Platform Port IDrpi-1 Fas 0/2 H Linux eth0rpi-2 Fas 0/1 H Linux eth0

In this example from an old Cisco switch, we have two Linux hostsconnected. So if I need to disconnect the eth0 interface from rpi-1, Iknow that it is plugged into the local port FastEthernet 0/2 of theswitch. I can also tell that rpi-2 is not another switch, as the Capabilitycolumn identifies it as H, which means it is a host.


There are a few implementations of LLDP for Linux: Open-LLDP,ladvd, and lldpd. I prefer lldpd for its simplicity of configuration, it’sability to speak other proprietary discovery protocols like Cisco Dis‐covery Protocol, and multiple output formats of the client utility.

Depending on what distribution you are running, lldpd might not beavailable as a package, but compilation and installation is standard.Once you have the package installed, there really isn’t much to theconfiguration. For example, when using the Rasbian package for in‐stallation, all you need to do is start the daemon to get up and running.Enabling CDP requires a slight modification to the configuration, asfollows:

/etc/defaults/lldpd# Start SNMP subagent and enable CDPDAEMON_ARGS="-x -c"

Once you have lldpd installed and running, it only takes a few secondsfor data to start coming in from your neighboring devices. The com‐mand to view the current LLDP information is lldpctl, and by defaultit prints out some very verbose information. In the following example,you can see that we are actually using CDP to communicate with avery old Cisco 2924 Switch:

-----------------------------------------------------------LLDP neighbors:-----------------------------------------------------------Interface: eth0, via: CDPv2, RID: 4, Time: 8 days, 00:58:39Chassis:ChassisID: local switch.example.comSysName: switch.example.comSysDescr: cisco WS-C2924-XLMgmtIP: 192.168.1.2Capability: Bridge, onPort:PortID: ifname FastEthernet0/2PortDescr: FastEthernet0/2VLAN: 1, pvid: yes VLAN #1

So far all of this has been useful information for humans to parse, butthat’s not really the scale I want to work at. lldpdctl helps us out byproviding multiple output formats including key-value and XML.Here is the same example in key-value format for comparison:

$ lldpctl -f keyvaluelldp.eth0.via=CDPv2lldp.eth0.rid=4lldp.eth0.age=8 days, 01:00:23

lldpd | 9

http://open-lldp.org/

http://www.blinkenlights.nl/software/ladvd/

http://vincentbernat.github.io/lldpd/

http://www.raspbian.org/

lldp.eth0.chassis.local=switch.example.comlldp.eth0.chassis.name=switch.example.comlldp.eth0.chassis.descr=cisco WS-C2924-XLlldp.eth0.chassis.mgmt-ip=192.168.1.2lldp.eth0.chassis.Bridge.enabled=onlldp.eth0.port.ifname=FastEthernet0/2lldp.eth0.port.descr=FastEthernet0/2lldp.eth0.vlan.vlan-id=1lldp.eth0.vlan.pvid=yeslldp.eth0.vlan=VLAN #1

Now we have information on the server about how it’s connected tothe rest of the world, in a parsable format, that with a little work wecould pass to our configuration management software. One reasonthis becomes important is that it allows for automatic discovery ofparent-child relationships between servers and the network equip‐ment they’re attached to. With some simple wrapping—perhaps a cus‐tom fact if you are a Puppet user—you have the hostname of the switchyou’re attached to (your parent). Now when you dynamically generateyour monitoring configuration, you can pass along that you are con‐nected to Switch.example.com.

IPerfA network can be a weird place, with ever-changing asymmetric pathsmoving bits from one place to another. Sometimes this leads to verydifferent throughput performance between servers that appear to beotherwise identical. After running a few common utilities like tracer‐oute—or perhaps curl on an HTTP service—you are still unable toreplicate the problem.

Enter iperf, the network testing tool. Iperf is designed to measure thethroughput between two points and runs as a client/server pair. Itsupports both UDP and TCP and can either test bidirectionally orunidirectionally from each endpoint in a single command. The powerof iperf is how efficiently it is able to saturate a network connection.For TCP, it reports the overall throughput. For UDP, you can adjustthe datagram size and the report includes throughput, packet loss, andjitter.

Installation is easy, as the package exists in most major Linux distri‐butions and is supported for cross-platform compiling, and Windowsbinaries are available from third parties. To run iperf, you do need tohave your firewall configured to allow connections between the twoendpoints, with the default server port for TCP and UDP being 5001.


https://puppetlabs.com/

http://iperf.sourceforge.net/

You also need to have iperf running on both sides of the connectionto test. Iperf can either run in Server or Client mode, one on each endof the connection, as in the following examples.

TCP Server Mode$ iperf -s-----------------------------------------------------------Server listening on TCP port 5001TCP window size: 85.3 KByte (default)-----------------------------------------------------------[ 4] local 192.168.1.101 port 5001 connected with 192.168.1.100 port 43697[ ID] Interval Transfer Bandwidth[ 4] 0.0-10.1 sec 114 MBytes 94.1 Mbits/sec

TCP Client Mode$ iperf -c 192.168.1.101-----------------------------------------------------------Client connecting to 192.168.1.101, TCP port 5001TCP window size: 22.9 KByte (default)-----------------------------------------------------------[ 3] local 192.168.1.100 port 43697 connected with 192.168.1.101 port 5001[ ID] Interval Transfer Bandwidth[ 3] 0.0-10.0 sec 114 MBytes 95.0 Mbits/sec

Performing an initial TCP test with iperf requires starting the server,then firing the client at it. In the above example, a Raspberry Pi isrunning as the server, with another system sending traffic to it. Theend result is the complete throughput as measured from the client tothe server. You can optionally measure both directions one at a timewith the -r flag, or run both directions at the same time with the -d flagpassed to the client. Which method to use depends on your expectedworkload, but in general you should get statistically similar resultsfrom the independent tests when each system is running on similarhardware.

UDP Server Mode$ iperf -s -u-----------------------------------------------------------Server listening on UDP port 5001Receiving 1470 byte datagramsUDP buffer size: 160 KByte (default)-----------------------------------------------------------[ 3] local 192.168.1.101 port 5001 connected with 192.168.1.100 port 55272[ ID] Interval Transfer Bandwidth Jitter Lost/Total

IPerf | 11

[ 3] 0.0-10.0 sec 114 MB 95.7 Mbps 0.158 ms 0/81482[ 3] 0.0-10.0 sec 1 datagrams received out-of-order

UDP Client Mode$ iperf -u -c 192.168.1.101-----------------------------------------------------------Client connecting to 192.168.1.101, UDP port 5001Sending 1470 byte datagramsUDP buffer size: 208 KByte (default)-----------------------------------------------------------[ 3] local 192.168.1.100 port 55272 connected with 192.168.1.101 port 5001[ ID] Interval Transfer Bandwidth[ 3] 0.0-10.0 sec 114 MB 95.8 Mbps[ 3] Sent 81483 datagrams[ 3] Server Report:[ 3] 0.0-10.0 sec 114 MB 95.7 Mbps 0.158 ms 0/81482[ 3] 0.0-10.0 sec 1 datagrams received out-of-order

Since UDP is stateless, it has no idea how far it can push the networklike TCP does. So iperf has some sane defaults for sending UDP traffic,and by default targets 1Mbps of throughput. The previous examplesworked around this by specifying our target bandwidth of 100M usingthe -b flag.

You may also notice that the UDP view on the client side also presentsa Server Report. This is required since there are no guarantees withUDP—the client could happily send packets all day long to a serverthat was dropping them on the floor. Be mindful of this when com‐paring UDP results, as the client and server should agree.

MUltihost SSH WrapperMUltihost SSH Wrapper, or mussh, is truly an old utility that you canstill find good use for today. At its core, mussh is just a shell scriptwrapper around SSH that allows you to execute the same commandacross multiple hosts either in sequence or parallel. The script has beenfilling a gap that until recently most CM tools have ignored: runningone-off commands across your distributed systems.

We have configuration management, why would I possibly want to dosomething outside of that? While it is true that you should be enforcingsystem state with a bigger hammer, sometimes you only need to per‐form a task once, or infrequently.


Take for example NTP clock skew. I recently ran into an issue whereI found that a service had a dramatically increased network queue timeafter a hardware upgrade. NTP configuration and service state werebeing enforced through CM, so I knew that it was running. I wantedto verify that all the system clocks were synchronized, and I didn’t wantto manually ssh to each of them.

$ mussh -l pi -m 2 -h rpi-1 rpi-2 -c 'sudo ntpdate -u ntp.home'pi@rpi-1: ntpdate: adjust time server ntp.home offset 0.0151 secpi@rpi-2: ntpdate: adjust time server ntp.home offset 0.0006 sec

Whoa, that’s a long command. So let’s break it down.

Sync NTP across multiple hosts concurrently$ mussh \ -l pi \ # Set the ssh username -m 2 \ # Run on two hosts concurrently -h rpi-1 rpi-2 \ # Hostnames for the command -c ‘sudo ntpdate -u ntp.home’ # Sync ntp with this peer

The resulting output is in the format of hostname: output of thecommand, which is useful when you run commands that you expect asingle line from. You can optionally pass the -b flag to have outputbuffered by mussh, otherwise the output from all systems will displayinterwoven with each other if you are expecting multiple lines of out‐put per host.

We also make three assumptions in this example. The first is that youhave key-based ssh authentication working between your computerand the list of hosts (you are using keys, right?). You also need to havessh-agent running and working so you’re not prompted for your pri‐vate key password for every host. Finally, this example assumes thatthe pi user is able to execute sudo on ntpdate without a passwordprompt, which works well for demos in a lab environment but is nota best practice for production.

Well, that’s all fine and dandy, but I don’t want to type all the servernames in for every command! Luckily, mussh supports reading thehosts one line at a time out of a file with the -H option. So instead, grabthem from whatever storage mechanism you have (flat-file, database,RESTful endpoint) and pipe them in!

$ grep "rpi-" servers.txt | mussh -l pi -H - -c 'uptime'pi@rpi-1: 21:12:54 up 4 days, 23:18, 1 user, load average: 0.00pi@rpi-2: 21:12:54 up 4 days, 23:18, 1 user, load average: 0.00

MUltihost SSH Wrapper | 13

http://en.wikipedia.org/wiki/Ssh-agent

In this snippet, I’m searching through a text file for my Raspberry Pi“servers” and sending those through mussh. The secret is in the -H -,which tells mussh that the file to read from is stdin.

ConclusionWe still have a lot to learn from the past about where we can go in thefuture. The tools of our past live on and inspire innovation, daring usto replace them with the next generation. Some of the tools discussedhere have successors that are still in infancy. Some of them are still inactive development and could use an influx of motivated developersto push them to the next level. All of them have contributed to gettingus where we are today and deserve—if nothing else—a tip of the hat.


About the AuthorJonathan Thurman is a Site Reliability Engineer at New Relic wherehe applies his systems, networking, and coding skills to automate theirinfrastructure. For over the last 15 years, he has worked on a varietyof projects including unified communications, performance testing,automation, and scaling for the Web.

Stay up-to-date with O’Reilly Velocity NewsletterGet web performance and operations news and insight delivered weekly to your inbox.

oreilly.com/velocity/newsletter

http://oreilly.com/velocity/newsletter?intcmp=il-velocity-newsletter-lp-newsltr_ebook_report_ad

Documents

5 Unsung Tools of DevOps - Mirosław Dąbrowski Unsung Tools of DevOps... · 5 Unsung Tools of DevOps The tools we use play a critical role in how effective we are. In today’s ever-changing