14
SOFTWARE—PRACTICE AND EXPERIENCE, VOL. 27(10), 1163–1176 (OCTOBER 1997) Pulsar: An Extensible Tool for Monitoring Large Unix Sites RAPHAEL A. FINKEL Computer Science Department, University of Kentucky, Lexington, KY 40506-0046, U.S.A. (email: [email protected]) SUMMARY Many problems can crop up unexpectedly on Unix computers, and the administrator must be able to detect and react to these problems quickly. If a site has more than a few computers, the effort needed to keep abreast of problems can lead to unresponsive administration. Pulse monitors form a simple but effective tool to assist the administrator in this task. This paper describes the Pulsar pulse-monitor package. It is composed of a presenter, which provides a graphical user interface to the administrator, a set of individual pulse monitors, which examine aspects of the status on a host and communicate their results to the presenter, and a scheduler, which executes pulse monitors according to the frequency specifiedby its configurationfiles. The set of pulse monitors is easily extended by the administrator to provide warnings about any situation that can be algorithmically detected. 1997 by John Wiley & Sons, Ltd. KEY WORDS: Unix; administration; monitoring INTRODUCTION Networks containing tens or even hundreds of Unix hosts under a single administrative authority are becoming more common. Such sites present administrative challenges due to the lack of homogeneity among the hosts (architecturally and with respect to Unix version) and the size of the site. Configuring hosts to share a common view has been addressed in various ways. 1,2,3 The second challenge, addressed by this paper, is everyday maintenance, which involves scrutiny of each host to ensure compliance with standards. These standards include proper behavior of background programs such as mail, adequate supply of resources such as disk and swap space, and security matters such as authentic copies of trusted software. Some problems build up slowly. A long-running background program with a memory leak gradually eats away at swap space. Other problems can occur suddenly. A host might suffer a software crash, or a network link may become unusable due to a gateway or router failure. Faced with a plethora of possible problem areas across a multitude of hosts, administrators often find themselves reacting to problems only as users submit complaints. Software to help the administrator discover problems in a timely fashion is clearly needed. Pulse monitors are simple programs that run at regular intervals on each monitored host. Each monitor has a specific task, such as discovering the current CPU load. It also has a notion as to the comfort level of the data it discovers. For example, a CPU load of 0.10 might This research was supported in part by NSF grant CDA-9502645 and DOE-EPSCoR grant DE-FC02-91ER75661. CCC 0038–0644/97/101163–14 $17 50 Received 10 October 1996 1997 by John Wiley & Sons, Ltd. Revised 29 January 1997

Pulsar: an extensible tool for monitoring large Unix sites

Embed Size (px)

Citation preview

Page 1: Pulsar: an extensible tool for monitoring large Unix sites

SOFTWARE—PRACTICE AND EXPERIENCE, VOL. 27(10), 1163–1176 (OCTOBER 1997)

Pulsar: An Extensible Tool for Monitoring Large Unix Sites

RAPHAEL A. FINKEL�

Computer Science Department, University of Kentucky, Lexington, KY 40506-0046, U.S.A. (email:[email protected])

SUMMARY

Many problems can crop up unexpectedly on Unix computers, and the administrator must be able to detectand react to these problems quickly. If a site has more than a few computers, the effort needed to keepabreast of problems can lead to unresponsive administration. Pulse monitors form a simple but effectivetool to assist the administrator in this task. This paper describes the Pulsar pulse-monitor package. It iscomposed of a presenter, which provides a graphical user interface to the administrator, a set of individualpulse monitors, which examine aspects of the status on a host and communicate their results to the presenter,and a scheduler,which executes pulse monitors according to the frequency specified by its configuration files.The set of pulse monitors is easily extended by the administrator to provide warnings about any situationthat can be algorithmically detected. 1997 by John Wiley & Sons, Ltd.

KEY WORDS: Unix; administration; monitoring

INTRODUCTION

Networks containing tens or even hundreds of Unix hosts under a single administrativeauthority are becoming more common. Such sites present administrative challenges due to thelack of homogeneity among the hosts (architecturally and with respect to Unix version) andthe size of the site. Configuring hosts to share a common view has been addressed in variousways.1,2,3

The second challenge, addressed by this paper, is everyday maintenance, which involvesscrutiny of each host to ensure compliance with standards. These standards include properbehavior of background programs such as mail, adequate supply of resources such as disk andswap space, and security matters such as authentic copies of trusted software.

Some problems build up slowly. A long-running background program with a memory leakgradually eats away at swap space. Other problems can occur suddenly. A host might suffer asoftware crash, or a network link may become unusable due to a gateway or router failure.

Faced with a plethora of possible problem areas across a multitude of hosts, administratorsoften find themselves reacting to problems only as users submit complaints. Software to helpthe administrator discover problems in a timely fashion is clearly needed.

Pulse monitors are simple programs that run at regular intervals on each monitored host.Each monitor has a specific task, such as discovering the current CPU load. It also has anotion as to the comfort level of the data it discovers. For example, a CPU load of 0.10 might� This research was supported in part by NSF grant CDA-9502645 and DOE-EPSCoR grant DE-FC02-91ER75661.

CCC 0038–0644/97/101163–14 $17�50 Received 10 October 19961997 by John Wiley & Sons, Ltd. Revised 29 January 1997

Page 2: Pulsar: an extensible tool for monitoring large Unix sites

1164 R. A. FINKEL

be considered excellent, but 7.6 might be considered uncomfortable. Pulse monitors reportvalues to a presenter that displays the global situation to the administrator.

This paper discusses Pulsar, an implementation of the pulse-monitor idea. This implemen-tation is freely available from ftp://ftp.cs.uky.edu/cs/software/pulsar.tar.gz.

It is quite easy to code new pulse monitors and embed them in a running Pulsar environmentwithout changing existing software.

ARCHITECTURE

The entire Pulsar pulse monitor package is written in Tcl/Tk4 and is intended to run underthe X Window System.5 This language choice allows Pulsar to be easily ported to almost anyUnix environment.� The choice also makes the code fairly easy to read and maintain.

The Pulsar package is composed of three principal components: pulse monitors, the sched-uler, and the presenter.

Pulse monitors

Pulsar pulse monitors all invoke Unix commands to determine the current situation(‘measure the pulse’) and send the result, which is called an alarm, to the presenter.

A typical pulse monitor first determines the name of its host and the version of Unix runningon that host. It then consults a configuration file specific to that pulse monitor to see how sucha host should be treated. The current situation is then evaluated by means of other programs,such as df for disk usage and uptime for CPU usage. The calling convention for theseprograms is often operating-system dependent; differences are usually hardcoded in the pulsemonitor in a switch statement. The results of the invoked programs are converted to a valuethat reflects the comfort level, which is reported to the presenter via the report program.

The first and second parameters to report are the major and minor name of the alarm.By convention, the major name of an alarm is usually the host name, and the minor name isa resource class, such as cpu. Practically any names may be used; the presenter builds newalarm indicators as needed. The third parameter to report is a numeric measure of comfort.These values follow a convention:

(a) 0: no problem; remove from display if present.(b) 1–9: no problem; show in green.(c) 10–19: potential problem; show in yellow.(d) 20–: problem; show in red.

For example, the CPU pulse monitor multiplies the current load by 15 and adds 1 to generatea comfort value. Load 0 has value 1 (just fine), load 0.5 has value 8.5 (no problem), load 1.0has value 16 (potential problem), and load 2.0 has value 31 (problem). The fourth parameterto report is an informative message text associated with the alarm.Report is a short Tcl script. It could be embedded in any pulse monitor written in Tcl; for

modularity, it is kept in a separate file. Pulsar also contains a version of report written inPerl.6 Report first reads a configuration file built by the presenter to discover the presenter’shost name and port number. It then establishes a socket to the presenter based on that nameand port number. As a simple protocol check, it sends the modification time of the presenterconfiguration file. It then sends the alarm itself.� Tcl and Tk run under other platforms as well; Pulsar might easily port to those platforms.

Page 3: Pulsar: an extensible tool for monitoring large Unix sites

PULSAR 1165

Our earlier versions of Pulsar had report use the Tk send command to execute anacceptVal procedure directly in the presenter. Using sockets is better, because (1) Tkrequires X authorization to be in force to enable send; (2) the pulse monitors would needto have permission to write to the X session of the administrator, which would be a securityproblem; and (3) every active Tk script opens a connection to the X server, which may have alimited number of available connections.

The scheduler

The scheduler is a short Tcl program that reads a configuration file and then invokes pulsemonitors as specified in that file. Each line in the configuration file specifies a particular pulsemonitor, the interval in minutes between activations, and for which hosts the line applies,independently restricting by host name, architecture, and operating system. For example, thelines

* * SunOS 10 {exec cpu.test.tcl &}miles * * 24*60 {exec disk.test.tcl &}

indicate that hosts running SunOS should run the CPU pulse monitor program every 10minutes, and that host ‘miles’ should run the disk pulse monitor once a day.

One instance of the scheduler runs on every host in the site. The same configuration file maybe used for all instances; network-shared files, such as those provided by NFS, are the easiestway to have all instances share both configuration files and executables for the scheduler andthe pulse monitors. It is also possible to copy the files if network-shared files are not possible.

Most pulse monitors run on every host according to the same schedule. Some pulse monitorsdo not run on particular hosts because they don’t apply there. Pulse monitors that check globalstatistics only run on a single host.

The scheduler uses the Tcl after command to cause events to occur when specified. Pulsemonitors usually take only a few seconds to complete.

The presenter

The presenter generates an X window for displaying alarms. To avoid confusion betweenmouse buttons and Tk buttons, the latter are called ‘icons’ here. The presenter shows a displaywith three regions: top, middle and bottom.

Unless the administrator requests more, only the top region is shown. This region holds anicon for each major alarm name, arranged in alphabetical order. Each icon is colored green,yellow or red according to the most uncomfortable value of any alarm with that major name.At first the display is empty; as alarms are registered by pulse monitors, they are shown. Thepresenter also logs each alarm into a log file.

If the administrator clicks on a top icon, the middle region is displayed, with an icon foreach minor name associated with the major name. For example, if all alarms associated withhost ‘miles’ have major name ‘miles’, then clicking on its icon displays all alarms associatedwith this host. Each middle icon is colored by the same convention. If the administrator clickson a middle icon, the bottom region is displayed, showing details of the alarm, including alist of the most recent messages. The administrator can then remove the bottom and middleregions from the display.

The color of an icon depends on the most recent value that has been received for it. Theadministrator has three ways to control the appearance of icons:

Page 4: Pulsar: an extensible tool for monitoring large Unix sites

1166 R. A. FINKEL

Figure 1. A sample display

� The administrator might wish to ignore a situation, such as mail waiting for root, unlessit changes in any way. Such an alarm may be temporarily ‘ignored’. The middle icon ofan ignored alarm is displayed in blue, and its top icon ignores the value of this alarm.

� Alarms can also be ‘tolerated’, which means that their current value is to be treated asthe end of the green range. The middle icon of a tolerated alarm is displayed in green.As the cursor passes over the middle icon of a tolerated or ignored alarm, the icon showsits true color.

� The administrator can delete both major and minor icons entirely from the display.

Figure 1 shows the display with all three regions.In addition to the primary presenter, the Pulsar package also contains a secondary presenter

to let another staff member view alarms independently. The secondary presenter connects tothe primary presenter, which first sends a dump of all current alarms and then arranges tosend all new alarms as they arrive. Each presenter has an independent log file and internalvariables indicating what alarms are currently ignored or tolerated, so each administrator hasindependent control over how alarms are displayed. If the primary should fail, the secondaries

Page 5: Pulsar: an extensible tool for monitoring large Unix sites

PULSAR 1167

Table I. Commonly used pulse monitors

Name PurposeLimited resources

cpu CPU loaddisk remaining disk capacitymem large processesswap swap space available

Program behavior

mail non-empty mail queuesat sat daemon operationprocess daemon processes runningsource up-to-date source codeexcess proper number of processes

Hardware behavior

ping running status of hostscollis rate of collisions on ethernetconnect latency across network linksprinters status of printers

Security

rootmail mail sent to rootlog problems noted in logsxhost insecure X serversmd5 proper version of programs

introduce a red pseudo-alarm with major name ‘internal’, which indicates that the connectionhas been lost.

Pulsar allows multiple primary and secondary presenters. Multiple primary presenters areappropriate if a single site contains several disjoint administrative or NFS domains. Eachsecondary presenter may connect to a list of primary presenters, so a single secondary presentermay view the status of all the domains. Multiple secondary presenters are appropriate if thereare multiple staff members who need to access the data.

PARTICULAR PULSE MONITORS

Pulsar comes with pulse monitors that have been helpful at our site. It is straightforward tocustomize these and to build others. We have written our pulse monitors in Tcl; another goodchoice would be Perl.6 This section gives a feeling for what we currently test with Pulsar, assummarized in Table I.

To keep pulse monitors as portable as possible, those that need lists of site-specific in-formation, such as local printers, hosts to check for liveness and gateways to ping, call theenumerate script, which is customized to the site. Some pulse monitors have individual cus-tomization files as well. Many pulse monitors have code that checks the particular operatingsystem type and adjusts behavior accordingly. For example, the equivalent to df on HP-UX

Page 6: Pulsar: an extensible tool for monitoring large Unix sites

1168 R. A. FINKEL

is bdf. Our scripts have been generalized to work under Solaris, Linux, IRIX, HP-UX, AIX,OSF1 and BSDI.

Limited resources

We check the CPU load every 30 minutes. A high load often indicates a runaway process;X clients sometimes fail in this manner when they lose connection to the X server.

We look at remaining disk capacity every day. The pulse monitor checks local disks withthe df command. A disk that is 92 per cent full is in the yellow range. The red range starts at95 per cent full. Disks that are only 88 per cent full are not even shown.

Memory load is checked in two ways. We see if there are any large processes every fivehours. A few daemons, like lpNet under Solaris, appear to have memory leaks; when theyget too big, it is time to kill them and restart them. The pulse monitor for memory load uses aconfiguration file that lists specific yellow ranges for particular programs. Programs that arenot listed follow a default rule. The default rule is architecture-specific, because we find thatLinux programs are generally small (the yellow region starts at 500K), Solaris ones larger(yellow starts at 3M), and IRIX64 even larger (yellow starts at 5M). We don’t use red alarmsfor memory load.

The second memory test measures available swap space every two hours. We set 20M asthe yellow limit and 5M as the red limit.

Program behavior

Every day we check to make sure that the mail queue is empty on all our hosts. We alsocheck that SAT is functioning properly. We use SAT to maintain global configuration.3 Moregenerally, we check that all standard daemons are running. Occasionally, a host boots and failsto start a daemon, or a daemon exits.

Once a month, we check the sites from which we have loaded source code to see if morerecent code is available. A configuration file records the site, directory, prefix, version, andsuffix of each software package. For instance,

ftp.cc.gatech.edu /pub/gnu ispell- 4.0 .tar.gz

indicates that we have version 4.0 of ispell and that it comes from ftp.cc.gatech.edu. Thepulse monitor connects to this site and sees if a higher-numbered version of ispell isavailable. This pulse monitor is written in Expect,7 an extension to Tcl for interaction withprograms like ftp. The major name for its alarm is ‘software’, and the minor name is thesoftware package.

Every seven hours, we check if multiple copies of particular programs are running, partic-ularly lpNet. Two copies cause a green alarm, three a yellow alarm and five a red alarm.

Hardware behavior

One host pings all the others every hour. Unfortunately, there are software failure modesin which a host responds to ping but is in fact not functioning correctly. Hosts that respondto ping are therefore probed further by telnet to the daytime request port. The list of hoststhat need to be queried is generated by enumerate. This program may be configured to useNIS1 or /etc/hosts. We use SAT3 to get an up-to-date list of those hosts we expect to beup.

Page 7: Pulsar: an extensible tool for monitoring large Unix sites

PULSAR 1169

Our local network is generally reliable. Still, we check the rate of collisions every hourby running netstat -i twice, five seconds apart, and computing the rate of collisions. Theyellow range starts at 20 per cent collision rate and red at 40 per cent. The major name for thisalarm is ‘network’; the minor name is ‘collisions.’ This pulse monitor is only run on one hostper local network.

Our network crosses several ATM links which have been experiencing strange behavior.We run ping across the links every hour for five seconds. If one message is lost, the link isin the yellow region. If more are lost, the link is in the red region. If the maximum round-triptime is worse than 6 ms, the situation is yellow; if worse than 20, it is red. The major namefor this alarm is ‘network’; the minor name is ‘stats’.

A pulse monitor on one host runs lpstat every four hours to check the health of printers.Any printer that does not respond is considered yellow. The major name for this alarm is‘printers’; the minor name is the particular printer.

Security

Mail is not usually sent to root, but occasionally, people outside the site don’t know abetter way to alert us to a problem or to ask for help. We often forget to check for rootmail, and we prefer not to forward it. Once a day, a pulse monitor checks the length of/usr/spool/mail/root and sends an alarm if the length is not zero.

System logs are a fruitful guide to problems, including security problems. Once a day, a pulsemonitor checks logs for lines that contain specific strings but not others. These specificationsare listed in a configuration file. For example,

HP-UX /usr/adm/syslog root ishmael|ahab security su:%h 25

specifies that on hosts running HP-UX, the pulse monitor should look in /usr/adm/syslog forlines that contain the word ‘root’ but do not match the names of our superusers. If such aline is found, it should generate a red alarm (value 25) with major name ‘security’ and minorname ‘su:’ followed by the host name. We use this pulse monitor to track both attempts tobecome root and repeated login failures. It would be helpful if syslogd could be configuredto invoke report directly for certain classes of problems. Unfortunately, standard releases ofsyslogd do not include the ability to invoke arbitrary programs.

Users sometimes fail to run their X sessions in a reasonably secure fashion. Every threehours, a pulse monitor attempts to connect to the X server on every host. If it succeeds andaccess control is disabled, it flashes an annoying warning on the offending screen for 10seconds and sends a red alarm. If it succeeds in connecting but access control is properlyenabled, a green alarm is sent. The major name of these alarms is ‘security’; the minor nameis composed of the host name and ‘xhost’.

Once a day, a pulse monitor checks the MD5 signatures8 of many important utilities such aslogin, ps, telnetd, ifconfig, ftpd, rexecd and rshd. The expected MD5 signaturesare stored in a configuration file with entries keyed to operating system name, version andarchitecture. Any file that is expected but missing elicits a yellow alarm; any file that has thewrong signature elicits a red alarm. This monitor has discovered configuration errors, suchas failing to apply security patches to a host. The major name for the alarm is ‘security’; theminor name is composed of the host name and the name of the utility that seems wrong. Thereare other tools for this kind of monitoring, particularly tripwire;9 the advantage of using Pulsaris that security and performance checks are integrated into a common interface.

Page 8: Pulsar: an extensible tool for monitoring large Unix sites

1170 R. A. FINKEL

Other associated tools

The Pulsar package includes scripts for maintenance. One script, killSched, terminatesall schedulers. It does this by touching a file inspected periodically by the scheduler. Thistechnique assumes that all schedulers run from the same directory mounted via NFS. Onealternative is to run schedulers as Tk scripts and to send an exit command to each, but thatmethod uses X resources that we would rather spare and introduces security worries. Anotheralternative is for each scheduler to listen to a particular port; killSched would send amessage to that port. This method requires that killSched know which hosts are runningschedulers and also needs to be protected against malicious misuse.

Another script, queryPulse, is useful for an administrator working from another terminalwithout visual access to the presenter output. It sends the presenter a command that causesit to enumerate its current alarms and their values and messages. These are sorted worst firstand written to standard output.

The startat script is used to start schedulers on all hosts in its parameter list. It is usedin conjunction with notReporting, a script that outputs the names of all hosts that have norecent entries in the presenter’s log file and need to have the scheduler started.

DESIGN AND IMPLEMENTATION RATIONALE

Pulsar’s design directs all communication from pulse monitors, which are evanescent, tothe presenter, which is intended to continue running for long periods. Schedulers do notcommunicate either with the presenter or with running pulse monitors. The result is a form ofstateless communication. This design has some pleasant effects:

(a) A failed pulse monitor does not interfere with the rest of Pulsar.(b) Pulse monitors can be redesigned while Pulsar is operating; the next time the scheduler

runs a pulse monitor, it always uses the current one.(c) The presenter does not need to be restarted, much less rewritten, to accommodate new

pulse monitors. It is easy to terminate and restart all schedulers when needed, such as whenthe scheduler configuration file changes.

(d) The presenter can be redesigned while Pulsar is operating. Simple changes can be testedin a running presenter by sending it new procedure code.

(e) The presenter may be terminated and restarted. While it is down, any pulse monitor thattries to contact it fails. The only effects are that some alarms are lost and that secondarypresenters lose their connection. As soon as the presenter comes up again, it begins todisplay newly arriving alarms, although alarms that occurred while it was down are lost,and secondary presenters must be restarted. The presenter remembers the tolerances set bythe administrator.

At the same time, this design has a flaw: an alarm cannot be customized by the administrator.The way an alarm is displayed can be modified slightly (ignored, tolerated or removed), but theonly way to change the semantics of an alarm is to modify the pulse monitor that generates it.If a single pulse monitor needs to apply different tests or different comfort levels on differenthosts, those differences must be coded into the pulse monitor, typically by having it read aconfiguration file.

Under the Pulsar design, pulse monitors all report to a single presenter, which then feedsinformation to any secondary presenters connected to it. An alternative organization would

Page 9: Pulsar: an extensible tool for monitoring large Unix sites

PULSAR 1171

have each pulse monitor directly contact each presenter. The alternative would add somecomplexity to the report program and the configuration file that indicates the presenter’sidentity. It would be more robust, in that failure of a single presenter would not affect theothers. We have not had problems with robustness; the presenter runs for months withoutproblems, generally until its host goes down. The alternative organization would not removethe need for secondary presenters, because they are able to collect data from several primarypresenters and display it as a unified whole. The reason we want this facility is that largesites are likely to have multiple administrative domains, each of which should have its ownpresenter.

Any network program raises security concerns. The presenter expects that pulse monitorsobey a simple protocol on sockets; this protocol is hard for an intruder outside the local siteto employ, because it includes the date of a file. The worst effect a successful intruder canhave on the presenter is to send an erroneous or malformed alarm, possibly hiding a realalarm. The primary presenter never reads any data sent by secondary presenters. However,Pulsar currently has no mechanism to restrict connections from secondary presenters; a remoteintruder who can guess the port number of the primary presenter can view local alarms.

Very large sites might run into scalability problems. The presenter uses a constant-timealgorithm to update its display when an alarm arrives with a previously known major name,so the computational load on the presenter is not affected adversely by the number of hosts.In fact, the total amount of time spent by the presenter is quite small. On our site withapproximately 50 machines, each of which reports at least twice an hour, less than 16 minutesof compute time (Sun SparcStation 20, 125 MHz, Solaris 2.5.1) accrue in a week of operation.The amount of network traffic generated by the pulse monitors is also quite low, so it should notprevent Pulsar from running well on thousands of nodes. However, the amount of display areathat must be dedicated to the presenter grows with the number of major names. If less spaceis allocated, scroll bars allow the administrator to access the hidden parts, but the display canno longer be checked at a glance. This problem can be addressed to some extent by judiciouschoice of major and minor names. Instead of reserving a major name for each host, hostscould be grouped into classes, each of which would have its own major name.

Alarms have a fine-grain value (an integer) as well as a coarse-grain value (a color). Bothvalues serve important roles. The fine-grain value allows the administrator to distinguishbetween severities of alarms and to ignore alarms until they get worse. For example, if diskspace is running short, but there is still 50M available on the disk, the administrator mightchoose to tolerate the current alarm value, which might be 13, in the yellow range. Later, as freedisk space declines to 40M, the alarm again becomes apparent, even though the discomfort,perhaps now at 15, is still in the yellow range. On the other hand, numeric values are not easyfor an administrator to assimilate. The panel of mostly green icons is understood at a glance.The few problem areas are immediately apparent.

In some ways, the scheduler is similar to the cron program, which is standard under Unix.One distinction is that for Pulsar’s scheduler, a single configuration file indicates for each typeof host exactly what pulse monitors to invoke. Cron requires a separate file on each host,so the cost of installing and maintaining pulse monitors would be far higher under a cronapproach. Another distinction is that cron tasks are specified by the time at which they areto run, not their frequency. The cron entry for a pulse monitor that must run every half hour,the frequency at which we run the CPU monitor, would be fairly cumbersome. An entry fora pulse monitor that is to run every five hours would be worse. On the other hand, it is notso important to run most pulse monitors on off hours, and others might be better run duringoff hours; cron might be appropriate for such monitors. Pulsar’s design does not prohibit

Page 10: Pulsar: an extensible tool for monitoring large Unix sites

1172 R. A. FINKEL

such usage; pulse monitors may be started manually, by the scheduler, or by any other meansdesired.

Because pulse monitors are scheduled at some rate, problems can arise and not be noticeduntil the next time the associated pulse monitor is run. Most system-administration problems,luckily, develop over the course of time and can be noticed well before they become critical.However, we have seen programs with severe memory leaks use up all of swap space beforean alarm appears, and processes that accidentally spawn many CPU-bound children drive theCPU load up faster than an administrator can notice in a timely fashion. Users who run X inan insecure fashion may not be doing so during the fairly infrequent probe for this problem. Itis not clear how such problems can be caught in a timely fashion without increasing the rateof monitoring, which at some point becomes a computational burden.

Tcl/Tk is a great tool for implementing software like Pulsar because it provides socket,which allows easy inter-application communication,send, which lets the implementer interactwith running applications, and after, which makes a scheduler easy to write. New pulsemonitors are easy to build and old ones are easy to modify to fit changing needs. Further, Tkallows fairly pleasant GUIs to be built without undue effort. This power has a price. First, Tclis interpreted, so it is not particularly fast. Luckily, speed is not crucial to Pulsar. Second, thesyntax of Tcl is error-prone and hard to get used to. However, most errors crop up early indebugging and are readily fixed.

The fact that the presenter is written in Tk means that any other Tk application connectedto the same X display can send it arbitrary and potentially malicious code. Tk mitigatesthis problem by (1) disallowing send if the X server does not have authentication properlyenabled, (2) by allowing applications to explicitly remove the send command, and (3) byallowing applications to enter a safe interpreter, which is quite restricted. It is wise not to runany Tk script with root privilege, because X authentication is not necessarily strong enoughto prevent attacks in which intruders gain access to the X terminal. Some pulse monitors mustbe run with root privilege; for example, the pulse monitor that checks MD5 signatures mustcheck files that are not publicly readable. These pulse monitors should perhaps be coded inPerl,6 which is careful to be safe when running as root, or in a safe Tcl interpreter. Such pulsemonitors should call report as non-root when they need to send an alarm.

RELATED WORK

The idea of monitoring performance is not new. Metric uses network probes to measuresoftware behavior on hosts connected by an Ethernet.10 Monitored software includes calls toprobe to send messages to an accountant process somewhere in the network. The accountantfilters messages and saves them on the disk to be analysed offline.

Commercial monitoring packages use the Simple Network Management Protocol(SNMP)11,12 and other means to monitor the health of hosts on a network. Sun Microsys-tems markets the SunNet Manager program.13 SGI machines come with a graphical toolcalled gr osview for watching host statistics in a network. Hewlett-Packard markets Open-View, a facility that can build a graphical map of a network and monitor the devices thatrespond to SNMP requests. In contrast to these commercial packages, Pulsar is small (about1400 lines of code), integrated (any measurable quantity can be monitored), expandable (it iseasy to add new pulse monitors, even as Pulsar is running), portable and freely available.

Several Unix-based projects are much closer to Pulsar in concept and implementation.Palantır provides monitoring, error detection and administration of networked computers.14

Like Pulsar, it has pulse monitors (called error-detecting modules), which can be written in any

Page 11: Pulsar: an extensible tool for monitoring large Unix sites

PULSAR 1173

language. Unlike Pulsar, which has two levels (monitors that send messages to the presenter),Palantır has four levels: multiple modules on a host communicate with a daemon on that host,which communicates with a netserver (usually on a different host), which communicates witha single database server. The database is fairly heavy-duty (it allows SQL queries); it containsexpected host properties. Palantır’s Version 2 is scheduled for release in July 1997.

RSCAN offers a uniform way to run any number of independent scans on any number ofcomputers and organize the results of all the scans in a formatted report.15 Typically, the scansare conducted to check for security flaws and configuration errors. Each pulse monitor (calleda module) explicitly contains OS-independent, OS-dependent, and OS-version-dependentparts. Reports are either in ASCII or in HTML, which allows for WWW access. During ascan, modules are copied from the supervisor host to each host being scanned. In contrast,Pulsar allows each monitor to be scheduled at an appropriate rate, and the current report isalways available in a visual presentation.

The System Administrator’s Cockpit (satool) developed at the University of Coloradois used for early warning of problems on hosts.16 It is composed of an SNMP-aware agentrunning on each host, a database that polls for data from those agents, and a presenter writtenin Tcl/Tk that allows the administrator to access the database in a hierarchical fashion. Thepresenter is configured with alarm conditions for values of monitored variables. In contrast,Pulsar uses a unidirectional flow of information from pulse monitor to presenter; the presenterdoes not query the pulse monitors. This simpler approach means that the presenter does nothave to be customized to deal with new pulse monitors that might be developed later, noris polling needed. On the other hand, Pulsar’s presenter cannot request that a pulse monitorcheck the current status; it must wait until the pulse monitor is scheduled to run. In practice,we find it easy to run any pulse monitor on any machine manually when needed.

Scotty is a graphical tool written in Tcl/Tk that allows administrators to keep track of thestatus of networked equipment.17 The administrator must set up the presenter’s graphical dis-play manually, introducing icons for each equipment object. Objects can be grouped togetherinto a single icon, which can be expanded interactively. Some monitors are built into thepresenter, particularly pinging to see that a piece of equipment is up, accepting SNMP trapsfrom equipment, sending SNMP queries based on a table of acceptible requests for differentkinds of equipment and manufacturers, and collecting and filtering syslog entries. Informationis displayed either as text or a graph, updated typically once per minute. If a value exceeds athreshold, Scotty can be configured to write a message to a log file, flash an icon, or pop up awindow. Other pulse monitors can be built by having them write to the syslog.

The System Diagnostic Console is currently under development at Berkeley as part of theNetwork of Workstations (NOW) project.18 Like Pulsar, it uses a graphical display on a singlescreen to present statistics from every host of the site. It also groups related information toallow it to display data from hundreds of hosts. Unlike Pulsar, it can calculate aggregateinformation such as averages from data to reduce the amount it must display; Pulsar data istypically snapshot information.

EXPERIENCE

We have been using Pulsar regularly for several months in different sites at the University ofKentucky. These sites include a fairly homogeneous shop of about 50 hosts, mostly runningSolaris, but also with Linux, AIX and IRIX components, another shop of about 40 hosts withmostly BSDI components, and a small shop of about five servers running HP-UX, Solaris,and Linux.

Page 12: Pulsar: an extensible tool for monitoring large Unix sites

1174 R. A. FINKEL

We have tried to make the pulse-monitor scripts as general as possible so that they can runeasily on any architecture and flavor of Unix. For example, the pulse monitor that determinesfree swap space uses the sar program under IRIX, free under Linux, swapinfo underHP-UX, and vmstat under Solaris, and a custom C program under OSF1. The pulse monitorfor checking how much memory is being occupied by each process calls ps piped throughcut; each operating system requires different parameters to these two programs. Each scripttherefore checks the operating system identity before collecting information. Scripts that needlocal-configuration information, such as those that see if hosts, printers and networks are up,invoke the site-specific enumerate script to get lists of hosts or other information.

Pulsar has often allowed us to prevent problems before users notice them. For example,insufficient swap space has occasionally been caused by failures in the lpNet program,which spawns multiple copies of itself. Very low free disk space has resulted from saving toomuch netscape caching onto local disks. Continually high CPU loads have been caused byX Window System programs whose server has disconnected. Security alarms from the MD5program have been symptoms of improperly configured software and overlooked softwarepatches. We see them most often after upgrading software. About half the alarms are due tomistakes of configuration, which we could perhaps have avoided by using cfengine2 duringupgrade; the other half are mistakes in updating the MD5 configuration file itself. Securityalarms from the xhost pulse monitor are almost always caused by insufficiently educatedusers. These alarms have diminished over time. In each of these cases, the problem has beencaught promptly, diagnosed quickly and treated easily.

Because we establish alarm values within the pulse monitors, all hosts that use the samemonitor compute alarm values identically. But what might be a reasonable amount of free diskspace on one host might be dangerously low on another. Therefore, we tend to set the warninglevels fairly conservatively. As monitors run, the administrator establishes tolerance for alarmsthat are not worthy of attention. These tolerance values are saved in a configuration file in theadministrator’s home directory and automatically configure the presenter each time it is run.Nuisance alarms therefore tend to disappear quickly from the view of the administrator.

The tool is not perfect. We usually start schedulers manually, which is particularly annoyingif a host dies and then comes back up. However, a scheduler can be invoked during Unixstartup, and individual pulse monitors can be invoked from cron.

There is no built-in way to notice that a particular pulse has not been measured for a while,which can mean that the scheduler has failed. We often run the notReporting script to checkthe presenter’s log and list hosts that have not reported anything for the last day or so.

Several enhancements to Pulsar are worth considering:

(a) Pulsar uses a two-level hierarchy of alarm names, where the top level is generally ahost name and the second level is a particular resource class. In practice, this two-levelarrangement appears adequate. Still, an arbitrary hierarchy might be appealing if thenumber of pulse monitors or hosts grows.

(b) Pulsar stores the three most recent values of each alarm. The depth of history could becustomizable for each alarm.

(c) The presenter could have a graphing option to plot the history of each alarm’s value. Ourexperience is that enough historical information is usually contained in the last few events,which the presenter displays in the bottom region.

(d) The administrator might want to know if certain alarms have encountered a red periodovernight while the presenter’s display was not being viewed. The color of the presenter’s

Page 13: Pulsar: an extensible tool for monitoring large Unix sites

PULSAR 1175

icons is determined solely by the current alarm value and any explicit ignoring actioninstituted by the administrator. A color could be reserved for ‘has been worse’, resettableto normal color by the administrator. This facility is called ‘decay to blue’ in the SunNetManager.

(e) The concept of ‘has been worse’ can be generalized to arbitrary aggregate functions, suchas max, min, mean, median, and exponentially smoothed averages. Such functions couldbe implemented in the presenter and thereby apply automatically to every type of alarm.This feature is present in the System Diagnostic Console.18

(f) The security of the presenter could be improved, particularly its susceptibility to unautho-rized secondary presenters. A cryptographic solution that allows the presenter to authen-ticate prospective secondary presenters should not be hard to implement.

(g) When a secondary presenter loses connection to the primary presenter, it could periodicallyattempt to re-establish connection. As it stands, the secondary presenter displays a warningbut no longer receives new information.

Our initial experience with Pulsar is quite positive. It is a small tool, but a valuable one.Perhaps it is wise not to load it with too many features.

ACKNOWLEDGEMENTS

The anonymous referees were very helpful in improving the content and presentation of thispaper. Jeff Carr, K. Lakshman and Paul Linton have used versions of Pulsar and providedhelpful feedback.

REFERENCES

1. R. Ramsey, All About Administering NIS+, Prentice-Hall, Englewood Cliffs, NJ, 1994.2. M. Burgess, ‘A site configuration engine’, Computing Systems, 8(3), 309–338 (Summer 1995).3. B. Sturgill and R. Finkel, ‘System administration tools — the SAT package’, Technical Report 147-89,

University of Kentucky, Department of Computer Science, 1992.4. J. K. Ousterhout, Tcl and the Tk Toolkit, Addison-Wesley, Reading, MA, 1994.5. R. Scheifler and J. Gettys, ‘The X Window System’, ACM Transactions on Graphics, 5(2), 79–109 (April

1986).6. L. Wall and R. Schwartz, Programming Perl, O’Reilly and Associates, 1990.7. D. Libes, Exploring Expect: A Tcl-based Toolkit for Automating Interactive Programs, O’Reilly and Asso-

ciates, December 1994.8. R. Rivest, ‘The MD5 message-digest algorithm’, Network Working Group, RFC 1321 (April 1992).9. G. H. Kim and E. H. Spafford, ‘The design and implementation of Tripwire: A file system integrity checker’,

Technical Report TR-CSD-93/71, Purdue University Department of Computer Science, 1994.10. G. McDaniel, ‘Metric: A kernel instrumentation system for distributed environments’, Proc of the 6th SOSP;

Operating Systems Review, 11(5), 93–99 (November 1977).11. J. Case, M. Fedor, M. Schoffstall and C. Davin, ‘A simple network management protocol (SNMP)’, Network

Working Group, RFC 1098 (April 1989).12. K. McCloghrie and M. Rose, ‘Management information base for network management of TCP/IP-based

internets: MIB-II’, Network Working Group, RFC 1213 (March 1991).13. SunSoft, SunNet Manager 2.2.2 Reference Manual, SunSoft, August 1994.14. M. Hanshaugen. ‘Palantır, May 1996’. (http://www.palantir.uio.no/)15. N. Sammons, ‘Multi-platform interrogation and reporting with RSCAN’, Proc. of the 9th USENIX/SAGE

Conference on System Administration, Monterey, CA, September 1995, pp. 75–87.16. T. Miller, C. Stirlen and E. Nemeth, ‘satool — a system administrator’s cockpit, an implementation’, Proc. of

the 7th USENIX/SAGE Conference on System Administration, November 1993, pp. 119–129.

Page 14: Pulsar: an extensible tool for monitoring large Unix sites

1176 R. A. FINKEL

17. J. Schonwalder and H. Langendorfer, ‘Tcl extensions for network management applications’, Proc. of the 3rdTcl/Tk Workshop, Toronto, Canada, July 1995, pp. 279–288.

18. E. Anderson, A. Goto, and D. Patterson. ‘A system diagnostic console for networks of computers’ (September1996). (http://now.cs.berkeley.edu/Sysadmin/sys-diag-console/abstract.ps)