68
Czech Technical University in Prague Faculty of Electrical Engineering Bachelor's Project Sun servers open--source software systems management Ondřej Jakubčík Supervisor: Ing. Josef Hajas Study Program: Electrical Engineering and Information Technology Computer Engineering May 27, 2010

jakubon1_2010bach.pdf

  • Upload
    dhunnun

  • View
    225

  • Download
    1

Embed Size (px)

Citation preview

  • Czech Technical University in PragueFaculty of Electrical Engineering

    Bachelor's ProjectSun servers open--source

    software systems management

    Ondej Jakubk

    Supervisor: Ing. Josef Hajas

    Study Program: Electrical Engineering and Information TechnologyComputer Engineering

    May 27, 2010

  • Acknowledgement

    I would like to thank my family, my friends and my colleagues for their insight, sup-port and wisdom. I am truly grateful for being surrounded by such brilliant people.

  • Declaration

    I hereby declare that I have completed this project independently and that I havelisted all the literature and publications used.

    I have no objection to usage of this work in compliance with the act 60 Zkon. 121/2000Sb. (copyright law), and with the rights connected with the copyright actincluding the changes in the act.

    In . . . . . . . . . . . . . . . . . . . . . . . on . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  • Abstrakt

    elem tto bakalsk prce je provst analzu dostupnch softwarovch pro-dukt pro systmovou sprvu, a ji komernch i otevench, dle analyzovatmonostiintegrace se servery spolenosti Oracle (Sun) a implementace integranho een dovybranho nstroje.

    Soust analzy je t teoretick st zamen na uitenost systmov sprvy,pouvan metody zskvn dat a protokoly, kter jsou pi monitorovn a sprvserver pouvany.

    Abstract

    Objective of this bachelor's project is to analyze available systems managementproducts, both commercial and open--source. It analyzes integration possibilitiesagainst servers made by Oracle (Sun) and a result of this project is an integrationinto a selected software.

    As a part of analysis there is also a theory focused on benets of systemsmanage-ment, availablemethods of data acquisition and protocols that are used formonitoringand managing servers.

  • Contents

    1 Introduction 1

    2 Systems management software 32.1 Commercial oerings 32.2 Open-source oerings 5

    3 Protocols for system management 113.1 Simple Network Management Protocol 113.1.1 Monitoring over SNMP 123.1.2 Important terms related to SNMP 133.1.3 Management Information Base 133.1.3.1 ASN.1 143.2 Intelligent Platform Management Interface 143.3 Web-Based Enterprise Management 153.4 Other protocols 163.4.1 Remote shell access 163.4.2 Other protocols 17

    4 Approaches to system management 194.1 Way of communication 194.1.1 In-band communication 194.1.2 Out-of-band communication 204.1.3 Side-band communication 214.2 By means of data gathering 214.2.1 Active monitoring 214.2.2 Passive monitoring 224.2.3 Combination of active and passive monitoring 234.3 Final comparison 23

    5 Sensors and components 25

  • 6 Management interfaces of Oracle Sun servers 296.1 System controllers 296.2 Command--line interface 306.3 SNMP 316.3.1 Oracle Sun MIBs 336.3.1.1 Origin and purpose of these MIBs 346.3.1.2 Notications 356.3.1.3 Polled data 376.4 IPMI 396.5 Other interfaces 39

    7 Zenoss integration 417.1 Choosing an approach 417.2 Development environment 417.3 Important design decisions 427.3.1 Event classes 427.3.2 Per-trap mapping vs. defaultmapping 427.4 Development steps 437.4.1 Compiling MIBs 447.4.2 Creating Event classes 447.4.3 Creating Event mappings 457.4.4 Adding products 507.4.5 Final modications 507.5 Testing 517.6 Future extension 51

    8 Conclusion 53

    A CD Contents 57

  • 11 Introduction

    Systems management has become a very important topic in almost every organisa-tion depending on IT services. It encompasses entire life cycle of IT infrastructure,including i.e. tracking and documenting requirements, purchasing and renewingequipment, license management, fault and risk monitoring etc. While systems man-agement has beenin somewayalways present in IT departments of mid-size to bigenterprises, approach to systems management was often dened in a company-spe-cic way, with no standardization.

    However, many companies now span a number of countries or even continents.For all but the biggest companies, it would be very inecient to invest in develop-ment of complete in-house solution for systems managementthese companies relyon third party solutions, that oer cheaper, well tested and supported alternative.

    Decentralization of IT resources is a very important factor for the need of systemsmanagement. It has become quite common to have more than one datacenter, oftenin remote locations, possibly quite far apart from each other so that in case of anaccident at or near one of them, the operations of a company can continue relativelyuninterrupted (in this case, by accident we mean either a natural phenomenalikeooding, storm, reor an act of ill willsuch as a terrorist attack). Because the ITsupport may not be alway present on site, an advanced warning of some components'possible failures is very important. Some, albeit not all systemmanagement softwaresuites can even tie individual systems, groups of systems or even components to aservice, so when a failure is imminent, one can see which services are in jeopardy.

    Businesses of today rely on IT more than ever before. Even a minute long outagecan cost thousands of dollars in eect. Therefore, some companies (notably telecom-munication companies, banks, etc.) build systems with certain level of redundancy,so in the case of failure of one system, other system takes over in a reasonable amountof time, so the interruption is barely noticeable. System management is necessary inthis case as it provides information about the nature of failure and it helps selectingand migrating to a dierent system.

    Computing power (in the sense of CPU processing speeds, RAMand storage sizes,etc.) keeps growing and its price is falling. However the workload is so variable thatcomputing power may not load processing node enough so that its power consumptionis actually higher than the outcome of its work.

    This led to a rebirth of one IT industryvirtualization. To a certain level, vir-tualization has been possible on various levels since 1967, in this case on IBM CP-40.However, the main reason back then was to enable various software to run unmod-ied or simultaneously (computers were batch oriented and most software was not

  • 2 Sun servers open--source systems management

    designed for any level of multitasking). Now, the reason for virtualization is consoli-dation, power consumption reduction and control of expenses.

    Availability of relatively cheap but powerful commodity hardware has led to anew architecture of ITinstead of renting a dedicated machine (although this is stillpossible), one can rent virtual machines, running on possibly very dierent set ofhardware. With properly setup infrastructure (ber channel or iSCSI disk arrays,virtualization software supporting live migration etc.), it is possible to achieve a veryhigh availability and reliability.

    However, cheaper systems are being built from cheaper components that areprone to failure more often than never, thus the need for proper monitoring is high.With proper software, migrating of virtual machines in case of a hardware malfunc-tion can be automated.

    Power consumption monitoring is a very important part of systems management.With power becoming more expensive, a careful monitoring of power consumptionwith relation to tasks performed is required to manage the costs of ones IT operationsor to properly bill the customers (the latter applies specically to cloud computingcustomers).

    This bachelors project will focus on one area of systems managementsystemshealth monitoring. With above in mind, we can try to focus on a clear design, that willallow implementing above described features or connecting with existing features inplace.

    Objective is to design and implement a Zenoss extension (also known as ZenPack)that will allow to discover, monitor and report system health status of some OracleSun servers to user. Zenoss was chosen because it is a very advanced integration plat-form, with advanced features such as graphing, so a future extensions like recordingand analyzing power consumption trends can be implemented. Selection was done inunpublished work by the author, available separately [1].

  • 32 Systems management software

    In this chapter, an incomplete list of both commercial and non-commercial softwareused for systems management is presented. When possible, manageability featuresof Oracle servers using these particular software solutions is also described.

    While there are many software solutions available from various vendors, onlyfew are listed in this section, just to give a brief overview of present features. Theobjective is to make the resulting integration with open-source software comparableto already existing integrations.

    2.1 Commercial oerings

    The following commercial product have been used by the author to manage OracleSun servers:

    CA Unicenter NSM HP Operations Manager IBM Director IBM Tivoli Enterprise Console IBM Tivoli NetCool OMNIbus

    All of these products can do passive monitoringlisten for events, either receivedusing SNMP traps, system logs or some other mechanism (like direct database entry,command line tool execution etc.).

    The Tivoli Enterprise Console, also known as TEC is one of the oldest systemsmanagement package. It relies on Tivoli Management Framework which providesalso way how to install other extensions and patches. TEC itself has rather simpleGUI written in Java, but the backend consists of many helper programs usually writ-ten in C. TEC is used to do passive monitoring onlyit waits for events and thoseevents get processed using internal engine (some of its parts are based on Prolog lan-guage). This software package however requires preinstalled database system to bepresent.

  • 4 Sun servers open--source systems management

    Figure 2.1 IBM Tivoli Enterprise Consolewith graphed amount of incoming events

    NetCool OMNIbus is similar to TEC, but it has a more modern GUI. Being aproduct acquired through acquisition, it is not written in Java but in compiled lan-guage. It uses totally dierent language for writing custom extension and as one of afew, it has its own database bundled.

    Operations Manager, Director and Unicenter NSM are products of dierent com-panies, but they have one common featurethey support active polling. Other thanthat, they oer similar features and all can receive and process notications fromOracle Sun servers.

    The following features are present in all integrations with these products:

    Translating SNMP traps and notications into user readable form. Removing duplicates of events. Having events with lower severity automatically close events with higher severity.

  • Systems management software 5

    Figure 2.2 TEC showing new events

    Integration that support polling usually can at least display the state of system LEDs,some (CA Unicenter NSM) can display a hierarchy of sensors.

    2.2 Open-source oerings

    In the open--source market, there are right now the following major products:

    Nagios OpenNMS Zabbix Zenoss

  • 6 Sun servers open--source systems management

    Figure 2.3 CA Unicenter NSM showing hierarchy of sensors

    Nagios is the oldest and most mature open--source product. It is very scalable, welldocumented, but its web GUI lacks some modern featureswhich of course means itis very fast, albeit sometimes not very user friendly.

    It is written mainly in C, which is another cause of high speed. Monitoring datacan be obtained by running checks either built-in or user supplied scripts calledplugins whose exit code and (optionally) any output is processed and evaluated byNagios.

    Checks can be run either locally or remotely using a tool called NRPE (NagiosRemote Plugin Executor). In addition to having Nagios to run a check actively (seesubsection 4.2.1 at page 21), one can also feed data into Nagios asynchronously (seesubsection 4.2.2 at page 22). For more information please see www.nagios.org or[2].

    OpenNMS is another network monitoring/management software package. WhileNagios achieves portability across dierent platform by using C as its programminglanguage, OpenNMS is written in Java, which makes it too very portable. It requires

  • Systems management software 7

    Figure 2.4 Nagios showing status of services (image from www.nagios.org)

    database for its backing. It provides more modern GUI to user, otherwise its featuresare mostly comparable to others.

    From [3]:

    Zabbix is an enterprise-class open source distributed monitoring solution.Zabbix is software that monitors numerous parameters of a network and

    the health and integrity of servers. Zabbix uses a exible notication mecha-nism that allows users to congure e-mail based alerts for virtually any event.This allows a fast reaction to server problems. Zabbix oers excellent report-ing and data visualisation features based on the stored data. This makes Zab-bix ideal for capacity planning.

  • 8 Sun servers open--source systems management

    Figure 2.5 OpenNMS event list

    Zabbix supports both polling and trapping. All Zabbix reports and statis-tics, as well as conguration parameters, are accessed through a web-basedfront end. A web-based front end ensures that the status of your network andthe health of your servers can be assessed from any location. Properly con-gured, Zabbix can play an important role in monitoring IT infrastructure.This is equally true for small organisations with a few servers and for largecompanies with a multitude of servers.

    Zabbix is written in C and PHP and requires a database backing.Finally, we are about to look at Zenoss, which is our integration platform. Ocial

    documentation [4]says:

    Zenoss is today's premier open source IT management solution. Through in-tegrated monitoring, it enables you to manage the status and health of yourinfrastructure through a single, Web-based console.

    The power of Zenoss starts with its in-depth Inventory and CongurationManagement Database (CMDB). Zenoss creates this database by discoveringmanaged resourcesservers, networks, and other devicesin your IT envi-ronment. The resulting environment model provides a complete inventory ofyour key systems, down to the level of resource components (interfaces, ser-vices, and processes, and installed software.)

  • Systems management software 9

    With the model built, you can use Zenoss' integrated availability and per-formance monitoring features to monitor and report on all aspects of your ITinfrastructure. Zenoss also provides events and fault management featuresthat tie into the CMDB. These features help drive operational eciency andproductivity by automating many of the notication, alerts, escalation, andremediation tasks you perform each day.

    Zenoss is written in Python and is based on Zope application platform and like mostpreviously mentioned software products, it requires databasespecically MySQL.

    Figure 2.6 Zenoss with list of manufacturers

  • 10 Sun servers open--source systems management

  • 11

    3 Protocols for system management

    Systemsmanagement can be thought of as a network application. As such, it is neces-sary to have one or more protocols, that will allow user to gather data (for descriptionof data gathering methods, please see chapter 4 at page 19). These protocols dier intheir complexity, reliability and verbosity.

    Some devices may also implement two or more protocols simultaneously, but theamount of data exposed may not be the same, even for the same device. Also, level ofsupport of these protocols varies considerable (e.g. very few software packages sup-port IPMI out--of--the--box). In this chapter we will describe some of the most com-monly used protocols that have been used for systems management.

    3.1 Simple Network Management Protocol

    Taken from Wikipedia [5]:

    Simple NetworkManagement Protocol (SNMP) is a UDP-based network proto-col. It is used mostly in network management systems to monitor network-at-tached devices for conditions that warrant administrative attention. SNMPis a component of the Internet Protocol Suite as dened by the Internet Engi-neering Task Force (IETF). It consists of a set of standards for network man-agement, including an application layer protocol, a database schema, and aset of data objects.

    SNMP exposes management data in the form of variables on the managedsystems, which describe the system conguration. These variables can thenbe queried (and sometimes set) by managing applications.

    Although in the early days of the internet by network devices mostly computers weremeant, the specication is designed very much device-independently, therefore de-vices such as

    servers routers racks switches

  • 12 Sun servers open--source systems management

    wireless access points uninterruptible power supplies

    can be monitored. Since the SNMP implementation can be carried out even on verysmall devices, SNMP can be implemented even for devices like air conditioning controletc.

    Currently, SNMP exists in three versions (in parentheses the years of standard-ization by the Internet Engineering Task Force is given):

    SNMP v1 (19881990) [68] SNMP v2c (1993) SNMP v3 (2002)

    Even though the latest version of SNMP brings very important new features, likeauthentication and encryption, it is still not supported by some of the network man-agement software suites.

    3.1.1 Monitoring over SNMP

    Network infrastructure implementing monitoring contains two important softwarecomponentsthe agent and the network management software, also known as NMS.

    Agent implements SNMP protocol and uses it to expose data. The structure ofdata is dened using Management Information Base (see below). Usually vendorschoose to dene their MIBs very broadly, so every agent implementing that particularMIB may not make use of all structures.

    Network management software also makes use of the MIB to gather and trans-late data it can get from the agent and performs further processingamong othersstatistics, error notication, automated error processing etc.

    SNMP protocol supports both active and passive monitoring. In active monitor-ing, NMS uses SNMP requests (gets or sets) to get data or set conguration parame-ters on the managed device directly. Whenmonitoring passively, NMS only listens forSNMP data coming from the managed device (SNMP uses two termstrap and no-ticationthey are often used interchangeably, although rst term refers to SNMPv1 and the latter to SNMP v2c and v3). Version 2c also species an inform packet,that diers from trap and notication as it makes the NMS send a conrmation whensuch packet is received. However, this mechanism is rarely used.

  • Protocols for system management 13

    SNMP is a datagram protocol and therefore there is a possibility of the data beinglost en route. This is especially important when using passive monitoringnetworkelements such as routers can causeUDPpackets to be lost and in the case of fatal error(by fatal error an error causing powering o of the monitored device) the noticationmay not be received at all, causing the error to be found due to some othermalfunction(typically a segment of network being down, possibly a service like database or webserver being inaccessible).

    3.1.2 Important terms related to SNMP

    Whenworkingwith SNMP based technologies, one can ofter come across the followingterms:

    OID varbind table scalar index

    OID is an abbreviation for object identier. It is represented as a dotted n--tuple ofintegers (MIBs actually describe the textual representation of these OIDs).

    Varbind stands for variable binding. It is consists of OID and its values, whichcan be OID too or it can be a number, string, or any other data structure expressableusing ASN.1.

    Scalar value is dened in MIB and it is always referenced using single OID.Table is dened in MIB too, but to access the rows in columns, one must append

    an index after the OID of column. Table is simply a set of columns.

    3.1.3 Management Information Base

    As mentioned above subsection 3.1.1 at page 12, there is a special format that de-scribes the data sent over SNMP. Format of a MIB is derived from ASN.1 (see sub-section 3.1.3.1 at page 14). Formally, it has been dened in [9]. Citation:

  • 14 Sun servers open--source systems management

    Management information is viewed as a collection of managed objects, resid-ing in a virtual information store, termed the Management Information Base(MIB). Collections of related objects are dened in MIB modules. These mod-ules are written using an adapted subset of OSI's Abstract Syntax NotationOne, ASN.1 [10]. It is the purpose of this document, the Structure of Manage-ment Information (SMI), to dene that adapted subset, and to assign a set ofassociated administrative values.

    3.1.3.1 ASN.1

    Abstract Syntax Notation One is one of many approaches on data structure descrip-tion. What makes it stand out is that it allows specication of the structure, but italso describes its encoding and decoding into various formats (ranging from binaryformats to XML).

    ASN.1 is an international standard adopted by Internation TelecommunicationUnion (ITU) and by ISO/IEC. It has been standardized as [1013]. Due to its versatil-ity, ASN.1 and its hierarchical data model is used other application protocols as well,including internet telephony (H.323) and directory services (LDAP).

    3.2 Intelligent Platform Management Interface

    Rather than a being a single protocol specication, IPMI species full set of physicalinterfaces to a system controller, communication protocol and data representation. Itis specied in [14], a standard designed by a computer manufacturer consortium ledby Intel. Citation for [14]:

    The IPMI specications dene standardized, abstracted interfaces to the plat-form management subsystem. IPMI includes the denition of interfaces forextending platform management between board within the main chassis, andbetween multiple chassis.

    The term platform management is used to refer to the monitoring andcontrol functions that are built in to the platform hardware and primarily usedfor the purpose of monitoring the health of the system hardware. This typi-cally includes monitoring elements such as system temperatures, voltages,fans, power supplies, bus errors, system physical security, etc. It includes

  • Protocols for system management 15

    automatic and manually driven recovery capabilities such as local or remotesystem resets and power on/o operations. It includes the logging of abnor-mal or out--of--range conditions for later examination and alerting where theplatform issues the alert without aid of run--time software. Lastly it includesinventory information that can help identify a failed hardware unit.

    3.3 Web-Based Enterprise Management

    A modied excerpt from [15]:

    WBEM is a set of management and Internet standard technologies developedto unify the management of distributed computing environments, facilitatingthe exchange of data across otherwise disparate technologies and platforms.

    It consists of a core set of standards developed byDMTF (DistributedMan-agement Task Force), which includes the Common Information Model (CIM),CIM-XML, CIM Query Language, WBEM Discovery using Service LocationProtocol (SLP) and WBEM Universal Resource Identier (URI) mapping. Inaddition, the DMTF has developed a WBEM Management Prole template,allowing for simplied prole development to deliver a complete, standalonedenition for the management of a particular system, subsystem, service orother entity.

    WBEM is extensible, facilitating the development of platform-neutral,reusable infrastructure, tools and applications. In addition to its use by ven-dors, end users and the open source community, WBEM is enabling other in-dustry organizations to build on its foundation in areas including Web ser-vices, security, storage, grid and utility computing.

    Openness of the WBEM specications led to development of several implementation,notably OpenPegasue [16]and WMI (Windows Management Instrumentation). WMIdoes not rely on Web Services, but rather on COM objects and RPC calls.

  • 16 Sun servers open--source systems management

    WBEM is now part of many operating systemsapart from Windows' WMI, it ispresent in most enterprise Linux distributions and in commercial Unices, like OracleSolaris and HP-UX.

    3.4 Other protocols

    3.4.1 Remote shell access

    System management has traditionally used a particularly simple approach using se-rial line, or its alternativetelnet or secure shell access to the system controller orto the system itself.

    System controller on most server platform oers a broad range of system man-agement possibilities. Besides power control and console control, it also provides sys-tem administrator with the ability to display the status of sensors and to list systemevents.

    # ssh root@myhostPassword:Waiting for daemons to initialize...

    Daemons ready

    Sun(TM) Integrated Lights Out Manager

    Version 3.0.6.1.d r48331

    Copyright 2009 Sun Microsystems, Inc. All rightsreserved.Use is subject to license terms.

    -> show /SYS product_name

    /SYSProperties:product_name = SPARC-Enterprise-T5220

    Figure 3.1 Output from service console

  • Protocols for system management 17

    Although the output is optimized for human reading and not for programmatic analy-sis, there are well established tools that can parse this output (expect [17]), and feedthe resulting data to a system management software.

    This technique applies not only to system controller, but to BIOSes and even oper-ating system command line utilities. There are a few Zenoss extensionsZenPacks,that use the technique of parsing text output to deliver information on processes, CPUload, storage status and more.

    # cat /proc/partitionsmajor minor #blocks name

    8 0 312571224 sda8 1 309917916 sda18 2 1 sda28 5 2650693 sda5

    Figure 3.2 Output of cat command

    3.4.2 Other protocols

    In addition to protocols listed above, there are some other protocols used for systemmanagement. One of the mature one is syslog protocol.

    Unix system log protocol is specied in [18]. It was designed with networking inming, so although it is generally used on local host, it is possible to setup the daemonto lter and forward messages to a network host. On this host, further processingcan be done. Usually, traditional syslog will not record originating host name, sothere needs to be a special daemon or the system logging daemon needs a specialconguration.

    Being a very old protocol, there is almost no security (besides facilities like re-jecting a host that is not in a list, etc.), and by generating a ood of messages, it ispossible to overload the daemon or ll the space in /var/log lesystem, which maylead to unexpected failures.

    Commercial products (especially those that contain or can be used with their ownagents on remote hosts) also use various RPCmechanisms. Among themost common,there are the following:

  • 18 Sun servers open--source systems management

    ONC RPC (Open Network Computing Remote Procedure Call) [19] CORBA (Common Object Request Broker Architecture) [20] SOAP (Simple Object Access Protocol) [21] XML-RPC (XML Remote Procedure Call) [22]

    Description of these protocols is beyond the scope of this project, for further informa-tion please consult the references. In case of proprietary software, details about theusage of these protocols may not be fully known, therefore their use as an communi-cation protocol with custom software may be very challenging.

  • 19

    4 Approaches to system management

    In this chapter we will describe possible approaches to systemmanagement, and com-pare them in terms of protocol requirements, generated network trac and reliability.

    Possibly the simplest approach to system management (more specically, systemhealth monitoring) is simply to wait until the device stops working, rendering someservice or services unusable. While possible to do so (indeed, author have observedsuch approach in an educational institution), there is no warning in advance andtherefore such approach is only feasible in environments where setting up monitoringwould be more expensive than repairing failed systems.

    4.1 Way of communication

    To be able to monitor any system, there must be a way to connect to it. In systemsmanagement, we usually use one of the following four communication channels:

    local only in-band communication out-of-band communication side-band communication

    By local communication a non-network communication withmonitored system is usu-ally meant. This may involve connecting serial console (e.g. laptop with serial line)or display, keyboard and mouse manually. Watching status LEDs in person can bealso used for quick system status checking. For the purpose of this project, we willnot consider this as a viable method of system monitoring. All other communicationchannels are described below.

    4.1.1 In-band communication

    In-band communication is a way of systemmonitoring and management communica-tion, where the monitoring data is sent over the same network channel as productiondata (e.g. web trac).

  • 20 Sun servers open--source systems management

    This implies that operating system on the monitored device has to support man-agement trac handling (usually, this is accomplished by running a so-called agent).Also, it means that management trac occupies (at least partially) useful bandwidthand that the agent will use some CPU cycles.

    On the other hand, using this type of communication poses no additional require-ments on the existing network infrastructureno additional cabling is required andno changes to network switches and routers needs to be made. Especially when deal-ing with many servers, savings on network infrastructure may be signicant.

    One signicant drawback of this approach is that without operating system run-ning, management may not be possible (although servers with Wake--on--LAN capa-bility can be at least turned on remotely).

    4.1.2 Out-of-band communication

    Out-of-band communication is complementary to the in-band communication. It usesits own network port or, in some setups, serial line connected to network terminalserver.

    Monitoring capabilities therefore do not depend on running operating system,nor does the monitoring trac aect production network bandwidth and CPU load.Depending on the system controller (this term is used mainly in connection withSPARC systems, another used terms are BMCbaseboard management controllerand SPservice processor) additional features may be oered to the system adminis-tratorfor example console redirection, storage redirection andmanagement, rmwareupdate etc. Power control is one of the basic features.

    This type of communication requires additional cabling and switching, so theresulting network infrastructure is more dense and also more expensive. Systemcontroller on the other hand does not use any special network features so very lowcost commodity switches.

    Security of this dedicated management network is of vital concern to the user.Breach may lead to disruption of management trac and it may be possible to over-load the system controller. In case of breaking into the system controller, the ad-versary could not only take the entire system down (possibly damaging productiondata), but it may be possible to boot a totally dierent operating system from redi-rected storageleading to data leak or intentional corruption. Of course, bootinga dierent operating system using a direct (i.e. production network) breach is alsopossible, but this channel is expected to be much more secure (strong passwords, re-wall, etc.). But a separate network may lead to temptation to keep default passwords,

  • Approaches to system management 21

    therefore it is very important to develop and enforce security guidelines with samestrictness as guidelines applying to operating system and network security.

    In conclusion, drawback of this approach is higher network infrastructure costs,but for setups requiring additional features like storage redirection etc., this approachis benecial.

    4.1.3 Side-band communication

    Side-band communication combines the best features of both communicationmethodsdescribed above. Side-band communication usually involves system controller, thatuses the same network port as production network, but operating in a separate virtualLAN (VLAN).

    Features are usually comparable to those of out-of-band communications, yetthere are some savings in network infrastructure. Setting up network componentsto correctly route information based on VLAN information may be more complicatedthan other means.

    Finally, not all service controllers support this type of communication, so unlessthere is a bigger number of servers supporting this type of communication, invest-ing time into setting up side-band monitoring in addition to any of the previouslymentioned ways is probably not a worthwhile eort.

    4.2 By means of data gathering

    4.2.1 Active monitoring

    By active monitoring we mean such setup, where the monitoring station (i.e. a boxrunning monitoring software) actively queries managed (monitored) devices.

    Certain protocols (like IPMI) support only this type of monitoring, others (likeSNMP) support both active and passive.

    During active monitoring, the following data (albeit not all of these may be avail-able ) is usually gathered and/or updated in regular time intervals:

  • 22 Sun servers open--source systems management

    list of hardware components with their statuses list of sensors with current values, thresholds and statuses overall system health status

    Depending on the verbosity of data obtained and on time intervals, active monitoringcan cause a signicant network trac (this may not be favourable especially whenusing in-band communication). However the amount of data transmitted may be reg-ulated by selecting only a subset of data (e.g. checking a system status and readingan extended set of data when the status changes).

    Advantage of active monitoring is reliabilityeven when using non-reliable datatransfer (UDP protocol used with SNMP protocol), the monitoring station can usuallydetect missing data and request it again.

    Another huge advantage is the ability to gather statistically relevant data to bestored and processed (like power consumption, network port trac etc.). Advancedfeatures of monitoring software can include graphing and reporting, which can inturn be used to consolidate computing resources in power-ecient way.

    This type of monitoring is usually supported by most network devices, rangingfrom servers to low-cost switches.

    4.2.2 Passive monitoring

    Passive monitoring is an opposite (and complementary) approach compared to ac-tivethis time, it is the responsibility of the monitored device to report a statuschange to monitoring software. Based on this received information, monitoring sta-tion will perform some actioneither predened or dened by user. Actions can befrom operator notication using paging or SMS, to automatic failure correction (likestarting virtual machines migration etc.).

    However, when using non-reliable data transport (UDP), passive noticationmaynot even be received. Also, especially when using SNMP protocol, management sta-tion does not usually send a reception conrmation. Multiple switches en route canadversely aect datagrams, causing the message to be delayed, received out-of-orderor entirely to be lost. To prevent this, some management and monitoring softwarecan listen for SNMP notications in local network and send it to the master manage-ment host using some reliable protocol (in most software this is implemented as RPC,either original ONC RPC, web service call or propriatery protocol).

  • Approaches to system management 23

    Huge advantage is that very little network trac is generated, and also thismethod is very CPU usage friendly (neither agent/system controller nor monitoringstation are processing huge amounts of data).

    This method may not be supported by all devices.

    4.2.3 Combination of active and passive monitoring

    When both above mentioned approaches are combined, possibly the most reliablemonitoring system can be built. However, not all monitoring packages allow thesetwo approaches to be combined.

    Modus operandi is like this:

    1. Monitoring station reads all data using active approach (i.e. full repository).2. Monitored hosts issue notications based on their status changes.3. Monitoring station updates it's data either by:

    a. using solely data from the passive noticationb. refreshing all data from the appropriate monitored device

    4. Once a while, monitoring station refreshes all data (just in case notication waslost).

    4.3 Final comparison

    To be able to correctly choose between various approaches to monitoring, it is best tohave these methods compared in tables:

    Feature In-band Out-of-band Side-band

    OS Independent no yes yes

    Communication port shared separate shared

    Uses host CPU yes no no

    Special net. requirements none yes, cabling yes, setup

    Display/storage redirection needs OS support yes yes

    Power management limited yes yes

    Table 4.1 Comparison of communication methods

  • 24 Sun servers open--source systems management

    Feature Active Passive Combination

    Comm. initiator management host monitored device both

    Network trac high low medium

    Reliability high lower highest

    Stat. data available yes no yes

    Mgmt software support medium very high very low

    Mged devices support high lower

    Table 4.2 Comparison of data acquisition methods

    Selection in particular setup will be subject to available software, number and type ofdevices, current network infrastructure hierarchy and also time and budget alloted.

  • 25

    5 Sensors and components

    Before we can get deeper into the actual data presented by Oracle Sun system con-trollers and agents, we need to dene and explain terms that are connected with aserver.

    Component is any functional part of the server. Components may nd themselvesin a number of states:

    present absent functioning about to malfunction malfunctioning unknown

    Very closely related term is sensor. Sensors are usually connected with compo-nents, although they may be connected with a whole system. There are fundamen-tally two types of sensors:

    physical (e.g. voltage, fan speed, etc.) virtual (e.g. system is OK)

    The dierence is, that virtual sensors are being computed based on physical sensors.It shall be noted that for some virtual sensors, the underlying physical sensors maybe hidden.

    Physical sensors usually detect some values being out of range or just some true/falseconditions. Some types of physical sensors:

    button sensor (power buttons, chassis intrusion detection) fan speed sensor current sensor presence sensor temperature sensor voltage sensor

  • 26 Sun servers open--source systems management

    Among virtual sensors are those whose condition is base on state of other sensors(e.g. power sensor measuring in Watts will be calculated from appropriate voltageand current sensors) or based on a condition detected by software. For example:

    memory ECC error sensor OK/not--OK sensor power sensor

    Some sensors (mostly physical) have setup some thresholds. A threshold is a value,which the measured value must achieve and cross for the sensor to change its state.Usually, only sensors that measure continuous values (numeric sensors, the oppositebeing discrete sensors) have dened thresholds:

    non--critical critical non--recoverable

    When a non--critical threshold is being crossed, usually a notication is generated,but the condition is not severe and it won't impact function of the system. Stayingbeyond critical threshold may potentially aect reliability and endurance may be af-fected. Non--recoverable threshold crossing usually signals something has gone verywrong and the system is immediately shutdown (although this can be modied andsometimes disabled).

    Also, thresholds can be low and highfor example, temperature sensor measur-ing ambient temperature has a all six thresholds dened (high temperature is notdesired equally as freezing temperatures).

    Discrete sensors have only a certain set of states they can have. Here is an in-complete list of discrete values certain sensors can have:

    disabled memory error detected OK/fail present/absent

    Both kinds of sensors have so-called assertions and deassertions. These two are op-posite to each other. Assertion means that the sensor assumes some state (usually

  • Sensors and components 27

    error state), deasertion means that the sensor leaves the state that was previouslyasserted.

    However, this may sometimes be trickylets see an example. We have a sensorHDD0 (the names are usually longer, but for the sake of example lets keep this one)that has the following states:

    Device Present Device Absent Hot Spare Rebuild In Progress

    and for all of the, both assertion and deassertion is enabled. In this particular exam-ple, having the sensor in Device Present Assert means that the particular deviceis present. Similarly, Device Absent Assert will mean that the device has been re-moved.

    There is however one more approachhave the device in Device Absent De-assert and Device Absent Deassert and Device Present Deassert. Both meanthe same thing as the ones in previous paragraphthe device has been inserted (isno longer absent) and device has been removed (and is no longer present) respec-tively. Any integration dealing with sensor must be aware of this and preferablyshould translate incoming notications into one common format and discard the lesscommon and more confusing one.

  • 28 Sun servers open--source systems management

  • 29

    6 Management interfaces of Oracle Sun servers

    Since this project focuses on systems management of Oracle Sun servers, we rstneed to describe management capabilities of these servers.

    Oracle (and previously Sun) has a very broad portfolio of servers. However, forthis project, we will focus on the following hardware families:

    Oracle Sun Fire x86 Servers (X2000 and X4000 series) Oracle Sun SPARC Enterprise Server (T1000, T2000 and T5000 series) Oracle Sun Blade Server Modules (X6000 and T6000 series) older Sun Fire Servers (SPARCs, V210 for example)

    The work will be done primarily on latest available servers (i.e. not End--of--Life ones).Although it may seem as a waste of time to target also servers no longer in production,it is author's belief that these servers may still be present especially in educationalinstitutions, where they performance is still sucient and having an open--source toolfor monitoring will be more than benecial.

    6.1 System controllers

    All the servers mentioned above have a special, independent computer on--board, thatcontrols power, monitors environmental and system characteristics (voltages, devicepresence, fan speeds etc.) and reports the using methods describe below. This com-puter is called system controller on SPARCs and service processor on x86 servers.

    On Oracle Sun servers mentioned above, one may encounter the following ver-sions of system controllers:

    Advanced Lights Out Manager (ALOM) [23 and 24] Embedded Lights Out Manager (eLOM) [25] Integrated Lights Out Manager (ILOM) [26]

    ALOM is the oldest from these two, and one can nd it only on older SPARC servers(there are two versionsALOM and ALOM--CMT, the rst one being used on sun4u

  • 30 Sun servers open--source systems management

    platforms and the latter being used on servers with UltraSPARC T1 processortheseprocessors have the ability to run several threads in parallel, also called Chip Multi-threading, hence the abbreviation CMT).

    ALOM had only command line interface and they can send e-mail to adminis-trator in the event of malfunction, newer version of ALOM--CMT also support SNMPprotocol. There is no web GUI, though. ALOM is primarily out--of--band (using ser-ial line or its own network port), but it can be congured from within Solaris usingscadm(1M) command. Features are pretty much standard:

    power control serial console redirection logical domains (on CMT machines, [27]) environment monitoring listing, disabling and enabling components

    eLOM on the other side can be found only on older x86 platforms. It oers com-mand line interface, SNMP interface and web interface. In addition to features listedwith ALOM (except the logical domains), eLOM has these additional features:

    graphical console redirection storage redirection

    ILOM is the latest and actively developed system controller software. It can befound both on SPARC and x86 servers and it oers everything ALOM and eLOM oertogether.

    6.2 Command--line interface

    Command--line interface is universally available on all three service controllers. How-ever, the syntax of commands diers considerably (to mitigate this to veteran SPARCadministrators, ILOM on SPARC can be run in ALOM--compatible mode, so that mostcommands and possibly even script these administrators know or have written willwork as expected). Please see the examples:

  • Management interfaces of Oracle Sun servers 31

    # ssh root@alom-server

    Copyright 2008 Sun Microsystems, Inc. All rights reserved.Use is subject to license terms.

    Sun(tm) Advanced Lights Out Manager CMT v1.7.6

    Please login: adminPlease Enter password: *****

    sc> showhostSun-Fire-T2000 System Firmware 6.7.6 2009/10/29 16:06

    Host ash versions:OBP 4.30.4 2009/08/19 07:24Hypervisor 1.7.3.a 2009/10/29 15:50POST 4.30.4 2009/08/19 07:47

    Figure 6.1 ALOM exampleinformation about server

    Command--line interface can be accessed over the following interfaces:

    serial line telnet (may be disabled for security reasons) secure shell internally over OS tool (e.g. scadm(1M))

    6.3 SNMP

    SNMP interface is arguably the most used interface for system management. BotheLOM and ILOM support SNMP from the very rst versions, ALOM--CMT startedto support SNMP directly relatively late.

    However, either due to absence of SNMP interface (ALOM--CMT prior to v1.4) ordue to simple wish to monitor the system in--band, there are so-called agents. Thereare currently two:

    Monitoring Agent for Sun Fire and Netra Systems (MASF) [28]

  • 32 Sun servers open--source systems management

    # ssh root@elom-hostroot@elom-host's password:

    Sun(TM) Embedded Lights Out Manager

    Copyright 2004-2006 Sun Microsystems, Inc. All rights reserved.

    Version 2.91

    Hostname: SUNSP0016365B97FB

    IP address: 10.18.141.146

    MAC address: 00:16:36:5B:97:FB

    System serial number: 0624QC0029

    /SP -> show /SP/SystemInfo/ProductInfo

    /SP/SystemInfo/ProductInfoTargets:

    Properties:ProductManufacturer = Sun MicrosystemsProductProductName = Sun Fire X2200 M2ProductPartlNumber = 1S39U9ZST61ProductSerialNumber = 0624QC0029AssetTag =

    Target Commands:show

    Figure 6.2 eLOM exampleinformation about server

    Oracle Server Hardware Management Agent [29]

    MASF is available only on SPARC systems, but it supports both ALOM (including theCMT variant) and ILOM system controller. On the other hand, the Hardware Man-agement Agent supports only x86 systems and only those running specic versionsof ILOM.

    All system controllers supporting SNMP and both agents can be congured toaccept incoming SNMP requests for data (useful when monitoring these systemsactivelyalso known as polling) and/or they can send SNMP traps or notications

  • Management interfaces of Oracle Sun servers 33

    # ssh root@sparc-ilomPassword:Waiting for daemons to initialize...

    Daemons ready

    Sun(TM) Integrated Lights Out Manager

    Version 3.0.6.1.d r48331

    Copyright 2009 Sun Microsystems, Inc. All rights reserved.Use is subject to license terms.

    Warning: password is set to factory default.

    -> show /SYS...

    Properties:type = Host Systemipmi_name = /SYSkeyswitch_state = Normalproduct_name = SPARC-Enterprise-T5220product_part_number = 602-3821-08product_serial_number = BEL07513TTproduct_manufacturer = SUN MICROSYSTEMSfault_state = OKpower_state = On

    ...

    Figure 6.3 ILOM exampleinformation about server

    on their own (passive monitoring). However, the format of data diers considerablyamong the types of service controller or agents. Its structure is important for furtherwork on the integration with Zenoss, so the data structure (described using MIBs)will be discussed in the next section.

    6.3.1 Oracle Sun MIBs

    Format and purpose of MIB was already dened (see section 3.1.3 at page 13). OracleSun systems (or more precisely, the system controllers and agents) implement someof the following MIBs:

  • 34 Sun servers open--source systems management

    ENTITY-MIB SUN-PLATFORM-MIB SUN-ILOM-PET-MIB SUN-HW-TRAP-MIB SUN-HW-MONITORING-MIB SUN-ASR-NOTIFICATION-MIB

    In the following paragraphs, we will look into these MIBs in higher detail.

    6.3.1.1 Origin and purpose of these MIBs

    ENTITY-MIB is the only MIB that has not been dened by Oracle (formerly Sun). Itis dened in an independent specication [30]. The purpose of MIB is given as follows([30]):

    In particular, it (this MIB) describes managed objects used for managing mul-tiple logical and physical entities managed by a single SNMP agent.

    ENTITY-MIB contains structures that (in terms of server management) describevarious components of the server, including details about count and type of processors,DIMM modules manufacturer etc.

    SUN-PLATFORM-MIB is a MIB that extends ENTITY-MIB with details aboutoperational state and also it contains tables that identify and list system sensors, to-gether with their thresholds and current values. Also, this MIB in particular denessome notications, that can be used to dynamically modify the model of monitoredsystem and/or it can be translated and displayed to user. However, these traps do notcarry all the information (like the type of sensor issuing the warning), so additionalaction is required to get such information (typically, this is done using regular expres-sion that looks for a certain pattern of sensor names). Using regular expressions isquick and functional way, but author believes the correct approach is to poll the agentor system controller for a correct sensor type based on received OIDs present in thenotications. These two MIBs are supported in MASF (SPARC) and all ILOMs andeLOMs.

    SUN-ILOM-PET-MIB is one of the MIBs that doesn't use typical Sun (Oracle)OID tree, but it instead uses a tree wiredformgmt (Wired for Management). Thisis an OID tree reserved by Intel [31]for so-called PETs (Platform Event Traps). Theselargely correspond with IPMI and ofter carry similar date. However, such trap gen-erated carries a computed specic type (a number that identies the type of trap or

  • Management interfaces of Oracle Sun servers 35

    notications that is being sent). Most NMSes can't deal with dynamic specic types,they expect these numbers to be assigned statically and dened in theMIBand thatis the purpose of this MIB. However, in case there is another PET MIB by a dier-ent vendor, they will share the OID tree and the numbers will collide. Not only willthe names and descriptions of most or all notications dierent, but some may havetotally dierent meaning.

    SUN-HW-TRAP-MIBwas designed relatively recentlywith a single purposeelim-inate the need to do a regular expression matching or polling agent when a trap isreceived. Hence, a direct display of these traps is preferred.

    SUN-HW-MONITORING-MIBwas designed to remove a dependency onENTITY-MIBand to provide some more information about the monitored system. It features datalike cumulative state, which is computed on the monitored host side. The advantageof this approach is mainly saving the network tracNMS may poll only few val-ues in the MIB and get a full tree only in case something goes wrong. This MIB isimplemented only in the Hardware Management Agent.

    SUN-ASR-NOTIFICATION-MIB is currently implemented by ASR agent. De-scription from [32]:

    ASR is a secure, scalable, customer--installable software feature of warrantyand SunSpectrum support that provides auto-case generation when specichardware faults occur. ASR is designed to enable faster problem resolution byeliminating the need to initiate contact with Sun for hardware failures, reduc-ing both the number of phone calls needed and overall phone time required.ASR also simplies support operations by utilizing electronic diagnostic data.

    In case there is an error detected (hardware error), the ASR agent sends detailsabout the error, together with unique identier of the system to Oracle, where thedata is ltered and entered as a Service Request on behalf of the customer. This savestime and communication eorts. In addition, ASR generates a SNMP notication toinform the customer about Service Request being created on his behalf.

    6.3.1.2 Notications

    It is not feasible to describe every single notication declared in all MIBs, as thatwould make this document extensively long and also very quickly outdated. In thissection, we will describe the basic principles behind notications in Oracle (Sun)MIBs.

  • 36 Sun servers open--source systems management

    ENTITY-MIB has only one notication, entCongChange is the only presentnotication. Its sole purpose is to inform NMS that a conguration change has oc-curred and that it should reread all data.

    SUN-PLATFORM-MIB has at present twelve notications dened. These noti-cations were designed to work in cooperation with ENTITY-MIB, and as such eachnotication carries an OID that points to the ENTITY-MIB and contains some addi-tional information. However, this is not practical for integrations that only translatenotications, so there are additional varbindsunPlatNoticationAdditional-Info that contain a human--readable text of the event that occurred.

    SUN-ILOM-PET-MIB was already briey described. What is interesting aboutthe notications is that they contain only one varbind, but with a string of encodedbinary data. Among them there is also a sensor name, which is often decoded fromthe trap and the rest is discarded as the meaning of the notication is already givenby the specication.

    SUN-HW-TRAP-MIB is the only MIB designed solely for the purpose of sendingtraps. As of now, it has seventy three notications dened. Names of the noticationscontain both the type of sensor on which the event occurred, but also which thresholdwas crossed. In the additional varbinds there is the full name of the sensor, thresholdvalue and current value. Example:

    sunHwTrapVoltageNonCritThresholdExceededanon--critical thresh-old was exceeded

    sunHwTrapVoltageOkthe voltage is OK now

    Please bear in mind that SNMP is UDP based and therefore each trap with lowerseverity (e.g. the one suggesting system is getting into better condition) should auto-matically close all previous events with higher severity, if they were sent for the samesensor.

    SUN-ASR-NOTIFICATION-MIB has only ve notication:

    sunAsrSrCreatedTrap sunAsrSrCreationInProgressTrap sunAsrSrUpdatedTrap sunAsrSrDelayedTrap sunAsrSrFailureTrap

  • Management interfaces of Oracle Sun servers 37

    With these notications, NMS can display appropriate messages when a service re-quest gets created, is being created, has been updated, is delayed or has failed, re-spectively.

    6.3.1.3 Polled data

    ENTITY-MIB contains the following tables:

    entPhysicalTable entLogicalTable entLPMappingTable entAliasMappingTable entPhysicalContainsTable

    It also contains entLastChangeTime scalar value.Taken from [30]:

    The entPhysicalTable contains one row per physical entity, and mustalways contain at least one row for an overall physical entity, which shouldhave anentPhysicalClass value of stack(11)', chassis(3)' or mod-ule(9)'.

    Each row is indexed by an arbitrary, small integer, and contains a de-scription and type of the physical entity. It also optionally contains the indexnumber of another entPhysicalEntry indicating a containment relation-ship between the two.

    The entLogicalTable contains one row per logical entity. Each row isindexed by an arbitrary, small integer and contains a name, description, andtype of the logical entity. It also contains information to allow access to theMIB information for the logical entity.

    TheentLPMappingTable containsmappings betweenentLogical-Index values (logical entities) and entPhysicalIndex values (the physi-cal components supporting that entity). A logical entity can map to more thanone physical component, and more than one logical entity can map to (share)the same physical component.

    TheentAliasMappingTable containsmappings betweenentLogical-Index, entPhysicalIndex pairs and alias' object identier values. Thisallows resources managed with other MIBs (e.g., repeater ports, bridge ports,

  • 38 Sun servers open--source systems management

    physical and logical interfaces) to be identied in the physical entity hierarchy.Note that each alias identier is only relevant in a particular naming scope.

    TheentPhysicalContainsTable contains simplemappings betweenentPhysicalContainedIn' values for each container/containee' relation-ship in the managed system. The indexing of this table allows an NMS toquickly discover the entPhysicalIndex' values for all children of a givenphysical entity.

    Scalar objectentLastChangeTime represents the value ofsysUptimewhen any part of the Entity MIB conguration last changed.

    SUN-PLATFORM-MIB is an extension of ENTITY-MIB. Specically, it augmentsentPhysicalTable with information about Oracle/Sun specic equipment infor-mation and most importantly it adds information about sensors (i.e. when a row inentPhysicalTable refers to a sensor, agent implementing the MIB will ll indetails about this sensorlike sensor type, thresholds and valuesinto appropriatetable with the same index as the row in entPhysicalTable).

    SUN-HW-MONITORING-MIB is independent on ENTITY-MIB and is comple-mented by SUN-HW-TRAP-MIB, which denitions of notications.

    This MIB contains similar data as ENTITY-MIB, but the data is spread amongmore tables:

    sunHwMonInventoryTable sunHwNumericVoltageSensorTable sunHwDiscreteVoltageSensorTable sunHwNumericCurrentSensorTable sunHwDiscreteCurrentSensorTable sunHwNumericPowerDeviceSensorTable sunHwDiscretePowerDeviceSensorTable sunHwNumericCoolingDeviceSensorTable sunHwDiscreteCoolingDeviceSensorTable sunHwNumericTemperatureSensorTable sunHwDiscreteTemperatureSensorTable sunHwNumericProcessorSensorTable sunHwDiscreteProcessorSensorTable sunHwNumericMemorySensorTable sunHwDiscreteMemorySensorTable sunHwNumericHardDriveSensorTable sunHwDiscreteHardDriveSensorTable sunHwNumericIOSensorTable sunHwDiscreteIOSensorTable

  • Management interfaces of Oracle Sun servers 39

    sunHwNumericSlotOrConnectorSensorTable sunHwDiscreteSlotOrConnectorSensorTable sunHwNumericOtherSensorTable sunHwDiscreteOtherSensorTable sunHwMonIndicatorTable sunHwMonTotalPowerConsumption

    As one can see, this MIB is more ne grained that ENTITY-MIB. In addition to thesetables, certain values of interest are also directly available as scalars, which radicallysimplies writing management extensions. There are quite a few scalars, only someare listed below (for a full list and description see theMIB itself, it is well commented):

    sunHwMonProductName sunHwMonProductType sunHwMonCumulativeSensorAlarmStatus sunHwMonIndicatorServiceName sunHwMonIndicatorServiceCurrentStatus

    6.4 IPMI

    IPMI is supported only in eLOM and ILOM. Utilities that access system controllersover IPMI (e.g. ipmitool(1M), [33]) can use two connection methods:

    out--of--band or side--band over network locally over KCS interface

    While the rst is available always, KCS (Keyboard Style Controller) was not avail-able on SPARC systems until recentlythis was caused by a driver missing, not ahardware defect [35].

    6.5 Other interfaces

    All of the system controllers can send notications using e-mail and they can also for-ward the events to a system logging daemon running on remote host. To the author'sknowledge, these interfaces are seldom used.

  • 40 Sun servers open--source systems management

    However, web interface is used quite often, it oers a quick way how to checkserver status, server components and also to upgrade rmware remotely without hav-ing to run TFTP server.

    Figure 6.4 ILOM login screen

  • 41

    7 Zenoss integration

    Since we now have all management protocols, approaches and Oracle Sun serversavailable interfaces described, we can start designing and implementing Zenoss inte-gration. As resources materials [3641]were invaluable and provided all informationneeded for designing and implementing the integration.

    7.1 Choosing an approach

    Zenoss supports both active and passive approach. To be able to actively poll systemcontrollers or agents for data, it is necessary to develop plugins in Python that extendZenoss' object model. While the API is not overly complex and ENTITY-MIB mod-elling is already present, it would be time consuming to implement the other MIB(SUN-HW-MONITORING-MIB) and management capabilities would thus be limitedto system controllers with ILOM and eLOM and to SPARC hosts running MASF.

    On the other hand, implementing trap handling is easier, and as a result of imple-menting support for SUN-PLATFORM-MIB and SUN-HW-TRAP-MIB noticationsmuch more platforms will be supported:

    Eventually, the desired functionality is that of existing integration with IBM TivoliEnterprise Console [42]or IBM Tivoli NetCool OMNIbus [43].

    7.2 Development environment

    A VirtualBox virtual machine running Debian GNU/Linux 5.0 with installed stackZenoss 2.5.1 (recently updated to 2.5.2). Development was done accordingly to Jane

  • 42 Sun servers open--source systems management

    Curry's [40]development tree was stored outside of Zenoss and versioned in Mer-curial repository.

    7.3 Important design decisions

    7.3.1 Event classes

    Zenoss organizes events into event classes. There are certain already existing classes,like /Hw/Perf etc. There were possible two approaches:

    1. extend existing event classes2. create a completely separate namespace with new event classes

    While the rst approach would suggest that the integration would t seamlessly intoexisting environment (especially helpful when users already have some paging, e-mailor other notications setup), the second approach guarantees that there will be noclashes with existing setup (of course, unless the user creates his own event classeswith the same names).

    As this integration should not break anything in the end--users setup, it has beendecided to create a completely separate namespace.

    7.3.2 Per-trap mapping vs. defaultmapping

    When Zenoss receives an event (in this case caused by receiving SNMP notication),it will try to process the event using Event Class Key, which is usually the name ofthe SNMP notication (provided the MIB is loaded and compiled). To do that, it willsearch its database and looks for Event Class Mappings, which play a similar role asrules in other software.

  • Zenoss integration 43

    Figure 7.1 Zenoss Event Processing

    When the mapping is not found, it will try and look for defaultmapping, that mayprocess the generated event. Although it would be simpler to develop just one block ofcode to process these events, there is a concern that running a larger block of code forevery single notication would make the application much slower. Hence, a decisionto create a mapping for every single SNMP notication has been made.

    7.4 Development steps

    In this section we will describe steps taken to develop this integration. There isone step common to all subsequents stepsonce it has been veried that the de-scribed action was successful, the resulting objects are added to the ZenPack (called

  • 44 Sun servers open--source systems management

    ZenPacks.ojakubcik.OracleHwMonitoring), the ZenPack is exported andthe commited to Mercurial repository.

    7.4.1 Compiling MIBs

    This is arguably the simplest step. It involves copying used MIBs to location whereZenoss expects them ($ZENHOME/share/mibs/site). The $ZENHOME environ-ment variable is set by default for user zenoss.

    Then, as user zenoss, one has to run the command

    $ zenmib -v 10

    to process the new MIBs and load them into Zenoss.

    7.4.2 Creating Event classes

    Before creating mappings, it is necessary to have all event classes against which wewant to map events to. Based on the two MIBs used now, the following classes willbe created:

    /Events/Oracle /Events/Oracle/Voltage /Events/Oracle/Temperature /Events/Oracle/Electrical Current /Events/Oracle/Fan Speed /Events/Oracle/Other /Events/Oracle/Power Supply /Events/Oracle/Fan /Events/Oracle/Processor /Events/Oracle/Memory /Events/Oracle/Hard Drive

  • Zenoss integration 45

    /Events/Oracle/IO /Events/Oracle/Slot or Connector /Events/Oracle/Component /Events/Oracle/FRU /Events/Oracle/Power Consumption

    These can be created from GUI by following the Events menu item in the left nav-igation bar and the by clicking Add New Organizer from the menu on the left fromSubclasses.

    However, it is also possible to do this using a tool zendmd, which is essentiallya Python interpreter with preloaded Zenoss classes [44](this is just a skeleton script,full can be found on CD in directory scripts as le createEventClasses.py):

    import Globalsfrom transaction import commitfrom Products.ZenUtils.ZenScriptBase import ZenScriptBasedmd = ZenScriptBase(connect=True).dmd

    event_classes = ['/Events/Oracle','/Events/Oracle/Voltage',...]

    for ec in event_classes:dmd.Events.manage_addOrganizer(ec)

    commit()

    As a result, we now have all event classes we need in place and can proceed tothe event mappings creation.

    7.4.3 Creating Event mappings

    Recommended procedure for creating Event class mappings is to have the ZenossSNMP daemon receive all possible notications and then by creating the mappingsfrom GUI. These mapping can then be modied again from GUI [39].

  • 46 Sun servers open--source systems management

    However, if we do that for just one notication we can observe the following attrib-utes are present (lled values are in parentheses) and the rest is to be lled manually:

    Name (SNMP trap name, e.g. sunPlatObjectCreation) Event Class Key (SNMP trap name, e.g. sunPlatObjectCreation) Sequence (number, in my case 7) Rule Regex Example (snmp trap sunPlatObjectCreation) Transform Explanation Resolution

    Meaning of these elds is in [36]:

    NameAn identier for this event classmapping. Not important formatch-ing events.

    Event Class KeyMustmatch the incoming event'seventClassKey eldfor this mapping to be considered as a match for events.

    SequenceSequence number of this mapping, among mappings with anidentical event class key property. Go to the Sequence tab to alter its posi-tion.

    RuleProvides a programmatic secondary match requirement. It takes aPython expression. If the expression evaluates to True for an event, thismapping is applied.

    RegexThe regular expression match is used only in cases where the ruleproperty is blank. It takes a Perl Compatible Regular Expression (PCRE).If the regexmatches an event's message eld, then this mapping is applied.

    TransformTakes Python code that will be executed on the event only ifit matches this mapping. For more details on transforms, see the sectiontitled Event Class Transform.

    ExplanationFree-form text eld that can be used to add an explanationeld to any event that matches this mapping.

    ResolutionFree-form text eld that can be used to add a resolution eldto any event that matches this mapping.

    Although we possibly could enter all mappings by using GUI, this would be errorprone and not very ecient. Luckily, as Zenoss is based on Zope, every GUI actionhas a corresponding Python function that can be called.

  • Zenoss integration 47

    To manipulate event classes, we rst need to get the class that represents them.This is doable by the following method:

    dmd.Events.getOrganizer(name)

    where name is a full path to event class organizer.Each organizer has amethodcreateInstance that takes one parameteriden-

    tier of the created mapping (in our case, this will be the name of the notication).This method nally returns and instance of EventClassInst, that we will furthermanipulate.

    EventClassInst has attributes that correspond to the eld described earlier(e.g. eventClassKey). After creating the new mapping instance, all we need to dois to set corresponding attributes using standard Python syntax and nally commiteverything into ZODB (Zope Object Database) by calling the commit() procedure.

    In following list, we will describe which attributes and how need to or should beset:

    eventClassKey and id shall be set to the translated name of the SNMP noti-cation.

    example shall be set to snmp trap . transform shall contain Python code that will modify received event text, sever-

    ity and possibly set other values so clearing will work. explanation and resolution may contain text explaining nature of the

    event.

    Transform eld, corresponding to the transform attribute will contain dierentPython code for notications from dierent MIBs. Some of them may be droppedautomatically:

    # Drop this eventevt._action = "history"

  • 48 Sun servers open--source systems management

    Most of the traps from SUN-HW-TRAP-MIB will have processing similar to this(please note, that although MIBs do specify an user friendly mapping of integers tonames, Zenoss does not use these mappings):

    # Get interesting attributescomponent = getattr(evt,'sunHwTrapComponentName', None)threshold_type = getattr(evt, 'sunHwTrapThresholdType', None)threshold_value = getattr(evt, 'sunHwTrapThresholdValue', None)reading = getattr(evt, 'sunHwTrapSensorValue', None)if threshold_type == 1:

    # Upperthr_type_text = "upper"thr_word = "over"thr_compare = ">="

    elif threshold_type == 2:# Lowerthr_type_text = "lower"thr_word = "below"thr_compare = "

  • Zenoss integration 49

    denitions = []

    # No /Events/Oracle needed, that is added automatically# Sun HW Trap MIB - threshold noticationsfor sensor_short, sensor_type, zen_group in [

    ('Voltage', 'Voltage', '/Voltage'),('Temp', 'Temperature', '/Temperature'), ...]:

    for thr_value, severity, threshold_type in [('Fatal', 5, 'non-recoverable'),('Crit', 4, 'critical'),('NonCrit', 3, 'non-critical')]:

    name = 'sunHwTrap' + sensor_short + thr_value +'ThresholdExceeded'

    organizer = zen_grouptransform = hw_thr_assert % {

    'severity' : severity,'type' : sensor_type,'threshold_type' : threshold_type}

    d = {'name' : name,'organizer' : organizer,'transform' : transform}

    denitions.append(d)

    Here, the hw_thr_assert and hw_thr deassert are strings that contain thetemplate for transformation script to be input into Zenoss.

    When we have the denitions array lled up with transformation rules, wecan cycle through them and create mappings in Zenoss:

    for denition in denitions:org = dmd.Events.getOrganizer('/Events/Oracle" +

    denition['organizer'])inst = org.createInstance('" + denition['name'] + "')inst.example = 'snmp trap ' + denition['name']inst.transform = denition['transform']

    Finally, we need to add some preamble to the script:

  • 50 Sun servers open--source systems management

    import Globalsfrom transaction import commitfrom Products.ZenUtils.ZenScriptBase import ZenScriptBasedmd = ZenScriptBase(connect=True).dmd

    Also we need to commit the changes to database:

    commit()

    7.4.4 Adding products

    Finally, we may want to add a new manufacturer and a list of products. This againcan be done from GUI or from command--line using zendmd.

    However, the syntax here is not as easy as in the rst example, so for purpose ofthis project, products were created by hand using GUI.

    Manufacturer Oracle was added to Zenoss, and a list of servers was created:

    Oracle Sun Fire X2250 Server Oracle Sun Fire X2270 Server Oracle Sun Fire X4100 M2 Server Oracle Sun Fire X4200 M2 Server Oracle Sun Fire X4600 M2 Server Oracle Sun Fire X4540 Server Oracle Sun Fire X4140 Server etc.

    7.4.5 Final modications

    Even though scripting the creation of the mappings saved us a considerable amountof time, the script inevitably may not be able to generate all messages and severities

  • Zenoss integration 51

    correctly. Hence, a walkthrough the generated mappings is recommended and modi-fying the generated code to make it more ecient for given purpose is encouraged.

    Small modications were needed especially with the notications that cover morethan one event (sunHwTrapHardDriveStatus) andmostSUN-PLATFORM-MIBnotications.

    7.5 Testing

    Optimal approach for testing would be to create an automation that would simulatefailures on physical machines, which would in turn respondwith notication. A semi--manual checking would then be required to conrm that the integration works asexpected.

    However, due to time constraints and unavailability of all testing machines, adierent approach was chosen. One server (Oracle Sun SPARC Enterprise T5220Server) was congured to send notications from system controller and MASF agentto the same IP address running Zenoss with this integration. Hard drives, power sup-plies and fans were the removed and the reinstalled to verify that traps are receivedand cleared.

    7.6 Future extension

    As of now, the integration has just basic functionality. Following paragraphs describethe possible new features to be developed, possibly as a future work of author.

    Testing framework. To ensure this software works, a complete automated testingframework supporting physical servers needs to be developed and regularly run.

    Better clearing mechanism. Right now, due to Zenoss way of handling clearingevents (i.e. only events with cleared severity can clear others) it is true that notica-tions ending with Deassert have severity of cleared. This may not be true, becauseeven if the sensor reading drops below non--recoverable threshold, its reading is nowcritical and not OK.

    Polling. This would mean developing a plugin into Zenoss that would discoverand model the server using data obtained by periodical reading MIB data.

    Model updates from traps. Instead or in addition to writing to event console whena SNMP notication is received, a previously obtained model of the server could beeither updated or a forced reread of all data can be forced. This of course requires a

  • 52 Sun servers open--source systems management

    functional polling and to function properly, a model will need to be updated anywayfrom time to time, just to make sure that a SNMP notication wasn't lost en route.

    Graphing and reporting. Based on data obtained by previous two extensions, itwould be possible to implement graphing and reporting, showing for example tem-perature trends, and more importantly power consumption.

  • 53

    8 Conclusion

    This project was partially research and partially implementation oriented. As a re-sult, a brief yet hopefully useful description of systemmanagement motivations, tech-nologies and software was given.

    In addition, a basic but functional integration into open--source system manage-ment tool was developed and tested (albeit only in limited way), by which this projectfullled its assignment.

    Author implemented a new and previously unknown (or at least not publicly de-scribed) way how to create Event Class mappings programatically.

    However, from the former idea of a complete monitoring solution that would dopolling, graphing and notications simultaneously was not realized. Nonetheless,even though this solution does not use all features of Zenoss, there is a room forimprovement, as described earlier.

  • 54 Sun servers open--source systems management

  • 55

    References

    [1] O. Jakubk, Selecting open-source system management solution for integratingwith Sun servers (unpublished, 2009). Available on CD.

    [2] E. Galstad Nagios Core Version 3.x Documentation. (2009).[3] Zabbix SIA, Zabbix 1.8 manual.[4] Zenoss, Inc., Zenossgetting started (Zenoss, Inc., 2009).[5] Wikipedia, Simple network management protocol (2010).[6] M. Rose and K. McCloghrie, RFC1155: Structure and identication of manage-

    ment information for TCP/IP-based internets (IETF, 1990).[7] K. McCloghrie and M. Rose, RFC1156: Management Information Base for net-

    work management of TCP/IP-based internets (IETF, 1990).[8] J. Case, M. Fedor, M. Schostall, and J. Davin, RFC1157: Simple Network Man-

    agement Protocol (SNMP) (IETF, 1990).[9] K. McCloghrie, D. Perkins, and J. Schoenwaelder, RFC2578: Structure of Man-

    agement Information Version 2 (SMIv2) (IETF, 1999).[10] ITU, Abstract Syntax Notation One: Specication of basic notation (ITU, 2002a).[11] ITU, Abstract Syntax Notation One: Information object specication (ITU,

    2002b).[12] ITU, Abstract Syntax Notation One: Constraint specication (ITU, 2002c).[13] ITU, Abstract Syntax Notation One: Parameterization of ASN.1 specications

    (ITU, 2002d).[14] Intel, HP, NEC, and Dell, Intelligent Platform Management Interface Specica-

    tion (Intel, 2009). Second generation, v2.0.[15] DMTF, Inc.,Web-based enterprise management (wbem) faqs (DMTF, Inc., 2010).[16] The Open Group OpenPegasus. (2010). www.openpegasus.org.[17] D. Libes,The expect home page (DonLibes, 2009).http://expect.nist.gov/.[18] R. Gerhards, RFC5424: The Syslog Protocol (IETF, 2009).[19] R. Thurlow,RFC5531RPC: Remote Procedure Call Protocol Specication Version

    2 (IETF, 2009).[20] Object Management Group, Inc. Common Object Request Broker Architecture

    (CORBA) Specication, Version 3.1. (2008).[21] World Wide Web Consortium SOAP Version 1.2 Part 1: Messaging Framework.

    (2007). second editions.[22] D. Winer, Xml-rpc specication (xml-rpc.com, 1999).[23] Sun Microsystems, Inc. Sun Advanced Lights Out Manager (ALOM) 1.6 Admin-

    istration Guide. (2007b). 819-2445-11.

  • 56 Sun servers open--source systems management

    [24] Sun Microsystems, Inc. Advanced Lights Out Management (ALOM) CMT v1.4Guide. (2007a). 819-7991-10.

    [25] SunMicrosystems, Inc. Embedded Lights OutManager AdministrationGuideForthe Sun Fire X2200 M2 and Sun Fire X2100 M2 Servers. (2009). 819-6588-14.

    [26] Oracle, Inc. Oracle Integrated Lights Out Manager (ILOM) 3.0 Getting StartedGuide. (2010c). 820-5523-11.

    [27] Oracle, Inc. Oracle VM Server for SPARC. (2010e). (formerly LDOMS).[28] Sun Microsystems, Inc. Sun SNMP Management Agent for Sun Fire and Netra

    Systems. (2004).[29] Oracle, Inc. Sun Server Management Agents 2.0 User's Guide. (2010b).

    821-1610.[30] K. McCloghrie and A. Bierman, RFC2737: Entity MIB (Version 2) (IETF, 1999).

    Obsoleted by RFC 4133.[31] Intel, HP, NEC, and Dell Platform Event Trap Format Specication. v1.0.[32] Oracle, Inc. Auto Service Request (ASR) v2.6Installation and Operations

    Guide. (2010a). http://wikis.sun.com/display/ASRSO/Home.[33] D. Laurie IPMItool. (2007). http://ipmitool.sourceforge.net/.[34] Oracle, Inc. IPMItool. (2010d). http://www.sun.com/system-

    management/tools.jsp.[35] SunMicrosystems, Inc., PSARC 2008/119 sun4v /dev/bmc (SunMicrosystems,

    Inc., 2008). (not available publicly).[36] Zenoss, Inc. Zenoss Administration. (2010b).[37] Zenoss, Inc. Zenoss Developer's Guide. (2010c).[38] Zenoss, Inc., Zenoss 2.5 source code documentation (Zenoss, Inc., 2010a).[39] J. Curry Zenoss Event Management. (2010). version 3.[40] J. Curry, Creating Zenoss ZenPacks (Jane Curry, 2009a).[41] J. Curry Crafting Zenoss Core users for events and zProperties. (2009b). draft.[42] Sun Microsystems, Inc. Monitoring Sun Servers in an IBM Tivoli Enterprise

    Console Environment. (2009b).[43] Sun Microsystems, Inc. Monitoring Sun Servers in an IBM Tivoli Netcool/OM-

    NIbus Environment. (2009a).[44] N. Brockett, batchaddlocations.py (Zenoss, Inc., 2009).

  • LVII

    A CD Contents

    As a part of this project, a CD was created. It contains the following les and direc-tories:

    Others/Directory containing other documents. Project/Directory containing PDF le of this project. RFC/Directory containing RFCs. ZenPack/Directory containing source les for ZenPack. READMEDescription of les on CD.

  • 58 Sun servers open--source systems management