NGOP Overview J.Fromm K.Genser T.Levshina M.Mengel

NGOP Overview

J.FrommK.Genser

T.LevshinaM.Mengel

6/24/2001 Large Scale Cluster Computing Workshop at Fermilab

2

Next Generation Operation GROUP

Integrated Systems Development DepartmentKrzysztof GenserTerry JonesTanya LevshinaIgor Mandrichenko Don Petravick

Operating Systems Support Department Troy DawsonJim FrommLisa Giacchetti Marc MengelKen SchumacherSteven Timm

Computing Services DepartmentRick Thies Rich Thompson


3

Current way of monitoring

• Various monitoring tools, thus no comprehensive picture of status of services – Xfalive– Patrol– NOC (network)– Fermi software (Enstore, FBS ….)

• When actions initiated by user’s problem report– Sometime misleading information– Postmortem investigation


4

Fermi Computing Environment

• Heterogeneous clusters– Various OSs– Different services (batch, interactive, farms)

• Various sets of applications (lsf, fbs, enstore, sam)

• Mixed management – system administrators – software administrators

• Computer Services Department (CSD) provides a single point of contact for reporting problems


5

NGOP Goals

• Active monitoring • Problem diagnostics• Early error detection and problem prevention• Centralized data collection• Status of service evaluation • Execution of corrective and notification

actions• Performance analysis


6

NGOP Project Phases

8/1999 – 3/2000 : Creation of NGOP group.Gathering requirements for Distributed Monitoring System. Evaluation of available commercial and freeware products.

3/2000 – 12/2000: Design and development of NGOP prototype

1/2001 - present: Prototype deployment on the farms. Farms monitoring by system administrators and operators. Prototype evaluation. Extending “xfalive” service to all nodes

monitored by CSD.


7

Prototype Statistics• Some implementation details:

– Written primarily in Python (some modules in C)– Use XML (and partially MATHML) for all configuration files

• Some deployment details: – Monitoring a total of 512 nodes

• Checking for node being down and node reset• On four farms (CDF, D0, two Fix Target experiment farms) - (270 nodes)

– System daemons presence – Critical file systems presence and size – Cpu load, memory and swap utilization – Number of users and users’ processes – Number of processors off-line– Baseboard temperature and fan speed– NFS timeouts– Disk errors

– Number of Monitored Objects ~ 6,500– About 5 instances of “ngop monitor” (GUI) are running simultaneously.– Events are stored in Oracle Database


8

Current Configuration

NGOP

ActionClient

MAs(Ping)

MAs(Ping)

NGOPCentral Server

ConfigFile Management

Server FNCDUH

ArchiveService

NGOPMonitor

User Node

NGOPMonitor

User Node

NGOPMonitor

User Node

MA(OSHealth)

MA(OFT_FBS)

fnpc 1 - 37

fnsfh

Old FixTarget Farm

Swatch

MA(OSHealth)

MA(CDF_FBS)

fncdf 1 - 90

cdffarm1

CDF Farm

SwatchMA

(OSHealth)

MA(FT_FBS)

Fnpc 201 - 250

fnsfo

FixTarget Farm

Swatch

WWW MailServers

LicenseServers

Enstore

CMS SDSS

MA(OSHealth)

MA(D0_FBS)

fnd0 1 - 100

d0bbin

D0 Farm

Swatch

DivisionServers

MISCOMP Kerberos

D0

CDF

BTEV

LicenseServers

FNALU

KTEV

MINOS

ODS

HPSS

PPD


9

Summary Of Occurred Events

• Detected Problems:– Node reset

– Node is down

– One CPU is missing after reboot

– File system not mounted

– System daemon is dead

– FBS Batch Manager is down

• Raised Alarms:– Memory usage is high

– Swap usage is high

– CPU Load is high

– File System is full

– Baseboard temperature is high

– Specific messages found in syslog : nfs timeouts, drive timeouts …


10

GUI Monitor Snapshots


11

Report Generator (MISCOMP Web Query Interface)

Monitoring

Agent id

Monitored

Object id

Event type

Event value Description

fnpc242_health OSHealth.fnpc242.cpuLoad.fnpc242

sysUsage 5.88 Average load on the node is less or equal to 8 and greater than 5

fnpc208._health OSHealth.fnpc208.memory.fnpc208

sysUsage 86 Memory usage is greater or equal to 80% and less 95%

fnpc204_health Hardware.fnpc204.baseTemp.fnpc204

Hardware 45.0 Temperature is between 45C and 50C

fnpc108_health OSHealth.fnpc108.

rstatd.fnpc108

Daemon 0 rstatd is not running


12

What’s next?

• NGOP Production (end of summer 2001)• Wish List:

– Provide Monitoring Client API– Implement Correlation(aka Looping) Agents– Implement historical rules and escalating alarms– Implement “snapshot” (“give me the updated system status now”) feature– Provide other than Python Monitoring Agent API– Fully Kerberize– Provide Standard Win2000 Monitoring Agents – Design and provide dynamic handling of configuration changes for the

Monitoring Client– Allow for easier handling of multiple configurations– Improve Admin (Configuration Client) Client GUI– Provide Configuration GUI (hoping for a good free XML Editor though) – Provide Performance Data Framework – Redesign/Rewrite GUI (for scalability and friendliness) – Provide GUI for non-Linux platforms if really needed– Work on scalability up to 10000 hosts


13

More Info

url: http://www-isd.fnal.gov/ngop/

E-mail: [email protected]

http://www-isd.fnal.gov/ngop/

Documents

NGOP Overview J.Fromm K.Genser T.Levshina M.Mengel