Upload
vivien-davidson
View
213
Download
0
Embed Size (px)
Citation preview
NGOP Overview
J.FrommK.Genser
T.LevshinaM.Mengel
6/24/2001 Large Scale Cluster Computing Workshop at Fermilab
2
Next Generation Operation GROUP
Integrated Systems Development DepartmentKrzysztof GenserTerry JonesTanya LevshinaIgor Mandrichenko Don Petravick
Operating Systems Support Department Troy DawsonJim FrommLisa Giacchetti Marc MengelKen SchumacherSteven Timm
Computing Services DepartmentRick Thies Rich Thompson
6/24/2001 Large Scale Cluster Computing Workshop at Fermilab
3
Current way of monitoring
• Various monitoring tools, thus no comprehensive picture of status of services – Xfalive– Patrol– NOC (network)– Fermi software (Enstore, FBS ….)
• When actions initiated by user’s problem report– Sometime misleading information– Postmortem investigation
6/24/2001 Large Scale Cluster Computing Workshop at Fermilab
4
Fermi Computing Environment
• Heterogeneous clusters– Various OSs– Different services (batch, interactive, farms)
• Various sets of applications (lsf, fbs, enstore, sam)
• Mixed management – system administrators – software administrators
• Computer Services Department (CSD) provides a single point of contact for reporting problems
6/24/2001 Large Scale Cluster Computing Workshop at Fermilab
5
NGOP Goals
• Active monitoring • Problem diagnostics• Early error detection and problem prevention• Centralized data collection• Status of service evaluation • Execution of corrective and notification
actions• Performance analysis
6/24/2001 Large Scale Cluster Computing Workshop at Fermilab
6
NGOP Project Phases
8/1999 – 3/2000 : Creation of NGOP group.Gathering requirements for Distributed Monitoring System. Evaluation of available commercial and freeware products.
3/2000 – 12/2000: Design and development of NGOP prototype
1/2001 - present: Prototype deployment on the farms. Farms monitoring by system administrators and operators. Prototype evaluation. Extending “xfalive” service to all nodes
monitored by CSD.
6/24/2001 Large Scale Cluster Computing Workshop at Fermilab
7
Prototype Statistics• Some implementation details:
– Written primarily in Python (some modules in C)– Use XML (and partially MATHML) for all configuration files
• Some deployment details: – Monitoring a total of 512 nodes
• Checking for node being down and node reset• On four farms (CDF, D0, two Fix Target experiment farms) - (270 nodes)
– System daemons presence – Critical file systems presence and size – Cpu load, memory and swap utilization – Number of users and users’ processes – Number of processors off-line– Baseboard temperature and fan speed– NFS timeouts– Disk errors
– Number of Monitored Objects ~ 6,500– About 5 instances of “ngop monitor” (GUI) are running simultaneously.– Events are stored in Oracle Database
6/24/2001 Large Scale Cluster Computing Workshop at Fermilab
8
Current Configuration
NGOP
ActionClient
MAs(Ping)
MAs(Ping)
NGOPCentral Server
ConfigFile Management
Server FNCDUH
ArchiveService
NGOPMonitor
User Node
NGOPMonitor
User Node
NGOPMonitor
User Node
MA(OSHealth)
MA(OFT_FBS)
fnpc 1 - 37
fnsfh
Old FixTarget Farm
Swatch
MA(OSHealth)
MA(CDF_FBS)
fncdf 1 - 90
cdffarm1
CDF Farm
SwatchMA
(OSHealth)
MA(FT_FBS)
Fnpc 201 - 250
fnsfo
FixTarget Farm
Swatch
WWW MailServers
LicenseServers
Enstore
CMS SDSS
MA(OSHealth)
MA(D0_FBS)
fnd0 1 - 100
d0bbin
D0 Farm
Swatch
DivisionServers
MISCOMP Kerberos
D0
CDF
BTEV
LicenseServers
FNALU
KTEV
MINOS
ODS
HPSS
PPD
6/24/2001 Large Scale Cluster Computing Workshop at Fermilab
9
Summary Of Occurred Events
• Detected Problems:– Node reset
– Node is down
– One CPU is missing after reboot
– File system not mounted
– System daemon is dead
– FBS Batch Manager is down
• Raised Alarms:– Memory usage is high
– Swap usage is high
– CPU Load is high
– File System is full
– Baseboard temperature is high
– Specific messages found in syslog : nfs timeouts, drive timeouts …
6/24/2001 Large Scale Cluster Computing Workshop at Fermilab
10
GUI Monitor Snapshots
6/24/2001 Large Scale Cluster Computing Workshop at Fermilab
11
Report Generator (MISCOMP Web Query Interface)
Monitoring
Agent id
Monitored
Object id
Event type
Event value Description
fnpc242_health OSHealth.fnpc242.cpuLoad.fnpc242
sysUsage 5.88 Average load on the node is less or equal to 8 and greater than 5
fnpc208._health OSHealth.fnpc208.memory.fnpc208
sysUsage 86 Memory usage is greater or equal to 80% and less 95%
fnpc204_health Hardware.fnpc204.baseTemp.fnpc204
Hardware 45.0 Temperature is between 45C and 50C
fnpc108_health OSHealth.fnpc108.
rstatd.fnpc108
Daemon 0 rstatd is not running
6/24/2001 Large Scale Cluster Computing Workshop at Fermilab
12
What’s next?
• NGOP Production (end of summer 2001)• Wish List:
– Provide Monitoring Client API– Implement Correlation(aka Looping) Agents– Implement historical rules and escalating alarms– Implement “snapshot” (“give me the updated system status now”) feature– Provide other than Python Monitoring Agent API– Fully Kerberize– Provide Standard Win2000 Monitoring Agents – Design and provide dynamic handling of configuration changes for the
Monitoring Client– Allow for easier handling of multiple configurations– Improve Admin (Configuration Client) Client GUI– Provide Configuration GUI (hoping for a good free XML Editor though) – Provide Performance Data Framework – Redesign/Rewrite GUI (for scalability and friendliness) – Provide GUI for non-Linux platforms if really needed– Work on scalability up to 10000 hosts
6/24/2001 Large Scale Cluster Computing Workshop at Fermilab
13
More Info
url: http://www-isd.fnal.gov/ngop/
E-mail: [email protected]