Upload
elvin-miller
View
217
Download
0
Tags:
Embed Size (px)
Citation preview
BNL: ATLAS Computing 1
A Scalable and Resilient PanDA Service
US ATLAS Computing Facility and Physics US ATLAS Computing Facility and Physics Application GroupApplication Group
Brookhaven National LabBrookhaven National Lab
Presented by Dantong YuPresented by Dantong Yu
BNL: ATLAS Computing 2
Build a scalable and resilient PanDA service for ATLASBuild a scalable and resilient PanDA service for ATLAS Support ATLAS VOs and thousands of ATLAS users and jobs. Reliable, scalable, and high performance. Cost-effective and flexible deployment.
A joint effort between Physics Application Group and RACF Grid Computing A joint effort between Physics Application Group and RACF Grid Computing Group to deploy and operate every component in PanDA system. Group to deploy and operate every component in PanDA system.
In this talk: In this talk: BNL PanDA architecture PANDA Components PANDA Hardwares Required Software Infrastructure and Grid Middleware Infrastructure and Procedure to Download and Install Required RPMS Nagios Based Panda Monitoring Systems Operation Procedures Experienced Problems
Motivation and Outline
BNL: ATLAS Computing 3
BNL PanDA architecture
3
BNL: ATLAS Computing 4
Clients
PanDA ServerMnt. Server
AutoPilot AutoPilotPanDA ServerMnt. Server
PanDA DB PanDA ArchivePanDA Archive
… …
F5 Server Load Balancing switch rewrites IP header for src. and dest. Addr., IP relay.
Clients
Virtual Services
Physical Servers
VIP
Reliable/High Performance ATLAS Job Management Architecture (PanDA)
44
BNL: ATLAS Computing 5
Panda Components
5
BNL: ATLAS Computing 6
Production SystemProduction System Front End Load Balancers
F5 switch does load balance and reliability.
Its transparency allows flexible management of the heterogeneous service, with only minimal application-level configuration and coding necessary to support integration with the smart switch.
Panda Monitoring Service, Panda Server, and Panda Logging Servers stateless Dispatches jobs to pilots as they request them, HTTPS-based, stateless. It needs to
connect to central Panda DB. Provides a graphic read-only information about Panda function via HTTP. GUI is also
stateless. It needs to connect to central Panda DB. Logs Panda Server Events into the Panda DB.
Autopilot submission systems (stateful) Using Condor-g/site gatekeepers to fill sites with pilots.
Panda Pilot Wrapper Code Distributor: Subversion with Web front-end. Dynamically download pilot wrapper script from the Subversion web cache.
Panda Database System
https://www.racf.bnl.gov/experiments/usatlas/gridops/griduiconfig/https://www.racf.bnl.gov/experiments/usatlas/gridops/griduiconfig/
Production Panda Components
BNL: ATLAS Computing 7
Panda Development and Testbed Systems
Panda Testbed and Development SystemsPanda Testbed and Development Systems Panda Monitoring Service and Panda Server Database System
BNL: ATLAS Computing 8
Panda Hardware
8
BNL: ATLAS Computing 9
Panda Hardware
Each component group requires a separate set of hosts and hardware. Each component group requires a separate set of hosts and hardware. Most servers should be standalone except a few of them. Most servers should be standalone except a few of them. Front end Load balancers: Two F5 3600 load balance switches. Panda Monitor, Panda Server, and Panda Logging Servers.
Dual quad-core Intel Xeon CPU E5335 @ 2.00GHz. (eight cores per host), 16 GB memory, and six 750GB SATA drives. (software RAID 10 provided 2TB local storage): three servers
Autopilot submission systems (Local pilots and global pilots). StatefulDual quad-core Intel Xeon CPU E5430 @ 2.66GHz. (eight cores per
host), 16 GB memory, and two 750GB SATA drives. (Mirrored disks): four servers
Panda pilot wrapper code distributor: subversion with Web front-endDual quad-core Intel Xeon CPU E5430 @ 2.66GHz. (eight cores per
host), 8 GB memory, and two 150GB SAS drives. (Mirrored disks): 1 server. Need Archive system to recover if disk storage is lost.
Web Apache server
BNL: ATLAS Computing 10
BNL ATLAS MySQL Production and Development Servers
Following BNL production MySQL servers are used:Following BNL production MySQL servers are used: 2 Panda-production MySQL servers (INNODB): primary and spare, dual
dual core with 16GB memory and 64 bits OS. 4 Panda-archive MySQL servers (MyISAM): 2 primary + 2 spare, 2 quad-
core processors with 16GB memory and 64 bit OS. daily text-based backup (database content) for all databases on
production servers above with the extra disk-copy on a special data-server having an interface to the tape.
64 bit-architecture, x86_64, 2.6.9-55.0.9.Elsm. Six 15k rpm SAS drives, each with 145GB disk space. Details can be found at
https://www.racf.bnl.gov/experiments/usatlas/gridops/atlasdbinfo.
BNL: ATLAS Computing 1111
ATLAS MySQL Production Databases at BNL: Details and Performance
Panda production MySQL server and its replica server with identical Panda production MySQL server and its replica server with identical hardware:hardware:
“fast-buffer” DataBase. keeps the info about all Panda managed Reprocessing, MC-production and user-analysis jobs
for up to 2 weeks, the cron-job moves the data into archive periodically. designed initially for USATLAS, since September 2007 supports 10 different ATLAS clouds
(CERN, CA, DE, ES, FR, NL, UK, US, TW and 2 instances for Nordugrid - ND,NDGF ). runs MySQL version 5.0.X. engine InnoDB, simple structure, autoincrement for IDs, no foreign keys. 31 tables, max number of rows ~16,500,000. provides with the fast multiple parallel connections to basic Panda-components: Panda-server,
Panda-monitor and Logger.
Performance access pattern: ~380-440 parallel threads open simultaneously all the time (max ~600) performance: average ~360 q/sec. (max > 800) query-type: select ~35%, update ~35%, insert ~25%, others (delete, etc. ~5%) nice monitoring interface Panda-monitor:
http://pandamon.usatlas.bnl.gov:25880/server/pandamon/query?dash=prod
BNL: ATLAS Computing 1212
Critical DBs on Four Panda Archival Database Servers
Panda Archive production MySQL server (along with a spare node) Database PandaArchiveDB
keeps the full archive of Panda managed reprocessing, Monte-Carlo, production and user analysis jobs since the end of 2005.
engine MyISAM, no autoincrement, replication from PandaDB through crons. partitioning: bi-monthly structure of job/file archive tables for better search
performance. 44 tables, max number of rows ~33,000,000 per table.
DataBases PandaLogDB, PandaMetaDB keep the archive of log-extract files for jobs, some monitoring information
about pilots, autopilot and scheduler-configuration support (schedconfig). engine MyISAM, ~52-54 tables. partitioning: bi-monthly structure for some tables. max number of rows ~4,600,000 per table.
access pattern: ~400-450 parallel threads open (max ~740). performance: average ~1300-1600 q/sec. (max ~2800), select (~80%), insert ~20%.
BNL: ATLAS Computing 13
Panda Server Infrastructure
13
BNL: ATLAS Computing 14
PanDA Software Infrastructure
OS, Grid Middleware, and Software Requirements OS (RHEL/SL 4) RPMs: mod_ssl, subversion, rrdtool, openmpi, OS (RHEL/SL 4) RPMs: mod_ssl, subversion, rrdtool, openmpi,
gridsite, graphtool, matplotlib, MySQL. gridsite, graphtool, matplotlib, MySQL. Glite-UI 3.1: Setup from /etc/profile.d/.Glite-UI 3.1: Setup from /etc/profile.d/. CA Certificates installed/updated.CA Certificates installed/updated. Unix accounts w/ ssh-key access: smUnix accounts w/ ssh-key access: sm Python 2.5 (from Tadashi) RPMs: python25, mod_python25, Python 2.5 (from Tadashi) RPMs: python25, mod_python25,
python25-curl, python25-numeric, MySQL-python25, python25-python25-curl, python25-numeric, MySQL-python25, python25-imaging.imaging.
BNL: ATLAS Computing 15
PanDA Autopilot
Glite-UI 3.1: Setup from /etc/profile.d/Glite-UI 3.1: Setup from /etc/profile.d/
CA Certificates installed/updatedCA Certificates installed/updated
Unix accounts w/ ssh-key access: sm, (sm2 for grid Unix accounts w/ ssh-key access: sm, (sm2 for grid autopilot, usatlas1 for local submission)autopilot, usatlas1 for local submission)
Condor 7.3.0 w/ custom configurationCondor 7.3.0 w/ custom configuration
BNL: ATLAS Computing 16
Panda System OS Administration
Initial installInitial install
Semi-manual setup script is at: Semi-manual setup script is at: /afs/usatlas.bnl.gov/mgmt/etc/gridui.usatlas.bnl.gov/system-setup.sh/afs/usatlas.bnl.gov/mgmt/etc/gridui.usatlas.bnl.gov/system-setup.sh
Ongoing package maintenance: BNL Redhat satellite Ongoing package maintenance: BNL Redhat satellite system. system.
Condor admin: on systems with global Condor, config Condor admin: on systems with global Condor, config changes and restart requires root.changes and restart requires root.
Account management: occasional SSH key additions for Account management: occasional SSH key additions for new team members.new team members.
BNL: ATLAS Computing 17
Panda Monitoring Systems
17
BNL: ATLAS Computing 18
https://www.usatlas.bnl.gov/nagios/sla_array.html
Panda Monitoring Systems
BNL: ATLAS Computing 19
https://www.usatlas.bnl.gov/nagios/tier2.html
USATLAS Tier 2 Sites
BNL: ATLAS Computing 2020
MySQL Servers Monitoring
We use three monitoring tools for MySQL servers:
- MySQLStat: Provide Monitoring Service for Internal ATLAS Community: BNL ATLAS MySQL servers, CERN MySQL servers, some other MySQL servers in USA and Europe.
- Ganglia
- Nagios: provides Critical Server Status, sends warnings and alarms if service has problem, opens RT tickets and can do some simple automatic recovery.
BNL: ATLAS Computing 2121
MySQL Servers Monitoring: MySQLstat
BNL: ATLAS Computing 22
Panda Operation Procedure
22
BNL: ATLAS Computing 2323
RT
SLA
Nagios
RT
In case of a failure of a critical machine or service Nagios generates alarms and send email alarms to SLA systems. When service recovers, Nagios generates a notification to SLA again.
RACF SLA System provides a configurable alarm management layer that automates service alerts from Nagios based monitoring system. It provides a configurable alarm management layer that automates service alerts from Nagios based monitoring
OSG Footprints
RT can exchange problem reports with external ticketing systems.
Machines and services monitored by Nagios
GGUS
Escalation if no response happens
with SLA specified time window
BNL: ATLAS Computing 24
Experienced Operation Problems
BNL: ATLAS Computing 25
Panda Server and Databases Problems
Panda Server Hanging A cron job at database server detects the slow query, disconnects the
Panda server’s MySQL connection if it appears to be slow. Panda processes do not handle this disconnection, wind up to be frozen. Panda Server had to be restarted either manual or automatically by Nagios.
Panda Database Server Load Enhanced database monitoring capabilities, and identify intrusive queries
and particular users and applications which initiate the query, and worked with users to modify MySQL queries.
Effectively and significantly reduces the number of slow queries.
Purchased licensed MySQL Backup software to reduce the backup time.
BNL: ATLAS Computing 26
Condor-G Based Auto-Pilots
Condor-G and Gatekeepers uses GASS servers to synchronize jobs status, and large number of Condor-G jobs add significant loads and result in status loss and held jobs.
Frequently Condor-G freezing due to large number of held jobs.
Pilot job status reported by Condor-G is out of synch with the actual status of ATLAS jobs.
To kill held pilots jobs caused early aborting good ATLAS jobs.
Work with Univ. of Wisconsin to customize the condor-G. Stage-in and stage-out events into the user log for better diagnosis. More Condor-G tuning options for large number of job submission and dispatch. More fine tuning knobs have separate throttles, for example: limiting jobmanagers
by their role: submission -vs- stage-out/removal. Efficiently process failed jobs and prevent bad jobs clogging the submission
system: when a gridmanager decides to put a job on hold, instead use the hold_reason as the abort_reason and abort the job.
BNL: ATLAS Computing 27
Panda Monitoring
Front end switch system hanging due to expired licenses.
New Python version and Oracle clients require manual compile.
Certificate authority does not issue certificates with DN containing wild card (*). Clients could not properly do X509 certificated based authenticate with multiple backend severs behind F5 switch.
BNL: ATLAS Computing 28
Summary
Contributions:Contributions:
Innovation in hardware resilience, extensive monitoring, and Innovation in hardware resilience, extensive monitoring, and automatic problem reporting and tracking.automatic problem reporting and tracking.
Significantly enhance the reliability of the evolving Panda system.Significantly enhance the reliability of the evolving Panda system.
Support easy access to the system for software improvement.Support easy access to the system for software improvement.
Condor-G is slow to update Pilot status, causing inconsistency Condor-G is slow to update Pilot status, causing inconsistency between actual job status and Panda monitoring. between actual job status and Panda monitoring.
Frequency of Condor-G component crashing: was fixed after condor Frequency of Condor-G component crashing: was fixed after condor team provided condor 7.3.0.team provided condor 7.3.0.