28
BNL: ATLAS Computing 1 A Scalable and Resilient PanDA Service US ATLAS Computing Facility and US ATLAS Computing Facility and Physics Application Group Physics Application Group Brookhaven National Lab Brookhaven National Lab Presented by Dantong Yu Presented by Dantong Yu

BNL: ATLAS Computing 1 A Scalable and Resilient PanDA Service US ATLAS Computing Facility and Physics Application Group Brookhaven National Lab Presented

Embed Size (px)

Citation preview

Page 1: BNL: ATLAS Computing 1 A Scalable and Resilient PanDA Service US ATLAS Computing Facility and Physics Application Group Brookhaven National Lab Presented

BNL: ATLAS Computing 1

A Scalable and Resilient PanDA Service

US ATLAS Computing Facility and Physics US ATLAS Computing Facility and Physics Application GroupApplication Group

Brookhaven National LabBrookhaven National Lab

Presented by Dantong YuPresented by Dantong Yu

Page 2: BNL: ATLAS Computing 1 A Scalable and Resilient PanDA Service US ATLAS Computing Facility and Physics Application Group Brookhaven National Lab Presented

BNL: ATLAS Computing 2

Build a scalable and resilient PanDA service for ATLASBuild a scalable and resilient PanDA service for ATLAS Support ATLAS VOs and thousands of ATLAS users and jobs. Reliable, scalable, and high performance. Cost-effective and flexible deployment.

A joint effort between Physics Application Group and RACF Grid Computing A joint effort between Physics Application Group and RACF Grid Computing Group to deploy and operate every component in PanDA system. Group to deploy and operate every component in PanDA system.

In this talk: In this talk: BNL PanDA architecture PANDA Components PANDA Hardwares Required Software Infrastructure and Grid Middleware Infrastructure and Procedure to Download and Install Required RPMS Nagios Based Panda Monitoring Systems Operation Procedures Experienced Problems

Motivation and Outline

Page 3: BNL: ATLAS Computing 1 A Scalable and Resilient PanDA Service US ATLAS Computing Facility and Physics Application Group Brookhaven National Lab Presented

BNL: ATLAS Computing 3

BNL PanDA architecture

3

Page 4: BNL: ATLAS Computing 1 A Scalable and Resilient PanDA Service US ATLAS Computing Facility and Physics Application Group Brookhaven National Lab Presented

BNL: ATLAS Computing 4

Clients

PanDA ServerMnt. Server

AutoPilot AutoPilotPanDA ServerMnt. Server

PanDA DB PanDA ArchivePanDA Archive

… …

F5 Server Load Balancing switch rewrites IP header for src. and dest. Addr., IP relay.

Clients

Virtual Services

Physical Servers

VIP

Reliable/High Performance ATLAS Job Management Architecture (PanDA)

44

Page 5: BNL: ATLAS Computing 1 A Scalable and Resilient PanDA Service US ATLAS Computing Facility and Physics Application Group Brookhaven National Lab Presented

BNL: ATLAS Computing 5

Panda Components

5

Page 6: BNL: ATLAS Computing 1 A Scalable and Resilient PanDA Service US ATLAS Computing Facility and Physics Application Group Brookhaven National Lab Presented

BNL: ATLAS Computing 6

Production SystemProduction System Front End Load Balancers

F5 switch does load balance and reliability.

Its transparency allows flexible management of the heterogeneous service, with only minimal application-level configuration and coding necessary to support integration with the smart switch.

Panda Monitoring Service, Panda Server, and Panda Logging Servers stateless Dispatches jobs to pilots as they request them, HTTPS-based, stateless. It needs to

connect to central Panda DB. Provides a graphic read-only information about Panda function via HTTP. GUI is also

stateless. It needs to connect to central Panda DB. Logs Panda Server Events into the Panda DB.

Autopilot submission systems (stateful) Using Condor-g/site gatekeepers to fill sites with pilots.

Panda Pilot Wrapper Code Distributor: Subversion with Web front-end. Dynamically download pilot wrapper script from the Subversion web cache.

Panda Database System

https://www.racf.bnl.gov/experiments/usatlas/gridops/griduiconfig/https://www.racf.bnl.gov/experiments/usatlas/gridops/griduiconfig/

Production Panda Components

Page 7: BNL: ATLAS Computing 1 A Scalable and Resilient PanDA Service US ATLAS Computing Facility and Physics Application Group Brookhaven National Lab Presented

BNL: ATLAS Computing 7

Panda Development and Testbed Systems

Panda Testbed and Development SystemsPanda Testbed and Development Systems Panda Monitoring Service and Panda Server Database System

Page 8: BNL: ATLAS Computing 1 A Scalable and Resilient PanDA Service US ATLAS Computing Facility and Physics Application Group Brookhaven National Lab Presented

BNL: ATLAS Computing 8

Panda Hardware

8

Page 9: BNL: ATLAS Computing 1 A Scalable and Resilient PanDA Service US ATLAS Computing Facility and Physics Application Group Brookhaven National Lab Presented

BNL: ATLAS Computing 9

Panda Hardware

Each component group requires a separate set of hosts and hardware. Each component group requires a separate set of hosts and hardware. Most servers should be standalone except a few of them. Most servers should be standalone except a few of them. Front end Load balancers: Two F5 3600 load balance switches. Panda Monitor, Panda Server, and Panda Logging Servers.

Dual quad-core Intel Xeon CPU E5335 @ 2.00GHz. (eight cores per host), 16 GB memory, and six 750GB SATA drives. (software RAID 10 provided 2TB local storage): three servers

Autopilot submission systems (Local pilots and global pilots). StatefulDual quad-core Intel Xeon CPU E5430 @ 2.66GHz. (eight cores per

host), 16 GB memory, and two 750GB SATA drives. (Mirrored disks): four servers

Panda pilot wrapper code distributor: subversion with Web front-endDual quad-core Intel Xeon CPU E5430 @ 2.66GHz. (eight cores per

host), 8 GB memory, and two 150GB SAS drives. (Mirrored disks): 1 server. Need Archive system to recover if disk storage is lost.

Web Apache server

Page 10: BNL: ATLAS Computing 1 A Scalable and Resilient PanDA Service US ATLAS Computing Facility and Physics Application Group Brookhaven National Lab Presented

BNL: ATLAS Computing 10

BNL ATLAS MySQL Production and Development Servers

Following BNL production MySQL servers are used:Following BNL production MySQL servers are used: 2 Panda-production MySQL servers (INNODB): primary and spare, dual

dual core with 16GB memory and 64 bits OS. 4 Panda-archive MySQL servers (MyISAM): 2 primary + 2 spare, 2 quad-

core processors with 16GB memory and 64 bit OS. daily text-based backup (database content) for all databases on

production servers above with the extra disk-copy on a special data-server having an interface to the tape.

64 bit-architecture, x86_64, 2.6.9-55.0.9.Elsm. Six 15k rpm SAS drives, each with 145GB disk space. Details can be found at

https://www.racf.bnl.gov/experiments/usatlas/gridops/atlasdbinfo.

Page 11: BNL: ATLAS Computing 1 A Scalable and Resilient PanDA Service US ATLAS Computing Facility and Physics Application Group Brookhaven National Lab Presented

BNL: ATLAS Computing 1111

ATLAS MySQL Production Databases at BNL: Details and Performance

Panda production MySQL server and its replica server with identical Panda production MySQL server and its replica server with identical hardware:hardware:

“fast-buffer” DataBase. keeps the info about all Panda managed Reprocessing, MC-production and user-analysis jobs

for up to 2 weeks, the cron-job moves the data into archive periodically. designed initially for USATLAS, since September 2007 supports 10 different ATLAS clouds

(CERN, CA, DE, ES, FR, NL, UK, US, TW and 2 instances for Nordugrid - ND,NDGF ). runs MySQL version 5.0.X. engine InnoDB, simple structure, autoincrement for IDs, no foreign keys. 31 tables, max number of rows ~16,500,000. provides with the fast multiple parallel connections to basic Panda-components: Panda-server,

Panda-monitor and Logger.

Performance access pattern: ~380-440 parallel threads open simultaneously all the time (max ~600) performance: average ~360 q/sec. (max > 800) query-type: select ~35%, update ~35%, insert ~25%, others (delete, etc. ~5%) nice monitoring interface Panda-monitor:

http://pandamon.usatlas.bnl.gov:25880/server/pandamon/query?dash=prod

Page 12: BNL: ATLAS Computing 1 A Scalable and Resilient PanDA Service US ATLAS Computing Facility and Physics Application Group Brookhaven National Lab Presented

BNL: ATLAS Computing 1212

Critical DBs on Four Panda Archival Database Servers

Panda Archive production MySQL server (along with a spare node) Database PandaArchiveDB

keeps the full archive of Panda managed reprocessing, Monte-Carlo, production and user analysis jobs since the end of 2005.

engine MyISAM, no autoincrement, replication from PandaDB through crons. partitioning: bi-monthly structure of job/file archive tables for better search

performance. 44 tables, max number of rows ~33,000,000 per table.

DataBases PandaLogDB, PandaMetaDB keep the archive of log-extract files for jobs, some monitoring information

about pilots, autopilot and scheduler-configuration support (schedconfig). engine MyISAM, ~52-54 tables. partitioning: bi-monthly structure for some tables. max number of rows ~4,600,000 per table.

access pattern: ~400-450 parallel threads open (max ~740). performance: average ~1300-1600 q/sec. (max ~2800), select (~80%), insert ~20%.

Page 13: BNL: ATLAS Computing 1 A Scalable and Resilient PanDA Service US ATLAS Computing Facility and Physics Application Group Brookhaven National Lab Presented

BNL: ATLAS Computing 13

Panda Server Infrastructure

13

Page 14: BNL: ATLAS Computing 1 A Scalable and Resilient PanDA Service US ATLAS Computing Facility and Physics Application Group Brookhaven National Lab Presented

BNL: ATLAS Computing 14

PanDA Software Infrastructure

OS, Grid Middleware, and Software Requirements OS (RHEL/SL 4) RPMs: mod_ssl, subversion, rrdtool, openmpi, OS (RHEL/SL 4) RPMs: mod_ssl, subversion, rrdtool, openmpi,

gridsite, graphtool, matplotlib, MySQL. gridsite, graphtool, matplotlib, MySQL. Glite-UI 3.1: Setup from /etc/profile.d/.Glite-UI 3.1: Setup from /etc/profile.d/. CA Certificates installed/updated.CA Certificates installed/updated. Unix accounts w/ ssh-key access: smUnix accounts w/ ssh-key access: sm Python 2.5 (from Tadashi) RPMs: python25, mod_python25, Python 2.5 (from Tadashi) RPMs: python25, mod_python25,

python25-curl, python25-numeric, MySQL-python25, python25-python25-curl, python25-numeric, MySQL-python25, python25-imaging.imaging.

Page 15: BNL: ATLAS Computing 1 A Scalable and Resilient PanDA Service US ATLAS Computing Facility and Physics Application Group Brookhaven National Lab Presented

BNL: ATLAS Computing 15

PanDA Autopilot

Glite-UI 3.1: Setup from /etc/profile.d/Glite-UI 3.1: Setup from /etc/profile.d/

CA Certificates installed/updatedCA Certificates installed/updated

Unix accounts w/ ssh-key access: sm, (sm2 for grid Unix accounts w/ ssh-key access: sm, (sm2 for grid autopilot, usatlas1 for local submission)autopilot, usatlas1 for local submission)

Condor 7.3.0 w/ custom configurationCondor 7.3.0 w/ custom configuration

Page 16: BNL: ATLAS Computing 1 A Scalable and Resilient PanDA Service US ATLAS Computing Facility and Physics Application Group Brookhaven National Lab Presented

BNL: ATLAS Computing 16

Panda System OS Administration

Initial installInitial install

Semi-manual setup script is at: Semi-manual setup script is at: /afs/usatlas.bnl.gov/mgmt/etc/gridui.usatlas.bnl.gov/system-setup.sh/afs/usatlas.bnl.gov/mgmt/etc/gridui.usatlas.bnl.gov/system-setup.sh

Ongoing package maintenance: BNL Redhat satellite Ongoing package maintenance: BNL Redhat satellite system. system.

Condor admin: on systems with global Condor, config Condor admin: on systems with global Condor, config changes and restart requires root.changes and restart requires root.

Account management: occasional SSH key additions for Account management: occasional SSH key additions for new team members.new team members.

Page 17: BNL: ATLAS Computing 1 A Scalable and Resilient PanDA Service US ATLAS Computing Facility and Physics Application Group Brookhaven National Lab Presented

BNL: ATLAS Computing 17

Panda Monitoring Systems

17

Page 18: BNL: ATLAS Computing 1 A Scalable and Resilient PanDA Service US ATLAS Computing Facility and Physics Application Group Brookhaven National Lab Presented

BNL: ATLAS Computing 18

https://www.usatlas.bnl.gov/nagios/sla_array.html

Panda Monitoring Systems

Page 19: BNL: ATLAS Computing 1 A Scalable and Resilient PanDA Service US ATLAS Computing Facility and Physics Application Group Brookhaven National Lab Presented

BNL: ATLAS Computing 19

https://www.usatlas.bnl.gov/nagios/tier2.html

USATLAS Tier 2 Sites

Page 20: BNL: ATLAS Computing 1 A Scalable and Resilient PanDA Service US ATLAS Computing Facility and Physics Application Group Brookhaven National Lab Presented

BNL: ATLAS Computing 2020

MySQL Servers Monitoring

We use three monitoring tools for MySQL servers:

- MySQLStat: Provide Monitoring Service for Internal ATLAS Community: BNL ATLAS MySQL servers, CERN MySQL servers, some other MySQL servers in USA and Europe.

- Ganglia

- Nagios: provides Critical Server Status, sends warnings and alarms if service has problem, opens RT tickets and can do some simple automatic recovery.

Page 21: BNL: ATLAS Computing 1 A Scalable and Resilient PanDA Service US ATLAS Computing Facility and Physics Application Group Brookhaven National Lab Presented

BNL: ATLAS Computing 2121

MySQL Servers Monitoring: MySQLstat

Page 22: BNL: ATLAS Computing 1 A Scalable and Resilient PanDA Service US ATLAS Computing Facility and Physics Application Group Brookhaven National Lab Presented

BNL: ATLAS Computing 22

Panda Operation Procedure

22

Page 23: BNL: ATLAS Computing 1 A Scalable and Resilient PanDA Service US ATLAS Computing Facility and Physics Application Group Brookhaven National Lab Presented

BNL: ATLAS Computing 2323

RT

SLA

Nagios

RT

In case of a failure of a critical machine or service Nagios generates alarms and send email alarms to SLA systems. When service recovers, Nagios generates a notification to SLA again.

RACF SLA System provides a configurable alarm management layer that automates service alerts from Nagios based monitoring system. It provides a configurable alarm management layer that automates service alerts from Nagios based monitoring

OSG Footprints

RT can exchange problem reports with external ticketing systems.

Machines and services monitored by Nagios

GGUS

Escalation if no response happens

with SLA specified time window

Page 24: BNL: ATLAS Computing 1 A Scalable and Resilient PanDA Service US ATLAS Computing Facility and Physics Application Group Brookhaven National Lab Presented

BNL: ATLAS Computing 24

Experienced Operation Problems

Page 25: BNL: ATLAS Computing 1 A Scalable and Resilient PanDA Service US ATLAS Computing Facility and Physics Application Group Brookhaven National Lab Presented

BNL: ATLAS Computing 25

Panda Server and Databases Problems

Panda Server Hanging A cron job at database server detects the slow query, disconnects the

Panda server’s MySQL connection if it appears to be slow. Panda processes do not handle this disconnection, wind up to be frozen. Panda Server had to be restarted either manual or automatically by Nagios.

Panda Database Server Load Enhanced database monitoring capabilities, and identify intrusive queries

and particular users and applications which initiate the query, and worked with users to modify MySQL queries.

Effectively and significantly reduces the number of slow queries.

Purchased licensed MySQL Backup software to reduce the backup time.

Page 26: BNL: ATLAS Computing 1 A Scalable and Resilient PanDA Service US ATLAS Computing Facility and Physics Application Group Brookhaven National Lab Presented

BNL: ATLAS Computing 26

Condor-G Based Auto-Pilots

Condor-G and Gatekeepers uses GASS servers to synchronize jobs status, and large number of Condor-G jobs add significant loads and result in status loss and held jobs.

Frequently Condor-G freezing due to large number of held jobs.

Pilot job status reported by Condor-G is out of synch with the actual status of ATLAS jobs.

To kill held pilots jobs caused early aborting good ATLAS jobs.

Work with Univ. of Wisconsin to customize the condor-G. Stage-in and stage-out events into the user log for better diagnosis. More Condor-G tuning options for large number of job submission and dispatch. More fine tuning knobs have separate throttles, for example: limiting jobmanagers

by their role: submission -vs- stage-out/removal. Efficiently process failed jobs and prevent bad jobs clogging the submission

system: when a gridmanager decides to put a job on hold, instead use the hold_reason as the abort_reason and abort the job.

Page 27: BNL: ATLAS Computing 1 A Scalable and Resilient PanDA Service US ATLAS Computing Facility and Physics Application Group Brookhaven National Lab Presented

BNL: ATLAS Computing 27

Panda Monitoring

Front end switch system hanging due to expired licenses.

New Python version and Oracle clients require manual compile.

Certificate authority does not issue certificates with DN containing wild card (*). Clients could not properly do X509 certificated based authenticate with multiple backend severs behind F5 switch.

Page 28: BNL: ATLAS Computing 1 A Scalable and Resilient PanDA Service US ATLAS Computing Facility and Physics Application Group Brookhaven National Lab Presented

BNL: ATLAS Computing 28

Summary

Contributions:Contributions:

Innovation in hardware resilience, extensive monitoring, and Innovation in hardware resilience, extensive monitoring, and automatic problem reporting and tracking.automatic problem reporting and tracking.

Significantly enhance the reliability of the evolving Panda system.Significantly enhance the reliability of the evolving Panda system.

Support easy access to the system for software improvement.Support easy access to the system for software improvement.

Condor-G is slow to update Pilot status, causing inconsistency Condor-G is slow to update Pilot status, causing inconsistency between actual job status and Panda monitoring. between actual job status and Panda monitoring.

Frequency of Condor-G component crashing: was fixed after condor Frequency of Condor-G component crashing: was fixed after condor team provided condor 7.3.0.team provided condor 7.3.0.