US CMS Testbed A Grid Computing Case Study Alan De Smet Condor Project University of Wisconsin at Madison [email protected] adesmet

US CMS Testbed

A Grid Computing Case Study

Alan De SmetCondor Project

University of Wisconsin at [email protected]

http://www.cs.wisc.edu/~adesmet/

Trust No One

• The grid will fail• Design for recovery

The Grid Will Fail

• The grid is complex• The grid is relatively new and untested

– Much of it is best described as prototypes or alpha versions

• The public Internet is out of your control• Remote sites are out of your control

Design for Recovery

• Provide recovery at multiple levels to minimize lost work

• Be able to start a particular task over from scratch if necessary

• Never assume that a particular step will succeed

• Allocate lots of debugging time

Some Background

Compact Muon Solenoid Detector

• The Compact Muon Solenoid (CMS) detector at the Large Hadron Collider will probe fundamental forces in our Universe and search for the yet-undetected Higgs Boson.

(Based on slide by Scott Koranda at NCSA)

Compact Muon Solenoid

(Based on slide by Scott Koranda at NCSA)

CMS - Now and the Future

• The CMS detector is expected to come online in 2006

• Software to analyze this enormous amount of data from the detector is being developed now.

• For testing and prototyping, the detector is being simulated now.

What We’re Doing Now

• Our runs are divided into two phases– Monte Carlo detector response simulation– Physics reconstruction

• The testbed currently only does simulation, but is moving toward reconstruction.

Storage and Computational Requirements

• Simulating and reconstructing millions of events per year

• Each event requires about 3 minutes of processor time

• Events are generally processed in run of about 150,000 events

• The simulation step of a single run will generate about 150 GB of data– Reconstruction has similar requirements

Existing CMS Production

• Runs are assigned to individual sites• Each site has staff managing their runs

– Manpower intensive to monitor jobs, CPU availability, disk space

• Local site uses Impala (old way) or MCRunJob (new way) to manage jobs running on local batch system.

Testbed CMS Production

• What I work on• Designed to allow a single master site

to manage jobs scattered to many worker sites

CMS Testbed Workers

Site CPUsUniversity of Wisconsin - Madison 5

Fermi National Accelerator Laboratory 12

California Institute of Technology 8

University of Florida 42

University of California – San Diego 3

As we move from testbed to full production, we will add more sites and hundreds of CPUs.

CMS Testbed Big Picture

Master Site

Impala

MOP

Condor-G

Worker

Globus

Condor

Real WorkDAGMan

Impala

• Tool used in current production• Assembles jobs to be run• Sends jobs out• Collects results• Minimal recovery mechanism• Expects to hand jobs off to a local batch

system– Assumes local file system

MOP

• Monte Carlo Distributed Production System– It could have been MonteDistPro (as the,

The Count of…)

• Pretends to be local batch system for Impala

• Repackages jobs to run on a remote site

MOP Repackaging

• Impala hands MOP a list of input files, output files, and a script to run.

• Binds site specific information to script– Path to binaries, location of scratch space,

staging location, etc– Impala is given locations like _path_to_gdmp_dir_ which MOP rewrites

• Breaks jobs into five step DAGs• Hands job off to DAGMan/Condor-G

MOP Job Stages

• Stage-in - Move input data and program to remote site

• Run - Execute the program• Stage-out - Retrieve program

logs• Publish - Retrieve program

output• Cleanup - Delete files

MOP JobStages

MOP Job Stages

• A MOP “run” collects multiple groups into a single DAG which is submitted to DAGMan

Combined DAG

...

...

...

...

...

DAGMan, Condor-G, Globus, Condor

• DAGMan - Manages dependencies• Condor-G - Monitors the job on master

site• Globus - Sends jobs to remote site• Condor - Manages job and computers

at remote site

Typical Network Configuration

Worker Site:Head Node

Worker Site:Compute

Node

Worker Site:Compute

Node

Private Network

Public Internet

MOP MasterMachine

Network Configuration

• Some sites make compute nodes visible to the public Internet, but many do not.– Private networks will scale better as sites

add dozens or hundreds of machine– As a result, any stage handling data transfer

to or from the MOP Master must run on the head node. No other node can address the MOP Master• This is a scalability issue. We haven’t hit

the limit yet.

When Things Go Wrong

• How recovery is handled

Recovery - DAGMan

• Remembers current status– When restarted, determines current

progress and continues.

• Notes failed jobs for resubmission– Can automatically retry, but we don’t

Recovery - Condor-G

• Remembers current status– When restarted, reconnects jobs to remote

sites and updates status– Also runs DAGMan, when restarted restarts

DAGMan

• Retries in certain failure cases• Holds jobs in other failure cases

Recovery - Condor

• Remembers current status• Running on remote site• Recovers job state and restarts jobs on

machine failure

globus-url-copy

• Used for file transfer• Client process can hang under some

circumstances• Wrapped in a shell script giving transfer

a maximum duration. If run exceeds duration, job is killed and restarted.

• Using ftsh to write script - Doug Thain’s Fault Tolerant Shell.

Human Involvement in Failure Recovery

• Condor-G places some problem jobs on hold– By placing them on hold, we prevent the

jobs from failing and provide an opportunity to recover.

• Usually Globus problems:expired certificate, jobmanager misconfiguration, bugs in the jobmanager


• A human diagnoses the jobs placed on hold– Is problem transient? condor_release the

job.– Otherwise fix the problem, then release the

job.– Can the problem not be fixed? Reset the

GlobusContactString and release the job, forcing it to restart.•condor_qedit <clusterid> GlobusContactString X


• Sometimes tasks themselves fail• A variety of problems, typically

external: disk full, network outage– DAGMan notes failure. When all possible

DAGMan nodes finish or fail, a rescue DAG file is generated.

– Submitting this rescue DAG will retry all failed nodes.

Doing Real Work

CMS Production Job 1828

• US CMS Testbed asked to help with real CMS production

• Given 150,000 events to do in two weeks.

What Went Wrong

• Power outage• Network outages• Worker site failures• Globus failures• DAGMan failure• Unsolved mysteries

Power Outage

• A power outage at the UW took out the master site and the UW worker site for several hours

• During the outage worker sites continued running assigned tasks, but as they exhausted their queues we could not send additional tasks

• File transfers sending data back failed• System recovered well

Network Outages

• Several outages, most less than an hour, one for eleven hours

• Worker sites continued running assigned tasks

• Master site was unable to report status until network was restored

• File transfers failed• System recovered well

Worker Site Failures

• One site had a configuration change go bad, causing the Condor jobs to fail– Condor-G placed problem tasks on hold.

When the situation was resolved, we released the jobs and they succeeded.

• Another site was incompletely upgraded during the run.– Jobs were held, released when fixed.

Worker Site Failure / Globus Failure

• At one site, Condor jobs were removed from the pool using condor_rm, probably by accident

• The previous Globus interface to Condor wasn’t prepared for that possibility and erroneously reported the job as still running– Fixed in newest Globus

• Job’s contact string was reset.

Globus Failures

• globus-job-manager would sometimes stop checking the status of a job, reporting the last status forever

• When a job was taking unusually long, this was usually the problem

• Killing the globus-job-manager caused a new one to be started, solving the problem– Has to be done on the remote site

• (Or via globus-job-run)

Globus Failures

• globus-job-manager would sometimes corrupt state files

• Wisconsin team debugged problem and distributed patched program

• Failed jobs had their GlobusContactStrings reset.

Globus Failures

• Some globus-job-managers would report problems accessing input files– The reason has not been diagnosed.

• Affected jobs had their GlobusContactStrings reset.

DAGMan failure

• In one instance a DAGMan managing 50 groups of jobs crashed.

• The DAG file was tweaked by hand to mark completed jobs as such and resubmitted– Finished jobs in a DAG simply have DONE

added to then end of their entry

Problems Previously Encountered

• We’ve been doing test runs for ~9 months. We’ve encountered and resolved many other issues.

• Consider building your own copy of the Globus tools out of CVS to stay on top of bugfixes.

• Monitor http://bugzilla.globus.org/ and the Globus mailing lists.

The Future

Future Improvements

• Currently our run stage runs as a vanilla universe Condor job on the worker site. If there is a problem the job must be restarted from scratch. Switching to the standard universe would allow jobs to recover and continue aborted runs.

Future Improvements

• Data transfer jobs are run as Globus fork jobs. They are completely unmanaged on the remote site. If the remote site has an outage, there is no information on the jobs.– Running these under Condor (Scheduler

universe) would ensure that status was not lost.

– Also looking at using the DaP Scheduler

Future Improvements

• Jobs are assigned to specific sites by an operator

• Once assigned, changing the assigned site is nearly impossible

• Working to support “grid scheduling”: automatic assignment of jobs to sites and changing site assignment

Documents

US CMS Testbed A Grid Computing Case Study Alan De Smet Condor Project University of Wisconsin at Madison [email protected] adesmet