46
US CMS Testbed A Grid Computing Case Study Alan De Smet Condor Project University of Wisconsin at Madison [email protected] http://www.cs.wisc.edu/~adesmet/

US CMS Testbed A Grid Computing Case Study Alan De Smet Condor Project University of Wisconsin at Madison [email protected] adesmet

Embed Size (px)

Citation preview

Page 1: US CMS Testbed A Grid Computing Case Study Alan De Smet Condor Project University of Wisconsin at Madison adesmet@cs.wisc.edu adesmet

US CMS Testbed

A Grid Computing Case Study

Alan De SmetCondor Project

University of Wisconsin at [email protected]

http://www.cs.wisc.edu/~adesmet/

Page 2: US CMS Testbed A Grid Computing Case Study Alan De Smet Condor Project University of Wisconsin at Madison adesmet@cs.wisc.edu adesmet

Trust No One

• The grid will fail• Design for recovery

Page 3: US CMS Testbed A Grid Computing Case Study Alan De Smet Condor Project University of Wisconsin at Madison adesmet@cs.wisc.edu adesmet

The Grid Will Fail

• The grid is complex• The grid is relatively new and untested

– Much of it is best described as prototypes or alpha versions

• The public Internet is out of your control• Remote sites are out of your control

Page 4: US CMS Testbed A Grid Computing Case Study Alan De Smet Condor Project University of Wisconsin at Madison adesmet@cs.wisc.edu adesmet

Design for Recovery

• Provide recovery at multiple levels to minimize lost work

• Be able to start a particular task over from scratch if necessary

• Never assume that a particular step will succeed

• Allocate lots of debugging time

Page 5: US CMS Testbed A Grid Computing Case Study Alan De Smet Condor Project University of Wisconsin at Madison adesmet@cs.wisc.edu adesmet

Some Background

Page 6: US CMS Testbed A Grid Computing Case Study Alan De Smet Condor Project University of Wisconsin at Madison adesmet@cs.wisc.edu adesmet

Compact Muon Solenoid Detector

• The Compact Muon Solenoid (CMS) detector at the Large Hadron Collider will probe fundamental forces in our Universe and search for the yet-undetected Higgs Boson.

(Based on slide by Scott Koranda at NCSA)

Page 7: US CMS Testbed A Grid Computing Case Study Alan De Smet Condor Project University of Wisconsin at Madison adesmet@cs.wisc.edu adesmet

Compact Muon Solenoid

(Based on slide by Scott Koranda at NCSA)

Page 8: US CMS Testbed A Grid Computing Case Study Alan De Smet Condor Project University of Wisconsin at Madison adesmet@cs.wisc.edu adesmet

CMS - Now and the Future

• The CMS detector is expected to come online in 2006

• Software to analyze this enormous amount of data from the detector is being developed now.

• For testing and prototyping, the detector is being simulated now.

Page 9: US CMS Testbed A Grid Computing Case Study Alan De Smet Condor Project University of Wisconsin at Madison adesmet@cs.wisc.edu adesmet

What We’re Doing Now

• Our runs are divided into two phases– Monte Carlo detector response simulation– Physics reconstruction

• The testbed currently only does simulation, but is moving toward reconstruction.

Page 10: US CMS Testbed A Grid Computing Case Study Alan De Smet Condor Project University of Wisconsin at Madison adesmet@cs.wisc.edu adesmet

Storage and Computational Requirements

• Simulating and reconstructing millions of events per year

• Each event requires about 3 minutes of processor time

• Events are generally processed in run of about 150,000 events

• The simulation step of a single run will generate about 150 GB of data– Reconstruction has similar requirements

Page 11: US CMS Testbed A Grid Computing Case Study Alan De Smet Condor Project University of Wisconsin at Madison adesmet@cs.wisc.edu adesmet

Existing CMS Production

• Runs are assigned to individual sites• Each site has staff managing their runs

– Manpower intensive to monitor jobs, CPU availability, disk space

• Local site uses Impala (old way) or MCRunJob (new way) to manage jobs running on local batch system.

Page 12: US CMS Testbed A Grid Computing Case Study Alan De Smet Condor Project University of Wisconsin at Madison adesmet@cs.wisc.edu adesmet

Testbed CMS Production

• What I work on• Designed to allow a single master site

to manage jobs scattered to many worker sites

Page 13: US CMS Testbed A Grid Computing Case Study Alan De Smet Condor Project University of Wisconsin at Madison adesmet@cs.wisc.edu adesmet

CMS Testbed Workers

Site CPUsUniversity of Wisconsin - Madison 5

Fermi National Accelerator Laboratory 12

California Institute of Technology 8

University of Florida 42

University of California – San Diego 3

As we move from testbed to full production, we will add more sites and hundreds of CPUs.

Page 14: US CMS Testbed A Grid Computing Case Study Alan De Smet Condor Project University of Wisconsin at Madison adesmet@cs.wisc.edu adesmet

CMS Testbed Big Picture

Master Site

Impala

MOP

Condor-G

Worker

Globus

Condor

Real WorkDAGMan

Page 15: US CMS Testbed A Grid Computing Case Study Alan De Smet Condor Project University of Wisconsin at Madison adesmet@cs.wisc.edu adesmet

Impala

• Tool used in current production• Assembles jobs to be run• Sends jobs out• Collects results• Minimal recovery mechanism• Expects to hand jobs off to a local batch

system– Assumes local file system

Page 16: US CMS Testbed A Grid Computing Case Study Alan De Smet Condor Project University of Wisconsin at Madison adesmet@cs.wisc.edu adesmet

MOP

• Monte Carlo Distributed Production System– It could have been MonteDistPro (as the,

The Count of…)

• Pretends to be local batch system for Impala

• Repackages jobs to run on a remote site

Page 17: US CMS Testbed A Grid Computing Case Study Alan De Smet Condor Project University of Wisconsin at Madison adesmet@cs.wisc.edu adesmet

MOP Repackaging

• Impala hands MOP a list of input files, output files, and a script to run.

• Binds site specific information to script– Path to binaries, location of scratch space,

staging location, etc– Impala is given locations like _path_to_gdmp_dir_ which MOP rewrites

• Breaks jobs into five step DAGs• Hands job off to DAGMan/Condor-G

Page 18: US CMS Testbed A Grid Computing Case Study Alan De Smet Condor Project University of Wisconsin at Madison adesmet@cs.wisc.edu adesmet

MOP Job Stages

• Stage-in - Move input data and program to remote site

• Run - Execute the program• Stage-out - Retrieve program

logs• Publish - Retrieve program

output• Cleanup - Delete files

MOP JobStages

Page 19: US CMS Testbed A Grid Computing Case Study Alan De Smet Condor Project University of Wisconsin at Madison adesmet@cs.wisc.edu adesmet

MOP Job Stages

• A MOP “run” collects multiple groups into a single DAG which is submitted to DAGMan

Combined DAG

...

...

...

...

...

Page 20: US CMS Testbed A Grid Computing Case Study Alan De Smet Condor Project University of Wisconsin at Madison adesmet@cs.wisc.edu adesmet

DAGMan, Condor-G, Globus, Condor

• DAGMan - Manages dependencies• Condor-G - Monitors the job on master

site• Globus - Sends jobs to remote site• Condor - Manages job and computers

at remote site

Page 21: US CMS Testbed A Grid Computing Case Study Alan De Smet Condor Project University of Wisconsin at Madison adesmet@cs.wisc.edu adesmet

Typical Network Configuration

Worker Site:Head Node

Worker Site:Compute

Node

Worker Site:Compute

Node

Private Network

Public Internet

MOP MasterMachine

Page 22: US CMS Testbed A Grid Computing Case Study Alan De Smet Condor Project University of Wisconsin at Madison adesmet@cs.wisc.edu adesmet

Network Configuration

• Some sites make compute nodes visible to the public Internet, but many do not.– Private networks will scale better as sites

add dozens or hundreds of machine– As a result, any stage handling data transfer

to or from the MOP Master must run on the head node. No other node can address the MOP Master• This is a scalability issue. We haven’t hit

the limit yet.

Page 23: US CMS Testbed A Grid Computing Case Study Alan De Smet Condor Project University of Wisconsin at Madison adesmet@cs.wisc.edu adesmet

When Things Go Wrong

• How recovery is handled

Page 24: US CMS Testbed A Grid Computing Case Study Alan De Smet Condor Project University of Wisconsin at Madison adesmet@cs.wisc.edu adesmet

Recovery - DAGMan

• Remembers current status– When restarted, determines current

progress and continues.

• Notes failed jobs for resubmission– Can automatically retry, but we don’t

Page 25: US CMS Testbed A Grid Computing Case Study Alan De Smet Condor Project University of Wisconsin at Madison adesmet@cs.wisc.edu adesmet

Recovery - Condor-G

• Remembers current status– When restarted, reconnects jobs to remote

sites and updates status– Also runs DAGMan, when restarted restarts

DAGMan

• Retries in certain failure cases• Holds jobs in other failure cases

Page 26: US CMS Testbed A Grid Computing Case Study Alan De Smet Condor Project University of Wisconsin at Madison adesmet@cs.wisc.edu adesmet

Recovery - Condor

• Remembers current status• Running on remote site• Recovers job state and restarts jobs on

machine failure

Page 27: US CMS Testbed A Grid Computing Case Study Alan De Smet Condor Project University of Wisconsin at Madison adesmet@cs.wisc.edu adesmet

globus-url-copy

• Used for file transfer• Client process can hang under some

circumstances• Wrapped in a shell script giving transfer

a maximum duration. If run exceeds duration, job is killed and restarted.

• Using ftsh to write script - Doug Thain’s Fault Tolerant Shell.

Page 28: US CMS Testbed A Grid Computing Case Study Alan De Smet Condor Project University of Wisconsin at Madison adesmet@cs.wisc.edu adesmet

Human Involvement in Failure Recovery

• Condor-G places some problem jobs on hold– By placing them on hold, we prevent the

jobs from failing and provide an opportunity to recover.

• Usually Globus problems:expired certificate, jobmanager misconfiguration, bugs in the jobmanager

Page 29: US CMS Testbed A Grid Computing Case Study Alan De Smet Condor Project University of Wisconsin at Madison adesmet@cs.wisc.edu adesmet

Human Involvement in Failure Recovery

• A human diagnoses the jobs placed on hold– Is problem transient? condor_release the

job.– Otherwise fix the problem, then release the

job.– Can the problem not be fixed? Reset the

GlobusContactString and release the job, forcing it to restart.•condor_qedit <clusterid> GlobusContactString X

Page 30: US CMS Testbed A Grid Computing Case Study Alan De Smet Condor Project University of Wisconsin at Madison adesmet@cs.wisc.edu adesmet

Human Involvement in Failure Recovery

• Sometimes tasks themselves fail• A variety of problems, typically

external: disk full, network outage– DAGMan notes failure. When all possible

DAGMan nodes finish or fail, a rescue DAG file is generated.

– Submitting this rescue DAG will retry all failed nodes.

Page 31: US CMS Testbed A Grid Computing Case Study Alan De Smet Condor Project University of Wisconsin at Madison adesmet@cs.wisc.edu adesmet

Doing Real Work

Page 32: US CMS Testbed A Grid Computing Case Study Alan De Smet Condor Project University of Wisconsin at Madison adesmet@cs.wisc.edu adesmet

CMS Production Job 1828

• US CMS Testbed asked to help with real CMS production

• Given 150,000 events to do in two weeks.

Page 33: US CMS Testbed A Grid Computing Case Study Alan De Smet Condor Project University of Wisconsin at Madison adesmet@cs.wisc.edu adesmet

What Went Wrong

• Power outage• Network outages• Worker site failures• Globus failures• DAGMan failure• Unsolved mysteries

Page 34: US CMS Testbed A Grid Computing Case Study Alan De Smet Condor Project University of Wisconsin at Madison adesmet@cs.wisc.edu adesmet

Power Outage

• A power outage at the UW took out the master site and the UW worker site for several hours

• During the outage worker sites continued running assigned tasks, but as they exhausted their queues we could not send additional tasks

• File transfers sending data back failed• System recovered well

Page 35: US CMS Testbed A Grid Computing Case Study Alan De Smet Condor Project University of Wisconsin at Madison adesmet@cs.wisc.edu adesmet

Network Outages

• Several outages, most less than an hour, one for eleven hours

• Worker sites continued running assigned tasks

• Master site was unable to report status until network was restored

• File transfers failed• System recovered well

Page 36: US CMS Testbed A Grid Computing Case Study Alan De Smet Condor Project University of Wisconsin at Madison adesmet@cs.wisc.edu adesmet

Worker Site Failures

• One site had a configuration change go bad, causing the Condor jobs to fail– Condor-G placed problem tasks on hold.

When the situation was resolved, we released the jobs and they succeeded.

• Another site was incompletely upgraded during the run.– Jobs were held, released when fixed.

Page 37: US CMS Testbed A Grid Computing Case Study Alan De Smet Condor Project University of Wisconsin at Madison adesmet@cs.wisc.edu adesmet

Worker Site Failure / Globus Failure

• At one site, Condor jobs were removed from the pool using condor_rm, probably by accident

• The previous Globus interface to Condor wasn’t prepared for that possibility and erroneously reported the job as still running– Fixed in newest Globus

• Job’s contact string was reset.

Page 38: US CMS Testbed A Grid Computing Case Study Alan De Smet Condor Project University of Wisconsin at Madison adesmet@cs.wisc.edu adesmet

Globus Failures

• globus-job-manager would sometimes stop checking the status of a job, reporting the last status forever

• When a job was taking unusually long, this was usually the problem

• Killing the globus-job-manager caused a new one to be started, solving the problem– Has to be done on the remote site

• (Or via globus-job-run)

Page 39: US CMS Testbed A Grid Computing Case Study Alan De Smet Condor Project University of Wisconsin at Madison adesmet@cs.wisc.edu adesmet

Globus Failures

• globus-job-manager would sometimes corrupt state files

• Wisconsin team debugged problem and distributed patched program

• Failed jobs had their GlobusContactStrings reset.

Page 40: US CMS Testbed A Grid Computing Case Study Alan De Smet Condor Project University of Wisconsin at Madison adesmet@cs.wisc.edu adesmet

Globus Failures

• Some globus-job-managers would report problems accessing input files– The reason has not been diagnosed.

• Affected jobs had their GlobusContactStrings reset.

Page 41: US CMS Testbed A Grid Computing Case Study Alan De Smet Condor Project University of Wisconsin at Madison adesmet@cs.wisc.edu adesmet

DAGMan failure

• In one instance a DAGMan managing 50 groups of jobs crashed.

• The DAG file was tweaked by hand to mark completed jobs as such and resubmitted– Finished jobs in a DAG simply have DONE

added to then end of their entry

Page 42: US CMS Testbed A Grid Computing Case Study Alan De Smet Condor Project University of Wisconsin at Madison adesmet@cs.wisc.edu adesmet

Problems Previously Encountered

• We’ve been doing test runs for ~9 months. We’ve encountered and resolved many other issues.

• Consider building your own copy of the Globus tools out of CVS to stay on top of bugfixes.

• Monitor http://bugzilla.globus.org/ and the Globus mailing lists.

Page 43: US CMS Testbed A Grid Computing Case Study Alan De Smet Condor Project University of Wisconsin at Madison adesmet@cs.wisc.edu adesmet

The Future

Page 44: US CMS Testbed A Grid Computing Case Study Alan De Smet Condor Project University of Wisconsin at Madison adesmet@cs.wisc.edu adesmet

Future Improvements

• Currently our run stage runs as a vanilla universe Condor job on the worker site. If there is a problem the job must be restarted from scratch. Switching to the standard universe would allow jobs to recover and continue aborted runs.

Page 45: US CMS Testbed A Grid Computing Case Study Alan De Smet Condor Project University of Wisconsin at Madison adesmet@cs.wisc.edu adesmet

Future Improvements

• Data transfer jobs are run as Globus fork jobs. They are completely unmanaged on the remote site. If the remote site has an outage, there is no information on the jobs.– Running these under Condor (Scheduler

universe) would ensure that status was not lost.

– Also looking at using the DaP Scheduler

Page 46: US CMS Testbed A Grid Computing Case Study Alan De Smet Condor Project University of Wisconsin at Madison adesmet@cs.wisc.edu adesmet

Future Improvements

• Jobs are assigned to specific sites by an operator

• Once assigned, changing the assigned site is nearly impossible

• Working to support “grid scheduling”: automatic assignment of jobs to sites and changing site assignment