Analysis of the ATLAS Rome Production Experience on the LCG Computing Grid

Enabling Grids for E-sciencE

www.eu-egee.org

Analysis of the ATLAS Rome Analysis of the ATLAS Rome Production Experience on the Production Experience on the LCG Computing GridLCG Computing Grid

Simone Campana, CERN/INFN

EGEE User Forum, CERN (Switzerland) March 1st – 3rd 2006

EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006 2


[email protected]

Outline

The ATLAS Experiment The Computing Model and the Data Challenges

The LCG Computing GridOverview and architecture

The ATLAS Production System The Rome Production on LCG

Report and numbers of the production Achievements with respect to DC2Standing Issues and possible improvements

Conclusions



[email protected]

The ATLAS Experiment

View of LHC @ CERNView of LHC @ CERN ATLAS ATLAS

AA TToroidaLL AApparatuSS for LHC



[email protected]

The ATLAS Computing Model

The ATLAS computing can NOTNOT rely on a SINGLESINGLE computer center model Amount of required resources is too large For the 1st year of data taking

50.6 MSI2k of CPU 16.9 TB of space on Tape 25.4 TB of space on Disk

ATLAS decided to embrace the GRID paradigmGRID paradigm High level of decentralization

Sites are organized in a multi-tiermulti-tier structure Hierarchical model Tiers are defined by ROLEROLE in the ATLAS computing

Tier-0 at CERNTier-0 at CERN• Record RAW data• Distribute second copy to Tier-1s• Calibrate anddo first-pass reconstruction

Tier-1 centersTier-1 centers • Manage permanent storage – RAW, simulated,

processed• Capacity for reprocessing, bulk analysis

Tier-2 centersTier-2 centers• Monte Carlo event simulation• End-user analysis

In Grid terminology ATLAS is a Virtual OrganizationVirtual Organization

Tier-1

Tier-0

Online filter farm RAWESDAOD

Reconstruction farm

RAWESDAODMC

Analysis farm

Re-reconstruction farm

Tier-2

Analysis farm

Monte Carlo farm

SelectedESD,AOD

RAW

RAW

ESDAOD

ESD, AOD

RAW

MC

RAWESDAOD

ESD, AOD



[email protected]

Data Challenges

Data ChallengeData Challenge: validation of the Computing and Data Model and test the complete software suite Full simulation and reprocessing of data as if coming from the detector Same software and computing infrastructure to be employed for data taking

ATLAS ran two major Data Challenges DC1 in 2002-2003 (with direct access to local resources + NorduGrid, see later) DC2 in July - December 2004 (completely in GRID environment)

Large scale productionLarge scale production in January – June 2005 Friendly called “Rome Production” provide data for physics studies for the ATLAS Rome Workshop in June 2005. Can be considered totally equivalent to a real Data ChallengeCan be considered totally equivalent to a real Data Challenge

Same methodology Large number of events produced

Offered a unique opportunity to test improvements in the production framework, the Grid middleware and the reconstruction software.

ATLAS resources span three different grids: the LCGLCG, NorduGridNorduGrid and OSGOSG

In this talk I will present the “Rome production” experience on the LHC Computing Grid infrastructure



[email protected]

The LCG Infrastructure

May 2005140 Grid sites

34 countries

12000 CPUs

8 PetaBytes



[email protected]

LCG architecture

The Logging and BookkeepingLogging and Bookkeeping service keeps the state information of a job allows the user to query its status

Each Computing ElementComputing Element is the front-end to a local batch system manages a pool of Worker Nodes

where the job is eventually executed.

Limited User CredentialsLimited User Credentials (Proxies) can be automatically renewed through a Proxy Service.

A set of services running on the Resource BrokerResource Broker machine match job requirements to the available resources schedule the job for execution to an appropriate Computing Element track the job status allow to retrieve their job output.

The Workload Management SystemThe Workload Management Systemresponsible for the management and monitoring of jobsresponsible for the management and monitoring of jobs



[email protected]

LCG Architecture

Allows the user to move files in and out of the Grid, replicate files among different Storage Elements and locate files.

Files are stored in Storage Storage ElementsElements Disk only or with tape backend.

A number protocols allows data transfer GridFTP is the most commonly

used Files are registered in a central central

cataloguecatalogue Replica Location Service keeps information about file

location and about some file metadata.

The Data Management SystemThe Data Management System

Provides information about the Grid resources and their status.

Info generated on every service And published by the GRIS

Propagated in a hierarchical structure GIIS at every site BDII as central collector.

The Information SystemThe Information System



[email protected]

LCG architecture

AccountingAccounting logs resource usage and

traces user jobs

Monitoring ServicesMonitoring Services visualize and record the

status of LCG resources Different systems in place

R-GMAGridICE...



[email protected]

The ATLAS production system

The executorsexecutors offer an interface to the underlying Grid middleware. The LCG executor, Lexor,

provides an interface to the native LCG WMS.

File upload/download rely on Grid-specific clients toolsclients tools

The ATLAS Data Management Data Management SystemSystem (DonQuijote) ensures high-level data management across different Grids.

The job monitoringjob monitoring is performed through Grid-specific tools. In LCG, information collected from

the production database and the GridICE server are merged and published through an interactive web interface.

An ATLAS central databasecentral database holds Grid-neutral information about jobs.

A “supervisor”“supervisor” agent distributes jobs to Grid-specific agents

called “executors” follows up their status, validates them

in case of success or flags them for resubmission.

•ProdDB•ProdDB

•supervisor•supervisor •supervisor•supervisor •supervisor•supervisor •supervisor•supervisor

•LCG•LCG•executor•executor

•GRID3•GRID3•executor•executor

•NG•NG•executor•executor

•batch•batch•executor•executor

•LCG•LCG•LCG

•Don •Don •Quijotte•Quijote

•NG•NG•NG•GRID3•GRID3•GRID3 •batch•batch•batch

•RLS•RLS •RLS•RLS •RLS•RLS



[email protected]

Task Flow for ATLAS production

HitsMCTruth

Digits(RDO)

MCTruth

BytestreamRaw

Digits

ESD

ESD

Geant4

Reconstruction

Reconstruction

Pile-up

BytestreamRaw

Digits

BytestreamRaw

Digits

HitsMCTruth

Digits(RDO)

MCTruth

Physicsevents

EventsHepMC

EventsHepMC

HitsMCTruth

Digits(RDO)

MCTruthGeant4

Geant4

Digitization

Digits(RDO)

MCTruth

BytestreamRaw

Digits

BytestreamRaw

Digits

BytestreamRaw

DigitsEventsHepMC

HitsMCTruth

Geant4Pile-up

Digitization

Mixing

Mixing Reconstruction ESD

Pyt

hia

Event generation

Detector Simulation

Digitization (Pile-up)

ReconstructionEventMixingByte stream

EventsHepMC

Min. biasEvents

Piled-upevents Mixed events

Mixed eventsWith

Pile-up

~5 TB 20 TB 30 TB20 TB 5 TB

TB Volume of datafor 107 events



[email protected]

HitsMCTruth

Digits(RDO)

MCTruth

ESDAOD

Geant4 Reconstruction

Pile-upHitsMCTruth

Digits(RDO)

MCTruth

EventsHepMC

EventsHepMC

Geant4

Digitization

Pyt

hia

Event generation

Detector Simulation

Digitization (Pile-up)

Only part of the full chainpart of the full chain has been used/tested for Rome Production No ByteStream No Event Mixing

The reconstruction has been performed on digitized events and only partially on piled-up events

But in fact …



[email protected]

Rome Production experience on LCG

In average 8 concurrent instances of Lexor8 concurrent instances of Lexor were active on the native LCG-2 system.

Four people were controlling the production process checking for job failures interacting with the middleware developers and the LCG

Experiment Integration Support team. The production for the Rome workshop consisted of:

A total of 380k jobs380k jobs submitted to the native LCG-2 WMS109k simulation jobs106k digitization jobs125k reconstruction jobs40k pile-up jobs

An total of 1.4M files1.4M files stored in LCG Storage Elementscorresponding to an amount of data of about 45TB

This is a clear improvementclear improvement with respect to DC2 where 91.5k jobs in total ran on LCG-2 and no reconstruction was performed.



[email protected]


jobs per day

0

1000

2000

3000

4000

5000

6000

7000

8000

6/2

5/2

004

7/2

5/2

004

8/2

5/2

004

9/2

5/2

004

10/2

5/2

004

11/2

5/2

004

12/2

5/2

004

1/2

5/2

005

2/2

5/2

005

3/2

5/2

005

4/2

5/2

005

5/2

5/2

005

6/2

5/2

005

7/2

5/2

005

Number of jobs per day Number of jobs per day

Data Challenge 2Data Challenge 2 Rome ProductionRome Production



[email protected]


Jobs distributed to 45 different computing resources

Ratio generally proportional to the size of the cluster indicates an overall good job

distribution.

No site in particular ran large majority of jobs. The site with the largest number

of CPU resources (CERN), contributed for about 11% of the ATLAS production.

Other major sites ran between 5% and 8% of the jobs each.

Achievement toward a more

robust and fault-tolerant system does not rely on a small number

of large computing centers.

cnaf.infn.it 7% roma1.infn.it

5%

lnl.infn.it 4%

ba.infn.it 2%

mi.infn.it 2%

others infn.it5%

ihep.su 2%

in2p3-cc.fr 5%

in2p3-cppm.fr 1%

others fr1%

prague.cz 3%

rl.ac.uk 7%shef.ac.uk

5%

fzk.de 5%

cern.ch 11%

grnet.gr 1%

nikhef.nl 5%

sara.nl 2% others

5%

triumf.ca 2%ifae.es

1%ific.uv.es

4%

ft.uam.es 5%

sinica.edu.tw 3%

others ac.uk2%

ox.ac.uk 2%

cnaf.infn.it

roma1.infn.it

lnl.infn.it

ba.infn.it

mi.infn.it

others infn.it

ihep.su

in2p3-cc.fr

in2p3-cppm.fr

others fr

prague.cz

rl.ac.uk

shef.ac.uk

ox.ac.uk

others ac.uk

sinica.edu.tw

ft.uam.es

ific.uv.es

ifae.es

triumf.ca

fzk.de

cern.ch

grnet.gr

nikhef.nl

sara.nl

others

The percentage of ATLAS jobs run at each LCG site



[email protected]

Improvements: the Information System

An unstableunstable Information System can affect production in many aspects. Jobs might not match the full set of resources and flood a restricted

number of sitescausing overload of some site services and leaving other available

resources unused. Data management commands might fail to transfer input and output

fileswaste a large amount of CPU cyclescause an overhead for the submission system and the production team

For the Rome production, several aspects were improved Fixes in BDII software BDII deployed as Load Balanced Service

Multiple backends under a DNS switch

This reduced the single point of failuresingle point of failure effect for both job submission and data management during the job execution.



[email protected]

Improvements: Site Configuration

The main source of failures during DC2main source of failures during DC2. No procedure in place during DC2

Treated on a case-by-case basis Unthinkable in the long timescale since

LCG infrastructure counts a very large number of resourcesLCG grows very rapidly and is widely distributed.

Many sites started a careful monitoring of the number of job failures and developed automatic tools to identify problematic nodes.

The LCG Operation team developed a series of automatic toolsautomatic tools for site sanity controls See next slide



[email protected]


The Site Functional TestsThe Site Functional Tests running every day at every site test the correct configuration of

the Worker Nodes and the interaction with Grid services.

Now, can include VO specific tests and allows a VO specific view

The GIIS monitorThe GIIS monitor checks the consistency of the

information published by the site in the IS

almost real time.



[email protected]


Freedom of Choosing ResourcesFreedom of Choosing Resources Allows the user to exclude mis-configured resources from the BDII

Generally based on SFT Possible to whitelist or blacklist a resource

Can exclude separately Storage and Computing Resources of the same site



[email protected]

Improvements: WMS and others

The LCG Workload Management System is highly automatedhighly automated designed to reduce the human intervention at the minimum consists in a complex set of services interacting with external components. This complexity caused a certain unreliabilityunreliability of the WMS during DC2during DC2.

The system became more robustsystem became more robust before the Rome production several bug fixes and optimizations in the WMS workflow.

The heterogeneous and dynamic natureheterogeneous and dynamic nature of a Grid environment implies a certain level of unreliability. ATLAS application improved to cope with such unreliability.

The production team and the LCG operation and support teams gathered a lot of experiencelot of experience during DC2 and benefited from this experience at the time of Rome Production.



[email protected]

Issues (and possible improvements)

Failure Rates and Causes are shown in table Still failure rate is quite high (~ 48%)

Different failures imply different amounts of resource losses Obviously the most serious reason is Data Management

SystemSystem CausesCauses RateRateWMS Total 1.6%

DM Download input files 26.4%

DM Upload output files 0.8 %

ATLAS/LCG Application-crash 9.1%

ATLAS Proxy Expired 0.3%

LCG Site misconfiguration 0.9%

Unclassified 9%



[email protected]

Data Management: issues

NO Reliable File TransferReliable File Transfer service in place during Rome production.

Data movement performed through LCG DM client tools.

LCG DM tools did not provide timeout and retrytimeout and retry capabilities WORKAROUND: a timeout and possible retry was implemented in Lexor at some point

LCG DM tools not always ensure consistencyconsistency between files in the SE and entries in the catalog

If the catalog is down or unreachable or the operation is killed prematurely WORKAROUND: manual cleanup was needed

Data access on mass storage systemsmass storage systems was very problematic

Data need to be moved (staged) from tape to disk before being accessed. The middleware could not ensure the existence/persistency of data on disk WORKAROUND: manual pre-staging of files was carried out by the production team

The ATLAS strategy for file distributionstrategy for file distribution must be (re)thought.

Output files chaotically spread around 143 different Storage Elements. A replication schema for frequently accessed file was not in place Complicates analysis of the reconstructed samples and the production itself.



[email protected]

Data Management: improvements

Timeout and RetryTimeout and Retry capabilities introduced nativelynatively in the LCG DM tools. Also improved to guarantee atomic operations.

A new catalog, the LCG File CatalogLCG File Catalog, has been developed More stable Easier problem tracking Better performance and reliability

The Storage Resource ManagerStorage Resource Manager interface introduced as a front-end to every SE. agreed on between experiments and middleware developers standardize storage access and management Offers more functionality for MSS access

A reliable File Transfer Servicereliable File Transfer Service developed within the EGEE project. It is a SERVICESERVICE Allows to replicate files between SEs in a reliable way. Built on top of gridFTP and SRM Capable to deal with data transfer from/to MSS



[email protected]

Data Management: improvements

FTSFTS and SRMSRM SEs have been intensively tested during Service Challenge 3 (ongoing) Throughput exercise started in

July 2005 and continuing at a low rate even now.

Data Distribution from CERN T0 to several T1s

Some issues have been addressed Many issues already fixed Others being fixed by Service

Challenge 4 (April 2006) In general, very positive positive

feedbackfeedback from experiments



[email protected]

Strategy for files distribution

The new ATLAS DDM is already in already in placeplace Fully tested and employed during SC3

Data throughput from CERN to ATLAS Tier 1s

Target (80MB/s sustained for a week) fully achieved

ATLAS DM is being now integrated ATLAS DM is being now integrated with ATLAS Production Systemwith ATLAS Production System.

New DistributedDistributed ATLAS Data ManagementATLAS Data Management system

Enforce concept of “logical datasetlogical dataset” collection of files being moved and located

as a unique entity.

Dataset Subscription ModelSubscription Model the site declare the interest in holding a

dataset

ATLAS agentsagents trigger migration of files Integrated with LFC, FTS and SRM



[email protected]

The Workload Management: issues

The performanceperformance of the WMS for job submission and handling generally acceptable in normal conditions

… but degrade under stress. WORKAROUND: several RBs dedicated to ATLAS and with different

hardware solutions have been deployed

The EGEE project will provide an enhanced WMSenhanced WMS Possibility of bulk submission, bulk matching and bulk queries Improved communication with Computing Elements at sites

Possible improvement of job submission speedsubmission speed and job dispatching.

Some preliminary tests show promising resultspromising results Still several issue must be clarified



[email protected]

Monitoring: issues and improvements

Lack of VO specific informationVO specific information about jobs at the sites GridICE sensors deployed in every site, but not correctly configured

everywhere. Partial information, difficult to interpret.

QueriesQueries to the ATLAS Production Database could cause an excessive load.

The error diagnosticserror diagnostics should be improved performed parsing executor log files and querying the DB

should be formalized in proper tools

Real-timeReal-time job output inspectioninspection would have been helpful especially to investigate causes of hanging jobs.

An ATLAS team is building a global job monitoring systemglobal job monitoring system Based on the current tools Possibly integrating new components (R-GMA etc …)



[email protected]

Conclusions

The Rome Production on the LCG infrastructure has been an overall successful exerciseoverall successful exercise Exercised the ATLAS production system Contributed to the testing of the ATLAS Computing and Data model Stress-tested the LCG infrastructure … and produced a lot of simulated data for the physicists!!!

Must be seen as the consequence of several improvements in the Grid middleware In the ATLAS components In the LCG operations

Still, several components need improvementsseveral components need improvements both in terms of reliability and performance Production still requires a lot of human attention

Issues have been addressed to the relevant partiesaddressed to the relevant parties and a lot of work has been done since Rome Production Preliminary tests show promising improvements Will be evaluated fully in Service Challenge 4 (April 2006)

Documents

Analysis of the ATLAS Rome Production Experience on the LCG Computing Grid