Upload
kort
View
36
Download
0
Embed Size (px)
DESCRIPTION
Analysis of the ATLAS Rome Production Experience on the LCG Computing Grid. Simone Campana, CERN/INFN EGEE User Forum, CERN (Switzerland) March 1 st – 3 rd 2006. Outline. The ATLAS Experiment The Computing Model and the Data Challenges The LCG Computing Grid - PowerPoint PPT Presentation
Citation preview
Enabling Grids for E-sciencE
www.eu-egee.org
Analysis of the ATLAS Rome Analysis of the ATLAS Rome Production Experience on the Production Experience on the LCG Computing GridLCG Computing Grid
Simone Campana, CERN/INFN
EGEE User Forum, CERN (Switzerland) March 1st – 3rd 2006
EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006 2
Enabling Grids for E-sciencE
Outline
The ATLAS Experiment The Computing Model and the Data Challenges
The LCG Computing GridOverview and architecture
The ATLAS Production System The Rome Production on LCG
Report and numbers of the production Achievements with respect to DC2Standing Issues and possible improvements
Conclusions
EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006 3
Enabling Grids for E-sciencE
The ATLAS Experiment
View of LHC @ CERNView of LHC @ CERN ATLAS ATLAS
AA TToroidaLL AApparatuSS for LHC
EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006 4
Enabling Grids for E-sciencE
The ATLAS Computing Model
The ATLAS computing can NOTNOT rely on a SINGLESINGLE computer center model Amount of required resources is too large For the 1st year of data taking
50.6 MSI2k of CPU 16.9 TB of space on Tape 25.4 TB of space on Disk
ATLAS decided to embrace the GRID paradigmGRID paradigm High level of decentralization
Sites are organized in a multi-tiermulti-tier structure Hierarchical model Tiers are defined by ROLEROLE in the ATLAS computing
Tier-0 at CERNTier-0 at CERN• Record RAW data• Distribute second copy to Tier-1s• Calibrate anddo first-pass reconstruction
Tier-1 centersTier-1 centers • Manage permanent storage – RAW, simulated,
processed• Capacity for reprocessing, bulk analysis
Tier-2 centersTier-2 centers• Monte Carlo event simulation• End-user analysis
In Grid terminology ATLAS is a Virtual OrganizationVirtual Organization
Tier-1
Tier-0
Online filter farm RAWESDAOD
Reconstruction farm
RAWESDAODMC
Analysis farm
Re-reconstruction farm
Tier-2
Analysis farm
Monte Carlo farm
SelectedESD,AOD
RAW
RAW
ESDAOD
ESD, AOD
RAW
MC
RAWESDAOD
ESD, AOD
EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006 5
Enabling Grids for E-sciencE
Data Challenges
Data ChallengeData Challenge: validation of the Computing and Data Model and test the complete software suite Full simulation and reprocessing of data as if coming from the detector Same software and computing infrastructure to be employed for data taking
ATLAS ran two major Data Challenges DC1 in 2002-2003 (with direct access to local resources + NorduGrid, see later) DC2 in July - December 2004 (completely in GRID environment)
Large scale productionLarge scale production in January – June 2005 Friendly called “Rome Production” provide data for physics studies for the ATLAS Rome Workshop in June 2005. Can be considered totally equivalent to a real Data ChallengeCan be considered totally equivalent to a real Data Challenge
Same methodology Large number of events produced
Offered a unique opportunity to test improvements in the production framework, the Grid middleware and the reconstruction software.
ATLAS resources span three different grids: the LCGLCG, NorduGridNorduGrid and OSGOSG
In this talk I will present the “Rome production” experience on the LHC Computing Grid infrastructure
EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006 6
Enabling Grids for E-sciencE
The LCG Infrastructure
May 2005140 Grid sites
34 countries
12000 CPUs
8 PetaBytes
EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006 7
Enabling Grids for E-sciencE
LCG architecture
The Logging and BookkeepingLogging and Bookkeeping service keeps the state information of a job allows the user to query its status
Each Computing ElementComputing Element is the front-end to a local batch system manages a pool of Worker Nodes
where the job is eventually executed.
Limited User CredentialsLimited User Credentials (Proxies) can be automatically renewed through a Proxy Service.
A set of services running on the Resource BrokerResource Broker machine match job requirements to the available resources schedule the job for execution to an appropriate Computing Element track the job status allow to retrieve their job output.
The Workload Management SystemThe Workload Management Systemresponsible for the management and monitoring of jobsresponsible for the management and monitoring of jobs
EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006 8
Enabling Grids for E-sciencE
LCG Architecture
Allows the user to move files in and out of the Grid, replicate files among different Storage Elements and locate files.
Files are stored in Storage Storage ElementsElements Disk only or with tape backend.
A number protocols allows data transfer GridFTP is the most commonly
used Files are registered in a central central
cataloguecatalogue Replica Location Service keeps information about file
location and about some file metadata.
The Data Management SystemThe Data Management System
Provides information about the Grid resources and their status.
Info generated on every service And published by the GRIS
Propagated in a hierarchical structure GIIS at every site BDII as central collector.
The Information SystemThe Information System
EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006 9
Enabling Grids for E-sciencE
LCG architecture
AccountingAccounting logs resource usage and
traces user jobs
Monitoring ServicesMonitoring Services visualize and record the
status of LCG resources Different systems in place
R-GMAGridICE...
EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006 10
Enabling Grids for E-sciencE
The ATLAS production system
The executorsexecutors offer an interface to the underlying Grid middleware. The LCG executor, Lexor,
provides an interface to the native LCG WMS.
File upload/download rely on Grid-specific clients toolsclients tools
The ATLAS Data Management Data Management SystemSystem (DonQuijote) ensures high-level data management across different Grids.
The job monitoringjob monitoring is performed through Grid-specific tools. In LCG, information collected from
the production database and the GridICE server are merged and published through an interactive web interface.
An ATLAS central databasecentral database holds Grid-neutral information about jobs.
A “supervisor”“supervisor” agent distributes jobs to Grid-specific agents
called “executors” follows up their status, validates them
in case of success or flags them for resubmission.
•ProdDB•ProdDB
•supervisor•supervisor •supervisor•supervisor •supervisor•supervisor •supervisor•supervisor
•LCG•LCG•executor•executor
•GRID3•GRID3•executor•executor
•NG•NG•executor•executor
•batch•batch•executor•executor
•LCG•LCG•LCG
•Don •Don •Quijotte•Quijote
•NG•NG•NG•GRID3•GRID3•GRID3 •batch•batch•batch
•RLS•RLS •RLS•RLS •RLS•RLS
EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006 11
Enabling Grids for E-sciencE
Task Flow for ATLAS production
HitsMCTruth
Digits(RDO)
MCTruth
BytestreamRaw
Digits
ESD
ESD
Geant4
Reconstruction
Reconstruction
Pile-up
BytestreamRaw
Digits
BytestreamRaw
Digits
HitsMCTruth
Digits(RDO)
MCTruth
Physicsevents
EventsHepMC
EventsHepMC
HitsMCTruth
Digits(RDO)
MCTruthGeant4
Geant4
Digitization
Digits(RDO)
MCTruth
BytestreamRaw
Digits
BytestreamRaw
Digits
BytestreamRaw
DigitsEventsHepMC
HitsMCTruth
Geant4Pile-up
Digitization
Mixing
Mixing Reconstruction ESD
Pyt
hia
Event generation
Detector Simulation
Digitization (Pile-up)
ReconstructionEventMixingByte stream
EventsHepMC
Min. biasEvents
Piled-upevents Mixed events
Mixed eventsWith
Pile-up
~5 TB 20 TB 30 TB20 TB 5 TB
TB Volume of datafor 107 events
EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006 12
Enabling Grids for E-sciencE
HitsMCTruth
Digits(RDO)
MCTruth
ESDAOD
Geant4 Reconstruction
Pile-upHitsMCTruth
Digits(RDO)
MCTruth
EventsHepMC
EventsHepMC
Geant4
Digitization
Pyt
hia
Event generation
Detector Simulation
Digitization (Pile-up)
Only part of the full chainpart of the full chain has been used/tested for Rome Production No ByteStream No Event Mixing
The reconstruction has been performed on digitized events and only partially on piled-up events
But in fact …
EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006 13
Enabling Grids for E-sciencE
Rome Production experience on LCG
In average 8 concurrent instances of Lexor8 concurrent instances of Lexor were active on the native LCG-2 system.
Four people were controlling the production process checking for job failures interacting with the middleware developers and the LCG
Experiment Integration Support team. The production for the Rome workshop consisted of:
A total of 380k jobs380k jobs submitted to the native LCG-2 WMS109k simulation jobs106k digitization jobs125k reconstruction jobs40k pile-up jobs
An total of 1.4M files1.4M files stored in LCG Storage Elementscorresponding to an amount of data of about 45TB
This is a clear improvementclear improvement with respect to DC2 where 91.5k jobs in total ran on LCG-2 and no reconstruction was performed.
EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006 14
Enabling Grids for E-sciencE
Rome Production experience on LCG
jobs per day
0
1000
2000
3000
4000
5000
6000
7000
8000
6/2
5/2
004
7/2
5/2
004
8/2
5/2
004
9/2
5/2
004
10/2
5/2
004
11/2
5/2
004
12/2
5/2
004
1/2
5/2
005
2/2
5/2
005
3/2
5/2
005
4/2
5/2
005
5/2
5/2
005
6/2
5/2
005
7/2
5/2
005
Number of jobs per day Number of jobs per day
Data Challenge 2Data Challenge 2 Rome ProductionRome Production
EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006 15
Enabling Grids for E-sciencE
Rome Production experience on LCG
Jobs distributed to 45 different computing resources
Ratio generally proportional to the size of the cluster indicates an overall good job
distribution.
No site in particular ran large majority of jobs. The site with the largest number
of CPU resources (CERN), contributed for about 11% of the ATLAS production.
Other major sites ran between 5% and 8% of the jobs each.
Achievement toward a more
robust and fault-tolerant system does not rely on a small number
of large computing centers.
cnaf.infn.it 7% roma1.infn.it
5%
lnl.infn.it 4%
ba.infn.it 2%
mi.infn.it 2%
others infn.it5%
ihep.su 2%
in2p3-cc.fr 5%
in2p3-cppm.fr 1%
others fr1%
prague.cz 3%
rl.ac.uk 7%shef.ac.uk
5%
fzk.de 5%
cern.ch 11%
grnet.gr 1%
nikhef.nl 5%
sara.nl 2% others
5%
triumf.ca 2%ifae.es
1%ific.uv.es
4%
ft.uam.es 5%
sinica.edu.tw 3%
others ac.uk2%
ox.ac.uk 2%
cnaf.infn.it
roma1.infn.it
lnl.infn.it
ba.infn.it
mi.infn.it
others infn.it
ihep.su
in2p3-cc.fr
in2p3-cppm.fr
others fr
prague.cz
rl.ac.uk
shef.ac.uk
ox.ac.uk
others ac.uk
sinica.edu.tw
ft.uam.es
ific.uv.es
ifae.es
triumf.ca
fzk.de
cern.ch
grnet.gr
nikhef.nl
sara.nl
others
The percentage of ATLAS jobs run at each LCG site
EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006 16
Enabling Grids for E-sciencE
Improvements: the Information System
An unstableunstable Information System can affect production in many aspects. Jobs might not match the full set of resources and flood a restricted
number of sitescausing overload of some site services and leaving other available
resources unused. Data management commands might fail to transfer input and output
fileswaste a large amount of CPU cyclescause an overhead for the submission system and the production team
For the Rome production, several aspects were improved Fixes in BDII software BDII deployed as Load Balanced Service
Multiple backends under a DNS switch
This reduced the single point of failuresingle point of failure effect for both job submission and data management during the job execution.
EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006 17
Enabling Grids for E-sciencE
Improvements: Site Configuration
The main source of failures during DC2main source of failures during DC2. No procedure in place during DC2
Treated on a case-by-case basis Unthinkable in the long timescale since
LCG infrastructure counts a very large number of resourcesLCG grows very rapidly and is widely distributed.
Many sites started a careful monitoring of the number of job failures and developed automatic tools to identify problematic nodes.
The LCG Operation team developed a series of automatic toolsautomatic tools for site sanity controls See next slide
EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006 18
Enabling Grids for E-sciencE
Improvements: Site Configuration
The Site Functional TestsThe Site Functional Tests running every day at every site test the correct configuration of
the Worker Nodes and the interaction with Grid services.
Now, can include VO specific tests and allows a VO specific view
The GIIS monitorThe GIIS monitor checks the consistency of the
information published by the site in the IS
almost real time.
EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006 19
Enabling Grids for E-sciencE
Improvements: Site Configuration
Freedom of Choosing ResourcesFreedom of Choosing Resources Allows the user to exclude mis-configured resources from the BDII
Generally based on SFT Possible to whitelist or blacklist a resource
Can exclude separately Storage and Computing Resources of the same site
EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006 20
Enabling Grids for E-sciencE
Improvements: WMS and others
The LCG Workload Management System is highly automatedhighly automated designed to reduce the human intervention at the minimum consists in a complex set of services interacting with external components. This complexity caused a certain unreliabilityunreliability of the WMS during DC2during DC2.
The system became more robustsystem became more robust before the Rome production several bug fixes and optimizations in the WMS workflow.
The heterogeneous and dynamic natureheterogeneous and dynamic nature of a Grid environment implies a certain level of unreliability. ATLAS application improved to cope with such unreliability.
The production team and the LCG operation and support teams gathered a lot of experiencelot of experience during DC2 and benefited from this experience at the time of Rome Production.
EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006 21
Enabling Grids for E-sciencE
Issues (and possible improvements)
Failure Rates and Causes are shown in table Still failure rate is quite high (~ 48%)
Different failures imply different amounts of resource losses Obviously the most serious reason is Data Management
SystemSystem CausesCauses RateRateWMS Total 1.6%
DM Download input files 26.4%
DM Upload output files 0.8 %
ATLAS/LCG Application-crash 9.1%
ATLAS Proxy Expired 0.3%
LCG Site misconfiguration 0.9%
Unclassified 9%
EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006 22
Enabling Grids for E-sciencE
Data Management: issues
NO Reliable File TransferReliable File Transfer service in place during Rome production.
Data movement performed through LCG DM client tools.
LCG DM tools did not provide timeout and retrytimeout and retry capabilities WORKAROUND: a timeout and possible retry was implemented in Lexor at some point
LCG DM tools not always ensure consistencyconsistency between files in the SE and entries in the catalog
If the catalog is down or unreachable or the operation is killed prematurely WORKAROUND: manual cleanup was needed
Data access on mass storage systemsmass storage systems was very problematic
Data need to be moved (staged) from tape to disk before being accessed. The middleware could not ensure the existence/persistency of data on disk WORKAROUND: manual pre-staging of files was carried out by the production team
The ATLAS strategy for file distributionstrategy for file distribution must be (re)thought.
Output files chaotically spread around 143 different Storage Elements. A replication schema for frequently accessed file was not in place Complicates analysis of the reconstructed samples and the production itself.
EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006 23
Enabling Grids for E-sciencE
Data Management: improvements
Timeout and RetryTimeout and Retry capabilities introduced nativelynatively in the LCG DM tools. Also improved to guarantee atomic operations.
A new catalog, the LCG File CatalogLCG File Catalog, has been developed More stable Easier problem tracking Better performance and reliability
The Storage Resource ManagerStorage Resource Manager interface introduced as a front-end to every SE. agreed on between experiments and middleware developers standardize storage access and management Offers more functionality for MSS access
A reliable File Transfer Servicereliable File Transfer Service developed within the EGEE project. It is a SERVICESERVICE Allows to replicate files between SEs in a reliable way. Built on top of gridFTP and SRM Capable to deal with data transfer from/to MSS
EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006 24
Enabling Grids for E-sciencE
Data Management: improvements
FTSFTS and SRMSRM SEs have been intensively tested during Service Challenge 3 (ongoing) Throughput exercise started in
July 2005 and continuing at a low rate even now.
Data Distribution from CERN T0 to several T1s
Some issues have been addressed Many issues already fixed Others being fixed by Service
Challenge 4 (April 2006) In general, very positive positive
feedbackfeedback from experiments
EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006 25
Enabling Grids for E-sciencE
Strategy for files distribution
The new ATLAS DDM is already in already in placeplace Fully tested and employed during SC3
Data throughput from CERN to ATLAS Tier 1s
Target (80MB/s sustained for a week) fully achieved
ATLAS DM is being now integrated ATLAS DM is being now integrated with ATLAS Production Systemwith ATLAS Production System.
New DistributedDistributed ATLAS Data ManagementATLAS Data Management system
Enforce concept of “logical datasetlogical dataset” collection of files being moved and located
as a unique entity.
Dataset Subscription ModelSubscription Model the site declare the interest in holding a
dataset
ATLAS agentsagents trigger migration of files Integrated with LFC, FTS and SRM
EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006 26
Enabling Grids for E-sciencE
The Workload Management: issues
The performanceperformance of the WMS for job submission and handling generally acceptable in normal conditions
… but degrade under stress. WORKAROUND: several RBs dedicated to ATLAS and with different
hardware solutions have been deployed
The EGEE project will provide an enhanced WMSenhanced WMS Possibility of bulk submission, bulk matching and bulk queries Improved communication with Computing Elements at sites
Possible improvement of job submission speedsubmission speed and job dispatching.
Some preliminary tests show promising resultspromising results Still several issue must be clarified
EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006 27
Enabling Grids for E-sciencE
Monitoring: issues and improvements
Lack of VO specific informationVO specific information about jobs at the sites GridICE sensors deployed in every site, but not correctly configured
everywhere. Partial information, difficult to interpret.
QueriesQueries to the ATLAS Production Database could cause an excessive load.
The error diagnosticserror diagnostics should be improved performed parsing executor log files and querying the DB
should be formalized in proper tools
Real-timeReal-time job output inspectioninspection would have been helpful especially to investigate causes of hanging jobs.
An ATLAS team is building a global job monitoring systemglobal job monitoring system Based on the current tools Possibly integrating new components (R-GMA etc …)
EGEE User Forum, CERN (Switzerland) – March 1st – 3rd ,2006 28
Enabling Grids for E-sciencE
Conclusions
The Rome Production on the LCG infrastructure has been an overall successful exerciseoverall successful exercise Exercised the ATLAS production system Contributed to the testing of the ATLAS Computing and Data model Stress-tested the LCG infrastructure … and produced a lot of simulated data for the physicists!!!
Must be seen as the consequence of several improvements in the Grid middleware In the ATLAS components In the LCG operations
Still, several components need improvementsseveral components need improvements both in terms of reliability and performance Production still requires a lot of human attention
Issues have been addressed to the relevant partiesaddressed to the relevant parties and a lot of work has been done since Rome Production Preliminary tests show promising improvements Will be evaluated fully in Service Challenge 4 (April 2006)