Upload
aubrey-allen
View
26
Download
0
Embed Size (px)
DESCRIPTION
Challenges and Success of HEP GRID. Faïrouz Malek, CNRS. 3rd EGEE User FORUM 2008 Clermont-Ferrand. The scales. High Energy Physics machines and detectors. pp @ √s=14 TeV L : 10 34 /cm 2 /s. L: 2.10 32 /cm 2 /s. Chambres à muons. Trajectographe. Calorim è tre. -. - PowerPoint PPT Presentation
Citation preview
Challenges and Success
of HEP GRIDFaïrouz Malek, CNRSFaïrouz Malek, CNRS
3rd EGEE User FORUM 2008Clermont-Ferrand
2Faïrouz Malek/CNRS
The scales
3Faïrouz Malek/CNRS
Chambres à muons
Calorimètre
Trajectographe
-
High Energy Physicsmachines and detectors
pp @ √s=14 TeVL : 1034/cm2/s
L: 2.1032 /cm2/s
2,5 million collisions per secondLVL1: 10 KHz, LVL3: 50-100 Hz25 MB/sec digitized recording
40 million collisions per secondLVL1: 1 kHz, LVL3: 100 Hz0.1 to 1 GB/sec digitized recording
4Faïrouz Malek/CNRS
LHC: 4 experiments … ready! First beam expected in autumn 2008
5Faïrouz Malek/CNRS
WWττμμeeZZννττννμμννee
bbssdd γγquar
ksle
pton
s
1ère 2ème 3ème
génération
bosons de jauges
ttccuu
HHHiggs
gg
Professor Vangelis, what are you expecting from the LHC ?
← CMS Simulation
6Faïrouz Malek/CNRS
Supersymetry: New world where each Boson (photon) or Fermion (e-) has Super Partner(s)
New Dimensions (space)
where only some particles can propagate → gravitons, new bosons …
Towards String Theory … gravitation is handled by quantum mechanics. This is true only if 10 or more dimensions of space-time.
Alas! … Hopefully ? MS is not so Standard AND …Hmmmmm … Maybe …….
Calabi-Yau
7Faïrouz Malek/CNRS
Physicists see online/offlinePhysicists see online/offline TRUE TRUE (top) (top)
events @ a running D0/Fermilab experimentevents @ a running D0/Fermilab experiment
8Faïrouz Malek/CNRS
A collision @ LHC
9Faïrouz Malek/CNRS
@ CERN: Acquisition, First pass reconstruction,
Storage Distribution
10Faïrouz Malek/CNRS
The Data Acquisition
11Faïrouz Malek/CNRS
LHC computing: is it really a challenge ?
• Signal/Background 10-9
• Data volume– High rate * large number of channels
* 4 experiments
15 PetaBytes of new data each year
• Compute power– Event complexity * Nb. events *
thousands users
60 k of (today's) fastest CPUs
12Faïrouz Malek/CNRS
Options as seen in 1996Before the GRID was invented
13Faïrouz Malek/CNRS
Timeline LHC Computing
1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008
LHC approved
ATLAS & CMS approved
ALICEapproved
LHCb approved
“Hoffmann”Review
7x107 MIPS1,900 TB disk
ATLAS (or CMS) requirementsfor first year at design luminosity
ATLAS&CMSCTP
107 MIPS100 TB disk
LHC start
ComputingTDRs
55x107 MIPS70,000 TB disk
(140 MSi2K)
14Faïrouz Malek/CNRS
Evolution of CPU Capacity at CERN
SC (0.6GeV)
PS (28GeV)ISR (300GeV)
SPS (400GeV)
ppbar(540GeV)
LEP (100GeV)
LEP II (200GeV)
LHC (14 TeV)
Costs (2007Swiss Francs)
Includes infrastructurecosts (comp.centre,
power, cooling, ..) and physics tapes
Tape & diskrequirements
:>10 times
CERNpossibility
15Faïrouz Malek/CNRS
Data Challenges
First physics
Cosmics
GRID 3
EGEE 1
LCG 1
Service Challenges
EU DataGrid
GriPhyN, iVDGL, PPDG
EGEE 2
OSG
LCG 2
EGEE 3
1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008
WLCG Partially decentralized
model– replicate the event data at
about five regional centres
– data transfer via network ormovable media
RC2
CERN
RC1
Timeline Grids
16Faïrouz Malek/CNRS
The Tiers ModelTier-0 -1 -2
17Faïrouz Malek/CNRS
WLCG Collaboration
• The Collaboration– 4 LHC experiments– ~250 computing centres– 12 large centres
(Tier-0, Tier-1)– 38 federations of smaller
“Tier-2” centres– Growing to ~40 countries– Grids: EGEE, OSG, Nordugrid
• Technical Design Reports– WLCG, 4 Experiments: June 2005
• Memorandum of Understanding(Agreed in October 2005)– Guaranteed resources– Quality of services (24/7, 4h Intervention)
• Resources– 5-year forward look– Target reliability and efficiency: 95%
18Faïrouz Malek/CNRS
Centers around the world form a Supercomputer
• The EGEE and OSG projects are the basis of the Worldwide LHC Computing Grid Project WLCG
Inter-operation between Grids is working!
19Faïrouz Malek/CNRS
Available Infrastructure
EGEE: ~250 sites, >45000 CPUOSG: ~ 15 sites for LHC, > 10000 CPU
¼ of the resources are contributed by groups external to the project
~>25 k simultaneous jobs
20Faïrouz Malek/CNRS
What about the Middleware ?
• Security – Virtual Organization Management (VOMS) – MyProxy
• Data management – File catalogue (LFC)– File transfer service (FTS)– Storage Element (SE)– Storage Resource Management (SRM)
• Job management – Work Load Management System(WMS)– Logging and Bookeeping (LB)– Computing Element (CE)– Worker Nodes (WN)
• Information System– Monitoring: BDII (Berkeley Database Information Index), RGMA
(Relational Grid Monitoring Architecture) aggregate service information from multiple Grid sites, now moved to SAM (Site Availability Monitoring)
– Monitoring & visualization (Griview, Dashboard, Gridmap etc.)
21Faïrouz Malek/CNRS
• ATLAS– pathena/PANDA– GANGA together with the gLite and Nordugrid
• CMS – CRAB together with gLite WMS and CondorG
• LHCb– GANGA together with DIRAC
• Alice– Alien2, PROOF
GRID ANALYSIS TOOLS
22Faïrouz Malek/CNRS
• User friendly job submission tools– Extensible due to plugin system
• Support for several applications– Athena, AthenaMC (ATLAS)– Gaudi, DaVinci (LHCb)– Others …
• Support for several backends– LSF, PBS, SGE etc– gLite WMS, Nordugrid, Condor– DIRAC, PANDA
• GANGA Job Building blocks
• Various interfaces– Command line, IPhyton, GUI
23Faïrouz Malek/CNRS
In total 968 persons since January, 579 in ATLASPer month ~275 users, 150 in ATLAS
ATLAS
LHCb
Others
24Faïrouz Malek/CNRS
• On the EGEE and the Nordugrid infrastructure ATLAS uses direct submission to the middleware using GANGA– EGEE: LCG RB and gLite WMS
– Nordugrid: ARC middleware
• On OSG PANDA system– Pilot based system
– Also available at some EGEE sites
ATLAS Strategy
25Faïrouz Malek/CNRS
About 50K jobs since September
Tier 0 1 2 3
Fraction 8 37 40 15
Tier-1: 48% Lyon, 36% FZK
26Faïrouz Malek/CNRS
ATLAS Panda System
• Interoperability is important
• PANDA jobs on some EGEE sites
• PANDA is an additional backend for GANGA
• The positive aspect is that it gives ATLAS choices on how to evolve
27Faïrouz Malek/CNRS
• CMS Remote Analysis Builder– User oriented tool for grid submission and handling of analysis
jobs
• Support for gLite WMS and CondorG
• Command line oriented tool– Allows to create and submit jobs, query status and retrieve output
CMS CRAB FEATURES
28Faïrouz Malek/CNRS
Mid-July mid-August 2007 645K jobs (20K jobs/day) – 89% grid success rate
29Faïrouz Malek/CNRS
• LHCb– GANGA as user interface – DIRAC as backend
• Alice– Alien2
• Alien and DIRAC are in many respects similar to PANDA
30Faïrouz Malek/CNRS
File File catalogcatalog
MasterMaster
SchedulSchedulerer
StorageStorage
CPU’sCPU’s
QueryQueryPROOF query:PROOF query:data file list, mySelector.Cdata file list, mySelector.C
Feedback,Feedback,merged final outputmerged final output
PROOF clusterPROOF cluster
• Cluster perceived as extension of local PC• Same macro and syntax as in local session• More dynamic use of resources• Real-time feedback• Automatic splitting and merging
31Faïrouz Malek/CNRS
Baseline Services
• Storage Element– Castor, dCache, DPM (with SRM 1.1)– Storm added in 2007– SRM 2.2 – long delays incurred
- being deployed in production
• Basic transfer tools – Gridftp, ..• File Transfer Service (FTS)• LCG File Catalog (LFC)• LCG data mgt tools - lcg-utils• Posix I/O –
– Grid File Access Library (GFAL)
• Synchronised databases T0T1s
– 3D project
• Information System• Compute Elements
– Globus/Condor-C– web services (CREAM)
• gLite Workload Management– in production at CERN
• VO Management System (VOMS)
• VO Boxes• Application software
installation• Job Monitoring Tools
The Basic Baseline Services – from the TDR (2005)
... continuing evolutionreliability, performance, functionality, requirements
32Faïrouz Malek/CNRS
3D - Distributed Deployment of Databases for LCG
ORACLE Streaming with Downstream Capture
(ATLAS, LHCb)
SQUID/FRONTIER Web caching
(CMS)
33Faïrouz Malek/CNRS
LHCOPN Architecture
Tier-2s and Tier-1s are inter-connected by the general
purpose research networks
Any Tier-2 mayaccess data at
any Tier-1
Tier-2 IN2P3
TRIUMF
ASCC
FNAL
BNL
Nordic
CNAF
SARAPIC
RAL
GridKa
Tier-2
Tier-2
Tier-2
Tier-2
Tier-2
Tier-2
Tier-2Tier-2
Tier-2
34Faïrouz Malek/CNRS
The usage
The number of jobs
The production
The real success !!!!
35Faïrouz Malek/CNRS
Data Transfer out of Tier-0
36Faïrouz Malek/CNRS
Site reliability
37Faïrouz Malek/CNRS
Site Reliability
Site ReliabilityTier-2 Sites
83 Tier-2 sites being monitored
Targets – CERN + Tier-1sBefore
July July 07 Dec 07 Avg.last 3 months
Each site 88% 91% 93% 89%
8 best sites 88% 93% 95% 93%
38Faïrouz Malek/CNRS
GRID Production per Vo in one year
HEP
33 million jobs ~ 110 million Norm. CPU
39Faïrouz Malek/CNRS
HEP GRID Production in one year
Babar
D0
ILC, …
40Faïrouz Malek/CNRS
CMS simulation2nd Term 2007
CC-IN2P3
FNAL
PIC
~675 Mevents
41Faïrouz Malek/CNRS
ATLAS: the data chain works – Sept 2007
Tracks recorded in the muon chambers of the ATLAS detector were expressed to physicists all over the world, enabling simultaneous analysis at sites across the globe. About two million muons over two weeks were recorded.
Terabytes of data were moved from the Tier-0 site at CERN to Tier-1 sites across Europe (seven sites), America (one site in America and one in Canada) and Asia (one site in Taiwan). Data transfer rates reached the expected maximum. Real analysis (in T2) happened in quasi real-time at sites across Europe and the U.S.
42Faïrouz Malek/CNRS
Ramp-up Needed for Start-up
Jul Sep Apr -07 -07 -08
3.7 X Sep Jul Apr -06 -07 -08
Sep Jul Apr -06 -07 -08
3 X 2.9 X
Sep Jul Apr -06 -07 -08
Sep Jul Apr -06 -07 -08
2.3 X 3.7 X target usageusage
pledgeinstalled
43Faïrouz Malek/CNRS
The Grid is now in operation, working on: reliability, scaling up, sustainability
44Faïrouz Malek/CNRS
Summary
• Applications support in good shape • WLCG service
– Baseline services in production with the exception of SRM 2.2– Continuously increasing capacity and workload– General site reliability is improving – but still a concern– Data and storage remain the weak points
• Experiment testing progressing – – involving now most sites, approaching full dress rehearsals
• Sites & experiments working well together to tackle the problems
• Major Combined Computing Readiness Challenge Feb-May 2008, before the machine starts, -- essential to provide experience for site operations and storage systems – stressed simultaneously by all four experiments
• Steep ramp-up ahead to deliver the capacity needed for 2008 run
45Faïrouz Malek/CNRS
Improving Reliability
• Monitoring• Metrics• Workshops• Data challenges• Experience• Systematic
problem analysis• Priority from
software developers