45
Challenges and Success of HEP GRID Faïrouz Malek, CNRS Faïrouz Malek, CNRS 3rd EGEE User FORUM 2008 Clermont-Ferrand

Challenges and Success of HEP GRID

Embed Size (px)

DESCRIPTION

Challenges and Success of HEP GRID. Faïrouz Malek, CNRS. 3rd EGEE User FORUM 2008 Clermont-Ferrand. The scales. High Energy Physics machines and detectors. pp @ √s=14 TeV L : 10 34 /cm 2 /s. L: 2.10 32 /cm 2 /s. Chambres à muons. Trajectographe. Calorim è tre. -. - PowerPoint PPT Presentation

Citation preview

Page 1: Challenges and Success  of HEP GRID

Challenges and Success

of HEP GRIDFaïrouz Malek, CNRSFaïrouz Malek, CNRS

3rd EGEE User FORUM 2008Clermont-Ferrand

Page 2: Challenges and Success  of HEP GRID

2Faïrouz Malek/CNRS

The scales

Page 3: Challenges and Success  of HEP GRID

3Faïrouz Malek/CNRS

Chambres à muons

Calorimètre

Trajectographe

-

High Energy Physicsmachines and detectors

pp @ √s=14 TeVL : 1034/cm2/s

L: 2.1032 /cm2/s

2,5 million collisions per secondLVL1: 10 KHz, LVL3: 50-100 Hz25 MB/sec digitized recording

40 million collisions per secondLVL1: 1 kHz, LVL3: 100 Hz0.1 to 1 GB/sec digitized recording

Page 4: Challenges and Success  of HEP GRID

4Faïrouz Malek/CNRS

LHC: 4 experiments … ready! First beam expected in autumn 2008

Page 5: Challenges and Success  of HEP GRID

5Faïrouz Malek/CNRS

WWττμμeeZZννττννμμννee

bbssdd γγquar

ksle

pton

s

1ère 2ème 3ème

génération

bosons de jauges

ttccuu

HHHiggs

gg

Professor Vangelis, what are you expecting from the LHC ?

← CMS Simulation

Page 6: Challenges and Success  of HEP GRID

6Faïrouz Malek/CNRS

Supersymetry: New world where each Boson (photon) or Fermion (e-) has Super Partner(s)

New Dimensions (space)

where only some particles can propagate → gravitons, new bosons …

Towards String Theory … gravitation is handled by quantum mechanics. This is true only if 10 or more dimensions of space-time.

Alas! … Hopefully ? MS is not so Standard AND …Hmmmmm … Maybe …….

Calabi-Yau

Page 7: Challenges and Success  of HEP GRID

7Faïrouz Malek/CNRS

Physicists see online/offlinePhysicists see online/offline TRUE TRUE (top) (top)

events @ a running D0/Fermilab experimentevents @ a running D0/Fermilab experiment

Page 8: Challenges and Success  of HEP GRID

8Faïrouz Malek/CNRS

A collision @ LHC

Page 9: Challenges and Success  of HEP GRID

9Faïrouz Malek/CNRS

@ CERN: Acquisition, First pass reconstruction,

Storage Distribution

Page 10: Challenges and Success  of HEP GRID

10Faïrouz Malek/CNRS

The Data Acquisition

Page 11: Challenges and Success  of HEP GRID

11Faïrouz Malek/CNRS

LHC computing: is it really a challenge ?

• Signal/Background 10-9

• Data volume– High rate * large number of channels

* 4 experiments

15 PetaBytes of new data each year

• Compute power– Event complexity * Nb. events *

thousands users

60 k of (today's) fastest CPUs

Page 12: Challenges and Success  of HEP GRID

12Faïrouz Malek/CNRS

Options as seen in 1996Before the GRID was invented

Page 13: Challenges and Success  of HEP GRID

13Faïrouz Malek/CNRS

Timeline LHC Computing

1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008

LHC approved

ATLAS & CMS approved

ALICEapproved

LHCb approved

“Hoffmann”Review

7x107 MIPS1,900 TB disk

ATLAS (or CMS) requirementsfor first year at design luminosity

ATLAS&CMSCTP

107 MIPS100 TB disk

LHC start

ComputingTDRs

55x107 MIPS70,000 TB disk

(140 MSi2K)

Page 14: Challenges and Success  of HEP GRID

14Faïrouz Malek/CNRS

Evolution of CPU Capacity at CERN

SC (0.6GeV)

PS (28GeV)ISR (300GeV)

SPS (400GeV)

ppbar(540GeV)

LEP (100GeV)

LEP II (200GeV)

LHC (14 TeV)

Costs (2007Swiss Francs)

Includes infrastructurecosts (comp.centre,

power, cooling, ..) and physics tapes

Tape & diskrequirements

:>10 times

CERNpossibility

Page 15: Challenges and Success  of HEP GRID

15Faïrouz Malek/CNRS

Data Challenges

First physics

Cosmics

GRID 3

EGEE 1

LCG 1

Service Challenges

EU DataGrid

GriPhyN, iVDGL, PPDG

EGEE 2

OSG

LCG 2

EGEE 3

1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008

WLCG Partially decentralized

model– replicate the event data at

about five regional centres

– data transfer via network ormovable media

RC2

CERN

RC1

Timeline Grids

Page 16: Challenges and Success  of HEP GRID

16Faïrouz Malek/CNRS

The Tiers ModelTier-0 -1 -2

Page 17: Challenges and Success  of HEP GRID

17Faïrouz Malek/CNRS

WLCG Collaboration

• The Collaboration– 4 LHC experiments– ~250 computing centres– 12 large centres

(Tier-0, Tier-1)– 38 federations of smaller

“Tier-2” centres– Growing to ~40 countries– Grids: EGEE, OSG, Nordugrid

• Technical Design Reports– WLCG, 4 Experiments: June 2005

• Memorandum of Understanding(Agreed in October 2005)– Guaranteed resources– Quality of services (24/7, 4h Intervention)

• Resources– 5-year forward look– Target reliability and efficiency: 95%

Page 18: Challenges and Success  of HEP GRID

18Faïrouz Malek/CNRS

Centers around the world form a Supercomputer

• The EGEE and OSG projects are the basis of the Worldwide LHC Computing Grid Project WLCG

Inter-operation between Grids is working!

Page 19: Challenges and Success  of HEP GRID

19Faïrouz Malek/CNRS

Available Infrastructure

EGEE: ~250 sites, >45000 CPUOSG: ~ 15 sites for LHC, > 10000 CPU

¼ of the resources are contributed by groups external to the project

~>25 k simultaneous jobs

Page 20: Challenges and Success  of HEP GRID

20Faïrouz Malek/CNRS

What about the Middleware ?

• Security – Virtual Organization Management (VOMS) – MyProxy

• Data management – File catalogue (LFC)– File transfer service (FTS)– Storage Element (SE)– Storage Resource Management (SRM)

• Job management – Work Load Management System(WMS)– Logging and Bookeeping (LB)– Computing Element (CE)– Worker Nodes (WN)

• Information System– Monitoring: BDII (Berkeley Database Information Index), RGMA

(Relational Grid Monitoring Architecture) aggregate service information from multiple Grid sites, now moved to SAM (Site Availability Monitoring)

– Monitoring & visualization (Griview, Dashboard, Gridmap etc.)

Page 21: Challenges and Success  of HEP GRID

21Faïrouz Malek/CNRS

• ATLAS– pathena/PANDA– GANGA together with the gLite and Nordugrid

• CMS – CRAB together with gLite WMS and CondorG

• LHCb– GANGA together with DIRAC

• Alice– Alien2, PROOF

GRID ANALYSIS TOOLS

Page 22: Challenges and Success  of HEP GRID

22Faïrouz Malek/CNRS

• User friendly job submission tools– Extensible due to plugin system

• Support for several applications– Athena, AthenaMC (ATLAS)– Gaudi, DaVinci (LHCb)– Others …

• Support for several backends– LSF, PBS, SGE etc– gLite WMS, Nordugrid, Condor– DIRAC, PANDA

• GANGA Job Building blocks

• Various interfaces– Command line, IPhyton, GUI

Page 23: Challenges and Success  of HEP GRID

23Faïrouz Malek/CNRS

In total 968 persons since January, 579 in ATLASPer month ~275 users, 150 in ATLAS

ATLAS

LHCb

Others

Page 24: Challenges and Success  of HEP GRID

24Faïrouz Malek/CNRS

• On the EGEE and the Nordugrid infrastructure ATLAS uses direct submission to the middleware using GANGA– EGEE: LCG RB and gLite WMS

– Nordugrid: ARC middleware

• On OSG PANDA system– Pilot based system

– Also available at some EGEE sites

ATLAS Strategy

Page 25: Challenges and Success  of HEP GRID

25Faïrouz Malek/CNRS

About 50K jobs since September

Tier 0 1 2 3

Fraction 8 37 40 15

Tier-1: 48% Lyon, 36% FZK

Page 26: Challenges and Success  of HEP GRID

26Faïrouz Malek/CNRS

ATLAS Panda System

• Interoperability is important

• PANDA jobs on some EGEE sites

• PANDA is an additional backend for GANGA

• The positive aspect is that it gives ATLAS choices on how to evolve

Page 27: Challenges and Success  of HEP GRID

27Faïrouz Malek/CNRS

• CMS Remote Analysis Builder– User oriented tool for grid submission and handling of analysis

jobs

• Support for gLite WMS and CondorG

• Command line oriented tool– Allows to create and submit jobs, query status and retrieve output

CMS CRAB FEATURES

Page 28: Challenges and Success  of HEP GRID

28Faïrouz Malek/CNRS

Mid-July mid-August 2007 645K jobs (20K jobs/day) – 89% grid success rate

Page 29: Challenges and Success  of HEP GRID

29Faïrouz Malek/CNRS

• LHCb– GANGA as user interface – DIRAC as backend

• Alice– Alien2

• Alien and DIRAC are in many respects similar to PANDA

Page 30: Challenges and Success  of HEP GRID

30Faïrouz Malek/CNRS

File File catalogcatalog

MasterMaster

SchedulSchedulerer

StorageStorage

CPU’sCPU’s

QueryQueryPROOF query:PROOF query:data file list, mySelector.Cdata file list, mySelector.C

Feedback,Feedback,merged final outputmerged final output

PROOF clusterPROOF cluster

• Cluster perceived as extension of local PC• Same macro and syntax as in local session• More dynamic use of resources• Real-time feedback• Automatic splitting and merging

Page 31: Challenges and Success  of HEP GRID

31Faïrouz Malek/CNRS

Baseline Services

• Storage Element– Castor, dCache, DPM (with SRM 1.1)– Storm added in 2007– SRM 2.2 – long delays incurred

- being deployed in production

• Basic transfer tools – Gridftp, ..• File Transfer Service (FTS)• LCG File Catalog (LFC)• LCG data mgt tools - lcg-utils• Posix I/O –

– Grid File Access Library (GFAL)

• Synchronised databases T0T1s

– 3D project

• Information System• Compute Elements

– Globus/Condor-C– web services (CREAM)

• gLite Workload Management– in production at CERN

• VO Management System (VOMS)

• VO Boxes• Application software

installation• Job Monitoring Tools

The Basic Baseline Services – from the TDR (2005)

... continuing evolutionreliability, performance, functionality, requirements

Page 32: Challenges and Success  of HEP GRID

32Faïrouz Malek/CNRS

3D - Distributed Deployment of Databases for LCG

ORACLE Streaming with Downstream Capture

(ATLAS, LHCb)

SQUID/FRONTIER Web caching

(CMS)

Page 33: Challenges and Success  of HEP GRID

33Faïrouz Malek/CNRS

LHCOPN Architecture

Tier-2s and Tier-1s are inter-connected by the general

purpose research networks

Any Tier-2 mayaccess data at

any Tier-1

Tier-2 IN2P3

TRIUMF

ASCC

FNAL

BNL

Nordic

CNAF

SARAPIC

RAL

GridKa

Tier-2

Tier-2

Tier-2

Tier-2

Tier-2

Tier-2

Tier-2Tier-2

Tier-2

Page 34: Challenges and Success  of HEP GRID

34Faïrouz Malek/CNRS

The usage

The number of jobs

The production

The real success !!!!

Page 35: Challenges and Success  of HEP GRID

35Faïrouz Malek/CNRS

Data Transfer out of Tier-0

Page 36: Challenges and Success  of HEP GRID

36Faïrouz Malek/CNRS

Site reliability

Page 37: Challenges and Success  of HEP GRID

37Faïrouz Malek/CNRS

Site Reliability

Site ReliabilityTier-2 Sites

83 Tier-2 sites being monitored

Targets – CERN + Tier-1sBefore

July July 07 Dec 07 Avg.last 3 months

Each site 88% 91% 93% 89%

8 best sites 88% 93% 95% 93%

Page 38: Challenges and Success  of HEP GRID

38Faïrouz Malek/CNRS

 

GRID Production per Vo in one year

HEP

33 million jobs ~ 110 million Norm. CPU

Page 39: Challenges and Success  of HEP GRID

39Faïrouz Malek/CNRS

HEP GRID Production in one year

Babar

D0

ILC, …

Page 40: Challenges and Success  of HEP GRID

40Faïrouz Malek/CNRS

CMS simulation2nd Term 2007

CC-IN2P3

FNAL

PIC

~675 Mevents

Page 41: Challenges and Success  of HEP GRID

41Faïrouz Malek/CNRS

ATLAS: the data chain works – Sept 2007

Tracks recorded in the muon chambers of the ATLAS detector were expressed to physicists all over the world, enabling simultaneous analysis at sites across the globe. About two million muons over two weeks were recorded.

Terabytes of data were moved from the Tier-0 site at CERN to Tier-1 sites across Europe (seven sites), America (one site in America and one in Canada) and Asia (one site in Taiwan). Data transfer rates reached the expected maximum. Real analysis (in T2) happened in quasi real-time at sites across Europe and the U.S.

Page 42: Challenges and Success  of HEP GRID

42Faïrouz Malek/CNRS

Ramp-up Needed for Start-up

Jul Sep Apr -07 -07 -08

3.7 X Sep Jul Apr -06 -07 -08

Sep Jul Apr -06 -07 -08

3 X 2.9 X

Sep Jul Apr -06 -07 -08

Sep Jul Apr -06 -07 -08

2.3 X 3.7 X target usageusage

pledgeinstalled

Page 43: Challenges and Success  of HEP GRID

43Faïrouz Malek/CNRS

The Grid is now in operation, working on: reliability, scaling up, sustainability

Page 44: Challenges and Success  of HEP GRID

44Faïrouz Malek/CNRS

Summary

• Applications support in good shape • WLCG service

– Baseline services in production with the exception of SRM 2.2– Continuously increasing capacity and workload– General site reliability is improving – but still a concern– Data and storage remain the weak points

• Experiment testing progressing – – involving now most sites, approaching full dress rehearsals

• Sites & experiments working well together to tackle the problems

• Major Combined Computing Readiness Challenge Feb-May 2008, before the machine starts, -- essential to provide experience for site operations and storage systems – stressed simultaneously by all four experiments

• Steep ramp-up ahead to deliver the capacity needed for 2008 run

Page 45: Challenges and Success  of HEP GRID

45Faïrouz Malek/CNRS

Improving Reliability

• Monitoring• Metrics• Workshops• Data challenges• Experience• Systematic

problem analysis• Priority from

software developers