32
LCG Status and Plans GridPP13 Durham, UK 4 th July 2005 Ian Bird IT/GD, CERN

LCG Status and Plans GridPP13 Durham, UK 4 th July 2005 Ian Bird IT/GD, CERN

Embed Size (px)

Citation preview

Page 1: LCG Status and Plans GridPP13 Durham, UK 4 th July 2005 Ian Bird IT/GD, CERN

LCG Status and Plans

GridPP13Durham, UK

4th July 2005

Ian BirdIT/GD, CERN

Page 2: LCG Status and Plans GridPP13 Durham, UK 4 th July 2005 Ian Bird IT/GD, CERN

Gri

dPP

13;

Du

rham

, 4th J

uly

, 200

5

2

Overview

Introduction Project goals and overview

Status Applications area Fabric Deployment and Operations

Baseline Services Service Challenges Summary

Page 3: LCG Status and Plans GridPP13 Durham, UK 4 th July 2005 Ian Bird IT/GD, CERN

Gri

dPP

13;

Du

rham

, 4th J

uly

, 200

5

3

LCG – Goals

The goal of the LCG project is to prototype and deploy the computing environment for the LHC experiments

Two phases:

Phase 1: 2002 – 2005 Build a service prototype, based on existing grid middleware Gain experience in running a production grid service Produce the TDR for the final system

Phase 2: 2006 – 2008 Build and commission the initial LHC computing environment

We

are

her

e

LCG and Experiment Computing TDRs completed and presented to the LHCC last week

Page 4: LCG Status and Plans GridPP13 Durham, UK 4 th July 2005 Ian Bird IT/GD, CERN

Gri

dPP

13;

Du

rham

, 4th J

uly

, 200

5

4 Joint with EGEE

Applications AreaPere Mato

Development environmentJoint projects, Data management

Distributed analysis

Middleware AreaFrédéric Hemmer

Provision of a base set of grid middleware (acquisition, development, integration)

Testing, maintenance, support

CERN Fabric AreaBernd Panzer

Large cluster managementData recording, Cluster technology

Networking, Computing service at CERN

Grid Deployment AreaIan Bird

Establishing and managing the Grid Service - Middleware, certification, security

operations, registration, authorisation,accounting

Project Areas & Management

Distributed Analysis - ARDAMassimo LamannaPrototyping of distributedend-user analysis using

grid technology

Project LeaderLes Robertson

Resource Manager – Chris EckPlanning Officer – Jürgen Knobloch

Administration – Fabienne Baud-Lavigne

Page 5: LCG Status and Plans GridPP13 Durham, UK 4 th July 2005 Ian Bird IT/GD, CERN

Applications Area

Page 6: LCG Status and Plans GridPP13 Durham, UK 4 th July 2005 Ian Bird IT/GD, CERN

Gri

dPP

13;

Du

rham

, 4th J

uly

, 200

5

6

Application Area Focus

Deliver the common physics applications software Organized to ensure focus on real experiment

needs Experiment-driven requirements and monitoring Architects in management and execution Open information flow and decision making Participation of experiment developers Frequent releases enabling iterative feedback

Success defined by experiment validation Integration, evaluation, successful deployment

Page 7: LCG Status and Plans GridPP13 Durham, UK 4 th July 2005 Ian Bird IT/GD, CERN

Gri

dPP

13;

Du

rham

, 4th J

uly

, 200

5

7

Validation Highlights

POOL successfully used in large scale production in ATLAS, CMS, LHCb data challenges in 2004

~400TB of POOL data produced Objective of a quickly-developed persistency hybrid leveraging

ROOT I/O and RDBMSes has been fulfilled

Geant4 firmly established as baseline simulation in successful ATLAS, CMS, LHCb production

EM & hadronic physics validated Highly stable: 1 G4-related crash per O(1M) events

SEAL components underpin POOL’s success, in particular the dictionary system

Now entering a second generation with Reflex

SPI’s Savannah project portal and external software service are accepted standards inside and outside the project

Page 8: LCG Status and Plans GridPP13 Durham, UK 4 th July 2005 Ian Bird IT/GD, CERN

Gri

dPP

13;

Du

rham

, 4th J

uly

, 200

5

8

Current AA Projects

SPI – Software process infrastructure (A. Aimar) Software and development services: external libraries,

savannah, software distribution, support for build, test, QA, etc.

ROOT – Core Libraries and Services (R. Brun) Foundation class libraries, math libraries, framework

services, dictionaries, scripting, GUI, graphics, etc. POOL – Persistency Framework (D. Duellmann)

Storage manager, file catalogs, event collections, relational access layer, conditions database, etc.

SIMU - Simulation project (G. Cosmo) Simulation framework, physics validation studies, MC event

generators, participation in Geant4, Fluka.

Page 9: LCG Status and Plans GridPP13 Durham, UK 4 th July 2005 Ian Bird IT/GD, CERN

Gri

dPP

13;

Du

rham

, 4th J

uly

, 200

5

9

SEAL and ROOT Merge

Major change in the AA has been the merge of the SEAL project with ROOT project

Details of the merge are being discussed following a process defined by the AF

Breakdown into a number of topics Proposals discussed with the experiments Public presentations Final decisions by the AF

Current status Dictionary plans approved MathCore and Vector libraries proposals have been

approved First development release of ROOT including these new

libraries

Page 10: LCG Status and Plans GridPP13 Durham, UK 4 th July 2005 Ian Bird IT/GD, CERN

Gri

dPP

13;

Du

rham

, 4th J

uly

, 200

5

10

Ongoing work SPI

Porting LCG-AA software to amd64 (gcc 3.4.4) Finalizing software distribution based on Pacman QA tools: test coverage and savannah reports

ROOT Development version v5.02 released last week Including new libraries: mathcore, reflex, cintex, roofit

POOL Version 2.1 released including new file catalog implementations:

LFCCatalog (lfc), GliteCatalog (glite, Fireman), GTCatalog (globus toolkit) New version of Conditions DB (COOL) 1.2 Adapting POOL to new dictionaries (Reflex)

SIMU New Geant4 public minor release 7.1 is being prepared Public release of Fluka expected by end July Intense activity in the combined calorimeter physics validation with

ATLAS, report in September. New MC generators being added (CASCADE , CHARYBDIS, etc.) into the

already long list of generators provided Prototyping persistency of Geant4 geometry with ROOT

Page 11: LCG Status and Plans GridPP13 Durham, UK 4 th July 2005 Ian Bird IT/GD, CERN

Fabric Area

Page 12: LCG Status and Plans GridPP13 Durham, UK 4 th July 2005 Ian Bird IT/GD, CERN

Gri

dPP

13;

Du

rham

, 4th J

uly

, 200

5

12

configuration, installation andmanagement of nodes

lemon LHC Era Monitoring - system & service monitoring

LHC Era Automated Fabric – hardware / state management

Extremely Large Fabric management system

CERN Fabric

Fabric automation has seen very good progress The new systems for managing large farms are in production

at CERN

Includes technology developed by European DataGrid

Page 13: LCG Status and Plans GridPP13 Durham, UK 4 th July 2005 Ian Bird IT/GD, CERN

Gri

dPP

13;

Du

rham

, 4th J

uly

, 200

5

13

CERN Fabric

Fabric automation has seen very good progress The new systems for managing large farms are in production

at CERN New CASTOR Mass Storage System

Was deployed first on the high throughput cluster for the recent ALICE data recording computing challenge

Agreement on collaboration with Fermilab on Linux distribution

Scientific Linux based on Red Hat Enterprise 3 Improves uniformity between the HEP sites serving LHC and

Run 2 experiments CERN computer centre preparations

Power upgrade to 2.5 MW Computer centre refurbishment well under way Acquisition process started

Page 14: LCG Status and Plans GridPP13 Durham, UK 4 th July 2005 Ian Bird IT/GD, CERN

Gri

dPP

13;

Du

rham

, 4th J

uly

, 200

5

14

Preparing for 7,000 boxes in 2008

Page 15: LCG Status and Plans GridPP13 Durham, UK 4 th July 2005 Ian Bird IT/GD, CERN

Gri

dPP

13;

Du

rham

, 4th J

uly

, 200

5

15

High Throughput Prototype openlab/LCG

Experience with likely ingredients in LCG:

64-bit programming

next generation I/O (10 Gb Ethernet, Infiniband, etc.)

High performance cluster used for evaluations, and for data challenges with experiments

Flexible configuration components

moved in and out of production environment

Co-funded by industry and CERN

Page 16: LCG Status and Plans GridPP13 Durham, UK 4 th July 2005 Ian Bird IT/GD, CERN

Gri

dPP

13;

Du

rham

, 4th J

uly

, 200

5

16

Alice Data Recording Challenge

Target – one week sustained at 450 MB/sec Used the new version of Castor mass storage system Note smooth degradation and recovery after

equipment failure

Page 17: LCG Status and Plans GridPP13 Durham, UK 4 th July 2005 Ian Bird IT/GD, CERN

Deployment and Operations

Page 18: LCG Status and Plans GridPP13 Durham, UK 4 th July 2005 Ian Bird IT/GD, CERN

Gri

dPP

13;

Du

rham

, 4th J

uly

, 200

5

18

Country providing resourcesCountry anticipating joining

In LCG-2: 139 sites, 32 countries ~14,000 cpu ~5 PB storage

Includes non-EGEE sites:• 9 countries• 18 sites

Computing Resources: June 2005

Number of sites is already at the scale expected for LHC

- demonstrates the full complexity of operations

Page 19: LCG Status and Plans GridPP13 Durham, UK 4 th July 2005 Ian Bird IT/GD, CERN

Gri

dPP

13;

Du

rham

, 4th J

uly

, 200

5

19

Operations Structure Operations Management Centre

(OMC): At CERN – coordination etc

Core Infrastructure Centres (CIC) Manage daily grid operations –

oversight, troubleshooting Run essential infrastructure

services Provide 2nd level support to ROCs UK/I, Fr, It, CERN, + Russia (M12) Hope to get non-European centres

Regional Operations Centres (ROC) Act as front-line support for user

and operations issues Provide local knowledge and

adaptations One in each region – many

distributed

User Support Centre (GGUS) In FZK – support portal – provide

single point of contact (service desk)

Page 20: LCG Status and Plans GridPP13 Durham, UK 4 th July 2005 Ian Bird IT/GD, CERN

Gri

dPP

13;

Du

rham

, 4th J

uly

, 200

5

20

Grid Operations The grid is flat, but Hierarchy of responsibility

Essential to scale the operation

CICs act as a single Operations Centre

Operational oversight (grid operator) responsibility

rotates weekly between CICs Report problems to ROC/RC ROC is responsible for ensuring

problem is resolved ROC oversees regional RCs

ROCs responsible for organising the operations in a region

Coordinate deployment of middleware, etc

CERN coordinates sites not associated with a ROC

CIC

CICCIC

CICCIC

CICCIC

CICCIC

CICCIC

RCRC

RCRC RCRC

RCRC

RCRC

ROCROC

RCRC

RCRC

RCRCRCRC

RCRCRCRC

ROCROC

RCRC

RCRC RCRC

RCRC

RCRC

ROCROC

RCRC

RCRC

RCRC

RCRC

ROCROC

OMCOMC

RC = Resource Centre

It is in setting up this operational infrastructure where we have really benefited from EGEE funding

Page 21: LCG Status and Plans GridPP13 Durham, UK 4 th July 2005 Ian Bird IT/GD, CERN

Gri

dPP

13;

Du

rham

, 4th J

uly

, 200

5

21

Grid monitoring

GIIS Monitor + Monitor Graphs

Sites Functional Tests GOC Data Base Scheduled Downtimes

Live Job Monitor GridIce – VO + fabric view Certificate Lifetime Monitor

Operation of Production Service: real-time display of grid operations Accounting information Selection of Monitoring tools:

Page 22: LCG Status and Plans GridPP13 Durham, UK 4 th July 2005 Ian Bird IT/GD, CERN

Gri

dPP

13;

Du

rham

, 4th J

uly

, 200

5

22

Operations focus Main focus of activities now:

Improving the operational reliability and application efficiency:

Automating monitoring alarms Ensuring a 24x7 service Removing sites that fail functional tests Operations interoperability with OSG

and others

Improving user support: Demonstrate to users a reliable and

trusted support infrastructure

Deployment of gLite components: Testing, certification pre-production

service Migration planning and deployment –

while maintaining/growing interoperability

Further developments now have to be driven by experience in real use

LCG-2 (=EGEE-0)

prototyping

prototyping

product

20042004

20052005

LCG-3 (=EGEE-x?)

product

Page 23: LCG Status and Plans GridPP13 Durham, UK 4 th July 2005 Ian Bird IT/GD, CERN

Gri

dPP

13;

Du

rham

, 4th J

uly

, 200

5

23

Recent ATLAS work

• ATLAS jobs in EGEE/LCG-2 in 2005•In latest period up to 8K jobs/day

• Several times the current capacity for ATLAS at CERN alone – shows the reality of the grid solution

Number of jobs/day ~10,000 concurrent jobs in the system

Page 24: LCG Status and Plans GridPP13 Durham, UK 4 th July 2005 Ian Bird IT/GD, CERN

Baseline Services & Service Challenges

Page 25: LCG Status and Plans GridPP13 Durham, UK 4 th July 2005 Ian Bird IT/GD, CERN

Gri

dPP

13;

Du

rham

, 4th J

uly

, 200

5

25

Baseline Services: Goals

Experiments and regional centres agree on baseline services

Support the computing models for the initial period of LHC

Thus must be in operation by September 2006. Expose experiment plans and ideas Timescales

For TDR – now For SC3 – testing, verification, not all components For SC4 – must have complete set

Define services with targets for functionality & scalability/performance metrics.

Very much driven by the experiments’ needs – But try to understand site and other constraints

Not done yet

Page 26: LCG Status and Plans GridPP13 Durham, UK 4 th July 2005 Ian Bird IT/GD, CERN

Gri

dPP

13;

Du

rham

, 4th J

uly

, 200

5

26

Baseline services

Storage management services

Based on SRM as the interface

Basic transfer services gridFTP, srmCopy

Reliable file transfer service

Grid catalogue services Catalogue and data

management tools Database services

Required at Tier1,2 Compute Resource

Services Workload management

VO management services

Clear need for VOMS: roles, groups, subgroups

POSIX-like I/O service local files, and include

links to catalogues Grid monitoring tools

and services Focussed on job

monitoring VO agent framework Applications software

installation service Reliable messaging

service Information system

Nothing really surprising here – but a lot was clarified in terms of requirements, implementations, deployment, security, etc

Page 27: LCG Status and Plans GridPP13 Durham, UK 4 th July 2005 Ian Bird IT/GD, CERN

Gri

dPP

13;

Du

rham

, 4th J

uly

, 200

5

27

Preliminary: Priorities

Service ALICE ATLAS CMS LHCb

Storage Element A A A A

Basic transfer tools A A A A

Reliable file transfer service A A A/B A

Catalogue services B B B B

Catalogue and data management tools

C C C C

Compute Element A A A A

Workload Management B A A C

VO agents A A A A

VOMS A A A A

Database services A A A A

Posix-I/O C C C C

Application software installation C C C C

Job monitoring tools C C C C

Reliable messaging service C C C C

Information system A A A A

A: High priority, mandatory serviceB: Standard solutions required, experiments could select different implementationsC: Common solutions desirable, but not essential

Page 28: LCG Status and Plans GridPP13 Durham, UK 4 th July 2005 Ian Bird IT/GD, CERN

Gri

dPP

13;

Du

rham

, 4th J

uly

, 200

5

28

Service Challenges – ramp up to LHC start-up service

SC2SC3

LHC Service OperationFull physics run

2005 20072006 2008

First physicsFirst beams

cosmics

June05 - Technical Design Report

Sep05 - SC3 Service Phase

May06 – SC4 Service Phase

Sep06 – Initial LHC Service in stable operation

SC4

SC2 – Reliable data transfer (disk-network-disk) – 5 Tier-1s, aggregate 500 MB/sec sustained at CERNSC3 – Reliable base service – most Tier-1s, some Tier-2s – basic experiment software chain – grid data throughput 500 MB/sec, including mass storage (~25% of the nominal final throughput for the proton period)SC4 – All Tier-1s, major Tier-2s – capable of supporting full experiment software chain inc. analysis – sustain nominal final grid data throughputLHC Service in Operation – September 2006 – ramp up to full operational capacity by April 2007 – capable of handling twice the nominal data throughput

Apr07 – LHC Service commissioned

See Jamie’s

talk fo

r more

details

Page 29: LCG Status and Plans GridPP13 Durham, UK 4 th July 2005 Ian Bird IT/GD, CERN

Baseline Services, Service Challenges, Production Service, Pre-production

service, gLite deployment, …

… confused?

Page 30: LCG Status and Plans GridPP13 Durham, UK 4 th July 2005 Ian Bird IT/GD, CERN

Gri

dPP

13;

Du

rham

, 4th J

uly

, 200

5

30

Services …

Baseline services Are the set of essential services that the experiments need to

be in production by September 2006 Verify components in SC3, SC4

Service challenges The ramp up of the LHC computing environment – building up

the production service, based on results and lessons of the service challenges

Production service The evolving service putting in place new components

prototyped in SC3, SC4 No big-bang changes, but many releases!!!

gLite deployment As new components are certified, will be added to the

production service releases, either in parallel with or replacing existing services

Pre-production service Should be literally a preview of the production service, But is a demonstration of gLite services at the moment – this

has been forced on us by many other constraints (urgency to “deploy” gLite, need for reasonable scale testing, … )

Page 31: LCG Status and Plans GridPP13 Durham, UK 4 th July 2005 Ian Bird IT/GD, CERN

Gri

dPP

13;

Du

rham

, 4th J

uly

, 200

5

31

Releases and Distributions We intend to maintain a single line of production middleware

distributions Middleware releases from [JRA1, VDT, LCG, …] Middleware distributions for deployment from GDA/SA1 Remember: announcement of a release is months away from a

deployable distribution (based on last 2 years experience) Distributions still labelled “LCG-2.x.x”

Would like to change to something less specific to avoid LCG/EGEE confusion

Frequent updates for Service challenge sites But only needed for SC sites

Frequent updates as gLite is deployed Not clear if all sites will deploy all gLite components immediately

This is unavoidable A strong request from LHC experiment spokesmen to the LCG

POB: “early, gradual and frequent releases of the [baseline] services is essential

rather than waiting for a complete sets”

Throughout all this, we must maintain a reliable production service, which gradually improves in reliability and performance

Page 32: LCG Status and Plans GridPP13 Durham, UK 4 th July 2005 Ian Bird IT/GD, CERN

Gri

dPP

13;

Du

rham

, 4th J

uly

, 200

5

32

Summary

We are at end of LCG Phase 1 Good time to step back and look at achievements

and issues LCG Phase 2 has really started

Consolidation of AA projects Baseline services Service challenges and experiment data challenges Acquisitions process starting No new developments make what we have work

absolutely reliably, and be scaleable, performant Timescale is extremely tight Must ensure that we have appropriate levels of

effort committed