Overview of World Grids · April 6, 2002 APS April 03 Meeting 1 Overview of World Grids Computing without Boundaries SAR Workshop April 18-19, 2003 Arlington, Texas Lee Lueking

April 6, 2002 APS April 03 Meeting 1

Overview of World GridsComputing without Boundaries

SAR WorkshopApril 18-19, 2003Arlington, Texas

Lee LuekingFermilab Computing Division, CEPA

Dept.DØ Liaison to PPDG

Batavia, Illinois

? Collaborations? Projects and testbeds? Deployment and Operation


The Goal

All of these projects are working towards the common goal of providing transparent access to the massively distributed computing infrastructure that is needed to meet the challenges of modern experiments …(From the EU DataTAG proposal)


HENP Grid Timelines20081999 2000 2001 2002 2003 2004 2005 2006 2007

20081999 2000 2001 2002 2003 2004 2005 2006 2007

PPDG PPDG-SciDAC

GriPhyN (NSF)

iVDGL (NSF)

Other US Grid projects: IPG (NASA), Teragrid (NSF), ESG (DOE) …

BaBar (SLAC)CDF/D0 (Fermilab)

ATLAS/CMS (CERN)

RHIC Experiments (BNL)

DataGrid (EU) EGEE (EU)

LCG (CERN)

Grid: mainly fabric

Other European Grid projects: GridPP (UK), INFN-Grid (Italy), CrossGrid (EU) …

Grid: no fabric Experiment under construction Experiment in operation

DataTag (EU)


A few of the Grid Projects with strong HEP collaboration

US projects European projects

Many national, regional Grid projects --GridPP(UK), INFN-grid(I),NorduGrid, Dutch Grid, …


The U.S. Grid Projects – peer projects between Experiments and Computer Science Groups

? PPDG - Particle Physics Data Grid -~$3m/year for 3 years

– End to end Applications – Vertical Integration

? GriPhyN - Grid Physics Network ~$2M/year for 5 years

– Research and Development – Challenge Prototypes

? iVDGL - international Virtual Data Grid Laboratory ~ $3M /year for 5 years

– Deployment of Experiments Grids

– All applications will be able to run at all sites

– Commit to sharing of resources? Also collaborating with TeraGrid as

application deployers

PPDG (DOE)– STAR– JLAB– D0– BaBar– CMS– ATLAS– SRB– SRM– Globus– Condor

GriPhyN (NSF)– LIGO– SDSS– CMS– ATLAS– Globus– Condor– SRB– SRM– Berkeley,– Northwestern – UCSD

iVDGL (NSF)– LIGO– SDSS– NVO– CMS– ATLAS– Globus– Condor


How the US Physics Grid Projects are organized towards deliverables

GriPhyN Challenge Problems– CP-1 Virtualize an application

pipeline – CP-2 High speed data transfer to

replicate results

– CP-3 Automated planning

– CP-4 Mixed replication and re-materialization at high speeds

– CP-5 Abstract generator functions added to virtualization

– CP-6 Jobs submitted from high-level tools/UIs

– CP-7 Intelligent job management: Transparency, Fault Tolerance, Advanced policy and scheduling

– CP-8 Monitoring and information synthesis

PPDG Common Services– CS-1&2 Job Description and

Management– CS-3 Information Services and

Monitoring– CS-4 Storage Management– CS-5 Reliable File Transfer– CS-6 Robust File Replication– CS-7 Documentation – CS-8 Evaluations/R&D– CS-9 Authentication, Authorization,

Accounting– CS-10 End to End Application– CS-11 Analysis Tools– CS-12 Experiment Catalogs– CS-13 Troubleshooting, Error

handling Diagnosis

iVDGL Work Teams

• Operations• Core Software • Facilities• Applications• Education and

Outreach


European Data Grid (EDG)

? Middleware – WP1: Work Scheduling – WP2: Data Management – WP3: Monitoring services – WP4: Fabric Management – WP5: Storage Management – WP6: Integration Testbed & Support – WP7: Network

? Applications – WP8: Particles Physics – WP9: Earth Observation – WP10: Biology

? Dissemination: WP11– Video

? Management: WP12


DataGrid and DataGrid WP1

? European DataGrid Project– Goal: Grid software projects meet real-life scientific applications (High

Energy Physics, Earth Observation, Biology) and their deadlines, with mutual benefit

– Middleware development and integration of existing middleware– Bring the issues of data identification, location, transfer and access into

the picture– Large scale testbed

? WP1 (Grid Workload Management) – Mandate: “To define and implement a suitable architecture for

distributed scheduling and resource management on a GRID environment“

– This includes the following areas of activity: • Design and development of a useful (as seen from the DataGrid

applications perspective) grid scheduler, or Resource Broker• Design and development of a suitable job description and

monitoring infrastructure• Design and implementation of a suitable job accounting structure


WP1 Workload Management System

? Working Workload Management System prototype implemented by WP1 in the first phase of the EDG project (presented at CHEP2001)

– Ability to define (via a Job Description Language, JDL) a job, submit it to the DataGrid testbed from any user machine, and control it

– WP1's Resource Broker chooses an appropriate computing resource for the job, based on the constraints specified in the JDL : user authorization, job characteristics, data proximity

? Application users have now been trying for about one year and a half with this first release of the workload management system

– Stress tests and semi-production activities (e.g. CMS stress tests, Atlas efforts)

– Significant achievements exploited by the experiments but also various problems were spotted, impacting in particular the reliability and scalability of the system

? New WMS (v. 2.0) presented at the 2nd EDG review and scheduled for integration at April 2003


Grid Projects


NorduGrid project

? Launched in spring of 2001, with the aim of creating a Grid infrastructure in the Nordic countries.

? Idea to have a Monarch architecture with a Nordic tier 1 center

? Partners from Denmark, Norway, Sweden, and Finland

? Initially meant to be the Nordic branch of the EU DataGrid (EDG) project

? 3 full-time researchers with few externally funded


NorduGrid Motivation and Goals

? Goal to have the ATLAS data challenge run by May 2002? Use available Grid middleware:

– The Globus Toolkit™• A toolbox – not a complete solution

– European DataGrid software• Not mature in the beginning of 2002• Architecture problems

? No single point of failure? Should be scalable? Resource owners should have full control over their resources? As few site requirements as possible:

– Local cluster installation details should not be dictated• Method, OS version, configuration, etc…

– Compute nodes should not be required to be on the public network– Clusters need not be dedicated to the Grid


NorduGrid Features at glance

? Dynamic Information System, Brokering, Monitoring? Independence on Globus GASS cache (and it’s bug) ? Own GridFTP server, pluggable with job submission? Stable and tested Grid testbed? Not Nordic or HEP specific? Tested on RedHat 6.2, 7.2 (also Alpha), Mandrake,

Debian, Slackware? Can share resources with non Grid applications? Has been running ATLAS data challenges since May

2002


Grid Montior


AliEn: Alice Environment


Project Timeline

Functionality Interoperability Performance, Scalability, Standards

First production (distributed simulation)

Physics Performance Report (mixing & reconstruction)

10% Data Challenge (analysis)

2001 2002 2003 2004 2005

Start


What is AliEn?

? Main features– Distributed file catalogue built on top of RDBMS – File replica and cache manager with interface to MSS

• CASTOR,HPSS,HIS…• AliEnFS – Linux file system that uses AliEn File Catalogue and replica

manager– SASL based authentication which supports various authentication

mechanisms (including Globus/GSSAPI)– Resource Broker with interface to batch systems

• LSF,PBS,Condor,BQS,…– Various user interfaces

• command line, GUI, Web portal– Package manager (dependencies, distribution…)– Metadata catalogue – C/C++/perl/java API– ROOT interface (TAliEn)– SOAP/Web Services

? EDG compatible user interface– Common authentication– Compatible JDL (Job description language) based on CONDOR ClassAds


SAM-Grid


Linker

Configurator A Configurator B Configurator C

I want to run applications A, B, and C

Attach A, B, C

Make Job(Framework)

/bin/sh scriptto run App A

/bin/sh scriptto run App C

/bin/sh scriptto run App B

#!/bin/env shscriptAscriptBscriptC

Configure

ScriptGenerator

Meta Systems: MC RunJob

? MCRunJob approach by CMS and DØ production teams

? Framework for dealing with multiple grid resources and testbeds (EDG, IGT)

Source: G.Graham

“Fools make feasts and wise men eat them.” Poor Richard


Lawrence BerkeleyNational Laboratory

BrookhavenNationalLaboratory

Indiana University

Boston University

ArgonneNationalLaboratory

U Michigan

University ofTexas atArlington

OklahomaUniversity

l Grid credentials (based on globus CA)

• updating to ESnet CA

l Grid software: Globus 1.1.4/2.0, Condor 6.3

(moving towards full VDT 1.x)

l ATLAS core software distribution at 2 sites (RH 6.2)

l ATLAS related grid software:

l Pacman - Package manager

l Magda - Distributed Data Manager

l Gridview - Grid Monitoring Visualisation

l Grappa - Physics Web Portal

l GRAT – Grid Application Toolkit for ATLAS grid

applications (RH 7.2)

l Testbed has been functional for ~ 1 year

l Decentralized account management

US-ATLAS Test Grid


UCSD

Florida

Wisconsin

CaltechFermilab

Princeton

l Grid credentials (based on Globus CA)

• updating to ESnet CA

l Grid software: VDT 1.0

l Globus 2.0 betal Condor-G 6.3.1

l Condor 6.3.1l ClassAds 0.9l GDMP 3.0

l Objectivity 6.1l CMS Grid Related Software

l MOP – distributed CMS Monte carlO Productionl VDGS - Virtual Data Grid System Prototype

l CLARENS - Distributed CMS Physics Analysis l DAR – Distribution After Release for CMS applications

(RH 6.2)

l Testbed has been functional for ~ 1/2 year

l Decentralized account management

US-CMS Test Grid


Deployment and Operation


The Large Hadron Collider Project4 detectors CMS

ATLAS

LHCb

Storage –Raw recording rate 0.1 – 1 GBytes/sec

Accumulating at 5-8 PetaBytes/year

10 PetaBytes of disk

Processing –200,000 of today’s fastest PCs


? CERN will provide the data reconstruction & recording service (Tier 0) -- but only a small part of the analysis capacity

Other Total CERN as Total CERN asTier 0 Tier 1 Total Tier 1 Tier 1 % of Tier 1 Tier 0 + 1 % of total

Tier 0 + 1

Processing (K SI2000) 12,000 8,000 20,000 49,000 57,000 14% 69,000 29%Disk (PetaBytes) 1.1 1.0 2.1 8.7 9.7 10% 10.8 20%Magnetic tape (PetaBytes) 12.3 1.2 13.5 20.3 21.6 6% 33.9 40%

-------------- CERN --------------

Summary of Computing Capacity Required for all LHC Experiments in 2008

? current planning for capacity at CERN + principal Regional Centres– 2002: 650 KSI2000 à <1% of capacity required in 2008– 2005: 6,600 KSI2000 à < 10% of 2008 capacity

CMSATLAS

LHCbCERN

Tier 0 Centre at CERN

physics group

LHC Computing Model

regional group

Tier2

Lab a

Uni a

Lab c

Uni n

Lab m

Lab b

Uni bUni y

Uni x

Tier3physics

department

α

β

γDesktop

Germany

Tier 1

USAUK

France

Italy

……….

CERN Tier 1

……….

The LHC Computing Centre

CERN Tier 0

Source: Ian Bird


The LHC Computing Grid (LCG) Project Goals

? Prepare and deploy the computing environment for the LHC experiments? Common applications, tools, frameworks and environments,? Move from testbed systems to real production services:

– Operated and Supported 24x7 globally– Computing fabrics run as production physics services– Computing environment must be robust, stable, predictable, and

supportable? Foster collaboration, coherence of the LHC computing centres? LCG is not a middleware development or grid technology project: It is a

grid deployment project? Concentrate on four work areas –Applications, Grid Technology, Fabrics,

Grid deployment


Timeline for the LCG computing service

LCG-1used for simulated event productions

principal service for LHC data challenges – batch analysis and simulation

validation of computing models

Stable 1st generation middlewareDeveloping management, operations tools

Computing model TDRs

Phase 2 TDR

More stable 2nd generation middleware

Very stable full function middlewareAcquisition, installation, commissioning of Phase 2 service (for LHC startup) Phase 2 service in

production

2003

2004

2005

2006

LCG-2

VDT, EDG tools building up to basic functionality

LCG-3

validation of computing service


DØ Regional Model

CINVESTAV

UO

UA

RiceFSU

LTUUTA

Mainz

Wuppertal

Munich

Aachen Bonn

GridKa

(Karlsruhe)Freiburg

Have you something to do tomorrow? Do it today.

Centers also in the UK and FranceUK: Lancaster, Manchester, Imperial

College, RAL France: CCin2p3, CEA-Saclay, CPPM

Marseille, IPNL-Lyon, IRES-Strasbourg, ISN-Grenoble, LAL-Orsay, LPNHE-Paris


Areas of Development and Challenge in Grid Deployment

? Grid Middleware: Virtual Data Toolkit -- Globus toolkit, Condor-G, other grid tools ? Certification and Testing activities at all levels: 1) Component/unit tests, 2) Basic

functional tests, including tests of distributed (grid) services, 3) Application level tests –based on HEPCAL use-cases, 4) Experiment beta-testing before release, 5) Site configuration verification

? Packaging and distribution: 1) Want to provide a tool that satisfies needs of the participating sites,, 2) Interoperate with existing tools where appropriate and necessar, 3) Does not force solution on sites with established infrastructure, 4) Solution for sites with nothing

? Configuration: 1) Essential to understand and validate correct site configuration, 2) Effort will be devoted to providing configuration tools, 3) Verification of correct configuration will be required before sites join LCG

? Operating and maintaining the grid infrastructure and associated services: 1) Gateways, information services, resource broker etc. – i.e. grid specific services, 2) Will be a coordination between teams at CERN and at Regional Centres, 3) Responsible also for the VO infrastructure, Authentication and Authorization services, 4) Security operations – incident response etc.

? Grid Operations Center(s): 1) Performance and problem monitoring, 2) Troubleshooting and coordination with; site operations, user support, network operations etc., 3) Accounting and reporting 4) Leverage existing experience/ideas, 4) Assemble monitoring, reporting, performance, etc. tools.


Some Restrictions May Apply

? Each Computing Center has rules and restrictions– Network firewalls– Some resources on private networks– Acceptance of Certificates issued by

various Certificate Authorities are subject to review

– Some “adapters” are required for various services like: local storage, mass storage, replica catalogs, ….

? The many grids being built now are notinteroperable: EDG/LCG, NorduGrid SAMGrid,... Interoperability is a major challenge for the future.


Operations

Expectation Management

“Blessed is he who expects nothing for he shall never be disappointed.”

Poor Richard


Collaboration Without Boundaries

? The Grid encourages unheralded collaboration– Never before has such collaboration and sharing been attempted

among HEP experiments.– Inter-disciplinary sharing of resources is being charted among

Bioinformatics, Climatology, Astrophysics, HEP, …, – Fully interoperable grids and services promise to open new avenues to

even more resources.? This is all really hard

– Sharing is a new word for HEP collider collaborations.– Sharing means being fair, waiting your turn, doing accounting, and

being accountable.– Collaboration on a global scale means travel, meetings in virtual

spaces, and new approaches to collaborative tools and environments. (cf. DAWN --Dymnamic Analysis Workspace with kNowledge – ITR recently submitted by ATLAS and CMS)

“Always taking out of the meal tub and never putting in, soon comes to the bottom.” Poor Richard


The END


Multi-Tiered View of LHC Computing

2.5-10Gbs

2.5-10Gbs

1-10Gbs

CERN/Outside Resource Ratio ~1:2Tier0/(Σ Tier1)/(Σ Tier2) ~1:1:1

Tens of Petabytes by 2007-8.An Exabyte ~5-7 Years later.


JOB

Computing Element

Submission Client

User Interface

QueuingSystem

Job and Data Management

User Interface

User Interface

BrokerMatch

Making Service

Information Collector

Execution Site #1

Submission Client

Submission Client

Match Making Service

Match Making Service

Computing Element

Grid Sensors

Execution Site #n

Queuing System

Queuing System

Grid Sensors

Storage Element

Storage Element

Computing Element

Storage Element

Data Handling System


Storage ElementStorage Element

Storage ElementStorage Element



Grid SensorsGrid

SensorsGrid

SensorsGrid

Sensors

Computing Element

Computing Element





SAM-Grid Project @FNAL


Bandwidth Growth of Global HENP Networks

? Rate of Progress >> Moore’s Law. (US-CERN Example)– 9.6 kbps Analog (1985)– 64-256 kbps Digital (1989 - 1994) [X 7 – 27]– 1.5 Mbps Shared (1990-3; IBM) [X 160]– 2 -4 Mbps (1996-1998) [X 200-400]– 12-20 Mbps (1999-2000) [X 1.2k-2k]– 155-310 Mbps (2001-2) [X 16k – 32k]– 622 Mbps (2002-3) [X 65k]– 2.5 Gbps λ (2003-4) [X 250k]– 10 Gbps λ (2005) [X 1M]

? A factor of ~1M over a period of 1985-2005 (a factor of ~5k during 1995-2005)

? HENP has become a leading applications driver, and also a co-developer of global networks

Source: Harvey Newman

Total Increase FactorYearSpeed

X 1M200510 Gbps λ

X 250 k2003-20042.5 Gbps λ

X 65k2002-2003622 Mbps

X 16k – 32k2001-2002155-310 Mbps

X 1.2k – 2k1999-200012-20 Mbps

X 200-4001996-19982 -4 Mbps

X 601990-3; IBM1.5 Mbps Shared

X 7-271989-199464-256 kbps Digital

-19859.6 kbps Analog


Grid Projects Timeline

Q3 00

Q4 00

Q4 01

Q3 01

Q2 01

Q1 01

Q1 02

GriPhyN: $11.9M+$1.6M

PPDG:$9.5M

iVDGL:$13.65M

EU DataGrid: $9.3M

EU DataTAG:4M Euros

GridPP:FIX or remove


PPDG also collaborates with:

European Projects:? EDG – collaboration on software

components – WP1, WP2, WP5. PPDG strategy to continue and if possible increase this.

? DataTAG – GLUE Interoperability and Experiment Test Grids

? PPARC – BaBarGrid, Run 2 SAM Grid developments

? LHC Computing Grid – support for US ATLAS and CMS User facilities

HENP Globally:? High Energy Physics Intergid

Coordination Board.

? GLUE – technical interoperability work is continuing in parallel with

DOE SciDAC Projects:? A High Performance Data Grid

Toolkit: Enabling Technology for Wide Area Data-Intensive Applications”- Globus

? “Storage Resource Management for Data Intensive Applications” – SRM

? “Security and Policy for Group Collaboration” – Community Authorization Service

? “Scientific Data Management Enabling Technology Center”

? “DOE Science Grid CollaboratoryPilot” – CA/RA

? Metadata management (SAM –different from the D0 one)

Documents

Overview of World Grids · April 6, 2002 APS April 03 Meeting 1 Overview of World Grids Computing without Boundaries SAR Workshop April 18-19, 2003 Arlington, Texas Lee Lueking