Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
April 6, 2002 APS April 03 Meeting 1
Overview of World GridsComputing without Boundaries
SAR WorkshopApril 18-19, 2003Arlington, Texas
Lee LuekingFermilab Computing Division, CEPA
Dept.DØ Liaison to PPDG
Batavia, Illinois
? Collaborations? Projects and testbeds? Deployment and Operation
April 6, 2002 APS April 03 Meeting 2
The Goal
All of these projects are working towards the common goal of providing transparent access to the massively distributed computing infrastructure that is needed to meet the challenges of modern experiments …(From the EU DataTAG proposal)
April 6, 2002 APS April 03 Meeting 3
HENP Grid Timelines20081999 2000 2001 2002 2003 2004 2005 2006 2007
20081999 2000 2001 2002 2003 2004 2005 2006 2007
PPDG PPDG-SciDAC
GriPhyN (NSF)
iVDGL (NSF)
Other US Grid projects: IPG (NASA), Teragrid (NSF), ESG (DOE) …
BaBar (SLAC)CDF/D0 (Fermilab)
ATLAS/CMS (CERN)
RHIC Experiments (BNL)
DataGrid (EU) EGEE (EU)
LCG (CERN)
Grid: mainly fabric
Other European Grid projects: GridPP (UK), INFN-Grid (Italy), CrossGrid (EU) …
Grid: no fabric Experiment under construction Experiment in operation
DataTag (EU)
April 6, 2002 APS April 03 Meeting 4
A few of the Grid Projects with strong HEP collaboration
US projects European projects
Many national, regional Grid projects --GridPP(UK), INFN-grid(I),NorduGrid, Dutch Grid, …
April 6, 2002 APS April 03 Meeting 5
The U.S. Grid Projects – peer projects between Experiments and Computer Science Groups
? PPDG - Particle Physics Data Grid -~$3m/year for 3 years
– End to end Applications – Vertical Integration
? GriPhyN - Grid Physics Network ~$2M/year for 5 years
– Research and Development – Challenge Prototypes
? iVDGL - international Virtual Data Grid Laboratory ~ $3M /year for 5 years
– Deployment of Experiments Grids
– All applications will be able to run at all sites
– Commit to sharing of resources? Also collaborating with TeraGrid as
application deployers
PPDG (DOE)– STAR– JLAB– D0– BaBar– CMS– ATLAS– SRB– SRM– Globus– Condor
GriPhyN (NSF)– LIGO– SDSS– CMS– ATLAS– Globus– Condor– SRB– SRM– Berkeley,– Northwestern – UCSD
iVDGL (NSF)– LIGO– SDSS– NVO– CMS– ATLAS– Globus– Condor
April 6, 2002 APS April 03 Meeting 6
How the US Physics Grid Projects are organized towards deliverables
GriPhyN Challenge Problems– CP-1 Virtualize an application
pipeline – CP-2 High speed data transfer to
replicate results
– CP-3 Automated planning
– CP-4 Mixed replication and re-materialization at high speeds
– CP-5 Abstract generator functions added to virtualization
– CP-6 Jobs submitted from high-level tools/UIs
– CP-7 Intelligent job management: Transparency, Fault Tolerance, Advanced policy and scheduling
– CP-8 Monitoring and information synthesis
PPDG Common Services– CS-1&2 Job Description and
Management– CS-3 Information Services and
Monitoring– CS-4 Storage Management– CS-5 Reliable File Transfer– CS-6 Robust File Replication– CS-7 Documentation – CS-8 Evaluations/R&D– CS-9 Authentication, Authorization,
Accounting– CS-10 End to End Application– CS-11 Analysis Tools– CS-12 Experiment Catalogs– CS-13 Troubleshooting, Error
handling Diagnosis
iVDGL Work Teams
• Operations• Core Software • Facilities• Applications• Education and
Outreach
April 6, 2002 APS April 03 Meeting 7
European Data Grid (EDG)
? Middleware – WP1: Work Scheduling – WP2: Data Management – WP3: Monitoring services – WP4: Fabric Management – WP5: Storage Management – WP6: Integration Testbed & Support – WP7: Network
? Applications – WP8: Particles Physics – WP9: Earth Observation – WP10: Biology
? Dissemination: WP11– Video
? Management: WP12
April 6, 2002 APS April 03 Meeting 8
DataGrid and DataGrid WP1
? European DataGrid Project– Goal: Grid software projects meet real-life scientific applications (High
Energy Physics, Earth Observation, Biology) and their deadlines, with mutual benefit
– Middleware development and integration of existing middleware– Bring the issues of data identification, location, transfer and access into
the picture– Large scale testbed
? WP1 (Grid Workload Management) – Mandate: “To define and implement a suitable architecture for
distributed scheduling and resource management on a GRID environment“
– This includes the following areas of activity: • Design and development of a useful (as seen from the DataGrid
applications perspective) grid scheduler, or Resource Broker• Design and development of a suitable job description and
monitoring infrastructure• Design and implementation of a suitable job accounting structure
April 6, 2002 APS April 03 Meeting 9
WP1 Workload Management System
? Working Workload Management System prototype implemented by WP1 in the first phase of the EDG project (presented at CHEP2001)
– Ability to define (via a Job Description Language, JDL) a job, submit it to the DataGrid testbed from any user machine, and control it
– WP1's Resource Broker chooses an appropriate computing resource for the job, based on the constraints specified in the JDL : user authorization, job characteristics, data proximity
? Application users have now been trying for about one year and a half with this first release of the workload management system
– Stress tests and semi-production activities (e.g. CMS stress tests, Atlas efforts)
– Significant achievements exploited by the experiments but also various problems were spotted, impacting in particular the reliability and scalability of the system
? New WMS (v. 2.0) presented at the 2nd EDG review and scheduled for integration at April 2003
April 6, 2002 APS April 03 Meeting 10
Grid Projects
April 6, 2002 APS April 03 Meeting 11
NorduGrid project
? Launched in spring of 2001, with the aim of creating a Grid infrastructure in the Nordic countries.
? Idea to have a Monarch architecture with a Nordic tier 1 center
? Partners from Denmark, Norway, Sweden, and Finland
? Initially meant to be the Nordic branch of the EU DataGrid (EDG) project
? 3 full-time researchers with few externally funded
April 6, 2002 APS April 03 Meeting 12
NorduGrid Motivation and Goals
? Goal to have the ATLAS data challenge run by May 2002? Use available Grid middleware:
– The Globus Toolkit™• A toolbox – not a complete solution
– European DataGrid software• Not mature in the beginning of 2002• Architecture problems
? No single point of failure? Should be scalable? Resource owners should have full control over their resources? As few site requirements as possible:
– Local cluster installation details should not be dictated• Method, OS version, configuration, etc…
– Compute nodes should not be required to be on the public network– Clusters need not be dedicated to the Grid
April 6, 2002 APS April 03 Meeting 13
NorduGrid Features at glance
? Dynamic Information System, Brokering, Monitoring? Independence on Globus GASS cache (and it’s bug) ? Own GridFTP server, pluggable with job submission? Stable and tested Grid testbed? Not Nordic or HEP specific? Tested on RedHat 6.2, 7.2 (also Alpha), Mandrake,
Debian, Slackware? Can share resources with non Grid applications? Has been running ATLAS data challenges since May
2002
April 6, 2002 APS April 03 Meeting 14
Grid Montior
April 6, 2002 APS April 03 Meeting 15
AliEn: Alice Environment
April 6, 2002 APS April 03 Meeting 16
Project Timeline
Functionality Interoperability Performance, Scalability, Standards
First production (distributed simulation)
Physics Performance Report (mixing & reconstruction)
10% Data Challenge (analysis)
2001 2002 2003 2004 2005
Start
April 6, 2002 APS April 03 Meeting 17
What is AliEn?
? Main features– Distributed file catalogue built on top of RDBMS – File replica and cache manager with interface to MSS
• CASTOR,HPSS,HIS…• AliEnFS – Linux file system that uses AliEn File Catalogue and replica
manager– SASL based authentication which supports various authentication
mechanisms (including Globus/GSSAPI)– Resource Broker with interface to batch systems
• LSF,PBS,Condor,BQS,…– Various user interfaces
• command line, GUI, Web portal– Package manager (dependencies, distribution…)– Metadata catalogue – C/C++/perl/java API– ROOT interface (TAliEn)– SOAP/Web Services
? EDG compatible user interface– Common authentication– Compatible JDL (Job description language) based on CONDOR ClassAds
April 6, 2002 APS April 03 Meeting 18
SAM-Grid
April 6, 2002 APS April 03 Meeting 19
Linker
Configurator A Configurator B Configurator C
I want to run applications A, B, and C
Attach A, B, C
Make Job(Framework)
/bin/sh scriptto run App A
/bin/sh scriptto run App C
/bin/sh scriptto run App B
#!/bin/env shscriptAscriptBscriptC
Configure
ScriptGenerator
Meta Systems: MC RunJob
? MCRunJob approach by CMS and DØ production teams
? Framework for dealing with multiple grid resources and testbeds (EDG, IGT)
Source: G.Graham
“Fools make feasts and wise men eat them.” Poor Richard
April 6, 2002 APS April 03 Meeting 20
Lawrence BerkeleyNational Laboratory
BrookhavenNationalLaboratory
Indiana University
Boston University
ArgonneNationalLaboratory
U Michigan
University ofTexas atArlington
OklahomaUniversity
l Grid credentials (based on globus CA)
• updating to ESnet CA
l Grid software: Globus 1.1.4/2.0, Condor 6.3
(moving towards full VDT 1.x)
l ATLAS core software distribution at 2 sites (RH 6.2)
l ATLAS related grid software:
l Pacman - Package manager
l Magda - Distributed Data Manager
l Gridview - Grid Monitoring Visualisation
l Grappa - Physics Web Portal
l GRAT – Grid Application Toolkit for ATLAS grid
applications (RH 7.2)
l Testbed has been functional for ~ 1 year
l Decentralized account management
US-ATLAS Test Grid
April 6, 2002 APS April 03 Meeting 21
UCSD
Florida
Wisconsin
CaltechFermilab
Princeton
l Grid credentials (based on Globus CA)
• updating to ESnet CA
l Grid software: VDT 1.0
l Globus 2.0 betal Condor-G 6.3.1
l Condor 6.3.1l ClassAds 0.9l GDMP 3.0
l Objectivity 6.1l CMS Grid Related Software
l MOP – distributed CMS Monte carlO Productionl VDGS - Virtual Data Grid System Prototype
l CLARENS - Distributed CMS Physics Analysis l DAR – Distribution After Release for CMS applications
(RH 6.2)
l Testbed has been functional for ~ 1/2 year
l Decentralized account management
US-CMS Test Grid
April 6, 2002 APS April 03 Meeting 22
Deployment and Operation
April 6, 2002 APS April 03 Meeting 23
The Large Hadron Collider Project4 detectors CMS
ATLAS
LHCb
Storage –Raw recording rate 0.1 – 1 GBytes/sec
Accumulating at 5-8 PetaBytes/year
10 PetaBytes of disk
Processing –200,000 of today’s fastest PCs
April 6, 2002 APS April 03 Meeting 24
? CERN will provide the data reconstruction & recording service (Tier 0) -- but only a small part of the analysis capacity
Other Total CERN as Total CERN asTier 0 Tier 1 Total Tier 1 Tier 1 % of Tier 1 Tier 0 + 1 % of total
Tier 0 + 1
Processing (K SI2000) 12,000 8,000 20,000 49,000 57,000 14% 69,000 29%Disk (PetaBytes) 1.1 1.0 2.1 8.7 9.7 10% 10.8 20%Magnetic tape (PetaBytes) 12.3 1.2 13.5 20.3 21.6 6% 33.9 40%
-------------- CERN --------------
Summary of Computing Capacity Required for all LHC Experiments in 2008
? current planning for capacity at CERN + principal Regional Centres– 2002: 650 KSI2000 à <1% of capacity required in 2008– 2005: 6,600 KSI2000 à < 10% of 2008 capacity
CMSATLAS
LHCbCERN
Tier 0 Centre at CERN
physics group
LHC Computing Model
regional group
Tier2
Lab a
Uni a
Lab c
Uni n
Lab m
Lab b
Uni bUni y
Uni x
Tier3physics
department
α
β
γDesktop
Germany
Tier 1
USAUK
France
Italy
……….
CERN Tier 1
……….
The LHC Computing Centre
CERN Tier 0
Source: Ian Bird
April 6, 2002 APS April 03 Meeting 26
The LHC Computing Grid (LCG) Project Goals
? Prepare and deploy the computing environment for the LHC experiments? Common applications, tools, frameworks and environments,? Move from testbed systems to real production services:
– Operated and Supported 24x7 globally– Computing fabrics run as production physics services– Computing environment must be robust, stable, predictable, and
supportable? Foster collaboration, coherence of the LHC computing centres? LCG is not a middleware development or grid technology project: It is a
grid deployment project? Concentrate on four work areas –Applications, Grid Technology, Fabrics,
Grid deployment
April 6, 2002 APS April 03 Meeting 27
Timeline for the LCG computing service
LCG-1used for simulated event productions
principal service for LHC data challenges – batch analysis and simulation
validation of computing models
Stable 1st generation middlewareDeveloping management, operations tools
Computing model TDRs
Phase 2 TDR
More stable 2nd generation middleware
Very stable full function middlewareAcquisition, installation, commissioning of Phase 2 service (for LHC startup) Phase 2 service in
production
2003
2004
2005
2006
LCG-2
VDT, EDG tools building up to basic functionality
LCG-3
validation of computing service
April 6, 2002 APS April 03 Meeting 28
DØ Regional Model
CINVESTAV
UO
UA
RiceFSU
LTUUTA
Mainz
Wuppertal
Munich
Aachen Bonn
GridKa
(Karlsruhe)Freiburg
Have you something to do tomorrow? Do it today.
Centers also in the UK and FranceUK: Lancaster, Manchester, Imperial
College, RAL France: CCin2p3, CEA-Saclay, CPPM
Marseille, IPNL-Lyon, IRES-Strasbourg, ISN-Grenoble, LAL-Orsay, LPNHE-Paris
April 6, 2002 APS April 03 Meeting 29
Areas of Development and Challenge in Grid Deployment
? Grid Middleware: Virtual Data Toolkit -- Globus toolkit, Condor-G, other grid tools ? Certification and Testing activities at all levels: 1) Component/unit tests, 2) Basic
functional tests, including tests of distributed (grid) services, 3) Application level tests –based on HEPCAL use-cases, 4) Experiment beta-testing before release, 5) Site configuration verification
? Packaging and distribution: 1) Want to provide a tool that satisfies needs of the participating sites,, 2) Interoperate with existing tools where appropriate and necessar, 3) Does not force solution on sites with established infrastructure, 4) Solution for sites with nothing
? Configuration: 1) Essential to understand and validate correct site configuration, 2) Effort will be devoted to providing configuration tools, 3) Verification of correct configuration will be required before sites join LCG
? Operating and maintaining the grid infrastructure and associated services: 1) Gateways, information services, resource broker etc. – i.e. grid specific services, 2) Will be a coordination between teams at CERN and at Regional Centres, 3) Responsible also for the VO infrastructure, Authentication and Authorization services, 4) Security operations – incident response etc.
? Grid Operations Center(s): 1) Performance and problem monitoring, 2) Troubleshooting and coordination with; site operations, user support, network operations etc., 3) Accounting and reporting 4) Leverage existing experience/ideas, 4) Assemble monitoring, reporting, performance, etc. tools.
April 6, 2002 APS April 03 Meeting 30
Some Restrictions May Apply
? Each Computing Center has rules and restrictions– Network firewalls– Some resources on private networks– Acceptance of Certificates issued by
various Certificate Authorities are subject to review
– Some “adapters” are required for various services like: local storage, mass storage, replica catalogs, ….
? The many grids being built now are notinteroperable: EDG/LCG, NorduGrid SAMGrid,... Interoperability is a major challenge for the future.
April 6, 2002 APS April 03 Meeting 31
Operations
Expectation Management
“Blessed is he who expects nothing for he shall never be disappointed.”
Poor Richard
April 6, 2002 APS April 03 Meeting 32
Collaboration Without Boundaries
? The Grid encourages unheralded collaboration– Never before has such collaboration and sharing been attempted
among HEP experiments.– Inter-disciplinary sharing of resources is being charted among
Bioinformatics, Climatology, Astrophysics, HEP, …, – Fully interoperable grids and services promise to open new avenues to
even more resources.? This is all really hard
– Sharing is a new word for HEP collider collaborations.– Sharing means being fair, waiting your turn, doing accounting, and
being accountable.– Collaboration on a global scale means travel, meetings in virtual
spaces, and new approaches to collaborative tools and environments. (cf. DAWN --Dymnamic Analysis Workspace with kNowledge – ITR recently submitted by ATLAS and CMS)
“Always taking out of the meal tub and never putting in, soon comes to the bottom.” Poor Richard
April 6, 2002 APS April 03 Meeting 33
The END
April 6, 2002 APS April 03 Meeting 34
Multi-Tiered View of LHC Computing
2.5-10Gbs
2.5-10Gbs
1-10Gbs
CERN/Outside Resource Ratio ~1:2Tier0/(Σ Tier1)/(Σ Tier2) ~1:1:1
Tens of Petabytes by 2007-8.An Exabyte ~5-7 Years later.
April 6, 2002 APS April 03 Meeting 35
JOB
Computing Element
Submission Client
User Interface
QueuingSystem
Job and Data Management
User Interface
User Interface
BrokerMatch
Making Service
Information Collector
Execution Site #1
Submission Client
Submission Client
Match Making Service
Match Making Service
Computing Element
Grid Sensors
Execution Site #n
Queuing System
Queuing System
Grid Sensors
Storage Element
Storage Element
Computing Element
Storage Element
Data Handling System
Data Handling System
Storage ElementStorage Element
Storage ElementStorage Element
Information Collector
Information Collector
Grid SensorsGrid
SensorsGrid
SensorsGrid
Sensors
Computing Element
Computing Element
Data Handling System
Data Handling System
Data Handling System
Data Handling System
SAM-Grid Project @FNAL
April 6, 2002 APS April 03 Meeting 36
Bandwidth Growth of Global HENP Networks
? Rate of Progress >> Moore’s Law. (US-CERN Example)– 9.6 kbps Analog (1985)– 64-256 kbps Digital (1989 - 1994) [X 7 – 27]– 1.5 Mbps Shared (1990-3; IBM) [X 160]– 2 -4 Mbps (1996-1998) [X 200-400]– 12-20 Mbps (1999-2000) [X 1.2k-2k]– 155-310 Mbps (2001-2) [X 16k – 32k]– 622 Mbps (2002-3) [X 65k]– 2.5 Gbps λ (2003-4) [X 250k]– 10 Gbps λ (2005) [X 1M]
? A factor of ~1M over a period of 1985-2005 (a factor of ~5k during 1995-2005)
? HENP has become a leading applications driver, and also a co-developer of global networks
Source: Harvey Newman
Total Increase FactorYearSpeed
X 1M200510 Gbps λ
X 250 k2003-20042.5 Gbps λ
X 65k2002-2003622 Mbps
X 16k – 32k2001-2002155-310 Mbps
X 1.2k – 2k1999-200012-20 Mbps
X 200-4001996-19982 -4 Mbps
X 601990-3; IBM1.5 Mbps Shared
X 7-271989-199464-256 kbps Digital
-19859.6 kbps Analog
April 6, 2002 APS April 03 Meeting 37
Grid Projects Timeline
Q3 00
Q4 00
Q4 01
Q3 01
Q2 01
Q1 01
Q1 02
GriPhyN: $11.9M+$1.6M
PPDG:$9.5M
iVDGL:$13.65M
EU DataGrid: $9.3M
EU DataTAG:4M Euros
GridPP:FIX or remove
April 6, 2002 APS April 03 Meeting 38
PPDG also collaborates with:
European Projects:? EDG – collaboration on software
components – WP1, WP2, WP5. PPDG strategy to continue and if possible increase this.
? DataTAG – GLUE Interoperability and Experiment Test Grids
? PPARC – BaBarGrid, Run 2 SAM Grid developments
? LHC Computing Grid – support for US ATLAS and CMS User facilities
HENP Globally:? High Energy Physics Intergid
Coordination Board.
? GLUE – technical interoperability work is continuing in parallel with
DOE SciDAC Projects:? A High Performance Data Grid
Toolkit: Enabling Technology for Wide Area Data-Intensive Applications”- Globus
? “Storage Resource Management for Data Intensive Applications” – SRM
? “Security and Policy for Group Collaboration” – Community Authorization Service
? “Scientific Data Management Enabling Technology Center”
? “DOE Science Grid CollaboratoryPilot” – CA/RA
? Metadata management (SAM –different from the D0 one)