Upload
cynthia-griffith
View
220
Download
0
Tags:
Embed Size (px)
Citation preview
EGEE-II INFSO-RI-031688 EGEE and gLite are registered trademarks
The EGEE Production Grid
Ian Bird
EGEE Operations Manager
HEPiX
Jefferson Lab, 12th October 2006
Enabling Grids for E-sciencE
[email protected] HEPiX; JLab; 9th-13th October 2006 2
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Outline
• Some history– What led up to where we
are now?– The EGEE project
• What is the EGEE grid infrastructure today?
– What has been achieved?– How is it used?– How does it compare and
relate to other production grids?
• Outlook
[email protected] HEPiX; JLab; 9th-13th October 2006 3
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Some history … LHC EGEE Grid
• 1999 – Monarc Project– Early discussions on how to organise distributed computing
for LHC
• 2000 – growing interest in grid technology– HEP community was the driver in launching the DataGrid
project
• 2001-2004 - EU DataGrid project– middleware & testbed for an operational grid
• 2002-2005 – LHC Computing Grid – LCG– deploying the results of DataGrid to provide aproduction facility for LHC experiments
• 2004-2006 – EU EGEE project phase 1– starts from the LCG grid– shared production infrastructure– expanding to other communities and sciences
• 2006-2008 – EU EGEE-II – Building on phase 1– Expanding applications and communities …
• … and in the future – Worldwide grid infrastructure??– Interoperating and co-operating infrastructures?
CERN
[email protected] HEPiX; JLab; 9th-13th October 2006 4
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
The EGEE project• EGEE - €32 M
– 1 April 2004 – 31 March 2006– 71 partners in 27 countries, federated in regional Grids
• EGEE-II - €35 M– 1 April 2006 – 31 March 2008– 91 partners in 32 countries – 13 Federations
• Objectives– Large-scale, production-quality
infrastructure for e-Science – Attracting new resources and
users from industry as well asscience
– Improving and maintaining “gLite” Grid middleware
[email protected] HEPiX; JLab; 9th-13th October 2006 5
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
The EGEE Infrastructure
Certification testbeds (SA3)
Pre-production service
Production service
Test-beds & Services
Operations Coordination Centre
Regional Operations Centres
Global Grid User Support
EGEE Network Operations Centre (SA2)
Operational Security Coordination Team
Support Structures
Operations Advisory Group (+NA4)
Joint Security Policy Group EuGridPMA (& IGTF)
Grid Security Vulnerability Group
Security & Policy Groups
Infrastructure:• Physical test-beds & services• Support organisations & procedures• Policy groups
[email protected] HEPiX; JLab; 9th-13th October 2006 6
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Certification & release preparation
• The goal is to produce a middleware distribution that can be deployed widely
– Not the same as middleware releases from development projects
– More like a Linux distribution – bringing together many pieces from several sources
• Extensive certification test-bed:– Close to 100 machines involved,
CERN + partners
• Emulate the main deployment environments
• Certification testing:– Installation and configuration– Component (service) functionality– System testing (trying to emulate
real workloads and stress testing)– Beginning to use virtualization to
simplify the testing environment
• Deployment into the pre-production system
– Final step of certification – validation by real sites
– Validation by applications – also allows to prepare apps for new versions
[email protected] HEPiX; JLab; 9th-13th October 2006 7
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Pre-production service
• Pre-production service is now ~ 20 sites• Provides access to some 500 CPU
– Some sites allow access to their full production batch systems for scale tests
• Sites install and test different configurations and sets of services– Try to get good feeling for the quality of the release or updates before
general release to production
– Feedback to: certification, integration, developers, etc.
• P-PS is now used in the way it was intended– For some time it was acting as a second certification test-bed for the gLite-
1.x branch
– Some services may be demonstrated in this environment before going to production (or they may need more work)
[email protected] HEPiX; JLab; 9th-13th October 2006 8
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Production service
sites
Size of the infrastructure today:
• 196 sites in 42 countries
• ~32 000 CPU
• ~ 3 PB disk, + tape MSS
0
5000
10000
15000
20000
25000
30000
35000
No.
CPU
CPU
[email protected] HEPiX; JLab; 9th-13th October 2006 9
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Usage of the infrastructureEGEE workload
0
200000
400000
600000
800000
1000000
1200000
1400000
1600000
1800000
Jan-
05
Feb-0
5
Mar
-05
Apr-0
5
May
-05
Jun-
05
Jul-0
5
Aug-0
5
Sep-0
5
Oct-
05
Nov-0
5
Dec-0
5
Jan-
06
Feb-0
6
Mar
-06
Apr-0
6
May
-06
Jun-
06
Jul-0
6
Aug-0
6
Jo
bs
/mo
nth
other VOs
planck
ops
magic
lhcb
geant4
fusion
esr
egrid
egeode
dteam
compchem
cms
biomed
atlas
alice
Normalized CPU time
0
1000000
2000000
3000000
4000000
5000000
6000000
Jan-
05
Feb-0
5
Mar
-05
Apr-0
5
May
-05
Jun-
05
Jul-0
5
Aug-0
5
Sep-0
5
Oct-
05
Nov-0
5
Dec-0
5
Jan-
06
Feb-0
6
Mar
-06
Apr-0
6
May
-06
Jun-
06
Jul-0
6
Aug-0
6
k.S
I2k
. h
ou
rs
other VOs
planck
ops
magic
lhcb
geant4
fusion
esr
egrid
egeode
dteam
compchem
cms
biomed
atlas
alice
>50k jobs/day
~7000 CPU-months/month
[email protected] HEPiX; JLab; 9th-13th October 2006 10
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Non-LHC VOs
EGEE workload
0
50,000
100,000
150,000
200,000
250,000
Jo
bs
/mo
nth
planck
ops
magic
geant4
fusion
esr
egrid
egeode
compchem
biomed
other VOs
Normalized CPU time
0
100,000
200,000
300,000
400,000
500,000
600,000
700,000
800,000
k.S
I2k
. h
ou
rs
planck
ops
magic
geant4
fusion
esr
egrid
egeode
dteam
compchem
biomed
other VOs
Workloads of the “other VOs” start to be significant – approaching 8-10K jobs per day; and 1000 cpu-months/month
• one year ago this was the overall scale of work for all VOs
Workloads of the “other VOs” start to be significant – approaching 8-10K jobs per day; and 1000 cpu-months/month
• one year ago this was the overall scale of work for all VOs
[email protected] HEPiX; JLab; 9th-13th October 2006 11
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Use of the infrastructure
20k jobs running simultaneously
[email protected] HEPiX; JLab; 9th-13th October 2006 12
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
CPU Usage
Virtual Organizations
Jan. ’06
Sep. ’06
[email protected] HEPiX; JLab; 9th-13th October 2006 13
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Use for massive data transfer
Large LHC experiments now transferring ~ 1PB/month each
[email protected] HEPiX; JLab; 9th-13th October 2006 14
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Applications on EGEE
• More than 25 applications from anincreasing number of domains– Astrophysics
– Computational Chemistry
– Earth Sciences
– Financial Simulation
– Fusion
– Geophysics
– High Energy Physics
– Life Sciences
– Multimedia
– Material Sciences
– …..
• Application types:• Simulation• Bulk Processing• Responsive Apps.• Workflow• Parallel Jobs
• Legacy Applications
[email protected] HEPiX; JLab; 9th-13th October 2006 15
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Simulation
• Examples– LHC Monte Carlo simulation
– Fusion
– WISDOM—malaria/avian flu
• Characteristics– Jobs are CPU-intensive
– Large number of independent jobs
– Run by few (expert) users
– Small input; large output
• Needs– Batch-system services
– Minimal data management for storage of results
ATLAS
ITER
[email protected] HEPiX; JLab; 9th-13th October 2006 16
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Drug Discovery
• WISDOM focuses on in silico drug discovery for neglected and emerging diseases.
• Malaria — Summer 2005– 46 million ligands docked
– 1 million selected
– 1TB data produced; 80 CPU-years used in 6 weeks
• Avian Flu — Spring 2006– H5N1 neuraminidase
– Impact of selected point mutations on eff. of existing drugs
– Identification of new potential drugs acting on mutated N1
• Fall 2006– Extension to other neglected diseases
[email protected] HEPiX; JLab; 9th-13th October 2006 17
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Bulk Processing
• Examples– HEP processing of raw data, analysis
– Earth observation data processing
• Characteristics– Widely-distributed input data
– Significant amount of input and output data
• Needs– Job management tools (workload management)
– Meta-data services
– More sophisticated data management
[email protected] HEPiX; JLab; 9th-13th October 2006 18
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Responsive Apps. (I)
• Examples–Prototyping new applications
–Monitoring grid operations
–Direct interactivity
• Characteristics–Small amounts of input and output data
–Not CPU-intensive
–Short response time (few minutes)
• Needs–Configuration which allows “immediate” execution (QoS)
–Services must treat jobs with minimum latency
[email protected] HEPiX; JLab; 9th-13th October 2006 19
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Responsive Apps. (II)
• Grid as a backend infrastructure:– gPTM3D: interactive analysis of medical images
– GPS@: bioinformatics via web portal
– GATE: radiotherapy planning
– DILIGENT: digital libraries
– Volcano sonification
• Characteristics– Rapid response: a human waiting for the result!
– Many small but CPU-intensive tasks
– User is not aware of “grid”!
• Needs– Interfacing (data & computing) with non-grid application or portal
– User and rights management between front-end and grid
[email protected] HEPiX; JLab; 9th-13th October 2006 20
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Workflow
• Examples–“Bronze Standard”: image registration
–Flood prediction
• Characteristics–Use of grid and non-grid services
–Complex set of algorithms for the analysis
–Complex dependencies between individual tasks
• Needs–Tools for managing the workflow itself
–Standard interfaces for services (I.e. web-services)
[email protected] HEPiX; JLab; 9th-13th October 2006 21
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Parallel Jobs
• Examples– Climate modeling
– Earthquake analysis
– Computational chemistry
• Characteristics– Many interdependent, communicating tasks
– Many CPUs needed simultaneously
– Use of MPI libraries
• Needs– Configuration of resources for flexible use of MPI
– Pre-installation of optimized MPI libraries
[email protected] HEPiX; JLab; 9th-13th October 2006 22
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Legacy Applications
• Examples–Commercial or closed source binaries
–Geocluster: geophysical analysis software
–FlexX: molecular docking software
–Matlab, Mathematics, …
• Characteristics–Licenses: control access to software on the grid
–No recompilation no direct use of grid APIs!
• Needs–License server and grid deployment model
–Transparent access to data on the grid
[email protected] HEPiX; JLab; 9th-13th October 2006 23
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Grid management: structure
• Operations Coordination Centre (OCC)
– management, oversight of all operational and support activities
• Regional Operations Centres (ROC)
– providing the core of the support infrastructure, each supporting a number of resource centres within its region
– Grid Operator on Duty
• Resource centres – providing resources
(computing, storage, network, etc.);
• Grid User Support (GGUS)
– At FZK, coordination and management of user support, single point of contact for users
[email protected] HEPiX; JLab; 9th-13th October 2006 24
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Grid Monitoring
• Goal:– Proactively monitor operational state & performance of the grid
– Trigger corrective actions at sites, ROCs, service managers
• Many tools used:– Distributed responsibility for tools maintenance and operation
– Operator portal, Info sys monitor, SFT/SAM, job monitors, etc.
• Site Functional Tests (SFT) Site Availability Monitor (SAM)– Framework to sample/test services at sites and publish results
– Can include ad-hoc tests (e.g. VO-specific) in the framework or externally
– Allows dynamic look-up by VO of sites that are currently OK for them
– SAM: extends the concept to measure service availability
– Web service access to the data
– Intend to use this to generate trouble tickets and alarms
• Primary tools of the operator on duty are – Information system monitoring and SFT/SAM
[email protected] HEPiX; JLab; 9th-13th October 2006 25
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Site metrics - availability
[email protected] HEPiX; JLab; 9th-13th October 2006 26
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Support - GGUS
[email protected] HEPiX; JLab; 9th-13th October 2006 27
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
The EGEE Network Operations Centre
• Creating a “Network Support unit” in the EGEE operational model;
• Tasks:– Receive tickets from NRENs, and
forward to GGUS if impact on grid– Receive tickets from GGUS if a
network issue– Troubleshoot & follow up with sites
or NRENs
GGUS
Users
SupportUnits
ENOC
NRENs
GÉANT2
EGEE Network
[email protected] HEPiX; JLab; 9th-13th October 2006 28
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Interoperation
• Interoperability and interoperation (or co-operation)
• EGEE has interoperability activities with:(enabling the middlewares to work together)
– Open Science Grid (U.S.) – quite far advanced– Nordugrid (ARC) – task in EGEE-II, 4 workshops and ongoing activity– UNICORE – task in EGEE-II– NAREGI (Japan) – 1 workshop, continued activity– GIN (OGF) – active in several areas
• EGEE has interoperation activities with:(enabling the infrastructures to co-operate)
– Open Science Grid – actually in use– Anticipated with NorduGrid (NDGF) for WLCG
[email protected] HEPiX; JLab; 9th-13th October 2006 29
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Interoperating information systems
EGEE
OSG
Naregi
Teragrid
Pragma
Nordugrid
[email protected] HEPiX; JLab; 9th-13th October 2006 30
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Related infrastructure projects
DEISATeraGrid
Coordination in SA1 for:
• EELA, BalticGrid, EUMedGrid, EUChinaGrid, SEE-GRID
Interoperation with
• OSG, NAREGI
SA3: • DEISA, ARC, NAREGI
[email protected] HEPiX; JLab; 9th-13th October 2006 31
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Sustainability: Beyond EGEE-II
• Need to prepare for permanent Grid infrastructure– Maintain Europe’s leading position in global science Grids
– Ensure a reliable and adaptive support for all sciences
– Independent of short project funding cycles
– Modelled on success of GÉANT Infrastructure managed in collaboration
with national grid initiatives
[email protected] HEPiX; JLab; 9th-13th October 2006 32
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Summary of status
• Today we have an operating production infrastructure – Probably the largest in the world, supporting many science domains– Relied upon by several as their primary source of computing
• We have a managed operations process addressing most areas– Constantly evolving
• Inter/Co-operation is a fact and is becoming more important very quickly– Several applications need to work across grids – and they need support for
that
• A large fraction of the value of the operations activity is in the intangibles – processes, structures, expertise, etc.
• We recognise that there are many outstanding problems with the current state of things: reliability and robustness are the focus for the next year