Upload
sabina
View
50
Download
0
Embed Size (px)
DESCRIPTION
Operating Central European EGEE ROC. Marcin Radecki, Tomasz Szepieniec , Ale ksander Kusznir and Marian Bubak ACC CYFRONET AGH. Outline. Introduction EGEE and Central European (CE) R egion Challenges for CE Regional Operating Centre Applications & Users Cooperation - PowerPoint PPT Presentation
Citation preview
EGEE-II INFSO-RI-031688
Enabling Grids for E-sciencE
www.eu-egee.org
EGEE and gLite are registered trademarks CGW’06 17 October 2006
Operating Central European EGEE ROC
Marcin Radecki, Tomasz Szepieniec, Aleksander Kusznir
and Marian Bubak
ACC CYFRONET AGH
CGW’06; Cracow; 15-18th October 2006 2
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Outline
• Introduction– EGEE and Central European (CE) Region
• Challenges for CE Regional Operating Centre– Applications & Users
– Cooperation
– Grid Infrastructure
• Conclusions
CGW’06; Cracow; 15-18th October 2006 3
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
EGEE – Community
• Possibly largest production infrastructure spans over 32 countries
• c.a. 200 sites grouped under 11 ROCs
• Scientific community involves over 2000 people
• EGEE’06 conference in Geneva– 700 attendees, – 32 „partner” projects present
ID Name Discipline UsersEGEE-001 Atlas Physics 890EGEE-002 Alice Physics 175EGEE-003 LHCb Physics 159EGEE-004 CMS Physics 632EGEE-010 ESR Earth Sciences 42EGEE-014 Biomed Biomed 114EGEE-039 Comp Chem Chemistry 15EGEE-040 Magic Astro particle physics 16EGEE-042 dteam Infrastructure testing 30EGEE-065 EGEODE Geo-Physics 33EGEE-066 Planck Astrophysics 8
Total 2114
CGW’06; Cracow; 15-18th October 2006 4
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Central European Region in EGEE
• 7 countries, 22 sites, 1493 CPUs, 70 TB storage space
• Supports 10/11 EGEE-approved + lot of associated VOs
• Site size scales from 2-3 to 300 CPUs
• Need for solutions suitable for both large computing centres and small sites
– Maintenance model– Skills & experience– Scalable across a site’s resources
CGW’06; Cracow; 15-18th October 2006 5
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Challenges for CE ROC
• We need to attract new users to grid and make possible their work in the new environment in order to use the resources efficiently. Provide the services the users require.
• Grid spans across many administrative domains, each of which need to be active in terms of cooperation to share resources and collaborate productively. Excellent possibility for expertise sharing.
• Having resources is not enough; infrastructure need to be stable before real users start to use it and we should maximize utilization as possible.
CGW’06; Cracow; 15-18th October 2006 6
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Grid-enabling users
• Means to gain and uphold users with us– Understand users’ needs and satisfy them
– Easy access, how-to-use documentation (in national languages)
– Stable working environment
– User Support infrastructure
• Results:– Computational chemistry
Mariusz Sterzel (CYFRONET) coordinatescomputational chemistry applications in EGEE
Enabling commercial software - Gaussian VO Study on pyrazoloquinolines (PQ) used for laser
light generation
– Bioinformatics Never Born Protein folding and function
recognition - Prof. Irena Roterman team (CM-UJ)
– Others: Many small teams are working
within regional catch-all VO – VOCE
CGW’06; Cracow; 15-18th October 2006 7
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
VOs in the Region
• Supported VOs listalice, atlas, auger, balticgrid, bellebiomed, cms, compass, compchem, crogrid, esr, euchina., gamess.gaussian, geant4, gear, geclipse,hone, hungrid, lhcb, magic, ops,skgrid, voce, vocet, zeus
• Service/Data Challenges and test productions
– Atlas Service Challenge 4– World-wide In Silico Docking On
Malaria data challenge 1st and 2nd (ongoing)
– EGEE-ITU International digital broadcasting
agreement – new frequency plan compatibility and complementary
analysis
CGW’06; Cracow; 15-18th October 2006 8
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Managment of CE ROC
CYFRONET
IISAS/PSNC
CESNET/PSNC
ICM WARSAW
ROC
Manager
ROC
Manager
User Support
Responsible
User Support
ResponsibleOperations
Responsible
Operations
ResponsibleSecurity
Responsible
Security
Responsible
1st Line
Support
1st Line
SupportCore Grid
Services
Core Grid
Services
Regional Certification
of Middleware
Regional Certification
of MiddlewareGrid Operator
On Duty
Grid Operator
On Duty
Pre-Production
Service
Pre-Production
Service
• ROC Manager– Represents the region at the level of
the Project managerial bodies
– Supervises all Service Activities
• Operations– Coordinate actions related to
infrastructure and middleware
– Escalates unsolvable problems level higher
– Fit the Project requirements into the region
• User Support– Provides support tools for users
– Takes part in shifts handling all user tickets in GGUS system
• Security– Incident handling procedures
– Incident response team
CGW’06; Cracow; 15-18th October 2006 9
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Procedures and Commitments
• Well defined procedures makes collaboration more efficient– Clear paths on how we deal with things to avoid misunderstandings
– Newbies are always there
– People tend to forget things over the time
• Procedures examples:– New site registration
– New site admin joining
– Site problem handling
– Sending Weekly Reports
• Commitments monitoring makes people more motivated
CGW’06; Cracow; 15-18th October 2006 10
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Operations - coordinate the work
• Operations is the most time consuming task– To make sure that operational procedures are understood and followed up
properly
– To ensure production requirements are met at the sites
– To work out best solutions for problems
– To understand expectations/needs
– To make sure problems are being solved in a proper way
– To ensure weekly reports are completed and sent
• Three styles of site administration observed– Keep all services ready all the time – „I’m the best admin in the city”
– React only when gets a problem report – „I’m a bit occupied”
– React only if my name appears on a „black list”, available to the public – „I’m hard-working on… something important”
CGW’06; Cracow; 15-18th October 2006 11
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Resources and their usage
• Accounting in EGEE– July-October ’06 - over 672k
CPU hours computed in CE region; equivalent of 275 CPUs running 24x7
– Problems with „missing” data
– Update rate: daily
• Our approach to accounting– Site performance efficiency
study: - Up-to-date information on what is going at a site,- Maximize site utilization
better to have jobs queued at a site than idle CPUs
– Is being extended towards a new system for fine grain accounting
Jobs Executing
Avoid low usage periodsAvoid low usage periods
Max. CPUsMax. CPUs
Jobs Queued
CGW’06; Cracow; 15-18th October 2006 12
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Stable infrastructure- social aspect
• How EGEE keeps the Grid stable– Grid Operator on Duty (GOD) watching entire grid
CE joined this activity in a first turn in EGEE-II
– Raise a ticket for each detected problem – Problem diagnosis and solution suggestion– Use monitoring tools for problem detection and availability metrics
• 1st Line Support in CE - how to be better than the average?– To detect and fix failures before they get notified by GOD Team and a ticket
is raised– Support site admins on remedy actions– Suggest known well-working practices expertise sharing– Knowledge comes out of the mind with pain despite saving a lot of time
while at work it needs a lot of encouragement for people to do so
CGW’06; Cracow; 15-18th October 2006 13
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Stable infrastructure - monitoring with NAGIOS
• Try to monitor as much functionality as possible
– E.g. all machines certificates expiration date
– Reasonable probe frequency
• Send a problem notification immediately but…
– Do not spam each 5 minute
• Allow site admin to tell the problem is being worked on
– Do not send notification until notified
• Allow site admin to schedule extraordinary check at will
– To let him convince at once how good the workaround is working
• Smart testing hierarchy• Monitors CE Core Services
– added tests for checking RB, BDII, LFC, VOMS
• Used by 1st line support– Overview of the region– Detailed check of services– Schedule checks when working on fixes
CGW’06; Cracow; 15-18th October 2006 14
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Data from EGEE CIC portal: https://egee.in2p3.fr/CIC/index.php?id=cic&subid=cic_roc_metrics&scope=project&project=&metrics=sft
Operations metrics results
D ec 05
Jan 06
Feb 06
M ar 06
Apr 06
M ay 06
Jun 06
Jul 06
Aug 06
Sep 06
0
1
2
3
4
5
6
7
8
9
Functional test failure % ratio
EG EE
C E
Best p layer
% o
f fa
ilure
s
Dec 05
Jan 06
Feb 06
M ar 06
Apr 06
M ay 06
Jun 06
Jul 06
Aug 06
Sep 06
0
1
2
3
4
5
6
7
8
9
Tim e unavailable % ratio
EG EE
C E
Best P layer
% o
f ti
me
EGEE Operations metrics results from last 10 months
CGW’06; Cracow; 15-18th October 2006 15
Enabling Grids for E-sciencE
EGEE-II INFSO-RI-031688
Conclusions
• CYFRONET gained the know-how on:– Coordination of a large initiative
– Organization of work for different subtasks
– Running a stable production infrastructure
– Accurate Grid job accounting
– Sensible and precise Grid infrastructure monitoring
– Facilitating the application users introduction to Grid
• Experience gathered in CE ROC may easily be re-used in building national Polish grid
16PL-Grid, Warszawa, 22.09.2006
Ogólnopolska infrastruktura gridowa PL-Grid
Zespół Akademickiego Centrum Komputerowego CYFRONET AGH
Kraków, czerwiec – wrzesień 2006
W poniższym opracowaniu przedstawiono motywację, cele, koncepcję i sposób podejścia do utworzenia narodowej infrastruktury gridowej, niezbędnej dla nowoczesnego prowadzenia badań naukowych (e-Science), spójnej z infrastrukturą europejską.
PL-Grid jako infrastruktura dla e-Science
Aktualnie prowadzenie badań naukowych wymaga wykorzystania zaawansowanych technologii informatycznych. Rośnie liczba zespołów naukowych, które intensywnie ze sobą współpracują, a do tego niezbędne są narzędzia informatyczne umożliwiające gromadzenie i wymianę uzyskanej wiedzy w skali globalnej. Wyniki eksperymentów to olbrzymie, rozproszone zbiory danych o różnorodnej strukturze, których opracowanie wymaga narzędzi dostępu, ich integracji oraz przetwarzania danych. Symulacja komputerowa jest w pełni akceptowaną metodą badawczą i coraz częściej łączone są ze sobą wyniki uzyskane z symulacji i eksperymentów. Takie nowatorskie podejście jest najbardziej widoczne w fizyce wysokich energii, w astrofizyce, naukach biologicznych i medycznych, w naukach o Ziemi.
Dla realizacji tego nowego paradygmatu prowadzenia badań naukowych, zwanego e-Science, jest niezbędna infrastruktura gridowa (zwana też Cyber-Science Infrastructure), obejmująca oprogramowanie umożliwiające współdzielenie różnych zasobów komputerowych oraz narzędzia wspierające współdziałanie partnerów w ramach tzw. wirtualnych organizacji.
Rys1. PL-Grid jako infrastruktura dla e-Science
17PL-Grid, Warszawa, 22.09.2006
Nutzer
Warstwadostępowa/tworzeniaaplikacji
Zasobygridowe
Usługigridowe
Podstawoweusługi
gridowe
Rozproszonerepozytoria
danych
Użytkownicy
Krajowasieć
komputerowa
Globus
Zarządzaniewirtualnymi
organizacjami
Zarządzaniezadaniami
Zarządzanie danymi
Systembezpieczeństwa
UNICORE(DEISA)
Rozproszonezasoby
obliczeniowe
Portale gridowe, narzędzia programistyczne
Monitorowanie
LCG/gLite(EGEE)
Uproszczona architektura PL-Gridu
18PL-Grid, Warszawa, 22.09.2006
Gridy dziedzinowe
PL-Grid
Infrastruktura(sprzęt, sieć)
Koordynacja
Raporty
Zalecenia
Informacja
Propozycje
Ocena
Zarząd Konsorcjum(Koordynator + członkowie)
CentrumOperacyjne
RadaUżytkowników
RadaKonsorcjum
Struktura organizacyjna PL-Gridu
19PL-Grid, Warszawa, 22.09.2006
TematMiesiące
0 3 6 9 12 15 18 21 24 27 30 33 36
Przygotowanie i zatwierdzenie projektu
Organizacja konsorcjum
Zatrudnienie pracowników
Zakupy urządzeń
Infrastruktura badawczo-szkoleniowa
Infrastruktura produkcyjna
Rozwój oprogramowania
Szkolenia gridowe
Przeglądy działalności
faza testowa
faza pilotowa
faza utrzymania i rozwoju
Harmonogram prac