Upload
tayte
View
97
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Grid Computing and Grid Site Infrastructure. David Groep NIKHEF/PDP and UvA. Outline. The challenges: LHC, Remote sensing, medicine, … Distributed Computing and the Grid Grid model and community formation, security Grid software: computing, storage, resource brokers, information system - PowerPoint PPT Presentation
Citation preview
Grid Computing and Grid Site Infrastructure
David GroepNIKHEF/PDP and UvA
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 2
Outline
• The challenges: LHC, Remote sensing, medicine, …
• Distributed Computing and the Grid– Grid model and community formation, security– Grid software:
computing, storage, resource brokers, information system
• Grid Clusters & Storage : the Amsterdam e-Science Facility– Computing: installing and operating large clusters– Data management: getting throughput to disk and tape
• Monitoring and site functional tests
• Does it work?
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 3
The LHC Computing Challenge
Physics @ CERN• LHC particle accellerator
• operational in 2007
• 4 experiments
• ~ 10 Petabyte per year (=10 000 000 GByte)
• 150 countries
• > 10000 Users
• lifetime ~ 20 years
level 1 - special hardware
40 MHz (40 TB/sec)
level 2 - embeddedlevel 3 - PCs
75 KHz (75 GB/sec)5 KHz (5 GB/sec)100 Hz(100 MB/sec)data recording &
offline analysishttp://www.cern.ch/
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 4
LHC Physics Data Processing
Source: ATLAS introduction movie, NIKHEF and CERN, 2005
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 5
Tiered data distribution model
10 PB data distribution issue• 15 major centres in the world• ~200 institutions• ~10 000 people
for 1 experiment:Processing issue• 1 ‘event’ takes ~ 90 s• There are 100 events/s• Need: 9000 CPUs (today)• But also: reprocessing, simulation, &c: 2-5x needed in total
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 6
T0-T1 Data Flow to Amsterdam
Tier-0Tier-0 diskbuffer
diskbuffer
TapeTape
diskstorage
diskstorage
diskstorage
diskstorage
Assumptions
Amsterdam is 13%24 hours/day
No in-efficienciesYear 2008
Assumptions
Amsterdam is 13%24 hours/day
No in-efficienciesYear 2008 RAW
RAW
1.6 GB/file0.026 Hz
2246 f/day41.6 MB/s3.4 TB/day
1.6 GB/file0.026 Hz
2246 f/day41.6 MB/s3.4 TB/day
ESD1ESD1
500 MB/file0.052 Hz
4500 f/day26 MB/s
2.2 TB/day
500 MB/file0.052 Hz
4500 f/day26 MB/s
2.2 TB/day
AODmAODm
500 MB/file0.04 Hz
3456 f/day20 MB/s
1.6 TB/day
500 MB/file0.04 Hz
3456 f/day20 MB/s
1.6 TB/day
Total T0-T1 BandwidthRAW+ESD1+AODm1
10,000 file/day88 MByte/sec700 Mbit/sec
Total T0-T1 BandwidthRAW+ESD1+AODm1
10,000 file/day88 MByte/sec700 Mbit/sec
Total to tapeRAW
41.6 MB/s2~3 drives
18 tape/day
Total to tapeRAW
41.6 MB/s2~3 drives
18 tape/day
Total to diskESD1+AOD1
46 MB/s4 TB/day
8000 file/day
Total to diskESD1+AOD1
46 MB/s4 TB/day
8000 file/day
ATLAS data flows (draft). Source: Kors Bos, NIKHEF
7.2
TByt
e/da
y
to e
very
Tier-1
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 7
Other similar applications: Earth Obs
Source: Wim Som de Cerff, KNMI, De Bilt
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 8
WISDOM Malaria drug discovery
Anti-malaria drug discovery– ligand docking onto the malaria virus in silico– 60 000 jobs, taking over 100 CPU years total– using 3 000 CPUs completed less than 2 months
– 47 sites– 15 countries– 3000 CPUs– 12 TByte of disk
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 9
Infrastructure to cope with this
Source: EGEE SA1, Grid Operations Centre, RAL, UK, dd March 2006
The Grid and Community Building
The Grid Concept
Protocols and standards
Authentication and Authorization
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 11
Three essential ingredients for Grid
‘Access computing like the electrical power grid’
A grid combines resources that– Are not managed by a single organization
– Use a common, open protocol … that is general purpose
– Provide additional qualities of service, i.e., are usable as a collective and transparent resource
Based on: Ian Foster, GridToday, November 2003
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 12
Grid AuthN/AuthZ and VOs
• Access to shared services– cross-domain authentication, authorization, accounting, billing– common protocols for collective services
• Support multi-user collaborations in “Virtual Organisations”– a set of individuals or organisations, – not under single hierarchical control, – temporarily joining forces to solve a particular problem at hand, – bringing to the collaboration a subset of their resources, – sharing those under their own conditions
– user’s home organization may or need not know about their activities
• Need to enable ‘easy’ single sign-on– a user is typically involved in many different VO’s
• Leave the resource owner always in control
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 13
Virtual vs. Organic structure
Organization A Organization B
Compute Server C1Compute Server C2
Compute Server C3
File server F1 (disks A and B)
Person C(Student)
Person A(Faculty)
Person B(Staff) Person D
(Staff)Person F(Faculty)
Person E(Faculty)
Virtual Community C
Person A(Principal Investigator)
Compute Server C1'
Person B(Administrator)
File server F1 (disk A)
Person E(Researcher)
Person D(Researcher)
Graphic: GGF OGSA Working Group
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 14
More characteristic VO structure
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 15
Trust in Grid Security
‘Security’: Authentication and Authorization
• There is no a priori trust relationship between members
or member organisations!– VO lifetime can vary from hours to decades– VO not necessarily persistent (both long- and short-lived)– people and resources are members of many VOs
• … but a relationship is required– as a basis for authorising access– for traceability and liability, incident handling, and accounting
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 16
Authentication vs. Authorization
• Single Authentication token (“passport”) – issued by a party trusted by all, – recognised by many resource providers, users, and VOs– satisfy traceability and persistency requirement– in itself does not grant any access, but provides
a unique binding between an identifier and the subject
• Per-VO Authorisations (“visa”)– granted to a person/service via a virtual organisation– based on the ‘passport’ name– embedded in the single-sign-on token (proxy)– acknowledged by the resource owners – providers can obtain lists of authorised users per VO,
but can still ban individual users
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 17
Federated PKI for authentication
• A Federation of many independent CAs– common minimum requirements– trust domain as required by users and relying parties– well-defined and peer-reviewed acceptance process
• User has a single identity– from a local CA close by– works across VOs, with single sign-on via ‘proxies’– certificate itself also usable outside the grid
CA 1CA 2
CA 3
CA ncharter
guidelines
acceptanceprocess
relying party 1
relying party n
International Grid Trust Federation and EUGridPMA, see http://www.eugridpma.org/
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 18
Authorization Attributes (VOMS)
• VO-managed attributed embedded in the proxy
• attributed signed by VO• proxy signed by user• user cert signed by CA
Query
Authentication
Request
AuthDB
C=IT/O=INFN /L=CNAF/CN=Pinco Palla/CN=proxy
VOMSpseudo-cert
VOMSpseudo-cert
Grid Services
Logical Elements in a Grid
Information System
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 20
Services in a Grid
• Computing Element “front-end service for (set of) computers”– Cluster computing: typically Linux with IP interconnect– Capability computing: typically shared-memory supercomputers – A ‘head node’ batches or forwards requests to the cluster
• Storage Element “front-end service for disk or tape”– Disk and tape based– Varying retention time, QoS, uniformity of performance– ‘ought’ to be ACL aware: mapping of grid authorization to POSIX ACLs
• File Catalogue …• Information System …
– Directory-based for static information– Monitoring and bookkeeping for real-time information
• Resource Broker …– Matching user job requirements to offers in the information system– WMS allows disconnected operation of the user interface
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 21
Typical Grid Topology
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 22
Job Description Language
This is JDL that the user might send to the Resource Broker
Executable = "catfiles.sh";StdOutput = "catted.out";StdError = "std.err"; Arguments = "EssentialJobData.txt
LogicalJobs.jdl /etc/motd";
InputSandbox = {"/home/davidg/tmp/jobs/LogicalJobs.jdl", "/home/davidg/tmp/jobs/catfiles.sh" };OutputSandBox = {"catted.out", "std.err"};
InputData = "LF:EssentialJobData.txt"; ReplicaCatalog = "ldap://rls.edg.org/lc=WPSIX,dc=cnrs,dc=fr";DataAccessProtocol = “gsiftp";
RetryCount = 2;
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 23
How to you see the Grid?
Broker matches the user’s request with the site • ‘information supermarket’ matchmaking (using Condor Matchmaking)• uses the information published by the site
Grid Information system‘the only information a user ever gets about a site’
• So: should be reliable, consistent and complete• Standard schema (GLUE) to
describe sites, queues, storage(complex schema semantics)
• Currently presented as an LDAP directory
LDAP Browser Jarek Gawor: www.mcs.anl.gov/~gawor/ldap
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 24
Glue Attributes Set by the Site
• Site information– SiteSysAdminContact: mailto: [email protected]– SiteSecurityContact: mailto: [email protected]
• Cluster infoGlueSubClusterUniqueID=gridgate.cs.tcd.ie
HostApplicationSoftwareRunTimeEnvironment: LCG-2_6_0HostApplicationSoftwareRunTimeEnvironment: VO-atlas-release-10.0.4HostBenchmarkSI00: 1300GlueHostNetworkAdapterInboundIP: FALSEGlueHostNetworkAdapterOutboundIP: TRUEGlueHostOperatingSystemName: RHELGlueHostOperatingSystemRelease: 3.5GlueHostOperatingSystemVersion: 3
GlueCEStateEstimatedResponseTime: 519GlueCEStateRunningJobs: 175GlueCEStateTotalJobs: 248
• Storage: similar info (paths, max number of files, quota, retention, …)
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 25
Information system and brokering issues
• Size of information system scales with #sites and #details– already 12 MByte of LDIF– matching a job takes ~15 sec
• Scheduling policies are infinitely complex– no static schema can likely express this information
• Much information (still) needs to be set-up manually … next slides show situation as of Feb 3, 2006
The info system is the single most important grid service
• Current broker tries to make optimal decision… instead of a `reasonable’ one
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 26
Example: GlueServiceAccessControlRule
For your viewing pleasure: GlueServiceAccessControlRule 261 distinct values seen for GlueServiceAccessControlRule
(one of) least frequently occuring value(s): 1 instance(s) of GlueServiceAccessControlRule:
/C=BE/O=BEGRID/OU=VUB/OU=IIHE/CN=Stijn De Weirdt
(one of) most frequently occuring value(s): 310 instance(s) of GlueServiceAccessControlRule: dteam
(one of) shortest value(s) seen: GlueServiceAccessControlRule: d0
(one of) longest value(s) seen: GlueServiceAccessControlRule: anaconda-ks.cfg configure-firewall install.log install.log.syslog j2sdk-1_4_2_08-linux-i586.rpm lcg-yaim-latest.rpm myproxy-addons myproxy-addons.051021 site-info.def site-info.def.050922 site-info.def.050928 site-info.def.051021 yumit-client-2.0.2-1.noarch.rpm
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 27
Example: GlueSEControlProtocolType
For your viewing pleasure: GlueSEControlProtocolType
freq value 1 GlueSEControlProtocolType: srm 1 GlueSEControlProtocolType: srm_v1 1 GlueSEControlProtocolType: srmv1 3 GlueSEControlProtocolType: SRM 7 GlueSEControlProtocolType: classic
… which means that of ~410 Storage Elements, only 13 publish interaction info. Ough!
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 28
Example: GlueHostOperatingSystemRelease
Today's attribute: GlueHostOperatingSystemRelease 1 GlueHostOperatingSystemRelease: 3.02 1 GlueHostOperatingSystemRelease: 3.03 1 GlueHostOperatingSystemRelease: 3.2 1 GlueHostOperatingSystemRelease: 3.5 1 GlueHostOperatingSystemRelease: 303 1 GlueHostOperatingSystemRelease: 304 1 GlueHostOperatingSystemRelease: 3_0_4 1 GlueHostOperatingSystemRelease: SL 1 GlueHostOperatingSystemRelease: Sarge 1 GlueHostOperatingSystemRelease: sl3 2 GlueHostOperatingSystemRelease: 3.0 2 GlueHostOperatingSystemRelease: 305 4 GlueHostOperatingSystemRelease: 3.05 4 GlueHostOperatingSystemRelease: SLC3 5 GlueHostOperatingSystemRelease: 3.04 5 GlueHostOperatingSystemRelease: SL3 18 GlueHostOperatingSystemRelease: 3.0.3 19 GlueHostOperatingSystemRelease: 7.3 24 GlueHostOperatingSystemRelease: 3 37 GlueHostOperatingSystemRelease: 3.0.5 47 GlueHostOperatingSystemRelease: 3.0.4
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 29
Example: GlueSAPolicyMaxNumFiles
136 separate Glue attributes seen
For your viewing pleasure: GlueSAPolicyMaxNumFiles
freq value
6 GlueSAPolicyMaxNumFiles: 99999999999999
26 GlueSAPolicyMaxNumFiles: 999999
52 GlueSAPolicyMaxNumFiles: 0
78 GlueSAPolicyMaxNumFiles: 00
1381 GlueSAPolicyMaxNumFiles: 10
136 separate Glue attributes seen
For your viewing pleasure: GlueServiceStatusInfo
freq value
2 GlueServiceStatusInfo: No Known Problems.
55 GlueServiceStatusInfo: No problems
206 GlueServiceStatusInfo: No Problems
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 30
LCG’s Most Popular Resource Centre
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 31
Example: SiteLatitude
Today's attribute: GlueSiteLatitude 1 GlueSiteLatitude: 1.376059 1 GlueSiteLatitude: 33.063924198120645 1 GlueSiteLatitude: 37.0 1 GlueSiteLatitude: 38.739925290125484 1 GlueSiteLatitude: 39.21 … 1 GlueSiteLatitude: 45.4567 1 GlueSiteLatitude: 55.9214118 1 GlueSiteLatitude: 56.44 1 GlueSiteLatitude: 59.56 1 GlueSiteLatitude: 67 1 GlueSiteLatitude: GlueSiteWeb: http://rsgrid3.its.uiowa.edu 2 GlueSiteLatitude: 40.8527 2 GlueSiteLatitude: 48.7 2 GlueSiteLatitude: 49.16 2 GlueSiteLatitude: 50 3 GlueSiteLatitude: 41.7827 3 GlueSiteLatitude: 46.12
8 GlueSiteLatitude: 0.0
The Amsterdam e-Science Facilities
Building and Running a Grid Resource Centre
Compute Clusters
Storage and Disk Pools
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 33
Grid Resources in Amsterdam
• 2x 1.2 PByte in 2 robots
• 36+512 CPUs IA32
• disk caches 10 + 50 TByte
• multiple 10 Gbit/s links
240 CPUs IA32
7 TByte disk cache
10 + 1 Gbit link SURFnet
2 Gbit/s to SARA
only resources with either GridFTP or Grid job management
BIG GRID Approved January 2006!
Investment of € 29M in next 4 years
For: LCG, LOFAR, Life Sciences,
Medical, DANS, Philips Research, …
See http://www.biggrid.nl/
Computing
Cluster topology
Connectivity
System services setup
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 35
Grid Site Logical Layout
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 36
Batch Systems and Schedulers
• Batch system keeps list of nodes and jobs• Scheduler matches jobs to nodes based on policies
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 37
NDPF Logical Composition
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 38
NDPF Network Topology
Quattor and ELFms
System Installation
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 40
Installation
Installing and managing a large cluster requires a system that
• Scales to O(10 000) nodes, with1. a wide variety in configuration (‘service nodes’)2. and also many instances of identical systems (‘worker nodes’)
• You can validate new configurations before it’s too late• Can rapidly recovery from node failures
by commissioning a new box (i.e. in minutes)
Popular systems include• Quattor (from ELFms)• xCAT• OSCAR• SystemImager & cfEngine
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 41
Quattor
ELFms stands for ‘Extremely Large Fabric management system’Subsystems:• : configuration, installation and management of nodes• : system / service monitoring• : hardware / state management
• ELFms manages and controls most of the nodes in the CERN CC– ~2100 nodes out of ~ 2400– Multiple functionality and cluster size (batch nodes, disk servers, tape
servers, DB, web, …)– Heterogeneous hardware (CPU, memory, HD size,..)– Supported OS: Linux (RH7, RHES2.1, RHES3) and Solaris (9)
Node ConfigurationManagement
NodeManagement
Source: German Cancio, CERN IT, see http://www.quattor.org/
Developed within the EU DataGrid Project (http://www.edg.org/), development and maintenance now coordinated by CERN/IT
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 42
System Installation
• Central configuration drives:• Automated Installation Infrastructure (AII)
PXE and RedHat KickStart or Solaris JumpStart
• Software Package Management (SPMA)transactional management based on RPMs or PKGs
• Node Configuration (NCM): autonomous agentsservice configuration components, (re-) configuration
– CDB: the ‘desired’ state of the node• two-tiered configuration language (PAN and LLD XML)• self-validating, complete language (“swapspace=2*physmem”)• template inheritance and composition
( “tbn20 = base system + CE_software + pkg_add(emacs)” )
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 43
Configuration Database
CDB
pan
GUI
Scripts
CLI
Node
CCM
Cache
XML
RDBMS
SQL
SOAP
HTTP
NodeManagement Agents
LEAF, LEMON, others
Source: German Cancio, CERN IT
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 44
NodeManagement Agents
Configuration Database
CDBGUI
Scripts
CLI
Node
CCM
Cache
RDBMS
SQL
SOAP
pan
XML HTTP
CERNCC
name_srv1: 137.138.16.5time_srv1: ip-time-1
lxbatchcluster/name: lxbatchmaster: lxmaster01pkg_add (lsf5.1)
lxpluscluster/name: lxpluspkg_add (lsf5.1) disk_srv
lxplus001 eth0/ip: 137.138.4.246 pkg_add (lsf6_beta)
lxplus020 eth0/ip: 137.138.4.225
lxplus029
Source: German Cancio, CERN IT
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 45
Configuration Database
CDB
pan
Node
CCM
Cache
XML
RDBMS
SQL
HTTP
GUI
Scripts
CLI SOAP
Source: German Cancio, CERN IT
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 46
Configuration Database
CDB
pan
GUI
Scripts
CLI
Node
XML
RDBMS
SQL
SOAP
HTTP
CCM
Cache NodeManagement Agents
Source: German Cancio, CERN IT
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 47
Configuration Database
CDB
pan
GUI
Scripts
CLI
Node
CCM
Cache
XML
SOAP
HTTP
RDBMS
SQL
LEAF, LEMON, others
Source: German Cancio, CERN IT
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 48
Configuration Database
CDB
pan
GUI
Scripts
CLI
XML
RDBMS
SQL
SOAP
HTTP
Node
CCM
Cache NodeManagement Agents
Source: German Cancio, CERN IT
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 49
Managing (cluster) nodes
Install server
base OS dhcppxe
nfs/http
Vendor System installer
RH73, RHES,Fedora,…
System services AFS,LSF,SSH,accounting..
Installed softwarekernel, system, applications..
CCMNode Configuration
Manager (NCM)
RPM, PKG
nfshttp
ftp
Software Servers
packages
(RPM, PKG)SWReppackages
CDB
Standard nodesManaged nodes
Install Manager
Node (re)install
cacheSW package
Manager (SPMA)
Source: German Cancio, CERN IT
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 50
Node Management Agents• NCM (Node Configuration Manager):
framework system, where service specific plug-ins called Components make the necessary system changes to bring the node to its CDB desired state– Regenerate local config files (eg. /etc/sshd/sshd_config),
restart/reload services (SysV scripts)– Large number of components available (system and Grid services)
• SPMA (Software Package Mgmt Agent) and SWRep: Manage all or a subset of packages on the nodes– Full control on production nodes: full control - on development nodes: non-
intrusive, configurable management of system and security updates.– Package manager, not only upgrader (roll-back and transactions)
• Portability: generic framework with plug-ins for NCM and SPMA– available for RHL (RH7, RHES3) and Solaris 9
• Scalability to O(10K) nodes– Automated replication for redundant / load balanced CDB/SWRep servers– Use scalable protocols eg. HTTP and replication/proxy/caching technology
(slides here)Source: German Cancio, CERN IT
Back in Amsterdam
User and directory management
Logging and auditing
Monitoring and fault detection
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 52
User directory and automount maps
Large number of alternatives exists (nsswitch.conf/pam.d)• files-based (/etc/passwd, /etc/auto.home, …)• YP/NIS, NIS+• Database (MySQL/Oracle)• LDAP
We went with LDAP:• information is in a central location (like NIS)• can scale by adding slave servers (like NIS)• is secure by LDAP over TLS (unlike NIS)• can be managed by external programs (also unlike NIS)
(in due course we will do real-time grid credential mapping to uid’s)
But you will need nscd, or a large number of slave servers
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 53
Logging and Auditing
Auditing and logging• syslog (also for grid gatekeeper, gsiftp, credential mapping)• process accounting (psacct)
For the paranoid – use tools included for CAPP/EAL3+: LAuS• system call auditing• highly detailed:
useful both for debugging and incident response• default auditing is critical: system will halt on audit errors
If your worker nodes are on private IP space• need to preserve a log of the NAT box as well
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 54
Grid Cluster Logging
Grid statistics and accounting• rrdtool views from the batch system load per VO
– combine qstat and pbsnodes output via script, cron and RRD
• cricket network traffic grapher
• extract pbs accounting data in dedicated database– grid users have a ‘generic’ uid from a dynamic pool –
need to link this in the database to the grid DN and VO
• from accounting db, upload anonymized records to APEL– APEL is the grid accounting system for VOs and funding agencies– accounting db also useful to charge costs to projects locally
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 55
NDPF Occupancy
Usage of the NIKHEF NDPF Compute farm
Average occupancy in 2005: ~ 78%
each colour represents a grid VO, black line is #CPUs available
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 56
But at times, in more detail
Auditing Indicent: a disk with less than 15% free makes the syscall-audit system panic, new processes cannot write audit entries, which is fatal, so they wait, and wait, and … a head node has most activity & fails first!
An unresponsive node causes the scheduler MAUI to wait for 15 minutes, then give up and start scheduling again, hitting the rotten node, and …
PBS Server trying desparately to contact adead node who’s CPU has turned into Norit… and unable to serve any more requests.
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 57
Black Holes
A mis-configured worker node accepting jobs that all die within seconds.Not for long, the entire job population will be sucked into this black hole…
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 58
Clusters: what did we see?
• the Grid (and your cluster) are error amplifiers– “black holes” may eat your jobs piecemeal
– dangerous “default” values can spoil the day (“GlueERT: 0”)
• Monitor! (and allow for (some) failures, and design for rapid recovery)
• Users don’t have a clue about your system beforehand(that’s the downside of those ‘autonomous organizations’)
• If you want users to have clue, you push publish your clues correctly (the information system is all they can see)
• Grid middleware may effectively do a DoS on your system– doing qstat for every job every minute, to feed the logging & bookkeeping …
• Power consumption is the greatest single limitation in CPU density• And finally: keep your machine room tidy, and label everything … or your
colleague will not be able to find that #$%^$*! machine in the middle of the night…
Data Storage
Typical data flows
Matching storage to computing
Disk pools and their interfaces
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 60
Atlas Tier-1 data flows
Tier-0
CPUfarm
T1T1OtherTier-1s
diskbuffer
RAW
1.6 GB/file0.02 Hz1.7K f/day32 MB/s2.7 TB/day
ESD2
0.5 GB/file0.02 Hz1.7K f/day10 MB/s0.8 TB/day
AOD2
10 MB/file0.2 Hz17K f/day2 MB/s0.16 TB/day
AODm2
500 MB/file0.004 Hz0.34K f/day2 MB/s0.16 TB/day
RAW
ESD2
AODm2
0.044 Hz3.74K f/day44 MB/s3.66 TB/day
T1T1OtherTier-1s
T1T1Tier-2s
Tape
RAW
1.6 GB/file0.02 Hz1.7K f/day32 MB/s2.7 TB/day
diskstorage
AODm2
500 MB/file0.004 Hz0.34K f/day2 MB/s0.16 TB/day
ESD2
0.5 GB/file0.02 Hz1.7K f/day10 MB/s0.8 TB/day
AOD2
10 MB/file0.2 Hz17K f/day2 MB/s0.16 TB/day
ESD2
0.5 GB/file0.02 Hz1.7K f/day10 MB/s0.8 TB/day
AODm2
500 MB/file0.036 Hz3.1K f/day18 MB/s1.44 TB/day
ESD2
0.5 GB/file0.02 Hz1.7K f/day10 MB/s0.8 TB/day
AODm2
500 MB/file0.036 Hz3.1K f/day18 MB/s1.44 TB/day
ESD1
0.5 GB/file0.02 Hz1.7K f/day10 MB/s0.8 TB/day
AODm1
500 MB/file0.04 Hz3.4K f/day20 MB/s1.6 TB/day
AODm1
500 MB/file0.04 Hz3.4K f/day20 MB/s1.6 TB/day
AODm2
500 MB/file0.04 Hz3.4K f/day20 MB/s1.6 TB/day
Plus simulation Plus simulation && analysis data analysis data
flowflow
Real data storage, reprocessing and
distribution
ATLAS data flows (draft). Source: Kors Bos, NIKHEF
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 61
SC3 storage network (SARA)
Disk-to-Disk 583 MByte/si.e. 4.6 Gbps
over the world
Graphic: Mark van de Sanden, SARA
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 62
Tier-1 Architecture SARA (storage)
Graphic: Mark van de Sanden, SARA
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 63
Matching Storage to Computing
Doing the math• Simple job
– Read 1 MByte piece of file (typically 1 “event”)– Calculate on it for 30 seconds– Do this for 2000 events per file (i.e. 2 GByte files)– On 1000 files (1 day of running) this takes 700 days– Need a total of 2 TByte, i.e. 4 IDE disks of 500 GB
• On the Grid: spread out over 1000 CPUs– All jobs start at the same time, retrieving a 2 GByte input– The machine with this 2 TByte disk is on a 100 Mbps link– Effective 10 MByte/s throughput– Thus, 10 kByte/s per machine– It takes 55 hours before the file transfers finish!– And after that, only 17 hours of calculation
= 1Mbyte
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 64
Storage
Just for ATLAS, one of the experiments
• RAW & ESD data flow ~ 4 TByte/day (1.4PB/y) to tape– Expected to be a permanent “museum” copy– Largely scheduled access (intelligent staging possible), read & write– Disk buffers before tape store can be smallish (~ 10%)
• ‘Chaotic’ access by real users: ~ 2-4 TByte/day throughput– Lifetime of data is finite but long (typically 2+ years)– Access needed from worker nodes, i.e., from O(1 000) CPUs– Random “skimming” access pattern– Need for disk server farms of typically 500 TByte – 1 PByte
• Management of disk resources– Split ‘file system view’ (file metadata) from the object store– dCache & dcap, DPNS & DPM, GPFS & ObjectStore, …
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 65
DPM Manageability
Example: Disk Pool Manager• Elements in DPM
– SRM: scheduling of requests via a standard interface– DPM Server: management and translation of filename to a transfer URL
(disk node)– DPNS Name service: ACLs, directory layout, ownership, size– Object servers: effectuate actual (GridFTP) transfers
• No central configuration files– Disk nodes request to add themselves to the DPM– All states are kept in a DB (easy to restore after a crash)
• Easy to remove disks and partitions– Allows simple reconfiguration of the Disk Pools– Files are stored as whole chunks (safer than striping over O(100) servers!)– Administrator can temporarily remove file systems if a disk has crashed– DPM automatically configures a file system as “unavailable” when it is not
contactable
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 66
Example: SRM put processing (1)
Data ServerGridftp Daemon
ClientDPM Daemon
SRM Daemon
1a. SRM Put
1b. Put intoRequest Database
1c. Return SRM RequestId
DPM Database
DPNS Daemon
Data ServerGridftp Daemon
Data ServerGridftp Daemon
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 67
Example: SRM put processing (2)
2a. Get Request from Database
2d. Add TURL in Request
Database and Mark ‘Ready’
2c. Pick best Data Server to put data onto
Data ServerGridftp Daemon
Client
SRM Daemon
2b. Check permissions and add to NS
DPM Database
DPNS Daemon
Data ServerGridftp Daemon
Data ServerGridftp Daemon
2e.add to replica table and set status ‘Pending’
DPM Daemon
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 68
Example: SRM put processing (3)
3a. SRM getRequestStatus
Data ServerGridftp Daemon
ClientDPM Daemon
SRM Daemon
3c. Return TURL
DPM Database
3b. Get TURL from Request
DPNS Daemon
Data ServerGridftp Daemon
Data ServerGridftp Daemon
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 69
Example: SRM put processing (4)
Data ServerGridftp Daemon
Client
SRM Daemon DPM Database
DPNS Daemon
Data ServerGridftp Daemon
Data ServerGridftp Daemon
4a. SRM(v1) set ‘Running’
4b. Update status of request
DPM Daemon
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 70
Example: SRM put processing (5)
Data ServerGridftp Daemon
ClientDPM Daemon
SRM Daemon
5. put file via Gridftp
DPM Database
DPNS Daemon
Data ServerGridftp Daemon
Data ServerGridftp Daemon
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 71
Example: SRM put processing (6)
6c. Get filesize
Data ServerGridftp Daemon
Client
SRM Daemon DPM Database
DPNS Daemon
Data ServerGridftp Daemon
Data ServerGridftp Daemon
6a. SRM(v1) set Done 6e. Update status of request
6d. Update replica metadata(size/status/pintime)
6b. Notify ‘Done’
DPM Daemon
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 72
Other disk storage solutions
• Cluster file systems: PVFS2, Lustre, GFS– Built for parallel IO performance– Clusters are not expected to grow (or shrink)– Can obtain better throughput on a single file within the cluster– Node failure can be quite catastrophic
• Some (e.g. GFS) limited by block device/file system limits– A 32-bit linux 2.4 kernel: 2 TByte
– Ext2/3, 4 kByte blocks: file max 2 TB, fs size 16 TB– Reiser3.6: file max 1 EB, fs size 16 TB ???!– XFS: file max 8 EB, fs size 8 EB
Monitoring at the global scale
Monitoring
Site Functional Tests
Real-time throbbing job monitor
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 74
Success Rate
What’s the chance the whole grid is working correctly?
If a single site has 98.5% reliability (i.e. is down 5 days/year)• With 200 sites, this gives you a 4% chance that the whole
grid is working correctly• And the 98.5% is quite optimistic to begin with …
So• build the grid, both middleware and user jobs, for failure• Monitor sites with both system and functional tests• Exclude sites with a current malfunction dynamically
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 75
Monitoring Tools
1. GIIS Monitor 2. GIIS Monitor graphs 3. Sites Functional Tests
4. GOC Data Base5. Scheduled Downtimes 6. Live Job Monitor
7. GridIce – VO view 8. GridIce – fabric view 9. Certificate Lifetime Monitor
Source: Ian Bird, SA1 Operations Status, EGEE-4 Conference, Pisa, November 2005
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 76
Google Grid Map
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 77
Freedom of Choice
• Tool for VOs to make a site selection based on a set of standard tests
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 78
Success Rate: WISDOM
Average success rate for jobs: 70-80% (single submit)
Success rate (August)
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
1 3 5 7 9 11 13 15 17 19 21 23 25 27
day
nb
of
job
s
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
succ
ess
rate
registered
success (final status)
aborted (final status)
cancelled (final status)
success rate :success/(registered-cancelled)
Source: N. Jacq, LPC and IN2P3/CNRS “Biomedical DC Preliminary Report WISDOM Application, 5 sept 2005
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 79
Failure reasons vary
Biomed data challenge Abort reasons distribution
(10/07/2005 - 27-08-2005)
63%
28%
4%4%1%Missmatching ressources
wrong configuration
Network/Connection Failures
Proxy problems
JDL Problems
- Failing middleware component- Wrong request in the job JDL
Abort reasons distribution for all VO 01/2005 – 06/2005
Source: N. Jacq, LPC and IN2P3/CNRS “Biomedical DC Preliminary Report WISDOM Application, 5 sept 2005
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 80
Is the Grid middleware current?
• Common causes of failure– Specified impossible combination of resources– Wrong middleware version at the site– Not enough space in proper place ($TMPDIR)– Environment configuration ($VO_vo_SW_DIR, $LFC_HOST,…)
0
20
40
60
80
100
120
140
12
/02
/20
05
19
/02
/20
05
26
/02
/20
05
05
/03
/20
05
12
/03
/20
05
19
/03
/20
05
26
/03
/20
05
02
/04
/20
05
09
/04
/20
05
16
/04
/20
05
23
/04
/20
05
30
/04
/20
05
07
/05
/20
05
14
/05
/20
05
21
/05
/20
05
28
/05
/20
05
04
/06
/20
05
11
/06
/20
05
18
/06
/20
05
25
/06
/20
05
Date
Sit
es
wit
h r
ele
as
e
LCG-2_4_0 LCG-2_3_1 LCG-2_3_0
Going From here
Does it work
How can we make it better
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 82
Going from here
Many nice things to do:• Most of LCG provides a single OS (RHEL3), but users may
need SLES, Debian, Gentoo, … or specific libraries• Virtualisation (Xen, VMware)?
• Scheduling user jobs– both VO and site wants to set part of the priorities …
• Auditing and user tracing in this highly dynamic systemcan we know for sure who is running what where? Or whether a user is DDoS-ing the White House right now?– Out of 221 sites, we know for certain there is a compromise!
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 83
More things to do …
• Sparse file access: access data efficiently over the wide area
• Can we do something useful with the large disks in all worker nodes? (our 240 CPUs share ~8 TByte of unused disk space!)
• There are new grid software releases every month, and the configuration comes from different sources …how can we combine and validate all these configurations fast and easy?
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 84
Job submission live monitor
Source: Gidon Moont, Imperial College, London, HEP and e-Science Centre
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 85
Outlook
Towards a global persistent grid infrastructure• Interoperability and persistency that are project independent
– Europe: EGEE-2, ‘European Grid Organisation’– US: Open Science Grid– Asia-Pacific: APGrid & PRAGMA, NAREGI, APAC, K*Grid, …
GIN aim: cross-submission and file access by end 2006• Extension to industry
– first: industrial engineering, financial scenario simulations
• New ‘middleware’– we are just starting to learn how it should work
• Extend more in sharing of structured data
2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 86
A Bright Future!
Imagine that you could plug your computer into the wall and have direct access to huge computing resources immediately, just as you plug in a lamp to get instant light. …
Far from being science-fiction, this is the idea the [Grid] is about to make into reality.
The EU DataGrid project brochure, 2001