Upload
clara-gaines
View
217
Download
2
Embed Size (px)
Citation preview
Building the Grid
Grid Middleware 8
David Groep, lecture series 2005-2006
Grid Middleware VIII 2
Scale
Grid for handling large collaborations, with significant amounts of data LHC physics -> much data, quite a few users Bioinformatics -> reasonable amount of data, very many users Biomedicine & pharma -> highly confidential data, much
computation, quite a few users …
example again is LCG
Grid Middleware VIII 3
Atlas Tier-1 data flows
Tier-0
CPUfarm
T1T1OtherTier-1s
diskbuffer
RAW
1.6 GB/file0.02 Hz1.7K f/day32 MB/s2.7 TB/day
ESD2
0.5 GB/file0.02 Hz1.7K f/day10 MB/s0.8 TB/day
AOD2
10 MB/file0.2 Hz17K f/day2 MB/s0.16 TB/day
AODm2
500 MB/file0.004 Hz0.34K f/day2 MB/s0.16 TB/day
RAW
ESD2
AODm2
0.044 Hz3.74K f/day44 MB/s3.66 TB/day
T1T1OtherTier-1s
T1T1Tier-2s
Tape
RAW
1.6 GB/file0.02 Hz1.7K f/day32 MB/s2.7 TB/day
diskstorage
AODm2
500 MB/file0.004 Hz0.34K f/day2 MB/s0.16 TB/day
ESD2
0.5 GB/file0.02 Hz1.7K f/day10 MB/s0.8 TB/day
AOD2
10 MB/file0.2 Hz17K f/day2 MB/s0.16 TB/day
ESD2
0.5 GB/file0.02 Hz1.7K f/day10 MB/s0.8 TB/day
AODm2
500 MB/file0.036 Hz3.1K f/day18 MB/s1.44 TB/day
ESD2
0.5 GB/file0.02 Hz1.7K f/day10 MB/s0.8 TB/day
AODm2
500 MB/file0.036 Hz3.1K f/day18 MB/s1.44 TB/day
ESD1
0.5 GB/file0.02 Hz1.7K f/day10 MB/s0.8 TB/day
AODm1
500 MB/file0.04 Hz3.4K f/day20 MB/s1.6 TB/day
AODm1
500 MB/file0.04 Hz3.4K f/day20 MB/s1.6 TB/day
AODm2
500 MB/file0.04 Hz3.4K f/day20 MB/s1.6 TB/day
Plus simulation Plus simulation && analysis data analysis data
flowflow
Real data storage, reprocessing and
distribution
ATLAS data flows (draft). Source: Kors Bos, NIKHEF
Example Grid Resource Centre
NDPF and the Amsterdam Tier-1
Grid Middleware VIII 5
Grid Site Logical Layout
Grid Middleware VIII 6
NDPF Logical Composition
Grid Middleware VIII 7
Physical resources
Service machines (the ‘grid tax’) ~ 10 systems:
CE, RB, SE classic, SRM/DPM, MON, LFC, BDII, UI, installhost
compute clusters private IP space for convenience (I’m lazy ) mix of systems (in GLUE parlance: subClusters)
66 dual-AMD Athlon MP200+ (home-built) 27 dual-Intel XEON 2.8 GHz (Supermicro) 35 dual-Intel XEON EM64T 3.2 GHz (Dell) ~ 80 dual-dual-core Intel Woodcrest – 700 kSI2k capacity (Dell, Aug 2006)
in total ~ 560 cores or 1000 kSI2k capacity
disk storage 25 TByte in DPM managed pool
how to configure this to be an effective grid resource?
Grid Middleware VIII 8
NDPF Network Topology
Grid Middleware VIII 9
Batch Systems and Schedulers
Batch system keeps list of nodes and jobs Scheduler matches jobs to nodes based on policies
Grid Middleware VIII 10
SC3 storage network (SARA)
Disk-to-Disk 583 MByte/si.e. 4.6 Gbps
over the world
Graphic: Mark van de Sanden, SARA
Grid Middleware VIII 11
Tier-1 Architecture SARA (storage)
Graphic: Mark van de Sanden, SARA
Grid Middleware VIII 12
Matching Storage to Computing
Doing the math Simple job
Read 1 MByte piece of file (typically 1 “event”) Calculate on it for 30 seconds Do this for 2000 events per file (i.e. 2 GByte files) On 1000 files (1 day of running) this takes 700 days Need a total of 2 TByte, i.e. 4 IDE disks of 500 GB
On the Grid: spread out over 1000 CPUs All jobs start at the same time, retrieving a 2 GByte input The machine with this 2 TByte disk is on a 100 Mbps link Effective 10 MByte/s throughput Thus, 10 kByte/s per machine It takes 55 hours before the file transfers finish! And after that, only 17 hours of calculation
= 1Mbyte
Grid Middleware VIII 13
Storage
Just for ATLAS, one of the experiments RAW & ESD data flow ~ 4 TByte/day (1.4PB/y) to tape
Expected to be a permanent “museum” copy Largely scheduled access (intelligent staging possible), read & write Disk buffers before tape store can be smallish (~ 10%)
‘Chaotic’ access by real users: ~ 2-4 TByte/day throughput Lifetime of data is finite but long (typically 2+ years) Access needed from worker nodes, i.e., from O(1 000) CPUs Random “skimming” access pattern Need for disk server farms of typically 500 TByte – 1 PByte
Management of disk resources Split ‘file system view’ (file metadata) from the object store dCache & dcap, DPNS & DPM, GPFS & ObjectStore, …
Grid Middleware VIII 14
Grid Resources Amsterdam
• 2x 1.2 PByte in 2 robots
• 36+1024 CPUs IA32
• disk caches 10 + 50 TByte
• multiple 10 Gbit/s links
560 cores IA32/x86_64
25 TByte disk cache
10 Gbit link SURFnet
2 Gbit/s to SARA
only resources with either GridFTP or Grid job management
BIG GRID Approved January 2006!
Investment of € 29M in next 4 years
For: LCG, LOFAR, Life Sciences,
Medical, DANS, Philips Research, …
See http://www.biggrid.nl/
Configuring systems
Grid is what Murphy had in mind as he formulated his law …
Grid Middleware VIII 16
How to you see the Grid?Broker matches the user’s request with the site ‘information supermarket’ matchmaking (using Condor
Matchmaking) uses the information published by the site
Grid Information system‘the only information a user ever gets about a site’
So: should be reliable, consistent and complete Standard schema (GLUE) to
describe sites, queues, storage(complex schema semantics)
Currently presented as an LDAP directory
LDAP Browser Jarek Gawor: www.mcs.anl.gov/~gawor/ldap
Grid Middleware VIII 17
Glue Attributes Set by the Site
Site information SiteSysAdminContact: mailto: [email protected] SiteSecurityContact: mailto: [email protected]
Cluster infoGlueSubClusterUniqueID=gridgate.cs.tcd.ie
HostApplicationSoftwareRunTimeEnvironment: LCG-2_6_0HostApplicationSoftwareRunTimeEnvironment: VO-atlas-release-10.0.4HostBenchmarkSI00: 1300GlueHostNetworkAdapterInboundIP: FALSEGlueHostNetworkAdapterOutboundIP: TRUEGlueHostOperatingSystemName: RHELGlueHostOperatingSystemRelease: 3.5GlueHostOperatingSystemVersion: 3
GlueCEStateEstimatedResponseTime: 519GlueCEStateRunningJobs: 175GlueCEStateTotalJobs: 248
Storage: similar info (paths, max number of files, quota, retention, …)
Grid Middleware VIII 18
Information system and brokering issues
Size of information system scales with #sites and #details already 12 MByte of LDIF matching a job takes ~15 sec
Scheduling policies are infinitely complex no static schema can likely express this information
Much information (still) needs to be set-up manually … next slides show situation as of Feb 3, 2006
The info system is the single most important grid service
Current broker tries to make optimal decision… instead of a `reasonable’ one
Grid Middleware VIII 19
Example: GlueServiceAccessControlRule
For your viewing pleasure: GlueServiceAccessControlRule 261 distinct values seen for GlueServiceAccessControlRule
(one of) least frequently occuring value(s): 1 instance(s) of GlueServiceAccessControlRule:
/C=BE/O=BEGRID/OU=VUB/OU=IIHE/CN=Stijn De Weirdt
(one of) most frequently occuring value(s): 310 instance(s) of GlueServiceAccessControlRule: dteam
(one of) shortest value(s) seen: GlueServiceAccessControlRule: d0
(one of) longest value(s) seen: GlueServiceAccessControlRule: anaconda-ks.cfg configure-firewall install.log install.log.syslog j2sdk-1_4_2_08-linux-i586.rpm lcg-yaim-latest.rpm myproxy-addons myproxy-addons.051021 site-info.def site-info.def.050922 site-info.def.050928 site-info.def.051021 yumit-client-2.0.2-1.noarch.rpm
Grid Middleware VIII 20
Example: GlueSEControlProtocolType
For your viewing pleasure: GlueSEControlProtocolType
freq value 1 GlueSEControlProtocolType: srm 1 GlueSEControlProtocolType: srm_v1 1 GlueSEControlProtocolType: srmv1 3 GlueSEControlProtocolType: SRM 7 GlueSEControlProtocolType: classic
… which means that of ~410 Storage Elements, only 13 publish interaction info. Ough!
Grid Middleware VIII 21
Example: GlueHostOperatingSystemRelease
Today's attribute: GlueHostOperatingSystemRelease 1 GlueHostOperatingSystemRelease: 3.02 1 GlueHostOperatingSystemRelease: 3.03 1 GlueHostOperatingSystemRelease: 3.2 1 GlueHostOperatingSystemRelease: 3.5 1 GlueHostOperatingSystemRelease: 303 1 GlueHostOperatingSystemRelease: 304 1 GlueHostOperatingSystemRelease: 3_0_4 1 GlueHostOperatingSystemRelease: SL 1 GlueHostOperatingSystemRelease: Sarge 1 GlueHostOperatingSystemRelease: sl3 2 GlueHostOperatingSystemRelease: 3.0 2 GlueHostOperatingSystemRelease: 305 4 GlueHostOperatingSystemRelease: 3.05 4 GlueHostOperatingSystemRelease: SLC3 5 GlueHostOperatingSystemRelease: 3.04 5 GlueHostOperatingSystemRelease: SL3 18 GlueHostOperatingSystemRelease: 3.0.3 19 GlueHostOperatingSystemRelease: 7.3 24 GlueHostOperatingSystemRelease: 3 37 GlueHostOperatingSystemRelease: 3.0.5 47 GlueHostOperatingSystemRelease: 3.0.4
Grid Middleware VIII 22
Example: GlueSAPolicyMaxNumFiles
136 separate Glue attributes seen
For your viewing pleasure: GlueSAPolicyMaxNumFiles
freq value
6 GlueSAPolicyMaxNumFiles: 99999999999999
26 GlueSAPolicyMaxNumFiles: 999999
52 GlueSAPolicyMaxNumFiles: 0
78 GlueSAPolicyMaxNumFiles: 00
1381 GlueSAPolicyMaxNumFiles: 10
136 separate Glue attributes seen
For your viewing pleasure: GlueServiceStatusInfo
freq value
2 GlueServiceStatusInfo: No Known Problems.
55 GlueServiceStatusInfo: No problems
206 GlueServiceStatusInfo: No Problems
Grid Middleware VIII 23
LCG’s Most Popular Resource Centre
Grid Middleware VIII 24
Example: SiteLatitude
Today's attribute: GlueSiteLatitude 1 GlueSiteLatitude: 1.376059 1 GlueSiteLatitude: 33.063924198120645 1 GlueSiteLatitude: 37.0 1 GlueSiteLatitude: 38.739925290125484 1 GlueSiteLatitude: 39.21 … 1 GlueSiteLatitude: 45.4567 1 GlueSiteLatitude: 55.9214118 1 GlueSiteLatitude: 56.44 1 GlueSiteLatitude: 59.56 1 GlueSiteLatitude: 67 1 GlueSiteLatitude: GlueSiteWeb: http://rsgrid3.its.uiowa.edu 2 GlueSiteLatitude: 40.8527 2 GlueSiteLatitude: 48.7 2 GlueSiteLatitude: 49.16 2 GlueSiteLatitude: 50 3 GlueSiteLatitude: 41.7827 3 GlueSiteLatitude: 46.12
8 GlueSiteLatitude: 0.0
Operational Monitoring
Detecting faults and errorsexperiences in the NDPF
Grid Middleware VIII 26
User directory and automount maps
Large number of alternatives exists (nsswitch.conf/pam.d) files-based (/etc/passwd, /etc/auto.home, …) YP/NIS, NIS+ Database (MySQL/Oracle) LDAP
We went with LDAP: information is in a central location (like NIS) can scale by adding slave servers (like NIS) is secure by LDAP over TLS (unlike NIS) can be managed by external programs (also unlike NIS)
(in due course we will do real-time grid credential mapping to uid’s)
But you will need nscd, or a large number of slave servers
Grid Middleware VIII 27
Logging and Auditing
Auditing and logging syslog (also for grid gatekeeper, gsiftp, credential mapping) process accounting (psacct)
For the paranoid – use tools included for CAPP/EAL3+: LAuS system call auditing highly detailed:
useful both for debugging and incident response default auditing is critical: system will halt on audit errors
If your worker nodes are on private IP space need to preserve a log of the NAT box as well
Grid Middleware VIII 28
Grid Cluster Logging
Grid statistics and accounting rrdtool views from the batch system load per VO
combine qstat and pbsnodes output via script, cron and RRD
cricket network traffic grapher
extract pbs accounting data in dedicated database grid users have a ‘generic’ uid from a dynamic pool –
need to link this in the database to the grid DN and VO
from accounting db, upload anonymized records to APEL APEL is the grid accounting system for VOs and funding agencies accounting db also useful to charge costs to projects locally
Grid Middleware VIII 29
NDPF Occupancy
Usage of the NIKHEF NDPF Compute farm
Average occupancy in 2005: ~ 78%
each colour represents a grid VO, black line is #CPUs available
Grid Middleware VIII 30
But at times, in more detail
Auditing Indicent: a disk with less than 15% free makes the syscall-audit system panic, new processes cannot write audit entries, which is fatal, so they wait, and wait, and … a head node has most activity & fails first!
An unresponsive node causes the scheduler MAUI to wait for 15 minutes, then give up and start scheduling again, hitting the rotten node, and …
PBS Server trying desparately to contact adead node who’s CPU has turned into Norit… and unable to serve any more requests.
Grid Middleware VIII 31
Black Holes
A mis-configured worker node accepting jobs that all die within seconds.Not for long, the entire job population will be sucked into this black hole…
Grid Middleware VIII 32
Clusters: what did we see? the Grid (and your cluster) are error amplifiers
“black holes” may eat your jobs piecemeal dangerous “default” values can spoil the day (“GlueERT: 0”)
Monitor! (and allow for (some) failures, and design for rapid recovery)
Users don’t have a clue about your system beforehand(that’s the downside of those ‘autonomous organizations’)
If you want users to have clue, you push publish your clues correctly (the information system is all they can see)
Grid middleware may effectively do a DoS on your system doing qstat for every job every minute, to feed the logging & bookkeeping …
Power consumption is the greatest single limitation in CPU density And finally: keep your machine room tidy, and label everything … or
your colleague will not be able to find that #$%^$*! machine in the middle of the night…
Grid-wide monitoring
Grid Middleware VIII 34
Success Rate
What’s the chance the whole grid is working correctly?
If a single site has 98.5% reliability (i.e. is down 5 days/year) With 200 sites, this gives you a 4% chance that the whole
grid is working correctly And the 98.5% is quite optimistic to begin with …
So build the grid, both middleware and user jobs, for failure Monitor sites with both system and functional tests Exclude sites with a current malfunction dynamically
Grid Middleware VIII 35
Monitoring Tools
1. GIIS Monitor 2. GIIS Monitor graphs 3. Sites Functional Tests
4. GOC Data Base5. Scheduled Downtimes 6. Live Job Monitor
7. GridIce – VO view 8. GridIce – fabric view 9. Certificate Lifetime Monitor
Source: Ian Bird, SA1 Operations Status, EGEE-4 Conference, Pisa, November 2005
Grid Middleware VIII 37
Freedom of Choice
Tool for VOs to make a site selection based on a set of standard tests
Grid Middleware VIII 38
Success Rate: WISDOM
Average success rate for jobs: 70-80% (single submit)Success rate (August)
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
1 3 5 7 9 11 13 15 17 19 21 23 25 27
day
nb
of
job
s
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
succ
ess
rate
registered
success (final status)
aborted (final status)
cancelled (final status)
success rate :success/(registered-cancelled)
Source: N. Jacq, LPC and IN2P3/CNRS “Biomedical DC Preliminary Report WISDOM Application, 5 sept 2005
Grid Middleware VIII 39
Failure reasons varyBiomed data challenge
Abort reasons distribution (10/07/2005 - 27-08-2005)
63%
28%
4%4%1%Missmatching ressources
wrong configuration
Network/Connection Failures
Proxy problems
JDL Problems
- Failing middleware component- Wrong request in the job JDL
Abort reasons distribution for all VO 01/2005 – 06/2005
Source: N. Jacq, LPC and IN2P3/CNRS “Biomedical DC Preliminary Report WISDOM Application, 5 sept 2005
Grid Middleware VIII 40
Is the Grid middleware current?
Common causes of failure Specified impossible combination of resources Wrong middleware version at the site Not enough space in proper place ($TMPDIR) Environment configuration ($VO_vo_SW_DIR, $LFC_HOST,…)
0
20
40
60
80
100
120
140
12
/02
/20
05
19
/02
/20
05
26
/02
/20
05
05
/03
/20
05
12
/03
/20
05
19
/03
/20
05
26
/03
/20
05
02
/04
/20
05
09
/04
/20
05
16
/04
/20
05
23
/04
/20
05
30
/04
/20
05
07
/05
/20
05
14
/05
/20
05
21
/05
/20
05
28
/05
/20
05
04
/06
/20
05
11
/06
/20
05
18
/06
/20
05
25
/06
/20
05
Date
Sit
es
wit
h r
ele
as
e
LCG-2_4_0 LCG-2_3_1 LCG-2_3_0
Assorted issues at the fabric layer
Does it workHow can we make it better
Grid Middleware VIII 42
Going from here
Many nice things to do: Most of LCG provides a single OS (RHEL3), but users may
need SLES, Debian, Gentoo, … or specific libraries Virtualisation (Xen, VMware)?
Scheduling user jobs both VO and site wants to set part of the priorities …
Auditing and user tracing in this highly dynamic systemcan we know for sure who is running what where? Or whether a user is DDoS-ing the White House right now? Out of 221 sites, we know for certain there is a compromise!
Grid Middleware VIII 43
More things to do …
Sparse file access: access data efficiently over the wide area
Can we do something useful with the large disks in all worker nodes? (our 240 CPUs share ~8 TByte of unused disk space!)
There are new grid software releases every month, and the configuration comes from different sources …how can we combine and validate all these configurations fast and easy?
Grid Middleware VIII 44
Job submission live monitor
Source: Gidon Moont, Imperial College, London, HEP and e-Science Centre
Grid Middleware VIII 45
Outlook
Towards a global persistent grid infrastructure Interoperability and persistency that are project independent
Europe: EGEE-2, ‘European Grid Organisation’ US: Open Science Grid Asia-Pacific: APGrid & PRAGMA, NAREGI, APAC, K*Grid, …
GIN aim: cross-submission and file access by end 2006 Extension to industry
first: industrial engineering, financial scenario simulations
New ‘middleware’ we are just starting to learn how it should work
Extend more in sharing of structured data