Building the Grid Grid Middleware 8 David Groep, lecture series 2005-2006

Building the Grid

Grid Middleware 8

David Groep, lecture series 2005-2006

Grid Middleware VIII 2

Scale

Grid for handling large collaborations, with significant amounts of data LHC physics -> much data, quite a few users Bioinformatics -> reasonable amount of data, very many users Biomedicine & pharma -> highly confidential data, much

computation, quite a few users …

example again is LCG


Atlas Tier-1 data flows

Tier-0

CPUfarm

T1T1OtherTier-1s

diskbuffer

RAW

1.6 GB/file0.02 Hz1.7K f/day32 MB/s2.7 TB/day

ESD2


AOD2

10 MB/file0.2 Hz17K f/day2 MB/s0.16 TB/day

AODm2

500 MB/file0.004 Hz0.34K f/day2 MB/s0.16 TB/day

RAW

ESD2

AODm2

0.044 Hz3.74K f/day44 MB/s3.66 TB/day

T1T1OtherTier-1s

T1T1Tier-2s

Tape

RAW


diskstorage

AODm2


ESD2


AOD2

10 MB/file0.2 Hz17K f/day2 MB/s0.16 TB/day

ESD2


AODm2


ESD2


AODm2


ESD1


AODm1


AODm1


AODm2


Plus simulation Plus simulation && analysis data analysis data

flowflow

Real data storage, reprocessing and

distribution

ATLAS data flows (draft). Source: Kors Bos, NIKHEF

Example Grid Resource Centre

NDPF and the Amsterdam Tier-1


Grid Site Logical Layout


NDPF Logical Composition


Physical resources

Service machines (the ‘grid tax’) ~ 10 systems:

CE, RB, SE classic, SRM/DPM, MON, LFC, BDII, UI, installhost

compute clusters private IP space for convenience (I’m lazy ) mix of systems (in GLUE parlance: subClusters)

66 dual-AMD Athlon MP200+ (home-built) 27 dual-Intel XEON 2.8 GHz (Supermicro) 35 dual-Intel XEON EM64T 3.2 GHz (Dell) ~ 80 dual-dual-core Intel Woodcrest – 700 kSI2k capacity (Dell, Aug 2006)

in total ~ 560 cores or 1000 kSI2k capacity

disk storage 25 TByte in DPM managed pool

how to configure this to be an effective grid resource?


NDPF Network Topology


Batch Systems and Schedulers

Batch system keeps list of nodes and jobs Scheduler matches jobs to nodes based on policies


SC3 storage network (SARA)

Disk-to-Disk 583 MByte/si.e. 4.6 Gbps

over the world

Graphic: Mark van de Sanden, SARA

http://lxgate24.cern.ch/GRIDVIEW/


Tier-1 Architecture SARA (storage)

Graphic: Mark van de Sanden, SARA


Matching Storage to Computing

Doing the math Simple job

Read 1 MByte piece of file (typically 1 “event”) Calculate on it for 30 seconds Do this for 2000 events per file (i.e. 2 GByte files) On 1000 files (1 day of running) this takes 700 days Need a total of 2 TByte, i.e. 4 IDE disks of 500 GB

On the Grid: spread out over 1000 CPUs All jobs start at the same time, retrieving a 2 GByte input The machine with this 2 TByte disk is on a 100 Mbps link Effective 10 MByte/s throughput Thus, 10 kByte/s per machine It takes 55 hours before the file transfers finish! And after that, only 17 hours of calculation

= 1Mbyte


Storage

Just for ATLAS, one of the experiments RAW & ESD data flow ~ 4 TByte/day (1.4PB/y) to tape

Expected to be a permanent “museum” copy Largely scheduled access (intelligent staging possible), read & write Disk buffers before tape store can be smallish (~ 10%)

‘Chaotic’ access by real users: ~ 2-4 TByte/day throughput Lifetime of data is finite but long (typically 2+ years) Access needed from worker nodes, i.e., from O(1 000) CPUs Random “skimming” access pattern Need for disk server farms of typically 500 TByte – 1 PByte

Management of disk resources Split ‘file system view’ (file metadata) from the object store dCache & dcap, DPNS & DPM, GPFS & ObjectStore, …


Grid Resources Amsterdam

• 2x 1.2 PByte in 2 robots

• 36+1024 CPUs IA32

• disk caches 10 + 50 TByte

• multiple 10 Gbit/s links

560 cores IA32/x86_64

25 TByte disk cache

10 Gbit link SURFnet

2 Gbit/s to SARA

only resources with either GridFTP or Grid job management

BIG GRID Approved January 2006!

Investment of € 29M in next 4 years

For: LCG, LOFAR, Life Sciences,

Medical, DANS, Philips Research, …

See http://www.biggrid.nl/

Configuring systems

Grid is what Murphy had in mind as he formulated his law …


How to you see the Grid?Broker matches the user’s request with the site ‘information supermarket’ matchmaking (using Condor

Matchmaking) uses the information published by the site

Grid Information system‘the only information a user ever gets about a site’

So: should be reliable, consistent and complete Standard schema (GLUE) to

describe sites, queues, storage(complex schema semantics)

Currently presented as an LDAP directory

LDAP Browser Jarek Gawor: www.mcs.anl.gov/~gawor/ldap


Glue Attributes Set by the Site

Site information SiteSysAdminContact: mailto: [email protected] SiteSecurityContact: mailto: [email protected]

Cluster infoGlueSubClusterUniqueID=gridgate.cs.tcd.ie

HostApplicationSoftwareRunTimeEnvironment: LCG-2_6_0HostApplicationSoftwareRunTimeEnvironment: VO-atlas-release-10.0.4HostBenchmarkSI00: 1300GlueHostNetworkAdapterInboundIP: FALSEGlueHostNetworkAdapterOutboundIP: TRUEGlueHostOperatingSystemName: RHELGlueHostOperatingSystemRelease: 3.5GlueHostOperatingSystemVersion: 3

GlueCEStateEstimatedResponseTime: 519GlueCEStateRunningJobs: 175GlueCEStateTotalJobs: 248

Storage: similar info (paths, max number of files, quota, retention, …)


Information system and brokering issues

Size of information system scales with #sites and #details already 12 MByte of LDIF matching a job takes ~15 sec

Scheduling policies are infinitely complex no static schema can likely express this information

Much information (still) needs to be set-up manually … next slides show situation as of Feb 3, 2006

The info system is the single most important grid service

Current broker tries to make optimal decision… instead of a `reasonable’ one


Example: GlueServiceAccessControlRule

For your viewing pleasure: GlueServiceAccessControlRule 261 distinct values seen for GlueServiceAccessControlRule

(one of) least frequently occuring value(s): 1 instance(s) of GlueServiceAccessControlRule:

/C=BE/O=BEGRID/OU=VUB/OU=IIHE/CN=Stijn De Weirdt

(one of) most frequently occuring value(s): 310 instance(s) of GlueServiceAccessControlRule: dteam

(one of) shortest value(s) seen: GlueServiceAccessControlRule: d0

(one of) longest value(s) seen: GlueServiceAccessControlRule: anaconda-ks.cfg configure-firewall install.log install.log.syslog j2sdk-1_4_2_08-linux-i586.rpm lcg-yaim-latest.rpm myproxy-addons myproxy-addons.051021 site-info.def site-info.def.050922 site-info.def.050928 site-info.def.051021 yumit-client-2.0.2-1.noarch.rpm


Example: GlueSEControlProtocolType

For your viewing pleasure: GlueSEControlProtocolType

freq value 1 GlueSEControlProtocolType: srm 1 GlueSEControlProtocolType: srm_v1 1 GlueSEControlProtocolType: srmv1 3 GlueSEControlProtocolType: SRM 7 GlueSEControlProtocolType: classic

… which means that of ~410 Storage Elements, only 13 publish interaction info. Ough!


Example: GlueHostOperatingSystemRelease

Today's attribute: GlueHostOperatingSystemRelease 1 GlueHostOperatingSystemRelease: 3.02 1 GlueHostOperatingSystemRelease: 3.03 1 GlueHostOperatingSystemRelease: 3.2 1 GlueHostOperatingSystemRelease: 3.5 1 GlueHostOperatingSystemRelease: 303 1 GlueHostOperatingSystemRelease: 304 1 GlueHostOperatingSystemRelease: 3_0_4 1 GlueHostOperatingSystemRelease: SL 1 GlueHostOperatingSystemRelease: Sarge 1 GlueHostOperatingSystemRelease: sl3 2 GlueHostOperatingSystemRelease: 3.0 2 GlueHostOperatingSystemRelease: 305 4 GlueHostOperatingSystemRelease: 3.05 4 GlueHostOperatingSystemRelease: SLC3 5 GlueHostOperatingSystemRelease: 3.04 5 GlueHostOperatingSystemRelease: SL3 18 GlueHostOperatingSystemRelease: 3.0.3 19 GlueHostOperatingSystemRelease: 7.3 24 GlueHostOperatingSystemRelease: 3 37 GlueHostOperatingSystemRelease: 3.0.5 47 GlueHostOperatingSystemRelease: 3.0.4


Example: GlueSAPolicyMaxNumFiles

136 separate Glue attributes seen

For your viewing pleasure: GlueSAPolicyMaxNumFiles

freq value

6 GlueSAPolicyMaxNumFiles: 99999999999999





136 separate Glue attributes seen

For your viewing pleasure: GlueServiceStatusInfo

freq value

2 GlueServiceStatusInfo: No Known Problems.

55 GlueServiceStatusInfo: No problems

206 GlueServiceStatusInfo: No Problems


LCG’s Most Popular Resource Centre


Example: SiteLatitude

Today's attribute: GlueSiteLatitude 1 GlueSiteLatitude: 1.376059 1 GlueSiteLatitude: 33.063924198120645 1 GlueSiteLatitude: 37.0 1 GlueSiteLatitude: 38.739925290125484 1 GlueSiteLatitude: 39.21 … 1 GlueSiteLatitude: 45.4567 1 GlueSiteLatitude: 55.9214118 1 GlueSiteLatitude: 56.44 1 GlueSiteLatitude: 59.56 1 GlueSiteLatitude: 67 1 GlueSiteLatitude: GlueSiteWeb: http://rsgrid3.its.uiowa.edu 2 GlueSiteLatitude: 40.8527 2 GlueSiteLatitude: 48.7 2 GlueSiteLatitude: 49.16 2 GlueSiteLatitude: 50 3 GlueSiteLatitude: 41.7827 3 GlueSiteLatitude: 46.12

8 GlueSiteLatitude: 0.0

http://rsgrid3.its.uiowa.edu/

Operational Monitoring

Detecting faults and errorsexperiences in the NDPF


User directory and automount maps

Large number of alternatives exists (nsswitch.conf/pam.d) files-based (/etc/passwd, /etc/auto.home, …) YP/NIS, NIS+ Database (MySQL/Oracle) LDAP

We went with LDAP: information is in a central location (like NIS) can scale by adding slave servers (like NIS) is secure by LDAP over TLS (unlike NIS) can be managed by external programs (also unlike NIS)

(in due course we will do real-time grid credential mapping to uid’s)

But you will need nscd, or a large number of slave servers


Logging and Auditing

Auditing and logging syslog (also for grid gatekeeper, gsiftp, credential mapping) process accounting (psacct)

For the paranoid – use tools included for CAPP/EAL3+: LAuS system call auditing highly detailed:

useful both for debugging and incident response default auditing is critical: system will halt on audit errors

If your worker nodes are on private IP space need to preserve a log of the NAT box as well


Grid Cluster Logging

Grid statistics and accounting rrdtool views from the batch system load per VO

combine qstat and pbsnodes output via script, cron and RRD

cricket network traffic grapher

extract pbs accounting data in dedicated database grid users have a ‘generic’ uid from a dynamic pool –

need to link this in the database to the grid DN and VO

from accounting db, upload anonymized records to APEL APEL is the grid accounting system for VOs and funding agencies accounting db also useful to charge costs to projects locally


NDPF Occupancy

Usage of the NIKHEF NDPF Compute farm

Average occupancy in 2005: ~ 78%

each colour represents a grid VO, black line is #CPUs available


But at times, in more detail

Auditing Indicent: a disk with less than 15% free makes the syscall-audit system panic, new processes cannot write audit entries, which is fatal, so they wait, and wait, and … a head node has most activity & fails first!

An unresponsive node causes the scheduler MAUI to wait for 15 minutes, then give up and start scheduling again, hitting the rotten node, and …

PBS Server trying desparately to contact adead node who’s CPU has turned into Norit… and unable to serve any more requests.


Black Holes

A mis-configured worker node accepting jobs that all die within seconds.Not for long, the entire job population will be sucked into this black hole…


Clusters: what did we see? the Grid (and your cluster) are error amplifiers

“black holes” may eat your jobs piecemeal dangerous “default” values can spoil the day (“GlueERT: 0”)

Monitor! (and allow for (some) failures, and design for rapid recovery)

Users don’t have a clue about your system beforehand(that’s the downside of those ‘autonomous organizations’)

If you want users to have clue, you push publish your clues correctly (the information system is all they can see)

Grid middleware may effectively do a DoS on your system doing qstat for every job every minute, to feed the logging & bookkeeping …

Power consumption is the greatest single limitation in CPU density And finally: keep your machine room tidy, and label everything … or

your colleague will not be able to find that #$%^$*! machine in the middle of the night…

Grid-wide monitoring


Success Rate

What’s the chance the whole grid is working correctly?

If a single site has 98.5% reliability (i.e. is down 5 days/year) With 200 sites, this gives you a 4% chance that the whole

grid is working correctly And the 98.5% is quite optimistic to begin with …

So build the grid, both middleware and user jobs, for failure Monitor sites with both system and functional tests Exclude sites with a current malfunction dynamically


Monitoring Tools

1. GIIS Monitor 2. GIIS Monitor graphs 3. Sites Functional Tests

4. GOC Data Base5. Scheduled Downtimes 6. Live Job Monitor

7. GridIce – VO view 8. GridIce – fabric view 9. Certificate Lifetime Monitor

Source: Ian Bird, SA1 Operations Status, EGEE-4 Conference, Pisa, November 2005

http://goc.grid.sinica.edu.tw/gstat/

https://lcg-sft.cern.ch/sft/lastreport.cgi

http://goc.grid-support.ac.uk/gppmonWorld/cert_maps/CE.html

http://www.hep.ph.ic.ac.uk/e-science/projects/demo/index.html

http://goc.grid-support.ac.uk/gridsite/operations/downtimes.php

https://goc.grid-support.ac.uk/gridsite/db/

http://gridice2.cnaf.infn.it:50080/gridice/vo/vo.php

http://gridice2.cnaf.infn.it:50080/gridice/site/site.php

http://goc.grid.sinica.edu.tw/gstat/NIKHEF-ELPROD/


Google Grid Map

http://goc02.grid-support.ac.uk/googlemaps/


Freedom of Choice

Tool for VOs to make a site selection based on a set of standard tests


Success Rate: WISDOM

Average success rate for jobs: 70-80% (single submit)Success rate (August)

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

1 3 5 7 9 11 13 15 17 19 21 23 25 27

day

nb

of

job

s

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

succ

ess

rate

registered

success (final status)

aborted (final status)

cancelled (final status)

success rate :success/(registered-cancelled)

Source: N. Jacq, LPC and IN2P3/CNRS “Biomedical DC Preliminary Report WISDOM Application, 5 sept 2005


Failure reasons varyBiomed data challenge

Abort reasons distribution (10/07/2005 - 27-08-2005)

63%

28%

4%4%1%Missmatching ressources

wrong configuration

Network/Connection Failures

Proxy problems

JDL Problems

- Failing middleware component- Wrong request in the job JDL

Abort reasons distribution for all VO 01/2005 – 06/2005

Source: N. Jacq, LPC and IN2P3/CNRS “Biomedical DC Preliminary Report WISDOM Application, 5 sept 2005


Is the Grid middleware current?

Common causes of failure Specified impossible combination of resources Wrong middleware version at the site Not enough space in proper place ($TMPDIR) Environment configuration ($VO_vo_SW_DIR, $LFC_HOST,…)

0

20

40

60

80

100

120

140

12

/02

/20

05

19

/02

/20

05

26

/02

/20

05

05

/03

/20

05

12

/03

/20

05

19

/03

/20

05

26

/03

/20

05

02

/04

/20

05

09

/04

/20

05

16

/04

/20

05

23

/04

/20

05

30

/04

/20

05

07

/05

/20

05

14

/05

/20

05

21

/05

/20

05

28

/05

/20

05

04

/06

/20

05

11

/06

/20

05

18

/06

/20

05

25

/06

/20

05

Date

Sit

es

wit

h r

ele

as

e

LCG-2_4_0 LCG-2_3_1 LCG-2_3_0

Assorted issues at the fabric layer

Does it workHow can we make it better


Going from here

Many nice things to do: Most of LCG provides a single OS (RHEL3), but users may

need SLES, Debian, Gentoo, … or specific libraries Virtualisation (Xen, VMware)?

Scheduling user jobs both VO and site wants to set part of the priorities …

Auditing and user tracing in this highly dynamic systemcan we know for sure who is running what where? Or whether a user is DDoS-ing the White House right now? Out of 221 sites, we know for certain there is a compromise!


More things to do …

Sparse file access: access data efficiently over the wide area

Can we do something useful with the large disks in all worker nodes? (our 240 CPUs share ~8 TByte of unused disk space!)

There are new grid software releases every month, and the configuration comes from different sources …how can we combine and validate all these configurations fast and easy?


Job submission live monitor

Source: Gidon Moont, Imperial College, London, HEP and e-Science Centre


Outlook

Towards a global persistent grid infrastructure Interoperability and persistency that are project independent

Europe: EGEE-2, ‘European Grid Organisation’ US: Open Science Grid Asia-Pacific: APGrid & PRAGMA, NAREGI, APAC, K*Grid, …

GIN aim: cross-submission and file access by end 2006 Extension to industry

first: industrial engineering, financial scenario simulations

New ‘middleware’ we are just starting to learn how it should work

Extend more in sharing of structured data

Documents

Building the Grid Grid Middleware 8 David Groep, lecture series 2005-2006