Grid Computing and Grid Site Infrastructure

Grid Computing and Grid Site Infrastructure

David GroepNIKHEF/PDP and UvA

2006-03-06 Grid Computing and Grid Site Infrastructure, UvA SNB 2

Outline

• The challenges: LHC, Remote sensing, medicine, …

• Distributed Computing and the Grid– Grid model and community formation, security– Grid software:

computing, storage, resource brokers, information system

• Grid Clusters & Storage : the Amsterdam e-Science Facility– Computing: installing and operating large clusters– Data management: getting throughput to disk and tape

• Monitoring and site functional tests

• Does it work?


The LHC Computing Challenge

Physics @ CERN• LHC particle accellerator

• operational in 2007

• 4 experiments

• ~ 10 Petabyte per year (=10 000 000 GByte)

• 150 countries

• > 10000 Users

• lifetime ~ 20 years

level 1 - special hardware

40 MHz (40 TB/sec)

level 2 - embeddedlevel 3 - PCs

75 KHz (75 GB/sec)5 KHz (5 GB/sec)100 Hz(100 MB/sec)data recording &

offline analysishttp://www.cern.ch/


LHC Physics Data Processing

Source: ATLAS introduction movie, NIKHEF and CERN, 2005


Tiered data distribution model

10 PB data distribution issue• 15 major centres in the world• ~200 institutions• ~10 000 people

for 1 experiment:Processing issue• 1 ‘event’ takes ~ 90 s• There are 100 events/s• Need: 9000 CPUs (today)• But also: reprocessing, simulation, &c: 2-5x needed in total


T0-T1 Data Flow to Amsterdam

Tier-0Tier-0 diskbuffer

diskbuffer

TapeTape

diskstorage

diskstorage

diskstorage

diskstorage

Assumptions

Amsterdam is 13%24 hours/day

No in-efficienciesYear 2008

Assumptions

Amsterdam is 13%24 hours/day

No in-efficienciesYear 2008 RAW

RAW

1.6 GB/file0.026 Hz

2246 f/day41.6 MB/s3.4 TB/day

1.6 GB/file0.026 Hz

2246 f/day41.6 MB/s3.4 TB/day

ESD1ESD1

500 MB/file0.052 Hz

4500 f/day26 MB/s

2.2 TB/day

500 MB/file0.052 Hz

4500 f/day26 MB/s

2.2 TB/day

AODmAODm

500 MB/file0.04 Hz

3456 f/day20 MB/s

1.6 TB/day

500 MB/file0.04 Hz

3456 f/day20 MB/s

1.6 TB/day

Total T0-T1 BandwidthRAW+ESD1+AODm1

10,000 file/day88 MByte/sec700 Mbit/sec

Total T0-T1 BandwidthRAW+ESD1+AODm1

10,000 file/day88 MByte/sec700 Mbit/sec

Total to tapeRAW

41.6 MB/s2~3 drives

18 tape/day

Total to tapeRAW

41.6 MB/s2~3 drives

18 tape/day

Total to diskESD1+AOD1

46 MB/s4 TB/day

8000 file/day

Total to diskESD1+AOD1

46 MB/s4 TB/day

8000 file/day

ATLAS data flows (draft). Source: Kors Bos, NIKHEF

7.2

TByt

e/da

y

to e

very

Tier-1


Other similar applications: Earth Obs

Source: Wim Som de Cerff, KNMI, De Bilt


WISDOM Malaria drug discovery

Anti-malaria drug discovery– ligand docking onto the malaria virus in silico– 60 000 jobs, taking over 100 CPU years total– using 3 000 CPUs completed less than 2 months

– 47 sites– 15 countries– 3000 CPUs– 12 TByte of disk


Infrastructure to cope with this

Source: EGEE SA1, Grid Operations Centre, RAL, UK, dd March 2006

The Grid and Community Building

The Grid Concept

Protocols and standards

Authentication and Authorization


Three essential ingredients for Grid

‘Access computing like the electrical power grid’

A grid combines resources that– Are not managed by a single organization

– Use a common, open protocol … that is general purpose

– Provide additional qualities of service, i.e., are usable as a collective and transparent resource

Based on: Ian Foster, GridToday, November 2003


Grid AuthN/AuthZ and VOs

• Access to shared services– cross-domain authentication, authorization, accounting, billing– common protocols for collective services

• Support multi-user collaborations in “Virtual Organisations”– a set of individuals or organisations, – not under single hierarchical control, – temporarily joining forces to solve a particular problem at hand, – bringing to the collaboration a subset of their resources, – sharing those under their own conditions

– user’s home organization may or need not know about their activities

• Need to enable ‘easy’ single sign-on– a user is typically involved in many different VO’s

• Leave the resource owner always in control


Virtual vs. Organic structure

Organization A Organization B

Compute Server C1Compute Server C2

Compute Server C3

File server F1 (disks A and B)

Person C(Student)

Person A(Faculty)

Person B(Staff) Person D

(Staff)Person F(Faculty)

Person E(Faculty)

Virtual Community C

Person A(Principal Investigator)

Compute Server C1'

Person B(Administrator)

File server F1 (disk A)

Person E(Researcher)

Person D(Researcher)

Graphic: GGF OGSA Working Group


More characteristic VO structure


Trust in Grid Security

‘Security’: Authentication and Authorization

• There is no a priori trust relationship between members

or member organisations!– VO lifetime can vary from hours to decades– VO not necessarily persistent (both long- and short-lived)– people and resources are members of many VOs

• … but a relationship is required– as a basis for authorising access– for traceability and liability, incident handling, and accounting


Authentication vs. Authorization

• Single Authentication token (“passport”) – issued by a party trusted by all, – recognised by many resource providers, users, and VOs– satisfy traceability and persistency requirement– in itself does not grant any access, but provides

a unique binding between an identifier and the subject

• Per-VO Authorisations (“visa”)– granted to a person/service via a virtual organisation– based on the ‘passport’ name– embedded in the single-sign-on token (proxy)– acknowledged by the resource owners – providers can obtain lists of authorised users per VO,

but can still ban individual users


Federated PKI for authentication

• A Federation of many independent CAs– common minimum requirements– trust domain as required by users and relying parties– well-defined and peer-reviewed acceptance process

• User has a single identity– from a local CA close by– works across VOs, with single sign-on via ‘proxies’– certificate itself also usable outside the grid

CA 1CA 2

CA 3

CA ncharter

guidelines

acceptanceprocess

relying party 1

relying party n

International Grid Trust Federation and EUGridPMA, see http://www.eugridpma.org/


Authorization Attributes (VOMS)

• VO-managed attributed embedded in the proxy

• attributed signed by VO• proxy signed by user• user cert signed by CA

Query

Authentication

Request

AuthDB

C=IT/O=INFN /L=CNAF/CN=Pinco Palla/CN=proxy

VOMSpseudo-cert

VOMSpseudo-cert

Grid Services

Logical Elements in a Grid

Information System


Services in a Grid

• Computing Element “front-end service for (set of) computers”– Cluster computing: typically Linux with IP interconnect– Capability computing: typically shared-memory supercomputers – A ‘head node’ batches or forwards requests to the cluster

• Storage Element “front-end service for disk or tape”– Disk and tape based– Varying retention time, QoS, uniformity of performance– ‘ought’ to be ACL aware: mapping of grid authorization to POSIX ACLs

• File Catalogue …• Information System …

– Directory-based for static information– Monitoring and bookkeeping for real-time information

• Resource Broker …– Matching user job requirements to offers in the information system– WMS allows disconnected operation of the user interface


Typical Grid Topology


Job Description Language

This is JDL that the user might send to the Resource Broker

Executable = "catfiles.sh";StdOutput = "catted.out";StdError = "std.err"; Arguments = "EssentialJobData.txt

LogicalJobs.jdl /etc/motd";

InputSandbox = {"/home/davidg/tmp/jobs/LogicalJobs.jdl", "/home/davidg/tmp/jobs/catfiles.sh" };OutputSandBox = {"catted.out", "std.err"};

InputData = "LF:EssentialJobData.txt"; ReplicaCatalog = "ldap://rls.edg.org/lc=WPSIX,dc=cnrs,dc=fr";DataAccessProtocol = “gsiftp";

RetryCount = 2;


How to you see the Grid?

Broker matches the user’s request with the site • ‘information supermarket’ matchmaking (using Condor Matchmaking)• uses the information published by the site

Grid Information system‘the only information a user ever gets about a site’

• So: should be reliable, consistent and complete• Standard schema (GLUE) to

describe sites, queues, storage(complex schema semantics)

• Currently presented as an LDAP directory

LDAP Browser Jarek Gawor: www.mcs.anl.gov/~gawor/ldap


Glue Attributes Set by the Site

• Site information– SiteSysAdminContact: mailto: [email protected]– SiteSecurityContact: mailto: [email protected]

• Cluster infoGlueSubClusterUniqueID=gridgate.cs.tcd.ie

HostApplicationSoftwareRunTimeEnvironment: LCG-2_6_0HostApplicationSoftwareRunTimeEnvironment: VO-atlas-release-10.0.4HostBenchmarkSI00: 1300GlueHostNetworkAdapterInboundIP: FALSEGlueHostNetworkAdapterOutboundIP: TRUEGlueHostOperatingSystemName: RHELGlueHostOperatingSystemRelease: 3.5GlueHostOperatingSystemVersion: 3

GlueCEStateEstimatedResponseTime: 519GlueCEStateRunningJobs: 175GlueCEStateTotalJobs: 248

• Storage: similar info (paths, max number of files, quota, retention, …)


Information system and brokering issues

• Size of information system scales with #sites and #details– already 12 MByte of LDIF– matching a job takes ~15 sec

• Scheduling policies are infinitely complex– no static schema can likely express this information

• Much information (still) needs to be set-up manually … next slides show situation as of Feb 3, 2006

The info system is the single most important grid service

• Current broker tries to make optimal decision… instead of a `reasonable’ one


Example: GlueServiceAccessControlRule

For your viewing pleasure: GlueServiceAccessControlRule 261 distinct values seen for GlueServiceAccessControlRule

(one of) least frequently occuring value(s): 1 instance(s) of GlueServiceAccessControlRule:

/C=BE/O=BEGRID/OU=VUB/OU=IIHE/CN=Stijn De Weirdt

(one of) most frequently occuring value(s): 310 instance(s) of GlueServiceAccessControlRule: dteam

(one of) shortest value(s) seen: GlueServiceAccessControlRule: d0

(one of) longest value(s) seen: GlueServiceAccessControlRule: anaconda-ks.cfg configure-firewall install.log install.log.syslog j2sdk-1_4_2_08-linux-i586.rpm lcg-yaim-latest.rpm myproxy-addons myproxy-addons.051021 site-info.def site-info.def.050922 site-info.def.050928 site-info.def.051021 yumit-client-2.0.2-1.noarch.rpm


Example: GlueSEControlProtocolType

For your viewing pleasure: GlueSEControlProtocolType

freq value 1 GlueSEControlProtocolType: srm 1 GlueSEControlProtocolType: srm_v1 1 GlueSEControlProtocolType: srmv1 3 GlueSEControlProtocolType: SRM 7 GlueSEControlProtocolType: classic

… which means that of ~410 Storage Elements, only 13 publish interaction info. Ough!


Example: GlueHostOperatingSystemRelease

Today's attribute: GlueHostOperatingSystemRelease 1 GlueHostOperatingSystemRelease: 3.02 1 GlueHostOperatingSystemRelease: 3.03 1 GlueHostOperatingSystemRelease: 3.2 1 GlueHostOperatingSystemRelease: 3.5 1 GlueHostOperatingSystemRelease: 303 1 GlueHostOperatingSystemRelease: 304 1 GlueHostOperatingSystemRelease: 3_0_4 1 GlueHostOperatingSystemRelease: SL 1 GlueHostOperatingSystemRelease: Sarge 1 GlueHostOperatingSystemRelease: sl3 2 GlueHostOperatingSystemRelease: 3.0 2 GlueHostOperatingSystemRelease: 305 4 GlueHostOperatingSystemRelease: 3.05 4 GlueHostOperatingSystemRelease: SLC3 5 GlueHostOperatingSystemRelease: 3.04 5 GlueHostOperatingSystemRelease: SL3 18 GlueHostOperatingSystemRelease: 3.0.3 19 GlueHostOperatingSystemRelease: 7.3 24 GlueHostOperatingSystemRelease: 3 37 GlueHostOperatingSystemRelease: 3.0.5 47 GlueHostOperatingSystemRelease: 3.0.4


Example: GlueSAPolicyMaxNumFiles

136 separate Glue attributes seen

For your viewing pleasure: GlueSAPolicyMaxNumFiles

freq value

6 GlueSAPolicyMaxNumFiles: 99999999999999





136 separate Glue attributes seen

For your viewing pleasure: GlueServiceStatusInfo

freq value

2 GlueServiceStatusInfo: No Known Problems.

55 GlueServiceStatusInfo: No problems

206 GlueServiceStatusInfo: No Problems


LCG’s Most Popular Resource Centre


Example: SiteLatitude

Today's attribute: GlueSiteLatitude 1 GlueSiteLatitude: 1.376059 1 GlueSiteLatitude: 33.063924198120645 1 GlueSiteLatitude: 37.0 1 GlueSiteLatitude: 38.739925290125484 1 GlueSiteLatitude: 39.21 … 1 GlueSiteLatitude: 45.4567 1 GlueSiteLatitude: 55.9214118 1 GlueSiteLatitude: 56.44 1 GlueSiteLatitude: 59.56 1 GlueSiteLatitude: 67 1 GlueSiteLatitude: GlueSiteWeb: http://rsgrid3.its.uiowa.edu 2 GlueSiteLatitude: 40.8527 2 GlueSiteLatitude: 48.7 2 GlueSiteLatitude: 49.16 2 GlueSiteLatitude: 50 3 GlueSiteLatitude: 41.7827 3 GlueSiteLatitude: 46.12

8 GlueSiteLatitude: 0.0

The Amsterdam e-Science Facilities

Building and Running a Grid Resource Centre

Compute Clusters

Storage and Disk Pools


Grid Resources in Amsterdam

• 2x 1.2 PByte in 2 robots

• 36+512 CPUs IA32

• disk caches 10 + 50 TByte

• multiple 10 Gbit/s links

240 CPUs IA32

7 TByte disk cache

10 + 1 Gbit link SURFnet

2 Gbit/s to SARA

only resources with either GridFTP or Grid job management

BIG GRID Approved January 2006!

Investment of € 29M in next 4 years

For: LCG, LOFAR, Life Sciences,

Medical, DANS, Philips Research, …

See http://www.biggrid.nl/

Computing

Cluster topology

Connectivity

System services setup


Grid Site Logical Layout


Batch Systems and Schedulers

• Batch system keeps list of nodes and jobs• Scheduler matches jobs to nodes based on policies


NDPF Logical Composition


NDPF Network Topology

Quattor and ELFms

System Installation


Installation

Installing and managing a large cluster requires a system that

• Scales to O(10 000) nodes, with1. a wide variety in configuration (‘service nodes’)2. and also many instances of identical systems (‘worker nodes’)

• You can validate new configurations before it’s too late• Can rapidly recovery from node failures

by commissioning a new box (i.e. in minutes)

Popular systems include• Quattor (from ELFms)• xCAT• OSCAR• SystemImager & cfEngine


Quattor

ELFms stands for ‘Extremely Large Fabric management system’Subsystems:• : configuration, installation and management of nodes• : system / service monitoring• : hardware / state management

• ELFms manages and controls most of the nodes in the CERN CC– ~2100 nodes out of ~ 2400– Multiple functionality and cluster size (batch nodes, disk servers, tape

servers, DB, web, …)– Heterogeneous hardware (CPU, memory, HD size,..)– Supported OS: Linux (RH7, RHES2.1, RHES3) and Solaris (9)

Node ConfigurationManagement

NodeManagement

Source: German Cancio, CERN IT, see http://www.quattor.org/

Developed within the EU DataGrid Project (http://www.edg.org/), development and maintenance now coordinated by CERN/IT


System Installation

• Central configuration drives:• Automated Installation Infrastructure (AII)

PXE and RedHat KickStart or Solaris JumpStart

• Software Package Management (SPMA)transactional management based on RPMs or PKGs

• Node Configuration (NCM): autonomous agentsservice configuration components, (re-) configuration

– CDB: the ‘desired’ state of the node• two-tiered configuration language (PAN and LLD XML)• self-validating, complete language (“swapspace=2*physmem”)• template inheritance and composition

( “tbn20 = base system + CE_software + pkg_add(emacs)” )


Configuration Database

CDB

pan

GUI

Scripts

CLI

Node

CCM

Cache

XML

RDBMS

SQL

SOAP

HTTP

NodeManagement Agents

LEAF, LEMON, others

Source: German Cancio, CERN IT


NodeManagement Agents


CDBGUI

Scripts

CLI

Node

CCM

Cache

RDBMS

SQL

SOAP

pan

XML HTTP

CERNCC

name_srv1: 137.138.16.5time_srv1: ip-time-1

lxbatchcluster/name: lxbatchmaster: lxmaster01pkg_add (lsf5.1)

lxpluscluster/name: lxpluspkg_add (lsf5.1) disk_srv

lxplus001 eth0/ip: 137.138.4.246 pkg_add (lsf6_beta)

lxplus020 eth0/ip: 137.138.4.225

lxplus029




CDB

pan

Node

CCM

Cache

XML

RDBMS

SQL

HTTP

GUI

Scripts

CLI SOAP




CDB

pan

GUI

Scripts

CLI

Node

XML

RDBMS

SQL

SOAP

HTTP

CCM

Cache NodeManagement Agents




CDB

pan

GUI

Scripts

CLI

Node

CCM

Cache

XML

SOAP

HTTP

RDBMS

SQL

LEAF, LEMON, others




CDB

pan

GUI

Scripts

CLI

XML

RDBMS

SQL

SOAP

HTTP

Node

CCM

Cache NodeManagement Agents



Managing (cluster) nodes

Install server

base OS dhcppxe

nfs/http

Vendor System installer

RH73, RHES,Fedora,…

System services AFS,LSF,SSH,accounting..

Installed softwarekernel, system, applications..

CCMNode Configuration

Manager (NCM)

RPM, PKG

nfshttp

ftp

Software Servers

packages

(RPM, PKG)SWReppackages

CDB

Standard nodesManaged nodes

Install Manager

Node (re)install

cacheSW package

Manager (SPMA)



Node Management Agents• NCM (Node Configuration Manager):

framework system, where service specific plug-ins called Components make the necessary system changes to bring the node to its CDB desired state– Regenerate local config files (eg. /etc/sshd/sshd_config),

restart/reload services (SysV scripts)– Large number of components available (system and Grid services)

• SPMA (Software Package Mgmt Agent) and SWRep: Manage all or a subset of packages on the nodes– Full control on production nodes: full control - on development nodes: non-

intrusive, configurable management of system and security updates.– Package manager, not only upgrader (roll-back and transactions)

• Portability: generic framework with plug-ins for NCM and SPMA– available for RHL (RH7, RHES3) and Solaris 9

• Scalability to O(10K) nodes– Automated replication for redundant / load balanced CDB/SWRep servers– Use scalable protocols eg. HTTP and replication/proxy/caching technology

(slides here)Source: German Cancio, CERN IT

Back in Amsterdam

User and directory management

Logging and auditing

Monitoring and fault detection


User directory and automount maps

Large number of alternatives exists (nsswitch.conf/pam.d)• files-based (/etc/passwd, /etc/auto.home, …)• YP/NIS, NIS+• Database (MySQL/Oracle)• LDAP

We went with LDAP:• information is in a central location (like NIS)• can scale by adding slave servers (like NIS)• is secure by LDAP over TLS (unlike NIS)• can be managed by external programs (also unlike NIS)

(in due course we will do real-time grid credential mapping to uid’s)

But you will need nscd, or a large number of slave servers


Logging and Auditing

Auditing and logging• syslog (also for grid gatekeeper, gsiftp, credential mapping)• process accounting (psacct)

For the paranoid – use tools included for CAPP/EAL3+: LAuS• system call auditing• highly detailed:

useful both for debugging and incident response• default auditing is critical: system will halt on audit errors

If your worker nodes are on private IP space• need to preserve a log of the NAT box as well


Grid Cluster Logging

Grid statistics and accounting• rrdtool views from the batch system load per VO

– combine qstat and pbsnodes output via script, cron and RRD

• cricket network traffic grapher

• extract pbs accounting data in dedicated database– grid users have a ‘generic’ uid from a dynamic pool –

need to link this in the database to the grid DN and VO

• from accounting db, upload anonymized records to APEL– APEL is the grid accounting system for VOs and funding agencies– accounting db also useful to charge costs to projects locally


NDPF Occupancy

Usage of the NIKHEF NDPF Compute farm

Average occupancy in 2005: ~ 78%

each colour represents a grid VO, black line is #CPUs available


But at times, in more detail

Auditing Indicent: a disk with less than 15% free makes the syscall-audit system panic, new processes cannot write audit entries, which is fatal, so they wait, and wait, and … a head node has most activity & fails first!

An unresponsive node causes the scheduler MAUI to wait for 15 minutes, then give up and start scheduling again, hitting the rotten node, and …

PBS Server trying desparately to contact adead node who’s CPU has turned into Norit… and unable to serve any more requests.


Black Holes

A mis-configured worker node accepting jobs that all die within seconds.Not for long, the entire job population will be sucked into this black hole…


Clusters: what did we see?

• the Grid (and your cluster) are error amplifiers– “black holes” may eat your jobs piecemeal

– dangerous “default” values can spoil the day (“GlueERT: 0”)

• Monitor! (and allow for (some) failures, and design for rapid recovery)

• Users don’t have a clue about your system beforehand(that’s the downside of those ‘autonomous organizations’)

• If you want users to have clue, you push publish your clues correctly (the information system is all they can see)

• Grid middleware may effectively do a DoS on your system– doing qstat for every job every minute, to feed the logging & bookkeeping …

• Power consumption is the greatest single limitation in CPU density• And finally: keep your machine room tidy, and label everything … or your

colleague will not be able to find that #$%^$*! machine in the middle of the night…

Data Storage

Typical data flows

Matching storage to computing

Disk pools and their interfaces


Atlas Tier-1 data flows

Tier-0

CPUfarm

T1T1OtherTier-1s

diskbuffer

RAW

1.6 GB/file0.02 Hz1.7K f/day32 MB/s2.7 TB/day

ESD2


AOD2

10 MB/file0.2 Hz17K f/day2 MB/s0.16 TB/day

AODm2

500 MB/file0.004 Hz0.34K f/day2 MB/s0.16 TB/day

RAW

ESD2

AODm2

0.044 Hz3.74K f/day44 MB/s3.66 TB/day

T1T1OtherTier-1s

T1T1Tier-2s

Tape

RAW


diskstorage

AODm2


ESD2


AOD2

10 MB/file0.2 Hz17K f/day2 MB/s0.16 TB/day

ESD2


AODm2


ESD2


AODm2


ESD1


AODm1


AODm1


AODm2


Plus simulation Plus simulation && analysis data analysis data

flowflow

Real data storage, reprocessing and

distribution

ATLAS data flows (draft). Source: Kors Bos, NIKHEF


SC3 storage network (SARA)

Disk-to-Disk 583 MByte/si.e. 4.6 Gbps

over the world

Graphic: Mark van de Sanden, SARA


Tier-1 Architecture SARA (storage)

Graphic: Mark van de Sanden, SARA


Matching Storage to Computing

Doing the math• Simple job

– Read 1 MByte piece of file (typically 1 “event”)– Calculate on it for 30 seconds– Do this for 2000 events per file (i.e. 2 GByte files)– On 1000 files (1 day of running) this takes 700 days– Need a total of 2 TByte, i.e. 4 IDE disks of 500 GB

• On the Grid: spread out over 1000 CPUs– All jobs start at the same time, retrieving a 2 GByte input– The machine with this 2 TByte disk is on a 100 Mbps link– Effective 10 MByte/s throughput– Thus, 10 kByte/s per machine– It takes 55 hours before the file transfers finish!– And after that, only 17 hours of calculation

= 1Mbyte


Storage

Just for ATLAS, one of the experiments

• RAW & ESD data flow ~ 4 TByte/day (1.4PB/y) to tape– Expected to be a permanent “museum” copy– Largely scheduled access (intelligent staging possible), read & write– Disk buffers before tape store can be smallish (~ 10%)

• ‘Chaotic’ access by real users: ~ 2-4 TByte/day throughput– Lifetime of data is finite but long (typically 2+ years)– Access needed from worker nodes, i.e., from O(1 000) CPUs– Random “skimming” access pattern– Need for disk server farms of typically 500 TByte – 1 PByte

• Management of disk resources– Split ‘file system view’ (file metadata) from the object store– dCache & dcap, DPNS & DPM, GPFS & ObjectStore, …


DPM Manageability

Example: Disk Pool Manager• Elements in DPM

– SRM: scheduling of requests via a standard interface– DPM Server: management and translation of filename to a transfer URL

(disk node)– DPNS Name service: ACLs, directory layout, ownership, size– Object servers: effectuate actual (GridFTP) transfers

• No central configuration files– Disk nodes request to add themselves to the DPM– All states are kept in a DB (easy to restore after a crash)

• Easy to remove disks and partitions– Allows simple reconfiguration of the Disk Pools– Files are stored as whole chunks (safer than striping over O(100) servers!)– Administrator can temporarily remove file systems if a disk has crashed– DPM automatically configures a file system as “unavailable” when it is not

contactable


Example: SRM put processing (1)

Data ServerGridftp Daemon

ClientDPM Daemon

SRM Daemon

1a. SRM Put

1b. Put intoRequest Database

1c. Return SRM RequestId

DPM Database

DPNS Daemon





2a. Get Request from Database

2d. Add TURL in Request

Database and Mark ‘Ready’

2c. Pick best Data Server to put data onto


Client

SRM Daemon

2b. Check permissions and add to NS

DPM Database

DPNS Daemon



2e.add to replica table and set status ‘Pending’

DPM Daemon



3a. SRM getRequestStatus


ClientDPM Daemon

SRM Daemon

3c. Return TURL

DPM Database

3b. Get TURL from Request

DPNS Daemon






Client

SRM Daemon DPM Database

DPNS Daemon



4a. SRM(v1) set ‘Running’

4b. Update status of request

DPM Daemon




ClientDPM Daemon

SRM Daemon

5. put file via Gridftp

DPM Database

DPNS Daemon





6c. Get filesize


Client

SRM Daemon DPM Database

DPNS Daemon



6a. SRM(v1) set Done 6e. Update status of request

6d. Update replica metadata(size/status/pintime)

6b. Notify ‘Done’

DPM Daemon


Other disk storage solutions

• Cluster file systems: PVFS2, Lustre, GFS– Built for parallel IO performance– Clusters are not expected to grow (or shrink)– Can obtain better throughput on a single file within the cluster– Node failure can be quite catastrophic

• Some (e.g. GFS) limited by block device/file system limits– A 32-bit linux 2.4 kernel: 2 TByte

– Ext2/3, 4 kByte blocks: file max 2 TB, fs size 16 TB– Reiser3.6: file max 1 EB, fs size 16 TB ???!– XFS: file max 8 EB, fs size 8 EB

Monitoring at the global scale

Monitoring

Site Functional Tests

Real-time throbbing job monitor


Success Rate

What’s the chance the whole grid is working correctly?

If a single site has 98.5% reliability (i.e. is down 5 days/year)• With 200 sites, this gives you a 4% chance that the whole

grid is working correctly• And the 98.5% is quite optimistic to begin with …

So• build the grid, both middleware and user jobs, for failure• Monitor sites with both system and functional tests• Exclude sites with a current malfunction dynamically


Monitoring Tools

1. GIIS Monitor 2. GIIS Monitor graphs 3. Sites Functional Tests

4. GOC Data Base5. Scheduled Downtimes 6. Live Job Monitor

7. GridIce – VO view 8. GridIce – fabric view 9. Certificate Lifetime Monitor

Source: Ian Bird, SA1 Operations Status, EGEE-4 Conference, Pisa, November 2005


Google Grid Map


Freedom of Choice

• Tool for VOs to make a site selection based on a set of standard tests


Success Rate: WISDOM

Average success rate for jobs: 70-80% (single submit)

Success rate (August)

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

1 3 5 7 9 11 13 15 17 19 21 23 25 27

day

nb

of

job

s

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

succ

ess

rate

registered

success (final status)

aborted (final status)

cancelled (final status)

success rate :success/(registered-cancelled)

Source: N. Jacq, LPC and IN2P3/CNRS “Biomedical DC Preliminary Report WISDOM Application, 5 sept 2005


Failure reasons vary

Biomed data challenge Abort reasons distribution

(10/07/2005 - 27-08-2005)

63%

28%

4%4%1%Missmatching ressources

wrong configuration

Network/Connection Failures

Proxy problems

JDL Problems

- Failing middleware component- Wrong request in the job JDL

Abort reasons distribution for all VO 01/2005 – 06/2005

Source: N. Jacq, LPC and IN2P3/CNRS “Biomedical DC Preliminary Report WISDOM Application, 5 sept 2005


Is the Grid middleware current?

• Common causes of failure– Specified impossible combination of resources– Wrong middleware version at the site– Not enough space in proper place ($TMPDIR)– Environment configuration ($VO_vo_SW_DIR, $LFC_HOST,…)

0

20

40

60

80

100

120

140

12

/02

/20

05

19

/02

/20

05

26

/02

/20

05

05

/03

/20

05

12

/03

/20

05

19

/03

/20

05

26

/03

/20

05

02

/04

/20

05

09

/04

/20

05

16

/04

/20

05

23

/04

/20

05

30

/04

/20

05

07

/05

/20

05

14

/05

/20

05

21

/05

/20

05

28

/05

/20

05

04

/06

/20

05

11

/06

/20

05

18

/06

/20

05

25

/06

/20

05

Date

Sit

es

wit

h r

ele

as

e

LCG-2_4_0 LCG-2_3_1 LCG-2_3_0

Going From here

Does it work

How can we make it better


Going from here

Many nice things to do:• Most of LCG provides a single OS (RHEL3), but users may

need SLES, Debian, Gentoo, … or specific libraries• Virtualisation (Xen, VMware)?

• Scheduling user jobs– both VO and site wants to set part of the priorities …

• Auditing and user tracing in this highly dynamic systemcan we know for sure who is running what where? Or whether a user is DDoS-ing the White House right now?– Out of 221 sites, we know for certain there is a compromise!


More things to do …

• Sparse file access: access data efficiently over the wide area

• Can we do something useful with the large disks in all worker nodes? (our 240 CPUs share ~8 TByte of unused disk space!)

• There are new grid software releases every month, and the configuration comes from different sources …how can we combine and validate all these configurations fast and easy?


Job submission live monitor

Source: Gidon Moont, Imperial College, London, HEP and e-Science Centre


Outlook

Towards a global persistent grid infrastructure• Interoperability and persistency that are project independent

– Europe: EGEE-2, ‘European Grid Organisation’– US: Open Science Grid– Asia-Pacific: APGrid & PRAGMA, NAREGI, APAC, K*Grid, …

GIN aim: cross-submission and file access by end 2006• Extension to industry

– first: industrial engineering, financial scenario simulations

• New ‘middleware’– we are just starting to learn how it should work

• Extend more in sharing of structured data


A Bright Future!

Imagine that you could plug your computer into the wall and have direct access to huge computing resources immediately, just as you plug in a lamp to get instant light. …

Far from being science-fiction, this is the idea the [Grid] is about to make into reality.

The EU DataGrid project brochure, 2001

Documents

Grid Computing and Grid Site Infrastructure