Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova


Citation preview

Evaluation of the Globus GRAM Service

Massimo SgaravattoINFN Padova

Evaluation of GRAM Service







Site1Site2 Site3

Submit jobs (using Globus tools)


Information on characteristics andstatus of local resources

Evaluation of GRAM Service Job submission tests using Globus tools

(globusrun, globus-job-run, globus-job-submit)

GRAM as uniform interface to different underlying resource management systems

“Cooperation” between GRAM and GIS Evaluation of RSL as uniform language to

specify resources Tests performed with Globus 1.1.2 and

1.1.3 and Linux machines

GRAM & fork system call  

Client Server (fork)



GRAM & CondorClient Server

(Condor front-end machine)

Globus Globus


Condor pool

GRAM & Condor Tests considering:

Standard Condor jobs (relinked with Condor library)

INFN WAN Condor pool configured as Globus resource

~ 200 machines spread across different sites Heterogeneous environment No single file system and UID domain

Vanilla jobs (“normal” jobs) PC farm configured as Globus resource

Single file system and UID domain


Server (LSF front-end machine)



Globus LSF

LSF Cluster

Results Some bugs found and fixed (fixes included in INFNGRID 1.1

distribution) Standard output and error for vanilla Condor jobs globus-job-status …

Some bugs can be solved without major re-design and/or re-implementation:

For LSF the RSL parameter (count=x) is translated into: bsub –n x … Just allocates x processors, and dispatches the job to the first one

Used for parallel applications Should be: bsub … x times Maybe we don’t need to solve this problem (see later…)

… Two major problems:

Scalability Fault tolerance

Globus GRAM Architecture


LSF/ Condor/ PBS/ …

Globus front-end machine



pc1% globusrun –b –r pc2.pd.infn.it/jobmanager-xyz \ –f file.rsl


pc1 pc2

Scalability One jobmanager for each globusrun If I want to submit 1000 jobs ???

1000 globusrun 1000 jobmanagers running in the front-end machine !!!

%globusrun –b –r pc2.infn.it/jobmanager-xyz –f file.rslfile.rsl:


It is not possible to specify in the RSL file 1000 different input files and 1000 different output files …

$(Process) in Condor Problems with job monitoring (globus-job-status) Therefore (count=x) with x>1 not very useful !

Fault tolerance The jobmanager is not persistent If the jobmanager can’t be contacted,

Globus assumes that the job(s) has been completed

Example of problem Submission of n jobs on a cluster managed

by a local resource management systems Reboot of the front end machine The jobmanager(s) doesn’t restart

Orphan jobs Globus assumes that the jobs have been successfully completed

GRAM & GIS How the local GRAMs provide the

GIS with characteristics and status of local resources ?

Tests performed considering: Condor pool LSF cluster

GRAM & Condor & GIS


Must be fixed

Jobs & GIS Info on Globus jobs published in the GIS:

User Subject of certificate Local user name

RSL string Globus job id LSF/Condor/… job id Status: Run/Pending/…

GRAM & GIS The information on characteristics and status

of local resources and on jobs is not enough As local resources we must consider Farms and not

the single workstations Other information (i.e. total and available CPU

power) needed Fortunately the default schema can be

integrated with other info provided by specific agents

The needed information must be identified first

RSL We need a uniform language to specify

resources, between different resource management systems

The RSL syntax model seems suitable to define even complicated resource specification expressions

The common set of RSL attributes is often not sufficient The attributes not belonging to the common

set are ignored

RSL More flexibility is required

Resource administrators should be allowed to define new attributes and users should be allowed to use them in resource specification expressions (Condor Class-Ads model)

Same language to describe the offered resources and the requested resources (Condor Class-Ads model) seems a better approach

Next steps Bug fixes

Modification of Globus LSF scripts for GIS Problem (count=x) with LSF ???

Tests with real applications and real environments (CMS fall production)

Define a small set of attributes of a Condor pool, LSF cluster, PBS cluster that should be reported to the GIS, and try to implement it

Let’s start with information provided by the underlying resource management system

Tests with GRAM API Not necessary tests with other resource management systems Scalability and robustness problems

Not so simple and straightforward !!! Up to Workload management WP, possible collaboration with Globus

team and Condor team
