31
Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 1

Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,

Embed Size (px)

DESCRIPTION

3 The Problem At one end are computing resources (the grid fabric) managed by batch queuing systems and middleware At the other end are end-users and their jobs/applications Need software and protocols for submitting jobs to the computing resources Also want to be able to monitor jobs after submission and efficiently schedule them to achieve high-throughput 01/19/09Service Oriented Cyberinfrastructure Lab,

Citation preview

Page 1: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,

Rochester Institute of Technology1

Job Submission

Andrew Pangborn & Myles Maxfield

01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 1

Page 2: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,

2

The Grid• Virtual organizations spanning multiple

administrative domains– Different organizations and administrators– Different hardware– Different queuing systems

• How do we make sense of it all?

01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 2

Page 3: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,

3

The Problem• At one end are computing resources (the grid fabric)

managed by batch queuing systems and middleware

• At the other end are end-users and their jobs/applications

• Need software and protocols for submitting jobs to the computing resources

• Also want to be able to monitor jobs after submission and efficiently schedule them to achieve high-throughput

01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 3

Page 4: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,

4

Grid Architecture

01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 4

Image from Ian Foster paper (The Anatomy of the Grid)

Job Submission

Page 5: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,

5

Batch Queuing Systems• Submitting a job directly to the batch queuing

system• One or more queues

– Priorities• Two common architectures

– Client/server– Dynamic offloading

• User credential (delegation)• Jobs have states (e.g. Pending, Running)

01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 5

Page 6: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,

6

Batch Queuing Systems• Important examples:

– Portable Batch System– TORQUE– Xgrid– Sun Grid Engine– Load Sharing Facility– Condor

01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 6

Page 7: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,

7

Portable Batch System (PBS)• Originally developed for NASA• Client/server architecture• Server: pbs_server• Client: pbs_mom• Works with MPI with built-in shell script

variables

01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 7

Page 8: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,

8

PBS Examplelitherum@gras:~$ cat test.sh#!/bin/sh#testpbsecho This is a testecho today is `date`echo This is `hostname`echo The current working directory is `pwd`ls -alF /homeuptime

01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 8

Page 9: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,

9

PBS Examplelitherum@gras:~$ qsub test.sh6.gras.carrion.rit.edulitherum@gras:~$ qstatJob id Name User Time Use S Queue------------------------- ---------------- --------------- -------- - -----6.gras test.sh litherum 00:00:00 C batch litherum@gras:~$ cat test.sh.o6This is a testtoday is Sat Jan 17 18:20:20 EST 2009This is carrion02The current working directory is /home/litherumtotal 20drwxr-xr-x 31 litherum litherum 4096 Jan 17 18:19 litherum/ 18:20:20 up 131 days, 21:20, 0 users, load average: 0.00, 0.00,

0.0001/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 9

Page 10: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,

10

Torque• Built on top of PBS• Supports reservations, where you can

reserve specific resources for specific times.• Supports partitions, where you can partition a

cluster into smaller sub-clusters.

01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 10

Page 11: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,

11

Torquelitherum@gras:~$ showqACTIVE JOBS--------------------JOBNAME USERNAME STATE PROC REMAINING

STARTTIME 0 Active Jobs 0 of 4 Processors Active (0.00%) 0 of 2 Nodes Active (0.00%)IDLE JOBS----------------------JOBNAME USERNAME STATE PROC WCLIMIT

QUEUETIME0 Idle JobsBLOCKED JOBS----------------JOBNAME USERNAME STATE PROC WCLIMIT

QUEUETIMETotal Jobs: 0 Active Jobs: 0 Idle Jobs: 0 Blocked Jobs: 0

01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 11

Page 12: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,

12

Xgrid• Apple• Essentially the same as

Condor• GUI! =)• Client/server model

http://upload.wikimedia.org/wikipedia/en/6/62/XgridAdminTool.jpg01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 12

Page 13: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,

13

Sun Grid Engine• Open source, like everything new Sun puts

out• Supports

– Reservations– Job dependencies,– Checkpointing– Multiple scheduling algorithms– Web interface

• Professional!

01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 13

Page 14: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,

14

Middleware• These queuing systems are hard to use• There may be many systems employed in a

given grid• Wouldn’t it be nice if all this were unified in a

single implementation?• Middleware that handles job submission in a

virtual organization across resources spread throughout multiple administration domains would be useful!

01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 14

Page 15: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,

15

• A tool for pooling and “scavenging” computing resources and distributing jobs

• Similar to a batch queuing system [2]– job management– scheduling policy– priority scheme– resource monitoring– resource management.

• Also focuses on high-throughput and “opportunistic computing” [2]– Utilize computing resources whenever they are available

01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 15

Condor image from: http://www.cs.wisc.edu/condor/

Page 16: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,

16

Condor Universes [1]• Standard

– Check pointing, fault tolerance– Link job against condor libraries

• Vanilla– Simpler, can run universal binaries (do not need to be

“condor compiled”)– No support for partial execution or job relocation

• Others– PVM– MPI– Java

01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 16

Page 17: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,

17

Condor Submission File Example [1]#hello.sub#condor job file exampleUniverse = VanillaExecutable = helloOutput = hello.outInput = hello.inError = hello.errLog = hello.logQueue

01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 17

Page 18: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,

18

Some Condor Commands [5]• condor_submit <job_file.sub>

– Submit a condor job• condor_q

– View condor job queue• condor_status

– Check status of jobs in queue• condor_compile

– Re-links jobs for use in standard universe

01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 18

Page 19: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,

19

Condor job structures

Master-Worker• Single master process

coordinates all the independent tasks

• Collects results as workers finish, distributes new jobs to workers

DAG (Directed Acyclic Graph)

01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 19

Programming models for larger scale jobs using condor agent

Page 20: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,

20

GRAM [4]• Globus Resource Allocation Manager (GRAM)

– Resource allocation – Process creation – Monitoring– Management – Maps requests expressed in a Resource Specification Language

(RSL) into commands to local schedulers and computers.

01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 20

Page 21: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,

21

GRAM• Pluggable!• Can’t make up their mind how to describe jobs• Will submit jobs to:

– Condor– LSF– PBS/Torque– ???

• Unified interface, identifier for which cluster/service to use

• Job submission file01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 21

Page 22: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,

22

GRAM Examplemaxfield@tg-login1:~> globusrun-ws -submit -factory https://tg-

login.ornl.teragrid.org:8444/wsrf/services/ManagedJobFactoryService -factory-type PBS -streaming -job-

command /bin/hostnameDelegating user credentials...Done.Submitting job...Done.Job ID: uuid:89538014-e4f2-11dd-81df-0010180bb4e6Termination time: 01/18/2009 23:57 GMTCurrent job state: PendingCurrent job state: Activetg-c15Current job state: CleanUp-HoldCurrent job state: CleanUpCurrent job state: DoneDestroying job...Done.Cleaning up any delegated credentials...Done.01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 22

Page 23: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,

23

GRAM Input Example<job><executable>/bin/echo</executable><argument>this is an example string </argument><argument>Globus was here</argument><stdout>${GLOBUS_USER_HOME}/stdout</stdout><stderr>${GLOBUS_USER_HOME}/stderr</stderr></job>

http://www.globus.org/toolkit/docs/4.2/4.2.1/execution/gram4/user/#gram4-user-usagescenarios-jdd

01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 23

Page 24: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,

24

Condor-G [4]• Condor-G is a Globus-enabled version of the Condor scheduler.• It uses Globus to handle inter-organizational problems like:

– Security– Resource management for supercomputers,– Executable staging.

• The same Condor tools that access local resources are now able to use the Globus protocols to access resources at multiple sites.

• It communicates with these resources and transfers files to and from these resources using Globus mechanisms, such as:

– GSI for security– GRAM protocol for job submission– GASS for file transfer

• Condor-G can be used to submit jobs to systems managed by Globus.• Globus tools can be used to submit jobs to systems managed by Condor

01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 24

Page 25: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,

25

Condor-G

01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 25

Page 26: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,

26

Using Condor-G• Set condor universe=globus in submit file• Also need to specify the globus scheduler

hostname, for example:globusscheduler = example.org/jobmanager

• Still use globus_submit command• TeraGrid Condor-G example here:

– http://www.teragrid.org/userinfo/jobs/condorg.php

01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 26

Page 27: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,

27

UNICORE• Alternative to Globus• Primarily used in Europe• Uses web services, similar to GT4• GUI• Abstract Job Objects• User -> Server -> Virtual Site• X.509 and SSL

01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 27

Page 28: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,

28

UNICORE GUI

01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 28

Page 29: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,

29

Upperware• Abstract Job Objects? Workflows? What is all

this nonsense?!• Scientist (primary user) doesn’t care about

this stuff• Shouldn’t have to deal with writing XML

description files or creating a complicated workflow

• Simply let them run their program

01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 29

Page 30: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,

30

GridShell

01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 30

•Unified command line interface•Defer to resident experts

Page 31: Rochester Institute of Technology 1 Job Submission Andrew Pangborn & Myles Maxfield 01/19/09Service Oriented Cyberinfrastructure Lab,

31

References1. http://www.linuxjournal.com/node/9058/print - Getting started with Condor2. Thain, D., Tannenbaum, T., & Livny, M. (2005). Distributed computing in practice:

the Condor experience.3. http://grid.rit.edu/seminar/lib/exe/fetch.php/users:jeremy_espenshade:condorjobs

ubmission.ppt4. http://iag.iucc.ac.il/presentations/front2.ppt5. http://www.cs.wisc.edu/condor/manual/v7.2/6. http://www.globus.org/toolkit/docs/4.2/4.2.1/execution/gram4/user/#gra

m4-user-usagescenarios-jdd7. http://upload.wikimedia.org/wikipedia/en/6/62/XgridAdminTool.jpg8. Wikipedia9. http://www.isgtw.org/images/Rudolph_expert_client_screenshot2.jpg10.http://upload.wikimedia.org/wikipedia/commons/a/a4/

Double_curvature_steel_lattice_Shell_by_Shukhov_in_Vyksa_1897_shell.jpg

01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 31