96
Condor Tutorial Condor Tutorial Prabhaker Mateti Wright State University

Condor Tutorial Prabhaker Mateti Wright State University

Embed Size (px)

Citation preview

Page 1: Condor Tutorial Prabhaker Mateti Wright State University

Condor TutorialCondor Tutorial

Prabhaker MatetiWright State University

Page 2: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor2

AcknowledgementsAcknowledgements

Many of these slides are adapted from tutorials by

Miron Livny, and his associatesUniversity of Wisconsin-Madisonhttp://www.cs.wisc.edu/condor

Page 3: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor3

Clusters with Part Time NodesClusters with Part Time Nodes

Cycle Stealing: Running of jobs on a workstations that don't belong to the owner.

Definition of Idleness: E.g., No keyboard and no mouse activity

Tools/Libraries– Condor– PVM– MPI

Page 4: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor4

Performance v. ThroughputPerformance v. Throughput

High Performance - Very large amounts of processing capacity over short time periods

– FLOPS - Floating Point Operations Per Second

High Throughput - Large amounts of processing capacity sustained over very long time periods – FLOPY - Floating Point Operations Per Year

FLOPY = 365x24x60x60*FLOPS?

Page 5: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor5

CooperationCooperation

Workstations are “personal” Others use slows you down

– Immediate-Eviction

– Pause-and-Migrate

Willing to share– Letting you cycle-steal

Willing to trust

Page 6: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor6

Granularity of MigrationGranularity of Migration

Process migration– Process = Collection of objects– at least one active object

Object migration– Passive objects– Active objects

Page 7: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor7

Migration of Jobs: Technical Migration of Jobs: Technical IssuesIssues

Checkpointing: Preserving the state of the process so it can be resumed.

One architecture to anotherYour “environment”

– keyboard, mouse, display, files, …

Page 8: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor8

CondorCondor

A system for high throughput computing by making use of idle computing resources

Lots of jobs over a long period of time, not a short burst of high performance

Manages both machines and jobsHas been stable, and delivered thousands of

CPU hours

Page 9: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor9

Condor TechniquesCondor Techniques

Migratory programs– Checkpointing– Remote IO

Resource matching

Page 10: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor10

Condor: AssumptionsCondor: Assumptions

Large numbers of workstations are idle most of the time

Owners of such machines would not mind their use by others while idle

Owners want their work to be given high priority

Page 11: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor11

RolesRolesOwner offers his machine for use by othersUser requests to run his jobsAdministrator manages the pool of available

machinesMultiple roles possible

Page 12: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor12

Classified Advertisements: ExampleClassified Advertisements: Example

MyType = "Machine"TargetType = "Job"Name = "froth.cs.wisc.edu"

StartdIpAddr="<128.105.73.44:33846>"

Arch = "INTEL"OpSys = "SOLARIS26"VirtualMemory = 225312

Disk = 35957KFlops = 21058Mips = 103LoadAvg = 0.011719KeyboardIdle = 12Cpus = 1Memory = 128Requirements = LoadAvg <= 0.300000 && KeyboardIdle > 15 * 60

Rank = 0

Page 13: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor13

Condor User RequestsCondor User Requests

Describes the program, and its needsExample condor_submit File

Universe = standardExecutable = /home/wsu03/condor/my_job.condorInput = my_job.stdinOutput = my_job.stdoutError = my_job.stderrLog = my_job.logArguments = -arg1 -arg2InitialDir = /home/wsu03/condor/run_1Queue

Page 14: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor14

ClassAds: Example for JobsClassAds: Example for Jobs

Requirements = Arch == “INTEL” && OpSys == “LINUX” && Memory > 20

Rank = (Memory > 32) * ( (Memory * 100)

+ (IsDedicated * 10000) + Mips )

Page 15: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor15

Condor Pool of MachinesCondor Pool of Machines

“Pool” can be a single machine, or a group of machines volunteered by their owners

Determined by a “central manager” - the matchmaker and centralized information repository

Each machine runs various daemons to provide different services, either to the users who submit jobs, the machine owners, or the pool itself

Page 16: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor16

Condor System StructureCondor System Structure

Page 17: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor17

Condor AgentsCondor Agents

Condor Resource Agent– condor_startd daemon– allows a machine to execute Condor jobs– enforces owner policy

Condor User Agent– condor_schedd daemon– allows a machine to submit jobs to a pool

Page 18: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor18

Condor: RobustnessCondor: Robustness

Checkpointing allows guaranteed forward progress of your jobs, even jobs that run for weeks before completion

If an execute machine crashes, you only loose work done since the last checkpoint

Condor maintains a persistent job queue - if the submit machine crashes, Condor will recover

Page 19: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor19

What’s Condor Good For?What’s Condor Good For?

Managing a large number of jobs– You specify the jobs in a file and submit them

to Condor, which runs them all and can send you email when they complete

– Mechanisms to help you manage huge numbers of jobs (1000’s), all the data, etc.

– Condor can handle inter-job dependencies (DAGMan)

Page 20: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor20

ThroughputThroughput

Checkpointing allows your job to run on opportunistic resources, not dedicated

Checkpointing permits migration - if a machine is no longer available, migrate

With remote system calls, you don’t even need an account on a machine where your job executes

Page 21: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor21

Can your program work with Can your program work with Condor?Condor?What kind of I/O does it do?Does it use TCP/IP? (network sockets)Can the job be resumed?Multiple processes?

– fork(), pvm_addhost(), etc.

Page 22: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor22

Typical IOTypical IO

Interactive TTY“Batch” TTY (just reads from STDIN and

writes to STDOUT or STDERR, but you can redirect to/from files)

X WindowsNFS, AFS, or another network file systemLocal file systemTCP/IP

Page 23: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor23

Condor Condor UniversesUniverses

Different universes support different functionalities

VanillaStandardSchedulerPVM

Page 24: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor24

Condor Condor UniversesUniverses: : IO supportIO support

No support for interactive TTY

X11 NFS LocalFiles TCP

Vanilla x x - x

Standard - x x -

Scheduler x x x x

PVM x x x x

Page 25: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor25

Condor UniversesCondor Universes

PVM (Parallel Virtual Machine)– Multiple processes in Condor

Scheduler– The job is run on the submit machine, not on a

remote execute machine– Job is automatically restarted if the

condor_schedd is shutdown– Used to schedule jobs

Page 26: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor26

Submitting Jobs to CondorSubmitting Jobs to Condor

Choosing a “Universe” for your job Preparing your job

– Making it “batch-ready”– Re-linking if checkpointing and remote system

calls are desired (condor_compile)Creating a submit description filecondor_submit your request to the User

Agent (condor_schedd)

Page 27: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor27

Making your job “batch-ready”Making your job “batch-ready”Must be able to run in the background: no

interactive input, windows, GUI, etc.Can still use STDIN, STDOUT, and STDERR but files are used for these instead of the actual devices

If your job expects input from the keyboard, you have to put the input you want into a file

Page 28: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor28

Preparing Your Job (cont’d)Preparing Your Job (cont’d)

If you are going to use the standard universe with checkpointing and remote system calls, you must re-link your job with Condor’s libraries

condor_compile gcc -o myjob myjob.c

Page 29: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor29

Submit Description FileSubmit Description File

Tells Condor about your job:– Which executable, universe, input, output and

error files to use, command-line arguments, environment variables, any special requirements or preferences (more on this later)

Can describe many jobs at once (a “cluster”) each with different input, arguments, output, etc.

Page 30: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor30

Example condor_submit FileExample condor_submit File

Universe = standardExecutable = /home/wsu03/condor/my_job.condorInput = my_job.stdinOutput = my_job.stdoutError = my_job.stderrLog = my_job.logArguments = -arg1 -arg2InitialDir = /home/wsu03/condor/run_1Queue

Page 31: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor31

Example Submit Description FileExample Submit Description FileSubmits a single job to the standard

universe, specifies files for STDIN, STDOUT and STDERR, creates a UserLog, defines command line arguments, and specifies the directory the job should be run in

As if you did

% cd /home/wright/condor/run_1% /home/wsu03/condor/my_job.condor -arg1 -arg2 \ > my_job.stdout 2> my_job.stderr \ < my_job.stdin

Page 32: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor32

““Clusters” and “Processes”Clusters” and “Processes”

A submit file describes one or more jobsThe collection of jobs is called a “cluster”Each job is called a “process” or “proc”A Condor “Job ID” is the cluster number,

a period, and the proc number (e.g., 23.5)Proc numbers always start at 0

Page 33: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor33

A Cluster Submit Description FileA Cluster Submit Description File

Universe = standardExecutable = /home/wsu03/condor/my_job.condorInput = my_job.stdinOutput = my_job.stdoutError = my_job.stderrLog = my_job.logArguments = -arg1 -arg2InitialDir = /home/wsu03/condor/run_$(Process)Queue 500

Page 34: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor34

A Cluster Submit Description FileA Cluster Submit Description File“Queue 500” = submit 500 jobs at onceThe initial directory for each job is specified

with the $(Process) macro$(Process) will be expanded to the process

number for each job in the cluster; “run_0”, “run_1”, … “run_499” directoriesAll the input/output files will be in different

directories

Page 35: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor35

condor_submitcondor_submit

condor_submit the-submit-file-namecondor_submit parses the file and creates a

“ClassAd” that describes your job(s)Creates the files you specified for STDOUT and STDERR

Sends your job’s ClassAd(s) and executable to the condor_schedd, which stores the job in its queue

Page 36: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor36

Monitoring Your JobsMonitoring Your JobsUsing condor_qUsing a “User Log” fileUsing condor_statusUsing condor_rmGetting email from CondorUsing condor_history after completion

Page 37: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor37

Using condor_qUsing condor_q

Displays the status of your jobs, how much compute time it has accumulated, etc.

Many different options:– A single job, a single cluster, all jobs that

match a certain constraint, or all jobs– Can view remote job queues, either individual

queues, or “-global”

Page 38: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor38

Using a “User Log” fileUsing a “User Log” file

Specify in your submit file:– Log = filename

Entries logged for:– When it was submitted– when it started executing– if it is checkpointed or vacated– if there are any problems, etc.

Page 39: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor39

Using condor_statusUsing condor_status

the “-run” option to see – Machines running jobs– The user who submitted each job– The machine they submitted from

Can also view the status of various submitters with “-submitter <name>”

Page 40: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor40

Using condor_rmUsing condor_rm

Removes a job from the Condor queueYou can only remove jobs that you ownRoot can condor_rm someone else’s jobsYou can give specific job ID’s (cluster or

cluster.proc), or you can remove all of your jobs with the “-a” option.

Page 41: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor41

Getting Email from CondorGetting Email from CondorBy default, Condor will send you email

when your jobs completesIf you don’t want this email, put this in your

submit file:notification = never

If you want email every time something happens to your job (checkpoint, exit, etc), use this:notification = always

Page 42: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor42

If you only want email if your job exits with an error, use this:notification = error

By default, the email is sent to your account on the host you submitted from. If you want the email to go to a different address, use this:notify_user = [email protected]

Getting Email from CondorGetting Email from Condor

Page 43: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor43

Using condor_historyUsing condor_history

Once your job completes, it will no longer show up in condor_q

Now, you must use condor_history to view the job’s ClassAd

The status field (“ST”) will have either a “C” for “completed”, or an “X” if the job was removed with condor_rm

Page 44: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor44

Classified AdvertisementsClassified Advertisements

A ClassAd is a set of named expressions– Each named expression is an attribute

Expressions are similar to those in C …– Constants, attribute references, operators

Page 45: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor45

Classified Advertisements: ExampleClassified Advertisements: Example

MyType = "Machine"TargetType = "Job"Name = "froth.cs.wisc.edu"

StartdIpAddr="<128.105.73.44:33846>"

Arch = "INTEL"OpSys = "SOLARIS26"VirtualMemory = 225312

Disk = 35957KFlops = 21058Mips = 103LoadAvg = 0.011719KeyboardIdle = 12Cpus = 1Memory = 128Requirements = LoadAvg <= 0.300000 && KeyboardIdle > 15 * 60

Rank = 0

Page 46: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor46

ClassAd MatchingClassAd Matching

ClassAds are always considered in pairs:– Does ClassAd A match ClassAd B (and vice

versa)?– This is called “2-way matching”

If the same attribute appears in both ClassAds, you can specify which attribute you mean by putting “MY.” or “TARGET.” in front of the attribute name

Page 47: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor47

ClassAd Matching “Example”ClassAd Matching “Example”

ClassAd AMyType = "Apartment“TargetType =

"ApartmentRenter“SquareArea = 3500RentOffer = 1000OnBusLine = TrueRank =

UnderGrad==False + TARGET.RentOffer

Requirements = MY.RentOffer -TARGET.RentOffer < 150

ClassAd BMyType =

"ApartmentRenter"TargetType =

"Apartment"UnderGrad = FalseRentOffer = 900Rank =

1/(TARGET.RentOffer + 100.0) + 50*HeatIncluded

Requirements = OnBusLine &&

SquareArea > 2700

Page 48: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor48

ClassAds in the Condor SystemClassAds in the Condor System

ClassAds allow Condor to be a general system– Constraints and ranks on matches expressed

by the entities themselves– Only priority logic integrated into the Match-

MakerAll principal entities in the Condor system

are represented by ClassAds– Machines, Jobs, Submitters

Page 49: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor49

ClassAds: Example for MachinesClassAds: Example for Machines

Friend = Owner == "tannenba“ || Owner == "wright"ResearchGroup = Owner == "jbasney" || Owner == "raman"

Trusted = Owner != "rival" && Owner != "riffraff"

Requirements = Trusted && ( ResearchGroup || (LoadAvg < 0.3 && KeyboardIdle > 15*60) )

Rank = Friend + ResearchGroup*10

Page 50: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor50

ClassAd Machine ExampleClassAd Machine ExampleMachine will never start a job submitted by

“rival” or “riffraff”If someone from ResearchGroup (“jbasney”

or “raman”) submits a job, it will always runIf anyone else submits a job, it will only run

here if the keyboard has been idle for more than 15 minutes and the load average is less than 0.3

Page 51: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor51

Machine Rank Example DescribedMachine Rank Example Described

If the machine is running a job submitted by owner “foo”, it will give this a Rank of 0, since foo is neither a friend nor in the same research group

If “wright” or “tannenba” submits a job, it will be ranked at 1 (since Friend will evaluate to 1 and ResearchGroup is 0)

If “raman” or “jbasney” submit a job, it will have a rank of 10

While a machine is running a job, it will be preempted for a higher ranked job

Page 52: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor52

ClassAds: Example for JobsClassAds: Example for Jobs

Requirements = Arch == “INTEL” && OpSys == “LINUX” && Memory > 20

Rank = (Memory > 32) * ( (Memory * 100)

+ (IsDedicated * 10000) + Mips )

Page 53: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor53

Job Example DescribedJob Example DescribedThe job must run on an Intel CPU, running

Linux, with at least 20 megs of RAMAll machines with 32 megs of RAM or less

are Ranked at 0Machines with more than 32 megs of RAM

are ranked according to how much RAM they have, if the machine is dedicated (which counts a lot to this job!), and how fast the machine is, as measured in MIPS

Page 54: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor54

ClassAd Attributes in your PoolClassAd Attributes in your PoolCondor defines a number of attributes by

default, which are listed in the User Manual (“About Requirements and Rank”)

To see if machines in your pool have other attributes defined, use:– condor_status -long <hostname>

A custom-defined attribute might not be defined on all machines in your pool, so you’ll probably want to use “meta-operators”

Page 55: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor55

ClassAd “Meta-Operators”ClassAd “Meta-Operators”

Meta operators allow you to compare against “UNDEFINED” as if it were a real value:– =?= is “meta-equal-to”– =!= is “meta-not-equal-to”– Color != “Red” (non-meta) would evaluate to

UNDEFINED if Color is not defined– Color =!= “Red” would evaluate to True if

Color is not defined, since UNDEFINED is not “Red”

Page 56: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor56

Priorities In CondorPriorities In Condor

User Priorities– Priorities between users in the pool to ensure fairness

– The lower the value, the better the priority Job Priorities

– Priorities that users give to their own jobs to determine the order in which they will run

– The higher the value, the better the priority

– Only matters within a given user’s jobs

Page 57: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor57

User Priorities in CondorUser Priorities in Condor

Each active user in the pool has a user priority

Viewed or changed with condor_userprioThe lower the number, the betterA given user’s share of available machines

is inversely related to the ratio between user priorities.– Example: Fred’s priority is 10, Joe’s is 20. Fred will be

allocated twice as many machines as Joe.

Page 58: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor58

User Priorities in Condor, cont.User Priorities in Condor, cont.Condor continuously adjusts user priorities

over time– machines allocated > priority, priority worsens– machines allocated < priority, priority improves

Priority Preemption– Higher priority users will grab machines away from

lower priority users (thanks to Checkpointing…)– Starvation is prevented– Priority “thrashing” is prevented

Page 59: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor59

Job Priorities in CondorJob Priorities in Condor

Can be set at submit-time in your description file with:prio = <number>

Can be viewed with condor_qCan be changed at any time with

condor_prioThe higher the number, the more likely the

job will run (only among the jobs of an individual user)

Page 60: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor60

Managing a Large Cluster of JobsManaging a Large Cluster of Jobs

Condor can manage huge numbers of jobsSpecial features of the submit description

file make this easierCondor can also manage inter-job

dependencies with condor_dagman– For example: job A should run first, then, run

jobs B and C, when those finish, submit D, etc…

– We’ll discuss DAGMan later

Page 61: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor61

Submitting a Large ClusterSubmitting a Large Cluster

Each process runs in its own directory: InitialDir = dir.$(process)

Can either have multiple Queue entries, or put a number after Queue to tell Condor how many to submit: Queue 1000

A cluster is more efficient: Your jobs will run faster, and they’ll use less space

Can only have one executable per cluster: Different executables must be different clusters!

Page 62: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor62

Inter-Job Dependencies with Inter-Job Dependencies with DAGManDAGManDAGMan handles a set of jobs that must

be run in a certain orderAlso provides “pre” and “post” operations,

so you can have a program or script run before each job is submitted and after it completes

Robust: handles errors and submit-machine crashes

Page 63: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor63

Using DAGManUsing DAGMan

You define a DAG description file, which is similar in function to the submit file you give to condor_submit

DAGMan restrictions:– Each job in the DAG must be in its own

cluster (for now)– All jobs in the DAG must have a User Log

and must share the same file

Page 64: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor64

DAGMan Description FileDAGMan Description File

# is a commentFirst section names the jobs in your DAG

and associates a submit description file with each job

Second (optional) section defines PRE and POST scripts to run

Final section defines the job dependencies

Page 65: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor65

Example DAGMan FileExample DAGMan File

Job A A.submitJob B B.submitJob C C.submitJob D D.submitScript PRE D d_input_checkerScript POST A a_output_processor A.outPARENT A CHILD B CPARENT B C CHILD D

Page 66: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor66

Setting up a DAG for CondorSetting up a DAG for Condor

Create all the submit description files for the individual jobs

Prepare any executables you plan to useCan have a mix of Vanilla and Standard

jobsSetup any PRE/POST commands or scripts

you wish to use

Page 67: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor67

Submitting a DAG to CondorSubmitting a DAG to Condor

condor_submit_dag DAG-description-fileThis will check your input file for errors

and submit a copy of condor_dagman as a scheduler universe job with all the necessary command-line arguments

Page 68: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor68

Removing a DAGRemoving a DAG

On shutdown, DAGMan will remove any jobs that are currently in the queue that are associated with its DAG

Once all jobs are gone, DAGMan itself will exit, and the scheduler universe job will be removed from the queue

Page 69: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor69

Typical ProblemsTypical Problems

Special requirements expressions for vanilla jobs

You didn’t submit it from a directory that is shared

Condor isn’t running as root You don’t have your file permissions setup

correctly

Page 70: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor70

Special Requirements Expressions Special Requirements Expressions for Vanilla Jobsfor Vanilla JobsWhen you submit a vanilla job, Condor

automatically appends two extra Requirements:– UID_DOMAIN == <submit_uid_domain>– FILESYSTEM_DOMAIN == <submit_fs>

Since there are no remote system calls with Vanilla jobs, they depend on a shared file system and a common UID space to run as you and access your files

Page 71: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor71

Special Requirements Expressions Special Requirements Expressions for Vanilla Jobsfor Vanilla JobsBy default, each machine in your pool is in

its own UID_DOMAIN and FILESYSTEM_DOMAIN, so your pool administrator has to configure your pool specially if there really is a common UID space and a network file system

If you don’t have an account on the remote system, Vanilla jobs won’t work

Page 72: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor72

Shared Files for Vanilla JobsShared Files for Vanilla Jobs

May be not all directories are sharedInitialdir = /tmp will probably cause trouble for Vanilla jobs!

You must be sure to set Initialdir to a shared directory (or cd into it to run condor_submit) for Vanilla jobs

Page 73: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor73

Why Don’t My Jobs Run?Why Don’t My Jobs Run?

Try condor_q -analyzeTry specifying a User Log for your jobLook at condor_userprio: maybe you have

a low priority and higher priority users are being served

Problems with file permissions or network file systems

Look at the SchedLog

Page 74: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor74

Using condor_q -analyzeUsing condor_q -analyze

Analyzes your job’s ClassAd, get all the ClassAds of the machines in the pool, and tell you what’s going on:

Will report errors in your Requirements expression (impossible to match, etc.)

Will tell you about user priorities in the pool (other people have better priority)

Page 75: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor75

Looking at condor_userprioLooking at condor_userprio

You can look at condor_userprio yourselfIf your priority value is a really high

number (because you’ve been running a lot of Condor jobs), other users will have priority to run jobs in your pool

Page 76: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor76

File Permissions in CondorFile Permissions in Condor

If Condor isn’t running as root, the condor_shadow process runs as the user the condor_schedd is running as (usually “condor”)

You must grant this user write access to your output files, and read access to your input files (both STDOUT, STDIN from your submit file, as well as files your job explicitly opens)

Page 77: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor77

File Permissions in CondorFile Permissions in Condor

Often, there will be a “condor” group and you can make your files owned and write-able by this group

For vanilla jobs, even if the UID_DOMAIN setting is correct, and they match for your submit and execute machines, if Condor isn’t running as root, your job will be started as user Condor, not as you!

Page 78: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor78

Problems with NFS in CondorProblems with NFS in Condor

For NFS, sometimes the administrators will setup read-only mounts, or have UIDs remapped for certain partitions (the classic example is root = nobody, but modern NFS can do arbitrary remappings)

Page 79: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor79

Problems with NFS in CondorProblems with NFS in Condor

If your pool uses NFS automounting, the directory that Condor thinks is your InitialDir might not exist on a remote machine

With automounting, you always need to specify InitialDir explicitly – InitialDir = /home/me/...

Page 80: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor80

Problems with AFS in CondorProblems with AFS in Condor

If your pool uses AFS, the condor_shadow, even if it’s running with your UID, will not have your AFS token.

You must grant an unauthenticated AFS user the appropriate access to your files

Some sites provide a better alternative that world-writable files– Host ACLs– Network-specific ACLs

Page 81: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor81

Looking at the SchedLogLooking at the SchedLog

Looking at the log file of the condor_schedd, the “SchedLog” file can possibly give you a clue if there are problems.

Find it with: condor_config_val schedd_log

You might need your pool administrator to turn on a higher “debugging level” to see more verbose output

Page 82: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor82

Other User FeaturesOther User Features

Submit-Only installationHeterogeneous SubmitPVM jobs

Page 83: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor83

Submit-Only InstallationSubmit-Only Installation

Can install just a condor_master and condor_schedd on your machine

Can submit jobs into a remote poolSpecial option to condor_install

Page 84: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor84

Heterogeneous SubmitHeterogeneous Submit

The job you submit doesn’t have to be the same platform as the machine you submit from– Maybe you have access to a pool that is full of Alphas,

but you have a Sparc on your desk, and moving all your data is a pain

You can take an Alpha binary, copy it to your Sparc, and submit it with a requirements expression that says you need to run on ALPHA/OSF1

Page 85: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor85

PVM Jobs in CondorPVM Jobs in Condor

Condor can run parallel applications – PVM applications now– Future work includes support for MPI

Master-Worker ParadigmWhat does Condor-PVM do?How to compile and submit Condor-PVM

jobs

Page 86: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor86

Master-Worker ParadigmMaster-Worker Paradigm

Condor-PVM is designed to run PVM applications based on the master-worker paradigm.

Master– has a pool of work, sends pieces of work to the

workers, manages the work and the workersWorker

– gets a piece of work, does the computation, sends the result back

Page 87: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor87

What does Condor-PVM do?What does Condor-PVM do?

Condor acts as the PVM resource manager.All pvm_addhost requests get re-mapped

to Condor. – Condor dynamically constructs PVM virtual

machines out of non-dedicated desktop machines.

When a machine leaves the pool, the user gets notified via the normal PVM notification mechanisms.

Page 88: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor88

Submission of Condor-PVM jobsSubmission of Condor-PVM jobs

Binary Compatible– Compile and link with PVM library just as

normal PVM applications. No need to link with Condor.

In the submit description file, set:universe = PVMmachine_count = <min>..<max>

Page 89: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor89

Resource Agent Resource Agent Configuration Configuration ExpressionsExpressions

STARTSTARTSTARTSTART

WANT SUSPENDWANT SUSPENDWANT SUSPENDWANT SUSPEND

SUSPENDSUSPENDSUSPENDSUSPEND

VACATEVACATEVACATEVACATE

WANT VACATEWANT VACATEWANT VACATEWANT VACATE

KILLKILLKILLKILL

True

True

True

True

True

False

False

Page 90: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor90

Resource Agent ConfigurationResource Agent Configuration

Default SetupWANT_VACATE : True

WANT_SUSPEND : True

START : Keyboard_Idle && CPU_Idle

SUSPEND : Keyboard_Busy || CPU_Busy

CONTINUE : Keyboard and CPU idle again

VACATE : If Suspended > 10 minutes

KILL : If spent > 10 minutes in VACATE state

Page 91: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor91

condor_mastercondor_master

Watches/restarts other daemonsSends Email if suspicious problems ariseRuns condor_preenProvides administrator remote control

Page 92: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor92

Condor Administrator CommandsCondor Administrator Commands

condor_off [ hostname … ]condor_oncondor_restartcondor_reconfig condor_vacateCan be used by the Owner also

Page 93: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor93

Host-based Access ControlHost-based Access Control

HOST_ALLOW and HOST_DENY to grant machines (subnets, domains) different access levels:

READ accessWRITE accessADMINISTRATOR accessOWNER access

Page 94: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor94

Host-based Access Control Ex.Host-based Access Control Ex.

HOSTDENY_READ = *.comHOSTALLOW_WRITE = *.cs.wright.eduHOSTDENY_WRITE = ppp*.wright.edu, 172.44.*

HOSTALLOW_ADMINISTRATOR = osis111.cs.wright.edu

HOSTALLOW_OWNER = $(FULL_HOSTNAME), $(HOSTALLOW_ADMINISTRATOR)

Page 95: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor95

Configuration File HierarchyConfiguration File Hierarchy

condor_config– Pool-wide default– Condor pool administrator’s requirements

condor_config.local– Overrides for a specific machine– Reflects Owner’s requirements

condor_config.root– System Administrator requirements

Page 96: Condor Tutorial Prabhaker Mateti Wright State University

Mateti, Condor96

Obtaining CondorObtaining Condor

Condor accounts available! [email protected]

Condor executables can be downloaded from http://www.cs.wisc.edu/condor

Complete Users and Administrators manual http://www.cs.wisc.edu/condor/manual