Transcript

Working with Condor

Links:

Condor’s homepage: http://www.cs.wisc.edu/condor/

Condor manual (for the version currently used): http://www.cs.wisc.edu/condor/manual/

v6.8/

Table of contents

Condor overview Usefull Condor commands Vanilla universe Macros Standard universe Java universe Matlab in Condor ClassAds DagMan

Condor overview

Condor is a system for running lots of jobs on a (preferably large) cluster of computers.

Condor is a specialized workload management system for compute-intensive jobs.

Condor overview

Condor’s inner structure: Condor is built of several daemons:

condor_master: This daemon is responsible for keeping all the rest of the Condor daemons running

condor_startd: This daemon represents a given machine to the Condor pool. It advertises attributes about the machine it’s running on. Must run on machines accepting jobs.

condor_schedd: This daemon is responsible for submitting jobs to condor. It manages the job queue (each machine has one!). Must run on machines submitting jobs.

condor_collector: Runs only on the condor server. This daemon is responsible for collecting all the information about the status of a Condor pool. All other daemons periodically sends updates to the collector.

condor_negotiator: Runs only on the condor server. This daemon is responsible for all the match-making within the Condor system.

condor_ ckpt_server: Runs only on the checkpointing server. This is the checkpoint server. It services requests to store and retrieve checkpoint files.

Condor overview

Condor uses user priorities to allocate machines to users in a fair manner. A lower numerical value for user priority means higher

priority. Each user starts out with the best user priority, 0.5. If the number of machines a user currently has is greater

then his priority, then his user priority will worsen (numerically increase) over time.

If the number of machines a user currently has is lower then his priority, then priority will improve over time.

Use condor_userprio {-allusers} to see user priorities

Usefull Condor commands

condor_status Shows all of the computers connected to condor

(not all are accepting jobs) Usefull arguments:

-claimed shows only machines running condor jobs ( and who runs them).

-available shows only machines which are willing to run jobs now

-long display entire classads. (discussed later on)

-constraint <const.> show only resources matching the given.

Usefull Condor commands

condor_status Attributes

Arch: INTEL means a 32bit linux X86_64 means a 64bit linux

Activity: “Idle” There is no job activity “Busy” A job is busy running “Suspended” A job is currently suspended “Vacating” A job is currently

checkpointing “Killing” A job is currently being killed “Benchmarking” The startd is running benchmarks

Usefull Condor commands

condor_status More attributes

State: “Owner” The machine owner is using the

machine, and it is unavailable to Condor. “Unclaimed” The machine is available to run Condor

jobs, but a good match is either not available or not yet found. “Matched” The Condor central manager has

found a good match for this resource, but a Condor scheduler has not yet claimed it.

“Claimed” The machine is claimed by a remote machine and is probably running a job.

“Preempting” A Condor job is being preempted (possibly via checkpointing) in order to clear the machine for either a higher priority job or because the machine owner wants the machine back.

Usefull Condor commands

condor_q Shows state of jobs submitted from the calling

computer (the one running condor_q) Usefull arguments:

-analyze Perform schedulability analysis on jobs. Usefull to see why a scheduled job isn’t running, and if it’s ever going to run.

-dag Sort DAG jobs under their DAGMan -constraint <const.> (classads) -global ( -g ) get the global queue. -run get information about running

jobs.

Usefull Condor commands

condor_rm Removes a scheduled job from the queue (of the

scheduling computer). condor_rm cluster.proc

Remove the given job condor_rm cluster

Remove the given cluster of jobs condor_rm user

Remove all jobs owned by user condor_rm –all

Remove all jobs

Vanilla universe jobs

Vanilla universe is used for running jobs without special needs and features.

In Vanilla universe Condor runs the job the same as it would run without Condor

Start with a simple example.c:#include <stdio.h>int main(){

printf(“hello condor”);return 0;

} Compile as usual: gcc example.c –o example

Vanilla universe jobs

In order to submit the job to Condor we use the condor_submit command.

Usage: condor_submit <sub_file> A simple submit file (sub_example):

Universe = VanillaExecutable = exampleLog = test.logOutput = test.outError = test.errorQueue

Notice that the submission commands are case insensitive.

Vanilla universe jobs

There are a few other usefull commands arguments = arg1 arg2 …

run the executable with the given arguments Input = <input file>

The file given is used as standard input environment = “<var1>=<value1> <var2>=<value2>

…” Runs the job with the given environment variables. In order to use spaces in the entry use single quote To insert quotation use double quote mark, example:

environment = “ a=“”quote”” b=‘a ‘’b’’ c’ ”

Vanilla universe jobs

getenv = <True | False> If getenv is set to True, then condor_

submit will copy all of the user's current shell environment variables at the time of job submission into the job ClassAd. The job will therefore execute with the same set of environment variables that the user had at submit time.

Defaults to False.

Vanilla universe jobs

A more advanced submission:Universe = VanillaExecutable = exampleLog = test.$(cluster).$(process).logOutput = test.$(cluster).$(process).outError = test.$(cluster).$(process).errorQueue 7

Here we see a use of predefined macros. ‘cluster’ gives us the value of the ClusterId job ClassAd attribute, and the $(process) macro supplies the value of the ProcId job ClassAd attribute

Macros

More on Macros: A macro is defined as follows: <macro_name> =

string It can be then used by writing $(macro_name) $$(attribute) is used to get a classad attribute

from the machine running the job. $ENV(variable) gives us the environment

variable ‘variable’ from the machine running the job.

For more on macros go to condor’s manual, condor_submit section.

Other universes

Standard universe Java universe

Standard universe

The Standard universe provides checkpointing and remote system calls.

Remote system calls: All system calls made by the job running in

Condor are made on the submitting computer. Chekpointing:

Save a snapshot of the current state of the running job, so the job can be restarted from the saved state in case of: Migration to another computer Machine crash or failureSS

Standard universe

In order to execute a program in the Standard universe it must be relinked with the Condor’s library.

To do so use condor_compile with your usual link command. Example: condor_compile gcc example.c

To manually cause a checkpoint use condor_checkpoint hostname

There are some restrictions on jobs running in the standard universe:

Standard universe - restrictions

Multi-process jobs are not allowed. This includes system calls such as fork(), exec(), and system().

Interprocess communication is not allowed. This includes pipes, semaphores, and shared memory.

Network communication must be brief. A job may make network connections using system calls such as socket(), but a network connection left open for long periods will delay checkpointing and migration.

Sending or receiving the SIGUSR2 or SIGTSTP signals is not allowed. Condor reserves these signals for its own use. Sending or receiving all other signals is allowed.

Alarms, timers, and sleeping are not allowed. This includes system calls such as alarm(), getitimer(), and sleep().

Standard universe - restrictions

Multiple kernel-level threads are not allowed. However, multiple user-level threads are allowed.

Memory mapped files are not allowed. This includes system calls such as mmap() and munmap().

File locks are allowed, but not retained between checkpoints. All files must be opened read-only or write-only. A file opened

for both reading and writing will cause trouble if a job must be rolled back to an old checkpoint image. For compatibility reasons, a file opened for both reading and writing will result in a warning but not an error.

Your job must be statically linked (On Digital Unix (OSF/1), HP-UX, and Linux, and therefore on our school).

Reading to or writing from files larger than 2 GB is not supported.

Java universe

Used to run java programs Example submit description file:

universe = javaexecutable = Example.classarguments = Exampleoutput = Example.outputerror = Example.errorqueue

Notice that the first argument is the main class of the job. The JVM must be informed when submitting jar files, this is done in

the following way: jar_files = example.jar

To run on a machine with a specific java version: Requirements = (JavaVersion==“1.5.0_01”)

Options to the Java VM itself can be set in the submit description file: java_vm_args = -DMyProperty=Value -verbose:gc … These options go after the java command but before the main class (Usage: java

[options] class [args...]). Do not use this to set the classpath (condor handles that itsef).

Matlab Functions

Matlab functions/scripts are written in .m files.

Structure:function {ret_var = } func_name(arg1, arg2,

…)

Running Matlab functions in condor

First method: Calling matlab What we want to do is run:

matlab -nodisplay -nojvm -nosplash –r ‘func(arg1, arg2, …)’

Instead of transferring the matlab executable we’ll write a script (run.csh):

#!/bin/csh –f

matlab -nodisplay -nojvm -nosplash -r "$*"

Running Matlab functions in condor

First method: Calling matlab The submission file:

executable = run.cshlog = mat.logerror = mat.erroroutput = mat.outputuniverse = vanillagetenv = Truearguments = func(arg1, arg2, …)queue 1

Notice that in order to run matlab we must set getenv = true

Running Matlab functions in condor

Second method: Compiling the function First, we compile our Matlab script, example.m, into an

executable:mcc –mv example.m

The –v option is not mandatory. It is used to show details in the process of compilation.

The files required for running will be “example” nad example.ctf

The compiled function requires matlab’s shared libraries in order to run.

So, we’ll send Condor a script which defines the necessary env. Variables and then runs the executable.

Running Matlab functions in condor

Second method: Compiling the function The script:

#!/bin/tcsh

setenv LD_LIBRARY_PATH /usr/local/stow/matlab-7.0.4-R14SP2/lib/matlab-7.0.4- R14SP2/bin/glnx86:/usr/local/stow/matlab-7.0.4-R1

4SP2/lib/matlab-7.0.4-R14SP2/sys/os/glnx86:/usr/local/stow/matlab-7.0.4- R14SP2/lib/matlab-7.0.4-R14SP2/sys/java/jre/glnx86/jr

e1.4.2/lib/i386/client:/usr/local/stow/matlab-7.0.4-R14SP2/lib/matlab-7.0.4- R14SP2/sys/java/jre/glnx86/jre1.4.2/lib/i386:/usr

/local/stow/matlab-7.0.4-R14SP2/lib/matlab-7.0.4-R14SP2/sys/opengl/lib/glnx86:

setenv XAPPLRESDIR /usr/local/stow/matlab-7.0.4-R14SP2/lib/matlab-7.0.4- R14SP2/X11/app-defaults

setenv LD_PRELOAD /lib/libgcc_s.so.1

./multi $1 $2

ClassAds

ClassAds are a flexible mechanism for representing the characteristics and constraints of machines and jobs in the Condor system

Condor acts as a matchmaker for ClassAds. ClassAds are analogous to the classified advertising section

in a newspaper. All machines running Condor advertise their attributes. A

machine also advertises under what conditions it is willing to run a job, and what type of job it would prefer.

When submitting a job, you specify your requirements and preferences. These attributes are bundled up into a job ClassAd.

ClassAds

ClassAd expressions are formed by composing literals, attribute references and other sub-expressions with operators and functions Literals: may be

integers (including TRUE – 1 and FALSE – 0) Real String, a list of characters between two double quote

chars. Use \ to include the following char in the string, irrespective of what that character is.

UNDEFINED keyword (case insensitive) ERROR keyword (case insensitive)

ClassAds

Attributes A pair (name, expression) is called an attribute. The attribute name is case insensitive. An optional scope resolution prefix may be added:

“MY.” and “TARGET.” MY. refers to an attribute defined in the current

ClassAd. TARGET. Refers to an attribute defined in the ClassAd

in which the current ClassAd is evaluated. If no scope prefix is given, the first try “MY.”, if not

found try “TARGET.”, if not found try the ClassAd environment, if not found then value is UNDEFINED.

If there is a circular dependency between two classads (e.g. A uses B and B uses A) then the value is ERROR

ClassAds

Operators The operators are similar to c language. All operators are case insensitive for strings, with the

following exeptions: =?= “is identical to” operator (similar to ==) =!= “is not identical to” operator (similar to !

=) Precedence:

ClassAds

Predefined functions Examples:

Integer strcmp(AnyType Expr1, AnyType Expr2) String strcat(AnyType Expr1 [ , AnyType Expr2 ... ]) Boolean isInteger(AnyType Expr)

Function names are case insensitive For a full list of the functions refer to the user

manual, section 4.1.1.4

ClassAds

When submitting a job, one give requirements which only machines answering them may run the job.

One can also rank the machines available to run the job and choose the the highest ranked machine to run the job.

This can be done using the Requirements and Rank commands in the submission file.

ClassAds submission commands

Requirements = <ClassAd Boolean Expression> The job will run on a machine only if the

requirements expression evaluates to TRUE on that machine.

Example: requirements = Memory >= 64 && Arch == "intel"

The running machine must have at least 64 MB of ram and the architecture is INTEL.

The computers in our school have two possible architecture names: “INTEL” if it’s a 32bit computer or “X86_64” if it’s a 64bit computer.

ClassAds submission commands

By default Condor adds to the requirements of a job the following requirements: Arch, OpSys the same as the submitting computer. Disk >= DiskUsage. The DiskUsage attribute is initialized

to the size of the executable plus the size of any files specified in a transfer_input_files command.

(Memory * 1024) >= ImageSize. To ensure the target machine has enough memory to run your job.

If Universe is set to Vanilla, FileSystemDomain is set equal to the submit machine's FileSystemDomain.

In order to see a submitted job’s requirements (along with everything else about the job) use condor_q –l .

ClassAds submission commands

rank = <ClassAd Float Expression> Sorts all matching machines by the given

exression. Condor will give the job the machine with the highest rank.

The expression is a numeric expression (where boolean sub-expressions evaluate to 1.0 or 0.0)

DagMan

Use a directed acyclic graph (DAG) to represent a set of jobs to be run in a certain order.

A basic DAG submit file:JOB name1 submit_file1

JOB name2 submit_file2

If “DONE” is specified in the end of a JOB line then that job is considered complete and is not submitted.

DagMan

Additional dag commands: SCRIPT:

Sets processing to be done before/after running the job.

These “scripts” run on the submitting machine. SCRIPT PRE job_name executable [arguments]

Runs the executable before job_name is submitted SCRIPT POST job_name executable [arguments]

Runs the executable after job_name has completed its execution under Condor.

DagMan

Additional dag commands: PARENT … CHILD

Used to describe the dependencies between the jobs. PARENT p1 p2 … CHILD c1 c2 …

Makes all pi’s parents of all ci’s (i.e. the ci’s will be submitted only after all pi’s have completed their execution)

RETRY RETRY jobName NumOfRetries [UNLESS-EXIT

value] If job fails runs runs again at most NumOfRetries times. If UNLESS-EXIT is specified and the value returned

equals “value” then no further retries will be attempted.

DagMan

Additional dag commands: VARS

Defines macros that can be used in the submit description file of a job.

VARS jobName macroname= “string” [macroname2= “string” …] ABORT-DAG-ON

Aborts the entire DAG if a specific node returns a specific value. Stops all nodes within the DAG immediately. This includes nodes

currently running. ABORT-DAG-ON JobName AbortExitValue [RETURN

DAGReturnValue] By default the returned value of the DAG is the value returned from

the aborted node. If RETURN is specified then the return value is DAGReturnValue

DagMan

Example DAG file:JOB A a.submitJOB B b.submitJOB C a.submitPARENT A CHILD B CRETRY C 3ABORT-DAG-On A 2

Submission of DAG’s is done with: condor_submit_dag file.dag

In order to specify the max number of jobs submitted by the DagMan add the argument: -maxjobs numOfJobs

If any node in a DAG fails, The DagMan continues to run the reminder of the nodes untill no

more forward progress can be made. Then it creates a rescue file (input_file.rescue), where for each

node that completed its execution the corresponding JOB line ends with DONE. Submitting this file continues DAG execution.

DagMan

It is possible to create a visualization of the DAG: Add a line to the DAG file with:

“DOT dot_file.dot” Submit the DAG dot -Tps dot_file.dot -o dag.ps

A DAG inside a DAG: Suppose you want to include inner.dag in outer.dag Execute

condor_submit_dag -no_submit inner.dag Include the following “JOB” line in outer.dag:

JOB jobName inner.dag.condor.sub inner.dag.condor.sub is the submission file for inner.dag


Recommended