Working with Condor. Links: Condor’s homepage: ï‚ Condor manual (for the version currently

  • View

  • Download

Embed Size (px)

Text of Working with Condor. Links: Condor’s homepage: ï‚ Condor manual...

  • Slide 1

Working with Condor Slide 2 Links: Condors homepage: Condor manual (for the version currently used): Slide 3 Table of contents Condor overview Usefull Condor commands Vanilla universe Macros Standard universe Java universe Matlab in Condor ClassAds DagMan Slide 4 Condor overview Condor is a system for running lots of jobs on a (preferably large) cluster of computers. Condor is a specialized workload management system for compute- intensive jobs. Slide 5 Condor overview Condors inner structure: Condor is built of several daemons: condor_master: This daemon is responsible for keeping all the rest of the Condor daemons running condor_startd: This daemon represents a given machine to the Condor pool. It advertises attributes about the machine its running on. Must run on machines accepting jobs. condor_schedd: This daemon is responsible for submitting jobs to condor. It manages the job queue (each machine has one!). Must run on machines submitting jobs. condor_collector: Runs only on the condor server. This daemon is responsible for collecting all the information about the status of a Condor pool. All other daemons periodically sends updates to the collector. condor_negotiator: Runs only on the condor server. This daemon is responsible for all the match-making within the Condor system. condor_ ckpt_server: Runs only on the checkpointing server. This is the checkpoint server. It services requests to store and retrieve checkpoint files. Slide 6 Condor overview Condor uses user priorities to allocate machines to users in a fair manner. A lower numerical value for user priority means higher priority. Each user starts out with the best user priority, 0.5. If the number of machines a user currently has is greater then his priority, then his user priority will worsen (numerically increase) over time. If the number of machines a user currently has is lower then his priority, then priority will improve over time. Use condor_userprio {-allusers} to see user priorities Slide 7 Usefull Condor commands condor_status Shows all of the computers connected to condor (not all are accepting jobs) Usefull arguments: -claimedshows only machines running condor jobs ( and who runs them). -availableshows only machines which are willing to run jobs now -longdisplay entire classads. (discussed later on) -constraint show only resources matching the given. Slide 8 Usefull Condor commands condor_status Attributes Arch: INTELmeans a 32bit linux X86_64means a 64bit linux Activity: IdleThere is no job activity BusyA job is busy running SuspendedA job is currently suspended VacatingA job is currently checkpointing KillingA job is currently being killed BenchmarkingThe startd is running benchmarks Slide 9 Usefull Condor commands condor_status More attributes State: Owner The machine owner is using the machine, and it is unavailable to Condor. UnclaimedThe machine is available to run Condor jobs, but a good match is either not available or not yet found. MatchedThe Condor central manager has found a good match for this resource, but a Condor scheduler has not yet claimed it. ClaimedThe machine is claimed by a remote machine and is probably running a job. PreemptingA Condor job is being preempted (possibly via checkpointing) in order to clear the machine for either a higher priority job or because the machine owner wants the machine back. Slide 10 Usefull Condor commands condor_q Shows state of jobs submitted from the calling computer (the one running condor_q) Usefull arguments: -analyzePerform schedulability analysis on jobs. Usefull to see why a scheduled job isnt running, and if its ever going to run. -dagSort DAG jobs under their DAGMan -constraint (classads) -global ( -g )get the global queue. -runget information about running jobs. Slide 11 Usefull Condor commands condor_rm Removes a scheduled job from the queue (of the scheduling computer). condor_rm cluster.proc Remove the given job condor_rm cluster Remove the given cluster of jobs condor_rm user Remove all jobs owned by user condor_rm all Remove all jobs Slide 12 Vanilla universe jobs Vanilla universe is used for running jobs without special needs and features. In Vanilla universe Condor runs the job the same as it would run without Condor Start with a simple example.c: #include int main(){ printf(hello condor); return 0; } Compile as usual: gcc example.c o example Slide 13 Vanilla universe jobs In order to submit the job to Condor we use the condor_submit command. Usage: condor_submit A simple submit file (sub_example): Universe = Vanilla Executable = example Log = test.log Output = test.out Error = test.error Queue Notice that the submission commands are case insensitive. Slide 14 Vanilla universe jobs There are a few other usefull commands arguments = arg1 arg2 run the executable with the given arguments Input = The file given is used as standard input environment = = = Runs the job with the given environment variables. In order to use spaces in the entry use single quote To insert quotation use double quote mark, example: environment = a=quote b=a b c Slide 15 Vanilla universe jobs getenv = If getenv is set to True, then condor_ submit will copy all of the user's current shell environment variables at the time of job submission into the job ClassAd. The job will therefore execute with the same set of environment variables that the user had at submit time. Defaults to False. Slide 16 Vanilla universe jobs A more advanced submission: Universe = Vanilla Executable = example Log = test.$(cluster).$(process).log Output = test.$(cluster).$(process).out Error = test.$(cluster).$(process).error Queue 7 Here we see a use of predefined macros. cluster gives us the value of the ClusterId job ClassAd attribute, and the $(process) macro supplies the value of the ProcId job ClassAd attribute Slide 17 Macros More on Macros: A macro is defined as follows: = string It can be then used by writing $(macro_name) $$(attribute) is used to get a classad attribute from the machine running the job. $ENV(variable) gives us the environment variable variable from the machine running the job. For more on macros go to condors manual, condor_submit section. Slide 18 Other universes Standard universe Java universe Slide 19 Standard universe The Standard universe provides checkpointing and remote system calls. Remote system calls: All system calls made by the job running in Condor are made on the submitting computer. Chekpointing: Save a snapshot of the current state of the running job, so the job can be restarted from the saved state in case of: Migration to another computer Machine crash or failureSS Slide 20 Standard universe In order to execute a program in the Standard universe it must be relinked with the Condors library. To do so use condor_compile with your usual link command. Example: condor_compile gcc example.c To manually cause a checkpoint use condor_checkpoint hostname There are some restrictions on jobs running in the standard universe: Slide 21 Standard universe - restrictions Multi-process jobs are not allowed. This includes system calls such as fork(), exec(), and system(). Interprocess communication is not allowed. This includes pipes, semaphores, and shared memory. Network communication must be brief. A job may make network connections using system calls such as socket(), but a network connection left open for long periods will delay checkpointing and migration. Sending or receiving the SIGUSR2 or SIGTSTP signals is not allowed. Condor reserves these signals for its own use. Sending or receiving all other signals is allowed. Alarms, timers, and sleeping are not allowed. This includes system calls such as alarm(), getitimer(), and sleep(). Slide 22 Standard universe - restrictions Multiple kernel-level threads are not allowed. However, multiple user-level threads are allowed. Memory mapped files are not allowed. This includes system calls such as mmap() and munmap(). File locks are allowed, but not retained between checkpoints. All files must be opened read-only or write-only. A file opened for both reading and writing will cause trouble if a job must be rolled back to an old checkpoint image. For compatibility reasons, a file opened for both reading and writing will result in a warning but not an error. Your job must be statically linked (On Digital Unix (OSF/1), HP-UX, and Linux, and therefore on our school). Reading to or writing from files larger than 2 GB is not supported. Slide 23 Java universe Used to run java programs Example submit description file: universe = java executable = Example.class arguments = Example output = Example.output error = Example.error queue Notice that the first argument is the main class of the job. The JVM must be informed when submitting jar files, this is done in the following way: jar_files = example.jar To run on a machine with a specific java version: Requirements = (JavaVersion==1.5.0_01) Options to the Java VM itself can be set in the submit description file: java_vm_args = -DMyProperty=Value -verbose:gc These options go after the java command but before the main class (Usage: java [options] class [args...]). Do not use this to set the classpath (condor handles that itsef). Slide 24 Matlab Functions Matlab functions/scripts are written in.m files. Structure: function {ret_var = } func_name(arg1, arg2, ) Slide 25 Running Matlab functions in condor First method: Calling matlab What we want to do is run: matlab -nodisplay -nojvm -nosplash r func(arg1, arg2, ) Instead of transferring the matlab executable well write a script (run.csh): #!/bin/csh f matlab -nodisplay -nojvm -nosplash -r "$*" Slide 26 Running Matlab functions in condor First method: Calling matlab The submission file: executable = run.csh l