39
[email protected] Grid Engine 6.2 Simple Workflow Intro Created by The BioTeam, http://blog.bioteam.net

Grid Engine 6.2 Simple Workflow Intro - BioTeam · 2017-04-01 · [email protected] Job Dependencies SGE scheduler does not promise to dispatch jobs in the order in which one submits

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Grid Engine 6.2 Simple Workflow Intro - BioTeam · 2017-04-01 · chris@bioteam.net Job Dependencies SGE scheduler does not promise to dispatch jobs in the order in which one submits

[email protected]

Grid Engine 6.2

Simple Workflow Intro

Created by The BioTeam, http://blog.bioteam.net

Page 2: Grid Engine 6.2 Simple Workflow Intro - BioTeam · 2017-04-01 · chris@bioteam.net Job Dependencies SGE scheduler does not promise to dispatch jobs in the order in which one submits

[email protected]

Topics   Talk about using SGE more effectively   Enabling workflows & pipelines via:

  Job Dependencies   Array Jobs

  Some live examples   SGE Troubleshooting (time permitting)

Created by The BioTeam, http://blog.bioteam.net

Page 3: Grid Engine 6.2 Simple Workflow Intro - BioTeam · 2017-04-01 · chris@bioteam.net Job Dependencies SGE scheduler does not promise to dispatch jobs in the order in which one submits

[email protected]

Automating Workflows

  A few comments:   Methods we will talk about today are great for

flexibility & ad-hoc development   Especially for shell, Perl or Ruby scripters

  There are more formal methods available for “serious” cluster-aware scientific software   DRMAA

Created by The BioTeam, http://blog.bioteam.net

Page 4: Grid Engine 6.2 Simple Workflow Intro - BioTeam · 2017-04-01 · chris@bioteam.net Job Dependencies SGE scheduler does not promise to dispatch jobs in the order in which one submits

[email protected]

DRMAA

  Distributed Resource Management Application API (“DRMAA”)

  Standard API for cluster job submission & control   Lets you write cluster-aware software that will be

portable across different cluster schedulers   Available on:

  SGE, PBS, PBSPro, Torque and Platform LSF*

Created by The BioTeam, http://blog.bioteam.net

Page 5: Grid Engine 6.2 Simple Workflow Intro - BioTeam · 2017-04-01 · chris@bioteam.net Job Dependencies SGE scheduler does not promise to dispatch jobs in the order in which one submits

[email protected]

DRMAA

  Many DRMAA bindings   Perl, C, C++, C#, Java, Python, Ruby

  Documentation & Tutorials   http://www.drmaa.org

Created by The BioTeam, http://blog.bioteam.net

Page 6: Grid Engine 6.2 Simple Workflow Intro - BioTeam · 2017-04-01 · chris@bioteam.net Job Dependencies SGE scheduler does not promise to dispatch jobs in the order in which one submits

[email protected]

Being more effective

Created by The BioTeam, http://blog.bioteam.net

Page 7: Grid Engine 6.2 Simple Workflow Intro - BioTeam · 2017-04-01 · chris@bioteam.net Job Dependencies SGE scheduler does not promise to dispatch jobs in the order in which one submits

[email protected]

Job Dependencies   SGE scheduler does not promise to dispatch jobs in the

order in which one submits them.   What if I have jobs that need to run in a certain order?   Imagine this scenario:

  Step 1 - Data staging script   Step 2 - Data analysis script   Step 3 - Result QC & staging script   Step 4 - Cleanup script

Created by The BioTeam, http://blog.bioteam.net

Page 8: Grid Engine 6.2 Simple Workflow Intro - BioTeam · 2017-04-01 · chris@bioteam.net Job Dependencies SGE scheduler does not promise to dispatch jobs in the order in which one submits

[email protected]

Job Dependencies   SGE Job Dependency Syntax allows for ordered job

execution   Hinges upon a simple SGE feature:

  Job Names   Huh?

  We need job names or some other identifier because we can’t be sure what SGE jobID the scheduler will assign our task

  With assignable names we can reference jobs that are already pending, holding or running

Created by The BioTeam, http://blog.bioteam.net

Page 9: Grid Engine 6.2 Simple Workflow Intro - BioTeam · 2017-04-01 · chris@bioteam.net Job Dependencies SGE scheduler does not promise to dispatch jobs in the order in which one submits

[email protected]

Job Dependency Example   qsub -N “worker1” my-job-script.sh!  qsub -N “worker2” my-job-script2.sh!  qsub -hold_jid worker1,worker2 cleanupJob.sh!

Created by The BioTeam, http://blog.bioteam.net

Page 10: Grid Engine 6.2 Simple Workflow Intro - BioTeam · 2017-04-01 · chris@bioteam.net Job Dependencies SGE scheduler does not promise to dispatch jobs in the order in which one submits

[email protected]

Job Dependency Example   qsub -N “worker1” my-job-script.sh!  qsub -N “worker2” my-job-script2.sh!  qsub -hold_jid worker1,worker2 cleanupJob.sh!

  See what we did up there?   Our worker scripts will run when resources are available   The cleanup script won’t run until the workers are done   It all hinges on this:

  By “naming” our jobs we can now reference them when using the “-hold_jid” argument.

Created by The BioTeam, http://blog.bioteam.net

Page 11: Grid Engine 6.2 Simple Workflow Intro - BioTeam · 2017-04-01 · chris@bioteam.net Job Dependencies SGE scheduler does not promise to dispatch jobs in the order in which one submits

[email protected]

Job Dependency Live Demo

  Using example scripts in:   { dag fill in path here! }

Created by The BioTeam, http://blog.bioteam.net

Page 12: Grid Engine 6.2 Simple Workflow Intro - BioTeam · 2017-04-01 · chris@bioteam.net Job Dependencies SGE scheduler does not promise to dispatch jobs in the order in which one submits

[email protected]

Array Jobs

Created by The BioTeam, http://blog.bioteam.net

Page 13: Grid Engine 6.2 Simple Workflow Intro - BioTeam · 2017-04-01 · chris@bioteam.net Job Dependencies SGE scheduler does not promise to dispatch jobs in the order in which one submits

[email protected]

Array Jobs   Extremely common use case in life science clustering:

  “I need to run my program 100,000 times against 100,000 different input files”

  Most people would …   Use ‘qsub’ to submit 100,000 separate jobs

  This will work but is not ideal   Each job consumes filehandles and other system resources on the SGE qmaster

host. This can slow down or even crash SGE at large enough scales   For users it can be a pain to monitor 100,000 jobs via ‘qstat’

  There is a better way!

Created by The BioTeam, http://blog.bioteam.net

Page 14: Grid Engine 6.2 Simple Workflow Intro - BioTeam · 2017-04-01 · chris@bioteam.net Job Dependencies SGE scheduler does not promise to dispatch jobs in the order in which one submits

[email protected]

Array Jobs

  Array jobs let you submit many individual “tasks” within one job submission

  Benefits:   Only one qsub required   Only one jobID or name to monitor in qstat   Significantly reduces load on SGE qmaster

Created by The BioTeam, http://blog.bioteam.net

Page 15: Grid Engine 6.2 Simple Workflow Intro - BioTeam · 2017-04-01 · chris@bioteam.net Job Dependencies SGE scheduler does not promise to dispatch jobs in the order in which one submits

[email protected]

Array Jobs: Qsub syntax

  This is a 10 element task submission

qsub -t 1-10:1 -N arrayJob \ !./my-arrayJobScript.sh

Created by The BioTeam, http://blog.bioteam.net

Page 16: Grid Engine 6.2 Simple Workflow Intro - BioTeam · 2017-04-01 · chris@bioteam.net Job Dependencies SGE scheduler does not promise to dispatch jobs in the order in which one submits

[email protected]

Array Jobs: Qsub syntax

  The “-t” switch:   -t [FirstTask] - [LastTask]:StepSize

  Examples   What is the difference in these two

commands?   qsub -t 1-100:1 ./my-array-job.sh   qsub -t 1-100:2 ./my-array-job.sh

Created by The BioTeam, http://blog.bioteam.net

Page 17: Grid Engine 6.2 Simple Workflow Intro - BioTeam · 2017-04-01 · chris@bioteam.net Job Dependencies SGE scheduler does not promise to dispatch jobs in the order in which one submits

[email protected]

Array Jobs: How they work   The secret is simple   For each task in the array, SGE will

populate a special environment variable   $SGE_TASK_ID

  Running tasks can query this variable to learn what position they are

  Often use to build paths to input or output files

Created by The BioTeam, http://blog.bioteam.net

Page 18: Grid Engine 6.2 Simple Workflow Intro - BioTeam · 2017-04-01 · chris@bioteam.net Job Dependencies SGE scheduler does not promise to dispatch jobs in the order in which one submits

[email protected]

Array Jobs: Live Demo   Using example scripts from   { dag put path here! }

Created by The BioTeam, http://blog.bioteam.net

Page 19: Grid Engine 6.2 Simple Workflow Intro - BioTeam · 2017-04-01 · chris@bioteam.net Job Dependencies SGE scheduler does not promise to dispatch jobs in the order in which one submits

[email protected]

Array Jobs: Final   For advance cases:   Recent SGE enhancement allows for job

dependency conditions among individual array job task elements

Created by The BioTeam, http://blog.bioteam.net

Page 20: Grid Engine 6.2 Simple Workflow Intro - BioTeam · 2017-04-01 · chris@bioteam.net Job Dependencies SGE scheduler does not promise to dispatch jobs in the order in which one submits

[email protected]

Job Dependencies   SGE Job Dependency Syntax allows for ordered job

execution   Hinges upon a simple SGE feature:

  Job Names   Huh?

  We need job names or some other identifier because we can’t be sure what SGE jobID the scheduler will assign our task

  With assignable names we can reference jobs that are already pending, holding or running

Created by The BioTeam, http://blog.bioteam.net

Page 21: Grid Engine 6.2 Simple Workflow Intro - BioTeam · 2017-04-01 · chris@bioteam.net Job Dependencies SGE scheduler does not promise to dispatch jobs in the order in which one submits

[email protected]

Synchronous qsub

Created by The BioTeam, http://blog.bioteam.net

Page 22: Grid Engine 6.2 Simple Workflow Intro - BioTeam · 2017-04-01 · chris@bioteam.net Job Dependencies SGE scheduler does not promise to dispatch jobs in the order in which one submits

[email protected]

Synchronous qsub   What if running a cluster job is only a tiny part of a larger

workflow or pipeline?   Solution:

  Synchronous job submission will “block” until job completes   This lets you embed a qsub call into some other script or workflow

  When qsub completes, your script resumes

  Example   qsub -sync y -b y /bin/sleep 10

Created by The BioTeam, http://blog.bioteam.net

Page 23: Grid Engine 6.2 Simple Workflow Intro - BioTeam · 2017-04-01 · chris@bioteam.net Job Dependencies SGE scheduler does not promise to dispatch jobs in the order in which one submits

[email protected]

Questions?

Created by The BioTeam, http://blog.bioteam.net

Page 24: Grid Engine 6.2 Simple Workflow Intro - BioTeam · 2017-04-01 · chris@bioteam.net Job Dependencies SGE scheduler does not promise to dispatch jobs in the order in which one submits

[email protected]

Troubleshooting

Created by The BioTeam, http://blog.bioteam.net

Page 25: Grid Engine 6.2 Simple Workflow Intro - BioTeam · 2017-04-01 · chris@bioteam.net Job Dependencies SGE scheduler does not promise to dispatch jobs in the order in which one submits

[email protected]

Grid Engine Troubleshooting   Lets be honest

  Not many user accessible troubleshooting methods   Best resource still the output and error files that

your jobs produce   The most powerful methods are available to

cluster admins only

Created by The BioTeam, http://blog.bioteam.net

Page 26: Grid Engine 6.2 Simple Workflow Intro - BioTeam · 2017-04-01 · chris@bioteam.net Job Dependencies SGE scheduler does not promise to dispatch jobs in the order in which one submits

[email protected]

Grid Engine Troubleshooting   There are two core problem types

  Job Level   Cluster seems OK, example scripts work fine   Some user jobs/apps fail

  Cluster Level   Problems running all jobs   Problems submitting to certain PE/queue/Project   Problems with jobs on certain nodes

Created by The BioTeam, http://blog.bioteam.net

Page 27: Grid Engine 6.2 Simple Workflow Intro - BioTeam · 2017-04-01 · chris@bioteam.net Job Dependencies SGE scheduler does not promise to dispatch jobs in the order in which one submits

[email protected]

Grid Engine Troubleshooting   Dealing with Cluster Level problems

  STDOUT/STDERR from user jobs still the best initial debug resource

  SGE messages and logs are usually very helpful   $SGE_ROOT/$SGE_CELL/spool/qmaster/messages!  $SGE_ROOT/$SGE_CELL/spool/qmaster/schedd/messages

  Execd spool logs often hold job specific error data   Remember that local spooling may be used (!)   $SGE_ROOT/$SGE_CELL/spool/<node>/messages

  SGE panic location   Will log to /tmp on any node when $SGE_ROOT not found or not writable

Created by The BioTeam, http://blog.bioteam.net

Page 28: Grid Engine 6.2 Simple Workflow Intro - BioTeam · 2017-04-01 · chris@bioteam.net Job Dependencies SGE scheduler does not promise to dispatch jobs in the order in which one submits

[email protected]

Job Level Troubleshooting   Job dies instantly

  First pass   Check the .o and .e files in the job directory   Check .po and .pe files for parallel MPI jobs   Best resource, usually clear error messages found:

  Permission problem, no license available, path problem, syntax error in app, etc.

  Second pass (admin assistance required)   Check qmaster spool messages and node execd

messages

Created by The BioTeam, http://blog.bioteam.net

Page 29: Grid Engine 6.2 Simple Workflow Intro - BioTeam · 2017-04-01 · chris@bioteam.net Job Dependencies SGE scheduler does not promise to dispatch jobs in the order in which one submits

[email protected]

Job Level Troubleshooting

  Job dies instantly …   Third pass

 qsub -w v <full job request>   This will tell you if the job can run assuming:

  All slots on all queues were empty   All load values were ignored

  Good source of info on ‘why can’t my job be scheduled’ problems

Created by The BioTeam, http://blog.bioteam.net

Page 30: Grid Engine 6.2 Simple Workflow Intro - BioTeam · 2017-04-01 · chris@bioteam.net Job Dependencies SGE scheduler does not promise to dispatch jobs in the order in which one submits

[email protected]

Job Level Troubleshooting   Job pending forever

  First Pass:   qstat -j <job_id>   This will tell you why the job is pending and if

there are any reasons why queues cannot accept the job

  Possible root causes   Impossible resource requested, license not available   Scheduling oddness

Created by The BioTeam, http://blog.bioteam.net

Page 31: Grid Engine 6.2 Simple Workflow Intro - BioTeam · 2017-04-01 · chris@bioteam.net Job Dependencies SGE scheduler does not promise to dispatch jobs in the order in which one submits

[email protected]

Job Level Troubleshooting

  Job pending forever   Second Pass (admin required)   $SGE_ROOT/default/spool/qmaster/schedd/messages   Just to see if anything weird is going on with the

scheduler

Created by The BioTeam, http://blog.bioteam.net

Page 32: Grid Engine 6.2 Simple Workflow Intro - BioTeam · 2017-04-01 · chris@bioteam.net Job Dependencies SGE scheduler does not promise to dispatch jobs in the order in which one submits

[email protected]

Job Level Troubleshooting

  Job runs from command line on front end node, but not under Grid Engine

  Most common root cause:   Difference in environment variables   Difference in shell execution environment

Created by The BioTeam, http://blog.bioteam.net

Page 33: Grid Engine 6.2 Simple Workflow Intro - BioTeam · 2017-04-01 · chris@bioteam.net Job Dependencies SGE scheduler does not promise to dispatch jobs in the order in which one submits

[email protected]

General Troubleshooting

  Many times the problems are not SGE related   Permission, path or ENV problems

  Best thing to do is watch STDERR and STDOUT   Use the qsub ‘-e’ and ‘-o’ switches to send output to a

file that you can read   Use qsub ‘-eo’ to send STDOUT and STDERR to the

same file (useful for debugging)

Created by The BioTeam, http://blog.bioteam.net

Page 34: Grid Engine 6.2 Simple Workflow Intro - BioTeam · 2017-04-01 · chris@bioteam.net Job Dependencies SGE scheduler does not promise to dispatch jobs in the order in which one submits

[email protected]

General Troubleshooting (cont.)

  To get email listing why a job aborted   Use: ‘qsub -m a user@host [rest of command] ’

Created by The BioTeam, http://blog.bioteam.net

Page 35: Grid Engine 6.2 Simple Workflow Intro - BioTeam · 2017-04-01 · chris@bioteam.net Job Dependencies SGE scheduler does not promise to dispatch jobs in the order in which one submits

[email protected]

General Troubleshooting (cont.)

  Checking exit status and seeing if jobs ran to completion without error   Use: ‘qacct -j <job_id>’ to query the accounting data   Will also tell you if the job had to be requeued onto a

different queue or exechost

Created by The BioTeam, http://blog.bioteam.net

Page 36: Grid Engine 6.2 Simple Workflow Intro - BioTeam · 2017-04-01 · chris@bioteam.net Job Dependencies SGE scheduler does not promise to dispatch jobs in the order in which one submits

[email protected]

Basic Debug Process

  Verify for yourself that cluster and SGE is happy before you do anything else   ‘qstat -f’, ‘qrsh hostname’, ‘qhost’, etc.

  This will quickly identify systemic or cluster wide issues

  Then move on to dealing with the specific issue

Created by The BioTeam, http://blog.bioteam.net

Page 37: Grid Engine 6.2 Simple Workflow Intro - BioTeam · 2017-04-01 · chris@bioteam.net Job Dependencies SGE scheduler does not promise to dispatch jobs in the order in which one submits

[email protected]

Basic Debug Process

  If problems persist, verify that the application actually runs OUTSIDE of Grid Engine   Easier to catch app/user/system issues   Good way to catch the super subtle stuff   This is especially useful for MPI parallel

programs

Created by The BioTeam, http://blog.bioteam.net

Page 38: Grid Engine 6.2 Simple Workflow Intro - BioTeam · 2017-04-01 · chris@bioteam.net Job Dependencies SGE scheduler does not promise to dispatch jobs in the order in which one submits

[email protected]

Recommendation   Build a personal portfolio of simple testing scripts

  qrsh hostname!  $SGE_ROOT/examples/jobs/simple.sh!  $SGE_ROOT/examples/jobs/sleeper.sh

  Get your users to supply you with example or dummy scripts that use real portfolio apps

Created by The BioTeam, http://blog.bioteam.net

Page 39: Grid Engine 6.2 Simple Workflow Intro - BioTeam · 2017-04-01 · chris@bioteam.net Job Dependencies SGE scheduler does not promise to dispatch jobs in the order in which one submits

[email protected]

Questions?

Created by The BioTeam, http://blog.bioteam.net