Condor Project Computer Sciences Department University of Wisconsin-Madison [email protected] Eager, Lazy, and Just-in-Time

Condor ProjectComputer Sciences DepartmentUniversity of Wisconsin-Madison

[email protected]://www.cs.wisc.edu/condor

Eager, Lazy, and Just-in-Time

Planning Edinburgh Workshop

Oct 2003

2http://www.cs.wisc.edu/condor

Planning –vs- Scheduling

› Can you control the resources? Yes? Scheduling. No? Planning.

› Planning is a ‘client’ operation.


The question of When

› Lots of planning open questions.

› An important consideration: When the planning occurs.

Time

Eager Just-in-TimeLazy


Eager Example› First Pass of EDG

Resource Broker

RB DAGMan

Condor-G

Globus

Fabric

Site Scheduler


Eager Condor-G Submit File

universe = globus

globussite = beak.cs.wisc.edu/jobmanager-lsf

executable = find_particlearguments = ….output = ….log = …


EDG Resource Broker Gets Lazy…

› Addition of a DAGMan callouts› DAGMan is given a command (script) to run

immediately before submission of job to Condor-G (different than a PRE script on a node)

› The helper command is passed a copy of the job submit file when DAGMan is about to submit that node in the graph

› This allows changes to be made to the submit file (i.e. changing globussite attribute) at the last minute


Eager Example› First Pass of EDG

Resource Broker

RB DAGMan

Condor-G

Globus

Fabric

Site Scheduler

callout


Moving Condor-G to Just-In-Time

› Delay the binding of the task (job) to the resource until the resource is ready.

› Need to know when the resource is ready.

› One way: unimplemented globus 1.1 “queue wait time” estimate Not really just-in-time, because of lies, lies

lies…

› Another way… Condor-G Glidein Mechanism.


How It Works

ScheddSchedd

LSFLSF

CollectorCollector

Condor-G Globus Resource

600 Condorjobs


How It Works

ScheddSchedd

LSFLSF

CollectorCollector


600 Condorjobs

GlideIn jobs


How It Works

ScheddSchedd

LSFLSF

CollectorCollector


GridManagerGridManager

600 Condorjobs

GlideIn jobs


How It Works

ScheddSchedd JobManagerJobManager

LSFLSF

CollectorCollector



600 Condorjobs

GlideIn jobs


How It Works


LSFLSF

StartdStartd

CollectorCollector



600 Condorjobs

GlideIn jobs


How It Works


LSFLSF

StartdStartd

CollectorCollector



600 Condorjobs

GlideIn jobs


How It Works


LSFLSF

User JobUser Job

StartdStartd

CollectorCollector



600 Condorjobs

GlideIn jobs


A Just-in-time Submit

executable = find_particlerequirements = TARGET.Arch ==

“Intel/Linux” || TARGET.Arch == “Sparc/Solaris”

# job describes the “power”rank = MFlops * 10000 + Memory


Another Just-in-time Submit

executable = find_particlerequirements = TARGET.Arch ==

“Intel/Linux” || TARGET.Arch == “Sparc/Solaris”

rank = sam_data_overlap(MY.dataset,TARGET.sam_site_name) + (TARGET.Mflops / 100000)

+dataset = search_space_id_0133313


Lots of Tradeoffs…› Just-in-Time

Pro: Dynamic. Resources can come and go. Can take advantage of changing circumstances.

Con: Coordination of multiple resources

› Eager Pro: Easier to coordinate multiple resources Con: Hard to scale… how to know about all

the resources in advance? Con: Plan falls apart if assumptions change.


Some observations› A complete separation of task from

resource is difficult. Lots and lots of structured data required. But this separation is required to in order to

achieve Just-In-Time planning.

› Grid Protocols that do not separate task from resource cannot realistically live on the grid. Virtualization can help.


Plan for failure

› Much effort on how to create a plan.

› How about a plan for when things fail?


Job Failure Policy Expressions

› Condor/Condor-G augemented so users can supply job failure policy expressions in the submit file.

› Can be used to describe a successful run, or what to do in the face of failure.

on_exit_remove = <expression>on_exit_hold = <expression>periodic_remove = <expression>periodic_hold = <expression>


Job Failure Policy Examples› Do not remove from queue (i.e. reschedule) if

exits with a signal:on_exit_remove = ExitBySignal == False

› Place on hold if exits with nonzero status or ran for less than an hour:

on_exit_hold = ((ExitBySignal==False) && (ExitSignal != 0)) || ((ServerStartTime –

JobStartDate) < 3600)› Place on hold if job has spent more than 50% of

its time suspended:periodic_hold = CumulativeSuspensionTime

> (RemoteWallClockTime / 2.0)


Thank you!

http://www.cs.wisc.edu/condor

[email protected]

Documents

Condor Project Computer Sciences Department University of Wisconsin-Madison [email protected] Eager, Lazy, and Just-in-Time