20
Jaime Frey Computer Sciences Department University of Wisconsin-Madison [email protected] http://www.cs.wisc.edu/condor OGF 19 Condor Software Forum Routing Jobs to the Grid

Jaime Frey Computer Sciences Department University of Wisconsin-Madison [email protected] OGF 19 Condor Software Forum Routing

Embed Size (px)

Citation preview

Page 1: Jaime Frey Computer Sciences Department University of Wisconsin-Madison jfrey@cs.wisc.edu  OGF 19 Condor Software Forum Routing

Jaime FreyComputer Sciences DepartmentUniversity of Wisconsin-Madison

[email protected]://www.cs.wisc.edu/condor

OGF 19Condor Software Forum

Routing Jobs to the Grid

Page 2: Jaime Frey Computer Sciences Department University of Wisconsin-Madison jfrey@cs.wisc.edu  OGF 19 Condor Software Forum Routing

www.cs.wisc.edu/condor

Schedd

Job Routera.k.a.

ScheddOn The

Side

What’s a Job Router?Specialized scheduler operating on schedd’s jobs.

Job 1Job 2Job 3Job 4Job 5…Job 4*

job queue

Page 3: Jaime Frey Computer Sciences Department University of Wisconsin-Madison jfrey@cs.wisc.edu  OGF 19 Condor Software Forum Routing

www.cs.wisc.edu/condor

Adapted Quill Technology

› Using Quill library to mirror job queue in memoryo Efficient - just “tails” the logo Independent - mirror without clogging

schedd command queue

› Modifying the job queue is another matter - must interact with schedd

Page 4: Jaime Frey Computer Sciences Department University of Wisconsin-Madison jfrey@cs.wisc.edu  OGF 19 Condor Software Forum Routing

www.cs.wisc.edu/condor

Usage Case

Routing: Vanilla -> Grid

Page 5: Jaime Frey Computer Sciences Department University of Wisconsin-Madison jfrey@cs.wisc.edu  OGF 19 Condor Software Forum Routing

www.cs.wisc.edu/condor

Condor Farm Story

Schedd

StartdResources

RandomSeedRandomSeed

RandomSeedRandomSeed

RandomSeedRandomSeed

RandomSeed

RandomSeed

RandomSeed

RandomSeed

RandomSeed

RandomSeed

RandomSeed

Application

condor_submit

job queue

•Now that this is working, howcan I use my collaborator’sresources too?

Page 6: Jaime Frey Computer Sciences Department University of Wisconsin-Madison jfrey@cs.wisc.edu  OGF 19 Condor Software Forum Routing

www.cs.wisc.edu/condor

Option #1: Merge Farms

› Combine machines with collaborator into one Condor resource pool.o Everything works just like it did before.o Excellent option for small to medium clusters.o Requires bidirectional connectivity to all

startds, or equivalent via GCB.o Requires some administrative coordination

(e.g. upgrades, negotiator policy, security, etc.)

Page 7: Jaime Frey Computer Sciences Department University of Wisconsin-Madison jfrey@cs.wisc.edu  OGF 19 Condor Software Forum Routing

www.cs.wisc.edu/condor

Option #1b: submit to multiple pools

› condor_submit -remote …

› Works

› Ok for small scale

› Have to manually partition jobs

Page 8: Jaime Frey Computer Sciences Department University of Wisconsin-Madison jfrey@cs.wisc.edu  OGF 19 Condor Software Forum Routing

www.cs.wisc.edu/condor

Option #2: Flocking Together

Schedd

LocalStartds

RemoteStartds

•full featured(std universe etc)•automatic matchmaking•easy to configure

•requires bidirectionalconnectivity•both sites must runcondor

RandomSeedRandomSeed

RandomSeedRandomSeed

RandomSeedRandomSeed

RandomSeedRandomSeed

RandomSeedRandomSeed

RandomSeedRandomSeed

RandomSeed

RandomSeed

RandomSeed

RandomSeed

Page 9: Jaime Frey Computer Sciences Department University of Wisconsin-Madison jfrey@cs.wisc.edu  OGF 19 Condor Software Forum Routing

www.cs.wisc.edu/condor

Gatekeeper

X

Option #3: Grid Universe

Schedd

Startds

RandomSeedRandomSeed

RandomSeedRandomSeed

RandomSeedRandomSeed Random

SeedRandomSeed

RandomSeedRandomSeed

RandomSeedRandomSeed

RandomSeed

RandomSeed

RandomSeed

RandomSeed

•easier to live with private networks•may use non-Condor resources

•restricted Condor feature set(e.g. no std universe over grid)•must pre-allocating jobsbetween vanilla and grid universe

vanilla site X

Page 10: Jaime Frey Computer Sciences Department University of Wisconsin-Madison jfrey@cs.wisc.edu  OGF 19 Condor Software Forum Routing

www.cs.wisc.edu/condor

Option #4: Routing Jobs

Schedd

LocalStartds

RandomSeedRandomSeed

RandomSeedRandomSeed

RandomSeedRandomSeed

RandomSeed Random

SeedRandomSeed

RandomSeed Random

SeedRandomSeed

RandomSeed Random

SeedRandomSeed

RandomSeed

RandomSeed

RandomSeed

RandomSeed

RandomSeed

ScheddOn The

Side Gatekeeper

X

Y

Z

vanilla site X

RandomSeed

RandomSeed

site Y site Z

•dynamic allocation of jobsbetween vanilla and grid universes.•not every job is appropriate fortransformation into a grid job.

Page 11: Jaime Frey Computer Sciences Department University of Wisconsin-Madison jfrey@cs.wisc.edu  OGF 19 Condor Software Forum Routing

www.cs.wisc.edu/condor

Example Routing Table

[GridResource = “gt2 gatekeeper.site1/jobmanager-pbs”; MaxJobs = 500; MaxIdle = 50; set_GlobusRSL = “(…)”][GridResource = “condor schedd.site2 collector.site2”; MaxJobs = 700; MaxIdle = 100; Requirements = other.ImageSize < 500]…

Page 12: Jaime Frey Computer Sciences Department University of Wisconsin-Madison jfrey@cs.wisc.edu  OGF 19 Condor Software Forum Routing

www.cs.wisc.edu/condor

What About I/O?

› Jobs must be sandboxable (i.e. specifying input/output via transfer-files mechanism).

› Routing of standard universe is not supported.

› Must have enough storage space at site for input/output files!

Page 13: Jaime Frey Computer Sciences Department University of Wisconsin-Madison jfrey@cs.wisc.edu  OGF 19 Condor Software Forum Routing

www.cs.wisc.edu/condor

What Types of Grids?› Routing table may contain any

combination of grid types supported by Condor’s grid universe.

› Example: Condor-C

Schedd

ScheddOn The

Side

Schedd X

RandomSeedRandomSeed

RandomSeedRandomSeed

RandomSeedRandomSeed

RandomSeed

site X

•for two Condor sites, schedd-to-scheddsubmission requires no additional software•however, still not as trivial to use as flocking

Page 14: Jaime Frey Computer Sciences Department University of Wisconsin-Madison jfrey@cs.wisc.edu  OGF 19 Condor Software Forum Routing

www.cs.wisc.edu/condor

Source Routing

› Routing the old-fashioned way:

universe = GridGridResource = condor site1 …remote_universe = Gridremote_GridResource = condor site2 …remote_remote_universe = Gridremote_remote_GridResource = pbs

Page 15: Jaime Frey Computer Sciences Department University of Wisconsin-Madison jfrey@cs.wisc.edu  OGF 19 Condor Software Forum Routing

www.cs.wisc.edu/condor

Routing At the Site

Gatekeeper

XSchedd

ScheddOn The

Side

Schedd X3

X2

•navigate internal firewalls•provide custom routesfor special users•improve scalability•However, keep in mindI/O requirements etc.

Page 16: Jaime Frey Computer Sciences Department University of Wisconsin-Madison jfrey@cs.wisc.edu  OGF 19 Condor Software Forum Routing

www.cs.wisc.edu/condor

Multicast in Future?

› Currently: route one job to one site

› Multicast: route one job to many sites

› Thin out all but first to germinate

› … or all but first to yield fruit.

Page 17: Jaime Frey Computer Sciences Department University of Wisconsin-Madison jfrey@cs.wisc.edu  OGF 19 Condor Software Forum Routing

www.cs.wisc.edu/condor

Future Glidein FactoryGatekeeper

X

Schedd

Startds

RandomSeedRandomSeed

RandomSeedRandomSeed

RandomSeedRandomSeed

RandomSeedRandomSeed

RandomSeedRandomSeed

RandomSeedRandomSeed

RandomSeed

RandomSeed

RandomSeed

RandomSeed

•true late binding of jobs to resources•may run on top of non-Condor sites•supports full feature-set of Condor(e.g. standard universe)

•requires GCB for private networks

homesite X

ScheddOn The

Side

glidein jobs

Page 18: Jaime Frey Computer Sciences Department University of Wisconsin-Madison jfrey@cs.wisc.edu  OGF 19 Condor Software Forum Routing

www.cs.wisc.edu/condor

Glideing in the Factory

Schedd

ScheddOn The

Side

glidein factory

site X

schedd-to-schedd

schedd-to-gatekeeper

•hierarchical strategy for scalabilityand reliability•better match for private networks

•may require some additional horsepowerfrom gatekeeper machine, perhaps adedicated element for “edge services”.

RandomSeed

RandomSeed

RandomSeed

RandomSeed

RandomSeed

Page 19: Jaime Frey Computer Sciences Department University of Wisconsin-Madison jfrey@cs.wisc.edu  OGF 19 Condor Software Forum Routing

www.cs.wisc.edu/condor

Pluggable Router

› Beyond simple ClassAd transforms

› Pluggins would fire when job matches entry in routing table

› Don’t yet understand semantics

› There is work to do!

Page 20: Jaime Frey Computer Sciences Department University of Wisconsin-Madison jfrey@cs.wisc.edu  OGF 19 Condor Software Forum Routing

www.cs.wisc.edu/condor

Thanks

Interested?Let us know.

We are currentlyusing job routingfor specific usersat UW. Jaime Frey

[email protected]

Future developmentwill focus on moreuse-cases.