Click here to load reader

Jaime Frey Computer Sciences Department University of Wisconsin-Madison [email protected] OGF 19 Condor Software Forum Routing

  • View
    216

  • Download
    0

Embed Size (px)

Text of Jaime Frey Computer Sciences Department University of Wisconsin-Madison [email protected] OGF 19...

  • Slide 1

Jaime Frey Computer Sciences Department University of Wisconsin-Madison [email protected] http://www.cs.wisc.edu/condor OGF 19 Condor Software Forum Routing Jobs to the Grid Slide 2 www.cs.wisc.edu/condor Schedd Job Router a.k.a. Schedd On The Side Whats a Job Router? Specialized scheduler operating on schedds jobs. Job 1 Job 2 Job 3 Job 4 Job 5 Job 4* job queue Slide 3 www.cs.wisc.edu/condor Adapted Quill Technology Using Quill library to mirror job queue in memory o Efficient - just tails the log o Independent - mirror without clogging schedd command queue Modifying the job queue is another matter - must interact with schedd Slide 4 www.cs.wisc.edu/condor Usage Case Routing: Vanilla -> Grid Slide 5 www.cs.wisc.edu/condor Condor Farm Story Schedd Startd Resources Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Application condor_submit job queue Now that this is working, how can I use my collaborators resources too? Slide 6 www.cs.wisc.edu/condor Option #1: Merge Farms Combine machines with collaborator into one Condor resource pool. o Everything works just like it did before. o Excellent option for small to medium clusters. o Requires bidirectional connectivity to all startds, or equivalent via GCB. o Requires some administrative coordination (e.g. upgrades, negotiator policy, security, etc.) Slide 7 www.cs.wisc.edu/condor Option #1b: submit to multiple pools condor_submit -remote Works Ok for small scale Have to manually partition jobs Slide 8 www.cs.wisc.edu/condor Option #2: Flocking Together Schedd Local Startds Remote Startds full featured (std universe etc) automatic matchmaking easy to configure requires bidirectional connectivity both sites must run condor Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Slide 9 www.cs.wisc.edu/condor Gatekeeper X Option #3: Grid Universe Schedd Startds Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed easier to live with private networks may use non-Condor resources restricted Condor feature set (e.g. no std universe over grid) must pre-allocating jobs between vanilla and grid universe vanillasite X Slide 10 www.cs.wisc.edu/condor Option #4: Routing Jobs Schedd Local Startds Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Schedd On The Side Gatekeeper X Y Z vanillasite X Random Seed Random Seed site Ysite Z dynamic allocation of jobs between vanilla and grid universes. not every job is appropriate for transformation into a grid job. Slide 11 www.cs.wisc.edu/condor Example Routing Table [ GridResource = gt2 gatekeeper.site1/jobmanager-pbs; MaxJobs = 500; MaxIdle = 50; set_GlobusRSL = () ] [ GridResource = condor schedd.site2 collector.site2; MaxJobs = 700; MaxIdle = 100; Requirements = other.ImageSize < 500 ] Slide 12 www.cs.wisc.edu/condor What About I/O? Jobs must be sandboxable (i.e. specifying input/output via transfer- files mechanism). Routing of standard universe is not supported. Must have enough storage space at site for input/output files! Slide 13 www.cs.wisc.edu/condor What Types of Grids? Routing table may contain any combination of grid types supported by Condors grid universe. Example: Condor-C Schedd On The Side Schedd X Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed site X for two Condor sites, schedd-to-schedd submission requires no additional software however, still not as trivial to use as flocking Slide 14 www.cs.wisc.edu/condor Source Routing Routing the old-fashioned way: universe = Grid GridResource = condor site1 remote_universe = Grid remote_GridResource = condor site2 remote_remote_universe = Grid remote_remote_GridResource = pbs Slide 15 www.cs.wisc.edu/condor Routing At the Site Gatekeeper X Schedd On The Side Schedd X3 X2 navigate internal firewalls provide custom routes for special users improve scalability However, keep in mind I/O requirements etc. Slide 16 www.cs.wisc.edu/condor Multicast in Future? Currently: route one job to one site Multicast: route one job to many sites Thin out all but first to germinate or all but first to yield fruit. Slide 17 www.cs.wisc.edu/condor Future Glidein Factory Gatekeeper X Schedd Startds Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed true late binding of jobs to resources may run on top of non-Condor sites supports full feature-set of Condor (e.g. standard universe) requires GCB for private networks home site X Schedd On The Side glidein jobs Slide 18 www.cs.wisc.edu/condor Glideing in the Factory Schedd On The Side glidein factory site X schedd-to-schedd schedd-to-gatekeeper hierarchical strategy for scalability and reliability better match for private networks may require some additional horsepower from gatekeeper machine, perhaps a dedicated element for edge services. Random Seed Random Seed Random Seed Random Seed Random Seed Slide 19 www.cs.wisc.edu/condor Pluggable Router Beyond simple ClassAd transforms Pluggins would fire when job matches entry in routing table Dont yet understand semantics There is work to do! Slide 20 www.cs.wisc.edu/condor Thanks Interested? Let us know. We are currently using job routing for specific users at UW. Jaime Frey [email protected] Future development will focus on more use-cases.

Search related