Dan Bradley Computer Sciences Department University of Wisconsin-Madison danb@cs.wisc.edu Schedd On The Side

  • View
    213

  • Download
    0

Embed Size (px)

Text of Dan Bradley Computer Sciences Department University of Wisconsin-Madison danb@cs.wisc.edu Schedd...

  • Slide 1

Dan Bradley Computer Sciences Department University of Wisconsin-Madison danb@cs.wisc.edu http://www.cs.wisc.edu/condor Schedd On The Side Slide 2 www.cs.wisc.edu/condor Schedd On The Side What is it? Specialized scheduler operating on schedds jobs. Job 1 Job 2 Job 3 Job 4 Job 5 Job 4* job queue Slide 3 www.cs.wisc.edu/condor Condor Farm Story Schedd Startd Resources Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Application condor_submit job queue Now that this is working, how can I use my collaborators resources too? Slide 4 www.cs.wisc.edu/condor Option #1: Merge Farms Combine machines with collaborator into one Condor resource pool. o Everything works just like it did before. o Excellent option for small to medium clusters. o Requires bidirectional connectivity to all startds, or equivalent via GCB. o Requires some administrative coordination (e.g. upgrades, negotiator policy, security, etc.) Slide 5 www.cs.wisc.edu/condor Option #2: Flocking Together Schedd Local Startds Remote Startds full featured (std universe etc) automatic matchmaking easy to configure requires bidirectional connectivity both sites must run condor Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Slide 6 www.cs.wisc.edu/condor Gatekeeper X Option #3: Grid Universe Schedd Startds Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed easier to live with private networks may use non-Condor resources restricted Condor feature set (e.g. no std universe over grid) must pre-allocating jobs between vanilla and grid universe vanillasite X Slide 7 www.cs.wisc.edu/condor Option #4: Routing Jobs Schedd Local Startds Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Schedd On The Side Gatekeeper X Y Z vanillasite X Random Seed Random Seed site Ysite Z dynamic allocation of jobs between vanilla and grid universes. not every job is appropriate for transformation into a grid job. Slide 8 www.cs.wisc.edu/condor What About Flow Control? May restrict routing to jobs which have been rejected by negotiator. May limit maximum actively routed jobs on a per site basis. May limit maximum idle routed jobs per site. Periodic remove of idle routed jobs is possible, but no guarantee of optimal rescheduling. Routing table may be reconfigured dynamically. Multicast? Might be interesting to try. Slide 9 www.cs.wisc.edu/condor What About I/O? Jobs must be sandboxable (i.e. specifying input/output via transfer- files mechanism). Routing of standard universe is not supported. Additional restrictions may apply, depending on site network and disk. Slide 10 www.cs.wisc.edu/condor What Types of Grids? Routing table may contain any combination of grid types supported by the grid universe. Example: Condor-C Schedd On The Side Schedd X Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed site X for two Condor sites, schedd-to-schedd submission requires no additional software however, still not as trivial to use as flocking Slide 11 www.cs.wisc.edu/condor Routing Behind the Scenes Gatekeeper X Schedd On The Side Schedd X3 X2 navigate internal firewalls provide custom routes for special users improve scalability However, keep in mind I/O requirements etc. Slide 12 www.cs.wisc.edu/condor Future Step: Glidein Factory Gatekeeper X Schedd Startds Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed true late binding of jobs to resources may run on top of non-Condor sites supports full feature set of Condor (e.g. standard universe) requires GCB on network boundary (initiated by schedd-on-the-side?) home site X Schedd On The Side glidein jobs Slide 13 www.cs.wisc.edu/condor Glideing in the Works Schedd On The Side glidein factory site X schedd-to-schedd schedd-to-gatekeeper hierarchical strategy for scalability and reliability better match for private networks may require some additional horsepower from gatekeeper machine, perhaps a dedicated element for edge services. Random Seed Random Seed Random Seed Random Seed Random Seed Slide 14 www.cs.wisc.edu/condor Thanks Interested? Let us know. We are currently using job routing for specific users at UW. Dan Bradley danb@cs.wisc.edu Future development will focus on more use-cases.