Text of Jaime Frey Computer Sciences Department University of Wisconsin-Madison [email protected] ...
Slide 1
Jaime Frey Computer Sciences Department University of Wisconsin-Madison [email protected] http://www.cs.wisc.edu/condor Condor-G: A Case in Distributed Job Delegation
Slide 2
www.cs.wisc.edu/condor Job Delegation Transfer of responsibility to schedule and execute a job Multiple delegations can form a chain
Slide 3
www.cs.wisc.edu/condor Job Delegation in Condor-G Today Condor-G Globus GRAM Batch System Front-end Execute Machine
Slide 4
www.cs.wisc.edu/condor Expanding the Model What can we do with new forms of job delegation? Some ideas Mirroring Load-balancing Glide-in schedd Multi-hop grid scheduling
Slide 5
www.cs.wisc.edu/condor Mirroring What it does Jobs mirrored on two Condor-Gs If primary Condor-G crashes, secondary one starts running jobs On recovery, primary Condor-G gets job status from secondary one Removes Condor-G submit point as single point of failure
Slide 6
www.cs.wisc.edu/condor Mirroring Example Condor-G 1 Matchmaker Execute Machine Condor-G 2
Slide 7
www.cs.wisc.edu/condor Mirroring Example Condor-G 1 Matchmaker Execute Machine Condor-G 2
Slide 8
www.cs.wisc.edu/condor Load-Balancing What it does Front-end Condor-G distributes all jobs among several back-end Condor-Gs Front-end Condor-G keeps updated job status Improves scalability Maintains single submit point for users
www.cs.wisc.edu/condor Glide-In Schedd What it does Drop a Condor-G onto the front-end machine of a cluster Delegate jobs to the cluster through the glide-in schedd Apply cluster-specific policies to jobs
Slide 11
www.cs.wisc.edu/condor Glide-In Schedd Example Condor-G Glide-In Schedd Batch System
Slide 12
www.cs.wisc.edu/condor Multi-Hop Grid Scheduling Match a job to a Virtual Organization (VO), then to a resource within that VO Easier to schedule jobs across multiple VOs and grids
Slide 13
www.cs.wisc.edu/condor Multi-Hop Grid Scheduling Example Experiment Condor-G Experiment Resource Broker VO Condor-G VO Resource Broker Globus GRAM Batch Scheduler
Slide 14
www.cs.wisc.edu/condor Endless Possibilities These new models can be combined with each other or with other new models Resulting system can be arbitrarily sophisticated
Slide 15
www.cs.wisc.edu/condor Job Delegation Challenges New complexity introduces new issues and exacerbates existing ones A few Transparency Representation Scheduling Control Active Job Control Revocation Error Handling and Debugging
Slide 16
www.cs.wisc.edu/condor Transparency Full information about job should be available to user Information from full delegation path No manual tracing across multiple machines Users need to know whats happening with their jobs
Slide 17
www.cs.wisc.edu/condor Representation Job state is a vector How best to show this to user Summary Current delegation endpoint Job state at endpoint Full information available if desired Series of nested ClassAds?
Slide 18
www.cs.wisc.edu/condor Scheduling Control Avoid loops in delegation path Give user control of scheduling Allow limiting of delegation path length? Allow user to specify part or all of delegation path
Slide 19
www.cs.wisc.edu/condor Active Job Control User may request certain actions hold, suspend, vacate, checkpoint Actions cannot be completed synchronously for user Must forward along delegation path User checks completion later
Slide 20
www.cs.wisc.edu/condor Active Job Control (cont) Endpoint systems may not support actions If possible, execute them at furthest point that does support them Allow user to apply action in middle of delegation path
Slide 21
www.cs.wisc.edu/condor Revocation Leases Lease must be renewed periodically for delegation to remain valid Allows revocation during long-term failures What are good values for lease lifetime and update interval?
Slide 22
www.cs.wisc.edu/condor Error Handling and Debugging Many more places for things to go horribly wrong Need clear, simple error semantics Logs, logs, logs Have them everywhere
Slide 23
www.cs.wisc.edu/condor Current Status Done Mirroring In Progress Condor-G -> Condor-G delegation User must specify hops Glide-in schedd Set up by hand