Click here to load reader

Jaime Frey Computer Sciences Department University of Wisconsin-Madison [email protected] Condor-G: A Case in Distributed

  • View
    216

  • Download
    0

Embed Size (px)

Text of Jaime Frey Computer Sciences Department University of Wisconsin-Madison [email protected] ...

  • Slide 1
  • Jaime Frey Computer Sciences Department University of Wisconsin-Madison [email protected] http://www.cs.wisc.edu/condor Condor-G: A Case in Distributed Job Delegation
  • Slide 2
  • www.cs.wisc.edu/condor Job Delegation Transfer of responsibility to schedule and execute a job Multiple delegations can form a chain
  • Slide 3
  • www.cs.wisc.edu/condor Job Delegation in Condor-G Today Condor-G Globus GRAM Batch System Front-end Execute Machine
  • Slide 4
  • www.cs.wisc.edu/condor Expanding the Model What can we do with new forms of job delegation? Some ideas Mirroring Load-balancing Glide-in schedd Multi-hop grid scheduling
  • Slide 5
  • www.cs.wisc.edu/condor Mirroring What it does Jobs mirrored on two Condor-Gs If primary Condor-G crashes, secondary one starts running jobs On recovery, primary Condor-G gets job status from secondary one Removes Condor-G submit point as single point of failure
  • Slide 6
  • www.cs.wisc.edu/condor Mirroring Example Condor-G 1 Matchmaker Execute Machine Condor-G 2
  • Slide 7
  • www.cs.wisc.edu/condor Mirroring Example Condor-G 1 Matchmaker Execute Machine Condor-G 2
  • Slide 8
  • www.cs.wisc.edu/condor Load-Balancing What it does Front-end Condor-G distributes all jobs among several back-end Condor-Gs Front-end Condor-G keeps updated job status Improves scalability Maintains single submit point for users
  • Slide 9
  • www.cs.wisc.edu/condor Load-Balancing Example Condor-G Back-end 1 Condor-G Front-end Condor-G Back-end 3 Condor-G Back-end 2
  • Slide 10
  • www.cs.wisc.edu/condor Glide-In Schedd What it does Drop a Condor-G onto the front-end machine of a cluster Delegate jobs to the cluster through the glide-in schedd Apply cluster-specific policies to jobs
  • Slide 11
  • www.cs.wisc.edu/condor Glide-In Schedd Example Condor-G Glide-In Schedd Batch System
  • Slide 12
  • www.cs.wisc.edu/condor Multi-Hop Grid Scheduling Match a job to a Virtual Organization (VO), then to a resource within that VO Easier to schedule jobs across multiple VOs and grids
  • Slide 13
  • www.cs.wisc.edu/condor Multi-Hop Grid Scheduling Example Experiment Condor-G Experiment Resource Broker VO Condor-G VO Resource Broker Globus GRAM Batch Scheduler
  • Slide 14
  • www.cs.wisc.edu/condor Endless Possibilities These new models can be combined with each other or with other new models Resulting system can be arbitrarily sophisticated
  • Slide 15
  • www.cs.wisc.edu/condor Job Delegation Challenges New complexity introduces new issues and exacerbates existing ones A few Transparency Representation Scheduling Control Active Job Control Revocation Error Handling and Debugging
  • Slide 16
  • www.cs.wisc.edu/condor Transparency Full information about job should be available to user Information from full delegation path No manual tracing across multiple machines Users need to know whats happening with their jobs
  • Slide 17
  • www.cs.wisc.edu/condor Representation Job state is a vector How best to show this to user Summary Current delegation endpoint Job state at endpoint Full information available if desired Series of nested ClassAds?
  • Slide 18
  • www.cs.wisc.edu/condor Scheduling Control Avoid loops in delegation path Give user control of scheduling Allow limiting of delegation path length? Allow user to specify part or all of delegation path
  • Slide 19
  • www.cs.wisc.edu/condor Active Job Control User may request certain actions hold, suspend, vacate, checkpoint Actions cannot be completed synchronously for user Must forward along delegation path User checks completion later
  • Slide 20
  • www.cs.wisc.edu/condor Active Job Control (cont) Endpoint systems may not support actions If possible, execute them at furthest point that does support them Allow user to apply action in middle of delegation path
  • Slide 21
  • www.cs.wisc.edu/condor Revocation Leases Lease must be renewed periodically for delegation to remain valid Allows revocation during long-term failures What are good values for lease lifetime and update interval?
  • Slide 22
  • www.cs.wisc.edu/condor Error Handling and Debugging Many more places for things to go horribly wrong Need clear, simple error semantics Logs, logs, logs Have them everywhere
  • Slide 23
  • www.cs.wisc.edu/condor Current Status Done Mirroring In Progress Condor-G -> Condor-G delegation User must specify hops Glide-in schedd Set up by hand
  • Slide 24
  • www.cs.wisc.edu/condor Thank You! Questions?

Search related