44
Condor Project Computer Sciences Department University of Wisconsin-Madison [email protected] http://www.cs.wisc.edu/condor Condor: A Project and a System Scientific Data Intensive Computing Workshop ‘04 Microsoft Research May 2004

Condor Project Computer Sciences Department University of Wisconsin-Madison [email protected] Condor: A Project and

Embed Size (px)

Citation preview

Page 1: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor: A Project and

Condor ProjectComputer Sciences DepartmentUniversity of Wisconsin-Madison

[email protected]://www.cs.wisc.edu/condor

Condor: A Project and a System

Scientific Data Intensive Computing Workshop ‘04

Microsoft ResearchMay 2004

Page 2: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor: A Project and

2http://www.cs.wisc.edu/condor

Outline

› What is the Condor Project?

› What is the Condor HTC Software?

› Recipe for using desktops for science

› Data!

Page 3: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor: A Project and

3http://www.cs.wisc.edu/condor

The Condor Project (Established ‘85)

Distributed High Throughput Computing research performed by a team of ~35 faculty, full time staff and students.

Page 4: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor: A Project and

4http://www.cs.wisc.edu/condor

The Condor Project (Established ‘85)Distributed High Throughput Computing research performed by a team of ~35 faculty, full time staff and students who:

face software engineering challenges in a heterogeneous distributed environment

are involved in national and international grid collaborations,

actively interact with academic and commercial users, maintain and support large distributed production

environments, and educate and train students.

Funding – US Govt. (DoD, DoE, NASA, NSF, NIH),AT&T, IBM, INTEL, Microsoft, UW-Madison, …

Page 5: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor: A Project and

5http://www.cs.wisc.edu/condor

A Multifaceted Project › Harnessing the power of clusters - opportunistic and/or

dedicated (Condor)

› Job management services for Grid applications (Condor-G, Stork)

› Fabric management services for Grid resources (Condor, GlideIns, NeST)

› Distributed I/O technology (Parrot, Kangaroo, NeST)

› Job-flow management (DAGMan, Condor, Hawk)

› Distributed monitoring and management (HawkEye)

› Technology for Distributed Systems (ClassAD, MW)

› Packaging and Integration (NMI, VDT)

Page 6: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor: A Project and

6http://www.cs.wisc.edu/condor

Outline

› What is the Condor Project?

› What is the Condor HTC Software?

› Recipe for using desktops for science

› Data!

Page 7: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor: A Project and

7http://www.cs.wisc.edu/condor

What is Condor?Condor converts collections of distributively

owned workstations and dedicated clusters into a distributed fault-tolerant high-throughput computing (HTC) facility.

› Distributed Ownership: decrease in cost-performance ratio caused Huge increase in organization aggregate computing

capacity Much smaller increase in the capacity accessible by a

single person

› HTC Large amounts of processing capacity sustained over

very long time periods

Page 8: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor: A Project and

8http://www.cs.wisc.edu/condor

Condor can manage a large number of jobs

› Managing a large number of jobs You specify the jobs in a file and submit

them to Condor, which runs them all and keeps you notified on their progress

Mechanisms to help you manage huge numbers of jobs (1000’s), the data, etc.

Condor can handle work flow / inter-job dependencies (DAGMan)

Condor users can set job priorities Condor administrators can set user priorities

Page 9: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor: A Project and

9http://www.cs.wisc.edu/condor

Condor can manage Dedicated Resources…

› Dedicated Resources Compute Clusters

› Manage Node monitoring,

scheduling Job launch,

monitor & cleanup

Page 10: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor: A Project and

10http://www.cs.wisc.edu/condor

…and Condor can manage non-dedicated

resources› Non-dedicated resources examples:

Desktop workstations in offices Workstations in student labs

› Non-dedicated resources are often idle --- ~70% of the time!

› Condor can effectively harness the otherwise wasted compute cycles from non-dedicated resources

Page 11: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor: A Project and

11http://www.cs.wisc.edu/condor

Some HTC Challenges

› Condor does whatever it takes to run your jobs, even if some machines… Crash (or are disconnected) Run out of disk space Don’t have your software installed Are frequently needed by others Are far away & managed by someone

else

Page 12: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor: A Project and

12http://www.cs.wisc.edu/condor

The Condor System› Unix and Win2k/XP

› Operational since 1986› Just at UW: more than 1800 CPUs in 10

pools on our campus

› Software available free on the web Open license

› Adopted by the “real world” (Galileo, Maxtor, Micron, Oracle, Tigr, Xerox,

NASA, Texas Instruments, … )

Page 13: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor: A Project and

13http://www.cs.wisc.edu/condor

Downloads and Deployments

Page 14: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor: A Project and

14http://www.cs.wisc.edu/condor

Page 15: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor: A Project and

15http://www.cs.wisc.edu/condor

Outline

› What is the Condor Project?

› What is the Condor HTC Software?

› Recipe for using desktops for science

› Data!

Page 16: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor: A Project and

16http://www.cs.wisc.edu/condor

Recipe Tip: Useful Distributed Ownership mechanisms in Condor

› Checkpoint / Migration Checkpoint == picture of process state Enables preempt/resume scheduling

and migration, ensures forward progress

› Remote System Calls Redirect I/O and other system calls back

to the submit machine.

› Matchmaking with ClassAds

Page 17: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor: A Project and

17http://www.cs.wisc.edu/condor

ClassAds

› Set of bindings of Attribute Names to Expressions

› Self-describing (no separate schema)› Combine query and data› Arbitrarily composed and nested› Bilateral

Resource owners are generous if it doesn’t cost them anything!

Page 18: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor: A Project and

18http://www.cs.wisc.edu/condor

Examples[ Type = "Job"; Owner = "raman"; Cmd = "run_sim"; Args = "-Q 17 3200"; Cwd = "/u/raman"; Memory = 31; Qdate = 886799469; ... Rank = other.Kflops... Requirements =

other.Type = ...]

[ Type = "Machine"; Name = "xxy.cs. ..."; Arch = "iX86"; OpSys = "Solaris"; Mips = 104; Kflops = 21893; State = "Unclaimed"; LoadAvg = 0.042969; ... Rank = ...; Requirements = ...;]

Page 19: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor: A Project and

19http://www.cs.wisc.edu/condor

Attribute Expressions› Constants 104, 0.042969, "iX86"

› References attr, self.attr, other.attr, expr.attr

› Operators +, *, >>, <, >=, &&, ...

› Functions strcat, substr, floor, member, ...

› Lists { expr, expr, ... }

› ClassAds [ name=expr; name=expr; ... ]

Page 20: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor: A Project and

20http://www.cs.wisc.edu/condor

Examples

› Descriptive attributes Type = "Job"; Owner = "raman"; Arch = "iX86"; OpSys = "Solaris"; Memory = 64; // megabytes Disk = 323496; // k bytes

Page 21: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor: A Project and

21http://www.cs.wisc.edu/condor

Examples

› Current state Daytime = 36017; // secs past

midnight KeyboardIdle = 1432; // seconds State = "Unclaimed"; LoadAvg = 0.042969;

Page 22: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor: A Project and

22http://www.cs.wisc.edu/condor

Examples

› Parameters ResearchGrp = { "raman", "miron",

"solomon", "jbasney" }; Friends = { "tannenba", "wright" }; Untrusted = { "rival", "riffraff" }; WantCheckpoint = 1;

Page 23: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor: A Project and

23http://www.cs.wisc.edu/condor

Examples

› Derived data

Rank = // machine's rank for job10 * member(other.Owner,ResearchGrp) + member(other.Owner, Friends);

Rank = // job's rank for machineKflops/1E3 + other.Memory/32;

Page 24: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor: A Project and

24http://www.cs.wisc.edu/condor

Examples

› Job constraint Requirements =

other.Type = "Machine"&& Arch = "iX86"&& OpsSys = "Solaris"&& Disk > 10000&& other.Memory >= self.Memory;

Page 25: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor: A Project and

25http://www.cs.wisc.edu/condor

Examples

› Machine constraint

Requirements = ! member(other.Owner, Untrusted) && Rank >= 10 ? true : Rank > 0 ? (LoadAvg < 0.3 && KeyboardIdle > 15*60) : DayTime < 6*60*60 || DayTime > 18*60*60;

Page 26: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor: A Project and

26http://www.cs.wisc.edu/condor

Matching Algorithm› To match two ads A and B

Set up environment such that in A• self evaluates to A• other evaluates to B• other attributes are searched for first in A

and then in B• and vice versa (with A and B interchanged)

Check if A.Requirements and B.Requirements both evaluate to true

A.Rank and B.Rank for preferences

Page 27: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor: A Project and

27http://www.cs.wisc.edu/condor

Three-valued Logicother.Memory > 32 all

other.Memory == 32 UNDEFINED

other.Memory != 32 if other has no

!(other.Memory == 32) "Memory" attribute

other.Mips >= 10 || other.Kflps >= 1000

TRUE if either attribute exists and

satisfies the given condition

Page 28: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor: A Project and

28http://www.cs.wisc.edu/condor

Recipe Tip: Build from Bottom up!

› Start with a service for a single user, on a single machine.

› “Personal Condor” Condor on your own workstation, no

local system/root access required, no system administrator intervention needed

Page 29: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor: A Project and

29http://www.cs.wisc.edu/condor

yourworkstation

personalCondor

600 Condorjobs

Page 30: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor: A Project and

30http://www.cs.wisc.edu/condor

Personal Condor?!

What’s the benefit of a Condor “Pool” with just

one user and one machine?

Page 31: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor: A Project and

31http://www.cs.wisc.edu/condor

Your Personal Condor will ...

› … keep an eye on your jobs and will keep you posted on their progress

› … implement your policy on the execution order of the jobs

› … keep a log of your job activities› … add fault tolerance to your jobs› … implement your policy on when the

jobs can run on your workstation

Page 32: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor: A Project and

32http://www.cs.wisc.edu/condor

Expand from your desktop…

› Build a Condor pool inside your organization Install Condor on multiple machines, pointing

them to your initial machine as the manager.› Utilize Condor resources at remote

organizations (“build a grid”) Takes advantage of your Condor-using friends… Get permission to access their resources Then configure your Condor pool to “flockflock” to

these pools Accounting system is flocking aware

Page 33: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor: A Project and

33http://www.cs.wisc.edu/condor

yourworkstation

Friendly Condor Pool

personalCondor

600 Condorjobs

Condor Pool

Page 34: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor: A Project and

34http://www.cs.wisc.edu/condor

Condor-G› What about resources at remote

organizations that are NOT managed via Condor? (perhaps they are managed via PBS, SGE, LSF, …)

› Condor-G Job task-broker for Grid Middleware. Submit jobs to resources managed via grid

middleware such as Globus (GT2 & GT3), Nordugrid, Unicore, or Oracle (or Condor)

Oracle: run PL/SQL programs on Oracle just like a normal job, via transactions, put in DAGs, etc.

Page 35: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor: A Project and

35http://www.cs.wisc.edu/condor

Condor GlideIn

› Problems What if the grid middleware or remote

scheduler doesn’t provide services I want? What about end-to-end semantic

guarantees?› Solution

Submit the Condor daemons to remote schedulers instead of the job

When the resources run these GlideIn jobs, they will temporarily join her Condor Pool, and run the job as usual.

Page 36: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor: A Project and

36http://www.cs.wisc.edu/condor

yourworkstation

Friendly Condor Pool

personalCondor

600 Condorjobs

Globus Grid

PBS LSF

Condor

Condor Pool

glide-in jobs

Page 37: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor: A Project and

37http://www.cs.wisc.edu/condor

Outline

› What is the Condor Project?

› What is the Condor HTC Software?

› Recipe for using desktops for science

› Data! Harmonize computation w/ data

storage and data movement.

Page 38: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor: A Project and

38http://www.cs.wisc.edu/condor

Data Movement: Stork

› Scheduler for wide-area data transfer› Condor historically focused on CPU

allocation Data movement was implicit side-effect

› Stork elevates data movement to be a “first class citizen” Data movement is another type of node within a

job dependency graph Data movement is now queued, scheduled,

monitored, managed, check-pointed

Page 39: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor: A Project and

39http://www.cs.wisc.edu/condor

Data Access: Parrot

Useful in distributed batch systems where one has access to many CPUs, but no consistent distributed filesystem (BYOFS!).

Works with legacy programs

% gv /gsiftp/www.cs.wisc.edu/condor/doc/usenix_1.92.ps % grep Yahoo /http/www.yahoo.com

Page 40: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor: A Project and

40http://www.cs.wisc.edu/condor

Data Storage: NeST

› Storage management software› Complementary piece of Condor software;

adds storage management to the traditional CPU management

› Key features User level Guaranteed storage reservations that allow higher-

level scheduling and planning (e.g. Stork) Flexible, extendible protocol layer allows easy

integration with existing middle-ware and applications

Easily deployable via glide-in

Page 41: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor: A Project and

41http://www.cs.wisc.edu/condor

› Practical and easily deployable User-level; requires no privilege Package NeST as standard batch jobs

› Result: Managed storage› General; glide-in works everywhere

Gliding-in storage mgmt

Internet

SGE SGE

SGE SGE SGE

SGE SGE

SGENeSTNeSTNeSTNeSTNeSTNeSTNeSTNeST

Homestore

Page 42: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor: A Project and

42http://www.cs.wisc.edu/condor

BirdBath

SOAP Interfaces to Condor Services LBNL: Workflow, ZSI (soon ?

LIGO, Laser Interferometer Gravitational-Wave Observatory )

IU: Portals UK College of London|

Cambridge: .NET

Page 43: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor: A Project and

43http://www.cs.wisc.edu/condor

The Idea

Computing power

is everywhere, we try to make it usable

by anyone.

Page 44: Condor Project Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu  Condor: A Project and

44http://www.cs.wisc.edu/condor

Thank you!

Condor Project on the Web:http://www.cs.wisc.edu/condor

Email:[email protected]