33
Derek Wright Computer Sciences Department, UW- Madison Lawrence Berkeley National Labs (LBNL) [email protected] http://www.cs.wisc.edu/condor http://sdm.lbl.gov Condor COD (Computing On Demand) Condor Week 5/5/2003

Derek Wright Computer Sciences Department, UW-Madison Lawrence Berkeley National Labs (LBNL) [email protected]

Embed Size (px)

Citation preview

Derek WrightComputer Sciences Department, UW-

Madison Lawrence Berkeley National Labs (LBNL)

[email protected]://www.cs.wisc.edu/condor

http://sdm.lbl.gov

Condor COD (Computing On

Demand)Condor Week 5/5/2003

www.cs.wisc.edu/condor

What problem are we trying to solve?

› Some people want to run interactive, yet compute-intensive applications

› Jobs that take lots of compute power over a relatively short period of time

› They want to use batch computing resources, but need them right away

› Ideally, when they’re not in use, resources would go back to the batch system

www.cs.wisc.edu/condor

Some example applications:

› A distributed build/compilation of a large software system

› A very complex spreadsheet that takes a lot of cycles when you press “recalculate”

› High-energy physics (HEP) “analysis” jobs

› Visualization tools for data-mining, rendering graphics, etc.

www.cs.wisc.edu/condor

Batch Jobs

Compute Farm

User’s Workstation

Example application for COD

On-demandworkers

Idle nodes

Data

DisplayController

application

www.cs.wisc.edu/condor

› Condor COD: “Computing on Demand” Use Condor to manage the batch

resources when they’re not in use by the interactive jobs

Allow the interactive jobs to come in with high priority and run instead of the batch job on any given resource

What’s the Condor solution?

www.cs.wisc.edu/condor

Why did we have to change Condor for

that?› Doesn’t Condor already notice when

an interactive job starts on a CPU?

› Doesn’t Condor already provide checkpointing when that happens?

› Can’t I configure Condor to run whatever jobs I want with a higher priority on my own machines?

www.cs.wisc.edu/condor

Well, yes… But that’s not good

enough…› Not all jobs can be checkpointed, and

even those that can take some time…

› We want this to be instantaneous, not waiting for the batch system to schedule tasks…

› You can configure Condor to run higher priority jobs, but the other jobs are kicked off the machine…

www.cs.wisc.edu/condor

What’s new about COD?› “Checkpoint to swap space”

When a high-priority COD job appears, the lower-priority batch job is suspended

The COD job can run right away, while the batch job is suspended

Batch jobs (even those that can’t checkpoint) can resume instantly once there are no more active COD jobs

www.cs.wisc.edu/condor

But wait, there’s more…

› The condor_startd can now manage multiple “claims” on each resource If any COD claim becomes active, the regular

Condor claim is automatically suspended Without an active COD, regular claim resumes

› There is a new command-line tool to request, activate, suspend, resume and release these claims

› There’s even a C++ object to do all of that, if you really want it…

www.cs.wisc.edu/condor

COD claim-management commands

› Request: authorizes the user and returns a unique claim ID for future commands

› Activate: spawns an application on a given COD claim, with various options to define the application, job ID, etc Suspends any regular Condor job You can have multiple COD claims on a single

resource, and they can all be running simultaneously

www.cs.wisc.edu/condor

COD commands (cont’d)

› Suspend: Given COD claim is suspended If there are no more active COD claims, a

regular Condor batch job can now run

› Resume: Given COD claim is resumed, suspending the Condor batch job (if any)

› Deactivate: Kill the application but hold onto the COD claim

› Release: Get rid of the COD claim itself

www.cs.wisc.edu/condor

COD command protocol

› All commands use ClassAds Allows for a flexible protocol Excellent error propagation Can use existing ClassAd technology

› Similar to existing Condor protocol Separation of claiming from activation,

so you can have hot-spares, etc.

www.cs.wisc.edu/condor

How does all of that solve the problem?

› The interactive COD application starts up, and goes out to claim some compute nodes

› Once the helper applications are in place and ready, these COD claims are suspended, allowing batch jobs to run

› When the interactive application has work, it can instantly suspend the batch jobs and resume the COD applications to perform the computations

www.cs.wisc.edu/condor

User’s Workstation

Compute Farm

Step 1: Initial state

Idle nodes

Batch jobs

Idle nodes%

www.cs.wisc.edu/condor

User’s Workstation

Compute Farm

Step 2: Application spawned

Idle nodes

Batch jobs

Idle nodes% fractal-gen –n 4

Controllerapplicationspawned

www.cs.wisc.edu/condor

User’s Workstation

Compute Farm

Step 3: Compute node setup

Idle nodes

Batch jobs

request

activa

te

Claiming and initializing [4] compute

nodes for rendering…

Got reply from:c1.cluster.orgc6.cluster.orgc14.cluster.orgc17.cluster.org

SUCCESS!

On-demandworkers

On-demandworkers

www.cs.wisc.edu/condor

% condor_cod_request –name c1.cluster.org \ –classad c1.outSuccessfully sent CA_REQUEST_CLAIM to startd at <128.105.143.14:55642>

Result ClassAd written to c1.outID of new claim is: “<128.105.143.14:55642>#1051656208#2”

% condor_cod_activate –keyword fractgen \ –id “<128.105.143.14:55642>#1051656208#2”Successfully sent CA_ACTIVATE_CLAIM to startd at <128.105.143.14:55642>

% …

Step 3: Commands used

www.cs.wisc.edu/condor

User’s Workstation

Compute Farm

Step 4: “Checkpoint” to swap

Batch jobs

suspend

Idle nodesIdle nodesSuspended

worker

SELECT FRACTAL TYPE

<Mandelbrot>

(more user input…)

www.cs.wisc.edu/condor

Step 4: Commands used

› Rendering application on each COD node is suspended while interactive tool waits for input

› The resources are now available for batch Condor jobs

% condor_cod_suspend \

–id “<128.105.143.14:55642>#1051656208#2”

Successfully sent CA_SUSPEND_CLAIM to startd at <128.105.143.14:55642>

% …

www.cs.wisc.edu/condor

User’s Workstation

Compute Farm

Step 5: Batch jobs can run

Batchqueue

Batch jobs

Idle nodesIdle nodes

SPECIFY PARAMETERS

max_iterations: 400000

TL: -0.65865, -0.56254

BR: -0.45865, -0.71254

(more user input…)

www.cs.wisc.edu/condor

Compute Farm

Step 6: Computation burst

Idle nodes

Batch jobs

User’s Workstation

resume

Suspendedbatch job

Interactiveworkers

On-demandworkers

CLICK <RENDER> TO VIEW YOUR FRACTAL…

RENDER

www.cs.wisc.edu/condor

Step 6: Commands used

› Batch Condor jobs on COD nodes are suspended

› All COD rendering applications are resumed on each node

% condor_cod_resume \

–id “<128.105.143.14:55642>#1051656208#2”

Successfully sent CA_RESUME_CLAIM to startd at <128.105.143.14:55642>

% …

www.cs.wisc.edu/condor

Compute Farm

Step 7: Results produced

Idle nodes

Batch jobs

User’s Workstation

Suspendedbatch job

Interactiveworkers

On-demandworkers

Data

Display

www.cs.wisc.edu/condor

Compute Farm

Step 8: User input while batch work resumes

Idle nodesIdle nodes

Batch jobs

User’s Workstation

ZOOM BOX COORDINATES:

TL = -0.60301, -0.61087

BR = -0.58037, -0.62785

Suspendedworker

suspend

www.cs.wisc.edu/condor

Compute Farm

Step 9: Computation burst #2

Idle nodes

Batch jobs

User’s Workstation Interactive

workers

Suspendedbatch job

On-demandworkers

resume

Data

Display

RENDER

www.cs.wisc.edu/condor

Compute Farm

Step 10: Clean-up

Idle nodes

Batch jobs

User’s Workstation

release

Idle nodes

REALLY QUIT? Y/N

Releasing compute nodes…

4 nodes terminated successfully!

www.cs.wisc.edu/condor

Step 10: Commands used

› The jobs are cleaned up, claims released, and resources returned to batch system

% condor_cod_release \

–id “<128.105.143.14:55642>#1051656208#2”

Successfully sent CA_RELEASE_CLAIM to startd at <128.105.143.14:55642>

State of claim when it was released: "Running"

% …

www.cs.wisc.edu/condor

Other changes for COD:

› The condor_starter has been modified so that it can run jobs without communicating with a condor_shadow All the great job control features of

the starter without a shadow Starter can write its own UserLog Other useful features for COD

www.cs.wisc.edu/condor

condor_status –cod› New “–cod” option to condor_status to view

COD claims in a Condor pool:

Name ID ClaimState TimeInState RemoteUser JobId Keyword

astro.cs.wi COD1 Idle 0+00:00:04 wright

chopin.cs.w COD1 Running 0+00:02:05 wright 3.0 fractgen

chopin.cs.w COD2 Suspended 0+00:10:21 wright 4.0 fractgen

Total Idle Running Suspended Vacating Killing

INTEL/LINUX 3 1 1 1 0 0

Total 3 1 1 1 0 0

www.cs.wisc.edu/condor

What else could I use all these new features

for?› Short-running system administration tasks

that need quick access but don’t want to disturb the jobs in your batch system

› A “Grid Shell” A condor_starter that doesn’t need a

condor_shadow is a powerful job management environment that can monitor a job running under a “hostile” batch system on the grid

www.cs.wisc.edu/condor

Future work› More ways to tell COD about your

application For now, you define important

attributes in your condor_config file and pre-stage the executables

› Ability to transfer files to and from a COD job at a remote machine We’ve already got the functionality in

Condor, so why rely on a shared filesystem or pre-staging?

www.cs.wisc.edu/condor

More future work › Accounting for COD jobs

› Working with some real-world applications and integrating these new COD features Would the real users please stand up?

› Better “Grid Shell” support This is really a separate-yet-related

area of work…

www.cs.wisc.edu/condor

How do you use COD?

› Upgrade to Condor version 6.5.3 or greater… COD is already included

› There will be a new section in the Condor manual (coming soon)

› If you need more help, ask the ever helpful [email protected]

› Find me at the BoF on Wednesday, 9am to Noon (room TBA)