36
Con z These lecture notes we Wilkinson at UNC Char Wilkinson at UNC Char ndor ere borrowed from Barry rlotte rlotte 1

zThese lecture notes were borrowed from Barry Wilkinson at ...blk/distributedComputing/GridComputing/L... · call dlle d a pool. zIndividual pools can b process calledprocess called

Embed Size (px)

Citation preview

Con

These lecture notes weWilkinson at UNC CharWilkinson at UNC Char

ndor

ere borrowed from Barry rlotterlotte

1

Con

First developed at UnMadison in mid 1980Madison in mid 1980collection of distributeclusters into a high-thclusters into a high-thfacility.

Key concept - using wpower of idle workstapower of idle worksta

ndor

niversity of Wisconsin-’s to convert as to convert a ed workstations and hroughput computinghroughput computing

wasted computer ationsations.

2

UUs

Consider following sc− I have a simulation th

on my high-end comp− I need to run it 1000 t

parameters each time− If I do this on one com

2000 hours (or about

From: “Condor: What it is and why you shouCambridge, Sem

ses

cenario:hat takes two hours to run putertimes with slightly different e.mputer, it will take at least

3 months)

3uld worry about it,” by B. Beckles, University of minar, June 23, 2004

− Suppose my departmethat are mostly sitting ia day).

− If I could use them whenot using them so thanot using them, so thathem, I could get abou

− This is an ideal situatio

I could do my simula

From: “Condor: What it is and why you shouCambridge, Sem

ent has 100 PCs like mine idle overnight (say 8 hours

en their legitimate users are t I do not inconveniencet I do not inconvenience t 800 CPU hours/day.

on for Condor.

ations in 2.5 days.

4uld worry about it,” by B. Beckles, University of minar, June 23, 2004

Condor F

Include:− Resource finder

Batch queue mana− Batch queue mana− Scheduler− Checkpoint/restart− Process migrationg

Features

agerager

5

Intended to run job even

Machines crashDi k h t dDisk space exhaustedSoftware not installedMachines are neededMachines are manageMachines are manageMachines are far awa

n if:

dd

d by othersed by othersed by others

ay

6

How does C

A collection of machinll d lcalled a pool.

Individual pools can bprocess called flockiprocess called flocki

From: “Condor: What it is and why you shouCambridge, Sem

ondor work?

nes running Condor

be joined together in a inging.

7uld worry about it,” by B. Beckles, University of minar, June 23, 2004

Machine

Machines have one o

− Central managerg− Submit machine (Sub− Execution machine (EExecution machine (E− Checkpoint server

e Roles

or more of 4 roles:

bmit host) Execute host)Execute host)

8

Central M

Resource broker for a

Keeps track of which available, what jobs awhich machine will ru

Only one central many

Manager

a pool.

machines are are running, negotiates un which job, etc.

nager per pool.g p p

9

Submit M

Machine which subm

Must be at least oneMust be at least one pool, and usually mor

Machine

mits jobs to pool.

submit machine in asubmit machine in a re than one.

10

Execute

Machine on which job

Must be at least oneMust be at least one pool, and usually mor

Machine

bs can be run.

execute machine in aexecute machine in a re than one.

11

Checkpoi

Machine which storesd d b j b hiproduced by job whic

Can only be one checpoolpool.

O i l h hOptional to have a ch

nt Server

s all checkpoint files h h k i tch checkpoint.

ckpoint machine in a

h k i hiheckpoint machine.

12

P ibl CPossible CoA central manager.

Some machine that So e ac e ahosts.

Some machine that hostshosts.

Some machines thatSome machines thatexecute hosts.

fi tionfiguration

can only be submit ca o y be sub

can be only execute

t can be both submit andt can be both submit and

13

14

Types oTypes oClassified according toClassified according to Currently seven environ

− StandardV ill− Vanilla

− PVMMPI− MPI

− GlobusJ− Java

− Scheduler

of Jobsof Jobsenvironment it providesenvironment it provides.

nments:

15

StStanFor jobs compiled wit

Allows for checking psystem callssystem calls.

Must be single thread

Not available under W

d ddardth Condor libraries.

pointing and remote

ded.

Windows.

16

CheckpCheckpCertain jobs can checkCertain jobs can checkfor safety and when inte

If checkpointed job inteth l t h k i t dthe last checkpointed s

Generally no change tolink Condor’s Standard

pointingpointingkpoint both periodicallykpoint, both periodically errupted.

errupted, it will resume at t t h it t t istate when it starts again.

o source code - need to Universe support library.

17

Van

For jobs that cannot bC d lib i dCondor libraries, andWindows batch files.

No checkpointing or rNo checkpointing or r

nilla

be compiled with f h ll i t d for shell scripts and

remote system callsremote system calls.

18

PVMFor PVM programs.

MPIFor MPI programs (MPICFor MPI programs (MPIC

Both PVM and MPI are mused in message passingg p g

Used for local clusters of

MPI could be used in gridabout this later in the cou

CH).CH).

message-passing libraries g programs.g p g

f computers.

d computing – we will talk urse.

19

GlobusGlobusFor submitting jobs to resGlobus (version 2 2 andGlobus (version 2.2 and

JavaFor Java programs (writtp g (Interface).

SchedulerUsed with DAG scheduleUsed with DAG schedule

sources managed by higher)higher).

ten for Java Virtual

ed jobs see latered jobs, see later.

20

SubmittiSubmittiJob submitted to “subJob submitted to subCondor_submit comm

Job described in “sub

Submit description filei i RSL filas given in an RSL fil

name of the executab

ng a jobng a jobbmit host” usingbmit host using mand.

bmit description” file.

e includes details such l i Gl b i hle in Globus, i.e. the ble, arguments, etc.

21

Condor Submit Describes job to Condor.U d ith C d bUsed with Condor _subm

Description F# This is a comment, condUniverse = vanillaUniverse vanillaExecutable = /home/abw/Input = myProg.stdinp y gOutput = myProg.stdoutError = myProg.stderrArguments = -arg1 -arg2InitialDir = /home/abw/conQueue

Description File

it dmit command.File Exampledor submit file

condor/myProg

ndor/assignment4

22

Submitting M

Submit file can specify mExample: Queue 500– Example: Queue 500

Condor calls groups of jo

Each job within cluster c

Condor job ID is the clusCondor job ID is the clusprocess number, for exa

Si l j b l l tSingle jobs also a cluste(process 0)

Multiple Jobs

multiple jobs0 will submit 500 jobs at once0 will submit 500 jobs at once

obs a cluster

called a process

ster number a period andster number, a period and ample 26.2

b t ith i ler but with a single process

23

Submitting a job wSubmitting a job wand prefea d p e e

Done using Condor’smechanism, which m− What it requiresq− What it desires− What it prefers and− What it prefers, and− What it will accept

These details start in

with requirementswith requirements erences e e cess “ClassAd”

may include:

dd

submit description file.24

condor-submit commancondor submit comman“ClassAd” from the subwhich is then used in Cwhich is then used in Cmechanism.

Command: sub

condor_submit sub

nd creates and creates a mit description file,

ClassAd matchmakingClassAd matchmaking

bmit description file

bmit.prog1

p

ClassAd file

25

Specifying RA C/Java-like Booleaevaluates to TRUE foevaluates to TRUE fo

# This is a comment condor su# This is a comment, condor suUniverse = vanillaExecutable = /home/abw/condInitialDir = /home/abw/condor/aRequirements = Memory >= 51

500queue 500

Requirementsn expression that

or a match.or a match.

ubmit fileubmit file

or/myProgassignment412 && Disk > 10000

26

ClassAd MaUsed to ensure job done accusers and owners.

Example of us“ I need a Pentium IV with a

and speed of a

Example of machine“Never run jobs

atchmakingcording to constraints of

ser constraintsat least 512 Mbytes of RAM t least 3.8 GHz

e owner constraintsowned by Fred”

27

ClassAd Match

• Agents (jobs) and resoadvertise their characteadvertise their charactein “classified advertisem

• Matchmaker scans Clathat satisfy each otherspreferences.

• Matchmaker informs boMatchmaker informs bo

• Agent and resource ma

hmaking Steps

ources (computers) eristics and requirementseristics and requirements ments.”

assAds and creates pairs s constraints and

oth parties of match.oth parties of match.

ake contact.

28

Job ClassAdJob ClassAd

JobJob

Machine ClassAddMachine ClassAdd

Match

Machine

Machine ClassAddMachine ClassAdd

Machine29

Job ClassA

[“MyType = “Job”

TargetType=“Machine”Requirements =((other.Arch==“INTEL”&&o&& other.Disk>myDiskUsagDiskUsage = 6000DiskUsage 6000]

Ad Example

Requirements qstatement must evaluate to true

other.OpSys==“LINUX”)ge)

6 MB

30

Machine ClasMachine Clas[[MyType=“Machine”TargetType=“Job”TargetType JobMachine=“coit-grid01.unccRequirements=Requirements((LoadAvg<=0.300000)&&(KeyboardIdle>(15*60))(KeyboardIdle (15 60))Arch=“INTEL”OpSys=“LINUX”OpSys LINUXDisk=1000000]]

ssAd ExamplessAd Example

.edu”Low load averageLow load average

Keyboard idle forKeyboard idle for more than 15 minutes

31

minutes

ClassAd’s Ra

Can be used in job Cbetween compatible mphighest rank

Rank expression shofloating point numberoat g po t u be

ExamRank = (Memory * 100

ank Statement

ClassAdd for selection machines. Choose

ould evaluate to a r.

mple Machine speed

00) + KFlops

32

Rank StCan also be used in Machin

matchmaking.

ExamExamRank = (other.Department =

where Department defined

Department=“Computer Sc

atementnes ClassAd in

mplemple== self.Department)

in job ClassAdd, say:

ience”

33

Using rank in MacUsing rank in Mac

Job ClassAd[[MyType = “Job”TargetType=“Machine”TargetType Machine

……Department=“Computer

Science”…]

chines ClassAdchines ClassAd

Machines ClassAd[[MyType=“Machine”TargetType=“Job”TargetType Job

…Rank = (other.Department == self.Department) …]

34

Directed AcyDirected AcyManager (Dg (

Meta-sch

We have already cov

yclic Graphyclic GraphDAGMan))heduler

vered this material.

35

Summary ofSummary of Feat

High throughput computi i i iopportunitistic environm

Provides a mechanismsProvides a mechanismsremote machines.

M t h kiMatchmaking

Checkpointingp g

DAG scheduling

Key CondorKey Condor turesting using an ent.

s for running jobs ons for running jobs on

36