Upload
duongngoc
View
216
Download
2
Embed Size (px)
Citation preview
Con
These lecture notes weWilkinson at UNC CharWilkinson at UNC Char
ndor
ere borrowed from Barry rlotterlotte
1
Con
First developed at UnMadison in mid 1980Madison in mid 1980collection of distributeclusters into a high-thclusters into a high-thfacility.
Key concept - using wpower of idle workstapower of idle worksta
ndor
niversity of Wisconsin-’s to convert as to convert a ed workstations and hroughput computinghroughput computing
wasted computer ationsations.
2
UUs
Consider following sc− I have a simulation th
on my high-end comp− I need to run it 1000 t
parameters each time− If I do this on one com
2000 hours (or about
From: “Condor: What it is and why you shouCambridge, Sem
ses
cenario:hat takes two hours to run putertimes with slightly different e.mputer, it will take at least
3 months)
3uld worry about it,” by B. Beckles, University of minar, June 23, 2004
− Suppose my departmethat are mostly sitting ia day).
− If I could use them whenot using them so thanot using them, so thathem, I could get abou
− This is an ideal situatio
I could do my simula
From: “Condor: What it is and why you shouCambridge, Sem
ent has 100 PCs like mine idle overnight (say 8 hours
en their legitimate users are t I do not inconveniencet I do not inconvenience t 800 CPU hours/day.
on for Condor.
ations in 2.5 days.
4uld worry about it,” by B. Beckles, University of minar, June 23, 2004
Condor F
Include:− Resource finder
Batch queue mana− Batch queue mana− Scheduler− Checkpoint/restart− Process migrationg
Features
agerager
5
Intended to run job even
Machines crashDi k h t dDisk space exhaustedSoftware not installedMachines are neededMachines are manageMachines are manageMachines are far awa
n if:
dd
d by othersed by othersed by others
ay
6
How does C
A collection of machinll d lcalled a pool.
Individual pools can bprocess called flockiprocess called flocki
From: “Condor: What it is and why you shouCambridge, Sem
ondor work?
nes running Condor
be joined together in a inging.
7uld worry about it,” by B. Beckles, University of minar, June 23, 2004
Machine
Machines have one o
− Central managerg− Submit machine (Sub− Execution machine (EExecution machine (E− Checkpoint server
e Roles
or more of 4 roles:
bmit host) Execute host)Execute host)
8
Central M
Resource broker for a
Keeps track of which available, what jobs awhich machine will ru
Only one central many
Manager
a pool.
machines are are running, negotiates un which job, etc.
nager per pool.g p p
9
Submit M
Machine which subm
Must be at least oneMust be at least one pool, and usually mor
Machine
mits jobs to pool.
submit machine in asubmit machine in a re than one.
10
Execute
Machine on which job
Must be at least oneMust be at least one pool, and usually mor
Machine
bs can be run.
execute machine in aexecute machine in a re than one.
11
Checkpoi
Machine which storesd d b j b hiproduced by job whic
Can only be one checpoolpool.
O i l h hOptional to have a ch
nt Server
s all checkpoint files h h k i tch checkpoint.
ckpoint machine in a
h k i hiheckpoint machine.
12
P ibl CPossible CoA central manager.
Some machine that So e ac e ahosts.
Some machine that hostshosts.
Some machines thatSome machines thatexecute hosts.
fi tionfiguration
can only be submit ca o y be sub
can be only execute
t can be both submit andt can be both submit and
13
Types oTypes oClassified according toClassified according to Currently seven environ
− StandardV ill− Vanilla
− PVMMPI− MPI
− GlobusJ− Java
− Scheduler
of Jobsof Jobsenvironment it providesenvironment it provides.
nments:
15
StStanFor jobs compiled wit
Allows for checking psystem callssystem calls.
Must be single thread
Not available under W
d ddardth Condor libraries.
pointing and remote
ded.
Windows.
16
CheckpCheckpCertain jobs can checkCertain jobs can checkfor safety and when inte
If checkpointed job inteth l t h k i t dthe last checkpointed s
Generally no change tolink Condor’s Standard
pointingpointingkpoint both periodicallykpoint, both periodically errupted.
errupted, it will resume at t t h it t t istate when it starts again.
o source code - need to Universe support library.
17
Van
For jobs that cannot bC d lib i dCondor libraries, andWindows batch files.
No checkpointing or rNo checkpointing or r
nilla
be compiled with f h ll i t d for shell scripts and
remote system callsremote system calls.
18
PVMFor PVM programs.
MPIFor MPI programs (MPICFor MPI programs (MPIC
Both PVM and MPI are mused in message passingg p g
Used for local clusters of
MPI could be used in gridabout this later in the cou
CH).CH).
message-passing libraries g programs.g p g
f computers.
d computing – we will talk urse.
19
GlobusGlobusFor submitting jobs to resGlobus (version 2 2 andGlobus (version 2.2 and
JavaFor Java programs (writtp g (Interface).
SchedulerUsed with DAG scheduleUsed with DAG schedule
sources managed by higher)higher).
ten for Java Virtual
ed jobs see latered jobs, see later.
20
SubmittiSubmittiJob submitted to “subJob submitted to subCondor_submit comm
Job described in “sub
Submit description filei i RSL filas given in an RSL fil
name of the executab
ng a jobng a jobbmit host” usingbmit host using mand.
bmit description” file.
e includes details such l i Gl b i hle in Globus, i.e. the ble, arguments, etc.
21
Condor Submit Describes job to Condor.U d ith C d bUsed with Condor _subm
Description F# This is a comment, condUniverse = vanillaUniverse vanillaExecutable = /home/abw/Input = myProg.stdinp y gOutput = myProg.stdoutError = myProg.stderrArguments = -arg1 -arg2InitialDir = /home/abw/conQueue
Description File
it dmit command.File Exampledor submit file
condor/myProg
ndor/assignment4
22
Submitting M
Submit file can specify mExample: Queue 500– Example: Queue 500
Condor calls groups of jo
Each job within cluster c
Condor job ID is the clusCondor job ID is the clusprocess number, for exa
Si l j b l l tSingle jobs also a cluste(process 0)
Multiple Jobs
multiple jobs0 will submit 500 jobs at once0 will submit 500 jobs at once
obs a cluster
called a process
ster number a period andster number, a period and ample 26.2
b t ith i ler but with a single process
23
Submitting a job wSubmitting a job wand prefea d p e e
Done using Condor’smechanism, which m− What it requiresq− What it desires− What it prefers and− What it prefers, and− What it will accept
These details start in
with requirementswith requirements erences e e cess “ClassAd”
may include:
dd
submit description file.24
condor-submit commancondor submit comman“ClassAd” from the subwhich is then used in Cwhich is then used in Cmechanism.
Command: sub
condor_submit sub
nd creates and creates a mit description file,
ClassAd matchmakingClassAd matchmaking
bmit description file
bmit.prog1
p
ClassAd file
25
Specifying RA C/Java-like Booleaevaluates to TRUE foevaluates to TRUE fo
# This is a comment condor su# This is a comment, condor suUniverse = vanillaExecutable = /home/abw/condInitialDir = /home/abw/condor/aRequirements = Memory >= 51
500queue 500
Requirementsn expression that
or a match.or a match.
ubmit fileubmit file
or/myProgassignment412 && Disk > 10000
26
ClassAd MaUsed to ensure job done accusers and owners.
Example of us“ I need a Pentium IV with a
and speed of a
Example of machine“Never run jobs
atchmakingcording to constraints of
ser constraintsat least 512 Mbytes of RAM t least 3.8 GHz
e owner constraintsowned by Fred”
27
ClassAd Match
• Agents (jobs) and resoadvertise their characteadvertise their charactein “classified advertisem
• Matchmaker scans Clathat satisfy each otherspreferences.
• Matchmaker informs boMatchmaker informs bo
• Agent and resource ma
hmaking Steps
ources (computers) eristics and requirementseristics and requirements ments.”
assAds and creates pairs s constraints and
oth parties of match.oth parties of match.
ake contact.
28
Job ClassAdJob ClassAd
JobJob
Machine ClassAddMachine ClassAdd
Match
Machine
Machine ClassAddMachine ClassAdd
Machine29
Job ClassA
[“MyType = “Job”
TargetType=“Machine”Requirements =((other.Arch==“INTEL”&&o&& other.Disk>myDiskUsagDiskUsage = 6000DiskUsage 6000]
Ad Example
Requirements qstatement must evaluate to true
other.OpSys==“LINUX”)ge)
6 MB
30
Machine ClasMachine Clas[[MyType=“Machine”TargetType=“Job”TargetType JobMachine=“coit-grid01.unccRequirements=Requirements((LoadAvg<=0.300000)&&(KeyboardIdle>(15*60))(KeyboardIdle (15 60))Arch=“INTEL”OpSys=“LINUX”OpSys LINUXDisk=1000000]]
ssAd ExamplessAd Example
.edu”Low load averageLow load average
Keyboard idle forKeyboard idle for more than 15 minutes
31
minutes
ClassAd’s Ra
Can be used in job Cbetween compatible mphighest rank
Rank expression shofloating point numberoat g po t u be
ExamRank = (Memory * 100
ank Statement
ClassAdd for selection machines. Choose
ould evaluate to a r.
mple Machine speed
00) + KFlops
32
Rank StCan also be used in Machin
matchmaking.
ExamExamRank = (other.Department =
where Department defined
Department=“Computer Sc
atementnes ClassAd in
mplemple== self.Department)
in job ClassAdd, say:
ience”
33
Using rank in MacUsing rank in Mac
Job ClassAd[[MyType = “Job”TargetType=“Machine”TargetType Machine
……Department=“Computer
Science”…]
chines ClassAdchines ClassAd
Machines ClassAd[[MyType=“Machine”TargetType=“Job”TargetType Job
…Rank = (other.Department == self.Department) …]
34
Directed AcyDirected AcyManager (Dg (
Meta-sch
We have already cov
yclic Graphyclic GraphDAGMan))heduler
vered this material.
35