37
Condor Project Computer Sciences Department University of Wisconsin-Madison [email protected] http://www.cs.wisc.edu/condor Grids and Condor Barcelona, 2006

Condor Project Computer Sciences Department University of Wisconsin-Madison [email protected] Grids and Condor Barcelona,

Embed Size (px)

Citation preview

Condor ProjectComputer Sciences DepartmentUniversity of Wisconsin-Madison

[email protected]://www.cs.wisc.edu/condor

Grids and Condor

Barcelona, 2006

2http://www.cs.wisc.edu/condor

AgendaExtended user’s tutorialAdvanced Uses of Condor

Java programsDAGManStorkMWGrid Computing

Case studies, and a discussion of your application‘s needs

3http://www.cs.wisc.edu/condor

Resources

There are many resources (machines) in the world, and many are or can be made available!

Groups of machines may be labeled as grids

Welcome to the power of the grid !

4http://www.cs.wisc.edu/condor

Condor and Grids

Condor has always been a tool to harness grid computing

Condor’s mechanisms have evolved as technologies have evolved. Roughly categorized: Flocking Glidein The grid universe

5http://www.cs.wisc.edu/condor

Flocking

• A way for jobs to run within a different, separate Condor pool

• Condor runs here, and Condor runs there

herethere

6http://www.cs.wisc.edu/condor

Connect Condor Poolswith Flocking

Flocking is a Condor-specific technology

Flocking is enabled with configuration Jobs flock from here to there when

they cannot be run here due to lack of available machines

7http://www.cs.wisc.edu/condor

Configuration

Configuration files contain lots of the administrative information used by Condor

Format is like that in submit description files:

AttributeName = Value

8http://www.cs.wisc.edu/condor

Configuration here For jobs to be able to flock from here to

there In the configuration file on the pool

where jobs flock from:FLOCK_TO = <central manager machine name>FLOCK_COLLECTOR_HOSTS = $(FLOCK_TO)FLOCK_NEGOTIATOR_HOSTS = $(FLOCK_TO)HOSTALLOW_NEGOTIATOR_SCHEDD = $(COLLECTOR_HOST), $(FLOCK_NEGOTIATOR_HOSTS)

9http://www.cs.wisc.edu/condor

Configuration there In the configuration file on the pool where

jobs flock to:FLOCK_FROM = <submit machine name>, . . . ,

<submit machine name>

To make security work:HOSTALLOW_WRITE_COLLECTOR = $(HOSTALLOW_WRITE),

$(FLOCK_FROM)

HOSTALLOW_WRITE_STARTD = $(HOSTALLOW_WRITE), $(FLOCK_FROM)

HOSTALLOW_READ_COLLECTOR = $(HOSTALLOW_READ), $(FLOCK_FROM)

HOSTALLOW_READ_STARTD = $(HOSTALLOW_READ), $(FLOCK_FROM)

10http://www.cs.wisc.edu/condor

Submit Description File

Enable file transfer:universe = vanillaexecutable = myjob.exeinput = myjob.inputoutput = myjob.outputlog = myjob.logshould_transfer_files = YESwhen_to_transfer_output = ON_EXITqueue

11http://www.cs.wisc.edu/condor

The Glidein Concept

Assume:We need more machines, and we

have permission to use a set of machines

Glidein temporarily adds a set of machines to the local pool

12http://www.cs.wisc.edu/condor

Glidein

In addition, Glidein solves the problem:“My job needs to run on that particular

resource, and my job needs Condor.” For example: a job that must run under

the standard universe

13http://www.cs.wisc.edu/condor

Glidein

Condor sends and runs its own executables on the resource

The needed resource appears to temporarily join the local Condor pool !

14http://www.cs.wisc.edu/condor

Glideinrun condor_glidein to add the remote

resource to the local pool

local pool remote

resource

the master and

startd daemons

become grid

universe jobs

using gt2

15http://www.cs.wisc.edu/condor

Making Glidein Work Change the configuration to give access

permission (HOSTALLOW_WRITE) to the remote resource

No changes to jobs’ submit description files! But, do enable file transfer in the submit

description file: universe = vanilla

executable = myjob.exeinput = myjob.inputoutput = myjob.outputlog = myjob.logshould_transfer_files = YESwhen_to_transfer_output = ON_EXITqueue

16http://www.cs.wisc.edu/condor

Force Job to Glidein Resource

In the submit description file: universe = standard

executable = ajob.exeinput = ajob.inputoutput = ajob.outputlog = ajob.logrequirements = \ ( machine == “example.mcs.anl.gov" ) \ && Arch != "" && OpSys != ""queue

17http://www.cs.wisc.edu/condor

The Grid Universe

Most useful when1. We want to send a job off to a far away

machine2. We want to hand a job to another batch

processing system on the local machine3. We want to send a job off to a far away

machine, in order to hand that job to another batch processing system on that machine

18http://www.cs.wisc.edu/condor

The Grid Universe All handled in the submit description file Supports several back end types:

Globus: GT2, GT3, GT4 NorduGrid UNICORE Condor PBS LSF

19http://www.cs.wisc.edu/condor

Condor-G

Condor-G describes jobs to be handed off to a machine, and the machine is utilizing Globus middleware gt 2: Globus Toolkit 1 or 2 or the

pre-web services GRAM gt 3: Globus Toolkit 3 gt 4: Globus Toolkit 4 or WS GRAM

20http://www.cs.wisc.edu/condor

Submit Description File

For gt2:universe = grid

input = job1.input

output = job1.result

log = job1.log

grid_resource = gt2 example.wisc.edu/jobmanager

queue

jobmanager

jobmanager-condor

jobmanager-pbs

jobmanager-lsf

jobmanager-sge

One of:

21http://www.cs.wisc.edu/condor

For gt3:universe = grid

input = job2.input

output = job2.result

log = job2.log

grid_resource = gt3 http://198.51.254.40:8080/osga/services/base /gram/XXXManagedJobFactoryService

queue

Submit Description File

Fork

Condor

PBS

LSF

SGE

XXX is one of:

IP address:Port number

22http://www.cs.wisc.edu/condor

For gt4:universe = gridinput = job3.inputoutput = job3.resultlog = job3.loggrid_resource = gt4 https://198.51.254.40:8080/wsrf/service/ManagedJobFactoryService XXX

queue

Submit Description File

Fork

Condor

PBS

LSF

SGE

XXX is one of:

IP address:Port number

OR

Host name:Port number

23http://www.cs.wisc.edu/condor

Nordugrid and the Submit Description

Fileuniverse = grid

input = job4.input

output = job4.result

log = job4.log

grid_resource = nordugrid ngexample.com

queue

24http://www.cs.wisc.edu/condor

Unicore and the Submit Description

Fileuniverse = grid

input = job5.input

output = job5.result

log = job5.log

grid_resource = unicore usite.example.com vsite

keystore_file = /frieda/certificates/keystore

keystore_alias = “frieda”

keystore_passphrase_file = /frieda/private/passphrase

queue

vsite is the name of the

Unicore virtual resource

25http://www.cs.wisc.edu/condor

PBS and the Submit Description

File Details of the PBS installation in$(GLITE_LOCATION)/etc/batch_gahp.config

universe = gridinput = job6.inputoutput = job6.resultlog = job6.loggrid_resource = pbsqueue

26http://www.cs.wisc.edu/condor

LSF and the Submit Description

File Details of the LSF installation in$(GLITE_LOCATION)/etc/batch_gahp.config

universe = gridinput = job7.inputoutput = job7.resultlog = job7.loggrid_resource = lsfqueue

27http://www.cs.wisc.edu/condor

Condor-C

Condor is running here,and Condor is running over there

For the case whereWe want to send a job off to a far away

machine, in order to hand that job to another batch processing system on that machine

28http://www.cs.wisc.edu/condor

Condor-C and the Submit Description

Fileuniverse = gridinput = job8.inputoutput = job8.resultlog = job8.loggrid_resource = condor [email protected] remotecentralmanager.example.com

+remote_jobuniverse = 5+remote_requirements = True+remote_ShouldTransferFiles = "YES"+remote_WhenToTransferOutput = "ON_EXIT"queue

schedd name

collector

machine name

vanilla universe

29http://www.cs.wisc.edu/condor

Credentials

Not just anybody can use any resource at any time. . .

Key concepts:Authentication

verification of an identity

Authorizationpermission to do something

30http://www.cs.wisc.edu/condor

Authentication

If Frieda says “I am Frieda.”,

how do we distinguish this from

if Frieda says “I am George

Bush.” ?

31http://www.cs.wisc.edu/condor

Authentication

Bush can do whatever he pleases If Frieda claims to be Bush, (and

this is accepted), then Frieda can do whatever she pleases

Authentication attempts to verify the identity of the entity that is communicating

32http://www.cs.wisc.edu/condor

Authorization

Who is allowed (permitted) to do what Frieda may run gt4 jobs on the Open

Science Grid machines Fred may write to files in /usr/bin the Unix user root may do anything!

Can be implemented with a list of those authorized

33http://www.cs.wisc.edu/condor

Condor and Authentication

Authentication within Condor comes in many forms. Here are three.

1. File system: Have the entity write a file. The OS attaches a name to the file owner. Condor checks that the entity’s claim is the same as the file owner.

2. GSI (Grid Security Infrastructure)3. Kerberos

34http://www.cs.wisc.edu/condor

Authentication Idea

• A centralized certificate authority (CA) does verification of an entity’s identity.

• When satisfied, the CA issues a signed certificate (also called a credential)

I am

Frieda

CA

35http://www.cs.wisc.edu/condor

Authentication• To authenticate,

the entity presents the certificate

• All is well, if we trust the CA and the remote machine

I am

Frieda

CA

36http://www.cs.wisc.edu/condor

GSI Authentication

GSI uses X.509 certificates Grid universe, submitting to back

end types using Globus middleware (gt2, gt3, gt4), as well as nordugrid, and unicore use X.509 certificates

Condor can also use GSI

37http://www.cs.wisc.edu/condor

Revocation, Trust, and Proxies

The CA may revoke a credential Frieda gives the signed credential to the remote

machine. If the remote machine is malicious, it could impersonate Frieda. Therefore, a password protects the credential.

A proxy is a credential that includes the password, but is only valid for a specific (short) time period.

MyProxy software enables GSI proxy management