Upload
igor-sfiligoi
View
5.797
Download
0
Tags:
Embed Size (px)
DESCRIPTION
This presentation explains why Condor is not suitable for use on user-owned machines, and why RemoteCondor is the best available solution to the problem.
Citation preview
Apr 2012 Remote Condor 1
UCSD HEP Group Trainings
Weddingconvenience and control
withRemoteCondor
by Igor SfiligoiRemoteCondor co-developed with J. Dost
UC San Diego
Apr 2012 Remote Condor 2
The Condor Batch System
● Condor is a Workload Management System● i.e. a batch system
● Strong points● Fault tolerant● Robust feature set● Flexible
● Large community base● Both commercial and scientific
http://research.cs.wisc.edu/condor/
Apr 2012 Remote Condor 3
Condor Architecture
● Clearly separates● Resource providers
from● Resource consumers
● Each has a daemonprocess to represent it● Startd for resource provides● Schedd for resource consumers
● A central service connects them all● Managed by a Collector/Negotiator pair
Machines (aka worker nodes)CPUs, Memory, IO,...
Job queues (aka submit nodes)Jobs submitted by users
Apr 2012 Remote Condor 4
Startd
Condor Architecture
Schedd
Schedd Startd
..
....
CollectorNegotiator
in a picture
Apr 2012 Remote Condor 5
The truth about submit nodes
● Corollary● The submit node is a server!
● There is no real “Condor client”● The cmdline tools are just a convenience
to talk to the daemon process
Schedd
condor_submitcondor_q
Submit node
CollectorNegotiator
Startd
Apr 2012 Remote Condor 6
Implications
● Being a server has several implications● Security implications
● Will have incoming connectivity● All security configuration on the submit node● Submit node controls user
authentication and authorization
● Unfriendly to non-dedicated hardware● Requires always on operation● Must be on a public&static IP address
Apr 2012 Remote Condor 7
Implications
● Being a server has several implications● Security implications
● Will have incoming connectivity● All security configuration on the submit node● Submit node controls user
authentication and authorization
● Unfriendly to non-dedicated hardware● Requires always on operation● Must be on a public&static IP address
High exploit risk
Requires high trustbetween all nodes
in the cluster
Impossible touse on a laptop
Apr 2012 Remote Condor 8
Implications
● Being a server has several implications● Security implications
● Will have incoming connectivity● All security configuration on the submit node● Submit node controls user
authentication and authorization
● Unfriendly to non-dedicated hardware● Requires always on operation● Must be on a public&static IP address
High exploit risk
Requires high trustbetween all nodes
in the cluster
Impossible touse on a laptop
Not suitablefor an unmanaged
user machine
Apr 2012 Remote Condor 9
What are the alternatives?
● Out of the box, Condor provides● Remote submission● Condor-C
● In the contrib sections, you can find● RemoteCondor
Apr 2012 Remote Condor 10
What are the alternatives?
● Out of the box, Condor provides● Remote submission● Condor-C
● In the contrib sections, you can find● RemoteCondor
This presentationargues that this isthe best solution
Apr 2012 Remote Condor 11
What are the alternatives?
● Out of the box, Condor provides● Remote submission● Condor-C
● In the contrib sections, you can find● RemoteCondor
This presentationargues that this isthe best solution
So what is wrong with these?
Apr 2012 Remote Condor 12
Schedd
Schedd node
Remote submission
● Essentially, connecting to a remote Schedd● condor_submit -remote … + condor_transfer_data
and● condor_q -name ..., condor_rm -name ..., …
● So no daemon processes on the submit node● A true client solution!
Scheddcondor_submit
condor_qcondor_transfer_data
Submit node
CollectorNegotiator
StartdAu
thhttp://research.cs.wisc.edu/condor/manual/v7.6/condor_submit.html
http://research.cs.wisc.edu/condor/manual/v7.6/condor_transfer_data.html
Apr 2012 Remote Condor 13
So, what's the problem?
● No local user log file● Must use
condor_qto monitor progress
● Fully Condor-based user authentication● While rich, not what users expect
(e.g. no user/password)
● Hard to tie into campus-wide auth
● Staged input data not shared
● Annoying at best● High monitoring load● And it does not work
with DAGMan
Could be a problem with large datasets
Apr 2012 Remote Condor 14
Condor-C
● Based on the Grid paradigm● Submit locally, then delegate to remote Schedd
● Still running a daemon process● But requires no incoming connections
Schedd
Schedd node
Schedd
condor_submitcondor_q
Submit node
CollectorNegotiator
StartdAu
th
● Secure● Laptop
friendly
Schedd
http://research.cs.wisc.edu/condor/manual/v7.6/5_3Grid_Universe.html#sec:Condor-C
Apr 2012 Remote Condor 15
What are the drawbacks?
● Awkward syntax● At least compared to Vanilla universe● See the Condor manual for examples
● Has scalability problems● Could likely be improved,
but this is the current state-of-the-art
● Fully Condor-based user authentication● Staged input data not shared
Same as remotesubmissions
Can be mitigatedwith Job Router
(but adds anotherlayer of complexity)
Apr 2012 Remote Condor 16
Introducing
RemoteCondor
Apr 2012 Remote Condor 17
What's the big idea?
● Let the users login into a remote machine● And run the cmdline tools there True client
approach
Apr 2012 Remote Condor 18
What's the big idea?
● Let the users login into a remote machine● And run the cmdline tools there
Advantages:● True local Condor experience● Standard system authentication and authorization
● No admin privileges for the users
● Trust based on “central” Schedd admin skills● Can regulate and transform Condor submissions
● Minimize security risk● Central handling● Familiar to users
No exceptions
Apr 2012 Remote Condor 19
What's the big idea?
● Let the users login into a remote machine● And run the cmdline tools there
Advantages:● True local Condor experience● Standard system authentication and authorization
● No admin privileges for the users
● Trust based on “central” Schedd admin skills● Can regulate and transform Condor submissions
● Minimize security risk● Central handling● Familiar to users
No exceptions
Big deal!
Where's the news?
Apr 2012 Remote Condor 20
What's the big idea?
● Let the users login into a remote machine● And run the cmdline tools there
● … while preserving the local look-and-feel● RemoteCondor provides
● Wrappers around major Condor cmdline tools● Integration with sshfs
https://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=RemoteCondor
Apr 2012 Remote Condor 21
RemoteCondor wrappers
● Provide wrappers that use ssh under the hood● Users (almost) unaware of the trick
● But may be prompted for a password● Works best with public key authentication
sshd
Schedd node
Schedd
condor_submitcondor_q
Submit nodeCollector
Negotiator
StartdAu
th
condor_submitcondor_q
Apr 2012 Remote Condor 22
RemoteCondor and sshfs
● But being able to talk to Condor is not enough● Users must be able to create and read data!
● Using sshfs solves the problem● Schedd-local disk mounted on submit node● Using ssh as a tunnel● All in user space (FUSE)
● RemoteCondor will properly convert paths(within certain limits)
http://fuse.sourceforge.net/sshfs.html
Disk local to Scheddfor maximum performance
Apr 2012 Remote Condor 23
RemoteCondor and sshfs
● But being able to talk to Condor is not enough● Users must be able to create and read data!
● Using sshfs solves the problem● Schedd-local disk mounted on submit node
sshd
Schedd node
Schedd
Submit nodeCollector
Negotiator
StartdAu
th
Real disksshfs
Apr 2012 Remote Condor 24
Using RemoteCondor
● Distributed in the Condor src tarball● In the Contrib section
● Requires a “make install”● To put the proper files in place
● Plus minimal configuration● Where is the remote Schedd node?● What username to use?● Where to mount the sshfs partition?
https://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=RemoteCondor
Apr 2012 Remote Condor 25
Summary
● Traditional Condor not suitable for user machines● Keeping Schedd nodes professionally maintained
highly desirable● To minimize security risks and control job flow
● RemoteCondor allows this operation modewhile preserving the local look-and-feel● Requires minimal local install
Apr 2012 Remote Condor 26
Acknowledgements
This work is partially sponsored by ● the US National Science Foundation under Grants No. OCI-0943725 (STCI) and PHY-0612805 (CMS Maintenance & Operations),
and ● the US Department of Energy under Grant No. DE-FC02-06ER41436 subcontract No. 647F290 (OSG).