Click here to load reader

Installing and Managing a Large Condor Pool Derek Wright Computer Sciences Department University of Wisconsin-Madison [email protected]

  • View
    212

  • Download
    0

Embed Size (px)

Text of Installing and Managing a Large Condor Pool Derek Wright Computer Sciences Department University of...

  • Installing and Managing a Large Condor PoolDerek WrightComputer Sciences DepartmentUniversity of [email protected]/condor

  • Talk OutlineWhat is Condor and why is it good for large clusters?The Condor Daemons (the sys admin view)A look at the UW-Madison Computer Science Condor Pool and ClusterSome other features of Condor that help for big poolsFuture work

  • What is Condor?A system of daemons and tools that harness desktop machines and commodity computing resources for High Throughput ComputingLarge numbers of jobs over long periods of timeNot High Performance Computing, which is short bursts of lots of compute power

  • What is Condor? (Contd)Condor matches jobs with available machines using ClassAdsAvailable machines can be:Idle desktop workstationsDedicated clustersSMP machinesCan also provide checkpointing and process migration (if you re-link your application against our library)

  • Whats Condor Good For?Managing a large number of jobsYou specify the jobs in a file and submit them to Condor, which runs them all and sends you email when they completeMechanisms to help you manage huge numbers of jobs (1000s), all the data, etc.Condor can handle inter-job dependencies (DAGMan)

  • Whats Condor Good For? (contd)Managing a large number of machinesCondor daemons run on all the machines in your pool and are constantly monitoring machine stateYou can query Condor for information about your machinesCondor handles all background jobs in your pool with minimal impact on your machine owners

  • Why is Condor Good for Large Clusters?Fault-Tolerance at all levels of CondorEven dedicated resources should be treated like they might disappear at any minute (Condor has been doing this since 1985 weve got a lot of experience)Checkpointing jobs (when possible) makes scheduling a lot easier, and ensures forward progressEases monitoring

  • Condor on Large Clusters (contd)Manages ALL your resources and jobs under one systemEasier for users and administratorsEasy to install and useNo queues to configure or choose fromIts developed by former system administrators (all the full-time staff)Its free (that scales really well)

  • What is a Condor Pool?Pool can be a single machine or a group of machinesDetermined by a central manager - the matchmaker and centralized information repositoryEach machine runs various daemons to provide different services, either to the users who submit jobs, the machine owners, or the pool itself

  • Talk OutlineWhat is Condor and why is it good for large clusters?The Condor Daemons (the sys admin view)A look at the UW-Madison Computer Science Condor Pool and ClusterSome other features of Condor that help for big poolsFuture work

  • The Condor Daemons

    condor_master

    Administrator Agent

    condor_collector

    Centralized Repository of ClassAds

    condor_negotiator

    Performs Matchmaking

    condor_startd

    Resource Agent (Machine)

    condor_schedd

    User Agent (Jobs)

    condor_starter

    Monitors/Manages a Job Process

    condor_shadow

    Handles Remote System Calls, Intra-Job Resource Management

    condor_dagman

    Manage Inter-Job Dependencies

    condor_eventd

    Pool-Wide Events

  • Layout of a Personal Condor Pool= ClassAd Communication Pathway

  • Layout of a General Condor Pool= ClassAd Communication Pathway

  • condor_master daemonStarts up all other Condor daemonsIf there are any problems and a daemon exists, it restarts the daemon and sends email to the administratorChecks the time stamps on the binaries it is configured to spawn, and if new binaries appear, the master will gracefully shutdown the currently running version and start the new version

  • condor_master (contd)Provides access to many remote administration commands:condor_reconfigcondor_restart, condor_off, condor_onDefault server for many other commands:condor_config_val, etc.Periodically runs condor_preen to clean up any files Condor might have left on the machine (the rest of the daemons clean up after themselves, as well)

  • condor_collectorCollects information from all other Condor daemons in the poolEach daemon sends a periodic update called a ClassAd to the collectorServices queries for information:Queries from other Condor daemonsQueries from users (condor_status)Can store historical pool data

  • condor_eventdAdministrators specify events in a config file (similar to a crontab, but not exactly):Date and timeWhat kind of event (currently, only shutdown is supported)What machines the event effects (ClassAd constraint)

  • condor_eventd (contd)When event is approaching, EventD will wake up and query the condor_collector for all machines that match the constraintEventD then knows how big all the jobs are that are currently running on the effected nodes, network bandwidth to the nearest checkpoint servers, etc.EventD plans evictions to allow the most computation w/o flooding the net

  • Talk OutlineWhat is Condor and why is it good for large clusters?The Condor Daemons (the sys admin view)A look at the UW-Madison Computer Science Condor Pool and ClusterSome other features of Condor that help for big poolsFuture work

  • Large Condor Pools in HEP and Government ResearchUW-Madison CS (~750 nodes)INFN (~270 nodes)CERN/Chorus (~100 nodes)NASA Ames (~330 nodes)NCSA (~200 nodes)

  • Layout of the UW-Madison PoolDedicated LinuxCluster (~200 cpus)Instructional Computer Labs (~225 cpus)Dedicated SchedulerDesktop Workstations (~325 cpus)Flocking to other PoolsSubmit-onlymachines atother sites

  • Composition of the UW/CS ClusterCurrent cluster: 100 Dual XEON 550MHz with 1 gig of RAM (tower cases)New nodes being installed: 150 Dual 933MHz Pentium III, 36 nodes w/ 2 gigs of RAM, the rest w/ 1 gig (2U racks)100 Mbit Switched Ethernet to nodesGigabit Ethernet to the file servers and checkpoint server

  • Composition of the rest of the UW/CS PoolInstructional Labs60 Intel/Linux60 Sparc/Solaris105 Intel/NTDesktop WorkstationsIncludes 12 and 8-way Ultra E6000s, other SMPs, and real desktops, etc.Central Manager - 600MHz Pentium III running Solaris, 512 Megs RAM

  • Talk OutlineWhat is Condor and why is it good for large clusters?The Condor Daemons (the sys admin view)A look at the UW-Madison Computer Science Condor Pool and ClusterSome other features of Condor that help for big poolsFuture work

  • Condors ConfigurationCondors configuration is a concatenation of multiple files, in order - definitions in later files overwrites previous definitionsLayout and purpose of the different files:Global config fileOther shared filesLocal config file

  • Global Config FileAll shared settings across your entire poolFound either in file pointed to with the CONDOR_CONFIG environment variable, /etc/condor/condor_config, or the home directory of the condor userMost settings can be in this fileOnly works as a global file if it is on a shared file system (HIGHLY recommended for large sites!)

  • Other shared files

    You can configure a number of other shared config files:files to hold common settings to make it easier to maintain (for example, all policy expressions, which well see later)platform-specific config files

  • Local config fileAny machine-specific settingslocal policy settings for a given ownerdifferent daemons to run (for example, on the Central Manager)Can either be on the local disk of each machine, or have separate files in a shared directory, each named by hostnameFor large sites: keep them all on AFS or NFS, and in CVS, if possible

  • Daemon-specific configurationYou can also change daemon-specific settings with condor_config_valUse the -set option for persistent changes, or -rset for memory-resident onlyUsed by the EventDCan be used by other entities for various remote-administration tasks

  • Advertising Your Own Attributes in the Machine ClassAdAdd new macro(s) to the config file This is usually done in the local config fileCan name the macros anything, so long as the names dont conflict with existing onesTell the condor_startd to include these other macros in the ClassAd it sends outEdit the STARTD_EXPRS macro to include the names of the macros you want to advertise (comma separated)

  • Host/IP Security in CondorYou can configure each machine in your pool to allow or deny certain actions from different groups of machines:read access - querying informationcondor_status, condor_q, etcwrite access - updating informationcondor_submit, adding a node to the pool, etcadministrator accesscondor_on, off, reconfig, restart... owner access Things a machine owner can do (vacate)

  • The Different Versions of CondorWe distribute two versions of Condor: Stable SeriesHeavily tested, recommended for use2nd number of version string is even (6.2.0)Development SeriesLatest features, not necessarily well-tested2nd number of version string is odd (6.3.0)Not recommended unless you know what you are doing and/or need a new feature

  • Condor Versions (contd)All daemons advertise a CondorVersion attribute in the ClassAd they publishYou can also view the version string by running ident on any Condor binaryIn general, all parts of Condor on a single machine should run the same versionMachines in a pool can usually run different versions and communicate with each other It will be made very clear when a version is incompatible with older versions

  • Talk OutlineWhat is Condor and why is it good for large clusters?The Condor Daemons (the sys admin view)A look at the UW-Madison Computer Science Condor Pool and ClusterSome other features of Condor that help for big poolsFuture work

Search related