Upload
angel-marjorie-haynes
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
Peter CouvaresComputer Sciences DepartmentUniversity of Wisconsin-Madison
[email protected]://www.cs.wisc.edu/~pfc
High-Throughput Computing With
Condor
www.cs.wisc.edu/condor
Who Are We?
www.cs.wisc.edu/condor
The Condor Project (Established ‘85)
Distributed systems CS research performed by a team that faces:
software engineering challenges in a Unix/Linux/NT environment,
active interaction with users and collaborators, daily maintenance and support challenges of a
distributed production environment, and educating and training students.
Funding - NSF, NASA,DoE, DoD, IBM, INTEL, Microsoft and the UW Graduate School
.
www.cs.wisc.edu/condor
The Condor System
www.cs.wisc.edu/condor
The Condor System
› Unix and NT
› Operational since 1986
› More than 1300 CPUs at UW-Madison
› Available on the web
› More than 150 clusters worldwide in academia and industry
www.cs.wisc.edu/condor
What is Condor?
› Condor converts collections of distributively owned workstations and dedicated clusters into a high-throughput computing facility.
› Condor uses matchmaking to make sure that everyone is happy.
www.cs.wisc.edu/condor
What is High-Throughput Computing?
› High-performance: CPU cycles/second under ideal circumstances. “How fast can I run simulation X on this
machine?”
› High-throughput: CPU cycles/day (week, month, year?) under non-ideal circumstances. “How many times can I run simulation X in
the next month using all available machines?”
www.cs.wisc.edu/condor
What is High-Throughput Computing?
› Condor does whatever it takes to run your jobs, even if some machines… Crash! (or are disconnected) Run out of disk space Don’t have your software installed Are frequently needed by others Are far away & admin’ed by someone
else
www.cs.wisc.edu/condor
What is Matchmaking?
› Condor uses Matchmaking to make sure that work gets done within the constraints of both users and owners.
› Users (jobs) have constraints: “I need an Alpha with 256 MB RAM”
› Owners (machines) have constraints: “Only run jobs when I am away from my
desk and never run jobs owned by Bob.”
www.cs.wisc.edu/condor
“What can Condordo for me?”
Condor can…
› …do your housekeeping.
› …improve reliability.
› …give performance feedback.
› …increase your throughput!
www.cs.wisc.edu/condor
Some Numbers: UW-CS Pool
6/98-6/00 4,000,000 hours ~450 years“Real” Users 1,700,000 hours ~260 years
CS-Optimization 610,000 hoursCS-Architecture 350,000 hoursPhysics 245,000 hoursStatistics 80,000 hoursEngine Research Center 38,000 hoursMath 90,000 hoursCivil Engineering 27,000 hoursBusiness 970 hours
“External” Users 165,000 hours ~19 yearsMIT76,000 hoursCornell 38,000 hoursUCSD 38,000 hoursCalTech 18,000 hours
www.cs.wisc.edu/condor
Condor & Physics
www.cs.wisc.edu/condor
Current CMS Activity
› Simulation (CMSIM) for CalTech provided >135,000 CPU hours to date peak day ~ 4000 CPU hours via NCSA Alliance, Condor has allocated
1,000,000 hours total to CalTech
› Simulation and Reconstruction (CMSIM + ORCA) for HEP group at UW-Madison
www.cs.wisc.edu/condor
INFN Condor Pool - Italy
› Italian National Institute for Research in Nuclear and Subnuclear Physics
› 19 locations, each running a Condor pool
› as few as 1 CPU -- to >100 CPUs
› each locally controlled
› each “flocks” jobs to other pools when available
www.cs.wisc.edu/condor
Particle Physics Data Grid
› The PPDG Project is... a software engineering effort to
design, implement, experiment, evaluate, and prototype HEP-specific data-transfer and caching software tools for Grid environments
› For example...
www.cs.wisc.edu/condor
Condor PPDG Work
› Condor Data Manager technology to automate & coordinate
data movement from a variety of long-term repositories to available Condor computing resources & back again
keeping the pipeline full! SRB (SDSC), SAM (Fermi), PPDG HRM
www.cs.wisc.edu/condor
California Institute of Technology Harvey B. Newman, Julian J. Bunn, Koen Holtman,Asad Samar, Takako Hickey, Iosif Legrand, VladimirLitvin, Philippe Galvez, James C.T. Pool, Roy Williams
Argonne National Laboratory Ian Foster, Steven TueckeLawrence Price, David Malon, Ed May
Berkeley Laboratory Stewart C. Loken, Ian Hinchcliffe, Doug Olson,Alexandre VaniachineArie Shoshani, Andreas Mueller, Alex Sim, John Wu
Brookhaven National Laboratory Bruce Gibbard, Richard Baker, Torre Wenaus
Fermi National Laboratory Victoria White, Philip Demar, Donald PetravickMatthias Kasemann, Ruth Pordes, James Amundson,Rich Wellner, Igor Terekhov, Shahzad Muzaffar
University of Florida Paul Avery
San Diego Supercomputer Center Margaret Simmons, Reagan Moore,
Stanford Linear Accelerator Center Richard P. Mount, Les Cottrell, Andrew Hanushevsky,Davide Salomoni
Thomas Jefferson NationalAccelerator Facility
Chip Watson, Ian Bird, Jie Chen
University of Wisconsin Miron Livny, Peter Couvares, Tevfik Kosar
PPDG Collaborators
www.cs.wisc.edu/condor
National Grid Efforts
› GriPhyN (Grid Physics Network)
› National Technology Grid - NCSA Alliance (NSF-PACI)
› Information Power Grid - IPG (NASA)
› close collaboration with the Globus project
www.cs.wisc.edu/condor
I have 600simulations to run.
How can Condorhelp me?
www.cs.wisc.edu/condor
My Application …Simulate the behavior of F(x,y,z) for 20 values of x, 10 values of y and 3 values of z (20*10*3 = 600) F takes on the average 3 hours to compute
on a “typical” workstation (total = 1800 hours) F requires a “moderate” (128MB) amount of
memory F performs “moderate” I/O - (x,y,z) is 5 MB
and F(x,y,z) is 50 MB
www.cs.wisc.edu/condor
Step I - get organized!› Write a script that creates 600 input files for
each of the (x,y,z) combinations
› Write a script that will collect the data from the 600 output files
› Turn your workstation into a “Personal Condor”
› Submit a cluster of 600 jobs to your personal Condor
› Go on a long vacation … (2.5 months)
www.cs.wisc.edu/condor
yourworkstation
personalCondor
600 Condorjobs
www.cs.wisc.edu/condor
Step II - build your personal Grid
› Install Condor on the desktop machine next door
› …and on the machines in the classroom.
› Install Condor on the department’s Linux cluster or the O2K in the basement.
› Configure these machines to be part of your Condor pool.
› Go on a shorter vacation ...
www.cs.wisc.edu/condor
yourworkstation
personalCondor
600 Condorjobs
GroupCondor
www.cs.wisc.edu/condor
Step III - take advantage of your
friends› Get permission from “friendly”
Condor pools to access their resources
› Configure your personal Condor to “flock” to these pools
› reconsider your vacation plans ...
www.cs.wisc.edu/condor
yourworkstation
friendly Condor
personalCondor
600 Condorjobs
GroupCondor
www.cs.wisc.edu/condor
Think BIG.
Go to the Grid.
www.cs.wisc.edu/condor
Upgrade to Condor-G
A Grid-enabled version of Condor that uses the inter-domain services of Globus to bring Grid resources into the domain of your Personal Condor
Easy to use on different platforms Robust Supports SMPs & dedicated schedulers
www.cs.wisc.edu/condor
Step IV - Go for the Grid
› Get access (account(s) + certificate(s)) to a “Computational” Grid
› Submit 599 “Grid Universe” Condor- glide-in jobs to your personal Condor
› Take the rest of the afternoon off ...
www.cs.wisc.edu/condor
yourworkstation
friendly Condor
personalCondor
600 Condorjobs
Globus Grid
PBS LSF
Condor
GroupCondor
599 glide-ins
www.cs.wisc.edu/condor
What Have We Done with the Grid Already?
› NUG30 quadratic assignment problem 30 facilities, 30 locations
• minimize cost of transferring materials between them
posed in 1968 as challenge, long unsolved but with a good pruning algorithm & high-
throughput computing...
www.cs.wisc.edu/condor
NUG30 Personal Condor Grid
For the run we will be flocking to
-- the main Condor pool at Wisconsin (600 processors)
-- the Condor pool at Georgia Tech (190 Linux boxes)
-- the Condor pool at UNM (40 processors)
-- the Condor pool at Columbia (16 processors)
-- the Condor pool at Northwestern (12 processors)
-- the Condor pool at NCSA (65 processors)
-- the Condor pool at INFN (200 processors)
We will be using glide_in to access the Origin 2000 (through LSF ) at NCSA.
We will use "hobble_in" to access the Chiba City Linux cluster and Origin
2000 here at Argonne.
www.cs.wisc.edu/condor
NUG30 - Solved!!!
Sender: [email protected] Subject: Re: Let the festivities begin.
Hi dear Condor Team,
you all have been amazing. NUG30 required 10.9 years of
Condor Time. In just seven days !
More stats tomorrow !!! We are off celebrating !
condor rules !
cheers,
JP.
www.cs.wisc.edu/condor
Conclusion
Computing power
is everywhere, we try to make it usable
by anyone.
www.cs.wisc.edu/condor
Need more info?
›Condor Web Page (http://www.cs.wisc.edu/condor)
›Peter Couvares ([email protected])