If you can't read please download the document
Upload
igor-sfiligoi
View
403
Download
1
Embed Size (px)
Citation preview
glideinWMS - The Larger Picture
glideinWMS training
glideinWMS-The Larger Picturei.e. Is it something you would be interested in?
by Igor Sfiligoi (UCSD)
Why this talk?
If you never heard of glideinWMS before,
you likely have no idea if this is a product you would be
interested in using.
This talk presentsglideinWMS in a larger context,
allowing you to understand
what this product is all about.
The basics
glideinWMS has been designed to address the needs of High Throughput Computing (HTC)Better known as batch processing
In a nutshell, we are trying to facilitate
the effective use
of a large number of CPUs
by a large number of users
High Throughput Computing
The basic premise of HTC is that there is always more demand than available CPUs
We should make good use of those CPUsKeep them busy, ideally, 24x7x365
Sustained utilization is thus
more important than peak performanceMeasure of success is
FLOPY = Floating Points per Year
not
FLOPS = Floating Points per Second
HTC from the user point of view
As a side effect, users must be HTC-aware
There are some negative aspectsNo interactive access, only process queuingUsually referred to as user jobs
Waiting in line to get access to CPUs
But the payoff is potentially hugeA single user can use 1000s CPUs at a time
Performing in few days
computations that would
take several years on a single machine
HTC in simplified picture
Scheduler
Repository
User scheduling
usually not FIFO
HTC products
There are many HTC products availableAlthough most call themselves batch systems
A non exhaustive list:Condor
PBS, with variants like Torque/Maui
LSF
SGE, also known as Oracle Grid Engine
Why another system?
All of the mentioned HTC systems
assume full control
of the compute resources (i.e. CPUs)And there are many places where
this is the case
glideinWMS developed to
support non-dedicated use
of compute resourcesi.e. when CPUs are given
to the system only
for limited duration at a time
Non-dedicated resources
In the past decade, two paradigms emergedGrid computing
Cloud computing
Both allow a user community to use compute resources they don't ownOften called resource elasticity
Managing large number of Grid and Cloud resources by hand impracticalglideinWMS creates a HTC system using them
Grid vs Cloud
(a short summary)
Grid computing is basically a federation of HTC clustersThus recently called Distributed HTC
Job queuing is a native paradigm
(Commercial) Clouds are about leasing resources on a pay-as-you-go basisAnd they happen to use virtualization
Instances expected to start almost immediately
So-called scientific clouds are typically
just Grid systems that use virtualization
(and a different middleware stack)
Grid vs Cloud
(a short summary)
Grid computing is basically a federation of HTC clustersThus recently called Distributed HTC
Job queuing is a native paradigm
(Commercial) Clouds are about leasing resources on a pay-as-you-go basisAnd they happen to use virtualization
Instances expected to start almost immediately
So-called scientific clouds are typically
just Grid systems that use virtualization
(and a different middleware stack)
glideinWMS
currently optimized
for the Grid model
glideinWMS and the Grid
(Cloud resources are used in a similar way)
glideinWMS creates
an overlay system on top of
the various HTC clustersFrom the user community
point of view,
a single HTC system
Just a dynamic one
glideinWMS
completely automates
the process
HTCHTCHTCHTCHTCglideinWMS
HTC
Implementation and support
glideinWMS heavily based on CondorEssentially a thin layer on top of it
Most of the software support thus coming fromthe Condor
development teamAt University of Wisconsin Madison
http://research.cs.wisc.edu/condor/
The glideinWMS-specific layer supported by a team spanning
Fermilab, UCSD and ISI
http://tinyurl.com/glideinWMS
glideinWMS and Condor
Condor handles the HTC systemMost Condor features thus available
glideinWMS role limited to scheduling, configuring and starting the Condor process onthe compute resources
Condor
Job
RepositoryHTCCondor
CPU HandlerglideinWMS
User Job
glideinWMS is a HTC producti.e. enables effective use of a large number ofCPUs by a large number of users
glideinWMS creates a HTCsystem out of nondedicatedcomputeresourcese.g. Grid and Cloud resources
glideinWMS is heavily based on Condorthus benefits from the Condor team support
Summary
Pointers
glideinWMS development team is reachable at
[email protected]
The official project Web page is
http://tinyurl.com/glideinWMS
OSG glidein factory at UCSD
http://hepuser.ucsd.edu/twiki2/bin/view/UCSDTier2/OSGgfactory
http://glidein-1.t2.ucsd.edu:8319/glidefactory/monitor/glidein_Production_v4_1/factoryStatus.html
Acknowledgments
This document was sponsored by grants from the US NSF and US
DOE,
and by the UCsystem