glideinWMS - The Larger Picture

Embed Size (px)

Citation preview

glideinWMS - The Larger Picture

glideinWMS training

glideinWMS-The Larger Picturei.e. Is it something you would be interested in?

by Igor Sfiligoi (UCSD)

Why this talk?

If you never heard of glideinWMS before,
you likely have no idea if this is a product you would be interested in using.

This talk presentsglideinWMS in a larger context,
allowing you to understand
what this product is all about.

The basics

glideinWMS has been designed to address the needs of High Throughput Computing (HTC)Better known as batch processing

In a nutshell, we are trying to facilitate
the effective use
of a large number of CPUs
by a large number of users

High Throughput Computing

The basic premise of HTC is that there is always more demand than available CPUs

We should make good use of those CPUsKeep them busy, ideally, 24x7x365

Sustained utilization is thus
more important than peak performanceMeasure of success is
FLOPY = Floating Points per Year
not
FLOPS = Floating Points per Second

HTC from the user point of view

As a side effect, users must be HTC-aware

There are some negative aspectsNo interactive access, only process queuingUsually referred to as user jobs

Waiting in line to get access to CPUs

But the payoff is potentially hugeA single user can use 1000s CPUs at a time

Performing in few days
computations that would
take several years on a single machine

HTC in simplified picture

Scheduler

Repository

User scheduling
usually not FIFO

HTC products

There are many HTC products availableAlthough most call themselves batch systems

A non exhaustive list:Condor

PBS, with variants like Torque/Maui

LSF

SGE, also known as Oracle Grid Engine

Why another system?

All of the mentioned HTC systems
assume full control
of the compute resources (i.e. CPUs)And there are many places where this is the case

glideinWMS developed to
support non-dedicated use
of compute resourcesi.e. when CPUs are given
to the system only
for limited duration at a time

Non-dedicated resources

In the past decade, two paradigms emergedGrid computing

Cloud computing

Both allow a user community to use compute resources they don't ownOften called resource elasticity

Managing large number of Grid and Cloud resources by hand impracticalglideinWMS creates a HTC system using them

Grid vs Cloud
(a short summary)

Grid computing is basically a federation of HTC clustersThus recently called Distributed HTC

Job queuing is a native paradigm

(Commercial) Clouds are about leasing resources on a pay-as-you-go basisAnd they happen to use virtualization

Instances expected to start almost immediately

So-called scientific clouds are typically
just Grid systems that use virtualization
(and a different middleware stack)

Grid vs Cloud
(a short summary)

Grid computing is basically a federation of HTC clustersThus recently called Distributed HTC

Job queuing is a native paradigm

(Commercial) Clouds are about leasing resources on a pay-as-you-go basisAnd they happen to use virtualization

Instances expected to start almost immediately

So-called scientific clouds are typically
just Grid systems that use virtualization
(and a different middleware stack)

glideinWMS
currently optimized
for the Grid model

glideinWMS and the Grid
(Cloud resources are used in a similar way)

glideinWMS creates
an overlay system on top of
the various HTC clustersFrom the user community
point of view,
a single HTC system

Just a dynamic one

glideinWMS
completely automates
the process

HTCHTCHTCHTCHTCglideinWMS
HTC

Implementation and support

glideinWMS heavily based on CondorEssentially a thin layer on top of it

Most of the software support thus coming fromthe Condor development teamAt University of Wisconsin Madison
http://research.cs.wisc.edu/condor/

The glideinWMS-specific layer supported by a team spanning Fermilab, UCSD and ISI
http://tinyurl.com/glideinWMS

glideinWMS and Condor

Condor handles the HTC systemMost Condor features thus available

glideinWMS role limited to scheduling, configuring and starting the Condor process onthe compute resources

Condor
Job
RepositoryHTCCondor
CPU HandlerglideinWMS

User Job

glideinWMS is a HTC producti.e. enables effective use of a large number ofCPUs by a large number of users

glideinWMS creates a HTCsystem out of nondedicatedcomputeresourcese.g. Grid and Cloud resources

glideinWMS is heavily based on Condorthus benefits from the Condor team support

Summary

Pointers

glideinWMS development team is reachable at
[email protected]

The official project Web page is
http://tinyurl.com/glideinWMS

OSG glidein factory at UCSD
http://hepuser.ucsd.edu/twiki2/bin/view/UCSDTier2/OSGgfactory
http://glidein-1.t2.ucsd.edu:8319/glidefactory/monitor/glidein_Production_v4_1/factoryStatus.html

Acknowledgments

This document was sponsored by grants from the US NSF and US DOE,
and by the UCsystem