CLOSER 2012 - Apr 2012 glideinWMS - Ownership of VM 1
CLOSER 2012
The glideinWMS approachto the ownership of System Images
in the Cloud World
presented by Igor Sfiligoi1
co-authors A.Tiradani2, B.Holzman2 and D.C.Bradley3
1UC San Diego, 2Fermilab, 3UW Madison
CLOSER 2012 - Apr 2012 glideinWMS - Ownership of VM 2
Our users
● Our primary target audience is the scientific community● In particular, those communities that need
massive amounts of CPU cycles
● Large number of users - O(1k)● Organized in groups (known as VOs)
● Batch processing is a must● Typical user task requires > 1 CPU year● Users understand the need of splitting the problem
in many (semi-)independent tasks
Virtual Organizations
CLOSER 2012 - Apr 2012 glideinWMS - Ownership of VM 3
Our environment
● World-wide distributed computing a must● No single resource provider can provide enough
CPU for all the users– Both due to logistical and political constraints
● Until now this meant Grid computing● i.e. federation of independent batch sys. providers● and essentially all research-funded
● Our VOs are now considering Cloud computing, too● Both commercial and research-funded
Cloud==IaaS
CLOSER 2012 - Apr 2012 glideinWMS - Ownership of VM 4
CLOUDGRID
Grid vs Cloud
● Similarities● Both are a way to provision resources
● Differences relevant to this talk
● Only bare (virtualized) hardware provided by the resource provider
● Must install OS before running the actual payload– But more flexibility
● OS and system libraries provided by the resource provider
● User just executes his payload– Must play well with
provided OS
CLOSER 2012 - Apr 2012 glideinWMS - Ownership of VM 5
CLOUDGRID
Grid vs Cloud
● Similarities● Both are a way to provision resources
● Differences relevant to this talk
● Only bare (virtualized) hardware provided by the resource provider
● Must install OS before running the actual payload– But more flexibility
● OS and system libraries provided by the resource provider
● User just executes his payload– Must play well with
provided OS
Some user groups findthe cloud model easier
CLOSER 2012 - Apr 2012 glideinWMS - Ownership of VM 6
Why can Cloud be easier?
● Counterintuitive● Installing and maintaining a whole OS a big task!
● Yet, easier than trying to adapt existing scientific application● Most code not actively maintained● Often written making system assumptions
● Plus, most (virtualized) hardware uniform● And HW API variations are usually much smaller
than OS API variations
CLOSER 2012 - Apr 2012 glideinWMS - Ownership of VM 7
Why can Cloud be easier?
● Counterintuitive● Installing and maintaining a whole OS a big task!
● Yet, easier than trying to adapt existing scientific application● Most code not actively maintained● Often written making system assumptions
● Plus, most (virtualized) hardware uniform● And HW API variations are usually much smaller
than OS API variations
But someone stillhas to manage the batch jobs
EnterglideinWMS
CLOSER 2012 - Apr 2012 glideinWMS - Ownership of VM 8
glideinWMS
● An overlay batch system on top of dynamic resources● Based on the pilot principle
● Hides provisioning from final users● Looks and feels like
a batch systemon top of dedicated resources to them
Provider A
Provider BProvider CProvider AProvider A
Glideinresource pool
http://tinyurl.com/glideinWMS Sfiligoi, I. et al., (2009). The pilot way to grid resources using glideinwms. In Computer Science and Information Engineering, 2009 WRI World Congress on, 2, pp.428-32. doi:10.1109/CSIE.2009.950
CLOSER 2012 - Apr 2012 glideinWMS - Ownership of VM 9
glideinWMS architecture
● Three independent components● Glidein Factory – interface to resources● VO Frontend – resource provisioning logic● The actual WMS – seen by the final users
ResourceProviders
VO Frontend
GlideinFactory
WMS
PilotPilotPilot
PilotJob
Job Pilot
Monitor Request
Workload Management System == Batch system
CLOSER 2012 - Apr 2012 glideinWMS - Ownership of VM 10
glideinWMS architecture
● Three independent components● Glidein Factory – interface to resources● VO Frontend – resource provisioning logic● The actual WMS – seen by the final users
ResourceProviders
VO Frontend
GlideinFactory
WMS
PilotPilotPilot
PilotJob
Job Pilot
Monitor RequestVO Frontend
GlideinFactory
WMS
GlideinFactory VO Frontend
VO Frontend
WMS
N-to-Mrelationship
Workload Management System == Batch system
CLOSER 2012 - Apr 2012 glideinWMS - Ownership of VM 11
Multiple operation teams
● Few groups operate the whole glideinWMS● Typically separate groups for
● Glidein Factory● VO Frontend + WMS
Sfiligoi, I. et al., (2011). Reducing the human cost of grid computing with glideinWMS. In CLOUD COMPUTING 2011, pp. 217-21.
CLOSER 2012 - Apr 2012 glideinWMS - Ownership of VM 12
Multiple operation teams
● Few groups operate the whole glideinWMS● Typically separate groups for
● Glidein Factory● VO Frontend + WMS
● Glidein Factory typically generic● Essentially an abstraction layer to resources● Was the interface to Grid resources
– Adding the support for Cloud resources now
Sfiligoi, I. et al., (2011). Reducing the human cost of grid computing with glideinWMS. In CLOUD COMPUTING 2011, pp. 217-21.Andrews, W. et al., (2011). Early experience on using glideinWMS in the cloud. J. Phys.: Conf. Ser. 331 062014
CLOSER 2012 - Apr 2012 glideinWMS - Ownership of VM 13
Multiple operation teams
● Few groups operate the whole glideinWMS● Typically separate groups for
● Glidein Factory● VO Frontend + WMS
● Glidein Factory typically generic● VO Frontend the actual brain
● Contains resource provisioning logic● Provides pilot credential(s)● Customizes the pilot
Sfiligoi, I. et al., (2011). Adapting to the Unknown With a few Simple Rules: The glideinWMS Experience. In ADAPTIVE 2011, pp. 25-28.Jan 2012 VO Frontend Training Session at UCSD
CLOSER 2012 - Apr 2012 glideinWMS - Ownership of VM 14
Multiple operation teams
● Few groups operate the whole glideinWMS● Typically separate groups for
● Glidein Factory● VO Frontend + WMS
● Glidein Factory typically generic● VO Frontend the actual brain
So...when movingto the Cloud,
who should ownthe OS image?
CLOSER 2012 - Apr 2012 glideinWMS - Ownership of VM 15
Option 1 – The Grid approach
● The easiest path forward is to mimic the Grid approach● Nobody, but the closest layer aware of the Cloud
● This implies● OS image owned by the Glidein Factory
– Since it is the one closest to the Cloud● Drop superuser privileges as soon as
the system boots up– Since superuser ops not allowed in the Grid paradigm
CLOSER 2012 - Apr 2012 glideinWMS - Ownership of VM 16
Option 1 - Analysis
Advantages:● VO unaware of the Cloud
● Only system services have superuser privileges– Minimize risk
● OS management shared between multiple VOs– Better quality– Lower security risk
Disadvantages:● VO cannot take advantage
of Cloud flexibility● Pilot cannot take full
advantage of OS features
● More work for the Factory admins– May need to maintain
several OS images
CLOSER 2012 - Apr 2012 glideinWMS - Ownership of VM 17
Option 2 – Adding root privileges
● One step further is allowing pilot to run as root (i.e. superuser)
● This implies● OS image still owned by the Glidein Factory● But pilot process now a system service● VO can customize OS configs, e.g.
– Install new software– Configure and start otherwise-dormant system services
● Side effect of VO Frontend being able to customize the pilot● Since running as root, it can change anything in the system
● Acceptable, since pilot running with credential provided by the VO FE
CLOSER 2012 - Apr 2012 glideinWMS - Ownership of VM 18
Option 2 - Analysis
Advantages:● VO only marginally aware
of the Cloud● VO can change
system settings– And the pilot now a proper
batch system daemon
● OS management shared between multiple VOs– Better quality– Lower security risk
Disadvantages:● VO cannot provide the
OS of their choice● Increased security risk
– Dynamic system configuration– Superuser processes
● VO must maintain any software it installs
● More work for the Factory admins– May need to maintain
several OS images
● Added effort● Increased security risk
CLOSER 2012 - Apr 2012 glideinWMS - Ownership of VM 19
Option 3 – Pure Cloud model
● Move OS handling down to the “user”● The VO Frontend admins in glideinWMS
● This implies● OS owned by the VO Frontend● Glidein Factory still needs to manipulate it
– Add pilot bits– Proper Cloud-provider-specific contextualization
● Pilot process now a system service
Not strictly needed, but very little reason not to
We do not want to expose it to the scientists
CLOSER 2012 - Apr 2012 glideinWMS - Ownership of VM 20
Option 3 - Analysis
Advantages:● VO can now provide
custom OS image● The pilot is a proper
batch system daemon
● Less work for the Glidein Factory admins
Disadvantages:● VO has to provide an
OS image● Increased security risk
– Dynamic system configuration– Superuser processes
● More work for VO(s)
– Must maintain the VO image● Potential image
manipulation problems– Given arbitrary OS image
in Glidein Factory
Expertise problem
CLOSER 2012 - Apr 2012 glideinWMS - Ownership of VM 21
The network traffic cost
● Most Cloud providers charge users both for CPU and WAN network usage● Network was “free” in the Grid world
● Data prestaging required to keep costs down● Data hosting still cheaper than networking● This includes software (including the OS)
● Implications on the OS ownership options:● VO-owned software likely more dynamic
– Thus higher cost
CLOSER 2012 - Apr 2012 glideinWMS - Ownership of VM 22
Making a choice
● No option a clear winner● Each option has significant advantages
and disadvantages
● Still want to pick one● To focus development● As the recommended
deployment strategyBut may end up
supporting all of themin the code
CLOSER 2012 - Apr 2012 glideinWMS - Ownership of VM 23
Option 2 chosen
● Option 3 not suitable for current user base● Not enough expertise in VOs to maintain full OS● Happy with only a few OS versions
● Option 1 too restrictive● Some VOs are eager to tweak system settings
– Major reason to look at the Cloud
● Major drawback of option 2 is security risk● Can be mitigated by using superuser privileges
only when absolutely needed
Manpower alsoa major issue
Compared to Option 1
CLOSER 2012 - Apr 2012 glideinWMS - Ownership of VM 24
Future work
● The glideinWMS project about to release the first Cloud-enabled version● Works fine in internal tests
● But no real-world Cloud experience yet● Just inferring from our Grid-based deployments
● Eager to put it in real-user hands● And collect their feedback
CLOSER 2012 - Apr 2012 glideinWMS - Ownership of VM 25
Conclusions
● Our user base lives in batch system paradigm● Extensive experience with Grid world● But would like to expand into Cloud resources
● OS ownership the major change● glideinWMS architecture allows for delegating the
ownership to an intermediate service● And lower the migration cost
● But is it a good idea?● Yes, if some customization allowed
CLOSER 2012 - Apr 2012 glideinWMS - Ownership of VM 26
Acknowledgments
● This work is partially sponsored by ● the US National Science Foundation under Grants
No. OCI-0943725 (STCI), PHY-1104549 (AAA), and PHY-0612805 (CMS Maintenance & Operations)
and ● the US Department of Energy under Grants No.
DE-SC0002298 (ANDSL) and DE-FC02-06ER41436 subcontract No. 647F290 (OSG).
CLOSER 2012 - Apr 2012 glideinWMS - Ownership of VM 27
Backup slides
CLOSER 2012 - Apr 2012 glideinWMS - Ownership of VM 28
glideinWMS in Numbers
● OSG Glidein Factory at UCSD● Serving ~10 Frontends
CLOSER 2012 - Apr 2012 glideinWMS - Ownership of VM 29
glideinWMS in Numbers
● CMS Frontend at CERN● Using 3 Glidein Factories