Budgeting: the Ugly Duckling of Cloud Computing?Dr. Matteo Lanati ([email protected])
25th October 2016
2LRZ, Distributed Resources Group, Matteo Lanati
● Introduction● Update on the last year‘s activity● Budgeting for OpenNebula
● Why it is needed
● Rationale and ideas
● Current status
● Next steps
Outline
3LRZ, Distributed Resources Group, Matteo Lanati
● Scope:
– Munich
– Bavaria
– Germany
– Europe
– Worldwide
● Provision of traditional IT services
● High performance systems
Leibniz Supercomupting Centre of the BavarianAcademy of Sciences and Humanities
4LRZ, Distributed Resources Group, Matteo Lanati
Phase 1 (2012)● Westmere/Sandy Bridge● > 155.000 cores● > 3.0 Pflops/s total peak
performance
SuperMUC
Phase 2 (2015)● Haswell ● > 86.000 cores● > 3.5 Pflops/s total peak
performance
https://www.lrz.de/services/compute/supermuc/
5LRZ, Distributed Resources Group, Matteo Lanati
SSH commands
Monitoring probes
...
Worker node 88
Datastore System store 1 System store 10
Worker node 1
LRZ Compute Cloud: OpenNebula setup
88 physical nodes736 cores / 7.5 TB RAM
...
VMWare high availability8 cores / 32 GB RAM
NetApp NAS300 TB
6LRZ, Distributed Resources Group, Matteo Lanati
Our user base
Update on last year‘s activity
March 2015 – October 2015200 accounts
October 2015 – October 2016250 new accounts
LRZ 28%
Other 15% Math / CS 28%
Mech. Eng. 12%
Other 40%
LRZ 19%
Math / CS 15%
Bio 10%
7LRZ, Distributed Resources Group, Matteo Lanati
Resource usage: computation
Update on last year‘s activity
October 2015 – October 2016Computation: ~3 Mi CPU-hours
Storage: ~30 TB
March 2015 – October 2015Computation: ~1 Mi CPU-hours
Storage: ~10 TB
Math / CS1.2 M (41%)
Geo496 K (17%)
Mech. Eng.459 K (15%)
Other443 K (14%)
LRZ191 K (6%)
Geo396 K (35%)
Math / CS128 K (12%)
Other219 K (20%)
LRZ88 K (8%)
Mech. Eng.64 K (6%)
Physics108 K (10%)
8LRZ, Distributed Resources Group, Matteo Lanati
Goal: efficient use of resources (i.e., few idle VMs)
Manage the lifetime of a group of VMs according to:
● Number of cores (Nc)● RAM (Mem)● Datastore space (Ds)● IPs● time
What budgeting means
Cost function
(A * Nc + B * Mem + C * Ds + D * IPs) * <running time>
9LRZ, Distributed Resources Group, Matteo Lanati
A concrete proposal for the cost factors
What budgeting means
0.01 * Nc * <hours> + 0.001 * Mem * <hours> + + 0.01 * Ds * <months> + 0.50 * IPpublic * <months> +
+ 0.10 * IPprivate * <months>
Item Time period Cost
Core Hour 0.01 €
GB of RAM Hour 0.001 €
GB in image store Month 0.01 €
Public IP Month 0.50 €
Private (campus) IP Month 0.10 €
10LRZ, Distributed Resources Group, Matteo Lanati
Use cases
● Computational bursts– 200 to 400 cores for few weeks to few months
● Multitenancy inside a group / project– Support students training activities
– Important feature: avoid budget overflow
● Resource management and planning– To help us deciding how /in which direction to grow
Why budgeting
11LRZ, Distributed Resources Group, Matteo Lanati
Hardware Classes
● Regular– Payed by LRZ
● Reserved– Brought in by the user
– Exclusive access
Budgeting: the big plan
User Classes
● Normal (uninterruptible)– No guarantees on start time
● Privileged (golden)– Immediate start
12LRZ, Distributed Resources Group, Matteo Lanati
Hardware Classes
● Regular– Payed by LRZ
● Reserved– Brought in by the user
– Exclusive access
Budgeting: the big plan
User Classes
● Normal (uninterruptible)– No guarantees on start time
● Privileged (golden)– Immediate start
13LRZ, Distributed Resources Group, Matteo Lanati
Hardware Classes
● Regular– Payed by LRZ
● Reserved– Brought in by the user
– Exclusive access
Budgeting: the big plan
Usage optimisation
User Classes
● Normal (uninterruptible)– No guarantees on start time
● Privileged (golden)– Immediate start
● Opportunistic● Interruptible
14LRZ, Distributed Resources Group, Matteo Lanati
Hardware Classes
● Regular
● Reserved– Permission/ownership
– Scheduling requirements
– Scheduler
Budgeting: possible implementation
User Classes
● Selected in the template● Possible customisation of the
GUI
15LRZ, Distributed Resources Group, Matteo Lanati
● Prepaid model– Avoid budget overflow
– Mitigation in case the budget is exceeded => undeploy VMs
● External implementation– Split the budget management from sysadmin view
– Easier to use the cost function to run a prediction model
Budgeting: important features
16LRZ, Distributed Resources Group, Matteo Lanati
Budgeting: the implementation so far
17LRZ, Distributed Resources Group, Matteo Lanati
Budgeting: the implementation so far
18LRZ, Distributed Resources Group, Matteo Lanati
Budgeting: the implementation so far
19LRZ, Distributed Resources Group, Matteo Lanati
Budgeting: the implementation so far
VM submission VM runningHook script
Cron jobs
VM undeployed
ONE DB
Budget thresholds
Budget Consumption<# cores> * <running time>
20LRZ, Distributed Resources Group, Matteo Lanati
● Update to ONE 5.0.x● Upgrade the hardware● Focus on the security of VMs – LRZ Security Scanner (LSS)
– Detect weak passwords
– Identify vulnerabilities
Next Steps
21LRZ, Distributed Resources Group, Matteo Lanati
Thank you for your attention