The CERN Cloud Computing Project

Copyright © 2010 Platform Computing Corporation. All Rights Reserved.1

The CERN Cloud Computing ProjectWilliam Lu, Ph.D.Platform Computing


LHC Computing Hierarchy

Markus Schulz, CERN 2Emerging Vision: A Richly Structured, Global Dynamic System

Tier 0 +1

Tier 1

Tier2 Center

Online System

CERN Center PBs of Disk; Tape Robot

FNAL CenterIN2P3 Center INFN Center RAL Center

InstituteInstituteInstituteInstitute

Workstations

~100-1500 MBytes/sec

2.5-10 Gbps

Tens of Petabytes by 2010.An Exabyte ~5-7 Years later.

~PByte/sec

10 Gbps

Tier2 CenterTier2 CenterTier2 Center~2.5-10 Gbps

Tier 3

Tier 4

Tier2 Center Tier 2

Experiment

CERN/Outside Resource Ratio ~1:2Tier0/( Tier1)/( Tier2) ~1:1:1

0.1 to 10 GbpsPhysics data cache


Computers:• 40,000 CPU cores used by multiple experiments

Storage: • Disks + tapes• Storage management system (CASTOR) is tightly integrated

with workload management (Platform LSF)

Software:• Apps: Open source, home grown, • OS: Scientific Linux, other Linux• VMs: open source XEN, KVM

Environment

http://flickr.com/photos/skimaniac/100490646/


IT serves users manually• User requests of resource, OS, software stack etc. are

handled manually, which is slow

Users circumvent scheduling policies• Users are not satisfied with the centralized management

scheduling policies due to their unique needs• They submit a pilot job to occupy resources then run scripts

to prepare the application environment and schedule jobs within the resource block. This causes low resource utilization

Legacy application issues• Legacy applications need legacy OS, which does not run on

the latest hardware

Challenges


Batch VirtualizationRequirements

How too Insolate application

environmento Increased security

How to o Automate resource

provisioning and management

o Scalable management practice

Virtualization


Platform ISF + Platform ISF Adaptive Cluster Integration with Platform LSF to provision VMs based on

workload Integration with provisioning

system Quattor Each experiment is able to

schedule their own VMclusters with uniqueapplication environment

VM cluster capacity is elastic based on workload

Solution


How It Works?

Platform ISF

Shared pool of resources

HPC administrator sets up VM resource pools,

one for each experiment

1Platform LSF Platform LSF

User submits a workload that cannot be met by his

VM resource pool

3

Platform ISF AC interacts with Platform

ISF to adjust the size of the resource pool

4

External Provider

Platform LSF

Platform ISF AC Platform ISF AC

Platform ISF AC

HPC administrator also sets up minimum and

maximum number of VMs within each pool

2


Increase user service level• Each experiment can control their own application stack and

resource allocation policies

Redeploy servers quickly and efficiently

Results

• Reduce cost and save power• Shares batch compute servers

with data management and database servers

Automated administration• Allow scalability

No hypervisor lock-in• Freedom of choosing multiple

VM hypervisors

Documents

The CERN Cloud Computing Project