VC3: Virtual Clusters for Community Computation · VC3: Virtual Clusters for Community Computation Douglas Thain, University of Notre Dame Rob Gardner, University of Chicago . John

VC3: Virtual Clusters for Community Computation

Douglas Thain, University of Notre Dame Rob Gardner, University of Chicago

John Hover, Brookhaven National Lab

You have developed a large scale workload which runs successfully at a University cluster.

Now, you want to migrate and expand that application to national-scale infrastructure. (And allow others to easily access and run similar workloads.)

Traditional HPC Facility Distributed HTC Facility Commercial Cloud

IceCube Simulation DAG

Signal Generator

Background Generator

Photon Propagator

Photon Propagator

Photon Propagator

Photon Propagator

CPU

GPU

Detector Detector Detector Detector Detector

Filter Filter Filter Filter Filter

Cleanup

CPU

CPU

CPU

CMS Data Analysis w/Lobster

Anna Woodard, Matthias Wolf, et al., Scaling Data Intensive Physics Applications to 10k Cores on Non-Dedicated Clusters with Lobster, IEEE Conference on Cluster Computing, September, 2015.

Lobster Master Application

Work Queue Master Library

Submit Wait

Foreman

Foreman

Foreman

$$$

$$$

$$$

16-core Worker 16-core Worker


$$$



$$$



$$$

Local Files and Programs

A B C

http://ccl.cse.nd.edu/research/papers/lobster-cluster-2015.pdf

The Perils of Workload Migration • Dynamic resource configuration and scaling.

– # nodes, cores/node, RAM/core, disk, GPUs • OS expectations:

– Ubuntu, Cray, Red Hat, Debian, etc…. • Software dependencies.

– Script languages, installed libraries, supporting tools… • Online service dependencies.

– Batch systems, databases, web proxies, … • Network accessibility:

– Addressibility, incoming/outgoing, port ranges, protocols… • Storage configuration:

– Local, global, temporary, permanent, home/project/tmp…

Can we make HPC more like cloud?

• User cluster specification: – 50-200 nodes of 24 cores and 64GB RAM/node – 150GB local disk per node – 100TB shared storage space – 10Gb outgoing public internet access for data – CMS software 8.1.3 and python 2.7 – Running Condor or Spark or Makeflow . . .

• Of course, we cannot unilaterally change other computing sites!

So, that means containers and VMs? Not necessarily.

VMs and containers are great, and we

will use them where needed, but:

1) Not all sites deploy them. 2) We want to use native hardware (and software) whenever possible.

Traditional HPC Facility Distributed HTC Facility Commercial Cloud

Concept: Virtual Cluster • 200 nodes of 24 cores and 64GB RAM/node • 150GB local disk per node • 100TB shared storage space • 10Gb outgoing public internet access for data • CMS software 8.1.3 and python 2.7

Virtual Cluster Service

Virtual Cluster Factory

Deploy Services Deploy Services Deploy Services


Virtual Cluster




Project Status and Structure

• Just getting started, funding began June 2016. • First milestone for PI meeting today:

– VC across three sites at UC/ND runs IceCube.

CSE: Douglas Thain Ben Tovar CMS: Kevin Lannon Michael Hildreth Kenyi Hurtado CRC: Paul Brenner

Robert Gardner Lincoln Bryant Benedikt Riedel

John Hover Jose Caballero

VC3 Architecture

User Portal VC3 Service Instance

Resource Provider

Resource Provider

Resource Provider

Batch System Batch System Batch System

VC3 Pilot Factory

Cluster Spec

Pilot Pilot

Pilot

Pilot Pilot Pilot

Middleware Scheduler

MW Nod

e

MW Nod

e MW Nod

e

MW Nod

e

MW Nod

e

MW Nod

e

End user accesses the VC head node.

Software Catalog

Site Catalog

Create a virtual cluster!

VC3 Service Instance


Teardown is Critical!

User Portal

Resource Provider

Resource Provider

Resource Provider


VC3 Pilot Factory

Cluster Spec

Pilot Pilot

Pilot

Pilot Pilot Pilot

Middleware Scheduler

MW Nod

e

MW Nod

e MW Nod

e

MW Nod

e

MW Nod

e

MW Nod

e

Software Catalog

Site Catalog

Destroy my virtual cluster!



Teardown is Critical!

User Portal

Resource Provider

Resource Provider

Resource Provider


VC3 Pilot Factory

Cluster Spec Middleware

Scheduler

Software Catalog

Site Catalog

Destroy my virtual cluster!

Inherent Challenges • Portal -> Service Instance

– Reliability, specification, collaboration, discoverability, lifecycle management.

• Cluster Factory – Configuration, impedance matching, response to outages, right-

sizing to workload, authentication, cost management. • Environment Construction

– Specification complexity and portability, detection of existing environments, environment sharing, resource consumption.

• Performance Management – Want mall easy, big possible. Matching HW capability to

middleware deployment. Environment compatible with manycore, GPU, FPGA.

• Site Management – Work with the site owners, not against them. Collect relevant

configuration data. Make VC deployment transparent to sites.

Changing Technology Landscape • Resource Management Systems

– Condor, PBS, SLURM, Cobalt, UGE, Mesos, ??? • User Interests in Middleware

– Workflows, GlideInWMS, PanDA, Hadoop, Spark, ???? • Software Deployment Technologies

– VMs -> LXC -> Docker -> Singularity -> ??? – CVMFS, Tarballs, NixOS, Spack, ???

• Access to Resources – Old way: SSH+Keys New Way: Two Factor Auth

• Our approach: – Pick a place to stand, but keep specific technologies at

arm’s length and be prepared to change.

Prototype Implementation • Portal -> Service Instance

– (under construction) • Pilot Job Factory

– AutoPyFactory (APF) from BNL – SSH/BOSCO to connect to resource providers

• Pilot Job and Environment Deployment – Local software install via tarballs + PATH. (Groundwerk) – Access CVMFS via FUSE or Parrot, whichever available.

• User Visible Middleware – Condor batch system (user level “glide-in”)

• Application – IceCube data analysis

Key Idea:

Specify requirements in abstract. Deliver requirements by

matching or creating, or both. *

* (only works if you can characterize requirements very accurately)

Tour of First Milestone Prototype:

Application (Ice Cube Simulation) Environment Creation (VC3-Pilot) Cluster Factory (AutoPyFactory)

IceCube Software and Jobs • Experiment specific software stack

– Dependencies not normal for particle physics experiment: Boost, hdf5, suite-parse, cfitsio, etc.

– Distributed mostly through CVMFS global filesystem now, tarballs still used in edge cases, containers are an issue.

– Moving to shipping C++11 compliant environment (own compiler, etc.)

• Heavily invested in GPU accelerators • Average job: 2-4 GB RAM, 10 GB Disk, 2 hour wall time • Tail-end job: 6+ GB RAM, 100 GB Disk, 10s to 100s hours • Need to record all details about a job, forever: job

configuration, where did it run, resource usage, efficiency, etc.

CVMFS Global Filesystem

www server

HEP Task

Parrot / FUSE

squid proxy squid

proxy squid proxy

CVMFS Driver metadata

data

data

data

metadata

data

data

CAS Cache

CMS Software

967 GB 31M files

Content Addressable

Storage

Build

CAS

HTTP GET HTTP GET

http://cernvm.cern.ch/portal/filesystem

CVMFS + HPC Challenges • Need disk local to node (ideal) or site (ok) for

local cache management. (Project: RAM $$$) • Need FUSE (ideal) to mount FS, otherwise use

Parrot (ok) for user level interception. • Must have a local HTTP proxy, otherwise

CVMFS becomes a denial of service attack. • Site operators dislike blocking CPU for data. • CVMFS itself has dependencies to install!

Jakob Blomer, Predrag Buncic, Rene Meusel, Gerardo Ganis, Igor Sfiligoi and Douglas Thain, The Evolution of Global Scale Filesystems for Scientific Software Distribution, IEEE/AIP Computing in Science and Engineering, 17(6), pages 61-71, December, 2015.

http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7310920

Delivering Dependencies with VC3-Pilot

vc3-pilot –require python 2.7.12 icecube-sim input.dat • Query the current environment. • Install missing pieces (recursively) in /home • Run the program with a modified PATH.

Resource Provider

Resource Provider

Resource Provider

Python 2.6 Python 2.7 Python 3.0

Task Pilot Task Pilot Task Pilot

Python 2.7 Python 2.7

"python":[ { "version":"v2.7.12", "versioncmd":"python --version", "versionreg":"Python ([0-9.]*).*", "sources":[ { "type":"tarball", "files":[ "Python-2.7.12.tgz" ], "recipe":[ "./configure --prefix=${VC3_PREFIX} --libdir=${VC3_PREFIX}/lib", "make", "make install", "ln -s ${VC3_PREFIX}/bin/pydoc{,2}" ] } ], "environment_variables":[ { "name":"PATH", "value":"bin" },

Recipes Define Environments

data dependencies

setup instructions

environment setup

app definition

CVMFS Deployment via VC3-Pilot

vc3-pilot –require cvmfs icecube-sim input.dat • Search for existing services. • Download dependent software. • Deploy using Parrot (user level VM) if necessary.

Resource Provider

Resource Provider

Resource Provider

FUSE

Task Pilot Task Pilot Task Pilot

FUSE /cvmfs /cvmfs

Parrot /cvmfs

CVMFS Deployment via VC3-Pilot Set the software environment required for scientific applications.

% stat /cvmfs/cms.cern.ch % stat: cannot stat '/cvmfs/cms.cern.ch': No such file or directory % ./vc3-pilot --require cvmfs -- stat /cvmfs/cms.cern.ch File: '/cvmfs/cms.cern.ch' Size: 4096 Blocks: 9 IO Block: 65336 ...

Icecube demo dependencies according to the pilot

(host already has cvmfs)

Icecube demo dependencies according to the pilot

(host does not have cvmfs)

The MAKER Genomics Pipeline http://www.yandell-lab.org/software/maker.html

vc3-pilot –require maker maker -BIO

Custom docker container in Jetstream took weeks to install pieces by hand. Converted to vc3-pilot, successfully ported to Stampede in a single automated install.

AutoPyFactory from BNL • Primary concern is intelligently, efficiently, and deterministically scaling

overlay submission to the WMS workload, based on policy. – How many pilots to submit, combining info from multiple sources?

• Chainable scheduler logic plugins allow “algorithms via config file”. • Single process, multi-threaded, no-database, object-oriented Python

daemon, resulting in high reliability/stability. • “Everything is a plugin” architecture makes new usage easy/safe. • Leverages developer effort, infrastructure, scalability, resource targets,

authorization mechanisms, and common interface (everything is a job) of the HTCondor project--which would need to be custom-coded without Condor. – Condor-G interface submits any executable, with job resource

requirements (memory, disk, corecount, waltime, etc.) if specified by the WMS queue.

– Scalability and speed allows rapid submission.

Submission policies Current APF supports ‘demand-driven policy’, logic is driven by how much idle work is

waiting. A separate APF queue handles a demand level, e.g.:

[low-demand] sched.ready.offset=0

sched.maxtorun.maximum = 1000

[medium-demand]

sched.ready.offset=1000


Sched.scale.factor = 0.10

[high-demand]

sched.ready.offset=6000


Sched.scale.factor = 0.01

Putting it All Together • VC created by AutoPyFactory on:

– UC ATLAS Tier-3 running Condor + FUSE – UC OSG Testbed running PBS w/o FUSE – ND CRC cluster running SGE w/o FUSE

• Payload: – VC3-Pilot deploys dependencies, mounts CVMFS. – Icecube data analysis task

• Truth in advertising: – GPU detection/configuration – Web proxy discovery.

Demo Time Deploy CVMFS on the fly: vc3-pilot --require cvmfs https://asciinema.org/a/40j5dnd6m67yog3y4qa4tw957 Deploy MAKER on the fly:

vc3-pilot --require maker-ecoli-example-01 https://asciinema.org/a/4qzmcrpmrzssxen6s1knkgw86 VC jobs running at UC via “glide in”

http://asciinema.org/a/7a9ku2k4z3ujtnr1v4cjo6mq3 VC jobs running at ND via “hobble in”

https://asciinema.org/a/84798

https://asciinema.org/a/40j5dnd6m67yog3y4qa4tw957

https://asciinema.org/a/4qzmcrpmrzssxen6s1knkgw86



http://asciinema.org/a/7a9ku2k4z3ujtnr1v4cjo6mq3





Much more to do… • User Portal and Dynamic Service Instances • Scale and Dynamic Behavior

– New problems with each order of magnitude. – Manage +10K cores on demand!

• Fitting into the Ecosystem – Work with sysadmins to synchronize user flexibility with respect

for local configuration and policy. • Deployment

– Per-site configuration, rather than per-job. – Better exploit existing packages / tools? – How to discover / deploy new services?

• Applications – Starting with LHC: CMS, ATLAS HEP: IceCube – Show generality with bio: MAKER, AWE, LifeMapper

http://virtualclusters.org

Documents

VC3: Virtual Clusters for Community Computation · VC3: Virtual Clusters for Community Computation Douglas Thain, University of Notre Dame Rob Gardner, University of Chicago . John