Open Cloud Testbed: Developing a Testbed for the Research ......Open Cloud Testbed: Developing a...

Preview:

Citation preview

Open Cloud Testbed: Developing a Testbed for the Research Community Exploring

Next-Generation Cloud Platforms

Michael Zink, David Irwin, UMass Amherst Orran Krieger & Martin Herbordt, Boston University,

Miriam Leeser & Peter Desnoyers, Northeastern University

What is MGHPCC?

What is Mass Open Cloud?

• Vision Statement“To create a self-sustaining at-scale public cloud based on the Open Cloud eXchange model… a marketplace for industry partners as well as a place for researchers and industry to innovate and expose innovation to users.”

• Project Overview and Goals– At-scale efficient production cloud for broad set of applications– Create and Deploy the OCX model– Testbed for research, open source developers, companies

Motivation I

● Cloud computing plays an important role in supporting most software we use in our daily lives

● Critical for enabling research into new cloud technologies (see demand for CloudLab and Chameleon)

● Demand for cloud testbeds often higher than available resources

Motivation II

● CISE researchers want to study users that are not CISE● MOC supports

○ real users○ access to real data sets○ can provide traces of real usage○ can allow services to be exposed to end-users (e.g., TTP)○ has access to production services at scale (e.g., NESE)○ infrastructure and services provided by industry partners

Research "in" the MOC

Motivation III

● CloudLab supports○ Large community (nationwide) of systems researchers○ Tools to configure experimental slices (a combination of bare

metal nodes and networking resources)○ Hard isolation from other users/experiments○ Profiles to describe hard- and software to build a cloud○ Designed specifically for reproducible research○ Software stack is open source○ Federation with other testbeds (e.g., GENI, FABRIC)

● Scientific infrastructure for cloud research

● Three clusters (Utah, Wisconsin, Clemson, and MGHPCC), which offer 15,000+ cores○ Each cluster has a different focus: storage and networking (using hardware from

Cisco, Seagate, and HP), high-memory computing (Dell), and energy-efficient computing (HP).

● A testbed for research and experimentation into new cloud platforms

● Combine proven software technologies and reproducibility features with a real production cloud

● Enhanced with programmable hardware (FPGA) capabilities; bump-in-the-wire (BITW); ~30 nodes

Open Cloud Testbed

OCT Approach

OCT Approach

OCT Approach

Research "in" the MOC

FPGAs in OCT

Research "in" the MOC

Open Cloud Testbed Concept

What’s new?● FPGA’s as reconfigurable compute and Bump-in-the-Wire

(BITW) fully accessible by users

What’s new?● FPGA’s as Bump-in-the-Wire (BITW) fully accessible by users● Make CloudLab dynamically scalable by adding and removing

third-party resources

What’s new?● FPGA’s as Bump-in-the-Wire (BITW) fully accessible by users● Make CloudLab dynamically scalable by adding and removing

third-party resources● Transfer new cloud mechanisms to production cloud (MOC)

What’s new?● FPGA’s as Bump-in-the-Wire (BITW) fully accessible by users● Make CloudLab dynamically scalable by adding and removing

third-party resources● Transfer new cloud mechanisms to production cloud (MOC)● Access to storage, data sets, cloud telemetry

What’s new?● FPGA’s as Bump-in-the-Wire (BITW) fully accessible by users● Make CloudLab dynamically scalable by adding and removing

third-party resources● Transfer new cloud mechanisms to production cloud (MOC)● Access to storage, data sets, cloud telemetry● Collaboration with industry (e.g.,180 nodes from Two Sigma)

What’s new?● FPGA’s as Bump-in-the-Wire (BITW) fully accessible by users● Make CloudLab dynamically scalable by adding and removing

third-party resources● Transfer new cloud mechanisms to production cloud (MOC)● Access to storage, data sets, cloud telemetry● Collaboration with industry (180 nodes from two sigma)● Usage of certain parts of CloudLab by users outside research

community

New Hardware

● Original plan: Add 10 new nodes to existing Mass CloudLab cluster● Two Sigma donation of ~200, 2-year old servers to MOC made us change

plan:○ Mass CloudLab (19 additional R630 => 380 additional cores)

○ 3 additional racks of servers (R630, R620, and R720/R730):■ ~ 1600 cores■ Part of CloudLab■ Part of MOC■ Can be used by industry and foundations

Challenges

● FPGAs● Out-of-band management● Network isolation● Community buy-in

FPGAs

● 15 in 2020 & 15 in 2022● Queried potential users of FPGAs (~20 beta users)

○ Cloud and Operating System

○ Middleware

○ FPGA systems

○ FPGA tools

○ Provider applications

○ Tenant applications

FPGAs

● Will most likely start with two models:○ Xilinx Alveo U280○ Intel D5005

● Toolchain● Implications on networking:

○ Rate limit switch

Out-of-band Management

● Whoever controls OBM controls server● OBM proxy will manage control● OBM interfaces with ESI

ESI

ESI Approach: long term● ESI controls access to servers and switches● CloudLab will have drivers for OBM, switch control, console, for servers

allocated● Scripts for admins to transfer nodes to/from CloudLab● Use Keylime for attesting nodes provided back to production:

○ MOC

○ NERC

○ HPC Clusters

● Eventually enable stateless CloudLab nodes - save and restore experiments

● Original plan was to do this; CloudLab not elastic until complete...

ESI Approach: new short term● Give CloudLab & ESI control software direct access to servers and

switches● Each will have in its inventory all servers; build simple mechanism to

allocate to NULL project servers used by other side● Cons: is unsecure, credentials for server out-of-band management and

switch management have to be shared● Pros: straight forward; we will be able to have an elastic cloud lab mid

year; can incrementally add drivers for OBM, console, network...

Network Isolation

AL2S● Guaranteeing QoS in the network is hard● Options:

○ Traffic shape at the server (tricky to enforce)○ Rate limit ports at switches (what is the impact on TCP?)

● Overprovisioning helps but does not guarantee isolation

Community buy-in

● This is where you come into the picture!● How can OCT support your systems research?● We need your feedback!!!● Good news:

○ CloudLab community can use it from day one

○ MOC community can use it from day one

Core Team

CNS-1925464

Recommended