Scalable Elastic Systems Architecture (SESA)

Scalable Elastic System Architecture (SESA)

Dan Schatzberg, Boston UniversityJonathan Appavoo, Boston University

Orran Krieger, VMwareEric Van Hensbergen, IBM Research Austin

The goal

Perform more computation with fewer resources

Fixed Resources

• Hardware as a fixed resource

• Focus on reducing computation’s need for hardware resources

• Multiplex hardware resources for different computations

Elastic Resources

• Cloud Computing

• Pay as you go hardware

• Focus on providing hardware to the computation that requires it

Time to scale hardware

Days

Fixed Hardware

Minutes

Cloud Computing


Days

Fixed Hardware

Minutes

Cloud Computing

Elastic Applications


Days

Fixed Hardware

Minutes

Cloud Computing

Elastic Applications

Milliseconds

?

Interactive HPC

• Medical imaging application

• interactive

• 1 megapixel image

• quadratic memory consumption - ~14TB

Interactive HPC

• Fixed Hardware

• Purchase a cluster

Interactive HPC

• Cloud Computing

• Allocate a cluster

• Maintain interactivity

• 650+ EC2 instances - $8000 dollars / 8 hour day

Can we do better?

Where we’re starting

Treat elasticity as a first-class system characteristic

OUTLINE1. THE PROBLEM

2. OBSERVATIONS

1. Top-Down Demand

2. Bottom-Up Support

3. Modularity

3. OUR TAKE ON A SOLUTION

4. PROTOTYPE & CHALLENGES

Top-Down DemandSystem Interface

Hardware

Software


Hardware


Hardware


Hardware


Hardware


Hardware

Events as Load

• Treat a service request as an event that is dispatched to resources

• As events occur, load increases

• As events are handled, load decreases

• Each layer being event-driven forces demand to flow top-down

Bottom-Up SupportSystem Interface

Hardware


HardwareAllocate/Deallocate

Resources


Hardware

Elastic Interface

• Support elasticity by interfacing via allocation and deallocation of physical or logical resources

• Each layer is constructed by being explicit with respect to resource consumption

• Be explicit with respect to time to meet a request

ModularitySystem Interface

Hardware


Hardware


Hardware


Hardware

Object model

• Objects can take advantage

• the semantics of their request patterns

• the lifetime of an instance

• the occupancy w.r.t memory, processing and communication

• We can optimize for elasticity by taking advantage of modularity in a system

OUTLINE

1. THE PROBLEM

2. OBSERVATIONS

3. OUR TAKE ON A SOLUTION : SESA

1. EBB’s : Elastic Building Blocks

2. SEE: Scalable Elastic Executive - A LibOS

3. EPIC: Events as Interrupts


Architecture Overview

Hardware


SSD

FAWN


Partitioning

SSD

FAWN


SSD

FAWN

Kittyhawk


System Software

SSD

FAWN

Kittyhawk


SSD

FAWN

Kittyhawk

HAL

Architecture OverviewApplications

SSD

FAWN

Kittyhawk

HAL


SSD

FAWN

Kittyhawk

?HAL


SSD

FAWN

Kittyhawk

?HAL

SESA

SEHAL SEHAL SEHAL

VM/Node

Elastic Partition of Nodes

VM/Node

VM/Node

SEMachine

SEExecutiveSEE SEE SEE

EBB Namespace

SE APP/SERVICE

Partitioning Layer

System Software Layers

Hardware Abstraction Layer

LibOS Layer

Component Layer

EBB’s

EBB NameSpace

A new Component Model for expressing and

encapsulating fine grain elasticity.

The Next Generation of Clustered Objects.

inc()

val() dec()

Memory

Processors

Memory

Processors

c c c c c c c c

inc

val dec

Cinc

val dec

Cinc

val dec

Cinc

val dec

Cinc

val dec

Cinc

val dec

Cinc

val dec

Cinc

val dec

C

dref(ctr)->inc();

val Clustered Objects (CO)

What did we learn?

• Event-driven architecture for lazy and dynamic instantiation of resources

• Mechanism to create scalable software

Elastic Building Blocks

• Programming Model for Elastic and Scalable Components

• Span multiple nodes

• Built in On Demand nature -- encapsulation of policies for both allocation and deallocation of resources

SEE

SEExecutive

SEE SEE SEE

A Distributed Library OS Model designed to enable Elastic Software within the

context of legacy environments.

Next Generation of Libra

Libra

PartitionController

Partition Partition PartitionApplication Application Application

JVMApp

App

AppApp

App

App

AppDB

LibraLibra LibraOSPurposeGeneral

Hypervisor

Gateway

Figure 1. Proposed system architecture.

nels, hypervisors run other operating systems with few or no mod-ifications [27, 5, 42]. By running an operating system (the con-troller) in a hypervisor partition, applications can be migrated in-crementally to an exokernel system. In the first step of the migra-tion, the application runs in its own partition instead of as an oper-ating system process but accesses system services on the controllerremotely through the 9P distributed filesystem protocol [32]. Later,as needed, these relatively expensive remote calls are replaced byefficient library implementations of the same abstractions.Proponents of exokernels object to hypervisors because “[hy-

pervisors] confine specialized operating systems and associatedprocesses to isolated virtual machines...sacrificing a single view ofthe machine” [25]. However, in our approach, a controller parti-tion provides a single view of the machine: applications runningin other partitions correspond to processes at the controller. This istrue even for applications that run on other machines in a cluster.Section 2 explains how Libra achieves this unified machine view.This paper makes three main contributions:

• The hypervisor approach for transforming existing systems intohigh-performance, specialized systems.

• A case study in which Nutch, a large distributed application thatincludes a commercial JVM (IBM’s J9 [4]), is transformed intosuch a system.

• Examples of Nutch-specific optimizations that result in goodperformance.The rest of the paper is organized as follows. Section 2 explains

the overall architecture we are proposing. Section 3 describes theimplementation of Libra and our port of J9 to Libra. Details ofour port of Nutch to J9/Libra, including specializations that im-prove its throughput, are described in Section 4. Section 5 evaluatesJ9/Libra’s performance on various benchmarks, including Nutch,the standard SPECjvm98 [39] and SPECjbb2000 [38] benchmarks,and two microbenchmarks. Section 6 discusses related work. Fi-nally, Section 7 outlines future work and Section 8 concludes thepaper.

2. DesignThis section explains our proposed architecture, which is depictedin Figure 1. At the bottom of the software stack, a hypervisorhosts a controller partition and one or more application partitions.The hypervisor interacts directly with the hardware to provisionhardware resources among partitions, providing high-level mem-ory, processor, and device management.Above the hypervisor, a general-purpose operating system such

as Linux runs as a “controller” partition. The controller is the pointof administrative control for the system and provides a familiar en-

vironment for both users and applications. Each application par-tition is launched from the controller by a script that invokes thehypervisor to create a new partition and load the application into it.This script also launches a gateway server that permits the applica-tion to access the controller’s resources, services, and environment.The gateway server is an extended version of Inferno [31],

which is a compact operating system that can run on other operatingsystems. Inferno creates a private, file-like namespace that containsservices such as the user’s console, the controller’s filesystem, andthe network (see Figure 2). The application accesses this names-pace remotely via the 9P resource sharing protocol [30], which runsover a shared-memory transport established between the controllerand application partitions. A more detailed description of our ex-tensions to Inferno and a preliminary performance analysis of thetransport is available in another paper [20].Note that nothing in the architecture requires applications to

access all resources through the gateway, because the hypervisorallows resources and peripherals to be dedicated to an applicationpartition and accessed directly. Alternatively, applications can use9P to access resources across the network, either directly or throughthe gateway. Facilities for redundant resource servers, fail over, andautomated recovery have also been explored [16].As in an exokernel system, each application is linked with a

library operating system (libOS). However, unlike in exokernelsystems, the libOS focuses only on performance-critical services;other services are obtained from the controller through the gateway.For example, Libra, the libOS described in this paper, contains athread library implementation but accesses files remotely.This approach not only reduces the cost of developing a libOS

but also reduces the cost of administering new partitions. Becauseapplications share filesystems, network configuration, and systemconfiguration with the controller, administering an additional ap-plication partition is cheaper than administering an additional op-erating system partition. Also, because 9P gateways run as the userwho launched the application, they inherit the same permissionsand limitations, taking advantage of existing security mechanismsand resource policies.Finally, because we use a hypervisor instead of an exokernel to

multiplex system resources, applications can use privileged execu-tion modes and instructions. This enables optimizations and easesmigration, because a libOS can be simply a pared-down, general-purpose operating system. The combination of supervisor-mode ca-pability and 9P for access to remote services is flexible: applicationpartitions can be like traditional operating systems, self-containedwith statically-allocated hardware resources; like microkernel ap-plications, with system services spread across various protectiondomains; or like a hybrid of the two architectures that makes sensefor the particular workload being executed.

3. J9/Libra ImplementationThis section describes the implementation of Libra and the port ofthe J9 [4] virtual machine to this new platform. It was a somewhatatypical porting process, since we were simultaneously portingJ9 to the Libra abstractions while designing and implementingnew Libra abstractions to support J9’s execution. We begin byintroducing the relevant aspects of J9 and briefly describing ourporting and debugging methodology. Next, we discuss the majorLibra subsystems needed by J9. Finally we highlight the majorlimitations of our current implementation.

3.1 J9 OverviewJ9 is one of IBM’s production JVMs and is deployed on morethan a dozen major platforms ranging from cell phones to zSeriesmainframes. Because it runs on such a diverse set of platforms, agreat deal of effort has been invested by the J9 team in defining

Architecture

Libra

PowerPC Blades: Libra Workers

Pool of Libra PartitionsX86 Linux Front Ends

9p

$

$

Libra



$ java -cp my.jar

9p

$

Libra



9p

$

$ java -cp my.jar

Libra



$ java -cp my.jar

9p

$ for ((i=0;i<44;i++))do java -cp my.jar &done

Libra



$ java -cp my.jar

9p




$ java -cp my.jar

9p


Libra

What did we learn?

• Specialized environment for each application

• Lightweight system layer implementing services for performance

• General purpose OS for non-performance critical services

SEE : A LibOS for SESA• Distributed LibOS that

can elastically span nodes

• Instances cooperate to support the allocation and deallocation of EBB’s

• Enables compatibility with Front End nodes running via unified 9p namespace

locality aware memory allocator

event dispatcher

inter-node communication protocols and

primitives

EBB InfrastructureFS Name Space Protocol (9p)

Scalable Elastic Executive (SEE)

Per-node EBB manifestations

SEHAL

SEMachines and EPICs

SEHAL SEHAL SEHAL

SEMachine Hardware Abstraction Layer : EPIC

Programmable Interrupt Controller

1 10 0

Interrupt

Execution

Source

1 11 0

Interrupt

Execution

Source


1 11 0

Event

Action

Source


Elastic Programmable Interrupt Controller

Event

Action

Source

1 11 0 0 001

1101

Event

Action

...

...

110 1

...


Source


• Programmed by the SEE

• Provides the minimum requirement of elastic applications - mapping load to resources

• Portable layer

• Take advantage of network features such as broadcast and multicast

OUTLINE

1. THE PROBLEM

2. OBSERVATIONS



PROTOTYPE APP

Kittyhaw

k

Elastic Matrix Cache

ElasticMatrixOps

Sage*

OL

SESA SAGE SERVICES

SEE: EBB’s + EH

AL

Traditional HW

Advanced HW

Challenges and Discussion

OUTLINE

1. THE PROBLEM

1. Pay as you go computing

2. Insufficient systems support for elasticity

2. OBSERVATIONS



Provider

ConsumerSoftware

Pay as you go hardware

Provider

ConsumerRequest

Software


Provider

ConsumerSoftware


Elastic Website

Provider

Consumer

Load Balancer

Elastic Website

Provider

Consumer

Load Balancer

Elastic Website

Provider

Consumer

Load Balancer

Other Elastic Applications

• Analytics

• Batch computation

• Stream processing

What’s the problem?

• Allocation/Boot-time

• Programmability

Medical Imaging Application

• Megapixel image

• Quadratic algorithm

• (1 mil pixels * 4 bytes/pixel)^2 ~ 14 TB

• On Amazon EC2 ~ $8000 per day

Snowflock

Provider

Consumer

Snowflock

Provider

Consumer

Snowflock

Provider

Consumer

Distributing an ObjectNon-Distributed Object Instance

LR0

R1

R2

Region List LockRegion List

Other Data

Structures

LR0

R1

R2

Root

LR0

Rep0

LR0

Rep2R2

LR1

Rep1


1 11 0

Event

Action

Source

Documents

Scalable Elastic Systems Architecture (SESA)