32
Mesos A Platform for Fine- Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy Katz, Scott Shenker, Ion Stoica University of California, Berkeley

Mesos

  • Upload
    leif

  • View
    65

  • Download
    0

Embed Size (px)

DESCRIPTION

Mesos. A Platform for Fine-Grained Resource Sharing in the Data Center. Benjamin Hindman , Andy Konwinski , Matei Zaharia , Ali Ghodsi , Anthony Joseph, Randy Katz, Scott Shenker , Ion Stoica University of California, Berkeley. Summary. - PowerPoint PPT Presentation

Citation preview

Page 1: Mesos

MesosA Platform for Fine-Grained Resource Sharing in the Data CenterBenjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy Katz, Scott Shenker, Ion Stoica

University of California, Berkeley

Page 2: Mesos

Summary Platform for sharing clusters across various

computing frameworks Improves cluster utilization and avoids

replication Introduces distributed two-level scheduling

mechanism Results show near-optimal data locality

Page 3: Mesos

Pig

BackgroundRapid innovation in cluster computing frameworks

Dryad

Pregel

Percolator

CIEL

Page 4: Mesos

ProblemRapid innovation in cluster computing frameworksNo single framework optimal for all applicationsWant to run multiple frameworks in a single cluster» …to maximize utilization» …to share data between frameworks» …to reduce cost of computation

Page 5: Mesos

Where We Want to Go

Hadoop

Pregel

MPIShared cluster

Today: static partitioning

Mesos: dynamic sharing

Page 6: Mesos

SolutionMesos is a common resource sharing layer over which diverse frameworks can run

MesosNode Node Nod

e Node

Hadoop Pregel …

Node

Node

Hadoop

Node

Node

Pregel…

Page 7: Mesos

Other Benefits of MesosRun multiple instances of the same framework

»Isolate production and experimental jobs»Run multiple versions of the framework

concurrently

Build specialized frameworks targeting particular problem domains

»Better performance than general-purpose abstractions

Page 8: Mesos

OutlineMesos Goals and ArchitectureImplementationResultsRelated Work

Page 9: Mesos

Mesos GoalsHigh utilization of resourcesSupport diverse frameworks (current & future)Scalability to 10,000’s of nodesReliability in face of failuresResulting design: Small

microkernel-like core that pushes scheduling logic to frameworks

Page 10: Mesos

Design ElementsFine-grained sharing:

»Allocation at the level of tasks within a job»Improves utilization, latency, and data

locality

Resource offers:»Simple, scalable application-controlled

scheduling mechanism

Page 11: Mesos

Element 1: Fine-Grained Sharing

Framework 1

Framework 2

Framework 3

Coarse-Grained Sharing (HPC):

Fine-Grained Sharing (Mesos):

+ Improved utilization, responsiveness, data locality

Storage System (e.g. HDFS) Storage System (e.g. HDFS)

Fw. 1

Fw. 1Fw. 3

Fw. 3 Fw. 2Fw. 2

Fw. 2

Fw. 1

Fw. 3

Fw. 2Fw. 3

Fw. 1

Fw. 1 Fw. 2Fw. 2

Fw. 1

Fw. 3 Fw. 3

Fw. 3

Fw. 2

Fw. 2

Page 12: Mesos

Element 2: Resource OffersOption: Global scheduler

»Frameworks express needs in a specification language, global scheduler matches them to resources

+ Can make optimal decisions – Complex: language must support all framework needs

– Difficult to scale and to make robust– Future frameworks may have unanticipated needs

Page 13: Mesos

Element 2: Resource OffersMesos: Resource offers

»Offer available resources to frameworks, let them pick which resources to use and which tasks to launch

+ Keeps Mesos simple, lets it support future frameworks

- Decentralized decisions might not be optimal

Page 14: Mesos

Mesos ArchitectureMPI job

MPI scheduler

Hadoop job

Hadoop scheduler

Allocation

moduleMesosmaster

Mesos slaveMPI

executor

Mesos slaveMPI

executor tasktask

Resource offer

Pick framework to offer

resources to

Page 15: Mesos

Mesos ArchitectureMPI job

MPI scheduler

Hadoop job

Hadoop scheduler

Allocation

moduleMesosmaster

Mesos slaveMPI

executor

Mesos slaveMPI

executor tasktask

Pick framework to offer

resources toResource offer

Resource offer = list of (node, availableResources)

E.g. { (node1, <2 CPUs, 4 GB>), (node2, <3 CPUs, 2 GB>) }

Page 16: Mesos

Mesos ArchitectureMPI job

MPI scheduler

Hadoop job

Hadoop scheduler

Allocation

moduleMesosmaster

Mesos slaveMPI

executor

Hadoop executo

r

Mesos slaveMPI

executor tasktask

Pick framework to offer

resources to

taskFramework-

specific scheduling

Resource offer

Launches and isolates

executors

Page 17: Mesos

Architecture Overview

Master process that maintains slave daemons Frameworks run tasks on these slaves Each Framework consists of two components

Scheduler that accepts resources

Executor that launches the framework’s tasks Does not require frameworks to specify resource

requirements or constraints. How?

Page 18: Mesos

Optimization: FiltersLet frameworks short-circuit rejection by providing a predicate on resources to be offered

»E.g. “nodes from list L” or “nodes with > 8 GB RAM”

»Could generalize to other hints as well

Ability to reject still ensures correctness when needs cannot be expressed using filters

Page 19: Mesos

Resource Allocation Allocation decisions delegated to allocation module Two allocation modules implemented

Fair sharing based on generalization of max-min fairness

Strict priority based allocation Usually works for short jobs – fails for large Tasks then need to be revoked Causes problems in some frameworks – guaranteed

allocation

Page 20: Mesos

Framework IsolationMesos uses OS isolation mechanisms, such as Linux containers and Solaris projectsContainers currently support CPU, memory, IO and network bandwidth isolationNot perfect, but much better than no isolation

Page 21: Mesos

Fault Tolerance Mesos master is critical Master is designed to be in a soft state Master reports executor failures to

frameworks Mesos allows frameworks to register

multiple schedulers for fault tolerance Result: fault detection and recovery in

~10 sec

Page 22: Mesos

Implementation

Page 23: Mesos

Implementation Stats10,000 lines of C++Master failover using ZooKeeperFrameworks ported: Hadoop, MPI, TorqueNew specialized framework: Spark, for iterative jobs(up to 10× faster than Hadoop)

Open source in Apache Incubator

Page 24: Mesos

UsersTwitter uses Mesos on > 100 nodes to run ~12 production services (mostly stream processing)Berkeley machine learning researchers are running several algorithms at scale on SparkConviva is using Spark for data analyticsUCSF medical researchers are using Mesos to run Hadoop and eventually non-Hadoop apps

Page 25: Mesos

Results» Utilization and performance vs static

partitioning» Framework placement goals: data

locality» Scalability» Fault recovery

Page 26: Mesos
Page 27: Mesos
Page 28: Mesos

Dynamic Resource Sharing

Page 29: Mesos

Mesos vs Static PartitioningCompared performance with statically partitioned cluster where each framework gets 25% of nodes

Framework Speedup on Mesos

Facebook Hadoop Mix

1.14×

Large Hadoop Mix 2.10×Spark 1.26×Torque / MPI 0.96×

Page 30: Mesos

Ran 16 instances of Hadoop on a shared HDFS cluster

Used delay scheduling [EuroSys ’10] in Hadoop to get locality (wait a short time to acquire data-local nodes)

Data Locality with Resource Offers

1.7×

Page 31: Mesos

ScalabilityMesos only performs inter-framework scheduling (e.g. fair sharing), which is easier than intra-framework scheduling

-10000 10000 30000 500000

0.20.40.60.8

1

Number of Slaves

Task

Sta

rt

Ove

rhea

d (s

)Result: Scaled to 50,000 emulated slaves,200 frameworks,100K tasks (30s len)

Page 32: Mesos

ConclusionMesos shares clusters efficiently among diverse frameworks thanks to two design elements:

»Fine-grained sharing at the level of tasks»Resource offers, a scalable mechanism for

application-controlled scheduling

Enables co-existence of current frameworks and development of new specialized onesIn use at Twitter, UC Berkeley, Conviva and UCSF