UC#BERKELEY# Mesos: A Platform for Fine- …laser.inf.ethz.ch/2013/material/joseph/LASER-Joseph-3.pdfMesos: A Platform for Fine-Grained Resource Sharing in Data Centers (I)! UC#BERKELEY#

Mesos: A Platform for Fine-Grained Resource Sharing

in Data Centers (I)

UC BERKELEY

Anthony D. Joseph

LASER Summer School September 2013

My Talks at LASER 2013

1.  AMP Lab introduction

2.  The Datacenter Needs an Operating System

3.  Mesos, part one

4.  Dominant Resource Fairness

5.  Mesos, part two

6.  Spark 2

Collaborators

•  Matei Zaharia

•  Benjamin Hindman

•  Andy Konwinski

•  Ali Ghodsi

•  Randy Katz

•  Scott Shenker

•  Ion Stoica 3

Modern Data Center Paradigm Commodity machines (100’s – 10,000’s of machines) » Attached storage devices

Data distributed and replicated across nodes » Data locality to computation matters

Solution: Use a datacenter computing framework » Divide jobs into smaller tasks, so that jobs can take turns

accessing each node and ideally locally accessing data » Tasks are both fine-grained in time (short) and space (use

fraction of a machine)

4

Rapid innovation in datacenter computing frameworks

No single framework optimal for all applications

Want to run multiple frameworks in a single datacenter »  …to maximize utilization »  …to share data between frameworks

Pig

Datacenter Scheduling Problem

Dryad

Pregel

Percolator

CIEL

5

Hadoop

Pregel

MPI Shared cluster

Today: static partitioning Dynamic sharing

Where We Want to Go

6

Solution: Apache Mesos

Mesos

Node Node Node Node

Hadoop Pregel …

Node Node

Hadoop

Node Node

Pregel

…

Mesos is a common resource sharing layer over which diverse frameworks can run

Run multiple instances of the same framework »  Isolate production and experimental jobs » Run multiple versions of the framework concurrently

Build specialized frameworks targeting particular problem domains » Better performance than general-purpose abstractions

http://mesos.apache.org/ 7

Mesos Goals

High utilization of resources

Support diverse frameworks (current & future)

Scalability to 10,000’s of nodes

Reliability in face of failures

Resulting design: Small microkernel-like core that pushes scheduling logic to frameworks

8

Previous Approaches Locality less important (HPC & grid computing) » Expensive, dedicated storage (SANs, Parallel FS) » Expensive, high speed networks (Infiniband)

Fine-grained task model infeasible » Ad-hoc programs (many barriers, tight message passing) » Legacy programs (Fortran77)

Approach taken: coarse-grained sharing »  Job specifies number of machines and amount of time needed » Scheduler queues job and allocates all machines at the same time

9

Mesos Design Elements

Fine-grained sharing: » Allocation at the level of tasks within a job » Improves utilization, latency, and data locality

Resource offers: » Simple, scalable application-controlled scheduling

mechanism

10

Element 1: Fine-Grained Sharing

Framework 1

Framework 2

Framework 3

Coarse-Grained Sharing (HPC): Fine-Grained Sharing (Mesos):

+ Improved utilization, responsiveness, data locality

Storage System (e.g. HDFS) Storage System (e.g. HDFS)

Fw. 1

Fw. 1 Fw. 3

Fw. 3 Fw. 2 Fw. 2

Fw. 2

Fw. 1

Fw. 3

Fw. 2 Fw. 3

Fw. 1

Fw. 1 Fw. 2 Fw. 2

Fw. 1

Fw. 3 Fw. 3

Fw. 3

Fw. 2

Fw. 2

11

Element 2: Resource Offers

Option: Global scheduler » Frameworks express needs in a specification language,

global scheduler matches them to resources

+ Can make optimal decisions

– Complex: language must support all framework needs – Difficult to scale and to make robust

– Future frameworks may have unanticipated needs

12

Element 2: Resource Offers

Mesos: Resource offers » Offer available resources to frameworks, let them pick

which resources to use and which tasks to launch ��

+  Keeps Mesos simple, lets it support future frameworks -  Decentralized decisions might not be optimal

13

Machines Make datacenter a real computer!

14

Node OS (e.g. Linux)

Node OS (e.g. Windows)

Node OS (e.g. Linux)

…

Spar

k SCADS

…

Datacenter “OS” (e.g., Apache Mesos)

Had

oop

MPI

Hyp

ertb

ale

…

Cas

sand

ra

Hive PIQL

Support interactive and iterative data analysis (e.g., ML algorithms)

Consistency adjustable data store

Predictive & insightful query language

AMP stack

Existing stack

Allocation Policies

Mesos controls how many resources each framework can get, but not which resources

Allocation policies are pluggable to suit organization needs

15

Example: Hierarchical Fair Sharing

Facebook.com

Spam Ads

Job 3

Job 2

User 1

Job 1

User 2

Job 4

100%

0%

20%

40%

60%

80%

100%

0 1 2 3 Time

Cluster Utilization

Curr Time

80% 20%

30%

70% User 1 User 2

Cluster Share Policy

20%

80%

Spam Dept.

Ads Dept.

20% 14% 100%

Curr Time

6%

Curr Time

0%

70% 30%

16

Mesos Architecture Slave 1

Hadoop Executor MPI executor

Slave 2 Hadoop Executor

Slave 3

Mesos Master Allocation Module

Framework Scheduler Hadoop

JobTracker

Framework Scheduler MPI

Scheduler

Resource offer Status

Slaves send status updates about

available resources

Pluggable policy picks which framework to offer resources to

Framework scheduler selects resources

and provides tasks

Framework executors run tasks and may persist across tasks

Launch Hadoop task 2

task 1

task 2

task 1

17

Resource Offer Details A resource offer is a set of machine-resource tuples »  { [m1, 1 CPU, 1GB], [m2, 4 CPU, 16GB] }

Resource offers count towards a frameworks share » Rescinded after a time out (incentive to reply fast)

Optimizations » Frameworks indicate interest to get offers » Frameworks can set filters to automatically filter out certain

nodes or nodes with too few resources

18

Dynamic Resource Sharing

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 101 201 301 401 501 601 701 801 901 1001

Clus

ter U

tiliz

ation

Time (seconds)

Torque Hadoop Instance 1 Hadoop Instance 2 Hadoop Instance 3

19

Which Offers to Accept?

Delay scheduling » Initially only accept preferred (e.g., local) resources » Accept any resource after timeout (1-5 seconds)

Can achieve near optimal locality

20

Multiple Hadoops Experiment

Hadoop1

Hadoop 2

Hadoop 3

Storage System (e.g. HDFS) Storage System (e.g. HDFS)

Hadoop 1

Hadoop 1 Hadoop 3

Hadoop 3 Hadoop 2 Hadoop 2

Hadoop 2

Fw. 1

Hadoop 3

Fw. 2 Hadoop 3

Hadoop 1

Hadoop 1 Hadoop 2 Hadoop 2

Hadoop 1

Hadoop 3 Hadoop 3

Hadoop 3

Hadoop 2

Hadoop 2

21

Data Locality on Mesos 16 Hadoop MapReduce instances over shared file system

22

0%

20%

40%

60%

80%

100%

Static partitioning Mesos, no delay sched. Mesos, 1s delay sched. Mesos, 5s delay sched.

Loca

l Map

Tas

ks (%

)

0

100

200

300

400

500

600

Static partitioning Mesos, no delay sched. Mesos, 1s delay sched. Mesos, 5s delay sched.

Job Run

ning

Tim

e (s)

Some Related Datacenter Resource Managers

Hadoop YARN » Open-source follow-on to Hadoop with pluggable

allocation policies » Primary focus is Hadoop jobs

Google’s Omega resource manager » Closed-source follow-on to original resource manager » Framework-specific schedulers use optimistic

concurrency model – all compete simultaneously to select resources

23

In Mesos Part II Lecture

Implementation Details and Supported Frameworks

Isolation

Handling Mesos Master Failure

Resource Revocation

Scalability

Results and Macrobenchmarks

24

Summary (Part One)

Mesos is a platform for sharing data centers among diverse cluster computing frameworks » Enables efficient fine-grained sharing » Gives frameworks control over scheduling » Supports current and future frameworks » Achieves high utilization

25

My Talks at LASER 2013

1.  AMP Lab introduction

2.  The Datacenter Needs an Operating System

3.  Mesos, part one

4.  Dominant Resource Fairness

5.  Mesos, part two

6.  Spark 26

Documents

UC#BERKELEY# Mesos: A Platform for Fine- …laser.inf.ethz.ch/2013/material/joseph/LASER-Joseph-3.pdfMesos: A Platform for Fine-Grained Resource Sharing in Data Centers (I)! UC#BERKELEY#