Benjamin Hindman – @benh Apache Mesos Design Decisions mesos.apache.org @ApacheMesos

Preview:

Citation preview

Benjamin Hindman – @benh

Apache MesosDesign Decisions

mesos.apache.org

@ApacheMesos

this is nota talk about YARN

at least not explicitly!

this talk is about Mesos!

a little historyMesos started as a research project at Berkeley in early 2009 by Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, Ion Stoica

our motivation

increase performance and

utilization of clusters

our intuition

① static partitioning considered

harmful

static partitioning considered harmful

datacenter

static partitioning considered harmful

static partitioning considered harmful

static partitioning considered harmful

static partitioning considered harmful

faster!

higher utilization!

static partitioning considered harmful

our intuition

② build new frameworks

“Map/Reduce is a big hammer,but not everything is a nail!”

Apache Mesos is a distributed systemfor running and building other distributed systems

Mesos is a cluster manager

Mesos is a resource manager

Mesos is a resource negotiator

Mesos replaces static partitioning of resources to frameworks withdynamic resource allocation

Mesos is a distributed system with a master/slave architecture

masters

slaves

frameworks register with the Mesos master in order to run jobs/tasks

masters

slaves

frameworks

Mesos @Twitter in early 2010

goal: run long-running services elastically on Mesos

Apache Aurora (incubating)

masters

Aurora is a Mesos framework that makes it easy to launch services written in Ruby, Java, Scala, Python, Go, etc!

masters

Storm, Jenkins, …

a lot of interestingdesign decisionsalong the way

many appear (IMHO)in YARN too

design decisions① two-level scheduling and resource

offers

② fair-sharing and revocable resources

③ high-availability and fault-tolerance

④ execution and isolation

⑤ C++

design decisions① two-level scheduling and resource

offers

② fair-sharing and revocable resources

③ high-availability and fault-tolerance

④ execution and isolation

⑤ C++

frameworks get allocated resources from the masters

masters

framework

resources are allocated viaresource offers

a resource offer represents a snapshot of available resources (one offer per host) that a framework can use to run tasks

offerhostname4 CPUs4 GB RAM

frameworks use these resources to decide what tasks to run

masters

framework

a task can use a subset of an offer

task3 CPUs2 GB RAM

Mesos challengedthe status quoof cluster managers

cluster manager status quo

cluster manager

application

specification

the specification includes as much information as possible to assist the cluster manager in scheduling and execution

cluster manager status quo

cluster manager

application wait for task to be executed

cluster manager status quo

cluster manager

application

result

problems with specifications① hard to specify certain desires or

constraints

② hard to update specifications dynamically as tasks executed and finished/failed

an alternative model

masters

framework

request3 CPUs2 GB RAM

a request is purposely simplified subset of a specification, mainly including the required resources

question: what should Mesos do if it can’t satisfy a request?

question: what should Mesos do if it can’t satisfy a request?

① wait until it can …

question: what should Mesos do if it can’t satisfy a request?

① wait until it can …

② offer the best it can immediately

question: what should Mesos do if it can’t satisfy a request?

① wait until it can …

② offer the best it can immediately

an alternative model

masters

framework

offerhostname4 CPUs4 GB RAM

offerhostname4 CPUs4 GB RAM

offerhostname4 CPUs4 GB RAM

offerhostname4 CPUs4 GB RAM

an alternative model

masters

framework

offerhostname4 CPUs4 GB RAM

offerhostname4 CPUs4 GB RAM

offerhostname4 CPUs4 GB RAM

offerhostname4 CPUs4 GB RAM

an alternative model

masters

framework

offerhostname4 CPUs4 GB RAM

framework uses the offers to perform it’s own scheduling

an analogue:non-blocking sockets

kernel

application

write(s, buffer, size);

an analogue:non-blocking sockets

kernel

application

42 of 100 bytes written!

resource offers address asynchrony in resource allocation

IIUC, even YARN allocates “the best it can” to an application when it can’t satisfy a request

requests are complimentary(but not necessary)

offers representthe currently available resources a framework can use

question: should resources within offers be disjoint?

masters

framework1 framework2

offerhostname4 CPUs4 GB RAM

offerhostname4 CPUs4 GB RAM

concurrency control

optimisticpessimistic

concurrency control

optimisticpessimistic

all offers overlap with one another, thus causing frameworks to “compete” first-come-first-served

concurrency control

optimisticpessimistic

offers made to different frameworks are disjoint

Mesos semantics:assume overlapping offers

design comparison:Google’s Omega

the Omega model

database

framework

snapshot

a framework gets a snapshot of the cluster state from a database (note, does not make a request!)

the Omega model

database

framework

transaction

a framework submits a transaction to the database to “acquire” resources (which it can then use to run tasks)

failed transactions occur when another framework has already acquired sought resources

isomorphism?

observation:snapshots are optimistic offers

Omega and Mesos

database

framework

snapshot

masters

framework

offerhostname4 CPUs4 GB RAM

Omega and Mesos

database

framework

transaction

masters

framework

task3 CPUs2 GB RAM

thought experiment:what’s gained by exploiting the continuous spectrum of pessimistic to optimistic?

optimisticpessimistic

design decisions① two-level scheduling and resource

offers

② fair-sharing and revocable resources

③ high-availability and fault-tolerance

④ execution and isolation

⑤ C++

Mesos allocates resources to frameworks using afair-sharing algorithmwe created called Dominant Resource Fairness (DRF)

DRF, born of static partitioning

datacenter

static partitioning across teams

promotions trends recommendationsteam

promotions trends recommendationsteam

fairly shared!

static partitioning across teams

goal: fairly share the resources without static partitioning

partition utilizations

promotions trends recommendations

45% CPU100% RAM

75% CPU100% RAM

100% CPU50% RAM

team

utilization

observation: a dominant resource bottlenecks each team from running any more jobs/tasks

dominant resource bottlenecks

promotions trends recommendationsteam

utilization

bottleneck RAM

45% CPU100% RAM

75% CPU100% RAM

100% CPU50% RAM

RAM CPU

insight: allocating a fair share of each team’s dominant resource guarantees they can run at least as many jobs/tasks as with static partitioning!

… if my team gets at least 1/N of my dominant resource I will do no worse than if I had my own cluster, but I might do better when resources are available!

DRF in Mesos

masters

framework ① frameworks specify a role when they register (i.e., the team to charge for the resources)

DRF in Mesos

masters

framework ① frameworks specify a role when they register (i.e., the team to charge for the resources)

② master calculates each role’s dominant resource (dynamically) and allocates appropriately

tep 4: Profit(statistical multiplexing)

$

in practice,fair sharing is insufficient

weighted fair sharing

promotions trends recommendationsteam

weighted fair sharing

promotions trends recommendationsteam

weight 0.17 0.5 0.33

Mesos implements weighted DRF

masters

masters can be configured with weights per role

resource allocation decisions incorporate the weights to determine dominant fair shares

in practice,weighted fair sharingis still insufficient

a non-cooperative framework (i.e., has long tasks or is buggy) can get allocated too many resources

Mesos provides reservations

slaves can be configured with resource reservations for particular roles (dynamic, time based, and percentage based reservations are in development)

resource offers include the reservation role (if any)

masters

framework (trends)

offerhostname4 CPUs4 GB RAMrole: trends

promotions40%

trends20%

used10%

unused30%recommendations

40%

reservations

reservations provide guarantees,but at the cost of utilization

revocable resources

masters

framework (promotions)

reserved resources that are unused can be allocated to frameworks from different roles but those resources may be revoked at any time

offerhostname4 CPUs4 GB RAMrole: trends

preemption via revocation

… my tasks will not be killed unless I’m using revocable resources!

design decisions① two-level scheduling and resource

offers

② fair-sharing and revocable resources

③ high-availability and fault-tolerance

④ execution and isolation

⑤ C++

high-availability and fault-tolerance a prerequisite @twitter

① framework failover

② master failover

③ slave failover

machine failure

process failure (bugs!)

upgrades

high-availability and fault-tolerance a prerequisite @twitter

① framework failover

② master failover

③ slave failover

machine failure

process failure (bugs!)

upgrades

masters

① framework failover

framework

framework re-registers with master and resumes operation

all tasks keep running across framework failover!

framework

high-availability and fault-tolerance a prerequisite @twitter

① framework failover

② master failover

③ slave failover

machine failure

process failure (bugs!)

upgrades

masters

② master failover

framework

after a new master is elected all frameworks and slaves connect to the new master

all tasks keep running across master failover!

high-availability and fault-tolerance a prerequisite @twitter

① framework failover

② master failover

③ slave failover

machine failure

process failure (bugs!)

upgrades

slave

③ slave failover

mesos-slave

task task

slave

③ slave failover

mesos-slave

tasktask

slave

③ slave failover

tasktask

slave

③ slave failover

mesos-slave

tasktask

slave

③ slave failover

mesos-slave

tasktask

slave

③ slave failover @twitter

mesos-slave

(large in-memory services,expensive to restart)

design decisions① two-level scheduling and resource

offers

② fair-sharing and revocable resources

③ high-availability and fault-tolerance

④ execution and isolation

⑤ C++

execution

masters

framework

task3 CPUs2 GB RAM

frameworks launch fine-grained tasks for execution

if necessary, a framework can provide an executor to handle the execution of a task

slave

executor

mesos-slave

executor

task

task

slave

executor

mesos-slave

executor

task

task

task

slave

executor

mesos-slave

executor task

goal: isolation

slave

isolation

mesos-slave

executor

task

task

slave

isolation

mesos-slave

executor

task

task

containers

executor + task design means containers can have changing resource allocations

slave

isolation

mesos-slave

executor

task

task

slave

isolation

mesos-slave

executor

task

task

slave

isolation

mesos-slave

executor

task

task

slave

isolation

mesos-slave

executor

task

task

slave

isolation

mesos-slave

executor

task

task

slave

isolation

mesos-slave

executor

task

task

slave

isolation

mesos-slave

executor

task

task

making the task first-class gives us true fine-grained resources sharing

requirement:fast task launching (i.e., milliseconds or less)

virtual machinesan anti-pattern

operating-system virtualization

containers(zones and projects)

control groups (cgroups)namespaces

isolation support

tight integration with cgroups

CPU (upper and lower bounds)memorynetwork I/O (traffic controller, in development)filesystem (using LVM, in development)

statistics too

rarely does allocation == usage (humans are bad at estimating the amount of resources they’re using)

used @twitter for capacity planning (and oversubscription in development)

CPU upper bounds?

in practice,determinism trumps utilization

design decisions① two-level scheduling and resource

offers

② fair-sharing and revocable resources

③ high-availability and fault-tolerance

④ execution and isolation

⑤ C++

requirements:① performance

② maintainability (static typing)

③ interfaces to low-level OS (for isolation, etc)

④ interoperability with other languages (for library bindings)

garbage collectiona performance anti-pattern

consequences:① antiquated libraries (especially

around concurrency and networking)

② nascent community

github.com/3rdparty/libprocess

concurrency via futures/actors, networking via message passing

github.com/3rdparty/stout

monads in C++,safe and understandable utilities

but …

scalability simulations to 50,000+ slaves

@twitter we run multiple Mesos clusters each with 3500+ nodes

design decisions① two-level scheduling and resource

offers

② fair-sharing and revocable resources

③ high-availability and fault-tolerance

④ execution and isolation

⑤ C++

final remarks

frameworks• Hadoop (github.com/mesos/hadoop)

• Spark (github.com/mesos/spark)

• DPark (github.com/douban/dpark)

• Storm (github.com/nathanmarz/storm)

• Chronos (github.com/airbnb/chronos)

• MPICH2 (in mesos git repository)

• Marathon (github.com/mesosphere/marathon)

• Aurora (github.com/twitter/aurora)

write your next distributed system with Mesos!

port a framework to Mesoswrite a “wrapper”

~100 lines of code to write a wrapper (the more lines, the more you can take advantage of elasticity or other mesos features)

see http://github.com/mesos/hadoop

Thank You!

mesos.apache.org

mesos.apache.org/blog

@ApacheMesos

master

② master failover

framework

after a new master is elected all frameworks and slaves connect to the new master

all tasks keep running across master failover!

stateless masterto make master failover fast, we choose to make the master stateless

state is stored in the leaves, at the frameworks and the slaves

makes sense for frameworks that don’t want to store state (i.e., can’t actually failover)

consequences: slaves are fairly complicated (need to checkpoint), frameworks need to save their own state and reconcile (we built some tools to help, including a replicated log)

master failoverto make master failover fast, we choose to make the master stateless

state is stored in the leaves, at the frameworks and the slaves

makes sense for frameworks that don’t want to store state (i.e., can’t actually failover)

consequences: slaves are fairly complicated (need to checkpoint), frameworks need to save their own state and reconcile (we built some tools to help, including a replicated log)

Apache Mesos is a distributed systemfor running and building other distributed systems

originsBerkeley research project including Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, Ion Stoica

mesos.apache.org/documentation

ecosystem

mesosdevelopers

operators

frameworkdevelopers

a tour of mesos from different perspectives of the ecosystem

the operator

the operatorPeople who run and manage frameworks (Hadoop, Storm, MPI, Spark, Memcache, etc)

Tools: virtual machines, Chef, Puppet (emerging: PAAS, Docker)

“ops” at most companies (SREs at Twitter)

the static partitioners

for the operator,Mesos is a cluster manager

for the operator,Mesos is a resource manager

for the operator,Mesos is a resource negotiator

for the operator,Mesos replaces static partitioning of resources to frameworks withdynamic resource allocation

for the operator,Mesos is a distributed system with a master/slave architecture

masters

slaves

frameworks/applications register with the Mesos master in order to run jobs/tasks

masters

slaves

frameworks can be required to authenticate as a principal*

masters

SASL

SASL

CRAM-MD5 secret mechanism(Kerberos in development)

framework

masters initialized with secrets

Mesos is highly-availableand fault-tolerant

the framework developer

the framework developer

Mesos uses Apache ZooKeeperfor coordination

mastersslaves

ApacheZooKeeper

increase utilization with revocable resources and preemption

masters

framework1

hostname:4 CPUs4 GB RAMrole: -

framework2 framework3

61%24%

15%

reservations

framework1

framework2

framework3

64%25%

11%

reservations

framework1

framework2

framework3

optimistic vs pessimisticwhat to say here …

authorization*principals can be used for:

authorizing allocation roles

authorizing operating system users (for execution)

authorization

agendamotivation and overview

resource allocation

frameworks, schedulers, tasks, status updates

high-availability

resource isolation and statistics

security

case studies

agendamotivation and overview

resource allocation

frameworks, schedulers, tasks, status updates

high-availability

resource isolation and statistics

security

case studies

I’d love to answer some questions with the help

of my data!

I think I’ll try Hadoop.

your datacenter

+ Hadoop

happy?

Not exactly …

… Hadoop is a big hammer, but not

everything is a nail!

I’ve got some iterative algorithms, I want to try

Spark!

datacenter management

datacenter management

datacenter management

static partitioning

static partitioning

static partitioningconsidered harmful

static partitioningconsidered harmful(1)hard to share data

(2)hard to scale elastically (to exploit statistical multiplexing)

(3)hard to fully utilize machines

(4)hard to deal with failures

static partitioningconsidered harmful(1)hard to share data

(2)hard to scale elastically (to exploit statistical multiplexing)

(3)hard to fully utilize machines

(4)hard to deal with failures

Hadoop …

(map/reduce)

(distributed file system)

HDFS

HDFS

HDFS

Could we just give Spark it’s own HDFS cluster

too?

HDFS x 2

HDFS x 2

HDFS x 2

HDFS x 2tee incoming data(2 copies)

HDFS x 2tee incoming data(2 copies)

periodic copy/sync

That sounds annoying … let’s not do that. Can we do any better though?

HDFS

HDFS

HDFS

HDFS

static partitioningconsidered harmful(1)hard to share data

(2)hard to scale elastically (to exploit statistical multiplexing)

(3)hard to fully utilize machines

(4)hard to deal with failures

During the day I’d rather give more machines to Spark but at night I’d

rather give more machines to Hadoop!

datacenter management

datacenter management

datacenter management

datacenter management

static partitioningconsidered harmful(1)hard to share data

(2)hard to scale elastically (to exploit statistical multiplexing)

(3)hard to fully utilize machines

(4)hard to deal with failures

datacenter management

datacenter management

datacenter management

static partitioningconsidered harmful(1)hard to share data

(2)hard to scale elastically (to exploit statistical multiplexing)

(3)hard to fully utilize machines

(4)hard to deal with failures

datacenter management

datacenter management

datacenter management

I don’t want to deal with this!

the datacenter …rather than think about the datacenter like this …

… is a computerthink about it like this …

datacenter computer

applications

resources

filesystem

mesos

applications

resources

filesystem

kernel

mesos

applications

resources

filesystem

kernel

mesos

frameworks

resources

filesystem

kernel

Step 1: filesystem

Step 2: mesosrun a “master” (or multiple for high availability)

Step 2: mesosrun “slaves” on the rest of the machines

Step 3: frameworks

Step 3: frameworks

Step 3: frameworks

Step 3: frameworks

Step 3: frameworks

Step 3: frameworks

Step 3: frameworks

Step 3: frameworks

Step 3: frameworks

Step 3: frameworks

Step 3: frameworks

Step 3: frameworks

Step 3: frameworks

tep 4: profit$

tep 4: profit(statistical multiplexing)

$

tep 4: profit(statistical multiplexing)

$

tep 4: profit(statistical multiplexing)

$

tep 4: profit(statistical multiplexing)

$

tep 4: profit(statistical multiplexing)

$

tep 4: profit(statistical multiplexing)

$

reduces CapEx and OpEx!

tep 4: profit(statistical multiplexing)

$

reduces latency!

tep 4: profit (utilize)$

tep 4: profit (utilize)$

tep 4: profit (utilize)$

tep 4: profit (utilize)$

tep 4: profit (utilize)$

tep 4: profit (utilize)$

tep 4: profit (failures)$

tep 4: profit (failures)$

tep 4: profit (failures)$

agendamotivation and overview

resource allocation

frameworks, schedulers, tasks, status updates

high-availability

resource isolation and statistics

security

case studies

agendamotivation and overview

resource allocation

frameworks, schedulers, tasks, status updates

high-availability

resource isolation and statistics

security

case studies

mesos

frameworks

resources

filesystem

kernel

mesos

frameworks

resources

kernel

resource allocation

resource allocation

reservationscan reserve resources per slave to provide guaranteed resources

requires human participation (ops) to determine what roles should be reserved what resources

kind of like thread affinity, but across many machines (and not just for CPUs)

resource allocation

resource allocation

resource allocation

(1) allocate reserved resources to frameworks authorized for a particular role

(2) allocate unused reserved resources and unused unreserved resources fairly amongst all frameworks according to their weights

preemption if a framework runs tasks outside of it’s reservations they can be preempted (i.e., the task killed and the resources revoked) for a framework running a task within its reservation

agendamotivation and overview

resource allocation

frameworks, schedulers, tasks, status updates

high-availability

resource isolation and statistics

security

case studies

mesos

frameworks

kernel

framework≈

distributed system

framework commonality

run processes/tasks simultaneously (distributed)

handle process failures (fault-tolerant)

optimize performance (elastic)

framework commonality

run processes/tasks simultaneously (distributed)

handle process failures (fault-tolerant)

optimize performance (elastic)

coordinate execution

frameworksare

execution coordinators

frameworksare

execution coordinators

frameworksare

execution schedulers

end-to-end principle“application-specific functions ought to reside in the end hosts of a network rather than intermediary nodes”

i.e., frameworks want to coordinate their tasks execution and they should be able to

framework anatomy

frameworks

framework anatomy

frameworks

scheduling API

scheduling

scheduling

i’d like to run some tasks!

scheduling

here are some resource offers!

resource offers

an offer represents the snapshot of available resources on a particular machine that a framework can use to run tasks

schedulers pick which resources to use to run their tasks

foo.bar.com:4 CPUs4 GB RAM

“two-level scheduling”mesos: controls resource allocations to schedulers

schedulers: make decisions about what to run given allocated resources

concurrency controlthe same resources may be offered to different frameworks

concurrency controlthe same resources may be offered to different frameworks

optimisticpessimistic

no overlapping offers all overlapping offers

tasksthe “threads” of the framework, a consumer of resources (cpu, memory, etc)

either a concrete command line or an opaque description (which requires an executor)

tasks

here are some resources!

tasks

launch these tasks!

tasks

tasks

status updates

status updates

status updates

task status update!

status updates

status updates

status updates

task status update!

more scheduling

more scheduling

i’d like to run some tasks!

agendamotivation and overview

resource allocation

frameworks, schedulers, tasks, status updates

high-availability

resource isolation and statistics

security

case studies

high-availability

high-availability (master)

high-availability (master)

high-availability (master)

high-availability (master)

high-availability (master)

high-availability (master)task status update!

high-availability (master)i’d like to run some tasks!

high-availability (master)

high-availability (framework)

high-availability (framework)

high-availability (framework)

high-availability (framework)

high-availability (slave)

high-availability (slave)

high-availability (slave)

agendamotivation and overview

resource allocation

frameworks, schedulers, tasks, status updates

high-availability

resource isolation and statistics

security

case studies

resource isolation

leverage Linux control groups (cgroups)

CPU (upper and lower bounds)memorynetwork I/O (traffic controller, in progress)filesystem (lvm, in progress)

resource statistics

rarely does allocation == usage (humans are bad at estimating the amount of resources they’re using)

per task/executor statistics are collected (for all fork/exec’ed processes too!)

can help with capacity planning

agendamotivation and overview

resource allocation

frameworks, schedulers, tasks, status updates

high-availability

resource isolation and statistics

security

case studies

securityTwitter recently added SASL support, default mechanism is CRAM-MD5, will support Kerberos in the short term

agendamotivation and overview

resource allocation

frameworks, schedulers, tasks, status updates

high-availability

resource isolation and statistics

security

case studies

framework commonality

run processes/tasks simultaneously (distributed)

handle process failures (fault-tolerant)

optimize performance (elastic)

framework commonality

as a “kernel”, mesos provides a lot of primitives that make writing a new framework easier such as launching tasks, doing failure detection, etc, why re-implement them each time!?

case study: chronosdistributed cron with dependencies

developed at airbnb

~3k lines of Scala!

distributed, highly available, and fault tolerant without any network programming!

http://github.com/airbnb/chronos

analytics

analytics + services

analytics + services

analytics + services

case study: aurora“run 200 of these, somewhere, forever”

developed at Twitter

highly available (uses the mesos replicated log)

uses a python DSL to describe services

leverages service discovery and proxying (see Twitter commons)

http://github.com/twitter/aurora

frameworks• Hadoop (github.com/mesos/hadoop)

• Spark (github.com/mesos/spark)

• DPark (github.com/douban/dpark)

• Storm (github.com/nathanmarz/storm)

• Chronos (github.com/airbnb/chronos)

• MPICH2 (in mesos git repository)

• Marathon (github.com/mesosphere/marathon)

• Aurora (github.com/twitter/aurora)

write your next distributed system with mesos!

port a framework to mesoswrite a “wrapper” scheduler

~100 lines of code to write a wrapper (the more lines, the more you can take advantage of elasticity or other mesos features)

see http://github.com/mesos/hadoop

conclusionsdatacenter management is a pain

conclusionsmesos makes running frameworks on your datacenter easier as well as increasing utilization and performance while reducing CapEx and OpEx!

conclusionsrather than build your next distributed system from scratch, consider using mesos

conclusionsyou can share your datacenter between analytics and online services!

Questions?

mesos.apache.org

@ApacheMesos

aurora

aurora

aurora

aurora

aurora

framework commonality

run processes simultaneously (distributed)

handle process failures (fault-tolerance)

optimize execution (elasticity, scheduling)

primitivesscheduler – distributed system “master” or “coordinator”

(executor – lower-level control of task execution, optional)

requests/offers – resource allocations

tasks – “threads” of the distributed system

scheduler

ApacheHadoop

Chronos

scheduler(1) brokers for resources

(2) launches tasks

(3) handles task termination

brokering for resources(1) make resource requests 2 CPUs 1 GB RAM slave *

(2) respond to resource offers 4 CPUs 4 GB RAM slave foo.bar.com

offers: non-blocking resource allocation

exist to answer the question:

“what should mesos do if it can’t satisfy a request?”

(1) wait until it can

(2) offer the best allocation it can immediately

offers: non-blocking resource allocation

exist to answer the question:

“what should mesos do if it can’t satisfy a request?”

(1) wait until it can

(2) offer the best allocation it can immediately

resource allocation

ApacheHadoop

Chronos

request

resource allocation

ApacheHadoop

Chronos

request

allocatordominant resource fairnessresource reservations

resource allocation

ApacheHadoop

Chronos

request

allocatordominant resource fairnessresource reservations

optimisticpessimistic

resource allocation

ApacheHadoop

Chronos

request

allocatordominant resource fairnessresource reservations

optimisticpessimisticno overlapping offers all overlapping offers

resource allocation

ApacheHadoop

Chronos

offer

allocatordominant resource fairnessresource reservations

“two-level scheduling”mesos: controls resource allocations to framework schedulers

schedulers: make decisions about what to run given allocated resources

end-to-end principle

“application-specific functions ought to reside in the end hosts of a network rather than intermediary nodes”

taskseither a concrete command line or an opaque description (which requires a framework executor to execute)

a consumer of resources

task operationslaunching/killing

health monitoring/reporting (failure detection)

resource usage monitoring (statistics)

resource isolation

cgroup per executor or task (if no executor)

resource controls adjusted dynamically as tasks come and go!

case study: chronosdistributed cron with dependencies

built at airbnb by @flo

before chronos

before chronos

single point of failure (and AWS was unreliable)

resource starved (not scalable)

chronos requirementsfault tolerance

distributed (elastically take advantage of resources)

retries (make sure a command eventually finishes)

dependencies

chronosleverages the primitives of mesos

~3k lines of scala

highly available (uses Mesos state)

distributed / elastic

no actual network programming!

after chronos

after chronos + hadoop

case study: aurora“run 200 of these, somewhere, forever”

built at Twitter

before aurorastatic partitioning of machines to services

hardware outages caused site outages

puppet + monit

ops couldn’t scale as fast as engineers

aurorahighly available (uses mesos replicated log)

uses a python DSL to describe services

leverages service discovery and proxying (see Twitter commons)

after aurorapower loss to 19 racks, no lost services!

more than 400 engineers running services

largest cluster has >2500 machines

Mesos

Mesos

Node NodeNod

eNode

Hadoop

Node NodeNod

eNode

Spark

Node Node

MPI Storm

Node

Chronos

Mesos

Mesos

Node NodeNod

eNode

Hadoop

Node NodeNod

eNode

Spark

Node Node

MPI

Node

Mesos

Mesos

Node NodeNod

eNode

Hadoop

Node NodeNod

eNode

Spark

Node Node

MPI Storm

Node

Mesos

Mesos

Node NodeNod

eNode

Hadoop

Node NodeNod

eNode

Spark

Node Node

MPI Storm

Node

Chronos …

tep 4: Profit(statistical multiplexing)

$

Recommended