Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Mesos: A Platform for Fine-Grained Resource Sharing
in Data Centers (I)
UC BERKELEY
Anthony D. Joseph
LASER Summer School September 2013
My Talks at LASER 2013
1. AMP Lab introduction
2. The Datacenter Needs an Operating System
3. Mesos, part one
4. Dominant Resource Fairness
5. Mesos, part two
6. Spark 2
Collaborators
• Matei Zaharia
• Benjamin Hindman
• Andy Konwinski
• Ali Ghodsi
• Randy Katz
• Scott Shenker
• Ion Stoica 3
Modern Data Center Paradigm Commodity machines (100’s – 10,000’s of machines) » Attached storage devices
Data distributed and replicated across nodes » Data locality to computation matters
Solution: Use a datacenter computing framework » Divide jobs into smaller tasks, so that jobs can take turns
accessing each node and ideally locally accessing data » Tasks are both fine-grained in time (short) and space (use
fraction of a machine)
4
Rapid innovation in datacenter computing frameworks
No single framework optimal for all applications
Want to run multiple frameworks in a single datacenter » …to maximize utilization » …to share data between frameworks
Pig
Datacenter Scheduling Problem
Dryad
Pregel
Percolator
CIEL
5
Hadoop
Pregel
MPI Shared cluster
Today: static partitioning Dynamic sharing
Where We Want to Go
6
Solution: Apache Mesos
Mesos
Node Node Node Node
Hadoop Pregel …
Node Node
Hadoop
Node Node
Pregel
…
Mesos is a common resource sharing layer over which diverse frameworks can run
Run multiple instances of the same framework » Isolate production and experimental jobs » Run multiple versions of the framework concurrently
Build specialized frameworks targeting particular problem domains » Better performance than general-purpose abstractions
http://mesos.apache.org/ 7
Mesos Goals
High utilization of resources
Support diverse frameworks (current & future)
Scalability to 10,000’s of nodes
Reliability in face of failures
Resulting design: Small microkernel-like core that pushes scheduling logic to frameworks
8
Previous Approaches Locality less important (HPC & grid computing) » Expensive, dedicated storage (SANs, Parallel FS) » Expensive, high speed networks (Infiniband)
Fine-grained task model infeasible » Ad-hoc programs (many barriers, tight message passing) » Legacy programs (Fortran77)
Approach taken: coarse-grained sharing » Job specifies number of machines and amount of time needed » Scheduler queues job and allocates all machines at the same time
9
Mesos Design Elements
Fine-grained sharing: » Allocation at the level of tasks within a job » Improves utilization, latency, and data locality
Resource offers: » Simple, scalable application-controlled scheduling
mechanism
10
Element 1: Fine-Grained Sharing
Framework 1
Framework 2
Framework 3
Coarse-Grained Sharing (HPC): Fine-Grained Sharing (Mesos):
+ Improved utilization, responsiveness, data locality
Storage System (e.g. HDFS) Storage System (e.g. HDFS)
Fw. 1
Fw. 1 Fw. 3
Fw. 3 Fw. 2 Fw. 2
Fw. 2
Fw. 1
Fw. 3
Fw. 2 Fw. 3
Fw. 1
Fw. 1 Fw. 2 Fw. 2
Fw. 1
Fw. 3 Fw. 3
Fw. 3
Fw. 2
Fw. 2
11
Element 2: Resource Offers
Option: Global scheduler » Frameworks express needs in a specification language,
global scheduler matches them to resources
+ Can make optimal decisions
– Complex: language must support all framework needs – Difficult to scale and to make robust
– Future frameworks may have unanticipated needs
12
Element 2: Resource Offers
Mesos: Resource offers » Offer available resources to frameworks, let them pick
which resources to use and which tasks to launch ���
+ Keeps Mesos simple, lets it support future frameworks - Decentralized decisions might not be optimal
13
Machines Make datacenter a real computer!
14
Node OS (e.g. Linux)
Node OS (e.g. Windows)
Node OS (e.g. Linux)
…
Spar
k SCADS
…
Datacenter “OS” (e.g., Apache Mesos)
Had
oop
MPI
Hyp
ertb
ale
…
Cas
sand
ra
Hive PIQL
Support interactive and iterative data analysis (e.g., ML algorithms)
Consistency adjustable data store
Predictive & insightful query language
AMP stack
Existing stack
Allocation Policies
Mesos controls how many resources each framework can get, but not which resources
Allocation policies are pluggable to suit organization needs
15
Example: Hierarchical Fair Sharing
Facebook.com
Spam Ads
Job 3
Job 2
User 1
Job 1
User 2
Job 4
100%
0%
20%
40%
60%
80%
100%
0 1 2 3 Time
Cluster Utilization
Curr Time
80% 20%
30%
70% User 1 User 2
Cluster Share Policy
20%
80%
Spam Dept.
Ads Dept.
20% 14% 100%
Curr Time
6%
Curr Time
0%
70% 30%
16
Mesos Architecture Slave 1
Hadoop Executor MPI executor
Slave 2 Hadoop Executor
Slave 3
Mesos Master Allocation Module
Framework Scheduler Hadoop
JobTracker
Framework Scheduler MPI
Scheduler
Resource offer Status
Slaves send status updates about
available resources
Pluggable policy picks which framework to offer resources to
Framework scheduler selects resources
and provides tasks
Framework executors run tasks and may persist across tasks
Launch Hadoop task 2
task 1
task 2
task 1
17
Resource Offer Details A resource offer is a set of machine-resource tuples » { [m1, 1 CPU, 1GB], [m2, 4 CPU, 16GB] }
Resource offers count towards a frameworks share » Rescinded after a time out (incentive to reply fast)
Optimizations » Frameworks indicate interest to get offers » Frameworks can set filters to automatically filter out certain
nodes or nodes with too few resources
18
Dynamic Resource Sharing
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 101 201 301 401 501 601 701 801 901 1001
Clus
ter U
tiliz
ation
Time (seconds)
Torque Hadoop Instance 1 Hadoop Instance 2 Hadoop Instance 3
19
Which Offers to Accept?
Delay scheduling » Initially only accept preferred (e.g., local) resources » Accept any resource after timeout (1-5 seconds)
Can achieve near optimal locality
20
Multiple Hadoops Experiment
Hadoop1
Hadoop 2
Hadoop 3
Storage System (e.g. HDFS) Storage System (e.g. HDFS)
Hadoop 1
Hadoop 1 Hadoop 3
Hadoop 3 Hadoop 2 Hadoop 2
Hadoop 2
Fw. 1
Hadoop 3
Fw. 2 Hadoop 3
Hadoop 1
Hadoop 1 Hadoop 2 Hadoop 2
Hadoop 1
Hadoop 3 Hadoop 3
Hadoop 3
Hadoop 2
Hadoop 2
21
Data Locality on Mesos 16 Hadoop MapReduce instances over shared file system
22
0%
20%
40%
60%
80%
100%
Static partitioning Mesos, no delay sched. Mesos, 1s delay sched. Mesos, 5s delay sched.
Loca
l Map
Tas
ks (%
)
0
100
200
300
400
500
600
Static partitioning Mesos, no delay sched. Mesos, 1s delay sched. Mesos, 5s delay sched.
Job Run
ning
Tim
e (s)
Some Related Datacenter Resource Managers
Hadoop YARN » Open-source follow-on to Hadoop with pluggable
allocation policies » Primary focus is Hadoop jobs
Google’s Omega resource manager » Closed-source follow-on to original resource manager » Framework-specific schedulers use optimistic
concurrency model – all compete simultaneously to select resources
23
In Mesos Part II Lecture
Implementation Details and Supported Frameworks
Isolation
Handling Mesos Master Failure
Resource Revocation
Scalability
Results and Macrobenchmarks
24
Summary (Part One)
Mesos is a platform for sharing data centers among diverse cluster computing frameworks » Enables efficient fine-grained sharing » Gives frameworks control over scheduling » Supports current and future frameworks » Achieves high utilization
25
My Talks at LASER 2013
1. AMP Lab introduction
2. The Datacenter Needs an Operating System
3. Mesos, part one
4. Dominant Resource Fairness
5. Mesos, part two
6. Spark 26