Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
FirmamentFast, centralized cluster scheduling at scale
1
Ionel Gog Malte Schwarzkopf Adam Gleave
Robert N. M. Watson Steven Hand
Meet Sesame, Inc.
● Sesame’s employees run:
1. Interactive data analyticsthat must complete in seconds
2
Sesame Inc
3. Batch processing jobsthat increase resource utilization
2. Long-running servicesthat must provide high performance
The cluster scheduler must achieve:
5
1. Good task placements○ high utilization without interference
2. Low task scheduling latency○ support interactive tasks
○ no idle resources
Ideal scheduler
4
State of the art
Centralized vs. distributed
Good task placements
Low scheduling latency
CentralizedSophisticated algorithms
[Borg, Quincy, Quasar]
DistributedSimple heuristics
[Sparrow, Tarcil, Yaq-d]
Can’t get both good placements and low latency for the entire workload!
HybridSplit workload, provide either
[Mercury, Hawk, Eagle]
5
Firmament provides a solution!
Centralized vs. distributed
● Centralized architecture
● Good task placements
● Low task scheduling latency
● Scales to 10,000+ machines
➢ Finds optimal task placements
➢ Min-cost flow-basedcentralized scheduler
6
[SO
SP 2
009]
Quincy: Fair Scheduling for
DistributedComputing Clusters
Flow-based intro
Copyright - Heather Gwinn
Min-cost flow scheduler
Rack 1 Rack 2
Interactive
Batch
Service
7
Flow-based schedules all tasks
Preference for first rack Me too!
Min-cost flow scheduler
Rack 1 Rack 2
Interactive
Batch
Service
8
Schedules all tasks at the same time
Flow-based schedules all tasks
Min-cost flow scheduler
Rack 1 Rack 2
Interactive
Batch
Service
9
Flow-based schedules all tasks
MigratePreempt
Considers tasks for migration or preemption
Min-cost flow scheduler
Rack 1 Rack 2
Interactive
Batch
Service
10
Flow-based schedules all tasks
Globally optimal placement!
T0
Tasks Machines
Introduction to min-cost flow scheduling
11
Flow scheduling: tasks & machines
T1
T2
T3
T4
T5
M0
M1
M2
M3
M4
M5
Introduction to min-cost flow scheduling
12
Flow scheduling: tasks to machines
T0 M0
M1
M2
12
M3
M4
M5
T1
T2
T3
T4
T5
Introduction to min-cost flow scheduling
13
T0 M0
M1
M2
13
M3
M4
M5
T1
T2
T3
T4
T5
Flow scheduling: zoom in
Cost: 3Cost: 5
Introduction to min-cost flow scheduling
14
Flow scheduling: Min costs for all
T0 M0
M1
M2
14
M3
M4
M5
T1
T2
T3
T4
T5
Cost: 3
Cost: 5
Cost: 7Cost: 5
Cost: 3Cost: 9Cost: 6
Cost: 2
Cost: 6
Min-cost flow places tasks with minimum overall cost
Introduction to min-cost flow scheduling
15
T0 M0
M1
M2
15
M3
M4
M5
T1
T2
T3
T4
T5
Sink
S
Flow demand: 6
Cost: 3
Cost: 5
Cost: 7Cost: 5
Cost: 3Cost: 9Cost: 6
Cost: 2
Cost: 6
Flow supply 1
1
1
1
1
1
FLOW
Flow scheduling: pushing flow for n tasks
S
Introduction to min-cost flow scheduling
16
Flow scheduling: pushing flow for n tasks
T0 M0
M1
M2
M3
M4
M5
T1
T2
T3
T4
T5
Flow supply 0
0
0
0
0
0
Flow demand: 0
17
How well doesthe Quincy
approach scale?
Quincy doesn’t scale: intro
Simulated Quincy using Google trace, 50% utilization
bette
r
Google cluster
18
Quincy doesn’t scale: empty figure
Simulated Quincy using Google trace, 50% utilization
bette
r
66 sec on average
Too slow! 30% of tasks wait to be scheduled for over 33% of their runtime and waste resources 19
Quincy doesn’t scale
Simulated Quincy using Google trace, 50% utilizationGoal: sub-second scheduling latency in
common case
bette
r
20
Quincy doesn’t scale
Goal
21
Contributions
Firmament contributions
● Low task scheduling latency○ Uses best suited min-cost flow algorithm○ Incrementally recomputes the solution
● Good task placement○ Same optimal placements as Quincy○ Customizable scheduling policies
Scheduling policy
22
sche
dule
r
mas
ter
Task table
Task statistics
Agent Agent
Cluster topology
mac
hine
mac
hine
SchedulerScheduling policy
Flow graph
Scheduling policy
Firmament scheduler: intro
23
class QuincyPolicy {
Cost_t TaskToResourceNodeCost( TaskID_t task_id) { return task_unscheduled_time * quincy_wait_time_factor; } ...}
Specifying scheduling policies
Firmament policy specification
Defines flow graph
N.B: More details in the paper.
Scheduling policy
24
sche
dule
r
mas
ter
Task table
Task statistics
Agent Agent
Cluster topology
mac
hine
mac
hine
Scheduler
Flow graph
Scheduling policy
Defines graph
Flow graph
Firmament scheduler: intro
Scheduling policy
25
sche
dule
r
mas
ter
Task table
Task statistics
Agent Agent
Cluster topology
mac
hine
mac
hine
SchedulerScheduling policy
Flow graph
Min-cost, max-flow solver
Submits graph
Firmament scheduler: submit to solver
Defines graph
Scheduling policy
26
sche
dule
r
mas
ter
Task table
Task statistics
Agent Agent
Cluster topology
mac
hine
mac
hine
SchedulerScheduling policy
Flow graph
Min-costmax-flow solver
Firmament scheduler: slow min-cost flow solver
Defines graph
Submits graphMost timespent here
27
Algorithm Worst-case complexityCost scaling O(V2Elog(VC))
E: number of arcsV: number of nodesU: largest arc capacityC: largest cost valueE > V > C ≅ U
Algorithms complexity: Successive shortest path
Used by Quincy
bette
r
Cost scaling is too slow beyond 1,000 machines
Subsampled Google trace, 50% slot utilization [Quincy policy]
28
Too
slow
!
Goal
Algorithms results: Cost scaling
29
Algorithm Worst-case complexityCost scaling O(V2Elog(VC))
Successive shortest path O(V2Ulog(V))
E: number of arcsV: number of nodesU: largest arc capacityC: largest cost valueE > V > C ≅ U
Algorithms complexity: Cost scaling
Lower worst-case complexity
Successive shortest path only scales to ~100 machines
bette
r
30
Too
slow
!
Goal
Algorithms results: successive shortest path
Subsampled Google trace, 50% slot utilization [Quincy policy]
31
Algorithm Worst-case complexityCost scaling O(V2Elog(VC))
Successive shortest path O(V2Ulog(V))
Relaxation O(E3CU2)
E: number of arcsV: number of nodesU: largest arc capacityC: largest cost valueE > V > C ≅ U
Highest complexity
Algorithms complexity: Relaxation
bette
r
Relaxation meets our sub-second latency goal 32
Goal Per
fect
!
Algorithms results: Relaxation
Subsampled Google trace, 50% slot utilization [Quincy policy]
M0
M1
R0
M2
M3
R1
ST2
X
T0
T1
T3
T4
Single-ish pass flow push
Relaxation is well-suited to the graph structure 33
Why is Relaxation fast?
Relaxation single-ish pass
M0
M1
M2
M3
T2
T0
T1
T3
T4
SCapacity: 1 task
34
Relaxation suffers in pathological edge cases
Slow Relaxation: tasks -> machine
Machine utilization
high medium
M0
M1
M2
M3
T2
T0
T1
T3
T4
S
35
Relaxation suffers in pathological edge cases
Slow Relaxation: machine -> sink
Capacity: 1 task
Machine utilization
high medium
Machine utilization
high medium
M0
M1
M2
M3
T2
T0
T1
T3
T4
S
Relaxation suffers in pathological edge cases
Slow Relaxation: machine -> tasks
Relaxation cannot push flow in a single pass any more 36
Capacity: 0 tasks
37
How bad doesRelaxation’s edge
case get?Experimental setup:● Simulated 12,500 machine cluster● Used the Quincy scheduling policy● Utilization >90% to oversubscribed cluster
High utilization introduction
Quincy, 12,500 machines cluster, job of increasing sizebe
tter
38
High utilization empty figure
Relaxation’s runtime increases with utilization
Quincy, 12,500 machines cluster, job of increasing sizebe
tter
39
High utilization Relaxation
Cost scaling is unaffected by high utilization
bette
rQuincy, 12,500 machines cluster, job of increasing size
40
Cost scaling is faster Best
High utilization Relaxation & Cost scaling
Seduling policy
41
sche
dule r
Scheduling policy
Flow graph
Min-cost, max-flow solver
Input graph
Min
-cos
t max
-flow
sol
ver
Cost scalingRelaxation
Input graph
Output flow
Solver runs both algorithms
bette
rQuincy, 12,500 machines cluster, job of increasing size
Algorithm runtime is still high at utilization > 94% 42
High utilization push-down runtime
Seduling policy
43
sche
dule r
Scheduling policy
Flow graph
Min-cost, max-flow solver
Min
-cos
t max
-flow
sol
ver
Cost scaling
Graph state
Input graph
State discardedOutput flow
Incremental cost scaling introduction
Seduling policy
44
sche
dule r
Scheduling policy
Flow graph
Min-cost, max-flow solver
Cost scaling
Min
-cos
t max
-flow
sol
ver
Graph state
Output flow
Input graph
Previous graph state
Graph changes
Incremental cost scaling
Quincy, 12,500 machines cluster, job of increasing sizebe
tter
Incremental cost scaling is ~2x faster 45
Incremental Cost scaling results
Note: many additional
experiments in the paper.
46
Evaluation
Centralized vs. distributed
Good task placements
Low scheduling latency
CentralizedSophisticated algorithms
e.g., Borg, Quincy, Quasar
DistributedSimple heuristics
e.g., Sparrow, Tarcil
Does Firmament choose good placements with
low latency?
47
How do Firmament’s placements compare to
other schedulers?
Experimental setup:● Homogeneous 40-machine cluster, 10G network● Mixed batch/service/interactive workload
Workload-mix intro
SCHEDULER
Interactive
Service
M1 M2 M3 M4 M8M7M6M5 M9 M10
R1 R2
Agg
Network utilization: low medium high 48
Workload mix-service task figure
49
better
Workload-mix Sparrow
5 seconds task response time on idle cluster
Firmament chooses good placements
Sparrow is unaware of resource utilization 50
better
20% of tasks experience poor placement
Workload-mix Sparrow
Firmament chooses good placements
Centralized Kubernetes and Docker still suffer 51
better
Workload-mix Docker
Firmament chooses good placements
20% of tasks experience poor placement
52
better
Avoided co-location interference
Workload-mix Firmament
Firmament chooses good placements
Firmament outperforms centralized and distributed schedulers
53
How well doesFirmament handle
challenging workloads?
Experimental setup:● Simulated 12,500 machine Google cluster● Used the centralized Quincy scheduling policy● Utilization varies between 75% and 95%
Google acceleration introduction
54
Google acceleration empty figure
Firmament handles challenging workloads at low latency
Google trace, 12,500 machines, utilization between 75% and 90%
Simulate interactive workloads by scaling down task runtimes
bette
r
Median task runtime: 420s
Median task runtime: 1.7s
Google trace, 12,500 machines, utilization between 75% and 90% 55
bette
r
Average latency is too high even without many short tasks
Google acceleration Cost scaling
45 seconds average latency (tuned over Quincy setup’s 66s)
Firmament handles challenging workloads at low latency
Google trace, 12,500 machines, utilization between 75% and 90%
Latency with a 250x acceleration:75th percentile: 2 sec
maximum: 57 sec 56
bette
rFirmament handles challenging workloads at low latency
Google acceleration Relaxation
Google trace, 12,500 machines, utilization between 75% and 90% 57
Firmament’s common-case latency is sub-second even at 250x acceleration
bette
r
Google acceleration Firmament
Firmament handles challenging workloads at low latency
Conclusions
58
firmament.io
Conclusions
● Low task scheduling latency○ Uses best algorithm at all times○ Incrementally recomputes solution
● Good task placement○ Same optimal placements as Quincy○ Customizable scheduling policies
Open-source and available at: