Upload
hahanh
View
252
Download
2
Embed Size (px)
Citation preview
Sparrow Distributed Low-Latency Scheduling
Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica
Sparrow schedules tasks in clusters
using a decentralized, randomized approach
support constraints and fair sharing
and provides response times
within 12% of ideal
Scheduling Setting
…
…
Map Reduce/Spark/Dryad
Job
…
Task Task Task Map Reduce/Spark/Dryad
Job Task Task
Job Latencies Rapidly Decreasing
10 min. 10 sec. 100 ms 1 ms
2004: MapReduce batch job
2009: Hive query
2010: Dremel Query
2012: Impala query 2010:
In-memory Spark query
2013: Spark
streaming
10 min. 10 sec. 100 ms 1 ms
2004: MapReduce batch job
2009: Hive query
2010: Dremel Query
2012: Impala query 2010:
In-memory Spark query
2013: Spark
streaming
1000 16-core machines
26 decisions/
second
Scheduler throughput
1.6K decisions/
second
160K decisions/
second
16M decisions/
second
Millisecond Latency
Quality Placement
Fault Tolerant
High Throughput
Today: Completely Centralized
Less centralization
Sparrow: Completely Decentralized
✗
✗
✓
✗ ✓
✓
✓
?
Millisecond Latency
Quality Placement
Fault Tolerant
High Throughput
Today: Completely Centralized
Less centralization
Sparrow: Completely Decentralized
✗
✗
✓
✗ ✓
✓
✓
✓
Sparrow Decentralized approach
Existing randomized approaches Batch Sampling Late Binding Analytical performance evaluation
Handling constraints
Fairness and policy enforcement
Within 12% of ideal on 100 machines
Scheduling with Sparrow
Worker
Worker
Worker
Worker
Worker Scheduler
Scheduler
Scheduler
Scheduler Job
Worker
Simulated Results
100-task jobs in 10,000-node cluster, exp. task durations
Omniscient: infinitely fast centralized
scheduler
!"!#"!$""!$#"!%""!%#"!&""!&#"
!" !"'% !"'( !"') !"'* !$
+,-./0-,!123,!43-5
6/78
+708/39302-:2,0;
Per-task sampling
Worker
Worker
Worker
Worker
Worker Scheduler
Scheduler
Scheduler
Scheduler Job
Worker
Power of Two Choices
Per-task sampling
Worker
Worker
Worker
Worker
Worker Scheduler
Scheduler
Scheduler
Scheduler Job
Worker
Power of Two Choices
Per-task sampling
Worker
Worker
Worker
Worker
Worker Scheduler
Scheduler
Scheduler
Scheduler Job
Worker
Power of Two Choices
Per-task sampling
Worker
Worker
Worker
Worker
Worker Scheduler
Scheduler
Scheduler
Scheduler Job
Worker
Power of Two Choices
Simulated Results
100-task jobs in 10,000-node cluster, exp. task durations
!"!#"!$""!$#"!%""!%#"!&""!&#"
!" !"'% !"'( !"') !"'* !$
+,-./0-,!123,!43-5
6/78
+708/39,:;17-<=302->2,0?
70% cluster load
!"
!#"
!$""
!$#"
!%""
$ $" $"" $"""
&'()*+('!,-.'!/.(0
,1(2(34*5
6'7!,1(2
Response Time Grows with Tasks/Job!
Per-Task Sampling
Worker
Worker
Worker
Worker
Worker Scheduler
Scheduler
Scheduler
Scheduler Job
Worker
✓
✓
Task 1
Task 2
Per-task
Per-task Sampling
Worker
Worker
Worker
Worker
Worker Scheduler
Scheduler
Scheduler
Scheduler Job
Worker
Place m tasks on the least loaded of d�m slaves
Per-task
✓
✓
4 probes (d = 2)
Batch
Per-task Sampling
Worker
Worker
Worker
Worker
Worker Scheduler
Scheduler
Scheduler
Scheduler Job
Worker
Place m tasks on the least loaded of d�m slaves
Per-task
✓
✓
4 probes (d = 2)
Batch
Per-task versus Batch Sampling
70% cluster load
!"
!#"
!$""
!$#"
!%""
$ $" $"" $"""
&'()*+('!,-.'!/.(0
,1(2(34*5
6'7!,1(2 819:;
Simulated Results
100-task jobs in 10,000-node cluster, exp. task durations
!"!#"!$""!$#"!%""!%#"!&""!&#"
!" !"'% !"'( !"') !"'* !$
+,-./0-,!123,!43-5
6/78
+708/39,:;17-<=7>?@A302-?2,0>
Queue length poor predictor of wait time
Worker
Worker
80 ms 155 ms
530 ms
Poor performance on heterogeneous workloads
Late Binding
Worker
Worker
Worker
Worker
Worker Scheduler
Scheduler
Scheduler
Scheduler Job
Worker
Place m tasks on the least loaded of d�m slaves
4 probes (d = 2)
Late Binding
Scheduler
Scheduler
Scheduler
Scheduler Job
Place m tasks on the least loaded of d�m slaves
4 probes (d = 2)
Worker
Worker
Worker
Worker
Worker
Worker
Late Binding
Scheduler
Scheduler
Scheduler
Scheduler Job
Place m tasks on the least loaded of d�m slaves
4 probes (d = 2)
Worker
Worker
Worker
Worker
Worker
Worker
Late Binding
Scheduler
Scheduler
Scheduler
Scheduler Job
Place m tasks on the least loaded of d�m slaves
Worker
Worker
Worker
Worker
Worker
Worker
Late Binding
Scheduler
Scheduler
Scheduler
Scheduler Job
Place m tasks on the least loaded of d�m slaves
Worker requests
task Worker
Worker
Worker
Worker
Worker
Worker
Worker
Late Binding
Scheduler
Scheduler
Scheduler
Scheduler Job
Place m tasks on the least loaded of d�m slaves
Worker requests
task Worker
Worker
Worker
Worker
Worker
Simulated Results
100-task jobs in 10,000-node cluster, exp. task durations
!"!#"!$""!$#"!%""!%#"!&""!&#"
!" !"'% !"'( !"') !"'* !$
+,-./0-,!123,!43-5
6/78
+708/39,:;17-<=7>?@=7>?@A67>,!=20820BC302-?2,0>
Job Constraints
Scheduler
Scheduler
Scheduler
Scheduler Job
Worker
Worker
Worker
Worker
Worker
Worker
Restrict probed machines to those that satisfy the constraint
Per-Task Constraints
Scheduler
Scheduler
Scheduler
Scheduler Job
Worker
Worker
Worker
Worker
Worker
Worker
Probe separately for each task
Technique Recap
Scheduler
Scheduler
Scheduler
Scheduler Batch sampling
+ Late binding
+ Constraint handling
Worker
Worker
Worker
Worker
Worker
Worker
Spark on Sparrow
Worker
Worker
Worker
Worker
Worker
Worker
Query: DAG of Stages
Sparrow Scheduler
Job
Spark on Sparrow
Worker
Worker
Worker
Worker
Worker
Worker
Query: DAG of Stages
Sparrow Scheduler
Job
Spark on Sparrow
Worker
Worker
Worker
Worker
Worker
Worker
Query: DAG of Stages
Sparrow Scheduler
Job
How does Sparrow compare to Spark’s native scheduler?
!"!#"""!$"""!%"""!&"""!'"""!("""
!"!#"""!$"""!%"""!&"""!'"""!("""
)*+,-.+*!/01*!21+3
/4+5!678490-.!21+3
:,485!.490;*!+<=*>7?*8:,488-@A>*4?
100 16-core EC2 nodes, 10 tasks/job, 10 schedulers, 80% load
TPC-H Queries: Background
TPC-H: Common benchmark for analytics workloads
Sparrow
Spark: Distributed in-memory analytics framework
Shark: SQL execution engine
TPC-H Queries
100 16-core EC2 nodes, 10 schedulers, 80% load
!"!#""!$"""!$#""!%"""!%#""!&"""!&#""!'"""
(& (' () ($%
*+,-./,+!012+!32,4
'%$5!32+674 #&8)!32+674 599$!32+674
*:/6.2;+<=>:,?!,:2-@1/AB:>CD!,:2-@1/A
B:>CD!E!@:>+!F1/61/AG6+:@
95
75
25
50
Percentiles
5
TPC-H Queries
100 16-core EC2 nodes, 10 schedulers, 80% load
!"!#""!$"""!$#""!%"""!%#""!&"""!&#""!'"""
(& (' () ($%
*+,-./,+!012+!32,4
'%$5!32+674 #&8)!32+674 599$!32+674
*:/6.2;+<=>:,?!,:2-@1/AB:>CD!,:2-@1/A
B:>CD!E!@:>+!F1/61/AG6+:@!3"!6+@:H4
Within 12% of ideal Median queuing delay of 9ms
Fault Tolerance
Scheduler 1
Scheduler 2
Spark Client 1 ✗ Spark Client 2
Timeout: 100ms Failover: 5ms
Re-launch queries: 15ms
!"!#"""!$"""!%"""!&"""
!" !#" !$" !%" !&" !'" !(")*+,!-./
!"!#"""!$"""!%"""!&"""
01,23!2,.456.,!7*+,!-+./
89*:12,;492<!=:*,67!#
;492<!=:*,67!$
When does Sparrow not work as well?
High cluster load
!"!#"!$"!%"!&"!'""!'#"!'$"
!" !"(# !"($ !"(% !"(& !'
)*+,-.+*!/01*!21+3
4-56
7,588-9:1.0+;0*.<
Related Work
Centralized task schedulers: e.g., Quincy Two level schedulers: e.g., YARN, Mesos Coarse-grained cluster schedulers: e.g., Omega Load balancing: single task
Batch sampling +
Late binding +
Constraint handling
www.github.com/radlab/sparrow
Sparrows provides near-ideal job response times without global visibility
Scheduler
Scheduler
Scheduler
Scheduler
Worker
Worker
Worker
Worker
Worker
Worker