49
Sparrow Distributed Low-Latency Scheduling Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica

Sparrow - EECS at UC Berkeleykeo/talks/sparrow-sosp-talk.pdf · Sparrow Distributed Low-Latency Scheduling Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica

  • Upload
    hahanh

  • View
    252

  • Download
    2

Embed Size (px)

Citation preview

Sparrow Distributed Low-Latency Scheduling

Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica

Sparrow schedules tasks in clusters

using a decentralized, randomized approach

support constraints and fair sharing

and provides response times

within 12% of ideal

Scheduling Setting

Map Reduce/Spark/Dryad

Job

Task Task Task Map Reduce/Spark/Dryad

Job Task Task

Job Latencies Rapidly Decreasing

10 min. 10 sec. 100 ms 1 ms

2004: MapReduce batch job

2009: Hive query

2010: Dremel Query

2012: Impala query 2010:

In-memory Spark query

2013: Spark

streaming

Scheduling challenges:

Millisecond Latency

Quality Placement

Fault Tolerant

High Throughput

10 min. 10 sec. 100 ms 1 ms

2004: MapReduce batch job

2009: Hive query

2010: Dremel Query

2012: Impala query 2010:

In-memory Spark query

2013: Spark

streaming

1000 16-core machines

26 decisions/

second

Scheduler throughput

1.6K decisions/

second

160K decisions/

second

16M decisions/

second

Millisecond Latency

Quality Placement

Fault Tolerant

High Throughput

Today: Completely Centralized

Less centralization

Sparrow: Completely Decentralized

✗ ✓

?

Millisecond Latency

Quality Placement

Fault Tolerant

High Throughput

Today: Completely Centralized

Less centralization

Sparrow: Completely Decentralized

✗ ✓

Sparrow Decentralized approach

Existing randomized approaches Batch Sampling Late Binding Analytical performance evaluation

Handling constraints

Fairness and policy enforcement

Within 12% of ideal on 100 machines

Scheduling with Sparrow

Worker

Worker

Worker

Worker

Worker Scheduler

Scheduler

Scheduler

Scheduler Job

Worker

Worker

Worker

Worker

Worker

Worker Scheduler

Scheduler

Scheduler

Scheduler Job

Worker

Random

Simulated Results

100-task jobs in 10,000-node cluster, exp. task durations

Omniscient: infinitely fast centralized

scheduler

!"!#"!$""!$#"!%""!%#"!&""!&#"

!" !"'% !"'( !"') !"'* !$

+,-./0-,!123,!43-5

6/78

+708/39302-:2,0;

Per-task sampling

Worker

Worker

Worker

Worker

Worker Scheduler

Scheduler

Scheduler

Scheduler Job

Worker

Power of Two Choices

Per-task sampling

Worker

Worker

Worker

Worker

Worker Scheduler

Scheduler

Scheduler

Scheduler Job

Worker

Power of Two Choices

Per-task sampling

Worker

Worker

Worker

Worker

Worker Scheduler

Scheduler

Scheduler

Scheduler Job

Worker

Power of Two Choices

Per-task sampling

Worker

Worker

Worker

Worker

Worker Scheduler

Scheduler

Scheduler

Scheduler Job

Worker

Power of Two Choices

Simulated Results

100-task jobs in 10,000-node cluster, exp. task durations

!"!#"!$""!$#"!%""!%#"!&""!&#"

!" !"'% !"'( !"') !"'* !$

+,-./0-,!123,!43-5

6/78

+708/39,:;17-<=302->2,0?

70% cluster load

!"

!#"

!$""

!$#"

!%""

$ $" $"" $"""

&'()*+('!,-.'!/.(0

,1(2(34*5

6'7!,1(2

Response Time Grows with Tasks/Job!

Per-Task Sampling

Worker

Worker

Worker

Worker

Worker Scheduler

Scheduler

Scheduler

Scheduler Job

Worker

Task 1

Task 2

Per-task

Per-task Sampling

Worker

Worker

Worker

Worker

Worker Scheduler

Scheduler

Scheduler

Scheduler Job

Worker

Place m tasks on the least loaded of d�m slaves

Per-task

4 probes (d = 2)

Batch

Per-task Sampling

Worker

Worker

Worker

Worker

Worker Scheduler

Scheduler

Scheduler

Scheduler Job

Worker

Place m tasks on the least loaded of d�m slaves

Per-task

4 probes (d = 2)

Batch

Per-task versus Batch Sampling

70% cluster load

!"

!#"

!$""

!$#"

!%""

$ $" $"" $"""

&'()*+('!,-.'!/.(0

,1(2(34*5

6'7!,1(2 819:;

Simulated Results

100-task jobs in 10,000-node cluster, exp. task durations

!"!#"!$""!$#"!%""!%#"!&""!&#"

!" !"'% !"'( !"') !"'* !$

+,-./0-,!123,!43-5

6/78

+708/39,:;17-<=7>?@A302-?2,0>

Queue length poor predictor of wait time

Worker

Worker

80 ms 155 ms

530 ms

Poor performance on heterogeneous workloads

Late Binding

Worker

Worker

Worker

Worker

Worker Scheduler

Scheduler

Scheduler

Scheduler Job

Worker

Place m tasks on the least loaded of d�m slaves

4 probes (d = 2)

Late Binding

Scheduler

Scheduler

Scheduler

Scheduler Job

Place m tasks on the least loaded of d�m slaves

4 probes (d = 2)

Worker

Worker

Worker

Worker

Worker

Worker

Late Binding

Scheduler

Scheduler

Scheduler

Scheduler Job

Place m tasks on the least loaded of d�m slaves

4 probes (d = 2)

Worker

Worker

Worker

Worker

Worker

Worker

Late Binding

Scheduler

Scheduler

Scheduler

Scheduler Job

Place m tasks on the least loaded of d�m slaves

Worker

Worker

Worker

Worker

Worker

Worker

Late Binding

Scheduler

Scheduler

Scheduler

Scheduler Job

Place m tasks on the least loaded of d�m slaves

Worker requests

task Worker

Worker

Worker

Worker

Worker

Worker

Worker

Late Binding

Scheduler

Scheduler

Scheduler

Scheduler Job

Place m tasks on the least loaded of d�m slaves

Worker requests

task Worker

Worker

Worker

Worker

Worker

Simulated Results

100-task jobs in 10,000-node cluster, exp. task durations

!"!#"!$""!$#"!%""!%#"!&""!&#"

!" !"'% !"'( !"') !"'* !$

+,-./0-,!123,!43-5

6/78

+708/39,:;17-<=7>?@=7>?@A67>,!=20820BC302-?2,0>

What about constraints?

Job Constraints

Scheduler

Scheduler

Scheduler

Scheduler Job

Worker

Worker

Worker

Worker

Worker

Worker

Restrict probed machines to those that satisfy the constraint

Per-Task Constraints

Scheduler

Scheduler

Scheduler

Scheduler Job

Worker

Worker

Worker

Worker

Worker

Worker

Probe separately for each task

Technique Recap

Scheduler

Scheduler

Scheduler

Scheduler Batch sampling

+ Late binding

+ Constraint handling

Worker

Worker

Worker

Worker

Worker

Worker

How does Sparrow perform on a real cluster?

Spark on Sparrow

Worker

Worker

Worker

Worker

Worker

Worker

Query: DAG of Stages

Sparrow Scheduler

Job

Spark on Sparrow

Worker

Worker

Worker

Worker

Worker

Worker

Query: DAG of Stages

Sparrow Scheduler

Job

Spark on Sparrow

Worker

Worker

Worker

Worker

Worker

Worker

Query: DAG of Stages

Sparrow Scheduler

Job

How does Sparrow compare to Spark’s native scheduler?

!"!#"""!$"""!%"""!&"""!'"""!("""

!"!#"""!$"""!%"""!&"""!'"""!("""

)*+,-.+*!/01*!21+3

/4+5!678490-.!21+3

:,485!.490;*!+<=*>7?*8:,488-@A>*4?

100 16-core EC2 nodes, 10 tasks/job, 10 schedulers, 80% load

TPC-H Queries: Background

TPC-H: Common benchmark for analytics workloads

Sparrow

Spark: Distributed in-memory analytics framework

Shark: SQL execution engine

TPC-H Queries

100 16-core EC2 nodes, 10 schedulers, 80% load

!"!#""!$"""!$#""!%"""!%#""!&"""!&#""!'"""

(& (' () ($%

*+,-./,+!012+!32,4

'%$5!32+674 #&8)!32+674 599$!32+674

*:/6.2;+<=>:,?!,:2-@1/AB:>CD!,:2-@1/A

B:>CD!E!@:>+!F1/61/AG6+:@

95

75

25

50

Percentiles

5

TPC-H Queries

100 16-core EC2 nodes, 10 schedulers, 80% load

!"!#""!$"""!$#""!%"""!%#""!&"""!&#""!'"""

(& (' () ($%

*+,-./,+!012+!32,4

'%$5!32+674 #&8)!32+674 599$!32+674

*:/6.2;+<=>:,?!,:2-@1/AB:>CD!,:2-@1/A

B:>CD!E!@:>+!F1/61/AG6+:@!3"!6+@:H4

Within 12% of ideal Median queuing delay of 9ms

Fault Tolerance

Scheduler 1

Scheduler 2

Spark Client 1 ✗ Spark Client 2

Timeout: 100ms Failover: 5ms

Re-launch queries: 15ms

!"!#"""!$"""!%"""!&"""

!" !#" !$" !%" !&" !'" !(")*+,!-./

!"!#"""!$"""!%"""!&"""

01,23!2,.456.,!7*+,!-+./

89*:12,;492<!=:*,67!#

;492<!=:*,67!$

When does Sparrow not work as well?

High cluster load

!"!#"!$"!%"!&"!'""!'#"!'$"

!" !"(# !"($ !"(% !"(& !'

)*+,-.+*!/01*!21+3

4-56

7,588-9:1.0+;0*.<

Related Work

Centralized task schedulers: e.g., Quincy Two level schedulers: e.g., YARN, Mesos Coarse-grained cluster schedulers: e.g., Omega Load balancing: single task

Batch sampling +

Late binding +

Constraint handling

www.github.com/radlab/sparrow

Sparrows provides near-ideal job response times without global visibility

Scheduler

Scheduler

Scheduler

Scheduler

Worker

Worker

Worker

Worker

Worker

Worker

Backup Slides

Can we do better without losing simplicity?

Policy Enforcement

Slave High Priority

Low Priority Slave User A (75%)

User B (25%)

Fair Shares Serve queues using weighted fair

queuing

Priorities Serve queues based on strict

priorities