36
Sparrow Distributed Low-Latency Spark Scheduling Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica

Sparrow

  • Upload
    kipling

  • View
    57

  • Download
    0

Embed Size (px)

DESCRIPTION

Sparrow. Kay Ousterhout, Patrick Wendell, Matei Zaharia , Ion Stoica. Distributed Low-Latency Spark Scheduling. Outline. The Spark scheduling bottleneck Sparrow’s fully distributed, fault-tolerant technique Sparrow’s near-optimal performance. Spark Today. User 1. Worker. Spark Context. - PowerPoint PPT Presentation

Citation preview

Page 1: Sparrow

SparrowDistributed Low-Latency Spark SchedulingKay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica

Page 2: Sparrow

Outline

The Spark scheduling bottleneck

Sparrow’s fully distributed, fault-tolerant technique

Sparrow’s near-optimal performance

Page 3: Sparrow

Spark TodayWork

erWorkerWorkerWorkerWorker

Worker

Spark ContextUser

1User 2User 3

Query Compilation

StorageScheduling

Page 4: Sparrow

Spark TodayWork

erWorkerWorkerWorkerWorker

Worker

Spark ContextUser

1User 2User 3

Query Compilation

StorageScheduling

Page 5: Sparrow

Job Latencies Rapidly Decreasing

10 min.

10 sec.

100 ms

1 ms

2004: MapReducebatch job

2009: Hive

query

2010: Dremel Query

2012: Impala query 2010:

In-memory

Spark query

2013:Spark

streaming

Page 6: Sparrow

Job latencies rapidly decreasing

Page 7: Sparrow

Job latencies rapidly decreasing+

Spark deployments growing in size

Scheduling bottleneck!

Page 8: Sparrow

Spark scheduler throughput:1500 tasks / second

1 second 100100 ms 10

10 second 1000

Task DurationCluster size(# 16-core machines)

Page 9: Sparrow

Optimizing the Spark Scheduler

0.8: Monitoring code moved off critical path

0.8.1: Result deserialization moved off critical path

Future improvements may yield 2-3x higher throughput

Page 10: Sparrow

Is the scheduler the bottleneck in my cluster?

Page 11: Sparrow

WorkerWorkerWorkerWorkerWorker

Worker

Cluster Scheduler

Task launch

Task completion

Page 12: Sparrow

WorkerWorkerWorkerWorkerWorker

Worker

Cluster Scheduler

Task launch

Task completion

Page 13: Sparrow

WorkerWorkerWorkerWorkerWorker

Worker

Cluster Scheduler

Task launch

Task completion

Scheduler

delay

Page 14: Sparrow
Page 15: Sparrow
Page 16: Sparrow

Spark TodayWork

erWorkerWorkerWorkerWorker

Worker

Spark ContextUser

1User 2User 3

Query Compilation

StorageScheduling

Page 17: Sparrow

Future SparkWork

erWorkerWorkerWorkerWorker

Worker

User 1User 2User 3

SchedulerQuery

compilation

SchedulerQuery

compilation

SchedulerQuery

compilation

Benefits:High

throughputFault

tolerance

Page 18: Sparrow

Future SparkWork

erWorkerWorkerWorkerWorker

Worker

User 1User 2User 3

SchedulerQuery

compilation

SchedulerQuery

compilation

SchedulerQuery

compilation

Storage:

Tachyon

Page 19: Sparrow

Scheduling with SparrowWork

erWorkerWorkerWorkerWorker

Scheduler

Scheduler

Scheduler

SchedulerStage

Worker

Page 20: Sparrow

Stage

Batch SamplingWork

erWorkerWorkerWorkerWorker

Scheduler

Scheduler

Scheduler

Scheduler

Worker

Place m tasks on the least loaded of 2m workers

4 probes (d =

2)

Page 21: Sparrow

Queue length poor predictor of wait timeWork

erWorker

80 ms155

ms

530 ms

Poor performance on heterogeneous workloads

Page 22: Sparrow

Stage

Late Binding

WorkerWorkerWorkerWorkerWorker

Scheduler

Scheduler

SchedulerScheduler

Worker

Place m tasks on the least loaded of dm workers

4 probes (d =

2)

Page 23: Sparrow

Late Binding

Scheduler

Scheduler

SchedulerScheduler

Place m tasks on the least loaded of dm workers

4 probes (d =

2)

WorkerWorkerWorkerWorkerWorker

Worker

Stage

Page 24: Sparrow

Late Binding

Scheduler

Scheduler

SchedulerScheduler

Place m tasks on the least loaded of dm workers

Worker

requests

task

WorkerWorkerWorkerWorkerWorker

Worker

Stage

Page 25: Sparrow

What about constraints?

Page 26: Sparrow

Stage

Per-Task Constraints

Scheduler

Scheduler

Scheduler

Scheduler

WorkerWorkerWorkerWorkerWorker

Worker

Probe separately for each task

Page 27: Sparrow

Technique Recap

Scheduler

Scheduler

Scheduler

SchedulerBatch

sampling+

Late binding+

Constraints

WorkerWorkerWorkerWorkerWorker

Worker

Page 28: Sparrow

How well does Sparrow perform?

Page 29: Sparrow

How does Sparrow compare to Spark’s native scheduler?

100 16-core EC2 nodes, 10 tasks/job, 10 schedulers, 80% load

Page 30: Sparrow

TPC-H Queries: BackgroundTPC-H: Common benchmark for

analytics workloads

Sparrow

Spark

Shark: SQL execution engine

Page 31: Sparrow

TPC-H Queries

100 16-core EC2 nodes, 10 schedulers, 80% load

95

75

2550

Percentiles

5

Within 12% of ideal

Median queuing delay of 9ms

Page 32: Sparrow

Policy Enforcement

WorkerHigh PriorityLow Priority Worker

User A (75%)User B (25%)

Fair SharesServe queues using

weighted fair queuing

PrioritiesServe queues based on strict priorities

Page 33: Sparrow

Weighted Fair Sharing

Page 34: Sparrow

Fault Tolerance

Scheduler 1

Scheduler 2

Spark Client 1 ✗Spark

Client 2

Timeout: 100msFailover: 5ms

Re-launch queries: 15ms

Page 35: Sparrow

Making Sparrow feature-complete

Interfacing with UI

Delay scheduling

Speculation

Page 36: Sparrow

(2) Distributed,

fault-tolerant scheduling

with Sparrow www.github.com/radlab/sparrow

Scheduler

Scheduler

Scheduler

Scheduler

WorkerWorkerWorkerWorkerWorker

Worker

(1) Diagnosing a

Spark scheduling bottleneck