Upload
kipling
View
57
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Sparrow. Kay Ousterhout, Patrick Wendell, Matei Zaharia , Ion Stoica. Distributed Low-Latency Spark Scheduling. Outline. The Spark scheduling bottleneck Sparrow’s fully distributed, fault-tolerant technique Sparrow’s near-optimal performance. Spark Today. User 1. Worker. Spark Context. - PowerPoint PPT Presentation
Citation preview
SparrowDistributed Low-Latency Spark SchedulingKay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica
Outline
The Spark scheduling bottleneck
Sparrow’s fully distributed, fault-tolerant technique
Sparrow’s near-optimal performance
Spark TodayWork
erWorkerWorkerWorkerWorker
Worker
Spark ContextUser
1User 2User 3
Query Compilation
StorageScheduling
Spark TodayWork
erWorkerWorkerWorkerWorker
Worker
Spark ContextUser
1User 2User 3
Query Compilation
StorageScheduling
Job Latencies Rapidly Decreasing
10 min.
10 sec.
100 ms
1 ms
2004: MapReducebatch job
2009: Hive
query
2010: Dremel Query
2012: Impala query 2010:
In-memory
Spark query
2013:Spark
streaming
Job latencies rapidly decreasing
Job latencies rapidly decreasing+
Spark deployments growing in size
Scheduling bottleneck!
Spark scheduler throughput:1500 tasks / second
1 second 100100 ms 10
10 second 1000
Task DurationCluster size(# 16-core machines)
Optimizing the Spark Scheduler
0.8: Monitoring code moved off critical path
0.8.1: Result deserialization moved off critical path
Future improvements may yield 2-3x higher throughput
Is the scheduler the bottleneck in my cluster?
WorkerWorkerWorkerWorkerWorker
Worker
Cluster Scheduler
Task launch
Task completion
WorkerWorkerWorkerWorkerWorker
Worker
Cluster Scheduler
Task launch
Task completion
WorkerWorkerWorkerWorkerWorker
Worker
Cluster Scheduler
Task launch
Task completion
Scheduler
delay
Spark TodayWork
erWorkerWorkerWorkerWorker
Worker
Spark ContextUser
1User 2User 3
Query Compilation
StorageScheduling
Future SparkWork
erWorkerWorkerWorkerWorker
Worker
User 1User 2User 3
SchedulerQuery
compilation
SchedulerQuery
compilation
SchedulerQuery
compilation
Benefits:High
throughputFault
tolerance
Future SparkWork
erWorkerWorkerWorkerWorker
Worker
User 1User 2User 3
SchedulerQuery
compilation
SchedulerQuery
compilation
SchedulerQuery
compilation
Storage:
Tachyon
Scheduling with SparrowWork
erWorkerWorkerWorkerWorker
Scheduler
Scheduler
Scheduler
SchedulerStage
Worker
Stage
Batch SamplingWork
erWorkerWorkerWorkerWorker
Scheduler
Scheduler
Scheduler
Scheduler
Worker
Place m tasks on the least loaded of 2m workers
4 probes (d =
2)
Queue length poor predictor of wait timeWork
erWorker
80 ms155
ms
530 ms
Poor performance on heterogeneous workloads
Stage
Late Binding
WorkerWorkerWorkerWorkerWorker
Scheduler
Scheduler
SchedulerScheduler
Worker
Place m tasks on the least loaded of dm workers
4 probes (d =
2)
Late Binding
Scheduler
Scheduler
SchedulerScheduler
Place m tasks on the least loaded of dm workers
4 probes (d =
2)
WorkerWorkerWorkerWorkerWorker
Worker
Stage
Late Binding
Scheduler
Scheduler
SchedulerScheduler
Place m tasks on the least loaded of dm workers
Worker
requests
task
WorkerWorkerWorkerWorkerWorker
Worker
Stage
What about constraints?
Stage
Per-Task Constraints
Scheduler
Scheduler
Scheduler
Scheduler
WorkerWorkerWorkerWorkerWorker
Worker
Probe separately for each task
Technique Recap
Scheduler
Scheduler
Scheduler
SchedulerBatch
sampling+
Late binding+
Constraints
WorkerWorkerWorkerWorkerWorker
Worker
How well does Sparrow perform?
How does Sparrow compare to Spark’s native scheduler?
100 16-core EC2 nodes, 10 tasks/job, 10 schedulers, 80% load
TPC-H Queries: BackgroundTPC-H: Common benchmark for
analytics workloads
Sparrow
Spark
Shark: SQL execution engine
TPC-H Queries
100 16-core EC2 nodes, 10 schedulers, 80% load
95
75
2550
Percentiles
5
Within 12% of ideal
Median queuing delay of 9ms
Policy Enforcement
WorkerHigh PriorityLow Priority Worker
User A (75%)User B (25%)
Fair SharesServe queues using
weighted fair queuing
PrioritiesServe queues based on strict priorities
Weighted Fair Sharing
Fault Tolerance
Scheduler 1
Scheduler 2
Spark Client 1 ✗Spark
Client 2
Timeout: 100msFailover: 5ms
Re-launch queries: 15ms
Making Sparrow feature-complete
Interfacing with UI
Delay scheduling
Speculation
(2) Distributed,
fault-tolerant scheduling
with Sparrow www.github.com/radlab/sparrow
Scheduler
Scheduler
Scheduler
Scheduler
WorkerWorkerWorkerWorkerWorker
Worker
(1) Diagnosing a
Spark scheduling bottleneck