Upload
matteo-dellamico
View
625
Download
3
Embed Size (px)
DESCRIPTION
Size-based scheduling for Hadoop providing both efficiency and fairness.
Citation preview
HFSP: the Hadoop Fair Sojourn Protocol
Mario Pastorelli, Antonio Barbuzzi, Damiano Carra, MatteoDell’Amico, Pietro Michiardi
May 13, 2013
1
Outline
1 Hadoop and MapReduce
2 Fair Sojourn Protocol
3 HFSP Implementation
4 Experiments
2
Hadoop and MapReduce
Outline
1 Hadoop and MapReduce
2 Fair Sojourn Protocol
3 HFSP Implementation
4 Experiments
3
Hadoop and MapReduce MapReduce
MapReduce
Bring the computation to the data – split in blocks across the cluster
MAP
One task per block
Hadoop filesystem (HDFS): 64 MB by default
Stores locally key-value pairs
e.g., for word count: [(manzana,15) , (melocoton,7) , . . .]
REDUCE
# of tasks set by the programmer
Mapper output is partitioned by key and pulled from “mappers”
The REDUCE function operates on all values for a single key
e.g., (melocoton, [7,42,13, . . .])
4
Hadoop and MapReduce MapReduce
MapReduce
Bring the computation to the data – split in blocks across the cluster
MAP
One task per block
Hadoop filesystem (HDFS): 64 MB by default
Stores locally key-value pairs
e.g., for word count: [(manzana,15) , (melocoton,7) , . . .]
REDUCE
# of tasks set by the programmer
Mapper output is partitioned by key and pulled from “mappers”
The REDUCE function operates on all values for a single key
e.g., (melocoton, [7,42,13, . . .])
4
Hadoop and MapReduce Problem Statement
The Problem With Scheduling
Current WorkloadsHuge job size variance
Running time: seconds to hoursI/O: KBs to TBs
[Chen et al., VLDB ’12; Ren et al., CMU TR ’12]
Consequence
Interactive jobs are delayed by long ones
In smaller clusters long queues exacerbate the problem
5
Fair Sojourn Protocol
Outline
1 Hadoop and MapReduce
2 Fair Sojourn Protocol
3 HFSP Implementation
4 Experiments
6
Fair Sojourn Protocol Introduction To FSP
Fair Sojourn Protocol [Friedman & Henderson, SIGMETRICS ’03]
100usage (%)
cluster
50
10 15 37.5 42.5 50
time(s)
100usage (%)
cluster
10 5020 30
50
time(s)
job 1
job 2
job 3
job 1 job 3job 2 job 1
Simulate completion time using a simulated processor sharingdisciplineSchedule all resources to the job that would complete first 7
Fair Sojourn Protocol Introduction To FSP
Multi-Processor FSP
10 13 3923.5
usage (%)cluster
100
50
24.5
time(s)
10 13 20 23 39
100
50
usage (%)cluster
time(s)
job 1
job 2
job 3
job 1
job 2
job 3
In our case, some jobs may not require all cluster resources8
HFSP Implementation
Outline
1 Hadoop and MapReduce
2 Fair Sojourn Protocol
3 HFSP Implementation
4 Experiments
9
HFSP Implementation HFSP In General
HFSP In A Nutshell
Job Size EstimationNaive estimation at first
After the first s “training” tasks have run, we make a betterestimation
s = 5 by default
On t task slots, we give priority to training tasks
t avoids starving “old” jobs“shortcut” for very small jobs
Scheduling Policy
We treat MAP and REDUCE phases as separate jobs
A virtual cluster outputs a per-job simulated completion time
Preempt running tasks of jobs that complete later in the virtualcluster
10
HFSP Implementation HFSP In General
HFSP In A Nutshell
Job Size EstimationNaive estimation at first
After the first s “training” tasks have run, we make a betterestimation
s = 5 by default
On t task slots, we give priority to training tasks
t avoids starving “old” jobs“shortcut” for very small jobs
Scheduling Policy
We treat MAP and REDUCE phases as separate jobs
A virtual cluster outputs a per-job simulated completion time
Preempt running tasks of jobs that complete later in the virtualcluster
10
HFSP Implementation Size Estimation
Job Size Estimation (1)
Initial Estimation
ξ · k · l
k: # of tasks
l: average size of past MAP/REDUCE tasks
ξ ∈ [1,∞]: aggressivity for scheduling jobs in training phase
ξ = 1 (default): tend to schedule training jobs right away
they may have to be preempted
ξ =∞: wait for training to end before deciding
may require more “waves”
11
HFSP Implementation Size Estimation
Job Size Estimation (2)
MAP PhaseFrom the size of the s samples, generate an empirical CDF
(Least-square) fit to a parametric distribution
Predicted job size: k time the expected value of the fitteddistribution
Data Locality
Experimentally, we find out it’s not an issue
For the s sample jobs, there are plenty of unprocessed blocks aroundWe use delay scheduling [Zaharia et al., EuroSys ’10]
12
HFSP Implementation Size Estimation
Job Size Estimation (2)
MAP PhaseFrom the size of the s samples, generate an empirical CDF
(Least-square) fit to a parametric distribution
Predicted job size: k time the expected value of the fitteddistribution
Data Locality
Experimentally, we find out it’s not an issue
For the s sample jobs, there are plenty of unprocessed blocks aroundWe use delay scheduling [Zaharia et al., EuroSys ’10]
12
HFSP Implementation Size Estimation
Job Size Estimation (3)
REDUCE PhaseShuffle time: getting data to the reducer
time between scheduling a REDUCE task and executing a REDUCE
function the first timeaverage of sample shuffle sizes, weighted by data size
Execution time
we set a timeout ∆ (default 60s)if the timeout is hit, estimated execution time is
∆
p
where progress p is the fraction of data processed
Compute estimated reduce time as before
13
HFSP Implementation Virtual Cluster
Virtual Cluster
Estimated job size is in a “serialized” single-machine format
Simulates a processor-sharing cluster to compute completiontime, based on
number of tasks per jobavailable task slots in the real cluster
Simulation is updated when
new jobs arrivetasks complete
14
HFSP Implementation Preemption
Job Preemption
Supported in Hadoop
KILL running tasks
wastes work
WAIT for them to finish
may take long
Our ChoiceMAP tasks: WAIT
generally small
For REDUCE tasks, we implemented SUSPEND and RESUME
avoids the drawbacks of both WAIT and KILL
15
HFSP Implementation Preemption
Job Preemption
Supported in Hadoop
KILL running tasks
wastes work
WAIT for them to finish
may take long
Our ChoiceMAP tasks: WAIT
generally small
For REDUCE tasks, we implemented SUSPEND and RESUME
avoids the drawbacks of both WAIT and KILL
15
HFSP Implementation Preemption
Job Preemption: SUSPEND and RESUME
Our SolutionWe delegate to the OS: SIGSTOP and SIGCONT
The OS will swap tasks if and when memory is needed
no risk of thrashing: swapped data is loaded only when resuming
Configurable maximum number of suspended tasks
if reached, switch to WAIT
hard limit on memory allocated to suspended tasks
If not all running tasks should be preempted, suspend theyoungest
likely to finish latermay have smaller memory footprint
16
HFSP Implementation Preemption
Job Preemption: SUSPEND and RESUME
Our SolutionWe delegate to the OS: SIGSTOP and SIGCONT
The OS will swap tasks if and when memory is needed
no risk of thrashing: swapped data is loaded only when resuming
Configurable maximum number of suspended tasks
if reached, switch to WAIT
hard limit on memory allocated to suspended tasks
If not all running tasks should be preempted, suspend theyoungest
likely to finish latermay have smaller memory footprint
16
HFSP Implementation Preemption
Job Preemption: SUSPEND and RESUME
Our SolutionWe delegate to the OS: SIGSTOP and SIGCONT
The OS will swap tasks if and when memory is needed
no risk of thrashing: swapped data is loaded only when resuming
Configurable maximum number of suspended tasks
if reached, switch to WAIT
hard limit on memory allocated to suspended tasks
If not all running tasks should be preempted, suspend theyoungest
likely to finish latermay have smaller memory footprint
16
HFSP Implementation Preemption
Job Preemption: SUSPEND and RESUME
Our SolutionWe delegate to the OS: SIGSTOP and SIGCONT
The OS will swap tasks if and when memory is needed
no risk of thrashing: swapped data is loaded only when resuming
Configurable maximum number of suspended tasks
if reached, switch to WAIT
hard limit on memory allocated to suspended tasks
If not all running tasks should be preempted, suspend theyoungest
likely to finish latermay have smaller memory footprint
16
Experiments
Outline
1 Hadoop and MapReduce
2 Fair Sojourn Protocol
3 HFSP Implementation
4 Experiments
17
Experiments Setup and Traces
Experimental Setup
Platform100 m1.xlarge Amazon EC2 instances
4 x 2 GHz cores, 1.6 TB storage, 15 GB RAM each
WorkloadsGenerated with the SWIM workload generator [Chen et al., MASCOTS ’11]
Sinthetized from Facebook traces [Chen et al., VLDB ’12]
FB2009: 100 jobs, most are small; 22 minutes submission scheduleFB2010: 93 jobs, small jobs filtered out; 1h submission schedule
Configuration
We compare to Hadoop’s FAIR scheduler
similar to a processor-sharing discipline
Delay scheduling enabled both for FAIR and HFSP18
Experiments Results
FB2009
0
0.25
0.5
0.75
1
0 0.5 1 1.5 2 2.5
Fra
ction o
f com
ple
ted jobs
Sojourn Time [min]
HFSPFAIR
0
0.25
0.5
0.75
1
0 20 40 60 80 100
Sojourn Time [min]
HFSPFAIR
0
0.25
0.5
0.75
1
0 50 100 150 200 250
Sojourn Time [min]
HFSPFAIR
Small jobs Medium jobs Large jobs
The FIFO scheduler would mostly fall outside of the graph
Small jobs (few tasks) are not problematic in either case
they are allocated enough tasks
Medium and large jobs instead require a significant amount ofthe cluster resources
“focusing” all resources of the cluster pays off19
Experiments Results
FB2010
0
0.25
0.5
0.75
1
0 100 200 300 400 500
Fra
ction o
f com
ple
ted jobs
Map Time [min]
HFSPFAIR
0
0.25
0.5
0.75
1
0 75 150 225 300 375
Reduce Time [min]
HFSPFAIR
0
0.25
0.5
0.75
1
0 125 250 375 500 625 750
Sojourn Time [min]
HFSPFAIR
MAP phase REDUCE phase Aggregate
Larger jobs, longer queues, more pressure on the scheduler
Median MAP sojourn time is more than halved
Main reason: less “waves” because cluster resources are focused
On aggregate, when the first job completes with FAIR, 20% jobsare done with HFSP.
20
Experiments Results
Cluster Size
0
20
40
60
80
100
120
10 20 30 40 50 60 70 80 90 100
Avera
ge s
ojo
urn
tim
e [m
in]
Cluster nodes number
HFSPFAIR
Experiment done with the Mumak Hadoop official emulator andFB2009
For smaller clusters, scheduling makes a bigger difference
21
Experiments Results
Robustness to Estimation Errors
140
150
160
170
180
190
200
210
220
230
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Avera
ge S
ojo
urn
Tim
e [s]
α
FAIRHFSP (α=0)
Experimental settings as before: FB2009 and Mumak again
For a job size estimation of θ, we introduce an error and pick avalue uniformly in
[(1− α) θ, (1 + α) θ]22
Experiments Results
Preemption: Costs
Question
Could the costs associated to swapping make SUSPEND not worth it?
MeasurementsLinux can read and write swap close to maximum disk speed
100 MB/s for us
Worst-Case Analysis
In the FB2010 experiment, 10% of REDUCE tasks are suspended
The JVM heap space for REDUCE tasks is 1GB
as advised in Hadoop docs
Therefore, a SUSPEND/RESUME induces swapping for at most 20 s
one order of magnitude less than average size of preempted tasks
23
Experiments Conclusions
Take-Home Messages
Size-based scheduling on Hadoop is viable, and particularly appealingfor companies with (semi-)interactive jobs and smaller clusters
Even simple approximate means for size estimation are sufficient, asHFSP is robust with respect to errors
OS delegation to POSIX SIGSTOP and SIGCONT signals is an efficientway to perform preemption in Hadoop
HFSP is available as free software athttp://bitbucket.org/bigfootproject/hfsp
Paper at http://arxiv.org/abs/1302.2749
24