HFSP: the Hadoop Fair Sojourn Protocol

HFSP: the Hadoop Fair Sojourn Protocol

Mario Pastorelli, Antonio Barbuzzi, Damiano Carra, MatteoDell’Amico, Pietro Michiardi

May 13, 2013

1

Outline

1 Hadoop and MapReduce

2 Fair Sojourn Protocol

3 HFSP Implementation

4 Experiments

2

Hadoop and MapReduce

Outline




4 Experiments

3

Hadoop and MapReduce MapReduce

MapReduce

Bring the computation to the data – split in blocks across the cluster

MAP

One task per block

Hadoop filesystem (HDFS): 64 MB by default

Stores locally key-value pairs

e.g., for word count: [(manzana,15) , (melocoton,7) , . . .]

REDUCE

# of tasks set by the programmer

Mapper output is partitioned by key and pulled from “mappers”

The REDUCE function operates on all values for a single key

e.g., (melocoton, [7,42,13, . . .])

4

Hadoop and MapReduce MapReduce

MapReduce

Bring the computation to the data – split in blocks across the cluster

MAP

One task per block

Hadoop filesystem (HDFS): 64 MB by default

Stores locally key-value pairs

e.g., for word count: [(manzana,15) , (melocoton,7) , . . .]

REDUCE

# of tasks set by the programmer

Mapper output is partitioned by key and pulled from “mappers”

The REDUCE function operates on all values for a single key

e.g., (melocoton, [7,42,13, . . .])

4

Hadoop and MapReduce Problem Statement

The Problem With Scheduling

Current WorkloadsHuge job size variance

Running time: seconds to hoursI/O: KBs to TBs

[Chen et al., VLDB ’12; Ren et al., CMU TR ’12]

Consequence

Interactive jobs are delayed by long ones

In smaller clusters long queues exacerbate the problem

5

Fair Sojourn Protocol

Outline




4 Experiments

6

Fair Sojourn Protocol Introduction To FSP

Fair Sojourn Protocol [Friedman & Henderson, SIGMETRICS ’03]

100usage (%)

cluster

50

10 15 37.5 42.5 50

time(s)

100usage (%)

cluster

10 5020 30

50

time(s)

job 1

job 2

job 3

job 1 job 3job 2 job 1

Simulate completion time using a simulated processor sharingdisciplineSchedule all resources to the job that would complete first 7

Fair Sojourn Protocol Introduction To FSP

Multi-Processor FSP

10 13 3923.5

usage (%)cluster

100

50

24.5

time(s)

10 13 20 23 39

100

50

usage (%)cluster

time(s)

job 1

job 2

job 3

job 1

job 2

job 3

In our case, some jobs may not require all cluster resources8

HFSP Implementation

Outline




4 Experiments

9

HFSP Implementation HFSP In General

HFSP In A Nutshell

Job Size EstimationNaive estimation at first

After the first s “training” tasks have run, we make a betterestimation

s = 5 by default

On t task slots, we give priority to training tasks

t avoids starving “old” jobs“shortcut” for very small jobs

Scheduling Policy

We treat MAP and REDUCE phases as separate jobs

A virtual cluster outputs a per-job simulated completion time

Preempt running tasks of jobs that complete later in the virtualcluster

10

HFSP Implementation HFSP In General

HFSP In A Nutshell

Job Size EstimationNaive estimation at first

After the first s “training” tasks have run, we make a betterestimation

s = 5 by default

On t task slots, we give priority to training tasks

t avoids starving “old” jobs“shortcut” for very small jobs

Scheduling Policy

We treat MAP and REDUCE phases as separate jobs

A virtual cluster outputs a per-job simulated completion time

Preempt running tasks of jobs that complete later in the virtualcluster

10

HFSP Implementation Size Estimation

Job Size Estimation (1)

Initial Estimation

ξ · k · l

k: # of tasks

l: average size of past MAP/REDUCE tasks

ξ ∈ [1,∞]: aggressivity for scheduling jobs in training phase

ξ = 1 (default): tend to schedule training jobs right away

they may have to be preempted

ξ =∞: wait for training to end before deciding

may require more “waves”

11



MAP PhaseFrom the size of the s samples, generate an empirical CDF

(Least-square) fit to a parametric distribution

Predicted job size: k time the expected value of the fitteddistribution

Data Locality

Experimentally, we find out it’s not an issue

For the s sample jobs, there are plenty of unprocessed blocks aroundWe use delay scheduling [Zaharia et al., EuroSys ’10]

12



MAP PhaseFrom the size of the s samples, generate an empirical CDF

(Least-square) fit to a parametric distribution

Predicted job size: k time the expected value of the fitteddistribution

Data Locality

Experimentally, we find out it’s not an issue

For the s sample jobs, there are plenty of unprocessed blocks aroundWe use delay scheduling [Zaharia et al., EuroSys ’10]

12



REDUCE PhaseShuffle time: getting data to the reducer

time between scheduling a REDUCE task and executing a REDUCE

function the first timeaverage of sample shuffle sizes, weighted by data size

Execution time

we set a timeout ∆ (default 60s)if the timeout is hit, estimated execution time is

∆

p

where progress p is the fraction of data processed

Compute estimated reduce time as before

13

HFSP Implementation Virtual Cluster

Virtual Cluster

Estimated job size is in a “serialized” single-machine format

Simulates a processor-sharing cluster to compute completiontime, based on

number of tasks per jobavailable task slots in the real cluster

Simulation is updated when

new jobs arrivetasks complete

14

HFSP Implementation Preemption

Job Preemption

Supported in Hadoop

KILL running tasks

wastes work

WAIT for them to finish

may take long

Our ChoiceMAP tasks: WAIT

generally small

For REDUCE tasks, we implemented SUSPEND and RESUME

avoids the drawbacks of both WAIT and KILL

15


Job Preemption

Supported in Hadoop

KILL running tasks

wastes work

WAIT for them to finish

may take long

Our ChoiceMAP tasks: WAIT

generally small

For REDUCE tasks, we implemented SUSPEND and RESUME

avoids the drawbacks of both WAIT and KILL

15


Job Preemption: SUSPEND and RESUME

Our SolutionWe delegate to the OS: SIGSTOP and SIGCONT

The OS will swap tasks if and when memory is needed

no risk of thrashing: swapped data is loaded only when resuming

Configurable maximum number of suspended tasks

if reached, switch to WAIT

hard limit on memory allocated to suspended tasks

If not all running tasks should be preempted, suspend theyoungest

likely to finish latermay have smaller memory footprint

16











16











16











16

Experiments

Outline




4 Experiments

17

Experiments Setup and Traces

Experimental Setup

Platform100 m1.xlarge Amazon EC2 instances

4 x 2 GHz cores, 1.6 TB storage, 15 GB RAM each

WorkloadsGenerated with the SWIM workload generator [Chen et al., MASCOTS ’11]

Sinthetized from Facebook traces [Chen et al., VLDB ’12]

FB2009: 100 jobs, most are small; 22 minutes submission scheduleFB2010: 93 jobs, small jobs filtered out; 1h submission schedule

Configuration

We compare to Hadoop’s FAIR scheduler

similar to a processor-sharing discipline

Delay scheduling enabled both for FAIR and HFSP18

Experiments Results

FB2009

0

0.25

0.5

0.75

1

0 0.5 1 1.5 2 2.5

Fra

ction o

f com

ple

ted jobs

Sojourn Time [min]

HFSPFAIR

0

0.25

0.5

0.75

1

0 20 40 60 80 100

Sojourn Time [min]

HFSPFAIR

0

0.25

0.5

0.75

1

0 50 100 150 200 250

Sojourn Time [min]

HFSPFAIR

Small jobs Medium jobs Large jobs

The FIFO scheduler would mostly fall outside of the graph

Small jobs (few tasks) are not problematic in either case

they are allocated enough tasks

Medium and large jobs instead require a significant amount ofthe cluster resources

“focusing” all resources of the cluster pays off19

Experiments Results

FB2010

0

0.25

0.5

0.75

1

0 100 200 300 400 500

Fra

ction o

f com

ple

ted jobs

Map Time [min]

HFSPFAIR

0

0.25

0.5

0.75

1

0 75 150 225 300 375

Reduce Time [min]

HFSPFAIR

0

0.25

0.5

0.75

1

0 125 250 375 500 625 750

Sojourn Time [min]

HFSPFAIR

MAP phase REDUCE phase Aggregate

Larger jobs, longer queues, more pressure on the scheduler

Median MAP sojourn time is more than halved

Main reason: less “waves” because cluster resources are focused

On aggregate, when the first job completes with FAIR, 20% jobsare done with HFSP.

20

Experiments Results

Cluster Size

0

20

40

60

80

100

120

10 20 30 40 50 60 70 80 90 100

Avera

ge s

ojo

urn

tim

e [m

in]

Cluster nodes number

HFSPFAIR

Experiment done with the Mumak Hadoop official emulator andFB2009

For smaller clusters, scheduling makes a bigger difference

21

Experiments Results

Robustness to Estimation Errors

140

150

160

170

180

190

200

210

220

230

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Avera

ge S

ojo

urn

Tim

e [s]

α

FAIRHFSP (α=0)

Experimental settings as before: FB2009 and Mumak again

For a job size estimation of θ, we introduce an error and pick avalue uniformly in

[(1− α) θ, (1 + α) θ]22

Experiments Results

Preemption: Costs

Question

Could the costs associated to swapping make SUSPEND not worth it?

MeasurementsLinux can read and write swap close to maximum disk speed

100 MB/s for us

Worst-Case Analysis

In the FB2010 experiment, 10% of REDUCE tasks are suspended

The JVM heap space for REDUCE tasks is 1GB

as advised in Hadoop docs

Therefore, a SUSPEND/RESUME induces swapping for at most 20 s

one order of magnitude less than average size of preempted tasks

23

Experiments Conclusions

Take-Home Messages

Size-based scheduling on Hadoop is viable, and particularly appealingfor companies with (semi-)interactive jobs and smaller clusters

Even simple approximate means for size estimation are sufficient, asHFSP is robust with respect to errors

OS delegation to POSIX SIGSTOP and SIGCONT signals is an efficientway to perform preemption in Hadoop

HFSP is available as free software athttp://bitbucket.org/bigfootproject/hfsp

Paper at http://arxiv.org/abs/1302.2749

24

http://bitbucket.org/bigfootproject/hfsp

http://arxiv.org/abs/1302.2749

Technology

HFSP: the Hadoop Fair Sojourn Protocol