Scheduling MapReduce Jobs in HPC Clusters

Scheduling MapReduce Jobs in HPC Clusters

Marcelo Neves, Tiago Ferreto, Cesar De Rose [email protected]

Faculty of InformaEcs, PUCRS Porto Alegre, Brazil

August 30, 2012

Outline

•  IntroducEon •  HPC Clusters and MapReduce •  MapReduce Job Adaptor •  EvaluaEon •  Conclusion

2

IntroducEon •  MapReduce (MR)

–  A parallel programming model –  Simplicity, efficiency and high scalability –  It has become a de facto standard for large-‐scale data analysis

•  MR has also aTracted the aTenEon of the HPC community –  Simpler approach to address the parallelizaEon problem –  Highly visible cases where MR has been successfully used by companies like Google, Facebook and Yahoo!

3

HPC Clusters and MapReduce •  HPC Clusters

–  Shared among mulEple users/organizaEons –  Resource Management System (RMS), such as PBS/Torque –  ApplicaEons are submiTed as batch jobs –  Users have to explicitly allocate the resources, specifying the number of nodes and amount of Eme

•  MR ImplementaEons (e.g. Hadoop) –  Have their own complete job management system –  Users do not have to explicitly allocate resources –  Require a dedicated cluster

4

Problem

•  Two disEnct clusters are required

5

How to run MapReduce jobs in a exisEng HPC cluster along with regular HPC jobs?

Current soluEons

•  Hadoop on Demand (HOD) and MyHadoop –  Create on demand MR installaEons as RMS’s jobs –  It’s not transparent, users sEll must to specify the number of nodes and amount of Eme to be allocated

•  MESOS –  Shares a cluster between mulEple different frameworks

–  Creates another level of resource management – Management is taken away from the cluster’s RMS

6

MapReduce Job Adaptor

7

HPC User

MR User

HPC Job (# of nodes, time)

MR Job (# of nodes, time)

Resource Management

System

MR Job Adaptor

MR Job (# of map tasks, # of reduce tasks,

job profile)

Cluster

MapReduce Job Adaptor

8

•  The adaptor has three main goals: – Facilitate the execuEon of MR jobs in HPC clusters – Minimize the average turnaround Eme of the jobs – Exploit unused resources in the cluster (the result of the various shapes of HPC job requests)

CompleEon Eme esEmaEon

•  MR performance model by Verma et al. 1 –  Job profile with performance invariants – EsEmate upper/lower bounds of job compleEon

9

•  NJM= number of map tasks

•  NJR= number of reduce tasks

•  SJM= number of map slots •  SJR= number of reduce slots

1. Verma et al.: Aria: automaEc resource inference and allocaEon for mapreduce environments (2011)

Algorithm

10

EvaluaEon •  Simulated environment (using the SimGrid toolkit)

–  Cluster composed by 128 nodes with 2 cores each –  RMS based on ConservaEve Backfilling (CBF) algorithm –  Stream of job submissions

•  HPC workload –  SyntheEc workload based on model by Lublin et al.1

–  Real-‐world HPC traces from the Parallel Workloads Archive (SDSC SP2)

•  MR workload –  SyntheEc workload derived from Facebook workloads described by

Zaharia et al. 2

11

1. Lublin et al.: The workload on parallel supercomputers: Modeling the characterisEcs of rigid jobs (2003) 2. Zaharia et al.: Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling (2010)

Turnaround Time and System UElizaEon •  Workload:

–  HPC: “peak hour” of Lublin’s model –  MR: hour of Facebook-‐like job submissions

•  The adaptor obtained shorter turnaround Emes and beTer

cluster uElizaEon in all cases –  MR-‐only: turnaround was reduced in ≈ 40% –  HPC+MR: overall turnaround was reduced in ≈ 15% –  HPC+MR: turnaround of MR jobs was reduced in ≈ 73%

12

≈ 40% ≈ 15%

Influence of the Job Size •  Shorter turnaround

regardless the job size •  BeTer results for bins with

smaller jobs

13

Job sizes in Facebook workload (based on Zaharia et al.)

1 2 3 4 5 6 7 8 9

Bin

Aver

age

turn

arou

nd ti

me

(min

utes

)

050

010

0015

0020

0025

00NaiveAdaptor1 2 3 4 5 6 7 8 9

Bin

Aver

age

turn

arou

nd ti

me

(min

utes

)

050

010

0015

0020

0025

00

NaiveAdaptor

Bin # Map Tasks

# Reduce Tasks

% Jobs at Facebook

1 1 0 39% 2 2 0 16% 3 10 3 14% 4 50 0 9% 5 100 0 6% 6 200 50 6% 7 400 0 4% 8 800 180 4% 9 2400 0 3%

Influence of System Load

14 Mean MR job inter arrival time (seconds)

Aver

age

turn

arou

nd ti

me

(min

utes

)

50100

250

500

750

1000

1250

1500

1 5 10 15 20 25 30

AlgorithmAdaptor

Naive

Mean MR job inter arrival time (seconds)

Aver

age

turn

arou

nd ti

me

(min

utes

)

50100

250

500

750

1000

1250

1500

1 5 10 15 20 25 30

AlgorithmAdaptor

Naive

Mean HPC job inter arrival time (seconds)

Aver

age

turn

arou

nd ti

me

(min

utes

)

50

100

200

400

600

800

1000

5 10 15 20 25 30

AlgorithmAdaptor

Naive

Mean HPC job inter arrival time (seconds)

Aver

age

turn

arou

nd ti

me

(min

utes

)

50

100

200

400

600

800

1000

5 10 15 20 25 30

AlgorithmAdaptor

Naive

Real-‐world Workload

•  Workload: –  HPC: a day-‐long trace from SDSC SP2 – MR: 1000 Facebook-‐like MR jobs

•  The adaptor’s algorithm performed beTer in all cases

15

≈ 54 % ≈ 80 %

Conclusion

•  Although MR has gained aTenEon by HPC community

•  There is sEll a quesEon of how to run MR jobs along with regular HPC jobs in a HPC cluster

•  MR Job Adaptor – Allows transparent MR job submission on HPC clusters

– Minimizes the average turnaround Eme –  Improve the overall uElizaEon, by exploiEng unused resources in the cluster

16

Thank you!

17

Technology

Scheduling MapReduce Jobs in HPC Clusters