17
Scheduling MapReduce Jobs in HPC Clusters Marcelo Neves, Tiago Ferreto, Cesar De Rose [email protected] Faculty of InformaEcs, PUCRS Porto Alegre, Brazil August 30, 2012

Scheduling MapReduce Jobs in HPC Clusters

Embed Size (px)

Citation preview

Page 1: Scheduling MapReduce Jobs in HPC Clusters

Scheduling  MapReduce  Jobs  in  HPC  Clusters  

Marcelo  Neves,  Tiago  Ferreto,  Cesar  De  Rose  [email protected]  

     

Faculty  of  InformaEcs,  PUCRS  Porto  Alegre,  Brazil  

 August  30,  2012  

Page 2: Scheduling MapReduce Jobs in HPC Clusters

Outline  

•  IntroducEon  •  HPC  Clusters  and  MapReduce  •  MapReduce  Job  Adaptor  •  EvaluaEon  •  Conclusion  

2  

Page 3: Scheduling MapReduce Jobs in HPC Clusters

IntroducEon  •  MapReduce  (MR)  

–  A  parallel  programming  model  –  Simplicity,  efficiency  and  high  scalability  –  It  has  become  a  de  facto  standard  for  large-­‐scale  data  analysis  

•  MR  has  also  aTracted  the  aTenEon  of  the  HPC  community  –  Simpler  approach  to  address  the  parallelizaEon  problem  –  Highly  visible  cases  where  MR  has  been  successfully  used  by  companies  like  Google,  Facebook  and  Yahoo!  

3  

Page 4: Scheduling MapReduce Jobs in HPC Clusters

HPC  Clusters  and  MapReduce  •  HPC  Clusters  

–  Shared  among  mulEple  users/organizaEons  –  Resource  Management  System  (RMS),  such  as  PBS/Torque  –  ApplicaEons  are  submiTed  as  batch  jobs  –  Users  have  to  explicitly  allocate  the  resources,  specifying  the  number  of  nodes  and  amount  of  Eme  

•  MR  ImplementaEons  (e.g.  Hadoop)  –  Have  their  own  complete  job  management  system  –  Users  do  not  have  to  explicitly  allocate  resources  –  Require  a  dedicated  cluster  

4  

Page 5: Scheduling MapReduce Jobs in HPC Clusters

Problem  

•  Two  disEnct  clusters  are  required  

5  

How  to  run  MapReduce  jobs  in  a  exisEng  HPC  cluster  along  with  regular  HPC  jobs?  

 

Page 6: Scheduling MapReduce Jobs in HPC Clusters

Current  soluEons  

•  Hadoop  on  Demand  (HOD)  and  MyHadoop  –  Create  on  demand  MR  installaEons  as  RMS’s  jobs  –  It’s  not  transparent,  users  sEll  must  to  specify  the  number  of  nodes  and  amount  of  Eme  to  be  allocated  

•  MESOS  –  Shares  a  cluster  between  mulEple  different  frameworks  

–  Creates  another  level  of  resource  management  – Management  is  taken  away  from  the  cluster’s  RMS  

6  

Page 7: Scheduling MapReduce Jobs in HPC Clusters

MapReduce  Job  Adaptor  

7  

HPC User

MR User

HPC Job (# of nodes, time)

MR Job (# of nodes, time)

Resource Management

System

MR Job Adaptor

MR Job (# of map tasks, # of reduce tasks,

job profile)

Cluster

Page 8: Scheduling MapReduce Jobs in HPC Clusters

MapReduce  Job  Adaptor  

8  

•  The  adaptor  has  three  main  goals:  – Facilitate  the  execuEon  of  MR  jobs  in  HPC  clusters  – Minimize  the  average  turnaround  Eme  of  the  jobs  – Exploit  unused  resources  in  the  cluster  (the  result  of  the  various  shapes  of  HPC  job  requests)  

Page 9: Scheduling MapReduce Jobs in HPC Clusters

CompleEon  Eme  esEmaEon  

•  MR  performance  model  by  Verma  et  al.  1  –  Job  profile  with  performance  invariants  – EsEmate  upper/lower  bounds  of  job  compleEon  

9  

•  NJM=  number  of  map  tasks  

•  NJR=  number  of  reduce  tasks  

•  SJM=  number  of  map  slots  •  SJR=  number  of  reduce  slots    

1.  Verma  et  al.:  Aria:  automaEc  resource  inference  and  allocaEon  for  mapreduce  environments  (2011)  

Page 10: Scheduling MapReduce Jobs in HPC Clusters

Algorithm  

10  

Page 11: Scheduling MapReduce Jobs in HPC Clusters

EvaluaEon  •  Simulated  environment  (using  the  SimGrid  toolkit)  

–  Cluster  composed  by  128  nodes  with  2  cores  each  –  RMS  based  on  ConservaEve  Backfilling  (CBF)  algorithm  –  Stream  of  job  submissions  

•  HPC  workload  –  SyntheEc  workload  based  on  model  by  Lublin  et  al.1  

–  Real-­‐world  HPC  traces  from  the  Parallel  Workloads  Archive  (SDSC  SP2)  

•  MR  workload  –  SyntheEc  workload  derived  from  Facebook  workloads  described  by  

Zaharia  et  al.  2  

11  

1.  Lublin  et  al.:  The  workload  on  parallel  supercomputers:  Modeling  the  characterisEcs  of  rigid  jobs  (2003)  2.  Zaharia  et  al.:  Delay  scheduling:  a  simple  technique  for  achieving  locality  and  fairness  in  cluster  scheduling  (2010)  

Page 12: Scheduling MapReduce Jobs in HPC Clusters

Turnaround  Time  and  System  UElizaEon  •  Workload:  

–  HPC:    “peak  hour”  of  Lublin’s  model  –  MR:    hour  of  Facebook-­‐like  job  submissions  

       •  The  adaptor  obtained  shorter  turnaround  Emes  and  beTer  

cluster  uElizaEon  in  all  cases  –  MR-­‐only:  turnaround  was  reduced  in  ≈  40%  –  HPC+MR:  overall  turnaround  was  reduced  in  ≈  15%  –  HPC+MR:  turnaround  of  MR  jobs  was  reduced  in  ≈  73%  

12  

≈  40%   ≈  15%  

Page 13: Scheduling MapReduce Jobs in HPC Clusters

Influence  of  the  Job  Size  •  Shorter  turnaround  

regardless  the  job  size  •  BeTer  results  for  bins  with  

smaller  jobs  

13  

Job  sizes  in  Facebook  workload    (based  on  Zaharia  et  al.)  

1 2 3 4 5 6 7 8 9

Bin

Aver

age

turn

arou

nd ti

me

(min

utes

)

050

010

0015

0020

0025

00NaiveAdaptor1 2 3 4 5 6 7 8 9

Bin

Aver

age

turn

arou

nd ti

me

(min

utes

)

050

010

0015

0020

0025

00

NaiveAdaptor

Bin   #  Map  Tasks  

#  Reduce  Tasks  

%  Jobs  at  Facebook  

1   1   0   39%  2   2   0   16%  3   10   3   14%  4   50   0   9%  5   100   0   6%  6   200   50   6%  7   400   0   4%  8   800   180   4%  9   2400   0   3%  

Page 14: Scheduling MapReduce Jobs in HPC Clusters

Influence  of  System  Load  

14  Mean MR job inter arrival time (seconds)

Aver

age

turn

arou

nd ti

me

(min

utes

)

50100

250

500

750

1000

1250

1500

1 5 10 15 20 25 30

AlgorithmAdaptor

Naive

Mean MR job inter arrival time (seconds)

Aver

age

turn

arou

nd ti

me

(min

utes

)

50100

250

500

750

1000

1250

1500

1 5 10 15 20 25 30

AlgorithmAdaptor

Naive

Mean HPC job inter arrival time (seconds)

Aver

age

turn

arou

nd ti

me

(min

utes

)

50

100

200

400

600

800

1000

5 10 15 20 25 30

AlgorithmAdaptor

Naive

Mean HPC job inter arrival time (seconds)

Aver

age

turn

arou

nd ti

me

(min

utes

)

50

100

200

400

600

800

1000

5 10 15 20 25 30

AlgorithmAdaptor

Naive

Page 15: Scheduling MapReduce Jobs in HPC Clusters

Real-­‐world  Workload  

•  Workload:  –  HPC:  a  day-­‐long  trace  from  SDSC  SP2  – MR:  1000  Facebook-­‐like  MR  jobs  

•  The  adaptor’s  algorithm  performed  beTer  in  all  cases  

15  

≈  54  %   ≈  80  %  

Page 16: Scheduling MapReduce Jobs in HPC Clusters

Conclusion  

•  Although  MR  has  gained  aTenEon  by  HPC  community  

•  There  is  sEll  a  quesEon  of  how  to  run  MR  jobs  along  with  regular  HPC  jobs  in  a  HPC  cluster  

•  MR  Job  Adaptor  – Allows  transparent  MR  job  submission  on  HPC  clusters  

– Minimizes  the  average  turnaround  Eme  –  Improve  the  overall  uElizaEon,  by  exploiEng  unused  resources  in  the  cluster  

16  

Page 17: Scheduling MapReduce Jobs in HPC Clusters

Thank  you!  

17