28
DEGAS Program System via C++ Library Extension Yili Zheng Computa?onal Research Division Lawrence Berkeley Na?onal Laboratory

DEGAS&Program&System&viaC++ Library&Extension&DEGAS&Program&System&viaC++ Library&Extension& Yili&Zheng& Computaonal&Research&Division& Lawrence&Berkeley&Naonal&Laboratory&

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: DEGAS&Program&System&viaC++ Library&Extension&DEGAS&Program&System&viaC++ Library&Extension& Yili&Zheng& Computaonal&Research&Division& Lawrence&Berkeley&Naonal&Laboratory&

DEGAS  Program  System  via  C++  Library  Extension  

Yili  Zheng  

Computa?onal  Research  Division  

Lawrence  Berkeley  Na?onal  Laboratory  

Page 2: DEGAS&Program&System&viaC++ Library&Extension&DEGAS&Program&System&viaC++ Library&Extension& Yili&Zheng& Computaonal&Research&Division& Lawrence&Berkeley&Naonal&Laboratory&

C++  is  Increasingly  Important  

6/4/13   DEGAS  Retreat  2013  -­‐-­‐  Yili  Zheng   2  

C++;  4  

C;  3  

FORTRAN;  3  

Co-­‐Design  Center  Proxy  Apps  

10  Apps:  4  in  C++,  3  in  C,  3  in  FORTRAN  (source:  Steve  Hofmyer)  Other  C++  examples:  BoxLib,  Combinatorial  BLAS,  …  

DEGAS  Lib  supports  C/C++  apps,    and  can  be  used  by  FORTRAN.  

Page 3: DEGAS&Program&System&viaC++ Library&Extension&DEGAS&Program&System&viaC++ Library&Extension& Yili&Zheng& Computaonal&Research&Division& Lawrence&Berkeley&Naonal&Laboratory&

A  “Compiler-­‐Free”  Approach  

•  Leverage  the  standardized  C++  specifica?ons  and  their  compiler  implementa?ons  – New  features  in  C++  11  for  DEGAS  lib  

async,  future,  lambda  func?ons    

•  C++  11  is  well  supported  

6/4/13   DEGAS  Retreat  2013  -­‐-­‐  Yili  Zheng   3  

Page 4: DEGAS&Program&System&viaC++ Library&Extension&DEGAS&Program&System&viaC++ Library&Extension& Yili&Zheng& Computaonal&Research&Division& Lawrence&Berkeley&Naonal&Laboratory&

DEGAS  Lib  So_ware  Stack  

6/4/13   DEGAS  Retreat  2013  -­‐-­‐  Yili  Zheng   4  

GASNet  Communica?on  Library    

Network  Drivers  and  OS  Libraries  

C++  Compiler    

C/C++  Apps  

DEGAS  Lib  Run?me  

DEGAS  Template  Header  Files  

UPC  Run?me  

UPC  Apps  

UPC  Compiler  

Page 5: DEGAS&Program&System&viaC++ Library&Extension&DEGAS&Program&System&viaC++ Library&Extension& Yili&Zheng& Computaonal&Research&Division& Lawrence&Berkeley&Naonal&Laboratory&

Design  Goals  

•  PGAS    – global  address  space  memory  management  

– one-­‐sided  communica?on  – sta?cally  shared  variables  and  arrays  – synchroniza?on  primi?ves  

•  Dynamic    – asynchronous  task  execu?on  

•  Hierarchical  PGAS  

6/4/13   DEGAS  Retreat  2013  -­‐-­‐  Yili  Zheng   5  

Page 6: DEGAS&Program&System&viaC++ Library&Extension&DEGAS&Program&System&viaC++ Library&Extension& Yili&Zheng& Computaonal&Research&Division& Lawrence&Berkeley&Naonal&Laboratory&

C++  Templates  for  PGAS  Shared  Data  

•  C++  templates  enable  generic  programming  •  Declara?on    template<class  T>      struct  matrix  {  T  *elements;  };  

•  Instan?a?on          matrix<double>  A;    matrix<complex>  B;  

•  In  DEGAS  we  use  templates  to  describe  shared  data      shared_var_t<int>  s;    //  =  shared  int  s;  in  UPC    shared_array_t<int>  sa(8);  //  =  shared  int  sa[8];  in  UPC  

6/4/13   DEGAS  Retreat  2013  -­‐-­‐  Yili  Zheng   6  

Page 7: DEGAS&Program&System&viaC++ Library&Extension&DEGAS&Program&System&viaC++ Library&Extension& Yili&Zheng& Computaonal&Research&Division& Lawrence&Berkeley&Naonal&Laboratory&

Shared  Variable  Example  #include  <degas.h>  using  namespace  degas;  

shared_var_t<int>  s  =  0;  //  shared  by  all  processes    global_lock_t  l;    //  SPMD  void  update()  {        for  (int  i=0;  i<100;  i++)            s  =  s  +  1;  //  a  race  condition      barrier();      //  shared  data  accesses  with  lock  protection      for  (int  i=0;  i<100;  i++)            {  l.lock();  s  =  s  +  1;  l.unlock();  }  }  

6/4/13   7  DEGAS  Retreat  2013  -­‐-­‐  Yili  Zheng  

Page 8: DEGAS&Program&System&viaC++ Library&Extension&DEGAS&Program&System&viaC++ Library&Extension& Yili&Zheng& Computaonal&Research&Division& Lawrence&Berkeley&Naonal&Laboratory&

Transla?on  Flow  

shared_var_t  <int>  s;  s  =  1;    //  “=”  is  overloaded  

C++  Compiler  

UPCR_PUT_PSHARED_VAL(sp,  a);  DEGAS  Lib  Run?me  

GASNet  Direct  Memory  Access  

Is  “s”  local?  

Yes   No  

operator  =  (s,  1);  

6/4/13   8  DEGAS  Retreat  2013  -­‐-­‐  Yili  Zheng  

Page 9: DEGAS&Program&System&viaC++ Library&Extension&DEGAS&Program&System&viaC++ Library&Extension& Yili&Zheng& Computaonal&Research&Division& Lawrence&Berkeley&Naonal&Laboratory&

Dynamic  Shared  Data  

•  Global  address  space  pointers  global_ptr_t<data_type,  place_type>  p;  

•  Dynamic  global  memory  alloca?on  global_ptr_t  allocate(place_type  place,  size_t  count);  

•  Data  transfer  func?ons  (memory  copy)  copy(global_ptr_t  src,  global_ptr_t  dst,  size_t  count);  

Allocate  and  copy  are  func?on  templates.    C++  compliers  can  instan?ate  the  right  version  of  allocate  and  copy  based  on  the  types  of  the  pointers  (meta-­‐programming).  

6/4/13   9  DEGAS  Retreat  2013  -­‐-­‐  Yili  Zheng  

Page 10: DEGAS&Program&System&viaC++ Library&Extension&DEGAS&Program&System&viaC++ Library&Extension& Yili&Zheng& Computaonal&Research&Division& Lawrence&Berkeley&Naonal&Laboratory&

Advanced  Communica?on  Features  

•  Non-­‐blocking  one-­‐sided  communica?on  – async_copy()  – async_copy_fence()  

• MPI-­‐style  collec?ve  opera?ons  built  directly  on  GASNet  – broadcast,  gather,  scaGer,  reduce,  all-­‐to-­‐all,  …  

•  Can  put  to  and  get  from  any  memory  loca?ons  – Leverage  GASNet’s  dynamic  memory  registra?on  feature  

6/4/13   DEGAS  Retreat  2013  -­‐-­‐  Yili  Zheng   10  

Page 11: DEGAS&Program&System&viaC++ Library&Extension&DEGAS&Program&System&viaC++ Library&Extension& Yili&Zheng& Computaonal&Research&Division& Lawrence&Berkeley&Naonal&Laboratory&

Asynchronous  Task  Execu?on  

•  C++  11  async  library   future = std::async(Function &&f, Args&&… args); !

future.wait(); !rv = future.get(); !

•  DEGAS  async  library   future = degas::async(Function &&f, Args&&… args) !

degas::async(place)(Function &&f, Args&&… args); !degas::async_after(place, future)(Function &&f, ! Args&&… args); !

6/4/13   DEGAS  Retreat  2013  -­‐-­‐  Yili  Zheng   11  

Page 12: DEGAS&Program&System&viaC++ Library&Extension&DEGAS&Program&System&viaC++ Library&Extension& Yili&Zheng& Computaonal&Research&Division& Lawrence&Berkeley&Naonal&Laboratory&

Async  Task  Example  

#include  <degas.h>  //  A  regular  function  can  be  called  by  remote  processes  void  print_num(int  num)  {      printf(“arg:  %d\n”,  num);  }  //  Only  the  master  process  executes  the  main  function  int  main(int  argc,  char  **argv)  {      node_range_t  group(1,  node_count,  2);  //  odd  num.  proc.      async(group)(print_num,  123);  //  call  a  function  remotely        degas::wait();  //  wait  for  the  remote  tasks  to  complete      return  0;  }  

6/4/13   12  DEGAS  Retreat  2013  -­‐-­‐  Yili  Zheng  

Page 13: DEGAS&Program&System&viaC++ Library&Extension&DEGAS&Program&System&viaC++ Library&Extension& Yili&Zheng& Computaonal&Research&Division& Lawrence&Berkeley&Naonal&Laboratory&

Async  with  Lambda  Func?on  

//  Process  0  spawns  async  tasks  for  (int  i  =  0;  i  <  node_count();  i++)  {      //  spawn  a  task  at  place  “i”      //  the  task  is  expressed  by  a  lambda  function      async(i)([]  (int  num)  {  printf("num:  %d\n”,  num);  },                          1000  +  i);  //  argument  to  the  lambda  function      degas::wait();  //  wait  for  all  tasks  to  finish  }  

mpirun –n 4 ./test_async!

Output: !num: 1000 !num: 1001 !num: 1002 !num: 1003 !

6/4/13   DEGAS  Retreat  2013  -­‐-­‐  Yili  Zheng   13  

Page 14: DEGAS&Program&System&viaC++ Library&Extension&DEGAS&Program&System&viaC++ Library&Extension& Yili&Zheng& Computaonal&Research&Division& Lawrence&Berkeley&Naonal&Laboratory&

DEGAS  Async  is  a  Building  Block  

•  One-­‐sided  collec?ve  communica?on  •  Remote  atomic  opera?ons  

•  Ac?ve  Message  Broadcast  

6/4/13   DEGAS  Retreat  2013  -­‐-­‐  Yili  Zheng   14  

Page 15: DEGAS&Program&System&viaC++ Library&Extension&DEGAS&Program&System&viaC++ Library&Extension& Yili&Zheng& Computaonal&Research&Division& Lawrence&Berkeley&Naonal&Laboratory&

Random  Access  Benchmark  (GUPS)  //  shared  uint64_t  Table[TableSize];  in  UPC  shared_array_t<uint64_t>  Table(TableSize);  

void  RandomAccessUpdate()    {      uint64_t  ran  =  starts(NUPDATE  /  NP  *  myid);      for  (uint64_t  i  =  myid;  i  <  NUPDATE;  i  +=  NP)  {          ran  =  (ran  <<  1)  ^  ((int64_t)ran  <  0  ?  POLY  :  0);          Table[ran  &  (TableSize-­‐1)]  ^=  ran;      }  }  

6/4/13   15  DEGAS  Retreat  2013  -­‐-­‐  Yili  Zheng  

0   4   8   12   1   5   9   13  

0   1   2   3   4   5   6   7   8   9   10   11   12   13   14   15  

2   6   10   14   3   7   11   15  

Process  0   Process  1  

Process  2   Process  3  

Logical  data  layout  

Physical  data  layout  

Main    update  loop  

Page 16: DEGAS&Program&System&viaC++ Library&Extension&DEGAS&Program&System&viaC++ Library&Extension& Yili&Zheng& Computaonal&Research&Division& Lawrence&Berkeley&Naonal&Laboratory&

Experimental  System:  MIC  

6/4/13   DEGAS  Retreat  2013  -­‐-­‐  Yili  Zheng   16  

Photo  Source:  hnp://newsroom.intel.com/community/intel_newsroom/blog/2012/11/12/  

Page 17: DEGAS&Program&System&viaC++ Library&Extension&DEGAS&Program&System&viaC++ Library&Extension& Yili&Zheng& Computaonal&Research&Division& Lawrence&Berkeley&Naonal&Laboratory&

Manycore  -­‐  A  Good  Fit  for  PGAS  

6/4/13   DEGAS  Retreat  2013  -­‐-­‐  Yili  Zheng   17  

61  cores,  30MB  aggregate  L2  1TFlops,  352GB/s  memory  bw  

L2  

Core  

TD  

L2  

Core  

TD  

L2  

Core  

TD  

L2  

Core  

TD  

L2  

Core  

TD  

L2  

Core  

TD  

L2  

Core  

TD  

L2  

Core  

TD  

MC  

MC  

MC  

MC  

PCIe  Client  Logic  

Thread  0   Thread  1   Thread  2   Thread  3  

Private  Segment  

Shared  Segment  

Private  Segment  

Shared  Segment  

Private  Segment  

Shared  Segment  

Private  Segment  

Shared  Segment  

Block  Diagram  of  the  Knights  Corner  (MIC)  micro-­‐architecture  

PGAS  Abstrac?on  

Page 18: DEGAS&Program&System&viaC++ Library&Extension&DEGAS&Program&System&viaC++ Library&Extension& Yili&Zheng& Computaonal&Research&Division& Lawrence&Berkeley&Naonal&Laboratory&

GUPS  Performance  on  MIC  

6/4/13   DEGAS  Retreat  2013  -­‐-­‐  Yili  Zheng   18  

0  

0.2  

0.4  

0.6  

0.8  

1  

1.2  

1.4  

1   2   4   8   16   32   60  

Time  (μs)  

Num.  of  Processes  

Random  Access  Latency  

DEGAS  C++  

UPC  

0.00  

0.01  

0.10  

1   2   4   8   16   32   60  GUPS  

Num.  of  Processes  

Giga-­‐Updates  Per  Second  

DEGAS  C++  

UPC  

Overhead  is  only  about  0.2  μs  (~220  cycles).  

Page 19: DEGAS&Program&System&viaC++ Library&Extension&DEGAS&Program&System&viaC++ Library&Extension& Yili&Zheng& Computaonal&Research&Division& Lawrence&Berkeley&Naonal&Laboratory&

GUPS  Performance  on  BlueGene/Q  

6/4/13   DEGAS  Retreat  2013  -­‐-­‐  Yili  Zheng   19  

Overhead  is  negligible  at  large  scale.  

0  

2  

4  

6  

8  

10  

12  

14  

1   2   4   8   16  

32  

64  

128  

256  

512  

1024  

2048  

4096  

8192  

Time  (usec)  

Num.  of  Processes  

Random  Access  Latency  

DEGAS  C++  

UPC  

0.00  

0.00  

0.01  

0.10  

1.00  

1   2   4   8   16  

32  

64  

128  

256  

512  

1024  

2048  

4096  

8192  

GUPS  

Num.  of  Processes  

Giga-­‐Updates  Per  Second  

DEGAS  C++  

UPC  

Page 20: DEGAS&Program&System&viaC++ Library&Extension&DEGAS&Program&System&viaC++ Library&Extension& Yili&Zheng& Computaonal&Research&Division& Lawrence&Berkeley&Naonal&Laboratory&

LULESH  Proxy  App  

•  Livermore  Unstructured  Lagrangian  Explicit  Shock  Hydrodynamics    

•  Proxy  App  for  UHPC,  ExMatEx,  and  LLNL  ASC  

• Wrinen  in  C++  with  MPI,  OpenMP,  and  CUDA  versions  

6/4/13   DEGAS  Retreat  2013  -­‐-­‐  Yili  Zheng   20  

hnps://codesign.llnl.gov/lulesh.php  

Page 21: DEGAS&Program&System&viaC++ Library&Extension&DEGAS&Program&System&viaC++ Library&Extension& Yili&Zheng& Computaonal&Research&Division& Lawrence&Berkeley&Naonal&Laboratory&

LULESH  3-­‐D  Data  Par??oning  

6/4/13   DEGAS  Retreat  2013  -­‐-­‐  Yili  Zheng   21  

Page 22: DEGAS&Program&System&viaC++ Library&Extension&DEGAS&Program&System&viaC++ Library&Extension& Yili&Zheng& Computaonal&Research&Division& Lawrence&Berkeley&Naonal&Laboratory&

LULESH  Communica?on  Panern  

6/4/13   DEGAS  Retreat  2013  -­‐-­‐  Yili  Zheng   22  

26  neighbors  

• 6  faces  • 12  edges  • 8  corners  

Cross-­‐sec?on  view  of  the  3-­‐D  processor  grid  

Page 23: DEGAS&Program&System&viaC++ Library&Extension&DEGAS&Program&System&viaC++ Library&Extension& Yili&Zheng& Computaonal&Research&Division& Lawrence&Berkeley&Naonal&Laboratory&

Data  Layout  of  Each  Par??on  

6/4/13   DEGAS  Retreat  2013  -­‐-­‐  Yili  Zheng   23  

Stride  N2  

Stride  N  

Stride  N2  

•  Blue  planes  are  con?guous  •  Green  planes  are  stride-­‐N2  chunks  •  Red  planes  are  stride-­‐N  elements  

•  3-­‐D  array  A[x][y][z]  •  row-­‐major  storage  •  z  index  goes  the  fastest  

x  

z  

y  size:  N3    

Page 24: DEGAS&Program&System&viaC++ Library&Extension&DEGAS&Program&System&viaC++ Library&Extension& Yili&Zheng& Computaonal&Research&Division& Lawrence&Berkeley&Naonal&Laboratory&

Convert  MPI  to  DEGAS  Lib  

6/4/13   DEGAS  Retreat  2013  -­‐-­‐  Yili  Zheng   24  

//  Post  Non-­‐blocking  Recv  MPI_Irecv(RecvBuf_1);  …  MPI_Irecv(RecvBuf_N);  

//  Post  Non-­‐bocking  Send  Pack_Data_to_Buf();  MPI_Isend(SendBuf_1);  …  MPI_Isend(SendBuf_N);  

MPI_Wait(…);  …  Unpack_Data(…);  

Pack_Data_to_Buf();  

//  Post  Non-­‐blocking  Copy  async_copy(SendBuf_1,  RecvBuf_1);  …  async_copy(SendBuf_N,  RecvBuf_N);  

async_copy_fence(…);  …  Unpack_Data(…);  

Page 25: DEGAS&Program&System&viaC++ Library&Extension&DEGAS&Program&System&viaC++ Library&Extension& Yili&Zheng& Computaonal&Research&Division& Lawrence&Berkeley&Naonal&Laboratory&

LULESH  Performance  on  MIC  

0  

0.2  

0.4  

0.6  

0.8  

1  

1.2  

1   8   27   64   125  Num.  of  Processes  

RelaPve  Performance  on  MIC  

DEGAS  MPI  

6/4/13   DEGAS  Retreat  2013  -­‐-­‐  Yili  Zheng   25  

Page 26: DEGAS&Program&System&viaC++ Library&Extension&DEGAS&Program&System&viaC++ Library&Extension& Yili&Zheng& Computaonal&Research&Division& Lawrence&Berkeley&Naonal&Laboratory&

Por?ng  LULESH  to  DEGAS  is  “Easy”  

•  DEGAS  lib  and  LULESH  speak  the  same  language  –  C++  

•  LULESH  is  well  modularized  –  easy  to  find  out  the  communica?on  components  

•  Non-­‐blocking  one-­‐sided  data  copy  func?ons  are  handy  and  efficient  –  leverage  hardware  RDMA  and  shared-­‐memory  for  PGAS  

• MPI-­‐style  collec?ves  are  necessary  

6/4/13   DEGAS  Retreat  2013  -­‐-­‐  Yili  Zheng   26  

Page 27: DEGAS&Program&System&viaC++ Library&Extension&DEGAS&Program&System&viaC++ Library&Extension& Yili&Zheng& Computaonal&Research&Division& Lawrence&Berkeley&Naonal&Laboratory&

Concluding  Remarks  •  DEGAS  Lib:  a  Minimally  Invasive  Technique  for  bringing  DEGAS  func?onality  to  exis?ng  apps  

• Many-­‐core  architectures  with  hardware  shared-­‐memory  (no  cache  coherence  is  OK)  will  be  ideal  for  PGAS,  HPGAS,  DEGAS  

•  High-­‐level  abstrac?ons  in  the  base  language  are  crucial  to  produc?vity  

6/4/13   DEGAS  Retreat  2013  -­‐-­‐  Yili  Zheng   27  

Page 28: DEGAS&Program&System&viaC++ Library&Extension&DEGAS&Program&System&viaC++ Library&Extension& Yili&Zheng& Computaonal&Research&Division& Lawrence&Berkeley&Naonal&Laboratory&

See  You  Tonight  for  PyGAS  

Thank  You!  

6/4/13   DEGAS  Retreat  2013  -­‐-­‐  Yili  Zheng   28