52
Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand Presented at GTC ’15

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/GTC... · 2017-07-18 · • AWP-ODC (Courtesy: Yifeng Cui, SDSC) •

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/GTC... · 2017-07-18 · • AWP-ODC (Courtesy: Yifeng Cui, SDSC) •

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

Presented at GTC ’15

Page 2: Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/GTC... · 2017-07-18 · • AWP-ODC (Courtesy: Yifeng Cui, SDSC) •

Dhabaleswar  K.  (DK)  Panda  The  Ohio  State  University  E-­‐mail:  [email protected]­‐state.edu  

hCp://www.cse.ohio-­‐state.edu/~panda  

Khaled  Hamidouche  The  Ohio  State  University  

E-­‐mail:  [email protected]­‐state.edu  

hCp://www.cse.ohio-­‐state.edu/~hamidouc  

Presented by

Page 3: Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/GTC... · 2017-07-18 · • AWP-ODC (Courtesy: Yifeng Cui, SDSC) •

3

GTC ’15 High-­‐End  CompuIng  (HEC):  Petaflop  to  Petaflop  

100-150 PFlops in 2016-17

1 EFlops in 2020-2024?

Page 4: Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/GTC... · 2017-07-18 · • AWP-ODC (Courtesy: Yifeng Cui, SDSC) •

4

GTC ’15 Trends  for  Commodity  CompuIng  Clusters  in  the  Top  500  List  (hCp://www.top500.org)  

0

20

40

60

80

100

0

100

200

300

400

500

Perc

enta

ge o

f Clu

ster

s

Num

ber o

f Clu

ster

s

Timeline

Percentage of Clusters Number of Clusters

86%

Page 5: Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/GTC... · 2017-07-18 · • AWP-ODC (Courtesy: Yifeng Cui, SDSC) •

5

GTC ’15 Drivers  of  Modern  HPC  Cluster  Architectures  

Accelerators  /  Coprocessors    high  compute  density,  high  performance/waC  

>1  TFlop  DP  on  a  chip    

High  Performance  Interconnects  -­‐  InfiniBand  <1usec  latency,  >100Gbps  Bandwidth      

MulI-­‐core  Processors  

Tianhe  –  2   Titan   Stampede   Tianhe  –  1A  

MulI-­‐core  processors  are  ubiquitous  InfiniBand  very  popular  in  HPC  clusters      

Accelerators/Coprocessors  becoming  common  in  high-­‐end  systems  Pushing  the  envelope  for  Exascale  compuIng  

Page 6: Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/GTC... · 2017-07-18 · • AWP-ODC (Courtesy: Yifeng Cui, SDSC) •

6

GTC ’15 Large-­‐scale  InfiniBand  InstallaIons  

225  IB  Clusters  (45%)  in  the  November  2014  Top500  list  (hCp://www.top500.org)  InstallaIons  in  the  Top  50  (21  systems):  

519,640  cores  (Stampede)  at  TACC  (7th)   73,584  cores  (Spirit)  at  USA/Air  Force  (31st)  

72,800  cores  Cray  CS-­‐Storm  in  US  (10th)   77,184  cores  (Curie  thin  nodes)  at  France/CEA  (33rd)  

160,768  cores  (Pleiades)  at  NASA/Ames  (11th)   65,320-­‐cores,  iDataPlex  DX360M4  at  Germany/Max-­‐Planck  (34th)  

72,000  cores  (HPC2)  in  Italy  (12th)   120,  640  cores  (Nebulae)  at  China/NSCS  (35th)  

147,456  cores  (Super  MUC)  in    Germany  (14th)   72,288  cores  (Yellowstone)  at  NCAR  (36th)  

76,032  cores  (Tsubame  2.5)  at  Japan/GSIC  (15th)   70,560  cores  (Helios)  at  Japan/IFERC  (38th)  

194,616  cores  (Cascade)  at  PNNL  (18th)   38,880  cores  (Cartesisu)  at  Netherlands/SURFsara  (45th)  

110,400  cores  (Pangea)  at  France/Total  (20th)   23,040  cores  (QB-­‐2)  at  LONI/US  (46th)  

37,120  cores  T-­‐Plakorm  A-­‐Class  at  Russia/MSU  (22nd)   138,368  cores  (Tera-­‐100)  at  France/CEA  (47th)  

50,544  cores  Occigen  at  France/GENCI-­‐CINES  (26th)   and  many  more!  

Page 7: Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/GTC... · 2017-07-18 · • AWP-ODC (Courtesy: Yifeng Cui, SDSC) •

7

GTC ’15 Parallel  Programming  Models  Overview  

P1 P2 P3

Shared Memory

P1 P2 P3

Memory Memory Memory

P1 P2 P3

Memory Memory Memory

Logical shared memory

Shared Memory Model

SHMEM, DSM Distributed Memory Model

MPI (Message Passing Interface)

Partitioned Global Address Space (PGAS)

Global Arrays, UPC, Chapel, X10, CAF, …

Programming  models  provide  abstract  machine  models  Models  can  be  mapped  on  different  types  of  systems  

e.g.  Distributed  Shared  Memory  (DSM),  MPI  within  a  node,  etc.  

Page 8: Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/GTC... · 2017-07-18 · • AWP-ODC (Courtesy: Yifeng Cui, SDSC) •

8

GTC ’15

•  High  Performance  open-­‐source  MPI  Library  for  InfiniBand,  10Gig/iWARP  and  RDMA  over  Converged  Enhanced  Ethernet  (RoCE)  - MVAPICH  (MPI-­‐1)  ,MVAPICH2  (MPI-­‐2.2  and  MPI-­‐3.0),  Available  since  2002  - MVAPICH2-­‐X  (MPI  +  PGAS),  Available  since  2012  - Support  for  NVIDIA  GPUs,  Available  since  2011  - Used  by  more  than    2,350  organizaIons    (HPC  Centers,  Industry  and  UniversiIes)  in  75  countries  - More  than  241,000  downloads  from  OSU  site  directly  - Empowering  many  TOP500  clusters  (Nov  ‘14  ranking)  

•   7th  ranked  519,640-­‐core  cluster  (Stampede)  at    TACC  • 11th  ranked  160,768-­‐core  cluster  (Pleiades)  at  NASA  • 15th    ranked  76,032-­‐core  cluster  (Tsubame  2.5)  at  Tokyo  InsItute  of  Technology  and  many  others  

- Available  with  soqware  stacks  of  many  IB,  HSE  and  server  vendors  including  Linux  Distros  (RedHat  and  SuSE)  

- hCp://mvapich.cse.ohio-­‐state.edu  

MVAPICH2  Soqware  

Page 9: Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/GTC... · 2017-07-18 · • AWP-ODC (Courtesy: Yifeng Cui, SDSC) •

9

GTC ’15

•  CommunicaIon  on  InfiniBand  Clusters  with  GPUs  •  MVAPICH2-­‐GPU:  CUDA-­‐Aware  MPI    •  MVAPICH2-­‐GPU  with  GPUDirect-­‐RDMA  (MVAPICH2-­‐GDR)    

•       Two-­‐sided  CommunicaIon    •       One-­‐sided  CommunicaIon  •       MPI  Datatype  Processing  •       OpImizaIons  (mulI-­‐GPU,  device  selecIon,  topology  detecIon)  

•  MPI  and  OpenACC  •  Conclusions    

Outline  

Page 10: Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/GTC... · 2017-07-18 · • AWP-ODC (Courtesy: Yifeng Cui, SDSC) •

10

GTC ’15 OpImizing  MPI  Data  Movement  on  GPU  Clusters  

CPU   CPU  QPI  

GPU  

PCIe  

GPU  

GPU  

CPU  

GPU  

IB  

Node  0   Node  1   1.  Intra-­‐GPU  2.  Intra-­‐Socket  GPU-­‐GPU  3.  Inter-­‐Socket  GPU-­‐GPU  4.  Inter-­‐Node  GPU-­‐GPU  5.  Intra-­‐Socket  GPU-­‐Host  

 7.  Inter-­‐Node  GPU-­‐Host  6.  Inter-­‐Socket  GPU-­‐Host  

Memory  buffers  

8.  Inter-­‐Node  GPU-­‐GPU  with  IB  adapter    on  remote  socket  

 

and  more  .  .  .  

•   For  each  path  different  schemes:  Shared_mem,  IPC,  GPUDirect  RDMA,  pipeline  …  •   CriIcal  for  runImes  to  opImize  data  movement  while  hiding  the  complexity  

•   Connected  as  PCIe  devices  –  Flexibility  but  Complexity  

Page 11: Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/GTC... · 2017-07-18 · • AWP-ODC (Courtesy: Yifeng Cui, SDSC) •

11

GTC ’15

• Many  systems  today  want  to  use  systems  that  have  both  GPUs  and  high-­‐speed  networks  such  as  InfiniBand  

•  Problem:  Lack  of  a  common  memory  registraIon  mechanism  - Each  device  has  to  pin  the  host  memory  it  will  use  - Many  operaIng  systems  do  not  allow  mulIple      devices  to  register  the  same  memory  pages  

•  Previous  soluIon:  - Use  different  buffers  for  each  device        and  copy  data  

InfiniBand  +  GPU  Systems  (Past)  

Page 12: Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/GTC... · 2017-07-18 · • AWP-ODC (Courtesy: Yifeng Cui, SDSC) •

12

GTC ’15 Sample  Code  -­‐  Without  MPI  integraIon  

Naïve implementation with standard MPI and CUDA

High Productivity and Poor Performance

PCIe

GPU

CPU

NIC

Switch

At Sender: cudaMemcpy(sbuf, sdev, . . .); MPI_Send(sbuf, size, . . .); At Receiver: MPI_Recv(rbuf, size, . . .); cudaMemcpy(rdev, rbuf, . . .);

Page 13: Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/GTC... · 2017-07-18 · • AWP-ODC (Courtesy: Yifeng Cui, SDSC) •

13

GTC ’15 Sample  Code  –  User  OpImized  Code  

PCIe

GPU

CPU

NIC

Switch

At Sender: for (j = 0; j < pipeline_len; j++) cudaMemcpyAsync(sbuf + j * blk, sdev + j *

blksz,. . .); for (j = 0; j < pipeline_len; j++) { while (result != cudaSucess) { result = cudaStreamQuery(…); if(j > 0) MPI_Test(…); } MPI_Isend(sbuf + j * block_sz, blksz . . .); } MPI_Waitall();

Pipelining at user level with non-blocking MPI and CUDA interfaces Code at Sender side (and repeated at Receiver side)

User-level copying may not match with internal MPI design

High Performance and Poor Productivity

Page 14: Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/GTC... · 2017-07-18 · • AWP-ODC (Courtesy: Yifeng Cui, SDSC) •

14

GTC ’15

•  Support  GPU  to  GPU  communicaIon  through  standard  MPI  interfaces  - e.g.  enable  MPI_Send,  MPI_Recv  from/to  GPU  memory  

•  Provide  high  performance  without  exposing  low  level  details  to  the  programmer  - Pipelined  data  transfer  which  automaIcally  provides  opImizaIons  inside  MPI  library  without  user  tuning  

•  A  new  Design  incorporated  in  MVAPICH2  to  support  this  funcIonality  

Can  this  be  done  within  MPI  Library?  

Page 15: Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/GTC... · 2017-07-18 · • AWP-ODC (Courtesy: Yifeng Cui, SDSC) •

15

GTC ’15

•  CommunicaIon  on  InfiniBand  Clusters  with  GPUs  •  MVAPICH2-­‐GPU:  CUDA-­‐Aware  MPI    •  MVAPICH2-­‐GPU  with  GPUDirect-­‐RDMA  (MVAPICH2-­‐GDR)    

•       Two-­‐sided  CommunicaIon    •       One-­‐sided  CommunicaIon  •       MPI  Datatype  Processing  •       OpImizaIons  (mulI-­‐GPU,  device  selecIon,  topology  detecIon)  MPI  and  OpenACC  

•  Conclusions    

Outline  

Page 16: Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/GTC... · 2017-07-18 · • AWP-ODC (Courtesy: Yifeng Cui, SDSC) •

16

GTC ’15

•  CollaboraIon  between  Mellanox  and  NVIDIA  to  converge  on  one  memory  registraIon  technique  

•  Both  devices  register  a  common  host  buffer  •  GPU  copies  data  to  this  buffer,  and  the  network            adapter  can  directly  read  from  this            buffer  (or  vice-­‐versa)      

•  Note  that  GPU-­‐Direct  does  not  allow          you  to  bypass  host  memory  

GPU-­‐Direct  

Page 17: Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/GTC... · 2017-07-18 · • AWP-ODC (Courtesy: Yifeng Cui, SDSC) •

17

GTC ’15 Sample  Code  –  MVAPICH2-­‐GPU  (CUDA-­‐Aware  MPI)  

•  MVAPICH2-GPU: standard MPI interfaces used •  Takes advantage of Unified Virtual Addressing (>= CUDA 4.0) •  Overlaps data movement from GPU •  with RDMA transfers

•  High Performance and High Productivity

OSU-GTC 2015

inside MVAPICH2

At Sender: At Receiver: MPI_Recv(r_device, size, …);

MPI_Send(s_device, size, …);

Page 18: Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/GTC... · 2017-07-18 · • AWP-ODC (Courtesy: Yifeng Cui, SDSC) •

18

GTC ’15

•  CommunicaIon  on  InfiniBand  Clusters  with  GPUs  •  MVAPICH2-­‐GPU:  CUDA-­‐Aware  MPI  •  MVAPICH2-­‐GPU  with  GPUDirect-­‐RDMA  (MVAPICH2-­‐GDR)  

•       Two-­‐sided  CommunicaIon    •       One-­‐sided  CommunicaIon  •       MPI  Datatype  Processing  •       OpImizaIons  (mulI-­‐GPU,  device  selecIon,  topology  detecIon)  

•  MPI  and  OpenACC  •  Conclusions    

Outline  

Page 19: Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/GTC... · 2017-07-18 · • AWP-ODC (Courtesy: Yifeng Cui, SDSC) •

19

GTC ’15 GPU-­‐Direct  RDMA  (GDR)  with  CUDA  

•  Network adapter can directly read/write data from/to GPU device memory

•  Avoids copies through the host •  Fastest possible communication

between GPU and IB HCA •  Allows for better asynchronous

communication •  OFED with GDR support is under

development by Mellanox and NVIDIA

InfiniBand

GPU

GPU Memor

y

CPU

Chip set

System Memory

Page 20: Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/GTC... · 2017-07-18 · • AWP-ODC (Courtesy: Yifeng Cui, SDSC) •

20

GTC ’15 GPU-­‐Direct  RDMA  (GDR)  with  CUDA  

IB Adapter

System Memory

GPU Memory

GPU

CPU

Chipset

P2P write: 5.2 GB/s P2P read: < 1.0 GB/s

SNB E5-2670 P2P write: 6.4 GB/s P2P read: 3.5 GB/s

IVB E5-2680V2

•  MVAPICH2  design  extended  to  use  GPUDirect  RDMA  •  Hybrid  design  using  GPU-­‐Direct  RDMA  

-   GPUDirect  RDMA  and  Host-­‐based  pipelining  

-   Alleviates  P2P  bandwidth  boClenecks  on  SandyBridge  and  IvyBridge  -   Support  for  communicaIon  using  mulI-­‐rail  

-   Support  for  Mellanox  Connect-­‐IB  and  ConnectX  VPI  adapters  

-   Support  for  RoCE  with  Mellanox  ConnectX  VPI  adapters  

Page 21: Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/GTC... · 2017-07-18 · • AWP-ODC (Courtesy: Yifeng Cui, SDSC) •

21

GTC ’15

•  MVAPICH2-­‐2.1a  with  GDR  support  can  be  downloaded  from  hCps://mvapich.cse.ohio-­‐state.edu/download/#mv2gdr/  

•  System  soqware  requirements  - Mellanox  OFED  2.1  or  later  

- NVIDIA  Driver  331.20  or  later  - NVIDIA  CUDA  Toolkit  6.0  or  later                            -  Plugin  for  GPUDirect  RDMA  

-  hCp://www.mellanox.com/page/products_dyn?product_family=116  -  Strongly  Recommended  :  use  the  new  GDRCOPY  module  from  NVIDIA    

•   hCps://github.com/NVIDIA/gdrcopy  

•  Has  opImized  designs  for  point-­‐to-­‐point  communicaIon  using  GDR  •  Contact  MVAPICH  help  list  with  any  quesIons  related  to  the  package    

- mvapich-­‐[email protected]­‐state.edu  

Using  MVAPICH2-­‐GPUDirect  Version  

Page 22: Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/GTC... · 2017-07-18 · • AWP-ODC (Courtesy: Yifeng Cui, SDSC) •

22

GTC ’15

•  CommunicaIon  on  InfiniBand  Clusters  with  GPUs  •  MVAPICH2-­‐GPU:  CUDA-­‐Aware  MPI  •  MVAPICH2-­‐GPU  with  GPUDirect-­‐RDMA  (MVAPICH2-­‐GDR)  

•       Two-­‐sided  CommunicaIon    •       One-­‐sided  CommunicaIon  •       MPI  Datatype  Processing  •       More  OpImizaIons    

•  MPI  and  OpenACC  •  Conclusions  

Outline  

Page 23: Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/GTC... · 2017-07-18 · • AWP-ODC (Courtesy: Yifeng Cui, SDSC) •

23

GTC ’15

•  Current  MPI  design  using  GPUDirect  RDMA  uses  Rendezvous  protocol  •  Has  higher  latency  for  small  messages  

•  Can  eager  protocol  be  supported  to  improve  performance  for  small  messages?  

•  Two  schemes  proposed  and  used  •  Loopback  (using  network  adapter  to  copy  data)  •  Fastcopy/GDRCOPY    (using  CPU  to  copy  data)  

Enhanced  MPI  Design  with  GPUDirect  RDMA  

Sender Receiver

Rendezvous Protocol

fin

rndz_start

rndz_reply

data

Sender Receiver

Eager Protocol

send

R.  Shi,  S.  Potluri,  K.  Hamidouche  M.  Li,  J.  Perkins  D.  RosseB  and  D.  K.  Panda,  Designing  Efficient  Small  Message  Transfer  Mechanism  for  Inter-­‐node  MPI  CommunicaJon  on  InfiniBand  GPU  Clusters  IEEE  InternaJonal  Conference  on  High  Performance  CompuJng  (HiPC'2014).  

Page 24: Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/GTC... · 2017-07-18 · • AWP-ODC (Courtesy: Yifeng Cui, SDSC) •

24

GTC ’15 Performance  of  MVAPICH2  with  GPU-­‐Direct-­‐RDMA:  Latency  

GPU-GPU Internode MPI Latency

0

5

10

15

20

25

30

0 2 8 32 128 512 2K

MV2-GDR2.1a MV2-GDR2.0b MV2 w/o GDR

Small Message Latency

Message Size (bytes)

Late

ncy

(us)

2.5X

8.6X

2.37 usec

Intel Ivy Bridge (E5-2680 v2) node with 20 cores

NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-

FDR HCA CUDA 6.5, Mellanox OFED 2.1 with GPU-Direct-RDMA

Page 25: Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/GTC... · 2017-07-18 · • AWP-ODC (Courtesy: Yifeng Cui, SDSC) •

25

GTC ’15 Performance  of  MVAPICH2  with  GPU-­‐Direct-­‐RDMA:  Bandwidth  

GPU-GPU Internode MPI Uni-Directional Bandwidth

OSU-GTC 2015

0

500

1000

1500

2000

2500

3000

1 4 16 64 256 1K 4K

MV2-GDR2.1a MV2-GDR2.0b MV2 w/o GDR

Small Message Bandwidth

Message Size (bytes)

Ban

dwid

th (M

B/

s)

10x

2.2X Intel Ivy Bridge (E5-2680 v2) node with 20 cores

NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-

FDR HCA CUDA 6.5, Mellanox OFED 2.1 with GPU-Direct-RDMA

Page 26: Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/GTC... · 2017-07-18 · • AWP-ODC (Courtesy: Yifeng Cui, SDSC) •

26

GTC ’15 Performance  of  MVAPICH2  with  GPU-­‐Direct-­‐RDMA:  Bi-­‐Bandwidth  

0

500

1000

1500

2000

2500

3000

3500

4000

1 4 16 64 256 1K 4K

MV2-GDR2.1a MV2-GDR2.0b MV2 w/o GDR

Small Message Bi-Bandwidth

Message Size (bytes)

Bi-B

andw

idth

(MB

/s)

10x

2x

Intel Ivy Bridge (E5-2680 v2) node with 20 cores

NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-

FDR HCA CUDA 6.5, Mellanox OFED 2.1 with GPU-Direct-RDMA

GPU-GPU Internode MPI Bi-directional Bandwidth

Page 27: Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/GTC... · 2017-07-18 · • AWP-ODC (Courtesy: Yifeng Cui, SDSC) •

27

GTC ’15 ApplicaIon-­‐Level  EvaluaIon  (HOOMD-­‐blue)  

•  Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB)

•  GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768 MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384

0

1000

2000

3000

4000

4 8 16 32 64

Aver

age

TPS

Number of GPU Nodes

Weak Scalability with 2K Particles/GPU

0

500

1000

1500

2000

4 8 16 32 64

Aver

age

TPS

Number of GPU Nodes

Strong Scalability with 256K Particles

Page 28: Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/GTC... · 2017-07-18 · • AWP-ODC (Courtesy: Yifeng Cui, SDSC) •

28

GTC ’15 ApplicaIon  Level  EvaluaIon  

•  LBM-CUDA (Courtesy: Carlos Rosale, TACC) •  Lattice Boltzmann Method for multiphase flows with large density ratios •  One process/GPU per node, 16 nodes

•  AWP-ODC (Courtesy: Yifeng Cui, SDSC) •  Seismic modeling code, Gordon Bell Prize finalist at SC 2010 •  128x256x512 data grid per process, 8 nodes

•  Oakley cluster at OSC: two hex-core Intel Westmere processors, two NVIDIA Tesla M2070, one Mellanox IB QDR

0

50

100

150

256*256*256 512*256*256 512*512*256 512*512*512

Step

Tim

e (S

)

Domain Size X*Y*Z

MPI MPI-GPU

13.7% 12.0%

11.8%

9.4%

0 10 20 30 40 50 60 70 80 90

1 GPU/Proc per Node 2 GPUs/Procs per Node

Tota

l Ex

ecut

ion

Tim

e (s

)

Configuration

AWP-ODC

MPI MPI-GPU

11.1% 7.9%

LBM-CUDA

Page 29: Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/GTC... · 2017-07-18 · • AWP-ODC (Courtesy: Yifeng Cui, SDSC) •

29

GTC ’15

•  CommunicaIon  on  InfiniBand  Clusters  with  GPUs  •  MVAPICH2-­‐GPU:  CUDA-­‐Aware  MPI  •  MVAPICH2-­‐GPU  with  GPUDirect-­‐RDMA  (MVAPICH2-­‐GDR)  

•       Two-­‐sided  CommunicaIon    •       One-­‐sided  CommunicaIon  •       MPI  Datatype  Processing  •       OpImizaIons  (mulI-­‐GPU,  device  selecIon,  topology  detecIon)    

•  MPI  and  OpenACC  •  Conclusions    

Outline  

Page 30: Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/GTC... · 2017-07-18 · • AWP-ODC (Courtesy: Yifeng Cui, SDSC) •

30

GTC ’15 Performance  of  MVAPICH2  with  GPU-­‐Direct-­‐RDMA:  MPI-­‐3  RMA  

GPU-GPU Internode MPI Put latency (RMA put operation Device to Device )

OSU-GTC 2015

MPI-3 RMA provides flexible synchronization and completion primitives

2.88 us

7X

0

5

10

15

20

25

30

35

0 2 8 32 128 512 2K 8K

MV2-GDR2.1a

MV2-GDR2.0b

Small Message Latency

Message Size (bytes)

Late

ncy

(us)

Intel Ivy Bridge (E5-2680 v2) node with 20 cores

NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-

FDR HCA CUDA 6.5, Mellanox OFED 2.1 with GPU-Direct-RDMA

Page 31: Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/GTC... · 2017-07-18 · • AWP-ODC (Courtesy: Yifeng Cui, SDSC) •

31

GTC ’15 CommunicaIon  Kernel  EvaluaIon  with  MPI3-­‐RMA:  3Dstencil  

Based on MVAPICH2-2.0b

Intel Sandy Bridge (E5-2670) node with 16

cores NVIDIA Tesla K40c GPU,

Mellanox Connect-IB Dual-FDR HCA

CUDA 5.5, Mellanox OFED 2.1 with

GPUDirect-RDMA Plugin

0 10 20 30 40 50 60 70 80 90

1 4 16 64 256 1K 4K

Lat

ency

(use

c)

Message Size (Bytes)

Send/Recv One-sided

3D Stencil with 16 GPU nodes

42%

Page 32: Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/GTC... · 2017-07-18 · • AWP-ODC (Courtesy: Yifeng Cui, SDSC) •

32

GTC ’15

•  CommunicaIon  on  InfiniBand  Clusters  with  GPUs  •  MVAPICH2-­‐GPU:  CUDA-­‐Aware  MPI  •  MVAPICH2-­‐GPU  with  GPUDirect-­‐RDMA  (MVAPICH2-­‐GDR)  

•       Two-­‐sided  CommunicaIon    •       One-­‐sided  CommunicaIon  •       MPI  Datatype  Processing  •       OpImizaIons  (mulI-­‐GPU,  device  selecIon,  topology  detecIon)  

•  MPI  and  OpenACC  •  Conclusions    

Outline  

Page 33: Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/GTC... · 2017-07-18 · • AWP-ODC (Courtesy: Yifeng Cui, SDSC) •

33

GTC ’15 Non-­‐conIguous  Data  Exchange  

•  Multi-dimensional data •  Row based organization •  Contiguous on one

dimension •  Non-contiguous on other

dimensions

•  Halo data exchange •  Duplicate the boundary •  Exchange the boundary in

each iteration

Halo data exchange

Page 34: Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/GTC... · 2017-07-18 · • AWP-ODC (Courtesy: Yifeng Cui, SDSC) •

34

GTC ’15 MPI  Datatype  support  in  MVAPICH2  

At  Sender:      MPI_Type_vector  (n_blocks,  n_elements,  stride,  old_type,  &new_type);    MPI_Type_commit(&new_type);  

             …                MPI_Send(s_buf,  size,  new_type,  dest,  tag,  MPI_COMM_WORLD);    

 

•  Datatypes  support  in  MPI  - Operate  on  customized  datatypes  to  improve  producIvity  - Enable  MPI  library  to  opImize  non-­‐conIguous  data  

•  Inside MVAPICH2 -  Use datatype specific CUDA Kernels to pack data in chunks -  Efficiently move data between nodes using RDMA -  Optimizes datatypes from GPU memories -  Transparent to the user

H. Wang, S. Potluri, D. Bureddy, C. Rosales and D. K. Panda, GPU-aware MPI on RDMA-Enabled Clusters: Design, Implementation and Evaluation, IEEE Transactions on Parallel and Distributed Systems.

Page 35: Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/GTC... · 2017-07-18 · • AWP-ODC (Courtesy: Yifeng Cui, SDSC) •

35

GTC ’15

•  Comprehensive  support    •  Targeted  kernels    for  regular  datatypes    -­‐  vector,  subarray,  indexed_block  

•  Generic  kernels  for  all  other  irregular  datatypes  

•  Separate  non-­‐blocking  stream  for  kernels  launched  by  MPI  library    •  Avoids  stream  conflicts  with  applicaIon  kernels      

•  Flexible  set  of  parameters  for  users  to  tune  kernels  •  Vector    

•  MV2_CUDA_KERNEL_VECTOR_TIDBLK_SIZE  

•  MV2_CUDA_KERNEL_VECTOR_YSIZE      

•  Subarray    •  MV2_CUDA_KERNEL_SUBARR_TIDBLK_SIZE    •  MV2_CUDA_KERNEL_SUBARR_XDIM  •  MV2_CUDA_KERNEL_SUBARR_YDIM    •  MV2_CUDA_KERNEL_SUBARR_ZDIM    

•  Indexed_block    •  MV2_CUDA_KERNEL_IDXBLK_XDIM  

MPI  Datatype  Processing  

R.  Shi,  X.  Lu,  S.  Potluri,  K.  Hamidouche,  J.  Zhang  and  D.  K.  Panda,    HAND:  A  Hybrid  Approach  to  Accelerate  Non-­‐conJguous  Data    Movement  using  MPI  Datatypes  on  GPU  Clusters  InternaJonal  Conference  on  Parallel  Processing  (ICPP'14).  

Page 36: Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/GTC... · 2017-07-18 · • AWP-ODC (Courtesy: Yifeng Cui, SDSC) •

36

GTC ’15

Stencil3D  communicaIon  kernel  on  2  GPUs  with  various  X,  Y,  Z  dimensions  using  MPI_Isend/Irecv  •  DT:  Direct  Transfer,  TR:  Targeted  Kernel    •  OpImized  design    gains  up  to  15%,  15%  and  

22%  compared  to  TR,  and  more  than    86%  compared  to  DT  on  X,  Y  and  Z  respecIvely    

0  

0.5  

1  

1.5  

2  

2.5  

1   2   4   8   16   32   64   128   256  

latency  (m

s)

size  of  DimZ,  [16,16,z]  

Performance  of  Stencil3D  (3D  subarray)  

0  

0.5  

1  

1.5  

2  

2.5  

1   2   4   8   16   32   64   128   256  

latency  (m

s)

size  of  DimY,  [16,y,16]  

0  0.5  1  

1.5  2  

2.5  

1   2   4   8   16   32   64   128   256  

latency  (m

s)

size  of  DimX,  [x,16,16]  

DT   TR   Enhanced  

86%

Page 37: Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/GTC... · 2017-07-18 · • AWP-ODC (Courtesy: Yifeng Cui, SDSC) •

37

GTC ’15

•  CommunicaIon  on  InfiniBand  Clusters  with  GPUs  •  MVAPICH2-­‐GPU:  CUDA-­‐Aware  MPI  •  MVAPICH2-­‐GPU  with  GPUDirect-­‐RDMA  (MVAPICH2-­‐GDR)  

•       Two-­‐sided  CommunicaIon    •       One-­‐sided  CommunicaIon  •       MPI  Datatype  Processing  •       OpImizaIons  (mulI-­‐GPU,  device  selecIon,  topology  detecIon)  

•  MPI  and  OpenACC  •  Conclusions    

Outline  

Page 38: Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/GTC... · 2017-07-18 · • AWP-ODC (Courtesy: Yifeng Cui, SDSC) •

38

GTC ’15

•  MulI-­‐GPU  node  architectures  are  becoming  common  •  UnIl  CUDA  3.2  

- CommunicaIon  between  processes    staged  through  the  host  - Shared  Memory  (pipelined)  - Network  Loopback  [asynchronous)  

•  CUDA  4.0  and  later  -  Inter-­‐Process  CommunicaIon  (IPC)  - Host  bypass  - Handled  by  a  DMA  Engine  - Low  latency  and  Asynchronous  - Requires  creaIon,  exchange  and  mapping  of  memory  handles  

- Overhead  

MulI-­‐GPU  ConfiguraIons  (K80  case  study)  

CPU  

GPU  1  

GPU  0  

 

Memory  

 

I/O  Hub  

Process  0   Process  1  

HCA  

Page 39: Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/GTC... · 2017-07-18 · • AWP-ODC (Courtesy: Yifeng Cui, SDSC) •

39

GTC ’15 Intra-­‐Node    CommunicaIon  Enhancement  

0 100 200 300 400 500 600 700 800

0 8 128 2K 32K 512K

MV2-2.1a Enhanced

Message Size (bytes)

Late

ncy

(us)

0 10000 20000 30000 40000 50000 60000 70000

1 16 256 4K 64K 1M

MV2-2.1a Enhanced

Message Size (bytes)

Ban

dwid

th(M

B/s

)

4.7X 7.8X

•  Two Processes sharing the same K80 GPUs •  Proposed designs achieve 4.7X improvement in latency •  7.8X improvement is delivered for Bandwidth •  Will be available with next releases of MVAPICH2-GDR

Page 40: Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/GTC... · 2017-07-18 · • AWP-ODC (Courtesy: Yifeng Cui, SDSC) •

40

GTC ’15

•  MVAPICH2  1.8,  1.9  and  other  MPI  libraries  require  device  selecIon  before  MPI_Init  

•  This  restricIon  has  been  removed  with  MVAPICH2  2.0  and  later    •  OSU  micro-­‐benchmarks  sIll  select  device  before  MPI_Init  for  backward  compaIbility  

•  Uses  node-­‐level  rank  informaIon  exposed  by  launchers  •  We  provide  a  script  which  exports  this  node  rank  informaIon  to  be  used  by  the  benchmark  - Can  be  modified  to  work  with  different  MPI  libraries  without  modifying    the  benchmarks  themselves  

- Sample  script  provided  with  OMB  :  “get_local_rank”                export  LOCAL_RANK=$MV2_COMM_WORLD_LOCAL_RANK  

Device  SelecIon  

Page 41: Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/GTC... · 2017-07-18 · • AWP-ODC (Courtesy: Yifeng Cui, SDSC) •

41

GTC ’15

•  AutomaIc  detecIon  of  GPUs  and  IB  placement:    

•  Avoid  the  inter-­‐sockets  QPI  bo]lenecks    •  SelecIon  of  the  nearest  HCA  and  binding  the  process  to  the  

associate  CPU  set      

•  Dynamic  threshold  selec_on  between  GDR  and  host-­‐based  transfers    

Topology  DetecIon  

Page 42: Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/GTC... · 2017-07-18 · • AWP-ODC (Courtesy: Yifeng Cui, SDSC) •

42

GTC ’15

•  CommunicaIon  on  InfiniBand  Clusters  with  GPUs  •  MVAPICH2-­‐GPU:  CUDA-­‐Aware  MPI  •  MVAPICH2-­‐GPU  with  GPUDirect-­‐RDMA  (MVAPICH2-­‐GDR)  

•       Two-­‐sided  CommunicaIon    •       One-­‐sided  CommunicaIon  •       MPI  Datatype  Processing  •       OpImizaIons  (mulI-­‐GPU,  device  selecIon,  topology  detecIon)  

•  MPI  and  OpenACC  •  Conclusions    

Outline  

Page 43: Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/GTC... · 2017-07-18 · • AWP-ODC (Courtesy: Yifeng Cui, SDSC) •

43

GTC ’15

• OpenACC  is  gaining  popularity    • Several  sessions  during  GTC    • A  set  of  compiler  direcIves  (#pragma)  • Offload  specific  loops  or  parallelizable  secIons  in  code  onto  accelerators  #pragma  acc  region  {  

                       for(i  =  0;  i  <  size;  i++)  {                                  A[i]  =  B[i]  +  C[i];                          }                }  • RouInes  to  allocate/free  memory  on  accelerators  buffer  =  acc_malloc(MYBUFSIZE);  acc_free(buffer);  

• Supported  for  C,  C++  and  Fortran  • Huge  list  of  modifiers  –  copy,  copyout,  private,  independent,  etc..  

OpenACC  

Page 44: Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/GTC... · 2017-07-18 · • AWP-ODC (Courtesy: Yifeng Cui, SDSC) •

44

GTC ’15

•  acc_malloc  to  allocate  device  memory  - No  changes  to  MPI  calls  - MVAPICH2  detects  the  device  pointer  and  opImizes  data  movement  - Delivers  the  same  performance  as  with  CUDA  

Using  MVAPICH2  with  OpenACC  1.0  

A  =  acc_malloc(sizeof(int)  *  N);    …...    #pragma  acc  parallel  loop  deviceptr(A)  .  .  .  //compute  for  loop    MPI_Send  (A,  N,  MPI_INT,  0,  1,  MPI_COMM_WORLD);      ……  acc_free(A);      

Page 45: Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/GTC... · 2017-07-18 · • AWP-ODC (Courtesy: Yifeng Cui, SDSC) •

45

GTC ’15

•  acc_deviceptr  to  get  device  pointer  (in  OpenACC  2.0)  - Enables  MPI  communicaIon  from  memory  allocated  by  compiler  when  it  is  available  in  OpenACC  2.0  implementaIons  

- MVAPICH2  will  detect  the  device  pointer  and  opImize  communicaIon  - Delivers  the  same  performance  as  with  CUDA      

Using  MVAPICH2  with  OpenACC  2.0  

A  =  malloc(sizeof(int)  *  N);    …...    

#pragma  acc  data  copyin(A)  .  .  .    {    #pragma  acc  parallel  loop  .  .  .  //compute  for  loop    

MPI_Send(acc_deviceptr(A),  N,  MPI_INT,  0,  1,  MPI_COMM_WORLD);    }    

……  free(A);  

Page 46: Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/GTC... · 2017-07-18 · • AWP-ODC (Courtesy: Yifeng Cui, SDSC) •

46

GTC ’15

•  OSU  Micro-­‐benchmarks  (OMB)  are  widely  used  to  compare  performance  of  different  MPI  stacks  and  networks  

•  Enhancements  to  measure  performance  of  MPI  communicaIon  from  GPU  memory  - Point-­‐to-­‐point:  Latency,  Bandwidth  and  Bi-­‐direcIonal  Bandwidth  - CollecIves:  Latency  test  for  all  collecIves  

•  Support  for  CUDA  and  OpenACC  •  Flexible  selecIon  of  data  movement  between  CPU(H)  and  GPU(D):            D-­‐>D,  D-­‐>H  and  H-­‐>D  •  Available  from  hCp://mvapich.cse.ohio-­‐state.edu/benchmarks  

•  Available  in  an  integrated  manner  with  MVAPICH2  stack    

CUDA  and  OpenACC  Extensions  in  OMB  

Page 47: Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/GTC... · 2017-07-18 · • AWP-ODC (Courtesy: Yifeng Cui, SDSC) •

47

GTC ’15

•  CommunicaIon  on  InfiniBand  Clusters  with  GPUs  •  MVAPICH2-­‐GPU:  CUDA-­‐Aware  MPI  •  MVAPICH2-­‐GPU  with  GPUDirect-­‐RDMA  (MVAPICH2-­‐GDR)  

•       Two-­‐sided  CommunicaIon    •       One-­‐sided  CommunicaIon  •       MPI  Datatype  Processing  •       OpImizaIons  (mulI-­‐GPU,  device  selecIon,  topology  detecIon)  

•  MPI  and  OpenACC  •  Conclusions    

Outline  

Page 48: Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/GTC... · 2017-07-18 · • AWP-ODC (Courtesy: Yifeng Cui, SDSC) •

48

GTC ’15

• MVAPICH2  opImizes  MPI  communicaIon  on  InfiniBand  clusters  with  GPUs    

•  Close  partnership  with  NVIDIA  during  the  last  four  years  •  Provides  opImized  designs  for  point-­‐to-­‐point  two-­‐sided,  one-­‐sided  communicaIons,  and  datatype  processing  

•  Takes  advantage  of  CUDA  features  like  IPC,  GPUDirect  RDMA  and  GDRCOPY  

•  Delivers    -     High  performance  -     High  producIvity  With  support  for  latest  NVIDIA  GPUs  and  InfiniBand  Adapters  

Conclusions  

Page 49: Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/GTC... · 2017-07-18 · • AWP-ODC (Courtesy: Yifeng Cui, SDSC) •

49

GTC ’15

 Dr.  Davide  Rosse�                  Filippo  Spiga  and  Stuart  Rankin,    HPCS,  University  of  Cambridge    (Wilkes  Cluster)    

Acknowledgments  

Page 50: Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/GTC... · 2017-07-18 · • AWP-ODC (Courtesy: Yifeng Cui, SDSC) •

50

GTC ’15 Funding  Acknowledgments    

Funding Support by

Equipment Support by

Page 51: Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/GTC... · 2017-07-18 · • AWP-ODC (Courtesy: Yifeng Cui, SDSC) •

51

GTC ’15 Personnel  Acknowledgments  Current Students

–  A. Awan (Ph.D.)

–  A. Bhat (M.S.) –  S. Chakraborthy (Ph.D.)

–  C.-H. Chu (Ph.D.) –  N. Islam (Ph.D.)

Past Students –  P. Balaji (Ph.D.)

–  D. Buntinas (Ph.D.)

–  S. Bhagvat (M.S.)

–  L. Chai (Ph.D.)

–  B. Chandrasekharan (M.S.) –  N.  Dandapanthula  (M.S.)  

–  V. Dhanraj (M.S.) –  T. Gangadharappa (M.S.)

–  K. Gopalakrishnan (M.S.)

–  G. Santhanaraman (Ph.D.) –  A. Singh (Ph.D.)

–  J. Sridhar (M.S.) –  S. Sur (Ph.D.)

–  H. Subramoni (Ph.D.)

–  K. Vaidyanathan (Ph.D.)

–  A. Vishnu (Ph.D.)

–  J. Wu (Ph.D.)

–  W. Yu (Ph.D.)

Past Research Scientist –  S. Sur

Current Post-Doc –  J. Lin

Current Programmer –  J. Perkins

Past Post-Docs –  H.  Wang  

–  X.  Besseron  –  H.-­‐W.  Jin  

–  M.  Luo  

–  W. Huang (Ph.D.) –  W. Jiang (M.S.)

–  J. Jose (Ph.D.)

–  S. Kini (M.S.)

–  M. Koop (Ph.D.)

–  R. Kumar (M.S.)

–  S. Krishnamoorthy (M.S.)

–  K. Kandalla (Ph.D.)

–  P. Lai (M.S.)

–  J. Liu (Ph.D.)

–  M. Luo (Ph.D.) –  A. Mamidala (Ph.D.)

–  G. Marsh (M.S.)

–  V. Meshram (M.S.)

–  S. Naravula (Ph.D.)

–  R. Noronha (Ph.D.)

–  X. Ouyang (Ph.D.)

–  S. Pai (M.S.)

–  S. Potluri (Ph.D.)

–  R.  Rajachandrasekar  (Ph.D.)  

–  M. Li (Ph.D.) –  M. Rahman (Ph.D.)

–  D. Shankar (Ph.D.) –  A. Venkatesh (Ph.D.)

–  J. Zhang (Ph.D.)

–  E. Mancini –  S. Marcarelli

–  J. Vienne

Current Senior Research Associates –  K. Hamidouche

–  X.  Lu  

Past Programmers –  D. Bureddy

–  H. Subramoni

Current Research Specialist –  M. Arnold

Page 52: Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU …mvapich.cse.ohio-state.edu/static/media/talks/slide/GTC... · 2017-07-18 · • AWP-ODC (Courtesy: Yifeng Cui, SDSC) •

52

GTC ’15

Contact [email protected]

[email protected]

Thanks!  QuesIons?  

http://mvapich.cse.ohio-state.edu

http://nowlab.cse.ohio-state.edu