Support Hybrid MPI+PGAS (UPC/OpenSHMEM/CAF) …mvapich.cse.ohio-state.edu/static/media/talks/slide/osc_theater-PGAS.pdf · Hybrid-ER Hybrid MPI+OpenSHMEM Sort Application Execution

Support Hybrid MPI+PGAS (UPC/OpenSHMEM/CAF) Programming Models through a Unified Runtime:

An MVAPICH2-X Approach

Dhabaleswar K. (DK) Panda

The Ohio State University

E-mail: [email protected]

http://www.cse.ohio-state.edu/~panda

Talk at OSC theater (SC ’15)

by




Parallel Programming Models Overview

P1 P2 P3

Shared Memory

P1 P2 P3

Memory Memory Memory

P1 P2 P3

Memory Memory Memory

Logical shared memory

Shared Memory Model

DSM

Distributed Memory Model

MPI (Message Passing Interface)

Partitioned Global Address Space (PGAS)

Global Arrays, UPC, Chapel, X10, CAF, …

• Programming models provide abstract machine models

• Models can be mapped on different types of systems

– e.g. Distributed Shared Memory (DSM), MPI within a node, etc.

• Additionally, OpenMP can be used to parallelize computation

within the node

• Each model has strengths and drawbacks - suite different problems or

applications

2 SC15

Partitioned Global Address Space (PGAS) Models

• Key features

- Simple shared memory abstractions

- Light weight one-sided communication

- Easier to express irregular communication

• Different approaches to PGAS

- Languages

• Unified Parallel C (UPC)

• Co-array Fortran (CAF)

• X10

• Chapel

- Libraries

• OpenSHMEM

• Global Arrays

3 SC15

OpenSHMEM: PGAS library

• SHMEM implementations – Cray SHMEM, SGI SHMEM, Quadrics SHMEM, HP

SHMEM, GSHMEM

• Subtle differences in API, across versions – example:

SGI SHMEM Quadrics SHMEM Cray SHMEM

Initialization start_pes(0) shmem_init start_pes

Process ID _my_pe my_pe shmem_my_pe

• Made application codes non-portable

• OpenSHMEM is an effort to address this:

“A new, open specification to consolidate the various extant SHMEM versions

into a widely accepted standard.” – OpenSHMEM Specification v1.0

by University of Houston and Oak Ridge National Lab

SGI SHMEM is the baseline

4 SC15

• UPC: a parallel extension to the C standard

• UPC Specifications and Standards: – Introduction to UPC and Language Specification, 1999

– UPC Language Specifications, v1.0, Feb 2001

– UPC Language Specifications, v1.1.1, Sep 2004

– UPC Language Specifications, v1.2, June 2005

– UPC Language Specifications, v1.3, Nov 2013

• UPC Consortium – Academic Institutions: GWU, MTU, UCB, U. Florida, U. Houston, U. Maryland…

– Government Institutions: ARSC, IDA, LBNL, SNL, US DOE…

– Commercial Institutions: HP, Cray, Intrepid Technology, IBM, …

• Supported by several UPC compilers – Vendor-based commercial UPC compilers: HP UPC, Cray UPC, SGI UPC

– Open-source UPC compilers: Berkeley UPC, GCC UPC, Michigan Tech MuPC

• Aims for: high performance, coding efficiency, irregular applications, …

Unified Parallel C: PGAS language

5 SC15

Co-array Fortran (CAF): Language-level PGAS support in Fortran

• An extension to Fortran to support global shared array in

parallel Fortran applications

• CAF = CAF compiler + CAF runtime (libcaf)

• Coarray syntax and basic synchronization support in Fortran 2008

• Collective communication and atomic operations in upcoming Fortran 2015

E.g.

real :: a(n)[*]

x(:)[q] = x(:) + x(:)[p]

name[i] = name

sync all

...

ATOMIC_ADD

ATOMIC_FETCH_ADD

...

6 SC15

• Hierarchical architectures with multiple address spaces

• (MPI + PGAS) Model

– MPI across address spaces

– PGAS within an address space

• MPI is good at moving data between address spaces

• Within an address space, MPI can interoperate with other shared

memory programming models

• Applications can have kernels with different communication patterns

• Can benefit from different models

• Re-writing complete applications can be a huge effort

• Port critical kernels to the desired model instead

MPI+PGAS for Exascale Architectures and Applications

7 SC15

• High Performance open-source MPI Library for InfiniBand, 10-40Gig/iWARP, and RDMA over

Converged Enhanced Ethernet (RoCE)

– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002

– MVAPICH2-X (MPI + PGAS), Available since 2011

– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014

– Support for Virtualization (MVAPICH2-Virt), Available since 2015

– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015

– Used by more than 2,475 organizations in 76 countries

– More than 308,000 downloads from the OSU site directly

– Empowering many TOP500 clusters (June ‘15 ranking)

• 10th ranked 519,640-core cluster (Stampede) at TACC

• 13th ranked 185,344-core cluster (Pleiades) at NASA

• 25th ranked 76,032-core cluster (Tsubame 2.5) at Tokyo Institute of Technology and many others

– Available with software stacks of many vendors and Linux Distros (RedHat and SuSE)

– http://mvapich.cse.ohio-state.edu

• Empowering Top500 systems for over a decade

– System-X from Virginia Tech (3rd in Nov 2003, 2,200 processors, 12.25 TFlops) ->

– Stampede at TACC (8th in Jun’15, 519,640 cores, 5.168 Plops)

MVAPICH2 Software

8 SC15

http://mvapich.cse.ohio-state.edu/



Hybrid (MPI+PGAS) Programming

• Application sub-kernels can be re-written in MPI/PGAS based

on communication characteristics

• Benefits:

– Best of Distributed Computing Model

– Best of Shared Memory Computing Model

• Exascale Roadmap*:

– “Hybrid Programming is a practical way to

program exascale systems”

* The International Exascale Software Roadmap, Dongarra, J., Beckman, P. et al., Volume 25, Number 1, 2011, International Journal of High Performance Computer Applications, ISSN 1094-3420

Kernel 1 MPI

Kernel 2 MPI

Kernel 3 MPI

Kernel N MPI

HPC Application

Kernel 2 PGAS

Kernel N PGAS

9 SC15

MVAPICH2-X for Hybrid MPI + PGAS Applications

MPI, OpenSHMEM, UPC, CAF or Hybrid (MPI + PGAS)

Applications

Unified MVAPICH2-X Runtime

InfiniBand, RoCE, iWARP

OpenSHMEM Calls MPI Calls UPC Calls

• Unified communication runtime for MPI, UPC, OpenSHMEM, CAF available with

MVAPICH2-X 1.9 (2012) onwards!

– http://mvapich.cse.ohio-state.edu

• Feature Highlights

– Supports MPI(+OpenMP), OpenSHMEM, UPC, CAF, MPI(+OpenMP) + OpenSHMEM,

MPI(+OpenMP) + UPC

– MPI-3 compliant, OpenSHMEM v1.0 standard compliant, UPC v1.2 standard

compliant (with initial support for UPC 1.3), CAF 2008 standard (OpenUH)

– Scalable Inter-node and intra-node communication – point-to-point and collectives

CAF Calls

10 SC15

http://mvapich.cse.ohio-state.edu/overview/mvapich2x




OpenSHMEM Collective Communication: Performance

Reduce (1,024 processes) Broadcast (1,024 processes)

Collect (1,024 processes) Barrier

0

50

100

150

200

250

300

350

400

128 256 512 1024 2048

Tim

e (

us)

No. of Processes

1

10

100

1000

10000

100000

1000000

10000000

Tim

e (

us)

Message Size

180X

1

10

100

1000

10000

4 16 64 256 1K 4K 16K 64K 256K

Tim

e (

us)

Message Size

18X

1

10

100

1000

10000

100000

1 4 16 64 256 1K 4K 16K 64K

Tim

e (

us)

Message Size

MV2X-SHMEM

Scalable-SHMEM

OMPI-SHMEM

12X

3X

11 SC15

Hybrid MPI+UPC NAS-FT

• Modified NAS FT UPC all-to-all pattern using MPI_Alltoall

• Truly hybrid program

• For FT (Class C, 128 processes)

• 34% improvement over UPC (GASNet)

• 30% improvement over UPC (MV2-X)

0

5

10

15

20

25

30

35

B-64 C-64 B-128 C-128

Tim

e (

s)

NAS Problem Size – System Size

34%

J. Jose, M. Luo, S. Sur and D. K. Panda, Unifying UPC and MPI Runtimes: Experience with MVAPICH, Fourth

Conference on Partitioned Global Address Space Programming Model (PGAS '10), October 2010

UPC (GASNet)

UPC (MV2-X)

MPI+UPC (MV2-X)

J. Jose, M. Luo, S. Sur and D. K. Panda, Unifying UPC and MPI Runtimes: Experience with MVAPICH, PGAS 2010

12 SC15

CAF One-sided Communication: Performance

0

1000

2000

3000

4000

5000

6000

Ban

dw

idth

(M

B/s

)

Message Size (byte)

GASNet-IBV

GASNet-MPI

MV2X

0

1000

2000

3000

4000

5000

6000

Ban

dw

idth

(M

B/s

)

Message Size (byte)

GASNet-IBV

GASNet-MPI

MV2X

02468

10

GASNet-IBV GASNet-MPI MV2XLate

ncy (

ms)

02468

10

GASNet-IBV GASNet-MPI MV2XLate

ncy (

ms)

0 100 200 300

bt.D.256

cg.D.256

ep.D.256

ft.D.256

mg.D.256

sp.D.256

GASNet-IBV GASNet-MPI MV2X

Time (sec)

Get NAS-CAF Put

• Micro-benchmark improvement (MV2X vs. GASNet-IBV, UH CAF test-suite)

– Put bandwidth: 3.5X improvement on 4KB; Put latency: reduce 29% on 4B

• Application performance improvement (NAS-CAF one-sided implementation)

– Reduce the execution time by 12% (SP.D.256), 18% (BT.D.256)

3.5X

29%

12%

18%

J. Lin, K. Hamidouche, X. Lu, M. Li and D. K. Panda, High-performance Co-array Fortran support with

MVAPICH2-X: Initial experience and evaluation, HIPS’15

13 SC15

Graph500 - BFS Traversal Time

• Hybrid design performs better than MPI implementations

• 16,384 processes - 1.5X improvement over MPI-CSR

- 13X improvement over MPI-Simple (Same communication characteristics)

• Strong Scaling Graph500 Problem Scale = 29

0

5

10

15

20

25

30

35

4K 8K 16K

Tim

e (

s)

No. of Processes

13X

7.6X

0.00E+00

1.00E+09

2.00E+09

3.00E+09

4.00E+09

5.00E+09

6.00E+09

7.00E+09

8.00E+09

9.00E+09

1024 2048 4096 8192

Trav

ers

ed

Ed

ges

Pe

r Se

con

d

# of Processes

MPI-Simple

MPI-CSC

MPI-CSR

Hybrid

Performance Strong Scaling

14 SC15

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

512-1TB 1024-2TB 2048-4TB 4096-8TB

Sort

Rat

e (

TB/m

in)

No. of Processes - Input Data

MPI

Hybrid-SR

Hybrid-ER

Hybrid MPI+OpenSHMEM Sort Application

Execution Time

J. Jose, S. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM Programming Models,

International Supercomputing Conference (ISC’13), June 2013

Weak Scalability

• Performance of Hybrid (MPI+OpenSHMEM) Sort Application • Execution Time (seconds)

- 1TB Input size at 8,192 cores: MPI – 164, Hybrid-SR (Simple Read) – 92.5, Hybrid-ER (Eager Read) - 90.36

- 45% improvement over MPI-based design

• Weak Scalability (configuration: input size of 1TB per 512 cores)

- At 4,096 cores: MPI – 0.25 TB/min, Hybrid-SR – 0.34 TB/min, Hybrid-SR – 0.38 TB/min

- 38% improvement over MPI based design

38%

0

100

200

300

400

500

600

700

1024-1TB 2048-1TB 4096-1TB 8192-1TB

Tim

e (

sec)

No. of Processes - Input Data

MPI

Hybrid-SR

Hybrid-ER

45%

15 SC15

Accelerating MaTEx k-NN with Hybrid MPI and OpenSHMEM

KDD (2.5GB) on 512 cores

9.0%

KDD-tranc (30MB) on 256 cores

27.6%

• Benchmark: KDD Cup 2010 (8,407,752 records, 2 classes, k=5) • For truncated KDD workload on 256 cores, reduce 27.6% execution time • For full KDD workload on 512 cores, reduce 9.0% execution time

J. Lin, K. Hamidouche, J. Zhang, X. Lu, A. Vishnu, D. Panda. Accelerating k-NN Algorithm with Hybrid MPI and OpenSHMEM,

OpenSHMEM 2015

• MaTEx: MPI-based Machine learning algorithm library • k-NN: a popular supervised algorithm for classification • Hybrid designs:

– Overlapped Data Flow; One-sided Data Transfer; Circular-buffer Structure

16 SC15

• Before CUDA 4: Additional copies

• Low performance and low productivity

• After CUDA 4: Host-based pipeline

• Unified Virtual Address

• Pipeline CUDA copies with IB transfers

• High performance and high productivity

• After CUDA 5.5: GPUDirect-RDMA support

• GPU to GPU direct transfer

• Bypass the host memory

• Hybrid design to avoid PCI bottlenecks

InfiniBand

CUDA-Aware Concept

GPU Memory

GPU

CPU

Chip set

Limitations of OpenSHMEM for GPU Computing

• OpenSHMEM memory model does not support disjoint memory address

spaces - case with GPU clusters

PE 0

Existing OpenSHMEM Model with CUDA

• Copies severely limit the performance

PE 1

GPU-to-GPU Data Movement

PE 0

cudaMemcpy (host_buf, dev_buf, . . . )

shmem_putmem (host_buf, host_buf, size, pe)

shmem_barrier (…)

host_buf = shmalloc (…)

PE 1

shmem_barrier ( . . . )

cudaMemcpy (dev_buf, host_buf, size, . . . )

host_buf = shmalloc (…)

• Synchronization negates the benefits of one-sided communication

• Similar issues with UPC



shmem_barrier (…)

shmem_barrier (…)

18 SC15

Global Address Space with Host and Device Memory

Host Memory

Private

Shared

Host Memory

Device Memory Device Memory

Private

Shared

Private

Shared

Private

Shared

shared space on host memory

shared space on device memory

N N

N N

• Extended APIs:

• heap_on_device/heap_on_host

• a way to indicate location of heap

• host_buf = shmalloc (sizeof(int), 0);

• dev_buf = shmalloc (sizeof(int), 1);

CUDA-Aware OpenSHMEM Same design for UPC PE 0

shmem_putmem (dev_buf, dev_buf, size, pe)

PE 1

dev_buf = shmalloc (size, 1);

dev_buf = shmalloc (size, 1); S. Potluri, D. Bureddy, H. Wang, H. Subramoni and D. K.

Panda, Extending OpenSHMEM for GPU Computing,

IPDPS’13

SC15 19

• After device memory becomes part of the global shared space:

- Accessible through standard UPC/OpenSHMEM communication APIs

- Data movement transparently handled by the runtime

- Preserves one-sided semantics at the application level

• Efficient designs to handle communication

- Inter-node transfers use host-staged transfers with pipelining

- Intra-node transfers use CUDA IPC

- Possibility to take advantage of GPUDirect RDMA (GDR)

• Goal: Enabling High performance one-sided communications

semantics with GPU devices

CUDA-aware OpenSHMEM and UPC runtimes

SC15

20

• GDR for small/medium message sizes

• Host-staging for large message to avoid PCIe

bottlenecks

• Hybrid design brings best of both

• 3.13 us Put latency for 4B (7X improvement )

and 4.7 us latency for 4KB

• 9X improvement for Get latency of 4B

Exploiting GDR: OpenSHMEM: Inter-node Evaluation

0

5

10

15

20

25

1 4 16 64 256 1K 4K

Host-Pipeline

GDR

Small Message shmem_put D-D

Message Size (bytes)

Late

ncy

(u

s)

0100200300400500600700800

8K 32K 128K 512K 2M

Host-Pipeline

GDR

Large Message shmem_put D-D


Late

ncy

(u

s)

0

5

10

15

20

25

30

35

1 4 16 64 256 1K 4K

Host-Pipeline

GDR

Small Message shmem_get D-D


Late

ncy

(u

s)

7X

9X

SC15 21

- Platform: Wilkes (Intel Ivy Bridge + NVIDIA

Tesla K20c + Mellanox Connect-IB)

- New designs achieve 20% and 19%

improvements on 32 and 64 GPU nodes

K. Hamidouche, A. Venkatesh, A. Awan, H. Subramoni, C.

Ching and D. K. Panda, Exploiting GPUDirect RDMA in

Designing High Performance OpenSHMEM for GPU Clusters.

IEEE Cluster 2015

Application Evaluation: GPULBM and 2DStencil

19%

SC15

0

50

100

150

200

8 16 32 64

Evo

luti

on

Tim

e (

mse

c)

Number of GPU Nodes

Weak Scaling

Host-Pipeline Enhanced-GDR

0

0.02

0.04

0.06

0.08

0.1

8 16 32 64

Exe

cuti

on

tim

e (

sec)

Number of GPU Nodes

Host-Pipeline Enhanced-GDR

GPULBM: 64x64x64 2DStencil 2Kx2K

• Redesign the application • CUDA-Aware MPI : Send/Recv=> hybrid

CUDA-Aware MPI+OpenSHMEM • cudaMalloc =>shmalloc(size,1); • MPI_Send/recv => shmem_put + fence • 53% and 45% • Degradation is due to small Input size

45 %

22

• Presented an overview of a Hybrid MPI+PGAS models

• Outlined research challenges in designing efficient runtimes for

these models on clusters with InfiniBand

• Demonstrated the benefits of Hybrid MPI+PGAS models for a set

of applications on Host based systems

• Presented the model extension and runtime support for

Accelerator-Aware hybrid MPI+PGAS model

• Hybrid MPI+PGAS model is an emerging paradigm which can lead

to high-performance and scalable implementation of applications

on exascale computing systems using accelerators

Concluding Remarks

23 SC15

Funding Acknowledgments

Funding Support by

Equipment Support by

24 SC15

Personnel Acknowledgments

Current Students – A. Augustine (M.S.)

– A. Awan (Ph.D.)

– A. Bhat (M.S.)

– S. Chakraborthy (Ph.D.)

– C.-H. Chu (Ph.D.)

– N. Islam (Ph.D.)

Past Students – P. Balaji (Ph.D.)

– D. Buntinas (Ph.D.)

– S. Bhagvat (M.S.)

– L. Chai (Ph.D.)

– B. Chandrasekharan (M.S.)

– N. Dandapanthula (M.S.)

– V. Dhanraj (M.S.)

– T. Gangadharappa (M.S.)

– K. Gopalakrishnan (M.S.)

– G. Santhanaraman (Ph.D.)

– A. Singh (Ph.D.)

– J. Sridhar (M.S.)

– S. Sur (Ph.D.)

– H. Subramoni (Ph.D.)

– K. Vaidyanathan (Ph.D.)

– A. Vishnu (Ph.D.)

– J. Wu (Ph.D.)

– W. Yu (Ph.D.)

Past Research Scientist – S. Sur

Current Post-Docs – J. Lin

– D. Shankar

Current Programmer – J. Perkins

Past Post-Docs – H. Wang

– X. Besseron

– H.-W. Jin

– M. Luo

– W. Huang (Ph.D.)

– W. Jiang (M.S.)

– J. Jose (Ph.D.)

– S. Kini (M.S.)

– M. Koop (Ph.D.)

– R. Kumar (M.S.)

– S. Krishnamoorthy (M.S.)

– K. Kandalla (Ph.D.)

– P. Lai (M.S.)

– J. Liu (Ph.D.)

– M. Luo (Ph.D.)

– A. Mamidala (Ph.D.)

– G. Marsh (M.S.)

– V. Meshram (M.S.)

– A. Moody (M.S.)

– S. Naravula (Ph.D.)

– R. Noronha (Ph.D.)

– X. Ouyang (Ph.D.)

– S. Pai (M.S.)

– S. Potluri (Ph.D.)

– R. Rajachandrasekar (Ph.D.)

– K. Kulkarni (M.S.)

– M. Li (Ph.D.)

– M. Rahman (Ph.D.)

– D. Shankar (Ph.D.)

– A. Venkatesh (Ph.D.)

– J. Zhang (Ph.D.)

– E. Mancini

– S. Marcarelli

– J. Vienne

Current Research

Scientists – H. Subramoni

– X. Lu

Past Programmers – D. Bureddy

Current Research

Specialist – M. Arnold

Current Senior Research

Associate - K. Hamidouche

25 SC15

[email protected]

Thank You!

Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/

The MVAPICH2 Project http://mvapich.cse.ohio-state.edu/

26 SC15

http://nowlab.cse.ohio-state.edu/




Documents

Support Hybrid MPI+PGAS (UPC/OpenSHMEM/CAF) …mvapich.cse.ohio-state.edu/static/media/talks/slide/osc_theater-PGAS.pdf · Hybrid-ER Hybrid MPI+OpenSHMEM Sort Application Execution