Efficient Parallelization of MATLAB Stencil Applications ...hpc.pnl.gov/conf/wolfhpc/2016/talks/spazier.pdfI generic handling of communication between processes I partial computation

Efficient Parallelization of MATLAB StencilApplications for Multi-Core Clusters

Johannes Spazier, Steffen Christgau, Bettina Schnor

University of Potsdam, Germany

WOLFHPC 2016, Salt Lake City, USA

November 13, 2016

Johannes Spazier (University Potsdam) Parallelization of MATLAB codes for Multi-Core Clusters WOLFHPC 16, Nov 13 1/29

Outline

1 Introduction

2 Message Passing Interface

3 Hybrid Programming

4 Conclusion


Outline

1 Introduction



4 Conclusion


Introduction

MATLAB

approved as high-level language for scientific computing

significantly reduced implementation effort

enables fast prototyping of mathematical models

well-suited for stencil applications

drawback: slow execution through interpreter

no out-of-the-box parallelization

⇒ insufficient performance for large data sets


Introduction

StencilPaC Overview

MATLAB to parallel C compiler

C Compiler(gcc)

MATLABsource

StencilPaCCompiler

generatedC code

finalexecutable

MPIheaders

MPIlibraries

commandline options CMD


Introduction

StencilPaC Overview

automatic parallelization for matrix operations

B(X ,Y ) = M1 (X1,Y1) ◦ . . . ◦Mn (Xn,Yn)

support different architecturesI shared and distributed memory systems, accelerators

build on common programming APIsI OpenMP, MPI and OpenACC

Y

X

B

= Y1

X1

M1

⊗ ...⊗ Yn

Xn

Mn


Introduction

Applications

two grid-based stencil applications

domain update over multiple iterations

manual reference implementations in C/C++

EasyWave

tsunami simulation developed at theGerman Research Center for Geosciences

access pattern: 5-point-stencil

Cellular Automaton

idealized model for biological systems

9-point-stencil (moore neighborhood)


Introduction

StencilPaC Overview

generated C code is much faster than MATLABfor both applications

improvements of more than

I 7 times with sequential code

I 21 times on an 8 core shared memory system

I 187 times with an NVIDIA Tesla K40m

for the memory-bound tsunami simulation EasyWave

even better results for the Cellular Automaton


Introduction

StencilPaC Overview

distributed systems are most challenging

I automatic partitioning of matrices

I generic handling of communication between processes

I partial computation

small runtime overhead is essential

focus on MPI one-sided API in previous work

today: concepts of and comparison with

I two-sided communication

I hybrid programming


Outline

1 Introduction



4 Conclusion


Message Passing Interface

Principles

degree of parallelization is given by thenumber of processes

distribute matrices evenly among theprocesses

one-dimensional domain decomposition(block of columns)

compute local parts in parallel

set up communication at runtime

provide appropriate ghost zones

X

Y

Process

0

Process

1

Process

2

Lghost Rghost



Distributed Computation

choose base matrix B

B(X ,Y ) = M1 (X1,Y1) ◦ . . . ◦Mn (Xn,Yn)

compute local part of B

for (j = 0; j < length(X); j++) {if (is_local( B, X(j) )) {

for (k = 0; k < length(Y); k++) {B( X(j), Y(k) ) = M1( X1(j), Y1(k) )

◦ ...◦ Mn( Xn(j), Yn(k) );

}}

}



1. One-sided Communication

direct access on remote memory with MPI Get

ghost zones can be fetched without involving otherprocesses

ranks calculated based on equally sized partitioning

coarse-grained synchronization withMPI Win fence

⇒ simple API for generic data exchange

⇒ less administration at runtime

⇒ expensive synchronization


One-sided Communication

Generic Data Exchange

B(X ,Y ) = M1 (X1,Y1) ◦ . . . ◦Mn (Xn,Yn)

MPI_Win_fence( 0, Mi.win );

for (j = 0; j < length(X); j++) {if (is_local( B, X(j))) {

if (!is_local(M1, X1(j)))MPI_Get(M1.win, X1(j) ...);

...if (!is_local(Mn, Xn(j)))

MPI_Get(Mn.win, Xn(j) ...);}

}

MPI_Win_fence( 0, Mi.win );



2. Two-sided Communication

exchange ghost zones via messages

both sender and receiver are involved

send operations must also be provided

use non-blocking operations to avoid deadlocks(MPI ISend and MPI IRecv)

synchronize with MPI Waitall

⇒ pair-wise synchronization

⇒ additional administration required


Two-sided Communication

Generic Data Exchange

B(X ,Y ) = M1 (X1,Y1) ◦ . . . ◦Mn (Xn,Yn)

for (j = 0; j < length(X); j++) {if (is_local( B, X(j)))

if (!is_local(M1, X1(j)))MPI_Irecv(M1.vdata, X1(j), ...);

if (is_local(M1, X1(j)))if (!is_local(B, X(j)))

MPI_Isend(M1.vdata, X1(j), ...);

/* Repeat for other matrices. */}

MPI_Waitall( Mi.reqnr, Mi.requests, ...);Mi.reqnr = 0;


Evaluation: One- vs. two-sided

EasyWave

●

●

●

●

●

●

●

●

●

●

●

197.50

103.67

63.31

36.62

21.90

13.71

8.98

207.21

106.21

64.76

37.45

21.89

13.57

8.37

6.23

7.16

10

25

50

100

200

1 2 4 8 16 32 64 96Number of Processes

Run

time

in S

econ

ds

● EasyWave Manual MPI EasyWave Auto MPI One−Sided EasyWave Auto MPI Two−Sided

Platform

12 dual-socket nodes

4-core Intel Xeon CPUs

InfiniBand Network

Open MPI 1.8.2 andGCC 4.9.1

Results

generated codes can keepup with hand-written one

similar scaling of allversions

two-sided is 13% fasterthan one-sided



Cellular Automaton

●

●

●

●

●

●

●

●

●

●

●

362.79

183.01

93.13

45.81

23.08

11.78

6.31

4.27

363.65

183.14

92.91

45.85

22.98

11.61

5.97

4.00

10

25

50

100

200

400


Run

time

in S

econ

ds

● CA Auto Char MPI One−SidedCA Auto Char MPI Two−Sided

Results

overall adequate scaling

speedup of at least 85 on96 cores

6% improvement withtwo-sided version



Summary

similar findings for both applications

satisfying scaling even for larger core counts

runtime of hand-written codes almost reached

two-sided MPI implementation performs betterthan one-sided

I despite higher runtime overhead

I benefiting from fine-grained synchronization


Outline

1 Introduction



4 Conclusion


Hybrid Programming

Two-sided MPI + OpenMP

each MPI process spawns multiple threads

work is divided statically among thesethreads

simple combination leads to serious loadimbalances

process distribution and thread partitioninginterfere

amount of computational work varies

#pragma omp parallel for private(k)for (j = 0; j < length(X); j++)if (is_local( B, X(j) ))

for (k = 0; k < length(Y); k++)B( X(j), Y(k) ) = ...

0 1y

x

1 n n+1 2n

1000

1101


Hybrid Programming

1. Dynamic Scheduling

use dynamic scheduling of threads

chunks are assigned at runtime

better sharing of computational workexpected

easy implementation with

#pragma omp parallel schedule(dynamic)

additional runtime overhead

0 1y

x

1 n n+1 2n

1000

1000

1000

1000

1101

1101

1101

1101


Hybrid Programming

2. Intersection Approach

optimization for special matrix access

based on MATLAB’s range index

start:step:end = [start, start+step, ..., end]

local portion can be determined in advance

no locality check at runtime anymore

enables static thread scheduling again

Local matrix portion Global index range Local index range


Evaluation: Hybrid Programming

Cellular Automaton

●

●

●

●

●

●

●

●

●

●

●

182.59

91.54

46.32

23.99

11.86

7.79

5.96

4.06

182.90

92.33

46.68

24.50

12.37

8.266.58

4.57

363.65

4.00

10

25

50

100

200

400


Run

time

in S

econ

ds

● CA Auto Char MPI Two−SidedCA Auto Char Hybrid Two−Sided DynamicCA Auto Char Hybrid Two−Sided Intersect

Results

pure MPI versionperforms best

similar results with hybridintersection

overhead of 14% withdynamic scheduling


Evaluation: Hybrid Programming

EasyWave

●

●

●

●

●

●

●

●

●

●

●

106.79

65.96

41.97

24.25

14.82

10.298.80

6.54

107.30

56.22

34.43

19.52

11.25

8.82

6.42

4.76

207.21

6.23

10

25

50

100

200


Run

time

in S

econ

ds

● EasyWave Auto MPI Two−SidedEasyWave Auto Hybrid Two−Sided DynamicEasyWave Auto Hybrid Two−Sided Intersect

Results

pure MPI better thanhybrid intersection

dynamic approachoutperforms otherversions

improvement of 24%

better load balancing formemory-boundapplications


Outline

1 Introduction



4 Conclusion


Conclusion

Successful extension of StencilPaC for multi-core clusters.


well suited for automatic parallelization

slightly better results with two-sided communication

speedups of up to 91 on 96 cores

Hybrid Programming

benefit of hybrid versions depends on application demands

dynamic scheduling can reduce load imbalancesI improvement of 24% on a memory-bound simulation

intersection approach does not show any benefit


Conclusion

Future Work

examine a wider range of applications

consider additional platforms

I use other MPI implementations (e.g. Open MPI 2.0)

I compare different architectures and network types

deeper analysis of observed effects in hybrid programming

evaluate possible use of generated C code on FPGAs


Questions?

Thanks foryour attention.


Documents

Efficient Parallelization of MATLAB Stencil Applications ...hpc.pnl.gov/conf/wolfhpc/2016/talks/spazier.pdfI generic handling of communication between processes I partial computation