Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Efficient Parallelization of MATLAB StencilApplications for Multi-Core Clusters
Johannes Spazier, Steffen Christgau, Bettina Schnor
University of Potsdam, Germany
WOLFHPC 2016, Salt Lake City, USA
November 13, 2016
Johannes Spazier (University Potsdam) Parallelization of MATLAB codes for Multi-Core Clusters WOLFHPC 16, Nov 13 1/29
Outline
1 Introduction
2 Message Passing Interface
3 Hybrid Programming
4 Conclusion
Johannes Spazier (University Potsdam) Parallelization of MATLAB codes for Multi-Core Clusters WOLFHPC 16, Nov 13 2/29
Outline
1 Introduction
2 Message Passing Interface
3 Hybrid Programming
4 Conclusion
Johannes Spazier (University Potsdam) Parallelization of MATLAB codes for Multi-Core Clusters WOLFHPC 16, Nov 13 3/29
Introduction
MATLAB
approved as high-level language for scientific computing
significantly reduced implementation effort
enables fast prototyping of mathematical models
well-suited for stencil applications
drawback: slow execution through interpreter
no out-of-the-box parallelization
⇒ insufficient performance for large data sets
Johannes Spazier (University Potsdam) Parallelization of MATLAB codes for Multi-Core Clusters WOLFHPC 16, Nov 13 4/29
Introduction
StencilPaC Overview
MATLAB to parallel C compiler
C Compiler(gcc)
MATLABsource
StencilPaCCompiler
generatedC code
finalexecutable
MPIheaders
MPIlibraries
commandline options CMD
Johannes Spazier (University Potsdam) Parallelization of MATLAB codes for Multi-Core Clusters WOLFHPC 16, Nov 13 5/29
Introduction
StencilPaC Overview
automatic parallelization for matrix operations
B(X ,Y ) = M1 (X1,Y1) ◦ . . . ◦Mn (Xn,Yn)
support different architecturesI shared and distributed memory systems, accelerators
build on common programming APIsI OpenMP, MPI and OpenACC
Y
X
B
= Y1
X1
M1
⊗ ...⊗ Yn
Xn
Mn
Johannes Spazier (University Potsdam) Parallelization of MATLAB codes for Multi-Core Clusters WOLFHPC 16, Nov 13 6/29
Introduction
Applications
two grid-based stencil applications
domain update over multiple iterations
manual reference implementations in C/C++
EasyWave
tsunami simulation developed at theGerman Research Center for Geosciences
access pattern: 5-point-stencil
Cellular Automaton
idealized model for biological systems
9-point-stencil (moore neighborhood)
Johannes Spazier (University Potsdam) Parallelization of MATLAB codes for Multi-Core Clusters WOLFHPC 16, Nov 13 7/29
Introduction
StencilPaC Overview
generated C code is much faster than MATLABfor both applications
improvements of more than
I 7 times with sequential code
I 21 times on an 8 core shared memory system
I 187 times with an NVIDIA Tesla K40m
for the memory-bound tsunami simulation EasyWave
even better results for the Cellular Automaton
Johannes Spazier (University Potsdam) Parallelization of MATLAB codes for Multi-Core Clusters WOLFHPC 16, Nov 13 8/29
Introduction
StencilPaC Overview
distributed systems are most challenging
I automatic partitioning of matrices
I generic handling of communication between processes
I partial computation
small runtime overhead is essential
focus on MPI one-sided API in previous work
today: concepts of and comparison with
I two-sided communication
I hybrid programming
Johannes Spazier (University Potsdam) Parallelization of MATLAB codes for Multi-Core Clusters WOLFHPC 16, Nov 13 9/29
Outline
1 Introduction
2 Message Passing Interface
3 Hybrid Programming
4 Conclusion
Johannes Spazier (University Potsdam) Parallelization of MATLAB codes for Multi-Core Clusters WOLFHPC 16, Nov 13 10/29
Message Passing Interface
Principles
degree of parallelization is given by thenumber of processes
distribute matrices evenly among theprocesses
one-dimensional domain decomposition(block of columns)
compute local parts in parallel
set up communication at runtime
provide appropriate ghost zones
X
Y
Process
0
Process
1
Process
2
Lghost Rghost
Johannes Spazier (University Potsdam) Parallelization of MATLAB codes for Multi-Core Clusters WOLFHPC 16, Nov 13 11/29
Message Passing Interface
Distributed Computation
choose base matrix B
B(X ,Y ) = M1 (X1,Y1) ◦ . . . ◦Mn (Xn,Yn)
compute local part of B
for (j = 0; j < length(X); j++) {if (is_local( B, X(j) )) {
for (k = 0; k < length(Y); k++) {B( X(j), Y(k) ) = M1( X1(j), Y1(k) )
◦ ...◦ Mn( Xn(j), Yn(k) );
}}
}
Johannes Spazier (University Potsdam) Parallelization of MATLAB codes for Multi-Core Clusters WOLFHPC 16, Nov 13 12/29
Message Passing Interface
1. One-sided Communication
direct access on remote memory with MPI Get
ghost zones can be fetched without involving otherprocesses
ranks calculated based on equally sized partitioning
coarse-grained synchronization withMPI Win fence
⇒ simple API for generic data exchange
⇒ less administration at runtime
⇒ expensive synchronization
Johannes Spazier (University Potsdam) Parallelization of MATLAB codes for Multi-Core Clusters WOLFHPC 16, Nov 13 13/29
One-sided Communication
Generic Data Exchange
B(X ,Y ) = M1 (X1,Y1) ◦ . . . ◦Mn (Xn,Yn)
MPI_Win_fence( 0, Mi.win );
for (j = 0; j < length(X); j++) {if (is_local( B, X(j))) {
if (!is_local(M1, X1(j)))MPI_Get(M1.win, X1(j) ...);
...if (!is_local(Mn, Xn(j)))
MPI_Get(Mn.win, Xn(j) ...);}
}
MPI_Win_fence( 0, Mi.win );
Johannes Spazier (University Potsdam) Parallelization of MATLAB codes for Multi-Core Clusters WOLFHPC 16, Nov 13 14/29
Message Passing Interface
2. Two-sided Communication
exchange ghost zones via messages
both sender and receiver are involved
send operations must also be provided
use non-blocking operations to avoid deadlocks(MPI ISend and MPI IRecv)
synchronize with MPI Waitall
⇒ pair-wise synchronization
⇒ additional administration required
Johannes Spazier (University Potsdam) Parallelization of MATLAB codes for Multi-Core Clusters WOLFHPC 16, Nov 13 15/29
Two-sided Communication
Generic Data Exchange
B(X ,Y ) = M1 (X1,Y1) ◦ . . . ◦Mn (Xn,Yn)
for (j = 0; j < length(X); j++) {if (is_local( B, X(j)))
if (!is_local(M1, X1(j)))MPI_Irecv(M1.vdata, X1(j), ...);
if (is_local(M1, X1(j)))if (!is_local(B, X(j)))
MPI_Isend(M1.vdata, X1(j), ...);
/* Repeat for other matrices. */}
MPI_Waitall( Mi.reqnr, Mi.requests, ...);Mi.reqnr = 0;
Johannes Spazier (University Potsdam) Parallelization of MATLAB codes for Multi-Core Clusters WOLFHPC 16, Nov 13 16/29
Evaluation: One- vs. two-sided
EasyWave
●
●
●
●
●
●
●
●
●
●
●
197.50
103.67
63.31
36.62
21.90
13.71
8.98
207.21
106.21
64.76
37.45
21.89
13.57
8.37
6.23
7.16
10
25
50
100
200
1 2 4 8 16 32 64 96Number of Processes
Run
time
in S
econ
ds
● EasyWave Manual MPI EasyWave Auto MPI One−Sided EasyWave Auto MPI Two−Sided
Platform
12 dual-socket nodes
4-core Intel Xeon CPUs
InfiniBand Network
Open MPI 1.8.2 andGCC 4.9.1
Results
generated codes can keepup with hand-written one
similar scaling of allversions
two-sided is 13% fasterthan one-sided
Johannes Spazier (University Potsdam) Parallelization of MATLAB codes for Multi-Core Clusters WOLFHPC 16, Nov 13 17/29
Evaluation: One- vs. two-sided
Cellular Automaton
●
●
●
●
●
●
●
●
●
●
●
362.79
183.01
93.13
45.81
23.08
11.78
6.31
4.27
363.65
183.14
92.91
45.85
22.98
11.61
5.97
4.00
10
25
50
100
200
400
1 2 4 8 16 32 64 96Number of Processes
Run
time
in S
econ
ds
● CA Auto Char MPI One−SidedCA Auto Char MPI Two−Sided
Results
overall adequate scaling
speedup of at least 85 on96 cores
6% improvement withtwo-sided version
Johannes Spazier (University Potsdam) Parallelization of MATLAB codes for Multi-Core Clusters WOLFHPC 16, Nov 13 18/29
Evaluation: One- vs. two-sided
Summary
similar findings for both applications
satisfying scaling even for larger core counts
runtime of hand-written codes almost reached
two-sided MPI implementation performs betterthan one-sided
I despite higher runtime overhead
I benefiting from fine-grained synchronization
Johannes Spazier (University Potsdam) Parallelization of MATLAB codes for Multi-Core Clusters WOLFHPC 16, Nov 13 19/29
Outline
1 Introduction
2 Message Passing Interface
3 Hybrid Programming
4 Conclusion
Johannes Spazier (University Potsdam) Parallelization of MATLAB codes for Multi-Core Clusters WOLFHPC 16, Nov 13 20/29
Hybrid Programming
Two-sided MPI + OpenMP
each MPI process spawns multiple threads
work is divided statically among thesethreads
simple combination leads to serious loadimbalances
process distribution and thread partitioninginterfere
amount of computational work varies
#pragma omp parallel for private(k)for (j = 0; j < length(X); j++)if (is_local( B, X(j) ))
for (k = 0; k < length(Y); k++)B( X(j), Y(k) ) = ...
0 1y
x
1 n n+1 2n
1000
1101
Johannes Spazier (University Potsdam) Parallelization of MATLAB codes for Multi-Core Clusters WOLFHPC 16, Nov 13 21/29
Hybrid Programming
1. Dynamic Scheduling
use dynamic scheduling of threads
chunks are assigned at runtime
better sharing of computational workexpected
easy implementation with
#pragma omp parallel schedule(dynamic)
additional runtime overhead
0 1y
x
1 n n+1 2n
1000
1000
1000
1000
1101
1101
1101
1101
Johannes Spazier (University Potsdam) Parallelization of MATLAB codes for Multi-Core Clusters WOLFHPC 16, Nov 13 22/29
Hybrid Programming
2. Intersection Approach
optimization for special matrix access
based on MATLAB’s range index
start:step:end = [start, start+step, ..., end]
local portion can be determined in advance
no locality check at runtime anymore
enables static thread scheduling again
Local matrix portion Global index range Local index range
Johannes Spazier (University Potsdam) Parallelization of MATLAB codes for Multi-Core Clusters WOLFHPC 16, Nov 13 23/29
Evaluation: Hybrid Programming
Cellular Automaton
●
●
●
●
●
●
●
●
●
●
●
182.59
91.54
46.32
23.99
11.86
7.79
5.96
4.06
182.90
92.33
46.68
24.50
12.37
8.266.58
4.57
363.65
4.00
10
25
50
100
200
400
1 2 4 8 16 32 64 96Number of Processes
Run
time
in S
econ
ds
● CA Auto Char MPI Two−SidedCA Auto Char Hybrid Two−Sided DynamicCA Auto Char Hybrid Two−Sided Intersect
Results
pure MPI versionperforms best
similar results with hybridintersection
overhead of 14% withdynamic scheduling
Johannes Spazier (University Potsdam) Parallelization of MATLAB codes for Multi-Core Clusters WOLFHPC 16, Nov 13 24/29
Evaluation: Hybrid Programming
EasyWave
●
●
●
●
●
●
●
●
●
●
●
106.79
65.96
41.97
24.25
14.82
10.298.80
6.54
107.30
56.22
34.43
19.52
11.25
8.82
6.42
4.76
207.21
6.23
10
25
50
100
200
1 2 4 8 16 32 64 96Number of Processes
Run
time
in S
econ
ds
● EasyWave Auto MPI Two−SidedEasyWave Auto Hybrid Two−Sided DynamicEasyWave Auto Hybrid Two−Sided Intersect
Results
pure MPI better thanhybrid intersection
dynamic approachoutperforms otherversions
improvement of 24%
better load balancing formemory-boundapplications
Johannes Spazier (University Potsdam) Parallelization of MATLAB codes for Multi-Core Clusters WOLFHPC 16, Nov 13 25/29
Outline
1 Introduction
2 Message Passing Interface
3 Hybrid Programming
4 Conclusion
Johannes Spazier (University Potsdam) Parallelization of MATLAB codes for Multi-Core Clusters WOLFHPC 16, Nov 13 26/29
Conclusion
Successful extension of StencilPaC for multi-core clusters.
Message Passing Interface
well suited for automatic parallelization
slightly better results with two-sided communication
speedups of up to 91 on 96 cores
Hybrid Programming
benefit of hybrid versions depends on application demands
dynamic scheduling can reduce load imbalancesI improvement of 24% on a memory-bound simulation
intersection approach does not show any benefit
Johannes Spazier (University Potsdam) Parallelization of MATLAB codes for Multi-Core Clusters WOLFHPC 16, Nov 13 27/29
Conclusion
Future Work
examine a wider range of applications
consider additional platforms
I use other MPI implementations (e.g. Open MPI 2.0)
I compare different architectures and network types
deeper analysis of observed effects in hybrid programming
evaluate possible use of generated C code on FPGAs
Johannes Spazier (University Potsdam) Parallelization of MATLAB codes for Multi-Core Clusters WOLFHPC 16, Nov 13 28/29
Questions?
Thanks foryour attention.
Johannes Spazier (University Potsdam) Parallelization of MATLAB codes for Multi-Core Clusters WOLFHPC 16, Nov 13 29/29