Upload
michael-pearson
View
53
Download
0
Embed Size (px)
DESCRIPTION
parallel processing computer science
Citation preview
Hchstleistungsrechenzentrum Stuttgart
SUPERsmith
Hybrid Parallel Programming
Hybrid MPI and OpenMPParallel Programming
MPI + OpenMP and other modelson clusters of SMP nodes
Rolf Rabenseifner1) Georg Hager2) Gabriele Jost3)[email protected] [email protected] [email protected]
1) High Performance Computing Center (HLRS), University of Stuttgart, Germany2) Regional Computing Center (RRZE), University of Erlangen, Germany3) Supersmith, Maximum Performance Software, USA
Author is currently working for Intel Corporation. All material presented has been prepared while still under the affiliation of Supersmith.It does not represent the views of Intel Corporation.
Tutorial tut123 at SC13, November 17, 2013, Denver (CO) USA
Slide 1
Rabenseifner, Hager, Jost
SUPERsmith
Hybrid Parallel Programming
Motivation
Efficient programming of clusters of shared memory (SMP) nodes
Hierarchical system layout
Hybrid programming seems natural MPI between the nodes Shared memory programming inside of each SMP node
OpenMP MPI-3 shared memory programming Accelerator support in OpenMP 4.0 and OpenACC
Node Interconnect
SMP nodescores
sharedmemory
CoreCPU(socket)SMP board
ccNUMA nodeCluster of ccNUMA/SMP nodes
new
new
Slide 2 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Hybrid Parallel Programming
Node Interconnect
Socket 1
Quad-coreCPU
SMP node SMP node
Socket 2
Quad-coreCPU
Socket 1
Quad-coreCPU
Socket 2
Quad-coreCPU
MPI process
4 x multi-threaded
MPI process
4 x multi-threaded
MPI process
4 x multi-threaded
MPI process
4 x multi-threaded
MPI process8 x multi-threaded
MPI process8 x multi-threaded
MPI MPI
MPI MPI
MPI MPI
MPI MPI
MPI MPI
MPI MPI
MPI MPI
MPI MPI
Motivation
Which programming model is fastest?
MPI everywhere?
Fully hybrid MPI & OpenMP?
Something between?(Mixed model)
? Often hybrid programming slower than pure MPI Examples, Reasons,
Node Interconnect
Socket 1
Quad-coreCPU
SMP node SMP node
Socket 2
Quad-coreCPU
Socket 1
Quad-coreCPU
Socket 2
Quad-coreCPU
Slide 3 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Hybrid Parallel Programming
Goals of this tutorial
Sensitize to problems on clusters of SMP nodessee sections Case studies
Mismatch problems Technical aspects of hybrid programming
see sections Programming models on clusters Examples on hybrid programming
Opportunities with hybrid programmingsee section Opportunities: Application categories
that can benefit from hybrid paralleliz. Issues and their Solutions
with sections Thread-safety quality of MPI libraries Tools for debugging and profiling
for MPI+OpenMP
Lessfrustration&
Moresuccesswith your parallel program on clusters of SMP nodes
Slide 4 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Hybrid Parallel Programming
Outlineslide number
Introduction / Motivation 2 Programming models on clusters of SMP nodes 6 Case Studies / pure MPI vs hybrid MPI+OpenMP 28 Practical How-To on hybrid programming 55 Mismatch Problems 91 Opportunities: Application categories that can 109
benefit from hybrid parallelization Other options on clusters of SMP nodes 118
Accelerators 135 Summary 151 Appendix 158 Content (detailed) 174
08:30 10:00
10:30 12:00
Includes additional
slides, marked as
skipped
Slide 5 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Hybrid Parallel Programming
Outline Introduction / Motivation
Programming models on clusters of SMP nodes
Case Studies / pure MPI vs hybrid MPI+OpenMP Hybrid programming & accelerators Practical How-To on hybrid programming Mismatch Problems Opportunities:
Application categories that can benefit from hybrid parallelization Other options on clusters of SMP nodes Summary
Slide 6 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Hybrid Parallel Programming
Major Programming models on hybrid systems
some_serial_code#pragma omp parallel forfor (j=;; j++)
block_to_be_parallelizedagain_some_serial_code
Master thread,other threads
sleeping
OpenMP (shared data)MPI local data in each processdataSequential
program on each core
Explicit Message Passingby calling MPI_Send & MPI_Recv
Node Interconnect
OpenMP inside of the SMP nodes
MPI between the nodesvia node interconnect
new
Pure MPI (one MPI process on each core) Hybrid: MPI + OpenMP
shared memory OpenMP distributed memory MPI
Hybrid: MPI message passing + MPI-3.0 shared memory programming Other: PGAS (UPC, Coarray Fortran, .) / together with MPI Often hybrid programming (MPI+OpenMP) slower than pure MPI
why?
Slide 7 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Hybrid Parallel Programming
Parallel Programming Models on Hybrid Platforms
No overlap of Comm. + Comp.MPI only outside of
parallel regionsof the numerical application code
OverlappingComm. + Comp.
MPI communication by one or a few threads
while other threads are computing
pure MPIone MPI process
on each core
hybrid MPI+OpenMPMPI: inter-node communication
OpenMP: inside of each SMP node
OpenMP only
distributed virtual shared memory
MasteronlyMPI only outsideof parallel regions
Hybrid MPI+MPIMPI for inter-node
communication + MPI-3.0 shared memory
programming
Within shared memory nodes:Halo updates through direct
data copy
Within shared memory nodes:
No halo updates, direct access to neighbor data
new
new new
Slide 8 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Hybrid Parallel Programming
Pure MPI
Advantages No modifications on existing MPI codes MPI library need not to support multiple threads
Major problems Does MPI library uses internally different protocols?
Shared memory inside of the SMP nodes Network communication between the nodes
Does application topology fit on hardware topology? Unnecessary MPI-communication inside of SMP nodes!
pure MPIone MPI process
on each core
Discussed in detail later on in the section Mismatch Problems
Slide 9 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Hybrid Parallel Programming
Hybrid MPI+OpenMP Masteronly Style
Advantages No message passing inside of the SMP nodes No topology problem
for (iteration .){#pragma omp parallel
numerical code/*end omp parallel */
/* on master thread only */MPI_Send (original datato halo areas in other SMP nodes)
MPI_Recv (halo data from the neighbors)
} /*end for loop
MasteronlyMPI only outside of parallel regions
Major Problems All other threads are sleeping
while master thread communicates! Which inter-node bandwidth? MPI-lib must support at least
MPI_THREAD_FUNNELED Section
Thread-safety quality of MPI
libraries
Slide 10 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Hybrid Parallel Programming
MPI rules with OpenMP / Automatic SMP-parallelization Special MPI-2 Init for multi-threaded MPI processes:
REQUIRED values (increasing order): MPI_THREAD_SINGLE: Only one thread will execute THREAD_MASTERONLY: MPI processes may be multi-threaded,
(virtual value, but only master thread will make MPI-callsnot part of the standard) AND only while other threads are sleeping
MPI_THREAD_FUNNELED: Only master thread will make MPI-calls MPI_THREAD_SERIALIZED: Multiple threads may make MPI-calls,
but only one at a time MPI_THREAD_MULTIPLE: Multiple threads may call MPI,
with no restrictions returned provided may be less than REQUIRED by the application
int MPI_Init_thread( int * argc, char ** argv[],int thread_level_required,int * thead_level_provided);
int MPI_Query_thread( int * thread_level_provided);int MPI_Is_main_thread(int * flag);
Slide 11 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Hybrid Parallel Programming
Calling MPI inside of OMP MASTER
Inside of a parallel region, with OMP MASTER
Requires MPI_THREAD_FUNNELED,i.e., only master thread will make MPI-calls
Caution: There isnt any synchronization with OMP MASTER!Therefore, OMP BARRIER normally necessary toguarantee, that data or buffer space from/for other threads is available before/after the MPI call!
!$OMP BARRIER #pragma omp barrier!$OMP MASTER #pragma omp master
call MPI_Xxx(...) MPI_Xxx(...); !$OMP END MASTER!$OMP BARRIER #pragma omp barrier
But this implies that all other threads are sleeping! The additional barrier implies also the necessary cache flush!
Slide 12 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Hybrid Parallel Programming
the barrier is necessary example with MPI_Recv
!$OMP PARALLEL!$OMP DO
do i=1,1000a(i) = buf(i)
end do!$OMP END DO NOWAIT!$OMP BARRIER!$OMP MASTER
call MPI_RECV(buf,...)!$OMP END MASTER!$OMP BARRIER!$OMP DO
do i=1,1000c(i) = buf(i)
end do!$OMP END DO NOWAIT!$OMP END PARALLEL
#pragma omp parallel{#pragma omp for nowait
for (i=0; i
Rabenseifner, Hager, Jost
SUPERsmith
Hybrid Parallel Programming
Example: Thread support within Open MPI In order to enable thread support in Open MPI, configure with:
configure --enable-mpi-threads
This turns on: Support for full MPI_THREAD_MULTIPLE internal checks when run with threads (--enable-debug)
configure --enable-mpi-threads --enable-progress-threads
This (additionally) turns on: Progress threads to asynchronously transfer/receive data per
network BTL. Additional Feature:
Compiling with debugging support, but without threads will check for recursive locking
Courtesy of Rainer Keller, HLRS and ORNL
Slide 14 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Hybrid Parallel Programming
Overlapping communication and computation
if (my_thread_rank < ) {MPI_Send/Recv.
i.e., communicate all halo data} else {
Execute those parts of the applicationthat do not need halo data(on non-communicating threads)
}
Execute those parts of the applicationthat need halo data(on all threads)
Overlapping Communication and ComputationMPI communication by one or a few threads while other threads are computing
Slide 15 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Hybrid Parallel Programming
Hybrid MPI + MPI-3 shared memory
Advantages No message passing inside of the SMP nodes Using only one parallel programming standard No OpenMP problems (e.g., thread-safety isnt an issue)
Major Problems Communicator must be split into shared
memory islands To minimize shared memory communication
overhead:Halos (or the data accessed by the neighbors) must be stored in MPI shared memory windows
Same work-sharing as with pure MPI
Hybrid MPI+MPIMPI for inter-node
communication + MPI-3.0 shared memory
programming
Slide 16 / 175
Rabenseifner, Hager, Jost
SUPERsmith
MPI-3 shared memory
Split main communicator into shared memory islands MPI_Comm_split_type
Define a shared memory window on each island MPI_Win_allocate_shared
Result (by default):
contiguous array, directly accessible by all processes of the island
Accesses and sychronization Normal assignments and expressions
No MPI_PUT/GET !
Normal MPI one-sided synchronization, e.g., MPI_WIN_FENCE
Hybrid Parallel Programming
Hybrid MPI+MPIMPI for inter-node
communication + MPI-3.0 shared memory
programming
Slide 17 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Splitting the communicator & contiguous shared memory allocation
Hybrid Parallel Programming
Hybrid MPI+MPIMPI for inter-node
communication + MPI-3.0 shared memory
programming
MPI_Aint /*IN*/ local_window_count; double /*OUT*/ *base_ptr;
MPI_Comm comm_all, comm_sm; int my_rank_all, my_rank_sm, size_sm, disp_unit;
MPI_Comm_rank (comm_all, &my_rank_all);
MPI_Comm_split_type (comm_all, MPI_COMM_TYPE_SHARED, 0,
MPI_INFO_NULL, &comm_sm);
MPI_Comm_rank (comm_sm, &my_rank_sm); MPI_Comm_size (comm_sm, &size_sm);
disp_unit = sizeof(double); /* shared memory should contain doubles */
MPI_Win_allocate_shared (local_window_count*disp_unit, disp_unit, MPI_INFO_NULL,
comm_sm, &base_ptr, &win_sm);
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 my_rank_all
0 1 2 3 my_rank_sm
0 1 2 3 my_rank_sm
0 1 2 3 my_rank_sm
0 1 2 3 my_rank_sm
MPI process
Sub-communicatorcomm_smfor one SMP node
local_window_countdoubles
base_ptr
Contiguous shared memory window within each SMP node
Sequence in comm_smas in comm_all
comm_all
F
F In Fortran, MPI-3.0, page 341, Examples 8.1 (and 8.2) show how to convert buf_ptr into a usable array a.This mapping is based on a sequential ranking of the SMP nodes in comm_all.
M
M
Slide 18 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Within each SMP node Essentials
The allocated shared memory is contiguous across process ranks, i.e., the first byte of rank i starts right after the last byte of rank i-1. Processes can calculate remote addresses offsets
with local information only. Remote accesses through load/store operations, i.e., without MPI RMA operations (MPI_GET/PUT, ) Although each process in comm_sm accesses the same physical memory,
the virtual start address of the whole array may be different in all processes! linked lists only with offsets in a shared array,
but not with binary pointer addresses!
Following slides show only the shared memory accesses,i.e., communication between the SMP nodes is not presented.
Hybrid Parallel Programming
Hybrid MPI+MPIMPI for inter-node
communication + MPI-3.0 shared memory
programming
Slide 19 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Shared memory access example
Hybrid Parallel Programming
Hybrid MPI+MPIMPI for inter-node
communication + MPI-3.0 shared memory
programming
MPI_Aint /*IN*/ local_window_count; double /*OUT*/ *base_ptr;
MPI_Win_allocate_shared (local_window_count*disp_unit, disp_unit, MPI_INFO_NULL,
comm_sm, &base_ptr, &win_sm);
MPI_Win_fence (0, win_sm); /*local store epoch can start*/
for (i=0; i 0) printf("left neighbors rightmost value = %lf \n", base_ptr[-1] );
if (my_rank_sm < size_sm-1) printf("right neighbors leftmost value = %lf \n",
base_ptr[local_window_count] );
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 my_rank_all
0 1 2 3 my_rank_sm
0 1 2 3 my_rank_sm
0 1 2 3 my_rank_sm
0 1 2 3 my_rank_sm
MPI process
Sub-communicatorfor one SMP node
base_ptr
Contiguous shared memory window within each SMP node local_window_countdoubles
Direct load access to remote window
portion
Direct load access to remote window
portion
Synchroni-zation
Synchroni-zation
Local storesF
F
F
F
F In Fortran, before and after the synchronization, on must add: CALL MPI_F_SYNC_REG (buffer)to guarantee that register copies of buffer are written back to memory, respectively read again from memory.
Slide 20 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Establish comm_sm, comm_nodes, comm_all,if SMPs are not contiguous within comm_orig
Hybrid Parallel Programming
Hybrid MPI+MPIMPI for inter-node
communication + MPI-3.0 shared memory
programming
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 my_rank_all
0 1 2 3 my_rank_sm
Sub-communicatorfor one SMP node:
comm_sm
MPI_Comm_split_type (comm_orig, MPI_COMM_TYPE_SHARED, 0, MPI_INFO_NULL, &comm_sm);
MPI_Comm_size (comm_sm, &size_sm); MPI_Comm_rank (comm_sm, &my_rank_sm);
MPI_Comm_split (comm_orig, my_rank_sm, 0, &comm_nodes);
MPI_Comm_size (comm_nodes, &size_nodes);
if (my_rank_sm==0) {
MPI_Comm_rank (comm_nodes, &my_rank_nodes);
MPI_Exscan (&size_sm, &my_rank_all, 1, MPI_INT, MPI_SUM, comm_nodes);
if (my_rank_nodes == 0) my_rank_all = 0;
}
MPI_Comm_free (&comm_nodes);
MPI_Bcast (&my_rank_nodes, 1, MPI_INT, 0, comm_sm);
MPI_Comm_split (comm_orig, my_rank_sm, my_rank_nodes, &comm_nodes);
MPI_Bcast (&my_rank_all, 1, MPI_INT, 0, comm_sm); my_rank_all = my_rank_all + my_rank_sm;
MPI_Comm_split (comm_orig, /*color*/ 0, my_rank_all, &comm_all);
0 1 2 3 my_rank_sm
0 1 2 3 my_rank_sm
0 1 2 3 my_rank_sm
Establish a
communicator
comm_sm
with ranks
my_rank_sm
on each SMP
node
Result: comm_nodes combines all processes with a
given my_rank_sm into a separate communicator.Exscan does
not return
value on the
first rank,
therefore
comm_all
comm_nodescombining all processes with samemy_rank_sm
On processes with my_rank_sm > 0, this comm_nodes is unused
because node-numbering within these comm_nodes may be different.
Expanding the numbering from
comm_nodes with my_rank_sm
== 0 to all new node-to-node
communicators comm_nodes.
Calculating my_rank_all and
establishing global communicator
comm_all with sequential SMP
subsets.
0 1 2 3
my_rank_nodes
Input
Rabenseifner, Hager, Jost
SUPERsmith
Alternative: Non-contiguous shared memory
Using info key "alloc_shared_noncontig MPI library can put processes window portions
on page boundaries, (internally, e.g., only one OS shared memory segment with some unused
padding zones) into the local ccNUMA memory domain + page boundaries
(internally, e.g., each window portion is one OS shared memory segment)Pros: Faster local data accesses especially on ccNUMA nodesCons: Higher programming effort for neighbor accesses: MPI_WIN_SHARED_QUERY
Hybrid Parallel Programming
Further reading:Torsten Hoefler, James Dinan, Darius Buntinas, Pavan Balaji, Brian Barrett, Ron Brightwell, William Gropp, Vivek Kale, Rajeev Thakur: MPI + MPI: a new hybrid approach to parallel programming with MPI plus shared memory.http://link.springer.com/content/pdf/10.1007%2Fs00607-013-0324-2.pdf
Hybrid MPI+MPIMPI for inter-node
communication + MPI-3.0 shared memory
programming
NUMA effects?Significant impact of alloc_shared_noncontig
Slide 22 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Non-contiguous shared memory allocation
Hybrid Parallel Programming
Hybrid MPI+MPIMPI for inter-node
communication + MPI-3.0 shared memory
programming
MPI_Aint /*IN*/ local_window_count; double /*OUT*/ *base_ptr;
disp_unit = sizeof(double); /* shared memory should contain doubles */
MPI_Info info_noncontig;
MPI_Info_create (&info_noncontig);
MPI_Info_set (info_noncontig, "alloc_shared_noncontig", "true");
MPI_Win_allocate_shared (local_window_count*disp_unit, disp_unit, info_noncontig,
comm_sm, &base_ptr, &win_sm );
0 1 2 3 my_rank_sm
0 1 2 3 my_rank_sm
0 1 2 3 my_rank_sm
0 1 2 3 my_rank_sm
MPI process
Sub-communicatorfor one SMP node
local_window_countdoubles
base_ptr
Non-contiguous shared memory window within each SMP node
Slide 23 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Non-contiguous shared memory:Neighbor access through MPI_WIN_SHARED_QUERY Each process can retrieve each neighbors base_ptr
with calls to MPI_WIN_SHARED_QUERY Example: only pointers to the window memory
of the left & right neighbor
Hybrid Parallel Programming
Hybrid MPI+MPIMPI for inter-node
communication + MPI-3.0 shared memory
programming
if (my_rank_sm > 0) MPI_Win_shared_query (win_sm, my_rank_sm 1,
&win_size_left, &disp_unit_left, &base_ptr_left);
if (my_rank_sm < size_sm-1) MPI_Win_shared_query (win_sm, my_rank_sm + 1,
&win_size_right, &disp_unit_right, &base_ptr_right);
MPI_Win_fence (0, win_sm); /* local stores are finished, remote load epoch can start */
if (my_rank_sm > 0) printf("left neighbors rightmost value = %lf \n",
base_ptr_left[ win_size_left/disp_unit_left 1 ] );
if (my_rank_sm < size_sm-1) printf("right neighbors leftmost value = %lf \n",
base_ptr_right[ 0 ] );
base_ptr_left base_ptr_right
Thanks to Steffen Weise (TU Freiberg) for testing and correcting the example codes.
Slide 24 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Other technical aspects with MPI_WIN_ALLOCATE_SHAREDCaution: On some systems the number of shared memory windows, and the total size of shared memory windowsmay be limited.Some OS systems may provide options, e.g., at job launch, or MPI process start,to enlarge restricting defaults.
If MPI shared memory support is based on POSIX shared memory: Shared memory windows are located in memory-mapped /dev/shm Default: 25% or 50% of the physical memory, but a maximum of ~2043 windows! Root may change size with: mount o remount,size=6G /dev/shm .Cray XT/XE/XC (XPMEM): No limits.On a system without virtual memory (like CNK on BG/Q), you have to reserve a chunk of address space when the node is booted (default is 64 MB). Thanks to Jeff Hammond and Jed Brown (ANL), Brian W Barrett (SANDIA), and Steffen Weise (TU Freiberg), for input and discussion.
Hybrid Parallel Programming
Hybrid MPI+MPIMPI for inter-node
communication + MPI-3.0 shared memory
programming
Another restriction in a low-quality MPI:MPI_COMM_SPLIT_TYPE may return always MPI_COMM_SELF
Due to default limit of context IDs
in mpich
Slide 25 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Splitting the communicator withoutMPI_COMM_SPLIT_TYPE
Alternative, if you want to group based on a fixed amount size_sm of shared memory cores in comm_all:
Based on sequential ranks in comm_all Pro: comm_sm can be restricted to ccNUMA locality domains Con: MPI does not guarantee MPI_WIN_ALLOCATE_SHARED() on whole SMP node
(MPI_COMM_SPLIT_TYPE() may return MPI_COMM_SELF or partial SMP node)
Hybrid Parallel Programming
Hybrid MPI+MPIMPI for inter-node
communication + MPI-3.0 shared memory
programming
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 comm_all
0 1 2 3 comm_sm
0 1 2 3 comm_sm
0 1 2 3 comm_sm
0 1 2 3 comm_sm
0 1 2 3 comm_sm
MPI_Comm_rank (comm_all, &my_rank); MPI_Comm_group (comm_all, &group_all);
ranges[0][0] = (my_rank / size_sm) * size_sm; ranges[0][1] = ranges[0][0]+size_sm1; ranges[0][2] = 1;
MPI_Group_range_incl (group_all, 1, ranges, &group_sm);
MPI_Comm_create (comm_all, group_sm, &comm_sm);
MPI_Win_allocate_shared ();
e.g., ranges[0][]={4,7,1}
To guarantee shared memory,one may add an additional
MPI_Comm_split_type (comm_sm, MPI_COMM_TYPE_SHARED, 0,
MPI_INFO_NULL, &comm_sm_really);Slide 26 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Hybrid Parallel Programming
Pure OpenMP (on the cluster)
Distributed shared virtual memory system needed Must support clusters of SMP nodes, e.g.,
Shared memory parallel inside of SMP nodes Communication of modified parts of pages
at OpenMP flush (part of each OpenMP barrier)
OpenMP onlydistributed virtual shared memory
i.e., the OpenMP memory and parallelization modelis prepared for clusters!
by rule of thumb:
Communication may be
10 times slowerthan with MPI
Slide 27 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Hybrid Parallel Programming
Outline Introduction / Motivation Programming models on clusters of SMP nodes Case Studies / pure MPI vs hybrid MPI+OpenMP
The Multi-Zone NAS Parallel Benchmarks For each application we discuss:
Benchmark implementations based on different strategies and programming paradigms Performance results and analysis on different hardware architectures
Compilation and Execution SummaryGabriele Jost (Supersmith, Maximum Performance Software)
Hybrid programming & accelerators Practical How-To on hybrid programming Mismatch Problems Opportunities: Application categories that can benefit from hybrid paralleli. Other options on clusters of SMP nodes Summary
Slide 28 / 175
Rabenseifner, Hager, Jost
SUPERsmith
The Multi-Zone NAS Parallel Benchmarks
Hybrid Parallel Programming
OpenMP
Call MPI
MPI Processes
sequential
MPI/OpenMP
OpenMPdata copy+ sync.
exchangeboundaries
sequentialsequentialTime step
OpenMPOpenMPintra-zones
OpenMPMLP Processesinter-zones
Nested OpenMPMLP
Multi-zone versions of the NAS Parallel Benchmarks LU,SP, and BT
Two hybrid sample implementations Load balance heuristics part of sample codes www.nas.nasa.gov/Resources/Software/software.html
Slide 29 / 175
Rabenseifner, Hager, Jost
SUPERsmith
MPI/OpenMP BT-MZcall omp_set_numthreads (weight)do step = 1, itmaxcall exch_qbc(u, qbc, nx,)
do zone = 1, num_zonesif (iam .eq. pzone_id(zone)) then
call zsolve(u,rsd,)end if
end do
end do...
Hybrid Parallel Programming
call mpi_send/recv
subroutine zsolve(u, rsd,)...
!$OMP PARALLEL DEFAULT(SHARED)!$OMP& PRIVATE(m,i,j,k...)do k = 2, nz-1
!$OMP DOdo j = 2, ny-1do i = 2, nx-1do m = 1, 5
u(m,i,j,k)=dt*rsd(m,i,j,k-1)
end doend do
end do!$OMP END DO NOWAITend do...
!$OMP END PARALLEL
Slide 30 / 175
Rabenseifner, Hager, Jost
SUPERsmith
MPI/OpenMP LU-MZcall omp_set_numthreads (weight)do step = 1, itmaxcall exch_qbc(u, qbc, nx,)
do zone = 1, num_zonesif (iam .eq. pzone_id(zone)) then
call ssor end if
end do
end do...
Hybrid Parallel ProgrammingSlide 31 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Pipelined Thread Execution in SSORsubroutine ssor!$OMP PARALLEL DEFAULT(SHARED)!$OMP& PRIVATE(m,i,j,k...)call sync1 ()do k = 2, nz-1
!$OMP DOdo j = 2, ny-1do i = 2, nx-1do m = 1, 5rsd(m,i,j,k)=dt*rsd(m,i,j,k-1) +
end doend do
end do!$OMP END DO nowaitend docall sync2 ()...
!$OMP END PARALLEL...
Hybrid Parallel Programming
subbroutine sync1neigh = iam -1do while (isync(neigh) .eq. 0)!$OMP FLUSH(isync)end doisync(neigh) = 0!$OMP FLUSH(isync)
subroutine sync2
neigh = iam -1do while (isync(neigh) .eq. 1)!$OMP FLUSH(isync)end doisync(neigh) = 1!$OMP FLUSH(isync)
Slide 32 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Golden Rule for ccNUMA: First touch
c------------------------------------------------------------------
---
c do one time step to touch all data c------------------------------------------------------------------
---
do iz = 1, proc_num_zoneszone = proc_zone_id(iz)call adi(rho_i(start1(iz)), us(start1(iz)),
$ vs(start1(iz)), ws(start1(iz)..
$ end dodo iz = 1, proc_num_zoneszone = proc_zone_id(iz)call initialize(u(start5(iz)),
$ end doHybrid Parallel Programming
A memory page gets mapped into the local memory of the processor that first touches it!
"touch" means "write", not "allocate"
All benchmarks use first touch policy to achieve good memory placement!
Slide 33 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Benchmark Characteristics Aggregate sizes:
Class D: 1632 x 1216 x 34 grid points Class E: 4224 x 3456 x 92 grid points
BT-MZ: (Block tridiagonal simulated CFD application) Alternative Directions Implicit (ADI) method #Zones: 1024 (D), 4096 (E) Size of the zones varies widely:
large/small about 20 requires multi-level parallelism to achieve a good load-balance
LU-MZ: (LU decomposition simulated CFD application) SSOR method (2D pipelined method) #Zones: 16 (all Classes) Size of the zones identical:
no load-balancing required limited parallelism on outer level
SP-MZ: (Scalar Pentadiagonal simulated CFD application) #Zones: 1024 (D), 4096 (E) Size of zones identical
no load-balancing requiredHybrid Parallel Programming
Load-balanced on MPI level: Pure MPI should perform best
Pure MPI: Load-balancing problems!Good candidate for
MPI+OpenMP
Limitted MPI Parallelism:
MPI+OpenMP increases Parallelism
Expectations:
Slide 34 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Hybrid code on cc-NUMA architectures OpenMP:
Support only per MPI process Version 3.1 has support for binding of threads via OMP_PROC_BIND
environment variable. Under consideration for Version 4.0: OMP_PROC_SET or OMP_LIST
to restrict the execution to a subset of the machine; OMP_AFFINITY to influence how the threads are distributed and bound on the machine
Version 4.0 announced at SC12 MPI:
Initially not designed for NUMA architectures or mixing of threads and processes, MPI-2 supports threads in MPI
API does not provide support for memory/thread placement Vendor specific APIs to control thread and memory placement:
Environment variables System commands like numactl,taskset,dplace,omplace etc http://www.halobates.de/numaapi3.pdf More in How-tos
Hybrid Parallel ProgrammingSlide 35 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Dell Linux Cluster Lonestar
Hybrid Parallel Programming
Located at the Texas Advanced Computing Center (TACC), University of Texas at Austin (http://www.tacc.utexas.edu)
1888 nodes, 2 Xeon Intel 6-Core 64-bit Westmere processors, 3.33 GHz, 24 GB memory per node, Peak Performance 160 Gflops per node, 3 channels from each processor's memory controller to 3 DDR3 ECC DIMMS, 1333 MHz,
Processor interconnect, QPI, 6.4GT/s Node Interconnect: InfiniBand Mellanox Switches, fat-tree topology, 40Gbit/sec
point-to-point bandwidth More details: http://www.tacc.utexas.edu/user-services/user-guides/lonestar-
user-guide Compiling the benchmarks:
ifort 11.1, Options: -O3 ipo openmp mcmodel=medium Running the benchmarks:
MVAPICH 2 setenv OMP_NUM_THREADS ibrun tacc_affinity ./bt-mz.x
Slide 36 / 175
Rabenseifner, Hager, Jost
SUPERsmith
NUMA Control (numactl) Example run script
Hybrid Parallel Programming
#!/bin/csh#$ -cwd#$ -j y#$ -q systest#$ -pe 12way 24#$ -V#$ -l h_rt=00:10:00setenv OMP_NUM_THREADS 1setenv MY_NSLOTS 16ibrun tacc_affinity ./bin/sp-mz.D.
Run 12 MPI processes per node, allocate 24 cores (2nodes) alltogether
1 thread per MPI process
Only use 16 of the 24 cores for MPI.NOTE: 8 cores unused!!!
numactl script for process/thread placementCommand to
run mpi job
Slide 37 / 175
Rabenseifner, Hager, Jost
SUPERsmith
NUMA Operations
Hybrid Parallel Programming
cmd option arguments description
Socket Affinity numactl -c {0,1,2,3} Only execute process on cores of this (these) socket(s).
Memory Policy numactl -l {no argument} Allocate on current socket.
Memory Policy numactl -i {0,1,2,3} Allocate round robin (interleave) on these sockets.
Memory Policy numactl --preferred= {0,1,2,3}select only one
Allocate on this socket; fallback to any other if full .
Memory Policy numactl -m {0,1,2,3} Only allocate on this (these) socket(s).
Core Affinity numactl -C {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}
Only execute process on this (these) Core(s).
Slide 38 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Example numactl script
Hybrid Parallel Programming
myway=`echo $PE | sed s/way//`export MV2_USE_AFFINITY=0export MV2_ENABLE_AFFINITY=0my_rank=$PMI_RANK local_rank=$(( my_rank % myway ))if [ $myway -eq 12 ]; thennumnode=$(( local_rank / 6 ))
fiexec numactl -c $numnode -m $numnode $*
Slide 39 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Dell Linux Cluster Lonestar Topology
Hybrid Parallel ProgrammingSlide 40 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Dell Linux Cluster Lonestar Topology
Hybrid Parallel Programming
CPU type: Intel Core Westmere processor
***************************
*********
Hardware Thread Topology***************************
*********
Sockets: 2 Cores per socket: 6 Threads per core: 1
---------------------------------
Socket 0: ( 1 3 5 7 9 11 )Socket 1: ( 0 2 4 6 8 10 )---------------------------------
Careful!Numbering scheme of
cores is system dependent
Slide 41 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Hybrid Parallel Programming
No idle cores
64 nodes
On128 nodes
On 256 nodes
On512 nodes
On 1024 nodes
BT-MZ improves using hybrid as expected due to better load balance
Unexpected: SP-MZ improves in some cases using hybrid
Slide 42 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Pitfall (1): Running 2 threads on the same core
Hybrid Parallel Programming
09/26/07, Author:Gabriele Jost
Running NPB BT-MZ Class D 128 MPI Procs, 12 threads each, 1 MPI per node (1way)
Pinning A:exec numactl c 0 -m 0 $*
Running 128 MPI Procs, 12 threads eachPinning B:exec numactl c 0,1 -m 0,1 $*
Only use cores and memory on socket 0,12 threads on 6 cores
Use cores and memory on 2 sockets
Slide 43 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Pitfall (2): Cause remote memory access
Hybrid Parallel Programming
09/26/07, Author:Gabriele Jost
Running NPB BT-MZ Class D 128 MPI Procs, 6 threads each 2 MPI per node
Pinning A:if [ $localrank == 0 ]; thenexec numactl --physcpubind=0,1,2,3,4,5 -m 0 $*elif [ $localrank == 1 ]; thenexec numactl --physcpubind=6,7,8,9,10,11 -m 1 $*fi
Running 128 MPI Procs, 6 threads eachPinning B:if [ $localrank == 0 ]; thenexec numactl --physcpubind=0,2,4,6,8,10 -m 0 $*elif [ $localrank == 1 ]; thenexec numactl physcpubind=1,3,5,7,9,11 -m 1 $*fi
Half of the threads access remote memory
600 Gflops
900 Gflops
900 Gflops
Only local memory access
Slide 44 / 175
Rabenseifner, Hager, Jost
SUPERsmith
LU-MZ Class D Scalability on Lonestar
Hybrid Parallel Programming
LU-MZ significantly benefits from hybrid mode:- Pure MPI limited to 16 cores, due to #zones = 16
Decrease of resource contention large contribution to improvement
G
F
l
o
p
s
idle cores
Slide 45 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Cray XE6 Hermit
Hybrid Parallel Programming
-------------------------------------------------------------
CPU type: AMD Interlagos processor *************************************************************
Hardware Thread Topology*************************************************************
Sockets: 2 Cores per socket: 16 Threads per core: 1 -------------------------------------------------------------Socket 0:+-----------------------------------------------------------------------------------------------------------------------------------------------+
| +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ || | 0 | | 1 | | 2 | | 3 | | 4 | | 5 | | 6 | | 7 | | 8 | | 9 | | 10 | | 11 | | 12 | | 13 | | 14 | | 15 | || +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ || +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ || | 16kB | | 16kB | | 16kB | | 16kB | | 16kB | | 16kB | | 16kB | | 16kB | | 16kB | | 16kB | | 16kB | | 16kB | | 16kB | | 16kB | | 16kB | | 16kB | || +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ || +---------------+ +---------------+ +---------------+ +---------------+ +---------------+ +---------------+ +---------------+ +---------------+ || | 2MB | | 2MB | | 2MB | | 2MB | | 2MB | | 2MB | | 2MB | | 2MB | || +---------------+ +---------------+ +---------------+ +---------------+ +---------------+ +---------------+ +---------------+ +---------------+ || +---------------------------------------------------------------------+ +---------------------------------------------------------------------+ || | 6MB | | 6MB | || +---------------------------------------------------------------------+ +---------------------------------------------------------------------+ |
Located at HLRS Stuttgart, Germany (https://wickie.hlrs.de/platforms/index.php/Cray_XE6) 3552 compute nodes 113.664 cores Two AMD 6276 Interlagos processors with 16 cores each, running at 2.3 GHz (TurboCore 3.3GHz) per
node Around 1 Pflop theoretical peak performance 32 GB of main memory available per node 32-way shared memory system High-bandwidth interconnect using Cray Gemini communication chips
Slide 46 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Hybrid Parallel Programming
32K cores
16K cores8K
cores
Expected:BT-MZ benefits from hybrid approach:- high number of MPI processes yields bad workload distribution-Best MPIxOMP combination depends on problem sizeExpected:-Both benchmarks benefit by increasing parallelismUnexpected:SP-MZ improves when reducing number of MPI processesBT-MZ 1024x32 unexpected low performance
Cray XE6 Hermit Scalability, continued
Slide 47 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Cray XE6: CrayPat Performance Analysis module load xt-craypat Compilation:
ftn fastsse r8 mp[= trace ] Instrument:
pat_build w g mpi,omp bt.exe bt.exe.pat Execution :
(export PAT_RT_HWPC {0,1,2,..}) export OMP_NUM_THREADS 4 aprun n NPROCS d 4 ./bt.exe.pat
Generate report: pat_report O
load_balance,thread_times,program_time,mpi_callers O profile_pe.th $1
Hybrid Parallel Programming
-d depth Specifies the number of CPUs for each PE and its threads.
Slide 48 / 175
Rabenseifner, Hager, Jost
SUPERsmith
BT-MZ 32x4 Function Profile
Hybrid Parallel ProgrammingSlide 49 / 175
Rabenseifner, Hager, Jost
SUPERsmith
BT-MZ Load-Balance 32x4 vs 128x1
Hybrid Parallel Programmingbt-mz-C.32x4
bt-mz-C.128x1
maximum, median, minimum PE are shown bt-mz.C.128x1 shows large imbalance in User
and MPI time bt-mz.C.32x4 shows well balanced times
Slide 50 / 175
Rabenseifner, Hager, Jost
SUPERsmith
IBM Power 6 Results obtained by the courtesy of the HPCMO Program and the
Engineer Research and Development Center Major Shared Resource Center, Vicksburg, MS (http://www.erdc.hpc.mil/index)
The IBM Power 6 System is located at (http://www.navo.hpc.mil/davinci_about.html)
150 Compute Nodes 32 4.7GHz Power6 Cores per Node (4800 cores total) 64 GBytes of dedicated memory per node QLOGOC Infiniband DDR interconnect IBM MPI: MPI 1.2 + MPI-IO
mpxlf_r O4 qarch=pwr6 qtune=pwr6 qsmp=omp
Execution: poe launch $PBS_O_WORKDIR./sp.C.16x4.exe
Hybrid Parallel Programming
Flag was essential to achieve full
compiler optimization in
presence of OMP directives!
Slide 51 / 175
Rabenseifner, Hager, Jost
SUPERsmith
NPB-MZ Class D on IBM Power 6:Exploiting SMT for 2048 Core Results
Hybrid Parallel Programming
Results for 128-2048 cores
Only 1024 cores were available for the experiments
BT-MZ and SP-MZ show benefit from Simultaneous Multithreading (SMT): 2048 threads on 1024 cores
128 cores
256 cores
1024 cores
512 cores
2
0
4
8
x
1
Doubling the number of threads through hyperthreading (SMT):#!/bin/csh#PBS -l select=32:ncpus=64:mpiprocs=NP:ompthreads=NT
2048
cores
best of category
Slide 52 / 175
Rabenseifner, Hager, Jost
SUPERsmith
MPI+OpenMP memory usage of NPB-MZ
Hybrid Parallel Programming
Using more OpenMP threads reduces the memory usage substantially, up to five times on Hopper Cray XT5 (eight-core nodes).
Hongzhang Shan, Haoqiang Jin, Karl Fuerlinger, Alice Koniges, Nicholas J. Wright:Analyzing the Effect of Different Programming Models Upon Performance and Memory Usage on Cray XT5 Platforms.Proceedings, CUG 2010, Edinburgh, GB, May 24-27, 2010.
Always same number of
cores
Slide 53 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Conclusions:
Hybrid Parallel Programming
BT-MZ: Inherent workload imbalance on MPI level #nprocs = #nzones yields poor performance #nprocs < #zones => better workload balance, but decreases parallelism Hybrid MPI/OpenMP yields better load-balance,
maintains amount of parallelism SP-MZ:
No workload imbalance on MPI level, pure MPI should perform best MPI/OpenMP outperforms MPI on some platforms due contention to
network access within a node LU-MZ:
Hybrid MPI/OpenMP increases level of parallelism All Benchmarks:
Decrease network pressure Lower memory requirements Good process/thread affinity essential
Slide 54 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Outline
Introduction / Motivation Programming models on clusters of SMP nodes Case Studies / pure MPI vs hybrid MPI+OpenMP Hybrid programming & accelerators
Practical How-To on hybrid programmingGeorg Hager, Regionales Rechenzentrum Erlangen (RRZE)
Mismatch Problems Application categories that can benefit from hybrid parallelization Other options on clusters of SMP nodes Summary
Hybrid Parallel ProgrammingSlide 55 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Hybrid Programming How-To: Overview
A practical introduction to hybrid programming How to compile and link Getting a hybrid program to run on a cluster
Running (hybrid) programs efficiently on multi-core clusters Affinity issues
ccNUMA Bandwidth bottlenecks
MPI and OpenMP on real hardware: Intra-node anisotropy MPI communication characteristics OpenMP loop startup overhead
Thread/process binding
Hybrid Parallel ProgrammingSlide 56 / 175
Rabenseifner, Hager, Jost
SUPERsmith
How to compile, link and run
Use appropriate OpenMP compiler switch (-openmp, -fopenmp, -mp, -qsmp=openmp, ) and MPI compiler script (if available)
Link with MPI library Usually wrapped in MPI compiler script If required, specify to link against thread-safe MPI library
Often automatic when OpenMP or auto-parallelization is switched on Running the code
Highly non-portable! Consult system docs! (if available) If you are on your own, consider the following points Make sure OMP_NUM_THREADS etc. is available on all MPI
processes Start env VAR=VALUE instead of your binary alone Use Pete Wyckoffs mpiexec MPI launcher (see below):
http://www.osc.edu/~pw/mpiexec Figure out how to start fewer MPI processes than cores on your
nodesHybrid Parallel ProgrammingSlide 57 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Examples for compilation and execution
Hybrid Parallel Programming
Cray XE6 (4 NUMA domains w/ 8 cores each): ftn -h omp ... export OMP_NUM_THREADS=8 aprun -n nprocs -N nprocs_per_node \
-d $OMP_NUM_THREADS a.out
Intel Sandy Bridge (8-core 2-socket) cluster, Intel MPI/OpenMP mpiifort -openmp ... OMP_NUM_THREADS=8 mpirun ppn 2 np 4 \
-env I_MPI_PIN_DOMAIN socket \-env KMP_AFFINITY scatter ./a.out
Slide 58 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Interlude: Advantages of mpiexec or similar mechanisms Startup mechanism should use a resource manager interface to
spawn MPI processes on nodes As opposed to starting remote processes with ssh/rsh:
Correct CPU time accounting in batch system Faster startup Safe process termination Allowing password-less user login not required between nodes
Interfaces directly with batch system to determine number of procs
Provisions for starting fewer processes per node than available cores
Required for hybrid programming E.g., -pernode and -npernode # options does not
require messing around with nodefiles
Hybrid Parallel ProgrammingSlide 59 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Running the codeMore examples (with mpiexec) Example for using mpiexec on a dual-socket quad-core cluster:
$ export OMP_NUM_THREADS=8$ mpiexec -pernode ./a.out
Same but 2 MPI processes per node:
$ export OMP_NUM_THREADS=4$ mpiexec -npernode 2 ./a.out
Pure MPI:
$ export OMP_NUM_THREADS=1 # or nothing if serial code$ mpiexec ./a.out
Hybrid Parallel Programming
Where do the threads run? see later!
Slide 60 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Running the code efficiently?
Symmetric, UMA-type compute nodes have become rare animals NEC SX Intel 1-socket (Xeon 12XX) rare in cluster environments Hitachi SR8000, IBM SP2, single-core multi-socket Intel Xeon
(all dead) Instead, systems have become non-isotropic on the node level
ccNUMA (AMD Opteron, SGI Altix, IBM Power7, Intel Sandy/Ivy Bridge)
Multi-core, multi-socket Shared vs. separate caches Multi-chip vs. single-chip Separate/shared buses
Hybrid Parallel ProgrammingSlide 61 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Issues for running code efficiently on non-isotropic nodes ccNUMA locality effects
Penalties for access across locality domains Impact of contention Consequences of file I/O for page placement Placement of MPI buffers
Multi-core / multi-socket anisotropy effects Bandwidth bottlenecks, shared caches Intra-node MPI performance
Core core vs. socket socket OpenMP loop overhead depends on mutual position of threads
in team
Hybrid Parallel ProgrammingSlide 62 / 175
Rabenseifner, Hager, Jost
SUPERsmith
A short introduction to ccNUMA
ccNUMA: whole memory is transparently accessible by all processors but physically distributed with varying bandwidth and latency and potential contention (shared memory paths)
Hybrid Parallel Programming
C C C C
M M
C C C C
M M
Slide 63 / 175
Rabenseifner, Hager, Jost
SUPERsmith
How much bandwidth does non-local access cost?
Example: AMD Magny Cours 4-socket system (8 chips, 4 sockets)STREAM Triad bandwidth measurements
Hybrid Parallel Programming
0
1
2
3
6
7
4
5
Slide 64 / 175
Rabenseifner, Hager, Jost
SUPERsmith
How much bandwidth does non-local access cost?
Example: Intel Sandy Bridge 2-socket system (2 chips, 2 sockets)STREAM Triad bandwidth measurements
Hybrid Parallel Programming
0 1
General rule:
The more ccNUMA domains, the larger the non-local access penalty
Slide 65 / 175
Rabenseifner, Hager, Jost
SUPERsmith
ccNUMA Memory Locality Problems
Locality of reference is key to scalable performance on ccNUMA Less of a problem with pure MPI, but see below
What factors can destroy locality? MPI programming:
processes lose their association with the CPU the mapping took place on originally
OS kernel tries to maintain strong affinity, but sometimes fails Shared Memory Programming (OpenMP, hybrid):
threads losing association with the CPU the mapping took place on originally
improper initialization of distributed data Lots of extra threads are running on a node, especially for hybrid
All cases: Other agents (e.g., OS kernel) may fill memory with data that
prevents optimal placement of user dataHybrid Parallel ProgrammingSlide 66 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Avoiding locality problems
How can we make sure that memory ends up where it is close to the CPU that uses it? See the following slides
How can we make sure that it stays that way throughout program execution? See end of section
Hybrid Parallel ProgrammingSlide 67 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Solving Memory Locality Problems: First Touch
"Golden Rule" of ccNUMA:A memory page gets mapped into the local memory of the processor that first touches it! Except if there is not enough local memory available this might be a problem, see later Some OSs allow to influence placement in more direct ways
cf. libnuma (Linux), MPO (Solaris), Caveat: "touch" means "write", not "allocate" Example:
double *huge = (double*)malloc(N*sizeof(double));// memory not mapped yetfor(i=0; i
Rabenseifner, Hager, Jost
SUPERsmith
Most simple case: explicit initialization
Hybrid Parallel Programming
integer,parameter :: N=10000000double precision A(N), B(N)
A=0.d0
!$OMP parallel dodo i = 1, N
B(i) = function ( A(i) )end do!$OMP end parallel do
integer,parameter :: N=10000000double precision A(N),B(N)!$OMP parallel !$OMP do schedule(static)do i = 1, N
A(i)=0.d0end do!$OMP end do...
!$OMP do schedule(static)do i = 1, N
B(i) = function ( A(i) )end do!$OMP end do!$OMP end parallel
Slide 69 / 175
Rabenseifner, Hager, Jost
SUPERsmith
ccNUMA problems beyond first touch
OS uses part of main memory fordisk buffer (FS) cache If FS cache fills part of memory,
apps will probably allocate from foreign domains
non-local access! Locality problem even on hybrid
and pure MPI with asymmetric file I/O, i.e. if not all MPI processesperform I/O
Remedies Drop FS cache pages after user job has run (admins job)
Only prevents cross-job buffer cache heritage Sweeper code (run by user) Flush buffer cache after I/O if necessary (sync is not
sufficient!)Hybrid Parallel Programming
P0C
P1C
C C
MI
P2C
P3C
C C
MI
BC
data(3)
BCdata(3)
data(1)
Slide 70 / 175
Rabenseifner, Hager, Jost
SUPERsmith
ccNUMA problems beyond first touch:Buffer cacheReal-world example: ccNUMA and the Linux buffer cacheBenchmark:1. Write a file of some size
from LD0 to disk2. Perform bandwidth
benchmark usingall cores in LD0 andmaximum memoryinstalled in LD0
Result: By default,Buffer cache is given priority over local page placement restrict to local
domain if possible!Hybrid Parallel Programming
Cray: aprun -ss
Slide 71 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Intra-node MPI characteristics: IMB Ping-Pong benchmark
Code (to be run on 2 cores):
Intranode (1S): aprun -n 2 -cc 0,1 ./a.out Intranode (2S): aprun n 2 -cc 0,16 ./a.out Internode: aprun n 2 N 1 ./a.out
Hybrid Parallel Programming
wc = MPI_WTIME()
do i=1,NREPEAT
if(rank.eq.0) thenMPI_SEND(buffer,N,MPI_BYTE,1,0,MPI_COMM_WORLD,ierr)MPI_RECV(buffer,N,MPI_BYTE,1,0,MPI_COMM_WORLD, &
status,ierr)elseMPI_RECV()MPI_SEND()
endif
enddo
wc = MPI_WTIME() - wc
C
r
a
y
X
E
6
n
o
d
e
Slide 72 / 175
Rabenseifner, Hager, Jost
SUPERsmith
IMB Ping-Pong: LatencyIntra-node vs. Inter-node on Cray XE6
0
0,2
0,4
0,6
0,8
1
1,2
1,4
1,6
1,8
internode intranode 2S intranode 1S
1,8
0,56
0,3
L
a
t
e
n
c
y
[
s
]
Hybrid Parallel Programming
Affinity matters!
Slide 73 / 175
Rabenseifner, Hager, Jost
SUPERsmith
IMB Ping-Pong: Bandwidth Characteristics Intra-node vs. Inter-node on Cray XE6
Hybrid Parallel Programming
Between two cores of one socket
Between two nodes via InfiniBand
Between two sockets of one node
Slide 74 / 175
Rabenseifner, Hager, Jost
SUPERsmith
The throughput-parallel vector triad benchmarkMicrobenchmarking for architectural exploration Every core runs its own, independent triad benchmark
pure hardware probing, no impact from OpenMP overheadHybrid Parallel Programming
double precision, dimension(:), allocatable :: A,B,C,D
!$OMP PARALLEL private(i,j,A,B,C,D)allocate(A(1:N),B(1:N),C(1:N),D(1:N))A=1.d0; B=A; C=A; D=Ado j=1,NITER
do i=1,NA(i) = B(i) + C(i) * D(i)
enddoif(.something.that.is.never.true.) thencall dummy(A,B,C,D)
endifenddo!$OMP END PARALLEL
Slide 75 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Throughput vector triad on Sandy Bridge socket (3 GHz)
Hybrid Parallel Programming
Saturation effectin memory
Scalable BW in L1, L2, L3 cache
Slide 76 / 175
Rabenseifner, Hager, Jost
SUPERsmith
The OpenMP-parallel vector triad benchmarkVisualizing OpenMP overhead
OpenMP work sharing in the benchmark loop
Hybrid Parallel Programming
double precision, dimension(:), allocatable :: A,B,C,D
allocate(A(1:N),B(1:N),C(1:N),D(1:N))A=1.d0; B=A; C=A; D=A!$OMP PARALLEL private(i,j)do j=1,NITER!$OMP DO
do i=1,NA(i) = B(i) + C(i) * D(i)
enddo!$OMP END DO
if(.something.that.is.never.true.) thencall dummy(A,B,C,D)
endifenddo!$OMP END PARALLEL
Implicit barrier
Slide 77 / 175
Rabenseifner, Hager, Jost
SUPERsmith
OpenMP vector triad on Sandy Bridge socket (3 GHz)
Hybrid Parallel Programming
sync overhead grows with # of threads
bandwidth scalability across memory interfaces
Slide 78 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Thread synchronization overhead on SandyBridge-EP Direct measurement of barrier overhead in CPU cycles
Hybrid Parallel Programming
2 Threads Intel 13.1.0 GCC 4.7.0 GCC 4.6.1Shared L3 384 5242 4616SMT threads 2509 3726 3399Other socket 1375 5959 4909
Gcc still not very competitiveIntel compiler
Full domain Intel 13.1.0 GCC 4.7.0 GCC 4.6.1Socket 1497 14546 14418Node 3401 34667 29788Node +SMT 6881 59038 58898
Slide 79 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Thread/Process Affinity (Pinning) Highly OS-dependent system calls
But available on all systemsLinux: sched_setaffinity(), PLPA hwlocWindows: SetThreadAffinityMask()
Support for semi-automatic pinning in some compilers/environments Intel compilers > V9.1 (KMP_AFFINITY environment variable) Pathscale Generic Linux: taskset, numactl, likwid-pin (see below) OpenMP 4.0: Support for affinity
Affinity awareness in MPI libraries Cray MPI OpenMPI Intel MPI
Hybrid Parallel ProgrammingSlide 80 / 175
Rabenseifner, Hager, Jost
SUPERsmith
How do we figure out the topology?
and how do we enforce the mapping without changing the code? Compilers and MPI libs may still give you ways to do that
But LIKWID supports all sorts of combinations:
LikeIKnewWhatImDoing
Open source tool collection (developed at RRZE):
http://code.google.com/p/likwid
Hybrid Parallel ProgrammingSlide 81 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Likwid Tool Suite
Command line tools for Linux: works with standard linux >= 2.6 kernel supports Intel and AMD CPUs Supports all compilers whose OpenMP implementation is based on
pthreads
Current tools: likwid-topology: Print thread and cache topology
(similar to lstopo from the hwloc package) likwid-pin: Pin threaded application without touching code likwid-perfctr: Measure performance counters likwid-perfscope: Performance oscilloscope w/ real-time display likwid-powermeter: Current power consumption of chip (alpha stage) likwid-features: View and enable/disable hardware prefetchers likwid-bench: Low-level bandwidth benchmark generator tool likwid-mpirun: mpirun wrapper script for easy LIKWID integration
Hybrid Parallel ProgrammingSlide 82 / 175
Rabenseifner, Hager, Jost
SUPERsmith
likwid-topology Topology information
Based on cpuid information Functionality:
Measured clock frequency Thread topology Cache topology Cache parameters (-c command line switch) ASCII art output (-g command line switch)
Currently supported: Intel Core 2 (45nm + 65 nm) Intel Nehalem, Westmere, Sandy Bridge AMD Magny Cours, Interlagos Intel Xeon Phi in beta stage
Hybrid Parallel ProgrammingSlide 83 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Output of likwid-topology
Hybrid Parallel Programming
CPU name: Intel Core i7 processorCPU clock: 2666683826 Hz*************************************************************
Hardware Thread Topology*************************************************************
Sockets: 2Cores per socket: 4Threads per core: 2-------------------------------------------------------------
HWThread Thread Core Socket0 0 0 01 1 0 02 0 1 03 1 1 04 0 2 05 1 2 06 0 3 07 1 3 08 0 0 19 1 0 110 0 1 111 1 1 112 0 2 113 1 2 114 0 3 115 1 3 1-------------------------------------------------------------
Slide 84 / 175
Rabenseifner, Hager, Jost
SUPERsmith
likwid-topology continued
and also try the ultra-cool -g option!
Hybrid Parallel Programming
Socket 0: ( 0 1 2 3 4 5 6 7 )Socket 1: ( 8 9 10 11 12 13 14 15 )-------------------------------------------------------------
*************************************************************
Cache Topology*************************************************************
Level: 1Size: 32 kBCache groups: ( 0 1 ) ( 2 3 ) ( 4 5 ) ( 6 7 ) ( 8 9 ) ( 10 11 ) ( 12 13 ) ( 14 15 )-------------------------------------------------------------
Level: 2Size: 256 kBCache groups: ( 0 1 ) ( 2 3 ) ( 4 5 ) ( 6 7 ) ( 8 9 ) ( 10 11 ) ( 12 13 ) ( 14 15 )-------------------------------------------------------------
Level: 3Size: 8 MBCache groups: ( 0 1 2 3 4 5 6 7 ) ( 8 9 10 11 12 13 14 15 )-------------------------------------------------------------
Slide 85 / 175
Rabenseifner, Hager, Jost
SUPERsmith
likwid-pin Inspired and based on ptoverride (Michael Meier, RRZE) and taskset Pins process and threads to specific cores without touching code Directly supports pthreads, gcc OpenMP, Intel OpenMP Allows user to specify skip mask (i.e., supports many different compiler/MPI
combinations) Can also be used as replacement for taskset Uses logical (contiguous) core numbering when running inside a restricted
set of cores Supports logical core numbering inside node, socket, core
Usage examples: env OMP_NUM_THREADS=6 likwid-pin -c 0,1,2,4-6 ./myApp parameters env OMP_NUM_THREADS=6 likwid-pin c S0:0-2@S1:0-2 ./myApp
Hybrid Parallel ProgrammingSlide 86 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Hybrid Parallel Programming
Example: STREAM benchmark on 12-core Intel Westmere:Anarchy vs. thread pinning
no pinning
Pinning (round-robin across sockets, physical cores first)
Slide 87 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Likwid-pinExample: Intel OpenMP Running the STREAM benchmark with likwid-pin:
Hybrid Parallel Programming
$ export OMP_NUM_THREADS=4 $ likwid-pin -c 0,1,4,5 ./stream[likwid-pin] Main PID -> core 0 - OK----------------------------------------------
Double precision appears to have 16 digits of accuracyAssuming 8 bytes per DOUBLE PRECISION word----------------------------------------------
[... some STREAM output omitted ...]The *best* time for each test is used*EXCLUDING* the first and last iterations[pthread wrapper] PIN_MASK: 0->1 1->4 2->5 [pthread wrapper] SKIP MASK: 0x1[pthread wrapper 0] Notice: Using libpthread.so.0
threadid 1073809728 -> SKIP [pthread wrapper 1] Notice: Using libpthread.so.0
threadid 1078008128 -> core 1 - OK[pthread wrapper 2] Notice: Using libpthread.so.0
threadid 1082206528 -> core 4 - OK[pthread wrapper 3] Notice: Using libpthread.so.0
threadid 1086404928 -> core 5 - OK[... rest of STREAM output omitted ...]
Skip shepherd thread
Main PID always pinned
Pin all spawned threads in turn
Slide 88 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Topology (mapping) choices with MPI+OpenMP:More examples using Intel MPI+compiler & home-grown mpirun
One MPI process per node
One MPI process per socket
OpenMP threads pinned round robin across cores in node
Two MPI processes per socket
Hybrid Parallel Programming
env OMP_NUM_THREADS=8 mpirun -pernode \likwid-pin -c 0-7 ./a.out
env OMP_NUM_THREADS=4 mpirun -npernode 2 \-pin "0,1,2,3_4,5,6,7" ./a.out
env OMP_NUM_THREADS=4 mpirun -npernode 2 \-pin "0,1,4,5_2,3,6,7" \likwid-pin -c L:0,2,1,3 ./a.out
env OMP_NUM_THREADS=2 mpirun -npernode 4 \-pin "0,1_2,3_4,5_6,7" \likwid-pin -c L:0,1 ./a.out
Slide 89 / 175
Rabenseifner, Hager, Jost
SUPERsmith
MPI/OpenMP hybrid how-to: Take-home messages
Learn how to take control of hybrid execution!
Always observe the topology dependence of Intranode MPI OpenMP overheads Saturation effects / scalability behavior with bandwidth-bound
code
Enforce proper thread/process to core binding, using appropriate tools (whatever you use, but use SOMETHING)
Multi-LD OpenMP processes on ccNUMA nodes require correct page placement
Hybrid Parallel ProgrammingSlide 90 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Hybrid Parallel Programming
Outline Introduction / Motivation Programming models on clusters of SMP nodes Case Studies / pure MPI vs hybrid MPI+OpenMP Hybrid programming & accelerators Practical How-To on hybrid programming
Mismatch Problems
Opportunities: Application categories that can benefit from hybrid parallelization
Other options on clusters of SMP nodes Summary
Slide 91 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Hybrid Parallel Programming
Mismatch Problems
None of the programming modelsfits to the hierarchical hardware (cluster of SMP nodes)
Several mismatch problems following slides
Benefit through hybrid programming Opportunities, see next section
Quantitative implications depends on you application
Examples: No.1 No.2Benefit through hybrid (see next section) 30% 10%Loss by mismatch problems 10% 25%Total +20% 15%
In most cases: Bothcategories!
CoreCPU(socket)SMP board
ccNUMA nodeCluster of ccNUMA/SMP nodes
Slide 92 / 175
Rabenseifner, Hager, Jost
SUPERsmith
The Topology Problem
Problem Application topology is mapped to the hardware topology
communication topology and message sizes communication overhead
Partially independent of the programming model:
Simplifications:
Hybrid Parallel Programming
pure MPI hybrid MPI+OpenMP Hybrid MPI+MPI
SMP node: O(N3) data items per node
Cluster network: O(N2) neighbor communication per neighbor
Slide 93 / 175
Rabenseifner, Hager, Jost
SUPERsmith
The Topology Problem, without inner halo communication Communication only through
neighbor accesses between d ccNUMA domains Compare the ccNUMA communication
(s = communication size per domain)(Example: d=8 ccNUMA domains) 1-dimensional data decomposition s ~ 2 = 2
3-dimensional data decomposition s ~ 6
= 1.5
between the ccNUMA domains
hybrid MPI+OpenMPHybrid MPI+MPI
No real win!Dont care about dimensions within
the SMP nodes!Make your software simple!
Hybrid Parallel ProgrammingSlide 94 / 175
Rabenseifner, Hager, Jost
SUPERsmith
The Topology Problem, with inner halo communicationWith halo cells and halo communication between the cores: We ignore differences in core-to-core communication speed within ccNUMA domain, and between ccNUMA domains of one SMP node
Example with c=32 cores per SMP node
c=32 and 1-dimensional data decomposition:s ~ 2 = 2
c=32 and 3-dimensional data decomposition (4x4x2):s ~ 2
+
+
= 0.63
In general: win = s1dims3dim =c
3
Hybrid Parallel Programming
pure MPI
Hybrid MPI+MPI
(s = communication size per core)
N
N
c=16, 32, 64, win= factor 2, 3, 5, ! Real win? You may not care as long as your inner-node
communication is below xx% !Make your software simple !?
Slide 95 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Hybrid Parallel Programming
The Mapping Problem with mixed model
Several multi-threaded MPI process per SMP node:
Problem Where are your processes
and threads really located?
Solutions: Depends on your platform, e.g., with numactl
Node Interconnect
Socket 1
Quad-coreCPU
SMP node
Socket 2
Quad-coreCPU
MPI process
4 x multi-threaded
MPI process
4 x multi-threaded
Node Interconnect
Socket 1
Quad-coreCPU
SMP node
Socket 2
Quad-coreCPU
Do we have this? or that?
MPI pro-cess
0
MPI pro-cess
1
Case study on Sun Constellation Cluster
Rangerwith BT-MZ and SP-MZ
Further questions: Where is the NIC1) located? Which cores share caches?
1) NIC = Network Interface Card
hybrid MPI+OpenMPHybrid MPI+MPI
pure MPI
Slide 96 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Hybrid Parallel Programming
Unnecessary intra-node communication
Problem: If several MPI process on each SMP node
unnecessary intra-node communicationSolution:
MPI+OpenMP: Only one MPI process per SMP node MPI+MPI: No halo-communication within an SMP node
Remarks: MPI communication within an SMP node: 2 copies
(user send buffer shared memory user recv buffer) MPI-3 shared memory halo commincation: 1 copy
(user send buffer user recv buffer) MPI-3 with direct access to neighbor data: 0 copy
Mixed model(several multi-threaded MPI processes per SMP node)
pure MPI
Hybrid MPI+MPI(with halo communication)
pure MPI &Mixed model
Hybrid MPI+MPI(with halo communication)
Hybrid MPI+MPI(with direct neighbor access)
Slide 97 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Hybrid Parallel Programming
Sleeping threads and network saturation with
Problem 1: Can the master thread
saturate the network?Solution: If not, use mixed model i.e., several MPI
processes per SMP nodeProblem 2:
Sleeping threads are wasting CPU time
Solution: Overlapping of
computation and communication
Problem 1&2 together: Producing more idle time
through lousy bandwidth of master thread
for (iteration .){#pragma omp parallel
numerical code/*end omp parallel */
/* on master thread only */MPI_Send (original datato halo areas in other SMP nodes)
MPI_Recv (halo data from the neighbors)
} /*end for loop
MasteronlyMPI only outside of
parallel regions
Node Interconnect
Masterthread
Socket 1
SMP node SMP node
Socket 2
Masterthread
Socket 1
Socket 2
Masterthread
Masterthread
hybrid MPI+OpenMP
Slide 98 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Hybrid Parallel Programming
OpenMP: Additional Overhead & Pitfalls Using OpenMP
may prohibit compiler optimization may cause significant loss of computational performance
Thread fork / join overhead On ccNUMA SMP nodes:
Loss of performance due to missing memory page locality or missing first touch strategy
E.g. with the masteronly scheme: One thread produces data Master thread sends the data with MPIdata may be internally communicated from one memory to the other one
Amdahls law for each level of parallelism Using MPI-parallel application libraries? Are they prepared for hybrid? Using thread-local application libraries? Are they thread-safe?
See, e.g., the necessary O4 flag with mpxlf_r on IBM Power6 systems
hybrid MPI+OpenMP
Slide 99 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Hybrid Parallel Programming
MPI-3 shared memory programming Pros
ISV and application libraries need not to be thread-safe No additional OpenMP overhead No OpenMP problems
Cons Library calls (MPI_WIN_ALLOCATE_SHARED)
instead of SHARED / PRIVATE compiler directives No work-sharing directives
Loop scheduling must be programmed by hand No support for fine-grained or auto-balanced work-sharing
As with OpenMP tasks, and dynamic or guided loop schedule Virtual addresses of a shared memory window may be different in
each MPI process no binary pointers i.e., linked lists must be stored with offsets rather than pointers
Hybrid MPI+MPIMPI for inter-node
communication + MPI-3.0 shared memory
programming
Slide 100 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Hybrid Parallel Programming
Parallel Programming Models on Hybrid Platforms
No overlap of Comm. + Comp.MPI only outside of
parallel regionsof the numerical application code
OverlappingComm. + Comp.
MPI communication by one or a few threads
while other threads are computing
pure MPIone MPI process
on each core
hybrid MPI+OpenMPMPI: inter-node communication
OpenMP: inside of each SMP node
OpenMP only
distributed virtual shared memory
MasteronlyMPI only outsideof parallel regions
Hybrid MPI+MPIMPI for inter-node
communication + MPI-3.0 shared memory
programming
Within shared memory nodes:Halo updates through direct
data copy
Within shared memory nodes:
No halo updates, direct access to neighbor data
new
new new
FunneledMPI only
on master-thread
Multiplemore than one thread
may communicate
Funneled & Reserved
reserved thread for communication
Funneled with
Full Load Balancing
Slide 101 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Hybrid Parallel Programming
Overlapping communication and computationThree problems: the application problem:
one must separate application into: code that can run before the halo data is received code that needs halo data
very hard to do !!! the thread-rank problem:
comm. / comp. viathread-rank
cannot usework-sharing directives
loss of majorOpenMP support(see next slide)
the load balancing problem
if (my_thread_rank < 1) {MPI_Send/Recv.
} else {my_range = (high-low-1) / (num_threads-1) + 1;my_low = low + (my_thread_rank+1)*my_range;my_high=high+ (my_thread_rank+1+1)*my_range;my_high = max(high, my_high)for (i=my_low; i
Rabenseifner, Hager, Jost
SUPERsmith
Hybrid Parallel Programming
Experiment: Matrix-vector-multiply (MVM) Jacobi-Davidson-Solver
on IBM SP Power3 nodeswith 16 CPUs per node
funneled&reserved is always faster in this experiments
Reason: Memory bandwidth is already saturated by 15 CPUs, see inset
Inset: Speedup on 1 SMP node using different number of threads
funneled &reserved
Masteronly
Source: R. Rabenseifner, G. Wellein:Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures.International Journal of High Performance Computing Applications, Vol. 17, No. 1, 2003, Sage Science Press .
f
u
n
n
e
l
e
d
&
r
e
s
e
r
v
e
d
i
s
f
a
s
t
e
r
m
a
s
t
e
r
o
n
l
y
i
s
f
a
s
t
e
r
p
e
r
f
o
r
m
a
n
c
e
r
a
t
i
o
(
r
)
Slide 103 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Overlapping: Using OpenMP tasks
Hybrid Parallel Programming
NEW OpenMP Tasking Model gives a new way to achieve more parallelismform hybrid computation.
Slides, courtesy of Alice Koniges, NERSC, LBNL
Alice Koniges et al.:Application Acceleration on Current and Future Cray Platforms.Proceedings, CUG 2010, Edinburgh, GB, May 24-27, 2010.
Slide 104 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Case study: Communication and Computation in Gyrokinetic Tokamak Simulation (GTS) shift routine
Hybrid Parallel Programming
Work on particle array (packing for sending, reordering, adding after sending) can be overlapped with data independent MPI communication using OpenMP tasks.
INDEPENDENT
INDEPENDENT
SEMI-INDEPENDENT
GTS shift routine
Slides, courtesy of Alice Koniges, NERSC, LBNL
Slide 105 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Overlapping can be achieved with OpenMP tasks (1st part)
Hybrid Parallel Programming
Overlapping MPI_Allreduce with particle work Overlap: Master thread encounters (!$omp master) tasking statements and creates
work for the thread team for deferred execution. MPI Allreduce call is immediately executed.
MPI implementation has to support at least MPI_THREAD_FUNNELED Subdividing tasks into smaller chunks to allow better load balancing and scalability
among threads.Slides, courtesy of Alice Koniges, NERSC, LBNL
Slide 106 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Overlapping can be achieved with OpenMP tasks (2nd part)
Hybrid Parallel Programming
Overlapping particle reordering
Overlapping remaining MPI_Sendrecv
Particle reordering of remaining particles (above) and adding sent particles into array (right) & sending or receiving of shifted particles can be independently executed.
Slides, courtesy of Alice Koniges, NERSC, LBNL
Slide 107 / 175
Rabenseifner, Hager, Jost
SUPERsmith
OpenMP tasking version outperforms original shifter, especially in larger poloidal domains
Hybrid Parallel Programming
Performance breakdown of GTS shifter routine using 4 OpenMP threads per MPI pro-cess with varying domain decomposition and particles per cell on Franklin Cray XT4.
MPI communication in the shift phase uses a toroidal MPI communicator (constantly 128).
Large performance differences in the 256 MPI run compared to 2048 MPI run! Speed-Up is expected to be higher on larger GTS runs with hundreds of thousands
CPUs since MPI communication is more expensive.
256 size run 2048 size run
Slides, courtesy of Alice Koniges, NERSC, LBNL
Slide 108 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Hybrid Parallel Programming
Outline Introduction / Motivation Programming models on clusters of SMP nodes Case Studies / pure MPI vs hybrid MPI+OpenMP Hybrid programming & accelerators Practical How-To on hybrid programming Mismatch Problems
Opportunities:Application categories that can benefit from hybrid parallelization
Other options on clusters of SMP nodes Summary
Slide 109 / 175
Rabenseifner, Hager, Jost
SUPERsmith
Hybrid Parallel Programming
Nested Parallelism
Example NPB: BT-MZ (Block tridiagonal simulated CFD application) Outer loop:
limited number of zones limited parallelism zones with different workload speedup