Hybrid MPI and OpenMP Parallel Programming

Hchstleistungsrechenzentrum Stuttgart

SUPERsmith

Hybrid Parallel Programming

Hybrid MPI and OpenMPParallel Programming

MPI + OpenMP and other modelson clusters of SMP nodes

Rolf Rabenseifner1) Georg Hager2) Gabriele Jost3)[email protected] [email protected] [email protected]

1) High Performance Computing Center (HLRS), University of Stuttgart, Germany2) Regional Computing Center (RRZE), University of Erlangen, Germany3) Supersmith, Maximum Performance Software, USA

Author is currently working for Intel Corporation. All material presented has been prepared while still under the affiliation of Supersmith.It does not represent the views of Intel Corporation.

Tutorial tut123 at SC13, November 17, 2013, Denver (CO) USA

Slide 1

Rabenseifner, Hager, Jost

SUPERsmith


Motivation

Efficient programming of clusters of shared memory (SMP) nodes

Hierarchical system layout

Hybrid programming seems natural MPI between the nodes Shared memory programming inside of each SMP node

OpenMP MPI-3 shared memory programming Accelerator support in OpenMP 4.0 and OpenACC

Node Interconnect

SMP nodescores

sharedmemory

CoreCPU(socket)SMP board

ccNUMA nodeCluster of ccNUMA/SMP nodes

new

new

Slide 2 / 175


SUPERsmith


Node Interconnect

Socket 1

Quad-coreCPU

SMP node SMP node

Socket 2

Quad-coreCPU

Socket 1

Quad-coreCPU

Socket 2

Quad-coreCPU

MPI process

4 x multi-threaded

MPI process

4 x multi-threaded

MPI process

4 x multi-threaded

MPI process

4 x multi-threaded

MPI process8 x multi-threaded

MPI process8 x multi-threaded

MPI MPI

MPI MPI

MPI MPI

MPI MPI

MPI MPI

MPI MPI

MPI MPI

MPI MPI

Motivation

Which programming model is fastest?

MPI everywhere?

Fully hybrid MPI & OpenMP?

Something between?(Mixed model)

? Often hybrid programming slower than pure MPI Examples, Reasons,

Node Interconnect

Socket 1

Quad-coreCPU

SMP node SMP node

Socket 2

Quad-coreCPU

Socket 1

Quad-coreCPU

Socket 2

Quad-coreCPU

Slide 3 / 175


SUPERsmith


Goals of this tutorial

Sensitize to problems on clusters of SMP nodessee sections Case studies

Mismatch problems Technical aspects of hybrid programming

see sections Programming models on clusters Examples on hybrid programming

Opportunities with hybrid programmingsee section Opportunities: Application categories

that can benefit from hybrid paralleliz. Issues and their Solutions

with sections Thread-safety quality of MPI libraries Tools for debugging and profiling

for MPI+OpenMP

Lessfrustration&

Moresuccesswith your parallel program on clusters of SMP nodes

Slide 4 / 175


SUPERsmith


Outlineslide number

Introduction / Motivation 2 Programming models on clusters of SMP nodes 6 Case Studies / pure MPI vs hybrid MPI+OpenMP 28 Practical How-To on hybrid programming 55 Mismatch Problems 91 Opportunities: Application categories that can 109

benefit from hybrid parallelization Other options on clusters of SMP nodes 118

Accelerators 135 Summary 151 Appendix 158 Content (detailed) 174

08:30 10:00

10:30 12:00

Includes additional

slides, marked as

skipped

Slide 5 / 175


SUPERsmith


Outline Introduction / Motivation

Programming models on clusters of SMP nodes

Case Studies / pure MPI vs hybrid MPI+OpenMP Hybrid programming & accelerators Practical How-To on hybrid programming Mismatch Problems Opportunities:

Application categories that can benefit from hybrid parallelization Other options on clusters of SMP nodes Summary

Slide 6 / 175


SUPERsmith


Major Programming models on hybrid systems

some_serial_code#pragma omp parallel forfor (j=;; j++)

block_to_be_parallelizedagain_some_serial_code

Master thread,other threads

sleeping

OpenMP (shared data)MPI local data in each processdataSequential

program on each core

Explicit Message Passingby calling MPI_Send & MPI_Recv

Node Interconnect

OpenMP inside of the SMP nodes

MPI between the nodesvia node interconnect

new

Pure MPI (one MPI process on each core) Hybrid: MPI + OpenMP

shared memory OpenMP distributed memory MPI

Hybrid: MPI message passing + MPI-3.0 shared memory programming Other: PGAS (UPC, Coarray Fortran, .) / together with MPI Often hybrid programming (MPI+OpenMP) slower than pure MPI

why?

Slide 7 / 175


SUPERsmith


Parallel Programming Models on Hybrid Platforms

No overlap of Comm. + Comp.MPI only outside of

parallel regionsof the numerical application code

OverlappingComm. + Comp.

MPI communication by one or a few threads

while other threads are computing

pure MPIone MPI process

on each core

hybrid MPI+OpenMPMPI: inter-node communication

OpenMP: inside of each SMP node

OpenMP only

distributed virtual shared memory

MasteronlyMPI only outsideof parallel regions

Hybrid MPI+MPIMPI for inter-node

communication + MPI-3.0 shared memory

programming

Within shared memory nodes:Halo updates through direct

data copy

Within shared memory nodes:

No halo updates, direct access to neighbor data

new

new new

Slide 8 / 175


SUPERsmith


Pure MPI

Advantages No modifications on existing MPI codes MPI library need not to support multiple threads

Major problems Does MPI library uses internally different protocols?

Shared memory inside of the SMP nodes Network communication between the nodes

Does application topology fit on hardware topology? Unnecessary MPI-communication inside of SMP nodes!


on each core

Discussed in detail later on in the section Mismatch Problems

Slide 9 / 175


SUPERsmith


Hybrid MPI+OpenMP Masteronly Style

Advantages No message passing inside of the SMP nodes No topology problem

for (iteration .){#pragma omp parallel

numerical code/*end omp parallel */

/* on master thread only */MPI_Send (original datato halo areas in other SMP nodes)

MPI_Recv (halo data from the neighbors)

} /*end for loop

MasteronlyMPI only outside of parallel regions

Major Problems All other threads are sleeping

while master thread communicates! Which inter-node bandwidth? MPI-lib must support at least

MPI_THREAD_FUNNELED Section

Thread-safety quality of MPI

libraries

Slide 10 / 175


SUPERsmith


MPI rules with OpenMP / Automatic SMP-parallelization Special MPI-2 Init for multi-threaded MPI processes:

REQUIRED values (increasing order): MPI_THREAD_SINGLE: Only one thread will execute THREAD_MASTERONLY: MPI processes may be multi-threaded,

(virtual value, but only master thread will make MPI-callsnot part of the standard) AND only while other threads are sleeping

MPI_THREAD_FUNNELED: Only master thread will make MPI-calls MPI_THREAD_SERIALIZED: Multiple threads may make MPI-calls,

but only one at a time MPI_THREAD_MULTIPLE: Multiple threads may call MPI,

with no restrictions returned provided may be less than REQUIRED by the application

int MPI_Init_thread( int * argc, char ** argv[],int thread_level_required,int * thead_level_provided);

int MPI_Query_thread( int * thread_level_provided);int MPI_Is_main_thread(int * flag);

Slide 11 / 175


SUPERsmith


Calling MPI inside of OMP MASTER

Inside of a parallel region, with OMP MASTER

Requires MPI_THREAD_FUNNELED,i.e., only master thread will make MPI-calls

Caution: There isnt any synchronization with OMP MASTER!Therefore, OMP BARRIER normally necessary toguarantee, that data or buffer space from/for other threads is available before/after the MPI call!

!$OMP BARRIER #pragma omp barrier!$OMP MASTER #pragma omp master

call MPI_Xxx(...) MPI_Xxx(...); !$OMP END MASTER!$OMP BARRIER #pragma omp barrier

But this implies that all other threads are sleeping! The additional barrier implies also the necessary cache flush!

Slide 12 / 175


SUPERsmith


the barrier is necessary example with MPI_Recv

!$OMP PARALLEL!$OMP DO

do i=1,1000a(i) = buf(i)

end do!$OMP END DO NOWAIT!$OMP BARRIER!$OMP MASTER

call MPI_RECV(buf,...)!$OMP END MASTER!$OMP BARRIER!$OMP DO

do i=1,1000c(i) = buf(i)

end do!$OMP END DO NOWAIT!$OMP END PARALLEL

#pragma omp parallel{#pragma omp for nowait

for (i=0; i


SUPERsmith


Example: Thread support within Open MPI In order to enable thread support in Open MPI, configure with:

configure --enable-mpi-threads

This turns on: Support for full MPI_THREAD_MULTIPLE internal checks when run with threads (--enable-debug)

configure --enable-mpi-threads --enable-progress-threads

This (additionally) turns on: Progress threads to asynchronously transfer/receive data per

network BTL. Additional Feature:

Compiling with debugging support, but without threads will check for recursive locking

Courtesy of Rainer Keller, HLRS and ORNL

Slide 14 / 175


SUPERsmith


Overlapping communication and computation

if (my_thread_rank < ) {MPI_Send/Recv.

i.e., communicate all halo data} else {

Execute those parts of the applicationthat do not need halo data(on non-communicating threads)

}

Execute those parts of the applicationthat need halo data(on all threads)

Overlapping Communication and ComputationMPI communication by one or a few threads while other threads are computing

Slide 15 / 175


SUPERsmith


Hybrid MPI + MPI-3 shared memory

Advantages No message passing inside of the SMP nodes Using only one parallel programming standard No OpenMP problems (e.g., thread-safety isnt an issue)

Major Problems Communicator must be split into shared

memory islands To minimize shared memory communication

overhead:Halos (or the data accessed by the neighbors) must be stored in MPI shared memory windows

Same work-sharing as with pure MPI



programming

Slide 16 / 175


SUPERsmith

MPI-3 shared memory

Split main communicator into shared memory islands MPI_Comm_split_type

Define a shared memory window on each island MPI_Win_allocate_shared

Result (by default):

contiguous array, directly accessible by all processes of the island

Accesses and sychronization Normal assignments and expressions

No MPI_PUT/GET !

Normal MPI one-sided synchronization, e.g., MPI_WIN_FENCE




programming

Slide 17 / 175


SUPERsmith

Splitting the communicator & contiguous shared memory allocation




programming

MPI_Aint /*IN*/ local_window_count; double /*OUT*/ *base_ptr;

MPI_Comm comm_all, comm_sm; int my_rank_all, my_rank_sm, size_sm, disp_unit;

MPI_Comm_rank (comm_all, &my_rank_all);

MPI_Comm_split_type (comm_all, MPI_COMM_TYPE_SHARED, 0,

MPI_INFO_NULL, &comm_sm);

MPI_Comm_rank (comm_sm, &my_rank_sm); MPI_Comm_size (comm_sm, &size_sm);

disp_unit = sizeof(double); /* shared memory should contain doubles */

MPI_Win_allocate_shared (local_window_count*disp_unit, disp_unit, MPI_INFO_NULL,

comm_sm, &base_ptr, &win_sm);

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 my_rank_all

0 1 2 3 my_rank_sm

0 1 2 3 my_rank_sm

0 1 2 3 my_rank_sm

0 1 2 3 my_rank_sm

MPI process

Sub-communicatorcomm_smfor one SMP node

local_window_countdoubles

base_ptr

Contiguous shared memory window within each SMP node

Sequence in comm_smas in comm_all

comm_all

F

F In Fortran, MPI-3.0, page 341, Examples 8.1 (and 8.2) show how to convert buf_ptr into a usable array a.This mapping is based on a sequential ranking of the SMP nodes in comm_all.

M

M

Slide 18 / 175


SUPERsmith

Within each SMP node Essentials

The allocated shared memory is contiguous across process ranks, i.e., the first byte of rank i starts right after the last byte of rank i-1. Processes can calculate remote addresses offsets

with local information only. Remote accesses through load/store operations, i.e., without MPI RMA operations (MPI_GET/PUT, ) Although each process in comm_sm accesses the same physical memory,

the virtual start address of the whole array may be different in all processes! linked lists only with offsets in a shared array,

but not with binary pointer addresses!

Following slides show only the shared memory accesses,i.e., communication between the SMP nodes is not presented.




programming

Slide 19 / 175


SUPERsmith

Shared memory access example




programming


MPI_Win_allocate_shared (local_window_count*disp_unit, disp_unit, MPI_INFO_NULL,

comm_sm, &base_ptr, &win_sm);

MPI_Win_fence (0, win_sm); /*local store epoch can start*/

for (i=0; i 0) printf("left neighbors rightmost value = %lf \n", base_ptr[-1] );

if (my_rank_sm < size_sm-1) printf("right neighbors leftmost value = %lf \n",

base_ptr[local_window_count] );

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 my_rank_all

0 1 2 3 my_rank_sm

0 1 2 3 my_rank_sm

0 1 2 3 my_rank_sm

0 1 2 3 my_rank_sm

MPI process

Sub-communicatorfor one SMP node

base_ptr

Contiguous shared memory window within each SMP node local_window_countdoubles

Direct load access to remote window

portion

Direct load access to remote window

portion

Synchroni-zation

Synchroni-zation

Local storesF

F

F

F

F In Fortran, before and after the synchronization, on must add: CALL MPI_F_SYNC_REG (buffer)to guarantee that register copies of buffer are written back to memory, respectively read again from memory.

Slide 20 / 175


SUPERsmith

Establish comm_sm, comm_nodes, comm_all,if SMPs are not contiguous within comm_orig




programming

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 my_rank_all

0 1 2 3 my_rank_sm

Sub-communicatorfor one SMP node:

comm_sm

MPI_Comm_split_type (comm_orig, MPI_COMM_TYPE_SHARED, 0, MPI_INFO_NULL, &comm_sm);

MPI_Comm_size (comm_sm, &size_sm); MPI_Comm_rank (comm_sm, &my_rank_sm);

MPI_Comm_split (comm_orig, my_rank_sm, 0, &comm_nodes);

MPI_Comm_size (comm_nodes, &size_nodes);

if (my_rank_sm==0) {

MPI_Comm_rank (comm_nodes, &my_rank_nodes);

MPI_Exscan (&size_sm, &my_rank_all, 1, MPI_INT, MPI_SUM, comm_nodes);

if (my_rank_nodes == 0) my_rank_all = 0;

}

MPI_Comm_free (&comm_nodes);

MPI_Bcast (&my_rank_nodes, 1, MPI_INT, 0, comm_sm);

MPI_Comm_split (comm_orig, my_rank_sm, my_rank_nodes, &comm_nodes);

MPI_Bcast (&my_rank_all, 1, MPI_INT, 0, comm_sm); my_rank_all = my_rank_all + my_rank_sm;

MPI_Comm_split (comm_orig, /*color*/ 0, my_rank_all, &comm_all);

0 1 2 3 my_rank_sm

0 1 2 3 my_rank_sm

0 1 2 3 my_rank_sm

Establish a

communicator

comm_sm

with ranks

my_rank_sm

on each SMP

node

Result: comm_nodes combines all processes with a

given my_rank_sm into a separate communicator.Exscan does

not return

value on the

first rank,

therefore

comm_all

comm_nodescombining all processes with samemy_rank_sm

On processes with my_rank_sm > 0, this comm_nodes is unused

because node-numbering within these comm_nodes may be different.

Expanding the numbering from

comm_nodes with my_rank_sm

== 0 to all new node-to-node

communicators comm_nodes.

Calculating my_rank_all and

establishing global communicator

comm_all with sequential SMP

subsets.

0 1 2 3

my_rank_nodes

Input


SUPERsmith

Alternative: Non-contiguous shared memory

Using info key "alloc_shared_noncontig MPI library can put processes window portions

on page boundaries, (internally, e.g., only one OS shared memory segment with some unused

padding zones) into the local ccNUMA memory domain + page boundaries

(internally, e.g., each window portion is one OS shared memory segment)Pros: Faster local data accesses especially on ccNUMA nodesCons: Higher programming effort for neighbor accesses: MPI_WIN_SHARED_QUERY


Further reading:Torsten Hoefler, James Dinan, Darius Buntinas, Pavan Balaji, Brian Barrett, Ron Brightwell, William Gropp, Vivek Kale, Rajeev Thakur: MPI + MPI: a new hybrid approach to parallel programming with MPI plus shared memory.http://link.springer.com/content/pdf/10.1007%2Fs00607-013-0324-2.pdf



programming

NUMA effects?Significant impact of alloc_shared_noncontig

Slide 22 / 175


SUPERsmith

Non-contiguous shared memory allocation




programming


disp_unit = sizeof(double); /* shared memory should contain doubles */

MPI_Info info_noncontig;

MPI_Info_create (&info_noncontig);

MPI_Info_set (info_noncontig, "alloc_shared_noncontig", "true");

MPI_Win_allocate_shared (local_window_count*disp_unit, disp_unit, info_noncontig,

comm_sm, &base_ptr, &win_sm );

0 1 2 3 my_rank_sm

0 1 2 3 my_rank_sm

0 1 2 3 my_rank_sm

0 1 2 3 my_rank_sm

MPI process

Sub-communicatorfor one SMP node

local_window_countdoubles

base_ptr

Non-contiguous shared memory window within each SMP node

Slide 23 / 175


SUPERsmith

Non-contiguous shared memory:Neighbor access through MPI_WIN_SHARED_QUERY Each process can retrieve each neighbors base_ptr

with calls to MPI_WIN_SHARED_QUERY Example: only pointers to the window memory

of the left & right neighbor




programming

if (my_rank_sm > 0) MPI_Win_shared_query (win_sm, my_rank_sm 1,

&win_size_left, &disp_unit_left, &base_ptr_left);

if (my_rank_sm < size_sm-1) MPI_Win_shared_query (win_sm, my_rank_sm + 1,

&win_size_right, &disp_unit_right, &base_ptr_right);

MPI_Win_fence (0, win_sm); /* local stores are finished, remote load epoch can start */

if (my_rank_sm > 0) printf("left neighbors rightmost value = %lf \n",

base_ptr_left[ win_size_left/disp_unit_left 1 ] );

if (my_rank_sm < size_sm-1) printf("right neighbors leftmost value = %lf \n",

base_ptr_right[ 0 ] );

base_ptr_left base_ptr_right

Thanks to Steffen Weise (TU Freiberg) for testing and correcting the example codes.

Slide 24 / 175


SUPERsmith

Other technical aspects with MPI_WIN_ALLOCATE_SHAREDCaution: On some systems the number of shared memory windows, and the total size of shared memory windowsmay be limited.Some OS systems may provide options, e.g., at job launch, or MPI process start,to enlarge restricting defaults.

If MPI shared memory support is based on POSIX shared memory: Shared memory windows are located in memory-mapped /dev/shm Default: 25% or 50% of the physical memory, but a maximum of ~2043 windows! Root may change size with: mount o remount,size=6G /dev/shm .Cray XT/XE/XC (XPMEM): No limits.On a system without virtual memory (like CNK on BG/Q), you have to reserve a chunk of address space when the node is booted (default is 64 MB). Thanks to Jeff Hammond and Jed Brown (ANL), Brian W Barrett (SANDIA), and Steffen Weise (TU Freiberg), for input and discussion.




programming

Another restriction in a low-quality MPI:MPI_COMM_SPLIT_TYPE may return always MPI_COMM_SELF

Due to default limit of context IDs

in mpich

Slide 25 / 175


SUPERsmith

Splitting the communicator withoutMPI_COMM_SPLIT_TYPE

Alternative, if you want to group based on a fixed amount size_sm of shared memory cores in comm_all:

Based on sequential ranks in comm_all Pro: comm_sm can be restricted to ccNUMA locality domains Con: MPI does not guarantee MPI_WIN_ALLOCATE_SHARED() on whole SMP node

(MPI_COMM_SPLIT_TYPE() may return MPI_COMM_SELF or partial SMP node)




programming

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 comm_all

0 1 2 3 comm_sm

0 1 2 3 comm_sm

0 1 2 3 comm_sm

0 1 2 3 comm_sm

0 1 2 3 comm_sm

MPI_Comm_rank (comm_all, &my_rank); MPI_Comm_group (comm_all, &group_all);

ranges[0][0] = (my_rank / size_sm) * size_sm; ranges[0][1] = ranges[0][0]+size_sm1; ranges[0][2] = 1;

MPI_Group_range_incl (group_all, 1, ranges, &group_sm);

MPI_Comm_create (comm_all, group_sm, &comm_sm);

MPI_Win_allocate_shared ();

e.g., ranges[0][]={4,7,1}

To guarantee shared memory,one may add an additional

MPI_Comm_split_type (comm_sm, MPI_COMM_TYPE_SHARED, 0,

MPI_INFO_NULL, &comm_sm_really);Slide 26 / 175


SUPERsmith


Pure OpenMP (on the cluster)

Distributed shared virtual memory system needed Must support clusters of SMP nodes, e.g.,

Shared memory parallel inside of SMP nodes Communication of modified parts of pages

at OpenMP flush (part of each OpenMP barrier)

OpenMP onlydistributed virtual shared memory

i.e., the OpenMP memory and parallelization modelis prepared for clusters!

by rule of thumb:

Communication may be

10 times slowerthan with MPI

Slide 27 / 175


SUPERsmith


Outline Introduction / Motivation Programming models on clusters of SMP nodes Case Studies / pure MPI vs hybrid MPI+OpenMP

The Multi-Zone NAS Parallel Benchmarks For each application we discuss:

Benchmark implementations based on different strategies and programming paradigms Performance results and analysis on different hardware architectures

Compilation and Execution SummaryGabriele Jost (Supersmith, Maximum Performance Software)

Hybrid programming & accelerators Practical How-To on hybrid programming Mismatch Problems Opportunities: Application categories that can benefit from hybrid paralleli. Other options on clusters of SMP nodes Summary

Slide 28 / 175


SUPERsmith

The Multi-Zone NAS Parallel Benchmarks


OpenMP

Call MPI

MPI Processes

sequential

MPI/OpenMP

OpenMPdata copy+ sync.

exchangeboundaries

sequentialsequentialTime step

OpenMPOpenMPintra-zones

OpenMPMLP Processesinter-zones

Nested OpenMPMLP

Multi-zone versions of the NAS Parallel Benchmarks LU,SP, and BT

Two hybrid sample implementations Load balance heuristics part of sample codes www.nas.nasa.gov/Resources/Software/software.html

Slide 29 / 175


SUPERsmith

MPI/OpenMP BT-MZcall omp_set_numthreads (weight)do step = 1, itmaxcall exch_qbc(u, qbc, nx,)

do zone = 1, num_zonesif (iam .eq. pzone_id(zone)) then

call zsolve(u,rsd,)end if

end do

end do...


call mpi_send/recv

subroutine zsolve(u, rsd,)...

!$OMP PARALLEL DEFAULT(SHARED)!$OMP& PRIVATE(m,i,j,k...)do k = 2, nz-1

!$OMP DOdo j = 2, ny-1do i = 2, nx-1do m = 1, 5

u(m,i,j,k)=dt*rsd(m,i,j,k-1)

end doend do

end do!$OMP END DO NOWAITend do...

!$OMP END PARALLEL

Slide 30 / 175


SUPERsmith

MPI/OpenMP LU-MZcall omp_set_numthreads (weight)do step = 1, itmaxcall exch_qbc(u, qbc, nx,)

do zone = 1, num_zonesif (iam .eq. pzone_id(zone)) then

call ssor end if

end do

end do...

Hybrid Parallel ProgrammingSlide 31 / 175


SUPERsmith

Pipelined Thread Execution in SSORsubroutine ssor!$OMP PARALLEL DEFAULT(SHARED)!$OMP& PRIVATE(m,i,j,k...)call sync1 ()do k = 2, nz-1

!$OMP DOdo j = 2, ny-1do i = 2, nx-1do m = 1, 5rsd(m,i,j,k)=dt*rsd(m,i,j,k-1) +

end doend do

end do!$OMP END DO nowaitend docall sync2 ()...

!$OMP END PARALLEL...


subbroutine sync1neigh = iam -1do while (isync(neigh) .eq. 0)!$OMP FLUSH(isync)end doisync(neigh) = 0!$OMP FLUSH(isync)

subroutine sync2

neigh = iam -1do while (isync(neigh) .eq. 1)!$OMP FLUSH(isync)end doisync(neigh) = 1!$OMP FLUSH(isync)

Slide 32 / 175


SUPERsmith

Golden Rule for ccNUMA: First touch

c------------------------------------------------------------------

---

c do one time step to touch all data c------------------------------------------------------------------

---

do iz = 1, proc_num_zoneszone = proc_zone_id(iz)call adi(rho_i(start1(iz)), us(start1(iz)),

$ vs(start1(iz)), ws(start1(iz)..

$ end dodo iz = 1, proc_num_zoneszone = proc_zone_id(iz)call initialize(u(start5(iz)),

$ end doHybrid Parallel Programming

A memory page gets mapped into the local memory of the processor that first touches it!

"touch" means "write", not "allocate"

All benchmarks use first touch policy to achieve good memory placement!

Slide 33 / 175


SUPERsmith

Benchmark Characteristics Aggregate sizes:

Class D: 1632 x 1216 x 34 grid points Class E: 4224 x 3456 x 92 grid points

BT-MZ: (Block tridiagonal simulated CFD application) Alternative Directions Implicit (ADI) method #Zones: 1024 (D), 4096 (E) Size of the zones varies widely:

large/small about 20 requires multi-level parallelism to achieve a good load-balance

LU-MZ: (LU decomposition simulated CFD application) SSOR method (2D pipelined method) #Zones: 16 (all Classes) Size of the zones identical:

no load-balancing required limited parallelism on outer level

SP-MZ: (Scalar Pentadiagonal simulated CFD application) #Zones: 1024 (D), 4096 (E) Size of zones identical

no load-balancing requiredHybrid Parallel Programming

Load-balanced on MPI level: Pure MPI should perform best

Pure MPI: Load-balancing problems!Good candidate for

MPI+OpenMP

Limitted MPI Parallelism:

MPI+OpenMP increases Parallelism

Expectations:

Slide 34 / 175


SUPERsmith

Hybrid code on cc-NUMA architectures OpenMP:

Support only per MPI process Version 3.1 has support for binding of threads via OMP_PROC_BIND

environment variable. Under consideration for Version 4.0: OMP_PROC_SET or OMP_LIST

to restrict the execution to a subset of the machine; OMP_AFFINITY to influence how the threads are distributed and bound on the machine

Version 4.0 announced at SC12 MPI:

Initially not designed for NUMA architectures or mixing of threads and processes, MPI-2 supports threads in MPI

API does not provide support for memory/thread placement Vendor specific APIs to control thread and memory placement:

Environment variables System commands like numactl,taskset,dplace,omplace etc http://www.halobates.de/numaapi3.pdf More in How-tos



SUPERsmith

Dell Linux Cluster Lonestar


Located at the Texas Advanced Computing Center (TACC), University of Texas at Austin (http://www.tacc.utexas.edu)

1888 nodes, 2 Xeon Intel 6-Core 64-bit Westmere processors, 3.33 GHz, 24 GB memory per node, Peak Performance 160 Gflops per node, 3 channels from each processor's memory controller to 3 DDR3 ECC DIMMS, 1333 MHz,

Processor interconnect, QPI, 6.4GT/s Node Interconnect: InfiniBand Mellanox Switches, fat-tree topology, 40Gbit/sec

point-to-point bandwidth More details: http://www.tacc.utexas.edu/user-services/user-guides/lonestar-

user-guide Compiling the benchmarks:

ifort 11.1, Options: -O3 ipo openmp mcmodel=medium Running the benchmarks:

MVAPICH 2 setenv OMP_NUM_THREADS ibrun tacc_affinity ./bt-mz.x

Slide 36 / 175


SUPERsmith

NUMA Control (numactl) Example run script


#!/bin/csh#$ -cwd#$ -j y#$ -q systest#$ -pe 12way 24#$ -V#$ -l h_rt=00:10:00setenv OMP_NUM_THREADS 1setenv MY_NSLOTS 16ibrun tacc_affinity ./bin/sp-mz.D.

Run 12 MPI processes per node, allocate 24 cores (2nodes) alltogether

1 thread per MPI process

Only use 16 of the 24 cores for MPI.NOTE: 8 cores unused!!!

numactl script for process/thread placementCommand to

run mpi job

Slide 37 / 175


SUPERsmith

NUMA Operations


cmd option arguments description

Socket Affinity numactl -c {0,1,2,3} Only execute process on cores of this (these) socket(s).

Memory Policy numactl -l {no argument} Allocate on current socket.

Memory Policy numactl -i {0,1,2,3} Allocate round robin (interleave) on these sockets.

Memory Policy numactl --preferred= {0,1,2,3}select only one

Allocate on this socket; fallback to any other if full .

Memory Policy numactl -m {0,1,2,3} Only allocate on this (these) socket(s).

Core Affinity numactl -C {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}

Only execute process on this (these) Core(s).

Slide 38 / 175


SUPERsmith

Example numactl script


myway=`echo $PE | sed s/way//`export MV2_USE_AFFINITY=0export MV2_ENABLE_AFFINITY=0my_rank=$PMI_RANK local_rank=$(( my_rank % myway ))if [ $myway -eq 12 ]; thennumnode=$(( local_rank / 6 ))

fiexec numactl -c $numnode -m $numnode $*

Slide 39 / 175


SUPERsmith

Dell Linux Cluster Lonestar Topology



SUPERsmith

Dell Linux Cluster Lonestar Topology


CPU type: Intel Core Westmere processor

***************************

*********

Hardware Thread Topology***************************

*********

Sockets: 2 Cores per socket: 6 Threads per core: 1

---------------------------------

Socket 0: ( 1 3 5 7 9 11 )Socket 1: ( 0 2 4 6 8 10 )---------------------------------

Careful!Numbering scheme of

cores is system dependent

Slide 41 / 175


SUPERsmith


No idle cores

64 nodes

On128 nodes

On 256 nodes

On512 nodes

On 1024 nodes

BT-MZ improves using hybrid as expected due to better load balance

Unexpected: SP-MZ improves in some cases using hybrid

Slide 42 / 175


SUPERsmith

Pitfall (1): Running 2 threads on the same core


09/26/07, Author:Gabriele Jost

Running NPB BT-MZ Class D 128 MPI Procs, 12 threads each, 1 MPI per node (1way)

Pinning A:exec numactl c 0 -m 0 $*

Running 128 MPI Procs, 12 threads eachPinning B:exec numactl c 0,1 -m 0,1 $*

Only use cores and memory on socket 0,12 threads on 6 cores

Use cores and memory on 2 sockets

Slide 43 / 175


SUPERsmith

Pitfall (2): Cause remote memory access


09/26/07, Author:Gabriele Jost

Running NPB BT-MZ Class D 128 MPI Procs, 6 threads each 2 MPI per node

Pinning A:if [ $localrank == 0 ]; thenexec numactl --physcpubind=0,1,2,3,4,5 -m 0 $*elif [ $localrank == 1 ]; thenexec numactl --physcpubind=6,7,8,9,10,11 -m 1 $*fi

Running 128 MPI Procs, 6 threads eachPinning B:if [ $localrank == 0 ]; thenexec numactl --physcpubind=0,2,4,6,8,10 -m 0 $*elif [ $localrank == 1 ]; thenexec numactl physcpubind=1,3,5,7,9,11 -m 1 $*fi

Half of the threads access remote memory

600 Gflops

900 Gflops

900 Gflops

Only local memory access

Slide 44 / 175


SUPERsmith

LU-MZ Class D Scalability on Lonestar


LU-MZ significantly benefits from hybrid mode:- Pure MPI limited to 16 cores, due to #zones = 16

Decrease of resource contention large contribution to improvement

G

F

l

o

p

s

idle cores

Slide 45 / 175


SUPERsmith

Cray XE6 Hermit


-------------------------------------------------------------

CPU type: AMD Interlagos processor *************************************************************

Hardware Thread Topology*************************************************************

Sockets: 2 Cores per socket: 16 Threads per core: 1 -------------------------------------------------------------Socket 0:+-----------------------------------------------------------------------------------------------------------------------------------------------+

| +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ || | 0 | | 1 | | 2 | | 3 | | 4 | | 5 | | 6 | | 7 | | 8 | | 9 | | 10 | | 11 | | 12 | | 13 | | 14 | | 15 | || +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ || +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ || | 16kB | | 16kB | | 16kB | | 16kB | | 16kB | | 16kB | | 16kB | | 16kB | | 16kB | | 16kB | | 16kB | | 16kB | | 16kB | | 16kB | | 16kB | | 16kB | || +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ || +---------------+ +---------------+ +---------------+ +---------------+ +---------------+ +---------------+ +---------------+ +---------------+ || | 2MB | | 2MB | | 2MB | | 2MB | | 2MB | | 2MB | | 2MB | | 2MB | || +---------------+ +---------------+ +---------------+ +---------------+ +---------------+ +---------------+ +---------------+ +---------------+ || +---------------------------------------------------------------------+ +---------------------------------------------------------------------+ || | 6MB | | 6MB | || +---------------------------------------------------------------------+ +---------------------------------------------------------------------+ |

Located at HLRS Stuttgart, Germany (https://wickie.hlrs.de/platforms/index.php/Cray_XE6) 3552 compute nodes 113.664 cores Two AMD 6276 Interlagos processors with 16 cores each, running at 2.3 GHz (TurboCore 3.3GHz) per

node Around 1 Pflop theoretical peak performance 32 GB of main memory available per node 32-way shared memory system High-bandwidth interconnect using Cray Gemini communication chips

Slide 46 / 175


SUPERsmith


32K cores

16K cores8K

cores

Expected:BT-MZ benefits from hybrid approach:- high number of MPI processes yields bad workload distribution-Best MPIxOMP combination depends on problem sizeExpected:-Both benchmarks benefit by increasing parallelismUnexpected:SP-MZ improves when reducing number of MPI processesBT-MZ 1024x32 unexpected low performance

Cray XE6 Hermit Scalability, continued

Slide 47 / 175


SUPERsmith

Cray XE6: CrayPat Performance Analysis module load xt-craypat Compilation:

ftn fastsse r8 mp[= trace ] Instrument:

pat_build w g mpi,omp bt.exe bt.exe.pat Execution :

(export PAT_RT_HWPC {0,1,2,..}) export OMP_NUM_THREADS 4 aprun n NPROCS d 4 ./bt.exe.pat

Generate report: pat_report O

load_balance,thread_times,program_time,mpi_callers O profile_pe.th $1


-d depth Specifies the number of CPUs for each PE and its threads.

Slide 48 / 175


SUPERsmith

BT-MZ 32x4 Function Profile



SUPERsmith

BT-MZ Load-Balance 32x4 vs 128x1

Hybrid Parallel Programmingbt-mz-C.32x4

bt-mz-C.128x1

maximum, median, minimum PE are shown bt-mz.C.128x1 shows large imbalance in User

and MPI time bt-mz.C.32x4 shows well balanced times

Slide 50 / 175


SUPERsmith

IBM Power 6 Results obtained by the courtesy of the HPCMO Program and the

Engineer Research and Development Center Major Shared Resource Center, Vicksburg, MS (http://www.erdc.hpc.mil/index)

The IBM Power 6 System is located at (http://www.navo.hpc.mil/davinci_about.html)

150 Compute Nodes 32 4.7GHz Power6 Cores per Node (4800 cores total) 64 GBytes of dedicated memory per node QLOGOC Infiniband DDR interconnect IBM MPI: MPI 1.2 + MPI-IO

mpxlf_r O4 qarch=pwr6 qtune=pwr6 qsmp=omp

Execution: poe launch $PBS_O_WORKDIR./sp.C.16x4.exe


Flag was essential to achieve full

compiler optimization in

presence of OMP directives!

Slide 51 / 175


SUPERsmith

NPB-MZ Class D on IBM Power 6:Exploiting SMT for 2048 Core Results


Results for 128-2048 cores

Only 1024 cores were available for the experiments

BT-MZ and SP-MZ show benefit from Simultaneous Multithreading (SMT): 2048 threads on 1024 cores

128 cores

256 cores

1024 cores

512 cores

2

0

4

8

x

1

Doubling the number of threads through hyperthreading (SMT):#!/bin/csh#PBS -l select=32:ncpus=64:mpiprocs=NP:ompthreads=NT

2048

cores

best of category

Slide 52 / 175


SUPERsmith

MPI+OpenMP memory usage of NPB-MZ


Using more OpenMP threads reduces the memory usage substantially, up to five times on Hopper Cray XT5 (eight-core nodes).

Hongzhang Shan, Haoqiang Jin, Karl Fuerlinger, Alice Koniges, Nicholas J. Wright:Analyzing the Effect of Different Programming Models Upon Performance and Memory Usage on Cray XT5 Platforms.Proceedings, CUG 2010, Edinburgh, GB, May 24-27, 2010.

Always same number of

cores

Slide 53 / 175


SUPERsmith

Conclusions:


BT-MZ: Inherent workload imbalance on MPI level #nprocs = #nzones yields poor performance #nprocs < #zones => better workload balance, but decreases parallelism Hybrid MPI/OpenMP yields better load-balance,

maintains amount of parallelism SP-MZ:

No workload imbalance on MPI level, pure MPI should perform best MPI/OpenMP outperforms MPI on some platforms due contention to

network access within a node LU-MZ:

Hybrid MPI/OpenMP increases level of parallelism All Benchmarks:

Decrease network pressure Lower memory requirements Good process/thread affinity essential

Slide 54 / 175


SUPERsmith

Outline

Introduction / Motivation Programming models on clusters of SMP nodes Case Studies / pure MPI vs hybrid MPI+OpenMP Hybrid programming & accelerators

Practical How-To on hybrid programmingGeorg Hager, Regionales Rechenzentrum Erlangen (RRZE)

Mismatch Problems Application categories that can benefit from hybrid parallelization Other options on clusters of SMP nodes Summary



SUPERsmith

Hybrid Programming How-To: Overview

A practical introduction to hybrid programming How to compile and link Getting a hybrid program to run on a cluster

Running (hybrid) programs efficiently on multi-core clusters Affinity issues

ccNUMA Bandwidth bottlenecks

MPI and OpenMP on real hardware: Intra-node anisotropy MPI communication characteristics OpenMP loop startup overhead

Thread/process binding



SUPERsmith

How to compile, link and run

Use appropriate OpenMP compiler switch (-openmp, -fopenmp, -mp, -qsmp=openmp, ) and MPI compiler script (if available)

Link with MPI library Usually wrapped in MPI compiler script If required, specify to link against thread-safe MPI library

Often automatic when OpenMP or auto-parallelization is switched on Running the code

Highly non-portable! Consult system docs! (if available) If you are on your own, consider the following points Make sure OMP_NUM_THREADS etc. is available on all MPI

processes Start env VAR=VALUE instead of your binary alone Use Pete Wyckoffs mpiexec MPI launcher (see below):

http://www.osc.edu/~pw/mpiexec Figure out how to start fewer MPI processes than cores on your

nodesHybrid Parallel ProgrammingSlide 57 / 175


SUPERsmith

Examples for compilation and execution


Cray XE6 (4 NUMA domains w/ 8 cores each): ftn -h omp ... export OMP_NUM_THREADS=8 aprun -n nprocs -N nprocs_per_node \

-d $OMP_NUM_THREADS a.out

Intel Sandy Bridge (8-core 2-socket) cluster, Intel MPI/OpenMP mpiifort -openmp ... OMP_NUM_THREADS=8 mpirun ppn 2 np 4 \

-env I_MPI_PIN_DOMAIN socket \-env KMP_AFFINITY scatter ./a.out

Slide 58 / 175


SUPERsmith

Interlude: Advantages of mpiexec or similar mechanisms Startup mechanism should use a resource manager interface to

spawn MPI processes on nodes As opposed to starting remote processes with ssh/rsh:

Correct CPU time accounting in batch system Faster startup Safe process termination Allowing password-less user login not required between nodes

Interfaces directly with batch system to determine number of procs

Provisions for starting fewer processes per node than available cores

Required for hybrid programming E.g., -pernode and -npernode # options does not

require messing around with nodefiles



SUPERsmith

Running the codeMore examples (with mpiexec) Example for using mpiexec on a dual-socket quad-core cluster:

$ export OMP_NUM_THREADS=8$ mpiexec -pernode ./a.out

Same but 2 MPI processes per node:

$ export OMP_NUM_THREADS=4$ mpiexec -npernode 2 ./a.out

Pure MPI:

$ export OMP_NUM_THREADS=1 # or nothing if serial code$ mpiexec ./a.out


Where do the threads run? see later!

Slide 60 / 175


SUPERsmith

Running the code efficiently?

Symmetric, UMA-type compute nodes have become rare animals NEC SX Intel 1-socket (Xeon 12XX) rare in cluster environments Hitachi SR8000, IBM SP2, single-core multi-socket Intel Xeon

(all dead) Instead, systems have become non-isotropic on the node level

ccNUMA (AMD Opteron, SGI Altix, IBM Power7, Intel Sandy/Ivy Bridge)

Multi-core, multi-socket Shared vs. separate caches Multi-chip vs. single-chip Separate/shared buses



SUPERsmith

Issues for running code efficiently on non-isotropic nodes ccNUMA locality effects

Penalties for access across locality domains Impact of contention Consequences of file I/O for page placement Placement of MPI buffers

Multi-core / multi-socket anisotropy effects Bandwidth bottlenecks, shared caches Intra-node MPI performance

Core core vs. socket socket OpenMP loop overhead depends on mutual position of threads

in team



SUPERsmith

A short introduction to ccNUMA

ccNUMA: whole memory is transparently accessible by all processors but physically distributed with varying bandwidth and latency and potential contention (shared memory paths)


C C C C

M M

C C C C

M M

Slide 63 / 175


SUPERsmith

How much bandwidth does non-local access cost?

Example: AMD Magny Cours 4-socket system (8 chips, 4 sockets)STREAM Triad bandwidth measurements


0

1

2

3

6

7

4

5

Slide 64 / 175


SUPERsmith

How much bandwidth does non-local access cost?

Example: Intel Sandy Bridge 2-socket system (2 chips, 2 sockets)STREAM Triad bandwidth measurements


0 1

General rule:

The more ccNUMA domains, the larger the non-local access penalty

Slide 65 / 175


SUPERsmith

ccNUMA Memory Locality Problems

Locality of reference is key to scalable performance on ccNUMA Less of a problem with pure MPI, but see below

What factors can destroy locality? MPI programming:

processes lose their association with the CPU the mapping took place on originally

OS kernel tries to maintain strong affinity, but sometimes fails Shared Memory Programming (OpenMP, hybrid):

threads losing association with the CPU the mapping took place on originally

improper initialization of distributed data Lots of extra threads are running on a node, especially for hybrid

All cases: Other agents (e.g., OS kernel) may fill memory with data that

prevents optimal placement of user dataHybrid Parallel ProgrammingSlide 66 / 175


SUPERsmith

Avoiding locality problems

How can we make sure that memory ends up where it is close to the CPU that uses it? See the following slides

How can we make sure that it stays that way throughout program execution? See end of section



SUPERsmith

Solving Memory Locality Problems: First Touch

"Golden Rule" of ccNUMA:A memory page gets mapped into the local memory of the processor that first touches it! Except if there is not enough local memory available this might be a problem, see later Some OSs allow to influence placement in more direct ways

cf. libnuma (Linux), MPO (Solaris), Caveat: "touch" means "write", not "allocate" Example:

double *huge = (double*)malloc(N*sizeof(double));// memory not mapped yetfor(i=0; i


SUPERsmith

Most simple case: explicit initialization


integer,parameter :: N=10000000double precision A(N), B(N)

A=0.d0

!$OMP parallel dodo i = 1, N

B(i) = function ( A(i) )end do!$OMP end parallel do

integer,parameter :: N=10000000double precision A(N),B(N)!$OMP parallel !$OMP do schedule(static)do i = 1, N

A(i)=0.d0end do!$OMP end do...

!$OMP do schedule(static)do i = 1, N

B(i) = function ( A(i) )end do!$OMP end do!$OMP end parallel

Slide 69 / 175


SUPERsmith

ccNUMA problems beyond first touch

OS uses part of main memory fordisk buffer (FS) cache If FS cache fills part of memory,

apps will probably allocate from foreign domains

non-local access! Locality problem even on hybrid

and pure MPI with asymmetric file I/O, i.e. if not all MPI processesperform I/O

Remedies Drop FS cache pages after user job has run (admins job)

Only prevents cross-job buffer cache heritage Sweeper code (run by user) Flush buffer cache after I/O if necessary (sync is not

sufficient!)Hybrid Parallel Programming

P0C

P1C

C C

MI

P2C

P3C

C C

MI

BC

data(3)

BCdata(3)

data(1)

Slide 70 / 175


SUPERsmith

ccNUMA problems beyond first touch:Buffer cacheReal-world example: ccNUMA and the Linux buffer cacheBenchmark:1. Write a file of some size

from LD0 to disk2. Perform bandwidth

benchmark usingall cores in LD0 andmaximum memoryinstalled in LD0

Result: By default,Buffer cache is given priority over local page placement restrict to local

domain if possible!Hybrid Parallel Programming

Cray: aprun -ss

Slide 71 / 175


SUPERsmith

Intra-node MPI characteristics: IMB Ping-Pong benchmark

Code (to be run on 2 cores):

Intranode (1S): aprun -n 2 -cc 0,1 ./a.out Intranode (2S): aprun n 2 -cc 0,16 ./a.out Internode: aprun n 2 N 1 ./a.out


wc = MPI_WTIME()

do i=1,NREPEAT

if(rank.eq.0) thenMPI_SEND(buffer,N,MPI_BYTE,1,0,MPI_COMM_WORLD,ierr)MPI_RECV(buffer,N,MPI_BYTE,1,0,MPI_COMM_WORLD, &

status,ierr)elseMPI_RECV()MPI_SEND()

endif

enddo

wc = MPI_WTIME() - wc

C

r

a

y

X

E

6

n

o

d

e

Slide 72 / 175


SUPERsmith

IMB Ping-Pong: LatencyIntra-node vs. Inter-node on Cray XE6

0

0,2

0,4

0,6

0,8

1

1,2

1,4

1,6

1,8

internode intranode 2S intranode 1S

1,8

0,56

0,3

L

a

t

e

n

c

y

[

s

]


Affinity matters!

Slide 73 / 175


SUPERsmith

IMB Ping-Pong: Bandwidth Characteristics Intra-node vs. Inter-node on Cray XE6


Between two cores of one socket

Between two nodes via InfiniBand

Between two sockets of one node

Slide 74 / 175


SUPERsmith

The throughput-parallel vector triad benchmarkMicrobenchmarking for architectural exploration Every core runs its own, independent triad benchmark

pure hardware probing, no impact from OpenMP overheadHybrid Parallel Programming

double precision, dimension(:), allocatable :: A,B,C,D

!$OMP PARALLEL private(i,j,A,B,C,D)allocate(A(1:N),B(1:N),C(1:N),D(1:N))A=1.d0; B=A; C=A; D=Ado j=1,NITER

do i=1,NA(i) = B(i) + C(i) * D(i)

enddoif(.something.that.is.never.true.) thencall dummy(A,B,C,D)

endifenddo!$OMP END PARALLEL

Slide 75 / 175


SUPERsmith

Throughput vector triad on Sandy Bridge socket (3 GHz)


Saturation effectin memory

Scalable BW in L1, L2, L3 cache

Slide 76 / 175


SUPERsmith

The OpenMP-parallel vector triad benchmarkVisualizing OpenMP overhead

OpenMP work sharing in the benchmark loop


double precision, dimension(:), allocatable :: A,B,C,D

allocate(A(1:N),B(1:N),C(1:N),D(1:N))A=1.d0; B=A; C=A; D=A!$OMP PARALLEL private(i,j)do j=1,NITER!$OMP DO

do i=1,NA(i) = B(i) + C(i) * D(i)

enddo!$OMP END DO

if(.something.that.is.never.true.) thencall dummy(A,B,C,D)

endifenddo!$OMP END PARALLEL

Implicit barrier

Slide 77 / 175


SUPERsmith

OpenMP vector triad on Sandy Bridge socket (3 GHz)


sync overhead grows with # of threads

bandwidth scalability across memory interfaces

Slide 78 / 175


SUPERsmith

Thread synchronization overhead on SandyBridge-EP Direct measurement of barrier overhead in CPU cycles


2 Threads Intel 13.1.0 GCC 4.7.0 GCC 4.6.1Shared L3 384 5242 4616SMT threads 2509 3726 3399Other socket 1375 5959 4909

Gcc still not very competitiveIntel compiler

Full domain Intel 13.1.0 GCC 4.7.0 GCC 4.6.1Socket 1497 14546 14418Node 3401 34667 29788Node +SMT 6881 59038 58898

Slide 79 / 175


SUPERsmith

Thread/Process Affinity (Pinning) Highly OS-dependent system calls

But available on all systemsLinux: sched_setaffinity(), PLPA hwlocWindows: SetThreadAffinityMask()

Support for semi-automatic pinning in some compilers/environments Intel compilers > V9.1 (KMP_AFFINITY environment variable) Pathscale Generic Linux: taskset, numactl, likwid-pin (see below) OpenMP 4.0: Support for affinity

Affinity awareness in MPI libraries Cray MPI OpenMPI Intel MPI



SUPERsmith

How do we figure out the topology?

and how do we enforce the mapping without changing the code? Compilers and MPI libs may still give you ways to do that

But LIKWID supports all sorts of combinations:

LikeIKnewWhatImDoing

Open source tool collection (developed at RRZE):

http://code.google.com/p/likwid



SUPERsmith

Likwid Tool Suite

Command line tools for Linux: works with standard linux >= 2.6 kernel supports Intel and AMD CPUs Supports all compilers whose OpenMP implementation is based on

pthreads

Current tools: likwid-topology: Print thread and cache topology

(similar to lstopo from the hwloc package) likwid-pin: Pin threaded application without touching code likwid-perfctr: Measure performance counters likwid-perfscope: Performance oscilloscope w/ real-time display likwid-powermeter: Current power consumption of chip (alpha stage) likwid-features: View and enable/disable hardware prefetchers likwid-bench: Low-level bandwidth benchmark generator tool likwid-mpirun: mpirun wrapper script for easy LIKWID integration



SUPERsmith

likwid-topology Topology information

Based on cpuid information Functionality:

Measured clock frequency Thread topology Cache topology Cache parameters (-c command line switch) ASCII art output (-g command line switch)

Currently supported: Intel Core 2 (45nm + 65 nm) Intel Nehalem, Westmere, Sandy Bridge AMD Magny Cours, Interlagos Intel Xeon Phi in beta stage



SUPERsmith

Output of likwid-topology


CPU name: Intel Core i7 processorCPU clock: 2666683826 Hz*************************************************************

Hardware Thread Topology*************************************************************

Sockets: 2Cores per socket: 4Threads per core: 2-------------------------------------------------------------

HWThread Thread Core Socket0 0 0 01 1 0 02 0 1 03 1 1 04 0 2 05 1 2 06 0 3 07 1 3 08 0 0 19 1 0 110 0 1 111 1 1 112 0 2 113 1 2 114 0 3 115 1 3 1-------------------------------------------------------------

Slide 84 / 175


SUPERsmith

likwid-topology continued

and also try the ultra-cool -g option!


Socket 0: ( 0 1 2 3 4 5 6 7 )Socket 1: ( 8 9 10 11 12 13 14 15 )-------------------------------------------------------------

*************************************************************

Cache Topology*************************************************************

Level: 1Size: 32 kBCache groups: ( 0 1 ) ( 2 3 ) ( 4 5 ) ( 6 7 ) ( 8 9 ) ( 10 11 ) ( 12 13 ) ( 14 15 )-------------------------------------------------------------

Level: 2Size: 256 kBCache groups: ( 0 1 ) ( 2 3 ) ( 4 5 ) ( 6 7 ) ( 8 9 ) ( 10 11 ) ( 12 13 ) ( 14 15 )-------------------------------------------------------------

Level: 3Size: 8 MBCache groups: ( 0 1 2 3 4 5 6 7 ) ( 8 9 10 11 12 13 14 15 )-------------------------------------------------------------

Slide 85 / 175


SUPERsmith

likwid-pin Inspired and based on ptoverride (Michael Meier, RRZE) and taskset Pins process and threads to specific cores without touching code Directly supports pthreads, gcc OpenMP, Intel OpenMP Allows user to specify skip mask (i.e., supports many different compiler/MPI

combinations) Can also be used as replacement for taskset Uses logical (contiguous) core numbering when running inside a restricted

set of cores Supports logical core numbering inside node, socket, core

Usage examples: env OMP_NUM_THREADS=6 likwid-pin -c 0,1,2,4-6 ./myApp parameters env OMP_NUM_THREADS=6 likwid-pin c S0:0-2@S1:0-2 ./myApp



SUPERsmith


Example: STREAM benchmark on 12-core Intel Westmere:Anarchy vs. thread pinning

no pinning

Pinning (round-robin across sockets, physical cores first)

Slide 87 / 175


SUPERsmith

Likwid-pinExample: Intel OpenMP Running the STREAM benchmark with likwid-pin:


$ export OMP_NUM_THREADS=4 $ likwid-pin -c 0,1,4,5 ./stream[likwid-pin] Main PID -> core 0 - OK----------------------------------------------

Double precision appears to have 16 digits of accuracyAssuming 8 bytes per DOUBLE PRECISION word----------------------------------------------

[... some STREAM output omitted ...]The *best* time for each test is used*EXCLUDING* the first and last iterations[pthread wrapper] PIN_MASK: 0->1 1->4 2->5 [pthread wrapper] SKIP MASK: 0x1[pthread wrapper 0] Notice: Using libpthread.so.0

threadid 1073809728 -> SKIP [pthread wrapper 1] Notice: Using libpthread.so.0

threadid 1078008128 -> core 1 - OK[pthread wrapper 2] Notice: Using libpthread.so.0

threadid 1082206528 -> core 4 - OK[pthread wrapper 3] Notice: Using libpthread.so.0

threadid 1086404928 -> core 5 - OK[... rest of STREAM output omitted ...]

Skip shepherd thread

Main PID always pinned

Pin all spawned threads in turn

Slide 88 / 175


SUPERsmith

Topology (mapping) choices with MPI+OpenMP:More examples using Intel MPI+compiler & home-grown mpirun

One MPI process per node

One MPI process per socket

OpenMP threads pinned round robin across cores in node

Two MPI processes per socket


env OMP_NUM_THREADS=8 mpirun -pernode \likwid-pin -c 0-7 ./a.out

env OMP_NUM_THREADS=4 mpirun -npernode 2 \-pin "0,1,2,3_4,5,6,7" ./a.out

env OMP_NUM_THREADS=4 mpirun -npernode 2 \-pin "0,1,4,5_2,3,6,7" \likwid-pin -c L:0,2,1,3 ./a.out

env OMP_NUM_THREADS=2 mpirun -npernode 4 \-pin "0,1_2,3_4,5_6,7" \likwid-pin -c L:0,1 ./a.out

Slide 89 / 175


SUPERsmith

MPI/OpenMP hybrid how-to: Take-home messages

Learn how to take control of hybrid execution!

Always observe the topology dependence of Intranode MPI OpenMP overheads Saturation effects / scalability behavior with bandwidth-bound

code

Enforce proper thread/process to core binding, using appropriate tools (whatever you use, but use SOMETHING)

Multi-LD OpenMP processes on ccNUMA nodes require correct page placement



SUPERsmith


Outline Introduction / Motivation Programming models on clusters of SMP nodes Case Studies / pure MPI vs hybrid MPI+OpenMP Hybrid programming & accelerators Practical How-To on hybrid programming

Mismatch Problems

Opportunities: Application categories that can benefit from hybrid parallelization

Other options on clusters of SMP nodes Summary

Slide 91 / 175


SUPERsmith


Mismatch Problems

None of the programming modelsfits to the hierarchical hardware (cluster of SMP nodes)

Several mismatch problems following slides

Benefit through hybrid programming Opportunities, see next section

Quantitative implications depends on you application

Examples: No.1 No.2Benefit through hybrid (see next section) 30% 10%Loss by mismatch problems 10% 25%Total +20% 15%

In most cases: Bothcategories!

CoreCPU(socket)SMP board

ccNUMA nodeCluster of ccNUMA/SMP nodes

Slide 92 / 175


SUPERsmith

The Topology Problem

Problem Application topology is mapped to the hardware topology

communication topology and message sizes communication overhead

Partially independent of the programming model:

Simplifications:


pure MPI hybrid MPI+OpenMP Hybrid MPI+MPI

SMP node: O(N3) data items per node

Cluster network: O(N2) neighbor communication per neighbor

Slide 93 / 175


SUPERsmith

The Topology Problem, without inner halo communication Communication only through

neighbor accesses between d ccNUMA domains Compare the ccNUMA communication

(s = communication size per domain)(Example: d=8 ccNUMA domains) 1-dimensional data decomposition s ~ 2 = 2

3-dimensional data decomposition s ~ 6

= 1.5

between the ccNUMA domains

hybrid MPI+OpenMPHybrid MPI+MPI

No real win!Dont care about dimensions within

the SMP nodes!Make your software simple!



SUPERsmith

The Topology Problem, with inner halo communicationWith halo cells and halo communication between the cores: We ignore differences in core-to-core communication speed within ccNUMA domain, and between ccNUMA domains of one SMP node

Example with c=32 cores per SMP node

c=32 and 1-dimensional data decomposition:s ~ 2 = 2

c=32 and 3-dimensional data decomposition (4x4x2):s ~ 2

+

+

= 0.63

In general: win = s1dims3dim =c

3


pure MPI

Hybrid MPI+MPI

(s = communication size per core)

N

N

c=16, 32, 64, win= factor 2, 3, 5, ! Real win? You may not care as long as your inner-node

communication is below xx% !Make your software simple !?

Slide 95 / 175


SUPERsmith


The Mapping Problem with mixed model

Several multi-threaded MPI process per SMP node:

Problem Where are your processes

and threads really located?

Solutions: Depends on your platform, e.g., with numactl

Node Interconnect

Socket 1

Quad-coreCPU

SMP node

Socket 2

Quad-coreCPU

MPI process

4 x multi-threaded

MPI process

4 x multi-threaded

Node Interconnect

Socket 1

Quad-coreCPU

SMP node

Socket 2

Quad-coreCPU

Do we have this? or that?

MPI pro-cess

0

MPI pro-cess

1

Case study on Sun Constellation Cluster

Rangerwith BT-MZ and SP-MZ

Further questions: Where is the NIC1) located? Which cores share caches?

1) NIC = Network Interface Card

hybrid MPI+OpenMPHybrid MPI+MPI

pure MPI

Slide 96 / 175


SUPERsmith


Unnecessary intra-node communication

Problem: If several MPI process on each SMP node

unnecessary intra-node communicationSolution:

MPI+OpenMP: Only one MPI process per SMP node MPI+MPI: No halo-communication within an SMP node

Remarks: MPI communication within an SMP node: 2 copies

(user send buffer shared memory user recv buffer) MPI-3 shared memory halo commincation: 1 copy

(user send buffer user recv buffer) MPI-3 with direct access to neighbor data: 0 copy

Mixed model(several multi-threaded MPI processes per SMP node)

pure MPI

Hybrid MPI+MPI(with halo communication)

pure MPI &Mixed model

Hybrid MPI+MPI(with halo communication)

Hybrid MPI+MPI(with direct neighbor access)

Slide 97 / 175


SUPERsmith


Sleeping threads and network saturation with

Problem 1: Can the master thread

saturate the network?Solution: If not, use mixed model i.e., several MPI

processes per SMP nodeProblem 2:

Sleeping threads are wasting CPU time

Solution: Overlapping of

computation and communication

Problem 1&2 together: Producing more idle time

through lousy bandwidth of master thread

for (iteration .){#pragma omp parallel

numerical code/*end omp parallel */

/* on master thread only */MPI_Send (original datato halo areas in other SMP nodes)

MPI_Recv (halo data from the neighbors)

} /*end for loop

MasteronlyMPI only outside of

parallel regions

Node Interconnect

Masterthread

Socket 1

SMP node SMP node

Socket 2

Masterthread

Socket 1

Socket 2

Masterthread

Masterthread

hybrid MPI+OpenMP

Slide 98 / 175


SUPERsmith


OpenMP: Additional Overhead & Pitfalls Using OpenMP

may prohibit compiler optimization may cause significant loss of computational performance

Thread fork / join overhead On ccNUMA SMP nodes:

Loss of performance due to missing memory page locality or missing first touch strategy

E.g. with the masteronly scheme: One thread produces data Master thread sends the data with MPIdata may be internally communicated from one memory to the other one

Amdahls law for each level of parallelism Using MPI-parallel application libraries? Are they prepared for hybrid? Using thread-local application libraries? Are they thread-safe?

See, e.g., the necessary O4 flag with mpxlf_r on IBM Power6 systems

hybrid MPI+OpenMP

Slide 99 / 175


SUPERsmith


MPI-3 shared memory programming Pros

ISV and application libraries need not to be thread-safe No additional OpenMP overhead No OpenMP problems

Cons Library calls (MPI_WIN_ALLOCATE_SHARED)

instead of SHARED / PRIVATE compiler directives No work-sharing directives

Loop scheduling must be programmed by hand No support for fine-grained or auto-balanced work-sharing

As with OpenMP tasks, and dynamic or guided loop schedule Virtual addresses of a shared memory window may be different in

each MPI process no binary pointers i.e., linked lists must be stored with offsets rather than pointers



programming

Slide 100 / 175


SUPERsmith


Parallel Programming Models on Hybrid Platforms

No overlap of Comm. + Comp.MPI only outside of

parallel regionsof the numerical application code

OverlappingComm. + Comp.

MPI communication by one or a few threads

while other threads are computing


on each core

hybrid MPI+OpenMPMPI: inter-node communication

OpenMP: inside of each SMP node

OpenMP only

distributed virtual shared memory

MasteronlyMPI only outsideof parallel regions



programming

Within shared memory nodes:Halo updates through direct

data copy

Within shared memory nodes:

No halo updates, direct access to neighbor data

new

new new

FunneledMPI only

on master-thread

Multiplemore than one thread

may communicate

Funneled & Reserved

reserved thread for communication

Funneled with

Full Load Balancing

Slide 101 / 175


SUPERsmith


Overlapping communication and computationThree problems: the application problem:

one must separate application into: code that can run before the halo data is received code that needs halo data

very hard to do !!! the thread-rank problem:

comm. / comp. viathread-rank

cannot usework-sharing directives

loss of majorOpenMP support(see next slide)

the load balancing problem

if (my_thread_rank < 1) {MPI_Send/Recv.

} else {my_range = (high-low-1) / (num_threads-1) + 1;my_low = low + (my_thread_rank+1)*my_range;my_high=high+ (my_thread_rank+1+1)*my_range;my_high = max(high, my_high)for (i=my_low; i


SUPERsmith


Experiment: Matrix-vector-multiply (MVM) Jacobi-Davidson-Solver

on IBM SP Power3 nodeswith 16 CPUs per node

funneled&reserved is always faster in this experiments

Reason: Memory bandwidth is already saturated by 15 CPUs, see inset

Inset: Speedup on 1 SMP node using different number of threads

funneled &reserved

Masteronly

Source: R. Rabenseifner, G. Wellein:Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures.International Journal of High Performance Computing Applications, Vol. 17, No. 1, 2003, Sage Science Press .

f

u

n

n

e

l

e

d

&

r

e

s

e

r

v

e

d

i

s

f

a

s

t

e

r

m

a

s

t

e

r

o

n

l

y

i

s

f

a

s

t

e

r

p

e

r

f

o

r

m

a

n

c

e

r

a

t

i

o

(

r

)

Slide 103 / 175


SUPERsmith

Overlapping: Using OpenMP tasks


NEW OpenMP Tasking Model gives a new way to achieve more parallelismform hybrid computation.

Slides, courtesy of Alice Koniges, NERSC, LBNL

Alice Koniges et al.:Application Acceleration on Current and Future Cray Platforms.Proceedings, CUG 2010, Edinburgh, GB, May 24-27, 2010.

Slide 104 / 175


SUPERsmith

Case study: Communication and Computation in Gyrokinetic Tokamak Simulation (GTS) shift routine


Work on particle array (packing for sending, reordering, adding after sending) can be overlapped with data independent MPI communication using OpenMP tasks.

INDEPENDENT

INDEPENDENT

SEMI-INDEPENDENT

GTS shift routine


Slide 105 / 175


SUPERsmith

Overlapping can be achieved with OpenMP tasks (1st part)


Overlapping MPI_Allreduce with particle work Overlap: Master thread encounters (!$omp master) tasking statements and creates

work for the thread team for deferred execution. MPI Allreduce call is immediately executed.

MPI implementation has to support at least MPI_THREAD_FUNNELED Subdividing tasks into smaller chunks to allow better load balancing and scalability

among threads.Slides, courtesy of Alice Koniges, NERSC, LBNL

Slide 106 / 175


SUPERsmith

Overlapping can be achieved with OpenMP tasks (2nd part)


Overlapping particle reordering

Overlapping remaining MPI_Sendrecv

Particle reordering of remaining particles (above) and adding sent particles into array (right) & sending or receiving of shifted particles can be independently executed.


Slide 107 / 175


SUPERsmith

OpenMP tasking version outperforms original shifter, especially in larger poloidal domains


Performance breakdown of GTS shifter routine using 4 OpenMP threads per MPI pro-cess with varying domain decomposition and particles per cell on Franklin Cray XT4.

MPI communication in the shift phase uses a toroidal MPI communicator (constantly 128).

Large performance differences in the 256 MPI run compared to 2048 MPI run! Speed-Up is expected to be higher on larger GTS runs with hundreds of thousands

CPUs since MPI communication is more expensive.

256 size run 2048 size run


Slide 108 / 175


SUPERsmith


Outline Introduction / Motivation Programming models on clusters of SMP nodes Case Studies / pure MPI vs hybrid MPI+OpenMP Hybrid programming & accelerators Practical How-To on hybrid programming Mismatch Problems

Opportunities:Application categories that can benefit from hybrid parallelization

Other options on clusters of SMP nodes Summary

Slide 109 / 175


SUPERsmith


Nested Parallelism

Example NPB: BT-MZ (Block tridiagonal simulated CFD application) Outer loop:

limited number of zones limited parallelism zones with different workload speedup

Documents

￼￼￼￼￼￼￼￼￼￼Hybrid MPI and OpenMP Parallel Programming

Hybrid MPI and OpenMP Parallel Programming