Hybrid MPI and OpenMP Parallel Programming

Embed Size (px)

DESCRIPTION

parallel processing computer science

Citation preview

  • Hchstleistungsrechenzentrum Stuttgart

    SUPERsmith

    Hybrid Parallel Programming

    Hybrid MPI and OpenMPParallel Programming

    MPI + OpenMP and other modelson clusters of SMP nodes

    Rolf Rabenseifner1) Georg Hager2) Gabriele Jost3)[email protected] [email protected] [email protected]

    1) High Performance Computing Center (HLRS), University of Stuttgart, Germany2) Regional Computing Center (RRZE), University of Erlangen, Germany3) Supersmith, Maximum Performance Software, USA

    Author is currently working for Intel Corporation. All material presented has been prepared while still under the affiliation of Supersmith.It does not represent the views of Intel Corporation.

    Tutorial tut123 at SC13, November 17, 2013, Denver (CO) USA

    Slide 1

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Hybrid Parallel Programming

    Motivation

    Efficient programming of clusters of shared memory (SMP) nodes

    Hierarchical system layout

    Hybrid programming seems natural MPI between the nodes Shared memory programming inside of each SMP node

    OpenMP MPI-3 shared memory programming Accelerator support in OpenMP 4.0 and OpenACC

    Node Interconnect

    SMP nodescores

    sharedmemory

    CoreCPU(socket)SMP board

    ccNUMA nodeCluster of ccNUMA/SMP nodes

    new

    new

    Slide 2 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Hybrid Parallel Programming

    Node Interconnect

    Socket 1

    Quad-coreCPU

    SMP node SMP node

    Socket 2

    Quad-coreCPU

    Socket 1

    Quad-coreCPU

    Socket 2

    Quad-coreCPU

    MPI process

    4 x multi-threaded

    MPI process

    4 x multi-threaded

    MPI process

    4 x multi-threaded

    MPI process

    4 x multi-threaded

    MPI process8 x multi-threaded

    MPI process8 x multi-threaded

    MPI MPI

    MPI MPI

    MPI MPI

    MPI MPI

    MPI MPI

    MPI MPI

    MPI MPI

    MPI MPI

    Motivation

    Which programming model is fastest?

    MPI everywhere?

    Fully hybrid MPI & OpenMP?

    Something between?(Mixed model)

    ? Often hybrid programming slower than pure MPI Examples, Reasons,

    Node Interconnect

    Socket 1

    Quad-coreCPU

    SMP node SMP node

    Socket 2

    Quad-coreCPU

    Socket 1

    Quad-coreCPU

    Socket 2

    Quad-coreCPU

    Slide 3 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Hybrid Parallel Programming

    Goals of this tutorial

    Sensitize to problems on clusters of SMP nodessee sections Case studies

    Mismatch problems Technical aspects of hybrid programming

    see sections Programming models on clusters Examples on hybrid programming

    Opportunities with hybrid programmingsee section Opportunities: Application categories

    that can benefit from hybrid paralleliz. Issues and their Solutions

    with sections Thread-safety quality of MPI libraries Tools for debugging and profiling

    for MPI+OpenMP

    Lessfrustration&

    Moresuccesswith your parallel program on clusters of SMP nodes

    Slide 4 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Hybrid Parallel Programming

    Outlineslide number

    Introduction / Motivation 2 Programming models on clusters of SMP nodes 6 Case Studies / pure MPI vs hybrid MPI+OpenMP 28 Practical How-To on hybrid programming 55 Mismatch Problems 91 Opportunities: Application categories that can 109

    benefit from hybrid parallelization Other options on clusters of SMP nodes 118

    Accelerators 135 Summary 151 Appendix 158 Content (detailed) 174

    08:30 10:00

    10:30 12:00

    Includes additional

    slides, marked as

    skipped

    Slide 5 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Hybrid Parallel Programming

    Outline Introduction / Motivation

    Programming models on clusters of SMP nodes

    Case Studies / pure MPI vs hybrid MPI+OpenMP Hybrid programming & accelerators Practical How-To on hybrid programming Mismatch Problems Opportunities:

    Application categories that can benefit from hybrid parallelization Other options on clusters of SMP nodes Summary

    Slide 6 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Hybrid Parallel Programming

    Major Programming models on hybrid systems

    some_serial_code#pragma omp parallel forfor (j=;; j++)

    block_to_be_parallelizedagain_some_serial_code

    Master thread,other threads

    sleeping

    OpenMP (shared data)MPI local data in each processdataSequential

    program on each core

    Explicit Message Passingby calling MPI_Send & MPI_Recv

    Node Interconnect

    OpenMP inside of the SMP nodes

    MPI between the nodesvia node interconnect

    new

    Pure MPI (one MPI process on each core) Hybrid: MPI + OpenMP

    shared memory OpenMP distributed memory MPI

    Hybrid: MPI message passing + MPI-3.0 shared memory programming Other: PGAS (UPC, Coarray Fortran, .) / together with MPI Often hybrid programming (MPI+OpenMP) slower than pure MPI

    why?

    Slide 7 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Hybrid Parallel Programming

    Parallel Programming Models on Hybrid Platforms

    No overlap of Comm. + Comp.MPI only outside of

    parallel regionsof the numerical application code

    OverlappingComm. + Comp.

    MPI communication by one or a few threads

    while other threads are computing

    pure MPIone MPI process

    on each core

    hybrid MPI+OpenMPMPI: inter-node communication

    OpenMP: inside of each SMP node

    OpenMP only

    distributed virtual shared memory

    MasteronlyMPI only outsideof parallel regions

    Hybrid MPI+MPIMPI for inter-node

    communication + MPI-3.0 shared memory

    programming

    Within shared memory nodes:Halo updates through direct

    data copy

    Within shared memory nodes:

    No halo updates, direct access to neighbor data

    new

    new new

    Slide 8 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Hybrid Parallel Programming

    Pure MPI

    Advantages No modifications on existing MPI codes MPI library need not to support multiple threads

    Major problems Does MPI library uses internally different protocols?

    Shared memory inside of the SMP nodes Network communication between the nodes

    Does application topology fit on hardware topology? Unnecessary MPI-communication inside of SMP nodes!

    pure MPIone MPI process

    on each core

    Discussed in detail later on in the section Mismatch Problems

    Slide 9 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Hybrid Parallel Programming

    Hybrid MPI+OpenMP Masteronly Style

    Advantages No message passing inside of the SMP nodes No topology problem

    for (iteration .){#pragma omp parallel

    numerical code/*end omp parallel */

    /* on master thread only */MPI_Send (original datato halo areas in other SMP nodes)

    MPI_Recv (halo data from the neighbors)

    } /*end for loop

    MasteronlyMPI only outside of parallel regions

    Major Problems All other threads are sleeping

    while master thread communicates! Which inter-node bandwidth? MPI-lib must support at least

    MPI_THREAD_FUNNELED Section

    Thread-safety quality of MPI

    libraries

    Slide 10 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Hybrid Parallel Programming

    MPI rules with OpenMP / Automatic SMP-parallelization Special MPI-2 Init for multi-threaded MPI processes:

    REQUIRED values (increasing order): MPI_THREAD_SINGLE: Only one thread will execute THREAD_MASTERONLY: MPI processes may be multi-threaded,

    (virtual value, but only master thread will make MPI-callsnot part of the standard) AND only while other threads are sleeping

    MPI_THREAD_FUNNELED: Only master thread will make MPI-calls MPI_THREAD_SERIALIZED: Multiple threads may make MPI-calls,

    but only one at a time MPI_THREAD_MULTIPLE: Multiple threads may call MPI,

    with no restrictions returned provided may be less than REQUIRED by the application

    int MPI_Init_thread( int * argc, char ** argv[],int thread_level_required,int * thead_level_provided);

    int MPI_Query_thread( int * thread_level_provided);int MPI_Is_main_thread(int * flag);

    Slide 11 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Hybrid Parallel Programming

    Calling MPI inside of OMP MASTER

    Inside of a parallel region, with OMP MASTER

    Requires MPI_THREAD_FUNNELED,i.e., only master thread will make MPI-calls

    Caution: There isnt any synchronization with OMP MASTER!Therefore, OMP BARRIER normally necessary toguarantee, that data or buffer space from/for other threads is available before/after the MPI call!

    !$OMP BARRIER #pragma omp barrier!$OMP MASTER #pragma omp master

    call MPI_Xxx(...) MPI_Xxx(...); !$OMP END MASTER!$OMP BARRIER #pragma omp barrier

    But this implies that all other threads are sleeping! The additional barrier implies also the necessary cache flush!

    Slide 12 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Hybrid Parallel Programming

    the barrier is necessary example with MPI_Recv

    !$OMP PARALLEL!$OMP DO

    do i=1,1000a(i) = buf(i)

    end do!$OMP END DO NOWAIT!$OMP BARRIER!$OMP MASTER

    call MPI_RECV(buf,...)!$OMP END MASTER!$OMP BARRIER!$OMP DO

    do i=1,1000c(i) = buf(i)

    end do!$OMP END DO NOWAIT!$OMP END PARALLEL

    #pragma omp parallel{#pragma omp for nowait

    for (i=0; i

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Hybrid Parallel Programming

    Example: Thread support within Open MPI In order to enable thread support in Open MPI, configure with:

    configure --enable-mpi-threads

    This turns on: Support for full MPI_THREAD_MULTIPLE internal checks when run with threads (--enable-debug)

    configure --enable-mpi-threads --enable-progress-threads

    This (additionally) turns on: Progress threads to asynchronously transfer/receive data per

    network BTL. Additional Feature:

    Compiling with debugging support, but without threads will check for recursive locking

    Courtesy of Rainer Keller, HLRS and ORNL

    Slide 14 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Hybrid Parallel Programming

    Overlapping communication and computation

    if (my_thread_rank < ) {MPI_Send/Recv.

    i.e., communicate all halo data} else {

    Execute those parts of the applicationthat do not need halo data(on non-communicating threads)

    }

    Execute those parts of the applicationthat need halo data(on all threads)

    Overlapping Communication and ComputationMPI communication by one or a few threads while other threads are computing

    Slide 15 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Hybrid Parallel Programming

    Hybrid MPI + MPI-3 shared memory

    Advantages No message passing inside of the SMP nodes Using only one parallel programming standard No OpenMP problems (e.g., thread-safety isnt an issue)

    Major Problems Communicator must be split into shared

    memory islands To minimize shared memory communication

    overhead:Halos (or the data accessed by the neighbors) must be stored in MPI shared memory windows

    Same work-sharing as with pure MPI

    Hybrid MPI+MPIMPI for inter-node

    communication + MPI-3.0 shared memory

    programming

    Slide 16 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    MPI-3 shared memory

    Split main communicator into shared memory islands MPI_Comm_split_type

    Define a shared memory window on each island MPI_Win_allocate_shared

    Result (by default):

    contiguous array, directly accessible by all processes of the island

    Accesses and sychronization Normal assignments and expressions

    No MPI_PUT/GET !

    Normal MPI one-sided synchronization, e.g., MPI_WIN_FENCE

    Hybrid Parallel Programming

    Hybrid MPI+MPIMPI for inter-node

    communication + MPI-3.0 shared memory

    programming

    Slide 17 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Splitting the communicator & contiguous shared memory allocation

    Hybrid Parallel Programming

    Hybrid MPI+MPIMPI for inter-node

    communication + MPI-3.0 shared memory

    programming

    MPI_Aint /*IN*/ local_window_count; double /*OUT*/ *base_ptr;

    MPI_Comm comm_all, comm_sm; int my_rank_all, my_rank_sm, size_sm, disp_unit;

    MPI_Comm_rank (comm_all, &my_rank_all);

    MPI_Comm_split_type (comm_all, MPI_COMM_TYPE_SHARED, 0,

    MPI_INFO_NULL, &comm_sm);

    MPI_Comm_rank (comm_sm, &my_rank_sm); MPI_Comm_size (comm_sm, &size_sm);

    disp_unit = sizeof(double); /* shared memory should contain doubles */

    MPI_Win_allocate_shared (local_window_count*disp_unit, disp_unit, MPI_INFO_NULL,

    comm_sm, &base_ptr, &win_sm);

    0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 my_rank_all

    0 1 2 3 my_rank_sm

    0 1 2 3 my_rank_sm

    0 1 2 3 my_rank_sm

    0 1 2 3 my_rank_sm

    MPI process

    Sub-communicatorcomm_smfor one SMP node

    local_window_countdoubles

    base_ptr

    Contiguous shared memory window within each SMP node

    Sequence in comm_smas in comm_all

    comm_all

    F

    F In Fortran, MPI-3.0, page 341, Examples 8.1 (and 8.2) show how to convert buf_ptr into a usable array a.This mapping is based on a sequential ranking of the SMP nodes in comm_all.

    M

    M

    Slide 18 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Within each SMP node Essentials

    The allocated shared memory is contiguous across process ranks, i.e., the first byte of rank i starts right after the last byte of rank i-1. Processes can calculate remote addresses offsets

    with local information only. Remote accesses through load/store operations, i.e., without MPI RMA operations (MPI_GET/PUT, ) Although each process in comm_sm accesses the same physical memory,

    the virtual start address of the whole array may be different in all processes! linked lists only with offsets in a shared array,

    but not with binary pointer addresses!

    Following slides show only the shared memory accesses,i.e., communication between the SMP nodes is not presented.

    Hybrid Parallel Programming

    Hybrid MPI+MPIMPI for inter-node

    communication + MPI-3.0 shared memory

    programming

    Slide 19 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Shared memory access example

    Hybrid Parallel Programming

    Hybrid MPI+MPIMPI for inter-node

    communication + MPI-3.0 shared memory

    programming

    MPI_Aint /*IN*/ local_window_count; double /*OUT*/ *base_ptr;

    MPI_Win_allocate_shared (local_window_count*disp_unit, disp_unit, MPI_INFO_NULL,

    comm_sm, &base_ptr, &win_sm);

    MPI_Win_fence (0, win_sm); /*local store epoch can start*/

    for (i=0; i 0) printf("left neighbors rightmost value = %lf \n", base_ptr[-1] );

    if (my_rank_sm < size_sm-1) printf("right neighbors leftmost value = %lf \n",

    base_ptr[local_window_count] );

    0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 my_rank_all

    0 1 2 3 my_rank_sm

    0 1 2 3 my_rank_sm

    0 1 2 3 my_rank_sm

    0 1 2 3 my_rank_sm

    MPI process

    Sub-communicatorfor one SMP node

    base_ptr

    Contiguous shared memory window within each SMP node local_window_countdoubles

    Direct load access to remote window

    portion

    Direct load access to remote window

    portion

    Synchroni-zation

    Synchroni-zation

    Local storesF

    F

    F

    F

    F In Fortran, before and after the synchronization, on must add: CALL MPI_F_SYNC_REG (buffer)to guarantee that register copies of buffer are written back to memory, respectively read again from memory.

    Slide 20 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Establish comm_sm, comm_nodes, comm_all,if SMPs are not contiguous within comm_orig

    Hybrid Parallel Programming

    Hybrid MPI+MPIMPI for inter-node

    communication + MPI-3.0 shared memory

    programming

    0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 my_rank_all

    0 1 2 3 my_rank_sm

    Sub-communicatorfor one SMP node:

    comm_sm

    MPI_Comm_split_type (comm_orig, MPI_COMM_TYPE_SHARED, 0, MPI_INFO_NULL, &comm_sm);

    MPI_Comm_size (comm_sm, &size_sm); MPI_Comm_rank (comm_sm, &my_rank_sm);

    MPI_Comm_split (comm_orig, my_rank_sm, 0, &comm_nodes);

    MPI_Comm_size (comm_nodes, &size_nodes);

    if (my_rank_sm==0) {

    MPI_Comm_rank (comm_nodes, &my_rank_nodes);

    MPI_Exscan (&size_sm, &my_rank_all, 1, MPI_INT, MPI_SUM, comm_nodes);

    if (my_rank_nodes == 0) my_rank_all = 0;

    }

    MPI_Comm_free (&comm_nodes);

    MPI_Bcast (&my_rank_nodes, 1, MPI_INT, 0, comm_sm);

    MPI_Comm_split (comm_orig, my_rank_sm, my_rank_nodes, &comm_nodes);

    MPI_Bcast (&my_rank_all, 1, MPI_INT, 0, comm_sm); my_rank_all = my_rank_all + my_rank_sm;

    MPI_Comm_split (comm_orig, /*color*/ 0, my_rank_all, &comm_all);

    0 1 2 3 my_rank_sm

    0 1 2 3 my_rank_sm

    0 1 2 3 my_rank_sm

    Establish a

    communicator

    comm_sm

    with ranks

    my_rank_sm

    on each SMP

    node

    Result: comm_nodes combines all processes with a

    given my_rank_sm into a separate communicator.Exscan does

    not return

    value on the

    first rank,

    therefore

    comm_all

    comm_nodescombining all processes with samemy_rank_sm

    On processes with my_rank_sm > 0, this comm_nodes is unused

    because node-numbering within these comm_nodes may be different.

    Expanding the numbering from

    comm_nodes with my_rank_sm

    == 0 to all new node-to-node

    communicators comm_nodes.

    Calculating my_rank_all and

    establishing global communicator

    comm_all with sequential SMP

    subsets.

    0 1 2 3

    my_rank_nodes

    Input

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Alternative: Non-contiguous shared memory

    Using info key "alloc_shared_noncontig MPI library can put processes window portions

    on page boundaries, (internally, e.g., only one OS shared memory segment with some unused

    padding zones) into the local ccNUMA memory domain + page boundaries

    (internally, e.g., each window portion is one OS shared memory segment)Pros: Faster local data accesses especially on ccNUMA nodesCons: Higher programming effort for neighbor accesses: MPI_WIN_SHARED_QUERY

    Hybrid Parallel Programming

    Further reading:Torsten Hoefler, James Dinan, Darius Buntinas, Pavan Balaji, Brian Barrett, Ron Brightwell, William Gropp, Vivek Kale, Rajeev Thakur: MPI + MPI: a new hybrid approach to parallel programming with MPI plus shared memory.http://link.springer.com/content/pdf/10.1007%2Fs00607-013-0324-2.pdf

    Hybrid MPI+MPIMPI for inter-node

    communication + MPI-3.0 shared memory

    programming

    NUMA effects?Significant impact of alloc_shared_noncontig

    Slide 22 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Non-contiguous shared memory allocation

    Hybrid Parallel Programming

    Hybrid MPI+MPIMPI for inter-node

    communication + MPI-3.0 shared memory

    programming

    MPI_Aint /*IN*/ local_window_count; double /*OUT*/ *base_ptr;

    disp_unit = sizeof(double); /* shared memory should contain doubles */

    MPI_Info info_noncontig;

    MPI_Info_create (&info_noncontig);

    MPI_Info_set (info_noncontig, "alloc_shared_noncontig", "true");

    MPI_Win_allocate_shared (local_window_count*disp_unit, disp_unit, info_noncontig,

    comm_sm, &base_ptr, &win_sm );

    0 1 2 3 my_rank_sm

    0 1 2 3 my_rank_sm

    0 1 2 3 my_rank_sm

    0 1 2 3 my_rank_sm

    MPI process

    Sub-communicatorfor one SMP node

    local_window_countdoubles

    base_ptr

    Non-contiguous shared memory window within each SMP node

    Slide 23 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Non-contiguous shared memory:Neighbor access through MPI_WIN_SHARED_QUERY Each process can retrieve each neighbors base_ptr

    with calls to MPI_WIN_SHARED_QUERY Example: only pointers to the window memory

    of the left & right neighbor

    Hybrid Parallel Programming

    Hybrid MPI+MPIMPI for inter-node

    communication + MPI-3.0 shared memory

    programming

    if (my_rank_sm > 0) MPI_Win_shared_query (win_sm, my_rank_sm 1,

    &win_size_left, &disp_unit_left, &base_ptr_left);

    if (my_rank_sm < size_sm-1) MPI_Win_shared_query (win_sm, my_rank_sm + 1,

    &win_size_right, &disp_unit_right, &base_ptr_right);

    MPI_Win_fence (0, win_sm); /* local stores are finished, remote load epoch can start */

    if (my_rank_sm > 0) printf("left neighbors rightmost value = %lf \n",

    base_ptr_left[ win_size_left/disp_unit_left 1 ] );

    if (my_rank_sm < size_sm-1) printf("right neighbors leftmost value = %lf \n",

    base_ptr_right[ 0 ] );

    base_ptr_left base_ptr_right

    Thanks to Steffen Weise (TU Freiberg) for testing and correcting the example codes.

    Slide 24 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Other technical aspects with MPI_WIN_ALLOCATE_SHAREDCaution: On some systems the number of shared memory windows, and the total size of shared memory windowsmay be limited.Some OS systems may provide options, e.g., at job launch, or MPI process start,to enlarge restricting defaults.

    If MPI shared memory support is based on POSIX shared memory: Shared memory windows are located in memory-mapped /dev/shm Default: 25% or 50% of the physical memory, but a maximum of ~2043 windows! Root may change size with: mount o remount,size=6G /dev/shm .Cray XT/XE/XC (XPMEM): No limits.On a system without virtual memory (like CNK on BG/Q), you have to reserve a chunk of address space when the node is booted (default is 64 MB). Thanks to Jeff Hammond and Jed Brown (ANL), Brian W Barrett (SANDIA), and Steffen Weise (TU Freiberg), for input and discussion.

    Hybrid Parallel Programming

    Hybrid MPI+MPIMPI for inter-node

    communication + MPI-3.0 shared memory

    programming

    Another restriction in a low-quality MPI:MPI_COMM_SPLIT_TYPE may return always MPI_COMM_SELF

    Due to default limit of context IDs

    in mpich

    Slide 25 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Splitting the communicator withoutMPI_COMM_SPLIT_TYPE

    Alternative, if you want to group based on a fixed amount size_sm of shared memory cores in comm_all:

    Based on sequential ranks in comm_all Pro: comm_sm can be restricted to ccNUMA locality domains Con: MPI does not guarantee MPI_WIN_ALLOCATE_SHARED() on whole SMP node

    (MPI_COMM_SPLIT_TYPE() may return MPI_COMM_SELF or partial SMP node)

    Hybrid Parallel Programming

    Hybrid MPI+MPIMPI for inter-node

    communication + MPI-3.0 shared memory

    programming

    0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 comm_all

    0 1 2 3 comm_sm

    0 1 2 3 comm_sm

    0 1 2 3 comm_sm

    0 1 2 3 comm_sm

    0 1 2 3 comm_sm

    MPI_Comm_rank (comm_all, &my_rank); MPI_Comm_group (comm_all, &group_all);

    ranges[0][0] = (my_rank / size_sm) * size_sm; ranges[0][1] = ranges[0][0]+size_sm1; ranges[0][2] = 1;

    MPI_Group_range_incl (group_all, 1, ranges, &group_sm);

    MPI_Comm_create (comm_all, group_sm, &comm_sm);

    MPI_Win_allocate_shared ();

    e.g., ranges[0][]={4,7,1}

    To guarantee shared memory,one may add an additional

    MPI_Comm_split_type (comm_sm, MPI_COMM_TYPE_SHARED, 0,

    MPI_INFO_NULL, &comm_sm_really);Slide 26 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Hybrid Parallel Programming

    Pure OpenMP (on the cluster)

    Distributed shared virtual memory system needed Must support clusters of SMP nodes, e.g.,

    Shared memory parallel inside of SMP nodes Communication of modified parts of pages

    at OpenMP flush (part of each OpenMP barrier)

    OpenMP onlydistributed virtual shared memory

    i.e., the OpenMP memory and parallelization modelis prepared for clusters!

    by rule of thumb:

    Communication may be

    10 times slowerthan with MPI

    Slide 27 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Hybrid Parallel Programming

    Outline Introduction / Motivation Programming models on clusters of SMP nodes Case Studies / pure MPI vs hybrid MPI+OpenMP

    The Multi-Zone NAS Parallel Benchmarks For each application we discuss:

    Benchmark implementations based on different strategies and programming paradigms Performance results and analysis on different hardware architectures

    Compilation and Execution SummaryGabriele Jost (Supersmith, Maximum Performance Software)

    Hybrid programming & accelerators Practical How-To on hybrid programming Mismatch Problems Opportunities: Application categories that can benefit from hybrid paralleli. Other options on clusters of SMP nodes Summary

    Slide 28 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    The Multi-Zone NAS Parallel Benchmarks

    Hybrid Parallel Programming

    OpenMP

    Call MPI

    MPI Processes

    sequential

    MPI/OpenMP

    OpenMPdata copy+ sync.

    exchangeboundaries

    sequentialsequentialTime step

    OpenMPOpenMPintra-zones

    OpenMPMLP Processesinter-zones

    Nested OpenMPMLP

    Multi-zone versions of the NAS Parallel Benchmarks LU,SP, and BT

    Two hybrid sample implementations Load balance heuristics part of sample codes www.nas.nasa.gov/Resources/Software/software.html

    Slide 29 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    MPI/OpenMP BT-MZcall omp_set_numthreads (weight)do step = 1, itmaxcall exch_qbc(u, qbc, nx,)

    do zone = 1, num_zonesif (iam .eq. pzone_id(zone)) then

    call zsolve(u,rsd,)end if

    end do

    end do...

    Hybrid Parallel Programming

    call mpi_send/recv

    subroutine zsolve(u, rsd,)...

    !$OMP PARALLEL DEFAULT(SHARED)!$OMP& PRIVATE(m,i,j,k...)do k = 2, nz-1

    !$OMP DOdo j = 2, ny-1do i = 2, nx-1do m = 1, 5

    u(m,i,j,k)=dt*rsd(m,i,j,k-1)

    end doend do

    end do!$OMP END DO NOWAITend do...

    !$OMP END PARALLEL

    Slide 30 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    MPI/OpenMP LU-MZcall omp_set_numthreads (weight)do step = 1, itmaxcall exch_qbc(u, qbc, nx,)

    do zone = 1, num_zonesif (iam .eq. pzone_id(zone)) then

    call ssor end if

    end do

    end do...

    Hybrid Parallel ProgrammingSlide 31 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Pipelined Thread Execution in SSORsubroutine ssor!$OMP PARALLEL DEFAULT(SHARED)!$OMP& PRIVATE(m,i,j,k...)call sync1 ()do k = 2, nz-1

    !$OMP DOdo j = 2, ny-1do i = 2, nx-1do m = 1, 5rsd(m,i,j,k)=dt*rsd(m,i,j,k-1) +

    end doend do

    end do!$OMP END DO nowaitend docall sync2 ()...

    !$OMP END PARALLEL...

    Hybrid Parallel Programming

    subbroutine sync1neigh = iam -1do while (isync(neigh) .eq. 0)!$OMP FLUSH(isync)end doisync(neigh) = 0!$OMP FLUSH(isync)

    subroutine sync2

    neigh = iam -1do while (isync(neigh) .eq. 1)!$OMP FLUSH(isync)end doisync(neigh) = 1!$OMP FLUSH(isync)

    Slide 32 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Golden Rule for ccNUMA: First touch

    c------------------------------------------------------------------

    ---

    c do one time step to touch all data c------------------------------------------------------------------

    ---

    do iz = 1, proc_num_zoneszone = proc_zone_id(iz)call adi(rho_i(start1(iz)), us(start1(iz)),

    $ vs(start1(iz)), ws(start1(iz)..

    $ end dodo iz = 1, proc_num_zoneszone = proc_zone_id(iz)call initialize(u(start5(iz)),

    $ end doHybrid Parallel Programming

    A memory page gets mapped into the local memory of the processor that first touches it!

    "touch" means "write", not "allocate"

    All benchmarks use first touch policy to achieve good memory placement!

    Slide 33 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Benchmark Characteristics Aggregate sizes:

    Class D: 1632 x 1216 x 34 grid points Class E: 4224 x 3456 x 92 grid points

    BT-MZ: (Block tridiagonal simulated CFD application) Alternative Directions Implicit (ADI) method #Zones: 1024 (D), 4096 (E) Size of the zones varies widely:

    large/small about 20 requires multi-level parallelism to achieve a good load-balance

    LU-MZ: (LU decomposition simulated CFD application) SSOR method (2D pipelined method) #Zones: 16 (all Classes) Size of the zones identical:

    no load-balancing required limited parallelism on outer level

    SP-MZ: (Scalar Pentadiagonal simulated CFD application) #Zones: 1024 (D), 4096 (E) Size of zones identical

    no load-balancing requiredHybrid Parallel Programming

    Load-balanced on MPI level: Pure MPI should perform best

    Pure MPI: Load-balancing problems!Good candidate for

    MPI+OpenMP

    Limitted MPI Parallelism:

    MPI+OpenMP increases Parallelism

    Expectations:

    Slide 34 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Hybrid code on cc-NUMA architectures OpenMP:

    Support only per MPI process Version 3.1 has support for binding of threads via OMP_PROC_BIND

    environment variable. Under consideration for Version 4.0: OMP_PROC_SET or OMP_LIST

    to restrict the execution to a subset of the machine; OMP_AFFINITY to influence how the threads are distributed and bound on the machine

    Version 4.0 announced at SC12 MPI:

    Initially not designed for NUMA architectures or mixing of threads and processes, MPI-2 supports threads in MPI

    API does not provide support for memory/thread placement Vendor specific APIs to control thread and memory placement:

    Environment variables System commands like numactl,taskset,dplace,omplace etc http://www.halobates.de/numaapi3.pdf More in How-tos

    Hybrid Parallel ProgrammingSlide 35 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Dell Linux Cluster Lonestar

    Hybrid Parallel Programming

    Located at the Texas Advanced Computing Center (TACC), University of Texas at Austin (http://www.tacc.utexas.edu)

    1888 nodes, 2 Xeon Intel 6-Core 64-bit Westmere processors, 3.33 GHz, 24 GB memory per node, Peak Performance 160 Gflops per node, 3 channels from each processor's memory controller to 3 DDR3 ECC DIMMS, 1333 MHz,

    Processor interconnect, QPI, 6.4GT/s Node Interconnect: InfiniBand Mellanox Switches, fat-tree topology, 40Gbit/sec

    point-to-point bandwidth More details: http://www.tacc.utexas.edu/user-services/user-guides/lonestar-

    user-guide Compiling the benchmarks:

    ifort 11.1, Options: -O3 ipo openmp mcmodel=medium Running the benchmarks:

    MVAPICH 2 setenv OMP_NUM_THREADS ibrun tacc_affinity ./bt-mz.x

    Slide 36 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    NUMA Control (numactl) Example run script

    Hybrid Parallel Programming

    #!/bin/csh#$ -cwd#$ -j y#$ -q systest#$ -pe 12way 24#$ -V#$ -l h_rt=00:10:00setenv OMP_NUM_THREADS 1setenv MY_NSLOTS 16ibrun tacc_affinity ./bin/sp-mz.D.

    Run 12 MPI processes per node, allocate 24 cores (2nodes) alltogether

    1 thread per MPI process

    Only use 16 of the 24 cores for MPI.NOTE: 8 cores unused!!!

    numactl script for process/thread placementCommand to

    run mpi job

    Slide 37 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    NUMA Operations

    Hybrid Parallel Programming

    cmd option arguments description

    Socket Affinity numactl -c {0,1,2,3} Only execute process on cores of this (these) socket(s).

    Memory Policy numactl -l {no argument} Allocate on current socket.

    Memory Policy numactl -i {0,1,2,3} Allocate round robin (interleave) on these sockets.

    Memory Policy numactl --preferred= {0,1,2,3}select only one

    Allocate on this socket; fallback to any other if full .

    Memory Policy numactl -m {0,1,2,3} Only allocate on this (these) socket(s).

    Core Affinity numactl -C {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}

    Only execute process on this (these) Core(s).

    Slide 38 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Example numactl script

    Hybrid Parallel Programming

    myway=`echo $PE | sed s/way//`export MV2_USE_AFFINITY=0export MV2_ENABLE_AFFINITY=0my_rank=$PMI_RANK local_rank=$(( my_rank % myway ))if [ $myway -eq 12 ]; thennumnode=$(( local_rank / 6 ))

    fiexec numactl -c $numnode -m $numnode $*

    Slide 39 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Dell Linux Cluster Lonestar Topology

    Hybrid Parallel ProgrammingSlide 40 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Dell Linux Cluster Lonestar Topology

    Hybrid Parallel Programming

    CPU type: Intel Core Westmere processor

    ***************************

    *********

    Hardware Thread Topology***************************

    *********

    Sockets: 2 Cores per socket: 6 Threads per core: 1

    ---------------------------------

    Socket 0: ( 1 3 5 7 9 11 )Socket 1: ( 0 2 4 6 8 10 )---------------------------------

    Careful!Numbering scheme of

    cores is system dependent

    Slide 41 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Hybrid Parallel Programming

    No idle cores

    64 nodes

    On128 nodes

    On 256 nodes

    On512 nodes

    On 1024 nodes

    BT-MZ improves using hybrid as expected due to better load balance

    Unexpected: SP-MZ improves in some cases using hybrid

    Slide 42 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Pitfall (1): Running 2 threads on the same core

    Hybrid Parallel Programming

    09/26/07, Author:Gabriele Jost

    Running NPB BT-MZ Class D 128 MPI Procs, 12 threads each, 1 MPI per node (1way)

    Pinning A:exec numactl c 0 -m 0 $*

    Running 128 MPI Procs, 12 threads eachPinning B:exec numactl c 0,1 -m 0,1 $*

    Only use cores and memory on socket 0,12 threads on 6 cores

    Use cores and memory on 2 sockets

    Slide 43 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Pitfall (2): Cause remote memory access

    Hybrid Parallel Programming

    09/26/07, Author:Gabriele Jost

    Running NPB BT-MZ Class D 128 MPI Procs, 6 threads each 2 MPI per node

    Pinning A:if [ $localrank == 0 ]; thenexec numactl --physcpubind=0,1,2,3,4,5 -m 0 $*elif [ $localrank == 1 ]; thenexec numactl --physcpubind=6,7,8,9,10,11 -m 1 $*fi

    Running 128 MPI Procs, 6 threads eachPinning B:if [ $localrank == 0 ]; thenexec numactl --physcpubind=0,2,4,6,8,10 -m 0 $*elif [ $localrank == 1 ]; thenexec numactl physcpubind=1,3,5,7,9,11 -m 1 $*fi

    Half of the threads access remote memory

    600 Gflops

    900 Gflops

    900 Gflops

    Only local memory access

    Slide 44 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    LU-MZ Class D Scalability on Lonestar

    Hybrid Parallel Programming

    LU-MZ significantly benefits from hybrid mode:- Pure MPI limited to 16 cores, due to #zones = 16

    Decrease of resource contention large contribution to improvement

    G

    F

    l

    o

    p

    s

    idle cores

    Slide 45 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Cray XE6 Hermit

    Hybrid Parallel Programming

    -------------------------------------------------------------

    CPU type: AMD Interlagos processor *************************************************************

    Hardware Thread Topology*************************************************************

    Sockets: 2 Cores per socket: 16 Threads per core: 1 -------------------------------------------------------------Socket 0:+-----------------------------------------------------------------------------------------------------------------------------------------------+

    | +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ || | 0 | | 1 | | 2 | | 3 | | 4 | | 5 | | 6 | | 7 | | 8 | | 9 | | 10 | | 11 | | 12 | | 13 | | 14 | | 15 | || +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ || +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ || | 16kB | | 16kB | | 16kB | | 16kB | | 16kB | | 16kB | | 16kB | | 16kB | | 16kB | | 16kB | | 16kB | | 16kB | | 16kB | | 16kB | | 16kB | | 16kB | || +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+ || +---------------+ +---------------+ +---------------+ +---------------+ +---------------+ +---------------+ +---------------+ +---------------+ || | 2MB | | 2MB | | 2MB | | 2MB | | 2MB | | 2MB | | 2MB | | 2MB | || +---------------+ +---------------+ +---------------+ +---------------+ +---------------+ +---------------+ +---------------+ +---------------+ || +---------------------------------------------------------------------+ +---------------------------------------------------------------------+ || | 6MB | | 6MB | || +---------------------------------------------------------------------+ +---------------------------------------------------------------------+ |

    Located at HLRS Stuttgart, Germany (https://wickie.hlrs.de/platforms/index.php/Cray_XE6) 3552 compute nodes 113.664 cores Two AMD 6276 Interlagos processors with 16 cores each, running at 2.3 GHz (TurboCore 3.3GHz) per

    node Around 1 Pflop theoretical peak performance 32 GB of main memory available per node 32-way shared memory system High-bandwidth interconnect using Cray Gemini communication chips

    Slide 46 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Hybrid Parallel Programming

    32K cores

    16K cores8K

    cores

    Expected:BT-MZ benefits from hybrid approach:- high number of MPI processes yields bad workload distribution-Best MPIxOMP combination depends on problem sizeExpected:-Both benchmarks benefit by increasing parallelismUnexpected:SP-MZ improves when reducing number of MPI processesBT-MZ 1024x32 unexpected low performance

    Cray XE6 Hermit Scalability, continued

    Slide 47 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Cray XE6: CrayPat Performance Analysis module load xt-craypat Compilation:

    ftn fastsse r8 mp[= trace ] Instrument:

    pat_build w g mpi,omp bt.exe bt.exe.pat Execution :

    (export PAT_RT_HWPC {0,1,2,..}) export OMP_NUM_THREADS 4 aprun n NPROCS d 4 ./bt.exe.pat

    Generate report: pat_report O

    load_balance,thread_times,program_time,mpi_callers O profile_pe.th $1

    Hybrid Parallel Programming

    -d depth Specifies the number of CPUs for each PE and its threads.

    Slide 48 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    BT-MZ 32x4 Function Profile

    Hybrid Parallel ProgrammingSlide 49 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    BT-MZ Load-Balance 32x4 vs 128x1

    Hybrid Parallel Programmingbt-mz-C.32x4

    bt-mz-C.128x1

    maximum, median, minimum PE are shown bt-mz.C.128x1 shows large imbalance in User

    and MPI time bt-mz.C.32x4 shows well balanced times

    Slide 50 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    IBM Power 6 Results obtained by the courtesy of the HPCMO Program and the

    Engineer Research and Development Center Major Shared Resource Center, Vicksburg, MS (http://www.erdc.hpc.mil/index)

    The IBM Power 6 System is located at (http://www.navo.hpc.mil/davinci_about.html)

    150 Compute Nodes 32 4.7GHz Power6 Cores per Node (4800 cores total) 64 GBytes of dedicated memory per node QLOGOC Infiniband DDR interconnect IBM MPI: MPI 1.2 + MPI-IO

    mpxlf_r O4 qarch=pwr6 qtune=pwr6 qsmp=omp

    Execution: poe launch $PBS_O_WORKDIR./sp.C.16x4.exe

    Hybrid Parallel Programming

    Flag was essential to achieve full

    compiler optimization in

    presence of OMP directives!

    Slide 51 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    NPB-MZ Class D on IBM Power 6:Exploiting SMT for 2048 Core Results

    Hybrid Parallel Programming

    Results for 128-2048 cores

    Only 1024 cores were available for the experiments

    BT-MZ and SP-MZ show benefit from Simultaneous Multithreading (SMT): 2048 threads on 1024 cores

    128 cores

    256 cores

    1024 cores

    512 cores

    2

    0

    4

    8

    x

    1

    Doubling the number of threads through hyperthreading (SMT):#!/bin/csh#PBS -l select=32:ncpus=64:mpiprocs=NP:ompthreads=NT

    2048

    cores

    best of category

    Slide 52 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    MPI+OpenMP memory usage of NPB-MZ

    Hybrid Parallel Programming

    Using more OpenMP threads reduces the memory usage substantially, up to five times on Hopper Cray XT5 (eight-core nodes).

    Hongzhang Shan, Haoqiang Jin, Karl Fuerlinger, Alice Koniges, Nicholas J. Wright:Analyzing the Effect of Different Programming Models Upon Performance and Memory Usage on Cray XT5 Platforms.Proceedings, CUG 2010, Edinburgh, GB, May 24-27, 2010.

    Always same number of

    cores

    Slide 53 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Conclusions:

    Hybrid Parallel Programming

    BT-MZ: Inherent workload imbalance on MPI level #nprocs = #nzones yields poor performance #nprocs < #zones => better workload balance, but decreases parallelism Hybrid MPI/OpenMP yields better load-balance,

    maintains amount of parallelism SP-MZ:

    No workload imbalance on MPI level, pure MPI should perform best MPI/OpenMP outperforms MPI on some platforms due contention to

    network access within a node LU-MZ:

    Hybrid MPI/OpenMP increases level of parallelism All Benchmarks:

    Decrease network pressure Lower memory requirements Good process/thread affinity essential

    Slide 54 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Outline

    Introduction / Motivation Programming models on clusters of SMP nodes Case Studies / pure MPI vs hybrid MPI+OpenMP Hybrid programming & accelerators

    Practical How-To on hybrid programmingGeorg Hager, Regionales Rechenzentrum Erlangen (RRZE)

    Mismatch Problems Application categories that can benefit from hybrid parallelization Other options on clusters of SMP nodes Summary

    Hybrid Parallel ProgrammingSlide 55 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Hybrid Programming How-To: Overview

    A practical introduction to hybrid programming How to compile and link Getting a hybrid program to run on a cluster

    Running (hybrid) programs efficiently on multi-core clusters Affinity issues

    ccNUMA Bandwidth bottlenecks

    MPI and OpenMP on real hardware: Intra-node anisotropy MPI communication characteristics OpenMP loop startup overhead

    Thread/process binding

    Hybrid Parallel ProgrammingSlide 56 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    How to compile, link and run

    Use appropriate OpenMP compiler switch (-openmp, -fopenmp, -mp, -qsmp=openmp, ) and MPI compiler script (if available)

    Link with MPI library Usually wrapped in MPI compiler script If required, specify to link against thread-safe MPI library

    Often automatic when OpenMP or auto-parallelization is switched on Running the code

    Highly non-portable! Consult system docs! (if available) If you are on your own, consider the following points Make sure OMP_NUM_THREADS etc. is available on all MPI

    processes Start env VAR=VALUE instead of your binary alone Use Pete Wyckoffs mpiexec MPI launcher (see below):

    http://www.osc.edu/~pw/mpiexec Figure out how to start fewer MPI processes than cores on your

    nodesHybrid Parallel ProgrammingSlide 57 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Examples for compilation and execution

    Hybrid Parallel Programming

    Cray XE6 (4 NUMA domains w/ 8 cores each): ftn -h omp ... export OMP_NUM_THREADS=8 aprun -n nprocs -N nprocs_per_node \

    -d $OMP_NUM_THREADS a.out

    Intel Sandy Bridge (8-core 2-socket) cluster, Intel MPI/OpenMP mpiifort -openmp ... OMP_NUM_THREADS=8 mpirun ppn 2 np 4 \

    -env I_MPI_PIN_DOMAIN socket \-env KMP_AFFINITY scatter ./a.out

    Slide 58 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Interlude: Advantages of mpiexec or similar mechanisms Startup mechanism should use a resource manager interface to

    spawn MPI processes on nodes As opposed to starting remote processes with ssh/rsh:

    Correct CPU time accounting in batch system Faster startup Safe process termination Allowing password-less user login not required between nodes

    Interfaces directly with batch system to determine number of procs

    Provisions for starting fewer processes per node than available cores

    Required for hybrid programming E.g., -pernode and -npernode # options does not

    require messing around with nodefiles

    Hybrid Parallel ProgrammingSlide 59 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Running the codeMore examples (with mpiexec) Example for using mpiexec on a dual-socket quad-core cluster:

    $ export OMP_NUM_THREADS=8$ mpiexec -pernode ./a.out

    Same but 2 MPI processes per node:

    $ export OMP_NUM_THREADS=4$ mpiexec -npernode 2 ./a.out

    Pure MPI:

    $ export OMP_NUM_THREADS=1 # or nothing if serial code$ mpiexec ./a.out

    Hybrid Parallel Programming

    Where do the threads run? see later!

    Slide 60 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Running the code efficiently?

    Symmetric, UMA-type compute nodes have become rare animals NEC SX Intel 1-socket (Xeon 12XX) rare in cluster environments Hitachi SR8000, IBM SP2, single-core multi-socket Intel Xeon

    (all dead) Instead, systems have become non-isotropic on the node level

    ccNUMA (AMD Opteron, SGI Altix, IBM Power7, Intel Sandy/Ivy Bridge)

    Multi-core, multi-socket Shared vs. separate caches Multi-chip vs. single-chip Separate/shared buses

    Hybrid Parallel ProgrammingSlide 61 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Issues for running code efficiently on non-isotropic nodes ccNUMA locality effects

    Penalties for access across locality domains Impact of contention Consequences of file I/O for page placement Placement of MPI buffers

    Multi-core / multi-socket anisotropy effects Bandwidth bottlenecks, shared caches Intra-node MPI performance

    Core core vs. socket socket OpenMP loop overhead depends on mutual position of threads

    in team

    Hybrid Parallel ProgrammingSlide 62 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    A short introduction to ccNUMA

    ccNUMA: whole memory is transparently accessible by all processors but physically distributed with varying bandwidth and latency and potential contention (shared memory paths)

    Hybrid Parallel Programming

    C C C C

    M M

    C C C C

    M M

    Slide 63 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    How much bandwidth does non-local access cost?

    Example: AMD Magny Cours 4-socket system (8 chips, 4 sockets)STREAM Triad bandwidth measurements

    Hybrid Parallel Programming

    0

    1

    2

    3

    6

    7

    4

    5

    Slide 64 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    How much bandwidth does non-local access cost?

    Example: Intel Sandy Bridge 2-socket system (2 chips, 2 sockets)STREAM Triad bandwidth measurements

    Hybrid Parallel Programming

    0 1

    General rule:

    The more ccNUMA domains, the larger the non-local access penalty

    Slide 65 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    ccNUMA Memory Locality Problems

    Locality of reference is key to scalable performance on ccNUMA Less of a problem with pure MPI, but see below

    What factors can destroy locality? MPI programming:

    processes lose their association with the CPU the mapping took place on originally

    OS kernel tries to maintain strong affinity, but sometimes fails Shared Memory Programming (OpenMP, hybrid):

    threads losing association with the CPU the mapping took place on originally

    improper initialization of distributed data Lots of extra threads are running on a node, especially for hybrid

    All cases: Other agents (e.g., OS kernel) may fill memory with data that

    prevents optimal placement of user dataHybrid Parallel ProgrammingSlide 66 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Avoiding locality problems

    How can we make sure that memory ends up where it is close to the CPU that uses it? See the following slides

    How can we make sure that it stays that way throughout program execution? See end of section

    Hybrid Parallel ProgrammingSlide 67 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Solving Memory Locality Problems: First Touch

    "Golden Rule" of ccNUMA:A memory page gets mapped into the local memory of the processor that first touches it! Except if there is not enough local memory available this might be a problem, see later Some OSs allow to influence placement in more direct ways

    cf. libnuma (Linux), MPO (Solaris), Caveat: "touch" means "write", not "allocate" Example:

    double *huge = (double*)malloc(N*sizeof(double));// memory not mapped yetfor(i=0; i

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Most simple case: explicit initialization

    Hybrid Parallel Programming

    integer,parameter :: N=10000000double precision A(N), B(N)

    A=0.d0

    !$OMP parallel dodo i = 1, N

    B(i) = function ( A(i) )end do!$OMP end parallel do

    integer,parameter :: N=10000000double precision A(N),B(N)!$OMP parallel !$OMP do schedule(static)do i = 1, N

    A(i)=0.d0end do!$OMP end do...

    !$OMP do schedule(static)do i = 1, N

    B(i) = function ( A(i) )end do!$OMP end do!$OMP end parallel

    Slide 69 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    ccNUMA problems beyond first touch

    OS uses part of main memory fordisk buffer (FS) cache If FS cache fills part of memory,

    apps will probably allocate from foreign domains

    non-local access! Locality problem even on hybrid

    and pure MPI with asymmetric file I/O, i.e. if not all MPI processesperform I/O

    Remedies Drop FS cache pages after user job has run (admins job)

    Only prevents cross-job buffer cache heritage Sweeper code (run by user) Flush buffer cache after I/O if necessary (sync is not

    sufficient!)Hybrid Parallel Programming

    P0C

    P1C

    C C

    MI

    P2C

    P3C

    C C

    MI

    BC

    data(3)

    BCdata(3)

    data(1)

    Slide 70 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    ccNUMA problems beyond first touch:Buffer cacheReal-world example: ccNUMA and the Linux buffer cacheBenchmark:1. Write a file of some size

    from LD0 to disk2. Perform bandwidth

    benchmark usingall cores in LD0 andmaximum memoryinstalled in LD0

    Result: By default,Buffer cache is given priority over local page placement restrict to local

    domain if possible!Hybrid Parallel Programming

    Cray: aprun -ss

    Slide 71 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Intra-node MPI characteristics: IMB Ping-Pong benchmark

    Code (to be run on 2 cores):

    Intranode (1S): aprun -n 2 -cc 0,1 ./a.out Intranode (2S): aprun n 2 -cc 0,16 ./a.out Internode: aprun n 2 N 1 ./a.out

    Hybrid Parallel Programming

    wc = MPI_WTIME()

    do i=1,NREPEAT

    if(rank.eq.0) thenMPI_SEND(buffer,N,MPI_BYTE,1,0,MPI_COMM_WORLD,ierr)MPI_RECV(buffer,N,MPI_BYTE,1,0,MPI_COMM_WORLD, &

    status,ierr)elseMPI_RECV()MPI_SEND()

    endif

    enddo

    wc = MPI_WTIME() - wc

    C

    r

    a

    y

    X

    E

    6

    n

    o

    d

    e

    Slide 72 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    IMB Ping-Pong: LatencyIntra-node vs. Inter-node on Cray XE6

    0

    0,2

    0,4

    0,6

    0,8

    1

    1,2

    1,4

    1,6

    1,8

    internode intranode 2S intranode 1S

    1,8

    0,56

    0,3

    L

    a

    t

    e

    n

    c

    y

    [

    s

    ]

    Hybrid Parallel Programming

    Affinity matters!

    Slide 73 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    IMB Ping-Pong: Bandwidth Characteristics Intra-node vs. Inter-node on Cray XE6

    Hybrid Parallel Programming

    Between two cores of one socket

    Between two nodes via InfiniBand

    Between two sockets of one node

    Slide 74 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    The throughput-parallel vector triad benchmarkMicrobenchmarking for architectural exploration Every core runs its own, independent triad benchmark

    pure hardware probing, no impact from OpenMP overheadHybrid Parallel Programming

    double precision, dimension(:), allocatable :: A,B,C,D

    !$OMP PARALLEL private(i,j,A,B,C,D)allocate(A(1:N),B(1:N),C(1:N),D(1:N))A=1.d0; B=A; C=A; D=Ado j=1,NITER

    do i=1,NA(i) = B(i) + C(i) * D(i)

    enddoif(.something.that.is.never.true.) thencall dummy(A,B,C,D)

    endifenddo!$OMP END PARALLEL

    Slide 75 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Throughput vector triad on Sandy Bridge socket (3 GHz)

    Hybrid Parallel Programming

    Saturation effectin memory

    Scalable BW in L1, L2, L3 cache

    Slide 76 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    The OpenMP-parallel vector triad benchmarkVisualizing OpenMP overhead

    OpenMP work sharing in the benchmark loop

    Hybrid Parallel Programming

    double precision, dimension(:), allocatable :: A,B,C,D

    allocate(A(1:N),B(1:N),C(1:N),D(1:N))A=1.d0; B=A; C=A; D=A!$OMP PARALLEL private(i,j)do j=1,NITER!$OMP DO

    do i=1,NA(i) = B(i) + C(i) * D(i)

    enddo!$OMP END DO

    if(.something.that.is.never.true.) thencall dummy(A,B,C,D)

    endifenddo!$OMP END PARALLEL

    Implicit barrier

    Slide 77 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    OpenMP vector triad on Sandy Bridge socket (3 GHz)

    Hybrid Parallel Programming

    sync overhead grows with # of threads

    bandwidth scalability across memory interfaces

    Slide 78 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Thread synchronization overhead on SandyBridge-EP Direct measurement of barrier overhead in CPU cycles

    Hybrid Parallel Programming

    2 Threads Intel 13.1.0 GCC 4.7.0 GCC 4.6.1Shared L3 384 5242 4616SMT threads 2509 3726 3399Other socket 1375 5959 4909

    Gcc still not very competitiveIntel compiler

    Full domain Intel 13.1.0 GCC 4.7.0 GCC 4.6.1Socket 1497 14546 14418Node 3401 34667 29788Node +SMT 6881 59038 58898

    Slide 79 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Thread/Process Affinity (Pinning) Highly OS-dependent system calls

    But available on all systemsLinux: sched_setaffinity(), PLPA hwlocWindows: SetThreadAffinityMask()

    Support for semi-automatic pinning in some compilers/environments Intel compilers > V9.1 (KMP_AFFINITY environment variable) Pathscale Generic Linux: taskset, numactl, likwid-pin (see below) OpenMP 4.0: Support for affinity

    Affinity awareness in MPI libraries Cray MPI OpenMPI Intel MPI

    Hybrid Parallel ProgrammingSlide 80 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    How do we figure out the topology?

    and how do we enforce the mapping without changing the code? Compilers and MPI libs may still give you ways to do that

    But LIKWID supports all sorts of combinations:

    LikeIKnewWhatImDoing

    Open source tool collection (developed at RRZE):

    http://code.google.com/p/likwid

    Hybrid Parallel ProgrammingSlide 81 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Likwid Tool Suite

    Command line tools for Linux: works with standard linux >= 2.6 kernel supports Intel and AMD CPUs Supports all compilers whose OpenMP implementation is based on

    pthreads

    Current tools: likwid-topology: Print thread and cache topology

    (similar to lstopo from the hwloc package) likwid-pin: Pin threaded application without touching code likwid-perfctr: Measure performance counters likwid-perfscope: Performance oscilloscope w/ real-time display likwid-powermeter: Current power consumption of chip (alpha stage) likwid-features: View and enable/disable hardware prefetchers likwid-bench: Low-level bandwidth benchmark generator tool likwid-mpirun: mpirun wrapper script for easy LIKWID integration

    Hybrid Parallel ProgrammingSlide 82 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    likwid-topology Topology information

    Based on cpuid information Functionality:

    Measured clock frequency Thread topology Cache topology Cache parameters (-c command line switch) ASCII art output (-g command line switch)

    Currently supported: Intel Core 2 (45nm + 65 nm) Intel Nehalem, Westmere, Sandy Bridge AMD Magny Cours, Interlagos Intel Xeon Phi in beta stage

    Hybrid Parallel ProgrammingSlide 83 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Output of likwid-topology

    Hybrid Parallel Programming

    CPU name: Intel Core i7 processorCPU clock: 2666683826 Hz*************************************************************

    Hardware Thread Topology*************************************************************

    Sockets: 2Cores per socket: 4Threads per core: 2-------------------------------------------------------------

    HWThread Thread Core Socket0 0 0 01 1 0 02 0 1 03 1 1 04 0 2 05 1 2 06 0 3 07 1 3 08 0 0 19 1 0 110 0 1 111 1 1 112 0 2 113 1 2 114 0 3 115 1 3 1-------------------------------------------------------------

    Slide 84 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    likwid-topology continued

    and also try the ultra-cool -g option!

    Hybrid Parallel Programming

    Socket 0: ( 0 1 2 3 4 5 6 7 )Socket 1: ( 8 9 10 11 12 13 14 15 )-------------------------------------------------------------

    *************************************************************

    Cache Topology*************************************************************

    Level: 1Size: 32 kBCache groups: ( 0 1 ) ( 2 3 ) ( 4 5 ) ( 6 7 ) ( 8 9 ) ( 10 11 ) ( 12 13 ) ( 14 15 )-------------------------------------------------------------

    Level: 2Size: 256 kBCache groups: ( 0 1 ) ( 2 3 ) ( 4 5 ) ( 6 7 ) ( 8 9 ) ( 10 11 ) ( 12 13 ) ( 14 15 )-------------------------------------------------------------

    Level: 3Size: 8 MBCache groups: ( 0 1 2 3 4 5 6 7 ) ( 8 9 10 11 12 13 14 15 )-------------------------------------------------------------

    Slide 85 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    likwid-pin Inspired and based on ptoverride (Michael Meier, RRZE) and taskset Pins process and threads to specific cores without touching code Directly supports pthreads, gcc OpenMP, Intel OpenMP Allows user to specify skip mask (i.e., supports many different compiler/MPI

    combinations) Can also be used as replacement for taskset Uses logical (contiguous) core numbering when running inside a restricted

    set of cores Supports logical core numbering inside node, socket, core

    Usage examples: env OMP_NUM_THREADS=6 likwid-pin -c 0,1,2,4-6 ./myApp parameters env OMP_NUM_THREADS=6 likwid-pin c S0:0-2@S1:0-2 ./myApp

    Hybrid Parallel ProgrammingSlide 86 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Hybrid Parallel Programming

    Example: STREAM benchmark on 12-core Intel Westmere:Anarchy vs. thread pinning

    no pinning

    Pinning (round-robin across sockets, physical cores first)

    Slide 87 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Likwid-pinExample: Intel OpenMP Running the STREAM benchmark with likwid-pin:

    Hybrid Parallel Programming

    $ export OMP_NUM_THREADS=4 $ likwid-pin -c 0,1,4,5 ./stream[likwid-pin] Main PID -> core 0 - OK----------------------------------------------

    Double precision appears to have 16 digits of accuracyAssuming 8 bytes per DOUBLE PRECISION word----------------------------------------------

    [... some STREAM output omitted ...]The *best* time for each test is used*EXCLUDING* the first and last iterations[pthread wrapper] PIN_MASK: 0->1 1->4 2->5 [pthread wrapper] SKIP MASK: 0x1[pthread wrapper 0] Notice: Using libpthread.so.0

    threadid 1073809728 -> SKIP [pthread wrapper 1] Notice: Using libpthread.so.0

    threadid 1078008128 -> core 1 - OK[pthread wrapper 2] Notice: Using libpthread.so.0

    threadid 1082206528 -> core 4 - OK[pthread wrapper 3] Notice: Using libpthread.so.0

    threadid 1086404928 -> core 5 - OK[... rest of STREAM output omitted ...]

    Skip shepherd thread

    Main PID always pinned

    Pin all spawned threads in turn

    Slide 88 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Topology (mapping) choices with MPI+OpenMP:More examples using Intel MPI+compiler & home-grown mpirun

    One MPI process per node

    One MPI process per socket

    OpenMP threads pinned round robin across cores in node

    Two MPI processes per socket

    Hybrid Parallel Programming

    env OMP_NUM_THREADS=8 mpirun -pernode \likwid-pin -c 0-7 ./a.out

    env OMP_NUM_THREADS=4 mpirun -npernode 2 \-pin "0,1,2,3_4,5,6,7" ./a.out

    env OMP_NUM_THREADS=4 mpirun -npernode 2 \-pin "0,1,4,5_2,3,6,7" \likwid-pin -c L:0,2,1,3 ./a.out

    env OMP_NUM_THREADS=2 mpirun -npernode 4 \-pin "0,1_2,3_4,5_6,7" \likwid-pin -c L:0,1 ./a.out

    Slide 89 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    MPI/OpenMP hybrid how-to: Take-home messages

    Learn how to take control of hybrid execution!

    Always observe the topology dependence of Intranode MPI OpenMP overheads Saturation effects / scalability behavior with bandwidth-bound

    code

    Enforce proper thread/process to core binding, using appropriate tools (whatever you use, but use SOMETHING)

    Multi-LD OpenMP processes on ccNUMA nodes require correct page placement

    Hybrid Parallel ProgrammingSlide 90 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Hybrid Parallel Programming

    Outline Introduction / Motivation Programming models on clusters of SMP nodes Case Studies / pure MPI vs hybrid MPI+OpenMP Hybrid programming & accelerators Practical How-To on hybrid programming

    Mismatch Problems

    Opportunities: Application categories that can benefit from hybrid parallelization

    Other options on clusters of SMP nodes Summary

    Slide 91 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Hybrid Parallel Programming

    Mismatch Problems

    None of the programming modelsfits to the hierarchical hardware (cluster of SMP nodes)

    Several mismatch problems following slides

    Benefit through hybrid programming Opportunities, see next section

    Quantitative implications depends on you application

    Examples: No.1 No.2Benefit through hybrid (see next section) 30% 10%Loss by mismatch problems 10% 25%Total +20% 15%

    In most cases: Bothcategories!

    CoreCPU(socket)SMP board

    ccNUMA nodeCluster of ccNUMA/SMP nodes

    Slide 92 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    The Topology Problem

    Problem Application topology is mapped to the hardware topology

    communication topology and message sizes communication overhead

    Partially independent of the programming model:

    Simplifications:

    Hybrid Parallel Programming

    pure MPI hybrid MPI+OpenMP Hybrid MPI+MPI

    SMP node: O(N3) data items per node

    Cluster network: O(N2) neighbor communication per neighbor

    Slide 93 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    The Topology Problem, without inner halo communication Communication only through

    neighbor accesses between d ccNUMA domains Compare the ccNUMA communication

    (s = communication size per domain)(Example: d=8 ccNUMA domains) 1-dimensional data decomposition s ~ 2 = 2

    3-dimensional data decomposition s ~ 6

    = 1.5

    between the ccNUMA domains

    hybrid MPI+OpenMPHybrid MPI+MPI

    No real win!Dont care about dimensions within

    the SMP nodes!Make your software simple!

    Hybrid Parallel ProgrammingSlide 94 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    The Topology Problem, with inner halo communicationWith halo cells and halo communication between the cores: We ignore differences in core-to-core communication speed within ccNUMA domain, and between ccNUMA domains of one SMP node

    Example with c=32 cores per SMP node

    c=32 and 1-dimensional data decomposition:s ~ 2 = 2

    c=32 and 3-dimensional data decomposition (4x4x2):s ~ 2

    +

    +

    = 0.63

    In general: win = s1dims3dim =c

    3

    Hybrid Parallel Programming

    pure MPI

    Hybrid MPI+MPI

    (s = communication size per core)

    N

    N

    c=16, 32, 64, win= factor 2, 3, 5, ! Real win? You may not care as long as your inner-node

    communication is below xx% !Make your software simple !?

    Slide 95 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Hybrid Parallel Programming

    The Mapping Problem with mixed model

    Several multi-threaded MPI process per SMP node:

    Problem Where are your processes

    and threads really located?

    Solutions: Depends on your platform, e.g., with numactl

    Node Interconnect

    Socket 1

    Quad-coreCPU

    SMP node

    Socket 2

    Quad-coreCPU

    MPI process

    4 x multi-threaded

    MPI process

    4 x multi-threaded

    Node Interconnect

    Socket 1

    Quad-coreCPU

    SMP node

    Socket 2

    Quad-coreCPU

    Do we have this? or that?

    MPI pro-cess

    0

    MPI pro-cess

    1

    Case study on Sun Constellation Cluster

    Rangerwith BT-MZ and SP-MZ

    Further questions: Where is the NIC1) located? Which cores share caches?

    1) NIC = Network Interface Card

    hybrid MPI+OpenMPHybrid MPI+MPI

    pure MPI

    Slide 96 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Hybrid Parallel Programming

    Unnecessary intra-node communication

    Problem: If several MPI process on each SMP node

    unnecessary intra-node communicationSolution:

    MPI+OpenMP: Only one MPI process per SMP node MPI+MPI: No halo-communication within an SMP node

    Remarks: MPI communication within an SMP node: 2 copies

    (user send buffer shared memory user recv buffer) MPI-3 shared memory halo commincation: 1 copy

    (user send buffer user recv buffer) MPI-3 with direct access to neighbor data: 0 copy

    Mixed model(several multi-threaded MPI processes per SMP node)

    pure MPI

    Hybrid MPI+MPI(with halo communication)

    pure MPI &Mixed model

    Hybrid MPI+MPI(with halo communication)

    Hybrid MPI+MPI(with direct neighbor access)

    Slide 97 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Hybrid Parallel Programming

    Sleeping threads and network saturation with

    Problem 1: Can the master thread

    saturate the network?Solution: If not, use mixed model i.e., several MPI

    processes per SMP nodeProblem 2:

    Sleeping threads are wasting CPU time

    Solution: Overlapping of

    computation and communication

    Problem 1&2 together: Producing more idle time

    through lousy bandwidth of master thread

    for (iteration .){#pragma omp parallel

    numerical code/*end omp parallel */

    /* on master thread only */MPI_Send (original datato halo areas in other SMP nodes)

    MPI_Recv (halo data from the neighbors)

    } /*end for loop

    MasteronlyMPI only outside of

    parallel regions

    Node Interconnect

    Masterthread

    Socket 1

    SMP node SMP node

    Socket 2

    Masterthread

    Socket 1

    Socket 2

    Masterthread

    Masterthread

    hybrid MPI+OpenMP

    Slide 98 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Hybrid Parallel Programming

    OpenMP: Additional Overhead & Pitfalls Using OpenMP

    may prohibit compiler optimization may cause significant loss of computational performance

    Thread fork / join overhead On ccNUMA SMP nodes:

    Loss of performance due to missing memory page locality or missing first touch strategy

    E.g. with the masteronly scheme: One thread produces data Master thread sends the data with MPIdata may be internally communicated from one memory to the other one

    Amdahls law for each level of parallelism Using MPI-parallel application libraries? Are they prepared for hybrid? Using thread-local application libraries? Are they thread-safe?

    See, e.g., the necessary O4 flag with mpxlf_r on IBM Power6 systems

    hybrid MPI+OpenMP

    Slide 99 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Hybrid Parallel Programming

    MPI-3 shared memory programming Pros

    ISV and application libraries need not to be thread-safe No additional OpenMP overhead No OpenMP problems

    Cons Library calls (MPI_WIN_ALLOCATE_SHARED)

    instead of SHARED / PRIVATE compiler directives No work-sharing directives

    Loop scheduling must be programmed by hand No support for fine-grained or auto-balanced work-sharing

    As with OpenMP tasks, and dynamic or guided loop schedule Virtual addresses of a shared memory window may be different in

    each MPI process no binary pointers i.e., linked lists must be stored with offsets rather than pointers

    Hybrid MPI+MPIMPI for inter-node

    communication + MPI-3.0 shared memory

    programming

    Slide 100 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Hybrid Parallel Programming

    Parallel Programming Models on Hybrid Platforms

    No overlap of Comm. + Comp.MPI only outside of

    parallel regionsof the numerical application code

    OverlappingComm. + Comp.

    MPI communication by one or a few threads

    while other threads are computing

    pure MPIone MPI process

    on each core

    hybrid MPI+OpenMPMPI: inter-node communication

    OpenMP: inside of each SMP node

    OpenMP only

    distributed virtual shared memory

    MasteronlyMPI only outsideof parallel regions

    Hybrid MPI+MPIMPI for inter-node

    communication + MPI-3.0 shared memory

    programming

    Within shared memory nodes:Halo updates through direct

    data copy

    Within shared memory nodes:

    No halo updates, direct access to neighbor data

    new

    new new

    FunneledMPI only

    on master-thread

    Multiplemore than one thread

    may communicate

    Funneled & Reserved

    reserved thread for communication

    Funneled with

    Full Load Balancing

    Slide 101 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Hybrid Parallel Programming

    Overlapping communication and computationThree problems: the application problem:

    one must separate application into: code that can run before the halo data is received code that needs halo data

    very hard to do !!! the thread-rank problem:

    comm. / comp. viathread-rank

    cannot usework-sharing directives

    loss of majorOpenMP support(see next slide)

    the load balancing problem

    if (my_thread_rank < 1) {MPI_Send/Recv.

    } else {my_range = (high-low-1) / (num_threads-1) + 1;my_low = low + (my_thread_rank+1)*my_range;my_high=high+ (my_thread_rank+1+1)*my_range;my_high = max(high, my_high)for (i=my_low; i

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Hybrid Parallel Programming

    Experiment: Matrix-vector-multiply (MVM) Jacobi-Davidson-Solver

    on IBM SP Power3 nodeswith 16 CPUs per node

    funneled&reserved is always faster in this experiments

    Reason: Memory bandwidth is already saturated by 15 CPUs, see inset

    Inset: Speedup on 1 SMP node using different number of threads

    funneled &reserved

    Masteronly

    Source: R. Rabenseifner, G. Wellein:Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures.International Journal of High Performance Computing Applications, Vol. 17, No. 1, 2003, Sage Science Press .

    f

    u

    n

    n

    e

    l

    e

    d

    &

    r

    e

    s

    e

    r

    v

    e

    d

    i

    s

    f

    a

    s

    t

    e

    r

    m

    a

    s

    t

    e

    r

    o

    n

    l

    y

    i

    s

    f

    a

    s

    t

    e

    r

    p

    e

    r

    f

    o

    r

    m

    a

    n

    c

    e

    r

    a

    t

    i

    o

    (

    r

    )

    Slide 103 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Overlapping: Using OpenMP tasks

    Hybrid Parallel Programming

    NEW OpenMP Tasking Model gives a new way to achieve more parallelismform hybrid computation.

    Slides, courtesy of Alice Koniges, NERSC, LBNL

    Alice Koniges et al.:Application Acceleration on Current and Future Cray Platforms.Proceedings, CUG 2010, Edinburgh, GB, May 24-27, 2010.

    Slide 104 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Case study: Communication and Computation in Gyrokinetic Tokamak Simulation (GTS) shift routine

    Hybrid Parallel Programming

    Work on particle array (packing for sending, reordering, adding after sending) can be overlapped with data independent MPI communication using OpenMP tasks.

    INDEPENDENT

    INDEPENDENT

    SEMI-INDEPENDENT

    GTS shift routine

    Slides, courtesy of Alice Koniges, NERSC, LBNL

    Slide 105 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Overlapping can be achieved with OpenMP tasks (1st part)

    Hybrid Parallel Programming

    Overlapping MPI_Allreduce with particle work Overlap: Master thread encounters (!$omp master) tasking statements and creates

    work for the thread team for deferred execution. MPI Allreduce call is immediately executed.

    MPI implementation has to support at least MPI_THREAD_FUNNELED Subdividing tasks into smaller chunks to allow better load balancing and scalability

    among threads.Slides, courtesy of Alice Koniges, NERSC, LBNL

    Slide 106 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Overlapping can be achieved with OpenMP tasks (2nd part)

    Hybrid Parallel Programming

    Overlapping particle reordering

    Overlapping remaining MPI_Sendrecv

    Particle reordering of remaining particles (above) and adding sent particles into array (right) & sending or receiving of shifted particles can be independently executed.

    Slides, courtesy of Alice Koniges, NERSC, LBNL

    Slide 107 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    OpenMP tasking version outperforms original shifter, especially in larger poloidal domains

    Hybrid Parallel Programming

    Performance breakdown of GTS shifter routine using 4 OpenMP threads per MPI pro-cess with varying domain decomposition and particles per cell on Franklin Cray XT4.

    MPI communication in the shift phase uses a toroidal MPI communicator (constantly 128).

    Large performance differences in the 256 MPI run compared to 2048 MPI run! Speed-Up is expected to be higher on larger GTS runs with hundreds of thousands

    CPUs since MPI communication is more expensive.

    256 size run 2048 size run

    Slides, courtesy of Alice Koniges, NERSC, LBNL

    Slide 108 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Hybrid Parallel Programming

    Outline Introduction / Motivation Programming models on clusters of SMP nodes Case Studies / pure MPI vs hybrid MPI+OpenMP Hybrid programming & accelerators Practical How-To on hybrid programming Mismatch Problems

    Opportunities:Application categories that can benefit from hybrid parallelization

    Other options on clusters of SMP nodes Summary

    Slide 109 / 175

  • Rabenseifner, Hager, Jost

    SUPERsmith

    Hybrid Parallel Programming

    Nested Parallelism

    Example NPB: BT-MZ (Block tridiagonal simulated CFD application) Outer loop:

    limited number of zones limited parallelism zones with different workload speedup