MPI: A Short Introduction to One -Sided Communication · 2020. 4. 15. · This document covers all the content of the FutureLearn course MPI: A Short Introduction to One -Sided Communication

This document covers all the content of the FutureLearn course

MPI: A Short Introduction to

One-Sided Communication

Learn about one-sided communication in MPI programming.

Discover the advantages of one-sided communication in parallel

programming!

Message Passing Interface (MPI) is a key standard for parallel computing architectures.

In this course, you will learn essential concepts of one-sided communication in MPI, as

well as the advantages of the MPI communication model.

You will learn the details of how exactly MPI works, as well how to use Remote Memory

Access (RMA) routines. Examples, exercises, and tests will be used to help you learn

and explore.

What topics will you cover?

MPI one-sided communication

Window creation and allocation

Remote Memory Access (RMA) routines

Synchronization calls

Examples and exercises

Who developed the course?

This course was developed by ASTRON, HLRS and SURFsara together with PRACE.

The Partnership for Advanced Computing in Europe (PRACE) is an international non-

profit association whose mission is to enable high-impact scientific discovery and

engineering research and development across all disciplines by offering world class

computing and data management resources and services.

Who is the course for?

This course is for anyone familiar with MPI who wants to learn to program using one-

sided communication.

What software or tools do you need?

To take this course you do not need a supercomputer – just an MPI environment on your

laptop or computer.

Hours per week: 4

Hashtag: #MPIonesided

Learning outcomes - what will you achieve?

By the end of the course, you will be able to:

Apply MPI one-sided communication to your communication patterns in your MPI

applications

Explain the main advantages and disadvantages of MPI one-sided

communication

Design your program using methods of MPI communication that prevent

deadlocks and ensure a correct program

Improve scalability by substituting non-scalable solutions with scalable one-sided

approaches

Contents

Week 1: Overview and the principles of one-sided MPI communication 5Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.1 Welcome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Some FutureLearn course basics . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Handouts for this course . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Hardware and Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2 About PRACE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5PRACE Mission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6PRACE Research Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3 Introductions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6Your educators - about us . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

Introduction to one-sided communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.4 Why use one-sided communication? . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.5 One-sided communication: How does it work? . . . . . . . . . . . . . . . . . . . . . . 7

Typically all processes are both origin and target . . . . . . . . . . . . . . . . . . 91.6 Quiz 1: Differences between one-sided and two-sided communication . . . . . . . . . 121.7 Which MPI-routines do I need? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

Routines for creating or allocating a window . . . . . . . . . . . . . . . . . . . . 13Routines for Remote Memory Access (RMA) . . . . . . . . . . . . . . . . . . . . 13Routines for synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.8 Sequence of one-sided communication . . . . . . . . . . . . . . . . . . . . . . . . . . 14The three categories of one-sided routines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.9 Window creation & allocation - overview . . . . . . . . . . . . . . . . . . . . . . . . 15Using existing memory as windows . . . . . . . . . . . . . . . . . . . . . . . . . . 15Allocating new memory as windows . . . . . . . . . . . . . . . . . . . . . . . . . 15Allocating shared memory windows . . . . . . . . . . . . . . . . . . . . . . . . . 15Using existing memory dynamically as windows . . . . . . . . . . . . . . . . . . . 16

1.10 Remote Memory Access (RMA) routines . . . . . . . . . . . . . . . . . . . . . . . . 16RMA routines that are finished by subsequent window synchronization . . . . . . 18RMA routines that are completed with regular MPI_Wait . . . . . . . . . . . . 18

1.11 Discussion of race conditions with Put and Get . . . . . . . . . . . . . . . . . . . . 191.12 Quiz 2: Windows, RMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201.13 Synchronization calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

Communication paradigms and synchronization models . . . . . . . . . . . . . . 20MPI Routines for synchronization calls . . . . . . . . . . . . . . . . . . . . . . . . 21

Functional opportunities, summary and quiz . . . . . . . . . . . . . . . . . . . . . . . . . . . 221.14 One-sided: funcational opportunities - an example . . . . . . . . . . . . . . . . . . . 221.15 Summary of week 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241.16 Quiz 3: with focus on synchronization . . . . . . . . . . . . . . . . . . . . . . . . . 24

3

Week 2: Details & examples of One-Sided MPI Communication 26Windows and their remote access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.1 Window creation & allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26Window creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26Allocate memory with MPI_Alloc_mem . . . . . . . . . . . . . . . . . . . . . . . . 27Window allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28Examples for Fortran programmers . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.2 RMA routines Put, Get and Accumulate . . . . . . . . . . . . . . . . . . . . . . . . . 29MPI_Put . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29MPI_Get . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30MPI_Accumulate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31Answer to questions in the discussion step from last week . . . . . . . . . . . . . 32

Synchronization routines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332.3 MPI_Win_fence and Fortran-specific features with one-sided communication . . . . 33

MPI_Win_fence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33Fortran-specific features with one-sided communication . . . . . . . . . . . . . . 33

2.4 Post / Start / Complete / Wait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34Race conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36Definitions of these five routines . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.5 Lock/Unlock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37Exercise, Summary and Quiz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.6 Exercise: Ring communication with fence . . . . . . . . . . . . . . . . . . . . . . . . 382.7 Course Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432.8 Quiz 4: General summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4

Week 1: Overview and theprinciples of one-sided MPIcommunication

IntroductionWelcome and introduction to the course material and instructors, and the chance to introduce yourself.

1.1 WelcomeWelcome to our One-sided Communication course! If this is your first online coursewith FutureLearn, we have prepared some tips on our learning platform and offer someimportant information on the hardware and software you will need.

Some FutureLearn course basics

The course material we provide is only one part of the learning process. The other is the community ofyour fellow learners. You will learn a lot by contributing to the discussions! As instructors, we willalso participate in the discussions from time to time. We encourage you to share your thoughts, askquestions if something is not clear for you or if you want to discuss new topics and, of course, answerquestions posed by your fellow learners - even if you are not sure that your answer is correct.

Handouts for this course

Please use this document for your notes. Our trailer video gives a (very) short introduction to thecourse contents: https://fs.hlrs.de/projects/par/mooc/trailer-one-sided_mooc.mp4

Hardware and Software

For the exercises a conventional laptop is sufficient. You will need a C or Fortran programmingenvironment and an MPI library. You can install the MPI library via the OpenMPI site or via theMPICH site.

1.2 About PRACEThis course is a PRACE project. In this article we explain shortly what PRACE is.

The video will give you a short overview of PRACE.

https://fs.hlrs.de/projects/par/mooc/2017_PRACE_HPCinEurope_short-version.mp4

5

https://fs.hlrs.de/projects/par/mooc/one-sided_mooc.pdf

https://fs.hlrs.de/projects/par/mooc/trailer-one-sided_mooc.mp4

https://www.open-mpi.org/

https://www.mpich.org/

https://fs.hlrs.de/projects/par/mooc/2017_PRACE_HPCinEurope_short-version.mp4

PRACE Mission

The mission of PRACE (Partnership for Advanced Computing in Europe) is to enable high-impactscientific discovery and engineering research and development across all disciplines to enhance Europeancompetitiveness for the benefit of society. PRACE seeks to realise this mission by offering world classcomputing and data management resources and services through a peer review process.

PRACE Research Infrastructure

PRACE is established as an international not-for-profit association (aisbl) with its seat in Brussels. Ithas 26 member countries whose representative organisations create a pan-European supercomputinginfrastructure, providing access to computing and data management resources and services for large-scalescientific and engineering applications at the highest performance level.

For more information you may visit the PRACE website.

1.3 IntroductionsPlease introduce yourself in the comments area on this page. Get to know your fellowlearners and tell them and us something about yourself. Where do you come from? Whyare you interested in MPI and one-sided communication and what is your experience inparallel programming?

Your educators - about us

Here’s a little about us.

Dr. Rolf Rabenseifner is the lead educator. He is responsible for HPC training at HLRS,the High Performance Computing Center, in Stuttgart, Germany. He is a member of theMPI forum. In workshops & summer schools he teaches parallel programming models inmany universities and labs.

Lucienne Dettki works also at HLRS in the field of HPC training. She develops theconceptual design and contents for our online courses together with her colleagues andorganizes PRACE and HLRS trainings.

Zheng Meyer-Zhao is an HPC software engineer at ASTRON in the Netherlands. Shespends most of her time developing software, trainings o HPC related topics and developingtraining materials.

Dr. Carlos Teijeiro Barjas is an HPC advisor at SURFsara in the Netherlands and supportsresearchers in their daily work. He coordinates trainings at SURFsara and contributes totraining on different HPC topics.

6

http://www.prace-ri.eu/prace-in-a-few-words/

https://www.hlrs.de

https://www.hlrs.de

https://www.astron.nl

https://www.surf.nl

Introduction to one-sided communicationIn this section we explain, why one-sided communication is used. The goal is to access the memory ofother processes to either store (PUT) or retrieve (GET) data. We will also discuss the pros and cons ofone-sided communication.

1.4 Why use one-sided communication?Before we go get into the details, let’s try to understand the goals and issues of theone-sided communication model. What about the interface to the surrounding code?

The basic idea of one-sided communication is to separate the PUT and GET routines from thesynchronization routines. In contrast to two-sided communication with implicit synchronization youhave to explicitly call synchronization routines when using one-sided communication. This meansone-sided communication is nonblocking and therefore all PUTs and GETs must be surrounded byspecial synchronization calls.

But why one-sided communication? What are the advantages of using it?

Despite of the necessity of explicit synchronization, one-sided communication can help to improveperformance:

• Reduce synchronizationWith two-sided communication there is some implicit synchronization. For example, a receiveoperation cannot complete before the corresponding send has started. This means there hasto be synchronization for every single data transfer. With one-sided communication, severalindependent PUT and GET operations can be completed with one synchronization step.

• No delay in sending dataIn one-sided communication, the PUTs and GETs are nonblocking. For example, this means thatwhile a process is sending data to a remote process, the remote process can continue its workinstead of waiting for the data.

• Functionality and scalabilityWhen using one-sided communication you can resolve some problems with scalable programs(i.e., on thousands of MPI processes). Two-sided communication requires significantly morecommunication time. We will provide an example next week.

What do you think? Which problems are better solved by one-sided communication? When would youprefer two-sided communication?

1.5 One-sided communication: How does it work?In this article we will point out the differences between one-sided communication andtwo-sided communication. At the same time, you will learn some new terms used inone-sided communication.

Let’s take a brief glance at two-sided communication before we start explaining one-sided communication:

In two-sided communication you have senders and receivers. Both processes are active peers incommunication. Each side has to call a communication routine: the sender calls MPI_Send and thereceiving process calls MPI_Recv. Each process can act as sender and as receiver.

7

In one-sided communication only one process is active: the so-called origin process. In thediagram below you can see process 0 taking action as the origin process. With MPI_Put process 0 sendsdata to the so-called target process, process 1, which receives this data without calling any receivingroutine. This means that the execution of a put operation is similar to the execution of a send by theorigin process and a matching receive by the target process. All arguments are provided by one callexecuted by the origin process.

If the origin process wants to get data from the target process, it calls MPI_Get. It works similar toMPI_Put, except that the direction of data transfer is reversed. MPI_Get is therefore equivalent to the

8

execution of a send by the target process and a corresponding receive by the origin process.

MPI_Put and MPI_Get are called Remote Memory Access (RMA) operations.

In the diagram above you can see the new term window. A window is a memory space which can beseen and accessed by other processes. Later in the course we will explain how to create and allocatewindows within the communicator.

Typically all processes are both origin and target

Each process can act as an origin and target process. Their interaction is shown in the followingsequence of figures, to give you a first impression of how one-sided operations are embedded intothe MPI processes:

As indicated previously, all processes may be both origin and target. Here we see an example with fourMPI processes.

There is a full protection of the memory of each MPI process against access from outside: that is, fromother MPI processes.

9

Here is an example of two-sided communication (normal MPI_Send and MPI_Recv). One process callsMPI_Send, which reads out the data from the local send buffer. The corresponding MPI_Recv in theother process then stores the data in the local receive buffer. It is the job of the MPI library to transportthe data from the sending process to the receiving process.

Now we will look at one-sided communication:

Each process has to specify a memory portion that should be accessible from the outside. With acollective call to MPI_Win_create, all processes make their windows accessible from the outside.

10

In this example, then we can use MPI_Put to store data from a local send buffer into the window of aremote process. Or we can use MPI_Get to fetch data from a remote window, and then store it into thelocal buffer.

You can see that there is no Remote Memory Access call on the target window process.

The origin side is the only process that must call the RMA routines (MPI_Put and MPI_Get).

In this sense, windows are peepholes into the memory of the MPI processes.

With RMA routines an origin process can put data into remote windows or can get data out from theremote windows.

Congratulations to those who have learned some new terms: origin process, target process and also

11

window. In the next section we will explain the MPI routines you need in order to implement one-sidedcommunications.

1.6 Quiz 1: Differences between one-sided and two-sided communicationHere is a short video introduction to the quizzes: https://www.youtube.com/watch?v=6hxccO2aPp4or alternatively https://fs.hlrs.de/projects/par/mooc/quiz-intro.mp4This quiz will test your understanding of the difference between one-sided and two-sided communicationsin MPI. Each of the 12 questions covers data movement with one or more processes taking the activerole of the communication.

Local variables are shown in white with a black frame, and windows are shown in yellow with blueframes. You have to identify and describe the action in the most correct and accurate way.

Here you can find an image with the local variables A, B and C and the window variables W0, W1 andW2. This can help you to get familiar with the format of the questions in this quiz. It may also beuseful in the future.

Ready to start? Let’s go!

1.7 Which MPI-routines do I need?In order to use one-sided communication in your program you will need some specialMPI-routines. First we will provide an overview. Following this, the routines will beexplained in detail.

There are three major sets of routines, namely for window creation, for remote memory access (RMA)and for synchronization.

12

https://www.youtube.com/watch?v=6hxccO2aPp4

https://fs.hlrs.de/projects/par/mooc/quiz-intro.mp4

Routines for creating or allocating a window

As explained in the previous article 1.5, only the origin process directly calls RMA routines. In orderto do this, a window in the target process is needed using one of the following functions in differentsituations:

• MPI_Win_create, when you already have an allocated memory buffer, which you want to makeremotely accessible,

• MPI_Win_allocate and MPI_Win_allocate_shared, when you want to create a buffer and makeit directly remotely accessible,

• MPI_Win_create_dynamic, when you do not know the needed buffer size yet.

These routines and some others will be explained in detail in step 1.9 “Window creation & allocation”

All four window creation and allocation routines are collective calls over the process group of a givencommunicator.

Routines for Remote Memory Access (RMA)

For access to the remote window, the origin process uses routines like:

• MPI_Put to put data to the remote process• MPI_Get to get data from the remote process• MPI_Accumulate to combine the data moved to the target process with the data that resides at

that process

More details will be explained in step 2.2 “RMA routines Put, Get and Accumulate”

Routines for synchronization

Since all RMA routines are nonblocking routines, they have to be surrounded by synchronization calls,which guarantee that the RMA is locally and remotely completed and all necessary cache operationsare implicitly done. There are 2 types of synchronizing:

13

• active, which means that both origin and target process have to call synchronization routines,and

• passive, which means that only the origin process must call synchronization routines, whereasthe target process is passive.

For active target synchronization you can use

• MPI_Fence to surround the RMA routines, or a combination of• MPI_Win_post, MPI_Win_start, MPI_Win_complete and MPI_Win_wait to restrict the communi-

cation that implements the synchronization to a minimum.

For passive target synchronization we use routines like

• MPI_Win_lock and MPI_Win_unlock to lock and unlock a window.

Now you have learned about the sequence of one-sided operations and also some RMA routines. In thenext steps we will explain them in more detail.

1.8 Sequence of one-sided communicationIn the last step you got to know the major set of routines for one-sided communication.Now we will show you how to use the sequence of one-sided operations.

The first thing to do is to create a window in each process. This window will be available for RemoteMemory Access (RMA) operations. And after having done all the necessary RMA operations, youshould “close” this window (i.e. free or deallocate it) as shown in the illustration below. Initializationand freeing a window are collective calls within the communicator.

As long as a window exists you can do some work with the remote memory, but you have to surroundthe RMA operations by synchronization calls. With these synchronization calls you define access epochs,for example RMA epochs or local load/store epochs.

In the role of a target process, i.e. owner of a window memory:

If you do some work on the local window memory, you have to separate it by a synchronization callfrom the work on this memory using RMA operations from remote processes.

14

In the role of being an origin process, i.e. calling RMA routines:

RMA calls must be surrounded by synchronization calls, which must match to such calls in the targetprocess.

Therefore you are programming RMA epochs and local store epochs.

In the next step we explain the window creation and allocation.

The three categories of one-sided routinesThis comes down to three questions:

• How to define/allocate the memory other processes can access?

• Which routines to use for accessing the exposed memory?

• Do we need memory “guards” to protect it from data corruption?

You will find the answers to these questions in this activity!

1.9 Window creation & allocation - overviewThere are four different ways to create or allocate windows:

• Using existing memory as windows• Allocating new memory as windows• Allocating shared memory windows• Using existing memory dynamically as windows

After being created or allocated using these methods, the windows can then be used by other processesin the communicator to perform RMA operations.

Using existing memory as windows

The MPI routine used to create a window from existing memory is MPI_Win_create. The existingmemory can be memory already allocated in the program, e.g. a variable or an array, or memoryallocated using MPI_Alloc_mem. Caution needs to be taken so that when memory is allocated withMPI_Alloc_mem, it is the responsibility of the software developer to free the allocated memory by callingMPI_Free_mem when the memory is no longer needed. The same is true for the created window: whenit is not required anymore, i.e. when all RMA and synchronization operations within the window arefinished, you must free the window using MPI_Win_free.

Allocating new memory as windows

Instead of using existing memory as windows, one can also choose to allocate new memory for RMAoperations. MPI_Win_allocate can be used for this purpose. We will describe the usage of thisroutine later this week with an example. MPI_Win_allocate is now preferable over a combination ofMPI_Alloc_mem and MPI_Win_create, because MPI_Win_allocate may allocate memory in a symmetricway that can result in better communication performance.

Allocating shared memory windows

It is also possible to allocate shared memory windows. However, as indicated by the term shared-memory, this method can only be used within a shared memory node. To define a shared memorywindow, MPI_Win_allocate_shared can be used.

15

Processes may call MPI_Win_shared_query to retrieve the starting memory address (the base pointer)of any process in the group.

More on this method will be covered in a later course "MPI-3.0 Shared Memory" (in 2020).

Using existing memory dynamically as windows

MPI_Win_create_dynamic provides the functionality to create windows which can use existing memorydynamically, i.e. one can create a window using this routine, and decide later which existing memory toattach to this window. A window created with this approach can only be used after memory has beenattached to it using MPI_Win_attach. After all RMA operations are completed, the memory can bedetached from the window using MPI_Win_detach. Dynamic windows are mainly for special purpose,for example to implement other parallel programming models on top of MPI.

All window creation and allocation routines are collective, whereas the memory allocation(MPI_Alloc_mem, MPI_Free_mem) and the dynamic routines MPI_Win_attach and MPI_Win_detachare local.

Except the first method “Using existing memory as windows”, the other three are newly defined routinesin the MPI-3.0 standard. The methods for creating or allocating windows are summarized in the imagebelow.

1.10 Remote Memory Access (RMA) routinesNow you can create windows for Remote Memory Access and do the necessary synchro-nizations, we will get to know some routines that you can use for RMA operations.

All the routines explained in this article are nonblocking operations, which means the RMAcommunication call initiates the transfer, but the transfer may continue after the call returns. Thetransfer is only completed, either at the origin process or at both the origin and the target process,when a subsequent synchronization call is issued by the caller on the involved window object. Beforethe transfer is finished, the resulting data values, or outcome, of concurrent conflicting accesses to thesame memory locations is undefined. These situations are known as race conditions.

To avoid race conditions, the local communication buffer of Put or Accumulate call should not beupdated, and the local communication buffer of a get call should not be accessed until the operation

16

completes at the origin (local finishing). If a location is updated by a put or accumulate operation, thecommunication buffer at the target should not be accessed until the updating operation has completed(remote finishing).

17

RMA routines that are finished by subsequent window synchronization

The simplest RMA operations are those for remote stores and remote loads.

• MPI_Put

Stores data from the origin process (source) to the remote target process (destination). This is aremote store operation.

• MPI_Get

Gets data from the remote memory window of the target process, which is the source in this case,and loads it to the origin process, which is the destination. This is a remote load operation.

The following routines are elementwise atomic routines: this means that several processes canupdate the same window location and the operations are serialized in some sequence. Of course, theoperations used must be commutative and associative, so that the sequence in which the processes arechanging data in the remote window is irrelevant. Therefore there are no race conditions!

With accumulation routines, the data moved to the target process is combined with the data thatresides in the target process. Many calls by many processes can be issued for the same target element.For example, this will allow the accumulation of a sum by having all involved processes add theircontributions to a sum variable in the memory of one process.

• MPI_Accumulate

This function is like MPI_Put except that data is combined into the target memory instead ofoverwriting it.

• MPI_Get_accumulate

The remote data is returned to the caller before the sent data is accumulated into the remotedata. These two functions work for single elements but also elementwise for a whole array.

• MPI_Fetch_and_op

The functionality of this routine is a subset of MPI_Get_accumulate. MPI_Fetch_and_op doesthe same as MPI_Get_accumulate, but only for one element. This allows a faster implementation.

• MPI_Compare_and_swap

The value at the origin is compared to the value at the target. The value at the target is onlyreplaced by a third value if the comparison is true.

MPI_Get_accumulate, MPI_Fetch_and_op and MPI_Compare_and_swap are new in MPI-3.0

RMA routines that are completed with regular MPI_Wait

The following routines are also new since MPI-3.0. They all start with an “R”:

MPI_Rput, MPI_Rget, MPI_Raccumulate and MPI_Rget_accumulate.

The “R” indicates that they return a request handle, thus they can be completed with normal MPI_Waitroutines. You do not need the one-sided synchronization routines.

Request-based RMA operations are only valid within a passive target epoch. Such passive target epochcan also be an MPI_Win_lock_all with MPI_MODE_NOCHECK started once directly after the windowcreation or allocation and unlocked once before freeing the window at the end. You will get moredetails on the meaning of passive target epochs in the article about synchronization calls.

18

We have covered an overview of the RMA functions. The illustration below summarizes the explainedroutines.

In the next steps you will learn more about synchronization models for RMA routines. But, beforethat, be sure not to miss the following discussion on examples with race conditions. We look forwardto discussing more examples with you!

1.11 Discussion of race conditions with Put and GetIn the previous section we explained several RMA routines. For remote store and loadwe discussed PUT and GET.

Remember: these operations are nonblocking. Can you think of an example which produces raceconditions with PUT and GET?

Here is a scenario with race conditions:

You have three processes with memory windows. Process 2 is the target process. Process 0 and process1 are doing a PUT operation on the target process 2 into the same location within its window. Thefaster process writes its data first in the target window and the slower process overwrites this datawith its own data by its PUT operation. So you see, which data will be stored finally in the remotewindow depends on the sequence of the PUT operation.

This is a write-write conflict. The conflict can be resolved by adding a synchronization between thetwo PUT operations.

Which scenario will result in a write-read race condition?

Discuss your ideas with your fellow learners!

Where else do the results depend on the order of RMA operations, even though such dependencies arenot race conditions?

Can you find the appropriate sections discussing details in the MPI standard?

Please compare the answers of this discussion with answers provided in some articles next week.

19

1.12 Quiz 2: Windows, RMATime for another quiz!

The quiz includes some general questions on window implementations, and also tests your ability toimplement and identify different one-sided operations.

Do you remember the image with local variables A, B and C and window variables W0, W1 and W2from Quiz 1? Here it is again:

You will see this again in some of the questions here, but this time you will be seeing two images: thefirst one (above) shows an initial configuration. The second one (below) shows the final status afterexecuting one function. . . or maybe many functions!

You will have to choose the correct answer to the questions. . . or maybe many answers are valid!

Ready for the challenge? Go ahead!

1.13 Synchronization callsAs mentioned in previous steps, all RMA routines are nonblocking, which means that you onlyknow when the RMA routines are started, but not when they are completed. Therefore, explicitsynchronizations are needed. Moreover, all RMA operations must be surrounded by synchronizationcalls.

Communication paradigms and synchronization models

There are two communication paradigms to achieve synchronization around RMA operations: activetarget communication and passive target communication.

The active target synchronization communication paradigm is similar to the message passing model:both the origin and target processes participate in the communication. In this paradigm, the originprocess performs synchronization and RMA operations, whereas the target process only participates inthe synchronization. There are two such synchronization models: Fence and Post-Start-Complete-Wait.

20

The passive target synchronization communication paradigm is closer to the shared-memorymodel, because only the origin process is involved in the communication. The synchronization modelLock/Unlock is used.

MPI Routines for synchronization calls

Below you can find an overview of MPI routines that are available for each synchronization model. Wewill discuss these synchronization models in more detail in the articles about synchronization routinesfrom Week 2.

Fence

• MPI_Win_fence

This routine works as a barrier to separate RMA epochs.

Post-Start-Complete-Wait

• MPI_Win_post, MPI_Win_start, MPI_Win_complete, MPI_Win_wait/MPI_Win_test

Lock/Unlock

• MPI_Win_lock, MPI_Win_unlock• MPI_Win_lock_all, MPI_Win_unlock_all (New in MPI-3.0)• MPI_Win_flush, MPI_Win_flush_local, MPI_Win_sync (New in MPI-3.0)• MPI_Win_flush_all, MPI_Win_flush_local_all, MPI_Win_sync (New in MPI-3.0)

21

Discussion:

Can you think of a situation where passive target communication with Lock/Unlock ispreferred over active target communication with Fence? Why?

Functional opportunities, summary and quizYou may be asking if one-sided communication is really needed? Why not always use send and receiveroutines? But sometimes one-sided communication has advantages over two-sided. We will look at anexample now.

1.14 One-sided: funcational opportunities - an exampleIn this example we will illustrate where one-sided communication is preferred over two-sided communi-cation.

Imagine you need to write an MPI application where information needs to be exchanged betweensenders and receivers. Every sender process knows all its neighbors (receiver processes) who needinformation. However, the receiver processes don’t know which sender has such information, northe number of processes that will send informations. In the processes in ther role as a receiver, weabbreviate this number of sending processes with “nsp”, which should be small compared to the totalnumber of processes started by the program.

If we use two-sided communication to solve the problem, it will not be scalable. The reason is thenecessary use of very expensive operations (either MPI_Alltoall or MPI_Reduce_scatter), so thateach sender tells all other processes whether they will get a message or not. Furthermore, consideringthat “nsp” is relatively small compared to the total number of processes, most of messages exchangedusing these expensive collective operations are actually not very useful: the sender is only telling “Idon’t have any messages for you” to the receiver.

22

On the other hand, one-sided communication can provide an efficient solution here. First of all, we leteach process play the role of being a receiver. It creates a window that consists of the variable nsp(number of sending processes). The initial value of nsp is set to 0, because the receiver process doesn’tknow the number of sending processes yet.

After creating the window, let each process begin in its role as a sender. It can now tell all itsneighbors to whom it will send information: “I want to send you some information”. This step can beachieved by first calling MPI_Win_fence to synchronize all processes, which means to wait until all nspvariables are initialized with 0, and then making multiple MPI_Accumulate calls to add 1 to the nspvariable of each of its neighboring receiver processes. When this is done, it (still in its role as sender)will call MPI_Win_fence again. First, this informs all neighboring receiver processes that its updates ofits receivers’ nsp variables has been completed (still in its role as a sender). Second, (now in its roleas receiver) it will wait until all other neighboring sender processes have finished their accumulates toits own nsp variable.

Now each process in its role as receiver has the correct value for nsp. Therefore it can loop overits value of nsp to receive information with MPI_Irecv(MPI_ANY_SOURCE). After that, now in therole as sender, each process can loop over its receiving processes to send them information. Again inits role as receiver, the process will call MPI_Waitall() to wait for the completion of receiving datafrom all its neighbors (i.e., sending processes). As receiving process, it can then see the senders’ rankin the statuses array of MPI_Waitall, which completes all receive calls.

Discussion:

Can you think of another problem where one-sided communication is preferred as asolution over two-sided communication? Please share it with others in the comments.

23

1.15 Summary of week 1This past week we have learned about one-sided communication. Let’s have a look atwhat you have learned.

The idea of one-sided communication is to separate all data movements from synchronization calls. Youshould now understand the advantages of one-sided communication and learned some new terms andMPI routines.

However, in order to start coding you will need more information about windows, remote memoryaccess (RMA) routines and synchronization calls. In the next week we will cover these subjects indetail. We have also prepared some exercises and code snippets.

Perhaps you still haven’t shared your thoughts with your fellow learners? We hope you will and takepart in the discussions.

We have prepared another quiz for you that focuses on synchronization. Go ahead and test yourknowledge. Good luck!

1.16 Quiz 3: with focus on synchronizationIn this quiz you will check your ability to identify one-sided operations. However, this time you’ll needto focus on synchronization requirements and check that the operations are performed with the correctsynchronization calls.

Does the following look familiar to you?

Each question will present an image with an initial program state and a final program state. You willneed to choose the right answer, which can be one of the given possibilities. . . or maybe more!

24

Ready for this new challenge? Let’s start!

25

Week 2: Details & examples ofOne-Sided MPI Communication

Windows and their remote accessNow we will take a deep-dive into the definitions of the window creation/allocation routines, and theirremote access routines.

2.1 Window creation & allocationIn this article we take a closer look at the routines required to manage memory accesswith windows. These are routines for window creation and for memory allocation. Wealso provide examples in Fortran.

Window creation

The MPI_Win_create routine specifies the already allocated region in memory that can be accessed byremote processes. It is a collective call over all processes in the communicator. It returns a windowhandle of type MPI_Win, which can be used to perform the remote memory access (RMA) operations.As you may already know, handles in MPI are a type of pointer to access opaque objects in the MPIsystem memory, in this case to access a window object.

The definition of the MPI_Win_create routine is as follows:

MPI_Win_create(win_base_addrtarget_process,win_sizetarget,disp_unittargetinfo, comm, win)

with the following input and output arguments:

• IN: win_base_addrtarget_process is the base address of the target window. It is like a normalbuffer argument, e.g. in C it may be &buffer_variable or directly buffer_array_name.

• IN: win_sizetarget is the size of window. In contrast to point-to-point or collective communication,the length is not specified as a count. The length must be specified as number of bytes.

• IN: The displacement unit argument disp_unittarget is the local unit size for displacements inbytes. Instead of a datatype handle, the size in bytes for the buffer’s datatype is passed, e.g. ifthe buffer is an array of doubles then sizeof(double) should be passed.

• IN: info is the info argument, to pass MPI_INFO_NULL or some optimization hints.

• IN: comm is the communicator.

26

• OUT: win is the window object returned by the call. This window handle then represents:

– all about the communicator– its processes– the location and size of the windows in all processes– the disp_units in all processes

When all RMA operations within the window are finished, the window is freed using MPI_Win_free.

Below is the code snippet that creates and frees the window for the variable nsp (number of sendingprocesses) described in the example in step 1.14 using the C programming language.

MPI_Win_create(&nsp, (MPI_Aint)sizeof(int),sizeof(int), MPI_INFO_NULL, MPI_COMM_WORLD, &win);

// Perform some RMA operations here....MPI_Win_free(&win);

Note that the window size must be specified as an address-sized integer, whereas the displacement is anormal integer.

Allocate memory with MPI_Alloc_mem

When the memory for a window is not allocated yet, the routine MPI_Alloc_mem can be used to allocatememory before calling MPI_Win_create. Note that it is the responsibility of the programmer to freethe allocated memory by calling MPI_Free_mem when the memory is no longer needed.

The definitions of MPI_Alloc_mem and MPI_Free_mem are as follows:

MPI_Alloc_mem(size, info, baseptr)


• IN: size is the size of memory segment in bytes,• IN: info is the info argument,• OUT: The output variable baseptr is the start address of the allocated memory segment.

MPI_Free_mem(base)

where base is the beginning of the memory segment to be freed.

Below is a code snippet that shows how to use the memory allocated with MPI_Alloc_mem andMPI_Win_create to create a window that contains 10 elements of type int.

MPI_Alloc_mem((MPI_Aint)(10*sizeof(int)), MPI_INFO_NULL, &snd_buf);MPI_Win_create(snd_buf, (MPI_Aint)(10*sizeof(int)),

sizeof(int), MPI_INFO_NULL, MPI_COMM_WORLD, &win);...MPI_Win_free(&win);MPI_Free_mem(snd_buf);

In some systems, message-passing and remote-memory-access (RMA) operations run faster whenaccessing specially allocated memory. MPI_Alloc_mem and MPI_free_mem are the routines provided byMPI for allocating and freeing this type of special memory. This means, MPI_Alloc_mem may return amemory pointer that enables faster one-sided communication than normally allocated memory. Andportable MPI programs that want to use MPI_Win_create together with passive target synchronizationmust use MPI_Alloc_mem for the windows memory.

27

Window allocation

The routine MPI_Win_allocate combines the behavior of MPI_Alloc_mem and MPI_Win_create. It isa collective call executed by all processes in communicator comm. It allocates new memory as windowfor RMA operations.

The definition of the MPI_Win_allocate routine is as follows:

MPI_Win_allocate (size, disp_unit, info, comm, baseptr, win)


• IN: size is the size of window in bytes with type MPI_Aint• IN: disp_unit is the local unit size for displacement in bytes• IN: info is the info argument• IN: comm is the communicator• OUT: the output variable baseptr is the start address of the allocated memory segment• OUT: win is the window object returned by the call

Below is the previous code snippet for window allocation, now rewritten using MPI_Win_allocate.

MPI_Win_allocate((MPI_Aint)(10*sizeof(int)), sizeof(int),MPI_INFO_NULL, MPI_COMM_WORLD, &snd_buf, &win);

...// MPI_Win_free will free the memory allocate by MPI_Win_allocate.MPI_Win_free(&win);

By allocating (potentially aligned) memory instead of allowing the user to pass in an arbitrary buffer,MPI_Win_allocate can improve the performance for systems with remote direct memory access.Therefore, for performance reasons, textttMPI_Win_allocate should be preferred over MPI_Alloc_mem+ MPI_Win_create.

Examples for Fortran programmers

A memory allocation example with modern C-Pointer in the mpi_f08 or mpi module:

USE mpi_f08USE, INTRINSIC :: ISO_C_BINDINGINTEGER :: max_length, disp_unitINTEGER(KIND=MPI_ADDRESS_KIND) :: lb, size_of_realREAL, POINTER, ASYNCHRONOUS :: buf(:)TYPE(MPI_Win) :: winINTEGER(KIND=MPI_ADDRESS_KIND) :: buf_size, target_dispTYPE(C_PTR) :: cptr_bufmax_length = ... ! or length_n = ..., length_m = ... in case of a two-dimensional arrayCALL MPI_Type_get_extent(MPI_REAL, lb, size_of_real)buf_size = max_length * size_of_real ! or buf_size = length_m * length_n * size_of_realdisp_unit = size_of_realCALL MPI_Win_allocate(buf_size, disp_unit, MPI_INFO_NULL,

MPI_COMM_WORLD, cptr_buf, win)CALL C_F_POINTER(cptr_buf, buf, (/max_length/) )

! or (/length_m, length_n/)

The following example of MPI_ALLOC_MEM with old-style “Cray”-Pointer can be used with the mpimodule or mpif.h:

28

https://fs.hlrs.de/projects/par/mooc/cray-pointers.pdf

USE mpiREAL aPOINTER (p, a(100)) ! no memory is allocatedINTEGER (KIND=MPI_ADDRESS_KIND) buf_sizeINTEGER length_real, win, ierrorCALL MPI_TYPE_EXTENT(MPI_REAL, length_real, ierror)Size = 100*length_realCALL MPI_ALLOC_MEM(buf_size, MPI_INFO_NULL, P, ierror)CALL MPI_WIN_CREATE(a, buf_size, length_real,MPI_INFO_NULL, MPI_COMM_WORLD, win, ierror)...CALL MPI_WIN_FREE(win, ierror)CALL MPI_FREE_MEM(a, ierror)

Within the new mpi_f08 module all memory allocation routines are defined with modern C-pointersTYPE(C_PTR). Within the mpi module and mpif.h memory allocation is provided through older “Cray”pointers, and additionally overloaded with the C-pointer interface. In mpif.h this overloading is onlyoptional, but the use of mpif.h has been strongly discouraged since the introduction of MPI-3.0.

2.2 RMA routines Put, Get and AccumulateNow that you have learned the above routines, now is the time to learn the parametersin detail in order to use them correctly in your code. We will explain them here.

In the last step you learned how to create windows. When all processes in a communicator have calledMPI_Win_create, they have all received their window handle. Now each process can take the role ofbeing an origin process calling RMA routines like MPI_Put, MPI_Get and MPI_Accumulate.

MPI_Put

MPI_Put (origin_address, origin_count, origin_datatype, target_rank,target_disporigin_proces, target_count, target_datatype, win)

Please note: MPI_Put puts data from the local send/origin buffer into the remote (target) memorywindow. The execution of MPI_Put is similar to the execution of MPI_Send by the origin process and amatching MPI_Recv by the target process. However, with MPI_Put all arguments are specified by theorigin process, but also the arguments for the target process! The local send buffer is specified withorigin_address, origin_count and origin_datatype.

MPI_Put is a nonblocking RMA routine executed inside an epoch, which has to be finished by asubsequent synchronization call. You are not allowed to modify the content in your buffer until thislater synchronization is finished.

The start address of your origin buffer is origin_address. The data is written in the target buffer ataddress

target_addr = win_base_addrtarget_process + target_disporigin_process * disp_unittarget_process

where win_base_addrtarget_process and disp_unittarget_process have been defined by MPI_Win_createin the target process.

The displacement unit target_disporigin_process is the size of one data element at the target, forexample 8 bytes for doubles. This size has also been defined at the target process when the windowwas created with MPI_Win_create. This displacement unit is multiplied with the target displacement

29

argument target_disporigin_process in the MPI_Put argument list when called on the origin process.Therefore this target displacement is like an index into the window array at the target process.

Processes may have different displacement units. The algorithm above works correctly independent ofwhether the displacement units on origin and target process are different. This is very useful when youare not using standard datatypes with the same size of data in each process, for example complicatedstructures.

But be careful: On heterogeneous platforms you should use a portable datatype with no bytespecifications inside: that means, with no explicit byte displacements. A datatype is portable, if alldisplacements in the datatype are of one predefined datatype, in terms of extents.

All important information for MPI_Put is summarized in the following picture.

MPI_Get

MPI_Get (origin_address, origin_count, origin_datatype, target_rank,target_disporigin_proces, target_count, target_datatype, win)

This routine is similar to MPI_Put, but with the reverse direction of data transfer: the origin processcopies data from the remote target process, i.e. origin_address, origin_count and origin_datatypenow specify the local receive buffer.

As for MPI_Put, all parameters are defined by the origin process. MPI_Get is also a nonblocking RMAroutine. Therefore, you must not read or modify your local buffer (the receive/origin buffer) until itsassociated epoch has been finished by a subsequent synchronization call.

As with MPI_Put, use portable datatypes on heterogeneous platforms.

30

MPI_Accumulate

MPI_Accumulate(origin_address, origin_count, origin_datatype,target_rank, target_disporigin_process, target_count,target_datatype, op, win)

As you can see above, MPI_Accumulate has an additional argument: the operation handle op. Theaccumulate operation is executed atomically per array element on the target window. This means thatmany origin processes are allowed to call MPI_Accumulate to accumulate its own data atomically forany given element on the window array of a given target process.

With op you define the atomic operation to be used. This operation cannot be user-defined: it has tobe one of the predefined reduction operations supplied for MPI_Reduce. These are for example MPI_SUM,MPI_MIN, MPI_PROD or MPI_REPLACE, an additional operation for one-sided accumulation operations.

Here we have the same advice as before: don’t forget that MPI_Accumulate is nonblocking.

31

Now that we have covered the parameters for the most important RMA routines, you should be able touse them in your code!

Answer to questions in the discussion step from last week

Let’s try to formulate an answer to the related questions in the discussion from step 1.11 from lastweek:

• Where else do the results depend on the order of RMA operations, although such dependenciesare not race conditions?

• Could you find the appropriate sections discussing details in the MPI standard?

Accumulate operations are element-wise atomic (see MPI-3.1 Section 11.7.1 Atomicity). In the case offloating point data together with the operations MPI_SUM and MPI_PROD, the rounding errors of theresult depend on the sequence of the operations. For example, see the first Advice to users in MPI-3.1Section 5.9.1 Reduce.

The use of MPI_REPLACE at the same window location is allowed, but of course the result depends onwhich replace was done at the latest time.

As an application developer, you can influence the order of the MPI_Accumulate calls. You can findthe info key accumulate_ordering and its usage described in MPI-3.1, Section 11.7.2 Ordering.

In the next section we will take a look at synchronization with MPI_Win_fence and some Fortran-specificdetails when using one-sided communications.

32

Synchronization routinesWe will cover different synchronization routines and their usage and also discuss the Fortran problemswith one-sided communication.

2.3 MPI_Win_fence and Fortran-specific features with one-sided commu-nicationMPI_Win_fence

Fence is one of the synchronization models used in active target communication. The MPI_Win_fenceroutine synchronizes RMA operations on a specified window. It is a collective call over the processgroup of the window. The fence is like a barrier: it synchronizes a sequence of RMA calls (e.g. put, get,accumulate) and it should be used before and after that sequence.

The definition of the MPI_Win_fence routine is as follows:

MPI_Win_fence(assert, win)

where

• assert is the program assertion• win is the window object

The assert argument is used to provide optimization hints to the implementation. A value of assert== 0 is always valid. Assert values are specified in the standard, e.g. MPI-3.1 Section 11.5.5 “Assertions”(Page 450), and they may be combined with a bitwise OR operation (assert1 | assert2 in C orIOR(assert1, assert2) in Fortran). For performance optimization of the internal cache operations,an application should provide all valid assertions. To be correct, an application must not provide aninvalid assertion.

The code snippet that updates the nsp value in the example explained in step 1.14 looks like this:

MPI_Win_fence(0, win);for (idx = 0; idx < nummsgs; idx++){

count = 1; // indicating that 1 message will be sent to process dests[idx]MPI_Accumulate(&count, 1, MPI_INT, dests[idx],

(MPI_Aint)0, 1, MPI_INT, MPI_SUM, win);}

MPI_Win_fence(0, win);

Fortran-specific features with one-sided communication

Fortran is a highly optimizing programming language. Therefore, it has some particularities whendealing with one-sided communication which are due to its register optimization across subroutine calls.

33

In the example above, the result 999 may be printed instead of the expected 777, because the buffvalue 999 of process 2 may be stored in register_A for optimization. Even if the contents of buffhave been modified by the MPI_Put between the store and the print operation, the stored register valueis the one that may be printed.

In order to avoid this, there are two possible options:

• Declare the window memory as module data or in a COMMON block. This option isnot available for allocated memory, e.g. using MPI_Alloc_mem or MPI_Win_allocate (orMPI_Win_allocate_shared).

• Declare the window memory, e.g. buff, as ASYNCHRONOUS and add

IF (.NOT. MPI_ASYNC_PROTECTS_NONBLOCKING) CALL MPI_F_SYNC_REG(buff)

before the 1st and after the 2nd MPI_Win_fence in process 2 (marked with dashed red lines).As the window memory (here buff) is visible in an argument list of the unanalyzable callMPI_F_SYNC_REG, register optimization is prohibited.

Additionally in process 1, MPI_Put is a nonblocking call and then any acceses to bbbb after the secondMPI_Win_fence must not be moved by the compiler across that second fence call. For that you havethe same possibilities as described for buff, but here for bbbb and only once at the dashed blue line.

2.4 Post / Start / Complete / WaitPost/Start/Complete/Wait is another synchronization model that is used in activetarget communication. There are five MPI routines that will help you to imple-ment this model: MPI_Win_start, MPI_Win_complete, MPI_Win_post andMPI_Win_wait/MPI_Win_test.

Any given process in the process group of the window handle win may issue a call to MPI_Win_postto start an RMA exposure epoch, and thereby allow access to its local window buffer. In order tostart an RMA access epoch to the exposed window buffer, a matching MPI_Win_start call from aprocess within the same process group should be issued on win. An access epoch is finished at origin bycalling MPI_Win_complete, whereas from the target side the routine MPI_Win_wait (not MPI_Wait!)terminates the exposure epoch.

34

MPI_Win_test is a nonblocking version of MPI_Win_wait. It returns flag == true if all access epochswith matching process group and window have been finalized using MPI_Win_complete. MPI_Win_testshould be invoked only where MPI_Win_wait can be invoked.

When using the Start/Complete and Post/Wait synchronization model, all communication partnersmust be known. A target process can be accessed by several origin processes as shown in the figure,and also an origin process may access several target windows. Therefore, in a call to MPI_Win_Postand MPI_Win_start the application must specify the partner processes by providing an appropriategroup handle. For example, in our figure the target process must provide a group handle that consistsof origin1 and origin2, whereas origin1 and origin2 must provide a group handle that includes only thetarget process. These group handles can be generated with MPI_Win_get_group + MPI_Group_incl:for further details about these routines, please check the MPI standard, available for download fromthe MPI Forum.

All local buffers must not be used before RMA epochs are locally finished. The assert argument maybe used for various optimizations. A value of assert == 0 is always valid.

The Post/Start/Complete/Wait synchronization model can be used for active target communication.Symmetric communication is also possible, as only MPI_Win_start and MPI_Win_wait may block.

35

https://www.mpi-forum.org

An MPI implementation is allowed to do different optimizations, such as not to block withinMPI_Win_start: this means, to implement a weak synchronization. For more details, you may readMPI-3.1 Section 11.5. Synchronization calls, especially the explanations for Figure 11.3 on page 439.

Race conditions

In general, there is no atomicity if the target of two MPI_Put overlap. If you execute either two MPI_Putcommands or an MPI_Put and an MPI_Get by the same origin process or two different origin processesto the same target window address without any synchronization in between the two RMA calls, thenthe outcome of your MPI program is undefined.

Definitions of these five routines

MPI_Win_start(group, assert, win)

where

• group is group of TARGET processes• assert is program assertion• win is the window object

MPI_Win_complete(win)

where win is the window object.

MPI_Win_post(group, assert, win)

where

• group is group of ORIGIN processes• assert is program assertion• win is the window object

MPI_Win_wait(win)

where win is the window object.

36

MPI_Win_test(win, flag)

with the input and output arguments

• IN: win is the window object• OUT: output argument flag is the success flag

2.5 Lock/UnlockLock/unlock is used in passive target communication, where only the origin process isinvolved in the communication.

MPI_Win_lock starts an RMA access epoch, where the window at the process with rank rank can beaccessed by RMA operations on win during that epoch. The matching routine to complete this RMAaccess epoch is MPI_Win_unlock. RMA operations issued during this period will have completed bothat the origin and at the target when the call returns.

The use of lock/unlock does not guarantee a sequence. Locks are used to protect access to the lockedtarget window affected by RMA calls issued between the lock and unlock calls, and to protect load/storeaccess to a locked local (or shared memory) window executed between the lock and unlock calls.

Portable programs can use lock calls to windows in memory allocated ONLY by MPI_Alloc_mem,MPI_Win_allocate or MPI_Win_attach. This happens because the efficient implementation of passivetarget communication when memory is not shared may require an asynchronous software agent. Suchan agent can be implemented more easily, and can achieve better performance, if restricted to speciallyallocated memory.

Note that MPI_Win_allocate_shared is currently missing in this list. See MPI-3.1 page 448, lines 1-4.This will be fixed in a later version of MPI. However it is already fixed in probably all existing MPIlibraries.

The definitions of the MPI_Win_lock and MPI_Win_unlock routines are as follows:

37

MPI_Win_lock(lock_type, rank, assert, win)

where

• lock_type is either MPI_LOCK_EXCLUSIVE or MPI_LOCK_SHARED,• rank is the rank of locked window• assert is program assertion• win is the window object

MPI_Win_unlock(rank, win)

where

• rank is the rank of window• win is the window object

Exercise, Summary and QuizWe have completed our sessions on one-sided communication. Now it is time for some fun exercises anda quiz.

2.6 Exercise: Ring communication with fenceIn this exercise, we will pass information between a set of processes that are arranged in a ring.

Initialization:

1. Each process stores its rank (in MPI_COMM_WORLD) into an integer variable snd_buf.

Repeat steps 2-5 with size iterations, where size is the number of processes.

2. Each process passes the content of its snd_buf to its neighbor on the right.

3. It is saved into the rcv_buf of the neighbor process.

38

4. Each process assigns the value in rcv_buf to its snd_buf as preparation for the next iteration.

5. Each process calculates the sum of all received values.

Result: Each process calculated the sum of all ranks.

The code which shall be revised in this exercise is provided below, in C and Fortran. It uses nonblockingirecv / send / wait communication for steps 2 and 3. You can modify this code by declaring theappropriate buffer as a window and substitute the irecv / send / wait calls by an appropriate one-sidedRMA call surrounded by appropriate synchronization calls.

Code for C programmers

#include <stdio.h>#include <mpi.h>#define to_right 201

int main (int argc, char *argv[]){

int my_rank, size, right, left;int snd_buf, rcv_buf, sum, i;MPI_Status status;MPI_Request request;

MPI_Init(&argc, &argv);MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);MPI_Comm_size(MPI_COMM_WORLD, &size);

right = (my_rank+1) % size;left = (my_rank-1+size) % size;

/* ... this SPMD-style neighbor computation with modulo has the same meaning as: *//* right = my_rank + 1; if (right == size) right = 0; *//* left = my_rank - 1; if (left == -1) left = size-1;*/

39

/* *** Here, you shall create the appropriate window */

sum = 0;snd_buf = my_rank;

for( i = 0; i < size; i++){

/* *** The following 3 lines shall be substituted by*** 1-sided communication and synchronization */

MPI_Issend(&snd_buf, 1, MPI_INT, right, to_right, MPI_COMM_WORLD, &request);MPI_Recv(&rcv_buf, 1, MPI_INT, left, to_right, MPI_COMM_WORLD, &status);MPI_Wait(&request, &status);snd_buf = rcv_buf;sum += rcv_buf;

}

printf ("PE%i:\tSum = %i\n", my_rank, sum);MPI_Finalize();

}

Code for Fortran programmers

PROGRAM ringUSE mpi_f08IMPLICIT NONE

INTEGER, PARAMETER :: to_right=201INTEGER :: my_rank, size, right, leftINTEGER :: i, sumINTEGER, ASYNCHRONOUS :: snd_bufINTEGER :: rcv_bufTYPE(MPI_Status) :: statusTYPE(MPI_Request) :: requestINTEGER(KIND=MPI_ADDRESS_KIND) :: iadummy

CALL MPI_Init()CALL MPI_Comm_rank(MPI_COMM_WORLD, my_rank)CALL MPI_Comm_size(MPI_COMM_WORLD, size)

right = mod(my_rank+1, size)left = mod(my_rank-1+size, size)

! ... this SPMD-style neighbor computation with modulo has the same meaning as:! right = my_rank + 1; IF (right .EQ. size) right = 0! left = my_rank - 1; IF (left .EQ. -1) left = size-1

! *** Here, you shall create the appropriate window

sum = 0snd_buf = my_rank

DO i = 1, size! *** The following 4 lines shall be substituted by

40

! *** 1-sided communication and synchronizationCALL MPI_Issend(snd_buf, 1, MPI_INTEGER, right, to_right, MPI_COMM_WORLD, request)CALL MPI_Recv(rcv_buf, 1, MPI_INTEGER, left, to_right, MPI_COMM_WORLD, status)CALL MPI_Wait(request, status)IF (.NOT.MPI_ASYNC_PROTECTS_NONBLOCKING) CALL MPI_F_sync_reg(snd_buf)snd_buf = rcv_bufsum = sum + rcv_buf

END DO

WRITE(*,*) "PE", my_rank, ": Sum =", sumCALL MPI_Finalize()

END PROGRAM

Here are some additional hints to solve the exercise:

1. Use one-sided communication.

2. There are two choices:

• Use of rcv_buf as window (rcv_buf = window)

– MPI_Win_fence: the rcv_buf can be used to receive data– MPI_Put: to write the content of the local variable snd_buf into the remote window(rcv_buf)

– MPI_Win_fence: the one-sided communication is finished, i.e. snd_buf is read out andrcv_buf is filled in.

• Use of snd_buf as window (snd_buf = window)

– MPI_Win_fence: the snd_buf is filled

– MPI_Get: to read the content of the remote window (snd_buf) into the local rcv_buf

– MPI_Win_fence: the one-sided communication is finished, i.e. snd_buf is read out andrcv_buf is filled in.

3. MPI_Win_create:

• base = reference to your rcv_buf or snd_buf variable

• disp_unit = number of bytes of one int/integer, because this is the datatype of the buffer(=window)

• size = same number of bytes, because buffer size = 1 value

• size and disp_unit have different internal representations, therefore:

– C/C++

MPI_Win_create(&rcv_buf, (MPI_Aint)sizeof(int),sizeof(int), MPI_INFO_NULL, ..., &win);

– Fortran:

INTEGER disp_unitINTEGER (KIND=MPI_ADDRESS_KIND) winsize, lb, extentCALL MPI_TYPE_GET_EXTENT(MPI_INTEGER, lb, extent, ierror)...disp_unit = extent

41

winsize = disp_unit * 1CALL MPI_WIN_CREATE(rcv_buf, winsize, disp_unit, && MPI_INFO_NULL, ..., ierror)

4. MPI_Put (or MPI_Get):

• target_disp

– C/C++:

MPI_Put(&snd_buf, 1, MPI_INT, right, (MPI_Aint) 0, 1,MPI_INT, win);

– Fortran:

INTEGER (KIND=MPI_ADDRESS_KIND) target_disptarget_disp = 0...CALL MPI_PUT(snd_buf, 1, MPI_INTEGER, right, && target_disp, 1, MPI_INTEGER, win, ierror)

• Register problem with Fortran with destination buffer of nonblocking RMA operations:

– Access to the rcv_buf before 1st and after 2nd MPI_WIN_FENCE must not be moved bycompiler optimizations across these calls to MPI_WIN_FENCE. Therefore, rcv_buf mustbe declared as asynchronous:

INTEGER, ASYNCHRONOUS :: rcv_buf

– Additionally, the following code should be called before 1st and after 2ndMPI_WIN_FENCE (i.e., two times in total).

IF (.NOT. MPI_ASYNC_PROTECTS_NONBLOCKING) && CALL MPI_F_SYNC_REG(rcv_buf)

– As MPI_PUT(snd_buf) is nonblocking from the start of MPI_PUT call until the end ofthe 2nd MPI_WIN_FENCE, the same is also needed for snd_buf, but only after the 2ndMPI_WIN_FENCE.

INTEGER, ASYNCHRONOUS :: snd_buf...IF (.NOT. MPI_ASYNC_PROTECTS_NONBLOCKING) && CALL MPI_F_SYNC_REG(snd_buf)

– Solution with snd_buf as window: same MPI_F_SYNC_REG calls, but with snd_bufinstead of rcv_buf and vice versa.

42

The solution with MPI_PUT and window=rcv_buf is illustrated in the figure below:

2.7 Course SummaryExcellent job! You have finished the course! We hope you enjoyed it and are happy withyour new MPI skills. Let’s recap what you learned on One-sided Communication - andsee what is next.

First we compared one-sided communication with two-sided communication. The advantages ofone-sided communication are explained in step 1.4. Potential advantages are the reduction ofsynchronization thanks to the possibility of several RMA calls within one epoch, no delay in sendingdata because of the nonblocking nature of RMA calls, and functional opportunities because you canresolve problems with scalable codes. Do you remember the example in step 1.14 with the unknownnumber of sending processes we discussed? In this example, the receiving processes do not have anyinformation about the sending processes, but we got to build a scaling solution that provides goodperformance for any amount of processes.

In one-sided communication we have an origin process which executes the RMA routines on the windowof a target process. Typically all processes are both origin and target process. We explainedthis in step 1.5.

We discussed three major sets of routines, namely for window creation / allocation (step 1.9 and step2.1), for RMA and for synchronization. Important RMA routines are Put, Get and Accumulate (seestep 1.10 and step 2.2). It is your responsibility to guarantee that no conflicting data accesses happenin your program. This means that you have to take care of synchronizations: all RMA routines mustbe surrounded by synchronization routines, such as Fence or Lock-Unlock, as explained in step1.13, step 2.3, step 2.4 and step 2.5. And please keep in mind that with these routines you should useassertions for better performance.

If you enjoyed this introductory MPI course and would like to learn more, look out for our upcomingFutureLearn course "MPI-3: A Guide to the New Shared Memory Interface", which will cover sharedmemory in more comprehensive detail.

43

And now, how comfortable do you feel with your new skills? Check it out by taking our last quiz thatcontains questions on all the topics we covered. Good luck!

2.8 Quiz 4: General summaryThis final quiz is a compilation of questions from previous quizzes with some additional challenges. Youcan expect questions from all the materials that we covered these past two weeks.

We are sure that you are an expert on this type of figures:

Therefore, we won’t explain anything else right now. All the necessary information to solve this can befound in the previous sections and quizzes. So if you have come this far, you know the way.

Let’s go for the final round!

44

Documents

MPI: A Short Introduction to One -Sided Communication · 2020. 4. 15. · This document covers all the content of the FutureLearn course MPI: A Short Introduction to One -Sided Communication