MPI advanced usage of point-to- point operationsgabriel/courses/mpicourse/MPI-pt2pt.pdfStandard MPI_Send MPI_Isend MPI_Send_init Blocking Non-blocking Persistent High Performance Computing

High Performance Computing Center Stuttgart

Edgar Gabriel

MPI

advanced usage of point-to-point operations

Edgar Gabriel

High Performance Computing Center Stuttgart (HLRS)[email protected]


Edgar Gabriel

Overview

• Point-to-point taxonomy and available functions• What is the status of a message?• Non-blocking operations• Example: 2-D laplace equation • Concatenating independent elements into a single message • Probing for messages


Edgar Gabriel

What you’ve learned so far

• Six MPI functions are sufficient for programming a distributed system memory machine

MPI_Init(int *argc, char ***argv);MPI_Finalize ();

MPI_Comm_rank (MPI_Comm comm, int *rank);MPI_Comm_size (MPI_Comm comm, int *size);

MPI_Send (void *buf, int count, MPI_Datatype dat,int dest, int tag, MPI_Comm comm);

MPI_Recv (void *buf, int count, MPI_Datatype dat,int source, int tag, MPI_Comm comm, MPI_Status *status);


Edgar Gabriel

So, why not stop here?

• Performance– need functions which can fully exploit the

capabilities of the hardware– need functions to abstract typical communication

patterns

• Usability– need functions to simplify often recurring tasks– need functions to simplify the management of

parallel applications


Edgar Gabriel


• Performance– asynchronous point-to-point operations– one-sided operations– collective operations– derived data-types– parallel I/O– hints

• Usability– process grouping functions– environmental and process management– error handling– object attributes– language bindings


Edgar Gabriel


• Performance– asynchronous point-to-point operations– one-sided operations– collective operations– derived data-types– parallel I/O– hints

• Usability– process grouping functions– environmental and process management– error handling– object attributes– language bindings


Edgar Gabriel

Point-to-point operations

• Data exchange between two processes– both processes are actively participating in the data

exchange two-sided communication• Large set of functions defined in MPI-1 (50+)

MPI_Ssend_initMPI_IssendMPI_SsendSynchronous

MPI_Rsend_initMPI_IrsendMPI_RsendReady

MPI_Bsend_initMPI_IbsendMPI_BsendBuffered

MPI_Send_initMPI_IsendMPI_SendStandard

PersistentNon-blockingBlocking


Edgar Gabriel

A message contains of…

• the data which is to be sent from the sender to the receiver, described by– the beginning of the buffer– a data-type– the number of elements of the data-type

• the message header (message envelope)– rank of the sender process– rank of the receiver process– the communicator– a tag


Edgar Gabriel

Rules for point-to-point operations

• Reliability: MPI guarantees, that no message gets lost• Non-overtaking rule: MPI guarantees, that two

messages posted from process A to process B arrive in the same order as they have been posted

• Message-based paradigm: MPI specifies, that a single message cannot be received with more than one Recv operation (in contrary to sockets!)

Recv buffer1

Message

Message in the Recv buffers

Recv buffer2

if (rank == 0 ) {MPI_Send(buf, 4, …);

}if ( rank == 1 ) {MPI_Recv(buf, 3,…); MPI_Recv(&(buf[3],1,…);

}


Edgar Gabriel

Message matching (I)

• How does the receiver know, whether the message which he just received is the message for which he was waiting?– the sender of the arriving message has to match the

sender of the expected message– the tag of the arriving message has to match the tag

of the expected message– the communicator of the arriving message has to

match the communicator of the expected message?


Edgar Gabriel

Message matching (II)

• What happens if the length of the arriving message does not match the length of the expected message?– the length of the message is not used for matching– if the received message is shorter than the expected

message, no problems– the received message is longer than the expected

message – an error code (MPI_ERR_TRUNC) will be returned – or your application will be aborted – or your application will deadlock– or your application writes a core-dump


Edgar Gabriel

Message matching (III)

• Example 1: correct exampleif (rank == 0 ) {MPI_Send(buf, 3, MPI_INT, 1, 1, MPI_COMM_WORLD);

}else if ( rank == 1 ) {MPI_Recv(buf, 5, MPI_INT, 0, 1, MPI_COMM_WORLD,

&status);}

Recv buffer

Message

Message in the Recv buffer

untouched elements in the recv buffer


Edgar Gabriel

Message matching (IV)

• Example 2: erroneous example

if (rank == 0 ) {MPI_Send(buf, 5, MPI_INT, 1, 1, MPI_COMM_WORLD);

}else if ( rank == 1 ) {MPI_Recv(buf, 3, MPI_INT, 0, 1, MPI_COMM_WORLD,

&status);}

Recv buffer

Message

Message in the Recv buffer

potentially writing over the end ofthe recv buffer


Edgar Gabriel

• Question: how can two processes safely exchange data at the same time?

• Possibility 1Process 0 Process 1

MPI_Send(buf,…); MPI_Send(buf,…);MPI_Recv(buf,…); MPI_Recv(buf,…);

– can deadlock, depending on the message length and the capability of the hardware/MPI library to buffer messages

Deadlock (I)


Edgar Gabriel

• Possibility 2: re-order MPI functions on one process

Process 0 Process 1

MPI_Recv(rbuf,…); MPI_Send(buf,…);MPI_Send(buf,…); MPI_Recv(rbuf,…);

• Other possibilities:– asynchronous communication – shown later – use buffered send (MPI_Bsend) – not shown here– use MPI_Sendrecv – not shown here

Deadlock (II)


Edgar Gabriel

Example

• Implementation of a ring using Send/Recv– Rank 0 starts the ring

MPI_Comm_rank (comm, &rank);MPI_Comm_size (comm, &size);

if (rank == 0 ) {MPI_Send(buf, 1, MPI_INT, rank+1, 1,comm);MPI_Recv(buf, 1, MPI_INT, size-1, 1,comm,&status);

}else if ( rank == size-1 ) {MPI_Recv(buf, 1, MPI_INT, rank-1, 1,comm,&status);MPI_Send(buf, 1, MPI_INT, 0, 1,comm);

}else {MPI_Recv(buf, 1, MPI_INT, rank-1, 1,comm,&status);MPI_Send(buf, 1, MPI_INT, rank+1, 1,comm);

}


Edgar Gabriel

Wildcards

• Question: can I use wildcards for the arguments in Send/Recv?

• Answer:– for Send: no– for Recv:

• tag: yes, MPI_ANY_TAG• source: yes, MPI_ANY_SOURCE• communicator: no


Edgar Gabriel

Status of a message (I)

• the MPI status contains directly accessible information– who sent the message– what was the tag– what is the error-code of the message

• … and indirectly accessible information through function calls– how long is the message– has the message bin cancelled


Edgar Gabriel

Status of a message (II) – usage in C

MPI_Status status;

MPI_Recv ( buf, cnt, MPI_INT, …, &status);

/*directly access source, tag, and error */src = status.MPI_SOURCE;tag = status.MPI_TAG;err = status.MPI_ERROR;

/*determine message length and whether it has beencancelled */

MPI_Get_count (status, MPI_INT, &rcnt);MPI_Test_cancelled (status, &flag);


Edgar Gabriel

Status of a message (III) – usage in Fortran

integer status(MPI_STATUS_SIZE)

call MPI_Recv(buf,cnt,MPI_INTEGER, …,status,ierr)

! directly access source, tag, and errorsrc = status(MPI_SOURCE)tag = status(MPI_TAG)err = status(MPI_ERROR)

! determine message length and whether it has been! cancelled

call MPI_Get_count (status,MPI_INTEGER,rcnt,ierr)call MPI_Test_cancelled (status,flag,ierr)


Edgar Gabriel

Status of a message (IV)

• If you are not interested in the status, you can pass– MPI_STATUS_NULL– MPI_STATUSES_NULL

to MPI_Recv and all other MPI functions, which return a status


Edgar Gabriel

Non-blocking operations (I)

• A regular MPI_Send returns, when ‘… the data is safely stored away’

• A regular MPI_Recv returns, when the data is fully available in the receive-buffer

• Non-blocking operations initiate the Send and Receive operations, but do not wait for its completion.

• Functions, which check or wait for completion of an initiated communication have to be called explicitly

• Since the functions initiating communication return immediately, all MPI-functions have an I prefix (e.g. MPI_Isend or MPI_Irecv).


Edgar Gabriel

Non-blocking operations (II)

MPI_Isend (void *buf, int cnt, MPI_Datatype dat, int dest, int tag, MPI_Comm comm, MPI_Request *req);

MPI_Irecv (void *buf, int cnt, MPI_Datatype dat, int src, int tag, MPI_Comm comm, MPI_Request *reqs);


Edgar Gabriel

Non-blocking operations (III)

• After initiating a non-blocking communication, it is not allowed to touch (=modify) the communication buffer until completion– you can not make any assumptions about when the

message will really be transferred• All Immediate functions take an additional argument, a

request• a request uniquely identifies an ongoing

communication, and has to be used, if you want to check/wait for the completion of a posted communication


Edgar Gabriel

Completion functions (I)

• Functions waiting for completionMPI_Wait – wait for one communication to finish MPI_Waitall – wait for all comm. of a list to finishMPI_Waitany – wait for one comm. of a list to finishMPI_Waitsome – wait for at least one comm. of a list

• Content of the status not defined for Send operations

MPI_Wait (MPI_Request *req, MPI_Status *stat);MPI_Waitall (int cnt, MPI_Request *reqs,

MPI_Status *stats);MPI_Waitany (int cnt, MPI_Request *reqs, int *index,

MPI_Status *stat);MPI_Waitsome (int cnt, MPI_Request *reqs, int *indices,

MPI_Status *stats);


Edgar Gabriel

Completion functions (II)

• Test-functions verify, whether a communication is completeMPI_Test – check, whether a comm. has finishedMPI_Testall – check, whether all comm. of a list finishedMPI_Testany – check, whether one of a list of comm.

finishedMPI_Testsome – check, how many of a list of comm.

finished

MPI_Test (MPI_Request *req, int *flag, MPI_Status *stat);

MPI_Testall (int cnt, MPI_Request *reqs, int *flag,MPI_Status *stats);

MPI_Testany (int cnt, MPI_Request *reqs, int *index, int *flag, MPI_Status *stat);

MPI_Testsome (int cnt, MPI_Request *reqs, int *indices,int *flag, MPI_Status *stats);


Edgar Gabriel

• Question: how can two processes safely exchange data at the same time?

• Possibility 3: usage of non-blocking operationsProcess 0 Process 1

MPI_Irecv(rbuf,…, &req); MPI_Irecv(rbuf,…,&req);

MPI_Send (buf,…); MPI_Send (buf,…);MPI_Wait (req, &status); MPI_Wait (req, &status);

• note: – you have to use 2 separate buffers!– many different ways for formulating this scenario– identical code for both processes

Deadlock problem revisited


Edgar Gabriel

Example – 2D Laplace equation (I)

• 2-D Laplace equation

• Central discretization leads to0=∆u

022

21,,1,

2,1,,1 =

∆

+−+

∆

+− +−+−

yuuu

xuuu jijijijijiji

i,ji-1,j i+1,j

i,j+1

i,j-1


Edgar Gabriel

Example – 2D Laplace equation (II)

• Parallel domain decomposition

• Data exchange at process boundaries required– not assuming periodic boundary conditions here


Edgar Gabriel

Example – 2D Laplace equation (III)

• Halo cells: – store a copy of the data which is hold by another process,

which is however required for the computation of the local data– how to implement efficiently the communication of this

scheme?


Edgar Gabriel

Example – 2-D Laplace equation (IV)

• Process mapping and determining neighbor processes

• At boundaries: set the rank of the according neighbor to MPI_PROC_NULL– a message sent to MPI_PROC_NULL will be ignored by the MPI

library• Hint: look at the Cartesian Topology functions for another method to

perform the same operations

8 9 10 110,2 1,2 2,2 3,2

4 5 6 70,1 1,1 2,1 3,1

0 1 2 30,0 1,0 2,0 3,0

x

y

xdown

xup

right

left

nprankn

nprankn

rankn

rankn

−=

+=

+=

−=

1

1::

y

x

npnp no of procs in x direction

no of procs in y direction


Edgar Gabriel

Laplace equation – communication in y-direction

• u(i,j) is stored in a matrix !!assuming C!!

• Dimension of u on an inner process (= not being at a boundary):

• with

containing the local data

::

ylocal

xlocal

nn no of local points in x direction

no of local points in y direction

)2,2( ++ ylocalxlocal nnu

):1,:1( ylocalxlocal nnu


Edgar Gabriel

Laplace equation – communication in y-direction

MPI_Request [4] req;

MPI_Irecv(&u[1][nylocal+1], nxlocal, MPI_DOUBLE, nup, tag, comm, &req[0]);

MPI_Irecv(&u[1][0], nxlocal, MPI_DOUBLE, ndown,tag, comm, &req[1]);

MPI_Isend(&u[1][nylocal], nxlocal, MPI_DOUBLE, nup, tag, comm, &req[2]);

MPI_Isend(&u[1][1], nxlocal, MPI_DOUBLE, nup, tag, comm, &req[3]);

MPI_Waitall (4, req, MPI_STATUSES_IGNORE);


Edgar Gabriel

Laplace equation – communication in x-direction

• Problem: the data which we have to send is not contiguous in the memory

• Logical view of the matrix

• Layout in memory of the same matrix (in C)


Edgar Gabriel

Laplace equation – communication in x-direction

• How to implement the halo-cell exchange in x-direction?– Send/Recv every element in a separate message

+ works- very slow

- copy the data into a separate vector/array and send this array

+ works- a more general interface is provided by MPI to pack

data into a contiguous buffer before sending- an even more general interface – derived datatypes –

is provided by MPI to avoid user-level packing and unpacking of messages

- not handled here


Edgar Gabriel

Packing a message

• MPI_Pack copies incount elements of type dat from inbuf into the user provided buffer outbuf– outbuf has to be large enough to hold the data– pos contains the position of the last packed data in outbuf. Has to be initialized to zero before first usage

– can be called several times to pack independent pieces of data

• Send and receive a message, which has been packed using the MPI datatype MPI_PACKED

MPI_Pack (void* inbuf, int incount, MPI_Datatype dat,void *outbuf, int *pos, MPI_Comm comm);


Edgar Gabriel

Packing a message (II)

outbuf before pack, pos=0

MPI_Pack(inbuf1,1,MPI_INT,outbuf,pos,comm);

outbuf after 1st pack, pos=6

MPI_Pack(inbuf2,1,MPI_FLOAT,outbuf,pos,comm);

posMPI internal header

outbuf after 1st pack, pos=10

pos

pos


Edgar Gabriel

Unpacking a message

• MPI_Unpack copies outcount elements of type dat frominbuf into the user provided buffer outbuf– inbuf holds the whole message– pos contains the position of the last unpacked data in inbuf. Has to be initialized to zero before first usage

– can be called several times to pack independent pieces of data

MPI_Unpack(void *inbuf, int insize, int* pos, void* outbuf, int outcount, MPI_Datatype dat,MPI_Comm comm);


Edgar Gabriel

Determining the size of the pack-buffer

• MPI_Pack_size returns the size in bytes of the required buffer to pack incount elements of type datusing MPI_Pack– size might not be identical to

incount *sizeof(original datatype)– several calls to MPI_Pack_size required, if you plan

to pack more than one type of dat• sum up the returned sizes• you can use size e.g. to malloc a buffer

MPI_Pack_size(int incount, MPI_Datatype dat, MPI_Comm comm, int *size);


Edgar Gabriel

Laplace equation – communication in x-direction (I)

double *sbufleft, *sbufrigh, *rbufleft, *rbufright;int bufsize, posleft=0, posright=0;

/* determine the required buffer sizes and allocate the buffers */

MPI_Pack_size (nylocal, MPI_DOUBLE, comm, &bufsize);sbufleft = malloc(bufsize);sbufright = malloc(bufsize);rbufleft = malloc(bufsize);rbufright = malloc(bufsize);

/* Pack the data before sending */for (i=1; i<nylocal+1; i++) {MPI_Pack (u[nxlocal][i], 1, MPI_DOUBLE, sbufright,

&posright, comm);MPI_Pack (u[1][i], 1, MPI_DOUBLE, sbufleft

&posleft, comm);}


Edgar Gabriel

Laplace equation – communication in x-direction (II)

/* Execute now the real communication */MPI_Irecv(rbufleft, bufsize, MPI_PACKED, nleft, tag,

comm, &req[0]);MPI_Irecv(rbufright, bufsize, MPI_PACKED, nright,tag,

comm, &req[1]);MPI_Isend(sbufleft, posleft, MPI_PACKED, nleft, tag,

comm, &req[2]);MPI_Isend(sbufright, posright, MPI_PACKED, nright, tag,

comm, &req[3]);MPI_Waitall (4, req, MPI_STATUSES_IGNORE);

/* Unpack the received data */posright = posleft = 0;for (i=1; i<nylocal+1; i++) {MPI_Unpack(rbufright, bufsize, &posright,

u[nxlocal+1][i], 1, MPI_DOUBLE, comm);MPI_Unpack (rbufleft, bufsize, &posleft,

u[0][i], 1, MPI_DOUBLE, comm);}


Edgar Gabriel

Overlapping communication and computation

Default Algorithm• Data exchange• Execute the calculation over the whole domain at once


Edgar Gabriel

Overlapping communication and computation (II)

Alternative algorithm:• Initiate communication (MPI_Isend/MPI_Irecv)

• Finish communication (MPI_Waitall)• Calculate inner values

• Calculate boundary values


Edgar Gabriel

Using more than one ghostcell

• Use n ghostcells and communicate only every nIterations

• Example for n =2

First iteration Second iteration Communication


Edgar Gabriel

What else is there?

• Various send modes – buffered send, synchronous send, ready send

• Persistent request operations• Sendrecv functions

– MPI_Sendrecv, MPI_Sendrecv_replace• Probing for a message

– MPI_Probe, MPI_Iprobe• Cancelling a message

– MPI_Cancel

• Derived datatypes• One-sided communication

Documents

MPI advanced usage of point-to- point operationsgabriel/courses/mpicourse/MPI-pt2pt.pdfStandard MPI_Send MPI_Isend MPI_Send_init Blocking Non-blocking Persistent High Performance Computing