Introduction to MPI · 2019. 9. 18. · 11. Alternatives to MPI 12. The largest computations in history / general interest / exam prep. ... Combines the elements in all the sendbufs

Collective CommunicationHello world everyone

Remainder of the Course

1. Why bother with HPC

2. What is MPI

3. Point to point communication

4. User-defined datatypes / Writing parallel code / How to use a super-computer

5. Collective communication (today)

6. Communicators (today)

7. Process topologies

8. File/IO and Parallel profiling

9. Hadoop/Spark

10. More Hadoop / Spark

11. Alternatives to MPI

12. The largest computations in history / general interest / exam prep

Last Time

• Point to point communication• Blocking and non-blocking

• User defined datatypes• Derived datatypes

• Packing

• General tips

Today

• Collective communication

• Communicators

Set things up once, communicate ‘once’

Collective Communication

Introduction

• Collective communications transmit data among all processes in a communicator.

• Barriers synchronise processes without passing extra data.

• Global communication functions with a variety of patterns

• Global reduction (max, min, sum etc.) across all processes

The communication function and communicator itself work together to achieve tremendous performance

• Collective communication functions can leverage special optimisations over many point-to-point calls.

Some semantics

• Some collective communication involves a single process sending information to all others• This process is the root (typically, rank == 0)

• All collective communication functions come in two flavours• Simple → Data is stored contiguously

• Vectored → Can ‘pick and choose’ from an array

We’ll introduce the basic patterns causally before looking at any code

Collective Communication – Broadcast

A0

P

roce

sses

→ Data

A0

A0

A0

A0

Broadcast

Broadcasts a lump of data to every process in a communicator

Collective Communication – Scatter/Gather

A0 A1 A2 A3

P

roce

sses

→ Data

A0

A1

A2

A3

Scatter

Gather

Scatters an array across multiple processes / Gathers sections of data to a single process

Collective Communication – Allgather

A0

B0

C0

D0

P

roce

sses

→ Data

A0 B0 C0 D0

A0 B0 C0 D0

A0 B0 C0 D0

A0 B0 C0 D0

Allgather

Gathers a split array and gives a copy to all processes

Collective Communication – Alltoall

A0 A1 A2 A3

B0 B1 B2 B3

C0 C1 C2 C3

D0 D1 D2 D3

P

roce

sses

→ Data

A0 B0 C0 D0

A1 B1 C1 D1

A2 B2 C2 D2

A3 B3 C2 D3

Broadcast

Everyone trades a copy with a friend

Global Communication – Patterns

Three flavours:

• Root sends to all processes (itself included)• Broadcast, Scatter

• Root receives data from all processes (itself included) • Gather

• Each process communications with each process (itself included)• Allgather and Alltoall

We’ll go through the basics in detail

but extra detail is available

‘Rationale’ – From the Designers

• Collective MPI functions are designed to be consistent with point-to-point communication

• To keep the number of arguments down, these functions are more restrictive than point-to-point functions:• The amount of data specified by the sender must exactly match that specified

at the receiver

• Collective functions are blocking only

• Collective functions do not have tags → Order of execution matters

These are systemic of the MPI standard, an implementation may include extra features (such as automatic synchronisation)

‘Rationale’ – From the Designers

• All processes in a group need to call the same function with the same arguments

• User-defined datatypes must be the same on all processes

• Some functions require a root process, which may have special arguments reserved for the root

• Collective communication can use the same communicators as point-to-point operations. Any point-to-point messages generated by MPI will be kept separate

This allows implementers to either write specific methods for collective communication (exploiting hardware) or use point-to-point calls while keeping your code portable between machines.

Communicators – A Brief Note

We will go through communicators in more detail soon but for now:

• Collective communication revolves around a ‘group’ of processes

• Think of the communicator argument as a group name linked to a communicator

• Collective communication cannot span multiple groups (there are things called inter-communicators)

But we’ll get to this later.

MPI_BCAST(buffer, count, datatype, root, comm)

• buffer (INOUT) starting address of buffer

• count (IN) number of elements in buffer

• datatype (IN) datatype of the buffer

• root (IN) the rank of the root in the communicator

• comm (IN) the communicator

Sends a copy of data specified by the root to all other processes in the communicator.

MPI_BCAST

In a communicator of n processes, works as if

• The root called MPI_SEND n times

Global Reductions

• Intended to make life easier• Global reductions perform some numerical operation in a distributed

manner and is extremely useful in many cases• Analogous to reduction operators in OpenMP

• Many numerical algorithms can replace senc/recv with broadcast/reduce with a correct topology

• Some operations which can be performed include:• Max• Min• Sum• Product etc. (there are others)

MPI_REDUCE(sendbuf, recvbuf, count, datatype, op, root, comm)• sendbuf (IN) Address of send buffer

• recvbuf (OUT) Address of receive buffer

• count (IN) The number of elements in the send buffer

• datatype (IN) The datatype of elements in the buffer

• op (IN) *NEW* The reduce operation

• root (IN) Rank of root process

• comm (IN) Communicator

This is best seen through an example

Global Reductions – Reduce

A0 B0 C0 D0

A1 B1 C1 D1

A2 B2 C2 D2

A3 B3 C3 D3

P

roce

sses

→ Data

A0 + A1 + A2 + A3

B0 + B1 + B2 + B3

C0 + C1 + C2 + C3

D0 + D1 + D2 + D3

Reduce (+)

Combines the elements in all the sendbufs of each process (using an operation) and returns that value to the root.

Global Reductions – AllReduce

A0 B0 C0 D0

A1 B1 C1 D1

A2 B2 C2 D2

A3 B3 C3 D3

P

roce

sses

→ Data

A0 + A1 + A2 + A3

B0 + B1 + B2 + B3

C0 + C1 + C2 + C3

D0 + D1 + D2 + D3

A0 + A1 + A2 + A3

B0 + B1 + B2 + B3

C0 + C1 + C2 + C3

D0 + D1 + D2 + D3

A0 + A1 + A2 + A3

B0 + B1 + B2 + B3

C0 + C1 + C2 + C3

D0 + D1 + D2 + D3

A0 + A1 + A2 + A3

B0 + B1 + B2 + B3

C0 + C1 + C2 + C3

D0 + D1 + D2 + D3

AllReduce (+)

Combines the elements in all the sendbufs of each process (using an operation) and returns that value to all processes.

Global Reductions – Reduce-Scatter

A0 B0 C0 D0

A1 B1 C1 D1

A2 B2 C2 D2

A3 B3 C3 D3

P

roce

sses

→ Data

A0 + A1 + A2 + A3

B0 + B1 + B2 + B3

C0 + C1 + C2 + C3

D0 + D1 + D2 + D3

Reduce - Scatter

Combines the elements in all the sendbufs in chunks of size n of each process (using an operation) then distributes the resulting array over

n processes

Global Reductions – Scan

A0 B0 C0 D0

A1 B1 C1 D1

A2 B2 C2 D2

A3 B3 C3 D3

P

roce

sses

→ Data

A0 B0 C0 D0

A0 + A1 B0 + B1 C0 + C1 D0 + D1

A0 + A1 + A2

B0 + B1 + B2

C0 + C1 + C2

D0 + D1 + D2

A0 + A1 + A2 + A3

B0 + B1 + B2 + B3

C0 + C1 + C2 + C3

D0 + D1 + D2 + D3

Scan

Combines the elements in all the sendbufs of each process and the ‘prior’ result. i.e. Performs a prefix reduction.

Custom Reductions

• It is possible to define your own reduction operation, as long as it is associative • ‘Gives the same result regardless of the grouping of input’• E.g. Max, Min, Avg, etc.• E.g. averaging on the even numbers in an array, finding the absolute maximum,

absolute average etc.

• The operation can be commutative if specified• The order of operations doesn’t matter (e.g. Max, Min, Sum, etc.)

• The function must fit a specific definition and is then bound to an OP_HANDLE

• No MPI communication function can be inside your custom reduction

Custom Reductions

typdef void MPI_User_function(void *invec, void *inoutvec, int *len, MPI_Datatype *datatype);

MPI_OP_CREATE(MPI_User_function *function, int commute, MPI_Op(op)

• function (IN) The user defined function

• commute (IN) true if commutative, false otherwise

• op (OUT) The operation

Summary – Collective Communication

• Collective communication methods provide a simple way to do a lot of work • Useful to let MPI exploit knowledge of the machine

• Many different communication patterns• Broadcast• Scatter/Gather• Allgather• Alltoall

• Often used to compute a reduction (mix, max, sum, OR, etc)• Reduce• Allreduce• Reduce/Scatter• Scan

Communicators

Introduction

• Put simply, a communicator is a group of processes.

• But first, a quick reminder of why MPI exists – To make point to point and collective communication portable between machines.

• At the time, a few key problems existed in the field. Understanding these problems makes understanding MPI easier

Division of Processes

• In some applications, we’d like different groups of processes to do different independent tasks at a very coarse level• E.g. use 2/3 of our machine to predict weather patterns, use 1/3 to process

new data

• Sometimes we divide a task based on data. It makes sense the operations acting on parts of our data is addressed to those processes• E.g. Performing operations on a diagonal of a matrix → It would be nice to

reference the diagonal by name (no matter how many processes we have)

Avoiding Message Conflicts

• Library routines have had difficulty in isolating their messages from other libraries • E.g. MPI_ANY_TAG being consumed by the wrong library

• MPI is designed to avoid this, communicators allow a library to segment traffic for itself• We don’t always know which modules before hand will be run, so we need to

define these communicators at run time

Extensibility to Users

• Often, computing efficient communication patterns (for an arbitrary machine) given a particular routine is expensive

• But can be reused

• If this pre-computation builds a communicator, we only need to perform that operation once

• Also allows for logical naming of groups

Safety

• By requiring routines to be managed by communicators, MPI implementers can guarantee safe (and hopefully efficient) execution

• I doubt many of you would like to do this armed only with socket programming

Groups

• A group is an ordered set of process identifiers (called processes)

• Each process has an integer rank

• Ranks are contiguous and start at 0

• Some special groups• MPI_GROUP_EMPTY – Can be passed to some communication arguments

• MPI_GROUP_NULL – Returned when a group is freed

Communicators

• A communicator is an opaque (magic sauce) object with a number of rules regarding creation, use and destruction

• Specifies a communication domain (used for point-to-point)• An intra-communicator is used to communicate within a group and has two

main attributes• The process group• The topology (logical layout of processes) (we’ll cover topologies later)

• An inter-communicator is used to communicate between disjoint groups of processes and has two attributes• A pair of process groups• No topology

• Communicators can also have user-defined attribtues

Communicators

Functionality Intra-communicator Inter-communicator

Number of groups 1 2

Communication safety Yes Yes

Collective operations Yes No

Topologies Yes No

Caching (user-defined data) Yes Yes

Each process needs to build its own image of the world. Communicators hide most of the complexity.This image of the world is called a communication domain.

Communication Domains

• Given by a set of communicators (one at each process) each with the same number of processes (representing the group)

• Allows the address for the ‘1’ process in a group to be logically equivalent for all processes but physically different • And importantly, hidden from the user

• If we take all communication domains together we get a complete communication graph

Communication Domain – Example

MPI_COMM_WORLD for three nodes

0 1 20 0 1 22

0 1 21

Communication Domains – Rationale

• In order to make sure only valid constructors are … constructed• All communicator creation routines require a valid initial constructor as input• MPI_COMM_WORLD is guaranteed to exist and be correct • In other words, any constructor you can create is some subset of

MPI_COMM_WORLD

• Using a single global communicator was standard practice pre-MPI and is thus the way much parallel code is written• However, more exotic communicator patterns allow for simpler MPI

communication calls and hence more efficiency.

• N.B. All group management functions do not require inter-process communication → Efficient but important

Group ManagementN.B. MPI_COMM_GROUP(comm, &group) returns the group of a communicator

Group Accessors

• MPI_GROUP_SIZE(group, &size) – Returns the size of a group

• MPI_GROUP_RANK(group, &rank) – Returns the rank of a process

• MPI_GROUP_TRANSLATE_RANKS(group1, n, ranks1, group2, ranks2)• Translates the n ranks in group1 to their counterparts in group2

• MPI_GROUP_COMPARE(group1, group2, result)• MPI_IDENT if they are the same object

• MPI_SIMILAR if the same processes are in both groups with differing ranks

• MPI_UNEQUAL otherwise

Group Constructors

• MPI_COMM_GROUP(comm, group) – Returns the group corresponding to the communicator

• MPI_GROUP_UNION(group1, group2, newgroup) • newgroup will contain a group of all processes in group1 and group2

• MPI_GROUP_INTERSECTION(group1, group2, newgroup) • newgroup will contain the processes in both groups 1 and 2

• MPI_GROUP_DIFFERENCE(group1, group2, newgroup) • newgroup will contain the set difference between groups 1 and 2

Group Destruction

• MPI_GROUP_FREE(group) – returns MPI_GROUP_NULL

Group Management - N.B.

• All processes must call the same group routines with the same arguments for the group to be created

• No communicator is associated with a group by default• Needs construction

• However some communicator constructors create groups by themselves

Communicator Management

Communicator Accessors

• MPI_COMM_SIZE(comm, size) – Returns the number of processes in the rank

• MPI_COMM_RANK(comm, rank) – Returns the rank of the calling process in that communicator

• MPI_COMM_COMPARE(comm1, comm2, result)• MPI_IDENT – Same processes, same ranks

• MPI_SIMILAR – Same processes, different ranks

• MPI_UNEQUAL – Otherwise

Communicator Constructors

• MPI_COMM_DUP(comm, newcomm) – Duplicates the provided communicator (useful to copy and then manipulate)

• MPI_COMM_CREATE(comm, group, newcomm) – Creates a new intra-communicator using a subset of comm

• MPI_COMM_SPLIT(comm, color, key, newcomm) – Creates separate communicators where processes passing the same ‘color’ are grouped together• This is a rather exotic one and is worth thinking about carefully

• Useful to segment processes into distinct subtasks

Summary

• Collective communication can simplify many common patterns• Broadcast/Reduce, Scatter/Gather

• Collective communication is also dependent on the communicator supplied

• Communicators can be used to separate processes into separate jobs

• Communicators are created from groups

ExamplesThe N-Body Problem

The N-body Problem

• Many simulations compute the interactions between many small objects

• If the force between particles is described completely by adding the forces of all particle pairs together • This is the N-body problem

• Good choice for parallelisation• 𝑂(𝑛) Memory

• 𝑂(𝑛2) Computation

• Good speedups for large 𝑛

• Small communication requirements

The N-body Problem

• How do we divide up the work?

• Simple: Divide the number of particles evenly among processes• Computing forces requires communication to all processes• Only suitable if simulating for a long time or forces are very tricky to compute

• Complex: Dividing the number of particles evenly but dynamically to reduce communication• We will not bother with this today

• What we are going to do• Define an MPI_PARTICLE datatype• Describe our approach• Use collective communication routines to implement the approach

N-body – Datatypes

typedef struct{

double x,y,z;

double mass;

} Particle;

Particle particles[MAX_PARTICLES];

MPI_TYPE MPI_PARTICLE;

MPI_Type_contiguous(4, MPI_DOUBLE, &MPI_PARTICLE);

MPI_Type_commit(&MPI_PARTICLE);

N-body – Approach

• Simple approach• Exchange all particles

• Compute forces for those locally concerned with

• Repeat

Particle *(particleLoc[]);

MPI_Comm_size(MPI_COMM_WORLD, &size);

for(int i = 0; i < size; ++i){MPI_Send(particles, count, MPI_PARTICLE, i , 0, MPI_COMM_WORLD);

}

for(int i = 0; i < size; ++i){MPI_Recv(particleLoc[i], MAX_PARTICLES, MPI_PARTICLE, i, 0, MPI_COMM_WORLD, &status);

}

N-body – Approach

• Simple approach• Exchange all particles

• Compute forces for those locally concerned with

• Repeat

• Some problems• Does not scale with number of

processes

• May deadlock

• Needs the locations of particleLocto be computed beforehand

Particle *(particleLoc[]);


for(int i = 0; i < size; ++i){MPI_Send(particles, count, MPI_PARTICLE, i , 0, MPI_COMM_WORLD);

}

for(int i = 0; i < size; ++i){MPI_Recv(particleLoc[i], MAX_PARTICLES, MPI_PARTICLE, i, 0, MPI_COMM_WORLD, &status);

}

N-body – Approach

• Collective communication can solve most of our problems here

• Allgather and Allgatherv allows us to realise this approach efficiently

• First problem: How many particles does each process consider• Fill some array counts[] holding the number of particles for each process

N-body – Approach

int count, counts[];

root = 0;

MPI_Gather(&count, 1, MPI_INT, counts, 1, MPI_INT, root, MPI_COMM_WORLD);


MPI_Bcast(counts, size, MPI_INT, root, MPI_COMM_WORLD);

• Allgather accomplishes this maneuver in one call

MPI_ALLGATHER (sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, comm)

• sendbuf (IN) starting address of send buffer

• sendcount (IN) number of elements to send

• sendtype (IN) datatype of elements

• recvbuf (OUT) starting address of receive buffer

• recvcount (IN) number of elements to receive

• recvtype (IN) datatype of elements


The same as MPI_Gather but all processes receive a copy of the final result

MPI_ALLGATHER


• Each process calls MPI_GATHER n times, once for each other process.

Of note:

• Remember, the sendbuf and recvbuf must be different

• recvcount indicates the number of items received from each process, not in total

N-Body – Approach

• This would work in the case where all processes has the same number of particles

• Obviously, this is not always the case

• Allgatherv allows processes to gather varying lengths of the array• Additionally expects the length of each sub-array and displacement into the

final array

• The length of each sub-array is fairly trivial to compute

• The displacement is simply the sum of sizes of each process 0 – i – 1

N-body – Approach

displacements[0] = 0;

for(int i = 0; i < size; ++i){

dispalcements[i] = counts[i-1] + dispalcements[i-1];

}

MPI_Allgatherv(myparticles, count, MPI_PARTICLE, MPI_COMM_WORLD);

N-Body – Putting it together

N-body – Approach 2

• There is another approach to accomplish the same computation

• Using allgatherv communication and computation are distict non-overlapping phases

• Perhaps using nonblocking communication can give us an advantage?• TLDR: Yes

• This is considered an Advanced technique

N-body – Nonblocking Approach

• The simple solution is to create a pipeline of sorts

• Each process• Receives some data from the left

• Sends some data to the right

• While data is arriving computation on the previous data occurs

while(not_done){MPI_Irecv(buf1,...source=left,..., &handles[0]);MPI_Isend(buf2,...,dest=right,...,&handles[1]);<compute on buf2>MPI_Waitall(2, handles, statuses);<swap buf1 and buf2>

}

N-Body – Nonblocking Approach

• When would this approach be useful?• Often simulations involve millions of timesteps

• Opportunity: Each requiring a new send and receive to be created and processed• It would be nice for MPI to ‘remember’ how it sent / received data before

• MPI supports this → Advanced manoeuvre• Very similar calls to a non-blocking communication

• No communication happens however

• Need to call MPI_Start(request) to actually communicate

• Only truly tricky part is handling different communication sizes

N-Body Nonblocking approach

/*Setup*/for(int i = 0; i < size-1; ++i){

MPI_Send_init(sendbuf, counts[(rank+i)%size], MPI_PARTICLE, right, i, MPI_COMM_WORLD, &request[2*i]);MPI_Recv_init(recvbuf, counts[(rank+i-1)%size], MPI_PARTICLE, left, i, MPI_COMM_WORLD, &request[2*i+1]);

}

• We setup a persistent non-blocking send / receive for communication to the left and right process using all process numbers as tags

• This exploits the fact that we are setting up all possible communications at once. We do not start all of them

N-Body Nonblocking Approach• Here, each process in the

pipeline is given the chance to send and receive data

• Computing it’s own work in the meantime

• Finally, one must free all thecommunication requests before moving onto morework

/*run pipeline*/while(!done){

<copy local particles into sendbuf>for(int i = 0; i < size; ++i){

MPI_Status statuses[2];if(i != size -1){

MPI_Startall(2, &request[2*i]);}<compute using sendbuf>if(i != size - 1){

MPI_Waitall(2, &request[2*i], statuses);} <copy recvbuf into sendbuf>

} <compute new particle positions>}/*Free Requests*/for(int i = 0; i < 2*(size-1); +=i){

MPI_Request_free(&request[i]);}

N-Body – Nonblocking approach

Collective Communication

MPI_GATHER(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, comm)• sendbuf (IN) starting address of send buffer






• root (IN) the rank of the communicator root


Each process sends the contents of the send buffer to the root process.

MPI_GATHER


• Each process calls MPI_SEND to the root

• The root calls MPI_RECV n times, in rank order

Of note:


• recv arguments are only significant for the root

• recvcount at the root indicates the number of items received from each process, not in total

MPI_SCATTER (sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, comm)• sendbuf (IN) starting address of send buffer






• root (IN) the rank of the communicator root


The root process sends sendcount separate entries from its sendbuf to all other processes in the communicator

MPI_SCATTER


• Root called MPI_Send n times • For the i-th send

• i-th segment of sendcount items is send to the i-th process

• Every other processes called MPI_Recv once

Of note:

• All arguments are significant to root

• Only recv arguments significant to every other process

MPI_ALLGATHER (sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, comm)








The same as MPI_Gather but all processes receive a copy of the final result

MPI_ALLGATHER


• Each process calls MPI_GATHER n times, once for each other process.

Of note:


• recvcount indicates the number of items received from each process, not in total

MPI_ALLTOALL (sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, comm)








The same as MPI_ALLGATHER but each process sends a distinct chunk of data to another

MPI_ALLREDUCE(sendbuf, recvbuf, count, datatype, op, comm)• sendbuf (IN) Address of the send buffer

• recvbuf (OUT) Address of the receive buffer

• count (IN) The number of items to process

• datatype (IN) The type of each element

• op (IN) The reduction operation

• comm (IN) The communicator

Performs the same function as MPI_REDUCE but the result appears in all processes

MPI_REDUCESCATTER(sendbuf, recvbuf, recvcounts, datatype, op, comm)• sendbuf (IN) Address of the sending buffer

• recvbuf (OUT) Address of the receiving buffer

• recvcounts (IN) Integer array indicating how many elements to be reduced/scattered

• datatype (IN) Data type of input buffer

• op (IN) Reduction operation

• comm (IN) The Communicator

Performs a reduction on the first recvcounts number of elements in the send buffer then scatters the resulting array across all processes.

MPI_SCAN(sendbuf, recvbuf, count, datatype, op, comm)• sendbuf (IN) Address of the send buffer

• recvbuf (OUT) Address of the receiving buffer

• count (IN) Number of elements in the input buffer

• datatype (IN) The type of each element

• op (IN) The reduction operation

• comm (IN) The communicator

Performs a prefix scan over each process 0 to n. Adds the reduction of its own data to the receive buffer and passes that on to the ‘next process’.

Documents

Introduction to MPI · 2019. 9. 18. · 11. Alternatives to MPI 12. The largest computations in history / general interest / exam prep. ... Combines the elements in all the sendbufs