Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Collective CommunicationHello world everyone
Remainder of the Course
1. Why bother with HPC
2. What is MPI
3. Point to point communication
4. User-defined datatypes / Writing parallel code / How to use a super-computer
5. Collective communication (today)
6. Communicators (today)
7. Process topologies
8. File/IO and Parallel profiling
9. Hadoop/Spark
10. More Hadoop / Spark
11. Alternatives to MPI
12. The largest computations in history / general interest / exam prep
Last Time
• Point to point communication• Blocking and non-blocking
• User defined datatypes• Derived datatypes
• Packing
• General tips
Today
• Collective communication
• Communicators
Set things up once, communicate ‘once’
Collective Communication
Introduction
• Collective communications transmit data among all processes in a communicator.
• Barriers synchronise processes without passing extra data.
• Global communication functions with a variety of patterns
• Global reduction (max, min, sum etc.) across all processes
The communication function and communicator itself work together to achieve tremendous performance
• Collective communication functions can leverage special optimisations over many point-to-point calls.
Some semantics
• Some collective communication involves a single process sending information to all others• This process is the root (typically, rank == 0)
• All collective communication functions come in two flavours• Simple → Data is stored contiguously
• Vectored → Can ‘pick and choose’ from an array
We’ll introduce the basic patterns causally before looking at any code
Collective Communication – Broadcast
A0
P
roce
sses
→ Data
A0
A0
A0
A0
Broadcast
Broadcasts a lump of data to every process in a communicator
Collective Communication – Scatter/Gather
A0 A1 A2 A3
P
roce
sses
→ Data
A0
A1
A2
A3
Scatter
Gather
Scatters an array across multiple processes / Gathers sections of data to a single process
Collective Communication – Allgather
A0
B0
C0
D0
P
roce
sses
→ Data
A0 B0 C0 D0
A0 B0 C0 D0
A0 B0 C0 D0
A0 B0 C0 D0
Allgather
Gathers a split array and gives a copy to all processes
Collective Communication – Alltoall
A0 A1 A2 A3
B0 B1 B2 B3
C0 C1 C2 C3
D0 D1 D2 D3
P
roce
sses
→ Data
A0 B0 C0 D0
A1 B1 C1 D1
A2 B2 C2 D2
A3 B3 C2 D3
Broadcast
Everyone trades a copy with a friend
Global Communication – Patterns
Three flavours:
• Root sends to all processes (itself included)• Broadcast, Scatter
• Root receives data from all processes (itself included) • Gather
• Each process communications with each process (itself included)• Allgather and Alltoall
We’ll go through the basics in detail
but extra detail is available
‘Rationale’ – From the Designers
• Collective MPI functions are designed to be consistent with point-to-point communication
• To keep the number of arguments down, these functions are more restrictive than point-to-point functions:• The amount of data specified by the sender must exactly match that specified
at the receiver
• Collective functions are blocking only
• Collective functions do not have tags → Order of execution matters
These are systemic of the MPI standard, an implementation may include extra features (such as automatic synchronisation)
‘Rationale’ – From the Designers
• All processes in a group need to call the same function with the same arguments
• User-defined datatypes must be the same on all processes
• Some functions require a root process, which may have special arguments reserved for the root
• Collective communication can use the same communicators as point-to-point operations. Any point-to-point messages generated by MPI will be kept separate
This allows implementers to either write specific methods for collective communication (exploiting hardware) or use point-to-point calls while keeping your code portable between machines.
Communicators – A Brief Note
We will go through communicators in more detail soon but for now:
• Collective communication revolves around a ‘group’ of processes
• Think of the communicator argument as a group name linked to a communicator
• Collective communication cannot span multiple groups (there are things called inter-communicators)
But we’ll get to this later.
MPI_BCAST(buffer, count, datatype, root, comm)
• buffer (INOUT) starting address of buffer
• count (IN) number of elements in buffer
• datatype (IN) datatype of the buffer
• root (IN) the rank of the root in the communicator
• comm (IN) the communicator
Sends a copy of data specified by the root to all other processes in the communicator.
MPI_BCAST
In a communicator of n processes, works as if
• The root called MPI_SEND n times
Global Reductions
• Intended to make life easier• Global reductions perform some numerical operation in a distributed
manner and is extremely useful in many cases• Analogous to reduction operators in OpenMP
• Many numerical algorithms can replace senc/recv with broadcast/reduce with a correct topology
• Some operations which can be performed include:• Max• Min• Sum• Product etc. (there are others)
MPI_REDUCE(sendbuf, recvbuf, count, datatype, op, root, comm)• sendbuf (IN) Address of send buffer
• recvbuf (OUT) Address of receive buffer
• count (IN) The number of elements in the send buffer
• datatype (IN) The datatype of elements in the buffer
• op (IN) *NEW* The reduce operation
• root (IN) Rank of root process
• comm (IN) Communicator
This is best seen through an example
Global Reductions – Reduce
A0 B0 C0 D0
A1 B1 C1 D1
A2 B2 C2 D2
A3 B3 C3 D3
P
roce
sses
→ Data
A0 + A1 + A2 + A3
B0 + B1 + B2 + B3
C0 + C1 + C2 + C3
D0 + D1 + D2 + D3
Reduce (+)
Combines the elements in all the sendbufs of each process (using an operation) and returns that value to the root.
Global Reductions – AllReduce
A0 B0 C0 D0
A1 B1 C1 D1
A2 B2 C2 D2
A3 B3 C3 D3
P
roce
sses
→ Data
A0 + A1 + A2 + A3
B0 + B1 + B2 + B3
C0 + C1 + C2 + C3
D0 + D1 + D2 + D3
A0 + A1 + A2 + A3
B0 + B1 + B2 + B3
C0 + C1 + C2 + C3
D0 + D1 + D2 + D3
A0 + A1 + A2 + A3
B0 + B1 + B2 + B3
C0 + C1 + C2 + C3
D0 + D1 + D2 + D3
A0 + A1 + A2 + A3
B0 + B1 + B2 + B3
C0 + C1 + C2 + C3
D0 + D1 + D2 + D3
AllReduce (+)
Combines the elements in all the sendbufs of each process (using an operation) and returns that value to all processes.
Global Reductions – Reduce-Scatter
A0 B0 C0 D0
A1 B1 C1 D1
A2 B2 C2 D2
A3 B3 C3 D3
P
roce
sses
→ Data
A0 + A1 + A2 + A3
B0 + B1 + B2 + B3
C0 + C1 + C2 + C3
D0 + D1 + D2 + D3
Reduce - Scatter
Combines the elements in all the sendbufs in chunks of size n of each process (using an operation) then distributes the resulting array over
n processes
Global Reductions – Scan
A0 B0 C0 D0
A1 B1 C1 D1
A2 B2 C2 D2
A3 B3 C3 D3
P
roce
sses
→ Data
A0 B0 C0 D0
A0 + A1 B0 + B1 C0 + C1 D0 + D1
A0 + A1 + A2
B0 + B1 + B2
C0 + C1 + C2
D0 + D1 + D2
A0 + A1 + A2 + A3
B0 + B1 + B2 + B3
C0 + C1 + C2 + C3
D0 + D1 + D2 + D3
Scan
Combines the elements in all the sendbufs of each process and the ‘prior’ result. i.e. Performs a prefix reduction.
Custom Reductions
• It is possible to define your own reduction operation, as long as it is associative • ‘Gives the same result regardless of the grouping of input’• E.g. Max, Min, Avg, etc.• E.g. averaging on the even numbers in an array, finding the absolute maximum,
absolute average etc.
• The operation can be commutative if specified• The order of operations doesn’t matter (e.g. Max, Min, Sum, etc.)
• The function must fit a specific definition and is then bound to an OP_HANDLE
• No MPI communication function can be inside your custom reduction
Custom Reductions
typdef void MPI_User_function(void *invec, void *inoutvec, int *len, MPI_Datatype *datatype);
MPI_OP_CREATE(MPI_User_function *function, int commute, MPI_Op(op)
• function (IN) The user defined function
• commute (IN) true if commutative, false otherwise
• op (OUT) The operation
Summary – Collective Communication
• Collective communication methods provide a simple way to do a lot of work • Useful to let MPI exploit knowledge of the machine
• Many different communication patterns• Broadcast• Scatter/Gather• Allgather• Alltoall
• Often used to compute a reduction (mix, max, sum, OR, etc)• Reduce• Allreduce• Reduce/Scatter• Scan
Communicators
Introduction
• Put simply, a communicator is a group of processes.
• But first, a quick reminder of why MPI exists – To make point to point and collective communication portable between machines.
• At the time, a few key problems existed in the field. Understanding these problems makes understanding MPI easier
Division of Processes
• In some applications, we’d like different groups of processes to do different independent tasks at a very coarse level• E.g. use 2/3 of our machine to predict weather patterns, use 1/3 to process
new data
• Sometimes we divide a task based on data. It makes sense the operations acting on parts of our data is addressed to those processes• E.g. Performing operations on a diagonal of a matrix → It would be nice to
reference the diagonal by name (no matter how many processes we have)
Avoiding Message Conflicts
• Library routines have had difficulty in isolating their messages from other libraries • E.g. MPI_ANY_TAG being consumed by the wrong library
• MPI is designed to avoid this, communicators allow a library to segment traffic for itself• We don’t always know which modules before hand will be run, so we need to
define these communicators at run time
Extensibility to Users
• Often, computing efficient communication patterns (for an arbitrary machine) given a particular routine is expensive
• But can be reused
• If this pre-computation builds a communicator, we only need to perform that operation once
• Also allows for logical naming of groups
Safety
• By requiring routines to be managed by communicators, MPI implementers can guarantee safe (and hopefully efficient) execution
• I doubt many of you would like to do this armed only with socket programming
Groups
• A group is an ordered set of process identifiers (called processes)
• Each process has an integer rank
• Ranks are contiguous and start at 0
• Some special groups• MPI_GROUP_EMPTY – Can be passed to some communication arguments
• MPI_GROUP_NULL – Returned when a group is freed
Communicators
• A communicator is an opaque (magic sauce) object with a number of rules regarding creation, use and destruction
• Specifies a communication domain (used for point-to-point)• An intra-communicator is used to communicate within a group and has two
main attributes• The process group• The topology (logical layout of processes) (we’ll cover topologies later)
• An inter-communicator is used to communicate between disjoint groups of processes and has two attributes• A pair of process groups• No topology
• Communicators can also have user-defined attribtues
Communicators
Functionality Intra-communicator Inter-communicator
Number of groups 1 2
Communication safety Yes Yes
Collective operations Yes No
Topologies Yes No
Caching (user-defined data) Yes Yes
Each process needs to build its own image of the world. Communicators hide most of the complexity.This image of the world is called a communication domain.
Communication Domains
• Given by a set of communicators (one at each process) each with the same number of processes (representing the group)
• Allows the address for the ‘1’ process in a group to be logically equivalent for all processes but physically different • And importantly, hidden from the user
• If we take all communication domains together we get a complete communication graph
Communication Domain – Example
MPI_COMM_WORLD for three nodes
0 1 20 0 1 22
0 1 21
Communication Domains – Rationale
• In order to make sure only valid constructors are … constructed• All communicator creation routines require a valid initial constructor as input• MPI_COMM_WORLD is guaranteed to exist and be correct • In other words, any constructor you can create is some subset of
MPI_COMM_WORLD
• Using a single global communicator was standard practice pre-MPI and is thus the way much parallel code is written• However, more exotic communicator patterns allow for simpler MPI
communication calls and hence more efficiency.
• N.B. All group management functions do not require inter-process communication → Efficient but important
Group ManagementN.B. MPI_COMM_GROUP(comm, &group) returns the group of a communicator
Group Accessors
• MPI_GROUP_SIZE(group, &size) – Returns the size of a group
• MPI_GROUP_RANK(group, &rank) – Returns the rank of a process
• MPI_GROUP_TRANSLATE_RANKS(group1, n, ranks1, group2, ranks2)• Translates the n ranks in group1 to their counterparts in group2
• MPI_GROUP_COMPARE(group1, group2, result)• MPI_IDENT if they are the same object
• MPI_SIMILAR if the same processes are in both groups with differing ranks
• MPI_UNEQUAL otherwise
Group Constructors
• MPI_COMM_GROUP(comm, group) – Returns the group corresponding to the communicator
• MPI_GROUP_UNION(group1, group2, newgroup) • newgroup will contain a group of all processes in group1 and group2
• MPI_GROUP_INTERSECTION(group1, group2, newgroup) • newgroup will contain the processes in both groups 1 and 2
• MPI_GROUP_DIFFERENCE(group1, group2, newgroup) • newgroup will contain the set difference between groups 1 and 2
Group Destruction
• MPI_GROUP_FREE(group) – returns MPI_GROUP_NULL
Group Management - N.B.
• All processes must call the same group routines with the same arguments for the group to be created
• No communicator is associated with a group by default• Needs construction
• However some communicator constructors create groups by themselves
Communicator Management
Communicator Accessors
• MPI_COMM_SIZE(comm, size) – Returns the number of processes in the rank
• MPI_COMM_RANK(comm, rank) – Returns the rank of the calling process in that communicator
• MPI_COMM_COMPARE(comm1, comm2, result)• MPI_IDENT – Same processes, same ranks
• MPI_SIMILAR – Same processes, different ranks
• MPI_UNEQUAL – Otherwise
Communicator Constructors
• MPI_COMM_DUP(comm, newcomm) – Duplicates the provided communicator (useful to copy and then manipulate)
• MPI_COMM_CREATE(comm, group, newcomm) – Creates a new intra-communicator using a subset of comm
• MPI_COMM_SPLIT(comm, color, key, newcomm) – Creates separate communicators where processes passing the same ‘color’ are grouped together• This is a rather exotic one and is worth thinking about carefully
• Useful to segment processes into distinct subtasks
Summary
• Collective communication can simplify many common patterns• Broadcast/Reduce, Scatter/Gather
• Collective communication is also dependent on the communicator supplied
• Communicators can be used to separate processes into separate jobs
• Communicators are created from groups
ExamplesThe N-Body Problem
The N-body Problem
• Many simulations compute the interactions between many small objects
• If the force between particles is described completely by adding the forces of all particle pairs together • This is the N-body problem
• Good choice for parallelisation• 𝑂(𝑛) Memory
• 𝑂(𝑛2) Computation
• Good speedups for large 𝑛
• Small communication requirements
The N-body Problem
• How do we divide up the work?
• Simple: Divide the number of particles evenly among processes• Computing forces requires communication to all processes• Only suitable if simulating for a long time or forces are very tricky to compute
• Complex: Dividing the number of particles evenly but dynamically to reduce communication• We will not bother with this today
• What we are going to do• Define an MPI_PARTICLE datatype• Describe our approach• Use collective communication routines to implement the approach
N-body – Datatypes
typedef struct{
double x,y,z;
double mass;
} Particle;
Particle particles[MAX_PARTICLES];
MPI_TYPE MPI_PARTICLE;
MPI_Type_contiguous(4, MPI_DOUBLE, &MPI_PARTICLE);
MPI_Type_commit(&MPI_PARTICLE);
N-body – Approach
• Simple approach• Exchange all particles
• Compute forces for those locally concerned with
• Repeat
Particle *(particleLoc[]);
MPI_Comm_size(MPI_COMM_WORLD, &size);
for(int i = 0; i < size; ++i){MPI_Send(particles, count, MPI_PARTICLE, i , 0, MPI_COMM_WORLD);
}
for(int i = 0; i < size; ++i){MPI_Recv(particleLoc[i], MAX_PARTICLES, MPI_PARTICLE, i, 0, MPI_COMM_WORLD, &status);
}
N-body – Approach
• Simple approach• Exchange all particles
• Compute forces for those locally concerned with
• Repeat
• Some problems• Does not scale with number of
processes
• May deadlock
• Needs the locations of particleLocto be computed beforehand
Particle *(particleLoc[]);
MPI_Comm_size(MPI_COMM_WORLD, &size);
for(int i = 0; i < size; ++i){MPI_Send(particles, count, MPI_PARTICLE, i , 0, MPI_COMM_WORLD);
}
for(int i = 0; i < size; ++i){MPI_Recv(particleLoc[i], MAX_PARTICLES, MPI_PARTICLE, i, 0, MPI_COMM_WORLD, &status);
}
N-body – Approach
• Collective communication can solve most of our problems here
• Allgather and Allgatherv allows us to realise this approach efficiently
• First problem: How many particles does each process consider• Fill some array counts[] holding the number of particles for each process
N-body – Approach
int count, counts[];
root = 0;
MPI_Gather(&count, 1, MPI_INT, counts, 1, MPI_INT, root, MPI_COMM_WORLD);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Bcast(counts, size, MPI_INT, root, MPI_COMM_WORLD);
• Allgather accomplishes this maneuver in one call
MPI_ALLGATHER (sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, comm)
• sendbuf (IN) starting address of send buffer
• sendcount (IN) number of elements to send
• sendtype (IN) datatype of elements
• recvbuf (OUT) starting address of receive buffer
• recvcount (IN) number of elements to receive
• recvtype (IN) datatype of elements
• comm (IN) the communicator
The same as MPI_Gather but all processes receive a copy of the final result
MPI_ALLGATHER
In a communicator of n processes, works as if
• Each process calls MPI_GATHER n times, once for each other process.
Of note:
• Remember, the sendbuf and recvbuf must be different
• recvcount indicates the number of items received from each process, not in total
N-Body – Approach
• This would work in the case where all processes has the same number of particles
• Obviously, this is not always the case
• Allgatherv allows processes to gather varying lengths of the array• Additionally expects the length of each sub-array and displacement into the
final array
• The length of each sub-array is fairly trivial to compute
• The displacement is simply the sum of sizes of each process 0 – i – 1
N-body – Approach
displacements[0] = 0;
for(int i = 0; i < size; ++i){
dispalcements[i] = counts[i-1] + dispalcements[i-1];
}
MPI_Allgatherv(myparticles, count, MPI_PARTICLE, MPI_COMM_WORLD);
N-Body – Putting it together
N-body – Approach 2
• There is another approach to accomplish the same computation
• Using allgatherv communication and computation are distict non-overlapping phases
• Perhaps using nonblocking communication can give us an advantage?• TLDR: Yes
• This is considered an Advanced technique
N-body – Nonblocking Approach
• The simple solution is to create a pipeline of sorts
• Each process• Receives some data from the left
• Sends some data to the right
• While data is arriving computation on the previous data occurs
while(not_done){MPI_Irecv(buf1,...source=left,..., &handles[0]);MPI_Isend(buf2,...,dest=right,...,&handles[1]);<compute on buf2>MPI_Waitall(2, handles, statuses);<swap buf1 and buf2>
}
N-Body – Nonblocking Approach
• When would this approach be useful?• Often simulations involve millions of timesteps
• Opportunity: Each requiring a new send and receive to be created and processed• It would be nice for MPI to ‘remember’ how it sent / received data before
• MPI supports this → Advanced manoeuvre• Very similar calls to a non-blocking communication
• No communication happens however
• Need to call MPI_Start(request) to actually communicate
• Only truly tricky part is handling different communication sizes
N-Body Nonblocking approach
/*Setup*/for(int i = 0; i < size-1; ++i){
MPI_Send_init(sendbuf, counts[(rank+i)%size], MPI_PARTICLE, right, i, MPI_COMM_WORLD, &request[2*i]);MPI_Recv_init(recvbuf, counts[(rank+i-1)%size], MPI_PARTICLE, left, i, MPI_COMM_WORLD, &request[2*i+1]);
}
• We setup a persistent non-blocking send / receive for communication to the left and right process using all process numbers as tags
• This exploits the fact that we are setting up all possible communications at once. We do not start all of them
N-Body Nonblocking Approach• Here, each process in the
pipeline is given the chance to send and receive data
• Computing it’s own work in the meantime
• Finally, one must free all thecommunication requests before moving onto morework
/*run pipeline*/while(!done){
<copy local particles into sendbuf>for(int i = 0; i < size; ++i){
MPI_Status statuses[2];if(i != size -1){
MPI_Startall(2, &request[2*i]);}<compute using sendbuf>if(i != size - 1){
MPI_Waitall(2, &request[2*i], statuses);} <copy recvbuf into sendbuf>
} <compute new particle positions>}/*Free Requests*/for(int i = 0; i < 2*(size-1); +=i){
MPI_Request_free(&request[i]);}
N-Body – Nonblocking approach
Collective Communication
MPI_GATHER(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, comm)• sendbuf (IN) starting address of send buffer
• sendcount (IN) number of elements to send
• sendtype (IN) datatype of elements
• recvbuf (OUT) starting address of receive buffer
• recvcount (IN) number of elements to receive
• recvtype (IN) datatype of elements
• root (IN) the rank of the communicator root
• comm (IN) the communicator
Each process sends the contents of the send buffer to the root process.
MPI_GATHER
In a communicator of n processes, works as if
• Each process calls MPI_SEND to the root
• The root calls MPI_RECV n times, in rank order
Of note:
• Remember, the sendbuf and recvbuf must be different
• recv arguments are only significant for the root
• recvcount at the root indicates the number of items received from each process, not in total
MPI_SCATTER (sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, comm)• sendbuf (IN) starting address of send buffer
• sendcount (IN) number of elements to send
• sendtype (IN) datatype of elements
• recvbuf (OUT) starting address of receive buffer
• recvcount (IN) number of elements to receive
• recvtype (IN) datatype of elements
• root (IN) the rank of the communicator root
• comm (IN) the communicator
The root process sends sendcount separate entries from its sendbuf to all other processes in the communicator
MPI_SCATTER
In a communicator of n processes, works as if
• Root called MPI_Send n times • For the i-th send
• i-th segment of sendcount items is send to the i-th process
• Every other processes called MPI_Recv once
Of note:
• All arguments are significant to root
• Only recv arguments significant to every other process
MPI_ALLGATHER (sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, comm)
• sendbuf (IN) starting address of send buffer
• sendcount (IN) number of elements to send
• sendtype (IN) datatype of elements
• recvbuf (OUT) starting address of receive buffer
• recvcount (IN) number of elements to receive
• recvtype (IN) datatype of elements
• comm (IN) the communicator
The same as MPI_Gather but all processes receive a copy of the final result
MPI_ALLGATHER
In a communicator of n processes, works as if
• Each process calls MPI_GATHER n times, once for each other process.
Of note:
• Remember, the sendbuf and recvbuf must be different
• recvcount indicates the number of items received from each process, not in total
MPI_ALLTOALL (sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, comm)
• sendbuf (IN) starting address of send buffer
• sendcount (IN) number of elements to send
• sendtype (IN) datatype of elements
• recvbuf (OUT) starting address of receive buffer
• recvcount (IN) number of elements to receive
• recvtype (IN) datatype of elements
• comm (IN) the communicator
The same as MPI_ALLGATHER but each process sends a distinct chunk of data to another
MPI_ALLREDUCE(sendbuf, recvbuf, count, datatype, op, comm)• sendbuf (IN) Address of the send buffer
• recvbuf (OUT) Address of the receive buffer
• count (IN) The number of items to process
• datatype (IN) The type of each element
• op (IN) The reduction operation
• comm (IN) The communicator
Performs the same function as MPI_REDUCE but the result appears in all processes
MPI_REDUCESCATTER(sendbuf, recvbuf, recvcounts, datatype, op, comm)• sendbuf (IN) Address of the sending buffer
• recvbuf (OUT) Address of the receiving buffer
• recvcounts (IN) Integer array indicating how many elements to be reduced/scattered
• datatype (IN) Data type of input buffer
• op (IN) Reduction operation
• comm (IN) The Communicator
Performs a reduction on the first recvcounts number of elements in the send buffer then scatters the resulting array across all processes.
MPI_SCAN(sendbuf, recvbuf, count, datatype, op, comm)• sendbuf (IN) Address of the send buffer
• recvbuf (OUT) Address of the receiving buffer
• count (IN) Number of elements in the input buffer
• datatype (IN) The type of each element
• op (IN) The reduction operation
• comm (IN) The communicator
Performs a prefix scan over each process 0 to n. Adds the reduction of its own data to the receive buffer and passes that on to the ‘next process’.