Distributed Systems CS 15-440 Programming Models Gregory Kesden Borrowed and adapted from our good friends at CMU-Doha, Qatar Majd F. Sakr, Mohammad Hammoud

Distributed SystemsCS 15-440

Programming Models

Gregory Kesden

Borrowed and adapted from our good friends at

CMU-Doha, Qatar

Majd F. Sakr, Mohammad Hammoud andVinay Kolar

1

Objectives

Discussion on Programming Models

Why parallelism?

Parallel computer architectures

Traditional models of parallel programming

Examples of parallel processing

Message Passing Interface (MPI)

MapReduce

Why parallelism?

Amdahl’s Law

We parallelize our programs in order to run them faster

How much faster will a parallel program run?

Suppose that the sequential execution of a program takes T1 time units and the parallel execution on p processors takes Tp time units

Suppose that out of the entire execution of the program, s fraction of it is not parallelizable while 1-s fraction is parallelizable

Then the speedup (Amdahl’s formula):

3

Amdahl’s Law: An Example

Suppose that 80% of you program can be parallelized and that you use 4 processors to run your parallel version of the program

The speedup you can get according to Amdahl is:

Although you use 4 processors you cannot get a speedup more than 2.5 times (or 40% of the serial running time)

4

Real Vs. Actual Cases

Amdahl’s argument is too simplified to be applied to real cases

When we run a parallel program, there are a communication overhead and a workload imbalance among processes in general

20 80

20 20

Process 1

Process 2

Process 3

Process 4

Serial

Parallel

1. Parallel Speed-up: An Ideal Case

Cannot be parallelized

Can be parallelized

20 80

20 20

Process 1

Process 2

Process 3

Process 4

Serial

Parallel

2. Parallel Speed-up: An Actual Case

Cannot be parallelized

Can be parallelized

Load Unbalance

Communication overhead

Guidelines

In order to efficiently benefit from parallelization, we ought to follow these guidelines:

1. Maximize the fraction of our program that can be parallelized

2. Balance the workload of parallel processes

3. Minimize the time spent for communication

6

Objectives


Why parallelism?


Traditional models of parallel programming



MapReduce


Parallel Computer Architectures

We can categorize the architecture of parallel computers in terms of two aspects:

Whether the memory is physically centralized or distributed Whether or not the address space is shared

8

Memory

Address Space

Shared Individual

Centralized SMP (Symmetric Multiprocessor) N/A

Distributed NUMA (Non-Uniform Memory Access) MPP (Massively Parallel Processors)

Memory

Address Space

Shared Individual



Memory

Address Space

Shared Individual



Memory

Address Space

Shared Individual



Memory

Address Space

Shared Individual

Centralized UMA – SMP (Symmetric Multiprocessor) N/A


Symmetric Multiprocessor

Symmetric Multiprocessor (SMP) architecture uses shared system resources that can be accessed equally from all processors

A single OS controls the SMP machine and it schedules processes and threads on processors so that the load is balanced

9

Processor

Cache

Bus or Crossbar Switch

Memory I/O

Processor

Cache

Processor

Cache

Processor

Cache

Massively Parallel Processors

Massively Parallel Processors (MPP) architecture consists of nodes with each having its own processor, memory and I/O subsystem

An independent OS runs at each node

10

Processor

Cache

Interconnection Network

Memory I/O

Bus

Processor

Cache

Memory I/O

Bus

Processor

Cache

Memory I/O

Bus

Processor

Cache

Memory I/O

Bus

Non-Uniform Memory Access

Non-Uniform Memory Access (NUMA) architecture machines are built on a similar hardware model as MPP

NUMA typically provides a shared address space to applications using a hardware/software directory-based coherence protocol

The memory latency varies according to whether you access memory directly (local) or through the interconnect (remote). Thus the name non-uniform memory access

As in an SMP machine, a single OS controls the whole system

11

Objectives


Why parallelizing our programs?


Traditional Models of parallel programming



MapReduce


Models of Parallel Programming

What is a parallel programming model?

A programming model is an abstraction provided by the hardware to programmers

It determines how easily programmers can specify their algorithms into parallel unit of computations (i.e., tasks) that the hardware understands

It determines how efficiently parallel tasks can be executed on the hardware

Main Goal: utilize all the processors of the underlying architecture (e.g., SMP, MPP, NUMA) and minimize the elapsed time of your program

13

Traditional Parallel Programming Models

14

Parallel Programming Models

Shared Memory Message Passing Message Passing

Shared Memory Model

In the shared memory programming model, the abstraction is that parallel tasks can access any location of the memory

Parallel tasks can communicate through reading and writing common memory locations

This is similar to threads from a single process which share a single address space

Multi-threaded programs (e.g., OpenMP programs) are the best fit with shared memory programming model

15

Shared Memory Model

16

Process

S1

P1

P2

P3

P4

S2

Si = SerialPj = Parallel

Tim

eSingle Thread

S1

Tim

e

P1 P2 P3 P3

S2 Shared Address Space

Multi-Thread

Process

Spawn

Join

Shared Memory Example

for (i=0; i<8; i++) a[i] = b[i] + c[i];sum = 0;for (i=0; i<8; i++) if (a[i] > 0) sum = sum + a[i];Print sum;

begin parallel // spawn a child threadprivate int start_iter, end_iter, i;shared int local_iter=4, sum=0;shared double sum=0.0, a[], b[], c[];shared lock_type mylock;

start_iter = getid() * local_iter;end_iter = start_iter + local_iter;for (i=start_iter; i<end_iter; i++) a[i] = b[i] + c[i];barrier;

for (i=start_iter; i<end_iter; i++) if (a[i] > 0) { lock(mylock); sum = sum + a[i]; unlock(mylock); }barrier; // necessary

end parallel // kill the child threadPrint sum;

Sequential

Parallel

Traditional Parallel Programming Models

18

Parallel Programming Models

Shared Memory Message Passing Shared Memory

Message Passing Model

In message passing, parallel tasks have their own local memories

One task cannot access another task’s memory

Hence, to communicate data they have to rely on explicit messages sent to each other

This is similar to the abstraction of processes which do not share an address space

MPI programs are the best fit with message passing programming model

19

Message Passing Model

20

S1

P1

P2

P3

P4

S2

S = SerialP = Parallel

Tim

eSingle Thread

Process 0

S1

P1

S2

Tim

e

Message Passing

Node 1

Process 1

S1

P1

S2

Node 2

Process 2

S1

P1

S2

Node 3

Process 3

S1

P1

S2

Node 4

Data transmission over the Network

Process

Message Passing Example

for (i=0; i<8; i++) a[i] = b[i] + c[i];sum = 0;for (i=0; i<8; i++) if (a[i] > 0) sum = sum + a[i];Print sum;

Sequential

Parallel

id = getpid(); local_iter = 4;start_iter = id * local_iter; end_iter = start_iter + local_iter;

if (id == 0) send_msg (P1, b[4..7], c[4..7]);else recv_msg (P0, b[4..7], c[4..7]);

for (i=start_iter; i<end_iter; i++) a[i] = b[i] + c[i];

local_sum = 0;for (i=start_iter; i<end_iter; i++) if (a[i] > 0) local_sum = local_sum + a[i];if (id == 0) { recv_msg (P1, &local_sum1); sum = local_sum + local_sum1; Print sum;}else send_msg (P0, local_sum);

Shared Memory Vs. Message Passing

Comparison between shared memory and message passing programming models:

22

Aspect Shared Memory Message Passing

Communication Implicit (via loads/stores) Explicit Messages

Synchronization Explicit Implicit (Via Messages)

Hardware Support Typically Required None

Development Effort Lower Higher

Tuning Effort Higher Lower

























Objectives






MapReduce



SPMD and MPMD

When we run multiple processes with message-passing, there are further categorizations regarding how many different programs are cooperating in parallel execution

We distinguish between two models:

1. Single Program Multiple Data (SPMD) model

2. Multiple Programs Multiple Data (MPMP) model

24

SPMD

In the SPMD model, there is only one program and each process uses the same executable working on different sets of data

25

a.out

Node 1 Node 2 Node 3

MPMD

The MPMD model uses different programs for different processes, but the processes collaborate to solve the same problem

MPMD has two styles, the master/worker and the coupled analysis

a.out

Node 1 Node 2 Node 3

b.out a.out

Node 1

b.out

Node 2

c.out

Node 3

1. MPMD: Master/Slave 2. MPMD: Coupled Analysis

a.out= Structural Analysis, b.out = fluid analysis and c.out = thermal analysis

Example

3 Key Points

To summarize, keep the following 3 points in mind:

The purpose of parallelization is to reduce the time spent for computation

Ideally, the parallel program is p times faster than the sequential program, where p is the number of processes involved in the parallel execution, but this is not always achievable

Message-passing is the tool to consolidate what parallelization has separated. It should not be regarded as the parallelization itself

27

Objectives






MapReduce



Message Passing Interface

In this part, the following concepts of MPI will be described:

Basics Point-to-point communication Collective communication

29

What is MPI?

The Message Passing Interface (MPI) is a message passing library standard for writing message passing programs

The goal of MPI is to establish a portable, efficient, and flexible standard for message passing

By itself, MPI is NOT a library - but rather the specification of what such a library should be

MPI is not an IEEE or ISO standard, but has in fact, become the industry standard for writing message passing programs on HPC platforms

30

Reasons for using MPI

31

Reason Description

Standardization MPI is the only message passing library which can be considered a standard. It is supported on virtually all HPC platforms

Reason Description


Portability There is no need to modify your source code when you port your application to a different platform that supports the MPI standard

Reason Description



Performance Opportunities Vendor implementations should be able to exploit native hardware features to optimize performance

Reason Description




Functionality Over 115 routines are defined

Reason Description




Functionality Over 115 routines are defined

Availability A variety of implementations are available, both vendor and public domain

Programming Model

MPI is an example of a message passing programming model

MPI is now used on just about any common parallel architecture including MPP, SMP clusters, workstation clusters and heterogeneous networks

With MPI the programmer is responsible for correctly identifying parallelism and implementing parallel algorithms using MPI constructs

32

Communicators and Groups

MPI uses objects called communicators and groups to define which collection of processes may communicate with each other to solve a certain problem

Most MPI routines require you to specify a communicator as an argument

The communicator MPI_COMM_WORLD is often used in calling communication subroutines

MPI_COMM_WORLD is the predefined communicator that includes all of your MPI processes

33

Ranks

Within a communicator, every process has its own unique, integer identifier referred to as rank, assigned by the system when the process initializes

A rank is sometimes called a task ID. Ranks are contiguous and begin at zero

Ranks are used by the programmer to specify the source and destination of messages

Ranks are often also used conditionally by the application to control program execution (e.g., if rank=0 do this / if rank=1 do that)

34

Multiple Communicators

It is possible that a problem consists of several sub-problems where each can be solved concurrently

This type of application is typically found in the category of MPMD coupled analysis

We can create a new communicator for each sub-problem as a subset of an existing communicator

MPI allows you to achieve that by using MPI_COMM_SPLIT

35

Example of Multiple Communicators

Consider a problem with a fluid dynamics part and a structural analysis part, where each part can be computed in parallel

Rank=0

Rank=0

Comm_Fluid

Rank=1

Rank=1

Rank=2

Rank=2

Rank=3

Rank=3

Rank=0

Rank=4

Comm_Struct

Rank=1

Rank=5

Rank=2

Rank=6

Rank=3

Rank=7

MPI_COMM_WORLD

Ranks within MPI_COMM_WORLD are printed in red Ranks within Comm_Fluid are printed with green Ranks within Comm_Struct are printed with blue

Next Class






MapReduce



Programming Models- Part II




38

Point-to-Point Communication

MPI point-to-point operations typically involve message passing between two, and only two, different MPI tasks

One task performs a send operation and the other performs a matching receive operation

Ideally, every send operation would be perfectly synchronized with its matching receive

This is rarely the case. Somehow or other, the MPI implementation must be able to deal with storing data when the two tasks are out of sync

39

Processor1 Processor2Network

sendA

recvA

Two Cases

Consider the following two cases:

1. A send operation occurs 5 seconds before the receive is ready - where is the message stored while the receive is pending?

2. Multiple sends arrive at the same receiving task which can only accept one send at a time - what happens to the messages that are "backing up"?

40

Steps Involved in Point-to-Point Communication

1. The data is copied to the user buffer by the user

2. The user calls one of the MPI send routines

3. The system copies the data from the user buffer to the system buffer

4. The system sends the data from the system buffer to the destination process

1. The user calls one of the MPI receive routines

2. The system receives the data from the source process and copies it to the system buffer

3. The system copies data from the system buffer to the user buffer

4. The user uses data in the user buffer

sendbuf

sysbuf

Call a send routine

Now sendbuf can be reused

Copying data fromsendbuf to sysbuf

Send data fromsysbuf to destination

recvbuf

sysbuf

Call a recev routine

Now recvbuf contains valid data

Copying data fromsysbuf to recvbuf

Receive data fromsource to sysbuf

Process 0

User Mode Kernel Mode

Process 1

User Mode Kernel Mode

1

2

3

4

Data

1 2

3

4

Sender

Receiver

Blocking Send and Receive

When we use point-to-point communication routines, we usually distinguish between blocking and non-blocking communication

A blocking send routine will only return after it is safe to modify the application buffer for reuse

Safe means that modifications will not affect the data intended for the receive task

This does not imply that the data was actually received by the receiver- it may be sitting in the system buffer at the sender side

42

Rank 0 Rank 1

sendbuf

recvbuf

recvbuf

sendbuf

Safe to modify sendbuf

Network

Blocking Send and Receive

A blocking send can be:

1. Synchronous: Means there is a handshaking occurring with the receive task to confirm a safe send

2. Asynchronous: Means the system buffer at the sender side is used to hold the data for eventual delivery to the receiver

A blocking receive only returns after the data has arrived (i.e., stored at the application recvbuf) and is ready for use by the program

43

Non-Blocking Send and Receive (1)

Non-blocking send and non-blocking receive routines behave similarly

They return almost immediately

They do not wait for any communication events to complete such as:

Message copying from user buffer to system buffer Or the actual arrival of a message

44

Non-Blocking Send and Receive (2)

However, it is unsafe to modify the application buffer until you make sure that the requested non-blocking operation was actually performed by the library

If you use the application buffer before the copy completes:

Incorrect data may be copied to the system buffer (in case of non-blocking send)

Or your receive buffer does not contain what you want (in case of non-blocking receive)

You can make sure of the completion of the copy by using MPI_WAIT() after the send or receive operations

45

Why Non-Blocking Communication?

Why do we use non-blocking communication despite its complexity?

Non-blocking communication is generally faster than its corresponding blocking communication

We can overlap computations while the system is copying data back and forth between application and system buffers

46

MPI Point-To-Point Communication Routines

47

Routine Signature

Blocking send int MPI_Send( void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm )

Non-blocking send int MPI_Isend( void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request )

Blocking receive int MPI_Recv( void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status )

Non-blocking receive int MPI_Irecv( void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request )

Routine Signature





Routine Signature





Routine Signature





Routine Signature





Message Order

MPI guarantees that messages will not overtake each other

If a sender sends two messages M1 and M2 in succession to the same destination, and both match the same receive, the receive operation will receive M1 before M2

If a receiver posts two receives R1 and R2, in succession, and both are looking for the same message, R1 will receive the message before R2

48

Fairness

MPI does not guarantee fairness – it is up to the programmer to prevent operation starvation

For instance, if task 0 and task 1 send competing messages (i.e., messages that match the same receive) to task 2, only one of the sends will complete

49

Task 0

Msg A

Task 1

Msg A

Task 2

?

Unidirectional Communication

When you send a message from process 0 to process 1, there are four combinations of MPI subroutines to choose from

1. Blocking send and blocking receive

2. Non-blocking send and blocking receive

3. Blocking send and non-blocking receive

4. Non-blocking send and non-blocking receive

50

Rank 0 Rank 1

sendbuf

recvbuf

recvbuf

sendbuf

Bidirectional Communication

When two processes exchange data with each other, there are essentially 3 cases to consider:

Case 1: Both processes call the send routine first, and then receive

Case 2: Both processes call the receive routine first, and then send

Case 3: One process calls send and receive routines in this order, and the other calls them in the opposite order

51

Rank 0 Rank 1

sendbuf

recvbuf

recvbuf

sendbuf

Bidirectional Communication- Deadlocks

With bidirectional communication, we have to be careful about deadlocks

When a deadlock occurs, processesinvolved in the deadlock will not proceedany further

Deadlocks can take place:

1. Either due to the incorrect order of send and receive

2. Or due to the limited size of the system buffer

52

Rank 0 Rank 1

sendbuf

recvbuf

recvbuf

sendbuf

Case 1. Send First and Then Receive

Consider the following two snippets of pseudo-code:

MPI_ISEND immediately followed by MPI_WAIT is logically equivalent to MPI_SEND

53

IF (myrank==0) THEN CALL MPI_SEND(sendbuf, …) CALL MPI_RECV(recvbuf, …)ELSEIF (myrank==1) THEN CALL MPI_SEND(sendbuf, …) CALL MPI_RECV(recvbuf, …)ENDIF

IF (myrank==0) THEN CALL MPI_ISEND(sendbuf, …, ireq, …) CALL MPI_WAIT(ireq, …) CALL MPI_RECV(recvbuf, …)ELSEIF (myrank==1) THEN CALL MPI_ISEND(sendbuf, …, ireq, …) CALL MPI_WAIT(ireq, …) CALL MPI_RECV(recvbuf, …)ENDIF


What happens if the system buffer is larger than the send buffer?

What happens if the system buffer is not larger than the send buffer?

Rank 0 Rank 1sendbuf

sysbuf

sendbuf

sysbuf

recvbuf recvbuf

Network

Rank 0 Rank 1sendbuf

sysbuf

sendbuf

sysbuf

recvbuf recvbuf

Network

DEADLOCK!


Consider the following pseudo-code:

The code is free from deadlock because: The program immediately returns from MPI_ISEND and starts receiving

data from the other process In the meantime, data transmission is completed and the calls of MPI_WAIT

for the completion of send at both processes do not lead to a deadlock

IF (myrank==0) THEN CALL MPI_ISEND(sendbuf, …, ireq, …) CALL MPI_RECV(recvbuf, …) CALL MPI_WAIT(ireq, …)ELSEIF (myrank==1) THEN CALL MPI_ISEND(sendbuf, …, ireq, …) CALL MPI_RECV(recvbuf, …) CALL MPI_WAIT(ireq, …)ENDIF

Case 2. Receive First and Then Send

Would the following pseudo-code lead to a deadlock?

A deadlock will occur regardless of how much system buffer we have

What if we use MPI_ISEND instead of MPI_SEND?

Deadlock still occurs

IF (myrank==0) THEN CALL MPI_RECV(recvbuf, …) CALL MPI_SEND(sendbuf, …)ELSEIF (myrank==1) THEN CALL MPI_RECV(recvbuf, …) CALL MPI_ISEND(sendbuf, …)ENDIF

Case 2. Receive First and Then Send

What about the following pseudo-code?

It can be safely executed

IF (myrank==0) THEN CALL MPI_IRECV(recvbuf, …, ireq, …) CALL MPI_SEND(sendbuf, …) CALL MPI_WAIT(ireq, …)ELSEIF (myrank==1) THEN CALL MPI_IRECV(recvbuf, …, ireq, …) CALL MPI_SEND(sendbuf, …) CALL MPI_WAIT(ireq, …)ENDIF

Case 3. One Process Sends and Receives; the other Receives and Sends

What about the following code?

It is always safe to order the calls of MPI_(I)SEND and MPI_(I)RECV at the two processes in an opposite order

In this case, we can use either blocking or non-blocking subroutines

IF (myrank==0) THEN CALL MPI_SEND(sendbuf, …) CALL MPI_RECV(recvbuf, …)ELSEIF (myrank==1) THEN CALL MPI_RECV(recvbuf, …) CALL MPI_SEND(sendbuf, …)ENDIF

A Recommendation

Considering the previous options, performance, and the avoidance of deadlocks, it is recommended to use the following code:

IF (myrank==0) THEN CALL MPI_ISEND(sendbuf, …, ireq1, …) CALL MPI_IRECV(recvbuf, …, ireq2, …)ELSEIF (myrank==1) THEN CALL MPI_ISEND(sendbuf, …, ireq1, …) CALL MPI_IRECV(recvbuf, …, ireq2, …)ENDIF CALL MPI_WAIT(ireq1, …) CALL MPI_WAIT(ireq2, …)




60

Collective Communication

Collective communication allows you to exchange data among a group of processes

It must involve all processes in the scope of a communicator

The communicator argument in a collective communication routine should specify which processes are involved in the communication

Hence, it is the programmer's responsibility to ensure that all processes within a communicator participate in any collective operation

61

Patterns of Collective Communication

There are several patterns of collective communication:

1. Broadcast

2. Scatter

3. Gather

4. Allgather

5. Alltoall

6. Reduce

7. Allreduce

8. Scan

9. Reducescatter

62

1. Broadcast

Broadcast sends a message from the process with rank root to all other processes in the group

63

AP0

P1

P2

P3

Data

Pro

cess

BroadcastA

A

A

A

P0

P1

P2

P3

Data

Pro

cess

int MPI_Bcast ( void *buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm )

2-3. Scatter and Gather

Scatter distributes distinct messages from a single source task to each task in the group

Gather gathers distinct messages from each task in the group to a single destination task

A B C DP0

P1

P2

P3

Data

Pro

cess

ScatterA

B

C

D

P0

P1

P2

P3

Data

Pro

cess

int MPI_Scatter ( void *sendbuf, int sendcnt, MPI_Datatype sendtype, void *recvbuf, int recvcnt, MPI_Datatype recvtype, int root, MPI_Comm comm )

Gather

int MPI_Gather ( void *sendbuf, int sendcnt, MPI_Datatype sendtype, void *recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm )

4. All Gather

Allgather gathers data from all tasks and distribute them to all tasks. Each task in the group, in effect, performs a one-to-all broadcasting operation within the group

A

B

C

D

P0

P1

P2

P3

Data

Pro

cess

allgatherA B C D

A B C D

A B C D

A B C D

P0

P1

P2

P3

Data

Pro

cess

int MPI_Allgather ( void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf, int recvcount, MPI_Datatype recvtype, MPI_Comm comm )

5. All To All

With Alltoall, each task in a group performs a scatter operation, sending a distinct message to all the tasks in the group in order by index

A0 A1 A2 A3

B0 B1 B2 B3

C0 C1 C2 C3

D0 D1 D2 D3

P0

P1

P2

P3

Data

Pro

cess

Alltoall

int MPI_Alltoall( void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf, int recvcnt, MPI_Datatype recvtype, MPI_Comm comm )

A0 B0 C0 D0

A1 B1 C1 D1

A2 B2 C2 D2

A3 B3 C3 D3

P0

P1

P2

P3

Data

Pro

cess

6-7. Reduce and All Reduce

Reduce applies a reduction operation on all tasks in the group and places the result in one task

Allreduce applies a reduction operation and places the result in all tasks in the group. This is equivalent to an MPI_Reduce followed by an MPI_Bcast

A

B

C

D

P0

P1

P2

P3

Data

Pro

cess

Reduce

int MPI_Reduce ( void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm )

int MPI_Allreduce ( void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm )

A*B*C*DP0

P1

P2

P3

Data

Pro

cess A

B

C

D

P0

P1

P2

P3

Data

Pro

cess

Allreduce A*B*C*D

A*B*C*D

A*B*C*D

A*B*C*D

P0

P1

P2

P3

Data

Pro

cess

8. Scan

Scan computes the scan (partial reductions) of data on a collection of processes

A

B

C

D

P0

P1

P2

P3

DataP

roce

ss

Scan

int MPI_Scan ( void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm )

P0

P1

P2

P3

Data

Pro

cess A

A*B

A*B*C

A*B*C*D

9. Reduce Scatter

Reduce Scatter combines values and scatters the results. It is equivalent to an MPI_Reduce followed by an MPI_Scatter operation.

int MPI_Reduce_scatter ( void *sendbuf, void *recvbuf, int *recvcnts, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm )

ReduceScatter

A0*B0*C0*D0

A1*B1*C1*D1

A2*B2*C2*D2

A3*B3*C3*D3

P0

P1

P2

P3

Data

Pro

cessA0 A1 A2 A3

B0 B1 B2 B3

C0 C1 C2 C3

D0 D1 D2 D3

P0

P1

P2

P3

Data

Pro

cess

Considerations and Restrictions

Collective operations are blocking

Collective communication routines do not take message tag arguments

Collective operations within subsets of processes are accomplished by first partitioning the subsets into new groups and then attaching the new groups to new communicators

70

Documents

Distributed Systems CS 15-440 Programming Models Gregory Kesden Borrowed and adapted from our good friends at CMU-Doha, Qatar Majd F. Sakr, Mohammad Hammoud