Chapter - 04 Basic Communication Operation

Embed Size (px)

Citation preview

Chapter 04

Basic Communication Operation

Processes need to exchange data with other processes

This exchange of data can significantly impact the efficiency of parallel programs by introducing interaction delays during their execution

ts + mtw

simple exchange of an m-word message between two processes running on different nodes of an interconnection network with cut-through routing.

ts: latency or the startup time for the data transfer tw: per word transfer time

ts & tw inversely proportional to the available bandwidth between the nodes

Many interactions in practical parallel programs occur in well-defined patterns involving more than two processes. Often either all processes participate together in a single global interaction operation, or subsets of processes participate in interactions local to each subset.

Let's Dig on Different kind of interations.

Chapter 4.1

One-to-All Broadcast & All-to-One Reduction

One-to-All Broadcast

Parallel algorithms often require a single process to send identical data to all other processes or to a subset of them

Initially, only the source process has the data of size m that needs to be broadcast. At the termination of the procedure, there are p copies of the initial data one belonging to each process.

All-to-One Reduction

Each of the p participating processes starts with a buffer M containing m words. The data from all processes are combined through an associative operator and accumulated at a single destination process into one buffer of size m

One-to-all broadcast and all-to-one reduction usage

matrix-vector multiplicationGaussian eliminationshortest pathsvector inner product

Chapter 4.1.1

Ring or Linear Array

Chapter 4.1.1

Ring or Linear Array

Chapter 4.1.1

A naive way to perform one-to-all broadcast is to sequentially send p - 1 messages from the source to the other p - 1 processes.

Chapter 4.1.1

A naive way to perform one-to-all broadcast is to sequentially send p - 1 messages from the source to the other p - 1 processes.

This is inefficient because the source process becomes a bottleneck only the connection between a single pair of nodes is used at a time

Chapter 4.1.1

How to solve this bottleneck issue !

Chapter 4.1.1

The Solution is recursive doubling

Chapter 4.1.1

The Solution is recursive doubling

The source process first sends the message to another process. Now both these processes can simultaneously send the message to two other processes that are still waiting for the message. By continuing this procedure until all the processes have received the data, the message can be broadcast in log p steps.

Chapter 4.1.1

Chapter 4.1.1

Reduction on a linear array can be performed by simply reversing the direction and the sequence of communication

Chapter 4.1.1

Reduction on a linear array can be performed by simply reversing the direction and the sequence of communication

Chapter 4.1.1

Consider the problem of multiplying a matrix with a vector.

The n x n matrix is assigned to an n x n (virtual) processor grid. The vector is assumed to be on the first row of processors.

The first step of the product requires a one-to-all broadcast of the vector element along the corresponding column of processors. This can be done concurrently for all n columns.

The processors compute local product of the vector element and the local matrix entry.

In the final step, the results of these products are accumulated to the first row using n concurrent all-to-one reduction operations along the columns (using the sum operation).

Chapter 4.1.2

Mesh

Chapter 4.1.2

Each row and column of a square mesh of p nodes as a linear array of p nodes

Chapter 4.1.2

The linear array communication operation can be made in 2 phases

In the first phase, the operation is performed along one or all rows by treating the rows as linear arrays.

1

Chapter 4.1.2

The linear array communication operation can be made in 2 phases

In the second phase, the columns are treated similarly.

2

Chapter 4.1.2

Consider a one-to-all broadcast in 2D mesh

Chapter 4.1.2

All the data first going to (p-1) nodes in the same raw

Chapter 4.1.2

Once all the nodes in a row of the mesh have acquired the data, they initiate a one-to-all broadcast in their respective columns.

Chapter 4.1.2

At the end of the second phase, every node in the mesh has a copy of the initial message.

Chapter 4.1.3

Hypercube

Chapter 4.1.3

Rows of p1/3 nodes in each of the 3D mesh would be treated as linear arrays

Chapter 4.1.3

Like the meshThe process is carried out by three phases

If D dimensional mesh we need d steps

Chapter 4.1.4

source processor is the root of this tree

In the first step, the source sends the data to the right child (assuming the source is also the left child). The problem has now been decomposed into two problems with half the number of processors.

Chapter 4.1.5

Algorithm

Chapter 4.1.5

One-to-all Broadcast

All of the algorithms described above are adaptations of the same algorithmic template.

We illustrate the algorithm for a hypercube, but the algorithm, as has been seen, can be adapted to other architectures.

The hypercube has 2d nodes and my_id is the label for a node.

X is the message to be broadcast, which initially resides at the source node0 means 000.

initially resides at the source node 0

The variable mask helps determine which nodes communicate in a particular iteration of the loop

sAlgorithm 4.1

works only if node 0 is the source of the broadcast. For an arbitrary source, we must relabel the nodes of the hypothetical hypercube by XORing the label of each node with the label of the source node before we apply this procedure.

General Algorithm whatever source can be...

Cost Analysis

T = (ts + mtw)logp

4.2 All-to-All Broadcast and Reduction

Generalization of One-to-All-Broadcast

A process sends the same m-word message to every other process, but different processes may broadcast different messages.

Usage of All-to-All-Broadcast

It's used on matrix operations like matrix multiplications and matrix-vector multiplication

Dual is All-to-All-reduction

Figure illustration

All-to-All-Broadcast in Linear Array and Ring

While performing all-to-all broadcast on a linear array or a ring, all communication links can be kept busy simultaneously until the operation is complete because each node always has some information that it can pass along to its neighbor.

Simplest approach: perform p one-to-all broadcasts. This is not the most efficient way, though.

Each node first sends to one of its neighbors the data it needs to broadcast.

In subsequent steps, it forwards the data received from one of its neighbors to its other neighbor.

The algorithm terminates in p-1 steps.

All-to-All Broadcast and Reduction

Algorithm for All-to-All-Broadcast in Ring

In all-to-all reduction, the dual of all-to-all broadcast

All-to-all reduction can be performed by reversing the direction and sequence of the messages

< Example /> First communication step in all-to-all broadcast would be the last step of all-to-all reduction.

0 sending msg[1] to 7 instead of receiving it. The only additional step required is that upon receiving a message, a node must combine it with the local copy of the message that has the same destination as the received message before forwarding the combined message to the next neighbor.

All-to-All-Broadcast in Mesh

Performed in two phases -
In the first phase :- each row of the mesh performs an all-to-all broadcast using the procedure for the linear array. In this phase, all nodes collect p messages corresponding to the p nodes of their respective rows. Each node consolidates this information into a single message of size mp. In second communication phase :- a columnwise all-to-all broadcast of the consolidated messages.

Algorithm for All-to-All-Broadcast in Mesh

Generalization of the mesh algorithm to log p dimensions.

Message size doubles at each of the log p steps.

All-to-All-Broadcast in Hypercube

Cost Analysis

For linear arrayT = (ts + mtw)(p-1)

For MeshTfirst = (ts + mtw)(p-1)Tsecond = (ts + mptw)(p-1)

Ttotal = (2ts)(p-1) + twm(p-1)

Cost Analysis

Chapter 4.3

All-Reduce and Prefix Sum Operation

All-Reduce

Each node starts with a buffer of size m andThe final results of the operation are identical buffers of size m on each node that are formed by combining the original p buffers using an associative operator

There is 2 ways to perform all-reduce

A simple method to perform all-reduce is to perform an all-to-one reduction followed by a one-to-all broadcast

There is 2 ways to perform all-reduce

There is a faster way to perform all-reduce by using the communication pattern of all-to-all broadcast

The Specialty is the message size is not increased here.

That is message size is m (mp)

...

Chapter 4.4

Scatter and Gather

Scatter and Gather

In the scatter operation, a single node sends a unique message of size m to every other node (also called a one-to-all personalized communication).

In the gather operation, a single node collects a unique message from each node.

While the scatter operation is fundamentally different from broadcast, the algorithmic structure is similar, except for differences in message sizes (messages get smaller in scatter and stay constant in broadcast).

The gather operation is exactly the inverse of the scatter operation.

Example of Scatter on Hypercube

In the first communication step, the source transfers half of the messages to one of its neighbors. In subsequent steps, each node that has some data transfers half of it to a neighbor that has yet to receive any data. There is a total of log p communication steps corresponding to the log p dimensions of the hypercube.

The gather operation is simply the reverse of scatter.