Parallel Programming - Slides

1Parallel Programming: Techniques and Applications using Networked Workstations and Parallel Computers

Barry Wilkinson and Michael Allen Prentice Hall, 1998

Figure 1.1 Astrophysical N-bodysimulation by Scott Linssen (undergraduateUniversity of North Carolina at Charlotte[UNCC] student).



Figure 1.2 Conventional computer havinga single processor and memory.

Main memory

Processor

Instructions (to processor)Data (to or from processor)



Figure 1.3 Traditional shared memorymultiprocessor model.Processors

Interconnectionnetwork

Memory modulesOneaddressspace



Processor


Local

Computers

Messages

Figure 1.4 Message-passingmultiprocessor model (multicomputer).

memory



Processor


Shared

Computers

Messages

Figure 1.5 Shared memory multiprocessorimplementation.

memory



Figure 1.6 MPMD structure.

Program

Processor

Data

Program

Processor

Data

InstructionsInstructions



P M

C

P M

C

P M

C

Figure 1.7 Static link multicomputer.

Computers

Network with direct linksbetween computers



Linksto other

nodes

Switch

Processor Memory

Computer (node)

Linksto othernodes

Figure 1.8 Node with a switch for internode message transfers.



Link

Figure 1.9 A link between two nodes withseparate wires in each direction.

NodeNode



Figure 1.10 Ring.



Figure 1.11 Two-dimensional array(mesh).

LinksComputer/processor



Figure 1.12 Tree structure.

Processingelement

Root

Links



Figure 1.13 Three-dimensional hypercube.000 001

010 011

100

110

101

111



0000 0001

0010 0011

0100

0110

0101

0111

1000 1001

1010 1011

1100

1110

1101

1111

Figure 1.14 Four-dimensional hypercube.



Figure 1.15 Embedding a ring onto a torus.

Ring



Figure 1.16 Embedding a mesh into ahypercube.

00

01

11

10

00 01 11 10yx

Nodal address1011



Figure 1.17 Embedding a tree into a mesh.

Root

A

A

A

A

A

A



HeadPacket

Request/Acknowledge

signal(s)

Figure 1.18 Distribution of flits.

Flit buffer

Movement



Data

R/A

Source Destinationprocessor processor

Figure 1.19 A signaling method betweenprocessors for wormhole routing (Ni andMcKinley, 1993).



Packet switching

Circuit switchingWormhole routing

Distance

Network

(number of nodes between source and destination)

latency

Figure 1.20 Network delay characteristics.



Messages

Node 1 Node 2

Node 3Node 4

Figure 1.21 Deadlock in store-and-forwardnetworks.



Physical link

Virtual channel

Route

buffer Node Node

Figure 1.22 Multiple virtual channels mapped onto a single physical channel.



Workstations Figure 1.23 Ethernet-type single wirenetwork.

Workstation/

Ethernet

file server



Figure 1.24 Ethernet frame format.

Preamble

(64 bits)

Destinationaddress(48 bits)

Sourceaddress(48 bits)

Type

(16 bits)

Data

(variable)

Frame checksequence(32 bits)

Direction



Network

Workstation/

Workstations

Figure 1.25 Network of workstations connected via a ring.

file server



Workstation/file server

Workstations

Figure 1.26 Star connected network.



Figure 1.27 Overlapping connectivity Ethernets.

(a) Using specially designed adaptors

(b) Using separate Ethernet interfaces

Parallel programming cluster



Time

Process 1

Process 2

Process 3

Process 4

Waiting to send a message

Figure 1.28 Space-time diagram of a message-passing program.

Message

Computing

Slope indicating timeto send message



Serial section Parallelizable sections

(a) One processor

(b) Multipleprocessors

fts (1 − f)ts

ts

(1 − f)ts/n

Figure 1.29 Parallelizing sequential problem — Amdahl’s law.

tp

n processors



Figure 1.30 (a) Speedup against number of processors. (b) Speedup against serial fraction, f.

4

8

12

16

20

0.2 0.4 0.6 0.8 1.0

Spee

dup

fact

or,S

(n)

Serial fraction, f

(b)

n = 256

n = 164

8

12

16

20

4 8 12 16 20

f = 20%

f = 10%

f = 5%

f = 0%

Spee

dup

fact

or,S

(n)

Number of processors, n

(a)



Sourcefile

Executables

Processor 0 Processor n − 1Figure 2.1 Single program, multiple dataoperation.

Compile to suitprocessor



Process 1

Process 2spawn();

Figure 2.2 Spawning a process.

Time

Start executionof process 2



Figure 2.3 Passing a message betweenprocesses using send() and recv()library calls.

Process 1 Process 2

send(&x, 2);

recv(&y, 1);

x y

Movementof data



Figure 2.4 Synchronous send() and recv() library calls using a three-way protocol.

Process 1 Process 2

send();

recv();Suspend

Time

processAcknowledgment

MessageBoth processescontinue

(a) When send() occurs before recv()

Process 1 Process 2

recv();

send();Suspend

Time

process

Acknowledgment

MessageBoth processescontinue

(b) When recv() occurs before send()

Request to send

Request to send



Figure 2.5 Using a message buffer.

Process 1 Process 2

send();

recv();

Message buffer

Readmessage buffer

Continueprocess

Time



bcast();

buf

bcast();

data

bcast();

datadata

Figure 2.6 Broadcast operation.

Process 0 Process n − 1Process 1

Action

Code



scatter();

buf

scatter();

data

scatter();

datadata

Figure 2.7 Scatter operation.


Action

Code



Figure 2.8 Gather operation.

gather();

buf

gather();

data

gather();

datadata


Action

Code



Figure 2.9 Reduce operation (addition).

reduce();

buf

reduce();

data

reduce();

datadata


+

Action

Code



Figure 2.10 Message passing between workstations using PVM.

PVM

Application

daemon

program

Workstation

PVMdaemon

Applicationprogram

Applicationprogram

PVMdaemon

Workstation

Workstation

Messagessent throughnetwork

(executable)

(executable)

(executable)



Figure 2.11 Multiple processes allocated to each processor (workstation).

Workstation

Applicationprogram

PVMdaemon Workstation

Workstation

Messagessent throughnetwork

(executable)

PVMdaemon

PVMdaemon



Figure 2.12 pvm_psend() and pvm_precv() system calls.

Process 1 Process 2

pvm_psend();

pvm_precv();Continueprocess

Wait for message

Pack

Send bufferArray Array toholdingdata

receivedata



pvm_pkint( … &x …);pvm_pkstr( … &s …);pvm_pkfloat( … &y …);pvm_send(process_2 … ); pvm_recv(process_1 …);

pvm_upkint( … &x …);pvm_upkstr( … &s …);pvm_upkfloat(… &y … );

Send

Receivebuffer

buffer

xsy

Process_1 Process_2

Figure 2.13 PVM packing messages, sending, and unpacking.

Message

pvm_initsend();



Figure 2.14 Sample PVM program.

#include <stdio.h>#include <stdlib.h>#include <pvm3.h>#define SLAVE “spsum”#define PROC 10#define NELEM 1000main() {

int mytid,tids[PROC];int n = NELEM, nproc = PROC;int no, i, who, msgtype;int data[NELEM],result[PROC],tot=0;char fn[255];FILE *fp;mytid=pvm_mytid();/*Enroll in PVM */

/* Start Slave Tasks */no= pvm_spawn(SLAVE,(char**)0,0,““,nproc,tids);if (no < nproc) {

printf(“Trouble spawning slaves \n”);for (i=0; i<no; i++) pvm_kill(tids[i]);pvm_exit(); exit(1);

}

/* Open Input File and Initialize Data */strcpy(fn,getenv(“HOME”));strcat(fn,”/pvm3/src/rand_data.txt”);if ((fp = fopen(fn,”r”)) == NULL) {

printf(“Can’t open input file %s\n”,fn);exit(1);

}for(i=0;i<n;i++)fscanf(fp,”%d”,&data[i]);

/* Broadcast data To slaves*/pvm_initsend(PvmDataDefault);msgtype = 0;pvm_pkint(&nproc, 1, 1);pvm_pkint(tids, nproc, 1);pvm_pkint(&n, 1, 1);pvm_pkint(data, n, 1);pvm_mcast(tids, nproc, msgtag);

/* Get results from Slaves*/msgtype = 5;for (i=0; i<nproc; i++){

pvm_recv(-1, msgtype);pvm_upkint(&who, 1, 1);pvm_upkint(&result[who], 1, 1);printf(“%d from %d\n”,result[who],who);

}

/* Compute global sum */for (i=0; i<nproc; i++) tot += result[i];printf (“The total is %d.\n\n”, tot);

pvm_exit(); /* Program finished. Exit PVM */ return(0);

#include <stdio.h>#include “pvm3.h”#define PROC 10#define NELEM 1000

main() {int mytid;int tids[PROC];int n, me, i, msgtype;int x, nproc, master;int data[NELEM], sum;

mytid = pvm_mytid();

/* Receive data from master */msgtype = 0;pvm_recv(-1, msgtype);pvm_upkint(&nproc, 1, 1);pvm_upkint(tids, nproc, 1);pvm_upkint(&n, 1, 1);pvm_upkint(data, n, 1);

/* Determine my tid */for (i=0; i<nproc; i++)

if(mytid==tids[i]){me = i;break;}

/* Add my portion Of data */x = n/nproc;low = me * x;high = low + x;for(i = low; i < high; i++)

sum += data[i];

/* Send result to master */pvm_initsend(PvmDataDefault);pvm_pkint(&me, 1, 1);pvm_pkint(&sum, 1, 1);msgtype = 5;master = pvm_parent();pvm_send(master, msgtype);

/* Exit PVM */pvm_exit();return(0);

}

Master

Slave

Broadcast data

Receive results



Figure 2.15 Unsafe message passing with libraries.

lib()

lib()

send(…,1,…);

recv(…,0,…);

Process 0 Process 1

send(…,1,…);

recv(…,0,…);

(a) Intended behavior

(b) Possible behavior

lib()

lib()

send(…,1,…);

recv(…,0,…);

Process 0 Process 1

send(…,1,…);

recv(…,0,…);

Destination

Source



Figure 2.16 Sample MPI program.

#include “mpi.h”#include <stdio.h>#include <math.h>#define MAXSIZE 1000

void main(int argc, char *argv){

int myid, numprocs;int data[MAXSIZE], i, x, low, high, myresult, result;char fn[255];char *fp;

MPI_Init(&argc,&argv);MPI_Comm_size(MPI_COMM_WORLD,&numprocs);MPI_Comm_rank(MPI_COMM_WORLD,&myid);

if (myid == 0) { /* Open input file and initialize data */strcpy(fn,getenv(“HOME”));strcat(fn,”/MPI/rand_data.txt”);if ((fp = fopen(fn,”r”)) == NULL) {

printf(“Can’t open the input file: %s\n\n”, fn);exit(1);

}for(i = 0; i < MAXSIZE; i++) fscanf(fp,”%d”, &data[i]);

}

/* broadcast data */MPI_Bcast(data, MAXSIZE, MPI_INT, 0, MPI_COMM_WORLD);

/* Add my portion Of data */x = n/nproc;low = myid * x;high = low + x;for(i = low; i < high; i++)

myresult += data[i];printf(“I got %d from %d\n”, myresult, myid);

/* Compute global sum */MPI_Reduce(&myresult, &result, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);if (myid == 0) printf(“The sum is %d.\n”, result);

MPI_Finalize();}



Tim

e

Number of data items (n)

Startup time

Figure 2.17 Theoretical communicationtime.



160

140

120

100

80

60

40

20

01 2 3 4 50

x0

c2g(x) = 6x2

c1g(x) = 2x2

f(x) = 4x2 + 2x + 12

Figure 2.18 Growth of function f(x) = 4x2 + 2x + 12.



Figure 2.19 Broadcast in a three-dimensional hypercube.

000 001

010 011

100

110

101

111

1st step

2nd step

3rd step



Figure 2.20 Broadcast as a tree construction.

P000

P000

P010P000

P000

P010P100 P110 P001 P101 P011

P001

P111

P001 P011

Step 1

Step 2

Step 3

Message



Figure 2.21 Broadcast in a mesh.

1 2 3

4

5 6

2

3

4

5

6

3 4

5

4

Steps



Source Destinations

Message

Figure 2.22 Broadcast on an Ethernetnetwork.



Figure 2.23 1-to-N fan-out broadcast.

Source

N destinations

Sequential



Source

Sequential message issue

DestinationsFigure 2.24 1-to-N fan-out broadcast on atree structure.



Process 1

Process 2

Process 3

Time

Computing

Waiting

Message-passing system routine

Message

Figure 2.25 Space-time diagram of a parallel program.



Statement number or regions of program1 2 3 4 5 6 7 8 9 10

Num

ber

of r

epet

ition

s or

tim

e

Figure 2.26 Program profile.



Processes

Results

Input data

Figure 3.1 Disconnected computationalgraph (embarrassingly parallel problem).



Figure 3.2 Practical embarrassingly parallel computational graph with dynamic processcreation and the master-slave approach.

Send initial data

Collect results

MasterSlaves

spawn()

recv()

send()

recv()send()



640

480

80

80

640

480

10

(a) Square region for each process

(b) Row region for each process

Figure 3.3 Partitioning into regions for individual processes.

Process

Map

Process

Map

x

y



Real

Figure 3.4 Mandelbrot set.

+2−2 0

+2

−2

0

Imaginary



Work pool

(xc, yc)(xa, ya)

(xd, yd)(xb, yb)

(xe, ye)

Figure 3.5 Work pool approach.

Task

Return results/request new task



0 disp_height

Row returned

Row sent

Increment

Decrement

Rows outstanding in slaves (count)

Figure 3.6 Counter termination.Terminate



Figure 3.7 Computing π by a Monte Carlomethod.

Area = π

Total area = 4

2

2



x

y 1 x2–=1

f(x)

Figure 3.8 Function being integrated incomputing π by a Monte Carlo method.1

1



Master

Slaves

Random numberprocess

Randomnumber

Partial sum

Request

Figure 3.9 Parallel Monte Carlointegration.



x1 x2 xk-1 xk xk+1 xk+2 x2k-1 x2k

Figure 3.10 Parallel computation of a sequence.



Figure 4.1 Partitioning a sequence of numbers into parts and adding the parts.

Sum

x0 … x(n/m)−1 xn/m … x(2n/m)−1 x(m−1)n/m … xn−1…

Partial sums

+ +

+

+



Figure 4.2 Tree construction.

Initial problem

Divide

Final tasks

problem



Figure 4.3 Dividing a list into parts.

P0 P1 P2 P3 P4 P5 P6 P7

P0

P0

P0 P2 P4 P6

P4

Original list

x0 xn−1



Figure 4.4 Partial summation.

P0 P1 P2 P3 P4 P5 P6 P7

P0

P0

P0 P2 P4 P6

P4

Final sum

x0 xn−1



OR

OROR

Found/Not found

Figure 4.5 Part of a search tree.



Figure 4.6 Quadtree.



Image area

First division

Second division

into four parts

Figure 4.7 Dividing an image.



Unsorted numbers

Sorted numbers

Buckets

Figure 4.8 Bucket sort.

Sortcontentsof buckets

Merge lists



Unsorted numbers

Sort

Figure 4.9 One parallel version of bucket sort.

Buckets

contentsof buckets

Merge lists

p processors

Sorted numbers



Unsorted numbers

Sort

Large

Figure 4.10 Parallel version of bucket sort.

Smallbuckets

Emptysmallbuckets

buckets

contentsof buckets

Merge lists

p processors

n/m numbers

Sorted numbers



Send Receive

Send

Process 1 Process n − 1



0 n − 1 0 n − 1 0 n − 1 0 n − 1

Figure 4.11 “All-to-all” broadcast.

buffer buffer

buffer



A0,0 A0,1 A0,2 A0,3

A1,0 A1,1 A1,2 A1,3

A3,0 A3,1 A3,2 A3,3

A2,0 A2,1 A2,2 A2,3

A0,0 A1,0 A2,0 A3,0

A0,1 A1,1 A2,1 A3,1

A0,3 A1,3 A2,3 A3,3

A0,2 A1,2 A2,2 A3,2

P0

P1

P2

P3

“All-to-all”

Figure 4.12 Effect of “all-to-all” on anarray.



Figure 4.13 Numerical integration usingrectangles.

f(q)f(p)

δ

f(x)

xp qa b



f(q)f(p)

δFigure 4.14 More accurate numericalintegration using rectangles.

f(x)

xp qa b



Figure 4.15 Numerical integration usingthe trapezoidal method.

f(q)f(p)

δ

f(x)

xp qa b



Figure 4.16 Adaptive quadratureconstruction.

A B

Cf(x)

x



Figure 4.17 Adaptive quadrature with falsetermination.

f(x)

x

A B

C = 0



Distant cluster of bodiesr

Center of mass

Figure 4.18 Clustering distant bodies.



Subdivisiondirection

Figure 4.19 Recursive division of two-dimensional space.

Partial quadtreeParticles



Figure 4.20 Orthogonal recursive bisectionmethod.



+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

Binary Tree

Result

Figure 4.21 Process diagram for Problem 4-12(b).

log n numbers



f(a)

f(b)

ab

y

x

f(x)

Figure 4.22 Bisection method for findingthe zero crossing location of a function.



Figure 4.23 Convex hull (Problem 4-22).



P0 P1 P2 P3 P4 P5

Figure 5.1 Pipelined processes.



sum

a[0] a[1] a[2] a[3] a[4]

soutsin

Figure 5.2 Pipeline for an unfolded loop.

soutsin soutsin soutsin soutsin

a aaaa



f(t) foutfin

Figure 5.3 Pipeline for a frequency filter.

foutfin foutfin foutfin foutfin

f0 f4f3f2f1 Filtered signal

Signal withoutfrequency f0






P0

P4

P3

P5

P2

P1Instance

1

Instance1

Instance1

Instance1

Instance1

Instance1

Instance2

Instance2

Instance2

Instance2

Instance2

Instance2

Instance4

Instance3

Instance3

Instance3

Instance3

Instance3

Instance3

Instance4

Instance4

Instance4

Instance4

Instance4

Instance5

Instance5

Instance5

Instance5

Instance5

Instance6

Instance5

Instance6

Instance6

Instance6

Instance6

Instance7

Instance7

Instance7

Instance7

Time

Figure 5.4 Space-time diagram of a pipeline.

p − 1 m



P0 P1 P2 P3 P4 P5

P0 P1 P2 P3 P4 P5

P0 P1 P2 P3 P4 P5

P0 P1 P2 P3 P4 P5

P0 P1 P2 P3 P4 P5

Time

Instance 1

Instance 2

Instance 3

Instance 0

Instance 4

Figure 5.5 Alternative space-time diagram.



P0

P4

P3

P5

P2

P1

Time

Figure 5.6 Pipeline processing 10 data elements.

d9d8d7d6d5d4d3d2d1d0 P0 P1 P2 P3 P4 P5

(a) Pipeline structure

(b) Timing diagram

P8

P7

P9

P6

d0 d1 d2 d3 d4 d5 d6 d7 d8 d9

P7P6 P8 P9

Input sequence

p − 1 n

d0 d1 d2 d3 d4 d5 d6 d7 d8 d9

d0 d1 d2 d3 d4 d5 d6 d7 d8 d9

d0 d1 d2 d3 d4 d5 d6 d7 d8 d9

d0 d1 d2 d3 d4 d5 d6 d7 d8 d9

d0 d1 d2 d3 d4 d5 d6 d7 d8 d9

d0 d1 d2 d3 d4 d5 d6 d7 d8 d9

d0 d1 d2 d3 d4 d5 d6 d7 d8

d0 d1 d2 d3 d4 d5 d6 d7

d0 d1 d2 d3 d4 d5 d6



Time

P0

P1

P2

P3

P4

P5

(a) Processes with the same (b) Processes not with the

P0

P1

P2

P3

P4

P5

Time

Figure 5.7 Pipeline processing where information passes to next stage before end of process.

Informationtransfersufficient tostart nextprocess

same execution timeexecution time

Information passedto next stage



P0 P1 P2 P3 P4 P5 P7P6 P8 P9 P11P10

Processor 1Processor 0 Processor 2

Figure 5.8 Partitioning processes onto processors.



Host

Multiprocessor

computer

Figure 5.9 Multiprocessor system with a line configuration.



P0 P3P2P1 P4

Figure 5.10 Pipelined addition.

Σ1

5iΣ

1

i Σ1

2i Σ

1

3i Σ

1

4i



P0 P2P1 Pn−1

Figure 5.11 Pipelined addition numbers with a master process and ring configuration.

dn−1… d2d1d0

Master process

Sum

Slaves



P0 P2P1 Pn−1

Figure 5.12 Pipelined addition of numbers with direct access to slave processes.

Master process

Sum

Slaves dn−1d0 d1

Numbers



5

4, 3, 1, 2, 5

4, 3, 1, 2

4, 3, 12

5

4, 3 5 21

4 53

21

54

32

5 43

5 4

1

21

32

1

5 4 3 21

5 4 3 2 1

Figure 5.13 Steps in insertion sort with five numbers.

P0 P2 P3 P4P1

Time

1

2

3

4

5

6

8

7

(cycles)

9

10



P0 P1 P2

Largest number Next largestnumber

Series of numbersxn−1 … x1x0

Figure 5.14 Pipeline for sorting using insertion sort.

xmax

Compare

Smallernumbers



P0 P2P1 Pn−1

Figure 5.15 Insertion sort with results returned to the master process using a bidirectional line configuration.

dn−1… d2d1d0Sorted sequence

Master process



P0

P4

P3

P2

P1

Time

Figure 5.16 Insertion sort with results returned.

Sorting phase Returning sorted numbers

2n − 1 n

Shown for n = 5



P0 P1 P2

1st prime 2nd prime

Series of numbersxn−1 … x1x0

Figure 5.17 Pipeline for sieve of Eratosthenes.

Compare

Not multiples of

3rd primemultiples number number number

1st prime number



x0x0

x0 x0

x1x1 x1x2 x2

x3

Figure 5.18 Solving an upper triangular set of linear equation using a pipeline.

Compute x0 Compute x1 Compute x2 Compute x3

P0 P1 P2 P3



Time

P0

P1

P2

P3

P4

P5

Figure 5.19 Pipeline processing using backsubstitution.

ProcessesFinal computed value

First value passed onward



P0 P1 P2 P3 P4

dividesend(x0) ⇒ recv(x0)end send(x0) ⇒ recv(x0)

multiply/add send(x0) ⇒ recv(x0)divide/subtract multiply/add send(x0) ⇒ recv(x0)send(x1) ⇒ recv(x1) multiply/add send(x1) ⇒end send(x1) ⇒ recv(x1) multiply/add

multiply/add send(x1) ⇒ recv(x1)divide/subtract multiply/add send(x1) ⇒send(x2) ⇒ recv(x2) multiply/addend send(x2) ⇒ recv(x2)

multiply/add send(x2) ⇒divide/subtract multiply/addsend(x3) ⇒ recv(x3)end send(x3) ⇒

multiply/adddivide/subtractsend(x4) ⇒end

Figure 5.20 Operations in back substitution pipeline.

Time



x1 x2

x

x3 x4

yin yout

a

x

yin yout

a

x

yin yout

a

x

yin yout

a

a1 a2 a3 a4

y4y3y2y1 Output

Figure 5.21 Pipeline for Problem 5-9.



Display

Pipeline

Audio input(digitized)

Figure 5.22 Audio histogram display.

Display

Audio input(digitized)

(a) Pipeline solution (b) Direct decomposition



P0 P1 P2 Pn−1

Processes

Barrier

Figure 6.1 Processes reaching the barrier atdifferent times.

Time

Active

Waiting



P0

Processes

Figure 6.2 Library call barriers.

Barrier();

P1

Barrier();

Pn−1

Barrier();

Processes wait untilall reach theirbarrier call



P0

Processes

Figure 6.3 Barrier using a centralized counter.

Barrier();

P1

Barrier();

Pn−1

Barrier();

Counter, C

Incrementand check for n



for(i=0;i<n;i++)recv(Pany);

for(i=0;i<n;i++)send(Pi);

Master

Figure 6.4 Barrier implementation in a message-passing system.

ArrivalphaseDeparturephase

send(Pmaster);recv(Pmaster);

Barrier:

send(Pmaster);recv(Pmaster);

Barrier:

Slave processes



P0 P1 P2 P3 P4 P5 P6 P7

Arrivalat barrier

Departurefrom barrier

Figure 6.5 Tree barrier.

Sychronizingmessage



1st stage

2nd stage

3rd stage

P0 P1 P2 P3 P4 P5 P6 P7

Time

Figure 6.6 Butterfly construction.



a[0]=a[0]+k; a[n-1]=a[n-1]+k;a[1]=a[1]+k;

Instructiona[] = a[] + k;

a[0] a[n-1]a[1]

Figure 6.7 Data parallel computation.

Processors



Σi=0

0

Σi=0

1

Σi=0

2

Σi=0

3

Σi=0

4

Σi=0

5

Σi=0

6

Σi=0

7

Σi=0

8

Σi=0

9

Σi=0

10

Σi=0

11

Σi=0

12

Σi=0

15

Σi=0

14

Σi=0

13

Σi=0

0

Σi=0

1

Σi=1

2

Σi=2

3

Σi=3

4

Σi=4

5

Σi=5

6

Σi=6

7

Σi=7

8

Σi=8

9

Σi=9

10

Σi=10

11

Σi=11

12

Σi=14

15

Σi=13

14

Σi=12

13

Σi=0

0

Σi=0

1

Σi=0

2

Σi=0

3

Σi=1

4

Σi=2

5

Σi=3

6

Σi=4

7

Σi=5

8

Σi=6

9

Σi=7

10

Σi=8

11

Σi=9

12

Σi=12

15

Σi=11

14

Σi=10

13

Σi=0

0

Σi=0

1

Σi=0

2

Σi=0

3

Σi=0

4

Σi=0

5

Σi=0

6

Σi=0

7

Σi=1

8

Σi=2

9

Σi=3

10

Σi=4

11

Σi=5

12

Σi=8

15

Σi=7

14

Σi=6

13

x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15

Figure 6.8 Data parallel prefix sum operation.

Numbers

Step 1

Step 2

Step 3

Final step

Add

Add

Add

Add

(j = 0)

(j = 1)

(j = 2)

(j = 3)



Computedvalue

Error

Iteration

Exact value

Figure 6.9 Convergence rate.t+1t



Figure 6.10 Allgather operation.

Allgather(); Allgather();

data

Allgather();

datadata


Send

Receive

buffer

buffer

xn−1x0 x1



3228242016128400

1 × 106

2 × 106

Figure 6.11 Effects of computation and communication in Jacobi iteration.

OverallCommunication

Computation

Execution

Number of processors, p

time(τ = 1)



hi,j

hi−1,j

hi,j−1 hi,j+1

hi+1,j

j

iMetal plate

Figure 6.12 Heat distribution problem.

Enlarged



xi−1

xi

xi+1

xi+k

xi−k

x1 x2 xk−1

xk+1 xk+2

xk

x2k−1

xk2

x2k

Figure 6.13 Natural ordering of heatdistribution problem.



send(g, Pi-1,j);send(g, Pi+1,j);send(g, Pi,j-1);send(g, Pi,j+1);recv(w, Pi-1,j)recv(x, Pi+1,j);recv(y, Pi,j-1);recv(z, Pi,j+1);





Figure 6.14 Message passing for heat distribution problem.

i

j

column

row



P0

P1

P1

P0

Pp−1

Pp−1

Figure 6.15 Partitioning heat distribution problem.

Blocks Strips (columns)



Square blocks

Strips

n

np---

Figure 6.16 Communication consequences of partitioning.



10001001010

1000

2000

Strip partition best

Block partition best

tstartup

Processors, pFigure 6.17 Startup times for block andstrip partitions.



Ghost points

Process i

Process i+1

One rowof points

Array heldby process i

Array heldby process i+1

Figure 6.18 Configurating array into contiguous rows for each process, with ghost points.

Copy



20°C100°C

10ft

10ft

4ft

Figure 6.19 Room for Problem 6-14.



vehicle

Figure 6.20 Road junction forProblem 6-16.



Airflow

Figure 6.21 Figure for Problem 6-23.

Actual dimensionsselected at will



P4

P5

P0

P1

P2

P3

P4

P5

P2P1P0

P3

Time

(b) Perfect load balancing

(a) Imperfect load balancing leading

t

Figure 7.1 Load balancing.

to increased execution time

Processors

Processors



QueueWork pool

Slave “worker” processes

Masterprocess

Figure 7.2 Centralized work pool.

Tasks

Request task

Send task

(and possiblysubmit new tasks)



Process M0 Process Mn−1

Master, Pmaster

Slaves

Initial tasks

Figure 7.3 A distributed work pool.



Process

Requests/tasks

ProcessProcess

Process

Figure 7.4 Decentralized work pool.



Figure 7.5 Decentralized selection algorithm requesting tasks between slaves.

RequestsSlave Pi

Localselectionalgorithm

RequestsSlave Pj

Localselectionalgorithm



Masterprocess

P1 P2 P3 Pn−1

P0

Figure 7.6 Load balancing using a pipeline structure.



If buffer empty,make request

Receive taskfrom request

If free,requesttask

Receivetask fromrequest

If buffer full,send task

Request for task

Figure 7.7 Using a communication process in line load balancing.

Ptask

Pcomm



P0

P1

P3

P2

P6P4P5

Figure 7.8 Load balancing using a tree.

Taskwhenrequested



Inactive

Active

Parent

First task

Other processes

Finalacknowledgment

Process

TaskAcknowledgment

Figure 7.9 Termination using messageacknowledgments.



P0 P2P1 Pn−1

Token passed to next processor

Figure 7.10 Ring termination detection algorithm.

when reached local termination condition



Terminated

Token

AND

Figure 7.11 Process algorithm for localtermination.



P0 PiPj Pn−1

Figure 7.12 Passing task to previous processes.

Task



Terminated

AND

Terminated

AND Terminated

AND

Figure 7.13 Tree termination.



Base camp

Summit

Possible intermediate camps

B

C

A

Figure 7.14 Climbing a mountain.

F

E

D



Figure 7.15 Graph of mountain climb.

A B C

D

E

F

10

13

17

51

8

24

9

14



A

B

C

D

E

F

A B C D E F

∞

∞

∞

∞

∞

∞

10

13

17

518 24

9

∞

∞

∞ ∞ ∞ ∞ ∞

∞

∞ ∞

∞∞

∞

∞

∞

∞

∞ ∞ ∞ ∞

∞

∞14Source

Destination

A

B

C

D

E

F

Source

Weight NULL

10

8 13 24 51C D E F

14D

9E

17F

(a) Adjacency matrix

(b) Adjacency list

Figure 7.16 Representing a graph.

B



Vertex i

Vertex j

wi , j

dj

di

Figure 7.17 Moore’s shortest-path algo-rithm.



Start at

w[]

dist Process C

Process A

Master process

Figure 7.18 Distributed graph search.

Vertex

sourcevertex

w[]

dist

Vertex

dist

Process B

Newdistance

Newdistance

w[]Vertex

Other processes



Entrance

Exit

Search path

Figure 7.19 Sample maze for Problem 7-9.



Gold

Entrance

Figure 7.20 Plan of rooms for Problem 7-10.



Door

Room A

Room B

Figure 7.21 Graph representation forProblem 7-10.



Processors Memory modulesFigure 8.1 Shared memory multiprocessorusing a single bus.

Bus

Cache



a. Brinch Hansen, P. (1975), “The Programming Language Concurrent Pascal,” IEEE Trans. Software Eng.,Vol. 1, No. 2 (June), pp. 199–207.

b. U.S. Department of Defense (1981), “The Programming Language Ada Reference Manual,” LectureNotes in Computer Science, No. 106, Springer-Verlag, Berlin.

c. Bräunl, T., R. Norz (1992), Modula-P User Manual, Computer Science Report, No. 5/92 (August), Univ.Stuttgart, Germany.

d. Thinking Machines Corp. (1990), C* Programming Guide, Version 6, Thinking Machines System Docu-mentation.

e. Gehani, N., and W. D. Roome (1989), The Concurrent C Programming Language, Silicon Press, NewJersey.

f. Fox, G., S. Hiranandani, K. Kennedy, C. Koelbel, U. Kremer, C. Tseng, and M. Wu (1990), Fortran DLanguage Specification, Technical Report TR90-141, Dept. of Computer Science, Rice University.

TABLE 8.1 SOME EARLY PARALLEL PROGRAMMING LANGUAGES

Language Originator/date Comments

Concurrent Pascal Brinch Hansen, 1975a Extension to Pascal

Ada U.S. Dept. of Defense, 1979b Completely new language

Modula-P Bräunl, 1986c Extension to Modula 2

C* Thinking Machines, 1987d Extension to C for SIMD systems

Concurrent C Gehani and Roome, 1989e Extension to C

Fortran D Fox et al., 1990f Extension to Fortran for data parallel programming



Figure 8.2 FORK-JOIN construct.

Main program

FORK

FORK

FORK

JOIN

JOIN JOIN

JOIN

Spawned processes



IP

Stack

Code Heap

Files

Interrupt routines

Code Heap

Files

Interrupt routines

IP

Stack

IP

Stack

Thread

Thread

(a) Process

(b) ThreadsFigure 8.3 Differences between a processand threads.



Main program

pthread_create(&thread1, NULL, proc1, &arg);

pthread_join(thread1, *status);

proc1(&arg)

return(*status);

{

}

Figure 8.4 pthread_create() and pthread_join().

thread1



Main program

Figure 8.5 Detached threads.

Thread

pthread_create();

pthread_create();

pthread_create(); Termination

Thread

Thread

Termination

Termination



+1 +1

Shared variable, x

Read

Write Write

Read

Process 1 Process 2Figure 8.6 Conflict in accessing sharedvariable.



Process 1 Process 2

while (lock == 1) do_nothing;lock = 1;

Critical section

lock = 0;

while (lock == 1)do_nothing;

lock = 1;

Critical section

lock = 0;

Figure 8.7 Control of critical sections through busy waiting.



Figure 8.8 Deadlock (deadly embrace).

R1 R2

R1 R2 Rn −1 Rn

P1 P2

P1 P2 Pn −1 Pn

(a) Two-process deadlock

(b) n-process deadlock

Resource

Process



Block

Cache

Processor 1

Cache

Processor 2

Main memory

Block in cache

76543210

Addresstag

Figure 8.9 False sharing in caches.



Array a[]sum

addr

Figure 8.10 Shared memory locations for Section 8.4.1 program example.



Array a[]global_index

addr

sum

Figure 8.11 Shared memory locations for Section 8.4.2 program example.



Test1Test2

Test3

Output1

Output2

1 2

3Figure 8.12 Sample logic circuit.

TABLE 8.2 LOGIC CIRCUIT DESCRIPTION FOR FIGURE 8.12

Gate Function Input 1 Input 2 Output

1 AND Test1 Test2 Gate1

2 NOT Gate1 Output1

3 OR Test3 Gate1 Output2



Movement

River

Log

of logs

Figure 8.13 River and frog for Problem 8-23.

Frog



Master

Slaves

Pool of threads

Request

Signal

Requestserviced

Figure 8.14 Thread pool for Problem 8-24.



a[i] a[0] a[i] a[n-1]

Incrementcounter, x

b[x] = a[i] Figure 9.1 Finding the rank in parallel.

Compare



a[i] a[0] a[i] a[1] a[i] a[2] a[i] a[3]

Tree

Add

0/1 0/10/1 0/1

Add

0/1/2 0/1/2

Add

Figure 9.2 Parallelizing the rank computation.

0/1/2/3/4

Compare



Figure 9.3 Rank sort using a master andslaves.

a[] b[]

Slaves

Master

Readnumbers

Place selectednumber



A

P1

Compare

B

P2

Send(A)

If A > B send(B)

Figure 9.4 Compare and exchange on a message-passing system — Version 1.

If A > B load Aelse load B

else send(A)

1

3

2

Sequence of steps



Compare

A

P1

Compare

B

P2

Send(A)

Send(B)

Figure 9.5 Compare and exchange on a message-passing system — Version 2.

If A > B load AIf A > B load B

1

3

2

3



43422825

88502825

Returnlowernumbers

98804342

88502825

43422825

98888050

Merge

Keephighernumbers

Figure 9.6 Merging two sublists — Version 1.

Originalnumbers

Finalnumbers

P1 P2



88502825

98804342

43422825

98888050

Merge

Keeplowernumbers

88502825

98804342

43422825

98888050

Merge

Keephighernumbers

Figure 9.7 Merging two sublists — Version 2.

P1 P2

Originalnumbers

Originalnumbers

(final

(finalnumbers)

numbers)



Time

4 2 7 8 5 1 3 6

2 4 7 8 5 1 3 6

2 4 7 8 5 1 3 6

2 4 7 8 5 1 3 6

2 4 7 5 8 1 3 6

2 4 7 5 1 8 3 6

2 4 7 5 1 3 8 6

2 4 7 5 1 3 6 8

2 4 7 5 1 3 6 8

2 4 7 5 1 3 6 8

2 4 5 7 1 3 6 8

2 4 5 1 7 3 6 8

2 4 5 1 3 7 6 8

2 4 5 1 3 6 7 8

2 4 5 1 3 6 7 8

Figure 9.8 Steps in bubble sort.

Original

Phase 1

Phase 2

Phase 3

sequence: 4 2 7 8 5 1 3 6

Placelargestnumber

Placenextlargestnumber



1

1

1

12

2

3 2 1

Time

Figure 9.9 Overlapping bubble sort actions in a pipeline.

Phase 3

Phase 2

Phase 1

3 2 1

Phase 4

4 3 2 1



4 2 7 5 1 68 3

2 4 7 1 5 68 3

2 4 7 8 3 61 5

2 4 1 3 8 67 5

2 1 4 7 5 63 8

1 2 3 5 7 84 6

1 2 3 5 6 84 7

1 2 3 5 6 84 7

Step

1

2

3

4

5

6

7

0

Figure 9.10 Odd-even transposition sort sorting eight numbers.

P0 P1 P2 P3 P4 P5 P6 P7

Time



Smallest

Largest

number

number Figure 9.11 Snakelike sorted list.



4 14 8 2

10 3 13 16

7 15 1 5

12 6 11 9

2 4 8 14

16 13 10 3

1 5 7 15

12 11 9 6

1 4 7 3

2 5 8 6

12 11 9 14

16 13 10 15

1 3 4 7

8 6 5 2

9 11 12 14

16 15 13 10

1 3 4 2

8 6 5 7

9 11 12 10

16 15 13 14

1 2 3 4

8 7 6 5

9 10 11 12

16 15 14 13

(a) Original placement

Figure 9.12 Shearsort.

(b) Phase 1 — Row sort (c) Phase 2 — Column sort

(d) Phase 3 — Row sort (e) Phase 4 — Column sort (f) Final phase — Row sort

of numbers



(b) Transpose operation(a) Operations between elementsin rows

(c) Operations between elementsin rows (originally columns)

Figure 9.13 Using the transpose operation to maintain operations in rows.



4 2 6

4 2 7 8 5 1 3 6

4 2 7 8 5 1 3 6

7 8 5 1 3

4 2 67 8 5 1 3

2 4 6

1 2 3 4 5 6 7 8

2 4 7 8 1 3 5 6

7 8 1 5 3

Sorted list

Unsorted list

Figure 9.14 Mergesort using tree allocation of processes.

Merge

Dividelist

P0

P2P0

P4 P5 P6 P7P1 P2 P3P0

P0

P6P4

P4

P0

P2P0

P0

P6P4

P4

Process allocation



P4

P6P1P0

2 1 6

4 2 7 8 5 1 3 6

3 2 1 4 5 7 8 6

3 4 5 7 8

1 2 7 86

Sorted list

Unsorted list

Figure 9.15 Quicksort using tree allocation of processes.

P0

P0

P7

P0

P6

P4

Process allocation

Pivot

3

P2



862 6

1 2 6

4 2 7 8 5 1 3 6

3 2 1 5 7 8 6

7 8

Sorted list

Unsorted list

Figure 9.16 Quicksort showing pivot withheld in processes.

4

1

82

3

7

5

Pivots

Pivot



Work pool

Sublists

Slave processes

Requestsublist Return

sublistFigure 9.17 Work pool implementation ofquicksort.



(a) Phase 1 001 010 011 100 101 110 111000

001 010 011 100 101 110 111000(b) Phase 2

≤ p1 > p1

001 010 011 100 101 110 111000(c) Phase 3

> p2 > p3≤ p3≤ p2

> p6 > p7≤ p7≤ p6> p4 > p5≤ p5≤ p4

Figure 9.18 Hypercube quicksort algorithm when the numbers are originally in node 000.



(a) Phase 1

Broadcast pivot, p1

001 010 011 100 101 110 111000

001 010 011 100 101 110 111000(b) Phase 2

≤ p1 > p1

Broadcast pivot, p3Broadcast pivot, p2

001 010 011 100 101 110 111000(c) Phase 3

Broadcastpivot, p4

Broadcastpivot, p5

Broadcastpivot, p6

Broadcastpivot, p7

> p2 > p3≤ p3≤ p2

> p6 > p7≤ p7≤ p6> p4 > p5≤ p5≤ p4

Figure 9.19 Hypercube quicksort algorithm when numbers are distributed among nodes.



(a) Phase 1 communication

(b) Phase 2 communication

(c) Phase 3 communication

Figure 9.20 Hypercube quicksortcommunication.

000 001

101

010 011

110 111

100

000 001

101

010 011

110 111

100

000 001

101

010 011

110 111

100



(a) Phase 1

Broadcast pivot, p1

001 011 010 110 111 101 100000

001 011 010 110 111 101 100000(b) Phase 2

≤ p1 > p1

Broadcast pivot, p3Broadcast pivot, p2

001 011 010 110 111 101 100000(c) Phase 3

Broadcastpivot, p4

Broadcastpivot, p5

Broadcastpivot, p6

Broadcastpivot, p7

> p2 > p3≤ p3≤ p2

> p6 > p7≤ p7≤ p6> p4 > p5≤ p5≤ p4

Figure 9.21 Quicksort hypercube algorithm with Gray code ordering.



82 4 5 1 6 73

83 4 761 2 5

Odd indicesEven indices

Sorted lists

a[] b[]

c[] d[]

e[]Final sorted list

Compare and exchange

1 2 3 4 5 6 7 8Figure 9.22 Odd-even merging of twosorted lists.

Merge

Merge



a2

b2

a4

b4

a3

b3

a1

b1

bn

anan−1

bn−1

Evenmergesort

Oddmergesort

c1c2c3c4

c2nc2n−1

Compare andexchange

Figure 9.23 Odd-even mergesort.

c5

c7c6

c2n−2



a0, a1, a2, a3, … an−2, an−1

Figure 9.24 Bitonic sequences.

Value

a0, a1, a2, a3, … an−2, an−1

(a) Single maximum (b) Single maximum and single minimum



3 5 8 9 7 4 2 1

3 4 2 1 7 5 8 9

Bitonic sequence

Bitonic sequence Bitonic sequence

Compare andexchange

Figure 9.25 Creating two bitonicsequences from one bitonic sequence.



3 5 8 9 7 4 2 1

3 4 2 1 7 5 8 9

Compare andexchange

2 1 3 4 7 5 8 9

1 2 3 4 5 7 8 9Sorted list Figure 9.26 Sorting a bitonic sequence.

Unsorted numbers



Sorted list

Figure 9.27 Bitonic mergesort.

Unsorted numbers

Bitonicsortingoperation

Directionof increasingnumbers



8 3 4 7 9 2 1 5

3 8 7 4 2 9 5 1

3 4 7 8 5 9 2 1

3 4 7 8 9 5 2 1

3 4 2 1 9 5 7 8

2 1 3 4 7 5 9 8

1 2 3 4 5 7 8 9

1

2

3

4

5

6

Compare and exchangeai with ai+n/2 (n numbers)

n = 2 ai with ai+1

n = 4 ai with ai+2

Formbitonic listsof four

Formbitonic listof eight

numbers

numbers

Split

Sort

n = 2 ai with ai+1

Sort bitonic list

n = 8 ai with ai+4

n = 4 ai with ai+2

n = 2 ai with ai+1

Split

Split

Sort

Step

Figure 9.28 Bitonic mergesort on eight numbers.

Compare andexchange

HigherLower

= bitonic list[Fig. 9.24 (a) or (b)]



88502825

98804342

43422825

98888050

50422825

98888043

Figure 9.29 Compare-and-exchangealgorithm for Problem 9-5.

Step 1

Step 2

Step 3

Terminates when insertions at top/bottom of lists



a0,0 a0,1

a1,0

a0,m−2

an−1,0

a0,m−1

an−2,0

an−1,m−1an−1,m−2

an−2,m−1

a1,1 a1,m−2 a1,m−1

an−2,1 an−2,m-2

an−1,1

Row

Column

Figure 10.1 An n × m matrix.



× =A B C

Figure 10.2 Matrix multiplication, C = A × B.

i

j

ci,j

Row

ColumnMultiply Sum

results



× =A b c

Figure 10.3 Matrix-vector multiplicationc = A × b.

i ci

Rowsum



× =

Sum

A B C

Figure 10.4 Block matrix multiplication.

p

qMultiply results



a0,0 a0,1 a0,2 a0,3

a1,0

a2,0

a3,0

a1,2a1,1

a2,1

a3,1

a2,2

a3,2 a3,3

a1,3

a2,3

b0,0 b0,1 b0,2 b0,3

b1,0

b2,0

b3,0

b1,2b1,1

b2,1

b3,1

b2,2

b3,2 b3,3

b1,3

b2,3

a0,0 a0,1

a1,0 a1,1

b0,0 b0,1

b1,0 b1,1

a0,2 a0,3

a1,2 a1,3

b2,0 b2,1

b3,0 b3,1

(a) Matrices

(b) Multiplying A0,0 × B0,0 to obtain C0,0

a0,0b0,0+a0,1b1,0 a0,0b0,1+a0,1b1,1

a1,0b0,0+a1,1b1,0 a1,0b0,1+a1,1b1,1

A0,0 B0,0 A0,1 B1,0

a0,2b2,0+a0,3b3,0 a0,2b2,1+a0,3b3,1

a1,2b2,0+a1,3b3,0 a1,2b2,1+a1,3b3,1

+

× + ×

=

=

a0,0b0,0+a0,1b1,0+a0,2b2,0+a0,3b3,0 a0,0b0,1+a0,1b1,1+a0,2b2,1+a0,3b3,1

a1,0b0,0+a1,1b1,0+a1,2b2,0+a1,3b3,0 a1,0b0,1+a1,1b1,1+a1,2b2,1+a1,3b3,1

= C0,0

Figure 10.5 Submatrix multiplication.

×



b[][j]

a[i][]Row i

Column j

c[i][j]

Processor Pi,j

Figure 10.6 Direct implementation ofmatrix multiplication.



Figure 10.7 Accumulation using a treeconstruction.

P0 P1 P2 P3

P0

P0 P2

c0,0

a0,0 b0,0 a0,1 b1,0 a0,2 b2,0 a0,3 b3,0

×

+

×××

+

+



App

Aqp Aqq

Apq

i j

i

j

Bpp

Bqp Bqq

Bpq Cpp

Cqp Cqq

Cpq

P1 P3P2P0

P0 + P1

P4 + P5 P6 + P7

P2 + P3

P5 P7P6P4

Figure 10.8 Submatrix multiplication and summation.



B

A

Figure 10.9 Movement of A and Belements.

j

i

Pi,j



B

A

Figure 10.10 Step 2 — Alignment ofelements of A and B.

j

i

bi+j,j

ai,j+i

i places

j places



B

A

Figure 10.11 Step 4 — One-place shift ofelements of A and B.

j

i

Pi,j



c0,0 c0,1 c0,2 c0,3

c1,0 c1,1 c1,2 c1,3

c2,0 c2,1 c2,2 c2,3

c3,0 c3,1 c3,2 c3,3

b3,0b2,0b1,0b0,0

b3,3b2,3b1,3b0,3

b3,2b2,2b1,2b0,2

b3,1b2,1b1,1b0,1

a0,3 a0,2 a0,1 a0,0

a3,3 a3,2 a3,1 a3,0

a2,3 a2,2 a2,1 a2,0

a1,3 a1,2 a1,1 a1,0

Figure 10.12 Matrix multiplication using a systolic array.

Pumpingaction

One cycle delay



c0

c1

c2

c3

b3b2b1b0

a0,3 a0,2 a0,1 a0,0

a3,3 a3,2 a3,1 a3,0

a2,3 a2,2 a2,1 a2,0

a1,3 a1,2 a1,1 a1,0

Figure 10.13 Matrix-vector multiplicationusing a systolic array.

Pumpingaction



Clearedto zero

Alreadyclearedto zero

Row i

Column i

Column

Row

Figure 10.14 Gaussian elimination.

Row j

Step throughaji



Already

Row i

Column

Row

Figure 10.15 Broadcast in parallel implementation of Gaussian elimination.

Broadcastith row

n − i +1 elements(including b[i])

clearedto zero



Broadcast Figure 10.16 Pipeline implementation ofGaussian elimination.

P0 P1 P2 Pn−1

rows

Row



P0

P1

P3

P2

0

n/p

2n/p

3n/p

Figure 10.17 Strip partitioning.

Row



P0

P1

Figure 10.18 Cyclic partitioning toequalize workload.

0

n/p

2n/p

3n/p

Row



∆ ∆

f(x, y)

Solution space

y

x Figure 10.19 Finite difference method.



Figure 10.20 Mesh of points numbered in natural order.

x1 x4x3x2 x8x7x6x5 x9

x31 x34x33x32 x38x37x36x35 x39

x41 x44x43x42 x48x47x46x45 x49

x51 x54x53x52 x58x57x56x55 x59

x61 x64x63x62 x68x67x66x65 x69

x71 x74x73x72 x78x77x76x75 x79

x11 x14x13x12 x18x17x16x15 x19

x21 x24x23x22 x28x27x26x25 x29

x81 x84x83x82 x88x87x86x85 x89

x60

x70

x80

x90

x40

x50

x30

x20

x10

x91 x94x93x92 x98x97x96x95 x99 x100

Boundary points (see text)



−4−4

−4

−4

−4

1 1 11 1

1 1

1 11

ai,i ai,i+nai,i−1 ai,i+1ai,i−n1 1ith equation

11

11

1

1

Figure 10.21 Sparse matrix for Laplace’s equation.

×

x1

=

To includeboundary values

11

A x

00

00

and some zeroentries (see text)

x2

xN

xN-1

Those equations with a boundarypoint on diagonal unnecessary

for solution



Point

Point to becomputed

computed

Sequential order of computation

Figure 10.22 Gauss-Seidel relaxation with natural order, computed sequentially.



Red

Black

Figure 10.23 Red-black ordering.



Figure 10.24 Nine-point stencil.



Coarsest grid points Finer grid pointsProcessor

Figure 10.25 Multigrid processorallocation.



50°C

40°C 60°C

Ambient temperature at edges of board = 20°C

Figure 10.26 Printed circuit board for Problem 10-18.



Figure 11.1 Pixmap.

j

i

Origin (0, 0)

p(i, j)Picture element(pixel)



0 255Gray level

Numberof pixels

Figure 11.2 Image histogram.



x0

x3 x4 x5

x1 x2

x6 x7 x8 Figure 11.3 Pixel values for a 3 × 3 group.



Step 1Each pixel addspixel from left

Step 2Each pixel addspixel from right

Step 3Each pixel adds pixel

from above

Step 4Each pixel adds pixel

from below

Figure 11.4 Four-step data transfer for the computation of mean.



x3 + x4

x0 + x1

x6 + x7

(a) Step 1 (b) Step 2

(c) Step 3 (d) Step 4

x0

x3

x2x1

x7x6

x4 x5

x8

x3 + x4 + x5

x0 + x1 + x2

x6 + x7 + x8

x0

x3

x2x1

x7x6

x4 x5

x8

x0 + x1 + x2

x0 + x1 + x2

x6 + x7 + x8

x0

x3

x2x1

x7x6

x4 x5

x8

x3 + x4 + x5

x0 + x1 + x2

x0 + x1 + x2

x6 + x7 + x8

x0

x3

x2x1

x7x6

x4 x5

x8

x3 + x4 + x5x6 + x7 + x8

Figure 11.5 Parallel mean data accumulation.



Largest Next largest

Next largest

in row in row

in column

Figure 11.6 Approximate median algorithm requiring six steps.



w0 w2

w3 w4

w1

w7w6

w5

w8

x0 x2

x3 x4

x1

x7x6

x5

x8

⊗ =

Figure 11.7 Using a 3 × 3 weighted mask.

x4'

Mask Pixels Result



1

1 1 1

1 1

1 1 1Figure 11.8 Mask to compute mean.

19

k =



1

1 8 1

1 1

1 1 1Figure 11.9 A noise reduction mask.

116

k =



−1

−1 8 −1

−1 −1

−1 −1 −1 Figure 11.10 High-pass sharpening filtermask.

19

k =



Intensity transition

First derivative

Figure 11.11 Edge detection usingdifferentiation.

Second derivative



f(x, y)

x

y

Image

Figure 11.12 Gray level gradient anddirection.

Gradient

φ

Constantintensity



0

1

0

1

−1

0

1

−1−1

−1

1

0

0

1

1

−1

0−1

Figure 11.13 Prewitt operator.



0

1

0

2

−1

0

1

−2−1

−2

1

0

0

1

2

−1

0−1

Figure 11.14 Sobel operator.



Figure 11.15 Edge detection with Sobel operator.

(a) Original image (Annabel) (b) Effect of Sobel operator



−1

0

4

−1

0

−1

0

−10

Figure 11.16 Laplace operator.



x1

x5

x7

x4x3

Left pixel Right pixel

Upper pixel

Lower pixelFigure 11.17 Pixels used in Laplaceoperator.



Figure 11.18 Effect of Laplace operator.



y = ax + b b = −x1a + y1y

x

b

a

Figure 11.19 Mapping a line into (a, b) space.

(b) Parameter space(a) (x, y) plane

(a, b)

Pixel in image

b = −xa + y(x1, y1)



r = x cos θ + y sin θ

r

θ

Figure 11.20 Mapping a line into (r, θ) space.

rθ

(b) (r, θ) plane(a) (x, y) plane

y = ax + by

x

(r, θ)



θr

y

x

Figure 11.21 Normal representation usingimage coordinate system.



r

θ0°

5

10°0

1015

20°30°

Accumulator

Figure 11.22 Accumulators, acc[r][θ], forthe Hough transform.



xjk

j

k Transformrows

Xjm

Transformcolumns

Xlm

Figure 11.23 Two-dimensional DFT.



fj,k

gj,k

hj,k

Image

Filter/image

F(j, k)

G(j, k)

H(j, k)

f(j, k)

g(j, k)

h(j, k)

MultiplyConvolution

Transform

Inversetransform

×

(a) Direct convolution (b) Using Fourier transform

Figure 11.24 Convolution using Fourier transforms.

∗



Slave processes

Master process

w0 w1 wn−1

X[0] X[1] X[n−1]Figure 11.25 Master-slave approach forimplementing the DFT directly.



+××a

wk

x[j]

Figure 11.26 One stage of a pipelineimplementation of DFT algorithm.

X[k]

Process j

X[k]

a

Values fornext iteration

a × x[j]

wk



Time

Figure 11.27 Discrete Fourier transform with a pipeline.

P0 P1 P2 P3 PN−1

(a) Pipeline structure

(b) Timing diagram

Output sequence

X[0] X[1] X[2] X[3] X[4] X[6]X[5]

Pipelinestages

X[0],X[1],X[2],X[3]…X[k]

a

wk

0

1

wk

x[0] x[1] x[2] x[3] x[N−1]

P0

P1

P2

PN−1

PN−2



x0x1x2

xN−1

x3

xN−2

N/2 ptDFT

N/2 ptDFT

Xk

Xk+N/2

k = 0, 1, … N/2

+

−Xodd × wk

Xeven

Figure 11.28 Decomposition of N-point DFT into two N/2-point DFTs.

Input sequence Transform



X0

X1

X2

X3

x0

x1

x2

x3

+−

−

+++

+

+Figure 11.29 Four-point discrete Fouriertransform.



Xk = Σ(0,2,4,6,8,10,12,14)+wkΣ(1,3,5,7,9,11,13,15)

{[Σ(0,8)+wkΣ(4,12)]+wk[Σ(2,10)+wkΣ(6,14)]}+{[Σ(1,9)+wkΣ(5,13)]+wk[Σ(3,11)+wkΣ(7,15)]}

{Σ(0,4,8,12)+wkΣ(2,6,10,14)}+wk{Σ(1,5,9,13)+wkΣ(3,7,11,15)}

x0 x8 x4 x12 x2 x10 x6 x14 x1 x9 x5 x13 x3 x11 x7 x15

0000 1000 0100 1100 0010 1010 0110 1011 0001 1001 0101 1101 0011 1011 0111 1111

Figure 11.30 Sixteen-point DFT decomposition.



x0

x8

x4

x12

x2

x10

x6

x14

x1

x9

x5

x13

x3

x11

x7

x15

X0

X1

X2

X3

X4

X5

X6

X7

X8

X9

X10

X11

X12

X13

X14

X15

Figure 11.31 Sixteen-point FFT computational flow.



x0

x8

x4

x12

x2

x10

x6

x14

x1

x9

x5

x13

x3

x11

x7

x15

X0

X1

X2

X3

X4

X5

X6

X7

X8

X9

X10

X11

X12

X13

X14

X15

Figure 11.32 Mapping processors onto 16-point FFT computation.

P0

P2

P1

P3

0000

0001

0010

0011

0100

0101

0110

0111

1001

1010

1011

1100

1101

1110

1111

1000

P/r

ProcessRow

Inputs Outputs



P0 P1 P2 P3

Figure 11.33 FFT using transposealgorithm — first two steps.

x0

x4

x8

x12

x1

x5

x9

x13

x2

x6

x10

x14

x3

x7

x11

x15



P0 P1 P2 P3

Figure 11.34 Transposing array fortranspose algorithm.

x0

x4

x8

x12

x1

x5

x9

x13

x2

x6

x10

x14

x3

x7

x11

x15



Figure 11.35 FFT using transposealgorithm — last two steps.

P0 P1 P2 P3

x0

x1

x2

x3

x4

x5

x6

x7

x8

x9

x10

x11

x12

x13

x14

x15



1 2 3 4 5 6 7

1

2

3

4

5

6

7

Mask

Figure 11.36 Image for Problem 11-3.



First choice

Second choice

Third choice

Figure 12.1 State space tree.

C0 C1 Cn−1

Notincluding

C0

Notincluding

C1

Notincluding

Cn−1



A1 A2

B1 B2

A1 B2

B1 A2

Parent A

Parent B

Child 1

Child 2Figure 12.2 Single-point crossover.

1

1

1

1

p

p

p

p

p+1

p+1

p+1

p+1

m

m

m

m



Subpopulation

Migration path;

Figure 12.3 Island model.

every island sendsto every other island



Island subpopulations

Limited migration path Figure 12.4 Stepping stone model



Instructions

Figure D.1 PRAM model.

Shared memory

Program

Data

ProcessorsClock

with localmemory



d[1] s[1] d[2] s[2] d[3] s[3] d[4] s[4] d[5] s[5] d[6] s[6]d[0] s[0] d[7] s[7]

1 111 0111

Figure D.2 List ranking by pointer jumping.

Null

2 222 0122

4 444 0123

7 456 0123



Local computation

Communication

Barrier synchronization

Threads or processes

Maximum of hsends or receives

(maximum time w)

Figure D.3 A view of the bulk synchronous parallel model.



o

L

g

Pi

Pk

PiTime

MessageProcessors

Figure D.4 LogP parameters.o

Next message



Documents

Parallel Programming - Slides