Upload
md-mahedi-mahfuj
View
810
Download
8
Embed Size (px)
DESCRIPTION
Citation preview
Introduction to Parallel Computing
Part IIb
What is MPI?
Message Passing Interface (MPI) is astandardised interface. Using this interface,several implementations have been made.The MPI standard specifies three forms ofsubroutine interfaces:(1) Language independent notation;(2) Fortran notation;(3) C notation.
MPI Features
MPI implementations provide:
• Abstraction of hardware implementation• Synchronous communication• Asynchronous communication• File operations• Time measurement operations
Implementations
MPICH Unix / Windows NTMPICH-T3E Cray T3ELAM Unix/SGI Irix/IBM AIXChimp SunOS/AIX/Irix/HP-UXWinMPI Windows 3.1 (no network req.)
Programming with MPI
What is the difference between programmingusing the traditional approach and the MPIapproach:
1. Use of MPI library2. Compiling3. Running
Compiling (1)
When a program is written, compiling itshould be done a little bit different from thenormal situation. Although details differ forvarious MPI implementations, there aretwo frequently used approaches.
Compiling (2)
First approach
Second approach
$ gcc myprogram.c –o myexecutable -lmpi
$ mpicc myprogram.c –o myexecutable
Running (1)
In order to run an MPI-Enabled applicationwe should generally use the command‘mpirun’:
Where x is the number of processes to use,and <parameters> are the arguments to theExecutable, if any.
$ mpirun –np x myexecutable <parameters>
Running (2)
The ‘mpirun’ program will take care of thecreation of processes on selected processors.By default, ‘mpirun’ will decide whichprocessors to use, this is usually determinedby a global configuration file. It is possibleto specify processors, but they may only beused as a hint.
MPI Programming (1)
Implementations of MPI support Fortran, C,or both. Here we only consider programmingusing the C Libraries. The first step in writinga program using MPI is to include the correctheader:
#include “mpi.h”
MPI Programming (2)
#include “mpi.h”
int main (int argc, char *argv[]){ … MPI_Init(&argc, &argv); … MPI_Finalize(); return …;}
MPI_Init
int MPI_Init (int *argc, char ***argv)
The MPI_Init procedure should be calledbefore any other MPI procedure (exceptMPI_Initialized). It must be called exactlyonce, at program initialisation. If removesthe arguments that are used by MPI from theargument array.
MPI_Finalize
int MPI_Finalize (void)
This routine cleans up all MPI states. It shouldbe the last MPI routine to be called in aprogram; no other MPI routine may be calledafter MPI_Finalize. Pending communicationshould be finished before finalisation.
Using multiple processes
When running an MPI enabled program usingmultiple processes, each process will run anidentical copy of the program. So there mustbe a way to know which process we are.This situation is comparable to that ofprogramming using the ‘fork’ statement. MPIdefines two subroutines that can be used.
MPI_Comm_size
int MPI_Comm_size (MPI_Comm comm, int *size)
This call returns the number of processes involved in a communicator. To find out howmany processes are used in total, call thisfunction with the predefined globalcommunicator MPI_COMM_WORLD.
MPI_Comm_rank
int MPI_Comm_rank (MPI_Comm comm, int *rank)
This procedure determines the rank (index) ofthe calling process in the communicator. Eachprocess is assigned a unique number within acommunicator.
MPI_COMM_WORLD
MPI communicators are used to specify towhat processes communication applies to.A communicator is shared by a group ofprocesses. The predefined MPI_COMM_WORLD
applies to all processes. Communicators canbe duplicated, created and deleted. For mostapplication, use of MPI_COMM_WORLD
suffices.
Example ‘Hello World!’#include <stdio.h>#include "mpi.h"
int main (int argc, char *argv[]){ int size, rank;
MPI_Init (&argc, &argv); MPI_Comm_size (MPI_COMM_WORLD, &size); MPI_Comm_rank (MPI_COMM_WORLD, &rank);
printf ("Hello world! from processor (%d/%d)\n", rank+1, size);
MPI_Finalize();
return 0;}
Running ‘Hello World!’
$ mpicc -o hello hello.c$ mpirun -np 3 helloHello world! from processor (1/3)Hello world! from processor (2/3)Hello world! from processor (3/3)$ _
MPI_Send
int MPI_Send (void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm )
Synchronously sends a message to dest. Datais found in buf, that contains count elementsof datatype. To identify the send, a tag has tobe specified. The destination dest is theprocessor rank in communicator comm.
MPI_Recvint MPI_Recv (void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status)
Synchronously receives a message from source.Buffer must be able to hold count elements ofdatatype. The status field is filled with statusinformation. MPI_Recv and MPI_Send callsshould match; equal tag, count, datatype.
DatatypesMPI_CHAR signed charMPI_SHORT signed short intMPI_INT signed intMPI_LONG signed long intMPI_UNSIGNED_CHAR unsigned charMPI_UNSIGNED_SHORT unsigned short intMPI_UNSIGNED unsigned intMPI_UNSIGNED_LONG unsigned long intMPI_FLOAT floatMPI_DOUBLE doubleMPI_LONG_DOUBLE long double
(http://www-jics.cs.utk.edu/MPI/MPIguide/MPIguide.html)
Example send / receive#include <stdio.h>#include "mpi.h"
int main (int argc, char *argv[]){ MPI_Status s; int size, rank, i, j;
MPI_Init (&argc, &argv); MPI_Comm_size (MPI_COMM_WORLD, &size); MPI_Comm_rank (MPI_COMM_WORLD, &rank);
if (rank == 0) // Master process { printf ("Receiving data . . .\n"); for (i = 1; i < size; i++) { MPI_Recv ((void *)&j, 1, MPI_INT, i, 0xACE5, MPI_COMM_WORLD, &s); printf ("[%d] sent %d\n", i, j); } } else { j = rank * rank; MPI_Send ((void *)&j, 1, MPI_INT, 0, 0xACE5, MPI_COMM_WORLD); }
MPI_Finalize(); return 0;}
Running send / receive
$ mpicc -o sendrecv sendrecv.c$ mpirun -np 4 sendrecvReceiving data . . .[1] sent 1[2] sent 4[3] sent 9$ _
MPI_Bcastint MPI_Bcast (void *buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm)
Synchronously broadcasts a message fromroot, to all processors in communicator comm(including itself). Buffer is used as source inroot processor, as destination in others.
MPI_Barrier
int MPI_Barrier (MPI_Comm comm)
Blocks until all processes defined in commhave reached this routine. Use this routine tosynchronize processes.
Example broadcast / barrierint main (int argc, char *argv[]){ int rank, i;
MPI_Init (&argc, &argv); MPI_Comm_rank (MPI_COMM_WORLD, &rank);
if (rank == 0) i = 27; MPI_Bcast ((void *)&i, 1, MPI_INT, 0, MPI_COMM_WORLD); printf ("[%d] i = %d\n", rank, i);
// Wait for every process to reach this code MPI_Barrier (MPI_COMM_WORLD);
MPI_Finalize();
return 0;}
Running broadcast / barrier
$ mpicc -o broadcast broadcast.c$ mpirun -np 3 broadcast[0] i = 27[1] i = 27[2] i = 27$ _
MPI_Sendrecvint MPI_Sendrecv (void *sendbuf, int sendcount, MPI_Datatype sendtype, int dest, int sendtag, void *recvbuf, int recvcount, MPI_Datatype recvtype, int source, int recvtag, MPI_Comm comm, MPI_Status *status)
int MPI_Sendrecv_replace( void *buf, int count, MPI_Datatype datatype, int dest, int sendtag, int source, int recvtag, MPI_Comm comm, MPI_Status *status )
Send and receive (2nd, using only one buffer).
Other useful routines
• MPI_Scatter• MPI_Gather• MPI_Type_vector• MPI_Type_commit• MPI_Reduce / MPI_Allreduce• MPI_Op_create
Example scatter / reduceint main (int argc, char *argv[]){ int data[] = {1, 2, 3, 4, 5, 6, 7}; // Size must be >= #processors int rank, i = -1, j = -1;
MPI_Init (&argc, &argv); MPI_Comm_rank (MPI_COMM_WORLD, &rank);
MPI_Scatter ((void *)data, 1, MPI_INT, (void *)&i , 1, MPI_INT, 0, MPI_COMM_WORLD);
printf ("[%d] Received i = %d\n", rank, i);
MPI_Reduce ((void *)&i, (void *)&j, 1, MPI_INT, MPI_PROD, 0, MPI_COMM_WORLD);
printf ("[%d] j = %d\n", rank, j);
MPI_Finalize();
return 0;}
Running scatter / reduce$ mpicc -o scatterreduce scatterreduce.c$ mpirun -np 4 scatterreduce[0] Received i = 1[0] j = 24[1] Received i = 2[1] j = -1[2] Received i = 3[2] j = -1[3] Received i = 4[3] j = -1$ _
Some reduce operationsMPI_MAX Maximum valueMPI_MIN Minimum valueMPI_SUM Sum of valuesMPI_PROD Product of valuesMPI_LAND Logical ANDMPI_BAND Boolean ANDMPI_LOR Logical ORMPI_BOR Boolean ORMPI_LXOR Logical Exclusive ORMPI_BXOR Boolean Exclusive OR
Measuring running time
double MPI_Wtime (void);
double timeStart, timeEnd;...timeStart = MPI_Wtime(); // Code to measure time for goes here.timeEnd = MPI_Wtime()...printf (“Running time = %f seconds\n”, timeEnd – timeStart);
Parallel sorting (1)
Sorting an sequence of numbers using thebinary–sort method. This method dividesa given sequence into two halves (untilonly one element remains) and sorts bothhalves recursively. The two halves are thenmerged together to form a sorted sequence.
Binary sort pseudo-code
sorted-sequence BinarySort (sequence){ if (# elements in sequence > 1) { seqA = first half of sequence seqB = second half of sequence BinarySort (seqA); BinarySort (seqB); sorted-sequence = merge (seqA, seqB); } else sorted-sequence = sequence}
Merge two sorted sequences
1 7 845 62 311
1
1
2 3
1
4
1
5 6
1 7
7
8
8
1
Example binary – sort
1 2 345 67 8
1 257 34 68
1 7 25 48 36
1 7 5 2 8 4 6 31 7 5 2 8 4 6 31 7 5 2 8 4 6 3
1 7 52 84 63
1 752 84 63
1 4 863 72 51 4 863 72 5
Parallel sorting (2)
This way of dividing work and gathering theresults is a quite natural way to use for aparallel implementation. Divide work in twoto two processors. Have each of theseprocessors divide their work again, until eitherno data can be split again or no processors areavailable anymore.
Implementation problems
• Number of processors may not be a power of two• Number of elements may not be a power of two• How to achieve an even workload?• Data size is less than number of processors
Parallel matrix multiplication
We use the following partitioning of data (p=4)
P1
P2
P3
P4
P1
P2
P3
P4
Implementation
1. Master (process 0) reads data2. Master sends size of data to slaves3. Slaves allocate memory4. Master broadcasts second matrix to all other
processes5. Master sends respective parts of first matrix to
all other processes6. Every process performs its local multiplication7. All slave processes send back their result.
Multiplication 1000 x 10001000 x 1000 Matrix multiplication
0
20
40
60
80
100
120
140
0 10 20 30 40 50 60
Processors
Tim
e (s
)
Tp T1 / p
Multiplication 5000 x 50005000 x 5000 Matrix multiplication
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
0 5 10 15 20 25 30 35
Processors
Tim
e (s
)
Tp T1 / p
Gaussian elimination
We use the following partitioning of data (p=4)
P1
P2
P3
P4
P1
P2
P3
P4
Implementation (1)
1. Master reads both matrices2. Master sends size of matrices to slaves3. Slaves calculate their part and allocate
memory4. Master sends each slave its respective part5. Set sweeping row to 0 in all processes6. Sweep matrix (see next sheet)7. Slave send back their result
Implementation (2)
While sweeping row not past final row doA. Have every process decide whether they
own the current sweeping rowB. The owner sends a copy of the row to
every other processC. All processes sweep their part of the
matrix using the current rowD. Sweeping row is incremented
Programming hints
• Keep it simple!• Avoid deadlocks• Write robust code even at cost of speed• Design in advance, debugging is more
difficult (printing output is different)• Error handing requires synchronisation, you
can’t just exit the program.
References (1)
MPI Forum Home Pagehttp://www.mpi-forum.org/index.html
Beginners guide to MPI (see also /MPI/)http://www-jics.cs.utk.edu/MPI/MPIguide/MPIguide.html
MPICHhttp://www-unix.mcs.anl.gov/mpi/mpich/
References (2)
Miscellaneous
http://www.erc.msstate.edu/labs/hpcl/projects/mpi/http://nexus.cs.usfca.edu/mpi/http://www-unix.mcs.anl.gov/~gropp/http://www.epm.ornl.gov/~walker/mpitutorial/http://www.lam-mpi.org/http://epcc.ed.ac.uk/chimp/http://www-unix.mcs.anl.gov/mpi/www/www3/