Computing Labs 2011

1

An Introduction to Parallel Computing With MPI

Computing Lab I

The purpose of the first programming exercise is to become familiar with the operating environment on a parallel computer, and to create and run a simple parallel program using MPI. In code development, it is always a good idea to start simple and develop/debug each piece of code before adding more complexity. This first code will implement the basic MPI structure, query communicator info, and output process rank – the classic “hello world” program. You will learn how to compile parallel programs and submit batch jobs using the scheduler.

• Write a basic “hello world” code which creates a MPI environment, determines the number of processes in the global communicator, and writes the rank of each process to standard output. You will have to implement the correct MPI language binding depending on which programming language you are using: FORTRAN, C, or C++. The general program structure for each language is shown below. You can write your code in the editor “TextWrangler” which is installed on the lab computers. This program allows you to edit and save source code files either locally on the lab computers or remotely on socrates and transfer files as needed using sftp.

FORTRAN90

program MPIhelloworld implicit none include “mpif.h” ! Include the MPI header file integer::ierr,pid,np call MPI_INIT(ierr) ! Initialize MPI environment call MPI_COMM_SIZE(MPI_COMM_WORLD,np,ierr) ! Get number of processes (np)

call MPI_COMM_RANK(MPI_COMM_WORLD,pid,ierr) ! Get local rank (pid) write(*,*)”I am process: “,pid call MPI_FINALIZE(ierr) ! Terminate MPI environment stop end program MPIhelloworld

C

#include <mpi.h> /* Include the MPI header file */ #include <stdio.h> main (int argc, char *argv[]) { int ierr,pid,np;

2

ierr = MPI_Init(&argc, &argv); /* Initialize MPI environment*/ MPI_Comm_size(MPI_COMM_WORLD, &np); /* Get number of processes (np)

MPI_Comm_rank(MPI_COMM_WORLD, &pid); /* Get local rank (pid) */ printf(“I am process: %d \n“,pid); MPI_Finalize(); /* Terminate MPI environment */ } C++

#include <mpi.h> // Include the MPI header file #include <isostream> int main (int argc, char ** argv) { int ierr,pid,np; MPI::Init(argc,argv); // Initialize MPI environment np = MPI::COMM_WORLD.Get_size(); // Get number of processes (np)

pid = MPI::COMM_WORLD.Get_rank(); // Get local rank (pid) printf(“I am process: %d \n“,pid); MPI::Finalize(); // Terminate MPI environment }

• Save your source code to your home directory on socrates (from the TextWrangler File menu select “Save to FTP/SFTP Server…” and log in). Now open a terminal program (such as Terminal or X11) and ssh to socrates. You should be able to log in with your NSID account. If you are not familiar with the UNIX command line environment, you can consult the attached document explaining all the basic commands you need to know. Your home directory is the location where you will keep all your source code, the executables, and your input and output data files. Your parallel code is submitted from the home directory and you can read and write files from there. Most parallel computers provide a different directory with additional disk space should your program use very large data files. socrates Information about socrates is available on the site http://www.usask.ca/its/services/research_computing/socrates.php Your account has been set up to use the OpenMPI implementation of the MPI standard. Socrates also has MPICH and LAM MPI installed. Socrates has the compilers gcc, g77, gfortran available. To compile a parallel MPI program you need to use the compiler scripts provided by OpenMPI which link the native compilers to

the proper MPI libraries. The compiler scripts are mpif77 or mpif90 for FORTRAN programs,

mpicc for C, and mpiCC for C++ programs. They can be passed any flag accepted by the underlying compilers. To do a basic build, use one of the following commands:

3

[]$ mpif90 -o executable sourcecode.f90 []$ mpicc -o executable sourcecode.c []$ mpiCC -o executable sourcecode.cpp Socrates uses the TORQUE/Moab batching system to manage the load distribution on the cluster. This load leveling program creates a queuing system to manage the cluster, and users must submit their batch jobs to the queue. An outline of basic TORQUE commands is given below (which evolved from software called PBS – Portable Batch System). To submit a parallel job, you will need to create a job script. Using a text editor (TextWrangler)

create a new file named myjobscript.pbs and type in all the necessary commands required to submit your parallel job to the queue. A sample job script is shown below. Note that PBS commands are preceded by #PBS and comment lines are inserted with a single #.

#/bin/sh # Sample PBS Script for use with OpenMPI on Socrates # Jason Hlady May 2010 # Specify the number of processors to use in the form of # nodes=X:ppn=Y, where X = number of computers (nodes), # Y = number of processors per computer #PBS -l nodes=1:ppn=1 # Job name which will show up in queue, job output #PBS -N <my job name> # Optional: join error and output into one stream #PBS -j oe # Show what node the app started on--useful for serial jobs echo `hostname` cd $PBS_O_WORKDIR echo "Current working directory is `pwd`" echo "Starting run at: `date`" echo "---------------------" # Run the application mpirun <my program name> echo "Program finished with exit code $? at: `date`" exit 0

4

When you submit the batch job, TORQUE will assign a job ID number. The standard output and standard error of the job will be stored in the file myjobname.oJOB_ID# in your working directory. To submit your batch job simply enter qsub myjobscript.pbs The job ID number will be output to screen. To observe the status of your job in the queue, type qstat To kill a job enter qdel JOB_ID# You can view the man pages of any of these commands for more information and options.

Computing Lab II

Option 1: Jacobi Iteration on a Two-Dimensional Mesh

This is a classic problem for learning the basics of building a parallel Single Program Multiple Data (SPMD) code with a domain decomposition approach and data dependency between processes. These issues are common to many parallel algorithms used in scientific programs. We will keep the algorithm as simple as possible so that you can focus on implementing the parallel communication and thinking about program efficiency.

Consider solving the temperature distribution on a two-dimensional grid with fixed temperature values on the boundaries.

Figure 1: Uniform grid of temperature values. Boundary values indicated by grey nodes.

5

The temperature values at all grid points can be stored in a two-dimensional data array, T(i,j). Starting from an initial guess for the temperature distribution (say T = 0 at all interior nodes (white squares)), we can calculate the final temperature distribution by repeatedly applying the calculation

, , , , ,

over all interior nodes until the temperature values converge to the final solution. This is not a very efficient solver and it may take hundreds (or thousands) of sweeps of the grid before convergence, but it is the simplest algorithm you can use. An example FORTRAN 90 sequential program is given below.

program jacobi ! A program solving 2D heat equations using Jacobi iteration implicit none integer, parameter::id=100,jd=100 integer::i,j,n,nmax real(kind=8), dimension(0:id+1,0:jd+1)::Tnew,Told character(6)::filename ! Initialize the domain Told=0.0_8 ! initial condition Told(0,:)=80.0_8 ! right boundary condition Told(id+1,:)=50.0_8 ! left boundary condition Told(:,0)=0.0_8 ! bottom boundary condition Told(:,jd+1)=100.0_8 ! top boundary condition Tnew=Told ! Perform Jacobi iterations (nmax sweeps of the domain) nmax=1000 do n=1,nmax ! Sweep interior nodes do i=1,id do j=1,jd Tnew(i,j)=(Told(i+1,j)+Told(i-1,j)+Told(i,j+1)+Told(i,j-1))/4.0_8 end do end do ! Copy Tnew to Told and sweep again Told=Tnew end do ! Output field data to file 50 format(102f6.1) filename="T.dat" open(unit=20,file=filename,status="replace") do j=jd+1,0,-1 write(20,50)(Tnew(i,j), i=0,id+1) end do stop end program jacobi

6

Now parallelize the Jacobi solver. Use a simple one-dimensional domain decomposition as shown below.

Figure 2: 1D decomposition of the grid.

Each process will perform iterations only on its subdomain, and will have to exchange temperature values with neighboring processes at the subdomain boundaries. You should create a row of ghost points to store these communicated values. The external row of boundary values around the global domain can also be considered ghost points. If you keep things basic, you should be able to write the parallel program in less than 70 lines of code!

Some tips and hints:

• To keep things simple, directly program the mapping of the domain to the processes, i.e. process 0 is on the left boundary, process ‘n’ on the right boundary, the rest in the middle. You can also directly specify the different boundary conditions for each process.

• After every process sweeps its local nodes once, you will have to communicate the updated temperature values at subdomain boundaries before the next sweep. This can be accomplished in two communication shifts – first everyone sends data to the process on the right and receives from the left, then everyone sends to the left and receives from the right. Make sure the communication pattern doesn’t block.

• Since the data values you need to communicate may not be in contiguous memory locations in your 2D temperature data array, you can create a 1D buffer array and explicitly copy the data values in/out of the buffer and use the buffer array in the MPI_SEND and MPI_RECV calls.

• You may want to look at the data field when the computation is done, and the easiest way to do this is to have every process write its local data array to a separate data file. You will have to use a different file name for every process, and one way to automatically generate file names (in FORTRAN 90) with the process id as the file name is with ASCII number to character conversion: filename=achar((pid-mod(pid,10))/10+48) // achar(mod(pid,10)+48) // ".dat" which gives the file name “12.dat” for pid = 12.

• Try using MPI_SENDRECV instead of separate blocking send and receive calls. This will allow you to solve the case when the domain is periodic in the x-direction (roll the domain into a

1 0 … n Process:

7

cylindrical shell with the two x-faces joined together) and process 0 communicates with process ‘n’,.

• You can implement a grid convergence measure such as the rms of the difference between Tnew and Told on the global grid, and then stop the outer loop when the convergence measure is acceptably small (say 10-5). To do this you will need to use collective communication calls to calculate the global convergence of the grid and to broadcast this value to all processes so that they stop at the same time.

• If you have the 1D domain decomposition working you can try a 2D domain decomposition which subdivides the domain into squares instead of strips. This is a more efficient decomposition since the number of subdomain ghost points is reduced.

Option 2: Numerical Integration of a Set of Discrete Data

This problem uses a master-worker model where the master process divides up the data and sends it to the workers, who perform local computations on the data and communicate results back to the master. There is no data dependency between workers (they don’t need to communicate with each other). This is an example of what is called an “embarrassingly parallel” problem.

Consider the numerical integration of a large set of discrete data values, which could represent points sampled from a function.

Figure 3: Discrete data values, f(xi), where i = 1,2,3,…,n.

To approximate the integral, we can fit straight lines between each pair of points and then compute the sum of the areas under each line segment. This is the trapezoid formula:

8

∑

The locations may not be evenly spaced. An example FORTRAN 90 code is given below.

program integrate ! A program to numerically integrate discrete data from the file “ptrace.dat” implicit none integer, parameter::n=960000 ! Number of points in data file integer::i real(kind=8)::integral real(kind=8), dimension(n)::x,f ! Open data file and read in data open(unit=21,file="ptrace.dat",status="old") do i=1,n read(21,*)x(i),f(i) end do ! Now compute global integral integral=0.0_8 do i=1,n-1 integral=integral+(x(i+1)-x(i))*(f(i)+f(i+1))/2.0_8 ! trapezoidal formula end do ! Ouput result write(*,*)"The integral of the data set is: ",integral stop end program integrate

Now parallelize this program using the master-worker model. The master process (choose process 0, which is always present) reads in data from the file, divides it up evenly and distributes it to the workers (all other processes). The workers compute the integral of their portions of the data and return the results to the master. The master sums the results to find the global integral and outputs the result. If you keep things simple, you should be able to write the parallel program in less than 60 lines of code.

Some tips and hints:

• If the data array is very large, the master process may not have enough local memory to store the entire array in memory. In this case it would be better to read in only part of the data set at a time and send it to a worker(s), before reading in more data (over-write previous values) and send to other workers, etc.

• In order to make this algorithm efficient, we need to minimize the idle time of the workers (and the master) and balance the computational work as evenly as possible. If the number of processes is small, and the data set is large, we may want the master process to help compute part of the integral while it is waiting for the workers to finish. Also, if the amount of data communicated to each worker is large (lots of communication overhead – bandwidth related) other workers will be idling while they wait for their data. Would it be more efficient to send smaller parcels of data to each worker so that they all get to work quickly, and then repeatedly

9

send more data when they finish until all the work is done? But if the number of messages gets too large, then we will have increased latency-related overhead.

• You can try using non-blocking communication calls on the master process so that it can do other tasks while waiting for results from workers.

• You can also try using the scatter and reduce collective communication routines to implement the parallel program.

Investigate Parallel Performance

Measure the parallel performance of your code and examine how the efficiency varies with process count and problem size.

• Implement timing routines in your parallel code as well as in a sequential version, and write run time to standard output. When submitting timed parallel jobs to the queue, you want to make sure that resources are used exclusively for your job (i.e. other applications are not running at the same time on the same CPU). Also, the run time of your code may be affected by the mapping of processes to cores/sockets/nodes on the machine so experiment with this. It might be a good idea to launch the code several times and average the run time results.

• Measure the parallel efficiency and speedup of your code on different numbers of processes. You may also want to repeat the measurements on larger/smaller domains to examine the effects of problem size. The single process run time T1 can be used to calculate speedup, or a tougher measure is to use the sequential code run time Ts. Plot a curve of speedup versus number of processes used. Also plot efficiency versus number of processes.

• How well does your code scale? How does the problem size affect the efficiency? Are there ways that the parallel performance of your code can be improved? You may want to consider operation count in critical loops, memory usage, compiler optimization, communication overhead, etc. as ways to improve the speed of your code.

Documents

Computing Labs 2011