Introduction to Parallel Programming_Student Workbook with Instructor's Notes.pdf

Introduction to Parallel Programming

Student Workbook with Instructor’s Notes

Intel® Software College

Legal Lines and Disclaimers

Introduction to Parallel ProgrammingStudent Workbook with Instructor’s Notes - Inner Front Cover


Student Workbook with Instructor’s Notes

Intel® Software College

Introduction to Parallel Programming Intel® Software CollegeStudent Workbook with Instructor’s Notes © January 2007 Intel® Corporation4

Legal Lines and DisclaimersThe information contained in this document is provided for informational purposes only and represents the current view of Intel Corporation ("Intel") and its contributors ("Contributors") on, as of the date of publication. Intel and the Contributors make no commitment to update the information contained in this document, and Intel reserves the right to make changes at any time, without notice.

DISCLAIMER. THIS DOCUMENT, IS PROVIDED "AS IS." NEITHER INTEL, NOR THE CONTRIBUTORS MAKE ANY REPRESENTATIONS OF ANY KIND WITH RESPECT TO PRODUCTS REFERENCED HEREIN, WHETHER SUCH PRODUCTS ARE THOSE OF INTEL, THE CONTRIBUTORS, OR THIRD PARTIES. INTEL, AND ITS CONTRIBUTORS EXPRESSLY DISCLAIM ANY AND ALL WARRANTIES, IMPLIED OR EXPRESS, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES OF MERCHANTABILITY, FITNESS FOR ANY PARTICULAR PURPOSE, NON-INFRINGEMENT, AND ANY WARRANTY ARISING OUT OF THE INFORMATION CONTAINED HEREIN, INCLUDING WITHOUT LIMITATION, ANY PRODUCTS, SPECIFICATIONS, OR OTHER MATERIALS REFERENCED HEREIN. INTEL, AND ITS CONTRIBUTORS DO NOT WARRANT THAT THIS DOCUMENT IS FREE FROM ERRORS, OR THAT ANY PRODUCTS OR OTHER TECHNOLOGY DEVELOPED IN CONFORMANCE WITH THIS DOCUMENT WILL PERFORM IN THE INTENDED MANNER, OR WILL BE FREE FROM INFRINGEMENT OF THIRD PARTY PROPRIETARY RIGHTS, AND INTEL, AND ITS CONTRIBUTORS DISCLAIM ALL LIABILITY THEREFOR.

INTEL, AND ITS CONTRIBUTORS DO NOT WARRANT THAT ANY PRODUCT REFERENCED HEREIN OR ANY PRODUCT OR TECHNOLOGY DEVELOPED IN RELIANCE UPON THIS DOCUMENT, IN WHOLE OR IN PART, WILL BE SUFFICIENT, ACCURATE, RELIABLE, COMPLETE, FREE FROM DEFECTS OR SAFE FOR ITS INTENDED PURPOSE, AND HEREBY DISCLAIM ALL LIABILITIES THEREFOR. ANY PERSON MAKING, USING OR SELLING SUCH PRODUCT OR TECHNOLOGY DOES SO AT HIS OR HER OWN RISK.

Licenses may be required. Intel, its contributors and others may have patents or pending patent applications, trademarks, copyrights or other intellectual proprietary rights covering subject matter contained or described in this document. No license, express, implied, by estoppels or otherwise, to any intellectual property rights of Intel or any other party is granted herein. It is your responsibility to seek licenses for such intellectual property rights from Intel and others where appropriate.

Limited License Grant. Intel hereby grants you a limited copyright license to copy this document for your use and internal distribution only. You may not distribute this document externally, in whole or in part, to any other person or entity.

LIMITED LIABILITY. IN NO EVENT SHALL INTEL, OR ITS CONTRIBUTORS HAVE ANY LIABILITY TO YOU OR TO ANY OTHER THIRD PARTY, FOR ANY LOST PROFITS, LOST DATA, LOSS OF USE OR COSTS OF PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES, OR FOR ANY DIRECT, INDIRECT, SPECIAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF YOUR USE OF THIS DOCUMENT OR RELIANCE UPON THE INFORMATION CONTAINED HEREIN, UNDER ANY CAUSE OF ACTION OR THEORY OF LIABILITY, AND IRRESPECTIVE OF WHETHER INTEL, OR ANY CONTRIBUTOR HAS ADVANCE NOTICE OF THE POSSIBILITY OF SUCH DAMAGES. THESE LIMITATIONS SHALL APPLY NOTWITHSTANDING THE FAILURE OF THE ESSENTIAL PURPOSE OF ANY LIMITED REMEDY.

Intel and Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.

*Other names and brands may be claimed as the property of others.

Copyright © 2007, Intel Corporation. All Rights Reserved.

Contents

Contents

Lab 1: Identifying Parallelism .............................................................................. 7 Lab 2: Introducing Threads................................................................................ 11 Lab 3: Domain Decomposition with OpenMP ...................................................... 15 Lab 4: Critical Sections and Reductions with OpenMP ........................................ 21 Lab 5: Implementing Task Decompositions........................................................ 25 Lab 6: Analyzing Parallel Performance............................................................... 29 Lab 7: Improving Parallel Performance.............................................................. 33 Lab 8: Choosing the Appropriate Thread Model.................................................. 35 Instructor’s Notes and Solutions........................................................................ 37

Intel® Software College Introduction to Parallel Programming © January 2007 Intel® Corporation Student Workbook with Instructor’s Notes 5


Introduction to Parallel Programming Intel® Software College Student Workbook with Instructor’s Notes © January 2007 Intel® Corporation 6

Lab 1: Identifying Parallelism

Time Required Thirty minutes

Part A

For each of the following code segments, draw a dependence graph and determine whether the computation is suitable for parallelization. If the computation is suitable for parallelization, decide how it should be divided among three CPUs. You may assume that all functions are free of side effects. Example 1: for (i = 0; i < 4; i++) { a[i] = 0.25 * i; b[i] = 4.0 / (a[i] * a[i]); } Example 2: if (a < b) c = f(-1); else if (a == b) c = f(0); else c = f(1); Example 3: for (i = 0; i < 4; i++) for (j = 0; j < 3; j++) a[i][j] = f(a[i][j] * b[j]); Example 4: prime = 2; do { first = prime * prime; for (i = first; i < 10; i+= prime) marked[i] = 1; while (marked[++prime]); } while (prime * prime < N);



Example 5: switch (i): { case 0: a = f(x); b = g(y); break; case 1: a = g(x); b = f(y); break; case -1: a = f(y); b = f(x); break; } Example 6: sum = 0.0; for (i = 0; i < 9; i++) sum = sum + b[i];


Lab. 1: Identifying Parallelism

Part B

Describe how parallelism could be used to reduce the time needed to perform each of the following tasks. Example 7:

A relational database table contains (among other things) student ID numbers and their cumulative GPAs. Find out the percentage of students with a cumulative GPA greater than 3.5. Example 8:

A ray-tracing program renders a realistic image by tracing one or more rays for each pixel of the display window. Example 9:

An operating system utility searches a disk and identifies every text file containing a particular phrase specified by the user. Example 10:

We want to improve a game similar to Civilization IV by reducing the amount of time the human player must wait for the virtual world to be set up.

Intel® Software College Introduction to Parallel Programming © January 2007 Intel® Corporation Student Workbook with Instructor’s Notes - 9



Lab 2: Introducing Threads

Time Required Thirty minutes

For each of the following programs or program segments:

1. determine whether the best parallelization approach is a domain decomposition or a task decomposition;

2. decide whether the best thread model is the fork/join model or the general threads model;

3. determine fork/join points (in the case of the fork/join model) or thread creation points (in the case of the general threads model); and

4. decide which variables should be shared and which variables should be private.

Example 1: /* Matrix multiplication */ int i, j, k; double **a, **b, **c, tmp; ... for (i = 0; i < m; i++) for (j = 0; j < n; j++) { tmp = 0.0; for (k = 0; k < p; k++) tmp += a[i][k] * b[k][j]; c[i][j] = tmp; } Example 2: /* This program implements an Internet-based service that responds to number-theoretic queries */ int main() { request r; ... while(1) { next_request(&r); acknowledge_request (r); switch (r.type) {



case PRIME: primality_test (r); break; case PERFECT: perfect_test (r); break; case WARING: find_waring_integer (r); break; } } ... } Example 3: double inner_product (double *x, double *y, int n) { int i; double result; result = 0.0; for (i = 0; i < n; i++) result += x[i] * y[i]; return result; } int main (int argc, char *argv[]) { double *d, *g, w, x, y, z; int i; ... for (i = 0; i < n; i++) d[i] = -g[i] + (w/x) * d[i]; y = inner_product (d, g); z = inner_product (d, t); ... } Example 4: /* Finite difference method to solve string vibration problem (from Michael J. Quinn, “Parallel Programming in C with MPI and OpenMP”, p. 325) */ #include <stdio.h> #include <math.h>


Lab. 2: Introducing Threads

#define F(x) sin(3.14159*(x)) #define G(x) 0.0 #define a 1.0 #define c 2.0 #define m 2000 #define n 1000 #define T 1.0 int main (int argc, char *argv[]) { float h; int i, j; float k; float L; float u[m+1][n+1]; h = a / n; k = T / m; L = (k*c/h)*(k*c/h); for (j = 0; j <= m; j++) u[j][0] = u[j][n] = 0.0; for (i = 1; i < n; i++) u[0][i] = F(i*h); for (i = 1; i < n; i++) u[1][i] = (L/2.0)*(u[0][i+1] + u[0][i-1])+ (1.0 - L) * u[0][i] + k * G(i*h); for (j = 1; j < m; j++) for (i = 1; i < n; i++) u[j+1][i] = 2.0*(1.0 - L) * u[j][i] + L*(u[j][i+1] + u[j][i-1]) – u[j-1][i]; for (j = 0; j <= m; j++) { for (i = 0; i <= n; i++) printf (“%6.3f,” u[j][i]); putchar (‘\n’); } return 0; }




Lab 3: Domain Decomposition with OpenMP

Time Required Fifty minutes

For each of the programs below

1. make the program parallel by adding the appropriate OpenMP pragmas;

2. compile the program;

3. execute the program for 1, 2, 3, and 4 threads; and

4. check the program outputs to verify they are the same. Note: You will need to generate matrices for the matrix multiplication exercise; a utility program gen.c is included in the lab folder for this purpose. Compile this code, and run it to create files matrix_a and matrix_b; explicit usage is outlined in the code itself. Be sure to generate a workload sufficiently large (e.g., matrix dimensions 1000 x 1000) to be meaningful. Program 1: /* * Matrix multiplication */ #include <stdio.h> /* * Function 'rerror' is called when the program detects an * error and wishes to print an appropriate message and exit. */ void rerror (char *s) { printf ("%s\n", s); exit (-1); } /* * Function 'allocate_matrix", passed the number of rows and columns, * allocates a two-dimensional matrix of floats. */ void allocate_matrix (float ***subs, int rows, int cols) { int i; float *lptr, *rptr; float *storage;



storage = (float *) malloc (rows * cols * sizeof(float)); *subs = (float **) malloc (rows * sizeof(float *)); for (i = 0; i < rows; i++) (*subs)[i] = &storage[i*cols]; return; } /* * Given the name of a file containing a matrix of floats, function * 'read_matrix' opens the file and reads its contents. */ void read_matrix ( char *s, /* File name */ float ***subs, /* 2D submatrix indices */ int *m, /* Number of rows in matrix */ int *n) /* Number of columns in matrix */ { char error_msg[80]; FILE *fptr; /* Input file pointer */ fptr = fopen (s, "r"); if (fptr == NULL) { sprintf (error_msg, "Can't open file '%s'", s); rerror (error_msg); } fread (m, sizeof(int), 1, fptr); fread (n, sizeof(int), 1, fptr); allocate_matrix (subs, *m, *n); fread ((*subs)[0], sizeof(float), *m * *n, fptr); fclose (fptr); return; } /* * Passed a pointer to a two-dimensional matrix of floats and * the dimensions of the matrix, function 'print_matrix' prints * the matrix elements to standard output. If the matrix has more * than 10 columns, the output may not be easy to read. */ void print_matrix (float **a, int rows, int cols) { int i, j; for (i = 0; i < rows; i++) { for (j = 0; j < cols; j++) printf ("%6.2f ", a[i][j]); putchar ('\n'); } putchar ('\n'); return; }


Lab. 3: Domain Decomposition with OpenMP

/* * Function 'matrix_multiply' multiplies two matrices containing * floating-point numbers. */ void matrix_multiply (float **a, float **b, float **c, int arows, int acols, int bcols) { int i, j, k; float tmp; for (i = 0; i < arows; i++) for (j = 0; j < bcols; j++) { tmp = 0.0; for (k = 0; k < acols; k++) tmp += a[i][k] * b[k][j]; c[i][j] = tmp; } return; } int main (int *argc, char *argv[]) { int m1, n1; /* Dimensions of matrix 'a' */ int m2, n2; /* Dimensions of matrix 'b' */ float **a, **b; /* Two matrices being multiplied */ float **c; /* Product matrix */ read_matrix ("matrix_a", &a, &m1, &n1); print_matrix (a, m1, n1); read_matrix ("matrix_b", &b, &m2, &n2); print_matrix (b, m2, n2); if (n1 != m2) rerror ("Incompatible matrix dimensions"); allocate_matrix (&c, m1, n2); matrix_multiply (a, b, c, m1, n1, n2); print_matrix (c, m1, n2); return 0; } Program 2: /* * Polynomial Interpolation * * This program demonstrates a function that performs polynomial * interpolation. The function is taken from "Numerical Recipes * in C", Second Edition, by William H. Press, Saul A. Teukolsky, * William T. Vetterling, and Brian P. Flannery. * */ #include <math.h> #define N 20 /* Number of function sample points */



#define X 14.5 /* Interpolate at this value of x */ /* Function 'vector' is used to allocate vectors with subscript range v[nl..nh] */ double *vector (long nl, long nh) { double *v; v = (double *) malloc(((nh-nl+2)*sizeof(double))); return v-nl+1; } /* Function 'free_vector' is used to free up memory allocated with function 'vector' */ void free_vector(double *v, long nl, long nh) { free ((char *) (v+nl-1)); } /* Function 'polint' performs a polynomial interpolation */ void polint (double xa[], double ya[], int n, double x, double *y, double *dy) { int i, m, ns=1; double den,dif,dift,ho,hp,w; double *c, *d; dif = fabs(x-xa[1]); c = vector(1,n); d = vector(1,n); for (i=1; i <= n; i++) { dift = fabs (x - xa[i]); if (dift < dif) { ns = i; dif = dift; } c[i] = ya[i]; d[i] = ya[i]; } *y = ya[ns--]; for (m = 1; m < n; m++) { for (i = 1; i <= n-m; i++) { ho = xa[i] - x; hp = xa[i+m] - x; w = c[i+1] - d[i]; den = ho - hp; den = w / den; d[i] = hp * den; c[i] = ho * den; } *y += (*dy=(2*ns < (n-m) ? c[ns+1] : d[ns--]));


Lab. 3: Domain Decomposition with OpenMP

} free_vector (d, 1, n); free_vector (c, 1, n); } /* Functions 'sign' and 'init' are used to initialize the x and y vectors holding known values of the function. */ int sign (int j) { if (j % 2 == 0) return 1; else return -1; } void init (int i, double *x, double *y) { int j; *x = (double) i; *y = sin(i); } /* Function 'main' demonstrates the polynomial interpolation function by generating some test points and then calling 'polint' with a value of x between two of the test points. */ int main (int argc, char *argv[]) { double x, y, dy; double *xa, *ya; int i; xa = vector (1, N); ya = vector (1, N); /* Initialize xa's and ya's */ for (i = 1; i <= N; i++) { init (i, &xa[i], &ya[i]); printf ("f(%4.2f) = %13.11f\n", xa[i], ya[i]); } /* Interpolate polynomial at X */ polint (xa, ya, N, X, &y, &dy); printf ("\nf(%6.3f) = %13.11f with error bound %13.11f\n", X, y, fabs(dy)); free_vector (xa, 1, N); free_vector (ya, 1, N); return 0; }




Lab 4: Critical Sections and Reductions with OpenMP

Time Required Twenty minutes

Exercise 1 Make this program parallel by adding the appropriate OpenMP pragmas and clauses. Compile the program, execute it on 1 and 2 threads, and make sure the program output is the same as the sequential program. Finally, compare the execution times of the sequential, single-threaded, and double-threaded programs. /* * A small college is thinking of instituting a six-digit student ID * number. It wants to know how many "acceptable" ID numbers there * are. An ID number is "acceptable" if it has no two consecutive * identical digits and the sum of the digits is not 7, 11, or 13. * * 024332 is not acceptable because of the repeated 3s. * 204124 is not acceptable because the digits add up to 13. * 304530 is acceptable. */ /* * Function "no_problem_with_digits" extracts the digits from * the ID number from right to left, making sure that there are * no repeated digits and that the sum of the digits is not 7, * 11, or 13. */ int no_problem_with_digits (int i) { int j; int latest; /* Digit currently being examined */ int prior; /* Digit to the right of "latest" */ int sum; /* Sum of the digits */ prior = -1; sum = 0; for (j = 0; j < 6; j++) { latest = i % 10; if (latest == prior) return 0; sum += latest; prior = latest; i /= 10; } if ((sum == 7) || (sum == 11) || (sum == 13)) return 0; return 1;



} /* * Function "main" iterates through all possible six-digit ID * numbers (integers from 0 to 999999), counting the ones that * meet the college's definition of "acceptable." */ int main (void) { int count; /* Count of acceptable ID numbers */ int i; count = 0; for (i = 0; i < 1000000; i++) if (no_problem_with_digits (i)) count++; printf ("There are %d acceptable ID numbers\n", count); return 0; }

Exercise 2

Make this program parallel by adding the appropriate OpenMP pragmas and clauses. Compile the program, execute it on 1 and 2 threads, and make sure the program output is the same as the sequential program. Finally, compare the execution times of the sequential, single-threaded, and double-threaded programs. /* * This program uses the Sieve of Eratosthenes to determine the * number of prime numbers less than or equal to 'n'. * * Adapted from code appearing in “Parallel Programming in C with * MPI and OpenMP,” by Michael J. Quinn, McGraw-Hill (2004). */ #include <stdio.h> #define MIN(a,b) ((a)<(b)?(a):(b)) int main (int argc, char *argv[]) { int count; /* Prime count */ int first; /* Index of first multiple */ int i; int index; /* Index of current prime */ char *marked; /* Marks for 2,...,'n' */ int n; /* Sieving from 2, ..., 'n' */ int prime; /* Current prime */ if (argc != 2) { printf ("Command line: %s <m>\n", argv[0]); exit (1); }


Lab. 4: Critical Sections and Reductions with OpenMP

n = atoi(argv[1]); marked = (char *) malloc (n-1); if (marked == NULL) { printf ("Cannot allocate enough memory\n"); exit (1); } for (i = 0; i < n-1; i++) marked[i] = 1; index = 0; prime = 2; do { first = prime * prime - 2; for (i = first; i < n-1; i += prime) marked[i] = 0; while (!marked[++index]); prime = index + 2; } while (prime * prime <= n); count = 0; for (i = 0; i < n-1; i++) count += marked[i]; printf ("There are %d primes less than or equal to %d\n", count, n); return 0; }

Exercise 3 The Monte Carlo method refers to the use of statistical sampling to solve a problem. Some experts say that more than half of all supercomputing cycles are devoted to Monte Carlo computations. A Monte Carlo program can benefit from parallel processing in two ways. Parallel processing can be used to reduce the time needed to find a solution of a particular resolution. The other use of parallel processing is to find a more accurate solution in the same amount of time. This assignment is to reduce the time needed to find a solution of a particular accuracy. The following C program uses the Monte Carlo method to come up with an approximation to pi. Add OpenMP directives to make the program suitable for execution on multiple threads. Divide the number of points to be generated evenly among the threads. Compare the execution times of the sequential, single-threaded, and double-threaded programs. /* * This program uses the Monte Carlo method to come up with an * approximation to pi. Taken from “Parallel Programming in C with * MPI and OpenMP,” by Michael J. Quinn, McGraw-Hill (2004). */ #include <stdio.h> int main (int argc, char *argv[1]) { int count; /* Points inside unit circle */ int i; int samples; /* Number of points to generate */ unsigned short xi[3]; /* Random number seed */



double x, y; /* Coordinates of point */ /* Number of points and 3 random number seeds are command-line arguments. */ if (argc != 5) { printf (“Command-line syntax: %s <samples> “ “<seed> <seed> <seed>\n”, argv[0]); exit (-1); } samples = atoi (argv[1]); count = 0; xi[0] = atoi(argv[2]); xi[1] = atoi(argv[3]); xi[2] = atoi(argv[4]); for (i = 0; i < samples; i++) { x = erand48(xi); y = erand48(xi); if (x*x + y*y <= 1.0) count++; } printf (“Estimate of pi: %7.5f\n”, 4.0 * count / samples); return 0; }


Lab 5: Implementing Task Decompositions

Time Required Sixty minutes

Exercise 1 Make this quicksort program parallel by adding the appropriate OpenMP pragmas and clauses. Compile the program, execute it on 1 and 2 threads, and make sure the program is still correctly sorting the elements of array ‘A’. Finally, compare the execution times of the sequential, single-threaded, and double-threaded programs. /* * Stack-based Quicksort * * The quicksort algorithm works by repeatedly dividing unsorted * sub-arrays into two pieces: one piece containing the smaller * elements and the other piece containing the larger elements. * The splitter element, used to subdivide the unsorted sub-array, * ends up in its sorted location. By repeating this process on * smaller and smaller sub-arrays, the entire array gets sorted. * * The typical implementation of quicksort uses recursion. This * implementation replaces recursion with iteration. It manages its * own stack of unsorted sub-arrays. When the stack of unsorted * sub-arrays is empty, the array is sorted. */ #include <stdio.h> #include <stdlib.h> #define MAX_UNFINISHED 1000 /* Maximum number of unsorted sub-arrays */ /* Global shared variables */ struct { int first; /* Low index of unsorted sub-array */ int last; /* High index of unsorted sub-array */ } unfinished[MAX_UNFINISHED]; /* Stack */ int unfinished_index; /* Index of top of stack */ float *A; /* Array of elements to be sorted */ int n; /* Number of elements in ‘A’ */ /* Function 'swap' is called when we want to exchange two array elements */ void swap (float *x, float *y) {



float tmp; tmp = *x; *x = *y; *y = tmp; } /* Function 'partition' actually does the sorting by dividing an Unsorted sub-array into two parts: those less than or equal to the splitter, and those greater than the splitter. The splitter is the last element in the unsorted sub-array. The splitter ends up in its final, sorted location. The function returns the final location of the splitter (its index). This code is an implementation of the algorithm appearing in “Introduction to Algorithms, Second Edition”, by Cormen, Leiserson, Rivest, and Stein (The MIT Press, 2001). */ int partition (int first, int last) { int i, j; float x; x = A[last]; i = first - 1; for (j = first; j < last; j++) if (A[j] <= x) { i++; swap (&A[i], &A[j]); } swap (&A[i+1], &A[last]); return (i+1); } /* Function 'quicksort' repeatedly retrieves the indices of unsorted sub-arrays from the stack and calls 'partition' to divide these sub-arrays into two pieces. It keeps one of the pieces and puts the other piece on the stack of unsorted sub-arrays. Eventually it ends up with a piece that doesn't need to be sorted. At this point it gets the indices of another unsorted sub-array from the stack. The function continues until the stack is empty. */ void quicksort (void) { int first; int last; int my_index; int q; /* Split point in array */ while (unfinished_index >= 0) { my_index = unfinished_index; unfinished_index--; first = unfinished[my_index].first; last = unfinished[my_index].last; while (first < last) {


Lab. 5: Implementing Task Decompositions

/* Split unsorted array into two parts */ q = partition (first, last); /* Put upper portion on stack of unsorted sub-arrays */ if ((unfinished_index+1) >= MAX_UNFINISHED) { printf ("Stack overflow\n"); exit (-1); } unfinished_index++; unfinished[unfinished_index].first = q+1; unfinished[unfinished_index].last = last; /* Keep lower portion for next iteration of loop */ last = q-1; } } } /* Function 'print_float_array', given the address and length of an Array of floating-point values, prints the values to standard output, one element per line. */ void print_float_array (float *A, int n) { int i; printf ("Contents of array:\n"); for (i = 0; i < n; i++) printf ("%6.4f\n", A[i]); } /* Function 'verify_sorted' returns 1 if the elements of array 'A' are in monotonically increasing order; it returns 0 otherwise. */ int verify_sorted (float *A, int n) { int i; for (i = 0; i < n-1; i++) if (A[i] > A[i+1]) return 0; return 1; } /* Function 'main' gets the array size and random number seed from the command line, initializes the array, prints the unsorted array, sorts the array, and prints the sorted array. */ int main (int argc, char *argv[]) { int i; int seed; /* Seed component input by user */ unsigned short xi[3]; /* Random number seed */



if (argc != 3) { printf ("Command-line syntax: %s <n> <seed>\n", argv[0]); exit (-1); } seed = atoi (argv[2]); xi[0] = xi[1] = xi[2] = seed; n = atoi (argv[1]); A = (float *) malloc (n * sizeof(float)); for (i = 0; i < n; i++) A[i] = erand48(xi); /* print_float_array (A, n); */ unfinished[0].first = 0; unfinished[0].last = n-1; unfinished_index = 0; quicksort (); /* print_float_array (A, n); */ if (verify_sorted (A, n)) printf ("Elements are sorted\n"); else printf ("ERROR: Elements are NOT sorted\n"); return 0; }


Lab 6: Analyzing Parallel Performance

Time Required Thirty-five minutes

Exercise 1 You are responsible for maintaining a library of core functions used by a wide variety of programs in an application suite. Your supervisor has noted the availability of multi-core processors and wants to know whether rewriting the library of functions using threads would significantly improve the performance of the programs in the application suite. What do you need to do to provide a meaningful answer?

Exercise 2 Somebody wrote an OpenMP program to solve the problem posed in Lab 5 and benchmarked its performance sorting 25 million keys. Here are the run times of the program, as reported by the command-line utility “time”:

Threads Run Time (sec) 1 8.535 2 21.183 3 22.184 4 25.060

What is the efficiency of the multithreaded program for 2, 3, and 4 threads? What can you conclude about the design of the parallel program? Can you offer any suggestions for improving the performance of the program? Exercise 3 A co-worker has been working on converting a sequential program into a multithreaded program. At this point, only some of the functions of the program have been made parallel. On a key data set, the multithreaded program exhibits these execution times:

Processors Time (sec) 1 5.34 2 3.74 3 3.31 4 3.10

Is your co-worker on the right track? Would you advise your co-worker to continue the parallelization effort?



Exercise 4 You’ve worked hard to convert a key application to multithreaded execution, and you’ve benchmarked it on a quad-core processor. Here are the results:

Threads Time (sec) 1 24.3 2 14.6 3 11.7 4 10.6

Suppose an 8-core version of the processor becomes available.

(a) Predict the execution time of this algorithm on an 8-core processor.

(b) Give a reason why the actual speedup may be lower than expected.

(c) Give two reasons why the actual speedup may be higher than expected. Exercise 5 You have benchmarked your multithreaded application on a system with CPU A, and it exhibits this performance:

Threads Time (sec) 1 14.20 2 7.81 3 5.87 4 4.72

Next you benchmark the same application on an otherwise identical system that has been upgraded with a newer processor, CPU B, and it exhibits this performance:

Threads Time (sec) 1 11.83 2 7.01 3 5.42 4 4.59

CPU B is clearly faster than CPU A. The execution times are lower when CPU B is used. However, the single-processor performance is improved by 20% by using CPU B. In contrast, when four processors are engaged, the parallel program is only 3% faster. Explain how this can happen.


Lab. 6: Analyzing Parallel Performance

Exercise 6 Hard disk drives continue to improve in speed at a slower rate than microprocessors. What are the implications of this trend for developers of multithreaded applications? What can be done about it?




Lab 7: Improving Parallel Performance

Time Required Forty-five minutes

Exercise 1 Recall that the parallel quicksort program developed in Lab 5 exhibited poor performance because of excessive contention among the tasks for access to the shared stack containing the indices of unsorted sub-arrays. You can dramatically improve the performance by reducing the frequency at which threads access the shared stack.

One way to reduce accesses to the shared stack is to switch to sequential quicksort for sub-arrays smaller than a threshold size. In other words, when a thread encounters a sub-array smaller than the threshold size and partitions it into two pieces, it does not put one piece on the stack and work on the remaining piece. Instead, it sorts both pieces itself by recursively calling the sequential quicksort function.

Use this strategy, and the sequential quicksort function given below, to improve the performance of the parallel quicksort program you developed in Lab 5. Run some experiments to determine the best threshold size for switching to sequential quicksort. void seq_quicksort (int first, int last) { int q; /* Split point in array */ if (first < last) { q = partition (first, last); seq_quicksort (first, q-1); seq_quicksort (q+1, last); } }

Exercise 2 The following C program counts the number of primes less than n. Use OpenMP pragmas and clauses to enable it to run on a multiprocessor. Make as many changes as you can in the time allowed to improve the performance of the program on the maximum available number of processors. /* * This C program counts the number of primes between 2 and n. */ #include <stdio.h> #include <math.h> #include <omp.h> /* Passed a positive integer ‘p’, function ‘is_prime’ returns 1 if



‘p’ is prime and 0 if ‘p’ is not prime. */ int is_prime (int p) { int i; if (p < 2) return 0; i = 2; while (i*i <= p) { if (p % i == 0) return 0; i++; } return 1; } int main (int argc, char *argv[]) { int *a; /* Keeps track of which numbers are primes */ int count; /* Number of primes between 2 and n */ int i; int n; /* We’re finding primes up through this number */ int t; /* Desired number of concurrent threads */ /* Get problem size and number of threads */ if (argc != 3) { printf ("Command line syntax: %s <n> <threads>\n", argv[0]); exit (-1); } n = atoi(argv[1]); t = atoi(argv[2]); omp_set_num_threads (t); a = (int *) malloc (n * sizeof(int)); for (i = 0; i < n; i++) a[i] = is_prime(i); count = 0; for (i = 0; i < n; i++) count += a[i]; printf ("There are %d primes less than %d\n", count, n); return 0; }


Lab 8: Choosing the Appropriate Thread Model

Time Required Forty minutes

For each of the following problems, decide if it would be more suitable for a parallel program based on OpenMP or a parallel program based on Win32/Java/POSIX threads. Problem 1

You are working on a software package that will be able to evaluate a wide variety of complicated integrals, such as:

∫ ∫ dxdyxyxcostan4

.

There are a wide variety of techniques to evaluate integrals, and it is impossible to determine in advance which technique will be most effective in solving a particular problem. Your parallel program will simultaneously attempt many techniques, stopping as soon as one technique has successfully evaluated the integral. (Example from Practical Parallel Programming by Gregory V. Wilson, The MIT Press, 1995) Problem 2

A radiation source emits neutrons that hit a homogeneous plate. The plate may reflect the neutron, absorb it, or allow it to be transmitted. Your parallel program will use a Monte Carlo method to estimate the probability of each of these three outcomes, based on the plate’s make-up and thickness. The program will simulate the paths of millions of particles, and it will keep track of how many particles have each of the three possible outcomes. In the course of the simulation, some particles are reflected immediately, while others bounce around for a long time in the plate before finally being reflected, absorbed, or transmitted. Hence the variance in the amount of time needed to simulate the path of a single particle is quite large. (Example from Parallel Programming in C with MPI and OpenMP by Michael J. Quinn, McGraw-Hill, 2004) Problem 3

You are on a team creating a parallel program that will give the user the ability to see satellite images from many locations on Earth. The program will have to perform a wide variety of tasks, including responding to keyboard presses and mouse clicks, translating an address into map



coordinates, retrieving relevant satellite images and map information from a database server, displaying these images, and zoom and pan operations. Problem 4

An existing sequential program converts a PostScript program into Perfectly Desirable Format (PDF). A PostScript program consists of a prolog followed by a script. The prolog contains application-specific function definitions. The script consists of a sequence of page descriptions. Each page description is independent of every other page description, depending only on the definitions in the prolog. A PDF document contains one or more pages. A page may contain text, graphics, and/or images. Because the sequential program executes too slowly, your task is to design a parallel program that reduces the conversion time. Problem 5

Your team is developing a program supporting digital video editing. You are responsible for the algorithm that divides raw video footage into scenes. The algorithm analyzes the entire video frame by frame. A frame that is significantly different from the preceding frame should be marked as the beginning of a new scene. To make the process as quick as possible, you’ve been asked to implement a parallel version of the algorithm.


Instructor’s Notes and Solutions





Lab 1: Identifying Parallelism

Part A

Example 1: This is a straightforward loop amenable to a domain decomposition. Example 2: Since only one of the three then-else clauses will ever execute, there is no point in using more than one CPU to execute this code segment. Example 3: This code segment is amenable to domain decomposition. It is designed to raise the question about whether it is better to divide the inner loop or the outer loop among the CPUs. This is the sort of question we’ll answer later in the short course. Example 4: The inner for loop is amenable to a domain partitioning. The iterations of the outer do-while loop cannot be executed in parallel. Example 5: It does no good to assign a different case to each CPU, since only one case will execute. However, we could do a functional decomposition within each of the cases, keeping two CPUs occupied. In other words, in case 0 one CPU could compute ‘g(x)’ while another CPU computed ‘f(y)’; in case 1 one CPU could compute ‘g(x)’ while another CPU computed ‘f(y)’; and in case 3 one CPU could compute ‘f(y)’ while another CPU computed ‘f(x)’. Example 6: As written, this code is not amenable to parallelization, since every addition involves the variable sum. This ought to give the students pause, since one example of domain decomposition in the slides was quite similar to this. We can transform this code into code suitable for domain decomposition by giving each CPU a temporary memory location it can use for accumulating the subtotal. This will be discussed in a subsequent lecture.



Part B

Example 7: We could do a domain decomposition of the database table, giving each CPU its share of the rows. After each CPU has come up with its subtotals (for number of students considered and number of students with GPAs > 3.5), the CPUs could add these into two global sums, at which point one CPU would compute the global fraction. Example 8: This problem sounds easier than it is. Ray-tracing is certainly amenable to domain decomposition. The problem is that some pixels take much longer to trace than others. For example, rays passing through glass are much more complicated that rays that strike no objects. We want to motivate thinking about how work should be allocated to CPUs. Example 9: This application is amenable to a variety of different functional decompositions. One function is to search the file system and find text files. Another function searches text files for the phrase. Should a new process be created every time a text file is discovered? Would it be better to have a central list of files yet to be searched and have the text-file-scanning processes go to this list when they need work? The idea is to promote creative thinking, not answer the question definitively. Example 10: This is another question designed to promote creative thinking. The world is initialized to a random state. Is it okay if the world created by a two-CPU system is different from a world created by a one-CPU system? Should each CPU be given an equal-sized region of the virtual world? What if one region has more continents than another—how will we balance the work among the CPUs? In what ways will the CPUs be able to work independently? In what ways will the CPUs need to exchange information with each other?



Lab 2: Introducing Threads

Example 1: The best parallelization approach is domain decomposition. The best loop to make parallel is the outermost loop indexed by i, because then there is only a single fork/join operation, minimizing overhead. Variables a, b, and c should be shared. Variables i, j, k, and tmp should be private. Example 2: We can make this program parallel using the general threads model. Instead of the main thread calling primality_test, perfect_test, or find_waring_integer, it forks another thread to perform the function and return the answer to the machine originating the request. Every created thread will have a private value of r. Some people would call this an example of task parallelism, since different threads are executing different functions. Others would call this domain parallelism, because the number of threads created is proportional to the size of the data set. Since the parallelism scales with the amount of data rather than the number of functions, the second characterization is probably better. The advantage of this strategy is that it enables the main thread to get quickly back to function next_request, improving the responsiveness of the system; i.e., how quickly it acknowledges a request. The risk is that if requests are coming in too quickly, we could end up with far more threads than CPUs, indefinitely postponing the time users might have to wait for their requests to be handled. That leads to a nice discussion topic: How could this problem be avoided? Example 3: Function inner_product is amenable to domain decomposition. The fork would happen at the beginning of the for loop, and the join would happen at the end of the for loop. Variable i would be private, and variables result, x, and y would be shared. Bright students should realize that we could have a problem if multiple threads update result at the same time. This is called a race condition. We’ll discuss this problem in detail in the fourth lecture. Another approach would be to use a functional decomposition in the main function to execute both calls to inner_product in parallel. That eliminates the problem described in the previous paragraph, because only one thread executes each call to inner_product. If we do a functional decomposition, variables i and result inside inner_product are private variables. The for loop in function main is amenable to a domain decomposition. Variable i is the only private variable.



Example 4: This is a short, but complete C program amenable to domain decomposition. The first three for loops can be parallelized with domain decomposition. The last for loop is not amenable to parallel execution, since we want to print the values in the correct order. The bulk of the processing time, however, is spent in the doubly-nested loop. The execution time of the other computational loops is trivial compared to the time spent in this loop. For that reason, this loop ought to be made parallel first using domain decomposition. Analyzing the doubly nested loop, we see that we can’t compute row j+1 of u until we’ve computed rows j and j-1. In other words, there is data dependence from iteration j to iteration j+1 (and from iteration j-1 to iteration j). Hence we cannot execute all the iterations of the outer for loop in parallel. However, we can execute all the iterations of the inner for loop indexed by i. So there should be a fork every time we get to the inner for loop indexed by i and a join every time we finish that loop.



Lab 3: Domain Decomposition with OpenMP

At this point in the course, the variety of programs that the students can make parallel is limited because they do not yet know about critical sections. That means we have to stick with “embarrassingly parallel” applications in which the threads work completely independently. Here are two programs that fit the bill. At the end I give a third alternative. Program 1: Matrix multiplication is a relatively easy program to make parallel. In function matrix_multiply, the two best candidates are the outermost for loop indexed by i or the middle for loop indexed by j. You may wish to talk with the students about the problems with trying to make parallel the inner for loop indexed by k. To decide between the i and the j loops, students should be thinking about grain size and how matrices are allocated in C (row major order). Both of these considerations lead to making the outer loop the parallel loop. Also note that this is the standard matrix multiplication algorithm found in most textbooks. Its weakness is that in order to compute a row of the product matrix, it references a row of the first factor matrix and ALL of the second factor matrix. When matrices are large, the second factor matrix is unlikely to fit completely in cache memory. This means that the cache hit rate of this algorithm is poor. A block-oriented matrix multiplication program can exhibit a much higher cache hit rate and significantly outperform this program. Program 2: This example is designed to give students a greater challenge. Ultimately, they should figure out that the best candidate for parallelization is the for loop indexed by i inside the for loop indexed by m inside function polint. In truth, unless n is very large, this program is unlikely to benefit much from parallelization.

If these programs are unacceptable, another likely candidate would be a program to compute the Mandelbrot set. This program has several advantages. It is short (if the graphics routines are not counted), consumes a great deal of CPU time, produces a pretty output, and can generate interesting questions about load balancing among threads. The principal disadvantage of doing an exercise on the Mandelbrot set is that it has been done so many times before.



Lab 4: Critical Sections and Reductions with OpenMP

Exercise 1 This should be an easy program for the students to parallelize using the parallel for pragma with a reduction clause. They need to figure out that the loop to make parallel is the one inside function main, not the one inside function no_problem_with_digits, and they can easily rewrite the for loop to get a statement of the form count += v; Exercise 2 The final for loop is an example of a reduction. The for loop inside the do...while loop is also amenable to parallelization. The initialization of array marked is another opportunity. Students should test their programs for small values of n and benchmark their programs for large values of n. There are 25 primes less than 100 and 168 primes less than 1000. A good value of n for benchmarking is 10 million. Exercise 3 Making this program parallel is more difficult than it may first appear. It won’t work simply to make the for loop parallel and add a reduction clause for the summation to count. The reason is subtle: variable xi, containing the random number seeds, must be private to each thread and must be initialized to different values for each thread. Otherwise, every thread will generate the same sequence of random numbers. How can each thread come up with its own random number seed? We need a way to get a thread-dependent value for xi[0], xi[1], or xi[2]. The way to do this is through function omp_get_thread_num(). The idea behind this exercise, then, is to help the students get to the point where they realize they need a function like this, and then give them the function name. That means the entire block of code from the initialization of count to the end of the for loop needs to be in a parallel block. So I’d suggest rather than using a reduction clause inside the for loop, create a private variable to keep track of the local count and then simply have a critical section at the very end of the block to add the local count to the global count.



Lab 5: Implementing Task Decompositions

Exercise 1 This is a challenging exercise, but it can teach students a lot about the issues they must confront when making a program parallel. When students get their parallel program executing correctly, they should feel a real sense of satisfaction. In order to get the program working correctly, students must clear three hurdles. The first one is easy; the second and third are tougher. First, students must ensure that all accesses to the shared stack of unsorted sub-arrays occur inside critical sections. All of these accesses occur inside function ‘quicksort’. It’s easy for students to make the critical sections too small. Second, students must realize that just because there is something on the stack when a thread tests ‘unfinished_index’ in the outer while loop’s condition, that doesn’t mean there will still be something on the stack when the thread enters the while loop. In other words, there is a race condition. So the code for getting something off the stack must be put inside an if statement. If the stack index is < 0, then ‘first’ and ‘last’ must be given values such that function ‘partition’ doesn’t get called. Finally, students need to figure out that a thread can’t simply stop when there is no work to do. Otherwise, there is a good chance all but one of the threads will exit ‘quicksort’ while one thread does the very first partitioning step. The loop condition must be rewritten so that threads keep looking for work as long as the array is unsorted. How do we know when the array is sorted? We know that every time we call function ‘partition’, one more array element is put in the right place. We also know that every time an unsorted sub-array has exactly one element in it, that element is in the right place. So we need to create a global counter that keeps track of how many elements are in their sorted positions. Threads should exit ‘quicksort’ only when this global counter reaches ‘n’. Making the program parallel by adding a new global variable, new logic, and OpenMP directives adds 30-40 lines to its length. With appropriate hints, students should be able to complete this exercise in about an hour. Solution: /* * Stack-based Quicksort * * The quicksort algorithm works by repeatedly dividing unsorted * sub-arrays into two pieces: one piece containing the smaller * elements and the other piece containing the larger elements. * The splitter element, used to subdivide the unsorted sub-array, * ends up in its sorted location. By repeating this process on * smaller and smaller sub-arrays, the entire array gets sorted. * * The typical implementation of quicksort uses recursion. This



* implementation replaces recursion with iteration. It manages its * own stack of unsorted sub-arrays. When the stack of unsorted * sub-arrays is empty, the array is sorted. */ #include <stdio.h> #include <stdlib.h> #include <omp.h> #define MAX_UNFINISHED 1000 /* Maximum number of unsorted sub-arrays */ /* Global shared variables */ struct { int first; /* Low index of unsorted sub-array */ int last; /* High index of unsorted sub-array */ } unfinished[MAX_UNFINISHED]; /* Stack */ int unfinished_index; /* Index of top of stack */ float *A; /* Array of elements to be sorted */ int n; /* Elements in A */ int num_sorted; /* Sorted elements in A */ /* Function 'swap' is called when we want to exchange two array elements */ void swap (float *x, float *y) { float tmp; tmp = *x; *x = *y; *y = tmp; } /* Function 'partition' actually does the sorting by dividing an Unsorted sub-array into two parts: those less than or equal to the splitter, and those greater than the splitter. The splitter is the last element in the unsorted sub-array. The splitter ends up in its final, sorted location. The function returns the final location of the splitter (its index). */ int partition (int first, int last) { int i, j; float x; x = A[last]; i = first - 1; for (j = first; j < last; j++) if (A[j] <= x) { i++;



swap (&A[i], &A[j]); } swap (&A[i+1], &A[last]); return (i+1); } /* Function 'quicksort' repeatedly retrieves the indices of unsorted sub-arrays from the stack and calls 'partition' to divide these sub-arrays into two pieces. It keeps one of the pieces and puts the other piece on the stack of unsorted sub-arrays. Eventually it ends up with a piece that doesn't need to be sorted. At this point it gets the indices of another unsorted sub-array from the stack. The function continues until the stack is empty. */ void quicksort (void) { int first; int id; int last; int my_count; int my_index; int q; /* Split point in array */ id = omp_get_thread_num(); printf ("Thread %d enters quicksort\n", id); my_count = 0; while (num_sorted < n) { #pragma omp critical { if (unfinished_index >= 0) { my_index = unfinished_index; unfinished_index--; first = unfinished[my_index].first; last = unfinished[my_index].last; } else { first = 0; last = -1; } } while (first <= last) { if (first == last) { #pragma omp critical num_sorted++; my_count++; last = first - 1; } else { /* Split unsorted array into two parts */ q = partition (first, last); #pragma omp critical num_sorted++; my_count++;



/* Put upper portion on stack of unsorted sub-arrays */ if ((unfinished_index+1) >= MAX_UNFINISHED) { printf ("Stack overflow\n"); exit (-1); } #pragma omp critical { unfinished_index++; unfinished[unfinished_index].first = q+1; unfinished[unfinished_index].last = last; } /* Keep lower portion for next iteration of loop */ last = q-1; } } } printf ("Thread %d exits, having sorted %d\n", id, my_count); } /* Function 'print_float_array', given the address and length of an Array of floating-point values, prints the values to standard output, one element per line. */ void print_float_array (float *A, int n) { int i; printf ("Contents of array:\n"); for (i = 0; i < n; i++) printf ("%6.4f\n", A[i]); } /* Function 'verify_sorted' returns 1 if the elements of array 'A' are in monotonically increasing order; it returns 0 otherwise. */ int verify_sorted (float *A, int n) { int i; for (i = 0; i < n-1; i++) if (A[i] > A[i+1]) return 0; return 1; } /* Function 'main' gets the array size and random number seed from the command line, initializes the array, prints the unsorted array, sorts the array, and prints the sorted array. */ int main (int argc, char *argv[]) { int i; int seed; /* Seed component input by user */



unsigned short xi[3]; /* Random number seed */ int t; /* Number of threads */ if (argc != 4) { printf ("Command-line syntax: %s <n> <threads> <seed>\n", argv[0]); exit (-1); } seed = atoi (argv[3]); xi[0] = xi[1] = xi[2] = seed; t = atoi(argv[2]); omp_set_num_threads (t); n = atoi (argv[1]); A = (float *) malloc (n * sizeof(float)); for (i = 0; i < n; i++) A[i] = erand48(xi); /* print_float_array (A, n); */ unfinished[0].first = 0; unfinished[0].last = n-1; unfinished_index = 0; num_sorted = 0; #pragma omp parallel quicksort (); /* print_float_array (A, n); */ if (verify_sorted (A, n)) printf ("Elements are sorted\n"); else printf ("ERROR: Elements are NOT sorted\n"); return 0; }



Lab 6: Analyzing Parallel Performance

Exercise 1 The first step should be to profile the programs in the application suite and see what percentage of their execution time is spent inside the core functions. If not enough time is spent in the core functions, making them execute faster will not reduce the overall execution time by that much. We can use Amdahl’s Law to estimate the maximum improvement. For example, suppose 40% of an application’s execution time is spent in core functions, and we get all our core functions to execute twice as fast. Then the maximum speedup of the application is

25.18.0

12/4.06.0

1==

+

Exercise 2 The complete table. Threads Run Time (sec) Speedup Efficiency 1 8.535 1.00 1.00 2 21.183 0.40 0.20 3 22.184 0.38 0.19 4 25.060 0.34 0.09 The efficiency drops tremendously as soon as more than one thread is executing the program. You can see that even for 2 threads, each thread is doing productive work only 20% of the time. Since this program does not require I/O, we can conclude that the threads are wasting a lot of time waiting for access to critical sections. In order to improve the performance of the parallel program, we must increase the amount of useful work that gets done between critical sections of code.

Exercise 3 We should look at the Karp-Flatt metric and see how the experimentally determined serial fraction is changing as threads are added.

Processors Experimentally Determined Serial Fraction 2 0.40 3 0.43 4 0.44

The experimentally determined serial fraction is growing very slowly as threads are added; this is very good news. The principal limiting factor to the speedup is not parallel overhead. Instead, it’s the large portion of time spent inside functions that have not yet



been made parallel. That’s a good omen for continuing forward with the parallelization effort.

Exercise 4 (a) To predict the execution time on 8 processors, we should use the Karp-Flatt metric. Threads Time (sec) Speedup e 1 24.3 1.00 - 2 14.6 1.66 0.20 3 11.7 2.08 0.22 4 10.6 2.29 0.25 The experimentally determined serial fraction ‘e’ is growing about 0.025 per thread, so we estimate its value will be 0.35 when there are 8 threads. Now we use the formula

1)1( +−=

pepψ

to predict a speedup of 2.32 on 8 threads. (b) The actual speedup may be lower than this because the parallel overhead may grow at a faster rate than the number of threads. For example, a critical section of code may become a bottleneck, and it may do no good to add threads beyond a certain number. (c) Speedup may be higher than expected if the additional cores bring the total amount of cache memory up to a point where the cache hit rate rises significantly.

Exercise 5 Even though CPU B is faster than CPU A, the speed of other system components, such as the I/O system, have not increased. Hence the fraction of program execution time devoted to inherently sequential operations has increased, reducing the speedup that can be achieved as processors are added.

Exercise 6 We’re back to Amdahl’s Law again. As the time required to execute the parallel portion of the program shrinks, the sequential component becomes relatively more significant. Processors and RAM are increasing in speed faster than hard disk drives. If we move a parallel application that does some disk I/O from an older system to a newer system with a much faster processor and a slightly faster hard disk drive, we would expect the execution times to be lower, but we would expect the speedup curves to be lower as well. One solution is to employ RAID technology to improve the throughput of the hard disk drive system. Another solution is to adopt a significantly faster secondary storage scheme, such as a solid state disk (also called solid state drive).



Lab 7: Improving Parallel Performance

Exercise 1 To complete this assignment, students must modify function “quicksort” in the original parallel program to call “seq_quicksort” when the difference between indices last and first is less than a particular size. Once they have their programs running, students must experiment to determine the best threshold size. If the size is too small, there will be too much contention for the stack. If the size is too large, there may be a significant load imbalance among the tasks. Here is one solution: /* * Stack-based Quicksort * * The quicksort algorithm works by repeatedly dividing unsorted * sub-arrays into two pieces: one piece containing the smaller * elements and the other piece containing the larger elements. * The splitter element, used to subdivide the unsorted sub-array, * ends up in its sorted location. By repeating this process on * smaller and smaller sub-arrays, the entire array gets sorted. * * The typical implementation of quicksort uses recursion. This * implementation replaces recursion with iteration. It manages its * own stack of unsorted sub-arrays. When the stack of unsorted * sub-arrays is empty, the array is sorted. */ #include <stdio.h> #include <stdlib.h> #include <omp.h> #define MAX_UNFINISHED 10000 /* Max number of unsorted sub-arrays */ #define CHUNK_SIZE 1 /* Global shared variables */ struct { int first; /* Low index of unsorted sub-array */ int last; /* High index of unsorted sub-array */ } unfinished[MAX_UNFINISHED]; /* Stack */ int unfinished_index; /* Index of top of stack */ float *A; /* Array of elements to be sorted */ int n; /* Elements in A */ int num_sorted; /* Sorted elements in A */ /* Function 'swap' is called when we want to exchange two array elements */



void swap (float *x, float *y) { float tmp; tmp = *x; *x = *y; *y = tmp; } /* Function 'partition' actually does the sorting by dividing an Unsorted sub-array into two parts: those less than or equal to the splitter, and those greater than the splitter. The splitter is the last element in the unsorted sub-array. The splitter ends up in its final, sorted location. The function returns the final location of the splitter (its index). */ int partition (int first, int last) { int i, j; float x; x = A[last]; i = first - 1; for (j = first; j < last; j++) if (A[j] <= x) { i++; swap (&A[i], &A[j]); } swap (&A[i+1], &A[last]); return (i+1); } /* Function `seq_quicksort’ implements the traditional, recursive quicksort algorithm. It is called when we want a thread to be be responsible for sorting an entire sub-array. */ void seq_quicksort (int first, int last) { int q; /* Split point in array */ if (first < last) { q = partition (first, last); seq_quicksort (first, q-1); seq_quicksort (q+1, last); } } /* Function 'quicksort' repeatedly retrieves the indices of unsorted sub-arrays from the stack and calls 'partition' to divide these sub-arrays into two pieces. It keeps one of the pieces and puts the other piece on the stack of unsorted sub-arrays. Eventually it ends up with a piece that doesn't need to be sorted. At this point it



gets the indices of another unsorted sub-array from the stack. The function continues until the stack is empty. */ void quicksort (void) { int first; int id; int last; int my_count; int my_index; int q; /* Split point in array */ id = omp_get_thread_num(); printf ("Thread %d enters quicksort\n", id); my_count = 0; while (num_sorted < n) { #pragma omp critical { if (unfinished_index >= 0) { my_index = unfinished_index; unfinished_index--; first = unfinished[my_index].first; last = unfinished[my_index].last; } else { first = 0; last = -1; } } /* printf ("Thread %d has region %d-%d\n", id, first, last); */ while (first <= last) { /* printf ("Thread %d now partitioning %d-%d\n", id, first, last); */ if (first == last) { #pragma omp critical num_sorted++; my_count++; last = first - 1; } else if ((last - first) < CHUNK_SIZE) { seq_quicksort (first, last); #pragma omp critical num_sorted += (last - first + 1); my_count += (last - first + 1); last = first - 1; } else { /* Split unsorted array into two parts */ q = partition (first, last); #pragma omp critical num_sorted++; my_count++;



/* Put upper portion on stack of unsorted sub-arrays */ if ((unfinished_index+1) >= MAX_UNFINISHED) { printf ("Stack overflow\n"); exit (-1); } #pragma omp critical { unfinished_index++; unfinished[unfinished_index].first = q+1; unfinished[unfinished_index].last = last; } /* Keep lower portion for next iteration of loop */ last = q-1; } } } printf ("Thread %d exits, having sorted %d\n", id, my_count); } /* Function 'print_float_array', given the address and length of an Array of floating-point values, prints the values to standard output, one element per line. */ void print_float_array (float *A, int n) { int i; printf ("Contents of array:\n"); for (i = 0; i < n; i++) printf ("%6.4f\n", A[i]); } /* Function 'verify_sorted' returns 1 if the elements of array 'A' are in monotonically increasing order; it returns 0 otherwise. */ int verify_sorted (float *A, int n) { int i; for (i = 0; i < n-1; i++) if (A[i] > A[i+1]) return 0; return 1; } /* Function 'main' gets the array size and random number seed from the command line, initializes the array, prints the unsorted array, sorts the array, and prints the sorted array. */ int main (int argc, char *argv[]) {



int i; int seed; /* Seed component input by user */ unsigned short xi[3]; /* Random number seed */ int t; /* Number of threads */ if (argc != 4) { printf ("Command-line syntax: %s <n> <threads> <seed>\n", argv[0]); exit (-1); } seed = atoi (argv[3]); xi[0] = xi[1] = xi[2] = seed; t = atoi(argv[2]); omp_set_num_threads (t); n = atoi (argv[1]); A = (float *) malloc (n * sizeof(float)); for (i = 0; i < n; i++) A[i] = erand48(xi); /* print_float_array (A, n); */ unfinished[0].first = 0; unfinished[0].last = n-1; unfinished_index = 0; num_sorted = 0; #pragma omp parallel quicksort (); /* print_float_array (A, n); */ if (verify_sorted (A, n)) printf ("Elements are sorted\n"); else printf ("ERROR: Elements are NOT sorted\n"); return 0; }



Exercise 2 To make the program parallel, all students have to do is put a pragma in front of the first for loop in function ‘main’. The program will exhibit some speedup. Speedup will be improved if students add a clause to the pragma indicating that the run-time system should use guided self-scheduling to allocate loop iterations to threads. Making the second for loop in ‘main’ parallel does little to affect the execution time because it takes an insignificant amount of time compared to the first loop. Ambitious students will notice that function ‘is_prime’ is very inefficient and will try to make it faster. A good way to do this is to add a pre-processing function, called from ‘main’, that finds all primes up to the square root of ‘n’, and puts them in a list. Function ‘is_prime’ runs dramatically faster if it simply refers to that list of primes, rather than trying all positive integers greater than or equal to 2. This demonstrates that making the sequential program faster should be the first thing attempted. /* * This C/OpenMP program counts the number of primes between 2 and n. */ #include <stdio.h> #include <math.h> #include <omp.h> int *prime_list; /* Contains primes up to sqrt(‘n’) */ int prime_list_len; /* Number of primes in ‘prime_list’ */ /* Function ‘sieve’ fills array ‘prime_list’ with primes between 2 and sqrt(‘n’). It uses a rather crude algorithm to do this. It would be easy to make this function faster, but its execution time is not too significant compared to the total execution time of the program. */ void sieve (int n) { int i, j; int s; /* Square root of ‘n’, rounded down */ s = (int) sqrt(n); prime_list = (int *) malloc (s * sizeof(int)); for (i = 0; i < s; i++) { if (i < 2) prime_list[i] = 0; else { prime_list[i] = i; j = 2; while (j*j <= i) { if (i % j == 0) { prime_list[i] = 0; break; } j++;



} } } i = 0; for (j = 0; j < s; j++) if (prime_list[j]) prime_list[i++] = prime_list[j]; prime_list_len = i; prime_list[prime_list_len] = n; } /* Function ‘is_prime’ returns 1 if ‘p’ is prime and 0 if ‘p’ is not prime. */ int is_prime (int p) { int i; if (p < 2) return 0; i = 0; /* Check all primes less than or equal to the square root of ‘p’ to see if they divide evenly into ‘p’. */ while (prime_list[i]*prime_list[i] <= p) { if (p % prime_list[i] == 0) return 0; i++; } return 1; } int main (int argc, char *argv[]) { int *a; /* ‘i’th element is 1 iff ‘i’ is prime */ int count; /* Number of primes between 2 and ‘n’ */ int i; int n; /* Upper bound */ int t; /* Desired number of threads */ /* Determine problem size and number of threads */ if (argc != 3) { printf ("Command line syntax: %s <n> <threads>\n", argv[0]); exit (-1); } n = atoi(argv[1]); t = atoi(argv[2]); omp_set_num_threads (t); /* Identify primes up to square root of ‘n’ */ sieve (n);



a = (int *) malloc (n * sizeof(int)); #pragma omp parallel for schedule(guided) for (i = 0; i < n; i++) a[i] = is_prime(i); count = 0; for (i = 0; i < n; i++) count += a[i]; printf ("There are %d primes less than %d\n", count, n); return 0; }



Lab 8: Choosing the Appropriate Thread Model

These answers are based on the assumption that OpenMP programs are easier to write, debug, and maintain than programs based on Win32/Java/POSIX threads, and the performance of well-written OpenMP threads programs is comparable to the performance of programs written using one of the more general thread models. Hence we will choose the OpenMP solution whenever it is feasible. Ideally, the students will enter into in-depth discussions of possible solution strategies for these problems, raising issues that go far beyond the bare-bones answers presented here. Problem 1 We want to stop our search as soon as one technique has successfully evaluated the integral, strongly suggesting that our parallel program should be based on Win32/Java/POSIX threads. The fork/join model of OpenMP is better suited for situations where we allow every thread to complete its work. Problem 2 Even though the variance in the amount of time needed to simulate the path of a single particle is quite large, the fact that we’re simulating millions of particles means that if we have one thread per processor and give every thread an equal share of the particles to simulate, the threads will complete at roughly the same time (because of the law of large numbers). So we should code up the application using OpenMP, and a static allocation of iterations to threads would most likely be fine. If we’re worried about threads finishing at significantly different times, we can switch to guided self-scheduling, but this will probably not be necessary. Problem 3 Different threads are going to be responsible for different tasks. Some tasks may be so compute-intensive that they should be performed using multiple threads. For example, a single thread is probably sufficient to catch the keyboard and mouse events, but some of the image manipulations may benefit from parallel execution. There are going to be a wide variety of specialized interactions among threads. It may even make sense to create threads for short-term tasks. In all, the dynamic, asymmetric nature of the parallelism in this program points toward the use of Win32/Java/POSIX threads. Another idea is to create an explicit threading/OpenMP hybrid. The compute-intensive image processing tasks can be parallelized more easily using OpenMP. Avoiding processor oversubscription could be the most difficult part of this parallel implementation. Neither explicit threading nor OpenMP provides a way to query the number of idle processors. Even if they did, the dynamic nature of the application could make the query result obsolete very quickly. On Windows, QueueUserWorkItem plus OpenMP in the compute-intensive tasks would minimize processor oversubscription. The OS maintains a thread pool and would only allow queued tasks to run if there were idle processors.



Problem 4 A quick (linear-time) preprocessing step can put the prolog in shared memory and determine the number of pages to be formatted. The time needed to format pages will vary by quite a bit, depending upon what is on the page. In addition, some documents are fairly short. Nevertheless, the same function will be called to process each page. With these factors in mind, it seems reasonable to use OpenMP’s parallel for statement to convert the pages of the PostScript program to PDF. Because the number of pages may be small, the parallel program should use guided self-scheduling or dynamic scheduling to ensure all threads finish at roughly the same time. Problem 5 This problem seems amenable to solution using OpenMP. The complete video probably contains thousands of frames. Each thread would scan a contiguous segment of the raw footage, marking the frames that represented new scenes. The segments scanned by the threads would have to overlap by a frame, to prevent the error of forgetting to mark a scene that began precisely at the beginning of a thread’s allocated segment.




Intel® Software College Introduction to Parallel Programming © January 2007 Intel® Corporation Student Workbook with Instructor’s Notes

63

Additional Resources

• For more information on Intel Software College, visitwww.intel.com/software/college .

• For more information on software development products, services, tools, training and expert advice, visit www.intel.com/software .

• Put your knowledge to the ultimate test solving coding problems in competitions for multi-threading on multi-core microprocessors and win cash prizes! Intel Multi-Threading Competition Series at www.topcoder.com/intel .

• For more information about the latest technologies for computer product developers and IT professionals, look up Intel Press books athttp://www.intel.com/intelpress/ .

• Maximize application performance using Intel® Software Development Products: www.intel.com/software/products/ .

www.intel.com/software/college


www.intel.com/software

www.intel.com/software

www.topcoder.com/intel



http://www.intel.com/intelpress/



www.intel.com/software/products/



Introduction to Parallel Programming Intel® Software CollegeStudent Workbook with Instructor’s Notes © January 2007 Intel® Corporation64

Introduction to Parallel Programming Student Workbook with Instructor’s Notes - Inner Back Cover


Copyright © 2006, Intel Corporation. All rights reserved. �Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.��

Documents

Introduction to Parallel Programming_Student Workbook with Instructor's Notes.pdf