[IEEE 2009 DoD High Performance Computing Modernization Program Users Group Conference (HPCMP-UGC) - San Diego, CA, USA (2009.06.15-2009.06.18)] 2009 DoD High Performance Computing

Evaluating Parallel Extensions to High Level Languages Using the HPC Challenge Benchmarks

Laura Humphrey, Brian Guilfoos, Harrison Smith, Andrew Warnock, Jose Unpingco, Bracy Elton, and Alan Chalker

The Ohio Supercomputer Center, Columbus, OH {humphrey, guilfoos, bsmith, unpingco, elton, alanc}@osc.edu

Abstract

Recent years have seen the development of many new parallel extensions to high level languages. However, there does not yet seem to have been a concentrated effort to quantify their performance or qualify their usability. Toward this end, we have used several parallel extensions to implement four of the high performance computing (HPC) Challenge benchmarks—FFT, HPL, RandomAccess, and STREAM—according to the Class 2 specifications. The parallel extensions used here include pMatlab, Star-P, and the official Parallel Computing Toolbox for MATLAB; pMatlab for Octave; and Star-P for Python. We have recorded performance results for the benchmarks using these extensions on the Ohio Supercomputing Center’s supercomputer Glenn as well as several of the Department of Defense Supercomputing Resource Centers (DoD DSRCs). These results are compared to those of the original C benchmarks as run on Glenn. We also highlight some of the features of these parallel extensions, as well as those of gridMathematica for Mathematica and IPython for Python, which have not yet been fully benchmarked. 1. Introduction Parallel extensions to high level languages (HLLs) have become increasingly popular in recent years. These extensions allow programmers, scientists, and engineers to write parallel code in the same languages they use for serial development, increasing productivity and providing easier access to high performance computing. Popular HLLs include:

• MATLAB®, developed by The MathWorks • Octave (a free MATLAB clone) • Python and Mathematica®, developed by

Wolfram Research Each of these HLLs also has parallel extensions. For MATLAB, these include pMatlab, developed by Lincoln

Laboratory at MIT; Star-P®, developed by Interactive Supercomputing; and the official Parallel Computing Toolbox, developed by The MathWorks. pMatlab is available for Octave. Star-P and IPython are available for Python. And gridMathematica, developed by Wolfram Research, is available for Mathematica. In order to test performance and evaluate the features of each parallel extension, we have chosen to implement four of the high performance computing (HPC) Challenge benchmarks—STREAM, HPL, FFT, and RandomAccess (RA)—according to the HPC Challenge Class 2 specifications (http://www. hpcchallenge.org/class2 specs.pdf). In what follows, we first give an overview of each of the parallel extensions. We then present performance results obtained on Glenn and several of the Department of Defense Supercomputing Resource Centers (DoD DSRCs) for the selected HPC Challenge benchmarks. Finally, we conclude with a short conclusion and discussion of the results. 2. Overview of Parallel Extensions Each parallel extension has its own structure, programming flow, and set of parallel commands. Here, we provide an overview of each extension, including basic structure, features, and examples of parallel commands. 2.1. pMatlab pMatlab and MatlabMPI are both extensions to MATLAB developed by Lincoln Laboratory at MIT. pMatlab implements distributed arrays and uses MatlabMPI, which implements message passing functions, such as MPI_Send() and MPI_Recv() for sending data between parallel processes. pMatlab can be used interactively or configured to run with a batch system, such as PBS or LSF.

2009 DoD High Performance Computing Modernization Program Users Group Conference

978-0-7695-3946-1/10 $26.00 © 2010 IEEE

DOI 10.1109/HPCMP-UGC.2009.68

410

Basic pMATLAB code consists of a MATLAB script that runs on all processes. This script must initialize pMATLAB and create a communicator before parallel commands and distributed arrays can be used. Distributed arrays are initialized by specifying a map and using one of several initialization functions, for instance, dmap = map([4,1], {}, [0:3]);

A = ones(M,N,dmap); B = zeros(M,N,dmap); C = rand(M,N,dmap);

This map specifies an array divided row-wise into four blocks, with no overlap, among processes 0 through 3. The functions ones(), zeros(), and rand() create distributed arrays of size M by N filled with ones, zeros, and uniformly random values in a manner similar to that of the corresponding MATLAB functions. pMatlab includes functions for checking the global size, local sizes, global indices, and local indices of each distributed array, as well as commands for extracting local portions of an array and aggregating the entire array onto the rank 0 process. There are also basic MATLAB commands for distributed arrays, such as +, -, *, .*, <, >, ==, ~=, abs(), fft(), complex(), and conv2(). Sparse matrices are also supported. 2.2. Star-P Star-P, developed by Interactive Supercomputing, consists of an interactive front-end interface and a back-end computing resource. The back-end can be configured to use a batch system such as LSF or PBS. Star-P supports both serial and parallel development, with relatively easy conversion between the two. Star-P can be used with either Python or MATLAB. Features and commands for MATLAB and Python in Star-P are relatively similar, with commands being structured for either task parallelism or data parallelism. For data parallelism in MATLAB, one uses the *p operator to create an array that is distributed across the associated dimensions, e.g.

A = rand(m,n*p); B = zeros(n*p, n*p); C = ones(m*p,n);

For data parallelism in Python, one uses similar functions from the starp.numpy package. There are a variety of arithmetic operations and functions available for distributed arrays in both MATLAB and Python. For MATLAB, these include ‘, +, -, *, /, \, .* , ~=, <, <=, abs(), cos(), sin(), tan(), cond(), rank(), chol(), eig(), exp(), fft(), fft2(), find(), hadamard(), ifft(), ifft2(), inv(), log(), lu(), max(), mean(), median(), min(), mpower(), norm(), pinv(), qr(), rank(), sqrt(), svd(), and toeplitz(). There is also support for sparse matrices when using Star-P with MATLAB. For Python, a smaller number but similar variety of functions is available. Sparse matrices are not supported when using Star-P with Python.

Task parallelism is accomplished in through the function ppeval(), which applies a function to an array of data in a parallel manner. Users can also write their own data parallel or task parallel functions in C using the Star-P Software Development Kit (SDK). 2.3. Parallel Computing Toolbox The Parallel Computing Toolbox (PCT) is the official parallel toolbox for MATLAB. This toolbox consists of the standard interactive MATLAB front-end and a back-end computing resource that is divided into “workers.” It supports interactive development with up to eight workers as well as batch mode, which can be configured to use a scheduling system such as LSF or PBS. The PCT supports both task parallelism and data parallelism. Task parallelism can be achieved through parallel for loops using the parfor construct. Data parallelism can be achieved through “codistributed” arrays generated using codistributor() or codistributed(). The function codistributor() distributes an array among workers across a dimension or dimensions in a blocked cyclic manner, e.g., A = randn(m,n,codistributor(‘1d’); B = zeros(m,n,codistributor(‘1d’,2); C = ones(m,n,codistributor(‘2d’);

The function codistributed() creates a distributed array out of local data on each worker. There are a large number of arithmetic operations and functions available for codistributed arrays, such as ‘, +, -, *, /, \, .* , ~=, <, <=, abs(), cos(), sin(), tan(), chol(), eig(), exp(), fft(), find(), log(), lu(), max(), min(), norm(), sqrt(), svd(), swapbytes(), uint8(), uint16(), uint32(), and uint64(), as well as functions for checking the local and global indices of a codistributed array, extracting local portions of a codistributed array, gathering a codistributed array onto one or more workers, and redistributing a codistributed array. There is also support for sparse matrices. In addition to task parallelism and data parallelism, the toolbox also implements commands similar to those used for MPI. For instance, labsend() and labreceive() can be used to send information between workers, and labindex can be used to check each worker’s index or rank. 2.4. gridMathematica gridMathematica is the official parallel package for Mathematica. This package uses the standard Mathematica interactive front-end, and the back-end is composed of Mathematica kernels running on a computing resource. The kernels can be managed using a batch system such as LSF or PBS.

411

gridMathematica has several functions for generating or evaluating data in parallel, for instance ParallelMap[f, expr], which applies the function f to each element in expr; ParallelTable[expr, {i, imin, imax}], which generates a list of the values of expr as i runs from imin to imax;; Parallelize[expr] which attempts to evaluate expr using automatic parallelization; and ParallelEvaluate[expr], which evaluates expr on all kernels, to name a few. There are also functions for checking the number of kernels, the id of each kernel, and sharing variables among kernels. 2.5. IPython IPython is an open source parallel computing package for Python. It consists of an interactive front-end and back-end processes or “engines.” These are coordinated by a “controller.” IPython can be modified to work with a batch system such as PBS or LSF. At the time the benchmarks were coded, IPython had a relatively small number of commands: push() and pull()for transferring data between the front-end and back-end; push_function() and pull_function() for transferring functions; and execute() for executing a command on the engines, to name a few. However, IPython is still in development, and new features and functions are being added. 3. Results We have coded and run four of the HPC Challenge benchmarks—FFT, HPL, RandomAccess, and STREAM—according to the Class 2 specifications using each of the preceding parallel HLL extensions. These benchmarks were run on The Ohio Supercomputer Center’s (OSC’s) IBM Cluster 1350, “Glenn,” and several DoD systems—“MJM,” “Eagle,” and “Babbage.” The HPC Challenge benchmarks as coded in C were also run on Glenn for comparison purposes. Problem sizes were generally selected to match the Class 2 specifications: at least a quarter of system memory for FFT, less than or equal to half of system memory for RandomAccess, and

at least a quarter of system memory for STREAM. An exception is HPL. The main HPL matrix was set to use a quarter of system memory instead of half, since all the packages require a copy of this matrix. There are some important implementation details to note. First, STREAM results are the average of several triad iterations, not the specified maximum. Next, RandomAccess is restricted to the specified maximum of 1,024 64-bit doubles in the look-ahead buffer. Because data passes through the front-end in Star-P, IPython, and gridMathematica, we restricted the look-ahead buffer to 1,024 on the front-end, as opposed to a 1,024 look-ahead on each process. Finally, FFT sizes are powers of 2. A variety of configuration, security, compatibility, and other issues prevented us from fully running all the benchmarks. Available results are given here. The IPython and gridMathematica benchmarks were run at a much lower capacity and could be improved. Thus, preliminary results for these HLLs are given separately. Each benchmark was timed for a certain number of “runs” or submissions of the main benchmark program. These results were used to derive minimum, median, and maximum results for each benchmark. STREAM, FFT, and HPL also had a number of internal “iterations” or repetitions of the main benchmark operation that were used to obtain an average time. Also, note that pMatlab can be run with MATLAB or Octave and on top of either MatlabMPI or bcMPI, OSC’s implementation of MatlabMPI. Whereas MatlabMPI uses file I/O for transferring data between processes, bcMPI uses C message passing libraries. Table 1, Table 2, Table 3, and Table 4 show results for FFT, HPL, RandomAccess, and STREAM, respectively. Table 5 shows results for the C benchmarks on Glenn. For these benchmarks, grid sizes were roughly square, and the block size was set to 64. FFT used a quarter of system memory, HPL used half of system memory, RandomAccess used less than or equal to half of system memory, and STREAM used about half of system memory. Exceptions include the 16 and 64 CPU cases. In these two cases HPL, RandomAccess, and FFT used half of the aforementioned system memory.

412

Table 1. FFT benchmark results in GFlops. Minimum, median, and maximum values based on five runs. Results are the average of five internal iterations for each benchmark, except results for bcMPI/MatlabMPI, which were

based on three runs with five iterations each.

HLL Star-P Matlab Star-P Python

MATLAB PCT

pMatlab-Matlab-

MatlabMPI

pMatlab-Octave-bcMPI

pMatlab-Octave-MatlabMPI

Machine Glenn MJM Glenn Glenn Glenn Eagle Glenn Eagle

CPUs

4 Min 0.61 0.53 0.11 0.07 0.01 0.06 0.02 0.01 Med 0.67 0.54 0.57 0.07 0.01 0.07 0.02 0.01 Max 0.68 0.56 0.64 0.07 0.01 0.08 0.02 0.01

8 Min 0.50 1.00 0.39 0.09 0.02 0.06 0.02 0.02 Med 0.56 1.02 0.49 0.09 0.02 0.06 0.03 0.02 Max 0.57 1.03 0.57 0.09 0.02 0.07 0.03 0.02

16 Min 0.67 1.86 0.57 0.59 0.02 0.07 0.04 0.04 Med 0.81 1.91 0.79 0.73 0.05 0.07 0.05 0.04 Max 0.87 1.94 0.83 0.78 0.05 0.08 0.05 0.04

32 Min 1.09 3.26 1.02 1.42 0.01 0.05 0.06 0.08 Med 1.23 3.60 1.06 1.46 0.14 0.05 0.06 0.08 Max 1.53 3.73 1.51 1.46 0.15 0.07 0.07 0.08

64 Min 1.12 6.11 1.35 1.54 0.03 0.04 0.06 0.11 Med 1.88 6.53 2.50 1.81 0.03 0.04 0.06 0.11 Max 2.90 6.81 2.87 1.87 0.03 0.05 0.06 0.11

Table 2. HPL benchmark results in GFlops. All benchmarks use half the memory listed in the Class 2 specifications. Minimum, median, and maximum values based on five runs for each benchmark, except for the

MatlabMPI benchmark, which used only three runs.


MATLAB PCT

pMatlab-Matlab-

MatlabMPI

Machine Glenn MJM Glenn Glenn Glenn

CPUs

4 Min 12.19 19.18 12.42 12.95 3.69 Med 12.66 19.22 12.73 13.04 3.74 Max 12.89 19.26 13.80 13.04 4.14

8 Min 21.50 32.85 21.88 21.20 n/a Med 21.77 32.87 21.96 21.93 n/a Max 22.12 32.90 24.19 22.09 n/a




413

Table 3. RandomAccess benchmark results in GUPs. Minimum, median, and maximum values based on five runs for each benchmark, except for the bcMPI/MatlabMPI benchmarks, each of which used three runs.


MATLAB PCT

pMatlab-Matlab-bcMPI

pMatlab-Matlab-

MatlabMPI pMatlab-

Octave-bcMPI

pMatlab-Octave-

MatlabMPI

Machine Glenn MJM Glenn Glenn Glenn Glenn Glenn Eagle Glenn Eagle

CPUs

4 Min 8E-06 6E-05 3E-05 3E-03 2E-03 8E-06 3E-04 2E-04 5E-05 3E-05 Med 1E-05 6E-06 3E-05 4E-03 2E-03 9E-06 3E-04 2E-04 5E-05 3E-05 Max 2E-05 6E-06 3E-05 7E-03 2E-03 1E-05 3E-04 2E-04 5E-05 3E-05



32 Min 7E-06 2E-06 3E-05 9E-03 8E-03 8E-07 3E-03 9E-04 n/a 3E-06 Med 7E-06 3E-06 3E-05 9E-03 8E-03 8E-07 3E-03 1E-03 n/a 4E-06 Max 1E-05 4E-06 6E-05 9E-03 8E-03 8E-07 3E-03 1E-03 n/a 4E-06

64 Min 7E-06 4E-06 3E-05 1E-02 n/a n/a 3E-03 6E-04 n/a 2E-06 Med 7E-06 4E-06 3E-05 1E-02 n/a n/a 3E-03 8E-04 n/a 2E-06 Max 7E-06 7E-06 6E-05 1E-02 n/a n/a 4E-03 8E-04 n/a 2E-06

Table 4. STREAM benchmark results in GB/s. Minimum, median, and maximum values based on five runs. Results are the average of five internal iterations for each benchmark, except for the bcMPI/MatlabMPI benchmarks, which

used three runs with five iterations.

HLL Star-P Matlab

Star-P Python

MATLAB PCT

pMatlab-Matlab-bcMPI

pMatlab-Matlab-

MatlabMPI pMatlab-

Octave-bcMPI

pMatlab-Octave-

MatlabMPI Machine Glenn Glenn Glenn Glenn Glenn Glenn Eagle Glenn Eagle

CPUs

4 Min 2.59 2.34 4.72 7.72 5.03 2.12 1.44 2.96 1.73 Med 3.16 3.56 4.76 7.73 7.07 3.33 1.44 3.49 1.74 Max 3.44 3.65 4.82 7.79 7.10 3.39 1.47 3.60 1.74

8 Min 5.69 4.71 8.86 7.90 7.96 5.33 2.88 5.59 2.95 Med 6.09 6.10 9.43 7.93 9.35 5.51 2.90 6.23 3.22 Max 7.03 7.16 9.72 14.43 11.85 5.95 2.92 6.42 3.33

16 Min 9.60 12.05 17.92 18.47 21.97 14.27 3.95 13.36 4.32 Med 10.44 13.25 18.20 23.18 29.00 14.51 6.31 13.78 5.18 Max 11.56 14.00 18.89 23.96 30.74 14.70 6.33 13.94 5.35

32 Min 18.38 20.91 33.668 46.64 47.20 26.29 8.70 26.18 10.14 Med 22.87 24.73 34.29 60.80 56.08 28.19 10.65 26.25 10.23 Max 22.93 27.80 36.941 62.19 60.10 28.57 11.54 26.99 10.24

64 Min 37.30 47.63 76.179 87.27 1.85 44.94 22.53 39.04 10.27 Med 44.55 48.05 77.103 116.03 16.28 58.14 22.93 45.14 10.39 Max 49.81 51.94 87.613 121.15 73.81 59.29 23.71 46.11 11.83

414

Table 5. C benchmark results from Glenn

Benchmark FFT (GFlops) HPL (GFlops) RA (GUps) STREAM (GB/s)

CPUs

4 0.773 16.86 0.0075 6.244 8 1.092 48.58 0.0062 13.513

16 3.091 51.91 0.0128 28.225 32 4.767 93.51 0.0239 57.728 64 11.523 186.9 0.0415 113.314

Table 6. Average preliminary results for gridMathematica and IPython. *gridMathematica configuration issues restricted walltimes to one hour. Because of this limit, FFT size in gridMathematica was set to a quarter of that listed in the Class 2 specifications. **FFT results for IPython are extrapolations based on results from smaller

memory sizes. HLL gridMathematica IPython

Benchmark FFT*

(GFlops) RA

(GUps) STREAM

(GB/s) FFT**

(GFlops) RA

(GUps) STREAM

(GB/s)

CPUs or

nodes

4 0.02 8.9E-08 1.94 1.9E-03 1.3E-06 0.41 8 0.03 8.9E-08 3.87 1.9E-03 3.7E-06 0.68

16 0.03 8.9E-08 5.14 2.4E-03 1.7E-06 1.26 32 0.04 8.8E-08 14.96 2.0E-03 8.4E-07 2.74

64 n/a n/a n/a 1.7E-03 3.9E-07 5.90 4. Conclusions We have highlighted some of the features of several parallel HLL extensions. Additionally, we have used four of the HPC Challenge benchmarks—FFT, HPL, RandomAccess, and STREAM coded according to the Class 2 specifications—to evaluate performance. Performance results vary widely depending on which language is used. Some of the best performance results were obtained when built-in functions were available. For instance, Star-P showed the best performance on the FFT benchmark, and it was the only HLL to include a parallel one-dimensional FFT function. Star-P and the MATLAB Parallel Computing Toolbox showed the best performance on the HPL benchmark. Similarly, these HLLs were the only two with a built-in parallel function for solving a linear system. These built-in functions also made the benchmarks much easier to code. Other differences in performance were due to differences in benchmark implementation. Most notably, the communication scheme used for RandomAccess with the Parallel Computing Toolbox is implemented as a hypercube, whereas all-to-all communication is used in pMatlab. Star-P, IPython, and gridMathematica communicate results from each process to the front-end, which then sorts them and sends them back to the appropriate processes. It appears that the hypercube implementation is substantially faster than the all-to-all implementation, and communication mediated by the front-end is the slowest. It should be noted that pMatlab

benchmark could be modified to use a hypercube implementation for possible speedup. Remaining differences in results could be due to a variety of factors, including differences in communication methods, performance of underlying functions in each of the parallel HLLs, and HLL overhead. Also, since we were not using a dedicated cluster, variance in some of the results could be due to differences in cluster utilization depending on the time of day. There is especially high variance in some of the pMatlab results, since they depend heavily on the file system. In general, more analysis is needed to determine the exact source of the differences. Full implementation details for all the benchmarks, more preliminary results for gridMathematica and IPython, and a more detailed discussion of benchmark limitations are given in the final report for User Productivity, Enhancement, Technology Transfer and Training (PETTT) project CE-KY7-SP2, which is available for download on the OKC at https://okc.erdc.hpc.mil/index.jsp. Acknowledgments This publication was made possible through support provided by DoD HPCMP PET activities through Mississippi State University under contract No. GS04T01BFC0060. The opinions expressed herein are those of the author(s) and do not necessarily reflect the views of the DoD or Mississippi State University.

415

Documents

[IEEE 2009 DoD High Performance Computing Modernization Program Users Group Conference (HPCMP-UGC) - San Diego, CA, USA (2009.06.15-2009.06.18)] 2009 DoD High Performance Computing