24
CNS, March 7, 2008 1 High Performance Computing with MATLAB MATLAB Kadin Tseng Scientific Computing and Visualization Group Boston University

CNS, March 7, 20081 High Performance Computing MATLAB with MATLAB Kadin Tseng Scientific Computing and Visualization Group Boston University

  • View
    217

  • Download
    1

Embed Size (px)

Citation preview

CNS, March 7, 2008 1

High Performance Computing

with MATLABMATLAB

Kadin TsengScientific Computing and Visualization

GroupBoston University

CNS, March 7, 2008 2

Performance IssuesPerformance Issues Memory AccessMemory Access VectorizationVectorization CompilerCompiler Other ConsiderationsOther Considerations

Parallel MATLABParallel MATLAB

OutlinOutlinee

CNS, March 7, 2008 3

Memory access patterns often affect computational Memory access patterns often affect computational performances. Here are some effective ways to performances. Here are some effective ways to enhance performance:enhance performance:

Allocate arrayAllocate array memory before using it memory before using it For-loops For-loops OrderingOrdering ComputeCompute and save array in-place wherever and save array in-place wherever

possiblepossible

Memory Memory AccessAccess

CNS, March 7, 2008 4

Allocate array memory before using it.Allocate array memory before using it.

MATLAB is designed primarily as an interactive, MATLAB is designed primarily as an interactive, user-friendly environment. No pre-allotment of user-friendly environment. No pre-allotment of memory is required. Often, however, array sizes memory is required. Often, however, array sizes are known a priori. By pre-allocating it ensures are known a priori. By pre-allocating it ensures that all array elements are allocated in one that all array elements are allocated in one single, contiguous block right from the start.single, contiguous block right from the start.

Allocate Allocate ArrayArray

n=5000;x(1) = 1;for i=2:n x(i) = 2*x(i-1);end

Wallclock time = 0.0153 seconds

n=5000; x = ones(n,1);x(1) = 1;for i=2:n x(i) = 2*x(i-1);end

Wallclock time = 0.0002 seconds

The timing data are recorded on Katana. The actual times can vary significantly depending on the processor.

CNS, March 7, 2008 5

Best if inner-most for loop is for left-most index Best if inner-most for loop is for left-most index of array, etc.of array, etc.

For a multi-dimensional array, x(i,j), the 1D For a multi-dimensional array, x(i,j), the 1D representation of the same array, x(k), representation of the same array, x(k), inherently possesses the contiguous propertyinherently possesses the contiguous property

For-loop For-loop OrderingOrdering

n=5000; x = zeros(n);for i=1:n % rows for j=1:n % columns x(i,j) = i+(j-1)*n; endend

Wallclock time = 0.88 seconds

n=5000; x = zeros(n);for j=1:n % columns for i=1:n % rows x(i,j) = i+(j-1)*n; endend

Wallclock time = 0.48 seconds

CNS, March 7, 2008 6

Compute and save array in-place improves performanceCompute and save array in-place improves performance

Compute In-Compute In-placeplace

x = randn(10000);ticy = x.^2;toc

Wallclock time = 1.23 seconds

x = randn(10000);ticx = x.^2;toc

Wallclock time = 0.49 seconds

CNS, March 7, 2008 7

Use function m-file instead of script m-file Use function m-file instead of script m-file whenever reasonable whenever reasonable Script m-file is loaded into memory and evaluate one line Script m-file is loaded into memory and evaluate one line

at a time. Subsequent uses require reloading.at a time. Subsequent uses require reloading. Function m-file is compiled into a pseudo-code and is Function m-file is compiled into a pseudo-code and is

loaded once. Subsequent use of the function will be faster loaded once. Subsequent use of the function will be faster without reloading.without reloading.

Avoid using virtual memory. Physical memory is Avoid using virtual memory. Physical memory is much faster.much faster.

Avoid passing large matrices to a function and Avoid passing large matrices to a function and modifying only a handful of elements.modifying only a handful of elements.

Use MATLAB profiler (Use MATLAB profiler (profileprofile) to identify “hot ) to identify “hot spots” for performance enhancement.spots” for performance enhancement.

Other Other ConsiderationsConsiderations

CNS, March 7, 2008 8

The use of for loop in MATLAB, in general, can be The use of for loop in MATLAB, in general, can be expensive, especially if the loop count is large or expensive, especially if the loop count is large or nested for-loops.nested for-loops.

Without array allocation, for-loops are very costly.Without array allocation, for-loops are very costly. From a performance standpoint, From a performance standpoint, in generalin general, a compact , a compact

vector representation should be used in place of for-vector representation should be used in place of for-loops. Here is an example.loops. Here is an example.

VectorizatiVectorizationon

i = 0;for t = 0:.01:10 i = i + 1; y(i) = sin(t);end

Wallclock time = 0.0045 seconds

t = 0:.01:10;y = sin(t);

Wallclock time = 0.0005 seconds

CNS, March 7, 2008 9

A MATLAB compiler, A MATLAB compiler, mccmcc, is available., is available. It compiles m-files into C codes, object libraries, or It compiles m-files into C codes, object libraries, or

stand-alone executables.stand-alone executables. A stand-alone executable generated with A stand-alone executable generated with mcc mcc can can

run on run on compatible platformscompatible platforms without an installed without an installed MATLAB or a MATLAB license.MATLAB or a MATLAB license.

Many MATLAB general and toolbox licenses are Many MATLAB general and toolbox licenses are available at BU. On special occasions, MATLAB available at BU. On special occasions, MATLAB access may be denied if all licenses are checked out. access may be denied if all licenses are checked out. Running a stand-alone requires NO licenses and no Running a stand-alone requires NO licenses and no waiting.waiting.

Some compiled codes may run more efficiently than Some compiled codes may run more efficiently than m-files because they are not run in interpretive m-files because they are not run in interpretive mode.mode.

A stand-alone enables you to share it without A stand-alone enables you to share it without revealing the source.revealing the source.

http://scv.bu.edu/documentation/tutorials/MATLAB/cohttp://scv.bu.edu/documentation/tutorials/MATLAB/compiler/mpiler/

CompileCompilerr

CNS, March 7, 2008 10

Is Parallel MATLAB MATLAB the way the way to go ?to go ?• Even in the best case, can’t compete with C/Fortran Even in the best case, can’t compete with C/Fortran with MPI/OpenMPwith MPI/OpenMP

• It is an acceptable compromise ifIt is an acceptable compromise if

• Converting your Converting your MATLABMATLAB code to C/Fortran requires code to C/Fortran requires too big of an effort and you don’t have the time or too big of an effort and you don’t have the time or inclination to do that.inclination to do that.

• A “big” job typically takes hours, rather than days, to A “big” job typically takes hours, rather than days, to run on a single processor.run on a single processor.

• You strongly prefer the relative ease and efficiency You strongly prefer the relative ease and efficiency in programming a research code in in programming a research code in MATLABMATLAB..

• The appropriate multiprocessing The appropriate multiprocessing MATLAB MATLAB paradigmparadigm is at your disposal.is at your disposal.

CNS, March 7, 2008 11

Multiprocessing Multiprocessing MATLABMATLAB

1 MatlabMPI

2 pMatlab

3 SCV’s parallel MATLABMATLAB

4 Distributed Computing Toolbox

5 Star-P

CNS, March 7, 2008 12

MatlabMPI is a parallel MATLAB package developed at MatlabMPI is a parallel MATLAB package developed at Lincoln Lab in Lexington, MA.Lincoln Lab in Lexington, MA.

It does not require or make use of high speed It does not require or make use of high speed interconnect for communication among cluster nodes. interconnect for communication among cluster nodes. Instead, it relies on the network file system being Instead, it relies on the network file system being visible, or shared, by all processors. With this, visible, or shared, by all processors. With this, message passing is achieved through I/O to the file message passing is achieved through I/O to the file system.system.

It has a small basic set of utility routines that mimic It has a small basic set of utility routines that mimic those of the Message Passing Interface (MPI) in those of the Message Passing Interface (MPI) in functionalities. While the MPI routines for sending functionalities. While the MPI routines for sending and receiving messages are performed via high speed and receiving messages are performed via high speed interconnect, the routines in this package accomplish interconnect, the routines in this package accomplish the same tasks via I/O.the same tasks via I/O.

It is good for “embarrassingly parallel” codes that It is good for “embarrassingly parallel” codes that require only infrequent communications. require only infrequent communications.

11 MatlabMPIMatlabMPI

CNS, March 7, 2008 13

pMatlab is a parallel MATLAB package also developed at pMatlab is a parallel MATLAB package also developed at Lincoln Lab in Lexington, MA. It is built on top of Lincoln Lab in Lexington, MA. It is built on top of MatlabMPI.MatlabMPI.

As such, it inherits all the properties of MatlabMPI. It As such, it inherits all the properties of MatlabMPI. It can be thought of as providing higher-level wrapper can be thought of as providing higher-level wrapper functions to insulate the programmers from having to functions to insulate the programmers from having to deal with lower-level function calls to perform parallel deal with lower-level function calls to perform parallel tasks.tasks.

It is good for embarrassingly parallel algorithms with It is good for embarrassingly parallel algorithms with very modest amount of communications. very modest amount of communications.

22 pMatlabpMatlab

CNS, March 7, 2008 14

SCV has a very simple parallel MATLAB package that is SCV has a very simple parallel MATLAB package that is also based on the shared network file system concept also based on the shared network file system concept as with MatlabMPI. as with MatlabMPI.

It is limited to most of the same restrictions as It is limited to most of the same restrictions as MatlabMPI. However, there are two departures:MatlabMPI. However, there are two departures:

1. There is only one batch script and two function m-1. There is only one batch script and two function m-files to be inserted to your code. files to be inserted to your code.

2. These include a barrier function to synchronize 2. These include a barrier function to synchronize work performed on multiprocessing nodes. This is work performed on multiprocessing nodes. This is typically required for codes that contain serial and typically required for codes that contain serial and parallel sections.parallel sections.

It is good for embarrassingly parallel algorithms with It is good for embarrassingly parallel algorithms with very modest amount of communications. very modest amount of communications.

Email or call Kadin if you want to use any of Email or call Kadin if you want to use any of the above three packages. An example is the above three packages. An example is given next.given next.

33 SCV’s parallel SCV’s parallel MATLABMATLAB

CNS, March 7, 2008 15

% This example demonstrates the use of multiprocessors to compute C = A + % This example demonstrates the use of multiprocessors to compute C = A + B (matrix size is NB (matrix size is N22))

% Decomposition along columns; can also be decomposed along rows, or both.% Decomposition along columns; can also be decomposed along rows, or both.% C(:, range(rank)) = A(:, range(rank)) + B(:, range(rank))% C(:, range(rank)) = A(:, range(rank)) + B(:, range(rank))% In the above, range(rank) is the range of columns as a function of the % In the above, range(rank) is the range of columns as a function of the

processor rankprocessor rank% range(rank) = rank*n+1:rank*n+n (0<=rank<=nproc-1; n=N/nproc)% range(rank) = rank*n+1:rank*n+n (0<=rank<=nproc-1; n=N/nproc)% For simplicity, N is assumed to be divisible by nproc% For simplicity, N is assumed to be divisible by nproc

N = 8; % size of global matrix AN = 8; % size of global matrix AI = (1:N)’; % generate column vectorI = (1:N)’; % generate column vectorA = I(:, ones(1,N))*10 + I(:, ones(1,N))’; % generate A on current (and all) A = I(:, ones(1,N))*10 + I(:, ones(1,N))’; % generate A on current (and all)

processprocess

[pbegin, pend, rank, nproc] = parallel_info(N);[pbegin, pend, rank, nproc] = parallel_info(N); % query for parallel info % query for parallel info% rank (0<=rank<=nproc-1) is the current MATLAB process% rank (0<=rank<=nproc-1) is the current MATLAB process

n = N/nproc; % distributed column size of matrix Bn = N/nproc; % distributed column size of matrix Bb = I(:, ones(1,n))*10; % generate N x n matrix b (local B)b = I(:, ones(1,n))*10; % generate N x n matrix b (local B)c = A(:, pbegin:pend) + b % compute local c from A and local bc = A(:, pbegin:pend) + b % compute local c from A and local bsave matrix_c; % each current dir has own individual copy of csave matrix_c; % each current dir has own individual copy of c

SCV parallel MATLAB – SCV parallel MATLAB – Example 1Example 1

CNS, March 7, 2008 16

% Run barrier to synchronize all processors% Run barrier to synchronize all processorsierr = barrier(rank, nproc); ierr = barrier(rank, nproc);

% Finally, perform (serial) gather on c of all ranks into C on 0% Finally, perform (serial) gather on c of all ranks into C on 0if (rank == 0)if (rank == 0) C = zeros(N); % allocate CC = zeros(N); % allocate C C(:,1:n) = c; % starts with c from rank 0 which is already in memoryC(:,1:n) = c; % starts with c from rank 0 which is already in memory for k=1:nproc-1for k=1:nproc-1 i = n*k+1; % beginning location to which c will be insertedi = n*k+1; % beginning location to which c will be inserted j = n*k+n; % end locationj = n*k+n; % end location fk = [‘../' num2str(k) ‘/matrix_c']; % file name of c on process kfk = [‘../' num2str(k) ‘/matrix_c']; % file name of c on process k load(fk, 'c');load(fk, 'c'); C(:,i:j) = c;C(:,i:j) = c; endend save(‘../matrixC’, ‘C’]); % save C to parent dirsave(‘../matrixC’, ‘C’]); % save C to parent direndend

SCV parallel MATLAB Example 1 SCV parallel MATLAB Example 1 (cont’d)(cont’d)

CNS, March 7, 2008 17

#!/bin/csh#!/bin/csh# Example SGE script for running parallel MATLAB jobs on Katana# Example SGE script for running parallel MATLAB jobs on Katana# Submit job with the command: # Submit job with the command: qsub batch_sge.scvqsub batch_sge.scv# "#$ qsub_option" is interpreted by qsub as if "qsub_option" was passed to qsub # "#$ qsub_option" is interpreted by qsub as if "qsub_option" was passed to qsub

on commandline. on commandline. # Set hard runtime (wallclock) limit, default is 2 hours. Format: -l # Set hard runtime (wallclock) limit, default is 2 hours. Format: -l

h_rt=HH:MM:SSh_rt=HH:MM:SS#$ -l h_rt=2:00:00#$ -l h_rt=2:00:00# Merge stderr into the stdout file to reduce clutter.# Merge stderr into the stdout file to reduce clutter.#$ -j y#$ -j y# Invoke Parallel Environment for N processors. No default value, it must be # Invoke Parallel Environment for N processors. No default value, it must be

specified.specified.# For MATLAB apps, DO NOT select omp# For MATLAB apps, DO NOT select omp#$ -pe 1_per_node 4#$ -pe 1_per_node 4# end of qsub options# end of qsub options# By default, the script is executed in the directory from which it was submitted# By default, the script is executed in the directory from which it was submitted# with qsub. You might want to change directories before invoking mpirun ...# with qsub. You might want to change directories before invoking mpirun ...cd $PWDcd $PWD# running the following script generates multiple concurrent copies of MATLAB# running the following script generates multiple concurrent copies of MATLAB# Use addpath in startup.m to add path to all necessary matlab m-files# Use addpath in startup.m to add path to all necessary matlab m-files# batch_sge and sge_matlab should live in either $HOME/bin or $PWD# batch_sge and sge_matlab should live in either $HOME/bin or $PWDsge_matlab $PWD scv_matlab_example.msge_matlab $PWD scv_matlab_example.m

… … parallel MATLAB Example 1 – parallel MATLAB Example 1 – batch scriptbatch script

CNS, March 7, 2008 18

SCV parallel MATLAB SCV parallel MATLAB Example 2 Example 2

21222

11

2

1

2

1

01

1

/])()()[(

)( ;

;

,..., ;

iii

ijij

ijij

ijijij

j

Ne

jijj

Ne

jij

zzyyxxr

drn

Cdr

B

jiIjiI

CIA

NeiBA

jj

The airplane is represented with patches of quadrilateral elements and the integral formulation is discretized to yield

ψ is the known Neumann boundary condition.φ is the unknown to be solved for.

CNS, March 7, 2008 19

… … parallel MATLAB Example 2 – parallel MATLAB Example 2 – GeometryGeometry

CNS, March 7, 2008 20

… … parallel MATLAB Example 2 – parallel MATLAB Example 2 – timingstimings

CNS, March 7, 2008 21

How slow is MATLAB compared How slow is MATLAB compared with C ?with C ?

CNS, March 7, 2008 22

The Mathworks has a DCT which is a parallel The Mathworks has a DCT which is a parallel MATLAB package that utilizes the cluster’s high MATLAB package that utilizes the cluster’s high speed interconnect for inter-processor speed interconnect for inter-processor communications. communications.

At present, DCT is not available on SCV At present, DCT is not available on SCV machines.machines.

44 Distributed Computing Distributed Computing ToolboxToolbox

CNS, March 7, 2008 23

StarP is a parallel MATLAB product of Interactive StarP is a parallel MATLAB product of Interactive Supercomputing, Inc. It bears some resemblance Supercomputing, Inc. It bears some resemblance to the pMatlab package in that it enables parallel to the pMatlab package in that it enables parallel MATLAB while shielding the programmers from MATLAB while shielding the programmers from most of the lower level parallel programming.most of the lower level parallel programming.

Like Mathworks’ DCT, StarP is a parallel Like Mathworks’ DCT, StarP is a parallel MATLAB package that utilizes high speed MATLAB package that utilizes high speed interconnect for inter-processor communications.interconnect for inter-processor communications.

At present, this package is not available on SCV At present, this package is not available on SCV machines.machines.

55 StarPStarP

CNS, March 7, 2008 24

Useful SCV InfoUseful SCV Info

• SCV home pageSCV home page (http://scv.bu.edu/) (http://scv.bu.edu/)• Resource ApplicationsResource Applications ( (

https://acct.bu.edu/SCFhttps://acct.bu.edu/SCF))• HelpHelp

– Web-based tutorials (http://scv.bu.edu/)Web-based tutorials (http://scv.bu.edu/)

(MPI, OpenMP, MATLAB, IDL, Graphics (MPI, OpenMP, MATLAB, IDL, Graphics tools)tools)

– HPC consultations by appointmentHPC consultations by appointment• Kadin Tseng ([email protected]) Kadin Tseng ([email protected]) • Doug Sondak ([email protected])Doug Sondak ([email protected])

[email protected], [email protected]@twister.bu.edu, [email protected]