1 Resources for parallel computing BLAS Basic linear algebra subprograms. Originally published in ACM Toms (1979) (Linpack Blas + Lapack). Implement matrix operations upto matrix-matrix mul- tiplication and triangular solve, but not matrix factorizations or eigenvalue calcula- tions. A reference implementation is on netlib.org. Web page www.netlib.org/blas "Frequently asked questions" BLAS 1) What and where are the BLAS? 2) Are there legal restrictions on the use of BLAS reference implementation software? 3) Publications/references for the BLAS? 4) Is there a Quick Reference Guide to the BLAS available? 5) Are optimized BLAS libraries available? Where can I find vendor supplied BLAS? 6) Where can I find Java BLAS? 7) Is there a C interface to the BLAS?

Resources for parallel computing · • Sparse linear equation solvers • ScaLAPACK • Some statistical functions (incl. random number generation) • some MPI support • Fast

  • Upload

  • View

  • Download

Embed Size (px)

Citation preview

Page 1: Resources for parallel computing · • Sparse linear equation solvers • ScaLAPACK • Some statistical functions (incl. random number generation) • some MPI support • Fast


Resources for parallel computing BLAS Basic linear algebra subprograms. Originally published in ACM Toms (1979) (Linpack → Blas + Lapack). Implement matrix operations upto matrix-matrix mul-tiplication and triangular solve, but not matrix factorizations or eigenvalue calcula-tions. A reference implementation is on netlib.org.

Web page


"Frequently asked questions" BLAS 1) What and where are the BLAS? 2) Are there legal restrictions on the use of BLAS reference implementation software? 3) Publications/references for the BLAS? 4) Is there a Quick Reference Guide to the BLAS available? 5) Are optimized BLAS libraries available? Where can I find vendor supplied BLAS? 6) Where can I find Java BLAS? 7) Is there a C interface to the BLAS?

Page 2: Resources for parallel computing · • Sparse linear equation solvers • ScaLAPACK • Some statistical functions (incl. random number generation) • some MPI support • Fast


8) Are prebuilt reference implementations of the Fortran77 BLAS available? 9) What about shared memory machines? Are there multithreaded versions of the BLAS available?

1) What and where are the BLAS? The BLAS (Basic Linear Algebra Subprograms) are routines that provide standard building blocks for per-forming basic vector and matrix operations. The Level 1 BLAS perform scalar, vector and vector-vector op-erations, the Level 2 BLAS perform matrix-….

On Wikipedia

Example of a reference subroutine: SUBROUTINE DGEMV ( TRANS, M, N, ALPHA, A, LDA, X, INCX, BETA, Y, INCY ) * .. Scalar Arguments .. DOUBLE PRECISION ALPHA, BETA INTEGER INCX, INCY, LDA, M, N CHARACTER*1 TRANS * .. Array Arguments .. DOUBLE PRECISION A( LDA, * ), X( * ), Y( * ) * * Purpose

Page 3: Resources for parallel computing · • Sparse linear equation solvers • ScaLAPACK • Some statistical functions (incl. random number generation) • some MPI support • Fast


* ======= * DGEMV performs one of the matrix-vector operati ons * * y := alpha*A*x + beta*y, or y := alpha*A' *x + beta*y, * * where alpha and beta are scalars, x and y are ve ctors and A is an * m by n matrix. * * Parameters * ==========

... * M - INTEGER. * On entry, M specifies the number of row s of the matrix A. * M must be at least zero.

... IF( INCY.EQ.1 ) THEN DO 60, J = 1, N IF( X( JX ).NE.ZERO ) THEN TEMP = ALPHA*X( JX ) DO 50, I = 1, M Y( I ) = Y( I ) + TEMP*A( I, J ) 50 CONTINUE END IF JX = JX + INCX 60 CONTINUE ...

BLAS quick reference

(see www.netlib.org/blas/blasqr.pdf)

Page 4: Resources for parallel computing · • Sparse linear equation solvers • ScaLAPACK • Some statistical functions (incl. random number generation) • some MPI support • Fast


LAPACK Fortran subroutines for linear equations (dense, banded), linear least squares problems, eigenvalue problems and singular values. There is a printed user guide (LAPACK Users' Guide, 11 authors, SIAM, 1999), part of this guide is in several html documents on www.netlib.org/lapack/lug, and there are man pages on www.netlib.org/lapack/manpages.tgz which are worth installing. The routines were written with parallel computation in mind.

Web page


Page 5: Resources for parallel computing · • Sparse linear equation solvers • ScaLAPACK • Some statistical functions (incl. random number generation) • some MPI support • Fast


Atlas www.math-atlas.sourceforge.net. Open source implementaion of BLAS and a few LAPACK routines.These must be (or should be) built (with make) on the machine where they are to be used (may take a few hours?).

These must be (or should be) built (with make) on the machine where they are to be used (may take a few hours?).

GotoBLAS (framb.: gótóblas) http://www.tacc.utexas.edu/resources/software/. The web page boasts: currently the fastest implementation of the Basic Linear Algebra Subroutines

Page 6: Resources for parallel computing · • Sparse linear equation solvers • ScaLAPACK • Some statistical functions (incl. random number generation) • some MPI support • Fast


Intel MKL (Math kernel library)

Academic license $160. Includes: • BLAS • Selected LAPACK routines • Fortran 95 interface • CBLAS (interface to call from C) • Sparse BLAS • Sparse linear equation solvers • ScaLAPACK • Some statistical functions (incl. random number generation) • some MPI support • Fast Fourier transforms • PDE solution support (I think incl. Poisson solver) • some numerical optimization.

User manual and reference manual

These come with the program, and are also both available on


(107 pages and 3250 pages).

AMD Core Math Library (ACML)

web page


Page 7: Resources for parallel computing · • Sparse linear equation solvers • ScaLAPACK • Some statistical functions (incl. random number generation) • some MPI support • Fast


DGEMM benchmarks From Intel Web site:

Page 8: Resources for parallel computing · • Sparse linear equation solvers • ScaLAPACK • Some statistical functions (incl. random number generation) • some MPI support • Fast


From GotoBLAS web site:

From a neutral (?) web site:

Page 9: Resources for parallel computing · • Sparse linear equation solvers • ScaLAPACK • Some statistical functions (incl. random number generation) • some MPI support • Fast


Example of BLAS use The following examples are from programs for evaluation of VARMA time-series likelihood that I wrote first in Matlab (~3500 lines) and am close to finishing trans-lating into C (~7000 lines). Eventually I hope to call some parallel BLAS routines and report on timing comparison between the Matlab and C versions. The Matlab programs are on www.hi.is/~jonasson, and the C-programs are on the way there.

omega_factor.m %OMEGA_FACTOR Cholesky factorization of Omega


% [Lu,Ll,info] = omega_factor(Su,Olow,p,q,ko) calculates the Cholesky

% factorization Omega = L·L' of Omega which is stored in two parts, a full

% upper left partition, Su, and a block-band lower partition, Olow, as returned

% by omega_build. Omega is symmetric, only the lower triangle of Su is

% populated, and Olow only stores diagonal and subdiagonal blocks. On exit, L =

% [L1; L2] with L1 = [Lu 0] and L2 is stored in block-band-storage in Ll. Info

% is 0 on success, otherwise the loop index resulting in a negative number

% square root. P and q are the dimensions of the problem and ko is a vector

% with ko(t) = number of observed values before time t.


% In the complete data case ko should be 0:r:n*r. For missing values, Su and

% Olow are the upper left and lower partitions of Omega_o = Omega with missing

% rows and columns removed. In this case Lu and Ll return L_o, the Cholesky

% factor of Omega_o.

function [Lu,Ll,info] = omega_factor(Su,Olow,p,q,ko)

n = length(ko)-1;

h = max(p,q);

ro = diff(ko);

Ll = zeros(size(Olow));

[Lu,info] = chol(Su'); % upper left partition

if info>0, return; end

Lu = Lu';

e = ko(h+1); % order of Su

for t = h+1 : n % loop over block-lines in Olow

K = ko(t)+1 : ko(t+1);

KL = K - ko(t-q);

JL = 1 : ko(t)-ko(t-q);

tmin = t-q; tmax = t-1;

Ll(K-e, JL) = omega_forward(Lu, Ll, Olow(K-e,JL)', p, q, ko, tmin, tmax)';

[Ltt, info] = chol(Olow(K-e,KL) - Ll(K-e,JL)*Ll(K-e,JL)');

if info>0, info = info + ko(t); return; end

Ll(K-e, KL) = Ltt';



Page 10: Resources for parallel computing · • Sparse linear equation solvers • ScaLAPACK • Some statistical functions (incl. random number generation) • some MPI support • Fast


OmegaFactor.c #include "xAssert.h"

#include "BlasGateway.h"

#include "VarmaUtilities.h"

#include "Omega.h"

void OmegaFactor ( // Cholesky-factorize Omega, or, for missing values, Omega_o

double Su[], // in/out upper left part of Omega, dimension mSu × mSu

double Olow[], // in/out mOlow×nOlow, block-diagonals of lower part of Omega

int p, // in number of autoregressive terms

int q, // in number of moving average terms

int n, // in length of time series

int ko[], // in ko[t] = N of observed values before time t+1, t<=n

int *info) // out 0 if ok, otherwise k for first nonpositive Ltt


// Finds Cholesky factorization of covariance matrix Omega_o for missing value

// VARMA log-likelihood. Also handles complete data with ko[SWS]=r*i for all i.

// The Cholesky factors overwrite Omega in memory. All matrices are stored in

// Fortran fashion.


double *U, *Ltt;

int t, j, ro;

int h = max(p,q);

int mSu = ko[h]; // order of Su

int mOlow = ko[n]-mSu; // no. of rows in Olow



if (mSu>0)

potrf("Low", mSu, Su, mSu, info);


*info = 0;

xAssert(*info >= 0);

if (*info>0) return;



U = Olow;

for (t=h; t<n; t++) {

ro = ko[t+1] - ko[t];

if (ro>0) {

j = ko[t] - ko[t-q];

Ltt = U + mOlow*j;

//Solve L(0:t-1,0:t-1)·U' = s' (U contains s on call):

OmegaForward("T", Su, Olow, p, q, ko, n, U, mOlow, ro, t-q, t-1);

//Ltt-U·U' and then Cholesky of that

syrk("Low", "N", ro, j, -1.0, U, mOlow, 1.0, Ltt, mOlow);

potrf("Low", ro, Ltt, mOlow, info);

if (*info>0) { *info += ko[t]; return; }


U += ro;



Page 11: Resources for parallel computing · • Sparse linear equation solvers • ScaLAPACK • Some statistical functions (incl. random number generation) • some MPI support • Fast


From omega_forward.m LT.LT = true;

⋮ if m>0, X(1:m,:) = linsolve(Lu(j:end,j:end), Y(1:m,:), LT); end

for t=max(h+1,tmin):tmax

t1 = max(t-q,tmin);

J = ko(t1)+1 : ko(t);

K = ko(t)+1 : ko(t+1);

JX = J - j + 1;

KX = K - j + 1;

JL = J - ko(t-q);

KL = K - ko(t-q);

X(KX,:) = X(KX,:) - Ll(K-e,JL)*X(JX,:);

X(KX,:) = linsolve(Ll(K-e,KL), X(KX,:), LT);


From OmegaForward.c

⋮ if (transp && e>0)

trsm("Right", "Low", "T", "NotUdia", nY, m, 1.0, Luii, e, Y, iY);

else if (e>0)

trsm("Left","Low", "NoT", "NotUdia", m, nY, 1.0, Luii, e, Y, iY);



incY = transp ? iY : 1;

tbeg = max(h,tmin);

for (t=tbeg; t<=tmax; t++) {

k1 = max(t-q,tmin);

k = ko[t] - ko[k1];

Yt = Y + (ko[t] - j)*incY;

Yk = Yt - k*incY;

Lt = Ll + ko[t] - e;

Ltt = Lt + mLl*(ko[t] - ko[t-q]);

Ltk = Ltt - mLl*k;

ro = ko[t+1] - ko[t];

if (transp && iY>0) {

gemm("NT", "T", nY, ro, k, -1.0, Yk, iY, Ltk, mLl, 1.0, Yt, iY);

trsm("Right", "Low", "T", "NotUdia", nY, ro, 1.0, Ltt, mLl, Yt, iY);


else if (!transp && mLl>0) {

gemm("NT", "NT", ro, nY, k, -1.0, Ltk, mLl, Yk, iY, 1.0, Yt, iY);

trsm("Left", "Low", "NT", "NotUdia", ro, nY, 1.0, Ltt, mLl, Yt, iY);



Page 12: Resources for parallel computing · • Sparse linear equation solvers • ScaLAPACK • Some statistical functions (incl. random number generation) • some MPI support • Fast


Gateway function to reference Blas/Lapack // Gateway function to reference Blas/Lapack

#include "BlasGateway.h"

#include "blasF.h"

void gemm(char *transa, char *transb, int m, int n, int k, double alpha,

double a[], int lda, double b[], int ldb, double beta, double c[], int ldc) {

dgemm(transa, transb, &m, &n, &k, &alpha, a, &lda, b,&ldb,&beta,c,&ldc,1,1);


BLAS/LAPACK from Fortran examples

Cholesky factorization using Netlib LAPACK95

From http://www.netlib.org/lapack95/html/DOC:


Cholesky factorization from MKL

Excerpt from MKL Reference Manual:

Page 13: Resources for parallel computing · • Sparse linear equation solvers • ScaLAPACK • Some statistical functions (incl. random number generation) • some MPI support • Fast


Matrix-matrix multiply

Using Intel MKL (help from the reference manual):

There is a description (incompatible with MKL) and a link to a reference imple-mentation on Netlib: http://www.netlib.org/blas/blast-forum (difficult to download and use). Here is an example from the description:

For calling from Fortran 77 see the call to the reference version of DGEMM as shown above in the quote from the MKL reference manual.

Page 14: Resources for parallel computing · • Sparse linear equation solvers • ScaLAPACK • Some statistical functions (incl. random number generation) • some MPI support • Fast


Sparse BLAS A Fortran 95 reference implementation was published in ACM TOMS in 2002:

Included in MKL.

Originally defined by the BLAS Technical Forum, see Netlib (http://www.netlib.org/blas/blast-forum):

Page 15: Resources for parallel computing · • Sparse linear equation solvers • ScaLAPACK • Some statistical functions (incl. random number generation) • some MPI support • Fast



Netlib has a description and an implementation (interface to the Fortran refer-ence BLAS), see http://www.netlib.org/blas/blast-forum. Example:

Notice that the dimension arguments are passed by value, and not by reference as is necessary when calling the Fortran routines directly from C.

Information on MKL web: www.intel.com/software/products/mkl/docs/mklqref/

Atlas (see above) has a free implementation and a quick reference card.

GNU Scientific Library (http://www.gnu.org/software/gsl) has interface with a different interface (and their own vector / matrix data types).