Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 1Libraries and Their Performance

Libraries and Their Performance

Frank V. Hale

Thomas M. DeBoni

NERSC User Services Group



Part I: Single Node Performance Measurement

• Use of hpmcount for measurement of total code performance

• Use of HPM Toolkit for measurement of code section performance

• Vector operations generally give better performance than scalar (indexed) operations

• Shared-memory, SMP parallelism can be very effective and easy to use



Demonstration Problem

• Compute using random points in unit square (ratio of points in unit circle to points in unit square)

• Use input file with sequence of 134,217,728 uniformly distributed random numbers in range 0-1; unformatted, 8-byte floating point numbers (1 gigabyte of data)



A first Fortran code

% cat estpi1.f

implicit none

integer i,points,circle

real*8 x,y

read(*,*)points

open(10,file="runiform1.dat",status="old",form="unformatted")

circle = 0

c repeat for each (x,y) data point: read and compute

do i=1,points

read(10)x

read(10)y

if (sqrt((x-0.5)**2 + (y-0.5)**2) .le. 0.5) circle = circle + 1

enddo

write(*,*)"Estimated pi using ",points," points as ", . ((4.*circle)/points)

end



Compile and Run with hpmcount

% cat jobestpi1

#@ class = debug

#@ shell = /usr/bin/csh

#@ wall_clock_limit = 00:29:00

#@ notification = always

#@ job_type = serial

#@ output = jobestpi1.out

#@ error = jobestpi1.out

#@ environment = COPY_ALL

#@ queue

setenv FC "xlf_r -q64 -O3 -qstrict -qarch=pwr3 -qtune=pwr3 "

$FC -o estpi1 estpi1.f

echo "10000" > estpi1.dat

hpmcount ./estpi1 <estpi1.dat

exit



Performance of first code

Points Pi Wall Clock

(sec.)

Mflips/sec.

10 3.56000 0.055 0.007

100 3.36000 0.030 0.033

1,000 3.196000 0.038 0.189

01,000 3.15000 0.120 0.587

100,000 3.14700 0.936 0.748

1,000,000 3.14099 8.979 0.780

10,000,000 3.14199 89.194 0.785



Performance of first code

0.01

0.1

1

10

100

10 100 1000 104 105 106 107

Wall Clock(sec.)

# Points



Some Observations

• Performance is not very good at all, less than 1 Mflip/s

(peak is 1,500 Mflip/s per processor)

• Scalar approach to computation

• Scalar I/O mixed with scalar computation

Suggestions: Separate I/O from computation Use vector operations on dynamically allocated vector data

structures



A second code, Fortran 90% cat estpi2.f implicit none integer :: i, points, circle integer, allocatable, dimension(:) :: ones real(kind=8), allocatable, dimension(:) :: x,y

c dynamically allocated vector data structures read(*,*)points allocate (x(points)) allocate (y(points)) allocate (ones(points)) ones = 1 open(10,file="runiform1.dat",status="old",form="unformatted") do i=1,points read(10)x(i) read(10)y(i) enddo circle = sum(ones,(sqrt((x-0.5)**2 + (y-0.5)**2) .le. 0.5)) write(*,*)"Estimated pi using ",points," points as ", & ((4.*circle)/points) end



Performance of second code

Points Pi Wall Clock (sec.)

Mflips/sec.

10 3.56000 0.090 0.004

100 3.36000 0.030 0.034

1,000 3.19000 0.039 0.197

10,000 3.15000 0.120 0.612

100,000 3.14700 0.967 0.755

1,000,000 3.14099 9.152 0.798

10,000,000 3.14199 91.170 0.801



Performance of second code

0.01

0.1

1

10

100

10 100 1000 104 105 106 107

Wall Clock(sec.)

# Points



Observations on Second Code

• Operations on whole vectors should be faster, but

• No real improvement in performance of total code was observed.

• Suspect that most time is being spent on I/O.

• I/O is now separate from computation, so the code is easy to instrument in sections



Instrument code sections with HPM Toolkit

Four sections to be separately measured:

• Data structure initialization

• Read data

• Estimate • Write output

Calls to f_hpmstart and f_hpmstop around each section.



Instrumented Code (1 of 2)

%cat estpi3.f

implicit none

integer :: i, points, circle

integer, allocatable, dimension(:) :: ones

real(kind=8), allocatable, dimension(:) :: x,y

#include "f_hpm.h"

call f_hpminit(0,"Instrumented code")

call f_hpmstart(1,"Initialize data structures")

read(*,*)points

allocate (x(points))

allocate (y(points))

allocate (ones(points))

ones = 1

call f_hpmstop(1)



Instrumented Code (2 of 2) call f_hpmstart(2,"Read data") open(10,file="runiform1.dat",status="old",form="unformatted") do i=1,points read(10)x(i) read(10)y(i) enddo call f_hpmstop(2) call f_hpmstart(3,"Estimate pi") circle = sum(ones,(sqrt((x-0.5)**2 + (y-0.5)**2) .le. 0.5)) call f_hpmstop(3) call f_hpmstart(4,"Write output") write(*,*)"Estimated pi using ",points," points as ", & ((4.*circle)/points) call f_hpmstop(4) call f_hpmterminate(0) end



Notes on Instrumented Code

• Entire executable code enclosed between hpm_init and hpm_terminate

• Code sections enclosed between hpm_start and hpm_stop

• Descriptive text labels appear in output file(s)



Compile and Run with HPM Toolkit% cat jobestpi3#@ class = debug #@ shell = /usr/bin/csh#@ wall_clock_limit = 00:29:00#@ notification = always#@ job_type = serial#@ output = jobestpi3.out#@ error = jobestpi3.out#@ environment = COPY_ALL#@ queue module load hpmtoolkitsetenv FC "xlf90_r -q64 -O3 -qstrict -qarch=pwr3 -qtune=pwr3 $HPMTOOLKIT

-qsuffix=cpp=f"$FC -o estpi3 estpi3.f echo "10000000" > estpi3.dat./estpi3 <estpi3.dat exit



Notes on Use of HPM Toolkit

• Must load module hpmtoolkit• Need to include header file f_hpm.h in Fortran code, and

give preprocessor directions to compiler with -qsuffix• Performance output in a file named like

perfhpmNNNN.MMMMM

where NNNN is the task id

and MMMMM is the process id

• Message from sample executable:libHPM output in perfhpm0000.21410



Comparison of Code Sections

Section Wall Clock

(sec.)

% Time Mflips/sec.

Init Data Structs 0.248 0.27 0.000

Read Data 89.933 99.02 0.000

Estimate 0.641 0.71 114.327

Write Output 0.001 0.00 0.381

Total 90.823 100.00 0.801

10,000,000 points



Observations on Sections

• Optimization of the estimation of has little effect because

• The code spends 99% of the time reading the data

• Can the I/O be optimized?



Reworking the I/O

• Whole arrary I/O versus scalar I/O• Scalar I/O (one number per record) file is twice as big

(8 bytes for number, 8 bytes for end of record)• Whole array I/O file has only one end of record marker• Only one call for Fortran read routine for whole array I/O

read(10)xy• Need to use some fancy array footwork to sort out x(1), y(1),

x(2), y(2), … x(n), y(n) from xy array.x = xy(1::2)

y = xy(2::2)



Revised Data Structures and I/O% cat estpi4.f implicit none integer :: i, points, circle integer, allocatable, dimension(:) :: ones real(kind=8), allocatable, dimension(:) :: x, y, xy#include "f_hpm.h" call f_hpminit(0,"Instrumented code") call f_hpmstart(1,"Initialize data structures") read(*,*)points allocate (x(points)) allocate (y(points)) allocate (xy(2*points)) allocate (ones(points)) ones = 1 call f_hpmstop(1) call f_hpmstart(2,"Read data") open(10,file="runiform.dat",status="old",form="unformatted") read(10)xy x = xy(1::2) y = xy(2::2) call f_hpmstop(2)



Vector I/O Code Sections

Section Wall Clock

(sec.)

% Time Mflips/sec.


Read Data 3.162 75.34 0.000

Estimate Pi 0.771 18.37 94.053

Write Output 0.001 0.02 0.393

Total 4.197 100.00 15.4

10,000,000 points



Observations on New Sections

• The time spent reading the data as a vector rather than a scalar was reduced from 89.9 to 3.16 seconds, a reduction of 96% of the I/O time.

• There was no performance penalty for the additional data structure complexity.

• I/O design can have very significant performance impacts!

• Total code performance with hpmcount is now 15.4 Mflip/s, 20 times improved from the 0.801 Mflip/s of the scalar I/O code.



Automatic Shared-Memory (SMP) Parallelization

• IBM Fortran provides a –qsmp option for automatic, shared-memory parallelization, allowing multithreaded computation within a node.

• Default number of threads is 16; the number of threads is controlled by OMP_NUM_THREADS environment variable

• Allows use of the SMP version of the ESSL library,

-lesslsmp



Compiler Options

• The source code is the same as the previous, vector operation example, estpi4.f

• Compiler options –qsmp and –lesslsmp enable automatic shared-memory parallelism (SMP)

• Compiler command line: xlf90_r -q64 -O3 -qstrict -qarch=pwr3 -qtune=pwr3

$HPMTOOLKIT -qsuffix=cpp=f -qsmp –lesslsmp

-o estpi5 estpi4.f



SMP Code Sections

Section Wall Clock

(sec.)

% Time Mflips/sec.


Read Data 4.311 87.78 0.000

Estimate 0.064 1.30 1100.

(up from 94)

Write Output 0.002 0.04 0.117

Total 4.911 100.00 15.4

10,000,000 points



Observations on SMP Code

• Computational section is now showing 1,100 Mflip/sec, or 4.6% of theoretical peak of 24,000 Mflip/sec on 16 processor node.

• Computational section is now 12 times faster, with no changes to source code

• Recommendation: always use thread-safe compilers (with _r suffix) and –qsmp unless there is a good reason to do otherwise.

• There are no explicit parallelism directives in the source code; all threading is within the library.



Too Many Threads Can Spoil Performance

• Each node has 16 processors, and usually having more threads than processors will not improve performance

0

200

400

600

800

1000

1200

0 4 8 12 16 20 24 28

Threads

Computation Mflip/s



Sidebar: Cost of Misaligned Common Block

• User code with Fortran77 style common blocks may receive an innocuous warning:

1514-008 (W) Variable … is misaligned. This may affect the efficiency of the code.

• How much can this affect the efficiency of the code?

• Test: put arrays x and y in misaligned common, with a 1-byte character in front of them



Potential Cost of Misaligned Common Blocks

• 10,000,000 points used for computing Pi;

• Properly aligned, dynamically allocated x and y

used 0.064 seconds at 1,100 Mflip/s

• Misaligned, statically allocated x and y in common block

used 0.834 seconds at 88.4 Mflip/s

• Common block alignment slowed computation by

a factor of 12



Part I Conclusion

• hpmcount can be used to measure the performance of the total code

• HPM Toolkit can be used to measure the performance of discrete code sections

• Optimization effort must be focused effectively

• Fortran90 vector operations are generally faster than Fortran77 scalar operations

• Use of automatic SMP parallelization may provide an easy performance boost

• I/O may be the largest factor in “whole code” performance

• Misaligned common blocks can be very expensive



Part II: Comparing Libraries

• In the rich user environment on seaborg, there are many alternative ways to do the same computation

• The HPM Toolkit provides the tools to compare alternative approaches to the same computation



Dot Product Functions

• User coded scalar computation

• User coded vector computaiton

• Single processor ESSL ddot• Multi-threaded SMP ESSL ddot• Single processor IMSL ddot

• Single processor NAG f06eaf• Multi-threaded SMP NAG f06eaf



Sample Problem

• Test Cauchy-Schwartz inequality for N vectors of length N

(X•Y)2 <= (X•X)(Y•Y)

• Generate 2N random numbers (array x2)

• Use 1st N for X; (X•X) computed once

• Vary vector Yfor i=1,n

y = 2.0*x2(i:n+(i-1))

First Y is 2X, second Y is 2(X2(2:N+1)), etc.

• Compute (2*N)+1 dot products of length N



Instrumented Code Section for Dot Products

call f_hpmstart(1,"Dot products")

xx = ddot(n,x,1,x,1)

do i=1,n

y = 2.0*x2(i:n+(i-1))

yy = ddot(n,y,1,y,1)

xy = ddot(n,x,1,y,1)

diffs(i) = (xx*yy)-(xy*xy)

enddo

call f_hpmstop(1)



Two User Coded Functions

real*8 function myddot(n,x,y) integer :: i,n real*8 :: x(n),y(n),dp dp = 0. do i=1,n dp = dp + x(i)*y(i) ! User scalar loop enddo myddot = dp returnend real*8 function myddot(n,x,y) integer :: i,n real*8 :: x(n),y(n) myddot = sum(x*y) ! User vector computation returnend



Compile and Run User Functions

module load hpmtoolkit

echo "100000" > libs.dat

setenv FC "xlf90_r -q64 -O3 -qstrict -qarch=pwr3

-qtune=pwr3 $HPMTOOLKIT-qsuffix=cpp=f"

$FC -o libs0 libs0.f

./libs0 <libs.dat

$FC -o libs0a libs0a.f

./libs0a <libs.dat



Compile and Run ESSL Versions

setenv FC "xlf90_r -q64 -O3 –qstrict -qarch=pwr3

-qtune=pwr3 $HPMTOOLKIT -qsuffix=cpp=f

-lessl"

$FC -o libs1 libs1.f

./libs1 <libs.dat

setenv FC "xlf90_r -q64 -O3 –qstrict -qarch=pwr3

-qtune=pwr3 $HPMTOOLKIT -qsuffix=cpp=f -qsmp

-lesslsmp"

$FC -o libs1smp libs1.f

./libs1smp <libs.dat



Compile and Run IMSL Version

module load imsl


-qtune=pwr3 $HPMTOOLKIT -qsuffix=cpp=f $IMSL"

$FC -o libs1imsl libs1.f

./libs1imsl <libs.dat

module unload imsl



Compile and Run NAG Versions

module load nag_64


-qtune=pwr3 $HPMTOOLKIT -qsuffix=cpp=f $NAG"

$FC -o libs1nag libsnag.f

./libs1nag <libs.dat

module unload nag

module load nag_smp64


-qtune=pwr3 $HPMTOOLKIT -qsuffix=cpp=f $NAG_SMP6

-qsmp=omp -qnosave "

$FC -o libs1nagsmp libsnag.f

./libs1nagsmp <libs.dat

module unload nag_smp64



First Comparison of Dot Product(N=100,000)

Version Wall Clock (sec) Mflip/s Scaled Time (1=Fastest)

User Scalar 246 203 1.72

User Vector249 201 1.74

ESSL 145 346 1.01

ESSL-SMP 408 123 2.85 Slowest

IMSL 143 351 1.00 Fastest

NAG 250 200 1.75

NAG-SMP 180 278 1.26



Comments on First Comparisons

• The best results, by just a little, were obtained using the IMSL library, with ESSL a close second

• Third best was the NAG-SMP routine, with benefits from multi-threaded computation

• The user coded routines and NAG were about 75% slower than the ESSL and IMSL routines. In general, library routines are highly optimized and better than user coded routines.

• The ESSL-SMP library did very poorly on this computation; this unexpected result may be due to data structures in the library, or perhaps the number of threads (default is 16).



ESSL-SMP Performance vs. Number of Threads• All for N=100,000

• Number of threads controlled by environment variable OMP_NUM_THREADS

0

200

400

600

800

1000

1200

0 4 8 12 16 20

Threads

ddot Mflip/s



Revised First Comparison of Dot Product(N=100,000)



User Vector249 201 5.0

ESSL 145 346 2.9

ESSL-SMP 50 1000 1.0 Fastest

4 threads

IMSL 143 351 2.9

NAG 250 200 5.0 Slowest

NAG-SMP 180 278 3.6

Tuning for Number of Threads is Very, Very Important for SMP codes !



Scaling up the Problem

• The first comparisons were for N=100,000 computing 200,001 dot products of vectors of length 100,000

• Second comparison for N=200,000 computes 400,001 dot products of vectors of length 200,000

• Increase computational complexity by a factor of 4.



Second Comparison of Dot Product(N=200,000)



User Vector1180 169 2.35 Slowest

ESSL 739 271 1.47


IMSL 725 276 1.44

NAG 1120 179 2.23

NAG-SMP 864 231 1.72



Comments on Second Comparisons (N=200,000)

• Now the best results are from the ESSL-SMP library, with the default 16 threads

• The next best group is ESSL, IMSL and NAG-SMP, taking 50-75% longer than the ESSL-SMP routine.

• The worst results were seen from NAG (single thread) and the user code routines.

What is the impact of the number of threads on the ESSL-SMP library performance? It is already the best.





0

200

400

600

800

1000

1200

1400

1600

0 4 8 12 16 20

Threads

ddot Mflip/s



Revised Second Comparison of Dot Product(N=200,000)



User Vector1180 169 8.1 Slowest

ESSL 739 271 5.1


(6 threads)

IMSL 725 276 5.0

NAG 1120 179 7.7

NAG-SMP 864 231 5.9



Scaling with Problem Size?(N1=100,000; N2=200,000; Complexity ratio approx. 4)

Version N2/N1 Wall Clock(sec) N2/N1 Mflip/s

User Scalar 4.45 0.90

User Vector 4.75 0.84

ESSL 5.10 0.78

ESSL-SMP 2.92 1.37

(4 threads for N1;

6 threads for N2)

IMSL 5.07 0.79

NAG 4.48 0.90

NAG-SMP 4.80 0.83



Comments on Scaling Problem Size

• The ESSL-SMP performance, when tuned for the optimal number of threads, increased by almost 40% with the increased problem size.

• The untuned ESSL-SMP performance increased by a factor of 3.2 with the increased problem size.

• The user codes, ESSL, IMSL, NAG and NAG-SMP routines all showed 10%-22% decreases in performance with the larger problem size.

• It is not possible to determine, a priori, how the performance of different, functionally equivalent routines will scale with problem size.



Matrix Multiplication

• User coded scalar computation

• Fortran intrinsic matmul• Single processor ESSL dgemm• Multi-threaded SMP ESSL dgemm• Single processor IMSL dmrrrr (32-bit)

• Single processor NAG f01ckf• Multi-threaded SMP NAG f01ckf



Sample Problem

• Multiply two dense N by N matrixes,

A and B

• A(i,j) = i + j

• B(i,j) = j – i

• Output C(N,N) to verify result



Kernel of user matrix multiply

do i=1,n do j=1,n a(i,j) = real(i+j) b(i,j) = real(j-i) enddo enddo call f_hpmstart(1,"Matrix multiply") do j=1,n do k=1,n do i=1,n c(i,j) = c(i,j) + a(i,k)*b(k,j) enddo enddo enddo call f_hpmstop(1)



Comparison of Matrix Multiply(N1=5,000)

Version Wall Clock(sec) Mflip/s Scaled Time (1=Fastest)

User Scalar 1,490 168 106 Slowest

Intrinsic 1,477 169 106 Slowest

ESSL 195 1,280 13.9

ESSL-SMP 14 17,800 1.0 Fastest

IMSL 194 1,290 13.8

NAG 195 1,280 13.9

NAG-SMP 14 17,800 1.0 Fastest



Observations on Matrix Multiply

• Fastest times were obtained by the two SMP libraries, ESSL-SMP and NAG-SMP, which both obtained 74% of the peak node performance

• All the single processor library functions took 14 times more wall clock time than the SMP versions, each obtaining about 85% of peak for a single processor

• Worst times were from the user code and the Fortran intrinsic, which took 100 times more wall clock time than the SMP libraries



Comparison of Matrix Multiply(N2=10,000)

Version Wall Clock(sec) Mflip/s Scaled TimeESSL-SMP 101 19,800 1.01

NAG-SMP 100 19,900 1.00

• Scaling with Problem Size (Complexity increase approx. 8 times)

Version Wall Clock(N2/N1) Mflip/s (N2/N1)ESSL-SMP 7.2 1.10NAG-SMP 7.1 1.12

Both ESSL-SMP and NAG-SMP showed 10% performance gains with the larger problem size.



Observations on Scaling

• Scaling of problem size was only done for the SMP libraries, to fit into reasonable times.

• Doubling N results in 8 times increase of computational complexity for dense matrix multiplication

• Performance actually increased for both routines for larger problem size.





0

4000

8000

12000

16000

20000

0 4 8 12 16 20 24 28 32 36

Threads

dgemm Mflip/s



Part II Conclusion

• The NERSC user environment provides a rich variety of mathematical libraries

• Performance can vary widely for the same computation, sometimes even for the same function name, from library to library; performance also varies with problem size and, for the SMP libraries, the number of threads

• It is not possible to know, a priori, which library will provide the best performance for a given function and problem size

• The HPM Toolkit provides a way to compare library routine performance and make informed choices



Part III: Moving to Multi-node Parallelism

• The examples so far have all been of single processor or multi-processor, shared-memory (SMP style) parallelism on a single 16 processor node

• The poe+ command is the multi-node equivalent of hpmcount, and poe+ can be used with MPI codes or multi-node, distributed memory parallel libraries such as PESSL and ScaLAPACK.

• poe+ is a perl script developed by David Skinner of the NERSC User Services Group which aggregates hpmcount results for each of distributed-memory process



Kernel of PESSL/ScaLAPACK matrix multiply

! Call PESSL library routine

call f_hpminit((me+1),"Instrumented code")

call f_hpmstart((me+1),"Matrix multiply")

call pdgemm('T','T',n,n,n,1.0d0, myA,1,1,ides_a, &

myB,1,1,ides_b,0.d0, &

myC,1,1,ides_c )

call f_hpmstop(me+1)

call f_hpmterminate(me+1)



Comments on PESSL/ScaLAPACK Code

• Although the kernel on the previous slide looks like a simple progression from the ESSL version, actually there is a lot of work involved in understanding PESSL/ScaLAPACK for new users

• There are a number of data structure complexities which do not exist for the single-node libraries

• The “complete” matrix does not exist on any processor, but is block-cyclic distributed among processors

• There are added parameters of processor geometry and data distribution parameters.

• New users should study the ScaLAPACK tutorial on the Web at http://www.netlib.org/scalapack/tutorial/



Prolog for PESSL/ScaLAPACK matrix multiply

! Initialize blacs processor grid

call blacs_pinfo (me,procs)

call blacs_get (0, 0, icontxt)

call blacs_gridinit(icontxt, 'R', prow, pcol)

call blacs_gridinfo(icontxt, prow, pcol, myrow, mycol)



More Prolog for PESSL/ScaLAPACK! Construct local arrays myArows = numroc(n, nb, myrow, 0, prow) myAcols = numroc(n, nb, mycol, 0, pcol)! Initialize local arrays allocate(myA(myArows,myAcols)) allocate(myB(myArows,myAcols)) allocate(myC(myArows,myAcols)) do i=1,n call g2l(i,n,prow,nb,iproc,myi) if (myrow==iproc) then do j=1,n call g2l(j,n,pcol,nb,jproc,myj) if (mycol==jproc) then myA(myi,myj) = real(i+j) myB(myi,myj) = real(i-j) myC(myi,myj) = 0.d0 endif enddo endif enddo



Still More Prolog for PESSL/ScaLAPACK

! Prepare array descriptors for PESSL (ScaLAPACK style)

ides_a(1) = 1 ! descriptor type

ides_a(2) = icontxt ! blacs context

ides_a(3) = n ! global number of rows

ides_a(4) = n ! global number of columns

ides_a(5) = nb ! row block size

ides_a(6) = nb ! column block size

ides_a(7) = 0 ! initial process row

ides_a(8) = 0 ! initial process column

ides_a(9) = myArows ! leading dimension of local array

do i=1,9

ides_b(i) = ides_a(i)

ides_c(i) = ides_a(i)

enddo



Compile Uninstrumented Codes and Run with poe+

setenv FC "mpxlf90 -O3 -qstrict -qarch=pwr3 -qtune=pwr3 -bmaxdata:0x80000000 -bmaxstack:0x80000000 "

$FC -o ABCp -lblacs -lpessl ABCp.f

module load scalapack$FC -o ABCs -qfree $PBLAS $BLACS $SCALAPACK -lessl

ABCp.f

poe+ ./ABCp ! PESSL versionpoe+ ./ABCs ! ScaLAPACK version



Four Runs for PESSL and ScaLAPACK Codes

• N=5000, 16 processors (one node) in 4x4 processor array

• N=10,000, 16 processors (one node) in 4x4 processor array

• N=5000, 64 processors (four nodes) in 8x8 processor array

• N=10000, 64 processors (four nodes) in 8x8 processor array

• Compare “whole code” performance using poe+ with “whole code” results for single-node ESSL-SMP routine using hpmcount.

• poe+ returns average wall clock time across all processes, and aggregate Mflip/s of all processes



Comparison of PESSL/ScaLAPACK dgemm(n=5000, 16 processors, “whole code” performance)

Section Wall Clock

(sec.)

Mflips/sec. Scaled Time (1.00=ESSL-SMP, 22 s)

PESSL 28.3 8,850 1.30

ScaLAPACK 30.4 8,240 1.40

ESSL-SMP achieved 47% of theoretical peak performance for one node

PESSL achieved 37%, and ScaLAPACK achieved 34%.



Comparison of PESSL/ScaLAPACK dgemm(n=10000, 16 processors, “whole code”)

Section Wall Clock

(sec.)


PESSL 141. 14,230 1.20

ScaLAPACK 160. 12,500 1.30

ESSL-SMP achieved 70% of theoretical peak performance for one node

PESSL achieved 59%, and ScaLAPACK achieved 52%.




Section Wall Clock

(sec.)


PESSL 15.3 16,400 0.70

ScaLAPACK 14.2 17,600 0.65

PESSL achieved 17% of the theoretical peak for 4 nodes (96,000 Mflip/s), and ScaLAPACK achieved 18%.




Section Wall Clock

(sec.)


PESSL 51.5 38,900 0.43

ScaLAPACK 58.3 34,400 0.49

PESSL achieved 41% of the theoretical peak for 4 nodes (96,000 Mflip/s), and ScaLAPACK achieved 36%.



Comments PESSL and ScaLAPACK Codes

• For problem sizes that fit within one node, the shared-memory, SMP libraries may give better performance than the distributed-memory, parallel libraries because of differences in data communication costs

• As the number of nodes and processors is increased, wall-clock time for distributed-memory libraries may drop below shared-memory SMP libraries for the same problem size, but per-processor efficiency may also drop.

• For problems which cannot fit in a node, the distributed-memory parallel libraries provide the best solution



Comments on using HPM Toolkit with PESSL and ScaLAPACK Codes

• HPM Toolkit generates two output files per task (one for statistics, one for visualization).

• Performance statistics for each task are found in files with names perfhpmNNNN.PPPPP where NNNN is the task id (or processor number), and PPPPP is the AIX process id

• Performance variations between processors and nodes can be observed.



PESSL dgemm results for Small Instrumented Section

• For N=5,000, 16 processors (one node), PESSL pdgemm– average time of 16.9 seconds

– aggregate 14,800 Mflip/s

– 62% of the theoretical peak performance for a node

• For N=10,000, 64 processors (four nodes) PESSL pdgemm– average time of 40.1 seconds

– aggregate 50,000 Mflip/s

– 52% of the theoretical peak performance for four nodes



Variability in PESSL dgemm Small Instrumented Section

• For N=5,000, 16 processors (one node), PESSL pdgemm– Wall clock for each processor varies from 16.4 to 17.4 sec

– Mflip/s for each processor varies from 850 to 1000

• For N=10,000, 64 processors (four nodes) PESSL pdgemm– Wall clock for each processor varies from 39.25 to 40.75

sec

– Mflip/s for each processor varies from 730 to 830



PESSL dgemm Task Variation(n=5000, 16 processors)

840

860

880

900

920

940

960

980

1000

1020

16.2 16.4 16.6 16.8 17 17.2 17.4 17.6

Wall Clock (s)

Mflip/s



PESSL dgemm Task Variation(n=10000, 64 processors)

720

740

760

780

800

820

840

39 39.5 40 40.5 41

Wall Clock (s)

Mflip/s



Part III Conclusion

• NERSC provides a variety of distributed-memory, multi-node mathematical libraries (PESSL, ScaLAPACK and NAG Parallel).

• Performance of these libraries can be measured using “whole code” approaches with poe+, similar to hpmcount for single node codes

• The HPM Toolkit can be used to instrument small sections of codes for more detailed analysis, include variation between tasks; but a number of output files are produced and must be analyzed by the user.



References

• Information on hpmcount and poe+ for whole code performance measurement is available on the NERSC Website at http://hpcf.nersc.gov/software/ibm/hpmcount/

• Detailed information about the HPM Toolkit for measuring performance of discrete code sections is available on the NERSC Website at http://hpcf.nersc.gov/software/ibm/hpmcount/HPM_2_4_2.html

• The list of mathematical libraries available on seaborg can be found on the NERSC Website at http://hpcf.nersc.gov/software/ibm/#mathlibs

Documents

Libraries and Their Performance