81
NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER March 17, 2003 1 Libraries and Their Performance Libraries and Their Performance Frank V. Hale Thomas M. DeBoni NERSC User Services Group

Libraries and Their Performance

  • Upload
    baina

  • View
    47

  • Download
    0

Embed Size (px)

DESCRIPTION

Libraries and Their Performance. Frank V. Hale Thomas M. DeBoni NERSC User Services Group. Part I: Single Node Performance Measurement. Use of hpmcount for measurement of total code performance Use of HPM Toolkit for measurement of code section performance - PowerPoint PPT Presentation

Citation preview

Page 1: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 1Libraries and Their Performance

Libraries and Their Performance

Frank V. Hale

Thomas M. DeBoni

NERSC User Services Group

Page 2: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 2Libraries and Their Performance

Part I: Single Node Performance Measurement

• Use of hpmcount for measurement of total code performance

• Use of HPM Toolkit for measurement of code section performance

• Vector operations generally give better performance than scalar (indexed) operations

• Shared-memory, SMP parallelism can be very effective and easy to use

Page 3: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 3Libraries and Their Performance

Demonstration Problem

• Compute using random points in unit square (ratio of points in unit circle to points in unit square)

• Use input file with sequence of 134,217,728 uniformly distributed random numbers in range 0-1; unformatted, 8-byte floating point numbers (1 gigabyte of data)

Page 4: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 4Libraries and Their Performance

A first Fortran code

% cat estpi1.f

implicit none

integer i,points,circle

real*8 x,y

read(*,*)points

open(10,file="runiform1.dat",status="old",form="unformatted")

circle = 0

c repeat for each (x,y) data point: read and compute

do i=1,points

read(10)x

read(10)y

if (sqrt((x-0.5)**2 + (y-0.5)**2) .le. 0.5) circle = circle + 1

enddo

write(*,*)"Estimated pi using ",points," points as ", . ((4.*circle)/points)

end

Page 5: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 5Libraries and Their Performance

Compile and Run with hpmcount

% cat jobestpi1

#@ class = debug

#@ shell = /usr/bin/csh

#@ wall_clock_limit = 00:29:00

#@ notification = always

#@ job_type = serial

#@ output = jobestpi1.out

#@ error = jobestpi1.out

#@ environment = COPY_ALL

#@ queue

setenv FC "xlf_r -q64 -O3 -qstrict -qarch=pwr3 -qtune=pwr3 "

$FC -o estpi1 estpi1.f

echo "10000" > estpi1.dat

hpmcount ./estpi1 <estpi1.dat

exit

Page 6: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 6Libraries and Their Performance

Performance of first code

Points Pi Wall Clock

(sec.)

Mflips/sec.

10 3.56000 0.055 0.007

100 3.36000 0.030 0.033

1,000 3.196000 0.038 0.189

01,000 3.15000 0.120 0.587

100,000 3.14700 0.936 0.748

1,000,000 3.14099 8.979 0.780

10,000,000 3.14199 89.194 0.785

Page 7: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 7Libraries and Their Performance

Performance of first code

0.01

0.1

1

10

100

10 100 1000 104 105 106 107

Wall Clock(sec.)

# Points

Page 8: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 8Libraries and Their Performance

Some Observations

• Performance is not very good at all, less than 1 Mflip/s

(peak is 1,500 Mflip/s per processor)

• Scalar approach to computation

• Scalar I/O mixed with scalar computation

Suggestions: Separate I/O from computation Use vector operations on dynamically allocated vector data

structures

Page 9: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 9Libraries and Their Performance

A second code, Fortran 90% cat estpi2.f implicit none integer :: i, points, circle integer, allocatable, dimension(:) :: ones real(kind=8), allocatable, dimension(:) :: x,y

c dynamically allocated vector data structures read(*,*)points allocate (x(points)) allocate (y(points)) allocate (ones(points)) ones = 1 open(10,file="runiform1.dat",status="old",form="unformatted") do i=1,points read(10)x(i) read(10)y(i) enddo circle = sum(ones,(sqrt((x-0.5)**2 + (y-0.5)**2) .le. 0.5)) write(*,*)"Estimated pi using ",points," points as ", & ((4.*circle)/points) end

Page 10: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 10Libraries and Their Performance

Performance of second code

Points Pi Wall Clock (sec.)

Mflips/sec.

10 3.56000 0.090 0.004

100 3.36000 0.030 0.034

1,000 3.19000 0.039 0.197

10,000 3.15000 0.120 0.612

100,000 3.14700 0.967 0.755

1,000,000 3.14099 9.152 0.798

10,000,000 3.14199 91.170 0.801

Page 11: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 11Libraries and Their Performance

Performance of second code

0.01

0.1

1

10

100

10 100 1000 104 105 106 107

Wall Clock(sec.)

# Points

Page 12: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 12Libraries and Their Performance

Observations on Second Code

• Operations on whole vectors should be faster, but

• No real improvement in performance of total code was observed.

• Suspect that most time is being spent on I/O.

• I/O is now separate from computation, so the code is easy to instrument in sections

Page 13: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 13Libraries and Their Performance

Instrument code sections with HPM Toolkit

Four sections to be separately measured:

• Data structure initialization

• Read data

• Estimate • Write output

Calls to f_hpmstart and f_hpmstop around each section.

Page 14: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 14Libraries and Their Performance

Instrumented Code (1 of 2)

%cat estpi3.f

implicit none

integer :: i, points, circle

integer, allocatable, dimension(:) :: ones

real(kind=8), allocatable, dimension(:) :: x,y

#include "f_hpm.h"

call f_hpminit(0,"Instrumented code")

call f_hpmstart(1,"Initialize data structures")

read(*,*)points

allocate (x(points))

allocate (y(points))

allocate (ones(points))

ones = 1

call f_hpmstop(1)

Page 15: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 15Libraries and Their Performance

Instrumented Code (2 of 2) call f_hpmstart(2,"Read data") open(10,file="runiform1.dat",status="old",form="unformatted") do i=1,points read(10)x(i) read(10)y(i) enddo call f_hpmstop(2) call f_hpmstart(3,"Estimate pi") circle = sum(ones,(sqrt((x-0.5)**2 + (y-0.5)**2) .le. 0.5)) call f_hpmstop(3) call f_hpmstart(4,"Write output") write(*,*)"Estimated pi using ",points," points as ", & ((4.*circle)/points) call f_hpmstop(4) call f_hpmterminate(0) end

Page 16: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 16Libraries and Their Performance

Notes on Instrumented Code

• Entire executable code enclosed between hpm_init and hpm_terminate

• Code sections enclosed between hpm_start and hpm_stop

• Descriptive text labels appear in output file(s)

Page 17: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 17Libraries and Their Performance

Compile and Run with HPM Toolkit% cat jobestpi3#@ class = debug #@ shell = /usr/bin/csh#@ wall_clock_limit = 00:29:00#@ notification = always#@ job_type = serial#@ output = jobestpi3.out#@ error = jobestpi3.out#@ environment = COPY_ALL#@ queue module load hpmtoolkitsetenv FC "xlf90_r -q64 -O3 -qstrict -qarch=pwr3 -qtune=pwr3 $HPMTOOLKIT

-qsuffix=cpp=f"$FC -o estpi3 estpi3.f echo "10000000" > estpi3.dat./estpi3 <estpi3.dat exit

Page 18: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 18Libraries and Their Performance

Notes on Use of HPM Toolkit

• Must load module hpmtoolkit• Need to include header file f_hpm.h in Fortran code, and

give preprocessor directions to compiler with -qsuffix• Performance output in a file named like

perfhpmNNNN.MMMMM

where NNNN is the task id

and MMMMM is the process id

• Message from sample executable:libHPM output in perfhpm0000.21410

Page 19: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 19Libraries and Their Performance

Comparison of Code Sections

Section Wall Clock

(sec.)

% Time Mflips/sec.

Init Data Structs 0.248 0.27 0.000

Read Data 89.933 99.02 0.000

Estimate 0.641 0.71 114.327

Write Output 0.001 0.00 0.381

Total 90.823 100.00 0.801

10,000,000 points

Page 20: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 20Libraries and Their Performance

Observations on Sections

• Optimization of the estimation of has little effect because

• The code spends 99% of the time reading the data

• Can the I/O be optimized?

Page 21: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 21Libraries and Their Performance

Reworking the I/O

• Whole arrary I/O versus scalar I/O• Scalar I/O (one number per record) file is twice as big

(8 bytes for number, 8 bytes for end of record)• Whole array I/O file has only one end of record marker• Only one call for Fortran read routine for whole array I/O

read(10)xy• Need to use some fancy array footwork to sort out x(1), y(1),

x(2), y(2), … x(n), y(n) from xy array.x = xy(1::2)

y = xy(2::2)

Page 22: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 22Libraries and Their Performance

Revised Data Structures and I/O% cat estpi4.f implicit none integer :: i, points, circle integer, allocatable, dimension(:) :: ones real(kind=8), allocatable, dimension(:) :: x, y, xy#include "f_hpm.h" call f_hpminit(0,"Instrumented code") call f_hpmstart(1,"Initialize data structures") read(*,*)points allocate (x(points)) allocate (y(points)) allocate (xy(2*points)) allocate (ones(points)) ones = 1 call f_hpmstop(1) call f_hpmstart(2,"Read data") open(10,file="runiform.dat",status="old",form="unformatted") read(10)xy x = xy(1::2) y = xy(2::2) call f_hpmstop(2)

Page 23: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 23Libraries and Their Performance

Vector I/O Code Sections

Section Wall Clock

(sec.)

% Time Mflips/sec.

Init Data Structs 0.252 6.00 0.000

Read Data 3.162 75.34 0.000

Estimate Pi 0.771 18.37 94.053

Write Output 0.001 0.02 0.393

Total 4.197 100.00 15.4

10,000,000 points

Page 24: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 24Libraries and Their Performance

Observations on New Sections

• The time spent reading the data as a vector rather than a scalar was reduced from 89.9 to 3.16 seconds, a reduction of 96% of the I/O time.

• There was no performance penalty for the additional data structure complexity.

• I/O design can have very significant performance impacts!

• Total code performance with hpmcount is now 15.4 Mflip/s, 20 times improved from the 0.801 Mflip/s of the scalar I/O code.

Page 25: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 25Libraries and Their Performance

Automatic Shared-Memory (SMP) Parallelization

• IBM Fortran provides a –qsmp option for automatic, shared-memory parallelization, allowing multithreaded computation within a node.

• Default number of threads is 16; the number of threads is controlled by OMP_NUM_THREADS environment variable

• Allows use of the SMP version of the ESSL library,

-lesslsmp

Page 26: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 26Libraries and Their Performance

Compiler Options

• The source code is the same as the previous, vector operation example, estpi4.f

• Compiler options –qsmp and –lesslsmp enable automatic shared-memory parallelism (SMP)

• Compiler command line: xlf90_r -q64 -O3 -qstrict -qarch=pwr3 -qtune=pwr3

$HPMTOOLKIT -qsuffix=cpp=f -qsmp –lesslsmp

-o estpi5 estpi4.f

Page 27: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 27Libraries and Their Performance

SMP Code Sections

Section Wall Clock

(sec.)

% Time Mflips/sec.

Init Data Structs 0.534 10.87 0.000

Read Data 4.311 87.78 0.000

Estimate 0.064 1.30 1100.

(up from 94)

Write Output 0.002 0.04 0.117

Total 4.911 100.00 15.4

10,000,000 points

Page 28: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 28Libraries and Their Performance

Observations on SMP Code

• Computational section is now showing 1,100 Mflip/sec, or 4.6% of theoretical peak of 24,000 Mflip/sec on 16 processor node.

• Computational section is now 12 times faster, with no changes to source code

• Recommendation: always use thread-safe compilers (with _r suffix) and –qsmp unless there is a good reason to do otherwise.

• There are no explicit parallelism directives in the source code; all threading is within the library.

Page 29: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 29Libraries and Their Performance

Too Many Threads Can Spoil Performance

• Each node has 16 processors, and usually having more threads than processors will not improve performance

0

200

400

600

800

1000

1200

0 4 8 12 16 20 24 28

Threads

Computation Mflip/s

Page 30: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 30Libraries and Their Performance

Sidebar: Cost of Misaligned Common Block

• User code with Fortran77 style common blocks may receive an innocuous warning:

1514-008 (W) Variable … is misaligned. This may affect the efficiency of the code.

• How much can this affect the efficiency of the code?

• Test: put arrays x and y in misaligned common, with a 1-byte character in front of them

Page 31: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 31Libraries and Their Performance

Potential Cost of Misaligned Common Blocks

• 10,000,000 points used for computing Pi;

• Properly aligned, dynamically allocated x and y

used 0.064 seconds at 1,100 Mflip/s

• Misaligned, statically allocated x and y in common block

used 0.834 seconds at 88.4 Mflip/s

• Common block alignment slowed computation by

a factor of 12

Page 32: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 32Libraries and Their Performance

Part I Conclusion

• hpmcount can be used to measure the performance of the total code

• HPM Toolkit can be used to measure the performance of discrete code sections

• Optimization effort must be focused effectively

• Fortran90 vector operations are generally faster than Fortran77 scalar operations

• Use of automatic SMP parallelization may provide an easy performance boost

• I/O may be the largest factor in “whole code” performance

• Misaligned common blocks can be very expensive

Page 33: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 33Libraries and Their Performance

Part II: Comparing Libraries

• In the rich user environment on seaborg, there are many alternative ways to do the same computation

• The HPM Toolkit provides the tools to compare alternative approaches to the same computation

Page 34: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 34Libraries and Their Performance

Dot Product Functions

• User coded scalar computation

• User coded vector computaiton

• Single processor ESSL ddot• Multi-threaded SMP ESSL ddot• Single processor IMSL ddot

• Single processor NAG f06eaf• Multi-threaded SMP NAG f06eaf

Page 35: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 35Libraries and Their Performance

Sample Problem

• Test Cauchy-Schwartz inequality for N vectors of length N

(X•Y)2 <= (X•X)(Y•Y)

• Generate 2N random numbers (array x2)

• Use 1st N for X; (X•X) computed once

• Vary vector Yfor i=1,n

y = 2.0*x2(i:n+(i-1))

First Y is 2X, second Y is 2(X2(2:N+1)), etc.

• Compute (2*N)+1 dot products of length N

Page 36: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 36Libraries and Their Performance

Instrumented Code Section for Dot Products

call f_hpmstart(1,"Dot products")

xx = ddot(n,x,1,x,1)

do i=1,n

y = 2.0*x2(i:n+(i-1))

yy = ddot(n,y,1,y,1)

xy = ddot(n,x,1,y,1)

diffs(i) = (xx*yy)-(xy*xy)

enddo

call f_hpmstop(1)

Page 37: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 37Libraries and Their Performance

Two User Coded Functions

real*8 function myddot(n,x,y) integer :: i,n real*8 :: x(n),y(n),dp dp = 0. do i=1,n dp = dp + x(i)*y(i) ! User scalar loop enddo myddot = dp returnend real*8 function myddot(n,x,y) integer :: i,n real*8 :: x(n),y(n) myddot = sum(x*y) ! User vector computation returnend

Page 38: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 38Libraries and Their Performance

Compile and Run User Functions

module load hpmtoolkit

echo "100000" > libs.dat

setenv FC "xlf90_r -q64 -O3 -qstrict -qarch=pwr3

-qtune=pwr3 $HPMTOOLKIT-qsuffix=cpp=f"

$FC -o libs0 libs0.f

./libs0 <libs.dat

$FC -o libs0a libs0a.f

./libs0a <libs.dat

Page 39: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 39Libraries and Their Performance

Compile and Run ESSL Versions

setenv FC "xlf90_r -q64 -O3 –qstrict -qarch=pwr3

-qtune=pwr3 $HPMTOOLKIT -qsuffix=cpp=f

-lessl"

$FC -o libs1 libs1.f

./libs1 <libs.dat

setenv FC "xlf90_r -q64 -O3 –qstrict -qarch=pwr3

-qtune=pwr3 $HPMTOOLKIT -qsuffix=cpp=f -qsmp

-lesslsmp"

$FC -o libs1smp libs1.f

./libs1smp <libs.dat

Page 40: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 40Libraries and Their Performance

Compile and Run IMSL Version

module load imsl

setenv FC "xlf90_r -q64 -O3 -qstrict -qarch=pwr3

-qtune=pwr3 $HPMTOOLKIT -qsuffix=cpp=f $IMSL"

$FC -o libs1imsl libs1.f

./libs1imsl <libs.dat

module unload imsl

Page 41: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 41Libraries and Their Performance

Compile and Run NAG Versions

module load nag_64

setenv FC "xlf90_r -q64 -O3 -qstrict -qarch=pwr3

-qtune=pwr3 $HPMTOOLKIT -qsuffix=cpp=f $NAG"

$FC -o libs1nag libsnag.f

./libs1nag <libs.dat

module unload nag

module load nag_smp64

setenv FC "xlf90_r -q64 -O3 -qstrict -qarch=pwr3

-qtune=pwr3 $HPMTOOLKIT -qsuffix=cpp=f $NAG_SMP6

-qsmp=omp -qnosave "

$FC -o libs1nagsmp libsnag.f

./libs1nagsmp <libs.dat

module unload nag_smp64

Page 42: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 42Libraries and Their Performance

First Comparison of Dot Product(N=100,000)

Version Wall Clock (sec) Mflip/s Scaled Time (1=Fastest)

User Scalar 246 203 1.72

User Vector249 201 1.74

ESSL 145 346 1.01

ESSL-SMP 408 123 2.85 Slowest

IMSL 143 351 1.00 Fastest

NAG 250 200 1.75

NAG-SMP 180 278 1.26

Page 43: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 43Libraries and Their Performance

Comments on First Comparisons

• The best results, by just a little, were obtained using the IMSL library, with ESSL a close second

• Third best was the NAG-SMP routine, with benefits from multi-threaded computation

• The user coded routines and NAG were about 75% slower than the ESSL and IMSL routines. In general, library routines are highly optimized and better than user coded routines.

• The ESSL-SMP library did very poorly on this computation; this unexpected result may be due to data structures in the library, or perhaps the number of threads (default is 16).

Page 44: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 44Libraries and Their Performance

ESSL-SMP Performance vs. Number of Threads• All for N=100,000

• Number of threads controlled by environment variable OMP_NUM_THREADS

0

200

400

600

800

1000

1200

0 4 8 12 16 20

Threads

ddot Mflip/s

Page 45: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 45Libraries and Their Performance

Revised First Comparison of Dot Product(N=100,000)

Version Wall Clock (sec) Mflip/s Scaled Time (1=Fastest)

User Scalar 246 203 4.9

User Vector249 201 5.0

ESSL 145 346 2.9

ESSL-SMP 50 1000 1.0 Fastest

4 threads

IMSL 143 351 2.9

NAG 250 200 5.0 Slowest

NAG-SMP 180 278 3.6

Tuning for Number of Threads is Very, Very Important for SMP codes !

Page 46: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 46Libraries and Their Performance

Scaling up the Problem

• The first comparisons were for N=100,000 computing 200,001 dot products of vectors of length 100,000

• Second comparison for N=200,000 computes 400,001 dot products of vectors of length 200,000

• Increase computational complexity by a factor of 4.

Page 47: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 47Libraries and Their Performance

Second Comparison of Dot Product(N=200,000)

Version Wall Clock (sec) Mflip/s Scaled Time (1=Fastest)

User Scalar 1090 183 2.17

User Vector1180 169 2.35 Slowest

ESSL 739 271 1.47

ESSL-SMP 503 398 1.00 Fastest

IMSL 725 276 1.44

NAG 1120 179 2.23

NAG-SMP 864 231 1.72

Page 48: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 48Libraries and Their Performance

Comments on Second Comparisons (N=200,000)

• Now the best results are from the ESSL-SMP library, with the default 16 threads

• The next best group is ESSL, IMSL and NAG-SMP, taking 50-75% longer than the ESSL-SMP routine.

• The worst results were seen from NAG (single thread) and the user code routines.

What is the impact of the number of threads on the ESSL-SMP library performance? It is already the best.

Page 49: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 49Libraries and Their Performance

ESSL-SMP Performance vs. Number of Threads• All for N=200,000

• Number of threads controlled by environment variable OMP_NUM_THREADS

0

200

400

600

800

1000

1200

1400

1600

0 4 8 12 16 20

Threads

ddot Mflip/s

Page 50: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 50Libraries and Their Performance

Revised Second Comparison of Dot Product(N=200,000)

Version Wall Clock (sec) Mflip/s Scaled Time (1=Fastest)

User Scalar 1090 183 7.5

User Vector1180 169 8.1 Slowest

ESSL 739 271 5.1

ESSL-SMP 146 1370 1.0 Fastest

(6 threads)

IMSL 725 276 5.0

NAG 1120 179 7.7

NAG-SMP 864 231 5.9

Page 51: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 51Libraries and Their Performance

Scaling with Problem Size?(N1=100,000; N2=200,000; Complexity ratio approx. 4)

Version N2/N1 Wall Clock(sec) N2/N1 Mflip/s

User Scalar 4.45 0.90

User Vector 4.75 0.84

ESSL 5.10 0.78

ESSL-SMP 2.92 1.37

(4 threads for N1;

6 threads for N2)

IMSL 5.07 0.79

NAG 4.48 0.90

NAG-SMP 4.80 0.83

Page 52: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 52Libraries and Their Performance

Comments on Scaling Problem Size

• The ESSL-SMP performance, when tuned for the optimal number of threads, increased by almost 40% with the increased problem size.

• The untuned ESSL-SMP performance increased by a factor of 3.2 with the increased problem size.

• The user codes, ESSL, IMSL, NAG and NAG-SMP routines all showed 10%-22% decreases in performance with the larger problem size.

• It is not possible to determine, a priori, how the performance of different, functionally equivalent routines will scale with problem size.

Page 53: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 53Libraries and Their Performance

Matrix Multiplication

• User coded scalar computation

• Fortran intrinsic matmul• Single processor ESSL dgemm• Multi-threaded SMP ESSL dgemm• Single processor IMSL dmrrrr (32-bit)

• Single processor NAG f01ckf• Multi-threaded SMP NAG f01ckf

Page 54: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 54Libraries and Their Performance

Sample Problem

• Multiply two dense N by N matrixes,

A and B

• A(i,j) = i + j

• B(i,j) = j – i

• Output C(N,N) to verify result

Page 55: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 55Libraries and Their Performance

Kernel of user matrix multiply

do i=1,n do j=1,n a(i,j) = real(i+j) b(i,j) = real(j-i) enddo enddo call f_hpmstart(1,"Matrix multiply") do j=1,n do k=1,n do i=1,n c(i,j) = c(i,j) + a(i,k)*b(k,j) enddo enddo enddo call f_hpmstop(1)

Page 56: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 56Libraries and Their Performance

Comparison of Matrix Multiply(N1=5,000)

Version Wall Clock(sec) Mflip/s Scaled Time (1=Fastest)

User Scalar 1,490 168 106 Slowest

Intrinsic 1,477 169 106 Slowest

ESSL 195 1,280 13.9

ESSL-SMP 14 17,800 1.0 Fastest

IMSL 194 1,290 13.8

NAG 195 1,280 13.9

NAG-SMP 14 17,800 1.0 Fastest

Page 57: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 57Libraries and Their Performance

Observations on Matrix Multiply

• Fastest times were obtained by the two SMP libraries, ESSL-SMP and NAG-SMP, which both obtained 74% of the peak node performance

• All the single processor library functions took 14 times more wall clock time than the SMP versions, each obtaining about 85% of peak for a single processor

• Worst times were from the user code and the Fortran intrinsic, which took 100 times more wall clock time than the SMP libraries

Page 58: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 58Libraries and Their Performance

Comparison of Matrix Multiply(N2=10,000)

Version Wall Clock(sec) Mflip/s Scaled TimeESSL-SMP 101 19,800 1.01

NAG-SMP 100 19,900 1.00

• Scaling with Problem Size (Complexity increase approx. 8 times)

Version Wall Clock(N2/N1) Mflip/s (N2/N1)ESSL-SMP 7.2 1.10NAG-SMP 7.1 1.12

Both ESSL-SMP and NAG-SMP showed 10% performance gains with the larger problem size.

Page 59: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 59Libraries and Their Performance

Observations on Scaling

• Scaling of problem size was only done for the SMP libraries, to fit into reasonable times.

• Doubling N results in 8 times increase of computational complexity for dense matrix multiplication

• Performance actually increased for both routines for larger problem size.

Page 60: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 60Libraries and Their Performance

ESSL-SMP Performance vs. Number of Threads• All for N=10,000

• Number of threads controlled by environment variable OMP_NUM_THREADS

0

4000

8000

12000

16000

20000

0 4 8 12 16 20 24 28 32 36

Threads

dgemm Mflip/s

Page 61: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 61Libraries and Their Performance

Part II Conclusion

• The NERSC user environment provides a rich variety of mathematical libraries

• Performance can vary widely for the same computation, sometimes even for the same function name, from library to library; performance also varies with problem size and, for the SMP libraries, the number of threads

• It is not possible to know, a priori, which library will provide the best performance for a given function and problem size

• The HPM Toolkit provides a way to compare library routine performance and make informed choices

Page 62: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 62Libraries and Their Performance

Part III: Moving to Multi-node Parallelism

• The examples so far have all been of single processor or multi-processor, shared-memory (SMP style) parallelism on a single 16 processor node

• The poe+ command is the multi-node equivalent of hpmcount, and poe+ can be used with MPI codes or multi-node, distributed memory parallel libraries such as PESSL and ScaLAPACK.

• poe+ is a perl script developed by David Skinner of the NERSC User Services Group which aggregates hpmcount results for each of distributed-memory process

Page 63: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 63Libraries and Their Performance

Kernel of PESSL/ScaLAPACK matrix multiply

! Call PESSL library routine

call f_hpminit((me+1),"Instrumented code")

call f_hpmstart((me+1),"Matrix multiply")

call pdgemm('T','T',n,n,n,1.0d0, myA,1,1,ides_a, &

myB,1,1,ides_b,0.d0, &

myC,1,1,ides_c )

call f_hpmstop(me+1)

call f_hpmterminate(me+1)

Page 64: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 64Libraries and Their Performance

Comments on PESSL/ScaLAPACK Code

• Although the kernel on the previous slide looks like a simple progression from the ESSL version, actually there is a lot of work involved in understanding PESSL/ScaLAPACK for new users

• There are a number of data structure complexities which do not exist for the single-node libraries

• The “complete” matrix does not exist on any processor, but is block-cyclic distributed among processors

• There are added parameters of processor geometry and data distribution parameters.

• New users should study the ScaLAPACK tutorial on the Web at http://www.netlib.org/scalapack/tutorial/

Page 65: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 65Libraries and Their Performance

Prolog for PESSL/ScaLAPACK matrix multiply

! Initialize blacs processor grid

call blacs_pinfo (me,procs)

call blacs_get (0, 0, icontxt)

call blacs_gridinit(icontxt, 'R', prow, pcol)

call blacs_gridinfo(icontxt, prow, pcol, myrow, mycol)

Page 66: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 66Libraries and Their Performance

More Prolog for PESSL/ScaLAPACK! Construct local arrays myArows = numroc(n, nb, myrow, 0, prow) myAcols = numroc(n, nb, mycol, 0, pcol)! Initialize local arrays allocate(myA(myArows,myAcols)) allocate(myB(myArows,myAcols)) allocate(myC(myArows,myAcols)) do i=1,n call g2l(i,n,prow,nb,iproc,myi) if (myrow==iproc) then do j=1,n call g2l(j,n,pcol,nb,jproc,myj) if (mycol==jproc) then myA(myi,myj) = real(i+j) myB(myi,myj) = real(i-j) myC(myi,myj) = 0.d0 endif enddo endif enddo

Page 67: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 67Libraries and Their Performance

Still More Prolog for PESSL/ScaLAPACK

! Prepare array descriptors for PESSL (ScaLAPACK style)

ides_a(1) = 1 ! descriptor type

ides_a(2) = icontxt ! blacs context

ides_a(3) = n ! global number of rows

ides_a(4) = n ! global number of columns

ides_a(5) = nb ! row block size

ides_a(6) = nb ! column block size

ides_a(7) = 0 ! initial process row

ides_a(8) = 0 ! initial process column

ides_a(9) = myArows ! leading dimension of local array

do i=1,9

ides_b(i) = ides_a(i)

ides_c(i) = ides_a(i)

enddo

Page 68: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 68Libraries and Their Performance

Compile Uninstrumented Codes and Run with poe+

setenv FC "mpxlf90 -O3 -qstrict -qarch=pwr3 -qtune=pwr3 -bmaxdata:0x80000000 -bmaxstack:0x80000000 "

$FC -o ABCp -lblacs -lpessl ABCp.f

module load scalapack$FC -o ABCs -qfree $PBLAS $BLACS $SCALAPACK -lessl

ABCp.f

poe+ ./ABCp ! PESSL versionpoe+ ./ABCs ! ScaLAPACK version

Page 69: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 69Libraries and Their Performance

Four Runs for PESSL and ScaLAPACK Codes

• N=5000, 16 processors (one node) in 4x4 processor array

• N=10,000, 16 processors (one node) in 4x4 processor array

• N=5000, 64 processors (four nodes) in 8x8 processor array

• N=10000, 64 processors (four nodes) in 8x8 processor array

• Compare “whole code” performance using poe+ with “whole code” results for single-node ESSL-SMP routine using hpmcount.

• poe+ returns average wall clock time across all processes, and aggregate Mflip/s of all processes

Page 70: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 70Libraries and Their Performance

Comparison of PESSL/ScaLAPACK dgemm(n=5000, 16 processors, “whole code” performance)

Section Wall Clock

(sec.)

Mflips/sec. Scaled Time (1.00=ESSL-SMP, 22 s)

PESSL 28.3 8,850 1.30

ScaLAPACK 30.4 8,240 1.40

ESSL-SMP achieved 47% of theoretical peak performance for one node

PESSL achieved 37%, and ScaLAPACK achieved 34%.

Page 71: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 71Libraries and Their Performance

Comparison of PESSL/ScaLAPACK dgemm(n=10000, 16 processors, “whole code”)

Section Wall Clock

(sec.)

Mflips/sec. Scaled Time (1.00=ESSL-SMP, 120 s)

PESSL 141. 14,230 1.20

ScaLAPACK 160. 12,500 1.30

ESSL-SMP achieved 70% of theoretical peak performance for one node

PESSL achieved 59%, and ScaLAPACK achieved 52%.

Page 72: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 72Libraries and Their Performance

Comparison of PESSL/ScaLAPACK dgemm(n=5000, 64 processors, “whole code”)

Section Wall Clock

(sec.)

Mflips/sec. Scaled Time (1.00=ESSL-SMP, 22 s)

PESSL 15.3 16,400 0.70

ScaLAPACK 14.2 17,600 0.65

PESSL achieved 17% of the theoretical peak for 4 nodes (96,000 Mflip/s), and ScaLAPACK achieved 18%.

Page 73: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 73Libraries and Their Performance

Comparison of PESSL/ScaLAPACK dgemm(n=10000, 64 processors, “whole code”)

Section Wall Clock

(sec.)

Mflips/sec. Scaled Time (1.00=ESSL-SMP, 120 s)

PESSL 51.5 38,900 0.43

ScaLAPACK 58.3 34,400 0.49

PESSL achieved 41% of the theoretical peak for 4 nodes (96,000 Mflip/s), and ScaLAPACK achieved 36%.

Page 74: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 74Libraries and Their Performance

Comments PESSL and ScaLAPACK Codes

• For problem sizes that fit within one node, the shared-memory, SMP libraries may give better performance than the distributed-memory, parallel libraries because of differences in data communication costs

• As the number of nodes and processors is increased, wall-clock time for distributed-memory libraries may drop below shared-memory SMP libraries for the same problem size, but per-processor efficiency may also drop.

• For problems which cannot fit in a node, the distributed-memory parallel libraries provide the best solution

Page 75: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 75Libraries and Their Performance

Comments on using HPM Toolkit with PESSL and ScaLAPACK Codes

• HPM Toolkit generates two output files per task (one for statistics, one for visualization).

• Performance statistics for each task are found in files with names perfhpmNNNN.PPPPP where NNNN is the task id (or processor number), and PPPPP is the AIX process id

• Performance variations between processors and nodes can be observed.

Page 76: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 76Libraries and Their Performance

PESSL dgemm results for Small Instrumented Section

• For N=5,000, 16 processors (one node), PESSL pdgemm– average time of 16.9 seconds

– aggregate 14,800 Mflip/s

– 62% of the theoretical peak performance for a node

• For N=10,000, 64 processors (four nodes) PESSL pdgemm– average time of 40.1 seconds

– aggregate 50,000 Mflip/s

– 52% of the theoretical peak performance for four nodes

Page 77: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 77Libraries and Their Performance

Variability in PESSL dgemm Small Instrumented Section

• For N=5,000, 16 processors (one node), PESSL pdgemm– Wall clock for each processor varies from 16.4 to 17.4 sec

– Mflip/s for each processor varies from 850 to 1000

• For N=10,000, 64 processors (four nodes) PESSL pdgemm– Wall clock for each processor varies from 39.25 to 40.75

sec

– Mflip/s for each processor varies from 730 to 830

Page 78: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 78Libraries and Their Performance

PESSL dgemm Task Variation(n=5000, 16 processors)

840

860

880

900

920

940

960

980

1000

1020

16.2 16.4 16.6 16.8 17 17.2 17.4 17.6

Wall Clock (s)

Mflip/s

Page 79: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 79Libraries and Their Performance

PESSL dgemm Task Variation(n=10000, 64 processors)

720

740

760

780

800

820

840

39 39.5 40 40.5 41

Wall Clock (s)

Mflip/s

Page 80: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 80Libraries and Their Performance

Part III Conclusion

• NERSC provides a variety of distributed-memory, multi-node mathematical libraries (PESSL, ScaLAPACK and NAG Parallel).

• Performance of these libraries can be measured using “whole code” approaches with poe+, similar to hpmcount for single node codes

• The HPM Toolkit can be used to instrument small sections of codes for more detailed analysis, include variation between tasks; but a number of output files are produced and must be analyzed by the user.

Page 81: Libraries and Their Performance

NATIONAL ENERGY RESEARCH SCIENTIFIC COMPUTING CENTER

March 17, 2003 81Libraries and Their Performance

References

• Information on hpmcount and poe+ for whole code performance measurement is available on the NERSC Website at http://hpcf.nersc.gov/software/ibm/hpmcount/

• Detailed information about the HPM Toolkit for measuring performance of discrete code sections is available on the NERSC Website at http://hpcf.nersc.gov/software/ibm/hpmcount/HPM_2_4_2.html

• The list of mathematical libraries available on seaborg can be found on the NERSC Website at http://hpcf.nersc.gov/software/ibm/#mathlibs