Code Tuning and Optimization Doug Sondak [email protected] Boston University Scientific Computing and Visualization

Code Tuning and Optimization

Doug Sondak

[email protected]

Boston University

Scientific Computing and Visualization

Outline

Introduction Example code Timing Profiling Cache Tuning

Information Services & Technology

2

04/21/23

Introduction

Timing Where is most time being used?

Tuning How to speed it up Often as much art as science

Parallel Performance How to assess how well parallelization is working


3

04/21/23

Example Code


4

04/21/23

Example Code

Simulation of response of eye to stimuli Response is affected by adjacent inputs

A dark area next to a bright area makes the bright area look brighter

Based on Grossberg & Todorovic paper Appendix in paper contains all equations

errors in eqns (A4) and (A5) – cross out “log2”

Paper contains 6 levels of response Our code only contains levels 1 through 5 Level 6 takes a long time to compute, and would skew our timings!


5

04/21/23

Example Code (cont’d)

All calculations done on a square array Array size and other constants are defined in gt.h (C)

or in the “mods” module at the top of the code (Fortran)

Due to nature of algorithm, array is padded on all sides npad is size of padding


6

04/21/23

Example Code – Level 1 Luminance (input) distribution Paper (and code) use “yin-yang square”

Array I magnitude of “bright” is ihigh magnitude of “dark” is ilow


7

04/21/23

bright

dark

Fig. 4 in paper

Example Code – Level 2

Level 2 – Circular Concentric On and Off Units Excitation and inhibition vary with distance


8

04/21/23

Fig. 5 in paper

Level 2 Equations


9

04/21/23

]})()[(exp{ 222 jqipCC pqij

]})()[(exp{ 222 jqipEEpqij

qppqpqijpqij

qppqpqijpqij

ij IECA

IDEBC

x

,

,

)(

)(

)0,max( ijij xX

Ipq=initial input (yin-yang)

Example Code – Level 3 Oriented Direction-of-Contrast-Sensitive Units Respond to angle

12 discrete angles

Respond to direction

of contrast, i.e.,

light-to-dark

or dark-to-light


10

04/21/23

Fig. 6(d) in paper

Level 3 Equations


11

04/21/23

]})()[(exp{ 222 jqipGpqij

]})()[(exp{ 222)(kk

kpqij njqmipH

)(

,

kpqij

qppqijk FXy

)()( kpqijpqij

kpqij HGF

K

kmk

2sin

K

knk

2cos

)0,max( ijkijk yY

Example Code - Level 4

Oriented Direction-of-Contrast-Insensitive Units Respond to angle Do not respond to direction of contrast, i.e.,

light-to-dark

or dark-to-light


12

04/21/23

Fig. 8(a) in paper

Level 4 Equations


13

04/21/23

)]2/([ Kkijijkijk YYz

)0,max( LzZ ijkijk

Example Code – Level 5

Level 5 – Boundary Contour Units Pool nearby excitations


14

04/21/23

Fig. 8(d) in paper

Level 5 Equation


15

04/21/23

k

ijkij ZZ

Timing


16

04/21/23

Timing

When tuning/parallelizing a code, need to assess effectiveness of your efforts

Can time whole code and/or specific sections Some types of timers

unix time command function/subroutine calls profiler


17

04/21/23

CPU Time or Wall-Clock Time?

CPU time How much time the CPU is actually crunching away User CPU time

Time spent executing your source code System CPU time

Time spent in system calls such as i/o

Wall-clock time What you would measure with a stopwatch


18

04/21/23

CPU Time or Wall-Clock Time? (cont’d)

Both are useful For serial runs without interaction from keyboard, CPU

and wall-clock times are usually close If you prompt for keyboard input, wall-clock time will accumulate if

you get a cup of coffee, but CPU time will not


19

04/21/23

CPU Time or Wall-Clock Time? (3)

Parallel runs Want wall-clock time, since CPU time will be about the same or even

increase as number of procs. is increased

Wall-clock time may not be accurate if sharing processors Wall-clock timings should always be performed in batch mode


20

04/21/23

Unix Time Command

easiest way to time code simply type time before your run command output differs between c-type shells (cshell, tcshell)

and Bourne-type shells (bsh, bash, ksh)


21

04/21/23

Unix Time Command (cont’d)

twister:~ % time mycode1.570u 0.010s 0:01.77 89.2% 75+1450k 0+0io 64pf+0w


22

04/21/23

user CPU time (s)

system CPU time (s)

wall-clock time (s)

(u+s)/wc

avg. shared + unsharedtext space

input + output operations

page faults + no. timesproc. was swapped

Unix Time Command (3)

Bourne shell results


23

04/21/23

$ time mycodeReal 1.62User 1.57System 0.03

wall-clock time (s)

user CPU time (s)

system CPU time (s)

Exercise 1 Copy files from /scratch/sondak/gt

cp /scratch/sondak/gt/* . Choose C (gt.c) or Fortran (gt.f90) Compile with no optimization:

pgcc –O0 –o gt gt.cc

pgf90 –O0 –o gt gt.f90

•Submit rungt script to batch queue qsub rungt


24

04/21/23

capital oh small ohzero

Exercise 1 (cont’d) Check status

qstat –u username

After run has completed a file will appear named rungt.o??????, where ?????? represents the process number

File contains result of time commandWrite down wall-clock time

Re-compile using –O3 Re-run and check time


25

04/21/23

Function/Subroutine Calls

often need to time part of code timers can be inserted in source code language-dependent


26

04/21/23

cpu_time

intrinsic subroutine in Fortran returns user CPU time (in seconds)

no system time is included

0.01 sec. resolution on p-series


27

04/21/23

real :: t1, t2call cpu_time(t1) ... do stuff to be timed ... call cpu_time(t2)print*, 'CPU time = ', t2-t1, ' sec.'

system_clock

intrinsic subroutine in Fortran good for measuring wall-clock time on p-series:

resolution is 0.01 sec. max. time is 24 hr.


28

04/21/23

system_clock (cont’d)

t1 and t2 are tic counts count_rate is optional argument containing tics/sec.


29

04/21/23

integer :: t1, t2, count_rate call system_clock(t1, count_rate) ... do stuff to be timed ... call system_clock(t2) print*,'wall-clock time = ', & real(t2-t1)/real(count_rate), ‘sec’

times can be called from C to obtain CPU time 0.01 sec. resolution on p-series

can also get system time with tms_stime


30

04/21/23

#include <sys/times.h>#include <unistd.h>void main(){ int tics_per_sec; float tic1, tic2; struct tms timedat; tics_per_sec = sysconf(_SC_CLK_TCK); times(&timedat); tic1 = timedat.tms_utime; … do stuff to be timed … times(&timedat); tic2 = timedat.tms_utime; printf("CPU time = %5.2f\n", (float)(tic2-tic1)/(float)tics_per_sec); }

gettimeofday

can be called from C to obtain wall-clock time

sec resolution on p-series


31

04/21/23

#include <sys/time.h> void main(){ struct timeval t; double t1, t2; gettimeofday(&t, NULL); t1 = t.tv_sec + 1.0e-6*t.tv_usec; … do stuff to be timed … gettimeofday(&t, NULL); t2 = t.tv_sec + 1.0e-6*t.tv_usec; printf(“wall-clock time = %5.3f\n", t2-t1); }

MPI_Wtime

convenient wall-clock timer for MPI codes

sec resolution on p-series


32

04/21/23

MPI_Wtime (cont’d) Fortran

C


33

04/21/23

double precision t1, t2t1 = mpi_wtime() ... do stuff to be timed ...t2 = mpi_wtime()print*,'wall-clock time = ', t2-t1

double t1, t2;t1 = MPI_Wtime();... do stuff to be timed ...t2 = MPI_Wtime();printf(“wall-clock time = %5.3f\n”,t2-t1);

omp_get_time

convenient wall-clock timer for OpenMP codes resolution available by calling omp_get_wtick()

0.01 sec. resolution on p-series


34

04/21/23

omp_get_wtime (cont’d) Fortran

C


35

04/21/23

double precision t1, t2, omp_get_wtimet1 = omp_get_wtime() ... do stuff to be timed ...t2 = omp_get_wtime()print*,'wall-clock time = ', t2-t1

double t1, t2;t1 = omp_get_wtime();... do stuff to be timed ...t2 = omp_get_wtime();printf(“wall-clock time = %5.3f\n”,t2-t1);

Timer Summary


36

04/21/23

CPU Wall

Fortran cpu_time system_clock

C times gettimeofday

MPI MPI_Wtime

OpenMP omp_get_time

Exercise 2

Put wall-clock timer around each “level” in the example code

Print time for each level Compile and run


37

04/21/23

PROFILING


38

04/21/23

Profilers

profile tells you how much time is spent in each routine

gives a level of granularity not available with previous timers e.g., function may be called from many places

various profilers available, e.g. gprof (GNU) pgprof (Portland Group) Xprofiler (AIX)


39

04/21/23

gprof

compile with -pg file gmon.out will be created when you run gprof executable > myprof for multiple procs. (MPI), copy or link gmon.out.n to

gmon.out, then run gprof


40

04/21/23

gprof (cont’d)


41

04/21/23

ngranularity: Each sample hit covers 4 bytes. Time: 435.04 seconds

% cumulative self self total time seconds seconds calls ms/call ms/call name 20.5 89.17 89.17 10 8917.00 10918.00 .conduct [5] 7.6 122.34 33.17 323 102.69 102.69 .getxyz [8] 7.5 154.77 32.43 .__mcount [9] 7.2 186.16 31.39 189880 0.17 0.17 .btri [10] 7.2 217.33 31.17 .kickpipes [12] 5.1 239.58 22.25 309895200 0.00 0.00 .rmnmod [16] 2.3 249.67 10.09 269 37.51 37.51 .getq [24]

gprof (3)


42

04/21/23

ngranularity: Each sample hit covers 4 bytes. Time: 435.04 seconds

called/total parents index %time self descendents called+self name index called/total children

0.00 340.50 1/1 .__start [2][1] 78.3 0.00 340.50 1 .main [1] 2.12 319.50 10/10 .contrl [3] 0.04 7.30 10/10 .force [34] 0.00 5.27 1/1 .initia [40] 0.56 3.43 1/1 .plot3da [49] 0.00 1.27 1/1 .data [73]

pgprof

compile with Portland Group compiler pgf90 (pgf95, etc.) pgcc –Mprof=func

similar to –pg run code

pgprof –exe executable pops up window with flat profile


43

04/21/23

pgprof (cont’d)


44

04/21/23

pgprof (3) To save profile data to a file:

re-run pgprof using –text flag at command prompt type p > filename

filename is the name you want to give the profile file type quit to get out of profiler


45

04/21/23

Exercise 3 Use pgprof to profile code

compile using –Mprof=func run code create profile using pgprof –exe gt

Note which routines use most time Please close pgprof when you’re through

Leaving window open ties up a license


46

04/21/23

Line-Level Profiling

Times individual lines For pgprof, compile with the flag

–Mprof=line

Optimizer will re-order linesprofiler will lump lines in some loops or other constructsmay want to compile without optimization, may not

In flat profile, double-click on function to get line-level data


47

04/21/23

Line-Level Profiling (cont’d)


48

04/21/23

Exercise 4

Compile code with –Mprof=line and –O0 and run will take about 5 minutes to run due to overhead from line-

level profiling and lack of optimization

Examine line-level profile for most time-consuming routine

Note lines with longest time consumption Save your profile data to a file (we will need it later)

re-run pgprof using –text flag at command prompt type p > prof


49

04/21/23

CACHE


50

04/21/23

Cache

Cache is a small chunk of fast memory between the main memory and the registers


51

04/21/23

secondary cache

registers

primary cache

main memory

Cache (cont’d)

If variables are used repeatedly, code will run faster since cache memory is much faster than main memory

Variables are moved from main memory to cache in lines L1 cache line sizes on our machines

Opteron (katana cluster) 64 bytes Xeon (katana cluster) 64 bytes Power4 (p-series) 128 bytes PPC440 (Blue Gene) 32 bytes Pentium III (linux cluster) 32 bytes


52

04/21/23

Cache (3)

Why not just make the main memory out of the same stuff as cache? Expensive Runs hot This was actually done in Cray computers

Liquid cooling system


53

04/21/23

Cache (4)

Cache hit Required variable is in cache

Cache miss Required variable not in cache If cache is full, something else must be thrown out (sent back to main

memory) to make room Want to minimize number of cache misses


54

04/21/23

Cache (5)


55

04/21/23

…

x[0]x[1]

x[2]x[3]x[4]x[5]

x[6]x[7]

x[8]x[9]

Main memory

“mini” cacheholds 2 lines, 4 words each

for(i=0; i<10; i++) x[i] = i;

ab…

Cache (6)


56

04/21/23

…

x[0]x[1]

x[2]x[3]x[4]x[5]

x[6]x[7]

x[8]x[9]

•will ignore i for simplicity•need x[0], not in cache cache miss•load line from memory into cache•next 3 loop indices result in cache hits

for(i=0; i<10; i++) x[i] = i;

ab…

x[0]x[1]

x[2]x[3]

Cache (7)


57

04/21/23

…

x[0]x[1]

x[2]x[3]x[4]x[5]

x[6]x[7]

x[8]x[9]

•need x[4], not in cache cache miss•load line from memory into cache•next 3 loop indices result in cache hits

for(i=0; i<10; i++) x[i] = i;

ab…

x[0]x[1]

x[2]x[3]

x[4]

x[5]x[6]x[7]

Cache (8)


58

04/21/23

…

x[0]x[1]

x[2]x[3]x[4]x[5]

x[6]x[7]

x[8]x[9]

•need x[8], not in cache cache miss•load line from memory into cache•no room in cache!•replace old line

for(i=0; i<10; i++) x[i] = i;

ab…

x[4]

x[5]x[6]x[7]

x[8]x[9]

ab

Cache (9)

Contiguous access is important In C, multidimensional array is stored in memory as

a[0][0]

a[0][1]

a[0][2]


59

04/21/23

…

Cache (10)

In Fortran and Matlab, multidimensional array is stored the opposite way:

a(1,1)

a(2,1)

a(3,1)


60

04/21/23

…

Cache (11)

Rule: Always order your loops appropriately will usually be taken care of by optimizer suggestion: don’t rely on optimizer


61

04/21/23

for(i=0; i<N; i++){ for(j=0; j<N; j++){ a[i][j] = 1.0; }}

do j = 1, n do i = 1, n a(i,j) = 1.0 enddoenddo

C Fortran

TUNING TIPS


62

04/21/23

Tuning Tips

Some of these tips will be taken care of by compiler optimization It’s best to do them yourself, since

compilers vary

Two important rules minimize number of operations access cache contiguously


63

04/21/23

Tuning Tips (cont’d) Access arrays in contiguous order

For multi-dimensional arrays, rightmost index varies fastest for C and C++, leftmost for Fortran and Matlab

Bad Good


64

04/21/23

for(i=0; i<N; i++){ for(j=0; j<N; j++{ a[i][j] = 1.0; }}

for(j=0; j<N; j++){ for(i=0; i<N; i++{ a[i][j] = 1.0; }}

Tuning Tips (3)

Eliminate redundant operations in loops

Bad:

Good:


65

04/21/23

for(i=0; i<N; i++){ x = 10;

}

…

x = 10;for(i=0; i<N; i++){ }

…

Tuning Tips (4)

Minimize if statements within loops

They may inhibit pipelining


66

04/21/23

for(i=0; i<N; i++){

if(i==0)

perform i=0 calculations

else

perform i>0 calculations

}

Tuning Tips (5)

Better Way:


67

04/21/23

perform i=0 calculations

for(i=1; i<N; i++){

perform i>0 calculations

}

Tuning Tips (6) Divides are expensive

Intel x86 clock cycles per operation add 3-6 multiply 4-8 divide 32-45

Bad:

Good:


68

04/21/23

for(i=0; i<N; i++)

x[i] = y[i]/scalarval;

qs = 1.0/scalarval;

for(i=0; i<N; i++)

x[i] = y[i]*qs;

Tuning Tips (7)

• There is overhead associated with a function call

Bad:

Good:


69

04/21/23

for(i=0; i<N; i++)

myfunc(i);

myfunc ( );

void myfunc( ){

for(int i=0; i<N; i++){

do stuff

}

}

Tuning Tips (8)

• Minimize calls to math functions

Bad:

Good:


70

04/21/23

for(i=0; i<N; i++)

z[i] = log(x[i]) * log(y[i]);

for(i=0; i<N; i++){

z[i] = log(x[i] + y[i]);

Tuning Tips (9)

• recasting may be costlier than you think

Bad:

Good:


71

04/21/23

sum = 0.0;

for(i=0; i<N; i++)

sum += (float) i

isum = 0;

for(i=0; i<N; i++)

isum += i;

sum = (float) isum

Exercise 5 The example code that has been provided is written in a clear,

readable style, that also happens to violate lots of the tuning tips that we have just reviewed.

Examine the line-level profile. What lines are using the most time? Is there anything we might be able to do to make it run faster? We will discuss options as a group come up with a strategy modify code re-compile and run compare timings

Re-examine line level profile, come up with another strategy, repeat procedure, etc.


72

04/21/23

Survey

Please fill out the survey for this tutorial at

http://scv.bu.edu/survey/tutorial_evaluation.html


73

04/21/23

Documents

Code Tuning and Optimization Doug Sondak [email protected] Boston University Scientific Computing and Visualization