73
Code Tuning and Optimization Doug Sondak [email protected] Boston University Scientific Computing and Visualization

Code Tuning and Optimization Doug Sondak [email protected] Boston University Scientific Computing and Visualization

Embed Size (px)

Citation preview

Page 1: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

Code Tuning and Optimization

Doug Sondak

[email protected]

Boston University

Scientific Computing and Visualization

Page 2: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

Outline

Introduction Example code Timing Profiling Cache Tuning

Information Services & Technology

2

04/21/23

Page 3: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

Introduction

Timing Where is most time being used?

Tuning How to speed it up Often as much art as science

Parallel Performance How to assess how well parallelization is working

Information Services & Technology

3

04/21/23

Page 4: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

Example Code

Information Services & Technology

4

04/21/23

Page 5: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

Example Code

Simulation of response of eye to stimuli Response is affected by adjacent inputs

A dark area next to a bright area makes the bright area look brighter

Based on Grossberg & Todorovic paper Appendix in paper contains all equations

errors in eqns (A4) and (A5) – cross out “log2”

Paper contains 6 levels of response Our code only contains levels 1 through 5 Level 6 takes a long time to compute, and would skew our timings!

Information Services & Technology

5

04/21/23

Page 6: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

Example Code (cont’d)

All calculations done on a square array Array size and other constants are defined in gt.h (C)

or in the “mods” module at the top of the code (Fortran)

Due to nature of algorithm, array is padded on all sides npad is size of padding

Information Services & Technology

6

04/21/23

Page 7: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

Example Code – Level 1 Luminance (input) distribution Paper (and code) use “yin-yang square”

Array I magnitude of “bright” is ihigh magnitude of “dark” is ilow

Information Services & Technology

7

04/21/23

bright

dark

Fig. 4 in paper

Page 8: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

Example Code – Level 2

Level 2 – Circular Concentric On and Off Units Excitation and inhibition vary with distance

Information Services & Technology

8

04/21/23

Fig. 5 in paper

Page 9: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

Level 2 Equations

Information Services & Technology

9

04/21/23

]})()[(exp{ 222 jqipCC pqij

]})()[(exp{ 222 jqipEEpqij

qppqpqijpqij

qppqpqijpqij

ij IECA

IDEBC

x

,

,

)(

)(

)0,max( ijij xX

Ipq=initial input (yin-yang)

Page 10: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

Example Code – Level 3 Oriented Direction-of-Contrast-Sensitive Units Respond to angle

12 discrete angles

Respond to direction

of contrast, i.e.,

light-to-dark

or dark-to-light

Information Services & Technology

10

04/21/23

Fig. 6(d) in paper

Page 11: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

Level 3 Equations

Information Services & Technology

11

04/21/23

]})()[(exp{ 222 jqipGpqij

]})()[(exp{ 222)(kk

kpqij njqmipH

)(

,

kpqij

qppqijk FXy

)()( kpqijpqij

kpqij HGF

K

kmk

2sin

K

knk

2cos

)0,max( ijkijk yY

Page 12: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

Example Code - Level 4

Oriented Direction-of-Contrast-Insensitive Units Respond to angle Do not respond to direction of contrast, i.e.,

light-to-dark

or dark-to-light

Information Services & Technology

12

04/21/23

Fig. 8(a) in paper

Page 13: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

Level 4 Equations

Information Services & Technology

13

04/21/23

)]2/([ Kkijijkijk YYz

)0,max( LzZ ijkijk

Page 14: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

Example Code – Level 5

Level 5 – Boundary Contour Units Pool nearby excitations

Information Services & Technology

14

04/21/23

Fig. 8(d) in paper

Page 15: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

Level 5 Equation

Information Services & Technology

15

04/21/23

k

ijkij ZZ

Page 16: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

Timing

Information Services & Technology

16

04/21/23

Page 17: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

Timing

When tuning/parallelizing a code, need to assess effectiveness of your efforts

Can time whole code and/or specific sections Some types of timers

unix time command function/subroutine calls profiler

Information Services & Technology

17

04/21/23

Page 18: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

CPU Time or Wall-Clock Time?

CPU time How much time the CPU is actually crunching away User CPU time

Time spent executing your source code System CPU time

Time spent in system calls such as i/o

Wall-clock time What you would measure with a stopwatch

Information Services & Technology

18

04/21/23

Page 19: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

CPU Time or Wall-Clock Time? (cont’d)

Both are useful For serial runs without interaction from keyboard, CPU

and wall-clock times are usually close If you prompt for keyboard input, wall-clock time will accumulate if

you get a cup of coffee, but CPU time will not

Information Services & Technology

19

04/21/23

Page 20: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

CPU Time or Wall-Clock Time? (3)

Parallel runs Want wall-clock time, since CPU time will be about the same or even

increase as number of procs. is increased

Wall-clock time may not be accurate if sharing processors Wall-clock timings should always be performed in batch mode

Information Services & Technology

20

04/21/23

Page 21: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

Unix Time Command

easiest way to time code simply type time before your run command output differs between c-type shells (cshell, tcshell)

and Bourne-type shells (bsh, bash, ksh)

Information Services & Technology

21

04/21/23

Page 22: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

Unix Time Command (cont’d)

twister:~ % time mycode1.570u 0.010s 0:01.77 89.2% 75+1450k 0+0io 64pf+0w

Information Services & Technology

22

04/21/23

user CPU time (s)

system CPU time (s)

wall-clock time (s)

(u+s)/wc

avg. shared + unsharedtext space

input + output operations

page faults + no. timesproc. was swapped

Page 23: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

Unix Time Command (3)

Bourne shell results

Information Services & Technology

23

04/21/23

$ time mycodeReal 1.62User 1.57System 0.03

wall-clock time (s)

user CPU time (s)

system CPU time (s)

Page 24: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

Exercise 1 Copy files from /scratch/sondak/gt

cp /scratch/sondak/gt/* . Choose C (gt.c) or Fortran (gt.f90) Compile with no optimization:

pgcc –O0 –o gt gt.cc

pgf90 –O0 –o gt gt.f90

•Submit rungt script to batch queue qsub rungt

Information Services & Technology

24

04/21/23

capital oh small ohzero

Page 25: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

Exercise 1 (cont’d) Check status

qstat –u username

After run has completed a file will appear named rungt.o??????, where ?????? represents the process number

File contains result of time commandWrite down wall-clock time

Re-compile using –O3 Re-run and check time

Information Services & Technology

25

04/21/23

Page 26: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

Function/Subroutine Calls

often need to time part of code timers can be inserted in source code language-dependent

Information Services & Technology

26

04/21/23

Page 27: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

cpu_time

intrinsic subroutine in Fortran returns user CPU time (in seconds)

no system time is included

0.01 sec. resolution on p-series

Information Services & Technology

27

04/21/23

real :: t1, t2call cpu_time(t1) ... do stuff to be timed ... call cpu_time(t2)print*, 'CPU time = ', t2-t1, ' sec.'

Page 28: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

system_clock

intrinsic subroutine in Fortran good for measuring wall-clock time on p-series:

resolution is 0.01 sec. max. time is 24 hr.

Information Services & Technology

28

04/21/23

Page 29: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

system_clock (cont’d)

t1 and t2 are tic counts count_rate is optional argument containing tics/sec.

Information Services & Technology

29

04/21/23

integer :: t1, t2, count_rate call system_clock(t1, count_rate) ... do stuff to be timed ... call system_clock(t2) print*,'wall-clock time = ', & real(t2-t1)/real(count_rate), ‘sec’

Page 30: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

times can be called from C to obtain CPU time 0.01 sec. resolution on p-series

can also get system time with tms_stime

Information Services & Technology

30

04/21/23

#include <sys/times.h>#include <unistd.h>void main(){ int tics_per_sec; float tic1, tic2; struct tms timedat; tics_per_sec = sysconf(_SC_CLK_TCK); times(&timedat); tic1 = timedat.tms_utime; … do stuff to be timed … times(&timedat); tic2 = timedat.tms_utime; printf("CPU time = %5.2f\n", (float)(tic2-tic1)/(float)tics_per_sec); }

Page 31: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

gettimeofday

can be called from C to obtain wall-clock time

sec resolution on p-series

Information Services & Technology

31

04/21/23

#include <sys/time.h> void main(){ struct timeval t; double t1, t2; gettimeofday(&t, NULL); t1 = t.tv_sec + 1.0e-6*t.tv_usec; … do stuff to be timed … gettimeofday(&t, NULL); t2 = t.tv_sec + 1.0e-6*t.tv_usec; printf(“wall-clock time = %5.3f\n", t2-t1); }

Page 32: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

MPI_Wtime

convenient wall-clock timer for MPI codes

sec resolution on p-series

Information Services & Technology

32

04/21/23

Page 33: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

MPI_Wtime (cont’d) Fortran

C

Information Services & Technology

33

04/21/23

double precision t1, t2t1 = mpi_wtime() ... do stuff to be timed ...t2 = mpi_wtime()print*,'wall-clock time = ', t2-t1

double t1, t2;t1 = MPI_Wtime();... do stuff to be timed ...t2 = MPI_Wtime();printf(“wall-clock time = %5.3f\n”,t2-t1);

Page 34: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

omp_get_time

convenient wall-clock timer for OpenMP codes resolution available by calling omp_get_wtick()

0.01 sec. resolution on p-series

Information Services & Technology

34

04/21/23

Page 35: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

omp_get_wtime (cont’d) Fortran

C

Information Services & Technology

35

04/21/23

double precision t1, t2, omp_get_wtimet1 = omp_get_wtime() ... do stuff to be timed ...t2 = omp_get_wtime()print*,'wall-clock time = ', t2-t1

double t1, t2;t1 = omp_get_wtime();... do stuff to be timed ...t2 = omp_get_wtime();printf(“wall-clock time = %5.3f\n”,t2-t1);

Page 36: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

Timer Summary

Information Services & Technology

36

04/21/23

CPU Wall

Fortran cpu_time system_clock

C times gettimeofday

MPI MPI_Wtime

OpenMP omp_get_time

Page 37: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

Exercise 2

Put wall-clock timer around each “level” in the example code

Print time for each level Compile and run

Information Services & Technology

37

04/21/23

Page 38: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

PROFILING

Information Services & Technology

38

04/21/23

Page 39: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

Profilers

profile tells you how much time is spent in each routine

gives a level of granularity not available with previous timers e.g., function may be called from many places

various profilers available, e.g. gprof (GNU) pgprof (Portland Group) Xprofiler (AIX)

Information Services & Technology

39

04/21/23

Page 40: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

gprof

compile with -pg file gmon.out will be created when you run gprof executable > myprof for multiple procs. (MPI), copy or link gmon.out.n to

gmon.out, then run gprof

Information Services & Technology

40

04/21/23

Page 41: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

gprof (cont’d)

Information Services & Technology

41

04/21/23

ngranularity: Each sample hit covers 4 bytes. Time: 435.04 seconds

% cumulative self self total time seconds seconds calls ms/call ms/call name 20.5 89.17 89.17 10 8917.00 10918.00 .conduct [5] 7.6 122.34 33.17 323 102.69 102.69 .getxyz [8] 7.5 154.77 32.43 .__mcount [9] 7.2 186.16 31.39 189880 0.17 0.17 .btri [10] 7.2 217.33 31.17 .kickpipes [12] 5.1 239.58 22.25 309895200 0.00 0.00 .rmnmod [16] 2.3 249.67 10.09 269 37.51 37.51 .getq [24]

Page 42: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

gprof (3)

Information Services & Technology

42

04/21/23

ngranularity: Each sample hit covers 4 bytes. Time: 435.04 seconds

called/total parents index %time self descendents called+self name index called/total children

0.00 340.50 1/1 .__start [2][1] 78.3 0.00 340.50 1 .main [1] 2.12 319.50 10/10 .contrl [3] 0.04 7.30 10/10 .force [34] 0.00 5.27 1/1 .initia [40] 0.56 3.43 1/1 .plot3da [49] 0.00 1.27 1/1 .data [73]

Page 43: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

pgprof

compile with Portland Group compiler pgf90 (pgf95, etc.) pgcc –Mprof=func

similar to –pg run code

pgprof –exe executable pops up window with flat profile

Information Services & Technology

43

04/21/23

Page 44: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

pgprof (cont’d)

Information Services & Technology

44

04/21/23

Page 45: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

pgprof (3) To save profile data to a file:

re-run pgprof using –text flag at command prompt type p > filename

filename is the name you want to give the profile file type quit to get out of profiler

Information Services & Technology

45

04/21/23

Page 46: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

Exercise 3 Use pgprof to profile code

compile using –Mprof=func run code create profile using pgprof –exe gt

Note which routines use most time Please close pgprof when you’re through

Leaving window open ties up a license

Information Services & Technology

46

04/21/23

Page 47: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

Line-Level Profiling

Times individual lines For pgprof, compile with the flag

–Mprof=line

Optimizer will re-order linesprofiler will lump lines in some loops or other constructsmay want to compile without optimization, may not

In flat profile, double-click on function to get line-level data

Information Services & Technology

47

04/21/23

Page 48: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

Line-Level Profiling (cont’d)

Information Services & Technology

48

04/21/23

Page 49: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

Exercise 4

Compile code with –Mprof=line and –O0 and run will take about 5 minutes to run due to overhead from line-

level profiling and lack of optimization

Examine line-level profile for most time-consuming routine

Note lines with longest time consumption Save your profile data to a file (we will need it later)

re-run pgprof using –text flag at command prompt type p > prof

Information Services & Technology

49

04/21/23

Page 50: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

CACHE

Information Services & Technology

50

04/21/23

Page 51: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

Cache

Cache is a small chunk of fast memory between the main memory and the registers

Information Services & Technology

51

04/21/23

secondary cache

registers

primary cache

main memory

Page 52: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

Cache (cont’d)

If variables are used repeatedly, code will run faster since cache memory is much faster than main memory

Variables are moved from main memory to cache in lines L1 cache line sizes on our machines

Opteron (katana cluster) 64 bytes Xeon (katana cluster) 64 bytes Power4 (p-series) 128 bytes PPC440 (Blue Gene) 32 bytes Pentium III (linux cluster) 32 bytes

Information Services & Technology

52

04/21/23

Page 53: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

Cache (3)

Why not just make the main memory out of the same stuff as cache? Expensive Runs hot This was actually done in Cray computers

Liquid cooling system

Information Services & Technology

53

04/21/23

Page 54: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

Cache (4)

Cache hit Required variable is in cache

Cache miss Required variable not in cache If cache is full, something else must be thrown out (sent back to main

memory) to make room Want to minimize number of cache misses

Information Services & Technology

54

04/21/23

Page 55: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

Cache (5)

Information Services & Technology

55

04/21/23

x[0]x[1]

x[2]x[3]x[4]x[5]

x[6]x[7]

x[8]x[9]

Main memory

“mini” cacheholds 2 lines, 4 words each

for(i=0; i<10; i++) x[i] = i;

ab…

Page 56: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

Cache (6)

Information Services & Technology

56

04/21/23

x[0]x[1]

x[2]x[3]x[4]x[5]

x[6]x[7]

x[8]x[9]

•will ignore i for simplicity•need x[0], not in cache cache miss•load line from memory into cache•next 3 loop indices result in cache hits

for(i=0; i<10; i++) x[i] = i;

ab…

x[0]x[1]

x[2]x[3]

Page 57: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

Cache (7)

Information Services & Technology

57

04/21/23

x[0]x[1]

x[2]x[3]x[4]x[5]

x[6]x[7]

x[8]x[9]

•need x[4], not in cache cache miss•load line from memory into cache•next 3 loop indices result in cache hits

for(i=0; i<10; i++) x[i] = i;

ab…

x[0]x[1]

x[2]x[3]

x[4]

x[5]x[6]x[7]

Page 58: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

Cache (8)

Information Services & Technology

58

04/21/23

x[0]x[1]

x[2]x[3]x[4]x[5]

x[6]x[7]

x[8]x[9]

•need x[8], not in cache cache miss•load line from memory into cache•no room in cache!•replace old line

for(i=0; i<10; i++) x[i] = i;

ab…

x[4]

x[5]x[6]x[7]

x[8]x[9]

ab

Page 59: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

Cache (9)

Contiguous access is important In C, multidimensional array is stored in memory as

a[0][0]

a[0][1]

a[0][2]

Information Services & Technology

59

04/21/23

Page 60: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

Cache (10)

In Fortran and Matlab, multidimensional array is stored the opposite way:

a(1,1)

a(2,1)

a(3,1)

Information Services & Technology

60

04/21/23

Page 61: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

Cache (11)

Rule: Always order your loops appropriately will usually be taken care of by optimizer suggestion: don’t rely on optimizer

Information Services & Technology

61

04/21/23

for(i=0; i<N; i++){ for(j=0; j<N; j++){ a[i][j] = 1.0; }}

do j = 1, n do i = 1, n a(i,j) = 1.0 enddoenddo

C Fortran

Page 62: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

TUNING TIPS

Information Services & Technology

62

04/21/23

Page 63: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

Tuning Tips

Some of these tips will be taken care of by compiler optimization It’s best to do them yourself, since

compilers vary

Two important rules minimize number of operations access cache contiguously

Information Services & Technology

63

04/21/23

Page 64: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

Tuning Tips (cont’d) Access arrays in contiguous order

For multi-dimensional arrays, rightmost index varies fastest for C and C++, leftmost for Fortran and Matlab

Bad Good

Information Services & Technology

64

04/21/23

for(i=0; i<N; i++){ for(j=0; j<N; j++{ a[i][j] = 1.0; }}

for(j=0; j<N; j++){ for(i=0; i<N; i++{ a[i][j] = 1.0; }}

Page 65: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

Tuning Tips (3)

Eliminate redundant operations in loops

Bad:

Good:

Information Services & Technology

65

04/21/23

for(i=0; i<N; i++){ x = 10;

}

x = 10;for(i=0; i<N; i++){ }

Page 66: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

Tuning Tips (4)

Minimize if statements within loops

They may inhibit pipelining

Information Services & Technology

66

04/21/23

for(i=0; i<N; i++){

if(i==0)

perform i=0 calculations

else

perform i>0 calculations

}

Page 67: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

Tuning Tips (5)

Better Way:

Information Services & Technology

67

04/21/23

perform i=0 calculations

for(i=1; i<N; i++){

perform i>0 calculations

}

Page 68: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

Tuning Tips (6) Divides are expensive

Intel x86 clock cycles per operation add 3-6 multiply 4-8 divide 32-45

Bad:

Good:

Information Services & Technology

68

04/21/23

for(i=0; i<N; i++)

x[i] = y[i]/scalarval;

qs = 1.0/scalarval;

for(i=0; i<N; i++)

x[i] = y[i]*qs;

Page 69: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

Tuning Tips (7)

• There is overhead associated with a function call

Bad:

Good:

Information Services & Technology

69

04/21/23

for(i=0; i<N; i++)

myfunc(i);

myfunc ( );

void myfunc( ){

for(int i=0; i<N; i++){

do stuff

}

}

Page 70: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

Tuning Tips (8)

• Minimize calls to math functions

Bad:

Good:

Information Services & Technology

70

04/21/23

for(i=0; i<N; i++)

z[i] = log(x[i]) * log(y[i]);

for(i=0; i<N; i++){

z[i] = log(x[i] + y[i]);

Page 71: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

Tuning Tips (9)

• recasting may be costlier than you think

Bad:

Good:

Information Services & Technology

71

04/21/23

sum = 0.0;

for(i=0; i<N; i++)

sum += (float) i

isum = 0;

for(i=0; i<N; i++)

isum += i;

sum = (float) isum

Page 72: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

Exercise 5 The example code that has been provided is written in a clear,

readable style, that also happens to violate lots of the tuning tips that we have just reviewed.

Examine the line-level profile. What lines are using the most time? Is there anything we might be able to do to make it run faster? We will discuss options as a group come up with a strategy modify code re-compile and run compare timings

Re-examine line level profile, come up with another strategy, repeat procedure, etc.

Information Services & Technology

72

04/21/23

Page 73: Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

Survey

Please fill out the survey for this tutorial at

http://scv.bu.edu/survey/tutorial_evaluation.html

Information Services & Technology

73

04/21/23