Upload
jack-bell
View
215
Download
1
Tags:
Embed Size (px)
Citation preview
Code Tuning and Optimization
Doug Sondak
Boston University
Scientific Computing and Visualization
Outline
Introduction Example code Timing Profiling Cache Tuning
Information Services & Technology
2
04/21/23
Introduction
Timing Where is most time being used?
Tuning How to speed it up Often as much art as science
Parallel Performance How to assess how well parallelization is working
Information Services & Technology
3
04/21/23
Example Code
Information Services & Technology
4
04/21/23
Example Code
Simulation of response of eye to stimuli Response is affected by adjacent inputs
A dark area next to a bright area makes the bright area look brighter
Based on Grossberg & Todorovic paper Appendix in paper contains all equations
errors in eqns (A4) and (A5) – cross out “log2”
Paper contains 6 levels of response Our code only contains levels 1 through 5 Level 6 takes a long time to compute, and would skew our timings!
Information Services & Technology
5
04/21/23
Example Code (cont’d)
All calculations done on a square array Array size and other constants are defined in gt.h (C)
or in the “mods” module at the top of the code (Fortran)
Due to nature of algorithm, array is padded on all sides npad is size of padding
Information Services & Technology
6
04/21/23
Example Code – Level 1 Luminance (input) distribution Paper (and code) use “yin-yang square”
Array I magnitude of “bright” is ihigh magnitude of “dark” is ilow
Information Services & Technology
7
04/21/23
bright
dark
Fig. 4 in paper
Example Code – Level 2
Level 2 – Circular Concentric On and Off Units Excitation and inhibition vary with distance
Information Services & Technology
8
04/21/23
Fig. 5 in paper
Level 2 Equations
Information Services & Technology
9
04/21/23
]})()[(exp{ 222 jqipCC pqij
]})()[(exp{ 222 jqipEEpqij
qppqpqijpqij
qppqpqijpqij
ij IECA
IDEBC
x
,
,
)(
)(
)0,max( ijij xX
Ipq=initial input (yin-yang)
Example Code – Level 3 Oriented Direction-of-Contrast-Sensitive Units Respond to angle
12 discrete angles
Respond to direction
of contrast, i.e.,
light-to-dark
or dark-to-light
Information Services & Technology
10
04/21/23
Fig. 6(d) in paper
Level 3 Equations
Information Services & Technology
11
04/21/23
]})()[(exp{ 222 jqipGpqij
]})()[(exp{ 222)(kk
kpqij njqmipH
)(
,
kpqij
qppqijk FXy
)()( kpqijpqij
kpqij HGF
K
kmk
2sin
K
knk
2cos
)0,max( ijkijk yY
Example Code - Level 4
Oriented Direction-of-Contrast-Insensitive Units Respond to angle Do not respond to direction of contrast, i.e.,
light-to-dark
or dark-to-light
Information Services & Technology
12
04/21/23
Fig. 8(a) in paper
Level 4 Equations
Information Services & Technology
13
04/21/23
)]2/([ Kkijijkijk YYz
)0,max( LzZ ijkijk
Example Code – Level 5
Level 5 – Boundary Contour Units Pool nearby excitations
Information Services & Technology
14
04/21/23
Fig. 8(d) in paper
Level 5 Equation
Information Services & Technology
15
04/21/23
k
ijkij ZZ
Timing
Information Services & Technology
16
04/21/23
Timing
When tuning/parallelizing a code, need to assess effectiveness of your efforts
Can time whole code and/or specific sections Some types of timers
unix time command function/subroutine calls profiler
Information Services & Technology
17
04/21/23
CPU Time or Wall-Clock Time?
CPU time How much time the CPU is actually crunching away User CPU time
Time spent executing your source code System CPU time
Time spent in system calls such as i/o
Wall-clock time What you would measure with a stopwatch
Information Services & Technology
18
04/21/23
CPU Time or Wall-Clock Time? (cont’d)
Both are useful For serial runs without interaction from keyboard, CPU
and wall-clock times are usually close If you prompt for keyboard input, wall-clock time will accumulate if
you get a cup of coffee, but CPU time will not
Information Services & Technology
19
04/21/23
CPU Time or Wall-Clock Time? (3)
Parallel runs Want wall-clock time, since CPU time will be about the same or even
increase as number of procs. is increased
Wall-clock time may not be accurate if sharing processors Wall-clock timings should always be performed in batch mode
Information Services & Technology
20
04/21/23
Unix Time Command
easiest way to time code simply type time before your run command output differs between c-type shells (cshell, tcshell)
and Bourne-type shells (bsh, bash, ksh)
Information Services & Technology
21
04/21/23
Unix Time Command (cont’d)
twister:~ % time mycode1.570u 0.010s 0:01.77 89.2% 75+1450k 0+0io 64pf+0w
Information Services & Technology
22
04/21/23
user CPU time (s)
system CPU time (s)
wall-clock time (s)
(u+s)/wc
avg. shared + unsharedtext space
input + output operations
page faults + no. timesproc. was swapped
Unix Time Command (3)
Bourne shell results
Information Services & Technology
23
04/21/23
$ time mycodeReal 1.62User 1.57System 0.03
wall-clock time (s)
user CPU time (s)
system CPU time (s)
Exercise 1 Copy files from /scratch/sondak/gt
cp /scratch/sondak/gt/* . Choose C (gt.c) or Fortran (gt.f90) Compile with no optimization:
pgcc –O0 –o gt gt.cc
pgf90 –O0 –o gt gt.f90
•Submit rungt script to batch queue qsub rungt
Information Services & Technology
24
04/21/23
capital oh small ohzero
Exercise 1 (cont’d) Check status
qstat –u username
After run has completed a file will appear named rungt.o??????, where ?????? represents the process number
File contains result of time commandWrite down wall-clock time
Re-compile using –O3 Re-run and check time
Information Services & Technology
25
04/21/23
Function/Subroutine Calls
often need to time part of code timers can be inserted in source code language-dependent
Information Services & Technology
26
04/21/23
cpu_time
intrinsic subroutine in Fortran returns user CPU time (in seconds)
no system time is included
0.01 sec. resolution on p-series
Information Services & Technology
27
04/21/23
real :: t1, t2call cpu_time(t1) ... do stuff to be timed ... call cpu_time(t2)print*, 'CPU time = ', t2-t1, ' sec.'
system_clock
intrinsic subroutine in Fortran good for measuring wall-clock time on p-series:
resolution is 0.01 sec. max. time is 24 hr.
Information Services & Technology
28
04/21/23
system_clock (cont’d)
t1 and t2 are tic counts count_rate is optional argument containing tics/sec.
Information Services & Technology
29
04/21/23
integer :: t1, t2, count_rate call system_clock(t1, count_rate) ... do stuff to be timed ... call system_clock(t2) print*,'wall-clock time = ', & real(t2-t1)/real(count_rate), ‘sec’
times can be called from C to obtain CPU time 0.01 sec. resolution on p-series
can also get system time with tms_stime
Information Services & Technology
30
04/21/23
#include <sys/times.h>#include <unistd.h>void main(){ int tics_per_sec; float tic1, tic2; struct tms timedat; tics_per_sec = sysconf(_SC_CLK_TCK); times(&timedat); tic1 = timedat.tms_utime; … do stuff to be timed … times(&timedat); tic2 = timedat.tms_utime; printf("CPU time = %5.2f\n", (float)(tic2-tic1)/(float)tics_per_sec); }
gettimeofday
can be called from C to obtain wall-clock time
sec resolution on p-series
Information Services & Technology
31
04/21/23
#include <sys/time.h> void main(){ struct timeval t; double t1, t2; gettimeofday(&t, NULL); t1 = t.tv_sec + 1.0e-6*t.tv_usec; … do stuff to be timed … gettimeofday(&t, NULL); t2 = t.tv_sec + 1.0e-6*t.tv_usec; printf(“wall-clock time = %5.3f\n", t2-t1); }
MPI_Wtime
convenient wall-clock timer for MPI codes
sec resolution on p-series
Information Services & Technology
32
04/21/23
MPI_Wtime (cont’d) Fortran
C
Information Services & Technology
33
04/21/23
double precision t1, t2t1 = mpi_wtime() ... do stuff to be timed ...t2 = mpi_wtime()print*,'wall-clock time = ', t2-t1
double t1, t2;t1 = MPI_Wtime();... do stuff to be timed ...t2 = MPI_Wtime();printf(“wall-clock time = %5.3f\n”,t2-t1);
omp_get_time
convenient wall-clock timer for OpenMP codes resolution available by calling omp_get_wtick()
0.01 sec. resolution on p-series
Information Services & Technology
34
04/21/23
omp_get_wtime (cont’d) Fortran
C
Information Services & Technology
35
04/21/23
double precision t1, t2, omp_get_wtimet1 = omp_get_wtime() ... do stuff to be timed ...t2 = omp_get_wtime()print*,'wall-clock time = ', t2-t1
double t1, t2;t1 = omp_get_wtime();... do stuff to be timed ...t2 = omp_get_wtime();printf(“wall-clock time = %5.3f\n”,t2-t1);
Timer Summary
Information Services & Technology
36
04/21/23
CPU Wall
Fortran cpu_time system_clock
C times gettimeofday
MPI MPI_Wtime
OpenMP omp_get_time
Exercise 2
Put wall-clock timer around each “level” in the example code
Print time for each level Compile and run
Information Services & Technology
37
04/21/23
PROFILING
Information Services & Technology
38
04/21/23
Profilers
profile tells you how much time is spent in each routine
gives a level of granularity not available with previous timers e.g., function may be called from many places
various profilers available, e.g. gprof (GNU) pgprof (Portland Group) Xprofiler (AIX)
Information Services & Technology
39
04/21/23
gprof
compile with -pg file gmon.out will be created when you run gprof executable > myprof for multiple procs. (MPI), copy or link gmon.out.n to
gmon.out, then run gprof
Information Services & Technology
40
04/21/23
gprof (cont’d)
Information Services & Technology
41
04/21/23
ngranularity: Each sample hit covers 4 bytes. Time: 435.04 seconds
% cumulative self self total time seconds seconds calls ms/call ms/call name 20.5 89.17 89.17 10 8917.00 10918.00 .conduct [5] 7.6 122.34 33.17 323 102.69 102.69 .getxyz [8] 7.5 154.77 32.43 .__mcount [9] 7.2 186.16 31.39 189880 0.17 0.17 .btri [10] 7.2 217.33 31.17 .kickpipes [12] 5.1 239.58 22.25 309895200 0.00 0.00 .rmnmod [16] 2.3 249.67 10.09 269 37.51 37.51 .getq [24]
gprof (3)
Information Services & Technology
42
04/21/23
ngranularity: Each sample hit covers 4 bytes. Time: 435.04 seconds
called/total parents index %time self descendents called+self name index called/total children
0.00 340.50 1/1 .__start [2][1] 78.3 0.00 340.50 1 .main [1] 2.12 319.50 10/10 .contrl [3] 0.04 7.30 10/10 .force [34] 0.00 5.27 1/1 .initia [40] 0.56 3.43 1/1 .plot3da [49] 0.00 1.27 1/1 .data [73]
pgprof
compile with Portland Group compiler pgf90 (pgf95, etc.) pgcc –Mprof=func
similar to –pg run code
pgprof –exe executable pops up window with flat profile
Information Services & Technology
43
04/21/23
pgprof (cont’d)
Information Services & Technology
44
04/21/23
pgprof (3) To save profile data to a file:
re-run pgprof using –text flag at command prompt type p > filename
filename is the name you want to give the profile file type quit to get out of profiler
Information Services & Technology
45
04/21/23
Exercise 3 Use pgprof to profile code
compile using –Mprof=func run code create profile using pgprof –exe gt
Note which routines use most time Please close pgprof when you’re through
Leaving window open ties up a license
Information Services & Technology
46
04/21/23
Line-Level Profiling
Times individual lines For pgprof, compile with the flag
–Mprof=line
Optimizer will re-order linesprofiler will lump lines in some loops or other constructsmay want to compile without optimization, may not
In flat profile, double-click on function to get line-level data
Information Services & Technology
47
04/21/23
Line-Level Profiling (cont’d)
Information Services & Technology
48
04/21/23
Exercise 4
Compile code with –Mprof=line and –O0 and run will take about 5 minutes to run due to overhead from line-
level profiling and lack of optimization
Examine line-level profile for most time-consuming routine
Note lines with longest time consumption Save your profile data to a file (we will need it later)
re-run pgprof using –text flag at command prompt type p > prof
Information Services & Technology
49
04/21/23
CACHE
Information Services & Technology
50
04/21/23
Cache
Cache is a small chunk of fast memory between the main memory and the registers
Information Services & Technology
51
04/21/23
secondary cache
registers
primary cache
main memory
Cache (cont’d)
If variables are used repeatedly, code will run faster since cache memory is much faster than main memory
Variables are moved from main memory to cache in lines L1 cache line sizes on our machines
Opteron (katana cluster) 64 bytes Xeon (katana cluster) 64 bytes Power4 (p-series) 128 bytes PPC440 (Blue Gene) 32 bytes Pentium III (linux cluster) 32 bytes
Information Services & Technology
52
04/21/23
Cache (3)
Why not just make the main memory out of the same stuff as cache? Expensive Runs hot This was actually done in Cray computers
Liquid cooling system
Information Services & Technology
53
04/21/23
Cache (4)
Cache hit Required variable is in cache
Cache miss Required variable not in cache If cache is full, something else must be thrown out (sent back to main
memory) to make room Want to minimize number of cache misses
Information Services & Technology
54
04/21/23
Cache (5)
Information Services & Technology
55
04/21/23
…
x[0]x[1]
x[2]x[3]x[4]x[5]
x[6]x[7]
x[8]x[9]
Main memory
“mini” cacheholds 2 lines, 4 words each
for(i=0; i<10; i++) x[i] = i;
ab…
Cache (6)
Information Services & Technology
56
04/21/23
…
x[0]x[1]
x[2]x[3]x[4]x[5]
x[6]x[7]
x[8]x[9]
•will ignore i for simplicity•need x[0], not in cache cache miss•load line from memory into cache•next 3 loop indices result in cache hits
for(i=0; i<10; i++) x[i] = i;
ab…
x[0]x[1]
x[2]x[3]
Cache (7)
Information Services & Technology
57
04/21/23
…
x[0]x[1]
x[2]x[3]x[4]x[5]
x[6]x[7]
x[8]x[9]
•need x[4], not in cache cache miss•load line from memory into cache•next 3 loop indices result in cache hits
for(i=0; i<10; i++) x[i] = i;
ab…
x[0]x[1]
x[2]x[3]
x[4]
x[5]x[6]x[7]
Cache (8)
Information Services & Technology
58
04/21/23
…
x[0]x[1]
x[2]x[3]x[4]x[5]
x[6]x[7]
x[8]x[9]
•need x[8], not in cache cache miss•load line from memory into cache•no room in cache!•replace old line
for(i=0; i<10; i++) x[i] = i;
ab…
x[4]
x[5]x[6]x[7]
x[8]x[9]
ab
Cache (9)
Contiguous access is important In C, multidimensional array is stored in memory as
a[0][0]
a[0][1]
a[0][2]
Information Services & Technology
59
04/21/23
…
Cache (10)
In Fortran and Matlab, multidimensional array is stored the opposite way:
a(1,1)
a(2,1)
a(3,1)
Information Services & Technology
60
04/21/23
…
Cache (11)
Rule: Always order your loops appropriately will usually be taken care of by optimizer suggestion: don’t rely on optimizer
Information Services & Technology
61
04/21/23
for(i=0; i<N; i++){ for(j=0; j<N; j++){ a[i][j] = 1.0; }}
do j = 1, n do i = 1, n a(i,j) = 1.0 enddoenddo
C Fortran
TUNING TIPS
Information Services & Technology
62
04/21/23
Tuning Tips
Some of these tips will be taken care of by compiler optimization It’s best to do them yourself, since
compilers vary
Two important rules minimize number of operations access cache contiguously
Information Services & Technology
63
04/21/23
Tuning Tips (cont’d) Access arrays in contiguous order
For multi-dimensional arrays, rightmost index varies fastest for C and C++, leftmost for Fortran and Matlab
Bad Good
Information Services & Technology
64
04/21/23
for(i=0; i<N; i++){ for(j=0; j<N; j++{ a[i][j] = 1.0; }}
for(j=0; j<N; j++){ for(i=0; i<N; i++{ a[i][j] = 1.0; }}
Tuning Tips (3)
Eliminate redundant operations in loops
Bad:
Good:
Information Services & Technology
65
04/21/23
for(i=0; i<N; i++){ x = 10;
}
…
x = 10;for(i=0; i<N; i++){ }
…
Tuning Tips (4)
Minimize if statements within loops
They may inhibit pipelining
Information Services & Technology
66
04/21/23
for(i=0; i<N; i++){
if(i==0)
perform i=0 calculations
else
perform i>0 calculations
}
Tuning Tips (5)
Better Way:
Information Services & Technology
67
04/21/23
perform i=0 calculations
for(i=1; i<N; i++){
perform i>0 calculations
}
Tuning Tips (6) Divides are expensive
Intel x86 clock cycles per operation add 3-6 multiply 4-8 divide 32-45
Bad:
Good:
Information Services & Technology
68
04/21/23
for(i=0; i<N; i++)
x[i] = y[i]/scalarval;
qs = 1.0/scalarval;
for(i=0; i<N; i++)
x[i] = y[i]*qs;
Tuning Tips (7)
• There is overhead associated with a function call
Bad:
Good:
Information Services & Technology
69
04/21/23
for(i=0; i<N; i++)
myfunc(i);
myfunc ( );
void myfunc( ){
for(int i=0; i<N; i++){
do stuff
}
}
Tuning Tips (8)
• Minimize calls to math functions
Bad:
Good:
Information Services & Technology
70
04/21/23
for(i=0; i<N; i++)
z[i] = log(x[i]) * log(y[i]);
for(i=0; i<N; i++){
z[i] = log(x[i] + y[i]);
Tuning Tips (9)
• recasting may be costlier than you think
Bad:
Good:
Information Services & Technology
71
04/21/23
sum = 0.0;
for(i=0; i<N; i++)
sum += (float) i
isum = 0;
for(i=0; i<N; i++)
isum += i;
sum = (float) isum
Exercise 5 The example code that has been provided is written in a clear,
readable style, that also happens to violate lots of the tuning tips that we have just reviewed.
Examine the line-level profile. What lines are using the most time? Is there anything we might be able to do to make it run faster? We will discuss options as a group come up with a strategy modify code re-compile and run compare timings
Re-examine line level profile, come up with another strategy, repeat procedure, etc.
Information Services & Technology
72
04/21/23
Survey
Please fill out the survey for this tutorial at
http://scv.bu.edu/survey/tutorial_evaluation.html
Information Services & Technology
73
04/21/23