Performance Optimization
Getting your programs to run faster
Why optimize
Better turn-around on jobsRun more programs/scenariosRelease resources to other applicationsYou want the job to finish before you retire
Ways to get more performance
Run on bigger, faster hardware clock speed, more memory, …
Tweak your algorithmOptimize your code
Loop Unrolling
Converting passes of a loop into in-line streams of codeUseful when loops do calculations on data in arraysUnrolling can take advantage of pipeline processing units in processorsCompiler may preload operands into CPU registers
Loop Unrolling – disadvantages
may be limited by the number of Floating point registers Pentium III: 8 Pentium 4: 8 Itanium: 128
Loop Unrolling – simple example
Loop
do i=1,n
a(i) = b(i) +x*c(i)
enddo
Unrolled Loop
do i=1,n,4
a(i) = b(i) +x*c(i)
a(i+1) = b(i+1) +x*c(i+1)
a(i+2) = b(i+2) +x*c(i+2)
a(i+3) = b(i+3) +x*c(i+3)
enddo
Loop Unrolling – simple example
Performance – RolledP3 550mhz – 13 mflopsItanium – 30 mflops
Performance UnrolledP3 550mhz – 30 mflopsItanium – 107 mflops
*from: LCI and NCSA
Loop Unrolling
int a[100];
for (i=0;i<100;i++){
a[i] = a[i] * 2;
}
int a[100];
for (i=0;i<100;i+=5){
a[i] = a[i] * 2;
a[i+1]=a[i+1]*2;
a[i+2]=a[i+2]*2;
a[i+3]=a[i+3]*2;
a[i+4]=a[i+4]*2;
}
Loop unrolling
int a[10][10];
for (i=0;i<10;i++){
for (j=0;j<10;j++) {
a[i][j] = a[i][j] *2;
} }
int a[10][10];for (i=0;i<10;i++){
a[i][0]=a[i][0]*2; a[i][1]=a[i][1]*2;
a[i][2]=a[i][2]*2; a[i][3]=a[i][3]*2;
a[i][4]=a[i][4]*2; a[i][5]=a[i][5]*2;
a[i][6]=a[i][6]*2; a[i][7]=a[i][7]*2;
a[i][8]=a[i][8]*2; a[i][9]=a[i][9]*2;
} }
Loop unrolling – Matrix Dot Product
float a[100];
float b[100];
float z;
for (i=0;i<100;i++){
z = z + a[i] * b[i];
}
float a[100];float b[100];float z;for (i=0;i<100;i+=2){
z = z + a[i] * b[i];z = z + a[i+1] *
b[i+1];}
Unrolling Loops
You can do it automatically
Unrolling Loops – compiler options
GNU Compilers -funroll-loops -funrull-all-loops (not recommended)
PGI Compilers -Munroll -Munroll=c:N -Munroll=n:M
Unrolling Loops – Compiler Options
Intel Compilers -unrollM (up to M times) -unroll
Design your program to minimize cache faults
Align data arrays with cache boundaries
Align data arrays with cache boundaries
If your algorithm has repetitive iterations across specific rows or columns… try making your array dimensions
match the cache buffer size of your computer
i.e. if you array is 1000 x 1000 (single byte integers) and you have a 1024 byte cache…
allocate the array as 1024 x 1024
Align data arrays with cache boundaries
..or if you array is 500x500 … allocate it as 512x512
Align data arrays with cache boundaries
Align data arrays with cache boundaries
1,1 1,2 1,3 1,4 1,5 1,6 1,7
2,1 2,2 2,3 2,4
1,1
1,2
1,3
1,4
1,5
1,6
1,7
2,1
2,2
2,3
2,4
2,5
2,6
2,7
3,1
3,2
3,3
1,1
1, 2
1,3
1,4
1,5
1,6
1,7
2,1
2,2
2,3
Taking Memory in Order
Optimizing the use of cacherow major order vs column major order row major --
a(1,1), a(2,1), a(3,1), a(1,2), a(2,2),… column major –
a(1,1), a(1,2), a(1,3), a(2,1), a(2,2),…
Taking Memory in Order
Remember C and Fortran store arrays in the
opposite mannerC – row majorFortran – column major
Taking Memory in Order
c
Fortran
Taking Memory in Order
do i=1,m
do j=1,n
a(i,j)=b(i,j)+c(i)
end do
end do
do j=1,m
do i=1,n
a(i,j)=b(i,j)+c(i)
end do
end do
•loop time: 23.42
•loop runs at 4.48 Mflops
•loop time: 2.80
•loop runs at 37.48 Mflops
Floating Point Division
FP Division is very expensive in terms of processor time20-60 clock cycles to computeUsually not pipelinedFP Division required by IEEE “rules”
Floating point division – use reciprocal float a[100];
for (i=0;i<100;i++){
a[i]=a[i]/2;
}
float a[100];
Float denom;
denom = 1/2;
for (i=0;i<100;i++){
a[i]=a[i]*denom;
}
Compiler options for IEEE Compatibility PGI Compilers
-Knoieee Intel Compilers
-mp GNU Compilers
can’t do
Floating Point Division
Floating Point Division
Compilers can’t optimize if divisor is not scalarBreaks IEEE “rules” May impact portability
Function Inlining
Build functions/subroutines in as inline parts of the programs code…… rather than functions/subroutines minimizes functions calls (and
management of…)
Function Inlining
Compile with – -Minline
compiler tries to inline what it can (meet compiler criteria)
-Minline=except:func excludes func from inlining
-Minline=func inline only func
Function Inlining
…Compile with- -Minline=myfile.lib
inlines functions from inline library file -Minline=levels:n
inlines functions up to n levels of calls usually default = 1
MPI Tuning
Minimize messagesPointers/countsMPI Derived datatypesMPI_Pack/MPI_UnpackUsing shared memory for message passing #PBS –l nodes=6:ppn=1 … but… #PBS –l nodes=3:ppn=2 … is better.
Compiler optimizations
-O0 –no optimization-O1 –local optimization, register allocation-O2 –local/limited global optimization-O3 –aggressive global optimization-Munroll – loop unrolling-Mvect - vectorization-Minline – function inlining
gcc Compiler Optimatizations
http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
See: