Performance Optimization Getting your programs to run faster

Performance Optimization

Getting your programs to run faster

Why optimize

Better turn-around on jobsRun more programs/scenariosRelease resources to other applicationsYou want the job to finish before you retire

Ways to get more performance

Run on bigger, faster hardware clock speed, more memory, …

Tweak your algorithmOptimize your code

Loop Unrolling

Converting passes of a loop into in-line streams of codeUseful when loops do calculations on data in arraysUnrolling can take advantage of pipeline processing units in processorsCompiler may preload operands into CPU registers

Loop Unrolling – disadvantages

may be limited by the number of Floating point registers Pentium III: 8 Pentium 4: 8 Itanium: 128

Loop Unrolling – simple example

do i=1,n

a(i) = b(i) +x*c(i)

Unrolled Loop

do i=1,n,4

a(i) = b(i) +x*c(i)

a(i+1) = b(i+1) +x*c(i+1)

a(i+2) = b(i+2) +x*c(i+2)

a(i+3) = b(i+3) +x*c(i+3)

Loop Unrolling – simple example

Performance – RolledP3 550mhz – 13 mflopsItanium – 30 mflops

Performance UnrolledP3 550mhz – 30 mflopsItanium – 107 mflops

*from: LCI and NCSA

Loop Unrolling

int a[100];

for (i=0;i<100;i++){

a[i] = a[i] * 2;

int a[100];

for (i=0;i<100;i+=5){

a[i] = a[i] * 2;

a[i+1]=a[i+1]*2;

a[i+2]=a[i+2]*2;

a[i+3]=a[i+3]*2;

a[i+4]=a[i+4]*2;

Loop unrolling

int a[10][10];

for (i=0;i<10;i++){

for (j=0;j<10;j++) {

a[i][j] = a[i][j] *2;

int a[10][10];for (i=0;i<10;i++){

a[i][0]=a[i][0]*2; a[i][1]=a[i][1]*2;

a[i][2]=a[i][2]*2; a[i][3]=a[i][3]*2;

a[i][4]=a[i][4]*2; a[i][5]=a[i][5]*2;

a[i][6]=a[i][6]*2; a[i][7]=a[i][7]*2;

a[i][8]=a[i][8]*2; a[i][9]=a[i][9]*2;

Loop unrolling – Matrix Dot Product

float a[100];

float b[100];

float z;

for (i=0;i<100;i++){

z = z + a[i] * b[i];

float a[100];float b[100];float z;for (i=0;i<100;i+=2){

z = z + a[i] * b[i];z = z + a[i+1] *

b[i+1];}

Unrolling Loops

You can do it automatically

Unrolling Loops – compiler options

GNU Compilers -funroll-loops -funrull-all-loops (not recommended)

PGI Compilers -Munroll -Munroll=c:N -Munroll=n:M

Unrolling Loops – Compiler Options

Intel Compilers -unrollM (up to M times) -unroll

Design your program to minimize cache faults

Align data arrays with cache boundaries

If your algorithm has repetitive iterations across specific rows or columns… try making your array dimensions

match the cache buffer size of your computer

i.e. if you array is 1000 x 1000 (single byte integers) and you have a 1024 byte cache…

allocate the array as 1024 x 1024

..or if you array is 500x500 … allocate it as 512x512

1,1 1,2 1,3 1,4 1,5 1,6 1,7

2,1 2,2 2,3 2,4

Taking Memory in Order

Optimizing the use of cacherow major order vs column major order row major --

a(1,1), a(2,1), a(3,1), a(1,2), a(2,2),… column major –

a(1,1), a(1,2), a(1,3), a(2,1), a(2,2),…

Remember C and Fortran store arrays in the

opposite mannerC – row majorFortran – column major

Fortran

do i=1,m

do j=1,n

a(i,j)=b(i,j)+c(i)

end do

do j=1,m

do i=1,n

a(i,j)=b(i,j)+c(i)

end do

•loop time: 23.42

•loop runs at 4.48 Mflops

•loop time: 2.80

•loop runs at 37.48 Mflops

Floating Point Division

FP Division is very expensive in terms of processor time20-60 clock cycles to computeUsually not pipelinedFP Division required by IEEE “rules”

Floating point division – use reciprocal float a[100];

for (i=0;i<100;i++){

a[i]=a[i]/2;

float a[100];

Float denom;

denom = 1/2;

for (i=0;i<100;i++){

a[i]=a[i]*denom;

Compiler options for IEEE Compatibility PGI Compilers

-Knoieee Intel Compilers

-mp GNU Compilers

can’t do

Floating Point Division

Compilers can’t optimize if divisor is not scalarBreaks IEEE “rules” May impact portability

Function Inlining

Build functions/subroutines in as inline parts of the programs code…… rather than functions/subroutines minimizes functions calls (and

management of…)

Function Inlining

Compile with – -Minline

compiler tries to inline what it can (meet compiler criteria)

-Minline=except:func excludes func from inlining

-Minline=func inline only func

Function Inlining

…Compile with- -Minline=myfile.lib

inlines functions from inline library file -Minline=levels:n

inlines functions up to n levels of calls usually default = 1

MPI Tuning

Minimize messagesPointers/countsMPI Derived datatypesMPI_Pack/MPI_UnpackUsing shared memory for message passing #PBS –l nodes=6:ppn=1 … but… #PBS –l nodes=3:ppn=2 … is better.

Compiler optimizations

-O0 –no optimization-O1 –local optimization, register allocation-O2 –local/limited global optimization-O3 –aggressive global optimization-Munroll – loop unrolling-Mvect - vectorization-Minline – function inlining

gcc Compiler Optimatizations

http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

Performance Optimization Getting your programs to run faster

Documents

Caching with Nginx. You Can Run Faster!

Run-In and Optimization

Duality in Two-stage Adaptive Linear Optimization: Faster ...Duality in Two-stage Adaptive Linear Optimization: Faster Computation and Stronger Bounds Dimitris Bertsimas Operations

Locate Potential Support Vectors for Faster Sequential Minimal Optimization

Let's make this test suite run faster

We4IT lcty 2013 - infra-man - domino run faster

Run Your Business 6X Faster at Lower Costs!

Pc Tool Tips to make your Computer Run Faster

NOBODYâ€™S FASTER IN THE SHORT RUN. - Protomold

How to Make Vista Run Faster

Faster 2D graphics on Windows 8 Your app will run faster on Windows 8

How to Make Your Computer Run Faster

Chapter 9 Code optimization Section 0 overview 1.Position of code optimizer 2.Purpose of code optimizer to get better efficiency –Run faster –Take less

Five cool ways the JVM can run Apache Spark faster

Business Process Optimization for Faster Customer …...Business Process Optimization for Faster Customer Service Deutsche Leasing, the largest manufacturer-neutral leasing company

Making Magento Run Faster Configuring Redis as Back End

Tips to make your pc games run faster and smoother

Let's make this test suite run faster - Paris JUG 2011

SIMPLIFY YOUR OEM SOLUTION: GO TO MARKET FASTER, RUN …i.dell.com/sites/content/shared-content/solutions/oem/fr/Documents/... · GO TO MARKET FASTER, RUN YOUR OPERATIONS BETTER,

Drill Smarter and Faster with SureDrill - APS Tech Time Drilling Optimization Real Time Drilling Optimization Real Time Drilling Optimization Drill Smarter and Faster with SureDrill™