95
Advanced OpenMP, SC'2000 1 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue University School of Electrical and Computer Engineering

Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Embed Size (px)

Citation preview

Page 1: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 1

Advanced Parallel Programming with OpenMP

Tim MattsonIntel Corporation

Computational Sciences Laboratory

Rudolf EigenmannPurdue University

School of Electrical and

Computer Engineering

Page 2: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 2

SC’2000 Tutorial Agenda

OpenMP: A Quick RecapOpenMP Case Studies

– including performance tuning

Automatic Parallelism and Tools SupportCommon Bugs in OpenMP programs

– and how to avoid them

Mixing OpenMP and MPIThe Future of OpenMP

Page 3: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 3

OpenMP: Recap

omp_set_lock(lck)

#pragma omp parallel for private(A, B)

#pragma omp critical

C$OMP parallel do shared(a, b, c)

C$OMP PARALLEL REDUCTION (+: A, B)

call OMP_INIT_LOCK (ilok)

call omp_test_lock(jlok)

setenv OMP_SCHEDULE “dynamic”

CALL OMP_SET_NUM_THREADS(10)

C$OMP DO lastprivate(XX)

C$OMP ORDERED

C$OMP SINGLE PRIVATE(X)

C$OMP SECTIONS

C$OMP MASTERC$OMP ATOMIC

C$OMP FLUSH

C$OMP PARALLEL DO ORDERED PRIVATE (A, B, C)

C$OMP THREADPRIVATE(/ABC/)

C$OMP PARALLEL COPYIN(/blk/)

Nthrds = OMP_GET_NUM_PROCS()

!$OMP BARRIER

OpenMP: An API for Writing Multithreaded Applications

– A set of compiler directives and library routines for parallel application programmers

– Makes it easy to create multi-threaded (MT) programs in Fortran, C and C++

– Standardizes last 15 years of SMP practice

OpenMP: An API for Writing Multithreaded Applications

– A set of compiler directives and library routines for parallel application programmers

– Makes it easy to create multi-threaded (MT) programs in Fortran, C and C++

– Standardizes last 15 years of SMP practice

Page 4: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 4

OpenMP: Supporters*

Hardware vendors– Intel, HP, SGI, IBM, SUN, Compaq

Software tools vendors– KAI, PGI, PSR, APR

Applications vendors– ANSYS, Fluent, Oxford Molecular, NAG, DOE

ASCI, Dash, Livermore Software, and many others

**These names of these vendors were taken from the OpenMP web site (www.openmp.org). We have These names of these vendors were taken from the OpenMP web site (www.openmp.org). We have made no attempts to confirm OpenMP support, verify conformity to the specifications, or measure the made no attempts to confirm OpenMP support, verify conformity to the specifications, or measure the

degree of OpenMP utilization.degree of OpenMP utilization.

Page 5: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 5

OpenMP: Programming Model

Fork-Join Parallelism: Master thread spawns a team of threads as needed.

Parallelism is added incrementally: i.e. the sequential program evolves into a parallel program.

Parallel Regions

Master Thread

Page 6: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 6

OpenMP:How is OpenMP typically used? (in C)OpenMP is usually used to parallelize loops:

– Find your most time consuming loops.

– Split them up between threads.

void main(){ double Res[1000];

for(int i=0;i<1000;i++) {do_huge_comp(Res[i]);

}}

#include “omp.h”void main(){ double Res[1000];#pragma omp parallel for for(int i=0;i<1000;i++) {

do_huge_comp(Res[i]); }}

Split-up this loop between multiple threads

Split-up this loop between multiple threads

Parallel ProgramParallel ProgramSequential ProgramSequential Program

Page 7: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 7

OpenMP:How is OpenMP typically used? (Fortran)

OpenMP is usually used to parallelize loops:– Find your most time consuming loops.

– Split them up between threads.

program example double precision Res(1000)

do I=1,1000 call huge_comp(Res(I)) end do end

Split-up this loop between multiple threads

Split-up this loop between multiple threads

program example double precision Res(1000)C$OMP PARALLEL DO do I=1,1000 call huge_comp(Res(I)) end do end

Parallel ProgramParallel ProgramSequential ProgramSequential Program

Page 8: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 8

OpenMP:How do threads interact?

OpenMP is a shared memory model.– Threads communicate by sharing variables.

Unintended sharing of data causes race conditions:

– race condition: when the program’s outcome changes as the threads are scheduled differently.

To control race conditions:– Use synchronization to protect data conflicts.

Synchronization is expensive so:– Change how data is accessed to minimize the need

for synchronization.

Page 9: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 9

Summary of OpenMP Constructs Parallel Region

C$omp parallel #pragma omp parallel Worksharing

C$omp do #pragma omp for

C$omp sections #pragma omp sections

C$single #pragma omp single

C$workshare #pragma workshare Data Environment

directive: threadprivate clauses: shared, private, lastprivate, reduction, copyin,

copyprivate Synchronization

directives: critical, barrier, atomic, flush, ordered, master Runtime functions/environment variables

Page 10: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 10

SC’2000 Tutorial Agenda

OpenMP: A Quick RecapOpenMP Case Studies

– including performance tuning

Automatic Parallelism and Tools SupportCommon Bugs in OpenMP programs

– and how to avoid them

Mixing OpenMP and MPIThe Future of OpenMP

Page 11: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 11

Performance Tuning and Case Studies with Realistic

Applications

1. Performance tuning of several benchmarks

2. Case study of a large-scale application

Page 12: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 12

Performance Tuning Example 1: MDG MDG: A Fortran code of the “Perfect Benchmarks”. Automatic parallelization does not improve this code.

0

0.5

1

1.5

2

2.5

3

3.5

original tuning step 1 tuning step 2

These performance improvements were achieved through manual tuning on a 4-processor Sun Ultra:

Page 13: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 13

MDG: Tuning Steps

Step 1: Parallelize the most time-consuming loop. It consumes 95% of the serial execution time. This takes:

array privatizationreduction parallelization

Step 2: Balancing the iteration space of this loop.

Loop is “triangular”. By default unbalanced assignment of iterations to processors.

Page 14: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 14

MDG Code Sample

c1 = x(1)>0

c2 = x(1:10)>0

DO i=1,n

DO j=i,n

IF (c1) THEN rl(1:100) = …

IF (c2) THEN … = rl(1:100)

sum(j) = sum(j) + …

ENDDO

ENDDO

c1 = x(1)>0c2 = x(1:10)>0

Allocate(xsum(1:#proc,n))

C$OMP PARALLEL DOC$OMP+ PRIVATE (I,j,rl,id) C$OMP+ SCHEDULE (STATIC,1) DO i=1,n id = omp_get_thread_num()

DO j=i,n

IF (c1) THEN rl(1:100) = …

IF (c2) THEN … = rl(1:100)

xsum(id,j) = xsum(id,j) + …

ENDDO

ENDDO

C$OMP PARALLEL DODO i=1,n sum(i)=sum(i)+xsum(1:#proc,i)ENDDO

Structure of the most time-consuming loop in MDG:

Original

Parallel

Page 15: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 15

ARC2D: A Fortran code of the “Perfect Benchmarks”.

Performance Tuning Example 2: ARC2D

0

1

2

3

4

5

6

original locality granularity

ARC2D is parallelized very well by available compilers. However, the mapping of the code to the machine could be improved.

Page 16: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 16

ARC2D: Tuning Steps

Step 1: Loop interchanging increases cache locality through

stride-1 references

Step 2: Move parallel loops to outer positions

Step 3: Move synchronization points outward

Step 4: Coalesce loops

Page 17: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 17

!$OMP PARALLEL DO!$OMP+PRIVATE(R1,R2,K,J) DO k = 2, kmax-1 DO j = jlow, jup r1 = prss(jminu(j), k) + prss(jplus(j), k) + (-2.)*prss(j, k) r2 = prss(jminu(j), k) + prss(jplus(j), k) + 2.*prss(j, k) coef(j, k) = ABS(r1/r2) ENDDO ENDDO!$OMP END PARALLEL

!$OMP PARALLEL DO!$OMP+PRIVATE(R1,R2,K,J) DO j = jlow, jup DO k = 2, kmax-1 r1 = prss(jminu(j), k) + prss(jplus(j), k) + (-2.)*prss(j, k) r2 = prss(jminu(j), k) + prss(jplus(j), k) + 2.*prss(j, k) coef(j, k) = ABS(r1/r2) ENDDO ENDDO!$OMP END PARALLEL

ARC2D: Code Samples

Loop interchanging increases cache locality

Page 18: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 18

ARC2D: Code Samples

Increasing parallel loop granularitythrough NOWAIT clause

!$OMP PARALLEL!$OMP+PRIVATE(LDI,LD2,LD1,J,LD,K) DO k = 2+2, ku-2, 1!$OMP DO DO j = jl, ju ld2 = a(j, k) ld1 = b(j, k)+(-x(j, k-2))*ld2 ld = c(j, k)+(-x(j, k-1))*ld1+(-y(j, k-1))*ld2 ldi = 1./ld f(j, k, 1) = ldi*(f(j, k, 1)+(-f(j, k-2, 1))*ld2+(-f(j, k-1, 1))*ld1) f(j, k, 2) = ldi*(f(j, k, 2)+(-f(j, k-2, 2))*ld2+(-f(jk-2, 2))*ld1) x(j, k) = ldi*(d(j, k)+(-y(j, k-1))*ld1) y(j, k) = e(j, k)*ldi ENDDO!$OMP END DO ENDDO!$OMP END PARALLEL

NOWAIT

Page 19: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 19

ARC2D: Code Samples

!$OMP PARALLEL DO!$OMP+PRIVATE(n, k,j) DO n = 1, 4 DO k = 2, kmax-1 DO j = jlow, jup q(j, k, n) = q(j, k, n)+s(j, k, n) s(j, k, n) = s(j, k, n)*phic ENDDO ENDDO ENDDO!$OMP END PARALLEL

!$OMP PARALLEL DO!$OMP+PRIVATE(nk,n,k,j) DO nk = 0,4*(kmax-2)-1 n = nk/(kmax-2) + 1 k = MOD(nk,kmax-2)+2 DO j = jlow, jup q(j, k, n) = q(j, k, n)+s(j, k, n) s(j, k, n) = s(j, k, n)*phic ENDDO ENDDO !$OMP END PARALLEL

Increasing parallel loop granularitythough loop coalescing

Page 20: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 20

Performance Tuning Example 3: EQUAKE

EQUAKE: A C code of the new SPEC OpenMP benchmarks.

EQUAKE is hand-parallelized with relatively few code modifications. It achieves excellent speedup.

0

1

2

3

4

5

6

7

8

originalsequential

initialOpenMP

improvedallocate

Page 21: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 21

EQUAKE: Tuning Steps

Step1:

Parallelizing the four most time-consuming loops

– inserted OpenMP pragmas for parallel loops and

private data

– array reduction transformation

Step2: A change in memory allocation

Page 22: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 22

EQUAKE Code Samples

/* malloc w1[numthreads][ARCHnodes][3] */

#pragma omp parallel for for (j = 0; j < numthreads; j++) for (i = 0; i < nodes; i++) { w1[j][i][0] = 0.0; ...; }

#pragma omp parallel private(my_cpu_id,exp,...){ my_cpu_id = omp_get_thread_num();

#pragma omp for for (i = 0; i < nodes; i++) while (...) { ... exp = loop-local computation; w1[my_cpu_id][...][1] += exp; ... }}#pragma omp parallel for for (j = 0; j < numthreads; j++) { for (i = 0; i < nodes; i++) { w[i][0] += w1[j][i][0]; ...;}

Page 23: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 23

What Tools Did We Use for Performance Analysis and Tuning?Compilers

the starting point for our performance tuning of Fortran codes was always the compiler-parallelized program.

It reports: parallelized loops, data dependences.

Subroutine and loop profilersfocusing attention on the most time-consuming

loops is absolutely essential.

Performance tables:typically comparing performance differences at the

loop level.

Page 24: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 24

Guidelines for Fixing “Performance Bugs”The methodology that worked for us:

Use compiler-parallelized code as a starting pointGet loop profile and compiler listingInspect time-consuming loops (biggest potential for

improvement)

– Case 1. Check for parallelism where the compiler could not find it

– Case 2. Improve parallel loops where the speedup is limited

Page 25: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 25

Performance Tuning

Case 1: if the loop is not parallelized automatically, do this:

Check for parallelism:read the compiler explanationa variable may be independent even if the compiler

detects dependences (compilers are conservative)check if conflicting array is privatizable (compilers

don’t perform array privatization well)

If you find parallelism, add OpenMP parallel directives, or make the information explicit for the parallelizer

Page 26: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 26

Performance Tuning

Case 2: if the loop is parallel but does not perform well, consider several optimization factors:

Parallelization overhead

Memory

CPU CPU CPU

Spreading overhead

High overheads are caused by:•parallel startup cost•small loops•additional parallel code•over-optimized inner loops•less optimization for parallel code

•load imbalance•synchronized section•non-stride-1 references•many shared references•low cache affinity

serialprogram

parallelprogram

Page 27: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 27

Case Study of a Large-Scale Application

Converting a Seismic Processing Application

to OpenMP Overview of the ApplicationBasic use of OpenMPOpenMP Issues EncounteredPerformance Results

Page 28: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 28

Overview of Seismic

Representative of modern seismic processing programs used in the search for oil and gas.

20,000 lines of Fortran. C subroutines interface with the operating system.

Available in a serial and a parallel variant.Parallel code is available in a message-passing

and an OpenMP form. Is part of the SPEChpc benchmark suite.

Includes 4 data sets: small to x-large.

Page 29: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 29

Seismic:Basic CharacteristicsProgram structure:

240 Fortran and 119 C subroutines.

Main algorithms: FFT, finite difference solvers

Running time of Seismic (@ 500MFlops):small data set: 0.1 hoursx-large data set: 48 hours

IO requirement:small data set: 110 MBx-large data set: 93 GB

Page 30: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 30

Basic OpenMP Use: Parallelization SchemeSplit into p parallel tasks

(p = number of processors)

Program Seismic initialization

C$OMP PARALLEL call main_subroutine()C$OMP END PARALLEL

initialization done by master processor only

main computation enclosed in one large parallel region

SPMD execution scheme

Page 31: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 31

Basic OpenMP Use: Data PrivatizationMost data structures are private,

i.e., Each thread has its own copy.

Syntactic forms:

Program Seismic...C$OMP PARALLELC$OMP+PRIVATE(a) a = “local computation” call x()C$END PARALLEL

Subroutine x()common /cc/ dc$omp threadprivate (/cc/)real b(100)...b() = “local computation”d = “local computation”...

Page 32: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 32

Basic OpenMP Use: Synchronization and Communication

compute

communicate

compute

communicate

copy to shared buffer;barrier_synchronization;copy from shared buffer;

Copy-synchronize scheme corresponds to message send-receive operations in MPI programs

Page 33: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 33

OpenMP Issues:Mixing Fortran and C

Bulk of computation is done in Fortran

Utility routines are in C: IO operations data partitioning routines communication/synchronization

operations OpenMP-related issues:

IF C/OpenMP compiler is not available, data privatization must be done through “expansion”.

Mix of Fortran and C is implementation dependent

Data privatization in OpenMP/C#pragma omp thread private (item) float item;void x(){ ... = item;}

Data expansion in absence a of OpenMP/C compilerfloat item[num_proc];void x(){ int thread; thread = omp_get_thread_num_(); ... = item[thread];}

Page 34: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 34

OpenMP Issues:Broadcast Common Blocks

common /cc/ cdatacommon /dd/ ddatac initialization cdata = ... ddata = ...

C$OMP PARALELC$OMP+COPYIN(/cc/, /dd/) call main_subroutine()C$END PARALLEL

Issues in Seismic:• At the start of the parallel region it is not yet known which common blocks need to be copied in. Solution: • copy-in all common blocks overhead

Page 35: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 35

OpenMP Issues:Multithreading IO and malloc

IO routines and memory allocation are called within parallel threads, inside C utility routines.

OpenMP requires all standard libraries and instrinsics to be thread-save. However the implementations are not always compliant.

system-dependent solutions need to be found

The same issue arises if standard C routines are called inside a parallel Fortran region or in non-standard syntax.

Standard C compilers do not know anything about OpenMP and the thread-safe requirement.

Page 36: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 36

OpenMP Issues:Processor AffinityOpenMP currently does not specify or provide

constructs for controlling the binding of threads to processors.

Processors can migrate, causing overhead. This behavior is system-dependent.

System-dependent solutions may be available.

parallelregion tasks may migrate as a result of

an OS event

task1task2task3task4

p1 2 3 4

Page 37: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 37

Performance Results

0

1

2

3

4

5

6

7

1 proc 2 proc 4 proc 8 proc

PVM

OpenMP

0

1

2

3

4

5

6

1 proc 2 proc 4 proc 8 proc

PVM

OpenMP

small data set medium data set

Speedups of Seismic on an SGI Challenge system

MPI

Page 38: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 38

SC’2000 Tutorial Agenda

OpenMP: A Quick RecapOpenMP Case Studies

– including performance tuning

Automatic Parallelism and Tools SupportCommon Bugs in OpenMP programs

– and how to avoid them

Mixing OpenMP and MPIThe Future of OpenMP

Page 39: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 39

Generating OpenMP Programs Automatically

OpenMPprogram

userinserts

directives

parallelizingcompilerinsertsdirectives

usertunes

program

Source-to-sourcerestructurers:• F90 to F90/OpenMP• C to C/OpenMP

Examples:• SGI F77 compiler (-apo -mplist option)• Polaris compiler

Page 40: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 40

The Basics AboutParallelizing Compilers Loops are the primary source of parallelism in

scientific and engineering applications. Compilers detect loops that have independent

iterations.

DO I=1,N A(expression1) = … … = A(expression2)ENDDO

The loop is independent if, for different iterations, expression1 is always different from expression2

Page 41: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 41

Basic Program Transformations

Data privatization:

DO i=1,n work(1:n) = …. . . . … = work(1:n)ENDDO

C$OMP PARALLEL DOC$OMP+ PRIVATE (work)DO i=1,n work(1:n) = …. . . . … = work(1:n)ENDDO

Each processor is given a separate version of the private data, so there is no sharing conflict

Page 42: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 42

Basic Program Transformations

Reduction recognition:

DO i=1,n ... sum = sum + a(i) … ENDDO

C$OMP PARALLEL DOC$OMP+ REDUCTION (+:sum)DO i=1,n ... sum = sum + a(i) … ENDDO

Each processor will accumulate partial sums, followed by a combination of these parts at the end of the loop.

Page 43: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 43

Basic Program Transformations

Induction variable substitution:

i1 = 0i2 = 0DO i =1,n i1 = i1 + 1 B(i1) = ... i2 = i2 + i A(i2) = …

ENDDO

C$OMP PARALLEL DO DO i =1,n

B(i) = ...

A((i**2 + i)/2) = …

ENDDO

The original loop contains data dependences: each processor modifies the shared variables i1, and i2.

Page 44: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 44

Compiler Options

Examples of options from the KAP parallelizing compiler (KAP includes some 60 options)

optimization levels

– optimize : simple analysis, advanced analysis, loop interchanging, array expansion

– aggressive: pad common blocks, adjust data layout subroutine inline expansion

– inline all, specific routines, how to deal with libraries

try specific optimizations

– e.g., recurrence and reduction recognition, loop fusion

(These transformations may degrade performance)

Page 45: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 45

More About Compiler Options

Limits on amount of optimization:

– e.g., size of optimization data structures, number of optimization variants tried

Make certain assumptions:

– e.g., array bounds are not violated, arrays are not aliased Machine parameters:

– e.g., cache size, line size, mapping Listing control

Note, compiler options can be a substitute for advanced compiler strategies. If the compiler has limited information, the user can help out.

Page 46: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 46

Inspecting the Translated Program

Source-to-source restructurers: transformed source code is the actual output Example: KAP

Code-generating compilers: typically have an option for viewing the translated

(parallel) code Example: SGI f77 -apo -mplist

This can be the starting point for code tuning

Page 47: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 47

Compiler Listing

The listing gives many useful clues for improving the performance:

Loop optimization tables Reports about data dependences Explanations about applied transformations The annotated, transformed code Calling tree Performance statistics

The type of reports to be included in the listing can be set through compiler options.

Page 48: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 48

5-processorSun Ultra SMP

BDNA

Native Parallelizer

Polaris to Native Directives

Polaris to OpenMP

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Sp

eed

up

ARC2D

FLO52QHYDRO2D

MDGSWIM

TOMCATVTRFD 1 234 5

Performance of Parallelizing Compilers

Page 49: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 49

Tuning Automatically-Parallelized Code

This task is similar to explicit parallel programming.

Two important differences :The compiler gives hints in its listing, which may

tell you where to focus attention. E.g., which variables have data dependences.

You don’t need to perform all transformations by hand. If you expose the right information to the compiler, it will do the translation for you.

(E.g., C$assert independent)

Page 50: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 50

Why Tuning Automatically-Parallelized Code?

Hand improvements can pay off becausecompiler techniques are limited

E.g., array reductions are parallelized by only few compilers

compilers may have insufficient information E.g.,loop iteration range may be input datavariables are defined in other subroutines (no

interprocedural analysis)

Page 51: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 51

Performance Tuning Tools

OpenMPprogram

userinserts

directives

parallelizingcompilerinsertsdirectives

usertunes

program

we need tool support

Page 52: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 52

Profiling Tools

Timing profiles (subroutine or loop level)shows most time-consuming program sections

Cache profilespoint out memory/cache performance problems

Data-reference and transfer volumesshow performance-critical program properties

Input/output activitiespoint out possible I/O bottlenecks

Hardware counter profileslarge number of processor statistics

Page 53: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 53

KAI GuideView: Performance Analysis

Speedup curves Amdahl’s Law vs. Actual

times

Whole program time breakdown

Productive work vs Parallel overheads

Compare several runs Scaling processors

Breakdown by section Parallel regions Barrier sections Serial sections

Breakdown by thread Breakdown overhead

Types of runtime calls Frequency and time

Page 54: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 54

GuideView

Analyze each Parallel region

Find serial regions that are hurt by parallelism

Sort or filter regions to navigate to hotspots

www.kai.com

Page 55: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 55

SGI SpeedShop and WorkShop

Suite of performance tools from SGIMeasurements based on

pc-sampling and call-stack sampling

– based on time [prof,gprof]

– based on R10K/R12K hw countersbasic block counting [pixie]

Analysis on various domainsprogram graph, source and disassembled codeper-thread as well as cumulative data

Page 56: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 56

SpeedShop and WorkShop

Addresses the performance Issues:Load imbalance

Call stack sampling based on time (gprof)

Synchronization OverheadCall stack sampling based on time (gprof) Call stack sampling based on hardware counters

Memory Hierarchy PerformanceCall stack sampling based on hardware counters

Page 57: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 57

WorkShop: Call Graph View

Page 58: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 58

WorkShop: Source View

Page 59: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 59

Purdue Ursa Minor/Major

Integrated environment for compilation and performance analysis/tuning

Provides browsers for many sources of information:

call graphs, source and transformed program, compilation reports, timing data, parallelism estimation, data reference patterns, performance advice, etc.

www.ecn.purdue.edu/ParaMount/UM/

Page 60: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 60

Ursa Minor/Major

Performance Spreadsheet

Program Structure View

Page 61: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 61

TAU Tuning Analysis UtilitiesPerformance Analysis Environment for C++,

Java, C, Fortran 90, HPF, and HPC++compilation facilitatorcall graph browsersource code browserprofile browsersspeedup extrapolationwww.cs.uoregon.edu/research/paracomp/tau/

Page 62: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 62

TAU Tuning Analysis Utilities

Page 63: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 63

SC’2000 Tutorial Agenda

OpenMP: A Quick RecapOpenMP Case Studies

– including performance tuning

Automatic Parallelism and Tools SupportCommon Bugs in OpenMP programs

– and how to avoid them

Mixing OpenMP and MPIThe Future of OpenMP

Page 64: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 64

SMP Programming Errors

Shared memory parallel programming is a mixed bag:

It saves the programmer from having to map data onto multiple processors. In this sense, its much easier.

It opens up a range of new errors coming from unanticipated shared resource conflicts.

Page 65: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 65

2 major SMP errors

Race Conditions– The outcome of a program depends on the

detailed timing of the threads in the team.

Deadlock– Threads lock up waiting on a locked resource

that will never become free.

Page 66: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 66

Race Conditions

The result varies unpredictably based on detailed order of execution for each section.

Wrong answers produced without warning!

C$OMP PARALLEL SECTIONS

A = B + C

C$OMP SECTION

B = A + C

C$OMP SECTION

C = B + A

C$OMP END PARALLEL SECTIONS

Page 67: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 67

Race Conditions:A complicated solution

In this example, we choose the assignments to occur in the order A, B, C.

ICOUNT forces this order. FLUSH so each thread sees

updates to ICOUNT - NOTE: you need the flush on each read and each write.

ICOUNT = 0

C$OMP PARALLEL SECTIONS

A = B + C

ICOUNT = 1

C$OMP FLUSH ICOUNT

C$OMP SECTION

1000 CONTINUE

C$OMP FLUSH ICOUNT

IF(ICOUNT .LT. 1) GO TO 1000

B = A + C

ICOUNT = 2

C$OMP FLUSH ICOUNT

C$OMP SECTION

2000 CONTINUE

C$OMP FLUSH ICOUNT

IF(ICOUNT .LT. 2) GO TO 2000

C = B + A

C$OMP END PARALLEL SECTIONS

Page 68: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 68

Race Conditions

The result varies unpredictably because the value of X isn’t dependable until the barrier at the end of the do loop.

Wrong answers produced without warning!

Solution: Be careful when you use NOWAIT.

C$OMP PARALLEL SHARED (X)

C$OMP& PRIVATE(TMP)

ID = OMP_GET_THREAD_NUM()

C$OMP DO REDUCTION(+:X)

DO 100 I=1,100

TMP = WORK(I)

X = X + TMP

100 CONTINUE

C$OMP END DO NOWAIT

Y(ID) = WORK(X, ID)

C$OMP END PARALLEL

Page 69: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 69

Race Conditions

The result varies unpredictably because access to shared variable TMP is not protected.

Wrong answers produced without warning!

The user probably wanted to make TMP private.

REAL TMP, X

C$OMP PARALLEL DO REDUCTION(+:X)

DO 100 I=1,100

TMP = WORK(I)

X = X + TMP

100 CONTINUE

C$OMP END DO

Y(ID) = WORK(X, ID)

C$OMP END PARALLEL I lost an afternoon to this bug last year. After spinning my wheels and insisting there was a bug in KAI’s compilers, the KAI tool Assure found the problem immediately!

I lost an afternoon to this bug last year. After spinning my wheels and insisting there was a bug in KAI’s compilers, the KAI tool Assure found the problem immediately!

Page 70: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 70

Deadlock

This shows a race condition and a deadlock.

If A is locked by one thread and B by another, you have deadlock.

If the same thread gets both locks, you get a race condition - i.e. different behavior depending on detailed interleaving of the thread.

Avoid nesting different locks.

CALL OMP_INIT_LOCK (LCKA)

CALL OMP_INIT_LOCK (LCKB)

C$OMP PARALLEL SECTIONS

C$OMP SECTION

CALL OMP_SET_LOCK(LCKA)

CALL OMP_SET_LOCK(LCKB)

CALL USE_A_and_B (RES)

CALL OMP_UNSET_LOCK(LCKB)

CALL OMP_UNSET_LOCK(LCKA)

C$OMP SECTION

CALL OMP_SET_LOCK(LCKB)

CALL OMP_SET_LOCK(LCKA)

CALL USE_B_and_A (RES)

CALL OMP_UNSET_LOCK(LCKA)

CALL OMP_UNSET_LOCK(LCKB)

C$OMP END SECTIONS

Page 71: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 71

Deadlock

This shows a race condition and a deadlock.

If A is locked in the first section and the if statement branches around the unset lock, threads running the other sections deadlock waiting for the lock to be released.

Make sure you release your locks.

CALL OMP_INIT_LOCK (LCKA)

C$OMP PARALLEL SECTIONS

C$OMP SECTION

CALL OMP_SET_LOCK(LCKA)

IVAL = DOWORK()

IF (IVAL .EQ. TOL) THEN

CALL OMP_UNSET_LOCK (LCKA)

ELSE

CALL ERROR (IVAL)

ENDIF

C$OMP SECTION

CALL OMP_SET_LOCK(LCKA)

CALL USE_B_and_A (RES)

CALL OMP_UNSET_LOCK(LCKA)

C$OMP END SECTIONS

Page 72: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 72

OpenMP death-traps

Are you using threadsafe libraries?I/O inside a parallel region can interleave

unpredictably. Make sure you understand what your constructors

are doing with private objects.Private variables can mask globals.Understand when shared memory is coherent.

When in doubt, use FLUSH. NOWAIT removes implied barriers.

Page 73: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 73

Navigating through the Danger Zones

Option 1: Analyze your code to make sure every semantically permitted interleaving of the threads yields the correct results.

This can be prohibitively difficult due to the explosion of possible interleavings.

Tools like KAI’s Assure can help.

Page 74: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 74

Navigating through the Danger ZonesOption 2: Write SMP code that is portable and

equivalent to the sequential form.Use a safe subset of OpenMP.Follow a set of “rules” for Sequential Equivalence.

Page 75: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 75

Portable Sequential Equivalence

What is Portable Sequential Equivalence (PSE)?

– A program is sequentially equivalent if its results are the same with one thread and many threads.

– For a program to be portable (i.e. runs the same on different platforms/compilers) it must execute identically when the OpenMP constructs are used or ignored.

Page 76: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 76

Portable Sequential Equivalence

Advantages of PSE– A PSE program can run on a wide range of

hardware and with different compilers - minimizes software development costs.

– A PSE program can be tested and debugged in serial mode with off the shelf tools - even if they don’t support OpenMP.

Page 77: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 77

2 Forms of Sequential EquivalenceTwo forms of Sequential equivalence based on

what you mean by the phrase “equivalent to the single threaded execution”:

– Strong SE: bitwise identical results.

– Weak SE: equivalent mathematically but due to quirks of floating point arithmetic, not bitwise identical.

Page 78: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 78

Strong Sequential Equivalence: rules

Control data scope with the base language

– Avoid the data scope clauses.

– Only use private for scratch variables local to a block (eg. temporaries or loop control variables) whose global initialization don’t matter.

Locate all cases where a shared variable can be written by multiple threads.

– The access to the variable must be protected.

– If multiple threads combine results into a single value, enforce sequential order.

– Do not use the reduction clause.

Page 79: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 79

Strong Sequential Equivalence: example

Everything is shared except I and TMP. These can be private since they are not initialized and they are unused outside the loop.

The summation into RES occurs in the sequential order so the result from the program is bitwise compatible with the sequential program.

Problem: Can be inefficient if threads finish in an order that’s greatly different from the sequential order.

C$OMP PARALLEL PRIVATE(I, TMP)

C$OMP DO ORDERED

DO 100 I=1,NDIM

TMP =ALG_KERNEL(I)

C$OMP ORDERED

CALL COMBINE (TMP, RES)

C$OMP END ORDERED

100 CONTINUE

C$OMP END PARALLEL

Page 80: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 80

Weak Sequential equivalence

For weak sequential equivalence only mathematically valid constraints are enforced.

– Floating point arithmetic is not associative and not commutative.

– In most cases, no particular grouping of floating point operations is mathematically preferred so why take a performance hit by forcing the sequential order?

In most cases, if you need a particular grouping of floating point operations, you have a bad algorithm.

How do you write a program that is portable and satisfies weak sequential equivalence?

Follow the same rules as the strong case, but relax sequential ordering constraints.

Page 81: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 81

Weak equivalence: example

C$OMP PARALLEL PRIVATE(I, TMP)

C$OMP DO

DO 100 I=1,NDIM

TMP =ALG_KERNEL(I)

C$OMP CRITICAL

CALL COMBINE (TMP, RES)

C$OMP END CRITICAL

100 CONTINUE

C$OMP END PARALLEL

The summation into RES occurs one thread at a time, but in any order so the result is not bitwise compatible with the sequential program.

Much more efficient, but some users get upset when low order bits vary between program runs.

Page 82: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 82

Sequential Equivalence isn’t a Silver Bullet

This program follows the weak PSE rules, but its still wrong.

In this example, RAND() may not be thread safe. Even if it is, the pseudo-random sequences might overlap thereby throwing off the basic statistics.

C$OMP PARALLEL

C$OMP& PRIVATE(I, ID, TMP, RVAL)

ID = OMP_GET_THREAD_NUM()

N = OMP_GET_NUM_THREADS()

RVAL = RAND ( ID )

C$OMP DO

DO 100 I=1,NDIM

RVAL = RAND (RVAL)

TMP =RAND_ALG_KERNEL(RVAL)

C$OMP CRITICAL

CALL COMBINE (TMP, RES)

C$OMP END CRITICAL

100 CONTINUE

C$OMP END PARALLEL

Page 83: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 83

SC’2000 Tutorial Agenda

OpenMP: A Quick RecapOpenMP Case Studies

– including performance tuning

Automatic Parallelism and Tools SupportCommon Bugs in OpenMP programs

– and how to avoid them

Mixing OpenMP and MPIThe Future of OpenMP

Page 84: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000

What is MPI?The message Passing Interface

MPI created by an international forum in the early 90’s.

It is huge -- the union of many good ideas about message passing API’s.

over 500 pages in the specover 125 routines in MPI 1.1 alone.Possible to write programs using only a couple of

dozen of the routines

MPI 1.1 - MPIch reference implementation. MPI 2.0 - Exists as a spec, full

implementations?

Page 85: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 85

How do people use MPI?The SPMD Model

Replicate the program.

Add glue code

Break up the data

A sequential program working on a data set

•A parallel program working on a decomposed data set.

• Coordination by passing messages.

Page 86: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 86

Pi program in MPI #include <mpi.h>void main (int argc, char *argv[]){

int i, my_id, numprocs; double x, pi, step, sum = 0.0 ;step = 1.0/(double) num_steps ;

MPI_Init(&argc, &argv) ;MPI_Comm_Rank(MPI_COMM_WORLD, &my_id) ;MPI_Comm_Size(MPI_COMM_WORLD, &numprocs) ;my_steps = num_steps/numprocs ;for (i=myrank*my_steps; i<(myrank+1)*my_steps ; i++){

x = (i+0.5)*step; sum += 4.0/(1.0+x*x);

}sum *= step ; MPI_Reduce(&sum, &pi, 1, MPI_DOUBLE, MPI_SUM, 0,

MPI_COMM_WORLD) ;}

Page 87: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 87

How do people mix MPI and OpenMP?

Replicate the program.

Add glue code

Break up the data

A sequential program working on a data set

•Create the MPI program with its data decomposition.

• Use OpenMP inside each MPI process.

Page 88: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 88

Pi program in MPI #include <mpi.h>#include “omp.h”void main (int argc, char *argv[]){

int i, my_id, numprocs; double x, pi, step, sum = 0.0 ;step = 1.0/(double) num_steps ;

MPI_Init(&argc, &argv) ;MPI_Comm_Rank(MPI_COMM_WORLD, &my_id) ;MPI_Comm_Size(MPI_COMM_WORLD, &numprocs) ;my_steps = num_steps/numprocs ;

#pragma omp parallel dofor (i=myrank*my_steps; i<(myrank+1)*my_steps ; i++){

x = (i+0.5)*step; sum += 4.0/(1.0+x*x);

}sum *= step ; MPI_Reduce(&sum, &pi, 1, MPI_DOUBLE, MPI_SUM, 0,

MPI_COMM_WORLD) ;}

Get the MPI part done first, then add OpenMP pragma where it makes sense to do so

Page 89: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 89

Mixing OpenMP and MPILet the programmer beware!

Messages are sent to a process on a system not to a particular thread

Safest approach -- only do MPI inside serial regions.… or, do them inside MASTER constructs.… or, do them inside SINGLE or CRITICAL

– But this only works if your MPI is really thread safe!

Environment variables are not propagated by mpirun. You’ll need to broadcast OpenMP parameters and set them with the library routines.

Page 90: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 90

SC’2000 Tutorial Agenda

OpenMP: A Quick RecapOpenMP Case Studies

– including performance tuning

Automatic Parallelism and Tools SupportCommon Bugs in OpenMP programs

– and how to avoid them

Mixing OpenMP and MPIThe Future of OpenMP

Page 91: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000

OpenMP Futures: The ARB

The future of OpenMP is in the hands of the OpenMP Architecture Review Board (the ARB)

– Intel, KAI, IBM, HP, Compaq, Sun, SGI, DOE ASCI

The ARB resolves interpretation issues and manages the evolution of new OpenMP API’s.

Membership in the ARB is Open to any organization with a stake in OpenMP.

– Research organization (e.g. DOE ASCI)

– Hardware vendors (e.g. Intel or HP)

– Software vendors (e.g. KAI)

Page 92: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 92

The Future of OpenMPOpenMP is an evolving standard. We will see

to it that it is well matched to the changing needs of the shard memory programming community.

Here’s what’s coming in the future:– OpenMP 2.0 for Fortran:

– This is a major update of OpenMP for Fortran95.

– Status. Specification released at SC’00

– OpenMP 2.0 for C/C++– Work to begin in January 2001

– Specification complete by SC’01.

To learn more about OpenMP 2.0, come to the OpenMP BOF on Tuesday evening

Page 93: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 93

Reference Material on OpenMPOpenMP Homepage www.openmp.org:The primary source of information about OpenMP and its development.

Books:Parallel programming in OpenMP, Chandra, Rohit, San Francisco, Calif. : Morgan Kaufmann ; London : Harcourt, 2000, ISBN: 1558606718

Research papers:Sosa CP, Scalmani C, Gomperts R, Frisch MJ. Ab initio quantum chemistry on a ccNUMA architecture using OpenMP. III. Parallel Computing, vol.26, no.7-8, July 2000, pp.843-56. Publisher: Elsevier, Netherlands.

Bova SW, Breshears CP, Cuicchi C, Demirbilek Z, Gabb H. Nesting OpenMP in an MPI application. Proceedings of the ISCA 12th International Conference. Parallel and Distributed Systems. ISCA. 1999, pp.566-71. Cary, NC, USA.

Gonzalez M, Serra A, Martorell X, Oliver J, Ayguade E, Labarta J, Navarro N. Applying interposition techniques for performance analysis of OPENMP parallel applications. Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000. IEEE Comput. Soc. 2000, pp.235-40. Los Alamitos, CA, USA.

J. M. Bull and M. E. Kambites. JOMPan OpenMP-like interface for Java. Proceedings of the ACM 2000 conference on Java Grande, 2000, Pages 44 - 53.

Page 94: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 94

Chapman B, Mehrotra P, Zima H. Enhancing OpenMP with features for locality control. Proceedings of Eighth ECMWF Workshop on the Use of Parallel Processors in Meteorology. Towards Teracomputing. World Scientific Publishing. 1999, pp.301-13. Singapore.

Cappello F, Richard O, Etiemble D. Performance of the NAS benchmarks on a cluster of SMP PCs using a parallelization of the MPI programs with OpenMP. Parallel Computing Technologies. 5th International Conference, PaCT-99. Proceedings (Lecture Notes in Computer Science Vol.1662). Springer-Verlag. 1999, pp.339-50. Berlin, Germany.

Couturier R, Chipot C. Parallel molecular dynamics using OPENMP on a shared memory machine. Computer Physics Communications, vol.124, no.1, Jan. 2000, pp.49-59. Publisher: Elsevier, Netherlands.

Bova SW, Breshearsz CP, Cuicchi CE, Demirbilek Z, Gabb HA. Dual-level parallel analysis of harbor wave response using MPI and OpenMP. International Journal of High Performance Computing Applications, vol.14, no.1, Spring 2000, pp.49-64. Publisher: Sage Science Press, USA.

Scherer A, Honghui Lu, Gross T, Zwaenepoel W. Transparent adaptive parallelism on NOWS using OpenMP. ACM. Sigplan Notices (Acm Special Interest Group on Programming Languages), vol.34, no.8, Aug. 1999, pp.96-106. USA.

Ayguade E, Martorell X, Labarta J, Gonzalez M, Navarro N. Exploiting multiple levels of parallelism in OpenMP: a case study. Proceedings of the 1999 International Conference on Parallel Processing. IEEE Comput. Soc. 1999, pp.172-80. Los Alamitos, CA, USA.

Page 95: Advanced OpenMP, SC'20001 Advanced Parallel Programming with OpenMP Tim Mattson Intel Corporation Computational Sciences Laboratory Rudolf Eigenmann Purdue

Advanced OpenMP, SC'2000 95

Honghui Lu, Hu YC, Zwaenepoel W. OpenMP on networks of workstations. Proceedings of ACM/IEEE SC98: 10th Anniversary. High Performance Networking and Computing Conference (Cat. No. RS00192). IEEE Comput. Soc. 1998, pp.13 pp.. Los Alamitos, CA, USA.

Throop J. OpenMP: shared-memory parallelism from the ashes. Computer, vol.32, no.5, May 1999, pp.108-9. Publisher: IEEE Comput. Soc, USA.

Hu YC, Honghui Lu, Cox AL, Zwaenepoel W. OpenMP for networks of SMPs. Proceedings 13th International Parallel Processing Symposium and 10th Symposium on Parallel and Distributed Processing. IPPS/SPDP 1999. IEEE Comput. Soc. 1999, pp.302-10. Los Alamitos, CA, USA.

Parallel Programming with Message Passing and Directives; Steve W. Bova, Clay P. Breshears, Henry Gabb, Rudolf Eigenmann, Greg Gaertner, Bob Kuhn, Bill Magro, Stefano Salvini; SIAM News, Volume 32, No 9, Nov. 1999.

Still CH, Langer SH, Alley WE, Zimmerman GB. Shared memory programming with OpenMP. Computers in Physics, vol.12, no.6, Nov.-Dec. 1998, pp.577-84. Publisher: AIP, USA.

Chapman B, Mehrotra P. OpenMP and HPF: integrating two paradigms. [Conference Paper] Euro-Par'98 Parallel Processing. 4th International Euro-Par Conference. Proceedings. Springer-Verlag. 1998, pp.650-8. Berlin, Germany.

Dagum L, Menon R. OpenMP: an industry standard API for shared-memory programming. IEEE Computational Science & Engineering, vol.5, no.1, Jan.-March 1998, pp.46-55. Publisher: IEEE, USA.

Clark D. OpenMP: a parallel standard for the masses. IEEE Concurrency, vol.6, no.1, Jan.-March 1998, pp.10-12. Publisher: IEEE, USA.