Linux Clusters Institute: Introduction to parallel ...linuxclustersinstitute.org/workshops/archive/intro19/pdfs/4-openmp.… · • Lab exercises with OpenMP August 2019 2. Parallel

Linux Clusters Institute:Introduction to parallel

programming.Part 1: OpenMP

Dr. Alexei Kotelnikov, Engineering Computing Services (ECS)School of Engineering, Rutgers University

This document is a result of work by volunteer LCI instructors and is licensed under CC BY-NC 4.0 (https://creativecommons.org/licenses/by-nc/4.0/).

August 2019 1

https://creativecommons.org/licenses/by-nc/4.0/

Outline of the presentation

• Parallel computing overview• OpenMP threading model• OpenMP directives• Basic techniques in OpenMP:

loop parallelization, parallel sectionssynchronization constructs

• Lab exercises with OpenMP

August 2019 2

Parallel vs Serial computing

August 2019 3

Computational tasks are processed in sequential order, essentially, on one CPU core

Computational tasks are processed concurrently on the different CPUs or cores

The tasks can be different executable codes, code blocks, loops, subroutines, I/O calls, etc

Parallel computing paradigms

August 2019 4

• Shared memory: all CPUs or cores have access to the same global memory address space

Each CPU has its own L1, L2, and L3 cache, as well as CPU cores.

• Programming approach: multi-threading with POSIX threads (pthreads) and OpenMP


• Distributed memory: the CPUs have access only to their own memory space. Data is exchanged via messages passed over the network.

August 2019 5

The network can be various: Ethernet, Infiniband, OmniPath

• Programming approach: Message Passing Interface (MPI)


• Hardware acceleration: some tasks are offloaded from CPU onto a multi-core GPU device. Data is transferred over a PCIe bus between the RAM and GPU.

August 2019 6

• Programming approach: CUDA, OpenCL, OpenACC. The latest OpenMP, 5.0, supports offloading to accelerator devices.

Shared memory systems and OpenMP

• All modern desktops and laptops have multi-core CPUs with access to a common memory address space, so considered shared memory architectures.

• You can use OpenMP if you program in C, C++, Fortran 77 or 90.• OpenMP extension is available in many modern compilers, for

example GNU, Intel, PGI, Visual C++. • No need for a computational cluster.• Many commercial and open source software packages are already

built with OpenMP and can take advantage of multiple CPU cores.• Can be easily applied to parallelize loops and sections in a serial

code.• Scalability is limited by the number of CPU cores in a system.

August 2019 7

What is OpenMP• OpenMP is an API environment for shared memory

computer architectures. • It is implemented in the compilers, C/C++, and Fortran,

through compiler directives, runtime libraries routines, and environment variables.

• Thread based parallelism: multiple threads concurrently compute tasks in the same code.

• Fork-join thread dynamics: the master thread forks a team of threads in a parallel region, then the threads terminate upon exiting the parallel region.

• Defined data scoping: shared or private to the threads.

August 2019 8

OpenMP

The master process forks threads in the parallel region of the code

August 2019 9

serial task serial task

thread 0

1

2

3

implicit barrier:the tasks are waiting for each other to complete the parallel region, unless nowait clause is used.

OpenMP

C/C++ general code structure with OpenMP compiler directives

August 2019 10

#include <omp.h>

int main() {int var1, var2;float var3;

/* Serial code */

/* Beginning of parallel section. Fork a team of threads. Specify variable scoping */

#pragma omp parallel private(var1, var2) shared(var3){

/* Parallel section executed by all threads */

/* All threads join master thread and disband */}

/* Resume serial code */ }

OpenMP

Compile and run in 4 threadsgcc -fopenmp testcode.c

export OMP_NUM_THREADS=4

./a.out

August 2019 11

• The number of threads can be either defined within the code or by environment variable OMP_NUM_THREADS.

• Usually, OMP_NUM_THREADS shouldn’t exceed the total number of available CPU cores.

• If OMP_NUM_THREADS is undefined, the run will utilize all the cores in the system.

OpenMP

#pragma omp construct [clause [clause]…]

August 2019 12

OpenMP compiler directives – pragma, constructs, clauses

For example:

#pragma omp parallel shared(a,b,c) private(i)

#pragma omp sections nowait

#pragma omp section

#pragma omp for

OpenMP

Compute concurrently blocks/sections of a code, #pragma omp sectionCompute concurrently distributed chunks of loop iterations, #pragma omp forThread synchronization, #pragma omp critical

August 2019 13

The clauses define the parameters for the constructs, for example:

Variable scope (private for each thread), private(i)Variable scope (shared for the threads), shared(a,b,c)Scheduling directives, nowait, scheduleSynchronization related directives, critical, reduction

The constructs specify what the threads should do, for example:

OpenMP

August 2019 14

Runtime library functions set and query parameters for the threads, for example

omp_get_num_threads – returns the number of threads in the running team

omp_get_thread_num – returns the thread number

There are over 32 run time library functions in OpenMP 3.1:egrep 'omp_.*\(’ /usr/lib/gcc/x86_64-redhat-linux/4.8.2/include/omp.h

OpenMP

C/C++ parallel region example. “Hello World” from each thread

August 2019 15

#include <omp.h>#include <stdio.h>int main () {

int nthreads, tid;/* Fork a team of threads with each thread having a private tid variable */#pragma omp parallel private(tid){/* Obtain and print thread id */tid = omp_get_thread_num();printf("Hello World from thread = %d\n", tid);

/* Only master thread does this */if (tid == 0){nthreads = omp_get_num_threads();printf("Number of threads = %d\n", nthreads);} /* All threads join master thread and terminate */

}}

OpenMP

Compile and run in 4 threadsgcc -fopenmp hello.c

export OMP_NUM_THREADS=4

./a.out

August 2019 16

Hello World from thread = 3Hello World from thread = 2Hello World from thread = 1Hello World from thread = 0Number of threads = 4

OpenMPA simple work sharing construct: parallel loops

August 2019 17

/* Fork a team of threads with private i , tid variables */#pragma omp parallel private(i, tid){

same_work(); // Compute the for loop iteractions in parallel#pragma omp for for (i=0; i<=4*N-1; i++) {

thread(i); //should be a thread safe function} //the final results are independent on the order of the

} //loop iterations

same_work

same_work

same_work

same_work

thread(0, N-1)

thread(N,2N-1)

thread(2N,3N-1)

thread(3N,4N-1)#pragma omp parallel #pragma omp for

OMP_NUM_THREADS=4

0 0

1

2

3

tid

OpenMPThe for loop construct

Since the parallel region has been defined, the iterations of the loop must be executed in parallel by the team.

August 2019 18

#pragma omp for [clause ...] schedule (type [,chunk]) //how the loop iterations assigned to threads

private (list) //the variable scope private to the threads

shared (list). //the variable scope shared to the threads

reduction (operator: list) //if reduction is applied to the total value

collapse (n) //in case of nested loop parallelization

nowait //if set, the threads do not synchronize at the end of the parallel loop

There are other possible clauses in the for construct not discussed here

OpenMPfor loop schedule clause

Describes how the iterations of the loop are divided among the threads in the team.

August 2019 19

STATIC: loop iterations are divided into pieces of size chunk and then statically assigned to the threads.DYNAMIC: the chunks, and dynamically scheduled among the threads; when a thread finishes one chunk, it is dynamically assigned another. The default chunk size is 1.GUIDED: similar to DYNAMIC except that the block size decreases each time a parcel of work is given to a thread.RUNTIME: defined by the environment variable OMP_SCHEDULE AUTO: the scheduling decision is delegated to the compiler and/or runtime system.

OpenMP

The loop iterations computed concurrently

August 2019 20

#include <stdio.h>#include <omp.h>

#define CHUNKSIZE 100#define N 1000

int main (){int i, chunk, tid; float a[N], b[N], c[N];/* Some initializations. */for (i=0; I < N; i++) a[i] = b[i] = i*1.0;chunk = CHUNKSIZE;

#pragma omp parallel shared(a,b,c,chunk) private(i){#pragma omp for schedule(dynamic,chunk) nowaitfor (i=0; i < N; i++){

c[i] = a[i] + b[i];tid = omp_get_thread_num();printf("thread = %d, i = %d\n", tid, i); }

} /* end of parallel loop */ }

OpenMP

Nested loops. By default only the outer loop is parallelized:

August 2019 21

#pragma omp parallel forfor (int i=0;i<N;i++){

for (int j=0;j<M;j++){/* do task(i,j) */}

}

#pragma omp parallel for collapse(2) for (int i=0;i<N;i++){

for (int j=0;j<M;j++){/* do task(i,j) */}

}

To parallelize both the loops:

August 2019 22

#pragma omp parallel shared(a,b,c,d) private(i){#pragma omp sections nowait{#pragma omp section{for (i=0; i < N; i++)c[i] = a[i] + b[i];/* Obtain and print thread id and array index number */tid = omp_get_thread_num();printf("thread = %d, i = %d\n", tid, i);

}#pragma omp section{for (i=0; i < N; i++)d[i] = a[i] * b[i];/* Obtain and print thread id and array index number */tid = omp_get_thread_num();printf("thread = %d, i = %d\n", tid, i);

}} /* end of sections */

} /* end of parallel section */ }

OpenMPParallel sections. Each section is executed by one thread.

August 2019 23

#pragma omp parallel for shared (sum) private (i) for ( int i=0; i < 1000000; i++) {

sum = sum + a[i];}

printf("sum=%lf\n",sum);

OpenMP

When things become messy: several threads are updating the same variable.

The threads overwrite the sum variable for each other, running into the “race condition”.The problem can be solved by using critical locks and critical sections, when only one thread at a time updates the sum.

August 2019 24

#pragma omp parallel for shared (sum) private (i) for ( int i=0; i < 1000000; i++) {

#pragma omp criticalsum = sum + a[i];

}printf("sum=%lf\n",sum);

OpenMP

Synchronization construct.

The critical section specifies a region of a code that must be executed by only one thread at a time.

August 2019 25

#pragma omp parallel for default(shared) private(i) \schedule(static,chunk) reduction(+:result)

for (i=0; i < n; i++)result = result + (a[i] * b[i]);

printf("Final result= %f\n",result);

OpenMPReduction clause.

A private copy for variable result is created for each thread. At the end of the reduction, the reduction variable is written in the global shared variable.

Scalar product of two vectors, A and B:

result = A0*B0 + A1*B1 + … +An-1*Bn-1

Have each thread to compute partial sums (result) on a chunk of data, then do reduction on the partial sums and write the result in the global shared variable:

OpenMP

References.

August 2019 26

https://computing.llnl.gov/tutorials/openMP

https://www.openmp.org/wp-content/uploads/omp-hands-on-SC08.pdf

https://people.sc.fsu.edu/~jburkardt/c_src/openmp/openmp.html

https://openmp.org

https://computing.llnl.gov/tutorials/openMP

https://www.openmp.org/wp-content/uploads/omp-hands-on-SC08.pdf

https://people.sc.fsu.edu/%7Ejburkardt/c_src/openmp/openmp.html

https://openmp.org

Conclusion and discussion

• We have touched base with the main elements of the OpenMP compiler directives and syntax.

• More complex stuff, such as utilizing affinity, vector CPU architecture, and offload to GPU accelerators, can be found in the references.

• On the next hour, we’ll do a lab session with the examples we have discussed this afternoon.

• Any questions or comments?

August 2019 27

OpenMP LAB exercises.

Create a new directory for exercises. Download a tar ball into the directory.Extract the files:

mkdir OpenMPcd OpenMPwget https://linuxcourse.rutgers.edu/LCI2019/OpenMP.tgztar –zxvf OpenMP.tgz

August 2019 28

https://linuxcourse.rutgers.edu/LCI2019/OpenMP.tgz


August 2019 29

gcc -fopenmp -o hello.x hello.cexport OMP_NUM_THREADS=4./hello.x

Compile hello.c with the -fopenmp and run it across 4 threads:

Similarly, compile for.c with the -fopenmp and run it several times and observe how the array elements are distributed across the threads.gcc -fopenmp -o for.x for.c./for.x


August 2019 30

gcc -fopenmp -o sections.x sections.cexport OMP_NUM_THREADS=2./sections.xgcc -fopenmp -o sum.x sum.c./sum.x

Compile sections.c and sum.c with the-fopenmp and run them across 2 threads

Modify file sum.c and remove the line with the construct critical. Recompile it again and run several times. Notice the different output results.gcc -fopenmp -o sum.x sum.c./sum.x

Documents

Linux Clusters Institute: Introduction to parallel ...linuxclustersinstitute.org/workshops/archive/intro19/pdfs/4-openmp.… · • Lab exercises with OpenMP August 2019 2. Parallel