Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Linux Clusters Institute:Introduction to parallel
programming.Part 1: OpenMP
Dr. Alexei Kotelnikov, Engineering Computing Services (ECS)School of Engineering, Rutgers University
This document is a result of work by volunteer LCI instructors and is licensed under CC BY-NC 4.0 (https://creativecommons.org/licenses/by-nc/4.0/).
August 2019 1
Outline of the presentation
• Parallel computing overview• OpenMP threading model• OpenMP directives• Basic techniques in OpenMP:
loop parallelization, parallel sectionssynchronization constructs
• Lab exercises with OpenMP
August 2019 2
Parallel vs Serial computing
August 2019 3
Computational tasks are processed in sequential order, essentially, on one CPU core
Computational tasks are processed concurrently on the different CPUs or cores
The tasks can be different executable codes, code blocks, loops, subroutines, I/O calls, etc
Parallel computing paradigms
August 2019 4
• Shared memory: all CPUs or cores have access to the same global memory address space
Each CPU has its own L1, L2, and L3 cache, as well as CPU cores.
• Programming approach: multi-threading with POSIX threads (pthreads) and OpenMP
Parallel computing paradigms
• Distributed memory: the CPUs have access only to their own memory space. Data is exchanged via messages passed over the network.
August 2019 5
The network can be various: Ethernet, Infiniband, OmniPath
• Programming approach: Message Passing Interface (MPI)
Parallel computing paradigms
• Hardware acceleration: some tasks are offloaded from CPU onto a multi-core GPU device. Data is transferred over a PCIe bus between the RAM and GPU.
August 2019 6
• Programming approach: CUDA, OpenCL, OpenACC. The latest OpenMP, 5.0, supports offloading to accelerator devices.
Shared memory systems and OpenMP
• All modern desktops and laptops have multi-core CPUs with access to a common memory address space, so considered shared memory architectures.
• You can use OpenMP if you program in C, C++, Fortran 77 or 90.• OpenMP extension is available in many modern compilers, for
example GNU, Intel, PGI, Visual C++. • No need for a computational cluster.• Many commercial and open source software packages are already
built with OpenMP and can take advantage of multiple CPU cores.• Can be easily applied to parallelize loops and sections in a serial
code.• Scalability is limited by the number of CPU cores in a system.
August 2019 7
What is OpenMP• OpenMP is an API environment for shared memory
computer architectures. • It is implemented in the compilers, C/C++, and Fortran,
through compiler directives, runtime libraries routines, and environment variables.
• Thread based parallelism: multiple threads concurrently compute tasks in the same code.
• Fork-join thread dynamics: the master thread forks a team of threads in a parallel region, then the threads terminate upon exiting the parallel region.
• Defined data scoping: shared or private to the threads.
August 2019 8
OpenMP
The master process forks threads in the parallel region of the code
August 2019 9
serial task serial task
thread 0
1
2
3
implicit barrier:the tasks are waiting for each other to complete the parallel region, unless nowait clause is used.
OpenMP
C/C++ general code structure with OpenMP compiler directives
August 2019 10
#include <omp.h>
int main() {int var1, var2;float var3;
/* Serial code */
/* Beginning of parallel section. Fork a team of threads. Specify variable scoping */
#pragma omp parallel private(var1, var2) shared(var3){
/* Parallel section executed by all threads */
/* All threads join master thread and disband */}
/* Resume serial code */ }
OpenMP
Compile and run in 4 threadsgcc -fopenmp testcode.c
export OMP_NUM_THREADS=4
./a.out
August 2019 11
• The number of threads can be either defined within the code or by environment variable OMP_NUM_THREADS.
• Usually, OMP_NUM_THREADS shouldn’t exceed the total number of available CPU cores.
• If OMP_NUM_THREADS is undefined, the run will utilize all the cores in the system.
OpenMP
#pragma omp construct [clause [clause]…]
August 2019 12
OpenMP compiler directives – pragma, constructs, clauses
For example:
#pragma omp parallel shared(a,b,c) private(i)
#pragma omp sections nowait
#pragma omp section
#pragma omp for
OpenMP
Compute concurrently blocks/sections of a code, #pragma omp sectionCompute concurrently distributed chunks of loop iterations, #pragma omp forThread synchronization, #pragma omp critical
August 2019 13
The clauses define the parameters for the constructs, for example:
Variable scope (private for each thread), private(i)Variable scope (shared for the threads), shared(a,b,c)Scheduling directives, nowait, scheduleSynchronization related directives, critical, reduction
The constructs specify what the threads should do, for example:
OpenMP
August 2019 14
Runtime library functions set and query parameters for the threads, for example
omp_get_num_threads – returns the number of threads in the running team
omp_get_thread_num – returns the thread number
There are over 32 run time library functions in OpenMP 3.1:egrep 'omp_.*\(’ /usr/lib/gcc/x86_64-redhat-linux/4.8.2/include/omp.h
OpenMP
C/C++ parallel region example. “Hello World” from each thread
August 2019 15
#include <omp.h>#include <stdio.h>int main () {
int nthreads, tid;/* Fork a team of threads with each thread having a private tid variable */#pragma omp parallel private(tid){/* Obtain and print thread id */tid = omp_get_thread_num();printf("Hello World from thread = %d\n", tid);
/* Only master thread does this */if (tid == 0){nthreads = omp_get_num_threads();printf("Number of threads = %d\n", nthreads);} /* All threads join master thread and terminate */
}}
OpenMP
Compile and run in 4 threadsgcc -fopenmp hello.c
export OMP_NUM_THREADS=4
./a.out
August 2019 16
Hello World from thread = 3Hello World from thread = 2Hello World from thread = 1Hello World from thread = 0Number of threads = 4
OpenMPA simple work sharing construct: parallel loops
August 2019 17
/* Fork a team of threads with private i , tid variables */#pragma omp parallel private(i, tid){
same_work(); // Compute the for loop iteractions in parallel#pragma omp for for (i=0; i<=4*N-1; i++) {
thread(i); //should be a thread safe function} //the final results are independent on the order of the
} //loop iterations
same_work
same_work
same_work
same_work
thread(0, N-1)
thread(N,2N-1)
thread(2N,3N-1)
thread(3N,4N-1)#pragma omp parallel #pragma omp for
OMP_NUM_THREADS=4
0 0
1
2
3
tid
OpenMPThe for loop construct
Since the parallel region has been defined, the iterations of the loop must be executed in parallel by the team.
August 2019 18
#pragma omp for [clause ...] schedule (type [,chunk]) //how the loop iterations assigned to threads
private (list) //the variable scope private to the threads
shared (list). //the variable scope shared to the threads
reduction (operator: list) //if reduction is applied to the total value
collapse (n) //in case of nested loop parallelization
nowait //if set, the threads do not synchronize at the end of the parallel loop
There are other possible clauses in the for construct not discussed here
OpenMPfor loop schedule clause
Describes how the iterations of the loop are divided among the threads in the team.
August 2019 19
STATIC: loop iterations are divided into pieces of size chunk and then statically assigned to the threads.DYNAMIC: the chunks, and dynamically scheduled among the threads; when a thread finishes one chunk, it is dynamically assigned another. The default chunk size is 1.GUIDED: similar to DYNAMIC except that the block size decreases each time a parcel of work is given to a thread.RUNTIME: defined by the environment variable OMP_SCHEDULE AUTO: the scheduling decision is delegated to the compiler and/or runtime system.
OpenMP
The loop iterations computed concurrently
August 2019 20
#include <stdio.h>#include <omp.h>
#define CHUNKSIZE 100#define N 1000
int main (){int i, chunk, tid; float a[N], b[N], c[N];/* Some initializations. */for (i=0; I < N; i++) a[i] = b[i] = i*1.0;chunk = CHUNKSIZE;
#pragma omp parallel shared(a,b,c,chunk) private(i){#pragma omp for schedule(dynamic,chunk) nowaitfor (i=0; i < N; i++){
c[i] = a[i] + b[i];tid = omp_get_thread_num();printf("thread = %d, i = %d\n", tid, i); }
} /* end of parallel loop */ }
OpenMP
Nested loops. By default only the outer loop is parallelized:
August 2019 21
#pragma omp parallel forfor (int i=0;i<N;i++){
for (int j=0;j<M;j++){/* do task(i,j) */}
}
#pragma omp parallel for collapse(2) for (int i=0;i<N;i++){
for (int j=0;j<M;j++){/* do task(i,j) */}
}
To parallelize both the loops:
August 2019 22
#pragma omp parallel shared(a,b,c,d) private(i){#pragma omp sections nowait{#pragma omp section{for (i=0; i < N; i++)c[i] = a[i] + b[i];/* Obtain and print thread id and array index number */tid = omp_get_thread_num();printf("thread = %d, i = %d\n", tid, i);
}#pragma omp section{for (i=0; i < N; i++)d[i] = a[i] * b[i];/* Obtain and print thread id and array index number */tid = omp_get_thread_num();printf("thread = %d, i = %d\n", tid, i);
}} /* end of sections */
} /* end of parallel section */ }
OpenMPParallel sections. Each section is executed by one thread.
August 2019 23
#pragma omp parallel for shared (sum) private (i) for ( int i=0; i < 1000000; i++) {
sum = sum + a[i];}
printf("sum=%lf\n",sum);
OpenMP
When things become messy: several threads are updating the same variable.
The threads overwrite the sum variable for each other, running into the “race condition”.The problem can be solved by using critical locks and critical sections, when only one thread at a time updates the sum.
August 2019 24
#pragma omp parallel for shared (sum) private (i) for ( int i=0; i < 1000000; i++) {
#pragma omp criticalsum = sum + a[i];
}printf("sum=%lf\n",sum);
OpenMP
Synchronization construct.
The critical section specifies a region of a code that must be executed by only one thread at a time.
August 2019 25
#pragma omp parallel for default(shared) private(i) \schedule(static,chunk) reduction(+:result)
for (i=0; i < n; i++)result = result + (a[i] * b[i]);
printf("Final result= %f\n",result);
OpenMPReduction clause.
A private copy for variable result is created for each thread. At the end of the reduction, the reduction variable is written in the global shared variable.
Scalar product of two vectors, A and B:
result = A0*B0 + A1*B1 + … +An-1*Bn-1
Have each thread to compute partial sums (result) on a chunk of data, then do reduction on the partial sums and write the result in the global shared variable:
OpenMP
References.
August 2019 26
https://computing.llnl.gov/tutorials/openMP
https://www.openmp.org/wp-content/uploads/omp-hands-on-SC08.pdf
https://people.sc.fsu.edu/~jburkardt/c_src/openmp/openmp.html
https://openmp.org
Conclusion and discussion
• We have touched base with the main elements of the OpenMP compiler directives and syntax.
• More complex stuff, such as utilizing affinity, vector CPU architecture, and offload to GPU accelerators, can be found in the references.
• On the next hour, we’ll do a lab session with the examples we have discussed this afternoon.
• Any questions or comments?
August 2019 27
OpenMP LAB exercises.
Create a new directory for exercises. Download a tar ball into the directory.Extract the files:
mkdir OpenMPcd OpenMPwget https://linuxcourse.rutgers.edu/LCI2019/OpenMP.tgztar –zxvf OpenMP.tgz
August 2019 28
OpenMP LAB exercises.
August 2019 29
gcc -fopenmp -o hello.x hello.cexport OMP_NUM_THREADS=4./hello.x
Compile hello.c with the -fopenmp and run it across 4 threads:
Similarly, compile for.c with the -fopenmp and run it several times and observe how the array elements are distributed across the threads.gcc -fopenmp -o for.x for.c./for.x
OpenMP LAB exercises.
August 2019 30
gcc -fopenmp -o sections.x sections.cexport OMP_NUM_THREADS=2./sections.xgcc -fopenmp -o sum.x sum.c./sum.x
Compile sections.c and sum.c with the-fopenmp and run them across 2 threads
Modify file sum.c and remove the line with the construct critical. Recompile it again and run several times. Notice the different output results.gcc -fopenmp -o sum.x sum.c./sum.x