Introduction to parallel programming via OpenMP · 2019. 11. 20. · Introduction to parallel...

Introduction to parallelprogramming via OpenMP

August 20, 2019

Introduction to parallel programming via OpenMP August 20, 2019 1 / 25

Contents

1 Basic facts about parallel algorithms

2 How things are stored in computer memory

3 OpenMP

�Parallel� comes from ...

Amdahl's law

Gives theoretical speedup formula:

Slatency =1

a + 1−ap

Slatency is the theoretical speedup of the algorithm,

a = number of operations done sequentiallytotal number of operations

p is the number of parallel workers.

Amdahl's law

It follows that

Slatency(s) ≤1

lims→∞ Slatency(s) =1

Doesn't take into account many other factors: memory,various overheads, ...

But can be used as a �rst estimate!

Example: dot product

Dot product

Given vectors x and y of length N compute

c = x · y =N∑i=1

where xi , yj - coordinates of x and y

Questions

What is the total number of operations?

What is the maximum speedup?

What about memory?

Amdahl's law (modi�ed)

Slatency =1

a + 1−ap

where:

Slatency is the theoretical speedup of the execution of

the whole task,

a = number of operations done in paralleltotal number of operations

p is the number of parallel workers,

c is the communication overhead (nonlinear!).

Main characteristics:

�ops = �oating point operations per second

memory access speed

memory latency = amount of time to satisfy anindividual memory request

memory bandwidth = how much data a memory channelcan transfer at once

Memory access speed

Memory bandwidth

True story (for graphical chips at least)

You cannot optimize more after you've reached the maximummemory bandwidth!

Measuring memory bandwidth

Theoretical bandwidth:number of RAM sticks × bus width × memory clockfrequency for one stick.

Experimental bandwidth (simplest variant):

void write_memory_loop ( void ∗ a r ray , s i z e_t s i z e ) {s i z e_t ∗ c a r r a y = ( s i z e_t ∗) a r r a y ;s i z e_t i ;f o r ( i = 0 ; i < s i z e / s i z eo f ( s i z e_t ) ; i++) {

c a r r a y [ i ] = 1 ;}

Experimental bandwidth: accessed memory / timeTaken from:1) http://codearcana.com/posts/2013/05/18/achieving-maximum-memory-bandwidth.html

2) https://github.com/awreece/memory-bandwidth-demo

OpenMP

Two main concepts of OpenMP:

the use of threads.

fork/join model of parallelism.

Website: http://www.openmp.org/Alternatives working on shared memory: TBB, Cilk (high-level), Pthreads(low-level)

OpenMP

Let's try �hello openmp world�.

#pragma omp p a r a l l e l pr i va te ( n th reads , t i d ){

t i d = omp_get_thread_num ( ) ;p r i n t f ( " He l l o OpenMP wor ld from th r ead = %d\n" , t i d ) ;i f ( t i d == 0){

n th r e ad s = omp_get_num_threads ( ) ;p r i n t f ( "Number o f t h r e ad s = %d\n" , n th r e ad s ) ;

Compiling:

GNU: -fopenmp

Intel: -openmp

Environment variable: export OMP_NUM_THREADS = NIntroduction to parallel programming via OpenMP August 20, 2019 13 / 25

OpenMP

Main tool: adding preprocessor directives #pragma omp tellsthe compiler how to set up threads.

Syntax

#pragma omp directive-name [clause[ [,] clause]...] new- line#pragma omp parallel

Memory speci�cation

Default: variables declared outside parallel regions are shared,internal are private.

OpenMP

Example: vector dot product

Implementation 1

sum = 0 . 0 ;#pragma omp p a r a l l e l f o r

fo r ( i n t i =0; i<N; ++i ){

sum += a [ i ]∗ b [ i ] ;}

Wrong result!

OpenMP

Why? It's a race condition. All threads are trying to updatesum at the same time. Let's correct that!

Implementation 2

sum = 0 . 0 ;#pragma omp p a r a l l e l f o r

fo r ( i n t i =0; i<N; ++i ){

#pragma omp atomicsum += a [ i ]∗ b [ i ] ;

OpenMP

If there is a special construction - it might be good idea to usethat!

Implementation 3

sum = 0 . 0 ;#pragma omp p a r a l l e l f o r r e d u c t i o n (+:sum)

f o r ( i n t i =0; i<N; ++i ){

sum += a [ i ]∗ b [ i ] ;}

OpenMP

Warning: don't use more programming threads than yourhardware allows!

Implementation 4

sum = 0 . 0 ;#pragma omp p a r a l l e l f o r r e d u c t i o n (+:sum) \num_threads (8 )

f o r ( i n t i =0; i<N; ++i ){

sum += a [ i ]∗ b [ i ] ;}

OpenMP

Less �defaults� - less errors! De�ne variable types explicitly.

Implementation 5

sum = 0 . 0 ;#pragma omp p a r a l l e l f o r r e d u c t i o n (+:sum) \sha r ed ( a , b ,N)

f o r ( i n t i =0; i<N; ++i ){

sum += a [ i ]∗ b [ i ] ;}

OpenMP

Let's try to add more �exibility. Dynamic should be good. Or...?

Implementation 6

sum = 0 . 0 ;#pragma omp p a r a l l e l f o r r e d u c t i o n (+:sum) \

sha r ed ( a , b ) s c h edu l e ( dynamic )f o r ( i n t i =0; i<N; ++i ){

sum += a [ i ]∗ b [ i ] ;}

OpenMP

Think of how memory is read for a moment. Do you have anyidea how to improve? Use blocks

Implementation 7

sum=0.0;i n t M=32;

#pragma omp p a r a l l e l f o r r e d u c t i o n (+:sum) \sha r ed ( a , b )f o r ( i n t i =0; i<N/M; i++){

f o r ( i n t j=i ∗M; j <( i +1)∗M; j++){

sum+=a [ j ]∗ b [ j ] ;}

OpenMP

Just an alternative.Implementation 8

#pragma omp p a r a l l e l s e c t i o n s{

#pragma omp s e c t i o n{

f o r ( i n t i =0; i<N/2 ; i++){

sum1+=a [ i ]∗ b [ i ] ;}

}#pragma omp s e c t i o n

{f o r ( i n t i=N/2 ; i<N; i++){

sum2+=a [ i ]∗ b [ i ] ;}

}} Introduction to parallel programming via OpenMP August 20, 2019 22 / 25

OpenMP

Things to learn:

#pragma omp atomic

#pragma omp flush

#pragma omp critical

#pragma omp barrier

#pragma omp single/master

KMP_AFFINITY

locks and mutexes

false sharing

OpenMP tasks

OpenMP in Python

Bad news (AFAIK)

Python is not a friend with OpenMP because of the GIL =General Interpreter Lock. Python has GIL so only one thread isallowed to run at a given time.

Good news (AFAIK)

There are some Python extensions which can handle OpenMP:SciPy (to some extent), CPython, weave, other C extensions...

Thanks for coming!

Introduction to parallel programming via OpenMP · 2019. 11. 20. · Introduction to parallel...

Documents

Parallel Programming using OpenMPweb.engr.oregonstate.edu/~mjb/cs575/Handouts/openmp.1pp.pdf · OpenMP Multithreaded Programming • OpenMP stands for “Open Multi-Processing”

Parallel Computing with OpenMP

3D Parallel FEM (IV) OpenMP + Hybrid Parallel Programming ...nkl.cc.u-tokyo.ac.jp/14e/04-pFEM/pFEM3D-OMP.pdf · 3D Parallel FEM (IV) OpenMP + Hybrid Parallel Programming Model Kengo

1 Parallel Programming With OpenMP. 2 Contents Overview of Parallel Programming & OpenMP Difference between OpenMP & MPI OpenMP Programming Model

Using OpenMP. Portable Shared Memory Parallel Programming

Parallel Processing (CS 667) Lecture 5: Shared Memory Parallel Programming with OpenMP *

Hybrid MPI+OpenMP Parallel MD

Parallel Programming with OpenMP part 1 – OpenMP v2.5

Parallel programming in OpenMp

Shared-Memory Parallel Programming with OpenMP - An

Writing Parallel Processing Compatible Engines Using OpenMP

Parallel Programming in OpenMP - USTC › zlsc › cxyy › 200910 › W...1 OpenMP - Introduction, Dieter an Mey, 18 January 2003 Parallel Programming in OpenMP Introduction Dieter

Introduction to OpenMP Parallel Programming · Introduction to OpenMP Parallel Programming Jemmy Hu SHARCNET HPC Consultant University of Waterloo January 20, 2016

Advanced OpenMP · Parallel region •The parallel regionis the basic parallel construct in OpenMP. •A parallel region defines a section of a program. •Program begins execution

Linux Clusters Institute: Introduction to parallel ...linuxclustersinstitute.org/workshops/archive/intro19/pdfs/4-openmp.… · • Lab exercises with OpenMP August 2019 2. Parallel

Hybrid MPI and OpenMP Parallel Programmingmc.stanford.edu/cgi-bin/images/7/78/Hybrid_MPI_openMP.pdf · Hybrid MPI and OpenMP Parallel Programming MPI + OpenMP and other ... Processors

Parallel Programming with OpenMP

An Introduction to Parallel Programming with OpenMP

Introduction to parallel computers and parallel programming · Different ways of parallel programming Threads model using OpenMP Easy to program (inserting a few OpenMP directives)

Parallel Programming with MPI and OpenMP