Introduction to parallel programming via OpenMP · 2019. 11. 20. · Introduction to parallel...

Preview:

Citation preview

Introduction to parallelprogramming via OpenMP

August 20, 2019

Introduction to parallel programming via OpenMP August 20, 2019 1 / 25

Contents

1 Basic facts about parallel algorithms

2 How things are stored in computer memory

3 OpenMP

Introduction to parallel programming via OpenMP August 20, 2019 2 / 25

�Parallel� comes from ...

Introduction to parallel programming via OpenMP August 20, 2019 3 / 25

Amdahl's law

Gives theoretical speedup formula:

Slatency =1

a + 1−ap

where

Slatency is the theoretical speedup of the algorithm,

a = number of operations done sequentiallytotal number of operations

< 1,

p is the number of parallel workers.

Introduction to parallel programming via OpenMP August 20, 2019 4 / 25

Amdahl's law

It follows that

Slatency(s) ≤1

a

lims→∞ Slatency(s) =1

a

Doesn't take into account many other factors: memory,various overheads, ...

But can be used as a �rst estimate!

Introduction to parallel programming via OpenMP August 20, 2019 5 / 25

Example: dot product

Dot product

Given vectors x and y of length N compute

c = x · y =N∑i=1

xiyi

where xi , yj - coordinates of x and y

Questions

What is the total number of operations?

What is the maximum speedup?

What about memory?

Introduction to parallel programming via OpenMP August 20, 2019 6 / 25

Amdahl's law (modi�ed)

Slatency =1

a + 1−ap

+ c

where:

Slatency is the theoretical speedup of the execution of

the whole task,

a = number of operations done in paralleltotal number of operations

< 1,

p is the number of parallel workers,

c is the communication overhead (nonlinear!).

Introduction to parallel programming via OpenMP August 20, 2019 7 / 25

Main characteristics:

�ops = �oating point operations per second

memory access speed

memory latency = amount of time to satisfy anindividual memory request

memory bandwidth = how much data a memory channelcan transfer at once

Introduction to parallel programming via OpenMP August 20, 2019 8 / 25

Memory access speed

Introduction to parallel programming via OpenMP August 20, 2019 9 / 25

Memory bandwidth

True story (for graphical chips at least)

You cannot optimize more after you've reached the maximummemory bandwidth!

Introduction to parallel programming via OpenMP August 20, 2019 10 / 25

Measuring memory bandwidth

Theoretical bandwidth:number of RAM sticks × bus width × memory clockfrequency for one stick.

Experimental bandwidth (simplest variant):

void write_memory_loop ( void ∗ a r ray , s i z e_t s i z e ) {s i z e_t ∗ c a r r a y = ( s i z e_t ∗) a r r a y ;s i z e_t i ;f o r ( i = 0 ; i < s i z e / s i z eo f ( s i z e_t ) ; i++) {

c a r r a y [ i ] = 1 ;}

}

Experimental bandwidth: accessed memory / timeTaken from:1) http://codearcana.com/posts/2013/05/18/achieving-maximum-memory-bandwidth.html

2) https://github.com/awreece/memory-bandwidth-demo

Introduction to parallel programming via OpenMP August 20, 2019 11 / 25

OpenMP

Two main concepts of OpenMP:

the use of threads.

fork/join model of parallelism.

Website: http://www.openmp.org/Alternatives working on shared memory: TBB, Cilk (high-level), Pthreads(low-level)

Introduction to parallel programming via OpenMP August 20, 2019 12 / 25

OpenMP

Let's try �hello openmp world�.

#pragma omp p a r a l l e l pr i va te ( n th reads , t i d ){

t i d = omp_get_thread_num ( ) ;p r i n t f ( " He l l o OpenMP wor ld from th r ead = %d\n" , t i d ) ;i f ( t i d == 0){

n th r e ad s = omp_get_num_threads ( ) ;p r i n t f ( "Number o f t h r e ad s = %d\n" , n th r e ad s ) ;

}}

Compiling:

GNU: -fopenmp

Intel: -openmp

Environment variable: export OMP_NUM_THREADS = NIntroduction to parallel programming via OpenMP August 20, 2019 13 / 25

OpenMP

Main tool: adding preprocessor directives #pragma omp tellsthe compiler how to set up threads.

Syntax

#pragma omp directive-name [clause[ [,] clause]...] new- line#pragma omp parallel

Memory speci�cation

Default: variables declared outside parallel regions are shared,internal are private.

Introduction to parallel programming via OpenMP August 20, 2019 14 / 25

OpenMP

Example: vector dot product

Implementation 1

sum = 0 . 0 ;#pragma omp p a r a l l e l f o r

fo r ( i n t i =0; i<N; ++i ){

sum += a [ i ]∗ b [ i ] ;}

Wrong result!

Introduction to parallel programming via OpenMP August 20, 2019 15 / 25

OpenMP

Why? It's a race condition. All threads are trying to updatesum at the same time. Let's correct that!

Implementation 2

sum = 0 . 0 ;#pragma omp p a r a l l e l f o r

fo r ( i n t i =0; i<N; ++i ){

#pragma omp atomicsum += a [ i ]∗ b [ i ] ;

}

Introduction to parallel programming via OpenMP August 20, 2019 16 / 25

OpenMP

If there is a special construction - it might be good idea to usethat!

Implementation 3

sum = 0 . 0 ;#pragma omp p a r a l l e l f o r r e d u c t i o n (+:sum)

f o r ( i n t i =0; i<N; ++i ){

sum += a [ i ]∗ b [ i ] ;}

Introduction to parallel programming via OpenMP August 20, 2019 17 / 25

OpenMP

Warning: don't use more programming threads than yourhardware allows!

Implementation 4

sum = 0 . 0 ;#pragma omp p a r a l l e l f o r r e d u c t i o n (+:sum) \num_threads (8 )

f o r ( i n t i =0; i<N; ++i ){

sum += a [ i ]∗ b [ i ] ;}

Introduction to parallel programming via OpenMP August 20, 2019 18 / 25

OpenMP

Less �defaults� - less errors! De�ne variable types explicitly.

Implementation 5

sum = 0 . 0 ;#pragma omp p a r a l l e l f o r r e d u c t i o n (+:sum) \sha r ed ( a , b ,N)

f o r ( i n t i =0; i<N; ++i ){

sum += a [ i ]∗ b [ i ] ;}

Introduction to parallel programming via OpenMP August 20, 2019 19 / 25

OpenMP

Let's try to add more �exibility. Dynamic should be good. Or...?

Implementation 6

sum = 0 . 0 ;#pragma omp p a r a l l e l f o r r e d u c t i o n (+:sum) \

sha r ed ( a , b ) s c h edu l e ( dynamic )f o r ( i n t i =0; i<N; ++i ){

sum += a [ i ]∗ b [ i ] ;}

Introduction to parallel programming via OpenMP August 20, 2019 20 / 25

OpenMP

Think of how memory is read for a moment. Do you have anyidea how to improve? Use blocks

Implementation 7

sum=0.0;i n t M=32;

#pragma omp p a r a l l e l f o r r e d u c t i o n (+:sum) \sha r ed ( a , b )f o r ( i n t i =0; i<N/M; i++){

f o r ( i n t j=i ∗M; j <( i +1)∗M; j++){

sum+=a [ j ]∗ b [ j ] ;}

}

Introduction to parallel programming via OpenMP August 20, 2019 21 / 25

OpenMP

Just an alternative.Implementation 8

#pragma omp p a r a l l e l s e c t i o n s{

#pragma omp s e c t i o n{

f o r ( i n t i =0; i<N/2 ; i++){

sum1+=a [ i ]∗ b [ i ] ;}

}#pragma omp s e c t i o n

{f o r ( i n t i=N/2 ; i<N; i++){

sum2+=a [ i ]∗ b [ i ] ;}

}} Introduction to parallel programming via OpenMP August 20, 2019 22 / 25

OpenMP

Things to learn:

#pragma omp atomic

#pragma omp flush

#pragma omp critical

#pragma omp barrier

#pragma omp single/master

KMP_AFFINITY

locks and mutexes

false sharing

OpenMP tasks

Introduction to parallel programming via OpenMP August 20, 2019 23 / 25

OpenMP in Python

Bad news (AFAIK)

Python is not a friend with OpenMP because of the GIL =General Interpreter Lock. Python has GIL so only one thread isallowed to run at a given time.

Good news (AFAIK)

There are some Python extensions which can handle OpenMP:SciPy (to some extent), CPython, weave, other C extensions...

Introduction to parallel programming via OpenMP August 20, 2019 24 / 25

Thanks for coming!

Introduction to parallel programming via OpenMP August 20, 2019 25 / 25

Recommended