Upload
kobe
View
54
Download
0
Embed Size (px)
DESCRIPTION
Parallel Programming with OpenMP . Ing . Andrea Marongiu [email protected]. Programming model: OpenMP. De-facto standard for the shared memory programming model A collection of compiler directives , library routines and environment variables - PowerPoint PPT Presentation
Citation preview
PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea [email protected]
Programming model: OpenMP
De-facto standard for the shared memory programming model
A collection of compiler directives, library routines and environment variables
Easy to specify parallel execution within a serial code
Requires special support in the compiler Generates calls to threading libraries (e.g.
pthreads) Focus on loop-level parallel execution Popular in high-end embedded
Fork/Join Parallelism
Initially only master thread is active Master thread executes sequential code Fork: Master thread creates or awakens additional threads to execute parallel code Join: At the end of parallel code created threads are suspended upon barrier
synchronization
Sequential program
Parallel program
Pragmas Pragma: a compiler directive in C or C++ Stands for “pragmatic information” A way for the programmer to communicate with
the compiler Compiler free to ignore pragmas: original
sequential semantic is not altered Syntax:
#pragma omp <rest of pragma>
Components of OpenMP
Parallel regions #pragma omp parallel
Work sharing #pragma omp for #pragma omp sections
Synchronization #pragma omp barrier #pragma omp critical #pragma omp atomic
Directives
Data scope attributes private shared reduction
Loop scheduling static dynamic
Clauses
Thread Forking/Joining omp_parallel_start() omp_parallel_end()
Loop schedulingThread IDs
omp_get_thread_num() omp_get_num_threads()
Runtime Library
Outlining parallelismThe parallel directive
Fundamental construct to outline parallel computation within a sequential program
Code within its scope is replicated among threads
Defers implementation of parallel execution to the runtime (machine-specific, e.g. pthread_create)
int main(){#pragma omp parallel { printf (“\nHello world!”); }}
A sequential program....is easily parallelized
int main(){ omp_parallel_start(&parfun, …); parfun(); omp_parallel_end();}
int parfun(…){ printf (“\nHello world!”);}
#pragma omp parallel
int main(){#pragma omp parallel { printf (“\nHello world!”); }}
int main(){ omp_parallel_start(&parfun, …); parfun(); omp_parallel_end();}
int parfun(…){ printf (“\nHello world!”);}
Code originally contained within the scope of the pragma is outlined to a new function within the compiler
#pragma omp parallel
int main(){#pragma omp parallel { printf (“\nHello world!”); }}
int main(){ omp_parallel_start(&parfun, …); parfun(); omp_parallel_end();}
int parfun(…){ printf (“\nHello world!”);}
The #pragma construct in the main function is replaced with function calls to the runtime library
#pragma omp parallel
int main(){#pragma omp parallel { printf (“\nHello world!”); }}
int main(){ omp_parallel_start(&parfun, …); parfun(); omp_parallel_end();}
int parfun(…){ printf (“\nHello world!”);}
First we call the runtime to fork new threads, and pass them a pointer to the function to execute in parallel
#pragma omp parallel
int main(){#pragma omp parallel { printf (“\nHello world!”); }}
int main(){ omp_parallel_start(&parfun, …); parfun(); omp_parallel_end();}
int parfun(…){ printf (“\nHello world!”);}
Then the master itself calls the parallel function
#pragma omp parallel
int main(){#pragma omp parallel { printf (“\nHello world!”); }}
int main(){ omp_parallel_start(&parfun, …); parfun(); omp_parallel_end();}
int parfun(…){ printf (“\nHello world!”);}
Finally we call the runtime to synchronize threads with a barrier and suspend them
#pragma omp parallelData scope attributes
int main(){ int id; int a = 5;#pragma omp parallel { id = omp_get_thread_num(); if (id == 0) printf (“Master: a = %d.”, a*2); else printf (“Slave: a = %d.”, a); }}
A slightly more complex example
Call runtime to get thread ID:Every thread sees a different value
Master and slave threads access the same variable a
#pragma omp parallelData scope attributes
int main(){ int id; int a = 5;#pragma omp parallel { id = omp_get_thread_num(); if (id == 0) printf (“Master: a = %d.”, a*2); else printf (“Slave: a = %d.”, a); }}
A slightly more complex example
Call runtime to get thread ID:Every thread sees a different value
Master and slave threads access the same variable a
How to inform the compiler
about these different
behaviors?
#pragma omp parallelData scope attributes
int main(){ int id; int a = 5;#pragma omp parallel shared (a) private (id) { id = omp_get_thread_num(); if (id == 0) printf (“Master: a = %d.”, a*2); else printf (“Slave: a = %d.”, a); }}
A slightly more complex example
Insert code to retrieve the address of the shared object from within each parallel thread
Allow symbol privatization: Each thread contains a private copy of this variable
#pragma omp parallelData scope attributes
int main(){ int id; int a = 5;#pragma omp parallel shared (a) private (id) { id = omp_get_thread_num(); if (id == 0) printf (“Master: a = %d.”, a*2); else printf (“Slave: a = %d.”, a); }}
A slightly more complex example
Insert code to retrieve the address of the shared object from within each parallel thread
Allow symbol privatization: Each thread contains a private copy of this variable
Correctness
issues
What if:• a was not marked for
shared access?
• id was not marked for
private access?
#pragma omp parallelData scope attributes
int main(){ int id; int a = 5;#pragma omp parallel shared (a) private (id) { id = omp_get_thread_num(); if (id == 0) printf (“Master: a = %d.”, a*2); else printf (“Slave: a = %d.”, a); }}
A slightly more complex example
Insert code to retrieve the address of the shared object from within each parallel thread
Allow symbol privatization: Each thread contains a private copy of this variable
a
idid
idid
Serialization
Performance
issues
Sharing work among threadsThe for directive
The parallel pragma instructs every thread to execute all of the code inside the block
If we encounter a for loop that we want to divide among threads, we use the for pragma
#pragma omp for
#pragma omp for
int main(){#pragma omp parallel for { for (i=0; i<10; i++) a[i] = i; }}
int main(){ omp_parallel_start(&parfun, …); parfun(); omp_parallel_end();}
int parfun(…){ int LB = …; int UB = …;
for (i=LB; i<UB; i++) a[i] = i; }
The code of the for loop is moved inside the outlined function.
#pragma omp for
int main(){#pragma omp parallel for { for (i=0; i<10; i++) a[i] = i; }}
int main(){ omp_parallel_start(&parfun, …); parfun(); omp_parallel_end();}
int parfun(…){ int LB = …; int UB = …;
for (i=LB; i<UB; i++) a[i] = i; }
Every thread works on a different subset of the iteration space..
#pragma omp for
int main(){#pragma omp parallel for { for (i=0; i<10; i++) a[i] = i; }}
int main(){ omp_parallel_start(&parfun, …); parfun(); omp_parallel_end();}
int parfun(…){ int LB = …; int UB = …;
for (i=LB; i<UB; i++) a[i] = i; }
..since lower and upper boundaries (LB, UB) are computed locally
The schedule clauseStatic Loop Partitioning
Es. 12 iterations (N), 4 threads (Nthr)
0 3 6 9
3 6 9 12
N
NthrC = ceil ( )
DATA CHUNK#pragma omp for { for (i=0; i<12; i++) a[i] = i; }
LB = C * TID
Thread ID (TID) 0 1 2 3
UB = min { [C * ( TID + 1) ], N}
LOWER BOUND
UPPER BOUND
Useful for:• Simple, regular loops• Iterations with equal duration
3iterations
thread
Iteration space
schedule(static)
The schedule clauseStatic Loop Partitioning
Es. 12 iterations (N), 4 threads (Nthr)
0 3 6 9
3 6 9 12
N
NthrC = ceil ( )
DATA CHUNK#pragma omp for { for (i=0; i<12; i++) a[i] = i; }
LB = C * TID
Thread ID (TID) 0 1 2 3
UB = min { [C * ( TID + 1) ], N}
LOWER BOUND
UPPER BOUND
Useful for:• Simple, regular loops• Iterations with equal duration
3iterations
thread
Iteration space
schedule(static)
What happens
with static
scheduling
when iterations
have different
duration?
The schedule clauseStatic Loop Partitioning
#pragma omp for { for (i=0; i<12; i++) a[i] = i; }
#pragma omp for { for (i=0; i<12; i++) { int start = rand(); int count = 0;
while (start++ < 256) count++;
a[count] = foo(); } }
schedule(static)
A variable amount of work in each iteration
1 2 3 4 5 6 7 8 9 10 11 12
Iteration spaceThread 1
Thread 2
Thread 3
Thread 4
Core 1
Core 2
Core 3
Core 4
1 4
8
10
2
3
56
1211
9
7
UNBALANCED
workloads
The schedule clauseDynamic Loop Partitioning
#pragma omp for { for (i=0; i<12; i++) { int start = rand(); int count = 0;
while (start++ < 256) count++;
a[count] = foo(); } }
schedule(static)
A thread is generated for every single iteration
schedule(dynamic)Iteration space
1 2 3 4 5 6 7 8 9 10 11 12
The schedule clauseDynamic Loop Partitioning
Iteration space
1 2 3 4 5 6 7 8 9 10 11 12Runtime environment
Work queue
A work queue is implemented in the runtime environment, where tasks are stored
int parfun(…){ int LB, UB; GOMP_loop_dynamic_next(&LB, &UB);
for (i=LB; i<UB; i++) {…}}
Work is fetched by threads from the runtime environment through synchronized accesses to the work queue
The schedule clauseDynamic Loop Partitioning
Iteration space
1 2 3 4 5 6 7 8 9 10 11 12
Core 1
Core 2
Core 3
Core 4
7
4
810
2 3 56
12 119
1
Core 1
Core 2
Core 3
Core 4
1 4
8
10
2
3
56
1211
9
7
Remember
results with
static
scheduling..
Speedup!
BALANCED
workloads
Sharing work among threadsThe sections directive
The for pragma allows to exploit data parallelism in loops
OpenMP also provides a directive to exploit task parallelism
#pragma omp sections
Task Parallelism Example
Identify independent nodes in the task graph, and outline parallel computation with the
sections directive
int main(){ v = alpha(); w = beta ();
y = delta ();
x = gamma (v, w); z = epsilon (x, y));
printf (“%f\n”, z);}
FIRST
SOLUTION
Task Parallelism Example
Barriers implied here!
int main(){#pragma omp parallel sections { v = alpha(); w = beta (); }#pragma omp parallel sections {
y = delta ();
x = gamma (v, w); } z = epsilon (x, y));
printf (“%f\n”, z);}
Task Parallelism Exampleint main(){#pragma omp parallel sections { v = alpha(); w = beta (); }#pragma omp parallel sections {
y = delta ();
x = gamma (v, w); } z = epsilon (x, y));
printf (“%f\n”, z);}
Each parallel task within a sections block identifies a
section
Task Parallelism Exampleint main(){#pragma omp parallel sections { #pragma omp section v = alpha(); #pragma omp section w = beta (); }#pragma omp parallel sections { #pragma omp section y = delta (); #pragma omp section x = gamma (v, w); } z = epsilon (x, y));
printf (“%f\n”, z);}
Each parallel task within a sections block identifies a
section
Task Parallelism Example
Identify independent nodes in the task graph, and outline parallel computation with the
sections directive
int main(){ v = alpha(); w = beta ();
y = delta ();
x = gamma (v, w); z = epsilon (x, y));
printf (“%f\n”, z);}
SECOND
SOLUTION
Task Parallelism Exampleint main(){ #pragma omp parallel sections { v = alpha(); w = beta ();
y = delta (); }#pragma omp parallel sections {
x = gamma (v, w); z = epsilon (x, y));}
printf (“%f\n”, z);}
Barrier implied here!
Task Parallelism Exampleint main(){ #pragma omp parallel sections { v = alpha(); w = beta ();
y = delta (); }#pragma omp parallel sections {
x = gamma (v, w); z = epsilon (x, y));}
printf (“%f\n”, z);}
Each parallel task within a sections block identifies a
section
Task Parallelism Exampleint main(){ #pragma omp parallel sections { #pragma omp section v = alpha(); #pragma omp section w = beta (); #pragma omp section y = delta (); }#pragma omp parallel sections { #pragma omp section x = gamma (v, w); #pragma omp section z = epsilon (x, y));}
printf (“%f\n”, z);}
Each parallel task within a sections block identifies a
section
#pragma omp barrier Most important synchronization mechanism in
shared memory fork/join parallel programming All threads participating in a parallel region wait
until everybody has finished before computation flows on
This prevents later stages of the program to work with inconsistent shared data
It is implied at the end of parallel constructs, as well as for and sections (unless a nowait clause is specified)
#pragma omp critical
Critical Section: a portion of code that only one thread at a time may execute
We denote a critical section by putting the pragma
#pragma omp critical
in front of a block of C code
-finding code example
double area, pi, x;int i, n;#pragma omp parallel for private(x) \ shared(area){ for (i=0; i<n; i++) { x = (i + 0.5)/n; area += 4.0/(1.0 + x*x); }} pi = area/n;
Synchronize accesses to shared variable area to avoid inconsistent results
Race condition Ensure atomic updates of the shared variable
area to avoid a race condition in which one process may “race ahead” of another and ignore changes
Race condition (Cont’d)
time
• Thread A reads “11.667” into a local register• Thread B reads “11.667” into a local register• Thread A updates area with “11.667+3.765”• Thread B ignores write from thread A and updates area with “11.667 + 3.563”
-finding code example
double area, pi, x;int i, n;#pragma omp parallel for private(x) shared(area){ for (i=0; i<n; i++) { x = (i +0.5)/n;#pragma omp critical area += 4.0/(1.0 + x*x); }} pi = area/n;
#pragma omp critical protects the code within its scope by acquiring a lock before entering the critical section and releasing it after execution
Correctness, not performance! As a matter of fact, using locks makes execution
sequential To dim this effect we should try use fine grained
locking (i.e. make critical sections as small as possible)
A simple instruction to compute the value of area in the previous example is translated into many more simpler instructions within the compiler!
The programmer is not aware of the real granularity of the critical section
Correctness, not performance! As a matter of fact, using locks makes execution
sequential To dim this effect we should try use fine grained
locking (i.e. make critical sections as small as possible)
A simple instruction to compute the value of area in the previous example is translated into many more simpler instructions within the compiler!
The programmer is not aware of the real granularity of the critical section
This is a dump of the intermediate
representation of the program within the
compiler
Correctness, not performance! As a matter of fact, using locks makes execution
sequential To dim this effect we should try use fine grained
locking (i.e. make critical sections as small as possible)
A simple instruction to compute the value of area in the previous example is translated into many more simpler instructions within the compiler!
The programmer is not aware of the real granularity of the critical section
This is how the compiler represents the instruction area += 4.0/(1.0 + x*x);
Correctness, not performance! As a matter of fact, using locks makes execution
sequential To dim this effect we should try use fine grained
locking (i.e. make critical sections as small as possible)
A simple instruction to compute the value of area in the previous example is translated into many more simpler instructions within the compiler!
The programmer is not aware of the real granularity of the critical section
call runtime to acquire lock
Lock-protected operations (critical
section)
call runtime to release lock
THIS IS DONE AT EVERY
LOOP ITERATION!
-finding code exampledouble area, pi, x;int i, n;#pragma omp parallel for \ private(x) \ shared(area){ for (i=0; i<n; i++) {
x = (i +0.5)/n;
#pragma omp critical area += 4.0/(1.0 + x*x);
}} pi = area/n;
Core 1
Core 2
Core 3
Core 4
Parallel
Sequential
Waiting for lock
Correctness, not performance! A programming pattern such as area += 4.0/(1.0
+ x*x); in which we: Fetch the value of an operand Add a value to it Store the updated valueis called a reduction, and is commonly supported by parallel programming APIs
OpenMP takes care of storing partial results in private variables and combining partial results after the loop
Correctness, not performance!
double area, pi, x;int i, n;#pragma omp parallel for private(x) shared(area){ for (i=0; i<n; i++) { x = (i +0.5)/n;
area += 4.0/(1.0 + x*x); }} pi = area/n;
The reduction clause instructs the compiler to create private copies of the area variable for every thread. At the end of the loop partial sums are combined on the shared area variable
reduction(+:area)
Correctness, not performance!
double area, pi, x;int i, n;#pragma omp parallel for private(x) shared(area){ for (i=0; i<n; i++) { x = (i +0.5)/n;
area += 4.0/(1.0 + x*x); }} pi = area/n;
The reduction clause instructs the compiler to create private copies of the area variable for every thread. At the end of the loop partial sums are combined on the shared area variable
reduction(+:area)
Shared variable is only
updated at the end of the
loop, when partial sums
are computed
Each thread computes partial sums on private
copies of the reduction variable
Correctness, not performance!
double area, pi, x;int i, n;#pragma omp parallel for private(x) shared(area){ for (i=0; i<n; i++) { x = (i +0.5)/n;
area += 4.0/(1.0 + x*x); }} pi = area/n;
The reduction clause instructs the compiler to create private copies of the area variable for every thread. At the end of the loop partial sums are combined on the shared area variable
reduction(+:area)
Core 1
Core 2
Core 3
Core 4
Core 1
Core 2
Core 3
Core 4
Summary
Customizing OpenMP for Efficient Exploitation of the Memory Hierarchy
Memory latency is well recognized as a severe performance bottleneck
MPSoCs feature complex memory hierarchy, with multiple cache levels, private and shared on-chip and off-chip memories
Using efficiently the memory hierarchy is of the utmost importance to exploit the computational power of MPSoCs, but..
Customizing OpenMP for Efficient Exploitation of the Memory Hierarchy
It is a difficult task, requiring deep understanding of the application and its memory access pattern
OpenMP standard doesn’t provide any facilities to deal with data placement and partitioning
Customization of the programming interface would bring the advantages of OpenMP to the MPSoC world
Extending OpenMP to support Data Distribution We need a custom directive that enables specific
code analysis and transformation When static code analysis can’t tell how to
distribute data we must rely on profiling The runtime is responsible for exploiting this
information to efficiently map arrays to memories
Extending OpenMP to support Data Distribution The entire process is driven by the custom
#pragma omp distributed directive
{
int A[m];
float B[n];
#pragma omp distributed (A, B)
…
}
{
int *A;
float *B;
A = distributed_malloc (m);
B = distributed_malloc (n);
…
}
Originally stack-allocated arrays are transformed into pointers to allow for their explicit placement throughout the memory hierarchy within the program
Extending OpenMP to support Data Distribution The entire process is driven by the custom
#pragma omp distributed directive
{
int A[m];
float B[n];
#pragma omp distributed (A, B)
…
}
{
int *A;
float *B;
A = distributed_malloc (m);
B = distributed_malloc (n);
…
}
The transformed program invokes the runtime to retrieve profile information which drive data placement
When no profile information is found, the distributed_malloc returns a pointer to the shared memory
Data partitioning OpenMP model is focused on loop parallelism In this parallelization scheme multiple threads may access
different sections (discontiguous addresses) of shared arrays
Data partitioning is the process of tiling data arrays and placing the tiles in memory such that a maximum number of accesses are satisfied from local memory
Most obvious implementation of this concept is the data cache, but.. Inter-array discontiguity often causes cache conflicts Embedded systems impose constraints on energy,
predictability, real-time that often make caches not suitable
A simple example
#pragma omp parallel for
for (i = 0; i < 4; i++)
for (j = 0; j < 6; j++)
A[ i ][ j ] = 1.0;
3,0 3,1 3,2 3,3 3,4 3,5
2,0 2,1 2,2 2,3 2,4 2,5
1,0 1,1 1,2 1,3 1,4 1,5
0,0 0,1 0,2 0,3 0,4 0,5
ITERATION SPACEINTERCONNECT
SHAREDMEMORY
SPM
CPU1
SPM
CPU2
SPM
CPU2
SPM
CPU2
The iteration space is partitioned between the processors
A simple example
#pragma omp parallel for
for (i = 0; i < 4; i++)
for (j = 0; j < 6; j++)
A[ i ][ j ] = 1.0;
3,0 3,1 3,2 3,3 3,4 3,5
2,0 2,1 2,2 2,3 2,4 2,5
1,0 1,1 1,2 1,3 1,4 1,5
0,0 0,1 0,2 0,3 0,4 0,5
ITERATION SPACE DATA SPACEINTERCONNECT
SHAREDMEMORY
SPM
CPU1
SPM
CPU2
SPM
CPU2
SPM
CPU2
Data space overlaps with iteration space. Each processor accesses a different tile
A(i,j)
Array is accessed with the loop induction variables
A simple example
#pragma omp parallel for
for (i = 0; i < 4; i++)
for (j = 0; j < 6; j++)
A[ i ][ j ] = 1.0;
3,0 3,1 3,2 3,3 3,4 3,5
2,0 2,1 2,2 2,3 2,4 2,5
1,0 1,1 1,2 1,3 1,4 1,5
0,0 0,1 0,2 0,3 0,4 0,5
ITERATION SPACE DATA SPACEINTERCONNECT
SHAREDMEMORY
SPM
CPU1
SPM
CPU2
SPM
CPU2
SPM
CPU2
The compiler can actually split the matrix into four smaller arrays and allocate them
onto scratchpads
A(i,j)
No access to remote memories through the bus, since data is allocated locally
Another example
#pragma omp parallel for
for (i = 0; i < 4; i++)
for (j = 0; j < 6; j++)
hist[A[i][j]]++;
3,0 3,1 3,2 3,3 3,4 3,5
2,0 2,1 2,2 2,3 2,4 2,5
1,0 1,1 1,2 1,3 1,4 1,5
0,0 0,1 0,2 0,3 0,4 0,5
ITERATION SPACE A
INTERCONNECT
SHAREDMEMORY
SPM
CPU1
SPM
CPU2
SPM
CPU2
SPM
CPU2
Different locations within array hist are accessed by many processors
3 7 7 5 4 4
3 2 2 3 0 0
1 1 0 2 3 5
1 1 4 5 5 4
hist
Another example In this case static code analysis can’t tell
anything on array access pattern How to decide most efficient partitioning? Split array in as many tiles as there are
processors Use access count information to map tiles to the
processor that has most accesses to it
Another example
3,0 3,1 3,2 3,3 3,4 3,5
2,0 2,1 2,2 2,3 2,4 2,5
1,0 1,1 1,2 1,3 1,4 1,5
0,0 0,1 0,2 0,3 0,4 0,5
ITERATION SPACE A
INTERCONNECT
SHAREDMEMORY
SPM
CPU1
SPM
CPU2
SPM
CPU2
SPM
CPU2
Now processors need to access remote scratchpads, since they work on
multiple tiles!!!
3 7 7 5 4 4
3 2 2 3 0 0
1 1 0 2 3 5
1 1 4 5 5 4
hist
TILE 1Access countPROC1 1PROC2 2PROC3 3PROC4 2
TILE 2Access countPROC1 1PROC2 4PROC3 2PROC4 0
TILE 3Access countPROC1 3PROC2 0PROC3 1PROC4 4
TILE 4Access countPROC1 2PROC2 0PROC3 0PROC4 1
Problem with data partitioning If there is no overlapping of iteration space and data space it may
happen that multiple processor need to access different tiles In this case data partitioning introduces addressing difficulties
because the data tiles can become discontiguous in physical memory
How to address the problem of generating efficient code to access data when performing loop and data partitioning?
We can further extend the OpenMP programming interface to deal with that!
The programmer only has to specify the intention of partitioning an array throughout the memory hierarchy, and the compiler does the necessary instrumentation
Code InstrumentationIn general, the steps for addressing an array element using
tiling are: Computation of the offset w.r.t. the base address Identify the tile to which this element belongs to Re-compute the index relative to the current tile Load the tile base address from a metadata array
This metadata array is populated during the memory allocation step of the tool-flow. It relies on access count information to figure out the most efficient mapping of array tiles to memories.
Extending OpenMP to support data partitioning
#pragma omp parallel tiled(A)
{
…
/* Access memory */
A[i][j] = foo();
…
}
{
/* Compute offset, tile and index for distributed array */
int offset = …;
int tile = …;
int index = …;
/* Read tile base address */
int *base = tiles[dvar][tile];
/* Access memory */
base[index] = foo();
…
}
The instrumentation process is driven by the custom tiled clause, which can be coupled with every parallel and work-sharing construct.