67
PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu [email protected]

Parallel Programming with OpenMP

  • Upload
    kobe

  • View
    54

  • Download
    0

Embed Size (px)

DESCRIPTION

Parallel Programming with OpenMP . Ing . Andrea Marongiu [email protected]. Programming model: OpenMP. De-facto standard for the shared memory programming model A collection of compiler directives , library routines and environment variables - PowerPoint PPT Presentation

Citation preview

Page 1: Parallel Programming with OpenMP

PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea [email protected]

Page 2: Parallel Programming with OpenMP

Programming model: OpenMP

De-facto standard for the shared memory programming model

A collection of compiler directives, library routines and environment variables

Easy to specify parallel execution within a serial code

Requires special support in the compiler Generates calls to threading libraries (e.g.

pthreads) Focus on loop-level parallel execution Popular in high-end embedded

Page 3: Parallel Programming with OpenMP

Fork/Join Parallelism

Initially only master thread is active Master thread executes sequential code Fork: Master thread creates or awakens additional threads to execute parallel code Join: At the end of parallel code created threads are suspended upon barrier

synchronization

Sequential program

Parallel program

Page 4: Parallel Programming with OpenMP

Pragmas Pragma: a compiler directive in C or C++ Stands for “pragmatic information” A way for the programmer to communicate with

the compiler Compiler free to ignore pragmas: original

sequential semantic is not altered Syntax:

#pragma omp <rest of pragma>

Page 5: Parallel Programming with OpenMP

Components of OpenMP

Parallel regions #pragma omp parallel

Work sharing #pragma omp for #pragma omp sections

Synchronization #pragma omp barrier #pragma omp critical #pragma omp atomic

Directives

Data scope attributes private shared reduction

Loop scheduling static dynamic

Clauses

Thread Forking/Joining omp_parallel_start() omp_parallel_end()

Loop schedulingThread IDs

omp_get_thread_num() omp_get_num_threads()

Runtime Library

Page 6: Parallel Programming with OpenMP

Outlining parallelismThe parallel directive

Fundamental construct to outline parallel computation within a sequential program

Code within its scope is replicated among threads

Defers implementation of parallel execution to the runtime (machine-specific, e.g. pthread_create)

int main(){#pragma omp parallel { printf (“\nHello world!”); }}

A sequential program....is easily parallelized

int main(){ omp_parallel_start(&parfun, …); parfun(); omp_parallel_end();}

int parfun(…){ printf (“\nHello world!”);}

Page 7: Parallel Programming with OpenMP

#pragma omp parallel

int main(){#pragma omp parallel { printf (“\nHello world!”); }}

int main(){ omp_parallel_start(&parfun, …); parfun(); omp_parallel_end();}

int parfun(…){ printf (“\nHello world!”);}

Code originally contained within the scope of the pragma is outlined to a new function within the compiler

Page 8: Parallel Programming with OpenMP

#pragma omp parallel

int main(){#pragma omp parallel { printf (“\nHello world!”); }}

int main(){ omp_parallel_start(&parfun, …); parfun(); omp_parallel_end();}

int parfun(…){ printf (“\nHello world!”);}

The #pragma construct in the main function is replaced with function calls to the runtime library

Page 9: Parallel Programming with OpenMP

#pragma omp parallel

int main(){#pragma omp parallel { printf (“\nHello world!”); }}

int main(){ omp_parallel_start(&parfun, …); parfun(); omp_parallel_end();}

int parfun(…){ printf (“\nHello world!”);}

First we call the runtime to fork new threads, and pass them a pointer to the function to execute in parallel

Page 10: Parallel Programming with OpenMP

#pragma omp parallel

int main(){#pragma omp parallel { printf (“\nHello world!”); }}

int main(){ omp_parallel_start(&parfun, …); parfun(); omp_parallel_end();}

int parfun(…){ printf (“\nHello world!”);}

Then the master itself calls the parallel function

Page 11: Parallel Programming with OpenMP

#pragma omp parallel

int main(){#pragma omp parallel { printf (“\nHello world!”); }}

int main(){ omp_parallel_start(&parfun, …); parfun(); omp_parallel_end();}

int parfun(…){ printf (“\nHello world!”);}

Finally we call the runtime to synchronize threads with a barrier and suspend them

Page 12: Parallel Programming with OpenMP

#pragma omp parallelData scope attributes

int main(){ int id; int a = 5;#pragma omp parallel { id = omp_get_thread_num(); if (id == 0) printf (“Master: a = %d.”, a*2); else printf (“Slave: a = %d.”, a); }}

A slightly more complex example

Call runtime to get thread ID:Every thread sees a different value

Master and slave threads access the same variable a

Page 13: Parallel Programming with OpenMP

#pragma omp parallelData scope attributes

int main(){ int id; int a = 5;#pragma omp parallel { id = omp_get_thread_num(); if (id == 0) printf (“Master: a = %d.”, a*2); else printf (“Slave: a = %d.”, a); }}

A slightly more complex example

Call runtime to get thread ID:Every thread sees a different value

Master and slave threads access the same variable a

How to inform the compiler

about these different

behaviors?

Page 14: Parallel Programming with OpenMP

#pragma omp parallelData scope attributes

int main(){ int id; int a = 5;#pragma omp parallel shared (a) private (id) { id = omp_get_thread_num(); if (id == 0) printf (“Master: a = %d.”, a*2); else printf (“Slave: a = %d.”, a); }}

A slightly more complex example

Insert code to retrieve the address of the shared object from within each parallel thread

Allow symbol privatization: Each thread contains a private copy of this variable

Page 15: Parallel Programming with OpenMP

#pragma omp parallelData scope attributes

int main(){ int id; int a = 5;#pragma omp parallel shared (a) private (id) { id = omp_get_thread_num(); if (id == 0) printf (“Master: a = %d.”, a*2); else printf (“Slave: a = %d.”, a); }}

A slightly more complex example

Insert code to retrieve the address of the shared object from within each parallel thread

Allow symbol privatization: Each thread contains a private copy of this variable

Correctness

issues

What if:• a was not marked for

shared access?

• id was not marked for

private access?

Page 16: Parallel Programming with OpenMP

#pragma omp parallelData scope attributes

int main(){ int id; int a = 5;#pragma omp parallel shared (a) private (id) { id = omp_get_thread_num(); if (id == 0) printf (“Master: a = %d.”, a*2); else printf (“Slave: a = %d.”, a); }}

A slightly more complex example

Insert code to retrieve the address of the shared object from within each parallel thread

Allow symbol privatization: Each thread contains a private copy of this variable

a

idid

idid

Serialization

Performance

issues

Page 17: Parallel Programming with OpenMP

Sharing work among threadsThe for directive

The parallel pragma instructs every thread to execute all of the code inside the block

If we encounter a for loop that we want to divide among threads, we use the for pragma

#pragma omp for

Page 18: Parallel Programming with OpenMP

#pragma omp for

int main(){#pragma omp parallel for { for (i=0; i<10; i++) a[i] = i; }}

int main(){ omp_parallel_start(&parfun, …); parfun(); omp_parallel_end();}

int parfun(…){ int LB = …; int UB = …;

for (i=LB; i<UB; i++) a[i] = i; }

The code of the for loop is moved inside the outlined function.

Page 19: Parallel Programming with OpenMP

#pragma omp for

int main(){#pragma omp parallel for { for (i=0; i<10; i++) a[i] = i; }}

int main(){ omp_parallel_start(&parfun, …); parfun(); omp_parallel_end();}

int parfun(…){ int LB = …; int UB = …;

for (i=LB; i<UB; i++) a[i] = i; }

Every thread works on a different subset of the iteration space..

Page 20: Parallel Programming with OpenMP

#pragma omp for

int main(){#pragma omp parallel for { for (i=0; i<10; i++) a[i] = i; }}

int main(){ omp_parallel_start(&parfun, …); parfun(); omp_parallel_end();}

int parfun(…){ int LB = …; int UB = …;

for (i=LB; i<UB; i++) a[i] = i; }

..since lower and upper boundaries (LB, UB) are computed locally

Page 21: Parallel Programming with OpenMP

The schedule clauseStatic Loop Partitioning

Es. 12 iterations (N), 4 threads (Nthr)

0 3 6 9

3 6 9 12

N

NthrC = ceil ( )

DATA CHUNK#pragma omp for { for (i=0; i<12; i++) a[i] = i; }

LB = C * TID

Thread ID (TID) 0 1 2 3

UB = min { [C * ( TID + 1) ], N}

LOWER BOUND

UPPER BOUND

Useful for:• Simple, regular loops• Iterations with equal duration

3iterations

thread

Iteration space

schedule(static)

Page 22: Parallel Programming with OpenMP

The schedule clauseStatic Loop Partitioning

Es. 12 iterations (N), 4 threads (Nthr)

0 3 6 9

3 6 9 12

N

NthrC = ceil ( )

DATA CHUNK#pragma omp for { for (i=0; i<12; i++) a[i] = i; }

LB = C * TID

Thread ID (TID) 0 1 2 3

UB = min { [C * ( TID + 1) ], N}

LOWER BOUND

UPPER BOUND

Useful for:• Simple, regular loops• Iterations with equal duration

3iterations

thread

Iteration space

schedule(static)

What happens

with static

scheduling

when iterations

have different

duration?

Page 23: Parallel Programming with OpenMP

The schedule clauseStatic Loop Partitioning

#pragma omp for { for (i=0; i<12; i++) a[i] = i; }

#pragma omp for { for (i=0; i<12; i++) { int start = rand(); int count = 0;

while (start++ < 256) count++;

a[count] = foo(); } }

schedule(static)

A variable amount of work in each iteration

1 2 3 4 5 6 7 8 9 10 11 12

Iteration spaceThread 1

Thread 2

Thread 3

Thread 4

Core 1

Core 2

Core 3

Core 4

1 4

8

10

2

3

56

1211

9

7

UNBALANCED

workloads

Page 24: Parallel Programming with OpenMP

The schedule clauseDynamic Loop Partitioning

#pragma omp for { for (i=0; i<12; i++) { int start = rand(); int count = 0;

while (start++ < 256) count++;

a[count] = foo(); } }

schedule(static)

A thread is generated for every single iteration

schedule(dynamic)Iteration space

1 2 3 4 5 6 7 8 9 10 11 12

Page 25: Parallel Programming with OpenMP

The schedule clauseDynamic Loop Partitioning

Iteration space

1 2 3 4 5 6 7 8 9 10 11 12Runtime environment

Work queue

A work queue is implemented in the runtime environment, where tasks are stored

int parfun(…){ int LB, UB; GOMP_loop_dynamic_next(&LB, &UB);

for (i=LB; i<UB; i++) {…}}

Work is fetched by threads from the runtime environment through synchronized accesses to the work queue

Page 26: Parallel Programming with OpenMP

The schedule clauseDynamic Loop Partitioning

Iteration space

1 2 3 4 5 6 7 8 9 10 11 12

Core 1

Core 2

Core 3

Core 4

7

4

810

2 3 56

12 119

1

Core 1

Core 2

Core 3

Core 4

1 4

8

10

2

3

56

1211

9

7

Remember

results with

static

scheduling..

Speedup!

BALANCED

workloads

Page 27: Parallel Programming with OpenMP

Sharing work among threadsThe sections directive

The for pragma allows to exploit data parallelism in loops

OpenMP also provides a directive to exploit task parallelism

#pragma omp sections

Page 28: Parallel Programming with OpenMP

Task Parallelism Example

Identify independent nodes in the task graph, and outline parallel computation with the

sections directive

int main(){ v = alpha(); w = beta ();

y = delta ();

x = gamma (v, w); z = epsilon (x, y));

printf (“%f\n”, z);}

FIRST

SOLUTION

Page 29: Parallel Programming with OpenMP

Task Parallelism Example

Barriers implied here!

int main(){#pragma omp parallel sections { v = alpha(); w = beta (); }#pragma omp parallel sections {

y = delta ();

x = gamma (v, w); } z = epsilon (x, y));

printf (“%f\n”, z);}

Page 30: Parallel Programming with OpenMP

Task Parallelism Exampleint main(){#pragma omp parallel sections { v = alpha(); w = beta (); }#pragma omp parallel sections {

y = delta ();

x = gamma (v, w); } z = epsilon (x, y));

printf (“%f\n”, z);}

Each parallel task within a sections block identifies a

section

Page 31: Parallel Programming with OpenMP

Task Parallelism Exampleint main(){#pragma omp parallel sections { #pragma omp section v = alpha(); #pragma omp section w = beta (); }#pragma omp parallel sections { #pragma omp section y = delta (); #pragma omp section x = gamma (v, w); } z = epsilon (x, y));

printf (“%f\n”, z);}

Each parallel task within a sections block identifies a

section

Page 32: Parallel Programming with OpenMP

Task Parallelism Example

Identify independent nodes in the task graph, and outline parallel computation with the

sections directive

int main(){ v = alpha(); w = beta ();

y = delta ();

x = gamma (v, w); z = epsilon (x, y));

printf (“%f\n”, z);}

SECOND

SOLUTION

Page 33: Parallel Programming with OpenMP

Task Parallelism Exampleint main(){ #pragma omp parallel sections { v = alpha(); w = beta ();

y = delta (); }#pragma omp parallel sections {

x = gamma (v, w); z = epsilon (x, y));}

printf (“%f\n”, z);}

Barrier implied here!

Page 34: Parallel Programming with OpenMP

Task Parallelism Exampleint main(){ #pragma omp parallel sections { v = alpha(); w = beta ();

y = delta (); }#pragma omp parallel sections {

x = gamma (v, w); z = epsilon (x, y));}

printf (“%f\n”, z);}

Each parallel task within a sections block identifies a

section

Page 35: Parallel Programming with OpenMP

Task Parallelism Exampleint main(){ #pragma omp parallel sections { #pragma omp section v = alpha(); #pragma omp section w = beta (); #pragma omp section y = delta (); }#pragma omp parallel sections { #pragma omp section x = gamma (v, w); #pragma omp section z = epsilon (x, y));}

printf (“%f\n”, z);}

Each parallel task within a sections block identifies a

section

Page 36: Parallel Programming with OpenMP

#pragma omp barrier Most important synchronization mechanism in

shared memory fork/join parallel programming All threads participating in a parallel region wait

until everybody has finished before computation flows on

This prevents later stages of the program to work with inconsistent shared data

It is implied at the end of parallel constructs, as well as for and sections (unless a nowait clause is specified)

Page 37: Parallel Programming with OpenMP

#pragma omp critical

Critical Section: a portion of code that only one thread at a time may execute

We denote a critical section by putting the pragma

#pragma omp critical

in front of a block of C code

Page 38: Parallel Programming with OpenMP

-finding code example

double area, pi, x;int i, n;#pragma omp parallel for private(x) \ shared(area){ for (i=0; i<n; i++) { x = (i + 0.5)/n; area += 4.0/(1.0 + x*x); }} pi = area/n;

Synchronize accesses to shared variable area to avoid inconsistent results

Page 39: Parallel Programming with OpenMP

Race condition Ensure atomic updates of the shared variable

area to avoid a race condition in which one process may “race ahead” of another and ignore changes

Page 40: Parallel Programming with OpenMP

Race condition (Cont’d)

time

• Thread A reads “11.667” into a local register• Thread B reads “11.667” into a local register• Thread A updates area with “11.667+3.765”• Thread B ignores write from thread A and updates area with “11.667 + 3.563”

Page 41: Parallel Programming with OpenMP

-finding code example

double area, pi, x;int i, n;#pragma omp parallel for private(x) shared(area){ for (i=0; i<n; i++) { x = (i +0.5)/n;#pragma omp critical area += 4.0/(1.0 + x*x); }} pi = area/n;

#pragma omp critical protects the code within its scope by acquiring a lock before entering the critical section and releasing it after execution

Page 42: Parallel Programming with OpenMP

Correctness, not performance! As a matter of fact, using locks makes execution

sequential To dim this effect we should try use fine grained

locking (i.e. make critical sections as small as possible)

A simple instruction to compute the value of area in the previous example is translated into many more simpler instructions within the compiler!

The programmer is not aware of the real granularity of the critical section

Page 43: Parallel Programming with OpenMP

Correctness, not performance! As a matter of fact, using locks makes execution

sequential To dim this effect we should try use fine grained

locking (i.e. make critical sections as small as possible)

A simple instruction to compute the value of area in the previous example is translated into many more simpler instructions within the compiler!

The programmer is not aware of the real granularity of the critical section

This is a dump of the intermediate

representation of the program within the

compiler

Page 44: Parallel Programming with OpenMP

Correctness, not performance! As a matter of fact, using locks makes execution

sequential To dim this effect we should try use fine grained

locking (i.e. make critical sections as small as possible)

A simple instruction to compute the value of area in the previous example is translated into many more simpler instructions within the compiler!

The programmer is not aware of the real granularity of the critical section

This is how the compiler represents the instruction area += 4.0/(1.0 + x*x);

Page 45: Parallel Programming with OpenMP

Correctness, not performance! As a matter of fact, using locks makes execution

sequential To dim this effect we should try use fine grained

locking (i.e. make critical sections as small as possible)

A simple instruction to compute the value of area in the previous example is translated into many more simpler instructions within the compiler!

The programmer is not aware of the real granularity of the critical section

call runtime to acquire lock

Lock-protected operations (critical

section)

call runtime to release lock

THIS IS DONE AT EVERY

LOOP ITERATION!

Page 46: Parallel Programming with OpenMP

-finding code exampledouble area, pi, x;int i, n;#pragma omp parallel for \ private(x) \ shared(area){ for (i=0; i<n; i++) {

x = (i +0.5)/n;

#pragma omp critical area += 4.0/(1.0 + x*x);

}} pi = area/n;

Core 1

Core 2

Core 3

Core 4

Parallel

Sequential

Waiting for lock

Page 47: Parallel Programming with OpenMP

Correctness, not performance! A programming pattern such as area += 4.0/(1.0

+ x*x); in which we: Fetch the value of an operand Add a value to it Store the updated valueis called a reduction, and is commonly supported by parallel programming APIs

OpenMP takes care of storing partial results in private variables and combining partial results after the loop

Page 48: Parallel Programming with OpenMP

Correctness, not performance!

double area, pi, x;int i, n;#pragma omp parallel for private(x) shared(area){ for (i=0; i<n; i++) { x = (i +0.5)/n;

area += 4.0/(1.0 + x*x); }} pi = area/n;

The reduction clause instructs the compiler to create private copies of the area variable for every thread. At the end of the loop partial sums are combined on the shared area variable

reduction(+:area)

Page 49: Parallel Programming with OpenMP

Correctness, not performance!

double area, pi, x;int i, n;#pragma omp parallel for private(x) shared(area){ for (i=0; i<n; i++) { x = (i +0.5)/n;

area += 4.0/(1.0 + x*x); }} pi = area/n;

The reduction clause instructs the compiler to create private copies of the area variable for every thread. At the end of the loop partial sums are combined on the shared area variable

reduction(+:area)

Shared variable is only

updated at the end of the

loop, when partial sums

are computed

Each thread computes partial sums on private

copies of the reduction variable

Page 50: Parallel Programming with OpenMP

Correctness, not performance!

double area, pi, x;int i, n;#pragma omp parallel for private(x) shared(area){ for (i=0; i<n; i++) { x = (i +0.5)/n;

area += 4.0/(1.0 + x*x); }} pi = area/n;

The reduction clause instructs the compiler to create private copies of the area variable for every thread. At the end of the loop partial sums are combined on the shared area variable

reduction(+:area)

Core 1

Core 2

Core 3

Core 4

Page 51: Parallel Programming with OpenMP

Core 1

Core 2

Core 3

Core 4

Page 52: Parallel Programming with OpenMP

Summary

Page 53: Parallel Programming with OpenMP

Customizing OpenMP for Efficient Exploitation of the Memory Hierarchy

Memory latency is well recognized as a severe performance bottleneck

MPSoCs feature complex memory hierarchy, with multiple cache levels, private and shared on-chip and off-chip memories

Using efficiently the memory hierarchy is of the utmost importance to exploit the computational power of MPSoCs, but..

Page 54: Parallel Programming with OpenMP

Customizing OpenMP for Efficient Exploitation of the Memory Hierarchy

It is a difficult task, requiring deep understanding of the application and its memory access pattern

OpenMP standard doesn’t provide any facilities to deal with data placement and partitioning

Customization of the programming interface would bring the advantages of OpenMP to the MPSoC world

Page 55: Parallel Programming with OpenMP

Extending OpenMP to support Data Distribution We need a custom directive that enables specific

code analysis and transformation When static code analysis can’t tell how to

distribute data we must rely on profiling The runtime is responsible for exploiting this

information to efficiently map arrays to memories

Page 56: Parallel Programming with OpenMP

Extending OpenMP to support Data Distribution The entire process is driven by the custom

#pragma omp distributed directive

{

int A[m];

float B[n];

#pragma omp distributed (A, B)

}

{

int *A;

float *B;

A = distributed_malloc (m);

B = distributed_malloc (n);

}

Originally stack-allocated arrays are transformed into pointers to allow for their explicit placement throughout the memory hierarchy within the program

Page 57: Parallel Programming with OpenMP

Extending OpenMP to support Data Distribution The entire process is driven by the custom

#pragma omp distributed directive

{

int A[m];

float B[n];

#pragma omp distributed (A, B)

}

{

int *A;

float *B;

A = distributed_malloc (m);

B = distributed_malloc (n);

}

The transformed program invokes the runtime to retrieve profile information which drive data placement

When no profile information is found, the distributed_malloc returns a pointer to the shared memory

Page 58: Parallel Programming with OpenMP

Data partitioning OpenMP model is focused on loop parallelism In this parallelization scheme multiple threads may access

different sections (discontiguous addresses) of shared arrays

Data partitioning is the process of tiling data arrays and placing the tiles in memory such that a maximum number of accesses are satisfied from local memory

Most obvious implementation of this concept is the data cache, but.. Inter-array discontiguity often causes cache conflicts Embedded systems impose constraints on energy,

predictability, real-time that often make caches not suitable

Page 59: Parallel Programming with OpenMP

A simple example

#pragma omp parallel for

for (i = 0; i < 4; i++)

for (j = 0; j < 6; j++)

A[ i ][ j ] = 1.0;

3,0 3,1 3,2 3,3 3,4 3,5

2,0 2,1 2,2 2,3 2,4 2,5

1,0 1,1 1,2 1,3 1,4 1,5

0,0 0,1 0,2 0,3 0,4 0,5

ITERATION SPACEINTERCONNECT

SHAREDMEMORY

SPM

CPU1

SPM

CPU2

SPM

CPU2

SPM

CPU2

The iteration space is partitioned between the processors

Page 60: Parallel Programming with OpenMP

A simple example

#pragma omp parallel for

for (i = 0; i < 4; i++)

for (j = 0; j < 6; j++)

A[ i ][ j ] = 1.0;

3,0 3,1 3,2 3,3 3,4 3,5

2,0 2,1 2,2 2,3 2,4 2,5

1,0 1,1 1,2 1,3 1,4 1,5

0,0 0,1 0,2 0,3 0,4 0,5

ITERATION SPACE DATA SPACEINTERCONNECT

SHAREDMEMORY

SPM

CPU1

SPM

CPU2

SPM

CPU2

SPM

CPU2

Data space overlaps with iteration space. Each processor accesses a different tile

A(i,j)

Array is accessed with the loop induction variables

Page 61: Parallel Programming with OpenMP

A simple example

#pragma omp parallel for

for (i = 0; i < 4; i++)

for (j = 0; j < 6; j++)

A[ i ][ j ] = 1.0;

3,0 3,1 3,2 3,3 3,4 3,5

2,0 2,1 2,2 2,3 2,4 2,5

1,0 1,1 1,2 1,3 1,4 1,5

0,0 0,1 0,2 0,3 0,4 0,5

ITERATION SPACE DATA SPACEINTERCONNECT

SHAREDMEMORY

SPM

CPU1

SPM

CPU2

SPM

CPU2

SPM

CPU2

The compiler can actually split the matrix into four smaller arrays and allocate them

onto scratchpads

A(i,j)

No access to remote memories through the bus, since data is allocated locally

Page 62: Parallel Programming with OpenMP

Another example

#pragma omp parallel for

for (i = 0; i < 4; i++)

for (j = 0; j < 6; j++)

hist[A[i][j]]++;

3,0 3,1 3,2 3,3 3,4 3,5

2,0 2,1 2,2 2,3 2,4 2,5

1,0 1,1 1,2 1,3 1,4 1,5

0,0 0,1 0,2 0,3 0,4 0,5

ITERATION SPACE A

INTERCONNECT

SHAREDMEMORY

SPM

CPU1

SPM

CPU2

SPM

CPU2

SPM

CPU2

Different locations within array hist are accessed by many processors

3 7 7 5 4 4

3 2 2 3 0 0

1 1 0 2 3 5

1 1 4 5 5 4

hist

Page 63: Parallel Programming with OpenMP

Another example In this case static code analysis can’t tell

anything on array access pattern How to decide most efficient partitioning? Split array in as many tiles as there are

processors Use access count information to map tiles to the

processor that has most accesses to it

Page 64: Parallel Programming with OpenMP

Another example

3,0 3,1 3,2 3,3 3,4 3,5

2,0 2,1 2,2 2,3 2,4 2,5

1,0 1,1 1,2 1,3 1,4 1,5

0,0 0,1 0,2 0,3 0,4 0,5

ITERATION SPACE A

INTERCONNECT

SHAREDMEMORY

SPM

CPU1

SPM

CPU2

SPM

CPU2

SPM

CPU2

Now processors need to access remote scratchpads, since they work on

multiple tiles!!!

3 7 7 5 4 4

3 2 2 3 0 0

1 1 0 2 3 5

1 1 4 5 5 4

hist

TILE 1Access countPROC1 1PROC2 2PROC3 3PROC4 2

TILE 2Access countPROC1 1PROC2 4PROC3 2PROC4 0

TILE 3Access countPROC1 3PROC2 0PROC3 1PROC4 4

TILE 4Access countPROC1 2PROC2 0PROC3 0PROC4 1

Page 65: Parallel Programming with OpenMP

Problem with data partitioning If there is no overlapping of iteration space and data space it may

happen that multiple processor need to access different tiles In this case data partitioning introduces addressing difficulties

because the data tiles can become discontiguous in physical memory

How to address the problem of generating efficient code to access data when performing loop and data partitioning?

We can further extend the OpenMP programming interface to deal with that!

The programmer only has to specify the intention of partitioning an array throughout the memory hierarchy, and the compiler does the necessary instrumentation

Page 66: Parallel Programming with OpenMP

Code InstrumentationIn general, the steps for addressing an array element using

tiling are: Computation of the offset w.r.t. the base address Identify the tile to which this element belongs to Re-compute the index relative to the current tile Load the tile base address from a metadata array

This metadata array is populated during the memory allocation step of the tool-flow. It relies on access count information to figure out the most efficient mapping of array tiles to memories.

Page 67: Parallel Programming with OpenMP

Extending OpenMP to support data partitioning

#pragma omp parallel tiled(A)

{

/* Access memory */

A[i][j] = foo();

}

{

/* Compute offset, tile and index for distributed array */

int offset = …;

int tile = …;

int index = …;

/* Read tile base address */

int *base = tiles[dvar][tile];

/* Access memory */

base[index] = foo();

}

The instrumentation process is driven by the custom tiled clause, which can be coupled with every parallel and work-sharing construct.