Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National

Parallel Programming Models

Jihad El-Sana

These slides are based on the book:Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National

Laboratory

Overview

• Parallel programming models in common use: – Shared Memory – Threads – Message Passing – Data Parallel – Hybrid

• Parallel programming models are abstractions above hardware and memory architectures.

Shared Memory Model

• Tasks share a common address space, which they read and write asynchronously.

• Various mechanisms such as locks / semaphores may be used to control access to the shared memory.

• An advantage of this model from the programmer's point of view is that the notion of data "ownership" is lacking, so there is no need to specify explicitly the communication of data between tasks.

• Program development can often be simplified.

Disadvantage

• Is difficult to understand and manage data locality. – Keeping data local to the processor that works on

it conserves memory accesses, cache refreshes and bus traffic that occurs when multiple processors use the same data.

– Unfortunately, controlling data locality is hard to understand and beyond the control of the average user.

Implementations

• The native compilers translate user program variables into actual memory addresses, which are global.

• Common distributed memory platform implementations does not exist.

• A shared memory view of data even though the physical memory of the machine was distributed, impended as virtual shared memory

Threads Model

• A single process can have multiple, concurrent execution paths.

• The main program loads and acquires all of the necessary system and user resources .

• It performs some serial work, and then creates a number of tasks (threads) that run concurrently.

Threads Cont.• The word of a thread can be described as a subroutine within

the main program. • All the thread shares the memory space• Each thread has local data.• They save the overhead of replicating the program's resources.• Threads communicate with each other through global memory. • Threads requires synchronization constructs to insure that

more than one thread is not updating the same global address at any time.

• Threads can come and go, but main thread remains present to provide the necessary shared resources until the application has completed.

Message Passing Model

• Message Passing Model is used– A set of tasks that use

their own local memory during computation.

– Multiple tasks can reside on the same physical machine as well across an arbitrary number of machines.

Message Passing Model

• Tasks exchange data through communications by sending and receiving messages.

• Data transfer usually requires cooperative operations to be performed by each process.

• The communication processes may exist on the same machine of different machines

Data Parallel Model• Most of the parallel work

focuses on performing operations on a data set.

• The data set is typically organized into a common structure.

• A set of tasks work collectively on the same data structure, each task works on a different partition of the same data structure.

Data Parallel Model Cont.

• Tasks perform the same operation on their partition of work.

• On shared memory architectures, all tasks may have access to the data structure through global memory. On distributed memory architectures the data structure is split up and resides as "chunks" in the local memory of each task.

Designing Parallel Algorithms

• The programmer is typically responsible for both identifying and actually implementing parallelism.

• Manually developing parallel codes is a time consuming, complex, error-prone and iterative process.

• Currently, The most common type of tool used to automatically parallelize a serial program is a parallelizing compiler or pre-processor.

A parallelizing compiler

• Fully Automatic – The compiler analyzes the source code and identifies opportunities

for parallelism. – The analysis includes identifying inhibitors to parallelism and possibly

a cost weighting on whether or not the parallelism would actually improve performance.

– Loops (do, for) loops are the most frequent target for automatic parallelization.

• Programmer Directed – Using "compiler directives" or possibly compiler flags, the

programmer explicitly tells the compiler how to parallelize the code. – May be able to be used in conjunction with some degree of

automatic parallelization also.

Automatic Parallelization Limitations

• Wrong results may be produced • Performance may actually degrade • Much less flexible than manual parallelization • Limited to a subset (mostly loops) of code • May actually not parallelize code if the

analysis suggests there are inhibitors or the code is too complex

The Problem & The Pogramm • Determine whether or not the problem is one that can actually be

parallelized.• Identify the program's hotspots:

– Know where most of the real work is being done. – Profilers and performance analysis tools can help here – Focus on parallelizing the hotspots and ignore those sections of the program

that account for little CPU usage. • Identify bottlenecks in the program

– Identify areas where the program is slow, or bounded.– May be possible to restructure the program or use a different algorithm to

reduce or eliminate unnecessary slow areas • Identify inhibitors to parallelism. One common class of inhibitor is data

dependence, as demonstrated by the Fibonacci sequence. • Investigate other algorithms if possible. This may be the single most

important consideration when designing a parallel application.

Partitioning

• Break the problem into discrete "chunks" of work that can be distributed to multiple tasks.– domain decomposition – functional decomposition.

Domain Decomposition

• The data associated with a problem is decomposed.

• Each parallel task then works on a portion of the data.

• This partition could be done in different ways.– Row, Columns, Blocks,

Cyclic, etc.

Functional Decomposition

• The problem is decomposed according to the work that must be done. Each task then performs a portion of the overall work.

Communications

• Cost of communications • Latency vs. Bandwidth • Visibility of communications• Synchronous vs. asynchronous communications• Scope of communications– Point-to-point– Collective

• Efficiency of communications• Overhead and Complexity

Synchronization

• Barrier• Lock / semaphore • Synchronous communication operations

Data Dependencies

• A dependence exists between program statements when the order of statement execution affects the results of the program.

• A data dependence results from multiple use of the same location(s) in storage by different tasks.

• Dependencies are important to parallel programming because they are one of the primary inhibitors to parallelism.

Load Balancing

• Load balancing refers to the practice of distributing work among tasks so that all tasks are kept busy all of the time. It can be considered a minimization of task idle time.

• Load balancing is important to parallel programs for performance reasons. For example, if all tasks are subject to a barrier synchronization point, the slowest task will determine the overall performance.

Documents

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National