Parallel Computing

Parallel Computing

Parallel

PVM (Parallel Virtual Machine)

MPI (Message Passing Interface)

CUDA (Compute Unified Device Architecture)

Parallel C# - Lightweight Parallelism

There are several parallel programming models in common use:

Shared Memory

Threads

Message Passing

Data Parallel

Hybrid

https://computing.llnl.gov/tutorials/parallel_comp/

Although it might not seem apparent, these models are NOT specific to a particular type of machine or memory architecture. In fact, any of these models can (theoretically) be implemented on any underlying hardware. Two examples:

1. Shared memory model on a distributed memory machine: Kendall Square Research (KSR) ALLCACHE approach.

2. Message passing model on a shared memory machine: MPI on SGI Origin.

Parallel Programming Models are Abstract

Machine memory was physically distributed, but appeared to the user as a single shared memory (global address space). Generically, this approach is referred to as "virtual shared memory". Note: although KSR is no longer in business, there is no reason to suggest that a similar implementation will not be made available by another vendor in the future.

The SGI Origin employed the CC-NUMA type of shared memory architecture, where every task has direct access to global memory. However, the ability to send and receive messages with MPI, as is commonly done over a network of distributed memory machines, is not only implemented but is very commonly used.


In the shared-memory programming model, tasks share a common address space, which they read and write asynchronously.

Various mechanisms such as locks/semaphores may be used to control access to the shared memory.

An advantage of this model from the programmer's point of view is that the notion of data ownership is lacking, so there is no need to specify explicitly the communication of data between tasks. Program development can often be simplified.

An important disadvantage in terms of performance is that it becomes more difficult to understand and manage data locality.

Keeping data local to the processor that works on it conserves memory accesses, cache refreshes and bus traffic that occurs when multiple processors use the same data.

Unfortunately, controlling data locality is hard to understand and beyond the control of the average user.

Shared Memory Model



In the threads model of parallel programming, a single process can have multiple, concurrent execution paths.

Perhaps the most simple analogy that can be used to describe threads is the concept of a single program that includes a number of subroutines: Threads Model

Threads are commonly associated with shared memory architectures and operating systems.

Threads


Unrelated standardization efforts have resulted in two very different implementations of threads: POSIX Threads and OpenMP.

POSIX Threads

* Library based; requires parallel coding * Specified by the IEEE POSIX 1003.1c standard (1995). * C Language only * Commonly referred to as Pthreads. * Most HW vendors offer Pthreads in addition to proprietary implementations. * Very explicit parallelism; requires significant programmer attention to detail.

OpenMP

* Compiler directive based; can use serial code * Endorsed by HW & SW vendors. * Portable/multi-platform, including Unix and Windows NT platforms * Available in C/C++ and Fortran implementations * Can be very easy and simple to use - provides for "incremental parallelism"

Two Types of Threads


The message passing model demonstrates the following characteristics:

A set of tasks that use their own local memory during computation. Multiple tasks can reside on the same physical machine and/or across an arbitrary number of machines.

Tasks exchange data through communications by sending and receiving messages.

Data transfer usually requires cooperative operations to be performed by each process. For example, a send operation must have a matching receive operation.

Message Passing Model


Data Parallel Model

The data parallel model demonstrates the following characteristics:

Most parallel work focuses on performing operations on a data set. The data set is typically organized into a common structure, such as an array or cube.

A set of tasks work collectively on the same data structure, however, each task works on a different partition of the same data structure.

Tasks perform the same operation on their partition of work, for example, "add 4 to every array element".

On shared memory architectures, all tasks may have access to the data structure through global memory. On distributed memory architectures the data structure is split up and resides as chunks in the local memory of each task.


In this model, any two or more parallel programming models are combined.

Currently, a common example of a hybrid model is the combination of the message passing model (MPI) with either the threads model (POSIX threads) or the shared memory model (OpenMP). This hybrid model lends itself well to the increasingly common hardware environment of networked SMP machines.

Another common example of a hybrid model is combining data parallel with message passing. Distributed memory architectures use message passing to transmit data between tasks, transparently to the programmer.

Hybrid Parallel Model

SPMD is actually a "high level" programming model that can be built upon any combination of the previously mentioned parallel programming models. SPMD Model

A single program is executed by all tasks simultaneously.

At any moment in time, tasks can be executing the same or different instructions within the same program.

SPMD programs usually have logic programmed into them to allow different tasks to branch or conditionally execute only those parts of the program they are designed to execute. That is, tasks do not necessarily have to execute the entire program - perhaps only a portion of it.

All tasks may use different data

Single Program Multiple Data (SPMD)



Like SPMD, MPMD is actually a "high level" programming model that can be built upon any combination of the previously mentioned parallel programming models. MPMD Model

MPMD applications typically have multiple executable object files (programs). While the application is being run in parallel, each task can be executing the same or different program as other tasks.

All tasks may use different data

Multiple Program Multiple Data (MPMD)


Designing and developing parallel programs has characteristically been a very manual process. The programmer is typically responsible for both identifying and actually implementing parallelism.

Very often, manually developing parallel codes is a time consuming, complex, error-prone and iterative process.

For a number of years now, various tools have been available to assist the programmer with converting serial programs into parallel programs. The most common type of tool used to automatically parallelize a serial program is a parallelizing compiler or pre-processor.

Automatic vs. Manual Parallelization


A parallelizing compiler generally works in two different ways:

Fully Automatic

The compiler analyzes the source code and identifies opportunities for parallelism.

The analysis includes identifying inhibitors to parallelism and possibly a cost weighting on whether or not the parallelism would actually improve performance.

Loops (do, for) loops are the most frequent target for automatic parallelization.

Programmer Directed

Using "compiler directives" or possibly compiler flags, the programmer explicitly tells the compiler how to parallelize the code.

May be able to be used in conjunction with some degree of automatic parallelization also.

Parallelizing Compilers


Issues with Automatic Parallelization

If you are beginning with an existing serial code and have time or budget constraints, then automatic parallelization may be the answer. However, there are several important caveats that apply to automatic parallelization:

Wrong results may be produced

Performance may actually degrade

Much less flexible than manual parallelization

Limited to a subset (mostly loops) of code

May actually not parallelize code if the analysis suggests there are inhibitors or the code is too complex


Undoubtedly, the first step in developing parallel software is to first understand the problem that you wish to solve in parallel. If you are starting with a serial program, this necessitates understanding the existing code also.

Before spending time in an attempt to develop a parallel solution for a problem, determine whether or not the problem is one that can actually be parallelized.

Understanding the Problem


Example of Parallelizable Problem:

This problem is able to be solved in parallel. Each of the molecular conformations is independently determinable. The calculation of the minimum energy conformation is also a parallelizable problem.

Calculate the potential energy for each of several thousand independent conformations of a molecule. When done, find the minimum energy conformation.

Example of a Non-parallelizable Problem:

This is a non-parallelizable problem because the calculation of the Fibonacci sequence as shown would entail dependent calculations rather than independent ones. The calculation of the F(n) value uses those of both F(n-1) and F(n-2). These three terms cannot be calculated independently and therefore, not in parallel.

Calculation of the Fibonacci series (1, 1, 2, 3,5, 8, 13, 21,. . .) by use of the formula:

F(n) = F(n-1) + F(n-2)


1. Identify the program's hotspots:

2. Know where most of the real work is being done. The majority of scientific and technical programs usually accomplish most of their work in a few places.

Profilers and performance analysis tools can help here Focus on parallelizing hotspots, ignore sections comprising little CPU usage.

3. Identify bottlenecks in the program

Are there areas that are disproportionately slow May be possible to restructure the program or use a different algorithm

4. Identify inhibitors to parallelism. One common class of inhibitor is data dependence, as demonstrated by the Fibonacci sequence above.

5. Investigate other algorithms if possible. This may be the single most important consideration when designing a parallel application.

Methods to Improve Parallel Algorithm Performance


One of the first steps in designing a parallel program is to break the problem into discrete chunks of work that can be distributed to multiple tasks. This is known as decomposition or partitioning.

There are two basic ways to partition computational work among parallel tasks:

domain decomposition

and

functional decomposition.

Partitioning


In this type of partitioning, the data associated with a problem is decomposed. Each parallel task then works on a portion of of the data.

There are different ways to partition data:

Domain Decomposition

BLOCK CYCLIC


In this approach, the focus is on the computation that is to be performed rather than on the data manipulated by the computation. The problem is decomposed according to the work that must be done. Each task then performs a portion of the overall work.

Functional Decomposition


Each program calculates the population of a given group, where each group's growth depends on that of its neighbors. As time progresses, each process calculates its current state, then exchanges information with the neighbor populations. All tasks then progress to calculate the state at the next time step.

Ecosystem ModelingExample of Functional Decomposition


An audio signal data set is passed through four distinct computational filters. Each filter is a separate process. The first segment of data must pass through the first filter before progressing to the second. When it does, the second segment of data passes through the first filter. By the time the fourth segment of data is in the first filter, all four tasks are busy.

Signal Processing

IMADSa tool for functional decomposition

Intelligent Multiple Agent Development System

Input

FunctionModule

Neural Network Module

Rule-Based Module

Output

http://concurrent.us

Parallel Virtual Machine

PVM (Parallel Virtual Machine) is a portable message-passing programming system, designed to link separate host machines to form a ``virtual machine'' which is a single, manageable computing resource.

The virtual machine can be composed of hosts of varying types, in physically remote locations.

PVM applications can be composed of any number of separate processes, or components, written in a mixture of C, C++ and Fortran.

The system is portable to a wide variety of architectures, including workstations, multiprocessors, supercomputers and PCs.

PVM is a by-product of ongoing research at several institutions, and is made available to the public free of charge.

NetLib Bob Manchek http://www.netlib.org/pvm3/

Parallel Virtual Machine

Parallel computing using a system such as PVM may be approached from three fundamental viewpoints, based on the organization of the computing tasks.

Crowd Computing

Tree Computation

Hybrid

PVM Modes of Operation

http://www.netlib.org/pvm3/book/

Crowd Computing is a collection of closely related processes, typically executing the same code, perform computations on different portions of the workload, usually involving the periodic exchange of intermediate results. This paradigm can be further subdivided into two categories:

The master-slave (or host-node ) model in which a separate ``control'' program termed the master is responsible for process spawning, initialization, collection and display of results, and perhaps timing of functions.

The slave programs perform the actual computation involved; they either are allocated their workloads by the master (statically or dynamically) or perform the allocations themselves.

The node-only model where multiple instances of a single program execute, with one process (typically the one initiated manually) taking over the non-computational responsibilities in addition to contributing to the computation itself.

Crowd Computing


{Master Mandelbrot algorithm.}{Initial placement}for i := 0 to NumWorkers - 1

pvm_spawn(<worker name>) {Start up worker i}pvm_send(<worker tid>,999) {Send task to worker i}

endfor

{Receive-send}while (WorkToDo)

pvm_recv(888) {Receive result}

pvm_send(<available worker tid>,999) {Send next task to available worker}

display resultendwhile

{Gather remaining results.}for i := 0 to NumWorkers - 1

pvm_recv(888) {Receive result}pvm_kill(<worker tid i>) {Terminate worker i}display result

endfor

PVM Mandelbrotexample of Crowd Computing

{Worker Mandelbrot algorithm.}

while (true)pvm_recv(999) {Receive task}result := MandelbrotCalculations(task) {Compute result}pvm_send(<master tid>,888) {Send result to master}

endwhile


The second model supported by PVM is termed a tree computation . In this scenario, processes are spawned in a tree-like manner, thereby establishing a tree-like, parent-child relationship (as opposed to crowd computations where a star-like relationship exists).

This paradigm, although less commonly used, is an extremely natural fit to applications where the total workload is not known a priori, for example, in branch-and-bound algorithms, alpha-beta search, and recursive divide-and-conquer algorithms.

Tree Computation


{ Spawn and partition list based on a broadcast tree pattern. }for i := 1 to N, such that 2^N = NumProcs

forall processors P such that P < 2îpvm_spawn(...) {process id P XOR 2î}if P < 2^(i-1) then

midpt: = PartitionList(list);{Send list[0..midpt] to P XOR 2î}pvm_send((P XOR 2î),999) list := list[midpt+1..MAXSIZE]

elsepvm_recv(999) {receive the list}

endifendfor

endfor

{ Sort remaining list. }Quicksort(list[midpt+1..MAXSIZE])

{ Gather/merge sorted sub-lists. }for i := N downto 1, such that 2^N = NumProcs

forall processors P such that P < 2îif P > 2^(i-1) then

pvm_send((P XOR 2î),888) {Send list to P XOR 2î}

elsepvm_recv(888) {receive temp list}merge templist into list

endifendfor

endfor

PVM Split-Sort-Mergeexample of Tree Computation

code written for a 4-node HyperCube - http://www.netlib.org/pvm3/book/

The third model, which we term hybrid, can be thought of as a combination of the tree model and crowd model. Essentially, this paradigm possesses an arbitrary spawning structure: that is, at any point during application execution, the process relationship structure may resemble an arbitrary and changing graph.

Hybrid


Message Passing Interface

#include <stdio.h>#include <mpi.h>

int main(int argc, char ** argv) {

int size,rank;int length;char name[80];

MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD,&rank);MPI_Comm_size(MPI_COMM_WORLD,&size);MPI_Get_processor_name(name,&length);

printf("Hello MPI! Process %d of %d on %s\n",size,rank,name);

MPI_Finalize();

return 0;}

MPI Hello World

http://www.cs.earlham.edu/~lemanal/slides/mpi-slides.pdf

Compute Unified Device Architecture (CUDA)

to be continued...


Documents

Parallel Computing